arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 多模态与视觉语言模型 17 篇

2606.18472 2026-06-18 cs.CV 新提交

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

通过正则化微调实现可域泛化的3D视觉-语言模型适应

Sneha Paul, Zachary Patterson, Nizar Bouguila

发表机构 * Concordia University(康考迪亚大学)

AI总结 提出ReFine3D框架,通过选择性层调优、多视图一致性、同义词提示及点渲染视觉监督等正则化策略,提升3D大语言模型在域泛化中的性能。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

域适应仍然是3D视觉中的一个核心挑战,特别是对于将3D点云与视觉和文本数据对齐的多模态基础模型。尽管这些模型表现出强大的通用能力,但将其适应到数据有限的下游领域往往会导致过拟合和灾难性遗忘。为了解决这个问题,我们引入了ReFine3D,一个正则化的微调框架,专为3D大语言模型(LMMs)的可域泛化调优而设计。ReFine3D将选择性层调优与两种针对性的正则化策略相结合:跨增强点云的多视图一致性,以及通过大语言模型生成的基于同义词的提示实现的文本多样性。此外,我们加入了点渲染的视觉监督和一种基于置信度聚合的测试时增强机制,以进一步增强鲁棒性。在不同3D域泛化基准上的大量实验表明,ReFine3D将基类到新类泛化提高了1.36%,跨数据集迁移提高了2.43%,对损坏的鲁棒性提高了1.80%,少样本准确率提高了最多3.11%,以最小的额外计算开销超越了先前的最先进方法。

英文摘要

Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.

2606.18553 2026-06-18 cs.CV 新提交

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

基于知识的分层多模态检索用于新闻图像描述生成

Minh-Loi Nguyen, Xuan-Vu Le, Long-Bao Nguyen, Hoang-Bach Ngo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(越南国立大学胡志明市分校理学院) Vietnam National University, Ho Chi Minh City(越南国立大学胡志明市分校)

AI总结 提出分层多模态文章检索增强的图像描述框架,通过结构感知检索和上下文精炼,结合VLM和LLM生成富含上下文细节的描述,在EVENTA 2025挑战赛中获得第5名。

Comments SOICT 2025

详情
AI中文摘要

传统的图像描述方法通常难以生成全面、上下文丰富的描述,尤其是对于无法直接从视觉线索中观察到的细节。为了克服这一问题,我们提出了一种新颖的检索增强图像描述框架,通过利用外部知识生成具有更深层次洞察的描述,如对象属性、事件背景和潜在意义。我们的方法采用分层多模态文章检索机制,超越了单一的文本实体。该检索考虑了文章结构感知特征,包括加权文本组件(例如,标题、正文部分)和视觉布局模式,以及多方面的相似性计算(内容-视觉、视觉-视觉和话语定位)。后续的上下文相关性精炼阶段进一步增强了检索到的信息。检索到的文章随后作为描述生成的知识库:首先,VLM生成简洁的图像描述;其次,我们基于该描述从检索到的文章中分割出相关信息;最后,LLM利用描述和提取的知识生成全面、上下文详细的描述。我们参加了ACM Multimedia EVENTA 2025挑战赛,并在OpenEvent-V1数据集的私有测试集上以0.2824的总分获得第5名。源代码已在此https URL公开发布。

英文摘要

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

2606.18681 2026-06-18 cs.CV 新提交

Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

超越多样性:将视觉令牌剪枝视为子空间重建以实现高效视觉语言模型

Jaeyeon Lee, Shunjie Wen, Dong-Wan Choi

发表机构 * Inha University(延世大学)

AI总结 提出SPARE方法,将令牌剪枝重构为子空间重建问题,通过迭代选择投影残差大的令牌进行剪枝,并引入反相关性机制保留上下文信息,在LLaVA上剪枝94%令牌仍保持95%性能。

Comments ECCV 2026 Under Review

详情
AI中文摘要

尽管视觉语言模型(VLM)性能卓越,但由于大量视觉令牌的存在,它们产生了巨大的计算开销。虽然多样性最大化已成为令牌减少的主流策略,但现有方法依赖于基于余弦的归一化相似度,忽略了幅度信息,无法忠实逼近原始特征表示,导致性能次优,尤其是在组合多技能推理任务上。本文提出SPARE,一种子空间重建方法,将令牌剪枝重新表述为列子集选择问题,并显式最小化重建误差。通过迭代选择投影残差大的令牌,SPARE在角度多样性之外实现了重建驱动的剪枝。此外,我们揭示了一个反直觉的反相关性现象:图像-文本相关性得分较低的令牌能更好地保留上下文信息。基于这一发现,我们将反相关性作为额外的选择标准纳入SPARE,以促进上下文感知的令牌选择。在多个VLM和基准上的大量实验表明,SPARE始终达到最先进的性能,在组合任务上取得显著提升。当应用于LLaVA时,SPARE在完全无需训练的情况下,可移除高达94%的视觉令牌,同时保留95%的基线性能。

英文摘要

Despite their remarkable performance, Vision Language Models (VLMs) incur substantial computational overhead due to the large number of visual tokens. While diversity maximization has become a dominant strategy for token reduction, existing methods rely on cosine-based normalized similarity that discards magnitude information, failing to faithfully approximate the original feature representation and leading to suboptimal performance, particularly on compositional multi-skill reasoning tasks. In this paper, we introduce SPARE, a subspace reconstruction method that reformulates token pruning as a column subset selection problem and explicitly minimizes reconstruction error. By iteratively selecting tokens with large projection residuals, SPARE performs reconstruction-driven pruning beyond angular diversity. Moreover, we reveal a counterintuitive anti-relevance phenomenon: tokens with lower image-text relevance score can better preserve contextual information. Based on this finding, we incorporate anti-relevance into SPARE as an additional selection criterion to promote context-aware token selection. Extensive experiments across multiple VLMs and benchmarks demonstrate that SPARE consistently achieves state-of-the-art performance, with strong gains on compositional tasks. When applied to LLaVA, SPARE removes up to 94% of visual tokens while retaining 95% of the baseline performance, all in a fully training-free manner.

2606.18780 2026-06-18 cs.CV cs.CL cs.MM 新提交

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA:面向统一低资源多模态信息抽取的语义锚定对齐增强

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出语义锚定对齐增强框架SAMA,通过构建结构化语义锚引导多专家多模态大模型生成高保真文本,并利用锚保留扩散机制合成图像,结合双约束过滤模块,在低资源多模态信息抽取任务中显著提升性能。

Comments Accepted by IEEE Transactions on Multimedia

详情
AI中文摘要

多模态信息抽取(MIE)——涵盖多模态命名实体识别(MNER)、关系抽取(MRE)和事件抽取(MEE)等任务——对于理解多媒体内容至关重要,但受到严重数据稀缺的限制。尽管数据增强是一种有前景的补救措施,但现有方法受到粗粒度跨模态对齐和碎片化、任务特定设计的阻碍,未能利用共享语义知识。为克服这些限制,我们引入了语义锚定对齐多模态增强(SAMA),一个用于生成高保真、任务感知合成数据的统一框架。SAMA从真实标签构建结构化语义锚,以指导协作多专家多模态大语言模型(CME-MLLM),该模型集成了用于共享语义的通用适配器和任务特定适配器,以生成多样且符合约束的文本样本。对于图像合成,SAMA采用锚保留扩散机制,使用锚加权提示和潜在条件来维持关键语义锚,同时多样化视觉上下文。为消除人工验证需求,SAMA进一步引入双约束过滤模块,基于跨模态一致性和锚保真度选择合成样本。在MNER、MRE和MEE基准数据集上的大量实验表明,SAMA在全监督和低资源设置下均一致优于最先进的增强基线,突显了其通用性、鲁棒性和有效性。

英文摘要

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

2606.18846 2026-06-18 cs.CV 新提交

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

从边界框到视觉推理:一种用于视觉语言模型的在线策略数据标注工具

Like Zhang, Runliang Niu, Shiqi Wang, Xiyu Hu, Qianli Xing, Pan Wang, Qingzu He, Qi Wang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) College of Computer Science, Jilin University(吉林大学计算机科学与技术学院) OPPO

AI总结 提出ScreenAnnotator,通过统一标注原子模式、在线策略循环与贝叶斯验证器,解决现有工具表达力不足、标注-训练脱节和数据复用性差的问题,实现高效多任务数据生成。

Comments 14 pages, 7 figures

详情
AI中文摘要

视觉语言模型(VLM)正快速向复杂的基于基础的结构化视觉推理发展。训练具备此类高级能力的模型需要一种新型数据,该数据能将空间坐标、开放词汇描述、结构化属性和拓扑关系无缝统一为单一表示。然而,现有数据标注工具从根本上无法满足这些复杂需求,存在三个系统性瓶颈:表达力有限、严重的标注-训练解耦以及数据复用性差。为弥补这一基础设施差距,我们引入了一个开源标注工具ScreenAnnotator。首先,我们定义了一个统一的标注原子模式,将空间、语义和结构基元绑定为单个单元。其次,我们实现了一个嵌入贝叶斯标注验证器(BAV)的在线策略标注循环。最后,我们设计了一个模板驱动的多任务数据合成过程,动态地将静态原子转化为多样化的多维推理任务,消除了冗余的重新标注。在线策略循环将流程图上的标注接受率提升至近100%,GUI截图上的接受率达到77%,同时随着标注数据的积累,每张图像的标注时间稳步减少。在流程图场景中,微调VLM的平均准确率达到76.1%,绝对提升了35.1个百分点。我们的代码可在以下网址获取:this https URL。

英文摘要

Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: https://github.com/WnQinm/Annotator.

2606.18974 2026-06-18 cs.CV 新提交

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Visual-OPSD:用于高效统一多模态推理的跨模态在策略自蒸馏

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu

发表机构 * Xi’an Jiaotong University(西安交通大学) MOE KLINNS Lab(MOE KLINNS实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering(陕西省大数据知识工程重点实验室) Sun Yat-sen University(中山大学)

AI总结 提出Visual-OPSD方法,通过跨模态在策略自蒸馏,将多步扩散生成的可视化思维推理能力转移到纯文本学生模型,实现14.3倍加速且性能提升3.40个百分点。

详情
AI中文摘要

统一多模态模型(UMMs)将生成的“可视化思维”(VTs)与文本推理交错以改进空间任务。这导致多步扩散带来大约一个数量级的推理成本。我们发现这种成本带来的直接收益有限。在ThinkMorph上,移除或噪声化VTs在九个基准上几乎不改变准确率。一旦渲染,注意力集中在VT上,无论其内容如何。然而,KL诊断表明,以特权VT轨迹为条件会改变模型的完成分布。这表明生成路径编码了超出渲染像素的有用推理。受此差距启发,我们提出了Visual On-Policy Self-Distillation(Visual-OPSD)。教师和学生共享相同权重,但上下文不同:教师看到特权VTs,而学生只看到问题。在策略学生轨迹上的token级JSD蒸馏将教师的推理转移到纯文本学生。在九个基准上,Visual-OPSD相比其生成教师提高了$+3.40$个百分点,加速$14.3\times$(每个样本10.0秒 vs. 142.8秒),并在VSP上比同规模VLM提高了$+63.83$个百分点。高斯噪声控制(真实VT为$+0.40$pp vs. $+10.28$pp)和$58.4\%$的KL差距闭合证实,收益来自生成路径的语义内容。

英文摘要

Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

2606.18992 2026-06-18 cs.CV 新提交

Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

展示,而非询问:基于轮次有效覆盖的生成式视觉消歧用于组合图像检索

Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Amsisan Tran Baogh Le Tuan Kiet Pham Sui Yang Guang

AI总结 提出CLARA框架,通过展示视觉备选面板让用户选择,结合似然比重校准实现多轮覆盖保证,在组合图像检索中有效消歧,优于文本提问基线。

详情
AI中文摘要

组合图像检索(CIR)使用参考图像和文本修改来搜索目标图像。然而,此类查询通常描述多个可能的图像而非一个确切目标,使得用户意图模糊。近期方法通过使用共形预测估计模糊性并向用户提问澄清文本来解决此问题。但这些方法有两个局限:其覆盖保证仅在第一轮交互中成立,且文本问题通常不足以解决细粒度视觉差异,如外观、属性或视角。我们提出CLARA,一种通过向用户展示小型视觉备选面板来消歧的澄清框架。用户无需回答文本问题,只需选择最接近预期目标的原型图像。这提供了直接的视觉信号,并避免依赖模型预测用户答案。为在多轮交互中维持有效的共形保证,CLARA使用用户选择引起的似然比对校准进行重加权。显示的原型也被约束为代表当前候选集,并映射到真实语料库图像,确保生成的图像不能人为提高覆盖。在开放域和时尚基准上的实验表明,CLARA匹配单轮最先进的检索性能,在多轮交互中维持名义覆盖,并在比强文本问题基线更少的轮次中找到预期目标。其优势在模糊性涉及视角或细粒度属性时尤为明显,此时视觉消歧比文本提问更有效。

英文摘要

Composed image retrieval (CIR) uses a reference image and a text modification to search for a target image. However, such queries often describe several possible images rather than one exact target, making the user's intent ambiguous. Recent methods address this by using conformal prediction to estimate ambiguity and by asking users clarifying text questions. However, these methods have two limitations: their coverage guarantee only holds at the first interaction, and text questions are often insufficient for resolving fine-grained visual differences such as appearance, attributes, or viewpoint. We propose CLARA, a clarification framework that resolves ambiguity by showing users a small panel of visual alternatives. Instead of answering text questions, the user simply selects the prototype image closest to the intended target. This provides a direct visual signal and avoids relying on a model to predict the user's answer. To maintain valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio induced by the user's selection. The displayed prototypes are also constrained to represent the current candidate set and are snapped to real corpus images, ensuring that generated images cannot artificially improve coverage. Experiments on open-domain and fashion benchmarks show that CLARA matches single-turn state-of-the-art retrieval performance, maintains nominal coverage across interaction rounds, and finds the intended target in fewer rounds than strong text-question baselines. Its advantage is especially clear when ambiguity involves viewpoint or fine-grained attributes, where visual clarification is more effective than textual questioning.

2606.19100 2026-06-18 cs.CV 新提交

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

AMALIA-VL: 一个原生欧洲葡萄牙语开源视觉与语言模型

Diogo Glória-Silva, João Cardeira, Manuel Letras da Luz, Afonso Simplício, Gonçalo Vinagre, Diogo Tavares, Rafael Ferreira, Inês Calvo, Inês Vieira, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology(NOVA科学与技术学校) NOVA LINCS

AI总结 针对欧洲葡萄牙语缺乏开源多模态模型的问题,提出AMALIA-VL,通过三阶段训练和葡萄牙语中心数据混合,建立强基线并开源所有资源。

详情
AI中文摘要

大型视觉与语言模型(LVLMs)发展迅速,但欧洲葡萄牙语(pt-PT)在现有的开源多模态模型中仍系统性地未被充分服务,这些模型要么将其与巴西葡萄牙语混为一谈,要么在其训练数据混合中严重缺乏代表性。我们推出了AMALIA-VL,这是第一个原生为pt-PT构建的开源指令微调LVLM,通过可学习的连接器将高分辨率视觉编码器与动态图像平铺以及完全开放的pt-PT优化语言模型配对。我们贡献了一个精心设计的三阶段训练过程——视觉-语言对齐、通用视觉指令微调和偏好优化——以及一个以pt-PT为中心的多模态数据混合,该混合结合了策划和翻译的公共数据集与新颖的数据集,以解决欧洲葡萄牙语多模态资源几乎完全缺失的问题。我们的评估表明,AMALIA-VL为开源pt-PT LVLM建立了强基线。我们将发布模型权重、训练数据和构建流程,以及机器翻译的pt-PT评估基准,以帮助民主化pt-PT LVLM的开发。

英文摘要

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

2606.19277 2026-06-18 cs.CV 新提交

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

高效遥感视觉问答的统一框架:适配双编码器、混合架构和编码器-解码器架构

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

发表机构 * Engineering North Carolina A\&T State University Greensboro - NC, USA College of Science Technology North Carolina A\&T State University Greensboro - NC, USA

AI总结 提出RS Adapter参数高效微调策略,在三种视觉语言模型架构上注入轻量瓶颈适配器,仅用不到5%可训练参数实现遥感VQA,混合架构FLAVA在多模态推理与检索间取得最佳平衡。

Comments 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

详情
AI中文摘要

遥感领域的视觉问答因航空影像的高分辨率、多尺度目标分布和语义复杂性而面临独特挑战。尽管通用领域的基础模型取得了显著成功,但直接应用于RSVQA受到巨大领域偏移和全微调计算成本高昂的阻碍。本研究对RS Adapter(一种参数高效微调策略)在三种不同的视觉语言模型架构上进行了比较分析:双编码器CLIP、编码器-解码器BLIP和混合FLAVA。我们引入了一个统一的架构手术流水线,将轻量瓶颈适配器注入冻结骨干网络的注意力和MLP层,从而以少于5%的可训练参数实现快速适应。在高分辨率RSVQA x数据集上的实验结果表明,虽然所有适配模型均实现收敛,但混合FLAVA架构相比单模态对应模型提供了更优越的多模态推理与检索能力平衡。我们的发现为灾害评估和城市监测中的资源高效VQA建立了新的基准。

英文摘要

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

2606.19338 2026-06-18 cs.CV 新提交

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测:评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出RNG-Bench基准套件,通过配对记忆和3D迷宫两个博弈,评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力,发现主要错误源于遗忘而非决策,微调可提升性能。

详情
AI中文摘要

将多模态基础模型部署为闭环策略时,越来越需要基于不再可见的观测来调节动作。然而,现有基准要么暴露完整状态,将隐藏状态重建与其他智能体技能混为一谈,要么仅在回合结束后测试记忆。我们引入了RNG-Bench(重建性非马尔可夫博弈),这是一个基准套件,旨在隔离基础模型在多步交互中重建过去观测并据此行动的能力。RNG-Bench包含两个互补的博弈:配对记忆,其中卡片身份在特定位置短暂显示后需被回忆;以及3D迷宫,其中自我中心视图需整合为空间地图。两个博弈都在统一的测试框架下评估,具有三个可控难度轴:网格大小、视觉模式和观测模态。该基准进一步引入了头对头对决协议以控制实例级方差,以及记忆差距指标,将遗忘与不良动作选择区分开来。最难的配置需要大约128K个token和每回合350个图像输入,前沿MLLMs远未饱和。记忆差距分析表明,大多数残余错误源于遗忘较早的观测,而非次优决策。最后,在最优策略轨迹和过滤后的模型演示上微调Qwen3.5-9B,提高了RNG-Bench的性能,并迁移到现有基准,而不降低通用多模态能力。

英文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 新提交

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Qwen Team, Alibaba Group(阿里巴巴集团Qwen团队)

AI总结 提出OmniAgent,一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体,通过主动感知将推理复杂度与视频时长解耦,在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情
AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式,无论查询难度如何都统一处理帧,导致计算成本随视频时长增长。尽管出现了交互式框架,但它们通常依赖于全局预扫描,其上下文成本仍随视频长度扩展。我们提出OmniAgent,第一个原生全模态智能体,将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作,选择性地将视听线索提炼到持久文本记忆中,有效将推理复杂度与原始视频时长解耦。为实现这一点,我们引入了(1)智能体监督微调,通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知;(2)带TAURA(轮次感知自适应不确定性重缩放优势)的智能体强化学习,利用轮次级熵将信用分配引导至关键发现轮次。关键的是,OmniAgent表现出正向测试时缩放,性能随推理轮次增加而提升,验证了主动感知的有效性。在十个基准(如VideoMME、LVBench)上的实验结果表明,OmniAgent在开源模型中达到了最先进性能。值得注意的是,在LVBench上,我们的7B智能体优于10倍大的Qwen2.5-VL-72B(50.5% vs. 47.3%)。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

2606.19120 2026-06-18 cs.LG cs.CV 交叉投稿

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思:解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences(中国科学院沈阳自动化研究所机器人学国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出ViGOS框架,通过解耦感知和推理,在MLLM后训练中避免文本捷径,提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

详情
AI中文摘要

在策略自蒸馏(OPSD)训练模型在其自身rollouts上,并使用冻结副本提供基于参考目标的密集token级目标。这对于LLM推理效果良好,但直接扩展到多模态大语言模型(MLLMs)可能产生捷径:特权目标可能主要基于文本参考目标而非图像来引导token。我们提出ViGOS,一种视觉引导的OPSD框架用于MLLM后训练。学生首先编写视觉描述,然后推理出最终答案。对于有效rollouts,仅图像的感知教师监督描述,而特权推理教师监督同一学生前缀上的推理和最终答案。仅对无效rollouts使用参考教师以恢复输出格式。在通用视觉-语言、专家推理、视觉数学、空间定位和视觉-语言先验基准测试中,ViGOS保持了OPSD的主要优势,并在易产生捷径的设置中改善了图像引导行为。

英文摘要

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

2507.07574 2026-06-18 cs.CV 版本更新

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

超越线性可分上限:对齐视觉-语言模型中的表征

Enrico Vompa, Tanel Tammet, Mohit Vaishnav

发表机构 * Applied Artificial Intelligence Group(应用人工智能小组) Tallinn University of Technology(塔林技术大学)

AI总结 提出线性可分上限(LSC)诊断框架,发现VLM存在对齐差距,并通过对比目标重塑视觉流形,使模型在抽象组合推理任务上显著超越LSC。

Comments Accepted TMLR

详情
AI中文摘要

推进视觉-语言模型(VLM)的一个挑战是确定其在抽象推理任务(如Bongard问题)上的失败源于有缺陷的感知还是有缺陷的自顶向下推理。为了分离这些因素,我们引入了一个诊断框架,该框架以线性可分上限(LSC)为中心,即线性分类器在VLM的原始视觉嵌入上可达到的性能。将该框架应用于最先进的VLM,我们发现了一个普遍的“对齐差距”,其中大多数模型无法在生成性能上超越其表征的线性可分性。我们发现,少数超越这一上限的模型通过两种机制实现:进一步将视觉表征细化为更线性可分的形式,或执行非线性决策逻辑。我们证明,这一瓶颈并非根本限制,而是可解决的视觉对齐问题。我们的方法用对比目标增强标准的下一个词预测,以将视觉流形重塑为更一维线性的几何结构,改进图像间比较,并使模型在抽象组合推理任务上显著超越LSC。

英文摘要

A challenge in advancing Visual-Language Models (VLMs) is determining whether their failures on abstract reasoning tasks, such as Bongard problems, stem from flawed perception or faulty top-down reasoning. To disentangle these factors, we introduce a diagnostic framework centered on the Linear Separability Ceiling (LSC), the performance achievable by a linear classifier on a VLM's raw visual embeddings. Applying this framework to state-of-the-art VLMs, we uncover a pervasive ''alignment gap'', where most models fail to generatively outperform the linear separability of their representations. We find that the few models surpassing this ceiling do so via two mechanisms: by further refining visual representations into a more linearly separable format or by executing non-linear decision logic. We demonstrate that this bottleneck is not a fundamental limitation but a solvable visual alignment issue. Our method augments standard next-token prediction with a contrastive objective to restructure the visual manifold into a more one-dimensionally linear geometry, improving image-to-image comparison and enabling models to significantly surpass the LSC on abstract compositional reasoning tasks.

2606.01711 2026-06-18 cs.CV 版本更新

Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference

通过纠正失真改进视觉令牌减少以实现高效多模态大语言模型推理

Hyeonwoo Cho, Donghyeon Baek, Yewon Kim, Bumsub Ham

发表机构 * KAIST(韩国科学技术院)

AI总结 提出RESTORE框架,通过校准位置和注意力失真来改进视觉令牌减少,在保持效率的同时提升多模态大语言模型性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务中取得了显著成功,但大量视觉令牌带来的二次计算复杂度导致了严重的内存和延迟瓶颈。虽然已经探索了视觉令牌减少(VTR)策略来缓解这一负担,但现有方法忽略了完整序列与减少序列之间的位置和注意力一致性,导致表示失真。为此,我们提出RESTORE,一种新颖的VTR框架,在保持效率的同时纠正位置和注意力失真。具体来说,我们提出一种简单而有效的校准方法,通过基于相对距离增强注意力权重来恢复丢失的视觉注意力。我们还引入了一种独特的锚点选择用于令牌合并,以减轻特征平均过程中的信息损失。在多个基准上的实验结果表明,我们的方法持续提高了各种减少方法的准确性,在保持计算效率的同时实现了最先进的性能。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency. Project page is available at https://cvlab.yonsei.ac.kr/projects/RESTORE

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3:面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

AI总结 提出基于统一混合Transformer架构的全模态世界模型Cosmos 3,联合处理语言、图像、视频、音频和动作序列,在理解和生成任务上达到新最优,为具身智能体提供可扩展的通用骨干。

详情
AI中文摘要

我们介绍了Cosmos 3,一个全模态世界模型家族,设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置,Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明,Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平,展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型,并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署,我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准,网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

2606.05409 2026-06-18 cs.CV cs.CL 版本更新

Would you still call this Dax? Novel Visual References in VLMs and Humans

你还会称它为Dax吗?VLM与人类中的新颖视觉参照

Ada Defne Tür, Gaurav Kamath, Joyce Chai, Siva Reddy, Benno Krojer

发表机构 * McGill University(麦吉尔大学) Mila Quebec AI Institute(魁北克人工智能研究所) University of Michigan - Ann Arbor(密歇根大学安娜堡分校) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出新颖视觉参照数据集(NVRD),通过对比VLM和人类对新颖视觉概念的泛化能力,发现模型在矛盾先验知识时难以习得新概念,且过度泛化。

详情
AI中文摘要

视觉语言模型(VLM)像人类学习者一样,经常接触新的视觉概念,但它们在接触后如何将新颖的视觉参照映射到语言上仍未被充分探索,特别是当这些参照与预训练的先验知识相矛盾时。为了研究这一点,我们提出了新颖视觉参照数据集(NVRD):包含跨越90个视觉概念的19,176张图像,这些概念具有不同层次的新颖性,每个概念最多有20个原始对象的逐渐扰动版本以测试泛化能力。与之前关于熟悉概念视觉增强的工作不同,NVRD包含完全新颖、开放式的刺激,从头构建,模拟人类遇到真正新概念的方式。我们评估了3个开源和2个闭源模型以及2,400个人类判断,以进行直接的人机比较,发现(i)当新概念与先验知识矛盾时,模型难以在上下文中习得它们,以及(ii)虽然模型和人类对视觉扰动表现出相关的敏感性,但模型显著过度泛化,将学到的标签扩展到人类拒绝的刺激上。我们贡献了NVRD作为人类和机器视觉概念学习研究的语料库和基准。

英文摘要

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni:从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校) National University of Singapore(新加坡国立大学)

AI总结 提出FutureOmni基准,评估多模态大模型从音视频线索预测未来的能力,发现现有模型在语音密集场景下表现差,并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)展现出强大的全模态感知能力,但它们从音视频线索预测未来事件的能力仍未被充分探索,因为现有基准主要关注回顾性理解。为弥补这一差距,我们引入了FutureOmni,这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理,并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建,包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明,当前系统在音视频未来预测方面存在困难,尤其是在语音密集场景中,Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限,我们整理了一个7K样本的指令微调数据集,并提出全模态未来预测(OFF)训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明,OFF增强了未来预测和泛化能力。我们公开发布所有代码(此 https URL )和数据集(此 https URL )。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

2. 具身智能、机器人与自动驾驶 13 篇

2606.18583 2026-06-18 cs.CV cs.RO 新提交

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

空地激光雷达地点识别:基于块级自监督学习和扩展互逆重排序

Yandi Yang, Xianghong Zou, Jianping Li, Haofeng Xie, Saurav Uprety, Hongzhou Yang, Naser El-Sheimy

发表机构 * University of Calgary(卡尔加里大学) Nanchang University(南昌大学) Nanyang Technological University(南洋理工大学) Wuhan University(武汉大学)

AI总结 提出一种空地激光雷达地点识别框架,通过多尺度块级自监督学习缩小域差距,并利用扩展互逆重排序算法减少误检,在多个数据集上显著提升检索精度。

详情
AI中文摘要

激光雷达地点识别用于确定在预先采集的点云地图上的位置。最常研究的基于地面激光雷达的地点识别存在预访问要求、覆盖不完整和视角有限等缺点。使用预先采集的全覆盖机载激光扫描(ALS)数据作为空中先验地图可以克服这些缺点,使得跨视角地点识别变得必要且有利。然而,空地激光雷达地点识别面临重大挑战,包括空中和地面点云之间的域差距以及初始检索中的误检。为了解决这些问题,我们提出了一种用于空地激光雷达地点识别的新型检索和重排序框架。基于相邻点云块与锚点块共享相似语义的先验知识,我们的检索网络在多个尺度上引入了块级自监督学习模块,并与场景级学习相结合,以提高空中和地面点云之间全局特征的判别性。此外,利用ALS点云的结构化空间分布,我们引入了一种扩展互逆(ER)重排序算法,以最大化利用邻域信息,并根据邻域特征优化每个特征,然后用于更新相似度矩阵以进行最终排序。大量实验表明,我们的检索网络优于现有最先进(SOTA)方法,在CS-Urban-Scenes数据集上平均Recall@1提高了9.8%,平均Recall@1%提高了3.2%,同时在CS-Campus3D数据集上也展示了最佳性能。此外,我们的ER重排序算法在无需额外训练的情况下,进一步将CS-Campus3D上的平均Recall@1提高了4.9%,CS-Urban-Scenes上提高了10.2%。

英文摘要

LiDAR place recognition determines one's position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8\% improvement in average Recall@1 and a 3.2\% improvement in average Recall@1\% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9\% on CS-Campus3D and 10.2\% on CS-Urban-Scenes without additional training.

2606.18687 2026-06-18 cs.CV cs.RO 新提交

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

空间分层蒸馏用于异构雷达位置识别

Sagun Singh Shrestha, Samuel Harding, Abdelwahed Khamis, Saimunur Rahman, Peyman Moghadam

发表机构 * CSIRO Robotics(澳大利亚联邦科学与工业研究组织机器人实验室) University of Queensland(昆士兰大学)

AI总结 针对4D汽车雷达与密集旋转雷达之间的异构位置识别,提出空间分层蒸馏(SSD)方法,通过基于雷达回波的物理空间非对称对齐,在重叠区域强制特征对齐,在稀疏区域降低蒸馏权重,在HeRCULES数据集上达到最先进性能。

Comments IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

详情
AI中文摘要

可扩展的全天候位置识别越来越依赖于异构雷达位置识别来桥接不同的硬件平台。一个显著的应用是将来自经济高效的4D汽车雷达的查询与由密集旋转雷达构建的高保真参考地图进行匹配。这一过程从根本上受到4D传感器极端稀疏性(和窄视场)的限制,该传感器仅捕获旋转雷达数据库中存在的结构密度的一小部分。先前的工作通过统一不同的雷达信号来解决这个问题,即将两种信号投影到共同的表示空间。然而,它们在多会话环境中性能下降。在本文中,我们提出了空间分层蒸馏(SSD);一种策略,用直接从物理雷达回波导出的非对称空间对齐取代标准的均匀蒸馏。在两个雷达都有重叠回波的区域,SSD强制进行强特征对齐。关键的是,在4D学生雷达缺乏回波但教师雷达在共享视场内包含有效结构的稀疏区域,SSD应用大幅折扣的蒸馏权重。对最近的HeRCULES数据集的广泛评估表明,SSD显著优于先前的位置识别方法,在其具有挑战性的动态序列上取得了最先进的结果。

英文摘要

Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.

2606.18824 2026-06-18 cs.CV cs.LG 新提交

Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

他们将去哪里?从自我中心视频建模多模态行人机动

Yuxuan Xie, Nicolas Pugeault, Chongfeng Wei, Hubert P. H. Shum, Edmond S. L. Ho

发表机构 * School of Computing Science, University of Glasgow(格拉斯哥大学计算机科学学院) James Watt School of Engineering, University of Glasgow(格拉斯哥大学詹姆斯·瓦特工程学院) Department of Computer Science, Durham University(杜伦大学计算机科学系)

AI总结 提出MMPM框架,通过行为感知交互模块和基于CVAE的模态感知轨迹预测器,分别建模行人过马路和不过马路两种模式,提升自我中心视角下多模态轨迹预测准确性。

Comments Accepted at The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情
AI中文摘要

从自我中心摄像头进行行人轨迹预测具有挑战性,因为它依赖于与车辆和场景上下文的复杂交互以及行人的意图。通过建模行人历史与未来轨迹的相关性和意图,通常会产生多模态(即多个模式)分布。现有的随机预测器通常从单一单峰分布中采样多个未来轨迹,这可能导致次优的“混合模式”轨迹,这些轨迹位于不同的运动模式之间,并在真实场景中变得不合理。在本文中,我们提出MMPM,一种模态感知框架,基于行人的过马路行为将未来轨迹分布分别建模为语义上有意义的模式。MMPM由两个模块组成:行为感知行人交互模块(PIM),通过引入注视、头部和手势来联合捕捉行人-车辆和行人-环境交互;以及基于CVAE的模态感知轨迹预测器(MTP)模块,分别对过马路和不过马路两种模式的未来轨迹分布进行建模。基于查询的解码器进一步在解码过程中强制执行模态一致性。在PIE和JAAD数据集上的实验表明,我们的方法超越了最先进的基线。我们提出的MTP是模型无关的,可以集成到现有框架如BiTrap-NP和SGNet-ED中,以进一步提高未来轨迹预测性能。我们还引入了一种数据驱动的验证协议,将预测与时空一致的真实轨迹匹配,展示了相比先前工作改进的逐帧位移误差。

英文摘要

Pedestrian trajectory prediction from an ego-centric camera is challenging since it depends on complex interactions with vehicles and scene context, as well as the intention of the pedestrian. By modelling correlation and intent from the historical and future trajectories of the pedestrian, it will usually result in a multimodal (i.e. multiple modes) distribution. Existing stochastic predictors often sample multiple futures from a single unimodal distribution, which can yield sub-optimal 'mixed-mode' trajectories that lie between distinct motion patterns and become implausible in real scenes. In this paper, we propose MMPM, a mode-aware framework that separately models future trajectory distributions into semantically meaningful modes based on the pedestrian's crossing behavior. MMPM consists of two modules: behavior-aware Pedestrian Interaction Module (PIM) that jointly captures pedestrian-vehicle and pedestrian-environment interactions by introducing gaze, head and hand gesture, and a CVAE-based Mode-aware Trajectory Predictor (MTP) module to model the future trajectory distributions on two modes, crossing and non-crossing the road, separately. A query-based decoder further enforces mode consistency during decoding. Experiments on PIE and JAAD datasets show that our method surpasses state-of-the-art baselines. Our proposed MTP is model-agnostic, which can be integrated into existing frameworks such as BiTrap-NP and SGNet-ED to further improve future trajectory prediction performance. We additionally introduce a data-driven validation protocol that matches predictions to spatio-temporally consistent ground-truth trajectories, demonstrating improved frame-wise displacement errors over previous work.

2606.18955 2026-06-18 cs.CV cs.RO 新提交

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

运动聚焦的潜在动作使跨实体VLA训练能从人类自我中心视频中学习

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Tianfu Jiangxi Laboratory(天府江西实验室)

AI总结 提出基于潜在动作的框架,利用混合解耦VQ-VAE从无标签人类视频中提取通用动作先验,通过意图-感知解耦策略减少动作幻觉,仅需50条轨迹即可适配下游任务。

Comments Accepted to IROS 2026

详情
AI中文摘要

训练通用视觉-语言-动作(VLA)模型通常需要大量、多样化的机器人数据集,并带有高保真动作标注。尽管自我中心的人类操作视频丰富且捕捉了显著的环境多样性,但缺乏动作标签使其难以在传统训练范式下使用。为解决这一问题,我们提出了一种基于潜在动作的框架,旨在从无标签人类视频中提取通用动作先验。该架构采用混合解耦VQ-VAE,通过物理掩码将运动动态与环境背景解耦,从而构建跨实体动作码本。通过在人类视频上使用码本进行预训练,VLM骨干网络学习到动作意图的深层表示。为了适应特定实体,我们引入了一种意图-感知解耦策略,其中VLM预测动作意图,而一个独立的冻结视觉编码器为动作专家提供状态特定特征,从而减少动作幻觉。在仿真和真实环境中的结果表明,我们的方法仅在无标签人类视频上预训练,与在大量标注数据集上训练的最先进VLA模型相比具有竞争力,且仅需50条轨迹进行下游适配。

英文摘要

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

2606.18960 2026-06-18 cs.CV cs.RO 新提交

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

2606.19258 2026-06-18 cs.CV cs.RO 新提交

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

CABLE: 面向V2X系统的云辅助带宽高效LMM编码框架

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

发表机构 * University of Georgia(佐治亚大学)

AI总结 提出CABLE框架,通过边缘端利用自我运动补偿和残差运动线索传播云分割掩码,生成感兴趣区域(ROI)并仅上传ROI掩码图像,形成掩码-ROI-LMM反馈循环,在五个数据集上实现73-87%的ROI像素覆盖减少和5-8倍LMM预填充加速。

详情
AI中文摘要

云托管的大型多模态模型(LMM)可以为车联网系统提供强大的开放词汇感知能力,但简单地将全分辨率帧从边缘传输到云会导致严重的通信开销和云侧预填充延迟。我们提出了CABLE,一种用于边缘-云感知的云辅助带宽高效LMM编码框架。CABLE在边缘端利用自我运动补偿传播先前的云分割掩码,通过残差运动线索进行细化,并通过走廊包络整合断开区域,形成鲁棒的感兴趣区域(ROI)。仅上传ROI掩码图像,而云分割输出作为下一帧的先验反馈,形成掩码-ROI-LMM反馈循环。在五个数据集(nuScenes、WOD-ZB、Waymo、KITTI和CADC)上的实验表明,该方法在保持感知能力的同时实现了显著的通信节省,相对于全帧推理,ROI像素覆盖减少73-87%,估计LMM预填充加速5-8倍,检测质量略有折衷。

英文摘要

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

2606.18610 2026-06-18 cs.RO cs.CV 交叉投稿

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) NVIDIA(英伟达) Physical Intelligence Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) Allen Institute for AI(艾伦人工智能研究所)

AI总结 提出SC3-Eval方法,利用前向-反向动力学一致性、跨视角一致性和测试时一致性,将预训练视频基础模型转化为准确的策略评估器,在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情
AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差,多视角观测必须保持相互一致,且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战,这是一种自洽视频生成方案,通过强制三种互补的一致性,将预训练视频基础模型转化为准确的策略评估器。首先,前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作,将生成的 rollout 锚定在物理上合理的动作流形上,并抵消仅前向模型无法惩罚的漂移。其次,跨视角一致性训练模型从每个相机视角修补其他视角,使多相机观测在长 rollout 中保持连贯,无需任何显式记忆机制。第三,测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号,当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式,支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上,SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119,优于三个强先前的基于视频模型的基线,并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

2606.19067 2026-06-18 cs.RO cs.CV 交叉投稿

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

传感器配置至关重要:四足机器人多模态SLAM的系统评估

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada, Arne Roennau

发表机构 * Machine Intelligence and Robotics Lab, Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院智能机器人实验室) Institute for Intelligent Systems, Esslingen University of Applied Sciences(埃森堡应用科学大学智能系统研究所) Department of Computer Science, University of Freiburg(弗赖堡大学计算机科学系)

AI总结 针对四足机器人运动中的传感器配置问题,系统评估了视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法,发现立体相机、全局快门和适当惯性集成能显著提升定位鲁棒性。

详情
AI中文摘要

四足机器人在不同环境中的自主导航从根本上依赖于鲁棒的同步定位与地图构建(SLAM)。虽然视觉-惯性SLAM在轮式、手持和空中平台上已经成熟,但在腿部运动的剧烈动态下,硬件级传感器配置如何影响性能仍存在关键的评估空白。四足机器人引入了独特的具身感知挑战,包括足部冲击、高频机械振动和快速角旋转,这些都会降低标准感知管道的性能。为了填补这一空白,我们使用在ANYmal D四足机器人上记录的GrandTour数据集,对最先进的视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法进行了系统评估。我们分离并量化了相机模态、快门技术和惯性传感器层级的影响,分析了它们在定位精度、算法鲁棒性和计算资源利用方面的权衡。我们的实证结果表明,硬件选择对系统鲁棒性有显著影响:立体配置始终优于单目和RGB-D模态,全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败,并且关键的是,在剧烈的腿部运动下,标准惯性集成可能降低主要基于视觉的框架的性能。这些见解还为定制传感器负载提供了具体的设计指南,以实现敏捷腿部系统的可靠感知。

英文摘要

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

2606.19240 2026-06-18 cs.RO cs.CV cs.HC cs.SY eess.SY 交叉投稿

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

透过遮挡:机器人遥操作的确定性手臂运动学校正

Thomas M. Kwok, Nicholas Koenig, Yue Hu

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出手臂运动学校正方法,利用恒定臂长几何约束和勾股定理确定性地重建遮挡关节深度,无需复杂建模,经Vicon验证有效,并成功应用于遥操作。

详情
AI中文摘要

无标记、单RGB-D相机动作捕捉为机器人遥操作提供了一种低成本、非侵入性的替代传统标记系统的方法;然而,在自遮挡存在时,特别是上肢运动期间,深度估计常常退化。本文提出了一种手臂运动学校正(AKC)方法,通过基于恒定臂长施加几何约束来改进深度估计。所提出的方法利用手腕位置和预定义臂长,基于勾股定理的确定性公式重建遮挡关节深度,从而避免了对复杂概率建模或参数调整的需求。针对Vicon参考系统的实验验证表明,该方法在静态和动态关节运动下均表现出可靠的性能,通过均方根误差(RMSE)和皮尔逊相关性进行评估。此外,在模拟和物理机器人环境中成功演示了运动映射遥操作。结果表明,AKC在长时间、严重自遮挡下增强了鲁棒性并保持了解剖一致性,即使与不太可靠的时间滤波器配对时也是如此,突显了其在机器人遥操作和人机交互等实时应用中的实用性。

英文摘要

Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.

2606.19333 2026-06-18 cs.RO cs.CV 交叉投稿

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Do as I Do: 从日常人类视频中获取灵巧操作数据

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出DO AS I DO算法,从单目RGB人类视频中重建手-物交互并重定向到多指灵巧机器人手,生成可执行的操作数据,优于现有方法。

Comments Project website: https://do-as-i-do.com/

详情
AI中文摘要

我们如何可扩展地生成机器人操作数据,特别是在像多指灵巧手这样的人形平台上?从人类视频中学习最近成为这个问题的可能答案。然而,估计手-物交互和跨越人-机器人具身差距的困难阻碍了将丰富的单目RGB人类视频作为机器人操作数据的主要来源。在这项工作中,我们提出了DO AS I DO,一种将单目RGB人类视频重建并重定向到多指灵巧机器人手的算法。DO AS I DO从各种自我中心和外部中心的野外视频源中重建手-物交互。然后,该算法将这些手-物交互估计重定向为一系列可在现实世界中执行的动作,从不同的人类视频中生成机器人完整的操作数据。总体而言,DO AS I DO在从RGB视频中估计手-物交互和提取灵巧操作轨迹方面优于先前的最先进技术,正如我们在具有真实标签的数据集和在线收集的视频片段数据集上的实验所示。我们的实验使我们能够为从业者收集人类操作数据提出一个有效性指南。

英文摘要

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

2603.11417 2026-06-18 cs.CV cs.LG 版本更新

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

端到端自动驾驶中的零样本跨城市泛化:自监督与监督表示

Fatemeh Naeinian, Ali Hamza, Haoran Zhu, Anna Choromanska

发表机构 * Department of Electrical and Computer Engineering, NYU Tandon School of Engineering(电气工程系,纽约大学Tandon工程学院)

AI总结 研究端到端自动驾驶模型在跨城市零样本迁移中的泛化能力,发现自监督预训练(如I-JEPA、DINOv2、MAE)相比监督预训练能显著减少位移和碰撞退化,提升闭环评估中的分布外PDMS。

详情
AI中文摘要

端到端自动驾驶模型通常使用监督的ImageNet预训练骨干网络在多城市数据集上训练,但其泛化到未见城市的能力尚未得到充分检验。当训练和评估数据在地理上混合时,模型可能隐含地依赖城市特定线索,掩盖了在真实世界域偏移下泛化到新位置时可能出现的失败模式。在这项工作中,我们将零样本跨城市迁移定义为端到端自动驾驶的受控表示级压力测试,并探究视觉预训练如何影响地理域偏移下的迁移行为。我们通过将自监督骨干网络I-JEPA、DINOv2和MAE集成到规划框架中进行了全面研究。我们在nuScenes上的开环设置和NAVSIM上的闭环评估协议中,在严格的地理划分下评估性能。我们的实验揭示了当模型在不同道路拓扑、交通规则和视觉环境的城市间迁移时存在显著的泛化差距。在开环评估中,监督骨干网络在城市间迁移时表现出严重退化,而某些领域特定的自监督方法可以显著减少位移和碰撞退化。在闭环评估中,自监督预训练在多个单城市训练设置中提高了平均分布外PDMS。我们的结果提供了经验证据,表明表示学习影响跨城市规划的鲁棒性,并促使将零样本地理迁移作为评估端到端自动驾驶系统的重要压力测试。

英文摘要

End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real-world domain shifts when generalizing to new locations. In this work, we formulate zero-shot cross-city transfer as a controlled representation-level stress test for end-to-end autonomous driving and ask how visual pretraining affects transfer behavior under geographic domain shift. We conduct a comprehensive study by integrating self-supervised backbones I-JEPA, DINOv2, and MAE into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models across cities with different road topologies, traffic conventions, and visual environments. In open-loop evaluation, a supervised backbone exhibits severe degradation when transferring between cities, yet some domain-specific self-supervised methods can substantially reduce both displacement and collision degradation. In closed-loop evaluation, self-supervised pretraining improves average out-of-distribution PDMS in several single-city training settings. Our results provide empirical evidence that representation learning influences the robustness of cross-city planning and motivate zero-shot geographic transfer as an important stress test for evaluating end-to-end autonomous driving systems.

2606.17030 2026-06-18 cs.CV 版本更新

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告:通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team(Qwen团队)

AI总结 提出Qwen-RobotWorld,一种以自然语言为统一动作接口的语言条件视频世界模型,通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练,在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测,在多个基准上取得最优结果。

详情
AI中文摘要

我们介绍Qwen-RobotWorld,一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口,它从当前观测预测物理上合理的未来视觉轨迹,涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向:用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境,以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的:a) 双流MMDiT与MLLM动作编码,其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合;b) 具身世界知识(EWK),一个860万视频-文本语料库(2亿+帧),包含20+种具身形态和500+动作类别的动作-语言映射;c) 通用+专家渐进式课程,一种两阶段训练策略,首先学习通用视觉先验,然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力:在EWMBench和DreamGen Bench上总体排名第一,在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

2606.17846 2026-06-18 cs.RO cs.CV cs.LG 版本更新

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Qwen-RobotManip 技术报告:对齐解锁机器人操作基础模型的规模

Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li, Pei Lin, Yiyang Huang, Zixing Lei, Tong Zhang, Jiazhao Zhang, Jie Zhang, Jingyang Fan, Gengze Zhou, Qihang Peng, Chenxu Lv, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(Qwen团队)

AI总结 提出 Qwen-RobotManip,通过统一的对齐框架(表示、运动和行为维度)实现多源异构操作数据的大规模协同训练,构建约38,100小时预训练语料,在零样本指令跟随、跨本体迁移等泛化能力上超越先前模型。

Comments 44 pages

详情
AI中文摘要

语言和多模态基础模型通过统一公式对齐异构数据并大规模训练,实现了强大的泛化能力。在本报告中,我们研究这种扩展方法是否可以应用于机器人操作以实现真正的泛化。这具有挑战性,因为与文本不同,操作数据本质上是异构的、收集成本高且多样性狭窄,使得对齐和规模同时变得困难。我们提出了 Qwen-RobotManip,一个基于 Qwen-VL 构建的可泛化视觉-语言-动作基础模型。Qwen-RobotManip 引入了一个跨操作表示、运动和行为维度的统一对齐框架,使大规模多源训练变得一致而非冲突。这种对齐能力进而使 Qwen-RobotManip 能够吸收以前训练方案无法维持规模的操作数据。一个人到机器人合成流水线将第一人称手部演示转换为跨15个平台的机器人轨迹,一个严格的策展流水线协调异构数据集。仅使用开源数据集和人类视频,无需专有数据收集,Qwen-RobotManip 构建了约38,100小时的预训练语料,并展现出涌现的泛化能力,包括零样本指令跟随、对扰动的鲁棒性、反应性错误恢复和跨本体迁移。我们发现标准基准无法捕捉预训练质量,因此采用了包括 RoboCasa365、LIBERO-Plus、EBench、RoboTwin-Clean2Rand、RoboTwin-IF 和 RoboTwin-XE 在内的 OOD 设置。Qwen-RobotManip 在所有 OOD 设置中显著优于先前最先进的模型(包括 π0.5),在 RoboChallenge 中排名第一,相对改进20%,并在包括 AgileX ALOHA、Franka、UR 和 ARX 在内的真实机器人平台上得到验证。

英文摘要

Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including $π$0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.

3. 图像识别、检索与分类 5 篇

2606.18528 2026-06-18 cs.CV 新提交

A Prototypical Signature Approach for Writer-Independent Offline Signature Verification

一种面向离线手写签名验证的原型签名方法

Kecia G. de Moura, Robert Sabourin, Rafael M. O. Cruz

发表机构 * École de technologie supérieure – Université du Québec Montreal(魁北克蒙特利尔高等电子与计算机工程学院)

AI总结 提出基于原型签名的数据驱动策略,生成多样且信息丰富的负样本,提升对熟练伪造签名的检测能力,并提高可扩展性和计算效率。

Comments Accepted for oral presentation at the International Conference on Pattern Recognition (ICPR) 2026

详情
AI中文摘要

离线手写签名验证旨在使用静态图像区分真实签名和伪造签名。由于真实伪造样本很少,通常从其他用户的真实签名中随机抽取负样本来创建训练数据。然而,这种随机选择往往缺乏多样性,增加冗余,并提高计算成本,导致训练效率低下。我们提出了一种数据驱动策略,使用原型签名生成多样且信息丰富的负样本,原型签名是真实签名特征的紧凑、不可识别的摘要。基于实验结果,我们得出结论:(i)原型签名产生更具信息量的负样本,改进了对熟练伪造的检测;(ii)所提出的方法与骨干网络无关,在不同架构上表现出鲁棒性;(iii)当与原始形式的线性SVM结合时,它可作为基于RBF模型的替代方案,同时显著提高可扩展性和计算效率。该方法的实现可在以下网址获取:https://this URL。

英文摘要

Offline handwritten signature verification aims to distinguish genuine from forged signatures using static images. Since real forgeries are rarely available, negative samples are usually randomly drawn from genuine signatures of other users to create training data. However, this random selection often lacks diversity, increases redundancy, and escalates computational cost, leading to inefficient training. We propose a data-driven strategy to generate diverse, informative negative samples using prototypical signatures, which are compact, non-identifiable summaries of genuine signature features. Based on the experiments results, we conclude that (i) prototypical signatures yield more informative negative samples, improving the detection of skilled forgeries; (ii) the proposed approach is backbone-agnostic, showing robustness across architectures; and (iii) when combined with a primal-form linear SVM, it serves as an alternative to RBF-based models while significantly improving scalability and computational efficiency. Implementation of the method is available at https://github.com/kdmoura/proto_hsv.

2606.18885 2026-06-18 cs.CV cs.IR 新提交

LARE: Low-Attention Region Encoding for Text-Image Retrieval

LARE: 低注意力区域编码用于文本-图像检索

Abdulmalik Alquwayfili, Faisal Almeshal, Jumanah Almajnouni, Leena Alotaibi, Faisal Alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J. Khan

发表机构 * Saudi Data and Artificial Intelligence Authority (SDAIA)(沙特数据与人工智能局)

AI总结 提出LARE框架,通过并行编码低注意力区域和完整图像,解决拥挤场景下视觉编码器忽视关键细节的问题,在密集场景子集上提升检索性能。

Comments Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: https://github.com/AbdulmalikDS/LARE ; Dataset: https://huggingface.co/datasets/AbdulmalekDS/Dense-Set

详情
AI中文摘要

拥挤场景中的图像检索尤其具有挑战性,因为传统视觉编码器存在显著性偏差,倾向于关注主要对象而忽略低注意力区域,而这些区域通常对细粒度检索至关重要。我们提出了LARE(低注意力区域编码),一个显式建模这些被忽略区域的框架。LARE采用双编码策略,并行编码图像的低注意力区域和完整图像,从而产生更多样化和信息丰富的图像嵌入。为了评估拥挤场景下的图像检索性能,我们引入了Dense-Set,一个源自COCO和Flickr30K的具有挑战性的子集。在该子集中,图像被重新标注,以提供对低注意力或先前被忽略区域的更丰富描述。该数据集突显了现有检索模型的局限性,并能够在密集拥挤场景条件下进行更严格的评估。实验结果表明,所提出的框架通过在共享潜在空间中保留微妙的非主导视觉线索来提高检索性能。

英文摘要

Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

2606.19204 2026-06-18 cs.CV 新提交

ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series

ROSA-TFormer: 一种雷达-光学传感器感知的时间Transformer用于基于GEE导出的Sentinel-1/2时间序列的陕北樟子松人工林分类

Nengbo Zhang, Chang sheng

发表机构 * Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences (AIRCAS)(中国科学院空天信息创新研究院遥感与数字地球重点实验室)

AI总结 提出ROSA-TFormer模型,集成SAR和光学嵌入分支、传感器感知门和时间注意力池化,利用Sentinel-1/2时间序列数据实现高精度樟子松人工林分类,总体精度达99.67%。

Comments journal in tree classification

详情
AI中文摘要

准确识别樟子松人工林对于监测陕北地区造林质量和生态恢复具有重要意义。本文提出ROSA-TFormer,一种雷达-光学传感器感知的时间Transformer,利用Google Earth Engine生成的Sentinel-1/2时间序列数据进行樟子松分类。该模型集成了独立的SAR和光学嵌入分支、传感器感知门以及时间注意力池化,以捕获多源季节特征。在月度与半月点级数据集上的实验表明,ROSA-TFormer在HalfMonth-dataBig数据集上实现了强分类性能,总体精度99.67%,宏F1 99.56%,樟子松F1 98.91%。空间块验证和消融实验进一步表明了雷达-光学时间融合和传感器感知建模的有效性。结果展示了ROSA-TFormer在点级樟子松人工林分类中的潜力,但更广泛的wall-to-wall验证仍有必要。

英文摘要

Accurate identification of Pinus sylvestris var. mongolica plantations is important for monitoring afforestation quality and ecological restoration in northern Shaanxi. This paper proposes ROSA-TFormer, a radar-optical sensor-aware temporal Transformer for P. sylvestris classification using Sentinel-1/2 time-series data generated on Google Earth Engine. The model integrates separate SAR and optical embedding branches, a sensor-aware gate, and temporal attention pooling to capture multi-source seasonal features. Experiments on monthly and half-month point-level datasets show that ROSA-TFormer achieves strong classification performance, with 99.67% overall accuracy, 99.56% macro F1, and 98.91% P. sylvestris F1 on the HalfMonth-dataBig dataset. Spatial block validation and ablation results further indicate the effectiveness of radar-optical temporal fusion and sensor-aware modeling. The results demonstrate the potential of ROSA-TFormer for point-level P. sylvestris plantation classification, while broader wall-to-wall validation remains necessary.

2606.02045 2026-06-18 cs.CV cs.AI 版本更新

Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift

域偏移下基于注意力机制和迁移学习的鲁棒桃叶损伤分类

Adrián Cánovas-Rodriguez, Miguel A. González-Illán, Maria Fernanda García-Cruz, Pedro Nortes Tortosa, José Salvador Rubio-Asensio, Miguel A. Zamora Izquierdo, Juan Antonio Martínez Navarro, Antonio F. Skarmeta

发表机构 * Department of Information and Communication Engineering(信息与通信工程系) University of Murcia(穆尔西亚大学) Department of Irrigation, Centro de Edafología y Biología Aplicada del Segura CEBAS-CSIC(灌溉系,塞格拉应用土壤学与生物技术中心CEBAS-CSIC)

AI总结 提出基于注意力机制和迁移学习的桃叶损伤分类方法,通过CBAM增强EfficientNet模型在公共数据集上达到93.3%准确率,并在本地数据集上通过迁移学习实现93%宏F1分数,有效应对域偏移。

详情
AI中文摘要

人工智能为从图像数据评估作物损伤提供了实用框架,支持农业管理中的早期决策。在桃园中,气候变化增加了非生物胁迫和生物压力,包括病虫害,这些通常产生视觉上相似的叶片症状。这种重叠使得手动诊断变得困难,尤其是在不同环境条件下的多个田地中,凸显了对具有强泛化能力的自动化模型的需求。 我们提出了一种基于图像的桃叶损伤检测分类方法。通过手动标注公开图像创建了一个基准数据集,包含六个损伤类别的1,366片桃叶。评估了几种深度学习架构。EfficientNet模型取得了最佳结果,其中EfficientNetB0达到92.9%的准确率,EfficientNetB3达到91.5%,EfficientNetB5在少数类上表现最强。DenseNet121达到92.6%的准确率。卷积块注意力模块(CBAM)的集成在多个骨干网络中提升了性能,特别是在EfficientNetB5和InceptionV3中,而在其他网络中效果有限或为负。CBAM增强的EfficientNetB5取得了93.3%的最佳总体准确率。 为了评估在现实条件下的鲁棒性,收集了一个包含四个类别180张图像的本地数据集,并应用迁移学习策略来解决域偏移。测试了三种微调策略。结合CBAM的EfficientNetB3在本地域中取得了最佳性能,迁移后宏F1分数达到93%。总体而言,基于注意力的模型在少数类上表现出更强的鲁棒性,并在不同田间条件下具有更好的泛化能力。

英文摘要

Artificial intelligence provides a practical framework for crop damage assessment from imagery data, supporting early decision-making in agricultural management. In peach orchards, climate change increases abiotic stress and biotic pressures, including pests and diseases, which often produce visually similar foliar symptoms. This overlap makes manual diagnosis difficult, especially across multiple fields with varying environmental conditions, highlighting the need for automated models with strong generalization ability. We propose an image-based classification approach for peach leaf damage detection. A benchmark dataset was created through manual annotation of publicly available images, consisting of 1,366 peach leaves across six damage categories. Several deep learning architectures were evaluated. EfficientNet models achieved the best results, with EfficientNetB0 reaching 92.9 percent accuracy, EfficientNetB3 achieving 91.5 percent, and EfficientNetB5 showing the strongest performance on minority classes. DenseNet121 reached 92.6 percent accuracy. The integration of the Convolutional Block Attention Module (CBAM) improved performance in several backbones, particularly EfficientNetB5 and InceptionV3, while showing limited or negative impact in others. The CBAM-enhanced EfficientNetB5 achieved the best overall accuracy of 93.3 percent. To evaluate robustness under realistic conditions, a local dataset of 180 images across four classes was collected, and transfer learning strategies were applied to address domain shift. Three fine-tuning strategies were tested. EfficientNetB3 combined with CBAM achieved the best performance in the local domain, reaching a 93 percent macro F1-score after transfer. Overall, attention-based models showed improved robustness for minority classes and better generalization across different field conditions.

2602.04401 2026-06-18 cs.RO cs.CV 版本更新

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

视觉地点识别中可靠操作点选择的分位数迁移

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics(昆士兰理工大学机器人中心) School of Electrical Engineering and Robotics(电气工程与机器人学院) Queensland University of Technology(昆士兰理工大学)

AI总结 提出一种通过分位数归一化迁移阈值的方法,自动选择视觉地点识别系统的操作点,在100%精度下最大化召回率,无需手动调参。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情
AI中文摘要

视觉地点识别(VPR)是全球导航卫星系统(GNSS)受限环境中定位的关键组成部分,但其性能严重依赖于选择平衡精度和召回率的图像匹配阈值(操作点)。阈值通常针对特定环境离线手动调整,并在部署期间固定,导致在环境变化下性能下降。我们提出一种方法,自动选择VPR系统的操作点,以在100%精度下最大化召回率。该方法使用已知对应关系的小型校准遍历,并通过相似度得分分布的分位数归一化将阈值迁移到部署中。这种分位数迁移确保阈值在校准大小和查询子集上保持稳定。在五个基准数据集上使用七种最先进的VPR技术进行的实验表明,我们提出的方法始终优于现有基线,使底层VPR技术在大约两倍的部署场景中(中位数改进)以100%精度运行,同时在该精度下检索到多达29%的正确匹配。该方法通过适应新环境并在操作条件下泛化,消除了手动调整。我们的代码可在该https URL获取。

英文摘要

Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.

4. 目标检测、分割与定位 8 篇

2606.18566 2026-06-18 cs.CV cs.AI cs.GR 新提交

Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

多模态超图融合用于低光照人群计数

Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao, Yan Zhang, Bangjun Wang

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 针对低光照环境下人群计数难题,构建三个新基准数据集,提出多模态超图融合模块和可变形矩形稀疏注意力模块,形成低光照计数网络LCNet,在三个基准上取得最优性能。

详情
AI中文摘要

人群计数是计算机视觉中的一项基本任务。然而,低光照环境下的人群计数在实际世界中具有重要实用价值,却仍未得到充分探索。现有方法主要关注良好光照场景或依赖单模态红绿蓝(RGB)表示,这在极端黑暗和复杂非均匀光照下往往变得不可靠。为解决此问题,我们构建了三个新的低光照人群计数基准,包括两个合成数据集SHA_Dark和SHB_Dark,以及一个真实世界基准LC-Crowd(低光照人群数据集)。受Retinex物理建模启发,我们引入深度和Canny边缘线索作为互补的几何和结构先验,以增强低光照条件下的内在反射率表示。我们提出多模态超图融合模块,将RGB外观、深度几何和边缘结构线索统一表示为超图中的节点,并通过动态超边构建和消息传递显式捕获它们的高阶互补关系。此外,为在密集预测中自适应分配计算,我们提出可变形矩形稀疏注意力(DRSA)模块,通过锚点感知估计和自适应矩形窗口建模将计算集中在信息丰富区域。基于这些设计,我们开发了统一的低光照计数网络(LCNet)用于鲁棒的低光照人群计数。在三个基准上的大量实验表明,所提方法在整体性能上优于现有最先进(SOTA)方法。代码见补充材料。数据集将在接收后公开。

英文摘要

Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA\_Dark and SHB\_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.

2606.18582 2026-06-18 cs.CV cs.RO eess.IV 新提交

Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

ICRA 2026 GOOSE 2D细粒度语义分割挑战赛技术报告:利用DINOv3实现野外机器人中的鲁棒户外场景理解

Jaeil Park, Hyobin Choi, Sangjin Lee, Hyungtae Lim, Sung-Hoon Yoon

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱庆北科学技术院) Massachusetts Institute of Technology (MIT)(麻省理工学院)

AI总结 提出一种结合DINOv3自监督骨干、ViT-Adapter和Mask2Former解码器的网络设计,以及多尺度测试增强和模型集成的推理策略,在64类细粒度越野语义分割挑战中取得第一名,复合得分76.57%。

Comments 5 pages, 4 figures

详情
AI中文摘要

ICRA 2026野外机器人研讨会举办的GOOSE 2D细粒度语义分割挑战赛评估了越野图像在64个细粒度类别和11个评估的非空洞粗类别上的密集语义分割。我们提出了该挑战的第一名解决方案。我们的解决方案包含两个互补的改进:(a) 网络级设计,结合了自监督DINOv3 ViT-L/16骨干、ViT-Adapter和Mask2Former掩码分类解码器,以及基于全局[CLS]令牌的粗类别辅助损失;(b) 推理时聚合策略,基于多尺度和水平翻转测试时增强,以及使用Codabench分数选择的前三个检查点的集成。我们的方法达到了官方复合得分76.57%,包括69.32%的细类mIoU和83.81%的类别级mIoU,并在最终阶段排行榜上排名第一:http://this url。

英文摘要

The GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off-road imagery over a fine-grained taxonomy of 64 classes and 11 evaluated non-void coarse categories. We present the first-place solution to this challenge. Our solution comprises two complementary improvements: (a) a network-level design that combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, together with a coarse-category auxiliary loss on the global [CLS] token; and (b) an inference-time aggregation strategy based on multi-scale and horizontal-flip test-time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine-class mIoU and 83.81% category-level mIoU, and ranks first on the final phase leaderboard: www.codabench.org/competitions/14257/#/results-tab.

2606.18783 2026-06-18 cs.CV 新提交

SCR-Guided Difficulty-Aware Optimization for Infrared Small Target Detection

SCR引导的困难感知优化用于红外小目标检测

Yunus Sevim, Behçet Uğur Töreyin

发表机构 * Aselsan(阿塞尔桑公司) Istanbul Technical University(伊斯坦布尔理工大学)

AI总结 提出REEM框架,利用信杂比作为可见性先验,通过可微调制软IoU损失,提升低可见性目标检测性能,无需额外参数或推理开销。

Comments Accepted at CVPR 2026 Workshops (PBVS). Published version: https://openaccess.thecvf.com/content/CVPR2026W/PBVS/html/Sevim_SCR-Guided_Difficulty-Aware_Optimization_for_Infrared_Small_Target_Detection_CVPRW_2026_paper.html

详情
AI中文摘要

红外小目标检测由于严重的背景杂波、低对比度和弱空间响应仍然具有挑战性,其中几何重叠单独不足以表征检测质量。在这项工作中,我们提出了REEM(重加权显式可见性增强调制),一种轻量级的SCR引导的困难感知优化框架,在训练期间将信杂比(SCR)作为物理上有意义的可见性先验。REEM不修改网络架构或直接优化SCR,而是从输入图像计算真实局部SCR,并对软IoU学习信号应用可微调制,强调低可见性目标,同时保持稳定优化和相同的推理行为。REEM集成到基于U-Net的MSHNet中,无需引入额外参数、架构修改或推理时开销。大量实验表明,与基线相比,REEM实现了持续改进,获得了更高的IoU和检测概率(Pd),同时大幅减少了虚警(FA),特别是在具有挑战性的低可见性条件下。这些结果表明,SCR引导的困难感知优化为红外小目标检测提供了有效且物理基础的补充,超越了传统的基于重叠的目标函数。代码可在https://github.com/yall-in-one/Reemm获取。

英文摘要

Infrared small target detection remains challenging due to severe background clutter, low contrast, and weak spatial responses where geometric overlap alone is insufficient to characterize detection quality. In this work, we propose REEM (Reweighted Explicit-visibility Enhanced Modulation), a lightweight SCR-guided difficulty-aware optimization framework that incorporates Signal-to-Clutter Ratio (SCR) as a physically meaningful visibility prior during training. Instead of modifying the network architecture or directly optimizing SCR, REEM computes a ground-truth local SCR from the input image and applies a differentiable modulation to the soft-IoU learning signal, emphasizing low-visibility targets while preserving stable optimization and identical inference behavior. REEM is integrated into a U-Net-based MSHNet without introducing additional parameters, architectural modifications, or inference-time overhead. Extensive experiments demonstrate consistent improvements over the baseline, achieving higher IoU and detection probability (Pd) together with substantially reduced false alarms (FA), particularly under challenging low-visibility conditions. These results suggest that SCR-guided difficulty-aware optimization provides an effective and physically grounded complement to conventional overlap-based objectives for infrared small target detection. The code is available at https://github. com/yall-in-one/Reemm.

2511.20302 2026-06-18 cs.CV 版本更新

CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

CrossEarth-Gate:基于Fisher引导的自适应调优引擎用于高效跨域遥感语义分割

Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu

发表机构 * Sun Yat-sen University(中山大学) The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学) National Supercomputing Center in Shenzhen(深圳国家超算中心) The Hong Kong University of Science and Technology(香港科技大学) Beijing Institute of Technology(北京理工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tsinghua University(清华大学)

AI总结 提出CrossEarth-Gate,通过Fisher信息引导的自适应模块选择机制,动态激活最关键的跨域模块,在18个跨域基准中16个达到最优性能。

详情
AI中文摘要

在遥感(RS)中,参数高效微调(PEFT)已成为激活基础模型泛化表示能力以用于下游任务的关键方法。然而,现有的专用PEFT方法在应用于大规模地球观测任务时常常失败,因为它们无法完全处理遥感数据中固有的多面且不可预测的域差距(例如空间、语义和频率偏移)。为克服这一问题,我们提出CrossEarth-Gate,它包含两个主要贡献。首先,我们建立了一个全面的遥感模块工具箱,以解决多方面的域差距,包括空间、语义和频率模块。其次,我们开发了一种基于Fisher引导的自适应选择机制,该机制作用于该工具箱。该选择由Fisher信息引导,通过衡量每个模块对任务特定梯度流的贡献来量化其重要性。它动态地仅在适当层激活最关键模块,引导梯度流以最大化适应效果和效率。全面实验验证了我们方法的有效性和泛化能力,其中CrossEarth-Gate在18个遥感语义分割跨域基准中的16个上达到了最先进性能。

英文摘要

In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (e.g., spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance on 16 out of 18 cross-domain benchmarks for RS semantic segmentation.

2602.07544 2026-06-18 cs.CV 版本更新

MUFASA: A Multi-Layer Framework for Slot Attention

MUFASA: 一种用于槽注意力的多层框架

Sebastian Bock, Leonie Schüßler, Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

发表机构 * TU Darmstadt(图宾根大学) Zuse School ELIZA(泽尼特学校ELIZA)

AI总结 提出MUFASA,一种轻量级即插即用框架,通过跨ViT编码器多层计算槽注意力并融合,提升无监督对象中心学习的分割性能,达到新最优。

Comments CVPR 2026. Authors Sebastian Bock and Leonie Schüßler contributed equally. Project page: https://visinf.github.io/mufasa/

详情
AI中文摘要

无监督对象中心学习(OCL)将视觉场景分解为不同的实体。槽注意力是一种流行的方法,将单个对象表示为潜在向量,称为槽。当前方法仅从预训练视觉变换器(ViT)的最后一层获取这些槽表示,忽略了跨其他层编码的宝贵、语义丰富的信息。为了更好地利用这些潜在语义信息,我们引入了MUFASA,一种用于基于槽注意力的无监督对象分割方法的轻量级即插即用框架。我们的模型跨ViT编码器的多个特征层计算槽注意力,充分利用其语义丰富性。我们提出了一种融合策略,将在多个层上获得的槽聚合成统一的以对象为中心的表示。将MUFASA集成到现有的OCL方法中,提高了它们在多个数据集上的分割结果,在仅增加少量推理开销的同时,建立了新的最先进水平并改善了训练收敛性。

英文摘要

Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot-attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.

2603.13941 2026-06-18 cs.CV 版本更新

Bidirectional Cross-Attention Fusion of High-Resolution RGB and Low-Resolution Hyperspectral Inputs for Multimodal Semantic Segmentation

高分辨率RGB与低分辨率高光谱输入的双向交叉注意力融合用于多模态语义分割

Jonas V. Funk, Lukas Roming, Andreas Michel, Paul Bäcker, Georg Maier, Thomas Längle, Markus Klute

发表机构 * KIT, Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Fraunhofer IOSB, Fraunhofer Institute of Optronics, System Technologies(弗劳恩霍夫院所光学、系统技术与图像利用研究所)

AI总结 提出双向交叉注意力融合(BCAF),通过局部双向交叉注意力对齐高分辨率RGB与低分辨率高光谱图像,避免预上采样或早期光谱坍缩,在实时约束下提升多模态分割性能。

Comments Submitted to Image and Vision Computing (Elsevier). 23 pages, 10 figures, 7 tables

详情
AI中文摘要

异构传感器的多模态语义分割必须协调空间分辨率和通道维度不同的模态间的互补信息。具体而言,高分辨率RGB成像提供详细的空间结构,但通常难以区分视觉上相似的材料,而高光谱成像(HSI)提供判别性光谱特征,但空间分辨率较低。我们提出双向交叉注意力融合(BCAF),通过局部化、双向交叉注意力在原生网格上对齐高分辨率RGB与低分辨率HSI,避免预上采样或早期光谱坍缩。BCAF使用两个独立骨干网络:一个用于RGB的标准Swin Transformer,以及一个用于HSI的适应型Swin骨干网络,通过带有光谱自注意力的3D令牌化保留光谱结构。尽管我们的评估针对RGB-HSI融合,但BCAF是模态无关的,适用于与低分辨率、高通道辅助传感器配准的RGB。在基准SpectralWaste数据集上,BCAF以55图像/秒的速度达到75.4%的性能。我们进一步评估了一个新的工业数据集:K3I-Cycling(首个RGB子集已在Fordatis上发布)。在该数据集上,BCAF在材料分割(纸张、金属、塑料等)上达到62.3% mIoU,在塑料类型分割(PET、PP、HDPE、LDPE、PS等)上达到66.2% mIoU。这些结果表明,保留原生网格空间细节和光谱结构可在实时约束下改善多模态分割。代码和模型检查点已公开于该https URL。

英文摘要

Multimodal semantic segmentation with heterogeneous sensors must reconcile complementary information across modalities that differ in spatial resolution and channel dimensionality. In particular, high-resolution RGB imaging provides detailed spatial structure but often fails to distinguish visually similar materials, whereas hyperspectral imaging (HSI) provides discriminative spectral signatures but at lower spatial resolution. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF delivers strong performance, achieving 75.4% at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.). These results show that preserving native-grid spatial detail and spectral structure improves multimodal segmentation under real-time constraints. Code and model checkpoints are publicly available at https://github.com/jonasvilhofunk/BCAF_2026.

2604.05527 2026-06-18 cs.CV 版本更新

Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

先验引导的多模态特征融合用于光学-SAR图像变化检测

Xuanguang Liu, Lei Ding, Yujie Li, Chenguang Dai, Zhenchao Zhang, Mengmeng Li, Ziyi Yang, Yifan Sun, Yongqi Sun, Hanyun Wang, Lorenzo Bruzzone

发表机构 * Institute of Geospatial Information, Information Engineering University(地理信息研究所,信息工程大学) Academy of Digital China (Fujian), Fuzhou University(数字中国研究院(福建),福州大学) The School of Electronics and Communication Engineering, Sun Yat-sen University(电子与通信工程学院,中山大学) The Department of Information Engineering and Computer Science, University of Trento(信息工程与计算机科学系,特伦托大学)

AI总结 提出STSF-Net框架,联合建模模态特定和时空共同特征,并利用视觉基础模型的语义先验自适应融合多模态特征,在三个数据集上达到最优性能。

详情
AI中文摘要

多模态变化检测(MMCD)识别多模态遥感数据中的变化区域,在土地利用监测和城市可持续发展中具有重要应用价值。然而,现有MMCD方法在跨模态交互和利用模态特定特征方面存在局限性,导致对细粒度变化信息的建模不足,从而阻碍了语义变化的精确检测。为解决这些问题,我们提出了STSF-Net,一个专为光学和SAR图像之间的MMCD设计的框架。STSF-Net联合建模模态特定特征和时空共同特征以增强变化表示。具体而言,利用模态特定特征捕获真实的语义变化信号,同时嵌入时空共同特征以抑制由成像机制差异引起的伪变化。此外,我们引入了一种光学和SAR特征融合策略,该策略基于从视觉基础模型获得的语义先验自适应调整多模态特征的重要性。最后,我们引入了新的Delta-SN6数据集,这是第一个公开可访问的多类MMCD基准,包含极高分辨率全极化SAR和光学图像。在Delta-SN6、BRIGHT和Wuhan数据集上的实验结果表明,我们的方法在mIoU上分别比最先进方法高出3.21%、0.87%和1.32%。

英文摘要

Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing data, demonstrating significant application value in land use monitoring and urban sustainable development. However, literature MMCD approaches exhibit limitations in both cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes. To address these problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts multimodal feature importance based on semantic priors obtained from visual foundation models. Finally, we introduce the novel Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan datasets demonstrate that our method outperforms the state-of-the-art by 3.21%, 0.87%, and 1.32% in mIoU, respectively.

2606.08206 2026-06-18 cs.CV cs.LG 版本更新

SegmentAnyTreeV2: Scaling Transformer-Based Tree Instance Segmentation Across Sensors, Platforms, and Forests

SegmentAnyTreeV2:跨传感器、平台和森林的基于Transformer的树木实例分割扩展

Maciej Wielgosz, Stefano Puliti, Rasmus Astrup

发表机构 * Norwegian Institute of Bioeconomy Research (NIBIO)(挪威生物经济研究所(NIBIO))

AI总结 提出SegmentAnyTreeV2,一种传感器和平台无关的森林点云语义与实例分割框架,结合Point Transformer v3骨干网络、轻量语义头和树木交叉注意力掩码解码器,在FOR-instance v3基准上达到90.5%精度和80.2%召回率,并展现出强跨域泛化能力。

Comments 25 pages, 6 figures, 10 tables, Corrected bibliography metadata and minor typographical issues; results unchanged

详情
AI中文摘要

我们提出SegmentAnyTreeV2,一种传感器和平台无关的森林点云语义与实例分割框架。该模型结合了基于序列化的Point Transformer v3骨干网络、轻量级语义头以及专注于树木的交叉注意力掩码解码器。语义预测将实例解码限制在树木类体素上,而实例感知的查询初始化、一对多种子监督和非对称掩码评分改善了密集和结构复杂林分中的分离效果。我们进一步引入了FOR-instance v3,一个扩展的基准数据集,包含427个场景和26,496棵标注树木,涵盖不同生物群落、森林结构和LiDAR平台。在FOR-instanceV2测试集上,SegmentAnyTreeV2实现了90.5%的精度、80.2%的召回率、85.0%的F1分数、90.7%的覆盖率和87.6%的语义mIoU,在实例检测和掩码完整性方面均优于以往基于学习的方法。在独立站点上的零样本评估进一步证明了其强大的跨域泛化能力。

英文摘要

We present SegmentAnyTreeV2, a sensor- and platform-agnostic framework for semantic and instance segmentation of forest point clouds. The model combines a serialization-based Point Transformer v3 backbone with a lightweight semantic head and a tree-focused cross-attention mask decoder. Semantic predictions restrict instance decoding to tree-class voxels, while instance-aware query initialization, one-to-many seed supervision, and asymmetric mask scoring improve separation in dense and structurally complex stands. We further introduce FOR-instance v3, an expanded benchmark comprising 427 scenes and 26,496 annotated trees across diverse biomes, forest structures, and LiDAR platforms. On the FOR-instanceV2 test split, SegmentAnyTreeV2 achieves 90.5% precision, 80.2% recall, 85.0% F1, 90.7% coverage, and 87.6% semantic mIoU, outperforming previous learning-based methods in both instance detection and mask completeness. Zero-shot evaluation on independent sites further demonstrates strong cross-domain generalization.

5. 视频理解与时序视觉 8 篇

2606.18441 2026-06-18 cs.CV 新提交

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集:视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出无时间标注的过程级奖励框架CF-GRPO,通过视频内在线索构建一致性帧先验,并利用一致性帧奖励优化模型帧使用与先验的对齐,提升视频推理性能。

详情
AI中文摘要

强化学习提升了大型语言模型的推理能力,但将仅结果奖励应用于视频多模态大语言模型(Video-MLLMs)时,对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发(其中一致的线索可以增强感知估计的显著性和可靠性),我们引入了一致性帧GRPO(CF-GRPO),一种无需时间标注的过程级奖励框架,用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验,包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后,它从视觉和响应表示中计算模型侧的帧使用分数,并通过一致性帧奖励(CFR)优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化,CFR提供了高对比度的奖励信号,无需人工时间标注。实验表明,VideoCFR在复杂视频推理基准上取得了有竞争力的性能,并在多个指标上优于代表性的Video-MLLM和RL基线,同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见:https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

2606.18558 2026-06-18 cs.CV 新提交

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI(艾伦人工智能研究所) University of Washington(华盛顿大学) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出一种基于语言指令的3D点运动预测方法,通过构建大规模数据集和基准,实现类无关、视角稳定的运动轨迹预测,并在机器人操作和视频生成中验证其有效性。

详情
AI中文摘要

运动预测是视觉智能的核心:智能体必须预测物体如何运动,以规划行动、推理物理交互并合成逼真的未来场景。我们认为,世界坐标系中的3D点提供了一种通用表示,具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务:给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述,模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务:(1) MolmoMotion-1M是一个大型语料库,包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹;(2) PointMotionBench是一个人工验证的基准,涵盖111个物体类别和61种运动类型;(3) MolmoMotion是一个通用运动预测模型,支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式,并在PointMotionBench上显著优于现有运动预测基线。最后,我们展示了学习到的3D运动先验能很好地迁移到下游应用:它提高了机器人操作的训练效率和泛化能力,其预测轨迹为生成模型提供了有效的运动指导,以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

2606.18586 2026-06-18 cs.CV cs.AI 新提交

APT: Atomic Physical Transitions for Causal Video-Language Understanding

APT: 用于因果视频语言理解的原子物理转变

Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Dolby Laboratories(杜比实验室)

AI总结 提出原子物理转变(APT)作为视频中因果状态变化的显式表示,并构建混合来源数据集,通过APT-Tune微调方法使VLM学习物理转变而不遗忘事件级知识。

详情
AI中文摘要

物理事件不仅通过其名称来理解,还通过组成它们的因果状态变化来理解。诸如“弹跳”之类的片段级标签可能是正确的,但同时隐藏了使事件在物理上有效的过程,从支撑丧失和接触开始到反弹和稳定。为了使这一隐藏过程显式化,我们引入了原子物理转变(APT):最小的、时间局部化的状态变化,将可见线索与活跃的物理机制以及前后动力学状态联系起来。APT链将视频表示为有序的因果转变序列,而不是单个聚合事件标签:事件标签说明发生了什么;APT链解释为什么会发生。为了使VLM能够学习APT,我们从人工标注和模拟器真实数据构建了混合来源的APT数据,涵盖接触、重力、摩擦和旋转/稳定性中的14种转变类型,包含1,246个试验中的27,303个计时实例。利用这些数据,我们发现当前的VLM在转变级物理理解上存在不足,零样本召回率最多为14%,错误主要由遗漏的转变主导。直接在APT链上进行微调可以改善转变检测,但会导致事件级遗忘,表明模型学习的是专门的答案格式,而不是可复用的物理表示。因此,我们提出了APT-Tune,一种参数高效的方案,教会VLM使用因果转变而不遗忘如何回答视频问题。它结合了图像填充感知监督、格式条件协同训练和机制条件域到类型解码,使APT学习具有格式鲁棒性和物理基础。在Qwen3-VL-2B上仅使用11M LoRA参数,APT-Tune显著提高了APT召回率,同时改善了事件级视频迁移。这些结果表明,APT不是一种新的答案格式,而是一种用于物理视频理解的人类对齐的因果监督信号。

英文摘要

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

2606.19062 2026-06-18 cs.CV 新提交

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

发表机构 * Sejong University(世宗大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) Ulsan National Institute of Science and Technology(乌山国立科学研究院)

AI总结 提出DREAM模型,通过双路径表示增强与对齐,结合层级视觉编码器和混合语言建模,在视频检索任务中实现新SOTA。

详情
AI中文摘要

在当今媒体驱动的世界中,视频内容在监控、教育和娱乐等领域的指数级增长使得通过自然语言查询检索语义相关视频变得日益关键。早期的视频检索系统依赖于手工特征或浅层跨模态映射,限制了其捕捉复杂语义和时间动态的能力。虽然大规模视觉-语言模型改进了跨模态对齐,但在建模细粒度时间依赖和微妙语言结构方面仍存在挑战。本文介绍DREAM:双路径表示增强与对齐模型,一种通过增强视觉和文本编码来解决这些局限性的新型多模态框架。DREAM采用混合语言建模策略,结合掩码和排列语言建模目标,以捕捉局部和全局语言语义。在视觉方面,我们设计了一个具有级联组注意力的层级视觉编码器,通过多阶段令牌交互和从粗到细的注意力细化来整合空间和时间信息。我们通过在广泛使用的MSRVTT、MSVD和LSMDC基准数据集上进行全面评估来验证DREAM,分别取得了49.4%、49.7%和27.3%的新SOTA R1分数。定性分析进一步展示了模型在帧间保持连贯注意力以及将复杂查询与动态视频内容对齐的能力。这些发现强调了层级注意力和双目标文本建模在实现鲁棒、上下文感知视频检索中的有效性,并为推进跨模态表示学习的未来研究铺平了道路。

英文摘要

In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

2606.18732 2026-06-18 cs.LG cs.CV 交叉投稿

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

低成本神经形态跌倒检测:使用合成事件数据和混合SNN

Guillermo Rojas, Gonzalo Soto, Daniel Yunge

发表机构 * School of Electrical Engineering Pontificia Universidad Católica de Valparaíso, Chile(瓦尔帕莱索天主教大学电气工程学院)

AI总结 提出混合SNN-CNN模型,从智能手机视频合成事件相机数据,实现高效准确的跌倒检测。

Comments 4 pages, 6 figures, presented at ICONS 2025 during the Poster Session, but not published

详情
AI中文摘要

本工作提出了混合模型,将脉冲神经网络(SNN)与卷积神经网络(CNN)组件集成,以从传统智能手机视频生成的模拟事件相机数据(动态视觉传感器,DVS)中学习。主要针对人类跌倒检测,该方法通过将视频帧转换为事件数据,利用SNN的能效和时空处理能力。通过多个数据集上的模拟评估所提出的模型,并将其性能与传统机器学习模型进行比较。结果表明,在不牺牲准确性的情况下显著提高了效率,强调了将SNN和DVS技术结合用于现实环境中复杂任务的潜力。

英文摘要

This work presents the development of hybrid models that integrate spiking neural networks (SNNs) with components of convolutional neural networks (CNNs) to learn from simulated event-based camera data (Dynamic Vision Sensor, DVS) generated from conventional smartphone videos. Aimed primarily at human fall detection, the approach leverages the energy efficiency and spatio-temporal processing capabilities of SNNs by converting video frames into event-based data. The proposed models are evaluated through simulations on multiple datasets, comparing their performance to that of traditional machine learning models. Results demonstrate significant gains in efficiency without sacrificing accuracy, underscoring the potential of combining SNNs and DVS technology for complex tasks in real-world environments.

2604.22476 2026-06-18 cs.CV cs.LG 版本更新

All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams

全神贯注于工作流:从视频流中自动高效发现事件

Marco Pegoraro, Jonas Seng, Dustin Heller, Wil M. P. van der Aalst, Kristian Kersting

发表机构 * Chair of Process and Data Science, RWTH Aachen University(过程与数据科学教授席位,亚琛工业大学) Artificial Intelligence & Machine Learning Lab, Technical University of Darmstadt(人工智能与机器学习实验室,达姆施塔特技术大学)

AI总结 提出SnapLog方法,利用图像嵌入和帧间相似矩阵进行时间分割,结合广义少样本分类从视频中提取事件数据,生成可解释的带标签时间戳帧序列。

Comments 18 pages, 6 figures, 1 table, 27 references

详情
AI中文摘要

业务流程管理和流程挖掘等学科通过基于记录的事件数据发现流程见解来帮助组织。然而,流程分析的一个障碍是数据多模态性:例如,视频形式的数据不能直接解释为事件。现有方法依赖于活动标签字典作为输入,无法提供逐帧标签解释,或依赖于过时的计算机视觉技术。在这项工作中,我们提出了SnapLog,一种通过使用图像嵌入将帧转换为特征向量,并通过帧间相似矩阵进行时间分割来从视频中提取事件数据的方法。然后使用广义少样本分类为视频片段分配标签,生成可解释为事件的带标签、时间戳的子帧序列。传统的流程挖掘技术可用于分析结果数据。我们表明,我们的方法生成的日志准确反映了视频中的流程。

英文摘要

Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. Existing approaches rely on a dictionary of activity label as input, cannot provide frame-by-frame labeling explanations, or rely on superseded computer vision techniques. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.

2606.06926 2026-06-18 cs.CV cs.MM 版本更新

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

SVHighlights: 迈向极长体育视频精彩片段检测

Donggyu Lee, Youngbin Ki, Jeonghun Kang, Taehwan Kim

发表机构 * Ulsan National Institute of Science and Technology(釜山国立科学研究院)

AI总结 针对现有方法无法处理超长视频精彩片段检测的问题,提出首个基准SVHighlights(包含320个平均时长2小时的体育视频)以及无训练的分段方法TF-SELECTOR,通过大语言模型融合多模态信息预测片段级显著性分数,在多个指标上超越现有基线。

Comments Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: https://leedongkyu2019.github.io/SVHighlights/

详情
AI中文摘要

尽管长视频的精彩片段检测具有重要的实际意义,但现有方法大多局限于短视频内容,这主要是由于缺乏合适的基准。为了填补这一空白,我们引入了SVHighlights,据我们所知,这是首个针对极长体育视频(每段时长超过一小时,涵盖多种体育类别)精彩片段检测的基准。SVHighlights是通过一个数据集生成流水线,从完整体育视频及其对应的官方精彩片段视频对构建而成,无需传统的逐片段显著性标注即可实现可扩展的标签生成。该基准包含320个视频,平均时长2.00小时,总时长640.18小时,显著超过以往的数据集。现有方法在长视频上也面临根本性挑战:在短视频片段上训练的模型无法泛化到小时级内容,并且它们的片段级评分缺乏识别精彩片段所需的更广泛上下文。为了解决这一问题并提供一个强基线,我们提出了TF-SELECTOR,一种无需训练的基于分段的方法,该方法通过合并相邻的具有相同语义内容的镜头,将每个视频划分为上下文感知的分段,并使用多模态输入(包括视觉描述、转录文本和音频音量)的大语言模型预测分段级显著性分数。实验表明,与视频时间定位(VTG)微调的基线相比,TF-SELECTOR在大多数指标上取得了更优的性能,在HIT@1上提升+3.12,在HIT@K上提升+4.06,在IoU上提升+2.95。这些结果确立了SVHighlights作为长视频精彩片段检测的具有挑战性的测试平台,并证明了简单的基于分段的策略可以有效地扩展到小时级视频。

英文摘要

While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +2.50 in HIT@1, +4.04 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.

2606.15632 2026-06-18 cs.CV 版本更新

Open-World Video Segmentation

开放世界视频分割

Qing Su, Kaiyang Li, Yuan Zhuang, Fei Miao, Shihao Ji

发表机构 * University of Connecticut(康涅狄格大学)

AI总结 提出Savvy系统,结合分层掩码发现、延迟接纳和轨迹整合,实现零样本开放世界长时视频分割;并设计粒度感知评估套件OGA,采用n:1匹配协议,解决传统1:1匹配对开放世界方法的不公平惩罚问题。

详情
AI中文摘要

尽管视频分割在短片段和封闭集基准上取得了快速进展,但开放世界视频分割仍然在很大程度上未被探索。挑战有两方面:(1)现有方法不支持在动态自我运动的长视频中进行对象发现和身份维护;(2)现有评估协议依赖于严格的1:1匹配,不公平地惩罚了具有不匹配粒度的语义有效预测。为了解决这两个问题,我们引入了Savvy,一个实用且强大的零样本开放世界长时视频分割系统。Savvy结合了分层掩码发现、延迟接纳和轨迹整合,以支持持久对象发现、安全轨迹提升和稳定的长距离身份维护。我们进一步提出了OGA,一个用于开放世界视频分割的粒度感知评估套件。基于粒度无关(GA)匹配协议,OGA将传统的1:1匹配放宽为n:1映射,但通过断点检测支持不连续性并通过对每个参考对象的优势连贯片段进行评分来强制执行时间严谨性。这防止了碎片化或闪烁的支持被过度奖励,同时实现了GA适应的指标和结构诊断:身份持久性(IP)和身份集中性(IC)。在VIPSeg上,我们展示了标准的1:1评估严重低估了开放世界方法,而GA评估恢复了许多被抑制的性能。在更现实的长时基准ScanNet和HM3D上,Savvy在经典指标和提出的指标(包括STQ、VPQ$_\infty$、IP和IC)上始终优于强基线。这些结果共同为开放世界长时视频分割建立了一个实用的基准和一个强基线。

英文摘要

While video segmentation has advanced rapidly on short clips and closed-set benchmarks, open-world video segmentation remains largely unexplored. The challenge is twofold: (1) existing methods are not designed to support object discovery and identity maintenance in long videos of dynamic ego-motion, and (2) existing evaluation protocols rely on a rigid 1:1 matching that unfairly penalizes semantically valid predictions with mismatched granularity. To address both gaps, we introduce Savvy, a practical and strong system for zero-shot open-world long-horizon video segmentation. Savvy combines hierarchical mask discovery, deferred admission, and track consolidation to support persistent object discovery, safe track promotion, and stable long-range identity maintenance. We further propose OGA, a granularity-aware evaluation suite for open-world video segmentation. Built on a Granularity-Agnostic (GA) matching protocol, OGA relaxes conventional 1:1 matching to an n:1 mapping, but still enforces temporal rigor by detecting support discontinuities through sever points and scoring each reference object through its dominant coherent fragment. This prevents fragmented or flickering support from being over-rewarded while enabling GA-adapted metrics and structural diagnostics: identity persistence (IP), and identity concentration (IC). On VIPSeg, we show that standard 1:1 evaluation substantially underestimates open-world methods, whereas GA evaluation recovers much of their suppressed performance. On the more realistic long-horizon benchmarks: ScanNet and HM3D, Savvy consistently outperforms strong baselines across both classical and proposed metrics, including STQ, VPQ$_\infty$, IP and IC. Together, these results establish a practical benchmark and a strong baseline for open-world long-horizon video segmentation.

6. 生成式视觉与世界模型 21 篇

2606.18478 2026-06-18 cs.CV 新提交

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

数据强制蒸馏:恢复少步视频生成中的多样性和保真度

Siyi Chen, Shaowei Liu, Yixuan Jia, Zian Wang, Huan Ling, Qing Qu, Jun Gao

发表机构 * University of Michigan(密歇根大学) NVIDIA(英伟达) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对分布匹配蒸馏(DMD)在少步视频生成中出现的模式坍塌和过饱和问题,提出数据强制蒸馏(DFD)框架,通过教师评分差异引导学生接近真实数据分布,仅需一行代码修改即可恢复多样性和保真度。

详情
AI中文摘要

最近的进展表明,将多步视频扩散模型蒸馏为高效的少步学生模型具有前景。其中,分布匹配蒸馏(DMD)及其后继DMD2实现了强大的生成质量和快速收敛。然而,由于反向KL目标的性质,这些方法表现出两个持续的失败模式:样本多样性大幅下降,以及明显过饱和的输出偏离真实视频外观。在这项工作中,我们提出了数据强制蒸馏(DFD),一个简单的训练后框架,通过仅一行代码更改即可恢复DMD中的多样性和保真度。其核心是教师评分差异,用于引导学生朝向真实数据分布,将其拉向缺失的模式(缓解模式坍塌)并远离真实数据中不存在的问题模式(避免过饱和)。我们提供了框架的深入理论分析,并在文本到视频、图像到视频和自回归视频生成上验证了我们的方法。仅需100-300步微调,DFD就能有效恢复Wan2.1-1.3B和Cosmos-Predict2.5-2B模型上的多样性和保真度,解决过饱和伪影,显著改善视频动态和外观,甚至优于教师模型。

英文摘要

Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback--Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100--300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

2606.18591 2026-06-18 cs.CV 新提交

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

桥接创意意图与视觉质量:基于创作者驱动的循环视频生成与代理反馈循环

Denis Savytski, Aiden Lei, Heding Liu, Warren Yang, Sihan Liang, Alexander Liu, Zhe Zhao

发表机构 * University of California, Davis(加州大学戴维斯分校) The Harker School(哈克学校) Basis Independent Silicon Valley(硅谷贝斯独立学校) Saratoga High(萨拉托加高中)

AI总结 提出CHIEF框架,通过人类-AI协作的迭代视频精炼,结合创作者驱动和代理主观反馈,提升长视频的叙事连贯性与创意方向。

Comments Accepted to the Workshop on Human-AI Co-Creativity at ICML 2026

详情
AI中文摘要

生成式AI使内容创作日益普及,但许多AI生成的视频缺乏叙事连贯性和创意方向,尤其在较长时长时问题更为突出。与编码不同,AI生成受益于可靠的反馈和循环自我改进等技术,而视频生成需要关于情节、场景和叙事的主观反馈,这自然激发了融入人类创意方向的方法。我们提出了CHIEF,一个人类-AI协同创作视频生成框架,将创作者置于人机循环迭代视频精炼的中心,并通过提供自动主观反馈来支持他们。创作者通过驱动每次迭代来融入其创意方向,而他们的修订则由专门的精炼代理整合。反馈循环由基于角色条件的多模态LLM生成,这些LLM观看生成的视频并从观众角度产生主观批评,提供自我评估无法捕捉的反馈。为测试我们提出框架的有效性,我们与没有电影制作经验的高中生和大学生合作,创作从1分钟短视频到具有复杂情节的完整10分钟短片的视频。

英文摘要

Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

2606.18702 2026-06-18 cs.CV 新提交

UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

UniTemp: 通过双向蒸馏实现任意时间顺序的视频生成

Lin Zhang, Sicheng Mo, Zefan Cai, Jinhong Lin, Zihao Lin, Jiuxiang Gu, Krishna Kumar Singh, Yuheng Li, Yin Li

发表机构 * University of Wisconsin Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe 研究院) University of California Los Angeles(加利福尼亚大学洛杉矶分校) University of California Davis(加利福尼亚大学戴维斯分校)

AI总结 提出UniTemp框架,通过双向蒸馏训练单个自回归模型,支持任意时间方向(前向、后向、中间插值)的视频生成,解决因果3D VAE在后向生成中的不连续性,提升可控性。

详情
AI中文摘要

自回归视频扩散模型已成为长视频生成的一种有前景的方法,在流式设置中表现出色。然而,现有方法仅限于前向时间生成,而实际视频创作通常需要灵活的生成顺序,例如,基于未来上下文进行后向扩展,或基于过去和未来上下文进行中间插值生成。我们通过训练一个支持任意时间方向生成的自回归模型来弥合这一差距。一个关键的技术挑战来自视频扩散模型中广泛使用的因果3D VAE,它编码的潜变量严格依赖于过去上下文。虽然这种因果结构适合前向生成,但在后向生成时会导致块间不连续性。为了解决这个问题,我们引入了块级锚点潜变量,这是一组辅助潜变量,用于在后向生成过程中恢复块边界处缺失的过去上下文。基于这一设计,我们提出了UniTemp,一个双向蒸馏框架,训练单个自回归学生模型用于任意方向的视频生成。在推理时,UniTemp可以基于任意过去和/或未来帧进行条件生成,提高了双向和中间插值生成的可控性。实验表明,与仅前向方法相比,UniTemp在短和长视频生成上保持了竞争性能,同时支持多种工作流程,如双向视频扩展、中间插值生成、循环视频生成、场景转换和视觉故事生成。项目网站:此 https URL

英文摘要

Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: https://lzhangbj.github.io/projects/unitemp/

2606.18765 2026-06-18 cs.CV 新提交

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

SpectralDiT:流匹配DiT的时间步条件谱残差校正

Jiayu Tian

发表机构 * Peking University(北京大学)

AI总结 提出SpectralDiT,通过时间步条件谱残差校正模块,在CIFAR-10和ImageNet-100上以极少额外计算和参数提升流匹配DiT的生成质量,FID分别降低5.1%和8.7%。

详情
AI中文摘要

我们提出SpectralDiT,一种对流匹配扩散变换器(Diffusion Transformers)的轻量级修改,它在MLP残差分支中添加了时间步条件谱校正。该模块将每个残差更新分解为补丁-令牌网格上的低频和高频分量,然后学习一个零初始化的加法门,使得模型最初与基线DiT匹配。在CIFAR-10像素空间生成中,SpectralDiT在补丁大小为1时将FID从20.78提升至19.71,并缩小了径向傅里叶谱差距。此外,我们将方法扩展到ImageNet-100上的潜在扩散。在额外理论FLOPs增加0.6%和参数增加1.36%的情况下,SpectralDiT改进了潜在流匹配,在无分类器引导(CFG 2.0)下实现了8.7%的相对FID降低。所有报告结果均为五个种子的平均值。在CIFAR-10上的消融实验和门控可视化揭示了稳定的块特定谱校正模式。

英文摘要

We propose SpectralDiT, a lightweight modification to flow-matching Diffusion Transformers that adds timestep-conditioned spectral correction to the MLP residual branch. The module decomposes each residual update into low- and high-frequency components on the patch-token grid, then learns a zero-initialized additive gate so the model initially matches the baseline DiT. On CIFAR-10 pixel-space generation, SpectralDiT improves FID from 20.78 to 19.71 at patch size 1 and reduces the radial Fourier spectrum gap. Furthermore, we scale our method to latent diffusion on ImageNet-100. With 0.6% additional theoretical FLOPs and 1.36% additional parameters, SpectralDiT improves latent flow-matching, achieving an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds. Ablations and gate visualizations on CIFAR-10 reveal stable block-specific spectral correction patterns.

2606.18788 2026-06-18 cs.CV cs.CL 新提交

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

HandwritingAgent: 语言驱动的可缩放矢量空间手写合成

Jaward Sesay, Yue Yu, Börje F. Karlsson

发表机构 * Beijing Institute of Technology(北京理工大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出HandwritingAgent,利用大推理模型在SVG格式中自动回归生成手写笔画序列,无需风格特定训练,通过自然语言和参考图像控制风格,在模仿、识别、多语言及复杂数学表达式合成等任务上达到或超越现有最优方法。

详情
AI中文摘要

教会机器模仿自然手写风格仍然是一个开放挑战,因为它需要合成在形状、纹理、压力和字体上动态变化的笔画序列——不仅在不同个体之间,而且在同一个人的手写中也是如此。针对这一挑战的尝试主要探索了在线和离线环境下的深度学习方法。然而,这些方法通常受到风格特定架构选择、对大型数据集的严重依赖、高计算成本以及缺乏通过自然语言灵活控制书写风格的限制。为此,我们引入了HandwritingAgent,一个语言驱动的智能体,它可以直接在可缩放矢量图形(SVG)格式中合成自然手写序列,无需风格特定训练。该智能体利用大型推理模型在离散网格画布环境中对目标手写字形进行几何分析并自回归生成笔画序列。生成过程以对话或非对话模式提供的文本以及参考手写风格图像为条件。在涵盖模仿、识别、多语言手写合成以及复杂手写数学和科学表达式生成等多样化手写任务上的实验表明,性能有显著提升,HandwritingAgent匹配或超越了最先进的生成式手写模型,同时提供了一种更高效、可控且泛化能力更强的合成方法。

英文摘要

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

2606.18906 2026-06-18 cs.CV 新提交

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

BindEdit: 驯服注意力泄漏以实现精确的多目标图像编辑

Chaewon Park, Soyoon Lee, Naeun Lee, Minjung Shin, Seogkyu Jeon, Kibeom Hong

发表机构 * Sookmyung Women’s University(成均女性大学) Yonsei University(延世大学) Samsung Research(三星研究院)

AI总结 针对多目标图像编辑中的语义混合和对象重复问题,提出BindEdit方法,通过联合正则化交叉注意力和自注意力、交叉注意力重平衡机制及区域保真项,在单次扩散轨迹内抑制注意力泄漏,实现精确编辑。

Comments Preprint

详情
AI中文摘要

真实图像编辑能够精确操作视觉内容,但现有方法在复杂的多目标场景中常常失败,导致语义混合、对象重复或编辑不完整。我们将这些失败归因于注意力泄漏,即在去噪过程中,跨空间区域和文本标记的信号变得纠缠。具体来说,我们识别出两种不同形式的泄漏:编辑-标记泄漏,其中模糊的标记-区域对齐导致对象混合;以及源主导泄漏,其中未改变的源对象的标记压倒了目标实体应有的注意力。为了解决这些泄漏,我们提出了\textbf{BindEdit},它在单次扩散轨迹内强制执行注意力级别的约束。为了抑制编辑-标记泄漏,BindEdit联合正则化交叉注意力和自注意力,使得每个目标标记组绑定到其对应的空间区域,同时保持实例级别的分离。为了抑制源主导泄漏,一种交叉注意力重平衡机制放大目标标记的影响,并减弱可编辑区域内残留的源语义。此外,区域保真项确保每个目标概念在整个编辑掩码中连贯表达。另外,我们提出了一个全面的多目标基准,涵盖不同的对象数量和类别。大量实验表明,BindEdit在单次扩散轨迹内始终优于现有方法,在单目标和多目标编辑场景中均保持稳健性能。

英文摘要

Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbf{BindEdit}, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

2606.19073 2026-06-18 cs.CV 新提交

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

驯服I2V模型用于图像HOI编辑:认知基准与智能体自校正框架

Jiayi Gao, Qingchao Chen, Yuxin Peng, Yang Liu

发表机构 * Wangxuan Institute of Computer Technology, Peking University, Beijing, China(王轩计算机技术研究所,北京大学,北京,中国) National Institute of Health Data Science, Peking University, Beijing, China(国家健康数据科学研究院,北京大学,北京,中国)

AI总结 提出HOI-Edit基准和SCPE框架,利用I2V模型的时间生成能力进行动态人-物交互编辑,通过自校正提示迭代优化,实现与SOTA竞争的性能。

详情
AI中文摘要

当前的图像编辑方法在静态属性上表现出色,但在复杂的人-物交互(HOI)上失败,这是一个关键挑战,现有基准将HOI与静态属性混淆,依赖无法同时评估动态交互有效性和纠缠的人-物对保留的全局指标。因此,我们首先引入HOI-Edit,一个包含三个渐进认知层次的综合基准,其特点是自动化指标HOI-Eval,通过让VLM在思考后对包含基础人-物对的图像进行问答,可靠地评估实例级交互。考虑到任务本质是重塑动态关系,我们对图像到视频(I2V)模型进行基准测试,发现它们由于其时间生成能力而天生适合动态编辑。关键的是,除了优越的性能,这种能力提供了“失败过程的重放”,为错误原因提供了独特的可诊断性。因此,我们提出SCPE(自校正过程编辑),一种新颖的智能体自校正框架,通过迭代优化的提示约束I2V模型的生成,使生成的视频更准确地呈现目标HOI。从这些视频中提取的帧是最终的编辑结果。在HOI-Edit上,SCPE在交互上达到了与最先进(SOTA)编辑模型(如Nano Banana)竞争的性能。代码可在该https URL获取。

英文摘要

Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pairs. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process," offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at https://github.com/oceanflowlab/HOI-Edit.

2606.19103 2026-06-18 cs.CV cs.AI 新提交

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

ProductConsistency:通过SFT和RL改进基于指令的图像编辑中的产品身份保持

Mukund Khanna, Raj Singh Yadav, Kunal Singh

发表机构 * Fractal Analytics

AI总结 针对基于指令的图像编辑中产品特征保持不足的问题,提出ProductConsistency数据集和循环一致性奖励,结合监督微调与强化学习,显著提升产品一致性、文本渲染和视觉质量。

Comments CVPR HiGen 2026

详情
AI中文摘要

近期基于指令的图像编辑的进展使模型能够根据自然语言指令执行复杂的视觉编辑。然而,在以产品为中心的场景中,保留产品特征、品牌和文本元素至关重要,当前的开源和闭源模型往往难以维持这种细粒度的对象身份。这一问题因缺乏具有文本保真度约束的基于指令的产品图像编辑数据集而进一步加剧,导致该能力在很大程度上被视为基于指令的图像编辑模型的隐式能力。在这项工作中,我们引入了ProductConsistency数据集,旨在改进以产品为中心的图像编辑。我们的方法包括一个用于产品编辑的包含87k样本的监督微调(SFT)数据集、一个包含869张独特产品图像的强化学习(RL)数据集,以及一个新的基准数据集ProductConsistency Benchmark,以允许对编辑模型进行严格和标准化的评估。为了指导RL训练,我们提出了一种循环一致性奖励,通过使用原始产品描述与从编辑图像生成的描述之间的字幕相似性来强制保持产品身份的语义。我们使用我们的数据集对Qwen-Image-Edit-2511和Flux.1-Kontext-dev进行了微调,并在OCR和感知指标以及基于MLLM的评估中展示了相对于基线模型的一致改进,表明更强的产品一致性、文本渲染和整体视觉质量;其中Qwen-Image-Edit-2511模型实现了字符错误率降低5倍。代码和流程可在此https URL获取。

英文摘要

Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md

2606.19195 2026-06-18 cs.CV 新提交

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Moebius: 0.2B轻量级图像修复框架,性能达10B级别

Kangsheng Duan, Ziyang Xu, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) VIVO AI Lab(维沃人工智能实验室)

AI总结 提出Moebius轻量级图像修复框架,通过局部-λ混合交互模块和自适应多粒度蒸馏策略,以0.22B参数实现与10B级模型FLUX.1-Fill-Dev相当甚至更优的生成质量,推理速度提升15倍以上。

详情
AI中文摘要

尽管10B级别的工业基础模型推动了图像修复的边界,但其高昂的计算成本严重阻碍了实际部署。构建高度优化的任务特定专家模型是一个有前景的解决方案,然而极端的结构压缩不可避免地引发了严重的表示瓶颈。为解决这一问题,我们提出了Moebius,一个高效的轻量级修复框架。我们通过引入局部-λ混合交互($L\lambda MI$)模块系统地重构了扩散主干。该模块由局部-λ和交互-λ子模块组成,巧妙地将空间上下文和全局语义先验总结为固定大小的线性矩阵,在保留复杂潜在交互的同时大幅减少参数。此外,为了释放这种高度紧凑架构的全部表示能力,我们将其与自适应多粒度蒸馏策略协同配对。该策略严格在潜在空间内操作以避免昂贵的像素空间解码,动态平衡多个基于梯度的损失以实现高保真对齐。在自然和肖像基准上的大量实验表明,这种最优协同使Moebius能够媲美甚至超越10B级工业通用模型FLUX.1-Fill-Dev的生成质量。值得注意的是,Moebius仅使用不到2%的参数(0.22B vs. 11.9B)就实现了这一点,同时总推理时间加速超过15倍,为高保真修复设立了新的效率标准。项目页面见此https URL。

英文摘要

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-$λ$ Mix Interaction ($LλMI$) block. Comprising Local-$λ$ and Interactive-$λ$ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a $>15\times$ acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.

2606.19162 2026-06-18 cs.LG cs.CV 交叉投稿

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

奖励一直就在你的数据中:用判别器引导的强化学习纠正流匹配

Nicolas Beltran-Velez, Felix Friedrich, Zhang Xiaofeng, Reyhane Askari-Hemmat, Xiaochuang Han, Adriana Romero-Soriano, Michal Drozdzal

发表机构 * FAIR at Meta(Meta FAIR) Columbia University(哥伦比亚大学) McGill University(麦吉尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 针对流匹配模型因损失函数与样本质量不匹配导致的视觉缺陷,提出判别器引导的强化学习(DRL),利用预训练空间中判别器的logit作为奖励,显著提升无引导FID和语义FD,并改善偏好对齐。

Comments 84 pages, including appendices

详情
AI中文摘要

得分匹配和流匹配模型通常依赖基于偏好的强化学习来实现两个目的:与主观偏好对齐,以及令人惊讶地恢复视觉真实性和连贯对象结构等属性——而这些属性本应通过匹配训练从数据本身学习。我们认为这反映了结构上的不匹配。匹配损失衡量训练时边缘分布下速度或得分场的$\ell_2$回归误差,这一代理指标与决定推理时样本质量的视觉和语义属性对齐不良。给定一个与这些属性对齐的奖励,强化学习通过评估模型自身生成的样本并直接遵循奖励景观来规避不匹配。挑战在于如何在不依赖人类偏好的情况下获得这样的奖励,因为人类偏好昂贵且会将数据真实性与标注者倾向混为一谈。我们提出判别器引导的强化学习(DRL)。DRL训练一个判别器,在预训练表示空间中区分数据样本和基础模型样本,并将其logit作为KL正则化强化学习中的奖励。预训练空间将判别器限制在感知有意义的方向上,而logit估计数据与模型之间的对数似然比,这是针对数据分布的最优奖励。在SiT、JiT、REPA和RAE上,DRL降低了无引导FID(例如,SiT上从9.38降至2.62)和语义空间FD(例如,SiT上DINOv3从88.2降至19.3),在所有骨干网络上均有一致提升,并且在没有经过偏好奖励训练的情况下改善了人类偏好奖励。在后续基于偏好的后训练中,DRL还在偏好奖励与图像保真度之间产生了更好的帕累托前沿,在提高对齐度的同时减少了过饱和和过亮等低级伪影。

英文摘要

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure $\ell_2$ regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., $9.38 \to 2.62$ on SiT) and semantic-space FD (e.g., $88.2 \to 19.3$ on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

2606.19325 2026-06-18 cs.SD cs.AI cs.CV 交叉投稿

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

参考驱动的野外先验多说话人音频场景生成

Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

发表机构 * Lightricks Tel Aviv University(特拉维夫大学)

AI总结 提出ScenA方法,利用预训练的文本到音频流匹配基础模型,通过多参考声音和自然语言提示生成多说话人音频场景,并采用高噪声偏置时间步分布解决参考捷径问题,在CoVoMix2-Dialogue基准上优于现有系统。

Comments Project page at https://finmickey.github.io/scena/

详情
AI中文摘要

现有的多说话人对话系统通过结构化监督(如每轮标签、多流转录或可学习说话人嵌入)将说话人与话语绑定。这些系统在仅语音的流水线中运行,生成干净的语音序列,缺乏真实对话的环境纹理。我们采取不同的方法。我们的方法ScenA将文本到音频流匹配基础模型(在大规模野外数据上预训练)直接以多个参考声音和描述整个多说话人音频场景的自由形式自然语言提示为条件。利用这样的基础模型使我们能够继承其生成自然、非录音室音频的能力:背景噪声、房间声学、重叠对话和自发的副语言事件,同时添加多说话人控制而无需任何每轮结构。具体地,参考潜在向量被连接到模型的令牌序列中,并通过轻量级的身份感知位置编码进行区分。然而,我们识别出这种方法的一个关键障碍:参考捷径。在标准噪声调度下的训练过程中,模型可以通过声学相似性识别匹配的参考与噪声目标,从而完全绕过文本提示。我们通过高噪声偏置的时间步分布来解决这个问题,迫使模型依赖文本提示进行说话人分配。我们在CoVoMix2-Dialogue基准上评估ScenA,结果表明它在说话人绑定指标上优于现有的多说话人系统,同时生成具有重叠语音、情感发声和环境声音的丰富对话音频。我们的结果证明了使用以自由形式场景描述为条件的通用音频模型,而不是通过仅语音流水线传递结构化对话脚本的优势。

英文摘要

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

2502.07531 2026-06-18 cs.CV cs.AI cs.LG cs.MM 版本更新

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

VidCRAFT3: 面向图像到视频生成的相机、物体与光照控制

Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Shanghai Innovation Institute(上海创新研究院) Zhejiang University(浙江大学) Huawei Noah’s Ark Lab(华为诺亚实验室) Westlake University(西湖大学) School of Data Science and MOE Frontiers Center for Brain Science, Fudan University(复旦大学数据科学学院和脑科学前沿中心) Fudan ISTBI–ZJNU Algorithm Centre for Brain-inspired Intelligence, Zhejiang Normal University(复旦大学-浙江师范大学脑启发智能算法中心)

AI总结 提出VidCRAFT3框架,通过显式建模几何、运动与光照的跨因素交互,实现对相机运动、物体运动和光照方向的独立或联合控制,在控制精度和视觉一致性上达到最优。

Comments Accepted to TVCG 2026

详情
AI中文摘要

可控图像到视频(I2V)生成将参考图像转换为由用户指定控制信号引导的连贯视频。虽然对相机运动、物体运动和光照的精确控制对于高保真创作至关重要,但现有方法通常独立处理这些因素,忽视了动态场景中视角、几何和光照之间的物理耦合,导致同时变化时出现阴影不匹配和透视漂移等视觉不一致问题。我们提出了VidCRAFT3,一个统一且灵活的I2V框架,显式建模几何、运动和光照之间的跨因素交互,实现对相机运动、物体运动和光照方向的独立或联合控制。Image2Cloud提供显式的3D几何先验以实现精确的相机运动控制。ObjMotionNet将稀疏物体轨迹编码为多尺度运动特征,以引导逼真的物体运动。空间三重注意力变压器通过光照交叉注意力整合光照方向,实现一致的重光照。为了解决联合标注数据的稀缺性,我们构建了VideoLightingDirection(VLD)数据集,包含精确的逐帧光照方向标注,并引入三阶段渐进训练策略,使得无需完全联合标注即可实现鲁棒学习。大量实验表明,VidCRAFT3在多种场景下的控制精度和视觉一致性上达到了最先进水平。

英文摘要

Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. While precise control over camera motion, object motion, and lighting is essential for high-fidelity creation, existing methods often treat these factors independently. This overlooks the physical coupling among viewpoint, geometry, and illumination in dynamic scenes, leading to visual inconsistencies such as mismatched shadows and perspective drift under simultaneous changes. We present VidCRAFT3, a unified and flexible I2V framework that explicitly models cross-factor interactions among geometry, motion, and illumination, enabling both independent and joint control over camera motion, object motion, and lighting direction. Image2Cloud provides explicit 3D geometric priors for accurate camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale motion features to guide realistic object motion. A Spatial Triple-Attention Transformer integrates lighting direction through lighting cross-attention for consistent relighting. To address the scarcity of jointly annotated data, we construct the VideoLightingDirection (VLD) dataset with accurate per-frame lighting direction annotations, and introduce a three-stage progressive training strategy that enables robust learning without fully joint annotations. Extensive experiments demonstrate that VidCRAFT3 achieves state-of-the-art performance in control precision and visual coherence across diverse scenarios.

2510.21615 2026-06-18 cs.CV 版本更新

Epipolar Geometry Improves Video Generation Models

极线几何改进视频生成模型

Orest Kupyn, Théo Uscidda, Marta Tintore Gazulla, Fabian Manhardt, Federico Tombari, Christian Rupprecht

发表机构 * University of Oxford(牛津大学) Google Research(谷歌研究院) CREST-ENSAE, Institut Polytechnique de Paris(巴黎理工学院CREST-ENSAE研究中心) Technical University of Munich(慕尼黑技术大学)

AI总结 针对视频生成模型几何不一致和运动伪影问题,提出基于极线几何约束的偏好优化方法,在保持视觉质量的同时将极线误差降低31%,人类评分一致性从54%提升至72%。

详情
AI中文摘要

视频生成模型通过使用整流流技术训练的潜在扩散变换器取得了显著进展。然而,这些模型仍然存在几何不一致、运动不稳定以及破坏逼真3D场景错觉的视觉伪影。3D一致的视频生成可能对生成和重建任务中的众多下游应用产生重大影响。我们探索了极线几何约束如何改进现代视频扩散模型。尽管使用了大量训练数据,这些模型未能捕捉基本的几何原理。我们通过基于偏好的优化,利用成对极线几何约束对齐扩散模型,通过数学上合理的几何约束直接解决不稳定轨迹和几何伪影。我们的方法有效地强制执行几何原理,而不需要端到端的可微性。评估表明,经典的几何约束比现代学习度量提供了更稳定的优化信号。在静态场景和动态相机上的训练确保了度量质量,同时模型泛化到各种动态场景。通过将数据驱动学习与经典计算机视觉相结合,我们将极线误差降低了31%,并将人类评分一致性从54%提高到72%,且不损害视觉质量。

英文摘要

Video generation models have advanced significantly through the latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite using massive training data, these models fail to capture fundamental geometric principles. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics. Training on static scenes with dynamic cameras ensures metric quality while the model generalizes to various dynamic scenes. By bridging data-driven learning with classical computer vision, we reduce epipolar error by 31% and improve human-rated consistency from 54% to 72% without compromising visual quality.

2604.03156 2026-06-18 cs.CV 版本更新

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

CAMEO: 一种条件感知与质量驱动的多智能体图像编辑编排器

Yuhan Pu, Hao Zheng, Ziqian Mo, Zirui Pang, Hill Zhang, Tianyi Fan, Shuhong Wu, Jiaheng Wei

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Harbin Institute of Technology(哈尔滨工业大学) Shenzhen University(深圳大学) Claremont McKenna College(克莱蒙特麦肯纳学院) Research Institute of Petroleum Exploration and Development, CNPC(石油勘探开发研究院,中石油)

AI总结 提出CAMEO多智能体框架,将条件图像编辑重构为质量感知的反馈驱动过程,通过分解编辑阶段、嵌入评估循环,在异常插入和人体姿态切换任务中平均胜率提升20%。

详情
AI中文摘要

条件图像编辑旨在根据文本提示和可选的参考指导修改源图像。这种编辑在需要严格结构控制的场景中至关重要(例如,驾驶场景中的异常插入和复杂人体姿态变换)。尽管近期大规模编辑模型(如Seedream、Nano Banana等)取得了进展,但大多数方法依赖单步生成。这种范式通常缺乏显式质量控制,可能引入与原始图像的过度偏差,并经常产生结构伪影或环境不一致的修改,通常需要手动调整提示才能获得可接受的结果。我们提出\textbf{CAMEO},一个结构化的多智能体框架,将条件编辑重构为质量感知、反馈驱动的过程,而非一次性生成任务。CAMEO将编辑分解为协调的阶段:规划、结构化提示、假设生成和自适应参考定位,仅在任务复杂度需要时才调用外部指导。为克服现有方法缺乏内在质量控制的不足,评估直接嵌入编辑循环中。通过结构化反馈迭代优化中间结果,形成闭环过程,逐步纠正结构和上下文不一致性。我们在异常插入和人体姿态切换任务上评估CAMEO。在多个强编辑骨干网络和独立评估模型上,CAMEO相比多个最先进模型平均胜率提升20%,展示了在条件图像编辑中更强的鲁棒性、可控性和结构可靠性。

英文摘要

Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20\% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.

2605.14877 2026-06-18 cs.CV 版本更新

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

HeatKV:针对视觉自回归建模的头部调制KV缓存压缩

Jonathan Cederlund, Axel Berg, William Isaksson, Durmus Alp Emre Acar, Chuteng Zhou, Pontus Giselsson

发表机构 * Dept. of Automatic Control, Lund University(自动控制系,吕勒欧大学) Arm(Arm公司)

AI总结 本文提出HeatKV方法,通过根据每个头部对先前生成尺度的注意力进行调整,实现更高效的KV缓存压缩,提升内存利用率并保持图像生成质量。

Comments 18 pages total including appendix; 6 main-paper figures, 2 appendix figures; 4 tables

详情
AI中文摘要

视觉自回归(VAR)模型最近在保持低延迟的同时展示了出色的图像生成质量。然而,它们受到严重的KV缓存内存限制,通常需要每个生成图像数吉字节的内存。我们引入了HeatKV,一种新的压缩方法,该方法根据每个头部对先前生成尺度的注意力来调整缓存分配。使用一个小的离线校准集,注意力头部根据其在先前尺度上的注意力分数进行排序。基于此排序,我们构建了一个针对给定内存预算定制的静态剪枝计划。应用于Infinity-2B模型时,HeatKV在KV缓存内存分配的压缩比上比现有方法高2倍,同时保持相似或更好的图像保真度、提示对齐度和人类感知分数。我们的方法在VAR模型的KV缓存压缩中达到了新的最先进的水平,展示了细粒度、特定头部的缓存分配的有效性。

英文摘要

Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation. Code and calibration script available at https://github.com/arm-research/heatkv.

2605.15824 2026-06-18 cs.CV 版本更新

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

FashionChameleon:迈向实时和交互式的人体服装视频定制

Quanjian Song, Yefeng Shen, Mengting Chen, Hao Sun, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Liujuan Cao

发表机构 * Xiamen University(厦门大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出FashionChameleon框架,通过单件服装视频数据实现交互式多服装视频定制,保留动作一致性,实现实时生成23.8FPS,比现有方法快30-180倍。

Comments Project Page: https://quanjiansong.github.io/projects/FashionChameleon/

详情
AI中文摘要

以人为中心的视频定制,特别是在服装层面,已显示出显著的商业价值。然而,现有方法无法支持低延迟和交互式服装控制,这对电子商务和内容创作应用至关重要。本文研究如何在仅使用单件服装视频数据的情况下,实现交互式多服装视频定制并保持动作一致性。我们提出了FashionChameleon,一个用于自回归视频生成中的人体服装定制的实时交互框架,用户可以在生成过程中交互式切换服装。FashionChameleon包含三个关键技术:(i) 代替在多服装视频数据上训练,我们使用上下文学习在单个参考服装对上训练教师模型。通过保留图像到视频的训练范式,同时强制参考和服装图像之间不匹配,模型被鼓励在单件服装切换时隐式保持一致性。(ii) 为了在生成过程中实现一致性和效率,我们引入了带有上下文学习的流式蒸馏,通过上下文教师强制微调模型,并通过梯度加权分布匹配蒸馏提高外推一致性。(iii) 为了将模型扩展到交互式多服装视频定制,我们提出了无训练KV缓存调度,包括服装KV刷新、历史KV撤回和参考KV解耦,以在保持动作一致性的同时实现服装切换。我们的FashionChameleon独特地支持交互式定制和一致的长视频外推,同时在单个GPU上实现实时生成23.8 FPS,比现有基线快30-180倍。

英文摘要

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.

2605.21028 2026-06-18 cs.CV cs.AI 版本更新

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

DySink:动态帧 sinks 用于自回归长视频生成

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Lab. of Computer Network and Information Integration, Southeast University(东南大学计算机网络与信息集成重点实验室) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Institute of Automation, CAS(中国科学院自动化研究所)

AI总结 本文提出 DySink,一种基于检索的框架,通过维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks,以提高自回归长视频生成的动态性和时间质量。

详情
AI中文摘要

自回归长视频生成通常采用有界内存流以提高效率,通常结合局部窗口实现短期连续性与静态早期帧 sinks 作为长程锚点。然而,这种固定分配在当前视觉状态与早期帧大幅偏离时仍会缓存早期帧,而丢弃可能更相关的中间历史。结果,保留的长程上下文可能变得不适应,并偏向过时的线索;在严重情况下,RoPE 引起的相位再对齐会homogenize 头间注意力并导致 sink 崩溃,其中内容会回归到 sink 帧。我们提出 DySink,一种基于检索的框架,维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks。DySink 将自适应检索与 sink 异常门相结合,后者检测检索上下文中的过度头间共识并抑制易崩溃的上下文。在分钟级视频上的实验表明,DySink 在动态度方面一致优于强基线,同时也实现了更高的时间质量。代码和模型权重将在 https://github.com/yebo0216best/DySink 上发布。

英文摘要

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves temporal quality over strong baselines while also achieving higher dynamic degree, enabling coherent and more natural long-horizon visual evolution. The code and model weights are released at https://github.com/yebo0216best/DySink.

2605.21431 2026-06-18 cs.CV 版本更新

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

iTryOn: 通过空间-语义引导掌握交互式视频虚拟试穿

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

发表机构 * Shenzhen Campus of Sun Yat-sen University Taobao \& Tmall Group of Alibaba

AI总结 本文提出iTryOn框架,通过空间-语义引导解决交互式视频虚拟试穿中的语义模糊和复杂服装变形问题,实现了更动态可控的虚拟试穿体验。

Comments Project Page: https://zhengjun-ai.github.io/itryon-page. Accepted by ICML 2026

详情
AI中文摘要

视频虚拟试穿(VVT)旨在无缝替换视频中人物身上的衣物。尽管现有方法在保持时间一致性方面取得了显著进展,但它们主要局限于非交互场景,其中模型仅展示衣物。这种限制忽略了现实世界服装展示中的关键方面:主动的人-衣物互动。为弥合这一差距,我们引入并正式化了一个新的挑战性任务:交互式视频虚拟试穿(Interactive VVT),其中视频中的主体主动与衣物互动。该任务引入了超出简单纹理保留的独特挑战,包括:(1)从标准姿态信息中解决交互的语义模糊性,以及(2)从视频中学习复杂的衣物变形,其中交互时刻稀少且短暂。为了解决这些挑战,我们提出了iTryOn,一种基于大规模视频扩散Transformer的新型框架。iTryOn首创多级交互注入机制,以引导复杂动态的生成。在空间层面,我们引入了服装无关的3D手先验,以提供精细的指导,精确的手-服装接触,有效解决空间模糊性。在语义层面,iTryOn利用全局描述词提供整体上下文,并利用时间戳动作描述词提供局部交互,通过我们新颖的Action-aware Rotational Position Embedding(A-RoPE)进行同步。广泛的实验表明,iTryOn不仅在传统VVT基准上实现了最先进的性能,还在新的交互设置中建立了显著的领先优势,标志着更动态和可控的虚拟试穿体验的重要一步。

英文摘要

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

2606.06361 2026-06-18 cs.CV 版本更新

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

两步物理:在视觉细化之前锁定运动先验会擦除它们

Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang, Seong Jae Hwang

发表机构 * National Institute of Standards and Technology(国家标准与技术研究院)

AI总结 本文发现图像到视频扩散模型在两步生成中比多步生成具有更好的物理一致性,通过频谱分析将原因归结为去噪过程中的相位侵蚀,并提出无需训练的PhaseLock框架,通过从两步推理中提取运动先验并利用潜在增量引导强制到高保真生成中,有效缓解相位退化,提升物理一致性平均6.2点,同时保持视觉保真度且开销极小。

Comments ICML 2026

详情
AI中文摘要

图像到视频扩散模型利用输入图像生成视觉上令人惊艳的内容,但常常产生违反物理规律的运动。我们揭示了一个令人惊讶的发现:两步生成通常比同一模型的50步输出表现出更好的物理一致性。通过频谱分析,我们将其追溯到去噪过程中的相位侵蚀:相位显著退化(从第2步到第50步下降约18%),而幅度保持相对稳定。基于这一洞察,我们提出PhaseLock,一个无需训练的框架,在整个去噪轨迹中保留来自少步推理的有效运动先验。PhaseLock不依赖全步推理来获得物理一致性,而是仅从2步中提取运动先验,并通过潜在增量引导将其强制到高保真生成中。我们的方法有效缓解了相位退化,在多种模型上平均提升物理一致性6.2点,同时基本保持视觉保真度,且开销可忽略不计(时间1.06倍,内存1.02倍),并减少了对昂贵外部引导方法(时间约5倍)的依赖。

英文摘要

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by $\approx 18\%$ from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ($1.06\times$ time, $1.02\times$ memory) and reduced reliance on expensive external guidance methods ($\sim5\times$ time). Project Page: https://dnwjddl.github.io/phaselock

2606.13376 2026-06-18 cs.CV 版本更新

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

MoVerse: 基于全景高斯支架的实时视频世界建模

Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang, Shengfeng He, Jing Li

发表机构 * South China University of Technology Columbia University Orange Team, Youku Moku-Lab, HUJING Digital Media \& Entertainment Group Singapore Management University

AI总结 提出MoVerse,从单张窄视场图像实时构建可交互漫游的360度全景世界,通过拓扑感知扩散补全视场、全景几何残差预测生成3D高斯支架,并结合双向扩散教师蒸馏为因果自回归学生实现低延迟视频渲染。

Comments Project Page: https://orange-3dv-team.github.io/MoVerse/

详情
AI中文摘要

我们提出MoVerse,一个实时视频世界模型,能够从单张窄视场图像创建可交互导航的场景。该设置具有挑战性,因为输入仅观察到环境的一小部分,而交互式漫游需要完整的周围世界、持久的几何结构、可控的相机运动以及时间上一致的高保真观测。MoVerse通过将世界构建与观测渲染分离来解决这个问题。它首先使用拓扑感知扩散将输入扩展为重力对齐的360°全景图,在3D推理之前闭合缺失的视场。然后,利用全景几何感知残差预测将全景图提升为持久的3D高斯支架,形成密集且可直接渲染的空间记忆。最后,一个高斯条件视频渲染器将沿用户指定相机轨迹的支架渲染结果转换为逼真的视频。为了使该渲染器适用于交互,我们训练了一个双向扩散教师用于高质量条件渲染,并将其蒸馏为一个因果自回归学生以实现有界延迟流式传输。这种设计结合了显式3D表示的可控性和长程一致性以及生成视频模型的感知质量。MoVerse在单个NVIDIA RTX 4090 GPU上支持8 FPS的实时场景漫游,展示了通往具有交互式视频输出的单图像世界创建的实用路径。

英文摘要

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

2606.13768 2026-06-18 cs.CV cs.AI 版本更新

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

CineOrchestra:面向电影视频生成的统一实体中心条件控制

Sharath Girish, Tsai-Shien Chen, Zhikang Dong, Mukesh Singhal, Hao Chen, Sergey Tulyakov, Aliaksandr Siarohin

发表机构 * Snap Inc.(Snap公司) UC Merced(加州大学默塞德分校)

AI总结 提出CineOrchestra,一种统一控制主体、事件、相机和镜头切换的视频扩散模型,通过实体中心条件原语和参数无关的旋转位置编码实现多轴联合控制,在密集描述跟随和镜头切换时序上超越六种专用方法。

Comments Project page: https://snap-research.github.io/CineOrchestra

详情
AI中文摘要

电影视频描绘了多个主体在特定时刻行动或互动,通过有意的相机运动捕捉,并由镜头切换拼接而成。这些元素共同要求比当前文本到视频模型更细粒度的控制。现有工作分别处理每个轴:多主体个性化、时间控制、多镜头合成或相机控制;没有先前的框架能联合集成所有四个轴。我们提出CineOrchestra,一种统一的视频扩散模型,同时控制主体、事件、相机和镜头切换。我们的关键洞察是,这些异构的电影元素共享一个基本结构:每个元素都是在特定时间间隔内行动的实体,因此都可以通过一个共享的实体中心条件原语结构来表达,并辅以视觉实体的参考图像。这种表述将架构挑战简化为单个位置编码问题,我们通过两个参数无关的协调旋转嵌入来解决:(a) 间隔采样的时间RoPE,在持续时间差异巨大的事件上产生一致注意力行为;(b) 2D实体-时间交叉注意力RoPE,消除每个实体条件的歧义,并将其路由到对应的时空区域。在两个新基准上,CineOrchestra在密集描述跟随和镜头切换时序上优于六种每轴专家方法,在成对用户研究和组件消融中持续获得增益。

英文摘要

Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations. Project page: https://snap-research.github.io/CineOrchestra

7. 3D视觉、点云与空间智能 13 篇

2606.18429 2026-06-18 cs.CV cs.AI cs.LG 新提交

CAOA -- Completion-Assisted Object-CAD Alignment

CAOA -- 补全辅助的物体-CAD对齐

Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran

发表机构 * University at Albany(奥尔巴尼大学)

AI总结 提出CAOA方法,结合语义感知点云补全和对称感知相对位姿估计,在Scan2CAD上实现17%精度提升,并发布S2C-Completion数据集。

Comments GitHub: https://github.com/MinhasKamal/CAOA

详情
Journal ref
Thirteenth International Conference on 3D Vision (3DV), 2026
AI中文摘要

准确地将CAD模型与室内RGB-D扫描中的对应物体对齐是3D语义重建的核心挑战。该任务需要估计9自由度(DoF)位姿——位置、旋转和三轴尺度——但受到噪声和不完整扫描以及导致几何畸变的分割误差的阻碍。我们提出补全辅助的物体-CAD对齐(CAOA),该方法将语义和上下文感知的点云补全模块与对称感知的相对位姿估计算法相结合,实现CAD模型与扫描物体的精确对齐。现有的补全方法通常在合成数据集上训练和评估,往往难以泛化到真实扫描。为弥合这一差距,我们引入了一种针对室内场景的合成数据生成策略,通过与广泛使用的补全数据集进行定量比较,验证了其显著减小合成到真实领域差距的效果。此外,我们发布了S2C-Completion,一个来自Scan2CAD的超过8500个物体-CAD对的专家标注数据集,用于真实室内单物体补全,并作为该任务的新基准。对于物体-CAD对齐,我们通过对称感知损失融入对称信息,提高了对对称模糊的鲁棒性。在Scan2CAD基准上,CAOA相比最先进方法实现了17%的精度提升。

英文摘要

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

2606.18439 2026-06-18 cs.CV cs.RO 新提交

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT:面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of California, Irvine(加利福尼亚大学尔湾分校) Nanyang Technological University(南洋理工大学)

AI总结 提出RegimeVGGT,通过逐层U形压缩(显著性引导带状合并与选择性保护K/V下采样)去除冗余,在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情
AI中文摘要

视觉几何基础Transformer(VGGT)通过一次前向传播从多视图图像恢复密集3D场景结构,但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算,忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域:浅层缺乏跨视图结构,中层驱动跨视图对齐,深层对密集几何是冗余的,但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩:显著性引导带状合并保护几何和边缘显著性令牌,而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练,RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

2606.18623 2026-06-18 cs.CV eess.IV 新提交

Intrinsic 4D Gaussian Segmentation from Scene Cues

内在4D高斯分割:基于场景线索

Hasan Yazar, Mohamed Rayan Barhdadi, Erchin Serpedin, Mehmet Tuncel, Hasan Kurban

发表机构 * Istanbul Technical University(伊斯坦布尔理工大学) Texas A&M University(德克萨斯农工大学) Hamad Bin Khalifa University(哈马德·本·哈利法大学)

AI总结 提出Intrinsic-GS方法,无需训练和掩码,通过构建高斯原语的亲和图并利用社区检测实现4D场景分割,在Neu3D和HyperNeRF上达到与掩码监督方法相当的精度,且速度提升12.5倍。

Comments 15 pages, 4 figures, 7 tables. Includes supplementary material. Preprint

详情
AI中文摘要

动态4D高斯泼溅以高保真度重建变形场景,并越来越多地被用作动态3D场景的表示。要利用此类场景进行编辑、操作或运动分析,首先需要对其进行分割:将高斯原语分组为连贯的对象。当前流程通过从基础模型(如SAM)导入2D掩码,并将其提升或蒸馏到高斯表示中来获得这种分组。在动态场景中,这些掩码必须在多个帧和视角中生成,成本高昂,并且所得分割可能强烈依赖于这些外部掩码的质量和一致性。我们探究能否从高斯本身恢复更多的对象级结构,并提出Intrinsic-GS,一种无需训练、无需掩码的方法,该方法根据外观、方向、尺度、变形轨迹和非学习渲染边界线索,在高斯原语上构建稀疏亲和图。该图通过Leiden社区检测进行划分,无需基础模型,也无需学习特征场。在标准的4D高斯分割基准Neu3D和HyperNeRF上,Intrinsic-GS在没有掩码监督的情况下恢复了大量的对象结构,在Neu3D上达到0.746 mIoU,在HyperNeRF上达到0.575;在Neu3D上,仅几何变体达到0.902 mIoU,与SAM监督的TRASE相当。在HyperNeRF上,Intrinsic-GS的运行速度比掩码监督流程中使用的掩码生成和特征渲染阶段快12.5倍。这些结果表明,大部分分割信号已经编码在高斯本身中,为3D和4D高斯分割提供了一种快速、无需掩码的方向,也可能指向在外部掩码不可靠或昂贵的情况下更可泛化、更鲁棒的分割。

英文摘要

Dynamic 4D Gaussian Splatting reconstructs deforming scenes with high fidelity and is increasingly adopted as a representation for dynamic 3D scenes. Putting such a scene to use, for editing, manipulation or motion analysis, first requires segmenting it: grouping the Gaussian primitives into coherent objects. Current pipelines obtain this grouping by importing 2D masks from foundation models such as SAM and lifting or distilling them into the Gaussian representation. In dynamic scenes these masks must be generated across many frames and views, which is costly, and the resulting segmentation can depend strongly on the quality and consistency of those external masks. We ask how much object-level structure can instead be recovered from the Gaussians themselves, and propose Intrinsic-GS, a training-free, mask-free method that builds a sparse affinity graph over Gaussian primitives from appearance, orientation, scale, deformation-trajectory and non-learned rendered-boundary cues. The graph is partitioned with Leiden community detection, requiring no foundation model and no learned feature field. On the standard 4D Gaussian segmentation benchmarks, Neu3D and HyperNeRF, Intrinsic-GS recovers substantial object structure without mask supervision, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF; on Neu3D, a geometry-only variant reaches 0.902 mIoU, matching SAM-supervised TRASE. On HyperNeRF, Intrinsic-GS runs 12.5x faster than the mask-generation and feature-rendering stages used by mask-supervised pipelines. These results suggest that much of the segmentation signal is already encoded in the Gaussians themselves, offering a fast, mask-free direction for 3D and 4D Gaussian segmentation that may also point toward more generalizable, robust segmentation in settings where external masks are unreliable or expensive.

2606.18787 2026-06-18 cs.CV 新提交

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

基于UDF的点云重建中的学习半径估计

Eito Ogawa, Hiroshi Watanabe

发表机构 * Graduate School of FSE Waseda University Tokyo, Japan(Waseda大学研究生院FSE学院东京日本)

AI总结 提出一种学习型逐查询半径选择器,预测连续支撑半径并插入冻结的LoSF-UDF骨干网络,通过抛物线插值获取离网目标半径进行训练,提高点云表面重建的细粒度精度。

详情
AI中文摘要

从点云进行表面重建对于消费级3D捕获(包括AR/VR和室内扫描)非常重要。局部补丁无符号距离场(UDF)方法轻量且可泛化,但其精度依赖于支撑半径,传统上半径是固定的或通过一维曲率启发式选择,无法捕捉异质局部几何。我们提出一种学习型逐查询半径选择器,预测连续支撑半径并插入冻结的LoSF-UDF骨干网络。该选择器使用通过抛物线插值从缓存的UDF误差曲线获得的离网目标半径进行训练。实验表明,该方法提高了细尺度重建精度。

英文摘要

Surface reconstruction from point clouds is important for consumer-grade 3D capture, including AR/VR and indoor scanning. Local-patch Unsigned Distance Field (UDF) methods are lightweight and generalizable, but their accuracy depends on the support radius, traditionally fixed or selected by a one-dimensional curvature heuristic that cannot capture heterogeneous local geometry. We propose a learned per-query radius selector that predicts a continuous support radius and plugs into a frozen LoSF-UDF backbone. The selector is trained using off-grid target radii obtained by parabolic interpolation of cached UDF error curves. Experiments show improved fine-scale reconstruction accuracy.

2606.18861 2026-06-18 cs.CV cs.AI 新提交

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

基于可微联合推理与能量一致性验证的RGB-D序列URDF合成

Xinze Zhang

发表机构 * University of Southern California(南加州大学)

AI总结 提出KinemaForge管道,通过可微关节推理和能量一致性验证,从RGB-D序列联合估计部件形状、关节拓扑和参数,显著降低关节轴误差和仿真漂移。

详情
AI中文摘要

从传感器观测重建可仿真的铰接物体数字孪生仍受两个持续存在的差距制约:(i) 部件级几何重建与运动学参数估计分离,(ii) 恢复的模型常违反能量守恒等基本动态不变量,导致URDF在物理仿真器中重放时出现漂移。我们提出KinemaForge,一种约束驱动管道,从短RGB-D序列联合推断部件级形状、关节拓扑和关节参数,并通过基于可微刚体动力学构建的能量一致性验证器验证结果。该管道引入三个组件:将关节-部件关联编码为软边的运动学约束图;通过Featherstone铰接体算法从渲染观测反向传播到关节参数的可微螺旋轴求解器;以及惩罚重建模型非物理自由响应的能量残差损失。在五个PartNet-Mobility类别和一个内部RGB-D基准上,KinemaForge将平均关节轴误差从最强几何基线(PARIS)的4.52度降至2.83度(-37.4%),从基于交互的Ditto基线的5.30度降至2.83度(-46.6%),在50秒滚动中长时仿真漂移比PARIS降低64%,初步评估中闭环操作成功率比Ditto提高14.6个百分点。代码和重建数据将在接收后发布。

英文摘要

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

2606.19019 2026-06-18 cs.CV 新提交

FlowObject: Flow Steering for Bridging Generative Priors and Reconstruction Fidelity

FlowObject: 流引导以桥接生成先验与重建保真度

Yuchen Rao, Xuqian Ren, Yinyu Nie, Sayan Deb Sarkar, Biao Zhang, Vincent Lepetit, Friedrich Fraundorfer

发表机构 * Graz University of Technology Austria(奥地利格拉茨理工大学) Tampere University Finland(芬兰塔尔库大学) Technical University of Munich Germany(德国慕尼黑技术大学) Stanford University The United States of America(美国斯坦福大学) Xi’an Jiaotong University China(中国西安交通大学) École des Ponts ParisTech France(法国巴黎综合理工学院)

AI总结 提出FlowObject框架,通过双空间引导策略驱动流匹配模型的ODE轨迹,在利用生成先验完成未观测区域的同时保持与真实观测的一致性,并集成3DGS细化阶段弥合生成输出与真实感重建的差距,显著提升几何完整性和视角相关外观保真度。

Comments Project page: https://yuchenrao.github.io/projects/flowObject/flowObject.html

详情
AI中文摘要

从少量随意拍摄的图像中恢复物体的完整3D表示仍然是一个重大挑战。最近的3D生成模型,特别是基于流匹配(Flow-Matching, FM)的模型,可以合成高质量的纹理资产;然而,它们常常遭受“合成偏差”,即学习到的先验覆盖了观测证据,同时缺乏与观测实例的对齐。相反,基于优化的方法如3D高斯泼溅(3DGS)在可见表面上提供高保真度,但无法推理未观测的几何结构。在本文中,我们提出了FlowObject,一个将稀疏视图3D重建重新表述为无训练、引导逆问题的框架。我们的方法采用双空间引导策略来驱动流匹配模型的常微分方程(ODE)轨迹,通过学习的生成先验完成未观测区域,同时强制与真实世界观测严格一致。通过集成3DGS细化阶段,FlowObject进一步弥合了“合成外观”生成输出与真实感重建之间的差距。在合成和真实世界数据集上的全面基准测试表明,当前最先进的方法通常难以同时实现几何完整性和观测一致性,尤其是在严重遮挡下。相比之下,我们的方法在几何完整性和视角相关外观保真度方面显著优于最先进的生成模型和基于优化的框架。

英文摘要

Recovering complete 3D representations of objects from few casual image captures remains a significant challenge. Recent 3D generative models, particularly those based on Flow-Matching (FM), can synthesize high-quality textured assets; however, they often suffer from ''synthetic bias'' where learned priors override observational evidence, alongside a lack of alignment with the observed instance. Conversely, optimization-based methods like 3D Gaussian Splatting (3DGS) provide high fidelity on visible surfaces but fail to reason about unobserved geometry. In this paper, we present FlowObject, a framework that reformulates sparse-view 3D reconstruction as a training-free, guided inverse problem. Our approach applies a dual-space guidance strategy to steer the Ordinary Differential Equation (ODE) trajectory of a flow-matching model, enabling the completion of unseen regions through learned generative priors while enforcing strict consistency with real-world observations. By integrating a 3DGS refinement stage, FlowObject further bridges the gap between ''synthetic-looking'' generative outputs and photorealistic reconstructions. Comprehensive benchmarks on synthetic and real-world datasets demonstrate that current state-of-the-art methods often struggle to achieve geometric completeness and observational consistency simultaneously, especially under severe occlusions. In contrast, our method significantly outperforms state-of-the-art generative models and optimization-based frameworks in both geometric completeness and view-dependent appearance fidelity.

2606.19156 2026-06-18 cs.CV 新提交

Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos

Hand-4DGS: 用于从第一人称视频进行4D手部重建的前馈3D高斯泼溅方法

Jeongmin Bae, Seoha Kim, Marc Pollefeys, Mahdi Rad, Youngjung Uh, Taein Kwon

发表机构 * Yonsei University(延世大学) Electronics and Telecommunications Research Institute(韩国电子通信研究院) ETH Zurich(苏黎世联邦理工学院) Microsoft Spatial AI Lab(微软空间人工智能实验室) VGG, University of Oxford(牛津大学VGG实验室)

AI总结 提出Hand-4DGS,首个前馈框架,从第一人称视频直接重建动态4D手部,利用网格引导表示和时间卷积,实现快速推理和强泛化,无需3D真值标注。

Comments Project page: https://jeongminb.github.io/hand-4dgs/

详情
AI中文摘要

从第一人称视频进行动态3D手部重建对于下一代计算平台(如AR/VR和AI眼镜)至关重要。尽管其重要性,大多数先前工作要么关注多视角3D手部重建,要么关注4D人体重建。由于头部快速运动、手部快速动态、严重遮挡以及单视角观察固有的模糊性,第一人称4D手部重建仍然具有挑战性。为了解决这些挑战,我们引入了Hand-4DGS,这是第一个直接从第一人称视频重建动态4D手部的前馈框架,实现了快速(约60 FPS)推理和强泛化。我们的方法结合了用于结构先验的网格引导表示和用于建模动态运动的时间卷积。我们在两个具有挑战性的第一人称数据集H2O和ARCTIC上评估了我们的框架,并展示了相对于基线的显著改进。我们的方法受益于前馈网络的泛化能力以及通过高斯泼溅的有效2D图像监督,无需昂贵的3D手部姿态真值标注。

英文摘要

Dynamic 3D hand reconstruction from egocentric videos is essential for next-generation computing platforms such as AR/VR and AI glasses. Despite its importance, most prior works focus either on multi-view 3D hand reconstruction or on 4D human body reconstruction. Egocentric 4D hand reconstruction remains challenging due to fast head motion, rapid hand dynamics, severe occlusions, and inherent ambiguity from single-view observations. To address these challenges, we introduce Hand-4DGS, the first feed-forward framework for reconstructing dynamic 4D hands directly from egocentric videos, enabling both fast (~60 FPS) inference and strong generalization. Our approach incorporates a mesh-guided representation for structural priors and temporal convolutions to model dynamic motion. We evaluate our framework on two challenging egocentric datasets, H2O and ARCTIC, and demonstrate significant improvements over baselines. Our method benefits from the generalization capability of feed-forward networks and effective 2D image supervision through Gaussian splatting, without requiring expensive 3D hand pose ground-truth annotations.

2606.19253 2026-06-18 cs.CV cs.AI cs.LG cs.RO 新提交

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas: 通过全景重投影实现3D场景理解

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑工业大学) Huawei(华为)

AI总结 提出OneCanvas方法,将多视图补丁特征聚合到全景画布上,利用深度和相机位姿进行重投影,无需复杂几何编码器或大量训练,在SQA3D等基准上达到最先进精度。

Comments Project page: https://baranowskibrt.github.io/onecanvas/

详情
AI中文摘要

现有的视觉语言模型(VLM)中的3D场景理解方法要么依赖复杂的、模型特定的几何编码器,要么为了追求空间推理而需要大量的训练预算。相反,OneCanvas将所有视图的补丁特征聚合到一个单一的等距柱状全景画布上。具体来说,每个补丁利用其深度和相机位姿被反投影到3D世界坐标,然后根据从画布原点看到的该点的连续经度和纬度放置在画布上,无需对重叠视图进行光栅化或聚合。补丁的度量坐标的3D位置嵌入被添加到其特征中,从而恢复了将世界位置压缩到角度画布坐标时丢失的深度。因此,来自所有帧的补丁共享一个空间坐标系,无需融合或对主干网络进行重大架构修改。预训练的VLM将此表示视为普通图像。由于画布可以以任何感兴趣的姿态为中心,相同的表示直接支持从特定视角进行情境推理,这是机器人和具身AI中的常见需求。得益于这种表示,我们还可以引入空间预训练课程:通过程序化地将从真实图像中提取的对象的补丁特征放置在原本空白的画布上的选定3D世界位置,我们生成了涵盖广泛空间推理任务的即时监督,并控制答案分布以减少空间推理捷径。OneCanvas在SQA3D和VSI-Bench上达到了最先进的准确率,并在SPBench上泛化到分布外数据,其训练计算量比最强竞争方法少一个数量级。

英文摘要

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

2606.19316 2026-06-18 cs.CV 新提交

NeuMesh++: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field

NeuMesh++:基于解耦神经网格隐式场的多功能高效体积编辑

Chong Bao, Yuan Li, Bangbang Yang, Yujun Shen, Hujun Bao, Zhaopeng Cui, Yinda Zhang, Guofeng Zhang

发表机构 * State Key Lab of CAD&CG, College of Computer Science, Zhejiang University(浙江大学计算机科学与技术学院计算机辅助设计与图形学国家重点实验室) Ant Research(蚂蚁研究院) Google(谷歌) ByteDance(字节跳动)

AI总结 提出一种基于网格顶点的解耦神经辐射场表示,实现几何、纹理和语义引导的高效体积编辑,包括网格引导几何编辑、纹理交换填充绘制及语义编辑。

Comments TPAMI 2025; Project Page: https://zju3dv.github.io/neumeshplusplus/

详情
AI中文摘要

近年来,神经隐式渲染技术迅速发展,在新视角合成和3D场景重建方面展现出显著优势。然而,现有的用于编辑目的的神经渲染方法功能有限,例如刚性变换和类别特定编辑。在本文中,我们提出了一种新颖的基于网格的表示方法,通过在网格顶点上编码解耦的几何、纹理和语义码来编码神经辐射场,从而实现一系列高效且全面的编辑功能,包括网格引导的几何编辑、通过纹理交换、填充和绘制操作进行的指定纹理编辑,以及语义引导的编辑。为此,我们开发了几种技术,包括一种新颖的局部空间参数化以提高渲染质量和训练稳定性,一种可学习的顶点修改颜色以提高纹理编辑的保真度,一种空间感知优化策略以实现精确的纹理编辑,以及一种语义辅助区域选择以减轻隐式场编辑的繁琐标注。在真实和合成数据集上的大量实验和编辑示例证明了我们的方法在表示质量和编辑能力上的优越性。项目页面:此 https URL

英文摘要

Recently neural implicit rendering techniques have evolved rapidly and demonstrated significant advantages in novel view synthesis and 3D scene reconstruction. However, existing neural rendering methods for editing purposes offer limited functionalities, e.g., rigid transformation and category-specific editing. In this paper, we present a novel mesh-based representation by encoding the neural radiance field with disentangled geometry, texture, and semantic codes on mesh vertices, which empowers a set of efficient and comprehensive editing functionalities, including mesh-guided geometry editing, designated texture editing with texture swapping, filling and painting operations, and semantic-guided editing. To this end, we develop several techniques including a novel local space parameterization to enhance rendering quality and training stability, a learnable modification color on vertex to improve the fidelity of texture editing, a spatial-aware optimization strategy to realize precise texture editing, and a semantic-aided region selection to ease the laborious annotation of implicit field editing. Extensive experiments and editing examples on both real and synthetic datasets demonstrate the superiority of our method on representation quality and editing ability. Project page: https://zju3dv.github.io/neumeshplusplus/

2606.18588 2026-06-18 cs.DC cs.CV 交叉投稿

Splaxel: Efficient Distributed Training of 3D Gaussian Splatting for Large-scale Scene Reconstruction via Pixel-level Communication

Splaxel:通过像素级通信实现大规模场景重建的高效分布式3D高斯泼溅训练

Wenqi Jia, Zhewen Hu, Ying Huang, Yu Gong, Stavros Kalafatis, Yuke Wang, Wei Niu, Chengming Zhang, Ang Li, Sheng Di, Yuede Ji, Bo Fang, Miao Yin

发表机构 * Independent Researcher(独立研究者) Rice University(里士满大学) University of Georgia(佐治亚大学) University of Houston(休斯顿大学) University of Washington(华盛顿大学) Argonne National Labs(阿贡国家实验室)

AI总结 提出Splaxel框架,通过像素级局部渲染与全局组合替代高斯同步,在保持数学一致性的同时稳定通信开销,结合可见性预测和冲突消除策略,实现大规模3DGS分布式训练加速7.6倍。

Comments 17 pages, 25 figures

详情
AI中文摘要

3D高斯泼溅(3DGS)能够实现高保真、实时的3D场景重建,但将训练扩展到大规模场景需要跨多个GPU优化数亿个高斯体。现有的分布式方法要么将场景划分为孤立区域,导致全局不一致,要么依赖全局高斯级交换,导致GPU间通信量大幅增长并迅速主导迭代时间。我们提出Splaxel,一种基于像素级局部渲染和全局组合的通信高效分布式3DGS训练框架。每个GPU渲染其局部子集并仅交换部分像素值,而非同步高斯体,从而在保持数学一致性的同时,使通信成本随场景规模增长保持稳定。Splaxel通过几何和透射率可见性预测进一步减少像素级冗余,并通过无冲突的相机视图整合提高GPU利用率。在包含多达1.2亿个高斯体的大规模数据集上评估,Splaxel相比最先进的分布式3DGS框架实现了高达7.6倍的加速,同时保持高重建质量。

英文摘要

3D Gaussian Splatting (3DGS) enables high-fidelity and real-time 3D scene reconstruction, but scaling training to large-scale scenes requires optimizing hundreds of millions of Gaussians across multiple GPUs. Existing distributed approaches either partition scenes into isolated regions, causing global inconsistency, or rely on global Gaussian-level exchanges, which lead to substantial growth in inter-GPU communication and quickly dominate iteration time. We propose Splaxel, a communication-efficient distributed 3DGS training framework based on pixel-level local rendering and global composition. Instead of synchronizing Gaussians, each GPU renders its local subset and exchanges only partial pixel values, maintaining mathematical consistency while keeping communication cost stable as the scene size increases. Splaxel further reduces pixel-level redundancy through geometric and transmittance visibility prediction and improves GPU utilization via conflict-free camera-view consolidation. Evaluated on large-scale datasets with up to 120M Gaussians, Splaxel achieves up to 7.6$\times$ speedup over the state-of-the-art distributed 3DGS framework while preserving high reconstruction quality.

2606.18826 2026-06-18 physics.optics cs.CV eess.IV 交叉投稿

EDoF-NeRF: extended depth-of-field neural radiance fields using a coded aperture camera

EDoF-NeRF: 使用编码孔径相机扩展景深的神经辐射场

Yoshiyuki Shirasaki, Ryoichi Horisaki

发表机构 * Department of Information Physics and Computing, Graduate School of Information Science and Technology, The University of Tokyo(信息物理与计算系,信息科学与技术研究生学校,东京大学)

AI总结 提出一种通过编码孔径相机扩展景深的方法,构建高保真神经辐射场,实现从不同视角图像渲染新视图,并验证其优于传统孔径相机。

详情
AI中文摘要

我们提出了一种扩展景深(DoF)的方法,用于构建高保真神经辐射场(NeRF)——一种基于隐式神经表示、从不同视角捕获的图像数据集渲染逼真新视图的新兴技术。DoF与光量之间的权衡不仅存在于传统相机中,也存在于NeRF中,因为NeRF使用的数据集是由这些相机捕获的。为了解决这个问题,我们在相机光阑处引入编码孔径,在散焦条件下保留空间频率分量。我们开发了一个将编码孔径纳入NeRF的相机模型,允许直接输入编码图像,并能够生成具有扩展景深的新视图。我们通过仿真和实验验证了所提出的方法,称为扩展景深NeRF(EDoF-NeRF),并证明了其相比传统孔径相机的优越性能。

英文摘要

We propose a method for extending the depth-of-field (DoF) to construct high-fidelity neural radiance fields (NeRF) -- an emerging technique for rendering photorealistic novel views from a dataset of images captured at different viewpoints, based on implicit neural representations. The trade-off between DoF and light quantity is inherent not only in conventional cameras but also in NeRF, since the datasets used by NeRF are captured by these cameras. To address this issue, we introduce a coded aperture placed at the camera pupil, preserving spatial frequency components under defocused conditions. We develop a camera model incorporating coded apertures into NeRF, allowing direct input of coded images and enabling the generation of novel views with an extended DoF. We validate the proposed method, termed extended DoF-NeRF (EDoF-NeRF), through simulations and experiments, demonstrating its superior performance compared to conventional aperture cameras.

2503.09439 2026-06-18 cs.CV 版本更新

SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation

SuperCarver: 纹理一致的3D几何超分辨率用于高保真表面细节生成

Qijian Zhang, Xiaozheng Jian, Xuan Zhang, Wenping Wang, Junhui Hou

发表机构 * Tencent Games, China(腾讯游戏,中国) Department of Computer Science & Engineering, Texas A & M University(电子与计算机工程系,德克萨斯A&M大学) Department of Computer Science, City University of Hong Kong(计算机科学系,香港城市大学)

AI总结 提出SuperCarver,一种3D几何超分辨率管线,通过先验引导的法线扩散模型和噪声鲁棒的逆渲染,为粗糙网格补充纹理一致的表面细节,实现高保真细节生成。

Comments Accepted in IEEE TVCG

详情
AI中文摘要

传统的高精度网格资产生产流程需要专业3D艺术家/建模师进行繁琐且费力的手动雕刻。近年来,AI赋能的3D内容创作在从图像或文本提示生成合理结构和复杂外观方面取得了显著进展。然而,合成逼真的表面细节仍然面临巨大挑战,并且增强现有低质量3D网格(而非图像/文本到3D生成)的几何保真度仍然是一个开放问题。在本文中,我们介绍了SuperCarver,一种3D几何超分辨率管线,用于为给定的粗糙网格补充纹理一致的表面细节。我们首先从多个视角将原始纹理网格渲染到图像域。为了实现细节增强,我们构建了一个确定性先验引导的法线扩散模型,该模型在精心策划的成对细节缺乏和细节丰富的法线图渲染数据集上进行微调。为了从潜在不完美的法线图预测更新网格表面,我们设计了一种通过可变形距离场的噪声鲁棒逆渲染方案。实验表明,我们的SuperCarver能够生成由实际纹理外观描述的逼真且富有表现力的表面细节,使其成为升级历史低质量3D资产和减少高多边形网格雕刻工作量的强大工具。

英文摘要

Conventional production workflow of high-precision mesh assets necessitates a cumbersome and laborious process of manual sculpting by specialized 3D artists/modelers. The recent years have witnessed remarkable advances in AI-empowered 3D content creation for generating plausible structures and intricate appearances from images or text prompts. However, synthesizing realistic surface details still poses great challenges, and enhancing the geometry fidelity of existing lower-quality 3D meshes (instead of image/text-to-3D generation) remains an open problem. In this paper, we introduce SuperCarver, a 3D geometry super-resolution pipeline for supplementing texture-consistent surface details onto a given coarse mesh. We start by rendering the original textured mesh into the image domain from multiple viewpoints. To achieve detail boosting, we construct a deterministic prior-guided normal diffusion model, which is fine-tuned on a carefully curated dataset of paired detail-lacking and detail-rich normal map renderings. To update mesh surfaces from potentially imperfect normal map predictions, we design a noise-resistant inverse rendering scheme through deformable distance field. Experiments demonstrate that our SuperCarver is capable of generating realistic and expressive surface details depicted by the actual texture appearance, making it a powerful tool to both upgrade historical low-quality 3D assets and reduce the workload of sculpting high-poly meshes.

2605.17131 2026-06-18 cs.CV cs.AI cs.LG 版本更新

A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation

针对点云分类和分割的深度学习架构系统性调研

Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran

发表机构 * State University of New York at Albany(纽约州立大学阿尔巴尼分校)

AI总结 本文系统性地探讨了点云分类和分割中的深度学习架构,分析了点云数据的结构特性,分类了不同架构的工作,并评估了其在主流基准上的性能,同时指出了开放挑战和未来方向。

Comments We reviewed a decade of advancements in point cloud processing: trace the evolution of the field from its foundational roots to the modern SOTA, analyze how diverse architectures overcome the inherent geometric challenges of 3D data, and map out critical research gaps alongside promising future directions. GitHub: https://github.com/MinhasKamal/DeepLearningForPointCloud

详情
Journal ref
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2026
AI中文摘要

点云因其简洁性和几何保真度而成为表示3D形状和场景最广泛采用的格式。然而,其固有的无序和不规则性质,加剧了传感器噪声和遮挡的影响,给基于机器学习的方法带来了独特的挑战。为应对这些问题,已开发出多种策略,包括转换为有序格式、提取局部几何特征以及基于排列不变或自注意力的处理方法。在本文中,我们的重点是深度学习模型在3D视觉三个基本任务中的应用:点云分类、部分分割和语义分割。我们首先正式定义点云数据,然后深入讨论其结构特性。接着,我们根据其骨干结构对重要工作进行分类,并评估其在流行基准上的性能。除了经验比较外,我们还提供了架构创新和局限性的见解。我们还概述了3D点云理解中的开放挑战和有前途的未来方向。

英文摘要

Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

8. 医学影像与生物视觉 28 篇

2606.18609 2026-06-18 cs.CV 新提交

Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification

基于反事实证据验证的医学视觉语言模型幻觉检测与纠正

Nan Zhou, Ke Zou, Meng Liu, Linchao He, Jiaqi Zhu, Yi Zhang, Hu Chen, Huazhu Fu

发表机构 * College of Computer Science, Sichuan University(四川大学计算机科学学院) Yong Loo Lin School of Medicine, National University of Singapore(新加坡国立大学杨潞龄医学院) Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University(四川大学数据保护与智能管理教育部重点实验室) National Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(北京理工大学自主智能无人系统国家重点实验室) Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR)(新加坡科技研究局高性能计算研究所)

AI总结 提出CoEV框架,通过文本与视觉证据的双向验证检测并纠正医学VLM幻觉,无需重新训练,在四个数据集上显著提升检测和纠正性能。

Comments MICCAI 2026 Accept. Submission Version

详情
AI中文摘要

视觉语言模型(VLM)在医学诊断中的可靠性受到幻觉的挑战,这削弱了信任。现有的幻觉检测方法主要关注识别生成文本与参考数据之间的事实不一致性。虽然一些研究分析了模型在图像中的注意力区域,但它们很少验证这种注意力是否真正反映了支持生成文本的视觉证据。为了解决这一差距,我们提出了反事实证据验证(CoEV),一个无需训练的即插即用框架,通过基于证据的事实一致性验证来检测和纠正幻觉。CoEV在文本断言和视觉证据之间执行双向验证,测试每个陈述是否得到其对应证据区域的支持,并将每个陈述分配到一个四象限诊断图中,该图捕获文本事实性和视觉基础性的组合。CoEV检测幻觉内容,并作为事后细化工具,无需重新训练即可纠正幻觉。在四个医学数据集上的大量实验表明,CoEV能够对抗幻觉。在幻觉检测方面,CoEV始终优于现有方法,平均PR-AUC和ROC-AUC分别提高了3.0%和3.9%的绝对百分点,在特定VQA场景中提升高达18.5%。在幻觉纠正方面,它将Micro-F1提高了高达12.5%,在医学报告生成中将幻觉率降低了超过11.9%,并提高了医学VQA的准确性。这些结果表明,CoEV能够可靠地检测和纠正幻觉,为临床医生提供可靠的、基于证据的诊断线索。代码将在接收后发布。

英文摘要

Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Co}unter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in VLMs.For hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.

2606.18658 2026-06-18 cs.CV eess.IV 新提交

On-Manifold Variational Learning with Heat-Kernel Priors

基于热核先验的流形变分学习

Jiarui Xing, Tal Zeevi, Nian Wu, Jian Wang

发表机构 * Yale School of Medicine(耶鲁大学医学院) University of Virginia(弗吉尼亚大学) Harvard Medical School(哈佛医学院)

AI总结 提出一种流形锚定变分框架,利用几何感知EM算法选择热核加权潜图上的图中心点作为原型,确保原型在流形上,并通过Dirichlet能量正则化保持潜空间几何平滑,在心脏瘢痕和脑MRI基准上取得最高精度和清晰原型。

详情
AI中文摘要

学习医学影像队列的无监督表示可以揭示临床上有意义的原型,而无需专家标签,这些标签通常带有噪声且无法捕捉真实的病理异质性。然而,现有的深度潜变量模型通过欧几里得平均估计高斯混合先验,产生的原型会偏离弯曲的数据流形,并随着子种群数量的增加而退化。我们提出了一种流形锚定变分框架,基于几何感知的期望最大化(EM)算法,其M步骤选择每个子种群原型作为热核加权潜图上具有最高扩散中心性的图中心点,确保每个原型保持在流形上。Dirichlet能量正则化强制潜空间的几何平滑性,每个子种群的不确定性分数实现了无标签的质量评估。流形锚定EM是一种通用几何工具,扩展了标准EM,并易于应用于其他潜变量模型。在心脏瘢痕和脑MRI基准上,我们的框架在所有比较方法中取得了最高精度,产生了迄今为止最清晰的原型,并且在所有基线退化的较大子种群数量下保持稳定。

英文摘要

Learning unsupervised representations of medical imaging cohorts can reveal clinically meaningful prototypes without expert labels, which are often noisy and fail to capture true pathological heterogeneity. However, existing deep latent-variable models estimate Gaussian mixture priors via Euclidean averaging, producing prototypes that drift off the curved data manifold and degenerate as the number of sub-populations grows. We propose a manifold-anchored variational framework built on a geometry-aware Expectation-Maximization (EM) algorithm, whose M-step selects each sub-population prototype as the graph medoid with the highest diffusion centrality on a heat-kernel-weighted latent graph, ensuring that every prototype remains on-manifold. A Dirichlet energy regularizer enforces geometric smoothness of the latent space, and a per-sub-population uncertainty score enables label-free quality assessment. \rev{The manifold-anchored EM is a general-purpose geometric tool that extends standard EM and applies readily to other latent-variable models beyond this setting.} On cardiac scar and brain MRI benchmarks, our framework attains the highest accuracy among all compared methods, produces the sharpest prototypes reported to date, and remains stable at large sub-population counts where all baselines degenerate.

2606.18675 2026-06-18 cs.CV 新提交

BrainFusionNet: a deep learning and XAI model to understand local, global, and sequential features of MRI images for improved brain tumour detection

BrainFusionNet:一种用于理解MRI图像局部、全局和序列特征以改进脑肿瘤检测的深度学习与XAI模型

Md Taimur Ahad, Bo Song, Yan Li

发表机构 * School of Mathematics, Physics and Computing, University of Southern Queensland(南方昆士兰大学数学、物理与计算学院) School of Engineering, University of Southern Queensland(南方昆士兰大学工程学院)

AI总结 提出BrainFusionNet混合模型,结合CNN、ViT和GRU提取MRI空间、上下文和序列特征,并集成SHAP、LIME和GradCAM进行可解释性分析,在公开数据集上达到98%准确率,优于SOTA CNN。

详情
Journal ref
Brain Inf. 13, 21 (2026)
AI中文摘要

磁共振成像(MRI)的噪声给深度学习(DL)带来挑战,当肿瘤边界模糊、肿瘤位置和外观复杂时尤其如此。因此,我们开发了BrainFusionNet,它结合卷积神经网络(CNN)、视觉变换器(ViT)和门控循环单元(GRU),从MRI图像中提取空间、上下文和序列特征,以改进脑肿瘤分类。此外,集成了可解释AI(如SHAP、LIME和GradCAM),以可视化和突出显示有助于BrainFusionNet决策过程的图像区域。所提出的BrainFusionNet模型在两个公开MRI数据集上进行了评估,K折验证表明在两个数据集上准确率均达到98%。该模型与六种最先进的(SOTA)CNN和迁移学习进行了比较。在SOTA CNN中,DenseNet121和VGG16达到了96%的最高准确率。BrainFusionNet的新颖之处在于,该混合模型能够有效提取MRI图像的局部和全局特征,即使在小尺度肿瘤区域和肿瘤尺寸较小的情况下也是如此。该模型具有平衡的序列CNN架构,以捕获低层和深层特征;以及定制的ViT,可捕获局部特征、稳定梯度流并降低MRI图像训练期间梯度消失的风险。CNN和ViT的输出被馈送到GRU以进行最终分类。此外,我们分析像素强度以确定MRI图像质量是否影响图像分类。我们的发现在图像解释方面非常新颖,因为我们发现MRI图像中像素强度的分布会影响DL性能。

英文摘要

The noise of Magnetic Resonance Imaging MRI poses challenges for Deep Learning DL when tumor boundaries are obscured tumor location and appearance are complex Therefore we develop BrainFusionNet that combines Convolutional Neural Networks CNNs Vision Transformers ViT and Gated Recurrent Units GRUs to extract spatial contextual and sequential features from MRI images for improved brain tumor classification Furthermore explainable AI such as SHAP LIME and GradCAM are integrated to visualise and highlight image regions that contribute to BrainFusionNets decisionmaking process The proposed BrainFusionNet model is evaluated on two publicly available MRI datasets Kfold validation suggests 98 accuracy on both datasets The model was compared with the six stateoftheart SOTA CNNs and transfer learning Among the SOTA CNNs DenseNet121 and VGG16 achieved the highest accuracy of 96 The novelty of BrainFusionNet is that the hybrid model effectively extracts local and global features from MRI images even in smallscale tumor regions and small tumor sizes The model has a balanced sequential CNN architecture to capture lowlevel and deeperlayer features a customized ViT that captures local features stabilizes gradient flow and reduces the risk of vanishing gradients during MRI image training The CNN and ViT outputs are fed into a GRU for final classification Furthermore we analyze pixel intensities to determine whether MRI image quality affects image classification Our findings are very novel in image interpretation as we found that the distribution of pixel intensities in MRI images affects DL performance

2606.18682 2026-06-18 cs.CV 新提交

Multi-Class Brain Tumor Classification Using Advanced Deep Learning Models: A Comparative Study

使用先进深度学习模型的多类脑肿瘤分类:一项比较研究

Asad Channa, Asghar Ali Chandio, Akhtar Hussain Jalbani, Mehwish Leghari, Shahzad Memon

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology(夸迪-艾瓦姆工程、科学与技术大学计算机科学系) Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology(夸迪-艾瓦姆工程、科学与技术大学人工智能系) The Faculty of Artificial Intelligence and Cyber Security, Universiti Teknikal Malaysia Melaka(马来西亚梅拉卡技术大学人工智能与网络安全学院) Department of Data Science, Quaid-e-Awam University of Engineering, Sciences & Technology(夸迪-艾瓦姆工程、科学与技术大学数据科学系) Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London(东伦敦大学建筑、计算与工程学院计算机科学与数字技术系)

AI总结 本研究比较五种CNN架构(包括定制模型和四种预训练模型)在约10,000张MRI图像上的多类脑肿瘤分类性能,发现EfficientNetB0以95%准确率最优,尤其显著提高了脑膜瘤的召回率(89%)。

详情
AI中文摘要

尽管深度学习最近取得了进展,但从MRI图像中准确分类脑肿瘤仍然面临挑战。在本研究中,我们对五种不同的卷积神经网络(CNN)架构进行了全面评估,包括一个定制的基线模型和四个预训练模型,用于使用临床来源的约10,000张MRI图像数据集对多类脑肿瘤进行分类。我们使用了五种不同的架构:VGG16、VGG19、DenseNet121和EfficientNetB0,它们都在相同的实验框架内进行了测试和训练。性能通过总体准确率和肿瘤召回率来衡量,以评估每种架构的临床相关性能。我们发现,与其他测试的架构相比,EfficientNetB0具有最佳的整体分类准确率95%;具体来说,VGG16(94.37%)、VGG19(92.29%)、DenseNet121(90.91%)和定制CNN(78.00%)。我们研究的一个特别重要的发现是,在检测脑膜瘤方面有显著改进;具体而言,简单的CNN可以以约20%的召回率检测脑膜瘤,而EfficientNetB0能够以89%的召回率检测脑膜瘤。脑膜瘤通常难以检测,因为它们在MRI图像上可能表现得非常微妙。此外,一个有趣的发现是,更深的VGG19性能不如较浅的VGG16。这表明,在处理医学图像时,CNN模型的架构效率可能比其深度更重要。总体而言,EfficientNetB0似乎在分类准确率、模型参数数量和临床有意义性能之间提供了最佳权衡。

英文摘要

Despite recent advancements in deep learning, accurately classifying brain tumors from MRI images continues to pose challenges. In this research, we present a comprehensive evaluation of five different convolutional neural networks (CNN) architectures, including a customized baseline model and four pre-trained models - for use in classifying multi-class brain tumors using a clinically-sourced dataset of approximately 10,000 MRI images. We have utilized five different architectures; VGG16, VGG19, DenseNet121, and EfficientNetB0, which were all tested and trained within an identical experimental framework. Performance was measured by both overall accuracy and tumor-wise recall as a means to measure the clinically-relevant performance of each architecture. We found that EfficientNetB0 had the best overall classification accuracy at 95%, when compared to the other architectures tested; specifically VGG16 (94.37%), VGG19 (92.29%), DenseNet121 (90.91%) and the customized CNN (78.00%). An especially important finding of our research was the considerable improvement in detecting meningiomas; specifically, while simple CNNs could detect meningiomas with a recall rate of approximately 20%, EfficientNetB0 was able to detect meningiomas with a recall rate of 89%. Meningiomas are often difficult to detect because they can appear very subtly on MRI images. Additionally, an interesting finding was that the deeper VGG19 performed worse than the shallower VGG16. This indicates that in many cases the architectural efficiency of a CNN model may be more important than its depth when working with medical images. Overall, EfficientNetB0 appears to provide the optimal trade-off between classification accuracy, number of parameters used in the model and clinically meaningful performance.

2606.18707 2026-06-18 cs.CV 新提交

PEFT-MedSAM: Efficient Fine-Tuning of Medical Foundation Models for Explainable Skin Lesion Segmentation

PEFT-MedSAM:面向可解释皮肤病变分割的医学基础模型高效微调

Asad Channa, Abdullah Khan, Asghar Ali Chandio, Aamir Akbar, Shahzad Memon, Aqib Hussain, Ameer Hamza

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology(计算机科学系,卡迪尔-阿瓦姆工程、科学与技术大学) Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology(人工智能系,卡迪尔-阿瓦姆工程、科学与技术大学) Department of Computer Science, Sindh Madressatul Islam University, City Campus, Karachi(计算机科学系, Sind 阿里斯坦伊斯兰大学,卡拉奇城校区) Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London(计算机科学与数字技术系,建筑、计算与工程学院,东伦敦大学)

AI总结 提出参数高效微调方法PEFT-MedSAM,冻结预训练编码器仅训练轻量解码器,在ISIC 2018上达到0.9411 Dice系数,并通过Grad-CAM可解释性增强临床可信度。

详情
AI中文摘要

使用深度学习模型对皮肤镜图像进行皮肤病变自动分割,有助于比常规检测更早发现黑色素瘤。然而,大多数现有的深度学习方法性能不佳。本文旨在提出一种名为PEFT-MedSAM的参数高效微调方法,用于适配医学分割一切模型(MedSAM)以自动分割皮肤镜皮肤病变。PEFT-MedSAM方法仅使用轻量级掩码解码器训练模型,同时保持预训练图像编码器和提示编码器冻结。在ISIC 2018基准数据集上的实验表明,与完全训练的U-Net基线(0.8715 Dice系数)和零样本MedSAM推理(0.8997 Dice系数)相比,PEFT-MedSAM获得了0.9411的Dice系数和0.8918的交并比。使用PH2数据集进行的外部验证显示Dice系数为0.9467,标准差为±0.0310。这些主张的支持证据包括比较两个数据集的Wilcoxon符号秩检验p值小于0.0001,以及bootstrap估计的95%置信区间[0.9364, 0.9447],该区间表示重复测试获得的平均Dice系数的估计范围。为了增加临床可信度,我们使用Grad-CAM可解释性以及基于指向游戏的评估方法,在验证集上评估CNN基线模型。结果表明,在包含519张图像的验证集上,准确率达到98.27%,并确认模型正确分类了包含皮肤病变的区域。

英文摘要

Automated segmentation of skin lesions using deep learning models for dermoscopic images can be very helpful in finding melanomas earlier than they would normally be detected. However, most deep learning methods available do not perform well. The aim of this paper is to present a parameter-efficient fine-tuning method called PEFT-MedSAM for adapting the Medical Segment Anything Model (MedSAM) to automatically segment dermoscopic skin lesions. The PEFT-MedSAM method uses only the lightweight mask decoder for training the model while keeping the pre-trained image encoder and prompt encoder frozen. The experiments performed on the ISIC 2018 benchmark dataset shows that PEFT-MedSAM obtains a dice coefficient of .9411 and an intersection over union value of .8918 when compared to both a fully trained U-Net baseline (.8715 dice coefficient) and zero-shot MedSAM inference (.8997 dice coefficient). The external validation of the model using PH2 dataset shows .9467 dice coefficient with +/- .0310 standard deviation. Supportive evidence for these claims include a p-value less than .0001 for Wilcoxon signed rank tests comparing the two datasets and bootstrap-estimated 95% confidence intervals of [.9364,.9447] that represent the estimated range of possible values for the average dice coefficient obtained by repeating the test. To increase clinical trustworthiness, we used Grad-CAM explainability along with a pointing game based evaluation methodology to evaluate the CNN baseline model on the validation set. The results showed that we had an accuracy rate of 98.27% on the validation set of 519 images and confirmed that the model classified regions containing skin lesions.

2606.18723 2026-06-18 cs.CV cs.LG 新提交

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

临床对齐的几何约束用于鲁棒的IVUS血管边界分割

Yunshu Chen, Litao Yang, Giuseppe Di Giovanni, Jordan Tan, Deval Mehta, Andrew Lin, Derek Chew, Masasi Fujino, Julie Butters, Stephen Nicholls, Zongyuan Ge, Kyung Hoon Cho

发表机构 * AIM For Health Lab, Monash University(莫纳什大学AIM健康实验室) Department of Data Science and Artificial Intelligence, Faculty of IT, Monash University(莫纳什大学信息技术学院数据科学与人工智能系) Monash University Victorian Heart Institute(莫纳什大学维多利亚心脏研究所) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院) National Cerebral and Cardiovascular Center(国立循环器病研究中心) Department of Cardiology, Chonnam National University Hospital and Medical School(全南大学医院和医学院心脏病学系)

AI总结 提出GeoCat网络,通过双编码器与可微几何一致性损失,在IVUS分割中降低边界漂移和拓扑错误,提升临床几何测量精度。

Comments MICCAI2026 Accepted

详情
AI中文摘要

血管内超声(IVUS)管腔和外弹性膜(EEM)分割对于定量评估冠状动脉斑块负荷至关重要。管腔或EEM勾画的误差会直接传播到斑块面积、斑块负荷和几何测量中。然而,优先考虑重叠分数的标准方法常常遭受边界漂移和拓扑错误,导致临床测量不准确。我们提出GeoCat,一个几何一致性网络,使用双笛卡尔-极坐标编码器,结合跨域注意力和时间融合,处理5帧IVUS片段。可微的几何一致性损失直接监督临床相关描述符,包括直径、方向和横截面积。该模型在来自146名患者的12,242张标注帧上训练,这些帧使用两种商用IVUS系统采集。我们使用分割准确性和斑块相关临床指标评估性能,包括Dice/IoU、边界测量(95HD(mm)、ASSD)、拓扑违规率和临床几何误差(dmax/dmin、角度和面积)。在我们的数据集上,GeoCat实现了0.93的Dice,将95HD降低到0.14 mm,并将拓扑违规率降低到1.0%。重要的是,它显著提高了几何保真度,产生0.13-0.16 mm的直径误差和约8度的角度误差,支持可靠的斑块负荷量化。

英文摘要

Intravascular ultrasound (IVUS) lumen and external elastic membrane (EEM) segmentation is important for quantitative coronary plaque burden assessment. Errors in lumen or EEM delineation directly propagate to plaque area, plaque burden and geometric measurements. However, standard methods prioritising overlap scores often suffer from boundary drift and topology errors, leading to inaccurate clinical measurements. We present GeoCat, a geometry-consistent network that processes 5-frame IVUS clips using dual Cartesian-polar encoders with cross-domain attention and temporal fusion. A differentiable geometry consistency loss directly supervises clinically relevant descriptors including diameters, orientations, and cross-sectional areas. The model is trained on 12,242 annotated frames from 146 patients acquired with two commercial IVUS systems. We evaluate performance using both segmentation accuracy and plaque-relevant clinical metrics, including Dice/IoU, boundary measures(95HD (mm), ASSD), topology violation rate, and clinical geometry errors (dmax/dmin, angles, and areas). On our dataset, GeoCat achieves a Dice of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. Importantly, it significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, supporting reliable plaque burden quantification.

2606.18749 2026-06-18 cs.CV 新提交

Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

迈向3D医学图像的无训练零样本异常检测:基于批次的方法使用2D基础模型

Tai Le-Gia

发表机构 * Chungnam National University(忠南大学)

AI总结 提出CS3F框架,利用2D基础模型对3D医学图像进行零样本异常检测,通过沿多轴分解、切片编码和跨主体相似性计算异常分数,并引入粗到细的分词策略减少信号衰减。

详情
AI中文摘要

零样本异常检测(ZSAD)在医学成像中具有吸引力,因为临床系统必须处理异构采集协议、变化的患者群体以及可能缺乏标注训练数据的病理。大多数现有的零样本异常检测方法是为2D图像设计的,它们直接扩展到3D医学体积受到大规模体积基础模型稀缺或利用体积上下文困难的限制。我们提出CS3F,一个无训练的基于批次的框架,用于3D医学图像中的ZSAD,使用2D基础模型。每个体积沿多个解剖轴分解,并由2D视觉变换器逐切片编码。然后通过池化相邻切片特征将其转换为局部体积令牌。异常分数通过跨主体互相似性获得:在其他主体中缺乏相似令牌的令牌被赋予更高的异常分数。为了减少深度池化引起的病灶信号衰减,我们引入了一种粗到细的分词策略,无需穷举匹配即可实现细分辨率体积评分。CS3F在脑部MRI上针对转移瘤、胶质瘤和中风进行评估,并在肺部CT上验证其泛化能力,超越标准图谱对齐的脑部MRI。结果表明,冻结的2D基础模型可以支持3D医学图像中的异常定位,且细分词化的益处很大程度上取决于病灶对比度和成像模态。

英文摘要

Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.

2606.18753 2026-06-18 cs.CV 新提交

SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

SMART:一种灵活、可解释且可扩展的高分辨率成像数据时空脑图谱

John Kalkhof, Boris Gutman, Emile d'Angremont, Daniel C. Alexander, Marco Lorenzi

发表机构 * Illinois Institute of Technology(伊利诺伊理工学院) Amsterdam University Medical Center(阿姆斯特丹大学医学中心) University College London(伦敦大学学院)

AI总结 提出SMART框架,通过解耦全局疾病动态与患者特定解剖表现,学习连续疾病时间图谱,实现高分辨率3D医学图像中时空变化的灵活、可解释和可扩展建模。

详情
AI中文摘要

我们介绍了SMART,一个从纵向高分辨率3D医学图像中学习灵活、可解释且可扩展的时空脑图谱的框架。现有的时空图谱构建方法依赖于黑盒生成模型,缺乏灵活性、限制可解释性,并且难以扩展到高维数据。SMART通过学习一个连续的疾病时间图谱来解决这些挑战,该图谱将全局群体级疾病动态与患者特定的解剖表现解耦。在解剖学启发先验的指导下,SMART通过区域特异性微分方程,沿着共享的疾病时间线建模可解释的全局区域进展轨迹。全局轨迹进一步通过由灵活且可扩展的多尺度神经细胞自动机参数化的密集微分同胚位移,个性化到个体解剖结构。在阿尔茨海默病的五个纵向MRI数据集(ADNI-1/GO/2、OASIS-3、AIBL;>1300名受试者)上评估,SMART产生了解剖学上有意义的疾病进展预测,并实现了最先进的预测准确性和比对抗性和扩散基线更好的时间一致性。我们的方法为高维医学图像时间序列中时空变化的灵活、可解释和可扩展建模建立了一个新范式。

英文摘要

We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer's disease (ADNI-1/GO/2, OASIS-3, AIBL; > 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

2606.18825 2026-06-18 cs.CV 新提交

DreamReg: Belief-Driven World Model for 2D-3D Ultrasound Registration

DreamReg:基于信念驱动的世界模型用于2D-3D超声配准

Luoyao Kang, Yuelin Zhang, Jiwei Shan, Haifan Gong, Qingpeng Ding, Shing Shin Cheng

发表机构 * T Stone Robotics Institute, The Chinese University of Hong Kong(香港中文大学T Stone机器人研究所) Multi-scale Medical Robotics Center(多尺度医疗机器人中心) Perelman School of Medicine, University of Pennsylvania(宾夕法尼亚大学佩雷尔曼医学院)

AI总结 提出DreamReg框架,将2D-3D超声配准建模为信念更新,通过世界模型模拟探头运动并整合想象结果,在CAMUS和u-RegPro数据集上实现鲁棒且准确的实时配准。

详情
AI中文摘要

超声(US)广泛应用于手术导航,但由于部分可观测性、散斑噪声以及依赖于动作的US采集,术中2D切片与术前3D体积之间的实时配准仍然具有挑战性。现有方法是一次性的或短视的,难以随时间收集证据或捕捉外科医生如何根据屏幕反馈调整探头运动。我们提出DreamReg,一个基于信念驱动的世界模型框架,将2D-3D配准形式化为对刚性变换的信念更新。DreamReg维护一个潜在信念状态,总结过去的观测和位姿信息,并在新切片到达时通过学习到的动态不断细化变换。在训练期间,DreamReg暴露于模拟临床扫描行为的探头运动轨迹,并通过将位姿细化条件于当前US观测来学习更新其信念。在推理期间,DreamReg通过内部想象来细化配准:它展开学习到的世界模型以模拟候选探头运动及其预测的观测,并整合这些想象的结果以收敛到准确的刚性变换。在CAMUS和u-RegPro数据集上的实验表明,与最先进方法相比,DreamReg在实时引导中具有改进的鲁棒性和有竞争力的配准精度。

英文摘要

Ultrasound (US) is widely used for surgical navigation, yet real-time registration between intraoperative 2D slices and preoperative 3D volumes remains challenging due to partial observability, speckle noise, and the action-dependent US acquisition. Existing methods are one-shot or short-horizon, making it hard for them to gather evidence over time or capture how surgeons adjust probe motion based on on-screen feedback. We propose DreamReg, a belief-driven world-model framework that formulates 2D-3D registration as belief updating over rigid transformations. DreamReg maintains a latent belief state that summarizes past observations and poses information, and continuously refines the transformation through learned dynamics as new slices arrive. During training, DreamReg is exposed to probe-motion trajectories that mimic clinical scanning behavior and learns to update its belief by conditioning pose refinement on the current US observation. During inference, DreamReg refines registration via internal imagination: it rolls out the learned world model to simulate candidate probe motions and their predicted observations, and integrates these imagined outcomes to converge to an accurate rigid transformation. Experiments on CAMUS and u-RegPro datasets demonstrate improved robustness and competitive registration accuracy for real-time guidance compared with state-of-the-art methods.

2606.18860 2026-06-18 cs.CV cs.LG 新提交

Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation

医学图像分割中对抗模型的不确定性量化

Hana Jebril, Thomas Pinetz, Günter Klambauer, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria(人工智能研究所、医学数据科学中心、维也纳医学大学,奥地利) Comprehensive Center for AI in Medicine, Medical University of Vienna, Austria(医学人工智能综合中心、维也纳医学大学,奥地利) ELLIS Unit Linz, LIT AI Lab and Institute for Machine Learning, Johannes Kepler University Linz, Austria(林茨ELLIS单位、LIT人工智能实验室和机器学习研究所、林茨约瑟夫·冯·克拉夫特大学,奥地利) Institute for Machine Learning, Johannes Kepler University Linz, Austria(机器学习研究所、林茨约瑟夫·冯·克拉夫特大学,奥地利) Clinical Research Center for Medical AI, Johannes Kepler University Linz, Austria(医学人工智能临床研究中心、林茨约瑟夫·冯·克拉夫特大学,奥地利)

AI总结 提出QUAM-SM后处理框架,通过针对性对抗搜索识别脆弱像素,量化不确定性并分离认知与偶然不确定性,在公开数据集上优于现有方法。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

可靠的像素级不确定性量化具有通过实现高保真纵向监测和区分真实病理变化与伪影来改变临床工作流程的潜力。理想情况下,这些模型提供关键治疗计划和手术干预所需的稳定性。然而,标准深度学习模型常常遭受校准不良,产生过度自信的预测,掩盖了微妙病理边界处的潜在脆弱性。为了解决这个问题,我们提出了QUAM-SM,一种使用针对性对抗搜索来识别“对抗脆弱”像素的后处理框架。通过主动寻找暴露预测不稳定性的扰动,我们的方法突出了决策最容易被翻转的区域。重要的是,该框架将认知不确定性与偶然不确定性分离。在两个具有多个专家标注的公开数据集上的实验表明,QUAM-SM在可靠性和边界敏感性方面优于标准和最新的不确定性估计方法。代码可在以下网址获取:https://this https URL

英文摘要

Reliable pixel-level uncertainty quantification holds the potential to transform clinical workflows by enabling high-fidelity longitudinal monitoring and distinguishing true pathological changes from artifacts. Ideally, these models provide the stability required for critical treatment planning and surgical intervention. However, standard deep learning models often suffer from miscalibration, yielding overconfident predictions that mask underlying vulnerabilities at subtle pathological boundaries. To address this, we propose QUAM-SM, a post-hoc framework using targeted adversarial search to identify "adversarially fragile" pixels. By actively seeking perturbations that expose predictive instability, our method highlights regions where decisions are most vulnerable to being flipped. Importantly, the framework disentangles epistemic uncertainty from aleatoric uncertainty. Experiments on two public datasets with multiple expert annotations demonstrate that QUAM-SM outperforms both standard and recent uncertainty estimation approaches in terms of reliability and boundary sensitivity. Code is available at https://github.com/HanaJebril/quam_sm

2606.18869 2026-06-18 cs.CV 新提交

Learning to Distort: Weakly-Supervised Image Quality Transfer for Prostate DWI Correction

学习扭曲:用于前列腺DWI校正的弱监督图像质量迁移

YuCheng Tang, Wen Yan, Alexander Ng, Natasha Thorley, Pawel Rajwa, Yipei Wang, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, David Atkinson, Shonit Punwani, Daniel Alexander, Shaheer Ullah Saeed, Veeru Kasivisvanathan, Yipeng Hu

发表机构 * UCL Hawkes Institute(UCL哈维斯研究所) Department of Medical Physics and Biomedical Engineering(医学物理与生物医学工程系) University College London(伦敦大学学院) Division of Surgery and Interventional Science(外科与介入科学分会) Centre for Medical Imaging(医学成像中心) British Urology Researchers in Surgical Training (BURST)(英国泌尿外科手术培训研究人员(BURST)) Department of Radiology(放射科) University College London Hospitals NHS Foundation Trust(伦敦大学学院医院国家健康服务信托基金) Centre for Medical Image Computing(医学图像计算中心) Department of Computer Science(计算机科学系) Department of Urology(泌尿科)

AI总结 提出弱监督图像质量迁移框架,利用图像质量评估信号从无失真图像学习生成真实失真,并训练校正模型,在PI-RADS和Gleason评分分类任务中优于现有无配对方法。

详情
AI中文摘要

单次激发平面回波前列腺弥散加权成像(DWI)常因几何失真而复杂化,影响从这些图像中获得可靠诊断的能力。开发自动化校正方法面临缺乏配对的失真和未失真临床扫描的挑战。本文首先提出一种新颖的弱监督图像质量迁移(IQT)框架,从无失真图像到失真图像,利用图像质量评估(IQA)信号监督迁移过程。与传统方法需要昂贵的体素级配对数据或采用无配对算法不同,我们的方法利用图像级质量标签(此处为失真与无失真)在预训练特征空间中建立潜在质量原型。认识到模拟真实失真比直接无配对校正更可靠,我们描述了一种弱监督原型流匹配算法,显式正则化生成轨迹朝向失真原型,产生模拟临床退化的真实磁敏感伪影。通过合成这些真实配对,我们能够训练第二个IQT模型进行正向失真校正。实验结果表明,我们生成的图像成功模拟了真实伪影的诊断干扰,从而产生更强大的失真校正IQT模型。除定性比较外,我们还通过评估临床下游任务性能(PI-RADS和Gleason评分分类),使用分布内和外部数据集,将我们的方法与现有无配对方法(如CycleGAN、UNIT-DDPM和OT-FM)作为正向或反向替代方案进行详尽的定量评估。

英文摘要

Single-shot echo-planar prostate diffusion-weighted imaging (DWI) is frequently complicated by geometric distortions, which impact the ability to derive reliable diagnoses from such images. Developing automated correction methods is challenged by the absence of paired distorted and undistorted clinical scans. In this paper, we first propose a novel weakly-supervised image quality transfer (IQT) framework from undistorted to distorted images that utilizes image quality assessment (IQA) signals to supervise the transfer process. Unlike traditional methods that require expensive, voxel-wise paired data or resort to developing unpaired algorithms, our approach utilizes image-level quality labels (here, distorted vs. undistorted) to establish latent quality prototypes within a pre-trained feature space. Recognizing that simulating realistic distortions is more reliable than direct unpaired correction, we describe a weakly-supervised prototype flow matching algorithm to explicitly regularize generative trajectories towards distorted prototypes, producing realistic susceptibility artifacts that mimic clinical degradations. By synthesizing these realistic pairs, we enable a second IQT model to be trained in the forward direction for distortion correction. Experimental results demonstrate that our generated images successfully mimic the diagnostic interference of real-world artifacts, which leads to more capable distortion correction IQT models. In addition to qualitative comparisons, we also conduct exhaustive quantitative evaluations that compare our approach with existing unpaired approaches (e.g., CycleGAN, UNIT-DDPM, and OT-FM) - as either forward or reverse alternatives - by assessing clinical downstream task performance in PI-RADS and Gleason score classification, using both in-distribution and external data sets.

2606.18872 2026-06-18 cs.CV 新提交

Bridging Single Distortion Artifacts and Mmultifactorial Clinical Quality: Few-shot Biparametric MRI Quality Assessment via Distortion-trained Prototypical Networks

桥接单一失真伪影与多因素临床质量:基于失真训练的原型网络的少样本双参数MRI质量评估

Yuheng Tang, Alexander Ng, Wen Yan, Natasha Thorley, Pawel Rajwa, Yipei Wang, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, Shonit Punwani, Daniel Alexander, Veeru Kasivisvanathan, Yipeng Hu

发表机构 * UCL Hawkes Institute(UCL Hawkes研究所) Department of Medical Physics and Biomedical Engineering(医学物理与生物医学工程系) University College London(伦敦大学学院) Division of Surgery and Interventional Science(外科与介入科学分会) Centre for Medical Imaging(医学成像中心) British Urology Researchers in Surgical Training (BURST)(英国泌尿外科手术培训研究人员(BURST)) Department of Radiology(放射科) University College London Hospitals NHS Foundation Trust(伦敦大学学院医院国家健康服务信托基金) Centre of Medical Imaging, Division of Medicine(医学成像中心,医学分会) Centre for Medical Image Computing(医学图像计算中心) Department of Computer Science(计算机科学系) Department of Urology(泌尿科)

AI总结 提出一种少样本双参数原型网络,利用失真标签元训练,通过特征融合和域对齐,仅用5个样本即可预测PI-QUAL临床质量评分,解决临床数据稀缺问题。

详情
AI中文摘要

临床前列腺多参数MRI高度依赖高质量扩散加权成像(DWI),但DWI读图常因几何失真(通常由直肠气体引起)而受损。通过PI-QUAL评分系统评估质量是新兴的临床标准,但该方法主观、耗时,且存在类别不平衡问题,其中低质量病例多样且相对稀少。以PRIME临床试验为例,6%的图像PI-QUAL评分低于4,87%的DWI问题源于失真,许多其他临床质量问题代表性不足。为解决这种标注临床数据的双重稀缺性,我们提出了一种用于自动图像质量评估(IQA)的少样本双参数原型网络。我们的框架利用双分支3D ResNet融合T2加权和DWI特征,提供解剖背景以区分真实形态与失真。为处理现实异质性,我们引入特征级线性调制(FiLM)和梯度反转层(GRL),以对齐基于不同b值的特征分布,同时抑制采集相关偏差。我们证明,仅基于相对客观、易于获取的失真标签进行元训练的模型,能够仅使用五个代表性样本有效适应预测复杂的多因素临床质量评分(如PI-QUAL)。在两个数据集上的实验结果表明,我们的方法在此具有挑战性的IQA任务中显著优于少样本学习基线,为临床工作流程中标准化前列腺MRI质量控制提供了实际可行且数据高效的解决方案。

英文摘要

Clinical prostate multi-parametric MRI relies heavily on high-quality diffusion-weighted imaging (DWI), yet reading DWI is frequently compromised by geometric distortion, often caused by rectal air. Assessing quality via the PI-QUAL scoring system is an emerging clinical standard, but it is subjective, time-consuming and suffers from a class imbalance where low-quality cases are diverse and relatively scarce. Using the PRIME clinical trial as an example, there are $6\%$ images with PI-QUAL scores lower than 4, $87\%$ of DWI issues are due to distortion. Many of the other clinical quality issues are under-represented. To address this common dual-scarcity of annotated clinical data, we propose a few-shot biparametric prototypical network for automated image quality assessment (IQA). Our framework utilizes a dual-branch 3D ResNet to fuse T2-weighted and DWI features, providing anatomical context to distinguish true morphology from distortion. To handle real-world heterogeneity, we introduce feature-wise linear modulation (FiLM) and a gradient reversal layer (GRL) to align feature distributions conditioned on varying b-values while suppressing acquisition-related biases. We demonstrate that a model meta-trained solely on comparatively objective, readily obtainable distortion labels can effectively adapt to predicting complex, multi-factorial clinical quality scores such as PI-QUAL using only five representative samples. Experimental results on two datasets show that our method significantly outperforms few-shot learning baselines for this challenging IQA task, offering a practically feasible and data-efficient solution for standardizing prostate MRI quality control in clinical workflows.

2606.18876 2026-06-18 cs.CV cs.LG 新提交

Test-Time Adaptation in Optical Coherence Tomography Using Trajectory-Aligned Time-Independent Flow

光学相干断层扫描中基于轨迹对齐的时间无关流的测试时自适应

Veit Hucke, Thomas Pinetz, Gregor Reiter, Ursula Schmidt-Erfurth, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria(人工智能研究所、医学数据科学中心、维也纳医学大学,奥地利) Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna, Austria(医学人工智能综合中心、维也纳医学大学,奥地利) Department of Ophthalmology and Optometry, Medical University of Vienna, Austria(眼科与视光学部、维也纳医学大学,奥地利) Laboratory for Ophthalmic Image Analysis, Medical University of Vienna, Austria(眼科图像分析实验室、维也纳医学大学,奥地利)

AI总结 提出一种基于流匹配的测试时自适应方法,通过直方图匹配和去除时间条件,生成高质量替代图像,在AMD分割中达到最优性能。

Comments Accepted in MICCAI

详情
AI中文摘要

光学相干断层扫描(OCT)在眼科中至关重要,但图像质量不一致,尤其是在低成本设备中,阻碍了自动化分析。为了解决这个问题,我们引入了一种基于流匹配的测试时自适应方法,从噪声输入生成高质量替代图像。通常,测试数据和训练数据之间的域差距会导致去噪过程中像素分布不匹配。我们通过将测试图像的直方图与合成参考轨迹匹配来克服这一问题,成功地将输入与预期分布对齐。此外,我们移除了网络的时间条件,以考虑真实世界噪声分布的轻微偏差。我们的方法在分割年龄相关性黄斑变性(AMD)两个阶段的关键生物标志物方面达到了最先进的性能。代码地址:this https URL。

英文摘要

Optical coherence tomography (OCT) is essential in ophthalmology, but inconsistent image quality especially in low-cost devices hinders automated analysis. To address this, we introduce a flow-matching-based test-time adaptation method that generates high-quality surrogate images from noisy inputs. Typically, domain gaps between test and training data cause pixel distribution mismatches during the denoising process. We overcome this by matching the test image's histogram to synthetic reference trajectories, successfully aligning the input with expected distributions. Additionally, we remove the network's time conditioning to account for slight deviations in real-world noise distributions. Our approach achieves state-of-the-art performance in segmenting critical biomarkers for two stages of Age-related Macular Degeneration (AMD). Code is available: https://github.com/Veit21/tta-flow.

2606.18886 2026-06-18 cs.CV 新提交

DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

DINO-Med3D:通过渐进式适应弥合体分割中的维度与领域差距

Haoyu Hu, Xiyao Ma, Shiqi Liu, Linsen Zhang, Xiaoliang Xie, Xiaohu Zhou, Zeng-Guang Hou

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出两阶段渐进框架DINO-Med3D,通过多切片嵌入模块、3D适配器和并行细节恢复流,将DINOv3适配到3D医学分割,在五个数据集上超越现有方法。

Comments Accepted at MICCAI 2026. The camera-ready version and link will be made publicly available upon publication

详情
AI中文摘要

尽管DINOv3在自然图像中展现了显著的语义判别能力,但其直接应用于体医学分割受到固有的维度和领域差异的阻碍。为解决这些问题,我们提出DINO-Med3D,一个两阶段渐进框架,将预训练的DINOv3编码器重新用于3D医学任务。在第一阶段,我们通过引入融合伪3D上下文的多切片嵌入模块来弥合维度差距,同时采用分割代理任务将从自然场景学到的表示适应到医学领域。随后,我们通过在冻结的主干中添加轻量级3D适配器来增强体理解,以强制执行全局切片间连续性。最后,为补偿嵌入过程中固有的空间信息损失,我们设计了一个并行细节恢复流,以显式保留高频边界线索。在五个公共数据集上的大量实验表明,我们的方法成功地将DINOv3适应到医学领域,并显著优于最先进的基线方法。

英文摘要

Although DINOv3 has demonstrated remarkable semantic discrimination in natural imagery, its direct application to volumetric medical segmentation is hindered by inherent dimension and domain disparities. To resolve these issues, we propose DINO-Med3D, a two-stage progressive framework that repurpose the pre-trained DINOv3 encoder for 3D medical tasks. In the first stage, we mitigate the dimension gap by introducing a multi-slice embedding module that incorporates pseudo-3D context, while simultaneously employing a segmentation proxy task to adapt representations learned from natural scenes to the medical domain. Subsequently, we further enhance volumetric understanding by adding lightweight 3D adapters into the frozen backbone to enforce global inter-slice continuity. Finally, to compensate for the spatial information loss inherent in the embedding process, we design a parallel detail recovery stream to explicitly preserve high-frequency boundary cues. Extensive experiments on five public datasets demonstrate that our approach successfully adapts DINOv3 to the medical domain and significantly outperforms state-of-the-art baselines.

2606.18894 2026-06-18 cs.CV 新提交

Automatic ply-specific analyses of CFRP micrographs using shortest-path-based ply distinction

基于最短路径的碳纤维增强聚合物显微图像自动铺层分析

Jonas Naumann, Jonas P. Appels, Julius Biermann, Christopher Gorsky, Timo de Wolff, Christoph Brauer

发表机构 * German Aerospace Center (DLR)(德国航空航天中心(DLR)) Institute of Lightweight Systems(轻质系统研究所) Composite Process Technologies(复合材料加工技术) Institute of Analysis and Algebra(分析与代数研究所)

AI总结 提出一种自动方法,通过将语义分割掩码视为图并应用最短路径算法区分铺层实例,实现高分辨率CFRP显微图像的铺层分割与定量分析。

详情
AI中文摘要

我们提出了一种自动方法,用于在高分辨率碳纤维增强聚合物显微图像的语义分割掩码中区分铺层实例。将分割掩码解释为以像素为顶点的图,使我们能够使用最短路径算法生成铺层分隔路径。从而,我们利用全局信息弥合了语义分割和铺层实例分割之间的差距。我们成功地将该方法应用于具有广泛特征的高分辨率显微图像,例如单层或多层中人为添加的间隙、不同的堆叠顺序以及贯穿铺层的裂纹。基于计算出的路径将每个纤维像素分配给一个铺层,可以对其微观结构特性(如局部纤维体积分数以及局部分辨的铺层和中间层厚度)进行全面的定量铺层分析。这些见解有助于揭示制造引起的不均匀性,得出关于制造参数的结论,并将力学性能与潜在的微观结构缺陷联系起来。

英文摘要

We present an automated approach to distinguish between ply instances in semantic segmentation masks of high-resolution carbon-fiber reinforced polymer micrographs. Interpreting the segmentation mask as a graph with pixels as vertices, enables us to use a shortest-path algorithm yielding the ply-separating paths. Thereby, we bridge the gap between semantic segmentation and ply instance segmentation using global information. We successfully apply our approach on high-resolution micrographs featuring a broad range of characteristics like artificially added gaps in single or multiple plies, different stacking sequences and ply traversing cracks. Assigning each fiber pixel to a ply based on the calculated paths, allows for a comprehensive, quantitative ply analysis with respect to its microstructural properties like the local fiber volume fraction as well as locally resolved ply and interleaf layer thickness. These insights help to reveal manufacturing-induced inhomogeneities, draw conclusions on manufacturing parameters and link mechanical properties to underlying microstructural imperfections.

2606.19215 2026-06-18 cs.CV 新提交

GUMP-Net: An interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation

GUMP-Net: 一种用于多类盆腔分割的可解释模型-数据驱动智能算法

Liheng Wang, Yinghui Zhang, Licheng Zhang, Hailin Xu, Qiyong Cao, Chong Chen

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院数学科学国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学) Department of Orthopedics, The Fourth Medical Center of Chinese PLA General Hospital(中国人民解放军总医院第四医学中心骨科) National Clinical Research Center for Orthopedics, Sports Medicine and Rehabilitation(国家骨科与运动康复临床医学研究中心) Department of Trauma and Orthopedics, People’s Hospital Peking University(北京大学人民医院创伤骨科) Department of Orthopedics and Traumatology, Beijing Jishuitan Hospital, Capital Medical University(首都医科大学附属北京积水潭医院骨科)

AI总结 提出GUMP-Net,结合改进测地线活动轮廓模型与深度神经网络,实现多类盆腔分割,在小训练数据下表现更优,并提供可解释几何视角。

Comments 26 pages, 8 figures, 3 tables

详情
AI中文摘要

盆腔分割是盆腔骨折精准智能诊疗及手术规划导航中最重要和基础的研究问题之一。通过将改进的测地线活动轮廓模型与深度神经网络相结合,我们提出了GUMP-Net,一种用于多类盆腔分割的可解释模型-数据驱动智能算法,其中设计了三个网络模块共同构成整体分割框架:用于自动水平集初始化的目标检测模块、用于学习解剖感知边缘检测函数的边缘检测器模块以及用于深度水平集演化的迭代模块。利用水平集表示和深度学习的优势,GUMP-Net在分割性能上比最先进的方法更准确、鲁棒和一致,尤其是在小训练数据情况下。在盆腔数据集上的大量实验证明了所提算法的合理性和有效性。扩展到踝关节数据集的进一步实验表明其对其他解剖结构具有更广泛的应用。所提算法不仅为复杂骨折复位提供了高效的分割方法,而且为理解深度学习分割提供了可解释的几何视角。

英文摘要

Pelvic segmentation is one of the most important and fundamental research problems in precise and intelligent diagnosis and treatment, as well as surgical planning and navigation for pelvic fractures. By combining an improved geodesic active contour model with deep neural networks, we propose GUMP-Net, an interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation, in which three network modules are designed to constitute the overall segmentation framework together: the object detection module for automatic level set initialization, the edge detector module for learning an anatomy-aware edge detector function and the iteration module for deep level set evolution. Leveraging the advantages of level set representation and deep learning, GUMP-Net shows more accurate, robust and consistent segmentation performance, especially in small training data situation, compared to the state-of-the-art methods. Extensive experiments on pelvic datasets demonstrate the rationality and effectiveness of the proposed algorithm. Further experiments extended to ankle dataset indicate broader applications to other anatomies. The proposed algorithm not only provides an efficient segmentation method for complex fracture reduction, but also gives an interpretable geometric perspective for understanding deep learning segmentation.

2606.19300 2026-06-18 cs.CV cs.LG 新提交

Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour Segmentation

置信度不等于可靠性:重新思考脑肿瘤分割中的MC Dropout

Xin Ci Wong, Duygu Sarikaya, Kieran Zucker, Marc De Kamps, Nishant Ravikumar

发表机构 * Centre for Doctoral Training in AI for Medical Diagnosis and Care, School of Computing, University of Leeds(利兹大学计算机学院人工智能医学诊断与护理博士培训中心) School of Computer Science, University of Leeds(利兹大学计算机科学学院)

AI总结 通过MC Dropout不确定性估计,发现全局不确定性-误差对齐(AUROC≈0.97)可能掩盖关键子区域(如增强肿瘤)的严重误校准(ECE=0.915),表明子区域校准评估对临床安全至关重要。

Comments Accepted for MIUA2016

详情
AI中文摘要

多参数MRI中的胶质瘤分割是治疗计划的关键组成部分。一个在治疗关键子区域上静默失败的分割模型会带来患者安全风险,而Dice分数等基于重叠的指标无法暴露这种风险。我们探究通过蒙特卡洛(MC)Dropout进行的体素级不确定性估计能否可靠地识别临床关键子区域中的分割错误,以及校准失败模式是否仅从标准报告指标中可检测。在126名BraTS21患者的两模型实证案例研究中,我们评估了高性能预训练SegResNet和本地训练的带有残差单元的UNet(UNet-Res)。MC dropout保持了分割准确性($|\Delta \text{Dice}|$ $<0.01$),同时实现了强不确定性-误差对齐(熵(H)的AUROC $\approx$0.97),表明不确定性正确地将错误体素排在正确体素之上。基于熵的患者分层识别出一个高不确定性亚组,其分割性能显著较低(全肿瘤Dice中位数$0.835$ vs. $0.925$),支持不确定性作为实用的分诊信号。然而,全局对齐可能掩盖重要的区域特异性差异。尽管AUROC相似,UNet-Res在增强肿瘤熵上接近零($0.054$),期望校准误差(ECE)为$0.915$,Dice仅为$0.714$,表明在最临床关键子区域上置信度严重误校准,这是标准Dice和AUROC报告无法发现的失败模式。这些发现表明,强不确定性-误差对齐对于临床安全是必要但不充分的:在选择临床部署模型时,子区域特异性校准评估必须伴随AUROC评估。

英文摘要

Glioma segmentation in multiparametric MRI is a critical component of treatment planning. A segmentation model that fails silently on treatment-critical sub-regions represents a patient safety risk that overlap-based metrics such as Dice scores cannot expose. We ask whether voxel-level uncertainty estimation via Monte Carlo (MC) Dropout can reliably identify segmentation errors in clinically critical sub-regions, and whether calibration failure modes are detectable from standard reporting metrics alone. In an empirical two-model case study on 126 BraTS21 patients, we evaluate a high-performance pretrained SegResNet and a locally trained UNet with residual units (UNet-Res). MC dropout preserved segmentation accuracy ($|Δ\text{Dice}|$ $<0.01$) while achieving strong uncertainty-error alignment (AUROC for entropy (H) $\approx$0.97), indicating uncertainty correctly ranks erroneous voxels above correct ones. Entropy-based patient stratification identified a high-uncertainty subgroup with substantially lower segmentation performance (median whole-tumour Dice $0.835$ vs. $0.925$), supporting uncertainty as a practical triage signal. However, global alignment can mask important region-specific differences. Despite similar AUROC, UNet-Res exhibited near-zero enhancing tumour entropy ($0.054$) and Expected Calibration Error (ECE) of $0.915$, with a Dice of only $0.714$, indicating severely miscalibrated confidence on the most clinically critical sub-region, a failure mode invisible to standard Dice and AUROC reporting. These findings demonstrate that strong uncertainty-error alignment is necessary but insufficient for clinical safety: sub-region-specific calibration assessment must accompany AUROC evaluation when selecting models for clinical deployment.

2606.15604 2026-06-18 eess.IV cs.CV 新提交

Parameter-Efficient Adaptation of SAM 3 for Automated ITV Generation from 4DCT Images

基于参数高效微调SAM 3从4DCT图像自动生成内靶区

Changwoo Song

发表机构 * Oncosoft Inc.(Oncosoft公司) Department of Computer Science & Engineering, Chungnam National University(忠南大学计算机科学与工程系)

AI总结 提出轻量框架,通过LoRA参数高效微调SAM 3,结合硬负样本挖掘和相位相干滤波,仅用7个标注体数据实现高精度内靶区自动生成,中位Dice达0.968。

Comments 10 pages, 4 figures, 2 tables

详情
AI中文摘要

四维计算机断层扫描(4DCT)捕获了胸部解剖结构的完整呼吸周期,然而当前的内靶区勾画流程孤立处理每个相位,丢弃了时间相干性,使轮廓易受相位特定伪影影响。我们提出一个轻量框架,通过低秩适应(LoRA)对Segment Anything Model 3(SAM 3)进行参数高效微调,仅使用七个标注的3D CT体数据,将其文本提示分割与医学领域对齐。此外,该框架结合了硬负样本挖掘策略,以改善低对比度胸部区域的边界判别。在推理时,通过相位相干时间滤波和空间连通性分析细化逐相位预测。由于呼吸运动是连续且周期性的,真实解剖结构出现在连续的相位块中,而瞬态伪影零星出现,因此被有效抑制。在肺部和心脏结构上的实验分别产生中位Dice分数0.968和0.910,95百分位Hausdorff距离分别为0.998 mm和2.931 mm。所提框架有效消除了未适应SAM 3零样本推理中固有的严重假阳性预测。仅用七个标注体数据,框架保留了超过95%的全数据准确率,且整个流水线可在单个消费级GPU上训练,展示了自适应放疗中可扩展、数据高效的解决方案。

英文摘要

Four-dimensional computed tomography (4DCT) captures the full respiratory cycle of thoracic anatomy, yet current Internal Target Volume contouring workflows process each phase in isolation, discarding temporal coherence and leaving contours vulnerable to phase-specific artifacts. We present a lightweight framework that applies parameter-efficient fine-tuning to the Segment Anything Model 3 (SAM 3) via low-rank adaptation (LoRA) to align its text-prompted segmentation with the medical domain using only seven annotated 3D CT volumes. Furthermore, the framework incorporates a hard negative mining strategy to improve boundary discrimination in low-contrast thoracic regions. At inference, phase-wise predictions are refined through phase-coherent temporal filtering and spatial connectivity analysis. Since respiratory motion is continuous and periodic, genuine anatomy appears in contiguous blocks of phases, whereas transient artifacts appear sporadically and are thus effectively suppressed. Experiments on pulmonary and cardiac structures yield median Dice scores of 0.968 and 0.910 with 95th-percentile Hausdorff distances of 0.998 mm and 2.931 mm, respectively. The proposed framework effectively eliminates the severe false-positive predictions inherent in the zero-shot inference of the unadapted SAM 3. With only seven annotated volumes, the framework retains over 95% of full-data accuracy, and the entire pipeline is trainable on a single consumer-grade GPU, demonstrating a scalable, data-efficient solution for adaptive radiotherapy.

2606.18523 2026-06-18 q-bio.QM cs.CV 交叉投稿

DART: A design-aware microfluidic chip paradigm for real-time live-cell image analysis

DART: 一种设计感知的微流控芯片范式用于实时活细胞图像分析

Johannes Seiffarth, Matthias Pesch, Lukas Scholtes, Dietrich Kohlheyer, Hanno Scharr, Katharina Nöh

发表机构 * Institute for Bio- and Geosciences, IBG-1: Biotechnology(生物与地质科学研究所,IBG-1:生物技术) Computational Systems Biotechnology (AVT.CSB), RWTH Aachen University(计算系统生物技术(AVT.CSB),亚琛工业大学) Institute for Advanced Simulation, IAS-8: Data Analytics and Machine Learning(先进模拟研究所,IAS-8:数据分析与机器学习)

AI总结 提出DART范式,通过嵌入式标记和深度学习检测对齐CAD蓝图与物理芯片,实现高通量微流控芯片中所有感兴趣区域的快速定位和全自动图像处理,支持实时分析。

详情
AI中文摘要

高通量微流控活细胞成像产生丰富的单细胞数据。然而,用于定位每个包含一个细胞群体的感兴趣区域(RoI)并从记录图像中移除周围微流控结构的半自动化流程随RoI数量扩展,这阻碍了实时图像分析并将洞察时间延迟数小时至数天。我们提出了用于微流控培养芯片的设计感知和实时能力(DART)范式,该范式将CAD蓝图与物理芯片对齐,从而实现了对所有RoI的通量无关定位以及跨不同RoI几何形状和芯片布局的全自动图像处理。DART通过嵌入式基准标记和基于深度学习的标记检测建立这种对齐。我们使用瑞士军刀芯片验证DART,该芯片在1164个RoI位置上组合了八种结构不同的RoI设计。DART在五分钟内定位所有RoI,在40毫秒内从原始显微镜图像中移除微流控结构,并在每张图像1.1秒内执行全自动图像分析,包括细胞分割。这些能力共同使DART成为一个端到端的硬件-软件范式,具有实时分析能力,为闭环和结果驱动的智能显微镜铺平了道路。

英文摘要

High-throughput microfluidic live-cell imaging generates rich single-cell data. Yet semi-automated procedures for locating regions of interest (RoIs), each containing one cell population, and removing surrounding microfluidic structures from recorded images, scale with the number of RoIs. This prevents real-time image analysis and delays time-to-insight by hours to days. We introduce the Design-Aware and Real-Time capable (DART) paradigm for microfluidic cultivation chips, which aligns the CAD blueprint with the physical chip and thereby enables throughput-independent localization of all RoIs and fully automated image processing across diverse RoI geometries and chip layouts. DART establishes this alignment through embedded fiducial markers and deep-learning-based marker detection. We validate DART using the Swiss Army Knife chip, which combines eight structurally distinct RoI designs across 1164 RoI locations. DART localizes all RoIs in five minutes, removes microfluidic structures from raw microscopy images in 40 ms, and performs fully automated image analysis, including cell segmentation, in under 1.1 s per image. Together, these capabilities establish DART as an end-to-end hardware-software paradigm with real-time-capable analysis that paves the way toward closed-loop and outcome-driven smart microscopy.

2606.18970 2026-06-18 cs.LG cs.AI cs.CV 交叉投稿

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics(数学系) Department of Political and Social Sciences(政治与社会科学系)

AI总结 通过受控基准测试,比较量子与经典生成器在脑MRI数据增强中的性能,发现两者均未显著优于仅用真实数据训练,且量子生成器无额外优势。

Comments This work has been submitted to the IEEE for possible publication. This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

医学图像分类常受限于有限的标注数据,因此生成式增强被提出;最近,量子生成模型被用于此目的,并经常报告准确率提升。然而,这些声称通常基于单次训练运行,未匹配量子与经典生成器的参数预算,也未表征任何收益出现的数据范围。我们提出了一个受控基准测试,隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中,在该空间中,使用变分量子生成器或参数数量几乎相同的经典生成器(1648 vs. 1632)训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器,覆盖从5%到100%的标注数据比例,通过八个随机种子进行配对显著性检验(多重比较校正)以及集内多样性和潜在分布分析。在所有比例下,没有增强变体显著优于仅用真实数据训练,且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展:合成样本分布外移,并且在数据稀缺时严重模式崩溃,而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

2510.10779 2026-06-18 cs.CV 版本更新

Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans

结构化谱图表示学习用于3D CT扫描的多标签异常分析

Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel

发表机构 * INSA Lyon, University of Lyon, CNRS, INSERM, CREATIS UMR 5220, U1294(里昂国立应用科学学院、里昂大学、国家科学研究中心、法国国家医学研究院、CREATIS UMR 5220、U1294)

AI总结 提出一种基于谱图卷积的2.5D框架,将3D CT体积表示为结构化图,通过轴向切片三元组节点建模层间依赖,实现多标签异常分类,跨数据集泛化性能强。

Comments Accepted at MELBA Journal 2026

详情
AI中文摘要

随着CT检查数量的增长,对器官分割、异常检测和报告生成等自动化工具的需求日益增加,以支持放射科医生管理临床工作负载。由于三维数据中固有的复杂空间关系和异常的广泛变异性,3D胸部CT扫描的多标签分类仍然是一个关键但具有挑战性的问题。基于3D卷积神经网络的现有方法难以捕捉长距离依赖,而视觉Transformer通常需要在大规模领域特定数据集上进行大量预训练才能获得竞争力。在这项工作中,我们提出了一种2.5D替代方案,引入了一个新的基于图的框架,将3D CT体积表示为结构化图,其中轴向切片三元组作为节点,通过谱图卷积处理,使模型能够推理层间依赖,同时保持与临床部署兼容的复杂度。我们的方法在来自独立机构的3个数据集上进行训练和评估,实现了强大的跨数据集泛化能力,并与最先进的视觉编码器相比表现出竞争性能。我们进一步进行了全面的消融研究,以评估各种聚合策略、边加权方案和图连接模式的影响。此外,我们通过自动放射学报告生成和腹部CT数据的迁移实验展示了我们方法的更广泛适用性。

英文摘要

With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.

2512.09185 2026-06-18 cs.CV cs.AI 版本更新

Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

学习患者特异性疾病动态:基于潜在流匹配的纵向影像生成

Hao Chen, Rui Yin, Yifan Chen, Qi Chen, Chao Li

发表机构 * University of Cambridge(剑桥大学) Nanjing First Hospital(南京第一医院) Nanjing Medical University(南京医科大学) Johns Hopkins University(约翰霍普金斯大学) University of Dundee(邓迪大学)

AI总结 提出Δ-LFM框架,利用流匹配对齐患者潜在轨迹,通过患者特异性潜在对齐实现单调疾病进展建模,在三个纵向MRI基准上验证了可解释性和性能。

Comments ICLR 2026 accepted

详情
AI中文摘要

理解疾病进展是一个直接的临床挑战,对早期诊断和个性化治疗具有重要意义。虽然最近的生成方法试图对进展进行建模,但关键不匹配仍然存在:疾病动态本质上是连续且单调的,然而潜在表示通常是分散的,缺乏语义结构,并且基于扩散的模型通过随机去噪过程破坏了连续性。在这项工作中,我们提出将疾病动态视为速度场,并利用流匹配(FM)来对齐患者数据的时间演变。与先前方法不同,它捕捉了疾病的内在动态,使进展更具可解释性。然而,一个关键挑战仍然存在:在潜在空间中,自动编码器(AE)不能保证跨患者的对齐或与临床严重性指标(例如年龄和疾病状况)的相关性。为了解决这个问题,我们提出学习患者特异性潜在对齐,这迫使患者轨迹沿着特定轴延伸,其幅度随疾病严重程度单调增加。这导致了一个一致且语义上有意义的潜在空间。总之,我们提出了Δ-LFM,一个用于通过流匹配建模患者特异性潜在进展的框架。在三个纵向MRI基准上,Δ-LFM展示了强大的实证性能,更重要的是,为解释和可视化疾病动态提供了一个新框架。

英文摘要

Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $Δ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $Δ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.

2512.10353 2026-06-18 cs.CV 版本更新

Hybrid Transformer-Mamba for Weakly Supervised Volumetric Medical Segmentation

混合Transformer-Mamba用于弱监督体积医学分割

Yiheng Lyu, Lian Xu, Coen Arrow, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi

发表机构 * University of Western Australia(西澳大学) Harry Perkins Institute of Medical Research(哈利·佩金斯医学研究所) National Imaging Facility(国家成像设施) Fiona Stanley Hospital(菲奥娜·斯蒂尔医院) Victor Chang Cardiac Research Institute(维多利亚·张心脏研究中心)

AI总结 提出TranSamba混合架构,通过跨平面建模捕获3D上下文,在弱监督下实现高效体积分割,在三个数据集上达到最优性能。

详情
AI中文摘要

弱监督分割使得模型能够从平面级标签进行训练。现有方法通常依赖2D编码器,忽略了医学数据的体积特性。我们提出TranSamba,一种混合Transformer-Mamba架构,旨在通过跨平面建模捕获3D上下文。TranSamba在Vision Transformer骨干网络基础上增加跨平面Mamba块,利用线性时间建模实现相邻平面间的高效信息交换。这种交换改善了平面内自注意力以及后续用于目标定位的注意力图。TranSamba在输入体积深度上保持线性时间复杂度和恒定空间复杂度。在涵盖不同模态和病理的三个数据集上的大量实验表明,TranSamba达到了最先进的性能,展示了跨平面建模的泛化有效性。代码可在以下网址获取:this https URL.

英文摘要

Weakly supervised segmentation enables model training from plane-level labels. Existing methods often rely on 2D encoders, neglecting the volumetric nature of medical data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context via cross-plane modeling. TranSamba augments a Vision Transformer backbone with Cross-Plane Mamba blocks, leveraging linear-time modeling for efficient information exchange across neighboring planes. This exchange improves in-plane self-attention and subsequent attention maps for object localization. TranSamba maintains linear time complexity and constant space complexity with respect to the input volume depth. Extensive experiments on three datasets covering diverse modalities and pathologies show that TranSamba achieves state-of-the-art performance, demonstrating the generalizable efficacy of cross-plane modeling. Code is available at: https://github.com/YihengLyu/TranSamba.

2606.00491 2026-06-18 cs.CV cs.AI 版本更新

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

CT分割系统的部署前鲁棒性压力测试:使用临床驱动的多损坏增强

CholMin Kanga, Jonghyun Chung, Amanpreet Kaur, Nagesh Gulkotwar, Aarthi Sivasankaran

发表机构 * Seoul National University(首尔国立大学) Google Inc.(谷歌公司)

AI总结 提出RAMP框架,通过多损坏增强提升CT分割模型在临床异质成像条件下的鲁棒性,显著缩小干净与损坏图像性能差距。

详情
AI中文摘要

基于深度学习的CT分割系统在干净基准图像上通常能达到高精度,但在噪声、分辨率损失、对比度变化、强度偏移和伪影等异质临床成像条件下,其性能可能会下降。这种不稳定性可能限制其在真实医疗成像工作流程中的可靠部署。 我们提出鲁棒性增强多损坏流水线(RAMP),这是一个面向鲁棒性的CT分割增强框架。RAMP结合了解剖约束的空间扰动、CT强度变换和随机多损坏组合,使模型在训练过程中暴露于临床可行的图像退化。 在两个CT分割评估设置中,RAMP实现了最强的损坏图像性能和最小的干净到损坏鲁棒性差距。在五器官噪声评估基准中,与nnU-Net基线相比,RAMP将平均损坏Dice从0.610提高到0.753,并将鲁棒性差距从0.264降低到0.064。在Abdomen1K中,RAMP将平均损坏Dice从0.633提高到0.789,并将鲁棒性差距从0.290降低到0.070。尽管RAMP未达到最高的干净图像Dice,但它显著减轻了严重图像退化下的最坏情况分割崩溃。 这些结果表明,多损坏增强可以作为提高CT分割系统在异质临床环境中可靠性的实用部署前策略。

英文摘要

Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under heterogeneous clinical imaging conditions such as noise, resolution loss, contrast variation, intensity shift, and artifacts. This instability can limit reliable deployment in real-world medical imaging workflows. We propose Robustness via Augmented Multi-corruption Pipeline (RAMP), a robustness-oriented augmentation framework for CT segmentation. RAMP combines anatomically constrained spatial perturbations, CT intensity transformations, and stochastic multi-corruption composition to expose models to clinically plausible image degradation during training. Across two CT segmentation evaluation settings, RAMP achieved the strongest corrupted-image performance and the smallest clean-to-corrupted robustness gap. In the five-organ noisy evaluation benchmark, RAMP improved mean corrupted Dice from 0.610 to 0.753 and reduced the robustness gap from 0.264 to 0.064 compared with the nnU-Net baseline. In Abdomen1K, RAMP improved mean corrupted Dice from 0.633 to 0.789 and reduced the robustness gap from 0.290 to 0.070. Although RAMP did not achieve the highest clean-image Dice, it substantially mitigated worst-case segmentation collapse under severe image degradation. These results suggest that multi-corruption augmentation can serve as a practical pre-deployment strategy for improving the reliability of CT segmentation systems in heterogeneous clinical environments.

2606.03827 2026-06-18 cs.CV cs.AI 版本更新

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

基于傅里叶运动建模的条件潜扩散模型用于虚拟人群合成

Shaokun Lan, Haoran Dou, Jinghan Huang, Arezoo Zakeri, Fengming Lin, Zherui Zhou, Jinming Duan, Alejandro F. Frangi

发表机构 * Centre for Computational Imaging and Modelling in Medicine (CIMIM)(计算医学成像与建模中心) University of Manchester(曼彻斯特大学) Christabel Pankhurst Institute(克里斯塔贝尔·潘克赫斯特研究所) Department of Computer Science(计算机科学系) Division of Informatics, Imaging & Data Sciences(信息学、成像与数据科学分会) Department of Electrical & Electronic Engineering(电子与电气工程系) NIHR Manchester Biomedical Research Centre, Manchester Academic Health Sciences Centre, University of Manchester(尼日利亚卫生研究委员会曼彻斯特生物医学研究中心、曼彻斯特学术健康科学中心、曼彻斯特大学)

AI总结 提出4D F-MeshLDM框架,结合卷积网格VAE、截断傅里叶级数运动参数化和条件扩散先验,实现可控的3D+t心脏网格序列生成,在UK Biobank数据上优于基线方法。

Comments This work has been early accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2026

详情
AI中文摘要

医疗设备的计算机模拟试验需要生成虚拟解剖人群。在心血管应用中,虚拟解剖通常表示为从生成模型采样的3D+t网格。然而,大多数现有网格生成器关注静态解剖,而序列模型往往缺乏显式周期性。为此,我们提出4D F-MeshLDM,一个条件生成框架,包括用于编码网格的卷积网格VAE、使用截断傅里叶级数参数化运动的结构化潜空间,以及学习傅里叶系数令牌上潜分布的先验扩散。通过仿射调制将扩散过程条件化于临床协变量,我们实现了可控合成。采样令牌并执行逆傅里叶合成产生周期一致的潜轨迹,可解码为3D+t心脏网格序列。在5,000名UK Biobank受试者上的实验表明,4D F-MeshLDM在解剖保真度上优于最先进的基线,并实现了接近零的周期闭合误差。此外,生成的队列准确保留了临床功能指标,突显了我们的框架在可靠的心脏计算机模拟试验中的潜力。

英文摘要

In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distribution over Fourier coefficient tokens. By conditioning the diffusion process on clinical covariates via affine modulation, we enable controllable synthesis. Sampling tokens and performing inverse Fourier synthesis yield cycle-consistent latent trajectories, which can be decoded into 3D+t cardiac mesh sequences. Experiments on 5,000 UK Biobank subjects demonstrate that 4D F-MeshLDM outperforms state-of-the-art baselines in anatomical fidelity and achieves near-zero cycle closure error. Furthermore, the generated cohorts accurately preserve clinical functional indices, highlighting the potential of our framework for reliable in-silico cardiac trials.

2606.15554 2026-06-18 cs.CV 版本更新

RaLMPH: Reliability-aware Learning for Multi-Pathologist Harmonization in Whole-Slide Image Classification

RaLMPH:全切片图像分类中面向多病理学家协调的可靠性感知学习

Sungrae Hong, Jiwon Jeong, Soeun Cheon, Donghee Han, Sol Lee, Jisu Shin, Kyungeun Kim, Mun Yong Yi

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) Seegene Medical Foundation(Seegene医学基金会)

AI总结 提出RaLMPH框架,通过可靠性场建模局部邻域结构和专家不确定性,实现多病理学家标注的全切片图像标签协调,提升多实例学习性能。

Comments Accepted by MICCAI 2026

详情
AI中文摘要

多实例学习(MIL)是全切片图像(WSI)分析的标准范式,并在计算病理学中取得了显著成果。然而,大多数MIL流程假设每张切片只有一个“金标准”标签,这与临床实践中常见的病理学家间显著差异相矛盾。现有的多标注者学习和标签细化方法通常估计全局标注者可靠性或依赖单实例假设,使其难以适应MIL以及专家意见不一致的局部诊断场景。我们提出RaLMPH(面向多病理学家协调的可靠性感知学习),一种基于MIL的标签协调框架,用于由多位病理学家标注的WSI。RaLMPH引入了一个可靠性场,该场联合建模(i)WSI特征空间中的局部邻域结构和(ii)专家不确定性(熵),从而能够识别每个样本的可信参考邻域。利用该场,RaLMPH执行样本级局部标注者排序以选择每张切片的可靠意见,并应用自适应门控机制根据局部可靠性融合标签。在由六位病理学家标注的临床WSI数据集以及受控模拟基准上的实验表明,RaLMPH始终优于现有方法。进一步分析阐明了我们的可靠性感知机制如何改进标签协调和下游MIL性能。

英文摘要

Multiple Instance Learning (MIL) is a standard paradigm for Whole-Slide Image (WSI) analysis and has achieved strong results in computational pathology. However, most MIL pipelines assume a single "gold" label per slide, which conflicts with clinical practice where substantial inter-pathologist variability is common. Existing multi-annotator learning and label-refinement methods typically estimate global annotator reliability or rely on single-instance assumptions, making them poorly suited to MIL and to localized diagnostic contexts where experts disagree. We propose RaLMPH (Reliability-aware Learning for Multi-Pathologist Harmonization), a MIL-based label reconciliation framework for WSIs annotated by multiple pathologists. RaLMPH introduces a reliability field that jointly models (i) local neighborhood structure in WSI feature space and (ii) expert uncertainty (entropy), enabling per-sample identification of trustworthy reference neighborhoods. Leveraging this field, RaLMPH performs sample-wise local annotator ranking to select reliable opinions per slide and applies an adaptive gating mechanism to fuse labels conditioned on local reliability. Experiments on a clinical WSI dataset with labels from six pathologists, as well as controlled simulated benchmarks, show that RaLMPH consistently outperforms existing approaches. Further analyses clarify how our reliability-aware mechanism improves label reconciliation and downstream MIL performance.

2606.17412 2026-06-18 cs.CV cs.AI 版本更新

Enhancing Pathological VLMs with Cross-scale Reasoning

增强病理视觉语言模型的跨尺度推理能力

Chi Phan, Tianyi Zhang, Qiaochu Xue, Yufeng Wu, Dan Hu, Zeyu Liu, Sudong Wang, Yueming Jin

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电气与计算机工程系) PuzzleLogic Pte Ltd(PuzzleLogic私人有限公司) Department of Pathology, Fujian Medical University Cancer Hospital & Fujian Cancer Hospital(福建医科大学附属肿瘤医院病理科暨福建省肿瘤医院)

AI总结 提出首个跨尺度训练与评估范式,通过多倍率视觉问答任务增强病理视觉语言模型的跨尺度推理能力,并构建高质量基准数据集Scale-VQA及模型ScaleReasoner-R1,实现最优性能。

详情
AI中文摘要

病理图像本质上是多尺度的,要求病理学家整合从低倍放大下的整体组织结构到高倍放大下的细胞形态的证据以进行准确诊断。虽然现有的视觉语言模型(VLM)病理数据集包含多种尺度,但它们通常缺乏明确的跨尺度推理目标。这一限制阻碍了VLM捕获关键的跨尺度表示和学习基于证据的推理。为弥补这一差距,我们引入了首个跨尺度训练和评估范式,将病理解释表述为多倍率推理。然而,创建这样的任务揭示了一个关键挑战:多图像视觉问答(VQA)容易受到仅文本捷径的影响,这使得模型能够利用与放大倍数相关的伪影而非视觉证据来猜测答案。为解决此问题,我们提出了一种泄漏感知的策展流程,结合了对抗性仅文本筛选和约束引导的问题设计。利用该流程,我们构建了Scale-VQA,一个高质量基准,包含4,685个多项选择题,基于2,537张跨多个放大级别的病理图像。最后,我们提出了ScaleReasoner-R1,一个通过强化学习训练的模型,以优化跨尺度VQA任务的性能。ScaleReasoner-R1在我们的跨尺度推理基准上达到了最先进的性能,并在已有的单尺度基准上泛化到最先进的性能。研究结果表明,即使是有限的跨尺度监督也能显著改善病理理解。代码和演示将开源。

英文摘要

Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.

2508.11211 2026-06-18 eess.IV cs.CV 版本更新

Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension

面向CT视野扩展的高效图像到图像薛定谔桥

Zhenhao Li, Song Ni, Long Yang, Xiaojie Yin, Haijun Yu, Jiazhou Wang, Hongbin Han, Weigang Hu, Yixing Huang

发表机构 * Institute of Medical Technology, Peking University Health Science Center(北京大学人民医院医学技术研究所) Shanghai Cancer Center, Fudan University(复旦大学上海癌症中心) Department of Electrical and Computer Engineering, University of Massachusetts Lowell(马萨诸塞大学洛厄尔分校电气与计算机工程系) Beijing Key Laboratory of Intelligent Neuromodulation and Brain Disorder Treatment(北京智能神经调控与脑疾病治疗重点实验室)

AI总结 提出基于图像到图像薛定谔桥(I²SB)扩散模型的CT视野扩展框架,通过直接学习有限视野与扩展视野图像间的随机映射,实现单步快速推理,在精度和速度上均超越现有扩散模型。

Comments 12 pages

详情
Journal ref
IEEE Transactions on Radiation and Plasma Medical Sciences 2026
AI中文摘要

计算机断层扫描(CT)是一种用于无创、高分辨率可视化内部解剖结构的基石成像模态。然而,当扫描物体超出扫描仪的视野(FOV)时,投影数据被截断,导致重建不完整并在FOV边界附近出现明显伪影。传统重建算法难以从这类数据中恢复准确的解剖结构,限制了临床可靠性。深度学习方法已被探索用于FOV扩展,其中扩散生成模型代表了图像合成的最新进展。然而,传统扩散模型由于迭代采样过程,计算量大且推理速度慢。为解决这些限制,我们提出了一种基于图像到图像薛定谔桥(I$^2$SB)扩散模型的高效CT FOV扩展框架。与从纯高斯噪声合成图像的传统扩散模型不同,I$^2$SB学习配对的有限FOV和扩展FOV图像之间的直接随机映射。这种直接对应关系产生了更可解释和可追踪的生成过程,增强了重建中的解剖一致性和结构保真度。I$^2$SB实现了优越的定量性能,在模拟噪声数据上的均方根误差(RMSE)值为49.8 HU,在真实数据上为152.0 HU,优于最先进的扩散模型,如条件去噪扩散概率模型(cDDPM)和基于块的扩散方法。此外,其单步推理使得每2D切片的重建仅需0.19秒,相比cDDPM(135秒)实现了超过700倍的加速,并超过了第二快的DiffusionGAN(0.58秒)。这种准确性和效率的结合表明I$^2$SB具有实时或临床部署的潜力。

英文摘要

Computed tomography (CT) is a cornerstone imaging modality for non-invasive, high-resolution visualization of internal anatomical structures. However, when the scanned object exceeds the scanner's field of view (FOV), projection data are truncated, resulting in incomplete reconstructions and pronounced artifacts near FOV boundaries. Conventional reconstruction algorithms struggle to recover accurate anatomy from such data, limiting clinical reliability. Deep learning approaches have been explored for FOV extension, with diffusion generative models representing the latest advances in image synthesis. Yet, conventional diffusion models are computationally demanding and slow at inference due to their iterative sampling process. To address these limitations, we propose an efficient CT FOV extension framework based on the image-to-image Schrödinger Bridge (I$^2$SB) diffusion model. Unlike traditional diffusion models that synthesize images from pure Gaussian noise, I$^2$SB learns a direct stochastic mapping between paired limited-FOV and extended-FOV images. This direct correspondence yields a more interpretable and traceable generative process, enhancing anatomical consistency and structural fidelity in reconstructions. I$^2$SB achieves superior quantitative performance, with root-mean-square error (RMSE) values of 49.8 HU on simulated noisy data and 152.0 HU on real data, outperforming state-of-the-art diffusion models such as conditional denoising diffusion probabilistic models (cDDPM) and patch-based diffusion methods. Moreover, its one-step inference enables reconstruction in just 0.19 s per 2D slice, representing over a 700-fold speedup compared to cDDPM (135 s) and surpassing DiffusionGAN (0.58 s), the second fastest. This combination of accuracy and efficiency indicates that I$^2$SB has potential for real-time or clinical deployment.

9. 文档图像、OCR与图表理解 7 篇

2606.18721 2026-06-18 cs.CV 新提交

Rethinking the Pointer Loss in Table Structure Recognition: Geometry-Aware Pointer Loss for Spatial Locality

重新思考表格结构识别中的指针损失:面向空间局部性的几何感知指针损失

Hong-Jun Choi, Jongho Lee, Jaeyoung Kim

发表机构 * Teamreboott Inc.(Teamreboott公司)

AI总结 针对指针网络在表格结构识别中相邻单元格错误占79.6%的问题,提出几何感知指针损失,通过反距离加权重写交叉熵目标,聚焦邻近单元格梯度,在不增加推理成本下提升性能。

详情
AI中文摘要

使用指针网络的表格结构识别(TSR)通过预测HTML序列同时将标签与检测到的文本(或单元格)区域对齐,取得了令人印象深刻的结果。然而,我们的分析揭示,当指针网络失败时,79.6%的错误发生在空间相邻的单元格之间(曼哈顿距离<=2)。尽管如此,标准交叉熵损失对所有负候选样本赋予相同权重。在这项工作中,我们提出了几何感知指针(GAP)损失,它根据与真实值的空间邻近性重新加权交叉熵目标。通过应用反距离加权,GAP将梯度流集中在模型最困难的区域:相邻单元格比远处单元格获得更强的梯度。我们的方法仅需对损失计算进行简单修改,保持相同的模型架构且零额外推理成本。在PubTabNet和SynthTabNet上的大量实验表明,GAP持续减少相邻单元格错误,达到了新的最先进性能。我们的发现表明,在损失层面融入几何归纳偏置为鲁棒TSR提供了一种简单而有效的方法。我们的代码可在以下网址获取:this https URL

英文摘要

Table Structure Recognition (TSR) using a pointer network achieves impressive results by predicting HTML sequences while aligning tags to detected text (or cell) regions. However, our analysis reveals that when pointer networks fail, 79.6% of errors occur between spatially adjacent cells (Manhattan distance <= 2). Despite this, standard cross-entropy loss weights all negative candidates equally. In this work, we propose Geometry-Aware Pointer (GAP) Loss, which reweights the cross-entropy objective based on spatial proximity to ground truth. By applying inverse distance weighting, GAP focuses gradient flow where the model struggles most: immediate neighbors receive stronger gradients than distant cells. Our approach requires only a straightforward modification to the loss computation, maintaining the same model architecture with zero additional inference cost. Extensive experiments on PubTabNet and SynthTabNet demonstrate that GAP consistently reduces adjacent-cell errors, achieving new state-of-the-art performance. Our findings suggest that incorporating geometric inductive biases at the loss level provides a simple yet effective approach to robust TSR. Our code is available at https://github.com/teamreboott/GAP

2606.18793 2026-06-18 cs.CV 新提交

Fuzzy-Geometric Branch-Point Modeling for Structure-Aware Augmentation of Handwritten Chinese Characters

模糊几何分支点建模用于结构感知的手写汉字增强

Dongbin Jiao, Yibo Lyu, Qiulu Wei, Fuxiang Lu, Shengcai Liu, Shi Yan

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系广东省类脑智能计算重点实验室)

AI总结 针对手写汉字增强中数据稀缺和结构失真问题,提出基于模糊几何的结构感知增强框架,通过模糊集建模分支点并优化,结合贝塞尔重建与多策略扰动生成样本,显著降低字错误率。

详情
AI中文摘要

数据稀缺和结构失真严重限制了高安全性认证中的手写识别。现有的增强方法常导致拓扑和形态损伤,尤其在处理复杂汉字时,笔画交叉、连笔和急转弯使传统分支点检测不可靠。为此,本文提出一种模糊几何驱动的结构感知(FGSA)增强框架。我们将分支点建模为骨架空间中的模糊集,通过整合拓扑邻域证据和方向场散度,构建连续的分支点隶属度场。该隶属度场通过无监督代理目标自适应优化,实现无需人工标注的鲁棒笔画解耦。最后,通过参数化三次贝塞尔重建和多策略扰动合成运动学对齐样本,确保结构保真度与样本多样性之间的平衡。此外,我们建立了LZUSig,一个专门针对中文手写签名细粒度结构退化的大规模高挑战性数据集。在CASIA-HWDB1.1、ChiSig和LZUSig上的大量实验表明,FGSA显著降低了字错误率(ΔWER),在对比基线中取得了最优识别增益。更重要的是,它在任务增益、结构保真度和判别特征保留之间实现了稳健的权衡,为手写增强提供了一种高度可控的解决方案。

英文摘要

Data scarcity and structural distortion significantly limit handwriting recognition in high-security authentication. Existing augmentation methods often cause topological and morphological damage, particularly when processing complex Chinese characters where stroke intersections, ligatures, and sharp turns render traditional branch-point detection unreliable. To address this, this paper proposes a fuzzy geometry-driven structure-aware (FGSA) augmentation framework. We model branch points as fuzzy sets within the skeleton space, constructing a continuous branch-point membership field by integrating topological neighborhood evidence with direction field divergence. This membership field is adaptively optimized via an unsupervised surrogate objective, enabling robust stroke decoupling without manual annotation. Finally, kinematically-aligned samples are synthesized through parameterized cubic Bézier reconstruction and multi-strategy perturbations, ensuring a balance between structural fidelity and sample diversity. Moreover, we establish LZUSig, a large-scale, highly challenging dataset specifically dedicated to fine-grained structural degradation in Chinese handwritten signatures. Extensive experiments on CASIA-HWDB1.1, ChiSig, and LZUSig demonstrate that FGSA significantly reduces the word-level error rate ($Δ$WER), achieving optimal recognition gains over the compared baselines. More importantly, it strikes a robust trade-off among task gain, structural fidelity, and discriminative feature preservation, offering a highly controllable solution for handwriting augmentation.

2606.18884 2026-06-18 cs.CV 新提交

Performance Gap Analysis between Latin and Arabic Scripts HTR

拉丁文与阿拉伯文手写文本识别之间的性能差距分析

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

发表机构 * Luleå University of Technology Department of Computer Science, Electrical

AI总结 本研究使用统一CRNN模型在多个数据集上比较阿拉伯文和拉丁文手写文本识别性能,发现性能差距在低资源场景下显著,随数据增加而缩小但持续存在,并分析了标注质量、视觉变异性和字符分布等因素。

Comments this paper accepted at TIPS workshop ICPR 2026

详情
AI中文摘要

最近的研究表明,手写文本识别(HTR)系统在阿拉伯文数据集上的表现不如拉丁文数据。然而,由于缺乏受控比较,这种差距的原因仍不清楚。在这项工作中,我们使用统一的CRNN模型对阿拉伯文和拉丁文脚本进行行级HTR的全面研究,涵盖九个数据集(包括KHATT(阿拉伯文)、Muharaf(阿拉伯文)、NUST-UHWR(乌尔都文)、PHTD(波斯文)、IAM(英文)、READ-2016(德文)等)和不同的训练规模(K ∈ {100, 500, 1000, 2000, ..., Kfull})。我们的结果显示性能差距仍然存在:在低资源设置下差距很大,随着数据增加而缩小,但在全规模下仍然存在,一致相差5-7个CER点。我们表明标注质量很重要,因为许多数据集包含标注错误。清理降低了错误率并缩小了差距,但并未消除差距。此外,我们发现由于阿拉伯文具有更高的视觉变异性,固定数量的训练样本提供的覆盖效率较低,需要更多数据来学习相似的表示。我们根据文本行数和字符数比较了跨数据集的识别性能,显示了等价权衡。我们比较了跨脚本的字符频率分布,并表明阿拉伯文比拉丁文显著更重尾。我们的错误分析显示,阿拉伯文数据集(例如KHATT)中约30%的替换错误是由视觉相似字符之间的混淆引起的,而在拉丁文数据集(如IAM)中约为15%。

英文摘要

Recent studies have shown that handwritten text recognition (HTR) systems perform worse on Arabic-script datasets than on Latin-script data. However, the reasons for this gap are still not well understood due to the lack of controlled comparisons. In this work, we present a comprehensive study of Arabic and Latin scripts HTR using a unified CRNN model for line-level HTR across nine datasets (including KHATT (Arabic), Muharaf (Arabic), NUST-UHWR (Urdu), PHTD (Persian), IAM (English), READ-2016 (German), and others) and di ferent training sizes (K in {100, 500, 1000, 2000, ..., Kfull}). Our results show the performance gap remains: it is large in low-resource settings, decreases with more data, but remains even at full scale, with a consistent difference of 5-7 CER points. We show that annotation quality matters, as many datasets contain labeling errors. Cleaning reduces error rates and narrows the gap, but does not eliminate it. In addition, we find that a fixed number of training samples provides less effective coverage in Arabic due to higher visual variability, requiring more data to learn similar representations. We compare recognition across datasets in terms of the number of text lines and the number of characters, showing an equivalence trade-off. We compare character frequency distributions across scripts and show that Arabic is significantly more heavy-tailed than Latin. Our error analysis reveals that around 30 percent of substitution errors in Arabic datasets (e.g., KHATT) are caused by confusion between visually similar characters, compared to about 15 percent in Latin-script datasets such as IAM.

2606.19096 2026-06-18 cs.CV 新提交

PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

PorTEXTO:用于视觉文本提取的欧洲葡萄牙语基准

João Cardeira, Diogo Glória-Silva, Manuel Letras da Luz, Rafael Ferreira, Diogo Tavares, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology(NOVA科学与技术学校) NOVA LINCS

AI总结 提出PorTEXTO,首个针对现代欧洲葡萄牙语视觉文本提取的基准,通过结合前沿LVLM转录和母语者审核构建,发现合成到真实样本性能显著下降,多语言数据比模型规模更关键。

详情
AI中文摘要

欧洲葡萄牙语(pt-PT)在OCR基准中基本缺失,这些基准偏向高资源语言。少数涵盖pt-PT的基准专注于历史文物和文献。本文针对现代OCR应用,引入PorTEXTO,首个面向当代和文化相关的pt-PT视觉文本提取基准。为确保质量,我们采用结合前沿LVLM转录和母语者详尽审核的标注流程。我们观察到大多数模型从合成样本到真实样本性能急剧下降,并发现目前专门的多语言数据比模型大小或分辨率预算更能驱动pt-PT性能,这促使我们发布开放的pt-PT OCR资源。

英文摘要

European Portuguese (pt-PT) is largely absent from OCR benchmarks, which skew toward high-resource languages. The few benchmarks that cover pt-PT focus on historical artifacts and literature. This work addresses modern OCR applications, introducing PorTEXTO, the first benchmark for contemporary and culturally relevant pt-PT visual text extraction. To ascertain quality, we employ an annotation pipeline combining transcriptions from a frontier LVLM with exhaustive review by native speakers. We observe a sharp performance drop from synthetic to real world samples in most models, and find that, currently, specialized multilingual data is a better driver for pt-PT performance than model size or resolution budget, motivating the release of open pt-PT OCR resources.

2606.19139 2026-06-18 cs.CV cs.CL 新提交

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

Urdu Katib 手写数据集:用于离线乌尔都语手写文本识别的历史文档数据集及基于CRNN的基线评估

Ramza Basharat, Muhammad Usman Ali

发表机构 * Department of Computer Science, University of Gujrat(古杰拉特大学计算机科学系)

AI总结 为解决乌尔都语手写文本识别中数据集稀缺的问题,本文提出了首个由历史时期Katib书写的离线乌尔都语手写文本行数据集UKHD,并评估了多种CRNN混合模型,其中CNN-BGRU-CTC在字符错误率和词错误率上表现最优。

详情
AI中文摘要

自动手写文本识别(HTR)本质上是一项具有挑战性的任务,当处理草书体时,其复杂性进一步增加。尽管在各种草书体上已经做出了显著努力,但关于乌尔都语手写文本识别(UHTR)的研究相对有限。这种研究滞后主要是由于其文字带来的独特挑战,以及基准数据集的稀缺和不可用。因此,为了推进UHTR研究,本研究提出了一个专门的真实数据集,称为Urdu Katib手写数据集(UKHD)。据我们所知,这是第一个专门从历史时期Katib书写的材料中整理的离线乌尔都语手写文本行数据集。它涵盖了Nastalique书法风格中各种扁平笔尖书写变体。此外,评估了不同基于CRNN的混合模型的有效性,以确定用于Urdu Katib手写识别(UKHR)的最佳架构。在分析的模型中,CNN-BGRU-CTC模型表现出更稳健的性能,具有较低的字符错误率(CER)和词错误率(WER)。本研究工作旨在支持和鼓励研究社区开发用于保存乌尔都语手写文学的稳健识别系统。

英文摘要

Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

2408.01526 2026-06-18 cs.CV 版本更新

Recognizing and Reconstructing a Multi-Unit Floor Plan

识别与重建多单元楼层平面图

Lukas Kratochvila, Gijs de Jong, Monique Arkesteijn, Simon Bilik, Tomas Zemcik, Karel Horak, Jan S. Rellermeyer

发表机构 * Department of Control and Instrumentation, Brno University of Technology, Brno, Czech Republic(控制与仪器系,布拉格技术大学,布拉格,捷克共和国) Department of Software Technology, Faculty of Electrical Engineering Mathematics and Computer Science, TU Delft, Delft, Netherlands(软件技术系,电气工程数学与计算机科学学院,代尔夫特理工大学,代尔夫特,荷兰) Department of Management in the Built Environment, Faculty of Architecture and the Built Environment, TU Delft, Delft, Netherlands(建筑环境管理系,建筑与环境学院,代尔夫特理工大学,代尔夫特,荷兰) Institute for Research and Applications of Fuzzy Modeling, University of Ostrava, Ostrava, Czech Republic and with Department of Informatics, Mendel University in Brno, Brno, Czech Republic(模糊建模研究与应用研究所,奥斯特拉瓦大学,奥斯特拉瓦,捷克共和国,并与布拉格梅德勒大学信息系联合) Department of Software Technology, Faculty of Electrical Engineering Mathematics and Computer Science, TU Delft, Delft, Netherlands and with Dependable and Scalable Software Systems, Institute of Systems Engineering, Faculty of Electrical Engineering and Computer Science, Leibniz University Hannover, Hannover, Germany(软件技术系,电气工程数学与计算机科学学院,代尔夫特理工大学,代尔夫特,荷兰,并与可靠和可扩展软件系统,系统工程研究所,电气工程与计算机科学学院,莱比锡大学汉诺威分校,汉诺威,德国)

AI总结 提出基于MDA-Unet和MACU-Net的像素级分割方法,结合改进跳跃连接和注意力机制,从2D平面图重建3D模型,在CubiCasa数据集上平均F1达0.86。

详情
AI中文摘要

数字孪生在应急规划中具有巨大潜力,可更高效设计逃生路线、在异常情况下提供更好方向感并加快救援干预。然而,由于缺乏3D表示(仅部分新建筑有有限数量),创建数字孪生仍主要依赖手动工作。因此,本文旨在从常见的2D建筑平面图合成3D信息。我们提出两种基于MDA-Unet和MACU-Net架构的新型像素级分割方法,具有改进的跳跃连接、注意力机制以及训练目标,并结合流水线的重建部分,将分割后的平面图矢量化以创建3D模型。将所提方法与另外两种最先进技术及多个基准数据集进行比较。在常用的CubiCasa基准数据集上,我们的方法在五个检查类别上实现了平均F1分数0.86,优于其他测试的像素级方法。我们还公开了代码以支持该领域的研究。

英文摘要

Digital twins have a major potential to form a significant part of urban management in emergency planning, as they allow more efficient designing of the escape routes, better orientation in exceptional situations, and faster rescue intervention. Nevertheless, creating the twins still remains a largely manual effort, due to a lack of 3D-representations, which are available only in limited amounts for some new buildings. Thus, in this paper we aim to synthesize 3D information from commonly available 2D architectural floor plans. We propose two novel pixel-wise segmentation methods based on the MDA-Unet and MACU-Net architectures with improved skip connections, an attention mechanism, and a training objective together with a reconstruction part of the pipeline, which vectorizes the segmented plans to create a 3D model. The proposed methods are compared with two other state-of-the-art techniques and several benchmark datasets. On the commonly used CubiCasa benchmark dataset, our methods have achieved the mean F1 score of 0.86 over five examined classes, outperforming the other pixel-wise approaches tested. We have also made our code publicly available to support research in the field.

2605.02089 2026-06-18 cs.CV 版本更新

Cross-Lingual Learning within Arabic Script for Low-Resource HTR

阿拉伯文字内低资源手写文本识别的跨语言学习

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

AI总结 针对阿拉伯文字低资源手写文本识别,通过跨语言联合训练CRNN和HTR-VT模型,在KHATT、NUST-UHWR和PHTD数据集上显著降低字符错误率。

Comments This paper accepted at DALL workshop ICDAR 2026

详情
AI中文摘要

有限标注数据下的手写文本识别(HTR)仍然是一个具有挑战性的问题,尤其是对于阿拉伯文字语言。尽管现代基于序列的识别器在高资源设置下表现良好,但随着训练数据的稀缺,其准确率急剧下降。阿拉伯文字语言共享一个书写系统,具有大量字符重叠,这促使跨语言学习成为缓解数据稀缺的一种策略。我们在低资源场景(样本数K=100、500、1000标注行)下,对阿拉伯语(KHATT)、乌尔都语(NUST-UHWR)和波斯语(PHTD)进行了受控的行级跨语言联合训练研究。基于CRNN和Vision Transformer的HTR-VT模型在多个相关阿拉伯文字数据集的联合集上进行训练以缓解数据稀缺,并在单个目标语言上进行评估。两种架构在低资源条件下均受益于跨语言训练。CRNN在目标语言数据极其有限时仍然更有效,而随着更多目标语言数据的可用,HTR-VT的跨语言训练收益变得不太一致。在波斯语(PHTD)上,联合训练实现了9.99的字符错误率(CER),尽管未使用全部可用训练数据,仍超越了先前报告的结果。在另一个乌尔都语数据集(UNHD)上,联合训练将CER从17.20降低到14.45。

英文摘要

Handwritten Text Recognition (HTR) with limited labeled data remains a challenging problem, particularly for Arabic-script languages. Although modern sequence-based recognizers perform well in high-resource settings, their accuracy degrades sharply as training data becomes scarce. Arabic-script languages share a common writing system with substantial character overlap, motivating cross-lingual learning as a strategy to mitigate data scarcity. We conduct a controlled line-level study of cross-lingual joint training for Arabic-script HTR under low-resource regimes (number of samples K = 100, 500, 1000 labeled lines) on Arabic (KHATT), Urdu (NUST-UHWR) and Persian (PHTD). CRNN and Vision Transformer-based HTR-VT models are trained on the union of multiple related Arabic-script datasets to mitigate the data scarcity and are evaluated on individual target languages. Both architectures benefit from cross-language training under low-resource conditions. CRNN remains more effective under extremely limited target-language data, whereas the benefits of cross-language training for HTR-VT become less consistent as larger amounts of target-language data become available. On Persian (PHTD), joint training achieves a Character Error Rate (CER) of 9.99 , surpassing previously reported results despite not using the full available training data. On an additional Urdu dataset (UNHD), joint training reduces CER from 17.20 to 14.45.

10. 低层视觉、计算成像与图像增强 10 篇

2606.18496 2026-06-18 cs.CV cs.AI 新提交

Neural Phase Correlation

神经相位相关

Cole Reynolds

发表机构 * Weyl Labs(Weyl实验室)

AI总结 提出相位相关的学习泛化,通过可学习基函数将变换分解,适用于非刚性形变和幺正动力学,在心脏MRI和超声数据集上达到或超越现有方法。

详情
AI中文摘要

对应关系本质上是关系性的:它寻求同一场景两次观测之间的未知变换,而非任一观测的内容。然而,主流的基于学习的方法并未将变换表示为架构中的一等对象。它们独立编码每幅图像,让学习的相似度函数或深度解码器隐式地发现映射。相位相关是典型的例外,它直接在傅里叶域测量图像间关系,但其固定基的刚性将其限制于全局平移。我们引入相位相关的学习泛化,通过学习变换分解所基于的基来解除这一限制。相同的代数原语可扩展到密集非刚性形变和幺正动力学。在ACDC心脏MRI基准上,该框架在两个配准方向上匹配或超越先前发表的基线。在CAMUS超声心动图上,它无需辅助评分或自适应平滑机制即可达到最先进水平。应用于一维量子谐振子的时间演化波函数对时,同一框架仅从观测对中恢复未知哈密顿量的埃尔米特函数本征态和量子化能级。

英文摘要

Correspondence is fundamentally relational: it seeks the unknown transformation between two observations of a common scene, not the content of either. Yet the dominant learning-based methods do not represent the transformation as a first-class object in the architecture. They encode each image independently and let a learned similarity function or a deep decoder discover the mapping implicitly. Phase correlation is the canonical exception, measuring the inter-image relationship directly in the Fourier domain, but the rigidity of its fixed basis confines it to global translation. We introduce a learned generalization of phase correlation that lifts this restriction by learning the basis on which the transformation decomposes. The same algebraic primitive extends to dense non-rigid deformations and to unitary dynamics. On the ACDC cardiac-MRI benchmark the framework matches or exceeds prior published baselines on both registration directions. On CAMUS echocardiography it matches state-of-the-art without auxiliary scoring or adaptive-smoothness mechanisms. Applied to time-evolved wavefunction pairs of the 1-D quantum harmonic oscillator, the same framework recovers the Hermite-function eigenstates and the quantized energy levels of the unknown Hamiltonian from observation pairs alone.

2606.18644 2026-06-18 cs.CV 新提交

Spiking Pyramid Wavelet Transformation for High-efficient and Low-energy Image Restoration

尖峰金字塔小波变换用于高效低能耗图像恢复

Chen Zhao, Xiantao Hu, Song Wu, Qian Wang, Chen Wu, Rui Xie, Jian Yang, Ying Tai

发表机构 * Nanjing University(南京大学) Nanjing University of Science and Technology(南京理工大学) University of Science and Technology of China(中国科学技术大学) China Mobile Institute(中国移动研究院)

AI总结 提出基于尖峰神经网络和金字塔小波变换的SPWM模型,通过SDPW块建模长程依赖并利用小波域退化特性,在保持图像质量的同时显著降低计算和能耗。

Comments Accepted by Pattern Recognition

详情
AI中文摘要

尖峰神经网络(SNNs)因其高效性和生物启发的潜力在计算机视觉领域引起了广泛兴趣。虽然基于尖峰CNN的方法在图像恢复(IR)任务中显示出前景,但其性能受到CNN操作固有感受野限制的约束。在本文中,我们探索了离散小波变换的优势,并提出了一种基于尖峰金字塔小波模型(SPWM)以实现高效低能耗目标。具体来说,我们开发了一个尖峰双金字塔小波(SDPW)块来建模长程依赖并利用小波域中的退化特性。在多个基准上的实验结果表明,SPWM在保持图像质量的同时显著降低了计算成本和能耗。我们的方法展示了SNNs在IR领域的潜力,为资源受限设备的未来应用提供了新的见解。

英文摘要

Spiking neural networks (SNNs) have garnered significant interest in computer vision due to their potential for efficiency and biological inspiration. While spiking CNN-based methods have shown promise for image restoration (IR) tasks, their performance is constrained by the inherent receptive field limitations of CNN operations. In the paper, we explore the benefits of discrete wavelet transformation and propose a spiking pyramid wavelet-based model (SPWM) for high-efficient and low-energy target. Specifically, we develop a spiking dual pyramid wavelet (SDPW) block to model long-range dependency and exploit the properties of the degradation in the wavelet domain. Experimental results on several benchmarks demonstrate that SPWM significantly lowers computational costs and energy consumption while maintaining image quality. Our method showcases the potential of SNNs in the field of IR, offering new insights for future applications of resource-limited devices.

2606.19046 2026-06-18 cs.CV 新提交

Low-Rank Tensor Completion Based on Fractional Regularization with Ky Fan p-k Norm

基于Ky Fan p-k范数分数阶正则化的低秩张量补全

Shan Fan, Feng Zhang, Jianjun Wang, Xi-Le Zhao, Tingwen Huang

发表机构 * School of Mathematics and Statistics, Southwest University(西南大学数学与统计学学院) School of Mathematical Sciences/Research Center for Image and Vision Computing, University of Electronic Science and Technology of China(电子科技大学数学科学学院/图像与视觉计算研究中心) Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology(深圳先进技术大学计算机科学与控制工程学院)

AI总结 提出张量核范数与Ky Fan p-k范数之比(TNPK)作为非凸替代,逼近张量管秩,并构建低秩张量补全模型,证明低秩张量是局部极小点,设计ADMM算法,实验验证优于现有方法。

详情
AI中文摘要

本文通过提出一种新颖的非凸替代,即张量核范数与张量Ky Fan p-k范数(TNPK)之比,来精确逼近张量管秩,从而解决低秩张量补全(LRTC)问题。TNPK具有吸引人的性质,包括尺度不变性、参数灵活性以及在特定p和k选择下存在闭式解。在特定的p和k参数设置下,它退化为张量核范数与张量Ky Fan k范数(TNK)之比或张量核范数与张量Frobenius范数(TNF)之比。我们构建了一个LRTC模型,并在张量零空间性质(NSP)下,证明了低秩张量是所提模型的局部极小点。此外,我们推导了Ky Fan p-k逆范数的近端算子,并进一步开发了一种高效的交替方向乘子法(ADMM)算法,在温和条件下保证子序列收敛。在合成和真实世界数据集上的大量实验验证了我们的方法相对于最先进竞争者的优越性能。

英文摘要

This paper addresses low-rank tensor completion (LRTC) by proposing a novel nonconvex surrogate, namely the ratio of the tensor nuclear norm to the tensor Ky Fan p-k norm (TNPK), to accurately approximate the tensor tubal rank. The TNPK possesses appealing properties, including scale invariance, parameter flexibility, and the existence of closed-form solutions under specific choices of p and k. With specific parameter settings of p and k, it reduces to the ratio of the tensor nuclear norm to the tensor Ky Fan k norm (TNK) or the ratio of the tensor nuclear norm to the tensor Frobenius norm (TNF). We construct a LRTC model and, under the tensor null space property (NSP), prove that low-rank tensors are local minimizers of the proposed model. Moreover, we derive the proximal operator of the Ky Fan p-k inverse-norm and further develop an efficient alternating direction method of multipliers (ADMM) algorithm with guaranteed subsequential convergence under mild conditions. Extensive experiments on synthetic and real-world datasets validate the superior performance of our method against state-of-the-art competitors.

2606.19097 2026-06-18 cs.CV 新提交

DVANet: Degradation-aware Visual-prior Alignment Network for Image Restoration

DVANet: 面向图像复原的退化感知视觉先验对齐网络

Yanjie Tu, Qingsen Yan, Axi Niu, Tao Hu, Haokui Zhang, Jiantao Zhou

发表机构 * School of Computer Science, Northwestern Polytechnical University(西北工业大学计算机学院) Shenzhen Research Institute of Northwestern Polytechnical University(西北工业大学深圳研究院) State Key Laboratory of Internet of Things for Smart City, University of Macau(澳门大学智慧城市物联网国家重点实验室)

AI总结 提出DVANet,一种基于半二次分裂优化的深度展开网络,通过退化感知观测一致性与视觉先验引导重建的协同展开,实现复杂退化下的统一图像复原,在多种退化场景和跨域任务中表现优越。

Comments All-in-One Image Restoration; Deep Unfolding; Degradation Representation; Visual Prior

详情
AI中文摘要

全能图像复原旨在开发一个统一的复原框架来处理多种退化类型。现有的端到端方法通常将复原过程视为黑盒映射,缺乏明确的优化解释。尽管深度展开为图像复原提供了可解释的迭代建模范式,但现有方法大多依赖于固定的退化假设或预定义的退化信息,难以适应复杂退化和局部内容受损下的统一复原需求。这一限制制约了它们在退化抑制和结构细节恢复方面的性能。为解决这些问题,本文提出DVANet,一种受半二次分裂优化算法启发的深度展开网络,将复杂退化下的统一图像复原公式化为退化感知观测一致性与视觉先验引导重建之间的协同展开过程。具体而言,在退化感知观测一致性分支中,采用退化表示模块提取全局退化属性和局部退化线索,并利用退化条件映射增强模型对不同退化类型的适应性。在视觉先验引导重建分支中,引入DINOv3提供结构和语义信息作为层次化视觉先验,从而补充受损区域缺失的结构信息并改善细节恢复。大量实验表明,DVANet在多场景退化和跨域图像复原任务上取得了优越或具有竞争力的性能,展现出良好的退化适应性和泛化能力。

英文摘要

All-in-One image restoration aims to develop a unified restoration framework for handling diverse degradation types. Existing end-to-end methods usually regard the restoration process as a black-box mapping, lacking an explicit optimization interpretation. Although deep unfolding provides an interpretable iterative modeling paradigm for image restoration, existing methods mostly rely on fixed degradation assumptions or predefined degradation information, making them difficult to adapt to unified restoration requirements under complex degradations and locally damaged content. This limitation restricts their performance in degradation suppression and structural detail recovery. To address these issues, this paper proposes DVANet, a deep unfolding network inspired by the half-quadratic splitting optimization algorithm, which formulates unified image restoration under complex degradations as a collaborative unfolding process between degradation-aware observation consistency and visual-prior-guided reconstruction. Specifically, in the degradation-aware observation consistency branch, a degradation representation module is employed to extract global degradation attributes and local degradation cues, and degradation-conditioned mapping is used to enhance the model's adaptability to different degradation types. In the visual-prior-guided reconstruction branch, DINOv3 is introduced to provide structural and semantic information as hierarchical visual priors, thereby complementing the missing structural information in damaged regions and improving detail recovery. Extensive experiments demonstrate that DVANet achieves superior or competitive performance on multi-scenario degradation and cross-domain image restoration tasks, showing favorable degradation adaptability and generalization ability.

2204.14224 2026-06-18 cs.CV cs.LG eess.IV 版本更新

Investigation of Neural Network Methods for Reconstruction and Classification of Texture Images Under Conditions of Incomplete Information

不完全信息条件下纹理图像重建与分类的神经网络方法研究

Galymzhan Abdimanap, Kairat Bostanbekov, Abdelrahman Abdallah, Anel Alimova, Darkhan Kurmangaliyev, Daniyar Nurseitov, Tatyana Dedova, Larissa Balakay, Serik Nurakynov

发表机构 * Satbayev University(萨特巴耶夫大学) Institute of Ionosphere LLP(电离层研究所) Information Technology Department(信息技术部门) Assiut University(阿西乌特大学)

AI总结 提出结合目标检测、GAN(CRA)修复和Transformer/CNN分类的端到端框架,发现重建质量高(PSNR 28.7dB)但分类准确率仅53%,通过置信度混合集成将MCA从48%提升至58%,揭示生成模型产生语义模糊特征的问题。

Comments IEEE ACCESS

详情
AI中文摘要

异质自然纹理的自动化分析常因物理损伤和数据丢失而受阻,这对计算机视觉构成了重大挑战。虽然深度学习在受控环境中已显示出成功,但其在信息不完全条件下对复杂地质材料的应用仍未被充分探索。本研究提出了一个用于高分辨率岩心样本图像修复和分类的集成框架。我们设计了一个端到端流水线,利用目标检测进行样本分割,随后使用具有上下文残差聚合(CRA)的生成对抗网络(GAN)进行图像修复,以重建缺失的高频细节。接着,我们在重建数据上评估了现代基于Transformer(Swin、ViT)和CNN架构的性能。实验揭示了重建质量与下游效用之间的关键分歧:尽管结构保真度高(PSNR 28.7 dB,FID 74.01),分类准确率却停滞在53%。为了改善少数类检测,我们提出了一种基于置信度的混合集成方法,将MCA从48%提升至58%。这些结果凸显了当前最先进生成模型的局限性,它们可能产生视觉上合理但语义模糊的特征(“幻觉”),从而混淆分类器。本工作深入探讨了图像重建质量与分类性能之间的依赖关系,为无损检测和材料科学领域的未来研究提供了可复现的基线。鉴于井间准确率仍处于49-53%范围,我们将所得到的系统定位为岩相解释的决策支持和筛选工具,而非完全自主的分类器。代码可在以下网址获取:https://github.com/your-repo(注:原文URL未提供,此处为示例)

英文摘要

The automated analysis of heterogeneous natural textures is frequently hindered by physical damage and data loss, presenting a significant challenge to computer vision. While deep learning has shown success in controlled environments, its application to complex geological materials under conditions of incomplete information remains underexplored. This study presents an integrated framework for the inpainting and classification of high-resolution core sample images. We propose an end-to-end pipeline that utilizes object detection for sample segmentation, followed by image inpainting using Generative Adversarial Networks (GANs) with Contextual Residual Aggregation (CRA) to reconstruct missing high-frequency details. Subsequently, we evaluate the performance of modern Transformer-based (Swin, ViT) and CNN architectures on the reconstructed data. Our experiments revealed a critical divergence between reconstruction quality and downstream utility: despite high structural fidelity (PSNR 28.7~dB, FID 74.01), classification accuracy plateaued at 53\%. To improve minority-class detection, we propose a confidence-based hybrid ensemble that raises MCA from 48\% to 58\%. These results highlight the limitations of current state-of-the-art generative models, which may produce visually plausible but semantically ambiguous features ("hallucinations") that confound classifiers. This work provides insights into the dependencies between image reconstruction quality and classification performance, offering a reproducible baseline for future research in non-destructive testing and material science. Given that cross-well accuracy remains in the 49--53\% range, we position the resulting system as a decision-support and screening tool for lithofacies interpretation rather than as a fully autonomous classifier. The code is available at https://github.com/GalymzhanAbdimanap/Lithology_recognition

2601.01200 2026-06-18 cs.CV eess.IV 版本更新

Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity

点云的多尺度隐式结构相似性客观质量评估

Zhang Chen, Shuai Wan, Yuezhe Zhang, Siyu Ren, Fuzheng Yang, Junhui Hou

发表机构 * School of Electronics and Information, Northwestern Polytechnical University(电子与信息学院,西北工业大学) Department of Computer Science, City University of Hong Kong(计算机科学系,香港城市大学) School of Telecommunication Engineering, Xidian University(电信工程学院,西安电子科技大学)

AI总结 针对点云质量评估中不规则数据匹配困难的问题,提出多尺度隐式结构相似性度量(MS-ISSM),通过径向基函数连续表示局部特征并比较隐式函数系数,结合ResGrouped-MLP网络,在多个基准上超越现有方法。

Comments IEEE TMM Accepted

详情
AI中文摘要

点云的无结构和不规则特性对精确的点云质量评估(PCQA)构成重大挑战,特别是在建立准确的感知特征对应关系方面。为了解决这一问题,我们提出了多尺度隐式结构相似性度量(MS-ISSM)。与传统的点对点匹配不同,MS-ISSM利用径向基函数(RBF)连续表示局部特征,将失真测量转化为隐式函数系数的比较。该方法有效避免了不规则数据中固有的匹配误差。此外,我们提出了ResGrouped-MLP质量评估网络,该网络能够鲁棒地将多尺度特征差异映射到感知分数。该网络架构摒弃了传统的平面多层感知器(MLP),采用分组编码策略,集成了残差块和通道注意力机制。这种分层设计使得模型能够保留亮度、色度和几何的独特物理语义,同时自适应地关注高、中、低尺度上最显著的失真特征。在多个基准上的实验结果表明,MS-ISSM在可靠性和泛化性方面均优于最先进的指标。源代码可在以下网址获取:this https URL。

英文摘要

The unstructured and irregular nature of points poses a significant challenge for accurate point cloud quality assessment (PCQA), particularly in establishing accurate perceptual feature correspondence. To tackle this, we propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM). Unlike traditional point-to-point matching, MS-ISSM utilizes radial basis function (RBF) to represent local features continuously, transforming distortion measurement into a comparison of implicit function coefficients. This approach effectively circumvents matching errors inherent in irregular data. Additionally, we propose a ResGrouped-MLP quality assessment network, which robustly maps multi-scale feature differences to perceptual scores. The network architecture departs from traditional flat multi-layer perceptron (MLP) by adopting a grouped encoding strategy integrated with residual blocks and channel-wise attention mechanisms. This hierarchical design allows the model to preserve the distinct physical semantics of luma, chroma, and geometry while adaptively focusing on the most salient distortion features across High, Medium, and Low scales. Experimental results on multiple benchmarks demonstrate that MS-ISSM outperforms state-of-the-art metrics in both reliability and generalization. The source code is available at: https://github.com/ZhangChen2022/MS-ISSM.

2602.00176 2026-06-18 cs.CV cs.AI 版本更新

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

基于噪声条件频率暴露的扩散逆问题后验延续

Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出后验延续框架,根据扩散噪声水平逐步暴露测量频率,结合稳定采样器实现超分辨率、修复和去模糊的先进性能。

详情
AI中文摘要

扩散后验采样通过将预训练的扩散先验与测量一致性指导相结合来解决逆问题。然而,在高噪声水平下,全频带指导可能不可靠,因为干净估计包含分数诱导误差,且高频测量方向弱可识别。我们认为后验指导应根据瞬时扩散噪声水平暴露测量频率。基于这一原则,我们提出一个后验延续框架,构建一系列中间后验,其似然强调当前可靠频带并逐渐恢复全频带一致性。我们通过一个稳定采样器实例化该框架,该采样器结合了扩散预测器、频率受限似然细化以及Haar域承诺规则,该规则提交可靠粗校正同时推迟弱可识别细节。在超分辨率、修复和去模糊任务中,我们的方法实现了具有竞争力乃至最先进的恢复性能,包括在FFHQ和ImageNet评估中,运动去模糊相比强基线PSNR提升高达5 dB。

英文摘要

Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance. However, full-band guidance can be unreliable at high noise levels, where clean estimates contain score-induced errors and high-frequency measurement directions are weakly identifiable. We argue that posterior guidance should expose measurement frequencies according to the instantaneous diffusion noise level. Based on this principle, we propose a posterior continuation framework that constructs a family of intermediate posteriors whose likelihood emphasizes currently reliable frequency bands and gradually returns to full-band consistency. We instantiate this framework with a stabilized sampler that combines a diffusion predictor, frequency-limited likelihood refinement, and a Haar-domain commitment rule that commits reliable coarse corrections while deferring weakly identifiable details. Across super-resolution, inpainting, and deblurring, our method achieves competitive-to-state-of-the-art restoration performance, including up to 5 dB PSNR improvement on motion deblurring over strong baselines in evaluations on FFHQ and ImageNet.

2603.05010 2026-06-18 cs.CV 版本更新

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

生成式图像恢复进展:能力、局限性与评估实践研究

Xiang Yin, Jinfan Hu, Zhiyuan You, Kainan Yan, Yu Tang, Chao Dong, Jinjin Gu

发表机构 * Fudan University(复旦大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) University of the Chinese Academy of Sciences(中国科学院大学) Multimedia Laboratory, The Chinese University of Hong Kong(香港中文大学多媒体实验室) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 通过多维度评估管道系统比较扩散、GAN等生成式模型与PSNR导向模型,揭示从细节不足到细节质量与语义控制的范式转变,并训练了更符合人类感知的IQA模型。

Comments Accepted by CVPR 2026 Findings

详情
AI中文摘要

生成式图像恢复(GIR)在感知真实感方面取得了显著进展,但与先前方法相比,其实际能力究竟有多大提升?为回答这一问题,我们基于新的多维度评估管道开展大规模研究,该管道从细节、清晰度、语义正确性和整体质量四个维度评估模型。我们的分析涵盖多种架构,包括基于扩散的、基于GAN的、PSNR导向的以及通用生成模型,揭示了关键的性能差异。此外,我们的分析揭示了失败模式的演变,这标志着以感知为导向的低层视觉领域发生了范式转变。核心挑战正从先前的细节稀缺(欠生成)问题演变为细节质量和语义控制(防止过生成)的新前沿。我们还利用我们的基准训练了一个新的IQA模型,该模型更符合人类感知判断。最终,本工作对现代生成式图像恢复模型进行了系统研究,提供了关键见解,重新定义了对其真实状态的理解,并为未来发展指明了方向。

英文摘要

Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.

2605.12567 2026-06-18 cs.CV cs.AI 版本更新

Pyramid Self-Contrastive Learning for Single-shot Test-time Ultrasound Image Denoising

金字塔自对比学习框架用于测试时超声图像去噪

Jiajing Zhang, Bingze Dai, Xi Zhang, Yue Xu, Wei-Ning Lee

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong(香港大学电子与计算机工程系) Department of Biomedical Engineering, Duke University(达特茅斯大学生物医学工程系)

AI总结 本文提出一种纯测试时训练框架,用于单次超声图像去噪,应用于合成孔径超声,通过自对比学习分离解剖相似性和噪声随机性,提升去噪效果和结构细节。

详情
AI中文摘要

内在的电子噪声和斑点噪声使超声图像的临床解释复杂化。传统去噪方法依赖显式噪声假设,其有效性在复合噪声条件下减弱。基于学习的方法需要大量标注数据和模型参数。这些预定义和预训练的方法在复杂体内环境中不可避免地导致领域偏移,因此局限于特定噪声类型并常模糊结构细节。本文提出了一种纯测试时训练框架用于单次超声图像去噪,并应用于合成孔径超声(SAU),该方法通过自对比学习在金字塔潜在空间中分离解剖相似性和噪声随机性。干净图像随后从解剖空间解码,而丢弃噪声空间。A2A在测试时仅使用一个噪声样本的SAU信号进行训练,从而从根本上消除了领域偏移和预训练成本。模拟实验,包括电子噪声水平0至30 dB和不同包含几何形状,证明了A2A在SNR和CNR上的改进分别为69.3%和34.4%。体内结果表明,仅使用心脏六个超声切面、肝脏和肾脏的两个孔径数据,SNR和CNR分别提高了84.8%和25.7%。A2A在多种成像目标和配置中产生清晰的图像/信号,为更可靠的超声解剖可视化和功能评估铺平了道路。

英文摘要

The inherent electronic and speckle noise complicates clinical interpretation of ultrasound images. Conventional denoising methods rely on explicit noise assumptions whose validity diminishes under composite noise conditions. Learning-based methods are usually pretrained in a limited image domain using a labeled dataset, which implies inevitable domain shift in complex in vivo environments. This study proposes a Pyramid Self-Contrastive Learning (PSCL) framework for test-time ultrasound image denoising without pretraining. Given multiple noisy samples from only one-shot imaging, PSCL disentangles anatomical similarity and noise randomness into separate pyramid latent spaces. The clean image is then decoded from the anatomy space while discarding the noise space. We first apply PSCL to synthetic aperture ultrasound (SAU), where an Aperture-to-Aperture loop serves as a self-supervised proxy task to ensure denoising fidelity. Simulation experiments, including noise levels from 0 to 30 dB and inclusion geometries from simple to complex, demonstrated improvements of 69.3% in SNR and 34.4% in CNR. The in vivo results showed 84.8% SNR and 25.7% CNR gains using only two aperture data of the heart in six echocardiographic views, liver, and kidney. PSCL delivers clear images across diverse imaging targets and configurations, paving the way for more reliable anatomical visualization without domain shift and pretraining costs.

2506.11139 2026-06-18 eess.IV cs.AI cs.CV 版本更新

Grids Often Outperform Implicit Neural Representations at Compressing Dense Signals

网格通常在压缩密集信号方面优于隐式神经表示

Namhoon Kim, Sara Fridovich-Keil

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Georgia Institute of Technology(佐治亚理工学院)

AI总结 研究发现,对于密集信号任务,带插值的正则化网格在训练速度和重建质量上优于同等参数量的隐式神经表示,而INR仅在拟合二值信号(如形状轮廓)时表现更优。

Comments Our analysis are available at https://github.com/voilalab/INR-benchmark

详情
AI中文摘要

隐式神经表示(INR)最近展示了令人印象深刻的结果,但其基本容量、隐式偏差和缩放行为仍知之甚少。我们研究了不同INR在一系列具有不同有效带宽的2D和3D真实及合成信号上的性能,以及包括断层扫描、超分辨率和去噪在内的过拟合和泛化任务。通过根据模型大小以及信号类型和带宽对性能进行分层,我们的结果揭示了不同INR和网格表示如何分配其容量。我们发现,对于许多涉及密集信号的任务,具有插值的简单正则化网格在训练速度和质量上优于或等同于具有相同参数数量的任何INR。我们还发现有限的情况——即拟合二值信号(如形状轮廓)——其中INR优于网格,以指导INR的未来开发和使用,使其应用于最有利的应用场景。

英文摘要

Implicit Neural Representations (INRs) have recently shown impressive results, but their fundamental capacity, implicit biases, and scaling behavior remain poorly understood. We investigate the performance of diverse INRs across a suite of 2D and 3D real and synthetic signals with varying effective bandwidth, as well as both overfitting and generalization tasks including tomography, super-resolution, and denoising. By stratifying performance according to model size as well as signal type and bandwidth, our results shed light on how different INR and grid representations allocate their capacity. We find that, for many tasks involving dense signals, a simple regularized grid with interpolation trains faster and to higher or comparable quality than any INR with the same number of parameters. We also find limited settings -- namely fitting binary signals such as shape contours -- where INRs outperform grids, to guide future development and use of INRs towards the most advantageous applications.

11. 鲁棒性、安全、隐私与可信视觉 11 篇

2606.18318 2026-06-18 cs.CV cs.CR 新提交

Budget-Aware Adaptive Adversarial Patches for Black-Box Object Detection

预算感知的自适应对抗补丁用于黑盒目标检测

Pedram MohajerAnsari, Amir Salarpour, David Fernandez, Mert D. Pesé

AI总结 提出一种查询高效、预算自适应的黑盒攻击方法,结合上下文汤普森采样放置和NES像素更新,在严格纯图像抑制测试下,对CNN和Transformer检测器实现强抑制,并揭示查询-视觉足迹权衡。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026)

详情
AI中文摘要

对抗补丁对现代目标检测器构成实际威胁。先前工作揭示了脆弱性,但三个差距限制了可操作的见解:(i) 很少有基于分数的黑盒攻击在严格查询预算下联合优化补丁的位置、纹理和大小;(ii) 成功很少与补丁的视觉足迹相关联;(iii) 评估常常混淆EOT鲁棒性与纯视图抑制。我们提出\method{},一种查询高效、预算自适应的黑盒攻击,它结合了轻量级的上下文汤普森采样放置器与NES风格的像素更新,仅在进展停滞时增大补丁。报告基于严格的纯图像抑制测试;EOT被审计但从不作为成功的替代,可选的外观/可打印性权重揭示了强度-可见性权衡。在YOLOv5、Faster R-CNN和YOLOS上,\method{}在基于CNN的检测器上实现了强抑制,在基于Transformer的检测器上实现了显著抑制,使用紧凑的补丁,并相对于固定大小和启发式基线暴露了清晰的查询-足迹权衡。打印-捕获实验进一步展示了跨未见物理对象和视角的迁移。

英文摘要

Adversarial patches pose a practical threat to modern object detectors. Prior work shows vulnerability, but three gaps limit actionable insight: (i) few \emph{score-based black-box} attacks \emph{jointly} optimize patch \emph{location, texture, and size} under tight query budgets; (ii) success is rarely tied to the patch's \emph{visual footprint}; and (iii) evaluations often conflate EOT robustness with plain-view suppression. We present \method{}, a query-efficient, budget-adaptive black-box attack that couples a lightweight \emph{Contextual Thompson-Sampling} placer with NES-style pixel updates, growing the patch only when progress stalls. Reporting is anchored by a \emph{strict plain-image} suppression test; EOT is audited but never used as a substitute for success, and optional appearance/printability weights expose strength--visibility trade-offs. Across YOLOv5, Faster R-CNN, and YOLOS, \method{} achieves strong suppression on CNN-based detectors and substantial suppression on the transformer-based detector, using compact patches and exposing clear query--footprint trade-offs relative to fixed-size and heuristic baselines. A print--capture pilot further shows transfer across unseen physical objects and viewpoints.

2606.18510 2026-06-18 cs.CV cs.CR 新提交

Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks

人脸呈现攻击检测中的架构偏差:视觉Transformer与卷积神经网络的比较研究

Ngela Landon Ntung, Floride Tuyisenge, Jema David Ndibwile

发表机构 * College of Engineering, Carnegie Mellon University(卡内基梅隆大学工程学院)

AI总结 通过比较ViT和CNN在人脸呈现攻击检测中的表现,发现预训练ViT(DeiT-S)在准确率、公平性和跨种族泛化上优于CNN,将种族间ACER差距降低83%。

Comments 8 Pages, 4 Figures, 5 Tables

详情
AI中文摘要

人脸呈现攻击检测(PAD)系统构成生物特征认证中的关键安全层;然而,现有方法在不同人口群体间表现出系统性性能差异,对深肤色个体影响尤为严重。本文通过实证比较研究,探究视觉Transformer架构相对于卷积基线是否能够减少人脸PAD系统中的人口统计偏差。实验在CASIA-SURF跨种族人脸反欺骗(CeFA)数据集上进行。评估了三种架构:从头训练的多模态ViT-Tiny、ResNet18 CNN基线,以及在CeFA上微调的预训练DeiT-S,覆盖非洲、东亚和零样本中亚人口群体。DeiT-S实现了最高总体准确率97.27%和最低等错误率0.86%,优于准确率90.15%的ResNet18。在公平性方面,DeiT-S将非洲与东亚受试者之间的种族间ACER差距降至0.13%,而基于LBP的工作[6]报告为0.75%,降低了83%。最值得注意的是,ResNet18在零样本中亚受试者上的BPCER为10.44%,而DeiT-S在相同未见群体上保持2.89%,展现出3.6倍的泛化优势。这些结果表明,预训练视觉Transformer在PAD中实现了更高的准确率,产生了更小的人口统计性能差距,并在未见人口群体上更公平地泛化,表明PAD中的跨人口公平性可能部分受架构设计影响。

英文摘要

Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT-Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT-S fine-tuned on CeFA across African, East Asian, and zero-shot Central Asian demographic groups. DeiT-S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT-S reduces the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP-based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero-shot Central Asian subjects, DeiT-S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross-demographic fairness in PAD may partly be influenced by architectural design.

2606.19184 2026-06-18 cs.CV cs.LG 新提交

When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain Shift

当AUC误导:域偏移下深度伪造检测器的极化感知评估

Dat Nguyen, Cosmin Radoi, Romain Hermary, Marcella Astrid, Nesryne Mejri, Enjie Ghorbel, Djamila Aouada

发表机构 * Cristal Laboratory, National School of Computer Sciences, University of Manouba(马努巴大学国家计算机科学学院Cristal实验室)

AI总结 针对现有AUC评估无法反映真实场景中混合数据源和不同伪影类型的问题,提出Cross-dataset AUC(Cross-AUC)指标,通过平均每域AUC并引入预测极化度量(Wasserstein距离)来评估域偏移鲁棒性,实验证明其有效性。

详情
AI中文摘要

生成式AI的最新进展,如扩散模型和换脸工具,使得创建高度逼真的深度伪造成为可能,导致了包括金融欺诈和非自愿色情内容在内的现实危害。为此,深度伪造检测成为一个活跃的研究领域,近期方法越来越关注提高对未见操作的泛化能力。这通常通过跨多个数据集分别测量的ROC曲线下面积(AUC)来评估。然而,这种评估未能反映检测器面对混合数据源和不同伪影类型的真实场景。为解决这一局限,我们引入一种新指标——跨数据集AUC(Cross-AUC),该指标平均每域AUC并加入预测极化度量,以考虑对域偏移的鲁棒性。极化程度通过类别分数分布之间的Wasserstein距离量化。Cross-AUC不仅更真实地评估深度伪造检测器在域偏移下的泛化能力,而且具有可解释性,因为它能更好地解释性能下降的原因。在七个基准数据集上的实验证明了其实用性。

英文摘要

Recent advances in generative AI, such as diffusion models and face-swapping tools, have enabled the creation of highly realistic deepfakes, leading to real-world harms including financial fraud and non-consensual explicit content. In response, deepfake detection has become an active research area, with recent methods increasingly focusing on improving generalization to unseen manipulations. This is typically evaluated using the Area Under the ROC Curve (AUC) measured separately across multiple datasets. However, such an evaluation fails to reflect real-world scenarios where detectors face a mixture of data sources and varying artifact types. To address this limitation, we introduce a novel metric, Cross-dataset AUC (Cross-AUC) that averages per-domain AUCs with a measure of prediction polarization for taking into account the robustness to domain shift. The polarization extent is quantified by the Wasserstein Distance between class score distributions. Cross-AUC not only assesses the generalization capabilities of deepfake detectors under domain shifts more realistically, but it is also interpretable as it better explains the reason behind a drop in performance. Experiments performed on seven benchmark datasets demonstrate its practical relevance.

2606.19259 2026-06-18 cs.CV cs.AI 新提交

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

一个用于检测 GPT-Image-2 生成的含丰富文本图像的多领域基准

Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

AI总结 针对现有基准缺乏文本丰富图像检测的问题,构建了包含8602张图像、覆盖6个类别的多领域基准,评估5种检测器,发现性能高度依赖领域且易受JPEG压缩影响。

详情
AI中文摘要

含丰富文本的图像通常包含隐私敏感、交易或决策相关信息。随着最近多模态图像生成模型合成逼真文本内容和结构化视觉设计的能力越来越强,检测AI生成的含丰富文本图像已成为数字信任和内容真实性的重要挑战。然而,现有基准主要关注以物体为中心的图像,对文本语义和布局组织至关重要的场景覆盖有限。在本文中,我们引入了一个用于检测OpenAI的GPT Image 2生成的含丰富文本图像的多领域基准。该基准包含8602张图像,涵盖六个代表性类别:商业海报、信息图表、学术海报、收据、表格和UI截图。利用该基准,我们在零样本设置下评估了五种代表性AI生成图像检测器,并分析了它们的整体性能、类别性能和后处理鲁棒性。我们的结果表明,检测器性能高度依赖于领域:在某些类别上表现良好的方法往往在其他类别上失败,即使最强的传统检测器也对JPEG压缩表现出严重敏感性。我们进一步使用多模态视觉语言模型进行了探索性评估,揭示了其在结构化格式上的潜力和局限性。这些发现突显了针对现代AI生成图像需要文本和布局感知的检测方法。我们的数据集发布于XXX。

英文摘要

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

2606.18839 2026-06-18 cs.LG cs.CV 交叉投稿

Semantic Robustness Certification for Vision-Language Models

视觉语言模型的语义鲁棒性认证

Peiyu Yang, Paul Montague, Feng Liu, Andrew C. Cullen, Amardeep Kaur, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing \& Information Systems, University of Melbourne, Australia

AI总结 提出首个无需额外数据即可认证视觉语言模型在语义层面(如形状、大小、风格)鲁棒性的框架,通过文本提示作为语义代理并量化决策边界,确保预测类别在语义变换下不变。

Comments Accepted to ICML

详情
AI中文摘要

视觉语言模型(VLM)现在被广泛用于下游任务。然而,现实世界的应用常常使VLM面临由语义变化(例如形状、大小和风格)引起的分布偏移。鲁棒性认证确定当对输入应用变换时模型的预测是否改变。虽然大多数认证框架研究输入的几何或像素级变换,但本文提出了一种新颖的框架,能够在语义级变换下认证VLM的鲁棒性。利用VLM的开放词汇能力,我们使用文本提示作为语义代理来构建由控制语义变化程度的范围参数化的变换。通过以封闭形式表征VLM决策边界,我们的框架定量地认证了在语义变换下预测类别保持不变的范围区间。我们的框架是第一个在语义级变化下认证VLM鲁棒性而无需为每种变化提供额外数据的框架,使其易于应用。在合成数据和真实数据上的实验表明,我们的框架能够在各种场景下认证针对多种语义变化的鲁棒性。

英文摘要

Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model's prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

2508.03483 2026-06-18 cs.CV cs.AI 版本更新

When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

当汽车有刻板印象:审计文本到图像模型中对象的群体偏见

Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng

发表机构 * AIM Intelligence(AIM智能研究院) Yonsei University(延世大学)

AI总结 提出SODA框架,通过三个指标系统测量文本到图像模型在生成对象中的群体偏见,发现中性提示隐含偏向中年和白人,且人口统计线索导致高度偏斜的刻板输出。

详情
AI中文摘要

虽然先前关于文本到图像生成的研究主要集中在人类描绘中的偏见,但生成对象中的群体偏见仍然相对未被充分探索。我们引入了SODA(刻板对象诊断审计),这是一个新颖的框架,通过自动属性发现和三个标准化指标系统地测量这些偏见:基础与群体差异(BDS)、跨群体差异(CDS)和视觉属性集中度(VAC)。将SODA应用于五个最先进模型和八个对象类别(例如汽车)的8000张图像,我们发现“中性”提示产生的输出在视觉上最接近中年和白人,表明这些群体在模型默认设置中被隐含地过度代表。此外,人口统计线索触发了高度偏斜的刻板输出:26.6%的对象-模型-群体组合产生的结果中,所有20张生成图像共享完全相同的属性值(例如,为女性生成玫瑰金笔记本电脑)。最后,提示级别的去偏减少了群体间差异,但矛盾地压缩了群体内多样性,用一种刻板印象取代了另一种。SODA提供了一个实用的流程,使这些隐含关联变得可测量,作为迈向更负责任的人工智能发展的一步。

英文摘要

While prior research on text-to-image generation has predominantly focused on biases in human depictions, demographic bias in generated objects remains relatively underexplored. We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring these biases through automated attribute discovery and three standardized metrics: Base vs. Demographic Divergence (BDS), Cross-Demographic Disparity (CDS), and Visual Attribute Concentration (VAC). Applying SODA to 8,000 images across five state-of-the-art models and eight object categories (e.g., cars), we find that "neutral" prompts produce outputs most visually similar to middle-aged and White people, suggesting these groups are implicitly over-represented in model defaults. Furthermore, demographic cues trigger highly skewed stereotypical outputs: 26.6% of object-model-demographic combinations produce results where all 20 generated images share the exact same attribute value (e.g., rose gold laptops for women). Finally, prompt-level debiasing reduces inter-group disparity but paradoxically collapses within-group diversity, replacing one stereotype with another. SODA offers a practical pipeline for making these implicit associations measurable, serving as a step toward more responsible AI development.

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器:通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)) School of Data Science, School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China(数据科学学院、人工智能学院、香港中文大学(深圳))

AI总结 提出语义感知通用扰动(SAUP),作为语义路由器同时劫持多个无状态决策,通过理论分析和SORT优化策略实现,在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在无状态系统中,例如自动驾驶和机器人技术。本文研究了一种新型威胁:语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动(SAUP),它充当语义路由器,“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一点,我们对潜在空间中的几何特性进行了理论和实证分析。在这些见解的指导下,我们提出了语义导向(SORT)优化策略,并标注了一个具有细粒度语义的新数据集以评估性能。在三个代表性MLLM上的大量实验证明了这种攻击的基本可行性,在针对Qwen的五个目标上使用单帧实现了66%的攻击成功率。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

2606.11615 2026-06-18 cs.CV cs.CR cs.LG 版本更新

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

Adv-TGD:面向人脸识别冒充攻击的对抗性文本引导扩散

Omid Ahmadieh, Nima Karimian

发表机构 * University of South Florida, Bellini College of Artificial Intelligence, Cybersecurity and Computing(南佛罗里达大学贝利尼人工智能、网络安全与计算学院)

AI总结 提出Adv-TGD框架,利用Stable Diffusion和LoRA微调生成逼真对抗人脸,在保持视觉质量的同时实现高成功率身份冒充攻击,平均ASR达85.90%。

详情
AI中文摘要

人脸识别(FR)技术的广泛普及引发了严重的隐私担忧,因为面部数据可能在未经同意的情况下被利用。为了解决这一挑战,我们提出了Adv-TGD,一个生成式对抗攻击框架,能够合成逼真的人脸,冒充目标身份并欺骗人脸识别系统。基于Stable Diffusion,Adv-TGD对每个样本进行LoRA微调,以简洁的文本提示为条件,生成自然但具有对抗性操控的身份。与传统的身份攻击方法不同,我们的方法在单步去噪过程中为每个源-目标对优化轻量级交叉注意力适配器。潜在混合受到面部局部热图掩码的约束,以确保空间精确的身份操控,同时保留非敏感区域。我们引入了一个复合目标,结合了掩码epsilon-MSE重建、FR嵌入空间中的阈值化身份差异、方向特征对齐和源相似性抑制,以平衡对抗攻击和视觉真实性。可选地,LLaVA生成的属性提示增强了细粒度语义细节,而不会重新引入身份线索。在黑盒评估协议下,Adv-TGD在IR152、IRSE50、MobileFace和FaceNet上平均攻击成功率(ASR)达到85.90%,超过语义SOTA基线Adv-CPG +6.25个百分点、基于扩散的化妆方法DiffAIM +3个百分点以及基于噪声的P3-Mask +16个百分点。尽管攻击效果强劲,Adv-TGD仍保持了高视觉保真度(PSNR = 27.15 dB,SSIM = 0.981)。此外,我们通过成功将其扩展到野外数据集(LADN)、通用对象分类(ImageNet)和基于Transformer的扩散模型(FLUX.1),展示了我们框架的灵活性。

英文摘要

The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion v2.1, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a fixed-timestep denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by 6.25 points, the diffusion-based makeup method DiffAIM by 3 points, and the noise-based P3-Mask by 16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 28.18 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).

2504.14798 2026-06-18 cs.LG cs.CV 版本更新

RUB: Evaluating Residual Knowledge in Unlearned Models

RUB: 评估未学习模型中的残留知识

Hao Xuan, Xingyu Li

发表机构 * Electrical and Computer Engineering University of Alberta(电气与计算机工程大学阿尔伯塔大学)

AI总结 提出鲁棒未学习原则及统一基准RUB,通过未学习映射攻击(UMA)检测残留信息,揭示现有方法在对抗评估下的脆弱性。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2026, pages 8550-8559
AI中文摘要

机器未学习(MUL)已成为隐私保护和内容监管的关键机制,然而当前技术往往无法保证完全移除敏感信息。虽然现有工作大多关注验证未学习的执行,但它们忽略了模型在面对对抗性恢复遗忘知识尝试时是否保持鲁棒性的关键问题。在这项工作中,我们倡导鲁棒未学习原则,要求模型既与重新训练的模型不可区分,又能抵御多样化的对抗威胁。为实例化这一原则,我们提出了一个统一基准RUB(鲁棒未学习基准),系统评估未学习算法在分类、图像到图像重建和文本到图像合成中的鲁棒性。在此框架内,我们引入未学习映射攻击(UMA)作为检测残留信息的通用方法,并展示现有攻击策略如何适应此框架,只要它们符合通用UMA框架。我们在判别式和生成式任务上的实验表明,最先进的未学习方法在这些评估下仍然脆弱,即使通过了标准验证指标。通过将鲁棒性定位为核心标准并提供对抗评估基准,我们希望RUB能为更可靠和安全的未学习实践铺平道路。RUB中的代码库和模型检查点将公开发布。

英文摘要

Machine Unlearning (MUL) has emerged as a key mechanism for privacy protection and content regulation, yet current techniques often fail to guarantee the complete removal of sensitive information. While most existing works focus on verifying the execution of unlearning, they overlook the critical question of whether models remain robust against adversarial attempts to recover forgotten knowledge. In this work, we advocate for the principle of Robust Unlearning, which requires models to be both indistinguishable from retrained counterparts and resilient against diverse adversarial threats. To instantiate this principle, we propose a unified benchmark, RUB (Robust Unlearning Benchmark), that systematically evaluates the robustness of unlearning algorithms across classification, image-to-image reconstruction, and text-to-image synthesis. Within this framework, we introduce the Unlearning Mapping Attack (UMA) as a generalizable method to detect residual information, and demonstrate how existing attack strategies can be adapted into this framework as long as they conform to the generic UMA framework. Our experiments across discriminative and generative tasks reveal that state-of-the-art unlearning methods remain vulnerable under these evaluations, even when passing standard verification metrics. By positioning robustness as the central criterion and providing a benchmark for adversarial evaluation, we hope RUB paves the way toward more reliable and secure unlearning practices. The codebase and model checkpoints in RUB will be published.

2505.03646 2026-06-18 cs.LG cs.AI cs.CV 版本更新

Revealing Hidden Vulnerabilities in Autoencoders through Gradient Signal Restoration

通过梯度信号恢复揭示自编码器中的隐藏漏洞

Chethan Krishnamurthy Ramanaik, Arjun Roy, Tobias Callies, Eirini Ntoutsi

发表机构 * University of the Bundeswehr Munich(联邦国防军理工大学)

AI总结 针对自编码器对抗攻击中梯度消失导致鲁棒性被高估的问题,提出GRILL框架恢复梯度信号,显著提升攻击效果,暴露隐藏漏洞。

详情
AI中文摘要

深度自编码器(AE)的对抗鲁棒性受到的关注远少于判别模型,尽管其压缩的潜在表示会导致病态映射,从而放大小的输入扰动并破坏重建稳定性。现有的AE白盒攻击通过优化范数有界的对抗扰动以最大化重建损失,往往收敛到次优扰动,从而可能高估AE的鲁棒性。我们表明,这种限制与通过病态层反向传播时对抗损失梯度消失有关,这些病态层的中间权重矩阵具有接近零的奇异值。为了解决这个问题,我们提出了GRILL(病态层中的梯度信号恢复)框架,旨在减轻梯度退化并提高编码器-解码器架构中对抗鲁棒性评估的可靠性。GRILL旨在缓解优化过程中的对抗梯度退化,使攻击能够在固定范数约束下更好地逼近高失真扰动。通过在多种AE架构上的广泛实验,包括样本特定和通用攻击,以及标准和自适应攻击设置,我们表明GRILL显著提高了攻击有效性,从而暴露了现有攻击限制所隐藏的漏洞。除了AE之外,我们提供了初步证据表明现代多模态编码器-解码器架构也存在类似的漏洞。

英文摘要

Adversarial robustness of deep autoencoders (AEs) has received less attention than that of discriminative models, although their compressed latent representations induce ill-conditioned mappings that can amplify small input perturbations and destabilize reconstructions. Existing white-box attacks for AEs, which optimize norm-bounded adversarial perturbations to maximize reconstruction damage, often converge to suboptimal perturbations, thereby potentially overstating AE robustness. We show that this limitation is linked to vanishing adversarial loss gradients during backpropagation through ill-conditioned layers, associated with near-zero singular values in their intermediate weight matrices. To address this, we propose GRILL (Gradient Signal Restoration in Ill-Conditioned Layers), a framework designed to mitigate gradient degradation and improve the reliability of adversarial robustness evaluation in encoder-decoder architectures. GRILL is designed to mitigate adversarial gradient degradation during optimization, enabling attacks to better approximate high-distortion perturbations under fixed norm constraints. Through extensive experiments across multiple AE architectures, under both sample-specific and universal attacks, as well as standard and adaptive attack settings, we show that GRILL significantly increases attack effectiveness, thereby exposing vulnerabilities hidden by existing attack limitations. Beyond AEs, we provide preliminary evidence that modern multimodal encoder-decoder architectures exhibit similar vulnerabilities.

2606.09946 2026-06-18 cs.AR cs.CV 版本更新

SPARX: Secure and Privacy-Aware Approximate CNN Acceleration with Edge RISC-V SoC

SPARX: 面向边缘RISC-V SoC的安全与隐私感知近似CNN加速

Sonu Kumar, Akash Sankhe, Mukul Lokhande, Santosh Kumar Vishvakarma

发表机构 * Dept of Science and Technology (DST), Govt of India(印度科学技术部) MeitY/SMDP-C2S(印度电子与信息化部/SMDP-C2S)

AI总结 提出SPARX框架,集成RISC-V指令扩展、近似对数CNN加速单元、差分隐私引擎和认证机制,通过近似感知决策框架选择最优乘法器,在边缘实现安全高效的CNN推理。

Comments Under review in 12th International Symposium on Smart Electronic Systems (iSES) 2026

详情
AI中文摘要

边缘AI系统日益需要在严格的能耗、性能、安全和隐私约束下进行实时CNN推理。近似计算通过利用神经网络工作负载的错误容忍性来提高硬件效率;然而,大多数近似CNN加速器并未联合考虑安全的、隐私感知的边缘部署。本文提出了SPARX,一个集成在异构RV32IMC RISC-V系统级芯片(SoC)内的安全与隐私感知近似CNN加速框架。SPARX结合了自定义RISC-V指令扩展、近似对数CNN加速单元、轻量级基于差分噪声的隐私引擎以及挑战-响应认证机制。为了指导算术选择,引入了一个近似感知决策框架,该框架使用近似严重性指数(ASI)、近似效率(AE)、近似质量(QoA)、近似品质因数(AFOM)和硬件加速效率(HAE)。对11种最先进的近似MAC架构的评估表明,迭代对数乘法器(ILM)是最合适的设计,与精确的基4 Booth MAC相比,面积减少51.7%,功耗降低81.5%,吞吐量提升2.13倍,而仅使ResNet-20/CIFAR-10的准确率降低2.82个百分点。在Xilinx VC707平台上的FPGA实现实现了250 MHz下58.4 GOPS/W的能效,而28纳米CMOS物理实现验证了ASIC的可行性。

英文摘要

Edge-AI systems increasingly require real-time CNN inference under strict energy, performance, security, and privacy constraints. Approximate computing improves hardware efficiency by exploiting the error resilience of neural network workloads; however, most approximate CNN accelerators do not jointly consider secure, privacy-aware edge deployment. This paper presents SPARX, a Secure and Privacy-Aware Approximate CNN Acceleration framework integrated within a heterogeneous RV32IMC RISC-V System-on-Chip (SoC). SPARX combines a custom RISC-V instruction extension, an approximate logarithmic CNN acceleration unit, a lightweight differential-noise-based privacy engine, and a challenge-response authentication mechanism. To guide arithmetic selection, an approximation-aware decision framework is introduced that uses the Approximation Severity Index (ASI), Approximation Efficiency (AE), Quality of Approximation (QoA), Approximation Figure-of-Merit (AFOM), and Hardware Acceleration Efficiency (HAE). Evaluation across 11 state-of-the-art approximate MAC architectures identifies the Iterative Logarithmic Multiplier (ILM) as the most suitable design, achieving 51.7% area reduction, 81.5% power reduction, and 2.13x throughput improvement compared with an accurate radix-4 Booth MAC, while only reducing ResNet-20/CIFAR-10 accuracy by 2.82 percentage points. FPGA implementation on a Xilinx VC707 platform achieves 58.4 GOPS/W energy efficiency at 250 MHz, while 28-nm CMOS physical implementation validates ASIC feasibility

12. 数据集、基准、评测与训练方法 26 篇

2606.18484 2026-06-18 cs.CV 新提交

Vines-DB: An RGB image dataset for multi-species ornamental vine segmentation

Vines-DB:用于多物种观赏藤蔓分割的RGB图像数据集

Saroj Burlakoti, Utsav Bhandari, Aaron Etienne, Shital Poudyal

发表机构 * Department of Plants, Soils and Climate, Utah State University(植物、土壤与气候系,犹他州立大学) Department of Applied Sciences, Technology and Education, Utah State University(应用科学、科技与教育系,犹他州立大学)

AI总结 为支持精准园艺和城市生态中的多类实例分割,构建了包含7种观赏藤蔓的RGB图像数据集Vines-DB,通过手动标注和增强得到2307张图像,并划分训练/验证/测试集。

Comments 7 pages, 1 figure. Source data repository: OSF (DOI: 10.17605/OSF.IO/YJHCK)

详情
AI中文摘要

Vines-DB数据集包含在美国犹他州洛根市犹他农业实验站格林维尔研究农场田间条件下采集的7种观赏藤蔓的1,218张原始高分辨率RGB图像。该数据集来自168株于2022年移植的藤本植物,在2023和2024生长季(7月至10月)的多个月份重复拍摄。图像使用配备48 MP摄像头的iPhone 16 Pro在上午10:00至下午12:00之间于日光下拍摄。藤蔓生长在1.2m x 2.4m的格架上,从1m距离处拍摄,背景为黑色或白色泡沫板,以增强对比度并减少背景噪声。数据集包括木通、凌霄花、藤绣球、金银花、凌霄'马德琳·加伦'、五叶地锦和多花紫藤。所有原始图像由训练有素的标注员在Roboflow中手动标注,生成基于多边形的实例分割掩码,共8个类别(7个物种和背景)。经过预处理和数据增强后,工作数据集扩展至2,307张图像,用于模型开发和评估。增强后的数据集通过分层抽样划分为2,019张训练图像、192张验证图像和96张测试图像,以保持平衡的代表性。Vines-DB支持精准园艺和城市生态中多类实例分割深度学习模型的开发和评估。该数据集可实现自动冠层覆盖度估计、物种识别和可扩展的田间表型分析等应用。此外,每月重复成像捕获了冠层发育和植物外观的时间变化,增加了数据集在真实田间条件下进行分割基准测试的实用性。

英文摘要

The Vines-DB dataset contains 1,218 original high-resolution RGB images of seven ornamental vine species collected under field conditions at the Utah Agricultural Experiment Station's Greenville Research Farm in Logan, Utah, USA. The dataset was generated from 168 individual vine plants that were transplanted in 2022 and photographed repeatedly across multiple months during the 2023 and 2024 growing seasons (July-October). Images were captured with an iPhone 16 Pro equipped with a 48 MP camera between 10:00 AM and 12:00 PM under daylight. Vines were grown on 1.2m x 2.4m trellises and photographed from a distance of 1m against black or white Styrofoam backdrops to improve contrast and reduce background noise. The dataset includes Akebia quinata, Campsis radicans, Hydrangea anomala petiolaris, Lonicera x heckrottii, Campsis x tagliabuana 'Madame Galen', Parthenocissus quinquefolia, and Wisteria floribunda. All original images were manually annotated in Roboflow by trained annotators to produce polygon-based instance segmentation masks for eight classes, including seven species and background. After preprocessing and data augmentation, the working dataset was expanded to 2,307 images for model development and evaluation. The augmented dataset was divided into 2,019 training images, 192 validation images, and 96 test images using stratified sampling to maintain balanced representation. Vines-DB supports the development and evaluation of deep learning models for multi-class instance segmentation in precision horticulture and urban ecology. The dataset enables applications such as automated canopy cover estimation, species identification, and scalable field phenotyping. In addition, repeated monthly imaging of the plants captures temporal variation in canopy development and plant appearance, increasing the dataset's utility for segmentation benchmarking under realistic field conditions.

2606.18554 2026-06-18 cs.CV 新提交

Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

伪造灾难:扩散时代跨域合成灾难检测基准

Duc-Manh Phan, Quoc-Duy Tran, Duy-Khang Do, Anh-Tuan Vo, Hai-Dang Nguyen, Trong Le Do, Mai-Khiem Tran, Vinh-Tiep Nguyen, Tam V. Nguyen, Isao Echizen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国家大学下属理科大学) Vietnam National University, Ho Chi Minh(胡志明市国家大学) University of Information Technology, VNU-HCM(胡志明市国家大学下属信息技术大学) University of Dayton(代顿大学) National Institute of Informatics(国立信息学研究所)

AI总结 针对扩散模型生成的逼真灾难图像难以检测的问题,提出包含30000张图像(6000张真实、24000张合成)的基准数据集,实验发现微调检测器在未知生成器上准确率下降50%,零样本检测器也不稳定,凸显了跨域检测的迫切需求。

Comments SOICT 2025

详情
AI中文摘要

文本到图像扩散模型的快速进步使得创建高度逼真的合成图像成为可能,这些图像与真实照片极为相似,使得区分真实内容与AI生成的伪造品越来越困难。这对网络安全、数字取证和灾难响应构成了挑战,其中洪水、火灾或地震的虚假图像可能传播错误信息或扰乱应急行动。为此,我们引入了Forged Calamity,一个用于合成灾难检测的基准数据集,包含30000张图像,其中包括6000张真实样本和由四种扩散模型生成的24000张合成样本。在微调和零样本设置下的全面实验揭示了当前取证方法的一致弱点。微调检测器在分布内表现良好,但在未见过的生成器或灾难类型上准确率下降高达50%,显示出对模型特定伪影的过拟合。零样本通用检测器也难以保持稳定的准确率,只有少数具有鲁棒表示能力的模型表现出有限的韧性。这些发现凸显了持续存在的泛化差距,以及在扩散时代确保视觉真实性迫切需要领域和模型无关的检测方法。

英文摘要

The rapid advancement of text-to-image diffusion models has enabled the creation of highly photorealistic synthetic images that closely resemble real photographs, making it increasingly difficult to distinguish authentic content from AI-generated fabrications. This poses challenges for cybersecurity, digital forensics, and disaster response, where fake imagery of floods, fires, or earthquakes can spread misinformation or disrupt emergency operations. To address this, we introduce Forged Calamity, a benchmark dataset for synthetic disaster detection containing 30,000 images, including 6,000 real and 24,000 synthetic samples generated by four diffusion models. Comprehensive experiments across fine-tuned and zero-shot settings reveal consistent weaknesses in current forensic approaches. Fine-tuned detectors perform well in-distribution but lose up to 50\% accuracy on unseen generators or disaster types, showing overfitting to model-specific artifacts. Zero-shot generalized detectors also struggle to maintain stable accuracy, with only limited resilience in a few representation-robust models. These findings highlight persistent generalization gaps and the urgent need for domain- and model-agnostic detection methods to ensure visual authenticity in the diffusion era.

2606.18555 2026-06-18 cs.CV 新提交

Rethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene Recognition

重新思考文本到图像作为室内场景识别的语义感知数据增强

Trong-Vu Hoang, Quang-Binh Nguyen, Dinh-Khoi Vo, Hoai-Danh Vo, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM, Vietnam(越南国立大学胡志明市理科大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国立大学胡志明市分校)

AI总结 针对室内图像数据不足,提出利用稳定扩散生成合成图像进行数据增强,并通过扩散重建误差防止滥用,在MIT室内场景数据集上验证了有效性。

Comments MAPR 2024

详情
AI中文摘要

在计算机视觉领域,室内图像识别由于光照条件、遮挡以及有限空间内多样化物体排列的复杂相互作用而面临挑战。为了解决训练室内图像缺乏的问题,我们引入了一种新颖的方法,利用稳定扩散(SD)生成合成图像,作为强大的数据增强工具。SD的使用提供了一个原则性框架,用于合成多样且逼真的室内场景,从而丰富训练数据池,以构建鲁棒的室内图像识别模型。在MIT室内场景数据集上的实验结果表明,当真实数据有限时,我们提出的方法在增强深度模型训练方面具有潜力。此外,为了防止SD合成图像的滥用,我们引入了一种基于扩散重建误差(DIRE)的应对措施。强大的DIRE表示使得仅使用轻量级深度模型就能训练鲁棒的分类器。实验表明,我们的方法能够完美识别SD生成的图像,使用MobilenetV3的准确率达到100%。

英文摘要

In the realm of computer vision, indoor image recognition presents challenges due to the intricate interplay of lighting conditions, occlusions, and diverse object arrangements within confined spaces. To address the lacks of training indoor images, we introduce a novel approach leveraging Stable Diffusion (SD) for the generation of synthetic images, which serve as a powerful data augmentation tool. The utilization of SD offers a principled framework for synthesizing diverse and realistic indoor scenes, thereby enriching the training data pool for robust indoor image recognition models. Experimental findings on the MIT Indoor Scene dataset reveal the potential of our proposed approach in enhancing the training of deep models when authentic data is limited. Furthermore, to prevent the misuse of SD synthetic images, we introduce a counter measure based on DIffusion Reconstruction Error (DIRE). The powerful DIRE presentation enables training robust classifiers only using lightweight deep models. Experiments show that our approach can perfectly recognize SD generated images with the accuracy of 100% using MobilenetV3.

2606.18565 2026-06-18 cs.CV eess.SP 新提交

Experimental Analysis of Neural Network-Based Image Classification on the CIFAR-10 Dataset

基于神经网络的CIFAR-10数据集图像分类实验分析

Necati Kagan Erkek, Emre Balci, Berkin Halay

发表机构 * Department of Electronics and Communication Engineering, Istanbul Technical University(伊斯坦布尔技术大学电子与通信工程系)

AI总结 通过全连接和卷积网络在CIFAR-10上实验,分析完整学习流程,六层卷积网络在10个epoch后验证准确率约74.77%,揭示了表示学习与记忆化的差异。

Comments 7 pages

详情
AI中文摘要

通过全连接和卷积网络公式,对CIFAR-10基准上的神经图像分类进行了实验研究。分析强调了完整的学习流程:图像向量化、归一化、独热类编码、监督损失最小化、学习率选择、小批量训练、卷积特征提取、最大池化和基于验证的泛化评估。评估了一个具有六个卷积层和三个最大池化阶段的卷积架构,使用批量大小为128、学习率为0.001的Adam优化器进行十个训练周期。验证准确率达到约74.77%,而验证损失在训练中期后开始增加,尽管训练损失持续减少。由此产生的行为说明了表示学习与记忆化之间的实际差异,并为未来关于正则化、数据增强、更深层架构和可复现图像分类教育的研究提供了紧凑的实验基线。

英文摘要

An experimental investigation of neural image classification on the CIFAR-10 benchmark is presented through fully connected and convolutional network formulations. The analysis emphasizes the complete learning pipeline: image vectorization, normalization, one-hot class encoding, supervised loss minimization, learning-rate selection, mini-batch training, convolutional feature extraction, max-pooling, and validation-based generalization assessment. A convolutional architecture with six convolutional layers and three max-pooling stages is evaluated for ten training epochs using a batch size of 128 and an Adam optimizer with a learning rate of 0.001. The validation accuracy reaches approximately 74.77%, while the validation loss begins to increase after the middle of training despite continued reduction in training loss. The resulting behavior illustrates the practical difference between representation learning and memorization, and it provides a compact experimental baseline for future studies on regularization, data augmentation, deeper architectures, and reproducible image-classification education.

2606.18841 2026-06-18 cs.CV 新提交

Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework

重新思考空地协作:渐进式跨任务基准与社会化学习框架

Zhoupeng Guo, Yunqi Zhu, Zhihe Fan, Xinjie Yao, Ruipu Zhao, Boan Tao, Yiming Sun, Zhen Wang, Pengfei Zhu

发表机构 * School of Automation, Southeast University(东南大学自动化学院) School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) School of Sports Training, Tianjin University of Sport(天津体育学院运动训练学院) Faculty of Information Engineering and Automation, Kunming University of Science and Technology(昆明理工大学信息工程与自动化学院) School of Artificial Intelligence, Tianjin University(天津大学人工智能学院) School of Artificial Intelligence, Hebei University of Technology(河北工业大学人工智能学院)

AI总结 提出空地渐进协作基准AGPC和社会化协同感知框架SCP,通过双层级路由器实现跨视角跨任务选择性交互,在异构空地感知中提升下游性能7.86%。

详情
AI中文摘要

空地协同感知对于真实世界动态环境中的鲁棒视觉理解至关重要。然而,现有研究通常将协作建模为单任务跨视角融合,忽视了定位、目标关联和细粒度解析之间的功能依赖关系。此外,空中和地面视角的异构性引入了显著的几何、尺度和遮挡差异,使得统一特征共享容易受到负迁移的影响。为解决这些问题,我们将空地感知建模为渐进式跨任务协作任务,并构建了空地渐进协作(AGPC)基准,这是一个包含超过745K原始视频帧的时空对齐基准。基于该基准,我们提出了社会化协同感知(SCP),一个从空中全局定位到地面目标关联和身份感知解析的渐进式协作框架。其核心模块——双层级路由器(DLR),将输入侧的多尺度专家选择与输出侧的任务条件调制解耦,实现了选择性的跨视角和跨任务交互,同时抑制有害干扰。大量实验证明了SCP的有效性。它实现了3.73%的协同进化增益和7.86%的平均下游性能提升。这些结果表明,对于异构空地感知,任务条件协作比统一融合更有效。代码可在该网址获取。

英文摘要

Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73\% coevolutionary gain and a 7.86\% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at https://github.com/g1136639260-spec/AGSCP.

2606.18943 2026-06-18 cs.CV 新提交

Physics-IQ Verified

物理智力验证

Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth

发表机构 * Anates Labs(Anates实验室) Technical University of Munich(慕尼黑技术大学) University of Technology Nuremberg(纽伦堡技术大学) Tuebingen AI Center, University of Tuebingen(图宾根大学人工智能中心) Helmholtz AI, Munich(慕尼黑海德堡人工智能研究所) Google DeepMind research(谷歌DeepMind研究)

AI总结 本文提出Physics-IQ Verified基准,通过改进提示和地面真实质量及引入样本级评分系统,提升视频生成模型对物理现实的理解评估,验证结果表明基准提升了57.6%的样本和34.8%的提示。

详情
AI中文摘要

视频生成模型(VGMs)已成为新的前沿,不仅用于视频生成,还用于多种下游任务,包括世界建模。为推进这些任务,一个良好的视频模型必须理解世界的物理现实。评估这种理解成为新兴领域,催生了Physics-IQ基准,通过将模型生成的视频与真实物理实验视频进行比较来量化。本文系统审计了Physics-IQ基准,揭示不足并提出三种解决方案,改进如何衡量VGMs的物理理解。具体而言,我们提高了提示和地面真实质量以减少混淆因素影响,并进一步引入样本级评分系统,使每个样本和指标权重相等。我们的基准Physics-IQ Verified优化了57.6%的所有样本并改进了超过34.8%的提示。在使用六个图像到视频生成模型的比较研究中,我们观察到中等但有意义的排名变化(Kendall's τ=0.46)。我们希望Physics-IQ Verified通过提供更可靠的信号推动社区发展,向物理准确的VGMs迈进。该基准的代码可通过此https URL访问。

英文摘要

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $τ= 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

2606.18952 2026-06-18 cs.CV 新提交

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

SP-TransientBench: 一个真实捕获的单光子感知基准

Hongzhou Dong, Zili Zhang, Ziting Wen, Yiheng Qiang, Runrong Deng, Wenle Dong, Ziwen Jiang, Xinyang Li, Rui Lu, Shuoyao Sun, Wenyu Wang, Ziyi Xia, Haitao Zheng, Guodong Shi, Xiaoqiang Ren

发表机构 * Shanghai University(上海大学) Southern University of Science and Technology(南方科技大学) The University of Sydney(悉尼大学)

AI总结 针对单光子LiDAR在真实场景中因噪声和多回波瞬态现象导致的感知挑战,提出包含10个场景、10297个视角的真实捕获多任务基准STB,支持深度估计、多视图重建和3D语义理解评估。

详情
AI中文摘要

基于单光子雪崩二极管(SPAD)传感的单光子LiDAR(SPL)能够以极高灵敏度进行时间分辨光子测量,为光子匮乏环境下的主动3D感知提供了独特潜力。然而,由于独特的测量噪声和复杂的多回波瞬态现象,真实世界的单光子感知仍然面临根本性挑战,这些因素共同使几何重建和语义场景理解变得复杂。尽管对基于SPAD的传感兴趣日益增长,现有研究大多局限于模拟数据或小规模受控捕获。因此,在深度估计、多视图重建和3D语义理解方面,对真实世界单光子感知的系统评估仍未得到充分探索。为弥补这一空白,我们引入了SP-TransientBench(STB),一个真实捕获的多任务单光子感知基准。STB包含10个多样化场景和10297个视图,使用固态单光子LiDAR以256×192分辨率捕获。每个视图提供具有多回波行为的完整飞行时间直方图、标准化元数据和用于多视图评估的校准相机位姿。我们还为选定场景提供了13类3D语义标注。通过为每个任务提供专用数据划分和评估协议,STB能够在多个3D视觉问题上实现真实世界单光子感知的一致且可重复的基准测试。数据集和代码将在接收后发布。

英文摘要

Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

2606.19053 2026-06-18 cs.CV 新提交

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

大规模视觉-语言模型在细粒度图像任务上的基准测试:从评估到诊断

Hong-Tao Yu, Chen-Wei Xie, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

发表机构 * School of Computer Science and Engineering, Southeast University, China(东南大学计算机科学与工程学院,中国) Alibaba Group(阿里巴巴集团) School of Computer Science and Engineering, School of Intelligence Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, China(东南大学计算机科学与工程学院、智能科学与工程学院以及新一代人工智能技术及其交叉应用关键实验室,中国) Wangxuan Institute of Computer Technology, National Key Laboratory for Multimedia Information Processing, Peking University, China(北京大学王轩计算机技术研究所、多媒体信息处理国家重点实验室,中国) University of Copenhagen, Denmark(丹麦哥本哈根大学)

AI总结 提出FG-BMK基准,含101万问题和28万图像,通过人机双范式评估LVLM的细粒度语义识别与视觉判别能力,诊断失败原因,发现视觉表示、语义对齐等瓶颈。

详情
AI中文摘要

近期大规模视觉-语言模型(LVLMs)展示了显著的多模态感知和推理能力。尽管众多基准从整体或任务特定角度评估了LVLMs,但它们在细粒度图像任务(计算机视觉的基础)上的能力仍未得到充分理解。为填补这一空白,我们引入FG-BMK,一个全面的细粒度评估基准,包含101万问题和28万图像,覆盖从常见物体中心领域到专业领域的多样化场景。FG-BMK通过面向人类和面向机器的范式,联合评估对话级细粒度语义识别和特征级视觉判别能力,从而诊断分析LVLM的失败是否源于视觉表示不足、视觉-语义对齐薄弱或细粒度知识有限。通过对一系列代表性LVLM/VLM的大量实验,我们发现当前LVLMs仍是不充分的细粒度识别器,失败源于视觉表示、语义对齐、模态对齐和类别级知识中相互交织的瓶颈。我们进一步分析了提升细粒度能力的训练设计因素,并考察了视觉和语言扰动如何影响LVLM预测。这些发现为当前LVLMs的局限性提供了诊断性见解,并为未来数据构建和模型设计提供了指导,以开发更可靠的细粒度视觉任务LVLMs。我们的代码已开源,可从此https URL获取。

英文摘要

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at https://fg-bmk.github.io/.

2606.18676 2026-06-18 cs.LG cs.CV 交叉投稿

InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search

InTrain: 面向零成本神经架构搜索的内在可训练性

Qinqin Zhou, Fuhai Chen, Jipeng Wu, Zhiwei Chen, Zhikai Hu, Weiwei Cai

发表机构 * School of Computer and Data Science, Fuzhou University(福州大学计算机与数据科学学院) School of Computer and Data Science, Minjiang University(闽江学院计算机与数据科学学院) School of Artificial Intelligence, Nanchang University(南昌大学人工智能学院) Department of Computer Science, Hong Kong Baptist University(香港浸会大学计算机科学系) School of Interdisciplinary Medicine and Engineering, Harbin Medical University(哈尔滨医科大学跨学科医学与工程学院)

AI总结 提出统一理论代理InTrain,通过几何容量和优化韧性两个协同成分形式化架构的可训练性,在NAS基准上达到与集成方法相当的排序相关性。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
AI中文摘要

免训练神经架构搜索有望在不进行昂贵训练的情况下高效发现高性能网络。然而,现有的零成本代理依赖于碎片化的启发式方法,未能捕捉基本问题:是什么使一个架构具有可训练性?本文引入内在可训练性(InTrain),一个统一的理论代理,将可训练性形式化为由两个协同成分——几何容量和优化韧性——涌现出的架构不变性。我们通过分析神经信息处理来操作化内在可训练性。几何容量通过激活协方差特征谱的参与比量化,捕捉表示流形的有效维度。优化韧性通过累积梯度健康度测量,评估跨网络深度的反向传播鲁棒性。InTrain通过尺度不变的乘法耦合综合这些维度,我们假设这对于捕捉它们协同、非加性的关系至关重要。在标准NAS基准和搜索空间上的大量实验表明,InTrain达到了与最先进的基于集成的代理相当的排序相关性,并优于其他单指标方法。

英文摘要

Training-free neural architecture search promises efficient discovery of high-performance networks without costly training. However, existing zero-cost proxies rely on fragmented heuristics that fail to capture the fundamental question: what makes an architecture trainable? This paper introduces Intrinsic Trainability (InTrain), a unified theoretical proxy that formalizes trainability as an architectural invariant emerging from two synergistic components: geometric capacity and optimization resilience. We operationalize intrinsic trainability through analysis of neural information processing. Geometric capacity is quantified via the participation ratio of activation covariance eigenspectrum, capturing the effective dimensionality of representation manifolds. Optimization resilience is measured through cumulative gradient health, assessing the robustness of backpropagation across network depth. InTrain synthesizes these dimensions through a scale-invariant multiplicative coupling, which we hypothesize is essential for capturing their synergistic, non-additive relationship. Extensive experiments on standard NAS benchmarks and search spaces demonstrate that InTrain achieves ranking correlations on par with state-of-the-art ensemble-based proxies and outperforms other single-metric methods.

2303.18031 2026-06-18 cs.CV cs.AI cs.LG 版本更新

Simple Domain Generalization Methods are Strong Baselines for Open Domain Generalization

简单域泛化方法是开放域泛化的强基线

Masashi Noguchi, Shinichi Shirakawa

发表机构 * Graduate School of Environment and Information Sciences(环境与信息科学研究生院) Yokohama National University(Yokohama国立大学) Faculty of Environment(环境学系)

AI总结 本文评估现有域泛化方法在开放域泛化中的表现,发现简单方法CORAL和MMD与复杂方法DAML竞争力相当,并通过集成学习和Dirichlet混合数据增强简单扩展后性能接近DAML且计算成本更低。

Comments Accepted at IJCNN 2024. The code used in the experiments is available at https://github.com/shiralab/OpenDG-Eval

详情
AI中文摘要

在现实应用中,机器学习模型需要处理开放集识别(OSR),即在推理过程中出现未知类别,同时还要处理域偏移,即训练和推理阶段数据分布不同。域泛化(DG)旨在处理推理阶段目标域在模型训练期间不可访问的域偏移情况。开放域泛化(ODG)同时考虑DG和OSR。域增强元学习(DAML)是一种针对ODG的方法,但其学习过程复杂。相比之下,尽管已提出多种DG方法,但它们尚未在ODG场景下进行评估。在本研究中,我们全面评估了现有DG方法在ODG中的表现,并表明两种简单的DG方法——相关对齐(CORAL)和最大均值差异(MMD)——在多种情况下与DAML具有竞争力。此外,我们通过引入DAML中使用的技术(如集成学习和Dirichlet混合数据增强)提出了CORAL和MMD的简单扩展。实验评估表明,扩展后的CORAL和MMD可以以较低的计算成本达到与DAML相当的性能。这表明简单的DG方法及其简单扩展是ODG的强基线。

英文摘要

In real-world applications, a machine learning model is required to handle an open-set recognition (OSR), where unknown classes appear during the inference, in addition to a domain shift, where the data distribution differs between the training and inference phases. Domain generalization (DG) aims to handle the domain shift situation where the target domain of the inference phase is inaccessible during the model training. Open domain generalization (ODG) considers DG and OSR. Domain-augmented meta-learning (DAML) is a method targeting ODG; however, it has a complicated learning process. By contrast, although various DG methods have been proposed, they have not been evaluated in ODG situations. In this study, we comprehensively evaluate the existing DG methods in ODG and show that the two simple DG methods, CORrelation ALignment (CORAL) and maximum mean discrepancy (MMD), are competitive with DAML in several cases. In addition, we propose simple extensions of CORAL and MMD by introducing the techniques used in DAML, such as ensemble learning and Dirichlet mixup data augmentation. The experimental evaluation demonstrates that the extended CORAL and MMD can perform comparably to DAML with lower computational costs. This suggests that the simple DG methods and their simple extensions are strong baselines for ODG.

2406.18215 2026-06-18 cs.CV 版本更新

Optimizing Incomplete, Large-Scale and Sparse Multi-Graph Matching in Bioimaging

优化生物成像中不完整、大规模和稀疏的多图匹配

Max Kahl, Sebastian Stricker, Lisa Hutschenreiter, Florian Bernard, Carsten Rother, Bogdan Savchynskyy

发表机构 * Heidelberg University(海德堡大学) Max Planck Institute for Informatics(马克斯·普朗克信息研究所) University of Bonn(波恩大学)

AI总结 针对生物成像中大规模稀疏多图匹配问题,提出稀疏排列同步范式及通用方法GREEDA,在目标值和运行时间上优于现有方法。

详情
AI中文摘要

多图匹配是计算机视觉中的一个基本问题。我们的工作受到生物成像中一个具有挑战性的应用的启发,在该应用中,需要将数十甚至数百张蠕虫的3D显微镜图像进行对应。现有数据集未覆盖这种大规模场景,且几乎所有现有方法都不适用,因为它们假设完整或密集的问题设置。为了支持进一步研究,我们的第一个贡献是基于生物成像中的问题实例构建了一个新的大规模数据集。我们的第二个贡献是对两种主要的多图匹配范式:直接法和排列同步法进行了全面分析。我们通过部分证明论证,实用的大规模方法必须明确处理问题的稀疏性和不完整性。由于标准的排列同步方法在此设置下失败,我们进一步引入了一种稀疏排列同步范式。我们的最终贡献是GREEDA,一种针对稀疏和不完整问题的通用方法,可跨成本阶和范式实例化。虽然本文重点研究最高二次阶的目标函数,但GREEDA本质上可推广到任意阶。在更大、更稀疏的实例上,GREEDA在目标值和运行时间上均优于竞争方法。例如,对于基于30张蠕虫图像的中等规模问题,GREEDA在2分钟内产生高质量解,而竞争方法至少需要半小时且结果差得多。在较小的密集问题上,GREEDA与领先方法性能相当,但速度快一个数量级。

英文摘要

Multi-graph matching is a fundamental problem in computer vision. Our work is motivated by a challenging application in bioimaging, where dozens or even hundreds of 3D microscopy images of worms must be brought into correspondence. Existing datasets do not cover this large-scale regime, and virtually all existing methods are inapplicable because they assume a complete or dense problem setting. To support further research, our first contribution is a new large-scale dataset based on problem instances from bioimaging. Our second contribution is a comprehensive analysis of the two main multi-graph matching paradigms: direct and permutation synchronization-based formulations. We argue, in part by proof, that practical large-scale methods must explicitly address problem sparsity and incompleteness. Since standard permutation synchronization approaches fail in this setting, we further introduce a sparse permutation synchronization paradigm. Our final contribution is GREEDA, a general method for sparse and incomplete problems that can be instantiated across cost orders and paradigms. While our paper focuses on objective functions up to quadratic order, GREEDA is inherently generalizable to arbitrary orders. On larger, sparse instances, GREEDA outperforms competing methods in both objective value and runtime. For example, for moderately-sized problems based on 30 worm images GREEDA produces a high-quality solution within 2 minutes, whereas competitors require at least half an hour and yield far worse results. On smaller dense problems, GREEDA remains on par with leading methods while being an order of magnitude faster.

2407.18245 2026-06-18 cs.CV cs.LG 版本更新

VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset

VGGHeads: 基于大规模合成数据集的3D多头部对齐

Orest Kupyn, Eugene Khvedchenia, Christian Rupprecht

发表机构 * University of Oxford(牛津大学) Piñata Farms Ukrainian Catholic University(乌克兰天主大学)

AI总结 提出VGGHeads,一个由扩散模型生成的大规模合成数据集,用于单步同时进行头部检测和3D网格重建,在真实图像上表现优异。

详情
AI中文摘要

人类头部检测、关键点估计和3D头部模型拟合是许多应用中的基本任务。然而,传统的真实世界数据集常常存在偏差、隐私和伦理问题,并且是在实验室环境中记录的,这使得训练出的模型难以泛化。在这里,我们介绍\method——一个使用扩散模型生成的大规模合成数据集,用于人类头部检测和3D网格估计。我们的数据集包含超过100万张高分辨率图像,每张图像都标注了详细的3D头部网格、面部标志和边界框。利用这个数据集,我们引入了一种新的模型架构,能够从单张图像中单步同时进行头部检测和头部网格重建。通过广泛的实验评估,我们证明了在我们的合成数据上训练的模型在真实图像上取得了强劲的性能。此外,我们数据集的多样性使其适用于广泛的任务,提供了人类头部的通用和全面表示。

英文摘要

Human head detection, keypoint estimation, and 3D head model fitting are essential tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce \method -- a large-scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset, we introduce a new model architecture capable of simultaneous head detection and head mesh reconstruction from a single image in a single step. Through extensive experimental evaluations, we demonstrate that models trained on our synthetic data achieve strong performance on real images. Furthermore, the versatility of our dataset makes it applicable across a broad spectrum of tasks, offering a general and comprehensive representation of human heads.

2504.01527 2026-06-18 cs.CV eess.IV 版本更新

Beyond Nearest Neighbor Interpolation in Data Augmentation

超越数据增强中的最近邻插值

Olivier Rukundo

发表机构 * Department of Electronic and Computer Engineering, University of Limerick(电子与计算机工程系,利默里克大学)

AI总结 本文提出改进的几何变换函数和均值分类过滤机制,以避免最近邻插值带来的标注误差和低通滤波影响,通过离线数据增强管道提升医学图像分割性能。

Comments 10 pages, 11 figures, 14 tables

详情
AI中文摘要

避免最近邻插值导致的未定义类别标签风险忽视了增强训练数据中像素级标注误差的加剧风险。此外,插值算法固有的低通滤波效应会加剧标注区域内的高频结构细节退化风险。为避免这些风险,作者通过修改卷积神经网络的数据转换函数,引入改进的几何变换函数,去除对最近邻插值的依赖,并整合基于均值的类别过滤机制来处理未定义的类别标签。作者还实现了离线数据增强管道,生成特定于插值的增强训练数据,从而能够定量评估插值对增强训练数据的低通滤波效应。在三个医学图像分割数据集和XBAT+数据集上的实验评估显示,在多个定量指标上均实现了性能提升。

英文摘要

Avoiding the risk of undefined categorical labels using nearest neighbor interpolation overlooks the risk of exacerbating pixel level annotation errors in augmented training data. Additionally, the inherent low pass filtering effects of interpolation algorithms exacerbate the risk of degrading high frequency structural details within annotated regions of interest. To avoid these risks, the author modified convolutional neural networks data transformation functions by incorporating a modified geometric transformation function, removing reliance on nearest neighbor interpolation, and integrating a mean-based class filtering mechanism to handle undefined categorical labels with alternative interpolation algorithms. The author also implemented an offline data augmentation pipeline to generate interpolation specific augmented training data, enabling quantitative assessment of interpolation specific low pass filtering effects on augmented training data. Experimental evaluation on three medical image segmentation datasets and the XBAT+ datasets demonstrated performance gains across multiple quantitative metrics.

2505.21954 2026-06-18 cs.CV cs.AI 版本更新

Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

重新审视主动说话人检测:面向泛化性和鲁棒性的野外基准

Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Tuan Khai Nguyen, Soochahn Lee, Yong Jae Lee

发表机构 * University of Wisconsin - Madison(威斯康星大学麦迪逊分校) Oregon State University(俄勒冈州立大学) University of Sydney(悉尼大学) Kookmin University(韩国成均馆大学)

AI总结 提出UniTalk数据集,涵盖多语言、嘈杂背景和拥挤场景等挑战性真实条件,评估显示现有模型在野外环境下性能不足,而UniTalk训练模型泛化性更好,为主动说话人检测建立新基准。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

我们提出了UniTalk,一个强调挑战性场景的新数据集,旨在增强主动说话人检测(ASD)任务的模型泛化性。先前建立的基准如AVA主要包含老电影,因此与现实世界视频存在显著领域差距。相比之下,UniTalk涵盖了反映挑战性真实条件的多种视频类型,包括代表性不足的语言、嘈杂背景和拥挤场景,同时在规模上与AVA相当。广泛评估表明,在现实条件下ASD仍未解决:在AVA上接近完美的先进模型在UniTalk上未能达到饱和。相反,在UniTalk上训练的模型能更好地泛化到现代野外数据集,包括Talkies和ASW。因此,UniTalk为ASD建立了新的基准,为研究人员开发和评估多功能且鲁棒的模型提供了宝贵资源。

英文摘要

We present UniTalk, a novel dataset emphasizing challenging scenarios to enhance model generalization for the task of active speaker detection (ASD). Previously established benchmarks such as AVA predominantly comprise old movies and thus exhibit significant domain gaps with real-world video. In contrast, UniTalk covers diverse video types reflecting challenging real-world conditions, including underrepresented languages, noisy backgrounds, and crowded scenes, while being on par with AVA in scale. Extensive evaluations reveal that ASD remains unsolved under realistic conditions: state-of-the-art models near-perfect on AVA fail to reach saturation on UniTalk. Conversely, models trained on UniTalk generalize better to modern in-the-wild datasets including Talkies and ASW. UniTalk thus establishes a new benchmark for ASD, providing researchers with a valuable resource for developing and evaluating versatile and resilient models.

2510.21605 2026-06-18 cs.CV 版本更新

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

S3OD:基于合成数据的通用显著目标检测

Orest Kupyn, Hirokatsu Kataoka, Christian Rupprecht

发表机构 * University of Oxford, VGG(牛津大学,视觉信息集团)

AI总结 提出S3OD方法,通过大规模合成数据生成和歧义感知架构,显著提升显著目标检测的跨数据集泛化能力,仅用合成数据训练即可降低20-50%误差。

详情
AI中文摘要

显著目标检测体现了数据受限任务的特点,昂贵的像素级精确标注迫使相关子任务(如DIS和HR-SOD)进行单独的模型训练。我们提出了一种通过大规模合成数据生成和歧义感知架构来大幅提升泛化能力的方法。我们引入了S3OD,一个包含超过139,000张高分辨率图像的数据集,通过我们的多模态扩散管道从扩散和DINO-v3特征中提取标签。迭代生成框架根据模型性能优先处理具有挑战性的类别。我们提出了一个简化的多掩码解码器,通过预测多个有效解释来处理显著目标检测中固有的歧义。仅使用合成数据训练的模型在跨数据集泛化中实现了20-50%的错误率降低,而微调版本在DIS和HR-SOD基准上达到了最先进的性能。

英文摘要

Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained only on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.

2602.08355 2026-06-18 cs.CV 版本更新

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-VAds:面向多模态大语言模型的电商短视频理解基准

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

发表机构 * Alimama Tech, Taobao \& Tmail Group of Alibaba Huazhong University of Science Vin University

AI总结 提出电商短视频理解基准E-VAds,通过多模态信息密度评估框架量化领域复杂性,并构建多智能体生成的问答数据集,最后开发基于强化学习的推理模型E-VAds-R1,在商业意图推理上实现109.2%的性能提升。

Comments Accepted by ICML2026

详情
AI中文摘要

电商短视频代表了在线视频行业中高收入的细分领域,其特点是目标驱动的格式和密集的多模态信号。当前模型通常难以处理这些视频,因为现有基准主要关注通用任务,忽略了商业意图的推理。在这项工作中,我们首先提出了一个多模态信息密度评估框架,以量化该领域的复杂性。我们的评估显示,与主流数据集相比,电商内容在视觉、音频和文本模态上表现出显著更高的密度,为视频理解建立了更具挑战性的前沿。为了弥补这一差距,我们引入了电商视频广告基准(E-VAds),这是首个专门为电商短视频理解设计的基准。我们从淘宝精选了3,961个高质量视频,涵盖广泛的产品类别,并使用多智能体系统生成了19,785个开放式问答对。这些问题被组织成两个主要维度,即感知与认知和推理,包含五个不同的任务。最后,我们开发了E-VAds-R1,一个基于强化学习的推理模型,具有称为MG-GRPO的多粒度奖励设计。该策略为早期探索提供平滑指导,同时为专家级精度创造非线性激励。实验结果表明,E-VAds-R1在仅使用几百个训练样本的情况下,在商业意图推理上实现了109.2%的性能提升。

英文摘要

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples. Data is available at https://github.com/TaobaoTmall-AlgorithmProducts/E-VAds_Benchmark.

2603.21583 2026-06-18 cs.CV 版本更新

HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling

HACMatch: 基于难度感知课程伪标签的半监督旋转回归

Mei Li, Huayi Zhou, Suizhi Huang, Yuxiang Lu, Yue Ding, Hongtao Lu

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出一种难度感知课程学习框架,通过动态选择伪标签样本和结构化数据增强,在少量标注数据下提升半监督旋转回归性能。

Comments This is an accepted manuscript of an article published in Computer Vision and Image Understanding

详情
Journal ref
Computer Vision and Image Understanding (2026)
AI中文摘要

从2D图像回归物体的3D旋转是一项关键且具有挑战性的任务,在自动驾驶、虚拟现实和机器人控制等领域有广泛应用。现有的旋转回归模型通常依赖大量标注数据进行训练,或需要点云、CAD模型等2D图像之外的额外信息。因此,探索仅使用有限数量标注2D图像的半监督旋转回归具有重要价值。尽管最近的工作FisherMatch将半监督学习引入旋转回归,但其基于熵的刚性伪标签过滤方法未能有效区分可靠和不可靠的无标注样本。为解决这一局限,我们提出一种难度感知课程学习框架,根据样本难度动态选择伪标签样本,从简单到复杂逐步推进。我们引入了多阶段和自适应课程策略,用更灵活、难度感知的机制替代固定阈值过滤。此外,我们提出一种专门针对旋转估计的新型结构化数据增强策略,通过从增强补丁中组装复合图像来引入特征多样性,同时保持关键几何完整性。在PASCAL3D+和ObjectNet3D上的综合实验表明,我们的方法在低数据场景下尤其优于现有的监督和半监督基线,验证了课程学习框架和结构化增强方法的有效性。

英文摘要

Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. Existing rotation regression models often rely on large amounts of labeled data for training or require additional information beyond 2D images, such as point clouds or CAD models. Therefore, exploring semi-supervised rotation regression using only a limited number of labeled 2D images is highly valuable. While recent work FisherMatch introduces semi-supervised learning to rotation regression, it suffers from rigid entropy-based pseudo-label filtering that fails to effectively distinguish between reliable and unreliable unlabeled samples. To address this limitation, we propose a hardness-aware curriculum learning framework that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. We introduce both multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering with more flexible, hardness-aware mechanisms. Additionally, we present a novel structured data augmentation strategy specifically tailored for rotation estimation, which assembles composite images from augmented patches to introduce feature diversity while preserving critical geometric integrity. Comprehensive experiments on PASCAL3D+ and ObjectNet3D demonstrate that our method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes, validating the effectiveness of our curriculum learning framework and structured augmentation approach.

2604.20822 2026-06-18 cs.CV cs.LG 版本更新

Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

全球海上风电基础设施:基于密集Sentinel-1时间序列的部署与运行动态

Thorsten Hoeser, Felix Bachofer, Claudia Kuenzer

发表机构 * Earth Observation Center (EOC), German Aerospace Center (DLR)(地球观测中心(EOC),德国航空航天中心(DLR)) Institute for Geography and Geology, University of Wuerzburg(地理与地质研究所,乌尔姆大学)

AI总结 提出全球Sentinel-1 SAR时间序列数据集,通过目标检测和规则分类器识别海上风电基础设施的部署与运行阶段,支持全球尺度动态分析。

Comments 29 pages, 18 figures

详情
AI中文摘要

海上风电行业正在快速扩张,增加了对全球范围内基础设施部署和运行进行独立、高时间分辨率监测的需求。虽然基于地球观测的海上风电基础设施测绘在空间定位方面已经成熟,但现有的开放数据集缺乏关于建设和运行动态的时间密集且语义精细的信息。我们引入了一个全球Sentinel-1合成孔径雷达(SAR)时间序列数据语料库,该语料库解析了2016年第一季度至2025年第一季度海上风电基础设施的部署和运行阶段。基于更新的目标检测工作流程,我们在检测到的基础设施位置编译了15,606条时间序列,共有14,840,637个事件作为分析就绪的一维SAR后向散射剖面,每个剖面对应一次Sentinel-1采集和一个位置。为了便于直接使用和基准测试,我们发布了(i)分析就绪的一维SAR剖面,(ii)由基于规则的分类器生成的事件级基线语义标签,以及(iii)包含553条时间序列和328,657个事件标签的专家标注基准数据集。基线分类器在事件评估中实现了0.84的宏F1分数,在折叠编辑相似性-质量阈值曲线下面积(AUC)为0.785,表明时间一致性。我们证明,由此产生的语料库支持全球尺度的部署动态分析、区域部署模式差异的识别、船只交互和运行事件,并为开发和比较海上风电基础设施监测的时间序列分类方法提供了参考。

英文摘要

The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.

2605.05547 2026-06-18 cs.CV 版本更新

Characterizing Brazilian Atlantic Forest Restoration Outcomes with Geospatial AlphaEarth Embeddings

利用地理空间AlphaEarth嵌入表征巴西大西洋森林恢复结果

Alice Heiman

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本研究利用AlphaEarth基础模型的卫星嵌入,通过余弦相似度定义参考轨迹嵌入,评估巴西圣保罗1729个恢复点的早期恢复成效,发现不同土地利用类型在嵌入空间中形成聚类,但信号存在噪声。

Comments Presented as a workshop paper at ICLR 2026 Machine Learning for Remote Sensing (ML4RS)

详情
AI中文摘要

巴西的大西洋森林是一个关键生物多样性热点,但其原始覆盖面积不足12-15%。尽管大规模监测森林恢复至关重要,但传统方法受限于实地报告在大尺度上的不可行性以及遥感指数(如NDVI)的饱和效应。此外,与森林砍伐导致的快速光谱变化不同,再造林是一个渐进过程。在本研究中,我们利用AlphaEarth Foundation模型的卫星嵌入,检查了圣保罗的1,729个恢复点,以评估其在表征早期恢复成功方面的有效性。我们引入了“参考轨迹嵌入”的概念,基于与成熟次生林参考点的余弦相似度定义恢复成功的度量。我们观察到不同土地利用和土地覆盖(LULC)类型在嵌入空间中形成不同的聚类,并且能够识别出具有明显变化向量的地点。然而,信号可能存在噪声,嵌入可能需要进一步微调以捕获和预测超出LULC的地点元数据。

英文摘要

The Atlantic Forest in Brazil is a critical biodiversity hotspot, yet less than 12-15% of its original cover remains. Although monitoring forest restoration on a large scale is essential, traditional methods are limited by the impracticality of on-the-ground reporting on such a scale and by the saturation of remote-sensing indices such as NDVI. Furthermore, reforestation is a gradual process as opposed to the rapid spectral changes caused by deforestation. In this study, we examine 1,729 restoration sites in São Paulo, using satellite embeddings from the AlphaEarth Foundation's model to evaluate their effectiveness in characterising early restoration success. We introduce the concept of a 'Reference Trajectory Embedding', defining a metric of restoration success based on cosine similarity to reference sites of mature secondary forest. We observe distinct clusters in embedding space according to different land use and land cover (LULC) types, and we can identify sites with clear change vectors. However, the signal can be noisy, and embeddings may require further fine-tuning to capture and predict site metadata beyond LULC.

2606.05368 2026-06-18 cs.CV 版本更新

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

Biomazon:亚马逊盆地三维森林结构与生物量建模的多模态数据集

Sayan Mandal, Rocco Sedona, Simon Besnard, Mikhail Urbazaev, Morris Riedel, Ehsan Zandi, Gabriele Cavallaro

发表机构 * Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich(julich超级计算中心(JSC),julich研究所) School of Engineering and Natural Sciences (SENS), University of Iceland(工程与自然科学学院(SENS),冰岛大学) Global Land Monitoring Group, GFZ Helmholtz Centre for Geosciences(全球土地监测组,geofz赫尔姆霍兹研究中心)

AI总结 针对现有方法未将森林垂直结构作为有序轮廓学习的问题,提出Biomazon多模态基准数据集,结合GEDI RH和AGBD目标与多传感器预测因子,通过共享编码器-解码器框架进行消融研究,为热带森林结构一致RH轮廓预测和结构-生物量建模建立参考基准。

Comments 32 pages, 21 figures, 8 tables

详情
AI中文摘要

准确、空间明确的描述热带森林结构对于碳核算和生态系统监测至关重要,然而大多数机器学习流程预测冠层顶部高度代理(例如RH95/RH98)或AGBD作为单独的标量目标,而不是将森林垂直结构作为有序轮廓学习。社区缺乏一个ML就绪的多模态基准,用于联合预测整个GEDI RH轮廓与AGBD,或评估强制RH百分位数之间物理一致排序的方法。我们通过Biomazon解决了这一问题,这是一个覆盖亚马逊盆地的20米多模态基准数据集,在标准化的空间划分和评估协议下,将GEDI RH和AGBD目标与多传感器预测因子(Sentinel-1/2、ALOS-2 PALSAR-2、Copernicus DEM、Dynamic World LULC和AlphaEarth嵌入)配对。使用共享编码器-解码器与任务特定头作为基线框架,我们对(i)骨干/模型规模、(ii)模态贡献以及(iii)在独立和融合设置下使用辅助嵌入进行了全面的消融研究,并报告了单目标和联合目标结果,以量化统一训练协议下的权衡。最后,我们通过与现有网格化产品(包括GEDI L4D RH10-RH98和AGBD)在匹配时间尺度上的区域对齐比较,将基线性能置于背景中。Biomazon连同随附的协议和基线结果,为未来热带森林中结构一致的RH轮廓预测和结构-生物量建模工作建立了参考基准。

英文摘要

Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.

2606.05883 2026-06-18 cs.CV 版本更新

Geometry-Aware Dataset Condensation for Diffusion Model Training

面向扩散模型训练的几何感知数据集压缩

Xiao Cui, Yulei Qin, Mo Zhu, Wengang Zhou, Hongsheng Li, Houqiang Li

发表机构 * GitHub

AI总结 针对扩散模型训练,提出基于几何感知分布对齐的真实子集选择方法,利用单侧部分最优传输保持几何结构,并辅以轻量级特征统计与语义一致性正则化,通过两阶段离散优化实现高效压缩。

Comments ICML 2026

详情
AI中文摘要

数据集压缩旨在通过合成或选择从真实数据中构建紧凑数据集。然而,现有方法不适用于扩散模型训练:合成数据生成通常产生不适合真实建模的低保真样本,而真实子集选择通常无法保留扩散似然目标所需的分布几何结构。为解决此问题,我们提出将真实子集选择重新表述为几何感知分布对齐问题。通过引入单侧部分最优传输,我们的方法选择性地将紧凑子集与完整数据分布对齐,同时允许低密度区域中的未匹配质量,确保保留扩散模型训练所需的有效几何结构。为进一步保证分布保真度,我们用轻量级特征统计和语义一致性正则化补充几何对齐。提出了一种高效的两阶段离散优化策略来实现该对齐目标。在扩散变体、子集大小、图像分辨率和训练轮次上的大量实验表明,我们的方法在扩散模型训练中实现了优越的保真度和分布覆盖。代码可在 https://github.com/2018cx/GADC 获取。

英文摘要

Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://github.com/2018cx/GADC.

2606.14702 2026-06-18 cs.CV 版本更新

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K:通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University(南京大学) CASIA(中国科学院自动化研究所)

AI总结 提出OmniVideo-100K数据集,通过实体锚定视频脚本和线索引导的QA生成机制,解决音视频问答中跨段实体不一致和长时推理不足的问题,微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情
AI中文摘要

当前的音视频问答(QA)自动化流水线通常采用“视频-字幕-QA”范式。然而,这些方法通常将视频分割成短片段,并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联,而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外,将长文本理解和QA合成耦合到单一步骤中,往往将模型限制在局部事件上,生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题,我们提出了一种自动化数据引擎,包含两种机制:(1)**实体锚定视频脚本**将视频转换为结构化脚本,包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验,确保跨片段引用一致性并重建音视频关联。(2)**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索,然后基于这些高价值线索生成QA对。利用这一流水线,我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B,在OmniVideo-Test上获得了高达20.59%的性能提升,并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力(提升高达12.64%)。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

2606.17188 2026-06-18 cs.CV cs.CL 版本更新

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

并非真正的多语言:脚本一致性作为VLM评估中缺失的维度

Prabhjot Singh, Bhushan Pawar, Madhu Reddiboina, Rajvee Sheth

发表机构 * RediMinds Inc.(RediMinds公司) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Independent Researcher(独立研究员)

AI总结 提出PuMVR基准,评估10个VLM在旁遮普语三种文字上的表现,发现显著的脚本差距,并提出脚本一致性率(SCR)作为必要评估指标。

详情
AI中文摘要

当前视觉语言模型(VLM)的多语言评估假设语言与正字法一一对应,忽略了使用多种文字语言的数十亿用户。我们引入了PuMVR(旁遮普多模态视觉推理),这是一个包含1000个严格平行图像-文本实例的基准,覆盖旁遮普语的三种活跃文字:古木基文、沙穆基文和罗马文。评估10个最先进的VLM,我们暴露了一个显著且系统的脚本差距。模型经常在一种文字上解决视觉任务,而在另一种文字上失败,准确率差异高达16%。关键的是,视觉输入均匀地提升了绝对性能,但并未缩小正字法差距。此外,跨文字的上下文迁移非常脆弱,揭示了脚本锁定的知识表示。通过所有文字对的McNemar检验支持,我们的发现表明当前的“多语言”VLM并非真正的多文字。我们提出脚本一致性率(SCR),在我们的基准上低至24.8%,作为脚本无关评估的强制性指标,以确保公平的AI访问。数据和代码可在以下网址获取:this https URL。

英文摘要

Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.

2406.14399 2026-06-18 cs.LG cs.CV physics.ao-ph stat.ML 版本更新

Benchmarking Physics-Informed Time-Series Models for Operational Global Station Weather Forecasting

面向全球站点业务天气预报的物理信息时间序列模型基准测试

Tao Han, Zhibin Wen, Zhenghao Chen, Dazhao Du, Song Guo, Lei Bai

发表机构 * Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong SAR China(香港科技大学计算机科学与工程系) Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China(南方科技大学计算机科学与工程系) School of Computer and Information Sciences, University of Newcastle, Newcastle, Australia(新castle大学计算机与信息科学学院) Hangzhou Innovation Institute of Beihang University, Hangzhou, China(北京航空航天大学杭州创新研究院) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室)

AI总结 提出大规模观测数据集WEATHER-5K和物理信息模型PhysicsFormer,通过压力-风对齐和能量感知平滑损失增强物理一致性,在多个天气变量和极端事件预测上评估学术模型与业务系统的差距。

Comments Accepted by ICML2026

详情
AI中文摘要

时间序列预测(TSF)模型的发展常受限于缺乏全面的数据集,尤其是在全球站点天气预报(GSWF)中,现有数据集规模小、时间短且空间稀疏。为解决这一问题,我们引入了WEATHER-5K,一个大规模观测天气数据集,能更好地反映真实世界条件,支持改进模型训练和评估。尽管最近的TSF方法在基准测试上表现良好,但在捕捉复杂天气动态和极端事件方面落后于业务数值天气预报系统。我们提出了PhysicsFormer,一种物理信息预测模型,结合动态核心与Transformer残差来预测未来天气状态。通过压力-风对齐和能量感知平滑损失强制物理一致性,确保在捕捉复杂时间模式的同时保持合理的动力学。我们将PhysicsFormer及其他TSF模型与业务系统在多个天气变量、极端事件预测和模型复杂度上进行基准测试,全面评估学术TSF模型与业务预报之间的差距。数据集和基准测试实现可在以下网址获取:this https URL。

英文摘要

The development of Time-Series Forecasting (TSF) models is often constrained by the lack of comprehensive datasets, especially in Global Station Weather Forecasting (GSWF), where existing datasets are small, temporally short, and spatially sparse. To address this, we introduce WEATHER-5K, a large-scale observational weather dataset that better reflects real-world conditions, supporting improved model training and evaluation. While recent TSF methods perform well on benchmarks, they lag behind operational Numerical Weather Prediction systems in capturing complex weather dynamics and extreme events. We propose PhysicsFormer, a physics-informed forecasting model combining a dynamic core with a Transformer residual to predict future weather states. Physical consistency is enforced via pressure-wind alignment and energy-aware smoothness losses, ensuring plausible dynamics while capturing complex temporal patterns. We benchmark PhysicsFormer and other TSF models against operational systems across several weather variables, extreme event prediction, and model complexity, providing a comprehensive assessment of the gap between academic TSF models and operational forecasting. The dataset and benchmark implementation are available at: https://github.com/taohan10200/WEATHER-5K.

2503.08038 2026-06-18 cs.LG cs.AI cs.CV 版本更新

Generalized Kullback-Leibler Divergence Loss

广义Kullback-Leibler散度损失

Jiequan Cui, Beier Zhu, Qingshan Xu, Zhuotao Tian, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong

发表机构 * Hefei University of Technology(合肥工业大学) University of Science and Technology of China(中国科学技术大学) Nanyang Technological University(南洋理工大学) The Chinese University of Hong Kong(香港中文大学) The University of Hong Kong(香港大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 本文提出广义KL散度损失,通过解耦KL损失为加权MSE和交叉熵损失,并引入非对称优化修正和类别全局信息,在对抗训练和知识蒸馏中取得SOTA性能。

Comments TPAMI 2026, extension of our NeurIPS paper "Decoupled Kullback-Leibler Divergence Loss". arXiv admin note: substantial text overlap with arXiv:2305.13948

详情
AI中文摘要

在本文中,我们深入探讨了Kullback-Leibler (KL) 散度损失,并从数学上证明它等价于由(1)加权均方误差(wMSE)损失和(2)包含软标签的交叉熵损失组成的解耦Kullback-Leibler (DKL) 散度损失。得益于DKL损失的解耦结构,我们确定了两个改进方向。首先,我们通过打破KL损失的不对称优化性质并引入更平滑的权重函数,解决了其在知识蒸馏等场景中的局限性。这一修改有效缓解了优化中的收敛困难,特别是对于软标签中预测分数较高的类别。其次,我们将类别级别的全局信息引入KL/DKL,以减少单个样本带来的偏差。通过这两项改进,我们推导出广义Kullback-Leibler (GKL) 散度损失,并通过在CIFAR-10/100、ImageNet和视觉-语言数据集上进行实验,聚焦于对抗训练和知识蒸馏任务,评估其有效性。具体来说,我们在公开排行榜RobustBench上实现了新的最先进对抗鲁棒性,并在CIFAR/ImageNet模型和CLIP模型上取得了具有竞争力的知识蒸馏性能,展示了其重要的实际价值。我们的代码可在该https URL获取。

英文摘要

In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft labels. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL loss in scenarios like knowledge distillation by breaking its asymmetric optimization property along with a smoother weight function. This modification effectively alleviates convergence challenges in optimization, particularly for classes with high predicted scores in soft labels. Secondly, we introduce class-wise global information into KL/DKL to reduce bias arising from individual samples. With these two enhancements, we derive the Generalized Kullback-Leibler (GKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial training, and knowledge distillation tasks. Specifically, we achieve new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive knowledge distillation performance across CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.

2606.17639 2026-06-18 cs.RO cs.CV 版本更新

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

ERQA-Plus:具身AI推理的诊断基准

Hong Yang, Basura Fernando

发表机构 * Centre for Frontier AI Research, Agency for Science, Technology and Research(新加坡科技研究局前沿人工智能研究中心) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出ERQA-Plus基准,包含1766个基于机器人中心图像的问答实例,覆盖感知、动作、社交、导航和常识推理,用于诊断具身AI的推理能力。

详情
AI中文摘要

通用具身智能体需要的不仅仅是物体识别:它们必须从情境视觉观察中推理空间关系、动作、程序、人类意图、环境约束和常识后果。然而,现有的视觉和具身问答基准通常对测试的推理依赖关系控制有限,使得难以将基于具身的推理与基于捷径的视觉或语言模式匹配区分开来。我们提出了ERQA-Plus,一个用于具身AI推理的诊断基准。ERQA-Plus包含1766个问答实例,这些实例基于711张以机器人为中心的图像,并根据一个结构化的分类法组织,涵盖感知、动作中心、社交交互、导航环境和上下文常识推理。该数据集使用多阶段生成和验证流程构建,结合了分类法引导的问题生成、自动质量判断、迭代修订和人工评估,以改进视觉基础、答案有效性和推理质量。我们对代表性的通用视觉语言模型和具身模型进行了基准测试,包括LLaVA-NeXT-8B、Prismatic-7B、MiniCPM-V-4.5-8B、Qwen3-VL、RoboRefer-8B和RoboBrain2.5-8B。尽管最强的模型Qwen3-VL-32B达到了83.4%的整体准确率和61.4的SBERT分数,但类别级别的结果揭示了空间推理、程序推理、事件预测和意图推理方面的持续弱点。因此,ERQA-Plus提供了一个细粒度的评估框架,不仅衡量具身智能体是否回答正确,还衡量它们能够可靠地执行哪些形式的具身推理。数据集可在https://this https URL获取,项目页面在https://this https URL。

英文摘要

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.

13. 其他/综合视觉 12 篇

2606.18661 2026-06-18 cs.CV cs.AI 新提交

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench:一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University(中南大学)

AI总结 提出指令驱动智能体框架,包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent,实现自主滑坡识别与分析。

详情
AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要,然而当前范式难以同时提取视觉特征和高层次地球科学语义,而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战,我们提出一个指令驱动的智能体框架,包含三个组成部分。首先,通过多VLM交叉验证和交互式标注构建LandslideBench,这是一个多模态细粒度数据集,包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后,通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM,以增强地质语义理解。最后,以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent,采用双规则控制器,结合结构化报告元数据约束和交叉验证识别约束,来调控自动化工具调用。实验表明,LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理,实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

2606.19249 2026-06-18 cs.CV cs.LG 新提交

Transformer Geometry Observatory TGO-I: Spectral Geometry Observatory

Transformer几何观测站TGO-I:谱几何观测站

Kaustubh Kapil, Kishor P. Upla

发表机构 * Sardar Vallabhai National Institute of Technology (SVNIT), Surat, India(印度苏拉特萨达尔·瓦拉巴伊国家理工学院(SVNIT))

AI总结 提出TGO框架,通过分析ViT表示的谱几何(有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性等),发现训练过程中维度利用增加、各向异性降低、谱熵和参与比上升,最终CLS标记表示具有最高有效维度和最低各向异性。

详情
AI中文摘要

尽管Vision Transformers(ViTs)被广泛采用并在众多计算机视觉应用中取得成功,对其维度和表示几何的基本理解仍然相对未被充分探索。为了弥补这一差距,我们引入了Transformer几何观测站(TGO),这是一个系统的实验和分析流程框架,旨在研究Vision Transformers的表示几何和动态。TGO-I是该框架的第一部分,专注于ViT表示的谱几何。使用在ImageNet-100上训练的ViT-Small/16模型,我们分析了训练过程中的有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性、协方差结构、特征谱和奇异值谱。我们的结果揭示了维度利用的一致增加,伴随着各向异性降低、谱熵增加、参与比增加以及逐渐平坦的特征谱。与常见的直觉(即训练应将信息集中到少数主导方向)相反,我们观察到方差在表示维度上的逐渐重新分布。这一现象在最终的CLS标记表示中尤为明显,该表示在网络中表现出最高的有效维度和最低的各向异性。

英文摘要

Despite the widespread adoption of Vision Transformers (ViTs) and their success across numerous computer vision applications, the fundamental understanding of their dimensional and representational geometry remains relatively underexplored. To address this gap, we introduce Transformer Geometry Observatory (TGO), a systematic framework of experiments and analysis pipelines designed to investigate the representational geometry and dynamics of Vision Transformers. TGO-I, the first installment of the framework, focuses on the spectral geometry of ViT representations. Using a ViT-Small/16 model trained on ImageNet-100, we analyze Effective Rank, Stable Rank, Participation Ratio, Spectral Entropy, Spectral Flatness, Spectral Anisotropy, covariance structure, eigenspectra, and singular value spectra throughout training. Our results reveal a consistent increase in dimensional utilization, accompanied by decreasing anisotropy, increasing spectral entropy, increasing participation ratio, and progressively flatter eigenspectra. Contrary to the common intuition that training should concentrate information into a small number of dominant directions, we observe a progressive redistribution of variance across representational dimensions. This phenomenon is particularly pronounced in the final CLS token representation, which exhibits the highest effective dimensionality and lowest anisotropy within the network.

2606.19151 2026-06-18 cs.CY cs.CV 交叉投稿

The Market in the Model: Latent Diffusion as Neural Economy

模型中的市场:潜在扩散作为神经经济

Eryk Salvaggio

发表机构 * Cambridge Digital Humanities(剑桥数字人文研究中心) University of Cambridge(剑桥大学) Machine Visual Culture Research Group(机器视觉文化研究组) Max Planck Institute(马克斯·普朗克研究所)

AI总结 本文从计算机视觉工程问题出发,分析潜在扩散模型的机制,论证其作为神经经济运作,将社会交流抽象为可通约向量,并警示仅关注版权与商品防御的批评可能强化模型产生的拜物教。

详情
AI中文摘要

在视觉文化和人文学科中,对生成图像模型的有价值批评强调了数据集在塑造其生成图像中的作用。然而,对嵌入模型机制的意识形态立场的细致研究一直被忽视,使得它们被想象为“黑箱”。为了扩展而非取代数据集批评,本文从潜在扩散模型被引入以解决计算机视觉工程师问题的角度,以及每个组件被赋予自动化决策的任务,审视了其机制。我通过其各部分的历史以及系统刻入每个生成图像中的视觉理论来解释这个集成。借鉴Impett和Offert的神经交换价值概念,我提出这一分析以论证该模型作为神经经济运作:一个封闭的符号系统,将社会交流抽象为可通约向量,同时将社会领域转化为待售包裹。逐组件追踪训练和生成流程揭示了每个操作取代了什么,以及它如何进一步巩固平台经济和注意力经济对社会交流的逻辑。本文警告,任何只关注版权和商品防御的批评都可能重申模型所产生的拜物教,并主张以社会交换为中心。

英文摘要

Valuable critique of generative image models within visual culture and the humanities has emphasized the role of datasets in shaping the images they produce. Yet, close studies of the ideological positions embedded into the mechanism of the models have been neglected, leaving them imagined as "black boxes." In a bid to expand, rather than replace, dataset critique, this paper examines the mechanisms of the latent diffusion model in terms of the problems they were brought in to solve on behalf of computer vision engineers, and the decisions each component was tasked with automating. I interpret that ensemble through the histories of its parts and the theory of vision the system inscribes into every generated image. Drawing on Impett and Offert's notion of neural exchange value, I offer this analysis to argue that the model operates as a neural economy: a contained symbolic system that abstracts social communication into commensurable vectors as it transfers the social sphere into parcels for sale. Tracing the training and generation pipelines component by component reveals what each operation displaces, and how it further entrenches the logics of platform and attention economies over social communication. The paper warns that any critique fixated exclusively on copyright and commodity defenses risks reaffirming the very fetishism the model produces, and argues instead for centering social exchange.

2506.13506 2026-06-18 cs.CV q-bio.NC 版本更新

Stimulus Motion Perception Studies Imply Specific Neural Computations in Human Visual Stabilization

刺激运动知觉研究暗示人类视觉稳定中的特定神经计算

David W Arathorn, Josephine C. D'Angelo, Austin Roorda

发表机构 * Montana State University, Dept of Electrical and Computer Engineering(蒙塔那州立大学电气与计算机工程系) University of California, Berkeley, Herbert Wertheim School of Optometry and Vision Science(加州大学伯克利分校赫伯特·韦特海姆视觉科学与眼科学学院)

AI总结 通过分析人类注视时眼球的微小抖动,发现视觉稳定机制比相机稳定或简单进化方案更复杂,提出了基于视网膜信号特定操作的功能模型和可能的神经回路实现。

详情
AI中文摘要

即使在注视期间,人眼也持续进行低幅度运动,以高达100Hz的频率在随机方向上小角度抖动。这种运动导致视网膜上图像的所有特征不断穿过多个视锥细胞,然而世界中稳定的物体被感知为稳定,而任何运动的物体被感知为运动。一系列持续十多年的实验揭示了视觉稳定的心理物理学比可能假设的(例如,从相机图像稳定的机制,或从进化角度可能假设的最简单解决方案)更为微妙。实验揭示的心理物理学强烈暗示了视网膜信号上的一组特定操作,导致了观察到的稳定行为。报告分为两个层次。首先是对很可能负责实验观察行为的机制的功能描述。其次是对可能实现功能行为的电路级神经元的更推测性提议。

英文摘要

Even during fixation the human eye is constantly in low amplitude motion, jittering over small angles in random directions at up to 100Hz. This motion results in all features of the image on the retina constantly traversing a number of cones, yet objects which are stable in the world are perceived to be stable, and any object which is moving in the world is perceived to be moving. A series of experiments carried out over a dozen years revealed the psychophysics of visual stabilization to be more nuanced than might be assumed, say, from the mechanics of stabilization of camera images, or what might be assumed to be the simplest solution from an evolutionary perspective. The psychophysics revealed by the experiments strongly implies a specific set of operations on retinal signals resulting in the observed stabilization behavior. The presentation is in two levels. First is a functional description of the action of the mechanism that is very likely responsible for the experimentally observed behavior. Second is a more speculative proposal of circuit-level neural elements that might implement the functional behavior.

2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo:通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) Ricoh Software Research Center Beijing Co.,Ltd(Ricoh 软件研究中心北京有限公司)

AI总结 提出Hilbert-Geo框架和Parse2Reason方法,利用条件描述语言和定理库实现立体几何问题的严格推理,在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

几何问题求解作为一种典型的多模态推理问题,近年来受到广泛关注并取得了很大进展,然而大多数工作集中于平面几何,由于三维空间图和复杂推理,通常在立体几何中失败。为弥补这一差距,我们引入了Hilbert-Geo,这是第一个用于立体几何的统一形式语言框架,包括一个广泛的谓词库和一个专用的定理库。基于该框架,我们提出了一种Parse2Reason方法,包含先解析后推理两个步骤。在解析步骤中,我们利用条件描述语言(CDL),一种由专门用于构建几何条件的谓词组成的形式化语言,来表示问题描述(自然文本)和立体图(视觉图像)。在推理步骤中,我们利用这些形式化CDL和定理库进行关系推理和代数计算,生成严格正确、可验证且人类可读的推理过程。值得注意的是,我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理,我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k,它们配备了几何形式语言标注、解答和答案。大量实验表明,我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能,在MathVerse-Solid(MathVerse中专用于立体几何的一个小子集)上达到84.1%,显著优于领先的多模态大语言模型,如Gemini-2.5-pro(在SolidFGeo2k上为54.2%)和GPT-5(在MathVerse-Solid上为62.9%)。此外,我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率,展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

2509.09631 2026-06-18 cs.SD cs.CL cs.CV 版本更新

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

DiFlow-TTS: 基于离散流匹配的紧凑低延迟零样本文本转语音

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

AI总结 提出DiFlow-TTS框架,通过离散流匹配和分解离散流去噪器,在零样本TTS中实现高质量与低延迟的平衡。

Comments Accepted at Interspeech 2026 (Long Paper Track)

详情
AI中文摘要

零样本文本转语音(TTS)在复制未见过的声音方面取得了显著进展,但平衡生成质量和推理效率仍然具有挑战性。自回归模型存在高延迟问题,而基于扩散的方法受限于训练时的配置。此外,大多数基于流的方法在连续空间中运行,由于连续令牌空间本质上比离散空间更复杂,这引入了优化挑战。为了解决这些限制,我们提出了DiFlow-TTS,一种基于离散流匹配的新型零样本TTS框架。该模型由一个用于语言建模的确定性音素-内容映射器和一个同时生成韵律和声学令牌流的分解离散流去噪器组成。实验结果表明了我们的方法在多个评估指标上的有效性。

英文摘要

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.

2604.14837 2026-06-18 cs.CV 版本更新

Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimer's Disease Neurodegeneration

Geonwoo Baek, David H. Salat, Ikbeom Jang

发表机构 * Department of Computer Science \& Engineering, Hankuk University of Foreign Studies, Seoul, Republic of Korea Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Charlestown, MA, USA Department of Radiology, Harvard Medical School, Boston, MA, USA Neuroimaging Research for Veterans (NeRVe) Center, VA Boston Healthcare System, Boston, MA, USA

Comments Submitted to Human Brain Mapping

详情
Journal ref
Human Brain Mapping 47(8), e70548 (2026)
英文摘要

Alzheimer's disease (AD) confirmation often relies on positron emission tomography (PET) or cerebrospinal fluid (CSF) analysis, which are costly and invasive. Consequently, structural MRI biomarkers such as cortical thickness (CT) are widely used for non-invasive AD screening. Multiscale structural mapping (MSSM) was recently proposed to integrate gray-white matter contrasts (GWCs) with CT from a single T1-weighted MRI (T1w) scan. Building on this framework, we propose MSSM+, together with surface supervertex mapping (SSVM) and a Supervertex Vision Transformer (SV-ViT). 3D T1w images from individuals with AD and cognitively normal (CN) controls were analyzed. MSSM+ extends MSSM by incorporating sulcal depth and cortical curvature at the vertex level. SSVM partitions the cortical surface into supervertices (surface patches) that effectively represent inter- and intra-regional spatial relationships. SV-ViT is a Vision Transformer architecture operating on these supervertices, enabling anatomically informed learning from surface mesh representations. Compared with MSSM, MSSM+ identified more spatially extensive and statistically significant group differences between AD and CN. In AD vs. CN classification, MSSM+ achieved a 3%p higher area under the precision-recall curve than MSSM. Vendor-specific analyses further demonstrated reduced signal variability and consistently improved classification performance across MR manufacturers relative to CT, GWCs, and MSSM. These findings suggest that MSSM+ combined with SV-ViT is a promising MRI-based imaging marker for AD detection prior to CSF/PET confirmation.

2602.02370 2026-06-18 cs.CV 版本更新

Uncertainty-Aware Image Classification In Biomedical Imaging Using Spectral-normalized Neural Gaussian Processes

Uma Meleti, Jeffrey J. Nirschl

发表机构 * Department of Pathology(病理学部) Lab Medicine, University of Wisconsin-Madison(实验室医学,威斯康星大学麦迪逊分校)

Comments Published at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026

详情
Journal ref
Proc. 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI),London, United Kingdom, Apr. 8-11, 2026, pp. [1-4], 2026
英文摘要

Accurate histopathologic interpretation is key for clinical decision-making; however, current deep learning models for digital pathology are often overconfident and poorly calibrated in out-of-distribution (OOD) settings, which limit trust and clinical adoption. Safety-critical medical imaging workflows benefit from intrinsic uncertainty-aware properties that can accurately reject OOD input. We implement the Spectral-normalized Neural Gaussian Process (SNGP), a set of lightweight modifications that apply spectral normalization and replace the final dense layer with a Gaussian process layer to improve single-model uncertainty estimation and OOD detection. We evaluate SNGP vs. deterministic and MonteCarlo dropout on six datasets across three biomedical classification tasks: white blood cells, amyloid plaques, and colorectal histopathology. SNGP has comparable in-distribution performance while significantly improving uncertainty estimation and OOD detection. Thus, SNGP or related models offer a useful framework for uncertainty-aware classification in digital pathology, supporting safe deployment and building trust with pathologists.

2411.16934 2026-06-18 cs.CV 版本更新

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finocchiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, Christian Micheloni

发表机构 * University of Udine(乌迪内大学) University of Catania(卡塔尼亚大学) York University(约克大学)

Comments in IEEE/CVF Winter Conference on Application of Computer Vision (WACV) 2026

详情
英文摘要

Episodic memory retrieval enables wearable cameras to recall objects or events previously observed in video. However, existing formulations assume an "offline" setting with full video access at query time, limiting their applicability in real-world scenarios with power and storage-constrained wearable devices. Towards more application-ready episodic memory systems, we introduce Online Visual Query 2D (OVQ2D), a task where models process video streams online, observing each frame only once, and retrieve object localizations using a compact memory instead of full video history. We address OVQ2D with ESOM (Egocentric Streaming Object Memory), a novel framework integrating an object discovery module, an object tracking module, and a memory module that find, track, and store spatio-temporal object information for efficient querying. Experiments on Ego4D demonstrate ESOM's superiority over other online approaches, though OVQ2D remains challenging, with top performance at only ~4% success. ESOM's accuracy increases markedly with perfect object tracking (31.91%), discovery (40.55%), or both (81.92%), underscoring the need of applied research on these components.

2510.13562 2026-06-18 physics.med-ph cs.CV cs.NA math.NA 版本更新

An efficient approach with theoretical guarantees to simultaneously reconstruct activity and attenuation sinogram for TOF-PET

Liyang Hu, Chong Chen

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China(数学科学国家重点实验室,数学与系统科学研究院,中国科学院,北京100190,中国) University of Chinese Academy of Sciences, Beijing 100190, China(中国科学院大学,北京100190,中国)

Comments 32 pages, 11 figures, 4 tables

详情
Journal ref
IEEE Transactions on Computational Imaging 2026
英文摘要

In positron emission tomography (PET), it is indispensable to perform attenuation correction in order to obtain the quantitatively accurate activity map (tracer distribution) in the body. Generally, this is carried out based on the estimated attenuation map obtained from computed tomography or magnetic resonance imaging. However, except for errors in the attenuation correction factors obtained, the additional scan not only brings in new radiation doses and/or increases the scanning time but also leads to severe misalignment induced by various motions during and between the two sequential scans. To address these issues, based on maximum likelihood estimation, we propose a new mathematical model for simultaneously reconstructing the activity and attenuation sinogram from the time-of-flight (TOF)-PET emission data only. Particularly, we make full use of the exclusively exponential form for the attenuation correction factors, and consider the constraint of a total amount of the activity in some mask region in the proposed model. Furthermore, we prove its well-posedness, including the existence, uniqueness and stability of the solution. We propose an alternating update algorithm to solve the model, and also analyze its convergence. Finally, numerical experiments with various TOF-PET emission data demonstrate that the proposed method is of numerical convergence and robust to noise, and outperforms some state-of-the-art methods in terms of accuracy and efficiency, and has the capability of autonomous attenuation correction.

2507.05647 2026-06-18 eess.IV cs.CV 版本更新

Diffusion-Based Limited-Angle CT Reconstruction under Noisy Conditions

Jiaqi Guo, Santiago López-Tapia

发表机构 * Dept. of Electrical and Computer Engineering, Northwestern University, Evanston, IL, USA(电气与计算机工程系,西北大学,埃文斯顿,伊利诺伊州,美国)

Comments Accepted at the 2025 IEEE International Conference on Image Processing (ICIP), Workshop

详情
英文摘要

Limited-Angle Computed Tomography (LACT) is a challenging inverse problem where missing angular projections lead to incomplete sinograms and severe artifacts in the reconstructed images. While recent learning-based methods have demonstrated effectiveness, most of them assume ideal, noise-free measurements and fail to address the impact of measurement noise. To overcome this limitation, we treat LACT as a sinogram inpainting task and propose a diffusion-based framework that completes missing angular views using a Mean-Reverting Stochastic Differential Equation (MR-SDE) formulation. To improve robustness under realistic noise, we propose RNSD$^+$, a novel noise-aware rectification mechanism that explicitly models inference-time uncertainty, enabling reliable and robust reconstruction. Extensive experiments demonstrate that our method consistently surpasses baseline models in data consistency and perceptual quality, and generalizes well across varying noise intensity and acquisition scenarios.

2406.16439 2026-06-18 cs.CV 版本更新

Continual Test-Time Adaptation for Object Detection with Adaptive Monitoring and Randomized Restoration

Shilei Cao, Juepeng Zheng, Yan Liu, Baoquan Zhao, Ziqi Yuan, Weijia Li, Runmin Dong, Haohuan Fu

发表机构 * School of Artificial Intelligence, Sun Yat-Sen University(中山大学人工智能学院) School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学与技术学院) State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University(清华大学智能技术与系统国家重点实验室) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生学院) National Supercomputing Center in Shenzhen(深圳国家超算中心) Ministry of Education Key Laboratory for Earth System Modeling and the Department of Earth System Science, Tsinghua University(清华大学地球系统模型教育部重点实验室)

详情
英文摘要

Real-world application models are commonly deployed in dynamic environments, where the target domain distribution undergoes temporal changes. Continual Test-Time Adaptation (CTTA) has recently emerged as a promising technique to gradually adapt a source-trained model to continually changing target domains. Despite recent advancements in addressing CTTA, two critical issues remain: 1) Fixed thresholds for pseudo-labeling in existing methodologies lead to low-quality pseudo-labels, as model confidence varies across categories and domains; 2) Stochastic parameter restoration methods for mitigating catastrophic forgetting fail to preserve critical information effectively, due to their intrinsic randomness. To tackle these challenges for detection models in CTTA scenarios, we present AMROD, featuring three core components. Firstly, the object-level contrastive learning module extracts object-level features for contrastive learning to refine the feature representation in the target domain. Secondly, the adaptive monitoring module dynamically skips unnecessary adaptation and updates the category-specific threshold based on predicted confidence scores to enable efficiency and improve the quality of pseudo-labels. Lastly, the adaptive randomized restoration mechanism selectively reset inactive parameters with higher possibilities, ensuring the retention of essential knowledge. We demonstrate the effectiveness of AMROD on four CTTA object detection tasks, where AMROD outperforms existing methods, especially achieving a 3.2 mAP improvement and a 20% increase in efficiency on the Cityscapes-to-Cityscapes-C CTTA task. The code of this work is available at https://github.com/ShileiCao/AMROD.