arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 多模态与视觉语言模型 11 篇

2606.18472 2026-06-18 cs.CV 新提交

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

通过正则化微调实现可域泛化的3D视觉-语言模型适应

Sneha Paul, Zachary Patterson, Nizar Bouguila

发表机构 * Concordia University(康考迪亚大学)

AI总结 提出ReFine3D框架,通过选择性层调优、多视图一致性、同义词提示及点渲染视觉监督等正则化策略,提升3D大语言模型在域泛化中的性能。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

域适应仍然是3D视觉中的一个核心挑战,特别是对于将3D点云与视觉和文本数据对齐的多模态基础模型。尽管这些模型表现出强大的通用能力,但将其适应到数据有限的下游领域往往会导致过拟合和灾难性遗忘。为了解决这个问题,我们引入了ReFine3D,一个正则化的微调框架,专为3D大语言模型(LMMs)的可域泛化调优而设计。ReFine3D将选择性层调优与两种针对性的正则化策略相结合:跨增强点云的多视图一致性,以及通过大语言模型生成的基于同义词的提示实现的文本多样性。此外,我们加入了点渲染的视觉监督和一种基于置信度聚合的测试时增强机制,以进一步增强鲁棒性。在不同3D域泛化基准上的大量实验表明,ReFine3D将基类到新类泛化提高了1.36%,跨数据集迁移提高了2.43%,对损坏的鲁棒性提高了1.80%,少样本准确率提高了最多3.11%,以最小的额外计算开销超越了先前的最先进方法。

英文摘要

Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.

2606.18553 2026-06-18 cs.CV 新提交

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

基于知识的分层多模态检索用于新闻图像描述生成

Minh-Loi Nguyen, Xuan-Vu Le, Long-Bao Nguyen, Hoang-Bach Ngo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(越南国立大学胡志明市分校理学院) Vietnam National University, Ho Chi Minh City(越南国立大学胡志明市分校)

AI总结 提出分层多模态文章检索增强的图像描述框架,通过结构感知检索和上下文精炼,结合VLM和LLM生成富含上下文细节的描述,在EVENTA 2025挑战赛中获得第5名。

Comments SOICT 2025

详情
AI中文摘要

传统的图像描述方法通常难以生成全面、上下文丰富的描述,尤其是对于无法直接从视觉线索中观察到的细节。为了克服这一问题,我们提出了一种新颖的检索增强图像描述框架,通过利用外部知识生成具有更深层次洞察的描述,如对象属性、事件背景和潜在意义。我们的方法采用分层多模态文章检索机制,超越了单一的文本实体。该检索考虑了文章结构感知特征,包括加权文本组件(例如,标题、正文部分)和视觉布局模式,以及多方面的相似性计算(内容-视觉、视觉-视觉和话语定位)。后续的上下文相关性精炼阶段进一步增强了检索到的信息。检索到的文章随后作为描述生成的知识库:首先,VLM生成简洁的图像描述;其次,我们基于该描述从检索到的文章中分割出相关信息;最后,LLM利用描述和提取的知识生成全面、上下文详细的描述。我们参加了ACM Multimedia EVENTA 2025挑战赛,并在OpenEvent-V1数据集的私有测试集上以0.2824的总分获得第5名。源代码已在此https URL公开发布。

英文摘要

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

2606.18681 2026-06-18 cs.CV 新提交

Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

超越多样性:将视觉令牌剪枝视为子空间重建以实现高效视觉语言模型

Jaeyeon Lee, Shunjie Wen, Dong-Wan Choi

发表机构 * Inha University(延世大学)

AI总结 提出SPARE方法,将令牌剪枝重构为子空间重建问题,通过迭代选择投影残差大的令牌进行剪枝,并引入反相关性机制保留上下文信息,在LLaVA上剪枝94%令牌仍保持95%性能。

Comments ECCV 2026 Under Review

详情
AI中文摘要

尽管视觉语言模型(VLM)性能卓越,但由于大量视觉令牌的存在,它们产生了巨大的计算开销。虽然多样性最大化已成为令牌减少的主流策略,但现有方法依赖于基于余弦的归一化相似度,忽略了幅度信息,无法忠实逼近原始特征表示,导致性能次优,尤其是在组合多技能推理任务上。本文提出SPARE,一种子空间重建方法,将令牌剪枝重新表述为列子集选择问题,并显式最小化重建误差。通过迭代选择投影残差大的令牌,SPARE在角度多样性之外实现了重建驱动的剪枝。此外,我们揭示了一个反直觉的反相关性现象:图像-文本相关性得分较低的令牌能更好地保留上下文信息。基于这一发现,我们将反相关性作为额外的选择标准纳入SPARE,以促进上下文感知的令牌选择。在多个VLM和基准上的大量实验表明,SPARE始终达到最先进的性能,在组合任务上取得显著提升。当应用于LLaVA时,SPARE在完全无需训练的情况下,可移除高达94%的视觉令牌,同时保留95%的基线性能。

英文摘要

Despite their remarkable performance, Vision Language Models (VLMs) incur substantial computational overhead due to the large number of visual tokens. While diversity maximization has become a dominant strategy for token reduction, existing methods rely on cosine-based normalized similarity that discards magnitude information, failing to faithfully approximate the original feature representation and leading to suboptimal performance, particularly on compositional multi-skill reasoning tasks. In this paper, we introduce SPARE, a subspace reconstruction method that reformulates token pruning as a column subset selection problem and explicitly minimizes reconstruction error. By iteratively selecting tokens with large projection residuals, SPARE performs reconstruction-driven pruning beyond angular diversity. Moreover, we reveal a counterintuitive anti-relevance phenomenon: tokens with lower image-text relevance score can better preserve contextual information. Based on this finding, we incorporate anti-relevance into SPARE as an additional selection criterion to promote context-aware token selection. Extensive experiments across multiple VLMs and benchmarks demonstrate that SPARE consistently achieves state-of-the-art performance, with strong gains on compositional tasks. When applied to LLaVA, SPARE removes up to 94% of visual tokens while retaining 95% of the baseline performance, all in a fully training-free manner.

2606.18780 2026-06-18 cs.CV cs.CL cs.MM 新提交

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA:面向统一低资源多模态信息抽取的语义锚定对齐增强

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出语义锚定对齐增强框架SAMA,通过构建结构化语义锚引导多专家多模态大模型生成高保真文本,并利用锚保留扩散机制合成图像,结合双约束过滤模块,在低资源多模态信息抽取任务中显著提升性能。

Comments Accepted by IEEE Transactions on Multimedia

详情
AI中文摘要

多模态信息抽取(MIE)——涵盖多模态命名实体识别(MNER)、关系抽取(MRE)和事件抽取(MEE)等任务——对于理解多媒体内容至关重要,但受到严重数据稀缺的限制。尽管数据增强是一种有前景的补救措施,但现有方法受到粗粒度跨模态对齐和碎片化、任务特定设计的阻碍,未能利用共享语义知识。为克服这些限制,我们引入了语义锚定对齐多模态增强(SAMA),一个用于生成高保真、任务感知合成数据的统一框架。SAMA从真实标签构建结构化语义锚,以指导协作多专家多模态大语言模型(CME-MLLM),该模型集成了用于共享语义的通用适配器和任务特定适配器,以生成多样且符合约束的文本样本。对于图像合成,SAMA采用锚保留扩散机制,使用锚加权提示和潜在条件来维持关键语义锚,同时多样化视觉上下文。为消除人工验证需求,SAMA进一步引入双约束过滤模块,基于跨模态一致性和锚保真度选择合成样本。在MNER、MRE和MEE基准数据集上的大量实验表明,SAMA在全监督和低资源设置下均一致优于最先进的增强基线,突显了其通用性、鲁棒性和有效性。

英文摘要

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

2606.18846 2026-06-18 cs.CV 新提交

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

从边界框到视觉推理:一种用于视觉语言模型的在线策略数据标注工具

Like Zhang, Runliang Niu, Shiqi Wang, Xiyu Hu, Qianli Xing, Pan Wang, Qingzu He, Qi Wang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) College of Computer Science, Jilin University(吉林大学计算机科学与技术学院) OPPO

AI总结 提出ScreenAnnotator,通过统一标注原子模式、在线策略循环与贝叶斯验证器,解决现有工具表达力不足、标注-训练脱节和数据复用性差的问题,实现高效多任务数据生成。

Comments 14 pages, 7 figures

详情
AI中文摘要

视觉语言模型(VLM)正快速向复杂的基于基础的结构化视觉推理发展。训练具备此类高级能力的模型需要一种新型数据,该数据能将空间坐标、开放词汇描述、结构化属性和拓扑关系无缝统一为单一表示。然而,现有数据标注工具从根本上无法满足这些复杂需求,存在三个系统性瓶颈:表达力有限、严重的标注-训练解耦以及数据复用性差。为弥补这一基础设施差距,我们引入了一个开源标注工具ScreenAnnotator。首先,我们定义了一个统一的标注原子模式,将空间、语义和结构基元绑定为单个单元。其次,我们实现了一个嵌入贝叶斯标注验证器(BAV)的在线策略标注循环。最后,我们设计了一个模板驱动的多任务数据合成过程,动态地将静态原子转化为多样化的多维推理任务,消除了冗余的重新标注。在线策略循环将流程图上的标注接受率提升至近100%,GUI截图上的接受率达到77%,同时随着标注数据的积累,每张图像的标注时间稳步减少。在流程图场景中,微调VLM的平均准确率达到76.1%,绝对提升了35.1个百分点。我们的代码可在以下网址获取:this https URL。

英文摘要

Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: https://github.com/WnQinm/Annotator.

2606.18974 2026-06-18 cs.CV 新提交

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Visual-OPSD:用于高效统一多模态推理的跨模态在策略自蒸馏

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu

发表机构 * Xi’an Jiaotong University(西安交通大学) MOE KLINNS Lab(MOE KLINNS实验室) Shaanxi Province Key Laboratory of Big Data Knowledge Engineering(陕西省大数据知识工程重点实验室) Sun Yat-sen University(中山大学)

AI总结 提出Visual-OPSD方法,通过跨模态在策略自蒸馏,将多步扩散生成的可视化思维推理能力转移到纯文本学生模型,实现14.3倍加速且性能提升3.40个百分点。

详情
AI中文摘要

统一多模态模型(UMMs)将生成的“可视化思维”(VTs)与文本推理交错以改进空间任务。这导致多步扩散带来大约一个数量级的推理成本。我们发现这种成本带来的直接收益有限。在ThinkMorph上,移除或噪声化VTs在九个基准上几乎不改变准确率。一旦渲染,注意力集中在VT上,无论其内容如何。然而,KL诊断表明,以特权VT轨迹为条件会改变模型的完成分布。这表明生成路径编码了超出渲染像素的有用推理。受此差距启发,我们提出了Visual On-Policy Self-Distillation(Visual-OPSD)。教师和学生共享相同权重,但上下文不同:教师看到特权VTs,而学生只看到问题。在策略学生轨迹上的token级JSD蒸馏将教师的推理转移到纯文本学生。在九个基准上,Visual-OPSD相比其生成教师提高了$+3.40$个百分点,加速$14.3\times$(每个样本10.0秒 vs. 142.8秒),并在VSP上比同规模VLM提高了$+63.83$个百分点。高斯噪声控制(真实VT为$+0.40$pp vs. $+10.28$pp)和$58.4\%$的KL差距闭合证实,收益来自生成路径的语义内容。

英文摘要

Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

2606.18992 2026-06-18 cs.CV 新提交

Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

展示,而非询问:基于轮次有效覆盖的生成式视觉消歧用于组合图像检索

Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Amsisan Tran Baogh Le Tuan Kiet Pham Sui Yang Guang

AI总结 提出CLARA框架,通过展示视觉备选面板让用户选择,结合似然比重校准实现多轮覆盖保证,在组合图像检索中有效消歧,优于文本提问基线。

详情
AI中文摘要

组合图像检索(CIR)使用参考图像和文本修改来搜索目标图像。然而,此类查询通常描述多个可能的图像而非一个确切目标,使得用户意图模糊。近期方法通过使用共形预测估计模糊性并向用户提问澄清文本来解决此问题。但这些方法有两个局限:其覆盖保证仅在第一轮交互中成立,且文本问题通常不足以解决细粒度视觉差异,如外观、属性或视角。我们提出CLARA,一种通过向用户展示小型视觉备选面板来消歧的澄清框架。用户无需回答文本问题,只需选择最接近预期目标的原型图像。这提供了直接的视觉信号,并避免依赖模型预测用户答案。为在多轮交互中维持有效的共形保证,CLARA使用用户选择引起的似然比对校准进行重加权。显示的原型也被约束为代表当前候选集,并映射到真实语料库图像,确保生成的图像不能人为提高覆盖。在开放域和时尚基准上的实验表明,CLARA匹配单轮最先进的检索性能,在多轮交互中维持名义覆盖,并在比强文本问题基线更少的轮次中找到预期目标。其优势在模糊性涉及视角或细粒度属性时尤为明显,此时视觉消歧比文本提问更有效。

英文摘要

Composed image retrieval (CIR) uses a reference image and a text modification to search for a target image. However, such queries often describe several possible images rather than one exact target, making the user's intent ambiguous. Recent methods address this by using conformal prediction to estimate ambiguity and by asking users clarifying text questions. However, these methods have two limitations: their coverage guarantee only holds at the first interaction, and text questions are often insufficient for resolving fine-grained visual differences such as appearance, attributes, or viewpoint. We propose CLARA, a clarification framework that resolves ambiguity by showing users a small panel of visual alternatives. Instead of answering text questions, the user simply selects the prototype image closest to the intended target. This provides a direct visual signal and avoids relying on a model to predict the user's answer. To maintain valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio induced by the user's selection. The displayed prototypes are also constrained to represent the current candidate set and are snapped to real corpus images, ensuring that generated images cannot artificially improve coverage. Experiments on open-domain and fashion benchmarks show that CLARA matches single-turn state-of-the-art retrieval performance, maintains nominal coverage across interaction rounds, and finds the intended target in fewer rounds than strong text-question baselines. Its advantage is especially clear when ambiguity involves viewpoint or fine-grained attributes, where visual clarification is more effective than textual questioning.

2606.19100 2026-06-18 cs.CV 新提交

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

AMALIA-VL: 一个原生欧洲葡萄牙语开源视觉与语言模型

Diogo Glória-Silva, João Cardeira, Manuel Letras da Luz, Afonso Simplício, Gonçalo Vinagre, Diogo Tavares, Rafael Ferreira, Inês Calvo, Inês Vieira, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology(NOVA科学与技术学校) NOVA LINCS

AI总结 针对欧洲葡萄牙语缺乏开源多模态模型的问题,提出AMALIA-VL,通过三阶段训练和葡萄牙语中心数据混合,建立强基线并开源所有资源。

详情
AI中文摘要

大型视觉与语言模型(LVLMs)发展迅速,但欧洲葡萄牙语(pt-PT)在现有的开源多模态模型中仍系统性地未被充分服务,这些模型要么将其与巴西葡萄牙语混为一谈,要么在其训练数据混合中严重缺乏代表性。我们推出了AMALIA-VL,这是第一个原生为pt-PT构建的开源指令微调LVLM,通过可学习的连接器将高分辨率视觉编码器与动态图像平铺以及完全开放的pt-PT优化语言模型配对。我们贡献了一个精心设计的三阶段训练过程——视觉-语言对齐、通用视觉指令微调和偏好优化——以及一个以pt-PT为中心的多模态数据混合,该混合结合了策划和翻译的公共数据集与新颖的数据集,以解决欧洲葡萄牙语多模态资源几乎完全缺失的问题。我们的评估表明,AMALIA-VL为开源pt-PT LVLM建立了强基线。我们将发布模型权重、训练数据和构建流程,以及机器翻译的pt-PT评估基准,以帮助民主化pt-PT LVLM的开发。

英文摘要

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

2606.19277 2026-06-18 cs.CV 新提交

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

高效遥感视觉问答的统一框架:适配双编码器、混合架构和编码器-解码器架构

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

发表机构 * Engineering North Carolina A\&T State University Greensboro - NC, USA College of Science Technology North Carolina A\&T State University Greensboro - NC, USA

AI总结 提出RS Adapter参数高效微调策略,在三种视觉语言模型架构上注入轻量瓶颈适配器,仅用不到5%可训练参数实现遥感VQA,混合架构FLAVA在多模态推理与检索间取得最佳平衡。

Comments 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

详情
AI中文摘要

遥感领域的视觉问答因航空影像的高分辨率、多尺度目标分布和语义复杂性而面临独特挑战。尽管通用领域的基础模型取得了显著成功,但直接应用于RSVQA受到巨大领域偏移和全微调计算成本高昂的阻碍。本研究对RS Adapter(一种参数高效微调策略)在三种不同的视觉语言模型架构上进行了比较分析:双编码器CLIP、编码器-解码器BLIP和混合FLAVA。我们引入了一个统一的架构手术流水线,将轻量瓶颈适配器注入冻结骨干网络的注意力和MLP层,从而以少于5%的可训练参数实现快速适应。在高分辨率RSVQA x数据集上的实验结果表明,虽然所有适配模型均实现收敛,但混合FLAVA架构相比单模态对应模型提供了更优越的多模态推理与检索能力平衡。我们的发现为灾害评估和城市监测中的资源高效VQA建立了新的基准。

英文摘要

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

2606.19338 2026-06-18 cs.CV 新提交

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测:评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出RNG-Bench基准套件,通过配对记忆和3D迷宫两个博弈,评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力,发现主要错误源于遗忘而非决策,微调可提升性能。

详情
AI中文摘要

将多模态基础模型部署为闭环策略时,越来越需要基于不再可见的观测来调节动作。然而,现有基准要么暴露完整状态,将隐藏状态重建与其他智能体技能混为一谈,要么仅在回合结束后测试记忆。我们引入了RNG-Bench(重建性非马尔可夫博弈),这是一个基准套件,旨在隔离基础模型在多步交互中重建过去观测并据此行动的能力。RNG-Bench包含两个互补的博弈:配对记忆,其中卡片身份在特定位置短暂显示后需被回忆;以及3D迷宫,其中自我中心视图需整合为空间地图。两个博弈都在统一的测试框架下评估,具有三个可控难度轴:网格大小、视觉模式和观测模态。该基准进一步引入了头对头对决协议以控制实例级方差,以及记忆差距指标,将遗忘与不良动作选择区分开来。最难的配置需要大约128K个token和每回合350个图像输入,前沿MLLMs远未饱和。记忆差距分析表明,大多数残余错误源于遗忘较早的观测,而非次优决策。最后,在最优策略轨迹和过滤后的模型演示上微调Qwen3.5-9B,提高了RNG-Bench的性能,并迁移到现有基准,而不降低通用多模态能力。

英文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 新提交

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Qwen Team, Alibaba Group(阿里巴巴集团Qwen团队)

AI总结 提出OmniAgent,一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体,通过主动感知将推理复杂度与视频时长解耦,在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情
AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式,无论查询难度如何都统一处理帧,导致计算成本随视频时长增长。尽管出现了交互式框架,但它们通常依赖于全局预扫描,其上下文成本仍随视频长度扩展。我们提出OmniAgent,第一个原生全模态智能体,将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作,选择性地将视听线索提炼到持久文本记忆中,有效将推理复杂度与原始视频时长解耦。为实现这一点,我们引入了(1)智能体监督微调,通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知;(2)带TAURA(轮次感知自适应不确定性重缩放优势)的智能体强化学习,利用轮次级熵将信用分配引导至关键发现轮次。关键的是,OmniAgent表现出正向测试时缩放,性能随推理轮次增加而提升,验证了主动感知的有效性。在十个基准(如VideoMME、LVBench)上的实验结果表明,OmniAgent在开源模型中达到了最先进性能。值得注意的是,在LVBench上,我们的7B智能体优于10倍大的Qwen2.5-VL-72B(50.5% vs. 47.3%)。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

2. 具身智能、机器人与自动驾驶 6 篇

2606.18583 2026-06-18 cs.CV cs.RO 新提交

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

空地激光雷达地点识别:基于块级自监督学习和扩展互逆重排序

Yandi Yang, Xianghong Zou, Jianping Li, Haofeng Xie, Saurav Uprety, Hongzhou Yang, Naser El-Sheimy

发表机构 * University of Calgary(卡尔加里大学) Nanchang University(南昌大学) Nanyang Technological University(南洋理工大学) Wuhan University(武汉大学)

AI总结 提出一种空地激光雷达地点识别框架,通过多尺度块级自监督学习缩小域差距,并利用扩展互逆重排序算法减少误检,在多个数据集上显著提升检索精度。

详情
AI中文摘要

激光雷达地点识别用于确定在预先采集的点云地图上的位置。最常研究的基于地面激光雷达的地点识别存在预访问要求、覆盖不完整和视角有限等缺点。使用预先采集的全覆盖机载激光扫描(ALS)数据作为空中先验地图可以克服这些缺点,使得跨视角地点识别变得必要且有利。然而,空地激光雷达地点识别面临重大挑战,包括空中和地面点云之间的域差距以及初始检索中的误检。为了解决这些问题,我们提出了一种用于空地激光雷达地点识别的新型检索和重排序框架。基于相邻点云块与锚点块共享相似语义的先验知识,我们的检索网络在多个尺度上引入了块级自监督学习模块,并与场景级学习相结合,以提高空中和地面点云之间全局特征的判别性。此外,利用ALS点云的结构化空间分布,我们引入了一种扩展互逆(ER)重排序算法,以最大化利用邻域信息,并根据邻域特征优化每个特征,然后用于更新相似度矩阵以进行最终排序。大量实验表明,我们的检索网络优于现有最先进(SOTA)方法,在CS-Urban-Scenes数据集上平均Recall@1提高了9.8%,平均Recall@1%提高了3.2%,同时在CS-Campus3D数据集上也展示了最佳性能。此外,我们的ER重排序算法在无需额外训练的情况下,进一步将CS-Campus3D上的平均Recall@1提高了4.9%,CS-Urban-Scenes上提高了10.2%。

英文摘要

LiDAR place recognition determines one's position on a prior point cloud map. The most studied ground-level LiDAR place recognition suffers from pre-visit requirements, incomplete coverage, and limited perspectives. Using pre-acquired, full-coverage Airborne Laser Scanning (ALS) data as an aerial prior map overcomes these drawbacks, making cross-view place recognition necessary and advantageous. However, aerial-ground LiDAR place recognition faces significant challenges, including the domain gap between aerial and ground point clouds, and false positives during initial retrieval. To address these challenges, we present a novel retrieval and re-ranking framework for aerial-ground LiDAR place recognition. Based on the priors that neighboring point cloud patches share similar semantics with anchor patch, our retrieval network introduces patch-level self-supervised learning modules at multiple scales and integrates with scene-level learning to improve global feature discriminativeness between aerial and ground point clouds. Furthermore, leveraging the structured spatial distribution of ALS point clouds, we introduce an Expanded Reciprocal (ER) re-ranking algorithm to exploit neighborhood information maximally and refine each feature based on neighbor features, which are then used to update the similarity matrix for final ranking. Extensive experiments demonstrate that our retrieval network outperforms existing state-of-the-art (SOTA) methods, achieving a 9.8\% improvement in average Recall@1 and a 3.2\% improvement in average Recall@1\% on the CS-Urban-Scenes, while also showing the best performance on the CS-Campus3D dataset. Additionally, our ER re-ranking algorithm further boosts the average Recall@1 by 4.9\% on CS-Campus3D and 10.2\% on CS-Urban-Scenes without additional training.

2606.18687 2026-06-18 cs.CV cs.RO 新提交

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

空间分层蒸馏用于异构雷达位置识别

Sagun Singh Shrestha, Samuel Harding, Abdelwahed Khamis, Saimunur Rahman, Peyman Moghadam

发表机构 * CSIRO Robotics(澳大利亚联邦科学与工业研究组织机器人实验室) University of Queensland(昆士兰大学)

AI总结 针对4D汽车雷达与密集旋转雷达之间的异构位置识别,提出空间分层蒸馏(SSD)方法,通过基于雷达回波的物理空间非对称对齐,在重叠区域强制特征对齐,在稀疏区域降低蒸馏权重,在HeRCULES数据集上达到最先进性能。

Comments IEEE ICRA Workshop on Open Challenges for Rigorous Robot Perception 2026

详情
AI中文摘要

可扩展的全天候位置识别越来越依赖于异构雷达位置识别来桥接不同的硬件平台。一个显著的应用是将来自经济高效的4D汽车雷达的查询与由密集旋转雷达构建的高保真参考地图进行匹配。这一过程从根本上受到4D传感器极端稀疏性(和窄视场)的限制,该传感器仅捕获旋转雷达数据库中存在的结构密度的一小部分。先前的工作通过统一不同的雷达信号来解决这个问题,即将两种信号投影到共同的表示空间。然而,它们在多会话环境中性能下降。在本文中,我们提出了空间分层蒸馏(SSD);一种策略,用直接从物理雷达回波导出的非对称空间对齐取代标准的均匀蒸馏。在两个雷达都有重叠回波的区域,SSD强制进行强特征对齐。关键的是,在4D学生雷达缺乏回波但教师雷达在共享视场内包含有效结构的稀疏区域,SSD应用大幅折扣的蒸馏权重。对最近的HeRCULES数据集的广泛评估表明,SSD显著优于先前的位置识别方法,在其具有挑战性的动态序列上取得了最先进的结果。

英文摘要

Scalable, all-weather place recognition increasingly relies on heterogeneous radar place recognition to bridge diverse hardware platforms. A notable application is matching queries from cost-effective 4D automotive radars against high-fidelity reference maps built by dense spinning radars. This process is fundamentally limited by the extreme sparsity (and narrow field-of-view) of the 4D sensor, which captures only a fraction of the structural density present in the spinning radar database. Prior efforts address this issue by unifying different radar signals. That is, projecting both signals into a common representational space. Yet, they suffer performance degradation in multi-session environments. In this paper, we propose spatially-stratified distillation (SSD); a strategy that replaces standard uniform distillation with an asymmetric spatial alignment derived directly from physical radar returns. In regions where both radars exhibit overlapping returns, SSD enforces strong feature alignment. Crucially, in sparse regions where the 4D student lacks returns but the teacher contains valid structure within the shared field of view, SSD applies heavily discounted distillation weights. Extensive evaluations of the recent HeRCULES dataset demonstrate that SSD significantly outperforms prior place recognition methods, achieving state-of-the-art results on its challenging dynamic sequences.

2606.18824 2026-06-18 cs.CV cs.LG 新提交

Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

他们将去哪里?从自我中心视频建模多模态行人机动

Yuxuan Xie, Nicolas Pugeault, Chongfeng Wei, Hubert P. H. Shum, Edmond S. L. Ho

发表机构 * School of Computing Science, University of Glasgow(格拉斯哥大学计算机科学学院) James Watt School of Engineering, University of Glasgow(格拉斯哥大学詹姆斯·瓦特工程学院) Department of Computer Science, Durham University(杜伦大学计算机科学系)

AI总结 提出MMPM框架,通过行为感知交互模块和基于CVAE的模态感知轨迹预测器,分别建模行人过马路和不过马路两种模式,提升自我中心视角下多模态轨迹预测准确性。

Comments Accepted at The IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情
AI中文摘要

从自我中心摄像头进行行人轨迹预测具有挑战性,因为它依赖于与车辆和场景上下文的复杂交互以及行人的意图。通过建模行人历史与未来轨迹的相关性和意图,通常会产生多模态(即多个模式)分布。现有的随机预测器通常从单一单峰分布中采样多个未来轨迹,这可能导致次优的“混合模式”轨迹,这些轨迹位于不同的运动模式之间,并在真实场景中变得不合理。在本文中,我们提出MMPM,一种模态感知框架,基于行人的过马路行为将未来轨迹分布分别建模为语义上有意义的模式。MMPM由两个模块组成:行为感知行人交互模块(PIM),通过引入注视、头部和手势来联合捕捉行人-车辆和行人-环境交互;以及基于CVAE的模态感知轨迹预测器(MTP)模块,分别对过马路和不过马路两种模式的未来轨迹分布进行建模。基于查询的解码器进一步在解码过程中强制执行模态一致性。在PIE和JAAD数据集上的实验表明,我们的方法超越了最先进的基线。我们提出的MTP是模型无关的,可以集成到现有框架如BiTrap-NP和SGNet-ED中,以进一步提高未来轨迹预测性能。我们还引入了一种数据驱动的验证协议,将预测与时空一致的真实轨迹匹配,展示了相比先前工作改进的逐帧位移误差。

英文摘要

Pedestrian trajectory prediction from an ego-centric camera is challenging since it depends on complex interactions with vehicles and scene context, as well as the intention of the pedestrian. By modelling correlation and intent from the historical and future trajectories of the pedestrian, it will usually result in a multimodal (i.e. multiple modes) distribution. Existing stochastic predictors often sample multiple futures from a single unimodal distribution, which can yield sub-optimal 'mixed-mode' trajectories that lie between distinct motion patterns and become implausible in real scenes. In this paper, we propose MMPM, a mode-aware framework that separately models future trajectory distributions into semantically meaningful modes based on the pedestrian's crossing behavior. MMPM consists of two modules: behavior-aware Pedestrian Interaction Module (PIM) that jointly captures pedestrian-vehicle and pedestrian-environment interactions by introducing gaze, head and hand gesture, and a CVAE-based Mode-aware Trajectory Predictor (MTP) module to model the future trajectory distributions on two modes, crossing and non-crossing the road, separately. A query-based decoder further enforces mode consistency during decoding. Experiments on PIE and JAAD datasets show that our method surpasses state-of-the-art baselines. Our proposed MTP is model-agnostic, which can be integrated into existing frameworks such as BiTrap-NP and SGNet-ED to further improve future trajectory prediction performance. We additionally introduce a data-driven validation protocol that matches predictions to spatio-temporally consistent ground-truth trajectories, demonstrating improved frame-wise displacement errors over previous work.

2606.18955 2026-06-18 cs.CV cs.RO 新提交

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

运动聚焦的潜在动作使跨实体VLA训练能从人类自我中心视频中学习

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Tianfu Jiangxi Laboratory(天府江西实验室)

AI总结 提出基于潜在动作的框架,利用混合解耦VQ-VAE从无标签人类视频中提取通用动作先验,通过意图-感知解耦策略减少动作幻觉,仅需50条轨迹即可适配下游任务。

Comments Accepted to IROS 2026

详情
AI中文摘要

训练通用视觉-语言-动作(VLA)模型通常需要大量、多样化的机器人数据集,并带有高保真动作标注。尽管自我中心的人类操作视频丰富且捕捉了显著的环境多样性,但缺乏动作标签使其难以在传统训练范式下使用。为解决这一问题,我们提出了一种基于潜在动作的框架,旨在从无标签人类视频中提取通用动作先验。该架构采用混合解耦VQ-VAE,通过物理掩码将运动动态与环境背景解耦,从而构建跨实体动作码本。通过在人类视频上使用码本进行预训练,VLM骨干网络学习到动作意图的深层表示。为了适应特定实体,我们引入了一种意图-感知解耦策略,其中VLM预测动作意图,而一个独立的冻结视觉编码器为动作专家提供状态特定特征,从而减少动作幻觉。在仿真和真实环境中的结果表明,我们的方法仅在无标签人类视频上预训练,与在大量标注数据集上训练的最先进VLA模型相比具有竞争力,且仅需50条轨迹进行下游适配。

英文摘要

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

2606.18960 2026-06-18 cs.CV cs.RO 新提交

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

2606.19258 2026-06-18 cs.CV cs.RO 新提交

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

CABLE: 面向V2X系统的云辅助带宽高效LMM编码框架

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

发表机构 * University of Georgia(佐治亚大学)

AI总结 提出CABLE框架,通过边缘端利用自我运动补偿和残差运动线索传播云分割掩码,生成感兴趣区域(ROI)并仅上传ROI掩码图像,形成掩码-ROI-LMM反馈循环,在五个数据集上实现73-87%的ROI像素覆盖减少和5-8倍LMM预填充加速。

详情
AI中文摘要

云托管的大型多模态模型(LMM)可以为车联网系统提供强大的开放词汇感知能力,但简单地将全分辨率帧从边缘传输到云会导致严重的通信开销和云侧预填充延迟。我们提出了CABLE,一种用于边缘-云感知的云辅助带宽高效LMM编码框架。CABLE在边缘端利用自我运动补偿传播先前的云分割掩码,通过残差运动线索进行细化,并通过走廊包络整合断开区域,形成鲁棒的感兴趣区域(ROI)。仅上传ROI掩码图像,而云分割输出作为下一帧的先验反馈,形成掩码-ROI-LMM反馈循环。在五个数据集(nuScenes、WOD-ZB、Waymo、KITTI和CADC)上的实验表明,该方法在保持感知能力的同时实现了显著的通信节省,相对于全帧推理,ROI像素覆盖减少73-87%,估计LMM预填充加速5-8倍,检测质量略有折衷。

英文摘要

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

3. 图像识别、检索与分类 3 篇

2606.18528 2026-06-18 cs.CV 新提交

A Prototypical Signature Approach for Writer-Independent Offline Signature Verification

一种面向离线手写签名验证的原型签名方法

Kecia G. de Moura, Robert Sabourin, Rafael M. O. Cruz

发表机构 * École de technologie supérieure – Université du Québec Montreal(魁北克蒙特利尔高等电子与计算机工程学院)

AI总结 提出基于原型签名的数据驱动策略,生成多样且信息丰富的负样本,提升对熟练伪造签名的检测能力,并提高可扩展性和计算效率。

Comments Accepted for oral presentation at the International Conference on Pattern Recognition (ICPR) 2026

详情
AI中文摘要

离线手写签名验证旨在使用静态图像区分真实签名和伪造签名。由于真实伪造样本很少,通常从其他用户的真实签名中随机抽取负样本来创建训练数据。然而,这种随机选择往往缺乏多样性,增加冗余,并提高计算成本,导致训练效率低下。我们提出了一种数据驱动策略,使用原型签名生成多样且信息丰富的负样本,原型签名是真实签名特征的紧凑、不可识别的摘要。基于实验结果,我们得出结论:(i)原型签名产生更具信息量的负样本,改进了对熟练伪造的检测;(ii)所提出的方法与骨干网络无关,在不同架构上表现出鲁棒性;(iii)当与原始形式的线性SVM结合时,它可作为基于RBF模型的替代方案,同时显著提高可扩展性和计算效率。该方法的实现可在以下网址获取:https://this URL。

英文摘要

Offline handwritten signature verification aims to distinguish genuine from forged signatures using static images. Since real forgeries are rarely available, negative samples are usually randomly drawn from genuine signatures of other users to create training data. However, this random selection often lacks diversity, increases redundancy, and escalates computational cost, leading to inefficient training. We propose a data-driven strategy to generate diverse, informative negative samples using prototypical signatures, which are compact, non-identifiable summaries of genuine signature features. Based on the experiments results, we conclude that (i) prototypical signatures yield more informative negative samples, improving the detection of skilled forgeries; (ii) the proposed approach is backbone-agnostic, showing robustness across architectures; and (iii) when combined with a primal-form linear SVM, it serves as an alternative to RBF-based models while significantly improving scalability and computational efficiency. Implementation of the method is available at https://github.com/kdmoura/proto_hsv.

2606.18885 2026-06-18 cs.CV cs.IR 新提交

LARE: Low-Attention Region Encoding for Text-Image Retrieval

LARE: 低注意力区域编码用于文本-图像检索

Abdulmalik Alquwayfili, Faisal Almeshal, Jumanah Almajnouni, Leena Alotaibi, Faisal Alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J. Khan

发表机构 * Saudi Data and Artificial Intelligence Authority (SDAIA)(沙特数据与人工智能局)

AI总结 提出LARE框架,通过并行编码低注意力区域和完整图像,解决拥挤场景下视觉编码器忽视关键细节的问题,在密集场景子集上提升检索性能。

Comments Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: https://github.com/AbdulmalikDS/LARE ; Dataset: https://huggingface.co/datasets/AbdulmalekDS/Dense-Set

详情
AI中文摘要

拥挤场景中的图像检索尤其具有挑战性,因为传统视觉编码器存在显著性偏差,倾向于关注主要对象而忽略低注意力区域,而这些区域通常对细粒度检索至关重要。我们提出了LARE(低注意力区域编码),一个显式建模这些被忽略区域的框架。LARE采用双编码策略,并行编码图像的低注意力区域和完整图像,从而产生更多样化和信息丰富的图像嵌入。为了评估拥挤场景下的图像检索性能,我们引入了Dense-Set,一个源自COCO和Flickr30K的具有挑战性的子集。在该子集中,图像被重新标注,以提供对低注意力或先前被忽略区域的更丰富描述。该数据集突显了现有检索模型的局限性,并能够在密集拥挤场景条件下进行更严格的评估。实验结果表明,所提出的框架通过在共享潜在空间中保留微妙的非主导视觉线索来提高检索性能。

英文摘要

Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

2606.19204 2026-06-18 cs.CV 新提交

ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series

ROSA-TFormer: 一种雷达-光学传感器感知的时间Transformer用于基于GEE导出的Sentinel-1/2时间序列的陕北樟子松人工林分类

Nengbo Zhang, Chang sheng

发表机构 * Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences (AIRCAS)(中国科学院空天信息创新研究院遥感与数字地球重点实验室)

AI总结 提出ROSA-TFormer模型,集成SAR和光学嵌入分支、传感器感知门和时间注意力池化,利用Sentinel-1/2时间序列数据实现高精度樟子松人工林分类,总体精度达99.67%。

Comments journal in tree classification

详情
AI中文摘要

准确识别樟子松人工林对于监测陕北地区造林质量和生态恢复具有重要意义。本文提出ROSA-TFormer,一种雷达-光学传感器感知的时间Transformer,利用Google Earth Engine生成的Sentinel-1/2时间序列数据进行樟子松分类。该模型集成了独立的SAR和光学嵌入分支、传感器感知门以及时间注意力池化,以捕获多源季节特征。在月度与半月点级数据集上的实验表明,ROSA-TFormer在HalfMonth-dataBig数据集上实现了强分类性能,总体精度99.67%,宏F1 99.56%,樟子松F1 98.91%。空间块验证和消融实验进一步表明了雷达-光学时间融合和传感器感知建模的有效性。结果展示了ROSA-TFormer在点级樟子松人工林分类中的潜力,但更广泛的wall-to-wall验证仍有必要。

英文摘要

Accurate identification of Pinus sylvestris var. mongolica plantations is important for monitoring afforestation quality and ecological restoration in northern Shaanxi. This paper proposes ROSA-TFormer, a radar-optical sensor-aware temporal Transformer for P. sylvestris classification using Sentinel-1/2 time-series data generated on Google Earth Engine. The model integrates separate SAR and optical embedding branches, a sensor-aware gate, and temporal attention pooling to capture multi-source seasonal features. Experiments on monthly and half-month point-level datasets show that ROSA-TFormer achieves strong classification performance, with 99.67% overall accuracy, 99.56% macro F1, and 98.91% P. sylvestris F1 on the HalfMonth-dataBig dataset. Spatial block validation and ablation results further indicate the effectiveness of radar-optical temporal fusion and sensor-aware modeling. The results demonstrate the potential of ROSA-TFormer for point-level P. sylvestris plantation classification, while broader wall-to-wall validation remains necessary.

4. 目标检测、分割与定位 3 篇

2606.18566 2026-06-18 cs.CV cs.AI cs.GR 新提交

Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

多模态超图融合用于低光照人群计数

Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao, Yan Zhang, Bangjun Wang

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 针对低光照环境下人群计数难题,构建三个新基准数据集,提出多模态超图融合模块和可变形矩形稀疏注意力模块,形成低光照计数网络LCNet,在三个基准上取得最优性能。

详情
AI中文摘要

人群计数是计算机视觉中的一项基本任务。然而,低光照环境下的人群计数在实际世界中具有重要实用价值,却仍未得到充分探索。现有方法主要关注良好光照场景或依赖单模态红绿蓝(RGB)表示,这在极端黑暗和复杂非均匀光照下往往变得不可靠。为解决此问题,我们构建了三个新的低光照人群计数基准,包括两个合成数据集SHA_Dark和SHB_Dark,以及一个真实世界基准LC-Crowd(低光照人群数据集)。受Retinex物理建模启发,我们引入深度和Canny边缘线索作为互补的几何和结构先验,以增强低光照条件下的内在反射率表示。我们提出多模态超图融合模块,将RGB外观、深度几何和边缘结构线索统一表示为超图中的节点,并通过动态超边构建和消息传递显式捕获它们的高阶互补关系。此外,为在密集预测中自适应分配计算,我们提出可变形矩形稀疏注意力(DRSA)模块,通过锚点感知估计和自适应矩形窗口建模将计算集中在信息丰富区域。基于这些设计,我们开发了统一的低光照计数网络(LCNet)用于鲁棒的低光照人群计数。在三个基准上的大量实验表明,所提方法在整体性能上优于现有最先进(SOTA)方法。代码见补充材料。数据集将在接收后公开。

英文摘要

Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA\_Dark and SHB\_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.

2606.18582 2026-06-18 cs.CV cs.RO eess.IV 新提交

Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

ICRA 2026 GOOSE 2D细粒度语义分割挑战赛技术报告:利用DINOv3实现野外机器人中的鲁棒户外场景理解

Jaeil Park, Hyobin Choi, Sangjin Lee, Hyungtae Lim, Sung-Hoon Yoon

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)(大邱庆北科学技术院) Massachusetts Institute of Technology (MIT)(麻省理工学院)

AI总结 提出一种结合DINOv3自监督骨干、ViT-Adapter和Mask2Former解码器的网络设计,以及多尺度测试增强和模型集成的推理策略,在64类细粒度越野语义分割挑战中取得第一名,复合得分76.57%。

Comments 5 pages, 4 figures

详情
AI中文摘要

ICRA 2026野外机器人研讨会举办的GOOSE 2D细粒度语义分割挑战赛评估了越野图像在64个细粒度类别和11个评估的非空洞粗类别上的密集语义分割。我们提出了该挑战的第一名解决方案。我们的解决方案包含两个互补的改进:(a) 网络级设计,结合了自监督DINOv3 ViT-L/16骨干、ViT-Adapter和Mask2Former掩码分类解码器,以及基于全局[CLS]令牌的粗类别辅助损失;(b) 推理时聚合策略,基于多尺度和水平翻转测试时增强,以及使用Codabench分数选择的前三个检查点的集成。我们的方法达到了官方复合得分76.57%,包括69.32%的细类mIoU和83.81%的类别级mIoU,并在最终阶段排行榜上排名第一:http://this url。

英文摘要

The GOOSE 2D Fine-Grained Semantic Segmentation Challenge at the ICRA 2026 Workshop on Field Robotics evaluates dense semantic segmentation of off-road imagery over a fine-grained taxonomy of 64 classes and 11 evaluated non-void coarse categories. We present the first-place solution to this challenge. Our solution comprises two complementary improvements: (a) a network-level design that combines a self-supervised DINOv3 ViT-L/16 backbone, a ViT-Adapter, and a Mask2Former mask-classification decoder, together with a coarse-category auxiliary loss on the global [CLS] token; and (b) an inference-time aggregation strategy based on multi-scale and horizontal-flip test-time augmentation and an ensemble of the top three checkpoints selected using Codabench scores. Our method achieves an official composite score of 76.57%, consisting of 69.32% fine-class mIoU and 83.81% category-level mIoU, and ranks first on the final phase leaderboard: www.codabench.org/competitions/14257/#/results-tab.

2606.18783 2026-06-18 cs.CV 新提交

SCR-Guided Difficulty-Aware Optimization for Infrared Small Target Detection

SCR引导的困难感知优化用于红外小目标检测

Yunus Sevim, Behçet Uğur Töreyin

发表机构 * Aselsan(阿塞尔桑公司) Istanbul Technical University(伊斯坦布尔理工大学)

AI总结 提出REEM框架,利用信杂比作为可见性先验,通过可微调制软IoU损失,提升低可见性目标检测性能,无需额外参数或推理开销。

Comments Accepted at CVPR 2026 Workshops (PBVS). Published version: https://openaccess.thecvf.com/content/CVPR2026W/PBVS/html/Sevim_SCR-Guided_Difficulty-Aware_Optimization_for_Infrared_Small_Target_Detection_CVPRW_2026_paper.html

详情
AI中文摘要

红外小目标检测由于严重的背景杂波、低对比度和弱空间响应仍然具有挑战性,其中几何重叠单独不足以表征检测质量。在这项工作中,我们提出了REEM(重加权显式可见性增强调制),一种轻量级的SCR引导的困难感知优化框架,在训练期间将信杂比(SCR)作为物理上有意义的可见性先验。REEM不修改网络架构或直接优化SCR,而是从输入图像计算真实局部SCR,并对软IoU学习信号应用可微调制,强调低可见性目标,同时保持稳定优化和相同的推理行为。REEM集成到基于U-Net的MSHNet中,无需引入额外参数、架构修改或推理时开销。大量实验表明,与基线相比,REEM实现了持续改进,获得了更高的IoU和检测概率(Pd),同时大幅减少了虚警(FA),特别是在具有挑战性的低可见性条件下。这些结果表明,SCR引导的困难感知优化为红外小目标检测提供了有效且物理基础的补充,超越了传统的基于重叠的目标函数。代码可在https://github.com/yall-in-one/Reemm获取。

英文摘要

Infrared small target detection remains challenging due to severe background clutter, low contrast, and weak spatial responses where geometric overlap alone is insufficient to characterize detection quality. In this work, we propose REEM (Reweighted Explicit-visibility Enhanced Modulation), a lightweight SCR-guided difficulty-aware optimization framework that incorporates Signal-to-Clutter Ratio (SCR) as a physically meaningful visibility prior during training. Instead of modifying the network architecture or directly optimizing SCR, REEM computes a ground-truth local SCR from the input image and applies a differentiable modulation to the soft-IoU learning signal, emphasizing low-visibility targets while preserving stable optimization and identical inference behavior. REEM is integrated into a U-Net-based MSHNet without introducing additional parameters, architectural modifications, or inference-time overhead. Extensive experiments demonstrate consistent improvements over the baseline, achieving higher IoU and detection probability (Pd) together with substantially reduced false alarms (FA), particularly under challenging low-visibility conditions. These results suggest that SCR-guided difficulty-aware optimization provides an effective and physically grounded complement to conventional overlap-based objectives for infrared small target detection. The code is available at https://github. com/yall-in-one/Reemm.

5. 视频理解与时序视觉 4 篇

2606.18441 2026-06-18 cs.CV 新提交

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集:视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出无时间标注的过程级奖励框架CF-GRPO,通过视频内在线索构建一致性帧先验,并利用一致性帧奖励优化模型帧使用与先验的对齐,提升视频推理性能。

详情
AI中文摘要

强化学习提升了大型语言模型的推理能力,但将仅结果奖励应用于视频多模态大语言模型(Video-MLLMs)时,对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发(其中一致的线索可以增强感知估计的显著性和可靠性),我们引入了一致性帧GRPO(CF-GRPO),一种无需时间标注的过程级奖励框架,用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验,包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后,它从视觉和响应表示中计算模型侧的帧使用分数,并通过一致性帧奖励(CFR)优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化,CFR提供了高对比度的奖励信号,无需人工时间标注。实验表明,VideoCFR在复杂视频推理基准上取得了有竞争力的性能,并在多个指标上优于代表性的Video-MLLM和RL基线,同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见:https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

2606.18558 2026-06-18 cs.CV 新提交

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI(艾伦人工智能研究所) University of Washington(华盛顿大学) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出一种基于语言指令的3D点运动预测方法,通过构建大规模数据集和基准,实现类无关、视角稳定的运动轨迹预测,并在机器人操作和视频生成中验证其有效性。

详情
AI中文摘要

运动预测是视觉智能的核心:智能体必须预测物体如何运动,以规划行动、推理物理交互并合成逼真的未来场景。我们认为,世界坐标系中的3D点提供了一种通用表示,具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务:给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述,模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务:(1) MolmoMotion-1M是一个大型语料库,包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹;(2) PointMotionBench是一个人工验证的基准,涵盖111个物体类别和61种运动类型;(3) MolmoMotion是一个通用运动预测模型,支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式,并在PointMotionBench上显著优于现有运动预测基线。最后,我们展示了学习到的3D运动先验能很好地迁移到下游应用:它提高了机器人操作的训练效率和泛化能力,其预测轨迹为生成模型提供了有效的运动指导,以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

2606.18586 2026-06-18 cs.CV cs.AI 新提交

APT: Atomic Physical Transitions for Causal Video-Language Understanding

APT: 用于因果视频语言理解的原子物理转变

Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Dolby Laboratories(杜比实验室)

AI总结 提出原子物理转变(APT)作为视频中因果状态变化的显式表示,并构建混合来源数据集,通过APT-Tune微调方法使VLM学习物理转变而不遗忘事件级知识。

详情
AI中文摘要

物理事件不仅通过其名称来理解,还通过组成它们的因果状态变化来理解。诸如“弹跳”之类的片段级标签可能是正确的,但同时隐藏了使事件在物理上有效的过程,从支撑丧失和接触开始到反弹和稳定。为了使这一隐藏过程显式化,我们引入了原子物理转变(APT):最小的、时间局部化的状态变化,将可见线索与活跃的物理机制以及前后动力学状态联系起来。APT链将视频表示为有序的因果转变序列,而不是单个聚合事件标签:事件标签说明发生了什么;APT链解释为什么会发生。为了使VLM能够学习APT,我们从人工标注和模拟器真实数据构建了混合来源的APT数据,涵盖接触、重力、摩擦和旋转/稳定性中的14种转变类型,包含1,246个试验中的27,303个计时实例。利用这些数据,我们发现当前的VLM在转变级物理理解上存在不足,零样本召回率最多为14%,错误主要由遗漏的转变主导。直接在APT链上进行微调可以改善转变检测,但会导致事件级遗忘,表明模型学习的是专门的答案格式,而不是可复用的物理表示。因此,我们提出了APT-Tune,一种参数高效的方案,教会VLM使用因果转变而不遗忘如何回答视频问题。它结合了图像填充感知监督、格式条件协同训练和机制条件域到类型解码,使APT学习具有格式鲁棒性和物理基础。在Qwen3-VL-2B上仅使用11M LoRA参数,APT-Tune显著提高了APT召回率,同时改善了事件级视频迁移。这些结果表明,APT不是一种新的答案格式,而是一种用于物理视频理解的人类对齐的因果监督信号。

英文摘要

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

2606.19062 2026-06-18 cs.CV 新提交

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

发表机构 * Sejong University(世宗大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) Ulsan National Institute of Science and Technology(乌山国立科学研究院)

AI总结 提出DREAM模型,通过双路径表示增强与对齐,结合层级视觉编码器和混合语言建模,在视频检索任务中实现新SOTA。

详情
AI中文摘要

在当今媒体驱动的世界中,视频内容在监控、教育和娱乐等领域的指数级增长使得通过自然语言查询检索语义相关视频变得日益关键。早期的视频检索系统依赖于手工特征或浅层跨模态映射,限制了其捕捉复杂语义和时间动态的能力。虽然大规模视觉-语言模型改进了跨模态对齐,但在建模细粒度时间依赖和微妙语言结构方面仍存在挑战。本文介绍DREAM:双路径表示增强与对齐模型,一种通过增强视觉和文本编码来解决这些局限性的新型多模态框架。DREAM采用混合语言建模策略,结合掩码和排列语言建模目标,以捕捉局部和全局语言语义。在视觉方面,我们设计了一个具有级联组注意力的层级视觉编码器,通过多阶段令牌交互和从粗到细的注意力细化来整合空间和时间信息。我们通过在广泛使用的MSRVTT、MSVD和LSMDC基准数据集上进行全面评估来验证DREAM,分别取得了49.4%、49.7%和27.3%的新SOTA R1分数。定性分析进一步展示了模型在帧间保持连贯注意力以及将复杂查询与动态视频内容对齐的能力。这些发现强调了层级注意力和双目标文本建模在实现鲁棒、上下文感知视频检索中的有效性,并为推进跨模态表示学习的未来研究铺平了道路。

英文摘要

In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

6. 生成式视觉与世界模型 9 篇

2606.18478 2026-06-18 cs.CV 新提交

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

数据强制蒸馏:恢复少步视频生成中的多样性和保真度

Siyi Chen, Shaowei Liu, Yixuan Jia, Zian Wang, Huan Ling, Qing Qu, Jun Gao

发表机构 * University of Michigan(密歇根大学) NVIDIA(英伟达) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对分布匹配蒸馏(DMD)在少步视频生成中出现的模式坍塌和过饱和问题,提出数据强制蒸馏(DFD)框架,通过教师评分差异引导学生接近真实数据分布,仅需一行代码修改即可恢复多样性和保真度。

详情
AI中文摘要

最近的进展表明,将多步视频扩散模型蒸馏为高效的少步学生模型具有前景。其中,分布匹配蒸馏(DMD)及其后继DMD2实现了强大的生成质量和快速收敛。然而,由于反向KL目标的性质,这些方法表现出两个持续的失败模式:样本多样性大幅下降,以及明显过饱和的输出偏离真实视频外观。在这项工作中,我们提出了数据强制蒸馏(DFD),一个简单的训练后框架,通过仅一行代码更改即可恢复DMD中的多样性和保真度。其核心是教师评分差异,用于引导学生朝向真实数据分布,将其拉向缺失的模式(缓解模式坍塌)并远离真实数据中不存在的问题模式(避免过饱和)。我们提供了框架的深入理论分析,并在文本到视频、图像到视频和自回归视频生成上验证了我们的方法。仅需100-300步微调,DFD就能有效恢复Wan2.1-1.3B和Cosmos-Predict2.5-2B模型上的多样性和保真度,解决过饱和伪影,显著改善视频动态和外观,甚至优于教师模型。

英文摘要

Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback--Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100--300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

2606.18591 2026-06-18 cs.CV 新提交

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

桥接创意意图与视觉质量:基于创作者驱动的循环视频生成与代理反馈循环

Denis Savytski, Aiden Lei, Heding Liu, Warren Yang, Sihan Liang, Alexander Liu, Zhe Zhao

发表机构 * University of California, Davis(加州大学戴维斯分校) The Harker School(哈克学校) Basis Independent Silicon Valley(硅谷贝斯独立学校) Saratoga High(萨拉托加高中)

AI总结 提出CHIEF框架,通过人类-AI协作的迭代视频精炼,结合创作者驱动和代理主观反馈,提升长视频的叙事连贯性与创意方向。

Comments Accepted to the Workshop on Human-AI Co-Creativity at ICML 2026

详情
AI中文摘要

生成式AI使内容创作日益普及,但许多AI生成的视频缺乏叙事连贯性和创意方向,尤其在较长时长时问题更为突出。与编码不同,AI生成受益于可靠的反馈和循环自我改进等技术,而视频生成需要关于情节、场景和叙事的主观反馈,这自然激发了融入人类创意方向的方法。我们提出了CHIEF,一个人类-AI协同创作视频生成框架,将创作者置于人机循环迭代视频精炼的中心,并通过提供自动主观反馈来支持他们。创作者通过驱动每次迭代来融入其创意方向,而他们的修订则由专门的精炼代理整合。反馈循环由基于角色条件的多模态LLM生成,这些LLM观看生成的视频并从观众角度产生主观批评,提供自我评估无法捕捉的反馈。为测试我们提出框架的有效性,我们与没有电影制作经验的高中生和大学生合作,创作从1分钟短视频到具有复杂情节的完整10分钟短片的视频。

英文摘要

Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

2606.18702 2026-06-18 cs.CV 新提交

UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

UniTemp: 通过双向蒸馏实现任意时间顺序的视频生成

Lin Zhang, Sicheng Mo, Zefan Cai, Jinhong Lin, Zihao Lin, Jiuxiang Gu, Krishna Kumar Singh, Yuheng Li, Yin Li

发表机构 * University of Wisconsin Madison(威斯康星大学麦迪逊分校) Adobe Research(Adobe 研究院) University of California Los Angeles(加利福尼亚大学洛杉矶分校) University of California Davis(加利福尼亚大学戴维斯分校)

AI总结 提出UniTemp框架,通过双向蒸馏训练单个自回归模型,支持任意时间方向(前向、后向、中间插值)的视频生成,解决因果3D VAE在后向生成中的不连续性,提升可控性。

详情
AI中文摘要

自回归视频扩散模型已成为长视频生成的一种有前景的方法,在流式设置中表现出色。然而,现有方法仅限于前向时间生成,而实际视频创作通常需要灵活的生成顺序,例如,基于未来上下文进行后向扩展,或基于过去和未来上下文进行中间插值生成。我们通过训练一个支持任意时间方向生成的自回归模型来弥合这一差距。一个关键的技术挑战来自视频扩散模型中广泛使用的因果3D VAE,它编码的潜变量严格依赖于过去上下文。虽然这种因果结构适合前向生成,但在后向生成时会导致块间不连续性。为了解决这个问题,我们引入了块级锚点潜变量,这是一组辅助潜变量,用于在后向生成过程中恢复块边界处缺失的过去上下文。基于这一设计,我们提出了UniTemp,一个双向蒸馏框架,训练单个自回归学生模型用于任意方向的视频生成。在推理时,UniTemp可以基于任意过去和/或未来帧进行条件生成,提高了双向和中间插值生成的可控性。实验表明,与仅前向方法相比,UniTemp在短和长视频生成上保持了竞争性能,同时支持多种工作流程,如双向视频扩展、中间插值生成、循环视频生成、场景转换和视觉故事生成。项目网站:此 https URL

英文摘要

Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: https://lzhangbj.github.io/projects/unitemp/

2606.18765 2026-06-18 cs.CV 新提交

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

SpectralDiT:流匹配DiT的时间步条件谱残差校正

Jiayu Tian

发表机构 * Peking University(北京大学)

AI总结 提出SpectralDiT,通过时间步条件谱残差校正模块,在CIFAR-10和ImageNet-100上以极少额外计算和参数提升流匹配DiT的生成质量,FID分别降低5.1%和8.7%。

详情
AI中文摘要

我们提出SpectralDiT,一种对流匹配扩散变换器(Diffusion Transformers)的轻量级修改,它在MLP残差分支中添加了时间步条件谱校正。该模块将每个残差更新分解为补丁-令牌网格上的低频和高频分量,然后学习一个零初始化的加法门,使得模型最初与基线DiT匹配。在CIFAR-10像素空间生成中,SpectralDiT在补丁大小为1时将FID从20.78提升至19.71,并缩小了径向傅里叶谱差距。此外,我们将方法扩展到ImageNet-100上的潜在扩散。在额外理论FLOPs增加0.6%和参数增加1.36%的情况下,SpectralDiT改进了潜在流匹配,在无分类器引导(CFG 2.0)下实现了8.7%的相对FID降低。所有报告结果均为五个种子的平均值。在CIFAR-10上的消融实验和门控可视化揭示了稳定的块特定谱校正模式。

英文摘要

We propose SpectralDiT, a lightweight modification to flow-matching Diffusion Transformers that adds timestep-conditioned spectral correction to the MLP residual branch. The module decomposes each residual update into low- and high-frequency components on the patch-token grid, then learns a zero-initialized additive gate so the model initially matches the baseline DiT. On CIFAR-10 pixel-space generation, SpectralDiT improves FID from 20.78 to 19.71 at patch size 1 and reduces the radial Fourier spectrum gap. Furthermore, we scale our method to latent diffusion on ImageNet-100. With 0.6% additional theoretical FLOPs and 1.36% additional parameters, SpectralDiT improves latent flow-matching, achieving an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds. Ablations and gate visualizations on CIFAR-10 reveal stable block-specific spectral correction patterns.

2606.18788 2026-06-18 cs.CV cs.CL 新提交

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

HandwritingAgent: 语言驱动的可缩放矢量空间手写合成

Jaward Sesay, Yue Yu, Börje F. Karlsson

发表机构 * Beijing Institute of Technology(北京理工大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出HandwritingAgent,利用大推理模型在SVG格式中自动回归生成手写笔画序列,无需风格特定训练,通过自然语言和参考图像控制风格,在模仿、识别、多语言及复杂数学表达式合成等任务上达到或超越现有最优方法。

详情
AI中文摘要

教会机器模仿自然手写风格仍然是一个开放挑战,因为它需要合成在形状、纹理、压力和字体上动态变化的笔画序列——不仅在不同个体之间,而且在同一个人的手写中也是如此。针对这一挑战的尝试主要探索了在线和离线环境下的深度学习方法。然而,这些方法通常受到风格特定架构选择、对大型数据集的严重依赖、高计算成本以及缺乏通过自然语言灵活控制书写风格的限制。为此,我们引入了HandwritingAgent,一个语言驱动的智能体,它可以直接在可缩放矢量图形(SVG)格式中合成自然手写序列,无需风格特定训练。该智能体利用大型推理模型在离散网格画布环境中对目标手写字形进行几何分析并自回归生成笔画序列。生成过程以对话或非对话模式提供的文本以及参考手写风格图像为条件。在涵盖模仿、识别、多语言手写合成以及复杂手写数学和科学表达式生成等多样化手写任务上的实验表明,性能有显著提升,HandwritingAgent匹配或超越了最先进的生成式手写模型,同时提供了一种更高效、可控且泛化能力更强的合成方法。

英文摘要

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

2606.18906 2026-06-18 cs.CV 新提交

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

BindEdit: 驯服注意力泄漏以实现精确的多目标图像编辑

Chaewon Park, Soyoon Lee, Naeun Lee, Minjung Shin, Seogkyu Jeon, Kibeom Hong

发表机构 * Sookmyung Women’s University(成均女性大学) Yonsei University(延世大学) Samsung Research(三星研究院)

AI总结 针对多目标图像编辑中的语义混合和对象重复问题,提出BindEdit方法,通过联合正则化交叉注意力和自注意力、交叉注意力重平衡机制及区域保真项,在单次扩散轨迹内抑制注意力泄漏,实现精确编辑。

Comments Preprint

详情
AI中文摘要

真实图像编辑能够精确操作视觉内容,但现有方法在复杂的多目标场景中常常失败,导致语义混合、对象重复或编辑不完整。我们将这些失败归因于注意力泄漏,即在去噪过程中,跨空间区域和文本标记的信号变得纠缠。具体来说,我们识别出两种不同形式的泄漏:编辑-标记泄漏,其中模糊的标记-区域对齐导致对象混合;以及源主导泄漏,其中未改变的源对象的标记压倒了目标实体应有的注意力。为了解决这些泄漏,我们提出了\textbf{BindEdit},它在单次扩散轨迹内强制执行注意力级别的约束。为了抑制编辑-标记泄漏,BindEdit联合正则化交叉注意力和自注意力,使得每个目标标记组绑定到其对应的空间区域,同时保持实例级别的分离。为了抑制源主导泄漏,一种交叉注意力重平衡机制放大目标标记的影响,并减弱可编辑区域内残留的源语义。此外,区域保真项确保每个目标概念在整个编辑掩码中连贯表达。另外,我们提出了一个全面的多目标基准,涵盖不同的对象数量和类别。大量实验表明,BindEdit在单次扩散轨迹内始终优于现有方法,在单目标和多目标编辑场景中均保持稳健性能。

英文摘要

Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbf{BindEdit}, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

2606.19073 2026-06-18 cs.CV 新提交

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

驯服I2V模型用于图像HOI编辑:认知基准与智能体自校正框架

Jiayi Gao, Qingchao Chen, Yuxin Peng, Yang Liu

发表机构 * Wangxuan Institute of Computer Technology, Peking University, Beijing, China(王轩计算机技术研究所,北京大学,北京,中国) National Institute of Health Data Science, Peking University, Beijing, China(国家健康数据科学研究院,北京大学,北京,中国)

AI总结 提出HOI-Edit基准和SCPE框架,利用I2V模型的时间生成能力进行动态人-物交互编辑,通过自校正提示迭代优化,实现与SOTA竞争的性能。

详情
AI中文摘要

当前的图像编辑方法在静态属性上表现出色,但在复杂的人-物交互(HOI)上失败,这是一个关键挑战,现有基准将HOI与静态属性混淆,依赖无法同时评估动态交互有效性和纠缠的人-物对保留的全局指标。因此,我们首先引入HOI-Edit,一个包含三个渐进认知层次的综合基准,其特点是自动化指标HOI-Eval,通过让VLM在思考后对包含基础人-物对的图像进行问答,可靠地评估实例级交互。考虑到任务本质是重塑动态关系,我们对图像到视频(I2V)模型进行基准测试,发现它们由于其时间生成能力而天生适合动态编辑。关键的是,除了优越的性能,这种能力提供了“失败过程的重放”,为错误原因提供了独特的可诊断性。因此,我们提出SCPE(自校正过程编辑),一种新颖的智能体自校正框架,通过迭代优化的提示约束I2V模型的生成,使生成的视频更准确地呈现目标HOI。从这些视频中提取的帧是最终的编辑结果。在HOI-Edit上,SCPE在交互上达到了与最先进(SOTA)编辑模型(如Nano Banana)竞争的性能。代码可在该https URL获取。

英文摘要

Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pairs. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process," offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at https://github.com/oceanflowlab/HOI-Edit.

2606.19103 2026-06-18 cs.CV cs.AI 新提交

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

ProductConsistency:通过SFT和RL改进基于指令的图像编辑中的产品身份保持

Mukund Khanna, Raj Singh Yadav, Kunal Singh

发表机构 * Fractal Analytics

AI总结 针对基于指令的图像编辑中产品特征保持不足的问题,提出ProductConsistency数据集和循环一致性奖励,结合监督微调与强化学习,显著提升产品一致性、文本渲染和视觉质量。

Comments CVPR HiGen 2026

详情
AI中文摘要

近期基于指令的图像编辑的进展使模型能够根据自然语言指令执行复杂的视觉编辑。然而,在以产品为中心的场景中,保留产品特征、品牌和文本元素至关重要,当前的开源和闭源模型往往难以维持这种细粒度的对象身份。这一问题因缺乏具有文本保真度约束的基于指令的产品图像编辑数据集而进一步加剧,导致该能力在很大程度上被视为基于指令的图像编辑模型的隐式能力。在这项工作中,我们引入了ProductConsistency数据集,旨在改进以产品为中心的图像编辑。我们的方法包括一个用于产品编辑的包含87k样本的监督微调(SFT)数据集、一个包含869张独特产品图像的强化学习(RL)数据集,以及一个新的基准数据集ProductConsistency Benchmark,以允许对编辑模型进行严格和标准化的评估。为了指导RL训练,我们提出了一种循环一致性奖励,通过使用原始产品描述与从编辑图像生成的描述之间的字幕相似性来强制保持产品身份的语义。我们使用我们的数据集对Qwen-Image-Edit-2511和Flux.1-Kontext-dev进行了微调,并在OCR和感知指标以及基于MLLM的评估中展示了相对于基线模型的一致改进,表明更强的产品一致性、文本渲染和整体视觉质量;其中Qwen-Image-Edit-2511模型实现了字符错误率降低5倍。代码和流程可在此https URL获取。

英文摘要

Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md

2606.19195 2026-06-18 cs.CV 新提交

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Moebius: 0.2B轻量级图像修复框架,性能达10B级别

Kangsheng Duan, Ziyang Xu, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) VIVO AI Lab(维沃人工智能实验室)

AI总结 提出Moebius轻量级图像修复框架,通过局部-λ混合交互模块和自适应多粒度蒸馏策略,以0.22B参数实现与10B级模型FLUX.1-Fill-Dev相当甚至更优的生成质量,推理速度提升15倍以上。

详情
AI中文摘要

尽管10B级别的工业基础模型推动了图像修复的边界,但其高昂的计算成本严重阻碍了实际部署。构建高度优化的任务特定专家模型是一个有前景的解决方案,然而极端的结构压缩不可避免地引发了严重的表示瓶颈。为解决这一问题,我们提出了Moebius,一个高效的轻量级修复框架。我们通过引入局部-λ混合交互($L\lambda MI$)模块系统地重构了扩散主干。该模块由局部-λ和交互-λ子模块组成,巧妙地将空间上下文和全局语义先验总结为固定大小的线性矩阵,在保留复杂潜在交互的同时大幅减少参数。此外,为了释放这种高度紧凑架构的全部表示能力,我们将其与自适应多粒度蒸馏策略协同配对。该策略严格在潜在空间内操作以避免昂贵的像素空间解码,动态平衡多个基于梯度的损失以实现高保真对齐。在自然和肖像基准上的大量实验表明,这种最优协同使Moebius能够媲美甚至超越10B级工业通用模型FLUX.1-Fill-Dev的生成质量。值得注意的是,Moebius仅使用不到2%的参数(0.22B vs. 11.9B)就实现了这一点,同时总推理时间加速超过15倍,为高保真修复设立了新的效率标准。项目页面见此https URL。

英文摘要

While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-$λ$ Mix Interaction ($LλMI$) block. Comprising Local-$λ$ and Interactive-$λ$ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2\% of the parameters (0.22B vs. 11.9B) while delivering a $>15\times$ acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting. Project page at https://hustvl.github.io/Moebius.

7. 3D视觉、点云与空间智能 9 篇

2606.18429 2026-06-18 cs.CV cs.AI cs.LG 新提交

CAOA -- Completion-Assisted Object-CAD Alignment

CAOA -- 补全辅助的物体-CAD对齐

Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran

发表机构 * University at Albany(奥尔巴尼大学)

AI总结 提出CAOA方法,结合语义感知点云补全和对称感知相对位姿估计,在Scan2CAD上实现17%精度提升,并发布S2C-Completion数据集。

Comments GitHub: https://github.com/MinhasKamal/CAOA

详情
Journal ref
Thirteenth International Conference on 3D Vision (3DV), 2026
AI中文摘要

准确地将CAD模型与室内RGB-D扫描中的对应物体对齐是3D语义重建的核心挑战。该任务需要估计9自由度(DoF)位姿——位置、旋转和三轴尺度——但受到噪声和不完整扫描以及导致几何畸变的分割误差的阻碍。我们提出补全辅助的物体-CAD对齐(CAOA),该方法将语义和上下文感知的点云补全模块与对称感知的相对位姿估计算法相结合,实现CAD模型与扫描物体的精确对齐。现有的补全方法通常在合成数据集上训练和评估,往往难以泛化到真实扫描。为弥合这一差距,我们引入了一种针对室内场景的合成数据生成策略,通过与广泛使用的补全数据集进行定量比较,验证了其显著减小合成到真实领域差距的效果。此外,我们发布了S2C-Completion,一个来自Scan2CAD的超过8500个物体-CAD对的专家标注数据集,用于真实室内单物体补全,并作为该任务的新基准。对于物体-CAD对齐,我们通过对称感知损失融入对称信息,提高了对对称模糊的鲁棒性。在Scan2CAD基准上,CAOA相比最先进方法实现了17%的精度提升。

英文摘要

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

2606.18439 2026-06-18 cs.CV cs.RO 新提交

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT:面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of California, Irvine(加利福尼亚大学尔湾分校) Nanyang Technological University(南洋理工大学)

AI总结 提出RegimeVGGT,通过逐层U形压缩(显著性引导带状合并与选择性保护K/V下采样)去除冗余,在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情
AI中文摘要

视觉几何基础Transformer(VGGT)通过一次前向传播从多视图图像恢复密集3D场景结构,但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算,忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域:浅层缺乏跨视图结构,中层驱动跨视图对齐,深层对密集几何是冗余的,但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩:显著性引导带状合并保护几何和边缘显著性令牌,而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练,RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

2606.18623 2026-06-18 cs.CV eess.IV 新提交

Intrinsic 4D Gaussian Segmentation from Scene Cues

内在4D高斯分割:基于场景线索

Hasan Yazar, Mohamed Rayan Barhdadi, Erchin Serpedin, Mehmet Tuncel, Hasan Kurban

发表机构 * Istanbul Technical University(伊斯坦布尔理工大学) Texas A&M University(德克萨斯农工大学) Hamad Bin Khalifa University(哈马德·本·哈利法大学)

AI总结 提出Intrinsic-GS方法,无需训练和掩码,通过构建高斯原语的亲和图并利用社区检测实现4D场景分割,在Neu3D和HyperNeRF上达到与掩码监督方法相当的精度,且速度提升12.5倍。

Comments 15 pages, 4 figures, 7 tables. Includes supplementary material. Preprint

详情
AI中文摘要

动态4D高斯泼溅以高保真度重建变形场景,并越来越多地被用作动态3D场景的表示。要利用此类场景进行编辑、操作或运动分析,首先需要对其进行分割:将高斯原语分组为连贯的对象。当前流程通过从基础模型(如SAM)导入2D掩码,并将其提升或蒸馏到高斯表示中来获得这种分组。在动态场景中,这些掩码必须在多个帧和视角中生成,成本高昂,并且所得分割可能强烈依赖于这些外部掩码的质量和一致性。我们探究能否从高斯本身恢复更多的对象级结构,并提出Intrinsic-GS,一种无需训练、无需掩码的方法,该方法根据外观、方向、尺度、变形轨迹和非学习渲染边界线索,在高斯原语上构建稀疏亲和图。该图通过Leiden社区检测进行划分,无需基础模型,也无需学习特征场。在标准的4D高斯分割基准Neu3D和HyperNeRF上,Intrinsic-GS在没有掩码监督的情况下恢复了大量的对象结构,在Neu3D上达到0.746 mIoU,在HyperNeRF上达到0.575;在Neu3D上,仅几何变体达到0.902 mIoU,与SAM监督的TRASE相当。在HyperNeRF上,Intrinsic-GS的运行速度比掩码监督流程中使用的掩码生成和特征渲染阶段快12.5倍。这些结果表明,大部分分割信号已经编码在高斯本身中,为3D和4D高斯分割提供了一种快速、无需掩码的方向,也可能指向在外部掩码不可靠或昂贵的情况下更可泛化、更鲁棒的分割。

英文摘要

Dynamic 4D Gaussian Splatting reconstructs deforming scenes with high fidelity and is increasingly adopted as a representation for dynamic 3D scenes. Putting such a scene to use, for editing, manipulation or motion analysis, first requires segmenting it: grouping the Gaussian primitives into coherent objects. Current pipelines obtain this grouping by importing 2D masks from foundation models such as SAM and lifting or distilling them into the Gaussian representation. In dynamic scenes these masks must be generated across many frames and views, which is costly, and the resulting segmentation can depend strongly on the quality and consistency of those external masks. We ask how much object-level structure can instead be recovered from the Gaussians themselves, and propose Intrinsic-GS, a training-free, mask-free method that builds a sparse affinity graph over Gaussian primitives from appearance, orientation, scale, deformation-trajectory and non-learned rendered-boundary cues. The graph is partitioned with Leiden community detection, requiring no foundation model and no learned feature field. On the standard 4D Gaussian segmentation benchmarks, Neu3D and HyperNeRF, Intrinsic-GS recovers substantial object structure without mask supervision, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF; on Neu3D, a geometry-only variant reaches 0.902 mIoU, matching SAM-supervised TRASE. On HyperNeRF, Intrinsic-GS runs 12.5x faster than the mask-generation and feature-rendering stages used by mask-supervised pipelines. These results suggest that much of the segmentation signal is already encoded in the Gaussians themselves, offering a fast, mask-free direction for 3D and 4D Gaussian segmentation that may also point toward more generalizable, robust segmentation in settings where external masks are unreliable or expensive.

2606.18787 2026-06-18 cs.CV 新提交

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

基于UDF的点云重建中的学习半径估计

Eito Ogawa, Hiroshi Watanabe

发表机构 * Graduate School of FSE Waseda University Tokyo, Japan(Waseda大学研究生院FSE学院东京日本)

AI总结 提出一种学习型逐查询半径选择器,预测连续支撑半径并插入冻结的LoSF-UDF骨干网络,通过抛物线插值获取离网目标半径进行训练,提高点云表面重建的细粒度精度。

详情
AI中文摘要

从点云进行表面重建对于消费级3D捕获(包括AR/VR和室内扫描)非常重要。局部补丁无符号距离场(UDF)方法轻量且可泛化,但其精度依赖于支撑半径,传统上半径是固定的或通过一维曲率启发式选择,无法捕捉异质局部几何。我们提出一种学习型逐查询半径选择器,预测连续支撑半径并插入冻结的LoSF-UDF骨干网络。该选择器使用通过抛物线插值从缓存的UDF误差曲线获得的离网目标半径进行训练。实验表明,该方法提高了细尺度重建精度。

英文摘要

Surface reconstruction from point clouds is important for consumer-grade 3D capture, including AR/VR and indoor scanning. Local-patch Unsigned Distance Field (UDF) methods are lightweight and generalizable, but their accuracy depends on the support radius, traditionally fixed or selected by a one-dimensional curvature heuristic that cannot capture heterogeneous local geometry. We propose a learned per-query radius selector that predicts a continuous support radius and plugs into a frozen LoSF-UDF backbone. The selector is trained using off-grid target radii obtained by parabolic interpolation of cached UDF error curves. Experiments show improved fine-scale reconstruction accuracy.

2606.18861 2026-06-18 cs.CV cs.AI 新提交

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

基于可微联合推理与能量一致性验证的RGB-D序列URDF合成

Xinze Zhang

发表机构 * University of Southern California(南加州大学)

AI总结 提出KinemaForge管道,通过可微关节推理和能量一致性验证,从RGB-D序列联合估计部件形状、关节拓扑和参数,显著降低关节轴误差和仿真漂移。

详情
AI中文摘要

从传感器观测重建可仿真的铰接物体数字孪生仍受两个持续存在的差距制约:(i) 部件级几何重建与运动学参数估计分离,(ii) 恢复的模型常违反能量守恒等基本动态不变量,导致URDF在物理仿真器中重放时出现漂移。我们提出KinemaForge,一种约束驱动管道,从短RGB-D序列联合推断部件级形状、关节拓扑和关节参数,并通过基于可微刚体动力学构建的能量一致性验证器验证结果。该管道引入三个组件:将关节-部件关联编码为软边的运动学约束图;通过Featherstone铰接体算法从渲染观测反向传播到关节参数的可微螺旋轴求解器;以及惩罚重建模型非物理自由响应的能量残差损失。在五个PartNet-Mobility类别和一个内部RGB-D基准上,KinemaForge将平均关节轴误差从最强几何基线(PARIS)的4.52度降至2.83度(-37.4%),从基于交互的Ditto基线的5.30度降至2.83度(-46.6%),在50秒滚动中长时仿真漂移比PARIS降低64%,初步评估中闭环操作成功率比Ditto提高14.6个百分点。代码和重建数据将在接收后发布。

英文摘要

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

2606.19019 2026-06-18 cs.CV 新提交

FlowObject: Flow Steering for Bridging Generative Priors and Reconstruction Fidelity

FlowObject: 流引导以桥接生成先验与重建保真度

Yuchen Rao, Xuqian Ren, Yinyu Nie, Sayan Deb Sarkar, Biao Zhang, Vincent Lepetit, Friedrich Fraundorfer

发表机构 * Graz University of Technology Austria(奥地利格拉茨理工大学) Tampere University Finland(芬兰塔尔库大学) Technical University of Munich Germany(德国慕尼黑技术大学) Stanford University The United States of America(美国斯坦福大学) Xi’an Jiaotong University China(中国西安交通大学) École des Ponts ParisTech France(法国巴黎综合理工学院)

AI总结 提出FlowObject框架,通过双空间引导策略驱动流匹配模型的ODE轨迹,在利用生成先验完成未观测区域的同时保持与真实观测的一致性,并集成3DGS细化阶段弥合生成输出与真实感重建的差距,显著提升几何完整性和视角相关外观保真度。

Comments Project page: https://yuchenrao.github.io/projects/flowObject/flowObject.html

详情
AI中文摘要

从少量随意拍摄的图像中恢复物体的完整3D表示仍然是一个重大挑战。最近的3D生成模型,特别是基于流匹配(Flow-Matching, FM)的模型,可以合成高质量的纹理资产;然而,它们常常遭受“合成偏差”,即学习到的先验覆盖了观测证据,同时缺乏与观测实例的对齐。相反,基于优化的方法如3D高斯泼溅(3DGS)在可见表面上提供高保真度,但无法推理未观测的几何结构。在本文中,我们提出了FlowObject,一个将稀疏视图3D重建重新表述为无训练、引导逆问题的框架。我们的方法采用双空间引导策略来驱动流匹配模型的常微分方程(ODE)轨迹,通过学习的生成先验完成未观测区域,同时强制与真实世界观测严格一致。通过集成3DGS细化阶段,FlowObject进一步弥合了“合成外观”生成输出与真实感重建之间的差距。在合成和真实世界数据集上的全面基准测试表明,当前最先进的方法通常难以同时实现几何完整性和观测一致性,尤其是在严重遮挡下。相比之下,我们的方法在几何完整性和视角相关外观保真度方面显著优于最先进的生成模型和基于优化的框架。

英文摘要

Recovering complete 3D representations of objects from few casual image captures remains a significant challenge. Recent 3D generative models, particularly those based on Flow-Matching (FM), can synthesize high-quality textured assets; however, they often suffer from ''synthetic bias'' where learned priors override observational evidence, alongside a lack of alignment with the observed instance. Conversely, optimization-based methods like 3D Gaussian Splatting (3DGS) provide high fidelity on visible surfaces but fail to reason about unobserved geometry. In this paper, we present FlowObject, a framework that reformulates sparse-view 3D reconstruction as a training-free, guided inverse problem. Our approach applies a dual-space guidance strategy to steer the Ordinary Differential Equation (ODE) trajectory of a flow-matching model, enabling the completion of unseen regions through learned generative priors while enforcing strict consistency with real-world observations. By integrating a 3DGS refinement stage, FlowObject further bridges the gap between ''synthetic-looking'' generative outputs and photorealistic reconstructions. Comprehensive benchmarks on synthetic and real-world datasets demonstrate that current state-of-the-art methods often struggle to achieve geometric completeness and observational consistency simultaneously, especially under severe occlusions. In contrast, our method significantly outperforms state-of-the-art generative models and optimization-based frameworks in both geometric completeness and view-dependent appearance fidelity.

2606.19156 2026-06-18 cs.CV 新提交

Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos

Hand-4DGS: 用于从第一人称视频进行4D手部重建的前馈3D高斯泼溅方法

Jeongmin Bae, Seoha Kim, Marc Pollefeys, Mahdi Rad, Youngjung Uh, Taein Kwon

发表机构 * Yonsei University(延世大学) Electronics and Telecommunications Research Institute(韩国电子通信研究院) ETH Zurich(苏黎世联邦理工学院) Microsoft Spatial AI Lab(微软空间人工智能实验室) VGG, University of Oxford(牛津大学VGG实验室)

AI总结 提出Hand-4DGS,首个前馈框架,从第一人称视频直接重建动态4D手部,利用网格引导表示和时间卷积,实现快速推理和强泛化,无需3D真值标注。

Comments Project page: https://jeongminb.github.io/hand-4dgs/

详情
AI中文摘要

从第一人称视频进行动态3D手部重建对于下一代计算平台(如AR/VR和AI眼镜)至关重要。尽管其重要性,大多数先前工作要么关注多视角3D手部重建,要么关注4D人体重建。由于头部快速运动、手部快速动态、严重遮挡以及单视角观察固有的模糊性,第一人称4D手部重建仍然具有挑战性。为了解决这些挑战,我们引入了Hand-4DGS,这是第一个直接从第一人称视频重建动态4D手部的前馈框架,实现了快速(约60 FPS)推理和强泛化。我们的方法结合了用于结构先验的网格引导表示和用于建模动态运动的时间卷积。我们在两个具有挑战性的第一人称数据集H2O和ARCTIC上评估了我们的框架,并展示了相对于基线的显著改进。我们的方法受益于前馈网络的泛化能力以及通过高斯泼溅的有效2D图像监督,无需昂贵的3D手部姿态真值标注。

英文摘要

Dynamic 3D hand reconstruction from egocentric videos is essential for next-generation computing platforms such as AR/VR and AI glasses. Despite its importance, most prior works focus either on multi-view 3D hand reconstruction or on 4D human body reconstruction. Egocentric 4D hand reconstruction remains challenging due to fast head motion, rapid hand dynamics, severe occlusions, and inherent ambiguity from single-view observations. To address these challenges, we introduce Hand-4DGS, the first feed-forward framework for reconstructing dynamic 4D hands directly from egocentric videos, enabling both fast (~60 FPS) inference and strong generalization. Our approach incorporates a mesh-guided representation for structural priors and temporal convolutions to model dynamic motion. We evaluate our framework on two challenging egocentric datasets, H2O and ARCTIC, and demonstrate significant improvements over baselines. Our method benefits from the generalization capability of feed-forward networks and effective 2D image supervision through Gaussian splatting, without requiring expensive 3D hand pose ground-truth annotations.

2606.19253 2026-06-18 cs.CV cs.AI cs.LG cs.RO 新提交

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas: 通过全景重投影实现3D场景理解

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑工业大学) Huawei(华为)

AI总结 提出OneCanvas方法,将多视图补丁特征聚合到全景画布上,利用深度和相机位姿进行重投影,无需复杂几何编码器或大量训练,在SQA3D等基准上达到最先进精度。

Comments Project page: https://baranowskibrt.github.io/onecanvas/

详情
AI中文摘要

现有的视觉语言模型(VLM)中的3D场景理解方法要么依赖复杂的、模型特定的几何编码器,要么为了追求空间推理而需要大量的训练预算。相反,OneCanvas将所有视图的补丁特征聚合到一个单一的等距柱状全景画布上。具体来说,每个补丁利用其深度和相机位姿被反投影到3D世界坐标,然后根据从画布原点看到的该点的连续经度和纬度放置在画布上,无需对重叠视图进行光栅化或聚合。补丁的度量坐标的3D位置嵌入被添加到其特征中,从而恢复了将世界位置压缩到角度画布坐标时丢失的深度。因此,来自所有帧的补丁共享一个空间坐标系,无需融合或对主干网络进行重大架构修改。预训练的VLM将此表示视为普通图像。由于画布可以以任何感兴趣的姿态为中心,相同的表示直接支持从特定视角进行情境推理,这是机器人和具身AI中的常见需求。得益于这种表示,我们还可以引入空间预训练课程:通过程序化地将从真实图像中提取的对象的补丁特征放置在原本空白的画布上的选定3D世界位置,我们生成了涵盖广泛空间推理任务的即时监督,并控制答案分布以减少空间推理捷径。OneCanvas在SQA3D和VSI-Bench上达到了最先进的准确率,并在SPBench上泛化到分布外数据,其训练计算量比最强竞争方法少一个数量级。

英文摘要

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

2606.19316 2026-06-18 cs.CV 新提交

NeuMesh++: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field

NeuMesh++:基于解耦神经网格隐式场的多功能高效体积编辑

Chong Bao, Yuan Li, Bangbang Yang, Yujun Shen, Hujun Bao, Zhaopeng Cui, Yinda Zhang, Guofeng Zhang

发表机构 * State Key Lab of CAD&CG, College of Computer Science, Zhejiang University(浙江大学计算机科学与技术学院计算机辅助设计与图形学国家重点实验室) Ant Research(蚂蚁研究院) Google(谷歌) ByteDance(字节跳动)

AI总结 提出一种基于网格顶点的解耦神经辐射场表示,实现几何、纹理和语义引导的高效体积编辑,包括网格引导几何编辑、纹理交换填充绘制及语义编辑。

Comments TPAMI 2025; Project Page: https://zju3dv.github.io/neumeshplusplus/

详情
AI中文摘要

近年来,神经隐式渲染技术迅速发展,在新视角合成和3D场景重建方面展现出显著优势。然而,现有的用于编辑目的的神经渲染方法功能有限,例如刚性变换和类别特定编辑。在本文中,我们提出了一种新颖的基于网格的表示方法,通过在网格顶点上编码解耦的几何、纹理和语义码来编码神经辐射场,从而实现一系列高效且全面的编辑功能,包括网格引导的几何编辑、通过纹理交换、填充和绘制操作进行的指定纹理编辑,以及语义引导的编辑。为此,我们开发了几种技术,包括一种新颖的局部空间参数化以提高渲染质量和训练稳定性,一种可学习的顶点修改颜色以提高纹理编辑的保真度,一种空间感知优化策略以实现精确的纹理编辑,以及一种语义辅助区域选择以减轻隐式场编辑的繁琐标注。在真实和合成数据集上的大量实验和编辑示例证明了我们的方法在表示质量和编辑能力上的优越性。项目页面:此 https URL

英文摘要

Recently neural implicit rendering techniques have evolved rapidly and demonstrated significant advantages in novel view synthesis and 3D scene reconstruction. However, existing neural rendering methods for editing purposes offer limited functionalities, e.g., rigid transformation and category-specific editing. In this paper, we present a novel mesh-based representation by encoding the neural radiance field with disentangled geometry, texture, and semantic codes on mesh vertices, which empowers a set of efficient and comprehensive editing functionalities, including mesh-guided geometry editing, designated texture editing with texture swapping, filling and painting operations, and semantic-guided editing. To this end, we develop several techniques including a novel local space parameterization to enhance rendering quality and training stability, a learnable modification color on vertex to improve the fidelity of texture editing, a spatial-aware optimization strategy to realize precise texture editing, and a semantic-aided region selection to ease the laborious annotation of implicit field editing. Extensive experiments and editing examples on both real and synthetic datasets demonstrate the superiority of our method on representation quality and editing ability. Project page: https://zju3dv.github.io/neumeshplusplus/

8. 医学影像与生物视觉 18 篇

2606.18609 2026-06-18 cs.CV 新提交

Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification

基于反事实证据验证的医学视觉语言模型幻觉检测与纠正

Nan Zhou, Ke Zou, Meng Liu, Linchao He, Jiaqi Zhu, Yi Zhang, Hu Chen, Huazhu Fu

发表机构 * College of Computer Science, Sichuan University(四川大学计算机科学学院) Yong Loo Lin School of Medicine, National University of Singapore(新加坡国立大学杨潞龄医学院) Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University(四川大学数据保护与智能管理教育部重点实验室) National Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(北京理工大学自主智能无人系统国家重点实验室) Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR)(新加坡科技研究局高性能计算研究所)

AI总结 提出CoEV框架,通过文本与视觉证据的双向验证检测并纠正医学VLM幻觉,无需重新训练,在四个数据集上显著提升检测和纠正性能。

Comments MICCAI 2026 Accept. Submission Version

详情
AI中文摘要

视觉语言模型(VLM)在医学诊断中的可靠性受到幻觉的挑战,这削弱了信任。现有的幻觉检测方法主要关注识别生成文本与参考数据之间的事实不一致性。虽然一些研究分析了模型在图像中的注意力区域,但它们很少验证这种注意力是否真正反映了支持生成文本的视觉证据。为了解决这一差距,我们提出了反事实证据验证(CoEV),一个无需训练的即插即用框架,通过基于证据的事实一致性验证来检测和纠正幻觉。CoEV在文本断言和视觉证据之间执行双向验证,测试每个陈述是否得到其对应证据区域的支持,并将每个陈述分配到一个四象限诊断图中,该图捕获文本事实性和视觉基础性的组合。CoEV检测幻觉内容,并作为事后细化工具,无需重新训练即可纠正幻觉。在四个医学数据集上的大量实验表明,CoEV能够对抗幻觉。在幻觉检测方面,CoEV始终优于现有方法,平均PR-AUC和ROC-AUC分别提高了3.0%和3.9%的绝对百分点,在特定VQA场景中提升高达18.5%。在幻觉纠正方面,它将Micro-F1提高了高达12.5%,在医学报告生成中将幻觉率降低了超过11.9%,并提高了医学VQA的准确性。这些结果表明,CoEV能够可靠地检测和纠正幻觉,为临床医生提供可靠的、基于证据的诊断线索。代码将在接收后发布。

英文摘要

Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Co}unter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in VLMs.For hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.

2606.18658 2026-06-18 cs.CV eess.IV 新提交

On-Manifold Variational Learning with Heat-Kernel Priors

基于热核先验的流形变分学习

Jiarui Xing, Tal Zeevi, Nian Wu, Jian Wang

发表机构 * Yale School of Medicine(耶鲁大学医学院) University of Virginia(弗吉尼亚大学) Harvard Medical School(哈佛医学院)

AI总结 提出一种流形锚定变分框架,利用几何感知EM算法选择热核加权潜图上的图中心点作为原型,确保原型在流形上,并通过Dirichlet能量正则化保持潜空间几何平滑,在心脏瘢痕和脑MRI基准上取得最高精度和清晰原型。

详情
AI中文摘要

学习医学影像队列的无监督表示可以揭示临床上有意义的原型,而无需专家标签,这些标签通常带有噪声且无法捕捉真实的病理异质性。然而,现有的深度潜变量模型通过欧几里得平均估计高斯混合先验,产生的原型会偏离弯曲的数据流形,并随着子种群数量的增加而退化。我们提出了一种流形锚定变分框架,基于几何感知的期望最大化(EM)算法,其M步骤选择每个子种群原型作为热核加权潜图上具有最高扩散中心性的图中心点,确保每个原型保持在流形上。Dirichlet能量正则化强制潜空间的几何平滑性,每个子种群的不确定性分数实现了无标签的质量评估。流形锚定EM是一种通用几何工具,扩展了标准EM,并易于应用于其他潜变量模型。在心脏瘢痕和脑MRI基准上,我们的框架在所有比较方法中取得了最高精度,产生了迄今为止最清晰的原型,并且在所有基线退化的较大子种群数量下保持稳定。

英文摘要

Learning unsupervised representations of medical imaging cohorts can reveal clinically meaningful prototypes without expert labels, which are often noisy and fail to capture true pathological heterogeneity. However, existing deep latent-variable models estimate Gaussian mixture priors via Euclidean averaging, producing prototypes that drift off the curved data manifold and degenerate as the number of sub-populations grows. We propose a manifold-anchored variational framework built on a geometry-aware Expectation-Maximization (EM) algorithm, whose M-step selects each sub-population prototype as the graph medoid with the highest diffusion centrality on a heat-kernel-weighted latent graph, ensuring that every prototype remains on-manifold. A Dirichlet energy regularizer enforces geometric smoothness of the latent space, and a per-sub-population uncertainty score enables label-free quality assessment. \rev{The manifold-anchored EM is a general-purpose geometric tool that extends standard EM and applies readily to other latent-variable models beyond this setting.} On cardiac scar and brain MRI benchmarks, our framework attains the highest accuracy among all compared methods, produces the sharpest prototypes reported to date, and remains stable at large sub-population counts where all baselines degenerate.

2606.18675 2026-06-18 cs.CV 新提交

BrainFusionNet: a deep learning and XAI model to understand local, global, and sequential features of MRI images for improved brain tumour detection

BrainFusionNet:一种用于理解MRI图像局部、全局和序列特征以改进脑肿瘤检测的深度学习与XAI模型

Md Taimur Ahad, Bo Song, Yan Li

发表机构 * School of Mathematics, Physics and Computing, University of Southern Queensland(南方昆士兰大学数学、物理与计算学院) School of Engineering, University of Southern Queensland(南方昆士兰大学工程学院)

AI总结 提出BrainFusionNet混合模型,结合CNN、ViT和GRU提取MRI空间、上下文和序列特征,并集成SHAP、LIME和GradCAM进行可解释性分析,在公开数据集上达到98%准确率,优于SOTA CNN。

详情
Journal ref
Brain Inf. 13, 21 (2026)
AI中文摘要

磁共振成像(MRI)的噪声给深度学习(DL)带来挑战,当肿瘤边界模糊、肿瘤位置和外观复杂时尤其如此。因此,我们开发了BrainFusionNet,它结合卷积神经网络(CNN)、视觉变换器(ViT)和门控循环单元(GRU),从MRI图像中提取空间、上下文和序列特征,以改进脑肿瘤分类。此外,集成了可解释AI(如SHAP、LIME和GradCAM),以可视化和突出显示有助于BrainFusionNet决策过程的图像区域。所提出的BrainFusionNet模型在两个公开MRI数据集上进行了评估,K折验证表明在两个数据集上准确率均达到98%。该模型与六种最先进的(SOTA)CNN和迁移学习进行了比较。在SOTA CNN中,DenseNet121和VGG16达到了96%的最高准确率。BrainFusionNet的新颖之处在于,该混合模型能够有效提取MRI图像的局部和全局特征,即使在小尺度肿瘤区域和肿瘤尺寸较小的情况下也是如此。该模型具有平衡的序列CNN架构,以捕获低层和深层特征;以及定制的ViT,可捕获局部特征、稳定梯度流并降低MRI图像训练期间梯度消失的风险。CNN和ViT的输出被馈送到GRU以进行最终分类。此外,我们分析像素强度以确定MRI图像质量是否影响图像分类。我们的发现在图像解释方面非常新颖,因为我们发现MRI图像中像素强度的分布会影响DL性能。

英文摘要

The noise of Magnetic Resonance Imaging MRI poses challenges for Deep Learning DL when tumor boundaries are obscured tumor location and appearance are complex Therefore we develop BrainFusionNet that combines Convolutional Neural Networks CNNs Vision Transformers ViT and Gated Recurrent Units GRUs to extract spatial contextual and sequential features from MRI images for improved brain tumor classification Furthermore explainable AI such as SHAP LIME and GradCAM are integrated to visualise and highlight image regions that contribute to BrainFusionNets decisionmaking process The proposed BrainFusionNet model is evaluated on two publicly available MRI datasets Kfold validation suggests 98 accuracy on both datasets The model was compared with the six stateoftheart SOTA CNNs and transfer learning Among the SOTA CNNs DenseNet121 and VGG16 achieved the highest accuracy of 96 The novelty of BrainFusionNet is that the hybrid model effectively extracts local and global features from MRI images even in smallscale tumor regions and small tumor sizes The model has a balanced sequential CNN architecture to capture lowlevel and deeperlayer features a customized ViT that captures local features stabilizes gradient flow and reduces the risk of vanishing gradients during MRI image training The CNN and ViT outputs are fed into a GRU for final classification Furthermore we analyze pixel intensities to determine whether MRI image quality affects image classification Our findings are very novel in image interpretation as we found that the distribution of pixel intensities in MRI images affects DL performance

2606.18682 2026-06-18 cs.CV 新提交

Multi-Class Brain Tumor Classification Using Advanced Deep Learning Models: A Comparative Study

使用先进深度学习模型的多类脑肿瘤分类:一项比较研究

Asad Channa, Asghar Ali Chandio, Akhtar Hussain Jalbani, Mehwish Leghari, Shahzad Memon

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology(夸迪-艾瓦姆工程、科学与技术大学计算机科学系) Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology(夸迪-艾瓦姆工程、科学与技术大学人工智能系) The Faculty of Artificial Intelligence and Cyber Security, Universiti Teknikal Malaysia Melaka(马来西亚梅拉卡技术大学人工智能与网络安全学院) Department of Data Science, Quaid-e-Awam University of Engineering, Sciences & Technology(夸迪-艾瓦姆工程、科学与技术大学数据科学系) Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London(东伦敦大学建筑、计算与工程学院计算机科学与数字技术系)

AI总结 本研究比较五种CNN架构(包括定制模型和四种预训练模型)在约10,000张MRI图像上的多类脑肿瘤分类性能,发现EfficientNetB0以95%准确率最优,尤其显著提高了脑膜瘤的召回率(89%)。

详情
AI中文摘要

尽管深度学习最近取得了进展,但从MRI图像中准确分类脑肿瘤仍然面临挑战。在本研究中,我们对五种不同的卷积神经网络(CNN)架构进行了全面评估,包括一个定制的基线模型和四个预训练模型,用于使用临床来源的约10,000张MRI图像数据集对多类脑肿瘤进行分类。我们使用了五种不同的架构:VGG16、VGG19、DenseNet121和EfficientNetB0,它们都在相同的实验框架内进行了测试和训练。性能通过总体准确率和肿瘤召回率来衡量,以评估每种架构的临床相关性能。我们发现,与其他测试的架构相比,EfficientNetB0具有最佳的整体分类准确率95%;具体来说,VGG16(94.37%)、VGG19(92.29%)、DenseNet121(90.91%)和定制CNN(78.00%)。我们研究的一个特别重要的发现是,在检测脑膜瘤方面有显著改进;具体而言,简单的CNN可以以约20%的召回率检测脑膜瘤,而EfficientNetB0能够以89%的召回率检测脑膜瘤。脑膜瘤通常难以检测,因为它们在MRI图像上可能表现得非常微妙。此外,一个有趣的发现是,更深的VGG19性能不如较浅的VGG16。这表明,在处理医学图像时,CNN模型的架构效率可能比其深度更重要。总体而言,EfficientNetB0似乎在分类准确率、模型参数数量和临床有意义性能之间提供了最佳权衡。

英文摘要

Despite recent advancements in deep learning, accurately classifying brain tumors from MRI images continues to pose challenges. In this research, we present a comprehensive evaluation of five different convolutional neural networks (CNN) architectures, including a customized baseline model and four pre-trained models - for use in classifying multi-class brain tumors using a clinically-sourced dataset of approximately 10,000 MRI images. We have utilized five different architectures; VGG16, VGG19, DenseNet121, and EfficientNetB0, which were all tested and trained within an identical experimental framework. Performance was measured by both overall accuracy and tumor-wise recall as a means to measure the clinically-relevant performance of each architecture. We found that EfficientNetB0 had the best overall classification accuracy at 95%, when compared to the other architectures tested; specifically VGG16 (94.37%), VGG19 (92.29%), DenseNet121 (90.91%) and the customized CNN (78.00%). An especially important finding of our research was the considerable improvement in detecting meningiomas; specifically, while simple CNNs could detect meningiomas with a recall rate of approximately 20%, EfficientNetB0 was able to detect meningiomas with a recall rate of 89%. Meningiomas are often difficult to detect because they can appear very subtly on MRI images. Additionally, an interesting finding was that the deeper VGG19 performed worse than the shallower VGG16. This indicates that in many cases the architectural efficiency of a CNN model may be more important than its depth when working with medical images. Overall, EfficientNetB0 appears to provide the optimal trade-off between classification accuracy, number of parameters used in the model and clinically meaningful performance.

2606.18707 2026-06-18 cs.CV 新提交

PEFT-MedSAM: Efficient Fine-Tuning of Medical Foundation Models for Explainable Skin Lesion Segmentation

PEFT-MedSAM:面向可解释皮肤病变分割的医学基础模型高效微调

Asad Channa, Abdullah Khan, Asghar Ali Chandio, Aamir Akbar, Shahzad Memon, Aqib Hussain, Ameer Hamza

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology(计算机科学系,卡迪尔-阿瓦姆工程、科学与技术大学) Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology(人工智能系,卡迪尔-阿瓦姆工程、科学与技术大学) Department of Computer Science, Sindh Madressatul Islam University, City Campus, Karachi(计算机科学系, Sind 阿里斯坦伊斯兰大学,卡拉奇城校区) Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London(计算机科学与数字技术系,建筑、计算与工程学院,东伦敦大学)

AI总结 提出参数高效微调方法PEFT-MedSAM,冻结预训练编码器仅训练轻量解码器,在ISIC 2018上达到0.9411 Dice系数,并通过Grad-CAM可解释性增强临床可信度。

详情
AI中文摘要

使用深度学习模型对皮肤镜图像进行皮肤病变自动分割,有助于比常规检测更早发现黑色素瘤。然而,大多数现有的深度学习方法性能不佳。本文旨在提出一种名为PEFT-MedSAM的参数高效微调方法,用于适配医学分割一切模型(MedSAM)以自动分割皮肤镜皮肤病变。PEFT-MedSAM方法仅使用轻量级掩码解码器训练模型,同时保持预训练图像编码器和提示编码器冻结。在ISIC 2018基准数据集上的实验表明,与完全训练的U-Net基线(0.8715 Dice系数)和零样本MedSAM推理(0.8997 Dice系数)相比,PEFT-MedSAM获得了0.9411的Dice系数和0.8918的交并比。使用PH2数据集进行的外部验证显示Dice系数为0.9467,标准差为±0.0310。这些主张的支持证据包括比较两个数据集的Wilcoxon符号秩检验p值小于0.0001,以及bootstrap估计的95%置信区间[0.9364, 0.9447],该区间表示重复测试获得的平均Dice系数的估计范围。为了增加临床可信度,我们使用Grad-CAM可解释性以及基于指向游戏的评估方法,在验证集上评估CNN基线模型。结果表明,在包含519张图像的验证集上,准确率达到98.27%,并确认模型正确分类了包含皮肤病变的区域。

英文摘要

Automated segmentation of skin lesions using deep learning models for dermoscopic images can be very helpful in finding melanomas earlier than they would normally be detected. However, most deep learning methods available do not perform well. The aim of this paper is to present a parameter-efficient fine-tuning method called PEFT-MedSAM for adapting the Medical Segment Anything Model (MedSAM) to automatically segment dermoscopic skin lesions. The PEFT-MedSAM method uses only the lightweight mask decoder for training the model while keeping the pre-trained image encoder and prompt encoder frozen. The experiments performed on the ISIC 2018 benchmark dataset shows that PEFT-MedSAM obtains a dice coefficient of .9411 and an intersection over union value of .8918 when compared to both a fully trained U-Net baseline (.8715 dice coefficient) and zero-shot MedSAM inference (.8997 dice coefficient). The external validation of the model using PH2 dataset shows .9467 dice coefficient with +/- .0310 standard deviation. Supportive evidence for these claims include a p-value less than .0001 for Wilcoxon signed rank tests comparing the two datasets and bootstrap-estimated 95% confidence intervals of [.9364,.9447] that represent the estimated range of possible values for the average dice coefficient obtained by repeating the test. To increase clinical trustworthiness, we used Grad-CAM explainability along with a pointing game based evaluation methodology to evaluate the CNN baseline model on the validation set. The results showed that we had an accuracy rate of 98.27% on the validation set of 519 images and confirmed that the model classified regions containing skin lesions.

2606.18723 2026-06-18 cs.CV cs.LG 新提交

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

临床对齐的几何约束用于鲁棒的IVUS血管边界分割

Yunshu Chen, Litao Yang, Giuseppe Di Giovanni, Jordan Tan, Deval Mehta, Andrew Lin, Derek Chew, Masasi Fujino, Julie Butters, Stephen Nicholls, Zongyuan Ge, Kyung Hoon Cho

发表机构 * AIM For Health Lab, Monash University(莫纳什大学AIM健康实验室) Department of Data Science and Artificial Intelligence, Faculty of IT, Monash University(莫纳什大学信息技术学院数据科学与人工智能系) Monash University Victorian Heart Institute(莫纳什大学维多利亚心脏研究所) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院) National Cerebral and Cardiovascular Center(国立循环器病研究中心) Department of Cardiology, Chonnam National University Hospital and Medical School(全南大学医院和医学院心脏病学系)

AI总结 提出GeoCat网络,通过双编码器与可微几何一致性损失,在IVUS分割中降低边界漂移和拓扑错误,提升临床几何测量精度。

Comments MICCAI2026 Accepted

详情
AI中文摘要

血管内超声(IVUS)管腔和外弹性膜(EEM)分割对于定量评估冠状动脉斑块负荷至关重要。管腔或EEM勾画的误差会直接传播到斑块面积、斑块负荷和几何测量中。然而,优先考虑重叠分数的标准方法常常遭受边界漂移和拓扑错误,导致临床测量不准确。我们提出GeoCat,一个几何一致性网络,使用双笛卡尔-极坐标编码器,结合跨域注意力和时间融合,处理5帧IVUS片段。可微的几何一致性损失直接监督临床相关描述符,包括直径、方向和横截面积。该模型在来自146名患者的12,242张标注帧上训练,这些帧使用两种商用IVUS系统采集。我们使用分割准确性和斑块相关临床指标评估性能,包括Dice/IoU、边界测量(95HD(mm)、ASSD)、拓扑违规率和临床几何误差(dmax/dmin、角度和面积)。在我们的数据集上,GeoCat实现了0.93的Dice,将95HD降低到0.14 mm,并将拓扑违规率降低到1.0%。重要的是,它显著提高了几何保真度,产生0.13-0.16 mm的直径误差和约8度的角度误差,支持可靠的斑块负荷量化。

英文摘要

Intravascular ultrasound (IVUS) lumen and external elastic membrane (EEM) segmentation is important for quantitative coronary plaque burden assessment. Errors in lumen or EEM delineation directly propagate to plaque area, plaque burden and geometric measurements. However, standard methods prioritising overlap scores often suffer from boundary drift and topology errors, leading to inaccurate clinical measurements. We present GeoCat, a geometry-consistent network that processes 5-frame IVUS clips using dual Cartesian-polar encoders with cross-domain attention and temporal fusion. A differentiable geometry consistency loss directly supervises clinically relevant descriptors including diameters, orientations, and cross-sectional areas. The model is trained on 12,242 annotated frames from 146 patients acquired with two commercial IVUS systems. We evaluate performance using both segmentation accuracy and plaque-relevant clinical metrics, including Dice/IoU, boundary measures(95HD (mm), ASSD), topology violation rate, and clinical geometry errors (dmax/dmin, angles, and areas). On our dataset, GeoCat achieves a Dice of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. Importantly, it significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, supporting reliable plaque burden quantification.

2606.18749 2026-06-18 cs.CV 新提交

Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

迈向3D医学图像的无训练零样本异常检测:基于批次的方法使用2D基础模型

Tai Le-Gia

发表机构 * Chungnam National University(忠南大学)

AI总结 提出CS3F框架,利用2D基础模型对3D医学图像进行零样本异常检测,通过沿多轴分解、切片编码和跨主体相似性计算异常分数,并引入粗到细的分词策略减少信号衰减。

详情
AI中文摘要

零样本异常检测(ZSAD)在医学成像中具有吸引力,因为临床系统必须处理异构采集协议、变化的患者群体以及可能缺乏标注训练数据的病理。大多数现有的零样本异常检测方法是为2D图像设计的,它们直接扩展到3D医学体积受到大规模体积基础模型稀缺或利用体积上下文困难的限制。我们提出CS3F,一个无训练的基于批次的框架,用于3D医学图像中的ZSAD,使用2D基础模型。每个体积沿多个解剖轴分解,并由2D视觉变换器逐切片编码。然后通过池化相邻切片特征将其转换为局部体积令牌。异常分数通过跨主体互相似性获得:在其他主体中缺乏相似令牌的令牌被赋予更高的异常分数。为了减少深度池化引起的病灶信号衰减,我们引入了一种粗到细的分词策略,无需穷举匹配即可实现细分辨率体积评分。CS3F在脑部MRI上针对转移瘤、胶质瘤和中风进行评估,并在肺部CT上验证其泛化能力,超越标准图谱对齐的脑部MRI。结果表明,冻结的2D基础模型可以支持3D医学图像中的异常定位,且细分词化的益处很大程度上取决于病灶对比度和成像模态。

英文摘要

Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.

2606.18753 2026-06-18 cs.CV 新提交

SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

SMART:一种灵活、可解释且可扩展的高分辨率成像数据时空脑图谱

John Kalkhof, Boris Gutman, Emile d'Angremont, Daniel C. Alexander, Marco Lorenzi

发表机构 * Illinois Institute of Technology(伊利诺伊理工学院) Amsterdam University Medical Center(阿姆斯特丹大学医学中心) University College London(伦敦大学学院)

AI总结 提出SMART框架,通过解耦全局疾病动态与患者特定解剖表现,学习连续疾病时间图谱,实现高分辨率3D医学图像中时空变化的灵活、可解释和可扩展建模。

详情
AI中文摘要

我们介绍了SMART,一个从纵向高分辨率3D医学图像中学习灵活、可解释且可扩展的时空脑图谱的框架。现有的时空图谱构建方法依赖于黑盒生成模型,缺乏灵活性、限制可解释性,并且难以扩展到高维数据。SMART通过学习一个连续的疾病时间图谱来解决这些挑战,该图谱将全局群体级疾病动态与患者特定的解剖表现解耦。在解剖学启发先验的指导下,SMART通过区域特异性微分方程,沿着共享的疾病时间线建模可解释的全局区域进展轨迹。全局轨迹进一步通过由灵活且可扩展的多尺度神经细胞自动机参数化的密集微分同胚位移,个性化到个体解剖结构。在阿尔茨海默病的五个纵向MRI数据集(ADNI-1/GO/2、OASIS-3、AIBL;>1300名受试者)上评估,SMART产生了解剖学上有意义的疾病进展预测,并实现了最先进的预测准确性和比对抗性和扩散基线更好的时间一致性。我们的方法为高维医学图像时间序列中时空变化的灵活、可解释和可扩展建模建立了一个新范式。

英文摘要

We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer's disease (ADNI-1/GO/2, OASIS-3, AIBL; > 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

2606.18825 2026-06-18 cs.CV 新提交

DreamReg: Belief-Driven World Model for 2D-3D Ultrasound Registration

DreamReg:基于信念驱动的世界模型用于2D-3D超声配准

Luoyao Kang, Yuelin Zhang, Jiwei Shan, Haifan Gong, Qingpeng Ding, Shing Shin Cheng

发表机构 * T Stone Robotics Institute, The Chinese University of Hong Kong(香港中文大学T Stone机器人研究所) Multi-scale Medical Robotics Center(多尺度医疗机器人中心) Perelman School of Medicine, University of Pennsylvania(宾夕法尼亚大学佩雷尔曼医学院)

AI总结 提出DreamReg框架,将2D-3D超声配准建模为信念更新,通过世界模型模拟探头运动并整合想象结果,在CAMUS和u-RegPro数据集上实现鲁棒且准确的实时配准。

详情
AI中文摘要

超声(US)广泛应用于手术导航,但由于部分可观测性、散斑噪声以及依赖于动作的US采集,术中2D切片与术前3D体积之间的实时配准仍然具有挑战性。现有方法是一次性的或短视的,难以随时间收集证据或捕捉外科医生如何根据屏幕反馈调整探头运动。我们提出DreamReg,一个基于信念驱动的世界模型框架,将2D-3D配准形式化为对刚性变换的信念更新。DreamReg维护一个潜在信念状态,总结过去的观测和位姿信息,并在新切片到达时通过学习到的动态不断细化变换。在训练期间,DreamReg暴露于模拟临床扫描行为的探头运动轨迹,并通过将位姿细化条件于当前US观测来学习更新其信念。在推理期间,DreamReg通过内部想象来细化配准:它展开学习到的世界模型以模拟候选探头运动及其预测的观测,并整合这些想象的结果以收敛到准确的刚性变换。在CAMUS和u-RegPro数据集上的实验表明,与最先进方法相比,DreamReg在实时引导中具有改进的鲁棒性和有竞争力的配准精度。

英文摘要

Ultrasound (US) is widely used for surgical navigation, yet real-time registration between intraoperative 2D slices and preoperative 3D volumes remains challenging due to partial observability, speckle noise, and the action-dependent US acquisition. Existing methods are one-shot or short-horizon, making it hard for them to gather evidence over time or capture how surgeons adjust probe motion based on on-screen feedback. We propose DreamReg, a belief-driven world-model framework that formulates 2D-3D registration as belief updating over rigid transformations. DreamReg maintains a latent belief state that summarizes past observations and poses information, and continuously refines the transformation through learned dynamics as new slices arrive. During training, DreamReg is exposed to probe-motion trajectories that mimic clinical scanning behavior and learns to update its belief by conditioning pose refinement on the current US observation. During inference, DreamReg refines registration via internal imagination: it rolls out the learned world model to simulate candidate probe motions and their predicted observations, and integrates these imagined outcomes to converge to an accurate rigid transformation. Experiments on CAMUS and u-RegPro datasets demonstrate improved robustness and competitive registration accuracy for real-time guidance compared with state-of-the-art methods.

2606.18860 2026-06-18 cs.CV cs.LG 新提交

Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation

医学图像分割中对抗模型的不确定性量化

Hana Jebril, Thomas Pinetz, Günter Klambauer, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria(人工智能研究所、医学数据科学中心、维也纳医学大学,奥地利) Comprehensive Center for AI in Medicine, Medical University of Vienna, Austria(医学人工智能综合中心、维也纳医学大学,奥地利) ELLIS Unit Linz, LIT AI Lab and Institute for Machine Learning, Johannes Kepler University Linz, Austria(林茨ELLIS单位、LIT人工智能实验室和机器学习研究所、林茨约瑟夫·冯·克拉夫特大学,奥地利) Institute for Machine Learning, Johannes Kepler University Linz, Austria(机器学习研究所、林茨约瑟夫·冯·克拉夫特大学,奥地利) Clinical Research Center for Medical AI, Johannes Kepler University Linz, Austria(医学人工智能临床研究中心、林茨约瑟夫·冯·克拉夫特大学,奥地利)

AI总结 提出QUAM-SM后处理框架,通过针对性对抗搜索识别脆弱像素,量化不确定性并分离认知与偶然不确定性,在公开数据集上优于现有方法。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

可靠的像素级不确定性量化具有通过实现高保真纵向监测和区分真实病理变化与伪影来改变临床工作流程的潜力。理想情况下,这些模型提供关键治疗计划和手术干预所需的稳定性。然而,标准深度学习模型常常遭受校准不良,产生过度自信的预测,掩盖了微妙病理边界处的潜在脆弱性。为了解决这个问题,我们提出了QUAM-SM,一种使用针对性对抗搜索来识别“对抗脆弱”像素的后处理框架。通过主动寻找暴露预测不稳定性的扰动,我们的方法突出了决策最容易被翻转的区域。重要的是,该框架将认知不确定性与偶然不确定性分离。在两个具有多个专家标注的公开数据集上的实验表明,QUAM-SM在可靠性和边界敏感性方面优于标准和最新的不确定性估计方法。代码可在以下网址获取:https://this https URL

英文摘要

Reliable pixel-level uncertainty quantification holds the potential to transform clinical workflows by enabling high-fidelity longitudinal monitoring and distinguishing true pathological changes from artifacts. Ideally, these models provide the stability required for critical treatment planning and surgical intervention. However, standard deep learning models often suffer from miscalibration, yielding overconfident predictions that mask underlying vulnerabilities at subtle pathological boundaries. To address this, we propose QUAM-SM, a post-hoc framework using targeted adversarial search to identify "adversarially fragile" pixels. By actively seeking perturbations that expose predictive instability, our method highlights regions where decisions are most vulnerable to being flipped. Importantly, the framework disentangles epistemic uncertainty from aleatoric uncertainty. Experiments on two public datasets with multiple expert annotations demonstrate that QUAM-SM outperforms both standard and recent uncertainty estimation approaches in terms of reliability and boundary sensitivity. Code is available at https://github.com/HanaJebril/quam_sm

2606.18869 2026-06-18 cs.CV 新提交

Learning to Distort: Weakly-Supervised Image Quality Transfer for Prostate DWI Correction

学习扭曲:用于前列腺DWI校正的弱监督图像质量迁移

YuCheng Tang, Wen Yan, Alexander Ng, Natasha Thorley, Pawel Rajwa, Yipei Wang, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, David Atkinson, Shonit Punwani, Daniel Alexander, Shaheer Ullah Saeed, Veeru Kasivisvanathan, Yipeng Hu

发表机构 * UCL Hawkes Institute(UCL哈维斯研究所) Department of Medical Physics and Biomedical Engineering(医学物理与生物医学工程系) University College London(伦敦大学学院) Division of Surgery and Interventional Science(外科与介入科学分会) Centre for Medical Imaging(医学成像中心) British Urology Researchers in Surgical Training (BURST)(英国泌尿外科手术培训研究人员(BURST)) Department of Radiology(放射科) University College London Hospitals NHS Foundation Trust(伦敦大学学院医院国家健康服务信托基金) Centre for Medical Image Computing(医学图像计算中心) Department of Computer Science(计算机科学系) Department of Urology(泌尿科)

AI总结 提出弱监督图像质量迁移框架,利用图像质量评估信号从无失真图像学习生成真实失真,并训练校正模型,在PI-RADS和Gleason评分分类任务中优于现有无配对方法。

详情
AI中文摘要

单次激发平面回波前列腺弥散加权成像(DWI)常因几何失真而复杂化,影响从这些图像中获得可靠诊断的能力。开发自动化校正方法面临缺乏配对的失真和未失真临床扫描的挑战。本文首先提出一种新颖的弱监督图像质量迁移(IQT)框架,从无失真图像到失真图像,利用图像质量评估(IQA)信号监督迁移过程。与传统方法需要昂贵的体素级配对数据或采用无配对算法不同,我们的方法利用图像级质量标签(此处为失真与无失真)在预训练特征空间中建立潜在质量原型。认识到模拟真实失真比直接无配对校正更可靠,我们描述了一种弱监督原型流匹配算法,显式正则化生成轨迹朝向失真原型,产生模拟临床退化的真实磁敏感伪影。通过合成这些真实配对,我们能够训练第二个IQT模型进行正向失真校正。实验结果表明,我们生成的图像成功模拟了真实伪影的诊断干扰,从而产生更强大的失真校正IQT模型。除定性比较外,我们还通过评估临床下游任务性能(PI-RADS和Gleason评分分类),使用分布内和外部数据集,将我们的方法与现有无配对方法(如CycleGAN、UNIT-DDPM和OT-FM)作为正向或反向替代方案进行详尽的定量评估。

英文摘要

Single-shot echo-planar prostate diffusion-weighted imaging (DWI) is frequently complicated by geometric distortions, which impact the ability to derive reliable diagnoses from such images. Developing automated correction methods is challenged by the absence of paired distorted and undistorted clinical scans. In this paper, we first propose a novel weakly-supervised image quality transfer (IQT) framework from undistorted to distorted images that utilizes image quality assessment (IQA) signals to supervise the transfer process. Unlike traditional methods that require expensive, voxel-wise paired data or resort to developing unpaired algorithms, our approach utilizes image-level quality labels (here, distorted vs. undistorted) to establish latent quality prototypes within a pre-trained feature space. Recognizing that simulating realistic distortions is more reliable than direct unpaired correction, we describe a weakly-supervised prototype flow matching algorithm to explicitly regularize generative trajectories towards distorted prototypes, producing realistic susceptibility artifacts that mimic clinical degradations. By synthesizing these realistic pairs, we enable a second IQT model to be trained in the forward direction for distortion correction. Experimental results demonstrate that our generated images successfully mimic the diagnostic interference of real-world artifacts, which leads to more capable distortion correction IQT models. In addition to qualitative comparisons, we also conduct exhaustive quantitative evaluations that compare our approach with existing unpaired approaches (e.g., CycleGAN, UNIT-DDPM, and OT-FM) - as either forward or reverse alternatives - by assessing clinical downstream task performance in PI-RADS and Gleason score classification, using both in-distribution and external data sets.

2606.18872 2026-06-18 cs.CV 新提交

Bridging Single Distortion Artifacts and Mmultifactorial Clinical Quality: Few-shot Biparametric MRI Quality Assessment via Distortion-trained Prototypical Networks

桥接单一失真伪影与多因素临床质量:基于失真训练的原型网络的少样本双参数MRI质量评估

Yuheng Tang, Alexander Ng, Wen Yan, Natasha Thorley, Pawel Rajwa, Yipei Wang, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, Shonit Punwani, Daniel Alexander, Veeru Kasivisvanathan, Yipeng Hu

发表机构 * UCL Hawkes Institute(UCL Hawkes研究所) Department of Medical Physics and Biomedical Engineering(医学物理与生物医学工程系) University College London(伦敦大学学院) Division of Surgery and Interventional Science(外科与介入科学分会) Centre for Medical Imaging(医学成像中心) British Urology Researchers in Surgical Training (BURST)(英国泌尿外科手术培训研究人员(BURST)) Department of Radiology(放射科) University College London Hospitals NHS Foundation Trust(伦敦大学学院医院国家健康服务信托基金) Centre of Medical Imaging, Division of Medicine(医学成像中心,医学分会) Centre for Medical Image Computing(医学图像计算中心) Department of Computer Science(计算机科学系) Department of Urology(泌尿科)

AI总结 提出一种少样本双参数原型网络,利用失真标签元训练,通过特征融合和域对齐,仅用5个样本即可预测PI-QUAL临床质量评分,解决临床数据稀缺问题。

详情
AI中文摘要

临床前列腺多参数MRI高度依赖高质量扩散加权成像(DWI),但DWI读图常因几何失真(通常由直肠气体引起)而受损。通过PI-QUAL评分系统评估质量是新兴的临床标准,但该方法主观、耗时,且存在类别不平衡问题,其中低质量病例多样且相对稀少。以PRIME临床试验为例,6%的图像PI-QUAL评分低于4,87%的DWI问题源于失真,许多其他临床质量问题代表性不足。为解决这种标注临床数据的双重稀缺性,我们提出了一种用于自动图像质量评估(IQA)的少样本双参数原型网络。我们的框架利用双分支3D ResNet融合T2加权和DWI特征,提供解剖背景以区分真实形态与失真。为处理现实异质性,我们引入特征级线性调制(FiLM)和梯度反转层(GRL),以对齐基于不同b值的特征分布,同时抑制采集相关偏差。我们证明,仅基于相对客观、易于获取的失真标签进行元训练的模型,能够仅使用五个代表性样本有效适应预测复杂的多因素临床质量评分(如PI-QUAL)。在两个数据集上的实验结果表明,我们的方法在此具有挑战性的IQA任务中显著优于少样本学习基线,为临床工作流程中标准化前列腺MRI质量控制提供了实际可行且数据高效的解决方案。

英文摘要

Clinical prostate multi-parametric MRI relies heavily on high-quality diffusion-weighted imaging (DWI), yet reading DWI is frequently compromised by geometric distortion, often caused by rectal air. Assessing quality via the PI-QUAL scoring system is an emerging clinical standard, but it is subjective, time-consuming and suffers from a class imbalance where low-quality cases are diverse and relatively scarce. Using the PRIME clinical trial as an example, there are $6\%$ images with PI-QUAL scores lower than 4, $87\%$ of DWI issues are due to distortion. Many of the other clinical quality issues are under-represented. To address this common dual-scarcity of annotated clinical data, we propose a few-shot biparametric prototypical network for automated image quality assessment (IQA). Our framework utilizes a dual-branch 3D ResNet to fuse T2-weighted and DWI features, providing anatomical context to distinguish true morphology from distortion. To handle real-world heterogeneity, we introduce feature-wise linear modulation (FiLM) and a gradient reversal layer (GRL) to align feature distributions conditioned on varying b-values while suppressing acquisition-related biases. We demonstrate that a model meta-trained solely on comparatively objective, readily obtainable distortion labels can effectively adapt to predicting complex, multi-factorial clinical quality scores such as PI-QUAL using only five representative samples. Experimental results on two datasets show that our method significantly outperforms few-shot learning baselines for this challenging IQA task, offering a practically feasible and data-efficient solution for standardizing prostate MRI quality control in clinical workflows.

2606.18876 2026-06-18 cs.CV cs.LG 新提交

Test-Time Adaptation in Optical Coherence Tomography Using Trajectory-Aligned Time-Independent Flow

光学相干断层扫描中基于轨迹对齐的时间无关流的测试时自适应

Veit Hucke, Thomas Pinetz, Gregor Reiter, Ursula Schmidt-Erfurth, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria(人工智能研究所、医学数据科学中心、维也纳医学大学,奥地利) Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna, Austria(医学人工智能综合中心、维也纳医学大学,奥地利) Department of Ophthalmology and Optometry, Medical University of Vienna, Austria(眼科与视光学部、维也纳医学大学,奥地利) Laboratory for Ophthalmic Image Analysis, Medical University of Vienna, Austria(眼科图像分析实验室、维也纳医学大学,奥地利)

AI总结 提出一种基于流匹配的测试时自适应方法,通过直方图匹配和去除时间条件,生成高质量替代图像,在AMD分割中达到最优性能。

Comments Accepted in MICCAI

详情
AI中文摘要

光学相干断层扫描(OCT)在眼科中至关重要,但图像质量不一致,尤其是在低成本设备中,阻碍了自动化分析。为了解决这个问题,我们引入了一种基于流匹配的测试时自适应方法,从噪声输入生成高质量替代图像。通常,测试数据和训练数据之间的域差距会导致去噪过程中像素分布不匹配。我们通过将测试图像的直方图与合成参考轨迹匹配来克服这一问题,成功地将输入与预期分布对齐。此外,我们移除了网络的时间条件,以考虑真实世界噪声分布的轻微偏差。我们的方法在分割年龄相关性黄斑变性(AMD)两个阶段的关键生物标志物方面达到了最先进的性能。代码地址:this https URL。

英文摘要

Optical coherence tomography (OCT) is essential in ophthalmology, but inconsistent image quality especially in low-cost devices hinders automated analysis. To address this, we introduce a flow-matching-based test-time adaptation method that generates high-quality surrogate images from noisy inputs. Typically, domain gaps between test and training data cause pixel distribution mismatches during the denoising process. We overcome this by matching the test image's histogram to synthetic reference trajectories, successfully aligning the input with expected distributions. Additionally, we remove the network's time conditioning to account for slight deviations in real-world noise distributions. Our approach achieves state-of-the-art performance in segmenting critical biomarkers for two stages of Age-related Macular Degeneration (AMD). Code is available: https://github.com/Veit21/tta-flow.

2606.18886 2026-06-18 cs.CV 新提交

DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

DINO-Med3D:通过渐进式适应弥合体分割中的维度与领域差距

Haoyu Hu, Xiyao Ma, Shiqi Liu, Linsen Zhang, Xiaoliang Xie, Xiaohu Zhou, Zeng-Guang Hou

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出两阶段渐进框架DINO-Med3D,通过多切片嵌入模块、3D适配器和并行细节恢复流,将DINOv3适配到3D医学分割,在五个数据集上超越现有方法。

Comments Accepted at MICCAI 2026. The camera-ready version and link will be made publicly available upon publication

详情
AI中文摘要

尽管DINOv3在自然图像中展现了显著的语义判别能力,但其直接应用于体医学分割受到固有的维度和领域差异的阻碍。为解决这些问题,我们提出DINO-Med3D,一个两阶段渐进框架,将预训练的DINOv3编码器重新用于3D医学任务。在第一阶段,我们通过引入融合伪3D上下文的多切片嵌入模块来弥合维度差距,同时采用分割代理任务将从自然场景学到的表示适应到医学领域。随后,我们通过在冻结的主干中添加轻量级3D适配器来增强体理解,以强制执行全局切片间连续性。最后,为补偿嵌入过程中固有的空间信息损失,我们设计了一个并行细节恢复流,以显式保留高频边界线索。在五个公共数据集上的大量实验表明,我们的方法成功地将DINOv3适应到医学领域,并显著优于最先进的基线方法。

英文摘要

Although DINOv3 has demonstrated remarkable semantic discrimination in natural imagery, its direct application to volumetric medical segmentation is hindered by inherent dimension and domain disparities. To resolve these issues, we propose DINO-Med3D, a two-stage progressive framework that repurpose the pre-trained DINOv3 encoder for 3D medical tasks. In the first stage, we mitigate the dimension gap by introducing a multi-slice embedding module that incorporates pseudo-3D context, while simultaneously employing a segmentation proxy task to adapt representations learned from natural scenes to the medical domain. Subsequently, we further enhance volumetric understanding by adding lightweight 3D adapters into the frozen backbone to enforce global inter-slice continuity. Finally, to compensate for the spatial information loss inherent in the embedding process, we design a parallel detail recovery stream to explicitly preserve high-frequency boundary cues. Extensive experiments on five public datasets demonstrate that our approach successfully adapts DINOv3 to the medical domain and significantly outperforms state-of-the-art baselines.

2606.18894 2026-06-18 cs.CV 新提交

Automatic ply-specific analyses of CFRP micrographs using shortest-path-based ply distinction

基于最短路径的碳纤维增强聚合物显微图像自动铺层分析

Jonas Naumann, Jonas P. Appels, Julius Biermann, Christopher Gorsky, Timo de Wolff, Christoph Brauer

发表机构 * German Aerospace Center (DLR)(德国航空航天中心(DLR)) Institute of Lightweight Systems(轻质系统研究所) Composite Process Technologies(复合材料加工技术) Institute of Analysis and Algebra(分析与代数研究所)

AI总结 提出一种自动方法,通过将语义分割掩码视为图并应用最短路径算法区分铺层实例,实现高分辨率CFRP显微图像的铺层分割与定量分析。

详情
AI中文摘要

我们提出了一种自动方法,用于在高分辨率碳纤维增强聚合物显微图像的语义分割掩码中区分铺层实例。将分割掩码解释为以像素为顶点的图,使我们能够使用最短路径算法生成铺层分隔路径。从而,我们利用全局信息弥合了语义分割和铺层实例分割之间的差距。我们成功地将该方法应用于具有广泛特征的高分辨率显微图像,例如单层或多层中人为添加的间隙、不同的堆叠顺序以及贯穿铺层的裂纹。基于计算出的路径将每个纤维像素分配给一个铺层,可以对其微观结构特性(如局部纤维体积分数以及局部分辨的铺层和中间层厚度)进行全面的定量铺层分析。这些见解有助于揭示制造引起的不均匀性,得出关于制造参数的结论,并将力学性能与潜在的微观结构缺陷联系起来。

英文摘要

We present an automated approach to distinguish between ply instances in semantic segmentation masks of high-resolution carbon-fiber reinforced polymer micrographs. Interpreting the segmentation mask as a graph with pixels as vertices, enables us to use a shortest-path algorithm yielding the ply-separating paths. Thereby, we bridge the gap between semantic segmentation and ply instance segmentation using global information. We successfully apply our approach on high-resolution micrographs featuring a broad range of characteristics like artificially added gaps in single or multiple plies, different stacking sequences and ply traversing cracks. Assigning each fiber pixel to a ply based on the calculated paths, allows for a comprehensive, quantitative ply analysis with respect to its microstructural properties like the local fiber volume fraction as well as locally resolved ply and interleaf layer thickness. These insights help to reveal manufacturing-induced inhomogeneities, draw conclusions on manufacturing parameters and link mechanical properties to underlying microstructural imperfections.

2606.19215 2026-06-18 cs.CV 新提交

GUMP-Net: An interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation

GUMP-Net: 一种用于多类盆腔分割的可解释模型-数据驱动智能算法

Liheng Wang, Yinghui Zhang, Licheng Zhang, Hailin Xu, Qiyong Cao, Chong Chen

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院数学科学国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学) Department of Orthopedics, The Fourth Medical Center of Chinese PLA General Hospital(中国人民解放军总医院第四医学中心骨科) National Clinical Research Center for Orthopedics, Sports Medicine and Rehabilitation(国家骨科与运动康复临床医学研究中心) Department of Trauma and Orthopedics, People’s Hospital Peking University(北京大学人民医院创伤骨科) Department of Orthopedics and Traumatology, Beijing Jishuitan Hospital, Capital Medical University(首都医科大学附属北京积水潭医院骨科)

AI总结 提出GUMP-Net,结合改进测地线活动轮廓模型与深度神经网络,实现多类盆腔分割,在小训练数据下表现更优,并提供可解释几何视角。

Comments 26 pages, 8 figures, 3 tables

详情
AI中文摘要

盆腔分割是盆腔骨折精准智能诊疗及手术规划导航中最重要和基础的研究问题之一。通过将改进的测地线活动轮廓模型与深度神经网络相结合,我们提出了GUMP-Net,一种用于多类盆腔分割的可解释模型-数据驱动智能算法,其中设计了三个网络模块共同构成整体分割框架:用于自动水平集初始化的目标检测模块、用于学习解剖感知边缘检测函数的边缘检测器模块以及用于深度水平集演化的迭代模块。利用水平集表示和深度学习的优势,GUMP-Net在分割性能上比最先进的方法更准确、鲁棒和一致,尤其是在小训练数据情况下。在盆腔数据集上的大量实验证明了所提算法的合理性和有效性。扩展到踝关节数据集的进一步实验表明其对其他解剖结构具有更广泛的应用。所提算法不仅为复杂骨折复位提供了高效的分割方法,而且为理解深度学习分割提供了可解释的几何视角。

英文摘要

Pelvic segmentation is one of the most important and fundamental research problems in precise and intelligent diagnosis and treatment, as well as surgical planning and navigation for pelvic fractures. By combining an improved geodesic active contour model with deep neural networks, we propose GUMP-Net, an interpretable model-data-driven intelligent algorithm for multi-class pelvic segmentation, in which three network modules are designed to constitute the overall segmentation framework together: the object detection module for automatic level set initialization, the edge detector module for learning an anatomy-aware edge detector function and the iteration module for deep level set evolution. Leveraging the advantages of level set representation and deep learning, GUMP-Net shows more accurate, robust and consistent segmentation performance, especially in small training data situation, compared to the state-of-the-art methods. Extensive experiments on pelvic datasets demonstrate the rationality and effectiveness of the proposed algorithm. Further experiments extended to ankle dataset indicate broader applications to other anatomies. The proposed algorithm not only provides an efficient segmentation method for complex fracture reduction, but also gives an interpretable geometric perspective for understanding deep learning segmentation.

2606.19300 2026-06-18 cs.CV cs.LG 新提交

Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour Segmentation

置信度不等于可靠性:重新思考脑肿瘤分割中的MC Dropout

Xin Ci Wong, Duygu Sarikaya, Kieran Zucker, Marc De Kamps, Nishant Ravikumar

发表机构 * Centre for Doctoral Training in AI for Medical Diagnosis and Care, School of Computing, University of Leeds(利兹大学计算机学院人工智能医学诊断与护理博士培训中心) School of Computer Science, University of Leeds(利兹大学计算机科学学院)

AI总结 通过MC Dropout不确定性估计,发现全局不确定性-误差对齐(AUROC≈0.97)可能掩盖关键子区域(如增强肿瘤)的严重误校准(ECE=0.915),表明子区域校准评估对临床安全至关重要。

Comments Accepted for MIUA2016

详情
AI中文摘要

多参数MRI中的胶质瘤分割是治疗计划的关键组成部分。一个在治疗关键子区域上静默失败的分割模型会带来患者安全风险,而Dice分数等基于重叠的指标无法暴露这种风险。我们探究通过蒙特卡洛(MC)Dropout进行的体素级不确定性估计能否可靠地识别临床关键子区域中的分割错误,以及校准失败模式是否仅从标准报告指标中可检测。在126名BraTS21患者的两模型实证案例研究中,我们评估了高性能预训练SegResNet和本地训练的带有残差单元的UNet(UNet-Res)。MC dropout保持了分割准确性($|\Delta \text{Dice}|$ $<0.01$),同时实现了强不确定性-误差对齐(熵(H)的AUROC $\approx$0.97),表明不确定性正确地将错误体素排在正确体素之上。基于熵的患者分层识别出一个高不确定性亚组,其分割性能显著较低(全肿瘤Dice中位数$0.835$ vs. $0.925$),支持不确定性作为实用的分诊信号。然而,全局对齐可能掩盖重要的区域特异性差异。尽管AUROC相似,UNet-Res在增强肿瘤熵上接近零($0.054$),期望校准误差(ECE)为$0.915$,Dice仅为$0.714$,表明在最临床关键子区域上置信度严重误校准,这是标准Dice和AUROC报告无法发现的失败模式。这些发现表明,强不确定性-误差对齐对于临床安全是必要但不充分的:在选择临床部署模型时,子区域特异性校准评估必须伴随AUROC评估。

英文摘要

Glioma segmentation in multiparametric MRI is a critical component of treatment planning. A segmentation model that fails silently on treatment-critical sub-regions represents a patient safety risk that overlap-based metrics such as Dice scores cannot expose. We ask whether voxel-level uncertainty estimation via Monte Carlo (MC) Dropout can reliably identify segmentation errors in clinically critical sub-regions, and whether calibration failure modes are detectable from standard reporting metrics alone. In an empirical two-model case study on 126 BraTS21 patients, we evaluate a high-performance pretrained SegResNet and a locally trained UNet with residual units (UNet-Res). MC dropout preserved segmentation accuracy ($|Δ\text{Dice}|$ $<0.01$) while achieving strong uncertainty-error alignment (AUROC for entropy (H) $\approx$0.97), indicating uncertainty correctly ranks erroneous voxels above correct ones. Entropy-based patient stratification identified a high-uncertainty subgroup with substantially lower segmentation performance (median whole-tumour Dice $0.835$ vs. $0.925$), supporting uncertainty as a practical triage signal. However, global alignment can mask important region-specific differences. Despite similar AUROC, UNet-Res exhibited near-zero enhancing tumour entropy ($0.054$) and Expected Calibration Error (ECE) of $0.915$, with a Dice of only $0.714$, indicating severely miscalibrated confidence on the most clinically critical sub-region, a failure mode invisible to standard Dice and AUROC reporting. These findings demonstrate that strong uncertainty-error alignment is necessary but insufficient for clinical safety: sub-region-specific calibration assessment must accompany AUROC evaluation when selecting models for clinical deployment.

2606.15604 2026-06-18 eess.IV cs.CV 新提交

Parameter-Efficient Adaptation of SAM 3 for Automated ITV Generation from 4DCT Images

基于参数高效微调SAM 3从4DCT图像自动生成内靶区

Changwoo Song

发表机构 * Oncosoft Inc.(Oncosoft公司) Department of Computer Science & Engineering, Chungnam National University(忠南大学计算机科学与工程系)

AI总结 提出轻量框架,通过LoRA参数高效微调SAM 3,结合硬负样本挖掘和相位相干滤波,仅用7个标注体数据实现高精度内靶区自动生成,中位Dice达0.968。

Comments 10 pages, 4 figures, 2 tables

详情
AI中文摘要

四维计算机断层扫描(4DCT)捕获了胸部解剖结构的完整呼吸周期,然而当前的内靶区勾画流程孤立处理每个相位,丢弃了时间相干性,使轮廓易受相位特定伪影影响。我们提出一个轻量框架,通过低秩适应(LoRA)对Segment Anything Model 3(SAM 3)进行参数高效微调,仅使用七个标注的3D CT体数据,将其文本提示分割与医学领域对齐。此外,该框架结合了硬负样本挖掘策略,以改善低对比度胸部区域的边界判别。在推理时,通过相位相干时间滤波和空间连通性分析细化逐相位预测。由于呼吸运动是连续且周期性的,真实解剖结构出现在连续的相位块中,而瞬态伪影零星出现,因此被有效抑制。在肺部和心脏结构上的实验分别产生中位Dice分数0.968和0.910,95百分位Hausdorff距离分别为0.998 mm和2.931 mm。所提框架有效消除了未适应SAM 3零样本推理中固有的严重假阳性预测。仅用七个标注体数据,框架保留了超过95%的全数据准确率,且整个流水线可在单个消费级GPU上训练,展示了自适应放疗中可扩展、数据高效的解决方案。

英文摘要

Four-dimensional computed tomography (4DCT) captures the full respiratory cycle of thoracic anatomy, yet current Internal Target Volume contouring workflows process each phase in isolation, discarding temporal coherence and leaving contours vulnerable to phase-specific artifacts. We present a lightweight framework that applies parameter-efficient fine-tuning to the Segment Anything Model 3 (SAM 3) via low-rank adaptation (LoRA) to align its text-prompted segmentation with the medical domain using only seven annotated 3D CT volumes. Furthermore, the framework incorporates a hard negative mining strategy to improve boundary discrimination in low-contrast thoracic regions. At inference, phase-wise predictions are refined through phase-coherent temporal filtering and spatial connectivity analysis. Since respiratory motion is continuous and periodic, genuine anatomy appears in contiguous blocks of phases, whereas transient artifacts appear sporadically and are thus effectively suppressed. Experiments on pulmonary and cardiac structures yield median Dice scores of 0.968 and 0.910 with 95th-percentile Hausdorff distances of 0.998 mm and 2.931 mm, respectively. The proposed framework effectively eliminates the severe false-positive predictions inherent in the zero-shot inference of the unadapted SAM 3. With only seven annotated volumes, the framework retains over 95% of full-data accuracy, and the entire pipeline is trainable on a single consumer-grade GPU, demonstrating a scalable, data-efficient solution for adaptive radiotherapy.

9. 文档图像、OCR与图表理解 5 篇

2606.18721 2026-06-18 cs.CV 新提交

Rethinking the Pointer Loss in Table Structure Recognition: Geometry-Aware Pointer Loss for Spatial Locality

重新思考表格结构识别中的指针损失:面向空间局部性的几何感知指针损失

Hong-Jun Choi, Jongho Lee, Jaeyoung Kim

发表机构 * Teamreboott Inc.(Teamreboott公司)

AI总结 针对指针网络在表格结构识别中相邻单元格错误占79.6%的问题,提出几何感知指针损失,通过反距离加权重写交叉熵目标,聚焦邻近单元格梯度,在不增加推理成本下提升性能。

详情
AI中文摘要

使用指针网络的表格结构识别(TSR)通过预测HTML序列同时将标签与检测到的文本(或单元格)区域对齐,取得了令人印象深刻的结果。然而,我们的分析揭示,当指针网络失败时,79.6%的错误发生在空间相邻的单元格之间(曼哈顿距离<=2)。尽管如此,标准交叉熵损失对所有负候选样本赋予相同权重。在这项工作中,我们提出了几何感知指针(GAP)损失,它根据与真实值的空间邻近性重新加权交叉熵目标。通过应用反距离加权,GAP将梯度流集中在模型最困难的区域:相邻单元格比远处单元格获得更强的梯度。我们的方法仅需对损失计算进行简单修改,保持相同的模型架构且零额外推理成本。在PubTabNet和SynthTabNet上的大量实验表明,GAP持续减少相邻单元格错误,达到了新的最先进性能。我们的发现表明,在损失层面融入几何归纳偏置为鲁棒TSR提供了一种简单而有效的方法。我们的代码可在以下网址获取:this https URL

英文摘要

Table Structure Recognition (TSR) using a pointer network achieves impressive results by predicting HTML sequences while aligning tags to detected text (or cell) regions. However, our analysis reveals that when pointer networks fail, 79.6% of errors occur between spatially adjacent cells (Manhattan distance <= 2). Despite this, standard cross-entropy loss weights all negative candidates equally. In this work, we propose Geometry-Aware Pointer (GAP) Loss, which reweights the cross-entropy objective based on spatial proximity to ground truth. By applying inverse distance weighting, GAP focuses gradient flow where the model struggles most: immediate neighbors receive stronger gradients than distant cells. Our approach requires only a straightforward modification to the loss computation, maintaining the same model architecture with zero additional inference cost. Extensive experiments on PubTabNet and SynthTabNet demonstrate that GAP consistently reduces adjacent-cell errors, achieving new state-of-the-art performance. Our findings suggest that incorporating geometric inductive biases at the loss level provides a simple yet effective approach to robust TSR. Our code is available at https://github.com/teamreboott/GAP

2606.18793 2026-06-18 cs.CV 新提交

Fuzzy-Geometric Branch-Point Modeling for Structure-Aware Augmentation of Handwritten Chinese Characters

模糊几何分支点建模用于结构感知的手写汉字增强

Dongbin Jiao, Yibo Lyu, Qiulu Wei, Fuxiang Lu, Shengcai Liu, Shi Yan

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系广东省类脑智能计算重点实验室)

AI总结 针对手写汉字增强中数据稀缺和结构失真问题,提出基于模糊几何的结构感知增强框架,通过模糊集建模分支点并优化,结合贝塞尔重建与多策略扰动生成样本,显著降低字错误率。

详情
AI中文摘要

数据稀缺和结构失真严重限制了高安全性认证中的手写识别。现有的增强方法常导致拓扑和形态损伤,尤其在处理复杂汉字时,笔画交叉、连笔和急转弯使传统分支点检测不可靠。为此,本文提出一种模糊几何驱动的结构感知(FGSA)增强框架。我们将分支点建模为骨架空间中的模糊集,通过整合拓扑邻域证据和方向场散度,构建连续的分支点隶属度场。该隶属度场通过无监督代理目标自适应优化,实现无需人工标注的鲁棒笔画解耦。最后,通过参数化三次贝塞尔重建和多策略扰动合成运动学对齐样本,确保结构保真度与样本多样性之间的平衡。此外,我们建立了LZUSig,一个专门针对中文手写签名细粒度结构退化的大规模高挑战性数据集。在CASIA-HWDB1.1、ChiSig和LZUSig上的大量实验表明,FGSA显著降低了字错误率(ΔWER),在对比基线中取得了最优识别增益。更重要的是,它在任务增益、结构保真度和判别特征保留之间实现了稳健的权衡,为手写增强提供了一种高度可控的解决方案。

英文摘要

Data scarcity and structural distortion significantly limit handwriting recognition in high-security authentication. Existing augmentation methods often cause topological and morphological damage, particularly when processing complex Chinese characters where stroke intersections, ligatures, and sharp turns render traditional branch-point detection unreliable. To address this, this paper proposes a fuzzy geometry-driven structure-aware (FGSA) augmentation framework. We model branch points as fuzzy sets within the skeleton space, constructing a continuous branch-point membership field by integrating topological neighborhood evidence with direction field divergence. This membership field is adaptively optimized via an unsupervised surrogate objective, enabling robust stroke decoupling without manual annotation. Finally, kinematically-aligned samples are synthesized through parameterized cubic Bézier reconstruction and multi-strategy perturbations, ensuring a balance between structural fidelity and sample diversity. Moreover, we establish LZUSig, a large-scale, highly challenging dataset specifically dedicated to fine-grained structural degradation in Chinese handwritten signatures. Extensive experiments on CASIA-HWDB1.1, ChiSig, and LZUSig demonstrate that FGSA significantly reduces the word-level error rate ($Δ$WER), achieving optimal recognition gains over the compared baselines. More importantly, it strikes a robust trade-off among task gain, structural fidelity, and discriminative feature preservation, offering a highly controllable solution for handwriting augmentation.

2606.18884 2026-06-18 cs.CV 新提交

Performance Gap Analysis between Latin and Arabic Scripts HTR

拉丁文与阿拉伯文手写文本识别之间的性能差距分析

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

发表机构 * Luleå University of Technology Department of Computer Science, Electrical

AI总结 本研究使用统一CRNN模型在多个数据集上比较阿拉伯文和拉丁文手写文本识别性能,发现性能差距在低资源场景下显著,随数据增加而缩小但持续存在,并分析了标注质量、视觉变异性和字符分布等因素。

Comments this paper accepted at TIPS workshop ICPR 2026

详情
AI中文摘要

最近的研究表明,手写文本识别(HTR)系统在阿拉伯文数据集上的表现不如拉丁文数据。然而,由于缺乏受控比较,这种差距的原因仍不清楚。在这项工作中,我们使用统一的CRNN模型对阿拉伯文和拉丁文脚本进行行级HTR的全面研究,涵盖九个数据集(包括KHATT(阿拉伯文)、Muharaf(阿拉伯文)、NUST-UHWR(乌尔都文)、PHTD(波斯文)、IAM(英文)、READ-2016(德文)等)和不同的训练规模(K ∈ {100, 500, 1000, 2000, ..., Kfull})。我们的结果显示性能差距仍然存在:在低资源设置下差距很大,随着数据增加而缩小,但在全规模下仍然存在,一致相差5-7个CER点。我们表明标注质量很重要,因为许多数据集包含标注错误。清理降低了错误率并缩小了差距,但并未消除差距。此外,我们发现由于阿拉伯文具有更高的视觉变异性,固定数量的训练样本提供的覆盖效率较低,需要更多数据来学习相似的表示。我们根据文本行数和字符数比较了跨数据集的识别性能,显示了等价权衡。我们比较了跨脚本的字符频率分布,并表明阿拉伯文比拉丁文显著更重尾。我们的错误分析显示,阿拉伯文数据集(例如KHATT)中约30%的替换错误是由视觉相似字符之间的混淆引起的,而在拉丁文数据集(如IAM)中约为15%。

英文摘要

Recent studies have shown that handwritten text recognition (HTR) systems perform worse on Arabic-script datasets than on Latin-script data. However, the reasons for this gap are still not well understood due to the lack of controlled comparisons. In this work, we present a comprehensive study of Arabic and Latin scripts HTR using a unified CRNN model for line-level HTR across nine datasets (including KHATT (Arabic), Muharaf (Arabic), NUST-UHWR (Urdu), PHTD (Persian), IAM (English), READ-2016 (German), and others) and di ferent training sizes (K in {100, 500, 1000, 2000, ..., Kfull}). Our results show the performance gap remains: it is large in low-resource settings, decreases with more data, but remains even at full scale, with a consistent difference of 5-7 CER points. We show that annotation quality matters, as many datasets contain labeling errors. Cleaning reduces error rates and narrows the gap, but does not eliminate it. In addition, we find that a fixed number of training samples provides less effective coverage in Arabic due to higher visual variability, requiring more data to learn similar representations. We compare recognition across datasets in terms of the number of text lines and the number of characters, showing an equivalence trade-off. We compare character frequency distributions across scripts and show that Arabic is significantly more heavy-tailed than Latin. Our error analysis reveals that around 30 percent of substitution errors in Arabic datasets (e.g., KHATT) are caused by confusion between visually similar characters, compared to about 15 percent in Latin-script datasets such as IAM.

2606.19096 2026-06-18 cs.CV 新提交

PorTEXTO: A European Portuguese Benchmark for Visual Text Extraction

PorTEXTO:用于视觉文本提取的欧洲葡萄牙语基准

João Cardeira, Diogo Glória-Silva, Manuel Letras da Luz, Rafael Ferreira, Diogo Tavares, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology(NOVA科学与技术学校) NOVA LINCS

AI总结 提出PorTEXTO,首个针对现代欧洲葡萄牙语视觉文本提取的基准,通过结合前沿LVLM转录和母语者审核构建,发现合成到真实样本性能显著下降,多语言数据比模型规模更关键。

详情
AI中文摘要

欧洲葡萄牙语(pt-PT)在OCR基准中基本缺失,这些基准偏向高资源语言。少数涵盖pt-PT的基准专注于历史文物和文献。本文针对现代OCR应用,引入PorTEXTO,首个面向当代和文化相关的pt-PT视觉文本提取基准。为确保质量,我们采用结合前沿LVLM转录和母语者详尽审核的标注流程。我们观察到大多数模型从合成样本到真实样本性能急剧下降,并发现目前专门的多语言数据比模型大小或分辨率预算更能驱动pt-PT性能,这促使我们发布开放的pt-PT OCR资源。

英文摘要

European Portuguese (pt-PT) is largely absent from OCR benchmarks, which skew toward high-resource languages. The few benchmarks that cover pt-PT focus on historical artifacts and literature. This work addresses modern OCR applications, introducing PorTEXTO, the first benchmark for contemporary and culturally relevant pt-PT visual text extraction. To ascertain quality, we employ an annotation pipeline combining transcriptions from a frontier LVLM with exhaustive review by native speakers. We observe a sharp performance drop from synthetic to real world samples in most models, and find that, currently, specialized multilingual data is a better driver for pt-PT performance than model size or resolution budget, motivating the release of open pt-PT OCR resources.

2606.19139 2026-06-18 cs.CV cs.CL 新提交

Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation

Urdu Katib 手写数据集:用于离线乌尔都语手写文本识别的历史文档数据集及基于CRNN的基线评估

Ramza Basharat, Muhammad Usman Ali

发表机构 * Department of Computer Science, University of Gujrat(古杰拉特大学计算机科学系)

AI总结 为解决乌尔都语手写文本识别中数据集稀缺的问题,本文提出了首个由历史时期Katib书写的离线乌尔都语手写文本行数据集UKHD,并评估了多种CRNN混合模型,其中CNN-BGRU-CTC在字符错误率和词错误率上表现最优。

详情
AI中文摘要

自动手写文本识别(HTR)本质上是一项具有挑战性的任务,当处理草书体时,其复杂性进一步增加。尽管在各种草书体上已经做出了显著努力,但关于乌尔都语手写文本识别(UHTR)的研究相对有限。这种研究滞后主要是由于其文字带来的独特挑战,以及基准数据集的稀缺和不可用。因此,为了推进UHTR研究,本研究提出了一个专门的真实数据集,称为Urdu Katib手写数据集(UKHD)。据我们所知,这是第一个专门从历史时期Katib书写的材料中整理的离线乌尔都语手写文本行数据集。它涵盖了Nastalique书法风格中各种扁平笔尖书写变体。此外,评估了不同基于CRNN的混合模型的有效性,以确定用于Urdu Katib手写识别(UKHR)的最佳架构。在分析的模型中,CNN-BGRU-CTC模型表现出更稳健的性能,具有较低的字符错误率(CER)和词错误率(WER)。本研究工作旨在支持和鼓励研究社区开发用于保存乌尔都语手写文学的稳健识别系统。

英文摘要

Automatic Handwritten Text Recognition (HTR) is inherently a challenging task, and its complexity is further increased when dealing with cursive scripts. Although significant efforts have been made on various cursive scripts, research regarding Urdu Handwritten Text Recognition (UHTR) has been relatively limited. This lag of research is primarily due to the unique challenges posed by its script, and the scarcity and unavailability of benchmark datasets. Therefore, to advance research in UHTR, this study presents a specialized real dataset called the Urdu Katib Handwritten Dataset (UKHD). To the best of our knowledge, this is the first offline Urdu handwritten text lines dataset specifically curated from the materials written by Katibs in historical times. It encompasses a diverse range of flat nib writing variations in the Nastalique calligraphic style. Additionally, the effectiveness of different CRNN-based hybrid models has been evaluated to identify the optimal architecture for Urdu Katib Handwriting Recognition (UKHR). Among the analyzed models, the CNN-BGRU-CTC model showed more robust performance, with low Character Error Rate (CER) and Word Error Rate (WER). This research work aims to support and encourage the research community in developing a robust recognition system for preserving Urdu handwritten literature.

10. 低层视觉、计算成像与图像增强 4 篇

2606.18496 2026-06-18 cs.CV cs.AI 新提交

Neural Phase Correlation

神经相位相关

Cole Reynolds

发表机构 * Weyl Labs(Weyl实验室)

AI总结 提出相位相关的学习泛化,通过可学习基函数将变换分解,适用于非刚性形变和幺正动力学,在心脏MRI和超声数据集上达到或超越现有方法。

详情
AI中文摘要

对应关系本质上是关系性的:它寻求同一场景两次观测之间的未知变换,而非任一观测的内容。然而,主流的基于学习的方法并未将变换表示为架构中的一等对象。它们独立编码每幅图像,让学习的相似度函数或深度解码器隐式地发现映射。相位相关是典型的例外,它直接在傅里叶域测量图像间关系,但其固定基的刚性将其限制于全局平移。我们引入相位相关的学习泛化,通过学习变换分解所基于的基来解除这一限制。相同的代数原语可扩展到密集非刚性形变和幺正动力学。在ACDC心脏MRI基准上,该框架在两个配准方向上匹配或超越先前发表的基线。在CAMUS超声心动图上,它无需辅助评分或自适应平滑机制即可达到最先进水平。应用于一维量子谐振子的时间演化波函数对时,同一框架仅从观测对中恢复未知哈密顿量的埃尔米特函数本征态和量子化能级。

英文摘要

Correspondence is fundamentally relational: it seeks the unknown transformation between two observations of a common scene, not the content of either. Yet the dominant learning-based methods do not represent the transformation as a first-class object in the architecture. They encode each image independently and let a learned similarity function or a deep decoder discover the mapping implicitly. Phase correlation is the canonical exception, measuring the inter-image relationship directly in the Fourier domain, but the rigidity of its fixed basis confines it to global translation. We introduce a learned generalization of phase correlation that lifts this restriction by learning the basis on which the transformation decomposes. The same algebraic primitive extends to dense non-rigid deformations and to unitary dynamics. On the ACDC cardiac-MRI benchmark the framework matches or exceeds prior published baselines on both registration directions. On CAMUS echocardiography it matches state-of-the-art without auxiliary scoring or adaptive-smoothness mechanisms. Applied to time-evolved wavefunction pairs of the 1-D quantum harmonic oscillator, the same framework recovers the Hermite-function eigenstates and the quantized energy levels of the unknown Hamiltonian from observation pairs alone.

2606.18644 2026-06-18 cs.CV 新提交

Spiking Pyramid Wavelet Transformation for High-efficient and Low-energy Image Restoration

尖峰金字塔小波变换用于高效低能耗图像恢复

Chen Zhao, Xiantao Hu, Song Wu, Qian Wang, Chen Wu, Rui Xie, Jian Yang, Ying Tai

发表机构 * Nanjing University(南京大学) Nanjing University of Science and Technology(南京理工大学) University of Science and Technology of China(中国科学技术大学) China Mobile Institute(中国移动研究院)

AI总结 提出基于尖峰神经网络和金字塔小波变换的SPWM模型,通过SDPW块建模长程依赖并利用小波域退化特性,在保持图像质量的同时显著降低计算和能耗。

Comments Accepted by Pattern Recognition

详情
AI中文摘要

尖峰神经网络(SNNs)因其高效性和生物启发的潜力在计算机视觉领域引起了广泛兴趣。虽然基于尖峰CNN的方法在图像恢复(IR)任务中显示出前景,但其性能受到CNN操作固有感受野限制的约束。在本文中,我们探索了离散小波变换的优势,并提出了一种基于尖峰金字塔小波模型(SPWM)以实现高效低能耗目标。具体来说,我们开发了一个尖峰双金字塔小波(SDPW)块来建模长程依赖并利用小波域中的退化特性。在多个基准上的实验结果表明,SPWM在保持图像质量的同时显著降低了计算成本和能耗。我们的方法展示了SNNs在IR领域的潜力,为资源受限设备的未来应用提供了新的见解。

英文摘要

Spiking neural networks (SNNs) have garnered significant interest in computer vision due to their potential for efficiency and biological inspiration. While spiking CNN-based methods have shown promise for image restoration (IR) tasks, their performance is constrained by the inherent receptive field limitations of CNN operations. In the paper, we explore the benefits of discrete wavelet transformation and propose a spiking pyramid wavelet-based model (SPWM) for high-efficient and low-energy target. Specifically, we develop a spiking dual pyramid wavelet (SDPW) block to model long-range dependency and exploit the properties of the degradation in the wavelet domain. Experimental results on several benchmarks demonstrate that SPWM significantly lowers computational costs and energy consumption while maintaining image quality. Our method showcases the potential of SNNs in the field of IR, offering new insights for future applications of resource-limited devices.

2606.19046 2026-06-18 cs.CV 新提交

Low-Rank Tensor Completion Based on Fractional Regularization with Ky Fan p-k Norm

基于Ky Fan p-k范数分数阶正则化的低秩张量补全

Shan Fan, Feng Zhang, Jianjun Wang, Xi-Le Zhao, Tingwen Huang

发表机构 * School of Mathematics and Statistics, Southwest University(西南大学数学与统计学学院) School of Mathematical Sciences/Research Center for Image and Vision Computing, University of Electronic Science and Technology of China(电子科技大学数学科学学院/图像与视觉计算研究中心) Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology(深圳先进技术大学计算机科学与控制工程学院)

AI总结 提出张量核范数与Ky Fan p-k范数之比(TNPK)作为非凸替代,逼近张量管秩,并构建低秩张量补全模型,证明低秩张量是局部极小点,设计ADMM算法,实验验证优于现有方法。

详情
AI中文摘要

本文通过提出一种新颖的非凸替代,即张量核范数与张量Ky Fan p-k范数(TNPK)之比,来精确逼近张量管秩,从而解决低秩张量补全(LRTC)问题。TNPK具有吸引人的性质,包括尺度不变性、参数灵活性以及在特定p和k选择下存在闭式解。在特定的p和k参数设置下,它退化为张量核范数与张量Ky Fan k范数(TNK)之比或张量核范数与张量Frobenius范数(TNF)之比。我们构建了一个LRTC模型,并在张量零空间性质(NSP)下,证明了低秩张量是所提模型的局部极小点。此外,我们推导了Ky Fan p-k逆范数的近端算子,并进一步开发了一种高效的交替方向乘子法(ADMM)算法,在温和条件下保证子序列收敛。在合成和真实世界数据集上的大量实验验证了我们的方法相对于最先进竞争者的优越性能。

英文摘要

This paper addresses low-rank tensor completion (LRTC) by proposing a novel nonconvex surrogate, namely the ratio of the tensor nuclear norm to the tensor Ky Fan p-k norm (TNPK), to accurately approximate the tensor tubal rank. The TNPK possesses appealing properties, including scale invariance, parameter flexibility, and the existence of closed-form solutions under specific choices of p and k. With specific parameter settings of p and k, it reduces to the ratio of the tensor nuclear norm to the tensor Ky Fan k norm (TNK) or the ratio of the tensor nuclear norm to the tensor Frobenius norm (TNF). We construct a LRTC model and, under the tensor null space property (NSP), prove that low-rank tensors are local minimizers of the proposed model. Moreover, we derive the proximal operator of the Ky Fan p-k inverse-norm and further develop an efficient alternating direction method of multipliers (ADMM) algorithm with guaranteed subsequential convergence under mild conditions. Extensive experiments on synthetic and real-world datasets validate the superior performance of our method against state-of-the-art competitors.

2606.19097 2026-06-18 cs.CV 新提交

DVANet: Degradation-aware Visual-prior Alignment Network for Image Restoration

DVANet: 面向图像复原的退化感知视觉先验对齐网络

Yanjie Tu, Qingsen Yan, Axi Niu, Tao Hu, Haokui Zhang, Jiantao Zhou

发表机构 * School of Computer Science, Northwestern Polytechnical University(西北工业大学计算机学院) Shenzhen Research Institute of Northwestern Polytechnical University(西北工业大学深圳研究院) State Key Laboratory of Internet of Things for Smart City, University of Macau(澳门大学智慧城市物联网国家重点实验室)

AI总结 提出DVANet,一种基于半二次分裂优化的深度展开网络,通过退化感知观测一致性与视觉先验引导重建的协同展开,实现复杂退化下的统一图像复原,在多种退化场景和跨域任务中表现优越。

Comments All-in-One Image Restoration; Deep Unfolding; Degradation Representation; Visual Prior

详情
AI中文摘要

全能图像复原旨在开发一个统一的复原框架来处理多种退化类型。现有的端到端方法通常将复原过程视为黑盒映射,缺乏明确的优化解释。尽管深度展开为图像复原提供了可解释的迭代建模范式,但现有方法大多依赖于固定的退化假设或预定义的退化信息,难以适应复杂退化和局部内容受损下的统一复原需求。这一限制制约了它们在退化抑制和结构细节恢复方面的性能。为解决这些问题,本文提出DVANet,一种受半二次分裂优化算法启发的深度展开网络,将复杂退化下的统一图像复原公式化为退化感知观测一致性与视觉先验引导重建之间的协同展开过程。具体而言,在退化感知观测一致性分支中,采用退化表示模块提取全局退化属性和局部退化线索,并利用退化条件映射增强模型对不同退化类型的适应性。在视觉先验引导重建分支中,引入DINOv3提供结构和语义信息作为层次化视觉先验,从而补充受损区域缺失的结构信息并改善细节恢复。大量实验表明,DVANet在多场景退化和跨域图像复原任务上取得了优越或具有竞争力的性能,展现出良好的退化适应性和泛化能力。

英文摘要

All-in-One image restoration aims to develop a unified restoration framework for handling diverse degradation types. Existing end-to-end methods usually regard the restoration process as a black-box mapping, lacking an explicit optimization interpretation. Although deep unfolding provides an interpretable iterative modeling paradigm for image restoration, existing methods mostly rely on fixed degradation assumptions or predefined degradation information, making them difficult to adapt to unified restoration requirements under complex degradations and locally damaged content. This limitation restricts their performance in degradation suppression and structural detail recovery. To address these issues, this paper proposes DVANet, a deep unfolding network inspired by the half-quadratic splitting optimization algorithm, which formulates unified image restoration under complex degradations as a collaborative unfolding process between degradation-aware observation consistency and visual-prior-guided reconstruction. Specifically, in the degradation-aware observation consistency branch, a degradation representation module is employed to extract global degradation attributes and local degradation cues, and degradation-conditioned mapping is used to enhance the model's adaptability to different degradation types. In the visual-prior-guided reconstruction branch, DINOv3 is introduced to provide structural and semantic information as hierarchical visual priors, thereby complementing the missing structural information in damaged regions and improving detail recovery. Extensive experiments demonstrate that DVANet achieves superior or competitive performance on multi-scenario degradation and cross-domain image restoration tasks, showing favorable degradation adaptability and generalization ability.

11. 鲁棒性、安全、隐私与可信视觉 4 篇

2606.18318 2026-06-18 cs.CV cs.CR 新提交

Budget-Aware Adaptive Adversarial Patches for Black-Box Object Detection

预算感知的自适应对抗补丁用于黑盒目标检测

Pedram MohajerAnsari, Amir Salarpour, David Fernandez, Mert D. Pesé

AI总结 提出一种查询高效、预算自适应的黑盒攻击方法,结合上下文汤普森采样放置和NES像素更新,在严格纯图像抑制测试下,对CNN和Transformer检测器实现强抑制,并揭示查询-视觉足迹权衡。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026)

详情
AI中文摘要

对抗补丁对现代目标检测器构成实际威胁。先前工作揭示了脆弱性,但三个差距限制了可操作的见解:(i) 很少有基于分数的黑盒攻击在严格查询预算下联合优化补丁的位置、纹理和大小;(ii) 成功很少与补丁的视觉足迹相关联;(iii) 评估常常混淆EOT鲁棒性与纯视图抑制。我们提出\method{},一种查询高效、预算自适应的黑盒攻击,它结合了轻量级的上下文汤普森采样放置器与NES风格的像素更新,仅在进展停滞时增大补丁。报告基于严格的纯图像抑制测试;EOT被审计但从不作为成功的替代,可选的外观/可打印性权重揭示了强度-可见性权衡。在YOLOv5、Faster R-CNN和YOLOS上,\method{}在基于CNN的检测器上实现了强抑制,在基于Transformer的检测器上实现了显著抑制,使用紧凑的补丁,并相对于固定大小和启发式基线暴露了清晰的查询-足迹权衡。打印-捕获实验进一步展示了跨未见物理对象和视角的迁移。

英文摘要

Adversarial patches pose a practical threat to modern object detectors. Prior work shows vulnerability, but three gaps limit actionable insight: (i) few \emph{score-based black-box} attacks \emph{jointly} optimize patch \emph{location, texture, and size} under tight query budgets; (ii) success is rarely tied to the patch's \emph{visual footprint}; and (iii) evaluations often conflate EOT robustness with plain-view suppression. We present \method{}, a query-efficient, budget-adaptive black-box attack that couples a lightweight \emph{Contextual Thompson-Sampling} placer with NES-style pixel updates, growing the patch only when progress stalls. Reporting is anchored by a \emph{strict plain-image} suppression test; EOT is audited but never used as a substitute for success, and optional appearance/printability weights expose strength--visibility trade-offs. Across YOLOv5, Faster R-CNN, and YOLOS, \method{} achieves strong suppression on CNN-based detectors and substantial suppression on the transformer-based detector, using compact patches and exposing clear query--footprint trade-offs relative to fixed-size and heuristic baselines. A print--capture pilot further shows transfer across unseen physical objects and viewpoints.

2606.18510 2026-06-18 cs.CV cs.CR 新提交

Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks

人脸呈现攻击检测中的架构偏差:视觉Transformer与卷积神经网络的比较研究

Ngela Landon Ntung, Floride Tuyisenge, Jema David Ndibwile

发表机构 * College of Engineering, Carnegie Mellon University(卡内基梅隆大学工程学院)

AI总结 通过比较ViT和CNN在人脸呈现攻击检测中的表现,发现预训练ViT(DeiT-S)在准确率、公平性和跨种族泛化上优于CNN,将种族间ACER差距降低83%。

Comments 8 Pages, 4 Figures, 5 Tables

详情
AI中文摘要

人脸呈现攻击检测(PAD)系统构成生物特征认证中的关键安全层;然而,现有方法在不同人口群体间表现出系统性性能差异,对深肤色个体影响尤为严重。本文通过实证比较研究,探究视觉Transformer架构相对于卷积基线是否能够减少人脸PAD系统中的人口统计偏差。实验在CASIA-SURF跨种族人脸反欺骗(CeFA)数据集上进行。评估了三种架构:从头训练的多模态ViT-Tiny、ResNet18 CNN基线,以及在CeFA上微调的预训练DeiT-S,覆盖非洲、东亚和零样本中亚人口群体。DeiT-S实现了最高总体准确率97.27%和最低等错误率0.86%,优于准确率90.15%的ResNet18。在公平性方面,DeiT-S将非洲与东亚受试者之间的种族间ACER差距降至0.13%,而基于LBP的工作[6]报告为0.75%,降低了83%。最值得注意的是,ResNet18在零样本中亚受试者上的BPCER为10.44%,而DeiT-S在相同未见群体上保持2.89%,展现出3.6倍的泛化优势。这些结果表明,预训练视觉Transformer在PAD中实现了更高的准确率,产生了更小的人口统计性能差距,并在未见人口群体上更公平地泛化,表明PAD中的跨人口公平性可能部分受架构设计影响。

英文摘要

Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT-Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT-S fine-tuned on CeFA across African, East Asian, and zero-shot Central Asian demographic groups. DeiT-S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT-S reduces the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP-based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero-shot Central Asian subjects, DeiT-S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross-demographic fairness in PAD may partly be influenced by architectural design.

2606.19184 2026-06-18 cs.CV cs.LG 新提交

When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain Shift

当AUC误导:域偏移下深度伪造检测器的极化感知评估

Dat Nguyen, Cosmin Radoi, Romain Hermary, Marcella Astrid, Nesryne Mejri, Enjie Ghorbel, Djamila Aouada

发表机构 * Cristal Laboratory, National School of Computer Sciences, University of Manouba(马努巴大学国家计算机科学学院Cristal实验室)

AI总结 针对现有AUC评估无法反映真实场景中混合数据源和不同伪影类型的问题,提出Cross-dataset AUC(Cross-AUC)指标,通过平均每域AUC并引入预测极化度量(Wasserstein距离)来评估域偏移鲁棒性,实验证明其有效性。

详情
AI中文摘要

生成式AI的最新进展,如扩散模型和换脸工具,使得创建高度逼真的深度伪造成为可能,导致了包括金融欺诈和非自愿色情内容在内的现实危害。为此,深度伪造检测成为一个活跃的研究领域,近期方法越来越关注提高对未见操作的泛化能力。这通常通过跨多个数据集分别测量的ROC曲线下面积(AUC)来评估。然而,这种评估未能反映检测器面对混合数据源和不同伪影类型的真实场景。为解决这一局限,我们引入一种新指标——跨数据集AUC(Cross-AUC),该指标平均每域AUC并加入预测极化度量,以考虑对域偏移的鲁棒性。极化程度通过类别分数分布之间的Wasserstein距离量化。Cross-AUC不仅更真实地评估深度伪造检测器在域偏移下的泛化能力,而且具有可解释性,因为它能更好地解释性能下降的原因。在七个基准数据集上的实验证明了其实用性。

英文摘要

Recent advances in generative AI, such as diffusion models and face-swapping tools, have enabled the creation of highly realistic deepfakes, leading to real-world harms including financial fraud and non-consensual explicit content. In response, deepfake detection has become an active research area, with recent methods increasingly focusing on improving generalization to unseen manipulations. This is typically evaluated using the Area Under the ROC Curve (AUC) measured separately across multiple datasets. However, such an evaluation fails to reflect real-world scenarios where detectors face a mixture of data sources and varying artifact types. To address this limitation, we introduce a novel metric, Cross-dataset AUC (Cross-AUC) that averages per-domain AUCs with a measure of prediction polarization for taking into account the robustness to domain shift. The polarization extent is quantified by the Wasserstein Distance between class score distributions. Cross-AUC not only assesses the generalization capabilities of deepfake detectors under domain shifts more realistically, but it is also interpretable as it better explains the reason behind a drop in performance. Experiments performed on seven benchmark datasets demonstrate its practical relevance.

2606.19259 2026-06-18 cs.CV cs.AI 新提交

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

一个用于检测 GPT-Image-2 生成的含丰富文本图像的多领域基准

Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

AI总结 针对现有基准缺乏文本丰富图像检测的问题,构建了包含8602张图像、覆盖6个类别的多领域基准,评估5种检测器,发现性能高度依赖领域且易受JPEG压缩影响。

详情
AI中文摘要

含丰富文本的图像通常包含隐私敏感、交易或决策相关信息。随着最近多模态图像生成模型合成逼真文本内容和结构化视觉设计的能力越来越强,检测AI生成的含丰富文本图像已成为数字信任和内容真实性的重要挑战。然而,现有基准主要关注以物体为中心的图像,对文本语义和布局组织至关重要的场景覆盖有限。在本文中,我们引入了一个用于检测OpenAI的GPT Image 2生成的含丰富文本图像的多领域基准。该基准包含8602张图像,涵盖六个代表性类别:商业海报、信息图表、学术海报、收据、表格和UI截图。利用该基准,我们在零样本设置下评估了五种代表性AI生成图像检测器,并分析了它们的整体性能、类别性能和后处理鲁棒性。我们的结果表明,检测器性能高度依赖于领域:在某些类别上表现良好的方法往往在其他类别上失败,即使最强的传统检测器也对JPEG压缩表现出严重敏感性。我们进一步使用多模态视觉语言模型进行了探索性评估,揭示了其在结构化格式上的潜力和局限性。这些发现突显了针对现代AI生成图像需要文本和布局感知的检测方法。我们的数据集发布于XXX。

英文摘要

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

12. 数据集、基准、评测与训练方法 8 篇

2606.18484 2026-06-18 cs.CV 新提交

Vines-DB: An RGB image dataset for multi-species ornamental vine segmentation

Vines-DB:用于多物种观赏藤蔓分割的RGB图像数据集

Saroj Burlakoti, Utsav Bhandari, Aaron Etienne, Shital Poudyal

发表机构 * Department of Plants, Soils and Climate, Utah State University(植物、土壤与气候系,犹他州立大学) Department of Applied Sciences, Technology and Education, Utah State University(应用科学、科技与教育系,犹他州立大学)

AI总结 为支持精准园艺和城市生态中的多类实例分割,构建了包含7种观赏藤蔓的RGB图像数据集Vines-DB,通过手动标注和增强得到2307张图像,并划分训练/验证/测试集。

Comments 7 pages, 1 figure. Source data repository: OSF (DOI: 10.17605/OSF.IO/YJHCK)

详情
AI中文摘要

Vines-DB数据集包含在美国犹他州洛根市犹他农业实验站格林维尔研究农场田间条件下采集的7种观赏藤蔓的1,218张原始高分辨率RGB图像。该数据集来自168株于2022年移植的藤本植物,在2023和2024生长季(7月至10月)的多个月份重复拍摄。图像使用配备48 MP摄像头的iPhone 16 Pro在上午10:00至下午12:00之间于日光下拍摄。藤蔓生长在1.2m x 2.4m的格架上,从1m距离处拍摄,背景为黑色或白色泡沫板,以增强对比度并减少背景噪声。数据集包括木通、凌霄花、藤绣球、金银花、凌霄'马德琳·加伦'、五叶地锦和多花紫藤。所有原始图像由训练有素的标注员在Roboflow中手动标注,生成基于多边形的实例分割掩码,共8个类别(7个物种和背景)。经过预处理和数据增强后,工作数据集扩展至2,307张图像,用于模型开发和评估。增强后的数据集通过分层抽样划分为2,019张训练图像、192张验证图像和96张测试图像,以保持平衡的代表性。Vines-DB支持精准园艺和城市生态中多类实例分割深度学习模型的开发和评估。该数据集可实现自动冠层覆盖度估计、物种识别和可扩展的田间表型分析等应用。此外,每月重复成像捕获了冠层发育和植物外观的时间变化,增加了数据集在真实田间条件下进行分割基准测试的实用性。

英文摘要

The Vines-DB dataset contains 1,218 original high-resolution RGB images of seven ornamental vine species collected under field conditions at the Utah Agricultural Experiment Station's Greenville Research Farm in Logan, Utah, USA. The dataset was generated from 168 individual vine plants that were transplanted in 2022 and photographed repeatedly across multiple months during the 2023 and 2024 growing seasons (July-October). Images were captured with an iPhone 16 Pro equipped with a 48 MP camera between 10:00 AM and 12:00 PM under daylight. Vines were grown on 1.2m x 2.4m trellises and photographed from a distance of 1m against black or white Styrofoam backdrops to improve contrast and reduce background noise. The dataset includes Akebia quinata, Campsis radicans, Hydrangea anomala petiolaris, Lonicera x heckrottii, Campsis x tagliabuana 'Madame Galen', Parthenocissus quinquefolia, and Wisteria floribunda. All original images were manually annotated in Roboflow by trained annotators to produce polygon-based instance segmentation masks for eight classes, including seven species and background. After preprocessing and data augmentation, the working dataset was expanded to 2,307 images for model development and evaluation. The augmented dataset was divided into 2,019 training images, 192 validation images, and 96 test images using stratified sampling to maintain balanced representation. Vines-DB supports the development and evaluation of deep learning models for multi-class instance segmentation in precision horticulture and urban ecology. The dataset enables applications such as automated canopy cover estimation, species identification, and scalable field phenotyping. In addition, repeated monthly imaging of the plants captures temporal variation in canopy development and plant appearance, increasing the dataset's utility for segmentation benchmarking under realistic field conditions.

2606.18554 2026-06-18 cs.CV 新提交

Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

伪造灾难:扩散时代跨域合成灾难检测基准

Duc-Manh Phan, Quoc-Duy Tran, Duy-Khang Do, Anh-Tuan Vo, Hai-Dang Nguyen, Trong Le Do, Mai-Khiem Tran, Vinh-Tiep Nguyen, Tam V. Nguyen, Isao Echizen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国家大学下属理科大学) Vietnam National University, Ho Chi Minh(胡志明市国家大学) University of Information Technology, VNU-HCM(胡志明市国家大学下属信息技术大学) University of Dayton(代顿大学) National Institute of Informatics(国立信息学研究所)

AI总结 针对扩散模型生成的逼真灾难图像难以检测的问题,提出包含30000张图像(6000张真实、24000张合成)的基准数据集,实验发现微调检测器在未知生成器上准确率下降50%,零样本检测器也不稳定,凸显了跨域检测的迫切需求。

Comments SOICT 2025

详情
AI中文摘要

文本到图像扩散模型的快速进步使得创建高度逼真的合成图像成为可能,这些图像与真实照片极为相似,使得区分真实内容与AI生成的伪造品越来越困难。这对网络安全、数字取证和灾难响应构成了挑战,其中洪水、火灾或地震的虚假图像可能传播错误信息或扰乱应急行动。为此,我们引入了Forged Calamity,一个用于合成灾难检测的基准数据集,包含30000张图像,其中包括6000张真实样本和由四种扩散模型生成的24000张合成样本。在微调和零样本设置下的全面实验揭示了当前取证方法的一致弱点。微调检测器在分布内表现良好,但在未见过的生成器或灾难类型上准确率下降高达50%,显示出对模型特定伪影的过拟合。零样本通用检测器也难以保持稳定的准确率,只有少数具有鲁棒表示能力的模型表现出有限的韧性。这些发现凸显了持续存在的泛化差距,以及在扩散时代确保视觉真实性迫切需要领域和模型无关的检测方法。

英文摘要

The rapid advancement of text-to-image diffusion models has enabled the creation of highly photorealistic synthetic images that closely resemble real photographs, making it increasingly difficult to distinguish authentic content from AI-generated fabrications. This poses challenges for cybersecurity, digital forensics, and disaster response, where fake imagery of floods, fires, or earthquakes can spread misinformation or disrupt emergency operations. To address this, we introduce Forged Calamity, a benchmark dataset for synthetic disaster detection containing 30,000 images, including 6,000 real and 24,000 synthetic samples generated by four diffusion models. Comprehensive experiments across fine-tuned and zero-shot settings reveal consistent weaknesses in current forensic approaches. Fine-tuned detectors perform well in-distribution but lose up to 50\% accuracy on unseen generators or disaster types, showing overfitting to model-specific artifacts. Zero-shot generalized detectors also struggle to maintain stable accuracy, with only limited resilience in a few representation-robust models. These findings highlight persistent generalization gaps and the urgent need for domain- and model-agnostic detection methods to ensure visual authenticity in the diffusion era.

2606.18555 2026-06-18 cs.CV 新提交

Rethinking Text-to-Image as Semantic-Aware Data Augmentation for Indoor Scene Recognition

重新思考文本到图像作为室内场景识别的语义感知数据增强

Trong-Vu Hoang, Quang-Binh Nguyen, Dinh-Khoi Vo, Hoai-Danh Vo, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM, Vietnam(越南国立大学胡志明市理科大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国立大学胡志明市分校)

AI总结 针对室内图像数据不足,提出利用稳定扩散生成合成图像进行数据增强,并通过扩散重建误差防止滥用,在MIT室内场景数据集上验证了有效性。

Comments MAPR 2024

详情
AI中文摘要

在计算机视觉领域,室内图像识别由于光照条件、遮挡以及有限空间内多样化物体排列的复杂相互作用而面临挑战。为了解决训练室内图像缺乏的问题,我们引入了一种新颖的方法,利用稳定扩散(SD)生成合成图像,作为强大的数据增强工具。SD的使用提供了一个原则性框架,用于合成多样且逼真的室内场景,从而丰富训练数据池,以构建鲁棒的室内图像识别模型。在MIT室内场景数据集上的实验结果表明,当真实数据有限时,我们提出的方法在增强深度模型训练方面具有潜力。此外,为了防止SD合成图像的滥用,我们引入了一种基于扩散重建误差(DIRE)的应对措施。强大的DIRE表示使得仅使用轻量级深度模型就能训练鲁棒的分类器。实验表明,我们的方法能够完美识别SD生成的图像,使用MobilenetV3的准确率达到100%。

英文摘要

In the realm of computer vision, indoor image recognition presents challenges due to the intricate interplay of lighting conditions, occlusions, and diverse object arrangements within confined spaces. To address the lacks of training indoor images, we introduce a novel approach leveraging Stable Diffusion (SD) for the generation of synthetic images, which serve as a powerful data augmentation tool. The utilization of SD offers a principled framework for synthesizing diverse and realistic indoor scenes, thereby enriching the training data pool for robust indoor image recognition models. Experimental findings on the MIT Indoor Scene dataset reveal the potential of our proposed approach in enhancing the training of deep models when authentic data is limited. Furthermore, to prevent the misuse of SD synthetic images, we introduce a counter measure based on DIffusion Reconstruction Error (DIRE). The powerful DIRE presentation enables training robust classifiers only using lightweight deep models. Experiments show that our approach can perfectly recognize SD generated images with the accuracy of 100% using MobilenetV3.

2606.18565 2026-06-18 cs.CV eess.SP 新提交

Experimental Analysis of Neural Network-Based Image Classification on the CIFAR-10 Dataset

基于神经网络的CIFAR-10数据集图像分类实验分析

Necati Kagan Erkek, Emre Balci, Berkin Halay

发表机构 * Department of Electronics and Communication Engineering, Istanbul Technical University(伊斯坦布尔技术大学电子与通信工程系)

AI总结 通过全连接和卷积网络在CIFAR-10上实验,分析完整学习流程,六层卷积网络在10个epoch后验证准确率约74.77%,揭示了表示学习与记忆化的差异。

Comments 7 pages

详情
AI中文摘要

通过全连接和卷积网络公式,对CIFAR-10基准上的神经图像分类进行了实验研究。分析强调了完整的学习流程:图像向量化、归一化、独热类编码、监督损失最小化、学习率选择、小批量训练、卷积特征提取、最大池化和基于验证的泛化评估。评估了一个具有六个卷积层和三个最大池化阶段的卷积架构,使用批量大小为128、学习率为0.001的Adam优化器进行十个训练周期。验证准确率达到约74.77%,而验证损失在训练中期后开始增加,尽管训练损失持续减少。由此产生的行为说明了表示学习与记忆化之间的实际差异,并为未来关于正则化、数据增强、更深层架构和可复现图像分类教育的研究提供了紧凑的实验基线。

英文摘要

An experimental investigation of neural image classification on the CIFAR-10 benchmark is presented through fully connected and convolutional network formulations. The analysis emphasizes the complete learning pipeline: image vectorization, normalization, one-hot class encoding, supervised loss minimization, learning-rate selection, mini-batch training, convolutional feature extraction, max-pooling, and validation-based generalization assessment. A convolutional architecture with six convolutional layers and three max-pooling stages is evaluated for ten training epochs using a batch size of 128 and an Adam optimizer with a learning rate of 0.001. The validation accuracy reaches approximately 74.77%, while the validation loss begins to increase after the middle of training despite continued reduction in training loss. The resulting behavior illustrates the practical difference between representation learning and memorization, and it provides a compact experimental baseline for future studies on regularization, data augmentation, deeper architectures, and reproducible image-classification education.

2606.18841 2026-06-18 cs.CV 新提交

Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework

重新思考空地协作:渐进式跨任务基准与社会化学习框架

Zhoupeng Guo, Yunqi Zhu, Zhihe Fan, Xinjie Yao, Ruipu Zhao, Boan Tao, Yiming Sun, Zhen Wang, Pengfei Zhu

发表机构 * School of Automation, Southeast University(东南大学自动化学院) School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) School of Sports Training, Tianjin University of Sport(天津体育学院运动训练学院) Faculty of Information Engineering and Automation, Kunming University of Science and Technology(昆明理工大学信息工程与自动化学院) School of Artificial Intelligence, Tianjin University(天津大学人工智能学院) School of Artificial Intelligence, Hebei University of Technology(河北工业大学人工智能学院)

AI总结 提出空地渐进协作基准AGPC和社会化协同感知框架SCP,通过双层级路由器实现跨视角跨任务选择性交互,在异构空地感知中提升下游性能7.86%。

详情
AI中文摘要

空地协同感知对于真实世界动态环境中的鲁棒视觉理解至关重要。然而,现有研究通常将协作建模为单任务跨视角融合,忽视了定位、目标关联和细粒度解析之间的功能依赖关系。此外,空中和地面视角的异构性引入了显著的几何、尺度和遮挡差异,使得统一特征共享容易受到负迁移的影响。为解决这些问题,我们将空地感知建模为渐进式跨任务协作任务,并构建了空地渐进协作(AGPC)基准,这是一个包含超过745K原始视频帧的时空对齐基准。基于该基准,我们提出了社会化协同感知(SCP),一个从空中全局定位到地面目标关联和身份感知解析的渐进式协作框架。其核心模块——双层级路由器(DLR),将输入侧的多尺度专家选择与输出侧的任务条件调制解耦,实现了选择性的跨视角和跨任务交互,同时抑制有害干扰。大量实验证明了SCP的有效性。它实现了3.73%的协同进化增益和7.86%的平均下游性能提升。这些结果表明,对于异构空地感知,任务条件协作比统一融合更有效。代码可在该网址获取。

英文摘要

Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73\% coevolutionary gain and a 7.86\% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at https://github.com/g1136639260-spec/AGSCP.

2606.18943 2026-06-18 cs.CV 新提交

Physics-IQ Verified

物理智力验证

Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth

发表机构 * Anates Labs(Anates实验室) Technical University of Munich(慕尼黑技术大学) University of Technology Nuremberg(纽伦堡技术大学) Tuebingen AI Center, University of Tuebingen(图宾根大学人工智能中心) Helmholtz AI, Munich(慕尼黑海德堡人工智能研究所) Google DeepMind research(谷歌DeepMind研究)

AI总结 本文提出Physics-IQ Verified基准,通过改进提示和地面真实质量及引入样本级评分系统,提升视频生成模型对物理现实的理解评估,验证结果表明基准提升了57.6%的样本和34.8%的提示。

详情
AI中文摘要

视频生成模型(VGMs)已成为新的前沿,不仅用于视频生成,还用于多种下游任务,包括世界建模。为推进这些任务,一个良好的视频模型必须理解世界的物理现实。评估这种理解成为新兴领域,催生了Physics-IQ基准,通过将模型生成的视频与真实物理实验视频进行比较来量化。本文系统审计了Physics-IQ基准,揭示不足并提出三种解决方案,改进如何衡量VGMs的物理理解。具体而言,我们提高了提示和地面真实质量以减少混淆因素影响,并进一步引入样本级评分系统,使每个样本和指标权重相等。我们的基准Physics-IQ Verified优化了57.6%的所有样本并改进了超过34.8%的提示。在使用六个图像到视频生成模型的比较研究中,我们观察到中等但有意义的排名变化(Kendall's τ=0.46)。我们希望Physics-IQ Verified通过提供更可靠的信号推动社区发展,向物理准确的VGMs迈进。该基准的代码可通过此https URL访问。

英文摘要

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $τ= 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

2606.18952 2026-06-18 cs.CV 新提交

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

SP-TransientBench: 一个真实捕获的单光子感知基准

Hongzhou Dong, Zili Zhang, Ziting Wen, Yiheng Qiang, Runrong Deng, Wenle Dong, Ziwen Jiang, Xinyang Li, Rui Lu, Shuoyao Sun, Wenyu Wang, Ziyi Xia, Haitao Zheng, Guodong Shi, Xiaoqiang Ren

发表机构 * Shanghai University(上海大学) Southern University of Science and Technology(南方科技大学) The University of Sydney(悉尼大学)

AI总结 针对单光子LiDAR在真实场景中因噪声和多回波瞬态现象导致的感知挑战,提出包含10个场景、10297个视角的真实捕获多任务基准STB,支持深度估计、多视图重建和3D语义理解评估。

详情
AI中文摘要

基于单光子雪崩二极管(SPAD)传感的单光子LiDAR(SPL)能够以极高灵敏度进行时间分辨光子测量,为光子匮乏环境下的主动3D感知提供了独特潜力。然而,由于独特的测量噪声和复杂的多回波瞬态现象,真实世界的单光子感知仍然面临根本性挑战,这些因素共同使几何重建和语义场景理解变得复杂。尽管对基于SPAD的传感兴趣日益增长,现有研究大多局限于模拟数据或小规模受控捕获。因此,在深度估计、多视图重建和3D语义理解方面,对真实世界单光子感知的系统评估仍未得到充分探索。为弥补这一空白,我们引入了SP-TransientBench(STB),一个真实捕获的多任务单光子感知基准。STB包含10个多样化场景和10297个视图,使用固态单光子LiDAR以256×192分辨率捕获。每个视图提供具有多回波行为的完整飞行时间直方图、标准化元数据和用于多视图评估的校准相机位姿。我们还为选定场景提供了13类3D语义标注。通过为每个任务提供专用数据划分和评估协议,STB能够在多个3D视觉问题上实现真实世界单光子感知的一致且可重复的基准测试。数据集和代码将在接收后发布。

英文摘要

Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

2606.19053 2026-06-18 cs.CV 新提交

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

大规模视觉-语言模型在细粒度图像任务上的基准测试:从评估到诊断

Hong-Tao Yu, Chen-Wei Xie, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

发表机构 * School of Computer Science and Engineering, Southeast University, China(东南大学计算机科学与工程学院,中国) Alibaba Group(阿里巴巴集团) School of Computer Science and Engineering, School of Intelligence Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, China(东南大学计算机科学与工程学院、智能科学与工程学院以及新一代人工智能技术及其交叉应用关键实验室,中国) Wangxuan Institute of Computer Technology, National Key Laboratory for Multimedia Information Processing, Peking University, China(北京大学王轩计算机技术研究所、多媒体信息处理国家重点实验室,中国) University of Copenhagen, Denmark(丹麦哥本哈根大学)

AI总结 提出FG-BMK基准,含101万问题和28万图像,通过人机双范式评估LVLM的细粒度语义识别与视觉判别能力,诊断失败原因,发现视觉表示、语义对齐等瓶颈。

详情
AI中文摘要

近期大规模视觉-语言模型(LVLMs)展示了显著的多模态感知和推理能力。尽管众多基准从整体或任务特定角度评估了LVLMs,但它们在细粒度图像任务(计算机视觉的基础)上的能力仍未得到充分理解。为填补这一空白,我们引入FG-BMK,一个全面的细粒度评估基准,包含101万问题和28万图像,覆盖从常见物体中心领域到专业领域的多样化场景。FG-BMK通过面向人类和面向机器的范式,联合评估对话级细粒度语义识别和特征级视觉判别能力,从而诊断分析LVLM的失败是否源于视觉表示不足、视觉-语义对齐薄弱或细粒度知识有限。通过对一系列代表性LVLM/VLM的大量实验,我们发现当前LVLMs仍是不充分的细粒度识别器,失败源于视觉表示、语义对齐、模态对齐和类别级知识中相互交织的瓶颈。我们进一步分析了提升细粒度能力的训练设计因素,并考察了视觉和语言扰动如何影响LVLM预测。这些发现为当前LVLMs的局限性提供了诊断性见解,并为未来数据构建和模型设计提供了指导,以开发更可靠的细粒度视觉任务LVLMs。我们的代码已开源,可从此https URL获取。

英文摘要

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at https://fg-bmk.github.io/.

13. 其他/综合视觉 2 篇

2606.18661 2026-06-18 cs.CV cs.AI 新提交

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench:一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University(中南大学)

AI总结 提出指令驱动智能体框架,包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent,实现自主滑坡识别与分析。

详情
AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要,然而当前范式难以同时提取视觉特征和高层次地球科学语义,而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战,我们提出一个指令驱动的智能体框架,包含三个组成部分。首先,通过多VLM交叉验证和交互式标注构建LandslideBench,这是一个多模态细粒度数据集,包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后,通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM,以增强地质语义理解。最后,以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent,采用双规则控制器,结合结构化报告元数据约束和交叉验证识别约束,来调控自动化工具调用。实验表明,LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理,实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

2606.19249 2026-06-18 cs.CV cs.LG 新提交

Transformer Geometry Observatory TGO-I: Spectral Geometry Observatory

Transformer几何观测站TGO-I:谱几何观测站

Kaustubh Kapil, Kishor P. Upla

发表机构 * Sardar Vallabhai National Institute of Technology (SVNIT), Surat, India(印度苏拉特萨达尔·瓦拉巴伊国家理工学院(SVNIT))

AI总结 提出TGO框架,通过分析ViT表示的谱几何(有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性等),发现训练过程中维度利用增加、各向异性降低、谱熵和参与比上升,最终CLS标记表示具有最高有效维度和最低各向异性。

详情
AI中文摘要

尽管Vision Transformers(ViTs)被广泛采用并在众多计算机视觉应用中取得成功,对其维度和表示几何的基本理解仍然相对未被充分探索。为了弥补这一差距,我们引入了Transformer几何观测站(TGO),这是一个系统的实验和分析流程框架,旨在研究Vision Transformers的表示几何和动态。TGO-I是该框架的第一部分,专注于ViT表示的谱几何。使用在ImageNet-100上训练的ViT-Small/16模型,我们分析了训练过程中的有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性、协方差结构、特征谱和奇异值谱。我们的结果揭示了维度利用的一致增加,伴随着各向异性降低、谱熵增加、参与比增加以及逐渐平坦的特征谱。与常见的直觉(即训练应将信息集中到少数主导方向)相反,我们观察到方差在表示维度上的逐渐重新分布。这一现象在最终的CLS标记表示中尤为明显,该表示在网络中表现出最高的有效维度和最低的各向异性。

英文摘要

Despite the widespread adoption of Vision Transformers (ViTs) and their success across numerous computer vision applications, the fundamental understanding of their dimensional and representational geometry remains relatively underexplored. To address this gap, we introduce Transformer Geometry Observatory (TGO), a systematic framework of experiments and analysis pipelines designed to investigate the representational geometry and dynamics of Vision Transformers. TGO-I, the first installment of the framework, focuses on the spectral geometry of ViT representations. Using a ViT-Small/16 model trained on ImageNet-100, we analyze Effective Rank, Stable Rank, Participation Ratio, Spectral Entropy, Spectral Flatness, Spectral Anisotropy, covariance structure, eigenspectra, and singular value spectra throughout training. Our results reveal a consistent increase in dimensional utilization, accompanied by decreasing anisotropy, increasing spectral entropy, increasing participation ratio, and progressively flatter eigenspectra. Contrary to the common intuition that training should concentrate information into a small number of dominant directions, we observe a progressive redistribution of variance across representational dimensions. This phenomenon is particularly pronounced in the final CLS token representation, which exhibits the highest effective dimensionality and lowest anisotropy within the network.