2606.18992 2026-06-18 cs.CV 新提交

Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

展示，而非询问：基于轮次有效覆盖的生成式视觉消歧用于组合图像检索

Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Amsisan Tran ； Baogh Le ； Tuan Kiet Pham ； Sui Yang Guang

AI总结提出CLARA框架，通过展示视觉备选面板让用户选择，结合似然比重校准实现多轮覆盖保证，在组合图像检索中有效消歧，优于文本提问基线。

详情

AI中文摘要

组合图像检索（CIR）使用参考图像和文本修改来搜索目标图像。然而，此类查询通常描述多个可能的图像而非一个确切目标，使得用户意图模糊。近期方法通过使用共形预测估计模糊性并向用户提问澄清文本来解决此问题。但这些方法有两个局限：其覆盖保证仅在第一轮交互中成立，且文本问题通常不足以解决细粒度视觉差异，如外观、属性或视角。我们提出CLARA，一种通过向用户展示小型视觉备选面板来消歧的澄清框架。用户无需回答文本问题，只需选择最接近预期目标的原型图像。这提供了直接的视觉信号，并避免依赖模型预测用户答案。为在多轮交互中维持有效的共形保证，CLARA使用用户选择引起的似然比对校准进行重加权。显示的原型也被约束为代表当前候选集，并映射到真实语料库图像，确保生成的图像不能人为提高覆盖。在开放域和时尚基准上的实验表明，CLARA匹配单轮最先进的检索性能，在多轮交互中维持名义覆盖，并在比强文本问题基线更少的轮次中找到预期目标。其优势在模糊性涉及视角或细粒度属性时尤为明显，此时视觉消歧比文本提问更有效。

英文摘要

Composed image retrieval (CIR) uses a reference image and a text modification to search for a target image. However, such queries often describe several possible images rather than one exact target, making the user's intent ambiguous. Recent methods address this by using conformal prediction to estimate ambiguity and by asking users clarifying text questions. However, these methods have two limitations: their coverage guarantee only holds at the first interaction, and text questions are often insufficient for resolving fine-grained visual differences such as appearance, attributes, or viewpoint. We propose CLARA, a clarification framework that resolves ambiguity by showing users a small panel of visual alternatives. Instead of answering text questions, the user simply selects the prototype image closest to the intended target. This provides a direct visual signal and avoids relying on a model to predict the user's answer. To maintain valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio induced by the user's selection. The displayed prototypes are also constrained to represent the current candidate set and are snapped to real corpus images, ensuring that generated images cannot artificially improve coverage. Experiments on open-domain and fashion benchmarks show that CLARA matches single-turn state-of-the-art retrieval performance, maintains nominal coverage across interaction rounds, and finds the intended target in fewer rounds than strong text-question baselines. Its advantage is especially clear when ambiguity involves viewpoint or fine-grained attributes, where visual clarification is more effective than textual questioning.

URL PDF HTML ☆

赞 0 踩 0

2606.19100 2026-06-18 cs.CV 新提交

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

AMALIA-VL: 一个原生欧洲葡萄牙语开源视觉与语言模型

Diogo Glória-Silva, João Cardeira, Manuel Letras da Luz, Afonso Simplício, Gonçalo Vinagre, Diogo Tavares, Rafael Ferreira, Inês Calvo, Inês Vieira, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology（NOVA科学与技术学校）； NOVA LINCS

AI总结针对欧洲葡萄牙语缺乏开源多模态模型的问题，提出AMALIA-VL，通过三阶段训练和葡萄牙语中心数据混合，建立强基线并开源所有资源。

详情

AI中文摘要

大型视觉与语言模型（LVLMs）发展迅速，但欧洲葡萄牙语（pt-PT）在现有的开源多模态模型中仍系统性地未被充分服务，这些模型要么将其与巴西葡萄牙语混为一谈，要么在其训练数据混合中严重缺乏代表性。我们推出了AMALIA-VL，这是第一个原生为pt-PT构建的开源指令微调LVLM，通过可学习的连接器将高分辨率视觉编码器与动态图像平铺以及完全开放的pt-PT优化语言模型配对。我们贡献了一个精心设计的三阶段训练过程——视觉-语言对齐、通用视觉指令微调和偏好优化——以及一个以pt-PT为中心的多模态数据混合，该混合结合了策划和翻译的公共数据集与新颖的数据集，以解决欧洲葡萄牙语多模态资源几乎完全缺失的问题。我们的评估表明，AMALIA-VL为开源pt-PT LVLM建立了强基线。我们将发布模型权重、训练数据和构建流程，以及机器翻译的pt-PT评估基准，以帮助民主化pt-PT LVLM的开发。

英文摘要

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

URL PDF HTML ☆

赞 0 踩 0

2606.19277 2026-06-18 cs.CV 新提交

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

高效遥感视觉问答的统一框架：适配双编码器、混合架构和编码器-解码器架构

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

发表机构 * Engineering North Carolina A\&T State University Greensboro - NC, USA ； College of Science ； Technology North Carolina A\&T State University Greensboro - NC, USA

AI总结提出RS Adapter参数高效微调策略，在三种视觉语言模型架构上注入轻量瓶颈适配器，仅用不到5%可训练参数实现遥感VQA，混合架构FLAVA在多模态推理与检索间取得最佳平衡。

Comments 4 pages, 2 figures, accepted and to be presented at 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026), scheduled for 9 to 14 August 2026 in Washington D.C

详情

AI中文摘要

遥感领域的视觉问答因航空影像的高分辨率、多尺度目标分布和语义复杂性而面临独特挑战。尽管通用领域的基础模型取得了显著成功，但直接应用于RSVQA受到巨大领域偏移和全微调计算成本高昂的阻碍。本研究对RS Adapter（一种参数高效微调策略）在三种不同的视觉语言模型架构上进行了比较分析：双编码器CLIP、编码器-解码器BLIP和混合FLAVA。我们引入了一个统一的架构手术流水线，将轻量瓶颈适配器注入冻结骨干网络的注意力和MLP层，从而以少于5%的可训练参数实现快速适应。在高分辨率RSVQA x数据集上的实验结果表明，虽然所有适配模型均实现收敛，但混合FLAVA架构相比单模态对应模型提供了更优越的多模态推理与检索能力平衡。我们的发现为灾害评估和城市监测中的资源高效VQA建立了新的基准。

英文摘要

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.19338 2026-06-18 cs.CV 新提交

Cosmos 3：面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

AI总结提出基于统一混合Transformer架构的全模态世界模型Cosmos 3，联合处理语言、图像、视频、音频和动作序列，在理解和生成任务上达到新最优，为具身智能体提供可扩展的通用骨干。

详情

AI中文摘要

我们介绍了Cosmos 3，一个全模态世界模型家族，设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置，Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明，Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平，展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型，并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署，我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准，网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

URL PDF HTML ☆

赞 0 踩 0

2606.05409 2026-06-18 cs.CV cs.CL 版本更新

Would you still call this Dax? Novel Visual References in VLMs and Humans

你还会称它为Dax吗？VLM与人类中的新颖视觉参照

Ada Defne Tür, Gaurav Kamath, Joyce Chai, Siva Reddy, Benno Krojer

发表机构 * McGill University（麦吉尔大学）； Mila Quebec AI Institute（魁北克人工智能研究所）； University of Michigan - Ann Arbor（密歇根大学安娜堡分校）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）

AI总结提出新颖视觉参照数据集（NVRD），通过对比VLM和人类对新颖视觉概念的泛化能力，发现模型在矛盾先验知识时难以习得新概念，且过度泛化。

详情

AI中文摘要

视觉语言模型（VLM）像人类学习者一样，经常接触新的视觉概念，但它们在接触后如何将新颖的视觉参照映射到语言上仍未被充分探索，特别是当这些参照与预训练的先验知识相矛盾时。为了研究这一点，我们提出了新颖视觉参照数据集（NVRD）：包含跨越90个视觉概念的19,176张图像，这些概念具有不同层次的新颖性，每个概念最多有20个原始对象的逐渐扰动版本以测试泛化能力。与之前关于熟悉概念视觉增强的工作不同，NVRD包含完全新颖、开放式的刺激，从头构建，模拟人类遇到真正新概念的方式。我们评估了3个开源和2个闭源模型以及2,400个人类判断，以进行直接的人机比较，发现（i）当新概念与先验知识矛盾时，模型难以在上下文中习得它们，以及（ii）虽然模型和人类对视觉扰动表现出相关的敏感性，但模型显著过度泛化，将学到的标签扩展到人类拒绝的刺激上。我们贡献了NVRD作为人类和机器视觉概念学习研究的语料库和基准。

英文摘要

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

URL PDF HTML ☆

赞 0 踩 0

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni：从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳分校）； National University of Singapore（新加坡国立大学）

AI总结提出FutureOmni基准，评估多模态大模型从音视频线索预测未来的能力，发现现有模型在语音密集场景下表现差，并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）展现出强大的全模态感知能力，但它们从音视频线索预测未来事件的能力仍未被充分探索，因为现有基准主要关注回顾性理解。为弥补这一差距，我们引入了FutureOmni，这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理，并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建，包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明，当前系统在音视频未来预测方面存在困难，尤其是在语音密集场景中，Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限，我们整理了一个7K样本的指令微调数据集，并提出全模态未来预测（OFF）训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明，OFF增强了未来预测和泛化能力。我们公开发布所有代码（此 https URL ）和数据集（此 https URL ）。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

URL PDF HTML ☆

赞 0 踩 0

2606.18583 2026-06-18 cs.CV cs.RO 新提交

Mem-World：用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology（大连理工大学）； Samsung R&D Institute China-Beijing (SRCB)（三星中国北京研究院）

AI总结提出Mem-World，通过4D腕部视角曲面元索引内存W-VMem，解决操作中因遮挡和运动导致的场景遗忘问题，实现持久世界建模，提升策略评估与改进效果。

详情

AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式，通过生成动作一致的视频推演，为昂贵的真实世界实验提供了可扩展的替代方案。然而，在操作中持久世界建模仍然具有挑战性：频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图，导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制，我们提出了Mem-World，一种内存增强的多视图动作条件世界模型。其核心是W-VMem，一种4D腕部视图为中心的曲面元索引内存，将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置，W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中，通过基于曲面元的渲染和评分选择相关历史帧，为预测提供信息丰富且非冗余的上下文。大量实验表明，Mem-World在复杂操作场景中生成持久推演，比Ctrl-World实现更可靠的策略评估，将皮尔逊相关系数提高14.5%，并通过合成数据生成支持有效的策略改进，在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19258 2026-06-18 cs.CV cs.RO 新提交

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

CABLE: 面向V2X系统的云辅助带宽高效LMM编码框架

Haohua Que, Zhipeng Bao, Qianyi Wu, Handong Yao

发表机构 * University of Georgia（佐治亚大学）

AI总结提出CABLE框架，通过边缘端利用自我运动补偿和残差运动线索传播云分割掩码，生成感兴趣区域（ROI）并仅上传ROI掩码图像，形成掩码-ROI-LMM反馈循环，在五个数据集上实现73-87%的ROI像素覆盖减少和5-8倍LMM预填充加速。

详情

AI中文摘要

云托管的大型多模态模型（LMM）可以为车联网系统提供强大的开放词汇感知能力，但简单地将全分辨率帧从边缘传输到云会导致严重的通信开销和云侧预填充延迟。我们提出了CABLE，一种用于边缘-云感知的云辅助带宽高效LMM编码框架。CABLE在边缘端利用自我运动补偿传播先前的云分割掩码，通过残差运动线索进行细化，并通过走廊包络整合断开区域，形成鲁棒的感兴趣区域（ROI）。仅上传ROI掩码图像，而云分割输出作为下一帧的先验反馈，形成掩码-ROI-LMM反馈循环。在五个数据集（nuScenes、WOD-ZB、Waymo、KITTI和CADC）上的实验表明，该方法在保持感知能力的同时实现了显著的通信节省，相对于全帧推理，ROI像素覆盖减少73-87%，估计LMM预填充加速5-8倍，检测质量略有折衷。

英文摘要

Cloud-hosted large multimodal models (LMMs) can provide strong open-vocabulary perception for Vehicle-to-Everything systems, but naively transmitting full-resolution frames from edge to cloud causes severe communication overhead and high cloud-side prefill latency. We present CABLE, a cloud-assisted bandwidth-efficient LMM-based encoding framework for edge-cloud perception. CABLE propagates the previous cloud segmentation mask on the edge using ego-motion compensation, refines it with residual-motion cues, and consolidates disconnected regions via a corridor envelope to form a robust region of interest (ROI). Only ROI-masked images are uploaded, while the cloud segmentation output is fed back as the prior for the next frame, forming a mask-to-ROI-to-LMM feedback loop. Experiments on five datasets (nuScenes, WOD-ZB, Waymo, KITTI, and CADC) show consistent communication savings while largely preserving perception, achieving $73$--$87\%$ ROI pixel-coverage reduction with $5$--$8\times$ estimated LMM prefill speedup at a modest detection-quality trade-off relative to full-frame inference.

URL PDF HTML ☆

赞 0 踩 0

2606.18610 2026-06-18 cs.RO cs.CV 交叉投稿

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

SC3-Eval: 通过自洽视频生成评估机器人基础模型

Wei-Cheng Tseng, Gashon Hussein, Yuzhu Dong, Allen Z. Ren, Lucy X. Shi, XuDong Wang, Sergey Levine, Zhaoshuo Li, Jinwei Gu, Florian Shkurti, Ming-Yu Liu, Quan Vuong

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）； NVIDIA（英伟达）； Physical Intelligence ； Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）； Allen Institute for AI（艾伦人工智能研究所）

AI总结提出SC3-Eval方法，利用前向-反向动力学一致性、跨视角一致性和测试时一致性，将预训练视频基础模型转化为准确的策略评估器，在7个真实世界策略上达到0.929的皮尔逊相关系数。

详情

AI中文摘要

在真实世界中评估通用机器人操作策略成本高、速度慢且难以扩展。动作条件视频世界模型通过模拟策略 rollout 提供了一种可扩展的替代方案。自回归 rollout 会累积复合误差，多视角观测必须保持相互一致，且评估器必须泛化到行为超出训练分布的策略。我们通过 SC3-Eval 解决这些挑战，这是一种自洽视频生成方案，通过强制三种互补的一致性，将预训练视频基础模型转化为准确的策略评估器。首先，前向-反向动力学一致性联合训练模型从动作预测帧以及从帧恢复动作，将生成的 rollout 锚定在物理上合理的动作流形上，并抵消仅前向模型无法惩罚的漂移。其次，跨视角一致性训练模型从每个相机视角修补其他视角，使多相机观测在长 rollout 中保持连贯，无需任何显式记忆机制。第三，测试时一致性在推理时重用反向动力学模式作为每个动作块的置信度信号，当生成的帧偏离请求的动作时终止 rollout。我们还展示了 SC3-Eval rollout 复现了策略在真实世界 rollout 中表现出的失败模式，支持细粒度的诊断比较而不仅仅是聚合排名。在七个真实世界的视觉-语言-动作策略上，SC3-Eval 达到了闭环皮尔逊相关系数 0.929 和 MMRV 0.119，优于三个强先前的基于视频模型的基线，并泛化到新任务。

英文摘要

Evaluating generalist robot manipulation policies in the real world is expensive, slow, and difficult to scale. Action-conditioned video world models offer a scalable alternative by simulating policy rollouts. Autoregressive rollouts accumulate compounding errors, observations across multiple camera views must remain mutually consistent, and the evaluator must generalize to policies whose behaviors lie outside the training distribution. We address these challenges with SC3-Eval, a self-consistent video generation recipe that adapts a pre-trained video foundation model into an accurate policy evaluator by enforcing three complementary forms of consistency. First, forward-inverse dynamics consistency jointly trains the model to predict frames from actions and to recover actions from frames, anchoring generated rollouts to a physically plausible action manifold and counteracting the drift a forward-only model cannot penalize. Second, cross-view consistency trains the model to inpaint each camera view from the other, keeping the multi-camera observation coherent over long rollouts without any explicit memory mechanism. Third, test-time consistency reuses the inverse dynamics mode at inference as a per-action-chunk uncertainty signal that terminates rollouts whose generated frames drift away from the requested actions. We also demonstrate SC3-Eval rollouts reproduce the failure modes that policies exhibit in real-world rollouts, supporting fine-grained diagnostic comparison rather than aggregate ranking alone. Across seven real-world vision-language-action policies, SC3-Eval attains a closed-loop Pearson correlation of $0.929$ and MMRV of $0.119$, outperforming three strong prior video-model-based baselines, and generalizes to new tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19067 2026-06-18 cs.RO cs.CV 交叉投稿

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

传感器配置至关重要：四足机器人多模态SLAM的系统评估

Roberto Corlito, Fabian Schmidt, Nils Seibert, Markus Enzweiler, Abhinav Valada, Arne Roennau

发表机构 * Machine Intelligence and Robotics Lab, Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院智能机器人实验室）； Institute for Intelligent Systems, Esslingen University of Applied Sciences（埃森堡应用科学大学智能系统研究所）； Department of Computer Science, University of Freiburg（弗赖堡大学计算机科学系）

AI总结针对四足机器人运动中的传感器配置问题，系统评估了视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法，发现立体相机、全局快门和适当惯性集成能显著提升定位鲁棒性。

详情

AI中文摘要

四足机器人在不同环境中的自主导航从根本上依赖于鲁棒的同步定位与地图构建（SLAM）。虽然视觉-惯性SLAM在轮式、手持和空中平台上已经成熟，但在腿部运动的剧烈动态下，硬件级传感器配置如何影响性能仍存在关键的评估空白。四足机器人引入了独特的具身感知挑战，包括足部冲击、高频机械振动和快速角旋转，这些都会降低标准感知管道的性能。为了填补这一空白，我们使用在ANYmal D四足机器人上记录的GrandTour数据集，对最先进的视觉、视觉-惯性和LiDAR-视觉-惯性SLAM方法进行了系统评估。我们分离并量化了相机模态、快门技术和惯性传感器层级的影响，分析了它们在定位精度、算法鲁棒性和计算资源利用方面的权衡。我们的实证结果表明，硬件选择对系统鲁棒性有显著影响：立体配置始终优于单目和RGB-D模态，全局快门相机相比卷帘快门相机显著减少了运动引起的跟踪失败，并且关键的是，在剧烈的腿部运动下，标准惯性集成可能降低主要基于视觉的框架的性能。这些见解还为定制传感器负载提供了具体的设计指南，以实现敏捷腿部系统的可靠感知。

英文摘要

Autonomous navigation of quadrupedal robots in diverse environments fundamentally relies on resilient Simultaneous Localization and Mapping (SLAM). While visual-inertial SLAM has matured across wheeled, handheld, and aerial platforms, a critical evaluation gap remains regarding how hardware-level sensor configurations affect performance under the aggressive dynamics of legged locomotion. Quadrupeds introduce distinct embodiment-induced sensory challenges, including foot-impact shocks, high-frequency mechanical vibrations, and rapid angular rotations, which degrade standard perception pipelines. To address this gap, we present a systematic evaluation of state-of-the-art visual, visual-inertial, and LiDAR-visual-inertial SLAM methods using the GrandTour dataset recorded on an ANYmal D quadruped. We isolate and quantify the impacts of camera modalities, shutter techniques, and inertial sensor tiers, analyzing their trade-offs across localization accuracy, algorithmic robustness, and computational resource utilization. Our empirical findings demonstrate that hardware selection has substantial influence on system resilience: stereo configurations consistently outperform monocular and RGB-D modalities, global shutter cameras significantly mitigate motion-induced tracking failures compared to rolling shutter cameras, and, crucially, standard inertial integration can degrade the performance of primarily vision-based frameworks under harsh legged locomotion. These insights additionally offer concrete design guidelines for tailoring custom sensor payloads to achieve dependable perception on agile legged systems.

URL PDF HTML ☆

赞 0 踩 0

2606.19240 2026-06-18 cs.RO cs.CV cs.HC cs.SY eess.SY 交叉投稿

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

透过遮挡：机器人遥操作的确定性手臂运动学校正

Thomas M. Kwok, Nicholas Koenig, Yue Hu

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出手臂运动学校正方法，利用恒定臂长几何约束和勾股定理确定性地重建遮挡关节深度，无需复杂建模，经Vicon验证有效，并成功应用于遥操作。

详情

AI中文摘要

无标记、单RGB-D相机动作捕捉为机器人遥操作提供了一种低成本、非侵入性的替代传统标记系统的方法；然而，在自遮挡存在时，特别是上肢运动期间，深度估计常常退化。本文提出了一种手臂运动学校正（AKC）方法，通过基于恒定臂长施加几何约束来改进深度估计。所提出的方法利用手腕位置和预定义臂长，基于勾股定理的确定性公式重建遮挡关节深度，从而避免了对复杂概率建模或参数调整的需求。针对Vicon参考系统的实验验证表明，该方法在静态和动态关节运动下均表现出可靠的性能，通过均方根误差（RMSE）和皮尔逊相关性进行评估。此外，在模拟和物理机器人环境中成功演示了运动映射遥操作。结果表明，AKC在长时间、严重自遮挡下增强了鲁棒性并保持了解剖一致性，即使与不太可靠的时间滤波器配对时也是如此，突显了其在机器人遥操作和人机交互等实时应用中的实用性。

英文摘要

Markerless, single-RGB-D-camera motion capture provides a low-cost and non-invasive alternative to conventional marker-based systems for robot teleoperation; however, depth estimation often degrades in the presence of self-occlusion, particularly during upper-limb motion. This paper presents an Arm Kinematic Correction (AKC) method that improves depth estimation by enforcing geometric constraints based on constant arm lengths. The proposed approach reconstructs occluded joint depths by leveraging wrist positions and predefined arm lengths via a deterministic formulation based on the Pythagorean theorem, thereby avoiding the need for complex probabilistic modeling or parameter tuning. Experimental validation against a Vicon reference system demonstrates reliable performance for both static and dynamic joint motions, evaluated using root-mean-square error (RMSE) and Pearson correlation. Furthermore, motion-mapping teleoperation is successfully demonstrated in both simulated and physical robot environments. The results show that AKC enhances robustness and preserves anatomical consistency under long-duration, severe self-occlusion, even when paired with less reliable temporal filters, highlighting its practicality for real-time applications such as robot teleoperation and human-robot interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.19333 2026-06-18 cs.RO cs.CV 交叉投稿

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Do as I Do: 从日常人类视频中获取灵巧操作数据

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik

发表机构 * UC Berkeley（加州大学伯克利分校）

AI总结提出DO AS I DO算法，从单目RGB人类视频中重建手-物交互并重定向到多指灵巧机器人手，生成可执行的操作数据，优于现有方法。

Comments Project website: https://do-as-i-do.com/

详情

AI中文摘要

我们如何可扩展地生成机器人操作数据，特别是在像多指灵巧手这样的人形平台上？从人类视频中学习最近成为这个问题的可能答案。然而，估计手-物交互和跨越人-机器人具身差距的困难阻碍了将丰富的单目RGB人类视频作为机器人操作数据的主要来源。在这项工作中，我们提出了DO AS I DO，一种将单目RGB人类视频重建并重定向到多指灵巧机器人手的算法。DO AS I DO从各种自我中心和外部中心的野外视频源中重建手-物交互。然后，该算法将这些手-物交互估计重定向为一系列可在现实世界中执行的动作，从不同的人类视频中生成机器人完整的操作数据。总体而言，DO AS I DO在从RGB视频中估计手-物交互和提取灵巧操作轨迹方面优于先前的最先进技术，正如我们在具有真实标签的数据集和在线收集的视频片段数据集上的实验所示。我们的实验使我们能够为从业者收集人类操作数据提出一个有效性指南。

英文摘要

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

URL PDF HTML ☆

赞 0 踩 0

2603.11417 2026-06-18 cs.CV cs.LG 版本更新

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

端到端自动驾驶中的零样本跨城市泛化：自监督与监督表示

Fatemeh Naeinian, Ali Hamza, Haoran Zhu, Anna Choromanska

发表机构 * Department of Electrical and Computer Engineering, NYU Tandon School of Engineering（电气工程系，纽约大学Tandon工程学院）

AI总结研究端到端自动驾驶模型在跨城市零样本迁移中的泛化能力，发现自监督预训练（如I-JEPA、DINOv2、MAE）相比监督预训练能显著减少位移和碰撞退化，提升闭环评估中的分布外PDMS。

详情

AI中文摘要

端到端自动驾驶模型通常使用监督的ImageNet预训练骨干网络在多城市数据集上训练，但其泛化到未见城市的能力尚未得到充分检验。当训练和评估数据在地理上混合时，模型可能隐含地依赖城市特定线索，掩盖了在真实世界域偏移下泛化到新位置时可能出现的失败模式。在这项工作中，我们将零样本跨城市迁移定义为端到端自动驾驶的受控表示级压力测试，并探究视觉预训练如何影响地理域偏移下的迁移行为。我们通过将自监督骨干网络I-JEPA、DINOv2和MAE集成到规划框架中进行了全面研究。我们在nuScenes上的开环设置和NAVSIM上的闭环评估协议中，在严格的地理划分下评估性能。我们的实验揭示了当模型在不同道路拓扑、交通规则和视觉环境的城市间迁移时存在显著的泛化差距。在开环评估中，监督骨干网络在城市间迁移时表现出严重退化，而某些领域特定的自监督方法可以显著减少位移和碰撞退化。在闭环评估中，自监督预训练在多个单城市训练设置中提高了平均分布外PDMS。我们的结果提供了经验证据，表明表示学习影响跨城市规划的鲁棒性，并促使将零样本地理迁移作为评估端到端自动驾驶系统的重要压力测试。

英文摘要

End-to-end autonomous driving models are typically trained on multi-city datasets using supervised ImageNet-pretrained backbones, yet their ability to generalize to unseen cities remains largely unexamined. When training and evaluation data are geographically mixed, models may implicitly rely on city-specific cues, masking failure modes that would occur under real-world domain shifts when generalizing to new locations. In this work, we formulate zero-shot cross-city transfer as a controlled representation-level stress test for end-to-end autonomous driving and ask how visual pretraining affects transfer behavior under geographic domain shift. We conduct a comprehensive study by integrating self-supervised backbones I-JEPA, DINOv2, and MAE into planning frameworks. We evaluate performance under strict geographic splits on nuScenes in the open-loop setting and on NAVSIM in the closed-loop evaluation protocol. Our experiments reveal a substantial generalization gap when transferring models across cities with different road topologies, traffic conventions, and visual environments. In open-loop evaluation, a supervised backbone exhibits severe degradation when transferring between cities, yet some domain-specific self-supervised methods can substantially reduce both displacement and collision degradation. In closed-loop evaluation, self-supervised pretraining improves average out-of-distribution PDMS in several single-city training settings. Our results provide empirical evidence that representation learning influences the robustness of cross-city planning and motivate zero-shot geographic transfer as an important stress test for evaluating end-to-end autonomous driving systems.

URL PDF HTML ☆

赞 0 踩 0

2606.17030 2026-06-18 cs.CV 版本更新

域偏移下基于注意力机制和迁移学习的鲁棒桃叶损伤分类

Adrián Cánovas-Rodriguez, Miguel A. González-Illán, Maria Fernanda García-Cruz, Pedro Nortes Tortosa, José Salvador Rubio-Asensio, Miguel A. Zamora Izquierdo, Juan Antonio Martínez Navarro, Antonio F. Skarmeta

发表机构 * Department of Information and Communication Engineering（信息与通信工程系）； University of Murcia（穆尔西亚大学）； Department of Irrigation, Centro de Edafología y Biología Aplicada del Segura CEBAS-CSIC（灌溉系，塞格拉应用土壤学与生物技术中心CEBAS-CSIC）

AI总结提出基于注意力机制和迁移学习的桃叶损伤分类方法，通过CBAM增强EfficientNet模型在公共数据集上达到93.3%准确率，并在本地数据集上通过迁移学习实现93%宏F1分数，有效应对域偏移。

详情

AI中文摘要

人工智能为从图像数据评估作物损伤提供了实用框架，支持农业管理中的早期决策。在桃园中，气候变化增加了非生物胁迫和生物压力，包括病虫害，这些通常产生视觉上相似的叶片症状。这种重叠使得手动诊断变得困难，尤其是在不同环境条件下的多个田地中，凸显了对具有强泛化能力的自动化模型的需求。我们提出了一种基于图像的桃叶损伤检测分类方法。通过手动标注公开图像创建了一个基准数据集，包含六个损伤类别的1,366片桃叶。评估了几种深度学习架构。EfficientNet模型取得了最佳结果，其中EfficientNetB0达到92.9%的准确率，EfficientNetB3达到91.5%，EfficientNetB5在少数类上表现最强。DenseNet121达到92.6%的准确率。卷积块注意力模块（CBAM）的集成在多个骨干网络中提升了性能，特别是在EfficientNetB5和InceptionV3中，而在其他网络中效果有限或为负。CBAM增强的EfficientNetB5取得了93.3%的最佳总体准确率。为了评估在现实条件下的鲁棒性，收集了一个包含四个类别180张图像的本地数据集，并应用迁移学习策略来解决域偏移。测试了三种微调策略。结合CBAM的EfficientNetB3在本地域中取得了最佳性能，迁移后宏F1分数达到93%。总体而言，基于注意力的模型在少数类上表现出更强的鲁棒性，并在不同田间条件下具有更好的泛化能力。

英文摘要

Artificial intelligence provides a practical framework for crop damage assessment from imagery data, supporting early decision-making in agricultural management. In peach orchards, climate change increases abiotic stress and biotic pressures, including pests and diseases, which often produce visually similar foliar symptoms. This overlap makes manual diagnosis difficult, especially across multiple fields with varying environmental conditions, highlighting the need for automated models with strong generalization ability. We propose an image-based classification approach for peach leaf damage detection. A benchmark dataset was created through manual annotation of publicly available images, consisting of 1,366 peach leaves across six damage categories. Several deep learning architectures were evaluated. EfficientNet models achieved the best results, with EfficientNetB0 reaching 92.9 percent accuracy, EfficientNetB3 achieving 91.5 percent, and EfficientNetB5 showing the strongest performance on minority classes. DenseNet121 reached 92.6 percent accuracy. The integration of the Convolutional Block Attention Module (CBAM) improved performance in several backbones, particularly EfficientNetB5 and InceptionV3, while showing limited or negative impact in others. The CBAM-enhanced EfficientNetB5 achieved the best overall accuracy of 93.3 percent. To evaluate robustness under realistic conditions, a local dataset of 180 images across four classes was collected, and transfer learning strategies were applied to address domain shift. Three fine-tuning strategies were tested. EfficientNetB3 combined with CBAM achieved the best performance in the local domain, reaching a 93 percent macro F1-score after transfer. Overall, attention-based models showed improved robustness for minority classes and better generalization across different field conditions.

URL PDF HTML ☆

赞 0 踩 0

2602.04401 2026-06-18 cs.RO cs.CV 版本更新

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

视觉地点识别中可靠操作点选择的分位数迁移

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics（昆士兰理工大学机器人中心）； School of Electrical Engineering and Robotics（电气工程与机器人学院）； Queensland University of Technology（昆士兰理工大学）

AI总结提出一种通过分位数归一化迁移阈值的方法，自动选择视觉地点识别系统的操作点，在100%精度下最大化召回率，无需手动调参。

Comments Accepted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026

详情

AI中文摘要

视觉地点识别（VPR）是全球导航卫星系统（GNSS）受限环境中定位的关键组成部分，但其性能严重依赖于选择平衡精度和召回率的图像匹配阈值（操作点）。阈值通常针对特定环境离线手动调整，并在部署期间固定，导致在环境变化下性能下降。我们提出一种方法，自动选择VPR系统的操作点，以在100%精度下最大化召回率。该方法使用已知对应关系的小型校准遍历，并通过相似度得分分布的分位数归一化将阈值迁移到部署中。这种分位数迁移确保阈值在校准大小和查询子集上保持稳定。在五个基准数据集上使用七种最先进的VPR技术进行的实验表明，我们提出的方法始终优于现有基线，使底层VPR技术在大约两倍的部署场景中（中位数改进）以100%精度运行，同时在该精度下检索到多达29%的正确匹配。该方法通过适应新环境并在操作条件下泛化，消除了手动调整。我们的代码可在该https URL获取。

英文摘要

Visual Place Recognition (VPR) is a key component for localisation in Global Navigation Satellite System (GNSS)-denied environments, but its performance critically depends on selecting an image matching threshold (operating point) that balances precision and recall. Thresholds are typically hand-tuned offline for a specific environment and fixed during deployment, leading to degraded performance under environmental change. We propose a method that automatically selects the operating point of a VPR system to maximise recall at 100% precision. The method uses a small calibration traversal with known correspondences and transfers thresholds to deployment via quantile normalisation of similarity score distributions. This quantile transfer ensures that thresholds remain stable across calibration sizes and query subsets. Experiments with seven state-of-the-art VPR techniques across five benchmark datasets demonstrate that our proposed approach consistently outperforms existing baselines, enabling the underlying VPR technique to operate at 100% precision in approximately twice as many deployment scenarios (median improvement), while retrieving up to 29% more correct matches at that precision. The method eliminates manual tuning by adapting to new environments and generalising across operating conditions. Our code is available at https://github.com/DhyeyR-007/Quantile-Transfer-for-Reliable-VPR.

URL PDF HTML ☆

赞 0 踩 0

2606.18566 2026-06-18 cs.CV cs.AI cs.GR 新提交

CrossEarth-Gate：基于Fisher引导的自适应调优引擎用于高效跨域遥感语义分割

Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu

发表机构 * Sun Yat-sen University（中山大学）； The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiao Tong University（上海交通大学）； National Supercomputing Center in Shenzhen（深圳国家超算中心）； The Hong Kong University of Science and Technology（香港科技大学）； Beijing Institute of Technology（北京理工大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Tsinghua University（清华大学）

AI总结提出CrossEarth-Gate，通过Fisher信息引导的自适应模块选择机制，动态激活最关键的跨域模块，在18个跨域基准中16个达到最优性能。

详情

AI中文摘要

在遥感（RS）中，参数高效微调（PEFT）已成为激活基础模型泛化表示能力以用于下游任务的关键方法。然而，现有的专用PEFT方法在应用于大规模地球观测任务时常常失败，因为它们无法完全处理遥感数据中固有的多面且不可预测的域差距（例如空间、语义和频率偏移）。为克服这一问题，我们提出CrossEarth-Gate，它包含两个主要贡献。首先，我们建立了一个全面的遥感模块工具箱，以解决多方面的域差距，包括空间、语义和频率模块。其次，我们开发了一种基于Fisher引导的自适应选择机制，该机制作用于该工具箱。该选择由Fisher信息引导，通过衡量每个模块对任务特定梯度流的贡献来量化其重要性。它动态地仅在适当层激活最关键模块，引导梯度流以最大化适应效果和效率。全面实验验证了我们方法的有效性和泛化能力，其中CrossEarth-Gate在18个遥感语义分割跨域基准中的16个上达到了最先进性能。

英文摘要

In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (e.g., spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance on 16 out of 18 cross-domain benchmarks for RS semantic segmentation.

URL PDF HTML ☆

赞 0 踩 0

2602.07544 2026-06-18 cs.CV 版本更新

MUFASA: A Multi-Layer Framework for Slot Attention

MUFASA: 一种用于槽注意力的多层框架

Sebastian Bock, Leonie Schüßler, Krishnakant Singh, Simone Schaub-Meyer, Stefan Roth

发表机构 * TU Darmstadt（图宾根大学）； Zuse School ELIZA（泽尼特学校ELIZA）

AI总结提出MUFASA，一种轻量级即插即用框架，通过跨ViT编码器多层计算槽注意力并融合，提升无监督对象中心学习的分割性能，达到新最优。

Comments CVPR 2026. Authors Sebastian Bock and Leonie Schüßler contributed equally. Project page: https://visinf.github.io/mufasa/

详情

AI中文摘要

无监督对象中心学习（OCL）将视觉场景分解为不同的实体。槽注意力是一种流行的方法，将单个对象表示为潜在向量，称为槽。当前方法仅从预训练视觉变换器（ViT）的最后一层获取这些槽表示，忽略了跨其他层编码的宝贵、语义丰富的信息。为了更好地利用这些潜在语义信息，我们引入了MUFASA，一种用于基于槽注意力的无监督对象分割方法的轻量级即插即用框架。我们的模型跨ViT编码器的多个特征层计算槽注意力，充分利用其语义丰富性。我们提出了一种融合策略，将在多个层上获得的槽聚合成统一的以对象为中心的表示。将MUFASA集成到现有的OCL方法中，提高了它们在多个数据集上的分割结果，在仅增加少量推理开销的同时，建立了新的最先进水平并改善了训练收敛性。

英文摘要

Unsupervised object-centric learning (OCL) decomposes visual scenes into distinct entities. Slot attention is a popular approach that represents individual objects as latent vectors, called slots. Current methods obtain these slot representations solely from the last layer of a pre-trained vision transformer (ViT), ignoring valuable, semantically rich information encoded across the other layers. To better utilize this latent semantic information, we introduce MUFASA, a lightweight plug-and-play framework for slot-attention-based approaches to unsupervised object segmentation. Our model computes slot attention across multiple feature layers of the ViT encoder, fully leveraging their semantic richness. We propose a fusion strategy to aggregate slots obtained on multiple layers into a unified object-centric representation. Integrating MUFASA into existing OCL methods improves their segmentation results across multiple datasets, setting a new state of the art while simultaneously improving training convergence with only minor inference overhead.

URL PDF HTML ☆

赞 0 踩 0

2603.13941 2026-06-18 cs.CV 版本更新

Bidirectional Cross-Attention Fusion of High-Resolution RGB and Low-Resolution Hyperspectral Inputs for Multimodal Semantic Segmentation

高分辨率RGB与低分辨率高光谱输入的双向交叉注意力融合用于多模态语义分割

Jonas V. Funk, Lukas Roming, Andreas Michel, Paul Bäcker, Georg Maier, Thomas Längle, Markus Klute

发表机构 * KIT, Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； Fraunhofer IOSB, Fraunhofer Institute of Optronics, System Technologies（弗劳恩霍夫院所光学、系统技术与图像利用研究所）

AI总结提出双向交叉注意力融合（BCAF），通过局部双向交叉注意力对齐高分辨率RGB与低分辨率高光谱图像，避免预上采样或早期光谱坍缩，在实时约束下提升多模态分割性能。

Comments Submitted to Image and Vision Computing (Elsevier). 23 pages, 10 figures, 7 tables

详情

AI中文摘要

异构传感器的多模态语义分割必须协调空间分辨率和通道维度不同的模态间的互补信息。具体而言，高分辨率RGB成像提供详细的空间结构，但通常难以区分视觉上相似的材料，而高光谱成像（HSI）提供判别性光谱特征，但空间分辨率较低。我们提出双向交叉注意力融合（BCAF），通过局部化、双向交叉注意力在原生网格上对齐高分辨率RGB与低分辨率HSI，避免预上采样或早期光谱坍缩。BCAF使用两个独立骨干网络：一个用于RGB的标准Swin Transformer，以及一个用于HSI的适应型Swin骨干网络，通过带有光谱自注意力的3D令牌化保留光谱结构。尽管我们的评估针对RGB-HSI融合，但BCAF是模态无关的，适用于与低分辨率、高通道辅助传感器配准的RGB。在基准SpectralWaste数据集上，BCAF以55图像/秒的速度达到75.4%的性能。我们进一步评估了一个新的工业数据集：K3I-Cycling（首个RGB子集已在Fordatis上发布）。在该数据集上，BCAF在材料分割（纸张、金属、塑料等）上达到62.3% mIoU，在塑料类型分割（PET、PP、HDPE、LDPE、PS等）上达到66.2% mIoU。这些结果表明，保留原生网格空间细节和光谱结构可在实时约束下改善多模态分割。代码和模型检查点已公开于该https URL。

英文摘要

Multimodal semantic segmentation with heterogeneous sensors must reconcile complementary information across modalities that differ in spatial resolution and channel dimensionality. In particular, high-resolution RGB imaging provides detailed spatial structure but often fails to distinguish visually similar materials, whereas hyperspectral imaging (HSI) provides discriminative spectral signatures but at lower spatial resolution. We present Bidirectional Cross-Attention Fusion (BCAF), which aligns high-resolution RGB with low-resolution HSI at their native grids via localized, bidirectional cross-attention, avoiding pre-upsampling or early spectral collapse. BCAF uses two independent backbones: a standard Swin Transformer for RGB and an HSI-adapted Swin backbone that preserves spectral structure through 3D tokenization with spectral self-attention. Although our evaluation targets RGB-HSI fusion, BCAF is modality-agnostic and applies to co-registered RGB with lower-resolution, high-channel auxiliary sensors. On the benchmark SpectralWaste dataset, BCAF delivers strong performance, achieving 75.4% at 55 images/s. We further evaluate a novel industrial dataset: K3I-Cycling (first RGB subset already released on Fordatis). On this dataset, BCAF reaches 62.3% mIoU for material segmentation (paper, metal, plastic, etc.) and 66.2% mIoU for plastic-type segmentation (PET, PP, HDPE, LDPE, PS, etc.). These results show that preserving native-grid spatial detail and spectral structure improves multimodal segmentation under real-time constraints. Code and model checkpoints are publicly available at https://github.com/jonasvilhofunk/BCAF_2026.

URL PDF HTML ☆

赞 0 踩 0

2604.05527 2026-06-18 cs.CV 版本更新

Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

先验引导的多模态特征融合用于光学-SAR图像变化检测

Xuanguang Liu, Lei Ding, Yujie Li, Chenguang Dai, Zhenchao Zhang, Mengmeng Li, Ziyi Yang, Yifan Sun, Yongqi Sun, Hanyun Wang, Lorenzo Bruzzone

发表机构 * Institute of Geospatial Information, Information Engineering University（地理信息研究所，信息工程大学）； Academy of Digital China (Fujian), Fuzhou University（数字中国研究院（福建），福州大学）； The School of Electronics and Communication Engineering, Sun Yat-sen University（电子与通信工程学院，中山大学）； The Department of Information Engineering and Computer Science, University of Trento（信息工程与计算机科学系，特伦托大学）

AI总结提出STSF-Net框架，联合建模模态特定和时空共同特征，并利用视觉基础模型的语义先验自适应融合多模态特征，在三个数据集上达到最优性能。

详情

AI中文摘要

多模态变化检测（MMCD）识别多模态遥感数据中的变化区域，在土地利用监测和城市可持续发展中具有重要应用价值。然而，现有MMCD方法在跨模态交互和利用模态特定特征方面存在局限性，导致对细粒度变化信息的建模不足，从而阻碍了语义变化的精确检测。为解决这些问题，我们提出了STSF-Net，一个专为光学和SAR图像之间的MMCD设计的框架。STSF-Net联合建模模态特定特征和时空共同特征以增强变化表示。具体而言，利用模态特定特征捕获真实的语义变化信号，同时嵌入时空共同特征以抑制由成像机制差异引起的伪变化。此外，我们引入了一种光学和SAR特征融合策略，该策略基于从视觉基础模型获得的语义先验自适应调整多模态特征的重要性。最后，我们引入了新的Delta-SN6数据集，这是第一个公开可访问的多类MMCD基准，包含极高分辨率全极化SAR和光学图像。在Delta-SN6、BRIGHT和Wuhan数据集上的实验结果表明，我们的方法在mIoU上分别比最先进方法高出3.21%、0.87%和1.32%。

英文摘要

Multimodal change detection (MMCD) identifies changed areas in multimodal remote sensing data, demonstrating significant application value in land use monitoring and urban sustainable development. However, literature MMCD approaches exhibit limitations in both cross-modal interaction and exploiting modality-specific characteristics. This leads to insufficient modeling of fine-grained change information, thus hindering the precise detection of semantic changes. To address these problems, we propose STSF-Net, a framework designed for MMCD between optical and SAR images. STSF-Net jointly models modality-specific and spatio-temporal common features to enhance change representations. Specifically, modality-specific features are exploited to capture genuine semantic change signals, while spatio-temporal common features are embedded to suppress pseudo-changes caused by differences in imaging mechanisms. Furthermore, we introduce an optical and SAR feature fusion strategy that adaptively adjusts multimodal feature importance based on semantic priors obtained from visual foundation models. Finally, we introduce the novel Delta-SN6 dataset, the first openly-accessible multiclass MMCD benchmark consisting of very-high-resolution fully polarimetric SAR and optical images. Experimental results on Delta-SN6, BRIGHT, and Wuhan datasets demonstrate that our method outperforms the state-of-the-art by 3.21%, 0.87%, and 1.32% in mIoU, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.08206 2026-06-18 cs.CV cs.LG 版本更新

SegmentAnyTreeV2: Scaling Transformer-Based Tree Instance Segmentation Across Sensors, Platforms, and Forests

SegmentAnyTreeV2：跨传感器、平台和森林的基于Transformer的树木实例分割扩展

Maciej Wielgosz, Stefano Puliti, Rasmus Astrup

发表机构 * Norwegian Institute of Bioeconomy Research (NIBIO)（挪威生物经济研究所（NIBIO））

AI总结提出SegmentAnyTreeV2，一种传感器和平台无关的森林点云语义与实例分割框架，结合Point Transformer v3骨干网络、轻量语义头和树木交叉注意力掩码解码器，在FOR-instance v3基准上达到90.5%精度和80.2%召回率，并展现出强跨域泛化能力。

Comments 25 pages, 6 figures, 10 tables, Corrected bibliography metadata and minor typographical issues; results unchanged

详情

AI中文摘要

我们提出SegmentAnyTreeV2，一种传感器和平台无关的森林点云语义与实例分割框架。该模型结合了基于序列化的Point Transformer v3骨干网络、轻量级语义头以及专注于树木的交叉注意力掩码解码器。语义预测将实例解码限制在树木类体素上，而实例感知的查询初始化、一对多种子监督和非对称掩码评分改善了密集和结构复杂林分中的分离效果。我们进一步引入了FOR-instance v3，一个扩展的基准数据集，包含427个场景和26,496棵标注树木，涵盖不同生物群落、森林结构和LiDAR平台。在FOR-instanceV2测试集上，SegmentAnyTreeV2实现了90.5%的精度、80.2%的召回率、85.0%的F1分数、90.7%的覆盖率和87.6%的语义mIoU，在实例检测和掩码完整性方面均优于以往基于学习的方法。在独立站点上的零样本评估进一步证明了其强大的跨域泛化能力。

英文摘要

We present SegmentAnyTreeV2, a sensor- and platform-agnostic framework for semantic and instance segmentation of forest point clouds. The model combines a serialization-based Point Transformer v3 backbone with a lightweight semantic head and a tree-focused cross-attention mask decoder. Semantic predictions restrict instance decoding to tree-class voxels, while instance-aware query initialization, one-to-many seed supervision, and asymmetric mask scoring improve separation in dense and structurally complex stands. We further introduce FOR-instance v3, an expanded benchmark comprising 427 scenes and 26,496 annotated trees across diverse biomes, forest structures, and LiDAR platforms. On the FOR-instanceV2 test split, SegmentAnyTreeV2 achieves 90.5% precision, 80.2% recall, 85.0% F1, 90.7% coverage, and 87.6% semantic mIoU, outperforming previous learning-based methods in both instance detection and mask completeness. Zero-shot evaluation on independent sites further demonstrates strong cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.18441 2026-06-18 cs.CV 新提交

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集：视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； Beijing University of Posts and Telecommunications（北京邮电大学）； Cloud and AI BU, Huawei（华为云与AI业务部）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

AI总结提出无时间标注的过程级奖励框架CF-GRPO，通过视频内在线索构建一致性帧先验，并利用一致性帧奖励优化模型帧使用与先验的对齐，提升视频推理性能。

详情

AI中文摘要

强化学习提升了大型语言模型的推理能力，但将仅结果奖励应用于视频多模态大语言模型（Video-MLLMs）时，对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发（其中一致的线索可以增强感知估计的显著性和可靠性），我们引入了一致性帧GRPO（CF-GRPO），一种无需时间标注的过程级奖励框架，用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验，包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后，它从视觉和响应表示中计算模型侧的帧使用分数，并通过一致性帧奖励（CFR）优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化，CFR提供了高对比度的奖励信号，无需人工时间标注。实验表明，VideoCFR在复杂视频推理基准上取得了有竞争力的性能，并在多个指标上优于代表性的Video-MLLM和RL基线，同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见：https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

URL PDF HTML ☆

赞 0 踩 0

2606.18558 2026-06-18 cs.CV 新提交

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI（艾伦人工智能研究所）； University of Washington（华盛顿大学）； UNC-Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结提出一种基于语言指令的3D点运动预测方法，通过构建大规模数据集和基准，实现类无关、视角稳定的运动轨迹预测，并在机器人操作和视频生成中验证其有效性。

详情

AI中文摘要

运动预测是视觉智能的核心：智能体必须预测物体如何运动，以规划行动、推理物理交互并合成逼真的未来场景。我们认为，世界坐标系中的3D点提供了一种通用表示，具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务：给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述，模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务：(1) MolmoMotion-1M是一个大型语料库，包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹；(2) PointMotionBench是一个人工验证的基准，涵盖111个物体类别和61种运动类型；(3) MolmoMotion是一个通用运动预测模型，支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式，并在PointMotionBench上显著优于现有运动预测基线。最后，我们展示了学习到的3D运动先验能很好地迁移到下游应用：它提高了机器人操作的训练效率和泛化能力，其预测轨迹为生成模型提供了有效的运动指导，以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

URL PDF HTML ☆

赞 0 踩 0

2606.18586 2026-06-18 cs.CV cs.AI 新提交

APT: Atomic Physical Transitions for Causal Video-Language Understanding

APT: 用于因果视频语言理解的原子物理转变

Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

发表机构 * Northwestern University（西北大学）； Dolby Laboratories（杜比实验室）

AI总结提出原子物理转变（APT）作为视频中因果状态变化的显式表示，并构建混合来源数据集，通过APT-Tune微调方法使VLM学习物理转变而不遗忘事件级知识。

详情

AI中文摘要

物理事件不仅通过其名称来理解，还通过组成它们的因果状态变化来理解。诸如“弹跳”之类的片段级标签可能是正确的，但同时隐藏了使事件在物理上有效的过程，从支撑丧失和接触开始到反弹和稳定。为了使这一隐藏过程显式化，我们引入了原子物理转变（APT）：最小的、时间局部化的状态变化，将可见线索与活跃的物理机制以及前后动力学状态联系起来。APT链将视频表示为有序的因果转变序列，而不是单个聚合事件标签：事件标签说明发生了什么；APT链解释为什么会发生。为了使VLM能够学习APT，我们从人工标注和模拟器真实数据构建了混合来源的APT数据，涵盖接触、重力、摩擦和旋转/稳定性中的14种转变类型，包含1,246个试验中的27,303个计时实例。利用这些数据，我们发现当前的VLM在转变级物理理解上存在不足，零样本召回率最多为14%，错误主要由遗漏的转变主导。直接在APT链上进行微调可以改善转变检测，但会导致事件级遗忘，表明模型学习的是专门的答案格式，而不是可复用的物理表示。因此，我们提出了APT-Tune，一种参数高效的方案，教会VLM使用因果转变而不遗忘如何回答视频问题。它结合了图像填充感知监督、格式条件协同训练和机制条件域到类型解码，使APT学习具有格式鲁棒性和物理基础。在Qwen3-VL-2B上仅使用11M LoRA参数，APT-Tune显著提高了APT召回率，同时改善了事件级视频迁移。这些结果表明，APT不是一种新的答案格式，而是一种用于物理视频理解的人类对齐的因果监督信号。

英文摘要

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.19062 2026-06-18 cs.CV 新提交

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

发表机构 * Sejong University（世宗大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； Ulsan National Institute of Science and Technology（乌山国立科学研究院）

AI总结提出DREAM模型，通过双路径表示增强与对齐，结合层级视觉编码器和混合语言建模，在视频检索任务中实现新SOTA。

详情

AI中文摘要

在当今媒体驱动的世界中，视频内容在监控、教育和娱乐等领域的指数级增长使得通过自然语言查询检索语义相关视频变得日益关键。早期的视频检索系统依赖于手工特征或浅层跨模态映射，限制了其捕捉复杂语义和时间动态的能力。虽然大规模视觉-语言模型改进了跨模态对齐，但在建模细粒度时间依赖和微妙语言结构方面仍存在挑战。本文介绍DREAM：双路径表示增强与对齐模型，一种通过增强视觉和文本编码来解决这些局限性的新型多模态框架。DREAM采用混合语言建模策略，结合掩码和排列语言建模目标，以捕捉局部和全局语言语义。在视觉方面，我们设计了一个具有级联组注意力的层级视觉编码器，通过多阶段令牌交互和从粗到细的注意力细化来整合空间和时间信息。我们通过在广泛使用的MSRVTT、MSVD和LSMDC基准数据集上进行全面评估来验证DREAM，分别取得了49.4%、49.7%和27.3%的新SOTA R1分数。定性分析进一步展示了模型在帧间保持连贯注意力以及将复杂查询与动态视频内容对齐的能力。这些发现强调了层级注意力和双目标文本建模在实现鲁棒、上下文感知视频检索中的有效性，并为推进跨模态表示学习的未来研究铺平了道路。

英文摘要

In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.18732 2026-06-18 cs.LG cs.CV 交叉投稿

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

低成本神经形态跌倒检测：使用合成事件数据和混合SNN

Guillermo Rojas, Gonzalo Soto, Daniel Yunge

发表机构 * School of Electrical Engineering Pontificia Universidad Católica de Valparaíso, Chile（瓦尔帕莱索天主教大学电气工程学院）

AI总结提出混合SNN-CNN模型，从智能手机视频合成事件相机数据，实现高效准确的跌倒检测。

Comments 4 pages, 6 figures, presented at ICONS 2025 during the Poster Session, but not published

2604.22476 2026-06-18 cs.CV cs.LG 版本更新

All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams

全神贯注于工作流：从视频流中自动高效发现事件

Marco Pegoraro, Jonas Seng, Dustin Heller, Wil M. P. van der Aalst, Kristian Kersting

发表机构 * Chair of Process and Data Science, RWTH Aachen University（过程与数据科学教授席位，亚琛工业大学）； Artificial Intelligence & Machine Learning Lab, Technical University of Darmstadt（人工智能与机器学习实验室，达姆施塔特技术大学）

AI总结提出SnapLog方法，利用图像嵌入和帧间相似矩阵进行时间分割，结合广义少样本分类从视频中提取事件数据，生成可解释的带标签时间戳帧序列。

Comments 18 pages, 6 figures, 1 table, 27 references

详情

AI中文摘要

业务流程管理和流程挖掘等学科通过基于记录的事件数据发现流程见解来帮助组织。然而，流程分析的一个障碍是数据多模态性：例如，视频形式的数据不能直接解释为事件。现有方法依赖于活动标签字典作为输入，无法提供逐帧标签解释，或依赖于过时的计算机视觉技术。在这项工作中，我们提出了SnapLog，一种通过使用图像嵌入将帧转换为特征向量，并通过帧间相似矩阵进行时间分割来从视频中提取事件数据的方法。然后使用广义少样本分类为视频片段分配标签，生成可解释为事件的带标签、时间戳的子帧序列。传统的流程挖掘技术可用于分析结果数据。我们表明，我们的方法生成的日志准确反映了视频中的流程。

英文摘要

Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. Existing approaches rely on a dictionary of activity label as input, cannot provide frame-by-frame labeling explanations, or rely on superseded computer vision techniques. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.

URL PDF HTML ☆

赞 0 踩 0

2606.06926 2026-06-18 cs.CV cs.MM 版本更新

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

SVHighlights: 迈向极长体育视频精彩片段检测

Donggyu Lee, Youngbin Ki, Jeonghun Kang, Taehwan Kim

发表机构 * Ulsan National Institute of Science and Technology（釜山国立科学研究院）

AI总结针对现有方法无法处理超长视频精彩片段检测的问题，提出首个基准SVHighlights（包含320个平均时长2小时的体育视频）以及无训练的分段方法TF-SELECTOR，通过大语言模型融合多模态信息预测片段级显著性分数，在多个指标上超越现有基线。

Comments Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: https://leedongkyu2019.github.io/SVHighlights/

详情

DOI: 10.1145/3770855.3817564

AI中文摘要

尽管长视频的精彩片段检测具有重要的实际意义，但现有方法大多局限于短视频内容，这主要是由于缺乏合适的基准。为了填补这一空白，我们引入了SVHighlights，据我们所知，这是首个针对极长体育视频（每段时长超过一小时，涵盖多种体育类别）精彩片段检测的基准。SVHighlights是通过一个数据集生成流水线，从完整体育视频及其对应的官方精彩片段视频对构建而成，无需传统的逐片段显著性标注即可实现可扩展的标签生成。该基准包含320个视频，平均时长2.00小时，总时长640.18小时，显著超过以往的数据集。现有方法在长视频上也面临根本性挑战：在短视频片段上训练的模型无法泛化到小时级内容，并且它们的片段级评分缺乏识别精彩片段所需的更广泛上下文。为了解决这一问题并提供一个强基线，我们提出了TF-SELECTOR，一种无需训练的基于分段的方法，该方法通过合并相邻的具有相同语义内容的镜头，将每个视频划分为上下文感知的分段，并使用多模态输入（包括视觉描述、转录文本和音频音量）的大语言模型预测分段级显著性分数。实验表明，与视频时间定位（VTG）微调的基线相比，TF-SELECTOR在大多数指标上取得了更优的性能，在HIT@1上提升+3.12，在HIT@K上提升+4.06，在IoU上提升+2.95。这些结果确立了SVHighlights作为长视频精彩片段检测的具有挑战性的测试平台，并证明了简单的基于分段的策略可以有效地扩展到小时级视频。

英文摘要

While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +2.50 in HIT@1, +4.04 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.

URL PDF HTML ☆

赞 0 踩 0

2606.15632 2026-06-18 cs.CV 版本更新

Open-World Video Segmentation

开放世界视频分割

Qing Su, Kaiyang Li, Yuan Zhuang, Fei Miao, Shihao Ji

发表机构 * University of Connecticut（康涅狄格大学）

AI总结提出Savvy系统，结合分层掩码发现、延迟接纳和轨迹整合，实现零样本开放世界长时视频分割；并设计粒度感知评估套件OGA，采用n:1匹配协议，解决传统1:1匹配对开放世界方法的不公平惩罚问题。

详情

AI中文摘要

尽管视频分割在短片段和封闭集基准上取得了快速进展，但开放世界视频分割仍然在很大程度上未被探索。挑战有两方面：（1）现有方法不支持在动态自我运动的长视频中进行对象发现和身份维护；（2）现有评估协议依赖于严格的1:1匹配，不公平地惩罚了具有不匹配粒度的语义有效预测。为了解决这两个问题，我们引入了Savvy，一个实用且强大的零样本开放世界长时视频分割系统。Savvy结合了分层掩码发现、延迟接纳和轨迹整合，以支持持久对象发现、安全轨迹提升和稳定的长距离身份维护。我们进一步提出了OGA，一个用于开放世界视频分割的粒度感知评估套件。基于粒度无关（GA）匹配协议，OGA将传统的1:1匹配放宽为n:1映射，但通过断点检测支持不连续性并通过对每个参考对象的优势连贯片段进行评分来强制执行时间严谨性。这防止了碎片化或闪烁的支持被过度奖励，同时实现了GA适应的指标和结构诊断：身份持久性（IP）和身份集中性（IC）。在VIPSeg上，我们展示了标准的1:1评估严重低估了开放世界方法，而GA评估恢复了许多被抑制的性能。在更现实的长时基准ScanNet和HM3D上，Savvy在经典指标和提出的指标（包括STQ、VPQ$_\infty$、IP和IC）上始终优于强基线。这些结果共同为开放世界长时视频分割建立了一个实用的基准和一个强基线。

英文摘要

While video segmentation has advanced rapidly on short clips and closed-set benchmarks, open-world video segmentation remains largely unexplored. The challenge is twofold: (1) existing methods are not designed to support object discovery and identity maintenance in long videos of dynamic ego-motion, and (2) existing evaluation protocols rely on a rigid 1:1 matching that unfairly penalizes semantically valid predictions with mismatched granularity. To address both gaps, we introduce Savvy, a practical and strong system for zero-shot open-world long-horizon video segmentation. Savvy combines hierarchical mask discovery, deferred admission, and track consolidation to support persistent object discovery, safe track promotion, and stable long-range identity maintenance. We further propose OGA, a granularity-aware evaluation suite for open-world video segmentation. Built on a Granularity-Agnostic (GA) matching protocol, OGA relaxes conventional 1:1 matching to an n:1 mapping, but still enforces temporal rigor by detecting support discontinuities through sever points and scoring each reference object through its dominant coherent fragment. This prevents fragmented or flickering support from being over-rewarded while enabling GA-adapted metrics and structural diagnostics: identity persistence (IP), and identity concentration (IC). On VIPSeg, we show that standard 1:1 evaluation substantially underestimates open-world methods, whereas GA evaluation recovers much of their suppressed performance. On the more realistic long-horizon benchmarks: ScanNet and HM3D, Savvy consistently outperforms strong baselines across both classical and proposed metrics, including STQ, VPQ$_\infty$, IP and IC. Together, these results establish a practical benchmark and a strong baseline for open-world long-horizon video segmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.18478 2026-06-18 cs.CV 新提交

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

数据强制蒸馏：恢复少步视频生成中的多样性和保真度

Siyi Chen, Shaowei Liu, Yixuan Jia, Zian Wang, Huan Ling, Qing Qu, Jun Gao

发表机构 * University of Michigan（密歇根大学）； NVIDIA（英伟达）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结针对分布匹配蒸馏（DMD）在少步视频生成中出现的模式坍塌和过饱和问题，提出数据强制蒸馏（DFD）框架，通过教师评分差异引导学生接近真实数据分布，仅需一行代码修改即可恢复多样性和保真度。

详情

AI中文摘要

最近的进展表明，将多步视频扩散模型蒸馏为高效的少步学生模型具有前景。其中，分布匹配蒸馏（DMD）及其后继DMD2实现了强大的生成质量和快速收敛。然而，由于反向KL目标的性质，这些方法表现出两个持续的失败模式：样本多样性大幅下降，以及明显过饱和的输出偏离真实视频外观。在这项工作中，我们提出了数据强制蒸馏（DFD），一个简单的训练后框架，通过仅一行代码更改即可恢复DMD中的多样性和保真度。其核心是教师评分差异，用于引导学生朝向真实数据分布，将其拉向缺失的模式（缓解模式坍塌）并远离真实数据中不存在的问题模式（避免过饱和）。我们提供了框架的深入理论分析，并在文本到视频、图像到视频和自回归视频生成上验证了我们的方法。仅需100-300步微调，DFD就能有效恢复Wan2.1-1.3B和Cosmos-Predict2.5-2B模型上的多样性和保真度，解决过饱和伪影，显著改善视频动态和外观，甚至优于教师模型。

英文摘要

Recent progress has shown promise in distilling multi-step video diffusion models into efficient few-step students. Among them, Distribution Matching Distillation (DMD) and its successor DMD2 achieved strong generation quality and fast convergence. However, due to the nature of the reverse Kullback--Leibler (KL) objective, these methods exhibit two persistent failure modes: a substantial drop in sample diversity, and visibly over-saturated outputs that deviate from real-video appearance. In this work, we propose Data-Forcing Distillation (DFD), a simple post-training framework that restores diversity and fidelity in DMD with only a single-line of code change. At its core is the teacher score discrepancy to guide the student toward the real-data distribution, pulling it to missing modes (mitigating mode collapse) and away from problematic modes absent in real data (avoiding over-saturation). We provide an in-depth theoretical analysis of our framework and validate our approach on text-to-video, image-to-video, and autoregressive video generation. With only 100--300 steps of finetuning, DFD effectively restores diversity and fidelity on both Wan2.1-1.3B and Cosmos-Predict2.5-2B model, resolving the over-saturation artifacts with significantly better video dynamics and appearance, and even outperforms the teacher model.

URL PDF HTML ☆

赞 0 踩 0

2606.18591 2026-06-18 cs.CV 新提交

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

桥接创意意图与视觉质量：基于创作者驱动的循环视频生成与代理反馈循环

Denis Savytski, Aiden Lei, Heding Liu, Warren Yang, Sihan Liang, Alexander Liu, Zhe Zhao

发表机构 * University of California, Davis（加州大学戴维斯分校）； The Harker School（哈克学校）； Basis Independent Silicon Valley（硅谷贝斯独立学校）； Saratoga High（萨拉托加高中）

AI总结提出CHIEF框架，通过人类-AI协作的迭代视频精炼，结合创作者驱动和代理主观反馈，提升长视频的叙事连贯性与创意方向。

Comments Accepted to the Workshop on Human-AI Co-Creativity at ICML 2026

详情

AI中文摘要

生成式AI使内容创作日益普及，但许多AI生成的视频缺乏叙事连贯性和创意方向，尤其在较长时长时问题更为突出。与编码不同，AI生成受益于可靠的反馈和循环自我改进等技术，而视频生成需要关于情节、场景和叙事的主观反馈，这自然激发了融入人类创意方向的方法。我们提出了CHIEF，一个人类-AI协同创作视频生成框架，将创作者置于人机循环迭代视频精炼的中心，并通过提供自动主观反馈来支持他们。创作者通过驱动每次迭代来融入其创意方向，而他们的修订则由专门的精炼代理整合。反馈循环由基于角色条件的多模态LLM生成，这些LLM观看生成的视频并从观众角度产生主观批评，提供自我评估无法捕捉的反馈。为测试我们提出框架的有效性，我们与没有电影制作经验的高中生和大学生合作，创作从1分钟短视频到具有复杂情节的完整10分钟短片的视频。

英文摘要

Generative AI has made content creation increasingly accessible, but many AI-generated videos lack narrative coherence and creative direction, issues that become more substantial at longer durations. Unlike coding, where AI generation benefits from reliable feedback and techniques such as recurrent self-improvement, video generation requires subjective feedback about plot, scenes, and narrative, which naturally motivates approaches that incorporate human creative direction. We introduce CHIEF, a human-AI co-creation video generation framework that places the creator at the center of human-in-the-loop iterative video refinement, and supports them by providing automatic subjective feedback. The creator incorporates their creative direction by driving each iteration, while their revisions are incorporated by a specialized refiner agent. The feedback loop is generated by persona-conditioned multimodal LLMs that watch generated videos and produce subjective critique from the audience perspectives, providing feedback that self-evaluation alone cannot capture. To test the effectiveness of our proposed framework, we work with high school and college students with no prior filmmaking experience to create videos, from short 1-minute videos to a complete short 10-minute film with a complicated plot.

URL PDF HTML ☆

赞 0 踩 0

2606.18702 2026-06-18 cs.CV 新提交

UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

UniTemp: 通过双向蒸馏实现任意时间顺序的视频生成

Lin Zhang, Sicheng Mo, Zefan Cai, Jinhong Lin, Zihao Lin, Jiuxiang Gu, Krishna Kumar Singh, Yuheng Li, Yin Li

发表机构 * University of Wisconsin Madison（威斯康星大学麦迪逊分校）； Adobe Research（Adobe 研究院）； University of California Los Angeles（加利福尼亚大学洛杉矶分校）； University of California Davis（加利福尼亚大学戴维斯分校）

AI总结提出UniTemp框架，通过双向蒸馏训练单个自回归模型，支持任意时间方向（前向、后向、中间插值）的视频生成，解决因果3D VAE在后向生成中的不连续性，提升可控性。

详情

AI中文摘要

自回归视频扩散模型已成为长视频生成的一种有前景的方法，在流式设置中表现出色。然而，现有方法仅限于前向时间生成，而实际视频创作通常需要灵活的生成顺序，例如，基于未来上下文进行后向扩展，或基于过去和未来上下文进行中间插值生成。我们通过训练一个支持任意时间方向生成的自回归模型来弥合这一差距。一个关键的技术挑战来自视频扩散模型中广泛使用的因果3D VAE，它编码的潜变量严格依赖于过去上下文。虽然这种因果结构适合前向生成，但在后向生成时会导致块间不连续性。为了解决这个问题，我们引入了块级锚点潜变量，这是一组辅助潜变量，用于在后向生成过程中恢复块边界处缺失的过去上下文。基于这一设计，我们提出了UniTemp，一个双向蒸馏框架，训练单个自回归学生模型用于任意方向的视频生成。在推理时，UniTemp可以基于任意过去和/或未来帧进行条件生成，提高了双向和中间插值生成的可控性。实验表明，与仅前向方法相比，UniTemp在短和长视频生成上保持了竞争性能，同时支持多种工作流程，如双向视频扩展、中间插值生成、循环视频生成、场景转换和视觉故事生成。项目网站：此 https URL

英文摘要

Autoregressive video diffusion models have emerged as a promising approach for long video generation, achieving strong performance in streaming settings. However, existing methods are restricted to forward temporal generation, whereas practical video creation often requires flexible generation order, e.g., conditioning on future context to extend backward, or on both past and future context for inbetween generation. We bridge this gap by training an autoregressive model that supports generation in arbitrary temporal directions. A key technical challenge arises from the Causal 3D VAE widely used in video diffusion models, which encodes latents strictly conditioned on past context. While suited for forward generation, this causal structure causes inter-block discontinuities when generation proceeds backward. To address this, we introduce blockwise anchor latents, a set of auxiliary latents that restore the missing past context at block boundaries during backward generation. Built on this design, we propose UniTemp, a bidirectional distillation framework that trains a single autoregressive student model for any-direction video generation. At inference time, UniTemp conditions on arbitrary past and/or future frames, improving controllability for both bidirectional and inbetween generation. Experiments show that UniTemp maintains competitive performance on short and long video generation compared to forward-only methods, while enabling diverse workflows such as bidirectional video extension, inbetween generation, looping video generation, scene transition, and visual story generation. Project website: https://lzhangbj.github.io/projects/unitemp/

URL PDF HTML ☆

赞 0 踩 0

2606.18765 2026-06-18 cs.CV 新提交

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

SpectralDiT：流匹配DiT的时间步条件谱残差校正

Jiayu Tian

发表机构 * Peking University（北京大学）

AI总结提出SpectralDiT，通过时间步条件谱残差校正模块，在CIFAR-10和ImageNet-100上以极少额外计算和参数提升流匹配DiT的生成质量，FID分别降低5.1%和8.7%。

详情

AI中文摘要

我们提出SpectralDiT，一种对流匹配扩散变换器（Diffusion Transformers）的轻量级修改，它在MLP残差分支中添加了时间步条件谱校正。该模块将每个残差更新分解为补丁-令牌网格上的低频和高频分量，然后学习一个零初始化的加法门，使得模型最初与基线DiT匹配。在CIFAR-10像素空间生成中，SpectralDiT在补丁大小为1时将FID从20.78提升至19.71，并缩小了径向傅里叶谱差距。此外，我们将方法扩展到ImageNet-100上的潜在扩散。在额外理论FLOPs增加0.6%和参数增加1.36%的情况下，SpectralDiT改进了潜在流匹配，在无分类器引导（CFG 2.0）下实现了8.7%的相对FID降低。所有报告结果均为五个种子的平均值。在CIFAR-10上的消融实验和门控可视化揭示了稳定的块特定谱校正模式。

英文摘要

We propose SpectralDiT, a lightweight modification to flow-matching Diffusion Transformers that adds timestep-conditioned spectral correction to the MLP residual branch. The module decomposes each residual update into low- and high-frequency components on the patch-token grid, then learns a zero-initialized additive gate so the model initially matches the baseline DiT. On CIFAR-10 pixel-space generation, SpectralDiT improves FID from 20.78 to 19.71 at patch size 1 and reduces the radial Fourier spectrum gap. Furthermore, we scale our method to latent diffusion on ImageNet-100. With 0.6% additional theoretical FLOPs and 1.36% additional parameters, SpectralDiT improves latent flow-matching, achieving an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds. Ablations and gate visualizations on CIFAR-10 reveal stable block-specific spectral correction patterns.

URL PDF HTML ☆

赞 0 踩 0

2606.18788 2026-06-18 cs.CV cs.CL 新提交

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

HandwritingAgent: 语言驱动的可缩放矢量空间手写合成

Jaward Sesay, Yue Yu, Börje F. Karlsson

发表机构 * Beijing Institute of Technology（北京理工大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结提出HandwritingAgent，利用大推理模型在SVG格式中自动回归生成手写笔画序列，无需风格特定训练，通过自然语言和参考图像控制风格，在模仿、识别、多语言及复杂数学表达式合成等任务上达到或超越现有最优方法。

详情

AI中文摘要

教会机器模仿自然手写风格仍然是一个开放挑战，因为它需要合成在形状、纹理、压力和字体上动态变化的笔画序列——不仅在不同个体之间，而且在同一个人的手写中也是如此。针对这一挑战的尝试主要探索了在线和离线环境下的深度学习方法。然而，这些方法通常受到风格特定架构选择、对大型数据集的严重依赖、高计算成本以及缺乏通过自然语言灵活控制书写风格的限制。为此，我们引入了HandwritingAgent，一个语言驱动的智能体，它可以直接在可缩放矢量图形（SVG）格式中合成自然手写序列，无需风格特定训练。该智能体利用大型推理模型在离散网格画布环境中对目标手写字形进行几何分析并自回归生成笔画序列。生成过程以对话或非对话模式提供的文本以及参考手写风格图像为条件。在涵盖模仿、识别、多语言手写合成以及复杂手写数学和科学表达式生成等多样化手写任务上的实验表明，性能有显著提升，HandwritingAgent匹配或超越了最先进的生成式手写模型，同时提供了一种更高效、可控且泛化能力更强的合成方法。

英文摘要

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

URL PDF HTML ☆

赞 0 踩 0

2606.18906 2026-06-18 cs.CV 新提交

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

BindEdit: 驯服注意力泄漏以实现精确的多目标图像编辑

Chaewon Park, Soyoon Lee, Naeun Lee, Minjung Shin, Seogkyu Jeon, Kibeom Hong

发表机构 * Sookmyung Women’s University（成均女性大学）； Yonsei University（延世大学）； Samsung Research（三星研究院）

AI总结针对多目标图像编辑中的语义混合和对象重复问题，提出BindEdit方法，通过联合正则化交叉注意力和自注意力、交叉注意力重平衡机制及区域保真项，在单次扩散轨迹内抑制注意力泄漏，实现精确编辑。

Comments Preprint

详情

AI中文摘要

真实图像编辑能够精确操作视觉内容，但现有方法在复杂的多目标场景中常常失败，导致语义混合、对象重复或编辑不完整。我们将这些失败归因于注意力泄漏，即在去噪过程中，跨空间区域和文本标记的信号变得纠缠。具体来说，我们识别出两种不同形式的泄漏：编辑-标记泄漏，其中模糊的标记-区域对齐导致对象混合；以及源主导泄漏，其中未改变的源对象的标记压倒了目标实体应有的注意力。为了解决这些泄漏，我们提出了\textbf{BindEdit}，它在单次扩散轨迹内强制执行注意力级别的约束。为了抑制编辑-标记泄漏，BindEdit联合正则化交叉注意力和自注意力，使得每个目标标记组绑定到其对应的空间区域，同时保持实例级别的分离。为了抑制源主导泄漏，一种交叉注意力重平衡机制放大目标标记的影响，并减弱可编辑区域内残留的源语义。此外，区域保真项确保每个目标概念在整个编辑掩码中连贯表达。另外，我们提出了一个全面的多目标基准，涵盖不同的对象数量和类别。大量实验表明，BindEdit在单次扩散轨迹内始终优于现有方法，在单目标和多目标编辑场景中均保持稳健性能。

英文摘要

Real image editing enables precise manipulation of visual content, yet existing methods often fail in complex multi-object scenarios, causing semantic blending, object duplication, or incomplete edits. We attribute these failures to attention leakage, where signals across spatial regions and text tokens become entangled during the denoising process. Specifically, we identify two distinct forms of leakage: Edit-Token Leakage, where ambiguous token-region alignment leads to object blending, and Source Dominance Leakage, where tokens of unchanged source objects overwhelm the attention intended for target entities. To resolve these leakages, we propose \textbf{BindEdit}, which enforces attention-level constraints within a single diffusion trajectory. To suppress Edit-Token Leakage, BindEdit jointly regularizes cross- and self-attention so that each target token group is bound to its corresponding spatial region while maintaining instance-level separation. To suppress Source Dominance Leakage, a cross-attention re-balancing mechanism amplifies target token influence and attenuates residual source semantics within editable regions. Moreover, a region fidelity term ensures that each target concept is expressed coherently across the entire editing mask. Additionally, we propose a comprehensive multi-object benchmark encompassing diverse object counts and categories. Extensive experiments demonstrate that BindEdit consistently outperforms existing methods within a single diffusion trajectory, maintaining robust performance across both single- and multi-object editing scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.19073 2026-06-18 cs.CV 新提交

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

驯服I2V模型用于图像HOI编辑：认知基准与智能体自校正框架

Jiayi Gao, Qingchao Chen, Yuxin Peng, Yang Liu

发表机构 * Wangxuan Institute of Computer Technology, Peking University, Beijing, China（王轩计算机技术研究所，北京大学，北京，中国）； National Institute of Health Data Science, Peking University, Beijing, China（国家健康数据科学研究院，北京大学，北京，中国）

AI总结提出HOI-Edit基准和SCPE框架，利用I2V模型的时间生成能力进行动态人-物交互编辑，通过自校正提示迭代优化，实现与SOTA竞争的性能。

详情

AI中文摘要

当前的图像编辑方法在静态属性上表现出色，但在复杂的人-物交互（HOI）上失败，这是一个关键挑战，现有基准将HOI与静态属性混淆，依赖无法同时评估动态交互有效性和纠缠的人-物对保留的全局指标。因此，我们首先引入HOI-Edit，一个包含三个渐进认知层次的综合基准，其特点是自动化指标HOI-Eval，通过让VLM在思考后对包含基础人-物对的图像进行问答，可靠地评估实例级交互。考虑到任务本质是重塑动态关系，我们对图像到视频（I2V）模型进行基准测试，发现它们由于其时间生成能力而天生适合动态编辑。关键的是，除了优越的性能，这种能力提供了“失败过程的重放”，为错误原因提供了独特的可诊断性。因此，我们提出SCPE（自校正过程编辑），一种新颖的智能体自校正框架，通过迭代优化的提示约束I2V模型的生成，使生成的视频更准确地呈现目标HOI。从这些视频中提取的帧是最终的编辑结果。在HOI-Edit上，SCPE在交互上达到了与最先进（SOTA）编辑模型（如Nano Banana）竞争的性能。代码可在该https URL获取。

英文摘要

Current image editing methods excel at static attributes but fail at complex Human-Object Interactions (HOI), a critical challenge unaddressed by existing benchmarks that conflate HOI with static attributes, relying on global metrics incapable of simultaneously assessing dynamic interaction validity and entangled human-object pair preservation. Thus, we first introduce HOI-Edit, a comprehensive benchmark with three progressive cognitive levels, which features an automated metric HOI-Eval that reliably evaluates instance-level interaction by letting VLM Q&A after thinking with images containing grounded Human-Object pairs. Considering the task's essence of remodeling dynamic relationships, we benchmark Image-to-Video (I2V) models, finding them inherently suited for dynamic editing due to their temporal generation capabilities. Crucially, beyond superior performance, this capability provides a "replay of the failure process," offering unique diagnosability into why errors occur. We thus propose SCPE (Self-Correcting Process Editing), a novel, agentic self-correcting framework that constrains the generation of I2V models through iteratively refined prompts, enabling the generated videos to more accurately present the target HOI. Extracted frames from these videos are the final editing results. On HOI-Edit, SCPE achieves performance competitive with state-of-the-art (SOTA) editing models like Nano Banana on interaction. Code is available at https://github.com/oceanflowlab/HOI-Edit.

URL PDF HTML ☆

赞 0 踩 0

2606.19103 2026-06-18 cs.CV cs.AI 新提交

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

ProductConsistency：通过SFT和RL改进基于指令的图像编辑中的产品身份保持

Mukund Khanna, Raj Singh Yadav, Kunal Singh

发表机构 * Fractal Analytics

AI总结针对基于指令的图像编辑中产品特征保持不足的问题，提出ProductConsistency数据集和循环一致性奖励，结合监督微调与强化学习，显著提升产品一致性、文本渲染和视觉质量。

Comments CVPR HiGen 2026

详情

AI中文摘要

近期基于指令的图像编辑的进展使模型能够根据自然语言指令执行复杂的视觉编辑。然而，在以产品为中心的场景中，保留产品特征、品牌和文本元素至关重要，当前的开源和闭源模型往往难以维持这种细粒度的对象身份。这一问题因缺乏具有文本保真度约束的基于指令的产品图像编辑数据集而进一步加剧，导致该能力在很大程度上被视为基于指令的图像编辑模型的隐式能力。在这项工作中，我们引入了ProductConsistency数据集，旨在改进以产品为中心的图像编辑。我们的方法包括一个用于产品编辑的包含87k样本的监督微调（SFT）数据集、一个包含869张独特产品图像的强化学习（RL）数据集，以及一个新的基准数据集ProductConsistency Benchmark，以允许对编辑模型进行严格和标准化的评估。为了指导RL训练，我们提出了一种循环一致性奖励，通过使用原始产品描述与从编辑图像生成的描述之间的字幕相似性来强制保持产品身份的语义。我们使用我们的数据集对Qwen-Image-Edit-2511和Flux.1-Kontext-dev进行了微调，并在OCR和感知指标以及基于MLLM的评估中展示了相对于基线模型的一致改进，表明更强的产品一致性、文本渲染和整体视觉质量；其中Qwen-Image-Edit-2511模型实现了字符错误率降低5倍。代码和流程可在此https URL获取。

英文摘要

Recent advances in instruction-based image editing have enabled models to perform complex visual edits from natural language instructions. However, in product-centric scenarios where preserving product features, branding, and textual elements are critical, current open and closed source models often struggle to maintain this fine-grained object identity. This issue is further compounded by the lack of datasets for instruction-based product image editing with text fidelity constraints, leaving it largely treated as an implicit capability of instruction-based image editing models. In this work, we introduce the ProductConsistency dataset which is designed to improve product-centric image editing. Our approach includes a supervised fine-tuning (SFT) dataset of 87k samples for product editing, a reinforcement learning (RL) dataset with 869 unique product images, and a new benchmark dataset, the ProductConsistency Benchmark, to allow rigorous and standardized evaluation of editing models. To guide RL training, we propose a Cyclic Consistency reward that enforces semantic preservation of product identity by using caption similarity between the original product description and captions generated from the edited image. We fine-tune both Qwen-Image-Edit-2511 and Flux.1-Kontext-dev using our dataset and demonstrate consistent improvements over baseline models in OCR and Perceptual metrics, and MLLM-based evaluations as well, indicating stronger product consistency, text rendering, and overall visual quality; with the Qwen-Image-Edit-2511 model achieving a 5x reduction in the character error rate. The code and pipeline is available at https://anonymous.4open.science/r/ProductConsistency-6FCC/README.md

URL PDF HTML ☆

赞 0 踩 0

2606.19195 2026-06-18 cs.CV 新提交

极线几何改进视频生成模型

Orest Kupyn, Théo Uscidda, Marta Tintore Gazulla, Fabian Manhardt, Federico Tombari, Christian Rupprecht

发表机构 * University of Oxford（牛津大学）； Google Research（谷歌研究院）； CREST-ENSAE, Institut Polytechnique de Paris（巴黎理工学院CREST-ENSAE研究中心）； Technical University of Munich（慕尼黑技术大学）

AI总结针对视频生成模型几何不一致和运动伪影问题，提出基于极线几何约束的偏好优化方法，在保持视觉质量的同时将极线误差降低31%，人类评分一致性从54%提升至72%。

详情

AI中文摘要

视频生成模型通过使用整流流技术训练的潜在扩散变换器取得了显著进展。然而，这些模型仍然存在几何不一致、运动不稳定以及破坏逼真3D场景错觉的视觉伪影。3D一致的视频生成可能对生成和重建任务中的众多下游应用产生重大影响。我们探索了极线几何约束如何改进现代视频扩散模型。尽管使用了大量训练数据，这些模型未能捕捉基本的几何原理。我们通过基于偏好的优化，利用成对极线几何约束对齐扩散模型，通过数学上合理的几何约束直接解决不稳定轨迹和几何伪影。我们的方法有效地强制执行几何原理，而不需要端到端的可微性。评估表明，经典的几何约束比现代学习度量提供了更稳定的优化信号。在静态场景和动态相机上的训练确保了度量质量，同时模型泛化到各种动态场景。通过将数据驱动学习与经典计算机视觉相结合，我们将极线误差降低了31%，并将人类评分一致性从54%提高到72%，且不损害视觉质量。

英文摘要

Video generation models have advanced significantly through the latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite using massive training data, these models fail to capture fundamental geometric principles. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics. Training on static scenes with dynamic cameras ensures metric quality while the model generalizes to various dynamic scenes. By bridging data-driven learning with classical computer vision, we reduce epipolar error by 31% and improve human-rated consistency from 54% to 72% without compromising visual quality.

URL PDF HTML ☆

赞 0 踩 0

2604.03156 2026-06-18 cs.CV 版本更新

CAOA -- 补全辅助的物体-CAD对齐

Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran

发表机构 * University at Albany（奥尔巴尼大学）

AI总结提出CAOA方法，结合语义感知点云补全和对称感知相对位姿估计，在Scan2CAD上实现17%精度提升，并发布S2C-Completion数据集。

Comments GitHub: https://github.com/MinhasKamal/CAOA

详情

DOI: 10.1109/3DV69130.2026.00047
Journal ref: Thirteenth International Conference on 3D Vision (3DV), 2026

AI中文摘要

准确地将CAD模型与室内RGB-D扫描中的对应物体对齐是3D语义重建的核心挑战。该任务需要估计9自由度（DoF）位姿——位置、旋转和三轴尺度——但受到噪声和不完整扫描以及导致几何畸变的分割误差的阻碍。我们提出补全辅助的物体-CAD对齐（CAOA），该方法将语义和上下文感知的点云补全模块与对称感知的相对位姿估计算法相结合，实现CAD模型与扫描物体的精确对齐。现有的补全方法通常在合成数据集上训练和评估，往往难以泛化到真实扫描。为弥合这一差距，我们引入了一种针对室内场景的合成数据生成策略，通过与广泛使用的补全数据集进行定量比较，验证了其显著减小合成到真实领域差距的效果。此外，我们发布了S2C-Completion，一个来自Scan2CAD的超过8500个物体-CAD对的专家标注数据集，用于真实室内单物体补全，并作为该任务的新基准。对于物体-CAD对齐，我们通过对称感知损失融入对称信息，提高了对对称模糊的鲁棒性。在Scan2CAD基准上，CAOA相比最先进方法实现了17%的精度提升。

英文摘要

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.18439 2026-06-18 cs.CV cs.RO 新提交

RegimeVGGT: Layer-Wise Spatially Preserving Redundancy Removal for Visual Geometry Grounded Transformer

RegimeVGGT：面向视觉几何基础Transformer的逐层空间保持冗余去除

Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, Zibo Zhao, Jiaxiang Hu, Kai Tang, Yichen Guo

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； University of California, Irvine（加利福尼亚大学尔湾分校）； Nanyang Technological University（南洋理工大学）

AI总结提出RegimeVGGT，通过逐层U形压缩（显著性引导带状合并与选择性保护K/V下采样）去除冗余，在保持重建质量的同时实现6.7倍加速。

Comments 9 pages, 3 figures, 7 tables. Jinhao You, Shuo Lyu, Zhuohang Lyu, Tanxuan Li, and Zibo Zhao contributed equally. Shuo Lyu is the corresponding author

详情

AI中文摘要

视觉几何基础Transformer（VGGT）通过一次前向传播从多视图图像恢复密集3D场景结构，但二次交叉帧注意力限制了其可扩展性。现有的免训练加速器沿单一轴均匀减少计算，忽略了层间异质性。我们的频谱、探测和因果分析揭示了三个区域：浅层缺乏跨视图结构，中层驱动跨视图对齐，深层对密集几何是冗余的，但其跨帧注意力对姿态仍然至关重要。RegimeVGGT沿两个轴应用逐层U形压缩：显著性引导带状合并保护几何和边缘显著性令牌，而选择性保护K/V下采样通过相移空间网格、参考帧锚点以及未压缩的相机/注册令牌来保持跨帧空间覆盖和姿态关键路径。免训练，RegimeVGGT在匹配重建质量下相比VGGT*实现了6.7倍加速。

英文摘要

Visual Geometry Grounded Transformer (VGGT) recovers dense 3D scene structure from multi-view images in one forward pass, but quadratic cross-frame attention limits its scalability. Existing training-free accelerators reduce computation uniformly along one axis, missing layer heterogeneity. Our spectral, probing, and causal analyses reveal three regimes: shallow layers lack cross-view structure, middle layers drive cross-view alignment, and deep layers are redundant for dense geometry yet their cross-frame attention remains essential for pose. RegimeVGGT applies layer-wise U-shaped compression along two axes: Saliency-Guided Banded Merging protects geometry- and edge-salient tokens, while Selectively Protected K/V Downsampling preserves cross-frame spatial coverage and the pose-critical path through a phase-shifted spatial grid, a reference-frame anchor, and uncompressed camera/register tokens. Training-free, RegimeVGGT achieves a 6.7x speedup over VGGT* at matched reconstruction quality.

URL PDF HTML ☆

赞 0 踩 0

2606.18623 2026-06-18 cs.CV eess.IV 新提交

Intrinsic 4D Gaussian Segmentation from Scene Cues

内在4D高斯分割：基于场景线索

Hasan Yazar, Mohamed Rayan Barhdadi, Erchin Serpedin, Mehmet Tuncel, Hasan Kurban

发表机构 * Istanbul Technical University（伊斯坦布尔理工大学）； Texas A&M University（德克萨斯农工大学）； Hamad Bin Khalifa University（哈马德·本·哈利法大学）

AI总结提出Intrinsic-GS方法，无需训练和掩码，通过构建高斯原语的亲和图并利用社区检测实现4D场景分割，在Neu3D和HyperNeRF上达到与掩码监督方法相当的精度，且速度提升12.5倍。

Comments 15 pages, 4 figures, 7 tables. Includes supplementary material. Preprint

详情

AI中文摘要

动态4D高斯泼溅以高保真度重建变形场景，并越来越多地被用作动态3D场景的表示。要利用此类场景进行编辑、操作或运动分析，首先需要对其进行分割：将高斯原语分组为连贯的对象。当前流程通过从基础模型（如SAM）导入2D掩码，并将其提升或蒸馏到高斯表示中来获得这种分组。在动态场景中，这些掩码必须在多个帧和视角中生成，成本高昂，并且所得分割可能强烈依赖于这些外部掩码的质量和一致性。我们探究能否从高斯本身恢复更多的对象级结构，并提出Intrinsic-GS，一种无需训练、无需掩码的方法，该方法根据外观、方向、尺度、变形轨迹和非学习渲染边界线索，在高斯原语上构建稀疏亲和图。该图通过Leiden社区检测进行划分，无需基础模型，也无需学习特征场。在标准的4D高斯分割基准Neu3D和HyperNeRF上，Intrinsic-GS在没有掩码监督的情况下恢复了大量的对象结构，在Neu3D上达到0.746 mIoU，在HyperNeRF上达到0.575；在Neu3D上，仅几何变体达到0.902 mIoU，与SAM监督的TRASE相当。在HyperNeRF上，Intrinsic-GS的运行速度比掩码监督流程中使用的掩码生成和特征渲染阶段快12.5倍。这些结果表明，大部分分割信号已经编码在高斯本身中，为3D和4D高斯分割提供了一种快速、无需掩码的方向，也可能指向在外部掩码不可靠或昂贵的情况下更可泛化、更鲁棒的分割。

英文摘要

Dynamic 4D Gaussian Splatting reconstructs deforming scenes with high fidelity and is increasingly adopted as a representation for dynamic 3D scenes. Putting such a scene to use, for editing, manipulation or motion analysis, first requires segmenting it: grouping the Gaussian primitives into coherent objects. Current pipelines obtain this grouping by importing 2D masks from foundation models such as SAM and lifting or distilling them into the Gaussian representation. In dynamic scenes these masks must be generated across many frames and views, which is costly, and the resulting segmentation can depend strongly on the quality and consistency of those external masks. We ask how much object-level structure can instead be recovered from the Gaussians themselves, and propose Intrinsic-GS, a training-free, mask-free method that builds a sparse affinity graph over Gaussian primitives from appearance, orientation, scale, deformation-trajectory and non-learned rendered-boundary cues. The graph is partitioned with Leiden community detection, requiring no foundation model and no learned feature field. On the standard 4D Gaussian segmentation benchmarks, Neu3D and HyperNeRF, Intrinsic-GS recovers substantial object structure without mask supervision, reaching 0.746 mIoU on Neu3D and 0.575 on HyperNeRF; on Neu3D, a geometry-only variant reaches 0.902 mIoU, matching SAM-supervised TRASE. On HyperNeRF, Intrinsic-GS runs 12.5x faster than the mask-generation and feature-rendering stages used by mask-supervised pipelines. These results suggest that much of the segmentation signal is already encoded in the Gaussians themselves, offering a fast, mask-free direction for 3D and 4D Gaussian segmentation that may also point toward more generalizable, robust segmentation in settings where external masks are unreliable or expensive.

URL PDF HTML ☆

赞 0 踩 0

2606.18787 2026-06-18 cs.CV 新提交

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

基于UDF的点云重建中的学习半径估计

Eito Ogawa, Hiroshi Watanabe

发表机构 * Graduate School of FSE Waseda University Tokyo, Japan（Waseda大学研究生院FSE学院东京日本）

AI总结提出一种学习型逐查询半径选择器，预测连续支撑半径并插入冻结的LoSF-UDF骨干网络，通过抛物线插值获取离网目标半径进行训练，提高点云表面重建的细粒度精度。

2606.18861 2026-06-18 cs.CV cs.AI 新提交

EDoF-NeRF: 使用编码孔径相机扩展景深的神经辐射场

Yoshiyuki Shirasaki, Ryoichi Horisaki

发表机构 * Department of Information Physics and Computing, Graduate School of Information Science and Technology, The University of Tokyo（信息物理与计算系，信息科学与技术研究生学校，东京大学）

AI总结提出一种通过编码孔径相机扩展景深的方法，构建高保真神经辐射场，实现从不同视角图像渲染新视图，并验证其优于传统孔径相机。

详情

AI中文摘要

我们提出了一种扩展景深（DoF）的方法，用于构建高保真神经辐射场（NeRF）——一种基于隐式神经表示、从不同视角捕获的图像数据集渲染逼真新视图的新兴技术。DoF与光量之间的权衡不仅存在于传统相机中，也存在于NeRF中，因为NeRF使用的数据集是由这些相机捕获的。为了解决这个问题，我们在相机光阑处引入编码孔径，在散焦条件下保留空间频率分量。我们开发了一个将编码孔径纳入NeRF的相机模型，允许直接输入编码图像，并能够生成具有扩展景深的新视图。我们通过仿真和实验验证了所提出的方法，称为扩展景深NeRF（EDoF-NeRF），并证明了其相比传统孔径相机的优越性能。

英文摘要

We propose a method for extending the depth-of-field (DoF) to construct high-fidelity neural radiance fields (NeRF) -- an emerging technique for rendering photorealistic novel views from a dataset of images captured at different viewpoints, based on implicit neural representations. The trade-off between DoF and light quantity is inherent not only in conventional cameras but also in NeRF, since the datasets used by NeRF are captured by these cameras. To address this issue, we introduce a coded aperture placed at the camera pupil, preserving spatial frequency components under defocused conditions. We develop a camera model incorporating coded apertures into NeRF, allowing direct input of coded images and enabling the generation of novel views with an extended DoF. We validate the proposed method, termed extended DoF-NeRF (EDoF-NeRF), through simulations and experiments, demonstrating its superior performance compared to conventional aperture cameras.

URL PDF HTML ☆

赞 0 踩 0

2503.09439 2026-06-18 cs.CV 版本更新

SuperCarver: Texture-Consistent 3D Geometry Super-Resolution for High-Fidelity Surface Detail Generation

SuperCarver: 纹理一致的3D几何超分辨率用于高保真表面细节生成

Qijian Zhang, Xiaozheng Jian, Xuan Zhang, Wenping Wang, Junhui Hou

发表机构 * Tencent Games, China（腾讯游戏，中国）； Department of Computer Science & Engineering, Texas A & M University（电子与计算机工程系，德克萨斯A&M大学）； Department of Computer Science, City University of Hong Kong（计算机科学系，香港城市大学）

AI总结提出SuperCarver，一种3D几何超分辨率管线，通过先验引导的法线扩散模型和噪声鲁棒的逆渲染，为粗糙网格补充纹理一致的表面细节，实现高保真细节生成。

Comments Accepted in IEEE TVCG

详情

AI中文摘要

传统的高精度网格资产生产流程需要专业3D艺术家/建模师进行繁琐且费力的手动雕刻。近年来，AI赋能的3D内容创作在从图像或文本提示生成合理结构和复杂外观方面取得了显著进展。然而，合成逼真的表面细节仍然面临巨大挑战，并且增强现有低质量3D网格（而非图像/文本到3D生成）的几何保真度仍然是一个开放问题。在本文中，我们介绍了SuperCarver，一种3D几何超分辨率管线，用于为给定的粗糙网格补充纹理一致的表面细节。我们首先从多个视角将原始纹理网格渲染到图像域。为了实现细节增强，我们构建了一个确定性先验引导的法线扩散模型，该模型在精心策划的成对细节缺乏和细节丰富的法线图渲染数据集上进行微调。为了从潜在不完美的法线图预测更新网格表面，我们设计了一种通过可变形距离场的噪声鲁棒逆渲染方案。实验表明，我们的SuperCarver能够生成由实际纹理外观描述的逼真且富有表现力的表面细节，使其成为升级历史低质量3D资产和减少高多边形网格雕刻工作量的强大工具。

英文摘要

Conventional production workflow of high-precision mesh assets necessitates a cumbersome and laborious process of manual sculpting by specialized 3D artists/modelers. The recent years have witnessed remarkable advances in AI-empowered 3D content creation for generating plausible structures and intricate appearances from images or text prompts. However, synthesizing realistic surface details still poses great challenges, and enhancing the geometry fidelity of existing lower-quality 3D meshes (instead of image/text-to-3D generation) remains an open problem. In this paper, we introduce SuperCarver, a 3D geometry super-resolution pipeline for supplementing texture-consistent surface details onto a given coarse mesh. We start by rendering the original textured mesh into the image domain from multiple viewpoints. To achieve detail boosting, we construct a deterministic prior-guided normal diffusion model, which is fine-tuned on a carefully curated dataset of paired detail-lacking and detail-rich normal map renderings. To update mesh surfaces from potentially imperfect normal map predictions, we design a noise-resistant inverse rendering scheme through deformable distance field. Experiments demonstrate that our SuperCarver is capable of generating realistic and expressive surface details depicted by the actual texture appearance, making it a powerful tool to both upgrade historical low-quality 3D assets and reduce the workload of sculpting high-poly meshes.

URL PDF HTML ☆

赞 0 踩 0

2605.17131 2026-06-18 cs.CV cs.AI cs.LG 版本更新

A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation

针对点云分类和分割的深度学习架构系统性调研

Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran

发表机构 * State University of New York at Albany（纽约州立大学阿尔巴尼分校）

AI总结本文系统性地探讨了点云分类和分割中的深度学习架构，分析了点云数据的结构特性，分类了不同架构的工作，并评估了其在主流基准上的性能，同时指出了开放挑战和未来方向。

Comments We reviewed a decade of advancements in point cloud processing: trace the evolution of the field from its foundational roots to the modern SOTA, analyze how diverse architectures overcome the inherent geometric challenges of 3D data, and map out critical research gaps alongside promising future directions. GitHub: https://github.com/MinhasKamal/DeepLearningForPointCloud

详情

DOI: 10.1145/3815180
Journal ref: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2026

AI中文摘要

点云因其简洁性和几何保真度而成为表示3D形状和场景最广泛采用的格式。然而，其固有的无序和不规则性质，加剧了传感器噪声和遮挡的影响，给基于机器学习的方法带来了独特的挑战。为应对这些问题，已开发出多种策略，包括转换为有序格式、提取局部几何特征以及基于排列不变或自注意力的处理方法。在本文中，我们的重点是深度学习模型在3D视觉三个基本任务中的应用：点云分类、部分分割和语义分割。我们首先正式定义点云数据，然后深入讨论其结构特性。接着，我们根据其骨干结构对重要工作进行分类，并评估其在流行基准上的性能。除了经验比较外，我们还提供了架构创新和局限性的见解。我们还概述了3D点云理解中的开放挑战和有前途的未来方向。

英文摘要

Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.18609 2026-06-18 cs.CV 新提交

Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification

基于反事实证据验证的医学视觉语言模型幻觉检测与纠正

Nan Zhou, Ke Zou, Meng Liu, Linchao He, Jiaqi Zhu, Yi Zhang, Hu Chen, Huazhu Fu

发表机构 * College of Computer Science, Sichuan University（四川大学计算机科学学院）； Yong Loo Lin School of Medicine, National University of Singapore（新加坡国立大学杨潞龄医学院）； Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University（四川大学数据保护与智能管理教育部重点实验室）； National Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology（北京理工大学自主智能无人系统国家重点实验室）； Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR)（新加坡科技研究局高性能计算研究所）

AI总结提出CoEV框架，通过文本与视觉证据的双向验证检测并纠正医学VLM幻觉，无需重新训练，在四个数据集上显著提升检测和纠正性能。

Comments MICCAI 2026 Accept. Submission Version

详情

AI中文摘要

视觉语言模型（VLM）在医学诊断中的可靠性受到幻觉的挑战，这削弱了信任。现有的幻觉检测方法主要关注识别生成文本与参考数据之间的事实不一致性。虽然一些研究分析了模型在图像中的注意力区域，但它们很少验证这种注意力是否真正反映了支持生成文本的视觉证据。为了解决这一差距，我们提出了反事实证据验证（CoEV），一个无需训练的即插即用框架，通过基于证据的事实一致性验证来检测和纠正幻觉。CoEV在文本断言和视觉证据之间执行双向验证，测试每个陈述是否得到其对应证据区域的支持，并将每个陈述分配到一个四象限诊断图中，该图捕获文本事实性和视觉基础性的组合。CoEV检测幻觉内容，并作为事后细化工具，无需重新训练即可纠正幻觉。在四个医学数据集上的大量实验表明，CoEV能够对抗幻觉。在幻觉检测方面，CoEV始终优于现有方法，平均PR-AUC和ROC-AUC分别提高了3.0%和3.9%的绝对百分点，在特定VQA场景中提升高达18.5%。在幻觉纠正方面，它将Micro-F1提高了高达12.5%，在医学报告生成中将幻觉率降低了超过11.9%，并提高了医学VQA的准确性。这些结果表明，CoEV能够可靠地检测和纠正幻觉，为临床医生提供可靠的、基于证据的诊断线索。代码将在接收后发布。

英文摘要

Vision-Language models (VLMs) reliability in medical diagnosis is challenged by trust-undermining hallucinations. Existing hallucination detection approaches mainly focus on identifying factual inconsistencies between generated text and reference data. While some studies analyze where models attend in images, they seldom verify whether such attention truly reflects the visual evidence supporting the generated text. To address this gap, we propose Co}unter-Evidence Verification (CoEV), a training-free plug-and-play framework that detects and corrects hallucinations through evidence-based factual consistency verification. CoEV performs bidirectional verification between textual assertions and visual evidence, testing whether each statement is supported by its corresponding evidence region, and assigns each statement into a four-quadrant diagnostic map capturing combinations of text factuality and visual grounding. CoEV detects hallucinated content and serves as a post hoc refinement tool, correcting hallucinations without retraining. Extensive experiments on four medical datasets show that CoEV combats hallucinations in VLMs.For hallucination detection, CoEV consistently outperforms existing methods, improving average PR-AUC and ROC-AUC by 3.0% and 3.9% absolute points respectively, with notable gains of up to 18.5% in specific VQA scenarios. For hallucination correction, it improves Micro-F1 by up to 12.5%, reduces hallucination rates by over 11.9% on medical report generation, and also boosts medical VQA accuracy. These results show that CoEV enables reliable detection and correction of hallucinations, providing clinicians with dependable, evidence-based cues for diagnosis. Code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.18658 2026-06-18 cs.CV eess.IV 新提交

On-Manifold Variational Learning with Heat-Kernel Priors

基于热核先验的流形变分学习

Jiarui Xing, Tal Zeevi, Nian Wu, Jian Wang

发表机构 * Yale School of Medicine（耶鲁大学医学院）； University of Virginia（弗吉尼亚大学）； Harvard Medical School（哈佛医学院）

AI总结提出一种流形锚定变分框架，利用几何感知EM算法选择热核加权潜图上的图中心点作为原型，确保原型在流形上，并通过Dirichlet能量正则化保持潜空间几何平滑，在心脏瘢痕和脑MRI基准上取得最高精度和清晰原型。

详情

AI中文摘要

学习医学影像队列的无监督表示可以揭示临床上有意义的原型，而无需专家标签，这些标签通常带有噪声且无法捕捉真实的病理异质性。然而，现有的深度潜变量模型通过欧几里得平均估计高斯混合先验，产生的原型会偏离弯曲的数据流形，并随着子种群数量的增加而退化。我们提出了一种流形锚定变分框架，基于几何感知的期望最大化（EM）算法，其M步骤选择每个子种群原型作为热核加权潜图上具有最高扩散中心性的图中心点，确保每个原型保持在流形上。Dirichlet能量正则化强制潜空间的几何平滑性，每个子种群的不确定性分数实现了无标签的质量评估。流形锚定EM是一种通用几何工具，扩展了标准EM，并易于应用于其他潜变量模型。在心脏瘢痕和脑MRI基准上，我们的框架在所有比较方法中取得了最高精度，产生了迄今为止最清晰的原型，并且在所有基线退化的较大子种群数量下保持稳定。

英文摘要

Learning unsupervised representations of medical imaging cohorts can reveal clinically meaningful prototypes without expert labels, which are often noisy and fail to capture true pathological heterogeneity. However, existing deep latent-variable models estimate Gaussian mixture priors via Euclidean averaging, producing prototypes that drift off the curved data manifold and degenerate as the number of sub-populations grows. We propose a manifold-anchored variational framework built on a geometry-aware Expectation-Maximization (EM) algorithm, whose M-step selects each sub-population prototype as the graph medoid with the highest diffusion centrality on a heat-kernel-weighted latent graph, ensuring that every prototype remains on-manifold. A Dirichlet energy regularizer enforces geometric smoothness of the latent space, and a per-sub-population uncertainty score enables label-free quality assessment. \rev{The manifold-anchored EM is a general-purpose geometric tool that extends standard EM and applies readily to other latent-variable models beyond this setting.} On cardiac scar and brain MRI benchmarks, our framework attains the highest accuracy among all compared methods, produces the sharpest prototypes reported to date, and remains stable at large sub-population counts where all baselines degenerate.

URL PDF HTML ☆

赞 0 踩 0

2606.18675 2026-06-18 cs.CV 新提交

BrainFusionNet: a deep learning and XAI model to understand local, global, and sequential features of MRI images for improved brain tumour detection

BrainFusionNet：一种用于理解MRI图像局部、全局和序列特征以改进脑肿瘤检测的深度学习与XAI模型

Md Taimur Ahad, Bo Song, Yan Li

发表机构 * School of Mathematics, Physics and Computing, University of Southern Queensland（南方昆士兰大学数学、物理与计算学院）； School of Engineering, University of Southern Queensland（南方昆士兰大学工程学院）

AI总结提出BrainFusionNet混合模型，结合CNN、ViT和GRU提取MRI空间、上下文和序列特征，并集成SHAP、LIME和GradCAM进行可解释性分析，在公开数据集上达到98%准确率，优于SOTA CNN。

详情

Journal ref: Brain Inf. 13, 21 (2026)

AI中文摘要

磁共振成像（MRI）的噪声给深度学习（DL）带来挑战，当肿瘤边界模糊、肿瘤位置和外观复杂时尤其如此。因此，我们开发了BrainFusionNet，它结合卷积神经网络（CNN）、视觉变换器（ViT）和门控循环单元（GRU），从MRI图像中提取空间、上下文和序列特征，以改进脑肿瘤分类。此外，集成了可解释AI（如SHAP、LIME和GradCAM），以可视化和突出显示有助于BrainFusionNet决策过程的图像区域。所提出的BrainFusionNet模型在两个公开MRI数据集上进行了评估，K折验证表明在两个数据集上准确率均达到98%。该模型与六种最先进的（SOTA）CNN和迁移学习进行了比较。在SOTA CNN中，DenseNet121和VGG16达到了96%的最高准确率。BrainFusionNet的新颖之处在于，该混合模型能够有效提取MRI图像的局部和全局特征，即使在小尺度肿瘤区域和肿瘤尺寸较小的情况下也是如此。该模型具有平衡的序列CNN架构，以捕获低层和深层特征；以及定制的ViT，可捕获局部特征、稳定梯度流并降低MRI图像训练期间梯度消失的风险。CNN和ViT的输出被馈送到GRU以进行最终分类。此外，我们分析像素强度以确定MRI图像质量是否影响图像分类。我们的发现在图像解释方面非常新颖，因为我们发现MRI图像中像素强度的分布会影响DL性能。

英文摘要

The noise of Magnetic Resonance Imaging MRI poses challenges for Deep Learning DL when tumor boundaries are obscured tumor location and appearance are complex Therefore we develop BrainFusionNet that combines Convolutional Neural Networks CNNs Vision Transformers ViT and Gated Recurrent Units GRUs to extract spatial contextual and sequential features from MRI images for improved brain tumor classification Furthermore explainable AI such as SHAP LIME and GradCAM are integrated to visualise and highlight image regions that contribute to BrainFusionNets decisionmaking process The proposed BrainFusionNet model is evaluated on two publicly available MRI datasets Kfold validation suggests 98 accuracy on both datasets The model was compared with the six stateoftheart SOTA CNNs and transfer learning Among the SOTA CNNs DenseNet121 and VGG16 achieved the highest accuracy of 96 The novelty of BrainFusionNet is that the hybrid model effectively extracts local and global features from MRI images even in smallscale tumor regions and small tumor sizes The model has a balanced sequential CNN architecture to capture lowlevel and deeperlayer features a customized ViT that captures local features stabilizes gradient flow and reduces the risk of vanishing gradients during MRI image training The CNN and ViT outputs are fed into a GRU for final classification Furthermore we analyze pixel intensities to determine whether MRI image quality affects image classification Our findings are very novel in image interpretation as we found that the distribution of pixel intensities in MRI images affects DL performance

URL PDF HTML ☆

赞 0 踩 0

2606.18682 2026-06-18 cs.CV 新提交

Multi-Class Brain Tumor Classification Using Advanced Deep Learning Models: A Comparative Study

使用先进深度学习模型的多类脑肿瘤分类：一项比较研究

Asad Channa, Asghar Ali Chandio, Akhtar Hussain Jalbani, Mehwish Leghari, Shahzad Memon

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology（夸迪-艾瓦姆工程、科学与技术大学计算机科学系）； Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology（夸迪-艾瓦姆工程、科学与技术大学人工智能系）； The Faculty of Artificial Intelligence and Cyber Security, Universiti Teknikal Malaysia Melaka（马来西亚梅拉卡技术大学人工智能与网络安全学院）； Department of Data Science, Quaid-e-Awam University of Engineering, Sciences & Technology（夸迪-艾瓦姆工程、科学与技术大学数据科学系）； Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London（东伦敦大学建筑、计算与工程学院计算机科学与数字技术系）

AI总结本研究比较五种CNN架构（包括定制模型和四种预训练模型）在约10,000张MRI图像上的多类脑肿瘤分类性能，发现EfficientNetB0以95%准确率最优，尤其显著提高了脑膜瘤的召回率（89%）。

详情

AI中文摘要

尽管深度学习最近取得了进展，但从MRI图像中准确分类脑肿瘤仍然面临挑战。在本研究中，我们对五种不同的卷积神经网络（CNN）架构进行了全面评估，包括一个定制的基线模型和四个预训练模型，用于使用临床来源的约10,000张MRI图像数据集对多类脑肿瘤进行分类。我们使用了五种不同的架构：VGG16、VGG19、DenseNet121和EfficientNetB0，它们都在相同的实验框架内进行了测试和训练。性能通过总体准确率和肿瘤召回率来衡量，以评估每种架构的临床相关性能。我们发现，与其他测试的架构相比，EfficientNetB0具有最佳的整体分类准确率95%；具体来说，VGG16（94.37%）、VGG19（92.29%）、DenseNet121（90.91%）和定制CNN（78.00%）。我们研究的一个特别重要的发现是，在检测脑膜瘤方面有显著改进；具体而言，简单的CNN可以以约20%的召回率检测脑膜瘤，而EfficientNetB0能够以89%的召回率检测脑膜瘤。脑膜瘤通常难以检测，因为它们在MRI图像上可能表现得非常微妙。此外，一个有趣的发现是，更深的VGG19性能不如较浅的VGG16。这表明，在处理医学图像时，CNN模型的架构效率可能比其深度更重要。总体而言，EfficientNetB0似乎在分类准确率、模型参数数量和临床有意义性能之间提供了最佳权衡。

英文摘要

Despite recent advancements in deep learning, accurately classifying brain tumors from MRI images continues to pose challenges. In this research, we present a comprehensive evaluation of five different convolutional neural networks (CNN) architectures, including a customized baseline model and four pre-trained models - for use in classifying multi-class brain tumors using a clinically-sourced dataset of approximately 10,000 MRI images. We have utilized five different architectures; VGG16, VGG19, DenseNet121, and EfficientNetB0, which were all tested and trained within an identical experimental framework. Performance was measured by both overall accuracy and tumor-wise recall as a means to measure the clinically-relevant performance of each architecture. We found that EfficientNetB0 had the best overall classification accuracy at 95%, when compared to the other architectures tested; specifically VGG16 (94.37%), VGG19 (92.29%), DenseNet121 (90.91%) and the customized CNN (78.00%). An especially important finding of our research was the considerable improvement in detecting meningiomas; specifically, while simple CNNs could detect meningiomas with a recall rate of approximately 20%, EfficientNetB0 was able to detect meningiomas with a recall rate of 89%. Meningiomas are often difficult to detect because they can appear very subtly on MRI images. Additionally, an interesting finding was that the deeper VGG19 performed worse than the shallower VGG16. This indicates that in many cases the architectural efficiency of a CNN model may be more important than its depth when working with medical images. Overall, EfficientNetB0 appears to provide the optimal trade-off between classification accuracy, number of parameters used in the model and clinically meaningful performance.

URL PDF HTML ☆

赞 0 踩 0

2606.18707 2026-06-18 cs.CV 新提交

PEFT-MedSAM: Efficient Fine-Tuning of Medical Foundation Models for Explainable Skin Lesion Segmentation

PEFT-MedSAM：面向可解释皮肤病变分割的医学基础模型高效微调

Asad Channa, Abdullah Khan, Asghar Ali Chandio, Aamir Akbar, Shahzad Memon, Aqib Hussain, Ameer Hamza

发表机构 * Department of Computer Science, Quaid-e-Awam University of Engineering, Sciences & Technology（计算机科学系，卡迪尔-阿瓦姆工程、科学与技术大学）； Department of Artificial Intelligence, Quaid-e-Awam University of Engineering, Sciences & Technology（人工智能系，卡迪尔-阿瓦姆工程、科学与技术大学）； Department of Computer Science, Sindh Madressatul Islam University, City Campus, Karachi（计算机科学系， Sind 阿里斯坦伊斯兰大学，卡拉奇城校区）； Department of Computer Science and Digital Technologies, School of Architecture, Computing and Engineering, University of East London（计算机科学与数字技术系，建筑、计算与工程学院，东伦敦大学）

AI总结提出参数高效微调方法PEFT-MedSAM，冻结预训练编码器仅训练轻量解码器，在ISIC 2018上达到0.9411 Dice系数，并通过Grad-CAM可解释性增强临床可信度。

详情

AI中文摘要

使用深度学习模型对皮肤镜图像进行皮肤病变自动分割，有助于比常规检测更早发现黑色素瘤。然而，大多数现有的深度学习方法性能不佳。本文旨在提出一种名为PEFT-MedSAM的参数高效微调方法，用于适配医学分割一切模型（MedSAM）以自动分割皮肤镜皮肤病变。PEFT-MedSAM方法仅使用轻量级掩码解码器训练模型，同时保持预训练图像编码器和提示编码器冻结。在ISIC 2018基准数据集上的实验表明，与完全训练的U-Net基线（0.8715 Dice系数）和零样本MedSAM推理（0.8997 Dice系数）相比，PEFT-MedSAM获得了0.9411的Dice系数和0.8918的交并比。使用PH2数据集进行的外部验证显示Dice系数为0.9467，标准差为±0.0310。这些主张的支持证据包括比较两个数据集的Wilcoxon符号秩检验p值小于0.0001，以及bootstrap估计的95%置信区间[0.9364, 0.9447]，该区间表示重复测试获得的平均Dice系数的估计范围。为了增加临床可信度，我们使用Grad-CAM可解释性以及基于指向游戏的评估方法，在验证集上评估CNN基线模型。结果表明，在包含519张图像的验证集上，准确率达到98.27%，并确认模型正确分类了包含皮肤病变的区域。

英文摘要

Automated segmentation of skin lesions using deep learning models for dermoscopic images can be very helpful in finding melanomas earlier than they would normally be detected. However, most deep learning methods available do not perform well. The aim of this paper is to present a parameter-efficient fine-tuning method called PEFT-MedSAM for adapting the Medical Segment Anything Model (MedSAM) to automatically segment dermoscopic skin lesions. The PEFT-MedSAM method uses only the lightweight mask decoder for training the model while keeping the pre-trained image encoder and prompt encoder frozen. The experiments performed on the ISIC 2018 benchmark dataset shows that PEFT-MedSAM obtains a dice coefficient of .9411 and an intersection over union value of .8918 when compared to both a fully trained U-Net baseline (.8715 dice coefficient) and zero-shot MedSAM inference (.8997 dice coefficient). The external validation of the model using PH2 dataset shows .9467 dice coefficient with +/- .0310 standard deviation. Supportive evidence for these claims include a p-value less than .0001 for Wilcoxon signed rank tests comparing the two datasets and bootstrap-estimated 95% confidence intervals of [.9364,.9447] that represent the estimated range of possible values for the average dice coefficient obtained by repeating the test. To increase clinical trustworthiness, we used Grad-CAM explainability along with a pointing game based evaluation methodology to evaluate the CNN baseline model on the validation set. The results showed that we had an accuracy rate of 98.27% on the validation set of 519 images and confirmed that the model classified regions containing skin lesions.

URL PDF HTML ☆

赞 0 踩 0

2606.18723 2026-06-18 cs.CV cs.LG 新提交

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

临床对齐的几何约束用于鲁棒的IVUS血管边界分割

Yunshu Chen, Litao Yang, Giuseppe Di Giovanni, Jordan Tan, Deval Mehta, Andrew Lin, Derek Chew, Masasi Fujino, Julie Butters, Stephen Nicholls, Zongyuan Ge, Kyung Hoon Cho

发表机构 * AIM For Health Lab, Monash University（莫纳什大学AIM健康实验室）； Department of Data Science and Artificial Intelligence, Faculty of IT, Monash University（莫纳什大学信息技术学院数据科学与人工智能系）； Monash University Victorian Heart Institute（莫纳什大学维多利亚心脏研究所）； School of Computing Technologies, RMIT University（皇家墨尔本理工大学计算技术学院）； National Cerebral and Cardiovascular Center（国立循环器病研究中心）； Department of Cardiology, Chonnam National University Hospital and Medical School（全南大学医院和医学院心脏病学系）

AI总结提出GeoCat网络，通过双编码器与可微几何一致性损失，在IVUS分割中降低边界漂移和拓扑错误，提升临床几何测量精度。

Comments MICCAI2026 Accepted

详情

AI中文摘要

血管内超声（IVUS）管腔和外弹性膜（EEM）分割对于定量评估冠状动脉斑块负荷至关重要。管腔或EEM勾画的误差会直接传播到斑块面积、斑块负荷和几何测量中。然而，优先考虑重叠分数的标准方法常常遭受边界漂移和拓扑错误，导致临床测量不准确。我们提出GeoCat，一个几何一致性网络，使用双笛卡尔-极坐标编码器，结合跨域注意力和时间融合，处理5帧IVUS片段。可微的几何一致性损失直接监督临床相关描述符，包括直径、方向和横截面积。该模型在来自146名患者的12,242张标注帧上训练，这些帧使用两种商用IVUS系统采集。我们使用分割准确性和斑块相关临床指标评估性能，包括Dice/IoU、边界测量（95HD（mm）、ASSD）、拓扑违规率和临床几何误差（dmax/dmin、角度和面积）。在我们的数据集上，GeoCat实现了0.93的Dice，将95HD降低到0.14 mm，并将拓扑违规率降低到1.0%。重要的是，它显著提高了几何保真度，产生0.13-0.16 mm的直径误差和约8度的角度误差，支持可靠的斑块负荷量化。

英文摘要

Intravascular ultrasound (IVUS) lumen and external elastic membrane (EEM) segmentation is important for quantitative coronary plaque burden assessment. Errors in lumen or EEM delineation directly propagate to plaque area, plaque burden and geometric measurements. However, standard methods prioritising overlap scores often suffer from boundary drift and topology errors, leading to inaccurate clinical measurements. We present GeoCat, a geometry-consistent network that processes 5-frame IVUS clips using dual Cartesian-polar encoders with cross-domain attention and temporal fusion. A differentiable geometry consistency loss directly supervises clinically relevant descriptors including diameters, orientations, and cross-sectional areas. The model is trained on 12,242 annotated frames from 146 patients acquired with two commercial IVUS systems. We evaluate performance using both segmentation accuracy and plaque-relevant clinical metrics, including Dice/IoU, boundary measures(95HD (mm), ASSD), topology violation rate, and clinical geometry errors (dmax/dmin, angles, and areas). On our dataset, GeoCat achieves a Dice of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. Importantly, it significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, supporting reliable plaque burden quantification.

URL PDF HTML ☆

赞 0 踩 0

2606.18749 2026-06-18 cs.CV 新提交

Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

迈向3D医学图像的无训练零样本异常检测：基于批次的方法使用2D基础模型

Tai Le-Gia

发表机构 * Chungnam National University（忠南大学）

AI总结提出CS3F框架，利用2D基础模型对3D医学图像进行零样本异常检测，通过沿多轴分解、切片编码和跨主体相似性计算异常分数，并引入粗到细的分词策略减少信号衰减。

详情

AI中文摘要

零样本异常检测（ZSAD）在医学成像中具有吸引力，因为临床系统必须处理异构采集协议、变化的患者群体以及可能缺乏标注训练数据的病理。大多数现有的零样本异常检测方法是为2D图像设计的，它们直接扩展到3D医学体积受到大规模体积基础模型稀缺或利用体积上下文困难的限制。我们提出CS3F，一个无训练的基于批次的框架，用于3D医学图像中的ZSAD，使用2D基础模型。每个体积沿多个解剖轴分解，并由2D视觉变换器逐切片编码。然后通过池化相邻切片特征将其转换为局部体积令牌。异常分数通过跨主体互相似性获得：在其他主体中缺乏相似令牌的令牌被赋予更高的异常分数。为了减少深度池化引起的病灶信号衰减，我们引入了一种粗到细的分词策略，无需穷举匹配即可实现细分辨率体积评分。CS3F在脑部MRI上针对转移瘤、胶质瘤和中风进行评估，并在肺部CT上验证其泛化能力，超越标准图谱对齐的脑部MRI。结果表明，冻结的2D基础模型可以支持3D医学图像中的异常定位，且细分词化的益处很大程度上取决于病灶对比度和成像模态。

英文摘要

Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.

URL PDF HTML ☆

赞 0 踩 0

2606.18753 2026-06-18 cs.CV 新提交

SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

SMART：一种灵活、可解释且可扩展的高分辨率成像数据时空脑图谱

John Kalkhof, Boris Gutman, Emile d'Angremont, Daniel C. Alexander, Marco Lorenzi

发表机构 * Illinois Institute of Technology（伊利诺伊理工学院）； Amsterdam University Medical Center（阿姆斯特丹大学医学中心）； University College London（伦敦大学学院）

AI总结提出SMART框架，通过解耦全局疾病动态与患者特定解剖表现，学习连续疾病时间图谱，实现高分辨率3D医学图像中时空变化的灵活、可解释和可扩展建模。

详情

AI中文摘要

我们介绍了SMART，一个从纵向高分辨率3D医学图像中学习灵活、可解释且可扩展的时空脑图谱的框架。现有的时空图谱构建方法依赖于黑盒生成模型，缺乏灵活性、限制可解释性，并且难以扩展到高维数据。SMART通过学习一个连续的疾病时间图谱来解决这些挑战，该图谱将全局群体级疾病动态与患者特定的解剖表现解耦。在解剖学启发先验的指导下，SMART通过区域特异性微分方程，沿着共享的疾病时间线建模可解释的全局区域进展轨迹。全局轨迹进一步通过由灵活且可扩展的多尺度神经细胞自动机参数化的密集微分同胚位移，个性化到个体解剖结构。在阿尔茨海默病的五个纵向MRI数据集（ADNI-1/GO/2、OASIS-3、AIBL；>1300名受试者）上评估，SMART产生了解剖学上有意义的疾病进展预测，并实现了最先进的预测准确性和比对抗性和扩散基线更好的时间一致性。我们的方法为高维医学图像时间序列中时空变化的灵活、可解释和可扩展建模建立了一个新范式。

英文摘要

We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer's disease (ADNI-1/GO/2, OASIS-3, AIBL; > 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

URL PDF HTML ☆

赞 0 踩 0

2606.18825 2026-06-18 cs.CV 新提交

DreamReg: Belief-Driven World Model for 2D-3D Ultrasound Registration

DreamReg：基于信念驱动的世界模型用于2D-3D超声配准

Luoyao Kang, Yuelin Zhang, Jiwei Shan, Haifan Gong, Qingpeng Ding, Shing Shin Cheng

发表机构 * T Stone Robotics Institute, The Chinese University of Hong Kong（香港中文大学T Stone机器人研究所）； Multi-scale Medical Robotics Center（多尺度医疗机器人中心）； Perelman School of Medicine, University of Pennsylvania（宾夕法尼亚大学佩雷尔曼医学院）

AI总结提出DreamReg框架，将2D-3D超声配准建模为信念更新，通过世界模型模拟探头运动并整合想象结果，在CAMUS和u-RegPro数据集上实现鲁棒且准确的实时配准。

详情

AI中文摘要

超声（US）广泛应用于手术导航，但由于部分可观测性、散斑噪声以及依赖于动作的US采集，术中2D切片与术前3D体积之间的实时配准仍然具有挑战性。现有方法是一次性的或短视的，难以随时间收集证据或捕捉外科医生如何根据屏幕反馈调整探头运动。我们提出DreamReg，一个基于信念驱动的世界模型框架，将2D-3D配准形式化为对刚性变换的信念更新。DreamReg维护一个潜在信念状态，总结过去的观测和位姿信息，并在新切片到达时通过学习到的动态不断细化变换。在训练期间，DreamReg暴露于模拟临床扫描行为的探头运动轨迹，并通过将位姿细化条件于当前US观测来学习更新其信念。在推理期间，DreamReg通过内部想象来细化配准：它展开学习到的世界模型以模拟候选探头运动及其预测的观测，并整合这些想象的结果以收敛到准确的刚性变换。在CAMUS和u-RegPro数据集上的实验表明，与最先进方法相比，DreamReg在实时引导中具有改进的鲁棒性和有竞争力的配准精度。

英文摘要

Ultrasound (US) is widely used for surgical navigation, yet real-time registration between intraoperative 2D slices and preoperative 3D volumes remains challenging due to partial observability, speckle noise, and the action-dependent US acquisition. Existing methods are one-shot or short-horizon, making it hard for them to gather evidence over time or capture how surgeons adjust probe motion based on on-screen feedback. We propose DreamReg, a belief-driven world-model framework that formulates 2D-3D registration as belief updating over rigid transformations. DreamReg maintains a latent belief state that summarizes past observations and poses information, and continuously refines the transformation through learned dynamics as new slices arrive. During training, DreamReg is exposed to probe-motion trajectories that mimic clinical scanning behavior and learns to update its belief by conditioning pose refinement on the current US observation. During inference, DreamReg refines registration via internal imagination: it rolls out the learned world model to simulate candidate probe motions and their predicted observations, and integrates these imagined outcomes to converge to an accurate rigid transformation. Experiments on CAMUS and u-RegPro datasets demonstrate improved robustness and competitive registration accuracy for real-time guidance compared with state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.18860 2026-06-18 cs.CV cs.LG 新提交

Quantification of Uncertainty with Adversarial Models in Medical Image Segmentation

医学图像分割中对抗模型的不确定性量化

Hana Jebril, Thomas Pinetz, Günter Klambauer, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria（人工智能研究所、医学数据科学中心、维也纳医学大学，奥地利）； Comprehensive Center for AI in Medicine, Medical University of Vienna, Austria（医学人工智能综合中心、维也纳医学大学，奥地利）； ELLIS Unit Linz, LIT AI Lab and Institute for Machine Learning, Johannes Kepler University Linz, Austria（林茨ELLIS单位、LIT人工智能实验室和机器学习研究所、林茨约瑟夫·冯·克拉夫特大学，奥地利）； Institute for Machine Learning, Johannes Kepler University Linz, Austria（机器学习研究所、林茨约瑟夫·冯·克拉夫特大学，奥地利）； Clinical Research Center for Medical AI, Johannes Kepler University Linz, Austria（医学人工智能临床研究中心、林茨约瑟夫·冯·克拉夫特大学，奥地利）

AI总结提出QUAM-SM后处理框架，通过针对性对抗搜索识别脆弱像素，量化不确定性并分离认知与偶然不确定性，在公开数据集上优于现有方法。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

可靠的像素级不确定性量化具有通过实现高保真纵向监测和区分真实病理变化与伪影来改变临床工作流程的潜力。理想情况下，这些模型提供关键治疗计划和手术干预所需的稳定性。然而，标准深度学习模型常常遭受校准不良，产生过度自信的预测，掩盖了微妙病理边界处的潜在脆弱性。为了解决这个问题，我们提出了QUAM-SM，一种使用针对性对抗搜索来识别“对抗脆弱”像素的后处理框架。通过主动寻找暴露预测不稳定性的扰动，我们的方法突出了决策最容易被翻转的区域。重要的是，该框架将认知不确定性与偶然不确定性分离。在两个具有多个专家标注的公开数据集上的实验表明，QUAM-SM在可靠性和边界敏感性方面优于标准和最新的不确定性估计方法。代码可在以下网址获取：https://this https URL

英文摘要

Reliable pixel-level uncertainty quantification holds the potential to transform clinical workflows by enabling high-fidelity longitudinal monitoring and distinguishing true pathological changes from artifacts. Ideally, these models provide the stability required for critical treatment planning and surgical intervention. However, standard deep learning models often suffer from miscalibration, yielding overconfident predictions that mask underlying vulnerabilities at subtle pathological boundaries. To address this, we propose QUAM-SM, a post-hoc framework using targeted adversarial search to identify "adversarially fragile" pixels. By actively seeking perturbations that expose predictive instability, our method highlights regions where decisions are most vulnerable to being flipped. Importantly, the framework disentangles epistemic uncertainty from aleatoric uncertainty. Experiments on two public datasets with multiple expert annotations demonstrate that QUAM-SM outperforms both standard and recent uncertainty estimation approaches in terms of reliability and boundary sensitivity. Code is available at https://github.com/HanaJebril/quam_sm

URL PDF HTML ☆

赞 0 踩 0

2606.18869 2026-06-18 cs.CV 新提交

Learning to Distort: Weakly-Supervised Image Quality Transfer for Prostate DWI Correction

学习扭曲：用于前列腺DWI校正的弱监督图像质量迁移

YuCheng Tang, Wen Yan, Alexander Ng, Natasha Thorley, Pawel Rajwa, Yipei Wang, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, David Atkinson, Shonit Punwani, Daniel Alexander, Shaheer Ullah Saeed, Veeru Kasivisvanathan, Yipeng Hu

发表机构 * UCL Hawkes Institute（UCL哈维斯研究所）； Department of Medical Physics and Biomedical Engineering（医学物理与生物医学工程系）； University College London（伦敦大学学院）； Division of Surgery and Interventional Science（外科与介入科学分会）； Centre for Medical Imaging（医学成像中心）； British Urology Researchers in Surgical Training (BURST)（英国泌尿外科手术培训研究人员（BURST））； Department of Radiology（放射科）； University College London Hospitals NHS Foundation Trust（伦敦大学学院医院国家健康服务信托基金）； Centre for Medical Image Computing（医学图像计算中心）； Department of Computer Science（计算机科学系）； Department of Urology（泌尿科）

AI总结提出弱监督图像质量迁移框架，利用图像质量评估信号从无失真图像学习生成真实失真，并训练校正模型，在PI-RADS和Gleason评分分类任务中优于现有无配对方法。

详情

AI中文摘要

单次激发平面回波前列腺弥散加权成像（DWI）常因几何失真而复杂化，影响从这些图像中获得可靠诊断的能力。开发自动化校正方法面临缺乏配对的失真和未失真临床扫描的挑战。本文首先提出一种新颖的弱监督图像质量迁移（IQT）框架，从无失真图像到失真图像，利用图像质量评估（IQA）信号监督迁移过程。与传统方法需要昂贵的体素级配对数据或采用无配对算法不同，我们的方法利用图像级质量标签（此处为失真与无失真）在预训练特征空间中建立潜在质量原型。认识到模拟真实失真比直接无配对校正更可靠，我们描述了一种弱监督原型流匹配算法，显式正则化生成轨迹朝向失真原型，产生模拟临床退化的真实磁敏感伪影。通过合成这些真实配对，我们能够训练第二个IQT模型进行正向失真校正。实验结果表明，我们生成的图像成功模拟了真实伪影的诊断干扰，从而产生更强大的失真校正IQT模型。除定性比较外，我们还通过评估临床下游任务性能（PI-RADS和Gleason评分分类），使用分布内和外部数据集，将我们的方法与现有无配对方法（如CycleGAN、UNIT-DDPM和OT-FM）作为正向或反向替代方案进行详尽的定量评估。

英文摘要

Single-shot echo-planar prostate diffusion-weighted imaging (DWI) is frequently complicated by geometric distortions, which impact the ability to derive reliable diagnoses from such images. Developing automated correction methods is challenged by the absence of paired distorted and undistorted clinical scans. In this paper, we first propose a novel weakly-supervised image quality transfer (IQT) framework from undistorted to distorted images that utilizes image quality assessment (IQA) signals to supervise the transfer process. Unlike traditional methods that require expensive, voxel-wise paired data or resort to developing unpaired algorithms, our approach utilizes image-level quality labels (here, distorted vs. undistorted) to establish latent quality prototypes within a pre-trained feature space. Recognizing that simulating realistic distortions is more reliable than direct unpaired correction, we describe a weakly-supervised prototype flow matching algorithm to explicitly regularize generative trajectories towards distorted prototypes, producing realistic susceptibility artifacts that mimic clinical degradations. By synthesizing these realistic pairs, we enable a second IQT model to be trained in the forward direction for distortion correction. Experimental results demonstrate that our generated images successfully mimic the diagnostic interference of real-world artifacts, which leads to more capable distortion correction IQT models. In addition to qualitative comparisons, we also conduct exhaustive quantitative evaluations that compare our approach with existing unpaired approaches (e.g., CycleGAN, UNIT-DDPM, and OT-FM) - as either forward or reverse alternatives - by assessing clinical downstream task performance in PI-RADS and Gleason score classification, using both in-distribution and external data sets.

URL PDF HTML ☆

赞 0 踩 0

2606.18872 2026-06-18 cs.CV 新提交

Bridging Single Distortion Artifacts and Mmultifactorial Clinical Quality: Few-shot Biparametric MRI Quality Assessment via Distortion-trained Prototypical Networks

桥接单一失真伪影与多因素临床质量：基于失真训练的原型网络的少样本双参数MRI质量评估

Yuheng Tang, Alexander Ng, Wen Yan, Natasha Thorley, Pawel Rajwa, Yipei Wang, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, Shonit Punwani, Daniel Alexander, Veeru Kasivisvanathan, Yipeng Hu

发表机构 * UCL Hawkes Institute（UCL Hawkes研究所）； Department of Medical Physics and Biomedical Engineering（医学物理与生物医学工程系）； University College London（伦敦大学学院）； Division of Surgery and Interventional Science（外科与介入科学分会）； Centre for Medical Imaging（医学成像中心）； British Urology Researchers in Surgical Training (BURST)（英国泌尿外科手术培训研究人员（BURST））； Department of Radiology（放射科）； University College London Hospitals NHS Foundation Trust（伦敦大学学院医院国家健康服务信托基金）； Centre of Medical Imaging, Division of Medicine（医学成像中心，医学分会）； Centre for Medical Image Computing（医学图像计算中心）； Department of Computer Science（计算机科学系）； Department of Urology（泌尿科）

AI总结提出一种少样本双参数原型网络，利用失真标签元训练，通过特征融合和域对齐，仅用5个样本即可预测PI-QUAL临床质量评分，解决临床数据稀缺问题。

详情

AI中文摘要

临床前列腺多参数MRI高度依赖高质量扩散加权成像（DWI），但DWI读图常因几何失真（通常由直肠气体引起）而受损。通过PI-QUAL评分系统评估质量是新兴的临床标准，但该方法主观、耗时，且存在类别不平衡问题，其中低质量病例多样且相对稀少。以PRIME临床试验为例，6%的图像PI-QUAL评分低于4，87%的DWI问题源于失真，许多其他临床质量问题代表性不足。为解决这种标注临床数据的双重稀缺性，我们提出了一种用于自动图像质量评估（IQA）的少样本双参数原型网络。我们的框架利用双分支3D ResNet融合T2加权和DWI特征，提供解剖背景以区分真实形态与失真。为处理现实异质性，我们引入特征级线性调制（FiLM）和梯度反转层（GRL），以对齐基于不同b值的特征分布，同时抑制采集相关偏差。我们证明，仅基于相对客观、易于获取的失真标签进行元训练的模型，能够仅使用五个代表性样本有效适应预测复杂的多因素临床质量评分（如PI-QUAL）。在两个数据集上的实验结果表明，我们的方法在此具有挑战性的IQA任务中显著优于少样本学习基线，为临床工作流程中标准化前列腺MRI质量控制提供了实际可行且数据高效的解决方案。

英文摘要

Clinical prostate multi-parametric MRI relies heavily on high-quality diffusion-weighted imaging (DWI), yet reading DWI is frequently compromised by geometric distortion, often caused by rectal air. Assessing quality via the PI-QUAL scoring system is an emerging clinical standard, but it is subjective, time-consuming and suffers from a class imbalance where low-quality cases are diverse and relatively scarce. Using the PRIME clinical trial as an example, there are $6\%$ images with PI-QUAL scores lower than 4, $87\%$ of DWI issues are due to distortion. Many of the other clinical quality issues are under-represented. To address this common dual-scarcity of annotated clinical data, we propose a few-shot biparametric prototypical network for automated image quality assessment (IQA). Our framework utilizes a dual-branch 3D ResNet to fuse T2-weighted and DWI features, providing anatomical context to distinguish true morphology from distortion. To handle real-world heterogeneity, we introduce feature-wise linear modulation (FiLM) and a gradient reversal layer (GRL) to align feature distributions conditioned on varying b-values while suppressing acquisition-related biases. We demonstrate that a model meta-trained solely on comparatively objective, readily obtainable distortion labels can effectively adapt to predicting complex, multi-factorial clinical quality scores such as PI-QUAL using only five representative samples. Experimental results on two datasets show that our method significantly outperforms few-shot learning baselines for this challenging IQA task, offering a practically feasible and data-efficient solution for standardizing prostate MRI quality control in clinical workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.18876 2026-06-18 cs.CV cs.LG 新提交

Test-Time Adaptation in Optical Coherence Tomography Using Trajectory-Aligned Time-Independent Flow

光学相干断层扫描中基于轨迹对齐的时间无关流的测试时自适应

Veit Hucke, Thomas Pinetz, Gregor Reiter, Ursula Schmidt-Erfurth, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria（人工智能研究所、医学数据科学中心、维也纳医学大学，奥地利）； Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna, Austria（医学人工智能综合中心、维也纳医学大学，奥地利）； Department of Ophthalmology and Optometry, Medical University of Vienna, Austria（眼科与视光学部、维也纳医学大学，奥地利）； Laboratory for Ophthalmic Image Analysis, Medical University of Vienna, Austria（眼科图像分析实验室、维也纳医学大学，奥地利）

AI总结提出一种基于流匹配的测试时自适应方法，通过直方图匹配和去除时间条件，生成高质量替代图像，在AMD分割中达到最优性能。

Comments Accepted in MICCAI

2606.18886 2026-06-18 cs.CV 新提交

DINO-Med3D: Bridging Dimension and Domain Gaps in Volumetric Segmentation via Progressive Adaptation

DINO-Med3D：通过渐进式适应弥合体分割中的维度与领域差距

Haoyu Hu, Xiyao Ma, Shiqi Liu, Linsen Zhang, Xiaoliang Xie, Xiaohu Zhou, Zeng-Guang Hou

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结提出两阶段渐进框架DINO-Med3D，通过多切片嵌入模块、3D适配器和并行细节恢复流，将DINOv3适配到3D医学分割，在五个数据集上超越现有方法。

Comments Accepted at MICCAI 2026. The camera-ready version and link will be made publicly available upon publication

详情

AI中文摘要

尽管DINOv3在自然图像中展现了显著的语义判别能力，但其直接应用于体医学分割受到固有的维度和领域差异的阻碍。为解决这些问题，我们提出DINO-Med3D，一个两阶段渐进框架，将预训练的DINOv3编码器重新用于3D医学任务。在第一阶段，我们通过引入融合伪3D上下文的多切片嵌入模块来弥合维度差距，同时采用分割代理任务将从自然场景学到的表示适应到医学领域。随后，我们通过在冻结的主干中添加轻量级3D适配器来增强体理解，以强制执行全局切片间连续性。最后，为补偿嵌入过程中固有的空间信息损失，我们设计了一个并行细节恢复流，以显式保留高频边界线索。在五个公共数据集上的大量实验表明，我们的方法成功地将DINOv3适应到医学领域，并显著优于最先进的基线方法。

英文摘要

Although DINOv3 has demonstrated remarkable semantic discrimination in natural imagery, its direct application to volumetric medical segmentation is hindered by inherent dimension and domain disparities. To resolve these issues, we propose DINO-Med3D, a two-stage progressive framework that repurpose the pre-trained DINOv3 encoder for 3D medical tasks. In the first stage, we mitigate the dimension gap by introducing a multi-slice embedding module that incorporates pseudo-3D context, while simultaneously employing a segmentation proxy task to adapt representations learned from natural scenes to the medical domain. Subsequently, we further enhance volumetric understanding by adding lightweight 3D adapters into the frozen backbone to enforce global inter-slice continuity. Finally, to compensate for the spatial information loss inherent in the embedding process, we design a parallel detail recovery stream to explicitly preserve high-frequency boundary cues. Extensive experiments on five public datasets demonstrate that our approach successfully adapts DINOv3 to the medical domain and significantly outperforms state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18894 2026-06-18 cs.CV 新提交

DART: 一种设计感知的微流控芯片范式用于实时活细胞图像分析

Johannes Seiffarth, Matthias Pesch, Lukas Scholtes, Dietrich Kohlheyer, Hanno Scharr, Katharina Nöh

发表机构 * Institute for Bio- and Geosciences, IBG-1: Biotechnology（生物与地质科学研究所，IBG-1：生物技术）； Computational Systems Biotechnology (AVT.CSB), RWTH Aachen University（计算系统生物技术（AVT.CSB），亚琛工业大学）； Institute for Advanced Simulation, IAS-8: Data Analytics and Machine Learning（先进模拟研究所，IAS-8：数据分析与机器学习）

AI总结提出DART范式，通过嵌入式标记和深度学习检测对齐CAD蓝图与物理芯片，实现高通量微流控芯片中所有感兴趣区域的快速定位和全自动图像处理，支持实时分析。

详情

AI中文摘要

高通量微流控活细胞成像产生丰富的单细胞数据。然而，用于定位每个包含一个细胞群体的感兴趣区域（RoI）并从记录图像中移除周围微流控结构的半自动化流程随RoI数量扩展，这阻碍了实时图像分析并将洞察时间延迟数小时至数天。我们提出了用于微流控培养芯片的设计感知和实时能力（DART）范式，该范式将CAD蓝图与物理芯片对齐，从而实现了对所有RoI的通量无关定位以及跨不同RoI几何形状和芯片布局的全自动图像处理。DART通过嵌入式基准标记和基于深度学习的标记检测建立这种对齐。我们使用瑞士军刀芯片验证DART，该芯片在1164个RoI位置上组合了八种结构不同的RoI设计。DART在五分钟内定位所有RoI，在40毫秒内从原始显微镜图像中移除微流控结构，并在每张图像1.1秒内执行全自动图像分析，包括细胞分割。这些能力共同使DART成为一个端到端的硬件-软件范式，具有实时分析能力，为闭环和结果驱动的智能显微镜铺平了道路。

英文摘要

High-throughput microfluidic live-cell imaging generates rich single-cell data. Yet semi-automated procedures for locating regions of interest (RoIs), each containing one cell population, and removing surrounding microfluidic structures from recorded images, scale with the number of RoIs. This prevents real-time image analysis and delays time-to-insight by hours to days. We introduce the Design-Aware and Real-Time capable (DART) paradigm for microfluidic cultivation chips, which aligns the CAD blueprint with the physical chip and thereby enables throughput-independent localization of all RoIs and fully automated image processing across diverse RoI geometries and chip layouts. DART establishes this alignment through embedded fiducial markers and deep-learning-based marker detection. We validate DART using the Swiss Army Knife chip, which combines eight structurally distinct RoI designs across 1164 RoI locations. DART localizes all RoIs in five minutes, removes microfluidic structures from raw microscopy images in 40 ms, and performs fully automated image analysis, including cell segmentation, in under 1.1 s per image. Together, these capabilities establish DART as an end-to-end hardware-software paradigm with real-time-capable analysis that paves the way toward closed-loop and outcome-driven smart microscopy.

URL PDF HTML ☆

赞 0 踩 0

2606.18970 2026-06-18 cs.LG cs.AI cs.CV 交叉投稿

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics（数学系）； Department of Political and Social Sciences（政治与社会科学系）

AI总结通过受控基准测试，比较量子与经典生成器在脑MRI数据增强中的性能，发现两者均未显著优于仅用真实数据训练，且量子生成器无额外优势。

Comments This work has been submitted to the IEEE for possible publication. This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

医学图像分类常受限于有限的标注数据，因此生成式增强被提出；最近，量子生成模型被用于此目的，并经常报告准确率提升。然而，这些声称通常基于单次训练运行，未匹配量子与经典生成器的参数预算，也未表征任何收益出现的数据范围。我们提出了一个受控基准测试，隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中，在该空间中，使用变分量子生成器或参数数量几乎相同的经典生成器（1648 vs. 1632）训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器，覆盖从5%到100%的标注数据比例，通过八个随机种子进行配对显著性检验（多重比较校正）以及集内多样性和潜在分布分析。在所有比例下，没有增强变体显著优于仅用真实数据训练，且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展：合成样本分布外移，并且在数据稀缺时严重模式崩溃，而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2510.10779 2026-06-18 cs.CV 版本更新

Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans

结构化谱图表示学习用于3D CT扫描的多标签异常分析

Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel

发表机构 * INSA Lyon, University of Lyon, CNRS, INSERM, CREATIS UMR 5220, U1294（里昂国立应用科学学院、里昂大学、国家科学研究中心、法国国家医学研究院、CREATIS UMR 5220、U1294）

AI总结提出一种基于谱图卷积的2.5D框架，将3D CT体积表示为结构化图，通过轴向切片三元组节点建模层间依赖，实现多标签异常分类，跨数据集泛化性能强。

Comments Accepted at MELBA Journal 2026

详情

DOI: 10.59275/j.melba.2026-87e3

AI中文摘要

随着CT检查数量的增长，对器官分割、异常检测和报告生成等自动化工具的需求日益增加，以支持放射科医生管理临床工作负载。由于三维数据中固有的复杂空间关系和异常的广泛变异性，3D胸部CT扫描的多标签分类仍然是一个关键但具有挑战性的问题。基于3D卷积神经网络的现有方法难以捕捉长距离依赖，而视觉Transformer通常需要在大规模领域特定数据集上进行大量预训练才能获得竞争力。在这项工作中，我们提出了一种2.5D替代方案，引入了一个新的基于图的框架，将3D CT体积表示为结构化图，其中轴向切片三元组作为节点，通过谱图卷积处理，使模型能够推理层间依赖，同时保持与临床部署兼容的复杂度。我们的方法在来自独立机构的3个数据集上进行训练和评估，实现了强大的跨数据集泛化能力，并与最先进的视觉编码器相比表现出竞争性能。我们进一步进行了全面的消融研究，以评估各种聚合策略、边加权方案和图连接模式的影响。此外，我们通过自动放射学报告生成和腹部CT数据的迁移实验展示了我们方法的更广泛适用性。

英文摘要

With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.

URL PDF HTML ☆

赞 0 踩 0

2512.09185 2026-06-18 cs.CV cs.AI 版本更新

Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

学习患者特异性疾病动态：基于潜在流匹配的纵向影像生成

Hao Chen, Rui Yin, Yifan Chen, Qi Chen, Chao Li

发表机构 * University of Cambridge（剑桥大学）； Nanjing First Hospital（南京第一医院）； Nanjing Medical University（南京医科大学）； Johns Hopkins University（约翰霍普金斯大学）； University of Dundee（邓迪大学）

AI总结提出Δ-LFM框架，利用流匹配对齐患者潜在轨迹，通过患者特异性潜在对齐实现单调疾病进展建模，在三个纵向MRI基准上验证了可解释性和性能。

Comments ICLR 2026 accepted

详情

AI中文摘要

理解疾病进展是一个直接的临床挑战，对早期诊断和个性化治疗具有重要意义。虽然最近的生成方法试图对进展进行建模，但关键不匹配仍然存在：疾病动态本质上是连续且单调的，然而潜在表示通常是分散的，缺乏语义结构，并且基于扩散的模型通过随机去噪过程破坏了连续性。在这项工作中，我们提出将疾病动态视为速度场，并利用流匹配（FM）来对齐患者数据的时间演变。与先前方法不同，它捕捉了疾病的内在动态，使进展更具可解释性。然而，一个关键挑战仍然存在：在潜在空间中，自动编码器（AE）不能保证跨患者的对齐或与临床严重性指标（例如年龄和疾病状况）的相关性。为了解决这个问题，我们提出学习患者特异性潜在对齐，这迫使患者轨迹沿着特定轴延伸，其幅度随疾病严重程度单调增加。这导致了一个一致且语义上有意义的潜在空间。总之，我们提出了Δ-LFM，一个用于通过流匹配建模患者特异性潜在进展的框架。在三个纵向MRI基准上，Δ-LFM展示了强大的实证性能，更重要的是，为解释和可视化疾病动态提供了一个新框架。

英文摘要

Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $Δ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $Δ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.

URL PDF HTML ☆

赞 0 踩 0

2512.10353 2026-06-18 cs.CV 版本更新

Hybrid Transformer-Mamba for Weakly Supervised Volumetric Medical Segmentation

混合Transformer-Mamba用于弱监督体积医学分割

Yiheng Lyu, Lian Xu, Coen Arrow, Mohammed Bennamoun, Farid Boussaid, Girish Dwivedi

发表机构 * University of Western Australia（西澳大学）； Harry Perkins Institute of Medical Research（哈利·佩金斯医学研究所）； National Imaging Facility（国家成像设施）； Fiona Stanley Hospital（菲奥娜·斯蒂尔医院）； Victor Chang Cardiac Research Institute（维多利亚·张心脏研究中心）

AI总结提出TranSamba混合架构，通过跨平面建模捕获3D上下文，在弱监督下实现高效体积分割，在三个数据集上达到最优性能。

详情

AI中文摘要

弱监督分割使得模型能够从平面级标签进行训练。现有方法通常依赖2D编码器，忽略了医学数据的体积特性。我们提出TranSamba，一种混合Transformer-Mamba架构，旨在通过跨平面建模捕获3D上下文。TranSamba在Vision Transformer骨干网络基础上增加跨平面Mamba块，利用线性时间建模实现相邻平面间的高效信息交换。这种交换改善了平面内自注意力以及后续用于目标定位的注意力图。TranSamba在输入体积深度上保持线性时间复杂度和恒定空间复杂度。在涵盖不同模态和病理的三个数据集上的大量实验表明，TranSamba达到了最先进的性能，展示了跨平面建模的泛化有效性。代码可在以下网址获取：this https URL.

英文摘要

Weakly supervised segmentation enables model training from plane-level labels. Existing methods often rely on 2D encoders, neglecting the volumetric nature of medical data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context via cross-plane modeling. TranSamba augments a Vision Transformer backbone with Cross-Plane Mamba blocks, leveraging linear-time modeling for efficient information exchange across neighboring planes. This exchange improves in-plane self-attention and subsequent attention maps for object localization. TranSamba maintains linear time complexity and constant space complexity with respect to the input volume depth. Extensive experiments on three datasets covering diverse modalities and pathologies show that TranSamba achieves state-of-the-art performance, demonstrating the generalizable efficacy of cross-plane modeling. Code is available at: https://github.com/YihengLyu/TranSamba.

URL PDF HTML ☆

赞 0 踩 0

2606.00491 2026-06-18 cs.CV cs.AI 版本更新

增强病理视觉语言模型的跨尺度推理能力

Chi Phan, Tianyi Zhang, Qiaochu Xue, Yufeng Wu, Dan Hu, Zeyu Liu, Sudong Wang, Yueming Jin

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore（新加坡国立大学电气与计算机工程系）； PuzzleLogic Pte Ltd（PuzzleLogic私人有限公司）； Department of Pathology, Fujian Medical University Cancer Hospital & Fujian Cancer Hospital（福建医科大学附属肿瘤医院病理科暨福建省肿瘤医院）

AI总结提出首个跨尺度训练与评估范式，通过多倍率视觉问答任务增强病理视觉语言模型的跨尺度推理能力，并构建高质量基准数据集Scale-VQA及模型ScaleReasoner-R1，实现最优性能。

详情

AI中文摘要

病理图像本质上是多尺度的，要求病理学家整合从低倍放大下的整体组织结构到高倍放大下的细胞形态的证据以进行准确诊断。虽然现有的视觉语言模型（VLM）病理数据集包含多种尺度，但它们通常缺乏明确的跨尺度推理目标。这一限制阻碍了VLM捕获关键的跨尺度表示和学习基于证据的推理。为弥补这一差距，我们引入了首个跨尺度训练和评估范式，将病理解释表述为多倍率推理。然而，创建这样的任务揭示了一个关键挑战：多图像视觉问答（VQA）容易受到仅文本捷径的影响，这使得模型能够利用与放大倍数相关的伪影而非视觉证据来猜测答案。为解决此问题，我们提出了一种泄漏感知的策展流程，结合了对抗性仅文本筛选和约束引导的问题设计。利用该流程，我们构建了Scale-VQA，一个高质量基准，包含4,685个多项选择题，基于2,537张跨多个放大级别的病理图像。最后，我们提出了ScaleReasoner-R1，一个通过强化学习训练的模型，以优化跨尺度VQA任务的性能。ScaleReasoner-R1在我们的跨尺度推理基准上达到了最先进的性能，并在已有的单尺度基准上泛化到最先进的性能。研究结果表明，即使是有限的跨尺度监督也能显著改善病理理解。代码和演示将开源。

英文摘要

Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.

URL PDF HTML ☆

赞 0 踩 0

2508.11211 2026-06-18 eess.IV cs.CV 版本更新

Efficient Image-to-Image Schrödinger Bridge for CT Field of View Extension

面向CT视野扩展的高效图像到图像薛定谔桥

Zhenhao Li, Song Ni, Long Yang, Xiaojie Yin, Haijun Yu, Jiazhou Wang, Hongbin Han, Weigang Hu, Yixing Huang

发表机构 * Institute of Medical Technology, Peking University Health Science Center（北京大学人民医院医学技术研究所）； Shanghai Cancer Center, Fudan University（复旦大学上海癌症中心）； Department of Electrical and Computer Engineering, University of Massachusetts Lowell（马萨诸塞大学洛厄尔分校电气与计算机工程系）； Beijing Key Laboratory of Intelligent Neuromodulation and Brain Disorder Treatment（北京智能神经调控与脑疾病治疗重点实验室）

AI总结提出基于图像到图像薛定谔桥（I²SB）扩散模型的CT视野扩展框架，通过直接学习有限视野与扩展视野图像间的随机映射，实现单步快速推理，在精度和速度上均超越现有扩散模型。

Comments 12 pages

详情

Journal ref: IEEE Transactions on Radiation and Plasma Medical Sciences 2026

AI中文摘要

计算机断层扫描（CT）是一种用于无创、高分辨率可视化内部解剖结构的基石成像模态。然而，当扫描物体超出扫描仪的视野（FOV）时，投影数据被截断，导致重建不完整并在FOV边界附近出现明显伪影。传统重建算法难以从这类数据中恢复准确的解剖结构，限制了临床可靠性。深度学习方法已被探索用于FOV扩展，其中扩散生成模型代表了图像合成的最新进展。然而，传统扩散模型由于迭代采样过程，计算量大且推理速度慢。为解决这些限制，我们提出了一种基于图像到图像薛定谔桥（I$^2$SB）扩散模型的高效CT FOV扩展框架。与从纯高斯噪声合成图像的传统扩散模型不同，I$^2$SB学习配对的有限FOV和扩展FOV图像之间的直接随机映射。这种直接对应关系产生了更可解释和可追踪的生成过程，增强了重建中的解剖一致性和结构保真度。I$^2$SB实现了优越的定量性能，在模拟噪声数据上的均方根误差（RMSE）值为49.8 HU，在真实数据上为152.0 HU，优于最先进的扩散模型，如条件去噪扩散概率模型（cDDPM）和基于块的扩散方法。此外，其单步推理使得每2D切片的重建仅需0.19秒，相比cDDPM（135秒）实现了超过700倍的加速，并超过了第二快的DiffusionGAN（0.58秒）。这种准确性和效率的结合表明I$^2$SB具有实时或临床部署的潜力。

英文摘要

Computed tomography (CT) is a cornerstone imaging modality for non-invasive, high-resolution visualization of internal anatomical structures. However, when the scanned object exceeds the scanner's field of view (FOV), projection data are truncated, resulting in incomplete reconstructions and pronounced artifacts near FOV boundaries. Conventional reconstruction algorithms struggle to recover accurate anatomy from such data, limiting clinical reliability. Deep learning approaches have been explored for FOV extension, with diffusion generative models representing the latest advances in image synthesis. Yet, conventional diffusion models are computationally demanding and slow at inference due to their iterative sampling process. To address these limitations, we propose an efficient CT FOV extension framework based on the image-to-image Schrödinger Bridge (I$^2$SB) diffusion model. Unlike traditional diffusion models that synthesize images from pure Gaussian noise, I$^2$SB learns a direct stochastic mapping between paired limited-FOV and extended-FOV images. This direct correspondence yields a more interpretable and traceable generative process, enhancing anatomical consistency and structural fidelity in reconstructions. I$^2$SB achieves superior quantitative performance, with root-mean-square error (RMSE) values of 49.8 HU on simulated noisy data and 152.0 HU on real data, outperforming state-of-the-art diffusion models such as conditional denoising diffusion probabilistic models (cDDPM) and patch-based diffusion methods. Moreover, its one-step inference enables reconstruction in just 0.19 s per 2D slice, representing over a 700-fold speedup compared to cDDPM (135 s) and surpassing DiffusionGAN (0.58 s), the second fastest. This combination of accuracy and efficiency indicates that I$^2$SB has potential for real-time or clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18721 2026-06-18 cs.CV 新提交

Rethinking the Pointer Loss in Table Structure Recognition: Geometry-Aware Pointer Loss for Spatial Locality

重新思考表格结构识别中的指针损失：面向空间局部性的几何感知指针损失

Hong-Jun Choi, Jongho Lee, Jaeyoung Kim

发表机构 * Teamreboott Inc.（Teamreboott公司）

AI总结针对指针网络在表格结构识别中相邻单元格错误占79.6%的问题，提出几何感知指针损失，通过反距离加权重写交叉熵目标，聚焦邻近单元格梯度，在不增加推理成本下提升性能。

详情

AI中文摘要

使用指针网络的表格结构识别（TSR）通过预测HTML序列同时将标签与检测到的文本（或单元格）区域对齐，取得了令人印象深刻的结果。然而，我们的分析揭示，当指针网络失败时，79.6%的错误发生在空间相邻的单元格之间（曼哈顿距离<=2）。尽管如此，标准交叉熵损失对所有负候选样本赋予相同权重。在这项工作中，我们提出了几何感知指针（GAP）损失，它根据与真实值的空间邻近性重新加权交叉熵目标。通过应用反距离加权，GAP将梯度流集中在模型最困难的区域：相邻单元格比远处单元格获得更强的梯度。我们的方法仅需对损失计算进行简单修改，保持相同的模型架构且零额外推理成本。在PubTabNet和SynthTabNet上的大量实验表明，GAP持续减少相邻单元格错误，达到了新的最先进性能。我们的发现表明，在损失层面融入几何归纳偏置为鲁棒TSR提供了一种简单而有效的方法。我们的代码可在以下网址获取：this https URL

英文摘要

Table Structure Recognition (TSR) using a pointer network achieves impressive results by predicting HTML sequences while aligning tags to detected text (or cell) regions. However, our analysis reveals that when pointer networks fail, 79.6% of errors occur between spatially adjacent cells (Manhattan distance <= 2). Despite this, standard cross-entropy loss weights all negative candidates equally. In this work, we propose Geometry-Aware Pointer (GAP) Loss, which reweights the cross-entropy objective based on spatial proximity to ground truth. By applying inverse distance weighting, GAP focuses gradient flow where the model struggles most: immediate neighbors receive stronger gradients than distant cells. Our approach requires only a straightforward modification to the loss computation, maintaining the same model architecture with zero additional inference cost. Extensive experiments on PubTabNet and SynthTabNet demonstrate that GAP consistently reduces adjacent-cell errors, achieving new state-of-the-art performance. Our findings suggest that incorporating geometric inductive biases at the loss level provides a simple yet effective approach to robust TSR. Our code is available at https://github.com/teamreboott/GAP

URL PDF HTML ☆

赞 0 踩 0

2606.18793 2026-06-18 cs.CV 新提交

Fuzzy-Geometric Branch-Point Modeling for Structure-Aware Augmentation of Handwritten Chinese Characters

模糊几何分支点建模用于结构感知的手写汉字增强

Dongbin Jiao, Yibo Lyu, Qiulu Wei, Fuxiang Lu, Shengcai Liu, Shi Yan

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology（南方科技大学计算机科学与工程系广东省类脑智能计算重点实验室）

AI总结针对手写汉字增强中数据稀缺和结构失真问题，提出基于模糊几何的结构感知增强框架，通过模糊集建模分支点并优化，结合贝塞尔重建与多策略扰动生成样本，显著降低字错误率。

详情

AI中文摘要

数据稀缺和结构失真严重限制了高安全性认证中的手写识别。现有的增强方法常导致拓扑和形态损伤，尤其在处理复杂汉字时，笔画交叉、连笔和急转弯使传统分支点检测不可靠。为此，本文提出一种模糊几何驱动的结构感知（FGSA）增强框架。我们将分支点建模为骨架空间中的模糊集，通过整合拓扑邻域证据和方向场散度，构建连续的分支点隶属度场。该隶属度场通过无监督代理目标自适应优化，实现无需人工标注的鲁棒笔画解耦。最后，通过参数化三次贝塞尔重建和多策略扰动合成运动学对齐样本，确保结构保真度与样本多样性之间的平衡。此外，我们建立了LZUSig，一个专门针对中文手写签名细粒度结构退化的大规模高挑战性数据集。在CASIA-HWDB1.1、ChiSig和LZUSig上的大量实验表明，FGSA显著降低了字错误率（ΔWER），在对比基线中取得了最优识别增益。更重要的是，它在任务增益、结构保真度和判别特征保留之间实现了稳健的权衡，为手写增强提供了一种高度可控的解决方案。

英文摘要

Data scarcity and structural distortion significantly limit handwriting recognition in high-security authentication. Existing augmentation methods often cause topological and morphological damage, particularly when processing complex Chinese characters where stroke intersections, ligatures, and sharp turns render traditional branch-point detection unreliable. To address this, this paper proposes a fuzzy geometry-driven structure-aware (FGSA) augmentation framework. We model branch points as fuzzy sets within the skeleton space, constructing a continuous branch-point membership field by integrating topological neighborhood evidence with direction field divergence. This membership field is adaptively optimized via an unsupervised surrogate objective, enabling robust stroke decoupling without manual annotation. Finally, kinematically-aligned samples are synthesized through parameterized cubic Bézier reconstruction and multi-strategy perturbations, ensuring a balance between structural fidelity and sample diversity. Moreover, we establish LZUSig, a large-scale, highly challenging dataset specifically dedicated to fine-grained structural degradation in Chinese handwritten signatures. Extensive experiments on CASIA-HWDB1.1, ChiSig, and LZUSig demonstrate that FGSA significantly reduces the word-level error rate ($Δ$WER), achieving optimal recognition gains over the compared baselines. More importantly, it strikes a robust trade-off among task gain, structural fidelity, and discriminative feature preservation, offering a highly controllable solution for handwriting augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.18884 2026-06-18 cs.CV 新提交

Performance Gap Analysis between Latin and Arabic Scripts HTR

拉丁文与阿拉伯文手写文本识别之间的性能差距分析

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

发表机构 * Luleå University of Technology Department of Computer Science, Electrical

AI总结本研究使用统一CRNN模型在多个数据集上比较阿拉伯文和拉丁文手写文本识别性能，发现性能差距在低资源场景下显著，随数据增加而缩小但持续存在，并分析了标注质量、视觉变异性和字符分布等因素。

Comments this paper accepted at TIPS workshop ICPR 2026

详情

AI中文摘要

尖峰金字塔小波变换用于高效低能耗图像恢复

Chen Zhao, Xiantao Hu, Song Wu, Qian Wang, Chen Wu, Rui Xie, Jian Yang, Ying Tai

发表机构 * Nanjing University（南京大学）； Nanjing University of Science and Technology（南京理工大学）； University of Science and Technology of China（中国科学技术大学）； China Mobile Institute（中国移动研究院）

AI总结提出基于尖峰神经网络和金字塔小波变换的SPWM模型，通过SDPW块建模长程依赖并利用小波域退化特性，在保持图像质量的同时显著降低计算和能耗。

Comments Accepted by Pattern Recognition

详情

AI中文摘要

尖峰神经网络（SNNs）因其高效性和生物启发的潜力在计算机视觉领域引起了广泛兴趣。虽然基于尖峰CNN的方法在图像恢复（IR）任务中显示出前景，但其性能受到CNN操作固有感受野限制的约束。在本文中，我们探索了离散小波变换的优势，并提出了一种基于尖峰金字塔小波模型（SPWM）以实现高效低能耗目标。具体来说，我们开发了一个尖峰双金字塔小波（SDPW）块来建模长程依赖并利用小波域中的退化特性。在多个基准上的实验结果表明，SPWM在保持图像质量的同时显著降低了计算成本和能耗。我们的方法展示了SNNs在IR领域的潜力，为资源受限设备的未来应用提供了新的见解。

英文摘要

Spiking neural networks (SNNs) have garnered significant interest in computer vision due to their potential for efficiency and biological inspiration. While spiking CNN-based methods have shown promise for image restoration (IR) tasks, their performance is constrained by the inherent receptive field limitations of CNN operations. In the paper, we explore the benefits of discrete wavelet transformation and propose a spiking pyramid wavelet-based model (SPWM) for high-efficient and low-energy target. Specifically, we develop a spiking dual pyramid wavelet (SDPW) block to model long-range dependency and exploit the properties of the degradation in the wavelet domain. Experimental results on several benchmarks demonstrate that SPWM significantly lowers computational costs and energy consumption while maintaining image quality. Our method showcases the potential of SNNs in the field of IR, offering new insights for future applications of resource-limited devices.

URL PDF HTML ☆

赞 0 踩 0

2606.19046 2026-06-18 cs.CV 新提交

Low-Rank Tensor Completion Based on Fractional Regularization with Ky Fan p-k Norm

基于Ky Fan p-k范数分数阶正则化的低秩张量补全

Shan Fan, Feng Zhang, Jianjun Wang, Xi-Le Zhao, Tingwen Huang

发表机构 * School of Mathematics and Statistics, Southwest University（西南大学数学与统计学学院）； School of Mathematical Sciences/Research Center for Image and Vision Computing, University of Electronic Science and Technology of China（电子科技大学数学科学学院/图像与视觉计算研究中心）； Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology（深圳先进技术大学计算机科学与控制工程学院）

AI总结提出张量核范数与Ky Fan p-k范数之比（TNPK）作为非凸替代，逼近张量管秩，并构建低秩张量补全模型，证明低秩张量是局部极小点，设计ADMM算法，实验验证优于现有方法。

详情

AI中文摘要

本文通过提出一种新颖的非凸替代，即张量核范数与张量Ky Fan p-k范数（TNPK）之比，来精确逼近张量管秩，从而解决低秩张量补全（LRTC）问题。TNPK具有吸引人的性质，包括尺度不变性、参数灵活性以及在特定p和k选择下存在闭式解。在特定的p和k参数设置下，它退化为张量核范数与张量Ky Fan k范数（TNK）之比或张量核范数与张量Frobenius范数（TNF）之比。我们构建了一个LRTC模型，并在张量零空间性质（NSP）下，证明了低秩张量是所提模型的局部极小点。此外，我们推导了Ky Fan p-k逆范数的近端算子，并进一步开发了一种高效的交替方向乘子法（ADMM）算法，在温和条件下保证子序列收敛。在合成和真实世界数据集上的大量实验验证了我们的方法相对于最先进竞争者的优越性能。

英文摘要

This paper addresses low-rank tensor completion (LRTC) by proposing a novel nonconvex surrogate, namely the ratio of the tensor nuclear norm to the tensor Ky Fan p-k norm (TNPK), to accurately approximate the tensor tubal rank. The TNPK possesses appealing properties, including scale invariance, parameter flexibility, and the existence of closed-form solutions under specific choices of p and k. With specific parameter settings of p and k, it reduces to the ratio of the tensor nuclear norm to the tensor Ky Fan k norm (TNK) or the ratio of the tensor nuclear norm to the tensor Frobenius norm (TNF). We construct a LRTC model and, under the tensor null space property (NSP), prove that low-rank tensors are local minimizers of the proposed model. Moreover, we derive the proximal operator of the Ky Fan p-k inverse-norm and further develop an efficient alternating direction method of multipliers (ADMM) algorithm with guaranteed subsequential convergence under mild conditions. Extensive experiments on synthetic and real-world datasets validate the superior performance of our method against state-of-the-art competitors.

URL PDF HTML ☆

赞 0 踩 0

2606.19097 2026-06-18 cs.CV 新提交

DVANet: Degradation-aware Visual-prior Alignment Network for Image Restoration

DVANet: 面向图像复原的退化感知视觉先验对齐网络

Yanjie Tu, Qingsen Yan, Axi Niu, Tao Hu, Haokui Zhang, Jiantao Zhou

发表机构 * School of Computer Science, Northwestern Polytechnical University（西北工业大学计算机学院）； Shenzhen Research Institute of Northwestern Polytechnical University（西北工业大学深圳研究院）； State Key Laboratory of Internet of Things for Smart City, University of Macau（澳门大学智慧城市物联网国家重点实验室）

AI总结提出DVANet，一种基于半二次分裂优化的深度展开网络，通过退化感知观测一致性与视觉先验引导重建的协同展开，实现复杂退化下的统一图像复原，在多种退化场景和跨域任务中表现优越。

Comments All-in-One Image Restoration; Deep Unfolding; Degradation Representation; Visual Prior

详情

AI中文摘要

全能图像复原旨在开发一个统一的复原框架来处理多种退化类型。现有的端到端方法通常将复原过程视为黑盒映射，缺乏明确的优化解释。尽管深度展开为图像复原提供了可解释的迭代建模范式，但现有方法大多依赖于固定的退化假设或预定义的退化信息，难以适应复杂退化和局部内容受损下的统一复原需求。这一限制制约了它们在退化抑制和结构细节恢复方面的性能。为解决这些问题，本文提出DVANet，一种受半二次分裂优化算法启发的深度展开网络，将复杂退化下的统一图像复原公式化为退化感知观测一致性与视觉先验引导重建之间的协同展开过程。具体而言，在退化感知观测一致性分支中，采用退化表示模块提取全局退化属性和局部退化线索，并利用退化条件映射增强模型对不同退化类型的适应性。在视觉先验引导重建分支中，引入DINOv3提供结构和语义信息作为层次化视觉先验，从而补充受损区域缺失的结构信息并改善细节恢复。大量实验表明，DVANet在多场景退化和跨域图像复原任务上取得了优越或具有竞争力的性能，展现出良好的退化适应性和泛化能力。

英文摘要

All-in-One image restoration aims to develop a unified restoration framework for handling diverse degradation types. Existing end-to-end methods usually regard the restoration process as a black-box mapping, lacking an explicit optimization interpretation. Although deep unfolding provides an interpretable iterative modeling paradigm for image restoration, existing methods mostly rely on fixed degradation assumptions or predefined degradation information, making them difficult to adapt to unified restoration requirements under complex degradations and locally damaged content. This limitation restricts their performance in degradation suppression and structural detail recovery. To address these issues, this paper proposes DVANet, a deep unfolding network inspired by the half-quadratic splitting optimization algorithm, which formulates unified image restoration under complex degradations as a collaborative unfolding process between degradation-aware observation consistency and visual-prior-guided reconstruction. Specifically, in the degradation-aware observation consistency branch, a degradation representation module is employed to extract global degradation attributes and local degradation cues, and degradation-conditioned mapping is used to enhance the model's adaptability to different degradation types. In the visual-prior-guided reconstruction branch, DINOv3 is introduced to provide structural and semantic information as hierarchical visual priors, thereby complementing the missing structural information in damaged regions and improving detail recovery. Extensive experiments demonstrate that DVANet achieves superior or competitive performance on multi-scenario degradation and cross-domain image restoration tasks, showing favorable degradation adaptability and generalization ability.

URL PDF HTML ☆

赞 0 踩 0

2204.14224 2026-06-18 cs.CV cs.LG eess.IV 版本更新

Investigation of Neural Network Methods for Reconstruction and Classification of Texture Images Under Conditions of Incomplete Information

不完全信息条件下纹理图像重建与分类的神经网络方法研究

Galymzhan Abdimanap, Kairat Bostanbekov, Abdelrahman Abdallah, Anel Alimova, Darkhan Kurmangaliyev, Daniyar Nurseitov, Tatyana Dedova, Larissa Balakay, Serik Nurakynov

发表机构 * Satbayev University（萨特巴耶夫大学）； Institute of Ionosphere LLP（电离层研究所）； Information Technology Department（信息技术部门）； Assiut University（阿西乌特大学）

AI总结提出结合目标检测、GAN（CRA）修复和Transformer/CNN分类的端到端框架，发现重建质量高（PSNR 28.7dB）但分类准确率仅53%，通过置信度混合集成将MCA从48%提升至58%，揭示生成模型产生语义模糊特征的问题。

Comments IEEE ACCESS

详情

DOI: 10.1109/ACCESS.2026.3705029

AI中文摘要

异质自然纹理的自动化分析常因物理损伤和数据丢失而受阻，这对计算机视觉构成了重大挑战。虽然深度学习在受控环境中已显示出成功，但其在信息不完全条件下对复杂地质材料的应用仍未被充分探索。本研究提出了一个用于高分辨率岩心样本图像修复和分类的集成框架。我们设计了一个端到端流水线，利用目标检测进行样本分割，随后使用具有上下文残差聚合（CRA）的生成对抗网络（GAN）进行图像修复，以重建缺失的高频细节。接着，我们在重建数据上评估了现代基于Transformer（Swin、ViT）和CNN架构的性能。实验揭示了重建质量与下游效用之间的关键分歧：尽管结构保真度高（PSNR 28.7 dB，FID 74.01），分类准确率却停滞在53%。为了改善少数类检测，我们提出了一种基于置信度的混合集成方法，将MCA从48%提升至58%。这些结果凸显了当前最先进生成模型的局限性，它们可能产生视觉上合理但语义模糊的特征（“幻觉”），从而混淆分类器。本工作深入探讨了图像重建质量与分类性能之间的依赖关系，为无损检测和材料科学领域的未来研究提供了可复现的基线。鉴于井间准确率仍处于49-53%范围，我们将所得到的系统定位为岩相解释的决策支持和筛选工具，而非完全自主的分类器。代码可在以下网址获取：https://github.com/your-repo（注：原文URL未提供，此处为示例）

英文摘要

The automated analysis of heterogeneous natural textures is frequently hindered by physical damage and data loss, presenting a significant challenge to computer vision. While deep learning has shown success in controlled environments, its application to complex geological materials under conditions of incomplete information remains underexplored. This study presents an integrated framework for the inpainting and classification of high-resolution core sample images. We propose an end-to-end pipeline that utilizes object detection for sample segmentation, followed by image inpainting using Generative Adversarial Networks (GANs) with Contextual Residual Aggregation (CRA) to reconstruct missing high-frequency details. Subsequently, we evaluate the performance of modern Transformer-based (Swin, ViT) and CNN architectures on the reconstructed data. Our experiments revealed a critical divergence between reconstruction quality and downstream utility: despite high structural fidelity (PSNR 28.7~dB, FID 74.01), classification accuracy plateaued at 53\%. To improve minority-class detection, we propose a confidence-based hybrid ensemble that raises MCA from 48\% to 58\%. These results highlight the limitations of current state-of-the-art generative models, which may produce visually plausible but semantically ambiguous features ("hallucinations") that confound classifiers. This work provides insights into the dependencies between image reconstruction quality and classification performance, offering a reproducible baseline for future research in non-destructive testing and material science. Given that cross-well accuracy remains in the 49--53\% range, we position the resulting system as a decision-support and screening tool for lithofacies interpretation rather than as a fully autonomous classifier. The code is available at https://github.com/GalymzhanAbdimanap/Lithology_recognition

URL PDF HTML ☆

赞 0 踩 0

2601.01200 2026-06-18 cs.CV eess.IV 版本更新

Objective Quality Assessment of Point Clouds Using Multi-scale Implicit Structural Similarity

点云的多尺度隐式结构相似性客观质量评估

Zhang Chen, Shuai Wan, Yuezhe Zhang, Siyu Ren, Fuzheng Yang, Junhui Hou

发表机构 * School of Electronics and Information, Northwestern Polytechnical University（电子与信息学院，西北工业大学）； Department of Computer Science, City University of Hong Kong（计算机科学系，香港城市大学）； School of Telecommunication Engineering, Xidian University（电信工程学院，西安电子科技大学）

AI总结针对点云质量评估中不规则数据匹配困难的问题，提出多尺度隐式结构相似性度量（MS-ISSM），通过径向基函数连续表示局部特征并比较隐式函数系数，结合ResGrouped-MLP网络，在多个基准上超越现有方法。

Comments IEEE TMM Accepted

详情

AI中文摘要

点云的无结构和不规则特性对精确的点云质量评估（PCQA）构成重大挑战，特别是在建立准确的感知特征对应关系方面。为了解决这一问题，我们提出了多尺度隐式结构相似性度量（MS-ISSM）。与传统的点对点匹配不同，MS-ISSM利用径向基函数（RBF）连续表示局部特征，将失真测量转化为隐式函数系数的比较。该方法有效避免了不规则数据中固有的匹配误差。此外，我们提出了ResGrouped-MLP质量评估网络，该网络能够鲁棒地将多尺度特征差异映射到感知分数。该网络架构摒弃了传统的平面多层感知器（MLP），采用分组编码策略，集成了残差块和通道注意力机制。这种分层设计使得模型能够保留亮度、色度和几何的独特物理语义，同时自适应地关注高、中、低尺度上最显著的失真特征。在多个基准上的实验结果表明，MS-ISSM在可靠性和泛化性方面均优于最先进的指标。源代码可在以下网址获取：this https URL。

英文摘要

The unstructured and irregular nature of points poses a significant challenge for accurate point cloud quality assessment (PCQA), particularly in establishing accurate perceptual feature correspondence. To tackle this, we propose the Multi-scale Implicit Structural Similarity Measurement (MS-ISSM). Unlike traditional point-to-point matching, MS-ISSM utilizes radial basis function (RBF) to represent local features continuously, transforming distortion measurement into a comparison of implicit function coefficients. This approach effectively circumvents matching errors inherent in irregular data. Additionally, we propose a ResGrouped-MLP quality assessment network, which robustly maps multi-scale feature differences to perceptual scores. The network architecture departs from traditional flat multi-layer perceptron (MLP) by adopting a grouped encoding strategy integrated with residual blocks and channel-wise attention mechanisms. This hierarchical design allows the model to preserve the distinct physical semantics of luma, chroma, and geometry while adaptively focusing on the most salient distortion features across High, Medium, and Low scales. Experimental results on multiple benchmarks demonstrate that MS-ISSM outperforms state-of-the-art metrics in both reliability and generalization. The source code is available at: https://github.com/ZhangChen2022/MS-ISSM.

URL PDF HTML ☆

赞 0 踩 0

2602.00176 2026-06-18 cs.CV cs.AI 版本更新

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

基于噪声条件频率暴露的扩散逆问题后验延续

Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出后验延续框架，根据扩散噪声水平逐步暴露测量频率，结合稳定采样器实现超分辨率、修复和去模糊的先进性能。

详情

AI中文摘要

扩散后验采样通过将预训练的扩散先验与测量一致性指导相结合来解决逆问题。然而，在高噪声水平下，全频带指导可能不可靠，因为干净估计包含分数诱导误差，且高频测量方向弱可识别。我们认为后验指导应根据瞬时扩散噪声水平暴露测量频率。基于这一原则，我们提出一个后验延续框架，构建一系列中间后验，其似然强调当前可靠频带并逐渐恢复全频带一致性。我们通过一个稳定采样器实例化该框架，该采样器结合了扩散预测器、频率受限似然细化以及Haar域承诺规则，该规则提交可靠粗校正同时推迟弱可识别细节。在超分辨率、修复和去模糊任务中，我们的方法实现了具有竞争力乃至最先进的恢复性能，包括在FFHQ和ImageNet评估中，运动去模糊相比强基线PSNR提升高达5 dB。

英文摘要

Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance. However, full-band guidance can be unreliable at high noise levels, where clean estimates contain score-induced errors and high-frequency measurement directions are weakly identifiable. We argue that posterior guidance should expose measurement frequencies according to the instantaneous diffusion noise level. Based on this principle, we propose a posterior continuation framework that constructs a family of intermediate posteriors whose likelihood emphasizes currently reliable frequency bands and gradually returns to full-band consistency. We instantiate this framework with a stabilized sampler that combines a diffusion predictor, frequency-limited likelihood refinement, and a Haar-domain commitment rule that commits reliable coarse corrections while deferring weakly identifiable details. Across super-resolution, inpainting, and deblurring, our method achieves competitive-to-state-of-the-art restoration performance, including up to 5 dB PSNR improvement on motion deblurring over strong baselines in evaluations on FFHQ and ImageNet.

URL PDF HTML ☆

赞 0 踩 0

2603.05010 2026-06-18 cs.CV 版本更新

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

生成式图像恢复进展：能力、局限性与评估实践研究

Xiang Yin, Jinfan Hu, Zhiyuan You, Kainan Yan, Yu Tang, Chao Dong, Jinjin Gu

发表机构 * Fudan University（复旦大学）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（深圳先进技术研究院，中国科学院）； University of the Chinese Academy of Sciences（中国科学院大学）； Multimedia Laboratory, The Chinese University of Hong Kong（香港中文大学多媒体实验室）； Shenzhen University of Advanced Technology（深圳先进技术大学）

AI总结通过多维度评估管道系统比较扩散、GAN等生成式模型与PSNR导向模型，揭示从细节不足到细节质量与语义控制的范式转变，并训练了更符合人类感知的IQA模型。

Comments Accepted by CVPR 2026 Findings

详情

AI中文摘要

生成式图像恢复（GIR）在感知真实感方面取得了显著进展，但与先前方法相比，其实际能力究竟有多大提升？为回答这一问题，我们基于新的多维度评估管道开展大规模研究，该管道从细节、清晰度、语义正确性和整体质量四个维度评估模型。我们的分析涵盖多种架构，包括基于扩散的、基于GAN的、PSNR导向的以及通用生成模型，揭示了关键的性能差异。此外，我们的分析揭示了失败模式的演变，这标志着以感知为导向的低层视觉领域发生了范式转变。核心挑战正从先前的细节稀缺（欠生成）问题演变为细节质量和语义控制（防止过生成）的新前沿。我们还利用我们的基准训练了一个新的IQA模型，该模型更符合人类感知判断。最终，本工作对现代生成式图像恢复模型进行了系统研究，提供了关键见解，重新定义了对其真实状态的理解，并为未来发展指明了方向。

英文摘要

Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.

URL PDF HTML ☆

赞 0 踩 0

2605.12567 2026-06-18 cs.CV cs.AI 版本更新

Pyramid Self-Contrastive Learning for Single-shot Test-time Ultrasound Image Denoising

金字塔自对比学习框架用于测试时超声图像去噪

Jiajing Zhang, Bingze Dai, Xi Zhang, Yue Xu, Wei-Ning Lee

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong（香港大学电子与计算机工程系）； Department of Biomedical Engineering, Duke University（达特茅斯大学生物医学工程系）

AI总结本文提出一种纯测试时训练框架，用于单次超声图像去噪，应用于合成孔径超声，通过自对比学习分离解剖相似性和噪声随机性，提升去噪效果和结构细节。

详情

AI中文摘要

内在的电子噪声和斑点噪声使超声图像的临床解释复杂化。传统去噪方法依赖显式噪声假设，其有效性在复合噪声条件下减弱。基于学习的方法需要大量标注数据和模型参数。这些预定义和预训练的方法在复杂体内环境中不可避免地导致领域偏移，因此局限于特定噪声类型并常模糊结构细节。本文提出了一种纯测试时训练框架用于单次超声图像去噪，并应用于合成孔径超声（SAU），该方法通过自对比学习在金字塔潜在空间中分离解剖相似性和噪声随机性。干净图像随后从解剖空间解码，而丢弃噪声空间。A2A在测试时仅使用一个噪声样本的SAU信号进行训练，从而从根本上消除了领域偏移和预训练成本。模拟实验，包括电子噪声水平0至30 dB和不同包含几何形状，证明了A2A在SNR和CNR上的改进分别为69.3%和34.4%。体内结果表明，仅使用心脏六个超声切面、肝脏和肾脏的两个孔径数据，SNR和CNR分别提高了84.8%和25.7%。A2A在多种成像目标和配置中产生清晰的图像/信号，为更可靠的超声解剖可视化和功能评估铺平了道路。

英文摘要

The inherent electronic and speckle noise complicates clinical interpretation of ultrasound images. Conventional denoising methods rely on explicit noise assumptions whose validity diminishes under composite noise conditions. Learning-based methods are usually pretrained in a limited image domain using a labeled dataset, which implies inevitable domain shift in complex in vivo environments. This study proposes a Pyramid Self-Contrastive Learning (PSCL) framework for test-time ultrasound image denoising without pretraining. Given multiple noisy samples from only one-shot imaging, PSCL disentangles anatomical similarity and noise randomness into separate pyramid latent spaces. The clean image is then decoded from the anatomy space while discarding the noise space. We first apply PSCL to synthetic aperture ultrasound (SAU), where an Aperture-to-Aperture loop serves as a self-supervised proxy task to ensure denoising fidelity. Simulation experiments, including noise levels from 0 to 30 dB and inclusion geometries from simple to complex, demonstrated improvements of 69.3% in SNR and 34.4% in CNR. The in vivo results showed 84.8% SNR and 25.7% CNR gains using only two aperture data of the heart in six echocardiographic views, liver, and kidney. PSCL delivers clear images across diverse imaging targets and configurations, paving the way for more reliable anatomical visualization without domain shift and pretraining costs.

URL PDF HTML ☆

赞 0 踩 0

2506.11139 2026-06-18 eess.IV cs.AI cs.CV 版本更新

Grids Often Outperform Implicit Neural Representations at Compressing Dense Signals

网格通常在压缩密集信号方面优于隐式神经表示

Namhoon Kim, Sara Fridovich-Keil

发表机构 * Department of Electrical and Computer Engineering（电气与计算机工程系）； Georgia Institute of Technology（佐治亚理工学院）

AI总结研究发现，对于密集信号任务，带插值的正则化网格在训练速度和重建质量上优于同等参数量的隐式神经表示，而INR仅在拟合二值信号（如形状轮廓）时表现更优。

Comments Our analysis are available at https://github.com/voilalab/INR-benchmark

详情

AI中文摘要

隐式神经表示（INR）最近展示了令人印象深刻的结果，但其基本容量、隐式偏差和缩放行为仍知之甚少。我们研究了不同INR在一系列具有不同有效带宽的2D和3D真实及合成信号上的性能，以及包括断层扫描、超分辨率和去噪在内的过拟合和泛化任务。通过根据模型大小以及信号类型和带宽对性能进行分层，我们的结果揭示了不同INR和网格表示如何分配其容量。我们发现，对于许多涉及密集信号的任务，具有插值的简单正则化网格在训练速度和质量上优于或等同于具有相同参数数量的任何INR。我们还发现有限的情况——即拟合二值信号（如形状轮廓）——其中INR优于网格，以指导INR的未来开发和使用，使其应用于最有利的应用场景。

英文摘要

Implicit Neural Representations (INRs) have recently shown impressive results, but their fundamental capacity, implicit biases, and scaling behavior remain poorly understood. We investigate the performance of diverse INRs across a suite of 2D and 3D real and synthetic signals with varying effective bandwidth, as well as both overfitting and generalization tasks including tomography, super-resolution, and denoising. By stratifying performance according to model size as well as signal type and bandwidth, our results shed light on how different INR and grid representations allocate their capacity. We find that, for many tasks involving dense signals, a simple regularized grid with interpolation trains faster and to higher or comparable quality than any INR with the same number of parameters. We also find limited settings -- namely fitting binary signals such as shape contours -- where INRs outperform grids, to guide future development and use of INRs towards the most advantageous applications.

URL PDF HTML ☆

赞 0 踩 0

2606.18318 2026-06-18 cs.CV cs.CR 新提交

Budget-Aware Adaptive Adversarial Patches for Black-Box Object Detection

预算感知的自适应对抗补丁用于黑盒目标检测

Pedram MohajerAnsari, Amir Salarpour, David Fernandez, Mert D. Pesé

AI总结提出一种查询高效、预算自适应的黑盒攻击方法，结合上下文汤普森采样放置和NES像素更新，在严格纯图像抑制测试下，对CNN和Transformer检测器实现强抑制，并揭示查询-视觉足迹权衡。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP 2026)

详情

AI中文摘要

对抗补丁对现代目标检测器构成实际威胁。先前工作揭示了脆弱性，但三个差距限制了可操作的见解：(i) 很少有基于分数的黑盒攻击在严格查询预算下联合优化补丁的位置、纹理和大小；(ii) 成功很少与补丁的视觉足迹相关联；(iii) 评估常常混淆EOT鲁棒性与纯视图抑制。我们提出\method{}，一种查询高效、预算自适应的黑盒攻击，它结合了轻量级的上下文汤普森采样放置器与NES风格的像素更新，仅在进展停滞时增大补丁。报告基于严格的纯图像抑制测试；EOT被审计但从不作为成功的替代，可选的外观/可打印性权重揭示了强度-可见性权衡。在YOLOv5、Faster R-CNN和YOLOS上，\method{}在基于CNN的检测器上实现了强抑制，在基于Transformer的检测器上实现了显著抑制，使用紧凑的补丁，并相对于固定大小和启发式基线暴露了清晰的查询-足迹权衡。打印-捕获实验进一步展示了跨未见物理对象和视角的迁移。

英文摘要

Adversarial patches pose a practical threat to modern object detectors. Prior work shows vulnerability, but three gaps limit actionable insight: (i) few \emph{score-based black-box} attacks \emph{jointly} optimize patch \emph{location, texture, and size} under tight query budgets; (ii) success is rarely tied to the patch's \emph{visual footprint}; and (iii) evaluations often conflate EOT robustness with plain-view suppression. We present \method{}, a query-efficient, budget-adaptive black-box attack that couples a lightweight \emph{Contextual Thompson-Sampling} placer with NES-style pixel updates, growing the patch only when progress stalls. Reporting is anchored by a \emph{strict plain-image} suppression test; EOT is audited but never used as a substitute for success, and optional appearance/printability weights expose strength--visibility trade-offs. Across YOLOv5, Faster R-CNN, and YOLOS, \method{} achieves strong suppression on CNN-based detectors and substantial suppression on the transformer-based detector, using compact patches and exposing clear query--footprint trade-offs relative to fixed-size and heuristic baselines. A print--capture pilot further shows transfer across unseen physical objects and viewpoints.

URL PDF HTML ☆

赞 0 踩 0

2606.18510 2026-06-18 cs.CV cs.CR 新提交

Architectural Bias in Face Presentation Attack Detection: A Comparative Study of Vision Transformers and Convolutional Neural Networks

人脸呈现攻击检测中的架构偏差：视觉Transformer与卷积神经网络的比较研究

Ngela Landon Ntung, Floride Tuyisenge, Jema David Ndibwile

发表机构 * College of Engineering, Carnegie Mellon University（卡内基梅隆大学工程学院）

AI总结通过比较ViT和CNN在人脸呈现攻击检测中的表现，发现预训练ViT（DeiT-S）在准确率、公平性和跨种族泛化上优于CNN，将种族间ACER差距降低83%。

Comments 8 Pages, 4 Figures, 5 Tables

详情

AI中文摘要

人脸呈现攻击检测（PAD）系统构成生物特征认证中的关键安全层；然而，现有方法在不同人口群体间表现出系统性性能差异，对深肤色个体影响尤为严重。本文通过实证比较研究，探究视觉Transformer架构相对于卷积基线是否能够减少人脸PAD系统中的人口统计偏差。实验在CASIA-SURF跨种族人脸反欺骗（CeFA）数据集上进行。评估了三种架构：从头训练的多模态ViT-Tiny、ResNet18 CNN基线，以及在CeFA上微调的预训练DeiT-S，覆盖非洲、东亚和零样本中亚人口群体。DeiT-S实现了最高总体准确率97.27%和最低等错误率0.86%，优于准确率90.15%的ResNet18。在公平性方面，DeiT-S将非洲与东亚受试者之间的种族间ACER差距降至0.13%，而基于LBP的工作[6]报告为0.75%，降低了83%。最值得注意的是，ResNet18在零样本中亚受试者上的BPCER为10.44%，而DeiT-S在相同未见群体上保持2.89%，展现出3.6倍的泛化优势。这些结果表明，预训练视觉Transformer在PAD中实现了更高的准确率，产生了更小的人口统计性能差距，并在未见人口群体上更公平地泛化，表明PAD中的跨人口公平性可能部分受架构设计影响。

英文摘要

Face Presentation Attack Detection (PAD) systems constitute a critical security layer in biometric authentication; however, existing approaches exhibit systematic performance disparities across demographic groups, disproportionately affecting individuals with darker skin tones. This paper presents a comparative empirical investigation of whether Vision Transformer architectures reduce demographic bias in face PAD systems relative to convolutional baselines. Experiments are conducted on the CASIA-SURF Cross-Ethnicity Face Anti-Spoofing (CeFA) dataset. Three architectures are evaluated: a Multimodal ViT-Tiny trained from scratch, a ResNet18 CNN baseline, and a pretrained DeiT-S fine-tuned on CeFA across African, East Asian, and zero-shot Central Asian demographic groups. DeiT-S achieves the highest overall accuracy of 97.27% and the lowest EER of 0.86%, outperforming ResNet18 at 90.15% accuracy. In terms of fairness, DeiT-S reduces the inter-ethnic ACER gap between African and East Asian subjects to 0.13%, compared to 0.75% reported in an LBP-based work [6], representing an 83% reduction. Most notably, while ResNet18 records a BPCER of 10.44% on zero-shot Central Asian subjects, DeiT-S maintains 2.89% on the same unseen group, demonstrating a 3.6x generalization advantage. These results suggest that pretrained Vision Transformers achieve superior PAD accuracy, produce smaller demographic performance gaps, and generalize more equitably across unseen demographic groups, indicating that cross-demographic fairness in PAD may partly be influenced by architectural design.

URL PDF HTML ☆

赞 0 踩 0

2606.19184 2026-06-18 cs.CV cs.LG 新提交

When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain Shift

当AUC误导：域偏移下深度伪造检测器的极化感知评估

Dat Nguyen, Cosmin Radoi, Romain Hermary, Marcella Astrid, Nesryne Mejri, Enjie Ghorbel, Djamila Aouada

发表机构 * Cristal Laboratory, National School of Computer Sciences, University of Manouba（马努巴大学国家计算机科学学院Cristal实验室）

AI总结针对现有AUC评估无法反映真实场景中混合数据源和不同伪影类型的问题，提出Cross-dataset AUC（Cross-AUC）指标，通过平均每域AUC并引入预测极化度量（Wasserstein距离）来评估域偏移鲁棒性，实验证明其有效性。

详情

AI中文摘要

生成式AI的最新进展，如扩散模型和换脸工具，使得创建高度逼真的深度伪造成为可能，导致了包括金融欺诈和非自愿色情内容在内的现实危害。为此，深度伪造检测成为一个活跃的研究领域，近期方法越来越关注提高对未见操作的泛化能力。这通常通过跨多个数据集分别测量的ROC曲线下面积（AUC）来评估。然而，这种评估未能反映检测器面对混合数据源和不同伪影类型的真实场景。为解决这一局限，我们引入一种新指标——跨数据集AUC（Cross-AUC），该指标平均每域AUC并加入预测极化度量，以考虑对域偏移的鲁棒性。极化程度通过类别分数分布之间的Wasserstein距离量化。Cross-AUC不仅更真实地评估深度伪造检测器在域偏移下的泛化能力，而且具有可解释性，因为它能更好地解释性能下降的原因。在七个基准数据集上的实验证明了其实用性。

英文摘要

Recent advances in generative AI, such as diffusion models and face-swapping tools, have enabled the creation of highly realistic deepfakes, leading to real-world harms including financial fraud and non-consensual explicit content. In response, deepfake detection has become an active research area, with recent methods increasingly focusing on improving generalization to unseen manipulations. This is typically evaluated using the Area Under the ROC Curve (AUC) measured separately across multiple datasets. However, such an evaluation fails to reflect real-world scenarios where detectors face a mixture of data sources and varying artifact types. To address this limitation, we introduce a novel metric, Cross-dataset AUC (Cross-AUC) that averages per-domain AUCs with a measure of prediction polarization for taking into account the robustness to domain shift. The polarization extent is quantified by the Wasserstein Distance between class score distributions. Cross-AUC not only assesses the generalization capabilities of deepfake detectors under domain shifts more realistically, but it is also interpretable as it better explains the reason behind a drop in performance. Experiments performed on seven benchmark datasets demonstrate its practical relevance.

URL PDF HTML ☆

赞 0 踩 0

2606.19259 2026-06-18 cs.CV cs.AI 新提交

A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

一个用于检测 GPT-Image-2 生成的含丰富文本图像的多领域基准

Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

AI总结针对现有基准缺乏文本丰富图像检测的问题，构建了包含8602张图像、覆盖6个类别的多领域基准，评估5种检测器，发现性能高度依赖领域且易受JPEG压缩影响。

详情

AI中文摘要

含丰富文本的图像通常包含隐私敏感、交易或决策相关信息。随着最近多模态图像生成模型合成逼真文本内容和结构化视觉设计的能力越来越强，检测AI生成的含丰富文本图像已成为数字信任和内容真实性的重要挑战。然而，现有基准主要关注以物体为中心的图像，对文本语义和布局组织至关重要的场景覆盖有限。在本文中，我们引入了一个用于检测OpenAI的GPT Image 2生成的含丰富文本图像的多领域基准。该基准包含8602张图像，涵盖六个代表性类别：商业海报、信息图表、学术海报、收据、表格和UI截图。利用该基准，我们在零样本设置下评估了五种代表性AI生成图像检测器，并分析了它们的整体性能、类别性能和后处理鲁棒性。我们的结果表明，检测器性能高度依赖于领域：在某些类别上表现良好的方法往往在其他类别上失败，即使最强的传统检测器也对JPEG压缩表现出严重敏感性。我们进一步使用多模态视觉语言模型进行了探索性评估，揭示了其在结构化格式上的潜力和局限性。这些发现突显了针对现代AI生成图像需要文本和布局感知的检测方法。我们的数据集发布于XXX。

英文摘要

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

URL PDF HTML ☆

赞 0 踩 0

2606.18839 2026-06-18 cs.LG cs.CV 交叉投稿

Semantic Robustness Certification for Vision-Language Models

视觉语言模型的语义鲁棒性认证

Peiyu Yang, Paul Montague, Feng Liu, Andrew C. Cullen, Amardeep Kaur, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing \& Information Systems, University of Melbourne, Australia

AI总结提出首个无需额外数据即可认证视觉语言模型在语义层面（如形状、大小、风格）鲁棒性的框架，通过文本提示作为语义代理并量化决策边界，确保预测类别在语义变换下不变。

Comments Accepted to ICML

详情

AI中文摘要

视觉语言模型（VLM）现在被广泛用于下游任务。然而，现实世界的应用常常使VLM面临由语义变化（例如形状、大小和风格）引起的分布偏移。鲁棒性认证确定当对输入应用变换时模型的预测是否改变。虽然大多数认证框架研究输入的几何或像素级变换，但本文提出了一种新颖的框架，能够在语义级变换下认证VLM的鲁棒性。利用VLM的开放词汇能力，我们使用文本提示作为语义代理来构建由控制语义变化程度的范围参数化的变换。通过以封闭形式表征VLM决策边界，我们的框架定量地认证了在语义变换下预测类别保持不变的范围区间。我们的框架是第一个在语义级变化下认证VLM鲁棒性而无需为每种变化提供额外数据的框架，使其易于应用。在合成数据和真实数据上的实验表明，我们的框架能够在各种场景下认证针对多种语义变化的鲁棒性。

英文摘要

Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model's prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

URL PDF HTML ☆

赞 0 踩 0

2508.03483 2026-06-18 cs.CV cs.AI 版本更新

When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

当汽车有刻板印象：审计文本到图像模型中对象的群体偏见

Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng

发表机构 * AIM Intelligence（AIM智能研究院）； Yonsei University（延世大学）

AI总结提出SODA框架，通过三个指标系统测量文本到图像模型在生成对象中的群体偏见，发现中性提示隐含偏向中年和白人，且人口统计线索导致高度偏斜的刻板输出。

详情

AI中文摘要

虽然先前关于文本到图像生成的研究主要集中在人类描绘中的偏见，但生成对象中的群体偏见仍然相对未被充分探索。我们引入了SODA（刻板对象诊断审计），这是一个新颖的框架，通过自动属性发现和三个标准化指标系统地测量这些偏见：基础与群体差异（BDS）、跨群体差异（CDS）和视觉属性集中度（VAC）。将SODA应用于五个最先进模型和八个对象类别（例如汽车）的8000张图像，我们发现“中性”提示产生的输出在视觉上最接近中年和白人，表明这些群体在模型默认设置中被隐含地过度代表。此外，人口统计线索触发了高度偏斜的刻板输出：26.6%的对象-模型-群体组合产生的结果中，所有20张生成图像共享完全相同的属性值（例如，为女性生成玫瑰金笔记本电脑）。最后，提示级别的去偏减少了群体间差异，但矛盾地压缩了群体内多样性，用一种刻板印象取代了另一种。SODA提供了一个实用的流程，使这些隐含关联变得可测量，作为迈向更负责任的人工智能发展的一步。

英文摘要

While prior research on text-to-image generation has predominantly focused on biases in human depictions, demographic bias in generated objects remains relatively underexplored. We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring these biases through automated attribute discovery and three standardized metrics: Base vs. Demographic Divergence (BDS), Cross-Demographic Disparity (CDS), and Visual Attribute Concentration (VAC). Applying SODA to 8,000 images across five state-of-the-art models and eight object categories (e.g., cars), we find that "neutral" prompts produce outputs most visually similar to middle-aged and White people, suggesting these groups are implicitly over-represented in model defaults. Furthermore, demographic cues trigger highly skewed stereotypical outputs: 26.6% of object-model-demographic combinations produce results where all 20 generated images share the exact same attribute value (e.g., rose gold laptops for women). Finally, prompt-level debiasing reduces inter-group disparity but paradoxically collapses within-group diversity, replacing one stereotype with another. SODA offers a practical pipeline for making these implicit associations measurable, serving as a step toward more responsible AI development.

URL PDF HTML ☆

赞 0 踩 0

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器：通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳））； School of Data Science, School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China（数据科学学院、人工智能学院、香港中文大学（深圳））

AI总结提出语义感知通用扰动（SAUP），作为语义路由器同时劫持多个无状态决策，通过理论分析和SORT优化策略实现，在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在无状态系统中，例如自动驾驶和机器人技术。本文研究了一种新型威胁：语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动（SAUP），它充当语义路由器，“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一点，我们对潜在空间中的几何特性进行了理论和实证分析。在这些见解的指导下，我们提出了语义导向（SORT）优化策略，并标注了一个具有细粒度语义的新数据集以评估性能。在三个代表性MLLM上的大量实验证明了这种攻击的基本可行性，在针对Qwen的五个目标上使用单帧实现了66%的攻击成功率。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

URL PDF HTML ☆

赞 0 踩 0

2606.11615 2026-06-18 cs.CV cs.CR cs.LG 版本更新

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

Adv-TGD：面向人脸识别冒充攻击的对抗性文本引导扩散

Omid Ahmadieh, Nima Karimian

发表机构 * University of South Florida, Bellini College of Artificial Intelligence, Cybersecurity and Computing（南佛罗里达大学贝利尼人工智能、网络安全与计算学院）

AI总结提出Adv-TGD框架，利用Stable Diffusion和LoRA微调生成逼真对抗人脸，在保持视觉质量的同时实现高成功率身份冒充攻击，平均ASR达85.90%。

详情

AI中文摘要

人脸识别（FR）技术的广泛普及引发了严重的隐私担忧，因为面部数据可能在未经同意的情况下被利用。为了解决这一挑战，我们提出了Adv-TGD，一个生成式对抗攻击框架，能够合成逼真的人脸，冒充目标身份并欺骗人脸识别系统。基于Stable Diffusion，Adv-TGD对每个样本进行LoRA微调，以简洁的文本提示为条件，生成自然但具有对抗性操控的身份。与传统的身份攻击方法不同，我们的方法在单步去噪过程中为每个源-目标对优化轻量级交叉注意力适配器。潜在混合受到面部局部热图掩码的约束，以确保空间精确的身份操控，同时保留非敏感区域。我们引入了一个复合目标，结合了掩码epsilon-MSE重建、FR嵌入空间中的阈值化身份差异、方向特征对齐和源相似性抑制，以平衡对抗攻击和视觉真实性。可选地，LLaVA生成的属性提示增强了细粒度语义细节，而不会重新引入身份线索。在黑盒评估协议下，Adv-TGD在IR152、IRSE50、MobileFace和FaceNet上平均攻击成功率（ASR）达到85.90%，超过语义SOTA基线Adv-CPG +6.25个百分点、基于扩散的化妆方法DiffAIM +3个百分点以及基于噪声的P3-Mask +16个百分点。尽管攻击效果强劲，Adv-TGD仍保持了高视觉保真度（PSNR = 27.15 dB，SSIM = 0.981）。此外，我们通过成功将其扩展到野外数据集（LADN）、通用对象分类（ImageNet）和基于Transformer的扩散模型（FLUX.1），展示了我们框架的灵活性。

英文摘要

The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion v2.1, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a fixed-timestep denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by 6.25 points, the diffusion-based makeup method DiffAIM by 3 points, and the noise-based P3-Mask by 16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 28.18 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).

URL PDF HTML ☆

赞 0 踩 0

2504.14798 2026-06-18 cs.LG cs.CV 版本更新

RUB: Evaluating Residual Knowledge in Unlearned Models

RUB: 评估未学习模型中的残留知识

Hao Xuan, Xingyu Li

发表机构 * Electrical and Computer Engineering University of Alberta（电气与计算机工程大学阿尔伯塔大学）

AI总结提出鲁棒未学习原则及统一基准RUB，通过未学习映射攻击（UMA）检测残留信息，揭示现有方法在对抗评估下的脆弱性。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2026, pages 8550-8559

AI中文摘要

机器未学习（MUL）已成为隐私保护和内容监管的关键机制，然而当前技术往往无法保证完全移除敏感信息。虽然现有工作大多关注验证未学习的执行，但它们忽略了模型在面对对抗性恢复遗忘知识尝试时是否保持鲁棒性的关键问题。在这项工作中，我们倡导鲁棒未学习原则，要求模型既与重新训练的模型不可区分，又能抵御多样化的对抗威胁。为实例化这一原则，我们提出了一个统一基准RUB（鲁棒未学习基准），系统评估未学习算法在分类、图像到图像重建和文本到图像合成中的鲁棒性。在此框架内，我们引入未学习映射攻击（UMA）作为检测残留信息的通用方法，并展示现有攻击策略如何适应此框架，只要它们符合通用UMA框架。我们在判别式和生成式任务上的实验表明，最先进的未学习方法在这些评估下仍然脆弱，即使通过了标准验证指标。通过将鲁棒性定位为核心标准并提供对抗评估基准，我们希望RUB能为更可靠和安全的未学习实践铺平道路。RUB中的代码库和模型检查点将公开发布。

英文摘要

Machine Unlearning (MUL) has emerged as a key mechanism for privacy protection and content regulation, yet current techniques often fail to guarantee the complete removal of sensitive information. While most existing works focus on verifying the execution of unlearning, they overlook the critical question of whether models remain robust against adversarial attempts to recover forgotten knowledge. In this work, we advocate for the principle of Robust Unlearning, which requires models to be both indistinguishable from retrained counterparts and resilient against diverse adversarial threats. To instantiate this principle, we propose a unified benchmark, RUB (Robust Unlearning Benchmark), that systematically evaluates the robustness of unlearning algorithms across classification, image-to-image reconstruction, and text-to-image synthesis. Within this framework, we introduce the Unlearning Mapping Attack (UMA) as a generalizable method to detect residual information, and demonstrate how existing attack strategies can be adapted into this framework as long as they conform to the generic UMA framework. Our experiments across discriminative and generative tasks reveal that state-of-the-art unlearning methods remain vulnerable under these evaluations, even when passing standard verification metrics. By positioning robustness as the central criterion and providing a benchmark for adversarial evaluation, we hope RUB paves the way toward more reliable and secure unlearning practices. The codebase and model checkpoints in RUB will be published.

URL PDF HTML ☆

赞 0 踩 0

2505.03646 2026-06-18 cs.LG cs.AI cs.CV 版本更新

重新思考空地协作：渐进式跨任务基准与社会化学习框架

Zhoupeng Guo, Yunqi Zhu, Zhihe Fan, Xinjie Yao, Ruipu Zhao, Boan Tao, Yiming Sun, Zhen Wang, Pengfei Zhu

发表机构 * School of Automation, Southeast University（东南大学自动化学院）； School of Computer Science and Engineering, University of New South Wales（新南威尔士大学计算机科学与工程学院）； School of Sports Training, Tianjin University of Sport（天津体育学院运动训练学院）； Faculty of Information Engineering and Automation, Kunming University of Science and Technology（昆明理工大学信息工程与自动化学院）； School of Artificial Intelligence, Tianjin University（天津大学人工智能学院）； School of Artificial Intelligence, Hebei University of Technology（河北工业大学人工智能学院）

AI总结提出空地渐进协作基准AGPC和社会化协同感知框架SCP，通过双层级路由器实现跨视角跨任务选择性交互，在异构空地感知中提升下游性能7.86%。

详情

AI中文摘要

空地协同感知对于真实世界动态环境中的鲁棒视觉理解至关重要。然而，现有研究通常将协作建模为单任务跨视角融合，忽视了定位、目标关联和细粒度解析之间的功能依赖关系。此外，空中和地面视角的异构性引入了显著的几何、尺度和遮挡差异，使得统一特征共享容易受到负迁移的影响。为解决这些问题，我们将空地感知建模为渐进式跨任务协作任务，并构建了空地渐进协作（AGPC）基准，这是一个包含超过745K原始视频帧的时空对齐基准。基于该基准，我们提出了社会化协同感知（SCP），一个从空中全局定位到地面目标关联和身份感知解析的渐进式协作框架。其核心模块——双层级路由器（DLR），将输入侧的多尺度专家选择与输出侧的任务条件调制解耦，实现了选择性的跨视角和跨任务交互，同时抑制有害干扰。大量实验证明了SCP的有效性。它实现了3.73%的协同进化增益和7.86%的平均下游性能提升。这些结果表明，对于异构空地感知，任务条件协作比统一融合更有效。代码可在该网址获取。

英文摘要

Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73\% coevolutionary gain and a 7.86\% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at https://github.com/g1136639260-spec/AGSCP.

URL PDF HTML ☆

赞 0 踩 0

2606.18943 2026-06-18 cs.CV 新提交

Physics-IQ Verified

物理智力验证

Tim Rädsch, Yuki M Asano, Hilde Kuehne, Stefan Bauer, Priyank Jaini, Robert Geirhos, Carsten T. Lüth

发表机构 * Anates Labs（Anates实验室）； Technical University of Munich（慕尼黑技术大学）； University of Technology Nuremberg（纽伦堡技术大学）； Tuebingen AI Center, University of Tuebingen（图宾根大学人工智能中心）； Helmholtz AI, Munich（慕尼黑海德堡人工智能研究所）； Google DeepMind research（谷歌DeepMind研究）

AI总结本文提出Physics-IQ Verified基准，通过改进提示和地面真实质量及引入样本级评分系统，提升视频生成模型对物理现实的理解评估，验证结果表明基准提升了57.6%的样本和34.8%的提示。

详情

AI中文摘要

视频生成模型（VGMs）已成为新的前沿，不仅用于视频生成，还用于多种下游任务，包括世界建模。为推进这些任务，一个良好的视频模型必须理解世界的物理现实。评估这种理解成为新兴领域，催生了Physics-IQ基准，通过将模型生成的视频与真实物理实验视频进行比较来量化。本文系统审计了Physics-IQ基准，揭示不足并提出三种解决方案，改进如何衡量VGMs的物理理解。具体而言，我们提高了提示和地面真实质量以减少混淆因素影响，并进一步引入样本级评分系统，使每个样本和指标权重相等。我们的基准Physics-IQ Verified优化了57.6%的所有样本并改进了超过34.8%的提示。在使用六个图像到视频生成模型的比较研究中，我们观察到中等但有意义的排名变化（Kendall's τ=0.46）。我们希望Physics-IQ Verified通过提供更可靠的信号推动社区发展，向物理准确的VGMs迈进。该基准的代码可通过此https URL访问。

英文摘要

Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's $τ= 0.46$). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark

URL PDF HTML ☆

赞 0 踩 0

2606.18952 2026-06-18 cs.CV 新提交

SP-TransientBench: A Real-Captured Single Photon Perception Benchmark

SP-TransientBench: 一个真实捕获的单光子感知基准

Hongzhou Dong, Zili Zhang, Ziting Wen, Yiheng Qiang, Runrong Deng, Wenle Dong, Ziwen Jiang, Xinyang Li, Rui Lu, Shuoyao Sun, Wenyu Wang, Ziyi Xia, Haitao Zheng, Guodong Shi, Xiaoqiang Ren

发表机构 * Shanghai University（上海大学）； Southern University of Science and Technology（南方科技大学）； The University of Sydney（悉尼大学）

AI总结针对单光子LiDAR在真实场景中因噪声和多回波瞬态现象导致的感知挑战，提出包含10个场景、10297个视角的真实捕获多任务基准STB，支持深度估计、多视图重建和3D语义理解评估。

详情

AI中文摘要

基于单光子雪崩二极管（SPAD）传感的单光子LiDAR（SPL）能够以极高灵敏度进行时间分辨光子测量，为光子匮乏环境下的主动3D感知提供了独特潜力。然而，由于独特的测量噪声和复杂的多回波瞬态现象，真实世界的单光子感知仍然面临根本性挑战，这些因素共同使几何重建和语义场景理解变得复杂。尽管对基于SPAD的传感兴趣日益增长，现有研究大多局限于模拟数据或小规模受控捕获。因此，在深度估计、多视图重建和3D语义理解方面，对真实世界单光子感知的系统评估仍未得到充分探索。为弥补这一空白，我们引入了SP-TransientBench（STB），一个真实捕获的多任务单光子感知基准。STB包含10个多样化场景和10297个视图，使用固态单光子LiDAR以256×192分辨率捕获。每个视图提供具有多回波行为的完整飞行时间直方图、标准化元数据和用于多视图评估的校准相机位姿。我们还为选定场景提供了13类3D语义标注。通过为每个任务提供专用数据划分和评估协议，STB能够在多个3D视觉问题上实现真实世界单光子感知的一致且可重复的基准测试。数据集和代码将在接收后发布。

英文摘要

Single-photon LiDAR (SPL) based on single-photon avalanche diode (SPAD) sensing enables time-resolved photon measurements with extreme sensitivity, offering unique potential for active 3D perception in photon-starved scenarios.However, real-world single photon perception remains fundamentally challenging due to unique measurement noise and complex multi-return transient phenomena, which jointly complicate geometric reconstruction and semantic scene understanding. Despite growing interest in SPAD-based sensing, existing studies are largely limited to simulated data or small-scale controlled captures. As a result, systematic evaluation of real-world single photon perception across depth estimation, multi-view reconstruction, and 3D semantic understanding remains underexplored. To bridge this gap, we introduce SP-TransientBench (STB), a real-captured multi-task benchmark for single photon perception. SP-TransientBenc comprises 10 diverse scenes and 10,297 views captured using a solid-state single-photon LiDAR at $256\times192$ resolution. Each view provides full time-of-flight histograms with multi-return behavior,standardized metadata, and calibrated camera poses for multi-view evaluation. We further provide 13-class 3D semantic annotations for selected scenes. By providing dedicated data splits and evaluation protocols for each task, STB enables consistent and reproducible benchmarking of real-world single photon perception across multiple 3D vision problems. The dataset and code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.19053 2026-06-18 cs.CV 新提交

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis

大规模视觉-语言模型在细粒度图像任务上的基准测试：从评估到诊断

Hong-Tao Yu, Chen-Wei Xie, Yuxin Peng, Serge Belongie, Xiu-Shen Wei

发表机构 * School of Computer Science and Engineering, Southeast University, China（东南大学计算机科学与工程学院，中国）； Alibaba Group（阿里巴巴集团）； School of Computer Science and Engineering, School of Intelligence Science and Engineering, and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, China（东南大学计算机科学与工程学院、智能科学与工程学院以及新一代人工智能技术及其交叉应用关键实验室，中国）； Wangxuan Institute of Computer Technology, National Key Laboratory for Multimedia Information Processing, Peking University, China（北京大学王轩计算机技术研究所、多媒体信息处理国家重点实验室，中国）； University of Copenhagen, Denmark（丹麦哥本哈根大学）

AI总结提出FG-BMK基准，含101万问题和28万图像，通过人机双范式评估LVLM的细粒度语义识别与视觉判别能力，诊断失败原因，发现视觉表示、语义对齐等瓶颈。

详情

AI中文摘要

近期大规模视觉-语言模型（LVLMs）展示了显著的多模态感知和推理能力。尽管众多基准从整体或任务特定角度评估了LVLMs，但它们在细粒度图像任务（计算机视觉的基础）上的能力仍未得到充分理解。为填补这一空白，我们引入FG-BMK，一个全面的细粒度评估基准，包含101万问题和28万图像，覆盖从常见物体中心领域到专业领域的多样化场景。FG-BMK通过面向人类和面向机器的范式，联合评估对话级细粒度语义识别和特征级视觉判别能力，从而诊断分析LVLM的失败是否源于视觉表示不足、视觉-语义对齐薄弱或细粒度知识有限。通过对一系列代表性LVLM/VLM的大量实验，我们发现当前LVLMs仍是不充分的细粒度识别器，失败源于视觉表示、语义对齐、模态对齐和类别级知识中相互交织的瓶颈。我们进一步分析了提升细粒度能力的训练设计因素，并考察了视觉和语言扰动如何影响LVLM预测。这些发现为当前LVLMs的局限性提供了诊断性见解，并为未来数据构建和模型设计提供了指导，以开发更可靠的细粒度视觉任务LVLMs。我们的代码已开源，可从此https URL获取。

英文摘要

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception and reasoning capabilities. While numerous benchmarks have evaluated LVLMs from holistic or task-specific perspectives, their capabilities on fine-grained image tasks-fundamental to computer vision-remain insufficiently understood. To address this gap, we introduce FG-BMK, a comprehensive fine-grained evaluation benchmark containing 1.01 million questions and 0.28 million images, covering diverse scenarios from common object-centric domains to specialized domains. FG-BMK jointly evaluates dialogue-level fine-grained semantic recognition and feature-level visual discriminability through human-oriented and machine-oriented paradigms, enabling diagnostic analysis of whether LVLM failures arise from insufficient visual representations, weak visual-to-semantic grounding, or limited fine-grained knowledge. Through extensive experiments on a diverse set of representative LVLMs/VLMs, we find that current LVLMs remain inadequate fine-grained recognizers, with failures arising from intertwined bottlenecks in visual representations, semantic grounding, modality alignment, and category-level knowledge. We further analyze training design factors for improving fine-grained capabilities and examine how visual and linguistic perturbations affect LVLM predictions. These findings provide diagnostic insights into the limitations of current LVLMs and offer guidance for future data construction and model design in developing more reliable LVLMs for fine-grained visual tasks. Our code is open-source and available at https://fg-bmk.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.18676 2026-06-18 cs.LG cs.CV 交叉投稿

InTrain: Intrinsic Trainability for Zero-Cost Neural Architecture Search

InTrain: 面向零成本神经架构搜索的内在可训练性

Qinqin Zhou, Fuhai Chen, Jipeng Wu, Zhiwei Chen, Zhikai Hu, Weiwei Cai

发表机构 * School of Computer and Data Science, Fuzhou University（福州大学计算机与数据科学学院）； School of Computer and Data Science, Minjiang University（闽江学院计算机与数据科学学院）； School of Artificial Intelligence, Nanchang University（南昌大学人工智能学院）； Department of Computer Science, Hong Kong Baptist University（香港浸会大学计算机科学系）； School of Interdisciplinary Medicine and Engineering, Harbin Medical University（哈尔滨医科大学跨学科医学与工程学院）

AI总结提出统一理论代理InTrain，通过几何容量和优化韧性两个协同成分形式化架构的可训练性，在NAS基准上达到与集成方法相当的排序相关性。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

AI中文摘要

免训练神经架构搜索有望在不进行昂贵训练的情况下高效发现高性能网络。然而，现有的零成本代理依赖于碎片化的启发式方法，未能捕捉基本问题：是什么使一个架构具有可训练性？本文引入内在可训练性（InTrain），一个统一的理论代理，将可训练性形式化为由两个协同成分——几何容量和优化韧性——涌现出的架构不变性。我们通过分析神经信息处理来操作化内在可训练性。几何容量通过激活协方差特征谱的参与比量化，捕捉表示流形的有效维度。优化韧性通过累积梯度健康度测量，评估跨网络深度的反向传播鲁棒性。InTrain通过尺度不变的乘法耦合综合这些维度，我们假设这对于捕捉它们协同、非加性的关系至关重要。在标准NAS基准和搜索空间上的大量实验表明，InTrain达到了与最先进的基于集成的代理相当的排序相关性，并优于其他单指标方法。

英文摘要

Training-free neural architecture search promises efficient discovery of high-performance networks without costly training. However, existing zero-cost proxies rely on fragmented heuristics that fail to capture the fundamental question: what makes an architecture trainable? This paper introduces Intrinsic Trainability (InTrain), a unified theoretical proxy that formalizes trainability as an architectural invariant emerging from two synergistic components: geometric capacity and optimization resilience. We operationalize intrinsic trainability through analysis of neural information processing. Geometric capacity is quantified via the participation ratio of activation covariance eigenspectrum, capturing the effective dimensionality of representation manifolds. Optimization resilience is measured through cumulative gradient health, assessing the robustness of backpropagation across network depth. InTrain synthesizes these dimensions through a scale-invariant multiplicative coupling, which we hypothesize is essential for capturing their synergistic, non-additive relationship. Extensive experiments on standard NAS benchmarks and search spaces demonstrate that InTrain achieves ranking correlations on par with state-of-the-art ensemble-based proxies and outperforms other single-metric methods.

URL PDF HTML ☆

赞 0 踩 0

2303.18031 2026-06-18 cs.CV cs.AI cs.LG 版本更新

Simple Domain Generalization Methods are Strong Baselines for Open Domain Generalization

简单域泛化方法是开放域泛化的强基线

Masashi Noguchi, Shinichi Shirakawa

发表机构 * Graduate School of Environment and Information Sciences（环境与信息科学研究生院）； Yokohama National University（Yokohama国立大学）； Faculty of Environment（环境学系）

AI总结本文评估现有域泛化方法在开放域泛化中的表现，发现简单方法CORAL和MMD与复杂方法DAML竞争力相当，并通过集成学习和Dirichlet混合数据增强简单扩展后性能接近DAML且计算成本更低。

Comments Accepted at IJCNN 2024. The code used in the experiments is available at https://github.com/shiralab/OpenDG-Eval

详情

DOI: 10.1109/IJCNN60899.2024.10650639

AI中文摘要

在现实应用中，机器学习模型需要处理开放集识别（OSR），即在推理过程中出现未知类别，同时还要处理域偏移，即训练和推理阶段数据分布不同。域泛化（DG）旨在处理推理阶段目标域在模型训练期间不可访问的域偏移情况。开放域泛化（ODG）同时考虑DG和OSR。域增强元学习（DAML）是一种针对ODG的方法，但其学习过程复杂。相比之下，尽管已提出多种DG方法，但它们尚未在ODG场景下进行评估。在本研究中，我们全面评估了现有DG方法在ODG中的表现，并表明两种简单的DG方法——相关对齐（CORAL）和最大均值差异（MMD）——在多种情况下与DAML具有竞争力。此外，我们通过引入DAML中使用的技术（如集成学习和Dirichlet混合数据增强）提出了CORAL和MMD的简单扩展。实验评估表明，扩展后的CORAL和MMD可以以较低的计算成本达到与DAML相当的性能。这表明简单的DG方法及其简单扩展是ODG的强基线。

英文摘要

In real-world applications, a machine learning model is required to handle an open-set recognition (OSR), where unknown classes appear during the inference, in addition to a domain shift, where the data distribution differs between the training and inference phases. Domain generalization (DG) aims to handle the domain shift situation where the target domain of the inference phase is inaccessible during the model training. Open domain generalization (ODG) considers DG and OSR. Domain-augmented meta-learning (DAML) is a method targeting ODG; however, it has a complicated learning process. By contrast, although various DG methods have been proposed, they have not been evaluated in ODG situations. In this study, we comprehensively evaluate the existing DG methods in ODG and show that the two simple DG methods, CORrelation ALignment (CORAL) and maximum mean discrepancy (MMD), are competitive with DAML in several cases. In addition, we propose simple extensions of CORAL and MMD by introducing the techniques used in DAML, such as ensemble learning and Dirichlet mixup data augmentation. The experimental evaluation demonstrates that the extended CORAL and MMD can perform comparably to DAML with lower computational costs. This suggests that the simple DG methods and their simple extensions are strong baselines for ODG.

URL PDF HTML ☆

赞 0 踩 0

2406.18215 2026-06-18 cs.CV 版本更新

Optimizing Incomplete, Large-Scale and Sparse Multi-Graph Matching in Bioimaging

优化生物成像中不完整、大规模和稀疏的多图匹配

Max Kahl, Sebastian Stricker, Lisa Hutschenreiter, Florian Bernard, Carsten Rother, Bogdan Savchynskyy

发表机构 * Heidelberg University（海德堡大学）； Max Planck Institute for Informatics（马克斯·普朗克信息研究所）； University of Bonn（波恩大学）

AI总结针对生物成像中大规模稀疏多图匹配问题，提出稀疏排列同步范式及通用方法GREEDA，在目标值和运行时间上优于现有方法。

详情

AI中文摘要

多图匹配是计算机视觉中的一个基本问题。我们的工作受到生物成像中一个具有挑战性的应用的启发，在该应用中，需要将数十甚至数百张蠕虫的3D显微镜图像进行对应。现有数据集未覆盖这种大规模场景，且几乎所有现有方法都不适用，因为它们假设完整或密集的问题设置。为了支持进一步研究，我们的第一个贡献是基于生物成像中的问题实例构建了一个新的大规模数据集。我们的第二个贡献是对两种主要的多图匹配范式：直接法和排列同步法进行了全面分析。我们通过部分证明论证，实用的大规模方法必须明确处理问题的稀疏性和不完整性。由于标准的排列同步方法在此设置下失败，我们进一步引入了一种稀疏排列同步范式。我们的最终贡献是GREEDA，一种针对稀疏和不完整问题的通用方法，可跨成本阶和范式实例化。虽然本文重点研究最高二次阶的目标函数，但GREEDA本质上可推广到任意阶。在更大、更稀疏的实例上，GREEDA在目标值和运行时间上均优于竞争方法。例如，对于基于30张蠕虫图像的中等规模问题，GREEDA在2分钟内产生高质量解，而竞争方法至少需要半小时且结果差得多。在较小的密集问题上，GREEDA与领先方法性能相当，但速度快一个数量级。

英文摘要

Multi-graph matching is a fundamental problem in computer vision. Our work is motivated by a challenging application in bioimaging, where dozens or even hundreds of 3D microscopy images of worms must be brought into correspondence. Existing datasets do not cover this large-scale regime, and virtually all existing methods are inapplicable because they assume a complete or dense problem setting. To support further research, our first contribution is a new large-scale dataset based on problem instances from bioimaging. Our second contribution is a comprehensive analysis of the two main multi-graph matching paradigms: direct and permutation synchronization-based formulations. We argue, in part by proof, that practical large-scale methods must explicitly address problem sparsity and incompleteness. Since standard permutation synchronization approaches fail in this setting, we further introduce a sparse permutation synchronization paradigm. Our final contribution is GREEDA, a general method for sparse and incomplete problems that can be instantiated across cost orders and paradigms. While our paper focuses on objective functions up to quadratic order, GREEDA is inherently generalizable to arbitrary orders. On larger, sparse instances, GREEDA outperforms competing methods in both objective value and runtime. For example, for moderately-sized problems based on 30 worm images GREEDA produces a high-quality solution within 2 minutes, whereas competitors require at least half an hour and yield far worse results. On smaller dense problems, GREEDA remains on par with leading methods while being an order of magnitude faster.

URL PDF HTML ☆

赞 0 踩 0

2407.18245 2026-06-18 cs.CV cs.LG 版本更新

VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset

VGGHeads: 基于大规模合成数据集的3D多头部对齐

Orest Kupyn, Eugene Khvedchenia, Christian Rupprecht

发表机构 * University of Oxford（牛津大学）； Piñata Farms ； Ukrainian Catholic University（乌克兰天主大学）

AI总结提出VGGHeads，一个由扩散模型生成的大规模合成数据集，用于单步同时进行头部检测和3D网格重建，在真实图像上表现优异。

详情

AI中文摘要

人类头部检测、关键点估计和3D头部模型拟合是许多应用中的基本任务。然而，传统的真实世界数据集常常存在偏差、隐私和伦理问题，并且是在实验室环境中记录的，这使得训练出的模型难以泛化。在这里，我们介绍\method——一个使用扩散模型生成的大规模合成数据集，用于人类头部检测和3D网格估计。我们的数据集包含超过100万张高分辨率图像，每张图像都标注了详细的3D头部网格、面部标志和边界框。利用这个数据集，我们引入了一种新的模型架构，能够从单张图像中单步同时进行头部检测和头部网格重建。通过广泛的实验评估，我们证明了在我们的合成数据上训练的模型在真实图像上取得了强劲的性能。此外，我们数据集的多样性使其适用于广泛的任务，提供了人类头部的通用和全面表示。

英文摘要

Human head detection, keypoint estimation, and 3D head model fitting are essential tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce \method -- a large-scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset, we introduce a new model architecture capable of simultaneous head detection and head mesh reconstruction from a single image in a single step. Through extensive experimental evaluations, we demonstrate that models trained on our synthetic data achieve strong performance on real images. Furthermore, the versatility of our dataset makes it applicable across a broad spectrum of tasks, offering a general and comprehensive representation of human heads.

URL PDF HTML ☆

赞 0 踩 0

2504.01527 2026-06-18 cs.CV eess.IV 版本更新

Beyond Nearest Neighbor Interpolation in Data Augmentation

超越数据增强中的最近邻插值

Olivier Rukundo

发表机构 * Department of Electronic and Computer Engineering, University of Limerick（电子与计算机工程系，利默里克大学）

AI总结本文提出改进的几何变换函数和均值分类过滤机制，以避免最近邻插值带来的标注误差和低通滤波影响，通过离线数据增强管道提升医学图像分割性能。

Comments 10 pages, 11 figures, 14 tables

详情

AI中文摘要

避免最近邻插值导致的未定义类别标签风险忽视了增强训练数据中像素级标注误差的加剧风险。此外，插值算法固有的低通滤波效应会加剧标注区域内的高频结构细节退化风险。为避免这些风险，作者通过修改卷积神经网络的数据转换函数，引入改进的几何变换函数，去除对最近邻插值的依赖，并整合基于均值的类别过滤机制来处理未定义的类别标签。作者还实现了离线数据增强管道，生成特定于插值的增强训练数据，从而能够定量评估插值对增强训练数据的低通滤波效应。在三个医学图像分割数据集和XBAT+数据集上的实验评估显示，在多个定量指标上均实现了性能提升。

英文摘要

Avoiding the risk of undefined categorical labels using nearest neighbor interpolation overlooks the risk of exacerbating pixel level annotation errors in augmented training data. Additionally, the inherent low pass filtering effects of interpolation algorithms exacerbate the risk of degrading high frequency structural details within annotated regions of interest. To avoid these risks, the author modified convolutional neural networks data transformation functions by incorporating a modified geometric transformation function, removing reliance on nearest neighbor interpolation, and integrating a mean-based class filtering mechanism to handle undefined categorical labels with alternative interpolation algorithms. The author also implemented an offline data augmentation pipeline to generate interpolation specific augmented training data, enabling quantitative assessment of interpolation specific low pass filtering effects on augmented training data. Experimental evaluation on three medical image segmentation datasets and the XBAT+ datasets demonstrated performance gains across multiple quantitative metrics.

URL PDF HTML ☆

赞 0 踩 0

2505.21954 2026-06-18 cs.CV cs.AI 版本更新

Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

重新审视主动说话人检测：面向泛化性和鲁棒性的野外基准

Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Tuan Khai Nguyen, Soochahn Lee, Yong Jae Lee

发表机构 * University of Wisconsin - Madison（威斯康星大学麦迪逊分校）； Oregon State University（俄勒冈州立大学）； University of Sydney（悉尼大学）； Kookmin University（韩国成均馆大学）

AI总结提出UniTalk数据集，涵盖多语言、嘈杂背景和拥挤场景等挑战性真实条件，评估显示现有模型在野外环境下性能不足，而UniTalk训练模型泛化性更好，为主动说话人检测建立新基准。

Comments Accepted to Interspeech 2026

2510.21605 2026-06-18 cs.CV 版本更新

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

S3OD：基于合成数据的通用显著目标检测

Orest Kupyn, Hirokatsu Kataoka, Christian Rupprecht

发表机构 * University of Oxford, VGG（牛津大学，视觉信息集团）

AI总结提出S3OD方法，通过大规模合成数据生成和歧义感知架构，显著提升显著目标检测的跨数据集泛化能力，仅用合成数据训练即可降低20-50%误差。

详情

AI中文摘要

显著目标检测体现了数据受限任务的特点，昂贵的像素级精确标注迫使相关子任务（如DIS和HR-SOD）进行单独的模型训练。我们提出了一种通过大规模合成数据生成和歧义感知架构来大幅提升泛化能力的方法。我们引入了S3OD，一个包含超过139,000张高分辨率图像的数据集，通过我们的多模态扩散管道从扩散和DINO-v3特征中提取标签。迭代生成框架根据模型性能优先处理具有挑战性的类别。我们提出了一个简化的多掩码解码器，通过预测多个有效解释来处理显著目标检测中固有的歧义。仅使用合成数据训练的模型在跨数据集泛化中实现了20-50%的错误率降低，而微调版本在DIS和HR-SOD基准上达到了最先进的性能。

英文摘要

Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained only on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2602.08355 2026-06-18 cs.CV 版本更新

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-VAds：面向多模态大语言模型的电商短视频理解基准

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

发表机构 * Alimama Tech, Taobao \& Tmail Group of Alibaba ； Huazhong University of Science ； Vin University

AI总结提出电商短视频理解基准E-VAds，通过多模态信息密度评估框架量化领域复杂性，并构建多智能体生成的问答数据集，最后开发基于强化学习的推理模型E-VAds-R1，在商业意图推理上实现109.2%的性能提升。

Comments Accepted by ICML2026

详情

AI中文摘要

电商短视频代表了在线视频行业中高收入的细分领域，其特点是目标驱动的格式和密集的多模态信号。当前模型通常难以处理这些视频，因为现有基准主要关注通用任务，忽略了商业意图的推理。在这项工作中，我们首先提出了一个多模态信息密度评估框架，以量化该领域的复杂性。我们的评估显示，与主流数据集相比，电商内容在视觉、音频和文本模态上表现出显著更高的密度，为视频理解建立了更具挑战性的前沿。为了弥补这一差距，我们引入了电商视频广告基准（E-VAds），这是首个专门为电商短视频理解设计的基准。我们从淘宝精选了3,961个高质量视频，涵盖广泛的产品类别，并使用多智能体系统生成了19,785个开放式问答对。这些问题被组织成两个主要维度，即感知与认知和推理，包含五个不同的任务。最后，我们开发了E-VAds-R1，一个基于强化学习的推理模型，具有称为MG-GRPO的多粒度奖励设计。该策略为早期探索提供平滑指导，同时为专家级精度创造非线性激励。实验结果表明，E-VAds-R1在仅使用几百个训练样本的情况下，在商业意图推理上实现了109.2%的性能提升。

英文摘要

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples. Data is available at https://github.com/TaobaoTmall-AlgorithmProducts/E-VAds_Benchmark.

URL PDF HTML ☆

赞 0 踩 0

2603.21583 2026-06-18 cs.CV 版本更新

HACMatch Semi-Supervised Rotation Regression with Hardness-Aware Curriculum Pseudo Labeling

HACMatch: 基于难度感知课程伪标签的半监督旋转回归

Mei Li, Huayi Zhou, Suizhi Huang, Yuxiang Lu, Yue Ding, Hongtao Lu

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出一种难度感知课程学习框架，通过动态选择伪标签样本和结构化数据增强，在少量标注数据下提升半监督旋转回归性能。

Comments This is an accepted manuscript of an article published in Computer Vision and Image Understanding

详情

DOI: 10.1016/j.cviu.2026.104742
Journal ref: Computer Vision and Image Understanding (2026)

AI中文摘要

从2D图像回归物体的3D旋转是一项关键且具有挑战性的任务，在自动驾驶、虚拟现实和机器人控制等领域有广泛应用。现有的旋转回归模型通常依赖大量标注数据进行训练，或需要点云、CAD模型等2D图像之外的额外信息。因此，探索仅使用有限数量标注2D图像的半监督旋转回归具有重要价值。尽管最近的工作FisherMatch将半监督学习引入旋转回归，但其基于熵的刚性伪标签过滤方法未能有效区分可靠和不可靠的无标注样本。为解决这一局限，我们提出一种难度感知课程学习框架，根据样本难度动态选择伪标签样本，从简单到复杂逐步推进。我们引入了多阶段和自适应课程策略，用更灵活、难度感知的机制替代固定阈值过滤。此外，我们提出一种专门针对旋转估计的新型结构化数据增强策略，通过从增强补丁中组装复合图像来引入特征多样性，同时保持关键几何完整性。在PASCAL3D+和ObjectNet3D上的综合实验表明，我们的方法在低数据场景下尤其优于现有的监督和半监督基线，验证了课程学习框架和结构化增强方法的有效性。

英文摘要

Regressing 3D rotations of objects from 2D images is a crucial yet challenging task, with broad applications in autonomous driving, virtual reality, and robotic control. Existing rotation regression models often rely on large amounts of labeled data for training or require additional information beyond 2D images, such as point clouds or CAD models. Therefore, exploring semi-supervised rotation regression using only a limited number of labeled 2D images is highly valuable. While recent work FisherMatch introduces semi-supervised learning to rotation regression, it suffers from rigid entropy-based pseudo-label filtering that fails to effectively distinguish between reliable and unreliable unlabeled samples. To address this limitation, we propose a hardness-aware curriculum learning framework that dynamically selects pseudo-labeled samples based on their difficulty, progressing from easy to complex examples. We introduce both multi-stage and adaptive curriculum strategies to replace fixed-threshold filtering with more flexible, hardness-aware mechanisms. Additionally, we present a novel structured data augmentation strategy specifically tailored for rotation estimation, which assembles composite images from augmented patches to introduce feature diversity while preserving critical geometric integrity. Comprehensive experiments on PASCAL3D+ and ObjectNet3D demonstrate that our method outperforms existing supervised and semi-supervised baselines, particularly in low-data regimes, validating the effectiveness of our curriculum learning framework and structured augmentation approach.

URL PDF HTML ☆

赞 0 踩 0

2604.20822 2026-06-18 cs.CV cs.LG 版本更新

Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series

全球海上风电基础设施：基于密集Sentinel-1时间序列的部署与运行动态

Thorsten Hoeser, Felix Bachofer, Claudia Kuenzer

发表机构 * Earth Observation Center (EOC), German Aerospace Center (DLR)（地球观测中心（EOC），德国航空航天中心（DLR））； Institute for Geography and Geology, University of Wuerzburg（地理与地质研究所，乌尔姆大学）

AI总结提出全球Sentinel-1 SAR时间序列数据集，通过目标检测和规则分类器识别海上风电基础设施的部署与运行阶段，支持全球尺度动态分析。

Comments 29 pages, 18 figures

详情

AI中文摘要

海上风电行业正在快速扩张，增加了对全球范围内基础设施部署和运行进行独立、高时间分辨率监测的需求。虽然基于地球观测的海上风电基础设施测绘在空间定位方面已经成熟，但现有的开放数据集缺乏关于建设和运行动态的时间密集且语义精细的信息。我们引入了一个全球Sentinel-1合成孔径雷达（SAR）时间序列数据语料库，该语料库解析了2016年第一季度至2025年第一季度海上风电基础设施的部署和运行阶段。基于更新的目标检测工作流程，我们在检测到的基础设施位置编译了15,606条时间序列，共有14,840,637个事件作为分析就绪的一维SAR后向散射剖面，每个剖面对应一次Sentinel-1采集和一个位置。为了便于直接使用和基准测试，我们发布了（i）分析就绪的一维SAR剖面，（ii）由基于规则的分类器生成的事件级基线语义标签，以及（iii）包含553条时间序列和328,657个事件标签的专家标注基准数据集。基线分类器在事件评估中实现了0.84的宏F1分数，在折叠编辑相似性-质量阈值曲线下面积（AUC）为0.785，表明时间一致性。我们证明，由此产生的语料库支持全球尺度的部署动态分析、区域部署模式差异的识别、船只交互和运行事件，并为开发和比较海上风电基础设施监测的时间序列分类方法提供了参考。

英文摘要

The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.05547 2026-06-18 cs.CV 版本更新

Characterizing Brazilian Atlantic Forest Restoration Outcomes with Geospatial AlphaEarth Embeddings

利用地理空间AlphaEarth嵌入表征巴西大西洋森林恢复结果

Alice Heiman

发表机构 * Department of Computer Science（计算机科学系）

AI总结本研究利用AlphaEarth基础模型的卫星嵌入，通过余弦相似度定义参考轨迹嵌入，评估巴西圣保罗1729个恢复点的早期恢复成效，发现不同土地利用类型在嵌入空间中形成聚类，但信号存在噪声。

Comments Presented as a workshop paper at ICLR 2026 Machine Learning for Remote Sensing (ML4RS)

详情

AI中文摘要

巴西的大西洋森林是一个关键生物多样性热点，但其原始覆盖面积不足12-15%。尽管大规模监测森林恢复至关重要，但传统方法受限于实地报告在大尺度上的不可行性以及遥感指数（如NDVI）的饱和效应。此外，与森林砍伐导致的快速光谱变化不同，再造林是一个渐进过程。在本研究中，我们利用AlphaEarth Foundation模型的卫星嵌入，检查了圣保罗的1,729个恢复点，以评估其在表征早期恢复成功方面的有效性。我们引入了“参考轨迹嵌入”的概念，基于与成熟次生林参考点的余弦相似度定义恢复成功的度量。我们观察到不同土地利用和土地覆盖（LULC）类型在嵌入空间中形成不同的聚类，并且能够识别出具有明显变化向量的地点。然而，信号可能存在噪声，嵌入可能需要进一步微调以捕获和预测超出LULC的地点元数据。

英文摘要

The Atlantic Forest in Brazil is a critical biodiversity hotspot, yet less than 12-15% of its original cover remains. Although monitoring forest restoration on a large scale is essential, traditional methods are limited by the impracticality of on-the-ground reporting on such a scale and by the saturation of remote-sensing indices such as NDVI. Furthermore, reforestation is a gradual process as opposed to the rapid spectral changes caused by deforestation. In this study, we examine 1,729 restoration sites in São Paulo, using satellite embeddings from the AlphaEarth Foundation's model to evaluate their effectiveness in characterising early restoration success. We introduce the concept of a 'Reference Trajectory Embedding', defining a metric of restoration success based on cosine similarity to reference sites of mature secondary forest. We observe distinct clusters in embedding space according to different land use and land cover (LULC) types, and we can identify sites with clear change vectors. However, the signal can be noisy, and embeddings may require further fine-tuning to capture and predict site metadata beyond LULC.

URL PDF HTML ☆

赞 0 踩 0

2606.05368 2026-06-18 cs.CV 版本更新

Biomazon: A Multimodal Dataset for 3D Forest Structure and Biomass Modeling in the Amazon Basin

Biomazon：亚马逊盆地三维森林结构与生物量建模的多模态数据集

Sayan Mandal, Rocco Sedona, Simon Besnard, Mikhail Urbazaev, Morris Riedel, Ehsan Zandi, Gabriele Cavallaro

发表机构 * Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich（julich超级计算中心（JSC），julich研究所）； School of Engineering and Natural Sciences (SENS), University of Iceland（工程与自然科学学院（SENS），冰岛大学）； Global Land Monitoring Group, GFZ Helmholtz Centre for Geosciences（全球土地监测组，geofz赫尔姆霍兹研究中心）

AI总结针对现有方法未将森林垂直结构作为有序轮廓学习的问题，提出Biomazon多模态基准数据集，结合GEDI RH和AGBD目标与多传感器预测因子，通过共享编码器-解码器框架进行消融研究，为热带森林结构一致RH轮廓预测和结构-生物量建模建立参考基准。

Comments 32 pages, 21 figures, 8 tables

详情

AI中文摘要

准确、空间明确的描述热带森林结构对于碳核算和生态系统监测至关重要，然而大多数机器学习流程预测冠层顶部高度代理（例如RH95/RH98）或AGBD作为单独的标量目标，而不是将森林垂直结构作为有序轮廓学习。社区缺乏一个ML就绪的多模态基准，用于联合预测整个GEDI RH轮廓与AGBD，或评估强制RH百分位数之间物理一致排序的方法。我们通过Biomazon解决了这一问题，这是一个覆盖亚马逊盆地的20米多模态基准数据集，在标准化的空间划分和评估协议下，将GEDI RH和AGBD目标与多传感器预测因子（Sentinel-1/2、ALOS-2 PALSAR-2、Copernicus DEM、Dynamic World LULC和AlphaEarth嵌入）配对。使用共享编码器-解码器与任务特定头作为基线框架，我们对（i）骨干/模型规模、（ii）模态贡献以及（iii）在独立和融合设置下使用辅助嵌入进行了全面的消融研究，并报告了单目标和联合目标结果，以量化统一训练协议下的权衡。最后，我们通过与现有网格化产品（包括GEDI L4D RH10-RH98和AGBD）在匹配时间尺度上的区域对齐比较，将基线性能置于背景中。Biomazon连同随附的协议和基线结果，为未来热带森林中结构一致的RH轮廓预测和结构-生物量建模工作建立了参考基准。

英文摘要

Accurate, spatially explicit characterization of tropical forest structure is essential for carbon accounting and ecosystem monitoring, yet most ML pipelines predict canopy-top height proxies (e.g., RH95/RH98) or AGBD as separate scalar targets, rather than learning the forest vertical structure as an ordered profile. The community lacks a ML-ready multimodal benchmark for predicting the entire GEDI RH profile jointly with AGBD, or for evaluating methods that enforce physically consistent ordering across RH percentiles. We address this with Biomazon, a 20 m multimodal benchmark dataset over the Amazon Basin that pairs GEDI RH and AGBD targets with multi-sensor predictors (Sentinel-1/2, ALOS-2 PALSAR-2, Copernicus DEM, Dynamic World LULC, and AlphaEarth embeddings) under standardized spatial splits and evaluation protocols. Using a shared encoder-decoder with task-specific heads as a baseline framework, we conduct a comprehensive ablation study of (i) backbone/model scale, (ii) modality contributions, and (iii) the use of auxiliary embeddings under standalone and fusion settings, and we report both single-target and joint-target results to quantify tradeoffs under a unified training protocol. Finally, we contextualize baseline performance through regionally aligned comparisons against existing gridded products, including GEDI L4D RH10-RH98 and AGBD, at matching temporal scale. Biomazon, together with the accompanying protocols and baseline results, establishes a reference benchmark for future work on structurally consistent RH-profile prediction and structure-biomass modeling in tropical forests.

URL PDF HTML ☆

赞 0 踩 0

2606.05883 2026-06-18 cs.CV 版本更新

Geometry-Aware Dataset Condensation for Diffusion Model Training

面向扩散模型训练的几何感知数据集压缩

Xiao Cui, Yulei Qin, Mo Zhu, Wengang Zhou, Hongsheng Li, Houqiang Li

发表机构 * GitHub

AI总结针对扩散模型训练，提出基于几何感知分布对齐的真实子集选择方法，利用单侧部分最优传输保持几何结构，并辅以轻量级特征统计与语义一致性正则化，通过两阶段离散优化实现高效压缩。

Comments ICML 2026

详情

AI中文摘要

数据集压缩旨在通过合成或选择从真实数据中构建紧凑数据集。然而，现有方法不适用于扩散模型训练：合成数据生成通常产生不适合真实建模的低保真样本，而真实子集选择通常无法保留扩散似然目标所需的分布几何结构。为解决此问题，我们提出将真实子集选择重新表述为几何感知分布对齐问题。通过引入单侧部分最优传输，我们的方法选择性地将紧凑子集与完整数据分布对齐，同时允许低密度区域中的未匹配质量，确保保留扩散模型训练所需的有效几何结构。为进一步保证分布保真度，我们用轻量级特征统计和语义一致性正则化补充几何对齐。提出了一种高效的两阶段离散优化策略来实现该对齐目标。在扩散变体、子集大小、图像分辨率和训练轮次上的大量实验表明，我们的方法在扩散模型训练中实现了优越的保真度和分布覆盖。代码可在 https://github.com/2018cx/GADC 获取。

英文摘要

Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://github.com/2018cx/GADC.

URL PDF HTML ☆

赞 0 踩 0

2606.14702 2026-06-18 cs.CV 版本更新

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K：通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University（南京大学）； CASIA（中国科学院自动化研究所）

AI总结提出OmniVideo-100K数据集，通过实体锚定视频脚本和线索引导的QA生成机制，解决音视频问答中跨段实体不一致和长时推理不足的问题，微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情

AI中文摘要

当前的音视频问答（QA）自动化流水线通常采用“视频-字幕-QA”范式。然而，这些方法通常将视频分割成短片段，并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联，而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外，将长文本理解和QA合成耦合到单一步骤中，往往将模型限制在局部事件上，生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题，我们提出了一种自动化数据引擎，包含两种机制：（1）**实体锚定视频脚本**将视频转换为结构化脚本，包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验，确保跨片段引用一致性并重建音视频关联。（2）**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索，然后基于这些高价值线索生成QA对。利用这一流水线，我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B，在OmniVideo-Test上获得了高达20.59%的性能提升，并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力（提升高达12.64%）。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

URL PDF HTML ☆

赞 0 踩 0

2606.17188 2026-06-18 cs.CV cs.CL 版本更新

ERQA-Plus：具身AI推理的诊断基准

Hong Yang, Basura Fernando

发表机构 * Centre for Frontier AI Research, Agency for Science, Technology and Research（新加坡科技研究局前沿人工智能研究中心）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结提出ERQA-Plus基准，包含1766个基于机器人中心图像的问答实例，覆盖感知、动作、社交、导航和常识推理，用于诊断具身AI的推理能力。

详情

AI中文摘要

通用具身智能体需要的不仅仅是物体识别：它们必须从情境视觉观察中推理空间关系、动作、程序、人类意图、环境约束和常识后果。然而，现有的视觉和具身问答基准通常对测试的推理依赖关系控制有限，使得难以将基于具身的推理与基于捷径的视觉或语言模式匹配区分开来。我们提出了ERQA-Plus，一个用于具身AI推理的诊断基准。ERQA-Plus包含1766个问答实例，这些实例基于711张以机器人为中心的图像，并根据一个结构化的分类法组织，涵盖感知、动作中心、社交交互、导航环境和上下文常识推理。该数据集使用多阶段生成和验证流程构建，结合了分类法引导的问题生成、自动质量判断、迭代修订和人工评估，以改进视觉基础、答案有效性和推理质量。我们对代表性的通用视觉语言模型和具身模型进行了基准测试，包括LLaVA-NeXT-8B、Prismatic-7B、MiniCPM-V-4.5-8B、Qwen3-VL、RoboRefer-8B和RoboBrain2.5-8B。尽管最强的模型Qwen3-VL-32B达到了83.4%的整体准确率和61.4的SBERT分数，但类别级别的结果揭示了空间推理、程序推理、事件预测和意图推理方面的持续弱点。因此，ERQA-Plus提供了一个细粒度的评估框架，不仅衡量具身智能体是否回答正确，还衡量它们能够可靠地执行哪些形式的具身推理。数据集可在https://this https URL获取，项目页面在https://this https URL。

英文摘要

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.

URL PDF HTML ☆

赞 0 踩 0

2606.18661 2026-06-18 cs.CV cs.AI 新提交

LandslideAgent with Multimodal LandslideBench: A Domain-Rule-Augmented Agent for Autonomous Landslide Identification and Analysis

LandslideAgent与多模态LandslideBench：一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University（中南大学）

AI总结提出指令驱动智能体框架，包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent，实现自主滑坡识别与分析。

详情

AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要，然而当前范式难以同时提取视觉特征和高层次地球科学语义，而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战，我们提出一个指令驱动的智能体框架，包含三个组成部分。首先，通过多VLM交叉验证和交互式标注构建LandslideBench，这是一个多模态细粒度数据集，包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后，通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM，以增强地质语义理解。最后，以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent，采用双规则控制器，结合结构化报告元数据约束和交叉验证识别约束，来调控自动化工具调用。实验表明，LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理，实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.19249 2026-06-18 cs.CV cs.LG 新提交

Transformer Geometry Observatory TGO-I: Spectral Geometry Observatory

Transformer几何观测站TGO-I：谱几何观测站

Kaustubh Kapil, Kishor P. Upla

发表机构 * Sardar Vallabhai National Institute of Technology (SVNIT), Surat, India（印度苏拉特萨达尔·瓦拉巴伊国家理工学院（SVNIT））

AI总结提出TGO框架，通过分析ViT表示的谱几何（有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性等），发现训练过程中维度利用增加、各向异性降低、谱熵和参与比上升，最终CLS标记表示具有最高有效维度和最低各向异性。

详情

AI中文摘要

尽管Vision Transformers（ViTs）被广泛采用并在众多计算机视觉应用中取得成功，对其维度和表示几何的基本理解仍然相对未被充分探索。为了弥补这一差距，我们引入了Transformer几何观测站（TGO），这是一个系统的实验和分析流程框架，旨在研究Vision Transformers的表示几何和动态。TGO-I是该框架的第一部分，专注于ViT表示的谱几何。使用在ImageNet-100上训练的ViT-Small/16模型，我们分析了训练过程中的有效秩、稳定秩、参与比、谱熵、谱平坦度、谱各向异性、协方差结构、特征谱和奇异值谱。我们的结果揭示了维度利用的一致增加，伴随着各向异性降低、谱熵增加、参与比增加以及逐渐平坦的特征谱。与常见的直觉（即训练应将信息集中到少数主导方向）相反，我们观察到方差在表示维度上的逐渐重新分布。这一现象在最终的CLS标记表示中尤为明显，该表示在网络中表现出最高的有效维度和最低的各向异性。

英文摘要

Despite the widespread adoption of Vision Transformers (ViTs) and their success across numerous computer vision applications, the fundamental understanding of their dimensional and representational geometry remains relatively underexplored. To address this gap, we introduce Transformer Geometry Observatory (TGO), a systematic framework of experiments and analysis pipelines designed to investigate the representational geometry and dynamics of Vision Transformers. TGO-I, the first installment of the framework, focuses on the spectral geometry of ViT representations. Using a ViT-Small/16 model trained on ImageNet-100, we analyze Effective Rank, Stable Rank, Participation Ratio, Spectral Entropy, Spectral Flatness, Spectral Anisotropy, covariance structure, eigenspectra, and singular value spectra throughout training. Our results reveal a consistent increase in dimensional utilization, accompanied by decreasing anisotropy, increasing spectral entropy, increasing participation ratio, and progressively flatter eigenspectra. Contrary to the common intuition that training should concentrate information into a small number of dominant directions, we observe a progressive redistribution of variance across representational dimensions. This phenomenon is particularly pronounced in the final CLS token representation, which exhibits the highest effective dimensionality and lowest anisotropy within the network.

URL PDF HTML ☆

赞 0 踩 0

2606.19151 2026-06-18 cs.CY cs.CV 交叉投稿

The Market in the Model: Latent Diffusion as Neural Economy

模型中的市场：潜在扩散作为神经经济

Eryk Salvaggio

发表机构 * Cambridge Digital Humanities（剑桥数字人文研究中心）； University of Cambridge（剑桥大学）； Machine Visual Culture Research Group（机器视觉文化研究组）； Max Planck Institute（马克斯·普朗克研究所）

AI总结本文从计算机视觉工程问题出发，分析潜在扩散模型的机制，论证其作为神经经济运作，将社会交流抽象为可通约向量，并警示仅关注版权与商品防御的批评可能强化模型产生的拜物教。

详情

AI中文摘要

在视觉文化和人文学科中，对生成图像模型的有价值批评强调了数据集在塑造其生成图像中的作用。然而，对嵌入模型机制的意识形态立场的细致研究一直被忽视，使得它们被想象为“黑箱”。为了扩展而非取代数据集批评，本文从潜在扩散模型被引入以解决计算机视觉工程师问题的角度，以及每个组件被赋予自动化决策的任务，审视了其机制。我通过其各部分的历史以及系统刻入每个生成图像中的视觉理论来解释这个集成。借鉴Impett和Offert的神经交换价值概念，我提出这一分析以论证该模型作为神经经济运作：一个封闭的符号系统，将社会交流抽象为可通约向量，同时将社会领域转化为待售包裹。逐组件追踪训练和生成流程揭示了每个操作取代了什么，以及它如何进一步巩固平台经济和注意力经济对社会交流的逻辑。本文警告，任何只关注版权和商品防御的批评都可能重申模型所产生的拜物教，并主张以社会交换为中心。

英文摘要

Valuable critique of generative image models within visual culture and the humanities has emphasized the role of datasets in shaping the images they produce. Yet, close studies of the ideological positions embedded into the mechanism of the models have been neglected, leaving them imagined as "black boxes." In a bid to expand, rather than replace, dataset critique, this paper examines the mechanisms of the latent diffusion model in terms of the problems they were brought in to solve on behalf of computer vision engineers, and the decisions each component was tasked with automating. I interpret that ensemble through the histories of its parts and the theory of vision the system inscribes into every generated image. Drawing on Impett and Offert's notion of neural exchange value, I offer this analysis to argue that the model operates as a neural economy: a contained symbolic system that abstracts social communication into commensurable vectors as it transfers the social sphere into parcels for sale. Tracing the training and generation pipelines component by component reveals what each operation displaces, and how it further entrenches the logics of platform and attention economies over social communication. The paper warns that any critique fixated exclusively on copyright and commodity defenses risks reaffirming the very fetishism the model produces, and argues instead for centering social exchange.

URL PDF HTML ☆

赞 0 踩 0

2506.13506 2026-06-18 cs.CV q-bio.NC 版本更新

Stimulus Motion Perception Studies Imply Specific Neural Computations in Human Visual Stabilization

刺激运动知觉研究暗示人类视觉稳定中的特定神经计算

David W Arathorn, Josephine C. D'Angelo, Austin Roorda

发表机构 * Montana State University, Dept of Electrical and Computer Engineering（蒙塔那州立大学电气与计算机工程系）； University of California, Berkeley, Herbert Wertheim School of Optometry and Vision Science（加州大学伯克利分校赫伯特·韦特海姆视觉科学与眼科学学院）

AI总结通过分析人类注视时眼球的微小抖动，发现视觉稳定机制比相机稳定或简单进化方案更复杂，提出了基于视网膜信号特定操作的功能模型和可能的神经回路实现。

详情

AI中文摘要

即使在注视期间，人眼也持续进行低幅度运动，以高达100Hz的频率在随机方向上小角度抖动。这种运动导致视网膜上图像的所有特征不断穿过多个视锥细胞，然而世界中稳定的物体被感知为稳定，而任何运动的物体被感知为运动。一系列持续十多年的实验揭示了视觉稳定的心理物理学比可能假设的（例如，从相机图像稳定的机制，或从进化角度可能假设的最简单解决方案）更为微妙。实验揭示的心理物理学强烈暗示了视网膜信号上的一组特定操作，导致了观察到的稳定行为。报告分为两个层次。首先是对很可能负责实验观察行为的机制的功能描述。其次是对可能实现功能行为的电路级神经元的更推测性提议。

英文摘要

Even during fixation the human eye is constantly in low amplitude motion, jittering over small angles in random directions at up to 100Hz. This motion results in all features of the image on the retina constantly traversing a number of cones, yet objects which are stable in the world are perceived to be stable, and any object which is moving in the world is perceived to be moving. A series of experiments carried out over a dozen years revealed the psychophysics of visual stabilization to be more nuanced than might be assumed, say, from the mechanics of stabilization of camera images, or what might be assumed to be the simplest solution from an evolutionary perspective. The psychophysics revealed by the experiments strongly implies a specific set of operations on retinal signals resulting in the observed stabilization behavior. The presentation is in two levels. First is a functional description of the action of the mechanism that is very likely responsible for the experimentally observed behavior. Second is a more speculative proposal of circuit-level neural elements that might implement the functional behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo：通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； Ricoh Software Research Center Beijing Co.,Ltd（Ricoh 软件研究中心北京有限公司）

AI总结提出Hilbert-Geo框架和Parse2Reason方法，利用条件描述语言和定理库实现立体几何问题的严格推理，在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情

AI中文摘要

几何问题求解作为一种典型的多模态推理问题，近年来受到广泛关注并取得了很大进展，然而大多数工作集中于平面几何，由于三维空间图和复杂推理，通常在立体几何中失败。为弥补这一差距，我们引入了Hilbert-Geo，这是第一个用于立体几何的统一形式语言框架，包括一个广泛的谓词库和一个专用的定理库。基于该框架，我们提出了一种Parse2Reason方法，包含先解析后推理两个步骤。在解析步骤中，我们利用条件描述语言（CDL），一种由专门用于构建几何条件的谓词组成的形式化语言，来表示问题描述（自然文本）和立体图（视觉图像）。在推理步骤中，我们利用这些形式化CDL和定理库进行关系推理和代数计算，生成严格正确、可验证且人类可读的推理过程。值得注意的是，我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理，我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k，它们配备了几何形式语言标注、解答和答案。大量实验表明，我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能，在MathVerse-Solid（MathVerse中专用于立体几何的一个小子集）上达到84.1%，显著优于领先的多模态大语言模型，如Gemini-2.5-pro（在SolidFGeo2k上为54.2%）和GPT-5（在MathVerse-Solid上为62.9%）。此外，我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率，展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

URL PDF HTML ☆

赞 0 踩 0

2509.09631 2026-06-18 cs.SD cs.CL cs.CV 版本更新

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

DiFlow-TTS: 基于离散流匹配的紧凑低延迟零样本文本转语音

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

AI总结提出DiFlow-TTS框架，通过离散流匹配和分解离散流去噪器，在零样本TTS中实现高质量与低延迟的平衡。

Comments Accepted at Interspeech 2026 (Long Paper Track)

2604.14837 2026-06-18 cs.CV 版本更新

Improved Multiscale Structural Mapping with Supervertex Vision Transformer for the Detection of Alzheimer's Disease Neurodegeneration

Geonwoo Baek, David H. Salat, Ikbeom Jang

发表机构 * Department of Computer Science \& Engineering, Hankuk University of Foreign Studies, Seoul, Republic of Korea ； Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Charlestown, MA, USA ； Department of Radiology, Harvard Medical School, Boston, MA, USA ； Neuroimaging Research for Veterans (NeRVe) Center, VA Boston Healthcare System, Boston, MA, USA

Comments Submitted to Human Brain Mapping

详情

DOI: 10.1002/hbm.70548
Journal ref: Human Brain Mapping 47(8), e70548 (2026)

英文摘要

Alzheimer's disease (AD) confirmation often relies on positron emission tomography (PET) or cerebrospinal fluid (CSF) analysis, which are costly and invasive. Consequently, structural MRI biomarkers such as cortical thickness (CT) are widely used for non-invasive AD screening. Multiscale structural mapping (MSSM) was recently proposed to integrate gray-white matter contrasts (GWCs) with CT from a single T1-weighted MRI (T1w) scan. Building on this framework, we propose MSSM+, together with surface supervertex mapping (SSVM) and a Supervertex Vision Transformer (SV-ViT). 3D T1w images from individuals with AD and cognitively normal (CN) controls were analyzed. MSSM+ extends MSSM by incorporating sulcal depth and cortical curvature at the vertex level. SSVM partitions the cortical surface into supervertices (surface patches) that effectively represent inter- and intra-regional spatial relationships. SV-ViT is a Vision Transformer architecture operating on these supervertices, enabling anatomically informed learning from surface mesh representations. Compared with MSSM, MSSM+ identified more spatially extensive and statistically significant group differences between AD and CN. In AD vs. CN classification, MSSM+ achieved a 3%p higher area under the precision-recall curve than MSSM. Vendor-specific analyses further demonstrated reduced signal variability and consistently improved classification performance across MR manufacturers relative to CT, GWCs, and MSSM. These findings suggest that MSSM+ combined with SV-ViT is a promising MRI-based imaging marker for AD detection prior to CSF/PET confirmation.

URL PDF HTML ☆

赞 0 踩 0

2602.02370 2026-06-18 cs.CV 版本更新

Uncertainty-Aware Image Classification In Biomedical Imaging Using Spectral-normalized Neural Gaussian Processes

Uma Meleti, Jeffrey J. Nirschl

发表机构 * Department of Pathology（病理学部）； Lab Medicine, University of Wisconsin-Madison（实验室医学，威斯康星大学麦迪逊分校）

Comments Published at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026

2411.16934 2026-06-18 cs.CV 版本更新

Online Episodic Memory Visual Query Localization with Egocentric Streaming Object Memory

Zaira Manigrasso, Matteo Dunnhofer, Antonino Furnari, Moritz Nottebaum, Antonio Finocchiaro, Davide Marana, Rosario Forte, Giovanni Maria Farinella, Christian Micheloni

发表机构 * University of Udine（乌迪内大学）； University of Catania（卡塔尼亚大学）； York University（约克大学）

Comments in IEEE/CVF Winter Conference on Application of Computer Vision (WACV) 2026

2510.13562 2026-06-18 physics.med-ph cs.CV cs.NA math.NA 版本更新

An efficient approach with theoretical guarantees to simultaneously reconstruct activity and attenuation sinogram for TOF-PET

Liyang Hu, Chong Chen

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China（数学科学国家重点实验室，数学与系统科学研究院，中国科学院，北京100190，中国）； University of Chinese Academy of Sciences, Beijing 100190, China（中国科学院大学，北京100190，中国）

Comments 32 pages, 11 figures, 4 tables

详情

DOI: 10.1109/TCI.2026.3697651
Journal ref: IEEE Transactions on Computational Imaging 2026

英文摘要

In positron emission tomography (PET), it is indispensable to perform attenuation correction in order to obtain the quantitatively accurate activity map (tracer distribution) in the body. Generally, this is carried out based on the estimated attenuation map obtained from computed tomography or magnetic resonance imaging. However, except for errors in the attenuation correction factors obtained, the additional scan not only brings in new radiation doses and/or increases the scanning time but also leads to severe misalignment induced by various motions during and between the two sequential scans. To address these issues, based on maximum likelihood estimation, we propose a new mathematical model for simultaneously reconstructing the activity and attenuation sinogram from the time-of-flight (TOF)-PET emission data only. Particularly, we make full use of the exclusively exponential form for the attenuation correction factors, and consider the constraint of a total amount of the activity in some mask region in the proposed model. Furthermore, we prove its well-posedness, including the existence, uniqueness and stability of the solution. We propose an alternating update algorithm to solve the model, and also analyze its convergence. Finally, numerical experiments with various TOF-PET emission data demonstrate that the proposed method is of numerical convergence and robust to noise, and outperforms some state-of-the-art methods in terms of accuracy and efficiency, and has the capability of autonomous attenuation correction.

URL PDF HTML ☆

赞 0 踩 0

2507.05647 2026-06-18 eess.IV cs.CV 版本更新

Diffusion-Based Limited-Angle CT Reconstruction under Noisy Conditions

Jiaqi Guo, Santiago López-Tapia

发表机构 * Dept. of Electrical and Computer Engineering, Northwestern University, Evanston, IL, USA（电气与计算机工程系，西北大学，埃文斯顿，伊利诺伊州，美国）

Comments Accepted at the 2025 IEEE International Conference on Image Processing (ICIP), Workshop

2406.16439 2026-06-18 cs.CV 版本更新

Continual Test-Time Adaptation for Object Detection with Adaptive Monitoring and Randomized Restoration

Shilei Cao, Juepeng Zheng, Yan Liu, Baoquan Zhao, Ziqi Yuan, Weijia Li, Runmin Dong, Haohuan Fu

发表机构 * School of Artificial Intelligence, Sun Yat-Sen University（中山大学人工智能学院）； School of Information Science and Technology, University of Science and Technology of China（中国科学技术大学信息科学与技术学院）； State Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University（清华大学智能技术与系统国家重点实验室）； Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生学院）； National Supercomputing Center in Shenzhen（深圳国家超算中心）； Ministry of Education Key Laboratory for Earth System Modeling and the Department of Earth System Science, Tsinghua University（清华大学地球系统模型教育部重点实验室）

1. 多模态与视觉语言模型 17 篇

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

Moving Beyond Diversity: Visual Token Pruning as Subspace Reconstruction for Efficient VLMs

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Native Active Perception as Reasoning for Omni-Modal Understanding

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference

Cosmos 3: Omnimodal World Models for Physical AI

Would you still call this Dax? Novel Visual References in VLMs and Humans

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

2. 具身智能、机器人与自动驾驶 13 篇

Aerial-ground LiDAR place recognition with patch-level self-supervised learning and expanded reciprocal re-ranking

Spatially Stratified Distillation for Heterogeneous Radar Place Recognition

Where Will They Go? Modelling Multimodal Pedestrian Manoeuvres from Ego-centric Videos

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

CABLE: Cloud-Assisted Bandwidth-efficient LMM-based Encoding for V2X Systems

SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation

Sensor Configuration Matters: A Systematic Evaluation of Multimodal SLAM on Quadruped Robots

Seeing Through Occlusion: Deterministic Arm Kinematic Correction for Robot Teleoperation

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Zero-Shot Cross-City Generalization in End-to-End Autonomous Driving: Self-Supervised versus Supervised Representations

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

3. 图像识别、检索与分类 5 篇

A Prototypical Signature Approach for Writer-Independent Offline Signature Verification

LARE: Low-Attention Region Encoding for Text-Image Retrieval

ROSA-TFormer: A Radar-Optical Sensor-Aware Temporal Transformer for Pinus sylvestris Plantation Classification in Northern Shaanxi Using GEE-Derived Sentinel-1/2 Time Series

Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift

Quantile Transfer for Reliable Operating Point Selection in Visual Place Recognition

4. 目标检测、分割与定位 8 篇

Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

Technical Report for ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge: Leveraging DINOv3 for Robust Outdoor Scene Understanding in Field Robotics

SCR-Guided Difficulty-Aware Optimization for Infrared Small Target Detection

CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

MUFASA: A Multi-Layer Framework for Slot Attention

Bidirectional Cross-Attention Fusion of High-Resolution RGB and Low-Resolution Hyperspectral Inputs for Multimodal Semantic Segmentation

Prior-guided Fusion of Multimodal Features for Change Detection from Optical-SAR Images

SegmentAnyTreeV2: Scaling Transformer-Based Tree Instance Segmentation Across Sensors, Platforms, and Forests

5. 视频理解与时序视觉 8 篇

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

APT: Atomic Physical Transitions for Causal Video-Language Understanding

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

Open-World Video Segmentation

6. 生成式视觉与世界模型 21 篇

Data-Forcing Distillation: Restoring Diversity and Fidelity in Few-Step Video Generation

Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops

UniTemp: Unlocking Video Generation in Any Temporal Order via Bidirectional Distillation

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

BindEdit: Taming Attention Leakage for Precise Multi-Object Image Editing

Taming I2V models for Image HOI Editing: A Cognitive Benchmark and Agentic Self-Correcting Framework

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

Epipolar Geometry Improves Video Generation Models

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

7. 3D视觉、点云与空间智能 13 篇

CAOA -- Completion-Assisted Object-CAD Alignment