arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.10937 2026-05-12 cs.CV 版本更新

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

Haoyuan Sun, Jing Wang, Yuxin Song, Yu Lu, Bo Fang, Yifu Luo, Jun Yin, Pengyu Zeng, Miao Zhang, Tiantian Zhang, Xueqian Wang, Shijian Lu

发表机构 * Nanyang Technological University（南洋理工大学）； Baidu Inc.（百度公司）； Zhejiang University（浙江大学）； City University of Hong Kong（香港城市大学）； Tsinghua University（清华大学）； Jimei University（集美大学）

AI总结本文研究了如何通过强化学习后训练进一步提升文本到图像生成模型的性能，并针对现有方法中奖励黑客问题提出了解决方案。作者指出标准化操作可能导致策略校准偏差，进而影响训练效果，为此提出了一种基于信息几何的超线性优势塑造方法（SLAS），通过引入优势依赖的权重对策略空间进行非线性重构，从而增强有效更新、抑制虚假梯度。实验表明，SLAS在多个模型和基准测试中均优于现有方法，提升了训练效率、泛化能力和生成质量。

2605.10936 2026-05-12 cs.CV 版本更新

Personal Visual Context Learning in Large Multimodal Models

Zihui Xue, Ami Baid, Sangho Kim, Mi Luo, Kristen Grauman

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结随着智能眼镜等可穿戴设备将大 multimodal 模型（LMMs）融入用户的连续第一人称视觉流，这些模型要成为真正的个人助手，关键在于视觉个性化能力。本文提出个人视觉上下文学习（Personal VCL），旨在利用用户特定的视觉信息解决个性化查询，并构建了 Personal-VCL-Bench 作为评估基准。研究发现当前 LMMs 在利用视觉上下文方面存在显著差距，为此提出了一种名为 Agentic Context Bank 的推理时基线方法，通过结构化的记忆银行和查询自适应的证据选择，有效提升了模型在多任务中的表现。

Comments Project website: https://vision.cs.utexas.edu/projects/PersonalVCL/

2605.10934 2026-05-12 cs.LG cs.AI cs.CV cs.RO stat.ML 版本更新

Variational Inference for Lévy Process-Driven SDEs via Neural Tilting

Yaman Kindap, Manfred Opper, Benjamin Dupuis, Umut Simsekli, Tolga Birdal

发表机构 * Imperial College London, UK（伦敦帝国学院）； Technical University of Berlin, Germany（柏林技术大学）； INRIA, CNRS, Département d’Informatique de l’Ecole Normale Supérieure / PSL, France（法国国家信息与自动化研究所（INRIA）、国家科学研究中心（CNRS）、巴黎社会科学高等师范学院信息学系/巴黎社会科学高等师范学院）

AI总结该论文研究了如何利用变分推断方法对由Lévy过程驱动的随机微分方程（SDEs）进行建模，以准确捕捉金融、气候等领域的极端事件和重尾现象。传统方法要么计算开销大，要么依赖高斯假设而无法处理跳跃特性。为此，作者提出了一种基于神经网络的指数倾斜框架，通过神经网络对Lévy测度进行指数加权，构建灵活的变分族，在保留跳跃结构的同时保证计算可行性。实验表明，该方法在合成和真实数据上均能有效捕捉跳跃动态，并在高斯变分方法失效的情况下提供可靠的后验推断。

Comments The associated project page which contains the official implementation can be found in https://circle-group.github.io/research/NeuralTilting/

2605.10922 2026-05-12 cs.CV 版本更新

Pixal3D: Pixel-Aligned 3D Generation from Images

Dong-Yang Li, Wang Zhao, Yuxin Chen, Wenbo Hu, Meng-Hao Guo, Fang-Lue Zhang, Ying Shan, Shi-Min Hu

发表机构 * BNRist, Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系BNRist）； Tencent ARC Lab（腾讯ARC实验室）； Victoria University of Wellington（惠灵顿维多利亚大学）

AI总结 Pixal3D 是一种基于图像的高保真3D生成方法，旨在解决现有3D生成模型在像素级细节还原方面的不足。该方法通过引入像素级反投影条件机制，直接在输入视角下生成与像素对齐的3D几何结构，建立了明确的像素到3D特征的对应关系，从而显著提升了生成结果的保真度。此外，Pixal3D 还支持多视角生成和场景级合成，为从单张或多张图像生成高精度3D物体和场景提供了新的解决方案。

Comments SIGGRAPH 2026. Project page: https://ldyang694.github.io/projects/pixal3d/

2605.10903 2026-05-12 cs.CV cs.RO 版本更新

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

Wenxuan Song, Han Zhao, Fuhao Li, Ziyang Zhou, Xi Wang, Jing Lyu, Pengxiang Ding, Yan Wang, Donglin Wang, Haoang Li

发表机构 * Zhejiang University（浙江大学）； Westlake University（西湖大学）； Tsinghua University（清华大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）

AI总结本文提出了一种新的方法，解决预训练视觉-语言-动作（VLA）模型在标准监督微调过程中性能提升有限且适应成本高的问题。该方法通过在参数空间中解耦辅助目标微调的两个目标——增强通用能力和拟合任务特定动作分布，并利用两种不同的训练策略在小规模任务集上训练出两个微调模型，从而提取出由辅助目标提供的能力向量。将这些能力向量与预训练参数结合形成增强能力的元模型，并引入轻量正交正则化损失，使模型在保持高性能的同时显著降低计算开销。实验表明，该方法在多种模型和新环境中均具有良好的有效性和泛化能力。

2605.10894 2026-05-12 cs.CV 版本更新

Counterfactual Stress Testing for Image Classification Models

Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta, Mélanie Roschewitz, Ben Glocker

发表机构 * Department of Computing, Imperial College London, UK（伦敦帝国理工学院计算机系）

AI总结本文研究了医学影像分类模型在新临床环境中因分布偏移而失效的问题，提出了一种基于因果生成模型的反事实压力测试框架，通过干预扫描仪类型、患者性别等属性生成具有临床真实性的“假设”图像，从而在保持解剖结构不变的前提下，进行有针对性的分布偏移评估。实验表明，该方法相比传统扰动方法能更准确地反映模型在真实分布外场景下的性能变化，为医学AI系统的鲁棒性评估提供了更可靠的基础。

2605.10887 2026-05-12 cs.CV 版本更新

Count Anything at Any Granularity

Chang Liu, Haoning Wu, Weidi Xie

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, China（人工智能学院，上海交通大学，中国）； CMIC, Shanghai Jiao Tong University, China（计算机医学研究所，上海交通大学，中国）

AI总结本文研究了开放世界物体计数中的细粒度计数问题，指出当前方法因未明确计数粒度而导致计数可靠性不足。为此，作者提出了多粒度计数框架，通过视觉示例和细粒度文本描述明确指定计数目标，并构建了首个自动化的数据增强管道，生成了目前最大的细粒度计数数据集KubriCount。基于该数据集，作者进一步训练了HieraCount模型，显著提升了细粒度计数的准确性和实际场景的泛化能力。

Comments Project page: https://verg-avesta.github.io/KubriCount/

2605.10885 2026-05-12 cs.CV 版本更新

Geometry-aware Prototype Learning for Cross-domain Few-shot Medical Image Segmentation

Feifan Song, Yuntian Bo, Haofeng Zhang

发表机构 * School of Computer Science and Engineering, Nanjing University of Science and Technology（南京理工大学计算机科学与工程学院）

AI总结跨域小样本医学图像分割（CD-FSMIS）旨在仅凭少量标注样本，使模型同时适应新的解剖类别和未见过的成像领域。现有基于原型的方法往往将解剖结构与领域特定的外观变化混杂在一起，导致在领域变化下难以实现稳定匹配。本文提出GeoProto框架，通过引入几何感知的原型增强机制，利用人体解剖结构的几何先验信息，提升原型匹配的鲁棒性与泛化能力，并在多个跨模态、跨序列和跨场景的数据集上取得了最先进的性能。

2605.10859 2026-05-12 cs.CV cs.LG 版本更新

Masked Generative Transformer Is What You Need for Image Editing

Wei Chow, Linfeng Li, Xian Sun, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu

发表机构 * ByteDance（字节跳动）； National University of Singapore（新加坡国立大学）； Duke University（杜克大学）； Shanghai Jiao Tong University（上海交通大学）； HKUST(GZ)（香港科技大学（广州））

AI总结该论文提出了一种基于掩码生成变压器（MGT）的图像编辑框架EditMGT，旨在解决扩散模型在编辑过程中修改扩散到非目标区域的问题。通过局部化token预测机制和多层注意力整合，EditMGT能够精确控制编辑区域，同时避免非目标区域的意外变化。研究还构建了一个包含200万张高分辨率图像的编辑数据集CrispEdit-2M，并在多个基准测试中取得了最先进的图像相似度表现，且编辑速度比现有方法快6倍。

Comments CVPR 2026 HiGen Workshop; Project Page at https://weichow23.github.io/EditMGT/ GitHub at https://github.com/weichow23/EditMGT

2605.10858 2026-05-12 cs.CV cs.RO 版本更新

Is Your Driving World Model an All-Around Player?

Lingdong Kong, Ao Liang, Tianyi Yan, Hongsi Liu, Wesley Yang, Ziqi Huang, Xian Sun, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu

AI总结当前的驾驶世界模型虽然能生成逼真的行车记录仪视频，但尚无单一模型在所有方面都表现优异。本文提出WorldLens基准，从像素质量、4D几何结构、闭环驾驶行为及人类感知等多个维度全面评估世界模型的真实性，并揭示现有模型在纹理、几何或行为一致性上各有所长，却难以兼顾。研究还构建了包含26,808条人类标注数据的WorldLens-26K数据集，以及一个能自动评估生成世界的视觉语言模型WorldLens-Agent，为模型评估提供了更贴近人类感知的统一框架。

Comments CVPR 2026 VideoWorldModel Workshop; Project Page at https://worldbench.github.io/worldlens GitHub at https://github.com/worldbench/WorldLens

2605.10850 2026-05-12 cs.CV 版本更新

Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

Ruinan Jin, Beidi Zhao, Myeongkyun Kang, Qiong Zhang, Xiaoxiao Li

发表机构 * The University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）； Redmin University of China（红矿大学）

AI总结本文研究了医学视觉问答（VQA）中自验证机制的可靠性边界，指出当前常用的通过重新调用相同视觉语言模型（VLM）进行自验证的做法存在根本性不可靠的问题。作者提出了一种诊断框架，通过分解验证器的行为为判别能力和一致性偏差，揭示了验证器与生成器之间的能力耦合会导致“验证幻觉”现象，即在错误答案被错误接受的情况下，验证器错误率和一致性偏差同时升高的状态。实验表明，验证机制无法提供独立的安全保障，且在多轮交互中错误答案可能被错误验证所固化，凸显出自验证在实际临床应用中可能存在的严重风险。

Comments 31 pages, 12 figures

2605.10845 2026-05-12 cs.CV cs.CL 版本更新

BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

Qi Yang, Xiangyao Ma, Xiao Wang, Hao Wang, Rui Wang

发表机构 * School of Computer Engineering and Science, Shanghai University, Shanghai, China（1 上海大学计算机工程与科学学院，上海，中国）； Funstory.ai Limited, Hong Kong SAR, China（2 Funstory.ai有限公司，香港特别行政区，中国）； Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China（3 上海交通大学计算机科学与工程系，上海，中国）

AI总结随着跨语言交流的日益频繁，富含视觉内容的PDF等文档中的语言障碍仍然是一个实际瓶颈。现有文档翻译方法在语言处理与版式保留之间面临矛盾，BabelDOC通过引入中间表示框架，将视觉布局信息与语义内容解耦，实现了术语提取、跨页上下文处理等文档级翻译操作，并通过自适应排版引擎将翻译内容重新锚定到原始布局中。实验表明，BabelDOC在版式保真度、视觉美观性和术语一致性方面优于现有方法，同时保持了较高的翻译精度。

Comments ACL 2026 System Demonstration paper. 2 figures

2605.10835 2026-05-12 cs.CV cs.LG 版本更新

Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training

Daniel Dratschuk, Paul Swoboda

发表机构 * Heinrich Heine University Düsseldorf（海因里希-海涅大学杜伊斯堡）

AI总结光学乐谱识别（OMR）任务面临缺乏大规模真实扫描数据集的瓶颈，现有方法多依赖少量样本迁移或过于简化的合成训练。本文提出Transcoda系统，通过改进的合成数据生成、**kern编码的规范化以及基于语法规则的解码方法，有效解决了乐谱文本编码的非唯一性问题。该方法在单块GPU上仅用6小时即可训练出一个5900万参数的紧凑模型，在合成乐谱数据集和历史波兰乐谱数据集上均取得优于现有方法的显著性能提升。

Comments 13 pages, 7 figures

2605.10833 2026-05-12 cs.CV cs.AI 版本更新

MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

Xiran Zhao, Jing Jin, Yan Bai, Zhongan Wang, Yifeng Sun, Yihang Lou, Xuanyu Zhu, Tao Feng, Yingna Wu

发表机构 * ShanghaiTech University（上海科技大学）； Tsinghua University（清华大学）； Meituan Inc.（美团公司）； Peking University（北京大学）

AI总结本文提出MMVIAD，首个面向工业异常检测的多视角连续视频数据集，涵盖多种物体类别、环境和异常类型，并支持多项任务评估。为提升模型在细粒度缺陷识别和时序定位上的表现，研究设计了两阶段的后训练流程，显著提升了模型性能，优于现有主流模型。该工作为工业视频理解与异常检测提供了新的基准和方法。

2605.10806 2026-05-12 cs.CV cs.AI cs.LG 版本更新

PhyGround: Benchmarking Physical Reasoning in Generative World Models

Juyi Lin, Arash Akbari, Yumei He, Lin Zhao, Haichao Zhang, Arman Akbari, Xingchen Xu, Zoe Y. Lu, Enfu Nan, Hokin Deng, Edmund Yeh, Sarah Ostadabbas, Yun Fu, Jennifer Dy, Pu Zhao, Yanzhi Wang

发表机构 * Northeastern University（东北大学）； Tulane University（路易斯安那州立大学）； University of Washington（华盛顿大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结 PhyGround 是一个用于评估生成式世界模型物理推理能力的新基准，旨在解决现有视频生成模型在物理规律遵循性方面的评估难题。该基准包含250个精心设计的提示，每个提示附带预期的物理结果，并涵盖13类物理定律的分类体系。通过大规模、质量控制的人类标注实验和一个专门的物理推理视觉语言模型 PhyJudge-9B，PhyGround 能够对生成视频的物理合理性进行细粒度、可复现的评估，显著提升了评估的准确性与可靠性。

Comments Preprint. 56 pages, 39 figures, 40 tables. Project page: https://phyground.github.io/

2605.10789 2026-05-12 cs.CV 版本更新

Rapid Forest Fuel Load Estimation via Virtual Remote Sensing and Metric-Scale Feed-Forward 3D Reconstruction

Quanyun Wu, Kyle Gao, Wentao Sun, Zhengsen Xu, Hudson Sun, Linlin Xu, Yuhao Chen, David A. Clausi, Jonathan Li

发表机构 * University of Waterloo（滑铁卢大学）； University of Calgary（卡尔加里大学）

AI总结本文提出了一种基于虚拟遥感数据和度量级前馈3D重建的快速森林燃料载荷估计方法，旨在解决传统方法成本高、耗时长的问题。该方法利用Google Earth Studio生成低空轨道图像和相机位姿，结合改进的Pi-Long模型进行密集3D重建，并通过度量恢复模块解决单目重建的尺度模糊问题，最终生成鸟瞰图高度和密度图，进而实现树种分类、叶面积指数计算和燃料载荷估计。实验表明，该方法在保证几何一致性的同时，提供了高效、低成本的森林生物量估算方案。

Comments Accepted for publication at IEEE IGARSS 2026

2605.10772 2026-05-12 cs.CV cs.AI eess.IV 版本更新

Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition

David F. Ramirez, Tim L. Overman, Kristen Jaskie, Marv Kleine, Andreas Spanias

发表机构 * SenSIP Center, School of ECEE, Arizona State University（SenSIP中心，电子与计算机工程学院，亚利桑那州立大学）； Prime Solutions Group

AI总结本文研究了将大语言-视觉模型（LLVM）应用于合成孔径雷达（SAR）图像的目标识别任务，特别是在军事车辆自动目标识别（ATR）中的应用。通过构建基于MSTAR公开数据集的训练与评估基准，并引入描述性文本和问答对，作者探索了LLVM在遥感图像描述和视觉问答（VQA）中的性能。实验表明，使用参数高效的微调方法，模型在识别细粒度目标特征方面达到了98%的准确率，为机器辅助的军事和情报遥感目标识别提供了新的技术路径。

Comments Accepted to SPIE Defense + Commercial Sensing, Automatic Target Recognition XXXV

Journal ref Proc. SPIE 13463, Automatic Target Recognition XXXV, 134630D (29 May 2025);

详情

DOI: 10.1117/12.3053859

英文摘要

Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA tasks. This challenge dataset is designed to push the boundaries of an LLVM in identifying nuanced ATR details in SAR imagery. Utilizing parameter-efficient fine-tuning, we train an LLVM method to identify fine-grained target qualities at 98% accuracy. We detail our data setup and experiments, addressing potential pitfalls that could lead to misleading conclusions. Accurately identifying and differentiating military vehicle types in SAR data poses a critical challenge, especially under complex environmental conditions. Mastering this target recognition skill may require a human analyst months of training and years of practice. This research represents a unique effort to apply LLVM to SAR applications, advancing machine-assisted remote sensing ATR for military and intelligence contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.10769 2026-05-12 cs.CV cs.AI 版本更新

MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation

Ziyi Wang, Xianping Ma, Ziyao Wang, Hongyang Zhang, Man On Pun

发表机构 * The Chinese University of Hong Kong (Shenzhen)（香港中文大学（深圳））； Southwest Jiaotong University（西南交通大学）

AI总结本文提出了一种名为MPerS的动态多模态大语言模型混合专家感知引导的遥感场景分割方法，旨在提升遥感图像语义分割的效果。该方法通过设计多种提示词引导大语言模型生成高质量的遥感场景描述，并结合DINOv3提取土地覆盖的密集视觉特征，利用动态混合专家模块自适应融合最有效的文本语义信息，最终实现更精确的遥感场景分割。实验表明，该方法在三个公开的遥感语义分割数据集上取得了优越的性能。

Comments Accepted to CVPR 2026 Findings. 11 pages, 6 figures

2605.10765 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

Tao Hu, Da-Wei Zhou

发表机构 * School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； State Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）

AI总结多模态大语言模型（MLLMs）通过指令微调取得了优异性能，但在实际应用中往往需要在连续任务中逐步扩展能力，同时避免灾难性遗忘。现有方法主要依赖模块组合范式，但难以应对同一任务内图像场景、问题意图和推理需求的差异。为此，本文提出DRAPE，一种动态跨模态提示生成框架，通过从文本指令中生成提示查询并结合视觉特征进行交叉注意力，为每个查询-图像对生成个性化的软提示，从而实现更细粒度的实例级适应。实验表明，DRAPE在多模态持续指令微调基准上取得了最先进的性能。

2605.10762 2026-05-12 cs.CV cs.AI 版本更新

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

Mohamed Eltahir, Lama Ayash, Ali Habibullah, Tanveer Hussain, Naeemullah Khan

发表机构 * King Abdullah University of Science and Technology (KAUST)（卡尔斯塔德大学科学与技术学院）； Department of Computer Science, Edge Hill University（埃奇希尔大学计算机科学系）

AI总结在长视频理解任务中，视觉-语言模型（VLM）因需处理数千帧视频而面临二次注意力计算成本的瓶颈。为解决这一问题，本文提出GridProbe，一种高效的训练-free 后验探测推理框架，通过冻结VLM自身的推理能力，在答案空间中对证据进行评分，并自适应选择与问题相关的帧，从而显著降低计算成本而几乎不损失精度。GridProbe通过在K×K网格上布置帧，并运行轻量级的行和列探测器，生成可解释的重要性图，进而实现形状自适应的帧选择，有效提升了长视频理解的效率与性能。

详情

英文摘要

Long-video understanding in VLMs is bottlenecked by a single monolithic forward pass over thousands of frames at quadratic attention cost. A common mitigation is to first select a small subset of informative frames before the forward pass; common for training-free selectors via auxiliary encoder-space similarities. Such signals are capped by contrastive pretraining, which usually fails on reasoning-heavy queries (negation, cross-frame counting, holistic summarization). We propose GridProbe, an efficient training-free posterior-probing inference paradigm that scores evidence in answer space using a frozen VLM's own reasoning and then selects question-relevant frames adaptively, resulting in sub-quadratic attention cost with little to no accuracy loss. We arrange frames on a $K{\times}K$ grid and run lightweight row R and column C probes, where each probe reads its peak posterior as a query-conditioned confidence. The outer product of R and C yields an interpretable importance map whose skewness and kurtosis drive Shape-Adaptive Selection, a closed-form rule that reliably replaces the fixed frame budget $M$ with a per-question $M_{\mathrm{eff}}$. We show empirically that $M_{\mathrm{eff}}$ tracks intrinsic question difficulty without ever seeing the answer, a sign of test-time adaptive compute. On Video-MME-v2, GridProbe matches the monolithic baseline within $1.6$ pp Avg Acc at $3.36\times$ TFLOPs reduction, while on LongVideoBench it Pareto-dominates the baseline ($+0.9$ pp at $0.35\times$ compute). Because the selector and QA models can be decoupled, pairing a small 2B selector with a stronger 4B or 8B QA is strictly Pareto-dominant over the 2B monolithic baseline (up to $+4.0$ pp at $0.52\times$ compute, on average), with no retraining. Finally, the interpretability of the importance maps opens future avenues for behavioral diagnostics, grounding, and frame-selection distillation.

URL PDF HTML ☆

赞 0 踩 0

2605.10761 2026-05-12 cs.CV 版本更新

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

Wenxuan Li, Pedro R. A. S. Bassi, Xinze Zhou, Jakob Wasserthal, Alan L. Yuille, Zongwei Zhou

发表机构 * Department of Computer Science, Johns Hopkins University（约翰霍普金斯大学计算机科学系）； Clinic of Radiology and Nuclear Medicine, University Hospital Basel（巴塞尔大学医院放射科与核医学科）； Department of Oncology, Johns Hopkins School of Medicine（约翰霍普金斯医学院肿瘤科）

AI总结 RadThinking 是一个用于放射学纵向临床推理的视觉问答（VQA）数据集，旨在使癌症筛查中的诊断推理过程显式化并可训练。该数据集包含不同难度级别的问答对，从基础感知问题到需要多步骤推理的复合型问题，并提供了每道复合问题对应的推理链条，符合临床报告标准。RadThinking 覆盖了大量患者的CT扫描数据，为AI系统进行系统性的推理训练与评估提供了重要资源。

2605.10756 2026-05-12 cs.CV 版本更新

TINS: Test-time ID-prototype-separated Negative Semantics Learning for OOD Detection

Yifeng Yang, Jubo Feng, Jing Xu, Xinbing Wang, Qinying Gu, Nanyang Ye

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）； University of Electronic Science and Technology of China（电子科学与技术大学）

AI总结该研究提出了一种名为TINS的测试时ID-原型分离负语义学习方法，用于提升视觉-语言模型在开放域检测（OOD Detection）中的性能。为了解决现有方法依赖静态负标签、难以适应多样化和动态变化的OOD概念的问题，TINS通过图像到文本的模态反转学习样本特定的负语义嵌入，并引入ID-原型分离正则化以避免与ID语义混淆。实验表明，TINS在多个基准数据集上均优于现有方法，尤其在Four-OOD基准中将平均FPR95从14.04%降低至6.72%。

2605.10744 2026-05-12 cs.CV cs.RO 版本更新

C-CoT: Counterfactual Chain-of-Thought with Vision-Language Models for Safe Autonomous Driving

Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li

发表机构 * College of Transportation, Tongji University（同济大学交通运输学院）； Department of Civil Engineering, Tsinghua University（清华大学土木工程系）； School of Vehicle and Mobility, Tsinghua University（清华大学车辆与移动系统学院）； Department of Civil and Environmental Engineering, National University of Singapore（新加坡国立大学土木与环境工程系）

AI总结本文提出了一种基于视觉语言模型的反事实推理框架C-CoT，用于提升自动驾驶在复杂城市交叉路口等安全关键场景中的决策能力。该方法将驾驶决策分解为五个阶段，通过引入结构化的元动作评估树，在反事实推理阶段显式评估不同行动组合的潜在后果，从而建立行动与安全结果之间的因果联系，增强模型在罕见和分布外场景中的鲁棒性。实验表明，该方法在风险预测和碰撞率等指标上均优于现有方法，显著提升了自动驾驶系统的安全性和可解释性。

2605.10739 2026-05-12 eess.IV cs.AI cs.CV 版本更新

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

David F. Ramirez, Tim Overman, Kristen Jaskie, Andreas Spanias

发表机构 * SenSIP Center, School of ECEE, Arizona State University（SenSIP中心，电子与计算机工程学院，亚利桑那州立大学）； Prime Solutions Group Inc（Prime Solutions Group公司）； Intelligence Advanced Research Projects Activity（智能高级研究计划局）

AI总结本文提出了一种基于Sentinel-2卫星影像的多模态视觉问答数据集SMART-HC-VQA，用于分析人类活动的时空演变。该数据集通过将施工标注、类型标签、时间阶段标签等信息转化为自然语言问答对，构建了一个时序扩展的自动目标识别与视觉问答挑战任务。研究还引入了一种多图像大语言模型训练框架，能够处理多时相遥感影像并进行语义推理，为理解语言引导下的遥感活动提供了可复现的基础。

Comments Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI

2605.10732 2026-05-12 cs.CV cs.AI 版本更新

iPay: Integrated Payment Action Recognition via Multimodal Networks and Adaptive Spatial Prior Learning

Kaicong Huang, Weiheng Oh, Thomas Guggisberg, Ruimin Ke

发表机构 * Rensselaer Polytechnic Institute（伦斯勒理工学院）； Capital District Transportation Authority（卡特里奇交通局）

AI总结本文提出了一种名为iPay的集成支付动作识别框架，用于车载公共交通监控系统。该方法结合RGB图像和骨架数据，通过多模态混合专家架构，分别捕捉局部细节和整体运动特征，并引入双注意力融合机制和空间差异判别器，以提升模型对支付动作的识别能力。实验表明，iPay在真实监控数据上取得了83.45%的识别准确率，具有较高的计算效率，适用于边缘部署。

2605.10730 2026-05-12 cs.CV 版本更新

Qwen-Image-2.0 Technical Report

Bing Zhao, Chenfei Wu, Deqing Li, Hao Meng, Jiahao Li, Jie Zhang, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kuan Cao, Kun Yan, Liang Peng, Lihan Jiang, Niantong Li, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Xihua Wang, Yan Shu, Yanran Zhang, Yi Wang, Yilei Chen, Ying Ba, Yixian Xu, Yujia Wu, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhendong Wang, Zihao Liu, Zikai Zhou, An Yang, Chen Cheng, Chenxu Lv, Dayiheng Liu, Fan Zhou, Hantian Xiong, Hongzhu Shi, Hu Wei, Huihong Zhao, Ivy Liu, Jianwei Zhang, Jiawei Zhang, Kai Chen, Kang He, Levon Xue, Lin Qu, Linhan Tang, Luwen Feng, Minggang Wu, Minmin Sun, Na Ni, Rui Men, Shuai Bai, Sishou Zheng, Tao Lan, Tianqi Zhang, Tingkun Wen, Wei Wang, Weixu Qiao, Weiyi Lu, Wenmeng Zhou, Xiaodong Deng, Xiaoxiao Xu, Xinlei Fang, Xionghui Chen, Yanan Wang, Yang Fan, Yichang Zhang, Yixuan Xu, Yu Wu, Zhiyuan Ma, Zhizhi Cai

发表机构 * Qwen Team（通义实验室）

AI总结本文介绍了Qwen-Image-2.0，一种能够统一高保真图像生成与精确图像编辑的全能型图像生成基础模型。该模型通过结合Qwen3-VL作为条件编码器与多模态扩散变换器，解决了超长文本渲染、多语言排版、高分辨率写实生成等挑战，并在大规模数据训练和定制化多阶段训练流程的支持下，实现了强大的多模态理解能力与灵活的生成与编辑功能。实验表明，Qwen-Image-2.0在生成与编辑任务上显著优于之前的版本，向着更通用、可靠和实用的图像生成模型迈出了重要一步。

2605.10723 2026-05-12 cs.CV cs.AI cs.LG cs.MA 版本更新

AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State

Huimin Wang, Leilei Ouyang, Chang Xia, Yongqi Kang, Yu Fu, Yuqi Ouyang

发表机构 * College of Computer Science, Sichuan University（四川大学计算机学院）

AI总结 AllocMV 是一种用于音乐视频生成的分层框架，旨在解决长时域视频生成中计算成本高和跨镜头一致性难以保持的问题。该方法将视频合成建模为多重选择背包问题，通过结构化持久状态对象进行资源优化分配，并引入基于动态规划的求解器实现高效资源调度。实验表明，AllocMV 在严格预算和节奏约束下，实现了生成质量与资源消耗之间的最优平衡。

2605.10717 2026-05-12 cs.LG cs.CV 版本更新

Heteroscedastic Diffusion for Multi-Agent Trajectory Modeling

Guillem Capellera, Antonio Rubio, Luis Ferraz, Antonio Agudo

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC（机器人与信息学院，CSIC-UPC）

AI总结本文提出了一种异方差扩散模型U2Diffine，用于多智能体轨迹建模，同时提供每个状态的不确定性估计，以解决传统方法在轨迹补全和不确定性量化方面的不足。通过在去噪损失中引入预测噪声的负对数似然，并利用一阶泰勒展开将潜在空间的不确定性传播到真实状态空间，实现了轨迹补全与不确定性估计的统一。此外，还提出了一种更高效的基线模型U2Diff，并结合排序神经网络进行后处理，显著提升了推理速度和预测可靠性，在多个体育数据集上取得了优于现有方法的性能。

Comments Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Extended version of arXiv:2503.18589 (CVPR 2025)

2605.10715 2026-05-12 cs.CV 版本更新

UAV-Assisted Scan-to-Simulation for Landslides Using Physics-Informed Gaussian Splatting

Zhenyu Liang, Jack C. P. Cheng

AI总结本文提出了一种基于无人机的扫描到模拟框架，用于提升滑坡监测与仿真的真实感与准确性。该方法结合物理感知的高斯点喷射技术（3DGS）与材料点法（MPM），实现了从无人机采集的实景图像到具备物理特性的滑坡模拟的全过程。研究通过在香港真实滑坡现场的验证，展示了该方法在视觉重建与物理模拟方面的双重优势，为灾害预防和公众教育提供了更有效的工具。

2605.10705 2026-05-12 cs.CV 版本更新

TransmissiveGS: Residual-Guided Disentangled Gaussian Splatting for Transmissive Scene Reconstruction and Rendering

Zhenyu Liang, Xiao Zhang, Tianchao Li, Jack C. P. Cheng, Chi-Keung Tang

发表机构 * HKUST（香港理工大学）

AI总结该论文提出了一种名为TransmissiveGS的新框架，用于解决透射场景重建与渲染中的挑战性问题。该方法通过引入双高斯表示和延迟着色函数，实现了反射与透射成分的解耦重建，并利用多视角不一致性及残差信息分离表面几何与光照属性，同时提出反射光场以提升近场反射估计精度。实验表明，该方法在合成与真实场景中均优于现有高斯点绘技术，显著提升了透射场景的重建与渲染质量。

2605.10676 2026-05-12 cs.CV cs.LG 版本更新

Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium

Qingxin Xiao, Peilin Zhao, Yangyang Zhao, Lingwei Dang, Qingyao Wu

发表机构 * South China University of Technology（华南理工大学）； Institute for Super Robotics (Huangpu)（机器人研究所（黄埔））； Shanghai Jiao Tong University（上海交通大学）； Changsha University of Science and Technology（长沙理工大学）

AI总结在多模态语言模型解码过程中，注意力往往异常聚焦于与任务无关的图像区域，现有方法通常将这些区域视为噪声并强制调整注意力，但本文认为这些区域实际上承载了重要的视觉与叙事逻辑，强制调整反而加剧了视觉与语言之间的不平衡。为此，研究提出了一种名为Adversarial Counter-Commonsense Equilibrium（ACE）的训练无关框架，通过引入反常识的图像干扰块，动态调整解码过程中的注意力分布，从而在不引入额外训练的前提下，有效抑制虚假信息，恢复视觉与语言的平衡，实验表明该方法能显著提升模型的可信度且几乎不增加推理开销。

2605.10675 2026-05-12 cs.CV 版本更新

Neuromorphic Monocular Depth Estimation with Uncertainty Modeling

Viktor Bergkvist, Felix Rydell, Per-Erik Forssén, David Gustafsson, Johan Rideg

发表机构 * Swedish Defence Research Agency（瑞典国防研究机构）； Linköping University（林奈大学）

AI总结本文研究了基于事件相机的单目深度估计问题，提出了一种结合不确定性建模的神经形态深度估计方法。通过使用高斯、对数正态和证据学习框架，模型能够预测每个像素的深度分布并估计其不确定性。实验比较了六种事件表示方式，并在合成数据上训练、在真实序列上微调U-Net模型，结果表明不确定性建模能有效提升深度估计的可靠性，并在多种指标下表现优异。

2605.10661 2026-05-12 cs.CV cs.AI 版本更新

bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

Michal Byra, Pawel Olszowiec, Grzegorz Stefanski, Grzegorz Gruszczynski, Alberto Presta

发表机构 * Samsung AI Center（三星人工智能中心）； Institute of Fundamental Technological Research, Polish Academy of Sciences（波兰科学院基础技术研究所）

AI总结本文研究了视觉Transformer（ViT）中是否可以通过单块循环结构替代传统的多层独立参数化结构。提出了一种名为bViT的模型，该模型仅使用一个Transformer块进行重复计算来处理图像，从而在保持深度结构的同时大幅减少参数量。实验表明，在相同训练条件和计算预算下，bViT在ImageNet-1K上达到了与标准ViT相当的性能，且参数数量减少了约一个数量级，展示了循环结构在视觉任务中的有效性与潜力。

Comments 31 pages, 16 figures

2605.10645 2026-05-12 cs.CV 版本更新

GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks

Hantao Zhang, Weidong Guo, Yuhe Liu, Jiancheng Yang, Sathvik Bhagavan, Danli Shi, Mingda Xu, Pascal Fua

发表机构 * CVLab, École Polytechnique Fédérale de Lausanne (EPFL)（瑞士联邦理工学院（EPFL）计算机视觉实验室）； Fudan University（复旦大学）； Beihang University（北航大学）； ELLIS Institute Finland（芬兰ELLIS研究所）； Aalto University（艾尔沃斯大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结本文提出了一种基于生成模型的新型医学诊断框架GenMed，通过联合建模输入与输出的联合分布 $P(X,Y)$，将诊断任务重新定义为推理时的输出优化问题。该方法利用扩散模型，在不改变模型结构或重新训练的前提下，实现了对多样化输入条件的灵活梯度引导，有效支持跨模态、少样本和零样本等复杂场景下的医学图像分割任务。实验表明，GenMed 在多种医学影像任务中表现出色，并配套发布了大规模文本-形状数据集以支持相关研究。

2605.10641 2026-05-12 cs.CV cs.AI 版本更新

LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

Nikolaos Gkalelis, Vasileios Mezaris

发表机构 * CERTH-ITI

AI总结本文提出了一种名为LLaVA-CKD的自底向上级联知识蒸馏框架，旨在解决视觉语言模型（VLMs）在实际部署中面临的大规模计算和内存需求问题。该方法通过引入中间容量的教师模型逐步引导学生模型学习，缓解了传统知识蒸馏中师生模型容量差距过大导致的知识迁移效果下降问题。实验表明，该框架在多个标准视觉问答基准测试中取得了当前最优的性能。

Comments Under review

2605.10629 2026-05-12 cs.CV 版本更新

Product-of-Gaussian-Mixture Diffusion Models for Joint Nonlinear MRI Reconstruction

Laurenz Nagler, Martin Zach, Thomas Pock

发表机构 * Graz University of Technology（格拉茨技术大学）； École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）； Biomedical Imaging Group and Center for Biomedical Imaging（生物医学成像组和生物医学成像中心）

AI总结本文提出了一种基于高斯混合乘积扩散模型的联合非线性磁共振成像重建方法，旨在解决现有方法中网络结构复杂、时间条件机制不透明以及需要离线估计线圈灵敏度等问题。该方法通过将参数高效的高斯混合扩散模型作为图像先验，并结合经典的线圈灵敏度平滑先验，实现了图像与线圈灵敏度的联合重建。该方法在保持重建质量的同时，提升了对对比度和解剖分布变化以及不同k空间轨迹的鲁棒性。

2605.10628 2026-05-12 cs.CV 版本更新

Hypergraph-Enhanced Training-Free and Language-Free Few-Shot Anomaly Detection

Guohuan Xie, Xin He, Dingying Fan, Siqi Li, Yun Liu

发表机构 * Nankai University（南开大学）； Tianjin University of Technology（天津工业大学）； Tsinghua University（清华大学）

AI总结本文提出了一种名为HyperFSAD的少样本异常检测框架，该方法无需训练和语言提示，且具备跨领域鲁棒性，有效解决了现有方法对特定任务训练、语言监督和领域适应性的依赖问题。该方法基于DINOv3和超图推理机制，通过稀疏超匹配和双分支图像评分策略，实现了对正常样本的紧凑表征与异常区域的精准识别。实验表明，在六个涵盖工业和医疗场景的数据集上，HyperFSAD在无训练、无语言提示的严格设置下取得了当前最优的检测性能。

2605.10622 2026-05-12 cs.MM cs.CV 版本更新

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

Yangneng Chen, Junlin Li, Weijun Yao, Xilai Ma, Guodong Du, Wenya Wang, Jing Li

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Huawei Technologies Co., Ltd.（华为技术有限公司）； The Hong Kong Polytechnic University（香港理工大学）； Nanyang Technological University（南洋理工大学）

AI总结大型视觉-语言模型（LVLMs）在多模态任务中表现出色，但其可靠性常因幻觉问题而受到挑战，即生成与视觉输入矛盾的文本。本文提出“词汇劫持”现象，发现某些视觉标记（称为惰性标记）会异常地吸引注意力，并在词汇空间中固定解码为无关词语（劫持锚点），导致语义崩溃。基于此，研究提出了一种无需训练的干预方法HAVAE，通过增强关键注意力头对视觉内容的关注，有效缓解了幻觉问题，同时保持模型整体性能。

Comments Accepted by ACL 2026 Main

2605.10616 2026-05-12 cs.LG cs.CL cs.CV 版本更新

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

Alan Arazi, Eilam Shapira, Shoham Grunblat, Mor Ventura, Elad Hoffer, Gioia Blayer, David Holzmüller, Lennart Purucker, Gaël Varoquaux, Frank Hutter, Roi Reichart

发表机构 * Technion – Israel Institute of Technology（技术ion – 以色列理工学院）； Prior Labs（Prior实验室）； NVIDIA ； SODA Team, INRIA Saclay, Palaiseau（SODA团队，INRIA萨克莱，帕莱索）； University of Freiburg（弗赖堡大学）； Probabl ； ELLIS Institute Tübingen（图宾根ELLIS研究所）

AI总结本文提出 MulTaBench，一个包含40个数据集的多模态表格学习基准，涵盖图像-表格和文本-表格任务，旨在评估模型在处理结构化数据与非结构化模态（如文本和图像）结合时的表现。研究发现，针对任务进行嵌入调优能显著提升性能，而现有基准往往忽视任务相关性，导致结果波动较大。MulTaBench 通过强调模态间互补信息的重要性，推动了目标感知表示学习的发展，并为构建多模态表格基础模型提供了新的研究方向。

2605.10588 2026-05-12 cs.CV 版本更新

Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

Yanbing Zhang, Bo Wang, Jianhui Liu, Nan Jiang, Jiaxiu Jiang, Haoze Sun, Yijun Yang, Shenghe Zheng, Lin Song, Haoyang Huang, Nan Duan, Wenbo Li

发表机构 * Joy Future Academy（未来Joy学院）

AI总结当前大型多模态模型（LMMs）在需要视角依赖理解的空间推理任务中表现不佳，主要受限于单一静态视角的观察。为此，研究提出了一种名为“Thinking with Novel Views（TwNV）”的新范式，通过在推理过程中引入生成新视角的合成图像，提升模型对空间关系的理解能力。实验表明，TwNV在多个空间子任务和不同架构的LMM上均显著提升了性能，验证了新视角生成在增强模型空间智能方面的有效性。

Comments Submitted to NeurIPS 2026

2605.10586 2026-05-12 cs.CV 版本更新

CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations

Nengbo Lu, Minghua Pan

发表机构 * Guilin University of Electronic Technology（桂林电子科技大学）

AI总结本文提出了一种名为CausalGS的框架，旨在仅从多视角视频中学习复杂三维动态场景的物理因果关系，无需依赖显式先验知识。其核心是一个逆物理推理模块，通过联合推断场景的初始速度场和内在材料属性，将动态过程分解为两个因素进行建模，并利用可微分物理模拟器进行物理正则化的学习。实验表明，CausalGS在长期未来帧外推和新视角插值任务中均优于现有方法，展示了其从视觉观测中自主学习物理属性交互和因果关系的能力。

Comments ICMR2026 Accepted

2605.10576 2026-05-12 cs.CV cs.AI 版本更新

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

Chen Zhong, Xiao An, Jiaxing Sun, Zihan Gui, Guangyi Yang, Wei He

发表机构 * Wuhan University（武汉大学）； Shanghai Artificial Intelligent Laboratory（上海人工智能实验室）

AI总结本文提出 SenseBench，首个专门用于评估大语言视觉模型在遥感低级视觉感知与描述能力的基准测试平台。该研究针对当前图像质量评估方法无法准确描述遥感退化现象的问题，构建了包含6大类22个细粒度退化类型的10,000余个精心标注样本，并设计了感知与描述两种评估协议，揭示了现有模型在遥感领域存在的领域偏差、多退化混淆等关键问题，为推动遥感低级视觉感知模型的发展提供了有力支持。

2605.10571 2026-05-12 eess.IV cs.CV 版本更新

Set-Based Groupwise Registration for Variable-Length, Variable-Contrast Cardiac MRI

Yi Zhang, Yidong Zhao, Tijmen Toxopeus, Maša Božić-Iven, Sebastian Weingärtner, Qian Tao

发表机构 * Department of Imaging Physics, Delft University of Technology, The Netherlands（荷兰代尔夫特理工大学影像物理系）

AI总结该研究针对可变长度、对比度不同的心脏MRI序列，提出了一种基于集合的群组配准方法\emph{\AnyTwoReg}，以解决传统深度学习方法在跨协议配准中的泛化性不足问题。该方法将MRI序列视为无序集合，解耦了网络设计与序列长度和输入顺序的依赖关系，并通过共享编码器和相关性引导的特征聚合构建了排列不变的参考基准，实现了从图像到形变场的排列等变映射。实验表明，该方法在未见过的定量MRI数据集上表现出良好的零样本泛化能力，并有效提升了后续定量映射的质量。

Comments MICCAI 2026. Submitted Version

详情

英文摘要

Quantitative cardiac magnetic resonance imaging (MRI) enables non-invasive myocardial tissue characterization but relies on robust motion correction within these variable-length, variable-contrast image sequences. Groupwise registration, which simultaneously aligns all images, has shown greater robustness than pairwise registration for motion correction. However, current deep-learning-based groupwise registration methods cannot generalize across MRI sequences: the architecture typically encodes input data as a fixed-length channel stack, which rigidly couples network design to protocol-specific sequence length, input ordering, and contrast dynamics. At inference time, any change in imaging protocols will render the network unusable. In this work, we introduce \emph{\AnyTwoReg}, a new set-based groupwise registration framework that takes a quantitative MRI sequence as an unordered set. This set formulation fundamentally decouples network design from sequence length and input ordering. By utilizing a shared encoder and correlation-guided feature aggregation, \emph{\AnyTwoReg} constructs a permutation-invariant canonical reference for registration, and learns a permutation-equivariant mapping from images to deformation fields. Additionally, we extract contrast-insensitive image features from an existing foundation model to handle extreme contrast variations. Trained exclusively on a single public $T_1$ mapping dataset (STONE, sequence length $L=11$), \AnyTwoReg generalizes to two unseen quantitative MRI datasets (MOLLI, ASL) with variable lengths ($L \in [11, 60]$) and different contrast dynamics. It achieves strong cross-protocol generalization in a zero-shot manner, and consistently improves downstream quantitative mapping quality. Notably, while designed for quantitative MRI sequences, our framework is directly applicable to Cine MRI sequences for inter-cardiac-phase registration.

URL PDF HTML ☆

赞 0 踩 0

2605.10567 2026-05-12 cs.CV 版本更新

VeloGauss: Learning Physically Consistent Gaussian Velocity Fields from Videos

Nengbo Lu, Bin Zhao

发表机构 * Guangxi Key Laboratory of Robot Intelligent Perception and Control（广西机器人智能感知与控制重点实验室）； School of Artificial Intelligence, Guilin University of Electronic Technology（人工智能学院，桂林电子科技大学）

AI总结本文提出了一种名为 VeloGauss 的方法，旨在仅从动态多视角视频中联合建模三维场景的几何、外观和物理信息，而无需依赖任何物理先验。该方法通过引入物理编码和粒子动力学系统，学习每个高斯粒子的运动场，并结合全局物理约束以确保场景的物理一致性。实验表明，VeloGauss 在新视角插值和未来帧外推任务中均取得了优于现有方法的性能。

Comments ICME2026 Accepted

2605.10564 2026-05-12 cs.CV cs.RO 版本更新

DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

Lingjun Zhang, Changjie Wu, Linzhe Shi, Jiangyang Li, Jiaxin Liu, Lei Yang, Hang Zhang, Mu Xu, Hong Wang

发表机构 * Tsinghua University（清华大学）； Amap, Alibaba Group（阿里巴巴集团Amap）； Nanyang Technological University（南洋理工大学）

AI总结本文提出了一种名为DeepSight的端到端自动驾驶世界模型，通过在鸟瞰图（BEV）空间中并行预测连续未来帧的潜在语义特征，实现了对长期未来世界状态的建模。该方法还引入了一种高效且自适应的文本推理机制，结合额外的社会知识和推理能力，以提升复杂长尾场景下的驾驶性能。实验表明，该方法在闭合回路 Bench2drive 基准测试中达到了最先进的效果。

Comments ICML 2026

2605.10523 2026-05-12 cs.CV 版本更新

Improving Human Image Animation via Semantic Representation Alignment

Chang Liu, Mengting Chen, Yixuan Huang, Haoning Wu, Chen Ju, Shuai Xiao, Jinsong Lan, Yanfeng Wang

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, China（上海交通大学人工智能学院，中国）； Alibaba Group, China（阿里巴巴集团，中国）

AI总结本文研究如何通过语义表示对齐来提升人体图像动画生成的质量，解决在生成长视频或复杂动作时出现的肢体扭曲和面部失真问题。提出了一种名为 SemanticREPA 的新方法，通过结构对齐模块和身份对齐模块，分别对齐视频潜在表示中的结构信息与深度特征、生成视频的身份特征与人脸识别特征，从而提升生成结果的结构稳定性和身份一致性。该方法在复杂动作生成和角色一致性方面表现出色，为人体动画生成提供了更高质量和更灵活的解决方案。

Comments Accepted by CVPR 2026 workshop

2605.10521 2026-05-12 cs.CV cs.AI 版本更新

DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation

Yiqi Tian, Sangjoon Park, Bo Zeng, Pengfei Jin, Yujin Oh, Quanzheng Li

发表机构 * Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School（先进医学计算与分析中心，麻省总医院和哈佛医学院）； Department of Industrial Engineering, University of Pittsburgh（工业工程系，匹兹堡大学）； Department of Radiation Oncology, College of Medicine, Yonsei University（放射肿瘤学系，延世大学医学院）； Institute for Innovation in Digital Healthcare, Yonsei University（数字医疗创新研究所，延世大学）； Department of Biomedical Systems Informatics, College of Medicine, Yonsei University（生物医学系统信息学系，延世大学医学院）

AI总结医学图像分割模型在不同子群体中的表现可能存在差异，现有公平性方法大多关注提升子群体平均性能，忽略了子群体内部可能存在的隐藏失效问题。为此，本文提出DuetFair机制，通过联合考虑子群体间适应与子群体内鲁棒性，引入FairDRO方法，结合分布感知的专家混合模型与子群体条件分布鲁棒优化，有效提升了模型在不同子群体中的公平性与分割性能。实验表明，FairDRO在多个医学图像分割基准上取得了优越的公平性与性能提升。

Comments 16 pages, 2 figures

2605.10498 2026-05-12 cs.CV cs.AI stat.ML 版本更新

Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data

Heegeon Yoon, Heeyoung Kim

发表机构 * Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)（工业与系统工程系，韩国科学技术院（KAIST））

AI总结该研究针对高度不平衡的多模态数据，提出了一个同时处理长尾识别与多模态融合的新框架。该方法通过引入多专家架构，结合模态特异性网络估计各模态的信息量，并利用置信度引导的权重动态调整融合过程，从而更有效地整合多源数据。实验表明，该方法在多个基准和真实数据集上优于现有方法，展示了其在长尾分类任务中的鲁棒性和泛化能力。

2605.10484 2026-05-12 cs.CV cs.RO 版本更新

OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

Gang Chen, Sebastián Barbas Laina, Stefan Leutenegger, Javier Alonso-Mora

发表机构 * Autonomous Multi-Robots Lab, Department of Cognitive Robotics, School of Mechanical Engineering, Delft University of Technology, 2628 CD, Delft, Netherlands（代尔夫特理工大学机械工程学院认知机器人学系自主多机器人实验室）； Mobile Robotics Lab, School of Computation, Information and Technology, Technical University of Munich（慕尼黑技术大学计算、信息与技术学院移动机器人实验室）

AI总结本文提出了一种名为 OpenSGA 的高效三维场景图对齐框架，旨在解决机器人在开放环境中重新访问场景时的物体级定位与地图融合问题。该方法通过融合视觉-语言、文本和几何特征，并结合空间上下文信息，实现了即使在坐标偏差较大的情况下也能准确对齐场景图。此外，作者还构建了一个大规模数据集 ScanNet-SG，包含超过 70 万样本和丰富的物体类别，显著提升了场景图对齐任务的训练与评估能力。实验表明，该方法在帧到扫描（F2S）和子扫描到子扫描（S2S）任务中均取得了最佳性能。

Comments 13 figures

2605.10470 2026-05-12 cs.CV 版本更新

Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution

Jinyi Luo, Minghao Liu, Yifan Li, Zejia Fan, Jiaying Liu

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王轩计算机技术研究院）

AI总结超分辨率（SR）是一个严重病态的问题，存在固有的歧义性。本文首次对多模态超分辨率进行了理论建模，揭示了现有方法在模态利用上的不足，并提出了一种基于动态模态融合的多模态专家混合超分辨率框架（M$^3$ESR），通过空间动态模态权重模块和时间自适应模态温度调度机制，实现了更精确的风险控制和模态贡献优化。实验表明，该方法在泛化能力和语义一致性方面均有显著提升。

2605.10464 2026-05-12 cs.CV 版本更新

Automated Detection of Abnormalities in Zebrafish Development

Sarath Sivaprasad, Hui-Po Wang, Anna-Lisa Jäckel, Jonas Baumann, Carole Baumann, Jennifer Herrmann, Mario Fritz

发表机构 * CISPA Helmholtz Center for Information Security（CISPA海德堡信息安全中心）； Helmholtz Institute for Pharmaceutical Research Saarland（萨尔兰州制药研究所海德堡中心）

AI总结本文提出了一种用于斑马鱼胚胎发育异常自动检测的方法，针对目前依赖人工评估效率低的问题，构建了一个包含高分辨率显微图像序列的大型数据集，涵盖正常发育和药物暴露两种条件，并提供了细粒度时间标注。研究还引入了基于Transformer的模型，能够融合时空特征以早期预测发育异常，在受精卵存活率分类和毒性评估任务中分别达到98%和92%的准确率，为自动化斑马鱼毒性分析提供了有效工具。

2605.10449 2026-05-12 cs.CV 版本更新

Automated high-frequency quantification of fish communities and biomass using computer vision

Kota Ishikawa, Takuma Masui, Keita Koeda, Rickdane Gomez, Lucas Yutaka Kimura, Michio Kondoh

发表机构 * Graduate School of Life Sciences, Tohoku University（东北大学生命科学研究生院）； Advanced Institute for Marine Ecosystem Change (WPI-AIMEC), Tohoku University（东北大学海洋生态系统变化先进研究所）； Graduate School of Science and Engineering, University of the Ryukyus（冲绳大学理学研究院）； Faculty of Science, University of the Ryukyus（冲绳大学理学部）

AI总结该研究提出了一种基于计算机视觉的自动化方法，用于高频量化水下鱼类群落结构和生物量。方法结合了深度学习鱼类识别、多目标跟踪和三维重建技术，能够从立体摄像系统采集的视频中准确估计鱼类的种类、数量及生物量。研究在珊瑚礁鱼类群落中进行了20天的连续监测，展示了该方法在捕捉物种丰富度、数量和生物量动态变化方面的优势，并验证了其在非侵入性、持续性监测中的有效性。

Comments 21 pages, 3 figures, supplementary information under Ancillary files

2605.10445 2026-05-12 cs.CV 版本更新

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

Zijun Shen, Sihan Yang, Ruichuan An, Ziyu Guo, Hao Liang, Ming Lu, Renrui Zhang, Wentao Zhang

发表机构 * Peking University（北京大学）； Nanjing University（南京大学）； CUHK（香港中文大学）； Zhongguancun Academy（中关村学院）

AI总结本文提出了一种名为Sync-R1的端到端强化学习框架，旨在通过协同优化实现个性化理解和生成之间的桥梁。该方法引入了Sync-GRPO和动态组缩放（DGS）技术，以增强多任务间的协同效应并提升训练效率，同时构建了更贴近现实场景的UnifyBench++数据集。实验表明，Sync-R1在跨任务推理和个性化生成方面表现出色，且无需复杂的冷启动流程。

2605.10439 2026-05-12 cs.CV 版本更新

Filtering Memorization from Parameter-Space in Diffusion Models

Yu Zhe, Yang Jiayan, Wei Junhao, Yu-Lin Tsai, Wang Chen

发表机构 * RIKEN AIP（理化学研究所Advanced Institute for Peripheral Research）； Science of Tokyo（东京科学大学）； University of California, Berkeley（加州大学伯克利分校）； Zhejiang University（浙江大学）

AI总结本文研究了扩散模型中低秩适配（LoRA）模块可能记住训练图像的问题，导致生成内容泄露受版权保护或敏感信息。为此，作者提出了一种无需训练和数据的后处理方法——Base-Anchored Filtering（BAF），通过分解LoRA更新为频谱通道，并衡量其与预训练主干网络主子空间的对齐程度，从而过滤掉可能包含记忆内容的通道。实验表明，BAF在多个数据集和扩散模型主干上有效减少了记忆效应，同时保持或提升了生成质量。

2605.10438 2026-05-12 cs.LG cs.CV 版本更新

Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure

Xiang Chen, Alexander Binder

发表机构 * DSC ScaDS.AI, Leipzig University（DSC ScaDS.AI，莱比锡大学）； Institute for Cancer Genetics and Informatics (ICGI), Oslo, Norway（癌症遗传学与信息学研究所（ICGI），奥斯陆，挪威）； ICT Cluster, Singapore Institute of Technology, Singapore（信息科技集群，新加坡理工学院，新加坡）

AI总结当前3D编码器大多将表示视为空间压缩，虽然能重建表面几何，但无法明确组件归属和连接有效性。本文提出一种以接口为中心的生成状态表示方法，将编码过程构建为可操作的状态而非被动压缩代码，使得局部几何、组件归属和连接有效性在解码过程中可被查询、约束和修复。通过引入组件条件的局部规范标记（C2LT-3D），该方法在开放世界多组件场景中提升了结构鲁棒性，并展示了其潜在状态在装配级结构推理中的有效性。

2605.10434 2026-05-12 cs.CV 版本更新

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Keming Wu, Yijing Cui, Wenhan Xue, Qijie Wang, Xuan Luo, Zhiyuan Feng, Zuhao Yang, Sudong Wang, Sicong Jiang, Haowei Zhu, Zihan Wang, Ping Nie, Wenhu Chen, Bin Wang

发表机构 * Tsinghua University（清华大学）； Nanyang Technological University（南洋理工大学）； University of Waterloo（滑铁卢大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结本文提出WorldReasonBench，用于评估视频生成模型作为未来世界状态预测器的能力，重点检验其在物理、社会、逻辑和信息一致性方面的推理能力。该基准包含436个结构化测试案例，并采用人类对齐的两阶段评估方法，分别验证推理过程和视频质量。研究揭示了当前视频生成模型在视觉合理性与世界推理能力之间存在显著差距，并提供了WorldRewardBench用于奖励模型评估，推动更真实的世界感知视频生成研究。

Comments Project Page: https://unix-ai-lab.github.io/WorldReasonBench/

2605.10409 2026-05-12 cs.CV 版本更新

Progressive Photorealistic Simplification

Adi Rosenthal, Dana Berman, Yedid Hoshen, Ariel Shamir

发表机构 * Reichman University and Google（里奇曼大学和谷歌）； Google Israel（谷歌以色列）； Hebrew University and Google（希伯来大学和谷歌）； Google（谷歌）

AI总结本文提出了一种渐进式光栅化简化方法，旨在在保持图像真实感的前提下减少视觉复杂度。该方法通过结合语义理解和生成编辑，利用视觉语言模型识别并优先移除图像中的元素，并通过学习验证器确保简化过程中的真实感和一致性。研究还进一步将该过程蒸馏为一个图像到视频生成模型，能够直接从单张图像生成连贯的简化序列，适用于内容感知去杂、语义分层分解等任务。

2605.10404 2026-05-12 cs.CV 版本更新

Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable

Tianyuan Zou, Liang Yue, Yang Liu, Ya-Qin Zhang, Sijie Cheng

发表机构 * Institute for AI Industry Research, Tsinghua University, Beijing, China（清华大学人工智能产业研究院）； RayNeo.AI, Shenzhen, China（深圳RayNeo.AI）； Department of Computer Science and Technology, Tsinghua University, Beijing, China（清华大学计算机科学与技术系）

AI总结随着智能眼镜、体戴摄像头等持续运行的硬件设备日益普及，生活日志视频流已成为持续运行人工智能系统的核心组成部分。这类视频流虽能显著提升系统实用性，但也带来了严重的隐私泄露风险，如暴露行为模式、情绪状态和社会互动等敏感信息。现有隐私保护方法要么针对特定攻击，要么导致显著的实用性损失，未能全面考虑数据处理全流程，因此生活日志视频流中的隐私与实用性权衡已成为下一代人工智能系统亟待解决的基础性挑战。

Comments 19 pages, 7 figures

2605.10397 2026-05-12 cs.CV cs.AI 版本更新

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

Xi Jiang, Yinjie Zhao, Zesheng Yang, Feng Zheng

发表机构 * Department of Computer Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China（南方科技大学计算机科学与工程系，深圳，中国）； School of EEE, Nanyang Technological University (NTU), Singapore（南洋理工大学电子工程学院，新加坡）； CFAR, Agency for Science, Technology and Research (A*STAR), Singapore（科技研究局（A*STAR）的CFAR，新加坡）

AI总结视觉异常检测在工业检测、医疗影像等领域具有重要意义，但不同领域间的数据模态和标注标准差异导致单一领域训练的模型难以跨域应用。为此，本文提出 AnomalyClaw，一种无需训练的视觉异常检测代理，通过多轮反驳机制提升判断可靠性，结合13种工具进行视觉验证与参考解析。实验表明，AnomalyClaw 在多个跨域数据集上显著优于单步推理方法，并通过自进化机制进一步提升了检测性能。

Comments We release the agent, the benchmark, and the analysis artifacts at https://github.com/jam-cc/AnomalyClaw

2605.10394 2026-05-12 cs.CV 版本更新

Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection

Andreas Goulas, Damianos Galanopoulos, Evlampios Apostolidis, Vasileios Mezaris

发表机构 * IDT-ITI

AI总结本文提出了一项新的任务——煽动性图像检测，旨在判断图像是否包含令人震惊、挑衅或情感强烈的特征，以吸引注意力并引发强烈情绪反应。为此，研究者构建了一个名为Sens-VisualNews的基准数据集，包含9,576张新闻图片，并根据其视觉内容中是否存在各种煽动性概念和事件进行标注。基于该数据集，研究进一步探讨了多种先进多模态大语言模型在零样本和微调设置下的提示敏感性、性能及鲁棒性。

Comments Authors' Accepted Version; Accepted at IEEE ICIP 2026

2605.10391 2026-05-12 cs.CL cs.AI cs.CV 版本更新

Phoenix-VL 1.5 Medium Technical Report

Team Phoenix, :, Arka Ray, Askar Ali Mohamed Jawad, Biondi Lee, Elijah Seah, Eva Lim, Fiona Teo, Grace Toh, Guang Xiang Teo, Jun En Tan, Jia Hui Bong, Jiale Wang, Jonathan Ng, Justin Tan, Kai Zhe Yew, Matthew Ong, Shun Yi Yeo, Wen Jett Lam, Wen Xiu Tan, Ze Yu Zhang, Gee Wah Ng, Chee Wee Ang, Mistral AI, :, Adrien Sadé, Guillaume Kunsch, Jia Sin Loh, Nicolas Schuhl, Rupert Menneer, Umar Jamil, Vincent Maladière, Yimu Pan

发表机构 * Mistral AI

AI总结本文介绍了Phoenix-VL 1.5 Medium，一个1230亿参数的本地化多模态、多语言基础模型，专门适配新加坡语境和区域性语言。该模型通过本地化的大规模多模态语料进行持续预训练，并结合新加坡文化、法律等领域的数据进行微调，显著提升了在新加坡相关任务上的表现，同时在通用多模态、多语言和STEM任务上也保持了高水平性能。研究还提出了包含本地化知识评估和机构对齐行为的安全框架，为区域化AI模型开发提供了新思路。

Comments Release page: https://medium.com/htx-ai/introducing-phoenix-vl-1-5-medium-multimodal-intelligence-uniquely-singaporean-ef8214c8cfa1

2605.10388 2026-05-12 cs.CV cs.RO 版本更新

Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction

Yumao Liu, Tao Liu, Xiangyu Li, Jiaxiang Li, Ke Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结本文研究了端到端自动驾驶轨迹预测中时间采样频率对模型性能的影响，挑战了高频率采样必然提升性能的传统假设。通过构建不同频率的训练集，并在固定实验协议下训练和评估相同模型，分析了采样频率与预测性能之间的关系。研究发现，模型和数据集不同会导致频率响应差异，小型模型在中等或较低频率下往往表现最佳，而大模型如AutoVLA在最高频率下效果更优，表明时间采样频率应作为可调参数进行优化，而非固定使用最高频率。

2605.10374 2026-05-12 cs.CV 版本更新

Halo Separation-guided Underwater Multi-scale Image Restoration

Jiaxin Yang, Honglin Liu, Yongli Wang, Shuyi Cao, Chengcheng Jiang, Jiale Wang

发表机构 * College of Information Science and Technology（信息科学与技术学院）； Dalian Maritime University（大连海事大学）； College of Marine Electrical Engineering（海洋电气工程学院）

AI总结本文针对水下自主水下机器人拍摄图像中因人工光源引起的光晕问题，提出了一种基于迭代结构的单光晕图像校正方法。该方法通过两个子网络分别实现光晕层分离和多尺度图像恢复，提升了水下图像的清晰度和质量。实验使用合成数据集和真实光晕图像进行训练与测试，并引入径向梯度约束以进一步优化光晕消除效果，为水下图像增强提供了更鲁棒的解决方案。

2605.10362 2026-05-12 cs.CV 版本更新

CellDX AI Autopilot: Agent-Guided Training and Deployment of Pathology Classifiers

Alexey Pchelnikov, Aleksei Pchelnikov

发表机构 * HistAI

AI总结 CellDX AI Autopilot 是一个通过人工智能代理实现病理图像分类模型训练与部署的平台，旨在降低计算病理学中对专业技能和计算资源的依赖。该平台提供结构化的代理技能，引导用户完成数据集构建、超参数优化、多策略模型比较及带人工参与的部署流程，并基于包含32,000多例病例和66,000张H&E染色全切片图像的预构建数据集进行训练。其核心贡献在于引入了专为病理任务设计的代理技能架构和多实例学习框架，显著提升了模型训练效率与易用性。

2605.10349 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Portable Active Learning for Object Detection

Rashi Sharma, Justin Timothy C. Bersamin, Karthikk Subramanian

发表机构 * Panasonic R&D Center Singapore（松下研发中心新加坡）； Nanyang Technological University（南洋理工大学）

AI总结本文提出了一种名为PAL的便携式主动学习框架，用于提升目标检测任务的标注效率。该方法无需修改检测模型内部结构或训练流程，仅基于模型的推理输出进行数据选择，结合类别级实例不确定性与图像级多样性，有效提升了所选样本的信息量与多样性。实验表明，PAL在多个数据集上均优于现有主动学习方法，显著提高了标签效率和检测精度，为实际应用中的高效目标检测部署提供了实用解决方案。

Comments CVPR 2026(highlight)

2605.10345 2026-05-12 cs.CV 版本更新

BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

Wei Wang, Dou Quan, Ning Huyan, Shuang Wang, Yi Li, Pei He, Licheng Jiao

发表机构 * Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University（中国教育部智能感知与图像理解重点实验室，西安电子科技大学）； School of Telecommunications, Xidian University（西安电子科技大学电信学院）； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结本文提出了一种基于视觉基础模型（VFM）的参数高效适配框架BGG，用于解决跨视角图像（如无人机与卫星图像）之间的几何差异问题，以提升跨视角地理定位（CVGL）的性能。BGG通过多粒度特征增强适配器（MFEA）和频率感知结构聚合（FASA）模块，有效提升了特征的尺度适应性和视角鲁棒性，并增强了局部结构特征，从而在低训练成本下实现了更精确的地理定位。实验表明，BGG在多个数据集上取得了优于现有方法的先进性能。

2605.10343 2026-05-12 cs.CV cs.AI 版本更新

EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant

Zichen Wen, Boxue Yang, Junlong Ke, Jiajie Huang, Chenfei Liao, Junxi Wang, Xuyang Liu, Linfeng Zhang

发表机构 * EPIC Lab, Shanghai Jiao Tong University（上海交通大学EPIC实验室）； Tsinghua University（清华大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Fudan University（复旦大学）

AI总结本文提出EvoStreaming，一种用于将离线视频语言模型（VideoLLM）适配为流式视频助理的自进化框架。研究发现，现有VideoLLM虽具备良好的视觉理解能力，但缺乏在流式场景下决定何时响应的交互策略。EvoStreaming通过模型自身生成数据、标注相关性并制定响应策略，无需外部监督即可合成流式交互轨迹，仅用极少样本便显著提升了模型在流式评估中的表现，同时基本保持其离线性能，为高效适配流式视频助理提供了新路径。

Comments 33 pages, 9 figures

2605.10334 2026-05-12 cs.CV 版本更新

The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection

Andrii Yermakov, Jan Cech, Mario Fritz, Jiri Matas

发表机构 * Czech Technical University in Prague（捷克技术大学）； CISPA Helmholtz Center for Information Security（CISPA海德堡中心）

AI总结近年来，深度伪造检测方法在跨数据集泛化能力上有所提升，但其背后的机制仍不明确。本文提出“Alpha混合假说”，认为当前先进的基于帧的检测器实际上是在搜索Alpha混合痕迹，而非学习语义异常或生成模型的指纹。研究通过实验验证了该假说，并提出了一种基于真实人脸图像和自混合图像增强数据集的检测方法BlenD，在多个合成伪造数据集上取得了最佳的跨数据集泛化性能，且无需在训练中使用明确生成的深度伪造样本。

2605.10319 2026-05-12 cs.CV 版本更新

LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency

Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Ko Watanabe, Riku Takahashi, Issey Sukeda, Andreas Dengel

发表机构 * RPTU Kaiserslautern-Landau \& DFKI GmbH, Kaiserslautern, Germany Faculty of Science ； Engineering, Hosei University, Tokyo, Japan EQUES, Tokyo, Japan

AI总结本文提出了一种名为 LimeCross 的训练-free 上下文条件化分层图像编辑框架，能够在保持未选层不变的前提下，根据文本指令对用户选定的 RGBA 分层进行编辑。该方法通过双流注意力机制利用其他层的上下文信息，保持跨层一致性，并有效防止编辑层污染。研究还引入了 LayerEditBench 数据集与评估协议，实验表明 LimeCross 在分层纯净度和合成真实感方面优于现有方法，为可控生成创作提供了新的分层编辑范式。

2605.10307 2026-05-12 cs.CV cs.GR cs.RO 版本更新

PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction

Yinan Deng, Jianyu Dou, Jiahui Wang, Jingyu Zhao, Yi Yang, Yufeng Yue

AI总结动态场景重建是计算机视觉与机器人领域中的一个基础而具有挑战性的问题。为了解决复杂运动场景下高保真渲染与精确跟踪的难题，本文提出了一种新的动态高斯泼溅框架 PaMoSplat，该方法结合了部件感知与运动先验。通过多视角分割掩码的三维重建与光流引导的部件运动估计，PaMoSplat 能够实现更高质量的渲染与更精确的跟踪，并在多个实际场景中表现出优于现有方法的性能与收敛速度。

Comments Accepted by TCSVT. Project Url: https://pamosplat.github.io

2605.10275 2026-05-12 cs.CV 版本更新

PolarVSR: A Unified Framework and Benchmark for Continuous Space-Time Polarization Video Reconstruction

Chenggong Li, Yidong Luo, Junchao Zhang, Boxin Shi, Degui Yang

发表机构 * School of Automation, Central South University（中南大学自动化学院）； Hunan Provincial Key Laboratory of Optic-Electronic Intelligent Measurement and Control（湖南省光学电子智能测量控制重点实验室）； Zhejiang University（浙江大学）； School of Engineering, Westlake University（西湖大学工程学院）； State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（北京大学计算机学院多媒体信息处理国家重点实验室）； National Engineering Research Center of Visual Technology, School of Computer Science, Peking University（视觉技术国家工程研究中心，北京大学计算机学院）

AI总结本文提出了一种统一的时空极化视频重建框架PolarVSR，旨在解决主流分焦平面极化成像中从混色阵列中恢复极化参数这一具有挑战性的逆问题。该方法通过联合建模空间与时间上的极化方向，并结合极化感知的隐式神经表示，实现了连续且高保真的超分辨率重建。同时，引入了基于光流引导的极化变化损失以优化极化动态，还建立了首个大规模彩色DoFP极化视频基准数据集，实验结果验证了方法的有效性。

2605.10269 2026-05-12 cs.CV cs.RO 版本更新

Increasing the Efficiency of DETR for Maritime High-Resolution Images

Tinsae Yehuala, Hao Cheng, Ville Lehtola

发表机构 * Dept. of Earth Observation Science, ITC Faculty, University of Twente（地球观测科学系，ITC学院，特文特大学）

AI总结本文针对海上无人水面船舶（USV）安全导航中高分辨率图像的目标检测需求，研究如何提升DETR模型的检测效率。作者采用基于状态空间模型（SSM）的Vision Mamba（ViM）作为主干网络，结合序列化图像分块处理与特征金字塔网络设计，有效提升了对远距离、小目标及大尺度变化的检测能力。通过引入令牌剪枝等优化策略，该方法在保持检测精度的同时显著降低了计算和内存开销，为海上实时目标检测提供了更高效可靠的解决方案。

2605.10251 2026-05-12 cs.CV 版本更新

Efficient Hybrid CNN-GNN Architecture for Monocular Depth Estimation

Ishan Narayan

发表机构 * IMCS Lab, CSIR-CSIO（IMCS实验室，CSIR-CSIO）

AI总结本文提出了一种名为GraphDepth的单目深度估计架构，通过在卷积编码器-解码器框架中引入图神经网络（GNN），有效建模了局部卷积难以捕捉的长距离空间关系。该方法在ResNet-101 U-Net主干网络的多尺度位置嵌入高效的GraphSAGE层，并结合通道注意力门控跳跃连接和异方差不确定性估计模块，提升了深度估计的精度与鲁棒性。实验表明，与基于Transformer的混合模型相比，GraphDepth在保持相近全局感受野的同时，计算效率更高，且在多个基准数据集上取得了优异的性能表现。

2605.10229 2026-05-12 cs.CV cs.CY 版本更新

VPD-100K: Towards Generalizable and Fine-grained Visual Privacy Protection

Xiaobin Hu, Enpu Zuo, Lanping Hu, Kaiwen Yang, Dianshu Liao, Tianyi Zhang, Bo Yin, Yinsi Zhou, Shidong Pan, Xiaoyu Sun

发表机构 * National University of Singapore（新加坡国立大学）； Australian National University（澳大利亚国立大学）； New York University（纽约大学）； The University of New South Wales（新南威尔士大学）

AI总结随着视觉数据共享的普及，隐私保护成为一项重要需求，但现有隐私检测算法因缺乏全面数据集而面临挑战。为此，本文提出一个大规模、细粒度的视觉隐私数据集 VPD-100K，涵盖人类存在、屏幕上的个人身份信息、物理标识符和位置指示等四个领域，包含10万张图像和19万标注对象实例，具有长尾分布、小目标和高视觉复杂度等特点。同时，研究设计了一种基于频率增强的轻量模块，有效提升了对敏感信息细微特征的捕捉能力，实验表明该数据集和方法在多种基准测试中均表现出色。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

2605.10210 2026-05-12 cs.RO cs.CV 版本更新

Nano-U: Efficient Terrain Segmentation for Tiny Robot Navigation

Federico Pizzolato, Francesco Pasti, Nicola Bellotto

发表机构 * Dept of Information Engineering, University of Padua（信息工程系，帕多瓦大学）

AI总结本文研究了如何在微型机器人上实现高效的地形分割，以支持其在户外非结构化环境中的自主导航。为了解决现有模型在资源受限的微控制器上部署困难的问题，作者提出了一种名为 Nano-U 的轻量二值分割网络，并结合量化感知蒸馏方法进行训练，显著提升了模型性能。该模型在多个数据集上表现优异，并通过改进的编译器工具链成功部署在低成本微控制器上，实现了低功耗、低延迟的实时地形感知。

Comments Code repository: https://github.com/federico-pizz/Nano-U

2605.10204 2026-05-12 cs.CV 版本更新

3DReflecNet: A Large-Scale Dataset for 3D Reconstruction of Reflective, Transparent, and Low-Texture Objects

Zhicheng Liang, Haoyi Yu, Boyan Li, Dayou Zhang, Zijian Cao, Tianyi Gong, Junhua Liu, Shuguang Cui, Fangxin Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Capital Normal University（首都师范大学）； University of Southern California（南加州大学）

AI总结本文介绍了3DReflecNet，一个专为重建具有反射、透明和低纹理表面物体的3D视觉方法而设计的大规模数据集。该数据集包含超过12万个基于物理渲染的合成样本和1000多个使用消费级设备采集的真实物体，总数据量超过22TB，涵盖了多种材质、复杂光照条件和几何形态。研究还设计了五个核心任务的基准测试，揭示了现有方法在处理这类复杂材料时的性能局限，推动了更鲁棒的3D视觉模型的发展。

Comments This paper has been accepted by CVPR 2026 Oral

2605.10190 2026-05-12 cs.CV 版本更新

DetRefiner: Model-Agnostic Detection Refinement with Feature Fusion Transformer

Soichiro Okazaki, Tatsuya Sasaki, Hiroki Ohashi

发表机构 * Hitachi, Ltd. Research and Development Group（日立株式会社研究开发集团）

AI总结 DetRefiner 是一种用于开放词汇目标检测的模型无关检测优化框架，旨在提升对已见和未见类别的检测性能。该方法通过轻量级的 Transformer 编码器融合全局图像特征和局部图像块特征，生成属性可靠性信息以校准基础检测模型的置信度。DetRefiner 不依赖于基础模型的内部特征或重新训练，仅在推理阶段对检测结果进行辅助校准，显著提升了多个开放词汇检测模型在多个数据集上的性能，尤其在未见类别上取得了最高达 +10.1 AP 的提升。

Comments CVPR 2026 Findings

2605.10184 2026-05-12 cs.CV cs.AI 版本更新

Developing a foundation model for high-resolution remote sensing data of the Netherlands

Paul Vermeeren, Heysem Kaya

发表机构 * Utrecht University, Department of Information and Computing Sciences（乌得勒支大学信息与计算科学系）

AI总结本文提出了一种基于荷兰高分辨率（1.2米）卫星影像的基座模型，结合卷积神经网络与视觉Transformer，以同时捕捉景观的细纹理、边缘、小物体以及大范围地形结构、高程模式和土地覆盖分布等特征。通过引入时间序列数据，模型能够学习跨时间的上下文信息，提升对地形特征、土地覆盖变化和季节动态等时序依赖关系的建模能力，从而减少特征歧义、增强表征学习并提高小样本下的泛化性能。实验表明，该模型在荷兰植被监测等任务中表现优异，并在多个全球基准数据集上取得了与先进模型相当的性能，展现了在有限数据和参数规模下学习通用表征的能力。

Comments 9 pages, 4 figures, under review in a journal

2605.10177 2026-05-12 cs.CV cs.AI cs.RO 版本更新

MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning

Guangli Chen, Dianzhao Li, Wenjian Zhong, Bangquan Xie, Ostap Okhrin

发表机构 * Dongguan Key Laboratory of Intelligent Equipment and Smart Industry, School of Advanced Engineering, Great Bay University（东莞智能装备与智能制造重点实验室，先进工程学院，大湾大学）； Chair of Applied Statistics, Technische Universität Dresden（应用统计学教授职位，德累斯顿技术大学）； Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)（可扩展数据解析与人工智能中心（ScaDS.AI））； College of Automation, Guangdong University of Technology（自动化学院，广东技术大学）

AI总结本文提出了一种名为MTA-RL的框架，通过基于多模态Transformer的3D可操作性表示和强化学习，提升城市自动驾驶的鲁棒性。该方法将RGB图像和LiDAR点云融合，生成结构化的几何感知可操作性表示，作为强化学习策略的输入，从而提高决策效率和稳定性。实验表明，MTA-RL在不同密度的交通场景中均优于现有方法，并在未见过的城市环境中表现出优异的零样本泛化能力。

2605.10174 2026-05-12 cs.CV 版本更新

BathyFacto: Refraction-Aware Two-Media Neural Radiance Fields for Bathymetry

Markus Brezovsky, Anatol Günthner, Frederik Schulte, Lukas Winiwarter, Boris Jutzi, Gottfried Mandlburger

发表机构 * Department of Geodesy and Geoinformation, TU Wien（维也纳技术大学测绘与地理信息系）； Institute of Photogrammetry and Remote Sensing (IPF), Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院测绘与遥感研究所）； Unit of Geometry and Surveying, University of Innsbruck（因斯布鲁克大学几何与测绘单位）

AI总结 BathyFacto 是一种针对水下测绘的折射感知双介质神经辐射场方法，旨在解决传统光束法重建在水下场景中因光折射导致的深度偏差问题。该方法通过引入介质条件颜色头和基于哈希网格的密度场，结合斯涅尔定律模拟光线在空气-水界面的折射路径，从而实现更精确的水下点云重建。实验表明，BathyFacto 在模拟场景中显著提升了重建精度和完整性，优于传统方法和未考虑折射的神经辐射场基线。

Comments 16 pages, 8 figures, 3 tables. Submitted to ISPRS Open Journal of Photogrammetry and Remote Sensing, Special Issue "3D Underwater Mapping from Above and Below"

2605.10172 2026-05-12 cs.CV cs.CL 版本更新

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

Zhiwei Ning, Xuanang Gao, Jiaxi Cao, Gengming Zhang, Shengnan Ma, Wenwen Tong, Hanming Deng, Jie Yang, Wei Liu

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University（上海交通大学自动化与智能感知学院）； Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University（上海交通大学图像处理与模式识别研究所）； SenseTime Research（商汤研究院）； Institute of Medical Robotics, Shanghai Jiao Tong University（上海交通大学医学机器人研究所）

AI总结本文提出了一种名为V-ABS的行动观察者驱动的束搜索框架，用于解决动态视觉推理中的多步骤复杂任务。该方法通过引入思考者-行动者-观察者迭代机制，结合基于熵的自适应加权算法，有效缓解了想象-行动-观察者偏差（IAO偏差），提升了推理的稳定性和最优性。实验表明，V-ABS在多个基准测试中均取得领先性能，显著优于现有模型。

2605.10162 2026-05-12 cs.CV 版本更新

Active-SAOOD: Active Sparsely Annotated Oriented Object Detection in Remote Sensing Images

Yu Lin, Jianghang Lin, Kai Ye, Shengchuan Zhang, Liujuan Cao

发表机构 * Key Laboratory of Multimedia Trusted Perception（多媒体可信感知关键实验室）； Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China（高效计算，中国教育部，厦门大学，361005，中华人民共和国）

AI总结本文提出了一种基于主动学习的稀疏标注遥感图像定向目标检测方法Active-SAOOD，旨在降低遥感图像中定向目标检测的标注成本。该方法通过模型状态观测模块，在实例层面综合考虑方向、分类与定位的不确定性以及类间和类内多样性，主动选择对当前模型最有价值的稀疏样本，从而在完全随机初始化的稀疏标注下实现稳定检测。实验表明，Active-SAOOD在多种数据集上显著提升了现有稀疏标注方法的性能与稳定性，尤其在仅1%标注比例下性能提升达9%，进一步增强了其在遥感领域的实用价值。

2605.10149 2026-05-12 cs.CV 版本更新

Improving Temporal Action Segmentation via Constraint-Aware Decoding

Yeo Keat Ee, Debaditya Roy, Chen Li, Hao Zhang, Basura Fernando

发表机构 * Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore（高性能计算研究所，科学、技术与研究局，新加坡）； Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore（前沿人工智能研究中心，科学、技术与研究局，新加坡）； Indian Institute of Technology Kharagpur, India（印度克哈格浦理工学院）； College of Computing and Data Science, Nanyang Technological University, Singapore（计算与数据科学学院，南洋理工大学，新加坡）

AI总结本文研究如何通过引入结构先验约束来提升时序动作分割的性能。作者提出了一种轻量级的约束感知解码框架，通过整合动作转移置信度、动作边界集和类别持续时间等统计结构先验，在不增加模型复杂度的情况下实现推理阶段的预测优化。该方法有效提升了全监督和半监督动作分割模型的性能，尤其在标注数据有限或新领域场景中表现突出。

Comments accepted to ICPR 2026

2605.10148 2026-05-12 cs.CV 版本更新

MicroViTv2: Beyond the FLOPS for Edge Energy-Friendly Vision Transformers

Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh

发表机构 * Department of Electro-Optics, National Formosa University（国立.formosa大学电光学系）； Department of Electrical Engineering, National Taipei University（台北国立大学电气工程系）； College of Artificial Intelligence and Green Energy, National Yang Ming Chiao Tung University（阳明交通大学人工智能与再生能源学院）

AI总结本文提出了一种轻量级的视觉Transformer模型MicroViTv2，旨在提升边缘设备上的能效表现。通过引入重参数化设计，包括重参数化块嵌入（RepEmbed）和重参数化深度可分离卷积混合器（RepDW），并结合单深度可分离转置注意力（SDTA）模块，模型在保持快速推理速度的同时，实现了更高的准确率。实验表明，MicroViTv2在Jetson AGX Orin等硬件平台上展现出优越的能效比，验证了超越FLOPs指标进行效率评估的重要性。

2605.10142 2026-05-12 cs.CV cs.AI 版本更新

Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

Mateusz Cedro, Marcin Chlebus

发表机构 * University of Warsaw（华沙大学）

AI总结本文研究了视觉模型的规模扩大是否能提升基于定位的解释质量。通过在多个图像数据集上评估不同深度和复杂度的ResNet、DenseNet和Vision Transformer模型，结合五种事后解释方法，发现模型规模的增加并未在大多数情况下提升解释质量，较小的模型往往表现相当甚至更优。研究还指出，预训练虽能提升预测性能，但对定位精度的提升并不一致，表明在模型选择中应明确评估解释性以确保安全应用。

Comments 28 pages, 8 figures, 8 tables

2605.10130 2026-05-12 cs.CV 版本更新

Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection

Yasiru Ranasinghe, Elim Schenck, Florence Yellin, Shuowen Hu, Christopher Funk, Vishal M. Patel

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Kitware ； DEVCOM Army Research Laboratory（国防部陆军研究实验室）

AI总结现有开放词汇检测方法主要针对RGB图像，难以推广到热成像领域，因热图像纹理低、发射率变化大，给基于RGB的语义理解带来挑战。本文提出Thermal-Det，首个由大语言模型（LLM）监督的开放词汇热成像目标检测方法，通过构建包含百万级热成像对齐样本的合成数据集，并结合跨模态蒸馏与文本校准模块，实现了无需人工标注的热成像检测知识迁移。实验表明，该方法在公开数据集上相比现有开放词汇检测器平均精度提升2-4%，为语言驱动的热感知系统奠定了基础。

Comments Accepted at CVPR 26

2605.10120 2026-05-12 cs.CV cs.AI 版本更新

MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan

发表机构 * Shanghai Key Laboratory of Intelligent Information Processing（上海智能信息处理关键实验室）； School of Computer Science, Fudan University（复旦大学计算机科学学院）

AI总结本文提出了一种名为MicroWorld的框架，旨在解决多模态大语言模型在显微镜等专业微观领域表现不足的问题。该方法通过构建多模态属性图（MAPG）来增强模型的推理能力，无需特定领域的微调即可在推理阶段提升模型表现。实验表明，MicroWorld显著提升了Qwen3-VL-8B-Instruct在MicroVQA等基准上的性能，取得了当前最优结果，并展示了其在跨领域泛化能力上的优势。

Comments 29 pages, 14 figures

2605.10117 2026-05-12 cs.CV cs.AI 版本更新

Think as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving

Donghyun Kim, Jaehyoung Park

发表机构 * Stony Brook University（史蒂文尼森布鲁克大学）

AI总结本文研究了自动驾驶场景中如何根据环境复杂度动态调整感知计算资源的问题。提出了一种名为Enhanced HOPE的自适应感知架构，通过无监督方法估计LiDAR帧的几何复杂度，并据此选择浅层或深层处理路径，从而在保证精度的同时提升计算效率。该方法还引入了线性时间的子空间注意力网络和持续的时序记忆模块，有效提升了对遮挡目标的跟踪能力，并在多个基准测试中表现出优越的性能。

2605.07846 2026-05-12 cs.CV 版本更新

BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

Peilin Xiong, Honghui Yuan, Junwen Chen, Keiji Yanai

发表机构 * Department of Informatics, The University of Electro-Communications（信息学系，电通大学）

AI总结本文研究了粗粒度掩码局部图像编辑中因掩码形状偏差导致的编辑区域边界失真问题，提出了一种名为BRIDGE的方法。该方法通过将掩码分离于DiT主干网络之外，并引入可学习的离散几何门控机制，实现背景稳定与编辑区域灵活生成的双重约束。实验表明，BRIDGE在多个基准测试中显著提升了编辑质量，同时保持了模型的轻量化特性。

Comments 11 pages, 6 figures

2605.07786 2026-05-12 cs.CV cs.AI 版本更新

APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

Caterina Gallegati, Monica Bianchini, Franco Scarselli, Vittorio Murino, Barbara Toniella Corradini

发表机构 * University of Siena（锡耶纳大学）； AI for Good (AIGO), Istituto Italiano di Tecnologia（AI for Good（AIGO），意大利理工学院）； University of Verona（威尼斯大学）

AI总结随着生成模型在视觉质量上取得突破，传统的基于特征分布的图像评估指标（如FID）仍被视为黄金标准，但其受到过时特征和参数化假设的限制。为解决这些问题，本文提出APEX，一种基于切片沃谢尔距离的无假设嵌入评估框架，无需依赖特定参数形式，且能兼容多种嵌入模型，如CLIP和DINOv2。实验表明，APEX在高维空间中具有良好可扩展性，对视觉退化具有更强鲁棒性，并在跨数据集评估中表现出高度稳定性。

2605.07575 2026-05-12 cs.CV cs.AI 版本更新

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

Ke Ma, Jiaqi Tang, Bin Guo, Xueting Han, Ruonan Xu, Qingfeng He, Ziheng Wang, Xu Wang, Qifeng Chen, Zhiwen Yu, Yunhao Liu

发表机构 * Northwestern Polytechnical University（北华大学）； Tsinghua University（清华大学）； The Hong Kong University of Science and Technology（香港科技大学）； Harbin Engineering University（哈尔滨工程大学）

AI总结本文提出了一种名为Response-G1的新型框架，旨在解决流媒体视频理解中主动响应时机判断的问题。该方法通过显式的场景图建模，将视频内容与查询响应条件进行结构化对齐，从而提升响应决策的准确性和可解释性。框架包含三个无需微调的阶段，包括在线生成场景图、基于记忆的语义检索以及增强触发提示，实验表明其在主动和被动任务中均优于现有方法。

Comments Accepted to ACL 2026

2605.07574 2026-05-12 cs.CV 版本更新

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

Yuliang Li, Chu Zhou, Heng Guo, Boxin Shi, Imari Sato, Zhanyu Ma

发表机构 * Beijing University of Posts and Telecommunications, China（北京邮电大学）； National Institute of Informatics, Japan（日本国立信息机构）； Peking University, China（北京大学）； The University of Tokyo, Japan（东京大学）

AI总结主流的视觉-语言模型（VLMs）由于依赖标准RGB输入，在处理反射、透明物体等光学模糊场景时存在显著困难。为解决这一问题，本文提出PolarVLM，首个将偏振物理参数融入VLM的多模态框架，通过双流架构和渐进式训练策略，有效避免物理误判并保持通用视觉能力。同时，研究构建了首个面向偏振感知的视觉问答基准PolarVQA，实验表明PolarVLM在多个任务上显著优于RGB基线，尤其在反射识别和玻璃计数任务中提升明显。

Comments 23 pages, 12 figures, including appendices

2605.07429 2026-05-12 cs.CV 版本更新

Towards Photorealistic and Efficient Bokeh Rendering via Diffusion Framework

Linxiao Shi, Siming Zheng, Zerong Wang, Hao Zhang, Jinwei Chen, Bo Li, Shifeng Chen, Peng-Tao Jiang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（深圳先进技术研究院，中国科学院）； vivo BlueImage Lab, vivo Mobile Communication Co., Ltd.（vivo BlueImage实验室，vivo移动通信有限公司）； Shenzhen University of Advanced Technology（深圳大学）

AI总结现有移动设备由于光学设计限制，难以生成自然的光学景深效果。为解决这一问题，本文提出 MagicBokeh，一种基于扩散框架的统一方法，能够高效生成高质量的逼真景深效果。该方法通过替代训练策略和聚焦感知的掩码注意力机制，联合优化景深渲染与超分辨率，显著提升了控制精度和视觉真实感，并引入退化感知深度模块以提升低质量输入的深度估计准确性。实验表明，MagicBokeh 能在真实低分辨率图像上高效生成高度逼真的景深效果，为未来景深渲染研究提供了新方向。

Comments Accepted by CVPR 2026

2605.05775 2026-05-12 cs.CV cs.AI 版本更新

The autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT $\unicode{x2013}$ Multitracer Multicenter Generalization

Jakob Dexl, Katharina Jeblick, Andreas Mittermeier, Balthasar Schachtner, Anna Theresa Stüber, Johanna Topalis, Maximilian Rokuss, Fabian Isensee, Klaus H. Maier-Hein, Hamza Kalisch, Jens Kleesiek, Constantin M. Seibold, Hussain Alasmawi, Lap Yan Lennon Chan, Yixuan Yuan, Alexander Jaus, Rainer Stiefelhagen, Pauline Ornela Megne Choudja, Konstantin Nikolaou, Christian La Fougère, Sergios Gatidis, Matthias P. Fabritius, Maurice Heimer, Gizem Abaci, Lalith Kumar Shiyam Sundar, Rudolf A. Werner, Jens Ricke, Clemens C. Cyran, Thomas Küstner, Michael Ingrisch

发表机构 * Department of Radiology, LMU University Hospital, LMU Munich（莱比锡大学医院放射科，莱比锡大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； University Hospital Tübingen, Department of Radiology（图宾根大学医院放射科）； Department of Radiology, Stanford University（斯坦福大学放射科）； German Cancer Research Center (DKFZ)（德国癌症研究中心（DKFZ））； Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital（海德堡大学医院放射肿瘤学部模式分析与学习组）； Faculty of Mathematics and Computer Science, Heidelberg University（海德堡大学数学与计算机科学学院）； Institute for AI in Medicine (IKIM), University Hospital Essen (AöR)（医学人工智能研究所（IKIM），埃森大学医院（AöR））； Department of Nuclear Medicine, University Hospital Essen (AöR)（核医学部，埃森大学医院（AöR））； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； Department of Computer Science and Engineering, The Chinese University of Hong Kong（香港中文大学计算机科学与工程系）； Department of Electronic Engineering, The Chinese University of Hong Kong（香港中文大学电子工程系）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； HIDSS4Health - Helmholtz Information and Data Science School for Health（HIDSS4Health - 海德堡信息与数据科学健康学校）； Department of Nuclear Medicine, LMU University Hospital, LMU Munich（莱比锡大学医院核医学部，莱比锡大学）； Comprehensive Pneumology Center (CPC-M), Member of the German Center for Lung Research (DZL)（综合肺科中心（CPC-M），德国肺癌研究中心（DZL）成员）； relAI – Konrad Zuse School of Excellence in Reliable AI（relAI - 卡诺德·祖斯可靠性人工智能卓越学校）； Cluster of Excellence iFIT (EXC 2180) "Image Guided and Functionally Instructed Tumor Therapies", University of Tübingen（卓越中心iFIT（EXC 2180）"图像引导和功能指导肿瘤治疗"，图宾根大学）

AI总结本文介绍了第三届 autoPET 挑战赛（MICCAI 2024）的设计与结果，旨在评估在全身 PET/CT 图像中自动分割病灶的算法在多示踪剂、多中心场景下的泛化能力。研究使用了来自两个医院的大量标注数据，并在包含未见示踪剂-中心组合的测试集上评估算法性能，结果显示最佳算法在多个指标上优于基线模型。研究还指出，当前算法在域内多示踪剂分割任务上表现良好，但在跨中心、跨示踪剂的泛化任务中仍面临挑战，性能差异主要受数据异质性和病例难度影响。

Comments Preprint submitted to Medical Image Analysis

2605.04617 2026-05-12 cs.CV cs.HC cs.LG 版本更新

Temporal Structure Matters for Efficient Test-Time Adaptation in Wearable Human Activity Recognition

Zishu Zhou, Zaipeng Xie, Xuanyao Jie

发表机构 * College of Computer Science and Software Engineering, Hohai University（河海大学计算机科学与软件工程学院）

AI总结可穿戴人体活动识别模型在面对真实世界中用户分布变化时往往性能下降，现有测试时自适应方法多沿用视觉任务的假设，未能充分利用活动识别流中的时间结构特性。本文重新审视时间结构作为条件推理信号的作用，提出了一种基于时间连续性和特征偏差的自适应机制，用于指导何时保持或释放时间惯性以及预测优化的路由位置。基于此，作者设计了SIGHT框架，无需反向传播即可实现轻量高效的实时自适应，实验表明其在实际数据集上优于现有方法，同时降低了计算和内存开销。

2605.04541 2026-05-12 cs.CV 版本更新

Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection

Muyao Peng, Shun Zou, Pei An, You Yang, Qiong Liu

发表机构 * School of Electronic Information and Communications, Huazhong University of Science and Technology（华中科技大学电子信息与通信学院）

AI总结本文提出了一种名为Angle-I2P的图像到点云配准方法，旨在解决低内点比情况下传统PnP方法难以准确配准的问题。该方法通过引入角度一致性约束和层次注意力机制，有效提升配准的鲁棒性与精度。实验表明，Angle-I2P在多个公开数据集上取得了当前最优的配准效果。

Comments Accepted by ICRA 2026

2605.03650 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

Zhiyuan Li, Rongzhen Zhao, Wenyan Yang, Wenshuai Zhao, Pekka Marttinen, Joni Pajarinen

发表机构 * Department of Electrical Engineering and Automation（电气工程与自动化系）； Aalto University（阿尔托大学）； Department of Computer Science（计算机科学系）

AI总结本文重新思考了视频对象中心学习中的时间一致性问题，指出当前依赖动态模块预测未来对象表示的方法实际上是复杂的离散对应问题的近似。作者提出了一种新的框架“Grounded Correspondence”，通过冻结的骨干网络提取显著区域初始化对象槽，并利用匈牙利匹配实现帧间身份对应，无需可学习的时间建模参数，即可在多个数据集上取得具有竞争力的性能。

2605.03639 2026-05-12 cs.CV 版本更新

Diffusion Masked Pretraining for Dynamic Point Cloud

Zhuoyue Zhang, Jihua Zhu, Chaowei Fang, Jian Liu, Ajmal Saeed Mian

发表机构 * Xi’an Jiaotong University（西安交通大学）； School of Artificial Intelligence and Robotics, Hunan University（湖南大学人工智能与机器人学院）； University of Western Australia（西澳大学）

AI总结本文提出了一种名为DiMP的统一自监督预训练框架，用于动态点云处理。该方法通过引入扩散模型，解决了现有掩码重建目标中的时空位置泄露和运动不确定性丢失问题。DiMP在位置推理和运动学习中均采用扩散建模，通过预测可见时空上下文中的干净点云中心，提升了位置表示的准确性，并将帧间位移监督转化为条件扩散模型的噪声预测任务，从而更完整地建模运动的条件分布。实验表明，DiMP在多个下游任务中均显著提升了性能。

2604.17565 2026-05-12 cs.CV 版本更新

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

Hong Jiang, Wensong Song, Zongxing Yang, Ruijie Quan, Yi Yang

发表机构 * ReLER, CCAI, Zhejiang University（ReLER、CCAI、浙江大学）； DBMI, HMS, Harvard University（DBMI、HMS、哈佛大学）

AI总结 UniGeo 是一种新型的相机可控图像编辑框架，旨在在不同相机视角下生成几何一致的场景视图。该方法通过在表示层、架构层和损失函数层统一注入几何引导，解决了现有方法在连续相机运动下出现的几何漂移和结构退化问题。实验表明，UniGeo 在多个公开数据集上显著优于现有方法，具有更高的视觉质量和几何一致性。

2604.06720 2026-05-12 cs.CV 版本更新

Exploring 6D Object Pose Estimation with Deformation

Zhiqiang Liu, Rui Song, Duanmu Chuangqi, Jiaojiao Li, David Ferstl, Yinlin Hu

发表机构 * State Key Laboratory of ISN, Xidian University（西安电子科技大学信息与通信系统国家重点实验室）； MagicLeap

AI总结本文提出DeSOPE，一个用于6自由度（6DoF）变形物体位姿估计的大规模数据集。传统6D位姿估计方法通常假设物体为刚性或可变形的关节结构，但在实际应用中，物体因磨损、碰撞或形变而偏离标准形状，导致方法失效。为此，DeSOPE包含26类常见物体在标准形态和三种变形状态下的高精度3D扫描数据，并配有133K帧的RGB-D图像和665K个位姿标注，为研究变形物体的位姿估计提供了重要资源。

Comments Accepted at CVPR 2026

2604.04306 2026-05-12 cs.CV cs.AI 版本更新

HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data

Stella Girtsou, Konstantinos Alexis, Giorgos Giannopoulos, Charalambos Kontoes

发表机构 * National Observatory of Athens（国家天文台）； National Technical University of Athens（雅典国家技术大学）； National and Kapodistrian University of Athens（雅典国家与卡波迪斯特里亚大学）； Athena Research Center（雅典研究所以及研究中心）

AI总结随着气候相关灾害频发，实时监测和预警需求日益迫切。本文提出 HighFM，一种面向高时间分辨率多光谱遥感数据的基座模型，通过利用超过 2TB 的 SEVIRI 卫星影像，改进了掩码自编码框架以学习稳健的时空表征，并在云检测和火灾识别任务中取得了优于传统方法和近期地理空间基座模型的性能，展示了地静止卫星数据在实时遥感应用中的巨大潜力。

2603.21901 2026-05-12 cs.CV 版本更新

CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； University of Chinese Academy of Sciences（中国科学院大学）； Technical University of Munich（慕尼黑技术大学）； Shanghai Jiao Tong University（上海交通大学）； National University of Singapore（新加坡国立大学）

AI总结 CLEAR 是一种无需掩码的端到端视频字幕去除框架，旨在在保持时间一致性的同时区分字幕与背景内容。该方法采用两阶段设计，第一阶段通过自监督正交约束学习解耦的字幕表示，第二阶段利用LoRA参数微调和生成反馈机制进行动态上下文调整，从而实现无需真实掩码的自适应推理。CLEAR 在参数效率和跨语言泛化能力方面表现优异，仅需基础扩散模型0.77%的参数即可在多个中文字幕数据集上超越依赖掩码的基线方法，并在六种语言中展现出强大的零样本泛化能力。

Comments Accepted by ICML 2026 (Spotlight)

2603.16964 2026-05-12 cs.CV cs.LG 版本更新

Behavior-Centric Extraction of Scenarios from Highway Traffic Data and their Domain-Knowledge-Guided Clustering using CVQ-VAE

Niklas Roßberg, Sinan Hasirlioglu, Mohamed Essayed Bouzouraa, Wolfgang Utschick, Michael Botsch

发表机构 * Technische Hochschule Ingolstadt（因斯布鲁克技术大学）； AUDI AG（奥迪公司）； Technische Universität München（慕尼黑技术大学）

AI总结该研究旨在从高速公路交通数据中标准化提取场景，并基于领域知识进行聚类，以支持自动驾驶系统的行为评估。研究提出了一种基于“场景即规范”概念的场景提取方法，并结合CVQ-VAE模型实现领域知识引导的聚类过程，提升了场景分类的可解释性和一致性。实验表明，该方法能够可靠地从真实数据中提取场景，并有效融合领域知识，为自动驾驶系统的验证提供了更高效和标准化的场景分类框架。

Comments Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States

2603.11969 2026-05-12 cs.CV 版本更新

AstroSplat: Physics-Based Gaussian Splatting for Rendering and Reconstruction of Small Celestial Bodies

Jennifer Nolan, Travis Driver, John Christian

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结本文提出了一种基于物理的高斯点绘（Gaussian Splatting）框架AstroSplat，用于小天体（如小行星）表面的渲染与重建。该方法引入行星反射模型，显式建模表面材质属性与光照交互，克服了传统基于球谐函数的外观参数化方法在物理特性表达上的不足。实验表明，AstroSplat在NASA“黎明”任务的真实图像上表现出更优的渲染效果和表面重建精度。

Comments 10 pages, 6 figures, conference

2603.11566 2026-05-12 cs.CV 版本更新

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

Zhongyu Xia, Yousen Tang, Yongtao Wang, Zhifeng Wang, Weijun Qin

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王轩计算机技术研究所）； EBTech Co. Ltd（EBTech公司）

AI总结本文提出了一种名为R4Det的4D雷达-相机融合方法，用于提升自动驾驶中的3D目标检测性能。针对现有方法在深度估计、时序融合和小目标检测方面的不足，R4Det引入全景深度融合模块增强深度估计精度，设计无需依赖车辆姿态的可变形门控时序融合模块，并构建实例引导的动态细化模块以提升小目标检测能力。实验表明，R4Det在TJ4DRadSet和VoD数据集上取得了最先进的3D检测效果。

Comments Accepted to CVPR 2026

2603.10165 2026-05-12 cs.CL cs.AI cs.CV cs.LG 版本更新

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang

AI总结 OpenClaw-RL 是一种创新的强化学习框架，通过利用用户反馈、工具输出和界面状态变化等“下一步状态”信号，实现对智能体的在线优化。该框架在基础设施上采用服务器-客户端架构，分离信号提取与策略优化过程，提升训练效率；在方法上提出混合强化学习目标，结合稀疏但精细的指令信号和广泛可用的评估信号，提升学习稳定性。研究展示了 OpenClaw-RL 在个性化代理和通用代理任务中的广泛应用，特别是在长期任务中表现出色。

Comments Code: https://github.com/Gen-Verse/OpenClaw-RL

2603.09465 2026-05-12 cs.CV cs.AI 版本更新

EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation

Jiajun Cao, Xiaoan Zhang, Xiaobao Wei, Liyuqiu Huang, Zijian Wang, Hanzhen Zhang, Zhengyu Jia, Wei Mao, Hao Wang, Xianming Liu, Shuchang Zhou, Yang Wang, Shanghang Zhang

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（多媒体信息处理国家重点实验室，计算机学院，北京大学）； XPeng Motors（小鹏汽车）

AI总结本文提出了一种名为EvoDriveVLA的协作感知-规划蒸馏框架，旨在解决视觉语言动作模型在自动驾驶中解冻视觉编码器后感知性能下降以及长期规划不稳定的问题。该方法结合了自锚定感知约束和未来感知轨迹优化，通过自锚定教师模型引导学生模型关注关键区域，并利用未来感知的引导教师进行轨迹优化与不确定性建模，从而提升模型的感知与规划能力。实验表明，EvoDriveVLA在nuScenes和NAVSIM数据集上均取得了优越的性能。

Comments 19 pages, 5 figures, 5 tables

2603.03239 2026-05-12 cs.CV 版本更新

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data

Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, Mikolaj Czerkawski

发表机构 * School of Engineering University of Edinburgh（工程学院爱丁堡大学）； European Space Agency (ESA)（欧洲航天局）； Asterisk Labs（Asterisk实验室）

AI总结该研究提出了一种名为COP-GEN的多模态潜扩散变换器，用于生成Copernicus地球观测数据，能够建模不同传感器（如光学、雷达、高程和土地覆盖）在原生空间分辨率下的联合分布。通过将跨模态映射参数化为条件分布，COP-GEN实现了灵活的任意到任意条件生成，包括无需任务特异性再训练的零样本模态转换。实验表明，该模型在保持高峰值保真度的同时，能够生成多样且物理一致的观测结果，并在构建的基准数据集上展现出显著优于现有方法的生成能力。

2602.05391 2026-05-12 cs.CV 版本更新

Efficient Dataset Distillation for Pre-Trained Self-Supervised Models via Statistical Flow Matching

Qianxin Xia, Jiawei Du, Xin Zhang, Yuhan Zhang, Jielei Wang, Guoming Lu

发表机构 * University of Electronic Science（电子科技大学）

AI总结该论文研究了如何高效地对预训练自监督模型进行数据集蒸馏，以生成一个体积小但性能接近原始数据集的合成数据集。为了解决传统方法在计算和内存上的高开销问题，作者提出了一种基于统计流匹配的新方法，通过对齐原始数据中目标类与非目标类中心的统计流来优化合成图像，大幅降低了计算资源需求。实验表明，该方法在保持甚至提升性能的同时，相比现有方法减少了10倍的GPU内存占用和4倍的运行时间，并提出了一种分类器继承策略以进一步提升效率和性能。

2602.04712 2026-05-12 cs.CV cs.AI eess.IV 版本更新

SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation

David F. Ramirez, Tim Overman, Kristen Jaskie, Joe Marvin, Andreas Spanias

发表机构 * SenSIP Center, School of ECEE, Arizona State University（SenSIP中心，电子与计算机工程学院，亚利桑那州立大学）； Prime Solutions Group Inc（Prime Solutions Group公司）

AI总结本文提出了一种用于合成孔径雷达（SAR）图像自动目标识别（ATR）的视觉上下文图像检索增强生成（ImageRAG）辅助AI方法，名为SAR-RAG。该方法结合多模态大语言模型（MLLM）与语义嵌入向量数据库，通过检索已知目标类型的图像示例，提升对SAR图像中军事车辆的识别准确率。实验表明，SAR-RAG在检索、分类和尺寸回归等指标上均优于传统MLLM方法，显著提升了ATR任务的性能。

Comments Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI

2601.22143 2026-05-12 cs.GR cs.CV 版本更新

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

Anthony Chen, Naomi Ken Korem, Gal Zeevi, Tavi Halperin, Matan Ben Yosef, Urska Jelercic, Ofir Bibi, Or Patashnik, Daniel Cohen-Or

发表机构 * Tel Aviv University（特拉维夫大学）

AI总结本文提出了一种基于音频-视觉扩散模型的视频配音方法JUST-DUB-IT，通过轻量级的LoRA适配器实现从输入视频生成对应语言的配音和同步面部动作。该方法利用生成模型自身生成多语言配对视频作为训练数据，通过在单个视频片段中切换语言并进行面部和音频修复，实现了高质量的配音效果，保持了说话人身份和唇形同步，同时在复杂运动和真实场景中表现出更强的鲁棒性。

Comments Project webpage available at https://justdubit.github.io

2601.08321 2026-05-12 cs.CV 版本更新

UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, Junshi Huang

发表机构 * Sun Yat-sen University（中山大学）

AI总结随着图像生成技术的快速发展，基于自然语言指令的视觉文本编辑任务日益受到关注。该任务的核心挑战在于如何准确理解指令和参考图像，并生成与图像风格一致的视觉文本。为此，本文提出 UM-Text，一个统一的多模态模型，通过引入视觉语言模型（VLM）和 UM-Encoder，实现了对文本内容与布局的精细设计，并通过区域一致性损失和三阶段训练策略提升了生成效果，同时贡献了一个大规模视觉文本图像数据集 UM-DATA-200K。

Comments Accepted by AAAI 2026

2512.15977 2026-05-12 cs.CV 版本更新

Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario, Mason J. Earles

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结该研究评估了多种开源和闭源的视觉-语言模型（VLMs）在农业图像分类任务中的表现，涉及27个数据集、162个类别和248,000张图像。结果表明，零样本VLMs在多数任务中显著落后于监督学习的基准模型YOLO11，且在开放性提示下性能更低，需借助语义判断等方法提升效果。尽管部分开源模型如Qwen-VL-72B表现接近闭源模型，但整体来看，当前VLMs尚未具备作为独立农业诊断系统的能力，更适合在受限接口和领域知识支持下作为辅助工具使用。

2512.06949 2026-05-12 cs.CV 版本更新

Can We Go Beyond Visual Features? Neural Tissue Relation Modeling for Relational Graph Analysis in Non-Melanoma Skin Histology

Shravan Venkatraman, Muthu Subash Kavitha, Joe Dhanith P R, V Manikandarajan, Jia Wu

发表机构 * Mohamed bin Zayed University of AI（Mohamed bin Zayed人工智能大学）； School of Information and Data Sciences（信息与数据科学学院）； Vellore Institute of Technology（维洛雷理工学院）； Loughborough University（洛桑大学）； MD Anderson Cancer Center, The University of Texas（MD安德森癌症中心，德克萨斯大学）

AI总结在皮肤癌诊断中，组织病理学图像分割对于识别组织结构至关重要，但建模空间上下文和组织间关系仍是一个挑战，尤其是在组织重叠或形态相似的区域。为此，本文提出了一种新的分割框架——神经组织关系建模（NTRM），通过在卷积神经网络中引入图神经网络，建模不同组织类型之间的空间和功能关系，从而提升分割的结构一致性。实验表明，NTRM在非黑色素瘤皮肤癌分割数据集上显著优于现有方法，Dice相似性系数提升了4.9%至31.25%，展示了关系建模在提升分割准确性和可解释性方面的潜力。

Comments CVPR 2026 Workshops

2511.23332 2026-05-12 cs.CV 版本更新

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang

发表机构 * Beijing Institute of Technology（北京理工大学）； Wuhan University（武汉大学）； Zhongguancun Academy（中关村学院）； Hong Kong Polytechnic University（香港理工大学）

AI总结本文提出 UniGeoSeg，一种面向遥感地景的统一开放世界分割框架，旨在解决现有方法在任务定义分散和指令数据有限方面的不足。研究构建了 GeoSeg-1M 数据集，包含大量图像-掩码-指令三元组，并设计了 GeoSeg-Bench 用于评估模型在复杂地景场景中的理解与推理能力。UniGeoSeg 通过任务感知的文本增强、潜在知识记忆和渐进式训练策略，实现了多任务学习，在多个基准测试中表现出色，具有强大的零样本泛化能力。

Comments Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg ; Accepted by CVPR 2026

2511.07756 2026-05-12 cs.CV 版本更新

Determinism of Randomness: Prompt-Residual Seed Shaping for Diffusion Generation

Song Yan, Wei Zhai, Chenfeng Wang, Xinliang Bi, Jian Yang, Yancheng Cai, Yusen Zhang, Yunwei Lan, Tao Zhang, GuanYe Xiong, Min Li, Zheng-Jun Zha

发表机构 * USTC（中国科学技术大学）； Li Auto Inc.（利亚自动化公司）； Xi’an High-tech Research Institute（西安高新技术研究院）； Wechat Vision（微信视觉）； Cambridge University（剑桥大学）； HUST（华中科技大学）

AI总结扩散模型从各向同性高斯潜在空间开始生成，但仅改变随机种子会导致生成结果在语义忠实度、构图和视觉质量上出现显著差异。本文通过分析从初始噪声到生成内容的语义映射，揭示了种子敏感性的几何原因：潜在空间中大多数方向对语义变化不敏感，而语义敏感的变化集中在较小的子空间内。基于这一发现，作者提出了一种无需训练的提示残差种子塑造方法，通过注入与语义变化相关的切向分量，将种子拉回到原始高斯分布的壳层，从而在保持先验兼容性的同时提升生成结果的对齐度和质量。

2510.25372 2026-05-12 cs.CV cs.LG 版本更新

Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers

M Yashwanth, Sharannya Ghosh, Aditay Tripathi, Anirban Chakraborty

发表机构 * Department of Computational and Data Sciences, Indian Institute of Science（计算与数据科学系，印度科学研究院）； Accenture, Japan（日本Accenture公司）； Google, India（印度Google公司）； Indian Institute of Science（印度科学研究院）

AI总结本文研究了如何在联邦学习环境下高效且通用地对视觉Transformer进行提示调优。为了解决全局提示调优泛化性差和个性化调优过拟合的问题，作者提出了PEP-FedPT框架，引入了一种基于类上下文混合提示（CCMP）的新方法，通过全局类原型和客户端类先验动态组合类特定提示，实现样本级提示个性化，而无需存储客户端参数。实验表明，该方法在多个数据集上优于现有方法，为联邦视觉Transformer调优提供了有效解决方案。

Comments Accepted to TMLR 2026

2510.10606 2026-05-12 cs.CV 版本更新

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Renmin University of China（中国人民大学）； The Hong Kong University of Science（香港科学大学）

AI总结 ViSurf 是一种统一的单阶段微调方法，旨在解决大型视觉-语言模型在知识注入与性能提升之间的矛盾。该方法结合了监督微调（SFT）和基于可验证奖励的强化学习（RLVR）的优势，通过将真实标签直接注入RLVR过程，实现外部监督与内部强化的同步优化。ViSurf 还引入了三种新的奖励控制策略以保障训练稳定性，实验表明其在多个基准测试中均优于单独使用SFT、RLVR或传统两阶段方法。

2510.04142 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments

Xiaoyu Yang, En Yu, Wei Duan, Jie Lu

发表机构 * Australian Artificial Intelligence Institute (AAII)（澳大利亚人工智能研究所）； Faulty of Engineering and Information Technology（工程与信息技术学院）； University of Technology Sydney（悉尼技术大学）； Australia（澳大利亚）

AI总结本文研究了在非平稳多流环境中，如何从多个多模态大语言模型中实现鲁棒的推理对齐问题。针对源模型推理分布随时间演变带来的系统性偏差，作者提出了一种新的约束满足框架——自主偏好优化（APO），将模型间差异视为动态负约束，并通过两阶段策略实现对齐：先通过监督引导使目标模型具备源模型的联合能力，再通过约束感知优化生成一致的共识流形。实验表明，该方法在胸部X光解读任务中表现出优越的鲁棒性，并发布了包含七个多模态大模型推理轨迹的CXR-MAX基准数据集。

Comments ICML 2026

2510.03895 2026-05-12 cs.RO cs.CV 版本更新

NoTVLA: Semantics-Preserving Robot Adaptation via Narrative Action Interfaces

Zheng Huang, Mingyu Liu, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Ye Lin, Xiaoman Li, Yiduo Jia, Hao Zhong, Hao Chen, Chunhua Shen

发表机构 * Zhejiang University（浙江大学）

AI总结该研究提出了一种名为NoTVLA的语义保持型机器人自适应框架，旨在解决视觉-语言-动作（VLA）模型在实际部署中面临的灾难性遗忘问题。其核心方法是通过关注稀疏轨迹而非密集动作序列，结合时间压缩和空间推理剪枝策略，优化轨迹规划并降低计算需求。NoTVLA在多任务评估中表现出优于现有模型的性能，同时显著减少计算资源消耗，并无需依赖腕部摄像头，实现了跨平台部署与零样本泛化能力。

2508.20325 2026-05-12 cs.CL cs.AI cs.CV 版本更新

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

Haibo Jin, Ruoxi Chen, Peiyan Zhang, Andy Zhou, Zelei Cheng, Haohan Wang

发表机构 * Hong Kong University of Science and Technology（香港理工大学）； Lapis Labs（Lapis实验室）； Capital One

AI总结随着大型语言模型（LLMs）在各领域应用日益广泛，其生成有害内容的潜在风险引发了社会和监管方面的关注。为验证LLMs是否符合政府发布的伦理指南，本文提出GUARD方法，通过自动生成违反指南的问题并结合“越狱”检测技术，评估模型对指南的遵循程度。该方法不仅能够识别直接违反指南的响应，还能发现可能绕过安全机制的潜在违规场景，并已在多个主流LLMs上进行了实证验证，展示了其在提升模型可靠性方面的有效性。

Comments 56 pages

2508.06248 2026-05-12 cs.CV 版本更新

Deepfake Detection that Generalizes Across Benchmarks

Andrii Yermakov, Jan Cech, Jiri Matas, Mario Fritz

发表机构 * Czech Technical University in Prague（捷克技术大学布拉格分校）； CISPA Helmholtz Center for Information Security（CISPA海德堡信息安全中心）

AI总结本文研究了如何使深度伪造检测方法在面对未知的伪造技术时仍具有良好的泛化能力。提出了一种名为GenD的方法，仅通过微调预训练视觉编码器中的层归一化参数（占总参数的0.03%），结合L2归一化和度量学习，实现了高效的泛化性能。实验表明，该方法在14个不同年份的基准数据集上取得了最先进的结果，证明了在保持模型简洁性的同时，也能实现强大的跨数据集检测能力。

2505.20381 2026-05-12 cs.CV 版本更新

ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking

Sijia Chen, Yanqiu Yu, En Yu, Wenbing Tao

发表机构 * National Key Laboratory of Science and Technology on Multi-spectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology（国家多谱信息处理科学与技术重点实验室，人工智能与自动化学院，华中科技大学）

AI总结 ReaMOT 是一个基于推理的多目标跟踪任务，旨在通过逻辑推理追踪由语言指令指定的目标，克服了现有方法对显式视觉-文本匹配的依赖。为此，研究者提出了 ReaMOT 挑战基准，包含大量语言指令和视频序列，并设计了 ReaTrack 框架，结合大语言模型与运动先验，实现了更鲁棒的跟踪性能。实验表明，ReaTrack 在高层次推理任务中表现出显著提升。

Comments Code: https://github.com/chen-si-jia/ReaMOT

2505.20001 2026-05-12 cs.CV 版本更新

NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification

Shihao Li, Huaibo Huang, Junxian Duan, Aihua Zheng, Jin Tang, Jixin Ma

发表机构 * State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology（光电信息采集与防护技术国家重点实验室）； Anhui Provincial Key Laboratory of Multimodal Cognitive Computation（安徽省多模态认知计算重点实验室）； School of Artificial Intelligence（人工智能学院）； Anhui University（安徽大学）； State Key Laboratory of Multimodal Artificial Intelligence Systems（多模态人工智能系统国家重点实验室）； New Laboratory of Pattern Recognition（模式识别新实验室）； CASIA（中国科学院自动化所）； University of Chinese Academy of Sciences（中国科学院大学）； School of Computing and Mathematical Sciences（计算与数学科学学院）； University of Greenwich（格林威治大学）

AI总结本文研究多模态物体重识别问题，旨在从异构模态中获取完整的身份特征。为解决现有方法依赖隐式特征融合、难以建模细粒度识别模式的问题，提出了一种基于文本调制的多粒度专家混合框架NEXT。该方法通过属性置信度生成高质量描述文本，并将识别任务分解为语义和结构两个分支，分别捕捉细粒度外观特征和粗粒度结构特征，最终通过多粒度特征聚合策略实现更准确的身份表示，实验表明该方法在多个数据集上显著优于现有先进方法。

2505.19519 2026-05-12 cs.CV 版本更新

Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift

Gihoon Kim, Hyungjin Park, Taesup Kim

发表机构 * Graduate School of Data Science（数据科学研究生院）； Seoul National University（首尔国立大学）

AI总结本文研究了如何在不导致分布偏移的前提下，对文本到图像的扩散模型进行个性化定制。作者指出，现有方法在个性化过程中容易过度拟合参考图像，忽视用户提示，其根本原因是未能同时保证图像真实性与文本对齐。为此，提出了一种基于李普希茨正则化的优化目标，约束模型参数更新，保持预训练模型输出分布的稳定性，从而在保留原始生成能力的同时实现对新概念的准确适配。实验表明，该方法在多个扩散模型架构上均表现出优越的定量和定性性能。

Comments Accepted at ICLR 2026

2504.02373 2026-05-12 eess.IV cs.CV 版本更新

HPGN: Hybrid Priors-Guided Network for Compressed Low-Light Image Enhancement

Hantang Li, Qiang Zhu, Xiandong Meng, Lei Xiong, Shuyuan Zhu, Xiaopeng Fan

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Pengcheng Laboratory（鹏城实验室）； University of Electronic Science and Technology of China（电子科技大学）

AI总结在实际应用中，低光照图像通常为了高效存储和传输而被压缩，但现有方法大多忽视了压缩伪影的去除或难以建立统一的增强框架。为此，本文提出了一种结合压缩先验和光照先验的混合引导网络（HPGN），通过引入JPEG质量因子和DCT量化矩阵指导模块设计，实现了对不同压缩质量低光照图像的联合增强。实验结果表明，该方法在提升图像质量方面具有显著优势。

Comments 5 pages, 3 figures

2411.18111 2026-05-12 cs.CV 版本更新

When Large Vision-Language Models Meet Person Re-Identification

Qizao Wang, Bin Li, Xiangyang Xue

发表机构 * School of Computer Science, Fudan University, Shanghai, China（复旦大学计算机科学学院，上海，中国）

AI总结本文研究了如何将大型视觉-语言模型（LVLMs）应用于行人重识别（ReID）任务。传统ReID依赖于提取区分性强的身份特征，而LVLMs则擅长跨模态理解和生成。为此，作者提出LVLM-ReID框架，通过指令引导LVLM生成包含行人关键外观语义的语义标记，并利用语义引导交互模块增强语义与视觉特征的关联，最终将强化后的语义标记作为行人身份表示。该方法无需额外图像-文本标注即可在多个基准上取得有竞争力的性能，展示了LVLM生成语义在提升ReID效果中的潜力。

Comments Accepted by ICASSP 2026

2411.04077 2026-05-12 cs.CV 版本更新

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

Nhi Pham, Michael Schott

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克信息研究所）； Saarland University（萨尔兰州大学）； Zuse School（祖斯学校）

AI总结本文提出了一种基于分层抽样评估的H-POPE基准，用于系统评估大视觉语言模型在物体存在性和属性层面的幻觉问题。该方法通过从粗到细的层次结构进行评估，揭示了模型在细粒度属性上更容易产生幻觉的现象。研究进一步探讨了模型在生成文本时是否依赖于视觉输入，为理解视觉语言模型的生成机制提供了新的视角。

Comments Poster at https://sites.google.com/berkeley.edu/bb-stat/home

2410.10247 2026-05-12 cs.CV cs.AI 版本更新

LPT: Less-overfitting Prompt Tuning for Vision-Language Model

Chenhao Ding, Xinyuan Gao, Songlin Dong, Jizhou Han, Qiang Wang, Zhengdong Zhou, Yuhang He, Yihong Gong

发表机构 * IEEE（国际电气电子工程师协会）

AI总结该研究针对视觉语言模型在迁移过程中易出现的过拟合问题，提出了一种名为LPT的轻量级提示调优框架。其核心方法包括利用CLIP过滤细粒度前景信息以引导基础视觉概念的提示生成，并引入特征级结构保持约束和输出级层次逻辑约束，以增强模型的泛化能力。实验表明，LPT在多个基准任务中显著提升了模型的泛化性能，有效缓解了过拟合问题。

2006.02666 2026-05-12 eess.IV cs.CV 版本更新

Deep Sequential Feature Learning in Clinical Image Classification of Infectious Keratitis

Yesheng Xu, Ming Kong, Wenjia Xie, Runping Duan, Zhengqing Fang, Yuxiao Lin, Qiang Zhu, Siliang Tang, Fei Wu, Yu-Feng Yao

发表机构 * Department of Ophthalmology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine（浙江大学医学院眼科学系，邵氏医院）； College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）

AI总结本文针对感染性角膜炎的临床图像分类问题，提出了一种基于序列级深度学习的模型，旨在准确区分感染性角膜病变的细微差异。该方法通过设计有效的机制保留临床图像的空间结构并提取关键特征，显著提升了分类性能。实验表明，该模型在120张测试图像上的诊断准确率达到80.00%，远超421位眼科医生49.27%的平均水平，展示了其在辅助诊断中的巨大潜力。

Comments Accepted by Engineering

2605.10111 2026-05-12 cs.LG cs.AI cs.CV 版本更新

CFSPMNet: Cross-subject Fourier-guided Spatial-Patch Mamba Network for EEG Motor Imagery Decoding in Stroke Patients

Xiangkai Wang, Yun Zhao, Dongyi He, Qingling Xia, Gen Li, Xinlai Xing, Yuchi Pan, Bin Jiang

发表机构 * School of Artificial Intelligence, Chongqing University of Technology（重庆理工大学人工智能学院）； Chongqing Key Laboratory of Embodied Intelligence Perception and Autonomous Learning for Humanoid Robots（重庆市人形机器人感知与自主学习重点实验室）； Key Laboratory of Advanced Equipment Intelligence of the Chongqing Education Commission（重庆市教育委员会先进设备智能重点实验室）； School of Smart Health, Chongqing Polytechnic University of Electronic Technology（重庆理工大学电子工程学院智能健康学院）； Department of Language Science and Technology, The Hong Kong Polytechnic University（香港理工大学语言科学与技术系）； School of Pharmacy and Bioengineering, Chongqing University of Technology（重庆理工大学药学院与生物工程学院）

AI总结该研究针对中风患者脑机接口（BCI）解码中的跨被试应用难题，提出了一种名为CFSPMNet的新型神经网络框架。该方法结合傅里叶域状态重组与共享-私有原型匹配机制，通过建模潜在的神经状态组织，有效提升了跨被试MI-EEG解码的准确性和鲁棒性。实验表明，CFSPMNet在两个中风MI-EEG数据集上均优于现有主流方法，展现出显著的性能提升。

2605.10106 2026-05-12 cs.CV cs.AI 版本更新

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

Tingshu Mou, Jiabo He, Renying Wang, Ce Liu, Hao Yang, Tiehua Zhang, Jingjing Chen, Xingjun Ma

发表机构 * Fudan University（复旦大学）； Bosch Center for Artificial Intelligence (BCAI)（博世人工智能中心（BCAI））； Tongji University（同济大学）

AI总结本文提出了一种名为ViSRA的基于视频的三维空间推理代理，旨在提升多模态大语言模型（MLLMs）的空间推理能力。ViSRA无需额外训练，通过利用专家模型提供的显式空间信息，以模块化和可扩展的方式引导模型进行空间推理，实现了灵活的即插即用框架。该方法在多个现有基准和未见过的三维空间任务中均表现出色，相比基线方法分别提升了15.6%和28.9%的绝对性能，具有可迁移的三维理解能力和较低的计算成本。

2605.10087 2026-05-12 cs.CV 版本更新

Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction

Guhnoo Yun, Juhan Yoo, Kijung Kim, Dong Hwan Kim

发表机构 * Korea Institute of Science and Technology（韩国科学技术院）； Department of Computer Science, Semyung University（Semyoung大学计算机科学系）

AI总结本文提出了一种基于音频和视觉传感器融合的非语言线索的人机交互（HRI）启动检测框架，用于家庭环境中的机器人交互。该框架通过声音源定位与人体跟踪信息结合，实现用户注视机器人时的交互启动检测，即使用户未直接说话，也能在注视时间超过预设阈值时识别交互意图。研究设计了状态转移模型，并在移动机器人上进行了实验验证，所有模块均集成于ROS系统中，实现了框架的完整实现与应用。

2605.10079 2026-05-12 cs.CV 版本更新

SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

Liangyang Ouyang, Ruicong Liu, Caixin Kang, Yifei Huang, Yoichi Sato

发表机构 * The University of Tokyo（东京大学）； Shanda AI Research Tokyo（Shanda AI东京研究所）

AI总结该论文提出了一种名为SocialDirector的训练-free交互控制器，用于提升多人物视频生成中社会互动的控制能力。该方法通过调节交叉注意力图，实现了对人物动作执行者、动作时机及目标对象的精确控制，有效解决了现有模型中人物与动作不匹配、社交动态混乱等问题。研究还构建了自动化评估流程，实验表明SocialDirector显著提升了生成视频的交互真实性，接近真实视频的表现水平。

2605.10071 2026-05-12 cs.CV 版本更新

MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization

Yaning Zhang, Tianyi Wang, Zan Gao, Yibo Zhao, Chunjie Ma, Meng Wang

发表机构 * Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences)（计算机科学与技术学院，齐鲁工业大学（山东省科学院））； School of Computing, National University of Singapore（国立新加坡大学计算机学院）； Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences)（山东省人工智能研究院，齐鲁工业大学（山东省科学院））； Key Laboratory of Computer Vision and System, Ministry of Education, Tianjin University of Technology（教育部计算机视觉与系统重点实验室，天津工业大学）

AI总结随着高真实感人脸生成技术的快速发展，通用性的人脸伪造检测与定位方法变得尤为重要。本文提出了一种多领域细粒度视觉-语言重建模型（MFVLR），通过语言引导的细粒度人脸伪造表示学习，全面捕捉多领域中的视觉伪造痕迹，从而实现对扩散模型生成人脸伪造内容的通用检测与定位。该模型引入细粒度语言变换器、多领域视觉编码器和视觉解码器，并设计了创新的视觉注入模块，显著提升了模型在跨生成器、跨伪造类型和跨数据集场景下的性能。

2605.10054 2026-05-12 cs.CV 版本更新

Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging

Zubair Faruqui, Rahul Dubey

发表机构 * Department of Computer Science, Missouri State University（密苏里州立大学计算机科学系）

AI总结该研究针对医学影像诊断中深度神经网络过度依赖非临床相关特征的问题，提出了一种在训练过程中直接引入解释性监督的方法，以引导模型关注具有临床意义的区域。研究系统分析了不同解释损失设计和监督强度对模型预测性能和解释可信度的影响，并引入了两个新的量化指标用于评估解释质量。实验表明，该方法在保持模型准确性的同时，能够显著提升解释的临床相关性，适用于多种标注的生物医学影像任务。

Comments Under review at IEEE Journal of Biomedical and Health Informatics (JBHI)

2605.10050 2026-05-12 cs.CV 版本更新

EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs

Jiameng Li, Minye Wu, Jiezhang Cao, Aleksei Tiulpin, Matthew B. Blaschko

发表机构 * KU Leuven（鲁文大学）； Shanghai Jiaotong University（上海交通大学）； Weill Cornell Medicine（韦尔医学院）

AI总结视频大语言模型（VideoLLMs）在处理长视频时面临挑战，因为密集采样会导致大量视觉token，而稀疏采样则可能遗漏关键时间信息，引发模型幻觉。本文提出了一种轻量且无需训练的token剪枝方法EchoPrune，通过将冗余token解释为时间回声，利用跨模态相关性和时间重建误差对token进行评分，从而在固定token预算下提升时间分辨率。实验表明，EchoPrune使VideoLLMs在相同token预算下处理的帧数提升至原来的20倍，并在多个基准上提升了性能和推理速度。

Comments 9 pages

2605.10046 2026-05-12 cs.CV cs.LG cs.MA 版本更新

PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows

Yufeng Zhu, Chunlei Shi, Yongchao Feng, Dan Niu

发表机构 * Department of Automation, Southeast University（东南大学自动化系）； State Key Laboratory of Virtual Reality Technology and Systems, Beihang University（北京航空航天大学虚拟现实技术与系统国家重点实验室）

AI总结本文提出了一种名为PixelFlowCast的降水临近预报方法，旨在在不使用潜在空间压缩的情况下实现高效且高精度的短期雷达回波预测。该方法采用两阶段框架，第一阶段通过确定性模型生成粗粒度预测以捕捉整体演变趋势，第二阶段利用KANCondNet提取深度时空特征进行精确条件引导，并结合基于像素均值流的预测器，以少量步骤生成高质量预测结果。实验表明，PixelFlowCast在预测精度和推理效率方面均优于现有主流方法，尤其在长序列预测任务中表现突出，具有良好的实际应用前景。

Comments 26 pages, 7 figures

2605.10045 2026-05-12 cs.CV 版本更新

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

Feihong Yan, Shaoyu Liu, Haixuan Wang, Shuai Lu, Linfeng Zhang, Huiqi Li, Xiangyang Ji

发表机构 * Beijing Institute of Technology（北京理工大学）； Xidian University（西安电子科技大学）； Northeastern University at Qinhuangdao（秦皇岛东北大学）； Shanghai Jiao Tong University（上海交通大学）； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结视觉自回归（VAR）模型作为扩散模型的有力替代方案，在图像生成中表现出色，但其固定训练分辨率限制了其在更高分辨率下的直接生成能力。本文提出ExtraVAR方法，通过引入阶段感知的RoPE重映射策略，解决了VAR模型在分辨率外推过程中出现的全局重复、局部重复和细节退化等问题，并进一步提出基于熵驱动的自适应注意力校准方法，以适应高分辨率下注意力分布的变化，实验表明该方法在结构一致性和细节保真度方面均优于现有方法。

Comments 10 pages, 7 figures

2605.10029 2026-05-12 cs.CV 版本更新

Slum Detection and Density Mapping with AlphaEarth Foundations: A Representation Learning Evaluation Across 12 Global Cities

Shuyang Hou, Ziqi Liu, Haoyue Jiao, Zhangyan Xu, Xiaopu Zhang, Lutong Xie, Yaxian Qing, Jianyuan Liang, Xuefeng Guan, Huayi Wua

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing（信息工程测绘与遥感国家重点实验室）

AI总结该研究利用AlphaEarth Foundations（AEF）这一全球一致的高分辨率地表嵌入数据，评估其在12个全球城市中用于贫民窟检测和密度估计的性能。通过多种训练策略和辅助特征配置，研究发现同一城市跨年训练效果最佳，并揭示了AEF在区分贫民窟边界和建模像素内密度梯度方面的局限性。研究还指出POI特征对密度估计有显著提升，并展示了AEF在长期贫民窟监测中的结构保持能力。

2605.10026 2026-05-12 cs.CV 版本更新

MUSDA: Multi-source Multi-modality Unsupervised Domain Adaptive 3D Object Detection for Autonomous Driving

Xiaohu Lu, Hamed Khatounabadi, Hayder Radha

发表机构 * Electrical and Computer Engineering（电气与计算机工程）； Michigan State University（密歇根州立大学）

AI总结随着自动驾驶技术的发展，多模态标注数据集日益丰富，为无需人工标注即可适应新环境的3D目标检测提供了可能。然而传统领域自适应方法通常仅针对单一来源或单一模态，难以应对多源多模态场景。本文提出了一种面向自动驾驶的多源多模态无监督领域自适应3D目标检测框架，通过引入分层空间条件领域分类器和原型图加权融合策略，有效对齐了不同来源和模态的特征，实验表明该方法在多个主流数据集上均优于现有先进方法。

2605.10009 2026-05-12 cs.CV 版本更新

Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation

Yujia Cai, Boxuan Li, Chenghao Xu, Jiexi Yan

发表机构 * School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China（西安电子科技大学计算机科学与技术学院）； School of Electronic Engineering, Xidian University, Xi’an, Shaanxi, China（西安电子科技大学电子工程学院）

AI总结本文提出了一种名为Hystar的轻量级框架，用于解决基于查询的图像检索（QBIR）中因查询风格多样而导致的分布偏移问题。该方法通过超网络动态生成注意力层的奇异值扰动，实现对每个查询风格的自适应调整，同时利用静态奇异值偏移保证跨风格的稳定性。此外，Hystar引入了基于最优传输的对比损失StyleNCE，以增强跨风格语义区分能力，实验表明该方法在多风格检索和跨风格分类任务中均优于现有方法，具有参数高效且风格稳定的优势。

Comments Accepted by ICLR2026

2605.10008 2026-05-12 physics.optics cs.CV cs.ET 版本更新

Measurement-Adapted Eigentask Representations for Photon-Limited Optical Readout

Tianyang Chen, Mandar M. Sohoni, Saeed A. Khan, Jérémie Laydevant, Shi-Yuan Ma, Tianyu Wang, Peter L. McMahon, Hakan E. Türeci

发表机构 * Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544, USA（普林斯顿大学电气工程与计算机工程系）； School of Applied and Engineering Physics, Cornell University, Ithaca, NY 14853, USA（康奈尔大学应用与工程物理学院）； USRA Research Institute for Advanced Computer Science, Mountain View, CA 94035, USA（美国研究机构高级计算机科学研究所）； Kavli Institute at Cornell for Nanoscale Science, Cornell University, Ithaca, NY 14853, USA（康奈尔大学纳米科学学院）

AI总结在低光条件下，光学读取面临光子噪声、探测器噪声和量化误差等限制，影响后续分类与决策的准确性。本文提出一种基于特征可分辨性的本征任务（eigentask）表示方法，用于对光学传感器输出进行噪声自适应的特征表示。实验表明，该方法在光子预算有限、样本稀缺和任务复杂度高的场景下显著优于主成分分析等传统方法，有效提升了分类性能与学习效率。

Comments 15+14 pages, 4+9 figures, 55 references

2605.10002 2026-05-12 cs.CV 版本更新

Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models

Minh Khoi Nguyen, Dai Lam Le, Amir Reza Jafari, Tuan Dung Nguyen, Mai Hong Son, Mai Huy Thong, Quang Huy Nguyen, Thanh Trung Nguyen, Reza Farahbakhsh, Noel Crespi, Phi Le Nguyen

发表机构 * AI4LIFE, Hanoi University of Science and Technology, Vietnam（AI4LIFE，越南科学与技术大学）； SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, France（SAMOVAR，法国电信南巴黎学院，巴黎理工学院）； Military Central Hospital, Vietnam（越南108军中心医院）

AI总结该研究提出Med-StepBench，首个用于评估医学视觉语言模型在3D PET/CT影像中逐步推理能力的大型基准，旨在检测模型在生成临床合理但错误的诊断时的幻觉问题。该框架将临床推理分解为四个诊断阶段，并通过超过12,000张影像和100万对图像-陈述对，揭示了现有模型在多步骤推理中的系统性缺陷。研究还表明，当前模型对看似合理但具有误导性的中间解释高度敏感，进一步放大了幻觉风险，为构建更安全可靠的医学视觉语言模型提供了重要依据。

Comments Accepted at IJCAI-ECAI 2026

2605.09996 2026-05-12 cs.CV 版本更新

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

Yeongtak Oh, Dongwook Lee, Sangkwon Park, Heeseung Kim, Sungroh Yoon

发表机构 * Department of Electrical and Computer Engineering, Seoul National University（首尔国立大学电气与计算机工程系）； Interdisciplinary Program in Artificial Intelligence, Seoul National University（首尔国立大学人工智能跨学科项目）； Department of Artificial Intelligence, University of Seoul（首尔大学人工智能系）

AI总结本文提出Omni-Persona，首个全面的多模态个性化基准，用于系统评估和改进文本、图像和音频的联合个性化能力。该基准通过“人格模态图”形式化任务，涵盖四个任务组和18个细粒度任务，并引入校准准确率（Cal）指标，综合衡量正确对齐与适当回避的能力。实验揭示了开源模型在音频与视觉对齐上的差距、参数规模与召回率并非可靠诊断指标，以及监督微调与基于奖励的强化学习在个性化中的不同局限与挑战。

Comments Project Page: https://github.com/oyt9306/Omni-Persona

2605.09984 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Geometric 4D Stitching for Grounded 4D Generation

Sunwoo Park, Taesung Kwon, Jong Chul Ye

发表机构 * KAIST AI（韩国科学技术院人工智能实验室）

AI总结本文提出了一种名为“几何4D缝合”的高效框架，用于解决现有4D场景生成方法中几何不一致和重建成本高的问题。该方法通过显式识别缺失的几何区域，并用几何基础的4D缝合进行补充，从而在保证几何一致性的同时，显著提升了4D场景生成的效率。此外，该方法还支持4D网格的迭代扩展和场景编辑，具有良好的实用性和扩展性。

2605.09982 2026-05-12 cs.CV 版本更新

ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

Yuna Lee, Kyoungho Min, Yulhwa Kim

发表机构 * Department of Electrical and Computer Engineering, Sungkyunkwan University, Republic of Korea（电气与计算机工程系，成均馆大学，大韩民国）； Department of Semiconductor Systems Engineering, Sungkyunkwan University, Republic of Korea（半导体系统工程系，成均馆大学，大韩民国）

AI总结本文提出了一种名为ERASE的两阶段视觉token剪枝框架，旨在解决视觉语言模型处理高分辨率图像时产生的大量视觉token带来的计算负担问题。该方法通过自适应剪枝策略，根据输入图像的复杂度识别并保留关键视觉token，在保持模型性能的同时显著减少token数量。实验表明，ERASE在Qwen2.5-VL-7B模型上以85%的剪枝率仍能保留89.46%的原始精度，优于现有最佳方法。

Comments 20 pages, 8 figures

2605.09977 2026-05-12 cs.CV 版本更新

INFANiTE: Implicit Neural representation for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI

Xiaotian Hu, Mingxuan Liu, Hongjia Yang, Juncheng Zhu, Yijin Li, Yifei Chen, Haoxiang Li, Tongxi Song, Zihan Li, Yingqi Hao, Ziyu Li, Yujin Zhang, Gang Ning, Yi Liao, Haibo Qu, Qiyuan Tian

发表机构 * Beihang University（北航大学）； Tsinghua University（清华大学）； Sichuan University（四川大学）； University of Oxford（牛津大学）

AI总结该研究提出了一种名为INFANiTE的隐式神经表示框架，用于从临床厚切片MRI扫描中高效学习高分辨率胎儿脑时空图谱，解决了传统方法中耗时的切片到体积重建和迭代配准步骤的问题。该方法显著加速了图谱构建过程，实验表明其在稀疏数据条件下仍能保持较高的精度和生物学合理性，为大规模胎儿脑发育分析提供了可行的解决方案。

2605.09976 2026-05-12 cs.CV 版本更新

OZ-TAL: Online Zero-Shot Temporal Action Localization

Chaolei Han, Hongsong Wang, Xin Gong, Jie Gui

发表机构 * School of Cyber Science and Engineering, Southeast University（东南大学信息科学与工程学院）； Engineering Research Center of Blockchain Application, Supervision and Management (Southeast University), Ministry of Education（教育部区块链应用、监督与管理工程研究中心（东南大学））； Purple Mountain Laboratories, Nanjing（紫金山实验室（南京））； School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）

AI总结本文提出了一种新的在线零样本时序动作定位任务（OZ-TAL），旨在在视频流处理过程中检测尚未见过的动作类别及其发生时间。为了解决现有方法在跨域视频中泛化能力不足的问题，作者设计了一个无需训练的框架，利用现成的视觉-语言模型并引入额外机制以增强视觉表示并减少其偏差。实验表明，该方法在THUMOS14和ActivityNet-1.3数据集上显著优于现有先进方法，确立了新的基准和对比基线。

2605.09972 2026-05-12 cs.RO cs.CV 版本更新

HiDrive: A Closed-Loop Benchmark for High-Level Autonomous Driving

Zhongyu Xia, Guanyu Zhu, Guo Tang, Wenhao Chen, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University（王炫计算机技术研究所，北京大学）

AI总结 HiDrive 是一个全新的闭环自动驾驶基准，旨在解决现有基准在场景多样性、对象种类和驾驶能力评估方面的不足。该基准特别强调长尾场景，引入了多种罕见物体和复杂交通情境，并扩展了对规则遵守、道德推理和应急决策等高级驾驶能力的评估。HiDrive 采用更先进的物理引擎，提供真实光照和高保真视觉渲染，为自动驾驶系统在真实复杂环境中的表现提供了更具挑战性的测试平台。

2605.09963 2026-05-12 cs.CV 版本更新

Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

Yang Shen, Yusen Cai, Weronika Hryniewska-Guzik, Qing Lin, Mengmi Zhang

发表机构 * Nanyang Technological University, Singapore（南洋理工大学，新加坡）； Warsaw University of Technology, Poland（华沙理工大学，波兰）

AI总结现有自监督学习方法主要学习对象不变的表征，但往往忽视了物体部分之间的空间结构和关系。为解决这一问题，本文提出了一种空间感知的预训练任务——空间预测（SP），通过预测同一图像中两个解耦局部视图之间的相对位置和尺度，学习细粒度的空间依赖关系。实验表明，该方法在图像识别、细粒度分类、语义分割和深度估计等多个任务中均取得显著提升，并增强了模型在分布外场景下的鲁棒性。

2605.09956 2026-05-12 cs.CV cs.AI 版本更新

SDTalk: Structured Facial Priors and Dual-Branch Motion Fields for Generalizable Gaussian Talking Head Synthesis

Peng Jia, Zhen Xiao, Jia Li, Xueliang Liu, Zhenzhen Hu, Lingyun Yu

发表机构 * Hefei University of Technology（合肥工业大学）； University of Science and Technology of China（中国科学技术大学）

AI总结本文提出了一种名为SDTalk的单次拍摄3D高斯溅射（3DGS）框架，用于实现无需个性化训练即可泛化到未知身份的高质量实时说话头生成。该方法通过引入结构化面部先验和双分支运动场，分别提升头部重建的完整性与面部动态的细节表现，从而在视觉质量和推理效率方面优于现有方法。

Comments 5 pages, 4 figures, 4 tables

2605.09954 2026-05-12 cs.RO cs.CV 版本更新

JODA: Composable Joint Dynamics for Articulated Objects

Tianhong Gao, Cheng Yu, Yinghao Xu, Mengyu Chu

发表机构 * Peking University（北京大学）； Ant Group, Robbyant（蚂蚁集团，Robbyant）

AI总结本文提出JODA，一种用于生成关节级动力学的可组合框架，能够捕捉如摩擦保持、卡扣、软闭合等精细的机械行为。JODA通过结构化的三通道场描述关节自由度下的保守力、干摩擦和阻尼，结合形状约束的分段三次插值方法，实现了表达力强且可微分模拟的动力学建模。该方法支持从多模态输入中推断和优化关节动力学，为复杂机械系统的建模、编辑和优化提供了统一的接口。

2605.09948 2026-05-12 cs.AI cs.CV cs.RO 版本更新

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

Boyang Shen, Kaixiang Yang, Hao Wang, Qiuyu Yu, Qiang Xie, Qiang Li, Zhiwei Wang

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Wuhan United Imaging Surgical Co.,Ltd. (UIS)（武汉联影 surgical 公司）

AI总结当前视觉-语言-动作（VLA）模型通常将视觉-语言主干网络的最深层表示视为动作预测的最优输入，但机器人操作任务需要频繁的闭环空间调整，过度抽象可能浪费计算资源并削弱精确控制所需的底层几何线索。为此，本文提出LoopVLA，一种递归VLA架构，联合学习表示优化、动作预测与表示充分性估计，通过共享的Transformer块迭代优化多模态特征，并在每一步生成候选动作和充分性评分，从而动态决定是否需要进一步优化。实验表明，LoopVLA在保持任务成功率的同时显著提升了模型效率，参数量减少45%，推理吞吐量提升达1.7倍。

2605.09936 2026-05-12 cs.CV cs.IR cs.LG 版本更新

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Yiwei Ou, Chung Ching Cheung, Jun Yang Ang, Xiaobin Ren, Ronggui Sun, Guansong Gao, Kaiqi Zhao, Manfredo Manfredini

发表机构 * University of Auckland（奥克兰大学）； University of Pennsylvania（宾夕法尼亚大学）； Stanford University（斯坦福大学）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））

AI总结本文提出Urban-ImageNet，一个大规模多模态数据集与评估框架，用于从社交媒体图像中感知城市空间。该数据集包含来自微博的200万张公共图像及其配对文本，涵盖中国24个城市61个城区，支持从1K到2M不同规模的训练与评估。基于城市理论构建的层次化分类体系，Urban-ImageNet支持城市场景语义分类、跨模态图像-文本检索和实例分割三项任务，旨在评估AI模型对城市空间社会性、功能性和空间特征的理解能力。

详情

英文摘要

We present Urban-ImageNet, a large-scale multi-modal dataset and evaluation benchmark for urban space perception from user-generated social media imagery. The corpus contains over 2 Million public social media images and paired textual posts collected from Weibo across 61 urban sites in 24 Chinese cities across 2019-2025, with controlled benchmark subsets at 1K, 10K, and 100K scale and a full 2M corpus for large-scale training and evaluation. Urban-ImageNet is organized by HUSIC, a Hierarchical Urban Space Image Classification framework that defines a 10-class taxonomy grounded in urban theory. The taxonomy is designed to distinguish activated and non-activated public spaces, exterior and interior urban environments, accommodation spaces, consumption content, portraits, and non-spatial social-media content. Rather than treating urban imagery as generic scene data, Urban-ImageNet evaluates whether machine perception models can capture spatial, social, and functional distinctions that are central to urban studies. The benchmark supports three tasks within one standardized library: (T1) urban scene semantic classification, (T2) cross-modal image-text retrieval, and (T3) instance segmentation. Our experiments evaluate representative vision, vision-language, and segmentation models, revealing strong performance on supervised scene classification but more challenging behavior in cross-modal retrieval and instance-level urban object segmentation. A multi-scale study further examines how model performance changes as balanced training data increases from 1K, 10K to 100K images. Urban-ImageNet provides a unified, theory-grounded, multi-city benchmark for evaluating how AI systems perceive and interpret contemporary urban spaces across modalities, scales, and task formulations. Dataset and benchmark are available at: huggingface.co/datasets/Yiwei-Ou/Urban-ImageNet and github.com/yiasun/dataset-2.

URL PDF HTML ☆

赞 0 踩 0

2605.09925 2026-05-12 cs.CV 版本更新

Frequency Adapter with SAM for Generalized Medical Image Segmentation

Phuoc-Nguyen Bui, Van-Nguyen Pham, Duc-Tai Le, Junghyun Bum, Hyunseung Choo

发表机构 * Sungkyunkwan University, Korea（成均馆大学，韩国）

AI总结医学图像分割在辅助诊断和治疗规划中具有重要意义，但深度学习模型在面对不同数据集时常因成像协议、扫描设备和患者群体的差异而难以泛化。本文提出了一种基于频率域适配的通用医学图像分割方法FSAM，结合低秩适配（LoRA）和频率适配模块，有效提取跨域不变的高频特征，提升模型在单一源域下的泛化能力。实验表明，该方法在视网膜和前列腺数据集上优于传统域泛化及基于SAM的域泛化方法。

Comments Under review, 10 pages, 1 figure, 2 tables

2605.09902 2026-05-12 cs.CV 版本更新

Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment

Haobo Wang, Xiaorong Ma, Weiqi Luo, Xiaojun Jia, Jiwu Huang

发表机构 * Sun Yat-sen University（中山大学）； Nanyang Technological University（南洋理工大学）； Shenzhen MSU-BIT University（深圳MSU-BIT大学）

AI总结该研究针对多模态大语言模型（MLLM）的安全性问题，提出了一种新型的定向迁移攻击方法PRAF-Attack，旨在通过对抗样本误导模型对图像内容的判断。该方法引入了渐进式分辨率处理和自适应特征对齐策略，利用中间层特征增强攻击的迁移性和鲁棒性，并通过梯度一致性选择可迁移的层次特征，显著提升了攻击效果。实验表明，PRAF-Attack在多种黑盒MLLM上均表现出优于现有方法的迁移能力。

详情

英文摘要

Adversarial perturbations can mislead Multimodal Large Language Models (MLLMs) recognize a benign image as a specific target object, posing serious risks in safety-critical scenarios such as autonomous driving and medical diagnosis. This makes transfer-based targeted attacks crucial for understanding and improving black-box MLLM robustness. Existing transfer-based targeted attack methods typically rely on the final global features of the surrogate encoder and anchor optimization to original-resolution target crops, leading to their limited transferability and robustness. To address these challenges, we propose Progressive Resolution Processing and Adaptive Feature Alignment (PRAF-Attack), a targeted transfer-based attack framework that integrates multi-scale global semantic guidance with robust intermediate-layer local alignment. Unlike prior methods that align only the surrogate encoder's final layer, we design an adaptive feature alignment strategy that leverages intermediate representations to enhance transferability. Specifically, we introduce an adaptive intermediate layer selection mechanism to identify transferable hierarchical features across surrogate ensembles via gradient consistency, along with an adaptive patch-level optimization strategy that preserves highly correlated local regions through efficient patch filtering. To overcome the reliance on fixed original-resolution target crops, we propose a progressive resolution processing strategy that gradually refines optimization from coarse to fine, enabling the attack to better exploit target information at multiple scales and achieve stronger transferability. We evaluate PRAF-Attack on a diverse suite of black-box MLLMs, including six open-source models and six closed-source commercial APIs. Compared with seven state-of-the-art targeted attack baselines, the proposed PRAF-Attack consistently achieves superior transferability.

URL PDF HTML ☆

赞 0 踩 0

2605.09900 2026-05-12 cs.AI cs.CL cs.CV 版本更新

The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

Hao Liu, Jicheng Liu

发表机构 * Department of Psychology（心理学系）； New York University（纽约大学）； Department of Computer Science（计算机科学系）； University of Southern California（南加州大学）

AI总结该论文提出了一种名为KnotBench的新型基准，用于评估视觉-语言模型在处理绳结图示任务中的能力。研究通过大量绳结图像和对应的规范签名，设计了包括等价判断、操作预测、识别和跨模态对齐在内的14项任务，揭示了当前模型在感知与操作之间的能力差距。实验表明，即使是最先进的模型如Claude Opus 4.7和GPT-5，在无思考模式下表现接近随机水平，而思考模式虽有提升，但整体仍难以准确模拟绳结操作。

Comments 41 pages, 18 figures

2605.09899 2026-05-12 cs.CV cs.AI 版本更新

Hyperbolic Distillation: Geometry-Guided Cross-Modal Transfer for Robust 3D Object Detection

Kanglin Ning, Wenrui Li, Houde Quan, Qifan Li, Xingtao Wang, Xiaopeng Fan

发表机构 * Faculty of Computing, Harbin Institute of Technology（哈尔滨工业大学计算机学院）； Suzhou Research Institute of HIT（哈尔滨工业大学苏州研究院）； PengChengLab（鹏城实验室）

AI总结本文提出了一种基于双曲几何约束的跨模态知识蒸馏方法HGC-Det，用于提升多模态3D目标检测的性能。该方法通过图像分支和点云分支分别提取语义特征，并引入语义引导的体素优化、双曲几何约束的跨模态特征迁移以及特征聚合的几何优化三个核心组件，有效缓解了模态异质性、空间错位和表示危机等问题。实验表明，该方法在室内和室外数据集上均取得了检测精度与计算成本之间的良好平衡。

Comments Current version has been subbmitted to IEEE Transactions on Multimedia. Now, this manuscript's status is Under Review

2605.09874 2026-05-12 cs.CV cs.AI cs.CL 版本更新

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

Ziyang Wang, Yue Zhang, Shoubin Yu, Ce Zhang, Zengqi Zhao, Jaehong Yoon, Hyunji Lee, Gedas Bertasius, Mohit Bansal

发表机构 * UNC Chapel Hill（北卡罗来纳大学教堂山分校）； NTU Singapore（新加坡国立大学）

AI总结 EgoMemReason 是一个面向长期第一人称视频理解的记忆驱动推理基准，旨在评估模型在连续多天视觉信息中积累、回忆和推理的能力。该基准引入了三种互补的记忆类型，包括实体记忆、事件记忆和行为记忆，用于评估模型对物体状态变化、活动顺序以及长期行为模式的识别能力。实验表明，当前最先进的模型在该基准上的整体准确率仅为39.6%，揭示了长期记忆推理仍面临重大挑战。

Comments The first two authors contributed equally. Project website: https://egomemreason.github.io/

2605.09864 2026-05-12 cs.CV cs.LG 版本更新

DA-SegFormer: Damage-Aware Semantic Segmentation for Fine-Grained Disaster Assessment

Kevin Zhu, William Tang, Raphael Hay Tene, Zesheng Liu, Nhut Le, Maryam Rahnemoonfar

发表机构 * Bina Labs, Lehigh University（Bina实验室，莱斯大学）

AI总结本文提出了一种名为DA-SegFormer的细粒度灾害评估语义分割方法，旨在解决无人机影像中因纹理退化和类别不平衡导致的细微损伤识别难题。该方法基于SegFormer架构，引入了类别感知采样策略和在线难例挖掘结合Dice损失函数，以增强对罕见损伤特征的学习，并采用分辨率保持的推理协议以保留原始纹理细节。实验表明，DA-SegFormer在RescueNet数据集上取得了74.61%的mIoU，显著优于基线模型，并在关键损伤类别上实现了显著提升。

Comments Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

2605.09859 2026-05-12 cs.CV 版本更新

Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval

Shijie Wang, Yadan Luo, Zijian Wang, Xin Yu, Zi Huang

发表机构 * The University of Queensland, Australia（昆士兰大学，澳大利亚）； The University of Adelaide, Australia（阿德莱德大学，澳大利亚）

AI总结本文研究了细粒度图像检索中如何提升对未见类别的检索性能问题，提出了一种基于生成外观先验对齐的新型方法GAPan。该方法通过可逆密度模型重构学习目标，从类别预测转向外观建模，利用归一化流将特征映射到潜在密度空间，并通过类别条件高斯先验进行优化，从而保留更丰富的外观细节。通过反向采样生成外观感知的锚点，引导检索嵌入与类别特定的外观分布对齐，显著提升了模型在未见类别上的泛化能力。

2605.09858 2026-05-12 cs.CV 版本更新

Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking

Riku Inoue, Shogo Sato, Kazuhiko Murasaki, Tomoyasu Shimada, Toshihiko Nishimura, Ryuichi Tanida

发表机构 * NTT, Inc.（NTT公司）

AI总结本文研究了动态环境下端到端多目标跟踪（MOT）中如何通过主动学习（AL）提升标注效率的问题。针对现有基于帧的AL方法与现代基于Transformer的端到端跟踪器在时间粒度上不匹配的问题，提出了一种基于片段（clip）的主动学习方法CUTAL，该方法通过多帧预测的不确定性度量评估每个片段的不确定性，并引入时间多样性约束以选择信息量大且冗余度低的片段。实验表明，CUTAL在相同标注预算下优于现有方法，并且在仅使用50%标注数据时即可达到接近全监督的跟踪性能。

Comments Accepted to 2026 IEEE International Conference on Image Processing (ICIP). Copyright 2026 IEEE. Published in 2026 IEEE International Conference on Image Processing (ICIP), scheduled for 13-17 September 2026 in Tampere, Finland

2605.09856 2026-05-12 cs.CV cs.AI 版本更新

MoPO: Incorporating Motion Prior for Occluded Human Mesh Recovery

Tao Tang, Hong Liu, Xinshun Wang, Wanruo Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, China（一般人工智能国家重点实验室，北京大学，深圳研究生院，中国）

AI总结尽管近期在人体网格恢复方面取得了显著进展，但在面对遮挡时仍表现出鲁棒性不足，常导致姿态估计不准确和运动抖动。本文提出MoPO方法，通过引入运动先验来提升遮挡人体网格恢复的效果。MoPO包含运动去遮挡模块和运动感知融合与优化模块，前者利用历史姿态预测遮挡关节位置，后者结合图像特征与预测姿态进行人体形状和姿态估计，并通过逆运动学进一步优化最终姿态，显著提升了遮挡场景下人体网格恢复的精度和时序一致性。

Comments 35 pages

2605.09850 2026-05-12 cs.CV cs.AI 版本更新

Probing Routing-Conditional Calibration in Attention-Residual Transformers

Wenhao Liang, Lin Yue, Wei Emma Zhang, Miao Xu, Mingyu Guo, Olaf Maennel, Weitong Chen

发表机构 * Adelaide University（阿德莱德大学）； Australian Institute for Machine Learning (AIML), Adelaide University（澳大利亚机器学习研究所（AIML），阿德莱德大学）； The University of Queensland（昆士兰大学）

AI总结本文研究了在注意力残差变换器（Attention-Residual Transformers）中，路由信息对模型校准的影响。通过设计匹配置信度的诊断实验，作者发现路由摘要无法提供稳定的路由条件下的校准证据，且基于路由深度的校准方法在多个评估指标上表现并不优于仅基于置信度的模型。实验表明，所谓的路由感知校准提升可能是由其他因素引起的，需在控制匹配置信度、带宽、模型容量和排列等因素后，才能确认是否为内部状态校准的真正提升。

Comments Under reviewing

2605.09830 2026-05-12 cs.IR cs.CV 版本更新

Loom: Hybrid Retrieval-Scoring Outfit Recommendation with Semantic Material Compatibility and Occasion-Aware Embedding Priors

Anushree Berlia

AI总结 Loom 是一个结合神经嵌入检索与结构化领域评分的服装搭配推荐系统，旨在从时尚图册中生成完整且协调的穿搭组合。该系统通过 FashionCLIP 嵌入进行约束检索，结合多目标评分函数，综合考虑嵌入相似性、色彩协调性、正式程度一致性、场合适配性等多个因素进行打分。研究引入了语义材质权重和场合先验嵌入两种技术，分别提升材质兼容性判断和场合适配性，实验表明该系统在搭配质量与违规率方面显著优于随机基线，且能在普通硬件上快速生成多样化的穿搭方案。

Comments Code: https://github.com/anushreeberlia/loom

2605.09827 2026-05-12 cs.CV cs.AI 版本更新

Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction

Anushree Berlia

AI总结本文提出 Fashion Florence，一种基于 Florence-2 的视觉语言模型，通过 LoRA 微调技术实现对服装图像结构化属性的提取。该模型能够从单张服装照片中生成包含类别、颜色、材质、风格标签和场合标签的 JSON 格式输出，适用于推荐系统等下游任务。实验表明，Fashion Florence 在多个指标上优于 GPT-4o-mini 和 Gemini 2.5 Flash，且在单个 GPU 上运行时参数量仅为 0.77B，推理成本接近于零。

Comments Model: https://huggingface.co/anushreeberlia/fashion-florence

2605.09802 2026-05-12 cs.CV cs.AI cs.LG 版本更新

CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection

Zhipeng Liu, Chunbo Luo

发表机构 * Department of Computer Science, University of Exeter（埃克塞特大学计算机科学系）

AI总结本文研究了跨视角（如地面与空中）场景下视觉-语言模型（VLM）的目标检测性能下降问题，提出了CrossVL框架，结合复杂度感知的特征路由机制和成对课程学习策略，以增强模型对不同视角图像的适应能力。该方法通过估计场景复杂度并动态路由视觉特征，以及利用同步地面-空中图像对的语义一致性进行渐进式训练，有效提升了检测精度和稳定性。实验表明，CrossVL在MAVREC数据集上显著提升了检测性能并缩小了不同视角间的性能差距。

Comments Accepted to CVPR 2026. Code available at https://github.com/1nyourlife/Crossvl_cvpr2026

2605.09774 2026-05-12 cs.CV 版本更新

DRIVE-C: A Controlled Corruption Dataset for Autonomous Driving

Shiva Aher

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结 DRIVE-C 是一个用于评估自动驾驶系统视觉感知鲁棒性的受控退化数据集，由真实场景下的多种环境驾驶视频构建而成。该数据集通过物理启发的合成退化方法生成了包含10段干净视频和600段退化视频的多样化样本，并提供了详细的元数据和传感器健康指数标注。DRIVE-C 为自动驾驶感知系统的鲁棒性评估、退化感知建模、不确定性估计以及传感器健康监测提供了可控且可复现的测试平台。

2605.09750 2026-05-12 cs.CV 版本更新

Fetal Brain Imaging: A Composite Neural Network Approach for Keyframe Detection in Ultrasound Videos

Aleksander Zamojski, Kacper Jarczak, Radoslaw Roszczyk

发表机构 * Warsaw University of Technology（华沙技术大学）

AI总结本文提出了一种用于胎儿脑部超声视频中关键帧检测的新方法，旨在提高胎儿脑部影像分析的效率和准确性。该方法采用一种融合卷积神经网络（CNN）和循环神经网络（RNN）的复合神经网络架构，其中CNN用于提取视频帧的局部空间特征，RNN则用于捕捉视频序列中帧与帧之间的时序依赖关系。该模型有助于更早地检测和诊断特定胎儿脑部疾病，从而支持更及时的治疗规划。

2605.09719 2026-05-12 cs.CV cs.AI 版本更新

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Alaa Asfour, Christopher Indris, Leihan Chen, Tejas Vyas, Guanghui Wang

发表机构 * Department of Computer Science, Toronto Metropolitan University（多伦多 Metropolitan 大学计算机科学系）

AI总结该研究提出了一种知识蒸馏框架，将大型3D视觉语言模型中的空间推理能力转移到更轻量的模型中，从而显著降低计算成本。通过引入可学习的隐式推理标记（Hidden CoT）和多任务蒸馏策略，该方法在保持教师模型72%以上性能的同时，将模型大小减少了3倍，推理延迟降低了8.7倍。该工作首次在蒸馏的3D视觉语言模型中应用隐式推理机制，实现了高效的3D场景问答任务。

2605.09703 2026-05-12 cs.CV 版本更新

MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

Xiaoyu Yuan, Niklas Heikkala, Tiina Törmänen, Hanna Järvenoja, Guoying Zhao, Haoyu Chen

发表机构 * University of Oulu（奥卢大学）

AI总结本文提出MOTOR-Bench，一个用于零样本人类心理状态理解的现实场景数据集与多智能体框架。该数据集包含1,440个协作学习场景的多模态视频片段，每个样本由教育专家基于自我调节学习理论标注，旨在支持对复杂人际互动的结构化分析。为解决现有方法在从可观测行为推理深层心理状态方面的不足，研究提出了MOTOR-MAS多智能体框架，通过结构化协调机制提升对行为、认知和情绪三类标签的预测性能，实验表明其在多项指标上显著优于现有方法。

Comments Accepted by CVPR 2026 workshop AI4RWC

2605.09701 2026-05-12 cs.CV 版本更新

DriveFuture: Future-Aware Latent World Models for Autonomous Driving

Yufeng Hong, Xiaotian Zhou, Yingyan Li, Xiangpo Zhou, Lin Liu, Yadan Luo, Shaoqing Xu, Lei Yang, Ziying Song

发表机构 * Beijing Institute of Technology（北京理工大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Beihang University（北航）； Beijing Jiaotong University（北京交通大学）； The University of Queensland（昆士兰大学）； University of Macau（澳门大学）； Nanyang Technological University（南洋理工大学）； School of Artificial Intelligence ( School of Software), Yanshan University（燕山大学人工智能学院（软件学院））

AI总结 DriveFuture 是一种面向自动驾驶的未来感知潜在世界模型，其核心在于将未来世界状态作为当前潜在状态建模的条件，从而显式学习面向路径规划的前瞻性能力。该方法在训练过程中通过预测和优化未来潜在状态，为基于扩散模型的轨迹规划器提供显式条件，在多个公开基准测试中取得了领先的性能表现。实验结果表明，将未来状态作为当前决策的条件，比单纯预测未来状态更能提升自动驾驶系统的智能化水平。

Comments 24pages, 7 figures

2605.09699 2026-05-12 eess.IV cs.CV cs.GR cs.LG 版本更新

A Real-Calibrated Synthetic-First Data Engine

Yukang Shen

发表机构 * Kennesaw State University（肯纳邦大学）

AI总结现代计算机视觉系统在数据稀缺领域常面临性能限制，而合成数据生成虽具潜力，但直接应用常因数据质量与反馈机制不足导致效果不稳定。本文提出一种“真实校准、以合成数据为主”的数据引擎，通过可控扩散模型与多阶段筛选过滤的统一流程，系统性提升合成数据增强的实用性与可靠性。实验表明，在人体姿态估计等任务中，合成数据与真实数据结合可有效提升性能，凸显了数据驱动策略在低数据场景下的重要价值。

Comments 7 pages, 6 figures

2605.09693 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Do multimodal models imagine electric sheep?

Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes, Philipp Krähenbühl, Vladlen Koltun

发表机构 * Apple（苹果公司）

AI总结该研究探讨了多模态模型在解决空间谜题时是否会产生心理意象，并发现大型多模态模型在解决如拼图、积木等任务时确实会形成类似“想象”的过程，甚至在解决与羊相关的谜题时会“想象”出羊的形象。研究通过微调Qwen3.5视觉语言模型，使其能够完成多种视觉推理任务，并发现模型在执行操作过程中会自发形成对中间状态的视觉表征。基于这一发现，研究提出了两种方法来增强和利用模型的内部视觉表征，显著提升了任务解决的准确率。

2605.09688 2026-05-12 cs.CV 版本更新

ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes

Rui Song, Tianhui Cai, Markus Gross, Xingcheng Zhou, Zewei Zhou, Zhiyu Huang, Olaf Wysocki, Jiaqi Ma

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； University of Cambridge（剑桥大学）； Technical University of Munich（慕尼黑技术大学）

AI总结本文提出了一种名为 ConFixGS 的方法，用于修复基于前馈的3D高斯泼溅（3DGS）在驾驶场景中的重建问题。该方法利用置信度感知的扩散先验，通过生成局部伪目标并结合支持视图的重投影校验，提升重建的细节可靠性并抑制不一致信息。实验表明，ConFixGS 在多个数据集上显著提升了新视角合成效果，PSNR 提升最高达3.68 dB，FID 减少近一半，展示了其在驾驶场景中鲁棒重建的有效性。

Comments 28 pages, 12 figures

2605.09687 2026-05-12 cs.CV 版本更新

Spatial-Frequency Gated Swin Transformer for Remote Sensing Single-Image Super-Resolution

Md Aminur Hossain, Parekh Valkesh, Ayush V. Patel, Yogesh Jethani, Sanjay K. Singh, Biplab Banerjee

发表机构 * Space Applications Centre, ISRO, Ahmedabad, India（印度航天研究组织阿赫迈德亚布德研究中心）； Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay, India（印度理工学院孟买资源工程研究学院）； New L J Institute of Engineering and Technology, Ahmedabad, India（阿赫迈德亚布德新LJ工程与技术学院）； Pandit Deendayal Energy University, Gandhinagar, India（潘迪特·德恩达尔能源大学）； GLS University, Ahmedabad, India（阿赫迈德亚布德GLS大学）

AI总结本文研究了遥感单图像超分辨率问题，旨在从低分辨率观测中重建高分辨率图像并保留精细的空间结构。为了解决现有Swin Transformer模型在细节重建上的不足，作者提出了一种空间-频率门控Swin Transformer（SFG-SwinSR），通过在前馈网络中引入空间-频率门控模块，分离低频结构内容与高频残差细节，从而提升重建质量。实验表明，该方法在多个遥感数据集上取得了更好的PSNR和SSIM指标，有效增强了高分辨率图像的细节表现。

Comments 15 pages

2605.09681 2026-05-12 cs.CV 版本更新

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

Yicheng Ji, Zhizhou Zhong, Jun Zhang, Qin Yang, XiTai Jin, Ying Qin, Wenhan Luo, Shuiyang Mao, Wei Liu, Huan Li

发表机构 * ZJU（浙江大学）； Video Rebirth（视频重生）； HKUST（香港科技大学）； BJTU（北京理工大学）

AI总结本文针对自回归视频扩散模型中因冗余键值（KV）缓存导致的注意力复杂度高和内存开销大的问题，提出了一种混合KV缓存压缩方法Forcing-KV。通过分析主流模型中注意力头的功能特性，将头分为关注帧内细节和块间过渡的静态头，以及控制帧间运动和一致性的动态头，并分别采用结构化剪枝和基于片段相似度的动态剪枝策略。该方法在保持生成质量的同时，显著提升了生成速度并减少了内存占用，实现在单块NVIDIA H200 GPU上每秒生成29帧以上。

Comments 10 pages

2605.09679 2026-05-12 cs.CV cs.AI 版本更新

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Boyan Wang, Liang He, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； University of Bologna（博洛尼亚大学）； Istanbul Medipol University（伊斯坦布尔梅迪波尔大学）； Center for Biomolecular Nanotechnologies, Istituto Italiano di Tecnologia（生物分子纳米技术中心，意大利技术研究院）； The First Affiliated Hospital, Sun Yat-Sen University（中山大学第一附属医院）； Tongji University（同济大学）

AI总结 DeepTumorVQA 是一个面向医学影像的层次化3D CT基准，旨在对医疗视觉语言模型（VLMs）和工具增强代理进行分阶段评估。该基准将肿瘤诊断中的推理过程分解为识别、测量、视觉推理和医学推理四个阶段，使模型在不同层次上的表现能够被独立评估。研究还引入了工具交互环境，允许模型调用分割、测量和医学知识等外部工具，从而更贴近实际医疗场景。实验表明，工具增强显著提升了模型在复杂医学推理任务中的表现。

2605.09677 2026-05-12 cs.CV 版本更新

VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement

Qingyu Xian, Hao Cheng, Berend Jan van der Zwaag, Rolands Kromanis, Ozlem Durmaz Incel

发表机构 * Pervasive Systems Research Group, Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente（普罗普及系统研究组，电气工程、数学与计算机科学学院，埃因霍温理工大学）； Department of Earth Observation Science, Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente（地球观测科学系，地理信息科学与地球观测（ITC）学院，埃因霍温理工大学）； Department of Civil Engineering and Management, Faculty of Engineering Technology, University of Twente（土木工程与管理系，工程科技学院，埃因霍温理工大学）

AI总结本文提出了一种基于视觉基础模型（VFM）的结构位移测量框架VFM-SDM，能够在无需任务特定训练、无需现场标记和标定的情况下，实现多方向结构位移的非接触式测量。该方法结合VFM推断的相机参数估计与点跟踪技术，通过三角化重建位移，并引入结构几何约束以提升估计的物理合理性和一致性。实验结果表明，该框架在真实场景中具有较高的测量精度和稳定性，为自动化、可扩展的结构健康监测提供了新思路。

2605.09670 2026-05-12 cs.RO cs.CV 版本更新

Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models

Aws Khalil, Jaerock Kwon

发表机构 * Department of Electrical and Computer Engineering, University of Michigan - Dearborn（密歇根大学迪尔伯恩分校电气与计算机工程系）

AI总结本文研究了基于视觉的遥操作系统中预测显示技术的生成能力，旨在通过生成未来视觉状态来缓解通信延迟带来的影响。作者提出了一种无需任务微调的零样本基准，评估了多种现成的生成视频模型在短时预测显示中的表现。实验表明，现有模型在预测精度、推理延迟和误差稳定性等方面难以同时满足预测显示的需求，揭示了通用生成视频模型与遥操作预测显示应用之间的性能差距。

详情

英文摘要

Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two conditioning regimes (multi-frame and single-frame). Performance is assessed using prediction accuracy (mean absolute difference), per-rollout latency, peak GPU memory usage, and temporal error evolution across the prediction horizon. On this zero-shot benchmark, no tested model simultaneously achieves low rollout error, non-divergent per-step error behavior, and real-time inference at the source frame rate. Increasing model scale or resolution yields limited and, in some cases, inverted improvements. These findings highlight a gap between general-purpose generative video synthesis and the requirements of predictive display in teleoperation, suggesting that practical deployment will require either explicit short-horizon temporal supervision, in-domain adaptation, or aggressive inference optimization rather than direct application of off-the-shelf models. Code, configurations, and qualitative results are released on the project page: https://bimilab.github.io/paper-GenPD

URL PDF HTML ☆

赞 0 踩 0

2605.09667 2026-05-12 cs.CV cs.AI 版本更新

S2P-Net: A Spectral-Spatial Polar Network for Rotation-Invariant Object Recognition in Low-Data Regimes

Albert Heruth

发表机构 * Unaffiliated Researcher（无隶属研究人员）

AI总结本文提出了一种名为S2P-Net的紧凑型深度学习网络架构，用于在数据量较少的情况下实现旋转不变的目标识别，且无需数据增强即可保证数学上的旋转不变性。该网络结合了频域与空域信息，并通过极坐标变换增强其对旋转的鲁棒性。与传统卷积神经网络相比，S2P-Net在小样本场景下表现出更优的识别性能，为低数据条件下的旋转不变目标识别提供了新思路。

Comments 9 pages, 4 figures, 3 tables. Preprint. Code available from the author upon request

2605.09666 2026-05-12 cs.CV cs.AI 版本更新

Rethinking Evaluation of Multiple Sclerosis (MS) Lesion Segmentation Models

Abdul Basit, Ashir Rashid, Muhammad Abdullah Hanif, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University (NYU) Abu Dhabi（eBRAIN实验室，工程学院，纽约大学（纽约大学阿布扎克分校））

AI总结本文探讨了多发性硬化症（MS）病灶分割模型评估方法的不足，指出当前大多使用Dice分数进行评估，未能充分考虑病灶级别的检测与分割性能，以及对复杂或人类标注者难以判断情况的模型表现。作者详细分析了神经科医生在脑部MRI扫描中关注的特征，并提出了更符合实际需求的评估指标，同时在两个开源数据集上对现有先进模型进行了分析，以评估其在实际医疗场景中的适用性。

Comments 8 pages, 5 figures, Accepted to IJCNN 2026

2605.09662 2026-05-12 cs.CV 版本更新

BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction

Alessio Mazzucchelli, Maria Naranjo-Almeida, Jorge Bustos-Sanchez, Mariella Dimiccoli, Francesc Moreno-Noguer, Jordi Sanchez-Riera, Adrian Penate-Sanchez

发表机构 * Arquimea Research Center（阿奎米亚研究中心）； Institut de Robòtica i Informàtica Industrial (CSIC-UPC)（机器人与信息技术研究所（CSIC-UPC））； Universidad de las Palmas de Gran Canaria (IUSIANI)（Gran Canaria大学（IUSIANI））

AI总结本文提出了一种名为BEA-GS的新型高斯泼溅方法，旨在在无需辐射监督的情况下实现更精确的物体提取。该方法通过引入两种新的损失函数，分别优化可见和不可见高斯点的几何结构，以更准确地对齐语义边界。实验表明，该方法在多个数据集上取得了当前最佳的边界分割效果，显著提升了物体级编辑和资产提取的精度。

Comments CVPR 2026 Highlight

2605.09644 2026-05-12 cs.CV 版本更新

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

Zichen Zou, Xiaosong Jia, Zuxuan Wu, Yu-Gang Jiang

AI总结该论文提出了一种名为RetrieveVGGT的训练-free框架，用于解决基于Transformer的三维重建在处理长序列时因注意力机制复杂度过高而导致的内存溢出和质量下降问题。通过将上下文构建转化为检索问题，RetrieveVGGT在每一步仅检索少量相关帧，从而保持可控的内存开销，并利用VGGT中查询与键之间的相似性作为相关性指标，无需额外训练。此外，该方法引入了分段采样和基于相机位姿的空间记忆机制，进一步提升了信息多样性与定位准确性，实验表明其在性能上优于多个现有方法。

2605.09640 2026-05-12 cs.CV cs.LG 版本更新

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

Meng Lou, Hanzhong Guo, Linwei Chen, Yizhou Yu

发表机构 * The University of Hong Kong（香港大学）； The Hong Kong University of Science and Technology（香港科学与技术大学）； Hong Kong Generative AI Research and Development Center（香港生成式人工智能研究与开发中心）

AI总结本文研究了如何在视觉持续学习中克服灾难性遗忘问题，提出了一种基于强化微调的新方法RaPO。作者发现现有方法如GRPO在面对类别增量和领域增量学习时仍存在显著遗忘，其根本原因在于轨迹层面的策略漂移。为此，RaPO通过引入保留奖励和跨任务优势归一化，有效缓解了策略漂移带来的遗忘问题，实验表明其在多个持续学习场景中均取得优越性能，为视觉持续学习中的强化微调提供了系统性探索。

详情

英文摘要

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.

URL PDF HTML ☆

赞 0 踩 0

2605.09628 2026-05-12 cs.CV 版本更新

DegBins: Degradation-Driven Binning for Depth Super-Resolution

Zhiqiang Yan, Zhengxue Wang, Jian Yang, Gim Hee Lee

发表机构 * Department of Computer Science, National University of Singapore（新加坡国立大学计算机科学系）； Nanjing University of Science and Technology（南京理工大学）

AI总结深度超分辨率（DSR）旨在从低分辨率深度图中恢复高分辨率深度图。传统方法通常在低维特征空间中学习高分辨率与低分辨率之间的残差，但难以准确建模空间变化的退化关系。本文提出了一种新的DSR框架DegBins，通过退化驱动的分箱策略，将回归问题转化为分类-回归混合问题，利用离散深度分箱的加权组合更灵活地表示残差深度，并在高维特征空间中建模退化关系，实现分箱范围和概率分布的自适应调整。实验表明，DegBins在多个基准数据集上优于现有方法，具有更高的精度和鲁棒性。

Comments 9 pages

2605.09622 2026-05-12 cs.CV cs.AI 版本更新

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

Yuhan Wang, Zihan Li, Han Liu, Simon Arberet, Martin Kraus, Yuyin Zhou, Florin-Cristian Ghesu, Dorin Comaniciu, Ali Kamen, Riqiang Gao

发表机构 * UC Santa Cruz（加州大学圣克ruz分校）； Siemens Healthineers（西门子医疗）； University of Washington（华盛顿大学）

AI总结在放射治疗计划中，体素级剂量预测是一个关键但具有挑战性的任务，现有模型往往难以在不同临床场景中泛化。本文提出 DiffKT3D，一种统一的 Any2Any 3D 扩散框架，通过迁移预训练视频扩散模型的知识，实现高效且具有临床意义的剂量预测。该方法引入了基于模态嵌入的灵活条件生成机制，并结合临床导向的强化学习后训练策略，显著提升了剂量预测精度与图像质量，优于当前最优模型。

Comments Accepted by CVPR 2026 main conference. Compare to CVPR version, minor updates here are included (e.g., combine main text and appendix; clarify the timing scenario in appendix)

2605.09614 2026-05-12 cs.CV 版本更新

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

Xuan Gong, Hanbo Huang, Hao Zheng, Yiran Zhang, Wenbin Dai, Weishu Zhao, Shiyu Liang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Lanzhou University（兰州大学）

AI总结本文研究了长链多模态推理中视觉信息衰减的问题，提出了一种基于信息论的分析方法，推导出干预点对下游视觉收益的下界，并据此设计了反射锚点策略优化（RAPO）方法。RAPO通过选择高熵的反射锚点并优化有限窗口的KL散度代理，有效增强了视觉信息在生成过程中的传播与保留。实验表明，RAPO在多个视觉-语言模型基准上显著优于现有方法，并且机制分析显示其能增强生成轨迹中视觉依赖的对比信号。

Comments Under Review

2605.09613 2026-05-12 cs.RO cs.CV 版本更新

SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

Narsimha Menga, Parikshit Sakurikar, Amirreza Rouhi, Satya Sai Reddy, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi

发表机构 * DreamVu

AI总结该研究提出了SABER，一个用于现实零售场景中机器人视觉-语言-动作（VLA）适配的高保真动作数据集。SABER通过多小时的真实店内捕捉，记录了人类在零售环境中的精细手部动作、全身运动及场景动态，无需人工编排或远程操作。该数据集包含多种动作表示形式，并在实际机器人系统上验证了其有效性，显著提升了复杂零售任务的完成率，展示了高质量数据对提升机器人性能的关键作用。

详情

英文摘要

Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu's ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks -- more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at https://dreamvu.ai/saber

URL PDF HTML ☆

赞 0 踩 0

2605.09606 2026-05-12 cs.CR cs.CV 版本更新

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

Yule Liu, Yilong Yang, Jiale Teng, Hanze Jia, Zeren Luo, Jingyi Zheng, Zifan Peng, Ke Li, Yifan Liao, Zhen Sun, Jiaheng Wei, Yang Liu, Zhuo Ma, Xinlei He

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Xidian University（西安电子科技大学）； Zhejiang University（浙江大学）； Wuhan University（武汉大学）

AI总结本文研究了图像到3D模型在生成有害几何结构方面的风险及其缓解方法，揭示了当前模型在面对恶意输入时可能重建出具有物理危害、风险组件或欺骗性复制品的3D结构。通过系统评估多种开源和商用模型，发现现有模型在生成有害几何方面表现较强，而现有防护机制效果有限。研究进一步提出了一种多层次防御策略，有效降低有害输出比例，但仍面临较高的误报率，突显了当前系统在几何安全防护方面的不足。

详情

英文摘要

Recent advances in image-to-3D models have significantly improved the fidelity and accessibility of 3D content creation. Such a powerful reconstruction capability that enables creative design can also be misused by the adversary to generate harmful geometries, which can be further fabricated via 3D printers and pose real-world risks. However, such risks are largely underexplored: it remains unclear how well current image-to-3D models can produce these harmful geometries, and whether existing safeguards can reliably prevent such generation. To fill this gap, we conduct a systematic measurement study of harmful geometry generation and mitigation. We first describe this risk through three kinds of unsafe categories: direct-use physical hazards, risky templates or components, and deceptive replicas. Each category is instantiated with representative objects. We evaluate both open-source and commercial image-to-3D models under original, degraded, viewpoint-shifted, and semantically camouflaged inputs. We consider different evaluation metrics, including geometric validity, multi-view VLM-based semantic scoring, targeted human validation, and controlled physical fabrication. The results reveal a concerning reality that current image-to-3D models can effectively reconstruct the harmful geometries, while fewer than 0.3% of such geometries trigger commercial moderation flags. As a first step toward mitigation, we evaluate three representative safeguard families, including input moderation, model-level benign alignment, and output-level filtering. We find that existing safeguards have distinct weaknesses. We further develop a stacked defense that can reduce harmful retention to <1%, but still at 11% overall false-positive cost. Taken together, our findings demonstrate that the risk in current system and encourage better geometry-aware safeguards for moderation.

URL PDF HTML ☆

赞 0 踩 0

2605.09604 2026-05-12 cs.CV 版本更新

DAP: Doppler-aware Point Network for Heterogeneous mmWave Action Recognition

Jiaying Lin, Shiman Wu, Jinfu Liu, Can Wang, Mengyuan Liu

发表机构 * Peking University（北京大学）； Huazhong University of Science and Technology（华中科技大学）； DJI Technology Company Ltd.（大疆技术创新有限公司）； Christian-Albrechts-Universität zu Kiel（基尔大学）

AI总结该研究针对毫米波雷达在异构场景下的人体动作识别（HAR）问题，提出了首个大规模异构多源毫米波点云数据集UniMM-HAR，并设计了DAP-Net网络以应对不同设备和频段带来的分布差异。DAP-Net通过融合多模态信息与Doppler感知机制，增强了模型对异构雷达源的鲁棒性，实验表明其在跨源识别任务中取得了优越的性能。

2605.09591 2026-05-12 cs.CV 版本更新

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Shuang Liang, Zeqing Wang, Yuxian Li, Xihui Liu, Han Wang

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong（香港大学电子与计算机工程系）； School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）； CASIC, The University of Hong Kong（香港大学中国科学院自动化所）

AI总结本文研究了可提示分割模型是否真正理解其分割的概念，而不仅仅是依赖视觉显著但语义误导的线索。为此，作者提出了一个新的基准测试 CAFE，通过属性层面的反事实修改来评估模型对概念的忠实度。实验表明，尽管模型能生成准确的分割掩码，但在面对误导性提示时仍表现出概念理解的不足，揭示了定位质量与语义理解之间的系统性差距。

Comments 30 pages, 8 figures

2605.09581 2026-05-12 cs.CV 版本更新

FPGA-Based Hardware Architecture for Contrast Maximization in Event-Based Vision

Michal Filipkowski, Marcin Kowalczyk, Tomasz Kryjak

发表机构 * AGH University of Krakow, Poland（波兰格但尼克技术大学）； Embedded Vision Systems Group, Computer Vision Laboratory（嵌入式视觉系统组，计算机视觉实验室）

AI总结本文提出了一种基于FPGA的硬件架构，用于实现基于事件视觉系统的对比度最大化（CM）算法。该架构利用FPGA的并行处理能力，高效实现了从异步事件流中重构图像的对比度计算与迭代优化，从而估计运动参数。研究展示了该硬件模块的设计细节与优化方法，并通过实验验证其在速度和能效方面的显著优势，相比CPU和GPU实现快200倍以上，为高速、低功耗嵌入式系统中的实时运动估计提供了坚实基础。

Comments Accepted for ARC 2026

2605.09575 2026-05-12 eess.IV cs.CV 版本更新

Annotation-free deep learning for detection and segmentation of fetal germinal matrix-intraventricular hemorrhage in brain MRI

Mingxuan Liu, Yingqi Hao, Yi Liao, Juncheng Zhu, Haoxiang Li, Hongjia Yang, Yifei Chen, Yijin Li, Kasidit Anmahapong, Zihan Li, Jialan Zheng, Min Kang, Yan Song, Hua Lai, Xiaoling Zhou, Nan Sun, Rong Hu, Gang Ning, Haibo Qu, Qiyuan Tian

发表机构 * Department of Radiology, West China Second University Hospital, Sichuan University（四川大学华西第二医院放射科）； School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University（清华大学医学院生物医学工程系）； Department of Radiology, Sichuan Provincial Woman’s and Children’s Hospital, The Affiliated Women’s and Children’s Hospital of Chengdu Medical College（四川省妇幼保健院放射科，成都医学院附属妇幼医院）； Chengdu Women’s and Children’s Central Hospital, School of Medicine, University of Electronic Science and Technology of China（成都妇女儿童中央医院，电子科技大学医学院）； Department of Radiology, The Third Affiliated Hospital of Zhengzhou University（郑州大学第三附属医院放射科）； Qujing Maternal and Child Health Hospital, Qujing, China（曲靖 maternal and child health hospital, Qujing, China）

AI总结该研究提出了一种无需标注数据的深度学习框架FreeHemoSeg，用于自动检测和分割胎儿脑MRI中的生发层-脑室出血（GMH-IVH）。该方法通过结合医学先验知识生成伪病变图像进行训练，有效解决了标注数据获取困难的问题。实验结果表明，FreeHemoSeg在内部和外部验证中均表现出优越的检测和分割性能，并显著提升了放射科医生的诊断效率和准确性。

详情

英文摘要

Background: Prenatal germinal matrix-intraventricular hemorrhage (GMH-IVH) is a leading cause of infant mortality and neurodevelopmental impairment. Manual diagnosis and lesion segmentation are labor-intensive and error-prone. Deep learning models offer potential for automation but typically require large annotated datasets, which are challenging to obtain. Purpose: To develop and validate an annotation-free deep learning framework for automated detection and segmentation of GMH-IVH on brain MRI. Materials and Methods: This retrospective study analyzed 2D T2-weighted MRI data from pregnant women collected from October 2015 to October 2023 at one hospital (internal validation) and two hospitals (external validation). Eligible participants included healthy fetuses and those with GMH-IVH. FreeHemoSeg was developed and trained using pseudo GMH-IVH images synthesized from normal fetal data guided by medical priors. Primary outcomes included diagnostic accuracy (area under the ROC curve [AUROC], sensitivity, specificity) and segmentation accuracy (Dice similarity coefficient [DSC]). A reader study evaluated clinical utility. Results: A total of 1674 stacks from 558 pregnant women were analyzed. FreeHemoSeg achieved the highest performance in both internal (sensitivity: 0.914, 95% CI 0.869-0.945; specificity: 0.966, 95% CI 0.946-0.978; DSC: 0.559, 95% CI 0.546-0.571) and external validation (sensitivity: 0.824, 95% CI 0.739-0.885; specificity: 0.943, 95% CI 0.913-0.964; DSC: 0.512, 95% CI 0.497-0.526), outperforming supervised and unsupervised methods. FreeHemoSeg assistance improved radiologists' sensitivity (from 0.882 to 0.941-1.000) and diagnostic confidence while reducing interpretation time by 16.0-52.7%. Conclusion: FreeHemoSeg accurately detects and localizes fetal brain hemorrhages without annotated training data, enabling earlier diagnosis and supporting timely clinical management.

URL PDF HTML ☆

赞 0 踩 0

2605.09572 2026-05-12 cs.CV cs.AI cs.MM 版本更新

KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

Guanyi Du, Lintao Wang, Kun Hu, Ziyang Wang

发表机构 * School of Computer Science, The University of Sydney（悉尼大学计算机科学学院）； School of Science, Edith Cowan University（埃迪斯科文大学科学学院）； School of Computer Science and Digital Technologies, Aston University（阿斯顿大学计算机科学与数字技术学院）

AI总结该研究探讨了如何利用Kolmogorov-Arnold网络（KAN）从符号注释生成手语姿态动画，提出了一种多尺度序列生成模型KANMultiSign，能够将HamNoSys符号系统转化为二维人体姿态序列。研究引入了从粗到细的生成策略，并结合多尺度监督机制，先生成整体身体结构，再细化手部动作细节；同时将KAN模块集成到Transformer架构中，以更高效地建模符号到连续姿态的非线性映射。实验表明，该方法在多个手语语料库中取得了比现有方法更优的性能，同时大幅减少了参数量，验证了多尺度监督在提升符号条件姿态生成效果中的关键作用。

Comments Accepted at Neurocomputing

2605.09566 2026-05-12 cs.CV 版本更新

Dual-Path Hyperprior Informed Deep Unfolding Network for Image Compressive Sensing

Tianyi Lu, Wenxue Cui, Shaohui Liu

发表机构 * Harbin Institute of Technology（哈尔滨理工大学）

AI总结本文提出了一种双路径超先验引导的深度展开网络（DPH-DUN），用于解决图像压缩感知中的重建问题。该方法通过将测量数据分为两个子集，并引入超先验信息指导重建过程，有效提升了不同纹理区域的重建质量。核心创新包括设计轻量神经模块生成多域超先验知识，并在重建过程中动态生成自适应步长和注意力机制，以提高重建精度和鲁棒性。实验表明，该方法在多个基准数据集上优于现有压缩感知方法。

2605.09554 2026-05-12 cs.CL cs.CV 版本更新

Towards Compact Sign Language Translation: Frame Rate and Model Size Trade-offs

Kuanwei Chen, Mengfeng Tsai

发表机构 * Computer Science and Information Engineering, National Central University, Zhongli, Taiwan（资讯工程系，国立中央大学，中坜，台湾）

AI总结本文研究了手语翻译（SLT）中帧率与模型大小之间的权衡问题，旨在实现更紧凑高效的翻译系统。作者提出了一种仅含77M参数的轻量级管道，结合MMPose骨骼姿态提取与单一线性投影至T5-small模型，通过调整输入帧率，在保证翻译质量的前提下显著降低计算复杂度。实验表明，该方法在12fps时相比24fps仅小幅降低BLEU-4得分，同时模型大小仅为之前T5-base系统的1/3，展示了轻量架构在无需层次化编码器或大规模模型的情况下仍具竞争力。

Comments 2 pages, 1 figure, 2 tables

2605.09538 2026-05-12 cs.CV cs.AI cs.RO 版本更新

PhysHanDI: Physics-Based Reconstruction of Hand-Deformable Object Interactions

Jihyun Lee, Changmin Lee, Donghwan Kim, Tae-Kyun Kim

发表机构 * School of Computing, KAIST, Daejeon, South Korea（韩国釜山科学技术院计算学系）

AI总结 PhysHanDI 是一种基于物理的框架，旨在同时重建手部与非刚性物体（如布料、毛绒玩具）的三维交互。该方法通过模拟由密集重建的手部运动引起的力来驱动物体变形，确保重建的物体动态既符合物理规律又与手部运动一致。此外，物体变形的模拟还能通过逆物理方法提升手部重建的精度，实验表明 PhysHanDI 在重建和未来预测任务中均优于现有最佳方法。

Comments Accepted to ICML 2026

2605.09513 2026-05-12 cs.CV cs.RO 版本更新

QueST: Persistent Queries as Semantic Monitors for Drift Suppression in Long-Horizon Tracking

Mayank Anand, Mohammad Saqlain, Kyan Mahajan, Priya Shukla, Gora Chand Nandi, Andrew Melnik

发表机构 * Center for Intelligent Robotics（智能机器人中心）； Indian Institute of Information Technology Allahabad（阿拔斯理工大学）； University of Bremen（不莱梅大学）

AI总结本文提出QueST，一种用于长期轨迹跟踪的语义监控框架，旨在解决传统逐帧匹配方法在复杂场景下累积误差导致的语义漂移问题。QueST将与交互相关的实体视为持久的语义查询，而非瞬时的点轨迹，并在每个时间步全局关注时空视频特征，提供稳定的语义锚点。通过引入轻量的三维物理约束，QueST在遮挡等情况下有效抑制漂移，实验表明其在长期关节运动序列上的跟踪精度显著优于现有方法。

2605.09507 2026-05-12 cs.CV 版本更新

Uncertainty-Aware and Decoder-Aligned Learning for Video Summarization

Omer Tariq, Syed Muhammad Raza, Jeongbae Son

发表机构 * Perception AI Neubility Inc.（感知AI Neubility公司）

AI总结该论文提出了一种用于视频摘要的不确定性感知与解码器对齐的学习框架VASTSum，旨在解决视频摘要任务中因主观标注和离散解码过程带来的挑战。该方法通过变分形式预测帧级的概率重要性分数，显式建模多标注者监督下的不确定性，并引入解码器对齐正则化以提升摘要选择的稳定性。实验表明，该方法在多个数据集上表现出更强的鲁棒性和高效性，优于传统确定性和扩散模型方法。

Comments Accepted for presentation at the 2026 International Joint Conference on Neural Networks (IJCNN 2026)

详情

英文摘要

Video summarization aims to produce a compact representation of a long video by selecting a subset of temporally important segments that best reflect human preferences. This task is inherently difficult due to strong annotation subjectivity and the reliance on discrete decoding procedures, such as temporal segmentation and knapsack-based selection, during evaluation. Most existing approaches either learn deterministic importance scores that overlook these characteristics or adopt complex generative models that increase training and inference cost. In this paper, we propose VASTSum, an uncertainty-aware and decoder-aligned learning framework for video summarization that addresses both challenges within a single-pass model. The proposed method predicts probabilistic frame-level importance scores using a variational formulation, enabling explicit modeling of uncertainty arising from multi-annotator supervision. To account for subjectivity, particularly under binary annotations, we employ a supervision strategy that encourages alignment with plausible human annotation modes rather than enforcing a single consensus target. Furthermore, we introduce a decoder-aligned regularization that promotes stability of knapsack-based summary selection, reducing sensitivity to small perturbations in predicted scores. We evaluate the proposed framework on the SumMe and TVSum benchmarks using standard rank-based metrics. Experimental results show consistent and competitive Kendall and Spearman correlations across multiple data splits, demonstrating improved robustness under annotation disagreement while maintaining efficient single-forward inference. These results indicate that explicitly modeling uncertainty and aligning learning objectives with the decoding stage provide a principled alternative to both deterministic and diffusion-based video summarization methods.

URL PDF HTML ☆

赞 0 踩 0

2605.09479 2026-05-12 eess.IV cs.CV cs.MM 版本更新

ML-CLIPSim: Multi-Layer CLIP Similarity for Machine-Oriented Image Quality

Feng Ding, Haisheng Fu, Jie Liang, Qihan Xu, Siyu Zhu, Jingning Han

发表机构 * Simon Fraser University（西蒙弗雷泽大学）； University of British Columbia（不列颠哥伦比亚大学）； Eastern Institute of Technology（东部技术学院）； Xi’an Jiaotong University（西安交通大学）； Google Inc（谷歌公司）

AI总结本文从机器视角出发，研究全参考图像质量评估问题，旨在评估图像在多下游模型中信息保留的程度。提出了一种基于CLIP视觉编码器的可微质量度量方法ML-CLIPSim，通过聚合中间特征相似性和全局图像嵌入来近似机器感知的图像质量。实验表明，该方法在机器偏好、人类质量预测以及图像压缩任务中均表现出优越性能，优于传统保真度和感知度量。

2605.09477 2026-05-12 cs.CV cs.AI 版本更新

Outlier-Robust Diffusion Solvers for Inverse Problems

Yang Zheng, Jiahua Liu, Tongyao Pang, Wen Li, Zhaoqiang Liu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）； Yau Mathematical Sciences Center, Tsinghua University（清华大学尤太数学科学中心）

AI总结本文研究了在存在异常值的情况下，如何利用扩散模型解决逆问题。为提高鲁棒性，作者首先通过显式噪声估计优化测量数据，并基于Huber损失函数构建迭代加权最小二乘目标函数，进而提出一种基于梯度下降的优化方法，并结合共轭梯度法以避免学习率调优问题。实验表明，该方法在多种图像数据集上表现出对异常值的强鲁棒性，优于现有的扩散模型方法。

Comments Accepted by CVPR 2026

2605.09460 2026-05-12 cs.CV cs.AI 版本更新

When Few Steps Are Enough: Training-Free Acceleration of Identity-Preserved Generation

Dongqi Zheng

发表机构 * FLUX Diffusion Transformer（FLUX扩散变换器）； InfuseNet

AI总结本文研究了在保持身份特征的前提下，如何通过简化生成步骤来加速图像生成过程。作者提出了一种无需重新训练的方法，通过替换预训练的扩散模型主干网络，并禁用分类器引导，显著提升了生成效率，同时保持了较高的身份相似度。实验表明，在早期生成步骤中已能获得较高质量的身份特征，后续步骤主要优化细节，从而为身份保留生成提供了高效且实用的优化策略。

2605.09455 2026-05-12 cs.CV 版本更新

Adaptive 3D Convolution for Remote Sensing Image Fusion

Siran Peng, Xiangyu Zhu, Shang-Qi Deng, Liang-Jian Deng, Zhen Lei

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（多模态人工智能系统国家重点实验室，自动化研究所，中国科学院）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； School of Mathematical Sciences/Multi-Hazard Early Warning Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China（数学科学学院/四川省多灾种早期预警重点实验室，电子科技大学）； Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences（人工智能与机器人中心，香港科学院，中国科学院）

AI总结本文研究了遥感图像融合问题，旨在从高分辨率但光谱信息有限的图像和低分辨率但光谱数据丰富的图像中生成高分辨率多/高光谱图像。为了解决现有方法在光谱信息保持和计算效率上的不足，作者提出了一种新型的自适应三维卷积（Ada3D）方法，该方法为每个输入体素生成独特的三维卷积核，结合空间和光谱信息，有效提升了融合效果，并通过分组卷积降低了计算复杂度。实验表明，该方法在五个数据集上均取得了当前最优的性能。

Comments Accepted by IEEE Transactions on Image Processing (TIP), Early Access, 2026

详情

DOI: 10.1109/TIP.2026.3689418

英文摘要

Remote sensing image fusion aims to create a high-resolution multi/hyper-spectral image from a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Recently, deep learning (DL) techniques have shown significant effectiveness in this area. Most DL-based methods approach image fusion as a 2D problem by encoding spectral information into feature map channels. However, our research suggests that this strategy introduces notable spectral distortions. In contrast, some methods consider spectral data as an additional dimension, utilizing standard 3D convolutions to preserve spectral information. Nevertheless, in a standard 3D convolutional layer, the same set of kernels is applied across all input regions, which we have found to be sub-optimal for image fusion. Furthermore, standard 3D convolutions necessitate substantial computational resources. To address these challenges, we propose a novel convolutional paradigm called Adaptive 3D Convolution (Ada3D) for remote sensing image fusion. Ada3D applies a unique set of 3D kernels to each input voxel, enabling the capture of fine-grained details. These adaptive kernels are generated through a two-step process: (i) spatial and spectral kernels are derived from their respective image sources; (ii) these two types of kernels are then combined to form content-aware 3D kernels that effectively integrate spatial and spectral information. Additionally, adaptive biases are introduced to enhance the convolutional outcome at the voxel level. Furthermore, we incorporate the group convolution technique to reduce computational complexity. As a result, Ada3D offers full adaptivity in an efficient manner. Evaluation results across five datasets demonstrate that our method achieves SOTA performance, underscoring the superiority of Ada3D. The code is available at https://github.com/PSRben/Ada3D.

URL PDF HTML ☆

赞 0 踩 0

2605.09449 2026-05-12 cs.CV 版本更新

SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

Bo Gu, Zhikang Zhang, Zizhuang Wei, Zhenyuan Chen, Lingyun Li, Zhuoyi Song

发表机构 * Fudan University（复旦大学）； Huawei（华为）； Shenzhen Loop Area Institute（深圳环城院）

AI总结当前多模态大语言模型（MLLMs）在视觉理解和语言推理方面取得了显著进展，但在三维环境中缺乏持续的、以世界为中心的空间表征。为此，研究提出了一种名为 SpaceMind++ 的视频 MLLM 架构，通过从 RGB 视频中构建体素化的认知地图，实现对物体永久性和空间拓扑关系的保持。该模型引入了坐标引导的深度迭代融合机制，将地图层面的空间知识反馈至原始二维视觉特征中，从而在不破坏原有视觉接口的前提下增强模型的空间推理能力。实验表明，SpaceMind++ 在多个基准测试中取得了优异的性能，尤其在未见过的三维环境中表现出更强的泛化能力。

Comments 14 pages, 3 figures

2605.09443 2026-05-12 cs.CV cs.CL 版本更新

Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

Yihong Tang, Kehai Chen, Xuefeng Bai, Min Zhang

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Shenzhen Loop Area Institute (SLAI)（深圳环城院）

AI总结随着多模态大语言模型的发展，角色扮演代理（RPAs）逐渐进入视觉化环境，但现有模型提取的通用视觉特征容易掩盖角色特性，导致模态-角色干扰（MRI）。为此，研究提出了一种无需训练的字符感知视觉干预框架CAVI，通过角色引导的标记剪枝、正交特征调制和模态自适应角色引导等方法，有效缓解MRI问题，显著提升了角色一致性的多模态交互能力。

2605.09442 2026-05-12 cs.CV cs.AI 版本更新

SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

Shanwen Tan, Hao Li, Jingtao Zhang, Xiaosong Jia, Xue Yang, Shaofeng Zhang, Yanyong Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Fudan University（复旦大学）； Georgia Institute of Technology（佐治亚理工学院）； Shanghai Jiao Tong University（上海交通大学）

AI总结 SWIFT 是一种用于多提示长视频生成的高效框架，旨在解决连续语义切换中的语义连贯性与计算效率之间的矛盾。该方法引入了轻量级的语义注入缓存和自适应动态窗口机制，能够在不重建缓存内容的前提下实现高效的语义切换，并通过分头语义注入和段级语义锚点保持视频的时序一致性。实验表明，SWIFT 在单块 H100 GPU 上实现了 22.6 FPS 的生成速度，显著提升了长视频生成的效率。

Comments Code is available at https://github.com/ShanwenTan/SWIFT

详情

英文摘要

Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than reconstructing it from scratch at every prompt boundary. To avoid uniformly perturbing all attention channels, we further perform head-wise semantic injection, so that each attention head receives a prompt update proportional to its alignment with the current video state. In addition, we introduce an Adaptive Dynamic Window that allocates temporal memory according to prompt phase, using larger local context near switching boundaries and smaller windows during stable segments to reduce average inference cost. To preserve long-range semantic consistency under compressed local attention, we further maintain segment-level semantic anchors that summarize prompt-conditioned video history and reintroduce it as compact memory tokens. Compared with current state-of-the-art methods, SWIFT preserves generation quality while achieving 22.6 FPS on a single H100 GPU, establishing a substantially more efficient solution for multi-prompt long-video generation. Our code is available at https://github.com/ShanwenTan/SWIFT.

URL PDF HTML ☆

赞 0 踩 0

2605.09433 2026-05-12 cs.CV 版本更新

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

Yunhong Lu, Qichao Wang, Hengyuan Cao, Xiaoyin Xu, Min Zhang

发表机构 * Zhejiang University（浙江大学）； Shanghai Institute for Advanced Study-Zhejiang University（上海先进研究院-浙江大学）； Shanghai Institute for Mathematics and Interdisciplinary Sciences（上海数学与交叉科学研究院）

AI总结现有文本到图像模型的偏好数据集通常仅存储最终的优胜或劣汰图像，这不足以支持基于直化流（RF）模型的生成过程，因其生成过程依赖特定的先验噪声样本并遵循近似直线的去噪轨迹。为此，本文提出了一种针对直化流模型的离线偏好优化框架——先验噪声感知偏好优化（PNAPO），通过保留生成优胜/劣汰图像所用的配对先验噪声，扩展标准三元组为六元组，并利用RF的直线特性进行噪声-图像插值，从而更准确地估计轨迹并提升优化目标的紧致性。实验表明，PNAPO在主流RF文本到图像模型上显著提升了偏好指标，同时减少了训练计算量。

Comments Accepted by ICML 2026

2605.09429 2026-05-12 cs.CV cs.AI 版本更新

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

Jie Ma, Yihang Liu, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun

发表机构 * Xiamen University（厦门大学）

AI总结该研究探讨了在视觉-语言模型中，低注意力视觉token是否真的冗余，并指出现有剪枝方法基于浅层注意力分数进行剪枝可能影响模型对复杂场景的推理能力，导致“视觉失语”问题。为此，作者提出了一种无需训练的剪枝框架COAST，通过对比自适应语义token剪枝，利用跨模态注意力识别关键token并平衡语义证据与空间上下文的关系。实验表明，COAST在多个基准上大幅减少了视觉token数量并提升了推理速度，同时保持了较高的模型性能，展示了其在不同模型和压缩设置下的广泛适用性。

2605.09425 2026-05-12 cs.CV cs.AI 版本更新

AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation

Shogo Noguchi

发表机构 * Gunma University（群马大学）

AI总结本文研究了多条件扩散模型中条件冲突对图像生成结构保真度的影响，提出了一种基于注意力机制的冲突抑制方法，有效提升了生成图像的高层结构一致性。通过结合语义分割、深度图和边缘信息作为多条件输入，模型能够在保持场景细节的同时生成高质量的图像，用于自动驾驶任务的数据增强。该工作不仅解决了多条件生成中的冲突问题，还构建了针对驾驶任务的生成框架与评估体系，为缓解高阶自动驾驶中数据稀缺问题提供了重要支持。

Comments 44 pages, 20 figures. Code and project page available at: https://github.com/ShogoNoguchi/AtteConDA

2605.09422 2026-05-12 cs.CL cs.CV 版本更新

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

Jiafeng Liang, Zhihao Zhu, Zihan Zhang, Baoqi Ren, Shixin Jiang, Runxuan Liu, Tao Ren, Ming Liu, See-Kiong Ng, Bing Qin

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Pengcheng Laboratory（鹏城实验室）； National University of Singapore（新加坡国立大学）； Peking University（北京大学）； Harvard University（哈佛大学）

AI总结尽管大型多模态模型（LMMs）在视频理解方面表现出色，但它们在因果发现过程中容易依赖文本先验信息，这一缺陷尚未被充分理解。本文提出了一种基于扰动的评估方法ProCauEval，通过系统控制视觉和文本模态的输入，揭示模型在因果推理中的失效模式。研究发现，主流LMMs虽然能够准确感知视频内容，但在因果推理中未能充分加以利用，并且更强的后训练反而加剧了对文本先验的依赖。为此，作者提出了一种反蒸馏策略优化框架ADPO，通过强化学习推动模型更依赖视觉证据而非文本捷径，实验表明该方法有效提升了模型的视觉参与度并保持了基础理解能力。

Comments 17 pages, 5 figures

2605.09418 2026-05-12 cs.CV cs.RO 版本更新

MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition

Zhengyi Xu, Yuhang Ming, Zhihao Zhan, Hanyu Zhu, Javier Civera, Wanzeng Kong

发表机构 * Hangzhou Dianzi University（杭州电子科技大学）； TopXGun Robotics（TopXGun机器人）； University of Zaragoza（萨拉戈萨大学）

AI总结跨视角场景识别在计算机视觉与机器人领域面临诸多挑战，尤其在地面观测与空中参考之间存在显著的视角、模态和结构差异。为此，本文提出MAG-VLAQ框架，通过融合预训练基础模型提取的多模态特征，在共享嵌入空间中实现地面与空中图像的对齐与融合。其核心创新在于引入ODE条件化的VLAQ机制，动态调整查询中心以适应多模态信息，从而在保持全局检索原型的同时提升场景特异性匹配能力。实验表明，该方法在KITTI360-AG数据集上显著优于现有方法，Recall@1指标达到61.1。

Comments 16 pages, 4 figures, 3 tables

2605.09417 2026-05-12 cs.CV 版本更新

SAMOFT: Robust Multi-Object Tracking via Region and Flow

Yanchao Wang, Dawei Zhang, Chengzhuan Yang, Wei Liu, Minglu Li, Hua Wang, Zhonglong Zheng, Ming-Hsuan Yang

发表机构 * School of Computer Science and Technology, Zhejiang Normal University（浙江师范大学计算机科学与技术学院）； Institute for Sustainable Industries and Liveable Cities, College of Engineering and Science, Victoria University（维多利亚大学可持续产业与宜居城市研究所、工程与科学学院）； School of Electrical Engineering and Computer Science, University of California at Merced（加州大学默塞德分校电子工程与计算机科学学院）

AI总结本文提出了一种名为SAMOFT的鲁棒多目标跟踪方法，旨在解决复杂运动场景下目标形变、非线性运动和遮挡带来的跟踪难题。该方法引入像素级运动匹配模块（PMM），结合Segment Anything Model（SAM）和密集光流，提升基于卡尔曼滤波的运动预测精度；同时设计了中心距匹配（CDM）模块和分布校正（DBC）模块，分别增强对低置信度检测的鲁棒性以及在线轨迹状态的动态修正能力。实验表明，SAMOFT在多个基准数据集上显著优于现有方法，验证了其有效性。

2605.09407 2026-05-12 cs.CV 版本更新

AnyDepth-DETR/-YOLO: Any-depth object detection with a single network

Woochul Kang, Hyungseop Lee, Jiho Lee

发表机构 * Incheon Nat’l Univ.（Incheon国立大学）

AI总结本文提出了一种名为AnyDepth-DETR/-YOLO的任意深度目标检测框架，使单个网络能够在推理时通过控制深度实现精度与效率的连续权衡，无需重新训练。该方法通过将网络的主干和颈部模块分解为必须执行的主路径和可跳过的细化路径，保持了不同深度配置下的多尺度特征层次。通过在最深和最浅网络之间进行自蒸馏，并结合预测层和特征层对齐损失，确保各阶段输出的兼容性。实验表明，该方法在RT-DETR和YOLOv12上实现了与现有最佳模型相当或更优的性能，且在高效配置下可提升1.82倍速度，仅损失2.0 AP。

Comments 16 pages, 5 figures, 9 tables

2605.09404 2026-05-12 cs.LG cs.CL cs.CV 版本更新

Let the Target Select for Itself: Data Selection via Target-Aligned Paths

Huitao Yang, Hengzhi He, Guang Cheng

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）

AI总结该研究针对目标导向的数据选择问题，提出了一种新的参考路径方法，以减少传统方法在异构数据池中可能产生的偏差。通过在目标验证集上进行短期预热，生成一个验证诱导的参考路径，并利用该路径上的终点损失下降作为候选样本的评分依据，从而实现无需梯度或海森矩阵近似的选择策略。该方法在多个实验中表现出与动态归因方法相当的性能，同时显著降低了预热和存储成本，并可复用到不同的数据池中。

2605.09392 2026-05-12 cs.CV 版本更新

HyNeuralMap: Hyperbolic Mapping of Visual Semantics to Neural Hierarchies

Zihan Ma, Tian Xia, Kexin Wang, Xiao Li, Xiaowei He, Yudan Ren

发表机构 * School of Electronic Information (School of Artificial Intelligence), the Xi’an Key Laboratory of Radiomics and Intelligent Perception, Northwest University（电子信息学院（人工智能学院）、西安放射组学与智能感知重点实验室、西北大学）

AI总结本文提出了一种名为HyNeuralMap的框架，用于将视觉语义映射到跨被试的神经层次结构中，以解决视觉刺激与神经响应之间复杂映射关系的理解问题。该方法利用双曲洛伦兹模型，通过双曲空间的负曲率作为归纳偏置，更有效地捕捉视觉语义的层次结构和跨被试神经相似性。实验表明，HyNeuralMap在多标签语义预测和跨模态检索任务中优于现有的欧氏空间方法，验证了双曲几何在跨模态语义对齐和层次建模中的优势。

Comments 14 pages, 4 figures

2605.09384 2026-05-12 cs.CV cs.AI q-bio.QM 版本更新

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

Runze Ma, Shunbo Jia, Haonan Lyu, Guo Liu, Caizhi Liao

发表机构 * School of Information Technology（信息科技学院）； Monash University Malaysia（墨尔本大学马来西亚分校）； Faculty of Innovation Engineering（创新工程学院）； Macau University of Science and Technology（澳门科学技术大学）； Department of Bioelectronics（生物电子系）； Faculty of Biomedical Engineering（生物医学工程学院）； Shenzhen University of Advanced Technology（深圳先进技术大学）

AI总结本文提出了一种名为LiteMedCoT-VL的参数高效的适配方法，旨在提升医疗视觉问答（VQA）模型在资源受限设备上的推理能力。该方法通过基于LoRA的微调，将大型教师模型的链式推理能力迁移至小型学生模型，且无需依赖图像字幕，更贴近实际临床场景。实验表明，LiteMedCoT-VL在PMC-VQA基准测试中取得了64.9%的准确率，显著优于现有基线模型，验证了小参数模型通过推理蒸馏可达到甚至超越更大模型的效果。

Comments 17 pages, 5 figures

2605.09378 2026-05-12 cs.CV cs.AI cs.CL 版本更新

EduStory: A Unified Framework for Pedagogically-Consistent Multi-Shot STEM Instructional Video Generation

Xinyi Wu, Jayant Teotia, Shuai Zhao, Erik Cambria

发表机构 * Nanyang Technological University（南洋理工大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结 EduStory 是一个统一的框架，旨在生成符合教学逻辑的多镜头STEM教学视频。该方法通过整合教学状态建模、脚本引导的结构化控制以及面向学习的评估指标，有效提升了视频在知识一致性和教学叙事连贯性方面的表现。研究还引入了 EduVideoBench 评估基准，支持对生成视频的多粒度分析与评估，实验表明该框架在保持教学意图和知识准确性方面具有显著优势。

2605.09362 2026-05-12 cs.GR cs.CV 版本更新

FrameTwin: Curve-Anchored Gaussian Alignment from Sparse Views for Adaptive Wireframe 3D Printing

Wenting Wang, Zhuo Huang, Kun Qian, Neelotpal Dutta, Yuhu Guo, Yingjun Tian, Yeung Yam, Charlie C. L. Wang

发表机构 * Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong（香港中文大学机械与自动化工程系）； Centre of Perceptual and Interactive Intelligence, Hong Kong（香港感知与交互智能中心）； University of Manchester（曼彻斯特大学）； Department of Mechanical and Aerospace Engineering, The University of Manchester, United Kingdom（曼彻斯特大学机械与航空航天工程系）

AI总结本文提出了一种名为FrameTwin的框架，用于从稀疏视角图像中进行自适应丝状结构3D打印的曲线锚定高斯对齐。该方法通过将高斯核锚定在参数化曲线上，捕捉薄丝结构的变形，从而获得紧凑且具有几何感知能力的编码，明确表达支撑结构的拓扑关系。与通用的高斯点扩散方法不同，该方法约束高斯核沿参数曲线分布，显著减少了稀疏视角下对薄结构的歧义，实现了全局一致的变形场对齐，并可用于动态调整后续打印路径。

2605.09339 2026-05-12 cs.CV cs.AI 版本更新

Perceptual Asymmetry Between Hue Categories: Evidence from Human Color Categorization

Elnara Kadyrgali, Nuray Toganas, Muragul Muratbekova, Pakizar Shamoi

发表机构 * School of Information Technology and Engineering（信息科技与工程学院）； Kazakh-British Technical University（哈萨克-英国技术大学）

AI总结人类颜色类别在感知空间中并非均匀分布，但大多数计算颜色模型仍假设颜色表示是固定且均匀的。本文通过分析大规模人类颜色分类数据，扩展了COLIBRI模糊颜色模型，引入了基于模糊隶属函数的定量指标，揭示了色相类别间的感知不对称性。研究发现，黄色类别在色相空间中占据紧凑且明确的区域，而绿色类别则覆盖更广的区间并具有更长的过渡结构，表明人类颜色类别不仅具有模糊性，其几何组织也高度不均匀，为语言颜色分类和感知驱动的颜色建模提供了新的视角。

Comments The paper has been submitted for consideration to ICICS 2026 (International Conference on Informatics and Computer Science)

2605.09328 2026-05-12 cs.CV 版本更新

Noise-Started One-Step Real-World Super-Resolution via LR-Conditioned SplitMeanFlow and GAN Refinement

Wei Zhu, Kai Zhang, Yu Zheng, Lei Luo, Yong Guo, Jian Yang

发表机构 * Nanjing University of Science and Technology（南京理工大学）； Nanjing University（南京大学）； Huawei（华为）

AI总结该研究提出了一种基于扩散模型的单步真实世界图像超分辨率方法SMFSR，旨在解决传统扩散模型在效率与质量之间的矛盾。该方法在保持噪声起始生成过程的基础上，通过LR条件下的SplitMeanFlow实现从噪声到高分辨率图像的直接映射，并引入GAN优化阶段提升细节真实感和图像自然度。实验表明，SMFSR在保持高效单步推理的同时，达到了当前单步扩散模型在真实世界超分辨率任务中的最优感知质量。

2605.09319 2026-05-12 cs.CV cs.LG 版本更新

PGID: Progressive Guided Inversion and Denoising for Robust Watermark Detection

Minh Quoc Duong, Chun Tong Lei, Chun Pong Lau

发表机构 * City University of Hong Kong（香港城市大学）

AI总结随着AI生成图像的普及，数字水印技术成为保护知识产权和防止恶意利用的重要手段。然而，现有的语义水印方法依赖扩散模型逆过程进行水印检测，容易受到印痕移除和伪造攻击的影响。本文提出了一种名为PGID的渐进引导逆过程与去噪框架，无需训练即可有效防御这些攻击，通过逐步逆过程和去噪循环将扰动的潜在变量投影回其原始区域，从而恢复被移除的水印并识别伪造实例。

2605.09317 2026-05-12 cs.CL cs.CV cs.LG 版本更新

Mem-W: Latent Memory-Native GUI Agents

Guibin Zhang, Yaohui Ling, Fanci Meng, Kun Wang, Shuicheng Yan

发表机构 * LV-NUS Lab（LV-NUS实验室）

AI总结本文提出了一种名为 Mem-W 的新型 GUI 智能体，其核心在于将记忆作为智能体连续上下文的一部分，而非传统的外部辅助结构。通过一个共享的轨迹到潜空间压缩器，Mem-W 将历史轨迹和当前会话片段编码为紧凑的记忆标记，并将其与当前 GUI 观测融合为连续的嵌入序列，从而实现对任务进展的统一感知与决策。实验表明，Mem-W 在多个网页和移动端导航任务中显著提升了多种基础模型和增强记忆方法的性能，最高提升达 30.0%，展示了潜空间原生记忆在长时程 GUI 操作中的有效性与扩展性。

2605.09312 2026-05-12 cs.CV 版本更新

Low-Cost Neural Radiance Fields

Alice Huang, Prathamesh Sonawane, Yashdeep Thorat, Yug Rao

发表机构 * University of Illinois Urbana Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文研究了如何在计算资源和数据量受限的情况下加速神经辐射场（NeRF）的训练与推理。作者对比了三种加速版NeRF模型，并针对低算力、低数据场景进行了扩展实验，包括引入深度监督损失、简化特征解码网络以及设计不同架构的HashNeRF。实验结果表明，在同等训练时间下，各改进方法未明显优于现有基线，但揭示了哪些改进更适合受限环境，并为未来研究提供了方向。

Comments 7 pages

2605.09302 2026-05-12 cs.LG cs.CV 版本更新

Discrete Langevin-Inspired Posterior Sampling

Chaitanya Amballa, Sattwik Basu, Jorge Vančo Sampedro, Romit Roy Choudhury

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文研究了在离散状态空间中使用离散扩散模型作为生成先验的逆问题后验采样方法。现有方法多依赖于连续松弛、吉布斯更新或特定退化过程的机制，限制了其可扩展性和通用性。为此，作者提出了一种基于离散朗之万动力学的后验采样器ΔLPS，能够在不离开离散状态空间的前提下，利用梯度信息高效地进行采样，支持所有维度的并行更新，并适用于不同训练方式的离散扩散模型。实验表明，该方法在图像恢复和空间映射等任务中优于现有离散扩散后验采样器，并能与连续扩散方法竞争。

2605.09296 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts

Boxuan Zhang, Jianing Zhu, Qifan Wang, Jiang Liu, Ruixiang Tang

发表机构 * Rutgers University（罗格斯大学）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Meta AI ； Advanced Micro Devices（先进微器件公司）

AI总结近年来生成模型能够生成高度逼真的图像，使得区分真实图像与AI生成图像变得愈发困难。现有基于预训练特征提取器的检测方法往往过于依赖全局语义信息，忽略了关键的微小缺陷。本文提出了一种基于局部分布差异的检测框架MDMF，通过放大图像中微小的统计不规则性，揭示AI生成图像的宏观分布差异，显著提升了检测性能。实验表明，MDMF在多个基准测试中均优于现有方法，验证了其有效性。

Comments 41 pages, 10 figures

2605.09288 2026-05-12 cs.LG cs.AI cs.CE cs.CV cs.NA math.NA 版本更新

MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving

Ethan Hsu, Hong Meng Yam, Ivan Ge

发表机构 * Stanford University（斯坦福大学）

AI总结该论文提出了一种名为 MC² 的混合求解方法，结合蒙特卡洛方法（Walk-on-Spheres）与神经网络，用于高效求解椭圆型偏微分方程（PDE）。该方法通过将低计算量的蒙特卡洛解作为结构化估计器，训练神经网络进行单次前向传播修正，从而获得高精度解，显著提升了求解速度。此外，论文还发布了 PDEZoo，一个包含两百万个椭圆型 PDE 的标准化基准数据集，为有限计算资源下的 PDE 求解研究提供了重要支持。

2605.09279 2026-05-12 cs.GR cs.CV cs.MM cs.NI eess.IV 版本更新

CAGS: Color-Adaptive Volumetric Video Streaming with Dynamic 3D Gaussian Splatting

Daheng Yin, Yili Jin, Jianxin Shi, Isaac Ding, Miao Zhang, Fangxin Wang, Zhaowu Huang, Cong Zhang, Jiangchuan Liu, Fang Dong

发表机构 * Simon Fraser University（西蒙弗雷泽大学）； Jiangxing Intelligence Inc.（江行智能有限公司）； McGill University（麦吉尔大学）； Nankai University（南开大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Fuzhou University（福州大学）； Southeast University（东南大学）

AI总结本文提出了一种名为CAGS的色彩自适应体素视频流系统，旨在解决动态3D高斯点云在实时传输中的带宽消耗和画质退化问题。该方法通过向量量化建立多细节层次（LoD），并利用低分辨率参考图像进行色彩校正，有效减少了颜色失真。实验表明，CAGS在不同带宽条件下相比现有方法在PSNR指标上提升了5至20 dB，并具有更高的传输效率和跨高斯表示的通用性。

Comments SIGGRAPH 2026 Conference Paper. Code is available at https://github.com/yindaheng98/ColorAdaptiveGaussianSplatting

Journal ref ACM SIGGRAPH 2026

详情

DOI: 10.1145/3799902.3811058

英文摘要

Volumetric video (VV) streaming enables real-time, immersive access to remote 3D environments, powering telepresence, ecological monitoring, and robotic teleoperation. These applications turn VV streaming into a real-time interface to remote physical environments, imposing new system-level demands for photorealistic scene representation, low-latency interaction, and robust performance under heterogeneous networks. 3D Gaussian Splatting (3DGS) has been widely used for real-time photorealistic rendering, offering superior visual quality and rendering performance, but it faces challenges due to bandwidth consumption. Furthermore, as the foundation of adaptive VV streaming, existing Levels of Detail (LoD) methods based on density are not well-suited to Gaussian representations, leading to visible gaps and severe quality degradation. Recent studies have also explored attribute compression techniques to reduce bandwidth consumption. Our preliminary studies reveal that aggressive attribute compression primarily causes color distortion, which can be effectively corrected in the rendered image using a reference image. Motivated by these findings, we propose a novel Color-Adaptive scheme for adaptive VV streaming that uses vector quantization (VQ) to establish LoDs and correct color distortions with low-resolution reference images. We further present CAGS, an adaptive VV streaming system compatible with diverse Gaussian representations, which integrates the Color-Adaptive scheme by rendering reference images on the streaming server and performing color restoration on the client. Extensive experiments on our prototype system demonstrate that CAGS outperforms the existing adaptive streaming systems in PSNR by 5$\sim$20 dB under fluctuating bandwidth, operates significantly faster than existing scalable Gaussian compression methods, and generalizes across different Gaussian representations.

URL PDF HTML ☆

赞 0 踩 0

2605.09276 2026-05-12 cs.LG cs.CV 版本更新

Uncertainty-Aware Token Importance Estimation in Spiking Transformers

Wenxuan Liu, Zecheng Hao, Tong Bu, Yuran Wang, Zhaofei Yu

发表机构 * School of Computer Science, Peking University（北京大学计算机科学学院）； School of Computer Science, Peking University. Institute for Artificial Intelligence, Peking University（北京大学计算机科学学院。人工智能研究所）； Peking University（北京大学）

AI总结本文研究了在脉冲变压器中如何更准确地估计令牌的重要性，以减少冗余计算并提高推理效率。现有方法主要依赖于响应特征，如激活幅度或发放统计，但未能反映令牌在时间演化中的不确定性变化。作者提出了一种无需训练、可插拔的Uncert框架，通过建模令牌的类别证据并分析其时间不确定性模式，为令牌重要性评估提供了新的依据。实验表明，该方法在静态和神经形态基准上均取得了良好的精度与效率平衡，尤其在令牌剪枝任务中表现突出。

2605.09272 2026-05-12 cs.AI cs.CL cs.CV 版本更新

Towards Conversational Medical AI with Eyes, Ears and a Voice

Meet Shah, Jason Gusdorf, Anil Palepu, Chunjong Park, Jack W. O'Sullivan, Vishnu Ravi, Tim Strother, Pavel Dubov, Aliya Rysbek, Toshiyuki Fukuzawa, Yana Lunts, Jan Freyberg, Michael B. Chang, Aniruddh Raghu, David Stutz, Devora Berlowitz, Eliseo Papa, Taylan Cemgil, JD Velasquez, Jack Chen, Arthur Chen, Doug Fritz, Charlie Taylor, Katya Tregubova, Jing Rong Lim, Richard Green, Sara Mahdavi, Mahvish Nagda, Jihyeon Lee, Craig Schiff, Liviu Panait, Sukhdeep Singh, Valentin Liévin, David G. T. Barrett, Hannah Gladman, Anna Cupani, Francesca Pietra, Uchechi Okereke, Katherine Tong, Clemens Meyer, Erwan Rolland, Mili Sanwalka, Michael D. Howell, Shixiang Shane Gu, Bibo Xu, Euan A. Ashley, S. M. Ali Eslami, Gregory Wayne, Pushmeet Kohli, Vivek Natarajan, Adam Rodman, Alan Karthikesalingam, Ryutaro Tanno

发表机构 * Google DeepMind（谷歌深Mind）； Google Research（谷歌研究）； Beth Israel Deaconess Medical Center, Harvard Medical School（贝塞斯达医院, 哈佛医学院）； Stanford University（斯坦福大学）

AI总结该研究提出了一种名为AI co-clinician的新型会话式医疗AI系统，能够实时处理来自医患对话的视听数据，辅助临床决策。该系统基于Gemini的低延迟音视频处理能力，采用双代理架构，兼顾深度临床推理与自然对话所需的低延迟响应。实验表明，AI co-clinician在多个关键评估维度上接近初级保健医生，且在通用评估标准上显著优于GPT-Realtime，但仍在体格检查和疾病特异性推理方面存在不足，突显了视听信息在医疗咨询中的重要性。

Comments Video examples are available on Youtube: https://youtu.be/y5Vaa_SN1t0, https://youtu.be/dC4icb75vLQ, and https://youtu.be/E7iEvWo-E6c

详情

英文摘要

The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed "TelePACES" evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover simulation study (n = 120 encounters) with 10 internal medicine residents as patient actors, we compared AI co-clinician with primary care physicians (PCPs), GPT-Realtime, and a baseline agent. AI co-clinician approached PCPs in key TelePACES dimensions, including management plans and differential diagnosis, while significantly outperforming GPT-Realtime across all general criteria. While our agent demonstrated parity with PCPs in case-specific triage measures, physicians maintained superior overall performance in case-specific assessments. Although AI co-clinician marks a significant advance in real-time telemedical AI, gaps remain in physical examination and disease-specific reasoning. Our work shows that text-only approaches fail to capture the true challenges of medical consultation and suggests that high-stakes real-time diagnostic AI is most safely advanced in collaborative, triadic models where AI can be a supportive co-clinician for doctors and patients.

URL PDF HTML ☆

赞 0 踩 0

2605.09269 2026-05-12 cs.CL cs.CV 版本更新

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Rui Liu, Dian Yu, Zhenwen Liang, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang

发表机构 * Tencent Hunyuan（腾讯文元）； University of Maryland, College Park（马里兰大学 College Park 分校）； University of North Carolina, Chapel Hill（北卡罗来纳大学 Chapel Hill 分校）

AI总结 DeltaRubric 是一种用于多模态大语言模型奖励建模的生成式方法，旨在解决现有评估方式在视觉细节判断上的偏差问题。该方法通过将评估过程分解为“规划”和“验证”两个步骤，动态生成针对具体实例的检查清单，并基于图像和问题进行验证，从而提高评估的准确性和可靠性。实验表明，DeltaRubric 在多个基准测试中显著提升了模型的奖励建模效果，验证了其在多模态任务中的有效性。

2605.09262 2026-05-12 cs.CV cs.CL 版本更新

Reinforcing Multimodal Reasoning Against Visual Degradation

Rui Liu, Dian Yu, Haolin Liu, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang

发表机构 * Tencent Hunyuan（腾讯文言）； University of Maryland, College Park（马里兰大学 College Park 分校）； University of Virginia（弗吉尼亚大学）； University of North Carolina, Chapel Hill（北卡罗来纳大学 Chapel Hill 分校）

AI总结该研究针对多模态大语言模型在面对现实视觉退化（如模糊、压缩伪影等）时推理能力下降的问题，提出了一种基于强化学习的微调框架ROMA。该方法通过双前向传播策略、分布一致性约束和正确性条件正则化等技术，在不损害干净输入性能的前提下提升模型对视觉退化的鲁棒性。实验表明，ROMA在多个多模态推理基准上显著优于现有方法，提升了可见和未见退化场景下的推理准确性。

2605.09258 2026-05-12 cs.CV cs.AI 版本更新

Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models

R. James Cotton, Pouyan Firouzabadi, Wendy Murray

发表机构 * Shirley Ryan AbilityLab Department of PM\&R Northwestern University ； Shirley Ryan AbilityLab Department of Biomedical Engineering Northwestern University

AI总结该研究旨在解决单目视频中精确追踪手指生物力学运动的问题，提出了一种结合SAM 3D Body基础模型与逆运动学优化的方法，从单视角视频中提取解剖学约束的手指关节角度。通过将模型迁移至JAX并集成至MuJoCo-MJX，实现了高效的GPU加速优化，并建立了Momentum Human Rig输出与生物力学模型标记之间的新映射关系。实验表明，该方法在多种手部动作和物体操作任务中，能够达到约10度的关节角度误差和6毫米的手部位置误差，具有良好的视角一致性和鲁棒性，为基于视频的定量手部运动分析提供了新途径。

Comments Accepted to EMBC 2026

2605.09242 2026-05-12 eess.IV cs.CV 版本更新

Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading

Yiqun Wang

发表机构 * School of Software Engineering Beijing Jiaotong University（软件工程学院北京交通大学）

AI总结本文提出了一种结合视觉-语言预训练和扩散概率建模的跨模态语义增强扩散框架CGSD，用于糖尿病视网膜病变的自动分级。该方法通过低秩适配技术对领域特定的视觉-语言模型进行微调，有效缩小了预训练模型与目标数据集之间的分布差异，并利用图像特征与病变等级文本描述的点积构建跨模态语义条件向量，作为扩散去噪网络的条件输入，提升了模型对细粒度病变特征和临床语义信息的感知能力。实验表明，该方法在APTOS 2019数据集上取得了优于现有方法的准确率和F1分数。

Comments 6 pages, 3 figures, 2 tables

2605.07910 2026-05-12 cs.CV 版本更新

One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction

Yulong Chen, Xiaoyun Dong, Haoyu Zhang, Zongxian Yang, Lewei Xie, Xinke Li, Yifan Zhang, Kai Wang, Jianping Wang

发表机构 * City University of Hong Kong (Dongguan)（香港城市大学（东莞））； City University of Hong Kong（香港城市大学）； SLAI

AI总结本文研究了从车路协同自动驾驶（VICAD）数据中重建动态场景的问题，指出现有高斯场景图方法因假设观测同步而无法处理车辆与基础设施摄像头之间的时序不同步问题，导致动态目标出现严重鬼影现象。为此，作者提出了一种解耦时空高斯场景图（DUST），通过为每个代理维护独立的位姿轨迹并共享统一的外观表示，有效消除了跨源干扰，并在V2X-Seq数据集上取得了显著的性能提升。

2605.07649 2026-05-12 cs.CV cs.AI cs.RO 版本更新

Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

Berkehan Ünal, Hauke Dierend, Dren Fazlija, Christopher Plachetka

发表机构 * Volkswagen Aktiengesellschaft（大众汽车股份有限公司）； L3S Research Center（莱比锡大学汉诺威研究中心）； Faculty of Information Technology（信息科技学院）； MOIA GmbH（MOIA公司）； Motor AI GmbH（Motor AI公司）

AI总结本文研究了如何利用视觉-语言模型（VLM）实现对操作设计域（ODD）的零样本感知，以支持自动驾驶系统等安全关键应用。通过在自定义数据集和Mapillary Vistas上的实验，作者评估了四种VLM在零样本分类与检测任务中的表现，并分析了不同优化策略的效果。研究提出了一种基于定义锚定的思维链提示方法，结合角色分解，显著提升了感知性能，为构建透明、高效的ODD感知系统提供了可行方案。

Comments 8 pages, 4 figures

2605.07399 2026-05-12 cs.CV 版本更新

GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization

Yu Pan, Andi Zhang, Yi Wang, Sibei Yang, Wenjie Wang

发表机构 * ShanghaiTech University（上海科技大学）； University of Warwick（沃里克大学）； SUN YAT-SEN UNIVERSITY（中山大学）

AI总结该论文研究了扩散视觉语言模型（dVLMs）在面对越狱攻击时的安全性问题，揭示了其在应对传统固定前缀优化（FPO）攻击时表现出的假象性鲁棒性。作者提出了一种基于全局概率优化（GPO）的新型越狱方法，通过操纵扩散模型的去噪轨迹，绕过模型的防护机制，并进一步开发了首个针对dVLMs的视觉模态越狱框架GPO-V。实验表明，GPO-V能够生成隐蔽且具有跨模型迁移能力的扰动，暴露了非序列生成架构中的关键安全漏洞，突显了对dVLMs进行安全对齐的紧迫性。

2605.07203 2026-05-12 cs.CV 版本更新

From Pixels to Primitives: Scene Change Detection in 3D Gaussian Splatting

Chamuditha Jayanga Galappaththige, Jason Lai, Timothy Patten, Donald Dansereau, Niko Suenderhauf, Dimity Miller

发表机构 * QUT Centre for Robotics（昆士兰大学机器人中心）； ARIAM ； ACFR, University of Sydney（悉尼大学先进计算机研究学院）

AI总结本文研究了基于高斯泼溅（Gaussian Splatting）的场景变化检测问题，提出了一种直接在原始高斯参数空间进行比较的方法，而非传统的渲染后对比方式。通过分析高斯的原始属性（位置、各向异性协方差和颜色），作者证明这些属性本身已包含足够的变化信息，并引入几何和光度漂移的各向异性模型以及每个高斯的可观测性项来解决表示的欠约束问题。该方法在多视角一致性、变化类型区分等方面具有优势，并在实际数据集上取得了优于现有方法约17%的性能提升。

Comments Project Page: https://chumsy0725.github.io/GS-DIFF/

2605.06969 2026-05-12 cs.CV 版本更新

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

Yuchen Guo, Junli Gong, Yao Lu, Xintong Xu, Yiuming Cheung, Weifeng Su

发表机构 * Northwestern University（西北大学）； Northeastern University（东北大学）； University of Washington（华盛顿大学）； Hong Kong Baptist University（香港 Baptist大学）； Beijing Normal - Hong Kong Baptist University（北京师范大学-香港 Baptist大学）

AI总结该研究旨在提升红外-可见光图像融合（IVIF）质量评估的准确性，针对现有方法过度依赖手工特征和全参考指标的问题，提出了一种基于多模态大语言模型（MLLM）的新型评估方法FuScore。该方法通过MLLM生成连续的质量评分，而非离散等级预测，从而实现对相似质量图像的细粒度区分，并结合多维度一致性构建软标签，进一步引入三元目标函数以提升评估的全面性和鲁棒性。实验表明，FuScore在与人类视觉偏好相关性方面达到了当前最优水平。

2605.06681 2026-05-12 cs.LG cs.CV 版本更新

A Hierarchical Ensemble Pipeline for Anomaly Detection in ESA Satellite Telemetry

Lorenzo Riccardo Allegrini, Geremia Pompei

发表机构 * ContinualIST, Pisa, Italy（持续主义机构，意大利比萨）； University of Pisa, Department of Computer Science, Pisa, Italy（比萨大学计算机科学系，意大利比萨）

AI总结本文提出了一种分层集成管道，用于处理欧洲空间局（ESA）卫星遥测数据中的异常检测问题。该方法结合了形状片段提取、统计特征分析、单通道建模、通道内堆叠以及跨通道聚合等多种技术，通过时间序列交叉验证和双层掩码策略进行训练与验证，有效防止信息泄露。实验结果表明，该方法在ESA-ADB基准测试中表现出优异的泛化能力，能够有效检测现实卫星遥测数据中的细微异常。

Comments 15 pages, 3 figures, 1 table. Submitted to the ML4ITS workshop at the ECML PKDD 2025 conference. Awarded 2nd place in the final round of the Spacecraft Anomaly Challenge on ESA dataset. (Ranked 1st on the Kaggle public leaderboard and 3rd on the private leaderboard)

Journal ref Communications in Computer and Information Science 2842 (2026) Chapter 7

2605.05831 2026-05-12 cs.CV 版本更新

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar

发表机构 * IIIT Hyderabad（IIIT海得拉尔学院）； Microsoft Research India & IIT Hyderabad（微软研究院印度分部及IIIT海得拉尔学院）

AI总结随着科学传播逐渐呈现多模态趋势，研究论文、幻灯片、视频等不同形式的材料共同传达研究成果，但目前缺乏结构化的关联方式。本文提出首个整合研究论文、演讲视频、讲解视频和幻灯片的多模态会议数据集（MCD），并评估多种嵌入式和视觉-语言模型在跨格式细粒度对应任务中的表现。研究发现，视觉-语言模型在整体上表现稳健，但在细粒度对齐上仍有不足，而嵌入式模型在文本与视觉对应上效果较好，但对公式和符号内容的处理存在明显聚类差异，为多模态科学理解的未来研究指明了方向。

Comments Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings Track, 2026

2605.05072 2026-05-12 cs.CV 版本更新

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

Yuan Wu, Zhiqiang Yan, Jiawei Lian, Zhengxue Wang, Jian Yang

发表机构 * Nanjing University of Science and Technology（南京理工大学）； National University of Singapore（新加坡国立大学）

AI总结本文研究了如何从相机和激光雷达传感器数据中准确预测三维场景的占用情况，重点解决传统方法在投影空间采样固定、难以适应真实场景高度变化和稀疏性的问题。为此，作者提出了一种名为HiPR的框架，通过高度引导的投影重参数化方法，动态调整激光雷达点云的采样范围，使投影点更合理地分布于具有几何意义的区域。实验表明，HiPR在保持实时推理能力的同时，显著优于现有先进方法。

2605.05045 2026-05-12 cs.CV cs.CL 版本更新

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli, Rui Zhang, Jack Sampson, Vijaykrishnan Narayanan

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）

AI总结该研究分析了视觉-语言模型在面对旋转和噪声等视觉干扰时产生的关系幻觉现象，揭示了即使轻微的图像扰动也会显著影响模型对物体间关系的推理能力。研究评估了多种基于提示的增强与预处理策略，发现这些方法虽能部分缓解问题，但无法彻底消除关系幻觉。结果表明，当前模型在感知鲁棒性与关系理解之间仍存在差距，亟需开发更具几何感知能力的视觉-语言模型。

2605.03652 2026-05-12 cs.CV cs.AI 版本更新

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Tencent HY Team

发表机构 * Tencent HY Team（腾讯HY团队）

AI总结本文提出了一种名为 AniMatrix 的动画视频生成模型，专门针对动画艺术风格进行设计，而非依赖物理现实作为先验。该模型通过双通道条件机制和三步过渡策略，重新定义“正确性”标准，克服传统模型对物理规律的依赖，并有效区分艺术表达与生成失败。实验表明，AniMatrix 在专业动画师参与的评估中表现优异，尤其在提示理解与艺术动作生成方面显著优于现有模型。

Comments 37 pages, 1 main figure (qualitative comparison), 1 TikZ architecture diagram; technical report. Model weights and inference code to be released

2605.03438 2026-05-12 cs.CV 版本更新

Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

Zihao Guo, Jihua Zhu, Jian Liu, Ajmal Saeed Mian

发表机构 * Xi’an Jiaotong University（西安交通大学）； School of Artificial Intelligence and Robotics, Hunan University（湖南大学人工智能与机器人学院）； University of Western Australia（西澳大学）

AI总结本文提出了一种名为Mantis的高效参数微调框架，专门针对基于Mamba架构的3D点云基础模型。该方法通过引入状态感知适配器（SAA），在冻结预训练主干网络的前提下实现状态级的细粒度适配，同时采用双序列化一致性蒸馏（DSCD）减少序列化带来的不稳定性。实验表明，Mantis仅需约5%的可训练参数即可在多个基准上取得具有竞争力的性能。

2605.01402 2026-05-12 cs.CL cs.CV cs.LG 版本更新

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Yao Du, Shanshan Song, Xiaomeng Li

发表机构 * The Hong Kong University of Science（香港科学与技术大学）

AI总结多模态大语言模型（MLLMs）在处理长尾分布的数值回归任务时表现不佳，现有基于标记的监督微调方法容易偏向高密度区域，导致回归均值化和尾部性能下降。本文提出了一种基于组相对策略优化的分布感知强化学习框架，通过引入基于一致相关系数的奖励机制，在批量层面提供跨样本的比较监督，从而在相关性、尺度和均值等方面对齐预测与真实分布。该方法无需修改模型结构，实验表明其在多种长尾回归基准上均优于传统微调方法，尤其在中样本和少样本场景下效果显著。

Comments Accepted by ICML 2026

2605.00642 2026-05-12 cs.AI cs.CV 版本更新

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Yan Zhang, Daiqing Wu, Huawen Shen, Can Ma, Yu Zhou

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； VCIP & TMCC & DISSec, College of Computer Science, Nankai University（南开大学计算机学院）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

AI总结本文提出了一种面向GUI定位任务的首个基于策略自蒸馏（OPSD）框架GUI-SD，旨在解决现有强化学习方法在训练效率和样本稀疏性方面的不足。该方法通过构建视觉增强的特权上下文和引入熵引导的蒸馏策略，实现了单次交互中的密集监督学习，有效提升了定位精度与训练效率。实验表明，GUI-SD在六个代表性基准上均优于现有方法。

Comments under review

2605.00548 2026-05-12 cs.CV cs.GR 版本更新

Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation

Nadav Z. Cohen, Ofir Abramovich, Ariel Shamir

发表机构 * Reichman University（雷曼大学）

AI总结本文研究了扩散模型中输入噪声的特性，发现白噪声中低频分量主要决定图像的全局结构和颜色组成，而高频分量控制细节。基于此，作者提出了一种无需训练的低频噪声操控方法，通过简单操作低频噪声来引导图像生成过程，从而在保持输出多样性的同时，实现对图像整体结构和颜色的有效控制。

Comments SIGGRAPH 2026 Conference Paper. Project Page at: https://nadavc220.github.io/colorful-noise/

2605.00408 2026-05-12 cs.CV 版本更新

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

Zhenhua Ning, Xin Li, Jun Yu, Guangming Lu, Yaowei Wang, Wenjie Pei

发表机构 * Pengcheng Laboratory, Shenzhen（鹏城实验室，深圳）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学，深圳）

AI总结本文提出了一种可学习的密度控制方法LeGS，用于改进三维高斯溅射（3DGS）技术，以克服其对启发式密度控制规则的依赖。该方法将密度控制建模为通过强化学习优化的参数化策略网络，并设计了一种基于敏感性分析的有效奖励函数，以精确量化单个高斯分布对重建质量的贡献。实验表明，LeGS在多个数据集上显著优于现有方法，在重建质量和计算效率之间取得了更好的平衡。

Comments 9 pages, 5 figures

2604.24954 2026-05-12 cs.LG cs.AI cs.CV 版本更新

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, Mike Ranzinger, Greg Heinrich, Guo Chen, Lukas Voegtle, Philipp Fischer, Timo Roman, Karan Sapra, Collin McCarthy, Shaokun Zhang, Fuxiao Liu, Hanrong Ye, Yi Dong, Mingjie Liu, Yifan Peng, Piotr Zelasko, Zhehuai Chen, Nithin Rao Koluguri, Nune Tadevosyan, Lilit Grigoryan, Ehsan Hosseini Asl, Pritam Biswas, Leili Tavabi, Yuanhang Su, Zhiding Yu, Peter Jin, Alexandre Milesi, Netanel Haber, Yao Xu, Sarah Amiraslani, Nabin Mulepati, Eric Tramel, Jaehun Jung, Ximing Lu, Brandon Cui, Jin Xu, Zhiqi Li, Shihao Wang, Yuanguo Kuang, Shaokun Zhang, Huck Yang, Boyi Li, Hongxu Yin, Song Han, Bilal Kartal, Pavlo Molchanov, Adi Renduchintala, Charles Wang, David Mosallanezhad, Soumye Singhal, Luis Vega, Katherine Cheung, Sreyan Ghosh, Yian Zhang, Alexander Bukharin, Venkat Srinivasan, Johnny Greco, Andre Manoel, Maarten Van Segbroeck, Suseella Panguliri, Rohit Watve, Divyanshu Kakwani, Shubham Pachori, Jeffrey Glick, Radha Sri-Tharan, Aileen Zaman, Khanh Nguyen, Shi Chen, Jiaheng Fang, Qing Miao, Wenfei Zhou, Yu Wang, Zaid Pervaiz Bhat, Varun Praveen, Arihant Jain, Ramanathan Arunachalam, Tomasz Kornuta, Ashton Sharabiani, Amy Shen, Wei Huang, Yi-Fu Wu, Ali Roshan Ghias, Huiying Li, Brian Yu, Nima Tajbakhsh, Chen Cui, Wenwen Gao, Li Ding, Terry Kong, Manoj Kilaru, Anahita Bhiwandiwalla, Marek Wawrzos, Daniel Korzekwa, Pablo Ribalta, Grzegorz Chlebus, Besmira Nushi, Ewa Dobrowolska, Maciej Jakub Mikulski, Kunal Dhawan, Steve Huang, Jagadeesh Balam, Yongqiang Wang, Nikolay Karpov, Valentin Mendelev, George Zelenfroynd, Meline Mkrtchyan, Qing Miao, Omri Almog, Bhavesh Pawar, Rameshwar Shivbhakta, Sudeep Sabnis, Ashrton Sharabiani, Negar Habibi, Geethapriya Venkataramani, Pamela Peng, Prerit Rodney, Serge Panev, Richard Mazzarese, Nicky Liu, Michael Fukuyama, Andrii Skliar, Roger Waleffe, Duncan Riach, Yunheng Zou, Jian Hu, Hao Zhang, Binfeng Xu, Yuhao Yang, Zuhair Ahmed, Alexandre Milesi, Carlo del Mundo, Chad Voegele, Zhiyu Cheng, Nave Assaf, Andrii Skliar, Daniel Afrimi, Natan Bagrov, Ran Zilberstein, Ofri Masad, Eugene Khvedchenia, Natan Bagrov, Borys Tymchenko, Tomer Asida, Daniel Afrimi, Parth Mannan, Victor Cui, Michael Evans, Katherine Luna, Jie Lou, Pinky Xu, Guyue Huang, Negar Habibi, Michael Boone, Pradeep Thalasta, Adeola Adesoba, Dina Yared, Christopher Parisien, Leon Derczynski, Shaona Ghosh, Wes Feely, Micah Schaffer, Radha Sri-Tharan, Jeffrey Glick, Barnaby Simkin, George Zelenfroynd, Tomasz Grzegorzek, Rishabh Garg, Aastha Jhunjhunwala, Sergei Kolchenko, Farzan Memarian, Haran Kumar, Shiv Kumar, Isabel Hulseman, Anjali Shah, Kari Briski, Padmavathy Subramanian, Joey Conway, Udi Karpas, Jane Polak Scowcroft, Annie Surla, Shilpa Ammireddy, Ellie Evans, Jesse Oliver, Tom Balough, Chia-Chih Chen, Sandip Bhaskar, Alejandra Rico, Bardiya Sadeghi, Seph Mard, Katherine Cheung, Meredith Price, Laya Sleiman, Saori Kaji, Wesley Helmholz, Wendy Quan, Michael Lightstone, Jonathan Cohen, Jian Zhang, Oleksii Kuchaiev, Boris Ginsburg, Jan Kautz, Eileen Long, Mohammad Shoeybi, Mostofa Patwary, Oluwatobi Olabiyi, Andrew Tao, Bryan Catanzaro, Udi Karpas

发表机构 * NVIDIA

AI总结本文介绍了 Nemotron 3 Nano Omni，这是 Nemotron 多模态系列的最新模型，首次原生支持音频输入，同时兼容文本、图像和视频。该模型在架构、训练数据和训练方法上均有改进，在多种模态任务中均表现出更高的准确性，尤其在现实文档理解、长音频视频理解和智能计算机使用方面表现突出。基于高效的 Nemotron 3 Nano 30B-A3B 架构，该模型引入了创新的多模态 token 减少技术，显著降低了推理延迟并提升了吞吐量，同时提供了多种精度格式的模型权重和部分训练数据及代码以促进进一步研究。

2604.19923 2026-05-12 cs.CV 版本更新

UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video

Tanuj Sur, Shashank Tripathi, Nikos Athanasiou, Ha Linh Nguyen, Kai Xu, Michael J. Black, Angela Yao

发表机构 * National University of Singapore（国立新加坡大学）； Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）

AI总结本文提出 UniCon3R，一种用于从单目视频中进行在线人类-场景四维重建的统一前馈框架。该方法通过显式建模人类与场景之间的接触关系，利用接触信息作为修正线索来提升人体网格重建质量，从而在保证快速推理速度的同时，实现场景几何与对齐的人体四维重建。实验表明，UniCon3R 在物理合理性与人体运动估计方面优于现有方法，验证了接触信息作为强大先验在联合重建中的有效性。

Comments Project page: https://surtantheta.github.io/UniCon3R

2604.19748 2026-05-12 cs.CV 版本更新

Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao, Taihang Hu, Jinsong Lan, Chao Lin, Yefeng Shen, Xingjian Wang, Zhao Wang, Zhengtao Wu, Xiaoli Xu, Zhengze Xu, Hao Yan, Mingzhou Zhang, Jun Zheng, Qinye Zhou, Xiaoyong Zhu, Bo Zheng

发表机构 * Alibaba Group（阿里巴巴集团）； Taobao App（淘宝App）

AI总结 Tstars-Tryon 1.0 是一个高效、真实且鲁棒的虚拟试穿系统，能够应对复杂现实场景中的多种挑战，如极端姿态、光照变化和运动模糊等。该系统支持多种服装类别和多参考图像的灵活组合，生成具有精细细节和真实材质的高质量图像，同时避免了常见的AI生成伪影。通过端到端的模型架构和优化的推理速度，系统实现了接近实时的生成效果，并已在淘宝App上大规模部署，服务于数百万用户。

Comments 24 pages, model evaluation report

2604.14125 2026-05-12 cs.CV cs.AI cs.RO 版本更新

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

发表机构 * The University of Hong Kong（香港大学）； Shanghai AI Laboratory（上海人工智能实验室）； Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出了一种名为HiVLA的视觉-语义引导的分层操作系统，旨在解决端到端视觉-语言-动作模型在精细控制数据微调时削弱其基础视觉语言模型推理能力的问题。该方法通过将高层语义规划与底层运动控制解耦，利用视觉语言模型进行任务分解和视觉定位，生成结构化操作计划，并通过配备级联交叉注意力机制的扩散变换器执行精确动作，从而在保持高层推理能力的同时提升操作精度。实验表明，HiVLA在长时序技能组合和复杂场景下的精细操作任务中显著优于现有端到端方法。

Comments Project Page: https://tianshuoy.github.io/HiVLA-page/

2604.11808 2026-05-12 cs.CV 版本更新

Pair2Scene: Learning Local Object Relations for Procedural Scene Generation

Xingjian Ran, Shujie Zhang, Weipeng Zhong, Li Luo, Bo Dai

发表机构 * The University of Hong Kong（香港大学）； Tsinghua University（清华大学）； Shanghai Jiao Tong University（上海交通大学）； Shenzhen Loop Area Institute（深圳loop区研究所）

AI总结生成高保真度的室内3D场景由于数据稀缺和复杂空间关系建模的困难，仍是一个重大挑战。本文提出Pair2Scene，一种基于局部物体关系学习的程序化场景生成框架，通过结合局部规则、场景层次结构和物理算法，有效捕捉支撑关系和功能关系两种关键物体间交互模式。该方法利用自建的3D-Pairs数据集进行训练，在推理阶段通过递归应用模型并结合碰撞感知的拒绝采样，生成符合物理和语义合理性的复杂场景，显著优于现有方法。

Comments ICML 2026

2604.07780 2026-05-12 eess.IV cs.CV 版本更新

MonoUNet: A Robust Tiny Neural Network for Automated Knee Cartilage Segmentation on Point-of-Care Ultrasound Devices

Alvin Kimbowa, Arjun Parmar, Ibrahim Mujtaba, Will Wei, Maziar Badii, Matthew Harkey, David Liu, Ilker Hacihaliloglu

发表机构 * School of Biomedical Engineering, The University of British Columbia（生物医学工程学院，不列颠哥伦比亚大学）； Department of Kinesiology, Michigan State University（运动科学系，密歇根州立大学）； Department of Rheumatology, The University of British Columbia（风湿病学系，不列颠哥伦比亚大学）； Department of Radiology, The University of British Columbia（放射学系，不列颠哥伦比亚大学）； Department of Medicine, The University of British Columbia（医学系，不列颠哥伦比亚大学）

AI总结本研究提出了一种名为 MonoUNet 的轻量级深度学习模型，旨在用于便携式超声设备上自动分割膝关节软骨。该模型通过引入可训练的单基因块提取多尺度局部相位特征，并结合门控机制提升对超声图像变化的鲁棒性，显著减少了参数量和计算成本。实验表明，MonoUNet 在多个设备和站点的数据集上取得了优异的分割性能，Dice 分数高达 92.62% 至 94.82%，且与手动测量结果具有高度一致性与可靠性。

Comments 17 pages, 4 figures. Published in Ultrasound in Medicine & Biology (2026)

Journal ref Ultrasound in Medicine & Biology, 2026, ISSN 0301-5629

详情

DOI: 10.1016/j.ultrasmedbio.2026.04.011

英文摘要

Objective: To develop a robust and compact deep learning model for automated knee cartilage segmentation on point-of-care ultrasound (POCUS) devices. Methods: We propose MonoUNet, a novel, highly compact segmentation model consisting of (i) an aggressively reduced U-Net backbone, (ii) a trainable monogenic block that extracts multi-scale local phase features from the input, and (iii) a gating mechanism that injects these features into the encoder stages to reduce sensitivity to variations in ultrasound image appearance. MonoUNet segmentation performance was evaluated on a multi-site, multi-device knee cartilage ultrasound dataset using Dice score and mean average surface distance (MASD). Agreement between MonoUNet and manual cartilage outcomes (thickness and echo intensity) was assessed using Bland-Altman analysis with 95% limits of agreement, and reliability was assessed using intraclass correlation coefficient (ICC$_{2,k}$). Results: Overall, MonoUNet outperformed existing lightweight segmentation models, with average Dice scores ranging from 92.62% to 94.82% and MASD values between 0.133 mm and 0.254 mm. MonoUNet reduces the number of parameters by 10x--700x and computational cost by 14x--2000x relative to existing lightweight models. MonoUNet cartilage outcomes showed excellent reliability and agreement with the manual outcomes: intraclass correlation coefficients (ICC$_{2,k})$=0.96 and bias=2.00% (0.047 mm) for average thickness, and ICC$_{2,k}$=0.99 and bias=0.80% (0.328 a.u.) for echo intensity. Conclusion: Incorporating trainable local phase features improves the robustness of highly compact neural networks for knee cartilage segmentation across varying acquisition settings and could support scalable ultrasound-based assessment and monitoring of knee osteoarthritis using POCUS devices. The code is publicly available at https://github.com/alvinkimbowa/monounet.

URL PDF HTML ☆

赞 0 踩 0

2604.03928 2026-05-12 cs.LG cs.AI cs.CV stat.ML 版本更新

Supervised Dimensionality Reduction Revisited: Why LDA on Frozen CNN Features Deserves a Second Look

Indar Kumar, Girish Karhana, Sai Krishna Jasti, Ankit Hemant Lade

发表机构 * independent researchers（独立研究人员）

AI总结本文重新审视了在冻结的预训练卷积神经网络特征上应用监督降维方法的有效性，特别是线性判别分析（LDA）。研究对比了多种降维策略在多个视觉任务上的表现，发现LDA在粗粒度分类任务中能显著提升分类准确率并大幅降低特征维度，但在细粒度任务中效果较差。实验表明，LDA在类间结构较明显时表现优异，而对需要细微区分的任务则可能适得其反，为冻结特征分类流程中的降维应用提供了实用指导。

Comments 11 pages, 5 figures, 5 tables. Code available at https://github.com/IndarKarhana/lda-image-classification

2604.01824 2026-05-12 cs.CV 版本更新

STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall, Mohsen Fayyaz

发表机构 * University of Bonn（波恩大学）； Microsoft（微软）； Meta ； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔人工智能与机器学习研究所）

AI总结 STRIVE 是一种用于视频问答的结构化时空强化学习框架，旨在解决现有方法在奖励方差低、策略更新不稳定的问题。该方法通过构建输入视频的多个时空变体，并在文本生成和视觉变体之间进行联合归一化，从而丰富奖励信号并提升策略更新的稳定性。此外，STRIVE 引入了基于重要性的采样机制，确保探索过程语义相关且保持时间覆盖，实验表明其在多个视频推理基准上优于现有强化学习方法。

2603.25074 2026-05-12 cs.CV 版本更新

Z-Erase: Enabling Concept Erasure in Single-Stream Diffusion Transformers

Nanxiang Jiang, Zhaoxin Fan, Baisen Wang, Daiheng Gao, Junhang Cheng, Jifeng Guo, Yalan Qin, Yeying Jin, Hongwei Zheng, Faguo Wu, Wenjun Wu

发表机构 * Beijing Advanced Innovation Center for Future Blockchain（未来区块链与隐私计算北京创新中心）； Privacy Computing, School of Artificial Intelligence, Beihang University（隐私计算，北京航空航天大学人工智能学院）； University of Science（科学大学）； Shanghai University（上海大学）； Tencent（腾讯）； Beijing Academy of Blockchain（北京区块链研究院）

AI总结 Z-Erase 是一种针对单流扩散变压器（如 Z-Image）设计的概念擦除方法，旨在从文本到图像模型中安全地去除不需要的概念。该方法提出了流解耦概念擦除框架和拉格朗日引导的自适应擦除调制算法，有效解决了单流模型中直接应用传统擦除方法导致的生成崩溃问题，并在多项任务中取得了最先进的性能。

2603.16869 2026-05-12 cs.CV 版本更新

SegviGen: Repurposing 3D Generative Model for Part Segmentation

Lin Li, Haoran Feng, Zehuan Huang, Haohua Chen, Wenbo Nie, Shaohua Hou, Keqing Fan, Pan Hu, Sheng Wang, Buyu Li, Lu Sheng

发表机构 * Renmin University of China（中国人民大学）； Tsinghua University（清华大学）； Beihang University（北航）； Beijing Jiaotong University（北京交通大学）； Bambu Lab（Bambu实验室）

AI总结本文提出了一种名为SegviGen的框架，通过重用预训练的3D生成模型，实现高效的3D部件分割。该方法利用生成模型中编码的结构先验知识，通过独特的部件着色策略引导分割过程，避免了传统方法中多视角不一致和边界模糊的问题。实验表明，SegviGen在交互式分割和全分割任务中分别优于现有最佳方法40%和15%，且仅需极少量的标注数据，展示了预训练3D生成模型在部件分割任务中的强大迁移能力。

Comments Project page: https://fenghora.github.io/SegviGen-Page/

2603.12800 2026-05-12 eess.IV cs.CV 版本更新

GLEAM: A Multimodal Imaging Dataset and HAMM for Glaucoma Classification

Jiao Wang, Chi Liu, Yiying Zhang, Hongchen Luo, Zhifen Guo, Ying Hu, Ke Xu, Jing Zhou, Hongyan Xu, Ruiting Zhou, Man Tang

发表机构 * Department of Ophthalmology, Shenyang Fourth People’s Hospital（眼科医院，沈阳第四人民医院）

AI总结本文提出了GLEAM，一个包含三种成像模态的公开青光眼数据集，涵盖眼底扫描激光图像、视神经周围OCT图像和视野图模式偏差图，并标注了四个疾病阶段，有助于综合利用多模态信息进行精准诊断。为有效整合跨模态信息，研究提出了一种分层注意力掩码建模（HAMM）方法，通过分层注意力编码器和轻量解码器，聚焦于跨模态表征学习，提升青光眼分类的准确性。该研究为多模态医学影像分析提供了新思路和有效工具。

2603.07686 2026-05-12 cs.RO cs.CV 版本更新

UniUncer: Unified Dynamic Static Uncertainty for End to End Driving

Yu Gao, Jijun Wang, Zongzheng Zhang, Anqing Jiang, Yiru Wang, Yuwen Heng, Shuo Wang, Hao Sun, Zhangfeng Hu, Hao Zhao

发表机构 * Bosch Corporate Research（博世企业研究）

AI总结该论文提出了一种名为UniUncer的统一动态静态不确定性框架，用于端到端自动驾驶系统，旨在提升系统对环境不确定性的感知与应对能力。该方法通过将确定性模型转换为概率回归模型，同时引入不确定性融合模块和不确定性感知门控机制，实现了对静态地图元素和动态交通参与者不确定性的联合建模与利用。实验表明，UniUncer在多个基准数据集上有效提升了轨迹预测和驾驶决策的性能，且计算开销极小。

Comments Accepted ICRA 2026

2603.00918 2026-05-12 cs.CV cs.AI 版本更新

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Seungwook Kim, Minsu Cho

发表机构 * POSTECH（POSTECH大学）； RLWRLD（RLWRLD实验室）； GenGenAI（GenGenAI研究院）

AI总结本文提出了一种名为SOLACE的后训练框架，用于提升文本到图像生成的质量。该方法通过模型自身对生成图像进行重噪声处理，并衡量其恢复噪声的准确性，从而生成内在的自信信号作为强化学习的奖励，无需外部奖励模型或人工标注。实验表明，SOLACE在组合生成、文本渲染和图文对齐等方面均取得了一致性提升，并能与外部奖励结合实现互补改进。

Comments 22 pages, accepted to CVPR 2026. Project page https://wookiekim.github.io/SOLACE/

2603.00166 2026-05-12 cs.CV cs.AI 版本更新

Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?

Hongyu Li, Kuan Liu, Yuan Chen, Juntao Hu, Huimin Lu, Guanjie Chen, Xue Liu, Guangming Lu, Hong Huang

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； City University of Hong Kong（城市大学）； Chinese University of Hong Kong（香港中文大学）； McGill University（麦吉尔大学）

AI总结本文探讨了生成式AI在执行简单任务时表现出的“简洁性悖论”，即模型在生成复杂场景时表现优异，却难以完成如生成纯色图像等简单任务。研究提出“AI服从性”概念，构建了一个分层评估框架，并设计了首个系统性基准Violin，用于评估模型从概率近似到像素级确定性的转换能力。实验表明，闭源模型在确定性任务上的表现优于开源模型，且其性能与自然图像生成能力存在相关性，为理解模型指令对齐问题提供了基础框架和工具。

2602.21581 2026-05-12 cs.CV 版本更新

MultiAnimate: Pose-Guided Image Animation Made Extensible

Yingcheng Hu, Haowen Gong, Chuanguang Yang, Zhulin An, Yongjun Xu, Songhua Liu

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences（人工智能安全国家重点实验室，计算技术研究所，中国科学院）； ShanghaiTech University（上海科技大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出了一种可扩展的多角色图像动画框架 MultiAnimate，旨在解决基于姿势引导的多角色视频生成中身份混淆和不合理遮挡的问题。该方法基于现代扩散变换器（DiT），引入了身份分配器和身份适配器两个关键组件，用于捕捉个体位置信息和角色间空间关系，从而提升模型的灵活性和泛化能力。实验表明，该方法在多角色图像动画任务中取得了优于现有扩散模型的最先进性能。

Comments CVPR2026 Accepted. Project page at https://hyc001.github.io/MultiAnimate/

2602.09534 2026-05-12 cs.CV 版本更新

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

Jiayi Lyu, Leigang Qu, Wenjing Zhang, Hanyu Jiang, Kai Liu, Zhenglin Zhou, Xiaobo Xia, Jian Xue, Tat-Seng Chua

发表机构 * University of the Chinese Academy of Sciences（中国科学院大学）； National University of Singapore（新加坡国立大学）； Zhejiang University（浙江大学）； State Key Laboratory of Communication Content Cognition, People’s Daily Online（人民日報網通信內容認知重點實驗室）

AI总结本文提出了一种名为 AUHead 的新方法，用于生成具有真实情感表达的说话人视频。该方法通过解耦音频与细粒度情感单元（Action Units, AUs）的控制，实现了对情绪表达的精确调控。研究采用两阶段框架，第一阶段利用大语言模型生成 AUs 序列，第二阶段基于 AUs 驱动的扩散模型生成高质量的视频，有效提升了情感真实性和视觉一致性。

Comments https://openreview.net/forum?id=dmzlAUkulz&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DICLR.cc%2F2026%2FConference%2FAuthors%23your-submissions) Accepted at the 14th International Conference on Learning Representations (ICLR 2026)

2602.09016 2026-05-12 cs.CV 版本更新

Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction

Hao Phung, Hadar Averbuch-Elor

发表机构 * Cornell University（康奈尔大学）

AI总结本文提出了一种名为 Raster2Seq 的方法，用于从栅格化的平面图图像中重建结构化的矢量图形表示。该方法将平面图重建视为序列到序列的任务，将房间、窗户和门等元素表示为包含几何和语义信息的带标签多边形序列。通过引入基于可学习锚点的自回归解码器，模型能够根据图像特征和已生成的顶点预测下一个顶点，从而更有效地生成复杂且具有多样多边形结构的平面图。实验表明，该方法在多个标准数据集上取得了最先进的性能，并在更具挑战性的数据集上也表现出良好的泛化能力。

Comments Accepted to SIGGRAPH 2026. Project page: https://cornell-vailab.github.io/Raster2Seq/

2602.05243 2026-05-12 cs.LG cs.CV 版本更新

CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers

Boxiang Zhang, Baijian Yang

发表机构 * Purdue University（普渡大学）

AI总结本文提出CORP，一种无需梯度或微调的闭式单次结构化剪枝方法，用于在Transformer模型中去除多层感知机和注意力子结构。该方法将结构化剪枝建模为表示恢复问题，通过闭式岭回归推导出补偿模型权重的解析解，从而在保持高精度的前提下实现模型的高效压缩。实验表明，CORP在ImageNet数据集上对DeiT模型进行大量剪枝后仍能保持较高的分类准确率。

2602.03916 2026-05-12 cs.CV cs.CE cs.CL cs.LG 版本更新

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez

发表机构 * Computational Intelligence and Operations Laboratory（计算智能与运筹实验室）； Shahjalal University of Science and Technology（沙赫jalal科技大学）； BRAC University（BRAC大学）； North South University（北南大学）； Monash University（墨尔本大学）； Qatar Computing Research Institute（卡塔尔计算研究院）

AI总结 SpatiaLab 是一个用于评估视觉语言模型（VLMs）在真实场景中空间推理能力的综合性基准。该研究指出，现有模型在处理复杂的空间关系、深度感知、导航和三维几何等问题时仍存在显著不足。SpatiaLab 包含 1400 个视觉问答对，涵盖六个主要类别及 30 种任务类型，实验表明当前最先进的 VLMs 在空间推理任务上的表现远低于人类。

Comments Accepted to ICLR 2026 (https://openreview.net/forum?id=fWWUPOb0CT). 92 Pages. 42 Figures and 29 Tables

Journal ref ICLR 2026

2601.15065 2026-05-12 cs.CV 版本更新

Enhancing Few-Shot Out-of-Distribution Detection via the Refinement of Foreground and Background

Tianyu Li, Zongqian Wu, Songyue Cai, Ping Hu, Xiaofeng Zhu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）； School of Computer Science and Technology, Hainan University（海南大学计算机科学与技术学院）

AI总结该论文针对少样本分布外检测（Few-Shot OOD Detection）中前景-背景分解方法的不足，提出了一种新的即插即用框架。该方法通过自适应背景抑制和可混淆前景修正两个核心模块，分别优化背景区域的分类熵权重和修正与其它类别相似的前景区域，从而提升检测性能。实验表明，该框架有效提升了现有方法在少样本场景下的分布外检测能力。

Comments arXiv preprint arXiv:2601.15065 (2026)

2512.19219 2026-05-12 cs.CV cs.AI 版本更新

Selective LoRA for Visual Tokens and Attention Heads

Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee

发表机构 * University of Michigan（密歇根大学）； LG AI Research（LG人工智能研究）

AI总结本文提出了一种面向视觉任务的参数高效微调方法Image-LoRA，针对视觉语言模型（VLM）输入的异构性，将LoRA的更新限制在视觉token和部分注意力头的值路径上，从而减少可训练参数和计算量。该方法在视觉定位任务中表现优异，尤其在视觉token占比高的情况下，与标准LoRA相比具有更优的性能与效率平衡，并在多个任务上验证了其通用性和文本处理的稳定性。

2511.12878 2026-05-12 cs.CV cs.RO 版本更新

Uni-Hand: Universal Hand Motion Forecasting in Egocentric Views

Junyi Ma, Wentao Bao, Jingyi Xu, Guanzhong Sun, Yu Zheng, Erhang Zhang, Xieyuanli Chen, Hesheng Wang

发表机构 * IRMV Lab, Shanghai Jiao Tong University（上海交通大学IMV实验室）； Meta Reality Labs（Meta现实实验室）； Department of Electronic Engineering, Shanghai Jiao Tong University（上海交通大学电子工程系）； China University of Mining and Technology（中国矿业大学）； College of Intelligence Science and Technology, National University of Defense Technology（国防科技大学智能科学与技术学院）

AI总结本文提出了一种名为Uni-Hand的通用手部运动预测框架，旨在解决第一人称视角下手部运动预测中存在的预测目标不足、模态差异、手部与头部运动耦合以及下游任务验证有限等问题。该方法通过融合视觉与语言信息、引入全局上下文和任务感知的文本嵌入，实现了2D和3D空间中手部关键点的多目标预测，并首次引入手部与物体交互状态的预测以提升下游任务表现。实验结果表明，Uni-Hand在多个公开数据集和新构建的基准测试中均取得了最先进的预测性能，并在机器人策略迁移和动作识别等任务中展现出优异的应用潜力。

Comments Accepted by T-PAMI 2026. Code and data: https://github.com/IRMVLab/UniHand

详情

英文摘要

Forecasting how human hands move in egocentric views is critical for applications like augmented reality and human-robot policy transfer. Recently, several hand trajectory prediction (HTP) methods have been developed to generate future possible hand waypoints, which still suffer from insufficient prediction targets, inherent modality gaps, entangled hand-head motion, and limited validation in downstream tasks. To address these limitations, we present a universal hand motion forecasting framework considering multi-modal input, multi-dimensional and multi-target prediction patterns, and multi-task affordances for downstream applications. We harmonize multiple modalities by vision-language fusion, global context incorporation, and task-aware text embedding injection, to forecast hand waypoints in both 2D and 3D spaces. A novel dual-branch diffusion is proposed to concurrently predict human head and hand movements, capturing their motion synergy in egocentric vision. By introducing target indicators, the prediction model can forecast the specific joint waypoints of the wrist or the fingers, besides the widely studied hand center points. In addition, we enable Uni-Hand to additionally predict hand-object interaction states (contact/separation) to facilitate downstream tasks better. As the first work to incorporate downstream task evaluation in the literature, we build novel benchmarks to assess the real-world applicability of hand motion forecasting algorithms. The experimental results on multiple publicly available datasets and our newly proposed benchmarks demonstrate that Uni-Hand achieves the state-of-the-art performance in multi-dimensional and multi-target hand motion forecasting. Extensive validation in multiple downstream tasks also presents its impressive human-robot policy transfer to enable robotic manipulation, and effective feature enhancement for action anticipation/recognition.

URL PDF HTML ☆

赞 0 踩 0

2511.00560 2026-05-12 cs.CV 版本更新

4D Neural Voxel Splatting: Dynamic Scene Rendering with Voxelized Guassian Splatting

Chun-Tin Wu, Jun-Cheng Chen

发表机构 * National Taiwan University（国立台湾大学）； Academia Sinica（中央研究院）

AI总结尽管3D高斯泼溅（3D-GS）在新视角合成中实现了高效的渲染，但将其扩展到动态场景时仍因每帧复制高斯分布而导致较大的内存开销。为此，本文提出了一种4D神经体素泼溅（4D-NVS）方法，结合体素表示与神经高斯泼溅，以高效建模动态场景。该方法通过学习变形场的紧凑神经体素集来建模时间动态，显著降低了内存消耗并加快了训练速度，同时保持了高质量的图像渲染。实验表明，该方法在内存占用和训练速度上优于现有方法，实现了实时渲染与更优的视觉效果。

Comments 10 pages, 7 figures

2509.13484 2026-05-12 cs.CV cs.CY 版本更新

MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

Liu Liu, Alexandra Kudaeva, Marco Cipriano, Fatimeh Al Ghannam, Freya Tan, Gerard de Melo, Andres Sevtsuk

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Hasso Plattner Institute（于尔克·平特纳研究所）

AI总结本文提出MINGLE，一种用于检测城市场景中语义复杂社交群体区域的视觉-语言模型方法。该方法通过结合人体检测、深度估计、视觉-语言模型推理及空间聚合算法，实现了对图像中社交互动区域的识别与定位。研究还构建了一个包含10万张城市街景图像的新数据集，标注了个体及社交群体的边界框和标签，为相关研究提供了重要资源。

Comments 13 pages, 4 figures Updated with the camera-ready version after acceptance

2505.23617 2026-05-12 cs.CV cs.AI cs.GR cs.LG 版本更新

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

Chenhao Zheng, Jieyu Zhang, Mohammadreza Salehi, Ziqi Gao, Vishnu Iyengar, Norimasa Kobori, Quan Kong, Ranjay Krishna

发表机构 * University of Washington（华盛顿大学）； Allen Institute for Artificial Intelligence（人工智能艾伦研究所）； Woven by Toyota, Inc（丰田公司）

AI总结本文提出了一种基于全景子物体轨迹的视频分词方法，旨在解决传统时空块分词在长视频处理中导致的冗余令牌和计算效率低的问题。该方法通过将视频内容组织为物体轨迹生成语义令牌，有效减少了令牌数量并保持时间一致性。所提出的TrajViT模型在多个视频理解任务中显著优于现有方法，展现出更高的性能和更低的计算成本。

Comments ICCV 2025

2505.18184 2026-05-12 eess.SP cs.CV 版本更新

AI- Enhanced Stethoscope in Remote Diagnostics for Cardiopulmonary Diseases

Hania Ghouse, Juveria Tanveen, Abdul Muqtadir Ahmed, Uma N. Dulhare

发表机构 * Department of Computer Science and Artificial Intelligence, Muffakham Jah College of Engineering and Technology（计算机科学与人工智能系，穆法卡姆·贾赫工程与技术学院）

AI总结本文针对全球范围内日益严重的 cardiovascular 和 pulmonary 疾病诊断难题，提出了一种结合人工智能的低成本听诊器系统，用于远程诊断心肺疾病。该方法通过提取和处理听诊声音中的 MFCC 特征，结合 CNN 和 GRU 的混合模型实现对六种肺部和五种心血管疾病的自动分类，能够在资源匮乏地区部署于低成本嵌入式设备，提供实时诊断支持，为标准化医疗提供了创新解决方案。

2505.07349 2026-05-12 eess.IV cs.CV 版本更新

Multi-Plane Vision Transformer for Hemorrhage Classification Using Axial and Sagittal MRI Data

Badhan Kumar Das, Gengyan Zhao, Boris Mailhe, Thomas J. Re, Dorin Comaniciu, Eli Gibson, Andreas Maier

发表机构 * Digital Technology and Innovation, Siemens Healthineers（西门子医疗数字技术与创新部）； Pattern Recognition Lab, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg（弗赖堡-艾尔朗根-纽伦堡大学计算机科学系模式识别实验室）

AI总结本文提出了一种用于脑出血分类的多平面视觉Transformer（MP-ViT），旨在解决使用不同方位MRI数据（如轴向和矢状位）进行出血检测时的信息丢失问题。该方法采用两个独立的Transformer编码器分别处理不同方位的影像，并通过跨注意力机制融合多方位信息，同时引入模态指示向量以补充缺失的对比信息。实验表明，MP-ViT在包含10,084个训练样本的临床数据集上表现出色，其AUC值相比传统ViT和CNN模型分别提升了5.5%和1.8%，展示了其在多方位MRI出血检测中的优越性。

Comments 10 pages

2505.05209 2026-05-12 cs.CV 版本更新

EAM: Enhancing Anything with Diffusion Transformers for Blind Super-Resolution

Haizhen Xie, Kunpeng Du, Qiangyu Yan, Sen Lu, Jianhong Han, Hanting Chen, Hailin Hu, Jie Hu

发表机构 * Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结本文提出了一种基于扩散变换器（DiT）的盲超分辨率方法EAM，旨在提升图像超分辨率性能。该方法引入了新的$Ψ$-DiT模块，通过三流架构有效利用预训练DiT的先验知识，并结合渐进式掩码图像建模策略和主题感知提示生成策略，显著提升了模型的泛化能力和训练效率。实验表明，EAM在多个数据集上取得了优于现有方法的定量指标和视觉质量。

Comments Revision of Section 4.1

2503.09336 2026-05-12 cs.CV 版本更新

Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness

Yu Feng, Dingxin Zhang, Runkai Zhao, Yong Xia, Heng Huang, Weidong Cai

发表机构 * The University of Sydney（悉尼大学）； Northwestern Polytechnical University（西北工业大学）； University of Maryland College Park（马里兰大学学院公园分校）

AI总结本文提出了一种针对3D点云模型的隐蔽块状后门攻击方法SPBA，通过利用局部曲率变化对点云进行块状划分，并选择不易察觉的块作为后门触发区域，从而在不显著改变点云结构的前提下实现高效隐蔽的后门植入。该方法相比传统的样本级触发方式，大幅降低了计算开销并提升了攻击隐蔽性，在多个基准数据集上取得了优越的实验结果。

Comments 12 pages, 6 figures, 11 tables

2407.11906 2026-05-12 cs.CV cs.RO 版本更新

SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge

Hao Ding, Yuqian Zhang, Tuxun Lu, Ruixing Liang, Hongchao Shu, Lalithkumar Seenivasan, Yonghao Long, Qi Dou, Cong Gao, Yicheng Leng, Seok Bong Yoo, Eung-Joo Lee, Negin Ghamsarian, Klaus Schoeffmann, Raphael Sznitman, Zijian Wu, Yuxin Chen, Septimiu E. Salcudean, Samra Irshad, Shadi Albarqouni, Seong Tae Kim, Yueyi Sun, An Wang, Long Bai, Hongliang Ren, Ihsan Ullah, Ho-Gun Ha, Attaullah Khan, Hyunki Lee, Satoshi Kondo, Satoshi Kasai, Kousuke Hirasawa, Sita Tailor, Ricardo Sanchez-Matilla, Imanol Luengo, Tianhao Fu, Jun Ma, Bo Wang, Marcos Fernández-Rodríguez, Estevao Lima, João L. Vilaça, Mathias Unberath

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； The Chinese University of Hong Kong（香港中文大学）； Intuitive Surgical Inc.（Intuitive Surgical公司）； University of Arizona（亚利桑那大学）； Chonnam National University（全州大学）； University of British Columbia（不列颠哥伦比亚大学）； Kyung Hee University（庆熙大学）； University H（大学H）

AI总结 SegSTRONG-C 是一项旨在提升手术器械分割模型在非对抗性干扰下鲁棒性的挑战赛，基于通过反事实机器人重演生成的数据集，提供干净与受干扰的配对样本以评估模型性能。该挑战赛要求参赛者在未受干扰的数据上训练模型，并在包含出血、烟雾和低亮度等干扰的测试集上进行评估，揭示了模型失效的关键因素并提出了提升鲁棒性的有效方法。挑战赛结果显示，优秀方法在多个干扰类型下均取得了较高的分割精度，突显了先验知识、定制训练策略和网络结构选择对提升模型鲁棒性的重要性。

详情

英文摘要

Surgical data science has seen rapid advancement with the excellent performance of end-to-end deep neural networks (DNNs). Despite their successes, DNNs have been proven susceptible to minor "corruptions," introducing a major concern for the translation of cutting-edge technology, especially in high-stakes scenarios. We introduce the SegSTRONG-C challenge dedicated to better understanding model deterioration under unforeseen but plausible non-adversarial "corruption" and the capabilities of contemporary methods that seek to improve it. Built on a dataset generated through counterfactual robotic replay, SegSTRONG-C provides paired clean and "corrupted" samples, enabling reproducible evaluation of model robustness. Participants are challenged to train tool segmentation algorithms on "uncorrupted" data and evaluate them on "corrupted" test domains for the binary robot tool segmentation task. Through comprehensive baseline experiments and participating submissions from widespread community engagement, SegSTRONG-C reveals key themes for model failure and identifies promising directions for improving robustness. The performance of challenge winners, achieving an average 0.9394 DSC and 0.9301 NSD across the unreleased test sets with "corruption" types: bleeding, smoke, and low brightness. This highlights how prior knowledge, customized training strategies, and architectural choice can be leveraged to improve robustness. In conclusion, the SegSTRONG-C challenge has identified practical approaches for enhancing model robustness. However, most approaches rely on conventional techniques that have known limitations. Looking ahead, we advocate for expanding intellectual diversity and creativity in non-adversarial robustness beyond data augmentation, calling for new paradigms that enhance universal robustness to unforeseen "corruptions" to facilitate richer applications in surgical data science.

URL PDF HTML ☆

赞 0 踩 0

2605.09245 2026-05-12 cs.CV 版本更新

CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking

Ruiqi Xian, Deep Patel, Iain Melvin, Sanjoy Kundu, Martin Renqiang Min, Dinesh Manocha

发表机构 * University of Maryland, College Park, MD, USA（马里兰大学）； NEC Laboratories America, Princeton, NJ, USA（NEC美国实验室）； University of North Carolina, Greenboro, NC, USA（北卡罗来纳大学格林伯格分校）

AI总结多相机多目标跟踪（MCMOT）在不同视角下保持目标身份一致性方面面临挑战，尤其需要精确的标定和大量标注。本文提出了一种无需标定和人工标注的自监督表征学习框架CalibFree，通过单视角蒸馏和跨视角重建促进视图无关与视图特定特征的分离，从而适应复杂动态场景。实验表明，该方法在多个数据集上均取得优于现有方法的跟踪性能，验证了其在无标定情况下的有效性与适应性。

2605.09218 2026-05-12 cs.CV cs.AI cs.LG cs.RO 版本更新

Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models

Sagar Bharadwaj, Ziyong Ma, Anurag Ghosh, Srinivasan Seshan, Anthony Rowe

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结 Flame3D 是一种无需训练的三维场景理解框架，通过可编辑的视觉-文本三维记忆与现成的大型语言模型结合，实现对复杂空间关系和未出现对象的零样本推理。该方法在推理时能够合成自定义的空间程序，支持对场景布局、空置空间和新对象的开放推理，并可通过外部数据更新记忆而无需重新训练。实验表明，Flame3D 在三维问答和组合空间推理任务中表现出色，突显了动态生成空间操作对复杂三维推理的重要性。

2605.09196 2026-05-12 cs.CV cs.AI cs.GR cs.LG cs.RO 版本更新

RigidFormer: Learning Rigid Dynamics using Transformers

Zhiyang Dou, Minghao Guo, Haixu Wu, Doug Roble, Tuur Stuyck, Wojciech Matusik

发表机构 * MIT（麻省理工学院）； Meta

AI总结本文提出了一种基于Transformer的模型RigidFormer，用于学习多物体刚体动力学，特别适用于点云等无网格表示。该模型通过对象级的锚点进行动态建模，结合锚点-顶点池化和基于锚点的RoPE注意力机制，实现了高效且高保真的刚体运动模拟。RigidFormer在多个基准测试中表现优于传统网格基方法，计算效率更高，并能处理大量物体和不同点云分辨率的输入。

Comments Project Page: https://people.csail.mit.edu/frankzydou/projects/RigidFormer/index.html

2605.09190 2026-05-12 cs.CV 版本更新

AQMP: Image compression through Adaptive Quadtree Refinement and Matching Pursuit with Hyperparameter Optimization

Franco Cerino, Emmanuel Tassone, Manuel Tiglio

发表机构 * CONICET（阿根廷国家科研 council）； Facultad de Matemática, Astronomía, Física y Computación, Universidad Nacional de Córdoba（数学、天文学、物理和计算学院，国家大学科达布拉达）

AI总结本文提出了一种新型图像编码方法 AQMP，结合自适应四叉树划分与匹配追踪技术，通过动态调整块大小以适应图像局部结构，从而在保证图像质量的前提下实现更高的压缩率。该方法引入超参数优化机制，利用树结构帕尔森估计器进行多目标优化，获得压缩效率与视觉质量之间的最佳平衡。实验表明，AQMP 在与 JPEG 相当的结构相似度（SSIM）下，压缩率可提升至其 4 倍，且在不同压缩条件下均表现出良好的性能。

Comments 34 pages, 18 figures

2605.09181 2026-05-12 cs.CV cs.ET eess.IV 版本更新

Establishing Robust Retinal Eye Tracking: A Weakly Supervised Algorithmic Framework

Bo Wen, Dillon Lohr, Yatong An, Pushkar Anand, Alexander Fix, Ruobing Qian, Catherine A. Fromm, Yimin Ding, Truong Nguyen, Mohamed El-Haddad, Francesco La Rocca

发表机构 * Meta Reality Labs（Meta现实实验室）； University of California, San Diego（加州大学圣地亚哥分校）； Independent Researcher（独立研究员）

AI总结本文提出了一种基于弱监督学习的新型框架，用于实现鲁棒的视网膜眼动追踪。该方法克服了传统模板匹配方法在应对视网膜特征变化和实际成像条件时的不足，初步实验表明其在6名受试者中达到95百分位的注视误差小于0.45度，具有较高的准确性。这一成果为眼科成像和视觉科学中的眼动追踪提供了新的技术路径。

Comments 2026 IEEE International Conference on Image Processing (Accepted for Publication)

2605.09151 2026-05-12 cs.CV 版本更新

MultiMedVision: Multi-Modal Medical Vision Framework

Frank Li, Bardia Khosravi, Mohammadreza Chavoshi, Young Seok Jeon, Theo Dapamede, Hari Trivedi, Janice Newsome, Judy Gichoya

发表机构 * Emory University（埃默里大学）； Yale University（耶鲁大学）

AI总结本文提出了一种名为 MultiMedVision 的多模态医学视觉框架，旨在统一处理二维（如X光）和三维（如CT）医学影像数据。该框架基于稀疏视觉变换器，通过三维旋转位置嵌入和可变长度序列打包技术，在共享的潜在空间中直接处理混合模态数据，无需模态特定适配器或将三维体积视为二维切片序列。实验表明，MultiMedVision 在多个医学影像基准测试中表现出色，验证了其在跨维度统一表征学习上的有效性。

Comments 9 pages, 2 figures

2605.09146 2026-05-12 cs.CV 版本更新

Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

Jingdong Zhang, Yizhou Wang, Zhengzhong Tu, Xin Li, Wenping Wang, Xiaohang Zhan

发表机构 * Texas A&M University（德克萨斯A&M大学）； Adobe（Adobe公司）

AI总结本文研究了人形视觉搜索（HVS）问题，即智能体在360度沉浸式环境中主动探索目标。为了解决现有方法依赖繁琐的多轮推理链（CoT）所带来的高认知负担和数据标注成本，作者提出了一种新的框架“Imagining in 360°”，将探索过程解耦为Imaginator和Actor两个模块。Imaginator通过一次推理预测环境的语义布局，为Actor提供多样化的空间信息分布，从而在不确定环境下实现高效搜索。该方法大幅降低了数据工程成本，并在复杂真实环境中显著提升了搜索效率和成功率。

2605.09132 2026-05-12 cs.CV 版本更新

KEPIL: Knowledge-Enhanced Prompt-Image Learning for Prompt-Robust Disease Detection

Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Robert Berke, Mauricio Reyes

发表机构 * University of Bern（伯尔尼大学）； Shanghai Jiao Tong University（上海交通大学）； Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern（放射肿瘤科、Inselspital伯尔尼大学医院及伯尔尼大学）

AI总结该研究提出了一种名为KEPIL的知识增强型提示-图像学习框架，旨在提升医学影像诊断中基于提示的零样本推理能力。为了解决当前视觉-语言模型对提示变化敏感且缺乏可靠外部知识的问题，KEPIL结合了结构化医学知识，通过动态提示增强、语义感知对比损失和实体中心报告标准化等方法，增强了模型的鲁棒性和泛化能力。实验表明，KEPIL在多个基准测试中取得了领先的零样本性能，显著提升了在提示变化情况下的诊断准确性。

2605.09090 2026-05-12 cs.CV cs.AI 版本更新

Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations

Gabriele Lombardo, Luigi Maiorana, Liliana Lo Presti, Marco La Cascia

发表机构 * Department of Engineering（工程系）

AI总结该研究探讨了在受控反事实扰动下视觉 grounding 模型中的各向异性问题，旨在分析模型在面对语义不匹配的描述时的行为。研究引入了一种基于相似度控制的反事实描述生成方法，系统地扰动图像中的物体或上下文成分，以分析 grounding 模型在不同对齐程度下的表现。实验表明，嵌入空间的各向异性并非导致反事实错误的主因，模型的鲁棒性需进一步考察嵌入空间更细致的几何特性。

Comments To be published in the proceedings of the 5th Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2026

2605.09089 2026-05-12 cs.CV cs.AI 版本更新

Field-Localized Forgery Detection for Digital Identity Documents

Abhishek Kumar, Riya Tapwal, Carsten Maple, Mark Hooper

发表机构 * The Alan Turing Institute（艾伦·图灵研究所）； IIT Mandi（曼迪理工学院）

AI总结本文提出了一种轻量级的场域定位伪造检测框架FLiD，专门用于数字身份文件的远程身份验证，以应对面部照片和文本信息等关键字段的局部篡改问题。该方法通过目标检测定位关键区域，并利用冻结的MobileNetV3-Small网络提取紧凑的特征嵌入，最终通过轻量分类网络实现高精度的伪造检测。实验表明，FLiD在多个评估指标上显著优于现有通用伪造检测方法，且参数量和计算量大幅减少。

2605.09071 2026-05-12 cs.CV 版本更新

Probability-Flow Distillation: Exact Wasserstein Gradient Flow for High-Fidelity 3D Generation

Rohith Ramanan, A. N. Rajagopalan

发表机构 * Department of Physics（物理系）； Indian Institute of Technology Madras（印度理工学院马德拉斯分校）； Department of Electrical Engineering（电气工程系）

AI总结该论文提出了一种名为概率流蒸馏（PFD）的新方法，用于解决文本到3D生成中现有方法如分数蒸馏采样（SDS）及其变体所面临的模式崩溃和细节丢失问题。PFD通过将蒸馏过程建模为精确的Wasserstein梯度流，实现了更准确的分布匹配，从而能够生成具有精细、高保真细节的3D模型，显著提升了生成质量。

2605.09067 2026-05-12 cs.CV 版本更新

Reducing Annotation Burden for Femoral Cartilage Segmentation in Knee MRI via Cross-Sequence Transfer Learning

Francesco Chiumento, Gianluigi Crimi, Elisa Moretta, Rocco Milieri, Alberto Bazzocchi, Giulio Vara, Giacomo Dal Fabbro, Stefano Zaffagnini, Fulvia Taddei, Serena Bonaretti

发表机构 * School of Electronic Engineering, Dublin City University（都柏林城市大学电子工程学院）； Bioengineering and Computing Laboratory, IRCCS Istituto Ortopedico Rizzoli（里扎利骨科研究所生物工程与计算实验室）； Dipartimento di Scienze Mediche e Chirurgiche (DIMEC), Alma Mater Studiorum – Università di Bologna（博洛尼亚大学医学与外科科学系（DIMEC））； Diagnostic and Interventional Radiology, IRCCS Istituto Ortopedico Rizzoli（里扎利骨科研究所诊断与介入放射学）； Department of Biomedical and Neuromotor Sciences, University of Bologna（博洛尼亚大学生物医学与神经运动科学系）； nd Orthopedics and Trauma Unit, IRCCS Istituto Ortopedico Rizzoli（里扎利骨科研究所第二骨科与创伤单元）； Independent Researcher（独立研究员）

AI总结该研究旨在通过跨序列迁移学习减少膝关节MRI中股骨软骨分割的人工标注负担，测试双回波稳态（DESS）与矢状位质子密度加权3D快速自旋回波（Cube）序列之间的双向迁移效果。研究采用改进的2D U-Net模型，在OAI数据集的507张DESS图像上进行预训练，并在不同序列间进行迁移学习，结果表明从Cube到DESS的迁移性能接近原序列训练效果，而从DESS到Cube的迁移则需更多标注数据，且病变对不同序列的分割影响存在差异。这一成果为减少医学图像分割标注工作提供了有效方法。

2605.09065 2026-05-12 cs.CV cs.LG 版本更新

Dependency-Aware Discrete Diffusion for Scene Graph Generation

Rajalaxmi Rajagopalan, Romit Roy Choudhury

发表机构 * University of Illinois, Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结该研究提出了一种依赖感知的离散扩散模型，用于生成场景图，以解决从自然语言生成结构化场景图的挑战。该方法通过在正向和反向过程中解耦结构与语义，捕捉对象、边和关系之间的条件依赖，从而生成更符合文本描述的场景图。实验表明，该方法在标准基准上优于现有连续和离散图生成方法，并在后续图像生成任务中表现出更优的组合对齐效果，尤其在多物体场景中表现突出。

2605.09053 2026-05-12 cs.CV 版本更新

LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation

Jiankun Peng, Jianyuan Guo, Yiguang Yang, Yue Liu, Jiashuang Yan, Ying Xu

发表机构 * The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing（中国科学院航空航天信息研究所，北京）； The School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing（中国科学院大学电子电气与通信工程学院，北京）； The Department of Computer Science, City University of Hong Kong, Hong Kong, SAR, China（香港城市大学计算机科学系，香港特别行政区，中国）

AI总结在连续环境视觉语言导航（VLN-CE）中，现有的在线拓扑规划方法仍面临局部深度信息冗余和随着拓扑图扩展导致当前候选节点关注减弱的问题。为此，本文提出LCGNav，一种模块化的局部几何增强框架，通过将候选深度视图转换为三维点云并结合可达范围的物理截断，实现更紧凑的局部几何建模。此外，LCGNav引入了一种保持维度的局部融合策略，仅对当前相关的“幽灵”节点进行几何增强，而无需改变原有规划器接口。实验表明，LCGNav作为一种有效的跨架构增强模块，能够以较低的训练成本提升多个代表性在线拓扑方法的关键指标，并在R2R-CE和RxR-CE数据集的val-unseen划分上取得了最佳性能。

2605.09050 2026-05-12 cs.RO cs.CV 版本更新

Automated Robotic Moisture Monitoring in Agricultural Fields

Senthil Palanisamy, Akila I. S

发表机构 * Coimbatore Institute of Technology（科伊巴特尔理工学院）

AI总结本文旨在开发一种自动化机器人系统，用于大规模农田的土壤湿度监测。该系统结合田间湿度传感器和机器人，利用Dijkstra算法规划路径，并通过图像处理技术计算土壤湿度，从而实现高效、经济的监测。研究搭建了一个小型实验田并测试了原型系统，验证了该方法的可行性。

Comments 2018 International Seminar on Intelligent Technology and Its Applications (ISITIA)

Journal ref 2018 International Seminar on Intelligent Technology and Its Applications (ISITIA)

2605.09039 2026-05-12 cs.CV 版本更新

SeasonScapes: Learning Large-scale Re-lightable 3D Landscapes with Seasonal Variation from Sparse Webcams

Timo Kleger, Qi Ma, Deheng Zhang, Luc Van Gool, Danda Pani Paudel

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出了一种名为 SeasonScapes 的框架和一个大规模季节变化三维景观数据集，该数据集由来自32个不同位置、13个时间点的85000多张网络摄像头图像组成，覆盖超过50公里×60公里的瑞士山区。通过将时间点特定的图像投影到三维网格上，构建出反映自然外观随时间变化的季节性三维景观。为了解决遮挡和缺失数据问题，研究采用条件扩散模型在网格上进行图像引导的补全，最终生成的网格可使用标准物理渲染器进行重新光照。

2605.09030 2026-05-12 cs.CV cs.LG 版本更新

When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation

Jörg Frochte

发表机构 * Bochum University of Applied Sciences（博湖应用科学大学）

AI总结该论文探讨了在艺术家风格评估中，使用对比风格描述符（CSD）余弦相似度作为绝对风格保真度指标的局限性，并提出了一种名为“判别差距”的诊断方法，用于检测该指标在特定艺术家语料库中是否能够准确区分相同与不同风格。研究发现，原始CSD余弦在多个艺术家语料中存在负点估计差距，表明其无法作为绝对评分使用；通过引入CSLS读取方式和位置嵌入插值方法，可显著提升评估准确性。研究建议在使用CSD余弦作为风格评分前，应先进行该诊断测试，并推荐使用改进后的CSD+方法以提高可靠性。

Comments 24 pages, 7 figures, 19 tables

2605.09025 2026-05-12 cs.CV cs.LG 版本更新

MedFL-Stress: A Systematic Robustness Evaluation of Federated Brain Tumor Segmentation under Cross-Hospital MRI Appearance Shift

Kiran Naseer, Naveed Anwer Butt

发表机构 * University of Gujrat, Pakistan（瓜尔杰特大学，巴基斯坦）

AI总结本文提出MedFL-Stress，一个用于评估联邦学习脑肿瘤分割模型在跨医院MRI影像外观变化下的鲁棒性的系统化测试框架。研究通过引入不同级别的MRI外观偏移，揭示了现有联邦学习方法在不同医院间性能差异的问题，并对比了FedAvg、FedProx和FedBN三种方法的表现。实验表明，FedBN在提升最差医院分割性能和减少医院间性能差距方面表现更优，突显了鲁棒性评估在联邦医疗影像应用中的重要性。

2605.09024 2026-05-12 cs.CV cs.GR cs.MM eess.IV 版本更新

Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination

Adrian Azzarelli, Nantheera Anantrasirichai, James Pollock, David R. Bull

发表机构 * University of Bristol, UK（英国布里斯托大学）； Lux Aeterna, Bristol, UK（卢克斯艾特纳，布里斯托，英国）

AI总结该研究提出了一种基于高分辨率图像照明的虚拟制作（VP）专用三维重建与重光照框架，解决了传统方法中背景与光照耦合、环境贴图分辨率低等问题。方法采用高斯点扩散技术，利用已知背景图像条件化重光照过程，无需依赖环境贴图，将合成简化为背景图像编辑任务。通过引入真实VP场景数据集，分解场景为固定外观与可变光照部分，实现了高效、可控的高质量三维重建与重光照，支持多种输出变量，且计算效率高。

详情

英文摘要

Virtual production (VP) use LED walls to provide both background imagery and image-based lighting. While this enables on-set compositing, it couples lighting to background and scene appearance, limiting flexibility for downstream editing. In addition, inverse rendering conventionally relies on physically-based rendering to estimates 3D geometry and lighting, using environment maps. However, these maps are typically low-resolution and assume far-field lighting. In VP, with near-field and high-resolution image-based lighting, this can lead to inaccuracies and introduce complexities when editing. Addressing this, we propose a VP-specific framework for 3D reconstruction and relighting using Gaussian Splatting. This uses the known background imagery to condition the relighting process. This avoids relying on environment maps and reduces compositing to a background-image editing task. To realize our framework, we introduce a process (and associated dataset) that captures real VP scenes under varying background content and illumination conditions. This data is used to decompose a 3D scene into fixed appearance and variable lighting components. The variable lighting process simulates light transport by parameterizing each primitive with a UV coordinate, intensity value and resolution modifier. Using mipmaps, these directly sample the background texture in image space - implicitly capturing reflections and refractions without physically-based rendering. Combined with the fixed appearance component, this allows us to render relit scenes using a Gaussian Splatting rasterizer. Compared to baselines, our approach achieves higher-quality 3D reconstruction and controllable relighting. The method is efficient (<3 GB RAM, <5 GB VRAM, <2 hours training, ~35 FPS) and supports rendering useful arbitrary output variables including depth, lighting intensity, lighting color, and unlit renders.

URL PDF HTML ☆

赞 0 踩 0

2605.09002 2026-05-12 cs.CV cs.AI 版本更新

CT-IDP: Segmentation-Derived Quantitative Phenotypes for Interpretable Abdominal CT Disease Classification

Lavsen Dahal, Joseph Y. Lo

发表机构 * Center for Virtual Imaging Trials, Carl E. Ravin Advanced Imaging Laboratories（虚拟成像试验中心，卡尔·E·拉文先进成像实验室）； Department of Radiology, Duke University School of Medicine（杜克大学医学学院放射科）； Electrical and Computer Engineering, Pratt School of Engineering, Duke University（杜克大学普拉特工程学院电气与计算机工程系）； Medical Physics Graduate Program, Duke University（杜克大学医学物理研究生项目）

AI总结本文提出了一种基于腹部CT影像分割的可解释性疾病分类框架CT-IDP，通过生成多器官分割结果并提取超过900个定量表型特征，用于疾病分类任务。研究在MERLIN数据集上训练并验证了该方法，并在两个独立数据集上进行了外部评估，结果显示CT-IDP在多个指标上均优于基于DINOv3的视觉Transformer基线模型，表明其在疾病分类中的有效性与可解释性优势。

2605.08985 2026-05-12 cs.CV 版本更新

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Kechen Fang, Yihua Qin, Chongyi Wang, Wenshuo Ma, Tianyu Yu, Yuan Yao

发表机构 * Tsinghua University（清华大学）； ModelBest

AI总结该研究针对多模态大语言模型（MLLMs）中高分辨率图像输入带来的视觉编码计算瓶颈问题，提出了一种高效且可控的视觉编码方案LLaVA-UHD v4。通过对比实验发现，基于切片的编码策略在保持局部细节的同时优于传统的全局编码方法；同时引入了在ViT浅层进行早期压缩的新方法，显著降低了计算量而不影响下游任务性能。实验表明，该方法在多个基准测试中将视觉编码的浮点运算量减少了55.8%，并在性能上达到或超越了基线模型。

2605.08974 2026-05-12 cs.CV cs.AI 版本更新

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

Tri Cao, Khoi Le, Thong Nguyen, Cong-Duy Nguyen, Quynh Vo, Anh Tuan Luu, Chunyan Miao, See-Kiong Ng, Shuicheng Yan, Bryan Hooi

发表机构 * National University of Singapore（新加坡国立大学）； VinUniversity（文大学）； Nanyang Technological University（南洋理工大学）

AI总结尽管多模态大语言模型在视频理解方面取得了进展，但在动态场景中仍容易产生幻觉。本文认为这是由于缺乏对时空信息的持续监控能力，即无法有效追踪物体的身份、状态及关系随时间的变化。为此，研究者提出了STEMO-Bench基准，用于评估模型在物体中心事实上的中间推理能力，并引入了STEMO-Track框架，通过结构化轨迹构建和时序聚合显著提升了模型在时空推理上的准确性和一致性。

Comments Code: https://github.com/nguyentthong/video_hallucination

2605.08971 2026-05-12 cs.CV cs.AI 版本更新

Extrusion Segmentation Strategy to improve CAD Reconstruction from Point Cloud

Said Harb, Mehdi Maboudi, Markus Gerke

发表机构 * Institute of Geodesy and Photogrammetry（测绘院）

AI总结本文研究如何从点云数据中重建CAD模型，提出了一种基于挤出分割的策略，将复杂形状分解为基本的挤出部件，从而提升深度学习模型的重建性能。该方法通过增加数据多样性，提高了模型的泛化能力和鲁棒性，为从无序点云生成结构化CAD模型提供了简单而有效的方式。

Comments Conference: ISPRS Toronto 2026

2605.08965 2026-05-12 cs.CV 版本更新

Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning

Naeun Lee, Hyunjong Kim, Sunghwan Choi, Injin Kong, Yohan Jo

发表机构 * Seoul National University（首尔国立大学）

AI总结尽管多模态大语言模型（MLLMs）在多模态任务中表现出色，但预测图像是否具有说服力及其原因仍具有挑战性。本文发现，让MLLMs在预测前进行推理并不能一致提升性能，甚至可能降低效果，表明生成的推理理由不可靠。为此，研究提出通过多样化的教师生成推理进行监督微调，提升了视觉说服力预测性能，并引入了一个三维的可信度评估框架，从推理与决策的一致性、推理与图像的相关性以及推理对决策的敏感性三个方面进行评估，揭示了预测性能与推理可信度之间的差异，并为未来训练更可信的视觉说服力模型提供了新方向。

2605.08952 2026-05-12 cs.CV 版本更新

FugSeg: Fast Uncertainty-aware Ground Segmentation for 3D Point Cloud

Yu Li, Volker Schwieger

发表机构 * Institute of Engineering Geodesy, University of Stuttgart（斯图加特大学工程大地测量研究所）； Daimler Truck AG（戴姆勒卡车公司）

AI总结在基于激光雷达的环境感知系统中，地面分割是支持地图构建和导航等应用的关键预处理步骤。为了解决反射噪声和孤立地面点等挑战，本文提出了一种快速且具有不确定性感知能力的地面分割方法FugSeg。该方法采用极坐标网格图表示点云，并引入自适应坡度和噪声地面点处理机制，有效提升了复杂地形下的分割可靠性；实验表明，FugSeg在多个公开数据集上均优于现有非学习方法，且在单线程CPU上即可实现高运行效率，适用于资源受限的系统。

Comments Accepted for publication in IEEE Transactions on Intelligent Transportation Systems

Journal ref IEEE Transactions on Intelligent Transportation Systems (Early Access), 2026

2605.08945 2026-05-12 cs.CV 版本更新

PIDNet: Progressive Implicit Decouple Network for Multimodal Action Quality Assessment

Qiqi Li, Pengfei Wang, Nenggan Zheng

发表机构 * Qiushi Academy for Advanced Studies (QAAS), Zhejiang University（浙江大学启斯特先进研究院）； College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； School of Software Technology, Zhejiang University（浙江大学软件学院）； State Key Lab of Brain-Machine Intelligence（脑机智能国家重点实验室）； Collaborative Innovation Center for Artificial Intelligence by MOE and Zhejiang Provincial Government (ZJU)（教育部-浙江省人工智能协同创新中心）； Zhejiang Lab（浙江实验室）

AI总结本文提出了一种名为PIDNet的渐进式隐式解耦网络，用于多模态动作质量评估。该方法通过渐进融合不同模态的特定信息、跨模态互补线索和全局质量语义，有效提升了评估准确性。核心模块iMambaWave结合双向Mamba分支和小波变换分支，分别捕捉长时序依赖和局部细节变化，配合门控聚合机制实现时域与频域信息的自适应融合。实验表明，PIDNet在多个数据集上取得了优于现有单模态和多模态方法的评估性能，并具有良好的通用性和模块化能力。

Comments 14 pages, 6 figures, 11 tables

2605.08911 2026-05-12 cs.CV 版本更新

Unified Modeling of Lane and Lane Topology for Driving Scene Reasoning

Han Li, Yulu Gao, Si Liu, Yuhang Wang, Bo Liu, Beipeng Mu

发表机构 * School of Artificial Intelligence, Beihang University（北京航空航天大学人工智能学院）； Zhongguancun Academy, Beijing, China（中关村学院，北京，中国）； Hangzhou International Innovation Institute, Beihang University（杭州国际创新院，北京航空航天大学）； Meituan, Beijing, China（美团，北京，中国）

AI总结自动驾驶车辆不仅需要感知驾驶场景中的物理元素，如车道线和交通信号灯，还需要理解车道中心线及其拓扑关系等逻辑信息。本文提出了一种统一建模车道与车道拓扑关系的新方法UniTopo，通过将车道间的拓扑关系表示为连接关系，实现了在同一个感知流程中同时获取车道位置和拓扑信息，建立了从原始图像特征直接感知车道拓扑的新范式。实验表明，该方法在OpenLane-V2基准测试中显著优于现有先进方法。

Comments Accepted by IEEE TCSVT

2605.08902 2026-05-12 cs.CV cs.AI 版本更新

DAPE: Dynamic Non-uniform Alignment and Progressive Detail Enhancement Techniques for Improving the Performance of Efficient Visual Language Models

Mengyuan Tian, Qiyan Zhao, Yanan Wang, Da-Han Wang

发表机构 * The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen 361005, China（厦门大学王文安经济研究所）； Fujian Key Laboratory of Pattern Recognition and Image Understanding, School of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China（福建pattern识别与图像理解重点实验室，厦门理工大学计算机与信息工程学院）

AI总结本文提出了一种名为DAPE的新框架，旨在提升高效视觉语言模型的性能。该方法通过动态非均匀对齐和渐进细节增强技术，解决了文本与图像之间信息密度分布不均的问题，实现了更精确的跨模态交互。实验表明，该方法在多个基准测试中显著提升了下游任务的准确性，同时降低了计算开销。

Comments Accepted in ICIC 2026 Oral

2605.08874 2026-05-12 cs.CV 版本更新

Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation

Hoang M. Truong, Hai Nguyen-Truong, Dang Huynh

发表机构 * Fulbright University Vietnam（富布赖特大学越南分校）

AI总结该研究针对开放词汇语义分割任务中图像级视觉-语言模型与像素级预测之间的语义对齐问题，提出了一种基于双曲空间的细调框架HyRo。HyRo通过在庞加莱球模型中解耦层次结构与语义对齐，利用双曲半径调整实现层次对齐，并通过正交变换进行角度对齐以优化同层次嵌入的语义关系。实验表明，HyRo在多个基准数据集上取得了当前最优的性能。

Comments Accepted to the PVUW Workshop at CVPR 2026. Project page: https://tmhoanggg.github.io/HyRo/

2605.08854 2026-05-12 cs.CV 版本更新

Restoration-Aligned Generative Flow Models for Blind Motion Deblurring

Insoo Kim, Jinwoo Shin

发表机构 * NAVER Cloud（NAVER云）； KAIST AI（韩国科学技术院人工智能研究所）； Samsung Electronics（三星电子）

AI总结本文提出了一种名为DeblurFlow的生成流模型框架，用于解决盲运动去模糊问题。该方法通过将生成流的轨迹终点从噪声替换为模糊观测，使模型的训练目标与去模糊任务对齐，从而避免了传统生成流模型在恢复任务中出现的保真度下降问题。研究还引入了r-space这一专门用于残差解码的潜在空间，大幅降低了计算成本，并在多个数据集上展示了DeblurFlow在恢复保真度和感知真实感方面的优越性能。

2605.08841 2026-05-12 cs.CV 版本更新

Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models

Junli Zha, Jiahui Wang, Xinkai Lu, Jinbo Wang

发表机构 * SF Technology Co., Ltd.（SF技术有限公司）

AI总结该研究针对视觉语言模型（VLMs）在经典视错觉理解任务中过度依赖记忆而非真实视觉感知的问题，提出了一种无需微调的训练自由框架。方法通过错觉感知的图像预处理、反错觉提示工程以及多投票集成三种互补策略，有效提升了模型对视觉错觉的识别能力。实验表明，该方法在官方测试集上达到了90.48%的准确率，在人工验证子集上更是达到了98.41%，并取得了挑战赛第二名的优异成绩。

Comments Accepted at CVPR 2026 Workshop on 5th DataCV Challenge

2605.08839 2026-05-12 cs.CV 版本更新

Cross-Sample Relational Fusion: Unifying Domain Generalization and Class-Incremental Learning

Zhen-Hao Xie, Yan Wang, Hao Sun, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou

发表机构 * School of Artificial Intelligence and the State Key Laboratory of Novel Software Technology, Nanjing University（人工智能学院和新型软件技术国家重点实验室，南京大学）

AI总结本文提出了一种统一处理领域偏移和灾难性遗忘的框架CORF，用于解决增量学习中的挑战。该方法通过空间贡献图选择性地优化训练样本，并结合预测置信度自适应调整样本权重，以增强模型的泛化能力。同时，CORF引入级联知识蒸馏机制，捕捉跨样本的关系依赖，实现多粒度的知识迁移，有效缓解了遗忘问题，并可无缝集成到现有增量学习算法中，取得良好的实验效果。

Comments Accepted by IEEE Transactions on Multimedia (TMM 2026). Code is available at https://github.com/LAMDA-CL/TMM26-CORF

2605.08824 2026-05-12 cs.GR cs.CV 版本更新

HairGPT: Strand-as-Language Autoregressive Modeling for Realistic 3D Hairstyle Synthesis

Haimin Luo, Min Ouyang, Lan Xu, Jingyi Yu

发表机构 * ShanghaiTech University（上海科技大学）； ShanghaiTech University and Deemos Technology Co., Ltd.（上海科技大学和Deemos技术有限公司）； Deemos Technology Co., Ltd.（Deemos技术有限公司）

AI总结 HairGPT 是一种基于发丝作为语言单元的自回归生成模型，旨在解决真实感3D发型合成中的结构与纹理耦合问题。该方法将发型分解为语义区域和结构层次的双解耦序列建模问题，通过几何分词器和语义注释引导发丝级别的生成，实现了复杂发型的合成与编辑。HairGPT 将发型生成从传统的纹理合成转变为结构化且语义可控的创作过程，支持在真实和风格化场景中生成高保真发型。

Comments Accepted to SIGGRAPH 2026 (Journal Track)

2605.08820 2026-05-12 cs.CV cs.AI cs.CR 版本更新

FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence

Xinyu Yan, Boyang Chen, Jiaming Zhang, Tiantong Wu, Hong Xi Tae, Yichen He, Tiantong Wang, Yachun Mi, Yurong Hao, Yilei Zhao, Lei Xiao, Longtao Huang, Pengjun Xie, Wei Liu, Wei Yang Bryan Lim

发表机构 * College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算机与数据科学学院）； Alibaba-NTU Global e-Sustainability CorpLab (ANGEL)（阿里巴巴-NTU全球可持续性科技实验室）； Alibaba Group（阿里巴巴集团）

AI总结随着人工智能生成图像日益逼真，AI生成的退款欺诈证据检测成为新的挑战。为此，研究者提出了FraudBench，一个基于多模态数据的基准，专门用于检测AI生成的虚假退款证据。该基准集从电商、外卖和旅行服务等真实场景中构建，包含图像、评论及产品元数据，并通过模型辅助过滤和人工标注区分真实损坏与未损坏证据，同时利用先进图像生成模型合成虚假损坏图像。实验表明，现有模型在检测AI生成的虚假损坏证据方面仍存在显著不足，揭示了通用图像检测与真实场景下欺诈证据验证之间的明显差距。

2605.08819 2026-05-12 cs.CV cs.LG 版本更新

From pre-training to downstream performance: Does domain-specific pre-training make sense?

Felix Krones

发表机构 * Oxford Internet Institute, University of Oxford, Oxford, UK（牛津大学互联网研究所，牛津大学，牛津，英国）

AI总结该研究探讨了在医学影像领域中，领域特定的预训练是否能有效提升下游任务性能。通过系统比较卷积神经网络和Transformer模型，并分析多种预训练方法（包括监督和自监督学习）及数据模态的影响，研究发现只有当预训练数据与目标模态高度匹配时，才能显著提升模型性能。研究强调了预训练策略对提升医学影像深度学习模型可靠性的重要性，并为开发更准确、可靠的诊断工具提供了参考。

2605.08814 2026-05-12 cs.CV 版本更新

Zero-Shot Chinese Character Recognition via Global-Local Dual-Branch Alignment and Hierarchical Inference

Wei Cao, Hao Xu, Xiaolei Diao

发表机构 * Jilin University, China（吉林大学，中国）； University College London, UK（伦敦大学学院，英国）

AI总结本文研究了开放场景下未见过的汉字识别这一具有挑战性的问题，提出了一种基于全局-局部双分支对齐和层次推理的零样本汉字识别方法。该方法通过统一的跨模态对齐框架联合学习汉字图像和汉字结构描述的全局与局部表示，结合结构过滤掩码抑制局部相似性中的噪声操作符，并采用从粗到细的层次推理策略，有效提升了识别性能与推理效率。实验表明，该方法在多种零样本划分下表现优异，尤其在低资源条件下具有显著优势。

Comments 9 pages

2605.08808 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding

Ziyao He, Yingjie Liu, ZhangYangRui, Mingsong Chen, Xuan Tang, Xian Wei

发表机构 * East China Normal University（东华大学）

AI总结本文提出了一种名为“曲率感知描述生成”的新框架，用于解决三维场景理解中稀疏点云数据的精确描述问题。该方法引入非欧几里得的测地注意力机制，通过在斜空间中进行自注意力计算和在洛伦兹空间中建立双向测地交叉注意力，实现了局部几何细节与全局语义层次的协同建模。理论分析表明，该方法有效缓解了欧几里得空间与双曲空间之间的冲突，实验结果在ScanRefer和Nr3D数据集上展示了其在定位精度和描述丰富性方面的优越性能。

Comments CVPR2026 Highlight!

2605.08805 2026-05-12 cs.CV 版本更新

LightAVSeg: Lightweight Audio-Visual Segmentation

Qing Zhong, Guodong Ding, Lingqiao Liu, Zaiwen Feng, Lin Yuanbo Wu, Angela Yao

发表机构 * College of Informatics, Huazhong Agricultural University, Wuhan, China（华中农业大学信息学院）； School of Computing, National University of Singapore, Singapore（新加坡国立大学计算机学院）； School of Computer Science, Adelaide University, Australia（阿德莱德大学计算机科学学院）； School of Engineering, University of Warwick, Coventry, UK（沃里克大学工程学院）； Zhejiang Yuexiu University, Shaoxing, China（浙江越秀大学）

AI总结 LightAVSeg 是一种轻量化的音视频分割框架，旨在解决现有模型计算复杂度高、难以高效部署的问题。该方法通过解耦设计替代传统的密集跨模态注意力机制，使交互成本随空间分辨率线性增长，并引入辅助对齐损失以提升语义一致性。实验表明，LightAVSeg 在参数量仅为 AVSegFormer 1/7 的情况下，在 MS3 数据集上取得了 50.4 mIoU 的优异性能，实现了高效的移动端推理。

Comments 15 pages, 8 figures, 6 tables, Accepted to ICML 2026

2605.08800 2026-05-12 cs.CV cs.AI 版本更新

PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models

Jiahui Guang, Zexun Zhan, Zhenlin Xu, Cuiyun Gao, Haiyan Wang, Jing Li, Zhaoquan Gu, Yanchun Zhang

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Pengcheng Laboratory（鹏城实验室）； The Hong Kong Polytechnic University（香港理工大学）； Sichuan University（四川大学）； Zhejiang Normal University（浙江师范大学）

AI总结该论文提出PPU-Bench，一个用于视觉语言模型中个性化部分遗忘的现实基准，旨在解决现有基准依赖合成数据或全量删除的问题。该基准包含24,000个样本，涵盖三种渐进式场景，评估模型在去除目标知识的同时保持非目标事实、模型效用和跨模态一致性的能力。研究还提出边界感知优化方法（BAO），有效强化了模型在个体事实边界上的控制能力。

2605.08787 2026-05-12 cs.CV 版本更新

Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models

Mashrafi Monon, Umaima Rahman, Asif Hanif, Numan Saeed, Mohammad Yaqub

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence（莫扎德人工智能大学）； New York University Abu Dhabi（纽约大学阿布扎克分校）

AI总结该论文提出了一种名为CT-SpatialVQA的新型基准，用于评估3D医学视觉-语言模型在语义-空间理解方面的能力。该基准基于1601份放射科报告和CT影像构建了9077个临床相关的问答对，要求模型具备解剖定位、左右识别、结构对比和三维结构关系推理等能力。实验表明，现有模型在这些任务上的表现较差，平均准确率仅为34%，突显了在临床可信应用中亟需加强三维医学证据整合的重要性。

2605.08784 2026-05-12 cs.CV 版本更新

simpleposter: a simple baseline for product poster generation

Benlei Cui, Fangao Zeng, Weitao Jiang, Yuwen Zhai, Haiwen Hong, Longtao Huang, Hui Xue, Wenxiang Shang, Pipei Huang

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结本文提出了一种名为SimplePoster的简单而有效的产品海报生成框架，旨在解决在保留产品外观和精确控制密集多行文本布局方面的挑战。与以往依赖复杂模块（如ControlNet和OCR编码器）的方法不同，SimplePoster通过全参数微调和字符级位置编码，在无需外部控制器的情况下实现了高保真主体保留和精准文本渲染。实验表明，SimplePoster在主体保留率和文本渲染准确性方面均优于现有方法。

Comments CVPR 2026

2605.08781 2026-05-12 cs.CV 版本更新

Contour-Native Bridge Defect Detection and Compact Digital Archiving with Frequency-Supervised Fourier Contours

Jin Liu, Wang Wang, Hongxu Pu, Zhen Cao, Yasong Wang, Hu Wang, Kunming Luo

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University（测绘遥感信息工程国家重点实验室，武汉大学）； Sustainability X-Lab, The University of Hong Kong（可持续性X实验室，香港大学）； Department of Cyber Security, Southeast University（安全与保密系，东南大学）； School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）； Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology（香港理工大学电子与计算机工程系）

AI总结本文研究了如何将桥梁缺陷检测结果以更紧凑、可恢复的轮廓向量形式进行表示，以替代传统的粗略几何边界框或存储成本高的栅格掩膜。提出了一种基于频率监督的傅里叶级数检测方法（FS-FSD），该方法直接回归傅里叶轮廓描述子，并在统一的多边形空间协议下对边界框、掩膜和轮廓进行评估。实验表明，该方法在大量无人机采集的桥梁图像上取得了更高的多边形空间检测精度和更优的真阳性几何匹配质量，为工程审查和后续信息流程提供了更高效、更精确的缺陷边界表示方式。

Comments 46 pages,13 figures

2605.08764 2026-05-12 cs.LG cs.CV eess.IV 版本更新

Anchoring the Eigengap: Cross-Modal Spectral Stabilization for Sample-Efficient Representation Learning

Nikhil J. Dhinagar, Vidhi Chhatbar, Chirag Jagad, Pavithra Senthilkumar, Sophia I. Thomopoulos, Mahir H. Khan, Sook-Lei Liew, the ENIGMA-Stroke Recovery Working Group, Paul M. Thompson

发表机构 * Imaging Genetics Center, Mark & Mary Stevens Neuroimaging & Informatics Institute, Keck School of Medicine, University of Southern California（影像基因中心，马克与玛丽史蒂文斯神经影像与信息学研究所，凯克医学院，南加州大学）； Neuroscience Graduate Program, Mark & Mary Stevens Neuroimaging & Informatics Institute, Chan Division of Occupational Science & Occupational Therapy, Biomedical Engineering, University of Southern California（神经科学研究生项目，马克与玛丽史蒂文斯神经影像与信息学研究所，查恩职业科学与职业治疗 division，生物医学工程，南加州大学）

AI总结本文研究了在数据稀缺情况下深度视觉模型性能下降的根本原因，指出这是由于有限样本导致的嵌入协方差矩阵噪声干扰，从而压缩了特征值间隔（eigengap），限制了可恢复的信号模式数量。作者提出了一个有限样本表示学习的谱理论，量化了可恢复的维度 $K(N)$，并通过扰动理论和集中不等式分析了可靠特征模式的判据。研究进一步表明，多模态学习（如视觉-语言模型）能够通过低秩约束抑制噪声方向、保持特征值间隔，从而提升数据效率和分类性能，尤其在医学影像等小样本场景中表现出显著优势。

2605.08753 2026-05-12 cs.CV stat.ML 版本更新

Simultaneous Monitoring of Shape and Surface Color via 4D Point Clouds: A Registration-free Approach

Mariafrancesca Patalano, Giovanna Capizzi, Kamran Paynabar

发表机构 * Department of Statistical Sciences, University of Padua（帕多瓦大学统计科学系）； School of Industrial and Systems Engineering, Georgia Institute of Technology（佐治亚理工学院工业与系统工程学院）

AI总结本文提出了一种无需配准的4D点云框架SMAC，用于同时监测物体的形状和表面颜色变化。该方法利用拉普拉斯-贝尔特拉米算子的谱特性，捕捉形状与颜色之间的关系，并通过联合监测策略有效检测形状变形和颜色异常。此外，该方法还引入了空间感知的后信号诊断过程，以定位异常来源，具有计算高效、无需配准和网格重建的优势，实验表明其在细微缺陷检测方面表现优异。

Comments 38 pages, 11 figures

2605.08739 2026-05-12 cs.CV 版本更新

ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting

Luchao Wang, Kaimin Liao, Qian Ren, Hua Wang, Zhi Chen, Yaohua Tang

AI总结本文提出了一种名为 ReorgGS 的方法，用于解决 3D 高斯溅射（3DGS）模型在收敛后参数化退化的问题。该方法通过将现有高斯点集视为经验概率场，重新采样中心点并估计各向异性协方差，从而重建更优的分布结构，提升后续优化的梯度可访问性。与简单重置不透明度的方法不同，ReorgGS 重构了高斯点的分布和可见性结构，在保持场景表达能力的同时，有效减少了冗余重叠，提高了模型的优化效果和渲染效率。

2605.08735 2026-05-12 cs.CV 版本更新

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

Joowon Kim, Seungho Shin, Joonhyung Park, Eunho Yang

发表机构 * KAIST（韩国科学技术院）； Kyung Hee University（庆熙大学）； AITRICS

AI总结该论文提出了一种名为CollabVR的协作视频推理框架，旨在解决视频生成模型（VGM）在多步骤任务中出现的长期偏差和中间片段模拟错误问题。该方法通过将视觉-语言模型（VLM）与VGM在步骤层面进行紧密协作，使VLM在每一步生成动作后对VGM生成的视频片段进行检查与修正，从而提升推理的准确性和鲁棒性。实验表明，CollabVR在多个基准测试中显著优于现有方法，尤其在复杂任务上表现突出，并且与针对推理优化的VGM结合使用时还能进一步提升性能。

2605.08729 2026-05-12 cs.CV cs.GR cs.MM cs.SD 版本更新

Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

Shihao Cheng, Jiaxu Zhang, Quanyue Song, Shansong Liu, Zhizhi Guo, Xiaolei Zhang, Chi Zhang, Xuelong Li, Zhigang Tu

发表机构 * DeepMind ； OpenAI

AI总结 Unison 是一个统一的框架，旨在解决人类中心视频生成中动作、语音和声音之间异步特性带来的对齐难题。该方法通过语义引导的谐波策略，分离生成语音和音效组件，并利用双向音频交叉注意力和语义条件门控机制，提升声音清晰度并减少语音主导现象。此外，Unison 提出双向跨模态强制策略，通过解耦的去噪时间表实现动作与音频的同步，显著提升了生成视频在音频感知质量和跨模态同步方面的表现。

2605.08727 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Control Your View: High-Resolution Global Semantic Manipulation in Learned Image Compression

Jiaming Liang, Chi-Man Pun, Weisi Lin, Greta Seng Peng Mok

发表机构 * University of Macau（澳门大学）； Nanyang Technological University（南洋理工大学）

AI总结本文研究了在学习图像压缩系统中实现高分辨率全局语义操控（GSM）的问题，指出现有方法在高分辨率场景下效果有限。作者通过理论与实验分析，揭示了高分辨率GSM攻击需要经过懒惰-震荡-细化三个阶段，并提出了一种周期几何衰减的步长调度策略，从而实现$\ell_{\infty}$-有界条件下的高分辨率GSM。基于此，他们改进了PGD方法，提出PGD$^{2}$-GSM，在Kodak数据集上首次实现了稳定高效的高分辨率GSM，揭示了学习图像压缩系统的新安全威胁。

2605.08724 2026-05-12 cs.CV 版本更新

SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment

Weiren Zhao, Yi Dong, Cheng Chen

发表机构 * The University of Hong Kong, Hong Kong, China（香港大学）

AI总结本文提出SynerMedGen，一个通过任务对齐将医学多模态理解与生成统一的框架，旨在解决现有模型中理解与生成目标分离的问题。该方法引入了三个与生成对齐的理解任务和两阶段训练策略，使理解阶段学到的生成有益表征能够有效支持医学图像合成。实验表明，SynerMedGen在多个医学图像生成任务中表现出色，且具有良好的泛化能力，同时作者还发布了包含100万对合成样本和200万生成衍生理解实例的SynerMed数据集，以支持相关研究。

Comments Accepted by ICML 2026

2605.08723 2026-05-12 cs.CV cs.MM 版本更新

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Huilai Li, Xiaomeng Di, Ying Xing, Yonghao Dang, Yiming Wang, Jianqin Yin

发表机构 * School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications（智能工程与自动化学院，北京邮电大学）； State Grid Corporation of China（国家电网公司）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（人工智能学院，北京邮电大学）

AI总结本文研究弱监督音视频视频解析（AVVP）问题，旨在仅使用粗粒度标签识别和定位视频中的音频、视觉及音视频事件。现有方法多关注多模态融合，却忽视了对单模态语义的引导与保持，导致伪标签噪声大、解析性能不佳。为此，本文提出一种增强单模态表征的新框架，通过相似性标签迁移方法提升伪标签生成器对单模态事件的理解，并采用软约束方式同步优化单模态与多模态特征建模，从而提升事件定位性能。实验表明，该方法在伪标签生成和AVVP任务中均优于现有先进方法。

2605.08712 2026-05-12 cs.CV 版本更新

From Articulated Kinematics to Routed Visual Control for Action-Conditioned Surgical Video Generation

Bohan Li, Shuojue Yang, Baorui Peng, Xianda Guo, Erli Zhang, Youqi Tao, Junfeng Duan, Daguang Xu, Qi Dou, Xin Jin, Wenjun Zeng, Hao Zhao, Yueming Jin

发表机构 * SJTU（上海交通大学）； NUS（国立新加坡大学）； THU（清华大学）； EIT（欧洲研究所）； WHU（武汉大学）； Harvard（哈佛大学）； NVIDIA（NVIDIA公司）； CUHK（香港大学）

AI总结本文研究了基于动作条件的手术视频生成问题，其核心挑战在于如何通过低维控制向量精确控制复杂的图像空间演变。为此，作者提出了一种从关节运动学向视觉控制提升的框架，将机械臂的运动学信息转化为五种与图像对齐的控制模态，并设计了一种分层路由的视觉控制体系，动态选择最相关的控制模态和运动尺度，从而提升生成效率与控制精度。此外，作者构建了一个包含精细标注的手术视频数据集，并通过实验验证了方法在动作忠实度、视觉保真度和跨域泛化能力方面的优越性。

2605.08709 2026-05-12 cs.CV 版本更新

UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning

Hongrui Li, Yichen Shi, Hongyang Wang, Yuhao Gao, Hui Ma, Jun Feng, Zitong Yu

发表机构 * Shijiazhuang Tiedao University（石家庄铁道大学）； Shanghai Jiao Tong University（上海交通大学）； Ningbo Institute of Digital Twin（宁波数字孪生研究所）； Great Bay University（大贝大学）

AI总结本文提出了一种基于知识引导的多模态推理框架UniShield，用于统一的人脸攻击检测，旨在同时识别物理欺骗和数字伪造攻击。该方法构建了人脸攻击知识图谱（FAKG），并通过攻击图指令调优（AGIT）生成大量训练样本，同时引入图一致性推理优化（GCRO）以提升推理的一致性。实验表明，UniShield在多种检测协议下均表现出优异的性能，显著提升了检测准确率和推理可靠性。

2605.08703 2026-05-12 cs.AI cs.CL cs.CV cs.LG 版本更新

RewardHarness: Self-Evolving Agentic Post-Training

Yuxuan Zhang, Penghui Du, Bo Li, Cong Wei, Junwen Miao, Huaisong Zhang, Songcheng Cai, Yubo Wang, Dongfu Jiang, Yuyu Zhang, Ping Nie, Wenhu Chen, Changqian Yu, Kelsey R. Allen

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）； Kolors Team, Kuaishou Technology（快手团队）； Carnegie Mellon University（卡内基梅隆大学）； University of Waterloo（滑铁卢大学）； Etude AI ； Tsinghua University（清华大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结该研究提出了一种名为 RewardHarness 的自进化智能奖励框架，旨在解决图像编辑任务中评估指令引导编辑效果时所需奖励模型依赖大量人工标注的问题。该方法通过少量示例迭代进化工具和技能库，无需额外训练即可对齐人类偏好，显著提升了数据效率。实验表明，仅使用 0.05% 的标注数据，RewardHarness 在图像编辑评估基准上取得了优于 GPT-5 的性能，展现了其在奖励建模中的高效性与有效性。

Comments Project page: https://rewardharness.com

2605.08702 2026-05-12 cs.CV cs.AI 版本更新

Gate-and-Merge: Zero-shot Compositional Personalization of Vision Language Models

Guodong Ding, Angela Yao

发表机构 * National University of Singapore（新加坡国立大学）

AI总结本文研究了视觉语言模型的组合式个性化问题，即在测试时同时识别或描述多个用户定义的概念。提出了一种零样本框架 Gate-and-Merge，无需共现训练即可实现组合式个性化。该方法通过独立学习每个概念的轻量 LoRA 适配器并结合概念标记，在推理时直接在权重空间合并相关更新，并利用门控机制抑制无关激活，从而提升模型在单一概念和组合场景下的性能。

2605.08695 2026-05-12 cs.CV 版本更新

EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics

Van-Loc Nguyen, AprilPyone MaungMaung, Minh-Triet Tran, Isao Echizen

发表机构 * University of Science, Vietnam National University Ho Chi Minh City（越南胡志明市国家大学科学大学）； National Institute of Informatics（国立信息研究所）

AI总结 EditSleuth 是一个用于图像编辑取证的新型数据集，包含257,725个图像编辑三元组，每个样本包含编辑后的图像、原始图像、编辑掩码、编辑类型标签、难度评分以及六步推理链。该数据集通过确定性方法构建，推理链中的每一步都基于可计算的视觉证据，旨在支持基于视觉依据的编辑定位与语义识别。实验表明，该数据集能够有效指导模型学习编辑推理能力，并生成具有解释性的取证说明。

2605.08664 2026-05-12 cs.CV 版本更新

IPAD-CLIP: Teaching CLIP to Detect Image Local Perceptual Artifacts

Juan Wang, Xinyu Sun, Ke Zhang, Jin Wang, Bing Li, Weiming Hu, Liang Wang

发表机构 * Minzu University of China（民族大学）； OPPO Co., Ltd.（OPPO公司）

AI总结当前图像质量评估方法主要关注全局失真（如噪声、模糊），而忽视了局部感知伪影（如鬼影、镜头眩光、摩尔效应）的检测。为解决这一问题，本文提出图像感知伪影检测（IPAD）任务，并构建了一个包含3,520张标注图像的基准数据集。基于CLIP模型，研究者设计了IPAD-CLIP框架，通过学习与伪影相关的语义嵌入，增强模型对局部细微伪影的识别能力，实验表明该方法在资源效率和检测性能上均优于现有先进方法。

Comments 14 pages, 6 figures

2605.08663 2026-05-12 cs.CV 版本更新

CAST: Channel-Aware Spatial Transfer Learning with Pseudo-Image Radar for Sign Language Recognition

Md. Shakhoyat Rahman Shujon, Sheikh Md. Galib Mahim, Md. Milon Islam, Md Rezwanul Haque, Md Rabiul Islam, Hamdi Altaheri, Fakhri Karray

发表机构 * Department of Computer Science and Engineering, Khulna University of Engineering & Technology（电子与技术大学计算机科学与工程系）； Department of Electrical and Computer Engineering, University of Waterloo（滑铁库大学电子与计算机工程系）； Department of Electrical and Computer Engineering, Texas A&M University（德克萨斯A&M大学电子与计算机工程系）； College of Applied Computer Science, King Saud University（沙特王后大学应用计算机科学学院）； Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence（马尔代夫人工智能大学机器学习系）

AI总结本文提出了一种名为CAST的双流架构，用于解决仅基于60GHz雷达回波幅度的孤立手语识别问题。该方法结合了三个基于物理特性的模块与预训练视觉网络，通过通道感知的空间迁移学习，有效提升了雷达信号的表征能力。核心方法包括对数压缩信号的逆变换、跨天线空间注意力机制以及异构网络的跨注意力融合，实验表明该方法在五折交叉验证中达到了80.5%的Top-1准确率，优于现有最佳单模型基线。

Comments Accepted for the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), MSLR Workshop @ CVPR 2026 in Denver (Colorado, USA)

2605.08651 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection

Lei Wang, Wenxiang Diao, Andrew Busch, Jun Zhou, Yongsheng Gao

发表机构 * School of Engineering and Built Environment, Griffith University（格里菲斯大学工程与环境学院）； School of Computer Science and Engineering, University of New South Wales（新南威尔士大学计算机科学与工程学院）； School of Information and Communication Technology, Griffith University（格里菲斯大学信息与通信技术学院）

AI总结本文研究了隐私感知的视频异常检测问题，提出了一种通过正交子空间投影来保护隐私的新型方法。核心方法包括正交投影层（OPL）和引导式正交投影层（G-OPL），能够去除与任务无关的特征变化，同时抑制人脸属性信息，保留动作和姿态等非身份识别特征。该方法在保证检测性能的同时有效保护隐私，并引入了隐私感知的评估框架，实验表明其在提升检测准确性的同时有效过滤敏感信息。

Comments Accepted as a Spotlight paper at the Forty-Third International Conference on Machine Learning (ICML 2026)

2605.08640 2026-05-12 cs.CV 版本更新

FlowADMM: Plug-and-play ADMM with Flow-based Renoise-Denoise Priors

Hendrik Sommerhoff, Michael Moeller

发表机构 * Computer Vision Group, University of Siegen（Siegen大学计算机视觉组）

AI总结本文提出了一种基于流模型的插件式ADMM算法FlowADMM，用于求解逆问题。该方法通过形式化流模型中的确定性重噪声-去噪操作，将这一操作整合到经典的ADMM框架中，从而提升了算法的收敛性与稳定性。实验表明，FlowADMM在去噪、去模糊、超分辨率和修复等任务中表现出色，且所需的图像一致性评估次数更少。

2605.08635 2026-05-12 cs.CV 版本更新

Kinematics-Driven Gaussian Shape Deformation for Blurry Monocular Dynamic Scenes

Yeon-Ji Song, Kiyoung Kwon, Junoh Lee, Jin-Hwa Kim, Byoung-Tak Zhang

发表机构 * AI Institute, Seoul National University（首尔国立大学人工智能研究所）； Interdisciplinary Program in Neuroscience, Seoul National University（首尔国立大学神经科学跨学科项目）； Gwangju Institute of Science（全州科学技术学院）； NAVER AI Lab（NAVER AI实验室）

AI总结本文研究了如何从模糊的单目视频中重建动态3D场景，针对运动模糊导致的几何信息混杂问题，提出了一种基于运动学的高斯形状变形框架Kinematics-GS。该方法通过将模糊视为与运动对齐的形变，并引入运动学先验对高斯形状进行参数化，从而在无需辅助运动监督的情况下有效避免形状退化。此外，该方法通过时间形变方差分解场景为动态和静态部分，并采用由粗到细的形变策略，提升了重建的稳定性和细节表现，实验表明其在真实场景中显著优于现有方法。

Comments 20 pages, 9 figures, 13 tables

2605.08633 2026-05-12 cs.DC cs.CV 版本更新

Transforming the Use of Earth Observation Data: Exascale Training of a Generative Compression Model with Historical Priors for up to 10,000x Data Reduction

Jinxiao Zhang, Runmin Dong, Xiyong Wu, Xihan Huang, Shenggan Cheng, Yunkai Yang, Zheng Zhou, Yunpu Xu, Zhaoyang Luo, Miao Yang, Fan Wei, Mengxuan Chen, Yang You, Juepeng Zheng, Weijia Li, Yutong Lu, Haohuan Fu

发表机构 * Institute of Data and Information, Tsinghua Shenzhen International Graduate School（数据与信息研究所，清华大学深圳国际研究生院）； Department of Earth System Science, Tsinghua University（地球系统科学系，清华大学）； Sun Yat-Sen University（中山大学）； National University of Singapore（新加坡国立大学）； National Supercomputing Center in Shenzhen（深圳国家超算中心）

AI总结该研究提出了一种基于历史先验的生成式压缩框架，旨在将地球观测数据的压缩从传统的存储和传输工具转变为一种新型的数据使用方式，实现高达10,000倍的数据压缩比。通过在LineShine Armv9超算上进行超大规模训练，研究团队优化了模型设计、内核、内存层次、运行时和并行性，实现了每秒1.54至2.16 EFLOP的高效训练性能。该方法利用地球观测数据重复测量同一星球的特性，为极端压缩提供了可行方案，展示了历史先验生成压缩在数据获取、传输、存储和科学应用中的巨大潜力。

2605.08627 2026-05-12 cs.CV 版本更新

DRNet: All-in-One Image Restoration via Prior-Guided Dynamic Reparameterization

Ao Li, Xiaoning Liu, Sheng Li, Yapeng Du, Zhen Long, Lei Luo, Le Zhang, Ce Zhu

发表机构 * School of Information and Communication Engineering, University of Electronic Science and Technology of China（信息与通信工程学院，电子科学与技术大学）； School of Communications and Information Engineering, Chongqing University of Posts and Telecommunication（通信与信息工程学院，重庆邮电大学）

AI总结本文提出了一种名为DRNet的全新图像修复框架，旨在通过单一模型处理多种退化问题。该方法引入了动态重参数化机制，结合任务特定调制器和连续小波变换编码器，有效解决了计算开销大、任务异构优化困难以及编码器设计低效等问题。实验表明，DRNet在五个修复任务中均达到最先进的性能，兼具参数效率和灵活应用能力，可作为盲修复基础模型或用户引导型专家模型使用。

Comments Accepted by IEEE TMM

2605.08618 2026-05-12 cs.CV cs.LG 版本更新

Beyond Toy Benchmarks: A Systematic Evaluation of OOD Detection Methods For Plant Pathology Classification

Devesh Shah

发表机构 * Independent Researcher（独立研究者）

AI总结该研究系统评估了六种越域检测方法在植物病理分类任务中的性能，关注真实场景下的分布偏移问题。通过在Plant Pathology 2021数据集上的实验发现，基于能量的微调方法在保持类别内准确率的同时显著提升了越域检测效果，其优势来源于嵌入空间重构和评分函数校准。研究还揭示了在中等规模数据集上应用约束优化方法时可能出现的训练不稳定性问题，为实际应用提供了重要参考。

2605.08606 2026-05-12 cs.CV 版本更新

Egocentric Whole-Body Human Mesh Recovery with Prior-Guided Learning

第一人称全身体素 meshes 恢复与先验引导学习

Soyeon Na, Seung Young Noh, Ju Yong Chang

AI总结本文提出基于先验引导的学习框架，通过构造更准确的优化伪标签和多先验方法，提升第一人称全身体素恢复的精度，实验表明其优于现有方法。

Comments Accepted to ICIP 2026. This is the author-formatted version of the paper

详情

AI中文摘要

第一人称人类 mesh 恢复（HMR）从单目头戴式摄像头日益重要，但因缺乏基于参数化人体模型如 SMPL 和 SMPL-X 的可靠真实第一人称图像标注而具有挑战性。现有第一人称 HMR 方法通常依赖伪 GT 并专注于身体姿态估计，限制了其恢复细粒度全身体细节如手和脸的能力。我们研究第一人称全身体素恢复并提出一个先验引导学习框架，从单个第一人称图像重建全身体素。我们构造了更准确的基于优化的伪 GT，与 3D 关节监督对齐，并利用多种先验通过适配一个外人称 HMR 基础模型和基于扩散的姿势先验。进一步采用确定性去畸变模块以处理第一人称图像中的鱼眼畸变。在多个第一人称基准测试中，实验显示相比最先进方法，全身体素重建得到改进，并表明我们的基于优化的伪 GT 显著比现有回归基于的伪 GT 更准确。为促进可重复性，代码和数据集注释已公开在 https://github.com/naso06/EgoSMPLX。

英文摘要

Egocentric human mesh recovery (HMR) from monocular head-mounted cameras is increasingly important for AR/VR applications, but remains challenging due to the lack of reliable ground-truth (GT) annotations based on parametric human body models such as SMPL and SMPL-X for real egocentric images. Existing egocentric HMR methods typically rely on pseudo-GT and focus on body pose estimation, which limits their ability to recover fine-grained whole-body details such as hands and face. We study egocentric whole-body human mesh recovery and propose a prior-guided learning framework that reconstructs whole-body meshes from a single egocentric image. We construct more accurate optimization-based pseudo-GT aligned with 3D joint supervision, and leverage multiple priors by adapting an exocentric HMR foundation model together with a diffusion-based pose prior. A deterministic undistortion module is further adopted to handle fisheye distortions in egocentric images. Experiments across multiple egocentric benchmarks demonstrate improved whole-body reconstruction compared to state-of-the-art methods, and show that our optimization-based pseudo-GT is substantially more accurate than existing regression-based pseudo-GT. To facilitate reproducibility, the code and dataset annotations are publicly available at https://github.com/naso06/EgoSMPLX.

URL PDF HTML ☆

赞 0 踩 0

2605.08592 2026-05-12 cs.CV 版本更新

Cross-Modal RGB-D Fusion Transformer for 6D Pose Estimation of Non-Cooperative Spacecraft with Stereo-Derived Depth

用于非合作航天器6D位姿估计的跨模态RGB-D融合Transformer

Yongliang Zhen, Bo LÜ, Hang Yang, Xiaotian WU

发表机构 * School of Physics, Northeast Normal University（东北师范大学物理学院）； Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Science（长春光学精密机械与物理研究所）

AI总结本文提出基于被动立体视觉的6D位姿估计方法，通过TSCA-Stereo网络处理空间图像中的弱纹理和强光问题，并结合跨模态Transformer融合RGB和立体深度信息，实现高精度位姿估计。

详情

AI中文摘要

在轨服务和主动清除非合作航天器需要可靠的位姿估计以提供准确的位置和姿态数据用于自主视觉导航。基于学习的单目方法在航天器位姿估计中广泛应用，但存在固有的深度模糊问题且在轨道上常见的恶劣光照条件下容易失效。主动深度传感器理论上可以解决几何模糊问题，但其功率和质量要求使其不适合大多数航天平台。本文通过被动立体视觉框架解决这些问题，开发了TSCA-Stereo立体匹配网络以应对弱纹理表面、镜面高光和严重光照变化。引入了跨模态融合Transformer，以适应性方式结合RGB外观信息与立体深度特征，支持可靠的位姿恢复。还构建了一个合成双目多模态数据集用于实验，涵盖立体视差图和6-DOF位姿注释，覆盖多种光照场景、姿态配置和噪声水平。实验结果表明，TSCA-Stereo在该空间专用数据集上优于基线方法。完整的位姿估计管道在不同成像条件下实现了平均位移误差0.0419米和平均姿态误差0.8632度，证实了被动立体方法在严苛空间视觉条件下的有效性和鲁棒性。

英文摘要

On-orbit servicing and active debris removal involving non-cooperative spacecraft require reliable pose estimation to supply accurate position and orientation data for autonomous visual navigation. Learning-based monocular methods have seen widespread adoption in spacecraft pose estimation, yet they suffer from an intrinsic depth ambiguity problem and tend to fail under the harsh illumination conditions routinely encountered in orbit. Active depth sensors could in principle address the geometric ambiguity, but their power and mass requirements make them poorly suited to most spacecraft platforms. This work addresses these issues through a passive stereo vision framework for six-degree-of-freedom (6-DOF) pose estimation of non-cooperative spacecraft. A binocular stereo matching network called TSCA-Stereo is developed to cope with weak-texture surfaces, specular highlights, and severe lighting variations typical of space imagery. A cross-modal fusion Transformer is introduced to combine RGB appearance information with stereo depth features in an adaptive manner, supporting reliable pose recovery. A synthetic binocular multimodal dataset is also built for the experiments, covering stereo disparity maps and 6-DOF pose annotations across a range of illumination scenarios, attitude configurations, and noise levels. Experimental results show that TSCA-Stereo outperforms the baseline across every evaluated metric on this space-specific dataset. The full pose estimation pipeline achieves a mean translation error of 0.0419 m and a mean orientation error of 0.8632° under varied imaging conditions, confirming that the passive stereo approach is both effective and resilient when operating under the demanding visual conditions of the space environment.

URL PDF HTML ☆

赞 0 踩 0

2605.08589 2026-05-12 cs.CV 版本更新

S2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain

S2FT：稀疏谱域中的参数高效微调

Baoquan Zhang, Zhehao Yu, Lisai Zhang, Kenghong Lin, Tianran Chen, Yuxi Sun, Yunming Ye, Yao He

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； ShenZhen SiFar Co., Ltd.（深圳思法科技有限公司）； Bilibili. Inc（哔哩哔哩公司）； Shenzhen University（深圳大学）

AI总结本文提出S2FT方法，通过在稀疏谱域中微调少量频谱系数，解决传统频谱微调中频谱不稀疏的问题，实验表明其仅使用0.08%的训练参数即可取得优异性能。

Comments Accepted by CVPR 2026

详情

AI中文摘要

参数高效微调（PEFT）是一种通过仅微调少量参数来适应大型预训练模型的关键技术。最近基于傅里叶变换的方法进一步减少了微调参数的规模，其基本假设是权重变化δW是一个具有稀疏频谱的空间域矩阵。然而，本文发现权重变化的频谱并非稀疏，而是呈现类似功率均匀的分布。这表明仅微调少量频谱系数不足以准确建模具有均匀频谱的权重变化。为此，本文提出寻找一种可逆变换，将具有稀疏频谱的潜在空间域矩阵转换为权重变化，然后在该稀疏频谱域中进行PEFT，称为S2FT。为寻找此类变换，我们首先预估计一个粗略的权重变化作为先验。然后，受稀疏频谱常对应局部平滑空间结构的启发，我们将此变换视为对预估计的权重变化进行行和列重排操作，以平滑空间结构并保持神经元的结构信息。最后，我们提出以简单最近邻搜索方式解决重排搜索问题，从而获得可逆变换。广泛的结果表明，我们的S2FT仅使用0.08%的训练参数即可取得优异性能。

英文摘要

Parameter Efficient Fine-Tuning (PEFT) is a key technique for adapting a large pretrained model to downstream tasks by fine-tuning only a small number of parameters. Recent methods based on Fourier transforms have further reduced the fine-tuned parameters scale by only fine-tuning a few spectral coefficients. Its basic assumption is that the weight change δW is a spatial-domain matrix with a sparse spectrum. However, in this paper, we observe that the spectrum of weight change is not sparse, but instead distributed like power-uniform. This fact implies that fine-tuning only a few spectral coefficients is insufficient to accurately model the weight change with uniform spectrum. To address this issue, we propose to seek an invertible transformation that can transform a latent spatial-domain matrix with sparse spectrum to the weight change, and then perform PEFT on such sparse spectrum domain with few spectral coefficients, called S2FT. To seek such transformation, we first pre-estimate a coarse weight change as a prior. Then, inspired by that sparse spectrum often correspond to locally smooth spatial structures, we regard this transformation as a row and column rearrangement operation on the pre-estimated weight change that smooth spatial structures while keep the structure information of neurons. Finally, we propose to solve the rearrangement search problem in a simple nearest neighbor search manner, thereby obtaining the invertible transformation. Extensive results show our S2FT achieves superior performance by only using 0.08% training parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.08585 2026-05-12 cs.CV cs.AI 版本更新

PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer's Diagnosis

PromptDx: 为多模态上下文阿尔茨海默病诊断的可微提示微调

Lujia Zhong, Yihao Xia, Shuo Huang, Jianwei Zhang, Yonggang Shi

发表机构 * Stevens Neuroimaging and Informatics Institute, Keck School of Medicine, University of Southern California（斯蒂文斯神经影像与信息学研究所，凯克医学院，南加州大学）； Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California（明希德电气与计算机工程系，维特比工程学院，南加州大学）； Alfred E. Mann Department of Biomedical Engineering, Viterbi School of Engineering, University of Southern California（阿尔弗雷德·E·曼生物医学工程系，维特比工程学院，南加州大学）

AI总结 PromptDx通过可微提示微调机制，结合预训练的TabPFN实现多模态上下文诊断，提升数据效率和临床适用性。

详情

AI中文摘要

医疗图像深度学习模型通常作为参数化记忆运作，通过回忆训练期间学习的固定知识来诊断患者。这与临床实践形成鲜明对比，医生通过类比推理诊断新病例，参考过去示例中的相似记录。尽管上下文学习（ICL）框架如表格优先拟合网络（TabPFN）提供了一种参考诊断范式，但它们设计为表格特定的归纳先验，并依赖非可微预处理流程，导致在应用于异质多模态数据时出现流形不匹配和梯度断裂。为了解决这些限制，我们提出了PromptDx，一种新的参考诊断框架，利用预训练的TabPFN作为ICL引擎，同时实现与多模态表示的无缝集成。我们的核心贡献是可微提示微调（DPT）机制，该机制将掩码多模态建模模块与预训练的ICL引擎对齐。通过训练一个轻量级适配器作为引擎非可微预处理程序的可微替代品，我们实现了在ICL范式内对多模态提示的端到端优化。我们在阿尔茨海默病神经影像计划（ADNI）数据集上验证了我们的方法，使用3D MRI和表格生物标志物。实验表明，我们的方法优于传统参数基线。值得注意的是，我们的方法仅使用1%的上下文样本就实现了优于标准ICL的30%的性能，显示出卓越的流形压缩能力。我们进一步验证了DPT框架在六个具有不同规模的表格数据集上的通用性。总体而言，我们的方法提供了一种更数据高效且符合临床需求的阿尔茨海默病诊断范式。

英文摘要

Deep learning models in medical imaging typically operate as parametric memory, diagnosing patients by recalling fixed knowledge learned during training. This contrasts sharply with clinical practice, where physicians employ analogical reasoning to diagnose new cases by referencing similar records from past exemplars. While In-Context Learning (ICL) frameworks such as Tabular Prior-Fitted Networks (TabPFN) offer a promising diagnosis-by-reference paradigm, they are designed with tabular-specific inductive priors and rely on non-differentiable preprocessing pipelines, leading to manifold mismatch and gradient fracture when applied to heterogeneous multimodal data. To address these limitations, we propose PromptDx, a novel diagnosis-by-reference framework that leverages a pre-trained TabPFN as an ICL engine while enabling seamless integration with multimodal representations. Our core contribution is a Differentiable Prompt Tuning (DPT) mechanism that aligns a Masked Multimodal Modeling module with the pre-trained ICL engine. By training a lightweight adapter as a differentiable surrogate for the engine's non-differentiable preprocessors, we enable an end-to-end optimization of multimodal prompts within the ICL paradigm. We validate our method on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset using 3D MRI and tabular biomarkers. Experiments demonstrate that our approach outperforms traditional parametric baselines. Notably, our method achieves superior performance using only 1% context samples compared to 30% in standard ICL, demonstrating exceptional manifold condensation ability. We further validate the generalizability of our DPT framework across six tabular datasets with diverse scales. Overall, our method offers a more data-efficient and clinically aligned paradigm for Alzheimer's Disease diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2605.08577 2026-05-12 cs.CV cs.LG 版本更新

Improving Generative Adversarial Networks with Self-Distillation

通过自蒸馏改进生成对抗网络

Antoni Nowinowski, Krzysztof Krawiec

发表机构 * Poznan University of Technology（波兹南技术大学）

AI总结本文提出自蒸馏GAN（SD-GAN），利用指数移动平均生成器作为教师模型，通过感知损失指导训练生成器，提升图像质量并稳定优化轨迹。

详情

AI中文摘要

在现代GAN中，维护生成器权重的指数移动平均（EMA）是标准实践，因为平均模型在训练过程中表现优于活跃训练的生成器。然而，EMA生成器仅用于最终部署，不影响训练过程。为利用这一机会，我们引入自蒸馏GAN（SD-GAN），利用EMA生成器作为教师模型，通过感知损失指导活跃生成器（学生）。我们证明了SD-GAN在Dirac-GAN设置下的局部渐近稳定性，并展示了其能抑制传统GAN中的寄生循环行为。在多个架构和数据集上的实验证明，SD-GAN在多个指标（如FID和随机FID）上提升了最终图像质量，稳定了优化轨迹，并提供了与传统对抗损失不完全相关的额外学习指导。此外，它在微调预训练GAN模型方面也证明了有效性。

英文摘要

In modern GANs, maintaining an Exponential Moving Average (EMA) of the generator's weights is a standard practice, as such an averaged model consistently outperforms the actively trained generator. However, the EMA generator is used for final deployment only and does not influence the training process. To address this missed opportunity, we introduce Self-Distilled GAN (SD-GAN) that employs the EMA generator as a teacher to guide the active generator (student) via perceptual loss. We prove the local asymptotic stability of SD-GAN in the Dirac-GAN setting and show that it dampens the parasitic cycling behavior that plagues the conventional GANs. Empirical evaluations across established architectures and datasets demonstrate that SD-GAN improves the final image quality on several metrics (FID and random-FID in particular), stabilizes the optimization trajectory and provides additional learning guidance that is not trivially correlated with the conventional adversarial loss. It also proves effective for fine-tuning pretrained GAN models.

URL PDF HTML ☆

赞 0 踩 0

2605.08574 2026-05-12 cs.CV cs.LG 版本更新

Post-hoc Selective Classification for Reliable Synthetic Image Detection

事后选择性分类用于可靠合成图像检测

Kaixiang Zheng, Jacob H. Seidman

发表机构 * University of Waterloo（滑铁卢大学）； Reality Defender

AI总结本文提出ReSIDe框架，通过改进选择性分类策略提升合成图像检测在协变量偏移下的可靠性，实验显示其在常见偏移下显著提升性能。

详情

AI中文摘要

随着合成图像日益逼真，可靠检测技术至关重要以防止滥用。尽管深度神经网络基于的合成图像检测器（SIDs）在分布内表现良好，但在存在常见协变量偏移时可靠性不足，导致检测精度下降。为避免潜在错误风险，我们采用选择性分类（SC）策略，允许SIDs在低置信度预测时 abstain。为实用性，我们聚焦于事后方法，即在给定SID上进行置信度估计而无需重新训练。然而，我们发现传统基于logit的置信度评分函数（CSFs）在协变量偏移下表现出病理行为，导致SC性能接近或甚至劣于随机猜测。为此，我们提出一个简单而有效的SC框架用于可靠合成图像检测（ReSIDe）。首先，我们将logits的概念推广到SID的中间层，从质心匹配角度扩展logit-based CSFs的使用范围到SID的任意层。然后，我们引入一种偏好优化算法，通过最小化风险-覆盖曲线（AURC）上界来聚合不同层提取的置信度分数，得到最终的置信度估计。广泛实验结果表明，ReSIDe显著提升了各种logit-based CSFs在常见协变量偏移下的SC性能，实现了高达69.55%的AURC减少。

英文摘要

As synthetic images become increasingly realistic, reliable synthetic image detection techniques are of pressing need to prevent their misuse. Despite satisfactory in-distribution performance, deep neural network-based synthetic image detectors (SIDs) lack reliability in deployment and often fail in the presence of common covariate shifts, resulting in poor detection accuracy. To avoid the risk caused by potential errors, we adopt a selective classification (SC) strategy by allowing SIDs to abstain from making low confidence predictions. For practicality, we focus on post-hoc methods which perform confidence estimation on a given SID without retraining. However, we show that conventional logit-based confidence score functions (CSFs) exhibit pathological behavior under covariate shifts, leading to SC performance close to or even worse than random guessing. To address this, we propose a simple yet effective SC framework for Reliable Synthetic Image Detection (ReSIDe). First, we generalize the notion of logits to an SID's intermediate layers from a centroid matching perspective, extending the use of logit-based CSFs to any layer of an SID. Then, we introduce a preference optimization algorithm that aggregates confidence scores extracted from different layers to a final confidence estimate by minimizing an upper bound of the area under the risk-coverage curve (AURC). Extensive experimental results show that ReSIDe significantly boosts the SC performance of various logit-based CSFs under common covariate shifts, achieving up to 69.55% AURC reduction.

URL PDF HTML ☆

赞 0 踩 0

2605.08572 2026-05-12 cs.CV 版本更新

Enhancing Consistency Models for Multi-Agent Trajectory Prediction

增强多智能体轨迹预测的一致性模型

Alen Mrdovic, Qingze, Liu, Danrui Li, Mathew Schwartz, Kaidong Hu, Sejong Yoon, Mubbasir Kapadia, Vladimir Pavlovic

发表机构 * Rutgers University - New Brunswick（罗格斯大学-新不伦瑞克分校）； New Jersey Institute of Technology（新泽西理工学院）； The College of New Jersey（新泽西学院）

AI总结本文提出ECTraj，通过改进训练和条件生成方法，提升多智能体轨迹预测的一致性模型性能，实现更快推理和更准确预测，建立新的基准。

详情

AI中文摘要

多智能体轨迹预测的扩散模型受限于迭代去噪，导致推理延迟，阻碍其在自动驾驶等时间敏感场景中的应用。快速采样变种使用DDIM和知情初始噪声分布部分缓解此问题，但要么无法实现真正的单步生成，要么受所选噪声分布限制。一致性模型（CMs）通过将噪声直接映射到数据实现高质量单步生成，但难以从头训练。我们提出ECTraj，一种增强的CM流程，具有改进的训练和条件生成能力。我们的框架扩展了学生-教师一致性训练方案：学生生成标准输出，而教师显式融合其预测与部分真实数据以提供更强的监督。我们还利用CMs的直接去噪能力进行训练中的Top-K多步生成。结合条件生成与增强的一致性目标，实现了更快的推理和改进的预测准确性，在大规模Argoverse 2数据集上建立了具有竞争力的新基准。

英文摘要

Diffusion models for multi-agent trajectory prediction are limited by iterative denoising, which causes inference latency that hinders their use in time-critical settings like autonomous driving. Fast-sampling variants using DDIM and informed initial noise distributions partially alleviate this issue, but they either fail to achieve true single-step generation or are constrained by the chosen noise distribution. Consistency Models (CMs) offer high-quality one-step generation by mapping noise directly to data, but are difficult to train from scratch . We propose ECTraj, an enhanced CM pipeline with improved training and conditional generation for trajectory prediction. Our framework extends the student-teacher consistency training scheme: the student produces standard outputs, while the teacher explicitly fuses its predictions with parts of the ground truth to give stronger supervision. We also exploit CMs' direct denoising for top-K multi-shot generation during training. Combining conditional generation with this enhanced consistency objective yields faster inference and improved prediction accuracy, establishing competitive new benchmarks on the large-scale Argoverse 2 dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.08566 2026-05-12 cs.CV cs.LG q-bio.QM 版本更新

MicroDiffuse3D: A Foundation Model for 3D Microscopy Imaging Restoration

MicroDiffuse3D：一种用于3D显微成像修复的预训练基础模型

Yongkang Li, Brian Wong, King Wai Chiu, Hanwen Xu, Tangqi Fang, Erin Dunnington, Dan Fu, Sheng Wang

发表机构 * Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA（保罗·G·艾伦计算机科学与工程学院，华盛顿大学，西雅图，华盛顿州，美国）； Department of Chemistry, University of Washington, Seattle, WA, USA（化学系，华盛顿大学，西雅图，华盛顿州，美国）

AI总结本文提出MicroDiffuse3D，一种预训练的3D显微成像修复模型，通过高通量数据提升3D化学成像的分辨率和信噪比，实现高质量体体积结构重建。

详情

基于毫米波的双阶段人体网格恢复框架

Hoang Hai Pham, Shuntian Zheng, Jiaqi Li, Yu Guan

发表机构 * Department of Computer Science, University of Warwick, Coventry, UK（沃里克大学计算机科学系，英国考文特里）

AI总结本文提出双阶段框架，通过雷达反射提取模块和运动感知网格恢复网络，提升毫米波雷达下人体3D网格恢复的精度与效率。

详情

AI中文摘要

毫米波雷达因其在恶劣环境下的鲁棒性和隐私保护特性，成为人体感知的有前景的传感模态。然而，从雷达观测中恢复准确的3D人体网格仍然困难，由于严重的信号杂波和雷达测量的固有不完整性。先前的工作通常采用端到端框架，直接从原始雷达数据回归人体参数，而未解耦信号解释与几何推理或利用时间运动线索，限制了学习性能。为此，我们提出了一种针对雷达的人体重建双阶段框架。首先，我们引入了人体反射提取模块，通过粗到细定位和体素级分割，生成带有置信度加权的雷达体积编码体素级人体可能性。其次，我们设计了运动感知的网格恢复网络，通过双分支架构联合建模每帧几何和跨帧动态，重建人体。大量实验表明，所提方法在保持计算效率的同时优于现有方法。

英文摘要

Millimeter-wave (mmWave) radar has emerged as a promising sensing modality for human perception due to its robustness under challenging environmental conditions and strong privacy-preserving properties. However, recovering accurate 3D human body meshes from radar observations remains difficult due to severe signal clutter and the inherently partial nature of radar measurements. Previous works typically adopt end-to-end frameworks that directly regress human body parameters from raw radar data, without decoupling signal interpretation from geometric reasoning or exploiting temporal motion cues, limiting learning performance. To address this, we propose a two-stage framework for radar-based human body reconstruction. First, we introduce a human reflection extraction module that performs coarse-to-fine localization and voxel-wise segmentation to produce a confidence-weighted radar volume encoding voxel-level human likelihood. Second, we design a motion-aware mesh recovery network that reconstructs the human body by jointly modeling per-frame geometry and inter-frame dynamics using a dual-branch architecture. Extensive experiments demonstrate that the proposed method outperforms existing approaches while maintaining computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.08521 2026-05-12 cs.CV cs.LG 版本更新

Geometric Flood Depth Estimation: Fusing Transformer-Based Segmentation with Digital Elevation Models

几何洪水深度估计：融合基于变换器的分割与数字高程模型

Nhut Le, Ehsan Karimi, Maryam Rahnemoonfar

发表机构 * Lehigh University（莱维大学）

AI总结本文提出几何'水面高程'方法，通过融合基于变换器的分割模型与数字高程模型，从单目航空影像估计洪水深度，提升洪水范围和体积评估能力。

Comments Accepted by the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

2605.08517 2026-05-12 cs.LG cs.CV physics.med-ph 版本更新

A Deep Risk Estimator for Known Operator Learning

已知算子学习的深度风险估计器

Andreas Maier, Md Hasan, Paulina Conrad, Paula Andrea Perez-Toro

发表机构 * Pattern Recognition Lab（模式识别实验室）； Friedrich-Alexander-Universität Erlangen-Nürnberg（埃朗根-纽伦堡弗里德里希-亚历山大大学）

AI总结本文提出一种深度风险估计器，用于估计包含学习和已知算子的深度网络的统计风险，通过分解总风险并分析各层贡献，验证了替换学习层为已知算子可降低界，并应用于CT重建和物理引导神经网络。

Comments In Review

详情

AI中文摘要

我们描述了一种估计包含学习和已知算子的深度网络统计风险的方法。基于之前为已知算子学习建立的最大训练误差界，我们推导出一个深度风险估计器，将分层网络的预期误差与训练样本大小联系起来。该估计器将总风险分解为各学习层的和；每个已知算子对此和贡献零，而每个学习层添加一个受Barron经典工作启发的近似项和随训练样本数减少的估计项。我们证明当学习层被已知算子替换时，界会缩小，且对应的样本需求与被替换层的可训练参数数成比例。作为应用，我们以计算机断层扫描为例，比较了具有算子意识的滤波反投影网络与完全连接的替代网络，后者将整个重建流程压缩为单一学习密集矩阵。预测的参数比与分析分解为循环滤波和稀疏反投影所暴露的结构稀疏性一致。我们在小图像规模上验证了预测的缩放性，且在中等图像规模上使用GPU验证，均遵循相同的缩放规律。除了CT重建外，该估计器还适用于将已知物理操作硬编码到其架构中的物理引导神经网络，我们预计该结果对从事算子意识深度学习的广泛社区有参考价值。在每次扫掠中校准每层常数，可使界跟踪经验测试均方误差，误差因子在2以内，因此该估计器可逆用于预测达到目标误差所需的训练样本数。

英文摘要

We describe an approach for estimating the statistical risk of deep networks that contain a mix of learned and known operators. Building on the maximal training error bounds previously established for known operator learning, we derive a deep risk estimator that connects the expected error of a layered network to the size of the training sample. The estimator decomposes the total risk into a sum over learned layers; every known operator contributes zero to this sum, while every learned layer adds an approximation term inspired by Barron's classic work and an estimation term that decreases with the number of training samples. We are able to show that the bound shrinks whenever a learned layer is replaced by a known operator and that the corresponding sample requirement scales with the number of trainable parameters of the layer that is replaced. As an application, we use computed tomography as an example and compare an operator-aware filtered backprojection network with a fully connected substitute that collapses the entire reconstruction pipeline into a single learned dense matrix. The predicted parameter ratio coincides with the structural sparsity that the analytic decomposition into a circulant filter and a sparse backprojection exposes. We confirm the predicted scaling on CPU at small image scale and on GPU at medium image scale, all on the same scaling law. Beyond CT reconstruction, the estimator applies to physics-informed neural networks that hardcode a known physical operation in its architecture, and we expect the result to be of interest for a broad community working on operator-aware deep learning. Calibrating the per-layer constants on each sweep yields a bound that tracks the empirical test MSE within a factor of two at every training-set size, so the estimator can be inverted to predict how many training samples are required to reach a target error.

URL PDF HTML ☆

赞 0 踩 0

2605.08493 2026-05-12 cs.CV 版本更新

CapCLIP: A Vision-Language Representation Alignment Approach for Wireless Capsule Endoscopy Analysis

CapCLIP：一种用于无线胶囊内镜分析的视觉-语言表示对齐方法

Haroon Wahab, Irfan Mehmood, Hassan Ugail

发表机构 * School of Computer Science, AI and Electronics Faculty of Engineering and Digital Technologies（计算机科学与电子工程学院，工程与数字技术学院）； School of Management Faculty of Mgmt, Law & Social Sciences（管理学院，管理、法律与社会科学学院）； Centre for Visual Computing and Intelligent Systems（视觉计算与智能系统中心）

AI总结 CapCLIP通过将内镜图像与临床标准术语生成的文本描述对齐，提升无线胶囊内镜分析的泛化能力和语义解释性，优于现有方法。

详情

AI中文摘要

无线胶囊内镜（WCE）可非侵入性评估小肠，但其临床应用受限于每检查生成的大量帧和在高度变化的成像条件下识别细微异常的困难。现有基于学习的方法多为视觉单一，局限于狭窄病病理集，且在数据集和中心间转移有限。为此，本研究提出CapCLIP，一种针对WCE的视觉-语言表示学习框架。CapCLIP通过将内镜图像与基于标准化命名法和病理意识的文本描述对齐，从而学习出语义丰富且可转移的嵌入。该框架在严格零样本条件下，使用未见过的WCE数据集评估，与相关开源视觉和视觉-语言基础模型进行比较。评估涵盖三个下游任务：K最近邻分类、CLIP风格图像-文本分类和文本到图像检索。在这些设置中，CapCLIP在零样本图像-文本分类和跨模态检索上表现尤为突出。结果表明，语言引导的表示学习可提升WCE分析的泛化能力和语义可解释性。这些发现将CapCLIP定位为定制化胶囊内镜基础模型的一步，并支持语言引导的WCE分析。

英文摘要

Wireless capsule endoscopy (WCE) enables non-invasive visual assessment of the small bowel, but its clinical utility is constrained by the large volume of frames generated per examination and the difficulty of recognising subtle abnormalities under highly variable imaging conditions. Existing learning-based approaches for WCE are predominantly vision-only, often confined to narrow pathology sets, and show limited transfer across datasets and centres. To address these limitations, this study introduces CapCLIP, a domain-specific vision-language representation learning framework for WCE. CapCLIP aligns capsule endoscopy frames with clinically grounded textual descriptions derived from standardised nomenclature and pathology-aware caption templates, thereby learning embeddings that are both semantically informed and transferable. The proposed framework is evaluated against relevant open-source vision and vision-language foundation models under strict zero-shot conditions using unseen WCE datasets. Evaluation covers three downstream tasks: K-nearest neighbour classification, CLIP-style image-text classification, and text-to-image retrieval. Across these settings, CapCLIP consistently outperforms the compared baselines, with particularly strong gains in zero-shot image-text classification and cross-modal retrieval on out-of-distribution datasets. The results indicate that language-guided representation learning can improve both generalisation and semantic interpretability in WCE analysis. These findings position CapCLIP as a step toward foundation models tailored to capsule endoscopy and support the use of language-grounded WCE analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.08452 2026-05-12 cs.CV 版本更新

NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics

NICE FACT: 诊断和校准VLMs在运动物理定量推理中的能力

Jian Lan, Zhicheng Liu, Xinpeng Wang, Yuhao Zhou, Haokun Chen, Jiancheng Lv, Barbara Plank, Thomas Seidl

发表机构 * University of Munich (LMU)（慕尼黑大学（LMU））； Munich Center of Machine Learning（慕尼黑机器学习中心）； Sichuan University（四川大学）

AI总结本文研究VLMs在运动物理定量推理中的表现，提出NICE和FACT双诊断框架，分析视觉真实性、物理定律理解和时间定位，揭示模型在识别视觉前提和应用物理定律方面的不足。

详情

AI中文摘要

推导精确空间和物理洞察力是视觉语言模型（VLMs）的核心能力，但其在相关空间智能任务如物理推理中的表现较差仍是一个根本障碍。本文旨在深入理解VLMs如何感知物理世界并利用物理定律，同时评估模型置信度的可靠性。我们提出NICE和FACT双诊断范式，明确分解运动物理的定量推理：FACT诊断视觉真实性、物理定律理解和时间定位。NICE研究我们的新型邻域感知校准方法和新指标以评估和校准置信度可靠性。在6种最新最先进的VLMs上评估后，发现模型无法识别视觉前提或利用必要的物理定律以得出答案。本文强调并建立了标准化诊断范式，以指导开发忠实且物理基础的VLMs。

英文摘要

The ability to derive precise spatial and physical insights is a cornerstone of vision-language models (VLMs), yet their poor performances in related spatial intelligence tasks such as physical reasoning remain a fundamental barrier. The community critically lacks a scientific analysis revealing whether VLMs faithfully reach answers or plausibly make guesses. This work aims to provide a fundamental understanding of how VLMs perceive the physical world, and utilize physical laws, while assessing the reliability of model confidence. We propose NICE and FACT, a dual-diagnostic paradigm that explicitly decomposes quantitative reasoning for kinematic physics: FACT diagnoses visual fidelity, physical law comprehension, and temporal grounding. NICE studies our novel neighborhood-informed calibration method and novel metrics to evaluate and calibrate confidence reliability. Evaluated across 6 latest state-of-the-art VLMs, we uncover that models fail to identify visual preconditions or utilize necessary physical laws to reach answers. This work highlights and establishes a standardized diagnostic paradigm to guide the development of faithful, physically-grounded VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.06356 2026-05-12 cs.CV 版本更新

SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

SwiftI2V: 通过条件分段生成实现高效的高分辨率图像到视频生成

YaoYang Liu, Yuechen Zhang, Wenbo Li, Yufei Zhao, Rui Liu, Long Chen

发表机构 * HKUST（香港科技大学）； CUHK（香港大学）； Joy Future Academy ； HKU（香港大学）； HUAWEI Research（华为研究）

AI总结本文提出SwiftI2V框架，通过分段生成和双向上下文交互，在2K分辨率下实现高效图像到视频生成，相比端到端模型减少GPU时间202倍。

Comments 27 pages, 17 figures

详情

AI中文摘要

高分辨率图像到视频（I2V）生成旨在合成逼真的时间动态同时保持输入图像的细粒度外观细节。在2K分辨率下，这变得极其具有挑战性，现有解决方案存在多种缺陷：1）端到端模型通常在内存和延迟上成本过高；2）级联低分辨率生成与通用视频超分辨率相结合，容易产生细节幻觉并偏离输入特定的局部结构，因为超分辨率阶段未明确针对输入图像进行条件处理。为此，我们提出了SwiftI2V，一个专为高分辨率I2V设计的高效框架。遵循广泛使用的两阶段设计，它通过首先生成低分辨率运动参考以减少令牌成本并减轻建模负担，然后执行强图像条件的2K合成，以恢复输入忠实的细节，同时受控开销。具体来说，为了使生成更具可扩展性，SwiftI2V引入了条件分段生成（CSG）来分段合成视频，每个步骤有受限制的令牌预算，并在每个分段内采用双向上下文交互以提高跨分段的一致性和输入保真度。在VBench-I2V上，SwiftI2V在2K分辨率下实现了与端到端基线相当的性能，同时将总GPU时间减少了202倍。特别是，它使在单个数据中心GPU（如H800）或消费级GPU（如RTX 4090）上实现实用的2K I2V生成成为可能。

英文摘要

High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).

URL PDF HTML ☆

赞 0 踩 0

2605.03456 2026-05-12 cs.CV 版本更新

VL-SAM-v3: Memory-Guided Visual Priors for Open-World Object Detection

VL-SAM-v3：基于记忆的开放世界目标检测视觉先验

Chih-Chung Liu, Zhiwei Lin, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University, China（北京大学计算机科学技术研究院）

AI总结 VL-SAM-v3通过引入检索驱动的外部视觉记忆，提升开放世界目标检测的性能，尤其在稀有类别和复杂场景中表现优异。

详情

AI中文摘要

开放世界目标检测旨在定位和识别超出固定封闭集标签空间的对象。它通常分为两类：开放词汇检测，假设测试时有预定义类别列表；以及开放端检测，需在推理过程中生成候选类别。现有方法主要依赖粗粒度文本语义和参数化知识，常无法提供足够的视觉证据以处理细粒度外观变化、稀有类别和杂乱场景。本文提出VL-SAM-v3，一个统一框架，通过检索驱动的外部视觉记忆增强开放世界检测。具体而言，一旦候选类别可用，VL-SAM-v3从非参数化记忆库中检索相关视觉原型，并将其转换为两种互补的视觉先验：即稀疏先验用于实例级空间锚定，密集先验用于类别感知的局部上下文。这些先验通过记忆引导的提示细化与原始检测提示结合，实现共享的检索与细化机制，支持开放词汇和开放端推理。在LVIS上的广泛零样本实验表明，VL-SAM-v3在开放词汇和开放端推理中均能显著提升检测性能，尤其在稀有类别上表现突出。此外，与更强的开放词汇检测器（即SAM3）的实验验证了所提检索与细化机制的通用性。

英文摘要

Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors for instance-level spatial anchoring and dense priors for class-aware local context. These priors are integrated with the original detection prompts via Memory-Guided Prompt Refinement, enabling a shared retrieval-and-refinement mechanism that supports open-vocabulary and open-ended inference. Extensive zero-shot experiments on LVIS show that VL-SAM-v3 consistently improves detection performance under both open-vocabulary and open-ended inference, with particularly strong gains on rare categories. Moreover, experiments with a stronger open-vocabulary detector (i.e., SAM3) validate the generality of the proposed retrieval-and-refinement mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.03276 2026-05-12 cs.CV 版本更新

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

VEBench: 多模态大模型在真实世界视频编辑中的基准测试

Andong Deng, Dawei Du, Zhenfang Chen, Wen Zhong, Fan Chen, Guang Chen, Chia-Wen Kuo, Longyin Wen, Chen Chen, Sijie Zhu

发表机构 * ByteDance Intelligent Creation（字节跳动智能创作）； CRCV, University of Central Florida（中央佛罗里达大学CRCV）

AI总结 VEBench通过高质量视频和人工验证问题对多模态大模型在视频编辑中的知识理解和操作推理能力进行评估，揭示当前模型与人类水平间的显著差距。

Comments CVPR Findings 2026

详情

AI中文摘要

真实世界视频编辑不仅需要电影技巧的专家知识，还需要多模态推理来选择、对齐和组合片段形成连贯的叙事。尽管最近的大型多模态模型（LMMs）在一般视频理解上取得了显著进展，但其在多视频推理和操作编辑流程中的能力仍 largely 未被探索。我们引入 VEBENCH，第一个全面的基准，旨在评估真实视频编辑场景中的编辑知识理解和操作推理能力。VEBENCH 包含 3,900 个高质量编辑视频（超过 257 小时）和 3,080 个人验证的 QA 对，通过三轮人机协作标注流程构建，确保精确的时间标记和语义一致性。它包含两个互补的 QA 任务：1）视频编辑技术识别，评估模型识别 7 种编辑技术的能力；2）视频编辑操作模拟，通过要求从多个候选片段中选择并定位时间片段来建模现实编辑流程。广泛的实验显示，当前模型性能与人类水平之间存在显著差距。这些结果突显了将视频理解与创造性操作推理结合的紧迫需求。我们设想 VEBENCH 作为推进智能视频编辑系统和未来复杂推理研究的基础。

英文摘要

Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models' ability to identify 7 editing techniques using multimodal cues; and 2) Video Editing Operation Simulation, modeling real-world editing workflows by requiring the selection and temporal localization of relevant clips from multiple candidates. Extensive experiments across proprietary (e.g., Gemini-2.5-Pro) and open-source LMMs reveal a large gap between current model performance and human-level editing cognition. These results highlight the urgent need for bridging video understanding with creative operational reasoning. We envision VEBENCH as a foundation for advancing intelligent video editing systems and driving future research on complex reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.02743 2026-05-12 cs.AI cs.CV cs.HC 版本更新

Triple Spectral Fusion for Sensor-based Human Activity Recognition

三频谱融合用于基于传感器的人体活动识别

Ye Zhang, Longguang Wang, Qing Gao, Chaocan Xiang, Mohammed Bennamoun, Yulan Guo

发表机构 * School of Electronics and Communication Engineering, the Shenzhen Campus of Sun Yat-sen University（南方科技大学深圳校区电子与通信工程学院）； Aviation University of Air Force（空军航空大学）； College of Computer Science, Chongqing University（重庆大学计算机学院）； Department of Computer Science and Software Engineering, the University of Western Australia（西澳大学计算机科学与软件工程系）

AI总结本文提出三频谱融合框架，通过自适应滤波和图傅里叶变换提升传感器数据融合与长期上下文关联性，实验表明其在多个基准数据集上表现优异。

详情

DOI: 10.1109/TPAMI.2026.3690949

AI中文摘要

本文提出三频谱融合框架，通过自适应滤波和图傅里叶变换提升传感器数据融合与长期上下文关联性，实验表明其在多个基准数据集上表现优异。

英文摘要

The field of sensor-based human activity recognition (HAR) mainly uses posture, motion and context data of Inertial Measurement Units (IMUs) to identify daily activities. Despite the advancements in learning-based methods, it is challenging to perform information fusion from the temporal perspective due to the complexities in fusing heterogeneous sensor data and establishing long-term context correlations. This paper proposes a novel triple spectral fusion framework tailored for HAR. First, we develop an adaptive complementary filtering technique for noise suppression and organize each IMU's sensors into posture and motion modality nodes. Given that IMU nodes form a dynamic heterogeneous graph, we then apply adaptive filtering within the graph Fourier domain to merge both homogeneous and heterogeneous node information. Furthermore, an adaptive wavelet frequency selection approach is implemented to suppress context redundancy and shorten the length of features. This approach enhances both timestamp-based graph aggregation and the correlation of long-term contexts. Our framework uses adaptive filtering in the Fourier, graph Fourier, and wavelet domains, enabling effective multi-sensor fusion and context correlation. Extensive experiments on ten benchmark datasets demonstrate the superior performance of our framework. Project page: https://github.com/crocodilegogogo/TSF-TPAMI2026.

URL PDF HTML ☆

赞 0 踩 0

2605.02169 2026-05-12 cs.CV cs.DC cs.LG 版本更新

Heterogeneous Model Fusion for Privacy-Aware Multi-Camera Surveillance via Synthetic Domain Adaptation

异构模型融合用于隐私保护多摄像头监控的合成领域适应

Peggy Joy Lu, Wei-Yu Chen, Yao-Tsung Huang, Vincent Shin-Mu Tseng

发表机构 * Department of Computer Science and Information Engineering, National Chung Cheng University（资讯工程系，国立 Chung Cheng 大学）； Department of Computer Science, National Yang Ming Chiao Tung University（计算机科学系，国立阳明交通大学）； National Center for High-Performance Computing（国家高速计算中心）

AI总结本文提出HeroCrystal框架，通过合成领域适应解决多摄像头目标检测中的隐私、类别不平衡和异构架构问题，提升定位精度并实现模型融合，实验表明其在多类别隐私保护设置下优于现有方法，mAP提升2.1%达33.4%。

Comments 42 pages, 13 figures. Published in Information Fusion (Elsevier). DOI: 10.1016/j.inffus.2026.104413

Journal ref Information Fusion, 2026

详情

DOI: 10.1016/j.inffus.2026.104413

AI中文摘要

我们提出HeroCrystal，一种新型隐私保护多摄像头领域自适应目标检测框架，解决数据隐私、类别不平衡和异构架构等挑战。框架包含三个关键阶段：生成阶段引入单次目标感知扩散生成模块，通过提示控制合成特定目标实例；联邦阶段采用概率Faster R-CNN提升定位精度，动态模型对比策略抑制领域偏见；服务器端在不访问原始数据的情况下融合异构架构模型。最后提出不一致类别整合算法解决标签不一致和架构异质性问题。在多个跨领域检测基准上的实验表明，该方法在多类别隐私保护设置下优于现有多源领域适应和联邦学习基线，mAP比现有隐私保护方法提升2.1%，达到33.4%的新SOTA，证明HeroCrystal在实现实用多摄像头AI监控系统中的有效性。源代码可在https://github.com/ccuvislab/HeroCrystal公开获取。

英文摘要

We propose HeroCrystal, a novel privacy-preserving framework for multi-camera domain-adaptive object detection, addressing challenges such as data privacy, class imbalance, and heterogeneous architectures. Our framework consists of three key stages. In the Generated Stage, we introduce a one-shot, target-aware diffusion-based generation module that learns visual style from a single target-domain image while leveraging prompt-based control to synthesize specific object instances. Unlike conventional style transfer-based methods that require large target datasets and ignore semantic-level discrepancies, our approach enables privacy-preserving augmentation to reduce ethical concerns, and introduces controllable rare object generation to mitigate long-tailed category degradation. In the Federated Stage, we employ probabilistic Faster R-CNN on the client side to improve localization accuracy, and a dynamic model contrastive strategy to suppress domain-specific bias. The server side performs model fusion across heterogeneous architectures without accessing raw data. Finally, in the Distilled Stage, we propose an inconsistent categories integration algorithm to resolve label inconsistency and architecture heterogeneity across clients. Extensive experiments on multiple cross-domain detection benchmarks demonstrate that our method outperforms existing multi-source domain adaptation and federated learning baselines under multi-class, privacy-preserving settings. Our method improves mAP by +2.1% over prior privacy-preserving approaches and achieves a new state-of-the-art mAP of 33.4%, highlighting the effectiveness of HeroCrystal in enabling practical multi-camera AI surveillance systems. The source code is publicly available at https://github.com/ccuvislab/HeroCrystal.

URL PDF HTML ☆

赞 0 踩 0

2605.01345 2026-05-12 cs.CV cs.AI cs.LG 版本更新

The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design

视觉语言模型中的感知带宽瓶颈：通过顺序实验设计实现主动视觉推理

Anjie Liu, Ziqin Gong, Yan Song, Yuxiang Chen, Xiaolong Liu, Hengtong Lu, Kaike Zhang, Chen Wei, Jun Wang

发表机构 * The Hong Kong University of Science（香港科学与技术大学）； University College London, London, United Kingdom（伦敦大学学院, 英国伦敦）； ShanghaiTech University, Shanghai, China（上海科技大学, 中国上海）； AI Lab, The Yangtze River Delta, China（人工智能实验室, 长江三角洲, 中国）

AI总结本文提出通过顺序贝叶斯最优实验设计实现主动视觉推理，解决视觉语言模型中感知带宽瓶颈问题，提升高分辨率视觉推理能力。

Comments 27 pages, 5 figures, accepted at ICML 2026

详情

AI中文摘要

现代视觉语言模型（VLMs）的视觉感知受到感知带宽瓶颈的限制：广视野保留了全局上下文但牺牲了复杂推理所需的细粒度细节。我们主张高分辨率视觉推理不仅是语义推理，也是在有限感知带宽下获取任务相关的证据。受主动视觉和信息觅食启发，我们将这一过程形式化为顺序贝叶斯最优实验设计（S-BOED），其中智能体在回答前决定获取哪些视觉证据。由于在连续十亿像素空间中精确的贝叶斯推断不可行，我们推导出一个可计算的覆盖-分辨率目标作为任务相关信息增益的代理。我们通过FOVEA，一种无需训练的程序，通过证据导向的探测来优化VLM的裁剪提案。在高分辨率基准测试中，实验结果在直接和ReAct风格基线中表现出一致的提升，特别是在以搜索为主的遥感设置中表现尤为突出。

英文摘要

Visual perception in modern Vision-Language Models (VLMs) is constrained by a perceptual bandwidth bottleneck: a broad field of view preserves global context but sacrifices the fine-grained details required for complex reasoning. We argue that high-resolution visual reasoning is therefore not only semantic reasoning but also task-relevant evidence acquisition under limited perceptual bandwidth. Inspired by active vision and information foraging, we formalise this process as sequential Bayesian optimal experimental design (S-BOED), where an agent decides which visual evidence to acquire before answering. Since exact Bayesian inference is intractable in continuous gigapixel spaces, we derive a tractable coverage--resolution objective as a proxy for task-relevant information gain. We instantiate this framework with FOVEA, a training-free procedure that refines VLM crop proposals through evidence-oriented probing. Experiments on high-resolution benchmarks show consistent gains over direct and ReAct-style baselines, with particularly strong improvements in search-dominated remote-sensing settings.

URL PDF HTML ☆

赞 0 踩 0

2605.00884 2026-05-12 cs.CV 版本更新

LiteVLA-H: Dual-Rate Vision-Language-Action Inference for Onboard Aerial Guidance and Semantic Perception

LiteVLA-H: 双速率视觉-语言-动作推断用于机载空中引导与语义感知

Justin williams, Kishor Datta Gupta, Roy George, Mrinmoy Sarkar

发表机构 * Department of Cyber Physical Systems, Clark Atlanta University（克劳克阿特拉大学网络物理系统系）

AI总结 LiteVLA-H通过双速率操作在边缘设备上实现高效视觉-语言-动作推断，兼顾快速动作输出与语义理解，提升空中引导与场景感知性能。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在 manipulation 中表现出强大的语义接地和任务泛化能力，但空中部署仍面临挑战，因为无人机需要在严格的机载计算和通信限制下实现低延迟闭环引导。我们提出了LiteVLA-H，一个紧凑的256M参数VLA系统，专为在NVIDIA Jetson AGX Orin上双速率操作而设计：一种快速的外环引导模式用于短动作令牌输出，一种较慢的语义模式用于场景理解、危险描述和操作员面向的叙述。核心经验观察是，在这种紧凑的边缘环境中，端到端延迟主要由多模态预填充而非解码少量额外令牌的边际成本主导。这促使一种调度器，在同一嵌入式平台上，以50.65毫秒（19.74Hz）的速率发出反应性动作令牌，同时仍支持句子级别的语义输出，速度为149.90-164.57毫秒（6.08-6.67Hz）。为了在不降低描述能力的情况下专门化模型，我们使用一种知识保持的微调配方，混合了反应性飞行数据、空中语义数据和通用标题/VQA监督。除了报告当前的延迟测量外，我们还将该系统与最近的最先进架构（AnywhereVLA、FutureVLA和ReMem-VLA）进行对比，显示在我们的部署条件下，所测动作分支在边缘推理速率上更高，同时保留了周期性语义意识。

英文摘要

Vision-language-action (VLA) models have shown strong semantic grounding and task generalization in manipulation, but aerial deployment remains difficult because drones require low-latency closed-loop guidance under strict onboard compute and communication constraints. We present LiteVLA-H, a compact 256M-parameter VLA system designed for dual-rate operation on an NVIDIA Jetson AGX Orin: a fast outer-loop guidance mode for short action-token outputs and a slower semantic mode for scene understanding, hazard description, and operator-facing narration. The central empirical observation is that, in this compact edge regime, end-to-end latency is dominated by multimodal pre-fill rather than by the marginal cost of decoding a few extra tokens. This motivates a scheduler that issues reactive action tokens at 50.65,ms (19.74,Hz) while still supporting sentence-level semantic outputs at 149.90--164.57\ms (6.08--6.67,Hz) on the same embedded platform. To specialize the model without collapsing its descriptive competence, we use a knowledge-preserving fine-tuning recipe that mixes reactive flight data, aerial semantic data, and generic caption/VQA supervision. Beyond reporting current latency measurements, we position the system against recent state-of-the-art architectures, including AnywhereVLA, FutureVLA, and ReMem-VLA, showing that the measured action branch reaches a higher edge inference rate under our deployment conditions while retaining periodic semantic awareness.

URL PDF HTML ☆

赞 0 踩 0

2604.23789 2026-05-12 cs.CV 版本更新

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

MuSS：一个多镜头数据集和电影叙事基准用于多镜头主体到视频生成

Haojie Zhang, Di Wu, Bingyan Liu, Linjie Zhong, Yuancheng Wei, Xingsong Ye, Nanqing Liu, Yaling Liang

发表机构 * South China University of Technology（华南理工大学）； Fudan University（复旦大学）； Yunnan Normal University（云南师范大学）

AI总结本文提出MuSS数据集和电影叙事基准，解决多镜头视频生成中的叙事逻辑、时空对齐和复制粘贴问题，通过改进的标注流程和反复制粘贴指标提升生成质量。

Comments 17 pages, 9 figues

详情

AI中文摘要

尽管视频基础模型在单镜头生成上表现优异，但现实中的电影叙事本质上依赖复杂的多镜头序列。进一步进展受限于缺乏解决三个核心挑战的数据集：真实叙事逻辑、时空文本-视频对齐冲突以及主体到视频（S2V）生成中的“复制粘贴”难题。为此，我们引入MuSS，一个大规模双轨数据集，专门用于多镜头视频和S2V生成。该数据集源自超过3000部电影，支持复杂的蒙太奇过渡和以主体为中心的叙事。为构建该数据集，我们首创了一种渐进式标注流程，通过确保局部镜头级准确性后再强制全球叙事一致性来消除上下文冲突。关键在于我们实现了跨镜头匹配机制，从根本上消除S2V中的复制粘贴捷径。此外，我们提出了电影叙事基准，包含视觉逻辑驱动范式和新颖的反复制粘贴方差（ACP-Var）度量，以严格评估连续叙事和3D结构一致性。大量实验表明，尽管当前基线在连续叙事逻辑或退化为2D贴纸生成器方面挣扎，但我们的MuSS增强模型在叙事效果和跨镜头身份保持方面达到了最先进的水平。

英文摘要

While video foundation models excel at single-shot generation, real-world cinematic storytelling inherently relies on complex multi-shot sequencing. Further progress is constrained by the absence of datasets that address three core challenges: authentic narrative logic, spatiotemporal text-video alignment conflicts, and the "copy-paste" dilemma prevalent in Subject-to-Video (S2V) generation. To bridge this gap, we introduce MuSS, a large-scale, dual-track dataset tailored for multi-shot video and S2V generation. Sourced from over 3,000 movies, MuSS explicitly supports both complex montage transitions and subject-centric narratives. To construct this dataset, we pioneer a progressive captioning pipeline that eliminates contextual conflicts by ensuring local shot-level accuracy before enforcing global narrative coherence. Crucially, we implement a cross-shot matching mechanism to fundamentally eradicate the S2V copy-paste shortcut. Alongside the dataset, we propose the Cinematic Narrative Benchmark, featuring a visual-logic-driven paradigm and a novel Anti-Copy-Paste Variance (ACP-Var) metric to rigorously assess continuous storytelling and 3D structural consistency. Extensive experiments demonstrate that while current baselines struggle with continuous narrative logic or degenerate into trivial 2D sticker generators, our MuSS-augmented model achieves state-of-the-art narrative effectiveness and cross-shot identity preservation.

URL PDF HTML ☆

赞 0 踩 0

2604.22942 2026-05-12 cs.CV cs.AI cs.LG 版本更新

ADP-FL-MedSeg：适应性差分隐私用于跨多样模态的联邦医疗分割

Puja Saha, Eranga Ukwatta

发表机构 * College of Engineering, University of Guelph（圭尔夫大学工程学院）

AI总结本文提出ADP-FL框架，通过动态调整隐私机制，在联邦学习中平衡隐私与效用，提升医疗图像分割的准确性与稳定性。

Comments 10 pages, 8 figures. Accepted in SPIE Medical Imaging 2026. Recipient of CAD Best Paper Award: 1st Place, and Robert F. Wagner All-Conference Best Paper Award: Finalist

Journal ref Proceedings Volume 13926, SPIE Medical Imaging 2026: Computer-Aided Diagnosis

详情

DOI: 10.1117/12.3075111

AI中文摘要

大量医学数据因隐私法规和机构限制难以集中使用。现有模型在不同临床中心泛化能力差，因成像协议异质性和数据分布变化。联邦学习提供无需共享原始数据的协作训练方法。然而，将差分隐私融入联邦学习常导致精度下降、收敛不稳定和泛化能力减弱。本文提出一种适应性差分隐私联邦学习（ADP-FL）框架，用于医疗图像分割，动态调整隐私机制以更好地平衡隐私-效用权衡。该方法稳定训练，显著提高Dice分数和分割边界质量，同时保持严格隐私保障。在多种成像模态和分割任务中评估ADP-FL，包括皮肤病变分割、肾脏肿瘤分割和脑肿瘤分割。与传统联邦学习和标准差分隐私联邦学习相比，ADP-FL在准确性、边界界定、收敛速度和训练稳定性方面均表现更优，性能接近非隐私联邦学习在同一隐私预算下的表现。这些结果证明了ADP-FL在真实联邦设置中实现高性能、隐私保护医疗图像分割的实用性。

英文摘要

Large volumes of medical data remain underutilized because centralizing distributed data is often infeasible due to strict privacy regulations and institutional constraints. In addition, models trained in centralized settings frequently fail to generalize across clinical sites because of heterogeneity in imaging protocols and continuously evolving data distributions arising from differences in scanners, acquisition parameters, and patient populations. Federated learning offers a promising solution by enabling collaborative model training without sharing raw data. However, incorporating differential privacy into federated learning, while essential for privacy guarantees, often leads to degraded accuracy, unstable convergence, and reduced generalization. In this work, we propose an adaptive differentially private federated learning (ADP-FL) framework for medical image segmentation that dynamically adjusts privacy mechanisms to better balance the privacy-utility trade-off. The proposed approach stabilizes training, significantly improves Dice scores and segmentation boundary quality, and maintains rigorous privacy guarantees. We evaluated ADP-FL across diverse imaging modalities and segmentation tasks, including skin lesion segmentation in dermoscopic images, kidney tumor segmentation in 3D CT scans, and brain tumor segmentation in multi-parametric MRI. Compared with conventional federated learning and standard differentially private federated learning, ADP-FL consistently achieves higher accuracy, improved boundary delineation, faster convergence, and greater training stability, with performance approaching that of non-private federated learning under the same privacy budgets. These results demonstrate the practical viability of ADP-FL for high-performance, privacy-preserving medical image segmentation in real-world federated settings.

URL PDF HTML ☆

赞 0 踩 0

2604.03687 2026-05-12 cs.CV 版本更新

SciLT: Long-tailed Image Classification under Scientific Image Domains

SciLT：在科学图像领域进行长尾图像分类

Jiahao Chen, Bing Su

发表机构 * Gaoling School of Artificial Intelligence（人工智能学院）

AI总结本文研究了科学图像领域的长尾识别问题，发现基础模型微调效果有限，提出SciLT框架通过多级表征和双监督学习实现头尾类平衡，实验表明其在科学长尾识别中表现优异。

详情

AI中文摘要

长尾识别受益于基础模型和微调范式，但现有研究和基准主要局限于自然图像领域，其中预训练和微调数据分布相似。相比之下，科学图像具有不同的视觉特征和监督信号，这引发了对在这些设置中微调基础模型有效性的疑问。本文在纯视觉和微调范式下研究了科学长尾识别。在三个科学基准上的实验表明，微调基础模型收益有限，揭示了penultimate-layer特征在尾类中的重要作用。受这些发现启发，我们提出了SciLT框架，通过自适应特征融合和双监督学习利用多级表征。通过联合利用penultimate-和final-layer特征，SciLT在头尾类上实现了平衡性能。大量实验表明，SciLT在科学长尾识别中一致优于现有方法，建立了强大的基准，并为适应具有显著领域偏移的科学数据提供了有价值的指导。

英文摘要

Long-tailed recognition has benefited from foundation models and fine-tuning paradigms, yet existing studies and benchmarks are mainly confined to natural image domains, where pre-training and fine-tuning data share similar distributions. In contrast, scientific images exhibit distinct visual characteristics and supervision signals, raising questions about the effectiveness of fine-tuning foundation models in such settings. In this work, we investigate scientific long-tailed recognition under a purely visual and fine-tuning paradigm. Experiments on three scientific benchmarks show that fine-tuning foundation models yields limited gains, and reveal that penultimate-layer features play an important role, particularly for tail classes. Motivated by these findings, we propose SciLT, a framework that exploits multi-level representations through adaptive feature fusion and dual-supervision learning. By jointly leveraging penultimate- and final-layer features, SciLT achieves balanced performance across head and tail classes. Extensive experiments demonstrate that SciLT consistently outperforms existing methods, establishing a strong and practical baseline for scientific long-tailed recognition and providing valuable guidance for adapting foundation models to scientific data with substantial domain shifts.

URL PDF HTML ☆

赞 0 踩 0

2604.02564 2026-05-12 eess.IV cs.CV 版本更新

Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It

为何不变性不足以实现生物医学领域泛化以及如何修复它

Sebo Diaz, Polina Golland, Elfar Adalsteinsson, Neel Dey

发表机构 * MIT（麻省理工学院）； MGH（麻省总医院）； HMS（哈佛医学院）

AI总结 MaskGen通过结合源域图像强度和领域稳定的基模型表示，提出了一种简单有效的领域泛化策略，在生物医学图像分割中实现了更强的泛化能力。

Comments Project GitHub https://github.com/sebodiaz/MaskGen

详情

AI中文摘要

英文摘要

Neuronavigation is widely used in biomedical research and interventions to guide the precise placement of instruments around the head to support procedures such as transcranial magnetic stimulation. Traditional systems, however, rely on subject-mounted markers that require manual registration, may shift during procedures, and can cause discomfort. We introduce and evaluate markerless approaches that replace expensive hardware and physical markers with low-cost visible and infrared light cameras incorporating stereo and depth sensing, combined with algorithmic modeling of the facial geometry. Validation with 50 human subjects yielded a median tracking discrepancy of only 2.32 mm and 2.01$^\circ$ for the best markerless algorithm compared to a conventional marker-based system, which indicates sufficient accuracy for transcranial magnetic stimulation and a substantial improvement over prior markerless results. The study also suggests that integration of the data from the various camera sensors can improve the overall accuracy further. The proposed markerless neuronavigation methods can reduce setup cost and complexity, improve patient comfort, and expand access to neuronavigation in clinical and research settings.

URL PDF HTML ☆

赞 0 踩 0

2602.04549 2026-05-12 cs.CV 版本更新

Nix and Fix: Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models

Nix 和 Fix：通过扩散模型实现 3D 高斯散射的 1000 倍压缩

Cem Eteke, Enzo Tartaglione

发表机构 * 1 Chair of Media Technology, Munich Institute of Robotics ； Machine Intelligence School of Computation, Information ； Technology Technical University of Munich, 80333 Munich, Germany 2LTCI, T\'el\'ecom Paris, Institut Polytechnique de Paris, France

AI总结本文提出 NiFi 方法，通过扩散模型实现 3DGS 在极低速率下的高质量压缩，达到 1000 倍压缩率。

2602.04054 2026-05-12 cs.LG cs.CV 版本更新

无需潜在变量的一步图像生成：像素均值流

Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, Kaiming He

发表机构 * MIT（麻省理工学院）

AI总结本文提出像素均值流(pMF)，通过分离网络输出空间与损失空间，实现无需潜在变量的一步图像生成，在ImageNet上取得优异结果。

Comments Tech report. Code at https://github.com/Lyy-iiis/pMF

2601.16836 2026-05-12 cs.CV cs.CL 版本更新

ColorConceptBench: A Benchmark for Probabilistic Color-Concept Understanding in Text-to-Image Models

ColorConceptBench：一种用于文本到图像模型中概率颜色-概念理解的基准

Chenxi Ruan, Yihan Hou, Yu Xiao, Guosheng Hu, Wei Zeng

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； The Hong Kong University of Science and Technology（香港科学与技术大学）； China Academy of Art（中国美术学院）

AI总结本文提出ColorConceptBench，通过概率颜色分布系统评估颜色-概念关联，研究文本到图像模型对隐含概念如情感和视觉状态的理解能力，发现模型在抽象语义上的敏感性不足。

Comments 9 pages, 6 figures

详情

AI中文摘要

文本到图像（T2I）模型在生成高质量图像方面已取得显著进展。然而，其将颜色与概念关联的能力仍主要受限于显式颜色名称或代码，而处理隐含概念（如情感和视觉状态）的能力尚待探索。为填补这一空白，我们引入ColorConceptBench，一个由专家标注的基准，通过概率颜色分布系统评估颜色-概念关联。ColorConceptBench超越显式颜色规范，考察模型如何解释1,281个隐含颜色概念，这些概念基于6,584个人类注释。对九种领先的T2I模型的评估表明，性能在语义类别上差异显著，且模型在抽象语义上的敏感性显著不足。这些限制即使在应用分类器自由引导扩展时仍存在，表明实现人类级颜色理解需要模型在学习和表示隐含语义意义的方式上发生转变。

英文摘要

Text-to-image (T2I) models have advanced considerably in generating high-quality images from textual descriptions. However, their ability to associate colors with concepts remains largely constrained to explicit color names or codes, while their capacity to handle \emph{implicit concepts}, such as emotions and visual states, remains underexplored. To address this gap, we introduce ColorConceptBench, an expert-annotated benchmark that systematically evaluates color-concept associations through probabilistic color distributions. ColorConceptBench moves beyond explicit color specifications by examining how models interpret 1,281 implicit color concepts, grounded in 6,584 human annotations. Our evaluation of nine leading T2I models reveals that performance varies substantially across semantic categories, and models exhibit a significant lack of sensitivity to abstract semantics. These limitations persist even when applying classifier-free guidance scaling at inference time, suggesting that achieving human-like color understanding demands a shift in how models learn and represent implicit semantic meaning.

URL PDF HTML ☆

赞 0 踩 0

2512.24552 2026-05-12 cs.CV math.OC 版本更新

OCP-GN: A Scalable Second-order Optimizer for Stochastic Optimization

OCP-GN：一种适用于随机优化的可扩展二阶优化器

Jindi Zhong, Congyaohui Yin, Zhaorong Zhang, Huanshui Zhang

发表机构 * JOURNAL OF LATEX CLASS FILES（LaTeX类文件期刊）

AI总结本文提出基于最优控制原理的OCP-GN算法，用于神经网络训练中的大规模优化问题，具有O(d)的计算复杂度和强鲁棒性，实验验证其显著优势。

2512.19115 2026-05-12 cs.CV 版本更新

Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

生成巨人，检索弱者：为何多模态大语言模型在多模态检索中表现不佳？

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Yang Li, Wentao Zhang

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Peking University（北京大学）； Tencent Inc（腾讯公司）； Zhongguancun Academy（中关村学院）

AI总结研究揭示多模态大语言模型在多模态检索中表现不佳的原因，通过稀疏自编码器分析发现其表示空间主要由文本语义构成，视觉语义不足，导致检索性能下降，提出ReAlign方法提升检索效果。

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）在生成任务中表现出色，但在零样本多模态检索任务中却表现出反直觉的不足。本文研究了阻碍MLLMs成为有效检索器的机制。通过稀疏自编码器（SAEs），我们将MLLM输出表示分解为可解释的语义概念以探测其内在行为。分析发现，MLLM的表示空间 overwhelmingly 由文本语义主导；而多模态检索所需的关键视觉语义仅占小部分。我们发现这种不平衡是由于MLLM过度关注图像-文本模态桥接，促进了生成但使嵌入空间同质化，最终降低了多模态检索所需的判别能力。进一步发现，对MLLM相似性计算贡献最大的特定特征组件实际上是干扰项，大大降低了检索性能。基于这些见解，我们提出了ReAlign，一种测试时适应方法，通过白化变换调整MLLM表示空间的几何结构。实验结果表明，这种简单的干预在不同MLLM上一致提升了零样本多模态检索性能，无需微调。代码可在https://github.com/Heinz217/mllm-retrieval-analysis获取。

英文摘要

Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from being effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; and the visual semantics essential for multimodal retrieval only constitute a small portion. We find that this imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations of MLLMs are actually distractors that greatly reduce retrieval performance. Building on these insights, we propose ReAlign, a test-time adaptation approach that applies a whitening transformation to adjust the geometry of MLLM representation spaces. Empirical results show that this simple intervention consistently improves zero-shot multimodal retrieval performance across diverse MLLMs without fine-tuning efforts. The code is available at https://github.com/Heinz217/mllm-retrieval-analysis.

URL PDF HTML ☆

赞 0 踩 0

2512.08984 2026-05-12 cs.CV cs.AI 版本更新

RAG-HAR: Retrieval Augmented Generation-based Human Activity Recognition

RAG-HAR：基于检索增强生成的人类活动识别

Nirhoshan Sivaroopan, Hansi Karunarathna, Chamara Madarasingha, Anura Jayasumana, Kanchana Thilakarathna

发表机构 * University of Sydney Australia（悉尼大学澳大利亚分校）； University of Sri Jayewardenepura Sri Lanka（Sri Jayewardenepura大学 Sri Lanka）； Curtin University Australia（Curtin大学澳大利亚分校）； Colorado State University USA（科罗拉多州立大学美国分校）

AI总结 RAG-HAR提出一种无需训练的检索增强框架，利用大语言模型实现人类活动识别，通过轻量统计描述符和语义相似样本检索提升识别准确性和实用性。

Comments Accepted to IEEE PerCom 2026 (Pervasive computing and communications)

详情

AI中文摘要

人类活动识别（HAR）在医疗、康复、健身追踪和智能环境中有广泛应用，但现有深度学习方法需要特定数据集训练、大量标注数据和大量计算资源。本文提出RAG-HAR，一种无需训练的检索增强框架，利用大语言模型（LLMs）进行HAR。RAG-HAR计算轻量统计描述符，从向量数据库检索语义相似样本，并利用此上下文证据进行LLM基于的活动识别。我们进一步通过提示优化和引入LLM基于的活动描述符，生成上下文丰富的向量数据库，以提供准确且高度相关的上下文信息。此外，RAG-HAR在六个多样化的HAR基准测试中实现了最先进的性能。最重要的是，RAG-HAR在无需模型训练或微调的情况下实现了这些改进，强调了其鲁棒性和实用性。RAG-HAR超越已知行为，能够识别和有意义地标记多种未见过的人类活动。

英文摘要

Human Activity Recognition (HAR) underpins applications in healthcare, rehabilitation, fitness tracking, and smart environments, yet existing deep learning approaches demand dataset-specific training, large labeled corpora, and significant computational resources.We introduce RAG-HAR, a training-free retrieval-augmented framework that leverages large language models (LLMs) for HAR. RAG-HAR computes lightweight statistical descriptors, retrieves semantically similar samples from a vector database, and uses this contextual evidence to make LLM-based activity identification. We further enhance RAG-HAR by first applying prompt optimization and introducing an LLM-based activity descriptor that generates context-enriched vector databases for delivering accurate and highly relevant contextual information. Along with these mechanisms, RAG-HAR achieves state-of-the-art performance across six diverse HAR benchmarks. Most importantly, RAG-HAR attains these improvements without requiring model training or fine-tuning, emphasizing its robustness and practical applicability. RAG-HAR moves beyond known behaviors, enabling the recognition and meaningful labelling of multiple unseen human activities.

URL PDF HTML ☆

赞 0 踩 0

2512.06673 2026-05-12 cs.CV 版本更新

增强控制的自回归扩散用于数据同化

Prakhar Srivastava, Farrin Marouf Sofian, Francesco Immorlano, Kushagra Pandey, Stephan Mandt

发表机构 * University of California, Irvine（加州大学洛杉矶分校）

AI总结本文提出一种增强控制的自回归扩散模型，通过引入预训练模型与离线训练控制器，提升数据同化中混沌时空偏微分方程的稳定性和准确性。

详情

AI中文摘要

尽管在测试时扩展和扩散微调方面有所进展，但对自回归扩散模型（ARDMs）的指导仍显不足。我们介绍了一个可 amortized 的框架，该框架通过将预训练的 ARDMS 与离线训练的控制器相结合，通过预览未来的滚动预测，控制器学习逐步修正，以在终端成本目标下预测观测，从而生成可重用的策略。受 ARDM 轨迹的随机最优控制观点启发，我们的方法在每个去噪子步骤中注入小控制量，同时保持接近预训练的动力学。我们研究了这种方法在混沌时空偏微分方程（PDEs）中的数据同化（DA）应用，其中现有方法往往计算成本高且在稀疏观测下易受预测漂移影响。在推理时，DA 变成一个具有实时修正的前馈滚动预测，相较于强大的扩散基线方法，实现了数量级的加速。在两个典型 PDEs 和一个涵盖六个观测领域的紧凑 ECMWF 再分析 v5（ERA5）试点研究中，我们的方法在稳定性和准确性上均优于现有最先进方法，且在更大规模的 GenCast 研究中也观察到相似的改进。

英文摘要

Despite advances in test-time scaling and diffusion finetuning, guidance for Auto-Regressive Diffusion Models (ARDMs) remains underexplored. We introduce an amortized framework that augments a pretrained ARDM with an offline-trained controller. By previewing future rollouts, the controller learns stepwise corrections that anticipate observations under a terminal-cost objective, yielding a reusable policy for guided generation. Motivated by a stochastic optimal control view of ARDM trajectories, our method injects small controls within each denoising sub-step while staying close to the pretrained dynamics. We study this approach for dataassimilation (DA) in chaotic spatiotemporal partial differential equations (PDEs), where existing methods are often computationally expensive and susceptible to forecast drift under sparse observations. At inference, DA becomes a feed-forward rollout with on-the-fly corrections, achieving an order-of-magnitude speedup over strong diffusion-based baselines. Across two canonical PDEs and a compact ECMWF Reanalysis v5 (ERA5) pilot spanning six observation regimes, our method consistently improves stability and accuracy over state-of-the-art alternatives, with similar improvements observed in a larger-scale GenCast study.

URL PDF HTML ☆

赞 0 踩 0

2510.05635 2026-05-12 cs.LG cs.CV 版本更新

NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering

NEO: 通过潜在空间重新定位实现无优化的测试时间适应

Alexander Murphy, Michal Danilowski, Soumyajit Chatterjee, Abhirup Ghosh

发表机构 * University of Birmingham（伯明翰大学）； Brave Software Research（Brave软件研究）； University of Cambridge（剑桥大学）

AI总结 NEO通过潜在空间重新定位实现无超参数的测试时间适应，仅需少量计算即可显著提升分类准确率，适用于多个数据集和设备。

Comments ICLR 2026

详情

AI中文摘要

测试时间适应（TTA）方法通常计算成本高、需要大量数据或对超参数敏感。基于潜在空间几何理论，通过将目标数据嵌入重新定位到原点，显著提升源与分布偏移样本的对齐程度。NEO是一种无需超参数的完全TTA方法，相比传统推理无显著计算开销。NEO在ImageNet-C上通过仅适应一个批次的64个样本，将ViT-Base的分类准确率从55.6%提升至59.2%。当使用512个样本进行适应时，NEO在ImageNet-C、ImageNet-R和ImageNet-S上超越了所有7种比较的TTA方法，在CIFAR-10-C上超越6/7种方法。NEO在模型校准指标上表现良好，并能从1个类别适应以提升ImageNet-C上999个其他类别的准确性。在Raspberry Pi和Jetson Orin Nano设备上，NEO相比基线减少了63%的推理时间及9%的内存使用。基于三种ViT架构和四个数据集的实验结果表明，NEO在TTA中具有高效且有效的应用潜力。

英文摘要

Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO -- a hyperparameter-free fully TTA method, that adds no significant compute compared to vanilla inference. NEO is able to improve the classification accuracy of ViT-Base on ImageNet-C from 55.6% to 59.2% after adapting on just one batch of 64 samples. When adapting on 512 samples NEO beats all 7 TTA methods we compare against on ImageNet-C, ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least amount of compute. NEO performs well on model calibration metrics and additionally is able to adapt from 1 class to improve accuracy on 999 other classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO reduces inference time by 63% and memory usage by 9% compared to baselines. Our results based on 3 ViT architectures and 4 datasets show that NEO can be used efficiently and effectively for TTA.

URL PDF HTML ☆

赞 0 踩 0

2509.08670 2026-05-12 cs.CV 版本更新

FractalPINN-Flow: A Fractal-Inspired Network for Unsupervised Optical Flow Estimation with Total Variation Regularization

FractalPINN-Flow：一种受分形启发的网络用于无监督光流估计与总变分正则化

Sara Behnamian, Rasoul Khaksarinezhad, Andreas Langer

发表机构 * Globe Institute, University of Copenhagen（全球研究所，哥本哈根大学）； Centre for Mathematical Sciences, Lund University（数学科学中心，吕勒奥大学）

AI总结本文提出FractalPINN-Flow，一种无监督深度学习框架，通过连续灰度帧直接学习光流，无需真实数据。基于分形几何和自相似性设计的分形变形网络，结合总变分正则化，实现高精度、平滑且边缘保留的光流估计。

Journal ref In Proceedings of the 2nd Sorbonne-Heidelberg Workshop on AI in Medicine: Machine Learning for Multi-modal Data, Heidelberg University Library, 2025

详情

DOI: 10.11588/heidok.00037608

AI中文摘要

我们提出了FractalPINN-Flow，一种无监督深度学习框架，用于密集光流估计，直接从连续灰度帧中学习，无需真实数据。该架构的核心是分形变形网络（FDN）——一种受分形几何和自相似性启发的递归编码器-解码器。与传统CNN不同，FDN使用重复的编码器-解码器嵌套和跳跃连接，以捕捉细粒度细节和长程运动模式。训练目标基于经典的变分公式，使用总变分（TV）正则化。具体来说，我们最小化一个能量函数，结合L1和L2数据保真项以强制亮度恒定性，以及一个TV项以促进空间平滑性和一致的光流场。在合成和基准数据集上的实验表明，FractalPINN-Flow能够生成准确、平滑且边缘保留的光流场。该模型在高分辨率数据和标注有限的场景中表现尤为出色。

英文摘要

We present FractalPINN-Flow, an unsupervised deep learning framework for dense optical flow estimation that learns directly from consecutive grayscale frames without requiring ground truth. The architecture centers on the Fractal Deformation Network (FDN) - a recursive encoder-decoder inspired by fractal geometry and self-similarity. Unlike traditional CNNs with sequential downsampling, FDN uses repeated encoder-decoder nesting with skip connections to capture both fine-grained details and long-range motion patterns. The training objective is based on a classical variational formulation using total variation (TV) regularization. Specifically, we minimize an energy functional that combines $L^1$ and $L^2$ data fidelity terms to enforce brightness constancy, along with a TV term that promotes spatial smoothness and coherent flow fields. Experiments on synthetic and benchmark datasets show that FractalPINN-Flow produces accurate, smooth, and edge-preserving optical flow fields. The model is especially effective for high-resolution data and scenarios with limited annotations.

URL PDF HTML ☆

赞 0 踩 0

2507.04465 2026-05-12 cs.CV 版本更新

Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions

基于深度学习的视觉手部手势识别：方法、数据集、挑战及未来研究方向的全面综述

Konstantinos Foteinos, Manousos Linardakis, Panagiotis Radoglou-Grammatikis, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

发表机构 * Department of Informatics and Telematics, Harokopio University of Athens（信息与电信学系，哈罗科比欧大学）

AI总结本文综述了视觉手部手势识别领域的方法、数据集、挑战及未来方向，系统分析了当前最先进的方法和评估指标，为研究者提供改进指南。

Comments Submitted to Neurocomputing. Rewritten abstract, due to limited space

详情

AI中文摘要

深度学习模型的快速发展和可用数据集的持续增长，使视觉手部手势识别（VHGR）这一重要领域受到研究社区的广泛关注，并在手语理解和人机交互等方面得到广泛应用。尽管该领域已有大量研究，但缺乏系统性和完整的综述，导致研究者需在数百篇论文中寻找当前最先进的方法（SOTA）。本文旨在填补这一空白，通过系统的方法和结构化的呈现，全面概述该计算机视觉领域。本文重点探讨四个核心问题：VHGR的主要方面、当前最先进的方法、方法和任务之间的比较洞察，以及塑造未来研究的挑战。通过系统的方法定位相关文献，本文以分类方式识别并组织了关键的VHGR方法。SOTA方法被分为三个主要任务：静态、孤立动态和连续手势识别。对于每个任务，列出了架构趋势和学习策略。为了支持未来方法的实验评估，本文回顾了常用的数据集并展示了标准性能指标。本文最后识别了VHGR中的主要挑战，包括通用计算机视觉问题和领域特定的障碍，并概述了未来研究的有希望方向。

英文摘要

The rapid evolution of deep learning (DL) models and the ever-increasing size of available datasets have raised the interest of the research community in the always-important field of visual hand gesture recognition (VHGR), and delivered a wide range of applications, such as sign language understanding and human-computer interaction. Despite the large volume of research works in the field, a structured and complete survey on VHGR is still missing, leaving researchers to navigate through hundreds of papers in order to find the current state-of-the-art (SOTA). The current survey aims to fill this gap by presenting a comprehensive overview of this computer vision field. With a systematic research methodology and a structured presentation of the various methods, datasets, and evaluation metrics, this review aims to constitute a useful guideline for researchers, helping them to propose improvements. Specifically, this survey focuses on four fundamental questions: what are the main VHGR aspects, what are the current SOTA methods, what comparative insights can be drawn across methods and tasks, and which challenges shape future research. Starting with the methodology used to locate the related literature, the survey identifies and organizes the key VHGR approaches in a taxonomy-based format. The SOTA methods are grouped across three primary VHGR tasks: static, isolated dynamic and continuous gesture recognition. For each task, the architectural trends and learning strategies are listed. To support the experimental evaluation of future methods in the field, the study reviews commonly used datasets and presents the standard performance metrics. Our survey concludes by identifying the major challenges in VHGR, including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research.

URL PDF HTML ☆

赞 0 踩 0

2507.04277 2026-05-12 cs.CV 版本更新

Towards Lightest Low-Light Image Enhancement Architecture for Mobile Devices

面向移动设备的最轻量级低光照图像增强架构

Guangrui Bai, Hailong Yan, Wenhai Liu, Yahui Deng, Erbao Dong

发表机构 * Key Laboratory of Precision and Intelligent Chemistry, Department of Precision Machinery and Precision Instrumentation, University of Science and Technology of China（精密与智能化学重点实验室，精密机械与精密仪器系，中国科学技术大学）； School of Information and Communication Engineering, University of Electronic Science and Technology of China（信息与通信工程学院，电子科学与技术大学）

AI总结本文提出LiteIE框架，通过轻量级网络和无监督训练，实现低光照图像增强的高效解决方案，在资源受限设备上达到高PSNR和实时处理能力。

Comments Accepted by ESWA

详情

AI中文摘要

实时低光照图像增强在移动和嵌入式设备上需要在视觉质量和计算效率之间取得平衡。现有深度学习方法常依赖大型网络和标注数据集，限制了其在资源受限平台上的部署。本文提出LiteIE，一种超轻量级无监督增强框架，消除了对大规模监督的依赖，并在多种条件下表现良好。我们设计了一个背骨无关的特征提取器，仅使用两个卷积层来生成紧凑的图像特征增强张量。此外，我们开发了一个无参数的迭代修复模块，通过重用提取的特征逐步恢复早期增强步骤中丢失的细节，不引入任何额外的可学习参数。我们进一步提出一个无监督训练目标，整合了曝光控制、边缘感知平滑性和多尺度颜色一致性损失。在LOL数据集上的实验表明，LiteIE在PSNR上达到19.04 dB，比SOTA高1.4 dB，同时仅使用其0.07%的参数。在Snapdragon 8 Gen 3移动处理器上，LiteIE在4K图像上以30 FPS运行，仅需58个参数，实现了在边缘设备上的实时部署。这些结果证明LiteIE是资源受限平台上的高效且实用的低光照增强解决方案。

英文摘要

Real-time low-light image enhancement on mobile and embedded devices requires models that balance visual quality and computational efficiency. Existing deep learning methods often rely on large networks and labeled datasets, limiting their deployment on resource-constrained platforms. In this paper, we propose LiteIE, an ultra-lightweight unsupervised enhancement framework that eliminates dependence on large-scale supervision and generalizes well across diverse conditions. We design a backbone-agnostic feature extractor with only two convolutional layers to produce compact image features enhancement tensors. In addition, we develop a parameter-free Iterative Restoration Module, which reuses the extracted features to progressively recover fine details lost in earlier enhancement steps, without introducing any additional learnable parameters. We further propose an unsupervised training objective that integrates exposure control, edge-aware smoothness, and multi-scale color consistency losses. Experiments on the LOL dataset, LiteIE achieves 19.04 dB PSNR, surpassing SOTA by 1.4 dB while using only 0.07\% of its parameters. On a Snapdragon 8 Gen 3 mobile processor, LiteIE runs at 30 FPS for 4K images with just 58 parameters, enabling real-time deployment on edge devices. These results establish LiteIE as an efficient and practical solution for low-light enhancement on resource-limited platforms.

URL PDF HTML ☆

赞 0 踩 0

2506.12542 2026-05-12 cs.LG cs.AI cs.CV stat.ML 版本更新

PLD: A Choice-Theoretic List-Wise Knowledge Distillation

PLD: 一种基于选择理论的列表级知识蒸馏

Ejafa Bassam, Dawei Zhu, Kaigui Bian

发表机构 * School of Computer Science, Peking University（北京大学计算机科学学院）

AI总结本文提出PLD，一种基于Plackett-Luce模型的列表级知识蒸馏方法，通过将教师logits视为'价值'评分，直接优化教师最优排名，实现凸且平移不变的替代目标，涵盖加权交叉熵。

Journal ref Advances in Neural Information Processing Systems 38 (NeurIPS 2025), 136090--136112 (2026)

2506.07436 2026-05-12 cs.CV cs.AI cs.ET 版本更新

Prompt to Protection: A Comparative Study of Multimodal LLMs in Construction Hazard Recognition

提示到保护：多模态大语言模型在建筑危险识别中的比较研究

Nishi Chaudhary, S M Jamil Uddin, Sathvik Sharath Chandra, Anto Ovid, Alex Albert

发表机构 * Department of Construction Management, Colorado State University（科罗拉多州立大学建设管理系）； Department of Civil, Construction, and Environmental Engineering, North Carolina State University（北卡罗来纳州立大学土木、建设与环境工程系）

AI总结本文比较了五种先进多模态大语言模型在建筑危险识别中的表现，发现提示策略显著影响性能，CoT提示效果最佳，GPT-4.5和GPT-o3表现突出，强调了提示设计在提升建筑安全应用准确性中的关键作用。

详情

DOI: 10.1109/ACCESS.2026.3691685

AI中文摘要

最近多模态大语言模型（LLMs）的出现为改进施工现场的视觉危险识别提供了新机遇。不同于传统计算机视觉模型依赖领域特定训练和大量数据集，现代LLMs能通过简单的自然语言提示解释和描述复杂视觉场景。然而，尽管对其应用的兴趣日益增长，但在建筑领域安全关键视觉任务中不同LLMs的表现仍有待研究。为此，本文对五种最先进的LLMs：Claude-3 Opus、GPT-4.5、GPT-4o、GPT-o3和Gemini 2.0 Pro进行了比较评估，以评估其从真实世界建筑图像中识别潜在危险的能力。每个模型在三种提示策略下进行测试：零样本、少样本和思维链（CoT）。零样本提示涉及最少指令，少样本结合基本安全上下文和危险源记忆法，而CoT提供逐步推理示例以支撑模型思维。使用精确率、召回率和F1分数指标进行定量分析。结果表明，提示策略显著影响性能，CoT提示在所有模型中均产生更高的准确性。此外，LLM在不同条件下的表现各异，GPT-4.5和GPT-o3在大多数设置中表现最佳。研究还展示了提示设计在提升多模态LLMs在建筑安全应用中的准确性和一致性中的关键作用。本研究为提示工程与LLMs的整合提供了可行见解，有助于开发更可靠的AI辅助安全系统。

英文摘要

The recent emergence of multimodal large language models (LLMs) has introduced new opportunities for improving visual hazard recognition on construction sites. Unlike traditional computer vision models that rely on domain-specific training and extensive datasets, modern LLMs can interpret and describe complex visual scenes using simple natural language prompts. However, despite growing interest in their applications, there has been limited investigation into how different LLMs perform in safety-critical visual tasks within the construction domain. To address this gap, this study conducts a comparative evaluation of five state-of-the-art LLMs: Claude-3 Opus, GPT-4.5, GPT-4o, GPT-o3, and Gemini 2.0 Pro, to assess their ability to identify potential hazards from real-world construction images. Each model was tested under three prompting strategies: zero-shot, few-shot, and chain-of-thought (CoT). Zero-shot prompting involved minimal instruction, few-shot incorporated basic safety context and a hazard source mnemonic, and CoT provided step-by-step reasoning examples to scaffold model thinking. Quantitative analysis was performed using precision, recall, and F1-score metrics across all conditions. Results reveal that prompting strategy significantly influenced performance, with CoT prompting consistently producing higher accuracy across models. Additionally, LLM performance varied under different conditions, with GPT-4.5 and GPT-o3 outperforming others in most settings. The findings also demonstrate the critical role of prompt design in enhancing the accuracy and consistency of multimodal LLMs for construction safety applications. This study offers actionable insights into the integration of prompt engineering and LLMs for practical hazard recognition, contributing to the development of more reliable AI-assisted safety systems.

URL PDF HTML ☆

赞 0 踩 0

2505.16025 2026-05-12 cs.CV cs.MM eess.IV 版本更新

Context and Pixel Aware Large Language Model for Video Quality Assessment

基于上下文和像素的大型语言模型用于视频质量评估

Wen Wen, Yaohong Wu, Yue Sheng, Neil Birkbeck, Balu Adsumilli, Yilin Wang

发表机构 * City University of Hong Kong（香港城市大学）； Google Inc.（谷歌公司）

AI总结本文提出CP-LLM，通过双视觉编码器和语言解码器同时生成视频质量评分和可解释描述，提升对像素失真敏感度，实验显示其在视频质量评估基准上表现优异。

Comments Accepted to ICIP 2026

详情

AI中文摘要

视频质量评估（VQA）是一个具有广泛应用的挑战性研究课题。传统手工制作和判别学习的VQA模型主要关注像素级失真，缺乏上下文理解，而近期的多模态大型语言模型（MLLMs）在敏感小失真或处理质量评分和描述作为单独任务方面存在困难。为了解决这些不足，我们引入CP-LLM：一个基于上下文和像素的大型语言模型。CP-LLM是一种新颖的多模态LLM架构，具有双视觉编码器，可独立分析感知质量的高层（视频上下文）和低层（像素失真）粒度，以及一个语言解码器，随后推理这些方面的相互作用。这种设计使CP-LLM能够同时生成稳健的质量评分和可解释的质量描述，具有增强的对像素失真的敏感度（例如，压缩伪影）。实验结果表明，CP-LLM在VQA基准上实现了跨数据集的最先进性能，并且在像素失真方面具有更强的鲁棒性。

英文摘要

Video quality assessment (VQA) is a challenging research topic with broad applications. Traditional hand-crafted and discriminative learning-based VQA models mainly focus on pixel-level distortions and lack contextual understanding, while recent multimodal large language models (MLLMs) struggle with sensitivity to small distortions or handle quality scoring and description as separate tasks. To address these shortcomings, we introduce CP-LLM: a Context- and Pixel-aware Large Language Model. CP-LLM is a novel multimodal LLM architecture featuring dual vision encoders designed to independently analyze perceptual quality at both high-level (video context) and low-level (pixel distortion) granularity, along with a language decoder that subsequently reasons about the interplay between these aspects. This design enables CP-LLM to simultaneously produce robust quality scores and interpretable quality descriptions, with enhanced sensitivity to pixel distortions (e.g., compression artifacts). Experiment results demonstrate that CP-LLM achieves state-of-the-art cross-dataset performance on VQA benchmarks and superior robustness to pixel distortions.

URL PDF HTML ☆

赞 0 踩 0

2505.15879 2026-05-12 cs.CV cs.AI cs.CL 版本更新

GRIT: Teaching MLLMs to Think with Images

GRIT: 教授大语言模型通过图像思考

Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, Xin Eric Wang

发表机构 * UC Santa Cruz（加州大学圣克鲁兹分校）； UC Santa Barbara（加州大学圣芭芭拉分校）； eBay

AI总结 GRIT提出一种基于图像和文本的 grounded reasoning 方法，通过强化学习实现高效训练，使大语言模型生成视觉基础的推理链。

Journal ref NeurIPS 2025

详情

AI中文摘要

近期研究表明，使用强化学习（RL）构建能够生成推理链的模型是有效的。然而，尽管在视觉-语言任务中推进推理能力的进展持续，现有开源视觉推理模型通常仅用纯自然语言生成推理内容，缺乏显式整合视觉信息。为此，我们提出了Grounded Reasoning with Images and Texts（GRIT），一种训练大语言模型（MLLMs）通过图像思考的新方法。GRIT引入了一种 grounded reasoning 框架，其中模型生成交替包含自然语言和显式边界框坐标的推理链。这些坐标指向输入图像中模型在推理过程中参考的区域。此外，GRIT配备了基于GRPO算法的强化学习方法GRPO-GR。GRPO-GR采用关注最终答案准确性和接地推理输出格式的鲁棒奖励，从而消除了需要带有推理链标注或显式边界框标签的数据需求。结果表明，GRIT实现了卓越的数据效率，仅需现有数据集中的20个图像-问题-答案三元组。全面评估显示，GRIT有效训练大语言模型生成连贯且视觉基础的推理链，展示了推理和接地能力的成功统一。

英文摘要

Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

URL PDF HTML ☆

赞 0 踩 0

2503.09158 2026-05-12 cs.CV 版本更新

隐式神经网络点云压缩

Hongning Ruan, Yulin Shao, Qianqian Yang, Liang Zhao, Zhaoyang Zhang, Dusit Niyato

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University（浙江大学信息科学与电子工程学院）； Department of Electrical and Electronic Engineering, The University of Hong Kong（香港大学电子与电气工程系）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算机与数据科学学院）

AI总结本文提出NeRC³框架，利用隐式神经表示实现点云的高效压缩，通过坐标基神经网络编码几何与属性，并扩展至动态点云压缩，实验验证其在静态和动态点云压缩中的优越性能。

Journal ref IEEE Transactions on Image Processing, vol. 35, pp. 260-275, 2026

详情

DOI: 10.1109/TIP.2025.3648141

AI中文摘要

点云因其能准确表示3D物体和场景而在众多应用中崭露头角。然而，高效压缩无结构、高精度点云数据仍是一个重大挑战。本文提出NeRC³，一种新颖的点云压缩框架，利用隐式神经表示（INRs）来编码密集点云的几何和属性。我们的方法采用两个坐标基神经网络：一个将空间坐标映射到体素占用，另一个将占用的体素映射到其属性，从而隐式表示体素化的点云几何和属性。编码器量化并压缩网络参数及重建所需的辅助信息，而解码器通过将体素坐标输入神经网络来重建原始点云。此外，我们通过减少时间冗余的技术将方法扩展到动态点云压缩，包括一种称为4D-NeRC³的4维时空表示。实验结果验证了我们的方法有效性：对于静态点云，NeRC³优于基于八叉树的G-PCC标准和现有INR方法。对于动态点云，4D-NeRC³在几何压缩性能上优于最新G-PCC和V-PCC标准，同时与最先进学习方法相当。它在几何和属性联合压缩中也表现出竞争力。

英文摘要

Point clouds have gained prominence across numerous applications due to their ability to accurately represent 3D objects and scenes. However, efficiently compressing unstructured, high-precision point cloud data remains a significant challenge. In this paper, we propose NeRC$^3$, a novel point cloud compression framework that leverages implicit neural representations (INRs) to encode both geometry and attributes of dense point clouds. Our approach employs two coordinate-based neural networks: one maps spatial coordinates to voxel occupancy, while the other maps occupied voxels to their attributes, thereby implicitly representing the geometry and attributes of a voxelized point cloud. The encoder quantizes and compresses network parameters alongside auxiliary information required for reconstruction, while the decoder reconstructs the original point cloud by inputting voxel coordinates into the neural networks. Furthermore, we extend our method to dynamic point cloud compression through techniques that reduce temporal redundancy, including a 4D spatio-temporal representation termed 4D-NeRC$^3$. Experimental results validate the effectiveness of our approach: For static point clouds, NeRC$^3$ outperforms octree-based G-PCC standard and existing INR-based methods. For dynamic point clouds, 4D-NeRC$^3$ achieves superior geometry compression performance compared to the latest G-PCC and V-PCC standards, while matching state-of-the-art learning-based methods. It also demonstrates competitive performance in joint geometry and attribute compression.

URL PDF HTML ☆

赞 0 踩 0

2411.08443 2026-05-12 cs.LG cs.CV 版本更新

Machine Unlearning on Pre-trained Models by Residual Feature Alignment Using LoRA

通过LoRA实现预训练模型的残差特征对齐机器去学习

Laiqiao Qin, Tianqing Zhu, Linlin Wang, Wanlei Zhou

发表机构 * City University of Macau（澳门城市大学）

AI总结本文提出一种高效的预训练模型去学习方法，通过LoRA分解中间特征并调整残差特征，实现去学习与保留目标的对齐，实验验证了方法的有效性。

Comments v2: corrected a sign typo in Algorithm 1 line 13

Journal ref IEEE Transactions on Dependable and Secure Computing, 2026

详情

DOI: 10.1109/TDSC.2026.3658545

AI中文摘要

机器去学习是一种新兴技术，旨在从已训练模型中移除一部分训练数据，而不显著影响模型在剩余数据上的性能。该技术在保护用户隐私和消除有害或过时数据方面变得越来越重要。关键挑战在于有效且高效地去学习特定信息而不影响模型在保留数据上的实用性。对于预训练模型，微调是实现去学习目标的重要方法。以往的工作通常微调整个模型的参数，这带来了显著的计算成本。此外，微调过程可能导致中间层特征的偏移，影响模型的整体实用性。在本文中，我们提出了一种新颖且高效的预训练模型去学习方法。我们称之为残差特征对齐去学习。具体而言，我们利用LoRA（低秩适应）将模型的中间特征分解为预训练特征和残差特征。通过调整残差特征，我们对齐去学习模型与预训练模型在中间特征层面，以实现去学习和保留目标。该方法旨在学习保留集上的零残差和去学习集上的偏移残差。在多个数据集上的广泛实验验证了我们方法的有效性。

英文摘要

Machine unlearning is an emerging technology that removes a subset of the training data from a trained model without significantly affecting the model performance on the remaining data. This topic is becoming increasingly important in protecting user privacy and eliminating harmful or outdated data. The key challenge lies in effectively and efficiently unlearning specific information without compromising the model's utility on the retained data. For pre-trained models, fine-tuning is an important way to achieve the unlearning target. Previous work typically fine-tuned the entire model's parameters, which incurred significant computational costs. In addition, the fine-tuning process may cause shifts in the intermediate layer features, affecting the model's overall utility. In this work, we propose a novel and efficient machine unlearning method for pre-trained models. We term the method Residual Feature Alignment Unlearning. Specifically, we leverage LoRA (Low-Rank Adaptation) to decompose the model's intermediate features into pre-trained features and residual features. By adjusting the residual features, we align the unlearned model with the pre-trained model at the intermediate feature level to achieve both unlearning and remaining targets. The method aims to learn zero residuals on the retained set and shifted residuals on the unlearning set. Extensive experiments on numerous datasets validate the effectiveness of our approach.

URL PDF HTML ☆

赞 0 踩 0

2407.12173 2026-05-12 cs.CV cs.AI 版本更新

Beta Sampling is All You Need: Efficient Image Generation Strategy for Diffusion Models using Stepwise Spectral Analysis

贝塔采样是全部所需：利用分步频谱分析的扩散模型高效图像生成策略

Haeil Lee, Hansang Lee, Seoyeon Gye, Junmo Kim

发表机构 * School of Electrical Engineering, KAIST（韩国科学技术院电子工程学院）

AI总结本文提出基于频谱分析的贝塔采样方法，优化扩散模型去噪过程，通过重点投入关键步骤提升生成效率与质量，实验显示其在FID和IS评分上优于传统均匀采样。

Comments 8 pages, 9 figures, WACV 2025

Journal ref Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 4215-4224, 2025

详情

DOI: 10.1109/WACV61041.2025.00414

AI中文摘要

生成扩散模型已成为高质量图像合成的强大工具，但其迭代性质需要大量计算资源。本文提出基于扩散过程图像频谱分析的高效时间步采样方法，旨在优化去噪过程。我们引入了一种类似于贝塔分布的采样技术，优先考虑过程早期和晚期的关键步骤。我们的假设是某些步骤在图像内容上有显著变化，而其他步骤则贡献较小。通过傅里叶变换测量每一步的频率响应变化，发现早期有显著的低频变化，后期有高频调整。ADM和Stable Diffusion的实验表明，我们的贝塔采样方法在FID和IS评分上优于均匀采样，并在效率上与AutoDiffusion等先进方法具有竞争力。本文提供了一个实用框架，通过聚焦最有效的步骤来提高扩散模型的效率，具有进一步优化和广泛应用的潜力。

英文摘要

Generative diffusion models have emerged as a powerful tool for high-quality image synthesis, yet their iterative nature demands significant computational resources. This paper proposes an efficient time step sampling method based on an image spectral analysis of the diffusion process, aimed at optimizing the denoising process. Instead of the traditional uniform distribution-based time step sampling, we introduce a Beta distribution-like sampling technique that prioritizes critical steps in the early and late stages of the process. Our hypothesis is that certain steps exhibit significant changes in image content, while others contribute minimally. We validated our approach using Fourier transforms to measure frequency response changes at each step, revealing substantial low-frequency changes early on and high-frequency adjustments later. Experiments with ADM and Stable Diffusion demonstrated that our Beta Sampling method consistently outperforms uniform sampling, achieving better FID and IS scores, and offers competitive efficiency relative to state-of-the-art methods like AutoDiffusion. This work provides a practical framework for enhancing diffusion model efficiency by focusing computational resources on the most impactful steps, with potential for further optimization and broader application.

URL PDF HTML ☆

赞 0 踩 0

2605.08440 2026-05-12 cs.LG cs.CV 版本更新

TARO: Temporal Adversarial Rectification Optimization Using Diffusion Models as Purifiers

TARO：利用扩散模型作为净化器的时序对抗修正优化

Daniel Wesego, Pedram Rooshenas

发表机构 * Department of Computer Science（计算机科学系）； University of Illinois Chicago（伊利诺伊大学芝加哥分校）

AI总结 TARO通过构建时序引导的分数先验，结合多视角去噪，实现对抗样本的鲁棒修正与语义保留，提升零样本下的鲁棒性。

详情

AI中文摘要

对抗净化利用扩散模型将对抗样本投影回数据流形，但平衡语义保留与对抗攻击鲁棒性仍具挑战。近期研究表明标准扩散净化在适应性评估下失效，而测试时基于分数的优化更具鲁棒性。现有优化防御通常依赖单一扩散噪声模式或统一时间步，忽视粗细去噪尺度的差异作用。我们提出时序对抗修正优化（TARO），一种推理时净化方法，从扩散轨迹的多去噪视角构建时序引导的分数先验。TARO形成粗到细的残差目标：高噪声专家提供全局平滑结构并降低对抗敏感性，低噪声专家恢复图像特定、类相关细节。指导强度控制此时间修正，使TARO在鲁棒全局修正与语义保留间取得平衡。实证表明，TARO在零样本设置下提升鲁棒准确性，同时兼容互补的对抗似然目标以进一步提升鲁棒性。

英文摘要

Adversarial purification with diffusion models seeks to project adversarial examples back toward the data manifold, but balancing semantic preservation and robustness against adaptive attacks remains challenging. Recent work shows that standard diffusion purification can fail under adaptive evaluation, while test-time score-based optimization is more resilient. Existing optimization defenses, however, typically rely on a single diffusion noise regime or treat timesteps uniformly, overlooking the distinct roles of coarse and fine denoising scales. We propose Temporal Adversarial Rectification Optimization (TARO), an inference-time purification method that builds a temporally guided score prior from multiple denoising views along the diffusion trajectory. TARO forms a coarse-to-fine residual target: high-noise experts provide globally smoothed structure with reduced adversarial sensitivity, while low-noise experts restore image-specific, class-relevant details. A guidance strength controls this temporal correction, allowing TARO to balance robust global rectification with semantic preservation. Empirically, TARO improves robust accuracy across datasets and adaptive threat models in a zero-shot setting, while remaining compatible with complementary adversarial-likelihood objectives for further robustness gains.

URL PDF HTML ☆

赞 0 踩 0

2605.08421 2026-05-12 cs.CV 版本更新

Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval

超越图像块：通过文本监督学习全局布局用于晚期交互视觉文档检索

Pascal Tilli, Mohsen Mesgar

发表机构 * University of Stuttgart（斯图加特大学）； Bosch Center for Artificial Intelligence（博世人工智能中心）

AI总结本文提出通过文本监督学习文档全局布局，改进视觉文档检索的晚期交互架构，提升检索性能。

2605.08412 2026-05-12 cs.CV 版本更新

SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

SYNCR: 一个具有合成接地的跨视频推理基准

Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami

发表机构 * New York University（纽约大学）

AI总结 SYNCR通过合成数据评估多视频推理能力，发现现有模型在物理空间推理上表现不足，参数扩展和训练后优化能提升时间对齐能力，但无法可靠解决细粒度物理跟踪和全局空间合成问题。

详情

AI中文摘要

多模态大语言模型（MLLMs）在单视频理解上取得快速进展，但其跨多视频流推理能力仍不明确。现有多视频基准主要依赖人工标注的真实视频，限制了空间、时间和物理真实性的精度，难以诊断模型失败。我们引入SYNCR，一个受控的合成基准，用于跨视频推理，具有程序验证的接地。使用Habitat、Kubric和CLEVRER模拟器引擎构建，SYNCR包含8,163个多视频问题-答案对，基于9,650个唯一视频。它评估MLLMs在八个任务上的表现，涵盖四个诊断支柱：时间对齐、空间跟踪、比较推理和整体合成。我们对领先的大规模和封闭权重MLLMs进行零样本评估，发现当前模型与人类之间存在显著差距：最佳模型仅达到52.5%的平均准确率，而人类基线为89.5%。模型在时间排序上表现较好，但在精确物理和空间推理上挣扎，最佳模型在运动学比较上仅达到26.0%的准确率。我们进一步发现，参数扩展和推理专用的训练后优化能提升时间对齐能力，但无法可靠解决细粒度物理跟踪或全局空间合成问题。最后，探索性sim-to-real相关性分析表明，SYNCR任务跟踪现实多视频基准中模型层面的趋势，同时暴露了现有评估未涵盖的推理能力。代码可在https://github.com/SaraGhazanfari/SYNCR获取。

英文摘要

Multimodal Large Language Models (MLLMs) have made rapid progress in single-video understanding, yet their ability to reason across multiple independent video streams remains poorly understood. Existing multi-video benchmarks rely largely on human-annotated real-world footage, limiting the precision of spatial, temporal, and physical ground truth and making it difficult to diagnose model failures. We introduce SYNCR, a controlled synthetic benchmark for cross-video reasoning with programmatically verified grounding. Built using Habitat, Kubric, and CLEVRER simulator engines, SYNCR contains 8,163 multi-video question-answer pairs grounded in 9,650 unique videos. It evaluates MLLMs across eight tasks spanning four diagnostic pillars: Temporal Alignment, Spatial Tracking, Comparative Reasoning, and Holistic Synthesis. Our zero-shot evaluation of leading open- and closed-weight MLLMs reveals a substantial gap between current models and humans: the best model achieves only 52.5% average accuracy, compared to an 89.5% human baseline. Models perform relatively well on temporal ordering but struggle with precise physical and spatial reasoning, with the best model reaching only 26.0% accuracy on Kinematic Comparison. We further find that parameter scaling and reasoning-specialized post-training improve temporal alignment capabilities, but do not reliably address fine-grained physical tracking or global spatial synthesis. Finally, an exploratory sim-to-real correlation analysis suggests that several SYNCR tasks track model-level trends on real-world multi-video benchmarks, while also exposing reasoning capabilities underrepresented by existing evaluations. Code available at https://github.com/SaraGhazanfari/SYNCR.

URL PDF HTML ☆

赞 0 踩 0

2605.08404 2026-05-12 cs.CL cs.AI cs.CV cs.ET 版本更新

Built Environment Reasoning from Remote Sensing Imagery Using Large Vision--Language Models

利用大视觉-语言模型从遥感图像中推导建成环境

Dongdong Wang, Deepak Balakrishnan, Ravi Srinivasan, Shenhao Wang

发表机构 * Department of Urban and Regional Planning, University of Florida（佛罗里达大学城市与区域规划系）

AI总结本文研究了大语言模型在智慧城市中的应用，通过多尺度遥感图像输入多模态语言模型，评估其对建成环境推理的影响，并比较了InternVL和Qwen等先进模型在生成建成环境建议中的准确性和可靠性。

Comments Published in the International Conference on Industrialized Construction 2026

2605.08396 2026-05-12 cs.CV 版本更新

Delivering Science as a Service: Sci-Orchestra's Cloud-Native Approach to HPC

作为服务交付科学：Sci-Orchestra的云原生方法用于HPC

Harinarayan Krishnan, Shubhabrata Mukerjee, Jeffrey Donatelli, Daniela Ushizima

发表机构 * Lawrence Berkeley National Laboratory（劳伦斯伯克利国家实验室）； Applied Math & Computational Research Division（应用数学与计算研究部）； Bakar Comp. Health Sciences Institute（巴卡计算与健康科学研究所）； UC San Francisco（旧金山大学）； Berkeley Institute for Data Science（伯克利数据科学研究所）； UC Berkeley（伯克利大学）

AI总结 Sci-Orchestra通过云原生方法自动化HPC实验流程，提供安全的执行环境，促进跨机构合作，加速实验室原型到工业应用的转化。

详情

AI中文摘要

现代计算环境的复杂性常使研究人员面临基础设施管理、认证协议和容器部署的负担。我们提出了Sci-Orchestra，一种分层编排框架，旨在完全自动化实验流程，使科学家能专注于科学发现而非后台操作。通过API驱动的接口抽象执行，系统负责安全认证、资源管理和在多样化的高性能计算环境中使用Kubernetes架构进行可扩展部署。Sci-Orchestra的关键创新是其自主市场，促进跨机构合作。通过直观的用户界面，研究人员可通过简单选择快速部署和共享专用服务，无需复杂安装和技术设置。这种模块化基础设施专门设计用于促进产业合作，因为它提供安全的执行环境，允许外部合作者测试和验证专有工具而无需源代码交换。这种“黑盒”互操作性保护知识产权，同时使无缝集成到更广泛的科学流程中，最终加速从实验室原型到工业规模应用的过渡。

英文摘要

The increasing complexity of modern computational environments often burdens researchers with infrastructure management, authentication protocols, and container deployments. We present Sci-Orchestra, a layered orchestration framework designed to fully automate experimental workflows, allowing scientists to prioritize scientific discovery over backend operations. By abstracting execution through an API-driven interface, the system assumes responsibility for secure authentication, resource management, and scalable deployment across diverse high-performance computing environments using Kubernetes architectures. A key innovation of Sci-Orchestra is its autonomous marketplace, which serves as a catalyst for cross-institutional collaboration. Through an intuitive user interface, researchers can rapidly deploy and share specialized services via simple selections, eliminating the need for complex installations and technical setups. This modular infrastructure is specifically designed to facilitate industry partnerships as it provides a secure execution environment and allows external collaborators to test and validate proprietary tools without the need for source-code exchange. This ``black-box'' interoperability protects intellectual property while enabling seamless integration into broader scientific pipelines, ultimately accelerating the transition from laboratory prototypes to industrial-scale applications.

URL PDF HTML ☆

赞 0 踩 0

2605.08376 2026-05-12 cs.CV 版本更新

UIESNN: A Scale-Aware Spiking Network for Underwater Image Enhancement

UIESNN：一种具有尺度感知的脉冲网络用于水下图像增强

Shuang Chen, Ruochen Li, Zihan Zhu, Ronald Thenius, Farshad Arvin, Amir Atapour-Abarghouei

发表机构 * Institute of Biology, University of Graz, Graz, Austria（格拉茨大学生物学院，格拉茨，奥地利）； University of Cambridge（剑桥大学）

AI总结本文提出UIESNN，一种具有尺度感知的脉冲网络，用于水下图像增强，通过多尺度池化LIF块和脉冲残差架构，提升颜色保真度和空间一致性。

详情

AI中文摘要

水下图像增强（UIE）是脉冲神经网络（SNNs）应用中一个实际重要但研究较少的领域，其中主导的退化包括大尺度和低频退化，如波长依赖性颜色偏差和散射引起的失真。现有SNN恢复设计依赖于局部有界脉冲感知，这会限制全局校正并导致饱和或不一致的表示。为了解决这些挑战，我们提出了一种名为UIESNN的具有尺度感知的SNN框架用于UIE。其核心是一个多尺度池化LIF块（MPLB），它将多尺度池化响应注入膜动力学，从而扩大有效感受野，同时保留细粒度细节并诱导异质尺度依赖激活。基于MPLB，我们设计了一种脉冲残差架构，该架构在完全脉冲驱动的管道中集成了频率分解和基于注意力的细化。在EUVP和LSUI基准上的大量实验表明，UIESNN在SNN方法中实现了最先进的性能，实现了改进的颜色保真度和空间一致性，同时具有竞争力的能量成本。

英文摘要

Underwater image enhancement (UIE) is a practically important yet underexplored application of spiking neural networks (SNNs), where the dominant degradations are large-scale and low-frequency, such as wavelength-dependent colour casts and scattering-induced veiling. Existing SNN restoration designs rely on locally bounded spiking perception, which can limit global correction and lead to saturated or inconsistent representations. To address these challenges, we propose a scale-aware SNN framework for UIE named UIESNN. At its core is a Multi-scale Pooling LIF Block (MPLB) that injects hierarchical multi-scale pooling responses into membrane dynamics, thereby enlarging the effective receptive field while preserving fine-grained details and inducing heterogeneous scale-dependent activations. Building on MPLB, we design a spiking residual architecture that integrates frequency decomposition and attention-based refinement in a fully spike-driven pipeline. Extensive experiments on the EUVP and LSUI benchmarks demonstrate that UIESNN achieves state-of-the-art performance among SNN-based methods, delivering improved colour fidelity and spatial coherence with competitive energy cost.

URL PDF HTML ☆

赞 0 踩 0

2605.08373 2026-05-12 cs.CV cs.AI 版本更新

NeuroGAN-3D: Enhancing Intrinsic Functional Brain Networks via High-Fidelity 3D Generative Super-Resolution

NeuroGAN-3D: 通过高保真3D生成超分辨率增强内在功能脑网络

M. Moein Esfahani, Sepehr Salem Ghahfarokhi, Mohammed Alser, Jingyu Liu, Vince Calhoun

发表机构 * Georgia State University（佐治亚州立大学）； Tri-Institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS)（神经影像与数据科学转化研究三机构中心（TReNDS））； Georgia Institute of Technology（佐治亚理工学院）； Emory University（埃默里大学）

AI总结本文提出NeuroGAN-3D，利用生成对抗网络提升rs-fMRI空间图的分辨率，以更精确地局部化功能单元和检测神经生物学变化。

Comments Accepted in ICCABS 2026: The 14th International Conference on Computational Advances in Bio and Medical Sciences

详情

AI中文摘要

近期神经影像学的进步加深了我们对大脑复杂功能和结构组织的理解。其中，功能性磁共振成像（fMRI）特别是静息态fMRI（rs-fMRI）已成为识别内在脑连接生物标志物和界定大规模神经网络的工具。这些网络通常表示为体积空间图，捕捉功能上连贯的脑区并反映个体差异在脑活动和结构中的表现。这些图的空间分辨率起着重要作用，因为它决定了局部化功能单元的精度、执行可靠的脑分割以及检测与发育、衰老或疾病相关的细微空间特异性神经生物学变化的能力。因此，提高神经影像学衍生图的有效分辨率对更深入理解大脑结构及其与行为和病理的关系具有重要意义。为此，我们提出NeuroGAN-3D，一种针对体积神经影像计算需求设计的新型3D生成超分辨率模型。我们的模型利用生成对抗网络架构来增强rs-fMRI空间图的分辨率，显著优于传统基线方法。

英文摘要

Recent advances in neuroimaging have deepened our understanding of the brain's complex functional and structural organization. Among these, functional Magnetic Resonance Imaging (fMRI) - particularly resting-state fMRI (rs-fMRI) - has emerged as a tool for identifying biomarkers of intrinsic brain connectivity and delineating large-scale neural networks. These networks are typically represented as volumetric spatial maps that capture functionally coherent brain regions and reflect individual differences in brain activity and structure. The spatial resolution of these maps plays an important role, as it determines the ability to localize functional units with precision, perform reliable brain parcellation, and detect subtle, spatially specific neurobiological alterations associated with development, aging, or disease. Therefore, improving the effective resolution of neuroimaging-derived maps holds significant promise for enabling more detailed insights into brain architecture and its relationship to behavior and pathology. To address this need, we propose NeuroGAN-3D, a novel 3D generative super-resolution model tailored to the computational demands of volumetric neuroimaging. Our model leverages a generative adversarial network architecture to enhance the spatial resolution of rs-fMRI spatial maps, significantly outperforming a conventional baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.08371 2026-05-12 cs.CV 版本更新

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

PaceVGGT: 为视觉几何变换器预置交替注意力令牌修剪

Haotang Li, Zhenyu Qi, Shaohan Henry Wang, Kebin Peng, Zi Wang, Qing Guo, Sen He, Huanrui Yang

发表机构 * University of Arizona（亚利桑那大学）； East Carolina University（东卡罗来纳大学）； Augusta University（奥古斯塔大学）； Nankai University（南开大学）

AI总结 PaceVGGT通过在冻结的VGGT首个交替注意力块前修剪DINO令牌，减少推理延迟，同时保持重建质量，在ScanNet-50和7-Scenes数据集上实现显著加速。

详情

AI中文摘要

视觉几何变换器（VGGT）是一种强大的前馈模型，适用于多种3D任务，但其交替注意力（AA）堆叠在总令牌数量上呈二次增长，使长片段变得昂贵。现有令牌减少加速器在AA内部操作，导致进入AA的补丁网格未压缩。我们引入PaceVGGT，一种预AA令牌修剪框架，该框架在冻结的VGGT首个AA块前修剪DINO补丁令牌。PaceVGGT训练了一个轻量级的令牌评分器，从DINO特征估计每个令牌的重要性。评分器首先通过无修剪主干的AA内部注意力目标进行蒸馏，然后在下游相机、深度和点图损失下进行细化。每帧保留预算固定了主干可见序列长度，而重要性自适应的合并/修剪分配在固定总合并预算下保留高显著性帧的残差内容。一个特征引导的重建模块重建预测头所需的密集空间网格。在ScanNet-50和7-Scenes上，PaceVGGT保持在重建质量-延迟前沿，同时减少推理延迟。在ScanNet-50上，它在N=300时将延迟降低$5.1\times$，在N=1000时比LiteVGGT降低$1.47\times$。这些结果表明预AA修剪是冻结VGGT式几何变换器的可行加速途径。

英文摘要

Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a fixed total merge budget. A Feature-guided Restoration module reconstructs the dense spatial grid required by the prediction heads. On ScanNet-50 and 7-Scenes, PaceVGGT remains on the reconstruction quality--latency frontier while reducing inference latency. On ScanNet-50, it reduces latency by $5.1\times$ over unmodified VGGT at $N=300$ and $1.47\times$ over LiteVGGT at $N=1000$. These results identify pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers.

URL PDF HTML ☆

赞 0 踩 0

2605.08329 2026-05-12 cs.CV eess.IV 版本更新

An Efficient Token Compression Framework for Visual Object Tracking

一种高效的视觉目标跟踪token压缩框架

Weijing Wu, Qihua Liang, Bineng Zhong, Haiying Xia, Zhiyi Mo, Shuxiang Song

发表机构 * Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University（教育区块链与智能技术重点实验室，教育部，广西师范大学）； Guangxi Key Lab of Multi-Source Information Mining and Security, Guangxi Normal University（广西多源信息挖掘与安全重点实验室，广西师范大学）； University Engineering Research Center of Educational Intelligent Technology, Guangxi Normal University（教育智能技术大学工程研究中心，广西师范大学）； Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou University（广西机器视觉与智能控制重点实验室，梧州大学）

AI总结本文提出ETCTrack框架，通过压缩历史模板帧的token以提升跟踪性能和效率，实验表明在七个基准上优于现有方法，减少60%的token数量并降低21.4%的MACs，仅损失0.4%的精度。

Comments Accepted by CVPR2026

详情

AI中文摘要

通过消除内部特征层面的冗余来优化视觉表示对于同时优化视觉跟踪模型的性能和计算成本至关重要。为提高性能，许多基于Transformer的跟踪器利用更多历史模板帧来捕捉更丰富的时空线索。然而，这种策略导致大量输入视觉token，从而产生两个关键问题：它导致二次计算负担，并可能降低跟踪器的整体性能。为此，我们提出了一种压缩-交互跟踪框架ETCTrack，该框架学习从历史模板帧中高效压缩模板token到稳健的目标表示，超越了手工规则。我们的方法首先使用自适应token压缩器动态构建紧凑且高度判别性的模板token，通过过滤冗余视觉token。这些精炼的模板token随后由分层交互编码器处理，以实现与搜索特征的深入适应性交互。精炼的搜索特征确保后续的精确目标定位。在七个基准上的实验表明，我们的方法优于当前最先进的跟踪器。ETCTrack-B224将模板token数量减少60%，导致MACs减少21.4%，仅损失0.4%的精度。源代码可在https://github.com/PJD-WJ/ETCTrack上获得。

英文摘要

Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined template tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. Refined search features ensure subsequent precise target localization. Experiments on seven benchmarks demonstrate that our method outperforms current state-of-the-art trackers. ETCTrack-B224 reduces the number of template tokens by 60%, leading to a 21.4% reduction in MACs with only a 0.4% drop in accuracy. The source code are available at https://github.com/PJD-WJ/ETCTrack.

URL PDF HTML ☆

赞 0 踩 0

2605.08311 2026-05-12 cs.LG cs.CV 版本更新

Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning

重燃开端：避免存储依赖于持续学习中的模型合并

Xi Wang, Cheng Deng

发表机构 * School of Electronic Engineering, Xidian University, Xi'an, Shaanxi, China（西安电子科技大学电子工程学院）

AI总结本文提出TRM框架，通过优化方法解决持续学习中模型合并的存储限制问题，提升模型稳定性和优化动态。

详情

AI中文摘要

模型合并为整合专业领域知识到统一多任务模型提供了有吸引力的范式，这与持续学习（CL）中的顺序知识获取自然一致。然而，保留多样化先前知识的需求与CL固有的存储限制冲突。本文系统分析了现有模型合并方法在CL约束下的表现。发现当前方法优先考虑全局对齐，导致连续数据流中任务特定误差的积累和放大；后续任务初期消失的梯度常导致优化停滞。这些使合并模型在下一训练阶段开始时处于亚优状态。为解决这些挑战，我们提出轨迹正则化合并（TRM），将合并过程重新表述为增强轨迹子空间内的优化过程。我们的框架整合了三个协同目标，包括任务对齐、预测一致性以及梯度响应性，以同时保持合并模型的历史稳定性并重新激活优化动态。广泛实验结果表明，我们的方法在多个基准上实现了最先进的性能。

英文摘要

Model merging provides a compelling paradigm for integrating specialized expertise into a unified multi-task model, a goal that aligns naturally with the sequential knowledge acquisition in continual learning (CL). However, the requirement for preserving diverse forms of previous knowledge conflicts with the storage limitations inherent to CL. In this paper, we systematically analyze existing model merging methods under the constraints of CL. We find that current methods prioritize global alignment, which often leads to the accumulation and amplification of task-specific errors within the continuous data stream; and the vanishing gradients at the onset of subsequent tasks frequently cause optimization to stagnate. These leave the merged model in a suboptimal state at the beginning of the next training phase. To address these challenges, we propose Trajectory Regularized Merging (TRM), a framework that reformulates the merging phase as an optimization process within an augmented trajectory subspace. Our framework integrates three synergistic objectives including task alignment, prediction consistency, and gradient responsiveness to concurrently preserve merged model's historical stability and re-activate optimization dynamics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.08296 2026-05-12 cs.CV eess.SP 版本更新

BenchHAR: Benchmarking Self-Supervised Learning for Generalizable Sensor-based Activity Recognition

BenchHAR: 用于通用传感器活动识别的自监督学习基准测试

Yize Cai, Rui Feng, Anlan Yu, Baoshen Guo, Zhiqing Hong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Peking University（北京大学）； Singapore-MIT Alliance for Research and Technology（新加坡-麻省理工联合研究技术联盟）

AI总结 BenchHAR针对传感器活动识别中数据异质性和标注数据稀缺问题，系统比较了自监督学习方法的泛化性能，发现混合范式和CNN编码器在性能上表现最佳，且增加预训练数据量能提升泛化能力。

Comments 25 pages

详情

AI中文摘要

人体活动识别（HAR）从可穿戴传感器支持广泛医疗和行为科学应用。然而，数据异质性和标注数据稀缺限制了其现实中的泛化能力。最近在视觉和语言领域自监督学习（SSL）的进展显示了从无标签数据中学习通用表示的强大能力。然而，很少有研究系统比较SSL方法的泛化性能或探索如何适应通用HAR。为解决这些差距，我们提出了BenchHAR，一个统一框架，用于评估SSL方法在未见目标分布上的泛化能力。BenchHAR整理了一个大规模数据集（约258,000个样本）并评估了八个代表性的SSL方法在12种编码器-分类器架构上的表现。我们的结果表明，现有SSL方法难以达到满意的泛化性能。我们发现：（1）对于HAR模型，混合范式（结合重建和对比预训练）实现了整体最佳性能。CNN编码器表现出最强的学习通用表示能力，而更具表现力的分类器架构进一步提升了泛化能力。（2）对于数据规模，增加从下游活动类别预训练数据量的一致性提高了泛化能力，而增加更多标注数据收益有限。有趣的是，纳入非下游活动类别的无标签数据并未提升泛化能力。（3）来自定制设备的传感器数据比研究级设备的数据泛化更好，且肢体转移数据更有效地转移到躯干位置。BenchHAR为通用传感器基于HAR系统提供了统一的基准和可操作的见解。我们的代码可在https://github.com/saiketa/HAR-Bench获取。

英文摘要

Human Activity Recognition (HAR) from wearable sensors supports broad healthcare and behavior science applications. However, data heterogeneity and the scarcity of labeled data limit its real-world generalization. Recent advances in self-supervised learning (SSL) in vision and language domains have shown strong capability for learning generalizable representations from unlabeled data. Yet, few studies have systematically compared the generalization performance of SSL methods or explored how to adapt them for generalizable HAR. To address these gaps, we present BenchHAR, a unified framework for evaluating the generalization capability of SSL methods for sensor-based HAR on unseen target distributions. BenchHAR curates a large-scale dataset (~258K samples) and evaluates eight representative SSL methods across 12 encoder-classifier architectures. Our results reveal that existing SSL methods struggle to achieve satisfactory generalization performance. We find that: (1) For HAR models, the hybrid paradigm (combining reconstruction and contrastive pretraining) achieves the best overall performance. The CNN encoder exhibits the strongest ability to learn generalizable representations, while more expressive classifier architectures further improve generalization. (2) For data scale, increasing the amount of pretraining data from downstream activity classes consistently improves generalization, while adding more labeled data yields limited gains. Interestingly, incorporating unlabeled data from non-downstream activity classes does not improve generalization. (3) Sensor data collected from custom-grade devices generalizes better than that from research-grade devices, and data from limb transfers more effectively to trunk positions. BenchHAR provides a unified benchmark and actionable insights for generalizable sensor-based HAR systems. Our code is available at https://github.com/saiketa/HAR-Bench.

URL PDF HTML ☆

赞 0 踩 0

2605.08282 2026-05-12 eess.IV cs.AI cs.CV 版本更新

A Paired Point-of-Care Ultrasound Dataset for Image Quality Enhancement and Benchmarking via a cGAN Baseline

一种配对的点即护理超声数据集用于通过cGAN基线的图像质量增强和基准测试

Lennard M. van Karnenbeek, Hilde G. A. van der Pol, Mark Wijkhuizen, Eva Poelman, Caroline A. Drukker, Theo Ruers, Freija Geldof, Behdad Dashtbozorg

发表机构 * Department of Nanobiophysics, Faculty of Science and Technology, University of Twente（特文特大学科学与技术学院生物医学系）； Image-Guided Surgery, Department of Surgery, Netherlands Cancer Institute（荷兰癌症研究所手术引导外科部）； Netherlands Cancer Institute（荷兰癌症研究所）； Center for Early Cancer Detection, Netherlands Cancer Institute（荷兰癌症研究所早期癌症检测中心）

AI总结本文提出了一种新的配对数据集，用于通过cGAN基线提升点即护理超声图像质量，并展示了在低资源环境中的诊断价值。

详情

AI中文摘要

目的：我们旨在利用深度学习和一种新的配对数据集来提升点即护理超声（POCUS）设备的图像质量。方法：我们使用定制的自动化架系统收集了第一个准确配对的数据集，结合低端POCUS和高端超声图像。基于pix2pix架构，使用带有L1和结构相似性指数（SSIM）损失的U-Net生成器的条件生成对抗网络（cGAN）被利用。在模拟数据集上预训练进一步提升了性能。评估是在1064对体外组织和仿真实超声图像集上进行的。结果：我们的方法将SSIM从0.29提升到0.54，PSNR从19.16 dB提升到22.41 dB。无参考指标也表明了显著的提升，自然图像质量评估器（NIQE）和基于感知的图像质量评估器（PIQE）得分分别从7.95降至4.44和31.12降至19.99。结论：本文提出了第一个公开可用的低端POCUS到高端超声图像的准确配对数据集。此外，我们的结果展示了所提框架在克服手持POCUS硬件限制方面的潜力，从而在低资源和点即护理环境中提升其诊断价值。POCUS-IQ数据集可在https://github.com/NKI-MedTech-AI/POCUS-IQ上公开获取。

英文摘要

Purpose: We aim to enhance the image quality of point-of-care ultrasound (POCUS) devices using deep learning and a novel paired dataset of POCUS and high-end ultrasound images. Approach: We collected the first accurately paired dataset using a custom-built automated gantry system of low-end POCUS and high-end ultrasound images. A conditional generative adversarial network (cGAN) was utilized based on the pix2pix architecture, with a U-Net generator that incorporates both L1 and structural similarity index (SSIM) losses to improve perceptual quality. Pretraining on a simulation dataset further boosts performance. Evaluation was performed on 1064 paired ex vivo tissue and phantom ultrasound image sets. Results: Our approach improves the SSIM from 0.29 to 0.54 and PSNR from 19.16 dB to 22.41 dB. No-reference metrics also indicate substantial enhancement, with the Natural Image Quality Evaluator (NIQE) and Perception-based Image Quality Evaluator (PIQE) scores dropping from 7.95 to 4.44 and 31.12 to 19.99, respectively. Conclusions: This work presents the first publicly available accurately paired dataset of low-end POCUS to high end ultrasound images. Additionally, our results demonstrate the potential of the proposed framework to overcome hardware limitations of handheld POCUS, enhancing its diagnostic value in low-resource and point-of-care settings. The POCUS-IQ Dataset is publicly available at https://github.com/NKI-MedTech-AI/POCUS-IQ.

URL PDF HTML ☆

赞 0 踩 0

2605.08281 2026-05-12 cs.CV 版本更新

Is Class Signal Clustered or Routed in Task-Induced Implicit Neural Representation Weight Spaces?

隐式神经表示中的类别信号是聚类还是路由？

Xinyi Guo, Mingyi He, Haobin Ding, Weiming Chen, Xinrui Chen, Jiawen Li, Di Zhang, Minxi Ouyang, Yizhi Wang, Xitong Ling

发表机构 * South China Normal University（南方科技大学）； Beijing University of Chemical Technology（北京化工大学）； Tsinghua University（清华大学）； Xi’an Jiaotong University（西安交通大学）

AI总结研究探讨了隐式神经表示中类别信号的几何结构，发现其并非简单聚类，而是通过读者路由实现可分类性。

详情

AI中文摘要

隐式神经表示（INR）将图像编码为神经网络权重，使图像分类成为权重空间可分类性的问题。一个自然的几何假设是，分类反馈应使图像特定的权重在共享锚坐标下按类别聚类。我们在此SIREN基础的Meta Weight Transformer（MWT）框架中测试这一假设，发现该预测失败。暴露的权重空间几何和监督聚类压力无法可靠追踪训练读者的准确性；聚类甚至可能使局部邻域更类一致，同时使训练读者更差。关键在于，读者构建而非继承类对齐的几何结构：令牌流诊断显示，类对齐的邻域只有在晚期读者交互后才强烈预测训练读者的准确性，而非输入坐标。我们进一步识别了增强权重令牌中的原生SIREN偏置列作为训练读者的低维、样本依赖的因果读出路由；针对性控制排除了通用标量列和边际分布伪影。该诊断促使干预措施，如强化读者路由、添加显式偏置路由或使用更密集的内环拟合；在本文使用的车道特定训练惯例下，路由导向的变体通常优于共享锚基线，但交互非加性。任务诱导的INR权重可分类并非因为形成原始几何聚类，而是因为其类别信号通过读者路由。

英文摘要

Implicit neural representations (INRs) encode images as neural-network weights, making image classification a problem of weight-space classifiability. A natural geometric hypothesis is that classifier feedback should make image-specific weights cluster by class in the shared-anchor coordinate. We test this hypothesis in the SIREN-based Meta Weight Transformer (MWT) regime, where end-to-end training meta-learns a shared initialization and inner-loop update schedule for fitting image-specific SIRENs. We find that this prediction fails. Exposed weight-space geometry and supervised clustering pressure do not reliably track trained-reader accuracy; clustering can even make local neighborhoods more class-consistent while making the trained reader worse. Crucially, the reader constructs rather than inherits class-aligned geometry: token-flow diagnostics show that class-aligned neighborhoods become strongly predictive of trained-reader accuracy only after late reader interactions, not in the input coordinate. We further identify the native SIREN bias column in the augmented weight token as a low-dimensional, sample-dependent causal readout route for the trained reader; targeted controls rule out generic scalar-column and marginal-distribution artifacts. The diagnosis motivates interventions that strengthen reader routing, add an explicit bias route, or use denser inner-loop fitting; under the lane-specific training conventions used here, route-directed variants often outperform the shared-anchor baseline but interact non-additively. Task-induced INR weights are classifiable not because they form raw geometric clusters, but because their class signal is routed through the reader.

URL PDF HTML ☆

赞 0 踩 0

2605.08276 2026-05-12 cs.CV 版本更新

Beyond ViT Tokens: Masked-Diffusion Pretrained Convolutional Pathology Foundation Model for Cell-Level Dense Prediction

超越ViT标记：面向细胞级密集预测的掩码扩散预训练卷积病理基础模型

Weiming Chen, Xitong Ling, Zhenyang Cai, Xidong Wang, Jiawen Li, Tian Guan, Benyou Wang, Yonghong He

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Research Institute of Tsinghua, Pearl River Delta（清华大学 Pearl River Delta 研究院）； The Chinese University of Hong Kong, ShenZhen（香港中文大学（深圳））

AI总结本文提出ConvNeXt Masked-Diffusion模型，通过卷积生成预训练框架提升病理图像细胞级密集预测性能，实验表明其在有限标注条件下表现更优，优于现有ViT模型和端到端分割方法。

详情

AI中文摘要

细胞级密集预测是计算病理学的核心，但受限于细粒度组织结构、强域偏移和昂贵的密集标注。现有基于ViT的病理基础模型依赖于补丁标记化，可能破坏空间连续性并削弱局部形态细节。为此，我们提出掩码扩散卷积基础模型，即ConvNeXt Masked-Diffusion（CMD），一种自监督卷积生成预训练框架。CMD采用全卷积ConvNeXt-UNet主干网络，在像素空间进行掩码扩散预训练，并通过自适应归一化融合冻结的病理基础模型特征。实验结果表明，CMD在多个病理密集预测任务中优于现有ViT模型，并在微调少量任务特定参数时超越了最先进的端到端分割方法。在标注有限的情况下，CMD表现出更强的鲁棒性和泛化能力。我们的发现表明，纯卷积架构也可作为竞争性的病理基础模型，实现当前ViT主导范式中的领先性能，并提供一个可扩展、高性能的解决方案，更好地保留组织结构先验知识以实现细粒度病理理解。

英文摘要

Cell-level dense prediction is central to computational pathology, but remains challenging due to fine-grained histological structures, strong domain shifts, and costly dense annotations. Existing ViT-based pathology foundation models rely on patch tokenization, which can disrupt spatial continuity and weaken local morphological details needed for cell-level prediction. To address this, we propose Masked-Diffusion Convolutional Foundation Models, termed ConvNeXt Masked-Diffusion (CMD), a self-supervised convolutional generative pretraining framework for dense pathology representation learning. CMD uses a fully convolutional ConvNeXt-UNet backbone, performs masked-diffusion pretraining in pixel space, and incorporates frozen pathology foundation model features through adaptive normalization. Experimental results demonstrate that CMD consistently outperforms existing ViT-based pathology foundation models and even surpasses state-of-the-art end-to-end segmentation methods while fine-tuning only a small number of task-specific parameters across multiple pathology dense prediction tasks. The advantage is particularly pronounced under limited annotation settings, where CMD exhibits stronger robustness and generalization ability. Our findings suggest that purely convolutional architectures can also serve as competitive pathology foundation models for cell-level dense prediction, achieving leading performance within the current ViT-dominated paradigm and providing a scalable, high-performance solution that better preserves histological structural priors for fine-grained pathology understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.08275 2026-05-12 eess.IV cs.CV 版本更新

Model-based Dynamic 3D MRI Reconstructions using Neural Fields and Tensor Product Expansions

基于神经场和张量积展开的模型动态3D MRI重建

Ray Sheombarsing, Max van Riel, David Heesterbeek, Nico van den Berg, Alessandro Sbrizzi

发表机构 * Computational Imaging Group for MRI Therapy & Diagnostics, Department of Radiotherapy, University Medical Center Utrecht（MRI治疗与诊断计算成像组，放射治疗系，乌得勒支大学医学中心）

AI总结本文提出一种无需离散化的高效MRI重建框架，利用张量积结构在高维时空场景中实现动态2D和3D图像的高精度重建，尤其在强欠采样条件下保持结构和运动信息。

2605.08271 2026-05-12 cs.CV cs.AI 版本更新

维度共激活用于冻结视觉基础模型中的表征一致性

Izaldein Al-Zyoud Abdulmotaleb El Saddik

发表机构 * MCRLab, School of Electrical Engineering and Computer Science University of Ottawa（MCRLab，电气与计算机科学学院，渥太华大学）

AI总结本文提出维度共激活(DCA)方法，用于评估冻结视觉基础模型中样本内部的表征一致性。通过比较语义子区域的特征维度共激活，DCA在深度伪造检测任务中表现出色，验证了其在冻结模型中的有效性。

详情

AI中文摘要

冻结视觉基础模型不仅提取特征，还通过学习的坐标系统组织图像。本文探讨该坐标系统在单个输入内部是否保持一致性，提出表征一致性概念。引入维度共激活(DCA)，一种用于测量这种一致性的工具。DCA通过比较语义区域是否在同一特征维度上共激活来评估一致性。与经典相似性度量不同，DCA有意避免中心化、L2归一化和完全Gram耦合。深度伪造检测提供了自然的验证任务。合成面孔可能在眼睛、鼻子和嘴巴区域生成合理外观，但破坏真实面孔中这些区域的表征结构。使用冻结的DINOv3特征，DCA揭示了这种破坏：眼-嘴-鼻指纹在CelebDF-v2和DFD上分别达到0.9106和0.9289的AUC。设计还通过消融实验验证：重新引入中心化使CelebDF-v2 AUC降至0.459，L2归一化降至0.862，跨维度耦合降至0.478。最后，用FaRL替代DINOv3使CelebDF-v2 AUC降至0.582。DCA因此依赖于稳定的每维度坐标系统，而非单纯的区域提取。这些结果将DCA定位为测量冻结基础模型中样本内部表征一致性的工具，深度伪造检测是首个验证任务。

英文摘要

Frozen vision foundation models do not merely extract features; they organize images through a learned coordinate system. We ask whether that coordinate system remains internally coherent within a single input. This leads to Representational Consistency: the study of whether a frozen foundation model represents one sample coherently across its semantic subregions. We introduce Dimensional Coactivation (DCA), a per-dimension instrument for measuring this coherence. DCA compares semantic regions by asking whether the same feature dimensions coactivate across them. Unlike classical similarity measures, it deliberately avoids centering, L2 normalization, and full Gram coupling. These operations are useful when comparing different models or distributions, but they are mismatched to the intra-sample setting, where the coordinate system is fixed and raw magnitude carries signal. Deepfake detection provides a natural validation task. Synthetic faces may reproduce plausible eyes, noses, and mouths while breaking the representational structure that links those regions in real faces. Using frozen DINOv3 features, DCA exposes this break: an eyes-mouth-nose fingerprint achieves 0.9106 AUC on CelebDF-v2 and 0.9289 on DFD under FF++ c23 cross-dataset transfer. The design is also sharply validated by ablation: reintroducing centering collapses CelebDF-v2 AUC to 0.459, L2 normalization reduces it to 0.862, and cross-dimension coupling reduces it to 0.478. Finally, replacing DINOv3 with FaRL collapses CelebDF-v2 AUC to 0.582. DCA therefore depends on a stable per-dimension coordinate system, not on region extraction alone. These results position DCA as an instrument for measuring intra-sample representational coherence in frozen foundation models, with deepfake detection as the first validation task.

URL PDF HTML ☆

赞 0 踩 0

2605.08246 2026-05-12 cs.CV cs.CR cs.LG 版本更新

Smart Railway Obstruction Detection System using IoT and Computer Vision

基于物联网和计算机视觉的智能铁路障碍检测系统

Pravin Kumar, Mritunjay Shall Peelam, Ramakant Kumar, Sanjay Kumar, Vinay Chamola

发表机构 * School of Computer Science, University of Petroleum and Energy Studies (UPES)（石油与能源研究大学计算机科学学院）； Department of Computer Science, Galgotias College of Engineering & Technology（工程与技术学院计算机科学系）； Department of Computer Science, NIT Jamshedpur（jamshedpur国立理工学院计算机科学系）； Department of Electrical and Electronics Engineering, BITS-Pilani, Pilani Campus（比什努尔理工学院电气与电子工程系）； APPCAIR, BITS-Pilani, Pilani campus（比什努尔理工学院帕兰ी校区APPCAIR）

AI总结本文提出NETRA系统，利用边缘计算平台实现低成本、低功耗的铁路入侵检测，通过概率传感器融合和边缘AI分类，提高检测准确率并降低部署成本。

详情

AI中文摘要

铁路轨道入侵对印度铁路安全构成重大挑战，包括野生动物入侵和故意障碍。2025年12月阿萨姆发生的碰撞事件凸显了实时检测的紧迫性。现有解决方案如基于光纤的Gajraj系统成本高昂（1000美元/公里）且误报率高，限制了在101条大象走廊中的部署仅限于20条。本文提出NETRA，一种成本效益高、独立于互联网的入侵检测系统，部署在Raspberry Pi Zero W和Raspberry Pi 4边缘平台上。NETRA采用概率传感器融合，整合PIR运动传感器和HC-SR04超声波距离传感器，通过可调阈值（tau_c=0.65）实现事件驱动的相机激活，减少不必要的视觉处理52%。确认入侵后，使用MobileNet-SSD（Pi Zero）或YOLOv5 ONNX（Pi 4）的边缘AI分类识别威胁，包括人类、大型动物和轨道障碍。确认的威胁通过LoRa（868 MHz）传输，2.4秒内通知机车司机。113个运动事件的实验评估显示，概率融合方法的检测准确率为95%，无误报，优于二元方法的85%。Raspberry Pi 4与YOLOv5实现83.5%的象类F1分数，比Pi Zero的启发法方法（14.8%）提高5.6倍。LoRa通信在1-2公里范围内实现100%的数据包交付。NETRA将部署成本降低75%（247美元/公里 vs Gajraj的1000美元/公里），同时提供对野生动物和障碍威胁的统一检测。

英文摘要

Railway track intrusions pose a critical safety challenge for Indian Railways, encompassing wildlife incursions and deliberate malicious obstructions. The December 2025 collision in Assam, in which seven elephants were killed by the Rajdhani Express, underscores the urgency of effective real-time detection. Existing solutions such as the optical fiber-based Gajraj system suffer from prohibitive costs (\$1000/km) and high false alarm rates, limiting deployment to only 20 of India's 101 elephant corridors. This paper proposes NETRA, a cost-effective, internet-independent intrusion detection system deployed on Raspberry Pi Zero W and Raspberry Pi 4 edge platforms. NETRA employs probabilistic sensor fusion integrating a PIR motion sensor and an HC-SR04 ultrasonic distance sensor with a tunable threshold (tau_c = 0.65), enabling event-driven camera activation that reduces unnecessary visual processing by 52%. Upon confirmed intrusion, edge-AI classification using MobileNet-SSD (Pi Zero) or YOLOv5 ONNX (Pi 4) identifies threats including humans, large animals, and track obstructions. Confirmed threats are transmitted via LoRa (868 MHz) to alert the locomotive driver within 2.4 seconds end-to-end. Experimental evaluation across 113 motion events demonstrated 95% detection accuracy with zero false alarms through probabilistic fusion, compared to 85% for binary methods. Raspberry Pi 4 with YOLOv5 achieved 83.5% elephant F1-score, a 5.6x improvement over Pi Zero's heuristic approach (14.8%). LoRa communication achieved 100% packet delivery across 1-2 km in field trials. NETRA reduces deployment cost by 75% (\$247/km vs \$1000/km for Gajraj) while providing unified detection of both wildlife and obstruction threats.

URL PDF HTML ☆

赞 0 踩 0

2605.08241 2026-05-12 cs.CV cs.AI 版本更新

TinySSL: Distilled Self-Supervised Pretraining for Sub-Megabyte MCU Models

TinySSL：用于子兆字节MCU模型的蒸馏自监督预训练

Bibin Wilson

发表机构 * Bibin Wilson

AI总结本文提出CA-DSSL框架，通过蒸馏和自监督学习在子兆字节MCU模型上实现高效表示学习，达到优于SimCLR-Tiny的性能，且参数更少。

详情

AI中文摘要

自监督学习（SSL）已改变大模型的表示学习，但在微控制器（MCU）类模型中仍无探索。本文识别了三个障碍：投影头主导、表示瓶颈和增强敏感性，并提出容量感知蒸馏自监督学习（CA-DSSL），一种教师引导框架，无需标签或文本监督即可克服这些障碍。CA-DSSL结合了不对称蒸馏、多尺度特征蒸馏和渐进增强课程。在MobileNetV2-0.35主干上预训练CIFAR-100，CA-DSSL达到62.7 0.5%线性探针准确率（3种子均值），优于SimCLR-Tiny 18个百分点，与SEED（61.7%）匹配，但参数更少。标准SSL方法（BYOL-Tiny、DINO-Tiny）在该规模完全崩溃。在Pascal VOC检测中，CA-DSSL达到随机初始化的2.3倍mAP，并在SEED上高出3个百分点，尽管SimCLR-Tiny在检测mAP上匹配CA-DSSL。部署主干占用378 KB（INT8）且无推理开销。初步ImageNet-100实验表明，CA-DSSL的优势特定于小数据集；扩展到ImageNet-1K作为未来工作。

英文摘要

Self-supervised learning (SSL) has transformed representation learning for large models, yet remains unexplored for microcontroller (MCU)-class models with fewer than 500K parameters. We identify three obstacles at this scale -- projection head dominance, representation bottleneck, and augmentation sensitivity -- and propose Capacity-Aware Distilled Self-Supervised Learning (CA-DSSL), a teacher-guided framework that overcomes them without labels or text supervision. CA-DSSL combines asymmetric distillation from a frozen DINO ViT-S/16 teacher, multi-scale feature distillation for spatial representations, and a progressive augmentation curriculum. On a MobileNetV2-0.35 backbone (396K parameters) pretrained on CIFAR-100, CA-DSSL reaches 62.7 0.5% linear-probe accuracy (3-seed mean) -- surpassing SimCLR-Tiny by 18 pp, matching SEED (61.7%) with 10 fewer projection parameters (426K vs. 3.15M), and reaching 94.0% of a supervised upper bound. Standard SSL methods (BYOL-Tiny, DINO-Tiny) collapse entirely at this scale. On Pascal VOC detection, CA-DSSL achieves 2.3 the mAP of random initialization and +3 pp over SEED, though SimCLR-Tiny matches CA-DSSL on detection mAP. The deployed backbone occupies 378 KB (INT8) with no inference overhead from pretraining. Preliminary ImageNet-100 experiments reveal that CA-DSSL's advantage is specific to small-data regimes; scaling to ImageNet-1K is discussed as future work.

URL PDF HTML ☆

赞 0 踩 0

2605.08238 2026-05-12 cs.CV cs.AI cs.ET cs.LG 版本更新

Resource-Aware Evolutionary Neural Architecture Search for Cardiac MRI Segmentation

面向资源的进化神经架构搜索用于心脏MRI分割

Farhana Yasmin, Mahade Hasan, Haipeng Liu, Amjad Ali, Ghulam Muhammad, Yu Xue

发表机构 * School of Computer Science, Nanjing University of Information Science and Technology（南京信息工程大学计算机科学学院）； School of Software, Nanjing University of Information Science and Technology（南京信息工程大学软件学院）； Eastern University（东东大学）； Research Centre for Intelligent Healthcare, Coventry University（科文大学智能医疗研究中心）； National Medical Research Association（国家医学研究协会）； Faculty of Engineering and Technology, Muscat University（穆斯cat大学工程与技术学院）； College of Computer and Information Sciences, King Saud University（沙特国王大学计算机与信息学院）

AI总结本文提出CardiacNAS，一种结合UNet-like supernet和心脏感知搜索空间的进化神经架构搜索框架，通过优化Dice相似度系数和Hausdorff距离与模型大小和FLOPs的平衡，实现心脏MRI分割的高精度与高效能。

Journal ref F. Yasmin et.al., "Resource-Aware Evolutionary Neural Architecture Search for Cardiac MRI Segmentation," 28th International Conference on Computer and Information Technology (ICCIT), 2025, pp. 2819-2824

详情

DOI: 10.1109/ICCIT68739.2025.11491084

AI中文摘要

心脏磁共振（CMR）分割是评估心室结构和功能定量评估的基础，但可靠分割仍因低组织对比度、模糊边界和扫描变异性而困难。本文提出CardiacNAS，一种进化神经架构搜索（NAS）框架，结合UNet-like supernet与覆盖深度、宽度、核大小、滤波器大小、注意力、融合、激活、丢弃和残差缩放的心脏感知搜索空间。搜索过程明确考虑资源，联合优化Dice相似度系数（DSC）和95百分位Hausdorff距离（HD95）与模型大小和浮点运算（FLOPs）在固定计算预算下。候选架构从supernet实例化，通过代理预算训练，并通过交叉、突变和精英选择进化。在ACDC数据集上评估并与六种最先进的方法进行比较，使用定性比较、学习曲线分析和设计因素相关性研究。所得到的模型在3.58M参数和14.56GFLOPs下达到93.22%的平均DSC和4.73mm HD95，展示了良好的精度效率权衡。分析表明，搜索的注意力和融合选择，以及残差缩放，有助于改进边界保真度和稳定性。CardiacNAS提供了一种原则性、资源感知的方法，用于可部署的CMR分割，具有透明报告的架构复杂性和计算预算。

英文摘要

Cardiac magnetic resonance (CMR) segmentation underpins quantitative assessment of ventricular structure and function, yet reliable delineation remains difficult due to low tissue contrast, fuzzy boundaries, and inter scan variability. We present CardiacNAS, an evolutionary neural architecture search (NAS) framework that couples a UNet like supernet with a cardiac aware search space spanning depth width, kernel size, filter size, attention, fusion, activation, dropout, and residual scaling. The search is explicitly resource aware, jointly optimizing dice similarity coefficient (DSC) and 95th percentile Hausdorff distance (HD95) versus model size and floating point operations (FLOPs) under fixed compute budgets. Candidate architectures are instantiated from the supernet, trained with proxy budgets, and evolved through crossover, mutation, and elitist selection. We evaluate on the ACDC dataset and compare against six state of the art methods, using qualitative comparisons, learning curve analyses, and design factor correlation studies. The resulting model attains 93.22% average DSC and 4.73 mm HD95 with 3.58M parameters and 14.56 GFLOPs, demonstrating a favorable accuracy efficiency trade off. Analyses indicate that searched attention and fusion choices, together with residual scaling, contribute to improved boundary fidelity and stability. CardiacNAS offers a principled, resource aware approach to deployable CMR segmentation with transparent reporting of architectural complexity and compute budgets.

URL PDF HTML ☆

赞 0 踩 0

2605.08226 2026-05-12 cs.CV 版本更新

SPECTRA-Net: Scalable Pipeline for Explainable Cross-domain Tensor Representations for AI-generated Images Detection

SPECTRA-Net：可扩展的可解释跨域张量表示可解释性AI生成图像检测流水线

Sarra Arab, Anfal Achouri, Seif Eddine Bouziane

发表机构 * The National School of Artificial Intelligence（国家人工智能学院）

AI总结本文提出SPECTRA-Net，通过多视角图像表示结合全局语义特征、频谱分析、局部补丁异常检测和统计描述符，实现跨域AI生成图像检测的高准确性和泛化能力。

Comments 13 pages, 2 figures, submitted to a journal

详情

AI中文摘要

AI生成图像的快速普及对数字信息完整性构成重大挑战。尽管人类观察者和现有检测模型难以跟上生成模型复杂性的提升，对稳健、实时检测系统的需求变得至关重要。本文介绍了SPECTRA-Net，一种可扩展的可解释跨域张量表示流水线，用于AI生成图像检测。我们的方法利用图像的多视角表示，结合来自视觉基础模型（VFM）的全局语义特征、频谱分析、基于局部补丁的异常检测和统计描述符。通过融合这些互补的数据流，SPECTRA-Net在域内和跨域设置中均实现了最先进的性能，展示了在广泛挑战数据集（包括WildFake、Chameleon和RRDataset）中的高准确性和泛化能力。所提出的流水线不仅为AI生成图像检测提供了稳健的解决方案，还通过缺陷定位提供可解释性，为现实应用中的更可信和可靠的内容验证铺平了道路。

英文摘要

The rapid proliferation of AI-generated images (AIGI) presents a significant challenge to digital information integrity. While human observers and existing detection models struggle to keep pace with the increasing sophistication of generative models, the need for robust, real-time detection systems has become critical. This paper introduces SPECTRA-Net, a scalable pipeline for explainable, cross-domain tensor representations for AIGI detection. Our approach leverages a multi-view representation of images, combining global semantic features from a Vision Foundation Model (VFM), spectral analysis, local patch-based anomaly detection, and statistical descriptors. By fusing these complementary data streams, SPECTRA-Net achieves state-of-the-art performance in both in-domain and cross-domain settings, demonstrating high accuracy and generalization capabilities across a wide range of challenging datasets, including WildFake, Chameleon, and RRDataset. The proposed pipeline not only provides a robust solution for AIGI detection but also offers explainability through artifact localization, paving the way for more trustworthy and reliable content verification in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2605.08222 2026-05-12 cs.CV cs.AI cs.IR 版本更新

From Historical Tabular Image to Knowledge Graphs: A Provenance-Aware Modular Pipeline

从历史表格图像到知识图谱：一个具有溯源意识的模块化流水线

Sarah Binta Alam Shoilee, Victor de Boer, Jacco van Ossenbruggen, Susan Legêne

发表机构 * Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）

AI总结本文提出一个模块化、具有溯源意识的流水线，将手写表格图像转换为知识图谱，支持人机协作。通过分解为表格重建、信息提取和知识图谱构建三个阶段，确保提取实体和文字可追溯至其视觉和文本来源。

Comments Shorter version of this paper has been accepted in the 5th International Conference on Hybrid Human-Artificial Intelligence (HHAI 2026)

详情

AI中文摘要

手写档案表格包含丰富的历史信息，但将其转换为结构化表示，如知识图谱，需要整合表格结构识别、手写识别和语义解释——一个复杂的多模态过程。端到端的AI实现可能会掩盖这些步骤，导致不透明的算法操作，阻碍人类监督、关键评估和信任。为此，我们提出一个模块化、具有溯源意识的流水线，将手写表格图像转换为知识图谱，支持人机协作。该流水线将工作流程分解为三个阶段——表格重建、信息提取和知识图谱构建——同时暴露中间表示以供检查、评估和修正。我们方法的一个关键贡献是系统地在每个阶段整合数据溯源，确保提取的所有实体和文字都能追溯到其视觉和文本来源。所提出的流水线通过在真实世界档案材料中进行多项实验来展示，涉及军事生涯。在三种不同的表格重建变体上的结果突显了模块化的重要性。通过将模块化与数据溯源相结合，我们的工作推进了透明且可协作控制的图像到知识图谱的流水线，用于复杂的史数据。

英文摘要

Handwritten archival tables contain rich historical information, yet transforming them into structured representations, such as Knowledge Graphs, requires integrating table structure recognition, handwriting recognition, and semantic interpretation - a complex multimodal process. End-to-end AI implementations can obscure these steps, resulting in opaque algorithmic operations that hinder human oversight, critical assessment, and trust. To address this, we present a modular, provenance-aware pipeline to convert handwritten tabular images into KGs supporting human-AI collaboration. The pipeline decomposes the workflow into three stages - table reconstruction, information extraction, and KG construction - while exposing intermediate representations for inspection, evaluation, and correction. A key contribution of our approach is the systematic integration of data provenance at every stage, ensuring that all extracted entities and literals remain traceable to their visual and textual origins. The proposed pipeline is demonstrated through a number of experiments on real-world archival material concerning military careers. The results across three different table reconstruction variants highlight the importance of modularisation. By coupling modularity with data provenance, our work advances transparent and collaboratively controllable image-to-KG pipelines for complex historical data.

URL PDF HTML ☆

赞 0 踩 0

2605.08220 2026-05-12 cs.AI cs.CE cs.CL cs.CV cs.SE 版本更新

Spatial Priming Outperforms Semantic Prompting: A Grid-Based Approach to Improving LLM Accuracy on Chart Data Extraction

空间提示优于语义提示：一种基于网格的方法以提高LLM在图表数据提取上的准确性

Andrei Lazarev, Dmitrii Sedov, Alexander Galkin

发表机构 * Russian Federation（俄罗斯联邦）

AI总结本文通过对比空间提示与语义提示，发现基于网格的提示方法能显著降低图表数据提取误差，优于传统语义引导策略。

Comments his is the version of the article accepted for publication in SUMMA 2025 after peer review. The final, published version is available at IEEE Xplore: https://doi.org/10.1109/SUMMA68668.2025.11302248

Journal ref 2025 7th International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA), Lipetsk, Russian Federation, 2025, pp. 799-804

详情

DOI: 10.1109/SUMMA68668.2025.11302248

AI中文摘要

自动从科学图表中提取数据是大规模文献分析的关键任务。尽管多模态大语言模型（LLMs）展现出潜力，但其在非标准化图表上的准确性仍面临挑战。本文探讨了两种策略：高阶语义提示和低阶空间提示。我们发现，基于网格的提示方法在合成数据集上显著降低了数据提取误差（SMAPE从25.5%降至19.5%，p < 0.05），优于传统语义方法。结论表明，对于当前多模态模型而言，提供明确的空间上下文比高阶语义指导更有效且可靠。

英文摘要

The automated extraction of data from scientific charts is a critical task for large-scale literature analysis. While multimodal Large Language Models (LLMs) show promise, their accuracy on non-standardized charts remains a challenge. This raises a key research question: what is the most effective strategy to improve model performance (high-level semantic priming) or low-level spatial priming? This paper presents a comparative investigation into these two distinct strategies. We describe our exploratory experiments with semantic methods, such as a two-stage metadata-first framework and Chain-of-Thought, which failed to produce a statistically significant improvement. In contrast, we present a simple but highly effective spatial priming method: overlaying a coordinate grid onto the chart image before analysis. Our quantitative experiment on a synthetic dataset demonstrates that this grid-based approach provides a statistically significant reduction in data extraction error (SMAPE reduced from 25.5% to 19.5%, p < 0.05) compared to a baseline. We conclude that for the current generation of multimodal models, providing explicit spatial context is a more effective and reliable strategy than high-level semantic guidance for this class of tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.08218 2026-05-12 cs.LG cs.CV 版本更新

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

深度梦境由此构成：在扩散模型中可视化单义特征

Adam Szokalski, Mateusz Modrzejewski

AI总结本文提出通过优化的潜在可视化（LVO）技术，扩展了用于卷积神经网络的特征可视化方法至潜在扩散模型。通过稀疏自编码器（SAEs）将多义层表示分解为单义特征，展示了在Stable Diffusion 1.5模型上可视化可识别概念的效果，相比基线方法更清晰且具有相关性。

详情

AI中文摘要

本文提出潜在可视化通过优化（LVO），一种机制可解释性技术，将最初为卷积神经网络开发的特征可视化通过优化方法扩展到潜在扩散模型。LVO利用稀疏自编码器（SAEs）将多义层表示分解为单义特征。关键贡献包括潜在空间优化、时间步活动分析、匹配计划的噪声注入、通过特征引导的先验初始化以及合适的正则化策略。我们在Stable Diffusion 1.5模型上进行Style50数据集微调，展示了SAE特征能产生清晰的可识别概念可视化，如对角线构图、人物、玫瑰、电缆和瀑布泡沫，这些与数据集示例相关联，而无解纠缠的基线方法则产生不连贯的结果。我们进一步表明，从像素空间特征可视化转移来的正则化技术也适用于潜在域，尽管需要为原始层和SAE变体进行不同的配置。与数据集示例和引导相比，LVO通过直接揭示激活特征的内容而非其下游效果提供互补见解。

英文摘要

This paper proposes latent visualization by optimization (LVO), a mechanistic interpretability technique that extends feature visualization by optimization - originally developed for convolutional neural networks - to latent diffusion models. LVO employs sparse autoencoders (SAEs) to disentangle polysemantic layer representations into monosemantic features. Key contributions include latent-space optimization, time-step activity analysis, schedule-matched noise injection, prior initialization through feature steering, and suitable regularization strategies. We demonstrate the method on Stable Diffusion 1.5 fine-tuned on the Style50 dataset, showing that SAE features produce clear visualizations of recognizable concepts - including diagonal compositions, human figures, roses, cables, and waterfall foam - that correlate with dataset examples, while the baseline without disentanglement produces less coherent results. We further show that regularization techniques from pixel-space feature visualization transfer to the latent domain, though they require different configurations for the raw-layer and SAE variants. Compared to dataset examples and steering, LVO provides complementary insights by directly revealing what activates a feature rather than its downstream effects.

URL PDF HTML ☆

赞 0 踩 0

2605.08213 2026-05-12 cs.CV 版本更新

弱监督概念学习用于以对象为中心的视觉推理

Sparsh Tiwari, Bettina Finzel, Gesina Schwalbe

发表机构 * University of Lübeck（吕贝克大学）； University of Bamberg（巴马克大学）； University of Ulm（乌尔姆大学）

AI总结本文提出一种高效的弱监督方案，结合槽式架构和VAE实现对象中心推理任务中的符号 grounding，减少监督至1%并提升域泛化能力。

详情

AI中文摘要

神经符号系统旨在结合深度神经网络对原始传感器输入的处理与符号人工智能的少样本性能。两阶段方法显式解耦基于DNN的感知与后续基于规则的推理。这避免了端到端可微方法的优化和可解释性问题，但需要昂贵的感知输出标签。本文介绍了一种高效的弱监督方案，用于感知阶段以将输出符号 grounding 用于逻辑归纳的对象中心推理任务。它结合了用于对象中心性的槽式架构和变分自编码器（VAE）进行自监督，与概念指导在潜在维度上竞争，以实现人类可解释的 grounding。所得到的预测被转换为符号背景知识，供推理框架使用，如归纳逻辑编程（ILP）、决策树和贝叶斯网络。我们在合成和现实世界数据集上的广泛实证评估表明，我们的方法能够发现复杂的抽象规则用于对象中心推理，同时将监督减少到标签的1%，并且即使在显著的领域转移下也表现出鲁棒性。值得注意的是，在1%的监督下，它甚至在领域泛化上超过了最先进的基础模型基线。

英文摘要

Neurosymbolic systems promise to combine deep neural network's (DNN) processing of raw sensor inputs with few-shot performance of symbolic artificial intelligence. Two-stage approaches explicitly decouple DNN based perception from subsequent rule based reasoning. This avoids optimization and interpretability issues of end to end differentiable approaches, but requires costly labels for the perception output. This paper introduces an efficient weak supervision scheme for the perception stage to ground its output symbols for logical induction in object-centric reasoning tasks. It combines a slot-based architecture for object-centricity with a Variational Autoencoder (VAE) for self-supervision, competing with concept guidance on latent dimensions for human interpretable grounding. The resulting predictions are translated into symbolic background knowledge for reasoning frameworks, such as Inductive Logic Programming (ILP), Decision Trees, and Bayesian Networks. Our extensive empirical evaluation on synthetic and real world datasets shows that our approach can discover complex, abstract rules for object centric reasoning whilst reducing supervision to as little as 1% of labels, and being robust even under substantial domain shift. Notably, at 1% supervision it even outperforms state of the art foundation model baselines in domain generalization

URL PDF HTML ☆

赞 0 踩 0

2605.08200 2026-05-12 cs.AI cs.CV cs.LG 版本更新

Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits

在视觉-语言模型中可靠性在哪里存在：注意力、隐藏状态和因果回路的机制研究

Logan Mann, Ajit Saravanan, Ishan Dave, Shikhar Shiromani, Saadullah Ismail, Yi Xia, Emily Huang

发表机构 * UC Santa Barbara（加州大学圣巴巴拉分校）； UC Berkeley（加州大学伯克利分校）； NVIDIA（英伟达）； Algoverse AI Research（Algoverse人工智能研究）； Brown University（布朗大学）

AI总结本文通过机制性研究发现，视觉-语言模型的可靠性主要体现在隐藏状态几何、分层边际形成和稀疏晚层回路，而非注意力图的锐度。

Comments 15 pages, 4 figures, 10 tables. Accepted at the ICLR 2026 Workshop on Multimodal Reasoning. Code and probe-training pipelines: https://github.com/itsloganmann/VLM-Reliability-Probe

详情

AI中文摘要

一种普遍的直觉认为，视觉-语言模型（VLMs）在注意力图看起来锐利时最为可信：集中在查询区域的注意力应意味着自信且校准的回答。我们直接测试了这一注意力-信心假设。我们通过统一的机制性流程——VLM可靠性探针（VRP）——对三个开源权重VLM家族（LLaVA-1.5、PaliGemma、Qwen2-VL；3-7B参数）进行了仪器化，将注意力结构、生成动态和隐藏状态几何与单一正确性标签进行比较。三个结果出现。（i）注意力结构是正确性几乎零预测因子（R_pb(C_k,y)=0.001，95% CI[-0.034,0.036]；R_pb(H_s,y)=-0.012，[-0.047,0.024]在合并n=3,090分割中），尽管注意力在特征提取中仍然是因果必要的（前30%补丁遮蔽使准确性下降8.2-11.3个百分点，p<0.001）。（ii）可靠性在计算后期变得清晰：一个隐藏状态线性探针在POPE上达到AUROC>0.95，对于两个家族，而自我一致性在K=10时是测量到的最强行为预测因子，10倍推理成本（R_pb=0.43）。（iii）因果神经元层面的删除暴露了具有直接监控设计影响的明显架构分裂：晚期融合LLaVA将可靠性集中在脆弱的晚期瓶颈（在前5探针神经元删除后，物体识别准确性下降8.3个百分点），而早期融合PaliGemma和Qwen2-VL则广泛分布，并能吸收约50%峰值层隐藏维度的破坏，降幅不超过1个百分点。结论是狭窄但重要的：在3-7B VLMs中，可靠性更可靠地从隐藏状态几何、分层边际形成和稀疏晚层回路中读取，而非注意力图的锐度。

英文摘要

A pervasive intuition holds that vision-language models (VLMs) are most trustworthy when their attention maps look sharp: concentrated attention on the queried region should imply a confident, calibrated answer. We test this Attention-Confidence Assumption directly. We instrument three open-weight VLM families (LLaVA-1.5, PaliGemma, Qwen2-VL; 3-7B parameters) with a unified mechanistic pipeline -- the VLM Reliability Probe (VRP) -- that compares attention structure, generation dynamics, and hidden-state geometry against a single correctness label. Three results emerge. (i) Attention structure is a near-zero predictor of correctness (R_pb(C_k,y)=0.001, 95% CI [-0.034,0.036]; R_pb(H_s,y)=-0.012, [-0.047,0.024] on a pooled n=3,090 split), even though attention remains causally necessary for feature extraction (top-30% patch masking drops accuracy by 8.2-11.3 pp, p<0.001). (ii) Reliability becomes legible later in the computation: a single hidden-state linear probe reaches AUROC>0.95 on POPE for two of three families, and self-consistency at K=10 is the strongest behavioral predictor we measure at 10x inference cost (R_pb=0.43). (iii) Causal neuron-level ablations expose a sharp architectural split with direct monitor-design implications: late-fusion LLaVA concentrates reliability in a fragile late bottleneck (-8.3 pp object-identification accuracy after top-5 probe-neuron ablation), whereas early-fusion PaliGemma and Qwen2-VL distribute it widely and absorb destruction of ~50% of their peak-layer hidden dimension with <=1 pp degradation. The takeaway is narrow but consequential: in 3-7B VLMs, reliability is read more reliably off hidden-state geometry, layer-wise margin formation, and sparse late-layer circuits than off attention-map sharpness.

URL PDF HTML ☆

赞 0 踩 0

2605.08196 2026-05-12 cs.CV 版本更新

Survey on Disaster Management Datasets for Remote Sensing Based Emergency Applications

基于遥感应急应用的灾害管理数据集综述

Alain P. Ndigande, Josiah Wiggins, Sedat Ozer

发表机构 * Ozer Lab, Dept. of Computer Science, Ozyegin University（奥泽实验室，计算机科学系，奥兹根大学）； Dept. of Electrical and Computer Engineering, California State Polytechnic University, Pomona（电气与计算机工程系，加州州立理工大学庞纳分校）

AI总结本文综述了用于机器学习和深度学习灾害管理流程的公开图像数据集，重点在于支持遥感任务的高质量数据集，以促进灾害响应解决方案的快速开发与部署。

Comments This work has been accepted for publication at IEEE Transactions on Geoscience and Remote Sensing

详情

AI中文摘要

近期自然灾害凸显了高效数据驱动灾害管理的紧迫需求。机器学习（ML）和深度学习（DL）技术在灾害管理的关键阶段，包括缓解、准备、检测、响应和恢复中显示出巨大潜力。然而，成功的ML或DL应用的关键在于可访问性和质量的注释数据集。随着无人机（UAV）和卫星高分辨率影像的日益可用，计算机视觉和遥感算法已成为灾害场景中快速检测、态势评估和决策的重要工具。本文全面概述了与ML/DL灾害管理流程相关的公开图像数据集，强调支持遥感任务的数据集，涵盖灾害事件的所有阶段，包括灾前、灾中和灾后。本文旨在为寻求高质量数据集以快速开发和部署遥感驱动灾害响应解决方案的研究人员和实践者提供集中参考。

英文摘要

Recent natural disasters have highlighted the urgent need for efficient data-driven approaches to disaster management. Machine learning (ML) and deep learning (DL) techniques have shown considerable promise in enhancing the key phases of disaster management including mitigation, preparedness, detection, response, and recovery. A critical enabler of successful ML or DL based applications in remote sensing, however, is the accessibility and quality of annotated datasets. With the growing availability of high-resolution imagery from unmanned aerial vehicles (UAVs) and satellites, computer vision and remote sensing algorithms have become essential tools for rapid detection, situational assessment, and decision-making in disaster scenarios. This survey provides a comprehensive overview of publicly available image-based datasets relevant to ML/DL-based disaster management pipelines. Emphasis is placed on datasets that support computer vision and remote sensing tasks across all phases of disaster events including pre-disaster, during, and post-disaster. The goal of this work is to serve as a centralized reference for researchers and practitioners seeking high-quality datasets for rapid development and deployment of remote sensing-driven disaster response solutions.

URL PDF HTML ☆

赞 0 踩 0

2605.08191 2026-05-12 cs.CV cs.AI 版本更新

A Robust Out-of-Distribution Detection Framework via Synergistic Smoothing

通过协同平滑实现稳健的分布外检测框架

Maria Stoica, Abdelrahman Hekal, Alessio Lomuscio

发表机构 * Imperial College London（伦敦帝国学院）； Zeroth Research（Zeroth研究）

AI总结本文提出ROSS框架，通过基线分数的不稳定性区分分布内和分布外样本，实现对对抗攻击的强鲁棒性，实验表明其在多个数据集上表现优异。

Comments Accepted to CVPR Findings 2026

详情

AI中文摘要

可靠的分布外（OOD）检测是安全部署机器学习系统的关键要求。尽管近期有进展，最先进的OOD检测器对对抗攻击高度敏感，这影响了其在自动化系统中的可信度。为了解决这一漏洞，我们应用中位数平滑处理基线OOD检测分数，平衡干净和对抗准确率。我们的关键见解是，用于中位数平滑生成的噪声样本可以被重新利用来量化基线分数的局部不稳定性。我们观察到，OOD样本在扰动下表现出更高的不稳定性。基于此，我们提出ROSS，一种新颖且稳健的后处理OOD检测器，利用基线分数的不稳定性进一步区分ID和OOD样本。ROSS实现了对称鲁棒性，对分数最小化和最大化攻击均表现强劲，不同于先前工作。这种对称防御导致了最先进的鲁棒性，比先前方法高出多达40 AUROC点。我们在CIFAR-10、CIFAR-100和ImageNet上进行了广泛的实验。代码可在：https://github.com/Abdu-Hekal/ROSS获取。

英文摘要

Reliable out-of-distribution (OOD) detection is a critical requirement for the safe deployment of machine learning systems. Despite recent progress, state-of-the-art OOD detectors are highly susceptible to adversarial attacks, which undermines their trustworthiness in automated systems. To address this vulnerability, we apply median smoothing to baseline OOD detection scores, balancing clean and adversarial accuracies. Our key insight is that the noisy samples generated for median smoothing can be repurposed to quantify the local instability of the base score. We observe that OOD samples exhibit higher instability under perturbation. Based on this, we propose ROSS, a novel and robust post-hoc OOD detector that leverages the instability of baseline scores to further distinguish between in-distribution (ID) and OOD samples. ROSS achieves symmetric robustness, performing strongly against both score-minimising and score-maximising attacks, unlike prior work. This symmetric defence leads to state-of-the-art robustness, outperforming prior methods by up to 40 AUROC points. We demonstrate ROSS's effectiveness on extensive experiments across CIFAR-10, CIFAR-100, and ImageNet. Code is available at: https://github.com/Abdu-Hekal/ROSS.

URL PDF HTML ☆

赞 0 踩 0

2605.08188 2026-05-12 cs.CV cs.AI 版本更新

KARMA-MV：音乐视频上的因果问答基准

Archishman Ghosh, Abhinaba Roy, Dorien Herremans

发表机构 * AMAAI Lab, Singapore University of Technology and Design（新加坡科技设计大学AMAAI实验室）

AI总结 KARMA-MV是一个基于2682个YouTube音乐视频构建的大规模多选问答数据集，旨在测试模型整合时序音频视觉线索和视觉到音乐影响的能力，通过因果知识图谱方法提升音乐视频因果推理性能。

详情

AI中文摘要

尽管在视频问答和跨模态理解方面取得了显著进展，但关于视觉动态如何驱动音乐结构的因果推理在音乐视频中仍被低估。我们介绍了KARMA-MV，一个从2682个YouTube音乐视频衍生出的大型多选问答数据集，旨在测试模型整合时序音频视觉线索并推理视觉到音乐影响的能力。与需要人工标注的传统数据集不同，KARMA-MV利用LLM推理进行可扩展的生成和验证，产生37737个多选问题。我们提出了一种因果知识图谱（CKG）方法，通过结构化检索跨模态依赖性增强视觉语言模型（VLMs）。在最先进的VLMs和LLMs上的实验表明，CKG基础带来了持续的提升，尤其是对较小的模型，确立了显式因果结构在音乐视频推理中的价值。KARMA-MV为超越相关性的因果音频视觉理解提供了新的基准。

英文摘要

While significant progress has been made in Video Question Answering and cross-modal understanding, causal reasoning about how visual dynamics drive musical structure in music videos remains under-explored. We introduce KARMA-MV, a large-scale multiple-choice QA dataset derived from 2,682 YouTube music videos, designed to test models' ability to integrate temporal audio-visual cues and reason about visual-to-musical influence across reasoning, prediction, and counterfactual questions. Unlike traditional datasets requiring manual annotation, KARMA-MV leverages LLM reasoning for scalable generation and validation, yielding 37,737 MCQs. We propose a causal knowledge graph (CKG) approach that augments vision-language models (VLMs) with structured retrieval of cross-modal dependencies. Experiments on state-of-the-art VLMs and LLMs show consistent gains from CKG grounding -- especially for smaller models -- establishing the value of explicit causal structure for music-video reasoning. KARMA-MV provides a new benchmark for advancing causal audio-visual understanding beyond correlation.

URL PDF HTML ☆

赞 0 踩 0

2605.08174 2026-05-12 cs.LG cs.AI cs.CV 版本更新

CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning

CERSA：累积能量保留子空间适应用于内存高效的微调

Jingze Ge, Xue Geng, Yun Liu, Wanqi Dong, Wang Zhe Mark, Min Wu, Ngai-Man Cheung, Bharadwaj Veeravalli, Xulei Yang

发表机构 * National University of Singapore（新加坡国立大学）； Nankai University（南开大学）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结 CERSA通过奇异值分解保留主成分以降低内存消耗，优于现有PEFT方法，适用于多种领域模型。

Comments 10 pages, 7 figures, supplementary material included

详情

AI中文摘要

为了解决微调大型预训练模型时的内存限制，现有的参数高效微调（PEFT）方法，如LoRA，依赖于低秩更新。然而，此类更新未能充分捕捉到全参数微调中权重修改的秩特性，导致性能差距。此外，LoRA和其他现有PEFT方法仍然需要大量内存来存储完整的冻结权重，限制了其在资源受限环境中的效率。为解决这些限制，我们引入了累积能量保留子空间适应（CERSA），一种新的微调范式，利用奇异值分解（SVD）仅保留负责90%至95%谱能量的主成分。通过在此主子空间中微调低秩表示，CERSA显著降低了内存消耗。我们对不同规模和领域的模型进行了广泛评估，包括图像识别、文本到图像生成和自然语言理解。实验证果表明，CERSA在性能上始终优于最先进的PEFT方法，同时实现了显著更低的内存需求。代码将公开发布。

英文摘要

To mitigate the memory constraints associated with fine-tuning large pre-trained models, existing parameter-efficient fine-tuning (PEFT) methods, such as LoRA, rely on low-rank updates. However, such updates fail to fully capture the rank characteristics of the weight modifications observed in full-parameter fine-tuning, resulting in a performance gap. Furthermore, LoRA and other existing PEFT methods still require substantial memory to store the full set of frozen weights, limiting their efficiency in resource-constrained settings. To addres these limitations, we introduce Cumulative Energy-Retaining Subspace Adaptation (CERSA), a novel fine-tuning paradigm that leverages singular value decomposition (SVD) to retain only the principal components responsible for 90% to 95% of the spectral energy. By fine-tuning low-rank representations derived from this principal subspace, CERSA significantly reduces memory consumption. We conduct extensive evaluations of CERSA across models of varying scales and domains, including image recognition, text-to-image generation, and natural language understanding. Empirical results demonstrate that CERSA consistently outperforms state-of-the-art PEFT methods while achieving substantially lower memory requirements. The code will be publicly released.

URL PDF HTML ☆

赞 0 踩 0

2605.08173 2026-05-12 cs.CV cs.LG 版本更新

利用迁移学习的数字图像伪造检测

Fatma Betul Buyuk, Gozde Karatas Baydogmus, Ali Buldu, Ayaulym Tulendiyeva, Zhuldyz Baizhumanova

发表机构 * Marmara University, Department of Computer Engineering（马尔马拉大学计算机工程系）； Loyola University Chicago, Department of Computer Science（芝加哥洛伊拉大学计算机科学系）； Biruni University, Department of Computer Engineering（比鲁尼大学计算机工程系）

AI总结本文提出一种基于迁移学习的数字图像伪造检测框架，结合压缩感知特征增强与深度卷积神经网络，通过混合输入表示和自适应阈值优化策略提升检测鲁棒性。

详情

AI中文摘要

随着高级图像编辑工具的普及， manipulated digital content 的数量显著增加，给数字取证和信息安全带来严峻挑战。本文提出一种基于迁移学习的数字图像伪造检测框架，整合压缩感知特征增强与深度卷积神经网络（CNN）架构。所提出的方法引入混合输入表示，结合RGB图像与基于压缩差异的特征（FDIFF），显式突出难以检测的细微篡改痕迹。此外，基于Youden指数的模型特定自适应阈值优化策略用于提高分类可靠性，通过在真阳性率和假阳性率之间取得更好平衡。在CASIA v2.0数据集上使用多个预训练CNN架构（包括DenseNet121、VGG16、ResNet50、EfficientNetB0、MobileNet和InceptionV3）进行实验，验证了所提框架的有效性和鲁棒性。模型使用准确率、精确率、召回率、F1分数、Matthews相关系数（MCC）和ROC曲线下的面积（AUC）等综合性能指标进行评估。结果表明，DenseNet121在准确率和AUC上表现最佳，而ResNet50提供了最平衡和可靠的预测，具有最高的MCC。研究强调，仅依赖准确率不足以满足取证应用需求，因为最小化假阴性至关重要。总体而言，所提框架提高了篡改痕迹的可见性并增强了分类鲁棒性，使其适用于实际的数字图像伪造检测场景。

英文摘要

The increasing availability of advanced image editing tools has led to a significant rise in manipulated digital content, posing serious challenges for digital forensics and information security. This study presents a transfer learning-based framework for digital image forgery detection that integrates compression-aware feature enhancement with deep convolutional neural network (CNN) architectures. The proposed approach introduces a hybrid input representation that combines RGB images with compression difference-based features (FDIFF), explicitly highlighting subtle manipulation artifacts that are often difficult to detect. In addition, a model-specific adaptive threshold optimization strategy based on the Youden Index is employed to improve classification reliability by achieving a better balance between true positive and false positive rates. Experiments conducted on the CASIA v2.0 dataset using multiple pretrained CNN architectures, including DenseNet121, VGG16, ResNet50, EfficientNetB0, MobileNet, and InceptionV3, demonstrate the effectiveness and robustness of the proposed framework. The models are evaluated using comprehensive performance metrics such as accuracy, precision, recall, F1-score, Matthews correlation coefficient (MCC), and area under the ROC curve (AUC). The results show that DenseNet121 achieves the highest accuracy and AUC, while ResNet50 provides the most balanced and reliable predictions with the highest MCC. The findings emphasize that relying solely on accuracy is insufficient for forensic applications, where minimizing false negatives is critical. Overall, the proposed framework improves the visibility of manipulation artifacts and enhances classification robustness, making it suitable for real-world digital image forgery detection scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.08161 2026-05-12 cs.CV 版本更新

Advanced Tumor Segmentation in PET/CT Imaging: A Training Strategy Study with nnU-Net for AutoPET III

PET/CT全身成像中肿瘤分割的进阶研究：基于nnU-Net的AutoPET III自动训练策略研究

Hussain Alasmawi

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结本文提出一种全身PET/CT肿瘤分割方法，通过nnU-Net框架探讨训练策略对模型性能的影响，提升分割鲁棒性与准确性，在AutoPET III挑战中取得第三名成绩。

详情

AI中文摘要

全身PET/CT影像中的肿瘤分割对于精确疾病评估和治疗计划至关重要。然而，由于病变大小、对比度和解剖分布的差异，这一任务仍然具有挑战性。依赖手动分割过程耗时且易受观察者间和观察者内变异影响。本文提出了一种针对AutoPET III挑战的全身肿瘤分割方法，目标是构建能够跨示踪剂和多中心数据泛化模型。我们采用基于ResNet的编码器的nnU-Net框架作为基线，并系统研究了训练策略的影响，包括强度归一化、批次Dice优化和使用CraveMix的数据增强。实验表明，这些策略显著影响模型性能，特别是在减少假阳性并提高对病变变异的鲁棒性方面。最佳配置在初步测试阶段达到Dice分数高达0.80，且我们的方法在AutoPET III挑战中排名第三。代码已公开。

英文摘要

Tumor segmentation in whole-body PET/CT imaging is crucial for precise disease evaluation and treatment planning. However, it remains challenging due to variability in lesion size, contrast, and anatomical distribution. Relying on manual segmentation makes the process time-consuming and prone to intra- and inter-observer variability. This work presents a whole-body tumor segmentation method developed for the AutoPET III challenge, where the goal is to build models that generalize across tracers and multi-center data. We employ the nnU-Net framework with a ResNet-based encoder as our baseline and systematically investigate the impact of training strategies, including intensity normalization, batch dice optimization, and data augmentation using CraveMix. Our experiments show that these strategies significantly influence model performance, particularly in reducing false positives and improving robustness to lesion variability. The best-performing configuration achieves a Dice score of up to 0.80 on the preliminary test phase, and our method ranked third in the AutoPET III challenge. The code is publicly available here.

URL PDF HTML ☆

赞 0 踩 0

2605.08160 2026-05-12 cs.CV cs.AI 版本更新

NoiseRater: 用于扩散模型训练的元学习噪声估值

Fang Wu, Haokai Zhao, Da Xing, Hanqun Cao, Tinson Xu, Yanchao Li, Xiangru Tang, Zehong Wang, Aaron Tu, Kuan Pang, Hanchen Wang, Hongbin Lin, Zeqi Zhou, Yinxi Li, Peng Xia, Li Erran Li, Molei Tao, Jure Leskovec, Aditya Joshi, Yejin Choi

发表机构 * Stanford University（斯坦福大学）； UNSW（新南威尔士大学）； UCL（伦敦大学学院）； The University of Chicago（芝加哥大学）； CUHK（香港中文大学）； Nanjing University（南京大学）； Brown University（布朗大学）； Yale University（耶鲁大学）； University of Notre Dame（Notre Dame 大学）； University of Waterloo（滑铁卢大学）； UCB（加州大学伯克利分校）； Georgia Technology（佐治亚理工学院）； Amazon（亚马逊）

AI总结本文提出NoiseRater，通过元学习实现实例级噪声估值，提升扩散模型训练效率和生成质量。

详情

AI中文摘要

扩散模型在生成任务中取得显著成功，但其训练范式通常将注入噪声视为均匀信息。本文挑战这一假设，引入NoiseRater，一种用于扩散模型训练的元学习框架，提出参数化的噪声评估器，根据数据和时间步对个体噪声进行重要性评分，实现训练目标的自适应重新加权。评估器通过双层优化训练，以提升下游验证性能。为实现高效部署，进一步设计解耦的两阶段流程，从元训练期间的软加权过渡到标准训练期间的硬噪声选择。在FFHQ和ImageNet上的实验表明，并非所有噪声样本贡献相同，优先选择信息量大的噪声可提升训练效率和生成质量。结果表明，噪声估值成为提升扩散模型训练的互补且此前未被充分探索的维度。代码可在https://anonymous.4open.science/r/NoiseRater-DEB116获取。

英文摘要

Diffusion models have achieved remarkable success across a wide range of generative tasks, yet their training paradigm largely treats injected noise as uniformly informative. In this work, we challenge this assumption and introduce NoiseRater, a meta-learning framework for instance-level noise valuation in diffusion model training. We propose a parametric noise rater that assigns importance scores to individual noise realizations conditioned on data and timestep, enabling adaptive reweighting of the training objective. The rater is trained via bilevel optimization to improve downstream validation performance after inner-loop diffusion updates. To enable efficient deployment, we further design a decoupled two-stage pipeline that transitions from soft weighting during meta-training to hard noise selection during standard training. Extensive experiments on FFHQ and ImageNet demonstrate that not all noise samples contribute equally, and that prioritizing informative noise improves both training efficiency and generation quality. Our results establish noise valuation as a complementary and previously underexplored axis for improving diffusion model training. Our code is available at: https://anonymous.4open.science/r/NoiseRater-DEB116.

URL PDF HTML ☆

赞 0 踩 0

2605.08142 2026-05-12 cs.LG cs.CL cs.CV 版本更新

Alice v1：通过一致性蒸馏增强的视频生成超越闭源模型

Wang Xiaoyu, Phong Nguyen, Chen Zhao

发表机构 * Mirage Team Open Source Research（Mirage团队开源研究）

AI总结 Alice v1通过一致性蒸馏与分数正则化实现视频生成质量突破，其在自动化基准测试中超越闭源系统，且在人类偏好研究中表现优异。

详情

AI中文摘要

我们提出了Alice v1，一个140亿参数的开源视频生成模型，通过一致性蒸馏与分数正则化（rCM）实现最先进的质量。与传统蒸馏方法不同，我们证明rCM蒸馏可以超越教师模型质量。我们归因于三个机制：（1）分数正则化项作为模式寻找目标，将概率质量集中在高质量输出而非覆盖完整教师分布；（2）我们的定向合成数据管道与难例挖掘提供针对失败模式（物理、手部、面孔）的训练信号，这些模式教师处理不一致；（3）一致性强制作为隐式正则化，消除对特定噪声样本的“幸运路径”依赖。Alice v1在4次去噪步骤（约8秒在H100上）生成5秒720p视频，比50步教师模型快7倍，同时将VBench分数从84.0（Wan2.2）提升到91.2。这在自动化基准测试中超越了教师模型和闭源系统，包括Veo3（~90）和Sora2（~88），在人类偏好研究中表现竞争。我们释放所有模型权重、训练代码、合成数据管道和评估脚本，以推动视频生成领域的开放研究。

英文摘要

Wepresent Alice v1, a 14-billion parameter open-source video generation model that achieves state-of-the-art quality through consistency distillation with score regularization (rCM). Contrary to conventional distillation-which trades quality for speed-we demonstrate that rCM-based distillation can exceed teacher model quality. We attribute this to three mechanisms: (1) the score regularization term acts as a mode-seeking objective that concentrates probability mass on high-quality outputs rather than covering the full teacher distribution, (2) our targeted synthetic data pipeline with hard example mining provides training signal specifically for failure modes (physics, hands, faces) that the teacher handles inconsistently, and (3) consistency enforcement acts as implicit regularization, eliminating "lucky path" dependence on specific noise samples. Alice v1 generates 5-second 720p videos at 24fps in 4 denoising steps (~8 seconds on H100), a 7x speedup over the 50-step teacher while improving VBench score from 84.0 (Wan2.2) to 91.2. This surpasses both the teacher and closed-source systems including Veo3 (~90) and Sora2 (~88) on automated benchmarks, with competitive results in human preference studies. We release all model weights, training code, synthetic data pipelines, and evaluation scripts to advance open research in video generation.

URL PDF HTML ☆

赞 0 踩 0

2605.08113 2026-05-12 cs.LG cs.CV 版本更新

Do Foundation Model Embeddings Improve Cross-Country Crop Yield Generalisation? A Leave-One-Country-Out Evaluation in Sub-Saharan Africa

基础模型嵌入是否能提升跨国家作物产量泛化？一种留一国家法的撒哈拉以南非洲评估

Yaw Osei Adjei

发表机构 * Department of Computer Science, Kwame Nkrumah University of Science and Technology（计算机科学系，库马西技术科学大学）

AI总结本文评估了地理空间基础模型嵌入在留一国家法交叉验证中的表现，发现跨国家预测效果不佳，主要受限于产量分布差异。

Comments 9 pages, 10 figures, appendix, code and processed results released publicly

详情

AI中文摘要

准确预测小农户玉米产量跨国家边界对于撒哈拉以南非洲粮食安全规划至关重要，但大多数发布的基准报告国内表现过度夸大真实泛化能力。本文评估了地理空间基础模型嵌入（Prithvi-EO-1.0-100M和ViT-Base）是否在留一国家法交叉验证中优于传统Sentinel-2光谱特征，基于五国6404个玉米田观测数据。结果表明存在明显的泛化差距：国内随机交叉验证获得中等R²值，但所有特征集在跨国家测试中表现糟糕，R²普遍为负。冻结Prithvi-EO嵌入在该设置下对跨国家预测无明显优势。本文认为主要限制是国家间产量分布变化而非表征质量，并释放可复现的负面基准供未来工作使用。

英文摘要

Accurate predictions of smallholder maize yields across national boundaries are critical for food security planning in sub-Saharan Africa, yet most published benchmarks report within-country performance that overstates true generalisability. This paper evaluates whether geospatial foundation model embeddings, specifically Prithvi-EO-1.0-100M and ViT-Base, outperform traditional Sentinel-2 spectral features under a Leave-One-Country-Out cross-validation scheme on 6,404 maize field observations from five African countries. The results show a clear generalisability gap: within-country random cross-validation yields moderate R^2 values, but all feature sets perform poorly under cross-country testing, with universally negative R^2. Frozen Prithvi-EO embeddings provide no meaningful advantage over engineered spectral features for cross-country prediction in this setting. The paper argues that the main limitation is a shift in yield distribution between countries rather than representation quality and releases a reproducible negative benchmark for future work.

URL PDF HTML ☆

赞 0 踩 0

2604.18486 2026-05-12 cs.CV cs.CL cs.RO 版本更新

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

小米OneVL：基于视觉-语言解释的一步潜在推理与规划

Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, Fang Li, Chenxu Dang, Junli Wang, Tao Xu, Jing Wu, Jianhua Wu, Xiaoshuai Hao, Wen Zhang, Tianyi Jiang, Lingfeng Zhang, Lei Zhou, Yingbo Tang, Jie Wang, Yinfeng Gao, Xizhou Bu, Haochen Tian, Yihang Qiu, Feiyang Jia, Lin Liu, Yigu Ge, Hanbing Li, Yuannan Shen, Jianwei Cui, Hongwei Xie, Bing Wang, Haiyang Sun, Jingwei Zhao, Jiahui Huang, Pei Liu, Zeyu Zhu, Yuncheng Jiang, Zibin Guo, Chuhong Gong, Hanchao Leng, Kun Ma, Naiyan Wang, Guang Chen, Kuiyuan Yang, Hangjun Ye, Long Chen

发表机构 * Xiaomi Embodied Intelligence Team（小米具身智能团队）

AI总结 OneVL通过双重辅助解码器监督的紧凑潜在标记，实现了视觉-语言解释的一步潜在推理与规划，首次在延迟条件下超越了显式推理方法。

Comments Technical Report; 49 pages, 22 figures, 10 tables; Project Page at https://xiaomi-embodied-intelligence.github.io/OneVL GitHub at https://github.com/xiaomi-research/onevl

详情

AI中文摘要

链式推理（CoT）推理已成为基于视觉-语言增强（VLA）的自动驾驶轨迹预测中的强大驱动力，但其自回归性质带来了延迟成本，这在实时部署中是不可接受的。潜在CoT方法试图通过将推理压缩到连续隐藏状态中来弥合这一差距，但始终无法达到其显式对应物的水平。我们建议，这是因为纯粹的语言潜在表示压缩了世界的一个符号抽象，而不是实际上支配驾驶的因果动态。因此，我们提出了OneVL（基于视觉-语言解释的一步潜在推理与规划），这是一个统一的VLA和世界模型框架，通过双辅助解码器监督的紧凑潜在标记，将推理路由到潜在标记中。除了一个重建文本CoT的语言解码器外，我们还引入了一个视觉世界模型解码器，预测未来帧标记，迫使潜在空间内化道路几何、代理运动和环境变化的因果动态。一个三阶段训练流程逐步将这些潜在与轨迹、语言和视觉目标对齐，确保稳定的联合优化。在推理中，辅助解码器被丢弃，所有潜在标记在单次并行传递中被预填充，与仅回答预测的速度相匹配。在四个基准测试中，OneVL成为首个超越显式CoT的潜在CoT方法，以仅回答延迟的精度提供更优的准确性。这些结果表明，通过世界模型监督，潜在CoT生成的表示比逐个标记的冗长推理更具泛化能力。代码已向社区开源。项目页面：https://xiaomi-embodied-intelligence.github.io/OneVL

英文摘要

Chain-of-Thought (CoT) reasoning has become a powerful driver of trajectory prediction in VLA-based autonomous driving, yet its autoregressive nature imposes a latency cost that is prohibitive for real-time deployment. Latent CoT methods attempt to close this gap by compressing reasoning into continuous hidden states, but consistently fall short of their explicit counterparts. We suggest that this is due to purely linguistic latent representations compressing a symbolic abstraction of the world, rather than the causal dynamics that actually govern driving. Thus, we present OneVL (One-step latent reasoning and planning with Vision-Language explanations), a unified VLA and World Model framework that routes reasoning through compact latent tokens supervised by dual auxiliary decoders. Alongside a language decoder that reconstructs text CoT, we introduce a visual world model decoder that predicts future-frame tokens, forcing the latent space to internalize the causal dynamics of road geometry, agent motion, and environmental change. A three-stage training pipeline progressively aligns these latents with trajectory, language, and visual objectives, ensuring stable joint optimization. In inference, the auxiliary decoders are discarded, and all latent tokens are prefilled in a single parallel pass, matching the speed of answer-only prediction. Across four benchmarks, OneVL becomes the first latent CoT method to surpass explicit CoT, delivering superior accuracy at answer-only latency. These results show that with world model supervision, latent CoT produces more generalizable representations than verbose token-by-token reasoning. Code has been open-sourced to the community. Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL

URL PDF HTML ☆

赞 0 踩 0

2603.22421 2026-05-12 cs.CV 版本更新

OsteoFlow: Lyapunov-Guided Flow Distillation for Predicting Bone Remodeling after Mandibular Reconstruction

OsteoFlow: 基于Lyapunov引导的流式蒸馏用于预测下颌重建后的骨重塑

Hamidreza Aftabi, Faye Yu, Brooke Switzer, Zachary Fishman, Eitan Prisman, Antony Hodgson, Cari Whyne, Sidney Fels, Michael Hardisty

发表机构 * Department of Electrical and Computer Engineering, University of British Columbia, Canada（电气与计算机工程系，不列颠哥伦比亚大学，加拿大）； Department of Mechanical Engineering, University of British Columbia, Canada（机械工程系，不列颠哥伦比亚大学，加拿大）； Department of Surgery, University of British Columbia, Canada（外科系，不列颠哥伦比亚大学，加拿大）； Sunnybrook Research Institute, University of Toronto, Canada（阳光医院研究学院，多伦多大学，加拿大）

AI总结 OsteoFlow通过Lyapunov引导的轨迹蒸馏方法，从术后第5天的CT扫描预测第1年的骨重塑，显著提升了预测精度，减少了手术切除区的平均绝对误差。

2402.02286 2026-05-12 cs.CV cs.AI cs.LG 版本更新

Attention-Mamba: A Mamba-Enhanced Multi-Scale Parallel Inference Network for Medical Image Segmentation

Attention-Mamba: 一种增强Mamba的多尺度并行推理网络用于医学图像分割

Yanhua Zhang, Ke Zhang, Jingyu Wang, Gabriella Balestra, Samanta Rosati, Yulin Wu, Wuwei Wang, Valentina Giannini

发表机构 * School of Astronautics, Northwestern Polytechnical University（航天学院，西北工业大学）； Department of Electronics and Telecommunications, Politecnico di Torino（电信系，托斯卡纳理工学院）； Beijing Aerospace Automatic Control Research Institute（北京航天自动控制研究所）； Xi'an University of Posts and Telecommunications（西安邮电大学）； Department of Oncology, University of Turin（肿瘤科，都灵大学）； Candiolo Cancer Institute, FPO-IRCCS（坎迪奥利癌症研究所，FPO-IRCCS）

AI总结本文提出一种基于Mamba的多尺度并行网络，通过多尺度特征提取和递归对齐模块提升医学图像分割性能，实现高效准确的分割结果。

Comments 14 pages, 9 figures and 8 Tables

详情

AI中文摘要

U-shaped架构长期以来主导医学图像分割领域，而Transformer被广泛用于建模长距离依赖。前者通常通过聚合多级特征隐式处理尺度变化，而后者效率受限于二次计算和内存复杂度。本文提出一种有效的传统U-shaped架构替代方案，通过在不同层次构建并行分支以获得多尺度特征和相应预测。此外，我们通过整合Mamba，一种捕捉长距离依赖的态空间模型，来增强网络。首先，双路径架构通过横向连接在每个分支中聚合高层语义信息和低层空间细节。然后，我们引入递归对齐模块（RAM），通过逐步对齐恢复低分辨率特征的空间细节，优化后续全局特征学习和多尺度融合。我们进一步在对齐特征上构建并行Mamba分支，建立层次化全局表示。最后，我们提出基于Mamba的注意力机制，用于自适应多尺度预测融合；该机制利用Mamba增强通道和空间维度上的信息交换。在三种成像模态（MRI、CT和皮肤镜）上的实验验证了所提网络的优越泛化能力。与最先进的2D CNN、Transformer和基于Mamba的网络相比，我们的模型在Synapse、ACDC、ISIC-2018和PH2数据集上实现了最高的分割性能，同时保持高效率，参数量为第二小（14.05 M），计算复杂度适中（8.94 GFLOPs）.

英文摘要

U-shaped architectures have long dominated the field of medical image segmentation, while Transformers are widely employed for modeling long-range dependencies. The former typically handles scale variations implicitly by aggregating multi-level features, whereas the efficiency of the latter is constrained by its quadratic computational and memory complexity. In this work, we propose an effective alternative to traditional U-shaped architectures by constructing parallel branches at different levels to obtain multi-scale features and corresponding predictions. Furthermore, we enhance our network by integrating Mamba, a state space model that captures long-range dependencies with linear complexity. First, a dual-path architecture with lateral connections aggregates high-level semantic information and low-level spatial details at each branch. Then, we introduce a Recursive Alignment Module (RAM) that restores spatial details in low-resolution features through stepwise alignment, optimizing them for subsequent global feature learning and multi-scale fusion. We further build parallel Mamba branches upon aligned features to establish hierarchical global representations. Finally, we propose a Mamba-based attention mechanism for adaptive multi-scale prediction fusion; this mechanism utilizes Mamba to enhance information exchange across scales along both the channel and spatial dimensions. Experiments across three imaging modalities (MRI, CT, and dermoscopy) underscore the superior generalization of the proposed network. Compared to state-of-the-art 2D CNN, Transformer, and Mamba-based networks, our model achieves the highest segmentation performance on the Synapse, ACDC, ISIC-2018, and PH2 datasets while maintaining high efficiency, featuring the second-smallest parameters (14.05 M) and moderate computational complexity (8.94 GFLOPs).

URL PDF HTML ☆

赞 0 踩 0