2605.26104 2026-05-26 cs.CV 版本更新

DRScaffold：提升轻量级视觉语言模型在密集场景推理中的能力

Xinrui Shi, Kai Liu, Ziqing Zhang, Jianze Li, Anqi Li, Yulun Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结针对轻量级视觉语言模型在密集场景推理中缺乏显式视觉锚定导致推理链不可靠的问题，提出DRScaffold监督微调框架，通过将监督目标分解为四个因果有序阶段，在不修改架构的情况下强制进行有根据的推理，显著提升密集场景推理性能。

详情

AI中文摘要

轻量级视觉语言模型在标准基准测试中表现有竞争力，但在密集场景推理中系统性失败，其中多个物体、属性和关系必须通过多步推理共同定位和解决。这种能力对于模型必须可靠解释杂乱环境的现实应用至关重要。然而，现有的训练信号在推理步骤与底层视觉实体和关系之间没有提供显式锚定，使得轻量级模型可以自由生成流畅但视觉上无根据的推理链。为解决这一差距，我们首先引入DRBench，一个包含2943张图像中14573个问题的基准，分为五个任务类别，跨越三个渐进推理层。基于DRBench，我们提出DRScaffold，一个监督微调框架，将监督目标分解为四个因果有序阶段，在不修改架构的情况下强制进行有根据的推理。在三个轻量级VLM上的实验表明，在DRBench上取得了显著提升，同时保持或改善了一般基准的性能。值得注意的是，使用DRScaffold训练的Qwen2.5-VL-3B在DRBench上超越了冻结的Qwen2.5-VL-32B，表明结构化监督可以替代密集场景推理中相当一部分模型规模。我们的代码和模型可在https://github.com/irene-shi/DRScaffold获取。

英文摘要

Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing training signals provide no explicit grounding between reasoning steps and the underlying visual entities and relations, leaving lightweight models free to generate fluent but visually unanchored reasoning chains. To address this gap, we first introduce DRBench, a benchmark of 14,573 questions across 2,943 images, organized into five task categories spanning three progressive reasoning layers. Building on DRBench, we propose DRScaffold, a supervised fine-tuning framework that decomposes the supervision target into four causally ordered stages, enforcing grounded reasoning without architectural modification. Experiments on three lightweight VLMs demonstrate substantial gains on DRBench while preserving or improving performance on general-purpose benchmarks. Notably, Qwen2.5-VL-3B trained with DRScaffold surpasses the frozen Qwen2.5-VL-32B on DRBench, demonstrating that structured supervision can substitute for a significant portion of model scale in dense-scene reasoning. Our code and models are available at https://github.com/irene-shi/DRScaffold .

URL PDF HTML ☆

赞 0 踩 0

2605.26032 2026-05-26 cs.CV cond-mat.stat-mech cs.AI cs.LG 版本更新

Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution

一切尺度：具有连续超分辨率的尺度不变扩散

Zixin Jessie Chen, Zhuo Chen, Archer Wang, Jeff Gore, William T. Freeman, Congyue Deng, Marin Soljačić

发表机构 * Department of Physics, Massachusetts Institute of Technology（麻省理工学院物理系）； Department of EECS, Massachusetts Institute of Technology（麻省理工学院电子工程与计算机科学系）； NSF AI Institute for Artificial Intelligence and Fundamental Interactions（国家科学基金会人工智能与基础相互作用研究所）； Institute for Data, Systems and Society, Massachusetts Institute of Technology（麻省理工学院数据、系统与社会研究所）

AI总结提出SKILD模型，通过尺度不变扩散统一图像生成与连续超分辨率，仅改变起始时间步即可实现不同任务。

Comments 29 pages, 17 figures

详情

AI中文摘要

从噪声创建图像是图像生成；从粗糙输入重建精细细节是超分辨率。尽管它们在实际应用中有差异，但都可以理解为逆转跨尺度的信息损失。我们引入了$ extbf{SKILD}$，一个$ extbf{S}$cale-invariant $ extbf{K}$-Space $ extbf{I}$mage $ extbf{L}$earning $ extbf{D}$iffusion模型，它在单个无条件框架内统一了生成和连续超分辨率。自然图像和临界物理系统都表现出尺度不变性，我们利用这一点设计了一个前向过程，该过程从精细尺度到粗糙尺度衰减图像内容，同时注入频谱匹配的高斯噪声，使尺度成为扩散动力学的显式坐标。相同训练的反向过程通过仅改变起始时间步来执行生成和连续超分辨率：$ extit{没有特定任务的架构，没有条件分支，没有无分类器指导，没有按尺度因子重新训练}$。实验上，SKILD在无条件CIFAR-10上达到FID 2.65和Inception Score 9.63，从单个无条件检查点在ImageNet上执行$2 imes$--$8 imes$超分辨率，同时在感知指标上优于条件模型，并重建了临界伊辛模型，其连接的四点相关函数紧密跟踪真实情况。

英文摘要

Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce $\textbf{SKILD}$, a $\textbf{S}$cale-invariant $\textbf{K}$-Space $\textbf{I}$mage $\textbf{L}$earning $\textbf{D}$iffusion model that unifies generation and continuous super-resolution within a single unconditional framework. Both natural images and critical physical systems exhibit scale invariance, and we leverage it to design a forward process that attenuates image content from fine to coarse scales while injecting spectrum-matched Gaussian noise, making scale an explicit coordinate of the diffusion dynamics. The same trained reverse process performs generation and continuous super-resolution by varying only the starting timestep: $\textit{no task-specific architecture, no conditioning branch, no classifier-free guidance, no retraining per scale factor}$. Empirically, SKILD reaches FID $2.65$ and Inception Score $9.63$ on unconditional CIFAR-10, performs $2\times$--$8\times$ super-resolution on ImageNet from a single unconditional checkpoint while outperforming conditional models across perceptual metrics, and reconstructs critical Ising models whose connected four-point correlations closely track the ground truth.

URL PDF HTML ☆

赞 0 踩 0

2605.26026 2026-05-26 cs.CV cs.AI cs.LG 版本更新

A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring

一种用于光片荧光显微镜的多模态3D基础模型实现少样本分割、分类和去模糊

Adina Scheinfeld, Haotan Zhang, Shang Mu, Rudolf L. M. van Herten, Lucas Stoffl, Ali Erturk, Zhuhao Wu, Johannes C. Paetzold

发表机构 * Tri-Institutional Program in Computational Biology \& Medicine, Weill Cornell Medicine, New York, NY, USA Department of Radiology, Weill Cornell Medicine, New York, NY, USA Helen ； Robert Appel Alzheimers Disease Research Institute, Feil Family Brain ； Mind Research Institute, Weill Cornell Medicine, New York, NY, USA Graduate Program in Physiology, Biophysics ； Systems Biology, Weill Cornell Medicine, New York, NY, USA Cornell Tech, New York, NY, USA Institute for Intelligent Biotechnologies (iBIO), Helmholtz Center Munich, Neuherberg, Germany Institute for Stroke ； Dementia Research, Klinikum der Universität München, Ludwig-Maximilians University Munich, Munich, Germany

AI总结提出一种基于掩码重建与图像-文本对齐联合优化的3D基础模型，在光片荧光显微镜数据上预训练，通过少样本适应显著降低标注成本并提升分割、分类和去模糊性能。

Comments 11 pages, 3 figures

详情

AI中文摘要

光片荧光显微镜（LSM）能够对生物样本进行高分辨率三维（3D）成像，提供丰富的体积数据用于研究细胞组织、病理学和血管网络。然而，LSM数据的大小、维度和标注负担使得监督深度学习方法成本高昂且难以扩展。此外，尽管存在大量未标注的LSM体积数据，但由于计算挑战和体积表示学习的复杂性，针对该模态的基础模型仍未得到充分探索。在这项工作中，我们引入了一个用于LSM数据的3D基础模型，该模型在涵盖多种生物体、染色和成像协议的大型精选3D图像集合上进行了预训练。通过联合优化掩码重建和图像-文本对齐，我们学习了可迁移的体积表示。预训练骨干网络大幅降低了标注负担，实现了针对多种下游任务的高效少样本适应。我们在下游分割、分类和去模糊任务上评估了该方法。结果表明，我们的方法在（1）使用标准评估指标衡量时以及（2）经过领域专家严格评估时，均持续优于基线。这凸显了基础模型预训练在减少标注需求的同时提升多样化LSM分析任务性能的潜力。预训练模型权重以及预训练和微调的代码已公开：https://github.com/AdinaScheinfeld/lsm_fm_public_repo.git。

英文摘要

Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich volumetric data for studying cellular organization, pathology, and vascular networks. However, the size, dimensionality, and annotation burden of LSM data make supervised deep learning approaches costly and difficult to scale. Additionally, despite the abundance of unannotated LSM volumes, foundation models for this modality remain underexplored due to computational challenges and the complexity of volumetric representation learning. In this work, we introduce a 3D foundation model for LSM data, pretrained on a large curated collection of 3D images spanning multiple organisms, stains, and imaging protocols. We learn transferable volumetric representations by jointly optimizing for masked reconstruction and image-text alignment. The pretrained backbone drastically reduces the annotation burden, enabling efficient, few-shot adaptation for varied downstream tasks. We evaluate this approach on downstream segmentation, classification, and deblurring. Our results demonstrate consistent improvements over baselines, (1) when measured using standard evaluation metrics and (2) when rigorously assessed by domain experts. This highlights the potential of foundation model pretraining to reduce annotation requirements while improving performance across diverse LSM analysis tasks. Pretrained model weights and code for pretraining and finetuning are publicly available: https://github.com/AdinaScheinfeld/lsm_fm_public_repo.git.

URL PDF HTML ☆

赞 0 踩 0

2605.26014 2026-05-26 cs.CV cs.CL 版本更新

STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models

STORM: 视频语言模型中时空推理的内化建模

Yiming Liang, Yixiao Chen, Yiyang Zhou, Yixuan Wang, Shoubin Yu, Andong Deng, Fuxiao Liu, Qin Zhang, Chen Chen, Mohit Bansal, Huaxiu Yao

发表机构 * Purdue（普渡大学）； Harvard（哈佛大学）； UNC（北卡罗来纳大学教堂山分校）； UCF（佛罗里达大学）； NVIDIA（英伟达）； Physion Labs（Physion 实验室）

AI总结提出STORM框架，通过有界连续潜在轨迹内化推理过程，无需显式文本思维链或外部工具，提升视频推理准确性并降低推理开销。

详情

AI中文摘要

创建临床验证的皮肤镜图像数据集的方法论

Kozachok Elena Sergeevna

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences（伊万诺夫系统编程研究所，俄罗斯科学院）

AI总结提出一种结合移动皮肤镜图像采集标准操作程序、结构化元数据信息模型和多阶段专家验证的方法，构建临床验证的皮肤镜图像数据集，用于医学信息学研究。

Comments 22 pages, 5 figures, 5 tables

详情

AI中文摘要

本研究提出了一种构建临床验证的皮肤镜图像数据集的方法，用于医学信息学研究。该工作的相关性在于，自动化诊断支持系统的性能不仅取决于图像数量，还取决于图像采集过程的可重复性、结构化元数据的完整性以及诊断标签的可靠性。国际数据集主要是在与俄罗斯常规门诊实践和移动皮肤镜显著不同的条件下创建的。所提出的方法整合了三个相互关联的组成部分：（1）通过移动皮肤镜采集图像的标准操作程序（SOP），（2）一个信息模型，包含16个结构化元数据字段，组织成六个临床导向的块，采用ISIC兼容的符号表示，以及（3）多阶段专家验证诊断标签（初始临床注释、三位专家的共识审查以及所有恶性肿瘤的组织学确认）。使用该方法，在2025年6月至2026年5月期间，收集了来自443名患者的1026张独特的皮肤镜图像数据集。从1044条初始记录中排除了18个重复项。该数据集包括九个疾病类别；所有39个恶性病变（18个黑色素瘤、15个基底细胞癌和6个鳞状细胞癌）均经过组织学验证。患者年龄范围为2至90岁（中位年龄38岁），其中女性279人（63%），男性164人（37%）。每张图像都附有专家注释的皮肤镜结构和明确的verification_stage字段，指示诊断确认的水平。所得数据集作为临床验证的试点资源，适用于独立模型评估、域偏移分析、可解释性研究和进一步扩展。

英文摘要

This study presents a methodology for constructing a clinically verified dataset of dermatoscopic images for medical informatics research. The relevance of the work is driven by the fact that the performance of automated diagnostic support systems depends not only on the volume of images, but also on the reproducibility of the image acquisition procedure, the completeness of structured metadata, and the reliability of diagnostic labels. International collections were primarily created under conditions that differ substantially from routine Russian outpatient practice and mobile dermatoscopy. The proposed methodology integrates three interconnected components: (1) a standard operating procedure (SOP) for acquiring images via mobile dermatoscopy, (2) an information model comprising 16 structured metadata fields organized into six clinically oriented blocks in ISIC-compatible notation, and (3) a multi-stage expert verification of diagnostic labels (initial clinical annotation, consensus review by three specialists, and histological confirmation of all malignant neoplasms). Using this methodology, a dataset of 1,026 unique dermatoscopic images from 443 patients was collected between June 2025 and May 2026. From 1,044 initial records, 18 duplicates were excluded. The dataset includes nine nosological categories; all 39 malignant lesions (18 melanomas, 15 basal cell carcinomas, and 6 squamous cell carcinomas) were histologically verified. Patient age ranged from 2 to 90 years (median 38), with 279 females (63%) and 164 males (37%). Each image is accompanied by expert-annotated dermatoscopic structures and an explicit verification_stage field indicating the level of diagnostic confirmation. The resulting dataset serves as a pilot clinically verified resource suitable for independent model evaluation, domain shift analysis, interpretability studies, and further expansion.

URL PDF HTML ☆

赞 0 踩 0

2605.25979 2026-05-26 cs.CV 版本更新

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

LLaVA-OneVision-2：迈向下一代感知智能

Xiang An, Yin Xie, Feilong Tang, Yunyao Yan, Huajie Tan, Didi Zhu, Changrui Chen, Xiuwei Zhao, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Kaichen Zhang, Wenkang Zhang, Zheng Cheng, Nansen Zhang, Chunsheng Wu, Chunjiang Ge, Zimin Ran, Dehua Song, Chunyuan Li, Shikun Feng, Ming Hu, Zhangquan Chen, Junbo Niu, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng

发表机构 * Glint Lab（Glint实验室）； AIM for Health Lab（健康AI实验室）； MVP Lab（MVP实验室）

AI总结提出LLaVA-OV-2模型，通过编解码流令牌化、窗口注意力和3D RoPE实现统一视频理解与时空定位，在多项基准上超越Qwen3-VL-8B。

详情

AI中文摘要

我们介绍LLaVA-OneVision-2（LLaVA-OV-2），这是LLaVA-OneVision系列中迄今为止能力最强的视觉语言模型，在广泛的多模态基准测试中均取得了卓越性能。该模型基于原生OneVision编码器，并引入窗口注意力机制以实现高效的局部计算，同时保持原生分辨率。其关键进展是编解码流令牌化：它将压缩视频视为连续的比特成本流，其中比特成本动态决定自适应时间分组，运动残差线索选择显著空间证据到紧凑的视觉画布中。这种分配将有限的令牌预算集中在包含事件的内容上，相比固定图片组，实现了更稳定的长视频令牌压缩。共享的3D RoPE进一步将编解码画布、采样帧和图像置于统一的时空坐标系中。此外，我们围绕大规模开放监督构建了LLaVA-OV-2数据和训练栈：约800万重新标注的视频样本用于预训练，400万样本的空间语料库用于微调。我们还引入了JumpScore，这是一个针对高频、密集重复运动中的细粒度定位的时空定位基准，填补了现有视频评估的空白。LLaVA-OV-2的一项突出能力是其在视频理解、时空定位、空间定位和操作轨迹推理上的统一感知。在JumpScore上，LLaVA-OneVision-2-8B达到74.9 JumpScore mAP，比Qwen3-VL-8B（30.1）高出44.8分；在同一基准的匹配视觉令牌预算下，编解码流输入相比帧采样在时空定位上提升9.7分。在标准基准上，LLaVA-OneVision-2-8B在视频任务上平均比Qwen3-VL-8B高出4.3分，在空间任务上高出5.3分，在跟踪任务上平均J&F高出15.6分。

英文摘要

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.25968 2026-05-26 cs.CV 版本更新

Context-driven Missing-Modality Learning for Robust Medical Diagnosis with Image-Tabular Data

基于上下文驱动的缺失模态学习用于图像-表格数据的鲁棒医学诊断

Tianling Liu, Lequan Yu, Tong Han, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University, Tianjin 300350, China.（智能与计算学院，天津大学，天津300350，中国）； Department of Statistics and Actuarial Science, School of Computing and Data Science, The University of Hong Kong, Hong Kong.（统计与精算系，计算与数据科学学院，香港大学，香港）； Department of Radiology, Tianjin Huanhu Hospital, Tianjin 300350, China.（放射科，天津华和医院，天津300350，中国）； Tianjin Key Laboratory of Cerebral Vascular and Neurodegenerative Diseases, Tianjin 300350, China.（天津脑血管与神经退行性疾病重点实验室，天津300350，中国）； Medical School of Tianjin University, Tianjin 300072, China.（天津大学医学院，天津300072，中国）

AI总结提出CMML框架，通过级联残差变换器自编码器合成缺失模态并利用上下文令牌进行语义对齐，在三种医学数据集上超越现有方法。

Comments 12 pages, 8 figures

详情

AI中文摘要

虽然整合多种成像和临床表格记录的多模态数据对于准确医学诊断至关重要，但临床实践中特定模态的任意缺失普遍存在，严重降低了多模态模型的性能。现有方法要么丢弃缺失模态导致信息丢失，要么在未捕获复杂模态间依赖关系的情况下难以合成它们。为解决这些限制，我们提出了一种新颖的上下文驱动缺失模态学习（CMML）框架，该框架顺序执行模态合成和语义对齐，以在任意缺失条件下实现鲁棒诊断。具体来说，我们设计了一个基于级联残差变换器的自编码器（CRTA），利用可学习的上下文令牌作为数据集级语义先验来捕获模态间依赖关系并合成关键的缺失表示。这些表示进一步通过模态特定的记忆库得到丰富。为解决原始可用表示与合成表示之间的差异，我们通过注入来自CRTA输出的多模态表示，将学习到的上下文令牌转化为实例自适应的语义参考。该参考引导异构模态表示对齐到统一空间，最后应用类别感知对比细化来探索判别性诊断线索。在皮肤病变（Derm7pt）、眼病（ODIR）和脑膜瘤（MEN）数据集上的广泛评估表明，CMML显著优于最先进（SOTA）方法，平均AUC分别提升1.26%、0.97%和1.32%。

英文摘要

While multimodal data integrating diverse imaging and clinical tabular records is crucial for accurate medical diagnosis, the arbitrary absence of specific modalities is prevalent in clinical practice, severely degrading the performance of multimodal models. Existing methods either discard missing modalities, leading to information loss, or struggle to synthesize them without capturing complex inter-modal dependencies. To address these limitations, we propose a novel Context-driven Missing-Modality Learning (CMML) framework, which sequentially performs modality synthesis and semantic alignment to achieve robust diagnosis under arbitrary missing conditions. Specifically, we design a Cascade Residual Transformer-based Autoencoder (CRTA) that leverages learnable context tokens acting as dataset-level semantic prior to capture inter-modal dependencies and synthesize key missing representations. These representations are further enriched by modality-specific memory banks. To resolve the discrepancy between original available and synthesized representations, we transform the learned context tokens into instance-adaptive semantic references by infusing multimodal representations from the CRTA's outputs. This reference guides the alignment of heterogeneous modality representations into a unified space, where class-aware contrastive refinement is finally applied to explore discriminative diagnostic cues. Extensive evaluations on skin lesion (Derm7pt), ocular disease (ODIR), and meningioma (MEN) datasets demonstrate that CMML significantly outperforms state-of-the-art (SOTA) methods, yielding AVG AUC improvements of 1.26%, 0.97%, and 1.32%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.25952 2026-05-26 cs.CV cs.AI 版本更新

网格与点云的连续域曲线骨架化

Jai Bardhan, Ramya Hebbalaguppe, Aravind Udupa

发表机构 * TCS Research（TCS研究）； IIT Delhi（德里理工学院）

AI总结提出CSCD框架，将基于局部分隔符的骨架化方法推广到连续域，通过CSCD-M（网格）和CSCD-PC（点云）两种实现，提升了骨架提取的鲁棒性和拓扑保持能力。

Comments 31 pages, 26 figures, 7 tables, 4 algorithms. Published at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026

详情

DOI: 10.1109/WACV61042.2026.00493

AI中文摘要

3D曲线骨架化的进展正在加速广泛的应用。然而，开发能够捕捉复杂物体细节的鲁棒骨架化算法仍然具有挑战性。基于局部分隔符（LS）的骨架化提供了一种高效的基于图的方法，但由于其离散性质，存在表示不准确的问题。为了解决这个问题，我们引入了CSCD，一个新颖的连续域曲线骨架化框架，将LS推广到流形上。具体来说，我们提出了两种实现：用于网格的CSCD-M和用于点云的CSCD-PC。CSCD-M利用网格的内在三角剖分来抵抗噪声并改善拓扑保持，而CSCD-PC采用簇状拉普拉斯算子以增强鲁棒性。据我们所知，CSCD-M是第一个用于曲线骨架化的内在方法。我们的结果表明，CSCD-M在各种网格上匹配LS的性能，并在Thingi10k数据集等基准测试上优于LS（TOG'21）。CSCD-PC在质量上优于CoverageAxis++（Eurographics'24）和EPCS（CAG'23）。最后，我们展示了CSCD在几个下游任务中的有效性：物体分类、形状分割、识别物体中的手柄、隧道和收缩。项目网站：https://cscd-skel.pages.dev

英文摘要

Advancements in 3D curve skeletonization are accelerating progress across a wide range of applications. However, developing robust skeletonization algorithms that capture intricate object details remains challenging. Skeletonization via Local Separators (LS) offers an efficient graph-based approach but suffers from representation inaccuracies due to its discrete nature. To address this, we introduce CSCD, a novel framework for Curve Skeletonization in the Continuous Domain, generalizing LS to manifolds. Specifically, we present two realizations: CSCD-M for meshes and CSCD-PC for point clouds. CSCD-M leverages the intrinsic triangulation of a mesh for resilience to noise and improved topological preservation, while CSCD-PC employs tufted Laplacians for enhanced robustness. To our knowledge, CSCD-M is the first intrinsic method for curve skeletonization. Our results show CSCD-M matches LS performance across diverse meshes and outperforms LS (TOG'21) on benchmarks like Thingi10k dataset. CSCD-PC qualitatively outperforms CoverageAxis++ (Eurographics'24) and EPCS (CAG'23). Finally, we demonstrate the efficacy of CSCD in a few downstream tasks: object classification, shape segmentation, identifying handles, tunnels, and constrictions in objects. Project Website: https://cscd-skel.pages.dev

URL PDF HTML ☆

赞 0 踩 0

2605.25909 2026-05-26 cs.CV 版本更新

R5DGS: Semantic-Aware 4D Gaussian Splatting with Rigid Body Constraints for Efficient Dynamic Scene Reconstruction

R5DGS：基于刚体约束的语义感知4D高斯泼溅用于高效动态场景重建

Denis Gridusov, Maxim Popov, Sergey Kolyubin

发表机构 * Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University（生物机电学与节能机器人实验室，ITMO大学）

AI总结提出R5DGS框架，通过紧凑身份编码和CLIP对象查找表实现语义感知的4D高斯表示，并利用刚体推理约束仅预测对象质心动力学，从而在保持轨迹合理性的同时实现11 FPS的加速。

Comments Code: https://github.com/be2rlab/r5dgs

详情

AI中文摘要

从多视角视频中重建和预测动态3D场景是机器人、AR/VR和数字孪生的基础任务。最近基于物理信息的高斯泼溅方法在未来的帧外推上取得了令人印象深刻的结果，但缺乏语义感知且计算开销大。我们引入了$ extbf{R5DGS}$，一个通过紧凑的身份编码向量增强物理驱动的4D高斯表示的框架，实现了精确的高斯到对象关联。通过构建离线的基于CLIP的对象查找表，我们支持开放词汇的文本提示，以检索和渲染任意时间戳和视角下的特定对象高斯。此外，我们提出了一个刚体推理约束，仅对对象质心预测和集成物理动力学，通过相对变换将运动传播到关联的高斯。这一优化在外推过程中实现了11 FPS的加速，而不损害轨迹的合理性。

英文摘要

Reconstructing and predicting dynamic 3D scenes from multi-view videos is a foundational task for robotics, AR/VR, and digital twins. Recent physics-informed Gaussian Splatting methods achieve impressive future frame extrapolation but lack semantic awareness and suffer from large computational overhead. We introduce $\textbf{R5DGS}$, a framework that augments a physics-driven 4D Gaussian representation with compact Identity Encoding vectors, enabling precise Gaussian-to-object association. By constructing an offline CLIP-based object lookup table, we support open-vocabulary text prompting to retrieve and render object-specific Gaussians across arbitrary timestamps and viewpoints. Furthermore, we propose a rigid-body inference constraint that predicts and integrates physical dynamics exclusively for object centroids, propagating motion to associated Gaussians via relative transformations. This optimization yields a 11 FPS speedup during extrapolation without compromising trajectories plausibility.

URL PDF HTML ☆

赞 0 踩 0

2605.25901 2026-05-26 cs.CV cs.RO 版本更新

AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

AgentGrounder：使用多模态语言模型的零样本3D视觉点云定位

Cuong Huynh, Maxim Popov, Denis Gridusov, Sergey Kolyubin

发表机构 * Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University（生物机械与高效能机器人实验室，ITMO大学）

AI总结提出AgentGrounder框架，通过两阶段设计（离线构建对象查找表和在线工具驱动代理）实现零样本3D视觉定位，在ScanRefer和Nr3D上分别提升2.5%和6.3%的准确率。

Comments Code: https://github.com/be2rlab/AgentGrounder

详情

AI中文摘要

3D视觉定位（3DVG）是具身AI的基本能力，要求智能体根据自然语言描述在3D场景中定位物体。最近的零样本方法利用2D视觉语言模型（LVLMs），但它们通常依赖于现有的多视图图像集，并且难以处理标准3D分割工具提供的有限语义和空间细节。我们提出了$ extbf{AgentGrounder}$，一个零样本3D视觉定位框架，直接对彩色点云进行操作，无需特定任务的3D训练。我们的方法采用两阶段设计：（1）离线阶段，应用3D模型构建对象查找表（OLT），包含实例ID、语义标签、3D边界框；（2）在线工具驱动代理，分解每个查询，仅从OLT中检索相关候选对象，进行几何评分，并在需要额外视觉证据（如颜色、材质或视角敏感线索）时按需触发图像渲染。与固定的锚点-目标匹配流水线相比，这种设计减少了级联匹配错误，并通过避免提示过载无关对象来提高上下文窗口效率。我们在零样本设置下对ScanRefer和Nr3D进行了评估，观察到在我们的设置中比SeeGround有持续改进，包括ScanRefer上+2.5%的Acc@0.5和Nr3D上+6.3%，在Nr3D视图无关查询上显著提升+6.3%。这些结果表明，结合选择性检索、几何推理和自适应视觉检查为开放词汇3D定位提供了实用且稳健的基础。我们的代码可在https://github.com/be2rlab/AgentGrounder获取。

英文摘要

3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.

URL PDF HTML ☆

赞 0 踩 0

2605.25892 2026-05-26 cs.CV 版本更新

SP-MoMamba: Superpixel-driven Mixture of State Space Experts for Efficient Image Super-Resolution

SP-MoMamba：基于超像素驱动的状态空间专家混合模型用于高效图像超分辨率

Wenbin Zou, Yawen Cui, Yi Wang, Lap-Pui Chau, Liang Chen, Jinshan Pan, Huiping Zhuang, Guanbin Li

发表机构 * Shien-Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou, China（华南理工大学谢民武智能工程学院）； Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR（香港理工大学电子与电气工程系）； College of Photonic and Electronic Engineering, Fujian Normal University, Fuzhou, China（福建师范大学光电工程学院）； School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China（南京理工大学计算机科学与工程学院）； School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China（中山大学计算机科学与工程学院）

AI总结提出SP-MoMamba，通过超像素驱动将刚性扫描转化为语义级交互，结合多尺度超像素状态空间专家混合与局部空间调制专家，实现高效且保真的图像超分辨率。

Comments 16 pages, 15 figures

详情

AI中文摘要

状态空间模型（SSM）因其线性复杂度和长程建模能力，已成为高效单图像超分辨率（SR）的强大范式。然而，现有的基于Mamba的方法通常依赖于与数据无关的刚性扫描，将2D图像重塑为固定网格上的1D序列，这不可避免地破坏了空间语义拓扑并引入伪影。受格式塔知觉分组理论的启发，我们提出了SP-MoMamba，一种用于内容感知SR的超像素驱动状态空间专家混合模型。我们的核心思想是通过将超像素视为基本单元，将传统的刚性扫描转化为语义级交互。具体来说，我们引入了超像素驱动状态空间模型（SP-SSM），它将语义同质区域压缩为高阶令牌，以保持全局拓扑一致性。为了解决固定扫描尺度与多样语义粒度之间的冲突，我们开发了多尺度超像素状态空间专家混合（MSS-MoE）。该模块利用动态路由机制自适应地分配尺度特定专家，有效捕捉多尺度纹理，同时减少计算冗余。此外，为了防止全局抽象过程中高频细节的丢失，我们引入了局部空间调制专家（LSME）来补充全局建模，确保锐利边缘和精细结构的精确重建。在标准基准上的大量实验表明，与最先进的高效SR方法相比，SP-MoMamba实现了更优的重建保真度和更有利的效率-性能权衡。

英文摘要

State space models (SSMs) have emerged as a powerful paradigm for efficient single-image super-resolution (SR) due to their linear complexity and long-range modeling capabilities. However, existing Mamba-based methods typically rely on data-agnostic rigid scanning, which reshapes 2D images into 1D sequences over a fixed grid, inevitably disrupting spatial-semantic topology and introducing artifacts. Inspired by the \textbf{Gestalt perceptual grouping theory}, we propose \textbf{SP-MoMamba}, a superpixel-driven mixture of state space experts designed for content-aware SR. Our core idea is to transform the traditional rigid scanning into a \textbf{semantic-level interaction} by treating superpixels as fundamental units. Specifically, we introduce the \textbf{Superpixel-driven State Space Model (SP-SSM)}, which compresses semantically homogeneous regions into high-order tokens to preserve global topological consistency. To address the conflict between fixed scanning scales and diverse semantic granularities, we develop the \textbf{Multi-Scale Superpixel Mixture of State Space Experts (MSS-MoE)}. This module utilizes a dynamic routing mechanism to adaptively assign scale-specific experts, effectively capturing multi-scale textures while reducing computational redundancy. Furthermore, to prevent the loss of high-frequency details during global abstraction, we introduce a \textbf{Local Spatial Modulation Expert (LSME)} to complement the global modeling, ensuring a precise reconstruction of sharp edges and fine structures. Extensive experiments on standard benchmarks demonstrate that SP-MoMamba achieves superior reconstruction fidelity and a more favorable efficiency-performance trade-off compared to state-of-the-art efficient SR methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25878 2026-05-26 eess.IV cs.CV 版本更新

A Clinically Validated Foundation Model for Comprehensive Lung Pathology Interpretation

临床验证的基础模型用于全面肺部病理解读

Zhengrui Guo, Zhengyu Zhang, Jiabo Ma, Yihui Wang, Fengtao Zhou, Yingxue Xu, Ling Liang, Chenglong Zhao, Qi Xie, Jinbang Li, Shujing Guo, Fangyi Han, Zhijian Cen, Ziyi Liu, Cheng Jin, Junlin Hou, Zhixuan Chen, Yu Cai, Lijuan Qu, Shifu Chen, Yueping Liu, Zhe Wang, Xiuming Zhang, Muyan Cai, Li Liang, Hao Chen

发表机构 * Department of Pathology, Nanfang Hospital, Southern Medical University, Guangzhou, China（南方医科大学南芳医院病理科，广州，中国）； Department of Pathology, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China（南方医科大学基础医学学院病理科，广州，中国）； Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China（香港科技大学计算机科学与工程系，香港，中国）； Guangdong Provincial Key Laboratory of Molecular Tumor Pathology, Guangzhou, China（广东省分子肿瘤病理重点实验室，广州，中国）； Department of Pathology, Shandong Provincial Qianfoshan Hospital, Jinan, Shandong, China（山东省青岛坊山医院病理科，济南，山东，中国）

AI总结提出PulmoFoundation，一种基于Virchow2和约4万张H&E染色全切片图像进行亚专科预训练的肺部病理基础模型，通过32项临床任务和前瞻性随机对照试验验证，在诊断准确性、效率和一致性上显著提升。

详情

AI中文摘要

病理评估指导肺癌诊断、治疗选择和预后评估，但当前的CPath方法依赖于针对孤立目标的任务特定模型。尽管泛癌基础模型提供了多功能性，但它们缺乏亚专科深度，且未在临床工作流程中评估或在真实世界环境中进行前瞻性验证。我们介绍了PulmoFoundation，这是一个多中心、前瞻性验证、随机对照试验（RCT）评估的基础模型，用于术前、术中和术后护理的全面肺部病理评估。PulmoFoundation基于Virchow2，通过使用约40,000张诊断性H&E染色全切片图像（WSI）进行亚专科特定预训练构建，并在约26,000张WSI上系统评估了32项临床相关任务。除了准确预测分子标记和患者生存率外，我们的模型在活检、冰冻切片和手术切除切片的核芯诊断任务中达到了临床级性能。在一项针对1,357名患者、涵盖11项诊断任务的注册前瞻性研究中，我们的模型实现了平均AUC 92.3%。使用预设的分诊阈值，PulmoFoundation可以减少68.8%的活检和83.0%的冰冻切片的额外二次复核负担，并推迟44.5%的IHC染色订单，阳性预测值分别为1.0、0.991和0.966。除了前瞻性验证，我们还进行了一项交叉RCT，涉及八名病理学家，AI辅助在4,928个病例-阅片者对中提高了诊断准确性（有AI为91.7%，无AI为83.8%）。AI辅助还使中位诊断时间减少了19.6%，诊断信心提高了8.7%，并将阅片者间一致性从中等（kappa=0.56）提高到显著（kappa=0.76）。这些评估共同支持PulmoFoundation作为临床验证的肺部病理决策支持系统。

英文摘要

Pathological assessment guides lung cancer diagnosis, treatment selection, and prognostic evaluation, yet current CPath approaches rely on task-specific models for isolated objectives. Although pan-cancer foundation models offer versatility, they lack subspecialty-level depth and have not been evaluated across clinical workflows or prospectively validated in real-world settings. We introduce PulmoFoundation, a multi-center, prospectively validated, randomized controlled trial (RCT)-evaluated foundation model for comprehensive lung pathology assessment across pre-operative, intra-operative, and post-operative care. Built upon Virchow2 via subspecialty-specific pretraining using ~40,000 diagnostic H&E-stained whole-slide images (WSIs), PulmoFoundation was systematically evaluated on ~26,000 WSIs across 32 clinically relevant tasks. In addition to accurately predicting molecular markers and patient survival, our model achieves clinical-grade performance in core diagnostic tasks across biopsy, frozen section, and surgical resection slides. In a registered prospective study of 1,357 patients across 11 diagnostic tasks, our model achieved an average AUC of 92.3%. Using pre-specified triage thresholds, PulmoFoundation could reduce additional second-review burden for 68.8% of biopsies and 83.0% of frozen sections, and defer 44.5% of IHC stain orders, with PPVs of 1.0, 0.991, and 0.966. Beyond prospective validation, we conducted a crossover RCT with eight pathologists, in which AI assistance improved diagnostic accuracy across 4,928 case-reader pairs (91.7% w/ AI vs. 83.8% w/o AI). AI assistance also reduced median diagnostic time by 19.6%, increased diagnostic confidence by 8.7%, and improved inter-rater agreement from moderate (kappa = 0.56) to substantial (kappa = 0.76). Together, these evaluations support PulmoFoundation as a clinically validated decision-support system for lung pathology.

URL PDF HTML ☆

赞 0 踩 0

2605.25876 2026-05-26 cs.CV 版本更新

[CLS] 还不够：基于补丁级推理与自适应聚合的多标签识别

Akang Wang, Xili Deng, Zhanxuan Hu, Yi Zhao, Yonghang Tai, Huafeng Li

发表机构 * Yunnan Normal University, Kunming, China（云南师范大学，昆明，中国）； Kunming University of Science and Technology, Kunming, China（昆明理工大学，昆明，中国）

AI总结针对CLIP等视觉语言模型在多标签识别中因[CLS]全局表征不足的问题，提出PIAA框架，通过补丁级推理和自适应聚合实现无训练的多标签识别，在NUS-WIDE上mAP提升超6%。

详情

AI中文摘要

视觉语言模型（如CLIP）通过将图像与文本概念对齐展现出强大的零样本识别能力，但在多标签识别（多个目标共存）中表现不佳。一个关键瓶颈是[CLS]标记作为单一的全局视觉表征，不足以忠实编码具有不同尺度、上下文和共现模式的多样目标。为解决这一局限，我们提出一个新的多标签图像识别框架PIAA，将预测公式化为补丁级推理后接自适应聚合。具体来说，我们首先从两个互补角度增强补丁级预测：（i）缓解视觉编码器中的语义纠缠以获得更具判别性的补丁表征，（ii）学习无监督视觉分类器以缩小视觉-语言模态差距。然后我们引入一个自适应聚合模块，将补丁级分数整合为最终的多标签预测。值得注意的是，整个流程完全无需训练，不需要梯度更新或参数微调。实验表明，我们的方法以最小的额外计算实现了显著改进，在具有挑战性的NUS-WIDE基准上相比代表性基线mAP提升超过6%。代码可在https://github.com/akang-wang/PIAA获取。

英文摘要

Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns. To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision-language modality gap. We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines. Code is available at https://github.com/akang-wang/PIAA.

URL PDF HTML ☆

赞 0 踩 0

2605.25810 2026-05-26 cs.CV 版本更新

Data-driven Head Motion Generation through Natural Gaze-Head Coordination

数据驱动的自然注视-头部协调头部运动生成

Xiaohan Liu, Yilin Wen, Yusuke Sugano

发表机构 * Institute of Industrial Science, The University of Tokyo（东京大学工业科学研究所）

AI总结提出首个数据驱动方法，通过自动提取自然注视和头部运动，利用条件变分自编码器生成与注视相关的头部运动，并应用于注视控制的视频生成。

详情

DOI: 10.23919/MVA65244.2025.11175115

AI中文摘要

我们提出了首个数据驱动的方法，从大规模野外面部视频中建模时间上的注视-头部协调。为了获得可泛化学习的训练数据，我们提出了一种自动流水线，利用现成的基于外观的注视估计器提取自然且多样化的注视和头部运动。为了捕捉注视-头部协调的概率相关性和时间动态，我们将模型建立在生成性条件变分自编码器上，以生成合理且多样化的注视条件头部运动。我们进一步将框架应用于注视控制的面部视频生成，其中我们实现了与输入注视相关的自然逼真头部运动的视频生成——这一方面此前未被强调。人类评估和定量比较证明了我们方法的有效性并验证了我们的设计选择，评估者对我们的方法表现出统计学上显著的偏好，优于基线方法。

英文摘要

We present the first data-driven approach to model temporal gaze-head coordination from large-scale in-the-wild facial videos. To obtain training data for generalizable learning, we propose an automatic pipeline that extracts natural yet diverse gaze and head motions with off-the-shelf appearance-based gaze estimators. To capture the probabilistic correlation and temporal dynamics of gaze-head coordination, we build our model on a generative conditional Variational Autoencoder for plausible yet diverse gaze-conditioned head motion generations. We further apply our framework to gaze-controlled facial video generation, where we enable video generation with natural and realistic head motion correlated to the input gaze - an aspect that has not been emphasized before. Human evaluation and quantitative comparisons demonstrate our method's effectiveness and validate our design choices, with evaluators showing statistically significant preference for our approach over baseline methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25804 2026-05-26 cs.CV 版本更新

Event-to-Video Reconstruction using Spatio-Temporal and Frequency-Enhanced Deep Neural Networks

基于时空与频率增强深度神经网络的事件到视频重建

Ramna Maqsood, Paulo Nunes, Luís Ducla Soares, Caroline Conti

发表机构 * Instituto de Telecomunicações, Instituto Universitário de Lisboa (ISCTE-IUL)（电信研究所，里斯本大学学院（ISCTE-IUL））

AI总结提出MSFET-E2V模型，通过跨域注意力模块融合时空特征与离散小波变换的频率表示，并设计轻量级小波增强跳跃块，实现高质量事件到视频重建，在多个数据集上超越现有方法。

详情

AI中文摘要

事件相机相比传统基于帧的相机具有显著优势，包括高时间分辨率、低延迟和能量效率。这些特性使其适用于高速和高动态范围场景采集；然而，缺乏密集强度帧限制了传统计算机视觉方法在场景理解中的直接应用。事件到视频（E2V）重建旨在通过将异步事件流转换为同步视频帧序列来弥合这一差距。现有的基于卷积神经网络和Transformer的E2V重建方法主要在空间域操作，往往难以恢复精细结构细节并抑制严重重建伪影。为解决这些问题，我们提出MSFET-E2V，一种新颖的多尺度频率增强Transformer模型。其核心是跨域注意力模块，该模块将时空特征与来自离散小波变换的频率感知表示相融合。与仅依赖空间注意力的先前方法不同，我们的方法通过考虑低频和高频分量有效捕捉局部和全局结构，增强细节保留和跨各种运动场景的鲁棒性。此外，我们提出一个轻量级小波增强跳跃块作为跳跃连接，通过联合空间-频率域处理促进伪影抑制和结构细节细化。大量实验表明，MSFET-E2V在多个真实世界事件数据集上取得了优于最先进方法的性能，在重建质量上提供了显著提升。此外，与现有基于Transformer的方法相比，我们提出的模型显著减少了参数数量、GPU内存使用和推理时间。

英文摘要

Event cameras offer significant advantages over conventional frame-based counterparts, including high temporal resolution, low latency, and energy efficiency. These characteristics make them suitable for high-speed and high-dynamic range scene acquisition scenarios; however, the lack of dense intensity frames limits the direct applicability of conventional computer vision methods for scene understanding. Event-to-video (E2V) reconstruction seeks to bridge this gap by converting asynchronous event streams into a sequence of synchronous video frames. Existing E2V reconstruction methods based on convolutional neural networks and transformers operate primarily in the spatial domain and often struggle to recover fine structural details while suppressing severe reconstruction artifacts. To address these issues, we propose MSFET-E2V, a novel multiscale frequency-enhanced transformer model. At its core lies a cross-domain attention module, which fuses spatio-temporal features with frequency-aware representations derived from the discrete wavelet transform. Unlike prior methods relying solely on spatial attention, our approach effectively captures both local and global structures by taking into account low- and high-frequency components, enhancing detail preservation and robustness across various motion scenarios. Furthermore, we propose a lightweight wavelet-enhanced skip block that serves as a skip connection, facilitating artifact suppression and structural detail refinement through joint spatial-frequency domain processing. Extensive experiments demonstrate that MSFET-E2V achieves superior performance over state-of-the-art methods on multiple real-world event datasets, offering significant gains in reconstruction quality. Moreover, compared to the existing transformer-based method, our proposed model significantly reduces the number of parameters, the GPU memory usage, and inference time.

URL PDF HTML ☆

赞 0 踩 0

2605.25802 2026-05-26 cs.CV 版本更新

Rethinking VLM Representation for VLA Initialization

重新思考用于VLA初始化的VLM表示

Weifeng Lin, Siyuan Huang, Hao Li, Tingwei Chen, Ruichuan An, Xinyu Wei, Jianbo Liu, Hongsheng Li

发表机构 * CUHK（香港中文大学）； PolyU ； Peking University（北京大学）； ACE Robotics（ACE机器人）

AI总结本文通过控制表示设计问题，沿能力级具身VQA监督、参数更新策略和机器人数据预训练三个轴，研究VLA初始化，发现保留预训练VLM表示对动作性能至关重要，而LoRA比全微调提供更可靠的初始化，分阶段基于LoRA的训练获得最强变体。

Comments 9 main-text pages, 5 appendix pages, 4 figures

详情

AI中文摘要

视觉-语言-动作（VLA）模型广泛采用预训练的视觉-语言模型（VLM）作为策略骨干，但目前尚不清楚何种预训练VLM表示对VLA初始化有用。在本文中，我们将VLA初始化作为一个受控的表示设计问题，沿三个轴进行研究：能力级具身VQA监督、参数更新策略和机器人数据预训练。我们的实验表明，原始预训练VLM表示是动作性能的关键来源。然而，具身VQA适应并不产生一致的收益：其收益取决于下游瓶颈，且来自不同能力域的收益并非简单相加。对于更新策略，LoRA提供了比全微调更可靠的初始化，表明过度重塑预训练表示会削弱VLA初始化。机器人数据预训练进一步改善了VLA初始化，通过分阶段基于LoRA的训练获得了最强变体。这些发现共同表明，有效的VLM到VLA适应应在保留对动作学习有用的预训练VLM表示的同时，注入与动作相关的具身和机器人轨迹信号。

英文摘要

Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.

URL PDF HTML ☆

赞 0 踩 0

2605.25801 2026-05-26 cs.CV 版本更新

PixelWizard: Towards Efficient High-Fidelity Video Generation at Ultra-Large Spatial Resolution

PixelWizard: 迈向高效高保真超大规模空间分辨率视频生成

Wenxue Li, Jingjing Ren, Peng Zhang, Tian Ye, Daiguo Zhou, Jian Luan, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； MiLM Plus, Xiaomi Inc（小米公司MiLM Plus部门）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出PixelWizard框架，通过分层解耦全局结构建模与细节合成，并引入噪声跨度对齐捷径训练，实现超大规模分辨率视频的高效高保真生成，加速超过10倍。

详情

AI中文摘要

高分辨率视频生成面临优化不稳定和计算成本高昂的双重瓶颈。令牌序列的大规模扩展不仅使优化偏向局部纹理而牺牲全局一致性，导致结构崩溃，还带来了高昂的训练成本和严重的推理延迟。为了解决这个问题，我们提出了PixelWizard，一个将全局结构建模与细粒度细节合成分层解耦的框架。PixelWizard首先建立一个紧凑的时空锚点以集中密集的结构先验，然后指导高分辨率下的细粒度生成。这减轻了局部优化偏差，确保结构稳定性而不损害高频细节。利用这种结构稳定性，我们引入了噪声跨度对齐捷径训练来打破推理瓶颈。通过显式建模步长，该机制允许模型以大步长遍历生成轨迹。关键的是，我们结合了指数索引偏置采样和自适应噪声跨度校准，以对齐优化与高分辨率网格的偏移噪声调度，确保鲁棒的少步推理而不产生蒸馏的沉重开销。大量实验表明，PixelWizard在实现卓越视觉质量的同时，将原生2K/4K视频的生成采样加速超过10倍。

英文摘要

High-resolution video generation faces a coupled bottleneck of optimization instability and prohibitive computational costs. The massive expansion of the token sequence not only biases optimization toward local textures at the expense of global coherence, leading to structural collapse, but also imposes prohibitive training costs and severe inference latency. To address this, we propose PixelWizard, a framework that hierarchically decouples global structure modeling from fine-grained detail synthesis. PixelWizard first establishes a compact spatiotemporal anchor to concentrate dense structural priors, which then guides fine-grained generation at high resolution. This mitigates the local optimization bias to ensure structural stability without compromising high-frequency details. Leveraging this structural stability, we introduce Noise-Span Aligned Shortcut Training to break the inference bottleneck. By explicitly modeling the step size, this mechanism allows the model to traverse the generation trajectory with large steps. Crucially, we incorporate Exponential Index-Biased Sampling and Adaptive Noise-Span Calibration to align optimization with the shifted noise schedules of high-resolution grids, ensuring robust few-step inference without incurring the heavy overhead of distillation. Extensive experiments demonstrate that PixelWizard achieves superior visual quality while accelerating the generative sampling of native 2K/4K videos by over 10x.

URL PDF HTML ☆

赞 0 踩 0

2605.25799 2026-05-26 cs.CV 版本更新

Addressing Exacerbated Attention Sink for Source-Free Cross-Domain Few-Shot Learning

应对源自由跨域小样本学习中加剧的注意力汇聚问题

Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

发表机构 * Huazhong University of Science and Technology（华中科技大学）

AI总结针对跨域小样本学习中标准微调加剧注意力汇聚导致判别性下降的问题，提出基于令牌动态重加权的方法抑制简单令牌依赖并增强困难令牌学习，实现新最优性能。

Comments Accepted by CVPR 2026

详情

AI中文摘要

视觉语言模型（如CLIP）展现了令人印象深刻的泛化能力，但其在跨域小样本学习（CDFSL）中的潜力尚未充分探索，该任务需要模型将源域信息迁移到训练数据稀缺的目标域。尽管注意力汇聚现象已在某些任务的视觉语言模型中被观察到，但其在CDFSL场景中的作用尚未被研究。本文揭示了先前工作忽视的一个关键问题：CDFSL中标准的目标域小样本微调显著加剧了注意力汇聚问题，导致类别间判别性差。为理解这一现象，通过大量实验，我们将其解释为模型对域适应的捷径学习：为克服源域与目标域之间的巨大域差距，模型倾向于将初始更接近目标域类别的令牌（即简单令牌）推得更近，从而加剧注意力汇聚，浪费了学习其他有判别性但初始较远的令牌（即困难令牌）的能力。为解决此问题，我们提出一种新方法，在目标域微调期间根据令牌与目标域类别的相关性动态重加权令牌，明确抑制模型对简单令牌的依赖并增强困难令牌的学习，减少汇聚令牌并提升判别性。在四个基准数据集上的大量实验验证了我们方法的合理性，展现了新的最优性能。我们的代码可在 https://github.com/shuaiyi308/TIR 获取。

英文摘要

Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model's shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model's reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes are available at https://github.com/shuaiyi308/TIR.

URL PDF HTML ☆

赞 0 踩 0

2605.25784 2026-05-26 cs.CV cs.MM 版本更新

通过交叉注意力激活投影实现扩散模型的概念遗忘

Saemi Moon, Suhyeon Jun, Seoyeon Lee, Dongwoo Kim

发表机构 * CSE, POSTECH（POSTECH计算机科学系）； GSAI, POSTECH（POSTECH通用人工智能实验室）

AI总结提出PURE方法，利用交叉注意力激活空间构建遗忘和保留基，通过线性投影编辑权重，在保持保留概念的同时有效消除目标概念。

详情

AI中文摘要

概念遗忘旨在从预训练的文本到图像扩散模型中擦除目标概念，而无需重新训练。闭式方法在此设置中具有吸引力，因为它们对交叉注意力权重应用单一确定性编辑，并且不增加推理时间成本。然而，现有的闭式方法通过文本编码器对少数命名目标概念的简短锚定提示的响应来表示目标概念，而唤起该概念但不一致命名的释义提示可以绕过编辑。我们认为，目标应该改为在交叉注意力激活空间中表示。文本嵌入描述用户的提示，而交叉注意力激活描述模型即将渲染的内容，后者泛化到锚定模板未覆盖的释义。基于这一观察，我们提出了PURE（U-Net渲染中的投影用于擦除），这是一种闭式方法，从沿短去噪轨迹捕获的逐层交叉注意力激活构建遗忘和保留基，并将单个线性投影器应用于交叉注意力键和值权重。在最近涵盖艺术风格、知识产权、名人和NSFW类别中十个概念的整体概念遗忘基准上，PURE显著减少了在释义和对抗性提示下的目标泄露，同时将保留概念保持接近未编辑模型，在评估方法中实现了最佳的总体遗忘-保留权衡。

英文摘要

Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods are attractive in this setting because they apply a single deterministic edit to the cross-attention weights and add no inference-time cost. Existing closed-form methods, however, represent the target concept through the text encoder's response to a few short anchor prompts that name it, and paraphrased prompts that evoke the concept without naming it consistently bypass the edit. We argue that the target should instead be represented in the cross-attention activation space. Text embeddings describe the user's prompt, while cross-attention activations describe what the model is about to render, and the latter generalize to paraphrase the anchor templates do not cover. Building on this observation, we propose PURE (Projection in U-Net Rendering for Erasure), a closed-form method that builds the forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights. On a recent holistic concept-unlearning benchmark covering ten concepts across artistic style, intellectual property, celebrity, and NSFW categories, PURE significantly reduces target leakage under paraphrased and adversarial prompts while preserving retain concepts close to the unedited model, yielding the best overall forget-retain trade-off among evaluated methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25764 2026-05-26 cs.CV cs.AI 版本更新

Benchmarking Pathology Foundation Models for Spatial Domain Understanding

病理基础模型在空间域理解中的基准测试

Bokai Zhao, Yiyang Zhang, Yuanchi Zhu, Hanqing Chao, Long Bai, Tai Ma, Minfeng Xu, Ming Song, Tianzi Jiang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Brainnetome Center, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所脑网膜工程中心）； Beijing Key Laboratory of Brainnetome and Brain-Computer Interface, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所北京脑网膜与脑机接口重点实验室）； DAMO Academy, Alibaba Group（阿里云达摩院）； ShanghaiTech University（上海科技大学）

AI总结提出SpaPath-Bench基准，通过空间域识别任务评估病理基础模型在区分组织区域和捕获空间关系方面的表示能力。

Comments MICCAI2026

详情

AI中文摘要

病理基础模型（PFMs）已成为从全切片图像（WSIs）中学习可迁移表示的核心方法，通常通过下游临床终点进行基准测试。虽然这种任务级评估不可或缺，但它们对表示本身编码了什么提供了有限的见解，特别是PFM嵌入是否能够区分有意义的组织区域并捕获其空间关系。我们提出了SpaPath-Bench，一个表示级基准，旨在诊断PFMs中的空间表示能力。SpaPath-Bench将配对全切片图像和空间转录组学（ST）数据上的空间域识别（SDI）制定为诊断任务。它整理了42个公开的配对WSI和ST切片，支持跨19个编码器和7种SDI方法的大规模评估，并使用三个互补标准衡量分区质量：无监督空间一致性、转录组学参考一致性和专家参考一致性。在83K次运行中，SpaPath-Bench揭示了不同的预训练范式捕获了组织空间架构的不同方面，并为构建下一代空间感知计算病理模型提供了实用指导。代码和数据管道公开于https://bokai-zhao.github.io/SpaPath-benchboard/。

英文摘要

Pathology foundation models (PFMs) have emerged as a core approach for learning transferable representations from whole slide images (WSIs), and they are typically benchmarked through downstream clinical endpoints. While such task level evaluations are indispensable, they offer limited insight into what the representations themselves encode, particularly whether PFM embeddings can distinguish meaningful tissue regions and capture their spatial relationships. We present SpaPath-Bench, a representation level benchmark designed to diagnose spatial representation capability in PFMs. SpaPath-Bench formulates spatial domain identification (SDI) on paired whole slide image and spatial transcriptomics (ST) data as a diagnostic task. It curates 42 public paired WSI and ST slides, enables large scale evaluation across 19 encoders and seven SDI methods, and measures partition quality using three complementary criteria: unsupervised spatial coherence, transcriptomics referenced agreement, and expert referenced agreement. Across 83K runs, SpaPath-Bench reveals that different pretraining paradigms capture distinct aspects of tissue spatial architecture, and it provides practical guidance for building the next generation of spatially aware computational pathology models. Code and data pipelines are publicly available at https://bokai-zhao.github.io/SpaPath-benchboard/.

URL PDF HTML ☆

赞 0 踩 0

2605.25759 2026-05-26 cs.CV 版本更新

Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences

通过合成局部偏好实现解剖学合理的人体图像生成

Bao Li, Yuliang Xiu, Zhen Liu

发表机构 * Westlake University（西湖大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出 ASAP 框架，利用局部退化机制构建受控偏好对，并结合局部有界 DPO 变体，在保持整体图像质量的同时减少解剖学错误。

详情

AI中文摘要

大规模文本到图像基础模型已实现显著的视觉真实感，但生成具有正确解剖结构的人体图像仍然具有挑战性。现有方法通过在高品质人体照片上进行监督微调时使用部位特定模块或局部损失加权来强制解剖约束，但此类数据集有限，且由于光照、姿态和背景等混杂因素，通常提供模糊的优化信号。基于偏好的对齐提供了一种替代方案，但标准的直接偏好优化（DPO）平等对待所有像素，因此未能利用解剖伪影的局部性。为了解决这个问题，我们提出了通过合成解剖偏好进行对齐（ASAP）的框架，该框架通过对高保真人体图像应用局部退化机制来构建受控偏好对。该机制通过对图像进行受控实验，在目标区域引入明确的解剖错误，同时保留其余内容。利用这一机制，我们创建了人类解剖偏好（HAP）数据集，包含超过10K个精心挑选的对，用于有效对齐文本到图像人体图像生成模型的解剖结构。为了更好地利用这些受控偏好对的局部性，我们引入了DPO的局部有界变体，该变体优先优化目标解剖区域，同时强制有限偏好间隔以防止过度优化并保持全局语义。我们进一步引入了HAF-Bench，一个用于系统评估解剖保真度的基准。大量实验表明，ASAP在多个基础模型上持续减少解剖错误，同时保持整体图像质量。

英文摘要

Large-scale text-to-image foundation models have achieved remarkable visual realism, yet generating human images with correct anatomical structures remains challenging. Existing approaches enforce anatomical constraints through part-specific modules or localized loss weighting during supervised fine-tuning on high-quality human photos, but such datasets are limited and often provide ambiguous optimization signals due to confounding factors such as lighting, pose, and background. Preference-based alignment offers an alternative, but standard Direct Preference Optimization (DPO) treats all pixels equally and therefore fails to exploit the localized nature of anatomical artifacts. To address this, we propose the framework of Alignment via Synthetic Anatomical Preference (ASAP), which constructs controlled preference pairs through a localized degradation mechanism applied to high-fidelity human images. This mechanism performs a controlled experiment on images by introducing explicit anatomical errors in targeted regions while preserving the remaining content. With this mechanism, we create the Human Anatomical Preference (HAP) dataset with over 10K curated pairs for effective anatomical alignment of text-to-image human image generative models. To better leverage the locality of these controlled preference pairs, we introduce a localized and margin-bounded variant of DPO that prioritizes optimization in targeted anatomical regions while enforcing a finite preference margin to prevent over-optimization and preserve global semantics. We further introduce HAF-Bench, a benchmark for systematic evaluation of anatomical fidelity. Extensive experiments demonstrate that ASAP consistently reduces anatomical errors across multiple foundation models while maintaining overall image quality.

URL PDF HTML ☆

赞 0 踩 0

2605.25751 2026-05-26 cs.CV 版本更新

SplitAvatar: One-shot Head Avatar with Autoregressive Gaussian Splitting

SplitAvatar: 基于自回归高斯分裂的单次头部化身

Hongzhe Liao, Chuhua Xian, Hongmin Cai, Haiyang Liu, Fa-Ting Hong

发表机构 * South China University of Technology（华南理工大学）； University of Tokyo（东京大学）； The Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结提出一种基于自回归高斯分裂的单图像可动画头部化身重建方法，通过图分裂网络渐进生成高斯体，解决高斯数量不匹配和细粒度细节缺失问题。

详情

AI中文摘要

3D高斯泼溅（3DGS）利用各向异性高斯体为高质量场景重建提供了高效方法。最近，基于3DGS的方法显著提升了人类化身的渲染质量，同时实现了实时性能。然而，现有方法存在基于图像和基于3DMM的方法生成的高斯体数量不匹配的问题。这种差异导致重建的表情缺乏细粒度细节。本文提出了一种从单张图像重建可动画头部化身的新方法。我们提出了一种图分裂网络，利用自回归架构从粗到细渐进生成高斯体。为了解决分裂高斯体引起的图不一致性，我们采用网格拓扑扩展方法，使GNN的连通性与增加的高斯数量对齐。此外，我们引入了一种新颖的密度控制方法，包括一个门控机制，为高斯体生成软掩码，防止分裂操作后的过度密集化。这允许对不同面部区域的高斯密度进行动态控制。为了实现平滑快速的训练，我们采用延迟过滤策略，避免在训练期间重新计算图拓扑。实验结果表明，我们的自回归结构通过渐进分裂高斯体有效提升了表情表示能力。这一过程通过GNN引导的分裂实现，合成更精确的面部细节，并达到更高的重建质量。

英文摘要

3D Gaussian Splatting (3DGS) provides an efficient method for high-quality scene reconstruction using anisotropic Gaussians. Recently, 3DGS-based methods have significantly improved the rendering quality of human avatars while enabling real-time performance. However, existing methods suffer from a magnitude mismatch in the number of Gaussians generated by image-based and 3DMM-based approaches. This discrepancy results in reconstructed expressions that lack fine-grained detail. In this paper, we introduce a novel method for reconstructing an animatable head avatar from a single image. We propose a Graph splitting network to progressively generate Gaussians from coarse to fine using an autoregressive architecture. To address the graph inconsistency caused by split Gaussians, we employ a mesh topology extension method to align the GNN's connectivity with the increased Gaussian count. Furthermore, we introduce a novel density control method that includes a gating mechanism that generates soft masks for Gaussians, preventing over-densification after the splitting operation. This allows for dynamic control over Gaussian density across different facial regions. For smooth and rapid training, we employ a delayed filtering strategy to avoid re-computing the graph topology during training. Experimental results demonstrate that our autoregressive structure effectively improves expression representation ability by progressively splitting Gaussians. This process, enabled by the GNN-guided splitting, synthesizes more precise facial details and achieves higher reconstruction quality.

URL PDF HTML ☆

赞 0 踩 0

2605.25737 2026-05-26 cs.CV 版本更新

SFR-Net: Learning Scale-Frustum Representations for Ultra-Wide Area Remote Sensing Image Segmentation

SFR-Net: 学习尺度截锥体表示用于超广域遥感图像分割

Chuyu Zhong, Keyan Chen, Qinzhe Yang, Bowen Chen, Zhengxia Zou, Zhenwei Shi

发表机构 * Department of Aerospace Intelligent Science and Technology, School of Astronautics, Beihang University（航天智能科学与技术学院，北京航空航天大学）； Key Laboratory of Spacecraft Design Optimization and Dynamic Simulation Technologies, Ministry of Education, Beihang University（航天器设计优化与动态仿真技术重点实验室，北京航空航天大学）； Shen Yuan Honors College, Beihang University（神元荣誉学院，北京航空航天大学）； College of Computing and Data Science, Nanyang Technological University（计算与数据科学学院，新加坡南洋理工大学）

AI总结针对超广域遥感图像中地物尺度差异大和长距离上下文语义连续性问题，提出尺度截锥体表示网络（SFR-Net），通过构建尺度截锥体表示和级联跨尺度融合机制，在GID和FBPS数据集上分别提升mIoU 1.72%和4.29%。

详情

AI中文摘要

像素数量和地理覆盖范围是遥感图像的两个关键特征。现有的遥感图像分割方法通常专注于像素数量小或像素数量大但地理覆盖范围有限的图像。本文介绍了一种针对超广域（UWA）遥感图像的新分割任务，其特点是像素数量大且地理覆盖范围极广。UWA分割的核心挑战在于同时处理尺度变化显著的地物以及保持长距离上下文语义连续性。为了解决这些挑战，我们提出了尺度截锥体表示网络（SFR-Net）。受不同高度拍摄的遥感图像视锥体的启发，我们构建了尺度截锥体表示，实现了不同尺度下地物和上下文特征的统一建模。此外，我们设计了一种级联跨尺度融合机制，以有效整合这些表示，增强局部语义理解，同时确保长距离上下文连续性。在GID和FBPS上的实验结果表明，SFR-Net达到了最先进的性能，相比最强的竞争方法，mIoU分别提高了1.72%和4.29%。此外，所提出的尺度截锥体表示可以集成到通用分割网络中，以提高分割精度和收敛速度。实现代码将在https://github.com/ChuyuZhong/SFR-Net公开。

英文摘要

Pixel count and geographical coverage are two key characteristics of remote sensing images. Existing remote sensing image segmentation methods typically focus on images with either a small pixel count or a large pixel count but limited geographical coverage. In this paper, we introduce a novel segmentation task targeting ultra-wide area (UWA) remote sensing images, characterized by both a large pixel count and extremely wide geographical coverage. The core challenges of UWA segmentation lie in simultaneously handling ground objects with significantly varying scales and maintaining long-range contextual semantic continuity. To address these challenges, we propose the Scale-Frustum Representation Network (SFR-Net). Inspired by the viewing frustums of remote sensing images captured from different altitudes, we construct scale-frustum representations, enabling unified modeling of ground objects and contextual features at different scales. Furthermore, we design a cascaded cross-scale fusion mechanism to effectively integrate these representations, enhancing local semantic understanding while ensuring long-range contextual continuity. Experimental results on GID and FBPS demonstrate that SFR-Net achieves state-of-the-art performance, improving mIoU by 1.72% and 4.29%, respectively, over the strongest competing methods. In addition, the proposed scale-frustum representations can be integrated into generic segmentation networks to improve both segmentation accuracy and convergence speed. The implementation code will be publicly available at https://github.com/ChuyuZhong/SFR-Net.

URL PDF HTML ☆

赞 0 踩 0

2605.25730 2026-05-26 cs.CV 版本更新

DeCoDrift: Stabilizing Decoder Coupling in Closed-Loop Foundation Segmentation

DeCoDrift：闭环基础分割中的解码器耦合稳定化

H. M. Shadman Tabib, Md. Shamsuzzoha Bayzid, M Sohel Rahman

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）

AI总结针对闭环迭代分割中解码器耦合漂移导致误差累积的问题，提出无需训练或真值监督的推理时稳定化框架DeCoDrift，通过约束提示更新和保持解码器耦合来提升注意力稳定性、时间一致性和分割质量。

Comments 18 Pages, 5 Figures

详情

AI中文摘要

基础分割模型（如Segment Anything Model, SAM）现在常被用于迭代流水线中，其中每个预测掩码被反馈作为下一个提示。这种做法将分割转变为闭环动态过程，但这些系统的解码器级行为在很大程度上仍未得到研究。我们表明，这种反馈循环可能引发一种先前被忽视的故障模式——解码器耦合漂移，其中掩码解码器的交叉注意力逐渐失去与目标对象的对齐，导致误差在迭代中累积。我们通过检测SAM的掩码解码器并推导出无真值的提示-图像耦合、注意力稳定性和时间一致性度量来研究这一现象。在体积电子显微镜数据上，这些解码器内部信号显示，与基于真值锚定的反馈相比，标准迭代提示系统性地降低了注意力对齐和时间一致性。然后，我们将迭代提示形式化为一个离散时间动态系统，并展示近端锚定如何减少反馈循环中的误差放大。基于这一分析，我们引入了DeCoDrift，一个无需训练、推理时稳定的框架，它约束提示更新并在迭代中保持解码器耦合。在大量实验中，DeCoDrift在注意力稳定性、时间一致性和分割质量上持续优于标准迭代提示，无需重新训练或真值监督。更广泛地说，我们的结果表明，解码器内部动态不仅仅是诊断性的：它们为在闭环使用中稳定基础分割模型提供了可操作的信号。

英文摘要

Foundation segmentation models such as Segment Anything Model (SAM) are now routinely used in iterative pipelines, where each predicted mask is fed back as the next prompt. This practice turns segmentation into a closed-loop dynamical process, yet the decoder-level behavior of these systems remains largely unexamined. We show that this feedback loop can induce a previously overlooked failure mode, decoder coupling drift, in which the mask decoder's cross-attention progressively loses alignment with the target object, causing errors to accumulate across iterations. We study this phenomenon by instrumenting SAM's mask decoder and deriving ground-truth-free measures of prompt-image coupling, attention stability, and temporal consistency. On volumetric electron microscopy data, these decoder-internal signals reveal that standard iterative prompting systematically degrades attention alignment and temporal coherence relative to oracle-anchored feedback. We then formalize iterative prompting as a discrete-time dynamical system and show how proximal anchoring reduces error amplification in the feedback loop. Building on this analysis, we introduce DeCoDrift, a training-free inference-time stabilization framework that constrains prompt updates and preserves decoder coupling across iterations. Across extensive experiments, DeCoDrift consistently improves attention stability, temporal coherence, and segmentation quality over standard iterative prompting, without retraining or ground-truth supervision. More broadly, our results show that decoder-internal dynamics are not merely diagnostic: they provide actionable signals for stabilizing foundation segmentation models in closed-loop use.

URL PDF HTML ☆

赞 0 踩 0

2605.25725 2026-05-26 cs.CV 版本更新

TriDP-PTM: a three-stage distortion-perception tradeoff guides the pre-training model for radar cardiac sensing

TriDP-PTM：三阶段失真-感知权衡引导的预训练模型用于雷达心脏感知

Jinye Li, Aidong Men, Yang Liu, Qingchao Chen

发表机构 * National Institute of Health Data Science, Peking University（北京大学国家健康数据科学研究院）； Institute of Medical Technology, Peking University（北京大学医学技术研究院）； Beijing University of Posts and Telecommunications（北京邮电大学）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Wangxuan Institute of Computer Technology, Peking University（北京大学王轩计算机技术研究所）； State Key Laboratory of General Artificial Intelligence, Peking University（北京大学通用人工智能国家重点实验室）

AI总结提出三阶段失真-感知预训练模型（TriDP-PTM），通过雷达-心电图-任务间接路径和复合损失函数，在合作竞争阶段实现最佳下游临床精度。

详情

AI中文摘要

心血管疾病（CVDs）仍然是全球主要的死亡原因，需要连续、准确的非侵入性心脏监测。虽然非接触式雷达方法显示出巨大潜力，但它们通常采用单一的“失真驱动”或“感知驱动”范式，经常面临“低失真但弱语义信息”与“高感知保真度但差可解释性”之间的权衡。为了解决这个问题，我们提出了一种三阶段失真-感知预训练模型（TriDP-PTM），这是一个基于雷达的多尺度融合双路径框架，系统比较了“直接雷达到任务”路径与“间接雷达到心电图到任务”路径。通过将心电图生成器与特征判别器集成以形成复合损失函数，我们的方法有效地将医学先验知识（如心电图形态和节律）纳入下游任务。通过实证分析，我们揭示了这种权衡表现为三个不同阶段（正和、合作竞争和负和），表明最佳的下游临床准确性通常出现在合作竞争阶段。在涉及30名受试者、5种生理状态的数据集上进行的大量实验表明，间接路径在各种任务中始终优于直接路径，在波形分割中实现了0.80的平均IoU，在四个任务中实现了98.3%的平均分类准确率，并且与最强基线相比，血压回归的MAE降低了56%。这些发现验证了我们的框架，并表明在间接雷达到心电图路径中，适当权衡失真和感知损失以在合作竞争机制中运行，对于在非接触式心脏监测中实现临床可解释的心电图形态和强大的下游准确性至关重要。

英文摘要

Cardiovascular diseases (CVDs) remain a leading cause of death globally, necessitating continuous, accurate non-invasive cardiac monitoring. While non-contact radar-based approaches show great promise, they often employ a single "distortion-driven" or "perception-driven" paradigm, frequently facing a trade-off between "low distortion but weak semantic information" and "high perceptual fidelity but poor interpretability." To address this, we propose a Three-stage Distortion-Perception Pre-Training Model (TriDP-PTM), a radar-based multi-scale fusion dual-path framework that systematically compares the "direct radar-to-task" path against an "indirect radar-to-ECG-to-task" path. By integrating an ECG generator with a feature discriminator to form a composite loss function, our approach effectively incorporates medical priors - such as ECG morphology and rhythm - into downstream tasks. Through empirical analysis, we reveal that this trade-off manifests in three distinct phases (Positive-Sum, Coopetitive, and Negative-Sum), showing optimal downstream clinical accuracy typically emerges in the coopetitive stage. Extensive experiments on a dataset involving 30 subjects across 5 physiological states reveal that the indirect path consistently outperforms the direct path in diverse tasks, achieving 0.80 mean IoU in waveform segmentation, 98.3% average classification accuracy across four tasks, and a 56% MAE reduction in blood pressure regression compared to the strongest baselines. These findings validate our framework and indicate that, within the indirect radar-to-ECG pathway, appropriately weighting distortion and perception losses to operate in the coopetitive regime is critical for achieving both clinically interpretable ECG morphology and strong downstream accuracy in non-contact cardiac monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.25708 2026-05-26 cs.CV cs.CL cs.ET 版本更新

DRM: 基于扩散的奖励模型与逐步引导

Jaxon Zhang, Binxin Yang, Hubery Yin, Chen Li, Jing Lyu

发表机构 * Peking University（北京大学）； WeChat Vision, Tencent Inc.（腾讯微信视觉实验室）

AI总结提出基于扩散的奖励模型（DRM），利用预训练扩散模型作为评估骨干，通过逐步评估能力改进强化学习对齐和推理采样，提升图像生成质量。

详情

AI中文摘要

当前主流将扩散模型与人类偏好对齐的方法通常采用基于VLM的奖励模型。然而，这些为语义对齐预训练的奖励模型难以捕捉关键的感知质量，如美学、构图和视觉和谐。在这项工作中，我们认为一个能够高保真生成的模型必须对这些视觉属性有深刻理解。基于这一见解，我们引入了基于扩散的奖励模型（DRM），这是一种新颖的范式，使用预训练的扩散模型作为强大的评估骨干。DRM的一个关键优势是其独特的能力，不仅可以评估最终图像，还可以评估生成过程中任何阶段的噪声中间潜变量。我们以两种方式利用这种逐步评估能力。首先，我们提出了逐步GRPO，一种强化学习算法，提供密集的每步奖励，以解决GRPO算法中不精确的信用分配问题，从而实现更稳定和有效的对齐。其次，我们引入了逐步采样，一种新颖的推理策略，使用DRM作为动态引导，在每一步评估多个生成路径，引导过程朝向更高质量的结果。大量实验证实，我们的方法显著提升了生成图像的最终质量。代码：https://github.com/jjaxonx/DRM。

英文摘要

Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and visual harmony. In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes. Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone. A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways. First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment. Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes. Extensive experiments confirm that our approach significantly enhances the final quality of generated images. Code: https://github.com/jjaxonx/DRM.

URL PDF HTML ☆

赞 0 踩 0

2605.25659 2026-05-26 cs.CV 版本更新

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

StreamChar: 基于解耦编排的长时程流式角色音频-视频生成

Linrui Tian, Qi Wang, Bang Zhang

发表机构 * Tongyi Lab, Alibaba Group（通义实验室，阿里巴巴集团）

AI总结提出StreamChar流式框架，通过LLM编排器与联合音频-视频DiT解耦长时程编排与短窗去噪，实现实时、稳定、高质量的角色动画生成。

详情

AI中文摘要

实时流式联合音频-视频生成用于角色动画需要生成器说出请求的文本、跨块保持视觉身份并在严格的播放预算内运行。这些要求难以同时满足：逐块自回归生成会累积文本-音频错位和视觉漂移，而低延迟所需的少步蒸馏通常会降低空间多样性和时间质量。我们提出StreamChar，一种将长时程编排与短窗音频-视频去噪分离的流式框架。基于LLM的编排器使用文本和历史上下文生成帧对齐的音频条件，联合音频-视频DiT在参考和运动帧条件下执行局部双向去噪。为高效部署，我们使用两阶段蒸馏流程，首先压缩采样器，然后在在线块展开下微调学生模型。进度感知指针在展开训练期间将部分文本与生成的音频对齐，而汇块记忆提供持久视觉锚点以减少长时程漂移。在短片段和长时程协议上的实验表明，StreamChar在单个H100 GPU上实时运行，与最近的联合和音频驱动基线相比，在文本保真度、音视频同步、视觉质量和流式稳定性方面提供了有利的系统级权衡。

英文摘要

Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.25657 2026-05-26 cs.CV 版本更新

ARMA-C3: A Contrastive ARMA Convolutional Framework for Unsupervised and Semi-supervised Classification

ARMA-C3: 一种用于无监督和半监督分类的对比ARMA卷积框架

VSS Tejaswi Abburi, Saurabh J. Shigwan, Nitin Kumar

发表机构 * VSS Tejaswi Abburi ； Saurabh J. Shigwan ； Nitin Kumar

AI总结提出ARMA-C3框架，利用对比学习和图割正则化在无监督和半监督场景下学习图节点的判别性表示，在多个医学影像数据集上表现优异。

详情

AI中文摘要

在生物医学和神经退行性疾病中，由于标记数据的稀缺和成像模式的复杂性，准确和早期疾病识别仍然具有挑战性。为了解决这些问题，我们引入了ARMA-C3，一个统一的无监督和半监督图学习框架，用于基于对比学习和图割正则化的节点分类，以学习结构上有意义且具有判别性的表示。通过将样本或图像建模为图节点并利用样本间关系，所提出的框架捕获了传统机器学习方法通常忽略的受试者级别依赖关系。我们在五个临床相关数据集上进行了广泛的二分类实验：阿尔茨海默病神经影像学倡议（ADNI）、额颞叶痴呆神经影像学（NIFD）数据集以及三个医学影像基准（BreastMNIST、PneumoniaMNIST和一个肝脏超声数据集）。实验结果表明，ARMA-C3在多个评估设置中，特别是在有限监督和严重类别不平衡下，与经典聚类技术、最先进的机器学习模型以及现有的基于图的深度学习方法相比，取得了具有竞争力且通常更优越的性能。所提出的框架进一步展示了在多样化生物医学成像模态中的鲁棒表示学习和强跨模态泛化能力。

英文摘要

In biomedical and neurodegenerative disorders, accurate and early disease identification remains challenging due to the scarcity of labeled data and the complexity of imaging patterns. To address these challenges, we introduce ARMA-C3, a unified unsupervised and semi-supervised graph learning framework for node classification based on contrastive learning and graph-cut regularization to learn structurally meaningful and discriminative representations. By modeling samples or images as graph nodes and exploiting inter-sample relationships, the proposed framework captures subject-level dependencies that conventional machine learning methods typically overlook. We conduct extensive binary classification experiments across five clinically relevant datasets: the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Neuroimaging in Frontotemporal Dementia (NIFD) dataset, and three medical imaging benchmarks (BreastMNIST, PneumoniaMNIST, and a liver ultrasound dataset). Experimental results demonstrate that ARMA-C3 achieves competitive and frequently superior performance compared to classical clustering techniques, state-of-the-art machine learning models, and existing graph-based deep learning approaches across multiple evaluation settings, particularly under limited supervision and severe class imbalance. The proposed framework further demonstrates robust representation learning and strong cross-modal generalization across diverse biomedical imaging modalities.

URL PDF HTML ☆

赞 0 踩 0

2605.25656 2026-05-26 cs.CV 版本更新

Event-based Batting Impact Estimation

基于事件的击球冲击估计

Ryotaro Ishida, Wataru Ikeda, Ryosei Hara, Akemi Kobayashi, Toshitaka Kimura, Mariko Isogawa

发表机构 * Keio University（庆应大学）； NTT Communication Science Laboratories（NTT通信科学实验室）

AI总结提出利用事件相机的高时间分辨率和高动态范围，通过检测球与球棒的加权质心距离来估计击球冲击时刻，并引入掩膜细化网络解决事件帧与RGB图像之间的域差异，在低光和严重遮挡条件下将平均绝对误差降低约63%。

详情

AI中文摘要

精确估计击球冲击时刻对于理解快速感觉运动控制至关重要。然而，由于时间分辨率不足和运动模糊，RGB相机难以完成此任务。同样，惯性测量单元（IMU）由于传感器侵入性和有限的时间精度，在实际比赛中不实用。为克服这些限制，我们提出了一种新颖框架，利用事件相机（具有微秒级分辨率和高动态范围）基于检测到的球与球棒之间的加权质心距离来估计冲击时刻。为解决事件帧与RGB图像之间的域差异（这会降低分割精度），我们生成高密度事件帧。然后，我们引入一个掩膜细化网络，利用这些帧和双向掩膜信息，并通过一种新颖的损失函数进行优化。在真实数据集上的实验表明，我们的方法在具有挑战性的条件下（包括低光环境和严重遮挡）实现了卓越的准确性，将平均绝对误差降低了约63%，优于基线方法。

英文摘要

Estimating the precise timing of batting impact is crucial for understanding the rapid sensorimotor control. However, this task is challenging for RGB cameras due to insufficient temporal resolution and motion blur. Similarly, Inertial Measurement Units (IMUs) are impractical for actual matches due to sensor intrusiveness and their limited temporal precision. To overcome these limitations, we propose a novel framework leveraging event-based cameras, which offer microsecond resolution and high dynamic range, to estimate impact timing based on the weighted centroid distance between the detected ball and bat. To address the domain gap between event frames and RGB images that degrades segmentation accuracy, we generate high-density event frames. We then introduce a mask refinement network that leverages these frames and bidirectional mask information, optimized using a novel loss function. Experiments on real-world datasets demonstrate that our method achieves superior accuracy under challenging conditions, including low-light environments and severe occlusions, outperforming baselines by reducing the Mean Absolute Error by approximately 63%.

URL PDF HTML ☆

赞 0 踩 0

2605.25621 2026-05-26 cs.CV 版本更新

StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

StreamOV: 通过证据引导记忆与响应触发的流式全视频理解

Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang, Yanwei Fu

发表机构 * Shanghai Innovation Institute（上海创新研究院）； Fudan University（复旦大学）； Nanjing University（南京大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Huazhong University of Science and Technology（华中科技大学）

AI总结提出StreamOV框架，利用多模态证据引导的长短期记忆和隐状态驱动的触发机制，实现流式全视频理解中的在线推理与主动响应，并在新基准SOVBench上取得最优性能。

详情

AI中文摘要

虽然流式全视频理解需要持续感知和主动的实时交互，但这一关键领域仍未被充分探索。当前的全模态方法本质上是为离线场景设计的，由于两个根本缺陷限制了其在流式场景中的适用性。首先，它们缺乏稳健的机制来管理长时间跨度下持续增长的音视频上下文，并且无法在适当时机自主发起响应。其次，现有基准主要局限于离线、单轮问答，无法捕捉连续的多轮流式交互。为弥补这些差距，我们提出了StreamOV，一种新颖的流式全视频理解框架，用于具有有限记忆和主动响应触发的高效在线音视频推理。具体来说，StreamOV引入了多模态证据引导的长短期记忆，在固定预算下将历史音视频上下文压缩为紧凑的信息性证据。它还采用隐状态驱动的触发器来决定何时响应，避免了显式的静音令牌生成和外部路由器。我们还整理了SOVBench，这是首个用于在线、多轮全模态评估的综合基准。大量实验表明，StreamOV在各种流式和全视频基准上取得了最先进的性能，证明了其在在线和离线视频理解中的有效性。

英文摘要

While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.25615 2026-05-26 cs.CV 版本更新

UAV-OVO: Out-of-Viewpoint Generalization in UAV Action Recognition

UAV-OVO：无人机动作识别中的视点外泛化

Yu Xia, Zhengbo Zhang, Shuaihu Zhang, Zhigang Tu

发表机构 * Wuhan University（武汉大学）； Singapore University of Technology and Design（新加坡科技与设计大学）

AI总结针对无人机动作识别中训练与测试视点不一致导致的性能下降问题，提出UAV-OVO基准和LATER方法，通过视点隔离和LoRA锚定特征重中心化实现视点鲁棒泛化。

详情

AI中文摘要

无人机动作识别面临标准基准测试常掩盖的部署偏移：从低俯视角拍摄的无人机视频训练的模型可能需要识别来自高俯视角的相同动作类别。虽然动作标签保持不变，但这种偏移改变了身体可见性、运动投影和场景上下文，促使模型依赖视点特定的捷径。我们引入UAV-OVO，一个用于无人机动作识别的视点外泛化基准。UAV-OVO从未校准视频中导出视点分数，使用视点隔离带将低俯视角视频分配给训练和分布内测试集，同时保留高俯视角视频用于分布外测试，并构建按类别分布匹配的ID/OOD测试集，使得性能差异反映视点偏移而非标签不平衡。在代表性视频识别器上，UAV-OVO揭示了显著的ID/OOD差距：拟合低俯视角训练分布良好的模型往往无法迁移到保留的高俯视角，暴露了被整体准确性隐藏的视点捷径。我们进一步提出LATER，即LoRA锚定的测试时重中心化，首先通过低秩适配（LoRA）适配识别器，然后利用学习到的LoRA子空间作为在线特征重中心化的语义锚点。具体来说，LATER在重中心化特征之前将目标域位移投影到LoRA子空间的正交补上，减少视点引起的漂移同时保留任务相关语义。UAV-OVO和LATER共同为视点鲁棒的无人机视频理解提供了一个受控测试床和一种实用的适配方法。

英文摘要

UAV action recognition faces a deployment shift that standard benchmarks often obscure: a model trained on UAV footage captured from low-depression viewpoints may be required to recognize the same action classes from high-depression viewpoints. While the action labels remain unchanged, this shift alters body visibility, motion projection, and scene context, encouraging models to rely on viewpoint-specific shortcuts. We introduce UAV-OVO, an Out-of-Viewpoint generalization benchmark for UAV action recognition. UAV-OVO derives view scores from uncalibrated videos, uses a view-isolation band to assign low-depression videos to the training and in-distribution test splits while reserving high-depression videos for out-of-distribution testing, and constructs ID/OOD test sets matched by class distribution so that performance differences reflect viewpoint shift rather than label imbalance. Across representative video recognizers, UAV-OVO reveals a substantial ID/OOD gap: models that fit the low-depression training distribution well often fail to transfer to held-out high-depression views, exposing viewpoint shortcuts hidden by aggregate accuracy. We further propose LATER, LoRA-Anchored Test-time Re-centering, which first adapts the recognizer with Low-Rank Adaptation (LoRA) and then uses the learned LoRA subspace as a semantic anchor for online feature re-centering. Specifically, LATER projects target-domain displacement onto the orthogonal complement of the LoRA subspace before re-centering features, reducing viewpoint-induced drift while preserving task-relevant semantics. Together, UAV-OVO and LATER provide a controlled testbed and a practical adaptation method for viewpoint-robust UAV video understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.25599 2026-05-26 cs.LG cs.CV 版本更新

Generalized Evidential Deep Learning: From a Bayesian Perspective

广义证据深度学习：从贝叶斯视角

Yuanye Liu, Yibo Gao, Yuanyang Chen, Xiahai Zhuang

发表机构 * School of Data Science, Fudan University, Shanghai, China（复旦大学数据科学学院，上海，中国）

AI总结本文从广义贝叶斯框架出发，为证据深度学习建立理论基础，并提出统一可扩展的广义证据深度学习框架，在分类、不确定性估计和OOD检测上取得可比结果。

Comments Submitted to ICML2026

详情

AI中文摘要

证据深度学习（EDL）已成为一种高效、无需采样的不确定性估计策略。一系列EDL变体被提出以解决原始框架的特定局限性，并取得了显著成功。然而，EDL的基本理论结构以及这些变体之间的关系尚未得到系统研究。在这项工作中，我们通过在广义贝叶斯框架内解释EDL，包括先验规范、后验更新和训练目标，为其建立了原则性的理论基础。我们进一步从贝叶斯分布不确定性角度刻画了证据不确定性，并通过渐近分析建立。基于这一视角，我们进一步提出了广义证据深度学习（GEDL），这是一个统一且可扩展的框架，明确解耦了各个组件的作用，并将GEDL与现有变体系统地联系起来。大量实验表明，GEDL在分类、不确定性估计和OOD检测上取得了可比的结果，并具有理论依据。

英文摘要

Evidential Deep Learning (EDL) has emerged as an efficient, sampling-free strategy for uncertainty estimation. A series of EDL variants have been proposed to address specific limitations of the original framework, achieving notable success. However, the underlying theoretical structure of EDL and the relationships among these variants have received limited systematic investigation. In this work, we establish a principled theoretical foundation for EDL by interpreting it within a generalized Bayesian framework that includes prior specification, posterior update, and training objective. We further characterize evidential uncertainty from a Bayesian distributional uncertainty viewpoint, established via asymptotic analysis. Building on this perspective, we further propose Generalized Evidential Deep Learning (GEDL), a unified and extensible framework that explicitly disentangles the roles of individual components and systematically relates GEDL to existing variants. Extensive experiments demonstrate that GEDL yields comparable results on classification, uncertainty estimation and OOD detections, with theoretical grounding.

URL PDF HTML ☆

赞 0 踩 0

2605.25598 2026-05-26 cs.CV 版本更新

Mosaic: 通过向量场混合的组合式多概念擦除

Junseok Ko, Jungwoo Kim, Jong-Seok Lee

发表机构 * Department of Artificial Intelligence, Yonsei University（延世大学人工智能系）； School of Integrated Technology, Yonsei University（延世大学整合技术学院）

AI总结针对流式文本到图像模型中同时擦除多个目标概念的任务，提出Mosaic框架，通过动态构建概念特定掩码并选择性混合向量场，无需额外优化即可有效移除复杂场景中的多概念。

详情

AI中文摘要

概念擦除已成为确保文本到图像（T2I）模型安全与伦理图像合成的关键研究方向。现有研究虽探索了多概念擦除，但通常假设每张图像仅有一个目标概念，这一限制被现代基于流的T2I模型日益暴露，此类模型可同时生成包含多个概念的复杂场景。为弥补这一空白，我们引入组合式多概念擦除这一新任务，旨在同时移除单个场景中的多个目标概念。我们提出CoME-Bench，一个用于评估组合式多概念擦除的基准，涵盖类别内和跨类别场景。我们进一步提出Mosaic，一个用于基于流的T2I模型中多概念擦除的新框架，该框架通过动态构建概念特定掩码并选择性混合它们，利用向量场中目标概念的空间局部性，无需额外优化。大量实验表明，Mosaic能有效移除复杂组合场景中的多个目标概念，同时保留非目标上下文。

英文摘要

Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While existing studies have explored concept erasure across multiple concepts, they typically assume only a single target concept per image, a limitation increasingly exposed by modern flow-based T2I models, which can generate complex scenes with multiple concepts simultaneously. To address this gap, we introduce compositional multi-concept erasure, a new task that aims to simultaneously remove multiple target concepts within a single scene. We propose CoME-Bench, a benchmark for evaluating compositional multi-concept erasure, which covers both intra- and cross-category scenarios. We further propose Mosaic, a novel framework for multi-concept erasure in flow-based T2I models, which exploits the spatial locality of target concepts in the vector field by dynamically constructing concept-specific masks and selectively blending them without additional optimization. Extensive experiments demonstrate that Mosaic effectively removes multiple target concepts in complex compositional scenes while preserving non-target contexts.

URL PDF HTML ☆

赞 0 踩 0

2605.25571 2026-05-26 cs.CV 版本更新

AnE: Pushing the Reasoning Frontier of Multimodal LLMs via Anchor Evolution

AnE: 通过锚点进化推动多模态大语言模型的推理前沿

Zehao Wang, Yihan Zeng, Zidong Gong, Yuanfan Guo, Feng Zhu, Hongzhi Zhang, Wei Zhang, Wangmeng Zuo

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Huawei Noah's Ark Lab（华为诺亚实验室）； Independent Researcher（独立研究员）

AI总结提出锚点进化（AnE）范式，通过真值锚点数据策展和脚手架剥离机制，解决多模态大模型推理中的认知漂移和幻觉路径问题，显著提升推理性能。

Comments 34 pages,10 figures

详情

AI中文摘要

通过监督微调（SFT）和强化学习（RL）进行的后训练对于增强多模态大语言模型（MLLMs）的推理能力至关重要，然而现有范式由于静态数据的限制常常达到性能瓶颈。虽然当前方法利用自我反思或自我进化来突破这些界限，但它们仍然受到低质量合成数据导致的认知漂移和幻觉推理路径的影响。为了解决这些挑战，我们提出了锚点进化（AnE），一种整合了真值锚点数据策展和模型进化的新范式，在推理前沿实现了忠实且稳定的性能提升。具体来说，我们提出了真值锚点扩展，通过轨迹展开定位模型失败前沿，并利用真实数据库检索高保真锚点以进行忠实的数据策展。随后，我们引入了脚手架剥离机制来内化推理能力。该机制首先通过脚手架增强监督来锚定推理路径，以减轻直接在原始数据上进行SFT的学习复杂性和分布漂移，然后利用强化学习剥离脚手架模板，从而有效地将推理路径转化为内在模型能力。在多模态推理基准上的实验结果表明，我们的方法显著推进了模型性能前沿，在八个多模态基准上将基础模型提升了10.3%，并达到了最先进的结果。代码将公开提供。

英文摘要

Post-training via Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) is crucial for enhancing reasoning in Multimodal Large Language Models (MLLMs), yet existing paradigms often reach a performance bottleneck due to the limitations of static data. While current methods leverage self-reflection or self-evolution to push these boundaries, they still suffer from cognitive drift and hallucinated reasoning paths caused by low-quality synthetic data. To address these challenges, we propose Anchor Evolution (AnE), a new paradigm that integrates truth-anchored data curation and model evolution, achieving faithful and steady performance gains at the reasoning frontier. Specifically, we propose Truth Anchor Expansion, which pinpoints the model failing frontier via trajectory rollouts and leverages ground-truth databases to retrieve high-fidelity anchors for faithful data curation. Subsequently, we introduce the Scaffold-Stripping Mechanism to internalize reasoning capabilities. This mechanism first anchors reasoning paths via scaffold-augmented supervision to mitigate the learning complexity and distribution drift of direct SFT on raw data, then leverages RL to strip the scaffold template, thereby effectively transitioning the reasoning paths into intrinsic model capabilities. Experimental results on multimodal reasoning benchmarks show that our method substantially advances the model performance frontier, improving the base model by 10.3\% across eight multimodal benchmarks and achieving state-of-the-art results. The code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.25568 2026-05-26 cs.CV 版本更新

Rethinking Scribble-Guided Image Editing: Generalization, Instruction Adherence, and Multi-Tasking

重新思考涂鸦引导的图像编辑：泛化、指令遵循与多任务

Mingyi Xu, Jinpeng Lin, Min Zhou, Tiezheng Ge, Ming Zeng

发表机构 * Xiamen University（厦门大学）； Taobao & Tmall Group of Alibaba（阿里巴巴淘宝与天猫集团）

AI总结针对涂鸦引导图像编辑在多任务场景下性能不稳定的问题，通过实证研究揭示指令级泛化瓶颈，提出覆盖-真实课程、多任务拼接和编辑聚焦损失三种策略，在VIBE基准上实现单任务和多任务的最优结果。

详情

AI中文摘要

涂鸦引导的图像编辑允许用户将简单的涂鸦注释与文本提示相结合，以指定图像编辑的位置和方式，从而实现灵活交互和精确的空间控制。然而，现有模型在这种范式下仍表现出不稳定的性能，尤其是在多任务场景中。为了提升性能，我们使用开源编辑模型进行实证研究，并揭示了泛化中的不对称性：指令级泛化（包括跨编辑任务以及从单任务到多任务设置）比图像域泛化（例如从合成图像到真实图像，或从马赛克图像到常规图像）更具挑战性。这表明主要瓶颈在于对多样化编辑指令的学习不足，而非图像域差异。受此启发，我们提出了三种策略：(a) 覆盖-真实课程，一个两阶段流程，首先构建大规模合成、指令丰富的数据以提供广泛的任务监督，然后精选少量真实数据以细化生成的真实性；(b) 多任务拼接，通过几乎零成本地拼接单任务样本来构建多任务训练样本，同时使学习到的能力泛化到非马赛克图像；(c) 编辑聚焦损失，利用合成数据中输入和输出图像之间的变化区域，将训练聚焦于编辑区域，提高学习效率和编辑准确性。通过这些策略，我们在VIBE基准上显著提升了单任务和多任务涂鸦引导编辑的性能，取得了最先进的结果。我们将公开发布我们的数据集和模型。

英文摘要

Scribble-guided image editing allows users to combine simple scribble annotations with text prompts to specify both where and how an image should be edited, enabling flexible interaction with precise spatial control. However, existing models still exhibit unstable performance under this paradigm, especially in multi-task scenarios. To improve performance, we conduct empirical studies using an open-source editing model and reveal an asymmetry in generalization: instruction-level generalization, including across editing tasks and from single-task to multi-task settings, is more challenging than image-domain generalization, such as from synthetic to real-world images or from mosaicked to regular images. This suggests that the primary bottleneck lies in insufficient learning for diverse editing instructions rather than in the image domain gap. Motivated by this insight, we propose three strategies: (a) a Coverage-then-Realism Curriculum, a two-stage pipeline that first builds large-scale synthetic, instruction-rich data for broad task supervision, then curates a small set of real-world data to refine generation realism; (b) Multi-Task Mosaicking, which constructs multi-task training samples by concatenating single-task examples at nearly zero cost while enabling the learned capability to generalize to non-mosaicked images; and (c) an Edit-Focused Loss, which leverages the changed regions between input and output images in synthetic data to focus training on edited regions, improving both learning efficiency and editing accuracy. With these strategies, we substantially improve both single-task and multi-task scribble-guided editing on the VIBE benchmark, achieving state-of-the-art results. We will publicly release our dataset and model.

URL PDF HTML ☆

赞 0 踩 0

2605.25563 2026-05-26 cs.CV 版本更新

受放射科医生启发的乳腺超声诊断的跨阶段注意力多专家网络

Xinyang Zhai, Chong Yang, Ruizhi Zhang

发表机构 * International Agency for Research on Cancer (IARC)（国际癌症研究机构）； World Health Organization（世界卫生组织）

AI总结提出跨阶段注意力混合专家网络(CSA-MoE-Net)，通过跨阶段注意力模块增强多级特征、三分支MoE块从全肿瘤图像、肿瘤核心和边界学习互补特征，并在平衡数据集上实现96.33%准确率，显著优于基线ResNet-18。

详情

AI中文摘要

乳腺超声成像是一种重要的早期乳腺癌诊断无创方法，但由于肿瘤异质性、边界模糊和数据不平衡，自动良恶性分类仍具挑战。为了提高特征表示和分类准确性，本文提出了跨阶段注意力混合专家网络(CSA-MoE-Net)。它采用跨阶段注意力增强的ResNet-18作为骨干网络，其中跨阶段注意力模块自适应地重新校准多级特征，从而增强关键肿瘤特征并抑制冗余。一个三分支混合专家(MoE)块从全肿瘤图像、肿瘤核心和边界学习互补特征，自适应门控网络融合这些特征以捕获形态、纹理和上下文信息。融合后的特征在架构中称为融合专家特征(FEF)。在包含2,129张乳腺超声图像的平衡数据集上的实验表明，在20次独立运行的平均值下，该模型实现了96.33%的准确率、94.09%的精确率、98.53%的召回率、96.25%的F1分数和99.50%的AUC。与基线ResNet-18相比，这些指标分别提高了3.01、0.70、5.37、2.98和5.42个百分点。所提出的机制无需侵入性修改，可无缝嵌入VGG-16、DenseNet-121等网络，带来稳定的性能提升，从而为计算机辅助诊断提供可靠支持。

英文摘要

Breast ultrasound imaging is an important noninvasive method for early breast cancer diagnosis, but automatic benign/malignant classification remains challenging due to tumor heterogeneity, blurred boundaries, and data imbalance. To improve feature representation and classification accuracy, this paper proposes the Cross-Stage Attention Mixture-of-Experts Network (CSA-MoE-Net). It adopts a Cross-Stage Attention-enhanced ResNet-18 as the backbone, in which the Cross-Stage Attention module adaptively recalibrates multi-level features, thereby enhancing key tumor features and suppressing redundancy. A three-branch Mixture of Experts (MoE) Block learns complementary features from the Whole Tumor Image, Tumor Core, and Boundary, and an Adaptive Gating Network fuses them to capture morphological, textural, and contextual information. The fused features are denoted as Fused Expert Feature (FEF) in the architecture. Experiments on a balanced dataset of 2,129 breast ultrasound images show that, averaged over 20 independent runs, the model achieves an accuracy of 96.33\%, precision of 94.09\%, recall of 98.53\%, F1-score of 96.25\%, and AUC of 99.50\%. Compared to the baseline ResNet-18, these metrics improve by 3.01, 0.70, 5.37, 2.98, and 5.42 percentage points, respectively. The proposed mechanism requires no invasive modification and can be seamlessly embedded into VGG-16, DenseNet-121, etc., yielding stable performance gains, thus providing reliable support for computer-aided diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2605.25503 2026-05-26 cs.CV 版本更新

Metric--Phase Fields: Decoupling Distance and Sign for Thin-Structure Reconstruction from Unoriented Point Clouds

度量-相位场：从无定向点云中解耦距离和符号以重建薄结构

Jiayi Kong, Xuhui Chen, Chen Zong, Fei Hou, Junhui Hou, Wenping Wang, Ying He

发表机构 * S-Lab, Nanyang Technological University, Singapore ； Key Laboratory of System Software (CAS), Institute of Software, Chinese Academy of Sciences, China ； University of Chinese Academy of Sciences, China ； School of Mathematics, Nanjing University of Aeronautics ； Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China ； Department of Computer Science ； Engineering, Texas A\&M University, USA

AI总结提出度量-相位场（MPF），通过解耦度量距离和拓扑相位，结合门控度量公式和残差相位注入，实现从无定向点云中稳定重建薄结构和开放边界。

详情

AI中文摘要

神经有符号距离函数（SDF）在重建水密流形方面表现出色，但由于严格的内外约束，在薄结构和开放边界上失败。相反，无符号距离场（UDF）适应一般几何形状，但在零水平集处存在梯度奇异性，阻碍优化和提取。我们引入度量-相位场（MPF），一种解耦的隐式表示，将度量邻近性与拓扑相位分离。给定无定向点云，MPF学习（i）无符号度量场$r$和（ii）平滑相位场$θ$，我们推导出一个有界相位指示器$P=\tanh(βθ)$，在有意义的地方提供软内外线索。我们通过门控度量公式和残差相位注入耦合这两个场，以获得具有稳定近表面梯度的有符号隐函数。相位系数$β$是可学习的，允许MPF自适应控制相变锐度和软符号指示器的饱和程度。在合成和扫描的薄壳及薄板形状上的实验表明，MPF比最近的基于SDF的方法更忠实地保留薄层结构，同时比基于UDF的方法实现更稳健的训练和更可靠的表面提取。源代码和测试模型见\href{https://github.com/JIAYI-Scarlett/ICML2026-MPF}{MPFs-GitHub}。

英文摘要

Neural Signed Distance Functions (SDFs) excel at reconstructing watertight manifolds but fail on thin structures and open boundaries due to strict inside--outside constraints. Conversely, Unsigned Distance Fields (UDFs) accommodate general geometries but suffer from gradient singularities at the zero-level set, hindering optimization and extraction. We introduce Metric--Phase Fields (MPFs), a decoupled implicit representation that separates metric proximity from topological phase. Given an unoriented point cloud, MPFs learn (i) an unsigned metric field $r$ and (ii) a smooth phase field $θ$, for which we derive a bounded phase indicator $P=\tanh(βθ)$ that provides soft inside--outside cues where they are meaningful. We couple the two fields via a gated-metric formulation with a residual phase injection to obtain a signed implicit function with stable near-surface gradients. The phase coefficient $β$ is learnable, allowing MPFs to adaptively control the sharpness of the phase transition and the degree of saturation of the soft sign indicator. Experiments on both synthetic and scanned thin-shell and thin-plate shapes demonstrate that MPFs preserve thin and layered structures more faithfully than recent SDF-based methods, while also enabling more robust training and more reliable surface extraction than UDF-based approaches. Check out \href{https://github.com/JIAYI-Scarlett/ICML2026-MPF}{MPFs-GitHub} for source code and test models.

URL PDF HTML ☆

赞 0 踩 0

2605.25500 2026-05-26 cs.CV 版本更新

MAIL++: 视觉语言模型的多模态双向智能体层

Kaixiang Chen, Pengfei Fang, Hui Xue

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China（新一代人工智能技术及其交叉应用国家重点实验室（东南大学），中华人民共和国教育部，中国）

AI总结提出MAIL/MAIL++方法，通过将跨模态耦合嵌入VLM内在计算模块并引入双向桥接，实现参数高效微调，在少样本分类和跨域检索中超越现有方法。

详情

AI中文摘要

将大型视觉语言模型（如CLIP）适应下游任务仍然具有挑战性，因为全微调计算成本高且在小数据场景下容易过拟合。参数高效微调（PEFT）通过轻量级提示或适配器模块缓解了这些问题，而跨模态耦合通过增强视觉和语言之间的交互被证明特别有效。然而，现有的耦合机制主要依赖外部辅助模块，导致间接、粗粒度的交互，这些交互在结构上与原始VLM解耦，从而限制了表示的表达能力。在本文中，我们提出了多模态交互智能体层（MAIL），这是一种PEFT范式，将跨模态耦合直接嵌入VLM的内在计算模块中。MAIL冻结主干网络，并在核心模块（如LayerNorm）之后插入轻量级智能体层，以近似全微调引起的参数更新。为了在这一层面耦合视觉和文本流，我们引入了一个基于瓶颈的文本到图像桥，该桥联合优化跨模态的成对智能体层，协调相应计算模块的适应。我们进一步提出了MAIL++，它通过元智能体层、元文本桥和元图像桥实现了双向跨模态交换。在推理时，所有智能体层被重参数化到冻结的主干网络中，保持原始计算效率。在少样本图像分类和少样本通用跨域检索上的大量实验表明，MAIL和MAIL++始终优于最先进的PEFT方法。

英文摘要

Adapting large vision-language models (VLMs) such as CLIP to downstream tasks remains challenging, as full fine-tuning is computationally prohibitive and prone to overfitting in low-data regimes. Parameter-efficient fine-tuning (PEFT) alleviates these issues with lightweight prompt- or adapter-based modules, and cross-modal coupling has proven especially effective by strengthening interactions between vision and language. However, existing coupling mechanisms predominantly rely on external auxiliary modules, leading to indirect, coarse-grained interactions that are structurally decoupled from the original VLM and thus limit representational expressiveness. In this paper, we propose Multi-Modal Interactive Agent Layer (MAIL), a PEFT paradigm that embeds cross-modal coupling directly into the intrinsic computation modules of VLMs. MAIL freezes the backbone and inserts lightweight agent layers after core modules, such as LayerNorm, to approximate the parameter updates induced by full fine-tuning. To couple visual and textual streams at this level, we introduce a bottleneck-based text-to-image bridge that jointly optimizes paired agent layers across modalities, coordinating the adaptation of corresponding computation modules. We further present MAIL++, which enables bidirectional cross-modal exchange through a meta agent layer, a meta-text bridge, and a meta-image bridge. At inference time, all agent layers are re-parameterized into the frozen backbone, preserving the original computational efficiency. Extensive experiments on few-shot image classification and few-shot universal cross-domain retrieval demonstrate that MAIL and MAIL++ consistently outperform state-of-the-art PEFT methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25461 2026-05-26 cs.CV 版本更新

MetaphorVU: Towards Metaphorical Video Understanding

MetaphorVU：迈向隐喻视频理解

Zhuoqun Li, Boxi Cao, Guiping Jiang, Fangrui Lv, Ruotong Pan, Jianan Wang, Xiangyu Wu, Hongyu Lin, Yaojie Lu, Yong Du, Ruyin Jia, Liyan, Tingting Gao, Han Li, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences（中国科学院大学）； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结针对当前多模态大语言模型在隐喻视频理解上的不足，提出首个系统性基准MetaphorVU-Bench，并设计基于隐喻知识图谱的推理增强框架MetaphorBoost，显著提升模型性能。

Comments ICML 2026 spotlight

详情

AI中文摘要

隐喻视频在各种现实场景中广泛存在，用于传达复杂思想，理解它们通常需要高阶认知能力。对隐喻视频理解缺乏系统性研究不仅限制了多模态大语言模型（MLLMs）的现实应用，也阻碍了对其高阶认知能力的全面评估。为填补这一空白，我们提出了MetaphorVU-Bench，这是首个专门用于隐喻视频理解的系统性和综合性基准。通过实验，我们发现当前的MLLMs在准确的隐喻视频理解上存在困难，远落后于人类水平，主要原因是跨域映射存在缺陷。受此发现启发，我们构建了一个隐喻知识图谱作为映射增强，并提出了MetaphorBoost，一个推理时增强框架，实现了持续的性能提升。我们的基准、分析和方法为未来推进MLLMs的研究提供了有用的见解和基础。

英文摘要

Metaphorical videos are prevalent across various real-world scenarios to convey complex ideas, and understanding them typically requires high-order cognitive capabilities. The lack of systematic studies on metaphorical video understanding not only constrains the real-world applicability of MLLMs but also impedes the thorough assessment of their high-order cognitive capabilities. To bridge this gap, we propose MetaphorVU-Bench, the first systematic and comprehensive benchmark dedicated to metaphorical video understanding. Through experiments, we find current MLLMs struggle with accurate metaphorical video understanding, lagging far behind human level, primarily due to defective cross-domain mapping. Motivated by this finding, we construct a metaphor knowledge graph as mapping augmentation and propose MetaphorBoost, an inference-time enhancement framework achieving consistent performance improvement. Our benchmark, analysis, and method provide useful insights and a foundation for future research on advancing MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.25442 2026-05-26 cs.CV 版本更新

Enhancing Single-Image Facial Demorphing using Multimodal Large Language Models

利用多模态大语言模型增强单图像面部去变形

Nitish Shukla, Arun Ross

发表机构 * IEEE

AI总结提出一种基于多模态大语言模型引导的耦合扩散重建框架，通过提取中间层语义嵌入作为条件，实现无参考的面部去变形，恢复构成图像并保持身份一致性。

详情

AI中文摘要

人脸识别系统越来越容易受到变形攻击，其中合成图像被制作成匹配多个身份，从而实现未经授权的访问和身份欺诈。现有的检测方法可以识别变形图像，但无法恢复构成图像或身份，限制了其取证实用性。本文提出了一种新颖的无参考面部去变形框架，利用多模态大语言模型（MLLMs）引导耦合的扩散重建过程。我们的关键创新在于从MLLM中间层提取语义嵌入以调节去变形过程，提供关于面部属性和身份线索的高级推理，补充低级像素信息。我们将去变形表述为一个耦合的条件生成问题，其中两个构成人脸通过直接在RGB域中操作的去噪扩散模型联合合成，确保身份间一致性，同时保留细粒度的感知细节。与依赖于压缩潜在表示或假设训练集和测试集之间身份重叠的先前方法不同，我们的方法通过直接利用MLLM隐藏状态作为条件信号，绕过了有损的文本生成-重新编码循环，使去噪网络能够关注细微的视觉线索，如头发、背景和面部纹理。消融研究进一步揭示，MLLM中间层编码了更具身份判别性的表示，RGB域去变形在严格操作点上的性能优于潜在空间方法30-40%，并且完整的MLLM嵌入通过多模态预训练的增强语义结构，比原始ViT特征提供了显著优势。

英文摘要

Face recognition systems are increasingly vulnerable to morphing attacks, where a composite image is crafted to match multiple identities, enabling unauthorized access and identity fraud. Existing detection methods identify morphed images but cannot recover constituent images or identities, limiting their forensic utility. This paper presents a novel reference-free facial demorphing framework that leverages Multimodal Large Language Models (MLLMs) to guide a coupled diffusion-based reconstruction process. Our key innovation lies in extracting semantic embeddings from intermediate MLLM layers to condition the demorphing, providing high-level reasoning about facial attributes and identity cues that complement low-level pixel information. We formulate demorphing as a coupled conditional generation problem, where both constituent faces are synthesized jointly through a denoising diffusion model operating directly in the RGB domain, ensuring inter-identity consistency while preserving fine-grained perceptual details. Unlike prior approaches that rely on compressed latent representations or assume identity overlap between training and testing sets, our method bypasses lossy text generation-reencoding cycles by directly utilizing MLLM hidden states as conditioning signals, enabling the denoising network to attend to subtle visual cues such as hair, background, and facial textures. Ablation studies further reveal that middle MLLM layers encode more identity-discriminative representations, RGB-domain demorphing outperforms latent-space approaches by 30--40\% at strict operating points, and full MLLM embeddings provide substantial advantages over raw ViT features through enhanced semantic structuring from multimodal pretraining.

URL PDF HTML ☆

赞 0 踩 0

2605.25437 2026-05-26 cs.CV 版本更新

Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning

看见更多意味着知道更多吗？基于单锚优势归一化的多源视觉推理

Fanhu Zeng, Zhicong Luo, Zefan Wang, You Li, Chi Chen, Maosong Sun

发表机构 * Tsinghua University（清华大学）； Northwest Polytechnical University（西北工业大学）； Beijing Jiaotong University（北京交通大学）

AI总结针对多源视觉推理中现有方法无法区分信息增益与干扰的问题，提出MARS框架，通过单源奖励作为动态锚点，将多源融合的信息增益显式纳入优势归一化，在强化学习中自适应增强源间互促并抑制噪声，在GRPO和DAPO上分别提升3.2%和4.9%。

Comments preprint

详情

AI中文摘要

通过可验证奖励的强化学习（RLVR）进行视觉推理已取得显著进展。然而，在处理多源输入时，现有方法倾向于将其视为信息的简单累积，缺乏明确机制来区分整合额外源是否带来信息增益或引入干扰。因此，它们在整合多个源时难以有效建模动态交互，特别是当这些源在物理属性和语义上差异显著时（例如红外和深度），导致当某个源包含主导信号时，性能甚至低于单源推理。为解决此问题，我们提出MARS，一种新颖的基于单锚的多源推理框架，将每个视觉模态建模为独立信息源。具体而言，通过将单源奖励视为动态锚点，我们的方法将多源融合引入的信息增益显式纳入优势归一化，并在RLVR中自适应地强调源间的相互促进，同时抑制潜在噪声或冲突。从理论分析来看，我们的方法有效量化了梯度估计中多源整合引入的信息增益，实现了模态的一致调节。实验结果也表明，在GRPO和DAPO上，跨不同数据集分别取得了3.2%和4.9%的性能提升，证实了方法的有效性。

英文摘要

Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of information, lacking explicit mechanisms to distinguish whether integrating additional sources yields information gain or introduces interference. Therefore, they struggle to effectively model dynamic interaction when integrating multiple sources, particularly when they differ significantly in physical properties and semantics, e.g., infrared and depth, leading to inferior performance to mono-source reasoning when a certain source holds the dominant signal. To address this issue, we propose MARS, a novel mono-anchored multi-source reasoning framework that models each visual modality as an independent information source. Specifically, by treating mono-source rewards as dynamic anchors, our method explicitly incorporates the information gain introduced by multi-source fusion into advantage normalization and adaptively emphasizes mutual promotion between sources while suppressing potential noise or conflicts during RLVR. From theoretical analysis, our method effectively quantifies information gain introduced by multi-source integration in gradient estimation, enabling consistent modality regulation. Empirical results also show impressive 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets, confirming effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2605.25427 2026-05-26 cs.CV cs.AI 版本更新

Binding Visual Features Point by Point

逐点绑定视觉特征

Udith Haputhanthri, Declan Campbell, Rim Assouel, Jonathan D. Cohen, Taylor W. Webb

发表机构 * Princeton University（普林斯顿大学）； Mila – Quebec AI Institute（魁北克AI研究所）； Université de Montréal（蒙特利尔大学）

AI总结研究通过文本引导的“指向”机制解决视觉语言模型在多目标场景中的绑定问题，发现该机制诱导内部视觉搜索程序，消除绑定错误并实现组合泛化。

详情

AI中文摘要

基于SAM模型和掩码引导的弱监督伪装目标检测

Xia Li, Xinran Liu, Lin Qi, Junyu Dong

发表机构 * School of Computer Science（计算机科学学院）； Technology, Ocean University of China, Qingdao 266100, China（技术，中国海洋大学，青岛266100，中国）

AI总结提出MGNet网络，利用SAM模型生成伪标签，通过级联掩码解码器、上下文增强模块和掩码引导特征聚合模块，实现弱监督伪装目标检测，性能与全监督方法相当。

Comments 18 pages

详情

DOI: 10.1016/j.imavis.2025.105571

AI中文摘要

伪装目标检测（COD）由于目标与背景高度相似，是一项具有挑战性的任务。现有的全监督方法需要耗费大量人力进行像素级标注，因此弱监督方法成为平衡精度与标注效率的可行折中方案。然而，由于使用粗标注，弱监督方法常出现性能下降。本文提出一种新的弱监督伪装目标检测方法以克服这些限制。具体地，我们设计了一个新颖的网络MGNet，通过利用自定义级联掩码解码器（CMD）生成的初始掩码来引导分割过程并增强边缘预测，从而解决边缘模糊和漏检问题。我们引入上下文增强模块（CEM）以减少漏检，以及掩码引导特征聚合模块（MFAM）进行有效的特征聚合。针对弱监督挑战，我们提出BoxSAM，利用带有边界框提示的Segment Anything Model（SAM）生成伪标签。通过采用冗余处理策略，为训练MGNet提供高质量的像素级伪标签。大量实验表明，我们的方法在性能上与当前最先进方法具有竞争力。

英文摘要

Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundings. Existing fully supervised methods require labor-intensive pixel-level annotations, making weakly supervised methods a viable compromise that balances accuracy and annotation efficiency. However, weakly supervised methods often experience performance degradation due to the use of coarse annotations. In this paper, we introduce a new weakly supervised approach for camouflaged object detection to overcome these limitations. Specifically, we propose a novel network, MGNet, which tackles edge ambiguity and missed detections by utilizing initial masks generated by our custom-designed Cascaded Mask Decoder (CMD) to guide the segmentation process and enhance edge predictions. We introduce a Context Enhancement Module(CEM) to reduce the missing detection, and a Mask-guided Feature Aggregation Module (MFAM) for effective feature aggregation. For the weak supervision challenge, we propose BoxSAM, which leverages the Segment Anything Model (SAM) with bounding-box prompts to generate pseudo-labels. By employing a redundant processing strategy, high quality pixel-level pseudo-labels are provided for training MGNet. Extensive experiments demonstrate that our method delivers competitive performance against current state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25377 2026-05-26 cs.CV cs.AI 版本更新

Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation

对抗正交解缠用于LVLM幻觉缓解

Ruoxi Cheng, Haoxuan Ma, Zhengfei Hai, Yiyan Huang, Ranjie Duan, Tianle Zhang, Xu Yang, Ziyi Ye, Xingjun Ma

发表机构 * Fudan University（复旦大学）； Tencent（腾讯）； Nanjing University（南京大学）； Southeast University（东南大学）； Great Bay University（大坝大学）； TeleAI, China Telecom（TeleAI，中国电信）

AI总结提出对抗正交解缠（AOD）框架，通过最小最大目标学习幻觉相关方向，并利用双前向对比解码策略，在不需额外训练的情况下缓解大型视觉语言模型（LVLM）的幻觉问题。

详情

AI中文摘要

大型视觉语言模型（LVLM）推进了多模态理解，但其可靠性受到幻觉的限制，即生成内容与视觉事实冲突。现有缓解方法要么依赖昂贵的外部干预（如指令调优和检索），要么使用受限于有缺陷的注意力权重和纠缠的隐藏表示的内部机制。我们提出对抗正交解缠（AOD），一种用于缓解LVLM幻觉的潜在几何框架。AOD通过最小最大目标学习幻觉相关方向：分类器将幻觉信号集中到投影分量中，而对抗器通过梯度反转层将其从正交残差空间中移除。学习到的方向使得一种无需训练的双前向对比解码策略能够抑制幻觉同时保持通用能力。在三个LVLM上进行的四个幻觉和四个效用基准实验表明，AOD一致优于强基线。它在POPE上平均提高超过6%的准确率，将AMBER提升6%，并在MMMU等效用任务上保持强劲性能。进一步分析显示跨数据集的鲁棒迁移，表明AOD捕获了通用的幻觉相关偏差而非数据集特定伪影。我们的源代码和数据集可在https://github.com/Hunter-Wrynn/AOD获取。

英文摘要

Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where generated content conflicts with visual facts. Existing mitigation methods either rely on costly external interventions, such as instruction tuning and retrieval, or use internal mechanisms that remain limited by flawed attention weights and entangled hidden representations. We propose Adversarial Orthogonal Disentanglement (AOD), a latent geometric framework for mitigating LVLM hallucinations. AOD learns a hallucination-related direction through a minimax objective: a classifier concentrates hallucination signals into the projected component, while an adversary removes them from the orthogonal residual space via a Gradient Reversal Layer. The learned direction enables a training-free dual-forward-pass contrastive decoding strategy that suppresses hallucinations while preserving general capabilities. Experiments on three LVLMs across four hallucination and four utility benchmarks show that AOD consistently outperforms strong baselines. It improves POPE accuracy by over 6\% on average, boosts AMBER by 6\%, and maintains strong performance on utility tasks such as MMMU. Further analysis shows robust transfer across datasets, suggesting that AOD captures general hallucination-related biases rather than dataset-specific artifacts. Our source code and datasets are available at https://github.com/Hunter-Wrynn/AOD.

URL PDF HTML ☆

赞 0 踩 0

2605.25373 2026-05-26 cs.CV 版本更新

Physics-Aware 3D Gaussian Editing for Driving Scene Generation

物理感知的三维高斯编辑用于驾驶场景生成

Feng Zhou, Jian Zhang, Yuhang Sun, He Wang, Qiong Wen, Debao Kong, Tieru Wu, Rui Ma

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University（吉林大学汽车底盘集成与生物力学国家重点实验室）； China FAW Group Co., Ltd.（中国第一汽车集团有限公司）

AI总结提出RoVES系统，通过单图像驱动的道路几何插入和4-DOF半车动力学模型，实现物理感知的驾驶场景编辑与车辆姿态校正。

详情

AI中文摘要

三维高斯泼溅（3DGS）在自动驾驶仿真和数据生成中展现出巨大潜力，能够实现逼真的重建和灵活的场景操作。然而，现有的3DGS场景编辑方法对道路几何编辑（例如插入减速带或凹陷路面）支持有限，并且通常不将此类编辑与合理的车辆-道路交互动力学耦合。这种编辑对于在极端驾驶场景下生成训练数据或评估系统在这些道路不规则情况下的可靠性至关重要。此外，许多基于优化的方法需要每次编辑进行数分钟的细化，而现有的高效替代方案主要关注外观级别或对象级别的操作，而非物理感知的道路不规则编辑。为了解决这些限制，我们提出了RoVES，一个用于驾驶场景中物理感知三维高斯编辑的道路和车辆编辑系统。RoVES实现了单图像驱动的道路几何插入，并将编辑后的道路轮廓与4-DOF半车动力学模型耦合，以实现垂直位移和俯仰方向上的物理感知车辆姿态校正。RoVES以一次性、无优化的流水线（1.84秒）插入道路元素，完整流水线（包括颜色转移和基于车辆动力学的姿态校正）在6.24秒内完成；它通过姿态编辑编辑动态车辆，并逐帧校正姿态以近似动力学一致的垂直位移和俯仰响应。在Waymo数据集上的实验表明，RoVES为物理感知的驾驶场景生成提供了实用的效率和具有竞争力的视觉一致性。

英文摘要

3D Gaussian Splatting (3DGS) has shown great potential in autonomous driving simulation and data generation, enabling photorealistic reconstruction and flexible scene manipulation. However, existing 3DGS scene editing methods have limited support for road geometry editing (e.g., inserting speed humps or sunken roads), and generally do not couple such edits with plausible vehicle-road interaction dynamics. Such editing is essential for generating training data under extreme driving scenarios or evaluating system reliability under these road irregularities. Moreover, many optimization-based methods require minutes of per-edit refinement, while existing efficient alternatives mainly focus on appearance-level or object-level manipulation rather than physics-aware road irregularity editing. To address these limitations, we propose RoVES, a Road-and-Vehicle Editing System for physics-aware 3D Gaussian editing in driving scenes. RoVES enables single-image-driven road geometry insertion and couples the edited road profile with a 4-DOF half-car vehicle dynamics model to achieve physics-aware vehicle pose correction in vertical displacement and pitch. RoVES inserts road elements in a one-shot, optimization-free pipeline (1.84s), and the full pipeline (including color transfer and vehicle-dynamics-based pose correction) completes in 6.24s; it edits dynamic vehicles via pose editing and corrects poses frame-by-frame to approximate dynamics-consistent vertical displacement and pitch responses. Experiments on the Waymo dataset show that RoVES provides practical efficiency and competitive visual consistency for physics-aware driving scene generation.

URL PDF HTML ☆

赞 0 踩 0

2605.25364 2026-05-26 cs.CV 版本更新

Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

MLLMs 能否超越语言进行推理？VisReason：一个面向视觉中心推理的综合基准

Longteng Guo, Yifan Wang, Pengkang Huo, Tailai Chen, Yuze Wu, Jing Liu, Xinxin Zhu

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结提出 VisReason 基准，包含 1505 个日常场景问题，评估多模态大模型在视觉中心推理上的表现，揭示人类与模型间的显著差距。

Comments Accepted by ACL 2026 Findings, resources released at https://github.com/CASIA-IVA-Lab/VisReason

2605.25363 2026-05-26 cs.CV 版本更新

MARVEL: Universal Murray's Law-informed Vessel Tree Segmentation and Topology Estimation

MARVEL：基于Murray定律的通用血管树分割与拓扑估计

Yi Zhou, Thiara Sana Ahmed, Jacqueline Chua, Meng Wang, Qinrong Zhang, Alejandro F. Frangi, Huazhu Fu, Jun Cheng, Leopold Schmetterer, Bingyao Tan

发表机构 * Singapore Eye Research Institute（新加坡眼科学研究院）； Singapore National Eye Centre（新加坡国家眼科中心）； Ophthalmology & Visual Sciences Academic（眼科与视觉科学学术）

AI总结提出一种与骨干网络无关的框架MARVEL，通过可微分的Murray定律约束正则化训练，提升血管分割的生理合理性、拓扑一致性，并在高血压分类任务中显著优于基线模型。

Comments 10 pages, 18 figures

详情

AI中文摘要

血管循环遵循优化质量传输和代谢能量消耗的基本生物物理原理，这些原理可以通过Murray定律有效建模。然而，当代深度学习方法用于血管分割时往往忽略这些生物物理约束，导致生理上不合理的分支和血管树误分类，使得这些自动分割结果对于下游临床任务（如血流模拟或疾病量化）不可靠。在本文中，我们引入MARVEL（基于Murray定律的通用血管分割与拓扑估计），一个与骨干网络无关的框架，将生物物理先验整合到血管树提取中。MARVEL结合逐像素监督与显式半径预测，以强制执行从经验宽度-指数映射导出的局部分叉约束。我们在训练期间将这些约束实现为可微正则化器，以引导模型朝向生理一致的重建。我们在八个公开数据集上评估MARVEL，涵盖多种血管模态和分割骨干网络。结果表明MARVEL在分割准确性、拓扑一致性和生理合理性方面具有优越性能。通过将分割掩膜转换为基于图的血流动力学模拟，我们证明MARVEL保留了区分高血压眼和正常眼所需的细微病理狭窄和拓扑连接。结果显示，MARVEL通过眼内动静脉压力差显著改善了高血压的分类（p < 0.001），在拓扑一致性和临床预测价值方面均优于基线模型。

英文摘要

Vascular circulation follows fundamental biophysical principles that optimize mass transport and metabolic energy expenditure, which can be effectively modeled by Murray's law. However, contemporary deep learning methods for vascular segmentation often neglect these biophysical constraints. This leads to physiologically implausible branching and misclassification vascular trees, rendering. These automated segmentation results are unreliable unreliable for downstream clinical tasks such as blood flow simulation or disease quantification. In this paper, we introduce MARVEL (Universal MurrAy's law-infoRmed Vessel sEgmentation and topoLogy estimation), a backbone-agnostic framework that integrates biophysical priors into vascular tree extraction. MARVEL combines per-pixel supervision with explicit radius predictions to enforce local bifurcation constraints derived from an empirical width-exponent mapping. We implement these constraints as differentiable regularizers during training to guide models toward physiologically consistent reconstructions. We evaluate MARVEL on eight public datasets across multiple vascular modalities and segmentation backbones. Results demonstrate MARVEL's superior performance in segmentation accuracy, topological consistency, and physiological plausibility. By converting segmented masks into graph-based hemodynamic simulations, we demonstrate that MARVEL preserves the subtle pathological narrowing and topological connectivity required to distinguish hypertensive from normotensive eyes. Results show that MARVEL significantly improves the classification of hypertension via arteriovenous pressure differences in the eye (p < 0.001), outperforming baseline models in both topological consistency and clinical predictive value.

URL PDF HTML ☆

赞 0 踩 0

2605.25357 2026-05-26 cs.CV cs.MA 版本更新

Towards Reliable Fetal Ultrasound Interpretation with Multi-Agent Collaboration

面向可靠胎儿超声解读的多智能体协作

Xiaotian Hu, Mingxuan Liu, Junwei Huang, Kasidit Anmahapong, Yifei Chen, Yiming Huang, Xuguang Bai, Zihan Li, Hongjia Yang, Yingqi Hao, Hong Xu, Yu Jiang, Tian Tian, Yi Liao, Haibo Qu, Qiyuan Tian

发表机构 * Tsinghua University（清华大学）； University of California San Diego（加州大学圣地亚哥分校）； West China Second University Hospital, Sichuan University（四川大学西昌医学院）

AI总结提出FetUSAgents多智能体系统，通过协作LLM代理和双路径证据仲裁（DPEA）整合视觉工具与临床推理，在胎儿超声VQA、报告生成等任务上超越最强基线25%以上。

详情

AI中文摘要

自动化胎儿超声解读需要从视觉感知（包括平面识别和解剖分割）到临床理解（包括生物测量和诊断报告）的工作流程。然而，当前“一任务一模型”的范式限制了跨多步骤过程的系统性证据整合。尽管多模态大语言模型（MLLM）展现出有前景的视觉理解能力，但其有限的领域特定基础和幻觉风险限制了在胎儿超声分析中的可靠性。为解决这些限制，我们提出了FetUSAgents，一个工具增强的多智能体系统，用于全面的胎儿超声解读，支持视觉问答（VQA）、报告生成、图像描述和视频总结。FetUSAgents通过协作的LLM代理协调任务特定的视觉工具，并将临床查询分解为从解剖识别到定量测量的子任务。我们进一步引入了双路径证据仲裁（DPEA），它将基于LLM的审慎推理与来自专业视觉工具的结构化计算证据相结合。一个检索增强的证据库整合中间发现，以支持可追溯且临床可靠的结论。此外，我们构建了FetUS-VQA，一个专门用于胎儿超声的VQA基准，包含1,892张图像和3,205个问答对，涵盖10个临床任务。广泛的分布外实验表明，FetUSAgents优于通用和医学MLLM，在VQA准确率上超过最强基线25%以上。这些结果表明了一条通往产前成像的基于证据的临床助手的可扩展路径。代码已公开。

英文摘要

Automated fetal ultrasound interpretation requires a workflow from visual perception, including plane recognition and anatomical segmentation, to clinical understanding, including biometric measurement and diagnostic reporting. However, the prevailing "one-task, one-model" paradigm limits systematic integration of evidence across this multi-step process. Although multimodal large language models (MLLMs) show promising visual understanding, their limited domain-specific grounding and hallucination risks restrict reliability in fetal ultrasound analysis. To address these limitations, we propose FetUSAgents, a tool-augmented multi-agent system for comprehensive fetal ultrasound interpretation, supporting visual question answering (VQA), report generation, image captioning, and video summarization. FetUSAgents coordinates task-specific visual tools through collaborative LLM agents and decomposes clinical queries into subtasks that progress from anatomical recognition to quantitative measurement. We further introduce Dual-Path Evidence Arbitration (DPEA), which integrates LLM-based deliberative reasoning with structured computational evidence from specialized visual tools. A retrieval-enhanced evidence bank consolidates intermediate findings to support traceable and clinically grounded conclusions. In addition, we construct FetUS-VQA, a dedicated VQA benchmark for fetal ultrasound, comprising 1,892 images and 3,205 question-answer pairs across 10 clinical tasks. Extensive out-of-distribution experiments show that FetUSAgents outperforms general and medical MLLMs, exceeding the strongest baseline by more than 25 percent in VQA accuracy. These results suggest a scalable route toward evidence-driven clinical assistants for prenatal imaging. Code is available.

URL PDF HTML ☆

赞 0 踩 0

2605.25348 2026-05-26 eess.IV cs.AI cs.CV cs.LG cs.SC 版本更新

Parameter-Efficient CT Reconstruction via Deep Graph Laplacian Regularization

基于深度图拉普拉斯正则化的参数高效CT重建

Veera Varuni Radhakrishnan, Chinthaka Dinesh, Qurat-ul-Ain Azim

发表机构 * Mechanical and Industrial Engineering Department（机械与工业工程系）

AI总结提出深度图拉普拉斯正则化（Deep GLR）方法，通过将二次图正则化集成到近端前向-后向分裂优化框架中，仅用少量参数和数据即可实现低剂量CT重建的噪声抑制，在参数效率和数据效率上显著优于现有方法。

Comments 7 pages, 3 figures, conference

详情

AI中文摘要

低剂量计算机断层扫描（LDCT）重建面临重建质量与资源需求之间的关键权衡。虽然最近的深度学习方法达到了最先进的性能，但它们通常依赖超过50万个参数，并在超过35,000次扫描的大规模数据集上训练。本文研究在严格资源约束下，基于图的正则化是否能提供有意义的噪声抑制。我们提出了深度图拉普拉斯正则化（Deep GLR），将二次图正则化集成到近端前向-后向分裂优化框架中，并包含三个轻量级CNN模块。在LoDoPaB-CT基准上评估，Deep GLR达到了30.70 dB的PSNR，相比滤波反投影提高了6.33 dB，同时仅使用了91,848个参数，在1000个样本上训练（标准训练集的2.8%）。与基准方法相比，这代表了每dB改进5.8倍的参数效率和30倍的数据效率。学习到的图带宽参数（ε=1.25）收敛到可解释的值，表明该方法捕捉了有意义的图像先验而非过拟合。尽管与最先进方法相比仍有13 dB的差距，但结果表明基于图的正则化为资源受限的医学成像场景提供了有利的效率-质量权衡。

英文摘要

Low-dose computed tomography (LDCT) reconstruction faces a critical tradeoff between reconstruction quality and resource requirements. While recent deep learning methods achieve state-of-the-art performance, they typically rely on over 500,000 parameters trained on large-scale datasets exceeding 35,000 scans. This work investigates whether graph-based regularization can provide meaningful noise reduction under strict resource constraints. We propose Deep Graph Laplacian Regularization (Deep GLR), integrating quadratic graph regularization into a Proximal Forward-Backward Splitting optimization framework with three lightweight CNN modules. Evaluated on the LoDoPaB-CT benchmark, Deep GLR achieves 30.70 dB PSNR, representing a 6.33 dB improvement over filtered backprojection, while using only 91,848 parameters trained on 1000 samples (2.8\% of standard training set). Compared to benchmark methods, this represents 5.8 times better parameter efficiency and 30 times better data efficiency per dB improvement. The learned graph bandwidth parameter ($ε$=1.25) converges to interpretable values, suggesting the method captures meaningful image priors rather than overfitting. While a 13 dB gap remains versus state-of-the-art methods, results demonstrate that graph-based regularization provides a favorable efficiency-quality tradeoff for resource-constrained medical imaging scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.25347 2026-05-26 cs.CV cs.LG 版本更新

ERNIE-Image Technical Report

ERNIE-Image 技术报告

Jiaxiang Liu, Zhida Feng, Pengyu Zou, Zhenyu Qian, Tianrui Zhu, Jun Xia, Yuehu Dong, Yanzheng Lin, Honglin Xiong, Anqi Chen, Yunpeng Ding, Jinghui Duan, Lin Gao, Chao Han, Tiechao He, Jiakang Hu, Ranjun Hua, Xueming Jiang, Qingli Kong, Yuting Lei, Tianyu Li, Yunlin Liu, Changling Liu, Yaxin Liu, Yi Liu, Xuguang Liu, Xiaolong Ma, Yan Pan, Yiran Ren, Nan Sheng, Yu Sun, Siyang Sun, Yixiang Tu, Yang Wan, Huanai Wang, Siqi Wang, Yang Wu, Youzhi Yang, Xiaowen Yang, Jianwen Yang, Yehua Yang, Quanwen Zhang, Xinmin Zhang, Haoxin Zhang, Xiang Zhang, Jun Zhang, Qian Zhang, Qiao Zhao, Qi Zhou

发表机构 * ERNIE Team, Baidu（百度ERNIE团队）

AI总结提出基于8B单流DiT架构的开源文本到图像生成模型ERNIE-Image，通过自底向上的预训练数据构建和自顶向下的后训练数据构建，结合稳定DPO策略和MT-DMD蒸馏方法，在指令遵循、文本渲染和美学质量上接近顶级商业模型。

详情

AI中文摘要

我们介绍了ERNIE-Image，一个基于8B单流DiT架构构建的开源文本到图像生成模型。ERNIE-Image旨在通过更有效地挖掘大规模预训练数据并在整个训练过程中提高监督质量，来弥合当前开源模型与领先闭源系统之间的差距。在预训练阶段，我们采用自底向上的数据构建流程，结合细粒度图像分类、丰富的标题注释、美学评估和分层采样。该策略在保留长尾概念和详细真实世界知识的同时减少数据噪声，为复杂生成任务提供了更坚实的基础。在后训练阶段，我们针对高需求场景使用自顶向下的数据构建流程，多样化提示注释以更好地匹配真实用户输入，并应用稳定的DPO策略使模型与人类美学偏好对齐。我们进一步训练ERNIE-Image-Turbo以实现高效的8-NFE生成，并提出MT-DMD以减轻蒸馏过程中的能力漂移。为了使模型在实际场景中更易于使用，我们为其配备了一个轻量级的提示增强器，将简洁的用户意图扩展为结构化的视觉描述。此外，我们开发了工业级美学模型ERNIE-Image-Aes，以及用于真实美学评估的人工标注基准ERNIE-Image-Aes-1K。大量的定性和定量实验表明，ERNIE-Image在开源模型中实现了领先性能，并在指令遵循、文本渲染和美学质量方面接近顶级商业模型。我们发布训练好的模型和美学资源，以促进AIGC社区的进一步学术研究和技术进步。

英文摘要

We introduce ERNIE-Image, an open-source text-to-image generation model built upon an 8B single-stream DiT architecture. ERNIE-Image aims to bridge the gap between current open-source models and leading closed-source systems through more effective mining of large-scale pre-training data and improved supervision quality throughout training. During pre-training, we adopt a bottom-up data construction pipeline that combines fine-grained image categorization, rich caption annotation, aesthetic assessment, and hierarchical sampling. This strategy reduces data noise while preserving long-tail concepts and detailed real-world knowledge, providing a stronger foundation for complex generation tasks. In the post-training stage, we use a top-down data construction pipeline for high-demand scenarios, diversify prompt annotations to better match real user inputs, and apply a stabilized DPO strategy to align the model with human aesthetic preferences. We further train ERNIE-Image-Turbo for efficient 8-NFE generation and propose MT-DMD to mitigate capability drift during distillation. To make the model easier to use in practical scenarios, we equip it with a lightweight Prompt Enhancer that expands concise user intents into structured visual descriptions. In addition, we develop ERNIE-Image-Aes, an industrial-grade aesthetic model, together with ERNIE-Image-Aes-1K, a human-annotated benchmark for realistic aesthetic evaluation. Extensive qualitative and quantitative experiments show that ERNIE-Image achieves leading performance among open-source models and approaches top-tier commercial models in instruction following, text rendering, and aesthetic quality. We release the trained models and aesthetic resources to facilitate further academic research and technical progress in the AIGC community.

URL PDF HTML ☆

赞 0 踩 0

2605.25345 2026-05-26 cs.GR cs.CV 版本更新

Depth Peeling for High-Fidelity Gaussian-Enhanced Surfel Rendering

用于高保真高斯增强面元渲染的深度剥离

Keyang Ye, Hongzhi Wu, Kun Zhou

发表机构 * State Key Lab of CAD&CG, Zhejiang University（计算机辅助设计与图形学国家重点实验室，浙江大学）； Hangzhou Research Institute of Holographic and AI Technology（杭州全息与人工智能技术研究院）

AI总结提出DP-GES，通过半透明边界增强不透明面元并利用深度剥离建立逐像素排序，实现无排序高斯溅射和正确透射率调制，消除锯齿和弹出伪影，提升重建质量。

2605.25343 2026-05-26 cs.CV 版本更新

Toward Native Multimodal Modeling: A Roadmap

迈向原生多模态建模：路线图

Siyu An, Junru Lu, Junnan Dong, Qiufeng Wang, Yinghui Li, Weizhi Fei, Zichao Yu, Zheng Yuan, Biao Liu, Haopeng Wang, Renzhao Liang, Yixuan Yang, Yunhang Shen, Bo Ke, Keyu Chen, Linhao Luo, Difan Zou, Xiao Huang, Di Yin, Ruizhi Qiao, Xing Sun

发表机构 * Tencent Youtu Lab（腾讯优图实验室）； Tsinghua University（清华大学）； The University of Hong Kong（香港大学）； University of Warwick（沃林汉大学）； Monash University（墨尔本大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结本文提出从非原生多模态范式向原生多模态建模（NMM）过渡的正式路线图，通过输入-输出二元性分类现有模型，并系统探讨架构协调、数据整理、训练推理及评估的全栈工业级方案。

Comments 52 pages, 5 figures, 3 tables, ~300 references

详情

AI中文摘要

多模态建模是从模态无关推理迈向世界建模的关键一步。早期方法主要依赖后期融合，即组装编码器、冻结语言骨干网络和输出头；而近期研究已将范式转向原生多模态建模（NMM），通过模态的内在集成实现卓越的多模态性能。尽管潜力巨大，原生架构的设计空间仍缺乏明确定义。本文向社区呈现了这一过渡的正式路线图。具体而言，我们正式定义了架构原生性，将中期融合和早期融合与非原生范式区分开来。我们进一步通过输入-输出二元性的视角将现有原生模型组织为三类：(i) 多到文本，用于仅输出文本的跨模态理解；(ii) 多到目标，用于面向场景的生成，例如图像、音频和视频生成；(iii) 多到多，用于对称输入-输出的统一建模。我们对迈向最终NMM框架的过渡进行了全面且工业级的调查，在该框架中，理解和生成在统一的Transformer范式中无缝共存。我们从工业视角系统地拆解了端到端流水线，包括架构协调、大规模数据整理、全栈训练配方、推理与部署，以及真正原生建模的综合评估。

英文摘要

Multimodal modeling represents a vital step from modality-agnostic reasoning toward world modeling. While early approaches predominantly rely on late-fusion that assembles encoders and frozen language backbones with output heads, recent efforts have shifted the paradigm toward native multimodal modeling (NMM) with the intrinsic integration of modalities for superior multimodal performance. Despite its potential, the design space of native architectures remains insufficiently defined. In this paper, we present the community with a formalized roadmap for this transition. Specifically, we formally define the architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. We further organize the existing native models through the lens of input-output duality into three categories: (i) Multi-to-Text for cross-modal comprehension with text-only output; (ii) Multi-to-Target for scenario-oriented generation, e.g., image, audio and video generation, and (iii) Multi-to-Multi for unified modeling with symmetric input-output. We deliver a comprehensive and industrial-grade investigation into the transition toward the definitive NMM framework, where understanding and generation seamlessly coexist within a unified transformer paradigm. We systematically unpack the end-to-end pipeline from industrial perspectives from architectural coordination, massive data curation, to full-stack training recipes, inference & deployment, and the comprehensive evaluation for truly native modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.25334 2026-05-26 cs.CV 版本更新

Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence

双路径几何感知多模态大语言模型用于空间智能

Yufei Zheng, Xuhan Zhu, Zide Liu, Chunpeng Zhou, Chenfeng Wang, Yongchao Xu, Yunnan Wang, Jiawei Liu, Pengfei Yu, Wei Zhai, Yang Cao, Zheng-Jun Zha

发表机构 * University of Science and Technology of China（中国科学技术大学）； Li Auto Inc.（利汽车公司）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出GAMSI，一种仅以RGB图像为输入、通过双路径查询和专家引导视觉对齐实现3D结构与度量尺度联合感知的多模态大语言模型，在七个空间智能基准上达到最优性能。

详情

AI中文摘要

从2D视觉输入理解物理世界的空间能力依赖于两种互补的几何知识：整体3D结构感知和细粒度度量尺度估计。现有的多模态大语言模型通常只处理其中一个方面，将深度图或点云作为额外模型输入，这带来了大量计算开销并继承了上游预测模型的泛化局限性。我们提出GAMSI，一种双路径几何感知多模态大语言模型用于空间智能，仅以RGB图像为输入，同时在统一的自回归骨干网络内内化两种几何先验。具体地，我们引入度量-结构解耦查询，使用两组可学习查询分别从共享视觉上下文中提取密集度量信号和稀疏结构线索，并通过任务解耦注意力掩码防止两条路径相互污染。在此基础上，专家引导视觉定位模块将聚合的线索投影回帧级视觉特征，并与视觉基础模型对齐，这些模型仅作为训练时的监督，而非模型输入。我们进一步构建了一个多任务空间指令微调数据集，包含152,776个样本，涵盖13种任务类型和三种视觉模态，整合自六个公共数据集。通过两阶段课程训练，GAMSI在七个空间智能基准上达到了最先进的性能。

英文摘要

Spatial understanding of the physical world from 2D visual inputs hinges on two complementary forms of geometric knowledge: holistic 3D structural perception and fine-grained metric scale estimation. Existing multimodal large language models (MLLMs) typically address only one facet, ingesting either depth maps or point clouds as additional model inputs, which incurs substantial computational overhead and inherits the generalization limitations of upstream prediction models. We propose GAMSI, a dual-pathway Geometry-Aware MLLM for Spatial Intelligence that takes only RGB images as input while internalizing both forms of geometric prior within a unified autoregressive backbone. Specifically, we introduce Metric-Structure Decoupled Queries (MSDQ) which employ two groups of learnable queries to respectively extract dense metric signals and sparse structural cues from the shared visual context, with a task-decoupled attention mask further preventing the two pathways from contaminating each other. Building on this, an Expert-Guided Visual Grounding (EVG) module projects the aggregated cues back to frame-level visual features and aligns them with vision foundation models, which serve purely as training-time supervision, rather than as model inputs. We further build a multi-task spatial instruction-tuning dataset (MTS) comprising 152{,}776 samples spanning 13 task types and three visual modalities, consolidated from six public datasets. Trained with a two-stage curriculum, GAMSI achieves state-of-the-art performance on seven spatial intelligence benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.25333 2026-05-26 cs.CV 版本更新

递归类连接分类（R3C）应用于二值图像分割以改进婴儿指纹增强

Joao Leonardo Harres Dall Agnol, Luiz Fernando Puttow Southier, Jefferson Tales 0liva, Marcelo Teixeira, Rodrigo Mineto, Marcelo Filipa, Dalcimar Casanova, Erick Oliveira Rodrigues

发表机构 * Infant.ID Ltda（Infant.ID公司）； Graduate Program in Production and Systems Engineering (PPGEPS), Federal University of Technology-Paran (UTFPR)（生产与系统工程硕士项目，联邦技术大学-巴拉那（UTFPR））

AI总结提出递归类连接分类（R3C）框架，通过迭代扩展脊线结构改进现有增强方法的二值分割输出，无需训练数据即可提升婴儿指纹识别率。

详情

DOI: 10.1109/ACCESS.2025.3594912
Journal ref: IEEE Access 2025

AI中文摘要

图像增强在婴儿指纹匹配中至关重要，因为儿童特有的特征（如较小的手指尺寸和较薄的脊线结构）通常会在采集过程中降低图像质量。为解决这些限制，注册通常依赖于专门的高分辨率扫描仪，而大多数现有增强方法并非为此设计。因此，儿童的识别率仍显著低于成人指纹。本研究引入递归类连接分类（R3C），一种通过扩展脊线结构迭代细化现有增强方法二值分割输出的新颖框架。R3C不需要修改底层分类器，且无需训练数据（目前婴儿指纹尚无此类数据）。相反，该方法通过将分类后的图像反复反馈到分类过程中，同时将每个中间分割与原始输入图像结合，从而改进分割。在三个指纹数据集上使用四种不同增强分类器进行的实验表明，与单独使用增强方法相比，R3C可将儿童的真接受率（TAR）提高最多4%，新生儿提高超过40%。定性分析进一步表明，R3C重新连接了断裂的脊线模式，改善了分割的视觉质量。由于独立于所使用的增强方法，R3C为改进二值分割提供了灵活且广泛适用的解决方案。

英文摘要

Image enhancement plays a crucial role in infant fingerprint matching, as child-specific characteristics such as smaller finger dimensions and thinner ridge structures often degrade image quality during acquisition. To address these limitations, enrollment typically depends on specialized highresolution scanners, which most existing enhancement methods are not designed to support. Consequently, identification rates for children remain significantly lower than those achieved with adult fingerprints. This study introduces Recursive Class Connectivity Classification (R3C), a novel framework that iteratively refines binary segmentation outputs from existing enhancement methods by extending ridge structures. R3C does not require modifications to the underlying classifier and operates without training data, which is not currently available for infant fingerprints. Instead, the method improves segmentation by repeatedly feeding the classified image back into the classification process, while combining each intermediate segmentation with the original input image. Experiments conducted on three fingerprint datasets using four different enhancement classifiers show that R3C can increase the True Acceptance Rate (TAR) by up to 4% for children and over 40% for newborns, compared to using the enhancement methods alone. A qualitative analysis further demonstrates that R3C reconnects fragmented ridge patterns, improving the visual quality of segmentation. Because it functions independently of the enhancement method used, R3C provides a flexible and broadly applicable solution for improving binary segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.25304 2026-05-26 cs.LG cs.CR cs.CV 版本更新

When Interpretability Becomes a Liability: Adversarial Attacks on CBM Concept Layers

当可解释性成为负担：针对CBM概念层的对抗攻击

Aditya Sridhar

发表机构 * Independent Researcher（独立研究者）

AI总结本文系统研究了概念瓶颈模型（CBM）中概念层的对抗性脆弱性，提出了一种基于语义扰动的稳定性正则化防御方法SPECTRA，显著提高了攻击所需的最小扰动范数，同时保持了分类精度。

Comments Accepted to CVPR 2026 (Findings). 9 pages, 6 figures

详情

AI中文摘要

概念瓶颈模型（CBM）已成为可解释机器学习的基础方法，通过显式的概念激活提供人类可理解的中间表示。然而，这种可解释性从根本上引入了一个关键且先前未被探索的攻击面：概念瓶颈层本身。我们提出了对CBM中概念级对抗性脆弱性的全面、系统性研究，揭示了对输入像素进行有针对性的最小扰动可以通过操纵语义表示导致灾难性的错误分类。我们开发了一个严格的理论框架来量化概念空间的鲁棒性，建立了揭示这些架构脆弱性景观的新指标。我们在CUB-200-2011数据集上的广泛分析表明，标准CBM对概念级操纵表现出严重的敏感性。为了解决这一关键弱点，我们引入了SPECTRA（基于语义扰动的概念训练以增强对抗鲁棒性），一种原则性的稳定性正则化防御。SPECTRA有效地强化了语义表示空间，将成功攻击所需的最小扰动范数从0.46提高到超过4,200，使得有针对性的概念操纵在计算上变得不可行。此外，SPECTRA将基线分类精度保持在2.2%以内。通过将概念级攻击确立为一种根本不同的威胁模型，这项工作在可解释机器学习与对抗鲁棒性的交叉领域开辟了一个新的研究前沿。

英文摘要

Concept Bottleneck Models (CBMs) have emerged as a cornerstone approach for interpretable machine learning, providing human-understandable intermediate representations through explicit concept activations. However, this interpretability fundamentally introduces a critical, previously unexplored attack surface: the concept bottleneck layer itself. We present a comprehensive, systematic study of concept-level adversarial vulnerabilities in CBMs, revealing that targeted, minimal perturbations operating on input pixels can induce catastrophic misclassification by manipulating semantic representations. We develop a rigorous theoretical framework to quantify concept-space robustness, establishing novel metrics that expose the vulnerability landscape of these architectures. Our extensive analysis on the CUB-200-2011 dataset demonstrates that standard CBMs exhibit severe susceptibility to concept-level manipulation. To address this critical weakness, we introduce SPECTRA (Semantic Perturbation-based Concept Training for Robustness against Attacks), a principled stability regularization defense. SPECTRA effectively hardens the semantic representation space, increasing the minimal perturbation norm required for a successful attack from 0.46 to over 4,200, rendering targeted concept manipulation computationally prohibitive. Furthermore, SPECTRA preserves baseline classification accuracy to within 2.2%. By establishing concept-level attacks as a fundamentally distinct threat model, this work opens a new research frontier at the intersection of interpretable machine learning and adversarial robustness.

URL PDF HTML ☆

赞 0 踩 0

2605.25294 2026-05-26 cs.CV 版本更新

Geometry-Aware Image Flow Matching

几何感知图像流匹配

Junho Lee, Kwanseok Kim, Joonseok Lee

发表机构 * Seoul National University, Seoul, Korea（首尔国立大学）

AI总结本文通过发现自然图像语义信息主要编码在方向分量上，提出球面最优传输流匹配（SOT-CFM）和球面流匹配（SFM）两种几何感知方法，在超球面上建模图像，相比欧几里得基线取得更优性能。

详情

AI中文摘要

生成模型的最新进展突显了几何感知建模在流形约束环境中的强大能力。然而，对于自然图像，该领域仍局限于欧几里得假设，未能利用数据内在的几何结构。在本文中，我们研究了自然图像的几何结构，观察到语义信息主要编码在方向分量中，而范数分量可以通过全局平均值近似。这一性质在RGB空间和潜在空间中都成立，表明自然图像可以在超球面上有效建模。基于这一发现，我们引入了球面最优传输流匹配（SOT-CFM），它利用角距离，以及球面流匹配（SFM），它直接在流形上约束动力学。我们的实验表明，这些几何感知方法相比欧几里得基线取得了更优的性能。最终，这项工作提供了一种新颖的视角，弥合了基于黎曼流形的建模与自然图像生成之间的差距。

英文摘要

Recent advances in generative models highlight the power of geometry-aware modeling in manifold-constrained settings. Yet, for natural images, the field remains confined to Euclidean assumptions, failing to exploit the potential of intrinsic geometric structures within the data. In this work, we investigate the geometry of natural images and observe that semantic information is predominantly encoded in directional components, while norm components can be approximated by the global average. This property holds across both RGB and latent spaces, suggesting that natural images can be effectively modeled on a hypersphere. Building on this finding, we introduce Spherical Optimal Transport Flow Matching (SOT-CFM), which utilizes angular distance, and Spherical Flow Matching (SFM), which constrains dynamics directly on the manifold. Our experiments demonstrate that these geometry-aware methods achieve superior performance against Euclidean baselines. Ultimately, this work provides a novel perspective that bridges the gap between Riemannian manifold-based modeling and natural image generation.

URL PDF HTML ☆

赞 0 踩 0

2605.25293 2026-05-26 cs.CV cs.AI cs.RO 版本更新

Neuromorphic LiDAR-based Bird's Eye View Object Detection using Energy-efficient Spiking Neural Networks

基于神经形态激光雷达的鸟瞰图目标检测：使用节能脉冲神经网络

Sambit Mohapatra, Senthil Yogamani, Heinrich Gotzig, Patrick Mader

发表机构 * Valeo, Germany（德国瓦莱欧公司）； Valeo, Ireland（爱尔兰瓦莱欧公司）； TU Ilmenau, Germany（德国伊门豪大学）

AI总结提出一种端到端脉冲编码器-解码器网络，用于激光雷达点云鸟瞰图表示中的目标检测，通过代理梯度反向传播训练，在KITTI基准上达到高精度，并实现3.33倍突触操作能耗降低。

详情

AI中文摘要

自动驾驶感知需要在严格的功耗约束下对三维传感器数据进行准确高效的处理。传统卷积神经网络实现了强大的检测精度，但计算密集，限制了其在资源受限的神经形态平台上的部署。脉冲神经网络通过事件驱动的稀疏计算提供了一种引人注目的替代方案，但其在复杂真实世界感知任务（如三维目标检测）中的应用仍然有限。在这项工作中，我们提出了一种端到端脉冲编码器-解码器网络，用于激光雷达点云鸟瞰图表示中的目标检测，并使用代理梯度反向传播进行训练。我们训练了两个变体：一个膜电位变体，在输出阶段读取连续神经元状态以获得最大精度，在$\mathrm{IoU}\!=\!0.5$（简单/中等/困难）下达到$92.05$/$87.04$/$86.51$ AP；以及一个全二进制脉冲变体，每一层仅操作脉冲序列，用于直接神经形态部署。我们评估了四种输入脉冲编码策略，并证明允许网络直接从数据学习脉冲表示优于手工制作的泊松、延迟和z轴编码方案，在KITTI基准上，当顺序帧不可用且BEV输入跨时间步重复呈现作为时间流代理时。分块能量分析表明，在保守的基于循环的操作下，与等效CNN相比，突触操作能量降低了$3.33 imes$。这些结果共同证明了脉冲神经网络在自动驾驶中实现准确且节能的神经形态感知的可行性。

英文摘要

Autonomous driving perception demands accurate and efficient processing of three-dimensional sensor data under strict power constraints. Traditional convolutional neural networks achieve strong detection accuracy but are computationally intensive, limiting their suitability for deployment on resource-constrained neuromorphic platforms. Spiking neural networks offer a compelling alternative through event-driven sparse computation, yet their application to complex real-world perception tasks such as three-dimensional object detection remains limited. In this work, we propose an end-to-end spiking encoder-decoder network for object detection in bird's eye view representations of LiDAR point clouds, trained using surrogate gradient backpropagation. We train two variants: a membrane potential variant that reads continuous neuron state at the output stage for maximum accuracy, achieving $92.05$/$87.04$/$86.51$ AP at $\mathrm{IoU}\!=\!0.5$ (Easy/Moderate/Hard), and, a fully binary spiking variant that operates exclusively on spike trains at every layer for direct neuromorphic deployment. We evaluate four input spike encoding strategies and demonstrate that allowing the network to learn spike representations directly from data outperforms hand-crafted Poisson, latency, and z-axis encoding schemes on the KITTI benchmark, where sequential frames are unavailable and the BEV input is presented repeatedly across timesteps as a proxy for temporal streaming. A block-wise energy analysis demonstrates a $3.33\times$ reduction in synaptic operation energy over an equivalent CNN under conservative loop-based operation. Together, these results demonstrate the viability of spiking neural networks for accurate and energy-efficient neuromorphic perception in autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.25266 2026-05-26 cs.CV 版本更新

DeltaCam: Differential Intrinsic Camera Modeling for Video Generation

DeltaCam: 用于视频生成的差分内参相机建模

Debabrata Mandal, Zhihan Peng, Yujie Wang, Praneeth Chakravarthula

发表机构 * UNC, Chapel Hill USA（北卡罗来纳大学教堂山分校）

AI总结提出DeltaCam视频扩散框架，通过差分参数化神经相机适配器学习相对变化，实现焦距、光圈、ISO等内参的平滑可控视频生成，并扩展到真实场景。

详情

AI中文摘要

将相机内参纳入视频生成模型为控制场景动态和影响视觉外观的成像过程提供了原则性方法。先前工作主要关注外参控制（如相机姿态和运动），而将内参视为隐式或固定。关键瓶颈在于缺乏具有准确且多样化的时变相机元数据的大规模视频数据集，这使得学习绝对相机参数化变得困难。因此，当前模型难以以可控且时间一致的方式融入摄影相机行为，包括景深转换、曝光变化、镜头畸变和色彩处理。我们引入DeltaCam，一种视频扩散框架，通过Δ参数化的神经相机适配器对相机行为进行建模，该适配器基于相机运动和内参的相对变化而非绝对状态进行操作。通过从合成视频数据中学习这种差分公式，我们减轻了对精确真实世界相机标签的依赖，并实现了对焦距、光圈、ISO、色温和镜头畸变成像因子的平滑一致控制。我们将此框架扩展到真实世界视频，通过两种机制：在真实图像-元数据对上微调控制以实现精确镜头匹配，以及提取解耦嵌入用于隐式视频到视频风格迁移，无需显式相机参数。通过有效分离场景内容与内生成像行为，DeltaCam实现了现有模型难以实现的相机一致视频生成和编辑操作。最终，我们的结果为连接合成控制与真实世界摄影仿真建立了一种实用且可扩展的方法。

英文摘要

Incorporating camera intrinsics into video generation models offers a principled way to control not only scene dynamics but also the imaging process that governs visual appearance. Prior work has primarily focused on extrinsic control, such as camera pose and motion, while treating intrinsic camera parameters as implicit or fixed. A key bottleneck is the lack of large-scale video datasets with accurate and diverse temporally varying camera metadata, which makes learning absolute camera parameterizations difficult. As a result, current models struggle to incorporate photographic camera behavior, including depth-of-field transitions, exposure variations, lens distortions, and color processing, in a controllable and temporally consistent manner. We introduce DeltaCam, a video diffusion framework that models camera behavior through $Δ$-parameterized neural camera adaptors, operating on relative changes in camera motion and intrinsics instead of absolute states. By learning this differential formulation from synthetic video data, we mitigate reliance on precise real-world camera labels and enable smooth, consistent control over imaging factors such as focal length, aperture, ISO, color temperature, and lens distortion. We extend this framework to real-world footage through two mechanisms: finetuning the controls on real image-metadata pairs for precise shot matching, and extracting disentangled embeddings for implicit video-to-video style transfer without requiring explicit camera parameters. By effectively separating scene content from intrinsic imaging behavior, DeltaCam enables camera-consistent video generation and editing operations that are difficult to achieve with existing models. Ultimately, our results establish a practical and scalable approach for bridging synthetic control and real-world photographic emulation.

URL PDF HTML ☆

赞 0 踩 0

2605.25262 2026-05-26 cs.CV 版本更新

Semantics-Guided Multimodal Masked Autoencoder Pretraining for 3D BEV Object Detection

语义引导的多模态掩码自编码器预训练用于3D BEV目标检测

Prabuddhi Wariyapperuma, Rajitha de Silva, Marc Hanheide, Thomas Bohné, Leonardo Guevara

发表机构 * University of Lincoln, Lincoln Centre for Autonomous Systems（林肯大学，林肯自主系统中心）； University of Cambridge, Institute for Manufacturing, Department of Engineering（剑桥大学，制造研究所，工程系）

AI总结提出语义引导的多模态掩码自编码器框架，通过语义引导的LiDAR体素掩码和辅助点语义解码分支，在预训练中注入语义信息，提升3D BEV目标检测性能。

Comments Accepted at the ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy (SRRA) as a lightning talk and poster

详情

AI中文摘要

准确的3D鸟瞰图（BEV）目标检测对于自动驾驶至关重要，并且强烈依赖于来自互补传感器（如摄像头和LiDAR）的有效多模态表示。多模态掩码自编码器已显示出学习此类表示以用于下游3D BEV目标检测的强大潜力。然而，现有方法通常对摄像头和LiDAR输入应用均匀随机掩码，平等对待所有区域，并且仅通过掩码重建学习表示。我们提出了一种语义引导的多模态掩码自编码器框架，该框架在预训练期间通过两个独立组件引入语义信息：（i）语义引导的LiDAR体素掩码，它更强烈地保留语义重要的LiDAR区域，以及（ii）一个辅助的点级LiDAR语义解码分支，在重建之外注入语义引导。在BEVFusion 3D目标检测上，与标准UniM2AE基线相比，我们的语义引导预训练策略在nuScenes mini验证集上提升了性能：语义引导的LiDAR体素掩码在基线上实现了+1.49%的平均精度（mAP）和+1.66%的nuScenes检测分数（NDS），而解码器侧的点语义监督实现了+1.39%的mAP和+3.22%的NDS。

英文摘要

Accurate 3D bird's-eye view (BEV) object detection is essential for autonomous driving, and depends strongly on effective multimodal representations from complementary sensors such as cameras and LiDAR. Multimodal masked autoencoders have shown strong potential for learning such representations for downstream 3D BEV object detection. However, existing methods typically apply uniform random masking to camera and LiDAR inputs, treating all regions equally, and learn representations only through masked reconstruction. We propose a semantics-guided multimodal masked autoencoder framework that introduces semantic information during pretraining through two separate components: (i) semantics-guided LiDAR voxel masking, which preserves semantically important LiDAR regions more strongly, and (ii) an auxiliary point-wise LiDAR semantic decoder branch that injects semantic guidance in addition to reconstruction. On BEVFusion 3D object detection, our semantics-guided pretraining strategy improves performance on the nuScenes mini validation set compared to the standard UniM2AE baseline: semantics-guided LiDAR voxel masking yields +1.49% mean Average Precision (mAP) and +1.66% nuScenes Detection Score (NDS), while decoder-side point semantic supervision yields +1.39% mAP and +3.22% NDS over the baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.25254 2026-05-26 cs.CV cs.AI 版本更新

填补GAP：多模态大语言模型中视觉推理的粒度对齐范式

Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart, Lei Lv, Qi Zhao, Li Wang, Hao Li, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba（阿里云大模型应用团队）； Alibaba University of Waterloo（阿里大学水力学院）； Vector Institute（向量研究所）； Zhejiang University（浙江大学）

AI总结提出GAP（粒度对齐范式），通过特征级、上下文级和能力引导级对齐，解决多模态大语言模型中视觉潜在推理的特征空间不匹配问题，提升感知与推理性能。

详情

AI中文摘要

视觉潜在推理让多模态大语言模型（MLLM）以连续令牌形式创建中间视觉证据，避免外部工具或图像生成器。然而，现有方法通常遵循输出即输入的潜在范式，产生不稳定的收益。我们识别出特征空间不匹配是导致这种不稳定的证据：主流的视觉潜在模型建立在预归一化MLLM上，重用解码器隐藏状态作为预测的潜在输入，尽管这些状态与模型训练时消耗的输入嵌入处于截然不同的范数范围（Xie et al., 2025; Li et al., 2026; Team et al., 2026）。这种不匹配可能使直接潜在反馈不可靠。受此诊断启发，我们提出GAP，一种用于视觉潜在建模的粒度对齐范式。GAP在三个层面对齐视觉潜在推理：特征级对齐通过轻量级PCA对齐潜在头将解码器输出映射为输入兼容的视觉潜在；上下文级对齐通过可检查的辅助视觉监督锚定潜在目标；能力引导对齐选择性地将潜在监督分配给基础MLLM难以处理的示例。在Qwen2.5-VL 7B上，所得模型在我们监督变体中实现了最佳平均聚合感知和推理性能。推理时干预探测进一步表明，生成的潜在提供了任务相关的视觉信号，而不仅仅是增加令牌槽位。

英文摘要

Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume (Xie et al., 2025; Li et al., 2026; Team et al., 2026). This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose GAP, a Granular Alignment Paradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

URL PDF HTML ☆

赞 0 踩 0

2605.09223 2026-05-26 cs.CV 版本更新

CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding

CREST: 曲率调节的事件中心采样用于高效长视频理解

Mehrajul Abadin Miraj, Abdul Mohaimen Al Radi, Shariful Islam Rayhan, Md. Tanvir Alam, Ismat Rahman, Yu Tian, Md Mosaddek Khan

发表机构 * Dept. of CSE, University of Dhaka（达卡大学计算机科学与工程系）； Dept. of CSE, University of Central Florida（中央佛罗里达大学计算机科学与工程系）

AI总结提出一种无训练帧选择方法CREST，利用查询-帧相关性的时间几何（局部曲率）来指导采样，在固定预算下实现高效长视频理解。

详情

AI中文摘要

从长视频中选择信息帧是一个组合问题，现有方法要么通过高效启发式方法处理，但未显式建模查询条件的时间结构，要么通过多阶段检索流水线处理，但预处理成本高。我们提出 extbf{CREST}，一种基于查询-帧相关性的时间几何的无训练帧选择方法。CREST基于观察：相关性随时间表现出结构化的局部变化——显著事件周围曲率陡峭，冗余段区域平坦。通过使用局部曲率指导选择，CREST在短暂决定性事件和缓慢演变的证据之间更有效地分配固定帧预算。在固定主干网络和帧预算下，CREST在LongVideoBench和VideoMME上比轻量级相关性-覆盖基线AKS获得更高准确率，同时保留了更强多阶段检索流水线MIRA的93-95%准确率，而预处理成本仅为后者的3-4%。 ootnote{代码和实现细节包含在补充材料中，将在接收后公开发布。}在时间帧选择的诊断基准TempRel上，CREST比AKS相对提高6.88%。成对LLM-as-a-judge评估进一步表明，CREST选择的帧产生更连贯的帧条件描述，在两个基准上胜率分别为60.58%和54.50%。这些结果表明，局部时间几何为长视频帧选择提供了简单高效的基础。

英文摘要

Selecting informative frames from long videos is a combinatorial problem that existing methods address either through efficient heuristics without explicit modeling of query-conditioned temporal structure, or through multi stage retrieval pipelines with substantial preprocessing cost. We propose \textbf{CREST}, a training-free frame selection method grounded in the temporal geometry of query--frame relevance. CREST is based on the observation that relevance over time exhibits structured local variation: sharp curvature around salient events and flatter regions in redundant segments. By using local curvature to guide selection, CREST allocates a fixed frame budget more effectively across brief decisive events and slowly evolving evidence. Under a fixed backbone and frame budget, CREST achieves higher accuracy than AKS, a lightweight relevance--coverage baseline, on LongVideoBench and VideoMME, while retaining 93--95\% of the accuracy of MIRA, a stronger multi-stage retrieval pipeline, at only 3--4\% of its preprocessing cost.\footnote{Code and implementation details are included in the supplementary material and will be released publicly upon acceptance.} On TempRel, our diagnostic benchmark for temporal frame selection, CREST achieves a 6.88\% relative improvement over AKS. Pairwise LLM-as-a-judge evaluation further shows that CREST-selected frames yield more coherent frame-conditioned descriptions, with win rates of 60.58\% and 54.50\% on the two benchmarks. These results show that local temporal geometry provides a simple and efficient basis for long-video frame selection.

URL PDF HTML ☆

赞 0 踩 0

2605.07607 2026-05-26 cs.CV 版本更新

EditCaption: 用于图像编辑指令合成的人工精炼SFT与HAE-DPO

Xiangyuan Wang, Honghao Cai, Yunhao Bai, Chao Hui, Tianze Zhou, Haohua Chen, Hao Shi, Yuling Wu, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu

发表机构 * Peking University（北京大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Tsinghua University（清华大学）； Beihang University（北京航空航天大学）； Xiaohongshu Inc.（小红书公司）

AI总结提出EditCaption两阶段后训练流程，通过人工精炼SFT和基于难度自适应错误感知DPO（HAE-DPO）提升图像编辑指令合成质量，显著降低关键错误率并超越现有模型。

详情

AI中文摘要

可解释且无需反向传播的绿色学习用于高效多任务超声心动图分割与分类

Jyun-Ping Kao, Jiaxin Yang, C. -C. Jay Kuo, Jonghye Woo

AI总结提出一种无需反向传播的多任务绿色学习框架，通过无监督VoxelHop编码器与多级回归解码器及XG-Boost分类器，在EchoNet-Dynamic数据集上实现左心室分割与射血分数分类，以极低参数量达到高精度。

Comments Accepted for publication in APSIPA Transactions on Signal and Information Processing. Jyun-Ping Kao and Jiaxing Yang contributed equally to this work. C.-C. Jay Kuo and Jonghye Woo are the senior authors

详情

AI中文摘要

超声心动图是管理心力衰竭（HF）的基石，左心室射血分数（LVEF）是指导治疗的关键指标。然而，手动LVEF评估存在较高的观察者间变异性，而现有的深度学习（DL）模型通常是计算密集且数据饥饿的“黑箱”，阻碍了临床信任和采用。在此，我们提出了一种无需反向传播的多任务绿色学习（MTGL）框架，可同时进行左心室（LV）分割和LVEF分类。我们的框架将用于分层时空特征提取的无监督VoxelHop编码器与多级回归解码器和XG-Boost分类器相结合。在EchoNet-Dynamic数据集上，我们的MTGL模型实现了最先进的分类和分割性能，分类准确率达到94.3%，Dice相似系数（DSC）达到0.912，显著优于多个先进的3D DL模型。关键的是，我们的模型在参数数量少一个数量级的情况下实现了这一性能，展现了卓越的计算效率。这项工作表明，GL范式可以为复杂的医学图像分析提供高度准确、高效且可解释的解决方案，为临床实践中更可持续和可信的人工智能铺平道路。

英文摘要

Echocardiography is a cornerstone for managing heart failure (HF), with Left Ventricular Ejection Fraction (LVEF) being a critical metric for guiding therapy. However, manual LVEF assessment suffers from high inter-observer variability, while existing Deep Learning (DL) models are often computationally intensive and data-hungry "black boxes" that impede clinical trust and adoption. Here, we propose a backpropagation-free multi-task Green Learning (MTGL) framework that performs simultaneous Left Ventricle (LV) segmentation and LVEF classification. Our framework integrates an unsupervised VoxelHop encoder for hierarchical spatio-temporal feature extraction with a multi-level regression decoder and an XG-Boost classifier. On the EchoNet-Dynamic dataset, our MTGL model achieves state-of-the-art classification and segmentation performance, attaining a classification accuracy of 94.3% and a Dice Similarity Coefficient (DSC) of 0.912, significantly outperforming several advanced 3D DL models. Crucially, our model achieves this with over an order of magnitude fewer parameters, demonstrating exceptional computational efficiency. This work demonstrates that the GL paradigm can deliver highly accurate, efficient, and interpretable solutions for complex medical image analysis, paving the way for more sustainable and trustworthy artificial intelligence in clinical practice.

URL PDF HTML ☆

赞 0 踩 0

2601.18597 2026-05-26 cs.CV 版本更新

EFSI-DETR: Efficient Frequency-Semantic Integration for Real-Time Small Object Detection in UAV Imagery

EFSI-DETR：面向无人机图像实时小目标检测的高效频率-语义集成

Yu Xia, Chang Liu, Tianqi Xiang, Zhigang Tu

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing（信息工程测绘遥感国家重点实验室）； Wuhan University（武汉大学）； Wuhan University Shenzhen Research Institute（武汉大学深圳研究院）； School of Computer Science（计算机学院）； School of Automation Science and Engineering（自动化科学与工程学院）； South China University of Technology（华南理工大学）

AI总结提出EFSI-DETR框架，通过动态频率-空间统一协同网络和高效语义特征集中器，实现无人机图像中实时小目标检测的先进性能。

详情

AI中文摘要

由于有限的特征表示和无效的多尺度融合，无人机图像中的实时小目标检测仍然具有挑战性。现有方法未充分利用频率信息并依赖静态卷积操作，限制了获取丰富特征表示的能力，并阻碍了深层语义特征的有效利用。为解决这些问题，我们提出EFSI-DETR，一种新颖的检测框架，将高效语义特征增强与动态频率-空间引导相结合。EFSI-DETR包含两个主要组件：(1) 动态频率-空间统一协同网络（DyFusNet），联合利用频率和空间线索进行鲁棒的多尺度特征融合；(2) 高效语义特征集中器（ESFC），以最小计算成本实现深层语义提取。此外，采用细粒度特征保留（FFR）策略，在融合过程中纳入空间丰富的浅层特征，以保留对无人机图像中小目标检测至关重要的细粒度细节。在VisDrone和CODrone基准上的大量实验表明，我们的EFSI-DETR以实时效率实现了最先进的性能，在VisDrone上AP和AP_s分别提升了 extbf{1.6}\%和 extbf{5.8}\%，同时在单个RTX 4090 GPU上获得 extbf{188} FPS的推理速度。

英文摘要

Real-time small object detection in Unmanned Aerial Vehicle (UAV) imagery remains challenging due to limited feature representation and ineffective multi-scale fusion. Existing methods underutilize frequency information and rely on static convolutional operations, which constrain the capacity to obtain rich feature representations and hinder the effective exploitation of deep semantic features. To address these issues, we propose EFSI-DETR, a novel detection framework that integrates efficient semantic feature enhancement with dynamic frequency-spatial guidance. EFSI-DETR comprises two main components: (1) a Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet) that jointly exploits frequency and spatial cues for robust multi-scale feature fusion, (2) an Efficient Semantic Feature Concentrator (ESFC) that enables deep semantic extraction with minimal computational cost. Furthermore, a Fine-grained Feature Retention (FFR) strategy is adopted to incorporate spatially rich shallow features during fusion to preserve fine-grained details, crucial for small object detection in UAV imagery. Extensive experiments on VisDrone and CODrone benchmarks demonstrate that our EFSI-DETR achieves the state-of-the-art performance with real-time efficiency, yielding improvement of \textbf{1.6}\% and \textbf{5.8}\% in AP and AP$_{s}$ on VisDrone, while obtaining \textbf{188} FPS inference speed on a single RTX 4090 GPU.

URL PDF HTML ☆

赞 0 踩 0

2601.18135 2026-05-26 cs.CV 版本更新

Forward Consistency Learning with Gated Context Aggregation for Video Anomaly Detection

基于门控上下文聚合的前向一致性学习用于视频异常检测

Jiahao Lyu, Minghua Zhao, Xuewen Huang, Yifei Chen, Shuangli Du, Jing Hu, Cheng Shi, Zhiyong Lv

发表机构 * Shaanxi Key Laboratory for Network Computing and Security Technology, School of Computer Science and Engineering, Xi’an University of Technology（陕西网络计算与安全技术重点实验室，西安理工大学计算机科学与工程学院）； School of Cyber Science and Engineering, Xi’an Jiaotong University（网络安全与工程学院，西安交通大学）

AI总结提出轻量级FoGA模型，通过前向一致性学习和门控上下文聚合，在资源受限设备上实现高效视频异常检测，性能优于现有方法且速度达155 FPS。

Comments It has been submitted to the KBS journal

详情

DOI: 10.1016/j.knosys.2026.116118
Journal ref: Knowledge-Based Systems 2026

AI中文摘要

作为公共安全的关键要素，视频异常检测（VAD）旨在实时监控系统中衡量各种事件与正常模式的偏差。然而，现有大多数VAD方法依赖大规模模型追求极端精度，限制了其在资源受限边缘设备上的可行性。此外，主流基于预测的VAD仅利用单帧未来预测误差检测异常，忽略了更长时域前向信息的更丰富约束。本文提出FoGA，一种轻量级VAD模型，执行基于门控上下文聚合的前向一致性学习，包含约2M参数，专为潜在边缘设备设计。具体而言，我们提出一种基于Unet的方法，对连续帧进行特征提取以生成即时预测和前向预测。然后，我们在跳跃连接中引入门控上下文聚合模块，动态融合相同空间尺度下的编码器和解码器特征。最后，模型通过新颖的前向一致性损失联合优化，并采用混合异常测量策略整合即时帧和前向帧的误差以实现更准确检测。大量实验证明了所提方法的有效性，其显著优于最先进的竞争方法，运行速度高达155 FPS。因此，我们的FoGA在性能与效率指标之间实现了出色的权衡。

英文摘要

As a crucial element of public security, video anomaly detection (VAD) aims to measure deviations from normal patterns for various events in real-time surveillance systems. However, most existing VAD methods rely on large-scale models to pursue extreme accuracy, limiting their feasibility on resource-limited edge devices. Moreover, mainstream prediction-based VAD detects anomalies using only single-frame future prediction errors, overlooking the richer constraints from longer-term temporal forward information. In this paper, we introduce FoGA, a lightweight VAD model that performs Forward consistency learning with Gated context Aggregation, containing about 2M parameters and tailored for potential edge devices. Specifically, we propose a Unet-based method that performs feature extraction on consecutive frames to generate both immediate and forward predictions. Then, we introduce a gated context aggregation module into the skip connections to dynamically fuse encoder and decoder features at the same spatial scale. Finally, the model is jointly optimized with a novel forward consistency loss, and a hybrid anomaly measurement strategy is adopted to integrate errors from both immediate and forward frames for more accurate detection. Extensive experiments demonstrate the effectiveness of the proposed method, which substantially outperforms state-of-the-art competing methods, running up to 155 FPS. Hence, our FoGA achieves an excellent trade-off between performance and the efficiency metric.

URL PDF HTML ☆

赞 0 踩 0

2601.00553 2026-05-26 cs.CV cs.AI 版本更新

A Comprehensive Dataset for Human vs. AI Generated Image Detection

人类与AI生成图像检测的综合数据集

Rajarshi Roy, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Gaytri Jena, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

发表机构 * 1 Kalyani Government Engineering College, India. 2 IIIT Delhi, India. 3 BITS Pilani Hyderabad Campus, India. 4 University of South Carolina, USA. 5 IIIT Guwahati, India. 6 NIT Silchar, India. 7 San Jos\' e State University, USA. 8 UCLA, USA. 9 Washington State University, USA. 10 Vishwakarma Institute of Information Technology, India. 11 Gandhi Institute for Technological Advancement, India. 12 Meta AI, USA. 13 Amazon AI, USA. 14 BITS Pilani Goa, India.

AI总结针对AI生成图像检测问题，构建了包含96000个真实与合成数据点的MS COCOAI数据集，并提出了图像真伪分类与生成模型识别两个任务。

详情

AI中文摘要

像Stable Diffusion、DALL-E和MidJourney这样的多模态生成式AI系统从根本上改变了合成图像的创建方式。这些工具推动了创新，但也促进了误导性内容、虚假信息和被操纵媒体的传播。随着生成的图像越来越难以与照片区分，检测它们已成为当务之急。为了应对这一挑战，我们发布了MS COCOAI，这是一个用于AI生成图像检测的新数据集，包含96000个真实和合成数据点，基于MS COCO数据集构建。为了生成合成图像，我们使用了五个生成器：Stable Diffusion 3、Stable Diffusion 2.1、SDXL、DALL-E 3和MidJourney v6。基于该数据集，我们提出了两个任务：（1）将图像分类为真实或生成；（2）识别哪个模型生成了给定的合成图像。该数据集可在https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset获取。

英文摘要

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, we release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

URL PDF HTML ☆

赞 0 踩 0

2512.24331 2026-05-26 cs.CV 版本更新

Spatial-aware Vision Language Model for Autonomous Driving

面向自动驾驶的空间感知视觉语言模型

Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong

发表机构 * Motional ； University of Amsterdam（阿姆斯特丹大学）

AI总结提出LVLDrive框架，通过融合LiDAR点云与视觉语言模型，利用渐进融合Q-Former和空间感知问答数据集，解决3D度量空间推理瓶颈，提升自动驾驶场景理解与决策可靠性。

Comments Accepted to CVPR AutoPilot Workshop 2026

详情

AI中文摘要

尽管视觉语言模型（VLM）通过利用语言模型中的常识在端到端自动驾驶中展现出显著前景，但它们依赖2D图像线索进行复杂场景理解和决策，这成为安全性和可靠性的关键瓶颈。当前基于图像的方法难以进行精确的度量空间推理和几何推断，导致不可靠的驾驶策略。为弥补这一差距，我们提出LVLDrive（LiDAR-视觉-语言），一种新颖框架，通过引入LiDAR点云作为额外输入模态，专门设计用于增强现有VLM的鲁棒3D度量空间理解能力。一个关键挑战在于如何减轻不同3D数据对预训练VLM带来的灾难性干扰。为此，我们引入渐进融合Q-Former，逐步注入LiDAR特征，确保VLM现有知识库的稳定性和保留。此外，我们开发了空间感知问答（SA-QA）数据集，明确教导模型高级3D感知和推理能力。在驾驶基准上的大量实验表明，与仅视觉的对应模型相比，LVLDrive在场景理解、度量空间感知和可靠驾驶决策方面均实现了优越性能。我们的工作强调了显式3D度量数据对于构建可信赖的基于VLM的自主系统的重要性。

英文摘要

While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

URL PDF HTML ☆

赞 0 踩 0

2512.18735 2026-05-26 cs.CV cs.AI 版本更新

$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

$M^3-Verse$: 大型多模态模型的“找不同”挑战

Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang

发表机构 * Zhejiang University, China（浙江大学）； Shanghai AI Lab, China（上海人工智能实验室）； Hangzhou Normal University, China（杭州师范大学）

AI总结提出 $M^3-Verse$ 基准，通过多视角视频对评估 LMM 在一致空间中对物体动态变化的理解能力，并验证了现有模型的局限性。

详情

AI中文摘要

现代大型多模态模型（LMMs）在静态图像和单状态时空理解方面表现出非凡的能力。然而，它们在两个不同视频观测中理解共享空间上下文内物体动态变化的能力仍未被充分探索。这种在一致环境中推理变换的能力对于空间智能领域的进步尤为关键。在本文中，我们引入了 $M^3-Verse$，一个多模态、多状态、多维度的基准，以正式评估这一能力。它基于成对视频，这些视频提供了室内场景在状态变化前后的多视角观察。该基准包含总共 270 个场景和 2,932 个问题，分为 50 多个子任务，探究 4 种核心能力。我们评估了 16 个最先进的 LMMs，并观察到它们在跟踪状态转换方面的局限性。为了解决这些挑战，我们进一步提出了一个简单而有效的基线，在多状态感知中实现了显著的性能提升。因此，$M^3-Verse$ 提供了一个具有挑战性的新测试平台，以促进对动态视觉世界有更全面理解的下一代模型的发展。您可以从 https://github.com/Wal-K-aWay/M3-Verse_pipeline 获取构建流程，并从 https://www.modelscope.cn/datasets/WalKaWay/M3-Verse 获取完整的基准数据。

英文摘要

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

URL PDF HTML ☆

赞 0 踩 0

2512.13597 2026-05-26 cs.CV 版本更新

Lighting in Motion: Spatiotemporal HDR Lighting Estimation

运动中的光照：时空高动态范围光照估计

Christophe Bolduc, Julien Philip, Li Ma, Mingming He, Paul Debevec, Jean-François Lalonde

发表机构 * Université Laval（拉瓦尔大学）； Eyeline Labs（Eyeline实验室）

AI总结提出基于扩散的时空光照估计方法LiMo，通过生成不同曝光下的镜面与漫反射球体，结合深度与几何条件，实现高精度高频细节预测与照度估计。

详情

AI中文摘要

我们提出LiMo（运动中的光照），一种基于扩散的时空光照估计方法。LiMo旨在同时实现逼真的高频细节预测和准确的照度估计。为此，我们提出根据输入中3D位置生成一组不同曝光下的镜面与漫反射球体。利用扩散先验，我们在大规模定制的室内外场景数据集上微调强大的现有扩散模型，并配以时空光照探针。为了实现准确的空间条件，我们证明仅靠深度是不够的，并引入一种新的几何条件来提供场景相对于目标3D位置的相对位置。最后，我们利用可微渲染将不同曝光下的漫反射和镜面预测合并为单个HDRI图。我们彻底评估了我们的方法和设计选择，使LiMo在空间控制和预测精度方面均达到最先进水平。

英文摘要

We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.

URL PDF HTML ☆

赞 0 踩 0

2512.12425 2026-05-26 cs.CV 版本更新

Boosting Monocular Metric Depth Estimation via Bokeh Rendering

通过散景渲染提升单目度量深度估计

Hangwei Zhang, Armando Fortes, Tianyi Wei, Xingang Pan

发表机构 * S-Lab, Nanyang Technological University（南洋理工大学S实验室）； Beihang University（北航大学）

AI总结提出BokehDepth两阶段框架，利用物理生成模型产生校准散景堆栈作为无监督几何信号，通过散景感知聚合模块提升单目深度估计的度量精度。

Comments Project Page: https://fogradio.github.io/BokehDepth_Project/

详情

Journal ref: ICML 2026

AI中文摘要

散景渲染和深度估计共享基本的光学联系，但现有方法未能充分利用这种互惠性。传统的散景管线严重依赖有噪声的深度图，不可避免地引入视觉伪影。相反，现有的单目深度模型通常遵循两种有缺陷的范式。基于生成扩散的框架往往缺乏一致的度量尺度。同时，前馈度量深度模型在纹理缺失或远处区域经常失败，而散焦模糊可以提供几何信息。我们提出BokehDepth，一个两阶段框架，将合成散焦视为无监督的几何信号。在第一阶段，一个物理基础的生成模型从单个清晰输入产生校准的散景堆栈，无需先验深度输入。随后，一个轻量级的散景感知聚合模块将这些堆栈集成到深度估计框架的编码器中。这种机制允许模型从散焦维度提取一致的几何特征，同时保持解码器架构不变。实验表明，与依赖深度的渲染基线相比，BokehDepth实现了优越的视觉散景保真度，并持续提升了最先进单目深度模型的度量精度。

英文摘要

Bokeh rendering and depth estimation share a fundamental optical connection, yet existing methods fail to fully exploit this reciprocity. Conventional bokeh pipelines rely heavily on noisy depth maps that inevitably introduce visual artifacts. Conversely, existing monocular depth models typically follow two flawed paradigms. Generative diffusion-based frameworks often lack consistent metric scale. Meanwhile, feed-forward metric depth models frequently fail in textureless or distant regions where defocus blur can provide geometric information. We propose BokehDepth, a two-stage framework that treats synthetic defocus as a supervision-free geometric signal. In the first stage, a physically grounded generative model produces calibrated bokeh stacks from a single sharp input without requiring prior depth input. Subsequently, a lightweight defocus-aware aggregation module integrates these stacks into the encoder of a depth estimation framework. This mechanism allows the model to extract consistent geometric features from the defocus dimension while keeping the decoder architecture unchanged. Experiments demonstrate that BokehDepth achieves superior visual bokeh fidelity compared to depth-dependent rendering baselines and consistently enhances the metric accuracy of state-of-the-art monocular depth models.

URL PDF HTML ☆

赞 0 踩 0

2512.08125 2026-05-26 eess.IV cs.CV 版本更新

FlowSteer: Conditioning Flow Field for Consistent Image Restoration

FlowSteer: 条件化流场以实现一致图像恢复

Tharindu Wickremasinghe, Chenyang Qi, Harshana Weligampola, Zhengzhong Tu, Stanley H. Chan

发表机构 * Purdue University（普渡大学）； HKUST（香港科技大学）； Texas A&M University（德克萨斯农工大学）

AI总结提出FlowSteer，一种算子感知的条件化方案，通过在采样路径中注入测量先验，将冻结流的隐式引导与显式测量约束耦合，在零样本设置下实现超分辨率、去模糊、去噪和着色等任务的一致图像恢复。

Comments Accepted by CVPRF 2026. Camera Ready version. Project page is \href{https://tharindu-nirmal.github.io/FlowSteer/}{in this link}

详情

AI中文摘要

基于流的文本到图像（T2I）模型在提示驱动图像生成方面表现出色，但在图像恢复（IR）中常常“偏离”对测量的忠实。先前的工作通过数据特定流或任务特定适配器来缓解这种漂移，但这些方法计算量大且不可跨任务扩展。这引出了一个问题：“难道我们不能高效地操纵流模型现有的生成能力吗？”为此，我们引入了FlowSteer（FS），一种算子感知的条件化方案，它在采样路径中注入测量先验，将冻结流的隐式引导与显式测量约束耦合。在超分辨率、去模糊、去噪和着色任务中，FS在严格的零样本设置下（无需重新训练模型，无需适配器）提高了测量一致性和身份保持。我们展示了流模型的性质及其对噪声的敏感性如何指导这种调度器的设计。FlowSteer虽然简单，但在利用流模型丰富的生成先验的同时，实现了更高保真度的重建图像。所有数据和代码将在\href{https://tharindu-nirmal.github.io/FlowSteer/}{此链接}公开。

英文摘要

Flow-based text-to-image (T2I) models excel at prompt-driven image generation, but falter on Image Restoration (IR), often "drifting away" from being faithful to the measurement. Prior work mitigate this drift with data-specific flows or task-specific adapters that are computationally heavy and not scalable across tasks. This raises the question "Can't we efficiently manipulate the existing generative capabilities of a flow model?" To this end, we introduce FlowSteer (FS), an operator-aware conditioning scheme that injects measurement priors along the sampling path,coupling a frozed flow's implicit guidance with explicit measurement constraints. Across super-resolution, deblurring, denoising, and colorization, FS improves measurement consistency and identity preservation in a strictly zero-shot setting-no retrained models, no adapters. We show how the nature of flow models and their sensitivities to noise inform the design of such a scheduler. FlowSteer, although simple, achieves a higher fidelity of reconstructed images, while leveraging the rich generative priors of flow models. All data and code will be publicly available \href{https://tharindu-nirmal.github.io/FlowSteer/}{in this link}.

URL PDF HTML ☆

赞 0 踩 0

2511.19065 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Understanding, Accelerating, and Improving MeanFlow Training

理解、加速和改进MeanFlow训练

Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer

发表机构 * Yonsei University（延世大学）； ETH Zurich（苏黎世联邦理工学院）； University of Zurich（苏黎世大学）； Max Planck ETH CLS（马克斯·普朗克ETH CLS）； Google（谷歌）

AI总结通过分析瞬时速度与平均速度的相互作用，提出一种加速瞬时速度形成并逐步转移训练重点的有效训练方案，实现更快的收敛和更优的少步生成性能。

详情

AI中文摘要

MeanFlow通过联合学习瞬时速度场和平均速度场，有望在少步内实现高质量生成建模。然而，其底层训练动态仍不清楚。我们分析两种速度之间的相互作用，发现：(i) 建立良好的瞬时速度是学习平均速度的前提；(ii) 当时间间隔较小时，瞬时速度的学习受益于平均速度，但随着间隔增大而退化；(iii) 任务亲和性分析表明，对于一步生成至关重要的大间隔平均速度的平滑学习，依赖于先形成准确的瞬时速度和小间隔平均速度。在这些观察的指导下，我们设计了一种有效的训练方案，加速瞬时速度的形成，然后将重点从短间隔平均速度转移到长间隔平均速度。我们改进的MeanFlow训练实现了更快的收敛和显著更好的少步生成：使用相同的DiT-XL骨干网络，我们的方法在1-NFE ImageNet 256x256上达到了令人印象深刻的FID 2.87，而传统的MeanFlow基线为3.43。或者，我们的方法以2.5倍更短的训练时间或使用更小的DiT-L骨干网络，匹配MeanFlow基线的性能。

英文摘要

MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

URL PDF HTML ☆

赞 0 踩 0

2511.12046 2026-05-26 cs.CR cs.AI cs.CV cs.LG 版本更新

BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning

BackWeak: 使用弱触发器和微调简单后门知识蒸馏

Shanmin Wang, Dongdong Zhao

发表机构 * School of Computer Science and Artificial Intelligence（计算机科学与人工智能学院）； Wuhan University of Technology（武汉科技大学）

AI总结提出BackWeak方法，通过微调教师模型嵌入弱触发器实现后门攻击，无需替代学生模型或模拟蒸馏，在标准蒸馏过程中可靠转移至不同学生架构。

详情

AI中文摘要

知识蒸馏对于压缩大型模型至关重要，但依赖从第三方仓库下载的预训练“教师”模型引入了严重的安全风险——最显著的是后门攻击。现有的知识蒸馏后门方法通常复杂且计算密集：它们使用替代学生模型和模拟蒸馏来保证可转移性，并构建类似于通用对抗扰动（UAP）的触发器，这些触发器在幅度上不隐蔽，本质上表现出强烈的对抗行为。本文质疑这种复杂性是否必要，并构建了隐蔽的“弱”触发器——具有可忽略对抗效应的不可察觉扰动。我们提出了BackWeak，一种简单、无替代的攻击范式。BackWeak表明，通过使用非常小的学习率对良性教师模型进行微调并嵌入弱触发器，即可植入强大的后门。我们证明，这种精细的微调足以嵌入后门，在受害者的标准蒸馏过程中可靠地转移到不同的学生架构，从而实现高攻击成功率。在多个数据集、模型架构和知识蒸馏方法上的广泛实证评估表明，BackWeak比以往复杂的方法更高效、更简单，且通常更隐蔽。本文呼吁研究知识蒸馏后门攻击的学者特别关注触发器的潜在对抗特性。

英文摘要

Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-party repositories introduces serious security risks--most notably backdoor attacks. Existing KD backdoor methods are typically complex and computationally intensive: they employ surrogate student models and simulated distillation to guarantee transferability, and construct triggers similar to universal adversarial perturbations (UAPs), which being not stealthy in magnitude, inherently exhibit strong adversarial behavior. This work questions whether such complexity is necessary and constructs stealthy "weak" triggers--imperceptible perturbations that have negligible adversarial effect. We propose BackWeak, a simple, surrogate-free attack paradigm. BackWeak shows that a powerful backdoor can be implanted by simply fine-tuning a benign teacher with a weak trigger using a very small learning rate. We demonstrate that this delicate fine-tuning is sufficient to embed a backdoor that reliably transfers to diverse student architectures during a victim's standard distillation process, yielding high attack success rates. Extensive empirical evaluations on multiple datasets, model architectures, and KD methods show that BackWeak is efficient, simpler, and often more stealthy than previous elaborate approaches. This work calls on researchers studying KD backdoor attacks to pay particular attention to the trigger's potential adversarial characteristics.

URL PDF HTML ☆

赞 0 踩 0

2510.22827 2026-05-26 cs.CV cs.LG 版本更新

FairJudge: Abstention-Aware Multimodal Judges for Fairness and Alignment Evaluation in Text-to-Image Models

FairJudge: 文本到图像模型中公平性与对齐评估的弃权感知多模态裁判

Zahraa Al Sahili, Maimuna Nowaz, Maryam Fetanat, Ioannis Patras, Matthew Purver

发表机构 * Queen Mary University of London（伦敦玛丽女王大学）； Institut Jožef Stefan（乔泽夫·斯蒂芬研究所）； Imperial College London（伦敦帝国学院）

AI总结提出FairJudge协议，利用多模态大语言模型作为结构化裁判，通过封闭标签、弃权机制和证据报告，在文本到图像模型中实现社会属性预测、职业定位和提示-图像对齐的公平性评估。

详情

AI中文摘要

评估文本到图像（T2I）系统不仅需要判断图像是否匹配提示，还需要判断社会显著属性是否被忠实表示且没有无根据的推断。现有的自动评估器通常依赖于以面部为中心的识别器或对比图像-文本相似度，这些方法提供的诊断反馈有限，并且通常在视觉证据模糊或缺失时强制进行预测。对于宗教和残疾等公平敏感属性，其中线索可能是上下文相关的、间接的或故意未指定的，这些评估器可能会遗漏细心的人类评审员会注意到的失败模式。我们引入了\textsc{FairJudge}，一种弃权感知的评估协议，该协议使用遵循指令的多模态LLM作为社会属性预测、职业定位和提示-图像对齐的结构化裁判。该协议将输出限制为封闭标签集，要求可见证据的理由，在线索不足时支持明确的\textsc{unspecified}决策，并将基于量规的对齐判断映射到$[-1,1]$。这些约束将MLLM裁判从开放式评估转变为可解析、可审计的评估程序。在四个属性预测基准和三个职业/对齐基准上，\textsc{FairJudge}优于或补充了CLIP、DeepFace、VIEScore和VQAScore。消融实验表明，封闭标签、弃权和证据报告对可靠性至关重要。我们进一步引入了\textsc{DIVERSIFY}和\textsc{DIVERSIFY-Professions}，这两个资源丰富的上下文数据集用于评估超越面部可见或图标线索的社会表示和职业定位。我们发布了代码、提示、数据集、解析器日志和每张图像的裁判输出，以支持可重复的审计。

英文摘要

Evaluating text-to-image (T2I) systems requires judging not only whether an image matches a prompt, but also whether socially salient attributes are represented faithfully and without unsupported inference. Existing automated evaluators typically rely on face-centric recognizers or contrastive image--text similarity, which provide limited diagnostic feedback and often force predictions even when visual evidence is ambiguous or absent. For fairness-sensitive attributes such as religion and disability, where cues may be contextual, indirect, or intentionally unspecified, these evaluators can therefore miss failure modes that careful human reviewers would notice. We introduce \textsc{FairJudge}, an abstention-aware evaluation protocol that uses instruction-following multimodal LLMs as structured judges for social-attribute prediction, profession grounding, and prompt--image alignment. The protocol constrains outputs to closed label sets, requires visible-evidence rationales, supports an explicit \textsc{unspecified} decision when cues are insufficient, and maps rubric-based alignment judgments to $[-1,1]$. These constraints turn MLLM judging from open-ended assessment into a parseable, auditable evaluation procedure. Across four attribute-prediction benchmarks and three profession/alignment benchmarks, \textsc{FairJudge} outperforms or complements CLIP, DeepFace, VIEScore, and VQAScore. Ablations show that closed labels, abstention, and evidence reporting are central to reliability. We further introduce \textsc{DIVERSIFY} and \textsc{DIVERSIFY-Professions}, two context-rich resources for evaluating social representation and profession grounding beyond face-visible or iconic cues. We release code, prompts, datasets, parser logs, and per-image judge outputs to support reproducible auditing.

URL PDF HTML ☆

赞 0 踩 0

2510.15264 2026-05-26 cs.CV 版本更新

DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

DriveGen3D: 通过高效视频扩散提升前馈驾驶场景生成

Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, Yukun Zhou, Wenkang Qin, Duochao Shi, Haoyun Li, Yicheng Xiao, Donny Y. Chen, Jiwen Lu

发表机构 * Zhejiang University（浙江大学）； GigaAI ； Tsinghua University（清华大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Peking University（北京大学）； Monash University（墨尔本大学）

AI总结提出DriveGen3D框架，结合快速视频扩散Transformer（FastDrive-DiT）和前馈3D重建模块（FastRecon3D），实现高质量、可控的动态3D驾驶场景生成，在长视频和3D一致性上达到最优。

Comments ICME 2026 Oral, Project Page: https://lhmd.top/drivegen3d

详情

AI中文摘要

我们提出了DriveGen3D，一个用于生成高质量、高可控性动态3D驾驶场景的新框架，解决了现有方法的关键局限性。当前的驾驶场景合成方法要么因扩展时间生成而面临高昂的计算需求，要么专注于没有3D表示的长时间视频合成，或者局限于静态单场景重建。我们的工作通过多模态条件控制，将加速的长期视频生成与大规模动态场景重建相结合，弥合了这一方法论差距。DriveGen3D引入了一个由两个专门组件组成的统一流程：FastDrive-DiT，一个高效的视频扩散Transformer，用于在文本和鸟瞰图（BEV）布局引导下进行高分辨率、时间连贯的视频合成；以及FastRecon3D，一个前馈模块，可快速构建跨时间的3D高斯表示，确保时空一致性。DriveGen3D能够生成长达$800\times424$、12 FPS的驾驶视频及相应的3D场景，在保持效率的同时取得了最先进的结果。

英文摘要

We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. DriveGen3D enable the generation of long driving videos (up to $800\times424$ at $12$ FPS) and corresponding 3D scenes, achieving state-of-the-art results while maintaining efficiency.

URL PDF HTML ☆

赞 0 踩 0

2510.14862 2026-05-26 cs.CV cs.DC 版本更新

接下来会发生什么？通过生成点轨迹预测未来运动

Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi

发表机构 * Visual Geometry Group, University of Oxford（牛津大学视觉几何组）

AI总结提出一种基于单张图像预测未来运动的方法，通过生成密集轨迹网格来捕捉场景动态和不确定性，相比现有方法更准确多样，并验证其在机器人等下游任务中的有效性。

详情

Journal ref: ICLR 2026

AI中文摘要

我们考虑从单张图像预测运动的问题，即预测世界中物体可能如何移动，而无法观察其他参数如物体速度或施加的力。我们将此任务表述为密集轨迹网格的条件生成，模型紧密遵循现代视频生成器的架构，但输出运动轨迹而非像素。这种方法捕捉了场景范围的动态和不确定性，比先前的回归器和生成器产生更准确和多样化的预测。我们在模拟数据上广泛评估了我们的方法，展示了其在机器人等下游应用中的有效性，并在真实世界的直觉物理数据集上显示出有希望的准确性。尽管最近最先进的视频生成器常被视为世界模型，但我们表明它们在从单张图像预测运动方面存在困难，即使在简单的物理场景如落块或机械物体交互中，尽管对这些数据进行了微调。我们表明这一局限性源于生成像素的开销，而非直接建模运动。

英文摘要

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

URL PDF HTML ☆

赞 0 踩 0

2509.09658 2026-05-26 cs.CV 版本更新

Measuring Epistemic Humility in Multimodal Large Language Models

测量多模态大语言模型中的认知谦逊

Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（Mohamed bin Zayed人工智能大学）； Hong Kong Baptist University（香港 Baptist大学）

AI总结提出HumbleBench基准，通过强制选择多项选择中引入“以上皆非”选项，评估多模态大语言模型拒绝错误选项的谦逊行为。

详情

AI中文摘要

多模态大语言模型（MLLMs）中的幻觉——即模型生成与输入图像不一致的内容——在现实应用中带来显著风险，从视觉问答中的错误信息到决策中的不安全错误。现有基准主要测试识别准确性，即评估模型能否在干扰项中选择正确答案。这忽略了可信AI的另一个重要能力：当没有提供的选项得到图像支持时，能够识别并避免做出错误选择，这是一种与谦逊相关的行为。我们提出了HumbleBench，这是一个新的幻觉基准，旨在评估MLLMs在强制选择多项选择设置中拒绝错误选项的能力，其中包含“以上皆非”选项。基于全景场景图数据集，我们利用对象和关系的细粒度场景图注释，使用候选属性线索，并提示GPT-4-Turbo生成多项选择问题，随后进行严格的人工筛选。每个问题都包含一个“以上皆非”选项，要求模型不仅识别正确的视觉信息，还要识别何时没有提供的答案有效。我们在HumbleBench上评估了各种最先进的MLLMs——包括通用型、专门推理型和专有模型——并为社区报告了实证结果。通过纳入明确的错误选项拒绝，HumbleBench填补了当前评估套件中的一个关键空白，评估了一种较窄但重要的、与可信多模态推理相关的弃权行为。我们的代码和数据集已公开发布，可在https://github.com/maifoundations/HumbleBench获取。

英文摘要

Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks another important capability for trustworthy AI: recognizing when none of the provided options is supported by the image and abstaining from committing to a false choice, a humility-related behavior. We present HumbleBench, a new hallucination benchmark designed to evaluate false-option rejection in MLLMs under a forced-choice multiple-choice setting with a ``None of the above'' option. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations for objects and relations, use candidate attribute cues, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a ``None of the above'' option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including general-purpose, specialized reasoning, and proprietary models -- on HumbleBench and report empirical findings for the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites by assessing a narrower but important abstention-oriented behavior that is relevant to trustworthy multimodal reasoning. Our code and dataset are released publicly and can be accessed at \href{https://github.com/maifoundations/HumbleBench}{https://github.com/maifoundations/HumbleBench}.

URL PDF HTML ☆

赞 0 踩 0

2509.00056 2026-05-26 cs.CV 版本更新

Apex-Centered Spatio-Temporal Rank Pooling and Gradient Attention for Micro-Expression Recognition

基于顶点的时空秩池化和梯度注意力用于微表情识别

Luu Tu Nguyen, Vu Tram Anh Khuong, Thanh Ha Le, Thi Duyen Ngo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology（信息技术学院，越南工程大学）

AI总结提出微表情时空图像（MESTI）和微表情梯度注意力网络（MEGANet），通过改进输入模态和注意力机制提升微表情识别性能。

详情

AI中文摘要

微表情识别（MER）由于微表情的细微和短暂性是一项具有挑战性的任务。传统的输入模态，如顶点帧、光流和动态图像，往往无法充分捕捉这些短暂的面部运动，导致性能次优。在本研究中，我们引入了微表情时空图像（MESTI），这是一种针对微表情的动态秩池化的重新表述，将视频序列转换为单张图像，同时强调微表情的起始-顶点-结束时间模式。此外，我们提出了微表情梯度注意力网络（MEGANet），该网络包含一个提出的梯度注意力块，以增强从微表情中提取细粒度运动特征。通过结合MESTI和MEGANet，我们旨在建立一种更有效的MER方法。进行了大量实验以评估MESTI的有效性，将其与现有输入模态在常规架构上进行比较。此外，我们证明将先前发表的MER网络的输入替换为MESTI会导致一致的性能提升。还评估了MEGANet的性能，显示我们提出的网络在SMIC-HS、SAMM数据集上达到了最先进的结果，在CASMEII数据集上具有竞争力的性能，并且在报告的跨数据集评估设置中也取得了领先性能。MESTI和MEGANet的组合始终优于比较方法。这些发现强调了MESTI作为优越输入模态和MEGANet作为先进识别网络的潜力，旨在在各种应用中实现更有效的MER系统。

英文摘要

Micro-expression recognition (MER) is a challenging task due to the subtle and fleeting nature of micro-expressions. Traditional input modalities, such as Apex Frame, Optical Flow, and Dynamic Image, often fail to adequately capture these brief facial movements, resulting in suboptimal performance. In this study, we introduce the Micro-expression Spatio-Temporal Image (MESTI), a micro-expression-specific reformulation of dynamic rank pooling that transforms a video sequence into a single image while emphasizing the onset-apex-offset temporal pattern of micro-expressions. Additionally, we present the Micro-expression Gradient Attention Network (MEGANet), which incorporates a proposed Gradient Attention block to enhance the extraction of fine-grained motion features from micro-expressions. By combining MESTI and MEGANet, we aim to establish a more effective approach to MER. Extensive experiments were conducted to evaluate the effectiveness of MESTI, comparing it with existing input modalities across regular architectures. Moreover, we demonstrate that replacing the input of previously published MER networks with MESTI leads to consistent performance improvements. The performance of MEGANet is also evaluated, showing that our proposed network achieves state-of-the-art results on the SMIC-HS, SAMM and competitive performance on CASMEII datasets, it also achieves leading performance in the reported cross-dataset evaluation settings. The combination of MESTI and MEGANet consistently outperforms the compared methods. These findings underscore the potential of MESTI as a superior input modality and MEGANet as an advanced recognition network, aiming to more effective MER systems in a variety of applications.

URL PDF HTML ☆

赞 0 踩 0

2508.13309 2026-05-26 cs.CV cs.LG 版本更新

DASH: A Meta-Attack Framework for Synthesizing Effective and Stealthy Adversarial Examples

DASH：一种用于合成有效且隐蔽的对抗样本的元攻击框架

Abdullah Al Nomaan Nafi, Habibur Rahaman, Zafaryab Haider, Tanzim Mahfuz, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty

发表机构 * University of Maine（缅因大学）； University of Florida（佛罗里达大学）； University of Tennessee, Knoxville（田纳西大学，基洛纳）

AI总结提出DASH元攻击框架，通过多阶段自适应组合Lp约束攻击方法，生成有效且感知对齐的对抗样本，在多个数据集上优于现有方法。

Comments Accepted to CVPR 2026

详情

AI中文摘要

在白盒设置下，已有大量技术被提出用于在严格的Lp范数约束下生成对抗样本。然而，这类范数受限的样本往往与人类感知不一致，只有少数方法专门探索感知对齐的对抗样本。此外，尚不清楚能否有效利用Lp约束攻击的见解来提升感知效能。本文介绍DASH，一个完全可微的元攻击框架，通过策略性地组合现有基于Lp的攻击方法，生成有效且感知对齐的对抗样本。DASH以多阶段方式运行：在每个阶段，它使用学习到的自适应权重聚合来自多个基础攻击的候选对抗样本，并将结果传播到下一阶段。一种新颖的元损失函数通过联合最小化误分类损失和感知失真来指导这一过程，使框架能够动态调整每个基础攻击在各阶段的贡献。我们在CIFAR-10、CIFAR-100和ImageNet上对对抗训练模型评估DASH。尽管仅依赖基于Lp约束的方法，DASH显著优于最先进的感知攻击如AdvAD，实现了更高的攻击成功率（例如提升20.63%）和更优的视觉质量（以SSIM、LPIPS和FID衡量，分别提升约11、0.015和5.7）。此外，DASH对未见过的防御具有良好的泛化能力，使其成为评估鲁棒性的实用且强大的基线，无需为每种新防御手工设计自适应攻击。

英文摘要

Numerous techniques have been proposed for generating adversarial examples in white-box settings under strict Lp-norm constraints. However, such norm-bounded examples often fail to align well with human perception, and only a few methods specifically explore perceptually aligned adversarial examples. Moreover, it remains unclear whether insights from Lp-constrained attacks can be effectively leveraged to improve perceptual efficacy. In this paper, we introduce DASH, a fully differentiable meta-attack framework that generates effective and perceptually aligned adversarial examples by strategically composing existing Lp-based attack methods. DASH operates in a multi-stage fashion: at each stage, it aggregates candidate adversarial examples from multiple base attacks using learned, adaptive weights and propagates the result to the next stage. A novel meta-loss function guides this process by jointly minimizing misclassification loss and perceptual distortion, enabling the framework to dynamically modulate the contribution of each base attack throughout the stages. We evaluate DASH on adversarially trained models across CIFAR-10, CIFAR-100, and ImageNet. Despite relying solely on Lp-constrained based methods, DASH significantly outperforms state-of-the-art perceptual attacks such as AdvAD, achieving higher attack success rates (e.g., 20.63% improvement) and superior visual quality, as measured by SSIM, LPIPS, and FID (improvements $\approx$ of 11, 0.015, and 5.7, respectively). Furthermore, DASH generalizes well to unseen defenses, making it a practical and strong baseline for evaluating robustness without requiring handcrafted adaptive attacks for each new defense.

URL PDF HTML ☆

赞 0 踩 0

2508.12628 2026-05-26 cs.CV 版本更新

Creative4U: MLLMs-based Advertising Creative Image Selector with Comparative Reasoning

Creative4U: 基于MLLMs的广告创意图像选择器与比较推理

Yukang Lin, Xiang Zhang, Shichang Jia, Bowen Wan, Chenghan Fu, Xudong Ren, Yueran Liu, Wanxian Guan, Pengji Wang, Jian Xu, Bo Zheng, Baolin Liu

发表机构 * Tsinghua University（清华大学）； Alibaba Group（阿里巴巴集团）

AI总结提出基于多模态大语言模型的创意图像评估与选择范式，通过构建比较推理数据集CreativePair和强化学习方法Creative4U，实现可解释的创意选择。

详情

AI中文摘要

广告中的创意图像是电子商务平台的核心和灵魂。引人注目的创意图像可以提升用户的购物体验，增加广告主的收入以及平台的广告收入。随着AIGC技术的出现，广告主能够以极低的成本生产大量创意图像。然而，他们难以评估创意质量以进行选择。现有方法主要关注创意排序，无法满足可解释的创意选择需求。在这项工作中，我们提出了首个可解释的创意评估与选择范式。借助多模态大语言模型（MLLMs），我们的方法将创意图像的评估与选择整合到自然语言生成任务中。为了促进这项研究，我们构建了CreativePair，这是首个比较推理驱动的创意数据集，包含8k个带标注的图像对，每个样本包含一个标签，指示哪张图像更优。此外，我们引入了Creative4U（读作Creative for You），一种基于MLLMs的创意选择器，它考虑了用户的兴趣。通过Reason-to-Select RFT，其中包括基于思维链的监督微调（CoT-SFT）和基于组相对策略优化（GRPO）的强化学习，Creative4U能够准确评估和选择创意图像。离线和在线实验均证明了我们方法的有效性。我们的代码和数据集将公开，以推动研究和工业应用。

英文摘要

Creative image in advertising is the heart and soul of e-commerce platform. An eye-catching creative image can enhance the shopping experience for users, boosting income for advertisers and advertising revenue for platforms. With the advent of AIGC technology, advertisers can produce large quantities of creative images at minimal cost. However, they struggle to assess the creative quality to select. Existing methods primarily focus on creative ranking, which fails to address the need for explainable creative selection. In this work, we propose the first paradigm for explainable creative assessment and selection. Powered by multimodal large language models (MLLMs), our approach integrates the assessment and selection of creative images into a natural language generation task. To facilitate this research, we construct CreativePair, the first comparative reasoning-induced creative dataset featuring 8k annotated image pairs, with each sample including a label indicating which image is superior. Additionally, we introduce Creative4U (pronounced Creative for You), a MLLMs-based creative selector that takes into account users' interests. Through Reason-to-Select RFT, which includes supervised fine-tuning with Chain-of-Thought (CoT-SFT) and Group Relative Policy Optimization (GRPO) based reinforcement learning, Creative4U is able to evaluate and select creative images accurately. Both offline and online experiments demonstrate the effectiveness of our approach. Our code and dataset will be made public to advance research and industrial applications.

URL PDF HTML ☆

赞 0 踩 0

2506.23700 2026-05-26 eess.IV cs.CV 版本更新

MedSAM-CA: A CNN-Augmented ViT with Attention-Enhanced Multi-Scale Fusion for Medical Image Segmentation

MedSAM-CA：一种用于医学图像分割的CNN增强型ViT与注意力增强多尺度融合方法

Peiting Tian, Xi Chen, Haixia Bi, Fan Li

AI总结提出MedSAM-CA，通过卷积注意力增强边界细化网络和注意力增强特征融合块，在低资源条件下微调预训练MedSAM模型，实现高精度医学图像分割。

Comments Withdrawn by the authors because the current version requires substantial revision in the description of the experimental settings and data preprocessing procedures. The manuscript should not be cited in its current form

详情

AI中文摘要

医学图像分割在临床诊断和治疗规划中起着关键作用，其中精确的边界勾画对于准确的病灶定位、器官识别和定量评估至关重要。近年来，基于深度学习的方法显著提高了分割精度。然而，仍存在两个主要挑战。首先，这些方法的性能严重依赖于大规模标注数据集，而在医学场景中，由于隐私问题和高昂的标注成本，这些数据集往往难以获得。其次，临床挑战性场景，例如某些成像模态的低对比度以及恶性肿瘤引起的模糊病灶边界，仍然对精确分割构成障碍。为了解决这些挑战，我们提出了MedSAM-CA，一种架构级别的微调方法，通过适应预训练的基础模型Medical Segment Anything (MedSAM)来减轻对大量手动标注的依赖。MedSAM-CA引入了两个关键组件：卷积注意力增强边界细化网络（CBR-Net）和注意力增强特征融合块（Atte-FFB）。CBR-Net与MedSAM编码器并行运行，利用分层卷积处理恢复长距离注意力机制可能忽略的边界信息。嵌入在MedSAM解码器中的Atte-FFB将来自CBR-Net跳跃连接的多级细粒度特征与解码器内上采样的全局表示融合，以增强边界勾画精度。在涵盖皮肤镜、CT和MRI成像模态的公开数据集上的实验验证了MedSAM-CA的有效性。在皮肤镜数据集上，MedSAM-CA仅使用完整训练数据的2%就达到了94.43%的Dice系数，达到了完整数据训练性能的97.25%，展示了在低资源临床场景中的强大有效性。

英文摘要

Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning, where accurate boundary delineation is essential for precise lesion localization, organ identification, and quantitative assessment. In recent years, deep learning-based methods have significantly advanced segmentation accuracy. However, two major challenges remain. First, the performance of these methods heavily relies on large-scale annotated datasets, which are often difficult to obtain in medical scenarios due to privacy concerns and high annotation costs. Second, clinically challenging scenarios, such as low contrast in certain imaging modalities and blurry lesion boundaries caused by malignancy, still pose obstacles to precise segmentation. To address these challenges, we propose MedSAM-CA, an architecture-level fine-tuning approach that mitigates reliance on extensive manual annotations by adapting the pretrained foundation model, Medical Segment Anything (MedSAM). MedSAM-CA introduces two key components: the Convolutional Attention-Enhanced Boundary Refinement Network (CBR-Net) and the Attention-Enhanced Feature Fusion Block (Atte-FFB). CBR-Net operates in parallel with the MedSAM encoder to recover boundary information potentially overlooked by long-range attention mechanisms, leveraging hierarchical convolutional processing. Atte-FFB, embedded in the MedSAM decoder, fuses multi-level fine-grained features from skip connections in CBR-Net with global representations upsampled within the decoder to enhance boundary delineation accuracy. Experiments on publicly available datasets covering dermoscopy, CT, and MRI imaging modalities validate the effectiveness of MedSAM-CA. On dermoscopy dataset, MedSAM-CA achieves 94.43% Dice with only 2% of full training data, reaching 97.25% of full-data training performance, demonstrating strong effectiveness in low-resource clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2506.17629 2026-05-26 cs.CV cs.AI cs.CL 版本更新

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning

CLiViS: 通过语言-视觉协同释放认知地图用于具身视觉推理

Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang

发表机构 * School of Computer Science and Technology, East China Normal University（东华大学计算机科学与技术学院）； King Abdullah University of Science and Technology（科廷大学）； Fudan University（复旦大学）

AI总结提出CLiViS框架，通过LLM进行高层任务规划并协调VLM驱动的开放世界视觉感知，构建动态认知地图以迭代更新场景上下文，实现无需训练的具身视觉推理。

详情

AI中文摘要

具身视觉推理（EVR）旨在基于自我中心视频遵循复杂、自由形式的指令，从而在动态环境中实现语义理解和时空推理。尽管具有潜力，EVR面临复杂指令多样性和长期自我中心视频中复杂时空动态的挑战。现有解决方案要么在静态视频描述上使用大型语言模型（LLM），这通常会遗漏关键视觉细节，要么依赖端到端视觉语言模型（VLM），后者在逐步组合推理上存在困难。考虑到LLM在推理和VLM在感知方面的互补优势，我们提出了CLiViS。这是一个新颖的无训练框架，利用LLM进行高层任务规划，并协调VLM驱动的开放世界视觉感知，以迭代更新场景上下文。基于这种协同，CLiViS的核心是一个动态认知地图，它在推理过程中不断演化。该地图构建了具身场景的结构化表示，连接了低层感知和高层推理。跨多个基准的大量实验证明了CLiViS的有效性和通用性，特别是在处理长期视觉依赖方面。代码可在 https://github.com/Teacher-Tom/CLiViS 获取。

英文摘要

Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.

URL PDF HTML ☆

赞 0 踩 0

2506.10689 2026-05-26 cs.CV 版本更新

Underage Detection through a Multi-Task and MultiAge Approach for Screening Minors in Unconstrained Imagery

通过多任务和多年龄方法在无约束图像中筛查未成年人的未成年人检测

Christopher Gaul, Eduardo Fidalgo, Enrique Alegre, Rocío Alaiz Rodríguez, Eri Pérez Corral

发表机构 * Department of Electrical, Systems and Automation Engineering（电气、系统与自动化工程系）

AI总结提出一种基于冻结FaRL视觉语言骨干和紧凑两层MLP的多任务架构，结合α重加权焦点损失和年龄平衡采样，在无约束图像中准确检测未成年人，并在新基准上显著提升性能。

详情

DOI: 10.1016/j.patcog.2026.113440

AI中文摘要

在无约束图像中准确自动筛查未成年人需要模型对分布偏移具有鲁棒性，并能应对公共数据集中儿童代表性不足的问题。为解决这些问题，我们提出了一种多任务架构，基于冻结的FaRL视觉语言骨干，结合一个紧凑的两层MLP，该MLP在一个年龄回归头和四个二元未成年人头（12、15、18和21岁）之间共享特征，并包含专门的超/低龄判别任务。该设计聚焦于法律关键年龄范围，同时保持骨干冻结。通过$α$重加权焦点损失和年龄平衡小批量采样缓解类别不平衡，同时通过年龄间隔移除阈值附近的模糊样本。评估在我们的新总体未成年人基准（303k清洗训练图像，110k测试图像）上进行，定义了“ASORES-39k”受限总体测试（去除噪声最大的域）和年龄估计野移测试“ASWIFT-20k”（20k图像，强调极端姿态（>45°）、表情和低图像质量以模拟现实世界偏移）。在清洗总体集上使用重采样和年龄间隔训练后，我们的多年龄模型“F”将ASORES-39k上的平均绝对误差从4.175岁（仅年龄基线）降至4.068岁，并在1%虚假成人率下将18岁以下检测的F2分数从0.801提升至0.857。在ASWIFT-20k上，相同配置几乎保持0.99的召回率，同时F2从0.742提升至0.833，展示了域偏移的鲁棒性。

英文摘要

Accurate automatic screening of minors in unconstrained images requires models robust to distribution shift and resilient to the under-representation of children in public datasets. To address these issues, we propose a multi-task architecture with dedicated under/over-age discrimination tasks based on a frozen FaRL vision-language backbone joined with a compact two-layer MLP that shares features across one age-regression head and four binary underage heads (12, 15, 18, and 21 years). This design focuses on the legally critical age range while keeping the backbone frozen. Class imbalance is mitigated through an $α$-reweighted focal loss and age-balanced mini-batch sampling, while an age gap removes ambiguous samples near thresholds. Evaluation is conducted on our new Overall Underage Benchmark (303k cleaned training images, 110k test images), defining both the "ASORES-39k" restricted overall test, which removes the noisiest domains, and the age estimation wild-shifts test "ASWIFT-20k" of 20k-images, stressing extreme poses ($>$45°), expressions, and low image quality to emulate real-world shifts. Trained on the cleaned overall set with resampling and age gap, our multiage model "F" reduces the mean absolute error on ASORES-39k from 4.175 y (age-only baseline) to 4.068 y and improves under-18 detection from F2 score of 0.801 to 0.857 at 1% false-adult rate. Under the ASWIFT-20k, the same configuration nearly sustains 0.99 recall while F2 rises from 0.742 to 0.833, demonstrating robustness to domain shift.

URL PDF HTML ☆

赞 0 踩 0

2505.11758 2026-05-26 cs.CV cs.AI cs.GR cs.RO 版本更新

Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning

具有预测性提示和负学习的可泛化视觉语言少样本适应

Sriram Mandalika

发表机构 * Hasso Plattner Institute, University of Potsdam（霍普夫纳研究所，波茨坦大学）

AI总结提出SCAN框架，通过查询自适应负路由、LLM引导对比提示和自适应融合权重，解决视觉语言模型少样本适应中负类信号处理问题，在11个基准上平均提升4.61%。

详情

AI中文摘要

视觉语言模型的少样本适应在推理时如何处理负类信号方面仍然存在根本性限制。现有方法对所有查询应用统一的负抑制，忽略了最具破坏性的混淆是查询特定的，并且随支持集几何形状而变化。我们提出SCAN（选择性混淆感知负样本），一个通过三个针对性贡献解决这一问题的框架。在推理中，查询自适应负路由将抑制限制在每个查询最易混淆的前K个类别，无需额外参数。通用负文本模板被替换为LLM引导的对比提示，描述易混淆类别对之间的区分属性，在关键处锐化文本决策边界。基于支持集Fisher可判别性估计的无参数自适应融合权重消除了手动调整视觉语言权衡的需要。在11个标准基准上评估，SCAN在16-shot设置下平均优于先前的基于提示和基于适配器的方法4.61%，在类间混淆最严重的细粒度数据集上提升高达7.70%。SCAN在分布偏移下也表现出强泛化性，在四个ImageNet OOD变体上平均提升2.95%，并在显著标签噪声下保持稳健性能，在50%标签损坏下的准确率仍超过最强竞争方法的干净基线。

英文摘要

Few-shot adaptation of vision-language models remains fundamentally limited by how negative class signals are handled at inference. Existing methods apply uniform negative suppression across all queries, ignoring that the most damaging confusions are query-specific and shift with support-set geometry. We introduce SCAN (Selective Confusion-Aware Negatives), a framework that addresses this gap through three targeted contributions. In inference, query-adaptive negative routing restricts suppression to the top-K most confusable classes per query, requiring zero additional parameters. Generic negative text templates are replaced with LLM-bootstrapped contrastive prompts that describe discriminative attributes between confusable class pairs, sharpening the textual decision boundary where it matters most. A parameter-free adaptive fusion weight estimated from support-set Fisher discriminability removes the need for manual tuning of the vision-language trade-off. Evaluated across 11 standard benchmarks, SCAN consistently outperforms prior prompt-based and adapter-based methods by an average of 4.61% at 16-shot, with gains of up to 7.70% on fine-grained datasets where inter-class confusion is most severe. SCAN also generalizes strongly under distribution shift, improving by 2.95% on average across four ImageNet OOD variants, and maintains robust performance under significant label noise, with accuracy under 50% label corruption still exceeding the clean baseline of the strongest competing method.

URL PDF HTML ☆

赞 0 踩 0

2503.01122 2026-05-26 cs.CV 版本更新

ACCORD: Alleviating Concept Coupling through Dependence Regularization for Text-to-Image Diffusion Personalization

ACCORD: 通过依赖正则化缓解文本到图像扩散个性化中的概念耦合

Shizhan Liu, Hao Zheng, Hang Yu, Jianguo Li

发表机构 * Ant Group（蚂蚁集团）

AI总结提出两种即插即用损失函数（去噪解耦损失和先验解耦损失）直接最小化两种依赖差异，以缓解概念耦合问题，实现文本控制与个性化保真度的更好平衡。

详情

AI中文摘要

图像个性化因其能够仅使用少量参考图像定制文本到图像生成而受到关注。然而，图像个性化的一个关键挑战是概念耦合问题，即有限的参考图像导致模型在个性化目标与其他概念之间形成不希望的关联。当前方法试图间接解决这个问题，导致文本控制与个性化保真度之间的次优平衡。本文通过统计分析直接处理概念耦合问题，揭示其源于两种不同的依赖差异来源。因此，我们提出了两种互补的即插即用损失函数：去噪解耦损失和先验解耦损失，每种损失旨在最小化一种依赖差异。大量实验表明，我们的方法在文本控制与个性化保真度之间实现了更优的权衡。

英文摘要

Image personalization has garnered attention for its ability to customize Text-to-Image generation using only a few reference images. However, a key challenge in image personalization is the issue of conceptual coupling, where the limited number of reference images leads the model to form unwanted associations between the personalization target and other concepts. Current methods attempt to tackle this issue indirectly, leading to a suboptimal balance between text control and personalization fidelity. In this paper, we take a direct approach to the concept coupling problem through statistical analysis, revealing that it stems from two distinct sources of dependence discrepancies. We therefore propose two complementary plug-and-play loss functions: Denoising Decouple Loss and Prior Decouple loss, each designed to minimize one type of dependence discrepancy. Extensive experiments demonstrate that our approach achieves a superior trade-off between text control and personalization fidelity.

URL PDF HTML ☆

赞 0 踩 0

2412.15668 2026-05-26 cs.CV 版本更新

Adaptive Hierarchical Graph Cut for Multi-granularity Out-of-distribution Detection

自适应层次图割用于多粒度分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest, Ponnuthurai Nagaratnam Suganthan

发表机构 * Interdisciplinary Graduate Programme, Nanyang Technological University（新加坡国立大学跨学科研究生项目）； College of Computing and Data Science, Nanyang Technological University（新加坡国立大学计算与数据科学学院）； KINDI Computing Research Center, College of Engineering, Qatar University（卡塔尔大学工程学院KINDI计算研究中心）

AI总结提出自适应层次图割网络(AHGC)，通过构建层次KNN图并基于图连接和密度信息进行子图划分，以处理不同标签粒度下的分布外检测问题，在CIFAR-10和CIFAR-100上FPR95指标分别降低40.47%和81.24%。

Comments Published in IEEE Transactions on Artificial Intelligence

详情

AI中文摘要

本文聚焦于一项重要且具有挑战性的任务：分布外检测（OOD检测），旨在区分并拒绝具有语义偏移的测试样本，以防止在分布内（ID）数据上训练的模型产生不可靠的预测。尽管先前的工作已取得一定成功，但它们对于现实世界中具有挑战性的应用效果不佳，因为这些方法简单地将所有未标记数据视为OOD数据，忽略了不同数据集具有不同标签粒度的情况。例如，CIFAR-10中的“猫”和Tiny-ImageNet中的“虎斑猫”具有相同语义，但由于标签粒度不同而具有不同标签。为此，本文提出了一种新颖的自适应层次图割网络（AHGC），以深入探索不同图像之间的语义关系。具体地，我们构建一个层次KNN图，基于余弦相似度评估不同图像之间的相似性。基于图的连接和密度信息，我们将图切割成多个子图以整合这些语义相似的样本。如果子图中标记样本的百分比大于阈值，我们将百分比最高的标签分配给未标记图像。为进一步提高模型泛化能力，我们将每张图像增强为两个增强版本，并最大化这两个版本之间的相似性。最后，我们利用相似度分数进行OOD检测。在两个具有挑战性的基准（CIFAR-10和CIFAR-100）上进行的大量实验表明，在典型情况下，AHGC在“FPR95”指标上分别比最先进的OOD检测方法在CIFAR-100上降低81.24%，在CIFAR-10上降低40.47%，这显示了我们的AHGC的有效性。

英文摘要

This paper focuses on a significant yet challenging task: out-of-distribution detection (OOD detection), which aims to distinguish and reject test samples with semantic shifts, so as to prevent models trained on in-distribution (ID) data from producing unreliable predictions. Although previous works have made decent success, they are ineffective for real-world challenging applications since these methods simply regard all unlabeled data as OOD data and ignore the case that different datasets have different label granularity. For example, "cat" on CIFAR-10 and "tabby cat" on Tiny-ImageNet share the same semantics but have different labels due to various label granularity. To this end, in this paper, we propose a novel Adaptive Hierarchical Graph Cut network (AHGC) to deeply explore the semantic relationship between different images. Specifically, we construct a hierarchical KNN graph to evaluate the similarities between different images based on the cosine similarity. Based on the linkage and density information of the graph, we cut the graph into multiple subgraphs to integrate these semantics-similar samples. If the labeled percentage in a subgraph is larger than a threshold, we will assign the label with the highest percentage to unlabeled images. To further improve the model generalization, we augment each image into two augmentation versions, and maximize the similarity between the two versions. Finally, we leverage the similarity score for OOD detection. Extensive experiments on two challenging benchmarks (CIFAR- 10 and CIFAR-100) illustrate that in representative cases, AHGC outperforms state-of-the-art OOD detection methods by 81.24% on CIFAR-100 and by 40.47% on CIFAR-10 in terms of "FPR95", which shows the effectiveness of our AHGC.

URL PDF HTML ☆

赞 0 踩 0

2409.17608 2026-05-26 cs.CV 版本更新

Appearance Blur-driven AutoEncoder and Motion-guided Memory Module for Video Anomaly Detection

外观模糊驱动的自编码器和运动引导的记忆模块用于视频异常检测

Jiahao Lyu, Minghua Zhao, Jing Hu, Xuewen Huang, Shuangli Du, Cheng Shi, Zhiyong Lv

发表机构 * School of Computer Science and Engineering, Xi’an University of Technology（西安理工大学计算机科学与工程学院）

AI总结提出一种基于外观模糊和运动引导记忆模块的零样本跨数据集视频异常检测方法，通过构建全局伪异常并利用运动记忆项扩大正常与异常运动差异。

Comments 13 pages, 11 figures

详情

DOI: 10.1016/j.knosys.2025.115218
Journal ref: Knowledge-Based Systems 2026

AI中文摘要

视频异常检测（VAD）通常学习正常样本的分布并通过测量显著偏差来检测异常，但不期望的泛化可能会重构一些异常从而抑制偏差。同时，大多数VAD无法应对新目标域的跨数据集验证，而少样本方法必须费力地依赖目标域的模型调优来完成域适应。为解决这些问题，我们提出一种新颖的VAD方法，带有运动引导记忆模块，实现零样本跨数据集验证。首先，我们对原始外观图像添加高斯模糊，从而构建全局伪异常，作为网络输入。然后，我们提出多尺度残差通道注意力来去模糊正常样本中的伪异常。接下来，通过记录训练阶段的运动特征获得记忆项，用于在测试阶段从原始信息中检索运动特征。最后，我们的方法可以通过注意力忽略模糊的真实异常，并依赖运动记忆项来增加正常与异常运动之间的正常性差距。在三个基准数据集上的大量实验证明了所提方法的有效性。与跨域方法相比，我们的方法在测试时无需适应即可实现有竞争力的性能。

英文摘要

Video anomaly detection (VAD) often learns the distribution of normal samples and detects the anomaly through measuring significant deviations, but the undesired generalization may reconstruct a few anomalies thus suppressing the deviations. Meanwhile, most VADs cannot cope with cross-dataset validation for new target domains, and few-shot methods must laboriously rely on model-tuning from the target domain to complete domain adaptation. To address these problems, we propose a novel VAD method with a motion-guided memory module to achieve cross-dataset validation with zero-shot. First, we add Gaussian blur to the raw appearance images, thereby constructing the global pseudo-anomaly, which serves as the input to the network. Then, we propose multi-scale residual channel attention to deblur the pseudo-anomaly in normal samples. Next, memory items are obtained by recording the motion features in the training phase, which are used to retrieve the motion features from the raw information in the testing phase. Lastly, our method can ignore the blurred real anomaly through attention and rely on motion memory items to increase the normality gap between normal and abnormal motion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of the proposed method. Compared with cross-domain methods, our method achieves competitive performance without adaptation during testing.

URL PDF HTML ☆

赞 0 踩 0

2409.09953 2026-05-26 cs.CV 版本更新

Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection

不确定性引导的外观-运动关联网络用于分布外动作检测

Xiang Fang, Arvind Easwaran, Blaise Genest

发表机构 * College of Computing and Data Science（计算与数据科学学院）； Nanyang Technological University（南洋理工大学）

AI总结针对分布外动作检测任务，提出不确定性引导的外观-运动关联网络（UAAN），通过融合外观与运动特征并推理时空物体交互，显著优于现有方法。

Comments Accepted by MIPR 2024

详情

AI中文摘要

分布外（OOD）检测旨在检测并拒绝具有语义偏移的测试样本，以防止在分布内（ID）数据集上训练的模型产生不可靠的预测。现有工作仅在图像数据集上提取外观特征，无法处理包含大量运动信息的动态多媒体场景。因此，我们针对一个更现实且更具挑战性的OOD检测任务：OOD动作检测（ODAD）。给定一个未裁剪的视频，ODAD首先对ID动作进行分类并识别OOD动作，然后定位ID和OOD动作。为此，本文提出了一种新颖的不确定性引导的外观-运动关联网络（UAAN），该网络同时探索外观特征和运动上下文，以推理用于ODAD的时空物体间交互。首先，我们设计独立的外观和运动分支，以提取相应的面向外观和面向运动的物体表示。在每个分支中，我们构建一个时空图来推理外观引导和运动驱动的物体间交互。然后，我们设计一个外观-运动注意力模块，融合外观和运动特征以进行最终的动作检测。在两个具有挑战性的数据集上的实验结果表明，UAAN显著优于最先进的方法，证明了其有效性。

英文摘要

Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts, to prevent models trained on in-distribution (ID) dataset from producing unreliable predictions. Existing works only extract the appearance features on image datasets, and cannot handle dynamic multimedia scenarios with much motion information. Therefore, we target a more realistic and challenging OOD detection task: OOD action detection (ODAD). Given an untrimmed video, ODAD first classifies the ID actions and recognizes the OOD actions, and then localizes ID and OOD actions. To this end, in this paper, we propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN), which explores both appearance features and motion contexts to reason spatial-temporal inter-object interaction for ODAD.Firstly, we design separate appearance and motion branches to extract corresponding appearance-oriented and motion-aspect object representations. In each branch, we construct a spatial-temporal graph to reason appearance-guided and motion-driven inter-object interaction. Then, we design an appearance-motion attention module to fuse the appearance and motion features for final action detection. Experimental results on two challenging datasets show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2404.10947 2026-05-26 cs.CV 版本更新

Residual Connections Harm Generative Representation Learning

残差连接损害生成式表示学习

Xiao Zhang, Ruoxi Jiang, William Gao, Rebecca Willett, Michael Maire

发表机构 * University of Chicago（芝加哥大学）； Fudan University（复旦大学）； Tencent（腾讯）； Shanghai Academy of AI for Science（上海人工智能科学研究院）

AI总结通过减少残差网络中恒等捷径的权重，显著提升掩码自编码器和扩散模型等生成式表示学习框架中的语义特征学习质量。

Comments accepted to CVPR 2026

详情

AI中文摘要

我们表明，在残差网络中引入一个加权因子以减少恒等捷径的影响，可以显著增强生成式表示学习框架（如掩码自编码器（MAE）和扩散模型）中的语义特征学习。我们的修改显著提高了特征质量，对于使用ViT-B/16骨干网络的MAE，将ImageNet-1K K近邻准确率从27.4%提升至63.9%，线性探测准确率从67.8%提升至72.7%，同时增强了扩散模型的生成质量。这一显著差距表明，虽然残差连接结构在促进梯度传播方面起着重要作用，但它可能通过将浅层表示的“回声”注入深层，从而降低抽象学习能力，产生有害副作用。我们通过一个固定公式来改善这一缺点，该公式随着层深度增加而单调减少恒等连接的贡献。我们的设计促进了特征抽象的逐步发展，且不影响网络的可训练性。分析我们修改后的残差网络学到的表示，我们发现低有效特征秩与下游任务性能之间存在相关性。

英文摘要

We show that introducing a weighting factor to reduce the influence of identity shortcuts in residual networks significantly enhances semantic feature learning in generative representation learning frameworks, such as masked autoencoders (MAEs) and diffusion models. Our modification notably improves feature quality, raising ImageNet-1K K-Nearest Neighbor accuracy from 27.4% to 63.9% and linear probing accuracy from 67.8% to 72.7% for MAEs with a ViT-B/16 backbone, while also enhancing generation quality in diffusion models. This significant gap suggests that, while residual connection structure serves an essential role in facilitating gradient propagation, it may have a harmful side effect of reducing capacity for abstract learning by virtue of injecting an echo of shallower representations into deeper layers. We ameliorate this downside via a fixed formula for monotonically decreasing the contribution of identity connections as layer depth increases. Our design promotes the gradual development of feature abstractions, without impacting network trainability. Analyzing the representations learned by our modified residual networks, we find correlation between low effective feature rank and downstream task performance.

URL PDF HTML ☆

赞 0 踩 0

2309.07778 2026-05-26 eess.IV cs.CV cs.LG q-bio.TO 版本更新

V3H: 面向不完整多视图聚类的视图变异与视图遗传

Xiang Fang, Yuchong Hu, Pan Zhou, Dapeng Oliver Wu

发表机构 * School of Computer Science and Technology, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology（华中科技大学大数据安全工程研究中心）； Department of Electrical and Computer Engineering, University of Florida（佛罗里达大学电子与计算机工程系）

AI总结提出一种受遗传学启发的视图变异与视图遗传方法(V3H)，通过分解子空间为变异矩阵和遗传矩阵分别学习各视图的独特信息和所有视图的一致信息，并利用可调低秩表示恢复底层数据结构，在不完整多视图聚类中同时捕获一致与独特信息，在15个基准数据集上超越现有方法。

Comments Publisheded in IEEE Transactions on Artificial Intelligence

详情

DOI: 10.1109/TAI.2021.3052425
Journal ref: IEEE Transactions on Artificial Intelligence 2020

AI中文摘要

真实数据常以多个不完整视图的形式出现。不完整多视图聚类是集成这些不完整视图的有效方法。以往的方法仅学习不同视图之间的一致信息，而忽略了每个视图的独特信息，这限制了它们的聚类性能和泛化能力。为克服这一局限，我们提出了一种新颖的视图变异与视图遗传方法(V3H)。受遗传学中变异与遗传的启发，V3H首先将每个子空间分解为对应视图的变异矩阵和所有视图的遗传矩阵，分别表示独特信息和一致信息。然后，通过基于聚类指示矩阵对齐不同视图，V3H集成来自不同视图的独特信息以提高聚类性能。最后，借助基于遗传矩阵的可调低秩表示，V3H恢复潜在的真正数据结构以减少大不完整性的影响。更重要的是，V3H可能是首个将遗传学引入聚类算法以从不完整多视图数据中同时学习一致信息和独特信息的工作。在15个基准数据集上的大量实验结果验证了其相对于其他最先进方法的优越性。

英文摘要

Real data often appear in the form of multiple incomplete views. Incomplete multi-view clustering is an effective method to integrate these incomplete views. Previous methods only learn the consistent information between different views and ignore the unique information of each view, which limits their clustering performance and generalizations. To overcome this limitation, we propose a novel View Variation and View Heredity approach (V3H). Inspired by the variation and the heredity in genetics, V3H first decomposes each subspace into a variation matrix for the corresponding view and a heredity matrix for all the views to represent the unique information and the consistent information respectively. Then, by aligning different views based on their cluster indicator matrices, V3H integrates the unique information from different views to improve the clustering performance. Finally, with the help of the adjustable low-rank representation based on the heredity matrix, V3H recovers the underlying true data structure to reduce the influence of the large incompleteness. More importantly, V3H presents possibly the first work to introduce genetics to clustering algorithms for learning simultaneously the consistent information and the unique information from incomplete multi-view data. Extensive experimental results on fifteen benchmark datasets validate its superiority over other state-of-the-arts.

URL PDF HTML ☆

赞 0 踩 0

2011.10331 2026-05-26 cs.CV cs.LG 版本更新

ANIMC: A Soft Framework for Auto-weighted Noisy and Incomplete Multi-view Clustering

ANIMC: 一种自动加权噪声与不完整多视图聚类的软框架

Xiang Fang, Yuchong Hu, Pan Zhou, Dapeng Oliver Wu

发表机构 * Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology（大数据安全湖北工程研究中心，信息科学与工程学院，华中科技大学）； School of Computer Science and Technology, Huazhong University of Science and Technology（计算机科学与技术学院，华中科技大学）； Key Laboratory of Information Storage System Ministry of Education of China, Huazhong University of Science and Technology（信息存储系统教育部重点实验室，华中科技大学）； Department of Electrical and Computer Engineering, University of Florida（电气与计算机工程系，佛罗里达大学）

AI总结提出ANIMC框架，通过软自动加权策略和双软正则回归模型，处理多视图聚类中的缺失实例和噪声问题。

Comments Publisheded in IEEE Transactions on Artificial Intelligence

详情

Journal ref: IEEE Transactions on Artificial Intelligence 2021

AI中文摘要

多视图聚类在许多图像处理场景中有广泛应用。在这些场景中，原始图像数据通常包含缺失实例和噪声，而大多数多视图聚类方法忽略了这一点。然而，缺失实例可能使这些方法难以直接使用，噪声则会导致不可靠的聚类结果。本文通过软自动加权策略和双软正则回归模型，提出了一种新颖的自动加权噪声与不完整多视图聚类框架（ANIMC）。首先，通过设计自适应半正则化非负矩阵分解（adaptive semi-RNMF），软自动加权策略为每个视图分配适当的权重，并添加软边界以平衡噪声和不完整性的影响。其次，通过提出θ-范数，双软正则回归模型通过选择不同的θ来调整模型的稀疏性。与现有方法相比，ANIMC具有三个独特优势：1）它是一种软算法，可以在不同场景下调整我们的框架，从而提高其泛化能力；2）它自动学习每个视图的适当权重，从而减少噪声的影响；3）它执行双软正则回归，对齐不同视图中的相同实例，从而减少缺失实例的影响。大量实验结果表明，它优于其他最先进的方法。

英文摘要

Multi-view clustering has wide applications in many image processing scenarios. In these scenarios, original image data often contain missing instances and noises, which is ignored by most multi-view clustering methods. However, missing instances may make these methods difficult to use directly and noises will lead to unreliable clustering results. In this paper, we propose a novel Auto-weighted Noisy and Incomplete Multi-view Clustering framework (ANIMC) via a soft auto-weighted strategy and a doubly soft regular regression model. Firstly, by designing adaptive semi-regularized nonnegative matrix factorization (adaptive semi-RNMF), the soft auto-weighted strategy assigns a proper weight to each view and adds a soft boundary to balance the influence of noises and incompleteness. Secondly, by proposingθ-norm, the doubly soft regularized regression model adjusts the sparsity of our model by choosing differentθ. Compared with existing methods, ANIMC has three unique advantages: 1) it is a soft algorithm to adjust our framework in different scenarios, thereby improving its generalization ability; 2) it automatically learns a proper weight for each view, thereby reducing the influence of noises; 3) it performs doubly soft regularized regression that aligns the same instances in different views, thereby decreasing the impact of missing instances. Extensive experimental results demonstrate its superior advantages over other state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.25220 2026-05-26 cs.CV cs.GR cs.RO 版本更新

Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation

无需多视图生成的多视图一致3D高斯头部头像

Aviral Chharia, Fernando De la Torre

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出MVCHead，一种直接从随机采样的2D图像学习3D高斯头部模型的方法，通过层次状态空间块和SE(3)多视图评判器实现多视图一致性，无需多视图数据或3D监督。

Comments CVPR 2026; Project Website: https://humansensinglab.github.io/MVCHead/

详情

Journal ref: CVPR, Denver, CO, USA, 2026, pp. 40163-40174

AI中文摘要

高保真3D高斯头部头像生成对于AR/VR、远程呈现和数字人类等应用至关重要。现有方法依赖于多视图数据集、3D捕获或中间2D视图合成。相比之下，我们仅从随机采样的2D图像中学习条件和非条件3D头部模型，而不使用多视图数据、3D监督或中间视图生成。我们引入MVCHead，一种单次状态空间模型，直接在3D表示中强制执行多视图一致性（MVC），同时在这些约束下回归3D高斯。其核心是，我们提出层次状态空间（HiSS）块，从粗到细逐步细化高斯，同时捕获长距离依赖。在每个HiSS块中，我们修改Mamba的标准单向扫描，提出层次双向状态扫描（HiBiSS），将递归与多视图不一致性最强的轴对齐。最后，我们设计了一个SE(3)多视图评判器，判断一组自渲染是否来自单个底层3D配置，奖励跨视图像素对齐而不观察真实的多视图对。MVCHead实现了最先进的感知质量，在纹理和几何一致性上超越了先前方法，并保持了可比的形状一致性。为了展示可扩展性，我们发布了FaceGS-10K，这是第一个用于训练和评估3D头部模型的大规模即用型3D高斯头部资产数据集。项目页面和代码：https://humansensinglab.github.io/MVCHead/

英文摘要

High-fidelity 3D Gaussian head avatar generation is critical for applications such as AR/VR, telepresence, and digital humans. Existing methods depend on multi-view datasets, 3D captures, or intermediate 2D view synthesis. In contrast, we learn both conditional and unconditional 3D head models from randomly sampled 2D images alone, without using multi-view data, 3D supervision, or intermediate view generation. We introduce MVCHead, a single-shot state space model that enforces multi-view consistency (MVC) directly in the 3D representation while regressing 3D Gaussians under these constraints. At its core, we propose a Hierarchical State Space (HiSS) block that progressively refines Gaussians from coarse to fine, while capturing long-range dependencies. Within each HiSS block, we modify Mamba's standard unidirectional scan with the proposed Hierarchical Bi-directional State Scan (HiBiSS) that aligns recurrence with the axes along which multi-view inconsistencies are strongest. Finally, we design an SE(3) Multi-view Critic that judges whether a set of self-renders arises from a single underlying 3D configuration, rewarding cross-view pixel alignment without observing real multi-view pairs. MVCHead achieves state-of-the-art perceptual quality, surpasses prior methods in both texture and geometric consistency, and maintains comparable shape consistency. To demonstrate scalability, we release FaceGS-10K, the first large-scale dataset of ready-to-use 3D Gaussian head assets for training and evaluation of 3D head models. Project Page and code: https://humansensinglab.github.io/MVCHead/

URL PDF HTML ☆

赞 0 踩 0

2605.25191 2026-05-26 cs.CV 版本更新

Injecting Image Guidance into Text-Conditioned Diffusion Models at Inference

在推理时将图像引导注入文本条件扩散模型

Agata Żywot, Iason Skylitsis, Thijmen Nijdam, Zoe Tzifa-Kratira, Derck Prinzhorn, Konrad Szewczyk, Aritra Bhowmik

发表机构 * University of Amsterdam（阿姆斯特丹大学）

AI总结提出视觉概念融合（VCF），一种无需重新训练即可在推理时同时以图像和文本为条件进行双重引导的方法，通过对齐CLIP图像特征与文本嵌入空间实现视觉概念注入。

详情

无偏扩散变分反演：基于原则性后验匹配

Weimin Bai, Yuxuan Gu, Yifei Wang, Weijian Luo, He Sun

发表机构 * Peking University（北京大学）

AI总结提出原则性后验匹配（PPM）框架，通过精确优化KL散度（利用Fisher散度积分）解决逆问题中模式坍塌和不确定性量化不可靠的问题，统一变分推理和摊销推理，在图像修复、超分辨荧光显微和射电干涉成像中实现高保真重建和校准的不确定性估计。

详情

AI中文摘要

现有的基于分数的逆问题方法通常采用KL散度在反演分布与贝叶斯后验之间的近似最小化。这种近似导致严重的模式坍塌和不可靠的不确定性量化。在本文中，我们提出原则性后验匹配（PPM），一个回归变分推理基础而非使用技巧性近似的框架。我们不依赖启发式近似，而是通过整合Fisher散度严格公式化KL散度的精确优化。我们推导出该积分的可处理等价梯度形式，使得无需先前近似引入的偏差即可进行精确优化。我们的分析清楚地揭示了先前方法中的模式坍塌直接源于这种近似差距。在我们的理论解决方案支持下，PPM统一了两个互补范式：（1）在变分推理中，PPM采用覆盖质量的散度，显著提高了反演多样性和不确定性量化；（2）在摊销推理中，它使得能够训练高效的重建网络以进行快速的单步重建。此外，我们的公式通过推广Fisher散度的积分，自然地扩展到更广泛的散度度量族。我们在具有挑战性的计算成像任务中验证了PPM，包括图像修复、超分辨荧光显微镜和射电干涉黑洞成像。在所有实验中，PPM实现了卓越的重建保真度、忠实的多模态后验恢复以及良好校准的不确定性估计，为科学成像建立了一个稳健的框架。

英文摘要

Existing score-based methods for inverse problems often resort to approximate minimization of the KL divergence between the inversion distribution and the Bayesian posterior. Such an approximation leads to severe mode collapse and unreliable uncertainty quantification. In this paper, we propose Principled Posterior Matching (PPM), a framework that returns to the fundamentals of variational inference, rather than using tricky approximations. Instead of relying on heuristic approximations, we rigorously formulate the exact optimization of the KL divergence via the integration of Fisher divergence. We derive a tractable, equivalent gradient form of this integral, enabling precise optimization without the biases introduced by prior approximations. Our analysis clearly reveals that the mode collapse in previous methods stems directly from this approximation gap. Supported by our theoretical solution, PPM unifies two complementary paradigms: (1) In variational inference, PPM adopts mass-covering divergences that significantly improve the inversion diversity and uncertainty quantification; (2) In amortized inference, it enables the training of an efficient reconstruction network for rapid, single-step reconstruction. Furthermore, our formulation naturally extends to a broader family of divergence measures by generalizing the integral of the Fisher divergence. We validate PPM across challenging computational imaging tasks, including inpainting, super-resolution fluorescent microscopy, and radio interferometric black-hole imaging. In all experiments, PPM achieves superior reconstruction fidelity, faithful multimodal posterior recovery, and well-calibrated uncertainty estimates, establishing a robust framework for scientific imaging.

URL PDF HTML ☆

赞 0 踩 0

2605.25039 2026-05-26 cs.CV 版本更新

AstroRAG -- A Pagerank-Based Retrieval-Augmented Generation Pipeline for Question Answering in Astronomy

AstroRAG -- 一种基于PageRank的检索增强生成管道用于天文学问答

Zhifeng Wang, Jason Jingshi Li, Kaihao Zhang, Ramesh Sankaranarayana

发表机构 * Australian National University（澳大利亚国立大学）； Learning Machines Pty Ltd

AI总结提出AstroRAG，一种基于PageRank的检索增强生成管道，通过两阶段检索（MMR和PR重排序）在严格token预算下选择紧凑互支持的上下文，无需训练且保护隐私，在天文学QA基准上使Mistral-7B准确率和F1分数达到79.49%，性能近乎翻倍。

Comments Accepted to IEEE CAI 2026

详情

AI中文摘要

大型语言模型（LLMs）在自然语言处理中表现出强大的性能，但仅依赖参数化知识时常常产生事实性错误。检索增强生成（RAG）通过将响应基于外部证据来减轻这些错误，然而传统的检索-转储方法经常引入无关上下文，从而降低答案质量。在这项工作中，我们提出了AstroRAG——一种基于PageRank的检索增强生成（RAG）管道，适用于天文学中的问答。该系统在Elasticsearch中执行token感知的分块和每个实例的临时索引，然后执行两阶段检索：（i）最大边际相关性（MMR）以获得一个小的、多样化的候选集，以及（ii）在相似性图上进行读者驱动的PageRank（PR）重排序，以在严格的token预算下识别紧凑、互支持的上下文。我们的设计无需训练、保护隐私且可重复，因为每个实例通过临时索引处理以防止跨任务泄漏。我们在用于天文学QA的AstroQA基准上评估了该管道，并在所有难度级别上展示了有竞争力的性能。特别是，RAG增强的Mistral-7B实现了 extbf{79.49\%的准确率}和 extbf{79.49\%的F1分数}，几乎是非RAG对应版本性能的两倍。这些结果突显了严格检索和精炼在提升领域特定推理方面的有效性，为将RAG扩展到其他科学领域奠定了坚实基础。

英文摘要

Large language models (LLMs) demonstrate strong performance in natural language processing but often generate factual errors when relying solely on parametric knowledge. Retrieval-Augmented Generation (RAG) mitigates these errors by grounding responses in external evidence, yet conventional retrieve-and-dump approaches frequently introduce irrelevant context that degrades answer quality. In this work, we present AstroRAG -- a PageRank-based retrieval-augmented generation (RAG) pipeline adapted for question answering in astronomy. The system performs token-aware chunking and per-instance, ephemeral indexing in Elasticsearch, then executes a two-stage retrieval: (i) Maximal Marginal Relevance (MMR) to obtain a small, diverse candidate set and (ii) a reader-driven PageRank (PR) re-ranking on a similarity graph to identify a compact, mutually supportive context under a strict token budget. Our design is training-free, privacy-preserving, and reproducible, as each instance is processed through transient indexing to prevent cross-task leakage. We evaluate the pipeline on the AstroQA benchmark for astronomy QA, and demonstrate competitive performance across all difficulty levels. In particular, the RAG-enhanced Mistral-7B achieves \textbf{79.49\% accuracy} and \textbf{79.49\% F1-score}, nearly doubling the performance of its non-RAG counterpart. These results highlight the effectiveness of disciplined retrieval and refinement in boosting domain-specific reasoning, establishing a robust foundation for extending RAG to other scientific fields.

URL PDF HTML ☆

赞 0 踩 0

2605.25024 2026-05-26 cs.CV 版本更新

DA-UCT: Self-Supervised Domain-Adaptive Ultrasound Computed Tomography for Rapid Musculoskeletal Sound Speed Reconstruction

DA-UCT：用于快速肌肉骨骼声速重建的自监督域自适应超声计算机断层扫描

Tianyu Liu, Heyu Ma, Aiduo Wang, Peiwen Li, Boyi Li, Ying Li, Dan Li, Chengcheng Liu, Dean Ta

发表机构 * College of Biomedical Engineering, Fudan University（复旦大学生物医学工程学院）

AI总结提出SDA-UCT框架，通过自监督域自适应和注意力增强网络，实现快速高分辨率肌肉骨骼超声计算机断层扫描重建，显著提升速度并保持高质量。

详情

AI中文摘要

通过全波形反演的超声计算机断层扫描（UCT）能够实现高分辨率定量成像，用于组织表征和疾病诊断。然而，由于高度非线性的优化，UCT存在计算负担大和收敛问题严重等缺点。深度学习可以加速UCT重建，但监督训练需要大规模标记数据集，这在体内难以获得。为了解决这些限制，我们提出了SDA-UCT，一个两阶段自监督域自适应框架，用于快速准确的肌肉骨骼组织UCT成像。SDA-UCT采用在模拟数据集上预训练的注意力增强网络（AttUCT），并通过物理信息自监督学习迁移到体内数据，有效弥合了模拟到真实的域差距。集成了低秩自适应（LoRA）机制，以实现跨不同临床场景的高效自适应。结果表明，AttUCT在模拟人前臂上实现了高质量声速重建，PSNR为29.23 dB，SSIM为0.928，优于传统FWI和现有深度学习方法。在体内数据上验证，SDA-UCT成功重建了揭示人前臂复杂解剖结构（皮肤、脂肪、肌肉、肌腱、骨骼和骨髓）的声速图像，与MRI参考高度一致。仅调整3%参数的LoRA机制实现了与全微调相当的性能。快速重建（每帧5毫秒）实现了实时3D可视化，比传统FWI提高了五个数量级。这项工作代表了首个用于快速、高分辨率体内UCT成像的自监督域自适应深度学习，显示了在肌肉骨骼疾病诊断中的潜力。

英文摘要

Ultrasound computed tomography (UCT) via full waveform inversion (FWI) enables high-resolution quantitative imaging for tissue characterization and disease diagnosis. However, UCT suffers from large computational burden and severe convergence issues due to highly nonlinear optimization. Deep learning can accelerate UCT reconstruction, but supervised training requires large-scale labeled datasets difficult to obtain in vivo. To address these limitations, we propose SDA-UCT, a two-stage self-supervised domain-adaptive framework for rapid and accurate UCT imaging of musculoskeletal tissues. SDA-UCT employs an attention-enhanced network (AttUCT) pre-trained on simulation datasets and transfers to in-vivo data via physics-informed self-supervised learning, effectively bridging the simulation-to-real domain gap. A Low-Rank Adaptation (LoRA) mechanism is integrated to enable efficient adaptation across diverse clinical scenarios. Results showed that AttUCT achieved high-quality SOS reconstruction for simulated human forearm with a PSNR of 29.23 dB and SSIM of 0.928, outperforming conventional FWI and existing deep learning methods. Validated on in-vivo data, SDA-UCT successfully reconstructed SOS images revealing complex anatomical structures (skin, fat, muscle, tendon, bone and bone marrow) for human forearm, in high concordance with MRI references. The LoRA mechanism adjusting only 3% of parameters achieved comparable performance to full fine-tuning. The rapid reconstruction (5 ms per frame) enables real-time 3D visualization, achieving five-orders-of-magnitude improvement over traditional FWI. This work represents the first self-supervised domain-adaptive deep learning for rapid, high-resolution in-vivo UCT imaging, showing potential for musculoskeletal disease diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2605.25022 2026-05-26 cs.CV cs.AI 版本更新

ClueAegis：面向统一基于证据的合成图像检测的启发式到推理认知技能学习

Huangsen Cao, Hongkang Chu, Yuxi Li, Ying Zhang, Chen Li, Jing Lyu, Yongwei Wang, Yu Zhao, Fei Wu

发表机构 * Zhejiang University（浙江大学）； WeChat Vision, Tencent Inc（腾讯微信视觉实验室）； University of the Chinese Academy of Sciences（中国科学院大学）

AI总结针对现有合成图像检测方法缺乏结构化取证推理的问题，提出一种启发式到推理的认知技能学习框架ClueAegis，通过两阶段智能体流程实现技能选择与证据引导推理，在跨域泛化和鲁棒性上达到最优性能。

详情

AI中文摘要

生成模型的快速发展使合成图像越来越逼真，挑战了可靠的检测。现有方法通常局限于端到端分类或单一推理，因此无法建模结构化的取证推理和异构视觉证据。我们从认知角度重新审视合成图像检测，提出了一种启发式到推理的认知技能学习框架，用于基于证据的取证分析。给定输入图像，我们的框架首先提取启发式感知线索，选择最优取证技能，然后执行技能条件推理以进行证据提取和决策。为支持这一范式，我们引入了ClueAegis-Bench，它将合成图像检测分解为显式标注的取证认知技能，以实现超越二分类的结构化评估。基于该基准，我们提出了ClueAegis（面向统一基于证据的合成图像检测的认知技能学习），一个两阶段智能体框架，执行启发式技能选择，然后通过技能条件工具链进行证据引导推理。该设计将合成图像检测重新表述为一个可配置的多技能推理过程，桥接了感知、技能选择和取证推理。大量实验表明，ClueAegis在提升跨域泛化和鲁棒性的同时实现了最先进的性能。它还提供了透明的推理轨迹和结构化的取证证据，为传统的端到端检测器提供了更可解释的替代方案。

英文摘要

The rapid advancement of generative models has made synthetic images increasingly realistic, challenging reliable detection. Existing methods are often limited to end-to-end classification or monolithic reasoning, and thus fail to model structured forensic reasoning and heterogeneous visual evidence. We revisit synthetic image detection from a cognitive perspective and propose a \textit{Heuristic-to-Reasoning} cognitive skill learning framework for evidence-based forensic analysis. Given an input image, our framework first extracts heuristic perceptual clues, selects the optimal forensic skill, and then performs skill-conditioned reasoning for evidence extraction and decision making. To support this paradigm, we introduce \textbf{ClueAegis-Bench}, which decomposes synthetic image detection into explicitly annotated forensic cognitive skills for structured evaluation beyond binary classification. Based on this benchmark, we propose \textbf{ClueAegis} (\underline{C}ognitive-skill \underline{L}earning for \underline{U}nified \underline{E}vidence-based Synthetic Image Detection), a two-stage agentic framework that conducts heuristic skill selection followed by evidence-guided reasoning through skill-conditioned toolchains. This design reformulates synthetic image detection as a configurable multi-skill reasoning process that bridges perception, skill selection, and forensic reasoning. Extensive experiments show that ClueAegis achieves state-of-the-art performance while improving cross-domain generalization and robustness. It also provides transparent reasoning trajectories and structured forensic evidence, offering a more explainable alternative to conventional end-to-end detectors.

URL PDF HTML ☆

赞 0 踩 0

2605.24993 2026-05-26 cs.AI cs.CV 版本更新

NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding

NeurIPS: 基于球面的脑解码的神经解剖学归纳先验

Sijin Yu, Zijiao Chen, Zhenyu Yang, Zihao Tan, Jiakun Xu, Zhongliang Liu, Shengxian Chen, Wenxuan Wu, Xiangmin Xu, Xin Zhang

发表机构 * South China University of Technology（南方科技大学）； Stanford University（斯坦福大学）； King's College London（伦敦国王学院）； Foshan University（佛山大学）； Pazhou Lab（琶洲实验室）

AI总结提出NeurIPS框架，通过选择性ROI球形分词器和结构引导专家混合模型，将解剖变异转化为归纳先验，在自然场景数据集上实现表面解码器最先进性能，并显著提升训练效率。

Comments International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

当前的fMRI解码器面临性能-保真度权衡，其中高效的ID编码器优于几何保真的表面模型。我们认为这部分是由于低效的表面分词化以及未能将解剖学用作预测信号。我们提出NeurIPS，一个通过将解剖变异从干扰因素重新定义为强大的归纳先验来改进表面解码的框架。NeurIPS结合了两项创新：用于高效几何编码的选择性ROI球形分词器（SRST），以及使用皮层特征显式建模个体解剖的结构引导专家混合模型（SG-MoE）。在自然场景数据集上，NeurIPS为表面解码器建立了新的最先进水平，并实现了与强1D基线相当的性能。这是以空前的效率实现的，因为模型收敛速度显著加快（10个epoch对比600个epoch）。这种效率使得仅使用20%的数据即可快速适应新受试者，并确保随着训练队列扩大而稳健扩展。消融实验提供了因果证据，表明这些收益源于模型使用皮层特征，而非记忆受试者ID。通过利用解剖先验，NeurIPS为稳健、可泛化的脑解码提供了一条有原则且可扩展的路径。

英文摘要

Current fMRI decoders face a performance-fidelity trade-off where efficient ID encoders outperform geometrically faithful surface-based models. We argue this is partly driven by inefficient surface tokenization and the failure to use anatomy as a predictive signal. We present NeurIPS, a framework that improves surface-based decoding by reframing anatomical variation from a nuisance to a powerful inductive prior. NeurIPS unites two innovations: a Selective ROI Spherical Tokenizer (SRST) for efficient geometric encoding, and a Structure-Guided Mixture of Experts (SG-MoE) that explicitly models individual anatomy using cortical features. On the Natural Scenes Dataset, NeurIPS establishes a new state-of-the-art for surface decoders and achieves performance comparable to strong 1D baselines. This is achieved with unprecedented efficiency, as the model converges dramatically faster (10 vs. 600 epochs). This efficiency enables rapid adaptation to new subjects using only 20% of data and ensures robust scalability as the training cohort is expanded. Ablations provide causal evidence that these gains are driven by the model's use of cortical features, not by memorizing subject IDs. By leveraging anatomical priors, NeurIPS provides a principled and scalable path toward robust, generalizable brain decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.24977 2026-05-26 cs.CV cs.CL 版本更新

Universal Boosts, Specific Suppressors: Sparse Autoencoder Steering of Medical Vision-Language Models

通用增强，特定抑制：基于稀疏自编码器引导的医学视觉语言模型

Farhad Nooralahzadeh, Benjamin Gundersen, Nicolas Deperrois, Hidetoshi Matsuom, Mizuho Nishio, Thomas Frauenfelder, Ahmed Allam, Christian Blüthgen, Michael Moor, Michael Krauthammer

发表机构 * University of Zurich and University Hospital of Zurich（苏黎世大学及苏黎世大学医院）； Kobe University（Kobe大学）； ETH AI Center（苏黎世联邦理工学院人工智能中心）； ETH Zurich（苏黎世联邦理工学院）； Stanford University（斯坦福大学）； Zurich University of Applied Sciences（苏黎世应用科学大学）

AI总结本文提出一种无需权重更新的解码时残差引导方法，通过每token稀疏自编码器（SAE）对医学视觉语言模型进行干预，抑制幻觉并提升报告质量，在多个模型上取得显著改进。

详情

AI中文摘要

医学视觉语言模型（VLM）在生成胸部X光报告时经常出现幻觉：它们编造图像中不存在的发现，遗漏重要发现，或定位错误。我们通过解码时残差引导，基于每token稀疏自编码器（SAE）来缓解这一问题，无需权重更新：在后期层使用Top-$K$ SAE，针对临床错误进行因果引导，然后在推理时结合抑制/增强干预。在MIMIC-CXR测试集上，我们的纯推理方法提高了三个放射学VLM（RadVLM、LLaVA-Rad和CheXOne）生成报告的质量，临床复合指标的相对改进分别为+5.4%、+7.2%和+17.0%，并且所有骨干网络的GREEN得分均具有统计显著性。跨模型特征对齐表明，质量促进（增强）方向在不同架构间高度重叠，而与幻觉相关的（抑制）方向则是模型特定的。因此，可迁移的引导必须针对每个骨干网络进行抑制处理，而不是共享一个通用的抑制列表。相同的配方无需重新训练即可零样本迁移到IU-Xray（GREEN相对提升+7.7%），确认了所识别的特征是模型属性，而非训练语料库的属性。我们发布了因果特征集和一个交互式特征仪表板：https://cxr-sparse-feature-dashboard.netlify.app/。

英文摘要

Medical vision-language models (VLMs) often hallucinate findings when generating chest X-ray reports: they fabricate findings that are not present in the image, miss important ones, or locate them incorrectly. We mitigate this without weight updates by decoding-time residual steering on a per-token sparse autoencoder (SAE) basis: Top-$K$ SAEs on late layers, causal steering against clinical errors, then combined suppress/boost intervention at inference time. On the MIMIC-CXR test split, our inference-only method improves the quality of generated reports for three radiology VLMs (RadVLM, LLaVA-Rad, and CheXOne), with relative improvements of +5.4%, +7.2%, and +17.0% in the clinical composite metric, and statistically significant GREEN gains on all backbones. A cross-model feature alignment shows that the quality-promoting (boost) directions overlap strongly across architectures, whereas hallucination-linked (suppress) directions are model-specific. Therefore, transferable steering must treat suppression per-backbone, rather than sharing a universal suppress list. The same recipe transfers zero-shot to IU-Xray (Green $+7.7\%$ rel.) without retraining, confirming that the identified features are properties of the model, not of the training corpus. We release causal feature sets and an interactive feature dashboard: https://cxr-sparse-feature-dashboard.netlify.app/.

URL PDF HTML ☆

赞 0 踩 0

2605.24973 2026-05-26 cs.CV cs.AI cs.CL 版本更新

MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

MinerU-Popo：结构化文档解析的通用后处理模型

Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang, Xiaomeng Zhao, Fan Wu, Cheng Tan, Fan Wu, Bin Wang, Conghui He

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory, OpenDataLab（上海人工智能实验室，OpenDataLab）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出MinerU-Popo轻量级通用后处理框架，通过分解为文本/表格截断恢复、标题层级重建和图文关联四个子任务，并利用动态分块和重叠同步将OCR页面级结果重构为文档级逻辑结构，显著提升标题层级TEDS和RAG准确性。

Comments The code is available at https://github.com/opendatalab/MinerU-Popo

详情

AI中文摘要

基于VLM的OCR模型已成为文档解析的事实标准，因为它们可以准确提取页面级元素（例如单个页面内的段落）及其边界框和文本内容。然而，下游应用（如RAG）需要连贯的文档级信息，而这些模型常常破坏跨页连续性，并且无法恢复被页面边界截断的结构（如段落和表格）。这种关系不局限于单个页面；相反，它们需要对跨多个页面的标题、段落、表格和图像进行联合分析。因此，一个自然的解决方案是重用现有的OCR输出，并通过后处理重建文档级逻辑结构。为此，我们提出了MinerU-Popo，一个轻量级且通用的OCR输出后处理框架，它将来自不同解析器的页面级结果转换为连贯的文档级结构。MinerU-Popo将问题分解为四个聚焦的子任务：文本截断恢复、表格截断恢复、标题层级重建和图文关联。为了有效解决这些问题，我们构建了一个面向任务的数据引擎，具有任务特定的输入过滤，并使用生成的数据（30K）微调了一个轻量级后处理模型（Qwen3-VL-4B）。为了支持长文档，我们引入了基于重叠同步的动态分块，对齐微调模型的分块级输出并保持全局一致性。最后，我们将对齐后的输出组装成树状文档表示，并通过节点分块和摘要进一步丰富，以支持下游检索和分析。实验结果表明，MinerU-Popo在所有五个测试的OCR模型上，标题层级TEDS至少提高了20%，提高了RAG准确性并降低了每次查询的延迟。

英文摘要

VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency.

URL PDF HTML ☆

赞 0 踩 0

2605.24965 2026-05-26 cs.CV cs.AI cs.LG 版本更新

Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection

视觉基础模型在面部深度伪造检测中的跨域泛化极限

Ibrahim Delibasoglu

发表机构 * Department of Software Engineering, Faculty of Computer and Information Sciences（软件工程系，计算机与信息科学学院）

AI总结本文通过系统评估三种视觉基础模型（RoPE-ViT、DINOv3、NVIDIA C-RADIOv4-H）在DF40基准上的线性探测性能，揭示了它们在面部深度伪造检测中的跨域泛化极限，发现基础模型对全脸合成保持高判别力，但对局部编辑技术存在根本性边界。

详情

AI中文摘要

生成模型的快速进化使得超逼真面部深度伪造的创建成为可能，暴露了现代数字取证中的一个关键弱点：检测器无法泛化到未见过的操作技术。传统网络遭受表示崩溃，过度拟合特定训练生成器的局部伪影指纹。本研究探讨了现代视觉基础模型是否可以作为可泛化的、开箱即用的特征提取器，能够在完全未见过的生成流形上追踪取证异常。我们进行了系统的跨域评估，比较了三种基础学习范式：全监督宏观语义特征（RoPE-ViT）、纯自监督几何特征（DINOv3）和多教师聚合表示（NVIDIA C-RADIOv4-H）。通过部署冻结的骨干网络并进行下游线性探测，我们映射了这些架构在具有挑战性的DF40基准上的性能极限。我们的实证结果揭示了预训练范式和参数规模之间的内在权衡，证明虽然基础模型对全脸合成保持高判别能力，但局部面部编辑技术在线性探测评估结构中暴露了基本边界。源代码和模型权重可在 http://github.com/mribrahim/deepfake 获取。

英文摘要

The rapid evolution of generative models has enabled the creation of hyper-realistic facial deepfakes, exposing a critical vulnerability in modern digital forensics: the inability of detectors to generalize to unseen manipulation techniques. Traditional networks suffer from representation collapse, overfitting to localized artifact fingerprints of specific training generators. This work investigates whether modern Vision Foundation Models can serve as generalizable, out-of-the-box feature extractors capable of tracking forensic anomalies across entirely unseen generative manifolds. We conduct a systematic cross-domain evaluation comparing three foundational learning paradigms: fully supervised macro-semantic features (RoPE-ViT), pure self-supervised geometric features (DINOv3), and multi-teacher agglomerative representations (NVIDIA C-RADIOv4-H). By deploying frozen backbones subjected to downstream linear probing, we map the performance limitations of these architectures on the challenging DF40 benchmark. Our empirical findings expose the intrinsic trade-offs between pre-training paradigms and parameter scale, proving that while foundation models retain high discriminative capabilities for entire face synthesis, localized face editing techniques expose fundamental boundaries in linear probe evaluation structures. Source code and model weights are available in http://github.com/mribrahim/deepfake

URL PDF HTML ☆

赞 0 踩 0

2605.24964 2026-05-26 cs.CV 版本更新

ConFi-GS Confidence-Guided High-Frequency Injection for 3D Gaussian Splatting Super-Resolution

ConFi-GS：置信度引导的高频注入用于3D高斯泼溅超分辨率

Jiaxiang Li, Zongtan Zhou, Zhen Tan, Yadong Liu, Dewen Hu

AI总结提出一种可靠性感知的频率建模框架，通过几何引导的细节需求先验和频率感知的可靠性图，指导低分辨率3DGS重建中高频细节的注入，提升保真度和感知质量。

详情

AI中文摘要

从低分辨率多视图图像重建高质量3D场景对3D高斯泼溅（3DGS）仍具挑战，因为高频观测不足常导致纹理模糊、边界弱化和视图不一致细节。现有方法要么统一应用超分辨率引导，要么主要基于几何采样定位增强区域。然而，它们通常不区分两个根本不同的问题：哪里需要额外细节，以及相应的候选高频内容是否足够可靠以融入多视图一致的3D表示。本文提出一种用于低分辨率3DGS重建的可靠性感知频率建模框架。该框架首先估计几何引导的细节需求先验，以定位在低分辨率监督下可能细节不足的区域。然后计算频率感知的可靠性图，以确定候选高频细节是否结构上受支持、频谱上未解决且跨视图稳定。结合这些信号得到细节注入图，指导优化过程中超分辨率细节的引入位置。基于该图，我们设计了一个统一的优化方案，包括空间选择性监督、从粗到细的频率正则化和可靠性感知的高斯稠密化。该方案控制可靠细节的注入位置、高频监督的激活时机以及未解决但可靠的细节如何融入高斯表示。多个基准上的实验表明，在抑制不稳定或视图不一致细节的同时，保真度和感知质量得到提升。

英文摘要

Reconstructing high-quality 3D scenes from low-resolution multi-view images remains challenging for 3D Gaussian Splatting (3DGS), because insufficient high-frequency observations often lead to blurred textures, weak boundaries, and view-inconsistent details. Existing approaches either apply super-resolution guidance uniformly or localize enhancement regions based mainly on geometric sampling. However, they typically do not distinguish between two fundamentally different questions: where additional detail is needed, and whether the corresponding candidate high-frequency content is reliable enough to be internalized into a multi-view consistent 3D representation. In this paper, we propose a reliability-aware frequency modeling framework for low-resolution 3DGS reconstruction. The framework first estimates a geometry-guided detail-demand prior to locate regions that are likely under-detailed under low-resolution supervision. It then computes a frequency-aware reliability map to determine whether candidate high-frequency details are structurally supported, spectrally unresolved, and cross-view stable. Combining these signals yields a detail-injection map that guides where super-resolved details should be introduced during optimization. Based on this map, we design a unified optimization scheme comprising spatially selective supervision, coarse-to-fine frequency regularization, and reliability-aware Gaussian densification. This scheme controls where reliable details are injected, when high-frequency supervision is activated, and how unresolved yet reliable details are internalized into the Gaussian representation. Experiments on multiple benchmarks show improved fidelity and perceptual quality while suppressing unstable or view-inconsistent details.

URL PDF HTML ☆

赞 0 踩 0

2605.24962 2026-05-26 cs.CV 版本更新

Tempered Self-Similarity Alignment for Physically Plausible Video Generation

Manjin Kim, Suha Kwak, Minsu Cho

发表机构 * Pohang University of Science and Technology (POSTECH)（浦项科学技术大学）

AI总结提出Tempered Self-Similarity Alignment (TSA)损失函数，通过将视觉基础模型中的时空自相似性关系知识迁移到视频生成模型中，以改善视频的物理合理性。

Comments Accepted to the CVPR 2026 Workshop on Video Generative Models: Benchmarks and Evaluation (VGBE)

详情

AI中文摘要

尽管视频生成模型取得了显著进展，但它们仍然难以生成物理上逼真的视频，经常出现外观漂移、不合理的运动和时间不一致性。在这项工作中，我们通过将视觉基础模型中编码的时空自相似性（STSS）关系知识迁移到视频生成模型中来解决这一局限性。STSS表示特征在空间和时间上的成对相似性，揭示了视频中物体如何与其他实体相互作用的 relational structure，有效捕捉了真实世界的动态，包括物体运动和语义变换。为了迁移这种关系知识，我们提出了Tempered Self-similarity Alignment (TSA)损失，它将STSS转换为概率对应分布，并训练视频生成模型使其在动态变化区域上的对应分布与视觉基础模型的对应分布对齐。在VideoPhy和VideoPhy2基准测试上的评估表明，我们的方法在不同交互场景中显著提升了物理合理性，验证了迁移关系知识对于生成物理逼真视频的有效性。

英文摘要

Despite remarkable advances in video generative models, they still struggle to generate physically realistic videos, frequently exhibiting appearance drift, implausible motion, and temporal inconsistencies. In this work, we address this limitation by transferring relational knowledge encoded in spatio-temporal self-similarity (STSS) from visual foundation models into video generative models. STSS represents pairwise similarities among features across space and time, revealing the relational structure of how objects interact with other entities throughout a video, effectively capturing real-world dynamics, including object motion and semantic transformations. To transfer this relational knowledge, we propose Tempered Self-similarity Alignment (TSA) loss, which transforms STSS into probabilistic correspondence distributions and trains the video generative model to align its correspondence distributions with those of the visual foundation model on dynamically changing regions. Evaluated on VideoPhy and VideoPhy2 benchmarks, our method demonstrates substantial improvements in physical plausibility across diverse interaction scenarios, validating the effectiveness of transferring relational knowledge for physically realistic video generation.

URL PDF HTML ☆

赞 0 踩 0

2605.24959 2026-05-26 cs.CV 版本更新

Three-Step Conditional Diffusion 3D Reconstruction for Light-Field Microscopy

三步条件扩散光场显微三维重建

Qihong Zhao, Shaokang Yan, Zhimin Qiao, Jinjia Wang, Bo Xiong

发表机构 * Yanshan University（雁山大学）； Peking University（北京大学）

AI总结针对光场显微成像中传统算法分辨率低、伪影重、计算成本高，以及现有学习方法重建精度和泛化能力不足的问题，提出一种基于三步条件扩散的高保真三维重建方法，通过确定性三步采样和轻量条件U-Net实现快速准确重建，并引入类间检测模块增强稳定性。

Comments 10 pages, 6 figures. Accepted to CVPR 2026 Findings

详情

AI中文摘要

光场显微镜（LFM）能够单次捕获生物样本的多角度信息，支持实时体积成像。然而，传统的基于物理的算法通常受限于有限的空间分辨率、严重的伪影和高计算成本。现有的基于学习的方法提高了推理效率，但在重建精度和泛化能力方面仍面临限制。为了解决这些挑战，本文提出了一种用于LFM的高保真三步条件扩散（TCD）三维重建方法。尽管传统扩散模型在生成建模中取得了显著成功，但其缓慢的采样过程以及质量与效率之间的固有权衡阻碍了其在实时三维成像中的应用。我们通过确定性三步采样策略结合轻量条件U-Net重新设计了扩散过程，为快速准确的体积重建建立了新范式。此外，还引入了类间检测（ICD）模块，以在推理过程中识别分布外或异常输入，从而增强模型的稳定性和可靠性。大量实验和跨数据集评估表明，TCD在重建保真度和泛化能力方面均显著优于最先进的方法，为光场显微镜提供了一种高效实用的三维重建解决方案。

英文摘要

Light-field microscopy (LFM) enables single-shot capture of multi-angular information from biological samples, supporting real-time volumetric imaging. However, traditional physics-based algorithms often suffer from limited spatial resolution, severe artifacts, and high computational costs. Existing learning-based methods improve inference efficiency but still face limitations in reconstruction accuracy and generalization capability. To address these challenges, this paper proposes a high-fidelity Three-Step Conditional Diffusion (TCD) 3D reconstruction method for LFM. Although conventional diffusion models have achieved remarkable success in generative modeling, their slow sampling process and the inherent trade-off between quality and efficiency hinder their application in real-time 3D imaging. We redesign the diffusion process through a deterministic three-step sampling strategy coupled with a lightweight conditional U-Net, establishing a new paradigm for fast and accurate volumetric reconstruction. Furthermore, an Inter-Class Detection (ICD) module is incorporated to identify out-of-distribution or anomalous inputs during inference, thereby enhancing model stability and reliability. Extensive experiments and cross-dataset evaluations demonstrate that TCD significantly outperforms state-of-the-art methods in both reconstruction fidelity and generalization, providing an efficient and practical 3D reconstruction solution for light-field microscopy.

URL PDF HTML ☆

赞 0 踩 0

2605.24957 2026-05-26 cs.AI cs.CV cs.LG 版本更新

Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

通过区域感知注意力重校准减轻视觉语言模型中的对象幻觉

Yuanzhi Xu, Qian Gao, Jun Fan, Guohui Ding, Zhenyu Yang, Sixue Lin, Yuteng Xiao

发表机构 * Qilu University of Technology (Shandong Academy of Sciences)（齐鲁工业大学（山东省科学院））； China Telecom Digital Intelligence Technology Co, Ltd（中国电信数字智能技术有限公司）； Shenyang Aerospace University（沈阳航空航天大学）； Qilu Institute of Technology（齐鲁理工学院）

AI总结提出一种无需训练的区域感知自适应加权机制，通过计算注意力头的稳健统计中点并利用跨头分歧动态调整干预预算，以连续惩罚调制抑制幻觉路径，有效纠正视觉语义错位，同时保持生成流畅性。

详情

AI中文摘要

生成事实上不正确的对象（通常称为对象幻觉）仍然是大型视觉语言模型（LVLMs）中的一个持久挑战。当前解决该问题的方法——从昂贵的数据驱动微调和延迟较高的对比解码到刚性的注意力头截断——常常在计算效率或模型特征空间的连续性上做出妥协。为克服这些限制，我们引入了一种新颖的、无需训练的推理策略，该策略作为一种区域感知的自适应加权机制，动态纠正语义漂移，而不依赖于突然的启发式截断。通过计算各注意力头上的离群值稳健统计中点，我们为可靠的视觉表示建立了一个稳定锚点。然后，我们利用跨区域映射的跨头分歧来动态确定干预预算，通过连续惩罚调制温和地抑制引起幻觉的注意力路径。这种重校准过程有效纠正了视觉语义错位，同时完全保留了生成流畅性和语言先验。在包括CHAIR、POPE和MME在内的标准多模态基准上的全面评估表明，我们的策略显著减少了实例级和句子级幻觉。结果展示了与当代基线相比的最先进性能，证实了我们方法的效率和算法鲁棒性。我们的代码将公开。

英文摘要

The generation of factually incorrect objects, commonly known as object hallucination, remains a persistent challenge in Large Vision-Language Models (LVLMs). Current approaches to address this issue - ranging from expensive data-driven fine-tuning and high-latency contrastive decoding to rigid attention head truncation - frequently compromise either computational efficiency or the continuity of the model's feature space. To overcome these limitations, we introduce a novel, training-free inference strategy that operates as a region-aware adaptive weighting mechanism to dynamically correct semantic drift without relying on abrupt heuristic truncations. By computing an outlier-resistant statistical midpoint across various attention heads, we establish a stable anchor for reliable visual representations. We then utilize the inter-head disagreement mapped across regions to dynamically determine intervention budgets, gently suppressing hallucination-inducing attention paths through a continuous penalty modulation. This recalibration process effectively rectifies visual-semantic misalignments while fully preserving generative fluency and language priors. Comprehensive evaluations on standard multimodal benchmarks, including CHAIR, POPE, and MME, reveal that our strategy substantially curtails both instance- and sentence-level hallucinations. The results demonstrate state-of-the-art performance against contemporary baselines, confirming our method's efficiency and algorithmic robustness. Our code will be public.

URL PDF HTML ☆

赞 0 踩 0

2605.24946 2026-05-26 cs.CV 版本更新

Interpretability Transfer from Language to Vision via Sparse Autoencoders

通过稀疏自编码器实现从语言到视觉的可解释性迁移

Alexey Kravets, Da Li, Chuan Li, Da Chen, Vinay P. Namboodiri

发表机构 * University of Bath, UK（巴斯大学）； Lambda, Inc.（Lambda公司）； Samsung AI Centre Cambridge（三星AI研究中心）

AI总结提出VISTA框架，通过约束视觉投影器将视觉token映射到LLM的文本SAE空间，实现无需专用视觉SAE的视觉可解释性，并在对象移除和替换任务上分别提升35%和47%。

详情

Journal ref: ICML 2026

AI中文摘要

最近使用稀疏自编码器（SAE）在语言模型可解释性方面取得的进展尚未有效迁移到视觉领域，主要原因是标记视觉概念的困难和模糊性。在本文中，我们引入了通过SAE迁移对齐的视觉可解释性（VISTA），这是一个在LLaVA风格的视觉-语言模型中通过约束视觉投影器将视觉token映射到LLM预先存在的、已标记的文本SAE空间，从而将可解释性从语言迁移到视觉的框架。该方法无需训练专用的视觉SAE即可实现视觉可解释性。通过使用LLM的SAE重建损失对投影器进行正则化，VISTA将匹配率（衡量SAE空间中激活最强的文本概念与图像中语义元素对应准确度的指标）提高了三倍。利用该框架，我们进一步分析了不同视觉编码器的空间定位特性，并表明DINOv2特征比其他编码器具有更强的定位能力。利用这种精确性，我们通过细粒度的局部概念干预验证了VISTA的跨模态对齐，其中特定对象在模型感知中被移除或替换，同时保留周围场景。与纯视觉基线相比，对象移除任务提升了35%，对象替换任务提升了47%，为视觉token存在于文本SAE流形中提供了因果证据。这些贡献在多种LLM架构上得到了验证。

英文摘要

Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision in a LLaVA-style vision-language model by constraining a visual projector to map visual tokens into an LLM's pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM's SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate, which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have stronger localization abilities than other encoders. Leveraging this precision, we validate VISTA's cross-modal alignment through fine-grained, localized concept interventions, where specific objects are removed or replaced in the model's perception while preserving the surrounding scene. This results in improvements of 35% in object removal and 47% in object replacement tasks over vision-only baselines, providing causal evidence that visual tokens inhabit the text SAE manifold. These contributions are validated across multiple LLM architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.24938 2026-05-26 cs.IR cs.AI cs.CV 版本更新

快照偏振显示逆渲染

Seokjun Choi, Yunseong Moon, Kaizhang Kang, Hoon-Gyu Chung, Jin-Nyeong Kim, Giljoo Nam, Seung-Hwan Baek

发表机构 * POSTECH

AI总结本文提出一种快照偏振显示逆渲染方法，利用LCD投影线性偏振RGB图案和偏振相机获取光谱偏振测量，通过前馈Transformer预测每像素法线、反照率、粗糙度和金属度，在真实桌面场景中优于现有方法。

详情

AI中文摘要

逆渲染仍然是图形学和视觉领域的核心挑战，尤其是在轻量级桌面工作流程所需的快照配置中，每帧信息预算高度受限。以往的逆渲染工作探索了各种可用的维度来丰富每次拍摄的信息，包括时间调制、光谱编码和偏振。在这项工作中，我们引入了偏振显示逆渲染，使用LCD投影线性偏振RGB二值图案，并配备四分之一波片的RGB偏振相机在单次拍摄中获取光谱偏振测量。一个前馈Transformer将这些测量映射到每像素法线、反照率、粗糙度和金属度。为了克服训练数据稀缺，我们通过生成流形扩展了一组有限的实测偏振双向反射分布函数。在真实桌面设置上的评估表明，该方法在多种场景中实现了准确的逆渲染，优于现有方法。

英文摘要

Inverse rendering remains a core challenge in graphics and vision, especially in the snapshot configurations required for lightweight desktop workflows, where the per-frame information budget is highly constrained. Previous inverse rendering work explores various available dimensions for enriching the per-shot information, including temporal modulation, spectral encoding, and polarization. In this work, we introduce polarimetric display inverse rendering, using an LCD to project a linearly polarized RGB binary pattern and an RGB polarization camera augmented with a quarter-wave plate to acquire spectro-polarimetric measurements in a single shot. A feed-forward transformer maps these measurements to per-pixel normal, albedo, roughness, and metallicity. To overcome training data scarcity, we expand a limited set of measured polarimetric bidirectional reflectance distribution functions via a generative manifold. Evaluations on a real desktop setup demonstrate accurate inverse rendering across diverse scenes, outperforming existing approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.24894 2026-05-26 cs.CV 版本更新

BFS: Back-to-Front Layered Image Synthesis via Knowledge Transfer

BFS: 通过知识转移的前后分层图像合成

Kyoungkook Kang, Gyujin Sim, Sunghyun Cho

发表机构 * SAMSUNG（三星）； POSTECH

AI总结提出BFS框架，利用双分支扩散模型和两阶段训练，通过从非分层图像合成中转移知识，实现高质量的前景层合成与背景和谐融合。

Comments SIGGRAPH 2026

详情

AI中文摘要

随着生成模型扩展了视觉内容创作的可能性，分层图像合成已成为可控和创意编辑的一个有前景的方向。然而，现有方法难以充分发挥这一潜力。基于分解的方法通常难以实现干净分离，而基于生成的方法则面临训练数据获取困难的问题，降低了质量和场景多样性。在本文中，我们提出了BFS，一种新颖的基于生成的分层图像合成框架。具体来说，给定背景图像和用户指导，BFS合成一个前景层，该层不仅包含前景对象，还包括其相关的视觉效果（如阴影和反射），同时与背景无缝协调以产生连贯的合成图像。为了实现多样且高质量的前景层合成，同时克服数据稀缺问题，我们利用相对易于学习的非分层图像合成知识来指导前景合成。为此，我们采用双分支扩散框架，其中两个相互连接的分支分别生成合成图像和前景层，实现双向知识转移。基于该框架，我们提出了一种两阶段训练方案，利用高质量的非分层合成图像数据集有效提升前景质量。大量实验（包括用户研究）表明，BFS生成了高质量的分层图像，始终优于先前方法。

英文摘要

As generative models expand the possibilities of visual content creation, layered image synthesis has emerged as a promising direction for controllable and creative editing. However, existing methods struggle to fully realize this potential. Decomposition-based methods often struggle with clean separation, while generation-based methods suffer from difficulty in training data acquisition, reducing quality and scene diversity. In this paper, we propose BFS, a novel generation-based framework for layered image synthesis. Specifically, given a background image and user guidance, BFS synthesizes a foreground layer that incorporates not only a foreground object but also its associated visual effects, such as shadows and reflections, while seamlessly harmonizing with the background to produce a coherent composite. To enable diverse and high-quality foreground layer synthesis while overcoming data scarcity, we leverage the comparatively easy-to-learn knowledge of unlayered image synthesis for the foreground synthesis. To this end, we adopt a dual-branch diffusion framework in which two interconnected branches generate a composite image and a foreground layer, respectively, enabling bidirectional knowledge transfer. Based on this framework, we propose a two-stage training scheme that utilizes a high-quality unlayered composite image dataset to effectively enhance foreground quality. Extensive experiments, including a user study, show that BFS produces high-quality layered images, consistently outperforming prior methods.

URL PDF HTML ☆

赞 0 踩 0

2605.24893 2026-05-26 cs.CV 版本更新

BED-SAM2: Boundary-Enhanced-Depth SAM2 via Monocular Geometric Priors

BED-SAM2: 通过单目几何先验增强边界的深度SAM2

Tyler Rust, Dara McNally, Kyle O'Donnell, Colin Kelly, Chandra Kambhamettu

发表机构 * University of Delaware（德克萨斯大学）； University of South Florida（佛罗里达州立大学）； DEVCOM Army Research Laboratory（国防部陆军研究实验室）

AI总结本研究通过修改SAM2编码器以直接编码单目深度信息，提出BED-SAM2模型，在少量训练周期内实现显著和伪装物体检测的竞争性能。

Comments 9 pages, 5 figures, 5 tables. Presented as a poster at the CVPR 2026 Workshop on Computer Vision in the Wild (CVinW). Code available at https://github.com/TylerRust-1/BED-SAM2

2605.24870 2026-05-26 cs.CV 版本更新

Trajectory-Consistent Calibration for Cache-Accelerated Diffusion Models

轨迹一致校准用于缓存加速扩散模型

Mingyu Liang, Dingkun Xu, Jingwei Xu

发表机构 * Laboratory for Novel Software Technology, Nanjing University, China（南京大学新型软件技术实验室）

AI总结针对缓存加速扩散模型中表示偏差导致生成质量下降的问题，提出无训练的轨迹一致校准方法，通过离线迭代校准缓存表示，在PixArt-alpha和DiT-XL/2上持续改善FID。

Comments 23 pages, 8 figures, 8 tables. Code is available at https://github.com/NJUDeepEngine/TCC

详情

AI中文摘要

扩散Transformer在迭代采样过程中需要重复进行去噪器评估，导致推理计算成本高昂。基于缓存的加速方法通过跨去噪步骤重用中间表示来降低这一成本，但可能引入表示偏差并降低生成质量。本文分析了这些偏差，并表明有效的校准应考虑重用导致的直接不匹配以及先前校正引起的后续轨迹偏移。为解决这一挑战，我们提出了轨迹一致校准（TCC），一种无训练的方法，将缓存表示校准为其全计算对应物。具体而言，TCC并非从单个未校正的缓存轨迹中估计所有校准先验，而是使用离线迭代过程，使得每个先验都考虑先前校准引起的轨迹偏移。在PixArt-alpha和DiT-XL/2上的实验表明，TCC在保持底层重用策略的同时，持续改善了代表性缓存加速方法的FID。值得注意的是，在基于FORA的典型PixArt-alpha缓存加速设置中，TCC将FID从29.83降至27.35，略微超过了全计算基线。

英文摘要

Diffusion Transformers require repeated denoiser evaluations during iterative sampling, making inference computationally expensive. Cache-based acceleration reduces this cost by reusing intermediate representations across denoising steps, but can introduce representation deviations and degrade generation quality. In this paper, we analyze these deviations and show that effective calibration should consider both the direct mismatch caused by reuse and the subsequent trajectory shift induced by earlier corrections. To address this challenge, we propose Trajectory-Consistent Calibration (TCC), a training-free method that calibrates cached representations toward their full-computation counterparts. Specifically, rather than estimating all calibration priors from a single uncorrected cache trajectory, TCC uses an offline iterative procedure so that each prior accounts for the trajectory shift induced by preceding calibrations. Experiments on PixArt-alpha and DiT-XL/2 show that TCC consistently improves FID across representative cache-based acceleration methods while preserving their underlying reuse policies. Notably, in a representative PixArt-alpha cache-acceleration setting based on FORA, TCC reduces FID from 29.83 to 27.35, slightly surpassing the full-computation baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.24843 2026-05-26 cs.CV cs.AI 版本更新

CLIP引导的SAM：用于可提示分割的参数高效语义条件

Shayan Jalilian, Abdul Bais

发表机构 * University of Regina, Regina, SK, Canada（里贾纳大学）

AI总结提出CLIP-Guided SAM框架，通过轻量级多模态语义适配器将CLIP特征注入SAM图像编码器，实现内部语义条件化，在低标注数据下提升分割性能并支持手动和半自动两种模式。

详情

AI中文摘要

可提示基础模型如分割一切模型（SAM）能生成高质量掩码，但语义上仍存在盲区，依赖外部提示来指定类别。现有的视觉-语言方法通过外部提示耦合来解决这一限制，即视觉-语言模型作为独立阶段为SAM生成空间提示。我们提出CLIP引导的SAM，一种基于内部语义条件的参数高效分割框架。我们不是仅使用语义信号来生成提示，而是通过轻量级多模态语义适配器将CLIP派生的文本、视觉和相似性特征直接注入SAM的图像编码器。这些适配器调节SAM的内部特征表示，使得语义信息能够影响掩码预测，同时保留SAM原有的可提示接口。我们的框架专为低标注数据场景设计，适用于通用领域基准和专门的下游任务。它支持两种操作模式：手动模式（用于同时使用文本和空间提示的交互式分割）和半自动纯文本模式（用于仅需文本输入的概念特定分割应用）。我们表明，鲁棒性取决于训练与推理时使用的提示类型是否一致，使得训练-测试提示一致性成为重要的设计原则。通过大量实验和消融研究，我们评估了我们的方法，与无语义条件的SAM+PEFT基线、视觉-语言+SAM流水线、SAM 3以及依赖大量无标注数据的强半监督分割方法进行比较。在这些设置中，CLIP引导的SAM在训练和部署中均保持参数高效的同时，始终取得优越或具有竞争力的性能。

英文摘要

Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM's image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM's internal feature representations, allowing semantic information to influence mask prediction while preserving SAM's original promptable interface. Our framework is designed for low labeled-data settings and applies to both general-domain benchmarks and specialized downstream tasks. It supports two operating modes: Manual mode, for interactive segmentation with both text and spatial prompts, and Semi-Automatic text-only mode, for applications that require concept-specific segmentation using only textual input. We show that robustness depends on aligning training with the type of prompts used at inference, making train-test prompt consistency an important design principle. Through extensive experiments and ablations, we evaluate our method against SAM+PEFT baselines without semantic conditioning, vision-language + SAM pipelines, SAM 3, and strong semi-supervised segmentation methods that rely on large amounts of unlabeled data. Across these settings, CLIP-Guided SAM consistently achieves superior or competitive performance while remaining parameter-efficient in both training and deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.24805 2026-05-26 cs.CV 版本更新

Fishbone: From One 3D Asset to a Million Controllable Edits

Fishbone: 从一个3D资产到百万可控编辑

Yumeng He, Xiaoying Wang, Peihao Li, Yanjia Huang, Joe Masterjohn, Jiajun Wu, Leonidas Guibas, Yin Yang, Ying Jiang, Chenfanfu Jiang

发表机构 * UCLA（加州大学洛杉矶分校）； USC（南加州大学）； UC Berkeley（加州大学伯克利分校）； TRI（技术研究院）； Stanford（斯坦福大学）； Utah（犹他大学）

AI总结提出一种统一的脊-肋表示方法Fishbone，支持通用网格的可控参数化变形、降阶动力学和动画，并构建了Fishbone-136K数据集，应用于可控3D生成、机器人学习数据增强等任务。

Comments 20 pages, 19 figures

详情

AI中文摘要

大规模可控3D资产对于计算机图形学、具身AI、机器人和交互式内容创作至关重要，但由于手动建模和绑定的高成本，创建多样化的3D资产仍然具有挑战性。形状变形提供了一种从现有网格生成变体的自然方式，但现有的数据驱动方法通常依赖稀疏的用户输入，而参数化编辑框架需要手动设计的控制结构和特定类别的配置。受自然生物启发，其中中央脊柱控制全局形状，横截面肋骨控制局部变化，我们引入了Fishbone，一种统一的脊-肋表示，适用于通用形状，支持可控参数化网格变形、降阶动力学和动画。给定输入网格，Fishbone使用自适应热方法计算测地标量场，提取等值线作为横截面肋骨，通过肋骨中心构建光滑的几何感知脊柱，并使用高斯加权蒙皮将表面顶点与附近的肋骨和脊柱结构关联。由此产生的表示支持实时和可预测的变形：肋骨控制局部轮廓，如厚度、方向和横截面变化，而脊柱控制全局弯曲、扭转和拉伸。相同的结构还支持降阶模拟和关键帧动画。我们进一步通过用脊-肋结构增强Hunyuan3D构建了Fishbone-136K，并展示了在可控3D生成、基于变形的机器人学习数据增强、交互式网格编辑和智能体生成中的应用。实验证明了所提出框架的有效性、效率和通用性。

英文摘要

Large-scale controllable 3D assets are critical for computer graphics, embodied AI, robotics, and interactive content creation, yet creating diverse 3D assets remains challenging due to the high cost of manual modeling and rigging. Shape deformation offers a natural way to generate variations from existing meshes, but existing data-driven methods often rely on sparse user inputs, while parametric editing frameworks require manually designed control structures and category-specific configurations. Inspired by natural creatures, where a central spine governs global shape and cross-sectional ribs control local variation, we introduce Fishbone, a unified rib-spine representation for general shapes that supports controllable parametric mesh deformation, reduced-space dynamics, and animation. Given an input mesh, Fishbone computes a geodesic scalar field with an adaptive heat method, extracts iso-contours as cross-sectional ribs, constructs a smooth geometry-aware spine through rib centers, and associates surface vertices with nearby rib and spine structures using Gaussian-weighted skinning. The resulting representation enables real-time and predictable deformation: ribs control local profiles such as thickness, orientation, and cross-sectional variation, while the spine controls global bending, twisting, and stretching. The same structure also supports reduced-space simulation and keyframe animation. We further construct Fishbone-136K by augmenting Hunyuan3D with rib-spine structures, and demonstrate applications in controllable 3D generation, deformation-based data augmentation for robot learning, interactive mesh editing, and agentic generation. Experiments demonstrate the effectiveness, efficiency, and versatility of the proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2605.24799 2026-05-26 cs.CV cs.AI 版本更新

Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models

面向大规模视觉识别的多模态大语言模型分治推理

Zhipeng Ye, Jiaqi Huang, Feng Jiang, Qiufeng Wang, Yikang Duan, Dawei Wang, Xihang Zhou, Qian Qiao

发表机构 * Taizhou Institute of Science and Technology, Nanjing University of Science and Technology（泰州科技学院、南京理工大学）； Department of Intelligence Science, Xi’an Jiaotong-Liverpool University（智能科学系，西安交通大学利物浦大学）； School of Computer Science and Technology, Soochow University（计算机科学与技术学院，苏州大学）； Department of Statistical Sciences, University of Toronto（统计科学系，多伦多大学）

AI总结针对多模态大语言模型在长序列识别中性能崩溃的问题，提出分治推理（DCI）策略，通过递归分解任务和动态剪枝提升信噪比与分类精度。

详情

AI中文摘要

多模态大语言模型（MLLMs）在广泛的视觉语言任务中展现了强大的能力。然而，当应用于大规模图像分类时，随着标签空间的扩大，其性能显著下降——我们将这一现象定义为长序列识别中的性能崩溃。通过信息论分析，我们揭示了这种崩溃源于不断增长的信息熵与注意力机制中显著的注意力稀释和衰减之间的根本冲突，这损害了模型在处理极长提示时维持足够信噪比的能力。为缓解这一问题，我们提出了分治推理（DCI），一种用于MLLMs视觉识别的新型测试时扩展策略。DCI递归地将复杂的全局分类任务分解为多个更简单的局部子问题，并采用动态剪枝机制压缩搜索空间。该方法通过缓解长序列推理中固有的权重稀释问题，有效提高了局部信噪比和模型精度。此外，传统自注意力具有难以承受的二次计算复杂度，而DCI在大规模分类场景中实现了更有利的扩展行为并显著加速推理。在ImageNet-1K和ImageNet-21K等基准上的大量实验表明，DCI持续提高了分类精度。这使得轻量级开源模型无需任何额外训练或微调即可与甚至超越前沿闭源巨头。作为一种模型无关、即插即用的范式，DCI为在大规模场景中扩展MLLMs的推理精度提供了一种高效方法。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when applied to large scale image classification, their performance degrades significantly as the label space expands a phenomenon we define as Performance Collapse in Long Sequence Recognition. Through an information theoretic analysis, we reveal that this collapse stems from a fundamental conflict between the escalating information entropy and the prominent attention dilution and decay within attention mechanisms, which impairs the model's ability to maintain a sufficient signal-to-noise ratio when processing extremely long prompts. To mitigate this, we propose Divide-and-Conquer Inference (DCI), a novel test-time scaling strategy for visual recognition with MLLMs. DCI recursively decomposes complex global classification tasks into multiple simpler, localized subproblems and employs a dynamic pruning mechanism to compress the search space. This method effectively improves the local signal to noise ratio and model accuracy by mitigating the inherent weight dilution issues in long-sequence inference. Moreover, while traditional self-attention incurs a prohibitive quadratic computational complexity, DCI achieves more favorable scaling behavior and substantially accelerates inference in large scale classification scenarios. Extensive experiments on benchmarks such as ImageNet-1K and ImageNet-21K demonstrate that DCI consistently improves classification accuracy. This enables lightweight open-source models to rival or even surpass frontier closed-source giants without any additional training or fine-tuning. As a model-agnostic, plug-and-play paradigm, DCI offers an efficient approach for scaling the inferential precision of MLLMs in large-scale scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.24797 2026-05-26 cs.CV 版本更新

HCL-FF: Hierarchical and Contrastive Learning for Forward-Forward Algorithm

HCL-FF：用于前向-前向算法的分层对比学习

Jie-En Yao, Hong-En Chen, C. -C. Jay Kuo

发表机构 * University of Southern California（南加州大学）

AI总结针对前向-前向算法缺乏分层协调和特征语义模糊的问题，提出HCL-FF框架，通过粗到细的分层学习策略和监督对比学习目标，在CIFAR-10等数据集上取得FF方法最佳性能。

Comments Accepted by CVPR 2026. Code: https://github.com/JNNNNYao/HCL-FF

详情

AI中文摘要

使用反向传播训练的深度神经网络在视觉任务中取得了显著性能，但仍存在生物不可解释、计算要求高和难以解释的问题。前向-前向（FF）算法通过局部目标函数独立训练每一层，提供了一种有前景的替代方案。然而，其纯局部优化缺乏跨层的分层协调，且将 goodness 与特征解耦导致表示无约束且语义模糊。我们提出分层对比学习FF框架（HCL-FF）来解决这些限制。HCL-FF引入了（1）一种从粗到细的分层学习策略，引导表示从低级线索到高级语义，以及（2）一种监督对比目标，在 goodness 解耦后强制类别判别性对齐。在CIFAR-10、CIFAR-100和Tiny-ImageNet上的实验表明，HCL-FF在基于FF的方法中取得了新的最佳性能，准确率分别提升了+5.46%、+17.00%和+12.51%。

英文摘要

Deep neural networks trained with backpropagation have achieved outstanding performance in vision tasks but remain biologically implausible, computationally demanding, and difficult to interpret. The Forward-Forward (FF) algorithm offers a promising alternative by training each layer independently through local goodness objectives. However, its purely local optimization lacks hierarchical coordination across layers, and the decoupling of goodness from features leaves the representations unconstrained and semantically ambiguous. We propose a Hierarchical and Contrastive Learning FF framework (HCL-FF) to address these limitations. HCL-FF introduces (1) a coarse-to-fine hierarchical learning strategy that guides representations from low-level cues to high-level semantics, and (2) a supervised contrastive objective that enforces class-discriminative alignment after goodness decoupling. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that HCL-FF achieves new state-of-the-art performance among FF-based methods, with notable accuracy gains of +5.46%, +17.00%, and +12.51%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.24794 2026-05-26 cs.CV cs.CL 版本更新

DUEL: Adversarial Self-Play for Multimodal Reasoning

DUEL: 用于多模态推理的对抗性自我对弈

Lin Qiu, Hanqing Zeng, Yao Liu, Bingjun Sun, Guangdeng Liao, Ji Liu

发表机构 * Meta AI

AI总结提出DUEL框架，通过对抗性自我对弈从预训练VLM生成监督信号，结合长度归一化对数似然奖励，无需人工标注即可提升视觉推理与判别能力。

详情

AI中文摘要

强化学习已成为提升视觉语言模型推理能力的有效范式。然而，基于RL的优化通常依赖于昂贵且难以扩展的高质量标注。现有的无监督替代方案可能因弱视觉基础和缺乏可靠验证信号而偏向有偏解。我们提出一个自我进化的训练后框架DUEL，其中监督信号源于从同一预训练VLM初始化的两个策略之间的对抗性交互。挑战者生成一个基于图像的真实声明及其最小扰动的难负样本，而求解者验证两个声明与图像的一致性，从而在近邻语义下鼓励细粒度视觉判别。为了稳定优化，我们引入长度归一化的对数似然奖励，在二元结果监督之外保留信息性优化信号，并在稀疏反馈下提高学习稳定性。实验表明，DUEL在无需额外人工标注、外部奖励模型或图像编辑工具的情况下，持续提升视觉推理和鲁棒判别能力。

英文摘要

Reinforcement learning (RL) has emerged as an effective paradigm for improving the reasoning capability of vision-language models (VLMs). However, RL-based optimization typically depends on costly high-quality annotations that are difficult to scale. Existing unsupervised alternatives may drift toward biased solutions due to weak visual grounding and the lack of reliable verification signals. We propose a self-evolving post-training framework, DUEL, where supervision emerges from adversarial interactions between two policies initialized from the same pretrained VLM. A Challenger generates an image-grounded true claim together with a minimally perturbed hard-negative counterpart, while a Solver verifies both claims against the image, encouraging fine-grained visual discrimination under near-neighbor semantics. To stabilize optimization, we introduce a length-normalized log-likelihood reward that preserves informative optimization signals beyond binary outcome supervision and improves learning stability under sparse feedback. Experiments show that DUEL consistently improves visual reasoning and robust discrimination without additional human annotations, external reward models, or image editing tools.

URL PDF HTML ☆

赞 0 踩 0

2605.24792 2026-05-26 cs.CV cs.AI 版本更新

Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering

用于胃肠内窥镜的参数高效视觉语言模型：医学图像生成与临床视觉问答

Ojonugwa Oluwafemi Ejiga Peter, Frederick Akor Ejiga, Fahmi Khalifa, Md Mahmudur Rahman

发表机构 * Computer Science Department, Morgan State University（莫尔甘州大学计算机科学系）； International Organization for Migration (IOM)（国际移民组织）； Electrical & Computer Engineering Department, Morgan State University（莫尔甘州大学电气与计算机工程系）

AI总结提出双流水线参数高效微调模型，结合Florence-2和LoRA Stable Diffusion，分别解决临床视觉问答和隐私保护合成数据生成问题，在Kvasir-VQA数据集上取得高ROUGE和BLEU分数，并显著降低计算成本。

详情

DOI: 10.1109/BHI67747.2025.11269532

AI中文摘要

胃肠内窥镜AI系统的主要局限性源于标注数据短缺、严格的隐私政策以及传统模型微调中的显著瓶颈。这些限制阻碍了复杂AI模型在临床实践中的成功应用，尤其影响了诊断的可靠性和可扩展性。在本文中，我们提出了一种双流水线PEFT模型，解决了两个基本问题：医学视觉问答（VQA）和隐私保护合成数据的生成。对于临床VQA，我们采用Florence-2视觉语言模型。利用PEFT增强了模型的可解释性，同时大幅降低了训练的计算成本。同时，我们使用低秩适应（LoRA）与Stable Diffusion 2.1生成高质量的胃肠图像，在不违反患者隐私的情况下增强训练数据库。本研究使用了Kvasir-VQA数据集。我们的Florence-2 VQA模型实现了ROUGE-1为0.92，ROUGE-L为0.91，BLEU分数从0.08提升到0.24。在私有数据集上的微调始终优于在公共数据集上的微调。秩为4的LoRA合成达到了最优性能，保真度得分为0.290，一致性得分为0.730，Frechet BiomedCLIP距离（FBD）为1450，计算成本降低了近90%。该框架提高了AI在胃肠内窥镜中的临床潜力。与FLUX、MSDM和Kandinsky 2.2相比，我们的模型表现出更优的FBD和强语义对齐。虽然其他模型在保真度或一致性上领先，但我们更低的FBD表明更好的图像-文本一致性。这些结果确立了我们的方法作为增强临床AI中VQA和合成数据生成的稳健解决方案。

英文摘要

The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and significant bottlenecks in conventional model fine-tuning. Such limitations impede the successful application of sophisticated AI models in clinical practice, particularly affecting the reliability and scalability of diagnosis. In this paper, we present a dual-pipeline PEFT model that addresses two fundamental problems: medical Visual Question Answering (VQA) and the generation of privacy-preserving synthetic data. For clinical VQA, we adopt the Florence-2 vision-language model. Leveraging PEFT enhances model interpretability while substantially reducing the computational cost of training. Simultaneously, we employ Low-Rank Adaptation (LoRA) with Stable Diffusion 2.1 to generate high-quality GI images that enhance training databases without violating patient privacy. This research utilized the Kvasir-VQA dataset. Our Florence-2 VQA model achieved ROUGE-1 of 0.92, ROUGE-L of 0.91, and BLEU score improvements from 0.08 to 0.24. Fine-tuning on private datasets consistently showed better results than fine-tuning on public datasets. The rank-4 LoRA synthesis achieved optimal performance with a fidelity score of 0.290, an agreement score of 0.730, and a Frechet BiomedCLIP Distance (FBD) of 1450, reducing computational costs by almost 90 percent. This framework improves the clinical potential of AI in GI endoscopy. Compared to FLUX, MSDM, and Kandinsky 2.2, our model demonstrates superior FBD and strong semantic alignment. While other models lead in Fidelity or Agreement, our lower FBD indicates better image-text coherence. These results establish our approach as a robust solution for enhancing VQA and synthetic data generation in clinical AI.

URL PDF HTML ☆

赞 0 踩 0

2605.24789 2026-05-26 cs.CV eess.IV 版本更新

Self-Supervised Contrastive Learning for Cardiac MR Sequence Classification

自监督对比学习用于心脏磁共振序列分类

Yuli Wang, Hyewon Jung, Dongshen Peng, Yuwei Dai, Jing Wu, Haoyue Guan, Yoko Kato, Zhicheng Jiao, Yu Sun, Ihab Kamel, Joao Lima, Cheng Ting Lin, Harrison Bai

发表机构 * Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine（放射科与放射科学系，约翰霍普金斯大学医学院）； Department of Electrical and Computer Engineering, Johns Hopkins University（电气与计算机工程系，约翰霍普金斯大学）； Department of Computer Science, University of North Carolina at Chapel Hill（计算机科学系，北卡罗来纳大学教堂山分校）； Department of Radiology, University of Colorado Denver Anschutz Medical Campus（放射科，科罗拉多大学丹佛分校安舒茨医学中心）； Department of Radiology, Second Xiangya Hospital, Central South University（放射科，中南大学湘雅医院）； Department of Cardiology, Johns Hopkins University School of Medicine（心血管科，约翰霍普金斯大学医学院）； Department of Diagnostic Imaging, Brown University Health（诊断影像科，布朗大学健康中心）

AI总结针对预训练ViT在心脏MR领域迁移效果差的问题，提出基于图像的自监督对比学习适应策略，在内部数据集上优于监督训练，并泛化到外部MR数据集，四个常见序列分类AUC超过0.75。

详情

AI中文摘要

利用自注意力机制的视觉Transformer（ViT）模型在各种视觉任务（包括图像分类）中展现出强大的泛化能力。然而，这些通常在通用公共数据集上预训练的模型往往缺乏医学成像应用所需的专门领域知识。在本研究中，我们使用内部数据集调查了ViT模型对心脏磁共振（MR）图像的适应情况。我们发现预训练的ViT特征不能有效地迁移到心脏MR领域。为了克服这一限制，我们引入了一种利用基于图像的自监督对比学习的适应策略，与传统的监督训练方法相比，表现出优越的性能。此外，我们适应的ViT模型对外部MR数据集（如BraTS和ADNI）表现出强大的泛化能力。通过消融研究，我们进一步研究了批次大小和数据集规模对性能的影响。最终，我们的适应模型在四种最常见的心脏MR序列上实现了超过0.75的分类AUC。

英文摘要

Vision Transformer (ViT) models, utilizing self-attention mechanisms, have demonstrated robust generalization capabilities across various vision tasks, including image classification. However, these models, typically pretrained on general public datasets, often lack the specialized domain knowledge necessary for medical imaging applications. In this study, we investigate the adaptation of ViT models, specifically for cardiac magnetic resonance (MR) images, using an in-house dataset. We found that pretrained ViT features do not effectively transfer to the cardiac MR domain. To overcome this limitation, we introduce an adaptation strategy that utilizes image-based self-supervised contrastive learning, demonstrating superior performance compared to traditional supervised training approaches. Moreover, our adapted ViT model exhibits strong generalization to external MR datasets such as BraTS and ADNI. Through ablation studies, we further investigate the impact of batch size and dataset scale on performance. Ultimately, our adapted model achieves classification AUC exceeding 0.75 across the four most common cardiac MR sequences.

URL PDF HTML ☆

赞 0 踩 0

2605.24776 2026-05-26 cs.CV 版本更新

利用预训练RGB去噪器进行高光谱图像恢复

Daniele Picone, Mohamad Jouni, Mauro Dalla-Mura

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-Lab（格勒诺布尔阿尔卑斯大学、法国国家科学研究中心、格勒诺布尔INP、GIPSA实验室）

AI总结提出一种轻量级适配器，通过投影映射重用冻结的预训练RGB去噪器，实现高光谱图像的去噪、去模糊和超分辨率恢复，实验表明RGB先验具有良好的迁移性。

2605.24762 2026-05-26 cs.CV 版本更新

4KLSDB: A Large-Scale Dataset for 4K Image Restoration and Generation

4KLSDB：用于4K图像恢复与生成的大规模数据集

Zihao Zhu, Kuan-Ru Huang, Zhaoming Xu, Renjie Li, Bo Wu, Ruizheng Bai, Mingyang Wu, Sayak Paul, Zhengzhong Tu

发表机构 * Texas A&M University（德克萨斯A&M大学）； Hugging Face

AI总结为解决现有数据集缺乏原生4K分辨率和规模的问题，提出包含129,484张4K图像的大规模数据集4KLSDB，并通过多阶段自动过滤和标注确保质量，实验证明其在超分辨率和扩散模型训练中能显著提升4K基准性能。

Comments Accepted to the DataCV Workshop at CVPR 2026; 10 pages, 4 figures, 7 tables; Our project page is available at: https://4klsdb.github.io/

详情

AI中文摘要

高分辨率数据集对于推进超分辨率（SR）和文本到图像（T2I）扩散研究至关重要。然而，当前公开可用的数据集既缺乏原生4K分辨率，也缺乏训练最先进模型所需的大规模。为解决这一差距，我们引入了一个4K大规模数据集与基准（4KLSDB），这是一个大规模、多样化的数据集，包含129,484张精心策划的4K分辨率图像，涵盖自然、城市景观、人物、食物、艺术品和CGI等多个类别，以及分别包含2,000和1,984张图像的独立验证集和测试集。图像来源于已建立的开放数据集，包括Photo Concept Bucket、Laion2B和PD12M。4KLSDB经历了严格的多阶段自动过滤和标注流程，涉及人工标注员和大规模多模态模型（LMMs），以确保高美学质量和数据集一致性。我们通过训练代表性的超分辨率和扩散模型来证明4KLSDB的有效性，观察到在原生4K基准上性能的显著提升。综合实验表明，在真实4K分辨率数据上训练与图像恢复任务中保真度的提高之间存在正相关，尤其是在4K分辨率下。我们通过提供4KLSDB，为研究社区提供宝贵资源，以推动真正高保真图像合成与恢复的进展。我们的项目页面位于：https://4klsdb.github.io/。

英文摘要

High-resolution datasets are essential for advancing super-resolution (SR) and text-to-image (T2I) diffusion research. However, current publicly available datasets lack both the native 4K resolution and the extensive scale necessary for training state-of-the-art models. To address this gap, we introduce a 4K Large Scale Dataset and Benchmark (4KLSDB), a large-scale, diverse dataset consisting of 129,484 carefully curated 4K resolution images spanning multiple categories such as nature, urban scenes, people, food, artwork, and CGI, alongside distinct validation and test sets containing 2,000 and 1,984 images respectively. Images were sourced from established open datasets including Photo Concept Bucket, Laion2B, and PD12M. 4KLSDB underwent rigorous multi-stage automated filtering and annotation pipelines involving both human annotators and Large Multimodal Models (LMMs) to ensure high aesthetic quality and dataset consistency. We demonstrate 4KLSDB's effectiveness by training representative super-resolution and diffusion models, observing significant improvements in performance on native 4K benchmarks. Comprehensive experiments illustrate a positive correlation between training on true 4K resolution data and improved fidelity in image restoration task, especially on 4K resolution. We provide the research community a valuable resource to drive progress toward genuinely high-fidelity image synthesis and restoration by providing 4KLSDB. Our project page is available at: https://4klsdb.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.24761 2026-05-26 cs.CV cs.RO 版本更新

从整板到微小缺陷：面向高分辨率PCB缺陷检测的尺度感知瓦片推理与拓扑感知合并

Mohammad Alijanpour Shalmani, Alale Rezvani Boroujeni, Ali Amini, Jiann Shiun Yuan

发表机构 * Dept. of Electrical and Computer Engineering（电气与计算机工程系）； Dept. of Marketing（市场营销系）； Centre of Real Time Computer Systems, Faculty of Informatics, Kaunas University of Technology（实时计算机系统中心，信息学院，凯纳斯技术大学）

AI总结针对高分辨率PCB图像缩放导致微小缺陷丢失的问题，提出基于瓦片推理的尺度一致训练策略和拓扑感知合并方法，无需重新训练即可显著提升缺陷检测精度。

详情

AI中文摘要

高分辨率印刷电路板（PCB）检测在将整板图像缩放到标准检测器输入时存在分辨率崩溃问题：微尺度缺陷缩小到几个像素而被遗漏。基于瓦片的推理保留了局部细节，但在瓦片边缘引入边界伪影，导致分割检测和假阴性。我们提出了五种推理策略的系统比较，在两个高分辨率PCB缺陷数据集PCB-Defect（230张图像，1704个标注）和HRIPCB（693张图像，2953个标注）上评估，涵盖六类缺陷。我们表明训练-推理尺度一致性至关重要：在全图像上训练的检测器在瓦片推理下mAP@50崩溃至0.01，而同一架构在640×640瓦片裁剪上训练时在两个数据集上分别达到0.72和0.94。我们进一步利用拓扑感知瓦片合并（TA-TM），一种无需训练的后处理方法，构建瓦片邻接图，并在全局NMS之前使用邻瓦片一致性调整边界敏感检测分数。在两个数据集中，添加128像素瓦片重叠将边界区域召回率从约26-63%提升至约70-100%，TA-TM在两个基准上均达到最佳mAP@50，且瓦片推理恢复了全图像方法完全遗漏的46-100%的小缺陷。结果在不同数据集上一致，证实了所提出策略的泛化性。TA-TM无需重新训练且架构无关，可直接应用于现有PCB检测流水线。

英文摘要

High-resolution printed circuit board (PCB) inspection suffers from resolution collapse when full-board images are resized to standard detector inputs: micro-scale defects shrink to a few pixels and are missed. Tile-based inference preserves local detail but introduces boundary artefacts at tile edges, causing split detections and false negatives. We present a systematic comparison of five inference strategies evaluated on two high-resolution PCB defect datasets, PCB-Defect (230 images, 1704 annotations) and HRIPCB (693 images, 2 953 annotations), spanning six defect classes. We show that training-inference scale consistency is critical: a detector trained on full images collapses to mAP@50 = 0.01 under tile inference, while the same architecture trained on 640*640 tile crops achieves 0.72 and 0.94 on the two datasets respectively. We further exploited Topology-Aware Tile Merging (TA-TM), a training-free post-processing method that builds a tile-adjacency graph and adjusts boundary-sensitive detection scores using neighbour-tile agreement before global NMS. Across both datasets, adding 128 px tile overlap raises boundary-zone recall from ~26-63% to ~70-100%, TA-TM achieves the best mAP@50 on both benchmarks, and tile inference recovers 46-100% of small defects missed entirely by full-image methods. Results are consistent across datasets, confirming the generalizability of the proposed strategy. TA-TM requires no retraining and is architecture-agnostic, making it directly applicable to existing PCB inspection pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.24722 2026-05-26 cs.CV 版本更新

Calibrating Probabilistic Object Detectors with Annotator Disagreement

校准具有标注者分歧的概率目标检测器

Zhi Qin Tan, Owen Addison, Yunpeng Li

发表机构 * organization= Faculty of Dentistry, Oral \& Craniofacial Sciences, King's College London , city= London , country= United Kingdom

AI总结针对目标检测中因物体模糊性导致标注者分歧的问题，提出一种无需真实标注即可校准概率目标检测器的方法，通过设计分类和定位校准误差指标及训练时/事后校准器，使模型预测不确定性匹配标注分布。

详情

AI中文摘要

对于模糊物体（例如医学图像），标注者之间可能存在高度分歧，这凸显了在目标检测任务中建立真实标注的挑战。尽管如此，所有现有的目标检测器都隐式地需要访问真实标注以进行训练或评估。我们针对的基本问题是：如何利用多个标注者的标注（但缺乏因物体模糊性导致的客观真实标注）来学习目标检测器，以及如何使学习到的检测器在检测模糊物体时表达有意义的模型预测不确定性？为了回答这些问题，我们提出了一种可解释的方法来校准概率目标检测器，其校准目标是将类别置信度和边界框方差估计与标注者的标注分布对齐。我们引入了一个高效且有效的框架来校准概率目标检测器，通过设计四个评估指标来衡量分类和定位的校准误差，并提出了一种训练时校准和后处理校准器，所有这些都无需访问任何真实标注。该框架可推广到许多现有的概率目标检测器，例如YOLO系列和两阶段检测器。在医学和自然图像的真实世界和合成数据集上的实验结果表明，所提出的框架与三种流行的目标检测器相结合具有优越的性能。

英文摘要

High degrees of disagreement among annotators can exist for ambiguous objects, e.g. in medical images, underscoring the challenges of establishing ground truth annotations in object detection tasks. Despite this, all existing object detectors implicitly require access to ground truth annotations for either training or evaluation. The fundamental questions we target are: How can we learn an object detector with multiple annotators' annotations but without objective ground truth annotations due to object ambiguity, and how can we enable the learned detector to express meaningful model predictive uncertainties in detecting ambiguous objects? To answer these questions, we present an interpretable approach to calibrate probabilistic object detectors, where the calibration goal is to align the class confidence and bounding box variance estimates to the annotators' annotation distribution. We introduce an efficient yet effective framework to calibrate probabilistic object detectors by designing four evaluation metrics to measure calibration errors regarding classification and localization, and proposing a train-time calibration and post-hoc calibrator, all without the need to access any ground truth. This framework is generalizable to many existing probabilistic object detectors, such as the YOLO families and two-stage detectors. Empirical results with real-world and synthetic datasets of medical and natural images demonstrate the superior performance of the proposed framework with three popular object detectors.

URL PDF HTML ☆

赞 0 踩 0

2605.24702 2026-05-26 cs.CV 版本更新

Do Image-Text Metrics Respect Semantic Invariances?

图像-文本度量是否尊重语义不变性？

Amit Agarwal, Hitesh Laxmichand Patel, Meizhu Liu, Jyotika Singh, Karan Dua, Hansa Meghwani, Matthew Rowe, Michael Avendi, Yassi Abbasi, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结通过空间、物体和社会语言框架三个维度的语义保持扰动，系统评估了五种流行图像-文本评估器（CLIPScore、PAC-S、UMIC、FLEUR和确定性LLM评判）的语义不变性，发现它们对非语义变化敏感，并提出了不变性校准评分作为后处理调整方法。

详情

AI中文摘要

无参考图像到文本评估器现在已成为评分图像-标题对齐的标准工具，但尚不清楚它们是否尊重语义不变性。我们对五种流行评估器（CLIPScore、PAC-S、UMIC、FLEUR和确定性LLM评判）进行了不变性探测，在三个轴向上施加语义保持扰动：空间（翻转、上下文保持的重定位、轻微旋转）、物体（尺度、类别）和社会语言框架（带有中性及长度匹配对照的文化/经济形容词）。在三个检测数据集和三个标题评估套件的精心策划切片上，我们发现了一致的非语义敏感性，其中良性的空间编辑和简单的措辞变化平均使分数变化约6-9%，而对于仅相差0.7%的系统，这些变化可能导致高达约37%的情况下的排名翻转，尤其是在空间变化下。一项小型人类研究也支持这一发现，并确认标注者通常认为扰动对同样正确，因此这些变化反映了度量行为而非语义变化。我们进一步提出了不变性校准评分，这是一种后处理调整方法，大致将中位数绝对敏感性减半，同时保持与学习型标题评估器的相关性。

英文摘要

Reference-free image-to-text evaluators are now standard for scoring image-caption alignment, yet it is unclear whether they respect semantic invariances. We present an invariance probe on five popular evaluators (CLIPScore, PAC-S, UMIC, FLEUR, and a deterministic LLM judge) under semantics-preserving perturbations along three axes -- spatial (flips, context-preserving repositioning, light rotations), object (scale, category), and socio-linguistic framing (cultural/economic adjectives with neutral and length-matched controls). Across curated slices of three detection datasets and three caption evaluation suites, we find consistent non-semantic sensitivities, where benign spatial edits and simple phrasing changes shift scores by $\approx$6--9\% on average, and for systems separated by just 0.7\%, these shifts can cause ranking flips in up to $\sim$37\% of cases, particularly under spatial changes. A small human study also supports this finding and confirms that annotators generally judge perturbed pairs as equally correct, so these shifts reflect metric behavior rather than semantic change. We further propose invariance-calibrated scoring, a post-hoc adjustment that roughly halves median absolute sensitivity while retaining correlation with learned caption evaluators.

URL PDF HTML ☆

赞 0 踩 0

2605.24691 2026-05-26 cs.CV 版本更新

VaaWIT: 面向多语言网页图像翻译的大语言模型视觉感知适配

Bo Li, Ronghao Chen, Ningyuan Deng, Huacan Wang, Shaolin Zhu, Lijie Wen

发表机构 * The Hong Kong University of Science（香港科技大学）； Tianjin University（天津大学）； Tsinghua University（清华大学）

AI总结针对网页图像翻译中视觉表示差距问题，提出VaaWIT框架，通过双流注意力模块和视觉感知适配器，实现大语言模型对细粒度视觉特征的动态融合，在多个基准上超越开源模型并接近闭源模型性能。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817631

AI中文摘要

翻译网页图像中的文本对于改善内容可访问性和跨语言信息检索至关重要，尤其是在社交媒体和电子商务领域。尽管大型视觉语言模型（LVLMs）已经推进了多模态理解，但由于视觉表示差距，将它们应用于网页图像翻译仍然具有挑战性：标准编码器通常优先考虑高级语义，而忽略了识别多样字符形态所需的细粒度视觉细节。为了解决这一挑战，我们提出了VaaWIT，一个端到端框架，用于适配大语言模型进行多语言网页图像翻译。该框架引入了两项关键技术贡献：（1）双流注意力模块（DSAM），促进多语言语义特征与详细视觉表示之间的双向交互，从而合成对文本变化鲁棒的统一特征；（2）视觉感知适配器（VAA），一种参数高效的微调策略，将这些融合的视觉线索动态注入冻结的LLM主干。这种设计使模型能够有效地将视觉上下文与语言推理对齐，同时最小化计算成本。在三个公共基准上的八个任务上的大量实验表明，VaaWIT显著优于最先进（SOTA）的开源基线，并达到了与专有模型相竞争的性能。这些结果验证了将细粒度视觉感知集成到LLM中用于复杂网页内容分析的有效性。

英文摘要

Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularly within social media and e-commerce domains. Although Large Vision-Language Models (LVLMs) have advanced multimodal understanding, applying them to Web image translation remains challenging due to the visual representation gap: standard encoders often prioritize high-level semantics over the fine-grained visual details required for recognizing diverse character morphologies. To address this challenge, we propose VaaWIT, an end-to-end framework that adapts Large Language Models for multilingual Web image translation. The framework introduces two key technical contributions: (1) a Dual-Stream Attention Module (DSAM), which facilitates bidirectional interaction between multilingual semantic features and detailed visual representations, thereby synthesizing unified features robust to textual variations; and (2) a Visual-Aware Adapter (VAA), a parameter-efficient fine-tuning strategy that dynamically injects these fused visual cues into the frozen LLM backbone. This design enables the model to align the visual context with linguistic reasoning effectively while minimizing computational costs. Extensive experiments on eight tasks on three public benchmarks demonstrate that VaaWIT significantly outperforms state-of-the-art (SOTA) open-source baselines and achieves competitive performance against proprietary models. These results validate the efficacy of integrating fine-grained visual perception into LLMs for complex Web content analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.24674 2026-05-26 cs.CV 版本更新

Reasoning to Align: Implicit Reasoning in Diffusion Transformers for Video Editing

推理对齐：扩散Transformer在视频编辑中的隐式推理

Yan Li, Lin Liu, Xiaopeng Zhang, Qi Tian

发表机构 * The Hongkong University of Science and Technology（香港科技大学）； Huawei Inc.（华为公司）

AI总结针对指令式视频编辑中条件信号未分化及交叉注意力监督不足的问题，提出RVEDiT框架，通过粒度路由令牌条件和参考锚定注意力对齐实现粗到细编辑与内部推理正则化。

详情

AI中文摘要

基于指令的视频编辑需要根据自然语言指令转换源视频，同时保留无关内容并保持时间连贯性。我们认为现有的扩散Transformer（DiT）编辑器由于两个结构原因难以完成此任务。首先，条件信号未分化地输入所有Transformer块，迫使单个令牌流同时编码全局编辑意图和细粒度视觉证据。其次，控制编辑的交叉注意力模式仅通过像素级重建间接监督，使得模型内部推理过程约束不足。为了解决这两个限制，我们提出了RVEDiT，一个隐式推理视频编辑DiT框架，围绕两个互补组件构建。第一个组件，粒度路由令牌条件，引入从多模态大语言模型蒸馏的可学习编辑令牌，并将其路由到浅层块，同时将原生视觉和文本令牌保留给深层块，从而在骨干网络内部诱导出从粗到细的编辑过程。第二个组件，参考锚定注意力对齐，在训练期间采用参数共享的参考分支，并最大化编辑分支和参考分支注意力特征之间的互信息，正则化模型的内部推理而不产生任何额外的推理成本。在标准基于指令的视频编辑基准上的实验表明，RVEDiT始终优于最先进的基线，特别是在局部和组合编辑方面取得了显著提升。

英文摘要

Instruction-based video editing requires transforming a source video according to a natural-language instruction while preserving irrelevant content and remaining temporally coherent. We argue that existing Diffusion Transformer (DiT) editors struggle with this task for two structural reasons. First, conditioning signals are fed undifferentiated into all transformer blocks, forcing a single token stream to encode both global editing intent and fine-grained visual evidence. Second, the cross-attention patterns that govern the edit are supervised only indirectly through pixel-level reconstruction, leaving the model's internal reasoning process under-constrained. To address both limitations, we propose RVEDiT, an implicit Reasoning Video Editing DiT framework built around two complementary components. The first, Granularity-Routed Token Conditioning, introduces learnable editing tokens distilled from a multimodal LLM and routes them to shallow blocks, while reserving native visual and textual tokens for deeper blocks, thereby inducing a coarse-to-fine editing process inside the backbone. The second, Reference-Anchored Attention Alignment, employs a parameter-sharing reference branch during training and maximizes the mutual information between the attention features of the editing and reference branches, regularizing the model's internal reasoning without incurring any additional inference cost. Experiments on standard instruction-based video editing benchmarks show that RVEDiT consistently outperforms state-of-the-art baselines, with particularly strong gains on localized and compositional edits.

URL PDF HTML ☆

赞 0 踩 0

2605.24652 2026-05-26 cs.AI cs.CV cs.MM cs.SD 版本更新

AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

AVBench：面向音视频生成模型的人类对齐与自动化评估基准

Jialiang Yang, Bin Xia, Ruihang Chu, Dingdong Wang, Wanke Xia, Zhun Mou, Tianyang Zhong, Yiting Zhao, Wenming Yang

发表机构 * Tsinghua University（清华大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出AVBench，通过细粒度人类中心指标和偏好学习训练的专业评估器，实现音视频生成的自动化、准确评估。

详情

AI中文摘要

音视频（AV）生成的快速进步使得能够生成具有同步声音的高保真合成内容，特别是涉及语音和交互的人类相关场景。然而，AV生成的评估仍处于早期阶段，只有少数针对人类相关场景的粗粒度基准，并且依赖于有限的预设评估和通用多模态大语言模型，导致对模型能力的不准确评估。为了解决这些问题，我们引入了AVBench，一个专为人类中心AV生成设计的全自动化基准。AVBench基于两个关键设计以实现全面准确的评估：（i）人类中心和细粒度指标。AVBench整合了十个评估维度，专为以人为中心的现实场景设计，涵盖视觉质量、音频质量以及跨模态的多层次一致性。这些实用指标捕捉了现有基准经常忽略的人类相关细节。（ii）通过偏好学习训练的专业评估器。为了解决缺乏专门训练数据的问题，我们通过将真实视频转化为具有受控扰动的多样化训练对来构建大规模监督。在该高质量数据集上微调后，评估器学会可靠地检测细微的跨模态不一致性。关键的是，AVBench不输出离散的文本判断，而是从模型对二元决策的预测置信度中推导出连续评估分数。这种概率评分机制比传统的VQA风格评估更可靠，并且与人类判断高度一致。综合来看，AVBench为AV生成提供了自动化评估，展示了数据过滤的强大潜力，并可作为来自人类反馈的强化学习（RLHF）的可微分奖励信号。

英文摘要

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

URL PDF HTML ☆

赞 0 踩 0

2605.24642 2026-05-26 cs.CV cs.RO 版本更新

Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

理解几何基础模型对视觉-语言-动作模型的影响

Yurou Yang, Muyuan Lin, Roberto Martin-Martin, Martin Labrie, Shreekant Gayaka, Cheng-Hao Kuo, Luca Carlone

发表机构 * Amazon Personal Robotics Group（亚马逊个人机器人小组）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文通过线性探测分析量化了视觉-语言-动作模型（VLA）与几何基础模型（GFM）之间的“几何差距”，比较了三种注入几何信息的架构，并研究了非架构因素对几何VLA性能的影响。

详情

AI中文摘要

近期工作探索了视觉-语言-动作模型（VLA）与用于3D重建的几何基础模型（GFM）（如VGGT）交叉领域的新机遇。虽然由此产生的几何VLA通常表现出改进的性能，但仍不清楚：(i) 现代VLA是否已经具备足够的几何理解能力，(ii) 将几何理解注入VLA的最佳架构是什么，以及(iii) 其他影响几何VLA的设计选择的效果。在本文中，我们针对特定的VLA（GR00T-N1.5）和GFM（VGGT）进行了严格的实验分析，以阐明这些问题。我们的第一个贡献是通过基于线性探测的严格分析，形式化了先前工作中关于当前VLA缺乏几何理解的直觉。该分析首次量化了VLA与GFM之间的“几何差距”。我们的第二个贡献是识别并比较了将GFM与VLA桥接的不同策略。我们实现了三种不同的架构，它们在将几何信息注入VLA的方式上有所不同，同时尽可能保持低级实现细节相似，以确保公平比较。最后，我们分析了非架构选择（例如，训练数据、相机数量、重建质量）对几何VLA性能的影响。

英文摘要

Recent work explores new opportunities at the intersection of vision-language-action models (VLAs) and geometric foundation models (GFMs) for 3D reconstruction, such as VGGT. While the resulting geometric VLAs often show improved performance, it remains unclear (i) if modern VLAs already have sufficient geometric understanding to start with, (ii) what is the best architecture to inject geometric understanding into a VLA, and (iii) what is the effect of other design choices that affect geometric VLAs. In this paper we provide a rigorous experimental analysis to shed light on these questions, for a specific choice of VLA (GR00T-N1.5) and GFM (VGGT). Our first contribution is to formalize prior work's intuition that current VLAs lack geometric understanding, by providing a rigorous analysis based on linear probing. The analysis quantifies, for the first time, the "geometric gap" between VLAs and GFMs. Our second contribution is to identify and compare different strategies to bridge GFMs with VLAs. We implement three different architectures, which differ in the way they inject geometry in the VLA, while keeping low-level implementation details as similar as possible, to ensure a fair comparison. Finally, we analyze the impact of non-architectural choices (e.g., training data, number of cameras, reconstruction quality) on the performance of the geometric VLAs.

URL PDF HTML ☆

赞 0 踩 0

2605.24639 2026-05-26 cs.CV cs.AI 版本更新

相位感知的基于小波散射的编解码器用于密集预测

Ghassen Marrakchi, Basarab Matei

发表机构 * Northern Paris Computer Science Lab, Sorbonne Paris Nord University, Villetaneuse, France（北巴黎计算机科学实验室，巴黎-索邦大学，法国维莱特内斯）

AI总结提出一种相位感知散射编解码器，通过在跳跃连接中显式保留相位信息来恢复空间结构，在图像去噪和皮肤病变分割任务中验证了相位对密集预测的有效性。

Comments 21 pages, 16 figures, 10 tables

2605.24608 2026-05-26 cs.AI cs.CV cs.LG 版本更新

Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology

基于数学形态学的深度卷积学习的格论与代数模型

Gustavo, Angulo

发表机构 * Mines Paris, PSL University, CMA-Center for Applied Mathematics, Sophia-Antipolis, France（巴黎 Mines 学院，PSL 大学，应用数学中心，法国索菲亚-安蒂波利斯）

AI总结本文基于格论和数学形态学，为深度卷积架构（CNN、ResNet、UNet）建立了严格的代数框架，揭示了标准CNN流水线是交叉格算子，并识别出三种真正的幂等开运算层设计。

详情

AI中文摘要

我们为深度卷积架构（包括CNN、ResNet和如UNet的编码器-解码器网络）建立了一个严格的代数框架，该框架基于格论和数学形态学。核心工具是Matheron-Maragos-Banon-Barrera (MMBB) 平移不变算子通用表示理论，我们将其系统地应用于标准深度网络的每一层。主要发现是：标准CNN流水线（线性卷积 + ReLU + 平坦最大池化）是一个交叉格算子：卷积是傅里叶下半格中的腐蚀，ReLU是格并闭包，最大池化是逐点最大加格中的膨胀，它们的组合既不是形态学开运算也不是闭运算。第二个发现是：ReLU在逐点格中的上伴随是一个全局（非局部）算子，在全局非负函数上为恒等映射，否则为负无穷，因此没有局部形态学腐蚀能与ReLU构成伴随对。这两个结果共同提供了深度在标准CNN中引入真正表示能力的精确代数原因：组合层不是幂等的。我们识别并完全刻画了三种真正的幂等开运算层设计：纯最大加形态学层（逐点格）、谱维纳层（傅里叶格）和自对偶形态学层。我们建立了完整的不动点和收敛理论。该框架还将最大池化、步长卷积和拉普拉斯金字塔统一在Goutsias-Heijmans伴随金字塔理论下，并给出了激活-池化膨胀（APD）分解及其正确的伴随算子。

英文摘要

We develop a rigorous algebraic framework for deep convolutional architectures, CNNs, ResNets, and encoder--decoder networks such as UNet, grounded in lattice theory and mathematical morphology. The central tool is the Matheron--Maragos--Banon--Barrera (MMBB) universal representation theory for translation-invariant operators, which we apply systematically to every layer of a standard deep network. The principal finding is that the standard CNN pipeline (linear convolution~$+$ ReLU~$+$ flat max-pooling) is a cross-lattice operator: the convolution is an erosion in the Fourier inf-semilattice while ReLU is a lattice-join closing and max-pooling is a dilation in the pointwise max-plus lattice, and their composition is a morphological opening in neither. A second finding is that the upper adjoint of ReLU in the pointwise lattice is a global (non-local) operator, the identity on globally non-negative functions and $-\infty$ otherwise, so no local morphological erosion can form an adjunction pair with ReLU. These two results together provide the precise algebraic reason why depth in standard CNNs introduces genuine representational power: the composed layer is not idempotent. Three layer designs that are genuine idempotent openings are identified and fully characterised: the pure max-plus morphological layer (pointwise lattice), the spectral Wiener layer (Fourier lattice), and the self-dual morphological layer. We establish a complete fixed-point and convergence theory. The framework also unifies max-pooling, strided convolution, and the Laplacian pyramid under the Goutsias--Heijmans adjoint pyramid theory, and gives the Activation--Pooling Dilation (APD) factorisation with its correct adjoint.

URL PDF HTML ☆

赞 0 踩 0

2605.24604 2026-05-26 cs.CV 版本更新

LC-Flow: Learning Local Continuous Optical Flow and Confidence from events

LC-Flow: 从事件中学习局部连续光流与置信度

Gunwoo Jeon, Chaesong Park, Jongwoo Lim

发表机构 * IPAI, Seoul National University（IPAI，首尔国立大学）

AI总结提出LC-Flow，首个基于学习的、从局部事件中估计时间连续光流的方法，通过连续局部循环网络和联合学习的置信度，解决事件稀疏性和孔径问题，在MVSEC和DSEC上达到局部方法最优，且置信度引导的聚合在MVSEC上超越基于帧的方法。

详情

AI中文摘要

事件相机以微秒分辨率异步捕捉亮度变化，但现有光流方法未能充分利用这种时间连续性。基于帧的方法引入人工累积延迟并遭受领域过拟合，而基于模型的局部方法无状态运行，丢弃预测间的时间历史，产生不准确的光流。我们提出 extbf{LC-Flow}，首个时间连续的、基于学习的光流估计器，完全从局部事件操作。其核心是一个连续局部循环网络，为每个空间网格维护持久隐藏状态，随着事件到达逐步累积时间上下文。与受限于固定累积窗口的基于帧的方法不同，也与每一步从头重新计算运动的无状态基于模型的方法不同，LC-Flow在任意时间戳上生成具有完整运动历史的稀疏局部光流估计。为了解决局部观测固有的歧义性，我们联合学习一个置信度分数，量化每个预测的可靠性，明确处理事件稀疏性和孔径问题。该置信度具有双重作用：为下游任务（如视觉里程计）过滤不可靠估计，并为多尺度置信度引导的聚合提供有原则的权重，从稀疏局部输出重建全局一致的光流。LC-Flow在MVSEC和DSEC上均达到局部方法的最优性能，而置信度引导的聚合在MVSEC基准上建立了新的总体最优，超越了依赖全局空间先验的重型基于帧的网络。

英文摘要

Event cameras capture brightness changes asynchronously with microsecond resolution, yet existing optical flow methods fail to fully exploit this temporal continuity. Frame-based approaches impose artificial accumulation latency and suffer from domain overfitting, while model-based local methods operate statelessly, discarding temporal history between predictions and yielding inaccurate flows. We propose \textbf{LC-Flow}, the first temporally continuous, learning-based optical flow estimator that operates purely from local events. At its core, a Continuous Local Recurrent Network maintains persistent hidden states per spatial grid, incrementally accumulating temporal context as events arrive. Unlike frame-based methods constrained to fixed accumulation windows, and unlike stateless model-based methods that recompute motion from scratch at each step, LC-Flow produces sparse local flow estimates at arbitrary timestamps with full motion history. To address the inherent ambiguity of local observations, we jointly learn a confidence score that quantifies the reliability of each prediction, explicitly handling event sparsity and the aperture problem. This confidence serves a dual role: filtering unreliable estimates for downstream tasks such as visual odometry, and providing principled weights for a multi-scale confidence-guided aggregation that reconstructs globally consistent flow from the sparse local outputs. LC-Flow achieves state-of-the-art performance among local methods on both MVSEC and DSEC, while the confidence-guided aggregation establishes a new overall state-of-the-art on the MVSEC benchmark, surpassing heavy frame-based networks that rely on global spatial priors.

URL PDF HTML ☆

赞 0 踩 0

2605.24593 2026-05-26 cs.CV 版本更新

PEDESTRIANQA: 面向行人意图与轨迹预测的视觉-语言模型基准

Naman Mishra, Shankar Gangisetty, C. V. Jawahar

发表机构 * CVIT, IIIT-Hyderabad, India（IIIT-海得拉巴计算机视觉与智能技术研究所，印度）

AI总结提出大规模视频数据集PedestrianQA，将行人意图和轨迹预测转化为带结构化理由的问答任务，通过微调视觉-语言模型显著提升预测准确性与可解释性。

详情

AI中文摘要

行人意图和轨迹预测对于自动驾驶系统的安全部署至关重要，直接影响复杂交通环境中的导航决策。近期大型视觉-语言模型的进展通过结合高容量视觉理解与灵活的自然语言推理，为这些任务提供了强大的新范式。本文中，我们引入PedestrianQA，这是一个大规模视频数据集，将行人意图和轨迹预测公式化为带有结构化理由的问答任务。PedestrianQA以自然语言表达丰富标注的行人序列，使视觉-语言模型能够从视觉动态、上下文线索和交通智能体间的交互中学习，同时生成其预测的简洁解释，无需为每个任务定制专门的架构。在PIE、JAAD、TITAN和IDD-PeD上的实证评估表明，在PedestrianQA上微调最先进的视觉-语言模型显著提高了意图分类、轨迹预测准确性以及解释性理由的质量，展示了视觉-语言模型作为安全关键行人行为建模的统一且可解释框架的强大潜力。

英文摘要

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision-language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural language reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question-answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.24553 2026-05-26 cs.CV 版本更新

IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring

IQA-Spider：统一多粒度图像质量评估与推理、定位和指代

Xinge Peng, Yiting Lu, Xin Li, Zhibo Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出IQA-Spider框架，通过统一推理、定位和指代任务，实现多粒度图像质量评估，并采用两阶段设计解决现有方法仅支持部分感知维度的问题。

Comments Accepted by ICML 2026

详情

AI中文摘要

我们提出IQA-Spider，这是第一个将推理、定位和指代统一到单个基于LMM的框架中的图像质量评估（IQA）框架，用于多粒度质量理解。现有的基于LMM的IQA方法通常仅支持部分感知维度，例如质量描述和问答（即推理）或像素级定位。这一局限性主要源于缺乏（i）统一的任务和数据形式化，以及（ii）有效的多粒度学习优化范式。为解决这些局限性，我们形式化了一个严格的任务四元组，涵盖全局和局部质量描述、像素级定位以及区域级指代。基于这一形式化，我们通过可扩展的自动标注流水线构建了相应的IQA数据集，从而为统一的多粒度学习提供了坚实基础。为进一步实现统一感知，我们采用无冲突的两阶段设计，逐步将文本级多粒度理解扩展到像素级定位：（i）第一阶段使模型具备跨多个IQA任务的细粒度文本级推理能力；（ii）第二阶段引入无需训练的文本到点定位范式，通过将token logits映射到空间坐标来桥接文本语义和像素级感知。基于这些努力，我们实现了具有统一多粒度可解释图像质量评估的IQA-Spider。在多个基准上的大量实验展示了强大的性能，验证了所提出形式化和框架的有效性与通用性。

英文摘要

We present IQA-Spider, the first image quality assessment (IQA) framework that unifies reasoning, grounding, and referring into a single LMM-based framework for multi-granularity quality understanding. Existing LMM-based IQA methods typically support only partial perception dimensions, such as quality description and question answering~(\textit{i.e.}, reasoning) or pixel-level grounding. This limitation largely stems from the absence of (i) a unified task and data formulation and (ii) effective optimization paradigms for multi-granularity learning. To address these limitations, we formulate a rigorous four-task paradigm covering global and local quality description, pixel-level grounding, and region-level referring. Based on this formulation, we construct a corresponding IQA dataset with a scalable and automatic annotation pipeline, thereby providing a solid foundation for unified multi-granularity learning. To further enable unified perception, we adopt a conflict-free two-stage design that progressively extends text-level multi-granularity understanding to pixel-level grounding: (i) the first stage equips the model with fine-grained text-level reasoning across multiple IQA tasks, and (ii) the second stage introduces a training-free text-to-point grounding paradigm, which bridges textual semantics and pixel-level perception by mapping token logits to spatial coordinates. Based on these efforts, we achieve IQA-Spider with unified multi-granularity explainable image quality assessment. Extensive experiments across multiple benchmarks demonstrate strong performance, validating the effectiveness and versatility of the proposed formulation and framework.

URL PDF HTML ☆

赞 0 踩 0

2605.24533 2026-05-26 cs.CV 版本更新

Learnable Shape Prototypes with Occlusion-Geometry-Guided Injection for Amodal Instance Segmentation

可学习形状原型与遮挡几何引导注入的模态实例分割

Fufan Zhang, Jingxiang Wang, Xiangjie Ye

发表机构 * School of Mechanical Engineering and Automation, Northeastern University（机械工程与自动化学院，东北大学）； School of Information Science and Engineering, Northeastern University（信息科学与工程学院，东北大学）

AI总结提出一种门控可靠性自适应形状先验框架，通过可学习原型和交叉注意力生成实例自适应形状先验，并利用可见掩码的符号距离场调节注入强度，在多个评估设置下超越现有方法。

Comments 13 pages, 7 figures, 5 tables. Submitted to IEEE Transactions on Circuits and Systems for Video Technology

详情

AI中文摘要

模态实例分割旨在预测完整的物体掩码，包括被遮挡区域，这些区域缺乏像素级观测，必须借助形状先验进行推断。现有方法通过固定容量编码空间或昂贵的生成模型获取形状先验，并在所有空间位置均匀注入，而不适应可见区域和遮挡区域之间不同的先验需求。本文提出一种门控可靠性自适应形状先验框架，该框架引入一个形状先验记忆模块，通过交叉注意力组合可学习原型，通过加权原型组合（而非生成）产生实例自适应形状先验。然后，一个空间自适应可靠性门利用可见掩码的符号距离场，根据每个位置的遮挡深度调节注入强度，在可见区域保留可靠特征，同时将形状补偿引导至遮挡区域。在两个主流模态实例分割基准上的实验表明，所提方法在多个评估设置下优于现有方法，在标准设置下，其中一个基准上的遮挡区域平均交并比提高了超过11个百分点，同时总参数量约为三分之一。线性探针分析进一步揭示，可见掩码交叉注意力模块隐式地将遮挡几何编码到视觉标记表示中，解释了所提模块分解的有效性。

英文摘要

Amodal instance segmentation aims to predict the complete object mask including occluded regions that lack pixel-level observations and must be inferred with the aid of shape priors. Existing methods acquire shape priors through fixed-capacity encoding spaces or expensive generative models, and inject them uniformly across all spatial positions without adapting to the varying prior demand between visible and occluded regions. In this paper, we propose a gated reliability-adaptive shape prior framework, which introduces a shape prior memory module that combines learnable prototypes via cross-attention to produce instance-adaptive shape priors through weighted prototype combination rather than generation. A spatial adaptive reliability gate then employs the signed distance field of the visible mask to modulate injection intensity at each position according to its occlusion depth, preserving reliable features in visible regions while directing shape compensation toward occluded areas. Experiments on two mainstream amodal instance segmentation benchmarks demonstrate that the proposed method outperforms existing approaches under multiple evaluation settings, improving the mean intersection-over-union over occluded regions by over 11 percentage points on one of the two benchmarks under the standard setting, while using approximately one-third of the total parameters. Linear probing analysis further reveals that the visible-mask cross-attention module implicitly encodes occlusion geometry into visual token representations, explaining the effectiveness of the proposed module decomposition.

URL PDF HTML ☆

赞 0 踩 0

2605.24532 2026-05-26 cs.CV 版本更新

Image-Conditioned Instance Prompt Network for Referring Remote Sensing Image Segmentation

图像条件实例提示网络用于遥感图像指代分割

Biaoyu Ren, Qingsheng Wang, Cun Xu, Dingkang Yang, Wenxuan Wang

发表机构 * School of Computer Science, Northwestern Polytechnical University, Xi'an, China（西北工业大学计算机科学学院，西安，中国）； College of Intelligent Robotics and Advanced Manufacturing, Fudan University, Shanghai, China（复旦大学智能机器人与先进制造学院，上海，中国）； Shenzhen Research Institute of Northwestern Polytechnical University, Shenzhen, China（西北工业大学深圳研究院，深圳，中国）

AI总结提出图像条件实例提示网络（ICIPNet），通过自适应视觉语义表示和双边信息融合模块，缓解跨模态特征融合瓶颈，提升遥感图像指代分割性能。

Comments 6 pages, 3 figures. Equal contribution: Biaoyu Ren and Qingsheng Wang. Corresponding authors: Dingkang Yang and Wenxuan Wang

详情

AI中文摘要

遥感图像指代分割（RRSIS）是一项与具身感知范式相关的情境化、任务驱动的跨模态任务，要求模型将视觉空间特征与语言意图对齐以实现精确的目标感知。近期研究聚焦于细化文本特征的粒度并优化图像-文本特征融合，以更好地引导目标特征表示。然而，描述粒度不足和对语义偏移的敏感性可能导致跨模态特征融合的瓶颈。为解决这些问题，我们提出带有双边信息融合的图像条件实例提示网络（ICIPNet），旨在缓解跨模态特征融合的瓶颈。ICIPNet引入图像条件实例提示（ICIP）模块，无需外部知识即可生成自适应的视觉和语义表示。双边信息融合（BIF）模块沿token和通道维度增强特征融合。实验表明，所提出的ICIPNet优于现有RRSIS模型。

英文摘要

Referring Remote Sensing Image Segmentation (RRSIS) is a situated, task-driven cross-modal task related to the embodied perception paradigm, requiring models to align visual-spatial features with linguistic intentions for precise target perception. Recent research has focused on refining the granularity of textual features and optimizing image-text feature fusion to better guide target feature representations. However, insufficient descriptive granularity and sensitivity to semantic shifts can cause bottlenecks in cross-modal feature fusion. To address these issues, we propose the Image-Conditioned Instance Prompt Network (ICIPNet) with Bilateral Information Fusion, which is designed to alleviate bottlenecks in cross-modal feature fusion. ICIPNet introduces an Image-Conditioned Instance Prompt (ICIP) module to generate self-adaptive visual and semantic representations without external knowledge. The Bilateral Information Fusion (BIF) module enhances feature fusion along the token and channel dimensions. Experiments demonstrate that the proposed ICIPNet outperforms existing RRSIS models.

URL PDF HTML ☆

赞 0 踩 0

2605.24531 2026-05-26 cs.CV 版本更新

NudgeVAD: Language-Nudged End-to-End Driving via FiLM Residuals

NudgeVAD: 通过FiLM残差的语言引导端到端驾驶

Chieh-Chi Yang, Yu-Hsiang Chen, Yi-Ting Chen

发表机构 * National Yang Ming Chiao Tung University（国立阳明交通大学）

AI总结提出NudgeVAD框架，利用语言作为校准的微调信号，通过恒等初始化的FiLM和零初始化残差头，在命令不可靠时显著提升驾驶轨迹预测性能。

Comments Technical report for the doScenes Instructed Driving Challenge, CVPR 2026 DriveX Workshop. 1st place in the Ablation track

详情

AI中文摘要

自然语言指令有望实现可控的端到端驾驶，但当规划器已经接收到可靠的高级命令时，其优势可能被掩盖。我们提出NudgeVAD，一个冻结规划器残差框架，利用语言作为对VAD轨迹的校准微调。通过恒等初始化的FiLM和零初始化的残差头，NudgeVAD在初始化时等价于冻结规划器，因此学习到的偏差仅来自语言条件残差。我们沿命令可靠性轴评估NudgeVAD。在可靠命令下，语言改进了初始规划器，但与VAD-FT (UNCOND)（一个计算量匹配的、无语言微调的VAD模型）相比几乎冗余。然而，在随机命令下，语言变得至关重要：去除文本使ADE6s降至3.166米，而带有文本的NudgeVAD恢复至2.806米，并优于VAD-FT (UNCOND) 0.312米。这些结果表明，语言并非普遍可加；当分类命令通道不可靠时，它最有价值。

英文摘要

Natural-language instructions promise controllable end-to-end driving, but their benefit can be hidden when planners already receive reliable high-level commands. We propose NudgeVAD, a frozen-planner residual framework that uses language as a calibrated nudge to a VAD trajectory. With identity-initialized FiLM and a zero-initialized residual head, NudgeVAD is equivalent to the frozen planner at initialization, so learned deviations arise only from language-conditioned residuals. We evaluate NudgeVAD along a command-reliability axis. With reliable commands, language improves the initial planner but becomes nearly redundant once compared against VAD-FT (UNCOND), a compute-matched VAD model fine-tuned without language. With random commands, however, language becomes essential: detaching text degrades ADE6s to 3.166 m, while NudgeVAD with text recovers 2.806 m and outperforms VAD-FT (UNCOND) by 0.312 m. These results show that language is not universally additive; it is most valuable when the categorical command channel is unreliable.

URL PDF HTML ☆

赞 0 踩 0

2605.24530 2026-05-26 cs.CL cs.CV 版本更新

Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval

Unveil: 统一视觉-文本集成与蒸馏的多模态文档检索

Hao Sun, Yingyan Hou, Jiayan Guo, Bo Wang, Chunyu Yang, Jinsong Ni, Yan Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University（北京理工大学通用人工智能国家重点实验室）； School of Intelligence Science and Technology, Peking University（北京大学智能科学与技术学院）； Aerospace Information Research Institute, Chinese Academy of Sciences（中国科学院航空航天信息研究所）； Key Laboratory of Target Cognition and Application Technology（目标认知与应用技术重点实验室）； Beijing Institute of Technology（北京理工大学）； Ucap Cloud（Ucap云）

AI总结提出Unveil框架，通过视觉-文本嵌入和知识蒸馏实现鲁棒的文档检索，兼顾布局与语义信息。

Comments ACL 2025 Main Conference

详情

AI中文摘要

现实场景中的文档检索由于文档格式和模态的多样性面临重大挑战。传统的基于文本的方法依赖于定制的解析技术，忽略布局信息且容易出错，而最近的无解析视觉方法在文本丰富的场景中往往难以捕捉细粒度的文本语义。为了解决这些限制，我们提出了 extbf{Unveil}，一种新颖的视觉-文本嵌入框架，有效整合文本和视觉特征以实现鲁棒的文档表示。通过知识蒸馏，我们将视觉-文本嵌入模型的语义理解能力转移到纯视觉模型，实现高效的无解析检索同时保持语义保真度。实验结果表明，我们的视觉-文本嵌入方法超越了现有方法，而知识蒸馏成功弥合了视觉-文本方法与纯视觉方法之间的性能差距，提高了检索准确性和效率。

英文摘要

Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose \textbf{Unveil}, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.22715 2026-05-26 cs.CV cs.AI cs.CL cs.HC 版本更新

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

AnyMo：野外人体运动的几何感知与设置无关建模

Baiyu Chen, Zechen Li, Wilson Wongso, Lihuan Li, Xiachong Lin, Hao Xue, Benjamin Tag, Flora Salim

发表机构 * The University of New South Wales（新南威尔士大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出AnyMo框架，通过物理模拟生成多样化IMU信号、图编码器预训练和LLM对齐，实现跨设备/数据集的零样本活动识别、跨模态检索和运动描述，性能显著提升。

详情

AI中文摘要

随着可穿戴和移动设备日益融入日常生活，它们为持续感知野外人体运动提供了实用途径。但惯性信号高度依赖于传感设置，包括身体位置、安装方向、传感器朝向、设备硬件和采样协议。这种设置依赖性使得学习跨设备和数据集迁移的运动表示变得困难，并限制了可穿戴IMU在封闭集识别之外的广泛应用。我们提出AnyMo，一个用于设置无关人体运动建模的几何感知框架。AnyMo利用基于物理的IMU模拟在密集体表位置上生成多样且合理的合成信号，从配对的合成放置视图和掩蔽部分观测中预训练图编码器，将多位置IMU标记化为全身运动令牌，并将这些令牌与LLM对齐以进行运动-语言理解。我们在三个互补任务上评估AnyMo：跨14个未见下游数据集的零样本活动识别、跨模态检索和可穿戴IMU运动描述，其中在HAR上平均Accuracy/F1/R@2提升11.7%/11.6%/22.6%，零样本IMU到文本和文本到IMU检索MRR分别提升15.9%和28.6%，零样本描述BERT-F1提升18.8%。这些结果支持AnyMo作为野外可穿戴运动理解的通才模型。项目页面：https://baiyuchen.com/project/AnyMo。

英文摘要

As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and sampling protocol. This setup dependence makes it difficult to learn motion representations that transfer across devices and datasets, and limits the broader use of wearable IMUs beyond closed-set recognition. We introduce AnyMo, a geometry-aware framework for setup-agnostic human motion modeling. AnyMo uses physics-grounded IMU simulation over dense body-surface placements to generate diverse and plausible synthetic signals, pre-trains a graph encoder from paired synthetic placement views and masked partial observations, tokenizes multi-position IMU into full-body motion tokens, and aligns these tokens with an LLM for motion-language understanding. We evaluate AnyMo on three complementary tasks: zero-shot activity recognition across 14 unseen downstream datasets, cross-modal retrieval, and wearable IMU motion captioning, where it improves average Accuracy/F1/R@2 by 11.7\%/11.6\%/22.6\% on HAR, increases zero-shot IMU-to-text and text-to-IMU retrieval MRR by 15.9\% and 28.6\%, respectively, and improves zero-shot captioning BERT-F1 by 18.8\%. These results support AnyMo as a generalist model for wearable motion understanding in the wild. Project page: https://baiyuchen.com/project/AnyMo.

URL PDF HTML ☆

赞 0 踩 0

2605.21652 2026-05-26 cs.CV cs.AI 版本更新

ClaimDiff-RL: 通过视觉声明比较进行细粒度描述强化学习

Tianle Li, Xuyang Shen, Yan Ma, Rongxin Guo, Shaoxiang Chen, Jiacheng Chen, Haochen Wang, Hongyang Tang, Yucong Zhou, Yu Cheng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； MiniMax

AI总结提出ClaimDiff-RL框架，利用原子声明差异作为奖励单元，通过多模态判断器枚举视觉差异并分配错误类型和严重程度，以解决长描述强化学习中事实性与覆盖度的权衡问题。

详情

AI中文摘要

长格式图像描述揭示了强化学习中的奖励粒度问题：描述被整体判断，而重要错误发生在单个视觉声明层面。一个好的密集描述应既忠实又信息丰富，避免幻觉而不遗漏显著细节。然而，成对偏好、基于参考的指标和整体标量奖励将这些局部错误压缩为单个序列级信号，模糊了事实性与覆盖度之间的权衡。我们引入ClaimDiff-RL框架，该框架使用基于参考的原子声明差异作为描述强化学习的奖励单元。给定一张图像、一个演员描述和一个参考描述，多模态判断器枚举视觉上可区分的差异，针对图像验证每个差异，分配开放词汇的错误类型和严重程度，并生成每个差异的统计信息用于奖励组合。这使得幻觉声明和遗漏的显著事实可以分别测量和调整。实验表明，整体标量奖励可以通过增加遗漏事实来减少幻觉，而ClaimDiff-RL揭示了这种忠实性与覆盖度的权衡，并实现了更平衡的操作点。在包含160张图像的人工标注诊断基准、公开描述基准和VQA基准上，ClaimDiff-RL改善了幻觉-遗漏事实平衡，保留了通用能力，甚至在多个细粒度能力维度（如物体计数、空间关系和场景识别）上超越了Gemini-3-Pro-Preview。这些结果表明，类型化、可验证的声明差异是细粒度且可诊断的描述强化学习的有效奖励单元。

英文摘要

Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors occur at the level of individual visual claims. A good dense caption should be both faithful and informative, avoiding hallucination without omitting salient details. Yet pairwise preferences, reference-based metrics, and holistic scalar rewards compress these local errors into a single sequence-level signal, obscuring the tradeoff between factuality and coverage. We introduce ClaimDiff-RL, a framework that uses reference-conditioned atomic claim differences as the reward unit for caption RL. Given an image, an actor caption, and a reference caption, a multimodal judge enumerates visually grounded differences, verifies each difference against the image, assigns open-vocabulary error types and severity levels, and produces per-difference statistics for reward composition. This makes hallucinated claims and omitted salient facts separately measurable and tunable. Experiments show that holistic scalar rewards can reduce hallucination by increasing missing facts, while ClaimDiff-RL exposes this faithfulness and coverage tradeoff and enables more balanced operating points. On a 160-image human-labeled diagnostic benchmark, public captioning benchmarks, and VQA benchmarks, ClaimDiff-RL improves the hallucination--missing-fact balance, preserves general capability, and even surpasses Gemini-3-Pro-Preview on several fine-grained Capability dimensions such as object counting, spatial relations, and scene recognition. These results suggest that typed, verifiable claim differences are an effective reward unit for fine-grained and diagnosable caption RL.

URL PDF HTML ☆

赞 0 踩 0

2605.19491 2026-05-26 cs.CV 版本更新

Thinking in Scales: Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning

尺度思考：通过自适应连续推理加速千兆像素病理图像分析

Jiusong Ge, Yingkang Zhan, Wenjie Zhao, Di Zhang, Ke Wang, Jiashuai Liu, Chunze Yang, Chengzu Li, Jian Zhang, Yuxin Dong, Ni Zhang, Qidong Liu, Mireia Crispin-Ortuzar, Huazhu Fu, Chen Li, Zeyu Gao

发表机构 * School of Computer Science（计算机科学学院）； Technology, Xi’an Jiaotong University, Xi’an, China（技术学院，西安交通大学，西安，中国）； Department of Transmedia Art, Xi’an Academy of Fine Arts, Xi’an, China（多媒体艺术系，西安美术学院，西安，中国）； Department of Oncology, University of Cambridge, Cambridge, U.K.（肿瘤学系，剑桥大学，剑桥，英国）； Language Technology Lab, University of Cambridge, Cambridge, U.K.（语言技术实验室，剑桥大学，剑桥，英国）； Institute of High Performance Computing, Agency for Science, Technology（高性能计算研究所，科技研究局）

AI总结提出PathCTM模型，通过动态尺度切换和注意力引导的区域剪枝实现高效连续推理，大幅减少计算开销并保持诊断性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

传统的全切片图像（WSI）分析方法通常依赖于多实例学习（MIL）范式，该范式在高倍率下提取补丁级特征并进行聚合以进行切片级预测。然而，这种详尽的补丁级处理计算成本高，严重限制了WSI分析的效率和可扩展性。为应对这一挑战，我们提出了PathCTM（面向病理学的连续思维模型），该模型能够对千兆像素WSI进行令牌高效的尺度空间连续推理。PathCTM将诊断推理表述为动态的序列信息追踪。它逐步从低倍率全局检查过渡到高倍率局部检查，并在收集到足够证据以有效限制决策不确定性时自适应终止推理。具体而言，它使用条件计算进行动态尺度切换，并采用注意力引导的区域剪枝，结合置信度感知的早期停止。大量实验表明，与基于标准MIL的方法相比，PathCTM将所需图像补丁数量减少了95.95%，推理时间缩短了约95.62%，同时AUC没有下降。代码可在https://github.com/JSGe-AI/PathCTM获取。

英文摘要

Traditional whole slide image (WSI) analysis methods typically rely on the multiple instance learning (MIL) paradigm, which extracts patch-level features at high magnification and aggregates them for slide-level prediction. However, such exhaustive patch-level processing is computationally expensive, severely limiting the efficiency and scalability of WSI analysis. To address this challenge, we propose PathCTM (a Pathology-oriented Continuous Thought Model) that enables token-efficient scale-space continuous reasoning for gigapixel WSIs. PathCTM formulates diagnostic inference as a dynamic sequential information pursuit. It progressively transitions from low-magnification global to high-magnification local inspection, and adaptively terminates inference when sufficient evidence is gathered to effectively bound decision uncertainty. Specifically, it uses conditional computation for dynamic scale switching with attention-guided region pruning, coupled with confidence-aware early stopping. Extensive experiments demonstrate that, compared with standard MIL-based methods, PathCTM reduces the number of required image patches by 95.95% and shortens inference time by approximately 95.62%, while maintaining AUC without degradation. Code is available at https://github.com/JSGe-AI/PathCTM.

URL PDF HTML ☆

赞 0 踩 0

2605.17531 2026-05-26 cs.CV 版本更新

Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

不要猜测，只需询问：通过多轮澄清解决指代分割中的歧义

Yuting Yang, Haichao Jiang, Tianming Liang, Quan Zhang, Jian-Fang Hu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）； Guangdong Province Key Laboratory of Information Security Technology（广东省信息安全技术重点实验室）； Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education（教育部机器智能与先进计算重点实验室）

AI总结提出IC-Seg框架，通过多轮对话主动澄清用户意图，并引入Hi-GRPO分层优化策略，有效解决指代分割中用户查询歧义问题。

详情

AI中文摘要

指代分割旨在根据文本查询分割图像或视频中的目标对象。尽管过去几年取得了显著进展，现有工作总是假设用户提供的查询已经精确且清晰。然而，这种假设不切实际。在现实场景中，期望所有用户仔细审查其视觉内容并确保查询唯一且无歧义是不现实的。遇到此类情况时，现有分割模型倾向于任意猜测用户偏好，常常导致不理想的结果。为解决这一限制，我们提出IC-Seg，一种新颖的智能体框架，在分割前通过多轮对话主动澄清用户意图。为有效激励这种能力，我们进一步引入Hi-GRPO，一种新的分层优化策略，在轨迹、轮次和步骤层面注入密集且信息丰富的监督信号。该策略鼓励高效的意图澄清，有效消除冗余交互并提高整体对话质量。为评估，我们建立了Ambi-RVOS，一个带有模糊用户查询的指代视频对象分割基准。大量实验表明，IC-Seg不仅在解决模糊查询方面大幅优于现有方法，而且在标准推理分割基准上保持最先进性能。代码和数据将在https://github.com/iSEE-Laboratory/IC-Seg发布。

英文摘要

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose IC-Seg, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce Hi-GRPO, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish Ambi-RVOS, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at https://github.com/iSEE-Laboratory/IC-Seg.

URL PDF HTML ☆

赞 0 踩 0

2605.17268 2026-05-26 cs.AI cs.CV cs.RO 版本更新

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

VLA 推理是否忠实？自动驾驶模型中因果链的安全性探究

Nicanor Mayumu, Xiaoheng Deng, Patrick Mukala

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）； Central South University（中南大学）； School of Computer Science（计算机科学学院）； University of Wollongong in Dubai（迪拜大学）

AI总结通过分析300次VLA推理，发现输出推理与轨迹的忠实度仅42.5%，存在大量漏检行人、轨迹脆弱及推理-动作不一致问题，并提出了信息论忠实度形式化定义与安全架构。

Comments Accept (Poster), CVPR 2026 Workshop DriveX NonArchival Track

2605.16409 2026-05-26 cs.CV cs.CL cs.LG 版本更新

监控视频中罕见交通事件的两次通过零样本时空定位

Jiantang Huang

AI总结提出一种无需微调的管道，通过粗到细的两遍分解和专家角色分配，利用冻结视觉语言模型实现罕见交通事件在时间、空间和碰撞类型上的联合定位。

Comments Accepted at CVPR 2026 AUTOPILOT Workshop (Non-Archival Track). 7 pages (4 main + references + appendix), 3 figures, 5 tables

详情

AI中文摘要

在真实闭路电视画面中定位交通事故是一个罕见事件问题，通常禁止使用标注事故视频进行训练，但需要精确的时空和碰撞类型联合定位。我们提出一种无需微调的管道，通过两个想法从冻结的视觉语言模型中引出这种联合输出。首先，粗到细的两遍分解：第一遍以1 fps处理全视频，产生粗粒度(t, x, y, c)元组；然后第二遍在±3秒窗口内以5 fps细化时间和位置，并设置两个确定性置信门，在边界犹豫或边缘夹紧坐标时回退到粗估计。其次，专家角色分配：Qwen3-VL-Plus负责定位，Gemini 3.1 Flash-Lite负责在居中视频片段上分类。在ACCIDENT@CVPR 2026基准测试（2,027个真实闭路电视视频）上，我们达到ACC^S = 0.539（95%置信区间[0.525, 0.553]）：比基准论文的最佳基线预言机（0.412）高0.127，比最强单VLM基线（Molmo-7B, 0.396）高0.143，比朴素基线（0.289）高0.250。VLM路径每个视频最多调用三次API（17%在API失败时回退到物理方法）；完整运行成本约20美元。

英文摘要

Grounding traffic accidents in real CCTV footage is a rare-event problem where training on labeled accident video is often prohibited, yet accurate joint localization in time, space, and collision type is required. We present a no-fine-tuning pipeline that elicits this joint output from frozen vision-language models through two ideas. First, a coarse-to-fine two-pass decomposition: a full-video pass at 1 fps produces a coarse (t, x, y, c) tuple, then a second pass at 5 fps within a +/- 3 s window refines time and location, with two deterministic confidence gates that revert to the coarse estimate on boundary hedges or edge-clamped coordinates. Second, a specialist role assignment: Qwen3-VL-Plus handles grounding, Gemini 3.1 Flash-Lite handles typing on a centered video clip. On the ACCIDENT@CVPR 2026 benchmark (2,027 real CCTV videos) we reach ACC^S = 0.539 (95% CI [0.525, 0.553]): +0.127 over the benchmark paper's best-of-baselines oracle (0.412), +0.143 over the strongest single-VLM baseline (Molmo-7B, 0.396), and +0.250 over the naive baseline (0.289). The VLM path uses up to three API calls per video (17% fall back to physics on API failures); the full run costs ~$20.

URL PDF HTML ☆

赞 0 踩 0

2605.01284 2026-05-26 cs.CV cs.AI cs.CL cs.IR 版本更新

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

证据链：面向迭代检索增强生成的像素级视觉归因

Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University（软件工程国家级工程研究中心，北京大学）； City University of Hong Kong（香港城市大学）； Peking University（北京大学）； Tencent Technology（腾讯科技）

AI总结提出Chain of Evidence (CoE)框架，利用视觉语言模型直接对检索到的文档截图进行推理，输出精确边界框以可视化完整推理链，解决迭代检索增强生成中的粗粒度归因和视觉语义丢失问题。

详情

DOI: 10.1145/3805712.3809540

AI中文摘要

迭代检索增强生成（iRAG）已成为通过逐步检索和推理外部文档来回答复杂多跳问题的强大范式。然而，当前系统主要基于解析文本运行，这造成了两个关键瓶颈：（1）粗粒度归因，用户需要根据模糊的文本级引用在冗长文档中手动定位证据；（2）视觉语义丢失，将视觉丰富的文档（如幻灯片、带有图表的PDF）转换为文本会丢弃对推理至关重要的空间逻辑和布局线索。为弥合这一差距，我们提出了证据链（CoE），这是一个与检索器无关的视觉归因框架，利用视觉语言模型直接对检索到的文档候选截图进行推理。CoE消除了特定格式的解析，输出精确的边界框，可视化检索候选集中的完整推理链。我们在两个不同的基准上评估CoE：Wiki-CoE，一个源自2WikiMultiHopQA的大规模结构化网页数据集；以及SlideVQA，一个具有挑战性的演示幻灯片数据集，包含复杂图表和自由形式布局。实验表明，微调后的Qwen3-VL-8B-Instruct取得了稳健的性能，在需要视觉布局理解的场景中显著优于基于文本的基线，同时为像素级可解释的iRAG建立了与检索器无关的解决方案。我们的代码可在https://github.com/PeiYangLiu/CoE.git获取。

英文摘要

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

URL PDF HTML ☆

赞 0 踩 0

2603.16100 2026-05-26 cs.CV 版本更新

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

重新评估CLIP中的模态内错位假设

Jonas Herzog, Yue Wang

发表机构 * Zhejiang University（浙江大学）

AI总结本文质疑CLIP的模态内错位假设，通过理论分析和实验证明图像嵌入距离不存在所谓的自由度，且模态内任务性能差异主要源于任务歧义而非错位。

Comments Accepted for CVPR'26. Project Page: https://vision-kek.github.io/Is-CLIP-Really-Misaligned/

详情

AI中文摘要

最近的研究表明，CLIP类对比语言-图像训练产生的嵌入对于纯图像任务并非最优。主要理论是跨模态（语言-图像）对齐损失忽略了模态内（图像-图像）对齐，导致图像间距离校准不良。在本研究中，我们质疑这一模态内错位假设。我们重新审视其基础理论论证、支持该假设的指标以及受影响的性能指标。对于理论论证，我们证明图像嵌入距离不存在所谓的自由度。对于经验度量，我们的发现表明，它们在语言-图像训练模型（CLIP、SigLIP）和图像-图像训练模型（DINO、SigLIP2）上产生相似结果。这表明观察到的现象并非源于前者特有的错位。对常见模态内任务（检索和少样本分类）的实验证实，解决任务歧义（而非所谓的错位）才是获得最佳结果的关键。

英文摘要

Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.

URL PDF HTML ☆

赞 0 踩 0

2603.00777 2026-05-26 cs.CV 版本更新

DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents

DUCX：分解使用工具的胸部X光代理中的不公平性

Zikang Xu, Ruinan Jin, Xiaoxiao Li

发表机构 * Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Anhui, China（人工智能研究所，合肥国家科学中心，安徽，中国）； The University of British Columbia, Vancouver, BC V6Z 1Z4, Canada（不列颠哥伦比亚大学，温哥华，BC V6Z 1Z4，加拿大）； Vector Institute, Toronto, ON M5G 1M1, Canada（向量研究所，多伦多，ON M5G 1M1，加拿大）

AI总结提出DUCK框架，通过阶段式公平性分解方法，系统审计使用工具的胸部X光代理中的工具暴露偏差、工具转换偏差和模型推理偏差，揭示端到端评估无法预测的群体差异。

Comments Early accepted by MICCAI 2026

详情

AI中文摘要

随着使用工具的临床AI系统协调专门的视觉和语言模块执行胸部X光问答等任务，医疗代理中的公平性变得至关重要。虽然这些医疗AI代理可以提高灵活性，但其增加的流水线复杂性也为人口统计偏差创造了新的途径，超出了独立模型。我们提出了DUCK，即分解胸部X光代理中的不公平性，这是一个对使用MedRAX实例化的工具型胸部X光代理的公平性进行系统审计的方法。为了定位差异产生的位置，我们引入了一种阶段式公平性分解，将端到端偏差与三个代理特定来源分开：工具暴露偏差，即基于工具存在的效用差距；工具转换偏差，即工具路由模式中的子组差异；以及模型推理偏差，即合成行为中的子组差异。在五个驱动骨干网络上对使用工具的代理框架进行的大量实验表明，端到端性能中存在人口统计差距，均等几率高达20.79%，最低公平-效用权衡降至28.65%。中间行为，包括工具使用、转换模式和推理轨迹，表现出明显的子组差异，这些差异无法仅从端到端评估中预测。例如，在分割工具可用的情况下，子组效用差距高达50%。我们的研究结果强调了过程级公平性审计和去偏的必要性，以确保临床代理系统的公平部署。代码：https://github.com/Nanboy-Ronan/DUCK。

英文摘要

Fairness in medical agents is becoming critical as tool-using clinical AI systems orchestrate specialized vision and language modules for tasks such as chest X-ray question answering. While these medical AI agents can improve flexibility, their added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present DUCK, Decomposing Unfairness in Chest X-ray agents, a systematic audit of fairness in tool-using chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias, or utility gaps conditioned on tool presence; tool transition bias, or subgroup differences in tool-routing patterns; and model reasoning bias, or subgroup differences in synthesis behaviors. Extensive experiments on tool-using agentic frameworks across five driver backbones reveal that demographic gaps persist in end-to-end performance, with equalized odds up to 20.79% and the lowest fairness-utility tradeoff down to 28.65%. Intermediate behaviors, including tool usage, transition patterns, and reasoning traces, exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone. For example, conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%. Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code: https://github.com/Nanboy-Ronan/DUCK.

URL PDF HTML ☆

赞 0 踩 0

2603.00191 2026-05-26 cs.LG cs.CV 版本更新

Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning

基于LoRA的持续学习中任务驱动的子空间分解用于知识共享与隔离

Lingfeng He, De Cheng, Huaijie Wang, Xi Yang, Nannan Wang, Xinbo Gao

发表机构 * Department of XXX, University of YYY, Location, Country（XXX部门，YYY大学，地点，国家）； School of ZZZ, Institute of WWW, Location, Country（ZZZ学院，WWW研究所，地点，国家）； State Key Laboratory of Integrated Services Networks, School of Telecommunications Engineering, Xidian University, Xi'an, China（信息服务网络国家重点实验室，电信工程学院，西安电子科技大学，西安，中国）； School of Electronic Engineering, Xidian University, Xi'an, China（电子工程学院，西安电子科技大学，西安，中国）

AI总结提出LoDA方法，通过任务驱动分解构建通用和任务特定LoRA子空间，结合梯度对齐优化和闭式重校准，实现知识共享与隔离，提升持续学习性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

持续学习要求模型在不遗忘旧知识的情况下顺序适应新任务。最近，低秩适应（LoRA）作为一种代表性的参数高效微调方法，在持续学习中受到越来越多的关注。几种基于LoRA的持续学习方法通过分离更新空间来减少任务间的干扰，通常从过去任务的估计零空间中构建新空间。然而，它们（i）忽略了任务共享方向，抑制了知识迁移；（ii）未能捕获真正有效的任务特定方向，因为旧任务的这些“零基”在相关任务下对新任务几乎保持不活跃。为了解决这个问题，我们从投影能量的角度研究LoRA的学习能力，并提出了低秩分解与适应（LoDA）。它通过解决两个基于能量的目标，执行任务驱动分解以构建通用和真正的任务特定LoRA子空间，解耦知识共享和隔离的方向。LoDA固定两个子空间上的LoRA下投影，并通过梯度对齐优化方法学习鲁棒的上投影。在每个任务之后，在将LoRA更新集成到主干之前，LoDA为通用更新推导出一个闭式重校准，沿着这个任务共享方向近似特征级联合最优。实验表明，LoDA优于现有的持续学习方法。我们的代码可在https://github.com/HHHLF/LoDA_ICML2026获取。

英文摘要

Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, Low-Rank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases" of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods. Our code is available at https://github.com/HHHLF/LoDA_ICML2026.

URL PDF HTML ☆

赞 0 踩 0

2602.23916 2026-05-26 cs.CV cs.AI 版本更新

Topology-Driven Transferability Estimation of Medical Foundation Models for Segmentation

基于拓扑驱动的医学基础模型分割迁移性估计

Jiaqi Tang, Shaoyang Zhang, Xiaoqi Wang, Jiaying Zhou, Yang Liu, Qingchao Chen

发表机构 * Peking University（北京大学）； Hohai University（河海大学）； Beijing Normal University-Hong Kong Baptist University United International College（北京师范大学-香港 Baptist大学联合国际学院）； National Institute of Health Data Science, Peking University（健康数据科学国家研究院，北京大学）； Institute of Medical Technology, Peking University（北京大学医学技术研究院）； State Key Laboratory of General Artificial Intelligence, Peking University（通用人工智能国家重点实验室，北京大学）

AI总结提出拓扑驱动迁移性估计框架，通过全局表示拓扑散度、局部边界感知拓扑一致性和任务自适应融合，无需微调即可高效选择医学基础模型，在OpenMind基准上加权Kendall指标相对提升约31%。

详情

AI中文摘要

大规模自监督学习（SSL）的出现产生了大量的医学基础模型。然而，为特定分割任务选择最优的医学基础模型仍然是一个计算瓶颈。现有的迁移性估计（TE）指标主要针对分类任务设计，依赖于全局统计假设，无法捕捉密集预测所需的拓扑复杂性。我们提出了一种新颖的拓扑驱动迁移性估计框架，评估流形可处理性而非统计重叠。我们的方法引入了三个组成部分：（1）全局表示拓扑散度（GRTD），利用最小生成树量化特征-标签结构同构性；（2）局部边界感知拓扑一致性（LBTC），专门在关键解剖边界评估流形可分离性；（3）任务自适应融合，根据目标任务的语义基数动态整合全局和局部指标。在跨不同解剖目标和SSL基础模型的大规模OpenMind基准上验证，我们的方法在加权Kendall指标上显著优于最先进的基线，相对提升约31%，提供了一种鲁棒的、无需训练的代理，用于高效模型选择而无需微调成本。代码将在接收后公开。

英文摘要

The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around 31% relative improvement in the weighted Kendall metric, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2602.23872 2026-05-26 cs.CV cs.RO 版本更新

通过静态-动态解耦实现高效长程视觉-语言-动作模型

Weikang Qiu, Huashuo Lei, Tinglin Huang, Rex Ying

发表机构 * Yale University（耶鲁大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出DySta框架，通过将视觉输入解耦为多级静态和动态令牌，减少上下文长度并复用KV缓存，实现高效多帧集成和推理，在基准测试和真实任务中显著提升性能。

详情

AI中文摘要

视觉-语言-动作（VLA）模型最近成为通用机器人控制的一种有前景的范式。基于视觉-语言模型（VLM）架构，VLA模型根据视觉观察和语言指令预测动作，在任务中实现了强大的性能和泛化能力。然而，VLA模型面临两个主要挑战：输入帧的有限上下文窗口，以及由于二次注意力复杂性和大参数数量导致的低效推理。为此，我们提出了DySta，一个将视觉输入解耦为多级静态和动态令牌的框架，使得（1）在帧间保留静态令牌的单一副本以显著减少上下文长度，以及（2）通过轻量级重缓存门（仅在必要时更新）重用静态令牌的键值（KV）缓存。这种设计实现了高效的多帧集成和高效推理。此外，我们引入了一个新的基准测试，更有效地评估VLA模型的多帧集成能力。实验表明，DySta在我们的基准测试中各项指标上提高了24.5%的多帧集成能力，在真实世界记忆依赖任务中绝对成功率达到23.3%，同时在模拟基准测试中推理速度提升2.0倍（成功率+2.3%），在真实世界通用任务中推理速度提升2.2倍（成功率+10.6%）。

英文摘要

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: a limited context window for input frames and inefficient inference due to the quadratic attention complexity and large parameter counts. To this end, we propose DySta, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the multi-frame integration ability of VLAs. Experiments show that Dysta improves multi-frame integration by 24.5% across metrics on our benchmark and 23.3% in absolute success rate on real-world memory-dependent tasks, while accelerating inference by 2.0x (with +2.3% success rate) on simulation benchmarks and 2.2x (with +10.6% success rate) on real-world general tasks.

URL PDF HTML ☆

赞 0 踩 0

2601.22709 2026-05-26 cs.CV cs.AI 版本更新

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

基于置信度蒸馏的门控关系对齐用于高效视觉语言模型

Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li

发表机构 * Department of Information Technology（信息科技系）； Electrical Engineering, ETH Zurich, Zurich, Switzerland（电气工程，苏黎世联邦理工学院，苏黎世，瑞士）； Qualcomm AI Research, Amsterdam, the Netherlands（高通人工智能研究，阿姆斯特丹，荷兰）； Department of Electrical, Electronic and Information Engineering（电气、电子与信息工程系）； University of Bologna, Bologna, Italy（博洛尼亚大学，博洛尼亚，意大利）； School of Electrical and Electronic Engineering（电气与电子工程学院）

AI总结提出GRACE框架，通过信息瓶颈原理统一知识蒸馏与量化感知训练，使用置信度门控解耦蒸馏、关系中心核对齐和自适应控制器，在INT4量化下实现性能超越FP16基线并接近教师模型，同时显著降低内存和提升吞吐量。

Comments Accepted to the International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

视觉语言模型（VLM）具有强大的多模态性能，但部署成本高，且训练后量化通常会导致显著的精度损失。尽管有潜力，但针对VLM的量化感知训练仍未得到充分探索。我们提出GRACE，一个在信息瓶颈原则下统一知识蒸馏和量化感知训练的框架：量化约束信息容量，而蒸馏指导在此预算内保留什么。将教师视为任务相关信息的代理，我们引入置信度门控解耦蒸馏以过滤不可靠的监督，关系中心核对齐以传递视觉标记结构，以及通过拉格朗日松弛实现的自适应控制器以平衡保真度与容量约束。在LLaVA和Qwen系列的大量基准测试中，我们的INT4模型始终优于FP16基线（例如，LLaVA-1.5-7B：SQA上70.1 vs. 66.8；Qwen2-VL-2B：MMBench上76.9 vs. 72.6），几乎匹配教师性能。使用真实的INT4内核，我们实现了3倍的吞吐量，内存减少54%。这一原则性框架显著优于现有量化方法，使GRACE成为资源受限部署的有力解决方案。代码和数据可在https://github.com/ForeverBlue816/GRACE获取。

英文摘要

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment. Code and data are available at: https://github.com/ForeverBlue816/GRACE.

URL PDF HTML ☆

赞 0 踩 0

2601.16763 2026-05-26 cs.CV 版本更新

Flow Matching for Probabilistic Monocular 3D Human Pose Estimation

基于流匹配的概率单目3D人体姿态估计

Cuong Le, Pavlo Melnyk, Bastian Wandt, Mårten Wadenbäck

发表机构 * Department of Electrical Engineering（电气工程系）； Linköping University（林雪平大学）； Independent researcher（独立研究者）

AI总结提出FMPose方法，利用流匹配生成模型从2D关键点学习3D姿态分布，通过图卷积网络建模2D提升条件，在保持精度的同时显著提升推理速度。

Comments 12 pages, 2 figures, 8 tables, accepted to TMLR

详情

AI中文摘要

从单目相机视角恢复3D人体姿态是一个高度病态的问题，因为存在深度模糊。早期从2D提升3D姿态的研究常常包含错误但过度自信的3D估计。为了缓解这一问题，新兴的概率方法将3D估计视为分布，考虑姿态的不确定性度量。属于类似范畴，我们提出了FMPose，一种基于流匹配生成方法的概率3D人体姿态估计方法。以2D线索为条件，流匹配方案通过连续归一化流学习从简单源分布到合理3D人体姿态分布的最优传输。2D提升条件通过图卷积网络建模，利用人体关节之间的可学习连接作为图结构进行特征聚合。尽管处理时间和精度之间存在权衡，但在等精度比较中，FMPose的处理时间显著快于扩散模型，并且还提供了另一种更快且更准确的配置。实验结果表明，我们的FMPose在3D人体姿态估计的两个常见基准（Human3.6M、MPI-INF-3DHP）上相比当前最先进方法有显著改进。此外，FMPose在更具挑战性的3DPW数据集上表现出竞争性能。代码实现见https://github.com/cuongle1206/FMPose。

英文摘要

Recovering 3D human poses from a monocular camera view is a highly ill-posed problem due to the depth ambiguity. Earlier studies on 3D human pose lifting from 2D often contain incorrect-yet-overconfident 3D estimations. To mitigate the problem, emerging probabilistic approaches treat the 3D estimations as a distribution, taking into account the uncertainty measurement of the poses. Falling in a similar category, we proposed FMPose, a probabilistic 3D human pose estimation method based on the flow matching generative approach. Conditioned on the 2D cues, the flow matching scheme learns the optimal transport from a simple source distribution to the plausible 3D human pose distribution via continuous normalizing flows. The 2D lifting condition is modeled via graph convolutional networks, leveraging the learnable connections between human body joints as the graph structure for feature aggregation. While trade-offs between processing time and precision exist, already in the equal-accuracy comparison, FMPose exhibits significantly faster processing time than the diffusion model, and also offers another faster and more accurate configuration. Experimental results show major improvements of our FMPose over current state-of-the-art methods on two common benchmarks for 3D human pose estimation, namely Human3.6M, MPI-INF-3DHP. Additionally, FMPose shows competitive performance on the more challenging 3DPW dataset. The code implementation is available at https://github.com/cuongle1206/FMPose

URL PDF HTML ☆

赞 0 踩 0

2601.08205 2026-05-26 cs.CV cs.LG 版本更新

FUME: Fused Unified Multi-Gas Emission Network for Livestock Rumen Acidosis Detection

FUME: 用于牲畜瘤胃酸中毒检测的融合统一多气体排放网络

Taminul Islam, Toqi Tahamid Sarker, Mohamed Embaby, Khaled R Ahmed, Amer AbuGhazaleh

发表机构 * Southern Illinois University, Carbondale（南方伊利诺伊大学，卡本达勒分校）； University of California, Davis（加州大学戴维斯分校）

AI总结提出FUME网络，利用双气体（CO2和CH4）光学成像，通过轻量双流架构和通道注意力融合，实现瘤胃酸中毒的高精度分割与分类。

Comments 10 pages, 5 figures

详情

Journal ref: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2026, pp. 510-519

AI中文摘要

瘤胃酸中毒是奶牛中常见的代谢紊乱，导致重大经济损失和动物福利问题。当前的诊断方法依赖于侵入性pH测量，限制了持续监测的可扩展性。我们提出了FUME（融合统一多气体排放网络），这是首个在体外条件下通过双气体光学成像进行瘤胃酸中毒检测的深度学习方法。我们的方法利用红外相机捕获的互补二氧化碳（CO2）和甲烷（CH4）排放模式，将瘤胃健康状态分类为健康、过渡和酸中毒。FUME采用轻量双流架构，包含权重共享编码器、模态特定自注意力和通道注意力融合，联合优化气体羽流分割和奶牛健康分类。我们引入了首个双气体OGI数据集，包含8967个标注帧，覆盖六个pH水平，并带有像素级分割掩码。实验表明，FUME在仅使用1.28M参数和1.97G MACs的情况下，实现了80.99%的mIoU和98.82%的分类准确率——在分割质量上优于最先进方法，且计算成本降低10倍。消融研究揭示，CO2提供主要的判别信号，而双任务学习对于最优性能至关重要。我们的工作确立了基于气体排放的牲畜健康监测的可行性，为实用的体外酸中毒检测系统铺平了道路。代码可在 https://github.com/taminulislam/fume 获取。

英文摘要

Ruminal acidosis is a prevalent metabolic disorder in dairy cattle causing significant economic losses and animal welfare concerns. Current diagnostic methods rely on invasive pH measurement, limiting scalability for continuous monitoring. We present FUME (Fused Unified Multi-gas Emission Network), the first deep learning approach for rumen acidosis detection from dual-gas optical imaging under in vitro conditions. Our method leverages complementary carbon dioxide (CO2) and methane (CH4) emission patterns captured by infrared cameras to classify rumen health into Healthy, Transitional, and Acidotic states. FUME employs a lightweight dual-stream architecture with weight-shared encoders, modality-specific self-attention, and channel attention fusion, jointly optimizing gas plume segmentation and classification of dairy cattle health. We introduce the first dual-gas OGI dataset comprising 8,967 annotated frames across six pH levels with pixel-level segmentation masks. Experiments demonstrate that FUME achieves 80.99% mIoU and 98.82% classification accuracy while using only 1.28M parameters and 1.97G MACs--outperforming state-of-the-art methods in segmentation quality with 10x lower computational cost. Ablation studies reveal that CO2 provides the primary discriminative signal and dual-task learning is essential for optimal performance. Our work establishes the feasibility of gas emission-based livestock health monitoring, paving the way for practical, in vitro acidosis detection systems. Codes are available at https://github.com/taminulislam/fume.

URL PDF HTML ☆

赞 0 踩 0

2512.16710 2026-05-26 cs.CV 版本更新

A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry

基于标志点的全面胎儿生物测量多中心、多设备基准数据集

Chiara Di Vece, Zhehua Mao, Netanell Avisdris, Brian Dromey, Raffaele Napolitano, Dafna Ben Bashat, Francisco Vasconcelos, Danail Stoyanov, Leo Joskowicz, Sophia Bano

发表机构 * Department of Computer Science and UCL Hawkes Institute, University College London, London WC1E 6BT, UK（计算机科学系和UCL Hawkes研究所，伦敦大学学院，伦敦WC1E 6BT，英国）； Sagol Brain Institute, Tel Aviv Sourasky Medical Center, Tel Aviv, Israel（萨戈尔脑研究所，特拉维夫 Sourasky 医疗中心，以色列特拉维夫）； School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel（计算机科学与工程学院，耶路撒冷希伯来大学，耶路撒冷，以色列）； UCLH NHS Foundation Trust and the Elizabeth Garrett Anderson Institute for Women’s Health, UCL, London, UK（UCLH NHS基金会信托和Elizabeth Garrett Anderson妇女健康研究所，UCL，伦敦，英国）； Sagol School of Neuroscience and Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel（萨戈尔神经科学学院和Sackler医学院，特拉维夫大学，以色列特拉维夫）

AI总结为解决胎儿超声生物测量中手动标注耗时且依赖操作者的问题，构建了包含4513张图像、来自3个临床中心7种设备的公开基准数据集，提供标准化评估流程和基线结果，验证了单中心训练会高估性能，为多中心泛化研究提供基准。

Comments 11 pages, 5 figures, 3 tables

详情

DOI: 10.1038/s41598-026-47854-3
Journal ref: Scientific Reports (2026)

AI中文摘要

准确的胎儿生长评估依赖于通过手动识别标准平面中的解剖标志点进行精确生物测量。手动标志点标注耗时、依赖操作者，且易受扫描仪和站点间差异影响，限制了自动化方法的可重复性。需要多源标注数据集来开发人工智能辅助的胎儿生长评估方法。为解决这一瓶颈，我们提出了一个开放的、多中心、多设备的胎儿超声图像基准数据集，包含用于临床胎儿生物测量的专家解剖标志点标注。这些测量包括头双顶径和枕额径、腹横径和前后径以及股骨长度。该数据集包含来自1904名受试者的4513张去标识超声图像，这些图像在三个临床站点使用七种不同的超声设备采集。我们提供标准化的、受试者不重叠的训练/测试划分、评估代码和基线结果，以实现方法的公平和可重复比较。使用自动生物测量模型，我们量化了域偏移，并证明局限于单个中心的训练和评估相对于多中心测试会显著高估性能。据我们所知，这是第一个公开可用的多中心、多设备、标志点标注数据集，覆盖所有主要胎儿生物测量指标，为胎儿生物测量中的域适应和多中心泛化提供了稳健的基准，并有助于跨中心实现更可靠的AI辅助胎儿生长评估。所有数据、标注、训练代码和评估流程均已公开。

英文摘要

Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset comprises 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2512.10548 2026-05-26 cs.CV 版本更新

Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

Blink: 动态视觉令牌分辨率增强多模态理解

Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； Baidu Inc（百度公司）

AI总结提出Blink框架，通过注意力引导的令牌超分辨率和动态丢弃机制，在单次前向传播中模拟人类眨眼式扫描，提升多模态大语言模型的视觉感知能力。

Comments CVPR 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在各种视觉-语言任务上取得了显著进展，但其视觉感知仍然有限。相比之下，人类通过动态扫描并顺序地以“眨眼式”过程聚焦于显著区域，高效地感知复杂场景。受此策略启发，我们首先研究MLLMs是否表现出类似行为。我们的初步分析表明，MLLMs自然地关注不同层的视觉区域，并且选择性地将更多计算分配给显著令牌可以增强视觉感知。基于这一见解，我们提出Blink，一种动态视觉令牌分辨率框架，在单次前向传播中模拟人类启发的过程。具体来说，Blink包括两个模块：显著性引导扫描和动态令牌分辨率。它首先基于注意力图估计每层视觉令牌的显著性，并通过即插即用的令牌超分辨率（TokenSR）模块扩展重要令牌。在下一层，当扩展令牌失去焦点时，它会丢弃它们。这种动态机制平衡了广泛探索和细粒度聚焦，从而自适应且高效地增强视觉感知。大量实验验证了Blink在增强视觉感知和多模态理解方面的有效性。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.

URL PDF HTML ☆

赞 0 踩 0

2512.08254 2026-05-26 cs.CV 版本更新

Real-World Scene Recovery for Scattering-Degraded Images Using Spatial and Frequency Priors

使用空间和频率先验的散射退化图像真实场景恢复

Yun Liu, Tao Li, Guanghui Yue, Wenqi Ren, Cosmin Ancuti, Weisi Lin

发表机构 * College of Artificial Intelligence, Southwest University（西南大学人工智能学院）； School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University（深圳大学医学院生物医学工程学院）； School of Cyber Science and Technology, Shenzhen Campus, Sun Yat-sen University（中山大学深圳校区计算机科学与技术学院）； ETcTI, University Politehnica Timisoara（蒂森堡大学ETcTI学院）； College of Computing and Data Science, Nanyang Technological University (NTU)（南洋理工大学计算机与数据科学学院）

AI总结提出空间和频率先验（SFP）方法，通过空间域传输图估计和频率域自适应增强策略，实现散射退化图像的真实场景恢复，在多种真实场景中优于现有方法。

Comments 18 pages, 22 figures, submitted to IEEE T-PAMI

详情

AI中文摘要

从受散射效应（如雾、沙尘暴、水下和遥感条件）退化的真实图像中恢复场景，仍然是计算机视觉中一个基本但具有挑战性的问题。现有方法要么依赖单一先验（本质上不足以表征多样的散射退化），要么使用在合成数据上训练的深度网络（通常对真实场景的泛化能力有限）。在本文中，我们提出空间和频率先验（SFP）用于散射诱导退化下的真实场景恢复。在空间域，我们观察到散射退化图像的逆在其光谱方向上揭示了一个与底层场景传输相关的投影。基于这一观察，我们制定了一个空间先验来估计传输图，从而能够在散射效应下有效恢复场景辐射。在频率域，我们设计了一种由两个新先验引导的自适应频率增强策略。第一个先验假设退化图像中跨通道的直流（DC）分量的平均强度近似于对应清晰图像的平均强度。第二个先验基于观察：在清晰图像中，窄带内的低径向频率仅占整个频谱的一小部分。这些先验能够针对不同频带的散射诱导衰减进行补偿。最后，对空间域和频率域的结果进行加权融合，得到最终的恢复图像。在多种真实世界散射退化场景上的大量实验验证，与最先进方法相比，我们的SFP实现了优越的性能和强大的泛化能力。

英文摘要

Scene recovery from real-world images degraded by scattering effects, such as haze, sandstorm, underwater, and remote sensing conditions, remains a fundamental yet challenging problem in computer vision. Existing methods either rely on a single prior, which is inherently insufficient to characterize diverse scattering degradations, or employ deep networks trained on synthetic data, which often suffer from limited generalization to real-world scenarios. In this paper, we propose Spatial and Frequency Priors (SFP) for real-world scene recovery under scattering-induced degradations. In the spatial domain, we observe that the inverse of a scattering-degraded image reveals a projection along its spectral direction that correlates with the underlying scene transmission. Based on this observation, a spatial prior is formulated to estimate the transmission map, enabling effective recovery of scene radiance under scattering effects. In the frequency domain, we design an adaptive frequency enhancement strategy guided by two novel priors. The first prior assumes that the mean intensity of the direct current (DC) components across channels in degraded images approximates that of the corresponding clear images. The second prior is based on the observation that, in clear images, low radial frequencies within a narrow band contribute only a small proportion of the overall spectrum. These priors enable targeted compensation for scattering-induced attenuation across different frequency bands. Finally, a weighted fusion of the spatial and frequency domain results is performed to obtain the final recovered image. Extensive experiments on diverse real-world scattering-degraded scenarios verify that our SFP achieves superior performance and strong generalization capability compared to state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2512.05791 2026-05-26 physics.med-ph cs.CV cs.LG math.PR 版本更新

Fast and Robust Diffusion Posterior Sampling for MR Image Reconstruction Using the Preconditioned Unadjusted Langevin Algorithm

使用预条件未调整朗之万算法实现快速且鲁棒的MR图像重建扩散后验采样

Moritz Blumenthal, Tina Holliber, Jonathan I. Tamir, Martin Uecker

发表机构 * Institute of Biomedical Imaging, Graz University of Technology, Graz, Austria ； Department of Radiology, Boston Children's Hospital, Harvard Medical School, Boston, USA ； Chandra Family Department of Electrical Engineering, University of Texas at Austin, USA ； Department of Diagnostic Medicine, Dell Medical School, University of Texas at Austin, USA

AI总结针对MR图像重建中扩散后验采样速度慢和参数调优问题，提出基于预条件未调整朗之万算法的精确似然方法，实现快速收敛且无需调参的鲁棒采样。

Comments Submitted to Magnetic Resonance in Medicine

详情

DOI: 10.1002/mrm.70416

AI中文摘要

目的：结合未调整朗之万算法（ULA）与扩散模型，可以从高度欠采样的k空间数据生成高质量MRI重建结果并附带不确定性估计。然而，扩散后验采样（DPS）或似然退火等采样方法存在重建时间长和需要参数调优的问题。本文旨在开发一种具有快速收敛性的鲁棒采样算法。理论与方法：在用于后验采样的反向扩散过程中，精确似然与所有噪声尺度下的扩散先验相乘。为克服收敛缓慢的问题，采用了预条件技术。该方法在fastMRI数据上训练，并在健康志愿者的回顾性欠采样脑部数据上测试。结果：对于笛卡尔和非笛卡尔加速MRI的后验采样，新方法在重建速度和样本质量上均优于退火采样和DPS。结论：所提出的预条件精确似然方法能够在各种MRI重建任务中实现快速可靠的后验采样，无需参数调优。

英文摘要

Purpose: The Unadjusted Langevin Algorithm (ULA) in combination with diffusion models can generate high quality MRI reconstructions with uncertainty estimation from highly undersampled k-space data. However, sampling methods such as diffusion posterior sampling (DPS) or likelihood annealing suffer from long reconstruction times and the need for parameter tuning. The purpose of this work is to develop a robust sampling algorithm with fast convergence. Theory and Methods: In the reverse diffusion process used for sampling the posterior, the exact likelihood is multiplied with the diffused prior at all noise scales. To overcome the issue of slow convergence, preconditioning is used. The method is trained on fastMRI data and tested on retrospectively undersampled brain data of a healthy volunteer. Results: For posterior sampling in Cartesian and non-Cartesian accelerated MRI the new approach outperforms annealed sampling and DPS in terms of reconstruction speed and sample quality. Conclusion: The proposed exact likelihood with preconditioning enables rapid and reliable posterior sampling across various MRI reconstruction tasks without the need for parameter tuning.

URL PDF HTML ☆

赞 0 踩 0

2512.01382 2026-05-26 cs.CV 版本更新

Reversible Inversion for Training-Free Exemplar-guided Image Editing

可逆反演用于免训练示例引导图像编辑

Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang, Hongkai Wen, Heng Tao Shen, Jingkuan Song

发表机构 * school of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）； School of Computing and Artificial Intelligence, Southwest Jiaotong University, China（西南交通大学计算机科学与人工智能学院）； School of Computer Science and Technology, Tongji University, China（同济大学计算机科学与技术学院）； Department of Computer Science, University of Warwick, United Kingdom（英国沃里克大学计算机科学系）

AI总结提出可逆反演（ReInversion）方法，通过两阶段去噪和掩码引导选择性去噪策略，实现免训练的高效示例引导图像编辑，达到最优性能且计算开销最低。

详情

AI中文摘要

示例引导图像编辑（EIE）旨在根据视觉参考修改源图像。现有方法通常需要大规模预训练来学习源图像和参考图像之间的关系，计算成本高。作为一种免训练的替代方案，反演技术可用于将源图像映射到潜在空间进行操作。然而，我们的实证研究表明，标准反演对于EIE是次优的，导致质量差和效率低。为了解决这一挑战，我们引入了 extbf{可逆反演（{ReInversion}）}，用于有效且高效的EIE。具体来说，ReInversion作为一个两阶段去噪过程运行，首先以源图像为条件，然后以参考图像为条件。此外，我们引入了一种掩码引导选择性去噪（MSD）策略，将编辑限制在目标区域，保持背景的结构一致性。定性和定量比较都表明，我们的ReInversion方法以最低的计算开销实现了最先进的EIE性能。

英文摘要

Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurring high computational costs. As a training-free alternative, inversion techniques can be used to map the source image into a latent space for manipulation. However, our empirical study reveals that standard inversion is sub-optimal for EIE, leading to poor quality and inefficiency. To tackle this challenge, we introduce \textbf{Reversible Inversion ({ReInversion})} for effective and efficient EIE. Specifically, ReInversion operates as a two-stage denoising process, which is first conditioned on the source image and subsequently on the reference. Besides, we introduce a Mask-Guided Selective Denoising (MSD) strategy to constrain edits to target regions, preserving the structural consistency of the background. Both qualitative and quantitative comparisons demonstrate that our ReInversion method achieves state-of-the-art EIE performance with the lowest computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2512.00125 2026-05-26 cs.CV cs.LG 版本更新

Hybrid Synthetic Data Generation with Domain Randomization Enables Zero-Shot Vision-Based Part Inspection Under Extreme Class Imbalance

混合合成数据生成与域随机化实现极端类别不平衡下基于视觉的零样本零件检测

Ruo-Syuan Mei, Sixian Jia, Guangze Li, Soo Yeon Lee, Brian Musser, William Keller, Sreten Zakula, Jorge Arinez, Chenhui Shao

发表机构 * Department of Mechanical Engineering, University of Michigan, Ann Arbor, MI 48109, USA ； Materials \& Manufacturing Systems Research Lab, General Motors, Warren, MI 48092, USA

AI总结提出一种结合仿真渲染、域随机化和真实背景合成的混合合成数据生成框架，仅用合成数据训练YOLOv8n和MobileNetV3-small模型，在极端类别不平衡下实现零样本工业零件检测，检测mAP@0.5达0.995，分类准确率96%，平衡准确率90.1%。

Comments Submitted to the NAMRC 54

详情

DOI: 10.1016/j.jmapro.2026.04.020

AI中文摘要

机器学习，特别是深度学习，正在改变工业质量检测。然而，训练鲁棒的机器学习模型通常需要大量高质量标注数据，这在制造业中获取成本高昂、耗时且劳动密集。此外，缺陷样本本身稀少，导致严重的类别不平衡，降低模型性能。这些数据约束阻碍了基于机器学习的质量检测方法在实际生产环境中的广泛采用。合成数据生成（SDG）通过高效、经济且可扩展的方式创建大规模、平衡且完全标注的数据集，提供了一种有前景的解决方案。本文提出一种混合SDG框架，集成了基于仿真的渲染、域随机化和真实背景合成，无需人工标注即可实现基于计算机视觉的工业零件检测的零样本学习。该SDG流水线通过改变零件几何、光照和表面属性，并将合成零件合成到真实图像背景上，在一小时内生成12,960张标注图像。利用YOLOv8n骨干网络进行目标检测、MobileNetV3-small进行质量分类的两阶段架构，仅使用合成数据训练，并在300个真实工业零件上评估。所提方法在检测上达到mAP@0.5为0.995，分类准确率96%，平衡准确率90.1%。与基于少量真实数据的基线方法相比，性能显著提升。在严重类别不平衡下，所提基于SDG的方法达到90-91%的平衡准确率，而基线仅达到50%准确率。这些结果表明，所提方法能够为真实制造应用实现免标注、可扩展且鲁棒的质量检测。

英文摘要

Machine learning, particularly deep learning, is transforming industrial quality inspection. Yet, training robust machine learning models typically requires large volumes of high-quality labeled data, which are expensive, time-consuming, and labor-intensive to obtain in manufacturing. Moreover, defective samples are intrinsically rare, leading to severe class imbalance that degrades model performance. These data constraints hinder the widespread adoption of machine learning-based quality inspection methods in real production environments. Synthetic data generation (SDG) offers a promising solution by enabling the creation of large, balanced, and fully annotated datasets in an efficient, cost-effective, and scalable manner. This paper presents a hybrid SDG framework that integrates simulation-based rendering, domain randomization, and real background compositing to enable zero-shot learning for computer vision-based industrial part inspection without manual annotation. The SDG pipeline generates 12,960 labeled images in one hour by varying part geometry, lighting, and surface properties, and then compositing synthetic parts onto real image backgrounds. A two-stage architecture utilizing a YOLOv8n backbone for object detection and MobileNetV3-small for quality classification is trained exclusively on synthetic data and evaluated on 300 real industrial parts. The proposed approach achieves an mAP@0.5 of 0.995 for detection, 96% classification accuracy, and 90.1% balanced accuracy. Comparative evaluation against few-shot real-data baseline approaches demonstrates significant improvement. The proposed SDG-based approach achieves 90-91% balanced accuracy under severe class imbalance, while the baselines reach only 50% accuracy. These results demonstrate that the proposed method enables annotation-free, scalable, and robust quality inspection for real-world manufacturing applications.

URL PDF HTML ☆

赞 0 踩 0

2511.18794 2026-05-26 cs.GR cs.CV 版本更新

ChronoGS: Disentangling Invariants and Changes in Multi-Period Scenes

ChronoGS：多时期场景中不变性与变化的解耦

Zhongtao Wang, Jiaqi Dai, Qingtian Zhu, Yilong Li, Mai Su, Fei Zhu, Meng Gai, Shaorong Wang, Chengwei Pan, Yisong Chen, Guoping Wang

发表机构 * Peking University（北京大学）； Beijing Forestry University（北京林业大学）； The University of Tokyo（东京大学）； Beihang University（北航）

AI总结提出ChronoGS，一种时间调制的高斯表示方法，通过统一锚点支架重建多时期场景，并解耦稳定与演化组件，实现时间一致的重建，同时发布ChronoScene基准数据集。

Comments CVPR26 Highlight

详情

AI中文摘要

多时期图像集合在现实应用中很常见。城市为测绘而重新扫描，建筑工地为进度跟踪而再次访问，自然区域为环境变化而监测。这些数据形成多时期场景，其中几何和外观会演变。重建此类场景是一个重要但尚未充分探索的问题。现有管线依赖于不兼容的假设：静态和野外方法强制单一几何，而动态方法假设平滑运动，两者在长期、不连续变化下均失败。为解决此问题，我们引入ChronoGS，一种时间调制的高斯表示，它在统一锚点支架内重建所有时期。它还被设计为解耦稳定和演化组件，实现多时期场景的时间一致重建。为促进相关研究，我们发布ChronoScene数据集，一个真实和合成多时期场景的基准，捕捉几何和外观变化。实验表明，ChronoGS在重建质量和时间一致性上始终优于基线。我们的代码和ChronoScene数据集公开于https://github.com/ZhongtaoWang/ChronoGS。

英文摘要

Multi-period image collections are common in real-world applications. Cities are re-scanned for mapping, construction sites are revisited for progress tracking, and natural regions are monitored for environmental change. Such data form multi-period scenes, where geometry and appearance evolve. Reconstructing such scenes is an important yet underexplored problem. Existing pipelines rely on incompatible assumptions: static and in-the-wild methods enforce a single geometry, while dynamic ones assume smooth motion, both failing under long-term, discontinuous changes. To solve this problem, we introduce ChronoGS, a temporally modulated Gaussian representation that reconstructs all periods within a unified anchor scaffold. It's also designed to disentangle stable and evolving components, achieving temporally consistent reconstruction of multi-period scenes. To catalyze relevant research, we release ChronoScene dataset, a benchmark of real and synthetic multi-period scenes, capturing geometric and appearance variation. Experiments demonstrate that ChronoGS consistently outperforms baselines in reconstruction quality and temporal consistency. Our code and the ChronoScene dataset are publicly available at https://github.com/ZhongtaoWang/ChronoGS.

URL PDF HTML ☆

赞 0 踩 0

2511.15407 2026-05-26 cs.AI cs.CV cs.LG 版本更新

IPR-1: Interactive Physical Reasoner

IPR-1：交互式物理推理器

Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie, Yan Li, Renjie Zhao, Zizhu He, Ziyu Wang, Jiting Cai, Yong-Lu Li

发表机构 * CARNEGIE MELLON UNIVERSITY（卡内基梅隆大学）

AI总结提出IPR模型，通过世界模型滚动评分和强化VLM策略，结合物理中心动作代码PhysCode，在1000+异构游戏基准上实现鲁棒的物理推理，性能超越GPT-5并零样本迁移至未见游戏。

Comments Accepted by CVPR 2026. 13 pages of main text and 20 pages of appendices. Project page: https://mybearyzhang.github.io/ipr-1

详情

AI中文摘要

人类通过观察、与环境交互以及内化物理和因果关系来学习。在这里，我们旨在探究一个智能体是否能够通过交互类似地获得类人推理能力，并随着更多经验不断改进。为此，我们引入了一个包含1000+异构游戏的Game-to-Unseen (G2U)基准，这些游戏展现出显著的视觉领域差异。现有方法（包括VLM和世界模型）难以捕捉底层物理和因果关系，因为它们不关注核心机制且过度拟合视觉细节。VLM/VLA智能体能够推理，但在交互设置中缺乏前瞻性，而世界模型进行想象但模仿视觉模式而非分析物理和因果关系。因此，我们提出IPR（交互式物理推理器），利用世界模型滚动来评分和强化VLM的策略，并引入PhysCode，一种以物理为中心的动作代码，将语义意图与动力学对齐，为预测和推理提供共享动作空间。在1000+游戏上预训练后，我们的IPR在从原始直觉到目标驱动推理的各个层次上表现稳健，甚至在总体上超越了GPT-5。我们发现，性能随着训练游戏和交互步骤的增加而提升，并且模型还能零样本迁移到未见过的游戏。这些结果支持以物理为中心的交互作为稳步提升物理推理的路径。更多演示和项目详情请见https://mybearyzhang.github.io/ipr-1。

英文摘要

Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at https://mybearyzhang.github.io/ipr-1.

URL PDF HTML ☆

赞 0 踩 0

2510.02730 2026-05-26 cs.LG cs.CV 版本更新

Dale meets Langevin: A Multiplicative Denoising Diffusion Model

Dale meets Langevin: 乘法去噪扩散模型

Nishanth Shetty, Madhava Prasath, Chandra Sekhar Seelamantula

发表机构 * Department of Electrical Engineering（电子工程系）； Indian Institute of Science（印度科学研究所）

AI总结提出以几何布朗运动为前向噪声过程的乘法分数生成模型，推导反向时间SDE并设计两种乘法采样器，引入Hyvärinen分数和乘法去噪分数匹配目标，在图像数据集上验证生成能力。

详情

AI中文摘要

指数梯度下降（EGD）是一种受生物学启发的优化算法，遵循Dale定律，在收敛时产生对数正态分布的突触权重，与神经科学的实验观察一致。由于几何布朗运动（GBM）在任何固定时间的边际分布是对数正态的，这种收敛性质揭示了EGD与基于GBM的随机过程之间的自然联系。我们提出了一种基于分数的乘法生成模型，以GBM作为前向噪声过程，并推导了其在环境空间和对数变换空间中的相应反向时间SDE。通过离散化相应的反向时间SDE，我们推导出两种乘法采样器：直接从环境空间反向时间SDE得到的符号无关采样器，以及通过Lamperti变换得到的符号保持采样器，我们称之为Dale-Langevin采样器。我们将该框架与镜像Langevin动力学联系起来，表明优化中驱动EGD的凸函数精确地控制着Dale-Langevin采样器。虽然标准Stein分数（定义为随机向量X在x处的∇log p_X(x)）在基于加性噪声的扩散模型中自然出现，但在乘法设置中，我们遇到了一种用于采样的修改版Stein分数，我们称之为Hyvärinen分数：x∘∇log p_X(x)。为了估计该分数，我们提出了一种新的乘法去噪分数匹配目标（M-DSM），证明了其与乘法显式分数匹配损失的等价性，并表明它包含了非负分数匹配损失。在MNIST、Fashion-MNIST、Kuzushiji-MNIST和CIFAR-10上的实验结果验证了所提框架的生成能力。

英文摘要

Exponentiated gradient descent (EGD), a biologically motivated optimisation algorithm that respects Dale's law, produces log-normally distributed synaptic weights at convergence, in alignment with experimental observations in neuroscience. Since the marginal distribution of geometric Brownian motion (GBM) at any fixed time is log-normal, this convergence property reveals a natural connection between EGD and GBM-based stochastic processes. We propose a multiplicative score-based generative model with GBM as a forward noising process and derive its corresponding reverse-time SDE in both the ambient space and in the $\log$-transformed space. We derive two multiplicative samplers by discretising the corresponding reverse-time SDEs: a sign-agnostic sampler obtained directly from the ambient-space reverse-time SDE, and a sign-preserving sampler, which we refer to as the Dale-Langevin sampler, obtained via the Lamperti transform. We connect the framework to Mirrored Langevin Dynamics, showing that the convex function driving EGD in optimisation precisely governs the Dale-Langevin sampler. While the standard Stein score, defined as $\nabla \log p_{\boldsymbol{X}}(\boldsymbol{x})$ for a random vector $\boldsymbol{X}$ evaluated at $\boldsymbol{x}$, comes up naturally in the additive noise based diffusion models, in the multiplicative setting, we encounter a modified version of the Stein score for sampling, which we refer to as the {\it Hyvärinen score}: $\boldsymbol{x} \circ \nabla \log p_{\boldsymbol{X}}(\boldsymbol{x})$. To estimate the score, we propose a new multiplicative denoising score-matching objective (M-DSM), prove its equivalence to the multiplicative explicit score-matching loss and show that it subsumes the non-negative score matching loss. Experimental results on MNIST, Fashion-MNIST, Kuzushiji-MNIST, and CIFAR-10 to validate the generative capability of the proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2509.25339 2026-05-26 cs.CV cs.AI cs.LG eess.IV 版本更新

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

VisualOverload: 在真正密集场景中探测VLM的视觉理解

Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne

发表机构 * Independent Researcher（独立研究者）； JKU Linz（林茨JKU）； MIT CSAIL ； Tübingen AI Center（图宾根人工智能中心）； Stanford（斯坦福）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）

AI总结提出VisualOverload基准，通过密集场景中的简单视觉任务测试VLM，发现最佳模型仅达69.5%准确率，揭示计数、OCR和逻辑一致性等关键缺陷。

Comments Accepted at CVPR 2026

详情

AI中文摘要

最先进的VLM是否真正解决了基本视觉理解？我们提出VisualOverload，一个略有不同的视觉问答（VQA）基准，包含2,720个问答对，并持有私有真实答案。与以往通常关注近全局图像理解的VQA数据集不同，VisualOverload挑战模型在密集（或过载）场景中执行简单的、无需知识的视觉任务。我们的数据集由公共领域绘画的高分辨率扫描图组成，这些绘画包含多个人物、动作和展开的子情节，背景细节丰富。我们手动为这些图像标注了六个任务类别的问题，以探测对场景的彻底理解。我们假设当前基准高估了VLM的性能，编码和推理细节对它们来说仍然是一项具有挑战性的任务，尤其是当面对密集场景时。实际上，我们观察到在37个测试模型中，即使是最好的模型（o3）在我们最难的测试子集上也仅达到19.6%的准确率，在所有问题上总体准确率为69.5%。除了全面评估外，我们还通过错误分析补充了基准，揭示了多种失败模式，包括缺乏计数能力、OCR失败以及复杂任务下惊人的逻辑不一致。总之，VisualOverload暴露了当前视觉模型中的关键差距，并为社区开发更好的模型提供了重要资源。基准：http://paulgavrikov.github.io/visualoverload

英文摘要

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

URL PDF HTML ☆

赞 0 踩 0

2509.12194 2026-05-26 cs.AI cs.CV 版本更新

Teaching large language models to reason like expert diagnosticians

教会大型语言模型像专家诊断医生一样推理

Thomas A. Buckley, Riccardo Conci, Peter G. Brodeur, Jason Gusdorf, Sourik Beltrán, Bita Behrouzi, Byron Crowe, Jacob Dockterman, Muzzammil Muhammad, Sarah Ohnigian, Andrew Sanchez, James A. Diao, Aashna P. Shah, Daniel Restrepo, Eric S. Rosenberg, Andrew S. Lea, Emily Glanton, Kimberly LeBlanc, Undiagnosed Diseases Network, Marinka Zitnik, Scott H. Podolsky, Zahir Kanjee, Raja-Elie E. Abdulnour, Jacob M. Koshy, Adam Rodman, Arjun K. Manrai

发表机构 * Department of Biomedical Informatics, Harvard Medical School（哈佛医学院生物医学信息学系）； Department of Medicine, Beth Israel Deaconess Medical Center（贝塞斯达医院内科部）； The Mongan Institute, Massachusetts General Hospital（麻省总医院蒙根研究所）； Division of Gastroenterology, Brigham and Women’s Hospital（布里洛妇女医院胃肠病科）； Department of Medicine, Brigham and Women’s Hospital（布里洛妇女医院内科部）； Department of Medicine, Massachusetts General Hospital（麻省总医院内科部）； Department of Pathology, Massachusetts General Hospital（麻省总医院病理学部）； Department of Health Humanities and Bioethics, University of Rochester School of Medicine and Dentistry（罗切斯特大学医学院和牙科学院健康人文与生物伦理学部）； Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University（哈佛大学凯普纳人工智能研究所）； Center for the History of Medicine, Countway Library of Medicine, Harvard Medical School（哈佛医学院医学史中心，考特维图书馆）； Department of Global Health and Social Medicine, Harvard Medical School（哈佛医学院全球健康与社会医学部）； Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital（布里洛妇女医院呼吸科和重症医学科）

AI总结提出 Dr. CaBot 代理 AI 系统，通过生成基于初始病例描述的幻灯片演示来模拟专家诊断推理，并在 NEJM CPC 和 NIH 未诊断疾病网络病例上取得优于前沿模型的表现，同时发布 CPC-Bench 基准以促进临床 AI 发展。

详情

AI中文摘要

鉴别诊断是一个迭代过程，将患者信息与更广泛的医学知识相结合。自1923年以来持续发表的临床病例系列，如NEJM临床病理会议（CPCs），展示了专家医生向同行演示诊断推理，并已被用于评估AI数十年。然而，先前的AI评估主要关注最终诊断准确性，而非细微的临床推理。在此，我们介绍Dr. CaBot，一个代理AI系统，通过仅从初始病例描述生成带有书面和旁白的幻灯片演示，来模拟专家诊断医生。CaBot最近生成了NEJM CPC 100多年历史上首个发表的AI诊断。在盲评中，医生在46/62（74%）的试验中错误分类了鉴别诊断的来源（CaBot vs. 医生撰写），并在各个质量维度上给予其好评。当被要求解决来自NIH未诊断疾病网络的72名未诊断疾病患者的病例时，CaBot仅从转诊记录中就识别出了50/72（69%）病例的工作诊断。为了促进透明度和研究，我们还开发了CPC-Bench，一个基于7,102个CPC和47,648个问题（涵盖10个任务）的经医生验证的基准。我们证明CaBot在CPC-Bench上优于前沿模型，并公开发布CaBot和CPC-Bench，以促进临床AI的进步。

英文摘要

Differential diagnosis is an iterative process that integrates patient information with broader medical knowledge. Clinical case series such as the NEJM Clinicopathologic Conferences (CPCs), published continuously since 1923, feature expert physicians who demonstrate diagnostic reasoning to peers, and have been used for decades to evaluate AI. However, prior AI evaluations have largely focused on final diagnostic accuracy rather than nuanced clinical reasoning. Here, we introduce Dr. CaBot, an agentic AI system that emulates an expert diagnostician by generating written and narrated slide-based presentations from an initial case description alone. CaBot recently generated the first AI diagnosis published in the 100+ year history of the NEJM CPCs. In blinded evaluations, physicians misclassified the source of the differential (CaBot vs. physician-written) in 46/62 (74%) of trials and rated them favorably across quality dimensions. When tasked with solving cases for 72 patients with undiagnosed disease from the NIH Undiagnosed Diseases Network, CaBot identified the working diagnosis in 50/72 (69%) of cases from referral notes alone. To promote transparency and research, we also developed CPC-Bench, a physician-validated benchmark based on 7,102 CPCs and 47,648 questions across 10 tasks. We show that CaBot outperforms frontier models on CPC-Bench, and release both CaBot and CPC-Bench publicly to foster progress in clinical AI.

URL PDF HTML ☆

赞 0 踩 0

2509.05614 2026-05-26 cs.CV cs.AI cs.RO 版本更新

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

SpecPrune-VLA: 通过动作感知的自推测剪枝加速视觉-语言-动作模型

Hanzhen Wang, Jiaming Xu, Yushun Xiang, Jiayi Pan, Yongkang Zhou, Yong-Lu Li, Guohao Dai

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结针对视觉-语言-动作模型推理加速，提出结合全局上下文与局部信息的无训练两层剪枝方法，实现高达1.57倍加速且成功率几乎无下降。

Comments Accepted to ICML 2026

详情

AI中文摘要

剪枝是一种通过移除不重要值的计算来加速计算密集型模型的典型技术。最近，它被应用于加速视觉-语言-动作（VLA）模型推理。然而，现有的加速方法仅关注当前动作步骤的局部信息，忽略了全局上下文，导致在某些场景下成功率下降超过20%且加速效果有限。本文指出VLA任务中的时空一致性：连续步骤中的输入图像表现出高度相似性，并提出关键见解：令牌选择应结合局部信息与模型的全局上下文。基于此，我们提出SpecPrune-VLA，一种无需训练、具有启发式控制的两级剪枝方法。(1) 动作级静态剪枝：利用全局历史和局部注意力，在每个动作中静态减少视觉令牌。(2) 层级动态剪枝：根据逐层重要性自适应地剪枝每层的令牌。(3) 轻量级动作感知控制器：根据末端执行器的速度将动作分为粗粒度或细粒度，并相应调整剪枝激进程度。大量实验表明，SpecPrune-VLA在LIBERO模拟中实现高达1.57倍加速，在真实世界任务中实现1.70倍加速，且成功率下降可忽略不计。

英文摘要

Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existing acceleration methods focus on local information from the current action step and ignore the global context, leading to >20% success rate drop and limited speedup in some scenarios. In this paper, we point out spatial-temporal consistency in VLA tasks: input images in consecutive steps exhibit high similarity, and propose the key insight that token selection should combine local information with global context of the model. Based on this, we propose SpecPrune-VLA, a training-free, two-level pruning method with heuristic control. (1) Action-level static pruning. We leverage global history and local attention to statically reduce visual tokens per action. (2) Layer-level dynamic pruning. We prune tokens adaptively per layer based on layer-wise importance. (3) Lightweight action-aware controller: We classify actions as coarse- or fine-grained by the speed of the end effector and adjust pruning aggressiveness accordingly. Extensive experiments show that SpecPrune-VLA achieves up to 1.57$\times$ speedup in LIBERO simulation and 1.70$\times$ on real-world tasks, with negligible success rate degradation.

URL PDF HTML ☆

赞 0 踩 0

2509.01557 2026-05-26 cs.CV 版本更新

Real-Time Hardware-Free HIFU Interference Suppression via Teacher-Student Diffusion Framework

基于教师-学生扩散框架的实时无硬件HIFU干扰抑制

Dejia Cai, Ali Abdollahi, Xi Wang, Kun Yang, Zhaohui Guo, Xiaowei Zhou, Hao Chen

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China（香港科学与技术大学计算机科学与工程系）； State Key Laboratory of Ultrasound Engineering in Medicine, Chongqing Medical University, Chongqing 400016, China（重庆医科大学超声医学工程国家重点实验室）； School of Microelectronics, Tianjin University, Tianjin 300072, China（天津大学微电子学院）； Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China（香港科学与技术大学化学与生物工程系）； Division of Life Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China（香港科学与技术大学生命科学系）； HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, The Hong Kong University of Science and Technology, Futian, Shenzhen, China（香港科技大学深圳-香港协同创新研究院）； State Key Laboratory of Nervous System Disorders, The Hong Kong University of Science and Technology, Hong Kong SAR, China（香港科技大学神经系统疾病国家重点实验室）

AI总结提出一种无需专用硬件同步的图像域扩散框架mHC-Diff，通过教师-学生蒸馏实现实时高保真HIFU干扰抑制，在临床数据集上达到26.65 dB PSNR和~20 FPS。

详情

AI中文摘要

高强度聚焦超声（HIFU）是一种非侵入性疗法，但其安全性常因连续超声引导期间的严重声学干扰而降低。传统的HIFU干扰抑制方法严重依赖专有的原始射频（RF）数据或复杂的硬件同步，限制了其临床实用性并阻碍了实时实现。为解决这一限制，我们提出了流形约束超连接扩散（mHC-Diff），一种图像域扩散框架，用于无需专用硬件同步的实时干扰抑制，将复杂干扰与解剖结构分离，同时确保高重建保真度。为实现临床实时应用，我们的方法采用两阶段策略：（i）解剖感知先验获取，其中扩散模型使用多步UNet作为高保真教师进行训练；以及（ii）效率蒸馏，其中通过知识蒸馏将该先验蒸馏为单步学生以实现实时吞吐量。在涵盖多种治疗场景的临床代表性数据集上的广泛验证表明，mHC-Diff实现了卓越的恢复（26.65 dB PSNR），同时在单个NVIDIA RTX 4090上实现实时推理（~20 FPS），比迭代扩散基线（例如HIFU-Diff）加速约6.8倍。通过消除对专用硬件同步和专有RF访问的需求，该图像域框架确保了兼容性，并促进了超声引导HIFU干预期间的实时干扰抑制。

英文摘要

High-Intensity Focused Ultrasound (HIFU) is a non-invasive therapy, yet its safety is often degraded by severe acoustic interference during continuous ultrasound guidance. Conventional HIFU interference suppression methods heavily rely on proprietary raw Radio-Frequency (RF) data or complex hardware synchronization, limiting their clinical utility and preventing real-time implementation. To address this limitation, we propose Manifold-Constrained Hyper-Connections Diffusion (mHC-Diff), an image-domain diffusion framework for real-time interference suppression without specialized hardware synchronization, disentangling complex interference from anatomical structures while ensuring high reconstruction fidelity. To achieve clinical real-time application, our approach employs a two-stage strategy: (i) anatomy-aware prior acquisition, where a diffusion model is trained with multi-step UNet as a highfidelity Teacher; and (ii) efficiency distillation, where this prior is distilled into a one-step Student via knowledge distillation to achieve real-time throughput. Extensive validation on a clinically representative dataset across diverse therapeutic scenarios shows that mHC-Diff achieves superior restoration (26.65 dB PSNR), while enabling real-time inference (~20 FPS) on a single NVIDIA RTX 4090, providing a ~6.8x speedup over iterative diffusion baselines (e.g., HIFU-Diff). By eliminating the requirement for specialized hardware synchronization and proprietary RF access, this image-domain framework ensures compatibility and facilitates real-time interference suppression during ultrasound-guided HIFU interventions.

URL PDF HTML ☆

赞 0 踩 0

2508.07624 2026-05-26 cs.CV 版本更新

Enhancing Egocentric Object Detection in Static Environments using Graph-based Spatial Anomaly Detection and Correction

基于图的空间异常检测与校正增强静态环境中的自我中心目标检测

Vishakha Lall, Yisi Liu

发表机构 * Centre of Excellence in Maritime Safety（海上安全卓越中心）； Singapore Polytechnic（新加坡理工学院）； Singapore（新加坡）

AI总结提出一种基于图神经网络的后处理管道，通过建模静态环境中物体的空间关系来校正自我中心帧中的检测异常，显著提升检测性能。

详情

DOI: 10.1109/CW68232.2025.00055

AI中文摘要

在涉及静态环境的许多实际应用中，物体的空间布局在实例之间保持一致。然而，最先进的目标检测模型通常无法利用这种空间先验，导致预测不一致、漏检或误分类，尤其是在杂乱或遮挡的场景中。在这项工作中，我们提出了一种基于图的后处理管道，显式建模物体之间的空间关系，以校正自我中心帧中的检测异常。使用在手动标注数据上训练的图神经网络（GNN），我们的模型识别无效的物体类别标签，并根据其邻域上下文预测校正后的类别标签。我们评估了我们的方法，既作为独立的异常检测与校正框架，也作为标准目标检测器（如YOLOv7和RT-DETR）的后处理模块。实验表明，融入这种空间推理显著提升了检测性能，mAP@50提升高达4%。该方法凸显了利用环境空间结构来提高目标检测系统可靠性的潜力。

英文摘要

In many real-world applications involving static environments, the spatial layout of objects remains consistent across instances. However, state-of-the-art object detection models often fail to leverage this spatial prior, resulting in inconsistent predictions, missed detections, or misclassifications, particularly in cluttered or occluded scenes. In this work, we propose a graph-based post-processing pipeline that explicitly models the spatial relationships between objects to correct detection anomalies in egocentric frames. Using a graph neural network (GNN) trained on manually annotated data, our model identifies invalid object class labels and predicts corrected class labels based on their neighbourhood context. We evaluate our approach both as a standalone anomaly detection and correction framework and as a post-processing module for standard object detectors such as YOLOv7 and RT-DETR. Experiments demonstrate that incorporating this spatial reasoning significantly improves detection performance, with mAP@50 gains of up to 4%. This method highlights the potential of leveraging the environment's spatial structure to improve reliability in object detection systems.

URL PDF HTML ☆

赞 0 踩 0

2507.14760 2026-05-26 eess.IV cs.AI cs.CV cs.LG 版本更新

QUTCC: Quantile Uncertainty Training and Conformal Calibration for Imaging Inverse Problems

QUTCC: 成像逆问题的分位数不确定性训练与保形校准

Cassandra Tong Ye, Shamus Li, Tyler King, Kristina Monakhova

AI总结提出QUTCC方法，结合分位数回归与U-Net实现空间自适应保形校准，在多个成像逆问题中生成更紧的不确定性区间并定位模型幻觉。

详情

AI中文摘要

尽管深度学习为科学和医学成像带来了巨大前景，但任何失败和幻觉（与事实不符的预测）都难以定位，并可能产生严重的下游后果。不确定性估计技术，如保形预测，可以通过预测模型预测的统计有效误差条来提供帮助。然而，流行的保形预测方法并非为高维图像值问题设计，且在保形校准过程中未考虑图像内的空间相关性，导致不确定性区间过大。我们提出了一种实用的同时分位数回归方法，能够在保形校准期间实现非线性、空间自适应缩放。我们的方法QUTCC使用带有分位数嵌入的U-Net架构，在训练期间学习完整的条件分位数分布，然后利用这个非线性学习函数进行空间自适应保形校准。在测试时，我们的方法能够高效地估计具有像素边际覆盖保证的不确定性区间。此外，QUTCC还可以在没有内置分布假设的情况下预测逐像素条件概率密度估计。我们在多个去噪问题、加速磁共振成像和定量相位显微镜上评估了我们的方法。与先前的保形方法相比，我们的方法在相同覆盖水平下始终产生更紧的不确定性区间，能够预测不同任务的合理条件分布，并且在某些情况下，高不确定性区域可以帮助我们定位模型预测中的幻觉。

英文摘要

While deep learning offers tremendous promise for scientific and medical imaging, any failures and hallucinations (predictions that do not coincide with reality) are hard to pinpoint and can have serious downstream consequences. Uncertainty estimation techniques, such as conformal prediction, can help by predicting statistically valid error bars for a model's prediction. However, popular conformal prediction methods were not designed for high-dimensional image-valued problems and do not take into account spatial correlations within an image during conformal calibration, resulting in larger-than-necessary uncertainty intervals. We propose a practical simultaneous quantile regression method that enables non-linear, spatially-adaptive scaling during conformal calibration. Our method, QUTCC uses a U-Net architecture with a quantile embedding to learn a full conditional quantile distribution during training, and then leverages this non-linear, learned function for spatially-adaptive conformal calibration. At test time, our method can efficiently estimate uncertainty intervals with pixel-marginal coverage guarantees. In addition, QUTCC can also predict pixel-wise conditional probability density estimates without any built-in distributional assumptions. We evaluate our method on several denoising problems, accelerated magnetic resonance imaging, and quantitative phase microscopy. Our method consistently produces tighter uncertainty intervals than prior conformal methods at the same coverage level, can predict plausible conditional distributions for different tasks, and in some cases, high-uncertainty regions can help us locate hallucinations in a model's prediction.

URL PDF HTML ☆

赞 0 踩 0

2506.19117 2026-05-26 cs.CV 版本更新

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

PrITTI: 基于基元的可控可编辑3D语义城市场景生成

Christina Ourania Tze, Daniel Dauner, Yiyi Liao, Dzmitry Tsishkou, Andreas Geiger

发表机构 * University of Tübingen（图宾根大学）； Tübingen AI Center（图宾根人工智能中心）； Zhejiang University（浙江大学）； Noah’s Ark Lab, Huawei（华为诺亚实验室）； KE:SAI - Kyutai ELLIS Scalable Autonomous Intelligence（KE:SAI - 韩国ELLIS可扩展自主智能）

AI总结提出PrITTI，一种利用矢量化对象基元和栅格化地面表面的混合表示，通过潜在扩散模型实现高质量、可控且可编辑的3D语义城市场景生成。

Comments Accepted to CVPR 2026

详情

AI中文摘要

现有的3D语义城市场景生成方法主要依赖于基于体素的表示，这些方法受限于固定分辨率、难以编辑且密集形式下内存消耗大。相比之下，我们倡导一种基于基元的范式，其中城市场景使用紧凑、语义上有意义的3D元素表示，这些元素易于操作和组合。为此，我们引入了PrITTI，一种潜在扩散模型，利用矢量化对象基元和栅格化地面表面生成多样化、可控且可编辑的3D语义城市场景。这种混合表示产生了一个结构化的潜在空间，便于对象和地面级别的操作。在KITTI-360上的实验表明，基于基元的表示释放了扩散变压器的全部能力，实现了最先进的3D场景生成质量，同时内存需求更低、推理速度更快、可编辑性优于基于体素的方法。除了生成，PrITTI还支持一系列下游应用，包括场景编辑、修复、外推和照片级真实感的街景合成。源代码和更多结果可在https://raniatze.github.io/pritti/找到。

英文摘要

Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis. The source code and more results can be found at https://raniatze.github.io/pritti/.

URL PDF HTML ☆

赞 0 踩 0

2506.05360 2026-05-26 cs.CV 版本更新

CarboFormer: A Lightweight Semantic Segmentation Architecture for Efficient Carbon Dioxide Detection Using Optical Gas Imaging

CarboFormer: 一种用于光学气体成像的轻量级语义分割架构，实现高效二氧化碳检测

Taminul Islam, Toqi Tahamid Sarker, Mohamed G Embaby, Khaled R Ahmed, Amer AbuGhazaleh

发表机构 * Southern Illinois University, Carbondale, USA（南方伊利诺伊大学，卡罗尔达勒分校）

AI总结提出CarboFormer轻量级语义分割框架，通过优化编解码器、多尺度特征融合和辅助监督策略，在资源受限环境下实现CO2排放的实时高精度检测，并贡献两个新数据集。

详情

DOI: 10.1007/978-3-032-14495-9_1
Journal ref: Advances in Visual Computing. ISVC 2025. Lecture Notes in Computer Science, vol 16397, pp. 3-15, Springer, Cham, 2026

AI中文摘要

二氧化碳（CO$_2$）排放是环境影响和多种工业过程（包括畜牧业管理）的关键指标。我们提出了CarboFormer，一种用于光学气体成像（OGI）的轻量级语义分割框架，旨在检测和量化不同应用中的CO$_2$排放。我们的方法集成了优化的编码器-解码器架构与专门的多尺度特征融合和辅助监督策略，以有效建模气体羽流图像中的局部细节和全局关系，同时在资源受限环境中以最小的计算开销实现有竞争力的精度。我们贡献了两个新数据集：（1）受控二氧化碳释放（CCR）数据集，模拟了系统变化流速（10-100 SCCM）的气体泄漏；（2）实时Ankom（RTA）数据集，专注于奶牛瘤胃液体外实验的排放。大量评估表明，CarboFormer在CCR上达到84.88% mIoU，在RTA上达到92.98% mIoU，同时保持计算效率，仅5.07M参数，运行速度为84.68 FPS。该模型在具有挑战性的低流量场景中特别有效，显著优于其他轻量级方法，如SegFormer-B0（CCR上83.36% mIoU）和SegNeXt（CCR上82.55% mIoU），使其适用于资源受限平台（如可编程无人机）上的实时监测。我们的工作通过提供稳健高效的CO$_2$排放分析工具，推进了环境传感和精准畜牧业管理。

英文摘要

Carbon dioxide (CO$_2$) emissions are critical indicators of both environmental impact and various industrial processes, including livestock management. We introduce CarboFormer, a lightweight semantic segmentation framework for Optical Gas Imaging (OGI), designed to detect and quantify CO$_2$ emissions across diverse applications. Our approach integrates an optimized encoder-decoder architecture with specialized multi-scale feature fusion and auxiliary supervision strategies to effectively model both local details and global relationships in gas plume imagery while achieving competitive accuracy with minimal computational overhead for resource-constrained environments. We contribute two novel datasets: (1) the Controlled Carbon Dioxide Release (CCR) dataset, which simulates gas leaks with systematically varied flow rates (10-100 SCCM), and (2) the Real Time Ankom (RTA) dataset, focusing on emissions from dairy cow rumen fluid in vitro experiments. Extensive evaluations demonstrate that CarboFormer achieves competitive performance with 84.88\% mIoU on CCR and 92.98\% mIoU on RTA, while maintaining computational efficiency with only 5.07M parameters and operating at 84.68 FPS. The model shows particular effectiveness in challenging low-flow scenarios and significantly outperforms other lightweight methods like SegFormer-B0 (83.36\% mIoU on CCR) and SegNeXt (82.55\% mIoU on CCR), making it suitable for real-time monitoring on resource-constrained platforms such as programmable drones. Our work advances both environmental sensing and precision livestock management by providing robust and efficient tools for CO$_2$ emission analysis.

URL PDF HTML ☆

赞 0 踩 0

2505.24876 2026-05-26 cs.CV cs.CL 版本更新

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Agent-X：评估视觉中心智能体任务中的深度多模态推理

Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（莫扎德·本·扎耶德人工智能大学）； University of Central Florida（中央佛罗里达大学）； University of Oxford（牛津大学）

AI总结提出Agent-X基准，通过828个真实视觉任务和细粒度步骤评估框架，揭示当前模型在多步视觉推理中全链成功率低于50%的瓶颈。

Comments Accepted in International Conference of Learning Representations (ICLR 2026)

详情

AI中文摘要

深度推理对于解决复杂任务至关重要，尤其是在需要顺序多模态理解的视觉中心场景中。然而，现有基准通常使用完全合成的单轮查询、有限的视觉模态进行评估，并且缺乏在真实世界环境中多步推理质量的评估框架。为了解决这一问题，我们引入了Agent-X，这是一个大规模基准，用于评估视觉中心智能体在真实多模态环境中的多步和深度推理能力。Agent-X包含828个具有真实视觉上下文的智能体任务，包括图像、多图像比较、视频和指令文本。这些任务涵盖六大智能体环境：通用视觉推理、网页浏览、安全与监控、自动驾驶、体育和数学推理。我们的基准要求智能体在这些多样化环境中将工具使用与明确的逐步决策相结合。此外，我们提出了一个细粒度的步骤级评估框架，用于评估每个推理步骤的正确性和逻辑连贯性以及整个任务中工具使用的有效性。我们的结果表明，即使是最佳性能模型，包括GPT、Gemini和Qwen系列，也难以解决多步视觉任务，全链成功率低于50%。这些发现突显了当前LMM推理和工具使用能力的关键瓶颈，并指出了视觉中心智能体推理模型的未来研究方向。我们的数据和代码公开在https://github.com/mbzuai-oryx/Agent-X。

英文摘要

Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X

URL PDF HTML ☆

赞 0 踩 0

2505.03631 2026-05-26 cs.CV 版本更新

Generalizable Video Quality Assessment via Weak-to-Strong Learning

通过弱到强学习实现可泛化的视频质量评估

Linhan Cao, Wei Sun, Xiangyang Zhu, Kaiwei Zhang, Jun Jia, Yicong Peng, Dandan Zhu, Guangtao Zhai, Xiongkuo Min

发表机构 * Shanghai Jiao Tong University（上海交通大学）； East China Normal University（华东师范大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出弱到强学习框架，结合同质/异质监督信号和迭代训练，无需人工标注即可提升视频质量评估的泛化能力。

Comments Accepted by CVPR 2026

详情

AI中文摘要

视频质量评估（VQA）旨在预测与人类视觉感知一致的视频感知质量，是量化视频处理流程中质量退化的基本工具。主流的VQA范式依赖于人工标注数据集的监督训练，尽管取得了显著进展，但在未见视频内容上仍存在泛化能力差的问题。本文探索弱到强（W2S）学习作为一种无需依赖人工标注数据集的新范式来推进VQA。我们首先提供经验证据，表明直接的W2S策略使强学生模型不仅能在域内基准上匹配其弱教师，还能在分布外（OOD）基准上超越教师，揭示了VQA中独特的弱到强效应。基于这一洞察，我们提出一个新颖框架，从两个方面增强W2S学习：（1）通过可学习排序公式整合来自不同VQA教师（包括现成VQA模型和合成失真模拟器）的同质和异质监督信号；（2）迭代W2S训练，其中每个强学生被回收作为后续循环的教师，逐步聚焦于困难案例。大量实验表明，我们的方法在域内和OOD基准上均达到最先进结果，尤其在OOD场景中表现突出。我们的发现强调W2S学习是打破标注障碍、实现视频质量评估可扩展泛化的原则性途径。我们的数据和代码将在https://github.com/clh124/W2S-VQA提供。

英文摘要

Video quality assessment (VQA) seeks to predict the perceptual quality of a video in alignment with human visual perception, serving as a fundamental tool for quantifying quality degradation across video processing workflows. The dominant VQA paradigm relies on supervised training with human-labeled datasets, which, despite substantial progress, still suffers from poor generalization to unseen video content. In this work, we explore weak-to-strong (W2S) learning as a new paradigm for advancing VQA without reliance on human-labeled datasets. We first provide empirical evidence that a straightforward W2S strategy allows a strong student model to not only match its weak teacher on in-domain benchmarks but also surpass it on out-of-distribution (OOD) benchmarks, revealing a distinct weak-to-strong effect in VQA. Building on this insight, we propose a novel framework that enhances W2S learning from two aspects: (1) integrating homogeneous and heterogeneous supervision signals from diverse VQA teachers -- including off-the-shelf VQA models and synthetic distortion simulators -- via a learn-to-rank formulation, and (2) iterative W2S training, where each strong student is recycled as the teacher in subsequent cycles, progressively focusing on challenging cases. Extensive experiments show that our method achieves state-of-the-art results across both in-domain and OOD benchmarks, with especially strong gains in OOD scenarios. Our findings highlight W2S learning as a principled route to break annotation barriers and achieve scalable generalization in video quality assessment. Our data and code will be available at https://github.com/clh124/W2S-VQA.

URL PDF HTML ☆

赞 0 踩 0

2504.15404 2026-05-26 cs.CV 版本更新

Context Aware Grounded Teacher for Source Free Object Detection

上下文感知的接地教师用于无源目标检测

Tajamul Ashraf, Rajes Manna, Partha Sarathi Purkayastha, Tavaheed Tariq, Janibul Bashir

发表机构 * Department of Computer Vision（计算机视觉系）； MBZUAI ； Microsoft Research India（微软印度研究院）； GAASH Research Lab（GAASH研究实验室）； NIT Srinagar（斋普尔理工学院）

AI总结针对无源目标检测中类别不平衡导致的上下文偏差和噪声伪标签问题，提出一种基于关系上下文模块和语义增强的偏差感知框架Grounded Teacher，通过关系正则化和语义增强提升少数类检测性能。

Comments Accepted in International Journal of Computer Vision (IJCV); Project Webpage: https://tajamul21.github.io/Grounded_Teacher/

详情

AI中文摘要

无源目标检测（SFOD）面临持续挑战，原因在于类别不平衡驱动的上下文偏差以及噪声伪标签下教师-学生训练的不稳定性。现有技术往往忽略上下文偏差和类别不平衡偏移，尤其是在医疗数据中。为解决此问题，我们提出Grounded Teacher（GT），一种偏差感知的无源框架，通过关系正则化和语义正则化来接地教师模型。为了显式建模类别间的方向性混淆，GT引入关系上下文模块（RCM），维护跨域上下文偏差的指数移动平均（EMA）估计。在此基础上，语义增强（SA）策略通过在源相似和源不相似的目标区域中进行自适应MixUp，选择性地增强少数类和易混淆类，从而提高少数类召回率而不过度拟合主导类别。为了在偏差伪标签下稳定学习，我们设计了语义感知损失（SAL），应用对角归一化权重，防止梯度爆炸，同时强调少数-多数类别的修正。此外，从大型视觉基础模型（LVFMs）导出的冻结专家分支在训练期间作为监督参考，在不增加推理开销的情况下改善伪标签质量。GT的行为驱动偏差量化使其能够跨领域广泛应用，无需依赖数据集先验。在Cityscapes-to-Foggy（50.8 mAP）和医学迁移（DDSM-to-INBreast上+5.9 AP50）上的评估显示出一致的增益和改进的少数类检测，且额外训练成本低于12%。代码和模型可在https://github.com/Tajamul21/Grounded-Teacher获取。

英文摘要

Source-free object detection (SFOD) faces persistent challenges due to class imbalance-driven context bias and instability in teacher-student training under noisy pseudo-labels. Existing techniques tend to ignore context bias and class-imbalance shifts, especially in medical data. To tackle this, we propose Grounded Teacher (GT), a bias-aware source-free framework that grounds the teacher model through relational and semantic regularization. To explicitly model directional confusion between classes, GT introduces a Relational Context Module (RCM) that maintains an exponential moving average (EMA) estimate of cross-domain contextual bias. Building upon this, a Semantic Augmentation (SA) strategy selectively augments minority and confusable classes through adaptive MixUp in both source-similar and source-dissimilar target regions, improving minority recall without overfitting dominant categories. To stabilize learning under biased pseudo-labels, we design a Semantic-Aware Loss (SAL) that applies diagonally normalized weights, preventing gradient explosion while emphasizing minority-majority corrections. Additionally, a frozen Expert branch derived from large vision foundation models (LVFMs) serves as a supervisory reference during training, refining pseudo-label quality without adding inference overhead. GT's behavior-driven bias quantification makes it broadly applicable across domains without relying on dataset priors. Evaluations on Cityscapes-to-Foggy (50.8 mAP) and medical transfers (+5.9 AP50 on DDSM-to-INBreast) show consistent gains and improved minority-class detection, with less than 12\% additional training cost. Code and model are available at https://github.com/Tajamul21/Grounded-Teacher.

URL PDF HTML ☆

赞 0 踩 0

2504.00816 2026-05-26 cs.CV physics.med-ph 版本更新

Two-stage deep learning framework for the restoration of incomplete-ring PET images

用于修复不完整环PET图像的两阶段深度学习框架

Yeqi Fang, Rong Zhou

发表机构 * College of Physics, Sichuan University（四川大学物理学院）

AI总结提出一种两阶段深度学习框架，无需飞行时间信息，通过投影域注意力U-Net预测缺失正弦图部分和级联U-Net与热启动扩散模型进行图像细化，从约50%缺失符合事件的不完整环数据中恢复高质量PET图像。

Comments 17 pages, 5 figures

详情

AI中文摘要

正电子发射断层扫描（PET）是一种重要的分子成像工具，广泛应用于医学。传统的PET系统依赖完整的探测器环来实现全角度覆盖和可靠的数据收集。然而，由于硬件故障、成本限制或特定临床需求，出现了不完整环PET扫描仪。标准重建算法由于数据完整性的降低和几何不一致性，在这些系统中往往性能下降。我们提出了一种两阶段深度学习框架，无需任何飞行时间（TOF）信息，即可从约50%缺失符合事件的数据中恢复高质量图像——这是之前基于CNN方法处理损失水平的两倍。该流程分两个阶段运行：投影域注意力U-Net首先通过利用相邻切片的空间上下文预测正弦图的缺失部分，然后使用OSEM算法重建完整数据，并将其传递给级联U-Net和热启动扩散模型进行图像细化。该模块从U-Net粗预测而非纯高斯噪声开始反向扩散过程。使用来自真实扫描的613个模拟脑体积（196个健康脑样本、217个阿尔茨海默病样本和200个轻度认知障碍样本），结果表明我们的模型成功保留了大部分解剖结构和示踪剂分布特征，PSNR为38.18至38.59 dB，SSIM为0.9904至0.9925。我们的两阶段深度学习框架有效地从超过50%的不完整环数据中恢复高质量PET图像，实现了接近完整的解剖保真度和鲁棒性能，无需TOF信息。

英文摘要

Positron Emission Tomography (PET) is an important molecular imaging tool widely used in medicine. Traditional PET systems rely on complete detector rings for full angular coverage and reliable data collection. However, incomplete-ring PET scanners have emerged due to hardware failures, cost constraints, or specific clinical needs. Standard reconstruction algorithms often suffer from performance degradation with these systems because of reduced data completeness and geometric inconsistencies. We present a two-stage deep-learning framework that, without incorporating any time-of-flight (TOF) information, restores high-quality images from data with about 50% missing coincidences - double the loss levels previously addressed by CNN-based methods. The pipeline operates in two stages: a projection-domain Attention U-Net first predicts the missing sections of the sinogram by leveraging spatial context from neighbouring slices, after which the completed data are reconstructed with OSEM algorithm and passed to a cascaded U-Net & warm-start diffusion model for image refinement. This module starts the reverse diffusion process from the U-Net coarse prediction rather than pure Gaussian noise. Using 613 simulated brain volumes from real scans (196 healthy brain samples, 217 Alzheimer's disease samples, and 200 Mild Cognitive Impairment samples), the result shows that our model successfully preserves most anatomical structures and tracer distribution features with PSNR of 38.18 to 38.59 dB and SSIM of 0.9904 to 0.9925. Our two-stage deep-learning framework effectively restores high-quality PET images from over 50% incomplete-ring data, achieving near-complete anatomical fidelity and robust performance without requiring TOF information.

URL PDF HTML ☆

赞 0 踩 0

2503.23670 2026-05-26 cs.CV 版本更新

Learning Bijective Surface Parameterization for Inferring Signed Distance Functions from Sparse Point Clouds with Grid Deformation

学习双射曲面参数化以通过网格变形从稀疏点云推断符号距离函数

Takeshi Noda, Chao Chen, Junsheng Zhou, Weiqi Zhang, Yu-Shen Liu, Zhizhong Han

发表机构 * School of Software, Tsinghua University, Beijing, China（清华大学软件学院，北京，中国）； Department of Computer Science, Wayne State University, Detroit, USA（韦恩州立大学计算机科学系，底特律，美国）

AI总结提出一种动态变形网络结合双射曲面参数化和网格变形优化的方法，从稀疏点云端到端预测符号距离函数，显著优于现有方法。

Comments Accepted by Conference on Computer Vision and Pattern Recognition (CVPR) 2025. Project page:https://takeshie.github.io/Bijective-SDF

详情

AI中文摘要

从稀疏点云推断符号距离函数（SDF）仍然是曲面重建中的一个挑战。关键在于稀疏点云缺乏学习连续场所需的详细几何信息。为解决此问题，我们提出了一种新颖的方法，学习一个动态变形网络以端到端方式预测SDF。为了从稀疏点参数化连续曲面，我们提出了双射曲面参数化（BSP），从局部块学习全局形状。具体来说，我们为从参数域到3D局部块的稀疏点构建双射映射，将块整合到全局曲面中。同时，我们将网格变形优化（GDO）引入曲面逼近，以优化网格点的变形并进一步细化参数曲面。在合成和真实扫描数据集上的实验结果表明，我们的方法显著优于当前最先进的方法。项目页面：https://takeshie.github.io/Bijective-SDF

英文摘要

Inferring signed distance functions (SDFs) from sparse point clouds remains a challenge in surface reconstruction. The key lies in the lack of detailed geometric information in sparse point clouds, which is essential for learning a continuous field. To resolve this issue, we present a novel approach that learns a dynamic deformation network to predict SDFs in an end-to-end manner. To parameterize a continuous surface from sparse points, we propose a bijective surface parameterization (BSP) that learns the global shape from local patches. Specifically, we construct a bijective mapping for sparse points from the parametric domain to 3D local patches, integrating patches into the global surface. Meanwhile, we introduce grid deformation optimization (GDO) into the surface approximation to optimize the deformation of grid points and further refine the parametric surfaces. Experimental results on synthetic and real scanned datasets demonstrate that our method significantly outperforms the current state-of-the-art methods. Project page: https://takeshie.github.io/Bijective-SDF

URL PDF HTML ☆

赞 0 踩 0

2412.07333 2026-05-26 cs.CV cs.AI 版本更新

一群大脑的智慧：通用大脑编码器

Roman Beliy, Navve Wasserman, Amit Zalcher, Michal Irani

发表机构 * Weizmann Institute of Science（魏兹曼科学研究院）

AI总结提出一种基于体素中心架构的通用大脑编码器，通过跨注意力机制联合多主体/数据集/机器的fMRI数据，提升个体编码性能并实现快速迁移学习。

详情

AI中文摘要

图像到fMRI编码对于神经科学研究和实际应用都很重要。然而，这种“大脑编码器”通常针对每个受试者和每个fMRI数据集进行训练，因此局限于非常有限的训练数据。在本文中，我们提出了一种通用大脑编码器，它可以联合训练来自许多不同受试者/数据集/机器的数据。实现这一点的关键是我们新的以体素为中心的编码器架构，该架构为每个大脑体素学习一个独特的“体素嵌入”。我们的编码器通过直接计算大脑体素嵌入与多级深度图像特征之间的交叉注意力，来训练预测每个大脑体素对每张图像的响应。这种以体素为中心的架构使得每个大脑体素的功能角色能够从体素-图像交叉注意力中自然涌现。我们展示了这种方法的能力：(i) 结合来自多个不同受试者（“一群大脑”）的数据以改善每个个体的大脑编码，(ii) 在受试者、数据集和机器（例如3特斯拉、7特斯拉）之间进行快速有效的迁移学习，仅需少量训练样本，(iii) 使用学习到的体素嵌入作为探索大脑功能（例如，大脑中编码了什么以及在哪里编码）的强大工具。

英文摘要

Image-to-fMRI encoding is important for both neuroscience research and practical applications. However, such "Brain-Encoders" have been typically trained per-subject and per fMRI-dataset, thus restricted to very limited training data. In this paper we propose a Universal Brain-Encoder, which can be trained jointly on data from many different subjects/datasets/machines. What makes this possible is our new voxel-centric Encoder architecture, which learns a unique "voxel-embedding" per brain-voxel. Our Encoder trains to predict the response of each brain-voxel on every image, by directly computing the cross-attention between the brain-voxel embedding and multi-level deep image features. This voxel-centric architecture allows the functional role of each brain-voxel to naturally emerge from the voxel-image cross-attention. We show the power of this approach to (i) combine data from multiple different subjects (a "Crowd of Brains") to improve each individual brain-encoding, (ii) quick & effective Transfer-Learning across subjects, datasets, and machines (e.g., 3-Tesla, 7-Tesla), with few training examples, and (iii) use the learned voxel-embeddings as a powerful tool to explore brain functionality (e.g., what is encoded where in the brain).

URL PDF HTML ☆

赞 0 踩 0

2306.02216 2026-05-26 cs.LG cs.CV 版本更新

Forgettable Federated Linear Learning with Certified Data Unlearning

具有认证数据遗忘的可遗忘联邦线性学习

Ruinan Jin, Minghui Chen, Qiong Zhang, Xiaoxiao Li

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）； Renmin University of China（中国人民大学）

AI总结提出一种基于预训练模型线性近似的联邦遗忘框架，通过联邦线性训练实现高效、安全且可认证的客户端数据遗忘。

Comments IEEE Transactions on Neural Networks and Learning Systems

详情

DOI: 10.1109/TNNLS.2026.3683398
Journal ref: IEEE Transactions on Neural Networks and Learning Systems, Early Access, pp. 1-10, 2026

AI中文摘要

联邦学习（FL）能够在分布式客户端之间进行协作模型训练，同时保护用户隐私。最近，联邦遗忘（FU）的出现旨在解决“被遗忘权”问题，并在无需重新训练整个FL系统的情况下移除中毒或目标客户端的影响。然而，许多FU方法需要与保留或目标客户端通信，引入额外的安全风险，或存储历史模型，限制了其效率和实用性。此外，由于非线性模型及其训练动态的复杂性，大多数用于深度神经网络（DNN）的FU方法缺乏理论认证。在这项工作中，我们引入了可遗忘联邦线性学习，这是一个用于DNN的训练和遗忘框架。我们的方法使用预训练模型线性近似DNN，并通过联邦线性训练实现与原始网络相当的性能。我们进一步提出了一种经过认证、高效且安全的遗忘策略，使服务器能够在不进行额外客户端通信或存储的情况下移除目标客户端的影响。在从小型到大型数据集上使用卷积神经网络和现代基础模型进行的广泛实验表明，我们的方法在模型准确性和有效的目标客户端遗忘之间取得了平衡。这项工作为高效且可信的FU提供了一个实用的流程。代码：https://github.com/Nanboy-Ronan/2F2L-Federated-Unlearning

我们提出了Artiverse，一个多样且物理基础扎实的高质量铰接3D物体数据集，旨在用于真实的功能建模和仿真。Artiverse包含来自多个3D静态仓库的5.4K个人工制作的物体，涵盖88个广泛类别。物体被标注有功能部件、内部结构、真实的运动学关系和铰接关节（包括多自由度关节），以及物理属性如公制尺度、材料和质量。我们开发了一个半自动标注管道，结合少样本分割、几何推理和多阶段人工验证，以实现高质量和高效的标注，将人工标注时间减少了30%以上。我们展示了Artiverse在部件运动分析、铰接物体生成和基于物理的交互任务中的价值。Artiverse为推进铰接物体的功能理解提供了数据资源。

英文摘要

We present Artiverse, a diverse and physically grounded dataset of high-quality articulated 3D objects designed for realistic functional modeling and simulation. Artiverse contains 5.4K human-authored objects across a broad range of 88 categories, aggregated from multiple 3D static repositories. Objects are annotated with functional parts, interior structures, realistic kinematic relationships and articulated joints including multi-DoF joints, and physical attributes such as metric scale, material, and mass. We develop a semi-automated annotation pipeline that combines few-shot segmentation, geometric reasoning, and multi-stage human verification to achieve high-quality and efficient annotation, reducing manual annotation time by over 30%. We demonstrate the value of Artiverse on tasks of part mobility analysis, articulated object generation, and physics-based interaction. Artiverse provides a data resource to advance functional understanding for articulated objects.

URL PDF HTML ☆

赞 0 踩 0

2605.24402 2026-05-26 cs.CV 版本更新

基于高斯排序邻域度的图神经网络图像分类方法

Rafael Mendonça Duarte, Jean Roberto Ponciano, Lucas Pascotti Valem

发表机构 * Institute of Mathematics and Computer Science (ICMC)（数学与计算机科学研究所）； University of São Paulo (USP)（圣保罗大学）； São Carlos -- SP -- Brazil（巴西圣卡洛斯）

AI总结提出GRaNDe（高斯排序邻域度）方法，通过结合邻域排序与高斯距离加权来改进图神经网络中的度归一化，在五个公开图像分类数据集上取得一致准确率提升。

详情

AI中文摘要

数据的指数级增长加剧了未标注数据的可用性与人工标注的高成本之间的差距。图神经网络（GNN）作为一种有前景的解决方案出现，因为它们利用关系结构并从标注和未标注数据中学习，执行半监督学习。这些模型的一个关键组成部分是基于度的归一化，它影响消息传播，但通常假设邻域节点具有均匀重要性。在图像分类中，图通常根据特征相似性构建，将所有邻居平等对待可能会忽略相关性的重要变化。受此差距启发，我们提出GRaNDe（高斯排序邻域度）。这种新颖的度度量将邻域排序与高斯距离加权相结合，以更好地捕捉节点重要性。在五个公开图像分类数据集上的实验表明，与最先进方法相比，该方法具有一致的准确率提升和竞争性或更优的结果。

英文摘要

The exponential growth of data has intensified the gap between the availability of unlabeled data and the high cost of manual annotation. Graph Neural Networks (GNNs) have emerged as a promising solution, as they exploit relational structures and learn from both labeled and unlabeled data, performing semi-supervised learning. A crucial component of many of these models is degree-based normalization, which influences message propagation but typically assumes uniform importance among neighboring nodes. In image classification, graphs are usually constructed from feature similarity, where treating all neighbors equally may overlook important variations in relevance. Motivated by this gap, we propose GRaNDe (Gaussian Rank-based Neighborhood Degree). This novel degree measure integrates neighborhood ranking with Gaussian distance weighting to better capture node importance. Experiments on five public image classification datasets show consistent accuracy improvements and competitive or superior results compared to state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.24354 2026-05-26 cs.CV 版本更新

SparseWorld: Enhancing End-to-End Autonomous Driving via World Models with Sparse Scene Representation

SparseWorld: 通过具有稀疏场景表示的世界模型增强端到端自动驾驶

Ruoyu Wang, Jingke Wang, Yukai Ma, Yuehao Huang, Shuangming Lei, Guanglin Xu, Aixue Ye, Yong Liu

发表机构 * Institute of Cyber-Systems and Control, Zhejiang University（浙江大学控制系统研究所）； Labs, Huawei（华为2012实验室）； State Key Laboratory of Industrial Control Technology（国家工业控制技术重点实验室）

AI总结提出SparseWorld，一种基于稀疏场景表示的轻量级世界模型，通过自回归预测未来地图元素和周围智能体，并利用预测结果优化下游运动预测和轨迹规划，在nuScenes数据集上实现0.05%的碰撞率，达到开放循环规划指标的最优性能。

详情

AI中文摘要

最近，世界模型通过未来情况预测和改进场景理解，在增强端到端驾驶系统方面取得了显著进展。然而，现有的驾驶世界模型通常基于密集场景表示，导致高计算成本和冗余信息。在本文中，我们提出了SparseWorld，一种轻量级世界模型，专注于仅预测场景的关键布局，从而为端到端驾驶系统实现高效的未来预测。SparseWorld首先执行自回归展开以预测未来的地图元素和周围智能体，使模型能够学习驾驶场景随时间如何演变。然后，它利用这些预测的未来来优化下游运动预测和轨迹规划。具体来说，我们提出了一种稀疏梦想家（Sparse Dreamer），通过联合时间和空间注意力在潜在空间中预测未来实例。通过与预测的未来实例交互，运动规划器捕获更准确的运动模式，并生成更明智且安全感知的轨迹。大量实验表明，SparseWorld显著降低了碰撞风险，并在nuScenes数据集的开放循环规划指标上实现了最先进的性能，碰撞率为0.05%。此外，在Bench2Drive基准测试的闭环规划指标上，它大幅优于基线方法。补充材料可在项目页面获取：https://wryzju.github.io/SparseWorld/。

英文摘要

Recently, world models have made significant progress in enhancing end-to-end driving systems through both future situation forecasting and improved scene understanding. However, existing driving world models are typically built upon dense scene representations, causing high computational costs and redundant information. In this paper, we present SparseWorld, a lightweight world model that focuses on predicting only the critical layout of the scene, enabling efficient future forecasting for end-to-end driving systems. SparseWorld first performs autoregressive rollout to forecast future map elements and surrounding agents, enabling the model to learn how driving scenarios evolve over time. It then leverages these predicted futures to refine downstream motion prediction and trajectory planning. Specifically, we propose a Sparse Dreamer that anticipates future instances in the latent space through joint temporal and spatial attention. By interacting with predicted future instances, the motion planner captures more accurate motion patterns and generates more informed and safety-aware trajectories. Extensive experiments demonstrate that SparseWorld significantly reduces collision risk and achieves state-of-the-art performance on the open-loop planning metrics of the nuScenes dataset with a collision rate of 0.05\%. Moreover, it substantially outperforms the baseline method in closed-loop planning metrics on the Bench2Drive benchmark. Supplementary material is available at the project page: https://wryzju.github.io/SparseWorld/.

URL PDF HTML ☆

赞 0 踩 0

2605.24353 2026-05-26 cs.CV q-bio.OT 版本更新

CoDA: 面向高效且可泛化的AI生成图像检测的颜色分布探测

Zexi Jia, Zhiqiang Yuan, Xiaoyue Duan, Jinchao Zhang, Jie Zhou, Anil K. Jain

发表机构 * Tencent WeChat AI（腾讯微信AI）； Department of Computer Science and Engineering, Michigan State University（密歇根州立大学计算机科学与工程系）

AI总结提出基于颜色分布探测的轻量级检测器CoDA（仅1.48M参数），通过噪声量化探针捕捉合成图像的颜色不均匀性，在跨模型和跨域基准上达到最优性能。

详情

AI中文摘要

AI生成图像检测面临泛化性与效率之间的持续权衡：基于轻量级伪影的方法在未见过的生成器或域上常常性能下降，而更鲁棒的大规模模型则计算成本高昂。同时，现有基准主要关注逼真场景下的跨模型评估，跨域鲁棒性尚未充分探索。为填补这一空白，我们引入了FakeForm，一个大规模基准，包含约37万张图像，覆盖62个不同域，用于跨模型和跨域评估。受此更广泛设置的启发，我们重新审视颜色分布探测作为AI生成图像检测的一种高效互补线索。我们观察到，特别是对于摄影内容，真实照片往往呈现更平滑、更稳定的颜色模式，而合成图像则常表现出神经生成引入的特征性颜色不平衡。基于这一观察，我们提出了CoDA，一个紧凑的1.48M参数检测器，基于噪声量化探针，并提供了将探针响应与颜色非均匀性联系起来的理论分析。实验表明，CoDA在标准基准上达到最先进性能，在FakeForm具有挑战性的跨域评估中取得最佳结果，同时在跨模型逼真设置中保持高度竞争力。这些结果表明，持续的生成伪影可以为高效且鲁棒的AI生成图像检测提供实用基础。模型和FakeForm基准将公开发布。

英文摘要

AI-generated image detection faces a persistent trade-off between generalization and efficiency: lightweight artifact-based methods often degrade on unseen generators or domains, whereas more robust large-scale models are computationally expensive. Meanwhile, existing benchmarks mainly focus on cross-model evaluation in photorealistic settings, leaving cross-domain robustness underexplored. To address this gap, we introduce FakeForm, a large-scale benchmark with approximately 370,000 images across 62 diverse domains for both cross-model and cross-domain evaluation. Motivated by this broader setting, we revisit color-distribution probing as an efficient complementary cue for AI-generated image detection. We observe that, especially for photographic content, real photographs tend to exhibit smoother and more stable color patterns, whereas synthetic images often show characteristic color imbalances introduced by neural generation. Based on this observation, we propose CoDA, a compact 1.48M-parameter detector built on a Noise-Quantization Probe, together with a theoretical analysis linking probe responses to color non-uniformity. Experiments show that CoDA achieves state-of-the-art performance on standard benchmarks and the best results on the challenging cross-domain evaluation of FakeForm, while remaining highly competitive in cross-model photorealistic settings. These results suggest that persistent generative artifacts can provide a practical foundation for efficient and robust AI-generated image detection. The models and FakeForm benchmark will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.24304 2026-05-26 cs.CV cs.AI 版本更新

ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views

ArtSplat: 基于前馈的关节式3D高斯泼溅从稀疏多状态未标定视图

Inseo Lee, Yoonji Kim, Eugene Sohn, Jiwoong Lee, Jungmin You, Joonseok Lee, Jin-Hwa Kim

发表机构 * Seoul National University（首尔国立大学）； Sogang University（成均馆大学）； NAVER AI Lab（NAVER AI实验室）

AI总结提出首个前馈框架ArtSplat，通过稀疏多视图跨多个关节状态，一次性重建几何和关节参数，引入逐像素关节图表示和跨状态注意力机制，在PartNet-Mobility上实现400倍加速。

详情

AI中文摘要

从稀疏视图图像重建关节物体是一个病态问题，需要同时推断几何和底层关节结构。现有基于NeRF和3D高斯泼溅（3DGS）的关节物体重建方法通常依赖密集视图或强先验（例如深度图、关节类型、预定义关节数量），并且需要昂贵的逐对象优化。在本文中，我们提出了ArtSplat，这是第一个用于关节式3D高斯泼溅的前馈框架。它通过单个前向传递，从跨多个关节状态的稀疏多视图图像中重建几何和关节参数。为了解决单次前向关节重建的挑战，我们引入了一种逐像素关节图表示，使得关节参数估计能够集成到前馈流水线中。我们进一步提出了一种带有状态令牌的跨状态注意力（CSA）机制，该机制有效捕获输入状态间的离散运动。在来自PartNet-Mobility的68个关节物体（包括单关节和多关节配置）上的实验表明，ArtSplat在几何和关节估计方面均达到了有竞争力的性能，同时比基线方法快400倍以上。

英文摘要

Articulated object reconstruction from sparse-view images is an ill-posed problem that requires simultaneous inference of geometry and underlying articulation structure. Existing methods for articulated object reconstruction based on NeRF and 3D Gaussian Splatting (3DGS) typically rely on dense views or strong priors (e.g., depth maps, joint types, predefined number of joints) and require costly per-object optimization. In this paper, we propose ArtSplat, the first feed-forward framework for articulated 3D Gaussian Splatting. It reconstructs both geometry and joint parameters from sparse multi-view images across multiple articulation states in a single forward pass. To address the challenges of single-pass articulated reconstruction, we introduce a per-pixel joint map representation that enables the integration of joint parameter estimation into the feed-forward pipeline. We further propose a Cross-State Attention (CSA) mechanism with state tokens, which effectively captures discrete motion across input states. Experiments on 68 articulated objects from PartNet-Mobility, including both single- and multi-joint configurations, demonstrate that ArtSplat achieves competitive performance in both geometry and joint estimation, while being over 400 times faster than baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.24273 2026-05-26 cs.CV physics.ao-ph 版本更新

Plume Segmentation from MethaneSAT with Cross-Sensor Transfer Learning and Physics-Informed Postprocessing

基于跨传感器迁移学习和物理信息后处理的MethaneSAT羽流分割

Manuel Pérez-Carrasco, Maya Nasr, Zhan Zhang, Apisada Chulakadabba, Javier Roger, Raia Ottenheimer, Sébastien Roche, Maryann Sargent, Chris Chan Miller, Daniel Varon, Jack Warren, Luis Guanter, Kang Sun, Jonathan Franklin, Jia Chen, Cecilia Garraffo, Xiong Liu, Ritesh Gautam, Steven Wofsy

发表机构 * Center for Astrophysics | Harvard & Smithsonian（哈佛-史密松天体物理中心）； Environmental Defense Fund（环境防御基金）； Department of Earth and Planetary Sciences, Harvard University（哈佛大学地球与行星科学系）； Institute of Environmental Physics (IUP), University of Bremen（不莱梅大学环境物理研究所）； John A. Paulson School of Engineering and Applied Sciences, Harvard University（哈佛大学约翰·A·保罗森工程与应用科学学院）

AI总结提出一种结合Mask R-CNN、跨传感器迁移学习和物理信息后处理的机器学习框架，解决MethaneSAT甲烷羽流检测中的标签稀缺和推理可靠性问题，实现高灵敏度和高精度两种操作模式。

Comments 35 pages, 20 figures, 9 tables

详情

AI中文摘要

从卫星图像中自动检测和掩膜单个甲烷羽流对于操作性的排放归因和量化至关重要。我们提出了一个机器学习框架，用于从MethaneSAT反演的柱平均干空气甲烷摩尔分数中检测羽流。我们解决了两个核心挑战：标记的MethaneSAT数据稀缺以及跨不同大气和地表条件的推理可靠性需求。我们首先证明，带有ResNet-50骨干网络的Mask R-CNN在MethaneAIR（MethaneSAT的机载版本）和MethaneSAT数据上均优于U-Net语义分割，像素级F1分数分别提升10.49和5.48。为解决MethaneSAT数据稀缺问题，我们评估了三种利用MethaneAIR飞行数据和合成羽流的跨传感器迁移策略。从MethaneAIR预训练权重微调的Mask R-CNN（ResNet-50）是最有效的策略，在基线操作点实现了0.60的实例级精度和接近完美的0.98召回率。一个物理信息后处理管道将检测结果转换为两种操作模式。第一种是高灵敏度模式，应用形态学滤波和基于邻近度的合并进行综合排放筛查，达到0.71的精度和0.94的召回率。第二种是高精度模式，额外应用基于分布的分类器进行可信源归因，达到0.92的精度和0.70的召回率。对基于小波的真实标签中被分类为假阳性的检测结果进行人工审查发现，相当一部分案例对应的是因保守标注标准而被排除的真实甲烷增强，表明报告的精度值是真实检测性能的下界。我们的数据和代码可在 https://doi.org/10.7910/DVN/FR959H 获取。

英文摘要

Automated detection and masking of individual methane plumes from satellite imagery is important for operational emission attribution and quantification. We present a machine learning framework for plume detection from MethaneSAT retrieved column-averaged dry-air mole fractions of methane. We address two core challenges: the scarcity of labeled MethaneSAT data and the need for inference reliability across diverse atmospheric and surface conditions. We first demonstrate that Mask R-CNN with a ResNet-50 backbone outperforms U-Net semantic segmentation on both MethaneAIR (an airborne version of MethaneSAT) and MethaneSAT data, with pixel-level F1 score gains of 10.49 and 5.48 respectively. To address MethaneSAT data scarcity, we evaluate three cross-sensor transfer strategies leveraging MethaneAIR flights and synthetic plumes. Mask R-CNN with ResNet-50 fine-tuned from MethaneAIR pre-trained weights is the most effective strategy, achieving instance-level precision of 0.60 and a near-perfect recall of 0.98 at the baseline operating point. A physics-informed post-processing pipeline converts detections into two operationally distinct modes. The first is a high-sensitivity mode that applies morphological filtering and proximity-based merging for comprehensive emission screening, achieving precision of 0.71 and recall of 0.94. The second is a high-precision mode that additionally applies a distribution-based classifier for confident source attribution, achieving precision of 0.92 and recall of 0.70. Manual review of detections classified as false positives against our wavelet-based ground truth labels reveals that a meaningful fraction of cases correspond to real methane enhancements excluded by conservative labeling criteria, indicating that precision values reported are lower bounds on true detection performance... Our data and code are available at: https://doi.org/10.7910/DVN/FR959H

URL PDF HTML ☆

赞 0 踩 0

2605.24251 2026-05-26 cs.LG cs.CV 版本更新

Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions

重新思考边缘上的持续异常检测：在现实工业条件下进行基准测试

Chad Weatherly, Sen Lin

发表机构 * University of Houston（休斯敦大学）

AI总结针对现有持续异常检测方法在评估、比较和边缘部署约束上的不足，提出统一基准和训练无关方法DINOSaur，在多种协议下超越所有现有方法，并在边缘设备上实现快速推理和适应。

详情

AI中文摘要

持续异常检测（CAD）解决了工业检测系统适应不断变化的生产条件的需求，但现有方法存在三个关键差距：不现实的评估、缺乏系统比较以及未考虑边缘部署约束。我们引入了一个统一的基准，结合了结构和逻辑异常的离散任务评估、一种新颖的连续漂移协议、对所有已发布CAD方法的首次头对头比较，以及在边缘硬件上的计算效率分析。我们的结果表明，现有的CAD方法并不一致地优于带有简单经验重放的传统方法。受此启发，我们提出了DINOSaur，一种无需训练的方法，结合了冻结的DINOv3骨干网络、空间索引的coreset记忆和邻域限制的异常评分。DINOSaur通过构造实现了零遗忘，在所有五种协议上优于所有评估的方法，并在NVIDIA Jetson Orin Nano上以低于100毫秒的推理速度运行，在设备上适应新任务的时间不到30秒。

英文摘要

Continual anomaly detection (CAD) addresses the need for industrial inspection systems to adapt to evolving production conditions, yet existing methods share three critical gaps: unrealistic evaluation, no systematic comparison, and no consideration of edge deployment constraints. We introduce a unified benchmark combining discrete-task evaluation on structural and logical anomalies, a novel continuous drift protocol, the first head-to-head comparison of all published CAD methods, and computational efficiency profiling on edge hardware. Our results reveal that existing CAD methods do not consistently outperform traditional approaches with simple experience replay. Thus motivated, we propose DINOSaur, a training-free method combining a frozen DINOv3 backbone with spatially-indexed coreset memory and neighborhood-restricted anomaly scoring. DINOSaur achieves zero forgetting by construction, outperforms all evaluated methods across all five protocols, and runs at sub-100\,ms inference on an NVIDIA Jetson Orin Nano, with on-device adaptation to new tasks in under 30 seconds.

URL PDF HTML ☆

赞 0 踩 0

2605.24243 2026-05-26 cs.CV cs.AI stat.ML 版本更新

GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer

GIBLy: 通过架构无关的轻量级几何归纳偏置层改进3D语义分割

Diogo Lavado, Alessandra Micheletti, Clàudia Soares

发表机构 * NOVA School of Science and Technology（诺瓦科学与技术学校）； Università degli Studi di Milano（米兰大学）

AI总结提出一种轻量级几何归纳偏置层GIBLy，通过集成可学习的几何先验提升3D分割性能，仅增加少量参数即可在多个基准上获得一致提升。

详情

AI中文摘要

在3D场景理解中，深度学习模型依赖大型模型和大量训练来捕捉3D数据中存在的几何结构。然而，现有方法缺乏显式机制来融入几何信息（例如可学习的基元形状），往往需要更大的模型和更多的训练数据，这增加了成本并可能限制泛化能力。我们引入了GIBLy，一种轻量级几何归纳偏置层，将可学习的几何先验集成到3D分割流程中。GIBLy通过提供与简单几何形状（因此可解释）对齐的特征来增强现有架构——无论是基于MLP、卷积还是Transformer——以最小的计算开销提升分割性能。我们在多个3D语义分割基准上验证了我们的方法，展示了一致的性能提升，包括在TS40K上使用PTV3时mIoU提升高达+11.5%，而仅增加58K额外参数。我们的结果突显了显式编码几何结构以支持准确高效的3D场景理解的优势，且仅需一个轻量级的附加层。

英文摘要

In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic geometric structures that are present in the 3D data. However, existing methods lack explicit mechanisms to incorporate geometric information, such as learnable primitive shapes, often necessitating large models and more training data which in turn increases cost and can limit generalization. We introduce GIBLy, a lightweight geometric inductive bias layer that integrates learnable geometric priors into 3D segmentation pipelines. GIBLy enhances existing architectures -- whether MLP-based, convolution-based, or transformer-based -- by providing features aligned with simple geometric shapes (and thus human-interpretable) that improve segmentation performance with minimal computational overhead. We validate our approach across multiple 3D semantic segmentation benchmarks, demonstrating consistent performance gains, including up to +11.5% mIoU on TS40K with PTV3, while adding only 58K extra parameters. Our results highlight the benefit of explicitly encoding geometric structure to support accurate and efficient 3D scene understanding, with a lightweight add-on layer

URL PDF HTML ☆

赞 0 踩 0

2605.24201 2026-05-26 cs.CV physics.med-ph 版本更新

Radiuma: A Unified Zero-Code Executable Graphical Workflow Generator for Reproducible and Shareable Medical Image Analysis and Machine Learning

Radiuma: 一个统一的无代码可执行图形工作流生成器，用于可重复和可共享的医学图像分析与机器学习

Mohammad Salmanpour, Mehrdad Oveisi, Isaac Shiri, Arman Rahmim

发表机构 * Department of Basic and Translational Research, BC Cancer Research Institute（基础与转化研究部，BC癌症研究中心）； Department of Radiology, University of British Columbia（放射科，不列颠哥伦比亚大学）； Technological Virtual Collaboration (TECVICO Corp.)（技术虚拟协作（TECVICO公司））； Department of Computer Science, University of British Columbia（计算机科学系，不列颠哥伦比亚大学）； Department of Cardiology, Inselspital, Bern University Hospital, University of Bern（心脏病学部，Inselspital，伯恩大学医院，伯恩大学）； Department of Digital Medicine, University of Bern（数字医学系，伯恩大学）； Departments of Physics & Biomedical Engineering, University of British Columbia（物理学与生物医学工程系，不列颠哥伦比亚大学）

AI总结提出Radiuma模块化平台，通过可视化工作流系统集成图像处理、放射组学特征提取和机器学习模块，实现无需编程即可构建可重复的多步分析流程。

详情

AI中文摘要

医学图像计算软件对于识别支持诊断、预后、治疗计划和临床研究的成像生物标志物至关重要。然而，缺乏标准化、用户友好且可重复的软件环境限制了先进医学图像分析工作流的广泛采用。我们提出了Radiuma，一个免费可用的模块化平台，旨在支持跨多种模态和文件格式的可靠且可重复的医学图像分析。Radiuma集成了图像读取、可视化、配准、融合、处理、分割、放射组学特征提取以及用于分类、回归和聚类的机器学习模块。其模块化设计允许用户独立执行每个组件，或通过可视化工作流系统连接模块，其中一个步骤的输出可以图形化地传递到下一步。这使得无需大量编程专业知识即可创建自定义、可执行且可重复的多步骤流程。每个模块的结果可以直接在可视化窗口中检查，提供对处理质量和工作流准确性的即时反馈。Radiuma还支持保存和共享自定义工作流，促进协作研究中的透明度、可重用性和一致性。通过结合灵活性、易用性和标准化分析工具，Radiuma为临床和转化环境中的放射组学和机器学习研究提供了一个实用环境。该平台旨在面向具有不同专业知识的用户，包括放射科医生、物理学家、临床医生和数据科学家。

英文摘要

Medical image computing software is essential for identifying imaging biomarkers that can support diagnosis, prognosis, treatment planning, and clinical research. However, the lack of standardized, user-friendly, and reproducible software environments has limited the broader adoption of advanced medical image analysis workflows. We present Radiuma, a freely available modular platform designed to support reliable and reproducible medical image analysis across multiple modalities and file formats. Radiuma integrates image reading, visualization, registration, fusion, processing, segmentation, radiomics feature extraction, and machine learning modules for classification, regression, and clustering. Its modular design allows users to execute each component independently or connect modules through a visual workflow system, where the output of one step can be graphically passed to the next. This enables the creation of custom, executable, and reproducible multi-step pipelines without requiring extensive programming expertise. Results from each module can be inspected directly in the visualization window, providing immediate feedback on processing quality and workflow accuracy. Radiuma also supports saving and sharing customized workflows, promoting transparency, reusability, and consistency across collaborative studies. By combining flexibility, usability, and standardized analysis tools, Radiuma provides a practical environment for radiomics and machine learning research in clinical and translational settings. The platform is designed to be accessible to users with diverse expertise, including radiologists, physicists, clinicians, and data scientists.

URL PDF HTML ☆

赞 0 踩 0

2605.24195 2026-05-26 cs.CV cs.LG 版本更新

Single View Seafloor Recovery from Imaging Sonar via Differentiable Rendering

通过可微渲染从成像声纳进行单视图海底恢复

Sevan Brodjian, Michael Hobley, Pietro Perona

发表机构 * California Institute of Technology（加州理工学院）

AI总结提出一种无需训练的方法，通过可微渲染在30秒内从单张声纳图像恢复海底地形，利用已知海底倾斜条件，首次实现单视图高度恢复。

详情

AI中文摘要

由于光衰减和浑浊度，声纳通常是水下高分辨率成像的唯一合适模态。前视成像声纳提供距离和水平角度的测量，但将垂直结构压缩成平面图像，产生歧义，使得3D恢复具有挑战性。成像声纳的一个常见应用是水下地形测绘（测深），但目前的方法需要多个视图、昂贵的多传感器设置或大量训练数据，这限制了其使用和对新环境的适应性。我们提出了一种无需训练的方法，通过可微渲染在30秒内从单张声纳图像恢复测深，条件为已知的海底倾斜。据我们所知，这是声纳中单视图高度恢复的第一个可微渲染方法。我们的方法实现了可微声纳光线追踪，并优化显式高度场以重现目标图像。在合成数据集上，我们的方法在分布偏移下优于有监督的CNN，在粗糙地形上保持接近，而CNN在分布内获胜。通过建模声纳过程的物理基础先验，我们的方法无需训练数据即可适应不同的传感器配置和环境。

英文摘要

Sonar is often the only modality suitable for high-resolution imaging underwater due to light attenuation and turbidity. Forward-looking imaging sonar provides measurements over range and horizontal angle but collapses vertical structure into a flat image, creating ambiguities that make 3D recovery challenging. A common use case for imaging sonar is underwater terrain mapping (bathymetry), yet current methods require many views, expensive multi-sensor setups, or significant training data, which limits use and adaptability to new environments. We present a training-free method that recovers bathymetry from a single sonar image in under 30 seconds via differentiable rendering, conditioned on a known seafloor tilt. To our knowledge, this is the first differentiable rendering approach for single-view height recovery in sonar. Our method implements differentiable sonar ray tracing and optimizes an explicit height field to reproduce the target image. On synthetic datasets, our approach outperforms a supervised CNN under distribution shift and remains close on rough terrain, while the CNN wins in-distribution. By modeling physically grounded priors of the sonar process, our method adapts across sensor configurations and environments without training data.

URL PDF HTML ☆

赞 0 踩 0

2605.24192 2026-05-26 cs.LG cs.AI cs.CV 版本更新

Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization

滤波后验均值集合：扩散泛化分析模型的统一框架

Matthew Niedoba, Berend Zwartsenberg, Frank Wood

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； Inverted AI ； Alberta Machine Intelligence Institute（阿尔伯塔机器智能研究所）

AI总结本文提出滤波后验均值集合（FPMC）统一框架，通过查询精度向量、响应权重和源分布建模扩散模型去噪函数的泛化行为，并通过软松弛和源分布增强提升现有方法性能。

Comments 27 Pages, 7 figures

详情

AI中文摘要

作为图像扩散模型骨干的神经网络去噪函数，在多种网络架构和训练超参数下展现出显著一致的泛化行为。最近一系列研究试图通过聚合训练数据集补丁的后验加权平均值来建模这些网络的输出。在本工作中，我们将这些方法整合为一个统一的模型类，称为滤波后验均值集合（FPMC）。我们使用查询精度向量、响应权重和源分布定义该模型类，并说明现有方法可通过这些设计轴的具体选择恢复。依次研究每个轴，我们发现FPMC性能可以通过对先前基于补丁的方法进行软松弛以及通过源分布的增强来改进。将这些发现应用于现有的FPMC，我们在三个自然图像数据集上展示了样本的一致改进。

英文摘要

The neural-network denoising functions which form the backbone of image diffusion models are remarkably consistent in their generalization behaviour across a wide variety of network architectures and training procedure hyperparameters. A recent line of research has sought to model the outputs of these networks by aggregating posterior weighted averages of training dataset patches. In this work, we consolidate these approaches into a unified model class which we call Filtered Posterior Mean Collections (FPMCs). We define this model class using query precision vectors, response weights, and source distributions, and illustrate that existing methods are recoverable with specific choices of these design axes. Investigating each axis in turn, we find that FPMC performance can be improved with soft relaxations of prior patch-based methods, and through augmentations of source distributions. Applying these findings to an existing FPMC, we demonstrate consistent sample improvement across three natural image datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.24176 2026-05-26 cs.CV 版本更新

Loki: Representation over Architecture for Diffusion-Based Portrait Animation

Loki：基于表示而非架构的扩散肖像动画

Pouyan Navard, Sernam Lim

发表机构 * The Ohio State University（俄亥俄州立大学）； University of Central Florida（中央佛罗里达大学）

AI总结提出Loki方法，通过使用参数化人脸模型解耦身份与表情/姿态，并利用轻量级键值注入保持身份，在减少参数和训练数据的同时提升扩散肖像动画的驱动跟随性。

详情

AI中文摘要

肖像动画将驱动片段的面部表情和头部姿态迁移到单个参考图像上，同时保留参考图像的身份。最先进的扩散系统通过依次堆叠用于表情、姿态和身份的训练模块来解决这一问题，但付出了可训练参数、专有语料库以及系统本应独立控制的轴之间的残余纠缠的代价。这种复杂性补偿了一个上游选择——从RGB中学习面部表情和头部姿态，而RGB是一种身份、姿态和表情无法分离的表示，除非分别学习。Loki在条件路径上跳出RGB。驱动表情和头部姿态由一个面部模型编码，该模型的参数轴在构造上与身份正交，然后光栅化为扩散骨干网络原生消费的空间图。身份通过轻量级键值注入，经由扩散骨干网络自身的预训练特征单独路由。由于参数化表示将身份与表情和姿态解耦，跨身份重演在推理时简化为系数替换，无需跨身份训练数据。Loki所需的推理参数比领先的扩散基线少约43%，并且训练所用的视频样本少1496倍。我们定义了两个指标，直接衡量生成的头部位姿轨迹和面部表情是否跟随驱动——这正是肖像动画所关注的问题；Loki在这两个指标上领先或并列领先。

英文摘要

Portrait animation transfers a driver clip's facial expression and head pose onto a single reference image while preserving the reference's identity. State-of-the-art diffusion systems address this by stacking trained modules for expression, pose, and identity in turn, paying for it in trainable parameters, proprietary corpora, and residual entanglement between the very axes the system is meant to control independently. This complexity compensates for an upstream choice -- learning facial expression and head pose from RGB, a representation in which identity, pose, and expression are inseparable without being learned apart. Loki steps out of RGB on the conditioning path. Driver expression and head pose are encoded by a face model whose parameter axes are identity-orthogonal by construction, then rasterised into a spatial map that the diffusion backbone consumes natively. Identity is routed separately through the diffusion backbone's own pretrained features via lightweight key-value injection. Because the parametric representation factorises identity from expression and pose, cross ID reenactment reduces to a coefficient substitution at inference, requiring no cross ID training data. Loki requires ~43% fewer inference parameters than leading diffusion baselines and trained on 1496x less video samples. We define two metrics that directly measure whether the generated head pose trajectory and facial expression followed the driver's -- the questions portrait animation actually asks; Loki leads or co-leads on both.

URL PDF HTML ☆

赞 0 踩 0

2605.24159 2026-05-26 cs.CV 版本更新

EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound

EchoVQA：为床旁心脏超声提供对话式辅助

Filippos Bellos, Yutong Li, Jessie N Dong, Zaiyang Guo, Emily Mackay, Yayuan Li, Yannis Avrithis, Alison Pouch, Jason J. Corso

发表机构 * University of Michigan（密歇根大学）； University of Pennsylvania（宾夕法尼亚大学）； Independent Scientist（独立科学家）

AI总结针对床旁心脏超声图像获取与解读依赖专业知识的问题，提出首个大规模超声心动图VQA数据集EchoVQA（含14,299张图像和74,819个问答对），并开发基于多模态可学习提示的参数高效方法，在多数基准上达到最优性能。

详情

AI中文摘要

床旁经胸超声心动图（TTE）几乎可在任何临床环境中进行心脏评估，但其诊断效用仍受限于图像获取和解读所需的专业知识。视觉问答（VQA）为通过交互式临床辅助弥合这一专业知识差距提供了有前景的范式，但现有的超声心动图VQA数据集规模有限、仅限于高质量图像，且仅覆盖少数视图。我们提出了EchoVQA，首个大规模超声心动图VQA数据集，包含14,299张图像和74,819个问答对。该数据集整合了公共来源（EchoNet-Dynamic、CAMUS）和我们使用两种手持探头（Lumify、Clarius）自行采集的床旁图像，涵盖多种视图，包括高质量和次优图像。独特的是，EchoVQA包含采集指导问题，帮助用户优化探头位置以获得用于左心室射血分数评估的诊断性心尖四腔心视图——这对床旁环境中的新手操作者来说是一项具有挑战性的任务。我们进一步开发了一种基于多模态可学习提示的参数高效方法，在包括EchoVQA在内的大多数基准上取得了最先进的性能，且可训练参数显著少于现有最先进方法。

英文摘要

Point-of-care transthoracic echocardiography (TTE) enables cardiac assessment in virtually any clinical setting, yet its diagnostic utility remains constrained by the expertise required for image acquisition and interpretation. Visual question answering (VQA) offers a promising paradigm for bridging this expertise gap through interactive clinical assistance, but existing echocardiography VQA datasets are limited in scale, restricted to high-quality images, and only cover a few views. We introduce EchoVQA, the first large-scale VQA dataset for echocardiography, comprising 14,299 images and 74,819 question-answer pairs. The dataset integrates public sources (EchoNet-Dynamic, CAMUS) with our own point-of-care acquisitions from two handheld probes (Lumify, Clarius), spanning diverse views and including both high-quality and suboptimal images. Uniquely, EchoVQA includes acquisition guidance questions to help users optimize transducer positioning toward a diagnostic apical 4-chamber view for left ventricular ejection fraction estimation -- a challenging task for novice operators in point-of-care settings. We further develop a parameter-efficient method based on multimodal learnable prompts achieving state-of-the-art performance on most benchmarks, including EchoVQA, with significantly less trainable parameters than existing state-of-the-art approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.24128 2026-05-26 cs.CV 版本更新

ImPartial: Multi-channel Whole-Cell Segmentation using Partial Annotations

ImPartial: 使用部分注释的多通道全细胞分割

Gunjan Shrivastava, Saad Nadeem

发表机构 * Memorial Sloan Kettering Cancer Center（纪念斯隆凯特琳癌症中心）

AI总结提出ImPartial框架，通过自监督多通道量化插值，在稀疏标注和有限监督下实现与全监督模型相当的全细胞分割性能。

Comments MICCAI'26 Early Accept

详情

AI中文摘要

病理图像中准确的细胞分割通常需要密集的像素级注释，这既昂贵又耗时。这一挑战对于新兴的生物成像模态和具有可变通道配置的多重数据集尤其重要，因为这些数据中专家标注的数据很少。在这项工作中，我们引入了ImPartial，这是一个深度学习框架，旨在使用稀疏涂鸦和有限监督在低标注条件下实现最先进的分割性能。ImPartial通过自监督多通道量化插值增强了分割目标。该方法利用了以下观察结果：精确的像素级重建或图像去噪对于准确分割并非必需，因此引入了一个与整体分割目标更一致的自监督分类目标。我们证明，ImPartial在需要显著更少注释的情况下实现了与全监督模型相当的性能。在基准多重细胞成像和单重临床明场免疫组化数据集上的大量实验表明，仅使用部分注释，ImPartial相对于强基线有一致的改进。所有基准数据集和代码均可通过我们的GitHub获取：https://github.com/nadeemlab/ImPartial。

英文摘要

Accurate cell segmentation in pathology images typically requires dense pixel-wise annotations, which are costly and time-consuming to obtain. This challenge is especially important for emerging biological imaging modalities and multiplexed datasets with variable channel configurations, where expert-labeled data are scarce. In this work, we introduce ImPartial, a deep learning framework designed to achieve state-of-the-art segmentation performance in low-annotation regimes using sparse scribbles and limited supervision. ImPartial augments the segmentation objective via self-supervised multi-channel quantized imputation. This approach leverages the observation that perfect pixel-wise reconstruction or denoising of the image is not needed for accurate segmentation, and thus, introduces a self-supervised classification objective that better aligns with the overall segmentation goal. We demonstrate that ImPartial achieves performance at par with fully supervised models while requiring substantially fewer annotations. Extensive experiments on benchmark multiplexed cellular imaging and single-plex clinical brightfield immunohistochemistry datasets show consistent improvements over strong baselines with only partial annotations. All benchmark datasets and code are available via our Github: https://github.com/nadeemlab/ImPartial.

URL PDF HTML ☆

赞 0 踩 0

2605.24114 2026-05-26 cs.CV 版本更新

距离感知的联合时空图对比学习用于重度抑郁症诊断

Muhammad Asif Hasan, Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew

发表机构 * School of Information Technology and Communication, Griffith University, Australia（信息科技与通信学院，格里菲斯大学，澳大利亚）

AI总结针对动态功能连接在重度抑郁症诊断中的噪声、频域信息利用不足及时空分离建模问题，提出基于霍克斯过程先验的联合时空图对比学习框架HWSTCL，通过谱节点描述符、指数距离衰减边权重和核加权对比目标，实现可靠时空表示并提升诊断性能。

详情

AI中文摘要

重度抑郁症（MDD）是一种常见的神经精神疾病，其基于静息态功能磁共振成像（rs-fMRI）的准确诊断仍然困难。动态功能连接（DFC）捕捉脑区间的时变交互，提供丰富的时空信息，但当前基于DFC的方法面临三个限制：滑动窗口Pearson相关产生对窗口长度和运动伪影敏感的噪声估计；相关导出的节点特征未充分利用血氧水平依赖（BOLD）信号的频域特性；大多数时空图模型在分离阶段处理空间结构和时间动态，限制了它们表示耦合脑网络演化的能力。为克服这些问题，我们将DFC学习重新表述为在霍克斯过程启发的时间依赖性先验下的联合时空图表示学习，并提出HWSTCL，一个基于可靠性精炼联合时空图和核加权预训练目标的两阶段框架。在每个时间窗口内，BOLD信号被编码为谱节点描述符，功能边通过指数距离衰减先验进行精炼，该先验降低不可靠长程连接的权重。然后通过霍克斯启发的指数核将每个区域与未来窗口中的自身连接形成联合图，使得在消息传递过程中空间和时间信息可以一起传播。核加权对比目标进一步促进每个区域跨窗口的时间一致性，同时减少不同区域间的冗余相似性。在基准rs-fMRI数据集上的实验表明，HWSTCL优于最近的基线方法，并为MDD诊断生成连贯的时空表示。

英文摘要

Major depressive disorder (MDD) is a common neuropsychiatric condition whose accurate diagnosis from resting-state functional magnetic resonance imaging (rs-fMRI) remains difficult. Dynamic functional connectivity (DFC) captures time-varying interactions among brain regions and provides rich spatio-temporal information, yet current DFC-based methods face three limitations: sliding-window Pearson correlation yields noisy estimates sensitive to window length and motion artifacts; correlation-derived node features do not fully exploit frequency-domain properties of blood-oxygen-level-dependent (BOLD) signals; and most spatio-temporal graph models handle spatial structure and temporal dynamics in separate stages, restricting their ability to represent coupled brain network evolution. To overcome these issues, we reformulate DFC learning as joint spatio-temporal graph representation learning under a Hawkes-process-inspired temporal dependency prior and propose HWSTCL, a two-stage framework built on a reliability-refined joint spatio-temporal graph with a kernel-weighted pretraining objective. Within each temporal window, BOLD signals are encoded as spectral node descriptors and functional edges are refined by an exponential distance-decay prior that down-weights less reliable long-range connections. The joint graph is then formed by linking each region to itself across future windows through a Hawkes-inspired exponential kernel, allowing spatial and temporal information to be propagated together during message passing. A kernel-weighted contrastive objective further promotes temporal consistency for each region across windows while reducing redundant similarity between different regions. Experiments on a benchmark rs-fMRI dataset show that HWSTCL outperforms recent baselines and yields coherent spatio-temporal representations for MDD diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2605.24065 2026-05-26 cs.CV 版本更新

模式即序列：将多模态运动预测转化为统一序列模式建模

Zikang Zhou, Haibo Hu, Xinhong Chen, Yifan Zhang, Nan Guan, Yung-Hui Li, Chun Jason Xue, Jianping Wang

发表机构 * City University of Hong Kong（香港城市大学）； City University of Hong Kong (Dongguan)（香港城市大学（东莞））； Hon Hai Research Institute（富士康研究学院）； Mohamed bin Zayed University of Artificial Intelligence（莫莫丁·宾·扎耶德人工智能大学）

AI总结提出Mode-as-Sequence框架，将无序模式集转化为有序模式序列并显式建模模式间依赖，通过ModeSeq和Parallel ModeSeq两种实例化方法解决多模态运动预测中的模式坍塌和置信度排序问题，在Waymo数据集上取得领先性能。

详情

AI中文摘要

多模态运动预测本质上是欠监督的：每个训练场景只提供一个已实现的未来，但存在多个合理的未来。这种稀疏监督通常会导致模式坍塌（冗余假设和模式覆盖不足）以及在预测少量轨迹时置信度排序不可靠。我们提出Mode-as-Sequence，一个统一的解码框架，将无序模式集转化为有序模式序列，并显式建模模式间依赖。在该框架下，我们开发了两种互补的实例化方法。ModeSeq执行循环模式解码，每个模式基于先前生成的模式生成，鼓励多样化、非冗余的假设，并具有校准的置信度排序。为了消除逐模式自回归瓶颈，我们进一步提出Parallel ModeSeq，它使用掩码模式间自注意力保留相同的因果依赖，同时在前向传播中一次性解码所有模式，从而实现高效的大K推理和可扩展的联合场景预测。为了在稀疏标签下学习代表性模式和校准的置信度，我们引入了Early-Match-Take-All (EMTA)及其联合场景扩展MA-EMTA，以及一个轻量级的排序正则化器，以减少置信度反转。在大型基准上的大量实验表明，在数据集、预测时长和对象类型上，排序导向指标和最佳K准确率均有一致提升。在Waymo开放数据集挑战中，ModeSeq在2024年无激光雷达运动预测赛道获得第一名，Parallel ModeSeq在2025年交互预测挑战赛中获得第一名，验证了Mode-as-Sequence在准确性和效率上的有效性。

英文摘要

Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible futures exist. This sparse supervision often leads to mode collapse (redundant hypotheses and insufficient mode coverage) and unreliable confidence ranking when predicting a small set of trajectories. We propose Mode-as-Sequence, a unified decoding framework that translates an unordered mode set into an ordered mode sequence and explicitly models mode-to-mode dependency. Under this framework, we develop two complementary instantiations. ModeSeq performs recurrent mode decoding, where each mode is generated conditioned on the previously generated modes, encouraging diverse, non-redundant hypotheses with calibrated confidence ordering. To remove the mode-by-mode autoregressive bottleneck, we further propose Parallel ModeSeq, which preserves the same causal dependency using masked mode-to-mode self-attention while decoding all modes in a single forward pass, enabling efficient large-$K$ inference and scalable joint-scene prediction. To learn representative modes and calibrated confidence under sparse labels, we introduce Early-Match-Take-All (EMTA) and its joint-scene extension MA-EMTA, together with a lightweight ranking regularizer that reduces confidence inversions. Extensive experiments on large-scale benchmarks demonstrate consistent improvements in both ranking-oriented metrics and best-of-K accuracy across datasets, horizons, and object types. In the Waymo Open Dataset challenges, ModeSeq achieves 1st place in the 2024 LiDAR-free motion prediction track, and Parallel ModeSeq achieves 1st place in the 2025 Interaction Prediction Challenge, validating the effectiveness of Mode-as-Sequence for both accuracy and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.24025 2026-05-26 cs.CV cs.LG 版本更新

MGVQ：协同多维敏感度感知与梯度-海森融合的向量量化

Zhong Wang, Zukang Xu, Xing Hu, Dawei Yang

发表机构 * Bauman Moscow State Technical University（巴甫洛夫莫斯科国立技术大学）

AI总结提出MGVQ框架，通过敏感度引导的结构化混合精度量化和梯度感知的二阶误差补偿，实现视觉-语言模型的超低位向量量化，在2-bit量化下最高提升4.9个点。

详情

AI中文摘要

视觉-语言模型（VLM）取得了卓越的性能，但其巨大的模型尺寸严重阻碍了在资源受限的边缘设备上的部署。作为一种高效的模型压缩技术，向量量化（VQ）在超低位表示方面表现出色，它将模型权重映射到紧凑码本中的离散码字，以降低内存消耗和传输开销，同时保持模型能力。直接将VQ应用于VLM仍存在两个核心限制。首先，视觉和文本输入带来的跨模态权重分布差异无法被单一的统一码本很好地拟合。其次，当前的二阶误差补偿忽略了梯度信息，导致权重偏离预训练最优状态、梯度漂移和补偿结果有偏。本文提出MGVQ，一种新颖的向量量化框架，集成了多维敏感度感知和梯度-海森融合。它包含两个核心模块：敏感度引导的结构化混合精度量化，通过结合全局和局部敏感度分析，根据通道敏感度动态分配不同位宽，实现精细的资源分配；梯度感知的二阶误差补偿，将一阶梯度嵌入误差校正，并采用Kronecker和Block-LDL分解确保低计算成本。在主流VLM（包括LLaVA-onevision、InternVL2和Qwen2-VL）上的大量实验验证了MGVQ的有效性。在2-bit量化设置下，MGVQ显著超越现有先进的后训练量化方法，在InternVL2-26B上最高提升4.9个点（71.4% vs 67.0%）。所提方法实现了稳定高效的超低位VLM量化，极大促进了多模态大模型在资源受限环境中的实际部署。

英文摘要

Vision-Language Models (VLMs) achieve outstanding performance, yet their huge model size severely hinders deployment on edge devices with limited resources. As an efficient model compression technique, vector quantization (VQ) excels in ultra-low-bit representation, which maps model weights to discrete codewords in a compact codebook to cut memory consumption and transmission overhead while preserving model capability. Direct VQ application to VLMs still has two core limitations. First, cross-modality weight distribution differences brought by visual and textual inputs cannot be well fitted by a single unified codebook. Second, current second-order error compensation ignores first-order gradient information, causing weight deviation from pre-trained optimal states, gradient drift and biased compensation results. This work proposes MGVQ, a novel vector quantization framework integrating multi-dimensional sensitivity perception and gradient-Hessian fusion. It consists of two core modules: sensitivity-guided structured mixed-precision quantization dynamically assigns different bit-widths according to channel sensitivity via combined global and local sensitivity analysis for refined resource allocation; gradient-aware second-order error compensation embeds first-order gradients into error correction, and adopts Kronecker and Block-LDL decomposition to ensure low computational cost. Extensive experiments on mainstream VLMs including LLaVA-onevision, InternVL2 and Qwen2-VL verify the effectiveness of MGVQ. In 2-bit quantization settings, MGVQ surpasses existing advanced post-training quantization methods significantly, achieving a maximum accuracy improvement of 4.9 points (71.4% vs 67.0% on InternVL2-26B). The proposed method realizes stable and efficient ultra-low-bit VLM quantization, greatly promoting the practical deployment of multimodal large models in resource-limited environments.

URL PDF HTML ☆

赞 0 踩 0

2605.24014 2026-05-26 cs.CV 版本更新

SkySeg: Collaborative Onboard Semantic Segmentation with Heterogeneous UAVs in the Wild

SkySeg: 野外异构无人机协同机载语义分割

Anqi Lu, Yun Cheng, Youbing Hu, Zhiqiang Cao, Jie Liu, Zhijun Li

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； ETH Zurich（苏黎世联邦理工学院）； Huawei Cloud Algorithm Innovation Lab（华为云算法创新实验室）

AI总结针对资源受限无人机在动态环境中实时语义分割的挑战，提出SkySeg异构多无人机空-空协作框架，结合高效信息融合推理与跨设备测试时自适应策略，实现低成本传感器下的机载分割，加速约3.6倍并提升精度5.91%。

详情

AI中文摘要

基于无人机的图像采集和分析需求激增，无人机越来越多地用于语义分割任务。为了满足无人机遥感任务的实时分析要求，进行机载计算并基于结果做出决策是一种自然的方法。然而，在资源受限的无人机平台上部署语义分割面临两个重大挑战：1）硬件限制限制了无人机执行实时语义分割的能力，2）飞行过程中的环境变化导致数据分布偏移，偏离原始训练数据。为了解决这些问题，本文介绍了SkySeg，一种异构多无人机空-空协作框架，它集成了计算机视觉和飞行模式，能够使用低成本传感器实现机载语义分割。SkySeg采用高效的信息融合推理方法，将低分辨率广域图像与高分辨率聚焦区域图像相结合。此外，它还包含一种跨设备测试时自适应策略，通过协作解决无人机间测试数据流的分布偏移，增强动态环境中的分割性能。实验结果表明，我们的SkySeg框架将推理延迟加速约3.6倍，将机载分割精度提高5.91%，并在野外环境中实现了10.91%的平均精度增益。

英文摘要

The demand for unmanned aerial vehicle (UAV)-based image acquisition and analysis has surged, with UAVs increasingly utilized for semantic segmentation tasks. To meet the real-time analysis requirements of UAV remote sensing missions, performing onboard computation and making decisions based on the results is a natural approach. However, deploying semantic segmentation on resource-constrained UAV platforms presents two significant challenges: 1) hardware constraints limit the ability of UAVs to perform real-time semantic segmentation, and 2) environmental variations during flight cause data distribution shifts, deviating from the original training data. To address these issues, this paper introduces SkySeg, a heterogeneous multi-UAV air-air cooperation framework that integrates computer vision and flight pattern to enable onboard semantic segmentation using low-cost sensors. SkySeg employs an efficient information fusion inference method, combining low-definition, wide-area images with high-definition, focused-area images. Additionally, it incorporates a cross-device test-time adaptation (TTA) strategy to enhance segmentation performance in dynamic environments by collaboratively addressing distribution shifts of test data streams across UAVs. Experimental results demonstrate that our SkySeg framework accelerates inference latency by approximately 3.6x, improves onboard segmentation accuracy by 5.91\%, and achieves a 10.91\% average accuracy gain in the wild.

URL PDF HTML ☆

赞 0 踩 0

2605.24012 2026-05-26 cs.CV 版本更新

Deep Learning-Based Automated Quantification of TIMI Myocardial Perfusion Frame Count (DL-TMPFC) from Coronary Angiography: A Novel Framework for Rapid Assessment of Microvascular Dysfunction

基于深度学习的TIMI心肌灌注帧数自动量化（DL-TMPFC）：一种快速评估微血管功能障碍的新框架

Si Li, Yuanqing He, Chenkai Hu, Xiaogang Guo, Huay-Cheem Tan, Chieh Yang Koo, Xuan Zhang, Lei He, Jingyuan Zeng, Shan Xiao

发表机构 * School of Artificial Intelligence and Digital Economy Industry, Guangzhou Institute of Science and Technology（人工智能与数字经济发展学院，广州科学研究院）； Department of Cardiology, Second Affiliated Hospital, Jiangxi Medical College, Nanchang University（南华大学江西医学院心内科，第二附属医院）； Department of Cardiology, First Affiliated Hospital of Zhejiang University School of Medicine（浙江大学医学院附属第一医院心内科）； Department of Cardiology, National University Heart Centre（国立大学心脏中心心内科）

AI总结提出DL-TMPFC框架，结合狭窄检测网络和区域感知分割网络，从冠状动脉造影中自动计算TIMI心肌灌注帧数，实现微血管功能障碍的客观量化，验证显示与专家手动测量高度一致。

Comments 15 pages,8 figures

详情

AI中文摘要

目的：冠状动脉微血管功能障碍（CMVD）影响约40%-60%的缺血和非阻塞性冠状动脉患者，但由于依赖侵入性功能测试或主观的TIMI血流分级，诊断仍具挑战性。TIMI心肌灌注帧数（TMPFC）提供了一种基于造影的客观定量测量CMVD的方法，但其临床应用受限于繁琐的手动计算和验证不足。本研究旨在开发和验证一种基于深度学习的TMPFC计算方法（DL-TMPFC），使其能够整合到临床工作流程中。方法和结果：DL-TMPFC框架包含两个组件。首先，狭窄检测网络排除阻塞性冠状动脉疾病（CAD）。然后，区域感知分割网络识别灌注区域，TMPFC计算模块自动从造影序列中确定首帧和末帧。该框架在来自三个独立机构的655名患者（445名阻塞性CAD、100名确诊CMVD、110名对照组）队列中进行了验证。DL-TMPFC与专家手动测量具有极好的一致性（偏差：-0.93帧；95%一致性界限：-5.33至+3.47；r=0.98）。DL-TMPFC通过完全自动化TMPFC并消除观察者依赖性，显著增强了临床可行性。临床上，DL-TMPFC能够准确识别全谱冠状动脉病理中的CMVD，并捕获超越二元分类的CMVD连续严重程度，实现定量风险分层。结论：DL-TMPFC实现了直接从常规造影中自动、标准化和准确地量化CMVD。通过提供自动和客观的测量，该工具为临床实践中及时识别和管理CMVD提供了即时诊断信息。

英文摘要

Aims: Coronary microvascular dysfunction (CMVD) affects approximately 40%-60% of patients with ischemia and non-obstructive coronary arteries, yet diagnosis remains challenging due to reliance on invasive functional testing or subjective Thrombolysis In Myocardial Infarction (TIMI) flow grade. The TIMI Myocardial Perfusion Frame Count (TMPFC) offers an objective, angiography-based quantitative measure of CMVD, but its clinical translation is hindered by cumbersome manual calculation and insufficient validation. This study aims to develop and validate a deep learning-powered TMPFC calculation (DL-TMPFC), enabling integration into clinical workflows. Methods and results: DL-TMPFC framework comprised two components. A stenosis detection network first excluded obstructive coronary artery disease (CAD). A territory-aware segmentation network then identified perfusion territories and TMPFC calculation module automatically determined the first and last frames from angiographic sequences. The framework was validated in a cohort of 655 patients (445 of obstructive CAD, 100 of confirmed CMVD, 110 of control group) from three independent institutions. DL-TMPFC showed excellent agreement with expert manual measurements (bias: -0.93 frames; 95% LoA: -5.33 to +3.47; r =0.98). DL-TMPFC markedly enhanced clinical feasibility by fully automating TMPFC and removing observer dependence. Clinically, DL-TMPFC accurately identified CMVD across a full spectrum of coronary pathologies and captured the continuous severity of CMVD beyond binary classification, enabling quantitative risk stratification. Conclusion: DL-TMPFC enabled automatic, standardized, and accurate quantification of CMVD directly from routine angiography. By providing an automatic and objective measure, this tool provided immediate diagnostic information for timely recognition and management of CMVD in clinical practice.

URL PDF HTML ☆

赞 0 踩 0

2605.24008 2026-05-26 cs.LG cs.CV cs.SE 版本更新

CAFD: Concept-Aware DNN Fault Detection using VLMs

CAFD: 使用视觉语言模型的概念感知深度神经网络故障检测

Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand

发表机构 * School of EECS, University of Ottawa（渥太华大学电子工程与计算机科学学院）； Research Ireland Lero centre for software, University of Limerick（利默尼克大学爱尔兰研究中心）

AI总结提出概念感知故障检测（CAFD）方法，通过整合模型信号、距离特征和基于视觉语言模型的概念故障比（CFR）特征，在保持效率的同时显著提升DNN故障检测性能。

详情

AI中文摘要

近年来，深度神经网络（DNN）的故障检测受到越来越多的关注。虽然已经提出了更先进的混合方法来结合多种信息源并优于早期技术，但它们通常会产生大量的计算开销，限制了在现实环境中的可扩展性和实用性。在本文中，我们介绍了概念感知故障检测（CAFD），这是一种基于学习的方法，通过有效整合多个信息源同时保持实际效率，实现了卓越的故障检测性能。具体来说，CAFD使用一组精心挑选的信息特征进行训练，包括基于DNN输出的模型信号、基于距离的特征以及一种新颖的基于概念的特征，称为概念故障比（CFR）。CFR利用视觉语言模型（VLM）从图像中提取文本概念，并量化其存在与DNN故障相关的可能性。通过引入这一特征，CAFD受益于互补的语义信息，从而实现更有效的故障检测。我们的结果表明，CFR是DNN故障检测的有效指标。我们对CAFD进行了广泛的实证评估，将其与三个主题DNN模型和数据集（包括ImageNet）上的五个最先进基线进行了比较。在广泛的约束选择预算范围内，CAFD在故障检测率（FDR）上始终优于所有基线，在所有研究对象和预算规模上平均FDR提高了18.3%。

英文摘要

Fault detection for Deep Neural Networks (DNNs) has received increasing attention in recent years. While more advanced hybrid approaches have been proposed to combine multiple sources of information and outperform earlier techniques, they often incur substantial computational overhead, limiting scalability and practicality in real-world settings. In this paper, we introduce Concept-Aware Fault Detection (CAFD), a learning-based approach that achieves superior fault detection performance by effectively integrating multiple information sources while maintaining practical efficiency. Specifically, CAFD is trained using a carefully selected set of informative features, including model-based signals derived from the DNN's outputs, distance-based features, and a novel concept-based feature, called Concept Failure Ratio (CFR). CFR leverages Vision-Language Models (VLMs) to extract textual concepts from images and quantify the likelihood that their presence is associated with DNN failures. By incorporating this feature, CAFD benefits from complementary semantic information, enabling more effective fault detection. Our results demonstrate that CFR serves as an effective indicator for DNN fault detection. We conduct an extensive empirical evaluation of CAFD, comparing it against five state-of-the-art baselines across three subject DNN models and datasets, including ImageNet. Across a wide range of constrained selection budgets, CAFD consistently outperforms all baselines in Fault Detection Rate (FDR), achieving average FDR improvements of 18.3% across all investigated subjects and budget sizes.

URL PDF HTML ☆

赞 0 踩 0

2605.24004 2026-05-26 cs.AI cs.CV cs.LG cs.RO 版本更新

Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving

推理--想象--行动：基于世界模型的闭环LLM自动驾驶决策

Zhengqi Sun, Yiwen Sun, Boxuan Liu, Tailai Chen, Tianxu Guo, Jiabin Liu

发表机构 * 1Department of Information Management, Peking University, Beijing 100871, China ； 2School of Intelligence Science ； Technology, Peking University, Beijing 100871, China ； 3State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing 100080, China ； 4Yuanpei College, Peking University, Beijing 100871, China ； 5China Agricultural University, Beijing, China ； 6CRSC Research \& Design Institute Group Co., Ltd., Beijing, China

AI总结提出Reason--Imagine--Act (RIA)闭环框架，结合LLM推理器与动作条件世界模型进行在线安全验证，在CARLA点目标协议下实现80.05%路线完成率、51.10%到达率和0.20%碰撞率。

Comments Accepted by the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026). 8 pages, 2 figures

详情

AI中文摘要

大型语言模型（LLM）在自动驾驶中具有潜力，但仅基于语义的决策策略可能在动态交通中产生物理上不安全的行为。现有方法要么在没有显式动力学验证的情况下进行在线语言推理，要么主要在离线流程中使用世界模型，在决策时语义意图与物理可行性之间存在差距。我们提出了Reason--Imagine--Act (RIA)，一个闭环框架，将LLM推理器与动作条件世界模型耦合，用于在线安全验证。在每一步，LLM提出一个动作模板和候选子动作，世界模型执行短时域展开，安全评分器选择最安全的可执行动作并反馈给下一步推理。在统一的CARLA点目标协议（1000个回合）下，RIA实现了80.05%的路线完成率、51.10%的到达率和0.20%的碰撞率。在相同的闭环接口下，RIA在核心闭环指标上始终优于无训练基线，包括CARLA TM和MADA。为便于复现，代码可在https://github.com/pku-smart-city/source_code/tree/main/RIA获取。

英文摘要

Large language models (LLMs) are promising for autonomous driving, but semantics-only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason--Imagine--Act (RIA), a closed-loop framework that couples an LLM reasoner with an action-conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub-actions, the world model performs short-horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point-goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed-loop interface, RIA consistently outperforms training-free baselines, including CARLA TM and MADA, on core closed-loop metrics. For reproducibility, code is available at https://github.com/pku-smart-city/source_code/tree/main/RIA.

URL PDF HTML ☆

赞 0 踩 0

2605.23997 2026-05-26 cs.CV cs.AI cs.LG 版本更新

IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning

IVR-R1：通过强化学习中的迭代视觉基础推理优化轨迹

Chenghao Li, Fusheng Hao, Xikai Zhang, Likang Xiao, Yanwei Ren, Fuxiang Wu, Quan Chen, Liu Liu

发表机构 * Hangzhou International Innovation Institute, Beihang University（北京航空航天大学杭州国际创新研究院）； School of Artificial Intelligence, Beihang University（北京航空航天大学人工智能学院）； Kuaishou Technology（快手科技）； Shenzhen Institute of Advanced Integration Technology, Shenzhen（深圳先进集成技术研究院）

AI总结提出IVR-R1框架，利用奖励驱动的筛选机制和迭代再推理循环，在强化学习中动态校正多模态推理轨迹，以解决视觉幻觉和逻辑错误问题。

详情

AI中文摘要

通过强化学习的多模态大语言模型在复杂视觉推理任务中展现出显著能力，但在长程多模态场景中仍存在局限，常出现视觉幻觉和逻辑错误。当前方法通常将高维视觉场景预编码为离散文本代理以促进下游推理。然而，随着推理链展开，文本与视觉场景之间固有的信息不对称会侵蚀视觉基础，导致推理误导和错误输出。为解决此问题，我们提出IVR-R1（迭代视觉基础推理），一种新颖的强化学习训练框架，通过动态视觉重新对齐主动校正推理轨迹以指导策略优化。具体而言，利用奖励驱动的筛选机制识别有缺陷的展开，IVR-R1在多模态上下文中执行细粒度的步骤级错误归因。通过将中间推理状态与原始视觉先验进行迭代交叉引用，再推理循环实现自动轨迹校正，有效合成专家级演示，作为策略模型的高保真推理模板。我们在多种多模态基准上的实验表明，IVR-R1持续优于现有强化学习方法，为在复杂多模态推理中保持逻辑和视觉一致性建立了优越范式。

英文摘要

Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning tasks, yet they remain limited in long-horizon multimodal scenarios, often suffering from visual hallucination and logical error. Current methods typically pre-encode high-dimensional visual scenes into discrete textual proxies to facilitate downstream reasoning. As the reasoning chain unfolds, however, the inherent information asymmetry between text and visual scenes tends to erode visual grounding, resulting in misguided reasoning and erroneous outputs. To address this issue, we introduce IVR-R1 (Iterative Visual-grounded Reasoning), a novel RL training framework that facilitates dynamic visual re-alignment that actively rectifies reasoning trajectories to guide policy optimization. Specifically, by leveraging a reward-driven screening mechanism to identify flawed rollouts, IVR-R1 executes a fine-grained, step-level error attribution within the multimodal context. By iteratively cross-referencing intermediate reasoning states against pristine visual priors, a Re-Reasoning Loop enables automated trajectory rectification, effectively synthesizing expert-level demonstrations that serve as high-fidelity reasoning templates for the policy model. Our experiments across diverse multimodal benchmarks demonstrate that IVR-R1 consistently outperforms existing reinforcement learning methods, establishing a superior paradigm for maintaining logical and visual consistency in complex multimodal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.23996 2026-05-26 cs.CV eess.IV 版本更新

Brain-to-Image Retrieval and Reconstruction via Multimodal EEG Alignment

通过多模态EEG对齐实现脑到图像的检索与重建

Chi Kit Wong, Yan Liu, Haowen Yan

AI总结提出一种脑到图像系统，通过多模态EEG对齐实现自然图像观看时的视觉刺激解码，在检索任务中达到86.30%的Top-1准确率，在重建任务中获得0.903的CLIP分数。

Comments 16 pages, 5 figures. Code available at: https://github.com/Chikit-WONG/DL_Project/

详情

AI中文摘要

我们提出一种脑到图像系统，该系统从自然图像观看期间记录的EEG信号中解码视觉刺激。我们的系统解决两个任务：(1) EEG到图像检索，给定一个EEG片段，在200个候选中对正确的刺激图像进行排序；(2) EEG到图像重建，生成与感知刺激一致的图像。对于检索，我们实现了一种多级模糊方法，该方法通过生物启发的EVNet特征进行改进，并使用InfoNCE损失进行训练。在单个受试者的10个随机种子评估中，检索模型实现了平均最终epoch Top-1准确率86.30%和Top-5准确率98.55%。对于重建，我们实现了CognitionCapturerPro，它将EEG表示对齐到多模态CLIP嵌入，包括图像、文本、深度和边缘嵌入，并通过IP-Adapter条件化使用SDXL-Turbo合成图像。在10个种子上平均，重建模型使用ViT-H-14实现了0.903的CLIP分数，使用ViT-L/14实现了0.870的CLIP分数，SSIM为0.409。这些结果证明了使用现代多模态对齐和生成建模技术从EEG信号解码丰富视觉表示的可行性。

英文摘要

We present a brain-to-image system that decodes visual stimuli from EEG signals recorded during natural image viewing. Our system addresses two tasks: (1) EEG-to-image retrieval, which ranks the correct stimulus image among 200 candidates given an EEG segment, and (2) EEG-to-image reconstruction, which generates an image consistent with the perceived stimulus. For retrieval, we implement a multi-level blurring approach improved with biologically inspired EVNet features and trained with the InfoNCE loss. Evaluated over 10 random seeds for a single subject, the retrieval model achieves a mean final-epoch Top-1 accuracy of 86.30% and Top-5 accuracy of 98.55%. For reconstruction, we implement CognitionCapturerPro, which aligns EEG representations to multi-modal CLIP embeddings, including image, text, depth, and edge embeddings, and synthesizes images with SDXL-Turbo conditioned via IP-Adapter. Averaged over 10 seeds, the reconstruction model achieves a CLIP score of 0.903 using ViT-H-14, a CLIP score of 0.870 using ViT-L/14, and an SSIM of 0.409. These results demonstrate the feasibility of decoding rich visual representations from EEG signals using modern multi-modal alignment and generative modeling techniques.

URL PDF HTML ☆

赞 0 踩 0

2605.23994 2026-05-26 cs.CV cs.AI 版本更新

RAW: Robust Avatar Watermarking -- Benchmarking and Baseline

RAW：鲁棒的数字人水印——基准测试与基线方法

Jack Parry, Jack Saunders, Vinay Namboodiri

发表机构 * University of Bath（巴斯大学）

AI总结针对数字人水印面临的后处理攻击，提出基准测试RAW和基于3D人脸重建的UV纹理空间水印方法WALT，在缩放攻击和背景移除攻击下分别达到92.4%和95.6%的鲁棒性。

详情

DOI: 10.2312/egs.20261006

AI中文摘要

数字人水印面临独特挑战：在部署前，数字人通常要经过背景替换、重新构图和格式转换等常规后处理。我们提出 extbf{RAW}（鲁棒的数字人水印），一个包含来自5个商业提供商的50个合成数字人视频和6种模拟真实数字人工作流程的攻击的基准测试。评估7种现有方法发现，数字人特定的攻击（如背景移除）会显著降低水印恢复率。我们提出 extbf{WALT}（通过学习纹理进行数字人水印），该方法通过3D人脸重建在UV纹理空间中嵌入水印。WALT在缩放攻击下达到最高鲁棒性（92.4%），同时在背景移除攻击下保持强劲性能（95.6%）。我们发布该基准测试以促进针对数字人水印的研究。

英文摘要

Digital avatar watermarking presents unique challenges: avatars are routinely post-processed with background replacement, reframing, and format conversion before deployment. We introduce \textbf{RAW} (Robust Avatar Watermarking), a benchmark comprising 50 synthetic avatar videos from 5 commercial providers and 6 attacks simulating real-world avatar workflows. Evaluating 7 existing methods reveals that avatar-specific attacks such as background removal significantly degrade watermark recovery. We propose \textbf{WALT} (Watermarking Avatars with Learned Textures), which embeds watermarks in UV texture space via 3D face reconstruction. WALT achieves the highest robustness to zoom attacks (92.4\%) while maintaining strong performance on background removal (95.6\%). We release our benchmark to facilitate research into avatar-specific watermarking.

URL PDF HTML ☆

赞 0 踩 0

2605.23992 2026-05-26 cs.CV cs.AI 版本更新

A World Model of Radiologist Reading for Medical Image Representation Learning

放射科医生阅读的世界模型用于医学图像表示学习

Yiwei Li, Zihao Wu, Huaqin Zhao, Yifan Zhou, Chao Cao, Dajiang Zhu, Tianming Liu, Lin Zhao

发表机构 * University of Georgia（佐治亚大学）； University of Texas at Arlington（德克萨斯大学阿灵顿分校）； New Jersey Institute of Technology（新泽西理工学院）

AI总结提出GazeWorld，一种将图像视为世界、放射科医生注视序列视为轨迹的医学成像世界模型，通过自回归预测注视补丁表示和空间补全未访问区域，在多个基准上实现最先进的诊断准确率和零样本性能。

详情

AI中文摘要

放射科医生的眼动追踪数据提供了专家在图像阅读过程中如何搜索、比较和积累证据的丰富记录；然而，现有方法仅部分利用这一信号，要么作为静态空间先验，要么作为与诊断脱节的辅助预测目标。我们提出GazeWorld，一种医学成像世界模型，将图像视为世界，将放射科医生的注视序列视为通过该世界的轨迹。GazeWorld自回归地从所有先前访问过的补丁预测下一个注视补丁的潜在表示，同时一个空间补全分支覆盖未访问区域。在推理时，GazeWorld仅从图像生成一系列补丁表示，无需真实注视数据。冻结的GazeWorld特征在CheXpert、RSNA肺炎和SIIM-ACR气胸的所有九个监督设置中实现了最先进的诊断准确率，并在所有三个基准上取得了最高的零样本准确率。在GazeSearch基准上，使用相同冻结特征训练的通用解码器在ScanMatch和SED上分别比专门构建的LogitGaze-Med高出16%和22%，尽管未明确训练以预测注视。GazeWorld表明，建模专家如何阅读（而不仅仅是他们得出什么结论）为医学成像AI提供了一种有前景的预训练范式。

英文摘要

Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

URL PDF HTML ☆

赞 0 踩 0

2605.23984 2026-05-26 cs.LG cs.AI cs.CV 版本更新

SRC-Flow：紧凑语义表示实现归一化流用于图像生成

Longtao Jiang, Jianmin Bao, Zhendong Wang, Xin Tao, Pengfei Wan, Zhihui Li, Xiaojun Chang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Kling Team, Kuaishou Technology（快手科技 Kling 团队）

AI总结提出SRC-Flow，通过语义表示压缩器将高维RAE特征压缩到低维语义空间，降低归一化流建模负担，在ImageNet上实现最优生成质量，同时保持精确似然计算和确定性可逆采样。

详情

AI中文摘要

归一化流（NFs）提供精确似然和确定性可逆采样，但在大规模图像生成方面历史上落后于扩散模型。我们识别出一个关键障碍：NFs需要学习全环境空间上的单个可逆传输，使其对高维表示高度敏感。这导致现代视觉表示空间中的语义-容量不匹配，其中语义信息紧凑但编码在过完备特征中。我们提出SRC-Flow，引入语义表示压缩器（SRC），在流建模之前将高维RAE特征压缩到低维语义空间，并通过冻结的RAE解码器保持重建。这个紧凑空间减少了NFs的建模负担，并在语义表示空间中实现了有效的基于似然的生成。我们进一步采用针对流学习的固定无条件双射的常数噪声正则化。在ImageNet $256 \times 256$和$512 \times 512$上，SRC-Flow在归一化流方法中实现了最先进的生成质量，在无分类器引导下gFID分数分别为1.65和2.07，同时在紧凑语义表示空间中保留精确似然计算，并在流级别实现确定性可逆采样。代码和模型将在https://github.com/longtaojiang/SRC-Flow提供。

英文摘要

Normalizing flows (NFs) provide exact likelihoods and deterministic invertible sampling, but have historically lagged behind diffusion models for large-scale image generation. We identify a key obstacle: NFs are required to learn a single invertible transport over the full ambient space, making them highly sensitive to high-dimensional representations. This leads to a semantic-capacity mismatch in modern visual representation spaces, where semantic information is compact but encoded in overcomplete features. We propose SRC-Flow, which introduces a Semantic Representation Compressor (SRC) to compact high-dimensional RAE features into a low-dimensional semantic space before flow modeling and preserve reconstruction through the frozen RAE decoder. This compact space reduces the modeling burden of NFs and enables effective likelihood-based generation in semantic representation space. We further adopt constant noise regularization tailored to the fixed unconditional bijection learned by flows. On ImageNet $256 \times 256$ and $512 \times 512$, SRC-Flow achieves state-of-the-art generation quality among normalizing flow methods, with gFID scores of 1.65 and 2.07 under classifier-free guidance, while retaining exact likelihood computation in the compact semantic representation space and deterministic invertible sampling at the flow level. Codes and models will be available at https://github.com/longtaojiang/SRC-Flow.

URL PDF HTML ☆

赞 0 踩 0

2605.17543 2026-05-26 cs.CV cs.GR 版本更新

HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos

HL-OutPaint：面向高分辨率长范围视频的粗到细视频外绘

Jeongeun Park, Janghyeok Han, Geonung Kim, Hyun-Seung Lee, Kyuha Choi, Youngseok Han, Sunghyun Cho

发表机构 * POSTECH ； Visual Display Business, Samsung Electronics（三星电子视觉显示事业部）

AI总结提出HL-OutPaint框架，采用粗到细两阶段流程，通过全局-局部帧交换机制构建全局粗引导，实现高分辨率长视频的大空间外推和时空一致生成。

Comments Supplementary material and video included. Project page: https://koyy001.github.io/Publications/hl-outpaint

详情

AI中文摘要

视频外绘生成超出视频原始空间范围的合理视觉内容，在使视频适应不同显示格式方面发挥关键作用。为支持此类应用，它必须能够对长序列进行大空间外推。然而，现有大多数方法仅解决其中一个挑战，或缺乏确保全局时空一致性的明确机制，导致明显局限性。本文提出HL-OutPaint，一种用于长序列的高分辨率视频外绘框架。我们的方法遵循粗到细策略，采用两阶段流水线。首先构建全局粗引导（GCG），这是一种低分辨率表示，捕捉视频的全局结构和主导运动。与简单下采样不同，GCG通过一种新颖的全局-局部帧交换机制构建，该机制将稀疏全局关键帧与局部时间窗口耦合，并在采样过程中交换信息。这使得GCG能够在统一表示中编码长期结构一致性和短期时间动态。在此表示引导下，HL-OutPaint随后执行高分辨率外绘，生成空间细节丰富且时间一致的内容。通过将全局结构建模与细粒度合成分离，我们的框架实现了对大空间扩展和长视频序列的稳定、连贯生成。大量实验表明，HL-OutPaint在涉及宽空间外推和长视频序列的挑战性场景中优于现有方法。

英文摘要

Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.

URL PDF HTML ☆

赞 0 踩 0

2605.17260 2026-05-26 cs.CV 版本更新

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

LiteFrame: 高效视觉编码器解锁视频大语言模型中的帧缩放

Jihwan Kim, Nikhil Parthasarathy, Danfeng Qin, Junhwa Hur, Deqing Sun, Bohyung Han, Ming-Hsuan Yang, Boqing Gong

发表机构 * Seoul National University（首尔国立大学）

AI总结针对视频大语言模型处理长视频时视觉令牌上下文长度爆炸的问题，提出LiteFrame高效视频编码器，通过压缩令牌蒸馏（CTD）训练框架，使紧凑的学生模型直接预测教师模型的信息密集时空压缩表示，从而在降低35%端到端延迟的同时处理8倍帧数并提升视频理解精度。

Comments Project Page: https://jjihwan.github.io/projects/LiteFrame

详情

AI中文摘要

将视频大语言模型扩展到长视频的基本挑战在于管理视觉令牌上下文长度的爆炸。现有策略主要关注“事后”令牌缩减——在特征提取后减少视觉令牌以减轻LLM的计算开销。虽然这些方法有效减少了视觉令牌数量，但我们观察到主要延迟瓶颈随后从LLM转移到视觉编码器昂贵的逐帧处理。为了解决这个问题，我们引入了LiteFrame，一个强大且高效的视频编码器骨干网络，用于视频大语言模型。为了训练LiteFrame，我们提出了压缩令牌蒸馏（CTD），一种新颖的训练框架，教导紧凑的学生视觉编码器直接预测大型教师视觉模型产生的信息密集、时空压缩的表示，从而有效绕过冗余计算。当与进一步的语言模型适配（LMA）结合时，这种方法产生了一个新的延迟-精度帕累托前沿——与InternVL3-8B相比，LiteFrame在端到端延迟降低35%的同时处理8倍帧数，并在多个基准测试中提高了平均视频理解精度。我们的结果展示了在固定计算预算下解锁更长视频理解的新潜在路径。

英文摘要

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8$\times$ more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.

URL PDF HTML ☆

赞 0 踩 0

2605.14889 2026-05-26 cs.CV cs.AI 版本更新

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

SurgicalMamba: 具有状态重编程的双路径SSD用于在线手术阶段识别

Sukju Oh, Sukkyu Sun

发表机构 * Department of Computer Science and Artificial Intelligence（计算机科学与人工智能系）

AI总结提出SurgicalMamba模型，基于Mamba2的结构化状态空间对偶性（SSD），通过双路径SSD块、强度调制步进和状态重编程三个组件，实现在线手术阶段识别，在多个基准上达到最先进性能。

Comments 28 pages, 7 figures, 10 tables; Code available at https://github.com/sukjuoh/Surgical-Mamba

详情

AI中文摘要

在线手术阶段识别（SPR）是上下文感知手术室系统的基础，要求仅根据过去上下文对每一帧做出预测。手术视频提出了自然视频识别器无法共同解决的三个需求：手术过程跨越数万帧，时间流动不均匀（长时间常规片段被短暂的阶段定义转换打断），视觉领域狭窄，因此骨干特征在通道间高度相关。现有识别器要么让每帧成本随已处理长度增长，要么保持成本有界但以均匀速率和通道独立动态推进状态，无法解决后两个需求。我们提出SurgicalMamba，一种基于Mamba2的结构化状态空间对偶性（SSD）的因果SPR模型，将每帧成本保持在O(d)。它引入了三个与SSD兼容的组件，共同解决这些需求：双路径SSD块，在循环状态级别分离长期和短期模式；强度调制步进，一种连续时间时间扭曲，使慢路径的有效速率适应阶段相关信息；以及状态重编程，一种每块的Cayley旋转，在原本轴对齐的SSM循环中打开跨通道混合。学习到的旋转平面继承了阶段对齐的结构，无需任何直接监督，提供了手术工作流的可解释内部特征。在七个公开SPR基准上，SurgicalMamba在严格在线评估下达到了最先进的准确率和阶段级Jaccard指数：在Cholec80上为94.6%/82.7%（比最强先前方法高0.7 pp/2.2 pp），在AutoLaparo上为89.5%/68.9%（高1.7 pp/2.0 pp），在单个GPU上达到238.74 fps。消融实验分离了每个组件的贡献。代码公开于https://github.com/sukjuoh/Surgical-Mamba。

英文摘要

Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components that jointly address these demands: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path's effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 238.74 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical-Mamba.

URL PDF HTML ☆

赞 0 踩 0

2605.14255 2026-05-26 cs.LG cs.CV 版本更新

Architecture-Aware Explanation Auditing for Industrial Visual Inspection

面向工业视觉检测的架构感知解释审计

Sibo Jia, Zihang Zhao, Kunrong Li

AI总结本文提出一种基于原生读出假设的架构感知解释审计协议，通过扰动实验证明解释方法的忠实度受其与模型原生决策机制的结构距离约束，并揭示忠实度排名是（模型、解释器、扰动算子）三元组的联合属性。

Comments Format update

详情

AI中文摘要

工业视觉检测系统日益依赖深度分类器，其热力图解释可能看似合理，但未能识别真正驱动模型决策的图像区域。本文基于原生读出假设，实现了一种架构感知的解释审计协议：解释方法的基于扰动的忠实度受其与模型原生决策机制的结构距离约束。在WM-811K晶圆图（9类，172k图像）上，采用三种子零填充扰动协议，ViT-Tiny + Attention Rollout的Deletion AUC为0.211，而Swin-Tiny / ResNet18+CBAM / DenseNet121 + Grad-CAM的Deletion AUC为0.432-0.525（|Cohen's d| > 1.1），尽管其分类准确率较低。Swin-Tiny将架构家族与读出结构分离：尽管是Transformer，其空间特征图层次使其与Grad-CAM兼容，表明操作因素是读出结构而非架构家族。一个模型无关的控制方法（RISE）将所有家族的Deletion AUC压缩至约0.1，表明差距源于解释器路径；值得注意的是，RISE优于所有原生方法，因此原生读出是兼容性原则而非最优性保证。模糊填充敏感性分析表明，在不同扰动基线下的家族排序反转，强化了忠实度排名是（模型、解释器、扰动算子）三元组的联合属性。在MVTec AD（预训练模型）上的探索性边界条件研究表明，审计结果依赖于数据集/任务，并识别了需要限定的条件。该协议提供了可操作的指导：解释路径应基于读出结构与模型架构协同设计，部署的热力图应附带定量忠实度指标。

英文摘要

Industrial visual inspection systems increasingly rely on deep classifiers whose heatmap explanations may appear visually plausible while failing to identify the image regions that actually drive model decisions. This paper operationalizes an architecture-aware explanation audit protocol grounded in the native-readout hypothesis: the perturbation-based faithfulness of an explanation method is bounded by its structural distance from the model's native decision mechanism. On WM-811K wafer maps (9 classes, 172k images) under a three-seed zero-fill perturbation protocol, ViT-Tiny + Attention Rollout attains Deletion AUC 0.211 against 0.432-0.525 for Swin-Tiny / ResNet18+CBAM / DenseNet121 + Grad-CAM (abs(Cohen's d) > 1.1), despite lower classification accuracy. Swin-Tiny disentangles architecture family from readout structure: despite being a Transformer, its spatial feature-map hierarchy makes it Grad-CAM compatible, showing that the operative factor is readout structure rather than architecture family. A model-agnostic control (RISE) compresses all families to Deletion AUC about 0.1, indicating the gap arises from the explainer pathway; notably, RISE outperforms all native methods, so native readout is a compatibility principle rather than an optimality guarantee. A blur-fill sensitivity analysis shows that the family ordering reverses under a different perturbation baseline, reinforcing that faithfulness rankings are joint properties of (model, explainer, perturbation operator) triples. An exploratory boundary-condition study on MVTec AD (pretrained models) indicates that audit results are dataset/task dependent and identifies conditions requiring qualification. The protocol yields actionable guidance: explanation pathways should be co-designed with model architectures based on readout structure, and deployed heatmaps should be accompanied by quantitative faithfulness metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.12961 2026-05-26 cs.CV cs.LG 版本更新

BFORE: 蝴蝶-萤火虫优化的Retinex增强用于低光图像质量提升

Ahmed Cherif

发表机构 * Sofrecom Tunisia（Sofrecom突尼斯）； Orange Innovation（Orange创新）

AI总结提出BFORE框架，结合蝴蝶优化算法和萤火虫算法自动搜索最佳Retinex增强参数，最大化高斯自然度评分，显著提升低光图像质量。

详情

AI中文摘要

低光图像存在可见度差、噪声和颜色失真问题。现有的基于Retinex的增强方法依赖手动调整参数，无法泛化到不同光照条件。本文提出BFORE（蝴蝶-萤火虫优化的Retinex增强），一个自动为每张图像寻找最佳增强参数的框架。BFORE分两阶段工作：（1）蝴蝶优化算法（BOA）搜索最优的多尺度Retinex带颜色恢复（MSRCR）参数，然后（2）萤火虫算法（FA）微调伽马校正、去噪和颜色参数。两个阶段都最大化高斯自然度评分（GNS），一种衡量增强图像自然度的无参考指标。标准质量指标（PSNR、SSIM、NIQE）仅在优化后计算，确保零数据泄露。在30对合成图像上，BFORE达到GNS=0.971，优于次优方法MSRCR（0.894）8.6%。在来自LOL数据集的115张真实图像上，BFORE达到GNS=0.887，优于MSRCR（0.808）9.8%。与三个在相同条件下训练的深度学习基线（Zero-DCE、SCI、IAT）进行受控比较，BFORE在GNS上超过最佳深度学习方法14.7%。消融研究证实，混合BOA+FA策略显著优于单独使用每种优化器，而在三个评估预算下的可扩展性分析表明，一旦计算资源可用，结构化优化器显著优于均匀随机采样（128次评估时p=0.009，300次评估时p=0.021）。所有改进均具有统计显著性（Wilcoxon符号秩检验p<0.0001）。每张图像在CPU上的处理时间为3-6分钟，适用于离线应用。

英文摘要

Low-light images suffer from poor visibility, noise, and color distortion. Existing Retinex-based enhancement methods rely on manually tuned parameters that do not generalize across different lighting conditions. This paper proposes BFORE (Butterfly-Firefly Optimized Retinex Enhancement), a framework that automatically finds the best enhancement parameters for each image. BFORE works in two phases: (1) a Butterfly Optimization Algorithm (BOA) searches for optimal Multi-Scale Retinex with Color Restoration (MSRCR) parameters, then (2) a Firefly Algorithm (FA) fine-tunes gamma correction, denoising, and color parameters. Both phases maximize a Gaussian Naturalness Score (GNS), a no-reference metric that measures how natural the enhanced image looks. Standard quality metrics (PSNR, SSIM, NIQE) are computed only after optimization, ensuring zero data leakage. On 30 synthetic image pairs, BFORE achieves GNS = 0.971, outperforming the next-best method MSRCR (0.894) by 8.6%. On 115 real images from the LOL dataset, BFORE achieves GNS = 0.887, outperforming MSRCR (0.808) by 9.8%. A controlled comparison with three deep learning baselines (Zero-DCE, SCI, IAT) trained under identical conditions shows BFORE surpasses the best DL method by 14.7% in GNS. An ablation study confirms that the hybrid BOA+FA strategy significantly outperforms each optimizer in isolation, and a scalability analysis at three evaluation budgets shows that the structured optimizer significantly outperforms uniform random sampling once compute is available (p = 0.009 at 128 evaluations, p = 0.021 at 300 evaluations). All improvements are statistically significant (p < 0.0001, Wilcoxon signed-rank test). Processing time is 3-6 minutes per image on CPU, suitable for offline applications.

URL PDF HTML ☆

赞 0 踩 0

2605.00908 2026-05-26 cs.CV 版本更新

Evaluation of Convolutional and Transformer-Based Detectors for Weed Detection in Tomato Plantations

卷积与基于Transformer的检测器在番茄种植园杂草检测中的评估

Alcides Toledo Espinosa, Gerardo Antonio Álvarez Hernández, Ángel Eduardo Zamora-Suárez, Miguel Bolaños, Juan Irving Vásquez

发表机构 * Instituto Politécnico Nacional（墨西哥国立理工学院）； CIDETEC-IPN（CIDETEC-墨西哥国立理工学院）； UPIBI-IPN（UPIBI-墨西哥国立理工学院）

AI总结本文比较了基于CNN和Transformer的目标检测架构在番茄种植园早期杂草检测中的性能，揭示了效率与上下文建模之间的权衡。

Comments 7 pages, 3 figures, and 1 table

详情

AI中文摘要

本文对卷积和基于Transformer的目标检测架构在番茄种植园早期杂草检测中进行了比较评估。考虑了每种范式的代表性模型，包括YOLOv6-nano（YOLO系列的最新变体）以及作为基于Transformer架构的RT-DETR Large和RF-DETR Medium。评估在GROUNDBASED_WEED数据集上进行，考虑了六个杂草类别和一个对应于未识别植物的额外类别，从而能够使用精度、召回率、平均精度和推理速度等指标以及非参数统计检验来评估检测准确性和计算效率方面的性能。结果突出了效率与上下文建模之间的明显权衡：基于CNN的检测器以较低的计算成本实现了高性能，而基于Transformer的方法以更高的资源需求为代价提供了更好的全局上下文捕获。这些结果为精准农业应用中的模型选择提供了实用标准。

英文摘要

This paper presents a comparative evaluation of convolutional and transformer-based object detection architectures for early weed detection in tomato plantations. Representative models from each paradigm are considered, including YOLOv26-nano, a recent variant of the YOLO family, and RT-DETR Large and RF-DETR Medium as transformer-based architectures. The evaluation was conducted on the GROUNDBASED_WEED dataset, considering six weed classes and an additional category corresponding to unidentified plants, which allowed for the assessment of performance in terms of detection accuracy and computational efficiency using metrics such as precision, recall, average precision, and inference speed, as well as non-parametric statistical tests. The results highlight a clear trade-off between efficiency and contextual modeling: CNN-based detectors achieve high performance at a lower computational cost, while transformer-based approaches offer better global context capture at the expense of higher resource demands. These results provide practical criteria for model selection in precision agriculture applications.

URL PDF HTML ☆

赞 0 踩 0

2603.29236 2026-05-26 cs.CV 版本更新

M2H-MX: Multi-Task Semantic and Geometric Perception for Real-Time Monocular 3D Scene Graph Construction

M2H-MX：用于实时单目3D场景图构建的多任务语义与几何感知

U. V. B. L. Udugama, George Vosselman, Francesco Nex

发表机构 * Department of Earth Observation Science（地球观测科学系）

AI总结提出M2H-MX多任务感知模型，通过注册门控全局上下文和受控跨任务交互的轻量解码器，在严格延迟约束下实现深度与语义预测相互增强，并集成到单目SLAM中，显著提升轨迹精度和地图质量。

Comments 6 pages, 5 figures, 5 tables. Preprint under review

详情

AI中文摘要

单目相机因其低成本且易于部署而对机器人感知具有吸引力，但从单一图像流实现可靠的实时空间理解仍然具有挑战性。虽然最近的多任务密集预测模型改进了逐像素深度和语义估计，但将这些进展转化为稳定的单目建图系统仍然不简单。本文提出了M2H-MX，一种用于单目空间理解的实时多任务感知模型。该模型保留多尺度特征表示，同时在轻量解码器中引入注册门控全局上下文和受控跨任务交互，使深度和语义预测在严格的延迟约束下相互增强。其输出通过紧凑的感知到建图接口直接集成到未修改的单目SLAM流水线中。我们评估了密集预测精度和系统内性能。在NYUDv2上，M2H-MX-L取得了最先进的结果，与代表性多任务基线相比，语义mIoU提高了6.6%，深度RMSE降低了9.4%。当在ScanNet上的实时单目建图系统中部署时，与强单目SLAM基线相比，M2H-MX将平均轨迹误差降低了60.7%，同时生成更清晰的度量-语义地图。这些结果表明，现代多任务密集预测可以可靠地部署于机器人系统中的实时单目空间感知。

英文摘要

Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.

URL PDF HTML ☆

赞 0 踩 0

2603.17044 2026-05-26 cs.LG cs.AI cs.CV 版本更新

通过有界一致性实现多样性：多模态融合的几何正则化

Zixuan Xia, Hao Wang, Pengcheng Weng, Yanyu Qian, Yangxin Xu, William Dan, Fei Wang

发表机构 * Department of Informatics University of Bern（伯尔尼大学信息学院）； College of Computing and Data Science Nanyang Technological University（南洋理工大学计算机与数据科学学院）； School of Software Engineering Xi’an Jiaotong University（西安交通大学软件工程学院）

AI总结提出一种轻量级即插即用的几何正则化框架，通过有界一致性原则在保持模态特异多样性的同时约束跨模态漂移，提升多模态融合性能。

详情

AI中文摘要

多模态融合通常被视为一个优化平衡问题，通过调整训练信号防止一种模态主导其他模态。然而，平衡优化并不能完全决定中间表示的几何结构。有监督的多模态模型仍可能学习到低多样性的模态特定嵌入，或允许配对的跨模态观测过度分离，从而削弱单模态鲁棒性和多模态融合。我们引入了\regName，一个轻量级即插即用的多模态表示学习几何正则化框架。\regName不强制执行严格的跨模态对齐，而是遵循有界一致性原则：在仅软约束超过允许一致性带的配对跨模态漂移部分的同时，保留模态特定多样性。在操作上，\regName结合了一个分散项（减轻谱集中度）和一个一致性带锚定项（控制过度配对漂移），无需架构修改或推理时开销。在音频-视觉、图像-文本和基于RF的基准测试上的实验表明，\regName一致地提高了多模态性能，并常常增强单模态表示。这些结果表明，显式调节表示几何是优化平衡的有效补充，并提供了几何感知正则化可以改善跨不同架构和领域的多模态学习的证据。

英文摘要

Multimodal fusion is often treated as an optimization-balancing problem, where training signals are adjusted to prevent one modality from dominating the others. However, balanced optimization does not fully determine the geometry of intermediate representations. Supervised multimodal models may still learn low-diversity modality-specific embeddings or allow paired cross-modal observations to drift excessively apart, weakening both unimodal robustness and multimodal fusion. We introduce \regName, a lightweight plug-and-play geometric regularization framework for multimodal representation learning. Rather than enforcing rigid cross-modal alignment, \regName follows a bounded-agreement principle: preserve modality-specific diversity while softly constraining only the portion of paired cross-modal drift that exceeds an admissible agreement band. Operationally, \regName combines a dispersion term that mitigates spectral concentration with an agreement-band anchoring term that controls excessive paired drift, requiring no architectural modification or inference-time overhead. Experiments across audio-visual, image-text, and RF-based benchmarks show that \regName consistently improves multimodal performance and often strengthens unimodal representations. These results suggest that explicitly regulating representation geometry is an effective complement to optimization balancing, and provide evidence that geometry-aware regularization can improve multimodal learning across diverse architectures and domains.

URL PDF HTML ☆

赞 0 踩 0

2601.20273 2026-05-26 cs.DC cs.CV 版本更新

SwiftFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs

SwiftFusion: 面向GPU上扩散Transformer分布式推理的可扩展序列并行

Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang, Gennady Pekhimenko

发表机构 * University of Toronto \&\ Institute ； Amazon ； University of Toronto \& Vector Institute \& NVIDIA

AI总结针对扩散Transformer推理中序列并行方法的通信和同步瓶颈，提出拓扑感知的StreamFusion引擎，通过Torus Attention和单边通信实现平均1.35倍加速。

详情

AI中文摘要

扩散Transformer（DiTs）在高品质图像和视频生成中日益普及。随着对更高分辨率图像和更长视频的需求增加，单GPU推理因延迟增加和激活尺寸过大而效率低下。当前框架采用序列并行（SP）技术如Ulysses Attention和Ring Attention来扩展推理。然而，这些实现存在三个主要限制：（1）现代GPU机器网络拓扑的次优通信模式，（2）机器间通信中全到全操作导致的延迟瓶颈，以及（3）使用双边通信库带来的GPU发送-接收同步和计算开销。为解决这些问题，我们提出了StreamFusion，一种拓扑感知的高效DiT服务引擎。StreamFusion包含三项关键创新：（1）考虑机器内外带宽差异的拓扑感知序列并行技术，（2）Torus Attention，一种新颖的SP技术，可将机器间全到全操作与计算重叠，以及（3）最小化GPU发送-接收同步和计算开销的单边通信实现。实验表明，StreamFusion平均比最先进方法快1.35倍（最高达1.77倍）。

英文摘要

Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35\times$ (up to $1.77\times$).

URL PDF HTML ☆

赞 0 踩 0

2601.03191 2026-05-26 cs.CV cs.AI cs.LG 版本更新

AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation

AnatomiX：一种解剖学感知的胸部X光解读多模态大语言模型

Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert

发表机构 * Hasso Plattner Institute（霍普夫纳研究所）； MBZUAI（穆萨大学人工智能研究所）

AI总结提出AnatomiX，一种两阶段解剖学感知多模态大语言模型，通过先识别解剖结构再执行下游任务，在解剖定位、短语定位、定位诊断和定位描述任务上相比现有方法提升超过25%。

详情

AI中文摘要

多模态医学大语言模型在胸部X光解读方面取得了显著进展，但在空间推理和解剖学理解方面仍面临挑战。尽管现有的定位技术提高了整体性能，但它们往往未能建立真正的解剖对应关系，导致医学领域中的解剖理解错误。为弥补这一差距，我们引入了AnatomiX，一种用于解剖学定位的胸部X光解读的多任务多模态大语言模型。受放射学工作流程启发，AnatomiX采用两阶段方法：首先识别解剖结构并提取其特征，然后利用大语言模型执行多种下游任务，如短语定位、报告生成、视觉问答和图像理解。在多个基准上的大量实验表明，与现有方法相比，AnatomiX实现了卓越的解剖推理，并在解剖定位、短语定位、定位诊断和定位描述任务上性能提升超过25%。代码和预训练模型可在 https://aneesurhashmi.github.io/anatomix 获取。

英文摘要

Multimodal medical large language models have shown substantial progress in chest X-ray interpretation but continue to face challenges in spatial reasoning and anatomical understanding. Although existing grounding techniques improve overall performance, they often fail to establish a true anatomical correspondence, resulting in incorrect anatomical understanding in the medical domain. To address this gap, we introduce AnatomiX, a multitask multimodal large language model for anatomically grounded chest X-ray interpretation. Inspired by the radiological workflow, AnatomiX adopts a two stage approach: first, it identifies anatomical structures and extracts their features, and then leverages a large language model to perform diverse downstream tasks such as phrase grounding, report generation, visual question answering, and image understanding. Extensive experiments across multiple benchmarks demonstrate that AnatomiX achieves superior anatomical reasoning and delivers over 25% improvement in performance on anatomy grounding, phrase grounding, grounded diagnosis and grounded captioning tasks compared to existing approaches. Code and pretrained model are available at https://aneesurhashmi.github.io/anatomix

URL PDF HTML ☆

赞 0 踩 0

2512.21815 2026-05-26 cs.CV cs.LG 版本更新

High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

高熵标记作为视觉-语言模型中的多模态失败点

Mengqi He, Xinyu Tian, Xin Shen, Jinhong Ni, Shu Zou, Zhaoyuan Yang, Jing Zhang

发表机构 * The Australia National University（澳大利亚国立大学）； The University of Queensland（昆士兰大学）； GE research（GE研究）

AI总结本研究揭示视觉-语言模型中约20%的高熵标记集中了不成比例的对抗性影响，并提出基于熵引导的稀疏攻击方法（EGA），实现高攻击成功率与有害率。

Comments 19 Pages,11 figures,8 tables

详情

AI中文摘要

视觉-语言模型（VLM）取得了显著性能，但仍易受对抗攻击。熵作为模型不确定性的度量，与VLM可靠性高度相关。虽然先前的基于熵的攻击在解码步骤中最大化不确定性，隐含假设每个标记对模型不稳定性的贡献相等，但我们揭示了在评估的具有不同架构的代表性开源VLM中，一小部分（约20%）高熵标记在自回归生成过程中集中了不成比例的对抗性影响。我们证明，将这些对抗扰动集中到这些高熵位置，可以在优化更少解码位置的情况下实现与全局方法相当的语义退化。此外，在多个代表性VLM中，此类攻击不仅导致语义漂移，还在当前流程下产生大量不安全子集（20-31%）。值得注意的是，由于这种脆弱的高熵标记在不同架构的VLM中重复出现，针对它们的攻击表现出非平凡的迁移性。受这些发现启发，我们设计了一种简单的熵引导攻击（EGA），该攻击实现了稀疏高熵定位，并通过可重用的标记库扩展，在三个代表性开源VLM上取得了具有竞争力的攻击成功率（93-95%）和相当高的有害率（30.2-38.6%）。

英文摘要

Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, as a measure of model uncertainty, is highly correlated with VLM reliability. While prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token equally contributes to model instability, we reveal that a small fraction (around 20%) of high-entropy tokens, in the evaluated representative open-source VLMs with diverse architectures, concentrates a disproportionate share of adversarial influence during autoregressive generation. We demonstrate that concentrating adversarial perturbations on these high-entropy positions achieves comparable semantic degradation to global methods while optimizing fewer decoding positions. Additionally, across multiple representative VLMs, such attacks induce not only semantic drift but also a substantial unsafe subset (20-31%) under the current pipeline. Remarkably, since such vulnerable high-entropy tokens recur across architecturally diverse VLMs, attacks focused on them exhibit non-trivial transferability. Motivated by these findings, we design a simple Entropy-Guided Attack (EGA) that operationalizes sparse high-entropy targeting and extends it with a reusable token bank, yielding competitive attack success rates (93-95%) with a considerable harmful rate (30.2-38.6%) on the three representative open-source VLMs.

URL PDF HTML ☆

赞 0 踩 0

2512.14180 2026-05-26 cs.CV 版本更新

Spherical Voronoi: Directional Appearance as a Differentiable Partition of the Sphere

球面Voronoi：作为球面可微分划分的定向外观

Francesco Di Sario, Daniel Rebain, Dor Verbin, Marco Grangetto, Andrea Tagliasacchi

发表机构 * University of Torino（都灵大学）； Simon Fraser University（西蒙弗雷泽大学）； University of British Columbia（不列颠哥伦比亚大学）； University of Toronto（多伦多大学）； Google DeepMind（谷歌深Mind）

AI总结提出球面Voronoi（SV）作为3D高斯泼溅中外观表示的统一框架，通过可学习区域划分实现视图依赖效果，在反射建模上达到最先进水平。

详情

AI中文摘要

辐射场方法（例如3D高斯泼溅）已成为新视角合成的强大范式，但其外观建模通常依赖于球谐函数（SH），这带来了根本性限制。SH难以处理高频信号，存在吉布斯振铃伪影，并且无法捕捉镜面反射——这是真实感渲染的关键组成部分。尽管球面高斯等替代方案有所改进，但它们增加了显著的优化复杂度。我们提出球面Voronoi（SV）作为3D高斯泼溅中外观表示的统一框架。SV将方向域划分为具有平滑边界的可学习区域，为视图依赖效应提供了直观且稳定的参数化。对于漫反射外观，SV在保持优化比现有替代方案更简单的同时取得了有竞争力的结果。对于反射——SH失败的地方——我们利用SV作为可学习的反射探针，遵循经典图形学原理将反射方向作为输入。该公式在合成和真实世界数据集上取得了最先进的结果，表明SV为显式3D表示中的外观建模提供了一种有原则、高效且通用的解决方案。项目页面：https://sphericalvoronoi.github.io/

英文摘要

Radiance field methods (e.g. 3D Gaussian Splatting) have emerged as a powerful paradigm for novel view synthesis, yet their appearance modeling often relies on Spherical Harmonics (SH), which impose fundamental limitations. SH struggle with high-frequency signals, exhibit Gibbs ringing artifacts, and fail to capture specular reflections - a key component of realistic rendering. Although alternatives like spherical Gaussians offer improvements, they add significant optimization complexity. We propose Spherical Voronoi (SV) as a unified framework for appearance representation in 3D Gaussian Splatting. SV partitions the directional domain into learnable regions with smooth boundaries, providing an intuitive and stable parameterization for view-dependent effects. For diffuse appearance, SV achieves competitive results while keeping optimization simpler than existing alternatives. For reflections - where SH fail - we leverage SV as learnable reflection probes, taking reflected directions as input following principles from classical graphics. This formulation attains state-of-the-art results on synthetic and real-world datasets, demonstrating that SV offers a principled, efficient, and general solution for appearance modeling in explicit 3D representations. Project page: https://sphericalvoronoi.github.io/

URL PDF HTML ☆

赞 0 踩 0

2512.11941 2026-05-26 cs.CV cs.AI 版本更新

DynaPURLS: Dynamic Refinement of Part-Aware Representations for Skeleton-Based Zero-Shot Action Recognition

DynaPURLS: 基于骨架的零样本动作识别中部分感知表示的动态细化

Jingmin Zhu, Anqi Zhu, James Bailey, Jun Liu, Hossein Rahmani, Mohammed Bennamoun, Farid Boussaid, Qiuhong Ke

发表机构 * Monash University（莫纳什大学）； Lancaster University（兰卡斯特大学）； University of Western Australia（西澳大学）

AI总结提出DynaPURLS框架，通过多尺度视觉-语义对应和动态细化模块，解决骨架零样本动作识别中的领域偏移问题，在三个基准数据集上取得最优结果。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence

详情

DOI: 10.1109/TPAMI.2026.3680873

AI中文摘要

基于骨架的零样本动作识别（ZS-SAR）从根本上受到主流方法的限制，这些方法依赖于将骨架特征与静态的类级语义对齐。这种粗粒度的对齐无法弥合可见类和未见类之间的领域偏移，从而阻碍了细粒度视觉知识的有效迁移。为了解决这些限制，我们引入了 extbf{DynaPURLS}，一个统一的框架，它建立稳健的多尺度视觉-语义对应，并在推理时动态细化它们以增强泛化能力。我们的框架利用大型语言模型生成层次化的文本描述，涵盖全局运动和局部身体部位动态。同时，一个自适应划分模块通过语义分组骨架关节点生成细粒度的视觉表示。为了强化这种细粒度对齐以应对训练-测试领域偏移，DynaPURLS包含一个动态细化模块。在推理时，该模块通过轻量级可学习投影将文本特征适应于输入的视觉流。该细化过程由一个置信度感知的类平衡记忆库稳定，该记忆库减轻了来自噪声伪标签的错误传播。在三个大规模基准数据集（包括NTU RGB+D 60/120和PKU-MMD）上的大量实验表明，DynaPURLS显著优于先前的方法，创造了新的最先进记录。源代码已在https://github.com/Alchemist0754/DynaPURLS公开。

英文摘要

Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce \textbf{DynaPURLS}, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this fine-grained alignment against the train-test domain shift, DynaPURLS incorporates a dynamic refinement module. During inference, this module adapts textual features to the incoming visual stream via a lightweight learnable projection. This refinement process is stabilized by a confidence-aware, class-balanced memory bank, which mitigates error propagation from noisy pseudo-labels. Extensive experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art, setting new state-of-the-art records. The source code is made publicly available at https://github.com/Alchemist0754/DynaPURLS

URL PDF HTML ☆

赞 0 踩 0

2512.04883 2026-05-26 cs.CV 版本更新

SDG-Track: A Heterogeneous Observer-Follower Framework for High-Resolution UAV Tracking on Embedded Platforms

SDG-Track: 一种用于嵌入式平台高分辨率无人机跟踪的异构观察者-跟随者框架

Jiawen Wen, Yu Hu, Suixuan Qiu, Jinshan Huang, Xiaowen Chu

AI总结提出SDG-Track框架，采用观察者-跟随者架构，通过稀疏检测引导跟踪和双空间恢复机制，在嵌入式平台上实现高分辨率无人机实时跟踪，达到35.1 FPS和97.2%检测精度。

Comments Withdrawn by the authors due to unresolved authorship and public-disclosure authorization issues

详情

AI中文摘要

在边缘设备上对小型无人机（UAV）进行实时跟踪面临根本性的分辨率-速度冲突。将高分辨率图像下采样到标准检测器输入尺寸会导致小目标特征低于可检测阈值。然而，在资源受限平台上处理原生1080p帧无法为平滑云台控制提供足够的吞吐量。我们提出SDG-Track，一种稀疏检测引导跟踪器，采用观察者-跟随者架构来解决这一冲突。观察者流在GPU上以低频率运行高容量检测器，从1920x1080帧中提供准确的位置锚点。跟随者流在CPU上通过ROI约束的稀疏光流执行高频轨迹插值。为了处理由光谱相似干扰物引起的遮挡或模型漂移导致的跟踪失败，我们引入了双空间恢复，一种无需训练的重捕获机制，结合颜色直方图匹配与几何一致性约束。在地对空跟踪站上的实验表明，SDG-Track实现了35.1 FPS的系统吞吐量，同时保留了97.2%的逐帧检测精度。该系统在NVIDIA Jetson Orin Nano上成功跟踪了实际操作条件下的敏捷FPV无人机。我们的论文代码公开在https://github.com/Jeffry-wen/SDG-Track。

英文摘要

Real-time tracking of small unmanned aerial vehicles (UAVs) on edge devices faces a fundamental resolution-speed conflict. Downsampling high-resolution imagery to standard detector input sizes causes small target features to collapse below detectable thresholds. Yet processing native 1080p frames on resource-constrained platforms yields insufficient throughput for smooth gimbal control. We propose SDG-Track, a Sparse Detection-Guided Tracker that adopts an Observer-Follower architecture to reconcile this conflict. The Observer stream runs a high-capacity detector at low frequency on the GPU to provide accurate position anchors from 1920x1080 frames. The Follower stream performs high-frequency trajectory interpolation via ROI-constrained sparse optical flow on the CPU. To handle tracking failures from occlusion or model drift caused by spectrally similar distractors, we introduce Dual-Space Recovery, a training-free re-acquisition mechanism combining color histogram matching with geometric consistency constraints. Experiments on a ground-to-air tracking station demonstrate that SDG-Track achieves 35.1 FPS system throughput while retaining 97.2\% of the frame-by-frame detection precision. The system successfully tracks agile FPV drones under real-world operational conditions on an NVIDIA Jetson Orin Nano. Our paper code is publicly available at https://github.com/Jeffry-wen/SDG-Track

URL PDF HTML ☆

赞 0 踩 0

2510.22973 2026-05-26 cs.CV 版本更新

Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

扩展以占据为中心的驾驶场景生成：数据集与方法

Bohan Li, Xin Jin, Hu Zhu, Hongsi Liu, Ruikai Li, Jiazhe Guo, Kaiwen Cai, Chao Ma, Yueming Jin, Hao Zhao, Xiaokang Yang, Wenjun Zeng

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Eastern Institute of Technology（东部技术研究院）； School of Electronic Information and Electrical Engineering（电子信息与电气工程学院）； Li Auto（力汽车）； National University of Singapore（新加坡国立大学）； Tsinghua University（清华大学）； Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative（宁波空间智能与数字衍生实验室）； Ningbo Institute of Digital Twin（宁波数字孪生研究院）

AI总结针对占据数据稀缺问题，构建最大语义占据数据集Nuplan-Occ，并提出统一框架联合生成高质量语义占据、多视角视频和LiDAR点云，采用时空解耦架构及高斯泼溅稀疏点图渲染和传感器感知嵌入策略，实现高保真生成。

Comments IEEE TPAMI

详情

AI中文摘要

驾驶场景生成是自动驾驶的关键领域，支持下游应用，包括感知和规划评估。以占据为中心的方法通过提供跨帧和模态的一致条件，最近取得了最先进的结果；然而，其性能严重依赖于标注的占据数据，而这类数据仍然稀缺。为克服这一限制，我们整理了Nuplan-Occ，这是迄今为止最大的语义占据数据集，基于广泛使用的Nuplan基准构建。其规模和多样性不仅促进了大规模生成建模，也促进了自动驾驶下游应用。基于该数据集，我们开发了一个统一框架，联合合成高质量语义占据、多视角视频和LiDAR点云。我们的方法采用时空解耦架构，支持4D动态占据的高保真空间扩展和时间预测。为弥合模态差距，我们进一步提出了两种新技术：基于高斯泼溅的稀疏点图渲染策略，增强多视角视频生成；以及传感器感知嵌入策略，显式建模LiDAR传感器属性以实现逼真的多LiDAR模拟。大量实验表明，与现有方法相比，我们的方法实现了更优的生成保真度和可扩展性，并验证了其在下游任务中的实用价值。仓库：https://github.com/Arlo0o/UniScene-Unified-Occupancy-centric-Driving-Scene-Generation/tree/v2

英文摘要

Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: https://github.com/Arlo0o/UniScene-Unified-Occupancy-centric-Driving-Scene-Generation/tree/v2

URL PDF HTML ☆

赞 0 踩 0

2509.15814 2026-05-26 eess.IV cs.CV 版本更新

CSFNet: 用于驾驶场景实时RGB-X语义分割的余弦相似度融合网络

Danial Qashqai, Emad Mousavian, Shahriar Baradaran Shokouhi, Sattar Mirzakuchaki

发表机构 * Department of Electrical Engineering, Iran University of Science and Technology（伊朗科学技术大学电气工程系）

AI总结提出CSFNet，通过余弦相似度注意力融合模块（CS-AFM）高效融合双模态特征，实现实时且高精度的RGB-X语义分割。

详情

DOI: 10.1016/j.engappai.2026.114362
Journal ref: Engineering Applications of Artificial Intelligence, 174, 114362 (2026)

AI中文摘要

语义分割作为复杂视觉解释的关键组成部分，在自动驾驶视觉系统中起着基础作用。最近的研究通过利用互补信息和开发多模态方法显著提高了语义分割的准确性。尽管准确性有所提高，但多模态语义分割方法存在计算复杂度高和推理速度慢的问题。因此，在驾驶应用中实现多模态方法是一项具有挑战性的任务。为了解决这个问题，我们提出了余弦相似度融合网络（CSFNet）作为实时RGB-X语义分割模型。具体来说，我们设计了一个余弦相似度注意力融合模块（CS-AFM），该模块有效地校正和融合两种模态的特征。CS-AFM模块利用跨模态相似性实现高泛化能力。通过增强低层跨模态特征的融合，CS-AFM为在高层使用单分支网络铺平了道路。因此，我们在编码器中使用双分支和单分支架构，并结合高效的上下文模块和轻量级解码器以实现快速准确的预测。为了验证CSFNet的有效性，我们使用Cityscapes、MFNet和ZJU数据集进行RGB-D/T/P语义分割。结果表明，CSFNet在准确性与最先进方法相比具有竞争力，同时在多模态语义分割模型中速度达到最先进水平。由于其低参数数量和计算复杂度，它还实现了高效率。CSFNet的源代码将在https://github.com/Danial-Qashqai/CSFNet提供。

英文摘要

Semantic segmentation, as a crucial component of complex visual interpretation, plays a fundamental role in autonomous vehicle vision systems. Recent studies have significantly improved the accuracy of semantic segmentation by exploiting complementary information and developing multimodal methods. Despite the gains in accuracy, multimodal semantic segmentation methods suffer from high computational complexity and low inference speed. Therefore, it is a challenging task to implement multimodal methods in driving applications. To address this problem, we propose the Cosine Similarity Fusion Network (CSFNet) as a real-time RGB-X semantic segmentation model. Specifically, we design a Cosine Similarity Attention Fusion Module (CS-AFM) that effectively rectifies and fuses features of two modalities. The CS-AFM module leverages cross-modal similarity to achieve high generalization ability. By enhancing the fusion of cross-modal features at lower levels, CS-AFM paves the way for the use of a single-branch network at higher levels. Therefore, we use dual and single-branch architectures in an encoder, along with an efficient context module and a lightweight decoder for fast and accurate predictions. To verify the effectiveness of CSFNet, we use the Cityscapes, MFNet, and ZJU datasets for the RGB-D/T/P semantic segmentation. According to the results, CSFNet has competitive accuracy with state-of-the-art methods while being state-of-the-art in terms of speed among multimodal semantic segmentation models. It also achieves high efficiency due to its low parameter count and computational complexity. The source code for CSFNet will be available at https://github.com/Danial-Qashqai/CSFNet.

URL PDF HTML ☆

赞 0 踩 0

2402.10665 2026-05-26 cs.LG cs.CV 版本更新

Soft Dice Confidence: A Near-Optimal Confidence Estimator for Selective Prediction in Semantic Segmentation

Soft Dice Confidence: 语义分割中选择性预测的近似最优置信度估计器

Bruno Laboissiere Camargos Borges, Bruno Machado Pacheco, Danilo Silva

发表机构 * Department of Automation and Systems Engineering, Federal University of Santa Catarina（圣卡塔琳娜联邦大学自动化与系统工程系）

AI总结针对语义分割中的选择性预测问题，提出一种基于Dice系数的近似最优置信度估计器SDC，在已知或估计边际后验概率下均优于现有方法。

Comments 48 pages, 11 figures

详情

AI中文摘要

在语义分割中，即使是最先进的深度学习模型在某些高风险应用（如医学图像分析）中也达不到所需的性能。在这些情况下，可以通过允许模型在置信度低时放弃预测来提高性能，这种方法称为选择性预测。虽然在分类文献中广为人知，但选择性预测在语义分割的背景下尚未得到充分探索。本文通过关注图像级弃权来解决这个问题，即对整个图像产生单个置信度估计，而先前的方法则关注像素级不确定性。假设Dice系数作为分割的评估指标，本文提供了两个主要贡献：（i）在已知边际后验概率的情况下，我们推导出最优置信度估计器，但观察到对于典型图像大小难以处理。然后，提出了一种线性时间可计算的近似方法，称为Soft Dice Confidence（SDC），并证明它与最优估计器紧密有界。（ii）当仅知道边际后验概率的估计时，我们提出了SDC的插件版本，并证明它优于所有先前的方法，包括那些需要额外调优数据的方法。这些发现得到了合成数据和来自六项医学成像任务（包括分布外场景）的真实世界数据的实验结果的支持，将SDC定位为语义分割中选择性预测的可靠且高效的工具。

英文摘要

In semantic segmentation, even state-of-the-art deep learning models fall short of the performance required in certain high-stakes applications such as medical image analysis. In these cases, performance can be improved by allowing a model to abstain from making predictions when confidence is low, an approach known as selective prediction. While well-known in the classification literature, selective prediction has been underexplored in the context of semantic segmentation. This paper tackles the problem by focusing on image-level abstention, which involves producing a single confidence estimate for the entire image, in contrast to previous approaches that focus on pixel-level uncertainty. Assuming the Dice coefficient as the evaluation metric for segmentation, two main contributions are provided in this paper: (i) In the case of known marginal posterior probabilities, we derive the optimal confidence estimator, which is observed to be intractable for typical image sizes. Then, an approximation computable in linear time, named Soft Dice Confidence (SDC), is proposed and proven to be tightly bounded to the optimal estimator. (ii) When only an estimate of the marginal posterior probabilities are known, we propose a plug-in version of the SDC and show it outperforms all previous methods, including those requiring additional tuning data. These findings are supported by experimental results on both synthetic data and real-world data from six medical imaging tasks, including out-of-distribution scenarios, positioning the SDC as a reliable and efficient tool for selective prediction in semantic segmentation.

URL PDF HTML ☆

赞 0 踩 0

2310.04981 2026-05-26 cs.CV cs.LG 版本更新

Compositional Semantics for Open Vocabulary Spatio-semantic Representations

开放词汇时空语义表示的组合语义

Robin Karlsson, Francisco Lepe-Salazar, Kazuya Takeda

发表机构 * Graduate School of Informatics, Nagoya University（名古屋大学信息学研究科）； Ludolab ； TIER IV

AI总结提出潜在组合语义嵌入z*作为可查询时空语义记忆的知识表示，证明其存在性、最优性及可发现性，并引入充分相似性推理方法提升重叠语义推理性能。

Comments Preprint

详情

AI中文摘要

视觉语言模型（VLM）将环境感知转换为LLM可解释的视觉语言语义。然而，完成复杂任务通常需要对当前感知之外的信息进行推理。我们提出潜在组合语义嵌入z*作为可查询时空语义记忆的基于学习的原则性知识表示。我们在数学上证明z*总是可以找到，并且最优z*是任何集合Z的质心。我们推导了估计相关和不相关语义可分离性的概率界限。我们证明z*可以通过迭代梯度下降从视觉外观和单一描述中发现。我们在包括CLIP和SBERT的四个嵌入空间上实验验证了我们的发现。结果表明，z*可以表示由SBERT编码的多达10个语义，以及理想均匀分布的高维嵌入的多达100个语义。我们引入了三个具有重叠语义的新数据集，以表明在常规非重叠注释上训练的常见VLM能够发现z*。我们提出的充分相似性推理方法克服了传统推理的根本局限性，并将更高层次的重叠语义推理性能平均提高了19.63 mIoU。

英文摘要

Vision-language models (VLMs) transform environment percepts into vision-language semantics interpretable by LLMs. However, completing complex tasks often requires reasoning about information beyond what is currently perceived. We propose latent compositional semantic embeddings z* as a principled learning-based knowledge representation for queryable spatio-semantic memories. We mathematically prove that z* can always be found, and that the optimal z* is the centroid for any set Z. We derive a probabilistic bound for estimating separability of related and unrelated semantics. We prove that z* is discoverable from visual appearance and singular descriptions by iterative gradient descent. We experimentally verify our findings on four embedding spaces including CLIP and SBERT. Our results show that z* can represent up to 10 semantics encoded by SBERT, and up to 100 semantics for ideal uniformly distributed high-dimensional embeddings. We introduce three new datasets with overlapping semantics to show that common VLMs trained on conventional nonoverlapping annotations discover z*. Our novel sufficient similarity inference method overcomes fundamental limitations of conventional inference, and improves higher-level overlapping semantic inference performance by 19.63 mIoU on average.

URL PDF HTML ☆

赞 0 踩 0

2605.10543 2026-05-26 cs.CV 版本更新

TIE: Time Interval Encoding for Video Generation over Events

TIE：面向事件视频生成的时间区间编码

Zhilei Shu, Shangwen Zhu, Zihang Liang, Xiaofan Li, Qianyu Peng, Xinyu Cui, Bo Ye, Yiming Li, Fan Cheng, Jian Zhao, Yang Cao, Zheng-Jun Zha, Ruili Feng

发表机构 * University of Science and Technology of China（中国科学技术大学）； Matrix Team（Matrix团队）； Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； University of Waterloo（滑铁卢大学）； The Pennsylvania State University（宾夕法尼亚州立大学）； Zhongguancun Academy（中关村学院）； The University of Hong Kong（香港大学）

AI总结提出时间区间编码（TIE），将旋转位置嵌入推广为区间感知形式，解决扩散变换器（DiT）在重叠事件视频生成中时间区间无法表示的问题，显著提升时间可控性。

详情

AI中文摘要

导演式提示、机器人动作预测和交互式视频代理需要对并发事件进行时间定位——在68%的通用片段和超过99%的机器人/游戏片段包含重叠事件的场景中，现有的事件生成器却基于单一活动提示假设。然而，现代视频生成器（如扩散变换器DiT）通过逐点位置编码将时间表示为离散点。这种表述造成了根本性的维度不匹配：时间上延展的区间和重叠事件在数学上无法被注意力机制表示。在本文中，我们提出时间区间编码（TIE），这是一种原则性的、即插即用的区间感知旋转嵌入推广，将时间区间提升为DiT交叉注意力中的一等公民。我们没有引入另一种启发式区间嵌入，而是证明，在兼容RoPE的双线性注意力中，TIE由两个基本原则刻画：时间可积性（要求事件在其整个持续时间内聚合位置证据）和持续时间不变性（消除对较长区间的平凡偏差）。在均匀核下，这种刻画产生了一个高效的闭式sinc解，该解保留了标准注意力接口，并通过区间积分自然地衰减边界噪声。实验上，TIE在保持基础DiT模型视觉质量的同时，显著提高了时间可控性。在OmniEvents数据集上的实验中，它将人工验证的时间约束满足率从77.34%提升至96.03%，将时间边界误差从0.261秒降低至0.073秒，同时改进了轨迹级时间对齐指标。代码和数据集可在https://github.com/MatrixTeam-AI/TIE获取。

英文摘要

Director-style prompting, robotic action prediction, and interactive video agents demand temporal grounding over concurrent events -- a regime in which 68% of general clips and over 99% of robotics/gameplay clips contain overlapping events, yet existing multi-event generators rest on a single-active-prompt assumption. However, modern video generators, such as Diffusion Transformers (DiT), represent time as discrete points through point-wise positional encodings. This formulation creates a fundamental dimension mismatch: temporally extended intervals and overlapping events are mathematically unrepresentable to the attention mechanism. In this paper, we propose Time Interval Encoding (TIE), a principled, plug-and-play interval-aware generalization of rotary embeddings that elevates time intervals to first-class primitives inside DiT cross-attention. Rather than introducing another heuristic interval embedding, we show that, within RoPE-compatible bilinear attention, TIE is characterized by two basic principles: Temporal Integrability, which requires an event to aggregate positional evidence over its full duration, and Duration Invariance, which removes the trivial bias toward longer intervals. Under a uniform kernel, this characterization yields an efficient closed-form sinc-based solution that preserves the standard attention interface and naturally attenuates boundary noise through interval integration. Empirically, TIE preserves the visual quality of the base DiT model while substantially improving temporal controllability. In our experiments on the OmniEvents dataset, it improves human-verified Temporal Constraint Satisfaction Rate from 77.34% to 96.03% and reduces temporal boundary error from 0.261s to 0.073s, while also improving trajectory-level temporal alignment metrics. The code and dataset are available at https://github.com/MatrixTeam-AI/TIE.

URL PDF HTML ☆

赞 0 踩 0

2401.01160 2026-05-26 eess.IV cs.CG cs.CV cs.LG 版本更新

Train-Free Segmentation in MRI with Cubical Persistent Homology

基于立方体持续同调的MRI无训练分割

Anton François, Raphaël Tinarrage

发表机构 * Centre G. Borelli ； ENS Paris-Saclay（巴黎-萨克雷大学）； IST Austria（IST奥地利研究所）； EMAp, Fundação Getulio Vargas（EMAp，格洛里亚·瓦格斯基金会）

AI总结提出一种基于拓扑数据分析的无训练MRI分割框架，通过自动阈值、提取已知拓扑子集和分解成分三步实现，利用持续同调中的近似代表循环建立拓扑特征与解剖成分的可解释联系，在胶质母细胞瘤和胎儿皮质板分割中验证有效性。

Comments Similar to the published version. 22 pages, 11 figures, 3 tables. For associated code, see https://github.com/antonfrancois/gliomaSegmentation_TDA

详情

DOI: 10.1007/s10851-026-01300-1
Journal ref: Journal of Mathematical Imaging and Vision 68, 20 (2026)

AI中文摘要

我们研究了一种基于拓扑数据分析的无训练MRI分割框架。该流程分三步进行：首先通过自动阈值识别待分割的整个对象，然后检测一个拓扑结构已知的独特子集，最后推导出分割的各个组成部分。一个关键要素是从持续同调图中提取近似代表循环，这提供了持久特征与解剖成分之间的可解释联系。为了阐明该方法的应用范围，我们明确了潜在的拓扑和强度假设，量化了它们在真实数据上的成立情况，并分析了典型的失败模式。我们在胶质母细胞瘤和胎儿皮质板分割上评估了该方法，并与无监督和深度学习参考方法进行了比较。通过在没有大型标注数据集的情况下运行，该方法非常适合数据稀缺的场景，并为专家修正或基于学习的流程提供了可解释的基线和实用的初始化。

英文摘要

We investigate a framework for train-free MRI segmentation based on Topological Data Analysis. The pipeline proceeds in three steps, first identifying the whole object to segment via automatic thresholding, then detecting a distinctive subset whose topology is known in advance, and finally deducing the various components of the segmentation. A key ingredient is the extraction of approximate representative cycles from persistence diagrams, which provides an interpretable link between persistent features and anatomical components. To clarify the method's scope, we make the underlying topological and intensity assumptions explicit, quantify when they hold on real data, and analyze typical failure modes. We evaluate the approach on glioblastoma and on fetal cortical plate segmentation, with comparisons to unsupervised and deep-learning references. By operating without large annotated datasets, the method is well suited to scarce-data settings and provides an interpretable baseline and practical initialization for expert refinement or learning-based pipelines.

URL PDF HTML ☆

赞 0 踩 0

2506.03134 2026-05-26 eess.SP cs.CV 版本更新

Controllable Radar Simulation with Waveform Parameter Embedding

具有波形参数嵌入的可控雷达仿真

Weiqing Xiao, Hao Huang, Chonghao Zhong, Yujie Lin, Nan Wang, Xiaoxue Chen, Zhaoxi Chen, Saining Zhang, Shuocheng Yang, Pierre Merriaux, Lei Lei, Hao Zhao

发表机构 * NJU（南京大学）； BJTU（北京理工大学）； BIT（北京理工大学）； AIR, THU（空气科技，清华大学）； NTU（国立台湾大学）； SVM, THU（SVM，清华大学）； Lightwheel AI ； LeddarTech

AI总结提出Ctrl-RS框架，通过环境反射张量、波形参数抽象和WARP-Net网络，实现可控的雷达立方体仿真，在2D/3D检测和语义分割任务中性能接近或超越真实雷达。

Comments CVPR 2026 Findings: Code: https://github.com/zhuxing0/SA-Radar Project page: https://zhuxing0.github.io/projects/SA-Radar

详情

AI中文摘要

自动驾驶模拟器仍然缺乏高保真雷达，尽管雷达对于恶劣天气下的鲁棒感知至关重要。一个关键障碍是原始雷达点云极其稀疏和随机，难以建模；我们认为模拟完整的距离-方位-多普勒立方体是一个更合理的目标。现有的雷达立方体模拟器要么纯粹依赖神经生成器，这些生成器不透明且对传感器属性的控制有限，要么依赖详细的电磁流水线，这些流水线速度慢、需要专有硬件规格，并且仍然难以捕捉真实世界的复杂性。我们引入了Ctrl-RS，一个可控的雷达立方体仿真框架，结合了两者的优势。首先，我们从多种传感器源（包括LiDAR、单目相机和现有雷达）构建环境反射张量。其次，我们将雷达物理抽象为一组紧凑的波形参数，这些参数表征3D点扩散函数，从而得到雷达属性（如距离分辨率、多普勒展宽和方位波束形状）的直观嵌入。第三，我们在一个大型混合数据集上训练WARP-Net，该数据集融合了真实、分析合成和模拟器生成的雷达立方体，以覆盖广泛的雷达属性分布。Ctrl-RS支持视角变化、参与者移除和属性编辑。在RADDet、Carrada和nuScenes上的实验表明，我们的模拟数据在2D检测和语义分割中可以匹配或超越真实雷达，并且在与真实数据结合时持续提升3D检测性能。项目地址：https://github.com/zhuxing0/Ctrl-RS。

英文摘要

Autonomous driving simulators still lack high-fidelity radar, even though radar is critical for robust perception in adverse weather. A key obstacle is that raw radar point clouds are extremely sparse and stochastic, making it difficult to model; we argue that simulating the full range-azimuth-Doppler cube is a more principled target. Existing radar cube simulators either rely purely on neural generators, which are opaque and offer little control over sensor attributes, or on detailed electromagnetic pipelines, which are slow, require proprietary hardware specifications, and still struggle to capture real-world complexity. We introduce Ctrl-RS, a controllable radar cube simulation framework that combines the strengths of both worlds. First, we build an environment reflection tensor from diverse sensor sources (including LiDAR, monocular cameras, and existing radar). Second, we abstract radar physics into a compact set of waveform parameters that characterize the 3D point spread function, yielding an intuitive embedding of radar attributes such as range resolution, Doppler broadening, and azimuth beam shape. Third, we train a WARP-Net on a large mixed dataset that fuses real, analytically synthesized, and simulator-generated radar cubes to cover a wide distribution of radar attributes. Ctrl-RS supports viewpoint changes, actor removal, and attribute editing. Experiments on RADDet, Carrada, and nuScenes show that our simulated data can match or surpass real radar in 2D detection and semantic segmentation, and consistently boosts performance in 3D detection when combined with real data. The Project is available at https://github.com/zhuxing0/Ctrl-RS.

URL PDF HTML ☆

赞 0 踩 0

2412.15678 2026-05-26 cs.CV 版本更新

Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network

多对时序句子定位的多线程知识迁移网络

Xiang Fang, Wanlong Fang, Changshuo Wang, Daizong Liu, Keke Tang, Jianfeng Dong, Pan Zhou, Beibei Li

发表机构 * Sichuan University（四川大学）； Nanyang Technological University, Singapore（南洋理工大学，新加坡）； Peking University（北京大学）； Guangzhou University（广州大学）； Zhejiang Gongshang University（浙江工商大学）

AI总结提出多对时序句子定位新任务，并设计多线程知识迁移网络，通过跨模态对比、原型对齐和自适应负样本选择实现多对视频-查询对的协同训练。

Comments Accepted by AAAI 2025

详情

AI中文摘要

给定一些包含未修剪视频和句子查询的视频-查询对，时序句子定位（TSG）旨在定位这些视频中与查询相关的片段。尽管先前优秀的TSG方法取得了显著成功，但它们单独训练每个视频-查询对，忽略了不同对之间的关系。我们观察到，相似的视频/查询内容不仅有助于TSG模型更好地理解和泛化跨模态表示，还能帮助模型定位一些复杂的视频-查询对。先前的方法遵循单线程框架，无法共同训练不同的对，并且通常花费大量时间重新获取冗余知识，限制了其实际应用。为此，在本文中，我们提出了一种全新的设置：多对TSG，旨在共同训练这些对。特别地，我们提出了一种新颖的视频-查询共同训练方法，即多线程知识迁移网络，以有效且高效地定位各种视频-查询对。首先，我们挖掘不同查询之间的空间和时间语义以相互协作。为了同时学习模态内和模态间表示，我们设计了一个跨模态对比模块，通过自监督策略探索语义一致性。为了充分对齐不同对之间的视觉和文本表示，我们设计了一种原型对齐策略，以1）匹配对象原型和短语原型以实现空间对齐，以及2）对齐活动原型和句子原型以实现时间对齐。最后，我们开发了一个自适应负样本选择模块，以自适应地生成跨模态匹配的阈值。大量实验表明了我们提出方法的有效性和效率。

英文摘要

Given some video-query pairs with untrimmed videos and sentence queries, temporal sentence grounding (TSG) aims to locate query-relevant segments in these videos. Although previous respectable TSG methods have achieved remarkable success, they train each video-query pair separately and ignore the relationship between different pairs. We observe that the similar video/query content not only helps the TSG model better understand and generalize the cross-modal representation but also assists the model in locating some complex video-query pairs. Previous methods follow a single-thread framework that cannot co-train different pairs and usually spends much time re-obtaining redundant knowledge, limiting their real-world applications. To this end, in this paper, we pose a brand-new setting: Multi-Pair TSG, which aims to co-train these pairs. In particular, we propose a novel video-query co-training approach, Multi-Thread Knowledge Transfer Network, to locate a variety of video-query pairs effectively and efficiently. Firstly, we mine the spatial and temporal semantics across different queries to cooperate with each other. To learn intra- and inter-modal representations simultaneously, we design a cross-modal contrast module to explore the semantic consistency by a self-supervised strategy. To fully align visual and textual representations between different pairs, we design a prototype alignment strategy to 1) match object prototypes and phrase prototypes for spatial alignment, and 2) align activity prototypes and sentence prototypes for temporal alignment. Finally, we develop an adaptive negative selection module to adaptively generate a threshold for cross-modal matching. Extensive experiments show the effectiveness and efficiency of our proposed method.

URL PDF HTML ☆

赞 0 踩 0

2412.06284 2026-05-26 cs.CV 版本更新

Your Data Is Not Perfect: Towards Cross-Domain Out-of-Distribution Detection in Class-Imbalanced Data

你的数据并不完美：面向类别不平衡数据中的跨域分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest, Ponnuthurai Nagaratnam Suganthan

发表机构 * College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算学院和数据科学学院）； KINDI Computing Research Center, College of Engineering, Qatar University, Doha（卡塔尔大学工程学院KINDI计算研究中心，多哈）

AI总结针对跨域类别不平衡的分布外检测问题，提出基于原型对齐的不确定性感知自适应语义对齐网络（UASA），通过标签驱动原型、自适应阈值和不确定性感知聚类缩小域间隙、语义间隙和类别不平衡间隙。

Comments Accepted by Expert Systems with Applications

详情

AI中文摘要

以往的OOD检测系统只关注ID和OOD样本之间的语义差距。除了语义差距，我们还面临两个额外的差距：源域和目标域之间的域差距，以及不同类别之间的类别不平衡差距。事实上，来自不同域的相似对象应该属于同一类别。在本文中，我们引入了一个现实且具有挑战性的设置：类别不平衡的跨域OOD检测（CCOD），该设置包含一个标注良好（但通常较小）的源集用于训练，并在一个未标注（但通常较大）的目标集上进行OOD检测。我们不假设目标域仅包含OOD类别或类别平衡：目标数据集的类别分布不必与源数据集相同。为了应对这一具有挑战性的设置，我们提出了一种基于原型对齐策略的新型不确定性感知自适应语义对齐网络（UASA）。具体来说，我们首先在源域中构建标签驱动的原型，并利用这些原型进行目标分类以缩小域差距。我们不是使用固定阈值进行OOD检测，而是生成自适应样本级阈值来处理语义差距。最后，我们进行不确定性感知聚类，将语义相似的目标样本分组，以缓解类别不平衡差距。在三个具有挑战性的基准上的大量实验表明，我们提出的UASA以较大优势优于最先进的方法。

英文摘要

Previous OOD detection systems only focus on the semantic gap between ID and OOD samples. Besides the semantic gap, we are faced with two additional gaps: the domain gap between source and target domains, and the class-imbalance gap between different classes. In fact, similar objects from different domains should belong to the same class. In this paper, we introduce a realistic yet challenging setting: class-imbalanced cross-domain OOD detection (CCOD), which contains a well-labeled (but usually small) source set for training and conducts OOD detection on an unlabeled (but usually larger) target set for testing. We do not assume that the target domain contains only OOD classes or that it is class-balanced: the distribution among classes of the target dataset need not be the same as the source dataset. To tackle this challenging setting with an OOD detection system, we propose a novel uncertainty-aware adaptive semantic alignment (UASA) network based on a prototype-based alignment strategy. Specifically, we first build label-driven prototypes in the source domain and utilize these prototypes for target classification to close the domain gap. Rather than utilizing fixed thresholds for OOD detection, we generate adaptive sample-wise thresholds to handle the semantic gap. Finally, we conduct uncertainty-aware clustering to group semantically similar target samples to relieve the class-imbalance gap. Extensive experiments on three challenging benchmarks demonstrate that our proposed UASA outperforms state-of-the-art methods by a large margin.

URL PDF HTML ☆

赞 0 踩 0