URL PDF HTML ☆

赞 0 踩 0

2605.21466 2026-05-21 cs.CV 版本更新

StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation

StreamGVE: 无需训练的视频编辑通过少步流式视频生成

Guanlong Jiao, Chenyangguang Zhang, Jia Jun Cheng Xian, Zewei Zhang, Renjie Liao

发表机构 * The University of British Columbia（不列颠哥伦比亚大学）； ETH Zürich（苏黎世联邦理工学院）； McMaster University（麦马斯特大学）； Vector Institute（向量研究所）； Canada CIFAR AI Chair（加拿大 CIFAR 人工智能主席）

AI总结本文提出StreamGVE，一种基于噪声到数据视角的视频编辑方法，通过引入双分支快速采样和自注意力桥接以及交叉注意力接地/增强，实现了高效的视频编辑，能够在少步设置中优于现有方法。

Comments Project Page: https://dsl-lab.github.io/StreamGVE/

详情

AI中文摘要

尽管现有的视频编辑方法通常可行，但它们往往需要许多昂贵的迭代，并且仍然难以交付高质量且令人满意的编辑结果。我们归因于普遍的数据到数据范式，这种范式不如噪声到数据生成与现代生成模型兼容。为了解决这一差距，我们重新审视视频编辑从噪声到数据的视角，并提出基于流式生成的视频编辑（StreamGVE），在保留少量步骤采样的同时无缝地注入源视频条件。基于预训练的流式生成模型，StreamGVE引入双分支快速采样，结合自注意力桥接和交叉注意力接地/增强，以满足采样和条件要求。我们进一步提出源导向的指导以提高目标生成质量，并提出视觉提示策略以增强编辑的灵活性和实用性。该方法在不同模型上均有效、稳健且具有通用性。在多样化的视频编辑任务上的广泛实验表明，StreamGVE在少步设置中也优于现有方法，即使时间成本极低。

英文摘要

Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.

URL PDF HTML ☆

赞 0 踩 0

2605.21454 2026-05-21 cs.CV q-bio.QM q-bio.TO 版本更新

ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction

ProtoPathway: 为多模态癌症生存预测设计的生物结构化原型-路径融合

Amaya Gallagher-Syed, Costantino Pitzalis, Myles J. Lewis, Michael R. Barnes, Gregory Slabaugh

发表机构 * Queen Mary University of London（伦敦女王学院）； Imperial College London（帝国理工学院伦敦分校）

AI总结本文提出ProtoPathway框架，通过统一全切片成像和转录组学，利用编码器生成生物基础的表示，以提升癌症生存预测的生物可解释性和计算效率。

Comments Currently under peer review

详情

AI中文摘要

我们介绍了ProtoPathway，一种为癌症生存预测设计的可解释多模态框架，通过编码器在两个融合侧生成生物基础的表示。在组织病理学侧，$K$个可学习的形态原型通过端到端训练与生存目标相结合，作为切片本身的表示：片段通过软分配流入原型标记，将可变长度的片段集压缩成固定任务适应的标记。在基因组侧，双分图神经网络在Reactome通路层级编码基因表达，生成反映构成基因及其更广泛生物背景的通路嵌入，通过双向消息传递在共享的基因-通路图上进行。跨模态注意机制则在紧凑的原型$ imes$通路矩阵上操作，其中原型查询通路，建模分子程序如何导致组织形态的生物方向。由于两个轴都携带稳定的任务学习身份，注意矩阵本身是可解释性输出，从而在完整的生物层级上实现原生的推理时间归因，从基因通过通路和原型到空间组织图。我们在五个TCGA癌症队列上进行评估，展示了与现有方法相比具有竞争力或更优的生存预测能力，同时具有显著改进的生物可解释性和减少的计算成本，通过折叠分层的基于排名的群体水平分析验证了可解释性声明。我们的源代码、模型权重和Reactome通路，以及一个重新实现所有多模态生存基准的统一代码库，在相同预处理和评估条件下可用：https://github.com/AmayaGS/ProtoPathway.

英文摘要

We introduce ProtoPathway, an interpretable-by-design multimodal framework for cancer survival prediction that unifies whole slide imaging and transcriptomics through encoders producing biologically grounded representations on both sides of the fusion. On the histopathology side, $K$ learnable morphological prototypes, trained end-to-end with the survival objective, serve as the slide representation itself: patches flow into prototype tokens via soft assignment, compressing variable-length patch sets into fixed task-adaptive tokens. On the genomic side, a bipartite graph neural network encodes gene expression within the Reactome pathway hierarchy, producing pathway embeddings that reflect both constituent genes and their broader biological context through bidirectional message passing over a shared gene--pathway graph. Cross-modal attention then operates over a compact prototype $\times$ pathway matrix in which prototypes query pathways, modeling the biological direction in which molecular programs give rise to tissue morphology. Because both axes carry stable task-learned identity, the attention matrix is itself an interpretability output, yielding native inference-time attribution across the full biological hierarchy, from genes through pathways and prototypes to spatial tissue maps. We evaluate on five TCGA cancer cohorts, demonstrating competitive or superior survival prediction with substantially improved biological interpretability and reduced computational cost, with interpretability claims validated through fold-stratified rank-based population-level analysis. Our source code, model weights, and Reactome pathways, together with a unified codebase reimplementing all multimodal survival baselines under identical preprocessing and evaluation, are available at: https://github.com/AmayaGS/ProtoPathway.

URL PDF HTML ☆

赞 0 踩 0

2605.21443 2026-05-21 cs.CV cs.AI 版本更新

TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos

TempGlitch: 评估视觉-语言模型在游戏视频中检测时间故障的能力

Yakun Yu, Ashley Wiens, Adrián Barahona-Ríos, Benedict Wilkins, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer

发表机构 * University of Alberta（阿尔伯塔大学）； Sony Interactive Entertainment（索尼互动娱乐）

AI总结本文提出TempGlitch基准测试，用于评估视觉-语言模型在游戏视频中检测时间故障的能力，发现现有模型在处理时间故障时表现不佳，且更密集的帧采样和更大的模型尺寸并不能有效解决这些问题。

详情

AI中文摘要

视觉-语言模型（VLMs）正被越来越多地探索用于视频游戏质量保证，特别是游戏故障检测。然而，大多数现有评估将故障视为静态视觉异常，要求模型从单个帧中检测故障。我们主张这种框架忽略了关键区别：一些故障是空间性的，在孤立帧中可见，而另一些是时间性的，只有通过连续帧的变化才能显现。初步研究证实了这一差距，显示时间故障对VLMs的检测比空间故障要困难得多。为系统评估这一未被充分探索的设置，我们引入了TempGlitch，一个受控的游戏视频基准测试，用于时间故障检测。TempGlitch涵盖五种时间故障类型，每类样本平衡，同时配有配对的无故障视频，以实现可靠的二元评估。我们评估了12个专有和开源的VLMs，在多个帧采样设置下。我们的结果表明，当前VLMs在TempGlitch上仍接近随机猜测，通常会陷入过于保守的行为，错过大多数故障，或过于敏感的行为，将干净的视频标记为有故障。此外，更密集的帧采样和更大的模型尺寸并不能可靠地解决这些失败。TempGlitch为时间推理、稳健的游戏理解以及自动化故障检测提供了专注的测试平台。代码和数据可在项目网站上获得。

英文摘要

Vision-language models (VLMs) are increasingly being explored for video game quality assurance, especially gameplay glitch detection. Most existing evaluations, however, treat glitches as static visual anomalies, asking models to detect failures from a single frame. We argue that this framing misses a key distinction: some glitches are spatial and visible in an isolated frame, whereas others are temporal and become evident only through changes across ordered frames. A preliminary study confirms this gap, showing that temporal glitches are substantially harder for VLMs to detect than spatial ones. To enable systematic evaluation of this underexplored setting, we introduce TempGlitch, a controlled gameplay video benchmark for temporal glitch detection. TempGlitch covers five temporal glitch types with balanced per-category samples, together with paired glitch-free videos that enable reliable binary evaluation. We evaluate 12 proprietary and open-weight VLMs across multiple frame-sampling settings. Our results show that current VLMs remain near chance on TempGlitch, often collapsing into either overly conservative behavior that misses most glitches or overly sensitive behavior that flags clean videos as glitchy. Moreover, denser frame sampling and larger model size do not reliably resolve these failures. TempGlitch provides a focused testbed for temporal reasoning, robust gameplay understanding, and automated glitch detection with VLMs. Code and data are available at the project website.

URL PDF HTML ☆

赞 0 踩 0

2605.21440 2026-05-21 cs.CV 版本更新

闭环动态驾驶数据混合用于真实-合成协同训练

Hongzhi Ruan, Pei Liu, Weiliang Ma, Zhengning Li, Xueyang Zhang, Jun Ma, Dan Xu, Kun Zhan

发表机构 * Li Auto（力汽车）； HKUST（香港科技大学）； HKUST (GZ)（香港科技大学（广州））

AI总结本文提出了一种闭环动态数据混合方法，通过动态优化过程调整训练数据混合比例，以提升模型性能，解决了在有限预算下优化数据混合的关键问题。

详情

AI中文摘要

数据扩展是现代深度学习的基础，随着自动驾驶转向端到端学习，其重要性日益增加。现实世界驾驶数据标注成本高且场景偏向性明显，使利用几乎无限的合成数据进行真实-合成协同训练成为有前景的方向。然而，简单地整合所有可用的合成数据效率低下且导致分布偏移，优化实际训练预算下的数据混合仍是一个关键但尚未充分研究的问题。因此，我们主张在场景类型和数量上为训练数据混合提供明确指导。特别是在本文中，我们将数据混合近似概念化为一个动态优化过程，通过闭环评估反馈迭代调整训练数据混合以最大化模型性能，并提出AutoScale，一种完全自动化的闭环数据引擎，统一了场景表示、数据混合优化与检索以及模型训练与评估。具体而言，我们提出了图正则化的自编码器（Graph-RAE）用于驾驶场景表示，引入了簇感知梯度上升（Cluster-GA）用于簇级重要性估计和重新加权，并执行簇引导的向量检索以选择高价值样本。在NavSim上的实验表明，AutoScale在有限预算下优于传统协同训练和跨域基线，实现了更好的性能。

英文摘要

Data scaling is fundamental to modern deep learning, and grows increasingly critical as autonomous driving shifts to end-to-end learning. Real-world driving data is expensive to annotate and scene-biased, making real-synthetic co-training with near-infinite synthetic data a promising direction. However, naively incorporating all available synthetic data is inefficient and leads to distribution shifts, and optimizing data mixture under practical training budgets remains a critical yet under-explored problem. In this sense, we claim that the mixture of training data requires clear guidance in terms of scene types and quantities. Particularly in this work, we conceptualize the data mixture approximately as a dynamic optimization process that iteratively adjusts the training data mixture to maximize model performance, guided by closed-loop evaluation feedback, and propose AutoScale, a fully automated closed-loop data engine unifying scene representation, data mixture optimization and retrieval, as well as model training and evaluation. Specifically, we propose Graph Regularized AutoEncoder (Graph-RAE) for driving scene representations, introduce Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation and reweighting, and perform cluster-guided vector retrieval to select high-value samples. Experiments on NavSim demonstrate that AutoScale outperforms vanilla co-training and cross-domain baselines, achieving better performance with fewer synthetic samples under constrained budgets.

URL PDF HTML ☆

赞 0 踩 0

2605.21371 2026-05-21 cs.CV 版本更新

A Non-Reference Diffusion-Based Restoration Framework for Landsat 7 ETM+ SLC-off Imagery in Antarctica

一种用于南极 Landsat 7 ETM+ SLC-off 图像恢复的非参考扩散框架

Leyue Tang, Jonathan Louis Bamber, Gang Qiao, Yuanhang Kong

发表机构 * College of Surveying and Geo-Informatics, Tongji University, Shanghai 200092, China（同济大学测绘与地理信息学院，上海200092，中国）

AI总结本文提出 DiffGF 框架，通过非参考扩散方法恢复 Landsat 7 SLC-off 图像，无需外部参考数据，利用南极专用数据集 SLCANT 进行训练和评估，验证了其在恢复南极 SLC-off 图像方面的高保真度，并通过下游裂缝分割应用展示了其实际价值。

Comments Submitted to IEEE JSTARS

详情

AI中文摘要

在南极获取可用光学图像本质上具有挑战性，由于极夜长和频繁的云覆盖。Landsat 提供了最长且最连续的光学观测，是南极研究最重要的遥感数据源之一。然而，2003 年扫描线校正器（SLC）故障导致 Landsat 7 ETM+ SLC-off 图像约有 22% 的像素缺失，严重限制了其可用性。与许多非极地环境不同，南极表面经历快速且显著的变化，这使得获取可靠的参考图像变得困难，减少了传统参考基填充方法的适用性。为了解决这一挑战，我们提出了 DiffGF，一种非参考扩散框架，用于在不需任何外部参考数据的情况下恢复 Landsat 7 SLC-off 图像。DiffGF 采用由潜在空间扩散过程和像素空间细化组成的两阶段设计。构建了一个专门的南极数据集 SLCANT 用于训练和评估。定量和定性结果表明，DiffGF 能够高保真地恢复南极 SLC-off 图像。其实际价值通过下游裂缝分割应用进一步检验。结果表明，DiffGF 为利用南极 Landsat 7 SLC-off 归档提供了有用的方法，使从历史记录中提取有价值信息成为可能，并支持相关的南极研究。

英文摘要

Acquiring usable optical imagery in Antarctica is inherently challenging due to prolonged polar nights and frequent cloud cover. Landsat provides the longest and most continuous optical observations and constitutes one of the most important remote sensing data sources for Antarctic studies. However, the scan-line corrector (SLC) failure in 2003 resulted in approximately 22% missing pixels in Landsat 7 ETM+ SLC-off imagery, severely limiting its usability. Unlike many non-polar environments, Antarctic surfaces undergo rapid and substantial changes, which makes it difficult to obtain reliable reference imagery and reduces the applicability of conventional reference-based gap-filling methods. To address this challenge, we propose DiffGF, a non-reference diffusion-based framework for restoring Landsat 7 SLC-off imagery without requiring any external reference data. DiffGF adopts a two-stage design consisting of a latent-space diffusion process and a pixel-space refinement. A dedicated Antarctic dataset, SLCANT, is constructed for training and evaluation. Quantitative and qualitative results demonstrate that DiffGF restores Antarctic SLC-off imagery with high fidelity. Its practical value is further examined through a downstream crevasse segmentation application. The results suggest that DiffGF provides a useful approach for exploiting Landsat 7 SLC-off archives in Antarctica, enabling the extraction of valuable information from historical records and supporting related Antarctic studies.

URL PDF HTML ☆

赞 0 踩 0

2605.21343 2026-05-21 cs.CV 版本更新

OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

OcclusionFormer: 布局导向图像生成中的Z轴顺序安排

Ziye Li, Henghui Ding

发表机构 * Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University, China（大数据研究院，计算机科学与人工智能学院，复旦大学，中国）

AI总结本文提出OcclusionFormer，一种基于Z轴顺序的扩散变换框架，通过解耦实例并利用体积渲染进行合成，以解决布局到图像模型中物体间遮挡问题，并通过查询对齐损失提升空间精度和语义一致性。

Comments ICML 2026, Project Page: https://henghuiding.com/OcclusionFormer/

详情

AI中文摘要

最近的布局到图像模型在空间可控性方面取得了显著进展。然而，它们仍然在物体间遮挡方面存在困难。当边界框重叠时，大多数现有方法缺乏显式的遮挡信息，这使得交集区域的生成本质上具有歧义性，并阻碍了复杂遮挡关系的确定。为此，我们首先构建了SA-Z，一个包含显式遮挡顺序和像素级注释的大型数据集。基于我们提出的数据集，我们引入了OcclusionFormer，一种新的遮挡感知扩散变换框架，通过解耦实例并利用体积渲染进行合成，显式地建模Z轴优先级。此外，为了确保细粒度的空间精度，我们引入了查询对齐损失，显式监督单个实例并增强语义一致性。所提出的方法有效减少了重叠区域的歧义性，强制正确遮挡依赖关系，并保持了结构完整性，从而在多样化的场景中实现了显著的准确性提升。

英文摘要

Recent layout-to-image models have achieved remarkable progress in spatial controllability. However, they still struggle with inter-object occlusion. When bounding boxes overlap, most existing methods lack explicit occlusion information, which makes the generation in intersection regions inherently ambiguous and hinders the determination of complex occlusion relationships. As a result, they often produce entangled textures or physically inconsistent layering in the overlapped areas. To address this issue, we first construct SA-Z, a large-scale dataset enriched with explicit occlusion ordering and pixel-level annotations. Building upon our proposed dataset, we introduce OcclusionFormer, a novel occlusion-aware Diffusion Transformer framework that explicitly models Z-order priority by decoupling instances and compositing them via volume rendering. Furthermore, to ensure fine-grained spatial precision, we introduce a queried alignment loss that explicitly supervises individual instances and enhances semantic consistency. The proposed method effectively reduces ambiguity in overlapping regions, enforces correct occlusion dependencies, and preserves structural integrity, leading to substantial accuracy gains across diverse scenes.

URL PDF HTML ☆

赞 0 踩 0

2605.21309 2026-05-21 cs.CV cs.RO 版本更新

Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation

Hyper-V2X: 基于超网络的协作鸟瞰图语义分割中epistemic和aleatoric不确定性的估计

Abhishek Dinkar Jagtap, Sanath Tiptur Sadashivaiah, Andreas Festag

发表机构 * CARISSMA Institute for Electric, COnnected, and Secure Mobility (C-ECOS), Technische Hochschule Ingolstadt（CARISSMA电动、连接与安全移动研究所（C-ECOS）、因戈尔施塔特技术大学）； University of Applied Sciences Aschaffenburg（阿施发堡应用科学大学）

AI总结本文提出Hyper-V2X框架，通过超网络估计协作V2X感知中的epistemic和aleatoric不确定性，采用部分权重生成方案和V2X上下文嵌入模块，条件化贝叶斯超网络生成随机鸟瞰图分割的权重分布，提升感知可靠性。

Comments Accepted for IEEE Intelligent Vehicle Symposium (IV) 2026

详情

AI中文摘要

通过Vehicle-to-Everything (V2X)通信实现的协作感知通过共享传感器数据创建统一的环境表示，从而提高自动驾驶安全性。尽管近期工作已推进多智能体融合以改善感知，但此类协作框架中的不确定性量化仍鲜有研究。本文介绍Hyper-V2X，一种基于超网络的框架，用于估计V2X感知中的epistemic和aleatoric不确定性。具体而言，我们提出了一种部分权重生成方案和V2X上下文嵌入模块，将贝叶斯超网络条件化于融合的多智能体特征，以生成随机Bird's-Eye-View (BEV)分割的权重分布。与现有确定性BEV模型不同，Hyper-V2X在计算开销小的情况下实现了高效的不确定性估计。我们的方法架构无关，可无缝集成到现代协作骨干结构中，如CoBEVT。在OPV2V基准测试中，Hyper-V2X提供了准确且校准良好的不确定性估计，并提高了整体感知可靠性。我们的代码和基准已公开发布，许可证为开源：https://github.com/abhishekjagtap1/Hyper-V2X

英文摘要

Cooperative perception enabled by Vehicle-to-Everything (V2X) communication enhances autonomous driving safety by creating a unified environmental representation through shared sensory data. While recent works have advanced multi-agent fusion for improved perception, uncertainty quantification in such cooperative frameworks remains largely unexplored. This paper introduces Hyper-V2X, a hypernetwork-based framework for estimating both epistemic and aleatoric uncertainties in V2X-based perception. Specifically, we propose a partial weight generation scheme and V2X context embedding module that conditions a Bayesian hypernetwork on fused multi-agent features to generate weight distributions for stochastic Bird's-Eye-View (BEV) segmentation. Unlike existing deterministic BEV models, Hyper-V2X enables efficient uncertainty estimation with little computation overhead. Our approach is architecture-agnostic, and can be seamlessly integrating with modern cooperative backbones such as CoBEVT. Experiments on the OPV2V benchmark demonstrate that Hyper-V2X provides accurate, well-calibrated uncertainty estimates and improves overall perception reliability. Our code and benchmark are publicly available under an open-source license: https://github.com/abhishekjagtap1/Hyper-V2X

URL PDF HTML ☆

赞 0 踩 0

2605.21308 2026-05-21 cs.CV cs.AI 版本更新

Deformba: Vision State Space Model with Adaptive State Fusion

Deformba：具有自适应状态融合的视觉状态空间模型

Hongyu Ke, Jack Morris, Yongkang Liu, Satoshi Kitai, Kentaro Oguchi, Yi Ding, Haoxin Wang

发表机构 * Department of Computer Science, Georgia State University（佐治亚州立大学计算机科学系）； University of Tennessee Knoxville（田纳西大学肯纳邦克分校）

AI总结本文提出Deformba，一种能够动态增强空间结构信息并保持状态空间模型线性复杂度的自适应方法，通过多模态融合（如交叉注意力）提升视觉任务的性能，展示了在2D和3D视觉任务中的广泛适用性。

详情

Journal ref: Forty-Third International Conference on Machine Learning (ICML 2026)

AI中文摘要

状态空间模型（SSMs）已作为一种强大的、高效的替代方案出现于Transformer之上，展现出线性时间复杂度和卓越的序列建模能力。然而，将其应用于视觉任务仍具有挑战性。首先，现有的视觉SSMs大多依赖于手动设计的固定扫描方法将图像块扁平化为序列，这会引入预定义的几何结构并增加复杂性。其次，在需要不同信息流之间进行查询式交互的领域中，SSMs的更广泛采用受到阻碍。这是由于SSMs为1D序列建模任务设计时固有的因果性和自指性所致。这种融合机制对于多视角3D融合等关键感知任务至关重要。为了解决这些限制，我们提出Deformba，一种上下文自适应的方法，能够在保持SSMs线性复杂度的同时动态增强空间结构信息。Deformba还允许多模态融合，如交叉注意力。为了证明Deformba的有效性和广泛适用性，我们在通用的2D视觉任务（如图像分类、目标检测和分割）以及3D视觉任务（如BEV感知）上测试其性能。大量实验表明，Deformba在各种视觉感知基准上均取得了强劲的性能。

英文摘要

State Space Models (SSMs) have emerged as a powerful and efficient alternative to Transformers, demonstrating linear-time complexity and exceptional sequence modeling capabilities. However, their application to vision tasks remains challenging. First, existing vision SSMs largely depend on manually designed fixed scanning methods to flatten image patches into sequences, which imposes predefined geometric structures and increases the complexity. Second, the broader adoption of vision SSMs is hindered in domains that require query-based interactions between distinct information streams. This is a result of the inherently causal and self-referential nature of SSMs designed for 1D sequence modeling tasks. This fusion mechanism is indispensable for critical perception tasks such as multi-view 3D fusion. To address these limitations, we propose Deformba, a context adaptive method that dynamically augments the spatial structural information while maintaining the linear complexity of SSMs. Deformba also allows multi-modal fusion like cross attention. To demonstrate the effectiveness and general applicability of Deformba, we test its performance on general 2D vision tasks such as image classification, object detection, and segmentation, as well as 3D vision tasks like BEV perception. Extensive experiments show that Deformba achieves strong performance across various visual perception benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.21301 2026-05-21 cs.LG cs.CV 版本更新

Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls

通过与健康对照组对比自动发现疾病亚组

Robin Louiset, Edouard Duchesnay, Benoit Dufumier, Antoine Grigis, Pietro Gori

发表机构 * NeuroSpin（神经旋）； Université Paris-Saclay（巴黎-萨克勒大学）； CEA（法国原子能委员会）； LTCI ； Institut Polytechnique de Paris（巴黎高等理工学院）

AI总结本文提出了一种通过对比患者与健康对照组来发现可解释且同质的疾病亚组的方法，该方法在医学影像数据集上展示了改进的亚组估计质量。

Comments Accepted to Data Mining and Knowledge Discovery, ECML-PKDD 2026 Journal Track

详情

AI中文摘要

在生物医学亚组发现中，研究者致力于在患者群体中发现可解释且同质的亚组。在本文中，我们假设健康个体（即对照组）与患者共享一些无关的变异性因素，从而提出了一种称为Deep UCSL的对比亚组发现方法。通过对比患者与对照组，Deep UCSL识别出仅由病理因素驱动的亚组，忽略与健康个体共享的共同变异性。我们的框架采用深度特征提取器来学习判别性表示空间。数学上，我们基于潜在聚类和患者/对照组标签的条件联合似然推导出一种新的损失函数，并通过期望最大化策略交替优化亚组推断和特征编码器更新。一个正则化项进一步鼓励表示捕捉疾病特异性变异性，同时忽略与对照组共享的变异性。与先前相关工作相比，我们的方法在MNIST示例和四个不同的医学影像数据集上展示了改进的亚组估计质量。代码和数据集可在：https://github.com/rlouiset/deep_ucsl获取。

英文摘要

In biomedical Subgroup Discovery, practitioners are interested in discovering interpretable and homogeneous subgroups within a group of patients. In this paper, assuming that healthy subjects (i.e., controls) share common but irrelevant factors of variation with the patients, we motivate and develop a Contrastive Subgroup Discovery method, entitled Deep UCSL. By contrasting patients with controls, Deep UCSL identifies subgroups driven solely by pathological factors, ignoring common variability shared with healthy subjects. Our framework employs a deep feature extractor to learn a discriminative representation space. Mathematically, we derive a novel loss based on the conditional joint likelihood of latent clusters and patient/control labels, optimized via an Expectation-Maximization strategy alternating between subgroup inference and feature encoder updates. A regularization term further encourages representations to capture disease-specific variability while ignoring variability shared with controls. Compared to previous related works, our approach quantitatively improves the quality of the estimated subgroups, as demonstrated on a MNIST example and four distinct real medical imaging datasets. Code and datasets are available at: https://github.com/rlouiset/deep_ucsl.

URL PDF HTML ☆

赞 0 踩 0

2605.21300 2026-05-21 cs.CV 版本更新

Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens

通过强调图像负样本token减少LVLMs中的物体幻觉

Meng Shen, Minghao Wu, Deepu Rajan

发表机构 * Nanyang Technological University（南洋理工大学）； Monash University（墨尔本大学）

AI总结本文通过强调图像负样本token来减少LVLMs中的物体幻觉问题，提出调整不同token的训练权重和数据过滤策略以控制幻觉。

Comments 20 pages, 10 figures, 10 tables

详情

AI中文摘要

物体幻觉是阻碍大型视觉-语言模型（LVLMs）在实践中应用的重要挑战。我们假设幻觉的一个可能来源是模型倾向于优先生成文本而非与图像进行有意义的交互。为此，我们研究了生成过程并将文本token分为三类：图像正样本、不变样本和负样本，基于它们对输入图像token的视觉依赖性。我们的分析发现，大多数生成的token对图像信息影响很小。这表明在模型训练阶段，更强调学习如何遵循文本指令，而非从图像中提取信息。基于此发现，我们提出根据token的视觉依赖性调整训练权重以控制幻觉。此外，我们移除一部分可能包含更多幻觉的训练数据作为数据过滤策略。这两种方法在不牺牲响应长度或引入额外计算成本的情况下减少了幻觉。我们验证了我们的方法在三个LVLM变体上的有效性，展示了其有效性和通用性。

英文摘要

Object hallucination is a significant challenge that hinders the application of large vision-language models (LVLMs) in practice. We hypothesize that one possible origin of hallucination is the model's tendency to prioritize text generation over meaningful interaction with images. To explore this, we examine the generation process and categorize text tokens into three groups: image-positive, invariant, and negative, based on their visual dependence on input image tokens. Our analysis reveals that most generated tokens are minimally influenced by the image information. This suggests that during the model's training stage, more emphasis is placed on learning how to follow textual instructions, rather than extracting information from images. Based on this finding, we propose adjusting the training weights of different tokens depending on their visual dependence to control hallucination. Additionally, we remove a portion of the training data that potentially contains more hallucinations as a data filtering strategy. Both methods achieve a reduction in hallucination without compromising response length or introducing additional computational costs during inference. We validate our methods across three LVLM variants, demonstrating the effectiveness and general applicability.

URL PDF HTML ☆

赞 0 踩 0

2605.21280 2026-05-21 cs.CV 版本更新

Let EEG Models Learn EEG

让EEG模型学习EEG

Yifan Wang, Yijia Ma, Wen Li, Chenyu You

发表机构 * Stony Brook University（石溪大学）； University of Texas Health Center at Houston（德克萨斯大学健康中心（休斯顿））

AI总结本文提出了一种基于条件流匹配的生成框架JET，通过直接建模神经信号的连续演化来生成高质量的EEG信号，解决了传统离散去噪方法在捕捉长期时间依赖性和保持频谱结构方面的不足，实现了在多个基准测试中优于现有方法的性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

高保真度的EEG生成对于缓解大规模神经建模中的数据稀缺和隐私约束至关重要。尽管近年来取得了进展，但大多数现有方法通过离散去噪目标来生成EEG，这无法充分反映神经活动本质上连续的时间动态和频谱结构。因此，这些方法往往难以保持长期时间依赖性，并且生成信号在频谱和时间结构上存在不匹配。在本文中，我们主张有效的EEG生成需要能够直接操作神经信号连续演化的模型。我们引入了Just EEG Transformer (JET)，一种基于条件流匹配的生成框架，将EEG建模为沿着连续轨迹演变的原始序列。通过学习一个平滑的向量场，将噪声传输到EEG数据分布，JET在不依赖离散去噪方案或领域特定表示的情况下捕捉时间连续性和瞬态动态。为了确保学习到的动力学与EEG信号的关键属性保持一致，我们引入了保留频谱结构、时间平稳性和信号级统计的原理性约束。在三个大规模基准测试中，JET一致地实现了最先进的性能，相比强大的基线，将TS-FID降低了超过40%。广泛的分析显示，JET捕捉了神经动态的关键结构特性，提供了一种可扩展且原理性的EEG生成方法。项目页面：https://y-research-sbu.github.io/JET/

英文摘要

High-fidelity EEG generation is critical for alleviating data scarcity and addressing privacy constraints in large-scale neural modeling. Despite recent progress, most existing approaches formulate EEG generation via discrete denoising objectives, which inadequately reflect the inherently continuous temporal dynamics and spectral structure of neural activity. As a result, these methods often struggle to preserve long-range temporal dependencies and exhibit mismatches in the spectral and temporal structure of the generated signals. In this work, we argue that effective EEG generation requires models that operate directly on the continuous evolution of neural signals. We introduce Just EEG Transformer (JET), a generative framework based on conditional flow matching that models EEG as raw sequences evolving along continuous trajectories. By learning a smooth vector field that transports noise to the EEG data distribution, JET captures temporal continuity and transient dynamics without relying on discretized denoising schemes or domain-specific representations. To ensure that the learned dynamics remain consistent with key properties of EEG signals, we introduce principled constraints that preserve spectral structure, temporal stationarity, and signal-level statistics. Across three large-scale benchmarks, JET consistently achieves state-of-the-art performance, reducing TS-FID by over 40% compared to strong baselines. Extensive analyses show that JET captures key structural properties of neural dynamics, providing a scalable and principled approach to EEG generation. Project page: https://y-research-sbu.github.io/JET/ .

URL PDF HTML ☆

赞 0 踩 0

2605.21272 2026-05-21 cs.CV cs.AI 版本更新

MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset

MONET：一个大规模、开放、非冗余且增强的文本到图像数据集

Benjamin Aubin, Gonzalo Iñaki Quintana, Onur Tasar, Sanjeev Sreetharan, Urszula Czerwinska, Damien Henry, Clément Chadebec

发表机构 * Jasper Research（Jasper研究）

AI总结本文提出MONET数据集，通过多阶段过滤和增强，提供高质量的文本到图像数据，以降低大规模可重复研究的门槛。

详情

AI中文摘要

训练大型文本到图像模型需要高质量、经过精心编纂的数据集，具有多样内容和详细的描述。然而，收集、过滤、去重和重新描述此类语料库的高昂成本和复杂性阻碍了该领域的开放和可重复研究。我们介绍了MONET，一个开放的Apache 2.0数据集，包含约104.9亿个图像-文本对，这些数据来自29亿个原始对，通过多阶段的安全过滤、领域过滤、精确和近似去重以及使用多种视觉-语言模型重新描述，覆盖短到长形式的描述，并进一步通过合成生成样本增强。每个图像都配有预计算的嵌入和注释，以加速下游使用。为了验证MONET的有效性，我们仅使用它训练了一个400亿参数的潜在扩散模型，并在GenEval和DPG评分中达到了具有竞争力的结果，证明我们的数据集降低了大规模、可重复文本到图像研究的门槛。

英文摘要

Training large text-to-image models requires high-quality, curated datasets with diverse content and detailed captions. Yet the cost and complexity of collecting, filtering, deduplicating, and re-captioning such corpora at scale hinders open and reproducible research in the field. We introduce MONET, an open Apache 2.0 dataset of approx. 104.9M image--text pairs collected from 2.9B raw pairs across heterogeneous open sources through successive stages of safety filtering, domain-based filtering, exact and near-duplicate removal, and re-captioning with multiple vision-language models covering short to long-form descriptions, and further augmented with synthetically generated samples. Each image is shipped with pre-computed embeddings and annotations to accelerate downstream use. To validate the effectiveness of MONET, we train a 4B-parameter latent diffusion model exclusively on it and reach competitive GenEval and DPG scores, demonstrating that our dataset lowers the barrier to large-scale, reproducible text-to-image research.

URL PDF HTML ☆

赞 0 踩 0

2605.07816 2026-05-21 cs.CV 版本更新

ICDAR 2026 Competition on Writer Identification and Pen Classification from Hand-Drawn Circles

ICDAR 2026竞赛：从手绘圆圈中识别作家和笔类

Thomas Gorges, Janne van der Loop, Lukas Hüttner, Linda-Sophie Schneider, Fei Wu, Mathias Seuret, Vincent Christlein

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg（模式识别实验室，弗赖堡-亚历山大-埃朗根-纽伦堡大学）； Buchwissenschaft, Johannes Gutenberg-Universität Mainz（书籍学，约翰内斯·古滕贝格大学美因茨）

AI总结本文提出CircleID竞赛，旨在研究在极小的静态痕迹中，生物特征作家特性和物理笔特征如何自然交织，通过两个任务：开放集作家识别和跨作家笔类分类，评估模型在识别已知作家并拒绝未知作家以及跨已知和未知作家进行笔类分类的能力。

详情

AI中文摘要

本文提出了CircleID，即ICDAR 2026竞赛中关于从扫描的手绘圆圈中进行作家识别和笔类分类的大型竞赛。主要目标是研究生物特征作家特性和物理笔特征在极小的静态痕迹中如何自然交织。CircleID包含两个任务：（1）开放集作家识别，要求模型识别已知作家并明确拒绝未知者；（2）跨作家笔类分类，评估在已见和未见作家之间的表现。参赛者获得了一个新的受控数据集，包含46,155张紧密裁剪的圆圈图像，以400 DPI数字化，并标注了作家身份和笔类型。数据集包含44名已知作家和22名未知作家使用八种不同笔具的样本。该竞赛在Kaggle上作为两个独立赛道进行，设有公开和私人排行榜。竞赛为参赛者提供了ResNet基准线。总计389支队伍（436名参赛者）提交了3,185次笔类分类任务的提交，113支队伍（141名参赛者）提交了1,737次作家识别任务的提交。在私人排行榜上表现最好的提交在作家识别任务中达到了64.801%的Top-1准确率，在笔类分类任务中达到了92.726%的准确率。本文详细介绍了数据集，评估了获胜方法，并分析了非分布作家对模型泛化和特征解耦的影响。在此次大规模竞赛中，CircleID为极小痕迹分析建立了新的基准。

英文摘要

This paper presents CircleID, a large-scale ICDAR 2026 competition on writer identification and pen classification from scanned hand-drawn circles. The primary objective is to investigate how biometric writer characteristics and physical pen features naturally entangle within minimal, static traces. CircleID comprises two distinct tasks: (1) open-set writer identification, requiring models to recognize known writers while explicitly rejecting unknown ones, and (2) cross-writer pen classification, evaluated across both seen and unseen writers. Participants were provided with a new, controlled dataset of 46,155 tightly cropped circle images, digitized at 400 DPI and annotated for writer identity and pen type. The dataset comprises samples from 44 known and 22 unknown writers using eight different pens. Hosted on Kaggle as two separate tracks with public and private leaderboards, the competition provided participants with a ResNet baseline. In total, 389 teams (436 participants) made 3,185 submissions for the pen classification task, and 113 teams (141 participants) made 1,737 submissions for the writer identification track. The best-performing private leaderboard submissions achieved a Top-1 accuracy of 64.801% for writer identification and 92.726% for pen classification. This paper details the dataset, evaluates the winning methodologies, and analyzes the impact of out-of-distribution writers on model generalization and feature disentanglement. In this large-scale competition, CircleID establishes a new baseline for minimal-trace analysis.

URL PDF HTML ☆

赞 0 踩 0

2603.24139 2026-05-21 cs.CV cs.LG 版本更新

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

tutor-student 强化学习：一种动态课程以实现鲁棒的深度伪造检测

Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang, Zhen Han, Chao Liang, Dengpan Ye

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； School of Integrated Circuits, Peking University（北京大学集成电路学院）； School of Information, Huazhong Agricultural University（华中农业大学信息学院）； Cyberspace Institute of Advanced Technology, Guangzhou University（广州大学先进技术网络研究院）

AI总结本文提出了一种 tutor-student 强化学习框架，通过动态优化训练课程来提高深度伪造检测的鲁棒性和泛化能力。

Comments Accepted to CVPR 2026

详情

Journal ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

AI中文摘要

标准的监督训练将所有样本视为同等重要，这在学习鲁棒且可泛化的特征方面可能是次优的。在本工作中，我们提出了一种新颖的 tutor-student 强化学习 (TSRL) 框架，以动态优化训练课程。我们的方法将训练过程建模为马尔可夫决策过程，其中一个 ``tutor'' agent 学习引导一个 ``student'' (深度伪造检测器)。tutor 实现为一个近端策略优化 (PPO) agent，观察每个训练样本的丰富状态表示，包括不仅其视觉特征，还包括其历史学习动态，如 EMA 损失和遗忘计数。基于此状态，tutor 通过分配连续权重 (0-1) 到样本的损失，从而动态重新加权训练批次。tutor 的奖励基于 student 的即时性能变化，具体奖励从错误预测转为正确预测的过渡。这种策略促使 tutor 学习一个优先考虑高价值样本的课程，如困难但可学习的例子，从而实现更高效和有效的训练过程。我们证明，这种自适应课程相比传统训练方法提高了 student 对未见操纵技术的泛化能力。代码可在 https://github.com/wannac1/TSRL 上获得。

英文摘要

Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.

URL PDF HTML ☆

赞 0 踩 0

2603.17784 2026-05-21 cs.CV cs.LG 版本更新

ResNet-50 with Class Reweighting and Anatomy-Guided Temporal Decoding for Gastrointestinal Video Analysis

基于类重加权和解剖引导时间解码的ResNet-50在消化系统视频分析中的应用

Romil Imtiaz, Dimitris K. Iakovidis

发表机构 * Department of Computer Science and Biomedical Informatics, University of Thessaly（塞萨洛尼基大学计算机科学与生物医学信息学系）

AI总结本文提出了一种多标签消化系统视频分析管道，结合ResNet-50帧分类器和解剖引导的时间事件解码，通过类重加权和解剖引导的解码方法提高稀有病理类别的识别性能，最终在挑战测试集上将时间mAP从0.3801提升到0.4303。

Comments ICPR 2026 RARE-VISION Competition

2601.15133 2026-05-21 cs.CV cs.LG 版本更新

SMILE-UHURA挑战 -- 从超高分辨率7T磁共振血管造影中进行微血管分割

Soumick Chatterjee, Hendrik Mattern, Marc Dörner, Alessandro Sciarra, Florian Dubost, Hannes Schnurre, Rupali Khatun, Chun-Chih Yu, Tsung-Lin Hsieh, Yi-Shan Tsai, Yi-Zeng Fang, Yung-Ching Yang, Juinn-Dar Huang, Marshall Xu, Siyu Liu, Fernanda L. Ribeiro, Saskia Bollmann, Karthikesh Varma Chintalapati, Chethan Mysuru Radhakrishna, Sri Chandana Hudukula Ram Kumara, Raviteja Sutrave, Abdul Qayyum, Moona Mazher, Imran Razzak, Cristobal Rodero, Steven Niederren, Fengming Lin, Yan Xia, Jiacheng Wang, Riyu Qiu, Liansheng Wang, Arya Yazdan Panah, Rosana El Jurdi, Guanghui Fu, Janan Arslan, Ghislain Vaillant, Romain Valabregue, Didier Dormont, Bruno Stankoff, Olivier Colliot, Luisa Vargas, Isai Daniel Chacón, Ioannis Pitsiorlas, Pablo Arbeláez, Maria A. Zuluaga, Stefanie Schreiber, Oliver Speck, Andreas Nürnberger

发表机构 * Faculty of Computer Science, Otto von Guericke University Magdeburg（奥托·冯·格里克大学马格德堡分校计算机科学学院）； Data and Knowledge Engineering Group, Otto von Guericke University Magdeburg（奥托·冯·格里克大学马格德堡分校数据与知识工程小组）； Human Technopole（人类技术极地）； Biomedical Magnetic Resonance, Otto von Guericke University Magdeburg（生物医学磁共振，奥托·冯·格里克大学马格德堡分校）； Department of Neurology, Medical Faculty, University Hospital of Magdeburg（马格德堡大学医院医学系神经科）； German Centre for Neurodegenerative Diseases（德国神经退行性疾病研究中心）； Centre for Behavioural Brain Sciences, Magdeburg（行为脑科学中心，马格德堡）； Department of Neurology, University Hospital Zurich（苏黎世大学医院神经科）； Department of Consultation-Liaison-Psychiatry and Psychosomatic Medicine, University Hospital Zurich（苏黎世大学医院咨询-联络精神病学与心身医学科）； Stanford University（斯坦福大学）； Translational Radiobiology, Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg（转化放射生物学，放射肿瘤学部，埃尔兰根大学医院，埃尔兰根-纽伦堡弗里德里希-亚历山大大学）； National Yang Ming Chiao Tung University（阳明交通大学）； School of Electrical Engineering and Computer Science, University of Queensland（昆士兰大学电气工程与计算机科学学院）； Australian eHealth Research Centre, CSIRO（澳大利亚eHealth研究中心，CSIRO）； National Heart and Lung Institute, Faculty of Medicine, Imperial College London（英国伦敦帝国理工学院医学系国家心脏和肺研究所）； Hawkes Institute, Department of Computer Science, University College London（霍克斯研究所，伦敦大学学院计算机科学系）； School of Computer Science and Engineering, University of New South Wales（新南威尔士大学计算机科学与工程学院）； Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates（阿布扎克穆罕默德·本·扎耶德人工智能大学）； The Alan Turing Institute, London, UK（艾伦·图灵研究所，伦敦，英国）； School of Computing, University of Leeds（利兹大学计算学院）； Department of Computer Science at School of Informatics, Xiamen University（厦门大学信息学院计算机科学系）； Manteia Technologies Co., Ltd, Xiamen, China（厦门Manteia技术有限公司）； Leicester International Institute, Dalian University of Technology（大连理工大学利兹国际学院）； Sorbonne Université, Institut du Cerveau - Paris Brain Institute（索邦大学，巴黎脑研究所）； Centre of Formation and Research in Artificial Intelligence, Universidad de Los Andes, Colombia（智利洛斯安德斯大学人工智能培训与研究中心）； Data Science Department, EURECOM, Sophia Antipolis, France（EURECOM数据科学系，法国索菲亚安蒂波利斯）

AI总结该研究旨在解决公共标注数据集不足的问题，通过提供一个包含时间飞行血管造影的7T MRI标注数据集，评估了多种深度学习方法在微血管分割任务中的性能。

详情

AI中文摘要

人类大脑通过复杂的血管网络获取营养和氧气。影响微血管的病理状况是脑血供中的关键弱点，可能导致严重疾病，如小脑血管疾病。7特斯拉MRI系统的发展使得可以获得更高的空间分辨率图像，使能够可视化大脑中的这些血管。然而，缺乏公开可用的标注数据集阻碍了稳健的机器学习驱动分割算法的发展。为此，SMILE-UHURA挑战被组织起来。该挑战与2023年ISBI会议同期在哥伦比亚的加勒比海城市卡塔赫纳举行，旨在为相关研究领域研究人员提供一个平台。SMILE-UHURA挑战通过提供一个包含7T MRI获取的时间飞行血管造影的标注数据集，填补了公共标注数据集的空白。该数据集是通过自动预分割和大量手动精修相结合创建的。在本文中，十六种提交的方法和两个基线方法在两个不同的数据集上进行了定量和定性比较：一个是来自相同数据集的保留测试MRA（标签保密），另一个是单独的7T ToF MRA数据集（输入体积和标签均保密）。结果表明，大多数提交的深度学习方法在提供的训练数据集上训练后，实现了可靠的分割性能。Dice分数在相应数据集上达到了最高0.838±0.066和0.716±0.125，平均性能最高可达0.804±0.15。

英文摘要

The human brain receives nutrients and oxygen through an intricate network of blood vessels. Pathology affecting small vessels, at the mesoscopic scale, represents a critical vulnerability within the cerebral blood supply and can lead to severe conditions, such as Cerebral Small Vessel Diseases. The advent of 7 Tesla MRI systems has enabled the acquisition of higher spatial resolution images, making it possible to visualise such vessels in the brain. However, the lack of publicly available annotated datasets has impeded the development of robust, machine learning-driven segmentation algorithms. To address this, the SMILE-UHURA challenge was organised. This challenge, held in conjunction with the ISBI 2023, in Cartagena de Indias, Colombia, aimed to provide a platform for researchers working on related topics. The SMILE-UHURA challenge addresses the gap in publicly available annotated datasets by providing an annotated dataset of Time-of-Flight angiography acquired with 7T MRI. This dataset was created through a combination of automated pre-segmentation and extensive manual refinement. In this manuscript, sixteen submitted methods and two baseline methods are compared both quantitatively and qualitatively on two different datasets: held-out test MRAs from the same dataset as the training data (with labels kept secret) and a separate 7T ToF MRA dataset where both input volumes and labels are kept secret. The results demonstrate that most of the submitted deep learning methods, trained on the provided training dataset, achieved reliable segmentation performance. Dice scores reached up to 0.838 $\pm$ 0.066 and 0.716 $\pm$ 0.125 on the respective datasets, with an average performance of up to 0.804 $\pm$ 0.15.

URL PDF HTML ☆

赞 0 踩 0

2105.09034 2026-05-21 cs.GR cs.CV 版本更新

Guided Facial Skin Color Correction

引导式面部肤色校正

Keiichiro Shirai, Tatsuya Baba, Shunsuke Ono, Masahiro Okuda, Yusuke Tatesumi, Paul Perrotin

发表机构 * Shinshu University（信州大学）； The University of Kitakyushu（北九州市立大学）； Tokyo Institute of Technology（东京技术大学）； Doshisha University（立命馆大学）； The University Institute of Technology La Rochelle（拉罗什大学技术学院）

AI总结本文提出了一种自动图像校正方法，用于人像照片，通过抑制背景颜色引起的肤色变化来提高面部肤色的一致性。在人像摄影中，由于光照环境（如从彩色背景墙反射的光线或相机 strobe 过曝）常导致肤色失真，若照片人工合成其他背景色，则这种颜色变化会更加明显，导致不自然的合成结果。在我们的框架中，首先大致提取面部区域并在颜色空间中校正肤色分布，然后在原始图像中对面部周围进行颜色和亮度校正，以实现适当的面部颜色平衡，不受亮度和背景颜色影响。与传统颜色校正算法不同，我们的最终结果通过带有引导图像的颜色校正过程获得。特别是，我们的引导图像过滤器在颜色校正中不需要像 He 等人最初提出的引导图像过滤器方法中所需的完美对齐的引导图像。实验结果表明，我们的方法在人像照片和自然场景照片上都比传统方法生成更自然的结果。我们还展示了自动年鉴风格照片生成作为另一种应用。

Comments 12 pages, 16 figures

详情

DOI: 10.3390/signals2030033
Journal ref: Signals, vol. 2, no. 3, pp. 540-558, 2021

AI中文摘要

本文提出了一种自动图像校正方法，用于人像照片，该方法通过抑制由于背景颜色引起的肤色变化来促进面部肤色的一致性。在人像照片中，由于光照环境（例如，从彩色背景墙反射的光线或相机 strobe 过曝）常常导致肤色失真，如果照片人工合成另一种背景颜色，这种颜色变化会更加明显，导致不自然的合成结果。在我们的框架中，首先大致提取面部区域并在颜色空间中校正肤色分布，然后在原始图像中对面部周围进行颜色和亮度校正，以实现适当的面部颜色平衡，该平衡不受亮度和背景颜色的影响。与传统颜色校正算法不同，我们的最终结果通过带有引导图像的颜色校正过程获得。特别是，我们的引导图像过滤器在颜色校正中不需要像 He 等人最初提出的引导图像过滤器方法中所需的完美对齐的引导图像。实验结果表明，我们的方法在人像照片和自然场景照片上都比传统方法生成更自然的结果。我们还展示了自动年鉴风格照片生成作为另一种应用。

英文摘要

This paper proposes an automatic image correction method for portrait photographs, which promotes consistency of facial skin color by suppressing skin color changes due to background colors. In portrait photographs, skin color is often distorted due to the lighting environment (e.g., light reflected from a colored background wall and over-exposure by a camera strobe), and if the photo is artificially combined with another background color, this color change is emphasized, resulting in an unnatural synthesized result. In our framework, after roughly extracting the face region and rectifying the skin color distribution in a color space, we perform color and brightness correction around the face in the original image to achieve a proper color balance of the facial image, which is not affected by luminance and background colors. Unlike conventional algorithms for color correction, our final result is attained by a color correction process with a guide image. In particular, our guided image filtering for the color correction does not require a perfectly-aligned guide image required in the original guide image filtering method proposed by He et al. Experimental results show that our method generates more natural results than conventional methods on not only headshot photographs but also natural scene photographs. We also show automatic yearbook style photo generation as an another application.

URL PDF HTML ☆

赞 0 踩 0

2605.21251 2026-05-21 eess.IV cs.CV 版本更新

Local-sensitive connectivity filter (ls-cf): A post-processing unsupervised improvement of the frangi, hessian and vesselness filters for multimodal vessel segmentation

局部敏感连通性滤波器（ls-cf）：一种后处理的无监督改进方法，用于多模态血管分割的Frangi、Hessian和血管性滤波器

Erick O Rodrigues, Lucas O Rodrigues, João HP Machado, Dalcimar Casanova, Marcelo Teixeira, Jeferson T Oliva, Giovani Bernardes, Panos Liatsis

发表机构 * Department of Academic Informatics (DAINF), Universidade Tecnologica Federal do Parana (UTFPR)（学术信息系（DAINF），技术联邦大学帕托布拉诺分校（UTFPR））； Graduate Program of Sciences Applied to Health Products, Universidade Federal Fluminense (UFF)（健康产品应用科学研究生项目，联邦大学弗洛里塞分校（UFF））； Institute of Technological Sciences (ICT), Universidade Federal de Itajuba (UNIFEI)（技术科学研究所，联邦大学伊塔比亚分校（UNIFEI））； Department of Electrical Engineering and Computer Science, Khalifa University of Science and Technology（电气工程与计算机科学系，卡利法科学技术大学）

AI总结本文提出了一种无监督的多模态方法，改进Frangi滤波器的响应，实现自动血管分割。通过计算像素级血管连续性并引入局部容忍启发式方法来填补Frangi响应产生的血管不连续性，提出局部敏感连通性滤波器（LS-CF），在多种多模态数据集上取得了有竞争力的结果，尤其在OSIRIX视网膜血管造影数据集中，其准确率优于现有最先进方法。

详情

DOI: 10.3390/jimaging8100291
Journal ref: Journal of Imaging 2022

AI中文摘要

视网膜血管分析是一种可用于评估眼部风险的程序。本文提出了一种无监督的多模态方法，改进Frangi滤波器的响应，实现自动血管分割。我们提出了一种滤波器，计算像素级血管连续性并引入局部容忍启发式方法来填补Frangi响应产生的血管不连续性。该方法称为局部敏感连通性滤波器（LS-CF），与基于阈值的Frangi响应滤波器、结合形态学闭运算的简单连通性滤波器以及文献中的现有方法进行了比较。该方法在多种多模态数据集中取得了有竞争力的结果。在OSIRIX视网膜血管造影数据集中，它在准确率方面优于所有现有最先进方法；在IOSTAR数据集中，它在4/5项任务中优于现有方法；在DRIVE和STARE数据集中，它也优于一些现有工作；在CHASE-DB数据集中，它在6/10项任务中优于现有方法，并且在CHASE-DB数据集中也优于所有现有的无监督方法。

英文摘要

A retinal vessel analysis is a procedure that can be used as an assessment of risks to the eye. This work proposes an unsupervised multimodal approach that improves the response of the Frangi filter, enabling automatic vessel segmentation. We propose a filter that computes pixel-level vessel continuity while introducing a local tolerance heuristic to fill in vessel discontinuities produced by the Frangi response. This proposal, called the local-sensitive connectivity filter (LS-CF), is compared against a naive connectivity filter to the baseline thresholded Frangi filter response and to the naive connectivity filter response in combination with the morphological closing and to the current approaches in the literature. The proposal was able to achieve competitive results in a variety of multimodal datasets. It was robust enough to outperform all the state-of-the-art approaches in the literature for the OSIRIX angiographic dataset in terms of accuracy and 4 out of 5 works in the case of the IOSTAR dataset while also outperforming several works in the case of the DRIVE and STARE datasets and 6 out of 10 in the CHASE-DB dataset. For the CHASE-DB, it also outperformed all the state-of-the-art unsupervised methods.

URL PDF HTML ☆

赞 0 踩 0

2605.21244 2026-05-21 cs.CV 版本更新

SR-Ground: Image Quality Grounding for Super-Resolved Content

SR-Ground: 图像质量接地用于超分辨内容

Artem Borisov, Evgeney Bogatyrev, Khaled Abud, Dmitriy Vatolin

发表机构 * Lomonosov Moscow State University（莫斯科罗蒙诺索夫莫斯科国立大学）； MSU AI Center, Lomonosov Moscow State University（MSU人工智能中心，莫斯科罗蒙诺索夫莫斯科国立大学）

AI总结本文提出SR-Ground数据集，用于超分辨图像中细粒度伪影分割，通过大规模众包研究生成高质量数据集，提升IQA模型性能并减少超分辨输出中的可感知伪影。

详情

AI中文摘要

超分辨率（SR）近年来发展迅速，扩散模型在保真度上取得了前所未有的进展，但引入了新的视觉伪影类型。尽管现有图像质量评估（IQA）方法提供整体质量评分，但缺乏可解释性且无法区分现代SR方法产生的不同伪影类型。为解决这一差距，我们引入SR-Ground，一个专门设计用于超分辨图像细粒度伪影分割的大规模数据集。该数据集包含由多种最先进的SR模型处理的图像，具有像素级注释的多种伪影类别。我们进行了一项涉及1,062名参与者的大型众包研究，以验证和优化自动生成的分割，最终生成了包含6种不同伪影类型的63,000张高质量图像数据集。我们证明了在SR-Ground上训练具有接地能力的IQA模型在下游任务中显著提高了性能。此外，我们引入了一种微调流程，利用我们的接地模型减少SR输出中的可感知伪影，展示了我们数据集的实用价值。

英文摘要

Super-Resolution (SR) has advanced rapidly in recent years, with diffusion-based models achieving unprecedented fidelity at the cost of introducing new types of visual artifacts. While existing Image Quality Assessment (IQA) methods provide holistic quality scores, they lack interpretability and fail to distinguish between different artifact types arising from modern SR approaches. To address this gap, we introduce SR-Ground, a large-scale dataset specifically designed for fine-grained artifact segmentation in super-resolved images. The dataset comprises images processed by a diverse set of state-of-the-art SR models, with pixel-level annotations for multiple artifact categories. We conduct a large-scale crowdsourcing study involving 1,062 participants to validate and refine automatically generated segmentations, resulting in a high-quality dataset of 63,000 images spanning 6 distinct artifact types. We demonstrate that training IQA models with grounding capabilities on SR-Ground significantly improves performance on downstream tasks. Furthermore, we introduce a fine-tuning pipeline that leverages our grounding model to reduce perceptible artifacts in SR outputs, showcasing the practical utility of our dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.21237 2026-05-21 cs.CV cs.AI 版本更新

RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis

RePCM：区域特定和表型适应的双心室心脏运动合成

Xuan Yang, Xiaohan Yuan, Hao Li, Lingyu Chen, Yanan Liu, Lei Li

发表机构 * School of Biomedical Engineering, National University of Singapore, Singapore（新加坡国立大学生物医学工程学院）； School of Automation, Southeast University, Nanjing, China（东南大学自动化学院）； School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China（南京航空航天大学计算机科学与技术学院）； School of Information Science and Engineering, Yunnan University, Kunming, China（云南大学信息科学与工程学院）

AI总结本文提出RePCM方法，通过单帧双心室网格运动补全，利用区域特定和表型适应性来提升心脏运动合成的准确性，以应对心血管疾病导致的区域和疾病特异性差异。

Comments Early Accepted by MICCAI 2026. This is the author's submitted version. 10 pages, 3 figures

详情

AI中文摘要

心脏周期内的运动对于量化区域功能至关重要，并且强烈受到心血管疾病的影响。由于在实践中难以获得时间密集的网格序列，我们专注于利用更易获得的终舒张期帧来推断完整的周期序列。由于存在强区域和疾病特异性差异，传统方法常通过依赖生成模型来过度平滑数据，这些模型是为全球模式优化的。为了解决这个问题，我们提出了Region-Aware和Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis（RePCM）方法，用于单帧双心室网格运动补全。在第一阶段，重建网络学习顶点级别的运动描述符，聚类产生数据驱动的功能分区，提供显式的运动衍生区域结构。在第二阶段，Region-Specific Injection模块在条件VAE中强制执行掩码同步的区域交换，保留局部特定动态并限制跨区域混合。Phenotype-Adaptive Mixture-of-Experts先验条件于ED形状，使用解剖引导的提示来建模潜在运动趋势并捕捉跨疾病变化。在三个涵盖不同心血管疾病的数据集上的实验显示，在几何和功能指标上取得了持续的改进，并且区域特定动态的保护得到了改善。

英文摘要

Cardiac motion over a cardiac cycle is crucial for quantifying regional function and is strongly affected by cardiovascular diseases. Since temporally dense mesh sequences are difficult to obtain in practice, we focus on leveraging the more accessible end-diastolic frame to infer a full-cycle sequence. Due to strong regional and disease-specific differences, traditional methods often oversmooth the data by relying on generative models that are optimized for global patterns. To address this problem, we propose Region-Aware and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis (RePCM) for single frame Bi-ventricular mesh motion completion. In Stage I, a reconstruction network learns vertex wise motion descriptors and clustering yields a data driven functional partition, providing an explicit motion derived region structure. In Stage II, a Region-Specific Injection Module enforces masked, synchronized region exchange within a conditional VAE, preserving localized specific dynamics and restricting cross-region mixing. A Phenotype-Adaptive Mixture-of-Experts prior conditioned on ED shape uses anatomy-guided cues to model latent motion trends and capture inter-disease variability. Experiments on three datasets covering different cardiovascular diseases show consistent gains in geometric and functional metrics and improved preservation of region specific dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.21207 2026-05-21 cs.CV 版本更新

PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection

PGC：用于通用人工智能生成图像检测的峰值引导校准

Xiaoyu Zhou, Jianwei Fei, Peipeng Yu, Jingchang Xie, Chong Cheng, Zhihua Xia

发表机构 * College of Cyber Security, Jinan University, Guangzhou, China（济南大学网络安全学院，中国广州）； Department of Information Engineering, University of Florence, Florence, Italy（佛罗伦萨大学信息工程系，意大利佛罗伦萨）； School of Integrated Circuits, Guangdong University of Technology, Guangzhou, China（广东工业大学集成电路学院，中国广州）

AI总结本文提出PGC框架，通过峰值聚焦机制聚合显著特征，以校准全局决策，从而提高对细粒度判别信号的检测能力，并在CommGen15数据集上实现了最先进的性能。

详情

AI中文摘要

生成式AI的快速发展，从GANs到现代扩散模型，导致了越来越微妙的判别线索。这些细粒度信号常常被主导的高保真图像内容（例如主体）所掩盖，限制了现有主要依赖全局表示的检测器的可靠性。为了解决这一挑战，我们提出了峰值引导校准（PGC）框架。PGC引入了一种新的策略，通过峰值聚焦机制聚合显著特征。具体而言，通过采用对峰值敏感的聚合方法，强调最判别性的局部线索，PGC利用这些关键信号来校准全局决策。这种方法恢复了在全局上下文中被淹没的细微模式。此外，为了更好地模拟现实世界威胁，我们引入了CommGen15数据集，一个包含15个商业模型样本的具有挑战性的基准。广泛实验表明，PGC在性能上达到最先进的水平。具体而言，它在我们的CommGen15数据集上将平均准确率提高了+12.3%，并在标准基准上设定了新纪录，包括GenImage（+2.1%）、AIGI（+3.5%）和UniversalFakeDetect（+1.7%）。代码可在https://github.com/xiaoyu6868/PGC上获得。

英文摘要

The rapid evolution of generative AI, from GANs to modern diffusion models, has resulted in increasingly subtle discriminative clues. These fine-grained signals are often overshadowed by dominant, high-fidelity image content (e.g., the main subject), limiting the reliability of existing detectors that predominantly rely on global representations. To address this challenge, we propose the Peak-Guided Calibration (PGC) framework. PGC introduces a novel strategy that aggregates salient features via a peak-focusing mechanism. Specifically, by employing a peak-sensitive aggregation that accentuates the most discriminative local clues, PGC leverages these critical signals to calibrate the global decision. This approach recovers subtle patterns that would otherwise be submerged in the global context. Furthermore, to better simulate real-world threats, we introduce the CommGen15 dataset, a challenging benchmark comprising samples from 15 commercial models. Extensive experiments demonstrate that PGC achieves state-of-the-art performance. Specifically, it improves mean accuracy by +12.3% on our CommGen15 dataset, and sets new records on standard benchmarks, including GenImage (+2.1%), AIGI (+3.5%), and UniversalFakeDetect (+1.7%). Code is available at https://github.com/xiaoyu6868/PGC.

URL PDF HTML ☆

赞 0 踩 0

2605.21195 2026-05-21 cs.CV 版本更新

SurgOnAir: 基于层次感知的实时手术视频评论

Jingyi He, Yue Zhou, Long Bai, Kun Yuan, Nassir Navab, Yuan Bi

发表机构 * Computer Aided Medical Procedures (CAMP), TU Munich, Germany Munich Center for Machine Learning (MCML), Munich, Germany University of Strasbourg, France The Chinese University of Hong Kong, Hong Kong

AI总结本研究提出SurgOnAir，一种流式视觉-语言模型，通过层次化数据集实现对手术流程多层级的实时理解与评论生成，提升手术过程中的即时响应能力。

详情

AI中文摘要

理解手术流程的实时动态对于智能手术系统至关重要，其中AI系统需要持续感知并响应手术进展。在手术室中，关键决策依赖于细微且即时的变化，如精细的器械运动和不断演变的组织状态，其中即使是轻微的感知延迟也可能限制辅助或危及安全。然而，现有方法仍为离线或在粗粒度时间尺度上操作，仅在处理视频片段后生成描述，阻碍了即时反应。为此，我们提出SurgOnAir，一种流式视觉-语言模型，能够按顺序处理帧，无需未来信息，并在视觉输入到达时逐步生成叙述标记。SurgOnAir实现了细粒度的帧到标记生成，能够即时响应不断变化的手术动态。基于我们精心编纂的层次化数据集SurgOnAir-11k，该模型被训练以生成多级文本响应，反映手术流程的内在层次结构。此外，特殊过渡标记被生成以显式标记状态变化，使SurgOnAir能够捕捉并信号关键工作流程的转变。实验表明，SurgOnAir通过单一的视觉-语言模型实现了对手术流程多个层次的实时理解，生成更优且层次感知的叙述。代码和数据集将公开。

英文摘要

Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.

URL PDF HTML ☆

赞 0 踩 0

2605.21131 2026-05-21 cs.CV 版本更新

UniT: Unified Geometry Learning with Group Autoregressive Transformer

UniT: 基于群自回归变换器的统一几何学习

Haotian Wang, Yusong Huang, Zhaonian Kuang, Hongliang Lu, Xinhu Zheng, Meng Yang, Gang Hua

发表机构 * Intelligent Transportation Thrust of the Systems Hub, The Hong Kong University of Science and Technology (GZ), Guangzhou, P.R.China（香港理工大学（广州）系统中心智能交通研究组，中国广东省广州市）； The National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University, Xi’an, P.R.China（人机混合增强智能国家级重点实验室，西安交通大学，中国陕西省西安市）； Applied Science, Amazon.com, Inc., USA（亚马逊公司应用科学部，美国）

AI总结本文提出UniT模型，通过群自回归变换器统一了几何感知中的多种能力，包括在线感知、离线重建、多模态融合、长视界扩展和度量尺度估计，并引入了适应性几何损失以提升跨场景的度量尺度泛化能力。

Comments Submitted to IEEE T-PAMI

详情

AI中文摘要

近期的前馈模型在从传感器观测推断密集3D结构方面显著进步。然而，其本质能力仍然分散在多个不兼容的范式中，包括在线感知、离线重建、多模态整合、长视界可扩展性和度量尺度估计。我们提出了UniT，一种基于新颖的群自回归变换器的统一模型，将这些看似不同的能力重新整合到单一框架中。关键思想是将传感器观测的组视为基本的自回归单元，并以无锚点和自适应尺度的方式预测相应的点图。更具体地说，在线和离线设置中的各种视角配置自然地整合到单一的群自回归过程中。通过改变组的大小，在线模式在多个自回归步骤上使用单帧组，而离线模式在单次前向传递中聚合多帧组。同时，队列式KV缓存机制确保了长视界下的有界自回归内存。这通过减少对早期帧的长距离依赖，通过无锚点关系建模实现，从而允许过时的记忆在飞行中被丢弃。为了提高跨场景的度量尺度泛化能力，进一步在该框架中引入了自适应几何损失。它将相对几何约束与部分绝对尺度项耦合，隐含地正则化全局尺度，并诱导从尺度不变几何到度量尺度解决方案的逐步过渡。与专门的模态注意力模块相结合，用于整合辅助模态，UniT在十个基准上实现了统一几何感知的最先进性能，涵盖了七个代表性任务。

英文摘要

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.21130 2026-05-21 cs.CV 版本更新

VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

VersusQ：用于通用视频质量评估的成对边距推理

Shibei Meng, Binxin Yang, Yuan Liu, Jiexuan Zhang, Zhengyao Lv, Hubery Yin, Qiang Xu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； WeChat Vision, Tencent Inc.（腾讯公司视觉部门）； Beijing Normal University（北京师范大学）； Peking University（北京大学）； The University of Hong Kong（香港大学）

AI总结本文提出VersusQ，一种基于成对边距推理的框架，通过直接比较视频来缓解绝对尺度校准偏差，实现跨域的视频质量评估。

详情

AI中文摘要

大型多模态模型（LMMs）在视频质量评估中展现出潜力，但大多数方法仍为每个视频预测一个绝对分数。这种点wise监督通常混合了感知质量和数据集特定的校准，包括标注协议、评分习惯和分数分布。因此，学习到的评分规则可能在基准内表现良好，但在未见过的领域转移效果差。我们主张相对比较通过纯粹关注感知差异而非数据集特定的评分习惯来缓解绝对尺度校准偏差。因此，我们提出了VersusQ，一种完全由直接比较驱动的成对边距推理框架。具体而言，VersusQ在两个视频之间进行基于LMM的比较，推断它们的视觉和时间质量差异，并预测一个带符号的连续边距，以捕捉首选选择和差异程度。此外，为了将可解释的比较理由与细粒度的数值差异对齐，我们引入了Margin-Coupled GRPO，它联合优化基于展开的相对推理和连续边距回归。在多个公共VQA基准上的广泛实验表明，VersusQ在多个公共VQA基准上实现了最先进的性能，强大的跨域泛化能力以及在异构评估场景下的可靠细粒度排名。

R2AoP: 从产前超声可靠且鲁棒地估计进展角

Yuanhan Wang, Yifei Chen, Beining Wu, Mingxuan Liu, Xiaotian Hu, Chunbo Jiang, Yijin Li, Changmiao Wang, Feiwei Qin, Qiyuan Tian

发表机构 * Tsinghua University（清华大学）； Hangzhou Dianzi University（杭州电子科技大学）； Shenzhen Research Institute of Big Data（深圳大数据研究院）

AI总结本文提出R2AoP框架，通过结构引导的分割和置信度引导的几何建模，实现了稳定的进展角估计，同时引入轻量级几何可靠测试时适应策略以提高在异质采集条件下的性能。

Comments 11pages,4 figures,Accepted by MICCAI 2026

详情

AI中文摘要

准确地从产前经阴超声估计进展角（AoP）对于客观评估产程进展至关重要，但仍然高度敏感于成像噪声、边界模糊性和局部分割误差的几何放大。我们提出R2AoP，一种可靠且鲁棒的AoP估计框架，整合了结构引导的分割和置信度引导的几何建模，以实现稳定且可重复的测量。一个三分支局部结构增强的主干提高了耻骨联合（PS）和胎儿头（FH）的界定，而置信度加权轮廓拟合明确抑制了AoP计算中不可靠边界点的影响。为进一步提高在异质采集条件下的性能，我们引入了一种轻量级几何可靠的测试时适应策略作为辅助组件，使推理过程稳定且无需目标标注。在多中心基准上的广泛评估显示，与最先进的AoP方法相比，AoP误差和边界指标均表现出一致的减少。我们的源代码可在https://github.com/baiyou1234/R2AoP上获得。

英文摘要

Accurate estimation of the Angle of Progression (AoP) from intrapartum transperineal ultrasound is critical for objective assessment of labor progression, yet remains highly sensitive to imaging noise, boundary ambiguities, and the geometric amplification of local segmentation errors. We propose R2AoP, a reliable and robust AoP estimation framework that integrates structurally informed segmentation and confidence-guided geometric modeling to achieve stable and reproducible measurements. A three-branch local-structure-enhanced backbone improves the delineation of the pubic symphysis (PS) and fetal head (FH), while confidence-weighted contour fitting explicitly suppresses the influence of unreliable boundary points in AoP computation. To further improve performance under heterogeneous acquisition conditions, we introduce a lightweight geometry-reliable test-time adaptation strategy as an auxiliary component, enabling stable inference without target annotations. Extensive evaluations on multi-center benchmarks demonstrate consistent reductions in AoP error and boundary metrics compared with state-of-the-art AoP methods. Our source code is available at https://github.com/baiyou1234/R2AoP.

URL PDF HTML ☆

赞 0 踩 0

2605.21090 2026-05-21 cs.CV 版本更新

TextSculptor: Training and Benchmarking Scene Text Editing

TextSculptor: 训练和评估场景文本编辑

Yiheng Lin, Siyu Jiao, Xiaohan Lan, Wei Zhou, Qi She, Fei Yu, Heyun Chen, Zhengwei Wang, Jinghuan Chen, Moran Li, Yingchen Yu, Zijian Feng, Yao Zhao, Yunchao Wei, Yujie Zhong

发表机构 * Beijing Jiaotong University（北京交通大学）； Bytedance（字节跳动）

AI总结本文提出TextSculptor框架，通过构建大规模数据集和基准测试，解决场景文本编辑中高质量训练数据稀缺和缺乏标准化评估的问题，提升开源模型性能。

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）和基于扩散的生成模型的进展显著提升了基于提示的图像编辑能力。然而，场景文本编辑仍具挑战性，因为模型需要精确修改文本内容，同时保持视觉真实性和非目标区域的完整性。当前开源模型仍落后于专有系统，主要由于高质量训练数据稀缺和缺乏针对文本编辑的标准化基准。为解决这些问题，我们提出了TextSculptor，一个全面的场景文本编辑数据构建和评估框架。我们首先开发了一个自动化数据构建管道，结合文本感知图像合成、程序化文本渲染和合成。基于此管道，我们构建了TextSculpt-Data，一个包含320万训练样本的大规模数据集，包括120万经过OCR验证的文本到图像样本和200万配对的文本编辑样本，具有自然对齐的源-目标图像和强背景一致性。我们进一步引入了TextSculpt-Bench，涵盖四个基本文本编辑任务：文本添加、文本替换、文本删除和混合编辑。为了支持可靠的评估，我们设计了一个定制协议，通过OCR文本对齐、多模态判断和背景区域相似性测量文本准确性、视觉质量和背景保持。广泛的实验表明，TextSculptor提升了开源文本编辑性能，缩小了与专有模型之间的差距。数据和基准可在https://github.com/linyiheng123/TextSculptor获取。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at https://github.com/linyiheng123/TextSculptor.

URL PDF HTML ☆

赞 0 踩 0

2605.21075 2026-05-21 cs.CV cs.LG 版本更新

SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining

SpectralEarth-FM: 将高光谱图像引入多模态地球观测预训练

Nassim Ait Ali Braham, Aaron Banze, Conrad M. Albrecht, Julien Mairal, Jocelyn Chanussot, Xiao Xiang Zhu

发表机构 * Chair of Data Science in Earth Observation（地球观测数据科学主任）； Technical University of Munich（慕尼黑技术大学）； Remote Sensing Technology Institute（遥感技术研究所）； German Aerospace Center (DLR)（德国航空航天中心）； Department of Aerospace Engineering（航空航天工程系）； University of the Bundeswehr Munich（联邦国防军慕尼黑大学）； LEAP ； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； Univ. Grenoble Alpes（格勒诺布尔阿尔卑斯大学）； Inria（法国国家信息与自动化技术研究院）； CNRS（法国国家科学研究中心）； Grenoble INP（格勒诺布尔INP）； LJK

AI总结本文提出SpectralEarth-FM，一种用于多传感器地球观测输入的分层变压器，旨在联合处理高光谱图像与低通道观测。通过构建SpectralEarth-MM数据集，采用JEPA风格的目标进行预训练，实现了在高光谱下游任务和标准EO基准上的最佳性能。

详情

AI中文摘要

地球观测（EO）基础模型（FMs）越来越多地使用多传感器数据进行训练，涵盖多谱段图像（MSI）、合成孔径雷达（SAR）和衍生的地理空间层，但高光谱图像（HSI）仍被低估。相反，现有的高光谱FM仅在HSI上训练，未探索HSI与共定位EO传感器的联合预训练和融合。我们引入SpectralEarth-FM，一种用于多传感器EO输入的分层变压器，具有异构光谱维度。该架构结合了高光谱输入的光谱标记化、传感器特定编码器、跨传感器融合模块和共享分层编码器，能够联合处理HSI和低通道观测。为了预训练SpectralEarth-FM，我们构建了SpectralEarth-MM数据集，该数据集将EnMAP、EMIT、DESI三颗空间载荷的HSI与Sentinel-2、Landsat-8/9光学图像、Landsat地表温度（LST）和Sentinel-1 SAR在共同地理足迹上进行共定位。该数据集包含约2000万个全球分布的地点，25000万个地理参考碎片，以及超过40TB的数据。预训练使用一种联合嵌入预测架构（JEPA）风格的目标，匹配全球视图和同一地点单传感器局部视图之间的表示。我们评估了SpectralEarth-FM在高光谱下游任务和标准EO基准上的性能，遵循PANGAEA协议，实现了在两种评估设置中的最佳性能。

英文摘要

Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.

URL PDF HTML ☆

赞 0 踩 0

2605.21072 2026-05-21 cs.CV 版本更新

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

Q-ARVD: 对自回归视频扩散模型进行量化

Siao Tang, Xinyin Ma, Gongfan Fang, Xingyi Yang, Xinchao Wang

发表机构 * National University of Singapore（新加坡国立大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结本文针对自回归视频扩散模型（ARVD）的量化问题，提出了一种新的框架Q-ARVD，解决了帧间量化敏感度不平衡和权重中异质性异常模式的问题，从而提高了模型效率。

Comments Code: https://github.com/tsa18/Q-ARVD

详情

AI中文摘要

自回归视频扩散模型（ARVD）已涌现出作为流式视频生成的有前景的架构，为实时交互视频生成和世界建模铺平了道路。尽管具有潜力，ARVDs的显著推理成本仍然是实际部署的主要障碍，使模型量化成为提高效率的自然方向。然而，ARVDs的量化仍鲜有研究。我们的实证分析表明，直接应用现有为标准扩散变压器开发的量化方案到ARVDs会导致性能不佳，揭示了与双向扩散模型观察到的量化行为不同的特性。在本文中，我们识别了量化ARVDs的两个关键挑战：（C1）高度不平衡的帧级量化敏感度。在自回归生成过程中，误差积累可以导致帧间严重的量化敏感度偏斜，遵循指数衰减模式。（C2）权重中显著的异质性异常模式。权重分布表现出明显的异常通道，其模式在层类型和块深度上变化很大。为了解决这些问题，我们提出了Q-ARVD，一种用于准确ARVD量化的新型框架。（S1）为解决高度不平衡的帧级敏感度，Q-ARVD将最终质量感知的帧加权机制纳入量化目标中。（S2）为防止异质性异常影响性能，Q-ARVD引入了异常感知的自适应双尺度量化，该方法可以自动检测任意层中异常通道的存在和数量，并将其隔离以保护正常通道。广泛的实验展示了Q-ARVD的优越性。

英文摘要

Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.

URL PDF HTML ☆

赞 0 踩 0

2605.21061 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Grounding Driving VLA via Inverse Kinematics

通过逆运动学接地驾驶VLA

Junsung Park, Hyunjung Shim

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）； KimJaeChul AI Graduate School（金 JaeChul人工智能研究生院）

AI总结本文提出通过逆运动学求解器重新设计驾驶VLA，以解决轨迹预测中对视觉token的忽略问题，通过引入视觉状态预测和逆运动学网络，提升了视觉接地和轨迹规划性能。

详情

AI中文摘要

现有驾驶VLA在预测轨迹时大多忽略其视觉token--这一现象我们归因于任务公式结构上不合理的设定而非训练不足。我们证明，当通过逆运动学视角看待轨迹恢复时，需要当前和未来视觉状态作为边界条件；现有VLA仅提供前者，促使模型依赖自身状态和文本指令进行捷径预测。为解决此问题，我们重新设计驾驶VLA，使其风格类似于逆运动学求解器。首先，一个需要LLM预测未来视觉场景的下一视觉状态预测目标提供密集的视觉监督并抑制捷径路径。其次，一个单独的逆运动学网络（基于交叉注意力的条件扩散模型）仅输入当前和未来视觉状态，以在轨迹解码过程中抑制对自身状态和文本捷径的依赖。仅通过这种简单的处方，我们的0.5B规模模型恢复了视觉接地能力，并在闭合回路NAVSIM-v2和nuScenes基准上，其轨迹规划性能可与7B-8B规模的VLA相媲美。进一步的分析表明，这种改进源于恢复了利用视觉特征的能力，效果在动态驾驶场景如转弯时尤为明显。

英文摘要

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

URL PDF HTML ☆

赞 0 踩 0

2605.21059 2026-05-21 cs.CV cs.LG 版本更新

Multimodal LLMs under Pairwise Modalities

基于成对模态的多模态大语言模型

Yan Li, Yunlong Deng, Yuewen Sun, Gongxu Luo, Kun Zhang, Guangyi Chen

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出了一种基于成对模态训练多模态大语言模型的方法，通过理论分析和表示学习框架，实现了跨模态对齐和重构，提升了模型的跨模态性能。

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）取得了令人印象深刻的结果，但其训练通常依赖于联合编纂的多模态数据，需要大量的人力来构建多向对齐的数据集，从而限制了跨领域的可扩展性。在本工作中，我们探索了仅利用多种成对模态作为完整联合多模态分布的替代方案进行训练。具体来说，我们首先提供了理论分析，探讨在仅观察成对模态的情况下，表示可识别的条件。基于此分析，我们提出了一种表示学习框架，用于仅使用成对数据对齐跨模态的潜在表示。该框架包括两个阶段：潜在表示对齐和跨模态重构。具体而言，在第一阶段，我们通过自模态重建和成对对比学习学习跨模态的共享潜在空间。我们还通过部分对齐和最小潜在规范在对比学习过程中引入归纳偏置。在第二阶段，我们将新引入的模态的编码器与预训练模态的解码器整合起来，以促进跨模态转移和生成。我们通过将3D点云和触觉模态添加到预训练的MLLMs中，并使用三种模态对进行评估，证明通过学习对齐的潜在表示空间，我们的模型在跨模态性能上表现优异。

英文摘要

Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.

URL PDF HTML ☆

赞 0 踩 0

2605.21042 2026-05-21 cs.CV 版本更新

DAMA：解耦的体锚定高斯用于可控的多层avatar

Daniel Eskandar, Berna Kabadayi, Garvita Tiwari, Gerard Pons-Moll

发表机构 * University of Tübingen（图宾根大学）； Tübingen AI Center（图宾根人工智能中心）； Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）； Max Planck Institute for Informatics（马克斯·普朗克信息研究所）； Zuse School ELIZA（祖斯学校ELIZA）

AI总结本文提出DAMA方法，通过专门的表示和重建方法，生成具有物理合理性的穿衣avatar，实现了可控的多层结构、清晰的衣物分离和显式的堆叠控制。

详情

AI中文摘要

现有的3D穿衣avatar重建方法虽然能实现高视觉保真度，但忽略了几何结构和物理合理性。它们要么将穿衣人类建模为单个可变形表面，要么尝试衣物解耦但不强制几何约束，导致衣物边界模糊且无法控制堆叠或层顺序。为解决这些限制，我们引入DAMA（Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars），一种3D avatar重建方法，通过专门的表示和重建方法生成具有物理合理性的穿衣avatar。在表示层面，我们通过重心平面坐标和正向法线偏移将高斯绑定到SMPL-X面部。基于此参数化，重建方法将2D分割提升为体锚定高斯，利用拓扑引导的修正细化层，并联合优化几何和外观。DAMA是首个从多视角图像生成具有物理合理性的多层avatar的高斯avatar重建方法，实现了清晰的衣物分离和显式的堆叠控制。在完整的4D-DRESS数据集（82扫描）上，DAMA在几何重建、衣物分离、穿透率和穿透深度方面均达到最先进的性能。该表示还支持用户定义的衣物重排和快速将符合身体的衣物转换为模拟准备的网格。项目页面：https://danieleskandar.github.io/dama/

英文摘要

Existing 3D clothed avatar reconstruction methods achieve high visual fidelity but ignore geometric structure and physical plausibility. They either model clothed humans as a single deformable surface or attempt garment disentanglement without enforcing geometric constraints, resulting in ambiguous garment boundaries and no control over stacking or layer ordering. To address these limitations, we introduce DAMA (Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars), a 3D avatar reconstruction method that produces physically plausible clothed avatars through a dedicated representation and reconstruction method. At the representation level, we bind Gaussians to SMPL-X faces using barycentric in-plane coordinates and a positive normal offset. Based on this parameterization, the reconstruction method lifts 2D segmentations to body-anchored Gaussians, refines layers using topology-guided correction, and jointly optimizes geometry and appearance. DAMA is the first Gaussian avatar reconstruction method from multi-view images to achieve physically plausible layering, clean garment separation, and explicit stacking control. On the full 4D-DRESS dataset (82 scans), it achieves state-of-the-art performance in geometry reconstruction, garment separation, penetration rate, and penetration depth. The representation further supports user-defined garment reordering and fast conversion of body-conforming garments to simulation-ready meshes. Project Page: https://danieleskandar.github.io/dama/

URL PDF HTML ☆

赞 0 踩 0

2605.20997 2026-05-21 cs.CV cs.AI cs.LG physics.comp-ph 版本更新

Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data

基于TanDEM-X和Landsat数据的混合机器学习模型用于森林高度估计

Islam Mansour, Ronny Haensch, Irena Hajnsek, Konstantinos Papathanassiou

发表机构 * German Aerospace Center (DLR)（德国航空航天中心（DLR））； Institute of Environmental Engineering, ETH Zürich（环境工程研究所，苏黎世联邦理工学院）

AI总结本文提出了一种结合机器学习与物理模型的混合方法，利用TanDEM-X干涉相干测量和Landsat光学数据来提高森林高度估计的精度，通过扩展特征空间减少高度和基线地形坡度的模糊性，实验结果表明RMSE和MAE分别降低了13.5%和16.6%。

详情

DOI: 10.1109/LGRS.2026.3693644

AI中文摘要

将机器学习（ML）与物理模型（PM）结合，已成为从遥感数据中检索地球物理参数的一种有前途的方法。在此背景下，一种用于从TanDEM-X干涉相干测量中估计森林高度的ML模型最近被提出，该模型通过物理模型约束学习过程。虽然所选特征用于训练和反演以确保解决方案的物理一致性，但它们无法解决数据中的所有高度/结构和基线/地形坡度模糊性。为改进这一点，提出通过扩展特征空间加入光学Landsat数据，以提供关于森林类型或结构的补充信息。扩展的模型被应用于几处Gabon的Lopé国家公园的TanDEM-X数据，并与空中LiDAR测量进行评估。结果表明，与原始混合模型相比，RMSE和MAE分别减少了13.5%和16.6%，证实了多光谱输入的附加价值。

英文摘要

Integrating machine learning (ML) with physical models (PM) has emerged as a promising way of retrieving geophysical parameters from remote sensing data. In this context, a ML model for estimating forest height from TanDEM-X interferometric coherence measurements has recently been proposed, that constrains the learning process through a PM. While the features used for training and inversion where selected to ensure the physical consistency of the solutions, they could not resolve all height / structure and baseline / terrain slope ambiguities in the data. To improve this, the extension of the feature space with optical Landsat data is proposed able to provide complementary information on forest type or structure. The extended model is applied and validated on several TanDEM-X acquisitions over the Gabonese Lopé national park site and assessed against airborne LiDAR measurements. Results show a 13.5% reduction in RMSE and a 16.6% reduction in MAE compared to the original hybrid model, confirming the added value of multispectral inputs.

URL PDF HTML ☆

赞 0 踩 0

2605.20973 2026-05-21 cs.CV 版本更新

Towards Integrated Rock Support Visualisation in 3D Point Cloud of Underground Mines

向地下矿山3D点云中的集成岩支可视化迈进

Dibyayan Patra, Simit Raval, Pasindu Ranasinghe, Bikram Banerjee, Ismet Canbulat

发表机构 * School of Minerals and Energy Resources Engineering, University of New South Wales（新南威尔士大学矿物与能源资源工程学院）； School of Surveying and Built Environment, University of Southern Queensland（南方昆士兰大学测绘与环境工程学院）

AI总结本文提出了一种自动化框架，用于利用地下矿山开掘的3D点云进行集成岩支可视化，通过结构映射、岩钉识别、断层面拟合和岩钉方向估计的统一工作流，实现了对断层面和岩钉向量的集成3D可视化，以评估其空间交集和几何关系，同时通过互补的立体分析评估整体锚固几何有效性。

详情

AI中文摘要

地下矿山中岩支的有效性取决于安装的岩钉与周围岩体结构特征之间的相互作用。然而，断层特征化和岩钉识别通常被视为单独的任务，限制了它们在集成支持评估中的价值。本文提出了一种自动化框架，用于利用地下矿山开掘的3D点云进行集成岩支可视化。该框架将结构映射、岩钉识别、断层面拟合和岩钉方向估计整合到一个统一的工作流中，该工作流针对准确性和计算效率进行了优化。输出用于生成拟合的断层面和岩钉向量的集成3D可视化，从而能够直接评估其空间交集和几何关系。此外，还进行了互补的立体分析，以评估断层极和岩钉方向的整体锚固几何有效性，相对于映射的结构特征。此外，岩钉级别的质量指标，包括暴露的突出长度和偏离局部顶板法线的程度，也进行了可视化，以支持安装质量的评估。所提出的框架在真实的地下金属矿扫描上进行了演示，在中等规模的点云中产生了准确的结构映射和岩钉识别结果。总体而言，本研究提供了一个实用的步骤，朝着无需手动测量或额外现场数据采集的自动化、集成的岩支有效性地质力学评估。

英文摘要

The effectiveness of rock support in underground mines depends on the interaction between installed rock bolts and the structural fabric of the surrounding rock mass. However, discontinuity characterisation and rock bolt identification are commonly treated as separate tasks, limiting their value for integrated support assessment. This study presents an automated framework for integrated rock support visualisation using 3D point clouds of underground mine excavations. The framework integrates structure mapping, rock bolt identification, discontinuity plane fitting, and bolt orientation estimation into a unified workflow optimised for accuracy and computational efficiency. The outputs are used to generate an integrated 3D visualisation of fitted discontinuity planes and bolt vectors, enabling direct assessment of their spatial intersections and geometric relationships. A complementary stereographic analysis of discontinuity poles and bolt orientations is also performed to evaluate overall bolting geometric effectiveness relative to the mapped structural fabric. Additionally, bolt-level quality metrics, including exposed protrusion length and deviation from the local roof normal, are visualised to support assessment of installation quality. The proposed framework is demonstrated on real underground metal mine scans, producing accurate structure mapping and rock bolt identification results in medium-scale point clouds. Overall, the study provides a practical step towards automated, integrated geotechnical assessment of rock support effectiveness without requiring manual measurements or additional in-situ data acquisition.

URL PDF HTML ☆

赞 0 踩 0

2605.20971 2026-05-21 cs.CV cs.AI cs.CR 版本更新

Comparative Evaluation of Deep Learning Models for Fake Image Detection

深度学习模型在虚假图像检测中的比较评估

Akhitha Pakala, Mohammed Mahir Rahman, Shahzad Memon, Tauseef Ahmed

发表机构 * University of East London（东伦敦大学）

AI总结本研究通过统一的预处理和训练流程比较了四个预训练的CNN架构在虚假图像检测中的性能，发现VGG16在准确性上表现最佳，但EfficientNetB0在检测虚假图像时的敏感性较高，但对真实图像的可靠性较低，研究指出需要平衡数据集、高级增强和公平性意识训练来开发可靠的虚假图像检测系统。

Comments Accepted at ICCIIoT26 and waiting to be indexed

详情

Journal ref: 6th International Conference on Computational Intelligence & Internet of Things (ICCIIoT), 2026

AI中文摘要

随着基于GAN的图像篡改技术日益复杂，数字取证面临重大挑战。本研究比较了四个预训练的CNN架构（VGG16、ResNet50、EfficientNetB0和XceptionNet）在虚假图像检测中的性能，使用统一的预处理和训练流程。通过调整大小、归一化和增强来解决类别不平衡问题并提高泛化能力。模型评估使用了准确性、精确率、召回率、F1分数和ROC-AUC。VGG16在准确性上达到91%，XceptionNet、ResNet50和EfficientNetB0分别达到90%。EfficientNetB0对虚假图像的敏感性更强，但在真实图像上的可靠性较低，反映了由不平衡驱动的偏差。局限性包括数据集不平衡、过拟合和解释性有限，这些因素影响了跨域鲁棒性。本研究提供了一个可重复的基准，并强调了平衡数据集、高级增强和公平性意识训练的必要性，以开发可靠的虚假图像检测系统。

英文摘要

The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.

URL PDF HTML ☆

赞 0 踩 0

2605.20965 2026-05-21 cs.CV cs.AI 版本更新

Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

在不遗忘的情况下寻找正确的视觉证据：通过层间视觉注意力差异减轻LVLMs中的幻觉

Yutong Xie, Zhenglin Hua, Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing, China（东南大学计算机科学与工程学院）； School of Artificial Intelligence, Shenzhen University, Shenzhen, China（深圳大学人工智能学院）； College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China（深圳大学计算机科学与软件工程学院）； Engineering, South China University of Technology, Guangzhou, China（华南理工大学工程学院）； National Engineering Laboratory for Big Data Systems Computing Technology, Shenzhen University, Shenzhen, China（深圳大学大数据系统计算技术国家工程实验室）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China（新一代人工智能技术及其交叉应用重点实验室（东南大学），中华人民共和国教育部）

AI总结本文提出了一种基于层间视觉注意力差异的幻觉缓解方法，通过增强视觉证据的注意力来减少视觉遗忘，从而在不遗忘的情况下找到正确的视觉证据。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型视觉-语言模型（LVLMs）在广泛的视觉-语言任务上表现出色。尽管有进展，它们仍然容易产生幻觉，生成与视觉内容不一致的响应。在本工作中，我们发现LVLMs在对正确的视觉证据关注不足时容易产生幻觉，并在生成过程中逐渐遗忘它。我们实证发现，尽管LVLMs整体对视觉证据关注不足，但在特定层中表现出对正确视觉证据的敏感性，存在显著的层间差异。受此观察启发，我们提出了一种新的幻觉缓解方法，通过层间视觉注意力差异（ILVAD）增强视觉证据。具体来说，我们从早期生成的token到视觉token在各层中获取注意力权重，并识别被反复激活作为视觉证据的token，形成显著性图。然后通过显著性图在生成过程中增强对视觉证据的注意力，以减少视觉遗忘。此外，我们利用显著性图获得生成文本对视觉证据的注意力分数，以选择并强调强烈基于视觉证据的文本token。我们的方法是无训练的，即插即用。在五个最近发布的模型上进行的多个基准评估表明，我们的方法可以在不同架构的LVLMs上一致地缓解幻觉。代码可在https://github.com/ytx-ML/ILVAD上获得。

英文摘要

Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.

URL PDF HTML ☆

赞 0 踩 0

2605.20963 2026-05-21 cs.CV 版本更新

Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

面向现实世界的无人机检测：一个新的多光谱数据集UAVNet-MS和一个新方法

Yihang Luo, Jun Chen, Chao Xiao, Yingqian Wang, Zhaoxu Li, Qiang Ling, Xu He, Nuo Chen, Gaowei Guo, Hongge Li, Miao Li, Longguang Wang, Yulan Guo, Li Liu, Wei An, Zhijie Chen

发表机构 * College of Electronic Science and Technology, National University of Defense Technology（电子科学与技术学院，国防科技大学）； Aviation University of Air Force（空军航空大学）； Sun Yat-sen University（中山大学）

AI总结本文提出了一种新的多光谱数据集UAVNet-MS和一种新的方法MFDNet，用于细粒度小无人机的检测，解决了传统RGB系统在小尺度下的性能问题。

Comments submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

详情

AI中文摘要

无人飞行器（UAV）的普及催生了对精确UAV监测的迫切需求。现有的基于RGB的系统依赖于空间线索，在小尺度下退化，特别是在高类型相似性、目标杂波模糊和低对比度的情况下。多光谱成像（MSI）编码了材料感知的光谱签名，但基于MSI的细粒度小UAV检测仍因缺乏专用数据集而被忽视。我们引入了UAVNet-MS，这是首个用于细粒度小UAV检测的多光谱数据集，包含15,618个时间同步的RGB-MSI数据立方体（1440x1080），带有边界框注释。该数据集具有挑战性的小对象（93.7% <= 32²像素，平均18²像素，约0.02%图像面积）在低对比度下。我们提出MFDNet，一种双流基线方法，解决数组诱导的视差和空间-光谱融合。在RGB-only、MSI-only和RGB+MSI协议下，对20种检测器的广泛评估表明，MFDNet在最佳RGB-only方法上实现了+6.2%的AP50提升，证明光谱线索提供了超越空间线索的互补材料证据。本文为多光谱UAV监测研究提供了基础数据集、强大基线和基准。

英文摘要

The proliferation of unmanned aerial vehicles (UAVs) has created urgent demand for precise UAV monitoring. Existing RGB-based systems rely on spatial cues that degrade at small scales, particularly with high inter-type similarity, target-clutter ambiguity, and low contrast. Multispectral imaging (MSI) encodes material-aware spectral signatures, yet MSI-based fine-grained small-UAV detection remains underexplored due to lack of dedicated datasets. We introduce UAVNet-MS, the first multispectral dataset for fine-grained small-UAV detection, comprising 15,618 temporally synchronized RGB-MSI data cubes (1440x1080) with bounding box annotations. The dataset features challenging small objects (93.7% <= 32^2 pixels, average 18^2 pixels, ~0.02% image area) under low contrast. We propose MFDNet, a dual-stream baseline addressing array-induced parallax and spatial-spectral fusion. Extensive evaluation under RGB-only, MSI-only, and RGB+MSI protocols against 20 detectors shows MFDNet achieves +6.2% AP50 improvement over best RGB-only methods, demonstrating spectral cues provide complementary material evidence beyond spatial cues. This work provides foundational dataset, strong baseline, and benchmark for multispectral UAV monitoring research.

URL PDF HTML ☆

赞 0 踩 0

2605.20961 2026-05-21 cs.CV 版本更新

Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

保留、揭示、扩展：基于区域感知的4D视频编辑

Zhangchi Hu, Wenzhang Sun, Xiangchen Yin, Jiahui Yuan, Chunfeng Wang, Hao Li, Kun Zhan, Xiaoyan Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）； Li Auto Inc.（利汽车公司）

AI总结本文提出PREX框架，通过区域感知分解目标时空体积，解决4D视频编辑中区域保持、揭示和扩展的问题，提升了视频编辑的准确性和稳定性。

Comments 23 pages, 13 figures

详情

AI中文摘要

现有的4D驱动视频扩散模型主要针对合理生成，但忠实的4D编辑需要在合成遮挡或视外内容时保留源观测区域。我们识别出证据角色不匹配问题：可靠的源支持证据、不可靠的渲染提示和不支持的区域在单一条件信号中交织，导致保留漂移、鬼影和不稳定的外推。我们提出PREX（保留、揭示、扩展），一个区域感知框架，根据观测支持和场景范围将目标时空体积分解为保留、揭示和扩展角色。PREX通过校准置信度构建观测支持的外观提示，并通过区域感知适配器注入到冻结的视频扩散骨干网络中，通过代理任务训练而无需配对编辑视频。我们进一步引入PREBench，一个诊断基准，包含精心编辑、区域角色掩码和人类对齐的指标，补充了全局视频质量和4D控制评估。实验表明，PREX在减少区域结构失败的同时，保持了强大的视觉质量和4D编辑控制能力。项目页面：https://ricepastem.github.io/PREX-Open

英文摘要

Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a region-aware adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video-quality and 4D-control evaluations. Experiments show that PREX reduces region-structured failures while maintaining strong visual quality and 4D edit control capability. Project Page: https://ricepastem.github.io/PREX-Open

URL PDF HTML ☆

赞 0 踩 0

2605.20955 2026-05-21 cs.CV 版本更新

DrawMotion: Generating 3D Human Motions by Freehand Drawing

DrawMotion: 通过自由手绘生成3D人体动作

Tao Wang, Lei Jin, Zhihua Wu, Qiaozhi He, Jiaming Chu, Yu Cheng, Junliang Xing, Jian Zhao, Shuicheng Yan, Li Wang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； University of Science and Technology of China（中国科学技术大学）； NLP Lab, School of Computer Science and Engineering, Northeastern University（东北大学计算机科学与工程学院自然语言处理实验室）； National University of Singapore（新加坡国立大学）； Tsinghua University（清华大学）； The Institute of AI (TeleAI), China Telecom（中国电信人工智能研究院）； Northwestern Polytechnical University（西北工业大学）

AI总结本研究提出DrawMotion，一种基于扩散模型的框架，通过自由手绘和文本条件生成3D人体动作，减少用户输入时间，提升生成精度。

详情

AI中文摘要

文本到动作生成，即通过文本描述生成人体动作，面临用户难以通过文本精确表达意图的挑战。为了解决这一问题，本文介绍了DrawMotion，一种高效的扩散基框架，适用于多条件场景。DrawMotion基于传统文本条件和新的手绘条件生成动作，分别提供语义和空间控制。具体而言，我们从三个方面解决细粒度动作生成任务：1) 自由手绘条件。为了准确捕捉用户意图而不需繁琐的文本输入，我们开发了算法自动在不同数据集格式中生成手绘简笔画；2) 多条件融合。我们提出了一个多条件模块（MCM），整合到扩散过程中，使模型能够利用所有可能的条件组合，同时比传统方法减少计算复杂性；3) 训练自由引导。值得注意的是，DrawMotion中的MCM确保其中间特征位于连续空间中，允许分类器引导梯度更新特征，从而使生成的动作与用户意图对齐，同时保持保真度。定量实验和用户研究表明，自由手绘方法在生成与想象一致的动作时，可将用户时间减少约46.7%。代码、演示和相关数据可在https://github.com/InvertedForest/DrawMotion上公开获取。

英文摘要

Text-to-motion generation, which translates textual descriptions into human motions, faces the challenge that users often struggle to precisely convey their intended motions through text alone. To address this issue, this paper introduces DrawMotion, an efficient diffusion-based framework designed for multi-condition scenarios. DrawMotion generates motions based on both a conventional text condition and a novel hand-drawing condition, which provide semantic and spatial control over the generated motions, respectively. Specifically, we tackle the fine-grained motion generation task from three perspectives: 1) freehand drawing condition. To accurately capture users' intended motions without requiring tedious textual input, we develop an algorithm to automatically generate hand-drawn stickman sketches across different dataset formats; 2) multi-condition fusion. We propose a Multi-Condition Module (MCM) that is integrated into the diffusion process, enabling the model to exploit all possible condition combinations while reducing computational complexity compared to conventional approaches; and 3) training-free guidance. Notably, the MCM in DrawMotion ensures that its intermediate features lie in a continuous space, allowing classifier-guidance gradients to update the features and thereby aligning the generated motions with user intentions while preserving fidelity. Quantitative experiments and user studies demonstrate that the freehand drawing approach reduces user time by approximately 46.7% when generating motions aligned with their imagination. The code, demos, and relevant data are publicly available at https://github.com/InvertedForest/DrawMotion.

URL PDF HTML ☆

赞 0 踩 0

2605.20942 2026-05-21 cs.CV 版本更新

Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

连接结构与语言：基于图的视觉推理用于自动驾驶道路理解

Lena Wild, Katie Z Luo, Marco Pavone

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； TRATON ； Stanford University（斯坦福大学）； NVIDIA（英伟达）

AI总结本文提出结合道路子基质（CRS）框架，通过图结构和开放词汇语义的联合执行，解决自动驾驶中道路结构理解的精度与语义灵活性之间的平衡问题。

详情

AI中文摘要

车道几何、拓扑和交通元素关系的结构化道路理解是安全自动驾驶的基础。尽管视觉-语言模型（VLMs）提供了有前途的语义灵活性，但它们缺乏精确道路推理所需的几何和关系基础。相反，传统模块化系统，如HD地图和拓扑道路图，提供了结构精度，但保持了语义刚性。为弥合这一差距，我们引入了结合道路子基质（CRS），一种基于图的框架，使几何道路结构和开放词汇语义能够在单一表示中联合执行。CRS能够通过递归图查询自动生成具有组合复杂性和语言多样性的问答对，辅以一种“免费基础”机制，确保逻辑可追溯到特定地图元素，并通过程序提取的推理链监督轨迹。我们证明了最先进的VLMs，包括大型闭源模型，在结构化道路推理上表现显著不足，但训练一个仅需20到80个CRS增强场景的2亿或4亿参数小模型，即可在不同深度的组合推理任务中获得稳定的提升。通过可验证的推理轨迹分析模型行为，揭示了失败模式的系统性转变：尽管基线模型在关系场景理解上失败，CRS训练的模型将失败减少到属性识别，表明道路理解的主要瓶颈不是模型规模，而是缺乏结构化监督。

英文摘要

Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.20941 2026-05-21 cs.CV cs.GR cs.HC 版本更新

SynCB：一种基于协同概念的模型，具有概念与互补神经分支之间的动态路由

Tores Julie, Sun Rémy, Sassatelli Lucile, Ancarani Elisa, Wu Hui-Yin, Precioso Frédéric

发表机构 * CNRS（法国国家科学研究中心）； Inria（法国国家信息与自动化技术研究院）； I3S（信息科学与系统研究所）

AI总结本研究提出了一种协同概念模型SynCB，通过动态路由模块在概念分支和互补神经分支之间进行选择，以提高任务准确性和对人工干预的响应性。

详情

AI中文摘要

基于概念（CB）的模型提供了可解释性和支持测试时的人工干预，而标准神经网络（NN）提供了强大的任务性能但透明性较低。先前的工作探索了将概念和其他表示结合的混合公式以提高准确性，但通常以牺牲人工干预为代价。我们引入了协同概念模型（SynCB）框架，该框架结合了CB分支和互补神经分支，并且有一个可训练的路由模块，可以动态选择每个输入使用的分支。与以往模型不同，SynCB保持两个分支独立，并通过路由模块协调它们。此外，两个分支都是联合学习的，允许互补神经分支和CB分支通过它们的共同骨干进行信息共享。为了提高对干预的响应性，我们进一步引入了测试时的干预策略和相应的损失。在五个数据集和CB基准上，SynCB始终在任务准确性和对人工干预的响应性上取得更高的成绩，比全神经基线高3.9个百分点，比最强竞争对手的干预性能高6.43个百分点。

英文摘要

Concept-based (CB) models provide interpretability and support test-time human intervention, while standard neural networks (NN) offer strong task performance but little transparency. Prior work has explored hybrid formulations that integrate concepts and additional representations to improve accuracy, often at the cost of human interventions. We introduce the \emph{Synergy Concept-Based Model (SynCB)} framework, that combines a CB branch with a complementary neural branch, and a trainable routing module that dynamically selects which branch to use for each input. Unlike prior models, which fuse residual and concept-based predictions, SynCB keeps the two branches distinct and coordinates them through the routing module. Moreover, both branches are learned jointly, allowing information sharing between the complementary neural branch and CB branches through their common backbone. To improve responsiveness to interventions, we further introduce a test-time intervention policy and a corresponding loss. Across five datasets and CB benchmarks, SynCB consistently achieves higher task accuracy while remaining more responsive to human interventions, surpassing the full neural baseline by up to 3.9 percentage points and exceeding the strongest competitor in intervention performance by up to 6.43 percentage points.

URL PDF HTML ☆

赞 0 踩 0

2605.20904 2026-05-21 cs.CV 版本更新

JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026

JFAA：EgoVis 2026 EPIC-KITCHENS-100 动作预见挑战的技术报告

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Pengcheng Laboratory（鹏城实验室）； Shandong Jianzhu University（山东建筑大学）

AI总结本文提出JFAA，一种基于JEPA的未来动作预见方法，用于EPIC-KITCHENS-100动作预见任务。通过冻结编码器和预测器提取观察上下文特征和近未来潜在标记，再训练轻量级注意力探针以预测动词、名词和动作日志。通过构建字段感知的集成模型提高鲁棒性，实验结果表明JFAA在EgoVis 2026 EPIC-KITCHENS-100动作预见挑战中取得第一名。

Comments The champion solution for the EPIC-KITCHENS-100 Action Anticipation Challenge at the CVPR EgoVis Workshop 2026

2605.20901 2026-05-21 cs.CV cs.AI 版本更新

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

VISTA：EgoVis 2026 ego4D 短期物体交互预测挑战的技术报告

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Pengcheng Laboratory（鹏城实验室）； Shandong Jianzhu University（山东建筑大学）

AI总结本文提出VISTA，一种用于EgoVis 2026 ego4D短期物体交互预测挑战的V-JEPA集成静态快速时序预测器。该方法结合了以物体为中心的空间检测与短视时间上下文，通过特征调制和ROI级上下文融合，将时间表示注入检测路径，以提高预测的鲁棒性。

Comments The champion solution for the Ego4D Short-Term Object Interaction Anticipation Challenge at the CVPR EgoVis Workshop 2026

详情

AI中文摘要

我们提出VISTA，一种用于EgoVis 2026 ego4D短期物体交互预测（STA）挑战的V-JEPA集成静态快速时序预测器。给定一个眼动视频时间戳，任务要求预测下一步的人-物体交互，包括未来活跃物体的边界框、名词类别、动词类别、接触时间以及置信度分数。VISTA采用StillFast风格的设计，结合以物体为中心的空间检测与短视时间上下文。具体来说，一个在COCO上预训练的Faster R-CNN ResNet-50 FPN检测器从最后一个观察到的高分辨率帧中生成物体建议，而冻结的V-JEPA 2.1时间分支从观察到的视频中提取片段级眼动上下文。时间表示通过特征调制和ROI级上下文融合注入检测路径。融合的建议特征随后传递给多头STA预测器进行框细化、名词分类、动词分类、接触时间回归和交互置信度估计。为了最终提交，我们进一步融合互补预测以提高鲁棒性。在官方挑战服务器上的实验结果表明，VISTA在EgoVis 2026 ego4D STA挑战中获得第一名。我们的代码将在https://github.com/CorrineQiu/VISTA上发布。

英文摘要

We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.

URL PDF HTML ☆

赞 0 踩 0

2605.20892 2026-05-21 cs.CV 版本更新

FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

Enhui Yu, Junhui Li, Ruitong Lu, Jialu Li, Youshan Zhang

发表机构 * University of Science and Technology Liaoning（辽宁科技大学）； Chuzhou University（楚州大学）； Yeshiva University（犹他大学）

AI总结本文提出FruitEnsemble框架，通过多阶段动态推理解决细粒度水果分类中的泛化限制问题，利用MLLM进行专家仲裁以提升分类准确率，最终达到70.49%的分类精度。

Comments 10 pages,6 figures,submitted to CVPR 2026

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2026

AI中文摘要

细粒度水果分类是农业计算机视觉中的关键但具有挑战性的任务，主要受高质量数据集匮乏和类别间高视觉相似性阻碍。为解决这些问题，我们首先构建了一个包含306个水果类别、116,233个样本的综合数据集。此外，我们提出FruitEnsemble，一种实用的两阶段动态推理框架，旨在克服静态单模型架构的泛化限制。第一阶段，FruitEnsemble利用验证校准的异构骨干网络加权集成生成稳健的Top-3候选池。为处理困难样本，我们引入专家仲裁机制：当集成置信度低于0.6时，触发多模态大语言模型（MLLM）进行严格视觉验证，通过整合外部植物学描述使用链式推理（CoT）进行验证。此外，我们优化了训练流程，采用硬样本感知的联合损失。大量实验表明，FruitEnsemble实现了70.49%的分类准确率，并优于现有最先进模型。我们的框架为现实世界的农业视觉分拣和质量检测任务提供了高效、部署导向的解决方案。

英文摘要

Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.20891 2026-05-21 cs.CV 版本更新

HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction

HDMoE：一种用于多模态癌症生存预测的分层解耦-融合专家混合框架

Huayi Wang, Haochao Ying, Yuyang Xu, Qiyao Zheng, jun wang, Cheng Zhang, Ying Sun, Jian Wu

发表机构 * Zhejiang University（浙江大学）； Xinjiang University（新疆大学）； Hangzhou City University（杭州市大学）； Sun Yat-sen University Cancer Center（中山大学肿瘤中心）

AI总结本文提出HDMoE框架，通过分层解耦-融合专家混合方法，有效整合多模态医学数据以提高癌症生存预测的准确性，解决了传统方法中特征解耦和融合效果不佳的问题。

Comments 12 pages, HDMoE has been accepted by KDD 2026 AI for Sciences Track

详情

AI中文摘要

多模态生存预测是一项关键但具有挑战性的任务，要求整合多模态医学数据（例如全切片图像（WSIs）和基因组谱）以实现准确的预后建模。鉴于模态间的固有异质性，特征解耦-融合范式已成为主导方法。然而，这些方法存在以下不足：（1）在解耦前未能减少模态特征的冗余信息，这会负面影响特征解耦和融合效果；（2）缺乏对特征细粒度关系建模的能力，无法捕捉模态内和模态间特征的局部信息交互。为了解决这些问题，我们提出了一种具有两个层次MoE和随机特征重排（RFR）模块的HDMoE框架。在第一层MoE中，使用共享专家和路由专家去除冗余信息并提取每个模态的细粒度特定特征，而第二层MoE促进细粒度的跨模态特征解耦。此外，我们设计了两个RFR模块，分别跟随每个层次的MoE，以精细融合模态内和模态间特征，有助于模型捕捉更多模态间的细粒度关系。在我们的私有肝癌（LC）和三个TCGA公开数据集上的广泛实验结果证实了我们所提出方法的有效性。代码可在https://github.com/ZJUMAI/HDMoE上获得。

英文摘要

Multimodal survival prediction, a crucial yet challenging task, demands the integration of multimodal medical data (\eg Whole Slide Images (WSIs) and Genomic Profiles) to achieve accurate prognostic modeling. Given the inherent heterogeneity across modalities, the feature decoupling-fusion paradigm has emerged as a dominant approach. However, these methods have the following shortcomings: (1) fail to reduce the redundant information of modality features before decoupling, which negatively affects the feature decoupling and fusion effect;(2) lack the ability to model the fine-grained relationships of the features and capture the local information interactions between intra- and inter-modality features. To address these issues, we propose a \underline{H}ierarchical \underline{D}ecoupling-Fusion \underline{M}ixture-\underline{o}f-\underline{E}xperts (HDMoE) framework with two levels of MoE and \underline{R}andom \underline{F}eature \underline{R}eorganization (RFR) modules.In the first-level MoE, shared experts and routed experts are employed to remove redundant information and extract fine-grained specific features within each modality, while the second-level MoE facilitates fine-grained inter-modality feature decoupling. Besides, we design two RFR modules following each level of MoE to finely fuse intra- and inter-modality features, which can help the model capture more fine-grained relationships between modalities. Extensive experimental results on our private Liver Cancer (LC) and three TCGA public datasets confirm the effectiveness of our proposed method. Codes are available at https://github.com/ZJUMAI/HDMoE.

URL PDF HTML ☆

赞 0 踩 0

2605.20889 2026-05-21 cs.CV 版本更新

Map-Mono-Ego: Map-Grounded Global Human Pose Estimation from Monocular Egocentric Video

Map-Mono-Ego: 从单目第一视角视频实现基于地图的全局人体姿态估计

Hiroyuki Deguchi, Ryosuke Hori, Kotaro Amaya, Tsubasa Maruyama, Mitsunori Tada, Hideo Saito

发表机构 * Keio University（庆应大学）； National Institute of Advanced Industrial Science and Technology（国家先进工业科学与技术研究院）

AI总结本文提出Map-MonoEgo框架，通过利用预扫描的3D点云实现从单目摄像头获得的全局一致的人体姿态估计，并引入AIST-Living数据集，证明该方法在无需专用硬件的情况下能有效提升日常监控任务的实用性。

Comments Accepted at ICIP 2026, Project page: https://deguchihiroyuki.github.io/Map-Mono-Ego-Project/

详情

AI中文摘要

单目第一视角人体姿态估计对于无处不在的活动监控至关重要。然而，理解用户在环境中的绝对位置仍是一个挑战。现有方法主要关注初始位置的相对运动，而不考虑佩戴者在环境中的绝对位置。此外，单目视觉固有的尺度模糊性导致严重的位移漂移，限制了长期跟踪，而无法使用专用多传感器硬件。为了解决这一问题，我们提出了MapMonoEgo，一种新颖的框架，仅通过单目摄像头即可实现全局一致的人体姿态估计，利用预扫描的3D点云。我们还引入了AIST-Living数据集，该数据集将第一视角视频与扫描环境中的真实运动相结合。实验表明，我们的方法显著优于现有最先进基线，证明其在无需专用硬件的情况下对实际监控任务的实用性。

英文摘要

Monocular egocentric human pose estimation is essential for ubiquitous activity monitoring. However, understanding the user's absolute location within the environment remains a challenge. Existing methods primarily focus on relative motion from an initial position, and tend not to account for the wearer's absolute location within an environment. Furthermore, inherent scale ambiguity in monocular vision leads to severe translational drift, limiting long-term tracking without specialized multi-sensor hardware. To address this, we propose MapMonoEgo, a novel framework achieving globally consistent human pose estimation solely from a monocular camera by leveraging a pre-scanned 3D point cloud. We also introduce AIST-Living dataset, a new dataset pairing egocentric video with ground-truth motion in a scanned environment. Experiments demonstrate that our approach significantly outperforms the state-of-the-art baseline, proving its utility for practical monitoring tasks without specialized hardware.

URL PDF HTML ☆

赞 0 踩 0

2605.20867 2026-05-21 cs.MA cs.CV 版本更新

ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection

ProCrit: 通过批评引导的修订实现自激发多视角推理用于多模态讽刺检测

Yingjia Xu, Jiulong Wu, Bowen Zhang, Baokui Guo, Siyuan Chai, Min Cao

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）； Baidu Inc.（百度公司）； Zhipu AI（智谱AI）

AI总结本文提出ProCrit，一种通过批评引导的修订实现自激发多视角推理的框架，用于多模态讽刺检测，解决了现有方法依赖固定视角的问题，通过动态生成多视角分析并进行协同优化。

详情

AI中文摘要

多模态讽刺检测需要对字面表达与意图意义之间的跨模态不一致进行推理，但因讽刺机制的多样性，所需的具体分析视角在样本间变化。尽管近期方法使分析过程显式化，但它们仍依赖于固定、预定义的视角，通过手工设计的路由规则独立运作。我们主张多模态讽刺检测应采用自激发多视角推理，即模型自主为每个样本生成所需的视角并逐步将其整合到一致的分析中。为实现这一目标，我们提出ProCrit，一种Proposal-Critic双智能体框架，包含用于多视角推理的提案智能体和用于外部评估和定向修订指导的批评智能体。首先，为克服现有讽刺数据集在过程级监督方面的不足，ProCrit通过动态角色智能体滚动生成过程级推理注释：一个强大的视觉-语言模型在共享上下文中依次生成分析角色，生成的多角色轨迹被展平为序列，保留跨视角依赖性的同时允许高效的自回归生成。其次，为提高推理可靠性，ProCrit采用草稿-批评-修订范式，其中独立的批评者识别推理缺陷并提供定向的自然语言反馈以指导修订。最后，我们开发了互为改进的训练框架，通过双阶段强化学习共同优化提案起草和反馈引导的修订，同时根据反馈的实际效果优化批评智能体。在三个广泛使用的基准测试上进行的实验验证了ProCrit的有效性。

英文摘要

Multimodal sarcasm detection requires reasoning over cross-modal incongruities between literal expression and intended meaning, yet the specific analytical perspectives needed vary across samples due to the diversity of sarcastic mechanisms. While recent methods make this analytical process explicit, they still rely on fixed, predefined perspectives that operate independently under hand-crafted routing rules. We argue that multimodal sarcasm detection instead calls for self-elicited multi-perspective reasoning, where a model autonomously generates the perspectives needed for each sample and progressively integrates them into a coherent analysis. To realize this goal, we propose ProCrit, a Proposal-Critic two-agent framework with a proposal agent for multi-perspective reasoning and a critic agent for external evaluation and targeted revision guidance. First, to overcome the lack of process-level supervision in existing sarcasm datasets, ProCrit synthesizes process-level reasoning annotations through a dynamic-role agentic rollout: a strong vision-language model sequentially spawns analytical roles within a shared context, and the resulting multi-role trajectories are flattened into sequences that preserve cross-perspective dependencies while enabling efficient autoregressive generation. Second, to improve reasoning reliability, ProCrit adopts a draft-critique-revise paradigm in which an independent critic identifies reasoning deficiencies and provides targeted natural-language feedback for directed revision. Finally, we develop a mutual-refinement training framework that jointly optimizes proposal drafting and feedback-guided revision via dual-stage reinforcement learning, while refining the critic agent according to the actual effectiveness of its feedback. Experiments on three widely used benchmarks demonstrate the effectiveness of ProCrit.

URL PDF HTML ☆

赞 0 踩 0

2605.20839 2026-05-21 cs.CV cs.LG 版本更新

Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models

无需激活的图像识别回骨：在MetaFormer风格视觉模型中的多项式替代方案

Jeffrey Wang, Jonathan Gregory, Grigorios G. Chrysos

发表机构 * University of Wisconsin--Madison（威斯康星大学麦迪逊分校）

AI总结本文提出无需激活函数的多项式替代方法，用于在MetaFormer风格的视觉模型中实现图像识别，展示了多项式模块在多个数据集上的优越性能。

Comments Accepted to ICML 2026

2605.20838 2026-05-21 cs.CV cs.AI 版本更新

Yisen Feng, Leigang Qu, Haoyu Zhang, Qiaohui Chu, Meng Liu, Xuemeng Song, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； National University of Singapore（新加坡国立大学）； Pengcheng Laboratory（鹏城实验室）； Shandong Jianzhu University（山东建筑大学）； Southern University of Science and Technology（南方科技大学）

AI总结本文提出一种基于多模态大语言模型（MLLM）的重排序框架，用于解决Ego4D事件记忆挑战2026中的自然语言查询和目标步 tracks，通过结合现有定位模型OSGNet的候选片段和MLLM的视频-语言推理能力，提升时间片段的定位精度。

Comments Champion solution for the Natural Language Queries and GoalStep tracks of the Ego4D Challenge at the CVPR EgoVis Workshop 2026

2605.20808 2026-05-21 cs.CV 版本更新

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

基于超高清图像合成的空间图对齐

Jinjin Zhang, Xiefan Guo, Di Huang

发表机构 * Beihang University（北航大学）

AI总结本文提出空间图对齐（SGA）方法，通过利用视觉基础模型的表示先验，保留LDMs的生成能力，解决超高清图像合成中生成质量与结构完整性之间的冲突，实现高质量的文本到图像合成。

Comments Technical Report

详情

AI中文摘要

现代超高清图像合成严重依赖大规模预训练潜在扩散模型（LDMs）的强大生成能力。尽管最近的表示对齐方法通过从基础模型（如SAM或DINO）中蒸馏视觉先验到生成潜在特征而有效，但将这些方法扩展到预训练LDMs在极端分辨率下暴露了学习性与保真度之间的关键冲突。具体而言，强制直接的块级特征蒸馏会扰动预训练的潜在流形，最终导致生成退化。为了解决这个瓶颈，我们提出了空间图对齐（SGA），一种新的框架，它明确利用视觉基础模型的表示先验，同时保留LDMs的本原生成能力。超越限制性的直接对齐，SGA通过将生成特征的内部自相似性与基础先验的自相似性对齐，施加一种非侵入性的空间约束。这种空间约束有效地建立了宏观结构的连贯性，而本原的生成目标保留了原始LDMs的微观像素级保真度。值得注意的是，这种通用策略可以无缝整合到预训练LDMs的中间扩散特征和VAE潜在空间中。广泛的实验表明，SGA在超高清文本到图像合成中实现了最先进的性能，有效协调了全局结构完整性和细粒度视觉细节。代码可在https://github.com/zhang0jhon/SGA获取。

英文摘要

Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at https://github.com/zhang0jhon/SGA.

URL PDF HTML ☆

赞 0 踩 0

2605.20807 2026-05-21 cs.CV 版本更新

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

通过中间结构预测分解主体驱动的图像生成

Hanzhong Guo, Yizhou Yu

发表机构 * School of Computing and Data Science, The University of Hong Kong（计算与数据科学学院，香港大学）

AI总结该研究提出了一种两阶段框架，通过先预测Canny图再基于源外观和预测结构生成最终图像，以解决主体驱动文本到图像生成中高频率身份细节如logo、图案和文本的保留问题，并通过自动管道构建了10万对文本感知数据集，实验结果表明中间结构预测能有效提升高保真主体驱动生成的性能。

2605.20804 2026-05-21 cs.CV cs.LG 版本更新

OlmoEarth v1.1: A more efficient family of OlmoEarth models

OlmoEarth v1.1: 一个更高效的OlmoEarth模型家族

Gabriel Tseng, Yawen Zhang, Favyen Bastani, Henry Herzog, Joseph Redmon, Hadrien Sablon, Piper Wolters, Patrick Alan Johnson, Christopher Wilhelm, Patrick Beukema

发表机构 * Allen Institute for AI（人工智能研究所）

AI总结本文提出了一种改进的OlmoEarth模型家族，通过优化训练和推理过程，显著降低了计算成本，同时保持了模型的整体性能。

2605.20780 2026-05-21 cs.LG cs.CV 版本更新

Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment

学习物理中的推理：通过表征对齐打破科学扩散中的捷径学习

Haozhe Jia, Pengyu Yin, Wenshuo Chen, Shaofeng Liang, Lei Wang, Bowen Tian, Xiucheng Wang, Nanqian Jia, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））； Shandong University（山东大学）； LimX Dynamics Technology Co., Ltd.（LimX动态技术有限公司）； Xidian University（西安电子科技大学）； Peking University（北京大学）； Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)（感知技术研究院，江苏工业技术研究院（JITRI））； Griffith University（格里菲斯大学）

AI总结该研究提出了一种无需教师的框架REPA-P，通过使用原理残差对中间特征与物理状态进行对齐，以解决物理信息扩散模型中中间表示在边界条件变化时容易产生捷径学习的问题，从而在四个PDE任务中提高了收敛速度、减少了物理残差并增强了分布外鲁棒性。

详情

AI中文摘要

物理信息扩散模型通常只在最终输出上强制实施PDE约束，导致中间表示不受约束且在边界条件变化时容易产生捷径学习。我们引入了REPA-P，一种无需教师、架构无关的框架，通过原理残差对中间特征与物理状态进行对齐。REPA-P在选定的层上附加轻量级1×1投影头，将隐藏激活解码为物理量，并在训练过程中应用PDE残差损失。这些头在推理时被丢弃，引入了零开销。在四个PDE任务中，包括达西流、拓扑优化、静电势和湍流通道流，REPA-P通过2倍的收敛加速、66.4%的残差减少和49.3%的分布外鲁棒性提升，实现了在U-Net和扩散变换器骨干网络上的持续收益。消融实验显示，监督少量中间层捕获了大部分收益，并补充了输出级物理损失。代码可在[https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P)获得。

英文摘要

Physics-informed diffusion models typically enforce PDE constraints only on final outputs, leaving intermediate representations unconstrained and prone to shortcut learning under shifted boundary conditions. We introduce **REPA-P**, a teacher-free, architecture-agnostic framework that aligns intermediate features with physical states using first-principles residuals. REPA-P attaches lightweight $1{\times}1$ projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training. These heads are discarded at inference, introducing **zero overhead**. Across four PDE tasks, including Darcy flow, topology optimization, electrostatic potential, and turbulent channel flow, REPA-P accelerates convergence by up to $2{\times}$, reduces physics residuals by up to $66.4\%$, and improves out-of-distribution robustness by up to $49.3\%$, with consistent gains on both U-Net and Diffusion Transformer backbones. Ablations show that supervising a small set of intermediate layers captures most benefits and complements output-level physics losses. Code is available at [https://github.com/Hxxxz0/REPA-P](https://github.com/Hxxxz0/REPA-P).

URL PDF HTML ☆

赞 0 踩 0

2605.20777 2026-05-21 cs.CV 版本更新

AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models

AttriStory: 基于扩散模型的视觉叙事中细粒度属性实现

Manogna Sreenivas, Rohit Kumar, Soma Biswas

发表机构 * Indian Institute of Science（印度科学研究院）

AI总结本文提出AttriStory基准，通过细粒度属性实现提升视觉叙事的质量，引入了在早期去噪步骤中操作的潜在优化模块，并通过AttriLoss目标增强属性-对象对的对齐度，从而实现更精确的属性定位。

Comments Accepted at CVPR AIStory Workshop, 2026

详情

AI中文摘要

基于扩散模型的视觉叙事在保持叙事场景中角色一致性方面取得了显著进展。然而，一个关键的差距仍然存在：尽管这些方法确保角色在不同场景中保持一致，但它们没有系统的方法来确保生成图像中诸如服装颜色和纹理等细粒度属性得到忠实呈现。为此，我们引入了AttriStory基准，通过大型语言模型收集了200个跨场景故事，涵盖10种不同的艺术风格。每个场景都包含详细的属性规范，以实现丰富的视觉叙事。进一步，为了解决属性实现问题，我们提出了一种插件式的潜在优化模块，在早期去噪步骤中操作，当模型建立结构和语义内容时。我们通过AttriLoss目标实现这一点，该目标旨在最大化所需属性-对象对的交叉注意力图的对齐度，同时抑制虚假关联，引导模型正确定位属性。这种方法与现有的一致性机制正交，能够无缝集成到当前的故事生成流程中，而无需进行架构修改。我们的实验表明，AttriLoss在所有基线中都实现了持续的改进。这项工作将属性实现定位为视觉叙事的一个独立且互补的维度，与角色一致性并列，推动该领域向细粒度属性控制的故事生成发展。项目页面：https://manogna-s.github.io/attristory/

英文摘要

Visual storytelling with diffusion models has made impressive strides in maintaining character consistency across narrative scenes. However, a critical gap remains: while these methods ensure a character remains consistent across scenes, they provide no systematic method to ensure if fine-grained attributes such as color and textures of clothing, accessories are faithfully rendered in the generated images. Towards this goal, we introduce AttriStory, a benchmark enabling attribute realization in visual storytelling. We curate 200 multi-scene stories across 10 distinct artistic styles using Large Language Model. Each scene is constructed with detailed attribute specifications to enable rich visual narratives. Further, to address attribute realization, we propose a plug-and-play latent optimization module that operates during early denoising steps, when the model establishes structural and semantic content. We achieve this through AttriLoss objective designed to maximize alignment between the cross-attention maps for desired attribute-object pairs while suppressing spurious associations, guiding models to localize attributes correctly. This approach operates orthogonally to existing consistency mechanisms, integrating seamlessly with current story generation pipelines without requiring architectural modifications. Our experiments demonstrate consistent improvements on incorporating AttriLoss across all baselines. This work positions attribute realization as a distinct, complementary dimension of visual storytelling, alongside character consistency, advancing the field toward fine-grained attribute-controlled story generation. Project-page:https://manogna-s.github.io/attristory/

URL PDF HTML ☆

赞 0 踩 0

2605.20766 2026-05-21 cs.CV 版本更新

早期高频注入用于几何敏感的域外检测

Chuanjie Cheng, Ningkang Peng, Chenxi Liu, Yifan He, Peirong Ma, Yanhui Gu

发表机构 * Nanjing Normal University（南京师范大学）

AI总结本文通过带宽分析揭示了高频输入对几何敏感域外检测的重要性，提出EIHF方法在CIFAR-100和ImageNet-100上提升了检测性能，同时揭示了其在场景中心Places迁移上的局限性。

详情

AI中文摘要

事后域外检测器在训练后对logits或特征进行评分，其成功依赖于表示中已编码的几何结构。我们通过跨CE、SimCLR、SupCon和域外导向表示方法PALM的带宽MMD^2分析重新审视这一假设。在我们的诊断中，低频输入带诱导更弱的ID/OOD特征差异，而高频带倾向于提供更强的分离性。这一观察促使提出EIHF，一种输入侧干预方法，在第一次卷积之前暴露高频证据而不改变训练目标。EIHF在几何敏感的域外检测中表现最强：在匹配的训练和评分设置下，它重塑类条件特征几何并减少ID/OOD马哈拉诺斯距离重叠。在CIFAR-100和ImageNet-100上的实验显示，在CIFAR-100上获得提升，在ImageNet-100上获得最佳的平均FPR95和次佳的平均AUROC，同时揭示了在场景中心Places迁移上的局限性。代码可在https://anonymous.4open.science/r/EIHF获得。

英文摘要

Post-hoc OOD detectors score logits or features after training, so their success depends on the geometry already encoded in the representation. We revisit this assumption through a band-wise MMD^2 analysis across CE, SimCLR, SupCon, and the OOD-oriented representation method PALM. In our diagnostic, low-frequency input bands induce weaker ID/OOD feature discrepancy, whereas higher-frequency bands tend to provide stronger separability. This observation motivates EIHF, an input-side intervention that exposes high-frequency evidence before the first convolution without changing the training objective. EIHF is strongest for geometry-sensitive OOD detection: under matched training and scoring settings, it reshapes class-conditional feature geometry and reduces ID/OOD Mahalanobis score overlap. Experiments on CIFAR-100 and ImageNet-100 show gains on CIFAR-100 and the best average FPR95 with second-best average AUROC on ImageNet-100, while also revealing a limitation on the scene-centric Places shift. Code is available at https://anonymous.4open.science/r/EIHF.

URL PDF HTML ☆

赞 0 踩 0

2605.20727 2026-05-21 cs.CV 版本更新

GAMR: Geometric-Aware Manifold Regularization with Virtual Outlier Synthesis for Learning with Noisy Labels

GAMR: 带虚拟异常合成的几何感知流形正则化用于噪声标签学习

Ningkang Peng, Jingyang Mao, Xiaoqian Peng, Peirong Ma, Xichen Yang, Weiguang Qu, Yanhui Gu

发表机构 * Nanjing Normal University（南京师范大学）； Nanjing University of Chinese Medicine（南京中医药大学）

AI总结本文提出了一种几何感知流形正则化方法，通过主动合成虚拟异常样本来重构特征空间几何，从而提升在噪声标签下的学习性能，其核心贡献是增强模型对难样本和噪声样本的区分能力，实现更鲁棒的表示学习。

详情

AI中文摘要

DarkShake-DVS: 低光和摇晃条件下基于事件的行人动作识别

Jiaqi Chen, Qinfu Xu, Liyuan Pan

发表机构 * Beijing Institute of Technology（北京理工大学）

AI总结本文提出了一种结合事件相机和惯性测量单元的EIS-HAR方法，通过非线性变形模块减少运动模糊并提取时空特征，同时引入DarkShake-DVS基准数据集，用于低光和6自由度运动条件下的行人动作识别研究。

Comments 8pages,7 figures

详情

AI中文摘要

行人动作识别（HAR）是计算机视觉中的基本任务，具有广泛的应用。实际部署通常涉及低光环境和无约束的6-DoF相机运动，这些条件会降低视觉质量，破坏时间一致性，并影响现有方法的可靠性。事件相机具有高低光灵敏度和微秒级时间分辨率，结合惯性测量单元（IMU）提供了一种有前途的解决方案。然而，当前研究面临两个关键挑战：缺乏整合低光条件、6-DoF运动和同步IMU数据的基准；以及缺乏有效的运动补偿技术。为此，我们提出事件-IMU稳定HAR（EIS-HAR），包含两个模块。第一个是EIS模块，通过非线性变形函数减少运动模糊以重建运动补偿的输入。第二个是HAR模块，具有四阶段混合架构，以高效提取时空特征进行准确的动作识别。为缓解数据稀缺，我们引入DarkShake-DVS，第一个大规模基于事件的HAR基准，包含18,041个真实世界片段，在低光和强烈6-DoF运动条件下拍摄，并补充同步IMU数据。在三个数据集上的广泛实验表明，EIS-HAR在状态-of-the-art方法上表现出一致的优越性。

英文摘要

Human Action Recognition (HAR) is a fundamental computer vision task with diverse real-world applications. Practical deployments often involve low-light environments and unconstrained 6-DoF camera motion, conditions that degrade visual quality, disrupt temporal coherence, and compromise reliability of existing methods. Event cameras, with high low-light sensitivity and microsecond-level temporal resolution, paired with an inertial measurement unit (IMU), present a promising solution. However, current research faces two key challenges: absence of a benchmark integrating low-light conditions, 6-DoF motion, and synchronized IMU data; and lack of effective motion compensation techniques. To address these, we propose Event-IMU Stabilized HAR (EIS-HAR), with two modules. The first is an EIS module that reduces motion blur via a non-linear warping function to reconstruct a motion-compensated input. The second is a HAR module with a four-stage hybrid architecture to efficiently extract spatiotemporal features for accurate action recognition. To alleviate data scarcity, we introduce DarkShake-DVS, the first large-scale event-based HAR benchmark that includes 18,041 realworld clips captured in low light and intense 6-DoF motion, supplemented by synchronized IMU data. Extensive experiments on three datasets demonstrate consistent superiority of EIS-HAR over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.20676 2026-05-21 cs.CV 版本更新

RoPeSLR: 3D RoPE驱动的稀疏低秩注意力用于高效的扩散变换器

Yuxi Liu, Zekun Zhang, Yixiang Cai, Renjia Deng, Yutong He, Kun Yuan

发表机构 * Peking University（北京大学）； University of Electronic Science and Technology of China（电子科技大学）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结本研究提出RoPeSLR，一种基于3D RoPE的稀疏低秩注意力框架，旨在解决扩散变换器中长序列生成的高复杂度问题，通过结合高频率语义尖峰集和极低秩背景连续体，实现子二次稀疏性和子线性秩增长，从而在超长视频推理中表现出色。

详情

AI中文摘要

扩散变换器（DiTs）已革新了高保真视频生成，但其$\mathcal{O}(L^2)$的注意力复杂度对长序列合成构成了重大瓶颈。尽管近期的稀疏线性注意力混合体旨在缓解这一问题，但其在极端稀疏性下性能严重下降，这是因为“RoPE困境”：标准线性注意力无法保持3D旋转位置嵌入（RoPE）的正交相对位置结构，从而消除了关键的距离意识。为了解决这个问题，我们提出了RoPeSLR，一种3D RoPE引导的稀疏低秩注意力框架。我们建立，根据经验证实的假设，DiT注意力流形可以解耦为一个高频率语义尖峰集（受限于$\mathcal{O}(L^{3/2})$稀疏性）和一个极低秩（$\mathcal{O}(d_h \log L)$）背景连续体。受这一结构先验的指导，RoPeSLR摒弃标准线性注意力，采用具有可学习3D绝对位置嵌入（PE）注入的头级低秩参数化，无缝合成长距离相对距离衰减。通过保证子二次稀疏性和子线性秩增长，RoPeSLR特别适合扩展到超长视频推理。广泛的评估验证了这种可扩展优势：在90%稀疏性下，RoPeSLR在Wan2.1-1.3B上实现高达10倍的FLOPs减少，并在HunyuanVideo-13B的超长100K+ token序列上提供2.26倍的端到端推理加速，同时保持接近无损的生成保真度（平均VBench退化低于1.3%）

英文摘要

Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).

URL PDF HTML ☆

赞 0 踩 0

2605.20651 2026-05-21 cs.CV 版本更新

Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation

凝视细节：用于OCTA视网膜血管分割的局部敏感增强

Tuopusen Huang, Ding Ma, Xiangqian Wu

发表机构 * Faculty of Computing（计算学院）

AI总结本文提出LSENet，通过引入三个创新模块解决OCTA血管分割中局部对比度低导致的断续和细节丢失问题，实验表明其在多个公开数据集上达到最佳性能且参数更少。

详情

AI中文摘要

现有的OCTA血管分割深度学习框架大多基于U-Net架构，但大多数方法仅关注整体表示，难以处理OCTA特有的低局部对比度问题，导致血管断续和细节丢失。为此，我们提出LSENet，基于U-Net架构引入三个核心创新模块：为解决血管断续问题，引入补丁信息增强模块（PIE），用补丁级注意力替代标准跳接连接；为缓解细节丢失问题，提出多尺度特征融合模块（MFF），通过从原始输入和前一层提取可解释特征，为PIE模块提供丰富多尺度信息；最后设计连接性细化解码器（CRD），通过最终卷积层的大核减少碎片化。在三个公开数据集（OCTA-500、ROSE-1和ROSSA）上的实验表明，所提LSENet在性能上达到最佳，且参数更少。

英文摘要

Existing deep learning frameworks for Optical Coherence Tomography Angiography (OCTA) vessel segmentation are largely derived from the U-Net architecture, which serves as the foundation for most current designs. However, most of these methods focus only on holistic representation, struggling to address the problem of low local contrast unique to OCTA, which leads to vessel discontinuities and loss of detail. To address these problems, we propose LSENet, which builds upon the U-Net architecture by introducing three core innovative modules: To address vessel discontinuities, we introduce the Patch Information Enhance module (PIE), which replaces standard skip connections to execute patch-wise attention. To mitigate detail loss, the Multiscale Feature Fusion module (MFF) is proposed to feed the PIE module rich, multi-scale information by extracting visually interpretable features from both the original input and preceding layers. Finally, the Connectivity Refinement Decoder (CRD) is designed to refine features from all levels and utilize a large kernel in the final convolutional layer to reduce fragmentation. Experiments on three public datasets (OCTA-500, ROSE-1, and ROSSA) demonstrate that our proposed LSENet achieves state-of-the-art performance while requiring fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.20645 2026-05-21 cs.CV 版本更新

Seeing Through Fog: Towards Fog-Invariant Action Recognition

穿透雾气：迈向雾不变的动作识别

Enqi Liu, Liyuan Pan, Zhi Gao, Lingzhi Li, Qing Li

发表机构 * Beijing Institute of Technology, Beijing, China（北京理工大学，北京，中国）； Beijing Institute for General Artificial Intelligence, Beijing, China（北京通用人工智能研究院，北京，中国）； Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing, China（北京理工大学扬子江地区研究院，嘉兴，中国）

AI总结本文提出FogAct基准数据集和FogNet模型，旨在解决雾天环境下动作识别中的挑战，通过改进的两流CLIP模型提取雾不变的语义信息，提升在雾天条件下的动作识别性能。

详情

AI中文摘要

雾天条件在现实应用中很常见；然而，现有动作识别方法通常假设有利的天气和高质量的视频输入。在雾天，不可预测的可见性降级和对比度降低会阻碍语义线索的提取，给当前的动作识别方法带来重大挑战。在本文中，我们通过采用两种策略来缓解雾天条件下动作识别的问题。首先，我们提出了FogAct，这是第一个雾状动作识别基准数据集，由使用立体相机系统拍摄的配对干净和雾天视频组成。该数据集涵盖10个场景和55个动作类别，包含近10000个视频片段。其次，我们提出了FogNet，一种两流CLIP模型，该模型发现隐藏在降质视频背后的雾不变的语义信息。FogNet通过清洁视频的指导学习雾视频的稳健表示，有效捕捉清洁和雾天视频之间的共享结构和运动线索。在FogAct和三个其他流行数据集上的广泛实验表明，我们的方法在与最先进（SOTA）方法相比时具有竞争性性能。我们的FogAct和FogNet可在我们的项目页面上找到。

英文摘要

Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.

URL PDF HTML ☆

赞 0 踩 0

2605.20640 2026-05-21 cs.CV cs.AI 版本更新

Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics

帕累托优化的肖像生成：用于对齐、真实性和美学的视觉对齐文本监督

Yunlong Wang, Jinjin Shi, Wenbin Gao, Xuran Xu, Runyu Shi, Ying Huang

AI总结本文提出了一种多模态扩散变换器（MM-DiT）的特征监督方法，通过引入轻量级的跨模态对齐机制，隐式提取多粒度的视觉对齐文本表示，以提升文本-图像对齐、真实性和美学质量，从而在Pareto前沿上实现协同改进。

详情

AI中文摘要

文本到图像扩散模型在生成人类肖像时往往面临严重的三重困境：文本-图像对齐、逼真度和人类感知的美学之间相互抑制。监督微调（SFT）是一种有效提升图像生成逼真度的方法，但通常会导致过度拟合训练数据集、破坏预训练图像先验并降低对齐或美学质量。为突破这一瓶颈，我们提出了一种多模态扩散变换器（MM-DiT）的特征监督范式。具体而言，我们引入了一种轻量级的跨模态对齐机制，隐式地从SigLIP 2中提取多粒度的视觉对齐文本表示，并在训练阶段将监督应用于MM-DiT的图像分支，且无额外的推理开销。我们的方法在保持基模型原有泛化能力的同时，注入了视觉对齐的文本指导，避免了SFT导致的退化。此外，我们的方法直接从预训练的视觉基础模型中挖掘隐含的多粒度美学信号，以优化人类感知的美学。在MM-DiT上的广泛实验表明，我们的方法推动了Pareto前沿，并在文本-图像对齐、逼真度和人类感知的美学方面实现了协同改进。

英文摘要

Text-to-image diffusion models often face a severe trilemma in human portrait generation: text-image alignment, photorealism, and human-perceived aesthetics inherently inhibit one another. Supervised Fine-Tuning (SFT) is an effective method for enhancing the photorealism of image generation. However, it often leads to overfitting to the training dataset, corrupts pre-trained image priors, and degrades alignment or aesthetics. To break this bottleneck, we propose a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT). Specifically, we introduce a lightweight cross-modal alignment mechanism that implicitly extracts multi-granularity vision-aligned text representations from SigLIP 2 and applies supervision to the image branch of MM-DiT during the training stage, with zero extra inference overhead. Our method injects vision-aligned text guidance while preserving the base model's original generalization, avoiding degradation caused by SFT. Furthermore, our method directly mines implicit multi-granularity aesthetic signals from pre-trained vision foundation models to optimize human-perceived aesthetics. Extensive experiments on MM-DiTs show that our method pushes the Pareto frontier and achieves synergistic improvements across text-image alignment, photorealism, and human-perceived aesthetics.

URL PDF HTML ☆

赞 0 踩 0

2605.20626 2026-05-21 cs.CL cs.AI cs.CV 版本更新

手语之间的直接翻译

Zetian Wu, Bowen Xie, Wuyang Meng, Milan Gautam, Stefan Lee, Liang Huang

发表机构 * Oregon State University（俄勒冈州立大学）

AI总结本文提出了一种直接的手语到手语翻译方法，通过使用回译技术生成合成的手语对，从而克服了传统级联方法中的误差传播和信息丢失问题，并在多个手语数据集上实现了更高的翻译质量和速度提升。

详情

AI中文摘要

手语翻译领域在手语与口语之间的翻译上取得了显著进展，但手语之间的翻译仍鲜为人知且难以实现。后者可以帮助15亿全球聋人和听力障碍者在语言障碍中交流，而无需依赖听力翻译者或书面语言能力。级联方法由单独的手语到文本、文本到文本和文本到手语系统组成，但存在误差传播、额外延迟以及视觉模态中独特信息的丢失。我们旨在开发直接的手语到手语翻译。然而，尚未有大规模的开放领域平行语料库在手语之间。为了实现直接的手语翻译，我们使用回译技术从不对齐的个体语言语音-手语语料库中生成合成的手语对。使用这些数据，我们联合训练了一个基于MBART的单一模型，用于文本到手语（T2S）和手语到手语（S2S）。在合成生成的美国手语（ASL）、中国手语（CSL）和德国手语（DGS）之间配对集上，我们的直接S2S方法在几何手语误差指标（20%更低的DTW对齐MPJPE）和翻译回句子后的语言匹配指标（50%高BLEU-4）上优于级联基线，同时实现了大约2.3倍的速度提升。在一小部分现有的跨语言手语数据上，我们发现我们的方法也实现了类似的改进。

英文摘要

The field of sign language translation has witnessed significant progress in the translation between sign and spoken languages, but the translation between sign languages remains largely unexplored and out of reach. The latter can help 1.5 billion deaf and hard-of-hearing (DHH) people worldwide communicate across language barriers without relying on hearing interpreters or written-language fluency. The cascade approach composing separate sign-to-text, text-to-text, and text-to-sign systems suffers from error propagation and extra latency as well as the loss of information unique in the visual modality. We aim to develop direct sign-to-sign translation. However, a large-scale open-domain parallel corpus has not been curated between sign languages. To enable direct translation between sign language utterances, we use back-translation to produce synthetic sign-sign pairs from unaligned individual language utterance-sign corpora. Using this data, we jointly train a single MBART-based model for both text->sign (T2S) and sign->sign (S2S). On synthetically generated paired sets between American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS), our direct S2S method outperforms the cascaded baseline on geometric sign error metrics (20% lower DTW-aligned MPJPE) and language matching metrics after predicted sign utterances are translated back to sentences (50% high BLEU-4) while achieving a roughly 2.3* speedup. On a small set of pre-existing cross-lingual sign data, we find similar improvements for our proposed method.

URL PDF HTML ☆

赞 0 踩 0

2605.20584 2026-05-21 cs.CV 版本更新

QwenSafe: Multimodal Content Rating Description Identification via Preference-Aligned VLMs

QwenSafe: 通过偏好对齐的视觉语言模型实现多模态内容评级描述识别

Dishanika Denipitiyage, Aruna Seneviratne, Suranga Seneviratne

发表机构 * University of Sydney（悉尼大学）； University of New South Wales（新南威尔士大学）

AI总结本文提出QwenSafe，一种通过联合推理应用元数据和截图来自动识别苹果定义的内容评级描述（CRDs）的视觉语言模型，通过引入metadata2CRD数据构建管道和直接偏好优化（DPO）提升模型预测准确性，实验结果显示QwenSafe在二元CRD分类中显著优于现有模型。

详情

AI中文摘要

移动应用市场要求开发者披露标准化的内容评级描述（CRDs）以告知用户潜在敏感或受限制的内容。确保这些披露的准确性和一致性仍然具有挑战性，因为应用内容的多模态性质跨越了文本描述和视觉界面。在本文中，我们提出了QwenSafe，一种视觉语言模型（VLM），旨在通过联合推理应用元数据和截图自动识别苹果定义的CRDs。为了使该任务能够扩展训练，我们引入了metadata2CRD数据构建管道，通过结合应用描述、截图和正式描述定义来合成描述对齐的问题-答案对。我们通过监督微调后直接偏好优化（DPO）调整Qwen3-VL-8B，以使模型预测与视觉和文本模态的描述特定证据和解释对齐。我们在12个苹果定义的内容评级描述上评估QwenSafe，并将其与最先进的视觉语言模型进行比较，包括Qwen3-VL、LLaVA-1.6和Gemini-2.5-Flash。QwenSafe在二元CRD分类中始终优于所有基线模型，分别在正类召回率上实现了111.8%、36.1%和2.1%的提升。我们的结果表明，描述意识的多模态对齐显著提高了自动化内容分类，并突显了视觉语言模型在支持移动应用市场中可扩展和一致的内容评级方面的潜力。

英文摘要

Mobile app marketplaces require developers to disclose standardized content rating descriptors (CRDs) to inform users about potentially sensitive or restricted content. Ensuring the accuracy and consistency of these disclosures remains challenging due to the multimodal nature of app content, which spans textual descriptions and visual interfaces. In this paper, we present QwenSafe, a Vision-Language Model (VLM) designed to automatically identify the presence of Apple-defined CRDs by jointly reasoning over app metadata and screenshots. To enable scalable training for this task, we introduce metadata2CRD, a data-construction pipeline that synthesizes descriptor-aligned question-answer pairs by combining app descriptions, screenshots, and formal descriptor definitions. We adapt Qwen3-VL-8B using supervised fine-tuning followed by Direct Preference Optimization (DPO) to align model predictions with descriptor-specific evidence and explanations across visual and textual modalities. We evaluate QwenSafe on 12 Apple-defined content rating descriptors and compare it against state-of-the-art vision-language models, including Qwen3-VL, LLaVA-1.6, and Gemini-2.5-Flash. QwenSafe consistently outperforms all baselines in binary CRD classification, achieving improvements in positive-class recall of 111.8%, 36.1%, and 2.1%, respectively. Our results demonstrate that descriptor-aware multimodal alignment substantially improves automated content classification and highlights the potential of vision-language models to support scalable and consistent content rating in mobile app marketplaces.

URL PDF HTML ☆

赞 0 踩 0

2605.20576 2026-05-21 cs.CV 版本更新

$Δ$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

$Δ$ynamics: 一种基于语言的表示方法，用于从视频中推断刚体动力学

Chia-Hsiang Kao, Cong Phuoc Huynh, Chien-Yi Wang, Noranart Vesdapunt, Stefan Stojanov, Bharath Hariharan, Oleksandr Obiednikov, Ning Zhou

发表机构 * Cornell University（康奈尔大学）； Amazon（亚马逊）

AI总结本文提出$Δ$YNAMICS框架，通过语言统一表示刚体动力学，利用结构化文本生成物理模拟场景配置，结合自然语言运动推理和光流输入提升泛化能力，在CLEVRER数据集上实现了7倍于现有VLMs的分割IoU，并在新数据集上展示了良好的迁移能力。

Comments Accepted to CVPR 2026. Project page: https://iandrover.github.io/2026_dynamics

详情

AI中文摘要

从单目视频中推断刚体物理状态和属性是实现基于物理的感知和模拟的关键步骤。现有方法假设特定的物理系统、物体类型和相机姿态，无法泛化到复杂的现实环境。我们引入$Δ$YNAMICS，一种视觉-语言框架，利用语言作为刚体动力学的统一表示。不同于直接预测参数，$Δ$YNAMICS生成结构化的文本格式场景配置用于物理模拟。我们通过整合自然语言运动推理和利用光流作为语义无关的输入来增强模型的泛化能力。在CLEVRER数据集上，$Δ$YNAMICS实现了0.30的分割IoU，比领先的VLMs（InternVL3-8B，Qwen2.5-VL-7B和Claude-4-Sonnet）提高了7倍。此外，测试时采样和进化搜索分别将分割IoU提高27%和120%。最后，我们展示了在包含235个现实世界刚体视频的新数据集上的良好迁移能力，突显了语言驱动的物理推断在连接感知和模拟方面的潜力。

英文摘要

Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce $Δ$YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, $Δ$YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, $Δ$YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.

URL PDF HTML ☆

赞 0 踩 0

2605.20569 2026-05-21 cs.CV 版本更新

End-to-End Unmixing with Material Prompts for Hyperspectral Object Tracking

端到端材料提示的超光谱目标跟踪

Xu Han, Mohammad Aminul Islam, Lei Wang, Zekun Long, Guanmanyi Fu, Wangshu Cai, Kuldip K. Paliwal, Jun Zhou

发表机构 * School of Information and Communication Technology, Griffith University, Australia（信息与通信技术学院，格里菲斯大学，澳大利亚）； School of Engineering and Built Environment, Griffith University, Australia（工程与建筑环境学院，格里菲斯大学，澳大利亚）； School of Environment and Science, Griffith University, Australia（环境与科学学院，格里菲斯大学，澳大利亚）

AI总结本文提出了一种端到端的材料感知跟踪框架，通过联合优化材料分解和目标定位，利用加权目标导向的解混损失对齐材料表示与定位精度，以提升超光谱图像在外观模糊、光照变化和背景杂波下的跟踪鲁棒性。

详情

AI中文摘要

超光谱成像编码了丰富的材料属性，可以在外观模糊、光照变化和背景杂波下提高跟踪鲁棒性。然而，由于超光谱视频数据有限，许多现有方法通过空间或通道融合策略适应预训练的RGB跟踪器，很大程度上忽略了超光谱成像中的内在材料信息。此外，很少的材料感知方法通常依赖于外部光谱解混管道，这些管道与跟踪目标解耦，限制了对材料表示的有效优化。为了解决这些限制，我们将超光谱目标跟踪公式化为材料分解和目标定位的联合优化问题，通过加权目标导向的解混损失将两个任务耦合起来，显式地对齐材料表示与定位精度。具体来说，我们提出了一种用于深度学习光谱解混的材料表示分解模块，具有自适应频率分解。基于分解的材料表示，我们进一步引入了双分支小波增强的材料提示模块，通过频域中的高效空间-材料交互学习低频和高频的材料提示。该框架是模型无关的，可以无缝扩展到不同的解混后端。在标准的超光谱跟踪基准上的大量实验验证了所提出端到端材料感知跟踪框架的最先进性能，并验证了其有效性。代码可在https://github.com/han030927/E2EMPT上获得。

英文摘要

Hyperspectral imagery encodes rich material properties that can improve tracking robustness under appearance ambiguity, illumination change, and background clutter. However, due to the limited availability of hyperspectral video data, many existing methods adapt pretrained RGB trackers via spatial or channel fusion strategies, largely neglecting the intrinsic material information in hyperspectral imagery. Moreover, the few material-aware approaches typically rely on external spectral unmixing pipelines that are decoupled from the tracking objective, limiting effective optimization of material representations for target localization. To address these limitations, we formulate hyperspectral object tracking as a joint optimization problem of material decomposition and target localization, coupling the two tasks via a weighted target-oriented unmixing loss that explicitly aligns material representations with localization accuracy. Specifically, we propose a material representation decomposition module for deep learning-based spectral unmixing with adaptive frequency decomposition. Building on the decomposed material representations, we further introduce a dual-branch wavelet-enhanced material prompt module that learns low- and high-frequency material prompts through efficient spatial-material interactions in the frequency domain. The framework is model-agnostic and can be seamlessly generalized to different unmixing backbones. Extensive experiments on standard hyperspectral tracking benchmarks demonstrate state-of-the-art performance and validate the effectiveness of the proposed end-to-end material-aware tracking framework. Code is available at https://github.com/han030927/E2EMPT.

URL PDF HTML ☆

赞 0 踩 0

2605.20551 2026-05-21 cs.CV cs.AI cs.RO 版本更新

Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning

更快或更强：通过加权聚合和标记剪枝实现灵活的视觉位置识别

Zichao Zeng, June Moh Goo, Junwei Zheng, Weijia Fan, Jiaming Zhang, Rainer Stiefelhagen, Jan Boehm

发表机构 * University College London（伦敦大学学院）； Karlsruhe Institute of Technology（卡尔斯鲁厄大学）； Hunan University（湖南大学）； Shenzhen University（深圳大学）

AI总结本文提出了一种加权聚合描述符（WeiAD）和标记剪枝框架（WeiToP），用于提升视觉位置识别的性能和效率，通过动态调整特征提取的精度与效率平衡。

详情

AI中文摘要

视觉位置识别（VPR）旨在将查询图像匹配到大规模数据库中相同地点的参考图像。最近最先进的方法采用视觉Transformer（ViTs）作为基础模型，提取对视角、光照和季节变化具有鲁棒性的补丁级特征，然后聚合为紧凑的全局描述符进行检索。大多数现有聚合方法将补丁标记均匀地池化到学习的簇中，尽管不同簇往往编码不同的空间或语义模式，并对VPR性能贡献不均。为了解决这一限制，我们提出了加权聚合描述符（WeiAD），在聚合过程中分配簇的权重，产生更具判别性的全局表示。除了准确性之外，检索延迟是大规模部署和资源受限边缘设备的关键关注点。先前的工作主要通过压缩全局描述符来减少延迟，而忽略了特征提取的成本，这在基于ViT的基础模型中变得更加严重。因此，我们引入了面向VPR的标记剪枝框架WeiToP，通过自蒸馏减少特征提取成本，其中聚合诱导的标记重要性监督一个轻量级剪枝模块，附加到早期Transformer层上，使推理时能够进行标记剪枝。在单次联合训练阶段后，WeiToP能够在推理时实现插拔式的标记剪枝，允许在不额外训练的情况下灵活地控制精度-效率权衡。此外，WeiToP在现有针对通用视觉任务的标记剪枝方法上表现更优。

英文摘要

Visual Place Recognition (VPR) aims to match a query image to reference images of the same place in a large-scale database. Recent state-of-the-art methods employ Vision Transformers (ViTs) as backbone foundation models to extract patch-level features that are robust to viewpoint, illumination, and seasonal variations, which are then aggregated into a compact global descriptor for retrieval. Most existing aggregation methods uniformly pool patch tokens into learned clusters, despite the fact that different clusters often encode distinct spatial or semantic patterns and contribute unequally to VPR performance. To address this limitation, we propose Weighted Aggregated Descriptor (WeiAD), which assigns weights to clusters during aggregation, producing more discriminative global representations. Beyond accuracy, retrieval latency is a critical concern for large-scale deployments and resource-constrained edge devices. Prior work mainly reduces latency by compressing global descriptors, while overlooking the cost of feature extraction, an issue exacerbated by ViT-based backbones. We therefore introduce WeiToP, a VPR-oriented token pruning framework that reduces feature extraction cost via self-distillation, where aggregation-induced token importance supervises a lightweight pruning module attached to an early transformer layer, enabling inference-time token pruning. After a single joint training phase, WeiToP enables plug-and-play token pruning at inference time, allowing flexible and on-demand control over the accuracy-efficiency trade-off without additional training. Moreover, WeiToP outperforms existing token pruning methods adapted from general vision tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.20549 2026-05-21 cs.CV 版本更新

MAPS: A Synthetic Dataset for Probing Vision Models in a Controlled 3D Scene Space

MAPS：用于在受控3D场景空间中探测视觉模型的合成数据集

Santiago Galella, Pamela Osuna-Vargas, Maren Wehrheim, Martina G. Vilas, Gemma Roig, Matthias Kaschube

发表机构 * FIAS & Institute of Computer Science Goethe University Frankfurt（FIAS与计算机科学研究所弗赖堡大学）； Mila & Department of Biology York University（Mila与生物学系约克大学）； Institute of Computer Science Goethe University Frankfurt（计算机科学研究所弗赖堡大学）

AI总结本文提出MAPS数据集，用于在受控3D场景空间中研究视觉模型的行为，通过回归敏感性分析评估20种模型对场景因素的依赖性，发现相机距离和高度是导致识别失败的主要因素，且现代CNN和Transformer模型在敏感性上表现出相似性。

Comments 33 pages, 20 figures

详情

AI中文摘要

现代视觉模型在标准基准上表现强劲，但其整体准确率难以揭示驱动预测的场景属性。现有鲁棒性基准提供重要压力测试，但通常操纵全局2D图像属性，依赖现实世界变化或仅覆盖有限的3D对象和场景参数。我们引入MAPS（Manifolds of Artificial Parametric Scenes），一种可扩展的工具，用于受控地将视觉模型行为归因于场景参数。MAPS包含2,618个经过筛选的逼真3D网格，已验证在560个ImageNet类别上具有可识别性，并提供基于Blender的渲染管道，可按需生成图像，连续变化九个独立场景因素，涵盖背景、相机和照明，可扩展至其他因素。为了展示其适用性，我们使用MAPS评估20种卷积和Transformer模型，通过基于回归的敏感性分析量化其对这些场景因素的依赖性。我们发现所有测试架构中普遍存在一个几乎普遍的失败轴：相机距离和高度在识别失败中始终占主导地位，无论ImageNet准确性如何。然而，完整的敏感性结构揭示出现代CNN和Transformer模型聚集在一起，与旧架构不同，表明细粒度的架构设计选择，而非粗粒度的CNN与Transformer区别，是敏感性特征的更强决定因素。

英文摘要

Modern vision models achieve strong performance on standard benchmarks, yet their aggregate accuracy reveals little about which scene properties drive their predictions. Existing robustness benchmarks provide important stress tests, but typically manipulate global 2D image properties, rely on entangled real-world variation, or cover only a limited set of 3D objects and scene parameters. We introduce MAPS (Manifolds of Artificial Parametric Scenes), a scalable instrument for controlled attribution of vision model behavior to scene parameters. MAPS comprises 2,618 curated photorealistic 3D meshes validated for recognizability across 560 ImageNet classes and provides a Blender-based rendering pipeline for on-demand image generation under continuous variation of nine independent scene-factors spanning background, camera, and lighting, extensible to other factors. To showcase its applicability, we use MAPS to evaluate 20 convolutional and transformer-based models by quantifying their reliance on these scene factors through regression-based sensitivity analysis. We find a near-universal failure axis across all tested architectures: camera distance and elevation consistently dominate recognition failure regardless of ImageNet accuracy. However, the full sensitivity structure reveals that modern CNNs and transformers cluster together, distinct from older architectures, suggesting that fine-grained architectural design choices, rather than the coarse CNN-versus-transformer distinction, are the stronger determinant of sensitivity profiles.

URL PDF HTML ☆

赞 0 踩 0

2605.20544 2026-05-21 cs.RO cs.CV 版本更新

The Yes-Man Syndrome: Benchmarking Abstention in Embodied Robotic Agents

顺从综合征：具身机器人代理中的退避基准测试

Doguhan Yeke, Elif Su Temirel, Ananth Shreekumar, Brandon Lee, Dongyan Xu, Z Berkay Celik

发表机构 * Purdue University（普渡大学）； Bilkent University（比尔肯特大学）

AI总结本文提出了一种用于具身机器人代理的退避基准测试框架RoboAbstention，通过五种机器人数据集中的图像生成退避指令，评估了多个前沿VLMs在退避任务中的表现，并探讨了改进退避性能的方法。

详情

AI中文摘要

视觉语言模型（VLMs）被用作具身代理的高层规划器，将自然语言指令和视觉观察转化为行动计划。尽管先前的工作研究了LLMs中的退避行为，但现有的基准测试大多仅限于文本，无法捕捉到具身机器人环境中的感知基础和物理约束。在这样的环境中，退避需要识别指令模糊、物理不可行、基于错误前提或在给定可用感觉模态和上下文下无法解决的情况。为了解决这一差距，我们引入了一个分类法来分类具身机器人中的退避行为，并提出了RoboAbstention，一个可扩展且可审计的框架，用于生成基于五个机器人数据集收集的图像的退避指令。RoboAbstention通过三个阶段的流程实现该分类法：（1）结构化的视觉基础，（2）确定性的约束推导，（3）通过类别特定模板进行受控的指令生成。这使能够构建一个具有可验证退避条件的多样化数据集。我们评估了几种前沿VLMs，并发现所有模型在退避任务中都表现出显著的弱点，包括那些具有高级推理能力的模型。表现最好的模型Gemini 2.5 Flash仅在6,069个基准指令中退避39.0%，而具身规划器Gemini Robotics ER 1.6 Preview仅在16.5%的指令中退避。我们进一步探讨了改进VLM规划器退避性能的方法，如防御性提示和上下文学习，并发现这些干预措施显著提高了性能，达到Gemini Robotics ER 1.6 Preview的93.6%退避率和GPT 5.4 Mini的88.6%退避率，但没有任何方法完全解决了该问题。我们开源了RoboAbstention在https://purseclab.github.io/RoboAbstention/。

英文摘要

Vision-language models (VLMs) are used as high-level planners for embodied agents, translating natural language instructions and visual observations into action plans. While prior work has studied abstention in LLMs, existing benchmarks are largely text-only and do not capture the perceptual grounding and physical constraints inherent to embodied robotics environments. In such settings, abstention requires recognizing when instructions are ambiguous, physically infeasible, based on false premises, or otherwise unresolvable given the available sensory modalities and context. To address this gap, we introduce a taxonomy to categorize abstention in the context of embodied robotics and present RoboAbstention, a scalable and auditable framework for generating abstention instructions grounded in images gathered from five robotics datasets. RoboAbstention instantiates the taxonomy through a three-phase pipeline: (1) structured visual grounding, (2) deterministic constraint derivation, and (3) controlled instruction generation via category-specific templates. This enables the construction of a diverse dataset with verifiable abstention conditions. We evaluate several frontier VLMs and find that all models exhibit significant weaknesses in abstention, including those with advanced reasoning capabilities. The best-performing model, Gemini 2.5 Flash, abstains on only 39.0% of our 6,069 benchmark instructions, while the embodied planner Gemini Robotics ER 1.6 Preview abstains on just 16.5%. We further explore methods for improving abstention in VLM planners, such as defensive prompting and in-context learning, and find that these interventions substantially improve performance, reaching 93.6% abstention rate for Gemini Robotics ER 1.6 Preview and 88.6% for GPT 5.4 Mini, yet no approach fully solves the problem. We open-source RoboAbstention at https://purseclab.github.io/RoboAbstention/.

URL PDF HTML ☆

赞 0 踩 0

2605.20543 2026-05-21 cs.CV 版本更新

Uncertainty-Guided Conservative Propagation for Structured Inference in Vessel Segmentation

不确定性引导的保守传播用于血管分割的结构推理

Huan Huang, Michele Esposito, Chen Zhao

发表机构 * Department of Computer Science, Kennesaw State University（凯斯西储大学计算机科学系）； Department of Cardiology, Medical University of South Carolina（南卡罗来纳医科大学心脏病科）

AI总结本文提出了一种不确定性引导的保守传播（UGCP）模块，用于改进血管分割的结构推理，通过局部预测交互进行多次logit空间更新步骤，提高分割的Dice相似系数、中心线Dice和95百分位Hausdorff距离，同时减少血管断开并提高结构一致性。

Comments Pattern Recognition submission. 35 pages, 6 figures

详情

AI中文摘要

准确的血管分割对于医学图像分析至关重要，但仍然具有挑战性，因为复杂的血管模式和成像模糊性导致了困难。大多数深度模型依赖于单次预测，限制了它们在推理过程中细化不确定或断开区域的能力。为了解决这一限制，我们提出了不确定性引导的保守传播（UGCP），这是一种通用的插件模块用于血管分割。与其直接使用一次输出作为最终预测不同，UGCP通过局部预测交互进行少量logit空间更新步骤来改进分割。预测不确定性引导可靠区域以支持模糊区域，同时结构意识调制和源基于稳定化减少不可靠传播和过度漂移。该模块是可微的，可以与不同的分割网络端到端训练。我们在四个公开的血管分割数据集上评估了UGCP，涵盖2D和3D任务，包括视网膜血管、冠状动脉和脑血管分割。使用基于卷积神经网络和Transformer的后端进行的实验显示，Dice相似系数、中心线Dice和95百分位Hausdorff距离均有所提高。进一步分析表明，UGCP在有限的额外计算下减少了血管断开并提高了结构一致性。代码将在https://github.com/chenzhao2023/UGC_PR上提供。

英文摘要

Accurate vessel segmentation is essential for medical image analysis, yet remains challenging due to complex vascular patterns and imaging ambiguity. Most deep models rely on single-pass prediction, limiting their ability to refine uncertain or disconnected regions during inference. To address this limitation, we propose Uncertainty-Guided Conservative Propagation (UGCP), a general plug-in module for vessel segmentation. Instead of directly using a one-shot output as the final prediction, UGCP performs a small number of logit-space update steps to refine the segmentation through local predictions interaction. Predictive uncertainty guides reliable regions to support ambiguous regions, while structure-aware modulation and source-based stabilization reduce unreliable propagation and excessive drift. The module is differentiable and can be trained end-to-end with different segmentation networks. We evaluate UGCP on four public vessel segmentation datasets covering 2D and 3D tasks, including retinal vessel, coronary artery, and cerebral vessel segmentation. Experiments with convolutional neural network-based and Transformer-based backbones show consistent improvements in Dice similarity coefficient, centerline Dice, and 95th percentile Hausdorff distance. Further analysis demonstrates that UGCP reduces vessel disconnections and improves structural consistency with limited additional computation. The code will be made available at https://github.com/chenzhao2023/UGC_PR.

URL PDF HTML ☆

赞 0 踩 0

2605.20185 2026-05-21 cs.GR cs.CV 版本更新

面向组件的结构保持风格迁移用于卫星视觉Sim2Real数据构建

Zongwu Xie, Yonglong Zhang, Yifan Yang, Yang Liu, Baoshi Cao

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology（机器人系统国家重点实验室，哈尔滨工业大学）

AI总结本文提出了一种面向组件的结构保持风格迁移框架，用于卫星视觉的合成到真实数据构建，通过提取真实图像的部件级风格代码并注入到合成图像中，从而提高标注保持的卫星视觉Sim2Real数据生成效果。

详情

AI中文摘要

对于基于相机的卫星视觉感知，Sim2Real数据构建需要图像接近真实域传感器外观同时保留来自模拟的注释。具有可靠姿态标签和组件级遮罩的卫星目标的真实传感器图像难以大规模获取，而合成渲染提供精确的几何注释但存在明显的外观差距。本文提出了一种面向组件的结构保持风格迁移框架用于卫星视觉的合成到真实数据构建。该方法通过校准的真实获取、基于ArUbo的相机姿态测量、CAD渲染和组件遮罩构建弱配对的真实-合成样本。然后从未标记的真实图像中提取部件级真实域风格代码，并通过遮罩对齐调节将其注入到对应的合成卫星区域中。为了保持生成图像对下游传感器数据监督的可用性，对抗训练与局部对比一致性、自正则化和边缘保持约束相结合。实验在5000张渲染的卫星图像和100张在校准设置下拍摄的真实图像上进行。真实图像提供目标域外观参考和最终评估图像，而下游的GDRNet姿态估计器仅在合成或翻译的合成图像上进行训练。与代表性图像翻译基线相比，所提方法实现了最小的图像分布差异，FID为54.32，KID为0.048。当翻译数据用于在目标域适应设置下训练GDRNet时，ADD通过率提高到0.260，AUC提高到0.611。这些结果表明，组件级外观迁移可以提高标注保持的卫星视觉Sim2Real数据生成效果。

英文摘要

For camera-based satellite visual sensing, Sim2Real data construction requires images that approach real-domain sensor appearance while retaining the annotations inherited from simulation. Real sensor images of satellite targets with reliable pose labels and component-level masks are difficult to acquire at scale, whereas synthetic rendering provides exact geometric annotations but suffers from a visible appearance gap. This paper presents a component-aware structure-preserving style transfer framework for satellite visual synthetic-to-real data construction. The method builds weakly paired real--synthetic samples from calibrated real acquisition, ArUco-based camera-pose measurement, CAD rendering, and component masks. It then extracts part-wise real-domain style codes from unlabeled real images and injects them into corresponding synthetic satellite regions through mask-aligned modulation. To keep the generated images usable for downstream sensor-data supervision, adversarial training is combined with local contrastive consistency, self-regularization, and edge-preserving constraints. Experiments are conducted on 5,000 rendered satellite images and 100 real images captured in a calibrated setup. The real images provide target-domain appearance references and final evaluation images, while the downstream GDRNet pose estimator is trained only on synthetic or translated synthetic images. Compared with representative image-translation baselines, the proposed method achieves the lowest image distribution discrepancy, with an FID of 54.32 and a KID of 0.048. When the translated data are used to train GDRNet in this target-domain adaptation setting, the ADD pass rate improves to 0.260 and the AUC improves to 0.611. These results indicate that component-level appearance transfer can improve annotation-preserving satellite visual Sim2Real data generation in the considered calibrated setup.

URL PDF HTML ☆

赞 0 踩 0

SVFSearch: 一种面向游戏垂直领域的多模态知识密集型短视频帧搜索基准

Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology（快手科技）

AI总结本文提出SVFSearch，首个针对中文游戏领域短视频帧搜索的多模态知识密集型基准，通过5000个四选一测试示例和4198个辅助训练示例，评估了从直接问答到计划-行动-重新计划代理等多种方法在短视频帧搜索中的性能。

详情

AI中文摘要

多模态大语言模型越来越多地被用作代理的骨干，以理解多模态输入、计划检索操作、调用外部工具并推理由检索信息得出的结论。然而，现有的基准很少评估在短视频应用中的这种能力，其中暂停的帧通常在视觉上具有歧义性，回答需要垂直的、长尾的和快速发展的领域知识。我们引入了SVFSearch，这是首个针对中文游戏领域短视频帧搜索的开放基准。SVFSearch包含5,000个四选一测试示例和4,198个辅助训练示例，每个示例都围绕一个暂停的游戏场景展开，来自真实的短视频片段。为了支持公平且可重复的评估，SVFSearch提供了一个冻结的离线检索环境，包括一个游戏领域文本语料库、一个主题链接的图像画廊以及文本、图像和多模态检索接口，避免了对不受控的网络搜索API的依赖。我们评估了从直接问答和RAG工作流程到计划-行动-重新计划代理和学习搜索模型在内的代表性范式。结果揭示了模型单独回答、实际代理搜索和 oracle 知识之间的巨大差距：最好的开源直接问答模型达到66.4%，最好的实际代理达到79.1%，而 oracle 知识达到95.4%。进一步分析揭示了视觉定位、检索质量、证据基础推理和工具使用行为中的瓶颈，包括过度检索、只回答捷径和检索诱导的误导。

英文摘要

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

URL PDF HTML ☆

赞 0 踩 0

2605.17472 2026-05-21 cs.CV 版本更新

Weighted Reverse Convolution for Feature Upsampling

加权反卷积用于特征上采样

Wentong Li, Zhiyuan Qi, Zichen Zhao, Kai Zhang, Lei Zhang

发表机构 * Nanjing University of Aeronautics and Astronautics（南京航空航天大学）； The Hong Kong Polytechnic University（香港理工大学）； Nanjing University（南京大学）

AI总结本文提出加权反卷积（WRC），从逆问题的角度重新审视视觉基础模型中的特征上采样，通过空间自适应的逆操作提升高层视觉描述符的密度，从而在需要细粒度定位、密集预测和点对应的任务中提升性能。

Comments 18 pages, 7 figures, code:https://github.com/PolyU-VCLab/WRC

详情

AI中文摘要

预训练的视觉基础模型（VFMs）提供强大的语义表示，但其补丁级特征本质上是粗略的，限制了在需要细粒度定位、密集预测和点对应的任务中的有效性。在本文中，我们从逆问题的角度重新审视VFMs中的特征上采样，并提出加权反卷积（WRC），一种空间自适应的逆操作，用于密集化高层视觉描述符。具体来说，我们将特征上采样公式化为加权Tikhonov正则化最小二乘问题，其中空间变化的权重在每个空间位置调节数据保真度和先验强度。这使得WRC能够适应空间变化的特征特性，从而在保留关键结构的同时减轻过平滑问题。此外，WRC保留了一个高效、完全可微的闭合形式FFT解，使其成为一种实用的上采样操作符。在轻量级自监督密集化框架中集成后，WRC在各种下游基准测试中一致提高了密集特征质量，包括分割、深度估计、视频对象分割、对象发现和关键点对应，同时保持高计算效率。

英文摘要

Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf{\textit{inverse problem}} and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.16962 2026-05-21 cs.CV cs.AI 版本更新

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

OmniVL-Guard Pro: 一个增强工具的代理用于综合视觉-语言防伪

Jinjie Shen, Zheng Huang, Yuchen Zhang, Yujiao Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China（合肥工业大学计算机科学与信息工程学院）； Wuhan University, Wuhan, China（武汉大学）； Lab for Intelligence and visiON (LION)（智能与视觉实验室）； Xi'an Jiaotong University（西安交通大学）

AI总结该研究提出OmniVL-Guard Pro，一种增强工具的代理，用于综合视觉-语言防伪，通过整合多种工具环境和引入新的强化学习方法，实现了开放世界中的线索驱动推理，并在多个任务上达到了最先进的性能。

Comments 29 pages

详情

AI中文摘要

现有的视觉-语言伪造检测和定位方法基于封闭世界范式，假设模型可以单独完成验证。然而，自包含的MLLMs受限于有限的参数知识、静态训练语料和有限的感知分辨率，在动态开放世界防伪中存在实际限制，特别是在需要外部线索的实时事件验证和需要对局部篡改进行细致审查的伪造分割中。为了解决这些限制，我们从扩大自包含模型转向超越它。我们提出了OmniVL-Guard Pro，一种增强工具的代理，将统一的防伪从封闭世界预测扩展到开放世界的线索驱动推理。OmniVL-Guard Pro整合了一个涵盖实时事件搜索、局部裁剪和缩放、边缘异常筛查、人脸检测、视频帧提取以及SAM3基于分割的工具环境。为了生成高质量的工具推理轨迹，我们引入了树状结构的自进化工具轨迹生成，通过种子引导、无引导的自我进化和弱提示的硬样本合成生成多样化的轨迹，产生Full-Spectrum Tool Reasoning (FSTR)数据集用于训练。我们进一步提出了Checker-Guided Agentic Reinforcement Learning (CGARL)，它为过程级监督提供，以惩罚那些答案正确但推理扭曲的情况。广泛的实验表明，OmniVL-Guard Pro在各种任务上实现了最先进的性能，并表现出强大的零样本泛化能力。FSTR数据集和OmniVL-Guard Pro的代码将在https://github.com/shen8424/OmniVL-Guard-Pro公开发布。

英文摘要

Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at https://github.com/shen8424/OmniVL-Guard-Pro.

URL PDF HTML ☆

赞 0 踩 0

2605.16530 2026-05-21 cs.CV 版本更新

SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

SWoMo：用于白内障手术模拟的神经符号世界模型

Ssharvien Kumar Sivakumar, Akwele Johnson, Anirudh Dhingra, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay

发表机构 * Technical University Darmstadt（德累斯顿技术大学）； Carl Zeiss AG（蔡司股份有限公司）； AICM, Medical Faculty of Heidelberg University（海德堡大学医学院）

AI总结本文提出SWoMo，一种用于白内障手术模拟的神经符号世界模型，通过分离运动生成与视觉真实性，结合规则基模拟器和场景图表示来建模运动动态和工具-组织交互，同时使用扩散模型生成逼真的视觉效果，从而提升手术模拟的真实性和临床适用性。

详情

AI中文摘要

现实手术模拟在培训初学者外科医生和开发自主代理方面起着至关重要的作用。世界模型可以通过根据当前观察和手术动作预测未来患者状态，将此类模拟环境扩展到真实且多样的程序中。然而，当前最先进的方法往往无法满足临床应用所需的关键标准，包括视觉真实性、物理基础的交互以及模拟超出训练分布的场景的能力。因此，我们引入SWoMo，一种用于白内障手术模拟的神经符号世界模型，该模型将运动生成与视觉真实性解耦。符号组件包括基于规则的模拟器和场景图表示，用于建模运动动态和工具-组织交互，而扩散模型则生成逼真的视觉外观，包括纹理和组织变形。我们提出了一种逆配对策略，通过在模拟器中重建真实的手术视频以获得配对的模拟和真实视频，然后用于训练我们的视频扩散模型，以实现反向的仿真到现实的翻译目标。我们的实验表明，与先前工作相比，既有定性也有定量的改进。我们证明，我们的模拟器进一步满足了关键标准，包括对未见交互几何的泛化、下游阶段检测的改进以及无监督的视频风格迁移。代码、数据和模型权重可在：https://ssharvienkumar.github.io/SWoMo/上获取。

英文摘要

Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/

URL PDF HTML ☆

赞 0 踩 0

2605.15876 2026-05-21 cs.CV 版本更新

Unlocking Dense Metric Depth Estimation in VLMs

解锁VLMs中的密集度量深度估计

Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke

发表机构 * Zhejiang University（浙江大学）； Tencent Hunyuan LLM（腾讯混元大模型）； HKUST（香港科技大学）； Shenzhen Loop Area Institute（深圳环城研究院）

AI总结本文提出DepthVLM，一种将单个VLM转换为原生密集几何预测器的简单有效框架，同时保持其多模态能力。通过在LLM主干上附加轻量级深度头，并在统一的视觉-文本监督范式下进行训练，DepthVLM能够在单次前向传递中生成高分辨率深度图和语言输出。此外，还引入了一个统一的室内-室外度量深度基准，实验表明DepthVLM在推理效率、复杂3D空间推理等方面均优于现有VLMs和纯视觉模型。

Comments Project Page: https://depthvlm.github.io/

详情

AI中文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

英文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.14417 2026-05-21 cs.RO cs.CV 版本更新

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

在身体移动之前：为语言条件的人形控制学习预见性关节意图

Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； LimX Dynamics Technology Co., Ltd.（LimX动态技术有限公司）； Shandong University（山东大学）； Data61/CSIRO ； Griffith University（格里菲斯大学）； Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)（深度感知技术研究院，江苏省工业技术研究院（JITRI））

AI总结该研究提出DAJI框架，通过学习语言生成与闭环控制之间的预见性关节意图接口，解决语言条件人形机器人中预见未来物理转换的需求，实现了在HumanML3D风格生成和BABEL任务中的高性能表现。

详情

AI中文摘要

自然语言是人形机器人的直观接口，但流式全身控制需要能够现在执行并预见未来物理转换的控制表示。现有语言条件人形系统通常生成低级跟踪器必须反应性修复的运动学参考，或使用隐式/动作策略，其输出不显式编码即将发生的接触变化、支撑转移和平衡准备。我们提出DAJI（Dynamics-Aligned Joint Intent），一个分层框架，学习语言生成与闭环控制之间的预见性关节意图接口。DAJI-Act通过学生驱动的回放将未来的教师 distill 成可部署的扩散动作策略，而 DAJI-Flow 自回归地从语言和意图历史生成未来意图块。实验表明，DAJI 在预见性隐式学习、单指令生成和流式指令跟随中表现优异，在 HumanML3D 风格生成中达到 94.42% 的回放成功率，在 BABEL 任务中达到 0.152 的子序列 FID。

英文摘要

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

URL PDF HTML ☆

赞 0 踩 0

2605.14382 2026-05-21 cs.CV cs.GR cs.MM 版本更新

ProDG：用于无数据后置可解释性的原型

Piotr Borycki, Magdalena Trędowicz, Jacek Tabor, Łukasz Struski, Przemysław Spurek

发表机构 * Jagiellonian University（雅盖隆大学）； IDEAS Research Institute（IDEAS研究所）； Centre for Credible AI（可信AI中心）； Warsaw University of Technology（华沙理工大学）

AI总结本文提出ProDG，一种无需数据的后置可解释性框架，通过生成模型直接从冻结模型的权重中合成纯高保真原型，从而摆脱了对任何外部数据的依赖，为隐私敏感领域提供了稳健的视觉可解释性。

详情

AI中文摘要

基于原型的前置可解释性方法通过利用直观的'这看起来像那'推理范式提供高度准确的解释。另一方面，后置模型可以在不依赖底层数据集或需要昂贵神经网络重新训练的情况下解释单个图像的预测。最近的方法成功解决了原型网络的重新训练问题。然而，它们仍然面临一个根本限制：它们需要访问数据子集（例如测试或验证集）来搜索并提取视觉原型。在本文中，我们解决了这一问题，并引入了ProDG：用于无数据后置可解释性的生成原型，一种新的框架，利用生成模型直接从冻结模型的权重中合成纯、高保真的原型，完全消除了对任何外部数据的依赖。通过在无数据XAI领域建立新的前沿，ProDG为隐私敏感领域解锁了稳健的视觉可解释性，其中原始数据受到严格限制或根本无法访问。项目页面：https://github.com/piotr310100/ProDG

英文摘要

Ante-hoc interpretability methods based on prototypes provide highly accurate explanations by utilizing the intuitive "this looks like that" reasoning paradigm. On the other hand, post-hoc models can explain predictions for a single image without relying on an underlying dataset or requiring costly neural network retraining. Recent approaches successfully solve the retraining problem for prototype-based networks. However, they still face a fundamental limitation: they require access to a subset of data (e.g., a test or validation set) to search for and extract the visual prototypes. In this paper, we address this issue and introduce ProDG: Generative Prototypes for Data-Free Post-Hoc Explainability, a novel framework that leverages generative models to synthesize pure, high-fidelity prototypes directly from the frozen model's weights, completely eliminating the dependency on any external data. By establishing this new frontier in Data-Free XAI, ProDG unlocks robust visual interpretability for privacy-sensitive domains, where original data is strictly restricted or fundamentally inaccessible. Project page: https://github.com/piotr310100/ProDG

URL PDF HTML ☆

赞 0 踩 0

2605.05405 2026-05-21 cs.CV 版本更新

Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response

通过联合嵌入实现零样本卫星图像检索：应用于危机响应

James Walsh, William Fawcett, Grace Colverd, Raúl Ramos-Pollán

发表机构 * Trillium Technologies ； University of Cambridge（剑桥大学）； Universidad de Antioquia（Antioquia大学）

AI总结本文提出GeoQuery系统，通过两阶段语义和视觉搜索，在无需配对数据和计算资源的情况下实现全球范围内的自然语言查询，利用部分全球数据的自然语言嵌入，优化描述生成提示以使文本嵌入空间与冻结CLAY视觉嵌入空间的距离相关联，从而在灾难地点查询中实现高精度检索。

详情

DOI: 10.56272/a2c9ee39

AI中文摘要

地球观测档案的语义搜索仍具挑战性。视觉基础模型如CLAY能生成丰富的卫星图像嵌入，但缺乏用于直观查询所需的自然语言基础，而对遥感CLIP式模型的完整对比训练需要配对数据和计算资源，这些在全球范围内不可用。为允许全球范围内的自然语言查询，我们提出GeoQuery，一种零样本检索系统，通过两阶段语义和视觉搜索绕过数据和计算限制，利用部分全球数据的自然语言嵌入。我们不训练联合编码器，而是为100,000个代理子集的全球Sentinel-2瓦片生成语言描述，并优化描述生成提示，使生成的文本嵌入空间中的距离与冻结CLAY视觉嵌入空间中的距离相关联。查询分为两个阶段，首先在代理子集上进行文本相似度搜索，然后在全球CLAY嵌入中进行视觉最近邻搜索。在76个灾难地点查询中，包括英国洪水、美国野火和美国干旱，GeoQuery在50公里内达到31.6%的准确率，其中洪水表现最强（50%在50公里内），因为地形特征由RGB嵌入良好捕获。在名为\ECHO{}的危机响应系统中部署，GeoQuery在布里斯班2025年 Cyclone Alfred期间识别出易受灾区域，下游洪水模拟重现了历史模式。提示对齐的代理为EO基础模型与操作检索之间提供了一个实用的桥梁，当完整对比训练不可行时。

英文摘要

Semantic search of Earth observation archives remains challenging. Visual foundation models such as CLAY produce rich embeddings of satellite imagery but lack the natural-language grounding needed for intuitive query, and full contrastive training of a remote-sensing CLIP-style model requires paired data and compute that are unavailable at global scale. To allow natural language querying at global scales, we present GeoQuery, a zero-shot retrieval system that sidesteps data and compute constraints through a two-stage semantic and visual search, leveraging a natural language embedding of a subset (proxy) of global data. Rather than training a joint encoder, we generate language descriptions for a 100k proxy subset of global Sentinel-2 tiles and optimise the description-generation prompt so that distances in the resulting text-embedding space correlate with distances in the frozen CLAY visual-embedding space. Queries are resolved in two stages, with a text-similarity search over the proxy subset followed by a visual nearest-neighbour search over worldwide CLAY embeddings On 76 disaster-location queries covering UK floods, US wildfires, and US droughts, GeoQuery achieves 31.6\% accuracy within 50\,km, with the strongest performance on floods (50\% within 50\,km) where terrain features are well captured by RGB embeddings. Deployed within a crisis response system called \ECHO{}, GeoQuery identified vulnerable areas during Brisbane's 2025 Cyclone Alfred, with downstream flood simulations reproducing historical patterns. Prompt-aligned proxies offer a practical bridge between EO foundation models and operational retrieval when full contrastive training is out of reach.

URL PDF HTML ☆

赞 0 踩 0

2605.04128 2026-05-21 cs.GR cs.AI cs.CL cs.CV cs.LG 版本更新

JoyAI-Image: Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation

JoyAI-Image: 激活统一多模态理解和生成中的空间智能

Lin Song, Wenbo Li, Guoqing Ma, Wei Tang, Bo Wang, Yuan Zhang, Yijun Yang, Yicheng Xiao, Jianhui Liu, Yanbing Zhang, Guohui Zhang, Wenhu Zhang, Hang Xu, Nan Jiang, Xin Han, Haoze Sun, Maoquan Zhang, Haoyang Huang, Nan Duan

发表机构 * Joy Future Academy, JD（joy未来学院，京东）

AI总结本文提出JoyAI-Image，一种统一的多模态基础模型，用于视觉理解、文本到图像生成和指令引导的图像编辑。该模型结合了空间增强的多模态大语言模型（MLLM）和多模态扩散Transformer（MMDiT），通过共享的多模态接口实现感知与生成的交互。构建可扩展的训练配方，结合统一指令微调、长文本渲染监督、空间 grounded 数据和通用及空间编辑信号，使模型具备广泛的多模态能力，同时增强几何感知推理和可控视觉合成。实验表明，JoyAI-Image在理解、生成、长文本渲染和编辑基准上达到最先进的性能。更重要的是，增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力，向更强的空间智能发展。

Comments Code: https://github.com/jd-opensource/JoyAI-Image

详情

AI中文摘要

我们提出了JoyAI-Image，一种统一的多模态基础模型，用于视觉理解、文本到图像生成和指令引导的图像编辑。JoyAI-Image将空间增强的多模态大语言模型（MLLM）与多模态扩散Transformer（MMDiT）结合，允许感知和生成通过共享的多模态接口进行交互。围绕此架构，我们构建了一个可扩展的训练配方，结合了统一指令微调、长文本渲染监督、空间 grounded 数据以及通用和空间编辑信号。该设计使模型具备广泛的多模态能力，同时增强了几何感知推理和可控视觉合成。在理解、生成、长文本渲染和编辑基准上的实验表明，JoyAI-Image实现了最先进的或高度竞争的性能。更重要的是，增强的理解、可控的空间编辑和新视角辅助推理之间的双向循环使模型超越一般视觉能力，向更强的空间智能发展。这些结果表明，统一视觉模型在下游应用如视觉-语言-动作系统和世界模型中具有前景。

英文摘要

We present JoyAI-Image, a unified multimodal foundation model for visual understanding, text-to-image generation, and instruction-guided image editing. JoyAI-Image couples a spatially enhanced Multimodal Large Language Model (MLLM) with a Multimodal Diffusion Transformer (MMDiT), allowing perception and generation to interact through a shared multimodal interface. Around this architecture, we build a scalable training recipe that combines unified instruction tuning, long-text rendering supervision, spatially grounded data, and both general and spatial editing signals. This design gives the model broad multimodal capability while strengthening geometry-aware reasoning and controllable visual synthesis. Experiments across understanding, generation, long-text rendering, and editing benchmarks show that JoyAI-Image achieves state-of-the-art or highly competitive performance. More importantly, the bidirectional loop between enhanced understanding, controllable spatial editing, and novel-view-assisted reasoning enables the model to move beyond general visual competence toward stronger spatial intelligence. These results suggest a promising path for unified visual models in downstream applications such as vision-language-action systems and world models.

URL PDF HTML ☆

赞 0 踩 0

2604.27505 2026-05-21 cs.CV 版本更新

Leveraging Verifier-Based Reinforcement Learning in Image Editing

利用基于验证器的强化学习进行图像编辑

Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, Weilin Huang

发表机构 * School of Computing and Data Science, The University of Hong Kong（计算与数据科学学院，香港大学）； Center for Embodied AI and Computer Vision, Shenzhen Loop Area Institute（具身人工智能与计算机视觉中心，深圳Loop Area研究院）

AI总结本文提出Edit-R1框架，通过构建基于推理的验证器奖励模型（RRM）来解决图像编辑中缺乏稳健奖励模型的问题，该模型通过分解指令为不同原则并逐项评估图像，实现细粒度奖励，实验表明其在图像编辑任务中优于现有模型。

详情

AI中文摘要

尽管强化学习从人类反馈（RLHF）已成为文本到图像生成的关键范式，但其在图像编辑中的应用仍鲜有研究。关键瓶颈在于缺乏适用于所有编辑任务的稳健通用奖励模型。现有编辑奖励模型通常仅提供总体评分而无详细检查，忽视了不同指令要求，导致奖励偏差。为此，我们主张从简单的评分器转向推理验证器。我们引入Edit-R1框架，构建基于推理链（CoT）的验证器奖励模型（RRM）并用于下游图像编辑。Edit-RRM将指令分解为不同的原则，将编辑后的图像与每个原则进行评估，并将这些检查汇总成可解释、细粒度的奖励。为了构建此类RRM，我们首先应用监督微调（SFT）作为“冷启动”生成CoT奖励轨迹。然后，我们引入组对比偏好优化（GCPO），一种利用人类配对偏好数据强化点状RRM的强化学习算法。在构建RRM后，我们使用GRPO训练编辑模型，利用此非可微但强大的奖励模型。大量实验表明，我们的Edit-RRM在图像编辑特定任务中优于强大的VLMs如Seed-1.5-VL和Seed-1.6-VL，并观察到明显的扩展趋势，性能从3B到7B参数持续提升。此外，Edit-R1为编辑模型如FLUX.1-kontext带来增益，凸显了其在提升图像编辑任务中的有效性。

英文摘要

While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

URL PDF HTML ☆

赞 0 踩 0

2604.27375 2026-05-21 cs.CV 版本更新

超越注意力分数：基于SVD的视觉令牌修剪用于高效视觉-语言模型

Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa

发表机构 * anoncvlab（匿名计算机视觉实验室）

AI总结本文提出SVD-Prune方法，通过SVD分解视觉令牌特征矩阵并利用统计杠杆分数选择顶级令牌，以在极端视觉令牌预算下保持高性能，优于现有修剪方法。

详情

AI中文摘要

卷积神经网络中纹理表示的感知偏差

Ludovica de Paolis, Fabio Anselmi, Alessio Ansuini, Eugenio Piasini

发表机构 * Neuroscience area International School for Advanced Studies (SISSA) Trieste Italy（国际先进研究学院（SISSA）神经科学部，特里埃斯特，意大利）； Department of Mathematics Informatics and Geosciences Università degli Studi di Trieste Trieste Italy（特里埃斯特大学数学、信息学和地球科学系，特里埃斯特，意大利）； Department of Data Engineering Area Science Park Trieste Italy（数据工程系，Area Science Park，特里埃斯特，意大利）

AI总结本文研究了卷积神经网络中纹理表示与人类感知内容之间的对齐关系，发现传统CNN视觉模型质量评估与人类纹理感知对齐性无直接关联，表明纹理感知可能涉及不同于传统CNN对象识别模型的机制。

2603.27747 2026-05-21 cs.CV cs.AI 版本更新

AI-Powered Facial Mask Removal Is Not Suitable For Identification

基于AI的面部遮挡去除并不适合识别

Emily A Cooper, Hany Farid

发表机构 * Herbert Wertheim School of Optometry & Vision Science University of California, Berkeley（赫伯特·韦瑟姆视觉科学学院，加州大学伯克利分校）； School of Information University of California, Berkeley（信息学院，加州大学伯克利分校）

AI总结本文研究了基于AI的面部遮挡去除技术的有效性和风险，探讨其在真实身份匹配中的可靠性。

2603.27309 2026-05-21 cs.GR cs.CV 版本更新

MeshTailor: Cutting Seams via Generative Mesh Traversal

MeshTailor: 通过生成网格遍历进行剪裁缝线

Xueqi Ma, Xingguang Yan, Congyue Zhang, Hui Huang

发表机构 * Shenzhen University（深圳大学）； Simon Fraser University（西蒙 Fraser大学）

AI总结本文提出MeshTailor，一种首个基于网格的生成框架，用于在3D表面合成边缘对齐的缝线。与以往基于优化或外在学习的方法不同，MeshTailor直接在网格图上操作，消除了投影伪影和脆弱的 snapping 策略。我们引入了ChainingSeams，一种层次化的缝线图序列化，按从全局结构切割到局部细节的粗到细方式对链进行排序，并引入了双流编码器以融合拓扑和几何上下文。利用这种层次化表示和双流顶点嵌入，我们的MeshTailor Transformer 使用自回归指针层在局部邻域内逐顶点追踪缝线。广泛的评估表明，与最近的基于优化和学习的基线相比，MeshTailor生成的缝线布局更加连贯和结构规整。

2603.14184 2026-05-21 cs.CV cs.AI 版本更新

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

更深入的思考，更弱的目标：理解并缓解多模态大语言模型推理过程中感知障碍

Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiaohui Li

发表机构 * Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）； Huawei Technologies（华为技术）

AI总结本文研究了多模态大语言模型在推理过程中出现的视觉感知障碍问题，提出了一种无需训练的视觉区域引导注意力框架，通过选择和重新加权视觉头部来引导模型关注与问题相关区域，从而提高视觉定位和推理准确性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在进行扩展推理模式时常常出现感知障碍，特别是在视觉问答（VQA）任务中。我们识别出注意力分散是根本原因：在多步推理过程中，模型的视觉注意力变得分散并远离与问题相关区域，实际上“失去焦点”于视觉输入。为了更好地理解这一现象，我们分析了MLLMs的注意力图，并观察到推理提示显著减少了回答问题关键区域的注意力。我们进一步发现模型对图像标记的总体注意力与图像内注意力的空间分散性之间存在强相关性。基于这一见解，我们提出了一个无需训练的视觉区域引导注意力（VRGA）框架，该框架根据熵-聚焦准则选择视觉头部并重新加权其注意力，从而有效引导模型在推理过程中关注与问题相关区域。在视觉-语言基准上的广泛实验表明，我们的方法有效缓解了感知退化，从而在视觉定位和推理准确性方面取得改进，同时提供了可解释的见解，说明MLLMs如何处理视觉信息。

英文摘要

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

URL PDF HTML ☆

赞 0 踩 0

2603.08235 2026-05-21 cs.CV cs.AI 版本更新

Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

探索深度学习与超宽场成像用于糖尿病视网膜病变和黄斑水肿

Pablo Jimenez-Lizcano, Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, Guillermo González de Rivera, Ruben Vera-Rodriguez, Julian Fierrez

发表机构 * BiometricsAI, Universidad Autónoma de Madrid, Madrid, Spain（生物度量AI，马德里自治大学，马德里，西班牙）； Department of Mathematics, Universidad de Las Palmas de Gran Canaria, Spain（数学系，拉斯帕尔马斯大Canaria大学，西班牙）； HCTLab Research Group, Universidad Autónoma de Madrid, Madrid, Spain（HCTLab研究组，马德里自治大学，马德里，西班牙）

AI总结本文研究了深度学习和超宽场成像在糖尿病视网膜病变和黄斑水肿检测中的应用，通过公开数据集评估了多种深度学习模型，并探讨了特征融合和频域表示的潜力。

Comments 6 pages, 4 figures, 2 tables

详情

AI中文摘要

糖尿病视网膜病变（DR）和糖尿病黄斑水肿（DME）是导致成年劳动力失明的主要原因之一。传统方法主要依赖标准彩色视网膜摄影（CFP）进行检测。然而，最近的超宽场成像（UWF）相比CFP提供了更宽的视野。受此启发，本文探讨了最新深度学习（DL）方法和UWF成像在三个临床相关任务上的应用：i）UWF图像质量评估，ii）可参考糖尿病视网膜病变（RDR）的识别，iii）DME的识别。使用公开的UWF4DR挑战数据集（作为MICCAI 2024会议的一部分发布），我们评估了DL模型在空间（RGB）和频域中的表现，包括流行的卷积神经网络（CNNs）以及最近的视觉变换器（ViTs）和基础模型。此外，我们还探索了最终的特征级融合以提高鲁棒性。最后，我们还利用Grad-CAM分析DL模型的决策，提高可解释性。我们的方法在所有架构中均实现了稳定强劲的性能，凸显了新兴ViTs和基础模型的竞争力，以及特征级融合和频域表示在UWF分析中的潜力。

英文摘要

Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.

URL PDF HTML ☆

赞 0 踩 0

2602.24138 2026-05-21 cs.CV cs.AI 版本更新

Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

多模态最优传输用于手术机器人中的无训练时序分割

Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Ivan Laptev, Cesare Stefanini

发表机构 * Dept. of Robotics, Mohamed bin Zayed University of AI（机器人系，Mohamed bin Zayed人工智能大学）

AI总结本文提出了一种无需标注的手术时序分割框架TASOT，通过结合时间对齐的文本描述和视觉信息，在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索，实现了在多个公开手术数据集上的显著提升。

详情

AI中文摘要

自动化识别手术阶段和步骤是机器人辅助手术中术中决策支持、工作流自动化和技能评估的基本能力。现有方法要么依赖大规模标注手术数据集，要么需要昂贵的领域特定预训练，这限制了它们在不同机器人平台和临床环境中的实际部署。在本文中，我们提出TASOT（文本增强的动作分割最优传输），一种无需任务特定标注或手术领域预训练的手术时序分割框架。TASOT扩展了动作分割最优传输（ASOT）公式，通过结合直接从输入视频生成的时间对齐文本描述，在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索。视觉表示使用DINOv3提取，而由视觉-语言模型生成的时间描述通过CLIP编码并时间对齐到单个帧，为传输成本提供互补的语义结构。我们在三个公开手术数据集和四个基准设置上评估了TASOT，涵盖腹腔镜和机器人手术程序，显示出显著优于最强的零样本基线：在Cholec80上+18.9 F1，在AutoLaparo上+33.7，在StrasByPass70上+23.7，在BernByPass70上+4.5。这些结果表明，在机器人环境中可以实现细粒度的手术工作流理解，而无需手动训练标注或手术特定的预训练流程，为实际的机器人手术系统提供了一种有前景的替代方案。

英文摘要

Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3, while temporal captions produced by a vision-language model are encoded via CLIP and temporally aligned to individual frames, providing complementary semantic structure to the transport cost. We evaluate TASOT on three public surgical datasets and four benchmark settings spanning laparoscopic and robotic procedures, showing substantial improvements over the strongest zero-shot baselines: +18.9 F1 on Cholec80, +33.7 on AutoLaparo, +23.7 on StrasByPass70, and +4.5 on BernByPass70. These results suggest that fine-grained surgical workflow understanding in robotic settings can be achieved without manual training annotations or surgical-specific pretraining pipelines, offering a promising alternative for real-world robotic surgical systems.

URL PDF HTML ☆

赞 0 踩 0

2602.18532 2026-05-21 cs.CV cs.AI cs.RO 版本更新

VLANeXt: Recipes for Building Strong VLA Models

VLANeXt: 构建强大VLA模型的配方

Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, Chen Change Loy

发表机构 * S-Lab, Nanyang Technological University（南洋理工大学S实验室）； SenseTime Research（商汤研究）； Sun Yat-sen University（中山大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文通过统一框架和评估设置重新审视VLA设计空间，系统分析了基础组件、感知要素和动作建模视角，总结出12项关键发现，提出了一种简单有效的VLA模型VLANeXt，并在LIBERO和LIBERO-plus基准测试中超越了现有方法，同时提供了易于使用的代码库。

Comments Accepted in ICML 2026, Project Page: https://dravenalg.github.io/VLANeXt/

详情

AI中文摘要

在大基础模型兴起之后，视觉-语言-动作模型（VLAs）应运而生，利用视觉语言模型的强大视觉和语言理解能力进行通用目的策略学习。然而，当前VLA领域仍处于碎片化和探索阶段。尽管许多团队提出了各自的VLA模型，但训练协议和评估设置的一致性不足，使得难以确定哪些设计选择真正重要。为了使这一发展领域更具结构化，我们重新审视VLA设计空间，基于类似RT-2的简单VLA基线，系统地分析了三个维度：基础组件、感知要素和动作建模视角。从这项研究中，我们提炼出12项关键发现，共同构成了构建强大VLA模型的实用配方。该探索的成果是一种简单而有效的模型VLANeXt，它在LIBERO和LIBERO-plus基准测试中优于现有方法，并在现实世界实验中表现出色。我们还发布了一个统一且易于使用的代码库，以重现我们的发现、探索设计空间并基于共享基础开发新的VLA变体。代码库可在https://github.com/DravenALG/VLANeXt上获得。

英文摘要

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding from Vision-Language Models for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2, which is the origin of VLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. It outperforms the state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong performance in real-world experiments. We release a unified and easy-to-use codebase to reproduce our findings, explore the design space, and develop new VLA variants on top of a shared foundation. The codebase is available at https://github.com/DravenALG/VLANeXt.

URL PDF HTML ☆

赞 0 踩 0

2602.16608 2026-05-21 cs.CL cs.AI cs.CV cs.LG 版本更新

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

可解释的人工智能：面向Transformer模型的上下文感知分层集成梯度方法

Melkamu Abay Mersha, Jugal Kalita

发表机构 * College of Engineering and Applied Science, University of Colorado Colorado Springs（科罗拉多州立大学工程与应用科学学院）

AI总结本文提出了一种上下文感知分层集成梯度框架（CA-LIG），用于解释Transformer模型的决策过程，通过计算每个Transformer块内的分层集成梯度，并将这些token级属性与类特定的注意力梯度融合，从而生成具有符号和上下文敏感性的属性图，以捕捉支持和反对的证据，并追踪Transformer层中的相关性层次流动。

详情

DOI: 10.1016/j.neucom.2026.133050

AI中文摘要

Transformer模型在多个领域和任务中实现了最先进的性能，然而其深层表示使得预测难以解释。现有的可解释性方法依赖于最终层的属性，只能捕捉局部token级属性或全局注意力模式，缺乏对token间依赖关系和结构组件的上下文感知能力。它们还无法捕捉相关性如何在层之间演变以及结构组件如何影响决策。为了解决这些限制，我们提出了上下文感知分层集成梯度（CA-LIG）框架，一种统一的层次属性框架，该框架在每个Transformer块内计算分层集成梯度，并将这些token级属性与类特定的注意力梯度融合。这种整合产生了带有符号和上下文敏感性的属性图，能够捕捉支持和反对的证据，同时追踪Transformer层中的相关性层次流动。我们评估了CA-LIG框架在多样化的任务、领域和Transformer模型家族中的表现，包括使用BERT进行情感分析和长多类文档分类，使用XLM-R和AfroLM在低资源语言设置中进行仇恨言论检测，以及使用Masked Autoencoder Vision Transformer模型进行图像分类。在所有任务和架构中，CA-LIG提供了更忠实的属性，显示出对上下文依赖的更强敏感性，并产生了更清晰、更语义连贯的可视化结果，优于现有可解释性方法。这些结果表明，CA-LIG提供了更全面、上下文感知和可靠的Transformer决策解释，推动了深度神经网络的实用可解释性和概念理解。

英文摘要

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

URL PDF HTML ☆

赞 0 踩 0

2602.11499 2026-05-21 cs.CV 版本更新

自 refining 视频采样

Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, Sung Ju Hwang

AI总结本文提出了一种自 refining 视频采样方法，通过预训练的视频生成器作为自身 refine 器，无需外部验证器或额外训练，在推理时实现迭代内部循环 refine，提高了运动一致性和物理对齐性。

Comments ICML 2026. Project page: https://agwmon.github.io/self-refine-video/

详情

AI中文摘要

现代视频生成器仍难以处理复杂的物理动态，往往无法达到物理真实感。现有方法通过外部验证器或在增强数据上额外训练来解决这一问题，但计算成本高且仍难以捕捉细粒度运动。在本工作中，我们提出了自 refining 视频采样，一种简单的方法，利用在大规模数据集上预训练的视频生成器作为自身的 self-refiner。通过将生成器解释为去噪自编码器，我们能够在推理时实现迭代内部循环 refine，而无需任何外部验证器或额外训练。我们进一步引入了一种不确定性的 refine 策略，根据 self-consistency 选择性地 refine 区域，这防止了过度 refine 引起的伪影。在最先进的视频生成器上进行的实验显示，在运动一致性与物理对齐性方面有显著提升，达到比默认采样器和 guidance-based 采样器高出 70% 以上的人类偏好。

英文摘要

Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70% human preference compared to the default sampler and guidance-based sampler.

URL PDF HTML ☆

赞 0 踩 0

2601.04068 2026-05-21 cs.CV cs.AI 版本更新

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

注意生成细节：面向视频扩散模型的直接局部化细节偏好优化

Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Alibaba Group - Taobao & Tmall Group（阿里巴巴集团-淘宝 & 天猫集团）

AI总结本文提出LocalDPO，一种新的后训练框架，通过从真实视频中构建局部偏好对，并在时空区域层面优化对齐，以提高视频生成的质量和人类偏好评分。

Comments Accepted by CVPR 2026

详情

AI中文摘要

将文本到视频的扩散模型与人类偏好对齐对于生成高质量视频至关重要。现有的直接偏好优化（DPO）方法依赖于多样本排序和任务特定的批评模型，这效率低下且常导致模糊的全局监督。为了解决这些限制，我们提出了LocalDPO，一种新的后训练框架，该框架从真实视频中构建局部偏好对，并在时空区域层面进行优化。我们设计了一个自动化流程，高效地收集偏好对数据，通过单次提示推理生成偏好对，消除了对外部批评模型或人工标注的需求。具体来说，我们将高质量的真实视频作为正样本，并通过局部随机时空掩码来生成对应的负样本，仅使用冻结的基模型恢复被掩码的区域。在训练过程中，我们引入了区域感知的DPO损失，将偏好学习限制在被损坏的区域以实现快速收敛。在Wan2.1和CogVideoX上的实验表明，LocalDPO在视频保真度、时间连贯性和人类偏好评分方面优于其他后训练方法，建立了更高效和精细的视频生成器对齐范式。代码可在https://github.com/1170300714/Local-DPO上获得。

英文摘要

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.

URL PDF HTML ☆

赞 0 踩 0

2512.13402 2026-05-21 cs.CV cs.AI 版本更新

End2Reg: Learning Task-Specific Segmentation for Markerless Registration in Spine Surgery

End2Reg: 为无标记定位学习任务特定分割在脊柱手术中

Lorenzo Pettinari, Sidaty El Hadramy, Michael Wehrli, Philippe C. Cattin, Daniel Studer, Carol C. Hasler, Maria Licci

发表机构 * Department of Biomedical Engineering, University of Basel, Allschwil, Switzerland（巴塞尔大学生物医学工程系，瑞士Allschwil）； Department of Orthopedics, University Children’s Hospital, Basel, Switzerland（巴塞尔大学儿童医院骨科部，瑞士Basel）

AI总结本文提出End2Reg，一种端到端深度学习框架，通过联合优化分割和定位，无需分割标签和手动步骤，从而提高脊柱手术中无标记导航的精度。

Comments Early Accepted MICCAI 2026. Code and interactive visualizations: https://lorenzopettinari.github.io/end-2-reg/

详情

AI中文摘要

脊柱手术中的术中导航需要毫米级的精度。目前，这通过辐射强度大的术中成像和骨锚定标记实现，但这些标记侵入性且会干扰手术流程。无标记RGB-D定位方法提供了一种有前途的替代方案。然而，现有方法依赖于弱分割标签来隔离相关解剖结构，这可能导致在定位过程中传播误差。我们提出了End2Reg，一种端到端深度学习框架，通过联合优化分割和定位，消除了对分割标签和手动步骤的需要。网络学习任务特定的分割掩码，以适应定位，仅通过定位目标进行指导，而无需显式的分割监督。End2Reg在体外和体内基准测试中实现了最先进的性能，将中位目标定位误差减少了32%，均方根误差平均减少了61%，同时在部分遮挡下保持稳健性能。消融结果证实，端到端优化显著提高了定位精度。总体而言，End2Reg朝着完全自动化的无标记术中导航迈进。代码和交互式可视化可在：https://lorenzopettinari.github.io/end-2-reg/ 上找到。

英文摘要

Intraoperative navigation in spine surgery demands millimeter-level accuracy. Currently, this is achieved through radiation-intensive intraoperative imaging and bone-anchored markers that are invasive and disrupt surgical workflow. Markerless RGB-D registration methods offer a promising alternative. However, existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, potentially propagating errors through the registration process. We present End2Reg, an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for segmentation labels and manual steps. The network learns task-specific segmentation masks optimized for registration, guided solely by the registration objective without explicit segmentation supervision. End2Reg achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% and mean Root Mean Square Error by 61%, while maintaining robust performance under partial occlusions. Ablation results confirm that end-to-end optimization significantly improves registration accuracy. Overall, End2Reg advances towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.

URL PDF HTML ☆

赞 0 踩 0

2512.09806 2026-05-21 cs.CV cs.AI 版本更新

FineVision: 你只需要开放数据

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti

发表机构 * Hugging Face ； Technical University of Munich（慕尼黑技术大学）； Stanford University（斯坦福大学）

AI总结本文提出FineVision，一个包含2400万样本的高质量数据集，通过半自动化流程整合了200多个来源，通过严格的数据清洗和人工审核确保数据质量，训练基于该数据集的模型在广泛评估中表现更优，推动数据驱动的视觉语言模型研究。

详情

AI中文摘要

令牌去哪了？在高分辨率下的STEP中理解剪枝行为

Michal Szczepanski, Martyna Poreba, Karim Haroun

发表机构 * Université Paris-Saclay, CEA, List（巴黎-萨克雷大学，CEA，List）； I3S, Université Côte d’Azur, CNRS（I3S，尼斯大学，CNRS）

AI总结本文提出STEP框架，通过动态补丁合并和令牌剪枝提高效率，同时在高分辨率语义分割任务中实现显著的计算成本降低和吞吐量提升，同时保持较高的准确性。

详情

DOI: 10.1007/s42979-025-04707-6
Journal ref: SN Computer Science 2026

AI中文摘要

视觉变换器（ViTs）在语义分割任务中实现了最先进的性能，但受到高计算和内存成本的限制。为了解决这一问题，我们提出了STEP（SuperToken和Early-Pruning），一种混合的令牌减少框架，结合动态补丁合并和令牌剪枝，以提高效率而不显著牺牲准确性。STEP的核心是dCTS，一个轻量级的CNN基政策网络，能够灵活地合并为超补丁。编码器块也集成了早期退出，以移除高置信度的超令牌，从而降低计算负载。我们在高分辨率语义分割基准上评估了我们的方法，包括高达1024x1024像素的图像，并显示当仅应用dCTS时，令牌数量可以比标准的16x16像素补丁方案减少2.5倍。这在使用ViT-Large作为骨干时，导致计算成本减少2.6倍，吞吐量增加3.4倍。应用完整的STEP框架进一步提高效率，达到计算复杂度减少4倍，推理速度提高1.7倍，最大精度下降不超过2.0%。通过提出的STEP配置，可以自信地在到达最终编码器层之前停止多达40%的令牌。

英文摘要

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

URL PDF HTML ☆

赞 0 踩 0

2509.13482 2026-05-21 cs.CV 版本更新

Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization

通过场景自适应晶格向量量化改进3D高斯散射压缩

Hao Xu, Xiaolin Wu, Xi Zhang

发表机构 * Department of Electrical & Computer Engineering, McMaster University（麦卡特尼大学电气与计算机工程系）； School of Computing and Artificial Intelligence, Southwest Jiaotong University（西南交通大学计算机与人工智能学院）； School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）

AI总结本文提出了一种场景自适应晶格向量量化（SALVQ）方法，用于改进3D高斯散射（3DGS）的压缩性能，通过优化晶格基矢来提高适应性和R-D效率，同时减少计算开销和训练时间。

Comments Accepted by IEEE TIP. Code available at https://github.com/hxu160/SALVQ

详情

AI中文摘要

3D高斯散射（3DGS）因其逼真渲染质量和实时性能而迅速流行，但会产生大量数据。因此，压缩3DGS数据对于其模型的成本效益至关重要。最近，一些基于锚点的神经压缩方法已被提出，实现了良好的3DGS压缩性能。然而，它们都依赖于统一标量量化（USQ）因其简单性。一个引人注目的问题是，更复杂的量化器是否能在极小的额外开销和系统最小变化的情况下改进当前的3DGS压缩方法。答案是肯定的，通过将USQ替换为晶格向量量化（LVQ）。为了更好地捕捉场景特定特性，我们为每个场景优化晶格基矢，提高LVQ的适应性和R-D效率。这种场景自适应LVQ（SALVQ）在向量量化和USQ的低复杂性之间取得了平衡。SALVQ可以无缝集成到现有的3DGS压缩架构中，通过最小的修改和计算开销提高其R-D性能。此外，通过缩放晶格基矢量，SALVQ可以动态调整晶格密度，使单个模型能够适应多种比特率目标。这种灵活性消除了为不同压缩级别训练单独模型的需要，显著减少了训练时间和内存消耗。

英文摘要

3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ's adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.

URL PDF HTML ☆

赞 0 踩 0

2509.09946 2026-05-21 cs.CV 版本更新

Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

通过鲁棒的2D跟踪和基于深度的后期聚合实现在线3D多摄像机感知

Vu-Minh Le, Thao-Anh Tran, Duc Huy Do, Xuan Canh Do, Huong Ninh, Hai Tran

发表机构 * Optoelectronics Center, Viettel Aerospace Institute, Viettel Group（Viettel集团光学电子中心、 Viettel航空航天研究所）； University of Engineering and Technology, Vietnam National University（越南国家大学工程大学）； School of Electrical and Electronic Engineering, Hanoi University of Science and Technology（河内科学技术大学电子与电气工程学院）

AI总结本文提出了一种方法，通过利用深度信息将现有的在线2D多摄像机跟踪系统扩展到3D空间，通过点云空间重建目标并利用聚类和偏转细化恢复其3D框，同时引入了增强的在线数据关联机制，以局部ID一致性来分配跨帧的全局ID，该框架在2025年AI城市挑战赛的3D MTMC数据集上评估，取得了第三名的成绩。

Comments Accepted at ICCVW 2025

详情

DOI: 10.1109/ICCVW69036.2025.00570

AI中文摘要

多目标多摄像机跟踪（MTMC）是自动化大规模监控中的关键计算机视觉任务。通过摄像机标定和深度信息，场景中的目标可以投影到3D空间，提供对3D环境的前所未有的自动感知水平。然而，在3D空间中的跟踪需要替换所有2D跟踪组件，这可能对现有的MTMC系统不可行。本文提出了一种方法，通过利用深度信息将任何在线2D多摄像机跟踪系统扩展到3D空间，通过点云空间重建目标，并通过聚类和偏转细化恢复其3D框。我们还引入了增强的在线数据关联机制，利用目标的局部ID一致性来分配跨帧的全局ID。所提出的框架在2025年AI城市挑战赛的3D MTMC数据集上进行评估，取得了排行榜第三名的成绩。

英文摘要

Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target's local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge's 3D MTMC dataset, achieving 3rd place on the leaderboard.

URL PDF HTML ☆

赞 0 踩 0

2508.11354 2026-05-21 cs.CV cs.AI cs.LG 版本更新

FunduSegmenter: Leveraging the RETFound Foundation Model for Joint Optic Disc and Optic Cup Segmentation in Retinal Fundus Images

FunduSegmenter：利用RETFound基础模型进行视网膜底照相图像中视盘和视杯联合分割

Zhenyi Zhao, Muthu Rama Krishnan Mookiah, Emanuele Trucco

发表机构 * University of Dundee（邓迪大学）

AI总结本文提出了一种基于RETFound基础模型的FunduSegmenter模型，通过引入一系列新颖模块实现视盘和视杯的联合分割，实验表明该模型在多个数据集上均优于现有方法。

详情

DOI: 10.1167/tvst.15.5.14
Journal ref: Trans. Vis. Sci. Tech. 2026;15(5):14

AI中文摘要

目的：本研究首次将RETFound模型应用于视盘（OD）和视杯（OC）的联合分割。RETFound是一个为眼底相机和光学相干断层扫描图像开发的知名基础模型，已在疾病诊断中表现出色。方法：我们提出FunduSegmenter，该模型整合了一系列新颖模块与RETFound，包括预适配器、解码器、后适配器、带有卷积块注意模块的跳跃连接以及视觉Transformer块适配器。该模型在自有数据集GoDARTS以及四个公开数据集IDRiD、Drishti-GS、RIM-ONE-r3和REFUGE上进行了评估，通过内部验证、外部验证和领域泛化实验进行验证。结果：在内部验证中，平均Dice相似系数达到90.51%，优于所有基线方法，其中nnU-Net为82.91%，DUNet为89.17%，TransUNet为87.91%。在所有外部验证实验中，平均结果比最佳基线高约3%，且在领域泛化中也具有竞争力。结论：本研究探讨了RETFound通过学习潜在通用表示在眼底相机图像中进行OD和OC分割的潜力。我们的FunduSegmenter在整体上优于现有最先进基线方法。所提出的模块是通用的，可以扩展到其他基础模型的微调。临床相关性：该模型在分布内和分布外数据上均表现出强大的稳定性与泛化能力，提供了稳定的OD和OC分割。这是许多自动化任务的关键步骤，从设置准确的视网膜坐标到生物标志物发现。代码和训练权重可在：https://github.com/JusticeZzy/FunduSegmenter上获得。

英文摘要

Purpose: This study introduces the first adaptation of RETFound for joint optic disc (OD) and optic cup (OC) segmentation. RETFound is a well-known foundation model developed for fundus camera and optical coherence tomography images, which has shown promising performance in disease diagnosis. Methods: We propose FunduSegmenter, a model integrating a series of novel modules with RETFound, including a Pre-adapter, a Decoder, a Post-adapter, skip connections with Convolutional Block Attention Module and a Vision Transformer block adapter. The model is evaluated on a proprietary dataset, GoDARTS, and four public datasets, IDRiD, Drishti-GS, RIM-ONE-r3, and REFUGE, through internal verification, external verification and domain generalization experiments. Results: An average Dice similarity coefficient of 90.51% was achieved in internal verification, which outperformed all baselines, some substantially (nnU-Net: 82.91%; DUNet: 89.17%; TransUNet: 87.91%). In all external verification experiments, the average results were about 3% higher than those of the best baseline, and our model was also competitive in domain generalization. Conclusions: This study explored the potential of the latent general representations learned by RETFound for OD and OC segmentation in fundus camera images. Our FunduSegmenter generally outperformed state-of-the-art baseline methods. The proposed modules are general and can be extended to fine-tuning other foundation models. Translational Relevance: The model shows strong stability and generalization on both in-distribution and out-of-distribution data, providing stable OD and OC segmentation. This is an essential step for many automated tasks, from setting the accurate retinal coordinate to biomarker discovery. The code and trained weights are available at: https://github.com/JusticeZzy/FunduSegmenter.

URL PDF HTML ☆

赞 0 踩 0

2508.06206 2026-05-21 cs.RO cs.CV 版本更新

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

Affordance-R1: 为多模态大语言模型中的通用化 affordance 推理设计的强化学习

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma

发表机构 * The Hong Kong University of Science and Technology (GZ)（香港科技大学（广州））； National University of Singapore（新加坡国立大学）； ShanghaiTech University（上海科技大学）； East China Normal University（华东师范大学）； Nanjing University of Information Science & Technology（南京信息工程大学）； Zhejiang University（浙江大学）； Institute of Automation, Chinese Academy of Science（中国科学院自动化研究所）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结本文提出 Affordance-R1，一种结合认知 CoT 引导的 Group Relative Policy Optimization (GRPO) 的统一 affordance 地标框架，通过强化学习实现零样本泛化和测试时推理能力。

详情

AI中文摘要

Affordance grounding 旨在预测与机器人执行动作相关的物体特定区域。它在人机交互、人-物交互、具身操作和具身感知领域中起着至关重要的作用。现有模型由于缺乏链式思维（CoT）推理能力，往往忽视不同物体间的 affordance 共享，限制了其域外（OOD）泛化和显式推理能力。为了解决这些挑战，我们提出了 Affordance-R1，这是首个集成认知 CoT 引导的 Group Relative Policy Optimization（GRPO）的统一 affordance 地标框架。具体而言，我们设计了一个复杂的 affordance 函数，包含格式、感知和认知奖励，以有效引导优化方向。此外，我们构建了一个高质量的 affordance 中心推理数据集 ReasonAff，以支持训练。通过仅使用强化学习与 GRPO 进行训练，而不使用显式推理数据，Affordance-R1 实现了稳健的零样本泛化，并表现出涌现的测试时推理能力。全面的实验表明，我们的模型优于已建立的方法，并展示了开放世界泛化能力。据我们所知，Affordance-R1 是首个将基于 GRPO 的 RL 与推理结合到 affordance 推理中的方法。我们的方法和数据集已发布在 https://github.com/hq-King/Affordance-R1。

英文摘要

Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.

URL PDF HTML ☆

赞 0 踩 0

2508.03578 2026-05-21 cs.CV 版本更新

RadProPoser: Probabilistic Radar Tensor Human Pose Estimation That Knows Its Limits

RadProPoser: 一种具有局限性的概率雷达张量人体姿态估计方法

Jonas Leo Mueller, Lukas Engel, Eva Dorschky, Daniel Krauss, Ingrid Ullmann, Martin Vossiek, Bjoern M. Eskofier

发表机构 * Munich Center for Machine Learning, Germany（慕尼黑机器学习中心，德国）

AI总结本文提出RadProPoser，一种端到端的概率框架，通过原始雷达张量数据预测三维身体关节及其每个关节的不确定性，该方法在新的基准数据集上实现了6.425厘米的均值位置误差，并通过等调校校准总不确定性。

Comments Accepted at IJCNN 2026 (WCCI, Maastricht)

详情

AI中文摘要

基于雷达的人体姿态估计使环境智能中的隐私保护运动跟踪成为可能，但雷达传感的噪声特性使得不确定性量化至关重要。我们提出了RadProPoser，一种端到端的概率框架，能够从原始雷达张量数据中预测三维身体关节并为每个关节提供不确定性。使用变分编码器-解码器与频谱注意力机制，该方法融合了时间帧中的实部和虚部雷达组件。通过可学习的高斯和拉普拉斯分布，我们建模了aleatoric不确定性。在新的基准数据集上训练，我们的方法实现了6.425厘米的均值位置误差。模型输出每个关节的aleatoric不确定性，等调校校准总不确定性，预期校准误差为0.027。由于频谱注意力机制在个体雷达张量组件上操作，扩展到多雷达配置只需拼接额外的输入流。在双正交雷达的HuPR基准上，该方法实现了5.042厘米的MPJPE。该框架在NVIDIA RTX 3090上以89帧每秒的速度运行，超过了15赫兹雷达帧率。

英文摘要

Radar-based human pose estimation enables privacy-preserving motion tracking for ambient intelligence, yet the noisy nature of radar sensing makes uncertainty quantification essential. We present RadProPoser, an end-to-end probabilistic framework that predicts three-dimensional body joints with per-joint uncertainties from raw radar tensor data. Using a variational encoder-decoder with spectral attention that fuses real and imaginary radar components across temporal frames, we model aleatoric uncertainty through learnable Gaussian and Laplace distributions. Trained on a new benchmark dataset with optical motion-capture ground truth, our method achieves 6.425 cm mean per-joint position error. The model outputs per-joint aleatoric uncertainties, and isotonic recalibration yields calibrated total uncertainty with expected calibration error of 0.027. Since spectral attention operates on individual radar tensor components, extending to multi-radar configurations requires only concatenating additional input streams. On the HuPR benchmark with dual orthogonal radars, this achieves 5.042 cm MPJPE. The framework runs at 89 frames per second (FPS) on an NVIDIA RTX 3090, exceeding the 15 Hz radar frame rate.

URL PDF HTML ☆

赞 0 踩 0

2507.23313 2026-05-21 cs.CV 版本更新

The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

伦勃朗的牛 - 分析文本到图像模型中艺术提示的解释

Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti

发表机构 * Department of Computer Science, Università degli Studi di Milano, Via Celoria, 18, 20133 Milan, Italy（米兰大学计算机科学系）

AI总结本文研究了文本到图像扩散模型在生成艺术作品时如何解释内容和风格的概念，通过交叉注意力热图分析生成图像中像素与特定提示词的关联，揭示了不同艺术提示和风格下内容与风格分离的程度，为理解大规模生成模型内部如何表示复杂艺术概念提供了新见解。

Comments to be published in: Applications of AI in the Analysis of Cultural and Artistic Heritage, organized within the 35th IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025

详情

DOI: 10.1109/MLSP62443.2025.11204333

视觉-语言模型是否准备好进行饮食评估？探索AI驱动的食品图像识别的下一个前沿

Sergio Romero-Tapiador, Ruben Tolosana, Blanca Lacruz-Pleguezuelos, Laura Judith Marcos Zambrano, Guadalupe X. Bazán, Isabel Espinosa-Salinas, Julian Fierrez, Javier Ortega-Garcia, Enrique Carrillo de Santa Pau, Aythami Morales

发表机构 * Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid（生物度量与数据模式分析实验室，马德里自治大学）； IMDEA Food, CEI UAM+CSIC（IMDEA食品，CEI UAM+CSIC）

AI总结本文评估了六种先进的视觉-语言模型在不同层次上的食品识别能力，提出了一个新的评估指标，并展示了FoodNExTDB数据库在饮食评估中的应用潜力。

Comments Accepted at IEEE/CVF Computer Vision and Pattern Recognition Conference workshops 2025 (CVPRw) 10 pages, 4 figures, 2 tables

详情

DOI: 10.1109/CVPRW67362.2025.00047
Journal ref: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1-10

AI中文摘要

基于食品图像的自动饮食评估仍是一个挑战，需要精确的食品检测、分割和分类。视觉-语言模型（VLMs）通过整合视觉和文本推理提供了新的可能性。在本研究中，我们评估了六种最先进的VLMs（ChatGPT、Gemini、Claude、Moondream、DeepSeek和LLaVA），分析它们在不同层次上的食品识别能力。在实验框架中，我们引入了FoodNExTDB，一个独特的食品图像数据库，包含9,263张由专家标注的图像，涵盖10个类别（例如“蛋白质来源”）、62个子类别（例如“家禽”）和9种烹饪风格（例如“烤制”）。总共，FoodNExTDB包括50,000个由七位专家生成的营养标签，这些标签由手动标注所有数据库中的图像生成。此外，我们提出了一种新的评估指标，专家加权召回率（EWR），该指标考虑了不同标注者之间的差异。结果表明，封闭源模型在识别包含单一产品的图像中的食品产品时，性能优于开源模型，达到了超过90%的EWR。尽管有潜力，当前VLMs在细粒度食品识别方面面临挑战，特别是在区分烹饪风格的细微差异和视觉相似的食品项目时，这限制了它们在自动饮食评估中的可靠性。FoodNExTDB数据库在https://github.com/AI4Food/FoodNExtDB上公开可用。

英文摘要

Automatic dietary assessment based on food images remains a challenge, requiring precise food detection, segmentation, and classification. Vision-Language Models (VLMs) offer new possibilities by integrating visual and textual reasoning. In this study, we evaluate six state-of-the-art VLMs (ChatGPT, Gemini, Claude, Moondream, DeepSeek, and LLaVA), analyzing their capabilities in food recognition at different levels. For the experimental framework, we introduce the FoodNExTDB, a unique food image database that contains 9,263 expert-labeled images across 10 categories (e.g., "protein source"), 62 subcategories (e.g., "poultry"), and 9 cooking styles (e.g., "grilled"). In total, FoodNExTDB includes 50k nutritional labels generated by seven experts who manually annotated all images in the database. Also, we propose a novel evaluation metric, Expert-Weighted Recall (EWR), that accounts for the inter-annotator variability. Results show that closed-source models outperform open-source ones, achieving over 90% EWR in recognizing food products in images containing a single product. Despite their potential, current VLMs face challenges in fine-grained food recognition, particularly in distinguishing subtle differences in cooking styles and visually similar food items, which limits their reliability for automatic dietary assessment. The FoodNExTDB database is publicly available at https://github.com/AI4Food/FoodNExtDB.

URL PDF HTML ☆

赞 0 踩 0

2501.15151 2026-05-21 cs.CV 版本更新

SpikeDet: Better Firing Patterns for Accurate and Energy-Efficient Object Detection with Spiking Neural Networks

SpikeDet: 更准确且节能的基于脉冲神经网络的目标检测中的 firing 模式

Yimeng Fan, Changsong Liu, Mingyang Li, Dongze Liu, Yuting Su, Yanyan Liu, Wei Zhang

发表机构 * School of Microelectronics（微电子学院）； School of Electrical and Information Engineering（电气与信息工程学院）； Optoelectronic Thin Film Device and Technology Research Institute（光电薄膜器件与技术研究所）

AI总结本文提出SpikeDet，一种新型的脉冲神经网络目标检测器，通过优化firing模式实现更准确且节能的目标检测。具体来说，设计了MDSNet脉冲骨干网络，有效调整每个层的膜电位突触输入分布，实现更优的脉冲特征提取；引入Spiking Multi-direction Fusion Module (SMFM)实现多方向融合，增强多尺度检测能力；提出Local Firing Saturation Index (LFSI)定量衡量局部firing饱和度。实验结果验证了方法的有效性，在COCO 2017数据集上达到52.2% AP，比现有SNN方法提升3.3% AP，能耗仅为一半。

详情

AI中文摘要

脉冲神经网络（SNNs）是神经网络的第三代。由于其低能耗和生物可解释性，SNNs在目标检测中获得了广泛关注。然而，现有的基于SNN的目标检测方法受到局部firing饱和的影响，相邻神经元同时达到最大firing率，尤其是在以对象为中心的区域。这种异常的神经元firing模式降低了特征辨别能力和检测准确性，同时增加了firing率，阻碍了SNNs实现其潜在的能源效率。为了解决这个问题，我们提出了SpikeDet，一种新颖的脉冲目标检测器，通过优化firing模式实现更准确且节能的检测。具体来说，我们设计了MDSNet脉冲骨干网络，该网络在每一层有效调整膜电位突触输入分布，从而在脉冲特征提取过程中实现更好的神经元firing模式。对于颈部部分，为了更好地利用和保留这些高质量的骨干特征，我们引入了Spiking Multi-direction Fusion Module (SMFM)，实现了脉冲特征的多方向融合，增强了模型的多尺度检测能力。此外，我们提出了Local Firing Saturation Index (LFSI)，以定量衡量局部firing饱和度。实验结果验证了我们方法的有效性。在COCO 2017数据集上，它达到了52.2%的AP，比先前的SNN方法提高了3.3%的AP，同时仅需一半的能耗。在目标检测子任务中，包括基于事件的GEN1、水下URPC 2019、低光ExDARK和密集场景CrowdHuman数据集上，SpikeDet也取得了最佳性能。

英文摘要

Spiking Neural Networks (SNNs) are the third generation of neural networks. They have gained widespread attention in object detection due to their low energy consumption and biological interpretability. However, existing SNN-based object detection methods suffer from local firing saturation, where adjacent neurons concurrently reach maximum firing rates, especially in object-centric regions. This abnormal neuron firing pattern reduces the feature discrimination capability and detection accuracy, while also increasing the firing rates that prevent SNNs from achieving their potential energy efficiency. To address this problem, we propose SpikeDet, a novel spiking object detector that optimizes firing patterns for accurate and energy-efficient detection. Specifically, we design a spiking backbone network, MDSNet, which effectively adjusts the membrane synaptic input distribution at each layer, achieving better neuron firing patterns during spiking feature extraction. For the neck, to better utilize and preserve these high-quality backbone features, we introduce the Spiking Multi-direction Fusion Module (SMFM), which realizes multi-direction fusion of spiking features, enhancing the multi-scale detection capability of the model. Furthermore, we propose the Local Firing Saturation Index (LFSI) to quantitatively measure local firing saturation. Experimental results validate the effectiveness of our method. On the COCO 2017 dataset, it achieves 52.2% AP, outperforming previous SNN-based methods by 3.3% AP while requiring only half the energy consumption. On object detection sub-tasks, including event-based GEN1, underwater URPC 2019, low-light ExDARK, and dense scene CrowdHuman datasets, SpikeDet also achieves the best performance.

URL PDF HTML ☆

赞 0 踩 0

2412.01944 2026-05-21 cs.CV eess.IV 版本更新

A Comparative Study of Transformer and Convolutional Models for Crop Segmentation from Satellite Image Time Series

变换器与卷积模型在卫星图像时间序列作物分割中的比较研究

Mattia Gatti, Ignazio Gallo, Nicola Landro, Christian Loschiavo, Anwar Ur Rehman, Mirco Boschetti, Riccardo La Grassa

发表机构 * University of Insubria（因斯布鲁克大学）； IREA CNR（意大利国家研究委员会IREA分部）； INAF-Astronomical Observatory（意大利国家天体物理研究所天文台）

AI总结本文比较了变换器和卷积模型在从卫星图像时间序列中进行作物分割中的应用，发现TSViT在整体表现上最佳，而VistaFormer在效率与性能之间提供了良好的权衡。

Comments This version corrects an error in the evaluation pipeline affecting previously reported metrics. Results have been recomputed, leading to updated values and a revised conclusion: the adapted Swin UNETR model does not outperform CNN baselines. Tables, figures, and comparisons have been updated, and the analysis has been extended to include additional transformer-based models

详情

DOI: 10.1117/12.3120038

AI中文摘要

从卫星图像时间序列（SITS）中进行作物分割是农业监测和土地利用分析中的基本任务。尽管卷积神经网络（CNNs）已被广泛应用，但基于变换器的架构提供了另一种机制，用于在多光谱数据中表示空间和时间依赖性。本文提出了对CNN和基于变换器的分割模型的比较研究，用于Sentinel-2时间序列的作物制图，包括3D U-Net、3D FPN、3D DeepLabv3以及三种变换器架构：Swin UNETR、TSViT和VistaFormer，它们采用不同的策略来捕捉时间依赖性。在Munich和Lombardia数据集上的实验表明，TSViT在整体表现上最佳，略微优于3D U-Net，后者仍然是一个强大的CNN基线。VistaFormer提供了最佳的效率，而Swin UNETR表现竞争，但不如那些显式建模时间动态的变换器。这些结果突显了时间建模在SITS中的重要性：TSViT优于CNNs和将时间视为额外空间维度的方法，而VistaFormer提供了良好的效率-性能权衡。

英文摘要

Crop segmentation from satellite image time series (SITS) is a fundamental task for agricultural monitoring and land-use analysis. While convolutional neural networks (CNNs) have been widely used, transformer-based architectures offer alternative mechanisms for representing spatial and temporal dependencies in multispectral data. This paper presents a comparative study of CNN and transformer-based segmentation models for crop mapping from Sentinel-2 time series, including 3D U-Net, 3D FPN, 3D DeepLabv3, and three transformer architectures: Swin UNETR, TSViT, and VistaFormer, which adopt different strategies for capturing temporal dependencies. Experiments on the Munich and Lombardia datasets show that TSViT achieves the best overall results, slightly surpassing 3D U-Net, which remains a strong CNN baseline. VistaFormer offers the best efficiency, while Swin UNETR performs competitively but is less effective than transformers that explicitly model temporal dynamics. These results highlight that temporal modelling is critical for SITS: TSViT outperforms CNNs and approaches that treat time as an additional spatial dimension, while VistaFormer provides a strong efficiency-performance trade-off.

URL PDF HTML ☆

赞 0 踩 0

2406.14978 2026-05-21 cs.CV 版本更新

E2GS: Event Enhanced Gaussian Splatting

E2GS：事件增强的高斯点撒法

Hiroyuki Deguchi, Mana Masuda, Takuya Nakabayashi, Hideo Saito

发表机构 * Keio University（庆应大学）

AI总结本文提出E2GS方法，结合事件数据与高斯点撒法，提升图像去模糊和高质量视角合成效果，实验表明其在合成和真实数据集上均能生成视觉吸引人的渲染结果，且训练和渲染速度更快（140 FPS）

Comments 7pages, Accepted at ICIP 2024

详情

AI中文摘要

事件相机因其高动态范围、无运动模糊和低能耗而闻名，这些特性使其在最近的应用中得到了广泛应用。在过去的几年中，基于神经辐射场（NeRF）的事件驱动3D重建领域取得了显著进展，NeRF方法展示了逼真的视角合成结果。然而，NeRF的体积渲染范式需要大量的训练和渲染时间。在本文中，我们介绍了事件增强的高斯点撒法（E2GS），这是一种将事件数据融入高斯点撒法的新方法，该方法最近在新型视角合成领域取得了显著进展。我们的E2GS有效利用了模糊图像和事件数据，显著提高了图像去模糊效果，并产生了高质量的新型视角合成。我们在合成和真实世界数据集上的全面实验表明，我们的E2GS能够生成视觉吸引人的渲染结果，同时提供更快的训练和渲染速度（140 FPS）。我们的代码可在https://github.com/deguchihiroyuki/E2GS上获得。

英文摘要

Event cameras, known for their high dynamic range, absence of motion blur, and low energy usage, have recently found a wide range of applications thanks to these attributes. In the past few years, the field of event-based 3D reconstruction saw remarkable progress, with the Neural Radiance Field (NeRF) based approach demonstrating photorealistic view synthesis results. However, the volume rendering paradigm of NeRF necessitates extensive training and rendering times. In this paper, we introduce Event Enhanced Gaussian Splatting (E2GS), a novel method that incorporates event data into Gaussian Splatting, which has recently made significant advances in the field of novel view synthesis. Our E2GS effectively utilizes both blurry images and event data, significantly improving image deblurring and producing high-quality novel view synthesis. Our comprehensive experiments on both synthetic and real-world datasets demonstrate our E2GS can generate visually appealing renderings while offering faster training and rendering speed (140 FPS). Our code is available at https://github.com/deguchihiroyuki/E2GS.

URL PDF HTML ☆

赞 0 踩 0

2205.13524 2026-05-21 cs.CV cs.GR 版本更新

PREF: Phasorial Embedding Fields for Compact Neural Representations

PREF: 用于紧凑神经表示的相位嵌入场

Binbin Huang, Xinhao Yan, Anpei Chen, Shenghua Gao, Jingyi Yu

发表机构 * ShanghaiTech University（上海科技大学）； ETH Zürich（苏黎世联邦理工学院）

AI总结本文提出了一种高效的基于频率的神经表示PREF，通过引入覆盖显著边谱的相位体积，结合快速傅里叶变换和局部插值加速傅里叶映射，从而减少频率表示中的成本MLP，提升效率和可解释性。

详情

AI中文摘要

我们提出了一种高效的基于频率的神经表示，称为PREF：一种带有相位体积的浅层MLP，能够覆盖比之前傅里叶特征映射或位置编码更显著的边谱。核心是我们的紧凑3D相位体积，其中频率在2D平面上均匀分布并在1D轴上扩展。为此，我们开发了一种专门且高效的傅里叶变换，结合快速傅里叶变换和局部插值以加速朴素傅里叶映射。我们还引入了Parsvel正则化器以稳定基于频率的学习。通过这些方法，我们的PREF减少了频率表示中的成本MLP，从而显著缩小了其与其他混合表示之间的效率差距，并提高了其可解释性。全面的实验表明，我们的PREF能够捕捉高频细节，同时保持紧凑和鲁棒，包括2D图像泛化、3D签名距离函数回归和5D神经辐射场重建。

英文摘要

We present an efficient frequency-based neural representation termed PREF: a shallow MLP augmented with a phasor volume that covers significant border spectra than previous Fourier feature mapping or Positional Encoding. At the core is our compact 3D phasor volume where frequencies distribute uniformly along a 2D plane and dilate along a 1D axis. To this end, we develop a tailored and efficient Fourier transform that combines both Fast Fourier transform and local interpolation to accelerate naïve Fourier mapping. We also introduce a Parsvel regularizer that stables frequency-based learning. In these ways, Our PREF reduces the costly MLP in the frequency-based representation, thereby significantly closing the efficiency gap between it and other hybrid representations, and improving its interpretability. Comprehensive experiments demonstrate that our PREF is able to capture high-frequency details while remaining compact and robust, including 2D image generalization, 3D signed distance function regression and 5D neural radiance field reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2605.20538 2026-05-21 cs.CV 版本更新

用于基于模型的图像去噪中超参数预测的Oracle监督转移

Jianmin Liao, Lixin Shen, Yuesheng Xu

发表机构 * Department of Mathematics Syracuse University（数学系苏利文大学）； Department of Mathematics & Statistics Old Dominion University（数学与统计学系老 Dominion 大学）

AI总结该研究提出HyperDn，一种单配置条件预测器，通过聚合源配置的Oracle监督，预测新的去噪器-噪声配置的异质超参数，展示了在跨范式实验中，从相对便宜的TV/TGV变分源转移到更昂贵的扩散模型DiffPIR时，通过少量或无目标Oracle标签实现接近Oracle性能的成果。

详情

AI中文摘要

超参数预测是基于模型的图像去噪器中的关键实际瓶颈，从经典的TV/TGV变分求解器到现代的扩散基模型如DiffPIR。尽管现有的学习预测器可以实现接近Oracle的性能，但这种方法扩展性差：每个新的配置通常需要其自身的Oracle标记训练集，且每个标签都需要通过与干净地面真实值对比的分层网格搜索来评估。因此，我们询问是否可以从源配置收集的Oracle监督能够转移到目标配置，而使用很少或没有目标Oracle标签。我们提出了HyperDn，一种单配置条件预测器，通过聚合源配置的Oracle监督，预测新的去噪器-噪声配置的异质超参数。在跨范式实验中，HyperDn从相对便宜的TV/TGV变分源转移到更昂贵的扩散基DiffPIR。仅使用2个目标Oracle标签，它达到了30.23 dB，接近Oracle性能，且在使用1/32个目标标签的情况下优于训练自研的每配置64标签预测器。在没有目标Oracle标签的情况下，HyperDn在两个未见过的噪声类型混合和从相对便宜的96×96源图像转移到512×768目标时也达到了接近Oracle的PSNR。这些结果表明，超参数预测的昂贵Oracle监督可以从源转移到新的目标配置，从而减少为每个新的去噪配置重建Oracle标签的需求。

英文摘要

Hyperparameter prediction is a critical practical bottleneck for model-based image denoisers, ranging from classical TV/TGV variational solvers to modern diffusion-based models such as DiffPIR. While existing learned predictors can achieve near-oracle performance, this approach scales poorly: each new configuration conventionally requires its own oracle-labeled training set, and each label requires a hierarchical grid search evaluated against clean ground truth. We therefore ask whether oracle supervision collected on source configurations can transfer to target configurations with few or no target oracle labels. We propose HyperDn, a single configuration-conditioned predictor that pools oracle supervision across source configurations and predicts heterogeneous hyperparameters for new denoiser--noise configurations. In a cross-paradigm experiment, HyperDn transfers from relatively cheap TV/TGV variational sources to more expensive diffusion-based DiffPIR. With only $2$ target oracle labels, it reaches $30.23$\,dB, within $0.90$\,dB of the oracle, and outperforms the $64$-label per-configuration predictor trained from scratch, using $1/32$ as many target labels as that baseline point. Without any target oracle labels, HyperDn also reaches near-oracle PSNR on two unseen mixtures of seen noise types and on transfer from relatively cheap $96\times 96$ source images to $512\times 768$ targets. Together, these results show that expensive oracle supervision for hyperparameter prediction can be transferred from source to new target configurations, reducing the need to rebuild oracle labels for each new denoising configuration.

URL PDF HTML ☆

赞 0 踩 0

2605.20476 2026-05-21 cs.CV 版本更新

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

告别漂移：用于长时视频到视频生成的锚定树采样

Matthew Bendel, Stephen W. Bailey, Mithilesh Vaidya, Sumukh Badam, Xingzhe He

发表机构 * Descript, Inc.（Descript公司）

AI总结本文提出了一种名为锚定树采样的方法，通过减少关键路径步骤来解决长时视频生成中的漂移问题，并在静态相机模式下实现了稳定且高质量的视频生成。

Comments 30 pages, 23 figures

详情

AI中文摘要

长时视频生成面临两个交织的问题。首先，漂移问题，即视频质量随时间下降。其次，连续性问题，表现为物体永久性问题或不当渲染瞬态内容（例如，出现在非连续帧中的物体颜色/风格变化）。最近的工作集中在自回归蒸馏技术上，旨在同时解决这两个问题。我们选择专注于漂移问题，并引入锚定树采样（ATS）：一种无训练的推理时间调度器，用稀疏到密集、锚定范围内的填补方法替代从左到右的滚动。根调用在全时间范围内生成稀疏锚点，递归细化生成中间锚点，最终叶跨度在相邻锚点之间合成。这将关键路径从K个连续滚动步骤减少到L+1个树状步骤，并将时间累积漂移转换为锚定范围内的漂移。我们专注于静态相机模式下的V2V生成，其中稀疏锚点在时间范围内可由密集条件信号近似，且基础模型可在不重新训练的情况下生成它们。我们在Wan 2.1 + VACE上评估了ATS，针对五种条件模式（修复、扩展、边缘、姿态、深度）。我们证明ATS在整体质量和漂移防止方面均优于两个竞争对手。此外，我们还展示了在LTX-2.3上稳定生成至少40分钟的视频。最后，我们提出了一条路径，将ATS扩展到任意长的T2V生成，以及动态相机和多镜头模式。

英文摘要

Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.20470 2026-05-21 cs.CV cs.AI physics.med-ph 版本更新

EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis

EPC-3D-Diff: 基于CBCT到CT合成的等价物理一致条件3D潜在扩散模型

Alzahra Altalib, Chunhui Li, Haytham Al Ewaidat, Khaled Alawneh, Ahmad Qendel, Alessandro Perelli

发表机构 * School of Science and Engineering, University of Dundee UK（邓迪大学科学与工程学院）； Faculty of Applied Sciences, Jordan University of Science and Technology（约旦科学技术大学应用科学学院）； Experia Healthcare, Jordan（约旦Experia医疗）； School of Cardiovascular and Metabolic Health, University of Glasgow UK（格拉斯哥大学心血管与代谢健康学院）

AI总结本文提出EPC-3D-Diff，一种新的条件3D潜在扩散框架，用于体积CBCT到CT合成，通过引入从成像物理导出的投影域等价损失，提高了物理一致性。该方法在训练过程中通过正向投影旋转合成的CT体积，并将其与相应角度偏移的投影进行匹配，从而在扩散目标中集成物理一致的等价约束。

Comments 10 pages, 4 figures

详情

AI中文摘要

锥束CT（CBCT）在放疗中常用于患者定位，但其定量可靠性受到散射、噪声和重建伪影的限制，限制了Hounsfield单位（HU）的准确性。我们提出了EPC-3D-Diff，一种新的条件3D潜在扩散框架，用于体积CBCT到CT合成，引入了从成像物理导出的投影域等价损失。与常见的图像域等价性不同，我们利用体积内旋转对应于其投影的角偏移的事实。在训练过程中，我们通过正向投影旋转合成的CT体积并将其与适当角度偏移的投影进行匹配，从而在扩散目标中集成物理一致的等价约束。为了高效捕捉完整的3D上下文，条件扩散在由轻量3D自动编码器学习的紧凑潜在空间中进行，保持轴向深度的同时在平面分辨率上进行下采样以实现稳定训练。我们验证了配对的头CBCT/CT假体数据集，包括重复扫描，并使用患者层面的分割进行配对临床数据验证，并进行了单域和混合域训练、消融实验和与扩散和CycleGAN的比较。EPC-3D-Diff具有良好的泛化能力，并在PSNR上相比最先进的方法取得了显著的改进，分别在假体和临床数据上提高了+7.4 dB和+1.8 dB，同时在SSIM和HU准确性方面也有所提升，在组织边界内。总体而言，EPC-3D-Diff提高了鲁棒性和物理一致性，支持HU意识的合成，以支持下游的放疗工作流程。

英文摘要

Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limiting Hounsfield Unit (HU) accuracy. We propose EPC-3D-Diff, a novel conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. Unlike common image domain equivariance, we exploit the fact that an in plane rotation of the volume corresponds to an angular shift in its projections. During training, we enforce this relationship by forward projecting rotated synthesized CT volumes and matching them to appropriately angle shifted projections of the paired target CT, yielding a physics consistent equivariance constraint integrated into the diffusion objective. To capture full 3D context efficiently, conditional diffusion is performed in a compact latent space learnt by a lightweight 3D autoencoder, preserving axial depth while downsampling in plane resolution for stable training. We validate on a paired head CBCT/CT phantom dataset, including repeat scans, and paired clinical data using patient wise splits, and perform single and mixed domain training, ablations, and comparisons with diffusion and CycleGAN. EPC-3D-Diff generalizes well and achieved substantial improvements, +7.4 dB (phantom) and +1.8 dB (clinical data) in PSNR compared to state of the art methods, alongside improved SSIM and HU accuracy, within tissue boundaries. Overall, EPC-3D-Diff improves robustness and physics consistency, supporting HU aware synthesis for downstream radiotherapy workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.20469 2026-05-21 cs.CV 版本更新

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

HalluCXR: 评估和缓解医疗视觉-语言模型在胸部X光解读中的幻觉

Haoyu Wang, Zitong Li

发表机构 * Department of Biostatistics & Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King’s College London（生物统计学与健康信息学系，精神病学、心理学与神经科学研究所，伦敦国王学院）

AI总结本文提出HalluCXR基准，评估六种不同架构的视觉-语言模型在856例分层MIMIC-CXR胸部X光图像上的表现，发现61.9%-82.3%的输出存在幻觉，其中80.2%存在临床危险错误，通过引入幻觉分类学、检测管道和模型集成方法，提出了缓解幻觉的策略。

详情

AI中文摘要

视觉-语言模型（VLMs）在医学影像解读中日益被使用，但它们经常产生幻觉，即生成在临床上合理但事实错误的发现，这直接对患者安全构成风险。我们介绍了HalluCXR，一个基准，评估了六个架构各异的VLMs在856例分层MIMIC-CXR胸部X光图像和三种查询类型上的表现，产生15,408次模型评估。一个八类幻觉分类学，带有临床严重程度评分和一个双层检测管道，经过250个人类注释验证（自动检测F1=0.959；LLM判断F1=0.907）。我们发现61.9%-82.3%的输出包含幻觉，其中最多80.2%存在临床危险错误。三种关键模式显现：正常X光图像反而吸引最严重的幻觉，常见发现被系统性夸大，而罕见发现被低估，且响应长度本身预测幻觉风险（AUC最高达0.908）。一个六模型集成减少了伪造的84.8%，但增加了遗漏；一个三模型子集在成本减半的情况下保持了相当的性能。这些结果表明，幻觉审计、基于 verbosity 的风险监控和基于集成的安全层是临床部署的先决条件。

英文摘要

Vision-language models (VLMs) are increasingly used for medical image interpretation, yet they frequently hallucinate, generating clinically plausible but factually incorrect findings that pose direct patient safety risks. We introduce HalluCXR, a benchmark evaluating six architecturally diverse VLMs across 856 stratified MIMIC-CXR chest radiographs and three query types, yielding 15,408 model evaluations. An eight-category hallucination taxonomy with clinical severity ratings and a two-layer detection pipeline are validated against 250 human annotations (auto-detection F1=0.959; LLM judge F1=0.907). We find that 61.9--82.3% of outputs contain hallucinations, with clinically dangerous errors in up to 80.2%. Three key patterns emerge: normal radiographs paradoxically attract the most severe hallucinations, common findings are systematically over-fabricated while rare findings go under-detected, and response length alone predicts hallucination risk (AUC up to 0.908). A six-model ensemble reduces fabrication by up to 84.8% at the cost of increased omission; a three-model subset retains comparable performance at half the cost. These results establish that hallucination auditing, verbosity-based risk monitoring, and ensemble-based safety layers are prerequisites for clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.20461 2026-05-21 cs.CV 版本更新

Understanding Model Behavior in Monocular Polyp Sizing

理解单目肠镜下息肉大小的模型行为

Xinqi Xiong, Andrea Dunn Beltran, Junmyeong Choi, Sarah K. McGill, Marc Niethammer, Roni Sengupta

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； University of California, San Diego（加州大学圣地亚哥分校）

AI总结本文通过多中心数据集和多种模型对二元息肉大小分类（≤5 mm vs. >5 mm）进行诊断审核，发现模型性能在不同架构和输入模态下较为一致，表明其依赖于与检查行为相关的线索而非真实度量尺度，并展示了完美尺度信息的潜在改进以及当前深度估计和全局校准的有限增益。

详情

AI中文摘要

准确的息肉大小分层指导监视决策，通常大于5 mm的病变需要更密切的随访。然而，单目结肠镜缺乏可靠的参考度量标准。我们对多个公共多中心数据集、模型家族和患者分层交叉验证中的二元息肉大小分类（≤5 mm vs. >5 mm）进行了诊断审核。在不同架构和输入模态（包括RGB外观、相对深度和照度）下，模型性能相对一致，表明其依赖于与检查行为相关的线索而非真实度量尺度。通过提供不同粒度的地面真实尺度，我们量化了完美尺度信息的潜在改进，并显示当前深度估计和全局校准提供的增益有限。我们进一步证明，在分布偏移下分割错误消除了大部分潜在增益，具有预测掩码的oracle尺度仅恢复基线性能。这些结果突显了度量尺度和掩码鲁棒性作为两个独立的瓶颈，并提供了可重用的评估工具，如oracle尺度梯子、快捷分组和掩码替换，用于审核未来的息肉大小管道。我们的代码在https://github.com/anaxqx/polyp-sizing-audit上公开可用。

英文摘要

Accurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (<=5 mm vs. >5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at https://github.com/anaxqx/polyp-sizing-audit.

URL PDF HTML ☆

赞 0 踩 0

2605.20459 2026-05-21 cs.CV cs.AI 版本更新

Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures

基于像素的新冠CT影像病变预测：自动图像分割架构的比较分析

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam

发表机构 * Department of Computer \& Software Engineering National University of Sciences \& Technology Islamabad, Pakistan ； School of Computing \& Information Systems University of Melbourne Melbourne, Australia

AI总结本文通过比较四种深度学习架构与六种预训练编码器，评估了在新冠CT影像中预测病变的性能，发现深度学习在分割任务中具有高精度和效率，其中二分类分割达到98%的F1分数，多分类分割在不同数据集上分别达到75%和77%的F1分数。

Comments 7 pages, 6 figures, 4 tables

详情

AI中文摘要

近年来，深度学习算法在医学图像分割领域受到了越来越多的关注。然而，由于缺乏标准化的性能分析方法和先前研究中使用不同数据集，该领域的可靠性受到阻碍。本研究的主要目的是全面评估当前的分割框架与最先进的预训练骨干网络，以准确预测CT影像中的新冠病变。此外，这种评估可以作为其他成像场景图像分割的参考点。为了实现这一目标，我们整合了四个不同的深度学习架构，即Unet、PSPNet、Linknet和FPN，以及六个预训练编码器，包括VGG 19、DenseNet 121、Inception ResNet V2、MobileNet V2、SeresNet 101和EfficientNet B0。这种方法使能够开发出多样化的测试架构。在图像分割的背景下，我们的研究涵盖了二分类和多分类实验。通过分析三个不同的新冠CT分割数据集，我们的分析结果表明深度学习架构能够产生精确且高效的分割结果。显著的是，二分类分割的最高F1分数达到98%，而多分类分割在两个不同的数据集上分别达到了75%和77%的F1分数。人工智能和深度学习的使用在多个维度上增强了对流行病疾病诊断过程的帮助。

英文摘要

In recent years, there has been a notable increase in the level of attention that is given to algorithms based on deep learning in the context of medical image segmentation. Nevertheless, the reliability of the field has been hindered due to the absence of a standardized methodology for performance analysis and the utilization of different datasets in previous research. The primary objective of the research is to comprehensively evaluate contemporary segmentation frameworks combined with state-of-the-art pre-trained backbones in order to accurately predict COVID-19 lesions in CT images. Moreover, this evaluation can serve as a point of reference for the segmentation of images in various other imaging scenarios. In order to accomplish this, we integrate four distinct deep learning architectures, namely Unet, PSPNet, Linknet, and FPN, with six pre-trained encoders, including VGG 19, DenseNet 121, Inception ResNet V2, MobileNet V2, SeresNet 101, and EfficientNet B0. This approach enables the development of diverse testing architectures. In the context of image segmentation, our research encompassed both binary and multi-class experimentation. The findings derived from our analysis of three distinct COVID-19 CT segmentation datasets indicate that deep learning architectures yield precise and efficient segmentation outcomes. Significantly, a maximum F1-Score of 98% was attained for binary class segmentation, while multi-class segmentation yielded F1-Scores of 75% and 77% across two separate datasets. The utilization of artificial intelligence and deep learning enhances the diagnostic process for pandemic diseases across multiple dimensions.

URL PDF HTML ☆

赞 0 踩 0

2605.20458 2026-05-21 cs.CV 版本更新

ELEMENT: Multi-Modal Retinal Vessel Segmentation Based on a Coupled Region Growing and Machine Learning Approach

ELEMENT：基于耦合区域生长和机器学习方法的多模态视网膜血管分割

Erick O. Rodrigues, Aura Conci, Panos Liatsis

AI总结本文提出了一种基于耦合区域生长和机器学习方法的多模态视网膜血管分割框架ELEMENT，通过区域生长和机器学习提取特征并进行像素分类，提高了分割的准确性和效率，实验表明其在多个数据集上均优于现有方法。

详情

DOI: 10.1109/JBHI.2020.2999257
Journal ref: IEEE Journal of Biomedical and Health Informatics 2020

AI中文摘要

视网膜血管结构包含重要的信息，用于检测和分析眼部疾病，包括年龄相关性黄斑变性、糖尿病视网膜病变和青光眼。常用的诊断模态包括视网膜摄影、扫描激光眼底镜（SLO）和荧光素血管造影（FA）。通常，视网膜血管分割是手动或交互式进行的，这使得过程耗时且容易出错。在本研究中，我们提出了一种新的多模态分割框架，称为ELEMENT（vEsseL sEgmentation using Machine lEarning and coNnecTivity）。该框架由区域生长和机器学习进行的特征提取和像素分类组成。所提出的特征基于灰度级和血管连通性属性捕获互补证据。后者信息在分类阶段无缝传播通过像素。ELEMENT减少了不一致性和加快了分割吞吐量。我们分析并比较了所提出方法与现有血管分割算法在三个主要实验组中的性能，针对每种眼部模态。我们的方法产生了更高的整体性能，整体准确率为97.40%，优于26种现有方法中的25种，包括6种基于深度学习的工作，评估在广泛知名的DRIVE视网膜图像数据集上。在STARE、CHASE-DB、VAMPIRE FA、IOSTAR SLO和RC-SLO数据集中，所提出的框架分别以98.27%、97.78%、98.34%、98.04%和98.35%的准确率超过了所有现有方法。

英文摘要

Vascular structures in the retina contain important information for the detection and analysis of ocular diseases, including age-related macular degeneration, diabetic retinopathy and glaucoma. Commonly used modalities in diagnosis of these diseases are fundus photography, scanning laser ophthalmoscope (SLO) and fluorescein angiography (FA). Typically, retinal vessel segmentation is carried out either manually or interactively, which makes it time consuming and prone to human errors. In this research, we propose a new multi-modal framework for vessel segmentation called ELEMENT (vEsseL sEgmentation using Machine lEarning and coNnecTivity). This framework consists of feature extraction and pixel-based classification using region growing and machine learning. The proposed features capture complementary evidence based on grey level and vessel connectivity properties. The latter information is seamlessly propagated through the pixels at the classification phase. ELEMENT reduces inconsistencies and speeds up the segmentation throughput. We analyze and compare the performance of the proposed approach against state-of-the-art vessel segmentation algorithms in three major groups of experiments, for each of the ocular modalities. Our method produced higher overall performance, with an overall accuracy of 97.40%, compared to 25 of the 26 state-of-the-art approaches, including six works based on deep learning, evaluated on the widely known DRIVE fundus image dataset. In the case of the STARE, CHASE-DB, VAMPIRE FA, IOSTAR SLO and RC-SLO datasets, the proposed framework outperformed all of the state-of-the-art methods with accuracies of 98.27%, 97.78%, 98.34%, 98.04% and 98.35%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.20445 2026-05-21 cs.CV cs.AI 版本更新

A Comprehensive Comparison of Deep Learning Architectures for COVID-19 Classification on CT & X-ray Imagery

对用于CT和X光影像中新冠分类的深度学习架构的全面比较

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam

发表机构 * Department of Computer \& Software Engineering National University of Sciences \& Technology Islamabad, Pakistan ； School of Computing \& Information Systems University of Melbourne Melbourne, Australia

AI总结本文通过比较多种深度学习架构，提出基于卷积神经网络的计算机辅助诊断系统，以区分新冠和正常肺部影像，并在X光和CT数据集上取得了95至98%的平均准确率。

Comments 6 pages, 2 figures, 5 tables

详情

AI中文摘要

新冠是一种造成大量人员伤亡的重大挑战，不仅涉及某些国家，甚至全球也因冠状病毒而遭受影响。使用计算断层扫描（CT）和X光的肺部影像技术是新冠或其他大流行病筛查过程中最有效的工具。如今，技术已通过人工智能取代手动过程，用自动化机器使系统能够模仿人类大脑，通过经验做出明智决策。受此启发，我们的工作提出使用卷积神经网络（CNN）模型设计一个计算机辅助诊断（CAD）系统，以区分新冠和正常肺部影像。我们使用了两组不同的肺部X光影像和两组不同的CT扫描，并利用预训练的多种网络（如VGG（16, 19）、Densenet（121）、Resnet（50, 50 V2, 101 V2）、MobileNet（V2）、Xception Inception（V3, Resnet V2）、EfficientNet（B0）和Nasnet（Large））进行分类。在X光和CT图像数据集上，Resnet和VGG架构显示出能够正确区分新冠和正常图像的能力，平均准确率分别为95至98%。我们在分类数据集上的结果具有竞争力，并优于文献中已报告的发现。

英文摘要

COVID-19 was a significant challenge that led to the loss of numerous lives daily. Not only a certain country was involved in this outbreak, but even the world has suffered because of the coronavirus. Imaging techniques using computed tomography (CT) and X-rays of the lungs are the most useful tools for the COVID-19 or any other pandemic disease screening process. Technology today has revolutionized the world by using artificial intelligence to replace manual processes with automated machines, which enable the system to imitate the human brain by making wise decisions based on experience. Motivated by this, our work proposes to use convolutional neural networks (CNN) based models for designing a computer-aided diagnosis (CAD) system that differentiates between COVID-19 and healthy lung pictures. We used two different sets of X-ray images of the lungs in addition to two different sets of CT scans and the classification is done using a variety of networks that have been pre-trained such as VGG (16, 19), Densenet (121), Resnet (50, 50 V2, 101 V2), Mobile net (V2), Xception Inception (V3, Resnet V2), Efficient net (B0) and Nasnet (Large). On the X-ray and CT image datasets, Resnet and VGG architecture have shown the ability to properly differentiate COVID-19 from normal images, with an average accuracy of 95 to 98 percent respectively. Our acquired results on the classification datasets are competitive and superior to previously reported findings in the literature.

URL PDF HTML ☆

赞 0 踩 0

2605.20405 2026-05-21 eess.IV cs.AI cs.CV physics.med-ph 版本更新

Disentangling Sampling from Training Budget in Class-Imbalanced CT Body Composition Segmentation

在CT身体成分分割中解耦采样与训练预算

Iason Skylitsis, Dimitrios Karkalousos, Ivana Išgum

发表机构 * Amsterdam University Medical Center（阿姆斯特丹大学医学中心）； University of Amsterdam（阿姆斯特丹大学）； Informatics Institute, Faculty of Science（信息学院，科学学院）； Department of Radiology（放射科）； Mayo Clinic Rochester（罗切斯特梅奥诊所）

AI总结本文提出了一种基于少样本学习的episodic采样方法，用于解决医学图像分割中的类别不平衡问题，通过解耦采样与训练预算，提高了小数据集下的分割性能。

详情

AI中文摘要

类别不平衡是医学图像分割中的基本挑战，其中频繁类通常在训练中占主导地位，而稀有类被忽视。基于损失的方法通过在批次内重新加权每个像素的损失来缓解不平衡，而采样策略控制哪些图像进入批次。然而，两者均未明确控制批次中出现的类别，导致稀有类的暴露仅部分平衡。在本文中，我们采用少样本学习中的episodic采样，以在完全监督设置中促进类别平衡的批次构造。我们解耦episodic采样与其传统的度量学习上下文，并在CT身体成分分割中评估其效果。我们在九种肌肉和脂肪组织上，从公共SAROS数据集中提取了210次扫描，将episodic采样与随机和加权采样进行比较。训练是在全数据和低数据模式下进行的，此外在匹配训练迭代预算下也进行了额外比较。在全数据训练中，三种策略表现相当（episodic的平均Dice为0.882，随机和加权为0.878）。在低数据训练中，episodic采样优于随机和加权（0.787 vs. 0.758和0.762），这由训练迭代数的12倍差异驱动。在匹配训练预算下，随机和加权过早过拟合，而episodic在达到平台前提高了约三倍的迭代次数。我们的发现识别了训练迭代预算作为采样策略中被低估的混淆因素，推动了小数据集的迭代感知评估协议。此外，episodic采样的残余优势与隐含的类别平衡批次的正则化效应一致，提供了一种低成本、模型无关的解决医学图像分割类别不平衡问题的策略。代码可在https://github.com/iasonsky/episodic-sampling上获得。

英文摘要

Class imbalance is a fundamental challenge in medical image segmentation, where frequent classes typically dominate training at the expense of rare classes. Loss-based approaches mitigate imbalance by reweighting the per-pixel loss within the batch, while sampling strategies control which images enter the batch. Yet neither explicitly controls which classes appear within the batch, leaving rare-class exposure only partially rebalanced. In this work, we adopt episodic sampling from few-shot learning to promote class-balanced batch construction in a fully supervised setting. We decouple episodic sampling from its conventional metric-learning context and evaluate it in body composition segmentation in CT. We compare episodic sampling against random and weighted sampling on nine muscle and adipose tissues, derived from 210 scans of the public SAROS dataset. Training is performed under full- and low-data regimes, with additional comparisons under matched training iteration budgets. Under full-data training, all three strategies performed comparably (mean Dice 0.882 for episodic, 0.878 for random and weighted). Under low-data training, episodic sampling outperformed random and weighted (0.787 vs. 0.758 and 0.762), driven by a 12-fold difference in training iterations. Under matched training budgets, random and weighted overfit earlier, while episodic improved for approximately three times more iterations before plateauing. Our findings identify the training iteration budget as under-recognized confound in sampling strategies, motivating iteration-aware evaluation protocols for small datasets. Furthermore, the residual advantage of episodic sampling is consistent with an implicit regularization effect of class-balanced batches, offering a low-cost, model-agnostic strategy for class-imbalanced medical image segmentation. Code is available at https://github.com/iasonsky/episodic-sampling.

URL PDF HTML ☆

赞 0 踩 0

2605.20390 2026-05-21 cs.CV cs.AI cs.LG cs.RO 版本更新

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

STELLAR: 为自动驾驶扩展3D感知大模型

Yingwei Li, Xin Huang, Yang Liu, Yang Fu, Alex Zihao Zhu, Chen Song, Junwen Yao, Anant Subramanian, Hao Xiang, Weijing Shi, Yuliang Zou, Tom Hoddes, Zhaoqi Leng, Govind Thattai, Dragomir Anguelov, Mingxing Tan

发表机构 * Waymo ； UCSD（加州大学圣地亚哥分校）

AI总结本文研究了大规模训练在自动驾驶感知系统中的应用，通过扩展输入模态并训练大规模模型，实现了在Waymo数据集上的新状态-of-the-art性能。

详情

AI中文摘要

模型扩展通过在多样化数据集上进行大规模训练已显示出显著的成功。然而，尚不清楚相同的范式是否适用于自动驾驶感知系统，因为存在独特的挑战，如融合异构传感器数据和需要复杂的3D空间理解。为弥合这一差距，我们进行了系统分析，研究了规模对这些系统的影响。我们基于稀疏窗口变换器开发了STELLAR模型，扩展了输入模态，包括LiDAR、雷达、相机和地图先验。我们在一个包含5000万驾驶示例的大规模数据集上训练该模型，参数数量高达5亿。我们的大规模实验揭示了模型性能与模型大小、数据和计算之间的经验扩展趋势。所得到的模型在Waymo Open Dataset挑战中建立了新的状态-of-the-art，大幅超越了先前的成果。我们的工作表明，大规模训练是提升自动驾驶感知模型能力极具前景的路径。

英文摘要

Model scaling has demonstrated remarkable success through large-scale training on diverse datasets. It remains an open question whether the same paradigm would apply to autonomous driving perception systems due to unique challenges, such as fusing heterogeneous sensor data and the need for sophisticated 3D spatial understanding. To bridge this gap, we present a comprehensive study on systematically analyzing the impact of scale on these systems. We develop our STELLAR model based on Sparse Window Transformer, by extending the input modalities to include LiDAR, radar, camera, and map prior. We train the model on a large-scale dataset of 50 million driving examples with up to 500 million parameters. Our large-scale experiments reveal empirical scaling trends that connect model performance to model size, data, and compute. The resulting model establishes a new state-of-the-art on the Waymo Open Dataset challenge, outperforming prior arts by a large margin. Our work demonstrates that large-scale training is a highly promising path for advancing the capabilities of perception models for autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.20388 2026-05-21 cs.CV 版本更新

How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction

如何移动决定了将来的行动：轨迹条件的自身视角预测

Sejoon Jun, Hai Nguyen-Truong, Luigi Seminara, Lorenzo Torresani

发表机构 * Khoury College of Computer Sciences, Northeastern University, Boston（北德文斯克学院，东北大学，波士顿）； Korea Advanced Institute of Science and Technology, Daejeon（韩国科学技术院，大田）； Department of Mathematics and Computer Science, University of Catania, Italy（卡塔尼亚大学数学与计算机科学系，意大利）

AI总结该研究通过轨迹条件自身视角预测，发现轨迹能更精确地表达意图，从而在任务规划中优于语言条件，且在预测时无需观察轨迹即可获得显著优势。

Comments Project page: https://farsightlab.github.io/TrajPilot

详情

AI中文摘要

预测一个人的第一人称视角如何演变（接下来会采取什么行动，什么计划完成任务，正在进行的投篮是否会得分）从根本上是不充分的：相同的情境允许许多可能的未来，而一个训练以最小化预测误差的模型被迫在这些未来之间做妥协或平均，无论哪种方式都可能出错。我们的方法基于两个发现。首先，未来的相机轨迹，即头部在空间中划出的路径，让模型承诺其中一个未来：它以足够精细的形式承载操作者的意图，从而决定行动如何展开，显著优于语言作为条件信号。其次，这种意图使轨迹本身可以从当前情境中部分预测出来，足以在测试时无需观察轨迹即可恢复大部分收益。我们将其实例化为TrajPilot，一个模型从自身视角上下文预测候选未来轨迹，并利用这些轨迹在与行动对齐的嵌入空间中引导动作预测，其中语言塑造了结构但从未用作条件输入。TrajPilot在Ego-Exo4D原子、Ego-Exo4D Keystep、Ego-Exo4D GoalStep和EgoPER的程序规划任务中优于VLM和结构化规划基线，随着预测范围的扩大（正是先前规划器崩溃的地方），并且在仅使用RGB的相机姿态估计下保持稳定。在推理时目标被遮蔽，同一模型能够进行无目标的预测，其在Ego-Exo4D原子任务上击败VLM基线，并扩展到EPIC-Kitchens-100和篮球投篮结果预测。

英文摘要

Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way. Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain. We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.20385 2026-05-21 cs.CV cs.AI 版本更新

能力 ≠ 可解释性：视觉基础模型的人类可解释性

Julien Colin, Lore Goetschalckx, Nuria Oliver, Thomas Serre

发表机构 * ELLIS Alicante, Spain（阿利坎特ELLIS研究所，西班牙）； Brown University, USA（布朗大学，美国）； imec, Leuven, Belgium（imec，比利时卢旺达）

AI总结本文研究了视觉基础模型的人类可解释性，提出了一种评估框架，发现基础模型比监督模型更难解释，且可解释性与下游任务性能无关，而是与特征的局部性和粗粒度语义对齐有关。

详情

AI中文摘要

领先的视觉模型的可解释性如何？随着这些模型从研究基准转向高风险部署，这个问题变得日益紧迫，但现有方法无法可靠地回答这个问题。我们通过两种互补的心理物理学协议构建了一个框架来衡量和比较视觉模型的人类可解释性：（1）局部化性——观察者能否预测特征在新图像上的位置？（2）可命名性——观察者能否准确描述特征所代表的内容？特征通过稀疏自编码器恢复，一个基于偶然锚定的评分函数将每个模型置于同一尺度上。将该框架应用于六个视觉Transformer——两个监督ViT和四个基础模型（DINOv2、DINOv3、CLIP、SigLIP）——我们收集了超过15,000个行为响应，分析了377名通过我们预设质量检查的参与者中的13,400个响应。基础模型比其监督模型更难解释，且差距不是能力取舍：可解释性不与我们在任何基准上的下游任务性能相关。相关的是特征激活的局部性和与人类的粗粒度语义对齐——具有聚焦激活和反映世界广泛类别结构的表示模型产生更可解释的特征，而细粒度感知对齐则不。两种协议产生高度相关的排名并共享相同的预测因素，确立可解释性作为表示质量的独立、可测量的维度——令人惊讶的是，我们测试的每个基础模型都低于先前的监督基线。仅靠能力无法弥合这一差距；局部性和粗粒度对齐可以。

英文摘要

How interpretable are the features of leading vision models? The question is increasingly pressing as these models move from research benchmarks into high-stakes deployments, yet existing methods cannot answer it reliably. We close this gap with a framework for measuring and comparing the human interpretability of vision models, built around two complementary psychophysics protocols: (1) localizability -- can an observer predict where a feature fires on a novel image? -- and (2) nameability -- can an observer accurately describe what the feature represents? Features are recovered via sparse autoencoders, and a chance-anchored scoring function places every model on a common scale. Applying the framework to six vision transformers -- two supervised ViTs and four foundation models (DINOv2, DINOv3, CLIP, SigLIP) -- we collected more than $15{,}000$ behavioral responses, analyzing the $13{,}400$ responses from the $377$ participants who passed our pre-specified quality checks. Foundation models are consistently *less* interpretable than their supervised counterparts, and the gap is not a capability tradeoff: interpretability does not correlate with downstream task performance on any benchmark we examine. What does correlate is the locality of a feature's activations and coarse-grained semantic alignment with humans -- models with focal activations and representations that reflect the world's broad categorical structure produce more interpretable features, whereas fine-grained perceptual alignment does not. The two protocols yield strongly correlated rankings and share the same predictors, establishing interpretability as an independent, measurable dimension of representation quality -- and, surprisingly, one on which every foundation model we tested falls below the supervised baselines that came before. Capability alone cannot close that gap; locality and coarse-grained alignment can.

URL PDF HTML ☆

赞 0 踩 0

2605.20316 2026-05-21 cs.CV cs.AI 版本更新

FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation

FullFlow: 通过双向视觉-语言生成升级文本到图像流匹配模型

Eric Tillmann Bill, Enis Simsar, Alessio Tonioni, Thomas Hofmann

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Google（谷歌）

AI总结本文提出FullFlow方法，通过仅训练LoRA适配器和轻量级文本头部，将预训练的rectified-flow文本到图像模型升级为双向视觉-语言生成器，从而在保持图像连续流的同时添加文本离散插入过程，提升文本到图像和图像到文本的生成质量。

Comments project page: https://ericbill21.github.io/fullflow/

详情

AI中文摘要

现代文本到图像扩散模型编码了丰富的视觉先验，但只能通过单向文本条件生成暴露。现有统一的视觉-语言模型通过大规模联合预训练或对文本路径进行大量重训练来恢复双向能力，但丢弃了文本到图像模型本身已编码的强图像先验。我们介绍了FullFlow，一种参数高效的配方，通过仅训练LoRA适配器和轻量级文本头部，将预训练的rectified-flow文本到图像模型升级为双向视觉-语言生成器。FullFlow保持图像在原生连续流中，并添加文本的离散插入过程。分离的图像和文本时间步将推断转化为二维生成空间中的轨迹选择，使文本→图像、图像→文本、联合采样和部分文本预测能够通过单一主干模型完成。在Stable Diffusion 3 (SD3)上，FullFlow在相同可训练参数数量和匹配LoRA秩的情况下，将文本→图像的FID从62.7提升到31.6，将图像→文本的CIDEr从2.0提升到99.4，同时在两个RTX A5000 GPU上训练时间不超过24小时的情况下，将峰值VRAM从约84GB降低到约38GB，并将吞吐量提高约8倍，仅训练主干参数的约5%。同样的配方适用于FLUX.1-dev，并通过部分文本生成支持下游VQA。这些结果表明，强大的双向视觉-语言能力可以从预训练的文本到图像流模型中解锁，而无需完整的多模态预训练。

英文摘要

Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision--language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision--language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling text$\rightarrow$image, image$\rightarrow$text, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves text$\rightarrow$image FID from $62.7$ to $31.6$ and image$\rightarrow$text CIDEr from $2.0$ to $99.4$ over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from ${\sim}84$\,GB to ${\sim}38$\,GB and raising throughput by ${\sim}8\times$ on two RTX A5000 GPUs in under 24 hours, training only ${\sim}5\%$ of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision--language capability can be unlocked from pretrained text-to-image flow models without full multimodal pretraining.

URL PDF HTML ☆

赞 0 踩 0

2605.20309 2026-05-21 cs.CV cs.AI 版本更新

Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision

Tiny-Engram: 生成视觉中的触发索引概念表

Runyuan Cai, Yiming Wang, Yu Lin, Xiaodong Zeng

发表机构 * AutoArk-AI

AI总结本文提出Tiny-Engram，一种紧凑的触发索引概念表，通过显式地为视觉记忆分配词汇地址和激活边界，实现对冻结图像和视频生成器中的概念的控制。该方法通过注册的n-gram匹配索引参数化每个概念，仅在匹配触发区域调节文本编码器的隐藏状态，从而在保持周围提示的组合控制的同时，将罕见触发短语绑定到目标身份。

详情

AI中文摘要

当前生成视觉模型的个性化方法通常通过连续适配器或权重更新来编码新概念，但对是否以及何时检索概念的控制有限。在本工作中，我们引入Tiny-Engram，一种紧凑的触发索引概念表，为冻结的图像和视频生成器中的视觉记忆提供显式的词汇地址和激活边界。Tiny-Engram将每个概念参数化为一组小的记忆条目，这些条目通过注册的n-gram匹配进行索引，仅在匹配的触发区域调节文本编码器的隐藏状态。在该词汇支持之外，条件路径与冻结的基础模型相同。在单编码器潜在扩散和多编码器扩散-变压器骨干结构上，这种公式将罕见触发短语绑定到目标身份，同时保持周围提示的组合控制。我们进一步在文本条件的视频生成设置中评估相同的表式记忆，其中触发路径可靠地改变生成的主题，但保持在排除的视频提示中精细的身份持续性仍然有限。综合来看，这些结果表明，小型、显式地址的概念表是实现模块化视觉个性化的一种实用途径，尤其在图像生成中证据最强。对于视频扩散，剩余的差距指向更广泛的需求：时间稳定的身份可能依赖于文本侧记忆与不断演变的视觉状态之间的更紧密耦合，这促使未来在记忆注入方面的工作超越文本条件接口。

英文摘要

Current personalization methods for generative vision models typically encode new concepts through continuous adapters or weight updates, yet provide limited control over whether and when a concept should be retrieved. In this work, we introduce Tiny-Engram, a compact trigger-indexed concept table that gives visual memories an explicit lexical address and activation boundary inside frozen image and video generators. Tiny-Engram parameterizes each concept as a small set of memory entries indexed by registered n-gram matches, which modulate text-encoder hidden states only within the matched trigger region. Outside this lexical support, the conditioning pathway is identical to that of the frozen base model. Across both single-encoder latent diffusion and multi-encoder diffusion-transformer backbones, this formulation binds a rare trigger phrase to a target identity while preserving compositional control from the surrounding prompt. We further evaluate the same table-based memory in a text-conditioned video generation setting, where the trigger path reliably alters the generated subject but fine-grained identity persistence across held-out video prompts remains limited. Taken together, these results suggest that small, explicitly addressed concept tables are a practical route to modular visual personalization, with strongest evidence in image generation. For video diffusion, the remaining gap points to a broader requirement: temporally stable identity likely depends on tighter coupling between text-side memory and the evolving visual state, motivating future work on memory injection beyond the text-conditioning interface.

URL PDF HTML ☆

赞 0 踩 0

2605.20308 2026-05-21 cs.CV cs.AI cs.LG 版本更新

SDM: A Powerful Tool for Evaluating Model Robustness

SDM：评估模型鲁棒性的强大工具

Xinlei Liu, Tao Hu, Jichao Xie, Peng Yi, Hailong Ma, Baolin Li

发表机构 * Information Engineering University, Zhengzhou, China ； Key Laboratory of Cyberspace Endogenous Safety \& Security of Henan Province, Zhengzhou, China ； Key Laboratory of Cyberspace Security Ministry of Education of China, Zhengzhou, China ； Songshan Laboratory, Zhengzhou, China

AI总结本文提出了一种名为SDM的新型梯度攻击方法，通过重新定义对抗样本生成的目标，解决了传统方法中'高损失非对抗样本'导致的性能下降问题，并在实验中证明了其在攻击性能和成本效率上的优势。

Comments 16 pages

详情

Journal ref: Forty-third International Conference on Machine Learning (ICML 2026)

AI中文摘要

基于梯度的攻击方法是评估模型鲁棒性的重要方法。然而，自从提出APGD以来，此类方法难以取得显著突破。为了实现这一效果，我们首先分析了先前方法中导致攻击性能下降的'高损失非对抗样本'问题，并证明该问题源于对抗样本生成目标的不恰当。随后，我们将目标重新定义为

英文摘要

Gradient-based attacks are important methods for evaluating model robustness. However, since the proposal of APGD, it has been difficult for such methods to achieve significant breakthroughs. To achieve such an effect, we first analyze the issue of "high-loss non-adversarial examples" that degrades attack performance in previous methods, and prove that this issue arises from inappropriate objectives for adversarial example generation. Subsequently, we reconstruct the objective as "maximizing the difference between the non-ground-truth label probability upper bound and the ground-truth label probability", and proposes a novel and powerful gradient-based attack method named Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step". It adopts the negative probability loss function and the Directional Probability Difference Ratio (DPDR) loss function in the initial and subsequent optimization stages, respectively, and approaches the ideal objective of adversarial example generation via stage-wise sequential optimization. Experiments demonstrate that compared with previous state-of-the-art methods, SDM not only achieves stronger attack performance but also exhibits superior cost-effectiveness. The code is available at https://github.com/X-L-Liu/ICML-SDM.

URL PDF HTML ☆

赞 0 踩 0

2605.20297 2026-05-21 cs.CV cs.LG 版本更新

MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery

MedCRP-CL: 通过贝叶斯非参数语义模态发现实现连续医学图像分割

Ziyuan Gao

发表机构 * University College London, London, United Kingdom（伦敦大学学院）

AI总结该研究提出MedCRP-CL框架，通过在线任务结构发现和结构感知的连续学习方法，解决医学图像分割在持续学习中的挑战，实现了73.3%的Dice得分和仅4.1%的遗忘率。

Comments Accepted by ICML 2026

详情

AI中文摘要

医学图像分割在持续学习中面临根本性挑战：数据按顺序从异质源到来，但有效的持续学习需要发现哪些任务共享足够的结构以受益于联合学习。现有方法要么在所有任务上应用统一约束，导致任务冲突时发生灾难性遗忘，要么需要预定义的任务分组，无法预测未来任务多样性。我们引入MedCRP-CL框架，实现在线任务结构发现和结构感知的持续学习。利用中文餐厅过程（CRP），我们的方法从临床文本提示中动态推断任务分组，无需预定义聚类数量或访问未来任务。我们将发现的分组称为语义模态，因为它们通过整合解剖区域和病理背景捕捉更细粒度的结构。在发现的结构指导下，我们维护语义模态特定的LoRA适配器，通过内模态EWC正则化，确保在不同任务组之间参数隔离，同时促进相似组的知识转移。该框架也是无回放的，仅存储聚合统计信息而非原始患者数据。在16个医学分割任务和四种成像模态上的实验表明，MedCRP-CL实现了73.3%的Dice得分，仅4.1%的遗忘率，优于最佳基线8.0%，同时仅需6倍更少的参数。代码可在https://github.com/zygao930/MedCRP-CL获取。

英文摘要

Medical image segmentation faces a fundamental challenge in continual learning: data arrives sequentially from heterogeneous sources, yet effective continual learning requires discovering which tasks share sufficient structure to benefit from joint learning. Existing methods either apply uniform constraints across all tasks, causing catastrophic forgetting when tasks conflict, or require predefined task groupings that cannot anticipate future task diversity. We introduce MedCRP-CL, a framework that performs online task structure discovery and structure-aware continual learning. Leveraging the Chinese Restaurant Process (CRP), our method dynamically infers task groupings from clinical text prompts as tasks arrive, without requiring predefined cluster counts or access to future tasks. We term these discovered groupings semantic modalities, as they capture finer-grained structure than physical imaging modalities by integrating anatomical region and pathological context. Guided by this discovered structure, we maintain semantic modality-specific LoRA adapters regularized by intra-modality EWC, ensuring parameter isolation across dissimilar task groups while facilitating knowledge transfer within similar ones. The framework is also replay-free, storing only aggregate statistics rather than raw patient data. Experiments on 16 medical segmentation tasks across four imaging modalities demonstrate that MedCRP-CL achieves 73.3% Dice score with only 4.1% forgetting, outperforming the best baseline by 8.0% while requiring 6$\times$ fewer parameters. Code is available at https://github.com/zygao930/MedCRP-CL.

URL PDF HTML ☆

赞 0 踩 0

2605.20290 2026-05-21 cs.GR cs.CV 版本更新

TelePhysics: Physics-Grounded Multi-Object Scene Generation from a Single Image with Real-Time Interaction

TelePhysics: 从单张图像生成物理一致的多物体场景 with 实时交互

Xin Zhang, Yabo Chen, Yijie Fang, Wanying Qu, Haibin Huang, Chi Zhang, Feng Xu, Xuelong Li

发表机构 * Fudan University（复旦大学）； Institute of Artificial Intelligence, China Telecom (TeleAI)（中国电信人工智能研究院（TeleAI））

AI总结本文提出TelePhysics，一种无需训练的框架，通过整体场景级3D重建将单张图像转换为物理一致且可控的视频。该方法通过统一的空间坐标系统表示完整场景几何，解决物体穿透和对齐模糊问题，实现准确的多物体交互和更丰富的复杂控制类型，从而在保持逼真视觉保真度的同时实现实时物理交互预览。

详情

AI中文摘要

Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a 免训练 framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. Unlike prior methods, this formulation enables accurate scenelevel multi-object interactions and introduces richer, complex control types for advanced mechanicsbased manipulation. By decoupling simulation from rendering, TelePhysics bypasses latency-heavy priors, achieving real-time physical interaction previews paired while preserving photorealistic visual fidelity. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability. The open-source code is available at https://github.com/xinzhang007/TelePhysics.

英文摘要

Recent generative video models achieve impressive visual quality but remain constrained by limited physical consistency and controllability. Existing video generation methods provide minimal physical control, and single-image-to-3D conversion approaches often suffer from object interpenetration. Furthermore, physics-based scene-level 3D generation methods exhibit spatial misalignment, stylized artifacts, and inconsistencies with the input data, restricting their use in realistic interactive video synthesis. We propose TelePhysics, a training-free framework that converts a single image into a physically consistent and controllable video through holistic scene-level 3D reconstruction. By representing the full scene geometry in a unified spatial coordinate system, TelePhysics resolves object penetration and alignment ambiguity. Unlike prior methods, this formulation enables accurate scenelevel multi-object interactions and introduces richer, complex control types for advanced mechanicsbased manipulation. By decoupling simulation from rendering, TelePhysics bypasses latency-heavy priors, achieving real-time physical interaction previews paired while preserving photorealistic visual fidelity. Experimental results demonstrate that TelePhysics substantially outperforms prior methods in physical fidelity, spatial coherence, and controllability. The open-source code is available at https://github.com/xinzhang007/TelePhysics.

URL PDF HTML ☆

赞 0 踩 0

2605.20287 2026-05-21 cs.LG cs.AI cs.CV 版本更新

FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction

FusionCell: 跨注意力融合布局几何与网络列表拓扑以实现标准单元性能预测

Haoyi Zhang, Kairong Guo, Bojie Zhang, Yibo Lin, Runsheng Wang

发表机构 * School of Integrated Circuits, Peking University, Beijing, China（集成电路学院，北京大学，北京，中国）

AI总结本文提出FusionCell，通过跨注意力机制融合布局几何和网络列表拓扑，以提高标准单元性能预测的准确性，解决了传统方法忽略布局几何导致的耦合和布局依赖效应的问题。

详情

AI中文摘要

标准单元是数字电路的基本构建块，其延迟和功率对芯片级性能有关键影响；然而，其表征仍依赖于缓慢的仿真扫描，许多快速预测器忽略了布局几何，未能捕捉到耦合和布局依赖效应。挑战在于如何联合表示布局几何和网络列表拓扑，使模型能够同时捕捉细粒度的空间细节和结构连接，以实现准确的性能预测。我们引入FusionCell，一种双模态预测器，将路由布局几何和网络列表拓扑作为输入，并在统一模型中显式融合它们。一个DeiT编码器处理三层路由布局，而图Transformer模型异构设备/网络图。模态通过拓扑引导机制集成，其中网络列表作为结构“地图”主动查询布局中的相关物理区域，以实现联合几何和拓扑推理。我们构建了一个基于ASAP7 PDK的7nm数据集，使用自动工具生成超过19500个单元，涵盖149种类型，针对六个指标：信号上升/下降延迟、过渡和功率。实验结果表明，FusionCell减少了回归误差，平均MAPE为0.92个百分点，并在基线模型上提高了Spearman/Kendall排名，同时将表征过程的速度提高了数十倍，相比电路仿真。

英文摘要

Standard cells form the building blocks of digital circuits, so their delay and power critically influence chip-level performance; yet characterization still relies on slow simulation sweeps, and many fast predictors ignore layout geometry, missing coupling and layout-dependent effects. The challenge is to jointly represent layout geometry and netlist topology so models capture fine-grained spatial details together with structural connectivity for accurate performance prediction. We introduce FusionCell, a dual-modality predictor that treats routed layout geometry and netlist topology as inputs and fuses them explicitly in a unified model. A DeiT encoder processes three-layer routed layouts, while a graph transformer models heterogeneous device/net graphs. The modalities are integrated through a topology-guided mechanism, where the netlist acts as a structural "map" to actively query relevant physical regions in the layout for joint geometric and topological reasoning. We build a 7nm dataset based on the ASAP7 PDK with over 19.5k cells spanning 149 types using automatic tools, targeting six metrics: signal rise/fall delay, transition, and power. Experimental results demonstrate that FusionCell reduces regression error, with an average MAPE of 0.92 percent, and improves Spearman/Kendall ranking over baselines, while accelerating the characterization process by orders of magnitude compared to circuit simulation.

URL PDF HTML ☆

赞 0 踩 0

2605.20284 2026-05-21 cs.CV cs.AI cs.LG 版本更新

JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA

JUDO: 一种面向工业异常问答的多模态推理框架

Hyunju Kang, Woohyun Lee, Jaewon Kim, Hogun Park

发表机构 * Sungkyunkwan University（成均馆大学）； Seoul National University（首尔国立大学）

AI总结本文提出JUDO框架，通过结合领域知识和上下文提升多模态推理能力，以解决工业异常检测中模型缺乏领域知识的问题，实验表明其在MMAD基准上优于Qwen2.5-VL-7B和GPT-4o。

Comments Published at ICLR 2026

详情

AI中文摘要

工业异常检测已显著受益于大多模态模型（LMMs），使检测能力超越了单纯的检测，尤其通过视觉引导推理提升图像理解能力。然而，LMMs缺乏领域特定知识，限制了其在复杂工业场景中生成准确响应的能力。在本工作中，我们提出了JUDO，即Juxtaposed Domain-Oriented Multimodal Reasoner，一种能够高效整合领域知识和上下文的视觉和文本推理框架。通过视觉推理，我们的模型通过将查询图像与正常图像进行对比，分割缺陷区域，实现细粒度的视觉比较检查。此外，我们通过监督微调（SFT）注入领域知识，以增强上下文理解，并通过强化学习（GRPO）引导领域推理，采用领域导向的推理过程。实验结果表明，JUDO在MMAD基准上表现优异，超越了Qwen2.5-VL-7B和GPT-4o等模型。这些结果突显了增强领域知识和上下文对有效推理在异常理解中的重要性。

英文摘要

Industrial anomaly detection has been significantly advanced by Large Multimodal Models (LMMs), enabling diverse human instructions beyond detection, particularly through visually grounded reasoning for better image understanding. However, LMMs lack domain-specific knowledge, which limits their ability to generate accurate responses in complex industrial scenarios. In this work, we present JUDO, Juxtaposed Domain-Oriented Multimodal Reasoner, a framework that efficiently incorporates domain knowledge and context in visual and textual reasoning. Through visual reasoning, our model segments the defect region by juxtaposing query images with normal images as visual domain context, enabling a fine-grained visual comparative inspection. Furthermore, we inject domain knowledge through supervised fine-tuning (SFT) to enhance context understanding and subsequently guide domain reasoning through reinforcement learning (GRPO) with tailored rewards, opting for a domain-oriented reasoning process. Experimental results demonstrate that JUDO achieves superior performance on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. These results highlight the importance of enhancing domain knowledge and context for effective reasoning in anomaly understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.20277 2026-05-21 cs.CV cs.AI 版本更新

Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis

通过轨迹积分反馈调节解剖感知奖励用于体积计算断层扫描分析

Tianwei Lin, Zhongwei Qiu, Jie Cao, Jiang Liu, Wenjie Yan, Bo Zhang, Yu Zhong, Wenqiao Zhang, Yingda Xia, Ling Zhang

发表机构 * Zhejiang University（浙江大学）； DAMO Academy, Alibaba Group（阿里集团达摩院）； Hupan Lab（虎扑实验室）； University of Electronic Science and Technology of China（电子科技大学）

AI总结本文提出了一种新的框架，通过轨迹积分反馈GRPO（TIF-GRPO）来改进医疗视觉语言模型在三维CT分析中的性能，通过引入临床异常基准评估子系统（CABS）来解决优化目标与临床严谨性之间的不匹配问题，提升异常检测和临床准确性。

详情

AI中文摘要

医学视觉-语言模型（VLMs）已迅速发展为通用多模态助手，但其在三维计算机断层扫描（CT）分析中的应用仍受到优化目标与临床严谨性之间持续不匹配的限制。当前的强化学习（RL）范式仍然依赖于词汇代理信号，导致``评估幻觉''，即模型优化语言流畅性而非事实性临床正确性，从而导致诊断性关键错误。为弥合这一差距，我们引入了临床异常基准评估子系统（CABS），一个将放射学报告分解为可验证的临床语义单元的结构化系统。利用CABS，我们识别出标准RL中的``机理分歧''，即表面相似性奖励驱动策略梯度绕过医学事实。因此，我们提出了轨迹积分反馈GRPO（TIF-GRPO），一种将控制理论原理整合到策略优化中的新框架。通过将临床推理建模为伪时间轨迹以发现异常，TIF-GRPO通过积分反馈回路调节解剖感知奖励，该回路将持续遗漏视为累积状态误差，并将幻觉视为过度的控制努力。在3D CT基准测试中，我们的方法显著提高了异常检测和临床忠实度，建立了医疗VLMs中细粒度调节的新范式。我们的项目可在GitHub上获取。

英文摘要

Medical vision-language models (VLMs) have rapidly advanced as general-purpose multimodal assistants, yet their deployment in 3D Computed Tomography (CT) analysis remains constrained by a persistent mismatch between optimization objectives and clinical rigor. Current Reinforcement Learning (RL) paradigms still rely on lexical proxy signals that induce ``\textit{Evaluation Hallucinations}'', where models optimize linguistic fluency rather than factual clinical correctness, leading to diagnostically critical errors. To bridge this gap, we introduce the \textbf{Clinical Abnormality Benchmarking Substrate (CABS)}, a structured system that decomposes radiology reports into verifiable clinical semantic units. Using CABS, we identify a ``\textit{Mechanistic Divergence}'' in standard RL, where surface-similarity rewards drive policy gradients to bypass medical facts. We therefore propose \textbf{Trajectory-Integral Feedback GRPO (TIF-GRPO)}, a novel framework integrating control-theoretic principles into policy optimization. By formulating clinical reasoning as a pseudo-temporal trajectory for anomaly discovery, TIF-GRPO regulates anatomy-aware rewards via an integral feedback loop that penalizes persistent omissions as cumulative state errors and suppresses hallucinations as excessive control effort. Experiments on 3D CT benchmarks demonstrate that our approach significantly enhances abnormality detection and clinical faithfulness, establishing a new paradigm for fine-grained regulation in medical VLMs. Our project is available at \href{https://github.com/ZJU4HealthCare/TIF-GRPO}{GitHub}.

URL PDF HTML ☆

赞 0 踩 0

2605.20275 2026-05-21 cs.CV cs.AI 版本更新

CP-MoE：一致性保留的混合专家用于持续学习

Yang Liu, Toan Nguyen, Flora D. Salim

发表机构 * School of Computer Science and Engineering University of New South Wales（计算机科学与工程学院新南威尔士大学）

AI总结本文提出CP-MoE，一种基于瞬时专家的持续学习框架，通过一致性保留的路由偏置和瞬时专家引导的正则化机制，减少参数干扰和遗忘，同时保留跨任务知识转移。

详情

AI中文摘要

持续学习在大语言模型（LLMs）和视觉-语言模型（VLMs）中仍面临灾难性遗忘的严重障碍。尽管混合专家（MoE）架构提供了扩展的有效途径，但现有的基于LoRA的MoE持续学习方法仍面临根本性的权衡：要么过于激进地隔离专家，限制任务间的知识转移，要么允许任务特定的更新覆盖重要的现有参数，导致严重的遗忘。为此，我们提出了CP-MoE，一种持续学习框架，围绕瞬时专家构建，该专家捕捉早期任务特定的更新并引导其整合到稳定的专家中。CP-MoE引入了一种一致性保留的路由偏置，利用瞬时专家估计与稳定专家的表示相似性，并引导路由向更兼容的专家选择方向；还引入了一种瞬时专家引导的正则化机制，该机制在合并过程中选择性地保护重要历史参数。这些组件共同减少了参数干扰和遗忘，同时保留了跨任务的知识转移。我们在基于LLM和VLM的MoE模型上验证了CP-MoE，既在单模态又在多模态持续学习基准上进行了测试。在SuperNI基准上，涵盖多样化的序列语言任务，CP-MoE实现了最先进的性能，并在未见任务上表现出更强的零样本迁移能力。在VQA v2数据集上，它能有效扩展到多模态视觉推理，一致地减少遗忘，并优于强大的MoE基线。

英文摘要

Catastrophic forgetting remains a major obstacle to continual learning in large language models (LLMs) and vision--language models (VLMs). Although Mixture-of-Experts (MoE) architectures offer an efficient path to scaling, existing LoRA-based MoE continual learning methods still face a fundamental trade-off: they either isolate experts too aggressively, limiting knowledge transfer across tasks, or allow task-specific updates to overwrite important existing parameters, leading to severe forgetting. To address this, we propose CP-MoE, a continual learning framework built around a transient expert that captures early task-specific updates and guides their integration into stable experts. CP-MoE introduces a consistency-preserving routing bias, which uses the transient expert to estimate representation similarity with stable experts and steer routing towards more compatible expert selection, and a transient expert-guided regularisation mechanism, which selectively protects important historical parameters during merging. Together, these components reduce parameter interference and forgetting while preserving cross-task knowledge transfer. We validate CP-MoE on both unimodal and multimodal continual learning benchmarks with LLM-based and VLM-based MoE models. On SuperNI benchmark, spanning diverse sequential language tasks, CP-MoE achieves state-of-the-art performance and stronger zero-shot transfer to unseen tasks. On VQA v2 dataset, it scales effectively to multimodal visual reasoning, consistently reduces forgetting, and outperforms strong MoE baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.20237 2026-05-21 cs.CV 版本更新

AnimeAdapter: Fine-grained and Consistent Zero-shot Anime Character Generation

AnimeAdapter: 细粒度且一致的零样本动漫人物生成

Yixuan Han

发表机构 * National University of Singapore, Singapore（新加坡国立大学）

AI总结本文提出了一种轻量级的外观适配器，用于在多样编辑条件下实现可控且一致的动漫人物生成，通过注入单张参考图像的细粒度视觉特征到扩散过程中，并结合CLIP的局部空间化特性，开发出语义选择性局部注意力机制，进一步解耦人物外观与空间布局，从而实现高效的动漫人物生成。

详情

AI中文摘要

我们提出了一种轻量级的外观适配器，用于在多样编辑条件下实现可控且一致的动漫人物生成。与依赖大规模视觉-语言模型或针对特定主体的微调不同，我们的方法将单张参考图像的细粒度视觉特征注入扩散过程。基于CLIP的局部空间化特性，我们开发了语义选择性局部注意力机制。为了进一步解耦人物外观与空间布局，我们在适配器训练过程中引入姿态感知的条件。所得到的预训练适配器保持紧凑、模块化，并且完全兼容Stable Diffusion社区工作流程，同时在部署时不需要额外的微调。此外，我们还提出了一个基于精选和重构的Danbooru提示的高质量动漫人物数据集，并在多个实际的人物编辑场景中评估了我们的方法。我们的代码、模型权重和数据集将在接受后公开发布。

英文摘要

We present a lightweight appearance adapter for Stable Diffusion that enables controllable and consistent anime character generation under diverse editing conditions. Instead of relying on large-scale vision-language models or per-subject fine-tuning, our method injects fine-grained visual features from a single reference image into the diffusion process. Based on CLIP emergent local spatialization, we develop semantic-selective local attention. To further disentangle character appearance from spatial layout, we incorporate pose-aware conditioning during adapter training. The resulting pretrained adapter remains compact, modular, and fully compatible with Stable Diffusion community workflows, while requiring no additional fine-tuning at deployment time. Furthermore, we present a high-quality anime character dataset based on curated and restructured Danbooru prompts, and evaluate our method across several practical character editing scenarios. Our code, model weights, and dataset will be publicly released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.20233 2026-05-21 cs.CV cs.AI 版本更新

AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education

基于仿真护理教育的自主学习能力评估：通过第一人称视频进行AI辅助评估

Hanchen David Wang, Yilin Liu, Madison J. Lee, Surya Chand Rayala, Gautam Biswas, Daniel T. Levin, Meiyi Ma

发表机构 * Vanderbilt University（范德比大学）

AI总结本文提出了一种基于第一人称视频的AI辅助评估框架，通过提取动作时间线、序列特征和识别指标，发现识别准确率与能力之间存在负相关关系，表明识别准确率可以作为自动化评估中的教学信息信号。

Comments Accepted at CVPR Workshop

详情

AI中文摘要

在临床仿真中评估学习者的能力需要专家观察，这种观察过程耗时、难以扩展且受评分者变异影响。视觉-语言模型已成为理解复杂视觉行为的有希望的工具。在本工作中，我们探讨了视觉观察是否能通过一个三阶段框架提供教育意义的信号，该框架（1）使用冻结的视觉编码器和少样本学习从第一人称护理仿真视频中提取动作时间线，（2）推导序列级特征和每会话识别指标，（3）将这些与指导教师评分的能力相关联。在22个密集标注的会话（3.8小时，493个动作）中，使用冻结的DINOv2主干和HMM Viterbi解码器，在留一法1次样本识别中实现了57.4%的MOF。令人惊讶的是，我们观察到识别准确率与能力之间存在负相关关系（rho = -0.524，p = 0.012 for mIoU），这种关系在六种混杂控制下仍然稳健：更熟练的学生产生多样、更难分类的工作流程，而简单的序列特征没有这种关系。逐项分析表明，患者安全协议和团队沟通是这种模式中预期的行为，过程模型比较显示，能力更高的学生表现出更一致的协议行动转换。这些发现表明，识别准确率可能可以补充预测的动作时间线作为自动化能力评估中的教学信息信号。

英文摘要

Assessing learner competency in clinical simulation requires expert observation that is time-intensive, difficult to scale, and subject to inter-rater variability. Vision-language models have emerged as a promising tool for understanding complex visual behavior. In this work, we investigate whether visual observations can provide educationally meaningful signals for competency assessment through a three-stage framework that (1) extracts action timelines from egocentric nursing simulation video using frozen visual encoders and few-shot learning, (2) derives sequence-level features and per-session recognition metrics, and (3) relates these to instructor-rated competency. Across 22 densely annotated sessions (3.8 hours, 493 actions), a frozen DINOv2 backbone with HMM Viterbi decoding achieves 57.4% MOF in leave-one-out 1-shot recognition. Surprisingly, we observe a negative trend between recognition accuracy and competency (rho = -0.524, p = 0.012 for mIoU), robust to six confound controls: more competent students produce diverse, harder-to-classify workflows, while simple sequence features show no such relationship. Per-item analysis identifies patient safety protocols and team communication as the expected behaviors most reflected in this pattern, and process model comparisons reveal that higher-competency students exhibit more protocol-consistent action transitions. These findings suggest that recognition accuracy may complement predicted action timelines as a pedagogically informative signal in automated competency assessment.

URL PDF HTML ☆

赞 0 踩 0

2605.20223 2026-05-21 cs.CV 版本更新

Why Latent Actions Fail, and How to Prevent It

为何潜在动作失效，以及如何防止它

Jung Min Lee, Taehyun Cho, Li Zhao, Jungwoo Lee

发表机构 * Seoul National University（首尔国立大学）； Microsoft Research（微软研究院）

AI总结本文研究了潜在动作模型中外部状态对动作学习的干扰问题，并提出通过聚焦内生成分来缓解噪声干扰的方法。

详情

AI中文摘要

潜在动作模型（LAMs）旨在通过压缩帧间变化来从未标记视频中学习动作样表示。然而，现实视频中的帧不仅包含主体自身状态，还包含如背景杂乱等外源状态。由于外源状态引入与动作无关的变化，这阻碍了可靠的潜在动作学习。本文通过扩展线性LAM框架，明确建模外源状态来分析这一问题。我们的分析揭示了两个见解：（1）最小化标准重建目标会产生编码未来观察中外源信息的潜在动作；（2）在专注于内生成分的表示空间中学习是缓解噪声干扰的关键。我们进一步表明，之前提出的辅助目标，如动作监督，确实促使潜在动作在不同外源状态下保持一致。这些发现通过线性和非线性LAMs的实验得到验证，提供了统一的理论分析，说明外源状态如何阻碍潜在动作学习以及为何常见的缓解方法有效。

英文摘要

Latent action models (LAMs) aim to learn action-like representations from unlabeled videos by compressing frame-to-frame changes. The frames of in-the-wild videos, however, contain not only the agent's own state but exogenous state such as background clutter. Since the exogenous state introduces changes unrelated to actions, it hinders reliable latent action learning. This paper investigates this problem analytically by extending a linear LAM framework to explicitly model exogenous state. Our analysis reveals two insights: (1) minimizing the standard reconstruction objective produces latent actions that encode exogenous information from future observation; and (2) learning in a representation space that focuses on endogenous components is a key to mitigating the interference of noise. We further show that previously proposed auxiliary objectives, such as action-supervision, provably encourage latent actions to be consistent across exogenous states. These findings are validated through experiments on both linear and nonlinear LAMs, providing a unified theoretical analysis of how exogenous state hinders latent action learning and why common remedies work.

URL PDF HTML ☆

赞 0 踩 0

2605.20211 2026-05-21 cs.CV cs.AI 版本更新

Leveraging Vision-Language Models to Detect Attention in Educational Videos

利用视觉-语言模型检测教育视频中的注意力

Gabriel Becquet, Sébastien Lallé, Vanda Luengo, Ali Abou-Hassan

发表机构 * Sorbonne University, CNRS, LIP6 & PHENIX（索邦大学、国家科学研究中心、LIP6与PHENIX）

AI总结本文研究利用视觉-语言模型直接分析教育视频内容，结合眼动数据以提高注意力检测的准确性，但发现其在实时教育诊断中的局限性。

详情

AI中文摘要

教育视频是远程和混合学习的核心组成部分。然而，学习者注意力的波动仍然是有效信息保留的重要障碍。先前的研究尝试通过在运行时检测和响应注意力丧失来缓解这一问题，使用眼动追踪数据。这些检测方法目前基于经典机器学习分类器，训练于工程化特征，如学习者注视和跳跃的汇总统计。这些方法难以捕捉学习者参与的复杂和时间特性，因此表现出中等的预测性能。在本研究中，我们旨在通过从标准工程化特征转向多模态基础模型来提高注意力检测。使用一个教育眼动追踪数据集（N = 70），我们研究了一种新的方法，利用视觉-语言模型（VLM）直接分析视频内容，结合叠加的注视数据。该方法旨在利用基础模型的语义推理能力，将学习者的注意力置于视频流中进行上下文化。我们通过几种提示策略使用Gemini 3评估了这种VLM方法的性能，但最终发现这些策略都无法超越统计基准。我们的结果为使用VLM进行实时教育诊断的局限性提供了新的见解。

英文摘要

Educational videos are a cornerstone of remote and blended learning. However, learners' fluctuating attention remains a significant barrier to effective information retention. Prior research has attempted to mitigate this by detecting and reacting to attention loss at runtime using eye tracking. Such detection has been based so far on classical machine learning classifiers trained on engineered features, such as summary statistics over learners' fixations and saccades. These methods have struggled to capture the complex, temporal nature of learner engagement, thus exhibiting moderate prediction performance. In this study, we aim to advance the detection of attention by shifting from standard engineered features to a multimodal foundation models. Using an educational eye-tracking dataset (N = 70), we investigate a novel methodology that utilizes a Vision-Language Model (VLM) to analyze video content directly with superimposed gaze data. This approach aims to leverage the semantic reasoning capabilities of foundation models to contextualize learner focus within the video stream. We evaluate the performance of this VLM-based approach using several prompting strategies with Gemini 3, but ultimately found that none of them could outperform statistical baselines. Our results provide new insights into the limitations of using VLMs for real-time educational diagnostics.

URL PDF HTML ☆

赞 0 踩 0

2605.17630 2026-05-21 cs.CV 版本更新

通过序列知识蒸馏实现紧凑压缩模型的高效训练

Caroline Mazini Rodrigues, Nicolas Keriven, Thomas Maugey

发表机构 * Univ. Rennes, Inria, CNRS, IRISA, Rennes, France（里昂大学、法国国家科学研究中心、法国国家信息与自动化研究所、IRISA、里昂，法国）

AI总结本文提出了一种通过序列知识蒸馏减少自动编码器压缩网络的方法，通过简化早期优化目标和逐步引入复杂性，提高了轻量级模型的重建质量与统计保真度，适用于资源受限环境。

详情

AI中文摘要

深度学习图像压缩模型在硬件受限的应用中常面临实际限制。尽管这些模型能够实现高质量的重建，但它们通常复杂、重量大且需要大量的训练数据和计算资源。我们提出了一种方法，通过更稳定的知识蒸馏过程显著减少基于自动编码器的压缩网络。其核心思想是高度减少的架构可以从早期训练中的简化优化目标中受益，随后逐步引入复杂性。因此，我们的方法首先通过序列编码器-解码器知识蒸馏阶段为轻量模型提供稳健的初始化，随后通过标准训练并可使用潜在蒸馏进行正则化。我们在两个不同的架构上评估了所得到的轻量级自动编码器在图像压缩任务中的表现。实验表明，与使用原始损失训练的轻量级自动编码器相比，我们的方法在早期epoch中更好地保持了重建质量和统计保真度，使其在资源受限环境中更具实用性。

英文摘要

Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to significantly reduce autoencoder-based compression networks in a more stable Knowledge Distillation process. The intuition is that highly reduced architectures benefit from simplified optimization objectives in early training, with complexity gradually introduced later. Therefore, our approach begins with a sequential encoder--decoder distillation stage that provides a robust initialization for the lightweight model. This is followed by standard training that can be regularized with latent distillation. We evaluate the resulting lightweight autoencoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity in early epochs better than training lightweight autoencoders with the original loss, making it practical for resource-limited environments.

URL PDF HTML ☆

赞 0 踩 0

2510.09060 2026-05-21 cs.AI cs.CV 版本更新

Letting Trajectories Spread: Quality-Preserving Control for Diverse Flow Matching

让轨迹扩散：用于多样化流匹配的质量保持控制

Jingxuan Wu, Zhenglin Wan, Xingrui Yu, Yuzhe Yang, Bo An, Ivor Tsang, Yang You

发表机构 * The University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Nanyang Technological University（南洋理工大学）； CFAR, Agency for Science, Technology and Research, Singapore（新加坡科技研究局CFAR）； University of California, Santa Barbara（加州大学圣巴巴拉分校）； National University of Singapore（新加坡国立大学）

AI总结本文提出了一种无需训练的推理时控制机制，使流本身具备多样性意识，通过几何上与模式质量寻求方向解耦的引导来鼓励轨迹横向扩散，同时通过时间调度的随机扰动重新引入不确定性，从而在不降低图像细节和提示忠实度的情况下提升多样性。

详情

AI中文摘要

基于流的文本到图像模型遵循确定性轨迹，这使得在有限的采样预算下探索多样模式成本较高。现有方法提高多样性通常依赖于重新训练或降低图像保真度。为了解决这一限制，我们提出了一种无需训练的推理时控制机制，使流本身具备多样性意识。我们的核心见解是通过几何上与模式质量寻求方向解耦的引导来鼓励多样性。我们的方法通过特征空间目标同时鼓励轨迹横向扩散，并通过时间调度的随机扰动重新引入不确定性。关键在于这种扰动被投影为与生成流正交，这是一个几何约束，允许其在不降低图像细节或提示保真度的情况下提升多样性。理论上，我们证明了这种设计单调地增加了一个体积代理，同时近似地保持边际分布，为生成质量的鲁棒性提供了原理性解释。经验上，在多个文本到图像设置下，固定采样预算下，我们的方法在Vendi分数和Brisque等多样性指标上一致优于强基线，同时保持图像质量和对齐。

英文摘要

Flow-based text-to-image models follow deterministic trajectories, making it costly to explore diverse modes under limited sampling budgets. Existing approaches to improving diversity often rely on retraining or degrade image fidelity. To address this limitation, we present a training-free, inference-time control mechanism that makes the flow itself diversity-aware. Our core insight is to encourage diversity through guidance that is geometrically decoupled from the mode's quality-seeking direction. Our method simultaneously encourages lateral spread among trajectories via a feature-space objective and reintroduces uncertainty through a time-scheduled stochastic perturbation. Crucially, this perturbation is projected to be orthogonal to the generation flow, a geometric constraint that allows it to boost variation without degrading image details or prompt fidelity. Theoretically, we show that this design monotonically increases a volume surrogate while approximately preserving the marginal distribution, providing a principled explanation for the robustness of generation quality. Empirically, across multiple text-to-image settings under fixed sampling budgets, our method consistently improves diversity metrics such as the Vendi Score and Brisque over strong baselines, while upholding image quality and alignment.

URL PDF HTML ☆

赞 0 踩 0

2506.16950 2026-05-21 cs.CV cs.LG 版本更新

LAION-C: An Out-of-Distribution Benchmark for Web-Scale Vision Models

LAION-C: 一个用于网络级视觉模型的分布外基准

Fanfei Li, Thomas Klein, Wieland Brendel, Robert Geirhos, Roland S. Zimmermann

发表机构 * Max Planck Institute for Intelligent Systems, Tübingen, Germany（马克斯·普朗克智能系统研究所，图宾根，德国）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； Tübingen AI Center（图宾根人工智能中心）； Google DeepMind（谷歌DeepMind）

AI总结本文提出LAION-C作为ImageNet-C的替代基准，旨在评估网络级数据集下的分布外鲁棒性，通过引入六种新的分布外扰动类型，发现现代模型在这些扰动下的表现显著提升，甚至超过人类观察者。

Comments ICML 2025 camera ready version

详情

AI中文摘要

分布外鲁棒性是计算机视觉模型的期望属性。提高模型鲁棒性需要高质量的鲁棒性基准信号来量化进展。尽管在ImageNet时代提出了多种基准数据集，如ImageNet-C，但大多数ImageNet-C的腐蚀类型不再相对于当今的大型网络爬取数据集是分布外的，因为这些数据集已经包含常见的腐蚀如模糊或JPEG压缩伪影。因此，这些基准不再适合评估网络级数据集中的分布外鲁棒性。事实上，最近的模型在ImageNet时代的分布外基准上显示出饱和分数，表明不清楚在网络级数据集上训练的模型是否真的在分布外泛化上更好，或者是否只是在训练过程中暴露于测试扭曲。为此，我们引入LAION-C作为ImageNet-C的替代基准。LAION-C包含六种新的扰动类型，专门设计为即使对于LAION这样的网络级数据集也是分布外的。在对最新模型的全面评估中，我们发现LAION-C数据集对当代模型提出了重大挑战，包括Gemini和GPT-4o等大语言模型。我们还进行了心理物理实验来评估我们扰动对人类观察者难度，从而能够将模型与实验室质量的人类鲁棒性数据进行比较。我们观察到分布外泛化的一个范式转变：从人类优于模型，到最佳模型现在匹配或优于最佳人类观察者。

英文摘要

Out-of-distribution (OOD) robustness is a desired property of computer vision models. Improving model robustness requires high-quality signals from robustness benchmarks to quantify progress. While various benchmark datasets such as ImageNet-C were proposed in the ImageNet era, most ImageNet-C corruption types are no longer OOD relative to today's large, web-scraped datasets, which already contain common corruptions such as blur or JPEG compression artifacts. Consequently, these benchmarks are no longer well-suited for evaluating OOD robustness in the era of web-scale datasets. Indeed, recent models show saturating scores on ImageNet-era OOD benchmarks, indicating that it is unclear whether models trained on web-scale datasets truly become better at OOD generalization or whether they have simply been exposed to the test distortions during training. To address this, we introduce LAION-C as a benchmark alternative for ImageNet-C. LAION-C consists of six novel distortion types specifically designed to be OOD, even for web-scale datasets such as LAION. In a comprehensive evaluation of state-of-the-art models, we find that the LAION-C dataset poses significant challenges to contemporary models, including MLLMs such as Gemini and GPT-4o. We additionally conducted a psychophysical experiment to evaluate the difficulty of our corruptions for human observers, enabling a comparison of models to lab-quality human robustness data. We observe a paradigm shift in OOD generalization: from humans outperforming models, to the best models now matching or outperforming the best human observers.

URL PDF HTML ☆

赞 0 踩 0

2506.08277 2026-05-21 q-bio.NC cs.AI cs.CL cs.CV cs.LG 版本更新

Task-conditioned probing of instruction-tuned multimodal LLMs: Region-specific brain alignment patterns under naturalistic stimuli

基于任务的指令调制多模态大语言模型探测：在自然主义刺激下的区域特定大脑对齐模式

Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta

发表机构 * Technische Universität Berlin（柏林技术大学）； Rice University（Rice 大学）； AWS AI Labs, Amazon（Amazon 人工智能实验室）； IIT Delhi（德里理工学院）； University of Wisconsin - Madison（威斯康星大学麦迪逊分校）； Spector Inc（Spector 公司）； IIIT-Hyderabad（海得拉巴理工学院）； Microsoft（微软）

AI总结本研究探讨了指令调制多模态大语言模型在自然主义刺激下的大脑对齐模式，通过比较不同模型在视频和音频任务中的表现，揭示了指令调制对模型表示能力的影响。

Comments 57 pages, 39 figures

详情

AI中文摘要

近期的体素级多模态脑编码研究显示，多模态大语言模型（MLLMs）在大脑对齐程度上高于单模态模型。更近期的研究表明，指令调制多模态（IT）模型能够生成与大脑活动强相关的任务特定表示，但大多数先前评估集中在单模态刺激或非指令调制模型上。我们仍然缺乏对指令调制是否使IT-MLLMs围绕功能任务需求组织其表示，还是仅反映表面语义的清晰理解。为此，我们通过预测自然主义电影观看（带音频的视频）期间记录的fMRI响应，来估计大脑对齐情况。使用来自六个视频和两个音频IT-MLLMs的指令特定嵌入，跨13个视频任务指令，我们发现指令调制视频MLLMs的大脑对齐程度高于上下文学习（ICL）多模态模型（~9%）、非指令调制多模态模型（~15%）和单模态基线（~20%）。我们对视频和音频任务以及语言引导的探测评估，产生了不同任务特定的MLLM表示，这些表示在不同大脑区域中变化。我们还发现，ICL模型表现出强语义组织（r=0.78），而IT模型与指令文本语义的耦合较弱（r=0.14），这与与更高大脑对齐相关的任务条件子空间一致。这些发现支持了任务特定指令与更强的大脑-MLLM对齐之间的关联，并为映射两个系统中的联合信息处理开辟了新途径。我们公开了代码 [https://github.com/subbareddy248/mllm_videos]。

英文摘要

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs show higher brain alignment than in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

URL PDF HTML ☆

赞 0 踩 0