arXivDaily arXiv每日学术速递 周一至周五更新
2605.16241 2026-05-18 cs.CV cs.AI 版本更新

Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

离线语义引导用于高效视觉-语言-动作策略蒸馏

Jin Shi, Brady Zhang, Yishun Lu

发表机构 * Department of Mechanical Engineering(机械工程系) University College London(伦敦大学学院) Department of Engineering Science(工程科学系) University of Oxford(牛津大学)

AI总结 本文提出VLA-AD框架,利用视觉-语言模型作为离线语义监督者,将大规模VLA教师模型蒸馏为轻量学生策略,通过高阶语义指导提升效率与鲁棒性。

详情
AI中文摘要

大规模视觉-语言-动作(VLA)策略近期在机器人操作中表现出色,但其规模和推理成本仍是实时闭环控制的主要障碍。我们引入VLA-AD蒸馏框架,利用视觉-语言模型作为离线语义监督者,将大规模VLA教师模型转化为轻量学生策略。不同于仅依赖低层动作模仿,VLA-AD在教师提供的7自由度动作目标中加入高层语义指导,包括任务阶段锚点和多帧操作方向描述。这些辅助信号仅在训练时使用:在测试时,学生策略独立运行,无需VLA教师或VLM。我们在三个LIBERO基准测试套件上评估VLA-AD。使用OpenVLA-7B作为教师,我们的方法产生一个15800万参数的学生模型,模型大小减少44倍,同时与教师的平均相对差距仅为0.27%。生成的策略在RTX 4090上以12.5 Hz运行,比OpenVLA-7B快3.28倍。我们进一步表明,相同的语义蒸馏流程可泛化到不同的π_{0.5}-4B教师,其中学生在两个套件中优于教师,并在libero_goal上保持在0.53%以内。此外分析表明,阶段级监督和多帧方向线索使学生对噪声教师动作(如错误的高频夹具变化)更不敏感。总体而言,VLA-AD证明了从VLMs获得的离线语义指导可以显著提高VLA策略蒸馏的效率、鲁棒性和部署性。

英文摘要

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero\_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

2605.16179 2026-05-18 cs.CV 版本更新

MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models

MAgSeg:利用多模态大语言模型对高分辨率卫星图像进行农业景观分割

Piyush Tiwary, Utkarsh Ahuja, Depanshu Sani, Aishwarya Jayagopal, Sagar Gubbi, Subhashini Venugopalan, Alok Talekar, Vaibhav Rajan

发表机构 * Google DeepMind(谷歌深Mind) Google(谷歌) Indian Institute of Science(印度科学研究院)

AI总结 本文提出MAgSeg,一种无需视觉解码器的多模态大语言模型分割方法,有效解决南半球农业景观分割中的碎片化地块、高类内方差和标注数据稀缺问题,实现高效农业环境制图。

详情
AI中文摘要

在全球南方,农业景观分割具有碎片化地块、高类内方差和标注数据稀缺的特点。最近的分割进展由多模态大语言模型(MLLMs)推动。然而,当前方法面临关键上下文长度瓶颈和领域对齐差距。我们通过MAgSeg,一种新型无解码器的MLLM分割方法,解决这些限制。MAgSeg是一种架构高效的方法,使标准MLLM能够从高分辨率卫星图像中分割复杂的小农户农业景观,而无需辅助视觉解码器。我们引入了一种新的指令微调数据格式,以实现对高分辨率卫星图像的可扩展微调和训练,使MAgSeg能够从图像的全球上下文中学习,同时仅生成图像内特定区域的文本标记。在涵盖三个国家的多个数据集上的广泛评估表明,MAgSeg显著优于最先进的MLLM基线,提供了一种可扩展的解决方案来制图小农户农业环境。

英文摘要

Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragmented plots, high intra-class variance, and a scarcity of labeled training data. Recent advances in segmentation have been made by Multimodal Large Language Models (MLLMs). However, current approaches encounter critical context length bottlenecks and a domain alignment gap in understanding satellite features. We address these limitations through MAgSeg, a novel, decoder-free MLLM segmentation approach. MAgSeg is an architecturally efficient approach that enables standard MLLMs to perform segmentation of complex smallholder agricultural landscapes from high-resolution satellite imagery, without requiring auxiliary vision decoders. We introduce a novel instruction tuning data format designed to enable scalable fine-tuning and post-training on high resolution satellite imagery, which enables MAgSeg to learn from the global context of the image while generating text tokens for only a patch within the image. Extensive evaluations on datasets spanning three countries in the Global South demonstrate that MAgSeg significantly outperforms state-of-the-art MLLM baselines, offering a scalable solution to map smallholder agricultural environments.

2605.16171 2026-05-18 cs.CV 版本更新

Res$^2$CLIP: Few-Shot Generalist Anomaly Detection with Residual-to-Residual Alignment

Res²CLIP:基于残差到残差对齐的少样本通用异常检测

Xinyue Liu, Jianyuan Wang, Biao Leng, Shuo Zhang

发表机构 * Beihang University(北航) University of Science and Technology Beijing(北京科技大学) Beijing Jiaotong University(北京交通大学)

AI总结 Res²CLIP通过残差到残差对齐解决少样本下类别泛化问题,消除细粒度特征差异和类特定偏差,提升跨类别泛化能力。

详情
AI中文摘要

少样本通用异常检测要求模型能泛化到新类别而无需重新训练,在现实场景中面临样本稀缺和快速变化类别的挑战。现有基于CLIP的方法面临两大问题:粗粒度统一文本提示难以适应细粒度前景-背景差异,导致跨粒度不匹配;在辅助数据集上微调会因领域偏移破坏CLIP的开放世界泛化能力,导致跨类别泛化退化。为解决这些问题,我们提出将多模态对齐完全转移到统一的残差空间,其中残差表示自然消除跨区域和类特定的细粒度正常特征差异,同时解决这两个问题。基于此洞察,Res²CLIP是首个在CLIP残差空间内对称连接视觉和文本模态的残差到残差对齐框架。该框架从残差视角出发,分为三个分支:基于文本提示的分支、基于视觉提示的分支以及新的残差到残差对齐分支。所有可学习优化均受限于残差域,残差对齐优化目标设计为使模型关注相对异常偏差而非优化类特定特征。在多个数据集上的实验验证了该架构的有效性。代码可在https://github.com/hito2448/Res2CLIP获取。

英文摘要

Few-shot Generalist Anomaly Detection requires models to generalize to novel categories without retraining, posing significant challenges in real-world scenarios with scarce samples and rapidly changing categories. Existing CLIP-based methods face two major challenges: coarse-grained unified text prompts struggle to adapt to fine-grained foreground-background differences, causing cross-granularity mismatch; and fine-tuning on auxiliary datasets disrupts CLIP's inherent open-world generalization due to domain shift, leading to cross-category generalization degradation. To address these, we propose to shift multimodal alignment entirely into a unified residual space, where residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases, simultaneously resolving both problems. Based on this insight, Res$^2$CLIP, the first residual-to-residual alignment framework that symmetrically bridges visual and text modalities within CLIP's residual space, is designed. The framework is developed from a residual perspective into three branches: a text prompt-based branch, a visual prompt-based branch, and a novel residual-to-residual alignment branch. All learnable optimizations are constrained within the residual domain, and the residual alignment optimization objectives are designed to force the model to focus on relative anomaly deviations rather than optimizing class-specific features. Experiments on multiple datasets demonstrate the effectiveness of our architecture. The code is available at https://github.com/hito2448/Res2CLIP.

2605.16165 2026-05-18 cs.CV cs.AI 版本更新

Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

二阶多级方差校正用于多模态模型中的模态竞争

Yishun Lu, Wes Armour

发表机构 * University of Oxford, Oxford, United Kingdom(牛津大学,英国)

AI总结 本文提出ML-FOP-SOAP框架,通过多级方差校正提升多模态对齐稳定性,实验显示在Janus和Emu3数据集上,该方法提高了样本效率和训练速度,适用于大规模多模态基础模型。

详情
AI中文摘要

自回归的下一个标记训练为图像生成和文本理解提供统一框架,但同时也导致强模态竞争,破坏了优化稳定性并限制了大批次扩展。我们发现一阶优化器如AdamW易受跨模态梯度异质性影响,而二阶预条件,特别是SOAP,为多模态对齐提供了更稳定的基。基于此,我们提出ML-FOP-SOAP,一个带有多级方差校正的二阶优化框架。我们的Fisher-正交投影抑制由方差引起的模态冲突,减少视觉生成与文本理解之间的权衡。为在大梯度累积下实用,我们引入了分层折叠策略,以低微步开销捕获细粒度方差。在Janus和Emu3上的实验显示,在两个模态上均获得一致收益,并在8192批次大小下实现稳定训练。与AdamW相比,我们的方法提高了样本效率高达1.4倍,并加速了实时时钟训练高达1.5倍,为扩展多模态基础模型提供了一个稳健的优化器。

英文摘要

Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.

2605.16147 2026-05-18 cs.CV 版本更新

Registers Matter for Pixel-Space Diffusion Transformers

注册信息对像素空间扩散变换器的重要性

Nikita Starodubcev, Ilia Sudakov, Ilya Drobyshevskiy, Artem Babenko, Dmitry Baranchuk

发表机构 * Yandex Research(Yandex研究院)

AI总结 研究发现扩散变换器与视觉变换器在处理像素空间时存在差异,注册令牌显著提升了生成质量,通过分析中间表示发现其在高噪声水平下产生更清晰的特征图,进而提出了一种高效的双流架构以提升生成效果。

详情
AI中文摘要

视觉变换器(ViTs)已知会表现出高范数补丁-令牌异常,这会降低特征图的质量,这一问题通过注册令牌得到有效的缓解。随着扩散模型越来越多地采用变换器架构并朝向像素空间训练迈进,它们在形式上越来越接近ViTs,从而引发了注册令牌是否也对扩散变换器(DiTs)有用的问题。在本文中,我们发现DiTs与ViTs在关键方面存在差异:它们不表现出补丁-令牌异常。有趣的是,注册令牌显著提高了像素空间DiTs的收敛性和生成质量。通过分析中间表示,我们发现注册令牌在高噪声水平下会产生更清晰的特征图,这可能解释了它们在像素空间生成中的有效性。我们进一步观察到,最近的像素空间DiT架构隐式地包含了注册-like机制,这可能部分解释了其强大的经验表现。受这些见解的启发,我们研究了一种参数高效的双流架构,专门处理注册令牌,并通过几乎不增加运行时间开销的方式提高了像素空间生成质量。

英文摘要

Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly incorporate register-like mechanisms, which may partially account for their strong empirical performance. Motivated by these insights, we investigate a parameter-efficient dual-stream architecture that specializes processing for register tokens and improves pixel-space generation quality with negligible runtime overhead.

2605.16127 2026-05-18 cs.CV 版本更新

WeatherOcc3D: VLM-Assisted Adverse Weather Aware 3D Semantic Occupancy Prediction

WeatherOcc3D: 借助VLM的恶劣天气感知3D语义占用预测

A. Enes Doruk, Abdelaziz Hussein, Hasan F. Ates

发表机构 * Department of Artificial Intelligence(人工智能系) Data Engineering Ozyegin University Istanbul, Türkiye(数据工程奥祖根大学伊斯坦布尔,土耳其)

AI总结 本文提出一种借助预训练CLIP隐空间的框架,通过语言环境线索指导多传感器融合,解决恶劣天气下传感器可靠性问题,提升3D语义占用预测的鲁棒性。

详情
AI中文摘要

尽管多模态3D语义占用预测通常通过融合相机和激光雷达输入来增强鲁棒性,但其有效性从根本上受限于环境变化。具体而言,相机传感器在低光条件下会严重退化,而激光雷达传感器在暴雨中会遇到显著的回散噪声。这些恶劣条件导致模态信任问题,传统静态融合策略无法在特定传感器不可靠时自适应地重新加权输入。为此,我们提出了一种借助预训练CLIP隐空间的框架,通过语言环境线索指导多传感器集成。我们利用参数高效的适配器将天气特定的文本嵌入与传感器特征对齐,并结合一种门控策略,将环境不确定性分解为两个因素:能见度和光照。这使模型能够动态调节融合比例——在晴天优先使用语义相机特征,在雨夜则转向几何激光雷达先验。在nuScenes数据集上的评估显示,我们的方法在OccMamba和M-CONet架构上分别实现了26.3和21.1的mIoU分数,显著优于传统基线。

英文摘要

While multi-modal 3D semantic occupancy prediction typically enhances robustness by fusing camera and LiDAR inputs, its effectiveness is fundamentally constrained by environmental variability. Specifically, camera sensors suffer from severe low-light degradation, while LiDAR sensors encounter significant backscatter noise during heavy precipitation. These adverse conditions create a modality trust problem, as static fusion strategies fail to adaptively re-weight inputs when a specific sensor becomes unreliable. To address this, we propose a VLM-assisted framework leveraging the pre-trained CLIP latent space to guide multi-sensor integration via linguistic environmental cues. We utilize a parameter-efficient adapter to align weather-specific text embeddings with sensor features, coupled with a gating strategy that decomposes environmental uncertainty into two factors: visibility and illumination. This enables the model to dynamically modulate the fusion ratio - prioritizing semantic camera features in clear daylight and shifting to geometric LiDAR priors during rainy nights. Evaluations on the nuScenes dataset demonstrate the versatility of our approach, as implementing our proposed framework on the OccMamba and M-CONet architectures achieves mIoU scores of 26.3 and 21.1, respectively, significantly outperforming their traditional baselines.

2605.16122 2026-05-18 cs.CV cs.AI 版本更新

GenShield: Unified Detection and Artifact Correction for AI-Generated Images

GenShield:面向AI生成图像的统一检测与伪影校正

Zhipei Xu, Xuanyu Zhang, Youmin Xu, Qing Huang, Shen Chen, Taiping Yao, Shouhong Ding, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(北京大学电子与计算机工程学院) Tencent Youtu Lab(腾讯优图实验室)

AI总结 本文提出GenShield框架,通过闭环诊断与修复流程实现可解释的AI生成图像检测与可控伪影校正,结合视觉链式推理课程学习策略,提升校正效果与泛化能力。

详情
AI中文摘要

基于扩散模型的图像合成使AI生成图像(AIGI)日益逼真,引发了在虚假信息检测、数字取证和内容审核等应用中真实性问题的紧迫关注。尽管在AIGI检测方面取得了显著进展,如何纠正检测到的具有明显伪影的AI生成图像并恢复真实外观仍鲜有研究。此外,现有工作很少建立AIGI检测与伪影校正之间的联系。为填补这一空白,我们提出了GenShield,一个统一的自回归框架,能够在闭环中联合执行可解释的AIGI检测和可控的伪影校正,揭示了这两个任务之间的相互促进关系。我们进一步引入基于视觉链式推理的课程学习策略,使系统能够进行自我解释、多步骤的“诊断-修复”校正,并具有明确的停止准则。同时构建了一个高质量的数据集,包含大规模的“伪影-校正”配对,并配套统一的评估流程。在我们的校正基准和主流AIGI检测基准上的广泛实验表明,我们的方法在性能和泛化能力方面均达到最先进的水平。代码可在https://github.com/zhipeixu/GenShield获取。

英文摘要

Diffusion-based image synthesis has made AI-generated images (AIGI) increasingly photorealistic, raising urgent concerns about authenticity in applications such as misinformation detection, digital forensics, and content moderation. Despite the substantial advances in AIGI detection, how to correct detected AI-generated images with visible artifacts and restore realistic appearance remains largely underexplored. Moreover, few existing work has established the connection between AIGI detection and artifact correction. To fill this gap, we propose GenShield, a unified autoregressive framework that jointly performs explainable AIGI detection and controllable artifact correction in a closed loop from diagnosis to restoration, revealing a mutually reinforcing relationship between these two tasks. We further introduce a Visual Chain-of-Thought based curriculum learning strategy that enables self-explained, multi-step ``diagnose-then-repair'' correction with an explicit stopping criterion. A high-quality dataset with large-scale ``artifact-restored'' pairs is also constructed alongside a unified evaluation pipeline. Extensive experiments on our correction benchmark and mainstream AIGI detection benchmarks demonstrate state-of-the-art performance and strong generalization of our method. The code is available at https://github.com/zhipeixu/GenShield.

2605.16090 2026-05-18 cs.CR cs.CV 版本更新

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

针对大视觉-语言模型的跨模态提示注入攻击:仅图像扰动

Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang, Guancheng Wang, JianFeng Ma

发表机构 * Xidian University(西安电子科技大学)

AI总结 本文提出CrossMPI攻击,通过仅图像扰动实现跨模态提示注入,改进模型隐藏状态空间优化并采用层选择策略与距离递减扰动策略,有效提升攻击性能。

详情
AI中文摘要

大型视觉-语言模型(LVLMs)已发展为多模态智能的强大范式,但其日益增长的部署也扩大了提示注入攻击的攻击面。尽管存在日益增长的担忧,现有攻击仍受到关键限制:注入的提示仅能引导模型对单一输入的解释。相反,这些攻击虽然多模态,但未能实现跨模态提示扰动。为此,我们引入了新颖的跨模态提示注入攻击CrossMPI,通过仅图像提示注入引导模型对文本和视觉输入的解释。我们的设计基于以下关键突破:首先,我们将注入提示扰动优化的焦点从视觉嵌入空间(通常仅有10^5个参数)转向模型隐藏状态空间(用于多模态信息整合,具有10^7个参数)。然后,采用两种策略以缓解由更大参数空间带来的优化挑战。为了约束优化的模型参数空间,我们引入了一种层选择策略,识别对多模态整合最关键的层。有趣的是,与以往经验不同,我们的分析表明,最优的LVLM提示扰动层位于模型中间而非最后。为了约束图像扰动空间,我们提出了一种新的距离递减扰动预算分配策略,该策略随着像素距离到语义关键区域的增加而递减分配预算。在多个LVLM和数据集上的广泛实验表明,我们的方法显著优于基线方法。

英文摘要

Large vision-language models (LVLMs) have emerged as a powerful paradigm for multimodal intelligence, but their growing deployment also expands the attack surface of prompt injection. Despite this growing concern, existing attacks still suffer from a critical limitation: the injected prompt for one modality only steers the model's interpretation of that singular input. Alternatively, these attacks remain multimodal but fail to achieve cross-modal prompt perturbation. To bridge this gap, we introduce a novel cross-modal prompt injection attack CrossMPI, which can steer the model's interpretation of both textual and visual inputs via image-only prompt injection. Our design is underpinned by the following key breakthroughs. First, we turn the focus of the injected prompt perturbation optimization from the visual embedding space (typically with only $10^5$ parameters) to the model hidden state space (for multimodal information integration and with $10^7$ parameters). Then, two strategies are adopted to mitigate the optimization challenges posed by the larger parameter space. To constrain the optimized model parameter space, we introduce a layer selection strategy that identifies the layers most critical to multimodal integration. Interestingly, deviating from the past experience, our analysis reveals that the optimal layers for LVLM prompt perturbation reside in the middle of the model rather than the last. To constrain the image perturbation space, we propose a new distance-decremental perturbation budget assignment strategy that allocates budgets decrementally as the pixel distance to semantic-critical regions increases. Extensive experiments across multiple LVLMs and datasets show that our method significantly outperforms baseline approaches.

2605.16081 2026-05-18 cs.LG cs.CV 版本更新

MIND: Decoupling Model-Induced Label Noise via Latent Manifold Disentanglement

MIND: 通过潜在流形解耦模型诱导标签噪声

Dayong Ren

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing 210023, China(新型软件技术国家重点实验室,南京大学,南京210023,中国)

AI总结 MIND通过潜在流形解耦技术解决模型诱导标签噪声问题,通过动态投影样本到潜在结构聚类,提升噪声识别能力,验证了其在复杂基准测试中的优越性。

Comments Accepted, to appear in ICML2026

详情
AI中文摘要

学习从自动注释驱动的预训练专家和基础模型主导的数据饥饿应用的范式,但引入了关键挑战:模型诱导的标签噪声。与经典鲁棒学习中的随机噪声不同,这种噪声源于标注者的归纳偏置,表现为与局部特征流形紧密耦合的系统性误差。现有方法依赖于全局转移矩阵无法捕捉这些结构模式,而学习实例特定的矩阵在数学上不可行。我们提出模型诱导噪声解耦(MIND),一个理论奠基的框架解决这一困境。我们证明高维噪声流形可通过潜在流形解耦分解为可处理的子空间依赖组件。具体而言,我们的潜在解耦估计器(LDE)动态地将样本投影到具有一致错误模式的潜在结构聚类中,无需地面真实锚点即可实现噪声可识别。为严格评估鲁棒性,我们采用分层协议:从受控噪声的CIFAR-100到大规模真实世界3D数据集(S3DIS、ScanNet)的结构压力测试,其中错误模式显式耦合于几何流形。经验上,MIND在这些复杂基准测试中显著优于现有方法,并有效纠正了视觉-语言模型(如OpenSeg)的零样本幻觉,突显其作为基础模型鲁棒蒸馏框架的潜力。

英文摘要

The paradigm of learning from automatic annotations driven by pre-trained experts and Foundation Models dominates data-hungry applications. However, it introduces a critical challenge: model-induced label noise. Unlike stochastic noise in classical robust learning, this noise stems from annotator inductive biases, manifesting as systematic errors tightly coupled with local feature manifolds. Existing methods relying on global transition matrices underfit these structural patterns, while learning instance-specific matrices remains mathematically intractable. We propose Model-Induced Noise Decoupling (MIND), a theoretically grounded framework addressing this dilemma. We demonstrate that the high-dimensional noise manifold can be decoupled into tractable, subspace-dependent components via Latent Manifold Disentanglement. Specifically, our Latent Decoupling Estimator (LDE) dynamically projects samples into latent structural clusters with consistent error modes, facilitating noise identifiability without ground-truth anchor points. To rigorously evaluate robustness, we adopt a hierarchical protocol: moving from controlled noise on CIFAR-100 to a structural stress test on large-scale real-world 3D datasets (S3DIS, ScanNet), where error patterns explicitly couple with geometric manifolds. Empirically, MIND significantly outperforms state-of-the-art methods on these complex benchmarks and effectively corrects zero-shot hallucinations from Vision-Language Models (e.g., OpenSeg), highlighting its potential as a robust distillation framework for Foundation Models.

2605.16080 2026-05-18 cs.CV 版本更新

ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation

ReAlign:通过推理对齐表示实现通用图像伪造检测

Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(北京大学电子与计算机工程学院) School of Future Technology, South China University of Technology(华南理工大学未来技术学院) School of Electronic and Information Engineering, South China University of Technology(华南理工大学电子与信息工程学院) Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University(广东省超高清沉浸媒体技术重点实验室,北京大学深圳研究生院)

AI总结 本文提出ReAlign框架,通过对比学习将LLM生成的高质量推理文本转化为轻量级AIGI检测器,提升检测准确性和泛化能力。

Comments Accepted by CVPR 2026

详情
AI中文摘要

AI生成图像(AIGIs)的兴起对数字真实性提出了新的挑战,需要高效且通用的图像伪造检测系统。现有方法无论是非LLM还是LLM基于的方法,都有各自的优势和局限性。非LLM方法提供高效的低级artifact检测,但缺乏语义理解。相反,LLM方法提供强大的语义推理和可解释性,但计算成本高且对细微视觉伪影不敏感。此外,解释性推理文本对伪造检测性能的真实贡献仍不明确。本文研究了LLM生成的推理文本的内在价值和潜力,将其视为通用性和语义错误敏感性的来源。基于这些发现,我们提出了ReAlign,一种新的框架,通过对比学习将由GRPO优化的LLM生成的高质量推理文本提炼成轻量级AIGI检测器。ReAlign有效继承了推理文本表示的泛化能力和语义敏感性,同时保持高效和轻量级以部署。此外,ReAlign采用定制的联合优化策略,整合对比损失用于图像-文本对齐和分类损失用于准确的伪造鉴别。在AIGCDetectBenchmark、AIGI-Holmes和我们新构建的UltraSynth-10k上的实验结果表明,ReAlign在准确性和泛化能力上均优于现有最先进检测器,特别是在面对来自现代生成模型的复杂、高保真伪造时表现突出。

英文摘要

The rise of AI-generated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.

2605.16079 2026-05-18 cs.CV cs.AI cs.HC 版本更新

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

VideoSeeker:通过原生代理工具调用激励实例级视频理解

Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) Xiaohongshu Inc.(小红书公司) East China Normal University(华东师范大学) Xi’an Jiaotong University(西安交通大学)

AI总结 VideoSeeker通过整合代理推理与实例级视频理解任务,提升视频理解精度,实验表明其在实例级任务中比基线模型提升13.7%,超越GPT-4o和Gemini-2.5-Pro。

Comments Project Page: https://gaotiexinqu.github.io/VideoSeeker/

详情
AI中文摘要

大型视觉-语言模型(LVLMs)在视频理解上取得了显著进展,但在需要精确实例级时空定位的任务中面临重大挑战。现有方法主要依赖文本提示进行人机交互,但这些提示难以提供精确的空间和时间参考,导致用户体验不佳。此外,当前方法通常将视觉感知与语言推理解耦,以语言为中心而非视觉内容,限制了模型主动感知细粒度视觉证据的能力。为解决这些问题,我们提出VideoSeeker,一种通过视觉提示实现实例级视频理解的新范式。VideoSeeker无缝整合代理推理与实例级视频理解任务,使模型能够主动感知并按需检索相关视频片段。我们构建了一个四阶段全自动数据合成管道,高效生成大规模高质量的实例级视频数据。我们通过冷启动监督和强化学习训练将工具调用和主动感知能力内化到模型中,构建了一个强大的视频理解模型。实验表明,我们的模型在实例级视频理解任务中平均比基线模型提升13.7%,超越强大的闭源模型如GPT-4o和Gemini-2.5-Pro,同时在通用视频理解基准上也表现出有效的迁移能力。相关数据集和代码将公开发布。

英文摘要

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

2605.16076 2026-05-18 cs.CV cs.AI 版本更新

AgriMind: An Ensemble Deep Learning Framework for Multi-Class Plant Disease Classification

AgriMind:一种用于多类植物疾病分类的集成深度学习框架

Salma Hoque Talukdar Koli, Fahima Haque Talukder Jely

发表机构 * RTM Al-Kabir Technical University(RTM阿克比爾技術大學) North East University Bangladesh(東北大學(孟加拉國))

AI总结 本文提出AgriMind框架,利用ResNet50、EfficientNet-B0和DenseNet121模型集成,通过转移学习实现对15种植物疾病的高精度分类,集成模型在测试集上达到99.23%的准确率。

详情
AI中文摘要

在孟加拉国,植物疾病检测仍主要依赖人工检查。我们构建了AgriMind系统,通过集成ResNet50、EfficientNet-B0和DenseNet121模型,利用20,638张PlantVillage图像进行训练。使用冻结的ImageNet主干和头-only训练,保持了轻量级的管道。单个模型在测试集上达到96-97%的准确率,但通过平均softmax输出,集成模型达到99.23%的准确率,错误率降低三分之二。我们尝试偏向最佳验证模型,但效果不佳。删除任一模型也损害性能。辣椒和土豆分类完美,而番茄在十个视觉相似类别中仍达到99.01%的准确率。在NVIDIA T4 GPU上,完整集成模型以53 FPS运行。是否能实现实时移动应用取决于TensorFlow Lite优化,这项工作尚未完成。

英文摘要

Plant disease detection is still largely manual in Bangladesh, where extension workers eyeball leaf samples across millions of smallholdings. We built AgriMind to automate this: an ensemble of ResNet50, EfficientNet-B0, and DenseNet121 trained on 20,638 PlantVillage images across 15 pepper, potato, and tomato disease classes. Transfer learning with frozen ImageNet backbones and 10 epochs of head-only training keeps the pipeline lightweight. Individual models hit 96--97% on the held-out test set, but averaging their softmax outputs pushes the ensemble to 99.23% -- a two-thirds cut in error rate. We tried biasing the average toward the best validation model; it backfired. Dropping any single model also hurt. Pepper and potato classify perfectly; tomato, with ten visually similar classes, still reaches 99.01%. On an NVIDIA T4 GPU the full ensemble runs at 53 FPS. Whether that translates to real-time mobile use depends on TensorFlow Lite optimization -- work we have not yet completed.

2605.16065 2026-05-18 cs.CV cs.AI 版本更新

Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting

鲁棒的先验引导分割用于可编辑的3D高斯散射

Raushan Joshi, Jean-Yves Guillemaut

发表机构 * University of Surrey(萨里大学)

AI总结 本文提出利用SAM-HQ生成准确2D掩码,通过先验引导标签重新分配实现鲁棒的3D分割,提升编辑任务的精度和效率。

Comments Accepted at IEEE International Conference on Image Processing 2026, 6 pages

详情
AI中文摘要

3D高斯散射(3D-GS)实现了实时3D场景重建,但缺乏用于编辑任务如物体移除、提取和重新着色的鲁棒分割。现有方法将2D分割提升到3D领域时面临视图不一致和粗掩码的问题。本文提出一种新的框架,利用Segment Anything Model High Quality(SAM-HQ)生成准确的2D掩码,解决标准SAM在边界保真度和细结构保持方面的局限。为实现给定场景中任意目标物体的鲁棒3D分割,我们引入了先验引导的标签重新分配方法,通过与学习先验的多视图一致性来为3D高斯分配标签。我们的方法实现了最先进的分割精度,并在保持高视觉保真度的同时实现了交互式、实时的物体编辑。定性结果表明在虚拟现实(VR)和机器人领域具有优越的边界保持和实际应用价值,推动了3D场景编辑的发展。

英文摘要

3D Gaussian Splatting (3D-GS) enables real-time 3D scene reconstruction but lacks robust segmentation for editing tasks such as object removal, extraction, and recoloring. Existing approaches that lift 2D segmentations to the 3D domain suffer from view inconsistencies and coarse masks. In this paper, we propose a novel framework that leverages the Segment Anything Model High Quality (SAM-HQ) to generate accurate 2D masks, addressing the limitations of the standard SAM in boundary fidelity and fine-structure preservation. To achieve robust 3D segmentation of any target object in a given scene, we introduce a prior-guided label reassignment method that assigns labels to 3D Gaussians by enforcing multiview consistency with learned priors. Our approach achieves state-of-the-art segmentation accuracy and enables interactive, real-time object editing while maintaining high visual fidelity. Qualitative results demonstrate superior boundary preservation and practical utility in Virtual Reality (VR) and robotics, advancing 3D scene editing.

2605.16022 2026-05-18 cs.CV 版本更新

EndoGSim: Physics-Aware 4D Dynamic Endoscopic Scene Simulations via MLLM-Guided Gaussian Splatting

EndoGSim: 基于多模态大语言模型的物理感知4D动态内窥镜场景模拟

Changjing Liu, Yiming Huang, Long Bai, Beilei Cui, Hongliang Ren

发表机构 * Department of Electronic Engineering, The Chinese University of Hong Kong (CUHK)(香港中文大学电子工程系)

AI总结 本文提出EndoGSim框架,通过MLLM引导的高斯点散布实现内窥镜场景的物理感知重建与模拟,结合预训练分割和深度估计,提升手术模拟的真实性和准确性。

Comments Early Accepted by MICCAI 2026

详情
AI中文摘要

在机器人辅助微创手术中,高保真的动态内窥镜场景重建与模拟对于提升后续任务和改善手术结果至关重要。然而,现有方法主要关注视觉重建,缺乏用于真实模拟所需的物理描述。我们提出一个统一框架,通过多模态大语言模型(MLLM)引导的高斯点散布实现内窥镜场景的物理感知重建与模拟。我们的方法利用4D高斯点散布(4DGS)结合预训练分割和深度估计来表示可变形组织和工具。为了实现物理属性的自动推断,我们引入了物体级材料场,通过MLLM初始化材料参数并通过可微分材料点方法(MPM)进行细化,在渲染图像和光流的联合监督下进行。在开源和自建数据集上验证,我们的框架在模拟保真度和物理准确性方面优于现有方法,凸显其在机器人辅助手术应用中的潜力。

英文摘要

In robot-assisted minimally invasive surgery, high-fidelity dynamic endoscopic scene reconstruction and simulation are crucial to enhancing downstream tasks and advancing surgical outcomes. However, existing methods primarily focus on visual reconstruction, lacking physics-based descriptions of the scene required for realistic simulation. We propose a unified framework that achieves physics-aware reconstruction and physical simulation of endoscopic scenes through Multi-modal Large Language Models (MLLMs)-guided Gaussian Splatting. Our approach utilizes 4D Gaussian Splatting (4DGS) integrated with pre-trained segmentation and depth estimation to represent deformable tissues and tools. To achieve automatic inference of physical properties, we introduce an object-wise material field that initializes material parameters via MLLM and refines them through a differentiable Material Point Method (MPM) under joint supervision from rendered images and optical flow. Validated on both open-source and in-house datasets, our framework achieves superior simulation fidelity and physical accuracy compared to state-of-the-art methods, underscoring its potential to advance robot-assisted surgical applications.

2605.16008 2026-05-18 cs.CV 版本更新

End-to-end plaque counting and virus titration from laboratory plate images with deep learning

基于深度学习的实验室平板图像端到端斑块计数与病毒滴度测定

Eugenia Moris, Alicia Costábile, Sebastián Rey, Irene Ferreiro, Joaquín Hurtado, Lizandra Lissette Luciano, Matías Villagrán, Aisha Espino Vázquez, Jomari Ramos, Isadora Monteiro, María Victoria de Santiago, Pilar Moreno, Gonzalo Moratorio, José Ignacio Orlando

发表机构 * Arionkoder LLC Laboratory of Experimental Virus Evolution, Pasteur Institute of Montevideo(实验病毒进化实验室,蒙特维多巴斯德研究所) Laboratory of Molecular Virology, Faculty of Sciences, University of the Republic(分子病毒学实验室,科学学院,乌拉圭共和国大学) Center for Innovation in Epidemiological Surveillance, Pasteur Institute of Montevideo(流行病学监测创新中心,蒙特维多巴斯德研究所) Biochemistry Section, Faculty of Sciences, University of the Republic(生物化学部门,科学学院,乌拉圭共和国大学)

AI总结 本文提出一种端到端的深度学习方法,通过分割模型对实验室平板图像中的斑块进行自动计数和滴度测定,提高了病毒感染性检测的效率和准确性。

详情
AI中文摘要

斑块实验仍然是病毒感染性检测的金标准,但通过平板图像进行斑块计数过程繁琐且易受操作者差异影响。本文提出了一种端到端的计算机辅助工作流程,直接从实验室斑块实验图像中基于细胞病理效应的病毒滴度测定。所提出的方法结合了源自Segment Anything Model (SAM)的两个模型:一个基于SAM2的井分割模块,用于在异质成像条件下定位实验井;另一个基于SAM的斑块分割模型,用于在每个井中检测和统计斑块。该方法在混合数据集上进行了评估,该数据集包括Mayaro病毒和Coxsackievirus B3的私有斑块实验图像,以及来自VACVPlaque数据集的天花病毒图像。该流程输出每井斑块计数,自动计算每毫升形成斑块单位(PFU/mL),并整合到一个基于网络的平台中,允许用户审查结果并组织实验。在测试板(17块来自MAYV/CVB3和22块来自VACV)上,该工作流程在两种板格式(6孔和12孔)上实现了良好的泛化,并与手动注释有很强的一致性(MAYV/CVB3的皮尔逊相关系数为0.92,VACV为0.88)。自动斑块计数还与四位独立专家的注释进行了比较,显示了高度的一致性。所提出的系统将在本论文被接受后开源并公开发布,以实现可重复、可扩展和审计准备的斑块实验分析,同时显著减少手动注释的工作量。

英文摘要

Plaque assays remain the gold standard readout of virus infectivity; however, plaque counting from plate images is labor-intensive and prone to inter-operator variability. We present an end-to-end, computer-aided workflow for cytopathic effect-based virus titration directly from laboratory plaque assay images. The proposed approach combines two models derived from the Segment Anything Model (SAM): a SAM2-based well-segmentation module that localizes assay wells across heterogeneous imaging conditions, and a SAM-based plaque-segmentation model that detects and enumerates plaques within each well. The method was evaluated on a mixed dataset comprising private plaque assay images of Mayaro virus and Coxsackievirus B3, together with public Vaccinia virus images from the VACVPlaque dataset. The pipeline outputs per-well plaque counts, automatically computes plaque-forming units per milliliter (PFU/mL), and is integrated into a web-based platform that allows users to review results and organize experiments. On held-out plates (17 from MAYV/CVB3 and 22 from VACV), the workflow generalized across two plate formats (6-well and 12-well) and showed strong agreement with manual annotations (Pearson correlation coefficients of 0.92 for MAYV/CVB3 and 0.88 for VACV). Automated plaque counts were further compared with annotations from four independent experts, demonstrating high concordance. The proposed system will be open sourced and publicly released upon acceptance of this manuscript to enable reproducible, scalable, and audit-ready plaque assay analysis while substantially reducing manual annotation effort.

2605.16003 2026-05-18 cs.CV 版本更新

Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation

回声驱动:一种用于交互式长视频生成的场景记忆框架

Mingqiang Wu, Weilun Feng, Zhefeng Zhang, Haotong Qin, Yuqi Li, Guoxin Fan, Xiaokun Liu, Zhulin An, Libo Huang, Yongjun Xu, Chuanguang Yang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院人工智能安全国家重点实验室,计算技术研究所,北京,中国) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国) China University of Mining & Technology, Beijing(中国矿业大学(北京)) ETH Zürich(苏黎世联邦理工学院) City College of New York, City University of New York(纽约城市学院,纽约城市大学) Xiamen Institute of Data Intelligence, Xiamen, China(厦门数据智能研究院,厦门,中国)

AI总结 本文提出Echo-Forcing框架,通过分层时间记忆、场景回溯帧和差分记忆衰减机制,解决长视频生成中历史KV状态的函数纠缠问题,实现交互式视频生成的流畅过渡与长距离场景回溯。

详情
AI中文摘要

自回归视频扩散模型通过局部注意力和KV缓存实现开放生成。然而,现有无训练长视频优化方法主要关注单一提示下的稳定扩展,难以处理涉及提示切换、旧场景遗忘和历史场景回溯的交互场景。我们发现核心瓶颈是历史KV状态的功能纠缠:稳定锚点和近期动态由同一缓存策略处理,导致背景污染、对新提示响应延迟和长距离记忆丢失。为此,我们提出Echo-Forcing,一种专门用于交互式长视频生成的无训练场景记忆框架,包含三个核心机制:(1)分层时间记忆,通过相对RoPE解耦稳定锚点、压缩历史和近期窗口;(2)场景回溯帧,将历史场景压缩为空间结构化的KV表示以支持长期回溯;(3)差分记忆衰减,根据旧场景与新场景的差异适配性遗忘冲突令牌。基于这些设计,Echo-Forcing在有限的缓存预算下统一支持平滑过渡、硬切和长距离场景回溯。在VBench-Long上的广泛评估进一步证明,Echo-Forcing在长视频生成和交互视频生成设置中均取得最佳整体性能。我们的代码已发布在https://github.com/mingqiangWu/Echo-Forcing

英文摘要

Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing

2605.15997 2026-05-18 cs.CV 版本更新

Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning

分割、检测与解释:一种用于CT外观推理的统一框架

Yuyuan Liu, Can Peng, Yingyu Yang, Qianye Yang, Cheng Ouyang, J. Alison Noble

发表机构 * Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford(生物医学工程研究所,工程科学系,牛津大学)

AI总结 本文提出统一框架,整合语言引导的视觉推理,提升CT图像分割与检测的精度,并提供外观推理输出。

Comments 8 pages, 4 figures, submitted to IEEE Transactions on Medical Imaging (TMI)

详情
AI中文摘要

近年来深度学习的进步显著推动了CT图像分析,尤其在分割任务中。然而,这些进展大多局限于图像层面的模式识别,大多数方法缺乏显式的解剖或上下文推理。大型视觉-语言模型引入了语言上下文到图像分析中,但大多数方法通常专注于单一任务,这不足以满足临床工作流程分析中需要多种细粒度分析的需求,如解剖检测和分割。本文提出了一种统一的自回归框架,将语言引导的视觉推理整合到CT解释中。我们的方法引入了任务路由标记,根据大型视觉-语言模型的隐藏状态触发检测和分割头,从而产生连贯的视觉输出(例如掩码和边界框)和文本推理。为进一步提高局部化精度和语义清晰度,我们进一步设计了

英文摘要

Recent progress in deep learning has significantly advanced CT image analysis, particularly for segmentation tasks. However, these advances are largely confined to image-level pattern recognition, with most methods lacking explicit anatomical or contextual reasoning. Large vision-language models introduce linguistic context into image analysis, yet most approaches typically focus on a single task, which is insufficient for clinical workflow analysis that requires multiple fine-grained types of analysis, such as anatomy detection and segmentation. In this paper, we propose a unified autoregressive framework that integrates language-guided visual reasoning into CT interpretation. Our method introduces task-routing tokens that trigger detection and segmentation heads conditioned on the hidden states of a large vision-language model, enabling coherent generation of visual outputs (e.g., masks and bounding boxes) and textual reasonings. To progressively enhance localisation accuracy and semantic clarity, we further design a "closer-look" mechanism that allows the model to perform progressive coarse-to-fine visits to regions of interest under refined fields of view. To support model training and evaluation, we curated a new multimodal CT dataset containing pixel-wise masks, bounding boxes, spatial prompts, and structured descriptions for visual objects constructed through an AI-assisted annotation process with human verification. Experiments on public benchmarks demonstrate consistent improvements over the SoTA, achieving up to 1.0% Dice on BTCV and 1.7% Dice on MosMed+, while additionally providing appearance reasoning outputs. The code and dataset will be available.

2605.15967 2026-05-18 cs.AI cs.CV cs.LO 版本更新

Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning

确定性事件-图子结构作为世界模型用于反事实推理

Fabio Rovai

发表机构 * Tesseract Academy(Tesseract学院)

AI总结 本文提出事件图子结构作为世界模型,通过结构化干预词汇fork日志来回答反事实查询,证明了解释性与反事实性查询的对偶性,并在CLEVRER验证规模上评估了基于领域无关子结构运行时的C++解释器。

Comments 10 pages, 3 figures, 2 tables

详情
AI中文摘要

我们研究了事件图子结构:一类世界模型,将智能体状态表示为只追加的类型RDF三元组日志,并通过结构化干预词汇fork日志来回答反事实查询。子结构在三元组层面可检查,支持精确的反事实查询,并且可以在不同领域之间转移而无需学习组件。我们正式化了该类,证明了解释性和反事实性查询之间的对偶性,将两者都减少到相同的因果-祖先遍历,并在领域无关的子结构运行时上评估了一个1,400行的CLEVRER-DSL解释器,达到完整的CLEVRER验证规模(n=75,618)。子结构在所有四个问题类别中均优于NS-DR符号Oracle(分别高出9.89、20.26、17.65和0.80个百分点),并在描述性和解释性方面优于参数化ALOE基线,但在预测性和反事实性方面略有落后。我们还引入了双EventLog,一个500规范的Park-Canonical Smallville反事实基准,子结构在完整上下文中超过Llama-3.1-8B 18.80点的联合准确率。

英文摘要

We study event-graph substrates: a class of world models that represent agent state as an append-only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal-ancestor traversal, and evaluate a 1,400-line CLEVRER-DSL interpreter atop a domain-agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS-DR symbolic oracle on all four per-question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin-EventLog, a 500-specification Park-canonical Smallville counterfactual benchmark on which the substrate exceeds Llama-3.1-8B with full context by 18.80 points joint accuracy.

2605.15964 2026-05-18 cs.RO cs.CV 版本更新

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

WorldVLN: 用于空域视觉-语言导航的自回归世界动作模型

Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang, Haoyang Wang, Shilong Ji, Ziyou Wang, Jianjie Fang, Zhiheng Zheng, Weichen Zhang, Yu Shang, Wei Wu, Chen Gao, Xinlei Chen, Yong Li

发表机构 * Tsinghua University(清华大学) Shandong University(山东大学) Manifold AI Beijing Institute of Technology(北京理工大学) Northeastern University(东北大学)

AI总结 WorldVLN提出一种自回归世界动作模型,通过预测潜在世界演变并生成可执行的航点动作,提升空域视觉-语言导航性能,优于现有基线模型。

详情
AI中文摘要

空域视觉-语言导航(VLN)要求智能体通过闭环感知与行动在3D环境中遵循自然语言指令。本文认为空域VLN可视为预测驱动的世界-动作问题:智能体应预测潜在世界演变并根据预测后果行动。为此,我们提出WorldVLN,首个针对空域VLN的自回归世界动作模型。不同于生成完整视觉片段的全序列视频生成世界模型,WorldVLN采用潜在自回归视频主干来预测短视界世界状态转换,并直接解码为可执行航点动作。每次动作段执行后,新接收的观测被编码回自回归上下文,实现闭环世界-动作预测。我们进一步引入双阶段训练框架,首先将视频先验在指令条件下的导航动力学中定位,然后开发Action-aware GRPO,首个针对自回归WAMs的强化学习方法,通过下游回放后果优化航点决策。在公开户外和室内基准上,WorldVLN在12%+的成功率提升和挑战性案例中表现更优。它进一步实现零样本迁移至真实无人机部署,表明所提WorldVLN为空间动作任务提供了一条有前景的路径。演示和代码可在https://embodiedcity.github.io/WorldVLN/上获取。

英文摘要

Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12\%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at https://embodiedcity.github.io/WorldVLN/.

2605.15961 2026-05-18 cs.CV 版本更新

Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

稀疏自编码器使CLIP模型的鲁棒且可解释的微调成为可能

Fabian Morelli, Arnas Uselis, Ankit Sonthalia, Seong Joon Oh

发表机构 * University of Tübingen(图宾根大学) KAIST(韩国科学技术院)

AI总结 本文提出SAE-FT方法,通过稀疏自编码器约束视觉表示的变化,实现CLIP模型的鲁棒且可解释的微调,提高下游任务性能同时保持模型鲁棒性。

详情
AI中文摘要

大规模预训练的视觉-语言模型如CLIP在各种任务上表现出色,但微调通常会降低对分布偏移的鲁棒性。本文提出SAE-FT方法,仅在模型的视觉表示上进行操作,通过稀疏自编码器识别出的语义重要特征,惩罚添加和移除这些特征的变化,从而防止灾难性遗忘并使微调过程可解释。SAE-FT在ImageNet及其相关分布偏移基准上表现优异,且计算高效。

英文摘要

Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: https://github.com/Fabian-Mor/sae-ft.

2605.15951 2026-05-18 cs.CV 版本更新

From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

从失败到反馈:群体修订解锁对象级 grounding 的难题

Yuyuan Liu, Yiping Ji, Anjie Le, Jiayuan Zhu, Jiazhen Pan, Can Peng, Jiajun Deng, Fengbei Liu, Junde Wu

发表机构 * Department of Engineering Science, University of Oxford(牛津大学工程科学系) Australian Institute for Machine Learning, Adelaide University(阿德莱德大学人工智能研究所) Technical University of Munich(慕尼黑技术大学) University of Science and Technology of China(中国科学技术大学) Cornell University(康奈尔大学)

AI总结 本文提出群体修订优化方法,通过生成改进候选响应提升硬案例学习效果,改进奖励和优势函数以增强高质量修订影响,优于现有GRPO方法。

Comments 8 pages, 5 figures, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

通过强化学习微调大视觉-语言模型以增强对象级 grounding 能力已成为有前景的方法。然而,现有方法主要基于GRPO,在响应层面分配奖励,导致在挑战性场景中所有候选响应失败时学习信号稀疏。本文提出群体修订优化范式,通过生成改进候选响应探索更好的 grounding 结果。受奖励塑造启发,我们引入巩固过程,量化每个候选响应相对于初始尝试的改进,并将其转化为信息丰富的塑造信号。这些信号用于精炼奖励和调节优势,放大高质量修订的影响。我们的方法在指称和推理分割、REC 和计数基准上均优于先前的 GRPO 基方法。我们的代码可在 https://github.com/yyliu01/GroupRevision 获取。

英文摘要

Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate's improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at https://github.com/yyliu01/GroupRevision.

2605.15942 2026-05-18 cs.CV cs.AI 版本更新

Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

分解式视觉-语言对齐用于细粒度开放词汇分割

Chenhao Wang, Yingrui Ji, Yu Meng, Yao Zhu

发表机构 * Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院航空航天信息研究所) University of Chinese Academy of Sciences(中国科学院大学) Zhejiang University(浙江大学)

AI总结 本文提出分解式视觉-语言对齐框架,通过将文本提示分解为概念令牌和多个属性令牌,实现细粒度开放词汇分割中对未见属性-类别组合的泛化提升。

详情
AI中文摘要

开放词汇分割模型常难以泛化到未见的对象类别和属性组合,因为细粒度描述通常被编码为整体句子,将多个语义单元纠缠在一起。我们提出一种分解式视觉-语言对齐框架,将文本提示显式分解为概念令牌和多个属性令牌,使每个语义单元能够分别进行跨模态交互。在特征层面,我们引入了特征门控交叉注意力模块,生成属性特定的门控图以以乘法方式融合信息,有效强制组合语义。在评分层面,每个token的相似性在log空间中聚合,产生稳定且可解释的组合匹配。该方法可以无缝集成到现有的基于transformer的分割架构中,并在细粒度开放词汇分割基准中显著提升对未见属性-类别组合的泛化能力。

英文摘要

Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.

2605.15923 2026-05-18 cs.CV 版本更新

Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction

Invaria:通过下一分辨率预测实现点云中的尺度和密度不变性

Chun-Peng Chang, Shaoxiang Wang, Alain Pagani, Dariu Gavrila, Holger Caesar

发表机构 * TU Delft(代尔夫特理工大学) DFKI(德累斯顿德国研究院)

AI总结 本文提出Invaria,一种通过下一分辨率预测和感受野校准实现点云尺度和密度不变性的编码器,提升了模型在不同分辨率下的泛化能力。

详情
AI中文摘要

现代图像编码器通过将语义意义与分辨率解耦实现高泛化能力,但在3D领域尚未完全实现。本文研究了3D点云编码器在实现类似泛化能力时的失败原因,发现现有模型对采样分辨率和尺度变化高度敏感,导致性能显著下降。这种敏感性是机器人实际部署中的主要瓶颈,因为它表明模型过度拟合特定量化密度和物体尺度,而非学习不变的语义特征。为缓解这种依赖,我们提出Invaria,一种通过下一分辨率预测和感受野校准实现尺度和密度不变性的点云编码器。虽然我们的目标不是显式生成高分辨率点云,但发现这种训练目标鼓励模型学习稳健的结构不变性。结果编码器在分辨率变化时实现显著性能提升,同时通过紧凑的模型大小和减少的token需求保持高效。具体来说,在ScanNet上,Invaria在3倍更低的分辨率下实现mIoU提升56.0%,当物体尺度减少3倍时提升20%。这些收益通过45%更小的模型大小和平均40%的输入token减少实现。

英文摘要

Modern image encoders achieve high generalization by decoupling semantic meaning from resolution, an ability yet to be fully realized in the 3D domain. We investigate the failure of 3D point cloud encoders to achieve similar generalization and find that existing models are highly sensitive to sampling resolution and scale changes, leading to significant performance degradation. This sensitivity is a major bottleneck for real-world deployment in robotics, as it suggests models overfit to specific quantization densities and object scales rather than learning invariant semantic features. To mitigate this dependency, we propose Invaria, a point cloud encoder that achieves scale and density invariance through next-resolution prediction and receptive field calibration. While our objective is not the explicit generation of high-resolution point clouds, we find that this training objective encourages the model to learn robust, structural invariants. The resulting encoder achieves significant performance gains during resolution shifts while maintaining high efficiency through a compact model size and reduced token requirements. Specifically, on ScanNet, Invaria achieves a 56.0\% higher mIoU at 3$\times$ lower resolution and a 20\% improvement when the objects scale is reduced by a factor of 3. These gains are achieved with a 45\% smaller model size and an average reduction of 40\% in input tokens.

2605.15921 2026-05-18 cs.CV 版本更新

AdaEraser: Training-Free Object Removal via Adaptive Attention Suppression

AdaEraser:通过自适应注意力抑制实现无训练对象去除

Dingming Liu

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院)

AI总结 本文提出AdaEraser框架,通过动态调节注意力机制实现对象去除,解决无训练方法中盲目抑制注意力导致生成质量下降的问题,实验表明其在对象去除任务中表现优异。

Comments Accepted by ICML 2026

详情
AI中文摘要

对象去除旨在从图像中消除指定对象,同时合理地用背景内容修复受影响区域。当前无训练方法通常在图像生成过程中阻断自注意力层中对象区域的注意力,利用周围背景信息恢复图像。然而,盲目抑制空置区域的自注意力会降低生成质量,因为模型必须同时重建这些区域的背景内容。为了解决这一冲突,我们提出AdaEraser,一种自适应框架,能够根据目标对象概念的估计存在性动态调节注意力。通过分析去噪步骤前后自注意力图的演变,我们开发了一种逐token的自适应注意力抑制策略。该方法使在去噪过程中逐步感知对象去除,自注意力层的抑制强度会根据情况进行调整。大量实验表明,AdaEraser在对象去除任务中实现了优越的性能,甚至优于基于训练的方法。

英文摘要

Object removal aims to eliminate specified objects from images while plausibly inpainting the affected regions with background content. Current training-free methods typically block attention to object regions within self-attention layers during the image generation process, leveraging surrounding background information to restore the image. However, indiscriminate suppression of self-attention in the vacated areas can degrade generation quality, as the model must simultaneously reconstruct background content in these regions. To solve this conflict, we propose AdaEraser, an adaptive framework that dynamically modulates attention based on the estimated presence of target object concepts. Through analysis of self-attention map evolution across denoising timesteps before and during removal, we develop a token-wise adaptive attention suppression strategy. This approach enables progressive perception of object removal throughout the denoising process, with the suppression strength in self-attention layers adjusted adaptively. Extensive experiments demonstrate that AdaEraser achieves superior performance in object removal, outperforming even training-based methods.

2605.15916 2026-05-18 cs.LG cs.AI cs.CV 版本更新

LoCO: Low-rank Compositional Rotation Fine-tuning

LoCO:低秩组合旋转微调

An Nguyen, Jaesik Choi, Anh Tong

发表机构 * Korea University(韩国大学) KAIST(韩国科学技术院) INEEJI

AI总结 LoCO提出一种低秩组合旋转微调方法,通过低秩斜对称矩阵构建正交变换,实现高效参数微调,适用于多领域模型适应,展现优于传统正交和非正交方法的性能。

Comments IJCAI 2026

详情
AI中文摘要

参数高效微调(PEFT)已成为适应大规模基础模型的关键技术,在自然语言处理和计算机视觉领域广泛应用。尽管现有方法如低秩适应通过低秩权重更新实现参数效率,但其在保持预训练表示几何结构方面有限。我们引入低秩组合正交微调(LoCO),一种新颖的PEFT方法,通过低秩斜对称矩阵构建正交变换,并通过组合旋转链实现。我们提出了一种近似方案,使组合旋转的完全并行计算成为可能,使该方法适用于高维特征空间。我们的方法在保持低计算复杂度的同时,保持正交性并控制近似误差。我们在多样化的领域中验证了LoCO,包括扩散Transformer微调、视觉Transformer适应和语言模型适应。我们的方法在性能上优于或与现有正交和非正交方法相当。

英文摘要

Parameter-efficient fine-tuning (PEFT) has emerged as an critical technique for adapting large-scale foundation models across natural language processing and computer vision. While existing methods such as low-rank adaptations achieve parameter efficiency via low-rank weight updates, they are limited in their ability to preserve the geometric structure of pretrained representations. We introduce Low-rank Compositional Orthogonal fine-tuning (LoCO), a novel PEFT method that constructs orthogonal transformations through low-rank skew-symmetric matrices and compositional rotation chains. We propose an approximation scheme that enables fully parallel computation of compositional rotations, making the approach practical for high-dimensional feature spaces. Our method maintains low computational complexity while maintaining orthogonality with controlled approximation error. We validate LoCO across diverse domains, including diffusion transformer fine-tuning, vision transformer adaptation, and language model adaptation. Our method demonstrates superior or competitive performance compared to both existing orthogonal and non-orthogonal methods.

2605.15908 2026-05-18 cs.CV cs.AI 版本更新

RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations

RaPD:通过语义增强的隐式表示实现分辨率无关的像素扩散

Yanhao Ge, Shanyan Guan, Weihao Wang, Ying Tai, Mingyu You

发表机构 * College of Electronic and Information Engineering, Tongji University(同济大学电子与信息工程学院) vivo Mobile Communication Co., Ltd.(vivo移动通信有限公司) Nanjing University(南京大学)

AI总结 RaPD通过语义表示引导和坐标查询注意力渲染器,在连续神经图像场的潜在空间中实现分辨率无关的像素扩散,解决了重建与生成之间的差距,提升了生成质量和分辨率扩展能力。

详情
AI中文摘要

自然图像是连续的,但大多数生成模型在离散网格上合成图像,限制了分辨率灵活生成。连续神经场使分辨率无关渲染成为可能,但先前方法仅在解码阶段引入连续性作为插值模块,使生成的潜在空间离散化且偏向重建。我们提出RaPD(分辨率无关像素扩散),在连续神经图像场(NIF)潜在空间中进行扩散。RaPD通过语义表示引导实现生成意识的潜在学习,并通过坐标查询注意力渲染器实现坐标条件化的、尺度感知的渲染。通过仅改变查询坐标,单个去噪潜在态可以在任意分辨率下渲染,保持扩散成本不变。实验表明生成质量和分辨率扩展能力均优于现有方法。

英文摘要

Natural images are continuous, yet most generative models synthesize them on discrete grids, limiting resolution-flexible generation. Continuous neural fields enable resolution-free rendering, but prior methods introduce continuity only at the decoding stage as an interpolation module, leaving the generative latent space discretized and reconstruction-oriented. We propose RaPD (Resolution-agnostic Pixel Diffusion), which performs diffusion in a continuous Neural Image Field (NIF) latent space. RaPD bridges this reconstruction-generation gap with Semantic Representation Guidance for generation-aware latent learning and a Coordinate-Queried Attention Renderer for coordinate-conditioned, scale-aware rendering. A single denoised latent can be rendered at arbitrary resolutions by changing only the query coordinates, keeping diffusion cost fixed. Experiments demonstrate superior generation quality and resolution scalability.

2605.15906 2026-05-18 cs.CV 版本更新

A Causally Grounded Taxonomy for Image Degradation Robustness Evaluation

基于因果的图像退化鲁棒性评估分类学

Stefan Becker, Simon Weiss, Wolfgang Hübner, Michael Arens

发表机构 * Fraunhofer IOSB Ettlingen(弗劳恩霍夫联合研究所Ettlingen)

AI总结 本文提出一个基于因果的图像退化分类学,通过双轴抽象整合算法退化、感知失真和物理退化,提供统一的严重性度量层以提升跨数据集和任务的可比性。

详情
AI中文摘要

图像退化可能在采集、处理和传输过程中发生,改变视觉外观并影响下游视觉任务。这些退化现象在多个领域被研究,包括合成污染基准测试、感知图像质量评估以及物理基础的成像系统或真实相机故障分析。尽管这些领域处理密切相关现象,但常使用不兼容的分组方案和后端特定严重性定义,导致结果难以跨数据集、退化源和任务比较。本文提出一个基于因果的框架,用于组织和解释这些设置中的图像退化。我们提供一个解释性表示和度量层,使隐含假设显性化。每个退化沿两个正交轴描述:成像管道中的主导因果源(环境、传感器/光学、ISP/渲染器/编码器或传输/系统)及其感知效果。这种双轴抽象产生了一个紧凑的分类学,涵盖算法退化、感知失真和物理驱动的成像伪影。为解决不一致的严重性语义而不改变现有实现,我们引入了一个轻量级的严重性度量层。对于每个退化和每个给定后端的原生严重性级别,我们使用全参考图像质量指标(PSNR、SSIM和LPIPS)量化退化强度。这使严重性在不同源之间可观察和可比,同时保留原生参数化。我们通过COCO Degradation展示该框架,这是一个对齐基准,用于评估在多样成像条件下目标检测器的鲁棒性。

英文摘要

Image degradations can occur during acquisition, processing, and transmission, altering visual appearance and affecting downstream vision tasks. They are studied in several communities, including synthetic corruption benchmarks for robustness evaluation, perceptual image quality assessment, and physically grounded analyses of imaging systems or real camera failures. Although these areas address closely related phenomena, they often use incompatible grouping schemes and backend specific severity definitions, making results difficult to compare across datasets, degradation sources, and tasks. We propose a causally grounded framework for organizing and interpreting image degradations across these settings. Instead of introducing new degradations or redefining existing benchmarks, we provide an interpretive representation and measurement layer that makes implicit assumptions explicit. Each degradation is described along two orthogonal axes: its dominant causal source in the imaging pipeline (environment, sensor/optics, ISP/renderer/codec, or transfer/system), and its resulting perceptual effect. This dual axis abstraction yields a compact taxonomy spanning algorithmic corruptions, perceptual distortions, and physically motivated imaging artifacts. To address inconsistent severity semantics without changing existing implementations, we introduce a lightweight severity measurement layer. For every degradation and each native severity level of a given backend, we quantify degradation strength using full reference image quality metrics: PSNR, SSIM, and LPIPS. This makes severity observable and comparable across sources while preserving native parameterizations. We demonstrate the framework through COCO Degradation, a taxonomy aligned benchmark for evaluating object detector robustness under diverse imaging conditions.

2605.15895 2026-05-18 eess.IV cs.CV 版本更新

Layer Selection in Feature-Based Losses Affects Image Quality and Microstructural Consistency in Deep Learning Super-Resolution of Brain Diffusion MRI

基于特征的损失函数中层选择影响深度学习超分辨率中图像质量及微结构一致性

David Lohr, Rene Werner

发表机构 * Institue for Applied Medical Informatics, University Medical Center Hamburg-Eppendorf(应用医学信息学研究所,汉堡大学医学中心) Institute of Computational Neuroscience, University Medical Center Hamburg-Eppendorf(计算神经科学研究所,汉堡大学医学中心) Center for Biomedical Artificial Intelligence (bAIome), University Medical Center Hamburg-Eppendorf(生物医学人工智能中心(bAIome),汉堡大学医学中心)

AI总结 本研究探讨了基于特征的损失函数在深度学习超分辨率中对扩散信号一致性的影响,发现深层网络层会导致网格状伪影,而浅层网络层能保持图像与地面真实的一致性,尤其在9倍超分辨率下表现优异。

详情
AI中文摘要

高分辨率扩散MRI的临床应用受硬件限制和扫描时间阻碍,推动了计算超分辨率的发展。本研究探讨了基于特征的损失函数在深度学习超分辨率中保持扩散信号一致性的有效性。利用人类连接组计划的7T数据生成低分辨率和高分辨率扩散加权图像对,训练了UNets进行2D超分辨率。通过消融和隔离研究,评估了不同VGG16层用于特征损失与图像基L1基准的性能。更深层的层和其组合在超分辨率DWI中产生网格状伪影,这种伪影在扩散参数如定量和各向异性分数中持续存在。使用最浅层时没有此类伪影。该层的下游分析显示与地面真实高度一致,即使在9倍超分辨率下也是如此。图像SNR和使用的VGG16层深度调节伪影的出现和严重程度,要求在扩散MRI中谨慎选择贡献层。

英文摘要

Clinical application of high-resolution diffusion MRI is hindered by hardware limitations and prohibitive scan times, motivating computational super-resolution. This study investigates the efficacy of a feature-based loss function in preserving diffusion signal consistency in deep learning super-resolution. Using 7T data from the human connectome project to generate pairs of low- and high-resolution diffusion weighted images (DWI), we trained UNets for 2D super-resolution. Ablation and isolation studies evaluated different VGG16-layers for feature-based losses against an image-based L1 baseline. Deeper layers and combinations thereof resulted in grid-like artifacts in super-resolution DWIs, which persisted in diffusion parameters like quantitative and fractional anisotropy. No such artifacts were present when using the shallowest layer. Downstream analysis for this layer showed great consistency with the ground truth, even for 9-fold super-resolution. Image SNR and used VGG16-layer depths modulated artifact appearance and severity, mandating careful selection of contributing layers for application in and beyond diffusion MRI.

2605.15894 2026-05-18 cs.CV cs.AI 版本更新

Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning

基于CBAM增强的EfficientNet和证据深度学习的不确定性意识卫星图像野火烟密度分类

Ranjith Chodavarapu

发表机构 * Kent State University(肯特州立大学)

AI总结 本文提出一种概率框架,通过CBAM增强的EfficientNet和证据深度学习,对卫星图像中的烟雾密度进行分类,并提供分解的epistemic和aleatoric不确定性。模型在16298个真实卫星图像块上达到93.8%的加权测试准确率。

详情
AI中文摘要

快速且准确的野火烟雾严重程度评估对于应急响应、空气质量建模和人类健康风险管理至关重要。现有的深度学习方法将烟雾检测视为二元任务,产生点估计而没有预测置信度的度量。我们提出了一种概率框架,将卫星图像块分类为轻度、中度和重度严重程度类别,并在单次前向传递中提供分解的epistemic和aleatoric不确定性。我们的架构使用预训练的EfficientNet-B3作为主干,并结合CBAM模块和证据深度学习头,该头预测Dirichlet浓度参数,直接估计vacuity(epistemic)和dissonance(aleatoric)而无需蒙特卡洛采样。在16298个来自野火检测数据集的真实卫星图像块上进行评估,我们的模型在加权测试准确率为93.8%(无加权为91.1%)时,ECE=0.0274。选择性预测保留最确定的50%的图像块可达到96.7%的准确率。随着图像质量下降,不确定性单调增加,vacuity是实际扫描质量的度量。中度类别代表过渡烟雾条件,表现出最高的epistemic不确定性(平均vacuity=0.187),确认了模型正确识别了模糊的烟雾边界区域。CBAM空间注意力图局部化到结构上显著的场景区域,t-SNE展示了轻度和重度烟雾的清晰聚类分离。

英文摘要

Rapid and accurate wildfire smoke severity assessment from satellite images is essential for emergency response, air quality modeling, and human health risk management. Existing deep learning approaches treat smoke detection as a binary task, producing point estimates without any measure of prediction confidence. We propose a probabilistic framework to categorize a satellite patch into Light, Moderate, and Heavy severity classes and to provide decomposed epistemic and aleatoric uncertainty in a single forward pass. Our architecture uses the backbone of a pre-trained EfficientNet-B3 and a CBAM module with an evidential deep learning head that predicts Dirichlet concentration parameters, directly estimating vacuity (epistemic) and dissonance (aleatoric) without Monte Carlo sampling. Evaluated on 16,298 real satellite patches derived from the Wildfire Detection dataset, our model achieves 93.8% weighted test accuracy (91.1% unweighted) with ECE=0.0274. Selective prediction retaining the most certain 50% of patches achieves 96.7% accuracy. As image quality degrades, uncertainty increases monotonically, and vacuity is a practical scan quality measure. The Moderate class represents transitional smoke conditions that exhibit the highest epistemic uncertainty (mean vacuity = 0.187), confirming the model correctly identifies ambiguous smoke boundary regions. CBAM spatial attention maps localize to structurally distinctive scene regions, and t-SNE demonstrates the clear cluster separation of Light and Heavy smoke.

2605.15880 2026-05-18 cs.CV cs.AI 版本更新

FSCM: Frequency-Enhanced Spatial-Spectral Coupled Mamba for Infrared Hyperspectral Image Colorization

FSCM:频率增强的空间-频谱耦合Mamba用于红外超光谱图像着色

Tingting Liu, Yuan Liu, Guiping Chen, Xiubao Sui, Qian Chen

发表机构 * School of Electronic and Optical Engineering, Nanjing University of Science and Technology(南京理工大学电子与光学工程学院) School of Mechanical Engineering, University of Science and Technology Beijing(北京科技大学机械工程学院) School of Instrument and Electronics, North University of China(北方大学仪器与电子学院)

AI总结 本文提出FSCM框架,通过频率增强的空间-频谱状态空间生成器和双流混合门控模块,提升红外超光谱图像着色的视觉质量和语义一致性。

详情
AI中文摘要

热红外成像对光照变化和烟雾干扰具有鲁棒性,使其在全天候感知中具有重要价值。然而,缺乏自然色彩和精细纹理限制了目标识别、人类视觉解释和可见光模型的迁移。现有红外着色方法主要依赖单波段图像,不足的光谱线索可能导致结构失真和语义混淆。尽管红外超光谱图像提供丰富的光谱响应和材料信息,现有单波段框架在建模空间-频谱耦合和弱纹理细节方面仍有限。为了解决这些问题,本文提出了FSCM,一种光谱信息引导的GAN框架。在FSCM中,由级联FSB单元组成的频率增强空间-频谱状态空间生成器被构建。每个FSB集成了三个互补组件:状态空间建模捕捉全局空间-频谱依赖性;频率增强模块(FEM)结合多级小波分解和傅里叶门控以恢复结构轮廓、方向高频细节和全局频率响应;双流混合门控模块(DGM)整合变形感知采样与稀疏注意力以增强有效局部结构并抑制背景干扰。此外,引入了在线语义分割引导损失以约束生成结果,提高复杂道路场景中的语义一致性。实验表明,FSCM在视觉质量和语义保真度上优于现有红外着色方法。

英文摘要

Thermal infrared imaging is robust to illumination variations and smoke interference, making it important for all-weather perception. However, the lack of natural color and fine texture limits target recognition, human visual interpretation, and the transfer of visible-light models. Existing infrared colorization methods mainly rely on single-band images, where insufficient spectral cues may lead to structural distortion and semantic confusion. Although infrared hyperspectral images provide rich spectral responses and material information, existing single-band frameworks remain limited in modeling spatial-spectral coupling and weak texture details. To address these issues, this paper presents FSCM, a spectral-information-guided GAN framework. Within FSCM, a frequency-enhanced spatial-spectral state-space generator composed of cascaded FSB units is constructed. Each FSB integrates three complementary components: state-space modeling captures global spatial-spectral dependencies; the frequency enhancement module (FEM) combines multi-level wavelet decomposition and Fourier gating to recover structural contours, directional high-frequency details, and global frequency responses; and the dual-stream hybrid gating module (DGM) integrates deformation-aware sampling with sparse attention to enhance effective local structures and suppress background interference. Additionally, an online semantic segmentation-guided loss is introduced to constrain the generated results, improving semantic consistency in complex road scenes. Experiments show that FSCM outperforms existing infrared colorization methods in visual quality and semantic fidelity.

2605.15868 2026-05-18 cs.CV 版本更新

SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval

SOLAR:用于对称多模态检索的自监督联合学习

Wenjie Yang, Hang Yu, Yuyu Guo, Peng Di

发表机构 * Ant Group(蚂蚁集团)

AI总结 本文提出SOLAR框架,通过自监督学习解决对称多模态检索问题,利用未标记数据提升模型性能,实验显示其在基准测试中优于现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

本文针对对称多模态到多模态检索问题,提出SOLAR框架,通过两阶段自监督学习利用未标记数据。第一阶段学习图像-文本对的交集掩码以对齐语义差异,第二阶段构造正负样本进行多模态嵌入学习。引入新基准测试评估性能,实验表明SOLAR在参数和嵌入维度上均优于现有方法。

英文摘要

In this work, we address the critical yet underexplored challenge of symmetric multimodal-to-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of image/text, which enable us to conduct self-supervised multimodal embedding learning. Complementing this framework, we present a new benchmark featuring high-quality human-verified positive and hard-negative pairs to evaluate symmetric MM2MM retrieval under realistic conditions, as well as the corresponding pipeline. Extensive experiments against ten SOTA methods show SOLAR surpasses the strongest supervised VLM by 7.08 points on this benchmark, with over 50x fewer model parameters and a 5x smaller embedding dimension. Code and benchmark will be available soon.

2605.15860 2026-05-18 cs.CV 版本更新

On RGB-TIR Stereo Calibration under Extreme Resolution Asymmetry

在极端分辨率不对称下的RGB-TIR立体校准

Michał Król, Michał Salamonowicz, Władysław Skarbek, Michał Tomaszewski

发表机构 * Independent Researcher(独立研究员) Warsaw University of Technology(华沙技术大学) Polish Japanese Academy of Information Technology(波兰日本信息科技学院)

AI总结 本文提出一种实用的RGB与TIR摄像头立体校准框架,解决低分辨率热传感器下的校准问题,通过动态OLED屏幕和专用角点检测算法实现高精度校准,验证了系统在建筑能耗评估中的应用。

Comments 27 pages, 12 figures, 3 tables

详情
AI中文摘要

准确的RGB-热红外(TIR)立体摄像头系统的几何校准对于多模态建筑围护结构分析至关重要,但在使用低成本、极低空间分辨率的热传感器时仍具挑战性。本文提出了一种实用的立体校准框架,用于2028 x 1520像素的RGB摄像头与仅80 x 62像素的TIR摄像头配对,像素比例约为1:625。一个主动OLED屏幕动态切换模态特定的图案(TIR用棋盘格,RGB用ChArUco)于同一物理表面,提供可控且可重复的热对比度。一种结合透视校正、Hessian鞍点分析和Mean Shift局部定位的专用角点检测算法,在无需每帧参数调整的情况下,可靠地检测80 x 62像素的棋盘格。基线约束的束调整在平面校准物体退化下强制物理一致的立体几何,得到32.7毫米(名义30毫米)的立体基线,整体重投影误差为0.382像素。该系统在热活跃建筑模型上进行验证,使用恒定深度和每像素深度估计,证明了TIR到RGB投影的一致性,适用于建筑能耗评估。

英文摘要

Accurate geometric calibration of RGB-thermal infrared (TIR) stereo camera systems is essential for multimodal building envelope analysis, yet remains challenging when low-cost thermal sensors with very low spatial resolution are employed. This paper presents a practical stereo calibration framework for an RGB camera (2028 x 1520 px) paired with a TIR camera operating at only 80 x 62 px - a pixel-count ratio of approximately 1:625. An active OLED screen dynamically switches modality-specific patterns (checkerboard for TIR, ChArUco for RGB) on a single physical surface, providing controlled and repeatable thermal contrast. A dedicated corner detection algorithm combining perspective rectification, Hessian saddle-point analysis, and Mean Shift localisation achieves reliable checkerboard detection at 80 x 62 px without per-frame parameter tuning. A baseline-constrained bundle adjustment enforces physically consistent rig geometry under the planar-calibration-object degeneracy, yielding a stereo baseline of 32.7 mm (nominal 30 mm) with an overall reprojection error of 0.382 px. The system is validated on a thermally active building mock-up using constant-depth and per-pixel depth estimation, demonstrating consistent TIR-to-RGB projection suitable for building energy performance assessment.

2605.15855 2026-05-18 cs.CV 版本更新

Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?

少即是多:我们是否需要为扩散模型的强化学习微调进行每一步优化?

Renye Yan, Jikang Cheng, Shikun Sun, Yi Sun, You Wu, Wei Peng, Zongwei Wang, Ling Liang, Junliang Xing, Yimao Cai

发表机构 * Peking University(北京大学) Tsinghua University(清华大学) Nanjing University(南京大学) Stanford University(斯坦福大学)

AI总结 本文研究了扩散模型强化学习微调中优化步骤的影响,提出AdaScope方法通过动态控制训练时机提升生成质量并降低计算成本。

Journal ref CVPR2026

详情
AI中文摘要

尽管在图像生成性能方面表现强大,扩散模型的重建目标限制了与人类偏好的对齐。强化学习通过显式奖励实现这种对齐。然而,大多数研究将强化学习应用于完整的去噪轨迹,使其计算成本高且削弱偏好对齐,即做更多却收获更少。我们观察到,在去噪阶段,强化学习微调的影响差异显著。在早期阶段,图像结构不稳定且远离最终奖励信号。在这一阶段应用强化学习会导致延迟奖励和动作-奖励不匹配,导致高方差和低效更新。相反,在后期阶段,奖励收益趋于饱和,持续训练容易导致过拟合局部细节,加剧奖励黑客问题。为解决这些挑战,我们提出了AdaScope,一种增强的强化学习插件,通过感知去噪过程中的结构演变和语义一致性,自适应地确定强化学习的最佳干预时间,并在去噪收敛和奖励收益饱和时动态终止训练。结果,它实现了罕见的'双重收益':计算成本的降低和显著的性能提升。我们为AdaScope的设计提供了理论依据。与最先进方法相比,AdaScope在性能上提高了66%,同时将计算成本降低了59%。

英文摘要

Despite strong image-generation performance, diffusion models' reconstruction objectives limit alignment with human preferences. RL enables such alignment through explicit rewards. However, most studies apply RL to the full denoising trajectory, making it computationally costly and weakening preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action-reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose AdaScope, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, AdaScope adaptively identifies the optimal intervention timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare 'dual benefit': a reduction in computational costs alongside a significant performance improvement. We offer theoretical grounds for the design of AdaScope. Compared with state-of-the-art methods, AdaScope improves performance by 66% while cutting computational cost by 59%.

2605.15843 2026-05-18 cs.CV 版本更新

WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes

WorldAct:将单体3D世界激活为可交互的以对象为中心的场景

Jichen Hu, Jiawei Guo, Jiazhong Cen, Chen Yang, Sikuang Li, Wei Shen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei Inc(华为公司)

AI总结 WorldAct通过多模态代理将静态生成的3D世界分解为可编辑的交互场景,支持对象级编辑和任务执行,保留全局一致性。

Comments Project page: https://sjtu-deepvisionlab.github.io/WorldAct

详情
AI中文摘要

最近基于生成场景合成的3D世界建模系统,如Marble,能够创建连贯且可探索的3D环境,但其输出通常是静态的单体资产,具有有限的可编辑性和物理交互能力。这限制了其在沉浸式内容创作和具身模拟中的应用,其中生成的世界必须被主动修改和操控。为解决这一挑战,我们提出了WorldAct,一个将静态生成的3D世界转换为可编辑和交互准备的场景的框架。WorldAct使用多模态代理指导场景分解,识别可操作对象,重建几何对齐的对象级网格以供交互,并通过3D修复恢复残留背景。所得到的场景支持对象级编辑、碰撞感知的操作和具身任务执行,同时保持全局场景一致性。实验表明,WorldAct比原始生成场景支持更丰富的交互场景,表明了通往可编辑和交互3D世界模型的实用路径。

英文摘要

Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.

2605.15835 2026-05-18 cs.CV 版本更新

Community-aware evaluation and threshold calibration for open-set plankton image recognition

面向社区的开放集浮游生物图像识别评估与阈值校准

Xi Chen, Eryuan Huang, Yingjun Xiao, Gang Fang

发表机构 * School of Computer Science and Cyber Engineering, Guangzhou University(广州大学计算机科学与网络工程学院) School of Environment, South China Normal University(华南师范大学环境学院) School of Artificial Intelligence, Guangzhou University(广州大学人工智能学院) Institute of Computing Science and Technology, Guangzhou University(广州大学计算科学与技术学院)

AI总结 本文研究了开放集浮游生物识别中样本级与社区级评估的不匹配问题,提出Open-Set Community Distortion(OSCD)指标,并通过实验验证社区感知阈值校准对减少OSCD的有效性,强调应将开放集识别视为生态测量问题。

Comments Manuscript. 14 figures/tables in total

详情
AI中文摘要

自动化的浮游生物图像识别在水生态系统监测中日益被使用,但部署的分类器不可避免地会遇到未见过的物种和非目标颗粒。开放集识别方法通常使用样本级指标如AUROC、AUPR和FPR@95%未知召回率操作点进行评估,而生态监测依赖于对物种丰度和多样性的社区级估计。本研究通过受控伪社区和三个涵盖海洋浮游动物、海洋浮游植物和淡水浮游生物的数据库,探讨了这些目标之间的不匹配。我们定义了Open-Set Community Distortion(OSCD),即Bray-Curtis风格的已知物种误差加上未知bin的误差,具有方向性成分,区分已知物种的高估和低估。闭集分类器在已知类准确率上表现高,但未知样本常被高信心吸收。样本级OOD指标不足以选择生态操作点:对于MSP,FPR@95%未知召回率阈值在所有三个数据集上产生了较大的测试社区OSCD,主要是因为真实已知物种被过度拒绝到未知bin中。社区感知阈值校准在SYKE-ZooScan 2024和SYKE-IFCB 2022上减少了MSP OSCD;在ZooLake上,固定召回率基线已接近社区感知阈值,最佳社区级方法是原型距离变体而非MSP。因此,社区感知校准的益处取决于验证社区的代表性以及固定召回率与社区最优之间的差距。这些结果表明,开放集浮游生物识别应被视为生态测量问题,而不仅仅是样本级检测任务。

英文摘要

Automated plankton image recognition is increasingly used in aquatic ecosystem monitoring, but deployed classifiers inevitably encounter unseen taxa and non-target particles. Open-set recognition methods are usually evaluated with sample-level metrics such as AUROC, AUPR, and FPR@95% unknown-recall operating points, whereas ecological monitoring depends on community-level estimates of taxon abundance and diversity. This study examines the mismatch between these objectives using controlled pseudo-communities and three datasets spanning marine zooplankton imaged by ZooScan, marine phytoplankton imaged by IFCB, and freshwater plankton imaged by an in-situ camera. We define Open-Set Community Distortion (OSCD), a Bray-Curtis-style error over known taxa plus an unknown bin, with directional components distinguishing known-taxon overestimation from underestimation. Closed-set classifiers achieved high known-class accuracy, but unknown samples were often absorbed with high confidence and in structured ways. Sample-level OOD metrics were not sufficient to select ecological operating points: for MSP, FPR@95% unknown-recall thresholds produced large test-community OSCD on all three datasets mainly because true known taxa were over-rejected into the unknown bin. Community-aware threshold calibration reduced MSP OSCD relative to fixed 95% known recall on SYKE-ZooScan 2024 and SYKE-IFCB 2022; on ZooLake the fixed-recall baseline was already close to the community-aware threshold, and the best community-level method was a prototype-distance variant rather than MSP. The benefit of community-aware calibration therefore depends on validation-community representativeness and the gap between fixed recall and the community optimum. These results show that open-set plankton recognition should be evaluated as an ecological measurement problem, not only as a sample-level detection task.

2605.15816 2026-05-18 cs.GR cs.CV cs.LG 版本更新

StippleDiffusion: Capacity-Constrained Stippling using Controlled Diffusion

StippleDiffusion:基于受控扩散的容量受限点绘制

Ofir Gilad, Aleksander Plocharski, Przemyslaw Musialski, Andrei Sharf

发表机构 * Ben Gurion University of the Negev(贝勒贡大学内盖夫分校) Warsaw University of Technology(华沙技术大学) New Jersey Institute of Technology(新泽西理工学院)

AI总结 本文提出一种基于扩散模型的点绘制方法,通过学习局部点分布先验和连续容量约束,实现高效且可微的点集生成,适用于任意目标密度。

Comments 12 pages, 10 figures

详情
AI中文摘要

点绘制模式,即局部密度跟踪目标图像的点集,传统上由逐密度迭代优化器生成,速度慢且非可微,每次新目标需重新运行。学习替代方法至今仅能处理无条件点生成;容量受限、图像条件化的点绘制仍无法实现。我们提出了首个基于扩散的采样器,能够在推理时同时满足学习的局部点分布先验和连续的图像定义容量约束。该方法基于最优传输网格点集扩散基础线程,构建在ControlNet分支上,条件于目标密度图和高分辨率图像。两种设计选择使组合可行:训练和推理限制在后期去噪阶段,初始化自密度加权拒绝样本;标准零卷积注入被替换为sigmoid门控1x1投影,以在强密度信号下保持基础模型的蓝噪声结构。单个训练检查点在推理时接受任意目标密度,可泛化至训练时未见过的点预算,并在输出点数几乎无关的时间内生成点集。在Icons-50基准测试中,我们的学习采样器在所有报告的指标上与逐密度优化基线持平,且保持端到端可微。

英文摘要

Stipple patterns, point sets whose local density tracks a target image, are traditionally produced by per-density iterative optimizers, which are slow, non-differentiable, and must be re-run from scratch for each new target. Learned alternatives have so far addressed only unconditional point generation; capacity-constrained, image-conditioned stippling has remained out of reach. We present the first diffusion-based sampler that simultaneously satisfies a learned local point-distribution prior and a continuous, image-defined capacity constraint at inference. The method is a ControlNet branch built on top of an optimal-transport-grid point-set diffusion baseline, conditioned on the target density map and a high-resolution image. Two design choices make the combination tractable: training and inference are restricted to the late-stage denoising regime, initialized from a density-weighted rejection sample, and the standard zero-convolution injection is replaced with a sigmoid-gated 1x1 projection that preserves the base model's blue-noise structure under hard density signals. A single trained checkpoint accepts arbitrary target densities at inference, generalizes to point budgets that were not seen during training, and produces stipples in time nearly independent of the output point count. On the Icons-50 benchmark, our learned sampler reaches parity with per-density-optimized baselines on every reported metric while remaining differentiable end-to-end.

2605.15803 2026-05-18 cs.CV cs.LG 版本更新

Embedding-perturbed Exploration Preference Optimization for Flow Models

嵌入扰动探索偏好优化用于流模型

Sujie Hu, Chubin Chen, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu, Xiu Li

发表机构 * Tsinghua University(清华大学) AMAP, Alibaba Group(阿里集团AMAP)

AI总结 本文提出E²PO框架,通过嵌入层面扰动维持优化稳定性,提升生成模型对人类偏好的对齐效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

最近的进展已将强化学习(RL)确立为对齐生成模型与人类意图的关键范式。然而,基于群体的优化框架(如GRPO)面临关键限制:群体内方差的快速衰减。随着群体内样本的差异性降低,方差趋于零。这消除了优化所需的信号,使过程不稳定,迫使策略提前停滞或奖励黑客。现有策略如改变初始噪声或增加群体大小往往无法解决这一根本问题,导致训练不稳定或收益递减。为克服这些挑战,我们提出嵌入扰动探索偏好优化(E²PO),一种通过嵌入层面扰动维持优化的新型框架。我们的方法在样本群体内引入结构化扰动,保证了鲁棒的方差,从而在训练过程中保持判别信号。大量实验表明,我们的方法显著优于现有最佳基线,实现了更忠实的人类偏好对齐。

英文摘要

Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.

2605.15792 2026-05-18 cs.CV 版本更新

Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models

反向流动:大型多模态模型中的生成到理解协同效应

Yujun Tong, Dongliang Chang, Zijin Yin, Xintong Liu, Yuanchen Fang, Zhanyu Ma

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Beijing Key Laboratory of Multimodal Data Intelligent Perception and Governance(北京多模态数据智能感知与治理重点实验室)

AI总结 本文提出生成到理解协同效应,通过生成过程作为中间推理步骤提升多模态理解,揭示生成与理解的双向关系及模型自我反思的不足。

Comments Accepted by CVPR 2026 Findings

详情
AI中文摘要

多模态AI长期目标是构建统一模型,使视觉理解和生成相互促进。尽管BAGEL和BLIP3o取得进展,但统一仍单向:理解指导生成,而生成如何支持理解未被研究。本文提出生成到理解协同效应,使视觉生成成为显式推理步骤,通过生成细节增强、上下文扩展等可控生成行为产生自生成视觉思考,并反馈至模型以优化感知,无需重新训练。在十二个基准测试中,反向信息流一致提升多模态理解。研究显示生成保真度限制感知增益,不同编辑提示家族影响迁移效率。进一步分析模型能否决定想象内容,尽管模型可生成合理编辑,但自生成视觉思考缺乏稳定任务对齐,揭示当前大多模态模型未达到真实自我反思。本文揭示统一认知缺失机制,表明想象并非理解终点,而是其起点。

英文摘要

The long-standing goal of multimodal AI is to build unified models in which visual understanding and visual generation mutually enhance one another. Despite recent works such as BAGEL, BLIP3o achieves remarkable progress; In practice, however, this unification remains one-directional: understanding routinely guides generation, yet how and why generation can support understanding is rarely investigated. We revisit this asymmetry and propose Generation-to-Understanding (G2U) synergy, where visual generation becomes an explicit intermediate reasoning step. Our framework enables a model to perform controlled generative acts, such as detail enhancement, context expansion or structural visualisation, to produce self-generated visual thoughts, which are then fed back into the model to refine perception without retraining or external tools. Through a comprehensive evaluation on twelve benchmarks, this reversed information flow consistently improves multimodal understanding. We show that generative fidelity bounds perceptual gain and that distinct families of edit prompts govern transfer efficiency. We further analyse whether models can decide what to imagine. While they can produce plausible edits, these self-generated visual thoughts lack stable task alignment, revealing that current large multimodal models fall short of true self-reflection. This work exposes a missing mechanism in unified cognition and suggests that imagination is not the end of understanding but its beginning.

2605.15120 2026-05-18 cs.RO cs.AI cs.CV 版本更新

CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning

CLOVER:端到端自动驾驶规划的闭环价值估计与排序

Sining Ang, Yuguang Yang, Canyu Chen, Yan Wang

发表机构 * Department of Automation, University of Science and Technology of China(中国科学技术大学自动化系) Institute for AI Industry Research, Tsinghua University(清华大学人工智能产业研究院) School of Electronic Information Engineering, Beihang University(北航电子信息技术学院) National College for Excellent Engineers, Beihang University(北航卓越工程师学院)

AI总结 CLOVER通过闭环价值估计与排序框架,解决端到端自动驾驶规划中训练与评估不匹配的问题,通过生成器和评分器的轻量级架构提升规划器性能,实现更准确的候选轨迹排序。

详情
AI中文摘要

端到端自动驾驶规划器通常通过模仿单条记录轨迹进行训练,但通过基于规则的规划指标进行评估,这导致了训练与评估之间的不匹配:接近记录路径的轨迹可能违反规划规则,而偏离记录路径的替代方案可能仍有效且得分高。这种不匹配对提案选择规划器尤其限制,因为其性能依赖于候选集覆盖和评分器排序质量。我们提出了CLOVER,一种用于端到端自动驾驶规划的闭环价值估计与排序框架。CLOVER采用轻量级生成器-评分器架构:生成器产生多样化的候选轨迹,评分器预测规划指标子分数以在推理时对它们进行排序。为了扩展提案支持超越单轨迹模仿,CLOVER构建了评估器过滤的伪专家轨迹,并通过集级别覆盖监督训练生成器。然后,它执行保守的闭环自我蒸馏:评分器被拟合到生成的提案上的真实评估子分数,而生成器则通过稳定性正则化向教师选择的前k和向量帕累托目标进行细化。我们分析了当评分器不完美时如何改进生成器,证明了当评分器选择的目标在真实评估下得到丰富且更新保持保守时,评分器介导的细化是可靠的。在NAVSIM上,CLOVER实现了94.5 PDMS和90.4 EPDMS,建立了新的状态。在更具挑战性的NavHard分割上,它获得了48.3 EPDMS,与最强报告结果相匹配。在补充的nuScenes开环评估中,CLOVER在比较方法中实现了最低的L2误差和碰撞率。代码数据将在https://github.com/WilliamXuanYu/CLOVER上发布。

英文摘要

End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training--evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator--scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top-$k$ and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at https://github.com/WilliamXuanYu/CLOVER.

2605.13169 2026-05-18 cs.CV cs.AI 版本更新

PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

PanoWorld:迈向360度全景世界的空间超感知

Changpeng Wang, Xin Lin, Junhan Liu, Yuheng Liu, Zhen Wang, Donglian Qi, Yunfeng Yan, Xi Chen

发表机构 * Zhejiang University(浙江大学) University of California, San Diego(加州大学圣地亚哥分校) University of California, Irvine(加州大学伊维特分校) The University of Hong Kong(香港大学)

AI总结 本文提出PanoWorld,通过构建全景原生理解能力,解决传统多模态大模型在空间感知上的不足,通过全景空间交叉注意力机制提升3D空间推理能力,并建立PanoSpace-Bench基准测试,验证了全景原生监督的有效性。

Comments Project page: https://wcpcp.github.io/PanoWorld

详情
AI中文摘要

多模态大实验室模型(MLLMs)在主导视角图像范式下仍难以实现空间理解,继承了人类感知的窄视野。为导航、机器人搜索和3D场景理解,360度全景感知通过一次性捕捉整个周围环境提供超感知。然而,现有MLLM流程通常将全景分解为多个视角,使等距投影(ERP)的球形结构隐含。本文研究全景原生理解,要求MLLM在ERP全景上作为连续的观察者中心空间进行推理。为此,我们首先定义了全景原生理解的关键能力,包括语义锚定、球形定位、参考框架转换和深度感知的3D空间推理。然后构建大规模元数据构造流程,将混合源ERP全景转换为几何感知、语言引导和深度感知的监督,并将这些信号作为能力对齐的指令微调数据。在模型方面,我们引入具有球形空间交叉注意力的PanoWorld,将球形几何注入视觉流。我们进一步构建PanoSpace-Bench,一个评估ERP原生空间推理的诊断基准。实验表明,PanoWorld在PanoSpace-Bench、H* Bench和R2R-CE Val-Unseen基准上显著优于专有和开源基线。这些结果表明,稳健的全景推理需要专门的全景原生监督和几何感知的模型适应。所有源代码和提出的数据将公开发布。

英文摘要

Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.

2605.12309 2026-05-18 cs.CV 版本更新

G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

G$^2$TR: 基于生成的视觉标记减少方法用于分离编码统一多模态模型

Junxian Li, Kai Liu, Zizhong Ding, Zhixin Wang, Zhikai Chen, Renjing Pei, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei Technologies Ltd(华为技术有限公司)

AI总结 本文提出G$^2$TR方法,通过生成分支信号减少多模态模型的视觉标记,提升效率并保持性能,实验显示在图像理解和编辑任务中表现优异。

详情
AI中文摘要

单独编码统一多模态模型(UMMs)的发展伴随着由于密集视觉标记处理而迅速增长的推理成本。本文聚焦于理解侧的视觉标记减少以提高单独编码UMMs的效率。尽管该主题在MLLMs中已被广泛研究,现有方法通常依赖于注意力分数、文本-图像相似性等,隐含假设最终目标是判别推理。这一假设不适用于UMMs,其中理解侧的视觉标记必须保留模型对图像编辑的能力。我们提出G$^2$TR,一种用于单独编码UMMs的生成引导视觉标记减少框架。我们的关键见解是生成分支提供了一个任务无关的信号,用于识别不仅语义相关但对潜在空间图像重建和生成也重要的理解侧视觉标记。G$^2$TR通过估计与VAE潜在一致性来估计标记重要性,进行平衡的标记选择,并将冗余标记合并到保留的代表中以减少信息损失。该方法是训练无关的,即插即用的,并且仅在理解编码阶段之后应用,使其兼容现有的UMM推理流程。在图像理解和编辑基准上的实验表明,G$^2$TR显著减少了视觉标记和prefill计算,减少了1.94倍,同时保持推理准确性和编辑质量,在几乎所有基准上优于基线。代码地址:https://github.com/lijunxian111/G2TR。

英文摘要

The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks. Code is at: https://github.com/lijunxian111/G2TR.

2605.10867 2026-05-18 cs.CR cs.AI cs.CV cs.LG cs.NI 版本更新

BEACON: A Multimodal Dataset for Learning Behavioral Fingerprints from Gameplay Data

BEACON:一个用于从游戏数据中学习行为指纹的多模态数据集

Ishpuneet Singh, Gursmeep Kaur, Uday Pratap Singh Atwal, Guramrit Singh, Gurjot Singh, Maninder Singh

AI总结 BEACON数据集通过高精度运动技能和认知负荷,为行为生物特征的鲁棒性测试提供严格压力测试,支持连续认证、行为建模和多模态学习。

详情
AI中文摘要

在高风险数字环境中,连续认证需要具有细粒度行为信号的高质量数据集,但现有基准往往受限于规模小、单模态传感或缺乏同步环境上下文。为此,本文引入BEACON(行为认证与连续监控行为引擎),一个大规模多模态数据集,捕捉竞技Valorant游戏中的多样化技能层级。BEACON包含约430GB同步多模态数据(461GB总存储量,包括辅助Valorant配置捕获),来自79个会话的28名不同玩家,估计102.51小时的活跃游戏时间,包括高频鼠标动态、按键事件、网络数据包捕获、屏幕录制、硬件元数据和游戏内配置上下文。BEACON利用战术射击游戏固有的高精度运动技能和高认知负荷,使其成为评估行为生物特征鲁棒性的严格压力测试。该数据集允许在高保真的电子竞技环境中研究连续认证、行为建模、用户漂移和多模态表示学习。作者在Hugging Face和GitHub上发布数据集和代码,以创建可重复的基准,用于评估下一代行为指纹和安全模型。

英文摘要

Continuous authentication in high-stakes digital environments requires datasets with fine-grained behavioral signals under realistic cognitive and motor demands. But current benchmarks are often limited by small scale, unimodal sensing or lack of synchronised environmental context. To address this gap, this paper introduces BEACON (Behavioral Engine for Authentication & Continuous Monitoring), a large-scale multimodal dataset that captures diverse skill tiers in competitive Valorant gameplay. BEACON contains approximately 430 GB of synchronised modality data (461 GB total on-disk including auxiliary Valorant configuration captures) from 79 sessions across 28 distinct players, estimated at 102.51 hours of active gameplay, including high-frequency mouse dynamics, keystroke events, network packet captures, screen recordings, hardware metadata, and in-game configuration context. BEACON leverages the high precision motor skills and high cognitive load that are inherent to tactical shooters, making it a rigorous stress test for the robustness of behavioral biometrics. The dataset allows for the study of continuous authentication, behavioral profiling, user drift and multimodal representation learning in a high-fidelity esports setting. The authors release the dataset and code on Hugging Face and GitHub to create a reproducible benchmark for evaluating next-generation behavioral fingerprinting and security models.

2605.10100 2026-05-18 cs.CV cs.AI 版本更新

HYPERPOSE: Hyperbolic Kinematic Phase-Space Attention for 3D Human Pose Estimation

HYPERPOSE:超几何运动相空间注意力用于3D人体姿态估计

Vinduja Thekkath, Ashish Musale, Ajay Waghumbare, Upasna Singh

AI总结 HYPERPOSE提出一种在双曲空间内进行时空推理的3D人体姿态估计框架,通过超几何运动相空间注意力机制保留人体骨骼的树状结构,提升几何精度和时间动态建模。

详情
AI中文摘要

我们引入HYPERPOSE,一种新颖的3D人体姿态估计框架,其通过在洛伦兹模型的双曲空间$\mathbb{H}^d$中进行时空推理,原生保持人体骨骼的层次树状拓扑结构。当前最先进的姿态估计器依赖于transformers和图卷积网络来捕捉复杂的关节动态,但这些架构仅在欧几里得空间中操作,与人体固有的树状结构根本不匹配,导致指数体积扭曲和结构不一致。为此,我们脱离平坦空间,引入超几何运动相空间注意力(HKPSA)机制,原生嵌入复杂关节关系,同时结合多尺度窗口双曲注意力机制,以$O(TW)$复杂度高效建模时间动态。此外,为克服非欧几里得流形训练的已知不稳定性,HYPERPOSE引入新的黎曼损失套件和不确定性加权课程学习,强制物理测地线约束,如骨骼长度和速度一致性。在Human3.6M和MPI-INF-3DHP数据集上的广泛评估表明,HYPERPOSE在结构和时间一致性上达到最先进的水平,显著减少体积扭曲和速度误差,同时在整体位置准确性上建立新的最先进基准。

英文摘要

We introduce HYPERPOSE, a novel 3D human pose estimation framework that performs spatio-temporal reasoning entirely within the Lorentz model of hyperbolic space $\mathbb{H}^d$ to natively preserve the hierarchical tree topology of the human skeleton. Current state-of-the-art pose estimators aim to capture complex joint dynamics by relying on transformers and graph convolutional networks. Since these architectures operate exclusively in Euclidean space which fundamentally mismatches the inherent tree structure of the human body, these methods inevitably suffer from exponential volume distortion and struggle to maintain structural coherence. To this end, we depart from flat spaces and aim to improve geometric fidelity with Hyperbolic Kinematic Phase-Space Attention (HKPSA), natively embedding complex joint relationships without distortion, alongside a multi-scale windowed hyperbolic attention mechanism that efficiently models temporal dynamics in $O(TW)$ complexity. Furthermore, to overcome the well-known instability of training non-Euclidean manifolds, HYPERPOSE introduces a novel Riemannian loss suite and an uncertainty-weighted curriculum, enforcing physical geodesic constraints like bone length and velocity consistency. Extensive evaluations on the Human3.6M and MPI-INF-3DHP datasets demonstrate that HYPERPOSE achieves state-of-the-art structural and temporal coherence, significantly reducing both volume distortion and velocity error, while establishing new state-of-the-art benchmarks in overall positional accuracy.

2605.09231 2026-05-18 cs.CV stat.ML 版本更新

An Elastic Shape Variational Autoencoder for Skeleton Pose Trajectories

一种弹性形状变分自编码器用于骨骼姿态轨迹

Arafat Rahman, Shashwat Kumar, Laura E. Barnes, Anuj Srivastava

发表机构 * Systems and Information Engineering, University of Virginia(弗吉尼亚大学系统与信息工程系) Biomedical Engineering, Johns Hopkins University(约翰霍普金斯大学生物医学工程系) Dept. of Applied Mathematics and Statistics, Johns Hopkins University(约翰霍普金斯大学应用数学与统计学系)

AI总结 本文提出ES-VAE,通过运输平方根速度场表示在Kendall形状流形上学习骨骼轨迹的生成模型,有效分离形状动态,优于标准VAE和序列建模基线,在步态分析和动作识别中表现优异。

Comments 9 pages

详情
AI中文摘要

深度生成模型为建模复杂结构数据提供了灵活的框架,如图像、视频、3D物体和文本。然而,当应用于人体骨骼序列时,标准变分自编码器(VAEs)通常将大量容量分配给干扰因素,如摄像机方向、主体尺寸、视角和执行速度,而非形状和运动的内在几何结构。我们提出弹性形状-变分自编码器(ES-VAE),一种针对骨骼轨迹的几何感知生成模型,利用传输平方根速度场(TSRVF)表示在Kendall形状流形上。该表示本质上消除了形状的刚体平移、旋转和全局缩放以及序列的时间率变化,隔离了底层形状动态。ES-VAE编码器将骨骼序列映射到低维潜在空间,结合黎曼对数映射,而解码器利用相应的指数映射重建序列。我们在两个数据集上展示了ES-VAE的有效性。首先,我们分析骨骼步态周期以预测临床移动评分并分类主体为健康和中风后组。其次,我们在NTU RGB+D数据集上评估动作识别。在两种设置中,ES-VAE均优于标准VAE和一系列序列建模基线,包括时间卷积网络、Transformer和图卷积网络。更广泛地说,ES-VAE为在姿态形状流形上学习生成模型提供了系统框架,相较于现有深度学习方法,提供了改进的潜在表示和下游性能。

英文摘要

Deep generative models provide flexible frameworks for modeling complex, structured data such as images, videos, 3D objects, and texts. However, when applied to sequences of human skeletons, standard variational autoencoders (VAEs) often allocate substantial capacity to nuisance factors-such as camera orientation, subject scale, viewpoint, and execution speed-rather than the intrinsic geometry of shapes and their motion. We propose the Elastic Shape - Variational Autoencoder (ES-VAE), a geometry-aware generative model for skeletal trajectories that leverages the transported square-root velocity field (TSRVF) representation on Kendall's shape manifold. This representation inherently removes rigid translations, rotations, and global scaling of shapes, and temporal rate variability of sequences, isolating the underlying shape dynamics. The ES-VAE encoder maps skeletal sequences to a low-dimensional latent space incorporating the Riemannian logarithm map, while the decoder reconstructs sequences using the corresponding exponential map. We demonstrate the effectiveness of ES-VAE on two datasets. First, we analyze skeletal gait cycles to predict clinical mobility scores and classify subjects into healthy and post-stroke groups. Second, we evaluate action recognition on the NTU RGB+D dataset. Across both settings, ES-VAE consistently outperforms standard VAEs and a range of sequence modeling baselines, including temporal convolutional networks, transformers, and graph convolutional networks. More broadly, ES-VAE provides a principled framework for learning generative models of longitudinal data on pose shape manifolds, offering improved latent representation and downstream performance compared to existing deep learning approaches.

2605.06475 2026-05-18 cs.AI cs.CV 版本更新

Probabilistic Dating of Historical Manuscripts via Evidential Deep Regression on Visual Script Features

通过视觉手写特征的证据深度回归进行历史手稿的概率年代测定

Ranjith Chodavarapu

发表机构 * Kent State University(肯特州立大学)

AI总结 本文提出一种基于视觉特征的深度回归方法,用于确定历史手稿的年代,通过分解不确定性提升预测精度,实验显示模型在测试集上取得优异性能。

详情
AI中文摘要

我们介绍了一种概率方法,用于仅通过视觉特征确定历史手稿页面的年代。与以往文献中将世纪聚合为类别的做法不同,我们将年代测定视为一个在连续年份轴上的证据深度回归问题,使神经网络能够在一个前向传递中输出完整的预测分布,包含分解的偶然性和epistemic不确定性。我们的架构结合了EfficientNet-B2主干网络和通过联合负对数似然和证据正则化目标训练的Normal-Inverse-Gamma(NIG)输出头。在DIVA-HisDB基准(150页,3个中世纪手稿,151936个补丁)上,我们的模型在测试集上取得了5.4年的MAE,远低于50年的世纪标签监督粒度,93%的补丁在5年内,97%在10年内。我们的方法在单次前向传递中实现了PICP=92.6%的校准,优于MC Dropout(PICP=88.2%,50次传递)和Deep Ensembles(PICP=79.7%,5个模型)的性能,且推理成本低5倍。不确定性分解显示偶然性不确定性是年代误差的强预测因子(Spearman ρ=0.729),且对最确定的20%补丁的有选择性预测可提供0.5年的MAE。我们展示了预测的不确定性随着图像退化程度的恶化而增加,空间分解映射解释了哪些手写区域导致偶然性不确定性,且页面级聚合将MAE降低到4.5年,不确定性与页面级误差之间的相关性为ρ=0.905。

英文摘要

We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Inverse-Gamma (NIG) output head trained with a joint negative-log-likelihood and evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages, 3 medieval codices, 151,936 patches), our model scores a test MAE of 5.4 years, well below the 50-year century-label supervision granularity, with 93\% of patches within 5 years and 97\% within 10 years. Our approach achieves \textbf{PICP=92.6\%}, the best calibration among all compared methods, in a single forward pass, outperforming MC Dropout (PICP=88.2\%, 50 passes) and Deep Ensembles (PICP=79.7\%, 5 models) at $5\times$ lower inference cost. Uncertainty decomposition shows aleatoric uncertainty is a strong predictor of dating error (Spearman $ρ=0.729$), and a selective prediction about the most certain 20\% of patches can provide \textbf{0.5 years MAE}. We show that predicted uncertainty increases as image degradation worsens, spatial decomposition maps explain which script regions cause aleatoric uncertainty, and page-level aggregation reduces MAE to 4.5 years with $ρ=0.905$ between uncertainty and page-level error.

2605.00934 2026-05-18 cs.LG cs.CV stat.ML 版本更新

Structured Analytic Coherent Point Drift for Non-Rigid Point Set Registration

结构化分析一致点漂移用于非刚性点集配准

Wei Feng, Haiyong Zheng

发表机构 * College of Electronic Engineering, Ocean University of China(中国海洋大学电子工程学院)

AI总结 本文提出Analytic-CPD,通过结构化分析映射改进传统CPD,实现更高效且可控的非刚性点集配准,实验验证其在不同数据集上的有效性与精度效率优势。

Comments Revised version. Supplementary material incorporated as appendices; method, implementation, and experimental details expanded

详情
AI中文摘要

Coherent Point Drift (CPD) 是一种用于无监督非刚性点集配准的概率框架。其标准非刚性M-step然而依赖于点索引高斯核系统,其大小随移动点数量增长,导致大点集的形变估计计算负担重且难以控制复杂度。为解决这些限制,我们提出Analytic-CPD,一种新的无监督非刚性配准框架,为CPD提供结构化分析重述。Analytic-CPD保留CPD后验对应层,但将M-step从点索引核位移估计提升到结构化分析映射估计。通过将CPD的高斯混合后验机制与结构化分析映射(SAM)耦合,该方法获得一个系数维度由环境维度和分析阶数而非移动点数量决定的形变模型。更重要的是,形变估计在可解释的分析函数空间层次上组织,因此分析阶数可以随着后验对应可靠性增加而逐步提升。我们通过增加阶数连续策略与减少阶段长度实现该想法:低阶分析映射首先稳定后验对应结构,而更高阶模式随后细化非线性残差形变。在受控模型匹配、平滑模型不匹配和注册人体形状数据上的实验验证了Analytic-CPD的有效性和优越的精度-效率性能。

英文摘要

Coherent Point Drift (CPD) is a representative probabilistic framework for unsupervised non-rigid point set registration. Its standard non-rigid M-step, however, relies on a point-indexed Gaussian-kernel system whose size grows with the number of moving points, making deformation estimation computationally heavy for large point sets and difficult to control in complexity during registration. To address these limitations, we propose Analytic-CPD, a new unsupervised non-rigid registration framework that gives CPD a structured analytic reformulation. Analytic-CPD preserves the CPD posterior correspondence layer, but lifts the M-step from point-indexed kernel displacement estimation to structured analytic mapping estimation. By coupling the Gaussian-mixture posterior mechanism of CPD with Structured Analytic Mappings (SAM), the method obtains a deformation model whose coefficient dimension is governed by the ambient dimension and analytic order rather than by the number of moving points. More importantly, deformation estimation is organized over an interpretable hierarchy of analytic function spaces, so the analytic order can be increased progressively as posterior correspondences become more reliable. We implement this idea through an increasing-degree continuation strategy with decreasing stage lengths: low-order analytic maps first stabilize the posterior correspondence structure, while higher-order modes later refine nonlinear residual deformation. Experiments on controlled model-matched, smooth model-mismatch, and registered human-shape data demonstrate the effectiveness and favorable accuracy--efficiency performance of Analytic-CPD.

2604.18145 2026-05-18 cs.CV cs.AI 版本更新

Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework

基于3D医学影像的区域 grounded 报告生成:一个细粒度数据集和图增强框架

Cong Huy Nguyen, Son Dinh Nguyen, Guanlin Li, Tuan Dung Nguyen, Aditya Narayan Sankaran, Mai Huy Thong, Thanh Trung Nguyen, Mai Hong Son, Reza Farahbakhsh, Phi Le Nguyen, Noel Crespi

发表机构 * AI4LIFE, Hanoi University of Science and Technology, Vietnam(AI4LIFE,河内科学技术大学,越南) SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, France(SAMOVAR,Telecom SudParis,巴黎理工学院,法国) Military Central Hospital, Vietnam(108军区中央医院,越南)

AI总结 本文提出VietPET-RoI数据集和HiRRA框架,通过图增强模块捕捉RoI属性依赖,提升3D PET/CT报告生成的临床可靠性,实验表明其在BLEU、ROUGE-L和临床指标上均优于现有方法。

Comments 16 pages; Accepted to appear in ACL 2026

详情
AI中文摘要

自动化的3D PET/CT影像报告生成受到高维体数据和标注数据稀缺的挑战,尤其是低资源语言。当前黑盒方法将整个体积映射到报告,忽略了临床工作中分析局部感兴趣区域(RoIs)以得出诊断结论的流程。本文通过引入VietPET-RoI数据集,首个大规模3D PET/CT数据集,包含600个PET/CT样本和1960个手动标注的RoIs,配以相应临床报告。此外,为展示该数据集的实用性,我们提出了HiRRA框架,通过图基关系模块模拟专业放射科医生的诊断流程,从全局模式匹配转向局部临床发现。我们还引入了新的临床评估指标,即RoI覆盖度和RoI质量指数,利用LLM提取测量RoI定位准确性和属性描述的忠实度。大量评估表明,我们的框架实现了SOTA性能,比现有模型在BLEU和ROUGE-L上分别高出19.7%和4.7%,在临床指标上取得45.8%的显著提升,表明增强的临床可靠性和减少的幻觉。我们的代码和数据集可在GitHub上获得。

英文摘要

Automated medical report generation for 3D PET/CT imaging is fundamentally challenged by the high-dimensional nature of volumetric data and a critical scarcity of annotated datasets, particularly for low-resource languages. Current black-box methods map whole volumes to reports, ignoring the clinical workflow of analyzing localized Regions of Interest (RoIs) to derive diagnostic conclusions. In this paper, we bridge this gap by introducing VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotation for a low-resource language, comprising 600 PET/CT samples and 1,960 manually annotated RoIs, paired with corresponding clinical reports. Furthermore, to demonstrate the utility of this dataset, we propose HiRRA, a novel framework that mimics the professional radiologist diagnostic workflow by employing graph-based relational modules to capture dependencies between RoI attributes. This approach shifts from global pattern matching toward localized clinical findings. Additionally, we introduce new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction. Extensive evaluation demonstrates that our framework achieves SOTA performance, surpassing existing models by 19.7% in BLEU and 4.7% in ROUGE-L, while achieving a remarkable 45.8% improvement in clinical metrics, indicating enhanced clinical reliability and reduced hallucination. Our code and dataset are available on GitHub.

2604.14692 2026-05-18 cs.CV 版本更新

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

链式窥视:面向视频理解的搜索引导渐进性对象基础推理

Zhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya Poria

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China.(网络与交换技术国家重点实验室,北京邮电大学,北京,中国) Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University, China.(大数据研究院,复旦大学计算机科学与人工智能学院,中国) ARC Lab, Tencent PCG, Shenzhen, China.(腾讯PCG深圳实验室,深圳,中国) School of Artificial Intelligence, Beijing University of Technology, Beijing, China.(北京理工大学人工智能学院,北京,中国)

AI总结 本文提出Chain-of-Glimpse框架,通过搜索引导的渐进推理解决视频中对象变化问题,提升多步骤决策的准确性和可解释性。

详情
AI中文摘要

视频理解需要在不同帧间识别和推理语义区分度高的视觉对象,但现有对象无关方法难以有效处理时间变化带来的显著对象变化。为此,我们引入Chain-of-Glimpse,一种搜索引导的渐进性对象基础推理框架,通过将每个推理步骤明确锚定到特定视觉证据区域,实现组合性和多步骤决策。形式上,Chain-of-Glimpse将视频推理视为逐步过程,逐步构建围绕任务相关视觉对象的空间基础轨迹,从而减少对显著性驱动线索的过度依赖。具体而言,Chain-of-Glimpse包含一个搜索引导的控制器,通过强化学习优化,以格式奖励显著激励基础能力,以迭代地基础视觉证据区域并形成可靠的推理轨迹,产生准确且可解释的多步骤决策。在域内NExTQA和域外Video-Holmes、CG-Bench Reasoning和VRBench基准测试中,广泛评估表明Chain-of-Glimpse在多样化视频推理任务中表现出一致的性能提升、鲁棒性和泛化能力。

英文摘要

Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively ground visual evidence regions and form reliable reasoning trajectories, yielding accurate and interpretable multi-step decisions. Extensive evaluations on both in domain NExTQA and out-of-domain Video-Holmes, CG-Bench Reasoning, and VRBench benchmarks demonstrate consistent performance gains, robustness and generalization of Chain-of-Glimpse across diverse video reasoning tasks.

2604.10210 2026-05-18 cs.CV cs.AI cs.LG 版本更新

A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

A3-FPN:渐近内容感知金字塔注意力网络用于密集视觉预测

Meng'en Qin, Yu Song, Quanling Zhao, Xiaodong Yang, Yingtao Che, Xiaohui Yang

发表机构 * Henan Engineering Research Center for Artificial Intelligence Theory and Algorithms(人工智能理论与算法河南省工程研究中心) Henan University(河南大学) Faculty of Computer Science and Control Engineering(计算机科学与控制工程学院) Shenzhen University of Advanced Technology(深圳先进技术大学) Department of Electrical and Electronic Engineering(电子与电气工程系)

AI总结 本文提出A3-FPN,通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示,提升密集预测任务中小物体的识别性能。

Journal ref Pattern Recognition, 2026, 113793

详情
AI中文摘要

学习多尺度表示是解决密集预测任务中物体尺度变化的常见策略。尽管现有特征金字塔网络在视觉识别中取得了显著进展,但固有设计缺陷限制了它们捕捉判别特征和识别小物体的能力。本文提出渐近内容感知金字塔注意力网络(A3-FPN),通过渐近解耦框架和内容感知注意力模块增强多尺度特征表示。具体而言,A3-FPN采用横向扩展的列网络,实现渐近全局特征交互,并将每个层次与所有层次表示解耦。在特征融合中,它从相邻层次收集补充内容,生成位置加权偏移和权重用于上下文感知重采样,并学习深度上下文重权重以提高类别内相似性。在特征重组装中,它进一步加强了同一尺度的判别特征学习,并基于特征图的信息内容和空间变化重组装冗余特征。在MS COCO、VisDrone2019-DET和Cityscapes上的大量实验表明,A3-FPN可以轻松集成到最先进的CNN和Transformer架构中,取得显著性能提升。值得注意的是,当与OneFormer和Swin-L主干结合时,A3-FPN在MS COCO上达到49.6的mask AP,在Cityscapes上达到85.6的mIoU。代码可在https://github.com/mason-ching/A3-FPN上获取。

英文摘要

Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.

2603.15269 2026-05-18 cs.CV 版本更新

Self-Supervised ImageNet Representations for In Vivo Confocal Microscopy: Tortuosity Grading without Segmentation Maps

自监督ImageNet表示用于活体共聚焦显微镜:无需分割图的曲折度分级

Kim Ouan, Noémie Moreau, Katarzyna Bozek

发表机构 * Faculty of Mathematics and Natural Sciences, University of Cologne, Germany(科隆大学数学与自然科学学院,德国) Institute for Biomedical Informatics, Faculty of Medicine and University Hospital Cologne, University of Cologne, Germany(医学信息学研究所,医学院及科隆大学医院,科隆大学,德国) Center for Molecular Medicine Cologne (CMMC), Faculty of Medicine and University Hospital Cologne, University of Cologne, Germany(科隆分子医学中心(CMMC),医学院及科隆大学医院,科隆大学,德国) Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), University of Cologne, Germany(科隆卓越集群:与衰老相关疾病相关的细胞应激反应(CECAD),科隆大学,德国)

AI总结 本文提出利用自监督预训练的ImageNet特征进行活体共聚焦显微镜的曲折度分级,无需分割图,提升了准确率和灵敏度。

Comments 7 pages, 4 figures, MIDL 2026 - Short Paper Track

详情
AI中文摘要

角膜神经纤维的曲折度用于指示不同疾病。当前最先进的曲折度分级方法严重依赖于这些神经纤维的昂贵分割图。本文证明自监督预训练的ImageNet特征可转移到活体共聚焦显微镜领域。我们展示DINO不应被忽视作为医学影像的深度学习模型,尽管它被后来的两个版本取代。经过仔细微调,DINO在准确率(84.25%)和灵敏度(77.97%)方面优于现有方法。我们的微调模型在无需分割图的情况下专注于分级的关键形态学元素。

英文摘要

The tortuosity of corneal nerve fibers are used as indication for different diseases. Current state-of-the-art methods for grading the tortuosity heavily rely on expensive segmentation maps of these nerve fibers. In this paper, we demonstrate that self-supervised pretrained features from ImageNet are transferable to the domain of in vivo confocal microscopy. We show that DINO should not be disregarded as a deep learning model for medical imaging, although it was superseded by two later versions. After careful fine-tuning, DINO improves upon the state-of-the-art in terms of accuracy (84,25%) and sensitivity (77,97%). Our fine-tuned model focuses on the key morphological elements in grading without the use of segmentation maps.

2603.05377 2026-05-18 cs.RO cs.CV 版本更新

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

OpenFrontier: 基于视觉-语言基础的通用导航

Esteban Padilla-Cerdio, Boyang Sun, Marc Pollefeys, Hermann Blum

发表机构 * ETH Zurich(苏黎世联邦理工学院) Microsoft Spatial AI Lab(微软空间人工智能实验室) University of Bonn(波恩大学)

AI总结 本文提出OpenFrontier框架,通过稀疏子目标识别与到达问题实现高效导航,无需任务特定训练或微调,适用于多种视觉-语言先验模型,展示零样本性能和真实机器人部署效果。

详情
AI中文摘要

开放世界导航要求机器人在复杂日常环境中做出决策并适应灵活的任务需求。传统导航方法依赖密集3D重建和手工制定的目标指标,限制了其在任务和环境中的泛化能力。最近的视觉-语言导航(VLN)和视觉-语言-动作(VLA)模型使端到端策略成为可能,但通常需要交互式训练、大规模数据收集或任务特定的微调。我们提出将导航视为稀疏子目标识别与到达问题,并发现提供视觉锚定目标以高语义先验能够实现高效目标条件导航。基于这一见解,我们选择视觉前沿作为语义锚点,并提出OpenFrontier导航框架,无需任务特定训练或微调,无缝整合多种视觉-语言先验模型。OpenFrontier通过轻量级系统设计实现高效导航,不依赖密集3D语义映射、任务特定策略训练或模型微调。我们评估了OpenFrontier在多个导航基准上的表现,并展示了强大的零样本性能以及在移动机器人上的有效实际部署。

英文摘要

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision-language navigation (VLN) and vision-language-action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select visual frontiers as semantic anchors and propose OpenFrontier, a navigation framework that requires no task-specific training or fine-tuning and seamlessly integrates diverse vision-language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D semantic mapping, task-specific policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

2602.21536 2026-05-18 cs.CV 版本更新

IHF-Harmony: Multi-Modality Magnetic Resonance Images Harmonization using Invertible Hierarchy Flow Model

IHF-Harmony:基于可逆分层流模型的多模态磁共振图像统一化

Pengli Zhu, Yitao Zhu, Haowen Pang, Anqi Qiu

发表机构 * Department of Health Technology and Informatics, The Hong Kong Polytechnic University, Hong Kong(健康科技与信息技术系,香港理工大学,香港) School of Integrated Circuits and Electronics, Beijing Institute of Technology, China(集成电路与电子学院,北京理工大学,中国) Mental Health Research Center, The Hong Kong Polytechnic University, Hong Kong(心理健康研究中心,香港理工大学,香港) Department of Biomedical Engineering, Johns Hopkins University, USA(生物医学工程系,约翰霍普金斯大学,美国)

AI总结 本文提出IHF-Harmony,通过可逆分层流模型实现多模态MRI图像统一化,利用无配对数据提升跨模态可扩展性,保留解剖结构并提升下游任务性能。

详情
AI中文摘要

回顾性MRI统一化受限于跨模态的可扩展性差和依赖旅行受试者数据集。为解决这些问题,我们引入IHF-Harmony,一种统一的可逆分层流框架,用于使用无配对数据的多模态统一化。通过将翻译过程分解为可逆的特征转换,IHF-Harmony保证了双射映射和无损重建,以防止解剖扭曲。具体而言,可逆分层流(IHF)通过分层减法耦合逐步去除与伪影相关的特征,而伪影感知归一化(AAN)则利用解剖固定特征调节来准确转移目标特征。结合解剖和伪影一致性损失目标,IHF-Harmony实现了高保真的统一化,保留了源解剖结构。在多个MRI模态上的实验表明,IHF-Harmony在解剖保真度和下游任务性能方面均优于现有方法,促进了大规模多中心成像研究的稳健统一化。代码可在https://github.com/Idea89560041/IHF-Harmony获取。

英文摘要

Retrospective MRI harmonization is limited by poor scalability across modalities and reliance on traveling subject datasets. To address these challenges, we introduce IHF-Harmony, a unified invertible hierarchy flow framework for multi-modality harmonization using unpaired data. By decomposing the translation process into reversible feature transformations, IHF-Harmony guarantees bijective mapping and lossless reconstruction to prevent anatomical distortion. Specifically, an invertible hierarchy flow (IHF) performs hierarchical subtractive coupling to progressively remove artefact-related features, while an artefact-aware normalization (AAN) employs anatomy-fixed feature modulation to accurately transfer target characteristics. Combined with anatomy and artefact consistency loss objectives, IHF-Harmony achieves high-fidelity harmonization that retains source anatomy. Experiments across multiple MRI modalities demonstrate that IHF-Harmony outperforms existing methods in both anatomical fidelity and downstream task performance, facilitating robust harmonization for large-scale multi-site imaging studies. Code is available at https://github.com/Idea89560041/IHF-Harmony.

2602.21141 2026-05-18 cs.CV 版本更新

SynthRender and IRIS: Open-Source Framework and Dataset for Bidirectional Sim-Real Transfer in Industrial Object Perception

SynthRender 和 IRIS:用于工业物体感知双向仿真-现实迁移的开源框架和数据集

Jose Moises Araya-Martinez, Thushar Tom, Adrián Sanchis Reig, Pablo Rey Valiente, Jens Lambrecht, Jörg Krüger

发表机构 * Technical University Berlin, Industrial Automation Technology(柏林技术大学,工业自动化技术) Mercedes-Benz AG, Future Manufacturing Technologies(梅赛德斯-奔驰公司,未来制造技术) Technical University Braunschweig, Institute for Cognitive Robotics(不伦瑞克技术大学,认知机器人研究所)

AI总结 本文提出SynthRender和IRIS,通过合成数据生成与结构化评估,系统研究双向仿真-现实迁移,提供32类数据集和CAD模型,实现高效合成训练与工业应用。

详情
AI中文摘要

物体感知对于机器人物料搬运和质量检测等任务至关重要。然而,现代监督深度学习模型需要大量标注数据以在半受控条件下实现稳健自动化;这是在自有工业部件上广泛应用的主要障碍。我们通过整合合成数据生成和结构化经验评估的框架,系统研究双向仿真-现实迁移。我们的方法结合2D到3D的现实到仿真技术,通过SynthRender开源框架的程序化引导域随机化(GDR)从物理部件创建3D资产。跨多个基准的结构化消融研究量化了单个渲染设计选择的影响,得出实用的高效合成训练指南。为支持在现实工业条件下的评估,我们引入工业现实-仿真图像集(IRIS),包含32类,具有多样的纹理、类内变化、强类间相似性,并有19,672个注释,提供CAD模型和重建网格用于双向仿真-现实基准测试。在三个工业基准上,所提框架实现了高度竞争性的性能,达到99.1% mAP@50在公开机器人数据集、98.3% mAP@50在汽车基准和95.3% mAP@50在IRIS上。

英文摘要

Object perception is fundamental for tasks such as robotic material handling and quality inspection. However, modern supervised deep-learning models require large annotated datasets for robust automation under semi-uncontrolled conditions; a major barrier for widespread deployment with proprietary industrial parts. We address this through an integrated framework combining synthetic data generation and structured empirical evaluation for systematic investigation of bidirectional sim-to-real transfer. Our method integrates 2D-to-3D Reality-to-Simulation techniques for 3D asset creation from physical parts with programmatic Guided Domain Randomization (GDR) via SynthRender, an open-source synthetic image generation framework. Structured ablation studies across multiple benchmarks quantify the impact of individual rendering design choices, yielding practical guidelines for dataefficient synthetic training. To support evaluation under realistic industrial conditions, we introduce Industrial Real-Sim Imagery Set (IRIS), a 32-class dataset with diverse textures, intra-class variation, strong inter-class similarities, and 19,672 annotations, providing both CAD models and reconstructed meshes for bidirectional sim-to-real benchmarking. Across three industrial benchmarks, the proposed framework achieves highly competitive performance, reaching 99.1% mAP@50 on a public robotics dataset, 98.3% mAP@50 on an automotive benchmark, and 95.3% mAP@50 on IRIS.

2602.19423 2026-05-18 cs.CV 版本更新

Prefer-DAS: Learning from Local Preferences and Sparse Prompts for Domain Adaptive Segmentation of Electron Microscopy

Prefer-DAS: 从局部偏好和稀疏提示中学习,用于电子显微镜领域自适应分割

Jiabao Chen, Shan Xiong, Jialin Peng

发表机构 * College of Computer Science and Technology, Huaqiao University(华侨大学计算机科学与技术学院)

AI总结 本文提出Prefer-DAS,通过利用局部偏好和稀疏提示实现高效的领域自适应分割,结合自训练和提示引导对比学习,提升了分割性能和灵活性。

详情
AI中文摘要

领域自适应分割(DAS)是一种有前景的范式,用于从各种大规模电子显微镜(EM)数据中界定细胞内结构,而无需在每个领域内耗费大量标注数据。然而,普遍的无监督领域自适应(UDA)策略往往表现出有限且有偏的性能,阻碍了其实际应用。在本研究中,我们探索稀疏点和局部人类偏好作为目标领域的弱标签,从而提出一个更加现实且标注高效的设置。具体而言,我们开发了Prefer-DAS,它开创了稀疏提示学习和局部偏好对齐。Prefer-DAS是一种可提示的多任务模型,整合了自训练和提示引导的对比学习。与SAM-like方法不同,Prefer-DAS允许在训练和推理阶段使用完整的、部分的甚至没有点提示,从而实现了交互式分割。与使用图像级人类偏好对齐进行分割不同,我们引入了局部直接偏好优化(LPO),为与空间变化的人类反馈对齐提供了即插即用的解决方案。为了解决潜在的反馈缺失问题,我们还引入了无监督偏好优化(UPO),它利用自学习的偏好。结果,Prefer-DAS模型能够根据点和人类偏好的可用性有效执行弱监督和无监督的DAS。在四个具有挑战性的DAS任务上的全面实验表明,我们的模型在自动和交互式分割模式中均优于SAM-like方法以及无监督和弱监督的DAS方法,突显了其强大的泛化能力和灵活性。此外,我们的模型性能非常接近或甚至超过了监督模型的性能。

英文摘要

Domain adaptive segmentation (DAS) is a promising paradigm for delineating intracellular structures from various large-scale electron microscopy (EM) without incurring extensive annotated data in each domain. However, the prevalent unsupervised domain adaptation (UDA) strategies often demonstrate limited and biased performance, which hinders their practical applications. In this study, we explore sparse points and local human preferences as weak labels in the target domain, thereby presenting a more realistic yet annotation-efficient setting. Specifically, we develop Prefer-DAS, which pioneers sparse promptable learning and local preference alignment. The Prefer-DAS is a promptable multitask model that integrates self-training and prompt-guided contrastive learning. Unlike SAM-like methods, the Prefer-DAS allows for the use of full, partial, and even no point prompts during both training and inference stages and thus enables interactive segmentation. Instead of using image-level human preference alignment for segmentation, we introduce Local direct Preference Optimization (LPO), plug-and-play solutions for alignment with spatially varying human feedback. To address potential missing feedback, we also introduce Unsupervised Preference Optimization (UPO), which leverages self-learned preferences. As a result, the Prefer-DAS model can effectively perform both weakly-supervised and unsupervised DAS, depending on the availability of points and human preferences. Comprehensive experiments on four challenging DAS tasks demonstrate that our model outperforms SAM-like methods as well as unsupervised and weakly-supervised DAS methods in both automatic and interactive segmentation modes, highlighting strong generalizability and flexibility. Additionally, the performance of our model is very close to or even exceeds that of supervised models.

2601.21798 2026-05-18 cs.CV 版本更新

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models

CG-MLLM:通过多模态大语言模型实现图像描述与3D内容生成

Junming Huang, Chi Wang, Letian Li, Guangkai Xu, Donglin Huang, Hao Chen, Qiang Dai, Weiwei Xu

发表机构 * Zhejiang University, China(浙江大学)

AI总结 本文提出CG-MLLM,一种能实现3D描述和高分辨率3D生成的多模态大语言模型,通过混合Transformer架构分离不同建模需求,结合预训练视觉语言模型与专用3D VAE潜在空间,提升3D生成质量与感知能力。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)已革新了文本生成和多模态感知,但其在3D内容生成方面的能力仍待探索。现有方法往往只能生成低分辨率网格或粗略结构代理,无法原生捕捉细粒度几何结构。本文提出CG-MLLM,一种新型多模态大语言模型,能够在单一框架内实现3D描述和高分辨率3D生成。通过混合Transformer架构,CG-MLLM分离了不同的建模需求,其中Token-level Autoregressive (TokenAR) Transformer处理token级内容,Block-level Autoregressive (BlockAR) Transformer处理块级内容。通过整合预训练的视觉语言骨干网络与专用3D VAE潜在空间,CG-MLLM促进了标准token与空间块之间的长上下文交互。实验结果表明,CG-MLLM在生成高保真3D对象方面显著优于现有MLLMs,有效将高分辨率3D内容创作带入主流LLM范式。此外,我们进一步发现,学习生成3D内容能够反向增强模型的基于图像的3D理解能力。

英文摘要

Large Language Models(LLMs) have revolutionized text generation and multimodal perception,but their capabilities in 3D content generation remain underexplored. Existing methods compromise by producing either low-resolution meshes or coarse structural proxies, failing to capture finegrained geometry natively. In this paper, we propose CG-MLLM, a novel Multi-modal Large Language Model (MLLM) capable of 3D captioning and high-resolution 3D generation in a single framework. Leveraging the Mixture-ofTransformer architecture, CG-MLLM decouples disparate modeling needs, where the Token-level Autoregressive (TokenAR) Transformer handles token-level content, and the Block-level Autoregressive (BlockAR) Transformer handles blocklevel content. By integrating a pre-trained visionlanguage backbone with a specialized 3D VAE latent space, CG-MLLM facilitates long-context interactions between standard tokens and spatial blocks within a single integrated architecture. Experimental results show that CG-MLLM significantly outperforms existing MLLMs in generating high-fidelity 3D objects, effectively bringing high-resolution 3D content creation into the mainstream LLM paradigm. Beyond generation, we further observe that learning to produce 3D content transfers back to perception, strengthening the model's image-based 3D understanding.

2601.12894 2026-05-18 cs.RO cs.CV 版本更新

Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning

稀疏动作生成:通过实时剪枝加速扩散策略

Kangye Ji, Jianbo Zhou, Yuan Meng, Ye Li, Hanyun Cui, Zhi Wang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Department of Computer Science(计算机科学系)

AI总结 本文提出SAG方法,通过自适应剪枝和重用机制实现稀疏动作生成,提升实时视觉运动控制效率,实验显示生成速度提升4倍。

详情
AI中文摘要

扩散策略因其强大的多模态动作分布建模能力在动作生成中占据主导地位,但其多步去噪过程使其难以满足实时视觉运动控制的需求。现有基于缓存的加速方法通常依赖静态调度,无法适应机器人与环境交互的动态特性,导致性能不佳。本文提出稀疏动作生成(SAG),通过自适应剪枝和重用机制实现极稀疏的动作生成。为适应迭代交互,SAG定制了回滚自适应的剪枝-重用机制,首先在全局识别可剪枝的计算,然后利用缓存的激活值在动作扩散过程中进行替换。为捕捉回滚动态,SAG参数化了观察条件的扩散剪枝器,以实现环境感知的适应,并通过高参数和推理效率的设计实现实时预测。此外,SAG引入了一种通用的重用策略,以zig-zag方式在时间步和块之间重用激活值,最小化全局冗余。在多个机器人基准测试中,SAG在不牺牲性能的情况下实现了高达4倍的生成速度提升。项目页面:https://sparse-actiongen.github.io.

英文摘要

Diffusion Policy has dominated action generation due to its strong capabilities for modeling multi-modal action distributions, but its multi-step denoising processes make it impractical for real-time visuomotor control. Existing caching-based acceleration methods typically rely on $\textit{static}$ schedules that fail to adapt to the $\textit{dynamics}$ of robot-environment interactions, thereby leading to suboptimal performance. In this paper, we propose $\underline{\textbf{S}}$parse $\underline{\textbf{A}}$ction$\underline{\textbf{G}}$en ($\textbf{SAG}$) for extremely sparse action generation. To accommodate the iterative interactions, SAG customizes a rollout-adaptive prune-then-reuse mechanism that first identifies prunable computations globally and then reuses cached activations to substitute them during action diffusion. To capture the rollout dynamics, SAG parameterizes an observation-conditioned diffusion pruner for environment-aware adaptation and instantiates it with a highly parameter- and inference-efficient design for real-time prediction. Furthermore, SAG introduces a one-for-all reusing strategy that reuses activations across both timesteps and blocks in a zig-zag manner, minimizing the global redundancy. Extensive experiments on multiple robotic benchmarks demonstrate that SAG achieves up to 4$\times$ generation speedup without sacrificing performance. Project Page: https://sparse-actiongen.github.io.

2512.15693 2026-05-18 cs.CV 版本更新

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Skyra:通过 grounded artifact reasoning 实现 AI 生成视频检测

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 本文提出 Skyra,一种专门用于识别 AI 生成视频中人类可感知的视觉瑕疵的多模态大语言模型,通过这些瑕疵作为基础证据进行检测和解释,同时构建了首个大规模 AI 生成视频瑕疵数据集并提出两阶段训练策略。

Comments Camera Ready Version. Project Page: https://github.com/JoeLeelyf/Skyra

详情
AI中文摘要

本文提出Skyra,一种专门用于识别AI生成视频中人类可感知的视觉瑕疵的多模态大语言模型,通过这些瑕疵作为基础证据进行检测和解释,同时构建了首个大规模AI生成视频瑕疵数据集并提出两阶段训练策略。

英文摘要

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

2511.13108 2026-05-18 cs.CV 版本更新

DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

DGS-Net:基于知识蒸馏的梯度手术用于AI生成图像检测中的CLIP微调

Jiazhen Yan, Ziqiang Li, Fan Wang, Boyu Wang, Ziwen He, Zhangjie Fu

发表机构 * School of Computer Science, Nanjing University of Information Science(南京信息工程大学计算机学院) University of Macau(澳门大学)

AI总结 本文提出DGS-Net,通过梯度空间分解分离有害和有益的下降方向,提升CLIP在AI生成图像检测中的微调效果,实验表明其在检测性能和泛化能力上优于现有方法。

Comments Accepted by ICML 2026 Spotlight

详情
AI中文摘要

生成模型如GANs和扩散模型的快速发展导致AI生成图像广泛传播,引发虚假信息、隐私侵犯和信任危机。尽管大规模多模态模型如CLIP能提供强可转移表示以检测合成内容,但微调时常导致灾难性遗忘,降低预训练先验并限制跨领域泛化。为此,我们提出Distillation-guided Gradient Surgery Network (DGS-Net),通过梯度空间分解分离有害和有益的下降方向,投影任务梯度到有害方向的正交补集并与从冻结CLIP编码器蒸馏出的有益方向对齐,实现先验保留与无关抑制的统一优化。在50种生成模型上的实验表明,本方法在检测性能和泛化能力上平均优于现有方法6.6个百分点。

英文摘要

The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.

2510.08398 2026-05-18 cs.CV 版本更新

VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

VideoVerse: 你的T2V生成器有世界模型能力来合成视频吗?

Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang

发表机构 * Sun Yat-sen University(中山大学) Hong Kong Polytechnic University(香港理工大学) Tsinghua University(清华大学) OPPO Research Institute(OPPO研究院)

AI总结 VideoVerse通过评估T2V模型对复杂时间因果关系和世界知识的理解能力,揭示现有模型与理想世界建模能力的差距。

Comments 26 Pages, 10 Figures, 14 Tables

详情
AI中文摘要

最近文本到视频(T2V)生成技术的快速发展使训练模型具备了更强的世界模型能力,使现有基准逐渐无法评估最先进的T2V模型。首先,当前评估维度如每帧美学质量和时间一致性已无法区分最先进的T2V模型。其次,事件级时间因果性——区分视频与其他模态的本质属性——仍 largely 未被探索。第三,现有基准缺乏对世界知识的系统评估,这是构建世界模型的关键能力。为解决这些问题,我们引入VideoVerse,一个专注于评估当前T2V模型是否能理解复杂时间因果性和世界知识以合成视频的综合基准。我们收集了跨不同领域的代表性视频,并提取其事件级描述,具有固有的时间因果性,然后由独立标注者重写为文本到视频提示。对于每个提示,我们设计了十个评估维度,涵盖动态和静态属性,最终得到300个提示、815个事件和793个评估问题。因此,通过使用现代视觉-语言模型开发了一个与人类偏好一致的基于问答的评估流程,系统地评估了领先的开源和闭源T2V系统,揭示了当前T2V模型与理想世界建模能力之间的差距。

英文摘要

The recent rapid advancement of Text-to-Video (T2V) generation technologies are engaging the trained models with more world model ability, making the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality-an essential property that differentiates videos from other modalities-remains largely unexplored. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark focusing on evaluating whether the current T2V model could understand complex temporal causality and world knowledge to synthesize videos. We collect representative videos across diverse domains and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design ten evaluation dimensions covering dynamic and static properties, resulting in 300 prompts, 815 events, and 793 evaluation questions. Consequently, a human preference-aligned QA-based evaluation pipeline is developed by using modern vision-language models to systematically benchmark leading open- and closed-source T2V systems, revealing the current gap between T2V models and desired world modeling abilities.

2510.06194 2026-05-18 hep-ex astro-ph.IM cs.CV 版本更新

Overlap-aware segmentation for topological reconstruction of obscured objects

关注重叠的分割以重建被遮挡物体的拓扑结构

J. Schueler, H. M. Araújo, S. N. Balashov, J. E. Borg, C. Brew, F. M. Brunbauer, C. Cazzaniga, A. Cottle, D. Edgeman, C. D. Frost, F. Garcia, D. Hunt, M. Kastriotou, P. Knights, H. Kraus, A. Lindote, M. Lisowska, D. Loomba, E. Lopez Asamar, P. A. Majewski, T. Marley, C. McCabe, L. Millins, R. Nandakumar, T. Neep, F. Neves, K. Nikolopoulos, E. Oliveri, A. Roy, T. J. Sumner, E. Tilly, W. Thompson, M. A. Vogiatzi

发表机构 * Department of Physics and Astronomy, University of New Mexico(新墨西哥大学物理与天文学系) Department of Physics, Blackett Laboratory, Imperial College London(伦敦帝国理工学院物理系) Particle Physics Department, STFC Rutherford Appleton Laboratory(英国科学与技术设施委员会拉瑟福德-苹果顿实验室粒子物理部) Luleå University of Technology(卢勒阿高校) CERN(欧洲核子研究中心) ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory(英国科学与技术设施委员会拉瑟福德-苹果顿实验室ISIS中子与穆子源) University College London (UCL), Department of Physics and Astronomy(伦敦大学学院(UCL)物理与天文学系) Department of Physics, Keble Road, University of Oxford(牛津大学物理系) Helsinki Institute of Physics, University of Helsinki(赫尔辛基大学物理研究所) School of Physics and Astronomy, University of Birmingham(伯明翰大学物理与天文学学院) LIP – Laboratório de Instrumentação e Física Experimental de Partículas, University of Coimbra(科英布拉大学粒子物理实验仪器实验室) Departamento de Fisica Teorica, Universidad Autonoma de Madrid(马德里自治大学理论物理系) Department of Physics, King’s College London(伦敦国王学院物理系) University of Hamburg(汉堡大学)

AI总结 本文提出OASIS框架,通过加权损失函数优先处理重叠区域,提升被遮挡物体的像素强度和拓扑特征重建。在MIGDAL实验中,OASIS显著改善了低能电子轨迹的重建效果。

详情
AI中文摘要

重叠物体的分离在科学成像中是一个重大挑战。尽管深度学习分割-回归算法能预测像素强度,但通常平等对待所有区域,而非优先处理重叠区域。最近的实例分割进展表明,训练中加权重叠像素区域可改善重叠区域的分割边界预测,但此方法尚未扩展到分割回归。本文提出OASIS:一种新的分割-回归框架,其加权损失函数旨在训练期间优先处理物体重叠区域,从而从严重遮挡的物体中提取像素强度和拓扑特征。在MIGDAL实验中,OASIS被用于直接成像Migdal效应——一种罕见过程,其中电子发射由核散射诱导——在低气压光学时间投影室中。此设置是一个极端测试案例,因为重建目标是微弱的电子 recoil 轨迹,通常被数量级更亮的核 recoil 轨迹严重遮挡。与无权分割回归相比,我们证明OASIS的新型重叠区域目标损失函数权重是提高低能电子轨迹强度和拓扑重建的最重要训练权重。在八次训练活动中平均,我们进一步显示添加重叠目标权重可将这些低能电子的中位强度重建误差从-41.1%提高到-13.3%。这些性能提升证明OASIS是一种通用的方法,可用于恢复重叠主导区域的被遮挡信号。

英文摘要

The separation of overlapping objects presents a significant challenge in scientific imaging. While deep learning segmentation-regression algorithms can predict pixel-wise intensities, they typically treat all regions equally rather than prioritizing overlap regions where attribution is most ambiguous. Recent advances in instance segmentation show that weighting regions of pixel overlap in training can improve segmentation boundary predictions in regions of overlap, but this idea has not yet been extended to segmentation regression. We address this with Overlap-Aware Segmentation of ImageS (OASIS): a new segmentation-regression framework with a weighted loss function designed to prioritize regions of object-overlap during training, enabling extraction of pixel intensities and topological features from heavily obscured objects. We demonstrate OASIS in the context of the MIGDAL experiment, which aims to directly image the Migdal effect--a rare process where electron emission is induced by nuclear scattering--in a low-pressure optical time projection chamber. This setting poses an extreme test case, as the target for reconstruction is a faint electron recoil track which is often heavily-buried within the order(s)-of-magnitude brighter nuclear recoil track. Compared to unweighted segmentation regression, we demonstrate OASIS's novel overlap region-targeted loss function weight to be the single most important training weight for improving intensity and topological reconstructions of the low-energy electron tracks that tend to be most dominated by pixel overlap. Averaging over eight training campaigns, we further show the addition of overlap-targeted weights to improve median intensity reconstruction errors from -41.1% to -13.3% for these low-energy electrons. These performance gains demonstrate OASIS as a generalizable methodology for recovering obscured signals in overlap-dominated regions.

2510.03161 2026-05-18 cs.CV cs.AI 版本更新

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

UniShield: 一种适应性多智能体框架用于统一的伪造图像检测与定位

Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(北京大学电子与计算机工程学院) School of Future Technology, South China University of Technology(华南理工大学未来技术学院) School of Electronic and Information Engineering, South China University of Technology(华南理工大学电子与信息工程学院) Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology, Shenzhen Graduate School, Peking University(北京大学深圳研究生院超高清沉浸媒体技术省重点实验室)

AI总结 UniShield通过多智能体框架实现跨领域伪造图像检测与定位,提升检测的适应性和实用性。

详情
AI中文摘要

UniShield通过多智能体框架实现跨领域伪造图像检测与定位,提升检测的适应性和实用性。

英文摘要

With the rapid advancements in image generation, synthetic images have become increasingly realistic, posing significant societal risks, such as misinformation and fraud. Forgery Image Detection and Localization (FIDL) thus emerges as essential for maintaining information integrity and societal security. Despite impressive performances by existing domain-specific detection methods, their practical applicability remains limited, primarily due to their narrow specialization, poor cross-domain generalization, and the absence of an integrated adaptive framework. To address these issues, we propose UniShield, the novel multi-agent-based unified system capable of detecting and localizing image forgeries across diverse domains, including image manipulation, document manipulation, DeepFake, and AI-generated images. UniShield innovatively integrates a perception agent with a detection agent. The perception agent intelligently analyzes image features to dynamically select suitable detection models, while the detection agent consolidates various expert detectors into a unified framework and generates interpretable reports. Extensive experiments show that UniShield achieves state-of-the-art results, surpassing both existing unified approaches and domain-specific detectors, highlighting its superior practicality, adaptiveness, and scalability.

2509.23352 2026-05-18 cs.CV cs.AI 版本更新

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

动态树RPO:通过结构化采样打破独立轨迹瓶颈

Xiaolong Fu, Lichen Ma, Zipeng Guo, ShiPing Dong, Lan Yang, Tan Lit Sin, Gaojing Zhou, Yu He, Jingling Fu, Shizhe Zhou, Junshi Huang, Jason Li

发表机构 * Sun Yat-sen University(中山大学) Tsinghua University(清华大学) Beijing University of Chemical Technology(北京化工大学)

AI总结 本文提出动态树RPO,通过树状结构采样策略和动态噪声强度,提升文本到图像生成的质量与效率,同时结合层调优强化学习方法,在多个基准测试中表现出色。

Comments Fig.3 updated

详情
AI中文摘要

将强化学习(RL)整合到流匹配模型中,推动了文本到图像(T2I)生成的质量提升。然而,这些进步往往以大量探索和低效采样策略为代价,由于采样组的微小变化。基于这一见解,我们提出了动态树RPO,实现了滑动窗口采样策略作为树状结构搜索,具有沿深度动态噪声强度。我们在此树结构中执行GRPO引导优化和受约束的随机微分方程(SDE)采样。通过共享树的前缀路径,我们的设计有效缓解了轨迹搜索的计算开销。通过为每个树层设计良好的噪声强度,动态树RPO可以在不增加额外计算成本的情况下增强探索的多样性。此外,我们无缝整合监督微调(SFT)和RL范式,构建我们的提议层调优RL,将SFT的损失函数重新表述为动态加权进展奖励模型(PRM),而不是单独的预训练方法。通过将此加权PRM与动态自适应剪裁边界关联,避免了动态树RPO中探索过程的干扰。得益于树状结构采样和层调优RL范式,我们的模型在有效方向上动态探索多样化的搜索空间。与现有基线相比,我们的方法在语义一致性、视觉保真度和人类偏好对齐方面在已建立的基准测试中表现出显著优势,包括HPS-v2.1、PickScore和ImageReward。特别是,我们的模型在这些基准测试中分别优于SoTA by 4.9%、5.91%和8.66%,同时将训练效率提高了近50%。

英文摘要

The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-window sampling strategy as a tree-structured search with dynamic noise intensities along depth. We perform GRPO-guided optimization and constrained Stochastic Differential Equation (SDE) sampling within this tree structure. By sharing prefix paths of the tree, our design effectively amortizes the computational overhead of trajectory search. With well-designed noise intensities for each tree layer, Dynamic-TreeRPO can enhance the variation of exploration without any extra computational cost. Furthermore, we seamlessly integrate Supervised Fine-Tuning (SFT) and RL paradigm within Dynamic-TreeRPO to construct our proposed LayerTuning-RL, reformulating the loss function of SFT as a dynamically weighted Progress Reward Model (PRM) rather than a separate pretraining method. By associating this weighted PRM with dynamic-adaptive clipping bounds, the disruption of exploration process in Dynamic-TreeRPO is avoided. Benefiting from the tree-structured sampling and the LayerTuning-RL paradigm, our model dynamically explores a diverse search space along effective directions. Compared to existing baselines, our approach demonstrates significant superiority in terms of semantic consistency, visual fidelity, and human preference alignment on established benchmarks, including HPS-v2.1, PickScore, and ImageReward. In particular, our model outperforms SoTA by $4.9\%$, $5.91\%$, and $8.66\%$ on those benchmarks, respectively, while improving the training efficiency by nearly $50\%$.

2509.22151 2026-05-18 cs.CV cs.CL 版本更新

MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

MultiMat: 多模态程序合成用于基于过程的材料生成

Jonas Belouadi, Tamy Boubekeur, Adrien Kaiser

发表机构 * University of Mannheim(曼海姆大学) Adobe Research(Adobe研究)

AI总结 MultiMat利用大规模多模态模型实现多模态程序合成,提升生成过程材料图的效率与视觉质量,优于纯文本基线方法。

Comments Accepted at ICLR 2026 (poster)

详情
AI中文摘要

材料节点图是生成程序化材料的2D通道的程序,包括几何如粗糙度和位移图,以及反射率如albedo和导电性图。它们在计算机图形学中对于以参数化和任意分辨率表示虚拟3D对象的外观至关重要。特别是,它们的有向无环图结构和中间状态使交互式外观建模能够实现模块化和可解释的工作流程。然而,创建此类图仍然具有挑战性,通常需要专业培训。尽管最近的神经程序合成方法试图简化这一过程,但它们仅将图表示为文本程序,无法捕捉到节点图本质上视觉-空间性质,这使得它们对人类易于理解。为了解决这一差距,我们提出了MultiMat,一种多模态程序合成框架,利用大型多模态模型来处理视觉和文本图表示,以提高程序化材料图的生成效果。我们训练我们的模型在一个新的生产质量程序化材料数据集上,并将其与一种受约束的树搜索推理算法结合,该算法确保静态正确性的同时,能够高效地在程序空间中导航。我们的实验结果表明,我们的多模态程序合成方法在无条件和有条件图合成中比纯文本基线更高效,具有更高的视觉质量和保真度,建立了新的最先进性能。

英文摘要

Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.

2509.21173 2026-05-18 cs.CV cs.AI cs.LG 版本更新

Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy

精度降低可能更可靠:对VLMs量化影响的系统评估

Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Chokri Mraidha, Fabio Arnez

发表机构 * Computer Vision Center(计算机视觉中心)

AI总结 本文系统评估了量化对VLMs可靠性的影响,发现量化能提升准确率、校准、异常检测和抗噪能力,但不改善协变量偏移或虚假相关性。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉-语言模型(VLMs)如CLIP已革新零样本分类和安全关键任务,如异常检测。然而,其高计算成本阻碍了实际部署。尽管量化是提高效率的标准方法,但其对超出简单Top-1准确率的可靠性指标的影响仍被忽视。本文通过超过70万次实验评估VLMs的量化效果,发现量化噪声反而能提升准确率、校准、异常检测和抗噪能力,但不改善协变量偏移或虚假相关性。我们利用这些反直觉发现,证明量化通过抑制高秩谱成分,迫使模型依赖稳健的低秩特征,从而提升泛化能力和抗噪能力,为利用量化部署更快速、可靠的VLMs提供了路径。

英文摘要

Vision-Language Models (VLMs) such as CLIP have revolutionized zero-shot classification and safety-critical tasks, including Out-of-Distribution (OOD) detection. However, their high computational cost hinders efficient real-world deployment. While quantization is a standard solution for efficiency, its broader impact on reliability metrics beyond simple Top-1 accuracy remains critically under-explored. In this study, we conduct a large-scale evaluation of VLM quantization across a comprehensive experimental suite of over 700k evaluation runs with varying configurations. We find that, contrary to the assumption that quantization's noise degrades performance, it can simultaneously improve accuracy, calibration, OOD detection, and robustness to noise, though not to covariate shift or spurious correlations. We leverage these counterintuitive findings to characterize the mechanics of quantization beyond simple regularization: we show that quantization dampens high-rank spectral components, compelling the model to rely more heavily on robust, low-rank features. Ultimately, this spectral filtering effect drives the observed improvements in generalization and noise tolerance, establishing a pathway to deploy faster, more reliable VLMs by utilizing quantization beyond its conventional role.

2509.15267 2026-05-18 cs.CV cs.AI cs.LG 版本更新

Autoguided Online Data Curation for Diffusion Model Training

自引导在线数据精炼用于扩散模型训练

Valeria Pais, Luis Oala, Daniele Faccio, Marco Aversa

发表机构 * University of Glasgow(格拉斯哥大学) Dotphoton

AI总结 本文研究自引导和在线数据选择方法对扩散模型训练效率的影响,通过合成数据任务验证了自引导在样本质量和多样性上的优势。

Comments Accepted non-archival paper at ICCV 2025 Workshop on Curated Data for Efficient Learning (CDEL)

详情
AI中文摘要

生成模型计算成本重新点燃了高效数据精炼的希望。本文探讨了最近发展的自引导和在线数据选择方法是否能提升扩散模型训练的时间和样本效率。我们整合了联合示例选择(JEST)和自引导到统一代码库中,以实现快速消融分析和基准测试。我们在受控的二维合成数据生成任务以及(3x64x64)-D图像生成上评估了数据精炼的组合。我们的比较是在相等的墙钟时间和样本数量下进行的,明确考虑了选择的开销。在所有实验中,自引导一致地提高了样本质量和多样性。早期AJEST(仅在训练开始时应用选择)在两个任务上都能匹配或略微超过自引导单独的效率。然而,其时间开销和额外的复杂性使自引导或均匀随机数据选择在大多数情况下更优。这些发现表明,尽管目标在线选择在早期训练中能带来效率提升,但稳健的样本质量改进主要由自引导驱动。我们讨论了限制和范围,并概述了数据选择何时可能有益。

英文摘要

The costs of generative model compute rekindled promises and hopes for efficient data curation. In this work, we investigate whether recently developed autoguidance and online data selection methods can improve the time and sample efficiency of training generative diffusion models. We integrate joint example selection (JEST) and autoguidance into a unified code base for fast ablation and benchmarking. We evaluate combinations of data curation on a controlled 2-D synthetic data generation task as well as (3x64x64)-D image generation. Our comparisons are made at equal wall-clock time and equal number of samples, explicitly accounting for the overhead of selection. Across experiments, autoguidance consistently improves sample quality and diversity. Early AJEST (applying selection only at the beginning of training) can match or modestly exceed autoguidance alone in data efficiency on both tasks. However, its time overhead and added complexity make autoguidance or uniform random data selection preferable in most situations. These findings suggest that while targeted online selection can yield efficiency gains in early training, robust sample quality improvements are primarily driven by autoguidance. We discuss limitations and scope, and outline when data selection may be beneficial.

2509.10310 2026-05-18 cs.CV math.OC 版本更新

A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments

面向复杂城市环境的街道设施地理定位的随机生灭方法

Evan Murphy, Marco Viola, Vladimir A. Krylov

发表机构 * School of Mathematical Sciences, Dublin City University(都柏林城市大学数学科学学院)

AI总结 本文提出基于能量地图的随机生灭优化算法,用于精确定位城市街道设施,通过整合地理空间信息提升定位精度,验证了其在大规模设施映射中的可行性。

Comments Accepted for publication in the Proceedings of the 27th Irish Machine Vision and Image Processing Conference (IMVIP 2025)

详情
AI中文摘要

本文针对复杂城市环境中街道设施的精确地理定位问题,提出基于能量地图的概率框架。该框架通过将能量表示为基于地图的地理定位格式,使优化过程能够无缝整合外部地理空间信息,如GIS图层、道路地图或放置约束,从而提升上下文意识和定位准确性。引入随机生灭优化算法以推断资产最可能的配置,并通过基于都柏林市中心街道照明基础设施的现实模拟验证了该方法的可行性,展示了其在大规模和精确城市资产映射中的潜力。该算法的实现将在GitHub仓库https://github.com/EMurphy0108/SBD_Street_Furniture中提供。

英文摘要

In this paper we address the problem of precise geolocation of street furniture in complex urban environments, which is a critical task for effective monitoring and maintenance of public infrastructure by local authorities and private stakeholders. To this end, we propose a probabilistic framework based on energy maps that encode the spatial likelihood of object locations. Representing the energy in a map-based geopositioned format allows the optimisation process to seamlessly integrate external geospatial information, such as GIS layers, road maps, or placement constraints, which improves contextual awareness and localisation accuracy. A stochastic birth-and-death optimisation algorithm is introduced to infer the most probable configuration of assets. We evaluate our approach using a realistic simulation informed by a geolocated dataset of street lighting infrastructure in Dublin city centre, demonstrating its potential for scalable and accurate urban asset mapping. The implementation of the algorithm will be made available in the GitHub repository https://github.com/EMurphy0108/SBD_Street_Furniture.

2508.08431 2026-05-18 eess.IV cs.CV eess.SP 版本更新

Preprocessing Algorithm Leveraging Geometric Modeling for Scale Correction in Hyperspectral Images for Improved Unmixing Performance

基于几何建模的预处理算法用于超光谱图像的尺度校正以提升解混性能

Praveen Sumanasekara, Athulya Ratnayake, Buddhi Wijenayake, Keshawa Ratnayake, Roshan Godaliyadda, Parakrama Ekanayake, Vijitha Herath

发表机构 * Department of Electrical and Electronic Engineering, University of Peradeniya(珀斯德尼亚大学电子与电气工程系) School of Electrical and Computer Engineering, Purdue University(普渡大学电子与计算机工程学院)

AI总结 本文提出一种预处理算法,通过校正像素签名的尺度变化,提升超光谱解混性能,实验验证其在多种解混方法上的有效性,实现约50%的误差降低。

Comments 20 pages, 14 figures

详情
AI中文摘要

光谱变化显著影响超光谱解混算法的准确性和收敛性。许多方法处理复杂光谱变化,但因地形、光照和阴影导致的像素签名大规模畸变仍是主要挑战。这些变化通常会降低解混性能并使模型拟合复杂化。因此,校正这些变化可为实际GIS应用提供显著优势。本文提出了一种新的预处理算法,在解混前校正由尺度引起的光谱变化。通过估计并校正像素签名的尺度畸变,该算法生成具有最小尺度畸变的像素签名。由于这些尺度畸变(阻碍许多解混方法性能)在所提出方法的输出中被大大减少,解混算法的丰度估计显著提高。我们提供了一个严谨的数学框架来描述和校正尺度变化,并对所提算法进行了广泛的实验验证。此外,该算法的影响在多种最先进的解混方法上评估了两个合成和两个真实超光谱数据集。所提出的预处理步骤在这些方法上一致提高了性能,即使对于专门处理光谱变化的算法,也实现了约50%的误差降低。这表明尺度校正作为一种补充步骤,有助于更准确的解混,利用现有方法。该算法的通用性、一致影响和显著影响突显了其在实际超光谱解混管道中的潜力。实现代码将在发表时公开。

英文摘要

Spectral variability significantly impacts the accuracy and convergence of hyperspectral unmixing algorithms. Many methods address complex spectral variability; yet large-scale distortions to the scale of the observed pixel signatures due to topography, illumination, and shadowing remain a major challenge. These variations often degrade unmixing performance and complicate model fitting. Because of this, correcting these variations can offer significant advantages in real-world GIS applications. In this paper, we propose a novel preprocessing algorithm that corrects scale-induced spectral variability prior to unmixing. By estimating and correcting these distortions to the scale of the pixel signatures, the algorithm produces pixel signatures with minimal distortions in scale. Since these distortions in scale (which hinder the performance of many unmixing methods) are greatly minimized in the output provided by the proposed method, the abundance estimation of the unmixing algorithms is significantly improved. We present a rigorous mathematical framework to describe and correct for scale variability and provide extensive experimental validation of the proposed algorithm. Furthermore, the algorithm's impact is evaluated across a wide range of state-of-the-art unmixing methods on two synthetic and two real hyperspectral datasets. The proposed preprocessing step consistently improves the performance of these algorithms, achieving error reductions of around 50%, even for algorithms specifically designed to handle spectral variability. This demonstrates that scale correction acts as a complementary step, facilitating more accurate unmixing with existing methods. The algorithm's generality, consistent impact, and significant influence highlight its potential as a key component in practical hyperspectral unmixing pipelines. The implementation code will be made publicly available upon publication.

2507.10236 2026-05-18 cs.CV 版本更新

Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?

在真实世界中导航AI生成图像检测的挑战:真正重要的是什么?

Despina Konstantinidou, Dimitrios Karageorgiou, Christos Koutlis, Olga Papadopoulou, Emmanouil Schinas, Symeon Papadopoulos

发表机构 * Information Technologies Institute - Centre for Research and Technology Hellas(信息科技研究所 - 希腊研究中心与技术研究所)

AI总结 研究真实世界中AI生成图像检测的挑战,分析设计选择对检测性能的影响,提出优化方法并提升AUC 26.87%。

Comments ACM International Workshop on Multimedia AI against Disinformation 2026 (MAD 2026)

详情
AI中文摘要

随着生成式人工智能的发展,AI生成图像的逼真度已达到足以欺骗甚至警惕的人类观察者的水平。然而,尽管当前的AI生成图像检测(AID)方法在受控基准数据集上表现优异,但在真实世界案例中却表现不佳。为此,我们引入了ITW-SM数据集,一个经过精心编排的真实和AI生成图像集合,源自主要社交媒体平台。我们利用它分析构建检测器时的关键设计选择,包括其架构、预训练的潜在空间、训练数据以及预处理方法。我们指出,简单地扩大预训练阶段或选择更多训练数据并不总是能提高检测性能。相反,我们的研究揭示了优化每个设计选择以使处理流程能够传播并有效分析低级痕迹和高级图像语义的重要性。基于我们的发现,我们在多种最先进的检测方法上实现了平均AUC提升26.87%,为开发更具鲁棒性的检测器提供了路线图。我们的资源可在https://mever-team.github.io/itw-sm获取。

英文摘要

As generative Artificial Intelligence (AI) advances, the realism of AI generated imagery has reached a threshold capable of deceiving even vigilant human observers. Yet, while current AI-generated Image Detection (AID) approaches perform exceptionally well on controlled benchmark datasets, they struggle significantly with real-world cases. To study this behavior we introduce the ITW-SM dataset, a curated collection of real and AI-generated images originating from major social media platforms. We employ it to analyze the effects of key design choices typically considered when building a detector, involving its architecture, pre-trained latent spaces, training data as well as pre-processing approaches. We indicate that naively scaling the pre-training stage or opting for more training data does not always lead to better detection performance. Instead, our work reveals that it is crucial to optimize each design choice to enable the processing pipeline to propagate and effectively analyze both low-level traces as well as high-level image semantics. Building on our findings, we achieve a substantial average improvement of 26.87% in AUC across multiple state-of-the-art detection approaches and under real-world conditions, providing a roadmap for developing more resilient detectors. Our assets are available on https://mever-team.github.io/itw-sm.

2506.16129 2026-05-18 cs.CV 版本更新

Neurosymbolic Object-Centric Learning with Distant Supervision

基于远监督的神经符号对象中心学习

Stefano Colamonaco, David Debot, Giuseppe Marra

发表机构 * Department of Computer Science, KU Leuven(计算机科学系,鲁汶大学)

AI总结 本文提出DeepObjectLog模型,通过概率神经符号方法实现对象中心学习,无需逐对象标签或掩码,提升对组合、对象计数和规则转移的泛化能力。

详情
AI中文摘要

神经符号学习可通过符号规则为潜在概念提供监督,但通常假设规则引用的实体已指定。对象中心模型将图像分解为槽状表示,但这些槽未必与符号推理所需的谓词对齐。本文研究了基于远监督的对象中心神经符号学习,通过逻辑程序的物体级参数直接从图像中学习,引入DeepObjectLog模型,整合槽式感知编码器与概率逻辑层,预测候选物体表示的对象性和类别概率,逻辑层通过潜在的对象性和类别分配计算观测标签的似然,无需逐对象标签、掩码、边界框或启发式集合匹配。在多样化的视觉推理任务中,DeepObjectLog在组合、对象计数和规则转移的分布外泛化方面优于神经对象中心和标准神经符号基线。

英文摘要

Neurosymbolic learning can use symbolic rules to provide supervision for latent concepts from weak labels, but it commonly assumes that the entities referenced by these rules are already specified. Object-centric models decompose images into slot-like representations; however, such slots are not necessarily aligned with the predicates required for symbolic reasoning. We investigate object-centric neurosymbolic learning under distant supervision, where the object-level arguments of a logic program are learned directly from images using only global task labels. We introduce DeepObjectLog, a probabilistic neurosymbolic model that integrates a slot-based perceptual encoder with a probabilistic logic layer. The encoder predicts objectness and class probabilities for candidate object representations, while the logic layer marginalizes over latent objectness and class assignments to compute the likelihood of the observed label. This formulation provides a differentiable task-level learning signal for object-centric perception without requiring per-object labels, masks, bounding boxes, or heuristic set matching. Evaluations across diverse visual reasoning tasks demonstrate that DeepObjectLog achieves superior out-of-distribution generalization to compositional, object-count, and rule shifts compared to neural object-centric and standard neurosymbolic baselines.

2505.23678 2026-05-18 cs.CV 版本更新

Grounded Reinforcement Learning for Visual Reasoning

基于视觉的强化学习用于视觉推理

Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出ViGoRL,通过强化学习实现视觉推理,通过空间坐标锚定推理步骤,提升视觉定位和搜索性能,优于传统方法。

Comments Project website: https://visually-grounded-rl.github.io/

详情
AI中文摘要

尽管强化学习在数学和编码任务中显著提升了语言模型,但视觉推理需要模型引导视觉注意力、解读感知输入并用空间证据支撑抽象推理。我们引入ViGoRL,通过强化学习训练视觉语言模型,将每个推理步骤明确锚定到特定视觉坐标。受人类视觉决策启发,ViGoRL学习生成空间接地的推理轨迹,每一步引导视觉注意力到相关区域。当需要精细探索时,我们的新型多轮强化学习框架使模型能动态放大预测坐标。在多样化的视觉推理基准上,ViGoRL在空间推理、视觉搜索和基于网页的接地任务中均优于监督微调和传统强化学习基线。结合多轮强化学习与放大视觉反馈显著提升了ViGoRL在定位小GUI元素和视觉搜索中的性能,达到86.4%的V*Bench成绩。此外,我们发现接地增强了其他视觉行为,如区域探索、接地子目标设定和视觉验证。最终,人类评估显示模型的视觉参考不仅空间准确,而且有助于理解模型推理步骤。我们的结果表明,视觉接地强化学习是赋予模型通用视觉推理能力的强大范式。

英文摘要

While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.

2504.18361 2026-05-18 cs.CV cs.AI 版本更新

COCO-Inpaint: A Benchmark for Detecting and Localizing Inpainting-Based Image Manipulations

COCO-Inpaint:用于检测和定位基于修补的图像篡改的基准

Haozhen Yan, Yan Hong, Jiahui Zhan, Suning Lang, Yikun Ji, Huijia Zhu, Jun Lan, Jianfu Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团)

AI总结 本文提出COCO-Inpaint基准,用于检测和定位基于修补的图像篡改,通过高质样本、多样场景和大规模覆盖,揭示修补与真实区域的内在不一致。

Comments 6 pages, 8 figures

详情
AI中文摘要

近年来,图像篡改技术的进步使高逼真内容生成成为可能,但也降低了随意编辑的门槛,引发了对多媒体真实性和安全性的担忧。现有图像篡改检测与定位(IMDL)方法主要针对拼接或复制移动伪造,而基于修补的篡改基准仍有限。为弥合这一差距,我们提出了COCO-Inpaint,一个专门用于修补检测和定位的综合基准,主要贡献包括:1)由六个最先进的修补模型生成的高质量修补样本;2)通过四种掩码生成策略和可选文本引导实现的多样化生成场景;3)包含238,302张具有丰富语义多样性的修补图像的大规模覆盖。本基准旨在突出修补区域与真实区域之间的内在不一致,而非表面语义特征如物体形状。我们进一步建立了严格的评估协议,通过三个标准指标来评估现有IMDL方法,揭示当前趋势和挑战。

英文摘要

Recent advances in image manipulation have enabled highly photorealistic content generation, but also lowered the barrier to arbitrary editing, raising concerns about multimedia authenticity and security. Existing Image Manipulation Detection and Localization (IMDL) methods mainly target splicing or copy-move forgeries, while benchmarks for inpainting-based manipulations remain limited. To bridge this gap, we present COCO-Inpaint, a comprehensive benchmark specifically designed for inpainting detection and localization, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage of 238,302 inpainted images with rich semantic diversity. Our benchmark is constructed to highlight intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We further establish a rigorous evaluation protocol with three standard metrics to benchmark existing IMDL methods and reveal current trends and challenges.

2405.13901 2026-05-18 cs.CV cs.LG eess.SP 版本更新

Discrete Cosine Transform Based Decorrelated Attention for Vision Transformers

基于离散余弦变换的去相关注意力机制用于视觉Transformer

Hongyi Pan, Emadeldeen Hamdan, Xin Zhu, Ahmet Enis Cetin, Ulas Bagci

发表机构 * Machine and Hybrid Intelligence Lab, Northwestern University(机器与混合智能实验室,西北大学) Department of Electrical and Computer Engineering, University of Illinois Chicago(电气与计算机工程系,伊利诺伊大学香槟分校)

AI总结 本文提出基于DCT的去相关注意力机制,通过改进初始化策略和压缩技术提升视觉Transformer的效率和性能,实验表明在Swin Transformer上显著降低计算开销且保持性能。

Comments This paper has been accepted to IJCAI-ECAI 2026

详情
AI中文摘要

自注意力机制是Transformer架构成功的关键,但学习查询、键和值投影仍具挑战性且计算成本高。本文提出两种互补方法,利用离散余弦变换(DCT)提升视觉Transformer的效率和性能。首先,引入基于DCT的初始化策略,通过DCT系数初始化投影权重,提升CIFAR-10和ImageNet-1K的分类精度。其次,提出基于DCT的注意力压缩技术,利用频域的去相关特性,通过截断高频成分减少查询、键和值投影的维度,不牺牲精度。实验表明,该压缩方法在Swin Transformer上显著降低计算开销,同时保持性能。

英文摘要

Self-attention is central to the success of Transformer architectures; however, learning the query, key, and value projections from random initialization remains challenging and computationally expensive. In this paper, we propose two complementary methods that leverage the Discrete Cosine Transform (DCT) to enhance the efficiency and performance of Vision Transformers. First, we address the initialization problem by introducing a simple yet effective DCT-based initialization strategy for self-attention, where projection weights are initialized using DCT coefficients. This structure-preserving approach consistently improves classification accuracy on the CIFAR-10 and ImageNet-1K benchmarks. Second, we propose a DCT-based attention compression technique that exploits the decorrelation properties of the frequency domain. By observing that high-frequency DCT coefficients typically correspond to noise, we truncate high-frequency components of the input patches, thereby reducing the dimensionality of the query, key, and value projections without sacrificing accuracy. Experiments on Swin Transformer models demonstrate that the proposed compression method achieves a substantial reduction in computational overhead while maintaining comparable performance.

2312.05975 2026-05-18 cs.CV cs.AI cs.LG 版本更新

FM-G-CAM: A Holistic Approach for Explainable AI in Computer Vision

FM-G-CAM:计算机视觉中可解释AI的综合方法

Ravidu Suien Rammuni Silva, Jordan J. Bird

发表机构 * Department of Computer Science Nottingham Trent University(计算机科学系诺丁汉特大学)

AI总结 本文提出FM-G-CAM方法,通过综合考虑多个预测类别,提供CNN模型决策的全面解释,改进传统Grad-CAM的局限性。

详情
AI中文摘要

可解释性是现代AI在现实应用中的关键因素。本文旨在强调理解计算机视觉模型(特别是卷积神经网络)预测的必要性。现有方法主要基于梯度加权类激活图(Grad-CAM),仅关注单一目标类别,忽略了CNN预测过程的大部分内容。本文提出了一种全面的方法,称为融合多类梯度加权类激活图(FM-G-CAM),考虑多个高预测类别,提供预测器CNN的全面解释。我们还提供了详细数学和算法描述。此外,通过现实应用场景的定量和定性比较,展示了FM-G-CAM相较于Grad-CAM的优势。最后,我们提供了一个开源Python库,包含FM-G-CAM实现,方便生成CNN模型预测的显著图。

英文摘要

Explainability is a vital aspect of modern AI for real-world impact and usability. The main objective of this paper is to emphasise the need to understand the predictions of Computer Vision models, specifically Convolutional Neural Network (CNN) models. Existing methods for explaining CNN predictions are largely based on Gradient-weighted Class Activation Maps (Grad-CAM) and focus solely on a single target class; this assumption about the target class selection neglects a large portion of the predictor CNN's prediction process. In this paper, we present an exhaustive methodology, called Fused Multi-class Gradient-weighted Class Activation Map (FM-G-CAM), that considers multiple top-predicted classes and provides a holistic explanation of the predictor CNN's rationale. We also provide a detailed mathematical and algorithmic description of our method. Furthermore, alongside a concise comparison of existing methods, we compare FM-G-CAM with Grad-CAM, quantitatively and qualitatively highlighting its benefits through real-world practical use cases. Finally, we present an open-source Python library with an FM-G-CAM implementation to conveniently generate saliency maps for CNN-based model predictions.

2605.15764 2026-05-18 cs.CV cs.AI 版本更新

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

GRASP:学习多个人非语言互动中的社会推理

Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Georgia Institute of Technology(佐治亚理工学院) Amazon AGI Korea University(亚马逊AGI韩国大学)

AI总结 GRASP通过连接高层社会问答与细粒度目光和指代手势事件,提升多个人非语言互动的社会推理能力,包含290万对问题-答案对,提出Social Grounding Reward提升模型性能。

Comments Project page: https://social-reaoning.github.io/grasp/

详情
AI中文摘要

理解社会互动需要推理微妙的非语言线索,但当前多模态大语言模型(MLLMs)在多个人视频中常无法识别谁与谁互动。我们引入GRASP,一个大规模社会推理数据集,将高层社会问答与细粒度目光和指代手势事件连接起来。GRASP包含290K个问题-答案对,覆盖46K小时视频,按16类分类涵盖目光、手势及联合目光-手势推理,同时包含GRASP-Bench用于评估。不同于以往仅关注孤立线索或高层社会问答的资源,GRASP通过身份一致的目光轨迹、指代手势及其联合组成构建社会事件。此外,我们提出Social Grounding Reward(SGR),一种利用这些社会事件鼓励模型推理每个互动参与者的学习信号。实验显示,SGR在GRASP-Bench上提升性能,同时在相关社会视频问答基准上保持零样本性能。

英文摘要

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

2605.15760 2026-05-18 cs.CV 版本更新

Learn2Splat: Extending the Horizon of Learned 3DGS Optimization

Learn2Splat: 扩展学得3DGS优化的视野

Naama Pearl, Stefano Esposito, Haofei Xu, Amit Peleg, Patricia Gschossmann, Lorenzo Porzi, Peter Kontschieder, Gerard Pons-Moll, Andreas Geiger

发表机构 * University of Tübingen, Tübingen AI Center(图宾根大学,图宾根人工智能中心) ETH Zurich(苏黎世联邦理工学院) Meta Reality Labs(Meta现实实验室)

AI总结 本文提出了一种学得优化器,通过元学习方案扩展优化视野,提升稀疏和密集视角下的重建质量与稳定性,实现零样本泛化。

详情
AI中文摘要

3D高斯散射(3DGS)优化通常使用标准优化器(Adam、SGD)。尽管在多样场景中稳定,但标准优化器通用性强,无法针对问题结构进行优化。特别是,它们产生独立的参数更新,无法捕捉场景中的结构和空间关系,导致优化效率低和收敛慢。近期的工作引入了学得优化器,通过参数间和高斯间依赖预测相关更新。然而,这些方法在固定迭代次数训练,并依赖手动调度学习率以避免退化。本文提出了一种学得优化器,能够在延长的优化视野中避免退化,无需辅助机制。为此,我们提出了一种元学习方案,通过检查点缓冲区和优化器滚动策略扩展优化视野,并结合一种编码梯度尺度信息的架构。结果表明,早期新颖视角合成质量得到提升,同时在长视野中保持稳定,实现零样本泛化。为支持我们的发现,我们引入了第一个统一框架,用于训练和评估学得和传统优化器,适用于稀疏和密集视角设置。代码和模型将公开发布。我们的项目页面可在 https://naamapearl.github.io/learn2splat 上找到。

英文摘要

3D Gaussian Splatting (3DGS) optimization is most commonly performed using standard optimizers (Adam, SGD). While stable across diverse scenes, standard optimizers are general-purpose and not tailored to the structure of the problem. In particular, they produce independent parameter updates that do not capture the structural and spatial relationships within a scene, leading to inefficient optimization and slow convergence. Recent works introduced learned optimizers that predict correlated updates informed by inter-parameter and inter-Gaussian dependencies. However, these methods are trained for a fixed number of optimization iterations and rely on manually scheduled learning rates to avoid degradation. In this paper, we introduce a learned optimizer for 3DGS that avoids degradation over extended optimization horizons without auxiliary mechanisms. To enable this, we propose a meta-learning scheme that extends the optimization horizon via a checkpoint buffer and an optimizer rollout strategy, combined with an architecture that encodes gradient scale information in its latent states. Results show improved early novel view synthesis quality while remaining stable over long horizons, with zero-shot generalization to unseen reconstruction settings. To support our findings, we introduce the first unified framework for training and evaluating both learned and conventional optimizers across sparse and dense view settings. Code and models will be released publicly. Our project page is available at https://naamapearl.github.io/learn2splat .

2605.15755 2026-05-18 cs.CV 版本更新

Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models

基于属性的选型推理用于艺术品情感理解的多模态大语言模型

Cheng Zhang, Yuer Liu, Zhiyu Zhou, Hongxia Xie, Wen-Huang Cheng

发表机构 * Department of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Department of Computer Science, National Taiwan University(国立台湾大学计算机科学系)

AI总结 本文提出基于属性的选型推理方法,通过多模态大语言模型实现艺术品情感理解,通过引入属性瓶颈引导框架提升情感预测精度和解释简洁性。

详情
AI中文摘要

多模态大语言模型(MLLMs)能够生成流畅的艺术品情感解释,但常面临属性泛滥问题:它们列举许多可见的正式属性,但未能识别哪些线索真正支持情感判断。因此,本文将艺术品情感理解定义为属性引导的选型推理(AGSR),其中预定义的正式属性作为证据单元,只有情感相关属性应进入最终解释。为使该问题可测量,我们扩展了EmoArt,最初在ACM MM 2025上介绍为包含132,664件艺术品的资源,具有内容、正式属性、价值-唤醒和情感标注,通过添加1,400件艺术品的人类显著性扩展标注,由15名艺术训练标注者标注。此扩展提供了实例级监督,以区分仅存在的属性和情感显著的属性。我们进一步提出FAB-G(正式属性瓶颈引导推理),一个监督的多代理框架,首先预测属性级显著性,然后将下游情感分析限制在保留的线索上。实验表明,FAB-G在情感、唤醒和价值预测上取得了一致的提升,实现了在Dice和Tversky度量下与人类标记的显著属性更强的一致性,并产生了比基于提示的基线更紧凑的最终解释。跨数据集评估进一步表明,基于属性的显著性选择在EmoArt的源分布之外转移,同时揭示了属性特定的边界案例。数据集和项目页面可在https://zhiliangzhang.github.io/EmoArt-130k/上获取。

英文摘要

Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664-artwork resource with content, formal-attribute, valence-arousal, and emotion annotations, by adding a 1,400-artwork human salience extension annotated by 15 art-trained annotators. This extension provides instance-level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB-G (Formal-Attribute Bottleneck-Guided reasoning), a supervised multi-agent framework that first predicts attribute-level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB-G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting-based baselines. Cross-dataset evaluation further suggests that attribute-grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute-specific boundary cases. The dataset and project page are available at https://zhiliangzhang.github.io/EmoArt-130k/

2605.15753 2026-05-18 cs.RO cs.CV 版本更新

Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces

层次化和整体化的开放词汇功能3D场景图用于室内空间

Xinggang Hu, Chenyangguang Zhang, Alexandros Delitzas, Xiangkui Zhang, Marc Pollefeys, Francis Engelmann, Xiangyang Ji

发表机构 * Tsinghua University(清华大学) ETH Zürich(苏黎世联邦理工学院) MPI for Informatics(信息研究所) Dalian University of Technology(大连理工大学) Microsoft(微软) Stanford University(斯坦福大学) University of Lugano(卢加诺大学)

AI总结 本文提出一种开放词汇管道,结合2D视觉定位和3D图优化,解决小规模密集相似实例的场景图推理问题,通过时间图优化和全局层次塑造提升室内空间的功能3D场景图生成能力。

详情
AI中文摘要

功能3D场景图提供了一种灵活的3D场景理解和机器人操作的表示方法,由物体节点、交互元素和功能关系边定义。然而,由于现有基准覆盖有限和先前管道设计过于简单,其潜力尚未被充分挖掘。因此,本文通过引入密集的桌面上物体和显式的多级功能关系扩展基准覆盖。这种扩展引入了关键挑战,包括小规模、密集和相似实例的处理,关系推理中缺乏视觉锚点,跨帧融合中的实例混淆,以及动态视角下的属性不确定性。为了解决这些问题,我们提出了一种基于2D视觉定位和3D图优化的开放词汇管道。具体而言,我们从2D视觉证据中锚定细粒度的功能边,并使用多个线索在3D中跨帧关联节点。此外,边关联被公式化为时间图优化,整合证据积累、熵正则化和时间平滑,以稳健地确定每个节点的功能连接。最后,通过全局层次塑造恢复层次图结构。大量实验表明,所提方法能够在具有挑战性的现实场景中可靠地推断功能3D场景图,从而进一步解锁其在实际应用中的潜力。

英文摘要

Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.

2605.15737 2026-05-18 cs.CV 版本更新

BARRIER: Bounded Activation Regions for Robust Information Erasure

BARRIER:基于鲁棒信息擦除的有界激活区域

Jan Miksa, Patryk Krukowski, Przemysław Spurek, Dawid Damian Rymarczyk, Marcin Sendera

发表机构 * Jagiellonian University(雅盖隆大学) IDEAS Research Institute(IDEAS研究所) National Research Institute(国家研究所)

AI总结 BARRIER通过动态隐藏层激活几何结构,利用区间算术保护中性概念,实现稳定的信息擦除,同时保持其他表示的完整性。

详情
AI中文摘要

机器无学习面临关键瓶颈。传统方法主要消除目标概念,但常导致其他重要表示的意外抑制。为此,BARRIER将干预从静态模型权重转移到隐藏层激活的动态几何结构。通过SVD投影的激活空间区间算术,将目标区域封装在包围超立方体中,确保保留分布的严谨保护。此几何构造将知识保护从经验启发式转化为具有概率尾界的功能漂移优化目标。关键稳定性允许在遗忘区域进行激进的无学习更新。实验表明,BARRIER在分类器和扩散模型中达到最佳折中,最大化目标概念擦除同时保护其他表示的完整性。代码见https://github.com/OneAndZero24/BARRIER。

英文摘要

Machine unlearning has reached a critical bottleneck. As traditional weight-space interventions focus primarily on erasing targeted concepts, they often fail to prevent the unintended suppression of other significant representations. This leads to substantial collateral damage, with essential knowledge being forgotten, because these methods lack formal mathematical guarantees for the preservation of neutral concepts. To avoid degradation, they are frequently forced into conservative updates. We propose BARRIER (Bounded Activation Regions for Robust Information Erasure), a paradigm-shifting framework that shifts the locus of intervention from static model weights to the dynamic geometry of hidden-layer activations. Unlike existing methods, BARRIER employs Interval Arithmetic (IA) on SVD-based projections of the activation space to encapsulate the specific target region within a bounding hypercube. By driving unlearning updates exclusively within this forget interval and mathematically bounding the model response on the complement, we ensure rigorous protection of the retain distribution. This geometric construction transforms the preservation of knowledge from an empirical heuristic into a formal optimization target with a probabilistic tail bound on functional drift. Crucially, this stability permits highly aggressive unlearning updates within the forget region. Empirical evaluations demonstrate that BARRIER matches state-of-the-art trade-offs across classifiers and diffusion models, maximizing targeted concept erasure while safeguarding the integrity of all other representations. Our code is available at https://github.com/OneAndZero24/BARRIER.

2605.15736 2026-05-18 cs.CV cs.AI 版本更新

BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation

BiomedAP: 一种基于视觉的双锚框架与门控跨模态融合用于鲁棒的医学视觉-语言适应

Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen

发表机构 * Wenzhou University(温州大学) Wenzhou Business College(温州商务学院)

AI总结 BiomedAP通过门控跨模态融合和双锚约束机制,提升医学视觉-语言模型在提示变化下的鲁棒性,实验显示其在多个基准上均优于基线方法。

Comments CVPR2026 Workshop

详情
AI中文摘要

BiomedAP通过门控跨模态融合和双锚约束机制,提升医学视觉-语言模型在提示变化下的鲁棒性,实验显示其在多个基准上均优于基线方法。

英文摘要

Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning

2605.15733 2026-05-18 cs.NE cs.AI cs.CV 版本更新

Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model

在启发式世界模型中的结构抽象与泛化

Tianqiu Zhang, Muyang Lyu, Xiao Liu, Si Wu

发表机构 * Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, IDG/McGovern Institute for Brain Research, Center of Quantitative Biology, School of Psychological and Cognitive Sciences, Key Laboratory of Machine Perception (Ministry of Education), Peking University(北京大学-清华大学生命科学中心,先进跨学科研究院,IDG/麦克戈文脑科学研究院,定量生物学中心,心理与认知科学学院,机器感知重点实验室(教育部),北京大学)

AI总结 本文提出了一种脑启发的分层模型,通过逆向模型提取潜在转换并构建预测视觉世界模型,展示了在连续高维动态中同时提取抽象结构的能力,实现了结构泛化。

Comments Project page: https://hpc-mec-worldmodel.github.io/

详情
AI中文摘要

人类将经验抽象为结构化表示以促进模式推断和知识转移。尽管海马-内侧颞叶(HPC-MEC)回路已知能表示空间和概念空间,但如何同时从连续、高维动态中提取抽象结构的机制仍不明确。我们提出了一种脑启发的分层模型,同时推断潜在转换并构建预测视觉世界模型。该架构采用逆向模型进行结构提取,同时结合HPC-MEC耦合模型,将关系结构(MEC)与整合的事件场景(HPC)分离。通过使用原始变换动态作为基准,我们展示了该模型在结构抽象方面的能力。通过利用速度驱动的路径整合,该框架能够在不同情境中实现稳健的预测和结构重用,从而实现结构泛化。本文提供了一个新的计算框架,用于理解如何通过脑启发的自监督学习世界模型,促进可重用的抽象知识的获取。

英文摘要

Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampal-entorhinal (HPC-MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high-dimensional dynamics remain poorly understood. We propose a brain-inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC-MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using primitive transformation dynamics as a benchmark, we demonstrate the model's capacity for structural abstraction. By leveraging velocity-driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain-inspired, self-supervised learning of world models facilitates the acquisition of reusable abstract knowledge.

2605.15728 2026-05-18 cs.CV cs.AI 版本更新

DecomPose: Disentangling Cross-Category Optimization Contention for Category-Level 6D Object Pose Estimation

DecomPose:解耦跨类优化冲突以实现类别级6D物体姿态估计

Yifan Gao, Lu Zou, Zhangjin Huang, Guoping Wang

发表机构 * Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, Hubei, China(智能机器人湖北省重点实验室,武汉理工大学,武汉,湖北,中国) University of Science(科学技术大学) Peking University, Beijing, China(北京大学,北京,中国)

AI总结 本文提出DecomPose框架,通过数据驱动的难度代理和不对称分支策略,解耦跨类优化冲突,提升类别级6D姿态估计性能。

详情
AI中文摘要

类别级6D物体姿态估计通常被建模为多类联合学习问题,但类别间的几何异质性导致共享模块中不兼容的优化信号纠缠,产生梯度冲突和负迁移。为此,我们首先引入基于梯度的诊断方法量化模块级跨类冲突。基于诊断结果,我们提出DecomPose框架,通过难度感知的梯度解耦和稳定性驱动的不对称分支策略,缓解优化冲突:(1) 难度感知的梯度解耦通过数据驱动的难度代理将类别分组,并将每个实例路由到组特定的对应分支以隔离不兼容的更新;(2) 稳定性驱动的不对称分支将更高容量的分支分配给结构简单的类别作为稳定的优化锚点,同时通过轻量级分支约束复杂类别以抑制噪声更新并缓解负迁移。在REAL275、CAMERA25和HouseCat6D上的大量实验表明,DecomPose有效减少了跨类优化冲突,并在多个基准上实现了优越的姿态估计性能。

英文摘要

Category-level 6D object pose estimation is typically formulated as a multi-category joint learning problem with fully shared model parameters. However, pronounced geometric heterogeneity across categories entangles incompatible optimization signals in shared modules, resulting in gradient conflicts and negative transfer during training. To address this challenge, we first introduce gradient-based diagnostics to quantify module-level cross-category contention. Building on results of diagnostics, we propose DecomPose, a difficulty-aware decomposition framework that mitigates optimization contention via: (1) difficulty-aware gradient decoupling, which groups categories using a data-driven difficulty proxy and routes each instance to a group-specific correspondence branch to isolate incompatible updates; and (2) stability-driven asymmetric branching, which assigns higher-capacity branches to structurally simple categories as stable optimization anchors while constraining complex categories with lightweight branches to suppress noisy updates and alleviate negative transfer. Extensive experiments on REAL275, CAMERA25, and HouseCat6D demonstrate that DecomPose effectively reduces cross-category optimization contention and delivers superior pose estimation performance across multiple benchmarks.

2605.15725 2026-05-18 cs.CV cs.AI cs.RO 版本更新

DiLA: Disentangled Latent Action World Models

DiLA:解耦的潜在动作世界模型

Tianqiu Zhang, Muyang Lyu, Yufan Zhang, Fang Fang, Si Wu

发表机构 * Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, IDG/McGovern Institute for Brain Research, Peking University(北京大学-清华生命科学中心,先进跨学科研究院,IDG/麦克戈文脑科学研究院,北京大学) Center of Quantitative Biology, Peking University(北京大学定量生物学中心) School of Psychological and Cognitive Sciences, Key Laboratory of Machine Perception (Ministry of Education), Peking University(心理与认知科学学院,机器感知重点实验室(教育部),北京大学)

AI总结 DiLA通过内容-结构解耦解决动作抽象与生成保真度的平衡问题,实现高质量视频生成和动作迁移。

Comments Project Page: http://disentangled-latent-action-world-models.github.io

详情
AI中文摘要

潜在动作模型(LAMs)通过推断连续帧间的抽象动作来学习世界模型,但面临动作抽象与生成保真度的权衡问题。现有方法通常通过两阶段训练或限制预测到光流来解决。本文提出DiLA,一种解耦的潜在动作世界模型,通过内容-结构解耦解决这一权衡。我们的关键发现是解耦和潜在动作学习是共演进的:潜在动作学习中的预测瓶颈驱动解耦,迫使模型将空间布局压缩到结构路径,同时将视觉细节卸载到单独的内容路径进行生成。这种协同作用产生了一个连续且语义结构化的潜在动作空间,而不牺牲生成质量。DiLA在视频生成质量、动作迁移、视觉规划和流形可解释性方面表现优异。这些发现确立了DiLA作为统一框架,同时实现高层动作抽象和高保真生成,推动了自监督世界模型学习的前沿。

英文摘要

Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.

2605.15723 2026-05-18 cs.LG cs.CV 版本更新

GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

GOMA:从图信号平滑视角迈向结构驱动的多模态对齐

Xu Wang, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang

发表机构 * School of Airspace Science and Engineering, Shandong University(山东大学 airspace 科学与工程学院) Department of Computer Science, Beijing Institute of Technology(北京理工大学计算机学院) School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院)

AI总结 GOMA通过统一设计解决多模态对齐中的拓扑障碍、平滑控制与信息保留问题,在七个多模态图基准上取得最佳检索性能并保持稳定性。

详情
AI中文摘要

多模态对齐通常通过CLIP式双编码器从孤立图像-文本对学习,忽略了实体间的关系上下文。多模态属性图(MAGs)中节点携带多模态属性,边编码语料结构,为优化冻结的视觉-语言嵌入提供自然设置。这种优化具有挑战性:视觉、文本和跨模态关系常诱导不同的邻域几何结构,而无限制的图传播可能导致检索表示快速过平滑。有效利用图上下文需要同时打破模态特定的拓扑障碍、控制平滑制度,并在语义边界崩溃前保留信息性平滑。我们提出图优化多模态对齐(GOMA),一种结构驱动的后对齐框架,将冻结的多模态嵌入视为图信号,并通过统一的检索导向设计解决这些需求。GOMA解耦了三个关键设计选择:消息应流动何处、多模态证据应如何传播,以及应保留哪种平滑深度。具体而言,它学习模态感知的传播算子,执行有限步耦合平滑而不使用对角线跨模态快捷方式,并自适应读取节点特定的平滑轨迹以在崩溃前保留有用平滑。所有实验遵循一种转换性MAG检索协议,其中图仅作为无标签上下文,且移除对角线自配对边。在七个MAG基准上,GOMA取得最佳或并列最佳检索性能,并显著优于最强的图竞争对手,证明MAG结构可以作为冻结多模态嵌入的有效后编码器。

英文摘要

Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal attributes and edges encode corpus structure, provide a natural setting for refining frozen vision-language embeddings. This refinement is challenging: visual, textual, and cross-modal relations often induce different neighborhood geometries, while unrestricted graph propagation can quickly over-smooth retrieval representations. Effectively leveraging graph context therefore requires simultaneously breaking modality-specific topological barriers, controlling the smoothing regime, and preserving informative smoothing before semantic boundaries collapse. We propose Graph-Optimized Multimodal Alignment (GOMA), a structure-driven post-alignment framework that views frozen multimodal embeddings as graph signals and addresses these requirements through a unified retrieval-oriented design. GOMA decouples three key design choices: where messages should flow, how multimodal evidence should propagate, and which smoothing depth should be retained. Concretely, it learns modality-aware propagation operators, performs finite-step coupled smoothing without diagonal cross-modal shortcuts, and adaptively reads out node-specific smoothing trajectories to preserve useful smoothing before collapse. All experiments follow a transductive MAG retrieval protocol where the graph serves only as unlabeled context and diagonal self-pair edges are removed. On seven MAG benchmarks, GOMA achieves state-of-the-art or tied state-of-the-art retrieval and remains substantially more stable than the strongest graph competitor, demonstrating that MAG structure can serve as an effective post-encoder for frozen multimodal embeddings.

2605.15722 2026-05-18 cs.LG cs.AI cs.CV eess.SP 版本更新

Bidirectional Fusion Guided by Cardiac Patterns for Semi-Supervised ECG Segmentation

双向融合引导心脏模式用于半监督ECG分割

Jeonghwa Lim, Minje Park, Sunghoon Joo

发表机构 * VUNO Inc.(VUNO公司)

AI总结 本文提出CardioMix框架,通过心脏模式引导的双向CutMix策略提升ECG分割性能,实验表明其在多种数据集和标注比例下均优于现有方法。

Comments 11 pages, 6 figures, 6 tables

详情
AI中文摘要

准确界定心电图(ECG)并分割有意义的波形特征对心血管诊断至关重要。然而,标注数据稀缺给深度学习模型训练带来了重大挑战。传统半监督语义分割(SemiSeg)方法主要关注未标注数据的一致性,未能充分利用标注与未标注集之间的信息交换。为此,我们引入CardioMix,基于心脏模式引导的双向CutMix策略构建ECG分割框架。该方法通过从未标注数据中引入真实变化丰富标注集,同时对未标注集施加更强的监督信号,确保所有增强样本在生理上具有意义。本框架设计为即插即用模块,与各种SemiSeg算法具有高度兼容性。在SemiSegECG公共多数据集基准上的大量实验表明,CardioMix在多种数据集和标注比例下均优于现有基于CutMix的融合策略作为即插即用模块兼容各种SemiSeg算法。

英文摘要

Accurate delineation of electrocardiogram (ECG), the segmentation of meaningful waveform features, is crucial for cardiovascular diagnostics. However, the scarcity of annotated data poses a significant challenge for training deep learning models. Conventional semi-supervised semantic segmentation (SemiSeg) methods primarily focus on consistency from unlabeled data, underutilizing the information exchange possible between labeled and unlabeled sets. To address this, we introduce CardioMix, a framework built on a bidirectional CutMix strategy guided by cardiac patterns for ECG segmentation. This approach enriches the labeled set with realistic variations from unlabeled data while simultaneously applying stronger supervisory signals to the unlabeled set, as the cardiac pattern-guided mixing ensures all augmented samples remain physiologically meaningful. Our framework is designed as a plug-and-play module, demonstrating high compatibility with various SemiSeg algorithms. Extensive experiments on SemiSegECG, a public multi-dataset benchmark for ECG delineation, demonstrate that CardioMix consistently outperforms existing CutMix-based fusion strategies across diverse datasets and labeled ratios as a plug-and-play module compatible with various SemiSeg algorithms.

2605.15720 2026-05-18 cs.CV cs.LG 版本更新

Semi-MedRef: Semi-Supervised Medical Referring Image Segmentation with Cross-Modal Alignment

Semi-MedRef:基于跨模态对齐的半监督医学指引用图像分割

Yuchen Li, Zhen Zhao, Yi Liu, Luping Zhou

发表机构 * The University of Sydney(悉尼大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Changzhou University(常州大学)

AI总结 本文提出Semi-MedRef框架,通过三个组件维持医学图像与位置语言的一致性,实验显示其在低标签条件下优于其他方法。

详情
AI中文摘要

医学指引用图像分割(MRIS)需要像素级掩码与解剖位置的文本描述对齐,这在低标签环境下使标注成本高昂。半监督学习(SSL)可通过利用未标记数据缓解这一负担,但其成功依赖于在扰动下保持可靠的图像-文本对齐。现有SSL方法多采用独立或简单的多模态扰动(如左右翻转),未能充分解决强增强下的跨模态对齐问题,而CutMix在单模态SSL中效果显著,但在多模态设置中因破坏图像-文本一致性而未被广泛探索。本文提出Semi-MedRef,一种教师-学生SSL框架,通过三个保持对齐的组件:T-PatchMix,一种跨模态CutMix风格增强,通过位置约束和概率驱动规则同步补丁混合与指引用表达;PosAug,一种位置感知文本增强,通过遮蔽或模糊解剖短语;以及ITCL,一种位置引导的图像-文本对比学习模块,利用位置伪标签构建软解剖正例并加强医学基础的跨模态对齐。在QaTa-COV19和MosMedData+上的实验表明,Semi-MedRef在所有标签条件下均优于完全监督和半监督基线。

英文摘要

Medical referring image segmentation (MRIS) requires pixel-level masks aligned with textual descriptions of anatomical locations, making annotation costly in low-label regimes. Semi-supervised learning (SSL) can mitigate this burden by leveraging unlabeled data, but its success hinges on maintaining reliable image-text alignment under perturbations. Most existing SSL-based referred segmentation methods use either independent or simplistic multi-modal perturbations (e.g., left-right flips), without fully addressing cross-modal alignment under strong augmentation, while CutMix, highly effective in single-modal SSL, remains underexplored in multi-modal settings due to its tendency to disrupt image-text coherence. We propose Semi-MedRef, a teacher-student SSL framework designed to explicitly maintain consistency between medical images and positional language through three alignment-preserving components: T-PatchMix, a cross-modal CutMix-style augmentation that synchronizes patch mixing with referring expressions via position-constrained and probability-driven rules; PosAug, a position-aware text augmentation that masks or fuzzes anatomical phrases; and ITCL, a position-guided image-text contrastive learning module, which leverages positional pseudo-labels to construct soft anatomical positives and strengthen medically grounded cross-modal alignment. Experiments on QaTa-COV19 and MosMedData+ demonstrate that Semi-MedRef consistently outperforms both fully supervised and semi-supervised baselines across all label regimes.

2605.15711 2026-05-18 cs.CV 版本更新

EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy

EntropyScan: 向通过视觉注意力熵实现LVLMs的模型级后门检测

Xuanyu Ge, Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen

发表机构 * China University of Geosciences(中国地质大学) University of the Chinese Academy of Sciences(中国科学院大学) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所)

AI总结 本文提出EntropyScan,一种轻量且不依赖触发器的模型级后门检测方法,通过量化视觉注意力分布的结构扭曲来检测后门模型,实验显示其在两个LVLM架构和三种高级攻击场景中达到98.5%的F1分数和96.6%的AUC。

Comments 20 pages, 6 figures, 8tables

详情
AI中文摘要

本文提出EntropyScan,一种轻量且不依赖触发器的模型级后门检测方法,通过量化视觉注意力分布的结构扭曲来检测后门模型,实验显示其在两个LVLM架构和三种高级攻击场景中达到98.5%的F1分数和96.6%的AUC。

英文摘要

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to backdoor attacks. Existing defense methods predominantly focus on sample-level defense, which relies on the knowledge of training data or triggers. However, identifying whether a given model is backdoored remains a critical but unexplored task. To fill this gap, we propose EntropyScan, a lightweight and trigger-agnostic method for model-level backdoor detection in LVLMs. We first observe that backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. Based on this insight, EntropyScan detects the backdoor models by quantifying such attention deviations. Specifically, it extracts visual attention distributions from the initial layers of the Large Language Model (LLM) and applies Tsallis entropy to capture these structural distortions. By employing a reference-anchored Z-score normalization on a small set of benign samples, it effectively identifies the backdoored model. Extensive experiments across two LVLMs architectures and three advanced attack scenarios show that EntropyScan achieves an F1 score of 98.5% in average and an AUC of 96.6%. Our code will be publicly available soon.

2605.15708 2026-05-18 cs.CV 版本更新

3D Segmentation Using Viewpoint-Dependent Spatial Relationships

基于视角依赖空间关系的3D分割

Ayaka Nanri, Klara Reichard, Mert Kiray, Federico Tombari, Benjamin Busam, Asako Kanezaki

AI总结 本文提出一个包含22万样本的3D参照分割数据集,通过密集视角采样扩展至数千万样本,研究视角依赖空间关系对3D大模型的影响,提升分割精度并提高mIoU至0.47。

详情
AI中文摘要

近期3D数据集和多模态模型的进步显著提升了自然语言3D场景理解。然而,大多数3D参照分割方法未显式表示观察者视角,导致

英文摘要

Recent advances in 3D datasets and multimodal models have greatly improved natural language 3D scene understanding. However, most 3D referring segmentation methods do not explicitly represent the observer viewpoint, making spatial relations such as "left," "right," "front," and "behind" ambiguous and difficult to evaluate. We introduce a viewpoint-aware 3D referring segmentation dataset containing 220k benchmark samples, and scalable to tens of millions of viewpoint-conditioned samples through dense viewpoint sampling. In this dataset, target objects can only be identified through observer-centric spatial relations, making viewpoint-conditioned grounding necessary. We construct the benchmark by leveraging camera poses to automatically annotate observer-centric relations (left/right, front/behind) together with viewpoint-independent relations (above/under). Using this benchmark, we evaluate several existing 3D large multimodal models in a zero-shot setting and find that current models struggle with viewpoint-dependent spatial instructions. We further study how explicit viewpoint information can be incorporated into 3D large multimodal models. We introduce a viewpoint representation that encodes camera poses and conditions the model on the observation viewpoint, improving segmentation accuracy on viewpoint-dependent relations and increasing mIoU from 0.30 to 0.47 compared to a model without viewpoint conditioning. The dataset, code, and trained models will be made publicly available upon acceptance.

2605.15707 2026-05-18 eess.IV cs.CV 版本更新

Evaluation of Anatomical Shape Priors in Deep Learning-Based Cardiac Multi-Compartment Segmentation

基于深度学习的心脏多腔分割中解剖形状先验的评估

Michael Hudler, Franz Thaler, Martin Urschler

发表机构 * Institute for Medical Informatics, Statistics and Documentation(医学信息学、统计学与文档研究所)

AI总结 本文评估了轻量级显式形状先验在心脏多腔CT分割中的效果,发现标准3D U-Net仍为强大基线,手工先验效果有限,未来需更 expressive 的学习先验。

Comments Published in the Proceedings of the Third Austrian Symposium on AI, Robotics, and Vision (AIRoV 2026), pp. 23-27

详情
AI中文摘要

全心多腔CT分割在临床中具有重要意义,但标准CNN未显式强制解剖合理性。基于训练数据统计,我们评估了轻量级显式形状先验,以形状感知损失和空间标签分布热图引导的U-Net变体改进3D心脏分割。在所有实验中,标准3D U-Net意外保持了非常强的基线,手工先验仅带来微小且不一致的变化,有时甚至退化性能。这些结果表明,基线已捕捉了显著的隐式解剖规律,未来改进可能需要更 expressive 的学习先验,而非简单的手工解剖形状约束。

英文摘要

Whole-heart multi-compartment CT segmentation is clinically important, but standard CNNs do not explicitly enforce anatomical plausibility. Based on statistics derived from the training data, we evaluate whether lightweight explicit shape priors, implemented as shape-aware losses and spatial label distribution heatmap-guided U-Net variants, improve 3D cardiac segmentation on MM-WHS CT and WHS++. Across all experiments, a standard 3D U-Net surprisingly remained a very strong baseline, with handcrafted priors yielding at best marginal and inconsistent changes and often degrading performance. These results suggest that the baseline already captures substantial implicit anatomical regularities and that future gains will likely require more expressive learned priors rather than simple handcrafted anatomical shape constraints.

2605.15689 2026-05-18 cs.CV 版本更新

How to Choose Your Teacher for Fine Grained Image Recognition

如何为细粒度图像识别选择教师

Oswin Gosal, Edwin Arkel Rios, Augusto Christian Surya, Fernando Mikael, Bo-Cheng Lai, Min-Chun Hu

发表机构 * National Tsing Hua University, Taiwan(台湾国立清华大学) National Yang Ming Chiao Tung University, Taiwan(台湾国立阳明交通大学)

AI总结 本文提出Ratio 1-2指标,通过分析实验数据提升教师选择效果,使小模型在细粒度图像识别中获得17%的准确率提升。

Comments Accepted to The 13th Workshop on Fine-Grained Visual Categorization (FGVC13) @ CVPR 2026. Main: 6 pages, 3 figures, 4 tables

详情
AI中文摘要

细粒度图像识别用于分类如鸟类物种或汽车型号等子类别。尽管最先进的模型准确率高,但往往资源消耗过大,难以部署在受限设备上。知识蒸馏通过将大教师模型的知识转移到小学生模型中解决此问题。选择合适的教师模型是关键挑战,本文引入Ratio 1-2指标,基于教师预测比例进行评估。对超过1000次实验的分析显示,该指标比先前方法提升18%,使小模型在细粒度图像识别中达到17%的准确率提升。实验代码库可在https://github.com/arkel23/FGIR-KD-Teacher获取。

英文摘要

Fine-grained image recognition classifies subcategories such as bird species or car models. While state-of-the-art (SOTA) models are accurate, they are often too resource-intensive for deployment on constrained devices. Knowledge distillation addresses this by transferring knowledge from a large teacher model to a smaller student model. A key challenge is selecting the right teacher, as it heavily impacts student performance. This paper introduces a teacher selection metric, \textbf{Ratio 1-2}, based on teacher prediction ratios. Extensive analysis of over one thousand experiments across 3 students, 8 teachers, and 8 datasets under 4 training strategies demonstrates that our metric improves teacher selection by 18\% over previous methods, enabling small student models to achieve up to 17\% accuracy gains. Experiment codebase is available at: \href{https://github.com/arkel23/FGIR-KD-Teacher}{https://github.com/arkel23/FGIR-KD-Teacher}.

2605.15684 2026-05-18 cs.CV 版本更新

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

ElasticDiT:通过弹性架构和稀疏注意力实现高效扩散变换器,用于移动设备上的高分辨率图像生成

Kunpeng Du, Haizhen Xie, Sen Lu, Lei Yu, Binglei Bao, Huaao Tang, Chuntao Liu, Hao Wu, Yang Zhao, Zhicai Huang, Heyuan Gao, Zhijun Tu, Jie Hu, Xinghao Chen

发表机构 * Huawei Technologies(华为技术)

AI总结 本文提出ElasticDiT,通过弹性架构和稀疏注意力机制,在移动设备上实现高效扩散变换器,平衡图像质量和计算效率,同时减少内存占用。

详情
AI中文摘要

扩散变换器(DiT)架构是高保真图像生成的最新范式,支撑如Stable Diffusion-3和FLUX.1等模型。然而,将这些模型部署到资源受限的移动设备上会带来极高的计算和内存开销。尽管效率驱动的方法如Linear-DiT和静态剪枝缓解了瓶颈,但通常会带来质量下降。不同于云环境,移动约束要求一种单模型范式,能够动态平衡保真度和延迟。我们引入ElasticDiT,通过调整空间压缩比和DiT块深度实现这种动态权衡。通过整合Shift Sparse Block Attention(SSBA)和Tiny DWT-Distilled VAE(T-DVAE),ElasticDiT在保持图像质量的同时减少了推理延迟和内存占用。实验表明,ElasticDiT能够在一个参数集内覆盖广泛的保真度-延迟权衡范围。通过联合调整压缩和深度,单个ElasticDiT模型可以动态重新配置以超越任务特定的基线。具体而言,我们的flex lite变体实现了32.87的HPS,超过了Flux模型,同时通过SSBA保持84.16%的平均稀疏度质量。此外,插件式的T-DVAE仅需标准VAEs的1/8计算成本即可实现SD3级的重建,而Flow-GRPO提升了语义对齐(GenEval: 66.93到73.62)。这些结果表明,ElasticDiT提供了一种多功能、硬件适应性的解决方案,消除了对多个专用模型的需求,为未来移动设备上的高分辨率图像生成提供了有前景的路径。

英文摘要

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.

2605.15682 2026-05-18 cs.CV 版本更新

DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

DreamSR:通过增强感受野的扩散变换器实现超高清图像超分辨率

Qingji Dong, Hang Dong, Mingqin Chen, Rui Zhang, Yitong Wang

发表机构 * ByteDance Inc.(字节跳动公司)

AI总结 DreamSR通过双分支MM-ControlNet和增强感受野策略,解决超分辨率中局部过生成和细节合成问题,实现高质量细节恢复。

详情
AI中文摘要

大规模预训练扩散模型因强大的生成先验通过文本引导被广泛应用于实际图像超分辨率。然而,当使用基于补丁的推理策略超分辨率处理高分辨率图像时,现有扩散基超分辨率方法常因LR图像全局提示与每次推理步骤中局部补丁不完整语义信息之间的不匹配而产生过生成问题。另一方面,现有方法由于网络设计和训练策略过度强调全局生成能力,也难以在局部补丁中生成细节纹理。为了解决这个问题,我们提出了DreamSR,一种新的超分辨率模型,通过抑制局部过生成并提高细节合成,从而实现具有超高质量细节的视觉忠实结果。具体来说,我们提出了一个双分支MM-ControlNet,其中ControlNet使用补丁级提示生成局部文本特征,而预训练的DiT使用全局提示生成全局文本特征,从而缓解过生成并确保补丁间的语义一致性。我们还设计了全面的训练策略,包含阶段特定的数据处理管道和增强感受野策略,增强模型捕捉补丁信息和有效恢复局部纹理的能力。广泛的实验表明,DreamSR优于最先进的方法,提供高质量的超分辨率结果。代码和模型可在https://github.com/jerrydong0219/DreamSR上获得。

英文摘要

Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies. To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches. We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model's capability to capture patch information and effectively restore local textures. Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results. Code and model are available at https://github.com/jerrydong0219/DreamSR.

2605.15681 2026-05-18 cs.GR cs.CV 版本更新

DealMaTe: Multi-Dimensional Material Transfer via Diffusion Transformer

DealMaTe: 通过扩散变换器实现多维材料传输

Nisha Huang, Yizhou Lin, Jie Guo, Xiu Li, Tong-Yee Lee, Zitong Yu

发表机构 * Tsinghua University(清华大学) Pengcheng Laboratory(鹏城实验室) National Cheng-Kung University(国立成功大学) Great Bay University(大湾大学) Dongguan Key Laboratory for Intelligence and Information Technology(东莞智能与信息技术重点实验室)

AI总结 DealMaTe通过深度、规范和光照图像实现材料传输,采用简化扩散框架,消除文本引导和参考网络,设计轻量3D信息注入方法,优化注意力机制,实现高效高质量的材料传输。

详情
AI中文摘要

最近,基于扩散的材料传输方法依赖于图像微调或复杂的架构和辅助网络,面临文本依赖、额外计算成本和特征对齐等挑战。为了解决这些限制,我们提出了DealMaTe,使用深度、规范和光照图像进行材料传输。DealMaTe是一种简化扩散框架,消除了文本引导和参考网络。我们设计了一种轻量的3D信息注入方法,多维3D着色LoRA,无需修改基础模型权重,实现了兼容的控制条件,并获得了和谐稳定的结果。此外,我们通过着色因果互注意力机制优化注意力机制,并使用键值(KV)缓存来减少由多个条件引起的推理延迟,提高计算效率,并在低架构复杂度下实现高质量的材料传输结果。广泛的实验涵盖了各种物体和光照条件,一致地证明DealMaTe在任意输入材料下实现了显著的高保真材料传输。代码可在https://github.com/haha-lisa/DealMaTe上获得。

英文摘要

Recently, diffusion-based material transfer methods rely on image fine-tuning or complex architectures with auxiliary networks but face challenges such as text dependency, additional computational costs, and feature misalignment. To address these limitations, we propose \textbf{DealMaTe}, using \underline{\textbf{de}}pth, norm\underline{\textbf{a}}l, and \underline{\textbf{l}}ighting images for \underline{\textbf{ma}}terial \underline{\textbf{t}}ransf\underline{\textbf{e}}r. DealMaTe is a simplified diffusion framework that eliminates text guidance and reference networks. We design a lightweight 3D information injection method, Multi-Dim 3D Shader LoRA, which, without modifying the base model weights, enables compatible control conditions and achieves harmonious and stable results. Additionally, we optimize the attention mechanism with Shader Causal Mutual Attention and key-value (KV) caching to reduce inference latency caused by multiple conditions, improve computational efficiency, and achieve high-quality material transfer results with low architectural complexity. Extensive experiments covering a wide variety of objects and lighting conditions consistently demonstrate that DealMaTe achieves remarkable high-fidelity material transfer under arbitrary input materials. The code is available at https://github.com/haha-lisa/DealMaTe.

2605.15677 2026-05-18 cs.CL cs.CV 版本更新

VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

VCG-Bench:迈向统一的视觉导向基准,用于结构化生成与编辑

Xiaoyan Su, Peijie Dong, Zhenheng Tang, Song Tang, Yuyao Zhai, Kaitao Lin, Liang Chen, Gai Yuhang, Yuyu Luo, Qiang Wang, Xiaowen Chu

发表机构 * The Hong Kong University of Science and Technology (GuangZhou)(香港科学与技术大学(广州)) Huawei Technologies Co., Ltd(华为技术有限公司) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) South China University of Technology(华南理工大学)

AI总结 本文提出VCG-Bench,一个统一的视觉导向mxGraph任务基准,通过符号逻辑和XML实现精确的图表生成与编辑,解决现有方法在结构化任务中的局限性。

Comments Accepted by ICML2026, 37 pages, 10 figures

详情
AI中文摘要

尽管视觉语言模型(VLMs)迅速发展,但在处理专业工作流程中至关重要的结构化、可控图表任务方面仍存在关键差距。现有方法主要依赖像素级合成,其在可编辑性和保真度上存在固有限制。本文提出一种新的图表即代码范式,利用mxGraph可扩展标记语言(XML)进行精确的图表生成与编辑。我们提出了VCG-Bench,一个统一的视觉导向mxGraph任务基准。VCG-Bench包括:(1)一个包含1,449种不同图表的分类数据集,涵盖6个领域和15个子领域;(2)一种整合生成(视觉到代码)和可编辑性(代码到代码)的范式定义;(3)一种定制的评估协议,采用多维指标,如mxGraph执行成功率、风格一致性分数(SCS)等。实验结果突显了当前最先进(SOTA)VLMs在结构保真度和指令合规性方面的挑战,反映了其视觉和推理能力。

英文摘要

Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.

2605.15673 2026-05-18 eess.IV cs.CV cs.LG 版本更新

Highly Detailed and Generalizable Broadleaf Tree Crown Instance Segmentation from UAV Imagery

基于无人机影像的高精度通用性阔叶林树冠实例分割

Mitsutaka Nakada, Takahiko Ikebata, Kengo Ikebata, Yuji Mizuno, Yusuke Onoda, Ryuichi Takeshige, Kyaw Kyaw Htoo, Kanehiro Kitayama, Robert Ong, Masanori Onishi

发表机构 * DeepForest Technologies Co., Ltd.(深森林技术有限公司) YM Lab.(YM实验室) Graduate School of Agriculture, Kyoto University(京都大学农业研究院) Graduate School of Science, Osaka Metropolitan University(大阪 metropolitan 大学理学研究院) Faculty of Tropical Forestry, Universiti Malaysia Sabah(马来西亚沙巴大学热带林业学院) Forest Research Centre, Sabah Forestry Department(沙巴林业部门森林研究中心)

AI总结 本文提出一种高精度树冠实例分割模型,通过无人机影像实现阔叶林中单个树冠的精确定界,利用大规模高质量标注数据集提升分割性能,适用于复杂结构和不同地理生物的森林环境。

Comments 12 pages, 5 figures, 3 Tables

详情
AI中文摘要

我们提出了一种高精度的实例分割模型,用于利用无人机获取的高空影像确定自然阔叶林中单个树冠。阔叶林中的树冠界定比其他森林类型更具挑战性,因为树冠形状多样且缺乏明显树顶。为解决这一问题,我们开发了一个基于深度学习的树冠分割模型,该模型在高质量标注的树冠轮廓上进行训练。我们通过熟练标注员手动定义了18,507个树冠多边形,从日本七个森林收集的正射影像中,并基于Mask2Former开发了多个主干架构的模型。最佳模型仅使用RGB影像即可在结构复杂的阔叶林中实现高分割性能。当应用于日本不同地理区域的森林以及婆罗洲生物不同的热带雨林时,性能仍然保持。这些结果表明,使用大量高质量标注数据集对于实现跨多样森林生态系统精确且通用的树冠分割至关重要。所开发的模型已整合到DF Scanner Pro软件中,该软件支持使用无人机进行实际森林监测,这种实现预计能够使广泛用户从无人机分析阔叶林的树级信息。

英文摘要

We present a highly detailed instance segmentation model for delineating individual tree crowns in natural broadleaf forests using aerial imagery acquired by unmanned aerial vehicles (UAVs). Tree crown delineation in broadleaf forests is more challenging than in other forest types due to diversity of crown shapes and the lack of clearly defined treetops. To address this issue, we developed a deep-learning-based crown segmentation model trained on high-quality annotated crown outlines. We manually delineated 18,507 crown polygons from orthomosaic images collected across seven forests in Japan by skilled annotators, and developed a model based on Mask2Former with multiple backbone architectures. The best model achieved high segmentation performance in structurally complex broadleaf forests using only RGB imagery. This performance was maintained when applied to geographically distinct forests within Japan, as well as to biologically distinct tropical rainforests in Borneo. These results demonstrate that using a large number of high-quality annotated datasets is critical for achieving detailed and generalizable crown segmentation across diverse forest ecosystems. The developed model has been integrated into DF Scanner Pro, a software that supports practical forest monitoring using UAVs, and this implementation is expected to enable a wide range of users to analyze tree-level information in broadleaf forest from UAVs.

2605.15672 2026-05-18 cs.CV cs.AI 版本更新

VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

VLMs 跟踪无需跟踪:诊断视觉路径跟随中的失败

Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No

发表机构 * Yonsei University(延世大学)

AI总结 研究VLMs在视觉路径跟随任务中的表现,发现其在面对局部相似干扰时易切换路径,揭示局部竞争导致的失败原因。

详情
AI中文摘要

视觉-语言模型(VLMs)在多模态基准测试中表现优异,但可能仍缺乏对基本视觉操作的鲁棒控制。我们研究了路径跟随任务,其中模型必须通过连续的局部延续跟随选定的视觉路径。为隔离这一能力,我们设计了受控的路径跟随任务,引入附近的竞争者并减少语义和拓扑模糊性,如交叉和重叠。在这些任务中,即使是最先进的VLMs也频繁失去目标路径并切换到附近的替代路径,尤其是在这些替代路径在局部上相似时。行为干预和内部分析表明,这些失败源于局部竞争:附近的相似干扰者会将模型拉离真正的延续。标准解决方案无法消除这一瓶颈:模型大小扩展只能提供有限的收益,推理部分通过成本高昂的替代策略补偿,而显式路径指示未能恢复稳定的路径跟随。最后,在复杂的电缆场景和地铁地图上测试表明,相同的路径切换失败在受控设置之外仍然存在。

英文摘要

Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.

2605.15671 2026-05-18 eess.IV cs.CV 版本更新

Degradation-Aware Blur-Segmentation of Brain Tumor

考虑退化因素的脑肿瘤模糊分割

Yuchun Wang, Xiaosong Li, Gefei Liang, Yang Liu

发表机构 * School of Physics and Optoelectronic Engineering, Foshan University, China(物理与光电工程学院,佛山大学,中国)

AI总结 本文提出DABSeg网络,通过同步去模糊和精确分割,提升多模态3D脑肿瘤分割在退化条件下的鲁棒性与临床实用性。

详情
AI中文摘要

多模态3D MRI脑肿瘤分割是放疗目标勾画、手术规划和治疗后评估的关键步骤。现有方法通常假设MRI图像无伪影,但扫描过程中不可避免的患者运动引入伪影和模糊,导致边界和纹理特征退化,影响分割性能。为此,我们引入考虑退化因素的模糊分割网络(DABSeg),一种同步去模糊的3D多模态MRI分割网络,统一了模糊去除和准确分割。具体而言,我们提出一个特征域运动去模糊茎以补偿模糊并平衡强度。同时,骨干网络嵌入了一个模糊感知的跨模态交叉注意力模块和多尺度残差聚合,以实现有效的模态互补性。值得注意的是,我们优化了一个联合损失,结合加权Dice与清晰参考重建项,其中不平衡的权重应用于小目标以增强学习强度和预测稳定性,以小病变和边界区域。系统比较和消融实验在BraTS2020数据集上,无论是清晰还是退化条件均一致表明,DABSeg在肿瘤Dice分数和边界精度上优于现有最先进方法。这些结果验证了考虑退化因素的跨任务协作学习在提升多模态3D脑肿瘤分割在现实退化条件下的鲁棒性和临床实用性方面的有效性。源代码可在https://github.com/YuchunWang24/DABSeg_ICPR获取。

英文摘要

Multimodal 3D MRI brain tumor segmentation is a pivotal step in radiotherapy target delineation, surgical planning and post-treatment assessment. Existing methods often assume artifact-free MRI images. However, inevitable patient motion during scanning introduces artifacts and blur that degrade boundary and texture features, leading to poor segmentation performance. To bridge this gap, we introduce Degradation-Aware Blur-Segmentation Net (DABSeg), a synchronous deblurring 3D multimodal MRI segmentation network that unifies blur removal and accurate segmentation. Specifically, we propose a feature-domain motion-deblurring stem to compensate for blur and rebalance intensity. Concurrently, the backbone network embeds a blur-aware cross-modal cross-attention module and multi-scale residual aggregation to yield effective modality complementarity. Notably, we optimize a joint loss that combines weighted Dice with a clear-reference reconstruction term, where imbalanced weights are applied to small targets to boost learning intensity and predictive stability for small lesions and border regions. Systematic comparisons and ablation experiments on the BraTS2020 dataset under both clear and degenerative conditions consistently demonstrate that DABSeg surpasses state-of-the-art methods in tumor Dice score and boundary precision. These results validate the effectiveness of degenerative-aware cross-task collaborative learning in improving the robustness and clinical utility of multi-modal 3D brain tumor segmentation under realistic degenerative conditions. The source code is available at https://github.com/YuchunWang24/DABSeg_ICPR

2605.15666 2026-05-18 cs.CV 版本更新

ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark

ChronoEarth-492K:一个大规模且长时域的时空超光谱地球观测数据集和基准

Haozhe Si, Yuxuan Wan, Yuqing Wang, Minh Do, Han Zhao

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Siebel School of Computing and Data Science(计算与数据科学学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出ChronoEarth-492K数据集,通过NASA EO-1 Hyperion任务的超光谱数据,提供大规模、时间校准的时空超光谱数据,支持短时和长时分析,并建立统一的评估平台,推动超光谱时空表示学习的发展。

详情
AI中文摘要

超光谱成像(HSI)为地球表面提供了密集的光谱信息,使土地覆盖和生态系统动态在材料层面得以理解。尽管近年来在超光谱自监督学习(SSL)方面取得了进展,但现有数据集仍然时间较浅,限制了长时间域时空建模的发展。为解决这一差距,我们引入ChronoEarth-492K,这是首个大规模、时间校准的超光谱SSL数据集,基于NASA的EO-1 Hyperion任务,目前是世界上持续时间最长的超光谱档案(2001-2017)。ChronoEarth-492K包含492,354个辐射校准的块,覆盖185,398个全球地点17年,其中28,786个地点包含多时间序列(≥3次观测),可支持短时间域和长时间域的分析。在此基础上,我们建立了ChronoEarth基准,一个涵盖静态、短时间域和长时间域任务的统一评估套件,由六个开源地理空间产品组成,涵盖土地覆盖、作物类型、森林动态和土壤特性。我们进一步提出了一套标准化的评估协议,并在最先进的超光谱基础模型上报告了广泛的基线结果。共同而言,ChronoEarth和基准提供了首个大规模、时间校准的平台,用于系统性的时空超光谱表示学习。

英文摘要

Hyperspectral imaging (HSI) provides dense spectral information for the Earth's surface, enabling material-level understanding of land cover and ecosystem dynamics. Despite recent progress in hyperspectral self-supervised learning (SSL), existing datasets remain temporally shallow, limiting the development of long-horizon spatiotemporal modeling. To address this gap, we introduce ChronoEarth-492K, the first large-scale, temporally calibrated hyperspectral SSL dataset built upon NASA's EO-1 Hyperion mission, the world's longest continuous hyperspectral archive up to date (2001-2017). ChronoEarth-492K comprises 492,354 radiometrically harmonized patches across 185,398 global locations over 17 years, with 28,786 sites containing multi-temporal sequences ($\geq 3$ observations) that enable both short- and long-horizon temporal analysis. Building on this foundation, we establish the ChronoEarth-Benchmark, a unified evaluation suite spanning static, short-horizon, and long-horizon temporal tasks, constructed from six open-source geospatial products covering land cover, crop type, forest dynamics, and soil properties. We further introduce a standardized evaluation protocol and report extensive baseline results across state-of-the-art hyperspectral foundation models. Together, ChronoEarth and benchmark provide the first large-scale, temporally grounded platform for systematic spatiotemporal hyperspectral representation learning.

2605.15661 2026-05-18 cs.CV cs.AI 版本更新

VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

VAGS:图像编辑与生成的速率自适应引导尺度

Yan Luo, Ahmadou Aidara, Jingyi Lu, Jeremy Moebel, Kai Han, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室) Harvard University(哈佛大学) School of Computing and Data Science(计算与数据科学学院) The University of Hong Kong(香港大学) Kempner Institute for the Study of Natural and Artificial Intelligence(自然与人工智能研究学院)

AI总结 VAGS通过自适应引导尺度提升图像编辑和生成的结构保真度和生成质量,无需微调或额外计算。

详情
AI中文摘要

分类自由引导(CFG)是控制流式采样器中文本语义强度的主要手段,但传统方法在整个ODE轨迹中固定引导尺度。这存在根本矛盾:早期步骤以噪声为主,携带弱语义信号,而后期步骤需提交图像结构,要求更强的方向性承诺;更关键的是,任何引导强度的值取决于引导速度是否与模型当前动态一致或相反。本文提出速率自适应引导尺度(VAGS),一种无需训练的替代方案,通过结合时间信号级项和任务相关速度场的余弦相似度,将名义尺度乘以一个有界因子。对于无需反向传播的编辑,VAGS测量源和目标引导速度之间的对齐程度,使每一步的编辑强度反映局部保留与变换的兼容性。对于生成,VAGS-Gen利用无条件与条件速度之间的对齐作为类比信号。两种变体均无需微调、辅助网络或额外前向传递,固定CFG是其特殊情形。在PIE-Bench和DIV2K进行编辑,在COCO17、CUB-200和Flickr30K进行生成时,VAGS在结构保真度和生成质量上优于固定CFG和近期无训练引导变体。代码可在https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale公开获取。

英文摘要

Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale.

2605.15660 2026-05-18 cs.CV 版本更新

MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

MaTe:仅需图像进行材料迁移的扩散变换器

Nisha Huang, Henglin Liu, Yizhou Lin, Kaer Huang, Chubin Chen, Jie Guo, Tong-Yee Lee, Xiu Li

发表机构 * Tsinghua University(清华大学) PengCheng Laboratory(鹏城实验室) Lenovo Research(联想研究院) National Cheng-Kung University(国立成功大学)

AI总结 MaTe通过多模态注意力机制实现材料迁移,无需文本指导或辅助网络,提升了生成质量和效率。

详情
AI中文摘要

最近的基于扩散的方法在材料迁移中依赖图像微调或复杂的架构,但面临文本依赖、额外计算成本和特征对齐挑战。为此,我们提出了MaTe,一种简化扩散框架,消除了文本指导和参考网络。MaTe在token层面整合输入图像,通过共享潜在空间中的多模态注意力实现统一处理。此设计无需额外适配器、ControlNet、反转采样或模型微调。大量实验表明,MaTe在零样本、无训练范式下实现了高质量的材料生成。它在视觉质量和效率上优于现有方法,同时保持精确的细节对齐,显著简化了推理前提。

英文摘要

Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.

2605.15640 2026-05-18 cs.CV 版本更新

Learning Disentangled Representations for Generalized Multi-view Clustering

学习解耦表示以实现通用多视图聚类

Xin Zou, Ruimeng Liu, Chang Tang, Zhenglai Li, Xinwang Liu, Kunlun He, Wanqing Li

发表机构 * AI Thrust, The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)人工智能方向) School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件工程学院) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) School of Computer, National University of Defense Technology(国防科技大学计算机学院) Medical Big Data Research Center, Medical Engineering Laboratory of Chinese PLA General Hospital(中国人民解放军总医院医学大数据研究中心,医学工程实验室) School of Computing and Information Technology, University of Wollongong(沃林根大学计算与信息学院)

AI总结 本文提出GMAE框架,通过解耦表示学习保留多视图互补性,提升聚类效果。实验表明其在完整和不完整多视图聚类任务中均优于现有方法。

Comments accepted by IEEE TPAMI 2026 (IEEE Transactions on Pattern Analysis and Machine Intelligence)

详情
AI中文摘要

多视图聚类(MVC)因其能利用互补信息而受到关注。然而,现有深度MVC方法在跨视图融合时常面临视图分布纠缠问题,影响共享潜在空间质量。为此,本文提出通用多视图自编码器(GMAE),通过解耦表示学习保留跨视图互补性。具体而言,GMAE采用双路径自编码器将源特征解耦为视图特定和视图共同嵌入,促进更清晰的聚类结构发现。进一步构建跨视图对抗判别器,引导视图特定编码器捕捉更判别性特征。通过策略性调节互信息,GMAE有效对齐分布并防止表示崩溃,确保生成稳健且非平凡的嵌入。在13个基准数据集上的全面实验表明,GMAE在完整和不完整MVC任务中均优于现有方法。代码实现见:https://github.com/obananas/GMAE。

英文摘要

Multi-View Clustering (MVC) has gained significant attention for its ability to leverage complementary information across diverse views. However, existing deep MVC methods often struggle with view-distribution entanglement during cross-view fusion, which hampers the quality of the shared latent space and leads to suboptimal Figures. To address this issue, we propose the Generalized Multi-view Auto-Encoder (GMAE), a framework designed to preserve cross-view complementarity through disentangled representation learning. Specifically, GMAE employs dual-path autoencoders to decouple source features into view-specific and view-common embeddings, facilitating the discovery of clearer clustering structures. We further construct cross-view adversarial discriminators to guide view-specific encoders in capturing more discriminative features. By strategically modulating mutual information, GMAE effectively aligns distributions and prevents representation collapse, ensuring the generation of robust, non-trivial embeddings. Comprehensive experiments on 13 benchmark datasets demonstrate that GMAE consistently outperforms state-of-the-art methods in both complete and incomplete MVC tasks. Our code implementation is available at the repository: https://github.com/obananas/GMAE.

2605.15621 2026-05-18 cs.CV 版本更新

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

LRCP: 低秩压缩性引导的视觉标记修剪用于高效的LVLMs

Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Tianjun Shi, Shikai Jiang, Yao Hu, Jiawei Li

发表机构 * Xiaohongshu(小红书) Harbin Institute of Technology(哈尔滨工业大学) Fudan University(复旦大学)

AI总结 本文提出LRCP,通过低秩压缩性引导视觉标记修剪,有效减少视觉语言模型的推理成本,实现94.7%的图像理解性能保留和88.9%的标记减少。

Comments The paper includes 11 figures, multiple tables, comprehensive experimental results on 11 image understanding benchmarks and 3 video benchmarks, with extensive ablation studies and qualitative visualizations

详情
AI中文摘要

大型视觉-语言模型(LVLMs)在多模态理解方面表现出色,但其推理成本随着视觉标记数量的增加而迅速增长,尤其在高分辨率图像和长视频中更为明显。现有基于注意力的方法通过注意力分数估计标记重要性,可能引入位置偏差;而基于表示的方法则通过特征关系或重建误差减少视觉冗余,忽略了视觉标记集的整体结构。本文从低秩压缩性的角度重新审视视觉标记压缩。在多个模型和数据集中,我们发现视觉标记表示表现出显著的低秩结构,存在一个主导子空间,即使随机移除大量标记后仍保持稳定。受此发现启发,我们提出LRCP,一种无需训练的压缩框架,首先通过PCA估计视觉标记的主导低秩子空间,然后通过投影残差对每个标记进行评分,保留那些难以由低秩背景解释的标记。大量实验表明,LRCP在保持94.7%的原始图像理解性能的同时实现88.9%的标记减少,并在保持97.8%的平均视频理解准确性的同时实现87.5%的标记减少。

英文摘要

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.

2605.15618 2026-05-18 cs.CV cs.AI 版本更新

Latent Video Prediction Learns Better World Models

潜在视频预测学习更好的世界模型

Ali J Alrasheed, Aryan Yazdan Parast, Basim Azam, James Bailey, Naveed Akhtar

发表机构 * The University of Melbourne(墨尔本大学) Monash University(莫纳什大学)

AI总结 本文系统研究了潜在预测模型在世界模型中的鲁棒性,发现其在特征可区分性、抗污损性、细粒度辨别、遮挡鲁棒性和时间方向敏感性等方面表现优异,优于其他视频基础模型。

详情
AI中文摘要

本文系统研究了潜在预测模型在世界模型中的鲁棒性,发现其在特征可区分性、抗污损性、细粒度辨别、遮挡鲁棒性和时间方向敏感性等方面表现优异,优于其他视频基础模型。

英文摘要

Self-supervised video models are increasingly framed as world models, yet their evaluation remains largely confined to a single top-1 accuracy score on clean benchmarks. This leaves a major gap in comprehending their potential as world models. We present the first systematic study addressing this gap, analyzing four matched-capacity frontier video foundation models, V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2, across five robustness axes relevant to their deployment as video world models: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. Our evaluations establish that across all five axes, latent-prediction models form a distinct and consistent profile. They degrade more gracefully under pixel corruption, preserve usable class structure rather than mere geometric stability under occlusion, capture fine-grained physical contact cues without reconstructing pixels, and uniquely encode the arrow of time. These advantages can even survive task adaptation: a frozen V-JEPA 2 backbone with a lightweight attentive probe outperforms a fully fine-tuned VideoMAE and a supervised TimeSformer on corruption and occlusion robustness. Our extensive results offer concrete new evidence in favor of latent prediction for robust world modeling.

2605.15615 2026-05-18 cs.CV cs.LG 版本更新

Neutral-Reference Prompting for Vision-Language Models

视觉-语言模型的中性参考提示

Senmao Tian, Xiang Wei, Shunli Zhang

发表机构 * Beijing Jiaotong University(北京交通大学)

AI总结 本文提出NeRP策略,通过中性提示和参考图像提升模型对未知类别的判别能力,同时保持对已知类别的准确性。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉-语言模型(VLMs)的有效迁移学习常面临基类-新类权衡(BNT)问题:提升对未见过类别的识别性能往往会降低对已知类别的准确性。现有工作通常简单归因于过拟合已知类别。我们观察到一种有趣现象:VLMs在某些下游数据上表现出不对称混淆,即类别A的样本系统性被误判为类别B,而反向混淆(B到A)很少发生。对于已知类别,这种偏差可通过交叉熵损失调整来缓解,但对未知类别,这种预训练诱导的偏差仍存在并损害泛化能力。受此启发,我们提出NeRP,一种即插即用的提示修正策略,无需修改模型参数即可提升对未知类别的判别能力。NeRP利用中性文本提示和参考图像,测量类别层面的先验偏好,结合样本似然获得模型的代理分数。如果对于给定样本,先验强烈支持当前预测,而观察到的证据明显不足,则在容易混淆的类别对之间执行局部翻转,从而纠正先验主导的误判。在多个backbone和15个少样本及跨领域基准上的广泛实验表明,NeRP显著提高了对未知类别的准确性,同时保持已知类别的预测性能。

英文摘要

Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B to A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model's surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.

2605.15597 2026-05-18 cs.CV cs.GR cs.LG cs.RO 版本更新

CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage

CM-EVS:稀疏全景RGB-D-姿态数据用于完整场景覆盖

Jiale Liu, Jungang Li, Jieming Yu, Xinglin Yu, Zihao Dongfang, Zongjian Ding, Kaifeng Ding, Yi Yang, Lidong Chen, Yang Zou, Shunwen Bai, Jiahuan Zhang, Haoran Huang, Shan Huang, Yudong Gao, Mingjun Cheng

发表机构 * Zhejiang University(浙江大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学) Vorynel(沃伊内尔) Xinjiang University(新疆大学) Wuhan Polytechnic University(武汉职业技术学院) Tianjin University(天津大学)

AI总结 本文提出CM-EVS,通过COVER算法生成稀疏全景RGB-D-姿态数据,实现低冗余且可追溯的完整场景覆盖,提升3D学习的几何一致性。

Comments 35 pages including appendix. Code and dataset: https://github.com/Strange-animalss/CM-EVS

详情
AI中文摘要

现代3D视觉学习依赖于从度量3D资产中采样的观测,但现有扫描、网格、点云、模拟和重建并未直接提供稀疏、可比且几何一致的全景训练接口。密集轨迹会重复附近视角,源特定渲染策略导致注释异质性,稀疏启发式可能遗漏重要区域或引入深度不一致观测。本文研究如何将3D资产转换为稀疏全景RGB-D-姿态数据,以保持完整的场景覆盖,同时具有低冗余和可追溯的来源。我们提出COVER(以覆盖为导向的视角筛选与ERP范围-深度变形),一种无需训练的ERP视角筛选器,将选定视角观测的几何投影到候选ERP探针,评分增量覆盖,并惩罚深度冲突。在有限的代理误差下,其贪心覆盖代理保持标准覆盖式近似行为,误差项内。使用COVER,我们构建了CM-EVS(覆盖-curated度量ERP视角集),一个包含36,373个curated ERP帧的全景RGB-D-姿态数据集,来自1,275个室内场景,涵盖Blender室内、HM3D和ScanNet++,并补充了从TartanGround和OB3D重新编码的户外全景。每个帧提供完整的球形RGB、度量范围深度、校准姿态;COVER生成的室内帧包括每一步的来源日志。每个室内场景平均仅25帧,覆盖所有13种统一房间类型,同时保持紧凑的场景级覆盖。实验表明,COVER改进了覆盖冲突的权衡,使CM-EVS成为稀疏、紧凑且可追溯的RGB-D-姿态资源,用于几何一致的全景3D学习。

英文摘要

Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB-D-pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage-Oriented Viewpoint curation with ERP Range-depth warping), a training-free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage-style approximation behavior up to an additive error term. Using COVER, we build CM-EVS (Coverage-curated Metric ERP View Set), a panoramic RGB-D-pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re-encoded into the same schema. Each frame provides full-sphere RGB, metric range depth, calibrated pose; COVER-produced indoor frames include per-step provenance logs. With a median of only 25 frames per indoor scene, CM-EVS covers all 13 unified room types while maintaining compact scene-level coverage. Experiments show that COVER improves the coverage-conflict trade-off, making CM-EVS a sparse, compact, and auditable RGB-D-pose resource for geometry-consistent panoramic 3D learning.

2605.15592 2026-05-18 cs.CV 版本更新

Efficient Image Synthesis with Sphere Latent Encoder

高效图像合成与球形潜在编码器

Tung Do, Thuan Hoang Nguyen, Hao Li

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·拉希德人工智能大学)

AI总结 本文提出分离的固定预训练图像编码器和球形潜在去噪模型,提高效率并独立优化重建与生成。在多个数据集上,方法在生成质量和推理速度上优于Sphere Encoder。

Comments Technical report

详情
AI中文摘要

少数步骤图像生成已取得快速进展,一致性及meanflow-based方法显著减少了采样步骤的数量。尽管其推理成本低,但这些方法常面临训练不稳定和可扩展性有限的问题。Sphere Encoder是一种近期的替代方案,仅需少数步骤即可生成高质量图像;然而,其在推理过程中需要在像素空间和潜在空间之间反复转换,同时在单一架构内联合优化重建与生成。这种设计导致计算效率低下,并在重建与生成之间产生目标冲突。为解决这些限制,我们将框架分离为一个固定的预训练图像编码器和一个单独的潜在去噪模型,后者完全在球形潜在空间中训练。我们的方法在训练和推理过程中消除了反复的像素空间操作,提高了效率,并允许重建与生成各自专业化。在Animal-Faces、Oxford-Flowers和ImageNet-1K数据集上,我们的方法在生成质量和推理速度上显著优于Sphere Encoder,同时在强少数步骤和多步骤基线中也取得了具有竞争力的结果。

英文摘要

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.

2605.15585 2026-05-18 cs.AI cs.CV 版本更新

See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

在编码前看到:学习视觉先验以生成空间感知的教育动画

Yuejia Li, Ke He, Junheng Li, Shutong Chen, Jingkang Xia, Zhiyue Su, Junchi Zhang, Mang Ye

发表机构 * Wuhan University(武汉大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出OmniManim框架,通过视觉规划和反馈机制提升教育动画生成质量,改进渲染效果和教学效果。

Comments 21 pages, 4 figures

详情
AI中文摘要

大型语言模型可以为教育动画生成可执行代码,但生成的渲染结果常出现元素重叠、对齐错误和动画连续性断裂等问题。这些缺陷无法仅从代码中可靠检测,需在执行后才能显现。本文将该问题形式化为渲染反馈感知的约束代码生成:给定自然语言规范,模型必须生成可执行代码,其渲染输出需满足可在渲染后评估的结构化质量标准。为解决此问题,我们引入OmniManim框架,围绕共享场景状态、显式视觉规划、结构化后渲染诊断和局部修复构建。其中,Vision Agent是任务特定的视觉规划模块:它通过粗到细的边界框去噪预测稀疏关键帧布局,并优化插值感知的目标以减少下游动画插值引起的中间帧失败。我们进一步构建了ManimLayout-1K和EduRequire-500两个数据集,并提供可复现的评估协议,涵盖可执行性、教学质量、视觉质量和效率。在EduRequire-500上,OmniManim在单模型基线和现有多智能体框架上均提升了测量渲染质量。系统性消融研究进一步验证,显式视觉规划,特别是其粗略空间先验、边界框细化和插值感知优化是这些提升的关键。

英文摘要

Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render-feedback-aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render-feedback-aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post-render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task-specific visual planning module: it predicts sparse keyframe layouts with coarse-to-fine bounding-box denoising and optimizes an interpolation-aware objective to reduce intermediate-frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout-1K and EduRequire-500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire-500, OmniManim improves measured render quality over both single-model baselines and existing multi-agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding-box refinement, and interpolation-aware optimization, is central to these gains.

2605.15584 2026-05-18 cs.CV 版本更新

AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models

AGC:面向视觉-语言模型对抗鲁棒性的自适应测地修正

Zhiwei Li, Jiacheng Xue, Weining Wang, Ajian Liu, Xingyu Gao, Zhenan Sun, Qi Li

发表机构 * NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences(自动化研究所国家工程研究中心与人工智能院,中国科学院) School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出AGC,一种无需训练的防御机制,通过自适应步长修正输入特征,提升视觉-语言模型的对抗鲁棒性,实测在八个细粒度数据集上提升44.4%的鲁棒准确率,同时降低10倍推理延迟。

详情
AI中文摘要

像CLIP这样的视觉-语言模型已展示了显著的零样本迁移能力。然而,其对不可察觉对抗扰动的易受攻击性仍是一个关键安全问题。虽然测试时间防御为部署模型提供了务实的解决方案,但现有方法通常在推理过程中依赖梯度优化,导致显著的计算开销。在本文中,我们重新审视了数据增强在CLIP鲁棒性中的作用,并观察到增强并非等效有效:特定增强提供稳定的几何线索,与正确类语义在超球面特征空间中对齐。基于此,我们提出自适应测地修正(AGC),一种无需训练的防御机制,无需参数更新。AGC将可靠的增强识别为几何锚点,并通过自适应步长将输入特征朝向锚点修正。AGC在八个细粒度数据集和三个CLIP后端上实现了优越性能,比最先进的基线提高了44.4%的平均鲁棒准确率,同时交付了10倍的推理延迟减少。我们的发现揭示了CLIP特征的基本几何属性,提供了一种高效且有效的多模态鲁棒部署范式。

英文摘要

Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4\% over state-of-the-art baseline while delivering a 10$\times$ reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.

2605.15583 2026-05-18 cs.CV 版本更新

Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling

通过条件多视图祖先采样进行无监督3D人体姿态估计

Ryohei Goto, Takuya Fujihashi, Shunsuke Saruwatari, Fumio Okura

发表机构 * The University of Osaka(大阪大学)

AI总结 本文提出一种无需3D监督的单视角3D人体姿态估计方法,利用预训练的2D运动扩散模型的2D扩散先验,通过条件多视图祖先采样优化3D姿态,使其多视图投影符合2D MDM噪声空间的流形,同时匹配给定的2D姿态和人体解剖约束。

Comments International Conference on Automatic Face and Gesture Recognition (FG 2026), Oral

详情
AI中文摘要

我们提出了一种从单视角估计3D人体姿态的方法,无需3D监督。该方法的关键在于利用在大规模2D人体姿态数据集上预训练的运动扩散模型(MDMs)的2D扩散先验。具体来说,我们将扩散模型的多视图祖先采样扩展到人体姿态的2D-3D提升任务。为此,我们提出了一种条件多视图祖先采样(cMAS),以优化3D姿态,使其多视图投影遵循2D MDM噪声空间中的流形,同时将3D姿态条件化以匹配给定的2D姿态和人体解剖约束。在Yoga数据集上的实验表明,我们的方法在跨域性能上优于最先进的监督和无监督3D姿态估计方法,包括在3D监督不可用的极端人体姿态情况下。代码可在:https://github.com/asaa0001/c-MAS获取。

英文摘要

We propose a method of estimating a 3D human pose from a single view without 3D supervision. The key to our method is to leverage the 2D diffusion priors of motion diffusion models (MDMs) pre-trained on large 2D human pose datasets. Specifically, we extend multi-view ancestral sampling of diffusion models to the task of 2D-3D lifting of human pose. To this end, we newly propose a conditional multi-view ancestral sampling (cMAS) that optimizes the 3D pose such that its multi-view projections follow the manifold in 2D MDM noise space, while conditioning the 3D pose to match the given 2D poses and anatomical constraints of humans. Experiments on the Yoga dataset demonstrate that our method achieves better cross-domain performance compared to state-of-the-art supervised and unsupervised 3D pose estimation methods, including extreme human poses where 3D supervision is unavailable. Code is available at: https://github.com/asaa0001/c-MAS.

2605.15582 2026-05-18 cs.CV 版本更新

LDGuid: A Framework for Robust Change Detection via Latent Difference Guidance

LDGuid: 一种通过潜在差异引导实现鲁棒变化检测的框架

Jiaxuan Zhao, Ali Bereyhi

发表机构 * University of Toronto(多伦多大学)

AI总结 本文提出LDGuid框架,通过学习并注入语义差异提升变化检测性能,实验显示其在多个数据集上显著提升分割效果,尤其在受光谱噪声影响的挑战性场景中表现突出。

Comments Accepted to IGARSS 2026. Code is available at: https://github.com/zjxyoyo/LDGuid

详情
AI中文摘要

现代深度学习模型在变化检测(CD)中常难以显式表示任务相关的语义差异。本文提出Latent Difference Guidance(LDGuid)框架,通过对抗自编码实现差异嵌入(DE)模块。DE模块通过信息瓶颈方法预训练,限制其仅学习前后事件样本间的任务相关差异。学习到的潜在差异随后作为CD模型的显式引导信号。通过将LDGuid整合到U-Net、BIT和AERNet基线模型中,并在LEVIR-CD、WHU-CD、SVCD和CaBuAr数据集上评估,实验结果表明LDGuid在所有基准上均提升了分割性能,特别是在受光谱噪声影响的挑战性场景中表现显著。结果进一步突显了LDGuid在整合领域知识(如任务特定的光谱指数)方面的能力。我们的发现表明,语义差异学习可以显著增强遥感中变化检测的鲁棒性。

英文摘要

Modern deep learning models for change detection (CD) often struggle to explicitly represent task-relevant semantic differences. This paper proposes the Latent Difference Guidance (LDGuid) framework that explicitly learns and injects semantic differences into CD models. LDGuid deploys adversarial autoencoding to implement a difference embedding (DE) module. The DE module is pretrained via the information bottleneck method, restricting it to learn only task-relevant differences between pre- and post-event samples. The learned latent difference is then used as an explicit guidance signal in the CD model. We validate LDGuid by integrating it into U-Net, BIT, and AERNet baselines for CD and evaluating it on LEVIR-CD, WHU-CD, SVCD, and CaBuAr datasets. Experimental results show that LDGuid enhances segmentation performance across all benchmarks, with particularly remarkable gains in challenging settings affected by spectral noise. The results further highlight the ability of LDGuid in incorporating domain knowledge, such as task-specific spectral indices. Our findings suggest that semantic difference learning can drastically enhance the robustness of CD in remote sensing.

2605.15579 2026-05-18 eess.IV cs.CV 版本更新

TVRN: Invertible Neural Networks for Compression-Aware Temporal Video Rescaling

TVRN:用于压缩感知的可逆神经网络时间视频重采样

Xinmin Feng, Li Li, Dong Liu, Feng Wu

发表机构 * MOE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition(教育部脑科学与智能感知认知重点实验室) University of Science and Technology of China(中国科学技术大学) Information Science and Technology Institution(信息科学与技术研究所) MCC Lab(MCC实验室)

AI总结 本文提出TVRN框架,通过可逆架构和学习到的排名策略,解决压缩感知下的时间视频重采样问题,提升重建质量。

Comments Accepted by IEEE Transactions on Image Processing

详情
AI中文摘要

为适应多样显示和带宽约束,高帧率视频需先时间下采样到低帧率(LFR)再上采样,需联合优化以实现有效帧率重采样。然而现有方法通常通过训练目标连接两个操作,未充分利用其互为逆过程的性质,可能导致高频信息丢失。此外,它们忽略了有损编码器对LFR视频的影响,限制了实际应用。本文提出一种端到端的压缩感知帧率重采样框架TVRN。为正则化帧率下采样过程中丢失的高频信息,TVRN采用结合多输入多输出时间小波变换的可逆架构,并加入高频重建模块。为通过非可微的有损编码器实现端到端训练,设计了一个近似其梯度的替代网络。最后,为提高不同压缩级别下的鲁棒性,通过学习到的排名策略扩展TVRN为非对称架构。大量实验表明,TVRN在工业视频压缩设置下优于现有方法。源代码可在https://github.com/fengxinmin/TVRN_public公开获取。

英文摘要

To fit diverse display and bandwidth constraints, high-frame-rate videos are temporally downscaled to low-frame-rate (LFR) and later upscaled, requiring joint optimization for effective frame-rate rescaling. However, existing methods typically link the two operations via training objectives, without fully exploiting their reciprocal nature, which may cause high-frequency information loss. Moreover, they overlook the impact of lossy codecs on LFR videos, limiting real-world applicability. In this work, we propose an end-to-end framework for compression-aware frame-rate rescaling, named TVRN. To regularize high-frequency information lost during frame-rate downscaling, TVRN adopts an invertible architecture that combines a Multi-Input Multi-Output Temporal Wavelet Transform with a high-frequency reconstruction module. To enable end-to-end training through non-differentiable lossy codecs, we design a surrogate network that approximates their gradients. Finally, to improve robustness under various compression levels, we extend TVRN to an asymmetric architecture by incorporating compression-aware features learned via a learning-to-rank strategy. Extensive experiments show that TVRN outperforms existing methods in reconstruction quality under industrial video compression settings. Source code is publicly available at https://github.com/fengxinmin/TVRN_public.

2605.15574 2026-05-18 cs.CV 版本更新

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

MI-CXR:多区间胸部X光片纵向推理基准

Sunghwan Steve Cho, Yunseok Han, Jaeyoung Do

发表机构 * AIDAS Laboratory(AIDAS实验室) Seoul National University(首尔国立大学)

AI总结 MI-CXR基准旨在评估多visit胸部X光片的纵向推理能力,通过五选一问题和三个互补任务家族,揭示现有视觉语言模型在时间维度上的局限性。

Comments 33 pages

详情
AI中文摘要

纵向胸部X光片解读需在多个患者访问中推理疾病演变,但现有医疗VQA基准多关注单张图像或短时间图像对。我们引入MI-CXR,一个用于标准化评估多访问胸部X光片序列多区间纵向推理的基准,无需自由形式报告生成或额外临床上下文。MI-CXR包含五个访问患者时间线的五选一问题,并实例化三个互补任务家族:时间事件定位、区间级变化推理和全局轨迹总结,评估基于临床的视觉推理。评估14种最先进的视觉语言模型(VLMs)显示整体表现较低,平均准确率为29.3%,仅略高于随机猜测。通过阶段式诊断探测,发现模型常产生局部合理的区间描述,但未能强制时间约束或将证据组合成全局一致的决策。这些发现揭示了当前VLMs的关键限制,并确立MI-CXR作为纵向医疗推理的原理性基准。该基准可在https://github.com/AIDASLab/MI-CXR获取。

英文摘要

Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at https://github.com/AIDASLab/MI-CXR

2605.15561 2026-05-18 cs.CV 版本更新

RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding

RoiMAM:面向高效视觉-语言理解的感兴趣区域医学注意模型

Jiayan Yang, Zhuoyu Wu, Wenqi Fang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Science(深圳先进技术研究院,中国科学院) CyPhi( ) AI Research Lab, School of IT, Monash University, Malaysia Campus(CyPhi人工智能研究实验室,信息学院,墨尔本大学马来西亚校区)

AI总结 本文提出RoiMAM,通过整合无训练ROI生成模块和语义选择性抑制,专注于病变相关区域,提升医疗视觉问答的效率与准确性。

Comments under revision

详情
AI中文摘要

视觉-语言模型(VLMs)通过联合解释图像和文本来促进医疗视觉问答(MedVQA)。然而,现有模型通常依赖大型架构和封闭答案集,限制了其效率和临床应用潜力。为克服这些不足,我们引入RoiMAM,一种高效的VLM。它集成了无需训练的ROI生成模块与语义选择性抑制,以聚焦于病变相关区域,同时结合文本提示增强模块,提供模态特定的上下文而不引入训练参数。与广泛使用的MedVInT-TD模型相比,我们的设计在模型大小不到20%的情况下实现了高效且准确的诊断,在SLAKE上提高了约2%的准确性,在PMC-VQA上提高了约4.6%的准确性。

英文摘要

Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20\% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.

2605.15558 2026-05-18 eess.IV cs.CV 版本更新

Text-RSIR: A Text-Guided Framework for Efficient Remote Sensing Image Transmission and Reconstruction

Text-RSIR: 一种基于文本的高效遥感图像传输与重建框架

Hao Yang, Xianping Ma, Peifeng Ma, Man-On Pun

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)科学与工程学院) Geosciences and Engineering, Southwest Jiaotong University(西南交通大学地球科学与工程学院) Institute of Space and Earth Information Science and the Department of Geography and Resource Management, The Chinese University of Hong Kong(香港中文大学空间与地球信息科学研究所及地理与资源管理系)

AI总结 本文提出一种基于文本的遥感图像传输系统,通过低分辨率图像与紧凑文本描述替代高分辨率数据,提升传输效率。引入文本条件图像恢复模型,实现细粒度细节恢复与语义一致性保持。

Comments 15 pages, 8 figures, submitted to ISPRS JPRS

详情
AI中文摘要

高分辨率遥感影像对环境监测、城市制图和土地覆盖分析至关重要,但其传输常受带宽限制和高通信成本阻碍。传统流程传输全分辨率像素数据导致冗余和低效。本文提出一种文本引导的遥感图像传输系统,用低分辨率图像配以紧凑文本描述替代完整高分辨率数据。机载文本生成器产生空间和语义摘要,将传输数据量减少至原大小的约2%。地面重建中引入文本条件图像恢复模型,利用跨模态学习恢复细粒度空间细节并保持语义一致性。实验结果表明,在Alsat-2B、UC Merced Land Use和Aerial Image数据集上,所提框架的重建PSNR分别为16.36 dB、26.87 dB和27.41 dB,实现了高效且信息保留的遥感图像传输。实现将公开发布于GitHub。

英文摘要

High-resolution remote sensing imagery is critical for environmental monitoring, urban mapping, and land cover analysis, but its transmission is often hindered by limited bandwidth and high communication costs. Conventional pipelines transmit full-resolution pixel data, resulting in redundant and inefficient delivery. This paper proposes a text-guided remote sensing image transmission system that replaces complete high-resolution data with low-resolution images accompanied by compact textual descriptions. An onboard text generator produces spatial and semantic summaries, reducing the transmitted data volume to approximately 2\% of the original size. For ground-based reconstruction, a text-conditioned image restoration model is introduced, which leverages cross-modal learning to recover fine spatial details and maintain semantic coherence. Experimental results on the Alsat-2B, UC Merced Land Use, and Aerial Image datasets demonstrate that the proposed framework achieves reconstruction PSNRs of 16.36 dB, 26.87 dB, and 27.41 dB, respectively, enabling efficient and information-preserving image transfer for remote sensing applications. The implementation will be made publicly available at \href{https://github.com/haoyangofficial/textrssr}{GitHub}.

2605.15546 2026-05-18 cs.CV 版本更新

3DTMDet: A Dual-Path Synergy Network of Transformer and SSM for 3D Object Detection in Point Clouds

3DTMDet:一种结合Transformer和SSM的双路径协同网络用于点云中的3D目标检测

Bingwen Qiu, Yuan Liu, Junqi Bai, Tong Jiang, Ben Liang, Fangzhou Chen, Xiubao Sui, Qian Chen

发表机构 * School of Electronic and Optical Engineering(电子与光学工程学院) The 28th Research Institute of China Electronics Technology Group Corporation(中国电子科技集团第二十八研究所) College of Astronautics(航天学院) School of Information and Communication Engineering(信息与通信工程学院) State key Laboratory of Extreme Environment Optoelectronic Dynamic Measurement Technology and Instrument(极端环境光电动态测量技术与仪器国家重点实验室)

AI总结 本文提出3DTMDet网络,结合SSM和Transformer,解决点云检测中稀疏点与远距离上下文理解的矛盾,通过3D混合Mamba Transformer模块和体素生成模块提升检测性能。

详情
AI中文摘要

点云目标检测面临远距离点极稀疏与需要远程上下文理解的矛盾。现有方法通过1D序列扩展感受野,不可避免地丢弃已稀缺的局部几何细节并降低远距离和小物体的检测。为了解决这个问题,我们提出了3DTMDet,一种新颖的检测网络,协同结合状态空间模型(Mamba)与Transformer。核心思想是利用SSM的线性复杂度和长序列建模优势,有效捕捉稀疏和远距离点之间的全局交互,同时使用Transformer模块进行局部注意力编码,以编码局部点集中的细粒度几何结构,保留准确的形状信息。我们提出了3D混合Mamba Transformer(3DHMT)块,使用SSM-Attention-SSM流水线来平衡全局上下文理解和局部细节保存,有效缓解了远距离检测中感受野扩大与几何保存之间的张力。此外,我们引入了受LiDAR物理启发的体素生成块,该模块沿传感器观测方向扩散特征,以重建遮挡和远距离区域的完整物体结构。在KITTI和ONCE数据集上进行的大量实验表明,3DTMDet优于最先进的检测器。代码可在https://github.com/QiuBingwen/3DTMDet获取。

英文摘要

A fundamental challenge in point cloud object detection lies in the conflict between the extreme sparsity of distant points and the need for remote context understanding. The existing methods typically use 1D serialization to expand the receptive field, which inevitably discards already scarce local geometric details and reduces detection of distant and small objects. To address this issue, we propose 3DTMDet, a novel detection network that synergistically combines state space models (Mamba) with Transformers. The core idea is to utilize SSM's linear complexity and advantages in long sequence modeling to effectively capture global interactions between sparse and distant points, while using Transformer modules with local attention to encode fine-grained geometric structures in local point sets, preserving accurate shape information. We propose the 3D Hybrid Mamba Transformer (3DHMT) block, which uses an SSM-Attention-SSM pipeline to balance global context understanding and local detail preservation, effectively alleviating the tension between receptive field enlargement and geometric preservation in remote detection. In addition, we introduced a voxel generation block inspired by LiDAR physics, which diffuses features along the sensor observation direction to reconstruct the complete object structure of occlusion and distant areas. Extensive experiments conducted on the KITTI and ONCE datasets have shown that 3DTMDet outperforms state-of-the-art detectors. The code is available at https://github.com/QiuBingwen/3DTMDet.

2605.15536 2026-05-18 cs.RO cs.AI cs.CV 版本更新

SkiP: When to Skip and When to Refine for Efficient Robot Manipulation

SkiP: 在何时跳过和何时细化以实现高效的机器人操作

Mingtong Dai, Guanqi Peng, Yongjie Bai, Feng Yan, Chunjie Chen, Lingbo Liu, Liang Lin, Xinyu Wu

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Peng Cheng Laboratory(鹏城实验室) Southern University of Science and Technology(南方科技大学) Sun Yat-sen University(中山大学) UNT University of Chinese Academy of Sciences(中国科学院大学)

AI总结 SkiP通过动态跳过冗余步骤和精细化关键步骤,提升机器人操作效率,无需额外结构或规划器。

详情
AI中文摘要

先前的模仿学习策略在每个控制步骤都预测未来动作,无论是在平滑运动阶段还是精确的接触丰富操作阶段。这种统一处理是浪费的:大多数操作轨迹步骤在自由空间中移动,携带很少的任务相关信息,而一小部分关键步骤围绕接触、抓取和对齐需求密集的高分辨率预测。我们提出了一种新的动作重标机制:在跳过段的每个时间步,我们用下一个关键段入口的动作替换行为克隆目标,使策略能够在一个决策中跳过冗余步骤。由此产生的Skip Policy (SkiP)在单一统一网络中动态跳过跳过段并密集细化关键段,无需学习跳过规划器或分层结构。为了自动将演示分成关键和跳过段而无需手动标注,我们引入了Motion Spectrum Keying (MSK),一种快速且任务无关的程序,从动作信号中检测局部运动复杂性。在72个模拟操作任务和三个真实机器人任务上的广泛实验表明,SkiP将执行步骤减少15-40%,同时在各种策略骨干上匹配或提高成功率。项目页面:https://pgq18.github.io/SkiP-page/.

英文摘要

Previous imitation learning policies predict future actions at every control step, whether in smooth motion phases or precise, contact-rich operation phases. This uniform treatment is wasteful: most steps in a manipulation trajectory traverse free space and carry little task-relevant information, while a small fraction of \emph{key} steps around contacts, grasps, and alignment demand dense, high-resolution prediction. We propose a novel \emph{action relabeling} mechanism: at each timestep in a skip segment, we replace the behavior cloning target with the action at the entrance of the next key segment, enabling the policy to leap over redundant steps in a single decision. The resulting \textbf{Skip Policy (SkiP)} dynamically leaps over skip segments and intensively refines actions in key segments, within a single unified network requiring no learned skip planner or hierarchical structure. To automatically partition demonstrations into key and skip segments without manual annotation, we introduce \emph{Motion Spectrum Keying} (MSK), a fast, task-agnostic procedure that detects local motion complexity from action signals. Extensive experiments across 72 simulated manipulation tasks and three real-robot tasks show that SkiP reduces executed steps by $15$--$40\%$ while matching or improving success rates across various policy backbones. Project page: \texttt{https://pgq18.github.io/SkiP-page/}.

2605.15535 2026-05-18 cs.CV 版本更新

Learning Dynamic Structural Specialization for Underwater Salient Object Detection

学习动态结构专业化用于水下显著目标检测

Lin Hong, Chenhui Wang, Linan Deng, Yuning Cui, Yu Zhang, Xin Wang, Bojian Zhang, Wenqi Ren, Xingchen Yang, Fumin Zhang

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology(电子与计算机工程系,香港科学与技术大学) School of Robotics and Advanced Manufacture, Harbin Institute of Technology(机器人与先进制造学院,哈尔滨工业大学) School of Computation, Information and Technology, Technical University of Munich(计算、信息与技术学院,慕尼黑技术大学) College of Computer Science & Visual Computing and Intelligent Perception Lab, Nankai University(计算机科学与视觉计算学院及智能感知实验室,南开大学) School of Cyber Science and Technology, Sun Yat-sen University(网络科学与技术学院,中山大学) School of Automation, Southeast University(自动化学院,东南大学)

AI总结 本文提出DSS-USOD方法,通过动态结构专业化解决水下图像退化导致的定位不准确、区域碎片化和边界预测粗的问题,提升边界精度与区域一致性。

Comments 15 pages

详情
AI中文摘要

水下显著目标检测(USOD)因在水下视觉场景理解和视觉引导机器人应用中受到越来越多关注。然而,现有USOD方法仍难以应对水下图像退化,这通常导致目标定位不准确、显著区域碎片化和边界预测粗劣。为解决这些挑战,本文提出DSS-USOD,一种基于RGB的USOD方法,建立在动态结构专业化之上。DSS-USOD从单张水下图像中提取共享基础表示,将其分解为对边界敏感和区域一致的结构特征,并根据局部结构上下文动态协调其贡献。具体而言,提取的共享基础表示被分解为一个用于建模细粒度边界细节的边界敏感分支和一个用于捕捉区域级结构一致性的区域一致分支。随后引入一个空间协调模块,根据局部结构上下文自适应调节两个分支的相对贡献。此外,引入协作结构监督以促进分支专业化并稳定空间协调,使DSS-USOD在退化的水下条件下更好地平衡边界精度和区域一致性。大量实验表明,DSS-USOD在基准数据集上实现了优越性能。最后,实际部署在水下机器人上验证了DSS-USOD在水下目标检测中的实际有效性。

英文摘要

Underwater salient object detection (USOD) has attracted increasing attention for underwater visual scene understanding and vision-guided robotic applications. However, existing USOD methods still struggle with underwater image degradations, which often lead to inaccurate object localization, fragmented salient regions, and coarse boundary prediction. To address these challenges, this paper proposes DSS-USOD, a novel RGB-based USOD method built upon dynamic structural specialization. DSS-USOD extracts a shared base representation from a single underwater image, decomposes it into boundary-sensitive and region-coherent structural features, and dynamically coordinates their contributions according to local structural context. Specifically, the extracted shared base representation is decomposed into a boundary-sensitive branch for modeling fine-grained boundary details and a region-coherent branch for capturing region-level structural consistency. A spatial coordination module is then introduced to adaptively regulate the relative contributions of the two branches according to local structural context. Moreover, cooperative structural supervision is introduced to promote branch specialization and stabilize spatial coordination, enabling DSS-USOD to better balance boundary precision and region coherence under degraded underwater conditions. Extensive experiments show that DSS-USOD achieves superior performance on benchmark datasets. Finally, real-world deployment on an underwater robot validates the practical effectiveness of DSS-USOD for underwater object inspection.

2605.15533 2026-05-18 cs.CV cs.AI 版本更新

Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance

无需调优的指令式视频编辑:通过结构噪声初始化和引导

Song Wu, Xinyu Chen, Qian Wang, Liang Li, Zili Yi, Junlan Feng

发表机构 * JIUTIAN Research, China Mobile(中国移动极天研究院) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院) State Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室)

AI总结 本文提出无需调优的指令式视频编辑框架,通过结构噪声初始化策略和噪声引导机制,提升视频编辑的视觉质量和性能。

Comments Accepted by ICIP 2026

详情
AI中文摘要

视频编辑面临重大挑战。尽管一系列无需调优的方法避免了大量数据收集和模型训练的需求,但它们往往未能充分利用嵌入在噪声潜在空间中的丰富信息,导致结果不满意。为此,我们提出一种无需调优、基于指令的视频编辑框架。我们从噪声潜在空间的角度出发:设计了结构噪声初始化策略(SNIS),通过为编辑区域分配更高的噪声水平(以促进内容变化)和为未编辑区域分配更低的噪声水平(以保持内容一致性),从而获得更优的编辑起点。我们引入了噪声引导机制(NGM),利用生成模型中的视频先验知识,有效整合噪声潜在空间中的丰富信息以引导去噪过程,从而保持未编辑内容和整体视觉一致性。实验表明,我们提出的方法在视觉质量和性能上均优于现有方法。

英文摘要

Video editing poses a significant challenge. While a series of tuning-free methods circumvent the need for extensive data collection and model training, they often underutilize the rich information embedded within noisy latent, leading to unsatisfactory results. To address this, we propose a \textit{tuning-free, instruction-based} video editing framework. We approach video editing from the perspective of noisy latent: we design a Structural Noise Initialization Strategy (SNIS) to secure a superior editing starting point by assigning higher noise levels to edited regions (to facilitate content change) and lower noise levels to unedited regions (to maintain content consistency). We introduce a Noise Guidance Mechanism (NGM), which leverages the video prior in the generative model and effectively integrates rich information within the noisy latent to guide the denoising process, thereby preserving unedited content and overall visual coherence. Experiments show that our proposed method achieves better visual quality and state-of-the-art performance.

2605.15519 2026-05-18 cs.CV cs.AI 版本更新

DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments

DiffVAS: 在部分可观测环境中基于扩散的视觉主动搜索

Anindya Sarkar, Srikumar Sastry, Aleksis Pirinen, Nathan Jacobs, Yevgeniy Vorobeychik

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校) RISE Research Institutes of Sweden(瑞典RISE研究机构) Climate AI Nordics(北欧气候AI)

AI总结 DiffVAS提出了一种目标条件化的策略,能够在部分可观测环境中同时搜索多种目标,提升了视觉主动搜索在现实应用中的部署能力。

Comments 26 Pages, 12 figures, Accepted to AAMAS 2026

详情
AI中文摘要

视觉主动搜索(VAS)已被引入作为一种建模框架,利用视觉线索指导空中(如基于无人机的)探索,并在广阔的地理区域中定位感兴趣区域。潜在应用包括检测稀有野生动物盗猎的热点、协助搜救任务以及揭露非法武器交易等。先前的VAS方法假设整个搜索空间在前期已知,这在受限视野和高采集成本的约束下往往不现实,且通常学习针对特定目标对象的策略,限制了同时搜索多种目标类别的能力。在本工作中,我们提出DiffVAS,一种目标条件化的策略,根据任务需求在部分可观测环境中同时搜索多种对象,从而推进视觉主动搜索策略在现实应用中的部署。DiffVAS利用扩散模型从顺序观测的局部视图中重建整个地理区域,使基于目标条件的强化学习规划模块能够有效推理并引导后续的搜索步骤。大量实验表明,DiffVAS在部分可观测环境中搜索多种对象方面表现优异,在多个数据集上显著超越了最先进的方法。

英文摘要

Visual active search (VAS) has been introduced as a modeling framework that leverages visual cues to direct aerial (e.g., UAV-based) exploration and pinpoint areas of interest within extensive geospatial regions. Potential applications of VAS include detecting hotspots for rare wildlife poaching, aiding search-and-rescue missions, and uncovering illegal trafficking of weapons, among other uses. Previous VAS approaches assume that the entire search space is known upfront, which is often unrealistic due to constraints such as a restricted field of view and high acquisition costs, and they typically learn policies tailored to specific target objects, which limits their ability to search for multiple target categories simultaneously. In this work, we propose DiffVAS, a target-conditioned policy that searches for diverse objects simultaneously according to task requirements in partially observable environments, which advances the deployment of visual active search policies in real-world applications. DiffVAS leverages a diffusion model to reconstruct the entire geospatial area from sequentially observed partial glimpses, which enables a target-conditioned reinforcement learning-based planning module to effectively reason and guide subsequent search steps. Extensive experiments demonstrate that DiffVAS excels in searching diverse objects in partially observable environments, significantly surpassing state-of-the-art methods on several datasets.

2605.15496 2026-05-18 cs.RO cs.CV 版本更新

LAPS: Improving Incremental LiDAR Mapping using Active Pooling and Sampling for Neural Distance Fields

LAPS:利用主动池化和采样改进增量激光雷达映射

Dongjae Lee, Wooseong Yang, Yifu Tao, Maurice Fallon, Ayoung Kim

发表机构 * Department of Mechanical Engineering, Seoul National University(首尔国立大学机械工程系) Oxford Robotics Institute at the University of Oxford(牛津大学机器人研究所)

AI总结 LAPS通过主动池化和采样提升增量神经映射的回放管理,提高回放保留和分配,增强重建完整性与几何精度。

Comments accepted at RA-L 2026

详情
AI中文摘要

神经距离场提供紧凑连续的3D几何表示,适合增量激光雷达映射。然而,其在线优化易受灾难性遗忘影响,新观测可能退化已重建几何。基于回放的训练常用于解决此问题,但现有方法依赖被动回放缓冲区和均匀采样,导致内存浪费和欠约束区域训练不足。我们提出LAPS,一种增量神经映射的回放管理框架,改进在线更新中的回放保留和分配。LAPS结合基于可靠性的主动池化保留有限内存下的可靠历史样本,以及基于不确定性的主动采样聚焦欠约束区域。实验表明,LAPS在合成和真实世界基准上一致提升重建完整性,同时保持竞争性的几何精度。在牛津尖塔数据集中,其在Blenheim Palace 05序列上比PIN-SLAM的召回率提高4.66个百分点,F1分数提高3.79个百分点。我们开源实现见:https://github.com/dongjae0107/LAPS。

英文摘要

Neural distance fields offer a compact and continuous representation of 3D geometry, making them attractive for incremental LiDAR mapping. However, their online optimization is vulnerable to catastrophic forgetting, where new observations can degrade previously reconstructed geometry. Replay-based training is commonly used to address this issue, but existing methods typically rely on passive replay buffers and uniform sampling, which can waste memory on redundant observations and under-train poorly constrained regions. We propose LAPS, a replay management framework for incremental neural mapping that improves both replay retention and replay allocation during online updates. LAPS combines reliability-based active pooling to retain reliable historical samples under limited memory with uncertainty-guided active sampling to focus optimization on under-constrained regions. Experiments on synthetic and real-world benchmarks show that LAPS consistently improves reconstruction completeness while maintaining competitive geometric accuracy. On Oxford Spires, it improves recall by 4.66 pp and F1-score by 3.79 pp over PIN-SLAM on the Blenheim Palace 05 sequence. We release our open source implementation at: https://github.com/dongjae0107/LAPS.

2605.15492 2026-05-18 cs.RO cs.CV 版本更新

FLASH: Efficient Visuomotor Policy via Sparse Sampling

FLASH:通过稀疏采样实现高效的视觉-运动策略

Jiaqi Bai, Jindou Jia, Yuxuan Hu, Gen Li, Xiangyu Chen, Tuo An, Kuangji Zuo, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University, Singapore(马尔萨实验室,南洋理工大学,新加坡)

AI总结 FLASH通过稀疏采样和Legendre多项式轨迹表示,提升视觉-运动策略学习效率,实现更长的动作时间跨度和更快的推理速度,实验表明其在多个任务中达到最先进的性能。

Comments 19 pages, 10 figures

详情
AI中文摘要

生成模型如扩散模型和流匹配在视觉-运动策略学习中占据主导地位,但其依赖迭代去噪导致高推理延迟,无法满足实时机器人控制需求。本文提出Fast Legendre-polynomial Action policy via Sparse History-anchored flow(FLASH Policy),通过连续Legendre多项式轨迹表示替代离散动作块生成。具体而言,通过稀疏时间采样拟合专家示范,使单次推理覆盖显著延长的动作时间跨度。为进一步加速生成,FLASH从历史多项式系数启动流匹配过程而非无信息的高斯噪声,缩短传输距离并实现准确单步推理。此外,解析多项式微分直接提供所需的速度前馈信号给扭矩控制器,无需数值近似。在五个模拟和两个真实世界操作任务上的大量实验表明,FLASH在所有任务中达到92%以上的成功率,每episode推理时间仅为31.40ms(比扩散策略快175倍,比先前流匹配策略快18倍),训练收敛速度比ACT快4倍,控制器跟踪误差比离散动作基线减少5至7倍。

英文摘要

Generative models such as diffusion and flow matching have become dominant paradigms for visuomotor policy learning, yet their reliance on iterative denoising incurs high inference latency incompatible with real-time robotic control. We present Fast Legendre-polynomial Action policy via Sparse History-anchored flow (FLASH Policy), which replaces discrete action-chunk generation with continuous Legendre polynomial trajectory representation. Specifically, by fitting expert demonstrations under sparse temporal sampling, FLASH enables a single inference to cover a significantly extended action horizon. To further accelerate generation, FLASH initiates the flow matching process from history polynomial coefficients rather than uninformative Gaussian noise, shortening the transport distance and enabling accurate single-step inference. Moreover, analytic polynomial differentiation directly provides desired velocity feed-forward signals to the torque controller without numerical approximation. Extensive experiments on five simulated and two real-world manipulation tasks demonstrate that FLASH achieves state-of-the-art success rates ($\ge 92\%$ across all tasks), a per-episode inference time of $31.40\,ms$ (up to $175\times$ faster than diffusion policies and $18\times$ faster than prior flow matching policies), up to $4\times$ faster training convergence than ACT, and $5\times$ to $7\times$ reduction in controller tracking error compared to discrete-action baselines.

2605.15484 2026-05-18 cs.CV cs.LG 版本更新

When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing

何时稀疏MoE在视觉中起作用?背骨计算利用在稀疏路由中的作用

Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

发表机构 * Department of Computer Science and Software Engineering(计算机科学与软件工程系) Auburn University(阿伯拉罕大学) Department of Information Management(信息管理系) National Central University(国立中央大学)

AI总结 研究稀疏top-k路由在视觉分类中的有效性,发现计算利用模式,指出背骨架构和多专家路由对性能的影响,通过实验验证关键因素。

Comments 24 pages (main + appendix), 8 figures, 18 tables. Under review at TMLR. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho

详情
AI中文摘要

混合专家(MoE)网络提供良好的准确率-计算量折衷,但实际视觉部署受专家崩溃和端到端效率提升有限的阻碍。本文研究稀疏top-k路由在视觉分类中的帮助条件,评估多种子协议下的四个基准(CIFAR-10/100、Tiny-ImageNet、ImageNet-1K)。观察到计算利用模式:正准确率差距需要总FLOPs的显著分数ρ进行路由;在ImageNet规模上,这虽必要但不够,还需多专家路由(k≥2)。通过两个受控实验隔离这些因素。在CIFAR-10上对隐藏大小的扫描显示标准和深度wise背骨的预测符号反转,排除背骨家族作为活跃变量。ImageNet-1K的消融实验仅改变top-k,保持架构、初始化和ρ固定,使差距从正变负。一种针对样本的Soft MoE变体,对专家进行softmax而非批次,使CIFAR-100超越密集基线,识别批次轴调度为样本CNN设置的主要失败模式。代码和汇总结果:https://github.com/libophd/sparse-moe-vision-rho。

英文摘要

Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $ρ$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ -- holding architecture, initialization, and $ρ$ fixed -- reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho.

2605.15475 2026-05-18 cs.CV cs.MM 版本更新

A Unified Non-Parametric and Interpretable Point Cloud Analysis via t-FCW Graph Representation

通过t-FCW图表示实现统一的非参数化且可解释的点云分析

Haijian Lai, Bowen Liu, Man Xu, Chan-Tong Lam, João Macedo, Benjamin Ng, Sio-Kei Im

发表机构 * Faculty of Applied Sciences, Macao Polytechnic University(澳门理工大学应用科学学院) University of Coimbra, CISUC/LASI, DEI(科英布拉大学) Macao Polytechnic University(澳门理工大学)

AI总结 本文提出增强型t-FCW图表示用于点云嵌入,分析其有效性来源并设计网络,实现高效可解释的点云处理,适用于分类和分割任务。

Comments Accepted for publication in IEEE Transactions on Multimedia

详情
AI中文摘要

我们引入增强型转置全连接加权(t-FCW)图表示,将点云嵌入度量空间。尽管原始t-FCW在点云分类中表现良好,但其有效性原因和更广泛适用性尚不明确。本文分析了使增强型和原始t-FCW有效的属性,并设计网络仅使用增强型t-FCW作为特征提取器。从可解释性角度看,我们构建了用于分类、部分分割和语义分割的记忆银行。我们的分析表明,增强型t-FCW继承了表面描述符的鲁棒性,并通过维度关系提供可解释性。这些属性使网络高效且可解释,能够在NVIDIA RTX A5000 GPU上以约7秒处理ModelNet40分类问题。重要的是,增强型t-FCW既可以作为轻量级独立基线,也可以作为现有深度模型的补充插件。

英文摘要

We introduce an empowered transposed Fully Connected Weighted (t-FCW) graph representation to embed point clouds into a metric space. While original t-FCW has shown promising results for point cloud classification, the reasons behind its effectiveness and its broader applicability remained unclear. In this work, we analyze the properties that make the empowered and original t-FCW effective and design a network that uses the empowered t-FCW exclusively as feature extractors. From an interpretability perspective, we build memory banks for classification, part segmentation, and semantic segmentation using the empowered t-FCW. Our analysis reveals that the empowered t-FCW inherits robustness from surface descriptors, provides interpretability through dimension-wise relations. These properties enable a highly efficient and interpretable network, which processes the ModelNet40 classification problem in approximately 7 seconds on an NVIDIA RTX A5000 GPU. Importantly, empowered t-FCW can function both as a lightweight standalone baseline and as a complementary plug-in to existing deep models.

2605.15458 2026-05-18 cs.CV 版本更新

Video Models Can Reason with Verifiable Rewards

视频模型可以借助可验证的奖励进行推理

Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen, Yuankai Li, Hoifung Poon, Muhao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校) Microsoft Research(微软研究院) University of Southern California(南加州大学) University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 本文提出VideoRLVR方法,通过规则反馈优化视频扩散模型,提升可验证推理能力,在Maze、FlowFree和Sokoban任务中优于监督微调基线,证明可验证RL能推动视频模型超越感知模仿。

Comments Website: https://darthzhu.github.io/VideoRLVR-page/

详情
AI中文摘要

视频扩散模型在感知真实感和时间一致性方面取得了快速进展,但主要优化于合理生成而非可验证推理。在需要生成视频满足显式空间、时间或逻辑约束的任务中,这一限制尤为突出。受强化学习可验证奖励(RLVR)在推理导向语言模型中的作用启发,我们引入VideoRLVR,一种通过基于规则的反馈优化视频扩散模型的实用方法。VideoRLVR将视频推理视为可验证视觉轨迹的生成,包含SDE-GRPO优化核心、密集分解奖励和Early-Step Focus策略以提高训练效率。Early-Step Focus策略限制策略优化到早期去噪阶段,使训练延迟降低约40%的同时保持性能。我们在Maze、FlowFree和Sokoban三个程序生成领域进行评估,这些领域有客观成功标准。在这些任务中,VideoRLVR在监督微调基线上持续改进,密集分解奖励在低成功率设置中尤为重要。我们的RL优化模型在这些可验证推理基准和跨领域基准中优于评估的专有和开源视频生成模型。这些结果表明,可验证RL能推动视频模型超越感知模仿,向更可靠的规则一致视觉推理迈进。

英文摘要

Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.

2605.15456 2026-05-18 eess.IV cs.CV math.OC 版本更新

DIPA: Distilled Preconditioned Algorithms for Solving Imaging Inverse Problems

DIPA:用于解决成像反问题的蒸馏预条件算法

Romario Gualdrón-Hurtado, Roman Jacome, Leon Suarez, Henry Arguello

发表机构 * Department of Computer Science, Universidad Industrial de Santander(圣安德烈斯工业大学计算机科学系) Department of Electrical Engineering, Universidad Industrial de Santander(圣安德烈斯工业大学电气工程系)

AI总结 本文提出DIPA算法,通过教师指导蒸馏改进重建质量,结合线性与非线性预条件运算符,验证了其在磁共振成像、压缩感知和超分辨率成像中的有效性。

Comments 17 pages, 8 figures, 8 tables

详情
AI中文摘要

解决成像反问题通常需要设计合适的先验模型,但数据保真项的最小化因物理约束导致的病态传感矩阵而面临挑战。为此,经典优化理论采用预条件技术通过改变算法梯度步长以加速收敛和提升数值稳定性。本文将预条件概念扩展至提升重建质量,并引入DIPA:蒸馏预条件算法,其中预条件运算符(PO)通过教师指导的蒸馏标准进行优化。教师与学生在重建过程中使用的传感运算符不同:教师使用模拟的更良态且信息更丰富的传感矩阵,而学生使用物理可行的传感矩阵。设计不同的蒸馏损失函数以将教师算法的不同特性转移到预条件学生中。PO可以是线性的(L-DIPA),允许可解释性,或非线性的(N-DIPA),由神经网络参数化,提供更好的可扩展性。在多种成像模态中验证了所提PO设计的有效性,包括磁共振成像、压缩感知和超分辨率成像。

英文摘要

Solving imaging inverse problems has usually been addressed by designing proper prior models of the underlying signal. However, minimizing the data fidelity term poses significant challenges due to the ill-conditioned sensing matrix caused by physical constraints in the acquisition system. Thus, preconditioning techniques have been adopted in classical optimization theory to address ill-conditioned data-fidelity minimization by transforming the algorithm gradient step to achieve faster convergence and better numerical stability. We extend the preconditioning concept beyond convergence acceleration and use it to improve reconstruction quality. We introduce DIPA: Distilled Preconditioned Algorithms, where a preconditioning operator (PO) is optimized using teacher-guided distillation criteria. Unlike standard model-compression KD, the teacher and student differ by the sensing operators available during reconstruction: the teacher uses a simulated, better-conditioned, and more informative sensing matrix, whereas the student uses the physically feasible sensing matrix. We design different distillation loss functions to transfer different properties of the teacher algorithm to the preconditioned student. The PO can be linear (L-DIPA), allowing interpretability, or non-linear (N-DIPA), parametrized by a neural network, offering better scalability. We validate the proposed PO design across several imaging modalities, including magnetic resonance imaging, compressed sensing, and super-resolution imaging.

2605.15450 2026-05-18 cs.CV cs.AI cs.LG 版本更新

RIDE: Retinex-Informed Decoupling for Exposing Concealed Objects

RIDE: 基于Retinex的解耦方法用于揭示隐藏物体

Chunming He, Rihan Zhang, Dingming Zhang, Chengyu Fang, Longxiang Tang, Jingjia Feng, Fengyang Xiao, Sina Farsiu

发表机构 * Duke University(杜克大学) Tsinghua University(清华大学) Harvard University(哈佛大学)

AI总结 RIDE通过Retinex理论提出同域图像分解方法,解决隐藏物体分割问题,利用判别性差距定理提升前景与背景的区分度。

详情
AI中文摘要

隐藏物体分割(COS)涵盖一系列密集预测任务,包括伪装物体检测、多形体分割、透明物体检测和工业缺陷检测,其中目标通过不同物理机制与周围环境视觉融合。现有方法要么直接操作RGB图像,要么采用异构分解(如傅里叶、小波)将空间证据分散到尺度/频率系数,使像素对齐线索不直接。我们引入一种根本不同的视角:通过Retinex理论进行同域图像分解,将图像分解为光照和反射成分。我们的核心发现是视觉融合迫使复合空间中的外观匹配,但并不需要同时在两个成分空间中匹配,这一现象我们正式称为判别性差距定理。关键的是,我们证明在多样化的COS子任务中,底层物理过程系统性地反相关光照和反射差异,从而理论保证Retinex分解在完整物理范围内保持或严格提升总前景-背景判别性,反相关最大化增益。基于此,我们提出RIDE,包括:(i)任务驱动的Retinex分解模块,学习端到端的分割最优分解;(ii)判别性差距注意力机制,适应性利用分解帮助的区域;(iii)伪装打破对比损失,操作在反射特征空间中。

英文摘要

Concealed Object Segmentation (COS) encompasses a family of dense-prediction tasks, including camouflaged object detection, polyp segmentation, transparent object detection, and industrial defect inspection, where targets are visually entangled with their surroundings through different physical mechanisms. Existing methods either operate directly on RGB images or employ \emph{heterogeneous} decompositions (\eg, Fourier, wavelet) that redistribute spatial evidence across scale/frequency coefficients, making pixel-aligned cues less direct. We introduce a fundamentally different perspective: \textbf{homogeneous image decomposition} via Retinex theory, which factorizes an image into illumination and reflectance components within the \emph{same} spatial domain. Our key insight is that visual entanglement enforces appearance matching in the composite space, but this does \emph{not} necessitate simultaneous matching in both component spaces, a phenomenon we formalize as the \textbf{Discriminability Gap Theorem}. Crucially, we show that across diverse COS sub-tasks, the underlying physical processes systematically anti-correlate illumination and reflectance differences, yielding theoretical guarantees that Retinex decomposition preserves or strictly improves total foreground--background discriminability across the full physical regime, with anti-correlation maximizing the gain. Building on this, we propose \textbf{RIDE} comprising: (i) a Task-Driven Retinex Decomposition module that learns segmentation-optimal factorizations end-to-end; (ii) a Discriminability Gap Attention mechanism that adaptively exploits where decomposition helps; and (iii) a Camouflage-Breaking Contrastive loss operating in reflectance feature space.

2605.15430 2026-05-18 cs.RO cs.CV 版本更新

Where to Perch in a Tree: Vision-Guidance for Tree-Grasping Drones

在树上何处栖息:用于树抓取无人机的视觉引导

Alex Dunnett, Leonie Bottomley, Mirko Kovac, Basaran Bahadir Kocer

发表机构 * Department of Civil, Aerospace and Design Engineering, University of Bristol(布里斯托大学土木、航空航天与设计工程系) Laboratory of Sustainability Robotics at Swiss Federal Laboratories for Materials Science and Technology (EMPA)(瑞士材料科学与技术联邦实验室可持续机器人实验室) Ecole Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院)

AI总结 本文提出一种视觉引导方法,用于确定树上理想的栖息点,通过图像处理算法评估树的形状和结构,基于枝条宽度、坡度和曲率选择适宜栖息的枝条。

Comments Work in progress version accepted to the Recent Advances in Robotic Perception for Forestry

详情
AI中文摘要

本研究展示了一种方法,用于确定树上理想的栖息点,该方法利用视觉引导的自主树栖无人机。各种图像处理算法,包括用于机器学习、图像分割和二值图像形态学的算法,被用来评估树的形状和结构。与仅寻找最近可用的枝条不同,本研究通过评估每条枝条的潜力,根据枝条宽度、坡度(与水平面的角度)和曲率等因素来确定其适合栖息的程度。对于给定的树栖无人机和超过10,000张从2月到10月在亚热带和温润气候下的城市树木图像数据集,所提出的方法成功地为76%的可行目标生成了结果。可行目标定义为枝条直径足够厚且可用栖息空间至少等于腱驱动抓取夹具的宽度。这些初步成功的结果为开发一系列改进和额外功能奠定了基础,以创建通用方法;这将涉及整合深度感知和姿态传感器的补充数据,以增强枝条评估。

英文摘要

This study demonstrates a method to locate an ideal perch location on a tree for vision-guided autonomous tree-perching drones. Various image processing algorithms, including those used for machine learning, image segmentation and binary image morphology, are implemented to assess the shape and structure of a tree. Rather than identifying the closest available branch, this study builds on vision methods by evaluating the potential of each branch, determining its suitability for perching based on factors such as branch width, slope (angle to the horizontal) and curvature. For a given tree-perching drone and a dataset of more than 10,000 urban tree images taken from February to October in a subtropical and temperate monsoon climate, the proposed method successfully produces a result for 76% of feasible targets. A feasible target defined as a tree where the branch diameters are sufficiently thick and where the available perching space is at least equal to the width of a tendon-driven grasping claw. These successful preliminary results create a foundation from which a number of identified improvements and additional features can be developed to create a generalised method; this will involve the incorporation of supplementary data from depth perception and attitude sensors to enhance the branch assessment.

2605.15424 2026-05-18 cs.CV 版本更新

Social-Mamba: Socially-Aware Trajectory Forecasting with State-Space Models

Social-Mamba:基于状态空间模型的社会感知轨迹预测

Po-Chien Luan, Wuyang Li, Yang Gao, Alexandre Alahi

发表机构 * EPFL, Switzerland(瑞士联邦理工学院)

AI总结 本文提出Social-Mamba,通过将社会互动视为结构化序列过程,结合循环Mamba模块和社交三元组分解,实现高效准确的轨迹预测,实验表明其在多个基准上表现优异。

详情
AI中文摘要

人类轨迹预测对于拥挤环境中安全导航至关重要,需要在准确性和计算效率之间取得平衡。高效建模社会互动是密集人群中的关键。然而,大多数最新方法依赖于注意力机制,虽然能捕捉复杂依赖关系,但会带来二次计算成本,随着邻居数量的增加而表现不佳。最近,选择性状态空间模型提供了线性时间的替代方案;然而,其本质上是顺序的,与社会互动的无结构和动态性质不匹配。为此,我们提出了Social-Mamba,一种预测架构,将社会互动重新表述为结构化序列过程。其核心是循环Mamba模块,一个新型模块,能够实现连续的双向信息流。Social-Mamba在以自我为中心的网格上组织代理,并引入社交三元组分解,将互动分解为时间、以自我为中心和目标为中心的扫描。这些通过可学习的社会门和全局扫描动态整合,以生成准确且高效的轨迹预测。在五个轨迹预测基准上的广泛实验表明,Social-Mamba在准确率方面达到最先进的水平,同时提供优越的参数效率和计算可扩展性。此外,将Social-Mamba嵌入到流匹配框架中进一步增强了准确性和效率,使其成为未来轨迹预测研究的灵活且稳健的基础。代码已公开:https://github.com/vita-epfl/Social-Mamba

英文摘要

Human trajectory forecasting is crucial for safe navigation in crowded environments, requiring models that balance accuracy with computational efficiency. Efficiently modeling social interactions is key to performance in dense crowds. Yet, most recent methods rely on attention mechanisms, which are effective at capturing complex dependencies, but incur quadratic computational costs that scale poorly with the growing number of neighbors. Recently, Selective State-Space Models have provided a linear-time alternative; however, their inherently sequential design is misaligned with the unstructured and dynamic nature of social interactions. To address this challenge, we propose Social-Mamba, a forecasting architecture that reformulates social interactions as structured sequential processes. At its core is the Cycle Mamba block, a novel module that enables continuous bidirectional information flow. Social-Mamba organizes agents on an egocentric grid and introduces social triplet factorization, which decomposes interactions into temporal, egocentric, and goal-centric scans. These are dynamically integrated through a learnable social gate and global scan to generate accurate and efficient trajectory predictions. Extensive experiments on five trajectory forecasting benchmarks show that Social-Mamba achieves state-of-the-art accuracy while offering superior parameter efficiency and computational scalability. Furthermore, embedding Social-Mamba into a flow-matching framework further enhances both accuracy and efficiency, establishing it as a flexible and robust foundation for future trajectory forecasting research. The code is publicly available: https://github.com/vita-epfl/Social-Mamba

2605.15423 2026-05-18 cs.CV cs.AI eess.IV 版本更新

MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes

MR2-ByteTrack:基于CNN和Transformer的视频目标检测用于AI增强的嵌入式视觉传感器节点

Luca Bompani, Manuele Rusci, Luca Benini, Daniele Palossi, Francesco Conti

发表机构 * Electrical, Electronic and Information Engineering (DEI), University of Bologna, Italy.(博洛尼亚大学电气、电子与信息工程学院,意大利) Department of Electrical Engineering (ESAT), KU Leuven, Belgium.(卢旺达大学电气工程系,比利时) Dalle Molle Institute for Artificial Intelligence (IDSIA), USI--SUPSI, Switzerland.(人工智能研究所(IDSIA),瑞士USI--SUPSI)

AI总结 本文提出MR2-ByteTrack,一种针对嵌入式视觉节点的视频目标检测方法,通过交替使用全分辨率和低分辨率推理,结合ByteTrack和Rescore算法提升效率,实现在嵌入式设备上的高精度实时检测。

详情
AI中文摘要

现代智能视觉传感器需要设备端智能来处理视频流,因为云计算在带宽、延迟和隐私限制下往往不可行。然而,这些传感系统通常依赖超低功耗微控制器(MCUs),其内存和计算能力有限,使得需要特征存储或多帧缓冲的传统视频目标检测方法不可行。为了解决这一挑战,我们引入了多分辨率重评分ByteTrack(MR2-ByteTrack),一种专为基于MCU的嵌入式视觉节点设计的视频目标检测(VOD)方法。MR2-ByteTrack通过交替使用全分辨率和低分辨率推理来降低计算成本,同时通过ByteTrack在帧间链接检测,并通过Rescore算法通过概率联合规则聚合跨帧的检测置信度分数以纠正误分类。我们将其应用于基于CNN的检测器和基于Transformer的模型,证明了其在具有根本不同空间处理的架构中的通用性。在ImageNetVID上的实验表明,MR2-ByteTrack保持了准确性,实现了CNN模型的mAP最高达49.0,Transformer模型的mAP为48.7,同时将CNN的乘加操作减少了高达53%,Transformer的减少了32%。当部署在GAP9上,一个超低功耗RISC-V多核MCU上时,我们的方法相比仅处理全分辨率图像,实现了高达55%的能耗节省,实现了在MCU类嵌入式视觉节点上的首个实时Transformer-based VOD。代码可在https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access获取。

英文摘要

Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53\% for the CNNs and 32\% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55\% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access

2605.15421 2026-05-18 cs.CV 版本更新

U-SEG: Uncertainty in SEGmentation -- A systematic multi-variable exploration

U-SEG:不确定性在分割中的探索——系统多变量研究

Michael Smith, Frank P. Ferrie

发表机构 * Centre for Intelligent Machines, McGill University(智能机器中心,麦吉尔大学)

AI总结 本文系统探讨了不确定性估计与分割交集中的关键问题,分析了不同变量对分割性能的影响,发现挑战性任务和样本多样性在分割中具有重要作用。

Comments Accepted to CVPR Findings Track 2026

详情
AI中文摘要

本文深入探讨了不确定性估计与分割交叉领域中的一些未被充分研究的课题。先前研究表明,不确定性估计的质量对多种变量非常敏感。作为不确定性估计的主要应用之一,帮助识别和解决实际场景中的预测错误,任何影响这一应用的因素都必须明确识别。例如,更具挑战性的领域或不同的数据集和架构是否会导致使用不确定性估计时性能下降?视频序列中的先前帧是否能提供与其它方法相当的不确定性估计?能否利用样本多样性结合不确定性估计方法以获得更好的估计?最后,何时使用基于集成的不确定性估计比确定性网络更合理?我们通过创建框架并执行大规模研究,跨多个变量(如数据集、主干网络和下游任务)对语义和全景分割进行研究。我们发现,a) 具有挑战性的全景分割任务通常导致性能下降,而数据集和主干网络之间的高性能方差表明泛化并不保证;b) 时间序列样本对特定配置有用,但在许多情况下不值得付出代价;c) 样本多样性在校准下游任务中最具潜力,但其他情况下无法超越更简单的替代方案;d) 确定性方法在某些下游任务中足够,但若在部署中能实现正确条件,集成方法可带来显著改进。

英文摘要

In this study, we explore in depth a few under-studied topics at the intersection of uncertainty estimation and segmentation. Prior work has shown that the quality of uncertainty estimates can be very sensitive to a range of variables. As one of the main uses of uncertainty estimation is to help identify and deal with prediction errors in practical scenarios, any factors that affect this must be clearly identified. For example, do more challenging domains or different datasets and architectures result in worse performance when using uncertainty estimates? Can prior frames in a video sequence in fact provide useful uncertainty estimates comparable to other approaches? Is it possible to combine uncertainty estimation approaches, taking advantage of sample diversity, to get better estimates? Finally, when might it make sense to use an ensemble-based uncertainty estimate over a deterministic network? We address these questions by creating a framework for and executing a large scale study across many variables such as datasets, backbones, and downstream tasks, for both semantic and panoptic segmentation. We find that a) the more challenging task of panoptic segmentation usually results in worse performance while high performance variance between datasets and backbones indicates that generalization is not guaranteed, b) time series samples can be useful for specific configurations, but in many cases are not worth the cost, c) sample diversity shows the most promise in the downstream task of calibration, but otherwise fails to beat simpler alternatives, d) a deterministic approach is adequate for some downstream tasks, but ensembles allow for significant improvements if the right conditions can be achieved in deployment.

2605.15398 2026-05-18 cs.GR cs.CV 版本更新

3DEditSafe: Defending 3D Editing Pipelines from Unsafe Generation

3DEditSafe: 防御3D编辑流程中的不安全生成

Nicole Meng, Zheyuan Liu, Meng Jiang, Yingjie Lao

发表机构 * Tufts University(塔夫茨大学) University of Notre Dame(诺特大学)

AI总结 本文提出3DEditSafe框架,通过安全正则化约束不安全语义传播,减少3D编辑中的不安全内容生成,揭示安全与质量的权衡。

详情
AI中文摘要

近期3D生成编辑的进步,特别是基于3D高斯点散布(3DGS)的流程,实现了从文本提示中高保真的多视角一致场景操控。然而,我们发现这些流程在处理不安全提示时会产生传播和优化的不安全编辑。本文研究了3D编辑流程中的不安全生成,证明这种行为可能导致最终3D表示中一致但不适宜工作(NSFW)的内容。为解决此问题,我们提出了3DEditSafe,一个安全正则化的3D编辑框架,通过生成阶段的安全指导和渲染视图的3D安全正则化、安全语义投影、残差抑制和掩码感知保留,引导优化远离不安全的编辑方向。我们在EditSplat场景上使用对象兼容的不安全提示基准评估了我们的方法,并证明2D安全指导单独不足以防止不安全的3D编辑。3DEditSafe减少了不安全语义对齐和视图级攻击成功率,同时揭示了安全与质量之间的权衡,更强的不安全抑制可能引入伪影或降低不安全提示的保真度。到目前为止,这项工作是首次尝试研究并防御文本驱动的3D编辑流程中的不安全生成,强调了需要直接在优化的3D表示上操作的安全机制。

英文摘要

Recent advances in 3D generative editing, particularly pipelines based on 3D Gaussian Splatting (3DGS), have achieved high-fidelity, multi-view-consistent scene manipulation from text prompts. However, we find that these pipelines also introduce new safety risks when unsafe prompts produce edits that are propagated and optimized across views. In this work, we study unsafe generation in 3D editing pipelines and show that such behavior can lead to coherent, undesirable Not-Safe-For-Work (NSFW) content in the final 3D representation. To address this, we propose 3DEditSafe, a safety-regularized 3D editing framework that constrains unsafe semantic propagation during optimization. 3DEditSafe combines generation-stage safety guidance with rendered-view 3D safety regularization, safe semantic projection, residue suppression, and mask-aware preservation to steer optimization away from unsafe editing directions. We evaluate our approach on EditSplat scenes using an object-compatible unsafe prompt benchmark and show that 2D safety guidance alone is not consistently sufficient to prevent unsafe 3D edits. 3DEditSafe reduces unsafe semantic alignment and view-level attack success rates, while revealing a safety-quality tradeoff in which stronger unsafe suppression can introduce artifacts or reduce unsafe-prompt fidelity. To our knowledge, this work is the first attempt to study and defend against unsafe generation in text-driven 3D editing pipelines, highlighting the need for safety mechanisms that operate directly on optimized 3D representations.

2605.15397 2026-05-18 cs.CV 版本更新

ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest

ELDOR:亚马逊雨林非法金矿开采的数据集和基准

Kangning Cui, Surendra Bohara, Suraj Prasai, Zishan Shao, Wei Tang, Martin Pillaca, Edwin Flores, Zhen Yang, Gregory Larsen, Evan Dethier, David Lutz, Jean-Michel Morel, Miles Silman, Victor Pauca, Fan Yang

发表机构 * Wake Forest University(威克森林大学) City University of Hong Kong (Dongguan)(香港城市大学(东莞)) Duke University(杜克大学) City University of Hong Kong(香港城市大学) Yale University(耶鲁大学) Alaska Spatial Science(阿拉斯加空间科学) Colby College(科林斯学院) Colby-Sawyer College(科林斯-萨威尔学院) Lingnan University(岭南大学) Centro de Innovación Cientifica Amazónica(亚马逊科学创新中心)

AI总结 ELDOR通过大规模无人机基准监测亚马逊雨林非法金矿开采对环境和景观的影响,包含2500多公顷的手动标注正射影像,涵盖采矿活动和生态结构的像素级语义标签,评估多种模型在细粒度和小规模结构识别上的性能。

Comments 70 pages, 35 figures, 28 tables

详情
AI中文摘要

亚马逊雨林非法金矿开采导致森林砍伐、水污染和长期生态系统破坏,但难以在精细空间尺度上监测。卫星影像支持大范围观测,但常遗漏小型采矿相关结构和微妙的土地覆盖转变,尤其是频繁的云层覆盖。我们引入ELDOR,一个大规模无人机基准,用于监测非法金矿开采对雨林环境和景观的破坏。ELDOR包含覆盖超过2500公顷的手动标注正射影像,具有像素级语义标签,涵盖采矿相关活动和周围生态结构。借助这一统一的标注源,我们建立了四个基准任务:语义分割、分割衍生识别、直接多标签分类以及基于视觉-语言模型的类别存在识别。在这些任务中,我们比较了通用和遥感专用的分割模型、视觉基础模型相关的分割方法、直接多标签分类方法以及视觉-语言模型,在受控的闭集协议下。结果表明,当前方法在罕见的小规模采矿结构和细粒度恢复类别上仍存在困难,表明需要上下文感知和多模态建模。为了支持领域分析和实际应用,我们进一步构建了一个交互式探索器,为领域专家提供统一的数据探索和模型推理界面。

英文摘要

Illegal gold mining in the Amazon rainforest causes deforestation, water contamination, and long-term ecosystem disruption, yet remains difficult to monitor at fine spatial scales. Satellite imagery supports large-scale observation, but often misses small mining-related structures and subtle land-cover transitions, especially under frequent cloud cover. We introduce ELDOR, a large-scale UAV benchmark for monitoring environmental and landscape disturbance from illegal gold mining in the rainforest. ELDOR contains manually annotated orthomosaic imagery covering over 2,500 hectares, with pixel-level semantic labels for both mining-related activities and surrounding ecological structures. With this unified annotation source, we establish four benchmark tasks: semantic segmentation, segmentation-derived recognition, direct multi-label classification, and class-presence recognition with vision-language models. Across these tasks, we compare generic and remote-sensing-specific segmentation models, vision foundation model-related segmentation methods, direct multi-label classification methods, and vision-language models under a controlled closed-set protocol. Results show that current methods still struggle with rare small-scale mining structures and fine-grained recovery classes, suggesting the need for context-aware and multimodal modeling. To support domain analysis and practical use, we further build an interactive explorer for domain experts that provides a unified interface for data exploration and model inference.

2605.15392 2026-05-18 physics.optics cs.CV 版本更新

Frequency-domain Event-based Imaging for Selective Surveillance

频域事件成像用于选择性监控

Megan Birch, James Rick, Adrish Kar, Jason Zutty, Joseph L. Greene

发表机构 * Georgia Tech Research Institute(佐治亚理工学院研究 institute)

AI总结 本文提出FRIES框架,通过频域分析事件数据,用于识别机械振动和旋转物体,结合RTS可视化技术,在室内和户外实验中验证了其在动态背景下的有效性。

Comments 14 pages, 11 figures

详情
AI中文摘要

事件相机(EBCs)因其微秒级像素级辐射变化报告和高动态范围,成为监控中的有吸引力的传感模式。然而,其异步、稀疏输出需要在事件空间中识别目标的算法。本文引入了频率率信息事件空间(FRIES),一种神经形态处理框架,用于检测事件中的周期性,如旋转器旋转和机械振动,以区分和监控人造物体。FRIES首先应用时间门来抑制背景和噪声,然后将事件聚合为像素级活动(如密度)图,并将像素聚类为感兴趣区域(ROIs)。对每个ROI应用局部频谱分析,以提取主导频率,用于区分结构化物体特征与无结构背景和噪声。被区分的目标通过共振时间表面(RTS)可视化,这是一种频率选择性方法,通过事件与其提取频率的相位相干性加权,奖励同步内容并抑制异步杂波。我们在受控室内实验中演示了FRIES和RTS,以恢复机械切碎机和无人机旋转器的旋转频率,对抗移动背景。我们进一步在户外数据上测试这些方法,以检测悬停无人机,对抗现实的树线。这些初步结果确立了频域事件处理作为神经形态管道中选择性监控的有前景的前端,以及利用高时间分辨率实现频谱区分的互补监控模式。

英文摘要

Event-based cameras (EBCs) are an attractive sensing modality for surveillance due to their reporting of pixel-level radiance changes with microsecond resolution and high dynamic range, enabling motion extraction while suppressing background. Their asynchronous, sparse output, however, necessitate algorithms that identify targets in event-space without processing full frames. We introduce Frequency Rate Information for Event Space (FRIES), a neuromorphic processing framework that detects periodicity in events, such as rotor rotation and mechanical vibrations, to discriminate and monitor man-made objects. FRIES first applies a time gate to suppress background and noise, then aggregates events into a pixel-wise activity (e.g., density) map and clusters pixels into regions-of-interest (ROIs). A localized spectral analysis is applied to each ROI to extract dominant frequencies used to distinguish structured object signatures from unstructured background and noise. Discriminated targets are visualized using a Resonant Time Surface (RTS), a frequency-selective method that weights events by their phase coherence with the extracted frequencies, rewarding in-sync content and suppressing out-of-sync clutter. We demonstrate FRIES and RTS in a controlled indoor experiment to recover the rotational frequency of a mechanical chopper and drone rotors against a moving background. We further test these methods on an outdoor data to detect a hovering drone against a realistic treeline. These preliminary results establish frequency-domain event processing as a promising front-end for selective surveillance in neuromorphic pipelines and a complementary surveillance modality, leveraging the high temporal resolution to enable spectral discrimination.

2605.15391 2026-05-18 cs.CV cs.AI 版本更新

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

PanoWorld:几何一致的全景视频世界建模

Le Jiang, Xiangyu Bai, Bishoy Galoaa, Shayda Moezzi, Caleb James Lee, Tooba Imtiaz, Edmund Yeh, Jennifer Dy, Yanzhi Wang, Sarah Ostadabbas

发表机构 * Northeastern University(东北大学)

AI总结 PanoWorld通过几何和动态一致性建模生成一致的360度视频,提升了空间理解能力,适用于具身AI应用。

详情
AI中文摘要

PanoWorld通过几何和动态一致性建模生成一致的360度视频,提升了空间理解能力,适用于具身AI应用。

英文摘要

We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.

2605.15383 2026-05-18 cs.CV 版本更新

MorphoHELM: A Comprehensive Benchmark for Evaluating Representations for Microscopy-Based Morphology Assays

MorphoHELM:用于评估基于显微镜的形态学检测方法的综合基准

Emre Hayir, Lorin Crawford, Alex X. Lu

发表机构 * Microsoft Research New England(微软研究院新英格兰分部)

AI总结 MorphoHELM提供了一个综合的开放基准,用于评估细胞染色法中的特征提取方法,通过不同批次效应评估任务,揭示方法间的权衡关系,展示经典计算机视觉方法在多种场景下的优势。

详情
AI中文摘要

显微镜图像包含关于细胞对扰动响应的丰富信息,对药物筛选等应用至关重要。研究人员常使用表示提取方法来量化图像,近年来深度学习方法层出不穷。然而,评估这些表示的质量仍存在碎片化问题,各模型在不同任务和数据集上使用定制的流程和指标,难以公平比较。本文介绍MorphoHELM,一个全面的开放基准,用于评估细胞染色法中的特征提取方法。MorphoHELM整合了领域内的评估标准,扩展并修正使其更稳健,并在迄今为止最广泛的方法上进行评估。该基准的一个显著特点是每个任务在不同批次效应(或技术噪声)程度下进行评估,直接量化方法检测生物信号能力随噪声增加而下降的程度。这些特性使MorphoHELM能够检测方法间的权衡关系,我们证明某些类型的生物信号检测能力强的模型在其他方面表现较弱。我们展示现有模型在所有设置中均无法超越经典计算机视觉分析策略,这些策略仍是最强的通用场景表示。所有数据集、代码和评估工具均在https://github.com/microsoft/MorphoHELM公开。

英文摘要

Microscopy images contain rich information about how cells respond to perturbations, making them essential to applications like drug screening. To quantify images, researchers often use representation extraction methods, and recent years have seen a proliferation of deep learning methods. While measuring the quality of these representations is essential, evaluation remains fragmented, with each proposed model evaluated on different tasks and datasets, using custom pipelines and metrics, making it difficult to fairly compare models. Here, we introduce MorphoHELM, a comprehensive open benchmark for evaluating feature extraction methods for Cell Painting, the most widely-used morphological profiling assay. MorphoHELM consolidates evaluation standards in the field, extends and corrects them to be more robust, and evaluates on the widest range of methods to date. A defining feature of the benchmark is that each task is evaluated at different degrees of batch effects (or technical noise), directly quantifying how the ability of methods to detect biological signal degrades as noise increases. Together, these properties enable MorphoHELM to detect trade-offs between methods, and we demonstrate that models that excel at certain kinds of biological signal are weaker at others. We show that no existing model outperforms classic computer vision analytic strategies across all settings, which remain the strongest general use-case representations. All datasets, code, and evaluation tools are publicly available at https://github.com/microsoft/MorphoHELM.

2605.15375 2026-05-18 cs.CV cs.AI 版本更新

ChangeFlow -- Latent Rectified Flow for Change Detection in Remote Sensing

ChangeFlow -- 潜在修正流用于遥感中的变化检测

Blaž Rolih, Matic Fučka, Filip Wolf, Luka Čehovin Zajc

发表机构 * University of Ljubljana, Faculty of Computer and Information Science(卢布尔雅那大学计算机与信息科学学院)

AI总结 本文提出ChangeFlow框架,通过潜在空间中的修正流合成变化掩码,以生成分布中的可能掩码,提升全局一致性与鲁棒性,实现80.4%的平均F1分数。

详情
AI中文摘要

遥感变化检测(RSCD)旨在定位同一地理区域两幅图像之间的变化。在实践中,变化掩码通常遵循区域级注释惯例而非纯粹的局部外观差异,使其具有上下文依赖性和偶尔的模糊性。大多数最先进的方法使用逐像素判别分类,产生单个预测,无法显式建模变化区域作为整体。生成式方法是自然替代方案,可建模可能掩码的分布,使采样能捕捉模糊性并鼓励全局一致性。然而,现有生成式RSCD方法通常落后于强大判别基线,由于像素空间生成的高计算成本和其条件机制的复杂性。为了解决判别和生成方法的局限性,我们提出ChangeFlow,一种生成框架,通过潜在空间中的修正流重新表述变化检测为变化掩码的合成。ChangeFlow由结构化但轻量级的条件信号引导,其随机设计自然支持基于采样的预测融合。即,聚合多个预测的变化掩码提高鲁棒性,而样本一致性提供实用的置信度估计,突出模糊区域。在四个基准上,ChangeFlow实现80.4%的平均F1分数,比先前最佳方法平均提高1.3个百分点,同时保持与最近强大基线相当的推理速度。项目页面:https://blaz-r.github.io/changeflow_cd

英文摘要

Remote sensing change detection (RSCD) aims to localise changes between two images of the same geographic region. In practice, change masks often follow region-level annotation conventions rather than purely local appearance differences, making them context-dependent and occasionally ambiguous. Most state-of-the-art methods utilise per-pixel discriminative classification, which produces a single prediction per input and fails to explicitly model the changed region as a coherent whole. A natural alternative is generative formulation, which can model a distribution of plausible masks, enabling sampling to capture ambiguity and encourage global consistency. However, existing generative RSCD approaches typically lag behind strong discriminative baselines due to the high computational cost of pixel-space generation and the complexity of their conditioning mechanisms. To address the limitations of prior discriminative and generative methods, we propose ChangeFlow, a generative framework that reformulates change detection as the synthesis of a change mask in latent space via rectified flow. ChangeFlow is guided by a structured yet lightweight conditioning signal, and its stochastic design naturally supports sampling-based prediction ensembling. Namely, aggregating multiple predicted change masks improves robustness, while sample agreement provides a practical confidence estimation that highlights ambiguous regions. Across four benchmarks, ChangeFlow achieves an average F1 of 80.4\%, improving by 1.3 points on average over the previous best method, while maintaining inference speed comparable to recent strong baselines. Project page: https://blaz-r.github.io/changeflow_cd

2605.15368 2026-05-18 cs.CV cs.GR cs.LG 版本更新

Discretizing Group-Convolutional Neural Networks for 3D Geometry in Feature Space

对特征空间中的群卷积神经网络进行离散化以处理3D几何

Daniel Franzen, Jean Philip Filling, Michael Wand

发表机构 * Johannes Gutenberg University Mainz(美因茨约翰内斯·古腾堡大学)

AI总结 本文提出在特征空间中进行采样,通过特征相似性选择代表性样本,从而解耦几何分辨率与内存处理成本,实现计算效率与精度的平衡。实验表明粗粒度的特征空间采样能有效保持分类精度,加速等变3D分类器的训练。

Comments 11 pages, 7 figures, 2 tables

详情
AI中文摘要

群卷积神经网络(GCNNs)是深度学习中引入对称性作为归纳偏置的重要方法:在每个线性层中,GCNNs密集采样变换群G,并在不同姿态下相关数据和滤波器(适用于可旋转GCNNs的适当反混叠)以保持对G的等变性。不幸的是,对这种采样产生的许多数据项应用滤波器成本很高(即使仅限于平移,即普通CNNs),随着自由度(如3D中的平移和旋转)的增加,成本呈指数增长,这往往阻碍了实际应用。在本文中,我们提出在特征空间中进行采样,即用特征相似性选择的代表性样本替代几何密集采样。这在训练和推理过程中解耦了几何分辨率与内存和处理成本,提供了一种新的方法来权衡计算努力和准确性。我们的主要经验发现是,粗粒度的特征空间采样在保持分类精度方面表现得非常出色,这允许基于几何相似性进行预计算,从而显著加速等变3D分类器的训练。

英文摘要

Group-convolutional neural networks (GCNNs) are among the most important methods for introducing symmetry as an inductive bias in deep learning: In each linear layer, GCNNs sample a transformation group $G$ densely and correlate data and filters in different poses (with suitable anti-aliasing for steerable GCNNs) to maintain equivariance with respect to $G$. Unfortunately, applying filters to many data items resulting from this sampling is expensive (even for translations alone, i.e., in ordinary CNNs), and costs grow exponentially with increasing degrees of freedom (such as translations and rotations in 3D), which often hinders practical applications. In this paper, we propose sampling in feature space, i.e., replacing geometrically dense samples with representative samples selected by feature similarity. This decouples geometric resolution from memory and processing costs during training and inference, providing a novel way to trade off computational effort and accuracy. Our main empirical finding is that a coarse feature-space sampling already preserves classification accuracy remarkably well, which permits precomputation based on geometric similarity, accelerating the training of equivariant 3D classifiers substantially.

2605.15342 2026-05-18 cs.CV cs.LG 版本更新

Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

Minerva-Ego:眼动视频理解的空间时间提示

Arsha Nagrani, Jasper Uijilings, Shyamal Buch, Tobias Weyand, Sudheendra Vijayanarasimhan, Bo Hu, Ramin Mehran, David A Ross, Cordelia Schmid

发表机构 * Google(谷歌) DeepMind(深度Mind)

AI总结 本文提出Minerva-Ego基准,通过多步骤多模态问题和密集标注的时空推理轨迹评估眼动视频理解模型,发现提示'何时何地'显著提升性能。

详情
AI中文摘要

视频推理模型是眼动和具身智能体的核心组成部分。然而,标准评估模型的基准仅提供输出评估(例如回答问题),而不评估中间推理步骤,且大多数仅提供文本领域的答案。我们引入了Minerva-Ego,一个用于评估复杂眼动视觉推理的基准。我们扩展了最近高质量的视频数据源,这些数据源来自眼动/具身设置,并添加了一组具有挑战性的多步骤多模态问题和密集标注的时空推理轨迹。基准测试实验表明,最先进的模型与人类表现之间仍有较大差距。为了深入研究这一差距,我们对数据集中的每个推理轨迹标注了所需解决问题的对象,作为时空掩码标注。通过广泛的评估,我们发现提示前沿模型以'哪里'和'何时'的提示来查看,能显著提高性能。Minerva-Ego可在https://github.com/google-deepmind/neptune下载。

英文摘要

Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of 'where' and 'when' to look yields substantial improvements in performance. Minerva-Ego can be downloaded at https://github.com/google-deepmind/neptune.

2605.15326 2026-05-18 cs.CV 版本更新

Multimodal Object Detection Under Sparse Forest-Canopy Occlusion

多模态目标检测在稀疏森林冠层遮挡下的应用

Nitik Jain, Mangal Kothari

发表机构 * Robotics & AI, Johns Hopkins University, USA(约翰霍普金斯大学机器人与人工智能系,美国) Department of Aerospace Engineering, IIT Kanpur(印度理工学院坎浦尔航空航天工程系) Senior Principal Flight Control Engineer, ADASI, EDGE Group, Abu Dhabi, UAE(阿布扎赫尔ADASI高级飞行控制系统工程师,EDGE集团)

AI总结 本文提出一种多模态管道,结合激光雷达、可见-热成像融合和合成孔径成像技术,以提高森林冠层下人类检测的可靠性,展示了改进的YOLOv5检测器在热成像和融合图像上的性能。

详情
AI中文摘要

可靠检测森林冠层下的人类仍是一个远程传感难题,由于遮挡稀疏、结构化且视点依赖。本文提出一个多模态的证明概念管道,整合三种互补方法:(i) 通过植被评估激光雷达回波的实验评估以评估主动传感的可行性;(ii) 使用多尺度变换和稀疏表示框架进行可见-热图像融合以增强人类显著性;(iii) 通过空中光学切片(AOS)合成孔径成像以抑制冠层杂波。在Teledyne FLIR热数据集上微调YOLOv5检测器,并在热图像和融合图像上进行评估。结果表明,测试的地面激光雷达配置对目标级检测的穿透有限,而可见-热融合在低对比度场景中提高了目标可见性,AOS在合成森林图像中增强了地面平面检测。微调的YOLOv5在FLIR前三个类别上实现了平均平均精度约为0.83。这些发现为在森林环境中部署的无人机搜索和救援及监视系统建立了初始基准,并推动了未来专门针对森林数据集和实时多模态整合的工作。

英文摘要

Reliable detection of humans beneath forest canopy remains a difficult remote-sensing challenge due to sparse, structured, and viewpoint-dependent occlusion. This paper presents a multimodal proof-of-concept pipeline that integrates three complementary approaches: (i) experimental evaluation of LiDAR returns through vegetation to assess the feasibility of active sensing, (ii) visible--thermal image fusion using a multi-scale transform and sparse-representation framework to enhance human saliency, and (iii) synthetic-aperture image formation via Airborne Optical Sectioning (AOS) to suppress canopy clutter. A YOLOv5 detector is fine-tuned on the Teledyne FLIR thermal dataset and evaluated on thermal and fused imagery. Results show that the tested terrestrial LiDAR configuration provides limited penetration for object-level detection, while visible--thermal fusion improves target visibility in low-contrast scenes and AOS enhances ground-plane detection in synthetic forest imagery. The fine-tuned YOLOv5 achieves a mean average precision of $\sim$0.83 on the top three FLIR classes. These findings establish an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments, and motivate future work on dedicated forest datasets and real-time multimodal integration.

2605.15325 2026-05-18 cs.CV 版本更新

COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection

COPRA:基于强化学习的条件参数适应用于视频异常检测

Darryl Cherian Jacob, Xinyu Liu, Kai Wang, Pan He

发表机构 * Auburn University(奥本大学) Tencent Hunyuan(腾讯文元)

AI总结 COPRA通过生成输入特定的参数更新,动态适应冻结的VLM,提升视频异常检测的适应性和泛化能力,同时拓展到多选视频问答和密集标注等任务。

Comments Manuscript currently under review for publication

详情
AI中文摘要

COPRA通过生成输入特定的参数更新,动态适应冻结的VLM,提升视频异常检测的适应性和泛化能力,同时拓展到多选视频问答和密集标注等任务。

英文摘要

Vision-language models (VLMs) have shown strong performance in video anomaly detection (VAD) while providing interpretable predictions. However, existing VLM-based VAD methods suffer from a fundamental mismatch between training and inference in both data distribution and model configuration. First, most approaches rely on static post-training adaptation, limiting generalization under distribution shifts such as unseen environments or anomaly types. Second, they train VLMs on sparse frames from long videos, but perform inference on densely sampled short segments, creating inconsistencies between training and testing. To address these limitations, we propose COPRA, a conditional parameter adaptation framework for VLM-based VAD. Instead of fixed prompts or shared parameter updates, COPRA generates input-specific parameter updates to dynamically adapt a frozen VLM for each video segment during both training and inference. Experiments show strong performance on standard VAD benchmarks, consistently outperforming static baselines in both in-domain and cross-domain settings. Moreover, COPRA generalizes beyond VAD to unseen tasks such as multiple-choice Video Question Answering and Dense Captioning. These results highlight COPRA as an effective weight-space generation framework for scalable, adaptive, and context-aware video understanding. The code will be released at https://github.com/THE-MALT-LAB/COPRA

2605.15320 2026-05-18 cs.GR cs.CV cs.LG 版本更新

FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction

FFAvatar: 少样本、前馈和可泛化的头像重建

Thuan Hoang Nguyen, Jiahao Luo, Yinyu Nie, Hao Li, Gordon Guocheng Qian, Jian Wang

发表机构 * Snap Inc. University of California, Santa Cruz(加州大学圣克鲁兹分校) MBZUAI

AI总结 FFAvatar通过多视图查询-Former融合多源图像信息,实现高保真3D高斯头像重建,支持实时部署与高质量动画。

Comments Project Page: https://ffavatar.github.io

详情
AI中文摘要

FFAvatar通过多视图查询-Former融合多源图像信息,实现高保真3D高斯头像重建,支持实时部署与高质量动画。

英文摘要

Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.

2605.15312 2026-05-18 cs.CY cs.CV 版本更新

Beyond Performance Disparities: A Three-Level Audit of Representational Harm in CelebA

超越表现差异:对CelebrA中表征性伤害的三级审计

Sieun Park, Yuanmo He

AI总结 本文通过三级审计揭示CelebrA数据集中性别化的年龄和美貌标准如何在数据和模型中再现,指出表征性伤害导致女性被过度审视而老年男性被排除在外。

Comments 15 pages, 8 figures

详情
AI中文摘要

大规模面部数据集如CelebrA在计算机视觉中广泛应用,但其标签中的文化偏见仍被忽视。公平性研究区分了表征性与分配性伤害,但对计算机视觉数据集的审计多关注分类标签,未探讨此类伤害如何在学习特征和模型注意力中体现。本文从数据集结构、学习特征权重和空间注意力三级层面分析CelebrA,聚焦性别化的年龄和美貌标准如何在数据中编码并在模型行为中再现。首先,202599张图像的分层聚类显示39个属性组织成与文化原型一致的潜在特质束:表演性女性(年轻、化妆、装饰)和专业男性(老化、面部毛发、正式着装)。尽管女性整体更常被评价为有吸引力,但被分配到老化或男性化簇时会遭受严重惩罚。其次,XGBoost结合SHAP分析揭示性别特定效应,如脂肪减少吸引力仅对女性有效。第三,Grad-CAM发现女性和年轻男性子群的预测集中在中面部线索,而老年男性的预测则偏向外围线索如头发和服装。老年男性获得最高准确率但最低平均精度,表明被数据集评估模板排除。文化双重标准由此从媒体代表进入数据标签、特征权重和模型注意力,产生两种表征性伤害:在狭窄评估模板下对女性的过度审视,以及完全排除老年男性。聚焦性能差异的公平性指标掩盖了这两种伤害,强调在公平性研究中需解决表征性伤害。

英文摘要

Large-scale facial datasets like CelebA are widely used in computer vision, yet the cultural biases embedded in their labels remain underexplored. Fairness research has distinguished representational from allocational harms, but audits of computer vision datasets have mostly examined categorical labels, leaving open how such harms appear in learned features and model attention. This paper examines CelebA at three levels: dataset structure, learned feature weights, and spatial attention, focusing on how gendered double standards of ageing and beauty are encoded in the data and reproduced in model behaviour. First, hierarchical clustering of 202,599 images shows that the 39 attributes organise into latent trait bundles aligned with cultural archetypes: performative femininity (youth, makeup, adornment) and professional masculinity (ageing, facial hair, formal attire). Female faces, though more often rated attractive overall, incur steep penalties when assigned to ageing or masculine-coded clusters. Second, XGBoost with SHAP analysis reveal gender-specific effects, such as adiposity reducing attractiveness only for females. Third, Grad-CAM finds that predictions for female and younger male subgroups concentrate on mid-face cues, whereas predictions for older males drift toward peripheral cues such as hair and clothing. Older males attain the highest accuracy but the lowest average precision, indicating categorical exclusion of groups outside the dataset's evaluative templates. Cultural double standards thus pass from media representation into dataset labels, feature weights, and model attention, producing two representational harms: hyper-scrutiny of women under a narrow evaluative template, and exclusion of older men from the scheme entirely. Fairness metrics focused on performance disparities mask both, underscoring the need to address representational harm in fairness research.

2605.15309 2026-05-18 cs.CV 版本更新

One Pass Is Not Enough: Recursive Latent Refinement for Generative Models

一次不够:生成模型的递归潜在细化

Mehdi Esmaeilzadeh, Alexia Jolicoeur-Martineau, Chirag Vashist, Ke Li

发表机构 * Simon Fraser University(西蒙弗雷泽大学)

AI总结 本文提出RTM方法,通过递归细化提升生成模型的多样性和覆盖范围,改进了FID指标下的精度和召回率,适用于多个数据集和基准测试。

详情
AI中文摘要

尽管在图像生成领域取得了显著进展,但问题仍未解决。主导指标FID将样本保真度与模式覆盖混淆,并接近饱和。然而,模型仍可能在低FID下出现模式崩溃,因为少量锐利的近似图像可能优于覆盖完整数据分布的模型。我们主张精度和召回率是FID的必要补充,由于FID已饱和,更重要的目标是提高多样性和覆盖范围。高召回率需要优先考虑模式覆盖的模型,而非大多数生成模型优化样本保真度。我们引入RTM,将基于风格的生成器中的单次潜在映射替换为迭代细化过程,证明这能一致提高质量和多样性。与隐式最大似然估计(IMLE)结合,IMLE通过设计优化模式覆盖,RTM在当前最先进的方法中实现了最高精度和召回率,同时保持竞争性的FID,改进了CIFAR-10、CelebA-HQ(256x256)和九个少样本基准。RTM还改进了StyleGAN2和StyleGAN2-ADA在CIFAR-10和AFHQ-v1(512x512)上的表现,证明其益处不限于IMLE。不同于通过牺牲覆盖范围获得竞争性FID的流匹配基线,递归细化同时提高了质量和多样性。

英文摘要

Despite remarkable progress, image generation is far from solved. The dominant metric, FID, conflates sample fidelity with mode coverage and is close to being saturated. Yet a model can still exhibit mode collapse while achieving a low FID, since a handful of sharp, near-duplicate images can outscore a model that faithfully covers the full data distribution. We argue that precision and recall are essential complements to FID, and that because FID is already saturated, the more meaningful goal is to improve diversity and coverage. Achieving high recall requires a model that explicitly prioritizes mode coverage, unlike most generative models, which optimize sample fidelity. We introduce RTM, which replaces the single-pass latent mapping in style-based generators with an iterative refinement process, and show that this consistently improves both quality and diversity. Integrated with Implicit Maximum Likelihood Estimation (IMLE), which optimizes mode coverage by design, RTM achieves the highest precision and recall among current state-of-the-art approaches while maintaining competitive FID, with improvements across CIFAR-10, CelebA-HQ at 256x256, and nine few-shot benchmarks. RTM also improves StyleGAN2 and StyleGAN2-ADA on CIFAR-10 and AFHQ-v1 at 512x512, demonstrating that the benefit is not specific to IMLE. Unlike flow-matching baselines that achieve competitive FID at the expense of coverage, recursive refinement improves both quality and diversity simultaneously.

2605.15307 2026-05-18 cs.GR cs.CV cs.MM cs.SD 版本更新

Sound Sparks Motion: Audio and Text Tuning for Video Editing

声音激发动作:用于视频编辑的音频和文本微调

AmirHossein Naghi Razlighi, Aryan Mikaeili, Ali Mahdavi-Amiri, Daniel Cohen-Or, Yiorgos Chrysanthou

发表机构 * University of Cyprus(塞浦路斯大学) Simon Fraser University(西蒙弗雷泽大学) Tel Aviv University(特拉维夫大学) CYENS Center Of Excellence(CYENS卓越中心)

AI总结 本文提出Sound Sparks Motion框架,通过测试时调整音频视觉生成模型的多模态条件信号,实现视频动作编辑,无需训练,通过音频潜在和文本条件残差扰动促进动作修改,同时利用视觉语言模型反馈提升编辑效果。

Comments Project Page: https://amirhossein-razlighi.github.io/Sound_Sparks_Motion

详情
AI中文摘要

以动作为中心的视频编辑仍然对大生成视频模型来说具有挑战性,这些模型通常对外观变化反应良好,但难以在现有片段中生成特定的局部动作或状态转换。我们介绍了Sound Sparks Motion,一种无需训练的框架,通过在测试时调整音频视觉视频生成模型的内部多模态条件信号,实现动作编辑。与修改模型权重不同,我们的方法仅调整两个轻量级变量:从源视频导出的音频潜在和文本条件的残差扰动。我们发现这种组合可以鼓励动作编辑,这些动作在仅通过提示控制时,底层模型往往难以实现。由于没有直接方法评估文本和动作之间的时间对齐,我们利用视觉语言模型提供反馈,指示生成视频中是否出现了预期的动作。这种简单的监督产生了一个有效的语义目标用于动作编辑,而正则化和感知-时间约束有助于保持内容和视觉质量。除了单视频调整外,我们还表明学习到的潜在控制可以跨视频转移,表明它们捕捉了可重用的动作编辑方向,而不是过拟合到单个示例。我们的结果强调了多模态条件调整,特别是通过音频路径,作为动作感知视频编辑的有前途的方向,并表明测试时调整可以作为轻量级的探测机制,帮助揭示模型多模态条件中嵌入的动作控制。代码和数据可通过我们的项目页面获取:https://amirhossein-razlighi.github.io/Sound_Sparks_Motion/

英文摘要

Motion-centric video editing remains difficult for large generative video models, which often respond well to appearance changes but struggle to produce specific, localized actions or state transitions in an existing clip. We introduce Sound Sparks Motion, a training-free framework that enables motion editing in an audio-visual video generation model by tuning its internal multimodal conditioning signals at test time. Rather than modifying model weights, our method tunes only two lightweight variables: an audio latent derived from the source video and a residual perturbation in the text-conditioning. We find that this combination can encourage motion edits that the underlying model often struggles to realize under prompt-only control. Since there is no direct way to evaluate temporal alignment between text and motion, we guide the tuning process using a vision-language model that provides feedback indicating whether the intended motion appears in the generated video. This simple supervision yields an effective semantic objective for motion editing, while regularization and perceptual-temporal constraints help preserve content and visual quality. Beyond per-video tuning, we show that the learned latent controls are transferable across videos, suggesting that they capture reusable motion-edit directions rather than overfitting to a single example. Our results highlight multimodal conditioning tuning, particularly through the audio pathway, as a promising direction for motion-aware video editing, and suggest that test-time tuning can serve as a lightweight probing mechanism that helps reveal latent motion controls embedded in the model's multimodal conditioning. Code and data are available via our project page: https://amirhossein-razlighi.github.io/Sound_Sparks_Motion/

2605.15300 2026-05-18 cs.CV 版本更新

Deep Pre-Alignment for VLMs

视觉语言模型的深度预对齐

Tianyu Yu, Kechen Fang, Zihao Wan, Kaidong Zhang, Yicheng Zhang, Jun Song, Bo Zheng, Yuan Yao

发表机构 * Tsinghua University Shanghai Qi Zhi Institute Taobao \& Tmall Group of Alibaba

AI总结 本文提出深度预对齐(DPA),通过替换传统ViT编码器为小型VLM作为感知器,实现视觉特征与目标大语言模型文本空间的深度对齐,提升了多模态基准性能,并降低了语言能力遗忘。

Comments Accepted by ICML 2026. Project Website: https://github.com/THUMAI-Lab/Deep-Pre-Alignment

详情
AI中文摘要

大多数视觉语言模型(VLMs)通过轻量级投影器将ViT编码器的输出直接映射到LLM。尽管有效,最近的分析表明这种架构存在对齐挑战:视觉特征在LLM的初始层仍远离文本空间,迫使模型在表面模态对齐上浪费关键深度~\cite{zhang-etal-2024-investigating,artzy-schwartz-2024-attend},而非深入理解和复杂推理。在本工作中,我们提出深度预对齐(DPA),一种新颖的架构,用小型VLM作为感知器替换标准ViT编码器,确保视觉特征与目标大语言模型的文本空间深度对齐。全面实验展示了DPA的有效性。在4B参数规模上,DPA在8个多模态基准上比基线高出1.9分,随着规模扩大到32B,增益扩大至3.0分。此外,通过将对齐任务委托给感知器,DPA在3个文本基准上实现了32.9\%的语言能力遗忘减少。我们进一步证明这些增益在不同LLM家族中保持一致,包括Qwen3和LLaMA 3.2,突显了我们方法的通用性。除了性能,DPA还为当前VLM开发提供了无缝升级路径,只需对视觉编码器进行模块化替换,计算开销微小。

英文摘要

Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth~\cite{zhang-etal-2024-investigating,artzy-schwartz-2024-attend} on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale. Moreover, by offloading alignment to the perceiver, DPA achieves a 32.9\% reduction in language capability forgetting over 3 text benchmarks. We further demonstrate that these gains are consistent across different LLM families including Qwen3 and LLaMA 3.2, highlighting the generality of our approach. Beyond performance, DPA also offers a seamless upgrade path for current VLM development, requiring only a modular replacement for the visual encoder with marginal computation overhead.

2605.15298 2026-05-18 cs.RO cs.AI cs.CL cs.CV 版本更新

PhysBrain 1.0 Technical Report

PhysBrain 1.0 技术报告

Shijie Lian, Bin Yu, Xiaopeng Lin, Changti Wu, Hang Yuan, Xiaolin Hu, Zhaolong Shen, Yuzhuo Miao, Haishan Liu, Yuxuan Tian, Yukun Shi, Cong Huang, Kai Chen

发表机构 * PhysBrain Team(PhysBrain团队)

AI总结 PhysBrain 1.0 通过将大规模人类自体视频转化为结构化的物理常识监督,提升机器人适应能力,在多模态问答和具身控制基准测试中取得SOTA结果,尤其在SimplerEnv中表现突出。

Comments Project Page: https://phys-brain.github.io

详情
AI中文摘要

视觉-语言-动作模型快速发展,但机器人轨迹单独学习广泛物理理解有限。PhysBrain 1.0研究了一种互补方法:将大规模人类自体视频转换为结构化的物理常识监督,再用于机器人适应。我们的数据引擎提取场景元素、空间动态、动作执行和深度感知关系,将其转化为问题-答案监督训练PhysBrain VLMs。所得物理先验通过保留能力且语言敏感的适应设计转移至VLA策略。在多模态问答基准和具身控制基准,包括ERQA、PhysBench、SimplerEnv-WidowX、LIBERO和RoboCasa中,PhysBrain 1.0取得SOTA结果,尤其在SimplerEnv中表现突出。这些结果表明,从人类交互视频中扩展物理常识能有效连接多模态理解与机器人动作。

英文摘要

Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.

2605.15256 2026-05-18 cs.CV 版本更新

ReactiveGWM: Steering NPC in Reactive Game World Models

Zeqing Wang, Danze Chen, Zhaohu Xing, Zizhao Tong, Yinhan Zhang, Xingyi Yang, Yeying Jin

发表机构 * Tencent(腾讯) National University of Singapore(新加坡国立大学) The Hong Kong Polytechnic University(香港理工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 当前游戏世界模型多从玩家视角出发,将非玩家角色(NPC)仅视为背景像素,难以捕捉玩家与NPC之间的互动。为此,本文提出ReactiveGWM,一种能够模拟玩家与NPC动态交互的反应型游戏世界模型。该模型通过解耦玩家控制与NPC行为,并引入轻量级偏差注入和跨注意力模块,实现了对NPC高层策略(如进攻、防守)的灵活响应,且无需针对具体游戏进行再训练,具备跨游戏的零样本策略迁移能力。

Comments The code is available at https://inv-wzq.github.io/ReactiveGWM/

详情
英文摘要

Current game world models simulate environments from a subjective, player-centric perspective. However, by treating the Non-Player Character (NPC) merely as background pixels, these models cannot capture interactions between the player and NPC. In that sense, they act as passive video renderers rather than real simulation engines, lacking the physical understanding needed to model action-induced NPC reactivities. We introduce ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and NPC. Instead of entangling all interaction dynamics, ReactiveGWM explicitly decouples player controls from NPC behaviors. Player actions are injected into the diffusion backbone via a lightweight additive bias, while high-level NPC responses (e.g., Offense, Control, Defense) are grounded through cross-attention modules. Crucially, these modules learn a game-agnostic representation of interactive logic. This enables zero-shot strategy transfer: our learned modules can be plugged directly into off-the-shelf, unannotated world models of different games. This instantly unlocks steerable NPC interactions without any domain-specific retraining. Evaluated on two Street Fighter games, ReactiveGWM maintains fine-grain player controllability while achieving robust, prompt-aligned NPC strategy adherence, paving the way for scalable, strategy-rich interaction with the NPC.

2605.15241 2026-05-18 eess.IV cs.CV cs.LG 版本更新

From Full and Partial Intraoral Scans to Crown Proposal: A Classification-Guided Restoration Assistance Pipeline

Rabin Kunwar, Dikshya Parajuli, Rujal Acharya, Romik Gosai, Prince Panta, Kundan Siwakoti, Shuvangi Adhikari, Saugat Kafley, Louis Digiorgio, Amit Regmi, Akio Tanaka, Masahiko Inada, Yuriko Komagamine, Kennta Kashiwazaki, Manabu Kanazawa

发表机构 * Accelerated Komputing Pvt. Ltd.(加速计算私人有限公司) University of Pittsburgh(匹兹堡大学) Institute of Science Tokyo(东京科学研究所) Emium Co. Ltd.(Emium公司) GodelBlock Inc.(GodelBlock公司) Carnegie Mellon University(卡内基梅隆大学)

AI总结 该研究提出了一种端到端的牙冠提案生成流程,旨在从全牙弓或部分牙弓的口腔扫描数据中生成个性化的牙冠初始方案,以辅助临床医生进行后续调整。方法结合了分类引导的分割策略和基于上下文的检索与拟合技术,有效解决了部分扫描数据分割精度低和生成牙冠细节丢失的问题。实验表明,该方法在多个评估指标上表现优异,具备较高的分割精度和实际应用价值。

详情
英文摘要

Single-unit crown restoration is among the most common procedures in clinical dentistry, with CAD/CAM workflows now designing crowns directly from intraoral scans. Partial scans are often preferred over full-arch scans for single-unit cases due to fewer stitching errors, yet most segmentation networks trained on full arches fail on partial scans, while end-to-end generative crown methods often produce over-smoothed surfaces that lose occlusal detail. We propose an end-to-end pipeline that takes a raw intraoral scan and target FDI tooth number as input and outputs an initial, patient-specific crown proposal for clinician refinement. The pipeline has three phases: (I) data preparation and pose standardization; (II) segmentation routed by scan type; and (III) crown proposal generation via context-aware retrieval and Blender-based fitting. We address partial-scan segmentation through a classify-then-align strategy: a DGCNN classifier categorizes the scan into one of five anatomical types, then coarse-to-fine RANSAC+ICP registration standardizes the jaw coordinate frame, followed by graph-cut optimization to refine tooth-gingival boundaries. Trained on 1,958 partial scans, the pipeline achieves macro-average DSC 0.9249, Recall 0.8919, and Precision 0.9615 across 17 semantic classes; a fine-tuned full-arch model reaches DSC 0.9347. The prepared tooth and its mesial and distal neighbors achieve DSC 0.9468-0.9569 with sub-millimeter Centroid Errors (0.2666-0.2774 mm). These centroids anchor a retrieval module using DGCNN embeddings and cosine similarity over neighboring and opposing teeth, followed by spline-guided alignment and Blender Python API refinement. The pipeline produces a preliminary crown shell in 2.5-3.5 minutes, offering a practical alternative to end-to-end generative approaches.

2605.15093 2026-05-18 cs.CV 版本更新

CoralLite: μCT Reconstruction of Coral Colonies from Individual Corallites

Jess Jones, Leonardo Bertini, Kenneth Johnson, Erica Hendy, Tilo Burghardt

发表机构 * University of Bristol(布里斯托大学) University of Liverpool(利物浦大学) Natural History Museum(自然历史博物馆)

AI总结 该研究提出了一种名为CoralLite的方法,用于从珊瑚骨骼的微CT扫描数据中重建单个珊瑚虫的骨骼结构。研究通过结合弱标注数据预训练与全标注切片微调的混合V-Trans-UNet网络,实现了对整个珊瑚群体骨骼的高精度分割与三维建模。该方法在相同珊瑚群体和不同生物样本上均表现出良好的分割性能,为基于微CT的珊瑚个体骨骼建模提供了首个深度学习基准与完整数据集。

Comments 15 pages, 10 figures, 2 tables

详情
英文摘要

The life history of an individual coral is archived within the accreting skeleton of the colony. While reef-forming coral colonies (e.g. massive $\textit{Porites}$ sp.) may live for hundreds of years and deposit calcareous structures many metres in height and width, their living tissue is a thin outer surface layer comprised of asexually-dividing polyps that only survive a few years. To understand the rate and timing of polyp division and the consequences for colony skeletal growth, scientists need to track the skeletal corallite deposited around each polyp. Here we propose CoralLite, an annotated $μ$CT scan dataset of entire calcareous skeletons and an associated, first corallite deep learning reconstruction baseline. CoralLite combines fully quantified volumetric segmentations with cross-slice linking for visualisations of 3D models for each corallite up to colony scale. For segmentation, we propose and evaluate in detail a hybrid V-Trans-UNet architecture applicable to segmenting tiled $μ$CT virtual slabs of $\textit{Porites}$ sp. colonies. The model is pre-trained on weakly annotated data and topology-aware fine-tuned using fully annotated slice sections with 8k+ manual corallite region annotations. On unseen slices of the same colony, the resulting model reaches 0.94 topological accuracy at mean Dice scores of 0.77 on the same colony and projection axis, and 0.63 mean Dice scores on a different, biologically unrelated specimen. Whilst our experiments are limited in scale and context, our results show for the first time that visual machine learning can effectively support full 3D individual corallite modelling from $μ$CT scans of coral skeletons alone. For reproducibility and as a baseline for future research we publish our full dataset of 697 $μ$CT slices, 37 partial or full slice annotations, and all network weights and source code with this paper.

2605.15010 2026-05-18 cs.CV 版本更新

3D Skew-Normal Splatting

Xiangru Wu, Ke Fan, Yanwei Fu

发表机构 * Fudan University(复旦大学)

AI总结 本文提出了一种名为Skew-Normal Splatting(SNS)的新方法,用于改进3D高斯溅射(3DGS)在实时新视角合成中的表示能力。通过引入Azzalini偏正态分布作为基本单元,SNS能够灵活建模对称和非对称结构,尤其在处理物体边界和单侧表面时表现出更强的表示能力。此外,SNS保持了数学上的可解析性,并通过解耦参数化和分块优化策略提升了训练稳定性,实验表明其在多个基准测试中优于传统高斯及其他非高斯核方法。

详情
英文摘要

3D Gaussian Splatting (3DGS) has emerged as a leading representation for real-time novel view synthesis and has been widely adopted in various downstream applications. The core strength of 3DGS lies in its efficient kernel-based scene representation, where Gaussian primitives provide favorable mathematical and computational properties. However, under a finite primitive budget, the symmetric shape of each primitive directly affects representation compactness, especially near asymmetric structures such as object boundaries and one-sided surfaces. Recent works have explored more complex kernel distributions; however, they either remain within the elliptical family or rely on hard truncation, which limits continuous shape control and introduces distributional discontinuities. In this paper, we propose Skew-Normal Splatting (SNS), which adopts the Azzalini Skew-Normal distribution as the fundamental primitive. By introducing a learnable and bounded skewness parameter, SNS can continuously interpolate between symmetric Gaussians and Half-Gaussian-like shapes, enabling flexible modeling of both sharp boundaries and interior regions. Moreover, SNS preserves analytical tractability under affine transformations and marginalization. This property allows seamless integration into existing Gaussian Splatting rasterization pipelines. Furthermore, to address the strong coupling between scale, rotation, and skewness parameters, we introduce a decoupled parameterization and a block-wise optimization strategy to enhance training stability and accuracy. Extensive experiments on standard novel-view synthesis benchmarks show that SNS consistently improves reconstruction quality over Gaussian and recent non-Gaussian kernels, with clearer benefits on sharp boundaries and thin or one-sided structures.

2605.14876 2026-05-18 cs.CV cs.AI 版本更新

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 尽管当前文本到图像生成模型在技术上取得了快速进展,但它们大多依赖单步生成范式,难以处理复杂的语义内容,且参数扩展带来的性能提升有限。为了解决多步推理方法中存在的幻觉、优化不稳定和推理延迟等问题,本文提出了一种闭环视觉推理框架CLVR,该框架将视觉语言逻辑规划与像素级扩散生成深度融合,并引入了基于代理提示的强化学习和Δ-空间权重合并等方法,有效提升了生成质量与推理效率,实验表明其在多个基准测试中优于现有开源模型,接近商业模型的性能。

详情
英文摘要

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

2605.14716 2026-05-18 cs.GR cs.CV cs.LG 版本更新

AnchorRoute: Human Motion Synthesis with Interval-Routed Sparse Contro

Pengcheng Fang, Tengjiao Sun, Dongjie Fu, Xiaoyu Zhan, Yanwen Guo, Hansung Kim, Xiaohao Cai

发表机构 * University of Southampton(索姆塞特大学) Mogo AI Ltd.(Mogo AI有限公司) Nanjing University(南京大学)

AI总结 AnchorRoute 是一种基于稀疏锚点的人体运动合成框架,通过用户指定的少量根位置、平面轨迹或身体点目标,生成完整的人体动作。该方法在生成阶段利用锚点生成条件特征,并注入到预训练的扩散模型中以保持生成质量,同时学习稀疏空间控制;在生成后阶段,通过锚点残差定义修正区间,结合软 token 更新进行精细化调整,从而在统一的锚点框架下实现生成与优化的结合。实验表明,AnchorRoute 在多种控制方式下均优于现有方法,生成动作更贴合锚点约束。

详情
英文摘要

Sparse anchors provide a compact interface for human motion authoring: users specify a few root positions, planar trajectory samples, or body-point targets, while the system synthesizes the full-body motion that completes the under-specified intent. We present AnchorRoute, a sparse-anchor motion synthesis framework that uses anchors as a shared scaffold for both generation and refinement. Before generation, AnchorRoute converts sparse anchors into anchor-condition features and injects the resulting condition memory into a frozen Transition Masked Diffusion prior through AnchorKV and dual-context conditioning. This preserves the generation quality of the pretrained text-to-motion prior while learning sparse spatial control. After generation, the same anchors are evaluated as residuals: their timestamps define refinement intervals, and their residuals determine where correction should be concentrated. RouteSolver then refines the motion by projecting soft-token updates onto anchor-defined piecewise-affine interval bases. This couples generation-time anchor conditioning with residual-routed refinement under one anchor scaffold. AnchorRoute supports root-3D, planar-root, and body-point control within the same formulation. In benchmark evaluations, AnchorRoute outperforms prior sparse-control methods under the sparse keyjoint protocol and consistently improves anchor adherence across control families. The results show that the learned anchor-conditioned generator and RouteSolver refinement are complementary: the generator preserves text-motion quality, while RouteSolver provides a controllable path toward stronger anchor adherence.

2605.14309 2026-05-18 cs.CV cs.AI cs.LG 版本更新

ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

Shen Lin, Jing Lin, Junhao Dong, Piotr Koniusz, Li Xu

发表机构 * Fujian Normal University(福建师范大学) Nanyang Technological University(南洋理工大学) University of New South Wales(新南威尔士大学) Data61 CSIRO(Data61澳大利亚联邦科学与工业研究组织)

AI总结 本文提出了一种基于可解释概念分解的视觉-语言模型(VLM)概念级机器遗忘方法ICED,旨在解决传统图像或实例级遗忘难以精确移除目标知识而不影响无关语义的问题。该方法通过多模态大语言模型构建任务相关的概念词汇表,并将视觉表征分解为稀疏、非负的语义概念组合,从而实现对图像中目标概念的精确抑制,同时保留非目标语义和跨模态知识。实验表明,该方法在保持模型性能的同时,能够更全面地遗忘目标知识并更好保留图像中的非目标信息。

详情
英文摘要

Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.

2605.13073 2026-05-18 cs.CV 版本更新

HarmoGS: Robust 3D Gaussian Splatting in the Wild via Conflict-Aware Gradient Harmonization

Yulei Kang, Tianze Zhu, Jian-Fang Hu, Jianhuang Lai, Wei-Shi Zheng

发表机构 * Sun Yat-sen University(中山大学) Northeastern University(东北大学) Guangdong Province Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China(教育部机器智能与先进计算重点实验室)

AI总结 本文针对真实场景中3D高斯泼溅(3DGS)重建面临的动态干扰和光照引起的视图间外观不一致问题,提出了一种基于冲突感知的优化框架。该方法通过语义一致性引导的掩膜生成和双视角梯度调和策略,有效抑制了不可靠的监督信息并缓解视图间梯度冲突,从而提升了重建质量与稳定性。实验表明,该方法在复杂真实场景下取得了当前最优的渲染效果。

详情
英文摘要

In-the-wild 3D Gaussian Splatting remains challenging due to transient distractors and illumination-induced cross-view appearance inconsistencies. Existing methods mainly rely on image-level masking to suppress unreliable supervision, but masking alone cannot fully eliminate residual occlusions or resolve illumination-induced inconsistencies, both of which can introduce conflicting cross-view gradients. These unresolved conflicts may destabilize Gaussian optimization and lead to visible reconstruction artifacts. We propose a conflict-aware 3DGS framework that addresses this problem from both image-space supervision and gradient-level optimization. Semantic Consistency-Guided Masking learns pixel-wise consistency scores to adaptively refine prior masks and suppress unreliable supervision before gradient formation. A dual-view Conflict-Aware Gradient Harmonization strategy further reconciles view-specific gradients by mutually rotating them into an orthogonal configuration, reducing negative directional interference across views. We also introduce conflict-aware densification and pruning to stabilize Gaussian growth and remove persistently conflicting primitives. Extensive experiments on standard in-the-wild benchmarks demonstrate that our method achieves state-of-the-art rendering quality under complex transient distractors and cross-view inconsistencies.

2605.09869 2026-05-18 cs.RO cs.CV 版本更新

ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control

Haosen Wang, Zhenyang Li, Yinqiang Zhang, Zongqi He, Lutao Jiang, Kai Li, Yizhou Zhao, Liaoyuan Fan, Wenjian Hou, Tingbang Liang, Yibin Wen, Defeng Gu

发表机构 * Sun Yat-sen University(中山大学) The University of Hong Kong(香港大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) City University of Hong Kong(香港城市大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文研究了零样本物体导航中的动作一致性问题,即智能体在导航过程中容易因语义信息的反复解读而无法持续追踪目标。为此,作者提出了 ConsistNav,一种无需训练的零样本物体导航框架,通过引入语义执行控制器、持久候选记忆和稳定性感知动作控制三个模块,有效提升了导航过程中对目标的持续追踪能力和动作一致性。实验表明,ConsistNav 在多个基准数据集上取得了优于现有方法的性能,显著提升了成功率和路径成功率。

Comments 13 pages, 5 figures

详情
英文摘要

Zero-shot object navigation has advanced rapidly with open-vocabulary detectors, image--text models, and language-guided exploration. However, even after current methods detect a plausible target hypothesis, the agent may still oscillate between exploration and pursuit, or abandon the object near success. We identify this failure mode as an action consistency gap: semantic evidence is repeatedly reinterpreted at each step without persistent commitment across the episode. We introduce ConsistNav, a training-free zero-shot ObjectNav framework built around a semantic executive composed of three coordinated modules: Finite-State Executive Controller stages target pursuit through guarded semantic phases; Persistent Candidate Memory accumulates cross-frame target evidence into stable object hypotheses; and Stability-Aware Action Control suppresses rotational stagnation, ineffective pursuit, and unverified stopping. This design changes neither the detector nor the low-level planner; instead, it controls when semantic evidence should influence navigation and when it should be suppressed or revisited. We conduct extensive experiments on HM3D and MP3D, where ConsistNav achieves state-of-the-art results among compared zero-shot ObjectNav methods and improves SR by 11.4% and SPL by 7.9% over the controlled baseline on MP3D. Ablation studies and real-world deployment experiments further demonstrate the effectiveness and robustness of the proposed executive mechanism.

2605.08245 2026-05-18 cs.CV cs.AI 版本更新

When Language Overwrites Vision: Over-Alignment and Geometric Debiasing in Vision-Language Models

Harshvardhan Saini, Samyak Jha, Yiming Tang, Dianbo Liu

发表机构 * Indian Institute of Technology Dhanbad(印度理工学院丹巴德分校) National University of Singapore(新加坡国立大学)

AI总结 本文研究了视觉-语言模型(VLMs)中由于语言与视觉模态过度对齐导致的幻觉问题,揭示了其根本原因在于解码器结构使得视觉嵌入过度对齐到文本流形,从而引入了语言统计偏倚,掩盖了细粒度视觉信息。作者首次量化分析了这一现象,提出两种互补的解决方案:一种是无需训练的推理策略,另一种是引入偏倚感知的微调方法,均能有效去除视觉表示中的语言偏倚。实验表明,这些方法在多个基准测试中显著减少了模型幻觉,并提升了长文本生成的质量。

详情
英文摘要

Vision-Language Models (VLMs) increasingly power high-stakes applications, from medical imaging to autonomous systems, yet they routinely hallucinate, confidently describing content not present in the input. We investigate the root causes of these failure modes with a mechanistic analysis focusing on the decoder-based VLMs. We trace these failure modes to a geometric over-alignment: to bridge the modality gap required by attention mechanisms, decoder-based VLMs over-align visual embeddings with the text manifold, injecting a statistical linguistic bias that systematically overshadows fine-grained visual evidence. While prior work either aggressively closes this gap or suppresses hallucinations through expensive black-box decoding strategies, none addresses the underlying geometric cause. We provide the first quantitative characterization of this over-alignment, demonstrating that linguistic bias concentrates in the top principal components of a universal, dataset-agnostic text subspace. Building on this insight, we propose two complementary remedies: a training-free inference strategy and a bias-aware fine-tuning paradigm, both of which explicitly project out this subspace from visual representations. Our methods significantly reduce hallucinations across POPE, CHAIR, and AMBER benchmarks, and improve CLAIR scores on long-form captioning tasks, with the training-free variant adding no computational overhead over the base model.

2605.07074 2026-05-18 cs.CV 版本更新

Decoupling Semantics and Fingerprints: A Universal Representation for AI-Generated Image Detection

Zhiyuan Wang, Yanxiang Chen, Pengcheng Zhao, Yunfeng Diao, Xin Liao

发表机构 * Hefei University of Technology(合肥工业大学) Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology), Ministry of Education(知识工程与大数据重点实验室(合肥工业大学)) School of Computer Science and Information Engineering, Hefei University of Technology(计算机科学与信息工程学院) Intelligent Interconnected Systems Laboratory of Anhui Province (Hefei University of Technology)(安徽省智能互联系统实验室(合肥工业大学)) School of Computer Science, Nanjing Audit University(南京审计大学计算机学院) College of Cyber Science and Technology, Hunan University(湖南大学计算机科学与技术学院)

AI总结 该论文研究了如何检测由不同未知架构生成的AI图像,指出现有方法容易过度依赖生成器特定的指纹和语义内容,导致泛化能力不足。研究发现,特征纠缠是主要原因,为此提出了一种正交分解与净化网络(ODP-Net),通过结构化分离通用伪造痕迹、生成器指纹和语义内容,有效提升了模型在未知生成模型上的检测性能。

Comments ~10 pages (IEEEtran two-column), 6 figures, 6 tables, 1 algorithm

详情
英文摘要

Detecting AI-generated images across unseen architectures remains challenging, as existing models often overfit to generator-specific fingerprints and semantic content rather than learning universal forgery traces. We attribute this failure to feature entanglement: detectors learn these factors as a single entangled representation, where universal forgery traces are inextricably confounded with both generator-specific fingerprints and semantic content. Crucially, our spectral analysis reveals that this entanglement is avoidable: distinct generator-specific fingerprints (e.g., GAN stripes vs. Diffusion Model spots) occupy disjoint frequency subspaces and coexist as independent superpositions. Leveraging this physical orthogonality, we propose the Orthogonal Decomposition and Purification Network (ODP-Net) to structurally disentangle these factors. Specifically, ODP-Net employs (1) Instance-aware Orthogonal Decomposition to project features into mutually exclusive subspaces: universal forgery traces, generator-specific fingerprints, and semantic content; (2) Perturbation-based Purification to enforce semantic invariance via cross-sample feature injection; and (3) Manifold Alignment to bridge domain gaps. By explicitly decoupling universal forgery traces from generator-specific fingerprints and semantic content, ODP-Net achieves state-of-the-art performance on unseen architectures (e.g., Stable Diffusion 3), validating that structural disentanglement is key to generalization.

2605.01852 2026-05-18 cs.CV 版本更新

DP-SfM: Dual-Pixel Structure-from-Motion without Scale Ambiguity

Lilika Makabe, Kohei Ashida, Hiroaki Santo, Fumio Okura, Yasuyuki Matsushita

发表机构 * Graduate School of Information Science and Technology, The University of Osaka(信息科学与技术研究生院,大阪大学)

AI总结 本文提出了一种名为DP-SfM的方法,利用双像素(DP)传感器捕获的图像进行多视角三维重建,无需参考物体或预先标定即可自动解决尺度模糊问题。该方法通过结合深度图与双像素图像中的散焦模糊信息,提出了一种简单有效的线性方法来估计绝对尺度,并进一步通过基于强度的优化对齐左右图像。实验表明,该方法在不同相机和镜头捕获的多样化场景中均表现出良好的效果。

Journal ref IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

详情
英文摘要

Multi-view 3D reconstruction, namely, structure-from-motion followed by multi-view stereo, is a fundamental component of 3D computer vision. In general, multi-view 3D reconstruction suffers from an unknown scale ambiguity unless a reference object of known size is present in the scene. In this article, we show that multi-view images captured using a dual-pixel (DP) sensor can automatically resolve the scale ambiguity, without requiring a reference object or prior calibration. Specifically, the defocus blur observed in DP images provides sufficient information to determine the absolute scale when paired with depth maps (up to scale) recovered from multi-view 3D reconstruction. Based on this observation, we develop a simple yet effective linear method to estimate the absolute scale, followed by the intensity-based optimization stage that aligns the left and right DP images by shifting them back toward each other using cross-view blur kernels. Experiments demonstrate the effectiveness of the proposed approach across diverse scenes captured with different cameras and lenses. Code and data are available at https://github.com/lilika-makabe/dp-sfm-tpami.git

2604.17669 2026-05-18 cs.CV 版本更新

Low Light Image Enhancement Challenge at NTIRE 2026

George Ciubotariu, Sharif S M A, Abdur Rehman, Fayaz Ali Dharejo, Rizwan Ali Naqvi, Marcos V. Conde, Radu Timofte, Zhi Jin, Hongjun Wu, Wenjian Zhang, Chang Ye, Xunpeng Yi, Qinglong Yan, Yibing Zhang, Zaynab Ali, Saiprasad Meesiyawar, Varda I Pattanshetty, Varsha I Pattanshetty, Nikhil Akalwadi, Padmashree Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Hao Yang, Ruikun Zhang, Liyuan Pan, Furkan Kınlı, Donghun Ryou, Inju Ha, Junoh Kang, Bohyung Han, Wei Zhou, Yuval Haitman, Ariel Lapid, Reuven Peretz, Idit Diamant, Leilei Cao, Shuo Zhang, Praful Hambarde, Prateek Shaily, Jayant Kumar, Hardik Sharma, Aashish Negi, Sachin Chaudhary, Akshay Dudhane, Amit Shukla, MoHao Wu, Lin Wang, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi, Raul Balmez, Alexandru Brateanu, Ciprian Orhei, Cosmin Ancuti, Codruta O. Ancuti, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Kaifan Qiao, Bofei Chen, Jingyi Xu, Duo Zhang, Xin Deng, Mai Xu, Shengxi Li, Lai Jiang, Harini A, Ananya N, Lakshanya K, Ying Xu, Xinyi Zhu, Shijun Shi, Jiangning Zhang, Yong Liu, Kai Hu, Jing Xu, Xianfang Zeng, Jinao Song, Guangsheng Tang, Cheng Li, Yuqiang Yang, Ziyi Wang, Yan Chen, Long Bao, Heng Sun, Mohab Kishawy, Jun Chen, Wan-Chi Siu, Yihao Cheng, Hon Man Hammond Lee, Chun-Chuen Hui

发表机构 * NTIRE 2026

AI总结 本文综述了NTIRE 2026低光图像增强挑战赛,介绍了参赛者提出的各种解决方案及最终结果。该挑战赛旨在寻找能够有效提升低对比度和噪声图像清晰度与视觉吸引力的网络模型。共有22支队伍提交了有效作品,本文全面评估了当前在(联合去噪与)低光图像增强领域的先进方法,展示了该领域的重要进展,并基于新的数据集进行了分析。

详情
英文摘要

This paper presents a comprehensive review of the NTIRE 2026 Low Light Image Enhancement Challenge, highlighting the proposed solutions and final results. The objective of this challenge is to identify effective networks capable of producing clearer and visually compelling images in diverse and challenging conditions by learning representative visual cues with the purpose of restoring information loss due to low-contrast and noisy images. A total of 195 participants registered for the first track and 153 for the second track of the competition, and 22 teams ultimately submitted valid entries. This paper thoroughly evaluates the state-of-the-art advances in (joint denoising and) low-light image enhancement, showcasing the significant progress in the field, while leveraging samples of our novel dataset.

2604.16925 2026-05-18 cs.CV 版本更新

Rethinking Cross-Dose PET Denoising: Mitigating Averaging Effects via Residual Noise Learning

Yichao Liu, Zongru Shao, Yueyang Teng, Junwen Guo

发表机构 * IWR, Heidelberg University(海德堡大学IWR) Silicon Austria Labs(Silicon Austria实验室) College of Medicine and Biological Information Engineering, Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University(医学与生物信息工程学院,医学图像智能计算教育部重点实验室,东北大学) Department of Epidemiology & Global Health, Umeå University(流行病学与全球健康系,乌梅大学)

AI总结 本文研究了低剂量正电子发射断层扫描(LDPET)图像的跨剂量去噪问题,指出传统模型在不同剂量条件下泛化能力较差,主要由于噪声水平和统计特性差异导致。作者分析发现,现有方法在训练过程中隐式优化了异质噪声分布的期望,导致网络学习到的是跨剂量的平均去噪映射,无法准确建模特定剂量的噪声特性。为此,提出了一种统一的残差噪声学习框架,直接从低剂量图像中估计噪声,而非预测全剂量图像,实验表明该方法在多个医疗中心的大规模数据集上优于现有方法,显著提升了跨剂量去噪性能。

详情
英文摘要

Cross-dose denoising for low-dose positron emission tomography (LDPET) has been proposed to address the limited generalization of models trained at a single noise level. However, neural networks trained on a specific dose level often fail to generalize to other dose conditions due to variations in noise magnitude and statistical properties. Conventional "one-size-for-all" models attempt to mitigate this variability but tend to learn averaged representations across noise levels, resulting in degraded performance. In this work, we analyze this limitation and show that standard training formulations implicitly optimize an expectation over heterogeneous noise distributions, causing the network to learn an averaged denoising mapping that cannot accurately model dose-specific noise characteristics. We propose a unified residual noise learning framework that estimates noise directly from low-dose PET images rather than predicting full-dose images. Experiments on large-scale multi-dose PET datasets from two medical centers demonstrate that the proposed method outperforms the "one-size-for-all" model, individual dose-specific U-Net models, and dose-conditioned approaches, achieving improved denoising performance. These results indicate that residual noise learning effectively mitigates the averaging effect and enhances generalization for cross-dose PET denoising.

2604.15221 2026-05-18 cs.RO cs.CV 版本更新

Vision-Based Safe Human-Robot Collaboration with Uncertainty Guarantees

Jakob Thumm, Marian Frei, Tianle Ni, Matthias Althoff, Marco Pavone

发表机构 * Department of Aeronautics and Astronautics, Stanford University(斯坦福大学航空航天系) Chair of Imaging and Computer Vision, RWTH Aachen University(亚琛工业大学影像与计算机视觉教授职位) School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Department of Computer Engineering, Technical University of Munich(慕尼黑技术大学计算机工程系)

AI总结 本文提出了一种基于视觉的人体姿态估计与运动预测框架,能够在保证安全协作的前提下提供可验证的不确定性保障。该方法结合了对噪声不确定性的估计与分布外检测,以提升预测的置信度,并引入符合性预测集来确保预测结果在实际人机协作中的高可靠性。实验在真实的人体运动数据和实际人机协作场景中验证了方法的有效性。

详情
英文摘要

We propose a framework for vision-based human pose estimation and motion prediction that gives conformal prediction guarantees for certifiably safe human-robot collaboration. Our framework combines aleatoric uncertainty estimation with OOD detection for high probabilistic confidence. To integrate our pipeline in certifiable safety frameworks, we propose conformal prediction sets for human motion predictions with high, valid confidence. We evaluate our pipeline on recorded human motion data and a real-world human-robot collaboration setting.

2603.14764 2026-05-18 cs.CV cs.AI cs.LG 版本更新

Topology-Preserving Polygon Augmentation for Segmentation in Structured Visual Domains

Sudip Laudari, Sang Hun Baek

发表机构 * Independent Researcher(独立研究者)

AI总结 该论文研究了在结构化视觉领域(如建筑平面图分析)中保持多边形标注拓扑结构的图像增强方法。针对传统几何增强可能导致多边形区域分割、破坏语义连通性的缺陷,提出了一种轻量的拓扑保持增强策略,能够在不改变顶点顺序的前提下修复索引空间中的邻接关系。实验表明,该方法在常见几何变换下能实现接近完美的循环邻接保持(CAP),并有效提升了基于多边形的分割标注一致性。

Comments 10 pages, 6 figures

详情
英文摘要

Geometric data augmentation is widely used in segmentation workflows, but polygon annotations are often assumed to remain valid after transformation. This assumption can fail in structured domains such as architectural floorplan analysis, where a region may contain an interior void encoded as part of a single ordered polygon chain. Cropping or clipping can remove bridge vertices in this chain, causing one semantic region to split into disconnected components. We propose a lightweight topology-preserving augmentation strategy that repairs missing adjacency relations in index space while preserving the original vertex order. The method adds minimal overhead and can be integrated into existing preprocessing workflows. Experiments show that the proposed approach achieves near-perfect Cyclic Adjacency Preservation (CAP) across common geometric transformations and improves annotation consistency in polygon-based segmentation.

2603.13864 2026-05-18 cs.CR cs.CV 版本更新

Inevitable Encounters: Backdoor Attacks Involving Lossy Compression

Qian Li, Yunuo Chen, Yuntian Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Eastern Institute of Technology(技术东院)

AI总结 本文研究了在现实场景中,由于数据存储和传输过程中不可避免地使用有损压缩,导致后门攻击效果被削弱的问题。针对图像压缩过程中嵌入的触发器信息可能丢失的问题,作者提出了两种专门应对有损压缩的中毒策略,确保触发器信息在压缩后仍能被有效恢复。实验表明,这两种方法在多种压缩方案下均具有良好的攻击效果,为后门攻击在实际应用中的实现提供了新的思路。

详情
英文摘要

Real-world backdoor attacks often require poisoned datasets to be stored and transmitted before being used to compromise deep learning systems. However, in the era of big data, the inevitable use of lossy compression poses a fundamental challenge to invisible backdoor attacks. We find that triggers embedded in RGB images often become ineffective after the images are lossily compressed into binary bitstreams (e.g., JPEG files) for storage and transmission. As a result, the poisoned data lose its malicious effect after compression, causing backdoor injection to fail. In this paper, we highlight the necessity of explicitly accounting for the lossy compression process in backdoor attacks. This requires attackers to ensure that the transmitted binary bitstreams preserve malicious trigger information, so that effective triggers can be recovered in the decompressed data. Building on the region-of-interest (ROI) coding mechanism in image compression, we propose two poisoning strategies tailored to inevitable lossy compression. First, we introduce Universal Attack Activation, a universal method that uses sample-specific ROI masks to reactivate trigger information in binary bitstreams for learned image compression (LIC). Second, we present Compression-Adapted Attack, a new attack strategy that employs customized ROI masks to encode trigger information into binary bitstreams and is applicable to both traditional codecs and LIC. Extensive experiments demonstrate the effectiveness of both strategies.

2603.08063 2026-05-18 cs.CV 版本更新

SkyLink: A Large Vision-Language Model Driven Re-ranking Framework for Cross-View UAV geolocalization

Bowen Liu, Pengyue Jia, Wanyu Wang, Derong Xu, Jiawei Cheng, Jiancheng Dong, Xiao Han, Zimo Zhao, Chao Zhang, Bowen Yu, Fangyu Hong, Xiangyu Zhao

发表机构 * Department of Data Science, City University of Hong Kong, Hong Kong(香港城市大学数据科学系) Information Systems, City University of Hong Kong, Hong Kong(香港城市大学信息系统系) College of Computer Science and Technology, Zhejiang University of Technology, Zhejiang(浙江工业大学计算机科学与技术学院)

AI总结 SkyLink 是一种基于大视觉-语言模型(LVLM)的跨视角无人机地理定位重排序框架,旨在提升无人机图像与卫星图像之间的匹配精度。该方法通过建模不同视角之间的视觉-语义关系,实现更有效的跨视角匹配,并引入一种关系感知损失函数以增强模型的判别能力和训练稳定性。实验表明,SkyLink 显著提升了现有模型在多种基准数据集上的重排序性能,尤其在复杂场景中表现突出。

详情
英文摘要

Cross-view UAV geolocalization is fundamentally a challenging large-scale image retrieval task, aiming to determine the geographic coordinates of Unmanned Aerial Vehicle (UAV) queries by matching them against an extensive geo-tagged satellite image database. Most existing methods learn separate feature representations for each view and determine the final prediction using naive heuristics to assess feature similarity, thereby neglecting to model the crucial cross-view relationships. In this paper, we propose SkyLink, a novel plug-and-play ranking framework that pioneers joint relational modeling of inter-view relationships to enhance cross-view UAV geolocalization. SkyLink leverages a Large Vision-Language Model (LVLM) to model the intricate visual-semantic relationships between UAV and satellite views, facilitating effective cross-view matching. To further refine the learning process, we introduce a relational-aware loss. It leverages soft labels to provide a more nuanced supervision signal, mitigating the harsh penalty on near-positive pairs. This approach enhances both training stability and the model's discriminative capacity. Extensive experiments conducted across multiple base retrieval architectures and benchmark datasets demonstrate that SkyLink significantly boosts the ranking effectiveness of existing models, consistently achieving superior performance in various challenging scenarios.

2603.07514 2026-05-18 cs.LG cs.AI cs.CV 版本更新

A Unified View of Score-Based and Drifting Models

Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao

发表机构 * Sony AI(索尼人工智能) Sony Group Corporation(索尼集团) Stanford University(斯坦福大学) Georgia Tech(佐治亚理工学院)

AI总结 本文探讨了漂移模型与基于分数的生成模型之间的内在联系,揭示了漂移方法在本质上等价于对平滑分布进行分数匹配的目标。研究发现,使用高斯核时,均值漂移场精确对应于数据分布与模型分布的分数差异,这一结论基于Tweedie公式。对于实际常用的拉普拉斯核,理论与实验均表明其残差项在高维情况下可忽略,因此实际应用中的漂移方法近似于基于分数的生成方法。该研究为理解生成模型提供了统一的视角,并指出了漂移模型与扩散模型在运输方向上的结构性相似与差异。

详情
英文摘要

Drifting models train one-step generators by optimizing a kernel-induced mean-shift discrepancy between the data and model distributions, with Laplace kernels used by default in practice. At each point, this discrepancy compares the kernel-weighted displacement toward nearby data samples with the corresponding displacement toward nearby model samples, thereby defining a transport direction for generated samples. In this paper, we show that drifting is more closely connected to score-based generative modeling than it may first appear, establishing a precise link to the score-matching principle underlying diffusion models. For Gaussian kernels, the population mean-shift field exactly equals the difference between the scores (i.e., the gradient-log-densities) of the Gaussian-smoothed data and model distributions. This identity follows from Tweedie's formula, which links the score of a Gaussian-smoothed density to its conditional mean, and implies that Gaussian-kernel drifting is exactly a score-matching objective on smoothed distributions. More generally, we derive an exact decomposition for radial kernels in which mean shift equals a score-based field plus a residual term. For the practical Laplace kernel, we further show theoretically and empirically that this residual is negligible in high dimension, implying that the transport field used in practice is nearly score-based. Our results reveal a structural connection to diffusion models: both methods use score-mismatch transport directions, but drifting realizes the score nonparametrically through kernel-based estimates, whereas diffusion models learn it parametrically with neural networks.

2602.20630 2026-05-18 cs.CV 版本更新

From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

Yepeng Liu, Hao Li, Liwen Yang, Fangzhen Li, Xudi Ge, Yuliang Gu, kuang Gao, Bing Wang, Guang Chen, Hangjun Ye, Yongchao Xu

发表机构 * School of Computer Science, Wuhan University(1 武汉大学计算机学院) Xiaomi EV(2 小米电动车)

AI总结 本文将关键点检测问题重新定义为一个序列决策过程,提出了一种基于强化学习的端到端框架 TraqPoint,旨在直接优化关键点在图像序列中的长期可追踪性。其核心创新在于引入了一种关注轨迹质量的奖励机制,通过策略梯度方法同时提升关键点在多视角下的一致性和区分度。实验表明,TraqPoint 在稀疏匹配任务中显著优于当前最先进的关键点检测与描述方法。

Comments Accepted by CVPR 2026 (Oral)

详情
英文摘要

Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbf{Tra}ck-\textbf{q}uality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.The code will be available at https://github.com/xiaomi-research/traqpoint.

2602.10687 2026-05-18 cs.CV cs.AI 版本更新

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China(合肥工业大学计算机科学与信息工程学院) Wuhan University, Wuhan, China(武汉大学) Lab for Intelligence and visiON (LION)(智能视觉实验室)

AI总结 现有伪造检测方法多局限于单模态或双模态设置,难以应对现实中的多模态虚假信息。本文提出OmniVL-Guard,一个基于平衡强化学习的统一视觉-语言伪造检测与定位框架,旨在解决多模态交互与多任务优化中的偏差问题。该方法包含自进化推理路径生成和自适应奖励缩放策略优化两个核心设计,有效提升了检测与定位的综合性能,并在多个数据集上展现出优越的零样本泛化能力。

Comments Accepted by ICML 2026

详情
英文摘要

Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios. The dataset and code are publicly available at https://github.com/shen8424/OmniVL-Guard.

2602.05414 2026-05-18 cs.CV 版本更新

TSBOW -- Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions

Ngoc Doan-Minh Huynh, Duong Nguyen-Ngoc Tran, Long Hoang Pham, Tai Huu-Phuong Tran, Hyung-Joon Jeon, Huy-Hung Nguyen, Duong Khac Vu, Hyung-Min Jeon, Son Hong Phan, Quoc Pham-Nam Ho, Chi Dai Tran, Trinh Le Ba Khanh, Jae Wook Jeon

发表机构 * Automation Lab, Department of Electrical and Computer Engineering(自动化实验室,电气与计算机工程系)

AI总结 随着全球变暖加剧极端天气事件的频率和强度,现有交通监控数据集难以应对复杂天气条件下的遮挡车辆检测问题。为此,本研究提出了TSBOW数据集,包含超过32小时的真实城市交通视频,涵盖多种天气条件和遮挡场景,标注了超过4.8万个目标框,旨在提升恶劣天气下交通参与者检测的性能。TSBOW为智能交通系统的研究提供了重要资源,推动了基于CCTV的交通监控技术发展。

Comments This paper has been accepted by the 40th AAAI Conference on Artificial Intelligence (AAAI-26)

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence. 40(2026). 5239-5247

详情
英文摘要

Global warming has intensified the frequency and severity of extreme weather events, which degrade CCTV signal and video quality while disrupting traffic flow, thereby increasing traffic accident rates. Existing datasets, often limited to light haze, rain, and snow, fail to capture extreme weather conditions. To address this gap, this study introduces the Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW), a comprehensive dataset designed to enhance occluded vehicle detection across diverse annual weather scenarios. Comprising over 32 hours of real-world traffic data from densely populated urban areas, TSBOW includes more than 48,000 manually annotated and 3.2 million semi-labeled frames; bounding boxes spanning eight traffic participant classes from large vehicles to micromobility devices and pedestrians. We establish an object detection benchmark for TSBOW, highlighting challenges posed by occlusions and adverse weather. With its varied road types, scales, and viewpoints, TSBOW serves as a critical resource for advancing Intelligent Transportation Systems. Our findings underscore the potential of CCTV-based traffic monitoring, pave the way for new research and applications. The TSBOW dataset is publicly available at: https://github.com/SKKUAutoLab/TSBOW.

2602.00841 2026-05-18 cs.CV 版本更新

Beyond First-Order: Learning Riemannian Geometries for Invariant Visual Place Recognition

Jintao Cheng, Weibin Li, Zhijian He, Jin Wu, Chi Man Vong, Wei Zhang

发表机构 * The Hong Kong University of Science and Technology, Hong Kong, China(香港科技大学) University of Macau, Macau, China(澳门大学) University of Science and Technology Beijing, Beijing, China(北京科技大学)

AI总结 本文研究了视觉地点识别(VPR)中如何构建对环境和视角剧烈变化具有鲁棒性的特征表示。为解决现有方法在极端变化下结构关联丢失或适应成本高的问题,提出了一种基于黎曼几何的不变聚合框架RIA,通过在对称正定流形上建模二阶场景结构,有效保留不变结构信息并抑制噪声。实验表明,RIA在无需大量监督训练的情况下即可达到与监督方法相当的性能,并在无结构环境中取得最先进的识别准确率。

Comments 14pages, 5 figures

详情
英文摘要

Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Existing aggregation paradigms either depend on extensive supervised training or rely on first-order pooling, often struggling to preserve structural correlations under extreme shifts or incurring high adaptation costs. In this work, we propose Riemannian Invariant Aggregation (RIA), a unified geometric framework that explicitly models second-order scene structure on the Symmetric Positive Definite (SPD) manifold. By treating perturbations as tractable congruence transformations, RIA leverages geometry-aware Riemannian mappings to project covariance descriptors into a linearized Euclidean space, effectively preserving invariant structural components while suppressing noise. Extensive evaluations demonstrate that RIA achieves zero-shot performance comparable to supervised methods, and establishes state-of-the-art accuracy with simple fine-tuning, particularly in unstructured environments. The source code will be released.

2601.00678 2026-05-18 cs.CV 版本更新

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

Melonie de Almeida, Daniela Ivanova, Tong Shi, John H. Williamson, Paul Henderson

发表机构 * University of Glasgow(格拉斯哥大学)

AI总结 该论文提出了一种基于单张图像生成动态视频的新方法,能够根据给定的相机轨迹生成高质量且时间一致的视频。核心方法是通过构建动态的3D高斯场景表示,并在单次前向传播中生成合理的物体运动,从而实现快速的相机控制视频生成。该方法在多个数据集上表现出色,取得了领先的视频质量和推理效率。

详情
英文摘要

Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.

2512.14671 2026-05-18 cs.CV 版本更新

ART: Articulated Reconstruction Transformer

Zizhang Li, Cheng Zhang, Zhengqin Li, Henry Howard-Jenkins, Zhaoyang Lv, Chen Geng, Jiajun Wu, Richard Newcombe, Jakob Engel, Zhao Dong

发表机构 * Reality Labs Research, Meta(Meta现实实验室) Stanford University(斯坦福大学)

AI总结 本文提出了一种名为ART的全新模型,用于从稀疏的多状态RGB图像中重建完整的3D可动物体,该模型无需依赖特定物体类别或复杂的优化过程。ART将可动物体视为由多个刚性部件组成,通过设计的Transformer架构将图像映射到可学习的部件槽位,并联合解码各部件的三维几何、纹理及运动参数,实现了物理可解释且可直接用于仿真的重建结果。实验表明,ART在多个基准测试中表现优异,显著超越了现有方法,确立了新的状态-of-the-art。

Comments Project Page: https://kyleleey.github.io/ART/

详情
英文摘要

We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.

2511.18719 2026-05-18 cs.CV 版本更新

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation

Ziqi Ni, Yuanzhi Liang, Rui Li, Yi Zhou, Haibin Huang, Chi Zhang, Xuelong Li

发表机构 * Southeast University(东南大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种名为ViPO的视觉偏好策略优化方法,用于提升视觉生成模型与人类偏好的一致性。与现有方法依赖单一标量奖励不同,ViPO通过引入感知结构模块,将反馈转化为结构化的像素级优势图,从而更精细地引导模型优化视觉内容中的关键区域。该方法在图像和视频生成任务中均表现出色,提升了对域内人类偏好奖励的对齐能力,并增强了对域外任务的泛化性能,且具有轻量、通用、易于集成现有训练流程的优点。

详情
英文摘要

Reinforcement learning (RL) has become a powerful tool for post-training visual generative models, with Group Relative Policy Optimization (GRPO) increasingly used to align generators with human preferences. However, existing GRPO pipelines rely on a single scalar reward per sample, treating each image or video as a holistic entity and ignoring the rich spatial and temporal structure of visual content. This coarse supervision hinders the correction of localized artifacts and the modeling of fine-grained perceptual cues. We introduce Visual Preference Policy Optimization (ViPO), a GRPO variant that lifts scalar feedback into structured, pixel-level advantages. ViPO employs a Perceptual Structuring Module that uses pretrained vision backbones to construct spatially and temporally aware advantage maps, redistributing optimization pressure toward perceptually important regions while preserving the stability of standard GRPO. Across both image and video benchmarks, ViPO consistently outperforms vanilla GRPO, improving in-domain alignment with human-preference rewards and enhancing generalization on out-of-domain evaluations. The method is architecture-agnostic, lightweight, and fully compatible with existing GRPO training pipelines, providing a more expressive and informative learning signal for visual generation.

2511.18127 2026-05-18 cs.CV 版本更新

SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting

Ruicong Liu, Yifei Huang, Liangyang Ouyang, Caixin Kang, Yoichi Sato

发表机构 * The University of Tokyo(东京大学)

AI总结 SFHand 是一种用于语言引导的实时 3D 手部状态预测框架,旨在提升增强现实和辅助机器人等场景下的人机交互体验。该方法通过连续视频流和语言指令,自回归地预测未来手部的多种状态,包括手部类型、2D 边界框、3D 姿态和轨迹,并结合了区域兴趣增强的记忆层以捕捉时间上下文和关键手部区域。研究还引入了 EgoHaFL 数据集,实验证明 SFHand 在 3D 手部预测任务中取得了显著优于现有方法的性能,并在下游操作任务中提升了任务成功率。

详情
英文摘要

Real-time 3D hand forecasting is a critical component for fluid human-computer interaction in applications like AR and assistive robotics. However, existing methods are ill-suited for these scenarios, as they typically require offline access to accumulated video sequences and cannot incorporate language guidance that conveys task intent. To overcome these limitations, we introduce SFHand, the first streaming framework for language-guided 3D hand forecasting. SFHand autoregressively predicts a comprehensive set of future 3D hand states, including hand type, 2D bounding box, 3D pose, and trajectory, from a continuous stream of video and language instructions. Our framework combines a streaming autoregressive architecture with an ROI-enhanced memory layer, capturing temporal context while focusing on salient hand-centric regions. To enable this research, we also introduce EgoHaFL, the first large-scale dataset featuring synchronized 3D hand poses and language instructions. We demonstrate that SFHand achieves new state-of-the-art results in 3D hand forecasting, outperforming prior work by a significant margin of up to 35.8%. Furthermore, we show the practical utility of our learned representations by transferring them to downstream embodied manipulation tasks, improving task success rates by up to 13.4% on multiple benchmarks. Dataset page: https://huggingface.co/datasets/ut-vision/EgoHaFL, project page: https://github.com/ut-vision/SFHand.

2511.17426 2026-05-18 cs.LG cs.CV stat.ML 版本更新

Self-Supervised Learning by Curvature Alignment

Benyamin Ghojogh, M. Hadi Sepanj, Paul Fieguth

发表机构 * Vision and Image Processing Group, Systems Design Engineering, University of Waterloo, Ontario, Canada(温哥华大学图像与图像处理小组,系统设计工程,安大略省,加拿大)

AI总结 本文提出了一种基于曲率对齐的自监督学习方法CurvSSL及其核空间扩展kernel CurvSSL,旨在通过显式建模数据流形的局部几何结构来提升表征学习效果。该方法在传统非对比学习框架中引入曲率正则化项,通过计算嵌入特征的局部曲率并对其在不同数据增强视图间进行对齐和去相关,从而增强表示的不变性和几何一致性。实验表明,该方法在MNIST和CIFAR-10数据集上取得了优于现有方法的线性评估性能。

Comments A shorter version of this paper has been published in: Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, Special Issue: Proceedings of CVIS 2025

Journal ref Shorter version of this paper is published in Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, Special Issue: Proceedings of CVIS 2025

详情
英文摘要

Self-supervised learning (SSL) has recently advanced through non-contrastive methods that couple an invariance term with variance, covariance, or redundancy-reduction penalties. While such objectives shape first- and second-order statistics of the representation, they largely ignore the local geometry of the underlying data manifold. In this paper, we introduce CurvSSL, a curvature-regularized self-supervised learning framework, and its RKHS extension, kernel CurvSSL. Our approach retains a standard two-view encoder-projector architecture with a Barlow Twins-style redundancy-reduction loss on projected features, but augments it with a curvature-based regularizer. Each embedding is treated as a vertex whose $k$ nearest neighbors define a discrete curvature score via cosine interactions on the unit hypersphere; in the kernel variant, curvature is computed from a normalized local Gram matrix in an RKHS. These scores are aligned and decorrelated across augmentations by a Barlow-style loss on a curvature-derived matrix, encouraging both view invariance and consistency of local manifold bending. Experiments on MNIST and CIFAR-10 datasets with a ResNet-18 backbone show that curvature-regularized SSL yields competitive or improved linear evaluation performance compared to Barlow Twins and VICReg. Our results indicate that explicitly shaping local geometry is a simple and effective complement to purely statistical SSL regularizers.

2511.03260 2026-05-18 cs.CV 版本更新

Enhancing Medical Image Segmentation via Heat Conduction Equation

Rong Wu, Yim-Sang Yu

发表机构 * Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA(流行病学与生物统计学系,加州大学旧金山分校,加州,美国)

AI总结 本文针对医学图像分割中在有限计算资源下难以实现高效全局上下文建模和长距离依赖推理的问题,提出了一种结合U-Mamba结构与热传导方程的混合架构。该方法在瓶颈层引入热传导算子,通过模拟频率域热扩散过程提升语义抽象能力,实验表明其在腹部CT数据集上的Dice系数达到0.8719,验证了该方法在医学图像分割任务中的有效性与优越性。

详情
英文摘要

Medical image segmentation models struggle to achieve efficient global context modeling and long-range dependency reasoning under practical computational budgets. In this work, we propose a hybrid architecture utilizing U-Mamba with Heat Conduction Equation, which combines state-space modules for efficient long-range reasoning with Heat Conduction Operators (HCOs) in the bottleneck layers, simulating frequency-domain thermal diffusion for enhanced semantic abstraction. Experimental results show that our model attains the highest DSC (0.8719) on the Abdomen CT dataset. It suggests that blending state-space dynamics with heat-based global diffusion offers a scalable solution for medical segmentation tasks.

2510.22665 2026-05-18 cs.CV cs.AI 版本更新

SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

Qiwei Ma, Xukun Lu, Wang Liu, Puhong Duan, Xudong Kang, Shutao Li

发表机构 * School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Yuelushan Center for Industrial Innovation(岳麓山创新中心) School of Medical Information Engineering, Jining Medical University(济南医学院医学信息工程学院)

AI总结 本文提出SARVLM,首个专为合成孔径雷达(SAR)影像设计的视觉-语言基础模型,旨在提升SAR图像的语义理解能力。为解决SAR多模态数据稀缺及跨模态表征不足的问题,研究者构建了包含百万级图像-文本对的SARVLM-1M大规模数据集,并设计了两阶段领域迁移训练策略,利用光学遥感数据作为桥梁,有效提升模型在SAR领域的表现。实验表明,SARVLM在多个基准任务中均优于现有模型,显著推进了SAR影像的语义理解水平。

Comments 13 pages, 13 figures

详情
英文摘要

Synthetic Aperture Radar (SAR) is a critical imaging modality due to its all-weather operational capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these approaches primarily focus on low-level visual features and often neglect multi-modal representation. Moreover, multimodal data for SAR is scarce, limiting the development of robust cross-modal models. To address this limitation, we construct SARVLM-1M, a large-scale vision-language dataset comprising over one million image-text pairs aggregated from existing datasets. Furthermore, to mitigate the substantial differences between SAR and natural imagery, we propose a two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge, facilitating effective knowledge transfer from natural images to SAR domains. Based on this strategy, we develop SARVLM, the first vision-language foundation model tailored for SAR, consisting of SARCLIP and SARCap. In addition, an ensemble strategy is utilized to improve the cross-scene generalization capability of the model. Moreover, SARDet and SARRot further validate the capability of the proposed framework in object detection. Extensive experiments on 13 benchmarks across image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning demonstrate the superior feature extraction and interpretation capabilities of SARVLM. It consistently outperforms state-of-the-art vision-language models and advances semantic understanding in SAR imagery. Code and datasets will be released on https://github.com/KlayMa527/SARVLM.git.

2510.02307 2026-05-18 cs.CV cs.AI 版本更新

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

发表机构 * Rice University(里士大学)

AI总结 文本到图像扩散模型在生成分辨率超出训练设定的图像时性能往往会下降。本文针对低分辨率图像生成问题,提出了一种无需额外训练的噪声重新校准方法 NoiseShift,通过调整去噪器的噪声条件索引,恢复正向与反向过程的一致性,从而减少训练与测试阶段的不匹配。实验表明,NoiseShift 在多个主流扩散模型上显著提升了低分辨率图像的生成质量,且实现简单、推理开销极小。

详情
英文摘要

Text-to-image diffusion models often degrade when sampled at resolutions outside the final training resolution set. Prior work has largely emphasized higher resolution generation, enabling pretrained diffusion models to extrapolate beyond the resolutions seen during training. In this work, we instead target lower-resolution generation, performing inference at reduced resolution to significantly cut computational cost. We show that network conditioning of the noise level induces a train-test mismatch that directly degrades low-resolution generation: the same scheduled noise level can correspond to a different perceptual corruption level at lower resolutions, mis-calibrating the denoiser timestep and noise embedding. To this end, we propose NoiseShift, a training-free recalibration method that keeps the original noise sampling schedule unchanged and instead re-indexes the noise conditioning of the denoiser to restore local forward-reverse consistency. Using a lightweight coarse-to-fine calibration on a small set of image-text pairs, NoiseShift learns a resolution-specific mapping from scheduler noise to conditioning noise, reducing train-test mismatch and improving lower-resolution generation quality. When NoiseShift is applied to Stable Diffusion 3 (SD3), Stable Diffusion 3.5 (SD3.5), and Flux-Dev, generation quality at low resolutions improves consistently. Particularly, SD3 generation at 128x128 resolution gets an improved FID score from 203 to 171, and SD3.5 gets an improved FID score from 310 to 277 on LAION-COCO. Even Flux-Dev which already implements a complementary time-shifting strategy gets a modest boost from NoiseShift with an improved FID score from 120 to 113 at 64x64 resolution. More importantly, NoiseShift achieves such improvements with minimal implementation changes and no additional inference overhead.

2509.24798 2026-05-18 cs.CV cs.AI 版本更新

Causal-Adapter: Taming Text-to-Image Diffusion for Faithful Counterfactual Generation

Lei Tong, Zhihua Liu, Chaochao Lu, Dino Oglic, Tom Diethe, Philip Teare, Sotirios A. Tsaftaris, Chen Jin

发表机构 * Centre for AI, DS\&AI, Astrazeneca, UK(英国阿斯利康人工智能中心) Institute for Imaging, Data and Communications (IDCOM), School of Engineering, University of Edinburgh, Edinburgh, UK(爱丁堡大学工程学院影像、数据与通信研究所) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出了一种名为 Causal-Adapter 的模块化框架,用于适配冻结的文本到图像扩散模型,实现对图像的反事实生成。该方法通过因果干预目标属性,并将其影响一致地传播至因果依赖部分,同时保持图像的核心身份。与依赖提示工程的方法不同,Causal-Adapter 引入结构因果模型,并采用属性正则化策略,实现了更准确的语义控制和高保真图像生成,在多个数据集上取得了优越的性能。

Comments Project Page: https://leitong02.github.io/causaladapter/

Journal ref ICML 2026

详情
英文摘要

We present Causal-Adapter, a modular framework that adapts frozen text-to-image diffusion backbones for counterfactual image generation. Our method supports causal interventions on target attributes and consistently propagates their effects to causal dependents while preserving the core identity of the image. Unlike prior approaches that rely on prompt engineering without explicit causal structure, Causal-Adapter leverages structural causal modeling with two attribute-regularization strategies: (i) prompt-aligned injection, which aligns causal attributes with textual embeddings for precise semantic control, and (ii) a conditioned token contrastive loss that disentangles attribute factors and reduces spurious correlations. Causal-Adapter achieves state-of-the-art performance on both synthetic and real-world datasets, including up to a 91% reduction in MAE on Pendulum for accurate attribute control and up to an 87% reduction in FID on ADNI for high-fidelity MRI generation. These results demonstrate robust, generalizable counterfactual editing with faithful attribute modification and strong identity preservation. Code and models will be released at: https://leitong02.github.io/causaladapter/.

2509.16223 2026-05-18 eess.SP cs.CV 版本更新

mRadNet: A Compact Radar Object Detector with MetaFormer

Huaiyu Chen, Fahed Hassanat, Robert Laganiere, Martin Bouchard

发表机构 * School of Electrical Engineering and Computer Science, University of Ottawa, Canada(渥太华大学电气与计算机工程学院,加拿大) tsensor Cortek Inc., Canada(加拿大tsensor Cortek公司)

AI总结 本文提出了一种名为mRadNet的紧凑型雷达目标检测模型,旨在满足车载嵌入式系统对模型轻量化和高效性的需求。该模型基于U-Net结构,结合MetaFormer模块,利用分离卷积和注意力机制有效提取局部与全局特征,并引入更高效的特征嵌入与融合策略以进一步降低计算复杂度。实验结果表明,mRadNet在CRUW数据集上以最少的参数和最低的计算量实现了优于现有方法的检测性能。

Comments 5 pages, 2 figures, to appear in Proc. of 34th European Signal Processing Conference (EUSIPCO 2026), Bruges, Belgique, Aug. 31 - Sept. 4, 2026. Code availble at https://github.com/huaiyu-chen/mRadNet

详情
英文摘要

Frequency-modulated continuous wave radars have gained increasing popularity in the automotive industry. Their robustness against adverse weather conditions makes it a suitable choice for radar object detection in advanced driver assistance systems. These real-time embedded systems have requirements for the compactness and efficiency of the model, which have been largely overlooked in previous work. In this work, we propose mRadNet, a novel radar object detection model with compactness in mind. mRadNet employs a U-net style architecture with MetaFormer blocks, in which separable convolution and attention token mixers are used to capture both local and global features effectively. More efficient token embedding and merging strategies are introduced to further facilitate the lightweight design. The performance of mRadNet is validated on the CRUW dataset, improving state-of-the-art performance with the fewest parameters and the lowest FLOPs.

2509.05030 2026-05-18 cs.CV 版本更新

LUIVITON: Learned Universal Interoperable VIrtual Try-ON

Cong Cao, Xianhang Cheng, Jingyuan Liu, Yujian Zheng, Zhenhui Lin, Ren Li, Meriem Chkir, Hao Li

发表机构 * The University of Tokyo(东京大学)

AI总结 本文提出了一种名为LUIVITON的全自动虚拟试穿系统,旨在解决现实世界中服装与人体模型之间骨骼结构、模板和密集对应关系不一致的问题,实现复杂多层服装在不同姿态和形态的人形角色上的自动穿戴。该方法通过SMPL作为中间代理,将服装到身体的映射分解为两个关键对应任务,并分别采用几何驱动模型和基于扩散的多视角外观特征匹配方法进行处理,最终在目标角色上生成物理合理的服装垂坠效果。该系统能够处理复杂的服装拓扑结构,并适用于多种人形角色,同时具备高效计算和无需人工干预的优点。

详情
英文摘要

To enable large-scale reuse of real-world 3D assets, where garments and characters rarely share skeletons, templates, or dense correspondences, we present a fully automated virtual try-on system that dresses complex, multi-layer garments onto diverse, arbitrarily posed humanoids. Our key idea is to use SMPL as an intermediate proxy and decompose clothing-to-body transfer into two correspondence tasks with distinct challenges: (1) clothing-to-SMPL (partial-to-complete alignment) and (2) body-to-SMPL (large pose/shape variation and stylization). We address clothing-to-SMPL using a geometry-driven correspondence model, and introduce a diffusion-based body-to-SMPL correspondence approach that leverages multi-view consistent appearance features together with a pretrained 2D foundation model. Using these correspondences, we register SMPL/SMPL+D (Displacement) to the garment and target body and then perform simulator-driven fitting by transferring the garment along a smooth SMPL-to-SMPL+D transition, producing physically plausible draping on the target. Our system handles complex garment topology (including non-manifold meshes) and generalizes to a wide range of humanoid characters (e.g., humans, robots, cartoons, and creatures) while remaining computationally practical. Upon draping, our system also supports fast customization of clothing size. We show that our system can produce high-quality 3D clothing fittings without any human labor, even when 2D clothing sewing patterns are not available. Our project page is: https://cao-cong0.github.io/LUIVITON-Learned-Universal-Interoperable-VIrtual-Try-ON/.

2508.17034 2026-05-18 cs.RO cs.CV 版本更新

DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration

Jiayi Li, Yuxin Yao, Qiuhang Lu, Juyong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) City University of Hong Kong(香港城市大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文针对刚性配准中噪声数据、部分重叠和实时处理等挑战,提出了一种双空间滤波与强化学习相结合的新方法DualReg。该方法结合基于特征匹配和基于局部几何匹配的优点,通过高效的滤波机制去除不可靠的特征对应点,并利用几何代理构建目标函数以估计变换参数。实验表明,该方法在保持精度的同时,相比MAC方法在KITTI数据集上实现了32倍的CPU时间加速。

Comments Accepted to CVPR 2026, Project page: https://ustc3dv.github.io/DualReg/

详情
英文摘要

Noisy, partially overlapping data and the need for real-time processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism consisting of a computationally lightweight one-point RANSAC algorithm and a subsequent refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat the filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method's effectiveness, as demonstrated by a 32x CPU-time speedup over MAC on KITTI with comparable accuracy. Project page: https://ustc3dv.github.io/DualReg/.

2508.01014 2026-05-18 cs.RO cs.CV 版本更新

Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction

Cheng-You Lu, Zhuoli Zhuang, Nguyen Thanh Trung Le, Da Xiao, Yu-Cheng Chang, Thomas Do, Srinath Sridhar, Chin-teng Lin

发表机构 * University of Technology Sydney(悉尼技术大学) Brown University(布朗大学)

AI总结 Hestia 是一种面向高效三维重建的视角规划方法,旨在解决传统重建过程中图像采集依赖人工或固定轨迹的问题。该方法通过引入体素面感知的分层结构,结合多样化数据集、贪心策略与几何感知设计,提升了视角规划的鲁棒性和重建质量。实验表明,Hestia 在覆盖范围、重建精度和实时性方面均优于现有方法,具有良好的实际应用前景。

Comments Accepted to the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026

详情
英文摘要

Advances in 3D reconstruction and novel view synthesis have enabled efficient and photorealistic rendering. However, images for reconstruction are still either largely manual or constrained by simple preplanned trajectories. To address this issue, recent works propose generalizable next-best-view planners that do not require online learning. Nevertheless, robustness and performance remain limited across various shapes. Hence, this study introduces Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction (Hestia), which addresses the shortcomings of the reinforcement learning-based generalizable approaches for five-degree-of-freedom viewpoint prediction. Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. Experimental results show that Hestia achieves non-marginal improvements, with at least a 4% gain in coverage ratio, while reducing Chamfer Distance by 50% and maintaining real-time inference. In addition, Hestia outperforms prior methods by at least 12% in coverage ratio with a 5-image budget and remains robust to object placement variations. Finally, we demonstrate that Hestia, as a next-best-view planner, is feasible for the real-world application. Our project page is https://johnnylu305.github.io/hestia web.

2507.01201 2026-05-18 cs.LG cs.CV 版本更新

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

Lauren Hyoseo Yoon, Yisong Yue, Been Kim

发表机构 * Computation and Neural Systems(计算与神经系统) California Institute of Technology(加利福尼亚理工学院) Computation and Mathematical Sciences(计算与数学科学) Google DeepMind(谷歌深Mind)

AI总结 该论文研究了如何对齐独立训练的视觉和语言模型,提出了一种名为JAM的方法,通过联合训练模态特定的自编码器,实现跨模态对齐。JAM引入了多模态扩散损失,有效提升了对齐效果,并系统分析了对齐目标、网络深度及基础模型规模对表示一致性的影响。研究不仅提供了对共享语义结构的理论见解,也为构建专业化的多模态模型提供了实用指导。

详情
英文摘要

Independently trained vision and language models inhabit disjoint representational spaces, shaped by their respective modalities, objectives, and architectures. The Platonic Representation Hypothesis (PRH) suggests these models may nonetheless converge toward a shared statistical model of reality. This raises a fundamental question: can we move beyond post-hoc detection of such alignment and explicitly optimize for it? We argue this challenge is most critical in fine-grained contextual distinctions-where multiple descriptions share global semantics but differ in subtle compositional details. We address this with the Joint Autoencoder Modulator (JAM), which aligns frozen unimodal models by jointly training modality-specific autoencoders with coordinated reconstruction and cross-modal alignment objectives. We systematically evaluate JAM across three design axes: (i) alignment objectives, introducing our multimodal Spread Loss that outperforms classic contrastive methods; (ii) the layer depth at which alignment is most effective; and (iii) the role of foundation model scale in representational convergence. Our findings show that JAM reliably induces alignment even across independently trained representations, offering both theoretical insight into the structure of shared semantics and practical guidance for transforming generalist unimodal foundations into specialist multimodal models.

2506.23552 2026-05-18 cs.CV cs.SD eess.AS 版本更新

JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh

发表机构 * Yonsei University(延世大学) CineLingo Seoul National University(首尔国立大学)

AI总结 本文提出了一种名为 JAM-Flow 的统一框架,用于同时生成面部运动和语音信号,解决了传统方法中将人脸生成与语音合成作为独立任务处理的问题。该方法结合了流匹配技术和一种新型的多模态扩散变换器(MM-DiT)架构,通过选择性联合注意力层实现跨模态交互,并保留各模态的特性。JAM-Flow 能够在单一模型中支持多种条件输入,如文本、参考音频和参考运动,从而实现从文本生成同步说话人脸、音频驱动动画等多种任务,显著推进了多模态生成建模的发展。

Comments project page: https://joonghyuk.com/jamflow-web Under review. Preprint published on arXiv

详情
英文摘要

The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: https://joonghyuk.com/jamflow-web

2505.21698 2026-05-18 cs.CV 版本更新

Adapting Foundation Vision-Language Models to Medical Diagnosis via Query-Driven Expert Bridging

Yitong Li, Morteza Ghahremani, Christian Wachinger

发表机构 * Lab for AI in Medical Imaging, Technical University of Munich (TUM)(医学影像人工智能实验室,慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 该研究针对基础视觉-语言模型在医学影像诊断中的应用难题,提出了一种名为MedBridge的轻量级适配框架,通过结合领域对齐、分辨率保持和多标签推理,有效缓解了医学图像与通用图像之间的领域差异。MedBridge利用预训练的视觉-语言模型作为多视角查询编码器,引入可学习的查询标记以实现非破坏性的领域适配,并通过多专家混合架构动态整合异构模型进行多标签诊断,显著提升了跨领域和同领域任务的性能。实验表明,该方法在多个胸部X光诊断基准上优于现有方法,且具有模型无关性和良好的扩展性。

详情
英文摘要

Vision-language foundation models achieve promising performance in natural image classification, yet their direct application to medical imaging is limited by severe domain shifts, resolution mismatches, and the multi-label nature of clinical diagnosis. Training dedicated medical foundation models from scratch, however, is costly and data-intensive. Here, we propose MedBridge, a lightweight adaptation framework that opens a new direction in domain-gap mitigation by jointly combining domain alignment, resolution preservation, and multi-label reasoning via complementary VLM experts for medical image diagnosis. Specifically, MedBridge transforms pretrained VLMs into multi-view query encoders that inject a compact set of learnable query tokens into intermediate layers, enabling non-destructive domain alignment while preserving fine-grained pathological cues via multi-view high-resolution sampling. These query tokens further act as routing signals for a mixture-of-experts, dynamically integrating heterogeneous foundation models for multi-label reasoning without requiring a shared representation space. We evaluated MedBridge on five chest radiograph benchmarks in three key adaptation tasks. MedBridge demonstrates superior performance in both cross-domain generalization (out-of-distribution transfer) and in-domain specialization (same-distribution tuning) settings, yielding a significant 6-15% AUC improvement over state-of-the-art adaptation methods for multi-label thoracic disease diagnosis. Furthermore, MedBridge is model-agnostic and demonstrates broad extensibility across eight diverse VLMs (e.g., CLIP, LLaVA, Qwen-VL, MedGemma), highlighting its ability to flexibly adapt arbitrary foundation models into a powerful medical diagnostic tool. Our code will be released upon acceptance.

2505.21535 2026-05-18 cs.CV cs.AI cs.LG 版本更新

FAR: Function-preserving Attention Replacement for IMC-friendly Inference

Yuxin Ren, Maxwell D Collins, Miao Hu, Huanrui Yang

发表机构 * University of Arizona(亚利桑那大学) TetraMem, Inc.(TetraMem公司)

AI总结 本文提出了一种名为FAR的函数保持注意力替换框架,旨在解决Transformer模型在基于忆阻器(ReRAM)的存算一体(IMC)设备上推理效率低的问题。FAR通过将预训练DeiT模型中的注意力机制替换为与IMC数据流兼容的多头双向LSTM结构,并结合块级知识蒸馏和结构化剪枝,实现了功能等效的同时显著降低了计算延迟和参数量。实验表明,FAR在ImageNet及多个下游任务上保持了与原始模型相当的准确率,展示了其在边缘计算设备上高效部署Transformer模型的潜力。

Comments 7 pages main paper, 6 figures; accepted by GLSVLSI 2026

详情
英文摘要

While transformers dominate modern vision and language models, their attention mechanism remains poorly suited for in-memory computing (IMC) devices due to intensive activation-to-activation multiplications and non-local memory access, leading to substantial latency and bandwidth overhead on ReRAM-based accelerators. To address this mismatch, we propose FAR, a Function-preserving Attention Replacement framework that substitutes all attention in pretrained DeiTs with sequential modules inherently compatible with IMC dataflows. Specifically, FAR replaces self-attention with a multi-head bidirectional LSTM architecture via block-wise distillation to retain functional equivalence while enabling linear-time computation and localized weight reuse. We further incorporate structured pruning on FAR models, enabling flexible adaptation to resource-constrained IMC arrays while maintaining functional fidelity. Evaluations on the DeiT family demonstrate that FAR maintains comparable accuracy to the original attention-based models on ImageNet and multiple downstream tasks with reduced parameters and latency. Further analysis shows that FAR preserves the semantic token relationships learned by attention while improving computational efficiency, highlighting its potential for energy-efficient transformer inference on IMC-based edge accelerators.

2505.18134 2026-05-18 cs.AI cs.CL cs.CV 版本更新

VideoGameBench: Can Vision-Language Models complete popular video games?

Alex L. Zhang, Thomas L. Griffiths, Karthik R. Narasimhan, Ofir Press

发表机构 * Princeton University(普林斯顿大学)

AI总结 VideoGameBench 是一个用于评估视觉语言模型(VLMs)完成流行视频游戏能力的基准测试,包含10款90年代经典游戏,模型仅通过原始视觉输入和目标描述进行实时交互。该研究揭示了当前前沿VLM在实时游戏任务中表现有限,难以完成完整游戏,主要受限于推理延迟等问题。为此,研究还提出了VideoGameBench Lite 以缓解实时性挑战,并指出当前最先进的模型在该基准上的完成率仍非常低。

Comments 10 pages, 38 pages including supplementary

详情
英文摘要

Vision-language models (VLMs) have achieved strong results on coding and math benchmarks that are challenging for humans, yet their ability to perform tasks that come naturally to humans--such as perception, spatial navigation, and memory management--remains understudied. Real video games are crafted to be intuitive for humans to learn and master by leveraging innate inductive biases, making them an ideal testbed for evaluating such capabilities in VLMs. To this end, we introduce VideoGameBench, a benchmark consisting of 10 popular video games from the 1990s that VLMs directly interact with in real-time. VideoGameBench challenges models to complete entire games with access to only raw visual inputs and a high-level description of objectives and controls, a significant departure from existing setups that rely on game-specific scaffolding and auxiliary information. We keep three of the games secret to encourage solutions that generalize to unseen environments. Our experiments show that frontier vision-language models struggle to progress beyond the beginning of each game. We find inference latency to be a major limitation of frontier models in the real-time setting; therefore, we introduce VideoGameBench Lite, a setting where the game pauses while waiting for the LM's next action. The best performing models, Gemini 2.5 Pro and Claude 3.7 Sonnet, complete only 0.48% of VideoGameBench and 1.6% of VideoGameBench Lite. We hope that the formalization of the human skills mentioned above into this benchmark motivates progress in these research directions.

2505.07322 2026-05-18 cs.CV 版本更新

RealRep: Generalized SDR-to-HDR Conversion via Attribute-Disentangled Representation Learning

Li Xu, Siqi Wang, Kepeng Xu, Gang He, Lin Zhang, Weiran Wang, Yu-Wing Tai

发表机构 * Xidian University(西安电子科技大学) Dartmouth College(达特茅斯学院)

AI总结 本文提出了一种通用的SDR到HDR转换框架RealRep,通过解耦亮度和色度属性的学习,提升对真实世界中多样SDR内容的鲁棒性。核心方法包括解耦表征学习、基于退化感知的负样本生成策略,以及一个轻量的两阶段映射网络DDACMNet,能够根据退化条件动态调整映射过程。实验表明,RealRep在泛化能力和HDR色彩重构的感知保真度方面均优于现有方法。

Comments Published on AAAI'26(Oral): The Annual AAAI Conference on Artificial Intelligence

详情
英文摘要

High-Dynamic-Range Wide-Color-Gamut (HDR-WCG) technology is becoming increasingly widespread, driving a growing need for converting Standard Dynamic Range (SDR) content to HDR. Existing methods primarily rely on fixed tone mapping operators, which struggle to handle the diverse appearances and degradations commonly present in real-world SDR content. To address this limitation, we propose a generalized SDR-to-HDR framework that enhances robustness by learning attribute-disentangled representations. Central to our approach is Realistic Attribute-Disentangled Representation Learning (RealRep), which explicitly disentangles luminance and chrominance components to capture intrinsic content variations across different SDR distributions. Furthermore, we design a Luma-/Chroma-aware negative exemplar generation strategy that constructs degradation-sensitive contrastive pairs, effectively modeling tone discrepancies across SDR styles. Building on these attribute-level priors, we introduce the Degradation-Domain Aware Controlled Mapping Network (DDACMNet), a lightweight, two-stage framework that performs adaptive hierarchical mapping guided by a control-aware normalization mechanism. DDACMNet dynamically modulates the mapping process via degradation-conditioned features, enabling robust adaptation across diverse degradation domains. Extensive experiments demonstrate that RealRep consistently outperforms state-of-the-art methods in both generalization and perceptually faithful HDR color gamut reconstruction.

2505.06982 2026-05-18 cs.CV 版本更新

Decentralized LoRA augmented transformer with multi-scale feature learning for secured eye diagnosis

Md. Naimur Asif Borno, Md Sakib Hossain Shovon, MD Hanif Sikder, Iffat Firozy Rimi, Tahani Jaser Alahmadi, Mohammad Ali Moni

发表机构 * organization= Research Assistant, The University of Queensland , addressline= 308 Queen St , city= Brisbane City , postcode= QLD 4000 , state= Queensland , country= Australia organization= Mechatronics Engineering, Rajshahi University of Engineering \& Technology , city= Rajshahi , postcode= 6204 , country= Bangladesh organization= Researcher, The University of Queensland , addressline= 308 Queen St , city= Brisbane City , postcode= QLD 4000 , state= Queensland , country= Australia organization= Department of Computer Science, American International University Bangladesh , city= Dhaka , postcode= 1216 , country= Bangladesh organization= Department of Computer Science, University of South Asia-Bangladesh , city= Dhaka , postcode= 1216 , country= Bangladesh organization= Department of Computer Science Engineering, Daffodil International University , city= Dhaka , country= Bangladesh Department of Information Systems, College of Computer Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, Saudi Arabia. Email organization= Faculty of Health, Medicine Behavioural Sciences, The University of Queensland , addressline= 308 Queen St , city= Brisbane City , postcode= QLD 4000 , state= Queensland , country= Australia Cyber Futures Institute Charles Sturt University , addressline= 308 Queen St , city= Bathurst NSW , country= Australia

AI总结 本文提出了一种基于改进型图像Transformer(DeiT)的去中心化眼病诊断框架,旨在解决医学影像中眼科疾病诊断面临的数据不平衡、隐私保护、空间特征多样性和临床可解释性等挑战。该方法结合多尺度特征学习、低秩适配(LoRA)、知识蒸馏和联邦学习,有效提升了模型在计算效率、数据隐私保护和诊断性能方面的表现。实验表明,该框架在多个基准数据集上优于传统卷积神经网络和现有Transformer模型,并通过Grad-CAM++提供了可解释的诊断依据,为安全、可扩展的眼科AI诊断系统奠定了基础。

Comments Published at Knowledge-Based Systems

详情
英文摘要

Accurate and privacy-preserving diagnosis of ophthalmic diseases remains a critical challenge in medical imaging, particularly given the limitations of existing deep learning models in handling data imbalance, data privacy concerns, spatial feature diversity, and clinical interpretability. This paper proposes a novel Data efficient Image Transformer (DeiT) based framework that integrates context aware multiscale patch embedding, Low-Rank Adaptation (LoRA), knowledge distillation, and federated learning to address these challenges in a unified manner. The proposed model effectively captures both local and global retinal features by leveraging multi scale patch representations with local and global attention mechanisms. LoRA integration enhances computational efficiency by reducing the number of trainable parameters, while federated learning ensures secure, decentralized training without compromising data privacy. A knowledge distillation strategy further improves generalization in data scarce settings. Comprehensive evaluations on two benchmark datasets OCTDL and the Eye Disease Image Dataset demonstrate that the proposed framework consistently outperforms both traditional CNNs and state of the art transformer architectures across key metrics including AUC, F1 score, and precision. Furthermore, Grad-CAM++ visualizations provide interpretable insights into model predictions, supporting clinical trust. This work establishes a strong foundation for scalable, secure, and explainable AI applications in ophthalmic diagnostics.

2504.21850 2026-05-18 cs.CV 版本更新

Visual Compositional Tuning

Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Esin Tureci, Olga Russakovsky

发表机构 * Princeton University(普林斯顿大学) Meta AI

AI总结 本文研究了视觉指令微调(VIT)数据集中样本复杂度对信息量的影响,提出了一种名为COMPACT的合成数据生成方法,通过在一个训练样本中组合多个基础视觉能力,显著提升了数据效率。实验表明,COMPACT在减少训练数据量90%的情况下,仍能保持与完整数据相当甚至更好的模型性能,在多个视觉语言基准测试中表现优异。该方法为提升视觉语言任务的训练效率提供了可扩展的解决方案。

Comments See the project website at this [URL](https://princetonvisualai.github.io/compact/)

详情
英文摘要

Visual instruction tuning (VIT) datasets have grown rapidly in scale, yet the informativeness of individual training samples has largely been overlooked. Recent dataset selection methods have shown that a small fraction of such datasets enriched with informative samples can lead to efficient finetuning of Multimodal Large Language Models. In this work, we explore the impact of sample complexity on informative data curation and introduce COMPACT (COMPositional Atomic-to-complex Visual Compositional Tuning), a compositional VIT data recipe that scales training sample complexity by combining multiple atomic visual capabilities in a single training example. Concretely, we synthesize rich and informative text questions for each image, allowing us to significantly reduce the number of training examples required for effective VIT. COMPACT demonstrates superior data efficiency compared to existing data reduction methods. When applied to the LLaVA-665K VIT dataset, COMPACT reduces the data budget by 90% while still achieving 100.2% of the full VIT performance (compared to only 97.5% by the state-of-the-art method) across eight multimodal benchmarks. Furthermore, training on the COMPACT data outperforms training on the full-scale VIT data on particularly complex benchmarks such as MM-Vet (+8.6%) and MMStar (+2.9%). COMPACT offers a scalable and efficient synthetic data generation recipe to improve on vision-language tasks.

2504.09544 2026-05-18 cs.LG cs.CE cs.CV 版本更新

Integrating chemical structures as treatments improves representations of microscopy images for morphological profiling

Yemin Yu, Emre Hayir, Neil Tenenholtz, Lester Mackey, Ying Wei, David Alvarez-Melis, Ava P. Amini, Alex X. Lu

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Microsoft Research(微软研究院) Department of Computer Science, Zhejiang University(浙江大学计算机科学系)

AI总结 该研究提出了一种名为MICON的新框架,通过在自监督预训练中整合化学结构信息,提升高通量显微图像的表征能力,以更准确地进行形态学分析。研究认为,将化合物结构作为诱导细胞表型变化的“处理”因素进行建模,能够显著优于传统手工特征和现有深度学习方法。实验表明,结合化学信息的表征学习在跨实验重复和数据来源的药物效应识别任务中表现更优,为多模态显微筛查数据的表征学习提供了新方向。

Comments 24 pages

详情
英文摘要

Recent advances in self-supervised deep learning have improved our ability to quantify cellular morphological changes in high-throughput microscopy screens, a process known as morphological profiling. However, most current methods only learn from images, despite many screens being inherently multimodal, as they involve both a chemical or genetic perturbation as well as an image-based readout. We hypothesized that incorporating chemical compound structures during self-supervised pre-training could improve learned representations of images from high-throughput microscopy screens. We introduce a representation learning framework, MICON (Molecular-Image Contrastive Learning), that models chemical compounds as treatments that induce transformations of cell phenotypes. MICON significantly outperforms classical hand-crafted features such as CellProfiler and existing deep-learning-based representation learning methods in challenging evaluation settings where models must identify reproducible effects of drugs across independent replicates and data-generating centers. We demonstrate that incorporating chemical compound information into the learning process provides small, but consistent improvements in performance and that modeling compounds specifically as treatments outperforms approaches that directly align images and compounds in a single representation space. Our findings point to a new direction for representation learning in morphological profiling, suggesting that methods should explicitly account for the multimodal nature of microscopy screening data.

2504.05451 2026-05-18 cs.CV 版本更新

ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes

Arjun Somayazulu, Efi Mavroudi, Changan Chen, Lorenzo Torresani, Kristen Grauman

发表机构 * UT Austin(得克萨斯大学奥斯汀分校) Meta AI Stanford University(斯坦福大学) Northeastern University(东北大学)

AI总结 ViewBridge 是一种用于学习活动视点不变表示的框架,旨在应对野外视频中极端视角变化带来的挑战。该方法通过知识蒸馏保留动作语义,并结合课程学习策略,逐步增加视角难度以实现平滑适应。实验表明,ViewBridge 在两个任务上优于现有方法,适用于多个数据集。

详情
英文摘要

Traditional methods for view-invariant learning rely on controlled multi-view training data with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce ViewBridge, a framework for learning rich video representations in the presence of severe view-occlusions. We introduce a knowledge distillation objective that preserves action-centric semantics, together with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. To sort training video segments for the proposed curriculum, we define a geometry-based metric that reflects their likely occlusion level. While training leverages multi-view data, at inference time, the input is an uncalibrated, single-viewpoint video. Evaluating our approach on two tasks -- temporal keystep grounding and fine-grained keystep recognition -- we outperform SOTA approaches across three datasets (Ego-Exo4D, LEMMA, EPFL-Smart-Kitchen-30). Project page: https://vision.cs.utexas.edu/projects/learning_view_distill/ .

2503.02597 2026-05-18 cs.CV cs.AI 版本更新

Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs

Wei-Yao Wang, Zhao Wang, Helen Suzuki, Yoshiyuki Kobayashi

发表机构 * Sony Group Corporation, Tokyo, Japan(索尼集团,日本东京)

AI总结 近期多模态大语言模型(MLLMs)在理解和推理多模态信息方面取得了显著进展,但视觉与语言模态之间的对齐问题仍是一个关键挑战。本文从模型架构层面出发,提出了一种新的模态互注意力机制(MMA),通过将因果注意力扩展为跨模态互注意力,使图像模态能够关注文本模态,从而提升模型对输入信息的准确理解。该方法在多个多模态理解基准测试中取得了优越性能,且无需增加额外参数,具有通用性和可扩展性。

Comments ICML 2026. Code is available at https://github.com/sony/aki

详情
英文摘要

Recent Multimodal Large Language Models (MLLMs) have demonstrated significant progress in perceiving and reasoning over multimodal inquiries, ushering in a new research era for foundation models. However, vision-language misalignment in MLLMs has emerged as a critical challenge, where the textual responses generated by these models are not factually aligned with the given text-image inputs. Existing efforts to address vision-language misalignment have focused on developing specialized vision-language connectors or leveraging visual instruction tuning from diverse domains. In this paper, we tackle this issue from a fundamental yet unexplored perspective by revisiting the core architecture of MLLMs. Most MLLMs are typically built on decoder-only LLMs consisting of a causal attention mechanism, which limits the ability of the earlier modalities (e.g., images) to incorporate information from the latter modalities (e.g., text). To address this problem a MLLM that unlocks causal attention into our proposed modality-mutual attention (MMA) to enable image tokens to attend to text tokens. This simple yet effective design allows MMA to achieve state-of-the-art performance in 12 multimodal understanding benchmarks (+6.2% on average across 3 LLMs backbones) without introducing additional parameters. Our MMA design is intended to be generic, allowing for applications across various modalities, and scalable to accommodate diverse multimodal scenarios.

2406.18944 2026-05-18 cs.CV cs.AI cs.CR 版本更新

Rethinking and Red-Teaming Protective Perturbation in Personalized Diffusion Models

Yixin Liu, Ruoxi Chen, Xun Chen, Lichao Sun

发表机构 * Lehigh University(莱维大学) Lehigh University Computer Science(莱维大学计算机科学) Engineering Bethlehem PA USA(工程 布雷顿 佛罗里达 美国) Independent Researcher(独立研究员) Independent Researcher Fremont California USA(独立研究员 佛罗里达 加州 美国)

AI总结 个性化扩散模型(PDMs)在使用少量数据生成特定人物图像方面表现出色,但其对微小对抗性扰动高度敏感,导致在受污染数据上微调时性能显著下降。本文通过 Shortcut Learning 的视角深入分析了 PDMs 的微调过程,揭示了对抗扰动在 CLIP 嵌入空间中引发的潜在语义对齐问题,并据此提出了一种系统性的反制框架,包括图像净化和对比解耦学习,有效提升了模型的鲁棒性和泛化能力。

Comments Code is available at https://github.com/liuyixin-louis/DiffShortcut

详情
英文摘要

Personalized diffusion models (PDMs) have become prominent for adapting pre-trained text-to-image models to generate images of specific subjects using minimal training data. However, PDMs are susceptible to minor adversarial perturbations, leading to significant degradation when fine-tuned on corrupted datasets. These vulnerabilities are exploited to create protective perturbations that prevent unauthorized image generation. Existing purification methods attempt to red-team the protective perturbation to break the protection but often over-purify images, resulting in information loss. In this work, we conduct an in-depth analysis of the fine-tuning process of PDMs through the lens of shortcut learning. We hypothesize and empirically demonstrate that adversarial perturbations induce a latent-space misalignment between images and their text prompts in the CLIP embedding space. This misalignment causes the model to erroneously associate noisy patterns with unique identifiers during fine-tuning, resulting in poor generalization. Based on these insights, we propose a systematic red-teaming framework that includes data purification and contrastive decoupling learning. We first employ off-the-shelf image restoration techniques to realign images with their original semantic content in latent space. Then, we introduce contrastive decoupling learning with noise tokens to decouple the learning of personalized concepts from spurious noise patterns. Our study not only uncovers shortcut learning vulnerabilities in PDMs but also provides a thorough evaluation framework for developing stronger protection. Our extensive evaluation demonstrates its advantages over existing purification methods and its robustness against adaptive perturbations.

2403.13805 2026-05-18 cs.CV cs.AI cs.LG 版本更新

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Ziyu Liu, Zeyi Sun, Yuhang Zang, Wei Li, Pan Zhang, Xiaoyi Dong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学) MThreads, Inc.(MThreads公司) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为RAR的方法,旨在提升多模态大语言模型(MLLMs)在细粒度和少样本视觉识别任务中的性能。RAR结合了CLIP的多模态检索能力与MLLMs的丰富知识库,通过建立多模态检索器来扩展模型的上下文窗口,并在推理时检索相关类别信息供MLLMs进行排序和预测。该方法有效解决了MLLMs在面对大量类别时性能下降的问题,在多个细粒度和零样本识别基准上取得了显著的性能提升。

Comments Project: https://github.com/Liuziyu77/RAR

详情
英文摘要

CLIP (Contrastive Language-Image Pre-training) uses contrastive learning from noise image-text pairs to excel at recognizing a wide array of candidates, yet its focus on broad associations hinders the precision in distinguishing subtle differences among fine-grained items. Conversely, Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories, thanks to their substantial knowledge from pre-training on web-level corpora. However, the performance of MLLMs declines with an increase in category numbers, primarily due to growing complexity and constraints of limited context window size. To synergize the strengths of both approaches and enhance the few-shot/zero-shot recognition abilities for datasets characterized by extensive and fine-grained vocabularies, this paper introduces RAR, a Retrieving And Ranking augmented method for MLLMs. We initially establish a multi-modal retriever based on CLIP to create and store explicit memory for different categories beyond the immediate context window. During inference, RAR retrieves the top-k similar results from the memory and uses MLLMs to rank and make the final predictions. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base, significantly boosting accuracy across a range of vision-language recognition tasks. Notably, our approach demonstrates a significant improvement in performance on 5 fine-grained visual recognition benchmarks, 11 few-shot image recognition datasets, and the 2 object detection datasets under the zero-shot recognition setting.

2212.12130 2026-05-18 cs.CV 版本更新

Learning to Detect and Segment for Open Vocabulary Object Detection

Tao Wang, Nan Li

发表机构 * Sichuan University(四川大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 该研究旨在解决开放词汇物体检测中的检测与分割问题,提出了一种名为CondHead的动态网络结构,以提升模型对新类别物体的泛化能力。核心方法通过条件参数化网络头,利用语义嵌入引导模型学习类别特异性知识,从而实现更准确的边界框回归和分割预测。该方法在保持计算开销极小的前提下,显著提升了现有开放词汇检测方法的性能。

Comments We appologize that author Nan Li was not on the published version due to cvpr23 policy that authors cannot be added after abstract deadline

详情
英文摘要

Open vocabulary object detection has been greatly advanced by the recent development of vision-language pretrained model, which helps recognize novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parameterize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated head and the dynamically generated head. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such a conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation.