URL PDF HTML ☆

赞 0 踩 0

2606.02639 2026-06-03 eess.IV cs.AI cs.CV 版本更新

Sparse-View Lung Nodule Volumetry from Digitally Reconstructed Radiographs via AReT: Anatomy-Regularized TensoRF

通过AReT：解剖正则化TensoRF从数字重建放射图像进行稀疏视图肺结节体积测量

Spoorthi M, Suja Palaniswamy

发表机构 * Amrita University（阿姆里塔大学）

AI总结本文发现并解决了TensoRF在X射线衰减场中的默认密度偏移问题，提出解剖正则化张量辐射场框架AReT，仅用三个正交X射线投影即可实现肺结节的稳定体积重建，在LIDC-IDRI数据集上达到高精度。

详情

AI中文摘要

我们识别并解决了TensoRF应用于X射线衰减场时一个先前未报告的失败模式：默认密度偏移-10（最初为RGB场景重建引入）抑制了密度梯度，并阻止了稀疏视图医学重建，无论学习率或正则化策略如何。将密度偏移设置为零可恢复梯度流，并仅从三个正交X射线投影实现肺结节的稳定体积重建。在此基础上，我们提出AReT，一个解剖正则化的张量辐射场框架，用于使用LIDC-IDRI数据集（19名患者，放射科医生注释的结节）的冠状、矢状和轴向投影进行肺结节重建。与需要密集多视图采集的现有NeRF方法不同，AReT专为稀疏视图胸部成像设计，并整合了结合L1稀疏性和总变分平滑性的胸部解剖感知正则化。对11种重建策略的系统比较表明，解剖感知正则化始终优于生成先验引导的方法。与放射科医生共识分割相比，AReT在临床可操作的结节（>=10 mm，n=14）上实现了Pearson r=0.983（p<0.0001），中位绝对体积误差为11.4%，接近零的系统偏差为-77.3 mm^3，并且比球形体积近似提高了8.4倍。

英文摘要

We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift of -10, originally introduced for RGB scene reconstruction, suppresses density gradients and prevents sparse-view medical reconstruction regardless of learning rate or regularization strategy. Setting the density shift to zero restores gradient flow and enables stable volumetric reconstruction of pulmonary nodules from only three orthogonal X-ray projections. Building on this, we propose AReT, an anatomy-regularized tensorial radiance field framework for lung nodule reconstruction using coronal, sagittal, and axial projections from the LIDC-IDRI dataset (19 patients, radiologist-annotated nodules). Unlike existing NeRF approaches requiring dense multi-view acquisition, AReT is designed for sparse-view thoracic imaging and incorporates chest-anatomy-aware regularization combining L1 sparsity and total variation smoothness. A systematic comparison across 11 reconstruction strategies shows anatomy-aware regularization consistently outperforms generative-prior-guided approaches. Evaluated against radiologist consensus segmentations, AReT achieves Pearson r=0.983 (p<0.0001) for clinically actionable nodules >=10 mm (n=14), median absolute volumetric error of 11.4%, near-zero systematic bias of -77.3 mm^3, and 8.4x improvement over spherical volume approximation.

URL PDF HTML ☆

赞 0 踩 0

2606.02631 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.SD 版本更新

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

小波作为分词器：自然信号共享小波分词方案的初步结果

Shenghao Ding

发表机构 * Yet Another AI

AI总结本文研究音频、图像和视频能否共享统一的小波分词方案，通过基于Haar DWT/IDWT的连续令牌模型，在多个数据集上验证了统一分词模式的可行性，并分析了潜在容量和元数据的影响。

Comments 12 pages, 3 figures

详情

AI中文摘要

本文研究音频、图像和视频是否可以共享一个共同的小波令牌模式，而不是依赖于各自模态特定的潜在网格。它介绍了一个初步的连续令牌模型，该模型围绕一级Haar DWT/IDWT前端、共享系数令牌布局、可选结构元数据、轻量级模态值适配器和共享的令牌级编码器-解码器主干构建。在Speech Commands、EuroSAT RGB和DAVIS 2017数据上，密集共享模型达到了39.92 dB音频、29.37 dB图像和23.93 dB视频的PSNR。在连续潜在标量预算下的匹配速率扫描表明，视觉增益不能仅由潜在容量解释，同时也表明加性元数据嵌入并非普遍改进来源。最后，固定速率能量选择提供了一个强大的非参数基线：在压缩保留比率下，energy_global相比均匀选择将音频的平均PSNR提高了16.73 dB，图像提高了16.90 dB，视频提高了15.86 dB。掩蔽稀疏训练在50%的密集令牌下达到了34.45 dB的视频PSNR。结果支持统一的 wavelet 令牌模式和稀疏令牌接口，但尚未建立通用的离散词汇表。

英文摘要

This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.

URL PDF HTML ☆

赞 0 踩 0

2606.02937 2026-06-03 q-bio.NC cs.CV 版本更新

BEAST3D: Animal behavioral analysis and neural encoding from multi-view video via Gaussian splatting

BEAST3D: 通过高斯泼溅从多视角视频进行动物行为分析与神经编码

Yanchen Wang, Lenny Aharon, Wangshu Zhu, Kyle Daruwalla, Linghua Zhang, Jiaru Zou, Selmaan Chettih, Helen Hou, Liam Paninski, Matthew R Whiteway

发表机构 * Columbia University（哥伦比亚大学）； Cold Spring Harbor（冷泉港）； Stanford University（斯坦福大学）

AI总结提出BEAST3D自监督预训练框架，利用未标注的多视角视频通过3D高斯泼溅重建和动物分割，学习3D视觉表征，有效应用于新视角合成、多视角姿态估计和神经编码。

详情

AI中文摘要

多视角视频记录越来越多地用于捕捉实验环境中动物的3D运动，但从这些记录中提取丰富的3D表示仍然具有挑战性。有监督的姿态估计需要大量手动标注，而在通用场景数据集上训练的通用3D重建模型无法适用于实验室实验的专业图像和稀疏视角设置。我们通过BEAST3D解决了这些限制，这是一个自监督预训练框架，从未标注的、已校准的多视角视频中学习3D视觉表示。BEAST3D使用视觉变换器预测3D高斯泼溅，通过可微渲染重建保留视角，同时将动物从背景中分割出来。BEAST3D通过直接以已知相机参数为条件，仅用四个视角即可重建3D结构——这与通用模型不同，后者必须从实验室环境中很少有的密集重叠视角估计相机几何。通过在四个物种上的全面评估，我们证明BEAST3D产生丰富的、视角不变的特征，这些特征有效地迁移到三个下游任务：新视角合成（验证了学习到的3D表示的质量）、多视角姿态估计（提供了行为分析中广泛使用的稀疏关键点轨迹）和神经编码（将3D行为特征与同时记录的神经活动相关联）。因此，BEAST3D建立了一个利用现代多视角实验室记录中3D结构的行为分析多功能框架。

英文摘要

Multi-view video recordings are increasingly used to capture the 3D movements of animals in experimental settings, yet extracting rich 3D representations from these recordings remains challenging. Supervised pose estimation requires extensive manual annotation, while general-purpose 3D reconstruction models trained on generic scene datasets fail on the specialized imagery and sparse-view setting of laboratory experiments. We address these limitations with BEAST3D, a self-supervised pretraining framework that learns 3D visual representations from unlabeled, calibrated multi-view video. BEAST3D uses a vision transformer to predict 3D Gaussian splats that reconstruct held-out views through differentiable rendering, while simultaneously segmenting the animal from the background. BEAST3D reconstructs 3D structure with as few as four views by conditioning directly on known camera parameters--unlike general-purpose models, which must estimate camera geometry from dense overlapping viewpoints that are seldom available in lab settings. Through comprehensive evaluation across four species, we demonstrate that BEAST3D produces rich, viewpoint-invariant features that transfer effectively to three downstream tasks: novel view synthesis, which validates the quality of the learned 3D representations; multi-view pose estimation, which provides the sparse keypoint trajectories widely used in behavioral analysis; and neural encoding, which relates 3D behavioral features to simultaneously recorded neural activity. BEAST3D thus establishes a versatile framework for behavioral analysis that leverages 3D structure in modern multi-view laboratory recordings.

URL PDF HTML ☆

赞 0 踩 0

2606.03994 2026-06-03 cs.CV cs.RO 版本更新

SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

SimuScene: 从单张图像重建仿真就绪的组合式3D场景

Inhee Lee, Sangwon Baik, Sungjoo Kim, Hyeonwoo Kim, Hyunsoo Cha, Hanbyul Joo

发表机构 * Seoul National University（首尔国立大学）

AI总结提出SimuScene，一种将物理仿真融入形状和布局估计的组合式3D重建流水线，通过物理引擎诊断重建错误并驱动修正，生成稳定且仿真就绪的场景。

Comments Project Page: https://snuvclab.github.io/SimuScene/

详情

AI中文摘要

从单张图像重建可交互、仿真就绪的3D场景是机器人操作的关键瓶颈。虽然最近的单图像提升器能恢复合理的每个物体形状，但组合它们会产生因物体相互穿透、悬浮或下沉而在物理仿真中崩溃的场景。现有的物理感知方法严格将其作为事后布局修正，而未解决底层几何误差。为此，我们引入SimuScene，一种将物理置于形状和布局估计循环中的组合式3D重建流水线。我们不仅将物理用于布局清理，还在生成过程中利用物理引擎作为诊断测量工具。通过在重力下对重建物体进行诊断性仿真，我们将穿透和支撑失败转化为定量修正信号，驱动重力轴拉伸和非模态形状重采样。这种物理信息反馈循环减轻了累积的重建误差，并产生稳定、仿真就绪的组合式3D场景。大量实验在物理稳定性和几何对齐基准上展示了最先进的性能。我们进一步通过在仿人控制和机器人臂操作任务中部署重建环境来突出SimuScene的实用性。

英文摘要

Reconstructing interactive, simulation-ready 3D scenes from a single image is a critical bottleneck for robotic manipulation. While recent single-image lifters recover plausible per-object shapes, composing them yields scenes that collapse under physical simulation due to interpenetrating, hovering, or sinking objects. Existing physics-aware methods address this strictly as a post-hoc layout correction, leaving the underlying geometric errors unresolved. To address this, we introduce SimuScene, a compositional 3D reconstruction pipeline that puts physics in the loop of shape and layout estimation. Rather than using physics merely for layout cleanup, we utilize the physics engine as a diagnostic measurement tool during the generative process itself. By diagnostically simulating reconstructed objects under gravity, we convert penetration and support failures into quantitative correction signals that drive gravity-axis stretching and amodal shape resampling. This physics-informed feedback loop mitigates accumulated reconstruction errors and produces a stable, simulation-ready compositional 3D scene. Extensive experiments demonstrate state-of-the-art performance on physical stability and geometric alignment benchmarks. We further highlight SimuScene's utility by deploying reconstructed environments in humanoid control and robot-arm manipulation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.03992 2026-06-03 cs.CV cs.RO 版本更新

Exploring Easy Boosts for Lidar Semantic Scene Completion

探索激光雷达语义场景补全的简易提升方法

Tetiana Martyniuk, Jonathan Seele, Alexandre Boulch, Gilles Puy, Renaud Marlet, Raoul de Charette

发表机构 * Inria, France（法国国家信息与自动化技术研究所）； valeo.ai, France（valeo.ai公司）； ETH Zurich, Switzerland（瑞士苏黎世联邦理工学院）； LIGM, CNRS, Univ Gustave Eiffel, ENPC, IP Paris, France（法国高等科学研究院（CNRS））

AI总结本文研究无需复杂架构重设计的“免费午餐”策略，通过为输入点云添加语义伪标签和可见性信息，显著提升激光雷达语义场景补全性能，使旧模型与最先进系统竞争甚至超越。

Comments Accepted to ICIP 2026

2606.03990 2026-06-03 cs.LG cs.CL cs.CV 版本更新

Neuron Populations Exhibit Divergent Selectivity with Scale

神经元群体随规模表现出分化的选择性

Amil Dravid, Yasaman Bahri, Alexei A. Efros, Yossi Gandelsman

发表机构 * UC Berkeley（加州大学伯克利分校）； TTIC

AI总结通过分析Rosetta神经元在不同规模模型中的分布与特性，发现其数量遵循次线性幂律增长，且选择性随规模增强，而非Rosetta神经元则保持低选择性，提出一个平衡特征效用与神经元容量的分析模型解释这一极化现象。

Comments Project page and code: https://avdravid.github.io/rosetta-neuron-scaling/

详情

AI中文摘要

我们研究神经网络中的神经元群体是否随规模可预测地演化，将缩放定律扩展到损失等宏观可观测指标之外。为探究此问题，我们研究了Rosetta神经元——一类先前被表征的、其激活模式在独立训练的模型中相似的神经元（Dravid et al., 2023）。在分别对高达30B参数的语言模型和高达5B参数的视觉模型的分析中，我们观察到Rosetta神经元群体遵循模型规模的次线性幂律，绝对数量增长但占总神经元数的比例缩小。我们进一步观察到神经元极化效应：Rosetta神经元随规模变得更具选择性且日益单语义化，与不断增长但仍保持低选择性的非Rosetta群体分离。一个平衡特征效用与有限神经元容量的分析模型解释了次线性幂律缩放和这种极化效应。最后，我们发现Rosetta神经元随规模变得更加领域专业化，并通过一个针对持续预训练的目标数据过滤案例研究展示了其选择性。我们的结果指向一个可解释的、共享的神经元层面结构的缩放定律，将模型大小与神经元通用性、选择性和专业化的系统性变化联系起来。

英文摘要

We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non-Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power-law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain-specialized with scale and illustrate their selectivity through a targeted data-filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron-level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.

URL PDF HTML ☆

赞 0 踩 0

2606.03989 2026-06-03 cs.CV 版本更新

PixVOD: Pixel-Distributed Direct Visual Odometry and Depth Estimation

PixVOD: 像素分布式直接视觉里程计与深度估计

Shinjeong Kim, Ignacio Alzugaray, Callum Rhodes, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Department of Computing, Imperial College London（帝国理工学院伦敦分校计算机系）

AI总结提出一种基于高斯信念传播的像素级分布式视觉里程计与深度估计方法，通过关键帧锚定机制实现传感器上并行计算。

详情

AI中文摘要

由二维像素阵列组成的图像是计算机视觉算法的标准输入，然而许多底层计算可以分布在像素之间。传输原始、冗余且带有噪声的像素数据离开传感器仍然效率低下，这促使人们转向焦平面传感器处理器，其在每个像素内直接执行大部分计算。我们设想像素在本地合成更高级别的信号，减少下游负载，并为更高级别的视觉任务提供更丰富的输入。我们提出了一种完全可并行化的视觉里程计和深度估计形式，跨像素进行，其中传感器处理器通过高斯信念传播（GBP）交换信息，以达成关于相机运动的共识，并从逐像素光度观测和表面法线先验中推断深度。为了在优化过程中保持几何稳定性，我们引入了一种类似关键帧的锚定机制，该机制调节帧之间的有效基线，从而实现一致的运动和深度更新。我们的方法在真实数据集上进行了评估，证明了基于GBP的像素级分布式里程计和深度估计与传感器上关键帧锚定的可行性。项目页面：此 https URL

英文摘要

Images composed of 2D pixel arrays are the standard input to computer vision algorithms, yet many underlying computations can be distributed across pixels. Transmitting raw, redundant, and noisy pixel data off the sensor remains inefficient, motivating a shift toward focal-plane sensor-processors that perform a significant part of the computation directly within each pixel. We envision pixels synthesizing higher-level signals locally, reducing downstream load, and providing richer inputs for higher-level vision tasks. We propose a fully parallelizable form of visual odometry and depth estimation across pixels, where sensor-processors exchange information through Gaussian Belief Propagation (GBP) to achieve consensus about camera motion and infer depth from per-pixel photometric observations and a surface normal prior. To maintain geometric stability during optimization, we introduce a keyframe-like anchoring mechanism that regulates the effective baseline between frames, enabling consistent motion and depth updates. Our method is evaluated on realistic datasets, demonstrating the feasibility of GBP-based pixel-level distributed odometry and depth estimation with keyframe anchoring on-sensor. Project Page: https://www.shinjeongkim.com/pixvod/

URL PDF HTML ☆

赞 0 踩 0

2606.03986 2026-06-03 cs.CV 版本更新

NewtPhys: Do Foundation Models Understand Newtonian Physics?

NewtPhys: 基础模型理解牛顿物理学吗？

Sebastian Cavada, Soumava Paul, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette

发表机构 * Inria（法国国家信息与自动化研究所）； Valeo.ai（Valeo人工智能公司）； MBZUAI（马克斯·普朗克人工智能研究所）

AI总结本文提出NewtPhys，一个基于真实场景多视图图像和物理模拟的4D物理标注数据集，用于系统评估基础模型在低层次牛顿物理推理中的能力，揭示了现有模型的局限性。

详情

AI中文摘要

先前的工作使用合成或半合成场景以及视觉问答任务评估基础模型中的物理推理。然而，这些基准强调高层次事件，缺乏评估真正低层次牛顿理解所需的视觉保真度。我们引入了NewtPhys，一个从真实场景的多视图图像构建的4D物理标注数据集，并带有基于物理的模拟。该数据集提供了跨时间步的密集、细粒度标注——包括3D力和覆盖物理、跟踪、语义和几何的逐像素非模态量——弥合了简单合成设置与真实视觉复杂性之间的差距。利用NewtPhys，我们系统评估了56个VLM，包括54个开放权重模型和2个闭源前沿模型，以及10个VFM，揭示了低层次物理推理中的局限性。除了基准测试外，我们的数据集还支持基于物理的视觉的未来研究和下一代物理感知评估的开发。代码和数据集可在该网址获取。

VLESA: 用于人类活动监测的视觉语言具身安全智能体

Hanjiang Hu, Yiyuan Pan, Jiaxing Li, Xusheng Luo, Alexander Robey, Na Li, Yebin Wang, Changliu Liu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Mitsubishi Electric Research Laboratories（三菱电机研究实验室）； Harvard University（哈佛大学）

AI总结提出VLESA框架，通过自我中心视频监测人类活动，利用GRPO训练的目标条件安全Q过滤器进行实时安全干预，在ASIMOV-2.0基准上实现更高干预精度。

Comments 18 pages, 5 tables, 5 figures

详情

AI中文摘要

随着AI系统越来越多地协助人类完成物理任务，确保安全变得至关重要——物理动作会带来即时且不可逆转的后果，而数字错误则不会。我们引入了视觉语言具身安全智能体（VLESA），这是一个从自我中心视频监测人类活动，并在预测到危险动作时触发实时安全干预的框架。VLESA处理意图依赖的安全问题，其中相同的动作可能根据上下文而安全或危险。我们引入了一个将自我中心帧与目标条件安全注释配对的数据集，使得能够通过GRPO训练一个目标条件安全Q过滤器，该过滤器在不重新训练的情况下根据推断的意图评估动作。在此基础上，提出了一个意图-动作预测智能体，用于从视频中联合推断目标并预测未来动作。在ASIMOV-2.0基准上，VLESA在精确的地面真值帧处实现了比基线更高的干预准确率，而通过目标条件约束解码，GRPO训练的Q过滤器将动作安全性提高了超过41个百分点。代码可在该网址获取。

英文摘要

As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at https://github.com/HanjiangHu/VLESA.

URL PDF HTML ☆

赞 0 踩 0

2606.03951 2026-06-03 cs.CV 版本更新

Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

Demo2Tutorial：从人类经验到多模态软件教程

Zechen Bai, Zhiheng Chen, Yiqi Lin, Kevin Qinghong Lin, Difei Gao, Xiangwu Guo, Xin Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore（新加坡国立大学Show实验室）

AI总结提出Demo2Tutorial框架，通过屏幕录制和交互日志将人类经验解析为结构化多模态教程，用于人类学习和GUI智能体训练，实验证明其生成质量超越人工教程并提升任务效率。

Comments Accepted by CVPR 2026

详情

AI中文摘要

数字环境中的人类经验提供了大量未被充分探索的真实、未修剪的交互资源，其中包含丰富的程序性知识。我们提出了Demo2Tutorial，一个将屏幕录制和交互日志捕获的人类经验转化为结构化多模态软件教程的框架，用于同时教授人类和智能体。Demo2Tutorial首先通过专用记录器收集人类经验，然后使用多模态动作解析器解析原始经验，以重建感知、动作和意图。接着，步骤规划器将这些步骤抽象为表示目标和步骤的分层任务图。最后，教程合成器将解析后的经验转化为结构化的、可复用的图文指令。我们在一个基于官方软件文档的新基准上评估了教程生成质量。我们进一步证明，这种蒸馏表示有利于（i）人类学习，通过自动生成多模态教程，以及（ii）智能体学习，通过改进下游GUI智能体规划和泛化。实验表明，Demo2Tutorial生成的高质量教程超越了人工编写的教程，并显著优于基线方法，同时实现了更快的人类任务完成和更好的GUI智能体规划，证明从人类经验中蒸馏的结构化教程可以作为有效知识表示，促进人类学习和智能体能力。代码和数据将在https://this https URL提供。

英文摘要

Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable image-text instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and agent capabilities. Code and data will be available at https://github.com/showlab/Demo2Tutorial.

URL PDF HTML ☆

赞 0 踩 0

2606.03925 2026-06-03 cs.CV 版本更新

Adaptive Causal Alignment for High-Confidence Adversarial Training

自适应因果对齐用于高置信度对抗训练

Zhiming Luo, Kejia Zhang, Yingxin Lai, Junwei Wu, Juanjuan Weng, Shaozi Li

发表机构 * Department of Artificial Intelligence, Xiamen University（厦门大学人工智能学院）； Department of Computer Science, Emory University（埃默里大学计算机科学系）； College of Information Science and Technology, Jinan University（济南大学信息科学与技术学院）

AI总结针对高置信度对抗训练中模型过度依赖非因果背景相关性的问题，提出HICAT框架，通过可学习背景偏差估计器与自适应去偏机制实现因果对齐，提升鲁棒泛化性能。

详情

AI中文摘要

逆对抗训练利用高置信度预测来稳定鲁棒学习，然而我们发现了一个关键悖论：高置信度往往源于对非因果背景相关性的过拟合，而非内在对象语义。我们的研究表明，视觉上下文作为双重信号，既可以是必要的支持先验，也可以是混杂的虚假相关。这一洞察使得现有的盲目抑制策略存在缺陷，因为它们不可避免地导致严重的特征损失。为解决此问题，我们提出高置信度因果对齐训练（HICAT），一个建立语义均衡的统一框架。HICAT遵循“测量-去偏-对齐”流程，集成了可学习背景偏差估计器（LBBE）以自适应诊断上下文效用。在该诊断指导下，自适应去偏机制执行精细的逻辑校正，并辅以几何基础的背景逻辑正交增强（FLOE）损失以强制执行特征解耦。在CIFAR-10、CIFAR-100和ImageNet-1K上的大量实验表明，HICAT在不同架构（CNN和ViT）上均持续优于匹配基线，同时显著缩小了鲁棒泛化差距。

英文摘要

Inverse adversarial training leverages high-confidence predictions to stabilize robust learning, yet we uncover a critical paradox: high confidence often stems from overfitting to non-causal background correlations rather than intrinsic object semantics. Our investigation reveals that visual context functions as a dual-natured signal, serving as either a necessary supportive prior or a spurious confounder. This insight renders existing blind suppression strategies flawed, as they inevitably lead to severe Feature Loss. To resolve this, we propose High-Confidence Causally Aligned Training (HICAT), a unified framework that establishes a Semantic Equilibrium. Operating on a ``Measure-Debias-Align'' pipeline, HICAT integrates a Learnable Background-Bias Estimator (LBBE) to adaptively diagnose context utility. Guided by this diagnosis, an Adaptive Debiasing mechanism performs surgical logit rectification, complemented by a geometrically grounded Foreground Logit Orthogonal Enhancement (FLOE) loss to enforce rigorous feature disentanglement. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that HICAT consistently improves over matched baselines across diverse architectures (CNNs and ViTs) while significantly reducing the robust generalization gap.

URL PDF HTML ☆

赞 0 踩 0

2606.03921 2026-06-03 cs.CV 版本更新

SparseStreet: 用于实时街景模拟的稀疏高斯泼溅

Qingpo Wuwu, Xiaobao Wei, Peng Chen, Nan Huang, Zhongyu Zhao, Hao Wang, Ming Lu, Ningning Ma, Shanghang Zhang

发表机构 * Peking University（北京大学）； Chinese Academy of Sciences（中国科学院）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Autonomous Driving Development, NIO（蔚来自动驾驶开发）

AI总结针对街景重建中高斯原语冗余问题，提出节点可学习剪枝与背景压缩框架，实现高达80%压缩比且质量损失极小。

详情

AI中文摘要

尽管3D高斯泼溅在街景重建中显示出有希望的结果，现有方法需要大量高斯原语来捕捉细节，导致存储成本过高和渲染速度缓慢。我们观察到动态对象（如车辆和行人）需要高保真表示以保持时间一致性，而静态背景区域通常包含大量冗余。受此启发，我们提出SparseStreet，一种专为街景设计的通用压缩框架。首先，我们引入基于节点的可学习剪枝策略，系统性地移除低贡献高斯原语，同时保留视觉关键区域。其次，在场景表示稳定后，我们应用背景压缩，进一步减少静态区域中的冗余。我们的方法有效保留了动态对象的几何和外观，同时显著减少了高斯原语的总数。在Waymo和nuScenes上的大量实验表明，SparseStreet实现了高达80%的压缩比，且质量退化极小，实现了资源高效、高保真的动态场景重建。项目网站：此 https URL。

英文摘要

While 3D Gaussian Splatting has shown promising results in street scene reconstruction, existing methods require massive numbers of Gaussian primitives to capture fine details, leading to prohibitive storage costs and slow rendering speeds. We observe that dynamic objects (e.g., vehicles and pedestrians) demand high-fidelity representations to maintain temporal consistency, while static background regions often contain substantial redundancy. Motivated by this, we propose SparseStreet, a general compression framework specifically designed for street scenes. First, we introduce a node-based learnable pruning strategy that systematically removes low-contributing Gaussian primitives while preserving visually critical regions. Second, after the scene representation stabilizes, we apply background compression, further reducing redundancy in static regions. Our method effectively preserves the geometry and appearance of dynamic objects while significantly reducing the total number of Gaussian primitives. Extensive experiments on the Waymo and nuScenes demonstrate that SparseStreet achieves up to 80% compression ratio with minimal quality degradation, enabling resource-efficient, high-fidelity dynamic scene reconstruction. Project website: https://sparsestreet.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.03904 2026-06-03 cs.LG cs.CV 版本更新

MAdam: Metric-Aware Multi-Objective Adam

MAdam: 度量感知的多目标Adam

Fengbei Liu, Rachit Saluja, Sunwoo Kwak, Ruibo Wang, Ruining Deng, Heejong Kim, Johannes C. Paetzold, Mert R. Sabuncu

发表机构 * Cornell Tech（康奈尔科技）； Weill Cornell Medicine（韦尔医学院）； Delft University of Technology（代尔夫特理工大学）

AI总结提出MAdam，通过偏好条件曲率预处理多目标优化中的协调方向，解决Adam与求解器之间的权重失配和几何失配问题，在多任务学习、帕累托前沿恢复等任务中一致提升性能。

详情

AI中文摘要

多目标优化是许多机器学习问题的基础，然而跨损失平衡、梯度平衡和基于帕累托的求解器家族几乎都将它们协调后的方向交给Adam处理。我们表明这种耦合在求解器的意图和优化器的执行之间引入了两个系统性差距。第一个是权重失配：Adam的二阶矩分母将时变偏好向量与梯度统计量纠缠在一起，将偏好边缘化为历史平均值，并将不同的帕累托权衡压缩为近乎均匀的混合。第二个是几何失配：Adam的自适应度量扭曲了多目标优化求解器假设的欧几里得几何，将对齐的目标转化为明显的冲突。为了共同解决这两个问题，我们引入了MAdam（度量感知的多目标Adam），这是一个即插即用的包装器，不改变求解器和优化器。MAdam通过标量化目标的偏好条件曲率对协调方向进行预处理；在此白化输入上，Adam的二阶矩退化为单位矩阵，因此实际更新由偏好条件度量主导。在多任务学习、帕累托前沿恢复、物理信息神经网络和医学成像中，MAdam在每个求解器家族上都一致优于Adam。

英文摘要

Multi-objective optimization (MOO) underlies many machine learning problems, yet MOO solvers across the loss-balancing, gradient-balancing, and Pareto-based families almost universally hand their reconciled directions to Adam~\cite{kingma2015adam}. We show this coupling introduces two systematic gaps between the solver's intent and the optimizer's execution. The first is a \emph{weighting mismatch}: Adam's second-moment denominator entangles the time-varying preference vector with gradient statistics, marginalizing the preference into a history average and collapsing distinct Pareto trade-offs toward a near-uniform mixture. The second is a \emph{geometric mismatch}: Adam's adaptive metric distorts the Euclidean geometry MOO solvers assume, turning aligned objectives into apparent conflicts. To resolve both jointly, we introduce \textbf{MAdam} (Metric-Aware Multi-Objective Adam), a drop-in wrapper that leaves both solver and optimizer unchanged. MAdam preconditions the reconciled direction by the preference-conditioned curvature of the scalarized objective; on this whitened input, Adam's second moment collapses to identity, so the realized update is governed by the preference-conditioned metric. Across multi-task learning, Pareto-front recovery, physics-informed neural networks, and medical imaging, MAdam consistently improves over Adam for every solver family.

URL PDF HTML ☆

赞 0 踩 0

2606.03903 2026-06-03 cs.CV 版本更新

CoralBay: 一种自监督CT基础模型

Ioannis Gatopoulos, Nicolas Känzig, Sebastian Otálora, Fei Tang

发表机构 * kaiko.ai（Kaiko AI）

AI总结提出CoralBay框架，通过分层3D Swin骨干网络和自蒸馏学习多尺度特征，实现CT体积数据的自监督预训练，有效提升下游放射学任务性能。

详情

AI中文摘要

自监督学习已在2D自然图像上实现了大规模预训练，产生了跨任务有效迁移的通用视觉表示。然而，许多医学成像模态（如CT扫描）本质上是三维的，在结构和语义上与自然图像根本不同。体积模态捕捉空间连续性、器官解剖和基于强度的组织特性（如亨氏单位），这些无法通过2D预训练充分建模。为弥补这一差距，我们引入了CoralBay，一种自蒸馏框架，通过使用分层3D Swin骨干网络并将自蒸馏应用于拼接的多尺度特征，扩展了DINO，实现了数据高效的自监督学习，编码了全局语义和细粒度局部结构的丰富空间表示。因此，CoralBay有效迁移到广泛的下游放射学任务，在多样化的解剖目标上展现出强大且一致的性能。此外，我们通过引入一个公开、可复现的3D放射学排行榜，为开源\eva框架做出贡献，该排行榜统一了多个数据集，并建立了评估体积表示学习方法的标准化基准。

英文摘要

Self-supervised learning has enabled large-scale pre-training on 2D natural images, producing general-purpose visual representations that transfer effectively across tasks. However, many medical imaging modalities, such as CT scans, are inherently three-dimensional and differ fundamentally from natural images in both structure and semantics. Volumetric modalities capture spatial continuity, organ anatomy, and intensity-based tissue properties (e.g., Hounsfield Units), which are not adequately modeled by 2D pre-training. To bridge this gap, we introduce CoralBay, a self-distillation framework that extends DINO by using a hierarchical 3D Swin backbone and applying self-distillation to concatenated multi-scale features, enabling data-efficient self-supervised learning of rich spatial representations that encode both global semantics and fine-grained local structure. As a result, CoralBay transfers effectively to a wide range of downstream radiological tasks, demonstrating strong and consistent performance across diverse anatomical targets. In addition, we contribute to the open-source \eva framework by introducing a public, reproducible 3D radiology leaderboard that unifies multiple datasets and establishes a standardized benchmark for evaluating volumetric representation learning methods.

URL PDF HTML ☆

赞 0 踩 0

2606.03879 2026-06-03 cs.CV cs.AI 版本更新

Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

超越编码器累加：衡量多编码器视觉语言模型中编码器的作用

Wei Ding, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang

发表机构 * Tsinghua University（清华大学）； Tencent（腾讯）； University of Macau（澳门大学）； University of Science and Technology Beijing（北京科技大学）

AI总结通过重新训练所有31个非空子集，提出容量-必要性分解和预投影器秩分析，揭示多编码器视觉语言模型中编码器角色并非简单累加，并给出最优配对原则。

详情

AI中文摘要

随着基础模型向融合更多异构视觉流扩展，理解不同编码器在联合训练下的交互成为原则性设计的前提。然而，大型视觉语言模型目前缺乏相应的工具，且参数高效的编码器配置在训练前难以识别。为了重新审视联合训练下的编码器角色，我们在16基准的Cambrian-1套件上，在统一流程下重新训练并评估了五个常见视觉编码器的所有31个非空子集（总计约2万GPU小时），并报告了三个发现。首先，从头重新训练每个子集揭示了与在固定检查点上掩码编码器所得不同的编码器排名，包括哪个编码器整体排名第一。其次，我们将每个编码器的贡献分解为两个维度：容量（编码器自身达到的分数）和必要性（从完整池中移除时的下降）。这两个维度不可互换。配对两个最高容量的编码器是次优的，而将一个高容量锚点与一个自适应补充配对则匹配完整的五编码器模型。在此配对之外添加更多编码器仅带来边际收益。第三，在固定参数数量下，每个编码器的预投影器有效秩解释了残差分数变化。最强的配对结合了一个秩在联合训练中存活的锚点和一个秩在联合训练下扩展的补充，这表明更高秩、更少坍缩的投影器输入对应着编码器-投影器接口处更有利的优化机制。总之，容量-必要性分解和预投影器秩分析，连同通过重新训练进行的全面评估，揭示了多编码器视觉语言模型设计中的方法论差距，并提供了弥补这一差距的具体原语。

英文摘要

As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall. Second, we decompose each encoder's contribution into two axes, Capacity, the score an encoder reaches on its own, and Necessity, the drop when it is removed from the full pool. The two axes are not interchangeable. Pairing the two highest-Capacity encoders is suboptimal, while pairing a high-Capacity anchor with an adaptive complement matches the full five-encoder model. Adding further encoders beyond this pair yields only marginal gains. Third, at fixed parameter count, per-encoder pre-projector effective rank explains the residual score variation. The strongest pairs combine an anchor whose rank survives joint training with a complement whose rank expands under it, suggesting that higher-rank, less-collapsed projector inputs correspond to a more favorable optimization regime at the encoder-projector interface. Together, the Capacity-Necessity decomposition and the pre-projector rank analysis, along with comprehensive evaluation through retraining, expose a methodological gap in multi-encoder LVLM design, and offer concrete primitives for closing it.

URL PDF HTML ☆

赞 0 踩 0

2606.03877 2026-06-03 cs.CV 版本更新

MLP Splatting: Object-Centric Neural Fields

MLP Splatting: 以对象为中心的神经场

Shinjeong Kim, Yuzhou Cheng, Xin Kong, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Department of Computing, Imperial College London（帝国理工学院伦敦分校计算机系）

AI总结提出MLP-Splatting方法，通过少量紧凑MLP原语实现场景分解和新视角合成，支持对象级编辑且内存和渲染效率优于现有方法。

详情

AI中文摘要

3D表示对于场景渲染、理解和交互至关重要。最近的方法，如3D高斯泼溅和神经辐射场，实现了令人印象深刻的光照真实感新视角合成，但缺乏将场景元素轻松分解为少数原语的能力，需要额外的分割或分组才能进行对象级操作。我们提出了MLP-Splatting，一种通过少量富有表现力的光场原语实现场景分解，同时提供光照真实感新视角合成的方法。MLP-Splatting将每个原语建模为一个独立的紧凑MLP，具有局部空间支持，预测辐射度和不透明度。与低级高斯原语或单个全局辐射场相比，我们的神经原语提供了更大的表达能力，同时保持空间局部性。通过高效的光线-原语交互稀疏体积合成进行渲染。我们的原语仅使用RGB监督进行训练，这产生了代表局部场景区域（通常对应于对象或对象部分）的原语，通过选择少量原语即可实现无需分割掩码的交互式对象级编辑。我们的方法辅以可选的语义特征蒸馏，支持开放词汇场景交互和开放集实例分割。与最先进的方法相比，我们在实验中表明，与语义3DGS方法相比，我们实现了显著更低的内存使用（1/15倍）和更快的渲染（3倍）。项目页面：此https URL

英文摘要

3D representations are fundamental to scene rendering, understanding, and interaction. Recent approaches, such as 3D Gaussian Splatting and Neural Radiance Fields, achieve impressive photorealistic novel-view synthesis, but lack the ability to easily decompose scene elements into a few primitives, requiring additional segmentation or grouping for object-level manipulation. We present MLP-Splatting, a method that enables scene decomposition via a few expressive light-field primitives while providing photorealistic novel-view synthesis. MLP-Splatting models each primitive as an independent compact MLP with localized spatial support that predicts radiance and opacity. In contrast to low-level Gaussian primitives or a single global radiance field, our neural primitives provide greater expressive capacity while remaining spatially localized. Rendering is performed through efficient sparse volumetric compositing over ray-primitive interactions. Our primitives are supervised using RGB supervision alone, which yields primitives that represent local scene regions often corresponding to objects or object parts, enabling interactive object-level editing without segmentation masks by selecting a handful of primitives. Our method, augmented with optional semantic feature distillation, enables open-vocabulary scene interaction and open-set instant segmentation. Compared to state-of-the-art methods, we achieve substantially lower memory usage (1/15$\times$) and faster rendering (3$\times$), as we show in our experiments compared to semantic 3DGS methods. Project Page: https://shinjeongkim.com/mlp-splatting

URL PDF HTML ☆

赞 0 踩 0

2606.03875 2026-06-03 cs.CV 版本更新

Seg2Track++: Probabilistic Track Validation and Data Association for Multi-Object Tracking and Segmentation

Seg2Track++: 用于多目标跟踪与分割的概率轨迹验证与数据关联

Diogo Mendonça, Tiago Barros, Cristiano Premebida, Urbano J. Nunes

发表机构 * University of Coimbra, Institute of Systems and Robotics, Department of Electrical and Computer Engineering（科英布拉大学，系统与机器人研究所，电气与计算机工程系）

AI总结提出Seg2Track++框架，结合SAM2实例分割与概率轨迹验证，实现零样本多目标跟踪与分割，提升身份保持并抑制假阳性传播。

详情

AI中文摘要

自主系统需要鲁棒的多目标跟踪与分割（MOTS）以在动态环境中可靠运行，确保一致的目标身份和精确的掩码级描绘。SAM2等基础模型在分割方面表现出强大的零样本泛化能力，但其直接应用于MOTS受到不可靠的轨迹关联和假阳性传播的限制。本文介绍Seg2Track++，一个将实例分割与SAM2及新颖的轨迹管理模块相结合的框架，以执行具有增强时间一致性的零样本MOTS。轨迹通过掩码质心距离（MCD）和置信度感知成本调制（CCM）进行关联，而概率轨迹验证（PTV）采用伯努利滤波器验证轨迹存在并抑制鬼影轨迹。在KITTI MOTS上的实验结果表明，无需微调即可改善身份保持、减少假阳性传播并实现鲁棒的轨迹管理。

英文摘要

Autonomous systems require robust Multi-Object Tracking and Segmentation (MOTS) to operate reliably in dynamic environments, ensuring consistent object identities and precise mask-level delineation. Foundation models such as SAM2 have shown strong zero-shot generalization for segmentation, but their direct application to MOTS is limited by unreliable track association and false-positive propagation. This work introduces Seg2Track++, a framework that integrates instance segmentation with SAM2 and a novel track management module to perform zero-shot MOTS with enhanced temporal consistency. Tracks are associated using Mask Centroid Distance (MCD) and Confidence-Aware Cost Modulation (CCM), while Probabilistic Track Validation (PTV) employs a Bernoulli filter to validate track existence and suppress ghost tracks. Experimental results on KITTI MOTS demonstrate improved identity preservation, reduced false-positive propagation, and robust track management without fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.03874 2026-06-03 cs.CV cs.RO 版本更新

DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

DyaPlex: 用于二元交互的全双工语音-运动模型

Koki Nagano, Hongyu Liu, Seonwook Park, Tianye Li, Amrita Mazumdar, Christian Jacobsen, Shengze Wang, Michael Stengel, Rajarshi Roy, Ka Chun Cheung, Simon See, Shalini De Mello

发表机构 * NVIDIA ； HKUST（香港科技大学）

AI总结提出DyaPlex，一种流式全双工语音-运动模型，通过双塔Transformer架构和统一二元令牌交织机制，实现同步多模态交互，在单体和二元交互基准上达到最优性能。

Comments Project page: https://research.nvidia.com/labs/amri/projects/DyaPlex

详情

AI中文摘要

我们提出了DyaPlex，一种用于二元交互的流式全双工语音-运动模型。为了捕捉人类交流的连续性和互惠性，这种全双工能力使智能体能够以流式方式同时感知和生成语音及物理运动。其核心在于，我们的方法利用了基础全双工语音模型的强先验，并集成了新颖的运动通路，从而实现完全同步的多模态交互。具体来说，我们设计了一种双塔Transformer架构，在保持冻结基础语音模型的零样本对话推理能力的同时，构建了深度耦合的流式运动通路。通过引入统一的二元令牌交织机制，并借助时间对齐的语音-运动RoPE引导交叉注意力，我们的模型有效地将自回归运动与丰富的潜在语音特征对齐。在4000小时的Seamless Interaction数据集上训练，我们的模型有效捕捉了跨说话者依赖关系，并在单体和二元人类交互基准上建立了新的最优性能。

英文摘要

We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.03871 2026-06-03 cs.CV cs.CL cs.LG 版本更新

Visual Instruction Tuning Aligns Modalities through Abstraction

视觉指令调优通过抽象对齐模态

Luis Palacios, Lorenzo Basile, Diego Doimo, Alberto Cazzaniga

发表机构 * Area Science Park, Trieste, Italy（特里埃斯特Area Science Park）

AI总结通过探针分析和因果干预，发现视觉指令调优将视觉特征直接嵌入LLM的中间语义层，绕过早期单模态处理层，并通过扩展和强化现有抽象阶段对齐视觉与文本表示。

详情

AI中文摘要

视觉指令调优有效地使预训练的大语言模型（LLM）能够同时处理图像信息和文本。然而，视觉特征如何嵌入到LLM骨干网络的逐层抽象层次中仍不清楚。通过一系列不同的视觉-语言架构，我们表明指令调优主要充当桥梁，将视觉特征直接嵌入到LLM的中间语义层，绕过了用于单模态处理的早期层。通过探针分析和因果干预，我们表明这些中间层是视觉-语言处理的语义核心，并在广泛的 multimodal 基准测试中发挥关键作用。此外，通过比较语义等价的视觉和文本表示的几何结构，我们发现微调扩展并强化了现有的抽象阶段，使视觉特征与已有的文本特征对齐。最后，我们通过将微调限制在中间层来确认这种局部对齐的功能作用：该策略在视觉中心基准测试中保持了全微调的性能，同时减少了训练时间。我们的结果表明，多模态集成是一种局部现象，由LLM内部抽象引擎的重新利用驱动。

英文摘要

Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of vision-language architectures, we show that instruction tuning primarily serves as a bridge, embedding visual features directly into the intermediate semantic layers of the LLM, bypassing the early layers devoted to unimodal processing. With probing analyses and causal interventions, we show that these intermediate layers are the semantic core of vision-language processing and play a critical role in the performance on a broad set of multimodal benchmarks. In addition, by comparing the geometry of semantically equivalent visual and textual representations, we find that fine-tuning extends and strengthens the existing abstraction phase, aligning visual features with pre-existing textual ones. Finally, we confirm the functional role of this localized alignment by restricting fine-tuning to intermediate layers alone: this strategy preserves the performance of full fine-tuning on vision-centric benchmarks while reducing training time. Our results suggest that multimodal integration is a localized phenomenon driven by the repurposing of the internal abstraction engine of the LLM.

URL PDF HTML ☆

赞 0 踩 0

2606.03868 2026-06-03 cs.CV 版本更新

Unified Video-Action Joint Denoising for Dexterous Action and Data Generation

统一视频-动作联合去噪用于灵巧动作与数据生成

Dingrui Wang, YuAn Wang, Jinkun Liu, Yue Zhang, Mattia Piccinini, Yu Sun, Johannes Betz

发表机构 * Technical University of Munich（慕尼黑技术大学）； ByteDance（字节跳动）； Tsinghua University（清华大学）

AI总结提出Donk模型，通过联合建模交互视频与手部轨迹的分布，实现灵巧手的动作生成与数据增强。

Comments 9 pages, 5 figures

详情

AI中文摘要

最近的世界动作模型通过将广泛的视觉动态先验与可执行的机器人动作对齐来利用视频基础模型。我们从分布的角度重新审视这种对齐。现有的公式通常将对齐的先验缩小为基于观测的未来动作策略分布。相比之下，我们通过在多条件机制下对交互视频和可执行手部轨迹的联合空间进行建模，保持更广泛的分布。我们提出了Donk，一个用于灵巧手的统一视频-动作去噪模型。通过语言、初始图像和初始手部状态，Donk采样未来视频和双手MANO轨迹作为动作策略。在没有图像条件的情况下，相同的去噪架构从文本条件分布中采样配对的视频-动作展开，将对齐的视频先验转化为数据引擎。在动作、视频和仅文本生成评估中，Donk在相同的统一训练方案下提高了灵巧轨迹的准确性，保持了强大的视频保真度，并产生了平滑的文本条件动作展开。

英文摘要

Recent world action models leverage video foundation models by aligning broad visual-dynamics priors with executable robot actions. We revisit this alignment from a distributional perspective. Existing formulations typically narrow the aligned prior into an observation-conditioned policy distribution over future actions. In contrast, we keep the distribution broader by modeling the joint space of interaction videos and executable hand trajectories under multiple conditioning regimes. We propose Donk, a unified video-action denoising model for dexterous hands. With language, an initial image, and the initial hand state, Donk samples future videos and bimanual MANO trajectories as an action policy. Without the image condition, the same denoising architecture samples paired video-action rollouts from a text-conditioned distribution, turning the aligned video prior into a data engine. Across action, video, and text-only generation evaluations, Donk improves dexterous trajectory accuracy, preserves strong video fidelity, and produces smooth text-conditioned action rollouts under the same unified training recipe.

URL PDF HTML ☆

赞 0 踩 0

2606.03837 2026-06-03 cs.CV 版本更新

Where Do We (Not) Need Temporal Context in Low-Resource Video Task Adaptation?

在低资源视频任务适应中，我们（不）需要时间上下文的哪些部分？

Luc P. J. Sträter, Hazel Doughty

发表机构 * Leiden University（莱顿大学）

AI总结本文系统研究了视频理解中模型适应策略的时间上下文分配问题，通过评估不同设置下的参数高效微调和探测方法，揭示了时间上下文在骨干网络、PEFT和探测之间的最优分布。

详情

AI中文摘要

参数高效微调（PEFT）和探测使得仅使用少量可训练参数就能适应基础模型，这对于标注和计算成本高昂的视频理解具有吸引力。然而，视频PEFT主要集中于适应图像预训练模型，而标准PEFT方法也可应用于视频表示。这些设置很少被比较，并且都将时间推理限制在模型的单个组件中，从而留下了时间上下文应如何在骨干网络、PEFT和探测之间分布的问题。在这项工作中，我们提供了视频理解中模型适应策略的系统研究。我们在外观聚焦、运动聚焦和空间密集设置中评估了方法，特别关注数据有限且参数效率最有利的场景。我们的结果为跨设置的PEFT和探测提供了新的见解，并证明了时间上下文分配对于有效视频适应的重要性。

英文摘要

Parameter-efficient fine-tuning (PEFT) and probing enable adaptation of foundation models using only a small number of trainable parameters, making it attractive for video understanding where annotation and computation are expensive. However, video PEFT has focused on adapting image-pretrained models, while standard PEFT methods can also be applied to video representations. These settings are rarely compared and both confine temporal reasoning to a single component of the model, leaving open how temporal context should be distributed across backbone, PEFT and probe. In this work we provide a systematic study of model adaptation strategies for video understanding. We evaluate methods across appearance-focused, motion-focused and spatially dense settings, with a particular focus on scenarios with limited data where parameter-efficiency is most beneficial. Our results provide new insights into PEFT and probing across settings and demonstrate the importance of temporal context allocation for effective video adaptation

URL PDF HTML ☆

赞 0 踩 0

2606.03806 2026-06-03 cs.CV 版本更新

TeX-1500: A Paired Real-World LWIR Hyperspectral Dataset and Benchmark for Temperature-Emissivity-Texture Decomposition

TeX-1500：用于温度-发射率-纹理分解的配对真实世界长波红外高光谱数据集与基准

Cheng Dai, Jiale Lin, Hongyi Xu, Bingxuan Song, Ziyang Xie, Fanglin Bao

发表机构 * School of Science, Westlake University（西lake大学科学学院）； School of Engineering, Westlake University（西lake大学工程学院）

AI总结针对长波红外高光谱成像中温度-发射率-纹理分解缺乏配对监督数据的问题，构建了包含1522对真实场景的TeX-1500数据集，并提出波长感知基线模型TeX-UNet，实现了可量化的数据驱动热感知基准。

详情

AI中文摘要

温度-发射率-纹理（TeX）分解旨在从长波红外高光谱成像（LWIR HSI）中恢复物体热状态、材料光谱响应和可见光般的几何纹理。现有的TeX流程主要是场景特定的逆求解器，缺乏配对的LWIR HSI-TeX监督限制了基于学习的分解。为解决这一空白，我们引入了TeX-1500，一个大规模配对LWIR HSI-TeX数据集和基准，用于监督式HSI到TeX分解。TeX-1500包含来自DARPA隐形前照灯（DARPA IH）推扫式成像和我们FTIR采集的1,522个校准真实场景对，覆盖五个地点、四个季节、不同的采集时间、异构波长布局和两个传感器系列。每个样本存储一个校准的有效波段辐射立方体、校准的波长位置，以及通过一致的恢复和TeX构建协议构建的对齐温度、发射率和纹理监督。我们进一步提供了TeX-UNet，一个简单的波长感知基线，将校准的HSI波段和波长位置映射到TeX场。在保留的DARPA IH推扫场景和零样本/少样本迁移到FTIR场景上的实验表明，TeX-1500为数据驱动的以物理属性为中心的热感知提供了可用的配对监督和可测量的基准。

英文摘要

Temperature-emissivity-texture (TeX) decomposition seeks to recover object heat state, material spectral response, and visible-like geometric texture from long-wave infrared hyperspectral imaging (LWIR HSI). Existing TeX pipelines are mainly scene-specific inverse solvers, and the lack of paired LWIR HSI-TeX supervision has limited learning-based decomposition. To address this gap, we introduce TeX-1500, a large-scale paired LWIR HSI-TeX dataset and benchmark for supervised HSI-to-TeX decomposition. TeX-1500 contains 1,522 calibrated real-scene pairs from DARPA Invisible Headlights (DARPA IH) pushbroom imagery and our FTIR acquisitions, covering five locations, four seasons, diverse acquisition times, heterogeneous wavelength layouts, and two sensor families. Each sample stores a calibrated valid-band radiance cube, calibrated wavelength positions, and aligned temperature, emissivity, and texture supervision constructed through a consistent restoration and TeX-construction protocol. We further provide TeX-UNet, a simple wavelength-aware baseline that maps calibrated HSI bands and wavelength positions to TeX fields. Experiments on the held-out DARPA IH pushbroom scenes and zero-/few-shot transfer to FTIR scenes show that TeX-1500 provides usable paired supervision and a measurable benchmark for data-driven physical-property-centered thermal perception.

URL PDF HTML ☆

赞 0 踩 0

2606.03802 2026-06-03 cs.CV 版本更新

Template Collapse and Information-Theoretic Limits in Camera rPPG Pulse Morphology Restoration

模板坍塌与相机rPPG脉搏形态恢复中的信息论极限

Achraf Ben Ahmed

发表机构 * PlesmoSense SARL（PlesmoSense公司）

AI总结本研究通过评估16种架构在153名受试者上的表现，引入跨受试者Pearson r来区分个体特异性恢复与模板坍塌，发现消费者摄像头无法编码个体动脉形态，且无架构能恢复个体特异性脉搏形态。

详情

AI中文摘要

目的：消费者面部相机远程光电容积描记法（rPPG）可实现被动心血管监测，但单周期波形形态（编码动脉硬化生物标志物）是否可从该测量中恢复尚未明确。方法：我们在三个数据集的153名受试者上评估了涵盖六个家族的16种架构，引入跨受试者Pearson r以区分个体特异性恢复与模板坍塌。结果：无架构恢复个体特异性形态（跨受试者r范围0.773--0.9999；真实上限0.601）。监督对比学习（SupCon）收敛至log N = 4.844，构成现有最强经验证据，表明测试的编码器家族无法从单周期rPPG中提取可判别形态结构。VAE解码器恢复了rPPG输入中缺失的群体级谐波内容（H2/H1：输出0.310 vs. 输入0.275），零样本泛化至UBFC（r = +0.708）；方向性幻觉差距（p = 0.150）提示部分信号读取。当输入不携带可判别结构时，抗坍塌目标失效。意义：消费者摄像头无法编码个体动脉形态；跨受试者r是波形重建基准中必要的坍塌诊断指标。

英文摘要

Objective: Consumer face camera remote photoplethysmography (rPPG) enables passive cardiovascular monitoring, but whether single-cycle waveform morphology encoding arterial stiffness biomarkers is recoverable from this measurement has not been characterised. Methods: We evaluated 16 architectures spanning six families on 153 subjects across three datasets, introducing cross-subject Pearson r to distinguish subject-specific recovery from template collapse. Results: No architecture recovered subject-specific morphology (cross-subject r range 0.773--0.9999; ground-truth ceiling 0.601). Supervised Contrastive (SupCon) converged to log N = 4.844, constituting the strongest available empirical evidence that no discriminative morphological structure is extractable from single-cycle rPPG by the encoder families tested. The VAE decoder restores population-level harmonic content absent from the rPPG input (H2/H1: 0.310 output vs. 0.275 input), generalising zero-shot to UBFC (r = +0.708); a directional hallucination gap (p = 0.150) suggests partial signal reading. Anti-collapse objectives fail when input carries no discriminative structure. Significance: Consumer cameras cannot encode individual arterial morphology; cross-subject r is a necessary collapse diagnostic for waveform reconstruction benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.03795 2026-06-03 cs.CV 版本更新

Beyond Compression: Quantifying Spectral Accessibility in Vision Representations

超越压缩：量化视觉表示中的频谱可访问性

Akayou A. Kitessa, Yijun Zhao

发表机构 * Fordham University（福特汉姆大学）

AI总结通过残差频谱损失（RSL）测量线性可恢复的带限傅里叶能量，研究视觉语言模型中投影层对表示频谱结构的影响，发现CLIP和DINOv2中频谱可访问性随深度非单调变化，中间层峰值后下降，且CLIP的投影是频谱中性的，而DINOv2的[CLS]池化导致频谱结构损失。

详情

AI中文摘要

视觉语言模型通过学习的投影层将视觉特征映射到共享嵌入空间，但目前尚不清楚这些变换如何改变视觉信息的结构。本研究通过空间频率可访问性（以从模型表示中线性恢复带限傅里叶能量的能力衡量）来考察表示的变化。为隔离降维之外的影响，我们引入了残差频谱损失（RSL），该损失相对于维度匹配的随机投影基线评估变化。为减少优化带来的混杂效应，分析使用所有参数冻结的预训练模型。实验结果显示，在ImageNet和MS-COCO数据集上，CLIP和DINOv2中可访问性随频率一致变化。频谱可访问性随深度呈非单调轨迹，在中间层达到峰值，然后向输出表示下降。最终变换因架构而异：CLIP的学习投影是频谱中性的，变化可由压缩解释，而DINOv2的[CLS]池化导致整个频谱的结构性损失。这些发现表明中间层和池化机制是现代视觉编码器中频谱变换的主要驱动因素。

英文摘要

Vision-language models map visual features into a shared embedding space through learned projection layers, yet it remains unclear how these transformations alter the structure of visual information. This study examines changes in representation through spatial-frequency accessibility, measured by the linear recoverability of band-limited Fourier energy from model representations. To isolate effects beyond dimensionality reduction, we introduce Residual Spectral Loss (RSL), which evaluates changes relative to a dimension-matched random projection baseline. To reduce confounding effects from optimization, the analysis uses pretrained models with all parameters frozen. The experimental results show consistent frequency-dependent changes in accessibility across CLIP and DINOv2 on ImageNet and MS-COCO datasets. Spectral accessibility follows a non-monotonic trajectory across depth, peaking at intermediate layers before decreasing toward the output representation. The final transformation differs across architectures: CLIP's learned projection is spectrally neutral, with changes explained by compression, whereas DINOv2's [CLS] pooling induces a structured loss across the spectrum. These findings identify intermediate layers and pooling mechanisms as primary drivers of spectral transformation in modern vision encoders.

URL PDF HTML ☆

赞 0 踩 0

2606.03793 2026-06-03 cs.CL cs.CV 版本更新

探究多模态大语言模型的对抗鲁棒性

Hashmat Shadab Malik, Muzammal Naseer, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI, UAE（穆罕默德·本·扎耶德人工智能大学，阿联酋）； Khalifa University, UAE（哈利法大学，阿联酋）； Australian National University, Australia（澳大利亚国立大学，澳大利亚）

AI总结通过系统研究多模态大语言模型的对抗鲁棒性，提出诊断性CLIP对齐协议预测鲁棒视觉编码器的迁移效果，并证明端到端多模态对抗训练能显著提升模型在强对抗攻击下的性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉-语言任务上表现出色，但通过视觉编码器（如CLIP）引入视觉输入显著扩大了攻击面，使这些模型容易受到视觉对抗扰动的影响。先前的防御方法通常通过在对抗微调期间强制与CLIP原始嵌入空间严格对齐来保持与预训练MLLMs的兼容性；虽然实用，但这种约束从根本上限制了可实现的鲁棒性。我们对MLLMs的对抗鲁棒性进行了系统研究。我们首先引入了一个诊断性CLIP对齐协议，该协议在完整的MLLM训练之前预测哪些鲁棒视觉编码器能有效迁移到多模态设置中，揭示出大规模多模态对抗预训练（而非仅单模态规模）是强鲁棒性迁移的关键因素。通过端到端多模态训练将这些编码器集成到MLLMs中，与受约束的即插即用基线相比，在强对抗攻击下，字幕生成平均提升28个CIDEr点，VQA准确率提升11.7%。我们进一步表明，直接对标准非鲁棒MLLM应用对抗训练会降低干净和对抗性能，从而确立了鲁棒视觉表示作为严格先决条件，而从鲁棒骨干网络进行端到端对抗训练则额外带来1.9个CIDEr点和4.3% VQA准确率的提升。除了训练时防御外，轻量级的测试时视觉随机变换可作为非鲁棒MLLM的有效黑盒防御，将对抗性能从接近零提升到与鲁棒模型相当的水平。最后，我们展示了鲁棒模型在白盒视觉越狱攻击下显著减少了有毒生成。代码和预训练权重将公开发布。

英文摘要

Multi-modal Large Language Models (MLLMs) achieve strong performance on vision-language tasks, but incorporating visual inputs through a vision encoder (e.g., CLIP) substantially expands the attack surface, making these models vulnerable to visual adversarial perturbations. Prior defenses typically preserve compatibility with pretrained MLLMs by enforcing strict alignment to CLIP's original embedding space during adversarial fine-tuning; while practical, this constraint fundamentally limits achievable robustness. We present a systematic investigation of adversarial robustness in MLLMs. We first introduce a diagnostic CLIP-alignment protocol that predicts, prior to full MLLM training, which robust vision encoders will transfer effectively to the multimodal setting, revealing that large-scale multimodal adversarial pretraining, rather than unimodal scale alone, is the critical factor for strong robustness transfer. Integrating such encoders into MLLMs via end-to-end multimodal training yields average gains of 28 CIDEr points on captioning and 11.7% VQA accuracy under strong adversarial attacks compared to constrained plug-and-play baselines. We further show that adversarial training applied directly to a standard non-robust MLLM degrades both clean and adversarial performance, establishing robust visual representations as a strict prerequisite, while end-to-end adversarial training from a robust backbone delivers additional gains of 1.9 CIDEr points and 4.3% VQA accuracy. Beyond training-time defenses, lightweight test-time visual stochastic transformations serve as an effective black-box defense for non-robust MLLMs, elevating adversarial performance from near-zero to levels comparable with robust models. Finally, we show that our robust models substantially reduce toxic generation under white-box visual jailbreak attacks. Code and pretrained weights will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2606.03694 2026-06-03 cs.RO cs.CV cs.HC 版本更新

Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset

面向人机交互的面部与身体跟踪：一个自我中心数据集

Jessica Wenninger, Gabriel Skantze

发表机构 * Furhat Robotics ； University of Naples Federico II（那不勒斯费德里科二世大学）； Division of Speech, Music and Hearing, KTH Royal Institute of Technology（语音、音乐和听觉研究所，皇家理工学院）

AI总结针对社交机器人自我中心视角下频繁身份切换问题，提出一个自定义标注的自我中心数据集，通过系统评估检测误差、对比面部与身体跟踪，并分析扩展空间记忆和外观重识别的影响，最终优化管道将身份切换减少49%。

Comments 8 pages, 5 figures, 3 tables. Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情

AI中文摘要

为了实现有意义的人机交互（HRI），机器人必须通过持续跟踪用户来不断评估参与度。然而，最先进的计算机视觉模型主要针对监控或自动驾驶进行了优化。社交机器人面临独特的自我中心挑战，例如人类跳动、相互遮挡或离开画面。频繁的身份切换（IDSW）会导致机器人在对话中失去立足点。为了解决这个问题，我们引入了一个新颖的、自定义标注的自我中心数据集，通过Furhat机器人收集，以捕捉复杂的社会动态。我们进行了系统评估，将检测错误与跟踪逻辑分离，比较面部与身体跟踪，并评估扩展空间记忆和外观重识别（ReID）的影响。结果表明，增加空间记忆可以缓解长时间遮挡，但在复杂动态事件上失败。集成ReID解决了复杂的切换，但表现出相反的效果：它显著提高了身体跟踪的稳定性，但由于轮廓角度敏感性导致面部IDSW激增。最终，我们的优化管道将IDSW减少了49%，减轻了交互中断。由于标准基准缺乏密集的近距离遮挡，这项工作强调了原生捕捉社会动态对于真正验证HRI感知模型的迫切需求。

英文摘要

To enable meaningful human-robot interaction (HRI), a robot must continuously assess engagement by consistently tracking users over time. State-of-the-art computer vision models, however, are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans bouncing, obstructing each other, or leaving the frame. Frequent identity switches (IDSW) cause the robot to lose its footing mid-conversation. To address this, we introduce a novel, custom-annotated egocentric dataset collected via the Furhat robot to capture complex social dynamics. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended spatial memory and appearance re-identification (ReID). Results indicate that increasing spatial memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49\%, mitigating interaction breakdowns. Because standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.

URL PDF HTML ☆

赞 0 踩 0

2606.03693 2026-06-03 cs.CL cs.CV 版本更新

Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

语言转换会破坏医学视觉语言模型吗？印度尼西亚放射学视觉问答案例研究

Pieter Christy Yan Yudhistira, Dzaki Rafif Malik, Novanto Yudistira

发表机构 * Intelligent System Laboratory, Faculty of Computer Science Brawijaya University（智能系统实验室，计算机科学学院布拉维亚大学）

AI总结本研究通过构建印尼语放射学VQA数据集IndoRad-VQA，评估医学视觉语言模型在非英语临床语言下的鲁棒性，发现英语与印尼语设置间存在8-25%的性能差距，表明需要更包容的多语言评估。

Comments accepted to MMFM-BIOMED Workshop @ CVPR 2026

详情

AI中文摘要

医学视觉语言模型（VLM）通常在英语放射学视觉问答基准上进行评估，其在非英语临床语言下的鲁棒性很大程度上未被探索。我们引入了IndoRad-VQA，这是VQA-RAD的印尼语改编版，以评估当问题以印尼语提出时，医学VLM是否保留放射学推理能力。放射学问答对被翻译成印尼语，并通过基于自我评估的质量控制来保持临床意义、术语一致性和答案等价性。我们在英语和印尼语提示设置下评估了通用、东南亚多语言和医学专用VLM。除了准确性，我们量化了英语和印尼语输入之间的语言鲁棒性差距。我们还进行了错误分析，以识别问答的失败模式，例如是/否翻转、侧向性错误和输出语言不匹配。我们的发现表明，在英语医学VQA基准上的强性能并不一定转化为印尼语临床环境中的鲁棒行为。我们观察到英语和印尼语设置之间的性能差距为8%到25%，具体取决于评估指标。这些结果突显了对医学多模态基础模型进行更包容的多语言评估的必要性。数据集可在以下网址获取：此 https URL。

英文摘要

Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.

URL PDF HTML ☆

赞 0 踩 0

2606.03675 2026-06-03 cs.CV 版本更新

A Fast Methane Detection Pipeline on Board Satellites Based on Mag1c-SAS and LinkNet

基于Mag1c-SAS和LinkNet的星载甲烷快速检测流水线

Jonáš Herec, Vít Růžička, Rado Pitoňák, Jan Sedmidubsky

发表机构 * Zaitra s.r.o.（泽特拉公司）； NASA JPL（美国国家航空航天局喷气推进实验室）； Faculty of Informatics, Masaryk University（马萨里克大学信息学院）

AI总结提出Mag1c-SAS算法加速甲烷检测，并结合轻量级LinkNet模型降噪，在星载硬件上实现高效、低功耗的甲烷泄漏检测。

Comments arXiv admin note: substantial text overlap with arXiv:2507.01472

详情

AI中文摘要

甲烷是一种强效温室气体，通过高光谱卫星图像早期检测泄漏有助于减缓气候变化。然而，许多现有高光谱任务仅捕获操作员手动瞄准的区域，从而遗漏潜在感兴趣事件。为了经济高效地克服下行链路速率慢的问题，星载检测是一种可行的解决方案。然而，传统的甲烷检测方法对于资源受限的星载硬件计算需求过高。本工作通过关注高效、低功耗算法来加速甲烷检测。具体而言，我们测试了先前未用于甲烷检测的快速目标检测ACE和CEM方法，并提出了Mag1c-SAS——当前最先进Mag1c算法的显著更快变体。为了探索其检测潜力，我们将它们与基于U-Net和LinkNet的机器学习模型集成。我们在STARCOP数据集和一个新的EMIT-MSeg数据集上评估我们的方法，该数据集我们与高质量注释策略一起引入并开源。所提出的Mag1c-SAS方法被证明非常有效，运行速度比原始Mag1c方法快约80倍，提供视觉上相似但噪声更大的结果。当额外与轻量级LinkNet方法配对时，它有效降低了噪声，在EMIT-MSeg上相比基线Mag1c方法AUPRC得分提高了超过30个百分点，在STARCOP上F1得分提高了约4个百分点。我们评估了两种新颖的波段选择策略，并通过硬件分析确认了系统的星载可行性，展示了边际功耗和高效的CPU/RAM利用率。我们以用户友好的轻量级PyPI库形式发布最终系统，网址为：this https URL，同时所有实验代码、模型和数据发布在：this https URL。

英文摘要

Methane is a potent greenhouse gas, and detecting leaks early via hyperspectral satellite imagery can help climate change mitigation efforts. Meanwhile, many existing hyperspectral missions only capture areas manually targeted by operators, thus missing potential events of interest. To overcome slow downlink rates cost-effectively, onboard detection is a viable solution. However, traditional methane detection methods are too computationally demanding for resource-limited onboard hardware. This work accelerates methane detection by focusing on efficient, low-power algorithms. In particular, we test fast target detection ACE and CEM methods that have not been previously used for methane detection and propose Mag1c-SAS -- a significantly faster variant of the current state-of-the-art Mag1c algorithm. To explore their detection potential, we integrate them with a machine learning model based on U-Net and LinkNet. We evaluate our methods on the STARCOP dataset and a novel EMIT-MSeg dataset, which we introduce and open-source alongside a high-quality annotation strategy. The proposed Mag1c-SAS approach proves highly effective by operating ~80x faster than the original Mag1c approach, providing a visually similar, but noisier result. When additionally paired with the lightweight LinkNet approach, it effectively reduces noise, achieving AUPRC score improvements of over 30 pp on EMIT-MSeg compared to the baseline Mag1c approach, and an F1 score on STARCOP ~4 pp higher. We evaluate two novel band selection strategies and confirm the system's onboard viability through hardware profiling, demonstrating marginal power consumption and efficient CPU/RAM utilization. We release the final system in a user-friendly and lightweight PyPI library at: https://pypi.org/project/onboard-methane-detection/, alongside all experimental code, models, and data at: https://github.com/zaitra/methane-filters-benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.03666 2026-06-03 cs.CV 版本更新

Beyond Single Solution: Multi-Hypothesis Collaborative Deep Unfolding Network for Image Compressive Sensing

超越单一解：用于图像压缩感知的多假设协作深度展开网络

Wenxue Cui, Hualin Li, Yuhang Qin, Yifu Xu, Xiaopeng Fan, Debin Zhao

发表机构 * Harbin Institute of Technology, Harbin, China（哈尔滨工业大学）； Harbin Institute of Technology Suzhou Research Institute, Suzhou, China（哈尔滨工业大学苏州研究院）

AI总结针对压缩感知问题的病态性，提出一种多假设协作深度展开网络（MHC-DUN），通过联合优化多个解空间，利用AlphaNet动态预测空间变步长进行梯度下降，并设计多假设协作近端映射模块，以提升重建质量。

Comments Accepted by CVPR 2026

详情

世界模型遇见语言模型：论具体推理与抽象推理的互补性

Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结本文提出受控具体推理框架及PF-OPSD方法，通过结合世界模型的视觉模拟与多模态大语言模型的抽象推理，在空间前瞻和开放域物理预测任务上提升性能与鲁棒性。

详情

AI中文摘要

世界模型和多模态大语言模型（MLLMs）为从静态视觉观察预测未来结果提供了互补能力。世界模型可以生成可能未来的具体视觉推演，而MLLMs可以对问题、目标和规则进行抽象推理。然而，生成的推演是随机的，可能在视觉上合理但任务不正确，因此需要确定视觉模拟何时有用、推演是否可信以及它应如何影响最终答案。我们将此问题形式化为受控具体推理，其中模型学习在抽象推理之外调用、验证和整合视觉未来模拟。为了研究这一设置，我们构建了两个人工验证的基准：用于可控空间前瞻的VRQABench和用于开放域物理预测的OpenWorldQA，并提出了特权未来在策略自蒸馏（PF-OPSD）。在训练期间，PF-OPSD仅使用真实未来视频和答案作为教师侧特权上下文来评估在策略具体推理轨迹，而可部署的学生在测试时从未观察到真实未来。实验结果表明，PF-OPSD在VRQABench和OpenWorldQA上分别比基线高出10.6%和10.9%，同时增强了对噪声或冲突推演的鲁棒性。我们的代码和数据集可在以下网址获取：https://this https URL。

英文摘要

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.

URL PDF HTML ☆

赞 0 踩 0

2606.03581 2026-06-03 cs.CV cs.RO 版本更新

UnsOcc: 3D Semantic Occupancy Prediction in Unstructured Scene via Rendering Fusion

UnsOcc：非结构化场景下基于渲染融合的3D语义占用预测

Ye Wu, Ruiqi Song, Baiyong Ding, Nanxin Zeng, Junjie Cheng, Yunfeng Ai

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Waytous Inc.（Waytous公司）

AI总结提出UnsOcc多模态框架，通过渲染融合模块和基于高斯溅射的细节感知辅助监督，解决非结构化场景中跨模态融合困难与长尾分布问题，在露天矿和nuScenes数据集上超越现有方法。

Comments 8 pages

详情

AI中文摘要

非结构化场景给自动驾驶带来了独特挑战，因为不规则障碍物和稀疏的场景布局削弱了3D目标检测等传统感知方法的有效性。3D语义占用预测因其能够通过为3D空间中的单个体素分配语义标签来提供密集的空间表示而成为研究热点。然而，将3D语义占用预测直接应用于非结构化场景仍然具有挑战性，因为场景稀疏性阻碍了有效的跨模态融合，并且这些场景中更严重的长期尾部分布进一步降低了预测性能。为了验证我们方法的有效性，我们构建了一个从露天矿收集的非结构化场景专用数据集。在此基础上，我们提出了UnsOcc，一种多模态3D语义占用预测框架，提高了在非结构化环境中的鲁棒性。其核心是，我们引入了一个基于渲染的融合模块RenderFusion，通过双向渲染监督增强跨模态特征对齐。此外，我们提出了GSRefinement，一种基于高斯溅射的细节感知辅助监督方法，将稀疏的3D占用预测投影到密集的2D语义分割图中，从而实现对长尾类别的有效监督。在露天矿数据集和nuScenes数据集上的大量实验表明，我们的方法显著优于现有的最先进方法。

英文摘要

Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to provide dense spatial representations by assigning semantic labels to individual voxels in 3D space. However, directly applying 3D semantic occupancy prediction to unstructured scenes remains challenging because scene sparsity hinders effective cross-modal fusion and the more severe long-tail distribution in these scenarios further degrades prediction performance. To validate the effectiveness of our approach, we construct a dedicated dataset of unstructured scenes collected from open-pit mines. Based on this, we propose UnsOcc, a multi-modal 3D semantic occupancy prediction framework that improves robustness in unstructured environments. At its core, we introduce a rendering-based fusion module, RenderFusion, which enhances cross-modal feature alignment through bidirectional rendering supervision. Furthermore, we propose GSRefinement, a detail-aware auxiliary supervision method based on Gaussian Splatting that projects sparse 3D occupancy predictions into dense 2D semantic segmentation maps, enabling effective supervision for long-tail categories. Extensive experiments on both the open-pit mine dataset and the nuScenes dataset demonstrate that our method significantly outperforms existing state-of-the-art approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.03578 2026-06-03 cs.CV 版本更新

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

在正确空间中扩散：潜在可扩散性的系统研究

Tianxiong Zhong, Xingye Tian, Xuebo Wang, Xin Tao, Pengfei Wan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文系统研究潜在扩散模型中潜在表示的可扩散性，提出速度不可约方差（VIV）作为生成质量的稳定预测指标。

详情

AI中文摘要

潜在扩散模型利用视觉分词器将图像压缩到潜在空间以实现高效生成建模。然而，分词器更好的重建质量并不一定转化为更好的生成质量，这表明潜在表示不仅应通过保真度评估，还应通过其可扩散性评估。最近的研究提出了多种对扩散友好的潜在空间的解释，包括语义可分离性、仿射等变性、分布均匀性、空间结构、谱平滑性和流形连续性。然而，这些性质通常在一组有限的分词器上验证，导致不清楚哪些因素最能预测下游生成质量，以及这些结论是否适用于其引入的特定设置之外。在这项工作中，我们通过训练大量具有不同正则化策略、架构和潜在配置的分词器，并使用多个下游扩散骨干网络对其进行评估，对潜在可扩散性进行了系统研究。我们的分析确定了几个与生成质量持续相关且在实验设置中表现出强泛化能力的潜在性质。除了现有指标，我们引入了速度不可约方差（VIV），这是一种由轨迹交叉引起的速度模糊性的度量。大量实验表明，VIV是生成质量最稳定的预测因子之一。

英文摘要

Latent diffusion models leverage visual tokenizers to compress images into latent spaces for efficient generative modeling. However, better reconstruction quality of a tokenizer does not necessarily translate into better generation quality, suggesting that latent representations should be evaluated not only by fidelity but also by their diffusability. Recent studies have proposed diverse explanations for diffusion-friendly latent spaces, including semantic separability, affine equivariance, distribution uniformity, spatial structure, spectral smoothness, and manifold continuity. Yet these properties are often validated on a limited set of tokenizers, leaving it unclear which factors are most predictive of downstream generation quality and whether such conclusions hold beyond the specific settings in which they are introduced. In this work, we conduct a systematic study of latent diffusability by training a large collection of tokenizers with diverse regularization strategies, architectures, and latent configurations, and evaluating them with multiple downstream diffusion backbones. Our analysis identifies several latent properties that consistently correlate with generation quality and exhibit strong generalization across experimental settings. Beyond existing metrics, we introduce Velocity Irreducible Variance (VIV), a measure of velocity ambiguity induced by trajectory crossings. Extensive experiments show that VIV is one of the most stable predictors of generation quality.

URL PDF HTML ☆

赞 0 踩 0

2606.03577 2026-06-03 cs.CV 版本更新

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

通过宽基线匹配激发多模态大语言模型中的复杂空间推理

Hao Zhong, Muzhi Zhu, Shenyan Zeng, Anzhou Li, Cong Chen, Hua Geng, Duochao Shi, Wentao Ye, Tao Lin, Hao Chen, Chunhua Shen

发表机构 * State Key Laboratory of CAD & CG, Zhejiang University（浙江大学计算机辅助设计与图形学国家重点实验室）； Ant Group（蚂蚁集团）； Westlake University（西湖大学）

AI总结本文提出ReasonMatch-Bench基准和动态对应强化学习（DCRL）方法，以系统评估和提升多模态大语言模型在宽基线匹配任务中的空间推理能力。

Comments CVPR 2026. Project page: https://aim-uofa.github.io/reasonmatch/ Code: https://github.com/aim-uofa/ReasonMatch

详情

AI中文摘要

宽基线匹配（WBM）需要整合几何理解、视角变化、细粒度感知和遮挡推理，使其成为部署在物理环境中的多模态大语言模型（MLLMs）空间推理的一个具有挑战性的测试平台。然而，当前的MLLMs缺乏对这些能力的系统评估和训练框架。我们引入了ReasonMatch-Bench，这是一个根据视角位移和匹配粒度在室内、室外和以物体为中心的场景中分层的基准，并表明当前的MLLMs在细粒度宽基线对应上仍然存在困难：在一个困难的90样本子集上，人类标注者达到84.0 F1，而最佳现有基线达到37.2。为了弥补这一差距，我们构建了一个可扩展的数据生成管道，该管道从大规模视频-3D语料库（包括RGB-D视频和SfM重建）中自动提取宽基线视图对，产生多样且可验证的监督。我们进一步提出了动态对应强化学习（DCRL），它结合了图像级视角进展和点级对应课程，通过可验证的奖励改进WBM训练，无需显式的CoT监督。大量实验表明，DCRL显著提高了ReasonMatch-Bench的性能，并迁移到相关的空间基准，同时在几个基准上保持了通用视觉理解性能并取得了适度提升。

英文摘要

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.03569 2026-06-03 cs.CV cs.AI 版本更新

When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics

当注意力崩溃时：从结构到语义的阶段性视觉令牌剪枝

Jiahui Wang, Kai Zhang, Mai Han, Huanghe Zhang

发表机构 * Shandong University（山东大学）； National University of Singapore (Suzhou) Research Institute（新加坡国立大学（苏州）研究院）

AI总结针对视觉语言模型推理中视觉令牌剪枝因依赖单一注意力分数导致特征多样性下降的问题，提出两阶段剪枝框架STS，先通过排斥采样最大化结构多样性，再通过指令感知交叉注意力过滤语义无关令牌，从而提升保留令牌的结构多样性与细粒度任务对齐。

详情

AI中文摘要

视觉语言模型（VLMs）展现了卓越的能力，但在推理过程中承受着巨大的计算开销。虽然视觉令牌剪枝提供了一种有前景的解决方案，但现有方法主要依赖于初始注意力分数。这种单一度量范式存在一个关键缺陷：高注意力分数会固有地坍缩到语义相似区域，从而严重降低特征多样性并丢弃重要的上下文细节。为解决这一问题，我们引入了结构到语义（STS），一种新颖的两阶段视觉令牌剪枝框架，明确解耦了剪枝过程。第一阶段采用基于排斥的采样机制，以最大化空间和结构多样性。第二阶段利用指令感知的交叉注意力，精确过滤掉与提示无关的令牌。这种两阶段协同构成了STS的核心，首先确保几何覆盖，然后根据语义相关性细化保留的令牌。大量评估表明，STS减轻了由基于注意力的选择引起的冗余，提高了保留视觉令牌的结构多样性和细粒度任务对齐。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference. While visual token pruning offers a promising solution, existing methods predominantly rely on initial attention scores. This single-metric paradigm presents a critical flaw: high attention scores inherently collapse onto semantically similar regions, thereby severely reducing feature diversity and discarding vital contextual details. To address this, we introduce Structure-to-Semantics (STS), a novel two-stage visual token pruning framework that explicitly decouples the pruning process. The first stage employs a repulsion-based sampling mechanism to maximize spatial and structural diversity. The second stage leverages instruction-aware cross-attention to precisely filter out prompt-irrelevant tokens. This two-stage synergy constitutes the core of STS, first ensuring geometric coverage and then refining the retained tokens according to semantic relevance. Extensive evaluations demonstrate that STS mitigates the redundancy caused by attention-based selection, improving both structural diversity and fine-grained task alignment of the preserved visual tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.03568 2026-06-03 cs.CV cs.AI cs.LG cs.RO 版本更新

Learned Non-Maximum Suppression for 3D Object Detection

用于3D目标检测的学习型非极大值抑制

Timo Osterburg, Stefan Schütte, Torsten Bertram

发表机构 * Institute of Control Theory and Systems Engineering, TU Dortmund University（控制理论与系统工程研究所，多特蒙德技术大学）

AI总结提出两种基于学习的过滤模块（D2D-Rescore和GossipNet3D）替代启发式NMS，通过检测间关系提升3D检测性能，尤其改善小物体和稀有类别的检测精度。

Comments 6 pages, accepted at IEEE Intelligent Vehicles Symposium (IV) 2026

详情

AI中文摘要

后处理是基于激光雷达的3D目标检测中的关键阶段，必须过滤密集且重叠的提议以实现紧凑可靠的感知。本文引入了两个学习型过滤模块，通过利用检测之间的关系来替代启发式非极大值抑制（NMS）。D2D-Rescore采用基于Transformer的检测到检测（D2D）注意力，而GossipNet3D通过鸟瞰图中的局部消息传递将2D GossipNet概念适应到3D。一种与nuScenes评估协议对齐的度量感知匹配策略确保了训练和验证行为的一致性，从而提高了整体检测性能。与CircleNMS相比，两种方法都提高了平均精度（mAP）、nuScenes检测分数（NDS）和真阳性质量，特别是对于小物体和稀有类别，同时增加了最小的计算开销。这些结果表明，学习型的检测级过滤可以在不修改基础网络的情况下增强3D检测器的可靠性，为启发式抑制提供了一种原则性的替代方案。代码可在以下网址获取：https://this URL。

英文摘要

Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection-level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms .

URL PDF HTML ☆

赞 0 踩 0

2606.03566 2026-06-03 cs.CV cs.AI 版本更新

Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis

基于高效Transformer的局部块采样用于多发性硬化脉络丛分割

Po-Jui Lu, Alessandro Cagol, Mario Ocampo-Pineda, Federico Spagnolo, Marina Mastantuono, Andreea-Alexandra Aldea, Jannis Müller, Özgür Yaldizli, Matthias Weigel, Lester Melie-Garcia, Roberta Magliozzi, Maria Pia Sormani, Ludwig Kappos, Jens Kuhle, Cristina Granziera

AI总结提出一种基于SwinUNETR和局部块采样的方法，实现多发性硬化侧脑室脉络丛的自动分割，在降低99%计算量的同时取得优于现有模型的Dice系数。

详情

AI中文摘要

背景：侧脑室脉络丛（LVCP）正逐渐被认为是与多发性硬化（MS）身体残疾和神经炎症相关的关键影像生物标志物。然而，LVCP的手动分割非常繁琐，限制了其在广泛临床试验和纵向评估中的应用。本研究旨在开发一种基于SwinUNETR的流程，利用靶向的脑室内和脑室周围小块采样，从独立和多模态MRI输入中自动分割MS中的LVCP。方法：我们回顾性评估了来自两个独立MS主导队列的三组数据的3T MRI扫描（数据集1：n=177；数据集2：n=177；扩展测试集：n=388）。我们的方法采用在32x32x32体素块上训练的SwinUNETR架构，并与3D UXNET模型进行基准比较。主要评估指标是Dice相似系数（DSC），辅以计算需求（GFLOPs）和95百分位豪斯多夫距离（HD95）。结果：在扩展测试集上，SwinUNETR模型在结合MPRAGE和FLAIR时获得了平均DSC为0.868（95% CI: 0.863-0.872），显著优于UXNET（DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001）。当仅限于独立FLAIR输入时，基于Transformer的方法保持了0.863的高DSC，而UXNET的空间定位显著恶化（HD95: 1.86 vs. 3.00 mm）。重要的是，所提出的框架将计算负载降低了99%（91.8 vs. 22,080 GFLOPs）。通过将局部块采样与SwinUNETR架构相结合，该方法为LVCP分割提供了一种准确、稳健且统计上优于当前领先模型的替代方案。其巨大的计算成本降低使其非常适合在临床和研究环境中广泛实施。

英文摘要

Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) related to physical disability and neuroinflammation. Yet, manual segmentation of the LVCP is highly tedious, restricting its use in broad clinical trials and longitudinal assessments. This research aims to develop a SwinUNETR-driven pipeline that leverages targeted intra- and peri-ventricular small patch sampling to automatically segment the LVCP in MS from both standalone and multi-modal MRI inputs. Methods: We retrospectively assessed 3T MRI scans across three sets of data stemming from two separate MS-dominant cohorts (Dataset 1: n=177; Dataset 2: n=177; expanded test set: n=388). Our method employed a SwinUNETR architecture trained on 32x32x32 voxel patches, benchmarking it against the 3D UXNET model. The primary metric for evaluation was the Dice Similarity Coefficient (DSC), supplemented by computational demand (GFLOPs) and the 95th percentile Hausdorff Distance (HD95). Results: On the extended test set, the SwinUNETR model secured a mean DSC of 0.868 (95% CI: 0.863-0.872) with MPRAGE and FLAIR combined, showing a statistically significant gain over UXNET (DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001). When restricted to standalone FLAIR inputs, the transformer-based approach sustained a high DSC of 0.863, while the spatial localization of UXNET worsened considerably (HD95: 1.86 vs. 3.00 mm). Importantly, the proposed framework lowered computational load by 99% (91.8 vs. 22,080 GFLOPs). By integrating localized patch sampling with a SwinUNETR architecture, this methodology offers an accurate, robust, and statistically superior alternative to current leading models for LVCP segmentation. Its vast reduction in computational cost makes it ideal for widespread implementation in clinical and research environments.

URL PDF HTML ☆

赞 0 踩 0

2606.03540 2026-06-03 cs.CV 版本更新

Attend to Anything: Foundation Model for Unified Human Attention Modeling

关注一切：统一人类注意力建模的基础模型

Wenzhuo Zhao, Ronghao Xian, Keren Fu, Qijun Zhao

AI总结提出 Attend to Anything Model (AAM)，一种多模态基础模型，通过层次化语言提示和双曲空间嵌入统一图像、视频和视听任务中的注意力建模，并在16个基准上平均提升6%，视频推理加速约4倍。

Comments Accepted to ICML 2026

详情

AI中文摘要

现有人类注意力（显著性）建模方法在模态、场景和任务公式上高度碎片化。因此，即使模型容量和数据规模增加，当前模型仍主要依赖于场景且针对特定任务，无法在实际应用中泛化。为解决这些根本限制，我们提出了关注一切模型（AAM），一种多模态基础模型，统一了各种图像、视频和视听任务及场景中的注意力建模。AAM将注意力重新表述为一种认知蕴含关系，按通用到特定的层次组织，通过双曲空间中的层次嵌入语言提示实现。此外，为统一静态图像和动态视频注意力，我们采用流体动力学视角，将视频帧注意力建模为由Fokker-Planck方程控制的扩散时间演化。在16个基准上的大量实验表明，AAM在各种场景下平均比最先进方法高出6%，同时视频推理速度提升约4倍。总体而言，这些结果表明AAM为未来注意力和显著性相关任务的研究提供了原则性基础。数据集和代码将在此https URL提供。

英文摘要

Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\% across various scenarios, while achieving approximately a 4$\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at https://github.com/wz-zhao/Attend-to-Anything.

URL PDF HTML ☆

赞 0 踩 0

2606.03539 2026-06-03 cs.CV 版本更新

Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding

零空间中知识保留的模型调优用于鲁棒的时空视频定位

Haoxuan Chen, Xianqin Liu, Jian-Fang Hu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, China（中山大学计算机科学与工程学院）； National Information Center of GACC (Guangdong), GuangZhou, China（广东省GACC国家信息中心）； Guangdong Province Key Laboratory of Information Security Technology, China（广东省信息安全技术重点实验室）； Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China（教育部机器智能与高级计算重点实验室）

AI总结针对低质量视频导致预训练知识被破坏的问题，提出零空间调优（NST）框架，通过将可学习残差限制在冻结权重的零空间内来保留预训练知识，同时利用质量自适应单元和双空间重参数化合成残差，在混合质量基准上达到最优性能。

Comments Accepted by ICME 2026

详情

AI中文摘要

时空视频定位旨在基于文本查询定位目标管。尽管近期方法取得了显著成功，但它们主要关注高质量输入，忽略了现实场景中广泛存在的低质量视频。虽然像LoRA这样的调优方法可以适应降质输入，但它们不可避免地破坏了预训练知识。为解决这一问题，我们提出了零空间调优（NST）。该框架利用了将冻结权重的零空间内的向量添加到层输入不会影响输出的几何性质。利用这一点，NST将可学习残差注入输入特征，这些残差可以选择性地对预训练骨干网络不可见。具体地，NST结合了质量自适应单元和双空间重参数化来合成这些残差，通过将高质量输入的组件限制在零空间内，同时将低质量输入的恢复组件引导至非零空间。由于冻结权重消除了零空间组件，我们有效地纠正了降质输入，同时保留了高质量输入的预训练知识。大量实验表明，NST在我们的混合质量基准上优于最先进的方法。

英文摘要

Spatio-Temporal Video Grounding aims to localize object tubes based on textual queries. While recent methods have achieved remarkable success, they mainly focus on high-quality(HQ) inputs, neglecting the widespread presence of low-quality(LQ) videos in real-world scenarios. Although tuning methods like LoRA can adapt to degraded inputs, they inevitably disrupt pre-trained knowledge. To address this, we propose Null-Space Tuning (NST). This framework exploits the geometric property that adding vectors within the null-space of frozen weights to the layer input does not affect the output. Leveraging this, NST injects learnable residuals into input features that can be selectively invisible to the pre-trained backbone. Specifically, NST combines the Quality-Adaptive Unit and Dual-Space Reparameterization to synthesize these residuals by confining components for HQ inputs to the null-space, while directing restoration components for LQ inputs to the non-null space. As the frozen weights eliminate null-space components, we effectively rectify degraded inputs while preserving pre-trained knowledge for HQ inputs. Extensive experiments show that NST outperforms state-of-the-art methods on our Mixed-Quality benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.03509 2026-06-03 cs.CV 版本更新

EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

EvoMemNav: 用于零样本具身导航的高效自进化细粒度记忆

Zuhao Ge, Xiaosong Jia, Chao Wu, Yuchen Zhou, Zuxuan Wu, Yu-Gang Jiang

AI总结提出EvoMemNav框架，通过构建视觉-语义记忆图并采用预算驱动的粗到细策略，结合反射驱动写回机制，实现零样本具身导航中高效、自进化的细粒度记忆，提升多实例区分和停止验证性能。

Comments Preprint

详情

AI中文摘要

构建记忆对于零样本具身导航中的长时程规划至关重要。以检测器为中心的场景图通常将观测压缩为稀疏节点，丢弃细粒度视觉证据并积累噪声，而基于3D重建的方法计算成本高昂。我们提出EvoMemNav，一种用于零样本具身导航的高效、自进化、细粒度记忆框架。EvoMemNav构建视觉-语义记忆图（VSMGraph），将原始视图作为一等记忆，并通过轻量级语义线索和拓扑关系将其组织成房间-视图-对象层次结构，保留用于消歧和停止验证的细粒度细节。为了扩展到不断增长的记忆，我们引入预算驱动的粗到细策略：粗阶段将搜索空间压缩到有希望的区域，细阶段仅调用VLM进行目标验证和决策。除了静态记忆，EvoMemNav在每个子任务后执行反射驱动的写回，更新附加到图上的先验知识，编码累积的环境知识以优化未来决策而无需重新训练。在GOAT-Bench和HM3D上，针对物体、文本描述和图像目标模态的实验显示，SR/SPL持续提升，具有更好的多实例区分能力、更少的过早停止和更强的零样本泛化能力。

英文摘要

Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation. EvoMemNav constructs a Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, preserving fine-grained details for disambiguation and Stop verification. To scale to growing memory, we introduce a budgeted coarse-to-fine policy: a coarse stage compresses the search space into promising regions, and a fine stage invokes a VLM only for targeted verification and decision. Beyond static memories, EvoMemNav performs reflection-driven write-back after each subtask, updating graph-attached priors that encode accumulated environmental knowledge to refine future decisions without retraining. Experiments on GOAT-Bench and HM3D across object, text-description, and image-goal modalities show consistent gains in SR/SPL, with better multi-instance disambiguation, fewer premature stops, and stronger zero-shot generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.03508 2026-06-03 cs.CV 版本更新

Structure-Guided Mixed Masked Pretraining and Spatial Continuity Regularization for Printed Circuit Board Defect Detection

结构引导混合掩码预训练与空间连续性正则化用于印刷电路板缺陷检测

Peitong Wang, Nuo Wang, Enxin Qin, Chengjin Yu, Hanyu Xuan, Yuanting Yan

发表机构 * Ahu.edu.cn（安徽大学）

AI总结提出两阶段PCB缺陷检测框架，通过结构引导混合掩码预训练学习PCB结构先验，并在微调阶段引入空间连续性正则化提升细长缺陷定位紧凑性，在DsPCBSD+数据集上达到85.5% mAP0.5。

Comments Preprint. 38 pages, 12 figures, 6 tables

详情

AI中文摘要

印刷电路板（PCB）缺陷检测是自动光学检测（AOI）的关键环节，但在实际应用中仍具挑战性，因为许多缺陷微小、低对比度且嵌入密集电路背景中。为解决这些问题，本文提出一种两阶段PCB缺陷检测框架，结合结构引导混合掩码预训练与空间连续性正则化。在预训练阶段，我们设计了一种稀疏卷积掩码预训练方案，利用无标签PCB图像，其中结构引导混合掩码用于构建信息丰富的掩码输入。稀疏卷积重建管道抑制掩码区域的无效响应，使检测器主干能够从可见导电模式推断缺失的PCB结构，从而学习PCB结构先验。在微调阶段，预训练主干被迁移到下游缺陷检测任务。针对该任务，在微调过程中引入空间连续性正则化项，该项约束分配给同一缺陷实例的分散正预测，并促进细长缺陷区域上更紧凑的定位。在DsPCBSD+数据集上的实验表明，所提方法达到85.5% mAP0.5和52.3% mAP0.5:0.95，优于多个强基线检测器。消融研究和定性结果进一步证实了所提框架在工业AOI场景中稳健PCB缺陷检测的有效性。

英文摘要

Printed circuit board (PCB) defect detection is an essential part of automated optical inspection (AOI); yet it remains challenging in practice because many defects are tiny, low-contrast, and embedded in dense circuit backgrounds. To address these issues, this paper presents a two-phase PCB defect detection framework that combines structure-guided mixed masked pretraining with spatial continuity regularization. In the pretraining stage, we design a sparse convolutional masked pretraining scheme to exploit unlabeled PCB images, where structure-guided mixed masking is used to construct informative masked inputs. The sparse convolutional reconstruction pipeline suppresses invalid responses from masked regions and enables the detector backbone to infer missing PCB structures from visible conductive patterns, thereby learning PCB structural priors. In the fine-tuning stage, the pretrained backbone is transferred to the downstream defect detection task. For the task, a spatial continuity regularization term is introduced during fine-tuning. This term constrains dispersed positive predictions assigned to the same defect instance and promotes more compact localization on elongated defect regions. Experiments on the DsPCBSD+ dataset show that the proposed method achieves 85.5% mAP0.5 and 52.3% mAP0.5:0.95, outperforming several strong baseline detectors. Ablation studies and qualitative results further confirm the effectiveness of the proposed framework for robust PCB defect detection in industrial AOI scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.03506 2026-06-03 cs.CV cs.GR 版本更新

AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization

AvatarMix: 保持身份特征的跨化身组合用于服装个性化

Zhaorong Wang, Yoshihiro Kanamori, Yuki Endo

发表机构 * University of Tsukuba（茨口大学）

AI总结提出AvatarMix方法，通过直接组合两个高保真高斯化身实现服装迁移，并采用SeamFix和FullbodyFix两级细化策略解决接缝伪影和身体重塑后的外观保真问题。

Comments CVPR 2026 Findings. 16 pages, including supplementary material

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 425-435

AI中文摘要

现有的3D化身服装迁移方法面临不同挑战：将2D编辑提升到3D的方法通常会导致服装或身份质量下降，而分别建模身体和服装层的方法则容易出现交叉伪影。我们提出AvatarMix，一种组合范式，通过直接组合两个高保真高斯化身的头部和身体来绕过这些问题。虽然这种范式固有地保留了服装质量并避免了交叉，但在创建无缝连接和保持身体重塑后的外观保真度方面带来了挑战。为此，我们提出两级细化策略：SeamFix，一个局部扩散模块，用于细化头发和颈部以确保无伪影连接；以及一个可选的全身细化模块FullbodyFix，当重定向导致穿衣身体退化时恢复服装外观。两者都在已经3D一致的高斯化身渲染上操作，与2D到3D提升相比，这限制了多视图伪影。为了保留用户的身体身份，我们的基于网格的高斯表示能够适应鲁棒的网格重定向技术，精确地将穿衣身体重塑为用户体型，并鲁棒地处理多样化的身体形状。大量实验表明，我们的方法在服装保真度和身份保持方面达到了最先进的结果，为逼真的3D服装个性化提供了新视角。项目页面：此https URL

英文摘要

Existing 3D avatar outfit transfer methods face distinct challenges: approaches that lift 2D edits to 3D often suffer from outfit or identity quality degradation, while those that separately model body and clothing layers are prone to intersection artifacts. We introduce AvatarMix, a compositional paradigm that bypasses these issues by directly composing the head and body from two high-fidelity Gaussian avatars. While this paradigm inherently preserves outfit quality and avoids intersections, it introduces challenges in creating a seamless join and maintaining appearance fidelity after body reshaping. To this end, we propose a two-tier refinement strategy: SeamFix, a localized diffusion module that refines hair and neck to ensure an artifact-free join, and an optional full-body refinement, FullbodyFix, that restores garment appearance when retargeting degrades the clothed body. Both operate on renders from an already 3D-consistent Gaussian avatar, which limits multi-view artifacts compared to 2D-to-3D lifting. To preserve the user's body identity, our mesh-based Gaussian representation enables the adaptation of a robust mesh retargeting technique, precisely reshaping the clothed body to the user's physique and robustly handling diverse body shapes. Extensive experiments demonstrate that our method achieves state-of-the-art results in outfit fidelity and identity preservation, providing a new perspective for realistic 3D outfit personalization. Project page: https://larsph.github.io/avatarmix/

URL PDF HTML ☆

赞 0 踩 0

2606.03499 2026-06-03 cs.CV 版本更新

Characterizing Detectability in 3DGS Poisoning: A Stage-wise Benchmark

表征3DGS投毒中的可检测性：分阶段基准测试

Quoc-Anh Bui-Huynh, Thanh Duc Ngo, Xue Geng, Kaixin Xu, Wang Zhe, Xulei Yang, Ngai-Man Cheung

发表机构 * Temasek Laboratories, Singapore University of Technology and Design（新加坡科技与设计大学Temasek实验室）； Vietnam National University, Ho Chi Minh City（越南国家大学胡志明市分校）； University of Information Technology, VNU-HCM（越南国家大学胡志明市信息技术大学）； Agency for Science, Technology, and Research (A*STAR)（科技研究局（A*STAR））

AI总结针对3DGS易受多种投毒攻击的问题，提出分阶段基准Poison-3DGS，系统研究各阶段可检测性差异，发现不同攻击在不同阶段产生独特取证信号，后期阶段（如训练动态和高斯参数统计）提供早期不可观测的强线索。

详情

AI中文摘要

3D高斯泼溅（3DGS）已迅速成为实时新视角合成的主要表示方法，但近期研究表明它易受多种投毒攻击，包括虚幻物体注入、计算成本放大和事后模型水印。尽管威胁面不断扩大，现有研究主要关注攻击成功，而防御和检测仍探索不足。从检测角度看，3DGS重建流程的多阶段特性产生了异构的中间表示，这既是关键挑战也是机遇。检测投毒的取证信号本质上是阶段依赖的：在一个阶段引入的攻击可能仅在后续阶段产生信号。这促使我们采用超越单阶段评估的分阶段可检测性视角。我们引入Poison-3DGS，一个用于分阶段表征3DGS投毒检测的基准。它暴露了跨多种场景和攻击的阶段特定伪影，包括多视图图像、几何、训练动态和高斯参数。利用该基准，我们对流水线各阶段的可检测性进行了系统研究。分析揭示了若干见解。首先，可检测性在不同阶段间差异显著，且没有任何单一阶段在所有攻击类型中持续占优。其次，不同攻击表现出不同的阶段特定取证信号，因此检测有效性关键取决于信号在何处被观测到。第三，后期阶段的信号（如训练动态和高斯参数统计）提供了早期阶段不可观测的强线索。总体而言，我们的工作提供了一个原则性基准，并首次系统表征了3DGS中阶段依赖的可检测性，为未来研究鲁棒可靠的3DGS系统奠定了基础。

英文摘要

3D Gaussian Splatting (3DGS) has rapidly emerged as a leading representation for real-time novel view synthesis, but recent work shows it is vulnerable to diverse poisoning attacks, including illusory object injection, computation cost amplification, and post hoc model watermarking. Despite this expanding threat surface, existing studies focus mainly on attack success, while defense and detection remain underexplored. From a detection perspective, a key challenge and opportunity arise from the multi-stage nature of the 3DGS reconstruction pipeline, which produces heterogeneous intermediate representations. Forensic signals for detecting poisoning are inherently stage dependent: an attack introduced at one stage may produce signals that emerge only at later stages. This motivates a stage-wise view of detectability that goes beyond single-stage evaluation. We introduce Poison-3DGS, a benchmark for stage-wise characterization of poisoning detection in 3DGS. It exposes stage-specific artifacts, including multi-view images, geometry, training dynamics, and Gaussian parameters, across a diverse set of scenes and attacks. Using it, we conduct a systematic study of detectability across pipeline stages. Our analysis reveals several insights. First, detectability varies significantly across stages, and no single stage consistently dominates across attack types. Second, different attacks exhibit distinct stage-specific forensic signals, so detection effectiveness depends critically on where signals are observed. Third, later-stage signals such as training dynamics and Gaussian parameter statistics provide strong cues not observable at earlier stages. Overall, our work provides a principled benchmark and the first systematic characterization of stage-dependent detectability in 3DGS, offering a foundation for future research on robust and reliable 3DGS systems.

URL PDF HTML ☆

赞 0 踩 0

2606.03493 2026-06-03 cs.CV cs.LG 版本更新

Low-Frequency Shortcuts in Texture-Driven Visual Learning

纹理驱动视觉学习中的低频捷径

Utku Şirin, Cathy Hou, David Alvarez-Melis, Stratos Idreos

发表机构 * Harvard University（哈佛大学）； Kempner Institute（凯姆纳研究所）

AI总结本文分析了纹理驱动领域中神经网络依赖低频成分作为捷径的现象，提出通过裁剪低频成分来消除捷径，从而提升分布内准确率和鲁棒性。

详情

AI中文摘要

神经网络存在捷径学习问题，即学习到的特征在训练集上泛化良好，但在分布内（ID）或分布外（OOD）测试集上表现不佳。现有研究均基于少数几个标准基准，这些基准是形状驱动的。然而，许多应用领域是纹理驱动的。在这项工作中，我们针对纹理驱动领域进行了捷径学习分析，并将其与标准基准进行了比较。我们表明，纹理驱动领域存在低频捷径。它们主要基于少数具有偏斜频谱行为的低频成分（LFC）做出决策，尽管其分类信息存在于更高频率的细粒度细节中。从训练集和测试集中裁剪LFC可以消除捷径，并提供更平衡的频谱行为，将ID准确率提升高达8%。我们表明，低频捷径使模型极易受到OOD干扰的影响，导致与ID准确率相比下降高达70%。裁剪LFC显著提高了对低频干扰的鲁棒性，提升高达40%，并引入了对高频干扰的权衡；平衡的频谱行为提供了更好的泛化性能，而对高频特征的依赖增加则降低了泛化性能。OOD准确率取决于这两个因素之间的相互作用。

英文摘要

Neural networks suffer from shortcut learning, where learned features generalize well to the training set but not to in-distribution (ID) or out-of-distribution (OOD) test sets. Existing studies are all based on a few standard benchmarks, which are shape-driven. Numerous application domains, however, are texture-driven. In this work, we present shortcut learning analysis for texture-driven domains, and compare it with that of a standard benchmark. We show that texture-driven domains suffer from low-frequency shortcuts. They make the majority of their decisions based on a few low-frequency components (LFCs) with a skewed spectral behavior, despite that their classification information is in higher-frequency, fine-grained details. Pruning LFCs from training and test sets eliminates the shortcut and provides a more balanced spectral behavior, improving the ID accuracy by up to 8%. We show that low-frequency shortcuts make the models highly vulnerable to OOD corruptions, leading up to 70% accuracy drop compared to the ID accuracy. Pruning LFCs significantly improves robustness to low-frequency corruptions, by up to 40%, and introduces a trade-off for high-frequency corruptions; the balanced spectral behavior provides a better generalization performance, whereas the increased dependence on high-frequency features reduces it. OOD accuracy depends on the interaction between these two factors.

URL PDF HTML ☆

赞 0 踩 0

2606.03490 2026-06-03 cs.CV 版本更新

TrAction: Action Recognition with Sparse Trajectories

TrAction: 基于稀疏轨迹的动作识别

Jan F. Meier, Felix B. Mueller, Alexander Ecker, Timo Lüddecke

发表机构 * Institute of Computer Science and Campus Institute Data Science, University Göttingen（计算机科学研究所和校园数据科学学院，哥廷根大学）； Max Planck Institute for Dynamics and Self-Organization（动态与自组织Max Planck研究所）

AI总结提出使用稀疏点轨迹作为输入模态，结合掩码轨迹预训练的Transformer架构，在降低计算成本的同时实现高效动作识别，并证明轨迹特征与外观特征互补。

详情

AI中文摘要

现代动作识别模型运行在内存和计算密集的密集RGB视频体积上，并且经常利用外观和背景捷径，例如从物体或场景而不是特征运动来预测动作。我们研究了一种高效的替代输入模态，它通过构造在很大程度上避免了这种偏差：稀疏点轨迹。为此，我们开发了一个简单的Transformer架构用于基于2.5D轨迹的识别，并配合掩码轨迹预训练，我们证明这能显著提高下游动作识别准确率。尽管仅使用密集RGB输入的一小部分，我们的方法在Something-Something V2上达到45%的top-1准确率，在EPIC-Kitchens-100上达到54%，并在时间反转敏感性上超过了V-JEPA。更重要的是，我们发现轨迹特征与最先进的基于外观的特征互补。将我们的预训练模型与DINOv2和V-JEPA 2融合，在Something-Something V2上top-1准确率分别提高了8.7和1.6个百分点。代码：此 https URL

英文摘要

Modern action recognition models operate on memory- and compute-intensive dense RGB video volumes and frequently exploit appearance and background shortcuts, for example, predicting actions from objects or scenes instead of characteristic motion. We investigate an efficient alternative input modality that is largely free of such biases by construction: sparse point trajectories. To this end, we develop a simple transformer architecture for 2.5D trajectory-based recognition together with a masked-trajectory pretraining, which we show to substantially improve downstream action recognition accuracy. Despite using only a fraction of the dense RGB input, our method reaches 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens-100, and surpasses V-JEPA on time-reversal sensitivity. More importantly, we find trajectory features to be complementary to state-of-the-art appearance-based features. Fusing our pretrained model with DINOv2 and V-JEPA 2 improves top-1 accuracy on Something-Something V2 by 8.7 and 1.6 points, respectively. Code: https://github.com/ecker-lab/TrAction

URL PDF HTML ☆

赞 0 踩 0

2606.03479 2026-06-03 cs.CV cs.GR 版本更新

PersistGS: Differentiable Physics for Object Permanence in 4D Gaussian Splatting

PersistGS: 4D高斯溅射中物体持久性的可微物理

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出PersistGS方法，通过将可微刚体模拟与3D高斯溅射耦合，在物体被遮挡期间利用物理规律预测其SE(3)轨迹，从而恢复物体持久性，并引入质心轮廓损失降低轨迹误差。

Comments Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Workshop on Generative 3D Reconstruction

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 4687-4696

AI中文摘要

动态3D高斯溅射（3DGS）方法通过光度监督从同步多相机视频重建时变场景。当一个运动物体被所有训练相机完全遮挡时，光度监督消失：表示该物体的高斯体无法接收梯度信号而退化。现有处理神经重建中不完整观测的方法依赖于学习到的生成先验，这些先验优先考虑视觉合理性而非物理正确性。我们提出$ extbf{PersistGS}$，一种通过将可微刚体模拟与3D高斯溅射耦合来在遮挡期间恢复物体持久性的方法。我们的方法将场景分解为每个物体的高斯体和碰撞网格，通过可微模拟从观测到的遮挡前轨迹估计摩擦和速度，并利用得到的SE(3)轨迹在整个遮挡期间定位物体高斯体。由于预测轨迹满足刚体动力学的控制方程，它能够忠实捕捉接触事件（弹跳、基于摩擦的减速、方向变化），而运动学外推无法建模这些事件。我们引入质心轮廓损失，将位置梯度与外观噪声分离，使轨迹误差比光度监督降低40%。我们使用在训练中保留的相机进行评估，这些相机在遮挡期间观察物体。在合成场景上的实验表明，PersistGS在PSNR上比恒定速度外推高出2.46dB，并且与真实轨迹上限仅差0.19dB。

英文摘要

Dynamic 3D Gaussian Splatting (3DGS) methods reconstruct time-varying scenes from synchronized multi-camera video using photometric supervision. When a moving object becomes fully occluded from all training cameras, this supervision vanishes: the Gaussians representing it receive no gradient signal and degrade. Existing approaches to incomplete observations in neural reconstruction rely on learned generative priors that prioritize visual plausibility over physical correctness. We propose $\textbf{PersistGS}$, a method that restores object permanence during occlusion by coupling differentiable rigid body simulation with 3D Gaussian Splatting. Our approach decomposes the scene into per-object Gaussians and collision meshes, estimates friction and velocity from the observed pre-occlusion trajectory via differentiable simulation, and uses the resulting SE(3) trajectory to position object Gaussians throughout the occlusion period. Because the predicted trajectory satisfies the governing equations of rigid body dynamics, it faithfully captures contact events (bounces, friction-based deceleration, direction changes) that kinematic extrapolation cannot model. We introduce a centroid silhouette loss that isolates positional gradients from appearance noise, yielding 40% lower trajectory error than photometric supervision. We evaluate using cameras withheld from training that observe the object during its occlusion. Experiments on synthetic scenes show that PersistGS outperforms constant velocity extrapolation by +2.46dB PSNR and comes within 0.19dB of a ground-truth trajectory upper bound.

URL PDF HTML ☆

赞 0 踩 0

2606.03470 2026-06-03 cs.CV 版本更新

Mixed-Modality Dual Face-Hair Retrieval

混合模态双人脸-发型检索

Quoc-Anh Bui-Huynh, Mai-Tuyen Lam, Dai-Anh-Tuan Nguyen, Thanh Duc Ngo

发表机构 * Vietnam National University, Ho Chi Minh City, Vietnam（越南国家大学，胡志明市，越南）； University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam（信息技术大学，VNU-HCM，胡志明市，越南）

AI总结提出混合模态双参考检索任务DFHR，通过解耦身份与发型特征并融合多模态嵌入，实现跨模态的身份感知与属性可控检索。

详情

AI中文摘要

我们提出了双人脸-发型检索（DFHR），这是一种图像检索中新的混合模态双参考任务，其中查询由指定身份的人脸图像和以图像或文本形式表达的发型参考组成。与先前的检索设置不同，DFHR需要对来自异质模态的两个语义独立属性——身份和发型——进行跨组件推理。这种表述要求在统一的嵌入空间内实现局部特征解耦、跨模态语义对齐和混合模态组合。我们构建了DFHR-Bench，这是首个用于混合模态人脸-发型检索的基准，包含超过18万个标注三元组，涵盖双图像和图像-文本设置，通过多阶段标注协议构建，确保语义和身份完整性。我们进一步提出了MFHC（多模态人脸-发型组合器），一个统一的框架，通过令牌注入和多视角监督融合解耦的身份和发型嵌入。DFHR和DFHR-Bench共同为跨模态的身份感知、属性可控视觉检索建立了新的范式。

英文摘要

We introduce Dual Face-Hair Retrieval (DFHR), a new mixed-modality dual-reference task in image retrieval where a query consists of a face image specifying identity and a hairstyle reference expressed as either an image or text. Unlike prior retrieval settings, DFHR requires cross-component reasoning between two semantically independent attributes -- identity and hairstyle -- originating from heterogeneous modalities. This formulation demands localized feature disentanglement, cross-modal semantic alignment, and mixed-modality composition within a unified embedding space. We construct DFHR-Bench, the first benchmark for mixed-modality face-hair retrieval, comprising over 180K annotated triplets across dual-image and image-text settings, built via a multi-stage annotation protocol ensuring semantic and identity integrity. We further propose MFHC (Multimodal Face-Hair Combiner), a unified framework that fuses disentangled identity and hairstyle embeddings through token injection and multi-view supervision. DFHR and DFHR-Bench together establish a new paradigm for identity-aware, attribute-controllable visual retrieval across modalities.

URL PDF HTML ☆

赞 0 踩 0

2606.03460 2026-06-03 cs.CV 版本更新

From 3D Perception to Safety Reasoning: A Graph-Based Framework for Real-Time Underground Mine Monitoring

从3D感知到安全推理：基于图的实时地下矿井监控框架

Pasindu Ranasinghe, Simit Raval, Dibyayan Patra, Bikram Banerjee, Ismet Canbulat

AI总结提出一个结合3D语义感知、不确定性异常检测、规则检查、设备端LLM推理和GraphRAG记忆分析的连续监控框架，通过场景图和时序图实现结构化安全推理，在115个危险场景中达到93%的覆盖率和92.7%的感知精度。

详情

AI中文摘要

地下煤矿开采要求人员和重型设备在共享、受限且照明不良的空间中作业，其中设备接近违规、结构不稳定和遮挡盲区等危险难以预测。传统监控系统（包括固定摄像头和基于规则的接近警报）可以检测预定义事件，但缺乏识别复杂或演变危险所需的3D场景理解和上下文记忆。本文提出一个连续监控框架，将彩色3D点云转换为结构化和可追溯的安全推理输出。该框架结合了3D语义感知、基于不确定性的异常检测、基于规则的危险检查、设备端LLM推理和基于GraphRAG的记忆分析，以识别即时危险并解释长期安全模式。场景图和时序图作为显式知识结构，连接推理阶段的感知输出。为克服标记地下数据的稀缺性，结合真实巷道扫描、受控物体放置和高保真长壁模拟生成多样化的危险场景，同时自监督预训练从有限标注中改进分割。感知模型在30 FPS下达到92.7%的准确率，内存使用低。在115个危险场景中，基于规则的检查覆盖率为57%，结合上下文LLM推理提高到76%，使用基于历史记录的记忆推理达到93%。定性结果表明，不确定性衍生的异常信号支持对超出预定义类别的分布外危险进行解释。总体而言，基于图的知识表示结合3D感知和分层安全推理，为地下矿井监控中的智能决策支持提供了实用基础。

我们提出了P-Topics（感知主题）建模，这是一个理解图像如何被情感和文化感知的新问题。目标是（1）在图像和标题数据集中发现并建模不同的感知体验，每个体验由客观事实和主观情感两方面定义；（2）将图像关联到其相关的感知体验。我们引入了**PercepT**（**感知**主题**T**ransformer），一个两阶段架构来处理P-Topics建模。在形成阶段，percepT通过无监督训练目标发现作为视觉-文本聚类的*P-Topics*，并动态选择聚类数量以匹配数据集的感知丰富度。在映射阶段，它通过注意力池化学习*P-Topic映射函数*，将图像关联到各自的聚类。在ArtELingo上，PercepT的轮廓系数达到**0.97**，而最接近的基线为**0.37**，反映了更好的感知聚类。PercepT的AUC分数达到**0.94**，而基线为**0.77**，显示了更好的感知聚类映射。人工评估证实PercepT捕获了语义上有意义的感知体验，并显著优于现有方法。我们的实现将公开。

英文摘要

We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is defined by an objective factual and a subjective affective aspect, and (2) associate images to their relevant perception experiences. We introduce **PercepT** (**Percep**tion topic **T**ransformer), a two-stage architecture that tackles P-Topics modeling. In the formation stage, percepT discovers *P-Topics* as visual-textual clusters using an unsupervised training objective, and dynamically selects the number of clusters to match the perceptual richness of the dataset. In the mapping stage, it learns *P-Topic mapping functions* via attention pooling to associate images to their respective clusters. On ArtELingo, PercepT achieves a silhouette score of **0.97** compared to **0.37** from the closest baseline reflecting better perceptual clusters. PercepT also achieves an AUC score of **0.94** compared to **0.77** showing better mapping to perceptual clusters. Human evaluation confirms that PercepT captures semantically meaningful perception experiences and significantly outperforms existing methods. Our implementation will be made public.

URL PDF HTML ☆

赞 0 踩 0

2606.03341 2026-06-03 cs.CV 版本更新

Cross-Modality Feature Fusion Based on Structured State Space Duality for Multimodal Image Registration Network

基于结构化状态空间对偶性的跨模态特征融合多模态图像配准网络

Zhikang Li, Yan Wu, Xin Hu, Yi Dai, Ming Li

发表机构 * Remote Sensing Image Processing and Fusion Group, School of Electronic Engineering, Xidian University（遥感图像处理与融合组，电子工程学院，西安电子科技大学）； National Key Laboratory of Radar Signal Processing, Xidian University（雷达信号处理国家级实验室，西安电子科技大学）

AI总结提出RegNetMamba-2算法，利用结构化状态空间对偶性（SSD）在粗到细匹配过程中提取局部和全局结构特征，通过跨模态交互和多尺度融合模块实现多模态图像配准，在多个数据集上取得高效性能。

详情

AI中文摘要

在多模态图像配准中，主要挑战在于共享结构信息的提取。与Transformer相比，结构化状态空间对偶性（SSD）在训练和推理过程中能以更高效率提取更全面的全局结构特征。受这些优势启发，我们提出了一种新的多模态图像配准算法，命名为RegNetMamba-2。我们的算法将SSD融入粗到细的匹配过程中，以有效提取局部和全局结构特征。首先，在网络中应用SSD于三种不同尺度进行多模态特征提取。为了增强局部表示，我们通过SSD的特征缩放函数更加关注前景边缘和结构信息。其次，针对输入图像的共享特征提取和所有尺度的多模态特征融合，我们提出了基于SSD的跨模态特征融合模型，包括跨模态特征交互（CMI）模块和多尺度特征融合（MSF）模块。CMI模块通过交叉形式的SSD用于每个尺度的跨模态特征提取。MSF模块旨在采用渐进式向上融合方式在特征层面获取精细特征，包含所有尺度的多模态特征。遵循粗到细策略，收集来自CMI的1/8尺度特征和来自MSF的1/2尺度特征以计算匹配概率分数。然后我们通过像素级对应关系分别建立匹配过程。大量实验表明，与最先进的基于深度学习的算法相比，RegNetMamba-2在以下数据集上的多模态图像配准性能和效率均取得了良好效果：VIS-SAR（OSDataset）、VIS-IR（LGHD/RoadSence）和VIS-NIR（RGB-NIR sense）。

英文摘要

In multi-modal image registration, the primary challenge lies in shared structural information extraction. Compared to Transformers, Structured State Space Duality (SSD) offers greater global structural feature extraction with higher efficiency during training and inference. Inspired by these advantages, we propose a novel algorithm for multi-modal image registration, named RegNetMamba-2. Our algorithm incorporates SSD into coarse-to-fine matching process to extract local and global structural features effectively. Firstly, SSD is applied in three different scales for multi-modal feature extraction in our network. To strengthen local representation, we pay more attention on foreground edge and structural information by feature scaling function of SSD. Secondly, for shared feature extraction of input images and multi-modal feature fusion in all scales, we propose cross-modality feature fusion model based on SSD, consisting of Cross-Modality feature Interaction (CMI) module and Multi-Scale feature Fusion (MSF) module. CMI module is designed for cross-modality feature extraction of each scale by SSD in cross form. MSF module is designed to employ a progressive upward fusion in feature-level to obtain fine features, consisting of multi-modal features in all scales. Following coarse-to-fine, the features in 1/8 scale from CMI and 1/2 scale from MSF are collected to calculate matching probability scores. Then we respectively establish matching process by correspondences of pixel-wise. Extensive experiments demonstrate that comparing with state-of-the-art deep-learning based algorithms, RegNetMamba-2 has achieved good effects in both performance and efficiency for multi-modal image registration on the following datasets: VIS-SAR (OSDataset), VIS-IR (LGHD/RoadSence) and VIS-NIR (RGB-NIR sense).

URL PDF HTML ☆

赞 0 踩 0

2606.03338 2026-06-03 cs.LG cs.CV 版本更新

IdEst: Assessing Self-Supervised Learning Representations via Intrinsic Dimension

IdEst: 通过内在维度评估自监督学习表示

Julie Mordacq, Vicky Kalogeiton, Steve Oudot

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出IdEst方法，利用最小生成树维度估计器评估自监督学习表示的内在维度，发现其与下游线性探测性能强相关，并能高效选择超参数。

Comments ICML 2026

详情

AI中文摘要

自监督学习（SSL）已成为从无标签数据中学习有意义表示的有效范式。然而，评估这些表示的标准协议——线性探测——计算成本高、对超参数敏感，并且对表示空间的几何结构提供的洞察有限。在这项工作中，受神经网络泛化与内在维度（ID）之间联系的启发，我们提出了IdEst，一种通过最小生成树维度估计器（$\mathrm{dim}_\mathrm{MST}$）估计SSL表示ID的方法。在多种数据集、架构和SSL预训练目标上，我们表明IdEst与下游线性探测性能强相关。此外，我们证明IdEst能够实现高效的超参数选择，与监督替代方案相比显著降低计算成本。我们的结果突出了内在维度作为评估SSL表示的原则性几何代理，补充了标准的监督探测协议。

英文摘要

Self-supervised learning (SSL) has emerged as a powerful paradigm for learning meaningful representations from unlabeled data. However, the standard protocol for evaluating these representations, linear probing, is computationally expensive, sensitive to hyperparameters, and provides limited insight into the geometric structure of the representation space. In this work, motivated by connections between neural network generalization and intrinsic dimension (ID) we propose IdEst, a method for estimating the ID of SSL representations via the Minimum Spanning Tree dimension estimator ($\mathrm{dim}_\mathrm{MST}$). Across diverse datasets, architectures, and SSL pretraining objectives, we show that IdEst strongly correlates with downstream linear probe performances. Furthermore, we demonstrate that IdEst enables efficient hyperparameter selection, significantly reducing the computational cost compared to supervised alternatives. Our results highlight intrinsic dimensionality as a principled geometric proxy for assessing SSL representations, complementing standard supervised probing protocols.

URL PDF HTML ☆

赞 0 踩 0

2606.03314 2026-06-03 cs.CV 版本更新

TASE: Truncation-Aware Semantic Embeddings for 3D Scene Understanding and Editing

TASE: 用于3D场景理解与编辑的截断感知语义嵌入

Tim-Felix Faasch, Jochen Kall, Lucas Nunes, Jens Behley, Cyrill Stachniss

发表机构 * Bosch Research（博世研究院）； Rheinisch-Westfälische Technische Hochschule Aachen（亚琛工业大学）； University of Bonn（波恩大学）

AI总结提出TASE方法，通过将预训练的2D语义特征投影到截断感知嵌入空间，结合尺度和平移等变损失，实现可控的3D场景文本驱动编辑，在大几何修改任务上显著优于现有方法。

详情

AI中文摘要

高保真语义3D场景表示对于众多应用（包括机器人、自动驾驶和仿真）至关重要。除此之外，编辑此类表示的能力使开发人员能够更轻松地将这些应用适应特定的目标场景。当前方法对可控编辑的支持有限。我们引入TASE，一种将预训练的2D语义特征投影到截断感知嵌入空间以实现灵活3D场景编辑的方法。我们的方法显式优化了一个特征空间，在该空间中，逐步减少特征通道会产生越来越抽象的语义表示，而保留更多通道则保留细粒度细节。此外，我们使用尺度和平移等变损失来改进特征的多视图一致性。由此产生的截断感知嵌入空间支持对3D场景进行文本驱动的编辑，提供了对编辑与原始场景内容一致程度的显式控制，并允许比先前方法更实质性的修改。此外，我们提出了编辑扩散模型的微调阶段，以减轻几何变化引起的伪影。实验结果表明，在3D场景编辑中具有竞争力的性能，在涉及大几何修改的编辑上显著优于先前方法。

英文摘要

High-fidelity semantic 3D scene representations are crucial for numerous applications, including robotics, autonomous driving, and simulation. Beyond this, the ability to edit such representations enables developers to adapt these applications more easily to specific target scenarios. Current approaches provide limited support for controllable editing. We introduce TASE, a method that projects pretrained 2D semantic features into a truncation-aware embedding space to enable flexible 3D scene editing. Our method explicitly optimizes a feature space in which progressively reducing feature channels yields increasingly abstract semantic representations, while retaining more channels preserves fine-grained detail. Additionally, we improve multi-view consistency of the features using a scale- and translation-equivariance loss. The resulting truncation-aware embedding space enables text-driven edits to 3D scenes, providing explicit control over how strongly edits adhere to the original scene content and allowing more substantial modifications than prior methods. Moreover, we propose a finetuning stage for the editing diffusion model to mitigate artifacts caused by geometric changes. Experimental results demonstrate competitive performance in 3D scene editing, substantially outperforming prior methods on edits involving large geometric modifications.

URL PDF HTML ☆

赞 0 踩 0

2606.03301 2026-06-03 cs.CL cs.CV 版本更新

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

SagaQA：面向电视剧长篇叙事理解的多跳推理基准

Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

发表机构 * IRIT, University of Toulouse, France（法国图卢兹大学IRIT中心）； Agency for Science, Technology and Research (A*STAR), Singapore（新加坡科技研究局）； CNRS, IRIT, France（法国CNRS与IRIT）

AI总结提出SagaQA基准，通过跨剧集的多跳推理任务评估模型对完整电视剧多模态叙事的高层次理解，并比较并行、顺序和混合三种规划策略的性能。

详情

AI中文摘要

我们介绍了SagaQA，一个用于对完整电视剧进行多跳推理的长视频基准。现有的视频推理基准通常强调对相邻帧或片段的局部理解。SagaQA通过要求对整个电视剧中扩展的多模态叙事进行高层次理解来弥补这一空白。SagaQA的一个显著特征是其推理步骤的粒度。我们的数据集需要长距离推理跳跃来连接完全不同的剧集之间的信息。这要求模型对整个事件和动作进行推理，需要在多模态层面上深入理解剧集的叙事和进展。受近期智能体方法进展的启发，我们进一步研究了不同的规划策略如何处理这种复杂推理。我们将这些方法分为三类——并行规划器、顺序规划器和混合规划器——并评估它们生成连贯且完整推理计划的能力。我们在SagaQA上的结果表明，混合规划器始终能产生更高质量的计划，并在电视剧复杂、高层次叙事理解方面表现出更强的能力。

英文摘要

We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows. A distinguishing feature of SagaQA is the granularity of its reasoning steps. Our dataset necessitates long-range reasoning hops to connect information across completely different episodes. This requires models to reason over entire events and actions, demanding a deep understanding of the show's narration and progression at a multimodal level. Motivated by recent progress in agentic methods, we further study how different planning strategies handle such complex reasoning. We categorize these approaches into three classes-Parallel, Sequential, and Hybrid planners-and evaluate their ability to generate coherent and complete reasoning plans. Our results on SagaQA suggest that hybrid planners consistently produce higher-quality plans and exhibit stronger capabilities for complex, high-level narrative understanding in TV shows.

URL PDF HTML ☆

赞 0 踩 0

2606.03287 2026-06-03 cs.CV 版本更新

FreeStreamGS: 来自无位姿流式输入的在线前馈3D高斯泼溅

Ruiyang Chen, Feiran Li, Chu Zhou, Zonglin Li, Zhanyu Ma, Heng Guo

AI总结提出FreeStreamGS，一种在线前馈框架，通过解耦内参恢复头和动态点精炼偏移策略，实现从无位姿流式输入的高效高质量新视角合成。

详情

AI中文摘要

前馈3D高斯泼溅（3DGS）允许从离线录制的图像序列进行高效高保真的新视角合成（NVS）。然而，从流式和无位姿图像输入实现在线NVS仍然具有挑战性。尽管已经提出了用于流式深度和点云恢复的在线前馈几何估计方法，但由于严重的渲染伪影，它们无法适应NVS。这是因为NVS对高斯尺度和位姿-几何对齐要求更严格的多视图一致性；即使微小的偏差也会随时间累积并明显降低渲染质量。为此，我们提出了FreeStreamGS，一个鲁棒的在线前馈框架，用于高效高质量的NVS。我们引入了两个关键机制：解耦内参恢复头，消除累积的相机内参偏差并防止长期流式中的场景尺度抖动；以及动态点精炼偏移策略，放松刚性反投影以校正耦合的位姿-深度漂移。大量实验表明，尽管FreeStreamGS无法访问未来帧，但其渲染质量与最先进的离线前馈3DGS方法相当。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) allows efficient and high-fidelity novel view synthesis (NVS) from an offline recorded image sequence. However, achieving online NVS from streaming and unposed image inputs remains challenging. Although online feed-forward geometric estimation methods have been proposed for streaming depth and point cloud recovery, they cannot be adapted to NVS due to severe rendering artifacts. This is because NVS demands stricter multi-view consistency in Gaussian scales and pose-geometry alignment; even minor deviations would accumulate over time and visibly degrade rendering quality. To this end, we propose FreeStreamGS, a robust online feed-forward framework for efficient and high-quality NVS. We introduce two key mechanisms: a Decoupled Intrinsic Recovery Head that removes cumulative camera intrinsic bias and prevents scene scale jitter during long-term streaming, and a Dynamic Point Refinement Offset strategy that relaxes rigid unprojection to correct coupled pose-depth drift. Extensive experiments show that FreeStreamGS achieves rendering quality competitive with state-of-the-art offline feed-forward 3DGS methods, despite operating without access to future frames.

URL PDF HTML ☆

赞 0 踩 0

2606.03251 2026-06-03 cs.AI cs.CV cs.LG eess.IV stat.ML 版本更新

Follow-Your-Preference++：重新思考图像修复中的偏好对齐

Junkun Yuan, Yutao Shen, Toru Aonishi, Hideki Nakayama, Yue Ma

发表机构 * Zhejiang University（浙江大学）； The University of Tokyo（东京大学）； Tsinghua University（清华大学）

AI总结本文从基本原理出发，通过直接偏好优化框架和公开奖励模型构建偏好数据，系统研究了图像修复中的偏好对齐问题，发现奖励模型存在偏差但可通过集成缓解，并在标准指标、大视觉语言模型评估和人类评估上显著超越先前最先进模型。

Comments 23 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2509.23082

详情

AI中文摘要

我们研究图像修复中的偏好对齐。与其提出另一种方法，我们从头重新审视该问题并重新评估其核心挑战。我们采用广泛使用的直接偏好优化框架，并利用公开的奖励模型构建偏好训练数据。我们的实证研究涵盖九个奖励模型、两个基准以及两个在架构和生成机制上不同的基线修复模型。我们的主要发现是：(1) 大多数奖励模型为偏好数据构建提供了有效信号，尽管有些作为评估者不可靠。(2) 跨模型和基准，偏好数据在候选和样本缩放下表现出一致的趋势。(3) 奖励模型显示出明显的偏差——特别是在亮度、构图和配色方案方面——使其容易引发奖励黑客行为。(4) 简单的奖励模型集成减轻了此类偏差，并产生了稳健且可泛化的性能。(5) 偏好对齐可迁移到对象移除任务，其中目标从开放式创意生成转变为连贯的背景补全。(6) 进一步分析表明，校准的集成方法进一步减轻了黑客行为并提高了鲁棒性。在不修改模型架构或引入额外数据集的情况下，我们的模型在标准指标、大视觉语言模型评估和人类评估上显著优于先前最先进的模型。我们的代码可在以下网址获取：此 https URL。

英文摘要

We study preference alignment for image inpainting. Rather than proposing yet another method, we revisit the problem from first principles and reassess its core challenges. We adopt the widely used direct preference optimization framework and construct preference training data with publicly available reward models. Our empirical study spans nine reward models, two benchmarks, and two baseline inpainting models that differ in architecture and generative mechanism. Our main findings are: (1) Most reward models provide valid signals for preference data construction, although some are unreliable as evaluators. (2) Across models and benchmarks, preference data exhibits consistent trends under both candidate and sample scaling. (3) Reward models display pronounced biases--particularly in brightness, composition, and color scheme--that make them prone to inducing reward hacking. (4) A simple ensemble of reward models mitigates such biases and yields robust, generalizable performance. {\color{rebuttal_blue}(5) Preference alignment is transferable to the object removal task, where the goal shifts from open-ended creative generation to coherent background completion. (6) Further analysis reveals that a calibrated ensemble method further mitigates hacking and improves robustness.} Without modifying model architectures or introducing additional datasets, our models substantially outperform prior state-of-the-art models on standard metrics, large vision-language model evaluations, and human assessments. Our code is available at: https://github.com/shenytzzz/Follow-Your-Preference.

URL PDF HTML ☆

赞 0 踩 0

2606.03214 2026-06-03 cs.AI cs.CV cs.CY cs.LG 版本更新

Effect of Demographic Bias on Skin Lesion Classification

人口统计偏差对皮肤病变分类的影响

Ralf Raumanns, Gerard Schouten, Veronika Cheplygina, Josien P. W. Pluim

发表机构 * Fontys University of Applied Science, Venlo, The Netherlands（Fontys应用科学大学，荷兰Venlo）； Fontys University of Applied Science, Eindhoven, The Netherlands（Fontys应用科学大学，荷兰Eindhoven）； Eindhoven University of Technology, Eindhoven, The Netherlands（埃因霍温技术大学，荷兰Eindhoven）； IT University of Copenhagen, Denmark（哥本哈根IT大学，丹麦）

AI总结本研究使用基于ResNet的卷积模型评估皮肤病变分类性能，通过线性规划控制人口统计特征，研究患者性别和年龄偏差的影响，并比较三种学习策略，发现性别偏差主要源于数据不平衡，而年龄偏差始终偏向年轻群体。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) , 26 pages, 12 figures

详情

DOI: 10.59275/j.melba.2026-4156
Journal ref: https://melba-journal.org/2026:011

AI中文摘要

在这项研究中，我们评估了使用基于ResNet的卷积模型进行皮肤病变分类的性能，重点关注训练数据中人口统计偏差的影响，特别是患者性别和年龄的变化。我们使用线性规划生成具有受控人口统计特征的数据集，从而系统性地研究偏差效应。评估了三种学习策略：单任务模型、强化多任务模型和对抗学习方案。我们的性别分析表明，性别特定的训练数据集优化了模型性能。值得注意的是，在训练数据中包含男性患者提高了男性亚组的性能，即使在女性占多数的情况下也是如此。强化学习和对抗学习方案缩小或消除了平衡和女性占多数数据集中的偏差差距。然而，这些策略在男性占多数的环境中效果较差，模型在男性上的表现仍然优于女性。在主要男性患者群体中，与基线模型相比，这两种学习方案显示出边际偏差减少。基于年龄的分析表明，三种模型方法的基线性能相当，性能随年龄类别下降。无论训练数据分布如何，年轻组始终达到最高性能。尽管平衡训练对最年轻年龄组产生最佳结果，但较老年组的性能下降。我们发现性别偏差主要源于数据不平衡，而年龄偏差无论分布如何始终偏向年轻群体。这些不同的机制需要有针对性的缓解策略。此外，在两个外部数据集上的跨数据集验证表明，域转移显著影响性能和人口统计偏差模式。

英文摘要

In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age. We use linear programming to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects. Three learning strategies are evaluated: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our sex-based analysis indicates that sex-specific training datasets optimise model performance. Notably, including male patients in the training data improved performance for the male subgroup, even in female-majority cases. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female-majority datasets. However, these strategies proved less effective in male-majority settings, where models continued to perform better for males than females. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations. Age-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories. Younger groups consistently achieve the highest performance, regardless of training data distribution. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories. We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution. These distinct mechanisms require targeted mitigation strategies. Additionally, cross-dataset validation on two external datasets revealed that domain shifts notably affect performance and patterns of demographic bias.

URL PDF HTML ☆

赞 0 踩 0

2606.03183 2026-06-03 cs.MM cs.CV cs.SD eess.AS 版本更新

Inference-Time Scaling for Joint Audio-Video Generation

联合音视频生成的推理时缩放

Jaemin Jung, Kyeongha Rho, Inkyu Shin, Joon Son Chung

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）； Luma AI

AI总结针对联合音视频生成中多目标优化的挑战，提出多验证器框架与自适应奖励加权算法，在无需额外训练的情况下显著提升语义对齐、感知质量和音视频同步。

Comments Accepted by Transactions on Machine Learning Research (TMLR). Project page: https://jung-jaemin.github.io/ITS-AVGen-Proj/

详情

AI中文摘要

联合音视频生成旨在合成与文本提示语义对齐且精确同步的逼真音视频对。现有联合音视频生成模型通常需要大量训练资源来提高保真度，而推理时缩放（ITS）最近在单模态领域成为一种有前景的无训练替代方案。然而，将ITS从单模态扩展到多模态领域并非易事，因为它需要平衡多个异构目标。在本文中，我们首次对联合音视频生成的ITS进行了全面研究。我们首先证明多验证器框架对于解决单目标指导的局限性（包括非对称性能权衡和验证器欺骗）至关重要。通过系统分析，我们随后确定了一个最优的多验证器组合，该组合在所有质量维度上产生均衡的改进。最后，为了有效聚合多样化的奖励信号，我们提出了自适应奖励加权（ARW），一种新颖的测试时优化算法。ARW将奖励聚合视为在线优化问题，利用可学习参数校准奖励方差，无需奖励分布的先验知识，从而确保鲁棒的多目标选择。在VGGSound和JavisBench-mini基准上的实验结果表明，我们的框架显著增强了生成输出的语义对齐、感知质量和音视频同步。合成样本和代码可在项目页面获取：this https URL。

英文摘要

Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: https://jung-jaemin.github.io/ITS-AVGen-Proj.

URL PDF HTML ☆

赞 0 踩 0

2606.03180 2026-06-03 cs.CV cs.CL cs.LG 版本更新

GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

GLINT：面向细粒度放射学表征的稀疏门控视觉-语言对齐

Jonggwon Park, Seongeun Lee, Junhyun Park, Hannah Yun, Hyunwoong Kim, Sohyun Jeong, Hyewon Kang, Byungmu Yoon, Kyoyun Choi

AI总结针对放射学图像-报告全局对齐与局部病灶尺度不匹配的问题，提出GLINT框架，通过稀疏门控对齐和密集特征正则化实现零样本分类、定位和分割。

详情

AI中文摘要

放射学中的视觉-语言模型（VLM）通过利用临床工作流程中自然产生的图像-报告对，已成为一种可扩展的范式。然而，这种配对揭示了尺度上的不匹配：每个病灶仅占据图像的一小部分区域，但监督仅在全局图像-报告级别提供。这带来了一个核心挑战：先前的方法将权重密集地分布到所有补丁上，而不是集中在与给定查询相关的稀疏子集上。为了解决这个问题，我们提出了GLINT（门控语言-图像对齐）框架，该框架显式建模这种稀疏对应关系。在对齐方面，我们引入了稀疏门控对齐，这是一种新颖的架构，其中在单独的门控嵌入空间上的sigmoid门仅激活与每个文本查询相关的补丁，强制执行显式稀疏性。在表征方面，我们添加了密集特征正则化，将可训练编码器的中间特征锚定到冻结的自监督学习（SSL）教师模型上，从而保留门控所依赖的细粒度补丁特征。相同的方案适用于2D胸部X光片（CXR）和3D胸部计算机断层扫描（CT），分别基于DINOv3和V-JEPA 2.1构建。GLINT支持从自由文本查询进行零样本分类、定位和分割，据我们所知，这是首次在没有掩码监督的情况下在3D CT体积上展示零样本分割。值得注意的是，最显著的增益出现在零样本定位和分割上，这些任务需要稀疏的、特定于查询的定位，这与我们的设计意图一致。在下游评估中，GLINT在分类、报告生成和分割方面均优于SSL编码器和医学VLM。

英文摘要

Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder's intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.03168 2026-06-03 cs.CV 版本更新

JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation

JAVEDIT: 联合音频-视觉指令引导视频编辑与智能体数据策展

Yinan Chen, Chuming Lin, Zhennan Chen, Yuxiang Zeng, Junwei Zhu, Yali Bi, Xijie Huang, Chengming Xu, Donghao Luo, Zhucun Xue, Xiaobin Hu, Chengjie Wang, Yong Liu, Jiangning Zhang, Shuicheng Yan

发表机构 * Zhejiang University（浙江大学）； Tencent Youtu Lab（腾讯优图实验室）； Nanjing University（南京大学）； University of Auckland（奥克兰大学）； Fudan University（复旦大学）； National University of Singapore（新加坡国立大学）

AI总结针对联合音频-视觉编辑缺乏数据集和基准的问题，提出首个大规模高质量数据集JAVEdit-100k、基准JAVEditBench以及基线模型JAVEdit，在六项指标中五项超越所有基线。

Comments Equal contributions from first two authors. Project page: https://ryanchenyn.github.io/projects/JAVEdit Code: https://github.com/RyanChenYN/JAVEdit Dataset: https://huggingface.co/datasets/Coraxor/JAVEdit-100k

详情

AI中文摘要

虽然基于指令的视频编辑已取得显著进展，但联合音频-视觉编辑仍受限于缺乏专用数据集和基准。为填补这一空白，我们提出了JAVEdit-100k，这是首个为指令引导的联合音频-视觉编辑定制的大规模高质量数据集。该数据集专注于以人为中心的视频，包含约10万个编辑三元组，涵盖五个不同类别，包括主体编辑和语音编辑。该数据集通过四个精心设计的生成流程严格构建，并无缝配对智能体在环质量控制机制。此外，为解决该领域缺乏标准化评估的问题，我们引入了JAVEditBench，这是一个全面的基准，包含精选源视频和跨所有编辑类别的人类对齐指令。最后，我们提出了JAVEdit，一个用于指令引导的联合音频-视觉编辑的开创性基线模型。实验表明，\model\ 在六项评估指标中的五项上优于所有基线。

英文摘要

While instruction-based video editing has seen significant progress, joint audio-visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit-100k, the first large-scale, high-quality dataset tailored for instruction-guided joint audio-visual editing. Focusing on human-centric videos, JAVEdit-100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent-in-the-loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human-aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction-guided joint audio-visual editing. Experiments show that \model\ outperforms all baselines on five of six evaluation metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.03160 2026-06-03 cs.CV 版本更新

SRENet: Spectral Re-Entry Network for Point Cloud Action Recognition

SRENet：用于点云动作识别的频谱重入网络

Qiuxia Wu, Jiarui Lan, Wenxiong Kang, Zhiyong Wang, Kun Hu

发表机构 * School of Software Engineering, South China University of Technology（南方科技大学软件工程学院）； School of Automation Science and Engineering, South China University of Technology（南方科技大学自动化科学与工程学院）； School of Computer Science, University of Sydney（悉尼大学计算机科学学院）； School of Science, Edith Cowan University（爱丁堡牛津大学科学学院）

AI总结提出SRENet，通过频谱分解与重入模块从频率角度学习全局上下文和细粒度时间动态，实现点云序列动作识别。

Comments 13 pages, 11 figures. Accepted by IEEE Transactions on Circuits and Systems for Video Technology

详情

DOI: 10.1109/TCSVT.2026.3695515

AI中文摘要

从点云序列中识别人体动作对于自动驾驶和人机交互等3D感知驱动应用至关重要。然而，点云的不规则结构和时间不一致性给时空表示学习带来了独特挑战，特别是在捕捉全局运动上下文和细粒度时间动态方面。我们提出SRENet，一个频谱感知框架，旨在从频率角度显式学习动作识别的全局上下文和细粒度时间动态。SRENet引入频谱分解块（SDeBlock），沿时间和空间轴进行基于小波的分析，通过频率特定注意力将特征分解为低频和高频分量。为了恢复残差动态并重新对齐在语义融合过程中扭曲的时间频率结构，频谱重入块（SReBlock）执行二次时间分解。此外，设计了一种频谱感知学习策略，通过对比损失和课程调度增强两个频率子空间的可区分性，该调度逐渐将焦点从低频空间转移到高频空间，与从粗到细的运动模式一致。在MSR-Action3D、NTU-RGBD和NTU-RGBD120上的大量实验表明，SRENet实现了最先进的性能，验证了频率建模在基于点云的动作理解中的有效性。

英文摘要

Recognizing human actions from point cloud sequences is critical for 3D perception driven applications such as autonomous driving and human-computer interaction. However, the irregular structure and temporal inconsistency of point clouds pose unique challenges for spatio-temporal representation learning, especially in capturing both global motion context and fine-grained temporal dynamics. We propose SRENet, a spectral-aware framework designed to explicitly learn both global context and fine-grained temporal dynamics of motion from a frequency perspective for action recognition. SRENet introduces a Spectral Decomposition Block (SDeBlock) that performs wavelet-based analysis along temporal and spatial axes, disentangling features into low- and high-frequency components with frequency-specific attention. To recover residual dynamics and re-align temporal frequency structures distorted during semantic fusion, a Spectral Re-entry Block (SReBlock) performs secondary temporal decomposition. Furthermore, a spectral-aware learning strategy is devised to enhance discriminability in both frequency subspaces via contrastive loss and a curriculum schedule that gradually shifts focus from low- to high-frequency spaces in line with coarse to detailed motion patterns. Extensive experiments on MSR-Action3D, NTU-RGBD and NTU-RGBD120 demonstrate that SRENet achieves state-of-the-art performance, validating the effectiveness of frequency modeling in point cloud-based action understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.03159 2026-06-03 cs.CV cs.AI cs.RO 版本更新

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

NVIDIA OmniDreams：用于闭环自动驾驶仿真的实时生成式世界模型

NVIDIA, :, Aarti Basant, Amlan Kar, Despoina Paschalidou, Fangyin Wei, Francesco Ferroni, Guillermo Garcia Cobo, Haithem Turki, Huan Ling, Jaewoo Seo, James Lucas, Jay Zhangjie Wu, Jialiang Wang, Jonathan Lorraine, Jun Gao, Kai He, Katarina Tothova, Kevin Xie, Michał Tyszkiewicz, Qi Wu, Riccardo de Lutio, Ruilong Li, Sanja Fidler, Seung Wook Kim, Tianchang Shen, Tianshi Cao, Tobias Pfaff, William Lew, Xindi Wu, Xuanchi Ren, Yifan Lu, Yuxuan Zhang, Zan Gojcic, Zian Wang

AI总结提出OmniDreams，一个基于Cosmos扩散模型训练的基础生成式世界模型，通过自回归生成动作条件视频，实现闭环仿真中复杂长尾场景的实时合成，并验证其在策略模型训练中的有效性。

详情

AI中文摘要

随着自动驾驶能力的提升，在长尾场景中安全评估驾驶策略仍是一个关键瓶颈。在闭环仿真中，驾驶策略模型与环境主动交互，其动作动态更新模拟器状态并直接影响下一组生成的传感器观测。尽管近期基于重建的神经模拟器提供了逼真效果，但它们从根本上受限于初始捕获数据，难以泛化到高度动态或新颖场景。为克服这些限制，我们引入了OmniDreams，一个从Cosmos扩散模型进行中期和后训练的基础生成式世界模型，能够自回归地实时生成动作条件视频。通过利用Cosmos丰富的视觉先验以及在21k小时驾驶场景上的中期和后训练，OmniDreams合成了传统模拟器难以捕获的复杂未观测现象，例如极端天气和不可预测的动态智能体行为。关键在于，它自回归地根据过去帧、当前模拟器状态和即时驾驶动作来调节其逼真的传感器生成。在结合Alpamayo 1策略模型和AlpaSim编排器的闭环系统中部署时，OmniDreams充当一个高度响应、反应灵敏的环境，为训练和评估下一代自动驾驶策略提供了可扩展且全面的解决方案。我们还展示了初步结果，表明从OmniDreams后训练的世界-动作模型（WAM）在Physical AI自动驾驶NuRec数据集上取得了强劲性能，超越了基于VLA的Alpamayo 1.5研究策略模型，同时仅使用其1/5的总参数量。这些结果凸显了像OmniDreams这样的实时世界模型也有潜力作为策略架构的骨干网络。

英文摘要

As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and struggle to generalize to highly dynamic or novel scenes. To overcome these limitations, we introduce OmniDreams, a foundation generative world model mid- and post-trained from the Cosmos diffusion model to autoregressively generate action-conditioned videos in real time. By leveraging the rich visual priors of Cosmos and mid- and post-training on 21k hours of driving scenarios, OmniDreams synthesizes complex, unobserved phenomena that are hard for traditional simulators to capture, such as extreme weather and unpredictable dynamic agent behaviors. Crucially, it autoregressively conditions its photorealistic sensor generation on past frames, the current simulator state, and immediate driving actions. Deployed in a closed-loop system with the Alpamayo 1 policy model and AlpaSim orchestrator, OmniDreams acts as a highly responsive, reactive environment, providing a scalable and comprehensive solution for training and evaluating next-generation autonomous driving policies. We additionally show preliminary results indicating that a world-action model (WAM) post-trained from OmniDreams achieves strong performance on the Physical AI Autonomous Vehicles NuRec dataset, surpassing the VLA-based Alpamayo 1.5 research policy model while using only 1/5 the total parameters. These results highlight the potential for a real-time world model like OmniDreams to also serve as a backbone for policy architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.03148 2026-06-03 cs.CV 版本更新

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

$A^2$: 较小的自监督ViT比更大的ViT定位更优

Sreehari Rammohan, Huy Ha, Carl Vondrick

发表机构 * Columbia University（哥伦比亚大学）； Stanford University（斯坦福大学）

AI总结针对视觉分类中前景定位与丰富表征的矛盾，提出$A^2$方法，通过解耦小模型定位与大模型嵌入，利用预训练特征实现无需额外训练的竞争性能。

详情

AI中文摘要

鲁棒的视觉分类通常依赖于定位图像中的主要前景对象，同时忽略上下文干扰。令人惊讶的是，我们发现较小的自监督ViT的注意力图比更大的ViT能更好地定位前景对象。然而，我们仍然需要大型ViT，因为它们从每个补丁中提取更丰富的表示。为了兼顾良好的定位和丰富的表示，我们提出了$A^2$，一种简单的方法，通过将看哪里（小注意力模型）与提取什么（大嵌入模型）解耦，利用这种逆缩放发现：我们围绕小模型的注意力峰值裁剪图像，并用大模型嵌入这些裁剪块。$A^2$完全使用预训练特征，不需要组标签，也不需要针对每个数据集进行注意力或骨干网络训练。在5个基准测试中，$A^2$与基于骨干匹配的损失级方法（如DFR）具有竞争力，并且在更强的分布偏移下优于端到端注意力训练。

英文摘要

Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foreground objects better than those of larger ViTs. However, we still need large ViTs, because they extract richer representations from each patch. To get the best of both worlds, good localization and rich representations, we propose $A^2$, a simple method that leverages this inverse scaling finding by decoupling where to look (a small attention model) from what to extract (a large embedding model): we crop around the attention peaks of a small model and embed the crops with a larger model. $A^2$ uses entirely pretrained features, requires no group labels, and does not require per-dataset attention or backbone training. Across 5 benchmarks, $A^2$ is competitive with backbone-matched loss-level methods like DFR, and outperforms end-to-end attention training under stronger distribution shifts.

URL PDF HTML ☆

赞 0 踩 0

2606.03142 2026-06-03 cs.CV 版本更新

Disentangling Visual and Factual Correctness in LVLMs' Visualization Literacy

解构LVLMs可视化素养中的视觉与事实正确性

Soohyun Lee, Jaeyoung Kim, Seokhyeon Park, Sihyeon Lee, Jiwon Song, Bohyoung Kim, Hyunjoo Song, Jinwook Seo

发表机构 * Seoul National University（首尔国立大学）； MADI Co., Ltd.（MADI公司）； Hankuk University of Foreign Studies（韩国民法大学）； Soongsil University（顺天大学）

AI总结提出框架分离视觉正确性与事实正确性，通过反事实测试和仲裁指标揭示LVLMs在可视化素养评估中依赖事实记忆而非视觉推理的问题。

Comments Under review at IEEE Transactions on Visualization and Computer Graphics (TVCG). 23 pages, 9 figures

详情

AI中文摘要

大型视觉语言模型（LVLMs）展现出强大的可视化解释能力，但尚不清楚其响应是否反映对视觉证据的真实推理，还是训练中习得的事实先验。当前评估混合了这两种来源，掩盖了正确视觉解释被记忆事实覆盖的情况。我们提出了一个将视觉正确性与事实正确性分离的框架，揭示了现有可视化素养评估的有效性局限。通过15个最先进LVLMs的三个实验：（1）多个模型在标准测试（VLAT）上达到人类水平，但这可能反映事实回忆而非视觉理解，而随机数据测试（reVLAT）在正确视觉解释被事实先验取代时低估了素养。（2）使用我们的反事实可视化素养评估测试（CVLAT）和能力归一化仲裁指标，我们根据视觉-事实依赖指数（VFRI）的符号对模型进行分类，揭示了以视觉为导向的多数和以事实知识为导向的少数，尽管几个接近零的情况需要谨慎。在相同反事实项目上的人类基线（N=30）证实，人们在冲突时绝大多数遵循图表，提供了人类参考点。（3）基于提示的干预可以改变优先级，但其有效性高度依赖模型且方向不对称，高图表阅读能力不能预测提示可控性。总体而言，高可视化准确性不足以证明忠实的视觉推理：可靠地集成到视觉分析中不仅需要评估可视化素养，还需要评估模型在视觉证据和事实先验分歧时如何仲裁。基准和代码：此 https URL

英文摘要

Large Vision-Language Models (LVLMs) show strong visualization interpretation, yet it is unclear whether their responses reflect genuine reasoning over visual evidence or factual priors learned during training. Current evaluations mix these two sources, obscuring when correct visual interpretation is overridden by memorized facts. We present a framework that isolates visual correctness from factual correctness, revealing validity limitations in existing visualization literacy assessments. Across three experiments with 15 state-of-the-art LVLMs: (1) several models reach human-level performance on standard tests (VLAT), but this may reflect factual recall rather than visual understanding, while randomized-data tests (reVLAT) underestimate literacy when correct visual interpretation is superseded by factual priors. (2) Using our Counterfactual Visualization Literacy Assessment Test (CVLAT) with capability-normalized arbitration metrics, we classify models by the sign of their visual-factual reliance index (VFRI), revealing a visualization-oriented majority and a factual knowledge-oriented minority, though several near-zero cases warrant caution. A human baseline (N=30) on the same counterfactual items confirms that people overwhelmingly follow the chart under conflict, providing a human reference point. (3) Prompt-based intervention can shift prioritization, but its effectiveness is highly model-dependent and direction-asymmetric, and high chart-reading capability does not predict prompt-controllability. Overall, high visualization accuracy is not sufficient evidence of faithful visual reasoning: reliable integration into visual analytics requires evaluating not only visualization literacy but also how models arbitrate between visual evidence and factual priors when the two diverge. Benchmark and code: https://github.com/JaeyoungKim-HCIL/CVLAT

URL PDF HTML ☆

赞 0 踩 0

2606.03120 2026-06-03 cs.CV 版本更新

KC-3DGS: Kurtosis-Constrained Gaussian Splatting for High-Fidelity View Synthesis

KC-3DGS: 基于峰度约束的高斯泼溅用于高保真视图合成

Vivekjyoti Banerjee, Abhay Yadav, Rama Chellappa, Aniket Roy

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； NEC Labs America（NEC美国实验室）

AI总结提出KC-3DGS，通过在小波域添加多尺度对齐损失、峰度集中损失和跨频带协方差惩罚，增强3DGS的感知质量，尤其改善稀疏视图下的高频细节和结构伪影。

详情

AI中文摘要

3D高斯泼溅（3DGS）通过将场景表示为各向异性高斯集合，并通过可微分光栅化优化，实现了实时新视图合成。然而，标准像素空间损失（L1、SSIM）仅约束整体重建误差，允许优化在频率尺度上重新分配误差。这导致过度平滑和结构伪影，尤其在监督有限的稀疏视图设置中。我们提出KC-3DGS，通过基于自然图像统计的小波域监督来增强3DGS训练。我们的方法结合了三个组件：（1）多尺度小波系数对齐损失，显式惩罚缺失的高频细节；（2）有监督的峰度集中损失，鼓励渲染图像匹配真实图像的重尾频率统计；（3）跨频带协方差惩罚，促进频率专门化。我们提供理论分析，表明像素空间损失允许在小波重分布下的一族不可区分扰动，而我们的联合目标排除了退化解。在MipNeRF360、Tanks&Temples、MVImgNet、DeepBlending和WRIVA-ULTRRA上的实验表明，感知质量持续提升。在具有挑战性的WRIVA-ULTRRA室外数据集上，KC-3DGS在DreamSim上提高了9.48%，同时改善了PSNR、SSIM和LPIPS。在仅有12张训练图像的稀疏视图设置中，我们的方法在MipNeRF360上将PSNR提高了高达0.5 dB，同时保持了感知质量。该方法作为即插即用的正则化策略，可无缝集成到现有的3DGS流程中。

英文摘要

3D Gaussian Splatting (3DGS) enables real-time novel view synthesis by representing scenes as collections of anisotropic Gaussians optimized via differentiable rasterization. However, standard pixel-space losses (L1, SSIM) constrain only aggregate reconstruction error, permitting the optimization to redistribute error across frequency scales. This leads to oversmoothing and structural artifacts, particularly in sparse-view settings where supervision is limited. We propose KC-3DGS, which augments 3DGS training with wavelet-domain supervision based on natural image statistics. Our method combines three components: (1) a multi-scale wavelet coefficient alignment loss that explicitly penalizes missing high-frequency detail, (2) a supervised kurtosis concentration loss that encourages rendered images to match the heavy-tailed frequency statistics of ground-truth images, and (3) a cross-band covariance penalty that promotes frequency specialization. We provide theoretical analysis showing that pixel-space losses admit a family of indistinguishable perturbations under wavelet redistribution, and that our joint objective excludes degenerate solutions. Experiments across MipNeRF360, Tanks&Temples, MVImgNet, DeepBlending, and WRIVA-ULTRRA demonstrate consistent improvements in perceptual quality. On the challenging WRIVA-ULTRRA outdoor dataset, KC-3DGS achieves a 9.48% improvement in DreamSim while also improving PSNR, SSIM, and LPIPS. In sparse-view settings with only 12 training images, our method improves PSNR by up to 0.5 dB on MipNeRF360 while maintaining perceptual quality. The approach integrates seamlessly into existing 3DGS pipelines as a plug-and-play regularization strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.03119 2026-06-03 cs.CV cs.AI cs.LG 版本更新

GuidedBridge: Training-freely Improving Bridge Models with Prior Guidance

GuidedBridge: 无需训练地利用先验引导改进桥接模型

Zehua Chen, Yucheng Yang, Binjie Yuan, Kaiwen Zheng, Jun S. Liu, Jun Zhu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出无需训练的先验引导方法（PG）和频率调制先验引导（FMPG），通过对比弱先验与已见先验增强桥接模型的先验利用，并设计级联框架CFG-FMPG用于图像修复，实验证明该方法能一致提升预训练桥接模型在多种图像翻译任务中的性能。

Comments ICML 2026

详情

AI中文摘要

引导方法，如无分类器引导（CFG）和自动引导（AG），推动了扩散模型中噪声到数据生成的发展。最近，桥接模型引入了一种数据到数据的生成过程，可以利用有指导性的干净先验。在这项工作中，受先前通过去噪结果质量差异作为引导的方法启发，我们提出了一种无需训练的桥接引导方法，称为先验引导（PG）。具体来说，我们引入一个弱先验，该先验在桥接预训练期间未见，阻碍先验利用从而降低去噪结果。然后，我们将其与已见先验对比，通过缩放因子突出并增强先验利用。此外，我们分析了桥接过程中先验利用的潜在机制，并设计了频率调制先验引导（FMPG），该引导将引导尺度调整到与桥接生成动力学一致的低频和高频带。为了解决图像修复中的先验利用问题，我们开发了一个级联框架CFG-FMPG，该框架首先通过CFG生成噪声隐藏表示，然后将其作为生成先验与FMPG一起利用，在不影响推理效率的情况下发挥它们的互补优势。实验表明，我们的PG方法在多种图像翻译任务中一致地改进了预训练桥接模型。

英文摘要

Guidance methods, such as classifier-free guidance (CFG) and auto-guidance (AG), have advanced noise-to-data generation in diffusion models. Recently, bridge models have introduced a data-to-data generative process that can exploit an instructive clean prior. In this work, inspired by previous methods creating quality difference between denoising results as guidance, we propose a training-free bridge guidance method, termed Prior Guidance (PG). Specifically, we introduce a weak prior, which is unseen during bridge pre-training, hindering prior exploitation and thereby degrading denoising result. Then, we contrast it with the seen prior to highlight and enhance prior exploitation via a scaling factor. Moreover, we analyze the underlying mechanism of prior exploitation in the bridge process and design frequency-modulated prior guidance (FMPG), which tailors the guidance scale to low- and high-frequency bands coherent with bridge generative dynamics. To address prior exploitation in image in-painting, we develop a cascaded framework, CFG-FMPG, which first generates a noisy hidden representation via CFG and then exploits it as a generative prior with FMPG, fulfilling their complementary strengths without compromising inference efficiency. Experiments demonstrate that our PG methods consistently improve pre-trained bridge models across diverse image translation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.03118 2026-06-03 cs.LG cs.CV q-bio.NC 版本更新

Learning to See via Epiretinal Implant Stimulation in silico with Model-Based Deep Reinforcement Learning

通过基于模型的深度强化学习在硅上学习经由视网膜上植入物刺激的视觉

Jacob Lavoie, Marwan Besrour, William Lemaire, Jean Rouat, Réjean Fontaine, Eric Plourde

发表机构 * Department of Electrical Engineering and Computer Engineering, Université de Sherbrooke（电气与计算机工程系， Sherbrooke 大学）

AI总结本研究提出使用各向同性和各向异性形状，通过深度强化学习在虚拟患者的视网膜上渲染可理解的图像，以提高人工恢复视觉的清晰度。

Comments 18 pages, 6 figures. Published version: Biomed. Phys. Eng. Express 10, 025006 (2024)

详情

DOI: 10.1088/2057-1976/acf1a5
Journal ref: Biomed. Phys. Eng. Express 10 (2024) 025006

AI中文摘要

目标：年龄相关性黄斑变性和视网膜色素变性等疾病会导致感光层退化。恢复视力的一种方法是通过微电极阵列（如视网膜上植入物）电刺激存活的视网膜神经节细胞。已知视网膜上植入物会产生沿邻近视网膜神经节细胞轴突束延伸的可见各向异性形状。最近的研究表明，为了获得各向同性的像素状形状，可以通过失活电极或降低刺激电流水平来映射轴突束并避免刺激它们。避免轴突束刺激旨在去除类似笔触的形状，转而采用更简化的像素状形状集合。方法：在本研究中，我们提出使用各向同性和各向异性形状，在名为rlretina的强化学习环境中为虚拟患者的视网膜渲染可理解的图像。该环境将任务形式化为在基于笔触的渲染任务中使用笔触。主要结果：我们训练了一个深度强化学习智能体，它学会组合各向同性和各向异性形状以形成图像。我们研究了哪种基于误差或基于感知的指标适合奖励智能体。该智能体以基于模型的数据生成方式训练，使用经过心理物理学验证的轴突映射模型来渲染不同虚拟患者感知到的图像。我们表明，与不同虚拟患者中的朴素方法相比，该智能体可以生成更可理解的图像。意义：这项工作提供了一种解决视网膜上刺激的新方法，这是朝着使用各向异性光幻视改善人工恢复视力中视觉敏锐度的第一步。

英文摘要

Objective: Diseases such as age-related macular degeneration and retinitis pigmentosa cause the degradation of the photoreceptor layer. One approach to restore vision is to electrically stimulate the surviving retinal ganglion cells with a microelectrode array such as epiretinal implants. Epiretinal implants are known to generate visible anisotropic shapes elongated along the axon fascicles of neighboring retinal ganglion cells. Recent work has demonstrated that to obtain isotropic pixel-like shapes, it is possible to map axon fascicles and avoid stimulating them by inactivating electrodes or lowering stimulation current levels. Avoiding axon fascicle stimulation aims to remove brushstroke-like shapes in favor of a more reduced set of pixel-like shapes. Approach: In this study, we propose the use of isotropic and anisotropic shapes to render intelligible images on the retina of a virtual patient in a reinforcement learning environment named rlretina. The environment formalizes the task as using brushstrokes in a stroke-based rendering task. Main Results: We train a deep reinforcement learning agent that learns to assemble isotropic and anisotropic shapes to form an image. We investigate which error-based or perception-based metrics is adequate to reward the agent. The agent is trained in a model-based data generation fashion using the psychophysically validated axon map model to render images as perceived by different virtual patients. We show that the agent can generate more intelligible images compared to the naive method in different virtual patients. Significance: This work shares a new way to address epiretinal stimulation that constitutes a first step towards improving visual acuity in artificially-restored vision using anisotropic phosphenes.

URL PDF HTML ☆

赞 0 踩 0

2606.03114 2026-06-03 cs.CV 版本更新

FAF-CD: Frequency-Aware Fusion for Change Detection under Imperfect Multimodal Remote Sensing

FAF-CD: 面向不完美多模态遥感的频率感知融合变化检测

Yufan Wang, Sokratis Makrogiannis, Chandra Kambhamettu

发表机构 * University of South Florida（佛罗里达州立大学）； Delaware State University（特拉华州立大学）

AI总结提出频率感知混合框架FAF-CD，通过DINOv3预训练ConvNeXt编码器、VMamba解码器及修正感知三支融合模块（可变形空间对齐+傅里叶/哈尔小波比较+自适应门控），在不完美异质遥感（如EO-SAR）和二元光学变化检测中提升精度并降低计算成本。

Comments Code will be released at https://github.com/VimsLab/FAF-CD

详情

AI中文摘要

面向真实世界监测的遥感变化检测通常依赖于不完美的异质观测，其中事件前后图像可能异步、跨传感器，或受光照、季节和模态偏移影响。这一设置对EO-SAR灾害制图尤其具有挑战性，因为干扰变化可能类似于结构损伤。我们提出FAF-CD，一种频率感知混合框架，采用DINOv3预训练的ConvNeXt编码器和线性复杂度的基于VMamba的解码器。其修正感知三支融合模块将可变形空间对齐与傅里叶和哈尔小波比较相结合，使用自适应门控跨尺度聚合互补线索。在BRIGHT验证集上，匹配的异质EO-SAR适应在干净和扰动tc-mIoU/tc-mAP上优于NeXt2Former-CD。FAF-CD还泛化到二元光学变化检测，在LEVIR-CD上达到0.924 cF1，在WHU-CD上达到0.955 cF1，并在伪变化对齐压力测试下，在M-CD和NeXt2Former-CD中，在两个二元数据集上获得最佳平均扰动cIoU/cF1。相对于NeXt2Former-CD，它进一步降低了约24 GFLOPs的计算成本，同时保持或提高了精度。

英文摘要

Remote sensing change detection for real-world monitoring often relies on imperfect heterogeneous observations, where pre- and post-event images may be asynchronous, cross-sensor, or affected by illumination, seasonal, and modality shifts. This setting is especially challenging for EO-SAR disaster mapping, where nuisance variation can resemble structural damage. We propose FAF-CD, a frequency-aware hybrid framework with a DINOv3-pretrained ConvNeXt encoder and a linear-complexity VMamba-based decoder. Its rectification-aware tri-branch fusion module combines deformable spatial alignment with Fourier and Haar-wavelet comparisons, using adaptive gating to aggregate complementary cues across scales. On BRIGHT validation, a matched heterogeneous EO-SAR adaptation improves clean and perturbed tc-mIoU/tc-mAP over NeXt2Former-CD. FAF-CD also generalizes to binary optical CD, achieving 0.924 cF1 on LEVIR-CD and 0.955 cF1 on WHU-CD, and obtains the best average perturbed cIoU/cF1 on both binary datasets among M-CD and NeXt2Former-CD under pseudo-change-aligned stress tests. It further reduces cost by approximately 24 GFLOPs relative to NeXt2Former-CD while maintaining or improving accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.03111 2026-06-03 cs.CV 版本更新

Inverting the Generation Process of Denoising Diffusion Implicit Models: Empirical Evaluation and a Novel Method

反转去噪扩散隐式模型的生成过程：实证评估与新方法

Yan Zeng, Masanori Suganuma, Takayuki Okatani

发表机构 * Graduate School of Information Sciences, Tohoku University（东北大学信息科学研究生院）； RIKEN Center for AIP（理化学研究所AIP中心）

AI总结提出一种结合梯度下降和不动点方法的混合方法，用于从生成图像中恢复DDIM的初始噪声图，显著提高了预测精度和重建质量。

详情

AI中文摘要

本文研究了反转DDIM图像生成过程以从生成图像中恢复潜在变量（特别是初始噪声图）的问题。现有方法在此任务中常面临精度不足的挑战。我们提出了一种新颖的混合方法，该方法在第一步结合了通过梯度下降的直接反转，随后在后续步骤中采用不动点方法。在三个数据集上的实证评估表明，我们的方法显著提高了初始潜在变量的预测精度，同时实现了更优的重建准确性。此外，我们引入了一项新的评估指标，称为自插值测试，该测试评估从真实与预测潜在图之间的插值点生成的图像质量，从而提供对性能更深入的洞察。我们的结果表明，尽管现有方法在重建方面表现尚可，但它们始终无法准确预测初始潜在变量，导致在自插值测试中表现不佳。相比之下，我们的方法在所有指标上均优于其他方法，为扩散模型提供了宝贵的见解，并增强了其在图像生成和编辑中的应用。

英文摘要

This paper studies the problem of inverting the DDIM image generation process to recover latent variables, particularly the initial noise map, from a generated image. Existing methods often struggle with accuracy in this task. We propose a novel hybrid approach that combines direct inversion via gradient descent for the first step, followed by a fixed-point method for subsequent steps. Empirical evaluations across three datasets demonstrate that our method significantly improves the prediction of initial latent variables while achieving superior reconstruction accuracy. Additionally, we introduce a new evaluation, called the self-interpolation test, which assesses the quality of images generated from interpolated points between the true and predicted latent maps, offering deeper insights into performance. Our results reveal that while existing methods perform reasonably well in reconstruction, they consistently fail to accurately predict the initial latent variables, resulting in poor performance on the self-interpolation test. In contrast, our method outperforms all others across all metrics, providing valuable insights into diffusion models and enhancing their applications in image generation and editing.

URL PDF HTML ☆

赞 0 踩 0

2606.03084 2026-06-03 cs.CV 版本更新

FCUS-rPPG：一种通过梯度振荡抑制实现快速收敛的无监督远程光电容积描记框架

Jiajie Li, Yu Liu, Rencheng Song, Xun Chen, Juan Cheng

发表机构 * Department of Biomedical Engineering（生物医学工程系）； Anhui Province Key Laboratory of Measuring Theory and Precision Instrument（安徽省测量理论与精密仪器重点实验室）； Hefei University of Technology（合肥工业大学）； Department of Electronic Engineering and Information Science（电子工程与信息科学系）； University of Science and Technology of China（中国科学技术大学）

AI总结提出FCUS-rPPG框架，通过光谱共享骨干网络和梯度、损失景观、特征表示层面的统一优化，实现单轮训练收敛并在跨数据集评估中达到最优性能。

详情

AI中文摘要

远程光电容积描记术（rPPG）利用消费级摄像头实现非接触式血容量脉搏（BVP）信号提取。现有的无监督rPPG方法无需真实生理标注即可学习BVP表示，但其优化常受噪声和不稳定梯度影响，导致收敛缓慢且跨域泛化能力有限。本文提出FCUS-rPPG，一种快速收敛且具有强泛化能力的无监督rPPG框架。受BVP表示同时具有多光谱共变和低维流形结构的观察启发，我们设计了光谱共享骨干网络，促进BVP特征解耦并提高优化效率。为了联合增强收敛稳定性和泛化性能，我们进一步开发了一个在梯度、损失景观和特征表示层面运作的统一优化框架。具体而言，后验证掩蔽机制根据BVP信号的弱幅度生理先验过滤误导性梯度；基于扰动的损失景观平滑策略将优化导向更可泛化的平坦最小值；噪声感知零空间正则化将特征更新约束在噪声子空间的正交补空间内，从而减轻噪声引起的表示漂移。在五个数据集上的大量实验表明，FCUS-rPPG仅需一个训练周期，而现有方法通常需要数十到数百个周期。值得注意的是，FCUS-rPPG在跨数据集评估中持续达到最先进（SOTA）性能。本研究为无监督rPPG的实际部署提供了高效且鲁棒的解决方案。源代码将在该URL公开。

英文摘要

Remote photoplethysmography (rPPG) enables non-contact extraction of blood volume pulse (BVP) signals using consumer-grade cameras. Recent unsupervised rPPG methods learn BVP representations without requiring ground-truth physiological annotations, yet their optimization is often hindered by noisy and unstable gradients, resulting in slow convergence and limited cross-domain generalization. In this paper, we propose FCUS-rPPG, a fast-converging unsupervised rPPG framework with strong generalization capability. Motivated by the observation that BVP representations exhibit both multi-spectral covariation and low-dimensional manifold structure, we design a spectrally shared backbone that facilitates BVP feature disentanglement while improving optimization efficiency. To jointly enhance convergence stability and generalization performance, we further develop a unified optimization framework operating at the gradient, loss-landscape, and feature-representation levels. Specifically, a post-verification masking mechanism filters out misleading gradients according to the weak-amplitude physiological prior of BVP signals; a perturbation-based loss landscape smoothing strategy steers optimization toward more generalizable flat minima; and a noise-aware null-space regularization constrains feature updates to the orthogonal complement of the noise subspace, thereby mitigating noise-induced representation drift. Extensive experiments on five datasets demonstrate that FCUS-rPPG requires only one training epoch, whereas existing methods typically require tens to hundreds of epochs. Notably, FCUS-rPPG consistently achieves state-of-the-art (SOTA) performance in cross-dataset evaluations. This study provides an efficient and robust solution to the real-world deployment of unsupervised rPPG. The source code will be publicly available at https://github.com/JiaJieLee/FCUS-rPPG.

URL PDF HTML ☆

赞 0 踩 0

2606.03005 2026-06-03 cs.CV cs.AI 版本更新

MUSE: A Unified Agentic Harness for MLLMs

MUSE: 多模态大语言模型的统一智能体框架

Jianglin Lu, Hailing Wang, Xu Ma, Qihua Dong, Mingyuan Zhang, Yizhou Wang, Yun Fu

发表机构 * Northeastern University（东北大学）

AI总结提出MUSE框架，通过可组合模块（任务表示、视觉处理、感知工具、结构化解析、确定性验证和验证器引导修复）提升冻结多模态大语言模型性能，无需重新训练。

详情

AI中文摘要

尽管进展迅速，多模态大语言模型（MLLMs）在人类轻松解决的任务上仍然失败，例如从屏幕截图导航网格迷宫或选择正确的拼图块。我们不重新训练模型，而是提出一个补充性问题：仅通过改进执行脚手架，能从冻结的MLLM中引出多少能力？我们引入MUSE，一个多模态统一结构化执行框架，它用可组合的模块（任务表示、视觉处理、感知工具使用、结构化解析、确定性验证和验证器引导修复）包装任何现成的MLLM，无需任何模型重新训练。我们使用多个最先进的MLLM，在涵盖视觉空间规划、视觉感知、多模态推理和细粒度视觉辨别的多样化基准上评估MUSE。MUSE在所有设置中都比裸模型带来一致的提升，在困难实例上提升最大。进一步分析揭示，许多MLLM失败源于框架层面的缺陷而非根本的模型缺陷，并且可以通过验证器引导修复来解决，无需触及模型。这些发现突显了智能体多模态框架作为一个关键但尚未充分探索的设计维度，提供了超越以模型为中心的优化的正交改进途径。

英文摘要

Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs. MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.02996 2026-06-03 cs.RO cs.CV cs.HC 版本更新

MARIO: Motion-Augmented Real-Time Multi-Sensor Inertial Odometry

MARIO: 运动增强的实时多传感器惯性里程计

Yiquan Li, Taeyoung Yeon, Chenfeng Gao, Vasco Xu, Xuanyou Liu, Karan Ahuja

发表机构 * Northwestern University（西北大学）； University of Chicago（芝加哥大学）

AI总结提出MARIO框架，通过学习IMU推断的人体姿态先验约束运动动力学，并结合多传感器融合（磁力计、气压计、辅助IMU），在Nymeria数据集上将位置漂移降低36%-42%，实现无相机人体跟踪的准确鲁棒惯性里程计。

Comments CVPR 2026 Findings

详情

AI中文摘要

仅使用惯性测量单元（IMU）的惯性里程计（IO）为增强现实（AR）和可穿戴设备中的人体运动跟踪提供了轻量级解决方案。最近的基于学习的IO方法通过在大规模人体运动数据集上进行预训练，提高了惯性定位的泛化能力。然而，这些方法仍然容易受到漂移和噪声的影响，因为它们没有显式捕捉人体运动动力学，尤其是在日常活动数据集（如Nymeria）上。在这项工作中，我们提出通过学习的IMU推断姿态先验将惯性里程计建立在人体运动学基础上，该先验促进物理一致的运动约束。我们将此姿态先验集成到现有IO架构中，并在具有挑战性的Nymeria数据集上将位置漂移减少高达36%，该数据集比先前工作中使用的数据集大5倍。我们进一步通过传感器融合框架改进了长期性能，该框架整合了商用AR眼镜上已有的轻量级传感器的辅助信号，包括磁力计、气压计和辅助IMU。通过这种融合策略，位置漂移减少了高达42%，提高了在不同运动条件下的鲁棒性和泛化能力。总之，我们的结果通过将人体运动学与多模态传感统一起来，为惯性轻量级里程计引入了新范式，为准确鲁棒的无相机人体跟踪设立了新基准。我们的网站位于此https URL。

英文摘要

Inertial odometry (IO) using only Inertial Measurement Units (IMUs) provides a lightweight solution for human motion tracking in augmented reality (AR) and wearable devices. Recent learning-based IO methods have improved the generalizability of inertial localization through large-scale pretraining on human motion datasets. However, these approaches remain prone to drift and noise because they do not explicitly capture human motion dynamics, especially on daily activity datasets such as Nymeria. In this work, we propose to ground inertial odometry in human kinematics through a learned IMU-inferred pose prior, which promotes physically consistent motion constraints. We integrate this pose prior into existing IO architectures and reduce positional drift by up to 36% on the challenging Nymeria dataset, which is 5x larger than datasets used in prior work. We further improve long-term performance with a sensor-fusion framework that incorporates auxiliary signals from lightweight sensors already available on commercial AR glasses, including magnetometers, barometers, and secondary IMUs. With this fusion strategy, positional drift is reduced by up to 42%, improving robustness and generalization across diverse motion conditions. Together, our results introduce a new paradigm for inertial and lightweight odometry by unifying human motion kinematics with multimodal sensing, setting a new benchmark for accurate and robust camera-less human tracking. Our website is available at https://spice-lab.org/projects/MARIO/.

URL PDF HTML ☆

赞 0 踩 0

2606.02979 2026-06-03 cs.CV cs.AI cs.RO 版本更新

Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

面向紧凑型自动驾驶感知的平衡学习与多传感器融合

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Engineering, Toyohashi University of Technology（计算机科学与工程系，丰田寺大学）； Department of Computer Science and Electronics, Gadjah Mada University（计算机科学与电子系，加查马达大学）

AI总结提出一种紧凑的深度多任务学习模型，通过自适应损失加权和中间传感器融合技术，在单次前向传播中同时处理语义分割、深度估计、激光雷达分割和鸟瞰投影，实现高效自动驾驶感知。

Comments This work has been accepted for publication in IEEE Transactions on Intelligent Transportation Systems. https://ieeexplore.ieee.org/document/9712213

详情

DOI: 10.1109/TITS.2022.3149370

AI中文摘要

我们提出了一种新颖的紧凑型深度多任务学习模型，能够在一次前向传播中处理多种自动驾驶感知任务。该模型同时执行多视角语义分割、深度估计、激光雷达分割和鸟瞰投影，无需其他模型支持。我们还提供了一种自适应损失加权算法，以解决因任务众多而出现的学习不平衡问题。通过数据预处理和中间传感器融合技术，该模型可以处理并组合来自RGB摄像头、动态视觉传感器（DVS）和安装在自车多个位置的激光雷达的多种输入模态。因此，可以更好地理解动态变化的环境。基于消融研究，使用我们提出的方法训练的模型变体取得了更好的性能。此外，还进行了比较研究，以阐明其与一些近期模型组合相比的性能和有效性。结果表明，即使参数少得多，我们的模型仍能保持更好的性能。因此，该模型可以更快地推理，并减少GPU内存使用。此外，结果在3个不同的CARLA仿真数据集和1个真实世界的nuScenes-lidarseg数据集上保持一致。为了支持未来的研究，我们在以下网址公开共享代码和其他文件：https://this URL。

英文摘要

We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact-perception.

URL PDF HTML ☆

赞 0 踩 0

2606.02962 2026-06-03 cs.CV cs.AI cs.HC eess.IV 版本更新

Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

面向自我中心自然语言查询定位的手部轨迹融合

Enmin Zhong, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García

发表机构 * Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center , ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain（图像处理小组（GTI）、信息处理与电信中心、电信工程学院、马德里理工大学、西班牙）

AI总结针对自我中心视频中的自然语言查询定位任务，提出手部轨迹编码器与自适应门控交叉注意力融合方法，利用手部运动信息提升查询定位性能。

Comments Accepted for the poster session at the Egocentric Vision (EgoVis) Workshop in Conjunction with CVPR 2026

2606.02956 2026-06-03 cs.CV cs.LG cs.RO 版本更新

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

自动驾驶的未来之路：KITScenes多模态数据集

Richard Schwarzkopf, Fabian Immel, Alexander Blumberg, Jonas Merkert, Nils Rack, Kaiwen Wang, Fabian Konstantinidis, Julian Truetsch, Carlos Fernandez, Annika Bätz, Kevin Rösch, Marlon Steiner, Willi Poh, Yinzhe Shen, Royden Wagner, Felix Hauser, Dominik Strutz, Jaime Villa, Gleb Stepanov, Holger Caesar, Ömer Şahin Taş, Frank Bieder, Jan-Hendrik Pauls, Christoph Stiller

发表机构 * FZI Research Center for Information Technology（弗劳恩霍夫信息技术研究中心）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； University Charles III of Madrid（马德里第三大学）； Delft University of Technology（代尔夫特理工大学）

AI总结本文提出KITScenes多模态数据集，通过高保真传感器和完整HD地图，解决现有数据集在传感器精度、地图完整性和地理多样性上的不足，并引入四个基准推动空间学习。

Comments 28 pages, 21 figures

详情

AI中文摘要

现有的自动驾驶数据集取得了重大进展，但在传感器保真度、地图完整性或地理多样性方面仍存在不足。我们提出了KITScenes多模态数据集，这是一个基于高保真传感器和地图构建的欧洲数据集。我们完全同步的传感器套件结合了高分辨率全局快门相机、超过400米的长距离激光雷达、4D成像雷达以及冗余的GNSS/INS定位。据我们所知，我们的HD地图是任何传感器数据集中最完整的，并通过开源软件上的自动驾驶试验进行了验证。首次在公共数据集中，所有与驾驶相关的交通元素（如交通灯）都以3D方式映射到重投影精确的水平，并具有完整的拓扑连接。我们的数据集记录在街道布局不规则且交通模式混合的城市中，通过拓宽可用的地理多样性来补充现有数据集。我们还引入了四个基准，每个基准都推动了具身AI的空间学习：在线HD地图构建、长距离深度估计、新颖视图合成和端到端驾驶。项目页面：此https URL

英文摘要

Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map completeness, or geographic diversity. We present KITScenes Multimodal, a European dataset built around high-fidelity sensors and maps. Our fully synchronized sensor suite combines high-resolution global-shutter cameras, long-range lidar beyond 400m, 4D imaging radar, and redundant GNSS/INS localization. Our HD maps are, to our knowledge, the most complete of any sensor dataset, validated through autonomous driving trials on open-source software. For the first time in a public dataset, all driving-relevant traffic elements, such as traffic lights, are mapped in 3D to a reprojection-accurate level with full topological connectivity. Recorded in cities with irregular street layouts and mixed traffic modes, our dataset complements existing datasets by broadening the available geographic diversity. We also introduce four benchmarks, each advancing spatial learning for embodied AI: online HD map construction, long-range depth estimation, novel view synthesis, and end-to-end driving. Project page: https://kitscenes.com/

URL PDF HTML ☆

赞 0 踩 0

2606.02951 2026-06-03 cs.RO cs.AI cs.CL cs.CV cs.HC 版本更新

SCOPE: Real-Time Natural Language Camera Agent at the Edge

SCOPE：边缘实时自然语言相机代理

Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra

发表机构 * Armada AI

AI总结提出SCOPE模块化代理，用于自然语言控制的PTZ相机，在边缘部署实现实时感知、规划与控制，并通过仿真和物理实验评估延迟、准确性和错误模式。

Comments 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE

详情

DOI: 10.1145/3757279.3785641
Journal ref: Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026

AI中文摘要

在机器人领域部署语言驱动的代理需要能够反映现实任务需求的评估：自然语言指令与可重复的结果。此类代理必须将语言模型连接到可调用的感知和控制工具，并使用部署关键指标（包括延迟、准确性和错误模式）进行评估。我们提出了SCOPE（用于感知和评估的仿真与相机操作），这是一个模块化代理，用于自然语言、开放词汇的云台变焦（PTZ）相机控制和视觉场景理解，专门为边缘部署设计。SCOPE既可在基于Blender的仿真环境中运行，也可在物理PTZ相机上运行，所有感知、规划和控制均在部署现场使用边缘可访问的计算资源本地执行。我们发布了一个包含536个任务的基准测试，涵盖问答、单步和多步命令、计数、空间推理、描述以及光学字符识别，在基于Blender的仿真环境中提供逼真的PTZ控制功能。执行轨迹与LM作为评判器结合，以评估延迟、准确性和错误模式。我们评估了19种规划器-感知模型组合，将Qwen3小语言模型（SLM）与Moondream和Qwen视觉语言模型（VLM）配对。更强的SLM显著减少了幻觉并改善了工具路由，从而实现了更可靠的闭环行为。一旦使用了足够强大的SLM，感知就成为主要的性能瓶颈。在规划和感知方面，混合专家模型在延迟和内存占用与更小网络相当的情况下，始终匹配或超过密集替代方案。量化在精度损失最小的情况下提供了额外的效率提升，为实时、边缘可行的语言驱动PTZ控制确定了一个实用的、从仿真到现实验证的设计点。

英文摘要

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

URL PDF HTML ☆

赞 0 踩 0

2606.02947 2026-06-03 cs.LG cs.CV 版本更新

BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks

BYORn：自举你的响应以防御大型视觉-语言模型的后门攻击

Ivan Sabolić, Marin Oršić, Josip Šarić, Sven Lončarić

发表机构 * University of Rijeka（里耶卡大学）

AI总结提出BYORn框架，通过识别并替换语义不合理的后门目标响应，打破触发器与目标输出的关联，从而在保持干净任务性能的同时提升对后门攻击的鲁棒性。

Comments Accepted to ICML 2026

详情

AI中文摘要

监督微调是将自回归视觉-语言模型适应下游任务的主要方法。最近的研究表明，这种范式极易受到后门攻击，并且现有的防御在开放生成设置中无效。为此，我们提出了BYORn，一个鲁棒的后门防御微调框架，其动机是观察到，在给定相应图像-文本输入和预训练模型的情况下，被毒化的目标响应通常在语义上不合理。BYORn识别这种不对齐的响应，并动态地用模型生成的替代响应替换它们，从而打破触发器与目标输出之间的相关性。由此产生的目标梯度对应于干净数据分布上总体风险上界的经验估计的梯度。实验上，BYORn在保持干净任务性能的同时，持续提高了对后门攻击的鲁棒性，建立了泛化与攻击成功率之间的新权衡边界。最后，我们证明了BYORn对专门设计用于规避所提防御的自适应攻击仍然有效。

英文摘要

Supervised fine-tuning is the predominant approach for adapting autoregressive vision-language models to downstream tasks. Recent work has shown that this paradigm is highly vulnerable to backdoor attacks, and that existing defenses are ineffective in open-ended generation settings. In response, we propose BYORn, a backdoor-robust fine-tuning framework motivated by the observation that poisoned target responses are often semantically implausible given the corresponding image-text inputs and a pretrained model. BYORn identifies such misaligned responses and dynamically replaces them with alternative responses generated by the model, thereby breaking the correlation between triggers and target outputs. The resulting objective gradient corresponds to the gradient of the empirical estimate of the population risk upper bound over the clean data distribution. Empirically, BYORn consistently improves robustness to backdoor attacks while preserving clean-task performance, establishing a new trade-off frontier between generalization and attack success rate. Finally, we demonstrate that BYORn remains effective against adaptive attacks specifically designed to circumvent the proposed defense.

URL PDF HTML ☆

赞 0 踩 0

2606.02935 2026-06-03 cs.CV cs.CE 版本更新

CAD-to-CT Registration of Cylindrical Objects via Ellipse-Based Axis Estimation

基于椭圆轴估计的圆柱体CAD到CT配准

Aleksander Ogonowski, Mikołaj Mrozowski, Daniel Więcek, Arkadiusz Ćwiek, Konrad Klimaszewski, Rafał Możdżonek, Adam Padee, Lech Raczyński, Piotr Wasiuk, Wojciech Wiślicki, Michał Matusiak, Sławomir Wronka

发表机构 * Department of Complex Systems, National Centre for Nuclear Research（复杂系统系，国家核研究中心）； ImagineRT sp. z o.o.（ImagineRT公司）； National Centre for Nuclear Research（国家核研究中心）

AI总结提出一种两阶段几何配准方法，通过检测CT切片中的椭圆截面估计旋转轴，再通过体素化CAD模型并最大化与CT扫描的体积重叠实现圆柱体（电离室）的精确配准，无需强度校准或特征匹配，倾斜和方向误差低于0.1°。

详情

AI中文摘要

CAD模型与CT扫描的精确配准对于在体积成像中建立真实几何基准至关重要。获取可靠的对象掩膜在机器学习环境中日益重要；随着最新架构能力增强，需要大规模数据集以充分利用其能力。当CT灰度值缺乏校准参考时，传统的基于强度的方法失效，而基于点的算法（如ICP、RANSAC）需要理想化CAD几何与噪声体积CT数据之间不可用的特征对应。我们提出了一种针对圆柱体（电离室）的两阶段几何配准方法，利用对象的独特几何特征。首先，通过检测CT切片中的椭圆截面、对边缘检测轮廓拟合椭圆，并在RANSAC异常值去除后对拟合椭圆中心进行PCA，来估计3D旋转轴。其次，将CAD模型体素化，沿检测轴定向，并通过平移调整最大化与CT扫描的体积重叠。该方法无需强度校准或特征匹配，即可实现倾斜和方向误差低于0.1°的鲁棒配准。配准后，对齐的CAD模型为机器学习目标定位和工业CT工作流中的自动分析等应用提供真实几何基准。

英文摘要

Accurate registration of CAD models to CT scans is essential for establishing ground truth geometry in volumetric imaging. Obtaining reliable object masks is of growing importance in machine learning settings; as recent architectures grow more capable, huge datasets are required to fully utilise their capabilities. Traditional intensity-based methods fail when CT grayscale values lack calibration references, while point-based algorithms (e.g., ICP, RANSAC) require feature correspondence unavailable between idealized CAD geometry and noisy volumetric CT data. We propose a two-stage geometric registration method for cylindrical objects (ionization chambers) that takes advantage of the distinctive geometric features of the objects. First, we estimate the 3D rotation axis by detecting elliptical cross-sections across CT slices, fitting ellipses to edge-detected contours, and performing PCA on the fitted ellipse centers after RANSAC outlier removal. Second, we voxelize the CAD model, orient it along the detected axis, and maximize volumetric overlap with the CT scan through translational adjustment. This approach achieves robust registration with tilt and orientation errors below $0.1^\circ$ without intensity calibration or feature matching. Once registered, the aligned CAD model provides ground truth geometry for applications including machine learning-based object localization and automated analysis in industrial CT workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.02927 2026-06-03 cs.CV 版本更新

SaluNet: Enabling Total Plasticity in Normalization-Free Deep Networks

SaluNet: 在无归一化深度网络中实现完全可塑性

Mourad Zaied

发表机构 * Department of electrical engineering（电气工程系）； National Engineering School of Gabes (ENIG)（盖斯国家工程学院）； University of Gabes（盖斯大学）

AI总结提出SALU激活函数替代归一化层，构建SaluNet网络，在无归一化条件下实现深度网络的稳定训练，并在多个数据集上取得优异性能。

Comments 34 pages

详情

AI中文摘要

归一化层如BatchNorm和LayerNorm长期以来被认为是深度网络稳定训练所必需的。本文证明它们可以被单一的可学习激活机制完全替代。我们发现标准归一化会引发可塑性抑制效应：当与归一化层配对时，可学习激活参数会迅速失去适应性。受此观察启发，我们引入SALU（饱和自适应线性单元），\[ \operatorname{SALU}(x;a,b) = \frac{a x}{\sqrt{1 + a b x^2}},\quad a>0,\; b>0 \] 一种有界的、可学习的激活函数，无需依赖批次统计或外部仿射参数即可提供内在的信号稳定。基于SALU，我们提出SaluNet，一种基于完全可塑性的范式：SALU替代归一化层，而SWALU和GALU替代标准激活函数。使用ResNet-18，SaluNet-C-18在CIFAR-10上达到97.35%，在CIFAR-100上达到83.25%，且无归一化；在批次大小为1时（归一化架构失败）仍保持93.44%和76.23%。对于Transformer，SaluNet-T在CIFAR-10上将LayerNorm-GELU从90.92%提升至91.01%，在CIFAR-100上从66.54%提升至68.10%。SaluNet-C-50在ImageNet-1K上达到78.67%的Top-1准确率（224×224），在288×288下为79.23%。这些结果表明归一化层抑制了完全可塑性——这是生物神经元固有的特性，使深度网络能够有效学习。

英文摘要

Normalization layers such as BatchNorm and LayerNorm have long been considered essential for stable training in deep networks. This work demonstrates that they can be fully replaced by a single learnable activation mechanism. We identify a plasticity suppression effect induced by standard normalization: learnable activation parameters rapidly lose adaptability when paired with normalization layers. Motivated by this observation, we introduce SALU (Saturated Adaptive Linear Unit), \[ \operatorname{SALU}(x;a,b) = \frac{a x}{\sqrt{1 + a b x^2}},\quad a>0,\; b>0 \] a bounded, learnable activation that provides intrinsic signal stabilization without relying on batch statistics or external affine parameters. Building on SALU, we propose SaluNet, a paradigm grounded in total plasticity: SALU replaces normalization layers, while SWALU and GALU replace standard activations. With ResNet-18, SaluNet-C-18 achieves 97.35\% on CIFAR-10 and 83.25\% on CIFAR-100 without normalization, maintaining 93.44\% and 76.23\% at batch size 1 where normalized architectures fail. For transformers, SaluNet-T improves over LayerNorm-GELU from 90.92\% to 91.01\% on CIFAR-10 and from 66.54\% to 68.10\% on CIFAR-100. SaluNet-C-50 reaches 78.67\% Top-1 on ImageNet-1K at $224\times224$, and $79.23\%$ at $288\times288$. These results suggest normalization layers suppress total plasticity, a property biological neurons inherently possess, enabling deep networks to learn effectively.

URL PDF HTML ☆

赞 0 踩 0

2606.02924 2026-06-03 cs.CV 版本更新

ATLAS: A Large-Scale Evaluation Benchmark for Adversarial LiDAR Perception

ATLAS：面向对抗性激光雷达感知的大规模评估基准

Mellon M. Zhang, Siddhant Panse, Zimo Fan, Akshal Dhal, Rishit Sarkar, Glen Chou

AI总结针对黑盒传感器攻击下激光雷达感知模型的鲁棒性评估空白，提出首个大规模物理驱动基准ATLAS，通过点注入和点移除两种攻击模式，揭示模型性能与鲁棒性的非对称性，并溯源至标准数据增强方法。

Comments preprint

详情

AI中文摘要

自动驾驶感知通常在干净的基准数据上进行评估，然而实际部署需要对罕见、结构化且可能具有对抗性的传感器异常具有鲁棒性。这一差距对于激光雷达尤为关键，因为外部行为者可以在不访问模型的情况下物理操纵传感过程，引发黑盒感知故障。现有的激光雷达基准对此类故障模式几乎不提供可见性。先前的对抗性激光雷达研究主要集中于攻击硬件、几何和算法防御以及早期检测器，而现代感知系统的鲁棒性尚未被探索。为弥补这一评估空白，我们提出了ATLAS（对抗性时间激光雷达攻击套件），这是首个大规模、物理驱动的激光雷达感知模型评估基准，在黑盒传感器攻击下模拟两种主要攻击模式——点注入和点移除，覆盖真实驾驶序列。通过评估当前最先进的激光雷达感知模型的广泛截面，ATLAS揭示了一个令人惊讶的鲁棒性非对称性：在标准基准上表现更强的模型往往更能抵御移除攻击，但实际上比弱模型更容易受到注入攻击。我们将这一脆弱性追溯到标准对象数据库采样增强，揭示了当前训练实践如何引发与架构无关的鲁棒性故障，并研究了缓解两种攻击模式的初步方向。我们发布了ATLAS生成代码，以支持随着攻击能力演进而进行的可扩展、可重复的评估，帮助使黑盒传感器鲁棒性成为未来激光雷达感知发展中的明确考虑因素。

英文摘要

Autonomous driving perception is typically evaluated on clean benchmark data, yet real-world deployment requires robustness to rare, structured, and potentially adversarial sensor anomalies. This gap is especially critical for LiDAR, where external actors can physically manipulate the sensing process to induce black-box perception failures without accessing the model. Existing LiDAR benchmarks provide little visibility into this failure mode. Prior adversarial LiDAR studies have largely centered on attack hardware, geometric and algorithmic defenses, and early-generation detectors, leaving the robustness of modern perception systems unexplored. To address this evaluation gap, we introduce ATLAS (Adversarial Temporal LiDAR Attack Suite), the first large-scale, physically grounded evaluation benchmark for LiDAR perception models under black-box sensor attacks, simulating the two primary attack modes -- point injection and point removal -- across real driving sequences. Evaluating a broad cross-section of current state-of-the-art LiDAR perception models, ATLAS reveals a surprising robustness asymmetry: models with stronger performance on standard benchmarks tend to better withstand removal attacks, yet are actually more vulnerable to injection attacks than weaker models. We trace this vulnerability to standard object database sampling augmentations, revealing how current training practices can induce architecture-agnostic robustness failures, and study initial directions for mitigating both attack modes. We release the ATLAS generation code to support extensible, reproducible evaluations as attack capabilities evolve, helping make black-box sensor robustness an explicit consideration in future LiDAR perception development.

URL PDF HTML ☆

赞 0 踩 0

2606.02915 2026-06-03 cs.CV 版本更新

Any2Poster: Any-Source Poster Generation Across Modalities and Domains

Any2Poster: 跨模态和领域的任意源海报生成

Amogh Vinaykumar, Aiden Li, Suozhi Huang, Shilong Liu

发表机构 * Flower Mound High School（弗洛拉穆恩高中）； University College London（伦敦大学学院）； Princeton University（普林斯顿大学）

AI总结提出Any2Poster Bench基准和Any2Poster Agent智能体，实现从多种输入模态和领域生成海报，并通过基于测验和视觉评估的方法验证信息保真度和视觉传达效果。

Comments Project Page: https://github.com/Any2Poster/Any2Poster

详情

AI中文摘要

视觉海报是传达密集信息的紧凑媒介，然而自动海报生成的进展难以衡量，因为现有评估通常局限于仅论文输入、狭窄领域或表面视觉相似性。我们引入了Any2Poster Bench，一个用于任意源海报生成的基准，它评估系统在八种输入模态（PDF、URL、PPTX、DOCX、Markdown、LaTeX、笔记本和视频）和五个内容领域上的表现。Any2Poster Bench将每个源与基于测验的逐字事实保留和解释性理解探测，以及基于VLM的视觉质量、布局、可读性、内容完整性和逻辑流程判断相结合，从而实现对信息保真度和视觉传达的可重复评估。为了实例化和验证这一基准，我们进一步提出了Any2Poster Agent，一个端到端的参考智能体，它解析异构源、组织显著内容、规划海报布局、渲染海报，并使用视觉反馈迭代优化。在Any2Poster Bench上，Any2Poster Agent在输入模态上平均准确率达到87.25%，在内容领域上达到87.28%。在PaperQuiz风格评估中（其中先前的论文到海报智能体可直接比较），Any2Poster Agent将总体准确率从PosterAgent-4o的51.06-51.33%提高到72.58%，并将密度增强分数从116-121提高到145.16。总之，Any2Poster Bench和Any2Poster Agent为研究多模态、通用领域的海报生成提供了可复用的评估资源和有竞争力的基线。

英文摘要

Visual posters are a compact medium for communicating dense information, yet progress on automatic poster generation remains difficult to measure because existing evaluations are often restricted to paper-only inputs, narrow domains, or surface-level visual similarity. We introduce Any2Poster Bench, a benchmark for any-source poster generation that evaluates systems across eight input modalities--PDFs, URLs, PPTX, DOCX, Markdown, LaTeX, notebooks, and videos--and five content domains. Any2Poster Bench pairs each source with quiz-based probes of verbatim factual retention and interpretive understanding, together with VLM-based judgments of visual quality, layout, readability, content completeness, and logical flow, enabling reproducible assessment of both information fidelity and visual communication. To instantiate and validate this benchmark, we further present Any2Poster Agent, an end-to-end reference agent that parses heterogeneous sources, organizes salient content, plans poster layouts, renders posters, and iteratively refines them using visual feedback. On Any2Poster Bench, Any2Poster Agent achieves 87.25% average accuracy across input modalities and 87.28% across content domains. On PaperQuiz-style evaluation, where prior paper-to-poster agents are directly comparable, Any2Poster Agent improves over PosterAgent-4o from 51.06-51.33% to 72.58% overall accuracy and from 116-121 to 145.16 in density-augmented score. Together, Any2Poster Bench and Any2Poster Agent provide a reusable evaluation resource and a competitive baseline for studying multimodal, domain-general poster generation.

URL PDF HTML ☆

赞 0 踩 0

2606.02831 2026-06-03 cs.CV 版本更新

Principled Reflection Separation via Nonlinear Superposition and Feature Interaction

基于非线性叠加与特征交互的原理性反射分离

Qiming Hu, Mingjia Li, Yuntong Li, Xiaojie Guo

AI总结针对单图像反射分离中传输层与反射层非线性耦合问题，提出可学习非线性叠加模型和广义双流交互框架，实现更优的分解性能与泛化能力。

Comments 23 pages

详情

AI中文摘要

单图像反射分离从根本上受到复杂图像形成过程中传输层和反射层纠缠的挑战。现有方法大多依赖简化假设或独立建模，限制了其处理真实场景的能力。在这项工作中，我们从统一视角重新审视该问题，并指出现有方法的一个关键问题，即广泛采用的sRGB域线性合成模型无法捕捉真实图像信号处理流水线引入的非线性耦合。为解决此问题，我们引入了一个可学习的非线性叠加模型，该模型更真实地刻画层间相互作用并提高分解保真度。基于此公式，我们提出了一个广义双流交互框架，通过特征交换显式建模传输层和反射层之间的双向依赖关系。该框架统一了基于激活、门控和注意力的交互机制，并兼容CNN和Transformer骨干网络。在多种真实世界基准上的大量实验表明，所提方法实现了优越的性能和强泛化能力。更重要的是，我们的研究揭示反射分离并非撤销线性混合，而是学习非线性形成与交互，为原理性图像分解模型的设计提供了新见解。代码和模型已公开于该链接。

英文摘要

Single-image reflection separation is fundamentally challenged by the entanglement of transmission and reflection layers under complex image formation processes. Existing approaches largely rely on simplified assumptions or independent modeling, limiting their ability to handle real-world scenarios. In this work, we revisit the problem from a unified perspective and identify a key issue of existing approaches, i.e., the widely adopted linear composition model in the sRGB domain fails to capture the nonlinear coupling introduced by real-world image signal processing pipelines. To address this, we introduce a learnable nonlinear superposition model that more faithfully characterizes layer interactions and improves decomposition fidelity. Building upon this formulation, we propose a generalized dual-stream interactive framework that explicitly models bidirectional dependencies between transmission and reflection through feature exchange. This framework unifies activation-, gating-, and attention-based interaction mechanisms, and is compatible with both CNN and Transformer backbones. Extensive experiments on diverse real-world benchmarks demonstrate that the proposed approach achieves superior performance with strong generalization capability. More importantly, our study reveals that reflection separation is not about undoing a linear mixture, but about learning nonlinear formation and interaction}, offering new insights into the design of principled image decomposition models. Code and models are publicly available at https://mingcv.github.io/DIRS-Page.

URL PDF HTML ☆

赞 0 踩 0

2606.02809 2026-06-03 cs.CV 版本更新

Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging

自动化报告驱动的肿瘤学VQA基准：用于评估3D医学影像上的视觉-语言模型

Bo Liu, Hanxue Gu, Xiangru Li, Zheren Zhu, Jacob Ellison, Kang Wang, Janine M. Lupo, Yang Yang, Hui Lin

发表机构 * UCSF–UC Berkeley Joint Graduate Program in Bioengineering（UCSF-伯克利生物工程联合研究生项目）； Department of Radiology, UCSF（UCSF放射科）； Department of Radiation Oncology, UCSF（UCSF放射肿瘤科）

AI总结提出一个自动化管道，从私有放射学报告和3D肿瘤影像生成多选VQA数据集，构建无污染基准，评估六种视觉-语言模型，发现视觉依赖因数据集而异。

详情

AI中文摘要

评估医学影像上的视觉-语言模型（VLM）需要临床基础、可扩展且控制评估混淆的基准。现有的公共基准在规模上有限、需要手动标注，或可能泄露到VLM预训练语料中。我们提出一个自动化智能体驱动的管道，直接从配对的私有放射学报告和3D肿瘤影像生成多选VQA数据集，产生两种互补的问题类型：从临床医生定义的报告模式确定性导出的RADS风格问题，以及由LLM根据放射科医生发现生成并对照源报告验证的放射学报告衍生问题。应用于四个内部癌症队列，该管道产生一个实例污染控制的基准，无需每个问题的人工标注。对六个VLM的零样本评估显示没有主导模型，且所有单元均有显著提升空间。一项盲消融实验显示，视觉依赖高度特定于数据集：肝脏报告衍生问题确实需要图像，而肺CT基本上可以在没有图像的情况下解决——领先的闭源模型在盲测时在肺CT上的准确率超过其有视觉的准确率，这表明即使是私有临床数据也不能保证对视觉能力的污染控制读取。该管道作为开放智能体技能发布，用于内部重新部署。

英文摘要

Evaluating vision-language models (VLMs) on medical images requires benchmarks that are clinically grounded, scalable, and controlled for evaluation confounds. Existing public benchmarks are limited in scale, manually annotated, or potentially leaked into VLM pretraining corpora. We present an automated agent-driven pipeline that generates multiple-choice VQA datasets directly from paired private radiology reports and 3D oncology imaging, producing two complementary question types: RADS-style questions deterministically derived from clinician-defined reporting schemas, and radiology report-derived questions generated by an LLM from radiologist findings and verified against the source report. Applied to four in-house cancer cohorts, the pipeline yields an instance-contamination-controlled benchmark without per-question human annotation. Zero-shot evaluation of six VLMs reveals no dominant model and substantial headroom across all cells. A blind ablation reveals that visual reliance is highly dataset-specific: liver Report-derived questions genuinely require the image, while Lung CT is essentially solvable without it - the leading closed model exceeds its sighted accuracy on Lung CT when blinded - indicating that even private clinical data does not guarantee a contamination-controlled read of visual capability. The pipeline is released as an open agent skill for in-house redeployment.

URL PDF HTML ☆

赞 0 踩 0

2606.02789 2026-06-03 cs.CV 版本更新

Diagnosis of Human Object Interaction Detectors for Real World Educational Applications

面向真实世界教育应用的人-物交互检测器诊断

Divya Mereddy, Ashwin Tudur Sadashiva, Marcos Quinones-Grueiro, Gautam Biswas

AI总结提出一种诊断驱动框架，结合三元组级HOI错误分类与错误因素归因分析，通过针对性改进将预训练CDN模型在CCATT数据集上的宏F1分数从48.6提升至90.2。

详情

AI中文摘要

人-物交互（HOI）识别对于在复杂教育环境中自动分析学生行为至关重要。尽管最先进的HOI检测器在基准数据集上表现良好，但在实际训练环境中部署时，由于领域特定物体、遮挡和复杂视觉条件，其性能往往会下降。本文针对真实世界的教育视频数据，引入了一种诊断驱动框架，该框架将三元组级HOI错误分类与错误因素归因分析相结合。我们在重症监护空运队（CCATT）混合现实医疗训练的背景下研究这一问题。基于对HOI失败模式及其原因的分析，我们开发了一种诊断信息驱动的改进策略，用于将预训练的HOI模型适应到目标领域。在CCATT数据集上的实验表明，通过由诊断出的错误因素引导的针对性改进，该方法将预训练CDN模型的宏F1分数从48.6提升至90.2。这些结果突显了详细诊断分析对于指导HOI模型在真实教育环境中进行针对性适应的价值。

英文摘要

Human-object interaction (HOI) recognition is critical for automatically analyzing student behavior in complex educational environments. Although state-of-the-art (SOTA) HOI detectors perform well on benchmark datasets, their performance often degrades when deployed in real-world training environments due to domain-specific objects, occlusions, and complex visual conditions. In this paper, we introduce a diagnosis-driven framework that integrates a triplet-level HOI error taxonomy with error-factor attribution analysis for real-world educational video data. We study this problem in the context of Critical Care Air Transport Team (CCATT) mixed-reality medical training. Based on an analysis of HOI failure modes and their causes, we develop a diagnosis-informed refinement strategy for adapting pretrained HOI models to the target domain. Experiments on the CCATT dataset show that this approach improves the macro-F1 score of a pretrained CDN model from 48.6 to 90.2 through targeted refinement guided by diagnosed error factors. These results highlight the value of detailed diagnostic analysis for informing targeted adaptation of HOI models in real-world educational environments.

URL PDF HTML ☆

赞 0 踩 0

2606.02774 2026-06-03 cs.CV 版本更新

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

GeoDrive-Bench：自动驾驶中区域特定多模态推理的基准测试

Yingzi Ma, Chaowei Xiao, Ming Jiang

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Johns Hopkins University（约翰霍普金斯大学）

AI总结提出GeoDrive-Bench基准，通过5053个跨六国人工验证的多选题，评估视觉语言模型在感知、预测、规划和区域推理四个驾驶任务中基于区域特定交通规则的推理能力，并设计蒸馏算法注入区域知识以提升模型性能。

详情

AI中文摘要

用于自动驾驶的视觉语言模型（VLM）已展现出有前景的性能，但它们处理区域特定交通规则的能力仍未得到充分探索，这引发了对其在全球不同环境中部署的不确定性。因此，我们引入了GeoDrive-Bench，这是一个新颖的基准，能够系统性地研究VLM的地理文化驾驶推理。我们整理了5053个人工验证的多选题，涵盖六个国家，涉及多样的驾驶文化。具体而言，我们强调四个驾驶任务：感知、预测、规划和区域推理。每个问题要求模型从视觉证据和当地交通惯例中推断出正确的驾驶行为，而不给出明确的国家标签。除了评估，我们还设计了一种蒸馏算法，将区域特定的交通规则知识注入VLM的内部表示，使模型能够更好地将视觉场景理解与当地驾驶策略对齐。在九个最先进的VLM上的实验表明，每个任务在不同地理驾驶文化下存在显著的性能差异，而我们提出的基线模型在跨区域的地理文化推理上有所改进。这些结果表明，当前的VLM仍然缺乏鲁棒的区域感知驾驶智能，并突显了GeoDrive-Bench作为可部署自动驾驶基础模型的诊断和训练导向测试床的价值。

英文摘要

Vision-language models (VLMs) for autonomous driving have shown promising performance, but their ability to handle region-specific traffic rules remains underexplored, raising uncertainties about their deployment across diverse global settings. We therefore introduce GeoDrive-Bench, a novel benchmark that enables the systematic investigation of VLMs' geo-culturally grounded driving reasoning. We curated 5,053 human-validated multiple-choice QA pairs across six countries covering diverse driving cultures. Specifically, we emphasize four driving tasks: perception, prediction, planning, and region reasoning. Each question requires models to infer the correct driving behavior from visual evidence and local traffic conventions without explicit country labels. Beyond evaluation, we further design a distillation algorithm that injects region-specific traffic-rule knowledge into the internal representations of VLMs, enabling models to better align visual scene understanding with local driving policies. Experiments on nine state-of-the-art VLMs show substantial performance variations across geo-driving cultures for each task, while our proposed baseline models exhibit improved geo-cultural reasoning across regions. These results suggest that current VLMs still lack robust region-aware driving intelligence and highlight GeoDrive-Bench as a diagnostic and training-oriented testbed for deployable autonomous driving foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.02764 2026-06-03 cs.CV physics.comp-ph 版本更新

From Local Training to Large-Scale Mapping: A Comparative Assessment of Machine Learning and Deep Learning for Transferable Satellite-Derived Bathymetry

从局部训练到大规模制图：机器学习与深度学习在可迁移卫星测深中的比较评估

Hsiao-Jou Hsu, Joachim Moortgat

发表机构 * School of Earth Sciences, The Ohio State University（地球科学学院，俄亥俄州立大学）

AI总结本研究评估了随机森林与四种CNN在0-20米深度范围内基于Sentinel-2影像的可迁移卫星测深性能，通过保持空间连续性的训练策略和引入平滑权重函数损失，实现了跨区域稳健的深度估计。

Comments 42 pages, 13 figures, 15 tables. Supplementary Information provided as ancillary file (anc/SI.pdf). Code and pretrained weights at https://github.com/buckai-observatory/DL_bathy

详情

DOI: 10.3390/rs18111768
Journal ref: Remote Sens. 18 (2026) 1768

AI中文摘要

多光谱影像的卫星测深（SDB）成本效益高，但在不同区域间的扩展性较差，尤其是在光学复杂的沿海环境中。我们利用Sentinel-2影像评估了机器学习与深度学习在0-20米深度范围内的可迁移SDB性能。在普拉塔斯岛和大堡礁选定区域训练了随机森林基线模型和四种CNN（ResNet-50、ResNet-101、EfficientNet-B4、ConvNeXt-Large），然后在空间独立的区域内和跨区域测试区域进行评估。训练过程中保持空间连续性（即保留连续的礁块而非随机斑块）是影响最大的设计选择；我们进一步引入了平滑权重函数（SWF）加权的RMSE损失，以强调近地表深度。采用这些选择后，区域内RMSE在0-20米范围内为1.15至1.92米，在深度≤3米时低至0.26米。随机森林在跨区域迁移下性能急剧下降（RMSE从1.53米升至2.99-3.78米），而深度模型保持更稳健（2.46-2.98米）。在公开的MagicBathyNet航空RGB基准（0-16米）上，所提出的网络达到了0.19-0.22米的RMSE，优于U-Net基线和一种任务特定的Transformer架构，且参数显著更少。我们进一步利用了多时相重复影像：在其上训练增加了多样性，并且在推理时对各次通过的中位数聚合预测减少了来自太阳角度、大气条件、水性质和潮汐变化的噪声。我们发布了优化的架构和预训练权重，以实现对新地点的可扩展迁移。

英文摘要

Satellite-derived bathymetry (SDB) from multispectral imagery is cost-effective but scales poorly across regions, especially in optically complex coastal environments. We evaluate machine learning and deep learning for transferable SDB over the 0-20 m depth range using Sentinel-2 imagery. A Random Forest baseline and four CNNs (ResNet-50, ResNet-101, EfficientNet-B4, ConvNeXt-Large) are trained on Pratas Island and selected Great Barrier Reef regions, then evaluated on spatially independent intra- and cross-regional test areas. Preserving spatial continuity during training, by keeping contiguous reef blocks rather than random patches, is the single most impactful design choice; we further introduce a Smooth Weight Function (SWF)-weighted RMSE loss that emphasizes near-surface depths. With these choices, intra-regional RMSE ranges from 1.15 to 1.92 m over 0-20 m and is as low as 0.26 m for depths <= 3 m. Random Forest degrades sharply under cross-regional transfer (RMSE 1.53 m -> 2.99-3.78 m), while the deep models stay more robust (2.46-2.98 m). On the public MagicBathyNet aerial-RGB benchmark (0-16 m) the proposed networks reach 0.19-0.22 m RMSE, outperforming a U-Net baseline and a task-specific transformer architecture with substantially fewer parameters. We further exploit multi-temporal repeat imagery: training on it broadens diversity, and median-aggregating predictions across passes at inference reduces noise from changing sun angles, atmospheric conditions, water properties, and tides. We release optimized architectures and pretrained weights to enable scalable transfer to new sites.

URL PDF HTML ☆

赞 0 踩 0

2606.02753 2026-06-03 cs.CV cs.AI 版本更新

MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

MetaWorld: 从单视角视频数据扩展多智能体视频世界模型

Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Zhejiang University（浙江大学）； Nanyang Technological University（南洋理工大学）

AI总结提出MetaWorld框架，通过单目世界状态展开、主体感知世界生成器和世界状态对齐机制，从单视角视频构建多智能体视频世界模型，解决数据稀缺和世界状态对齐问题。

详情

AI中文摘要

视频世界模型是具身AI和元宇宙的基础生成技术，但现有方法固有限制于单智能体从单一视角观察。将这些模型扩展到多智能体设置引入了两个关键挑战：数据稀缺（协调的多视角记录对于通用开放域场景来说成本过高）和世界状态对齐（独立生成的视频流无法确保共享物理环境和事件在不同视角下一致演化）。为应对这些挑战，我们提出MetaWorld，一种新颖框架，可直接从单视角视频将多智能体视频世界模型扩展到开放域环境。首先，我们引入单目世界状态展开（MWSU），将单目视频显式分解为相机操作者的自我运动与可见主体的空间轨迹。这种相机-轨迹分解自然提取了共享3D空间内同步的多智能体运动数据，完全绕开了多相机设置的需求。其次，为精确视觉控制，我们开发了主体感知世界生成器，实现基于每个智能体身份图像的外观驱动模拟。最后，为确保两个视角基于相同的物理现实，我们提出世界状态对齐（WSA），一种在视频DiT的每个Transformer层插入的逐帧跨分支交叉注意力机制。通过联合同步去噪过程，WSA强制实现静态几何一致性和动态运动一致性，促使共享3D环境和物理事件在两个自我中心视角间保持良好对齐。大量实验表明，MetaWorld实现了优越的跨视角一致性和身份保真度，为多智能体视频世界建模建立了一个高度可扩展、物理驱动的范式。

英文摘要

Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-agent settings introduces two critical challenges: data scarcity (coordinated multi-view recordings are prohibitively expensive to collect for general open-domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi-agent video world models to open-domain environments directly from single-view videos. First, we introduce Monocular World-State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator's ego-motion and the visible subject's spatial trajectory. This camera-trajectory decomposition naturally extracts synchronized multi-agent motion data within a shared 3D space, completely bypassing the need for multi-camera setups. Second, for precise visual control, we develop the Subject-Aware World Generator to enable appearance-driven simulation conditioned on per-agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World-State Alignment, a per-frame inter-branch cross-attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well-aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross-view consistency and identity fidelity, establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.02747 2026-06-03 cs.CV cs.AI 版本更新

Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records

Plan2Map: 基于规划记录的文档驱动地理空间边界重建的多模态基准

Fabian Degen, Oishi Deb, Jindong Gu, Junchi Yu, Samuele Marro, Philip Torr, Jialin Yu

AI总结提出Plan2Map基准和GeoPlanAgent系统，通过文档证据提取、定位、地图配准、边界分割等步骤，从英国规划记录中重建地理空间边界，显著优于直接VLM方法。

Comments Project page: https://odeb1.github.io/Plan2Map_Project_Page/. Fabian Degen and Oishi Deb Contributed Equally

详情

AI中文摘要

规划记录定义了地理区域上的限制，但其源文档通常仅提供间接的空间证据而非机器可读的边界。我们介绍了Plan2Map，一个包含208个案例的多模态基准，用于从英国规划记录中重建文档驱动的地理空间边界。仅给定源规划文档，系统必须从通知文本、时间表、地图图版、地图标签和边界注释中重建有效的地理空间边界；参考GeoJSON被保留用于评分。我们提出了GeoPlanAgent，一个文档驱动、地理空间工具在环的系统，将任务分解为证据提取、定位、地图配准、边界分割、投影和验证。在Plan2Map上，GeoPlanAgent实现了0.736的平均IoU和0.904的中位IoU，其中67.8%的预测IoU达到或超过0.8，显著优于直接VLM到GeoJSON的基线。诊断分析表明，直接VLM预测仍然不可靠，而剩余错误集中在定位和地图配准上，监督边界分割显著提高了像素级掩码质量。Plan2Map为从公共规划记录中进行多模态地理空间重建提供了一个具体的测试平台。项目页面：此https URL。

英文摘要

Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather than machine-readable boundaries. We introduce Plan2Map, a 208-case multimodal benchmark for document-grounded geospatial boundary reconstruction from UK planning records. Given only a source planning document, systems must reconstruct a valid geospatial boundary from notice text, schedules, map plates, map labels, and boundary annotations; the reference GeoJSON is held out for scoring. We propose GeoPlanAgent, a document-grounded, geospatial-tool-in-the-loop system that decomposes the task into evidence extraction, localisation, map registration, boundary segmentation, projection, and verification. On Plan2Map, GeoPlanAgent achieves 0.736 mean IoU and 0.904 median IoU, with 67.8\% of predictions at or above 0.8 IoU, substantially outperforming direct VLM-to-GeoJSON baselines. Diagnostic analysis shows that direct VLM prediction remains unreliable, while remaining errors are concentrated in localisation and map registration, and supervised boundary segmentation substantially improves pixel-level mask quality. Plan2Map provides a concrete testbed for multimodal geospatial reconstruction from public planning records. Project page: https://odeb1.github.io/Plan2Map_Project_Page/.

URL PDF HTML ☆

赞 0 踩 0

2606.02742 2026-06-03 cs.CV 版本更新

基于拓扑感知排序的图Mamba生存分析

Yuanfang Chen, Peiqiang Yan, Yuntao Shou, Qian Zhao, Xiangyong Cao

发表机构 * School of Mathematics and Statistics（数学与统计学学院）； West China Science and Technology Innovation Harbor（西部科学与技术创新港）； School of Computer Science and Technology（计算机科学与技术学院）

AI总结针对WSI生存分析中Mamba模型对输入顺序敏感及单向架构限制空间结构利用的问题，提出基于拓扑感知排序的图Mamba框架TopoMamSurv，通过TAO策略、双向Mamba模块和GCN集成实现高效长程依赖建模与双向空间上下文建模。

详情

AI中文摘要

在计算病理学中，全切片图像（WSI）生存分析对于患者预后评估至关重要，但面临多项技术挑战。尽管Transformer通过其自注意力机制捕获长程依赖，但其$O(N^2)$时间复杂度在大规模WSI图结构中造成严重计算瓶颈。Mamba模型以线性复杂度突破了Transformer的计算瓶颈。然而，由于Mamba对输入数据顺序的高度敏感性，图Mamba中传统的节点排序方法（如基于节点度或子图大小的方法）未能充分考虑图数据的拓扑连通性，从而限制了Mamba序列建模的性能。此外，其单向架构无法利用图像的双向空间结构。为解决这些挑战，本文提出一种基于拓扑感知排序的新型图Mamba生存分析框架（TopoMamSurv），以适应Mamba的序列敏感性。我们的可视化实验进一步证实，通过拓扑感知排序（TAO）策略提取的节点确实表现出更高的相似性。此外，我们设计了双向Mamba模块并集成图卷积网络（GCN），以实现图像的双向空间上下文建模，形成“局部聚合-全局捕获”的分层特征学习架构。该框架通过TAO、双向语义建模和分层特征融合的系统设计，有效调和了WSI分析中长程依赖建模、计算效率和空间结构利用之间的矛盾。该框架在五个TCGA数据集上验证了其全面的性能优势。

英文摘要

In computational pathology, Whole Slide Images (WSIs) survival analysis is crucial for patient prognosis assessment, but it faces multiple technical challenges. Although the Transformer captures long-range dependencies through its self-attention mechanism, its $O(N^2)$ time complexity causes a severe computational bottleneck in large-scale WSIs graph structures. The Mamba model breaks through the Transformer's computational bottleneck with linear complexity. But, owing to Mamba's high sensitivity to the order of input data, traditional node sorting methods in Graph Mamba, such as those based on node degree or subgraph size, fail to adequately account for the topological connectivity of graph data. This inadequacy consequently restricts the performance of Mamba's sequential modeling. Moreover, its unidirectional architecture cannot leverage the bidirectional spatial structure of images. To address these challenges, this paper proposes a novel Graph Mamba survival analysis framework based on topology-aware ordering (TopoMamSurv) to adapt to the sequential sensitivity of Mamba. Our visualization experiments further confirmed that the nodes extracted through the topology-aware ordering (TAO) strategy indeed exhibit higher similarity. Furthermore, we designed a bidirectional Mamba module and integrated a Graph Convolutional Network (GCN) to achieve bidirectional spatial context modeling of images, forming a hierarchical feature learning architecture for "local aggregation - global capture." This framework effectively reconciles the contradiction between long-range dependency modeling, computational efficiency, and spatial structure utilization in WSIs analysis through its systematic design of TAO, bidirectional semantic modeling, and hierarchical feature fusion. This framework has been validated for its comprehensive performance advantage on five TCGA datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.02482 2026-06-03 cs.CV 版本更新

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

X-Stream: 探索多模态大语言模型作为多流理解的多路复用器

Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo, Dongming Wu, Huankang Guan, Minghong Cai, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Rui Liu, Xiangyu Yue

发表机构 * MMLab, Chinese University of Hong Kong（中大香港人工智能实验室）； Huawei Inc.（华为公司）

AI总结为解决多流视频理解评估缺失的问题，提出首个基准X-Stream，包含4220个QA对和932个视频，覆盖多窗口、多视角和多设备场景，并基于信号多路复用理论评估MLLM作为多路复用器的性能，发现现有模型在并发流上仅达约50%分数。

Comments Project Page: https://peiwensun2000.github.io/xstream/

详情

AI中文摘要

尽管视频流理解取得了显著进展，但实际应用（如体育直播、自动驾驶和多屏协作）本质上需要连续的多流交互。然而，现有基准局限于单流范式，在评估在线跨流推理方面存在关键空白。为填补这一空白，我们引入了X-Stream，这是首个专门用于多流流式理解的基准。X-Stream包含932个视频中精心整理的4220个QA对，评估了跨多窗口、多视角和多设备场景的11个子任务。关键的是，我们的数据集使用一种新颖的双重验证流水线构建，防止对单一流的过度依赖。此外，我们开创性地将多模态大语言模型（MLLM）概念化为朴素多路复用器，通过信号多路复用理论的视角系统评估其性能。我们广泛的在线推理实验揭示了一个严峻的现实：最先进的MLLM在并发流上表现困难，仅达到约50%的分数，且主动能力差。最终，X-Stream暴露了当前多路复用方案的权衡，为下一代多流智能体提供了实用的评估协议和经验指导。

英文摘要

While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning. To bridge this, we introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. Comprising 4,220 rigorously curated QA pairs across 932 videos, X-Stream evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios. Crucially, our dataset is constructed using a novel dual-verification pipeline that prevents over-reliance on a single stream. Furthermore, we pioneer the conceptualization of multi-modal large language models (MLLMs) as naive multiplexers, systematically evaluating their performance through the lens of Signal Multiplexing Theory. Our extensive online inference experiments reveal a stark reality: state-of-the-art MLLMs struggle significantly with concurrent streams, achieving only about 50% score and exhibiting poor proactive ability. Ultimately, X-Stream exposes the trade-off of current multiplexing schemes, providing both a practical evaluation protocol and empirical guidance for next-generation multi-stream agents.

URL PDF HTML ☆

赞 0 踩 0

2606.02090 2026-06-03 cs.CV 版本更新

FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation

FocusDiT: 扩散Transformer中的查询掩码用于细粒度图像生成

Xueji Fang, Liyuan Ma, Jianhao Zeng, Jinjin Cao, Mingyuan Zhou, Guo-Jun Qi

发表机构 * Zhejiang University（浙江大学）； Westlake University（西湖大学）

AI总结提出FocusDiT方法，通过掩码关键查询令牌仅输入FFN层，增强细粒度视觉生成，实验验证其有效性。

2606.01962 2026-06-03 cs.CV 版本更新

Contrastive Augmented Transformer with Domain-specific Enhancement for Robust Multi-scenario Metal Surface Defect Detection

基于领域增强的对比增强Transformer用于鲁棒的多场景金属表面缺陷检测

Yiyao Liu, Wenxiao He, Liyuan Ren, Huan Wang

发表机构 * Glasgow College, University of Electronic Science and Technology of China（格拉斯哥学院，电子科学与技术大学）

AI总结提出对比增强Transformer（CAT）框架，结合Swin Transformer骨干、特征金字塔网络、领域特定液滴增强算法和难负样本挖掘策略，解决金属表面缺陷检测中标注数据有限、多尺度缺陷识别难和跨场景泛化差的问题，在KolektorSDD2数据集上达到99.54%像素级AUROC。

详情

AI中文摘要

金属表面缺陷检测对于维持工业制造中的产品质量至关重要。然而，它面临着重大挑战，包括有限的标注数据、难以识别细微的多尺度缺陷以及跨不同场景的泛化能力差。为了解决这些问题，本文提出了一种新颖的对比增强Transformer（CAT）框架，用于鲁棒的缺陷检测。CAT采用分层Swin Transformer骨干，并重新设计了特征金字塔网络，以有效融合低级纹理与高级语义，从而实现对细微和多尺度缺陷模式的精确建模。为了增强在真实噪声条件下的鲁棒性，我们提出了一种领域特定的液滴增强算法。此外，我们将难负样本挖掘策略纳入对比损失中，以增强模型在模糊缺陷区域的判别能力。在KolektorSDD2数据集上的实验结果表明，CAT实现了99.54%的像素级AUROC，优于现有方法。此外，CAT在三个未见过的数据集（包括KSDD1、用于瓷砖缺陷的MTD和用于轨道表面缺陷的MSDD）上表现出优越的泛化能力和鲁棒性，展示了其在大规模工业部署中的潜力。

英文摘要

Metal surface defect detection is critical for maintaining product quality in industrial manufacturing. However, it faces significant challenges, including limited annotated data, difficulty in identifying subtle multi-scale defects, and poor generalization across diverse scenarios. To address these issues, this paper proposes a novel Contrastive Augmented Transformer (CAT) framework for robust defect detection. CAT employs a hierarchical Swin Transformer backbone and redesigns the feature pyramid network to effectively fuse low-level textures with high-level semantics, enabling precise modeling of subtle and multi-scale defect patterns. To enhance robustness under real-world noise conditions, we propose a domain-specific droplet augmentation algorithm. Furthermore, we incorporate a hard negative mining strategy into the contrastive loss to strengthen the model's discrimination ability in ambiguous defect regions. Experimental results on the KolektorSDD2 dataset demonstrate that CAT achieves a pixel-level AUROC of 99.54%, outperforming existing methods. In addition, CAT exhibits superior generalization and robustness on three unseen datasets, including KSDD1, MTD for tile defects, and MSDD for rail surface defects, demonstrating its potential for wide-scale industrial deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.01624 2026-06-03 cs.CV cs.SE 版本更新

What to Test Next: Interpretable Coverage Gap Discovery in Driving VLMs

下一步测试什么：驾驶视觉语言模型中可解释的覆盖缺口发现

Abhishek Aich, Sparsh Garg, Vijay Kumar BG, Turgun Yusuf Kashgari, Manmohan Chandraker

AI总结提出 SliceScorer 和 SliceNav 方法，通过结合暴露先验和邻居失败先验的确定性评分规则，在驾驶视觉语言模型中有效发现高风险覆盖缺口，并支持可解释和可审计的验证流程。

详情

AI中文摘要

驾驶视觉语言模型必须准确理解由操作设计域定义的各种条件下的场景，然而验证仍然稀疏：许多切片缺失，使得经验故障率不可靠。我们提出 SliceScorer，一种用于缺失切片推荐的确定性评分规则，它结合了 (i) 基于暴露的覆盖先验，优先考虑罕见、测试不足的区域，以及 (ii) 邻居失败先验，从类似测试条件传播风险。SliceScorer 刻意简单——可解释、可审计且保守——这些属性对于安全关键验证至关重要。为了在声明的 ODD 之外进行压力测试，我们将 SliceScorer 嵌入 SliceNav，一个由 LLM 编排的验证流程，其中模型解释开发者查询以选择相关操作（分诊、评分、获取、评估）和词汇扩展，组合验证工作流，同时保持所有评分确定性和可审计性。在三个驾驶 VLM（WiseAD、DriveMM、Cosmos-Reason2-2B）上的实验表明，SliceNav 比先前的切片发现方法更有效地发现高风险覆盖缺口，同时在条件空间中保持多样化的推荐。消融实验证实了两个评分组件的贡献，定性分析展示了从开发者查询到目标评估的端到端工作流。

英文摘要

Driving vision-language models (VLMs) must accurately understand scenes across diverse conditions defined by Operational Design Domains (ODDs), yet verification remains sparse: many slices are missing, making empirical failure rates unreliable. We propose SliceScorer, a deterministic scoring rule for missing-slice recommendation that combines (i) an exposure-based coverage prior to prioritize rare, under-tested regions, and (ii) a neighbor-failure prior that propagates risk from similar tested conditions. SliceScorer is deliberately simple - interpretable, auditable, and conservative - properties essential for safety-critical validation. For stress testing beyond the declared ODD, we embed SliceScorer within SliceNav, an LLM-orchestrated verification pipeline where the model interprets developer queries to select relevant operators (triage, scoring, acquisition, evaluation) and vocabulary extensions, composing verification workflows while keeping all scoring deterministic and auditable. Experiments on three driving VLMs (WiseAD, DriveMM, Cosmos-Reason2-2B) show that SliceNav surfaces high-risk coverage gaps more effectively than prior slice-discovery methods while maintaining diverse recommendations across the condition space. Ablations confirm both scoring components contribute, and qualitative analysis demonstrates end-to-end workflows from developer query to targeted evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.01348 2026-06-03 cs.CV 版本更新

ST-ColoNet: 通过混合注意力与边缘引导特征学习的时空结肠段识别

Crystal Cai, Ziyi Wang, Zhengjie Zhang, Jingsheng Gao, Dahong Qian, Suncheng Xiang

AI总结提出ST-ColoNet框架，结合Colorlaus模块（度量学习优化边缘空间特征）和Full-Temp模块（三种自注意力模式近似全自注意力），在自建数据集上实现结肠段识别准确率81.0%、F1分数70.7%。

Comments Some experiments need to be updated

详情

AI中文摘要

结肠镜视频中的结肠段识别是许多下游任务的关键需求，但现有自动识别方法仅使用结肠镜图像，未充分利用时间信息，导致性能不佳。此外，相关的公开视频数据集稀缺。为解决此问题，我们整理并发布了一个专门用于结肠段识别任务的标注数据集。此外，我们提出了一种基于两阶段深度学习的框架——时空网络结肠段识别（ST-ColoNet），用于从结肠镜视频中识别结肠段，该框架包括Colorlaus模块（使用度量学习优化边缘介导的空间特征提取）和Full-Temp模块（结合三种自注意力模式，以更好地近似长结肠镜序列上的全自注意力并优化时间特征聚合）。通过大量消融实验，我们证明该框架能够在结肠段识别任务上达到最先进的性能，准确率为81.0%，F1分数为70.7%，相比现有最先进方法有巨大提升。

英文摘要

Colo-segment recognition in colonoscopy videos is a key requirement for many downstream tasks, but existing automatic recognition methods only use colonoscopy images without fully exploiting the use of temporal information, leading to poor performance. Additionally, relevant public video-based datasets are in scarcity. To tackle this problem, we curate and release a labeled dataset specifically for the task of colo-segment recognition. In addition, we propose a two-stage deep learning-based framework, Colo-Segment Recognition via SpatioTemporal Network (ST-ColoNet), for the task of colo-segment recognition from colonoscopy videos which includes the Colorlaus module that uses metric learning to optimize edge-mediated spatial feature extraction, as well as the Full-Temp module which combines three self-attention patterns to better approximate full self-attention on long colonoscopy sequences and optimize temporal feature aggregation. Through extensive ablation experiments, we show that our framework is capable of achieving state-of-the-art performance on the task of colo-segment recognition, achieving an accuracy of 81.0% and F1-score of 70.7%, which is a tremendous improvement over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.27454 2026-06-03 eess.IV cs.CV 版本更新

NL-MambaXCT: Self-Supervised Nested-Learning Mamba for Nomex Honeycomb X-ray CT Defect Classification

NL-MambaXCT：用于Nomex蜂窝X射线CT缺陷分类的自监督嵌套学习Mamba

Ghaleb Aldoboni, Lobna Nassar, Fakhri Karray, Reem Alshamsi

发表机构 * Aurak Academy of Arts and Sciences（阿劳克艺术与科学学院）； Machine Intelligence Institute（人工智能研究所）； University of Waterloo（滑铁卢大学）

AI总结提出NL-MambaXCT框架，结合自监督掩码图像建模和嵌套学习，实现Nomex蜂窝XCT缺陷的高效分类，在测试集上达到96.91%准确率。

详情

AI中文摘要

X射线计算机断层扫描（XCT）广泛应用于航空航天制造中Nomex蜂窝结构的无损检测，但工业检测仍严重依赖人工解读和基于有限标注数据训练的监督模型。本文提出NL-MambaXCT，一个基于Mamba的框架，结合自监督掩码图像建模和嵌套学习（NL）公式，用于从生产XCT切片中进行自动化、标签高效的缺陷分类。骨干网络是一个四阶段2D编码器，早期阶段使用RegNet卷积块，深层阶段使用基于Mamba的序列混合与注意力。该网络在19,961张未标注的工业XCT切片上通过掩码图像建模进行预训练，并在按生产顺序划分的2,000张重新标注的Nomex XCT切片上进行微调。NL通过双时间尺度参数动态实现：选定投影保持慢速指数移动平均轨迹与快速权重并行，而深度动量优化器引入额外的慢速参数更新轨迹。在保留测试集上，MIM预训练的NL-MambaXCT模型达到96.91%的准确率和96.8%的宏F1分数，在准确率上比CNN、注意力和单时间尺度Mamba基线高出3.11-10.31个百分点。结果表明，将掩码自监督与NL风格的快/慢学习动态相结合，是Nomex蜂窝XCT检测中鲁棒缺陷分类的一种有前景的策略。

英文摘要

X-ray computed tomography (XCT) is widely used for non-destructive testing of Nomex honeycomb structures in aerospace manufacturing, but industrial inspection still relies heavily on manual interpretation and supervised models trained on limited labeled data. This work introduces NL-MambaXCT, a Mamba-based framework that combines self-supervised masked image modelling with a Nested Learning (NL) formulation for automated, label-efficient defect classification from production XCT slices. The backbone is a four-stage 2D encoder with RegNet convolutional blocks in the early stages and Mamba-based sequence mixing with attention in the deeper stages. It is pretrained by masked image modelling on 19,961 unlabeled industrial XCT slices and fine-tuned on 2,000 relabeled Nomex XCT slices split by production order. NL is instantiated through two-timescale parameter dynamics: selected projections maintain slow exponential-moving-average traces alongside fast weights, while a deep-momentum optimizer introduces an additional slow parameter-update trajectory. On the held-out test set, the MIM-pretrained NL-MambaXCT model achieves 96.91% accuracy and 96.8% macro F1, outperforming CNN, attention, and single-timescale Mamba baselines by 3.11--10.31 percentage points in accuracy. The results suggest that combining masked self-supervision with NL-style fast/ slow learning dynamics is a promising strategy for robust defect classification in Nomex honeycomb XCT inspection.

URL PDF HTML ☆

赞 0 踩 0

2605.26914 2026-06-03 cs.CV 版本更新

I2PRef: Image-Driven Point Completion with Iterative Refinement

I2PRef: 图像驱动的点云补全与迭代细化

Azhar Hussian, Marina Ritthaler, André Kaup, Vasileios Belagiannis

AI总结提出一种以图像为主要几何来源的点云补全方法，通过图像到点（I2P）模块直接从单张RGB图像重建完整点云，并利用基于Transformer的点到点（P2P）细化模块迭代优化，在ShapeNet-ViPC上取得最先进性能，Chamfer距离相对提升12.3%。

Comments Accepted at European Signal Processing Conference (EUSIPCO 2026)

详情

AI中文摘要

我们提出了一种图像条件化的点云补全方法，将图像视为主要的几何来源而非次要的引导。为此，我们引入了一个图像到点（I2P）模块，该模块可以直接从单张RGB图像重建完整的点云，无需3D输入。此外，我们引入了一个基于Transformer的点到点（P2P）细化模块，该模块利用点令牌和图像特征之间的自注意力和交叉注意力，迭代地细化粗I2P输出。I2P模块使图像编码器能够学习丰富的几何表示，而P2P模块逐步恢复细粒度细节。与依赖辅助损失或融合模块的现有多模态方法不同，我们的显式I2P任务仅基于图像提供了强大的几何感知先验。在ShapeNet-ViPC上的大量实验表明，我们的方法取得了最先进的补全性能，Chamfer距离相对先前方法提升了12.3%。代码可在 https://github.com/AzharSindhi/I2PRef.git 获取。

英文摘要

We present an image-conditioned point cloud completion approach that treats images as the primary geometric source rather than a secondary guide. To this end, we introduce an Image-to-Point (I2P) module that can reconstruct complete point clouds directly from a single RGB image, with no need for 3D inputs. Additionally, we introduce a transformer-based Point-to-Point (P2P) refinement module that uses self- and cross-attention between point tokens and image features to iteratively refine the coarse I2P output. The I2P module enables the image encoder to learn rich geometric representations, while the P2P module progressively recovers fine-grained details. Unlike existing multimodal methods that rely on auxiliary losses or fusion modules, our explicit I2P task provides a strong, geometry-aware prior based on images alone. Extensive experiments on ShapeNet-ViPC demonstrate state-of-the-art completion performance with a 12.3% relative Chamfer Distance improvement over prior methods. Code is available at: https://github.com/AzharSindhi/I2PRef.git

URL PDF HTML ☆

赞 0 踩 0

2605.26774 2026-06-03 cs.CV 版本更新

Cesarean Scar Defect Segmentation in Transvaginal Ultrasound Images: a Dataset and Benchmark

经阴道超声图像中的剖宫产瘢痕缺损分割：数据集与基准

Yuan Tian, Yue Li, Wei Xia, Tianyu Xu, Jian Zhang, Liye Shi, Jing Liu, Yang Wang, Ming Liu, Qing Xu, Yixuan Zhang, Maggie M. He, Xiangjian He

发表机构 * Department of Obstetrics and Gynecology, International Peace Maternity and Child Health Hospital affiliated to Shanghai Jiao Tong University School of Medicine（妇产科部门，上海交通大学医学院国际和平妇产儿童医院）； School of Computer Science, University of Nottingham Ningbo China（Nottingham Ningbo中国大学计算机学院）； School of Computer Science, University of Nottingham（Nottingham大学计算机学院）； Department of Computer Science and Engineering, University of California, San Diego（加州大学圣地亚哥分校计算机科学与工程系）； Department of Ultrasound, International Peace Maternity and Child Health Hospital affiliated to Shanghai Jiao Tong University School of Medicine（超声科，上海交通大学医学院国际和平妇产儿童医院）； School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University（上海交通大学电子信息与电气工程学院）； Department of Cardiology, Gold Coast University Hospital（心内科，Gold Coast大学医院）

AI总结针对经阴道超声图像中剖宫产瘢痕缺损（CSD）分割缺乏公开数据集的问题，构建了包含1111张图像和16个视频的CSD数据集，提供像素级标注，并建立了基准以推动医学图像分割算法和临床创新。

详情

AI中文摘要

剖宫产瘢痕缺损（CSD）是剖宫产后最常见的并发症之一。经阴道超声检查广泛用于CSD的初步筛查。准确确定CSD的轮廓和尺寸对治疗至关重要。然而，由于CSD尺寸小、形态不规则、图像质量欠佳以及资源有限环境中临床意识不足，超声医师常常忽略CSD。尽管人工智能在医学影像领域取得了进展，但目前尚无公开的经阴道超声CSD分割数据集。为填补这一空白，我们提出了一个全面的CSD数据集，包含1111张图像和16个视频，共501个阳性样本，带有确证的CSD和精确的像素级手动标注。标注遵循标准化临床指南，由经验丰富的超声医师和受过培训的博士生合作完成。这项工作为推进医学图像分割算法和促进临床创新提供了高质量的基准资源。最终，改善CSD诊断及后续治疗策略可提高育龄女性的生活质量，对医学研究和临床实践均具有重要价值。

英文摘要

Cesarean Scar Defect (CSD) is one of the most prevalent complications following cesarean delivery. Transvaginal ultrasonography is widely used for primary CSD screening. Accurate determination of CSD outline and dimensions is crucial for treatment. However, CSDs are frequently overlooked by sonographers due to small size and irregular morphology, suboptimal image quality, and limited clinical awareness in resource-constrained settings. Despite artificial intelligence advances in medical imaging, no public dataset exists for transvaginal ultrasound CSD segmentation. To address this gap, we present a comprehensive CSD dataset comprising 1,111 images and 16 videos, yielding 501 positive samples with confirmed CSD and precise pixel-level manual annotations. Annotations are performed following standardized clinical guidelines through collaboration between experienced sonographers and trained PhD students. This work provides high-quality benchmark resources for advancing medical image segmentation algorithms and promoting clinical innovation. Ultimately, improved CSD diagnosis and subsequent treatment strategies can enhance the quality of life in women of reproductive age, representing significant value for both medical research and clinical practice.

URL PDF HTML ☆

赞 0 踩 0

2605.26006 2026-06-03 cs.CV cs.GR cs.RO 版本更新

MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control

MIND: 多尺度意图扩散用于文本驱动的基于物理的人形控制

Bin Li, Ruichi Zhang, Han Liang, Jingyan Zhang, Juze Zhang, Xin Chen, Jingya Wang

发表机构 * ShanghaiTech University（上海科技大学）； University of Pennsylvania（宾夕法尼亚大学）； Bytedance Seed（字节跳动种子）； Stanford University（斯坦福大学）； InstAdapt

AI总结提出MIND框架，通过多尺度意图扩散机制将文本命令与低级动作语义对齐，实现基于物理的人形机器人行为生成。

详情

AI中文摘要

使基于物理的人形机器人能够根据高级文本命令执行多样化的行为仍然是一个重大挑战。现有方法通常遵循两阶段范式（结合运动学动作生成与基于物理的跟踪）或端到端模仿学习范式（直接从文本生成动作）。然而，前者受限于运动学生成与基于物理跟踪之间的固有域偏移，而后者则难以弥合文本命令与低级动作之间的巨大模态差距，限制了有效的语义对齐。值得注意的是，人形状态编码了丰富的运动动态，与低级动作相比，这些动态在语义上与文本描述更对齐，因此成为推导行为意图的自然基础。基于这一见解，我们提出了MIND，一种新颖的端到端扩散框架，用于文本驱动的基于物理的人形控制，该框架利用行为意图作为文本命令与低级动作之间的语义桥梁。其核心是，MIND引入了多尺度意图扩散机制，其中整体意图预测器捕获全局行为动态以指导整体行为合成，而即时意图预测器在每一步扩散中提供逐步的细粒度信号以进行局部行为细化。这种分层意图公式化为人形控制施加了结构化的归纳偏置，改善了语义对齐和行为自然性。此外，MIND将人形状态编码到潜在空间中，以实现更有效的语义意图建模。大量实验表明，MIND优于现有方法，并能从文本命令中合成连贯、物理合理且语义对齐的人形行为。我们的代码将发布以促进未来研究。

英文摘要

Enabling physics-based humanoids to execute diverse behaviors from high-level textual commands remains a significant challenge. Existing methods typically follow either a two-stage paradigm that combines kinematic motion generation with physics-based tracking, or an end-to-end imitation-learning paradigm that directly generates actions from text. However, the former suffers from the inherent domain shift between kinematic generation and physics-based tracking, while the latter struggles with the substantial modality gap between textual commands and low-level actions, limiting effective semantic alignment. Notably, humanoid states encode rich motion dynamics that are more semantically aligned with textual descriptions than low-level actions, making them a natural basis for deriving behavioral intent. Building upon this insight, we propose MIND, a novel end-to-end diffusion framework for text-driven physics-based humanoid control that leverages behavioral intent as a semantic bridge between textual commands and low-level actions. At its core, MIND introduces a multi-scale intent diffusion mechanism, where a holistic intent predictor captures global behavioral dynamics to guide overall behavior synthesis, while an immediate intent predictor provides step-wise, fine-grained signals for local behavior refinement at each diffusion step. This hierarchical intent formulation imposes a structured inductive bias for humanoid control, improving semantic alignment and behavioral naturalness. Furthermore, MIND encodes humanoid states into a latent space to enable more effective semantic intent modeling. Extensive experiments demonstrate that MIND outperforms existing methods and synthesizes coherent, physically plausible, and semantically aligned humanoid behaviors from text commands. Project page: https://binlee26.github.io/MIND_page.

URL PDF HTML ☆

赞 0 踩 0

2605.29661 2026-06-03 cs.CV 版本更新

Geometry-Guided Modeling of Foundation Features Enables Generalizable Object Shape Deformation Learning

几何引导的基础特征建模实现可泛化的物体形状变形学习

Yiyao Ma, Kai Chen, Zhongxiang Zhou, Zhuheng Song, Dongsheng Xie, Zelong Tan, Rong Xiong, Qi Dou

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China（香港中文大学计算机科学与工程系）； Zhejiang Innovation Center for Humanoid Robotics, Ningbo, China（浙江省人形机器人创新中心）； State Key Laboratory of Industrial Control and Technology, Zhejiang University, Hangzhou, China（浙江省工业控制技术重点实验室）

AI总结提出一种几何引导的特征建模机制和视图自适应特征聚合模块，通过变形类别级形状模板实现单目3D形状恢复，在形状变化和视角多样性上显著优于现有方法。

Comments 20 pages, 12 figures, accepted by ICML 2026

详情

AI中文摘要

单目3D形状恢复是几何理解的基础，但在任意视角和未见物体类别上实现鲁棒泛化仍然是一个重大挑战。本文提出一个可泛化的变形学习框架，通过显式变形类别级形状模板以匹配目标观测来重建3D物体。为了解决模板与目标之间的复杂形状变化，我们引入了几何引导的特征建模机制。该过程首先用模板拓扑丰富基础特征以生成几何感知表示，然后将其与目标观测显式关联以指导精确变形。此外，为了弥合固定模板与任意目标视图之间的差异，我们提出一个视图自适应特征聚合模块。该模块利用多视图模板特征及其对应的相机姿态来丰富规范模板表示，确保无论目标视角如何都能实现鲁棒的特征对齐。大量实验表明，我们的方法在处理大形状变化和多样化视角方面显著优于最先进的方法，展现出对新颖类别的强泛化能力，并有效支持下游真实世界的灵巧机器人操作任务。项目主页：https://GODeform.github.io/

英文摘要

Monocular 3D shape recovery is fundamental to geometric understanding, yet achieving robust generalization across arbitrary viewpoints and unseen object categories remains a significant challenge. In this paper, we present a generalizable deformation learning framework that reconstructs 3D objects by explicitly deforming a category-level shape template to match the target observation. To address complex shape variations between the template and the target, we introduce a geometry-guided feature modeling mechanism. This process first enriches foundation features with template topology to yield a geometry-aware representation, which is then explicitly correlated with the target observation to guide precise deformation. Furthermore, to bridge the disparity between the fixed template and arbitrary target views, we propose a view-adaptive feature aggregation module. This module leverages multi-view template features and their corresponding camera poses to enrich the canonical template representation, ensuring robust feature alignment regardless of the target's perspective. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in handling large shape variations and diverse viewpoints, exhibiting strong generalization to novel categories and effectively supporting downstream real-world dexterous robotic manipulation tasks. Project homepage: https://GODeform.github.io/

URL PDF HTML ☆

赞 0 踩 0

2603.18639 2026-06-03 cs.CV 版本更新

OrthoPhys: Physically Plausible Video Generation with Orthogonal-View Geometry Guidance

OrthoPhys：基于正交视角几何引导的物理合理视频生成

Cong Wang, Hanxin Zhu, Xiao Tang, Jiayi Luo, Xin Jin, Long Chen, Zhibo Chen

发表机构 * the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（多模态人工智能系统国家重点实验室，中国科学院自动化研究所）； the School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Zhongguancun Academy（中关村学院）； School of Information Science and Technology, University of Science and Technology of China（中国科学技术大学信息科学与技术学院）； College of Automotive and Energy Engineering, Tongji University（同济大学汽车与能源工程学院）； SKLCCSE, School of Computer Science and Engineering, Beihang University（SKLCCSE，北京航空航天大学计算机科学与工程学院）

AI总结提出两阶段框架 OrthoPhys，通过正交视角几何引导生成物理一致的前景运动，再合成完整视频，显著提升物理真实感和时空一致性。

详情

AI中文摘要

近期视频生成的进展在视觉保真度上取得了显著提升，但确保物理一致的运动仍是一个基本挑战。直观上，这一限制可归因于现实世界中的物体运动在三维空间中展开，而视频观测仅提供此类动力学的部分、视角依赖的投影。为解决这些问题，我们提出 OrthoPhys，一个两阶段框架，利用正交视角几何引导来强制物理合理性。我们的第一阶段不直接生成非结构化的二维视频，而是生成前景动力学的同步四视角正交视频。通过在这些正交视角中引入几何增强的注意力机制，该阶段有效地强制了三维空间一致性，并隐式地将运动基于物理属性。在第二阶段，这些物理一致的正交前景作为刚性引导，合成最终的完整视频，无缝学习前景动力学与背景上下文之间的交互。为支持这种正交视角训练范式，我们构建了 PhysMV 数据集，包含 40K 个场景，每个场景由四个正交视角组成，总共 160K 个视频序列。大量实验表明，OrthoPhys 在物理真实感和时空一致性上显著优于现有视频生成方法。项目页面：https://anonymous.4open.science/w/Phys4D/。

英文摘要

Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose OrthoPhys, a two-stage framework that leverages orthogonal-view geometry guidance to enforce physical plausibility. Instead of directly generating unstructured 2D videos, our first stage generates synchronized, four-view orthogonal videos of the foreground dynamics. By incorporating a geometry-enhanced attention mechanism across these orthogonal views, this stage effectively enforces 3D spatial coherence and implicitly grounds the motion in physical attributes. In the second stage, these physically consistent orthogonal foregrounds serve as rigid guidance to synthesize the final complete video, seamlessly learning the interaction between foreground dynamics and the background context. To support this orthogonal-view training paradigm, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that OrthoPhys significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Project page: https://anonymous.4open.science/w/Phys4D/.

URL PDF HTML ☆

赞 0 踩 0

2605.03358 2026-06-03 cs.CV 版本更新

Tracing Like a Clinician: Anatomy-Guided Spatial Priors for Cephalometric Landmark Detection

像临床医生一样追踪：解剖引导的空间先验用于头影测量标志点检测

Sidhartha Mohapatra, Pallavi Mohanty

发表机构 * Founder & CTO, CephTrace（CephTrace创始人及CTO）； Clinical Advisor, CephTrace（CephTrace临床顾问）

AI总结提出一种五阶段解剖引导管道，生成置信度加权的空间先验来训练HRNet-W32，在1502张X光片上实现25个标志点平均径向误差1.04 mm，并通过消融实验和临床验证证明其有效性。

Comments v3: 21 pages, 15 tables, 12 figures + supplementary materials (8 tables, 3 figures). v4: quantified Grad-CAM analysis (Table 13), corrected clinical measurements (Table 6: bias, MAE, ICC; vertical kappa 1.00->0.78), reviewer wording fixes. Code and weights: https://github.com/sidwiz/cephtrace-research, https://huggingface.co/CephTrace/cephtrace-v4

详情

AI中文摘要

临床医生通过遵循结构化的解剖工作流程来追踪头影测量X光片——然而，先前没有系统明确地将此编码到计算中。我们提出了一个五阶段解剖引导管道，生成置信度加权的空间先验，用于塑造HRNet-W32的训练。该系统在来自7+个成像设备的1502张X光片上的25个标志点实现了1.04 mm的平均径向误差——通过显式解剖先验而非学习注意力，与HYATT-Net（在CEPHA29上1.05 mm）相当。三路消融实验隔离了机制：解剖先验保持1%的验证-测试差距，而去除先验则产生88%的差距（1.94 mm）——尽管验证收敛相同。训练×推理先验矩阵确认：（1）所有模型与推理无关，（2）仅28通道架构无益处，（3）随机先验部分且不稳定（1.72 mm），（4）只有解剖正确、图像特定的先验产生1.04 mm——作为训练时的正则化器。部署时无需生成先验。五折交叉验证（p=0.0015）、患者级置换检验（p<0.0001，n=151）、复现基线、Grad-CAM分析和临床验证（151名患者包括72例边界病例的100%骨骼分类，kappa=1.00）提供了汇聚证据。跨领域实验支持假设：先验有效性取决于标志点空间熵——在四个领域前瞻性确认。补充材料包含在内。

英文摘要

Clinicians trace cephalometric radiographs following a structured anatomical workflow, yet no prior system encodes this into computation. We present a five-phase anatomy-guided pipeline producing confidence-weighted spatial priors that shape HRNet-W32 training, achieving 1.04 mm mean radial error on 25 landmarks across 1,502 radiographs from 7+ imaging devices. A training x inference prior matrix isolates the mechanism: anatomical priors maintain a 1% validation-to-test gap versus 88% without priors (1.94 mm), despite identical validation convergence. The matrix establishes that all trained models are inference-independent, the expanded architecture alone provides no benefit, random priors yield partial but unstable improvement (1.72 mm), and only image-specific anatomically correct priors produce the 1.04 mm result -- functioning as a training-time regularizer requiring no automated prior generation at deployment. Five-fold cross-validation (p=0.0015), patient-level permutation testing (p<0.0001, n=151), quantified Grad-CAM analysis (88% vs. 74% in-zone activation, p<0.001), and clinical measurement validation (skeletal classification kappa=0.79-0.84, zero Class II<->III reversals, ICC>0.95) provide converging evidence. Cross-domain experiments on echocardiography, cervical spine, and hand radiography support the hypothesis that prior effectiveness scales with the spatial entropy of the landmark distribution.

URL PDF HTML ☆

赞 0 踩 0

2605.24253 2026-06-03 cs.CV cs.AI cs.IR 版本更新

CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval

CRISP -- 基于聚类的冗余减少实例采样用于病理病例表示与检索

Zahra Rahimi Afzal, Wataru Uegami, Saghir Alfasly, Wenchao Han, Saba Yasir, Judy C. Boughey, Matthew P. Goetz, Krishna R. Kalari, H. R. Tizhoosh

发表机构 * Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA（Kimia实验室，人工智能与信息学系，梅奥诊所，罗切斯特，明尼苏达州，美国）； DICE Lab, Department of Electrical and Computer Engineering, University of Illinois Chicago, IL, USA（DICE实验室，电气与计算机工程系，伊利诺伊大学芝加哥分校，伊利诺伊州，美国）； MD Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA（MD Kimia实验室，人工智能与信息学系，梅奥诊所，罗切斯特，明尼苏达州，美国）； PhD Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA（PhD Kimia实验室，人工智能与信息学系，梅奥诊所，罗切斯特，明尼苏达州，美国）； Division of Computational Pathology and Informatics, Mayo Clinic, Rochester, MN, USA（计算病理学与信息学部，梅奥诊所，罗切斯特，明尼苏达州，美国）； Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA（实验室医学与病理学系，梅奥诊所，罗切斯特，明尼苏达州，美国）； Department of Breast and Melanoma Surgical Oncology, Comprehensive Cancer Center, Mayo Clinic, Rochester, MN, USA（乳腺和黑色素瘤外科肿瘤学系，综合癌症中心，梅奥诊所，罗切斯特，明尼苏达州，美国）； Department of Oncology, Comprehensive Cancer Center, Mayo Clinic, Rochester, MN, USA（肿瘤学系，综合癌症中心，梅奥诊所，罗切斯特，明尼苏达州，美国）； PhD H.R. Tizhoosh

AI总结提出CRISP无监督框架，通过聚类和冗余减少采样整合病例内多张全切片图像，构建紧凑代表性补丁集用于病例级检索，在乳腺癌数据集上匹配或超越现有标准。

详情

AI中文摘要

数字病理档案中每个病例通常包含多张全切片图像（WSI），这些图像捕获空间上不同的肿瘤区域并反映内在的形态异质性。然而，现有方法大多依赖单一病理学家选择的切片，从而丢弃了分布在其余WSI中的潜在信息性证据。迄今为止，尚无自主框架用于全面的多WSI病例处理。在此，我们提出一个用于病例级分析的无监督框架，该框架整合病例内所有可用切片的信息。所提方法不依赖单一指定切片，而是通过选择性提炼跨WSI的信息性补丁来构建病例级表示。我们引入基于聚类的冗余减少实例采样用于病理学（CRISP），这是一个两阶段框架，首先减少单个WSI内的冗余，随后应用基于聚类的采样为整个病例选择紧凑但具有代表性的补丁集。所得补丁集捕获病例级异质性，同时避免对千兆像素图像的穷举处理，并直接作为检索索引。使用两个梅奥诊所乳腺癌数据集进行诊断和治疗规划，我们证明CRISP在患者/病例搜索和检索中一致匹配或超越当前结合模型和病理学家切片选择的标准实践。通过自动化病例级处理并消除主观WSI选择，CRISP可能能够利用当前被忽视的分布在多个WSI中的临床相关信息。

英文摘要

Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumor regions and reflecting intrinsic morphological heterogeneity. However, most existing approaches rely on a single pathologist-selected slide, thereby discarding potentially informative evidence distributed across the remaining WSIs. To date, no autonomous framework has been proposed for comprehensive multi-WSI case processing. Here, we present an unsupervised framework for case-level analysis that integrates information from all available slides within a case. Rather than relying on a single designated slide, the proposed approach constructs case-level representations by selectively distilling informative patches across WSIs. We introduce Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP), a two-stage framework that first reduces redundancy within individual WSIs and subsequently applies clustering-based sampling to select a compact yet representative set of patches for the entire case. The resulting patch set captures case-level heterogeneity while avoiding exhaustive processing of gigapixel images, and directly serves as a retrieval index. Using two Mayo Clinic breast cancer datasets for diagnosis and treatment planning, we demonstrate that CRISP consistently matches or surpasses the current standard practice of combined model and pathologist slide selection for patient/case search and retrieval. By automating case-level processing and eliminating subjective WSI selection, CRISP potentially enables the exploitation of clinically relevant information distributed across multiple WSIs that is currently overlooked.

URL PDF HTML ☆

赞 0 踩 0

2605.23995 2026-06-03 cs.CV cs.AI 版本更新

Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines

任务对齐的自监督学习在医学图像分析中的应用：系统综述与实践设计指南

Chathura Wimalasiri, Kishor Nandakishor, Marimuthu Palaniswami

发表机构 * Department of Electrical and Electronic Engineering, University of Melbourne（墨尔本大学电子与电气工程系）

AI总结本文系统综述了医学图像中自监督学习（SSL）的四种范式（对比、非对比与预测、生成与重建、混合），分析了前置任务与下游任务的对齐对性能的影响，并提出了实践设计指南。

Comments This manuscript is 31 pages with 4 tables and 3 figures

详情

AI中文摘要

自监督学习（SSL）已成为通过从无标签数据中学习表示来解决医学影像中标注瓶颈的有前景范式。然而，其有效性在很大程度上取决于前置任务的设计及其与下游临床目标的对齐。我们对医学影像中的SSL进行了系统的、任务导向的综述，考察了不同前置任务公式如何影响分类、分割、检测等任务的性能。遵循PRISMA指南，我们分析了2017年至2025年间发表的75项研究，并将其组织为四种范式：对比学习、非对比与预测学习、生成与重建学习、以及混合学习。我们不是按架构对方法进行分类，而是将每种范式映射到其最佳支持的下游目标。我们的分析表明，不存在普遍最优的SSL策略；相反，性能由前置任务、成像模态和目标任务之间的对齐决定。对比方法学习全局判别特征，与分类任务对齐良好，但可能忽略细微的病理模式。生成和空间预测方法更好地保留局部解剖结构，使其更适合分割和其他密集预测任务，而混合方法提供了最平衡的性能。我们进一步表明，模态特定设计至关重要，并且SSL在低标签和少样本场景中提供最大益处。最后，我们将这些发现提炼为实践设计指南，并概述了开放挑战，包括病理感知前置任务设计、高维数据的资源高效训练以及标准化评估协议。这项工作为在医学影像中设计更有效且临床相关的SSL框架提供了实用指导。

英文摘要

Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning representations from unlabeled data. However, its effectiveness depends heavily on the design of the pretext task and its alignment with the downstream clinical-objectives. We present a systematic, task-oriented review of SSL in medical imaging, examining how different pretext-task formulations influence performance across classification, segmentation, detection, and other tasks. Following PRISMA guidelines, we analyze 75 studies published between 2017 and 2025 and organize them into four paradigms: contrastive, non-contrastive and predictive, generative and reconstruction-based, and hybrid learning. Rather than cataloguing methods by architecture, we map each paradigm to the downstream objectives it best supports. Our analysis shows there is no universally optimal SSL strategy; instead, performance is governed by the alignment between the pretext task, the imaging modality, and the target task. Contrastive methods learn global discriminative features and align well with classification, but may overlook subtle pathological patterns. Generative and spatial prediction-based approaches better preserve local anatomical structure, making them more suitable for segmentation and other dense prediction tasks, while hybrid methods offer the most balanced performance. We further show that modality-specific design is critical and that SSL provides its greatest benefit in low-label and few-shot regimes. Finally, we distill these findings into practical design guidelines and outline open challenges, including pathology-aware pretext task design, resource-efficient training for high-dimensional data, and standardized evaluation protocols. This work offers practical guidance for designing more effective and clinically relevant SSL frameworks in medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2605.22018 2026-06-03 cs.CV cs.AI cs.RO 版本更新

FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

FRED：面向洪水道路环境的多模态自动驾驶数据集

Connor Malone, Sebastien Demmel, Sebastien Glaser

发表机构 * Queensland University of Technology（昆士兰理工大学）； ARC Training Centre for Automated Vehicles in Rural and Remote Regions (AVR3)（农村和偏远地区自动化车辆培训中心（AVR3））

AI总结提出首个针对道路水险场景的多模态自动驾驶数据集FRED，包含相机、LiDAR和IMU数据，并提供语义标签以支持水险检测方法训练与评估。

详情

AI中文摘要

洪水道路环境数据集（FRED）是，据我们所知，首个专门针对道路水险场景数据收集的多模态自动驾驶数据集。该数据集包含来自2.3 MP FLIR Blackfly USB3相机的图像、来自Ouster OS1-64 LiDAR的64线360度点云，以及由Geoflex RTK GNSS校正的iXblue ATLANS-C IMU数据，数据采集自五个不同地点，涵盖洪水期间和洪水之后。数据以两种格式发布：KITTI风格格式，便于与现有数据工具集成；以及RTMaps格式，用于直接回放车辆的数据捕获。我们提供语义标签，以支持用于水险检测的单传感器和传感器融合方法的训练与评估。提供位置和速度数据，以及干燥条件下捕获的数据，以支持可能包含地图的基于位置的检测方法开发，并评估其他任务，如定位和SLAM。

英文摘要

The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting the collection of data from scenarios involving water hazards on the road. The dataset contains images from a 2.3 MP FLIR Blackfly USB3 camera, 64-beam 360 degree point clouds from an Ouster OS1-64 LiDAR, and data from an iXblue ATLANS-C IMU corrected by a Geoflex RTK GNSS, from five separate locations captured both during and after flooding events. The data has been released in two formats: a KITTI-style format for easy integration with existing data tools, and the RTMaps format for direct replay of the vehicle's data capture. We provide semantic labels to enable the training and evaluation of both single-sensor and sensor-fusion methods for water hazard detection. Position and velocity, as well as data captured under dry conditions, are provided to enable the development of location-based detection methods that may incorporate maps, and to evaluate other tasks such as localisation and SLAM.

URL PDF HTML ☆

赞 0 踩 0

2601.00990 2026-06-03 eess.IV cs.CV 版本更新

Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review

不确定性校准的可解释人工智能用于胎儿超声平面分类：系统综述

Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov, Ozkan Gunalp

发表机构 * Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg（卢森堡大学生命科学与医学系，科学、技术与医学学院）； Department of Biostatistics and Medical Informatics, Institute of Health Sciences, Ege University（伊兹密尔大学健康科学学院生物统计学与医学信息学系）

AI总结通过系统综述78项研究，提出CALIB-XFUS框架，强调校准、解释忠实性和公平性，以满足监管要求。

Comments 12 pages, 5 figures, 1 table, 75 references; systematic review (PRISMA 2020); manuscript prepared for submission to The Lancet Digital Health (Reviews section)

详情

AI中文摘要

胎儿超声是产前护理的基石，准确识别一小组标准解剖平面支撑着生物测量、生长监测和结构异常检测。深度学习分类器现在在精心策划的基准上达到或超过专家准确性，但大多数仍然不透明且校准不良，使临床医生缺乏安全决策支持所需的校准置信度或忠实解释。我们按照PRISMA 2020系统综述了2015年1月1日至2026年4月30日期间发表的78项研究，这些研究将自动胎儿平面分类与可解释性或预测不确定性量化相结合。六个标准平面的合并平衡准确率为0.93（95% CI 0.91至0.95），但只有19项研究（24%）报告了校准，14项（18%）报告了选择性预测。我们提出了CALIB-XFUS，一个22项报告框架，将校准、解释忠实性和公平性操作化，用于受监管的胎儿超声人工智能。该框架涵盖六个领域：临床任务和使用指征；数据集来源和代表性；模型和训练流程；校准和选择性预测；解释忠实性和临床医生验证；以及上市后监测。我们认为，根据FDA良好机器学习实践原则和欧盟AI法案高风险义务，不确定性校准、忠实解释和公平审计的胎儿超声人工智能现在在技术上可行且在监管上被期望。

英文摘要

Fetal ultrasound is the cornerstone of antenatal care, and accurate recognition of a small set of standard anatomical planes underpins biometry, growth surveillance, and detection of structural anomalies. Deep learning classifiers now match or exceed expert accuracy on curated benchmarks, but most remain opaque and miscalibrated, leaving clinicians without the calibrated confidence or faithful explanations needed for safe decision support. We systematically reviewed 78 studies published between January 1, 2015 and April 30, 2026 that paired automated fetal plane classification with explainability or predictive uncertainty quantification, following PRISMA 2020. Pooled balanced accuracy across six standard planes was 0.93 (95% CI 0.91 to 0.95), but only 19 studies (24%) reported calibration and 14 (18%) reported selective prediction. We propose CALIB-XFUS, a 22-item reporting framework that operationalises calibration, explanation faithfulness, and fairness for regulated fetal ultrasound artificial intelligence. The framework spans six domains: clinical task and indication for use; dataset provenance and representativeness; model and training pipeline; calibration and selective prediction; explanation faithfulness and clinician validation; and post-market surveillance. We argue that uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible and regulatorily expected under the FDA Good Machine Learning Practice principles and the EU AI Act high-risk obligations.

URL PDF HTML ☆

赞 0 踩 0

2605.20731 2026-06-03 cs.CV cs.AI stat.AP 版本更新

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

TASTE：一个由设计师标注的AI生成图形设计多维偏好数据集

Haonan Zhu, Elad Hirsch, Alexandria Minetti, Allison Nulty, Purvanshi Mehta

发表机构 * Lica World（Lica世界）； Contra.Work Inc.（Contra.Work公司）

AI总结针对现有偏好数据集仅提供单一整体评价的不足，本文构建了TASTE多维偏好数据集，由两组专业设计师对四个文本到图像模型的输出按九项标准排序，并提出了无准则信号验证框架和偏好模型基准测试。

详情

AI中文摘要

文本到图像模型现在能够以生产规模生成图形设计，但其监督仍然主要来自照片风格的偏好数据集，每次比较只有一个整体判断。设计师沿着几个不同的轴（例如，排版、布局、色彩和谐）评估设计，而单个偏好标签会将这些轴合并。我们发布了\emph{TASTE} extit{（排版、美学、空间、色调等）}，这是一个多维偏好数据集，其中两个不相交的五名专业设计师队列分别对来自四个当前文本到图像模型的输出按九项标准进行排序，并附带每张图像的幻觉标记。我们将该数据集与两个贡献配对。首先，一个基于Kendall的$τ$、多数投票概率和Condorcet循环的无准则信号验证框架，针对精确的iid均匀零假设；分析揭示了显著但中等程度的设计师一致性，每个TASTE标准都拒绝了随机评分者的零假设。其次，我们在TASTE上对偏好模型进行基准测试，发现现成的VLM评判器和专用的T2I评分器未能达到与设计师小组的多数一致，而直接在TASTE上训练的小型MLP头显著缩小了与单个评分者上限的差距，为未来基于TASTE训练的偏好模型设定了基线。

英文摘要

Text-to-image models now generate graphic design at production scale, yet their supervision still comes primarily from photo-style preference datasets with a single overall verdict per comparison. Designers evaluate designs along several distinct axes (e.g., typography, layout, color harmony) that a single preference label collapses. We release \emph{TASTE} \textit{(Typography, Aesthetics, Spatial, Tone, Etc.)}, a multi-dimensional preference dataset in which two disjoint cohorts of five professional designers each ranked outputs from four current text-to-image models across nine criteria along with per-image hallucination flags. We pair the dataset with two contributions. First, a criterion-agnostic signal-validation framework based on Kendall's $τ$, majority-vote probability, and Condorcet cycles against exact iid-uniform nulls; the analysis reveals significant but moderate designer agreement, with every TASTE criterion rejecting the random-rater null. Second, we benchmark preference models on TASTE and find that off-the-shelf VLM judges and dedicated T2I scorers fail to reach majority agreement with the designer panel, while a small MLP head trained directly on TASTE substantially narrows the gap to the single-rater ceiling, setting a baseline for future TASTE-trained preference models.

URL PDF HTML ☆

赞 0 踩 0

2605.20306 2026-06-03 cs.CV cs.LG 版本更新

WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

WildRoadBench: 面向视觉语言模型与自主智能体的野外航拍道路损伤定位基准

Bingnan Liu, Chenhang Cui, Rui Huang, Jiani Luo, Zhirong Shen, Tinghao Wang, Xiande Huang, Lingbei Meng, Fei Shen, An Zhang

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； National University of Singapore（新加坡国立大学）； De Artificial Intelligence Lab（德人工智能实验室）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； University of Science and Technology of China（中国科学技术大学）

AI总结提出WildRoadBench基准，通过VLM直接定位和LLM驱动智能体自主研究两种协议，评估模型在航拍道路损伤定位上的性能，发现现有方法在野外场景下仍不可靠。

Comments Preprint. Under review. 4 figures, 6 tables

详情

AI中文摘要

我们介绍了WildRoadBench，一个野外航拍道路损伤定位基准，它在一个专业标注的无人机语料库上，将视觉语言模型的直接视觉定位与LLM驱动的智能体的自主研究与工程相结合。在两种协议下评估相同的图像集和相同的每类AP_50指标。VLM轨道衡量固定VLM是否能在统一的提示、解码和解析流程下，从一张图像和一个简短提示中定位特定领域的损伤。智能体轨道衡量一个自主智能体，在仅给定书面任务简介、少量探索切片和固定交互预算的情况下，能否搜索公共网络、调整预训练组件、编写训练和推理代码，并通过隐藏保留集上的标量反馈预言机提交预测。我们对广泛的闭源前沿模型和开源VLM以及几个前沿LLM驱动的智能体进行了基准测试。在野外环境中，两种途径都远未达到可靠性能：闭源前沿模型在VLM排行榜上领先，但仍留下超过一半的指标未达到；开源定位器远低于它们，且新一代或推理型变体并未持续改进定位；每个开源模型的小目标均崩溃；尽管智能体拥有更丰富的功能，但仍落后于最强的VLM，且有几个未能在预算内提交有效结果。我们在https://anonymous.4open.science/r/wildroadbench-0607发布代码和数据，以支持可重复的后续研究。

英文摘要

We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source VLMs together with several frontier LLM-driven agents. Both routes remain far from reliable performance in this wild setting: closed-source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open-source grounders plateau well below them, and newer generations or reasoning-style variants do not consistently improve grounding; small targets collapse for every open-source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at https://anonymous.4open.science/r/wildroadbench-0607 to support reproducible follow-up research.

URL PDF HTML ☆

赞 0 踩 0

2605.20183 2026-06-03 cs.CV 版本更新

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

MSAVBench：面向多镜头音频-视频生成的全面可靠评估

Yujie Wei, Yujin Han, Zhekai Chen, Yongming Li, Kaixun Jiang, Zhihang Liu, Quanhao Li, Zhiwu Qing, Xiang Wang, Zhen Xing, Ruihang Chu, Lingyi Hong, Yefei He, Junjie Zhou, Junqiu Yu, Yang Shi, Difan Zou, Kai Zhu, Shiwei Zhang, Yingya Zhang, Yu Liu, Xihui Liu, Hongming Shan

发表机构 * Fudan University（复旦大学）； The University of Hong Kong（香港大学）； Tongyi Lab, Alibaba Group（阿里集团通义实验室）； Zhejiang University（浙江大学）； Peking University（北京大学）

AI总结提出首个多镜头音频-视频生成基准MSAVBench，通过自适应混合评估框架在四个维度上系统评估19个模型，发现当前系统在导演级控制和细粒度音视频同步上仍存在挑战。

详情

AI中文摘要

视频生成正从单镜头合成快速演变为复杂的多镜头音频-视频（MSAV）叙事以满足现实需求。然而，评估此类前沿模型仍是一个基本挑战。现有基准在范围和数据多样性上有限，并依赖僵化的评估流程，阻碍了对现代MSAV模型的系统可靠评估。为弥补这些差距，我们引入MSAVBench，这是首个针对多镜头音频-视频生成的综合基准和自适应混合评估框架。我们的基准涵盖四个关键维度：视频、音频、镜头和参考，覆盖多样化的任务设置、多达15个镜头的可变数量以及具有挑战性的非真实场景。我们的评估框架通过镜头分割的自适应自校正机制、主观指标的实例化评分规则以及复杂判断的基于工具的证据提取，提高了鲁棒性。此外，MSAVBench与人类判断高度一致，达到91.5%的斯皮尔曼等级相关系数。我们对19个最先进的闭源和开源模型的系统评估表明，当前系统在导演级控制和细粒度音视频同步上仍存在困难，而模块化或代理式生成管道为缩小开源与闭源模型之间的差距提供了一条有希望的路径。基准数据和评估代码已在https://github.com/ali-vilab/MSAVBench公开。

英文摘要

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. The benchmark data and evaluation code are publicly available at https://github.com/ali-vilab/MSAVBench.

URL PDF HTML ☆

赞 0 踩 0

2605.18740 2026-06-03 cs.CV cs.AI cs.CL cs.LG 版本更新

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD：通过在线策略自蒸馏学习多模态大语言模型的精细细节

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu

发表机构 * Tsinghua University（清华大学）

AI总结提出Vision-OPD框架，通过在线策略自蒸馏将模型自身的局部区域感知能力迁移到全局图像策略，提升多模态大语言模型对细粒度视觉理解的准确性。

Comments Project page: https://github.com/VisionOPD/Vision-OPD

详情

AI中文摘要

多模态大语言模型（MLLMs）在细粒度视觉理解方面仍然存在困难，答案往往依赖于全图中微小但决定性的证据。我们观察到一种区域到全局的感知差距：当以证据为中心的裁剪图像为条件时，同一MLLM回答细粒度问题的准确率高于以对应全图为条件，这表明许多失败源于难以聚焦于相关证据，而非局部识别能力不足。受此观察启发，我们提出Vision-OPD（视觉在线策略蒸馏），一种区域到全局的自蒸馏框架，将模型自身特权的区域感知迁移到其全图策略。Vision-OPD从同一MLLM实例化两个条件策略：一个以裁剪图像为条件的教师和一个以全图为条件的学生。学生生成在线策略轨迹，Vision-OPD沿这些轨迹最小化教师和学生下一个词元分布之间的词元级差异。这使得模型能够内化视觉放大的好处，而无需外部教师模型、真实标签、奖励验证器或推理时工具使用。在多个细粒度视觉理解基准上的实验表明，Vision-OPD模型在性能上可与更大的开源、闭源以及“思考图像”智能体模型相媲美或更优。

英文摘要

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models. The code is available at https://github.com/VisionOPD/Vision-OPD

URL PDF HTML ☆

赞 0 踩 0

2605.19320 2026-06-03 cs.CV cs.DB 版本更新

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

TextAlign: 基于层次化奖励的文本渲染偏好对齐

Mingxuan Cui, Jingpu Yang, Fengxian Ji, Qian Jiang, Zhecheng Shi, Jiaming Wang, Zirui Song, Fajri Koto, Xiuying Chen

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（莫扎德·穆萨大学人工智能学院）； Chinese Academy of Sciences Institute of Automation（中国科学院自动化研究所）； Northeastern University（东北大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结提出TextAlign框架，通过层次化视觉语言模型奖励将文本渲染错误分解为全局、单词和字形级别，并转化为标量偏好信号，利用GRPO或DPO进行后训练对齐，在不改变生成器架构下提升文本渲染准确性。

详情

AI中文摘要

忠实的文本渲染仍然是大型文本到图像生成模型的一个持续弱点，因为它需要语义指令遵循和细粒度的字形级结构。先前的方法通常通过特定于架构的模块或编码器修改来提高这种能力，这使跨基础模型的部署复杂化。我们将文本渲染作为后训练偏好对齐问题进行研究，并提出了TextAlign，一种非侵入式框架，保持生成器架构不变。关键组件是一个基于层次化视觉语言模型（VLM）的奖励，它将渲染错误分解为全局、单词和字形级别，然后将二元缺陷判断转换为标量偏好信号。得到的信号支持组相对策略优化（GRPO）和直接偏好优化（DPO）。在FLUX.1-dev和Z-Image-Turbo上的实验表明，基于OCR的文本准确性持续提升，且不降低一般生成质量。与强大的基础和文本渲染基线（包括SD3.5、Qwen-Image、AnyText和TextDiffuser）相比，这些结果表明奖励设计为改进文本渲染提供了一种可扩展的替代模型重新设计的方法。

英文摘要

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

URL PDF HTML ☆

赞 0 踩 0

2605.18160 2026-06-03 cs.CV cs.AI 版本更新

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

Vision Inference Former：在多模态大语言模型中维持视觉一致性

Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang

发表机构 * Zhejiang University（浙江大学）； East China Normal University（华东师范大学）； Zhejiang University of Science and Technology（浙江理工大学）

AI总结针对多模态大语言模型中视觉信息被弱化的问题，提出Vision Inference Former（VIF）轻量模块，在推理解码阶段持续注入视觉语义，提升生成内容与视觉的一致性。

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）取得了显著进展，主要归功于整合视觉和文本信息的有效范式。主流的基于连接器的范式将视觉特征投影到文本序列中，从而在生成式架构内实现统一的多模态对齐和推理。然而，我们的实验揭示了两个关键限制：（1）尽管视觉信息是MLLMs中的核心证据模态，但它被与文本标记同等对待，削弱了视觉模态的独特贡献；（2）随着生成长度的增加，特别是在有限的上下文窗口内，模型对视觉信息的依赖逐渐减弱，导致视觉-语言对齐恶化，生成内容与视觉语义之间的一致性降低。为了解决这些挑战，我们提出了Vision Inference Former（VIF），一种轻量级架构模块，它在纯视觉表示和模型输出空间之间建立直接桥梁。具体而言，VIF在推理过程的解码阶段持续注入视觉语义，确保模型在生成过程中牢固地基于视觉内容。我们在涵盖通用推理、OCR、表格理解、以视觉为中心的评估和幻觉的14个基准任务上进行了实验。实验结果表明，VIF在不同架构上持续提升模型性能，同时引入最小的额外开销。本工作的代码可在https://github.com/Dong-Xinpeng/VIF获取。

英文摘要

In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model's output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong-Xinpeng/VIF.

URL PDF HTML ☆

赞 0 踩 0

2605.16813 2026-06-03 cs.GR cs.CV 版本更新

QuadLink: Autoregressive Quad-Dominant Mesh Generation via Point-Relation Learning

QuadLink: 通过点关系学习的自回归四边形主导网格生成

Yiheng Zhang, Zhe Zhu, Tingrui Shen, Zhuojiang Cai, Tianxiao Li, Zixing Zhao, Qiujie Dong, Zhiyang Dou, Jiepeng Wang, Le Wan, Yuwang Wang, Wenping Wang, Yuan Liu, Cheng Lin

发表机构 * Hong Kong University of Science and Technology（香港科技大学）； Tencent VISVISE（腾讯VISVISE）； Peking University（北京大学）； Technical University of Munich（慕尼黑技术大学）； Tsinghua University（清华大学）； The University of Hong Kong（香港大学）； Massachusetts Institute of Technology（麻省理工学院）； Texas A&M University（德克萨斯大学）； Macau University of Science and Technology（澳门科技大学）

AI总结提出QuadLink框架，通过将点云链接成结构化面片，以自回归方式生成各向异性的四边形主导网格，实现高几何保真度和拓扑质量。

详情

AI中文摘要

生成可用于生产的四边形主导网格是现代3D内容创作的基石。从点云生成各向异性的四边形主导网格具有挑战性，因为现有方法通常局限于生成纯三角形网格或具有各向同性密度的纯四边形网格。在本文中，我们提出QuadLink，一个由三个阶段组成的统一框架，通过将点链接成结构化面片来生成四边形主导网格。QuadLink将多边形网格生成公式化为混合质心条件顶点链接模型：它首先预测一组统一的锚点（顶点和面质心），然后学习将顶点与面质心关联的质心条件链接，最后通过鲁棒的几何验证策略引导的四边形优先策略组装多边形面。这种基于链接的公式能够高效生成具有连贯边流的稀疏各向异性四边形主导网格，同时支持混合多边形拓扑。为了构建该模型的训练数据，我们进一步引入三角到四边形算子，通过全局合并选择将艺术三角形网格转换为四边形主导训练数据。大量实验表明，QuadLink从点云生成可用于生产的四边形主导网格，与先前基线相比，实现了更高的几何保真度和拓扑质量。我们的方法原生支持混合多边形拓扑，无需架构更改即可推广到任意n边形网格。

英文摘要

The generation of production-ready quad-dominant meshes is a cornerstone of modern 3D content creation. Generating anisotropic quad-dominant meshes from point clouds is challenging, as existing methods are typically limited to producing either pure triangular meshes or pure quadrilateral meshes with isotropic densities. In this paper, we present QuadLink, a unified framework consisting of three stages for quad-dominant mesh generation by linking points into structured faces. QuadLink formulates polygonal mesh generation as a hybrid centroid-conditioned vertex linking model: it first predicts a unified set of anchors (vertices and face centroids), then learns centroid-conditioned links that associate vertices with face centroids, and finally assembles polygonal faces with a quad-first strategy guided by robust geometric verification strategies. This link-based formulation enables efficient generation of sparse and anisotropic quad-dominant meshes with coherent edge flow and meanwhile supporting hybrid polygonal topology. To construct training data for this model, we further introduce a Tri-to-Quad Operator that converts artistic triangle meshes into quad-dominant training data via global merge selection. Extensive experiments show that QuadLink produces production-ready quad-dominant meshes from point clouds and achieves improved geometric fidelity and topological quality compared to prior baselines. Our method natively supports hybrid polygonal topology, generalizing to arbitrary n-gon meshes without architectural changes.

URL PDF HTML ☆

赞 0 踩 0

2602.02994 2026-06-03 cs.CV 版本更新

Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

Video-OPD：通过在线策略蒸馏实现多模态大语言模型在时序视频定位中的高效后训练

Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, Jianzhong Ju, Zhenbo Luo, Jian Luan

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出Video-OPD框架，利用在线策略蒸馏和教师验证分歧聚焦课程，以高效后训练多模态大语言模型进行时序视频定位，克服稀疏奖励和高计算开销问题。

详情

AI中文摘要

强化学习因其在线策略优化而成为时序视频定位（TVG）后训练的一种有原则的范式，但现有的基于GRPO的方法仍然受到稀疏奖励信号和大量计算开销的根本限制。我们提出了Video-OPD，一个受近期在线策略蒸馏进展启发的TVG高效后训练框架。Video-OPD优化直接从当前策略采样的轨迹，从而保持训练和推理分布之间的一致性，同时前沿教师通过反向KL散度目标提供密集的令牌级监督。这种公式保留了缓解分布偏移至关重要的在线策略属性，同时将稀疏的回合级反馈转化为细粒度的逐步学习信号。基于Video-OPD，我们引入了教师验证分歧聚焦（TVDF），一种轻量级训练课程，迭代地优先考虑既教师可靠又对学生信息量最大的轨迹，从而提高训练效率。实验结果表明，Video-OPD在实现显著更快的收敛和更低计算成本的同时，始终优于GRPO，确立了在线策略蒸馏作为TVG传统强化学习的有效替代方案。

英文摘要

Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.

URL PDF HTML ☆

赞 0 踩 0

2605.13258 2026-06-03 cs.CV cs.AI 版本更新

面向细粒度概念瓶颈模型的概念级注意力机制

Minghong Zhong, Guoshuai Zou, Kanghao Chen, Dexia Chen, Ruixuan Wang

AI总结提出概念级注意力机制（CoAt-CBM），通过可学习概念视觉查询和概念对比优化，实现自适应细粒度图像-概念对齐，解决预训练偏差和概念互斥问题，显著提升性能。

Comments Withdrawn by authors for revision and improvement

详情

AI中文摘要

最近，通过利用大型预训练视觉-语言模型（如CLIP）学习的图像-文本对齐，概念瓶颈模型（CBM）取得了令人印象深刻的性能。然而，概念建模存在两个关键限制。现有方法常受预训练偏差影响，表现为粒度错位或依赖结构先验。此外，使用二元交叉熵（BCE）损失进行微调将每个概念独立处理，忽略了概念间的互斥性，导致对齐次优。为解决这些限制，我们提出了面向细粒度概念瓶颈模型的概念级注意力机制（CoAt-CBM），一种实现自适应细粒度图像-概念对齐和高可解释性的新颖框架。具体地，CoAt-CBM采用可学习的概念级视觉查询，自适应地获取细粒度的概念级视觉嵌入，然后用于生成概念得分向量。接着，一种新颖的概念对比优化指导模型处理概念得分的相对重要性，使概念预测忠实反映图像内容并改善对齐。大量实验表明，CoAt-CBM持续优于最先进方法。代码将在接收后公开。

英文摘要

Recently impressive performance has been achieved in Concept Bottleneck Models (CBM) by utilizing the image-text alignment learned by a large pre-trained vision-language model (i.e. CLIP). However, there exist two key limitations in concept modeling. Existing methods often suffer from pre-training biases, manifested as granularity misalignment or reliance on structural priors. Moreover, fine-tuning with Binary Cross-Entropy (BCE) loss treats each concept independently, which ignores mutual exclusivity among concepts, leading to suboptimal alignment. To address these limitations, we propose Concept-wise Attention for Fine-grained Concept Bottleneck Models (CoAt-CBM), a novel framework that achieves adaptive fine-grained image-concept alignment and high interpretability. Specifically, CoAt-CBM employs learnable concept-wise visual queries to adaptively obtain fine-grained concept-wise visual embeddings, which are then used to produce a concept score vector. Then, a novel concept contrastive optimization guides the model to handle the relative importance of the concept scores, enabling concept predictions to faithfully reflect the image content and improved alignment. Extensive experiments demonstrate that CoAt-CBM consistently outperforms state-of-the-art methods. The codes will be available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2604.16808 2026-06-03 cs.CV 版本更新

BioLip: Language-Generalizable Lip-Sync Deepfake Detection via Biomechanical Constraint Violation Modeling

BioLip: 通过生物力学约束违反建模实现语言泛化的唇同步深度伪造检测

Hao Chen, Junnan Xu

发表机构 * Independent Researcher（独立研究者）

AI总结针对现有检测方法在生成器或语言迁移下失效的问题，提出基于唇部运动生物力学约束的轻量级三分支网络，仅利用地标坐标检测唇同步伪造，在零样本设置下对未见生成器和多种语言表现鲁棒。

Comments 13 pages, 5 figures. Keywords: Deepfake detection, lip-sync forgery, biomechanical constraints, landmark kinematics, cross-lingual generalization, video forensics, privacy-preserving inference, compression robustness

2604.07048 2026-06-03 cs.CV 版本更新

PRISM: Rethinking Atmospheric Scattering Reconstruction as a Unified Understanding and Restoration Model for Real-world Dehazing

PRISM: 重新思考大气散射重建作为真实世界去雾的统一理解与恢复模型

Chengyu Fang, Chunming He, Yuelin Zhang, Chubin Chen, Chenyang Zhu, Hongqiu Wang, Longxiang Tang, Xiu Li, Sina Farsiu

发表机构 * Tsinghua University（清华大学）； Duke University（杜克大学）； CUHK（香港中文大学）； HKUST（香港理工大学）； HKUST(GZ)（香港理工大学（广州））

AI总结提出基于近端散射大气重建（PSAR）的物理结构化框架，结合在线非均匀雾合成和选择性自蒸馏适应（SSDA）方案，实现真实世界图像去雾的统一理解与恢复。

Comments 21 Pages, 8 Figures, 7 Tables

详情

AI中文摘要

真实世界图像去雾（RID）旨在去除真实场景中由雾引起的退化。由于非均匀雾分布、空间变化的颜色偏移以及配对真实雾-干净数据的稀缺，该任务仍然具有挑战性。在PRISM中，我们提出了近端散射大气重建（PSAR），这是一个物理结构化框架，在大气散射模型下联合重建清晰场景和散射变量，使恢复过程在复杂真实世界条件下更具可解释性。为了弥合合成到真实的差距，我们设计了一个在线非均匀雾合成流程和一个用于非配对真实世界场景的选择性自蒸馏适应（SSDA）方案，该方案使模型能够选择性地从高质量感知目标中学习，同时利用其内在的散射理解来审计残留雾并指导自我优化。在真实世界基准上的实验表明，PRISM在RID任务上取得了具有竞争力的性能。

英文摘要

Real-world image dehazing (RID) aims to remove haze-induced degradation from real scenes. This task remains challenging due to non-uniform haze distribution, spatially varying color shifts, and the scarcity of paired real hazy-clean data. In PRISM, we propose Proximal Scattering Atmosphere Reconstruction (PSAR), a physically structured framework that jointly reconstructs the clear scene and scattering variables under the atmospheric scattering model, making the restoration process more interpretable in complex real-world conditions. To bridge the synthetic-to-real gap, we design an online non-uniform haze synthesis pipeline and a Selective Self-Distillation Adaptation (SSDA) scheme for unpaired real-world scenarios, which enables the model to selectively learn from high-quality perceptual targets while leveraging its intrinsic scattering understanding to audit residual haze and guide self-refinement. Experiments on real-world benchmarks demonstrate that PRISM achieves competitive performance on RID tasks.

URL PDF HTML ☆

赞 0 踩 0

2604.05718 2026-06-03 cs.CV 版本更新

MPM: Mutual Pair Merging for Efficient Vision Transformers

MPM：用于高效视觉Transformer的互结对合并

Simon Ravé, Pejman Rasti, David Rousseau

发表机构 * LARIS University of Angers（安格尔大学LARIS实验室）； UMR INRAe-IRHS Angers, France（法国安格尔INRAe-IRHS UMR）

AI总结提出无训练、无参数的互结对合并（MPM）模块，通过余弦空间互近邻配对与平均，记录合并图用于解码器前基于收集的重建，在语义分割中实现端到端加速，且精度损失小。

Comments Accepted to CVPR 2026 (Findings)

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 2998-3008

AI中文摘要

减少序列长度是加速Transformer的常用方法，但先前的token缩减工作通常针对分类任务，报告的是代理指标而非端到端延迟。对于语义分割，token缩减进一步受到重建密集、像素对齐特征的需求限制，并且在现代加速器上，计算合并图的开销可能抵消预期收益。我们提出互结对合并（MPM），一种无需训练的token聚合模块，它在余弦空间中形成互最近邻对，对每对进行平均，并记录一个合并图，使得在解码器之前能够进行基于收集的重建，从而现有分割头可以保持不变。MPM不引入任何学习参数，也没有连续的压缩旋钮（无保留率或阈值）。速度-精度权衡由离散的插入调度设置。我们在NVIDIA H100 GPU（带和不带FlashAttention-2）和Raspberry Pi 5上，针对标准分割数据集基准测试了端到端延迟。在ADE20K上，MPM在Raspberry Pi 5上为ViT-Tiny减少了高达60%的每张图像延迟，在H100上使用FlashAttention-2时吞吐量提升高达20%，同时mIoU下降保持在3%以内。这些结果表明，当显式考虑开销时，简单、重建感知、无需训练的token合并可以转化为分割中实际的时钟时间增益。

英文摘要

Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

URL PDF HTML ☆

赞 0 踩 0

2604.04439 2026-06-03 cs.LG cs.CV 版本更新

Estimating Central, Peripheral, and Temporal Visual Contributions to Human Decision Making in Atari Games

估计Atari游戏中中央、周边和时间视觉对人类决策的贡献

Henrik Krauss, Takehisa Yairi

发表机构 * Department of Advanced Interdisciplinary Studies, The University of Tokyo（东京大学先进跨学科研究系）； Research Center for Advanced Science and Technology, The University of Tokyo（东京大学先进科学与技术研究中心）

AI总结通过控制消融框架分析Atari游戏中的眼动数据，发现周边视觉信息对人类决策贡献最大，而注视信息和过去状态信息贡献较小。

详情

AI中文摘要

我们研究了不同视觉信息源在动态视觉环境中对人类决策的贡献。利用Atari-HEAD（一个带有同步眼动追踪的大规模Atari游戏数据集），我们引入了一个受控消融框架，作为逆向工程周边视觉信息、显式注视信息（以注视图形式）以及人类行为中过去状态信息贡献的手段。我们在六种设置下训练动作预测网络，这些设置选择性地包含或排除这些信息源。在20个游戏中，周边信息的贡献最为显著，移除后预测准确率的中位数下降范围为35.27-43.90%。注视信息导致的下降较小，为2.11-2.76%，而过去状态信息的下降范围较广，为1.52-15.51%，其中上限可能因减少了周边信息泄露而更具信息量。为了补充总体准确率，我们根据不同模型配置分配的真实动作概率对状态进行聚类。该分析识别出粗略的行为模式，包括焦点主导、周边主导以及更多情境决策情境。这些结果表明，Atari游戏中的人类决策强烈依赖于当前注视焦点之外的信息，而所提出的框架提供了一种从行为中估计此类信息源贡献的方法。

英文摘要

We study how different visual information sources contribute to human decision making in dynamic visual environments. Using Atari-HEAD, a large-scale Atari gameplay dataset with synchronized eye-tracking, we introduce a controlled ablation framework as a means to reverse-engineer the contribution of peripheral visual information, explicit gaze information in the form of gaze maps, and past-state information from human behavior. We train action-prediction networks under six settings that selectively include or exclude these information sources. Across 20 games, peripheral information shows by far the strongest contribution, with median prediction-accuracy drops in the range of 35.27-43.90% when removed. Gaze information yields smaller drops of 2.11-2.76%, while past-state information shows a broader range of 1.52-15.51%, with the upper end likely more informative due to reduced peripheral-information leakage. To complement aggregate accuracies, we cluster states by true-action probabilities assigned by the different model configurations. This analysis identifies coarse behavioral regimes, including focus-dominated, periphery-dominated, and more contextual decision situations. These results suggest that human decision making in Atari depends strongly on information beyond the current focus of gaze, while the proposed framework provides a way to estimate such information-source contributions from behavior.

URL PDF HTML ☆

赞 0 踩 0

2512.18954 2026-06-03 cs.CV 版本更新

VOIC: Visible-Occluded Integrated Guidance for 3D Semantic Scene Completion

VOIC：可见-遮挡联合引导的3D语义场景补全

Zaidao Han, Risa Higashita, Jiang Liu

发表机构 * Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology（可信自主系统研究院，南方科技大学）； Department of Computer Science and Engineering, Southern University of Science and Technology（计算机科学与工程系，南方科技大学）； School of Computer Science, University of Nottingham Ningbo China（宁波大学计算机学院）； Department of Electronic and Information Engineering, Changchun University（电子与信息工程学院，长春大学）

AI总结提出VOIC网络，通过解耦可见区域感知与遮挡区域推理，利用离线可见区域标签提取策略和双解码器框架，在SemanticKITTI和SSCBench-KITTI360上实现最先进的3D语义场景补全性能。

详情

AI中文摘要

基于相机的3D语义场景补全（SSC）是自动驾驶和机器人场景理解的关键任务。它旨在从单张图像推断完整的3D体素表示，包括语义和几何信息。现有方法通常关注端到端的2D到3D特征提升和体素补全。然而，它们常常忽视由单图像输入引起的高置信度可见区域感知与低置信度遮挡区域推理之间的干扰，这可能导致特征稀释和错误传播。为了解决这些挑战，我们引入了一种离线可见区域标签提取（VRLE）策略，该策略从密集的3D地面真值中显式分离并提取可见区域的体素级监督。该策略为两个互补的子任务（可见区域感知和遮挡区域推理）净化了监督空间。基于这一思想，我们提出了可见-遮挡交互补全网络（VOIC），一种新颖的双解码器框架，将SSC显式解耦为可见区域语义感知和遮挡区域场景补全。VOIC首先通过融合图像特征与深度导出的占据信息构建基础3D体素表示。可见解码器专注于生成高保真的几何和语义先验，而遮挡解码器则利用这些先验以及跨模态交互进行连贯的全局场景推理。在SemanticKITTI和SSCBench-KITTI360基准上的大量实验表明，VOIC在几何补全和语义分割精度上均优于现有的单目SSC方法，实现了最先进的性能。

英文摘要

Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2603.01576 2026-06-03 cs.CV 版本更新

Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications

Cryo-Bench：面向冰冻圈应用的基础模型基准测试

Saurabh Kaushik, Lalit Maurya, Beth Tellman, Valerio Marsocci

发表机构 * Center for Sustainability and the Global Environment (SAGE), University of Wisconsin–Madison（可持续性与全球环境中心（SAGE），威斯康星大学麦迪逊分校）； Portsmouth AI and Data Science Centre (PAIDS), School of Computing, University of Portsmouth（波特茅斯人工智能与数据科学中心（PAIDS），计算学院，波特茅斯大学）； ESA, ESRIN, φ \varphi -lab, Frascati（欧洲航天局（ESA），欧洲空间研究中心（ESRIN），φ实验室，弗拉斯卡蒂）

AI总结提出Cryo-Bench基准，评估14个地理基础模型在冰冻圈关键组件（如冰川、冰湖、海冰等）上的性能，发现UNet在冻结编码器下平均mIoU最高（66.38），而全微调结合学习率调整可提升性能12.77%。

详情

AI中文摘要

地理基础模型（GFMs）已在涵盖多个领域的地球观测任务中得到评估，并展现出即使在标签稀疏的情况下也能生成可靠地图的强大潜力。然而，针对冰冻圈应用的GFMs基准测试仍然有限，主要原因是缺乏合适的评估数据集。为填补这一空白，我们引入了 extbf{Cryo-Bench}，这是一个用于评估GFMs在关键冰冻圈组件上性能的基准。Cryo-Bench包括覆盖冰川、冰湖、海冰和崩解前沿，涉及多种传感器和广泛的地理区域。我们评估了14个GFMs以及UNet和ViT基线，以分析它们的优势、局限性和最佳使用策略。在冻结编码器的情况下，UNet在Cryo-Bench包含的五个评估数据集上取得了最高的平均mIoU extbf{66.38}，其次是TerraMind的 extbf{64.02}。在少样本设置（10%输入数据）下，DOFA和TerraMind等GFMs优于UNet，分别达到 extbf{59.53}、 extbf{56.62}和 extbf{56.60}的mIoU分数，而U-Net为56.60。当完全微调GFMs时，我们观察到不同数据集和模型之间的性能不一致。然而，调整学习率并配合微调显著提升了GFM性能。例如，在两个代表性数据集（GLID和CaFFe）上的评估显示平均相对提升为 extbf{12.77\%}。尽管预训练数据中冰冻圈表示极少，GFMs仍展现出显著的领域适应能力，并在各项任务中产生有意义的结果。基于我们的发现，我们建议通过超参数优化进行编码器微调以获得最佳性能，而在用户需要快速结果且无需大量实验时使用冻结编码器。（\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}）

英文摘要

Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}).

URL PDF HTML ☆

赞 0 踩 0

2511.19945 2026-06-03 cs.CV 版本更新

Low-Resolution Editing is All You Need for High-Resolution Editing

低分辨率编辑足以实现高分辨率编辑

Junsung Lee, Hyunsoo Lee, Yong Jae Lee, Bohyung Han

发表机构 * ECE & IPAI, Seoul National University（电子与信息物理学院及首尔国立大学IPAI）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结本文提出一种测试时优化框架，通过分块优化、细节迁移和同步策略，实现高分辨率图像编辑。

Comments CVPR 2026. Project website: https://hleephilip.github.io/ScaleEdit

2603.26738 2026-06-03 cs.CV cs.AI cs.CL 版本更新

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

SleepVLM：基于视觉语言模型的可解释且规则驱动的睡眠分期

Guifeng Deng, Pan Wang, Mengfan Niu, Jiquan Wang, Shuying Rao, Junyi Xie, Xi'ang Chen, Sha Zhao, Gang Pan, Wanjun Guo, Tao Li, Haiteng Jiang

AI总结提出SleepVLM，一种基于规则驱动的视觉语言模型，通过多通道PSG波形图像进行睡眠分期，并生成符合AASM评分标准的临床可读解释，在保持高准确率的同时提升可解释性。

Comments Under review

详情

AI中文摘要

尽管自动睡眠分期已达到专家级准确率，但其临床采用因缺乏可审计的推理而受阻。我们提出了SleepVLM，一种基于规则驱动的视觉语言模型（VLM），它通过多通道多导睡眠图（PSG）波形图像进行睡眠分期，并基于美国睡眠医学学会（AASM）评分标准生成临床可读的理由。利用波形感知预训练和规则驱动的监督微调，SleepVLM在保留测试集（MASS-SS1）上实现了0.767的Cohen's kappa，在外部队列（ZUAMHCS）上实现了0.743，达到了最先进的性能。两位经过训练的睡眠技术专家的独立评估进一步验证了模型的推理质量，在两个数据集上，事实准确性、证据全面性和逻辑连贯性的平均得分在3.75-3.96之间（满分5分）。通过将竞争性性能与透明、基于规则的解释相结合，SleepVLM可以提高临床工作流程中自动睡眠分期的可信度和可审计性。为了促进可解释睡眠医学的进一步研究，我们发布了MASS-EX，一个新颖的专家注释数据集。

英文摘要

While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) that stages sleep from multi-channel polysomnography (PSG) waveform images and generates clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen's kappa of 0.767 on a held-out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Independent expert evaluation by two trained sleep technologists further validated the model's reasoning quality, with mean scores of 3.75-3.96 out of 5 across factual accuracy, evidence comprehensiveness, and logical coherence on both datasets. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.

URL PDF HTML ☆

赞 0 踩 0

PAND：面向提示的邻域蒸馏用于轻量级细粒度视觉分类

Qiuming Luo, Yuebing Li, Feng Li, Chang Kong

发表机构 * arXiv

AI总结提出PAND框架，通过提示感知语义校准和邻域感知结构蒸馏，将大型视觉语言模型知识迁移至轻量网络，在细粒度分类任务上超越现有方法。

Comments Accepted by ICIP2026

2512.10888 2026-06-03 cs.CV 版本更新

PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

PubTables-v2: 一个新的用于全页和多页表格提取的大规模数据集

Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Amrit Ramesh, Maury Courtland

发表机构 * Kensho Technologies（Kensho技术公司）

AI总结针对全页和多页表格提取任务缺乏标注数据的问题，本文创建了大规模数据集PubTables-v2，并评估了当前前沿模型与小模型在不同上下文级别任务上的性能差异。

Comments 28 pages, separated POTATR to its own paper, added frontier model results

详情

AI中文摘要

表格提取（TE）是文档理解中的一个关键挑战。传统方法先检测表格，然后识别其结构。最近，人们对开发直接在全页或文档上下文中提取表格的方法（如视觉语言模型（VLM））的兴趣激增。然而，缺乏标注数据使得进展难以展示。为了解决这个问题，我们创建了一个新的大规模数据集PubTables-v2。PubTables-v2统一了各种周围上下文级别的TE，并且值得注意的是，它是第一个用于多页TE的基准。我们的评估显示，虽然当前前沿模型在最复杂的任务（全文档多页TE）上显著优于小模型（+0.354 GriTS_Con），但在较窄的任务（裁剪表格提取）上，通过针对性训练，这种差距可以被缩小甚至逆转（-0.056 GriTS_Con）。数据可在 https://huggingface.co/datasets/kensho/PubTables-v2 获取。代码和模型将发布。

英文摘要

Table extraction (TE) is a key challenge in document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), to extract tables directly in their full page or document context. However, a lack of annotated data has made progress difficult to demonstrate. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 unifies TE across various levels of surrounding context and, notably, is the first benchmark for multi-page TE. Our evaluations reveal that while current frontier models strongly outperform ($+0.354\ \textrm{GriTS}_\textrm{Con}$) small models on the most complex task (full-document multi-page TE), this gap can be closed or even reversed ($-0.056\ \textrm{GriTS}_\textrm{Con}$) on narrower tasks (cropped table extraction) with targeted training. Data is available at https://huggingface.co/datasets/kensho/PubTables-v2. Code and models will be released.

URL PDF HTML ☆

赞 0 踩 0

2603.14377 2026-06-03 cs.CV 版本更新

两全其美：通过统一离散流匹配实现多模态推理与生成

Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou

AI总结提出UniDFlow框架，通过任务特定低秩适配器解耦理解与生成，并利用基于参考的多模态偏好对齐优化忠实性与可控性，在多个基准上达到最先进性能。

2602.11804 2026-06-03 cs.CV eess.IV 版本更新

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

基于深度感知融合与有限训练数据的高效分割一切

Yiming Zhou, Xuenjie Xie, Panfeng Li, Albrecht Kunz, Ahmad Osman, Xavier Maldague

发表机构 * University of Cambridge（剑桥大学）

AI总结提出一种轻量级RGB-D融合框架，通过单目深度先验增强EfficientViT-SAM，在仅使用11.2k训练样本（不到SA-1B的0.1%）的情况下，实现比EfficientViT-SAM更高的分割精度。

详情

DOI: 10.1109/ICASSP55912.2026.11464597
Journal ref: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1731-1735

AI中文摘要

分割一切模型（SAM）实现了令人印象深刻的通用分割性能，但需要大规模数据集（例如1100万张图像）且仅依赖RGB输入。最近的高效变体减少了计算量，但仍依赖于大规模训练。我们提出了一种轻量级RGB-D融合框架，用单目深度先验增强EfficientViT-SAM。深度图通过预训练的估计器生成，并通过专门的深度编码器与RGB特征进行中层融合。仅使用11.2k样本（不到SA-1B的0.1%）训练，我们的方法比EfficientViT-SAM取得了更高的准确率，表明深度线索为分割提供了强大的几何先验。

英文摘要

Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.

URL PDF HTML ☆

赞 0 踩 0

2602.09708 2026-06-03 cs.LG cs.AI cs.CV cs.NA math.NA 版本更新

Physics-informed diffusion models in spectral space

谱空间中的物理信息扩散模型

Davide Gallon, Philippe von Wurstemberger, Patrick Cheridito, Arnulf Jentzen

发表机构 * ETH Zürich（苏黎世联邦理工学院）

AI总结提出物理信息谱扩散（PISD）方法，结合生成式潜扩散模型与物理信息机器学习，在谱表示潜空间中对偏微分方程参数和解进行扩散建模，通过扩散后验采样施加物理约束和测量条件，在泊松、亥姆霍兹和不可压缩纳维-斯托克斯方程上展现出比现有扩散求解器更高的精度和计算效率。

Comments 18 pages, 10 figures

详情

AI中文摘要

我们提出物理信息谱扩散（PISD），一种将生成式潜扩散模型与物理信息机器学习相结合的方法，用于生成基于部分观测的偏微分方程（PDE）的解，特别包括正向和逆向PDE问题。我们在缩放谱表示的潜空间中通过扩散过程学习PDE参数和解的联合分布，其中高斯噪声对应于具有受控正则性的函数。与基于网格的扩散模型相比，这种谱公式能够实现显著的降维，并确保函数空间中的诱导过程保持在PDE算子定义良好的函数类内。基于扩散后验采样，我们在推理过程中施加物理信息约束和测量条件，在每个扩散步骤应用基于Adam的更新。我们在泊松、亥姆霍兹和不可压缩纳维-斯托克斯方程上评估了所提出的方法，与现有的基于扩散的PDE求解器（在稀疏观测下达到最先进水平）相比，展示了更高的精度和计算效率。代码可在 https://github.com/deeplearningmethods/PISD 获取。

英文摘要

We propose physics-informed spectral diffusion (PISD), a methodology that combines generative latent diffusion models with physics-informed machine learning to generate solutions of partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid-based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics-informed constraints and measurement conditions during inference, applying Adam-based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier-Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion-based PDE solvers, which are state of the art for sparse observations. Code is available at https://github.com/deeplearningmethods/PISD.

URL PDF HTML ☆

赞 0 踩 0

2601.22841 2026-06-03 cs.CV 版本更新

How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models

我们需要多少模型？遥感基础模型中的冗余与可瘦身性

Leonard Hackel, Tom Burgert, Begüm Demir

AI总结通过后验瘦身（均匀减少编码器Transformer块宽度）评估8个遥感基础模型的表示冗余，发现遥感模型在激进宽度缩减下仍保持69%-109%相对精度，而自然图像预训练模型性能急剧下降，表明遥感模型存在冗余编码且可有效瘦身。

详情

AI中文摘要

遥感中的大规模基础模型（RS FMs）遵循计算机视觉（CV）中建立的范式开发，但将CV缩放定律迁移至RS的有效性尚未系统检验。我们假设RS FMs在比CV对应模型小得多的规模下进入过参数化区域，任务相关信息在模型维度间冗余编码。为验证这一假设，我们应用后验瘦身（即均匀减少预训练编码器Transformer块的宽度）作为衡量8个最先进RS FMs在分类、分割和变化检测任务中表示冗余的工具。在激进宽度缩减下，RS FMs在RS数据集上保持69%至109%的相对精度，而基于自然图像预训练的掩码自编码器（MAE）和DINOv2（记为CV MAE和CV DINOv2）在相同计算需求范围内，在匹配类别数的ImageNet子集上性能急剧下降。直接在相同RS数据集上评估的CV MAE缩小了差距但未消除，表明数据集特性和领域特定预训练共同导致了模型间的差异。特征相关性、解释方差和有效维度等机制分析表明，任务相关方差集中在少数主成分中，并在模型维度间冗余编码。我们进一步证明，对于对比目标，学习型可瘦身训练优于后验瘦身，而基于重建的目标无法从当前可瘦身训练协议中受益。我们的发现确立了后验瘦身作为资源受限RS应用的实际部署策略，以及作为RS FMs表示冗余的诊断工具。论文接收后，我们将发布所有代码。

英文摘要

Large-scale foundation models (FMs) in remote sensing (RS) (denoted as RS FMs) are developed following paradigms established in computer vision (CV), yet the validity of transferring CV scaling laws to RS has not been systematically examined. We hypothesize that RS FMs enter an overparameterized regime at substantially smaller scales than their CV counterparts, with task-relevant information encoded redundantly across model dimensions. To test this hypothesis, we apply post-hoc slimmability, uniform width reduction of pretrained encoder transformer blocks, as a tool to measure representational redundancy across eight state-of-the-art RS FMs on classification, segmentation, and change detection tasks. RS FMs retain 69% to 109% relative accuracy on RS datasets under aggressive width reduction, while masked autoencoder (MAE) and DINOv2 pretrained on natural images (denoted as CV MAE and CV DINOv2) degrade sharply on ImageNet subsets of matched class count over the same range of computational requirements. A CV MAE evaluated directly on the same RS datasets narrows but does not close the gap, indicating that both dataset characteristics and domain-specific pretraining contribute to the differences between the models. Mechanistic analyses such as feature correlation, explained variance, and effective dimensionality indicate that task-relevant variance concentrates in few principal components and is redundantly encoded across model dimensions. We further show that learned slimmable training improves over post-hoc slimmability for contrastive objectives, while reconstruction-based objectives do not benefit from current slimmable training protocols. Our findings establish post-hoc slimming as a practical deployment strategy for resource-constrained RS applications and as a diagnostic tool for representational redundancy in RS FMs. Upon acceptance, we will publish all code.

URL PDF HTML ☆

赞 0 踩 0

2601.22443 2026-06-03 cs.LG cs.CV stat.CO stat.ML 版本更新

Weak Diffusion Priors Can Still Achieve Strong Inverse-Problem Performance

弱扩散先验仍能实现强逆问题性能

Jing Jia, Wei Yuan, Sifan Liu, Liyue Shen, Guanyang Wang

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结研究弱扩散先验在逆问题中的鲁棒性，通过贝叶斯一致性和局部相关性分析揭示其在信息丰富测量下仍有效的原因。

Comments 37 pages, ICML 2026 spotlight. Code: https://github.com/jjia131/weak-diffusion-priors-inverse-problem, Project Page: https://jjia131.github.io/weak-diffusion-priors-inverse-problem/

详情

AI中文摘要

在卧室图像上训练的扩散模型能否恢复人脸图像？扩散模型被广泛用作逆问题的先验，但标准方法通常假设一个高保真模型，该模型在与未知信号高度匹配的数据上训练。实践中，常常必须使用不匹配或低保真的扩散先验。令人惊讶的是，这些弱先验的表现往往几乎与全强度的域内基线相当。我们研究了逆求解器何时以及为何对弱扩散先验具有鲁棒性。通过大量实验，我们发现当测量信息高度丰富（例如，大量观测像素）时，弱先验能够成功，并识别了它们失败的场景。为了解释这一行为，我们将贝叶斯一致性理论与局部相关性分析相结合：理论给出了高维测量使后验集中于真实信号附近的条件，而相关性分析表明弱先验和更强的自然图像先验可以共享相似的局部空间结构。这些结果为何时可以可靠地使用弱扩散先验提供了原则性依据。代码可在 https://github.com/jjia131/weak-diffusion-priors-inverse-problem 获取。

英文摘要

Can a diffusion model trained on bedrooms recover human faces? Diffusion models are widely used as priors for inverse problems, but standard approaches usually assume a high-fidelity model trained on data that closely match the unknown signal. In practice, one often must use a mismatched or low-fidelity diffusion prior. Surprisingly, these weak priors often perform nearly as well as full-strength, in-domain baselines. We study when and why inverse solvers are robust to weak diffusion priors. Through extensive experiments, we find that weak priors succeed when measurements are highly informative (e.g., many observed pixels), and we identify regimes where they fail. To explain this behavior, we combine Bayesian-consistency theory with local-correlation analysis: the theory gives conditions under which high-dimensional measurements make the posterior concentrate near the true signal, while the correlation analysis shows that weak and stronger natural-image priors can share similar local spatial structure. These results provide a principled justification on when weak diffusion priors can be used reliably. Code is available at https://github.com/jjia131/weak-diffusion-priors-inverse-problem.

URL PDF HTML ☆

赞 0 踩 0

2510.22491 2026-06-03 cs.LG cs.CE cs.CV 版本更新

LAMP: Data-Efficient Linear Affine Weight-Space Models for Parameter-Controlled 3D Shape Generation and Extrapolation

LAMP: 数据高效的线性仿射权重空间模型用于参数控制的3D形状生成与外推

Ghadi Nehme, Yanxia Zhang, Dule Shu, Matt Klenk, Faez Ahmed

发表机构 * GitHub

AI总结提出LAMP框架，通过过拟合共享初始化的符号距离函数解码器并对齐权重空间，以少量样本实现参数约束下的可控3D生成与外推，并引入线性失配安全度量确保可靠性。

详情

AI中文摘要

在显式参数约束下生成高保真3D几何体是工程设计的核心，但当前方法通常需要大型数据集，且无法在训练分布之外提供可靠控制。我们提出LAMP，一个数据高效的框架，用于可控和可解释的3D生成，该框架通过从共享初始化过拟合每个样本并对齐符号距离函数（SDF）解码器，然后在对齐的权重空间中通过求解参数约束的仿射混合问题来生成新设计。为了提高可靠性，我们提出一种线性失配安全度量，用于检测混合解码器何时离开有效的局部区域。我们在DrivAerNet++、BlendedNet以及额外的工业级车辆系列（包括跑车、SUV和敞篷车）上评估LAMP。LAMP能够以少至50个样本实现受控插值，在训练范围外安全外推高达100%，并在固定参数下进行性能引导优化，在外推、数据效率和参数保真度方面优于条件自编码器和深度网络插值（DNI）基线。我们的结果表明，LAMP推进了用于设计探索、数据集生成和性能驱动优化的可控、数据高效且安全的3D生成。

英文摘要

Generating high-fidelity 3D geometries under explicit parameter constraints is central to engineering design, yet current methods often require large datasets and fail to provide reliable control beyond the training distribution. We introduce LAMP, a data-efficient framework for controllable and interpretable 3D generation that aligns signed distance function (SDF) decoders by overfitting each exemplar from a shared initialization, then generates new designs by solving a parameter-constrained affine mixing problem in the aligned weight space. To improve reliability, we propose a linearity-mismatch safety metric that detects when mixed decoders leave the valid local regime. We evaluate LAMP on DrivAerNet++, BlendedNet, and additional industry-level vehicle families, including sports cars, SUVs, and convertibles. LAMP enables controlled interpolation with as few as 50 samples, safe extrapolation up to 100% beyond training ranges, and performance-guided optimization under fixed parameters, outperforming conditional autoencoder and Deep Network Interpolation (DNI) baselines in extrapolation, data efficiency, and parameter fidelity. Our results demonstrate that LAMP advances controllable, data-efficient, and safe 3D generation for design exploration, dataset generation, and performance-driven optimization.

URL PDF HTML ☆

赞 0 踩 0

2512.23234 2026-06-03 cs.CV cs.AI 版本更新

Edge-Aware and Content-Adaptive Infrared Gas Leak Detection for Industrial Safety Monitoring

边缘感知与内容自适应的工业安全监控红外气体泄漏检测

Dongsheng Li, Tianli Ma, Siling Wang, Beibei Duan, Song Gao

发表机构 * School of Mechatronic Engineering, Xi’an Technological University（机械电子工程学院，西安理工大学）； School of Electronic Information Engineering, Xi’an Technological University（电子信息工程学院，西安理工大学）； Shaanxi Shanhua Coal Chemical Co., Ltd.（陕西神华化工有限公司）

AI总结针对红外气体羽流微弱、半透明且边界模糊的检测难题，提出一种边缘感知与内容自适应特征融合检测器（ECAF-Det），通过羽流导向的局部-全局特征增强、多尺度边缘感知模块和内容自适应稀疏路由路径聚合网络，在IIG和LangGas数据集上显著提升了检测精度。

详情

AI中文摘要

红外气体泄漏检测对于工业安全和环境监测至关重要，但由于气体羽流通常微弱、细小、半透明且边界模糊，自动检测仍然具有挑战性。本文提出了一种边缘感知与内容自适应特征融合检测器（ECAF-Det），用于杂乱热场景中的弱羽流检测。ECAF-Det集成了三个面向任务的设计：羽流导向的局部-全局特征增强块，用于保留精细边界线索并捕获长程上下文连续性；多尺度边缘感知模块，将方向梯度和相位一致性线索转化为分层边缘先验，用于边界敏感的羽流表示；以及内容自适应稀疏路由路径聚合网络，动态调节多尺度特征传播，以强调信息丰富的羽流特征并抑制冗余背景响应。在IIG数据集上的实验表明，ECAF-Det实现了29.8%的AP、84.3%的AP50和25.3%的小目标AP，分别比RT-DETR-R18基线提高了3.0、6.5和5.4个百分点，计算量为43.7 GFLOPs，参数量为14.9 M。在LangGas数据集上，ECAF-Det实现了36.3%的AP和68.5%的AP50，展示了其对不同红外气体羽流外观的泛化能力。主要的人工智能贡献在于边缘感知表示学习与内容自适应稀疏特征路由，用于弱红外羽流感知。所提出的检测器可作为工业气体泄漏监测中早期预警和远程巡检的视觉感知组件。

英文摘要

Infrared gas leak detection is important for industrial safety and environmental monitoring, but automatic detection remains challenging because gas plumes are often faint, small, semi-transparent, and weakly bounded. This paper proposes an Edge-Aware and Content-Adaptive Feature Fusion Detector (ECAF-Det) for weak-plume detection in cluttered thermal scenes. ECAF-Det integrates three task-oriented designs: a plume-oriented local-global feature enhancement block to preserve fine boundary cues and capture long-range contextual continuity; a multi-scale edge perception module that transforms directional gradient and phase-consistency cues into hierarchical edge priors for boundary-sensitive plume representation; and a content-adaptive sparse routing path aggregation network that dynamically regulates multi-scale feature propagation to emphasize informative plume features and suppress redundant background responses. Experiments on the IIG dataset show that ECAF-Det achieves 29.8% AP, 84.3% AP50, and 25.3% small-object AP, improving the RT-DETR-R18 baseline by 3.0, 6.5, and 5.4 percentage points, respectively, with 43.7 GFLOPs and 14.9 M parameters. On the LangGas dataset, ECAF-Det achieves 36.3% AP and 68.5% AP50, demonstrating its generalization to different infrared gas plume appearances. The main AI contribution is edge-aware representation learning with content-adaptive sparse feature routing for weak infrared plume perception. The proposed detector can serve as a visual perception component for early warning and remote inspection in industrial gas leak monitoring.

URL PDF HTML ☆

赞 0 踩 0

2512.22539 2026-06-03 cs.RO cs.CV 版本更新

通过 HCM-GRPO 实现物理合理性推理：赋能紧凑模型以获得卓越性能

Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu

发表机构 * Tsinghua University（清华大学）； Alibaba Health Information Technology Limited（阿里巴巴健康信息技术有限公司）

AI总结针对多模态大语言模型在物理合理性推理中数据缺乏和推理能力弱的问题，提出包含大规模数据集和 HCM-GRPO 方法的完整解决方案，以紧凑模型超越大规模开源和闭源模型。

详情

AI中文摘要

近年来，图像生成的性能得到了显著提升。然而，图像筛选的研究很少，且由于缺乏数据以及多模态大语言模型（MLLMs）中物理合理性推理能力较弱，其性能并不令人满意。在这项工作中，我们提出了一个完整的解决方案，从数据和方法论两方面解决这些问题。在数据方面，我们收集了一个包含超过 128k 样本的综合图像筛选数据集，涉及约 640k 张图像。每个样本由一张原始图像和四张生成图像组成。该数据集从四个方面评估物理合理性推理能力：外观变形、物理阴影、放置布局和扩展合理性。关于数据标注，我们研究了多种方法，包括纯人工、全自动和答案驱动的标注，以最经济的方式获取高质量的思维链（CoT）数据。在方法论上，我们将一种硬案例挖掘（HCM）策略与动态比例准确率（DPA）奖励引入到组相对策略优化（GRPO）框架中，称为 HCM-GRPO。与原始 GRPO 相比，这种增强方法展示了更优越的物理合理性推理能力。我们的实验结果表明，即使是像 GPT5.2 和 Gemini3-Pro 这样的最先进的闭源 MLLMs，在物理合理性推理方面也表现出不令人满意的性能。相比之下，通过利用 HCM-GRPO，我们能够以更小的模型超越大规模开源和领先闭源模型的分数。

英文摘要

The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare, and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak physical plausibility reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, comprising about 640k images. Each sample consists of an original image and four generated images. The dataset evaluates the physical plausibility reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior physical plausibility reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT5.2 and Gemini3-Pro, exhibit unsatisfactory performance in physical plausibility reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

URL PDF HTML ☆

赞 0 踩 0

2511.02417 2026-06-03 cs.CV cs.RO 版本更新

CropCraft: A Procedural World Generator for Robotic Simulation of Agricultural Tasks

CropCraft：用于农业任务机器人仿真的程序化世界生成器

Riccardo Bertoglio, Cyrille Pierre, Johann Laconte, Roland Lenain

发表机构 * Institut National de la Recherche Agronomique（法国国家农业科研院）

AI总结提出基于Blender和Python的开源程序化世界生成器CropCraft，通过YAML配置生成多样化农田场景，支持间作、葡萄园和杂草田，并生成带标注的3D仿真环境，用于农业机器人感知和导航算法开发。

详情

AI中文摘要

现代农业中 agroecological 实践的采用要求机器人系统能够在高度多样化和复杂的田间环境中运行。开发和评估此类系统严重依赖仿真，但生成代表 agroecological 多样性的逼真且可配置的3D环境仍然是一个主要挑战。本文提出了 CropCraft，一个基于 Blender 和 Python 构建的开源程序化世界生成器，旨在生成适用于农业机器人的3D仿真环境。CropCraft 通过简单的 YAML 配置文件生成作物田，支持多种场景，包括间作、葡萄园和杂草丛生的田地。该工具包含一个多生长阶段的3D植物模型库（作物、草和杂草），并使用随机放置算法真实地再现实际田地中观察到的空间变异性。生成的场景可直接导入 Gazebo 仿真器，并包含所有放置元素的地面真值标注，支持感知和导航算法的开发。为了展示 CropCraft 的实际用途，我们将其应用于使用深度学习的作物-杂草语义分割任务。生成了包含10,000张玉米田合成图像的数据集，这些图像具有不同的杂草密度、生长阶段和光照条件，并用于训练多个分割架构。仅使用合成数据训练的模型在真实田间图像上实现了约10%的平均交并比（mIoU）的 sim-to-real 差距，优于先前的先进合成生成方法。我们进一步表明，即使将少量真实图像与合成数据结合，也能提高跨领域的泛化能力，为农业感知任务中合成数据的有效使用提供了新见解。

英文摘要

The adoption of agroecological practices in modern agriculture requires robotic systems capable of operating in highly diverse and complex field environments. Developing and evaluating such systems relies heavily on simulation, yet generating realistic and configurable 3D environments representative of agroecological diversity remains a major challenge. This paper presents CropCraft, an open-source procedural world generator built on Blender and Python, designed to produce 3D simulation environments tailored to agricultural robotics. CropCraft generates crop fields from a simple YAML configuration file, supporting a wide range of scenarios including intercropping, vineyards, and weed-infested fields. The tool includes a library of 3D plant models (crops, grasses, and weeds) at multiple growth stages, and uses stochastic placement algorithms to realistically reproduce the spatial variability observed in real fields. Generated worlds are directly importable into the Gazebo simulator and include ground-truth annotations for all placed elements, supporting both perception and navigation algorithm development. To demonstrate the practical utility of CropCraft, we apply it to the task of crop-weed semantic segmentation using deep learning. A dataset of 10,000 synthetic images of maize fields with varying weed densities, growth stages, and lighting conditions was generated and used to train several segmentation architectures. Models trained exclusively on synthetic data achieve a sim-to-real gap of approximately 10% mean Intersection over Union (mIoU) on real field images, outperforming previous state-of-the-art synthetic generation approaches. We further show that combining even a few real images with synthetic data improves generalization across domains, providing new insights into the effective use of synthetic data for agricultural perception tasks.

URL PDF HTML ☆

赞 0 踩 0

2510.13565 2026-06-03 cs.CV 版本更新

XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation

XD-RCDepth: 轻量级雷达-相机深度估计，具有可解释性对齐和分布感知蒸馏

Huawei Sun, Zixu Wang, Xiangyuan Peng, Julius Ott, Georg Stettinger, Lorenzo Servadei, Robert Wille

发表机构 * Technical University of Munich（慕尼黑技术大学）； Infineon Technologies AG（英飞凌科技）

AI总结提出轻量级雷达-相机深度估计架构XD-RCDepth，通过可解释性对齐蒸馏和深度分布蒸馏减少参数29.7%并保持精度，在nuScenes和ZJU-4DRadarCam数据集上实现实时性能。

2510.09845 2026-06-03 cs.LG cs.AI cs.CV 版本更新

Harnessing Self-Supervised Deep Learning and Geostationary Remote Sensing for Advancing Wildfire and Associated Air Quality Monitoring: Improved Smoke and Fire Front Masking using GOES and TEMPO Radiance Data

利用自监督深度学习和地球静止遥感推进野火及相关空气质量监测：使用GOES和TEMPO辐射数据改进烟雾和火锋掩膜

Nicholas LaHaye, Thilanka Munashinge, Hugo Lee, Xiaohua Pan, Gonzalo Gonzalez Abad, Hazem Mahmoud, Jennifer Wei

AI总结本研究利用NASA TEMPO卫星任务的每小时数据和自监督深度学习，提出了一种创新系统，通过GOES-18和TEMPO数据有效区分烟雾与云层，实时绘制野火火锋和烟雾羽流，显著优于现有业务产品。

Comments https://2025.ieeeigarss.org/view_paper.php?PaperNum=6389&SessionID=1611

2510.03316 2026-06-03 cs.CV cs.AI cs.LG 版本更新

The View From Space: Navigating Instrumentation Differences with EOFMs

从太空视角：利用EOFMs导航仪器差异

Ryan P. Demilt, Nicholas LaHaye, Karis Tenneson

发表机构 * Spatial Informatics Group（空间信息组）

AI总结本研究通过分析地球观测基础模型（EOFMs）对传感器架构的敏感性，揭示了当前模型设计的缺陷，并为模型开发者、用户和遥感科学社区指明了前进方向。

详情

Journal ref: https://neurips.cc/virtual/2025/loc/san-diego/122891

AI中文摘要

地球观测基础模型（EOFMs）作为处理大量遥感及其他地球观测数据、并对许多关键地球监测任务产生影响的工具，其普及程度急剧上升。一个新兴趋势是利用预训练模型的输出作为“嵌入”，这些嵌入总结了高维数据，可用于通用任务，如相似性搜索和内容特定查询。然而，大多数EOFMs仅在单一模态数据上训练，然后通过匹配不同模态的波段进行应用或基准测试。现有工作尚不清楚多样化的传感器架构如何影响当前EOFMs套件的内部表示。我们在本工作中表明，EOFMs的表示空间对传感器架构高度敏感，理解这一差异为我们提供了关于当前EOFMs设计陷阱的关键视角，并指明了作为模型开发者、用户以及以稳健遥感科学为指导的社区应如何前进的方向。

英文摘要

Earth Observation Foundation Models (EOFMs) have exploded in prevalence as tools for processing the massive volumes of remotely sensed and other earth observation data, and for delivering impact on the many essential earth monitoring tasks. An emerging trend posits using the outputs of pre-trained models as 'embeddings' which summarize high dimensional data to be used for generic tasks such as similarity search and content-specific queries. However, most EOFM models are trained only on single modalities of data and then applied or benchmarked by matching bands across different modalities. It is not clear from existing work what impact diverse sensor architectures have on the internal representations of the present suite of EOFMs. We show in this work that the representation space of EOFMs is highly sensitive to sensor architecture and that understanding this difference gives a vital perspective on the pitfalls of current EOFM design and signals for how to move forward as model developers, users, and a community guided by robust remote-sensing science.

URL PDF HTML ☆

赞 0 踩 0

2509.25859 2026-06-03 cs.CV cs.SY eess.SY 版本更新

LiDAR Point Cloud Colourisation Using Multi-Camera Fusion and Low-Light Image Enhancement

使用多相机融合和低光图像增强的LiDAR点云着色

Pasindu Ranasinghe, Dibyayan Patra, Bikram Banerjee, Simit Raval

AI总结提出一种硬件无关的方法，通过多相机融合和低光增强模块，实现机械LiDAR点云的360度着色，在低光照条件下仍能恢复场景细节。

详情

DOI: 10.3390/s25216582
Journal ref: Sensors 25(21), 6582 (2025)

AI中文摘要

近年来，相机数据与LiDAR测量的融合已成为增强空间理解的一种强大方法。本研究引入了一种新颖的、与硬件无关的方法，该方法使用多个相机输入从机械LiDAR生成着色点云，提供完整的360度覆盖。主要创新在于其在低光照条件下的鲁棒性，这是通过在融合管道中集成低光图像增强模块实现的。系统需要初始校准以确定相机内参，然后自动计算LiDAR与相机之间的几何变换，无需专门的校准目标，简化了设置。数据处理框架使用颜色校正来确保融合前相机馈送的一致性。该算法使用Velodyne Puck Hi-Res LiDAR和四相机配置进行了测试。优化后的软件实现了实时性能，即使在极低照度下也能可靠着色，成功恢复了原本无法检测的场景细节。

英文摘要

In recent years, the fusion of camera data with LiDAR measurements has emerged as a powerful approach to enhance spatial understanding. This study introduces a novel, hardware-agnostic methodology that generates colourised point clouds from mechanical LiDAR using multiple camera inputs, providing complete 360-degree coverage. The primary innovation lies in its robustness under low-light conditions, achieved through the integration of a low-light image enhancement module within the fusion pipeline. The system requires initial calibration to determine intrinsic camera parameters, followed by automatic computation of the geometric transformation between the LiDAR and cameras, removing the need for specialised calibration targets and streamlining the setup. The data processing framework uses colour correction to ensure uniformity across camera feeds before fusion. The algorithm was tested using a Velodyne Puck Hi-Res LiDAR and a four-camera configuration. The optimised software achieved real-time performance and reliable colourisation even under very low illumination, successfully recovering scene details that would otherwise remain undetectable.

URL PDF HTML ☆

赞 0 踩 0

2505.17659 2026-06-03 cs.RO cs.CV 版本更新

Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling

Plan-R1：安全且可行的轨迹规划作为语言建模

Xiaolong Tang, Meina Kan, Shiguang Shan, Xilin Chen

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出Plan-R1两阶段轨迹规划框架，通过原则对齐与行为学习解耦，结合规则奖励和方差解耦GRPO，显著提升自动驾驶规划的安全性和可行性。

Comments Accepted by ICLR2026

详情

AI中文摘要

安全且可行的轨迹规划对于现实世界的自动驾驶系统至关重要。然而，现有的基于学习的规划器严重依赖专家演示，这不仅缺乏明确的安全意识，还可能继承次优人类驾驶数据中的不良行为（如超速）。受大型语言模型成功的启发，我们提出了Plan-R1，一种两阶段轨迹规划框架，将原则对齐与行为学习解耦。在第一阶段，通用轨迹预测器在专家数据上进行预训练，以捕获多样化的、类人的驾驶行为。在第二阶段，使用基于规则的奖励通过组相对策略优化（GRPO）对模型进行微调，明确地将自我规划与安全、舒适和交通规则遵守等原则对齐。这种两阶段范式保留了类人行为，同时增强了安全意识并丢弃了演示中的不良模式。此外，我们识别了直接应用GRPO到规划的一个关键限制：组级归一化消除了跨组的尺度差异，导致罕见、高方差的安全违规组与大量低方差的安全组具有相似的优势，从而抑制了对安全关键目标的优化。为解决此问题，我们提出了方差解耦GRPO（VD-GRPO），用中心化和固定缩放替代归一化以保留绝对奖励幅度，确保安全关键目标在整个训练过程中保持主导地位。在nuPlan基准上的实验表明，Plan-R1显著提高了规划的安全性和可行性，达到了最先进的性能，特别是在现实反应性设置中。我们的代码可在https://github.com/XiaolongTang23/Plan-R1获取。

英文摘要

Safe and feasible trajectory planning is critical for real-world autonomous driving systems. However, existing learning-based planners rely heavily on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting undesirable behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a two-stage trajectory planning framework that decouples principle alignment from behavior learning. In the first stage, a general trajectory predictor is pre-trained on expert data to capture diverse, human-like driving behaviors. In the second stage, the model is fine-tuned with rule-based rewards using Group Relative Policy Optimization (GRPO), explicitly aligning ego planning with principles such as safety, comfort, and traffic rule compliance. This two-stage paradigm retains human-like behaviors while enhancing safety awareness and discarding undesirable patterns from demonstrations. Furthermore, we identify a key limitation of directly applying GRPO to planning: group-wise normalization erases cross-group scale differences, causing rare, high-variance safety-violation groups to have similar advantages as abundant low-variance safe groups, thereby suppressing optimization for safety-critical objectives. To address this, we propose Variance-Decoupled GRPO (VD-GRPO), which replaces normalization with centering and fixed scaling to preserve absolute reward magnitudes, ensuring that safety-critical objectives remain dominant throughout training. Experiments on the nuPlan benchmark demonstrate that Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance, particularly in realistic reactive settings. Our code is available at https://github.com/XiaolongTang23/Plan-R1.

URL PDF HTML ☆

赞 0 踩 0

2507.09105 2026-06-03 cs.CV 版本更新

Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production

混合自回归-扩散模型用于实时手语生成

Maoxiao Ye, Xinfeng Ye, Mano Manoharan

发表机构 * University of Auckland（奥克兰大学）

AI总结提出HybridSign混合自回归-扩散模型，结合因果帧生成与流式扩散精炼，实现低延迟高质量手语生成，在PHOENIX14T和How2Sign上取得最佳质量-效率权衡。

Comments Accepted at ACL 2026

详情

AI中文摘要

早期的手语生成（SLP）模型通常依赖于自回归解码，这自然保持了时间因果性，但在推理时会出现错误累积。最近的基于扩散的方法通过迭代去噪提高了生成质量，但其序列级精炼过程引入了大量延迟。为了解决这一权衡问题，我们提出了HybridSign，一种用于低延迟手语生成的混合自回归-扩散模型，它结合了因果帧生成与流式扩散精炼。多尺度姿态表示模块捕获细粒度发音特征，而置信度感知因果注意力机制利用关节级置信度分数提高在噪声2D姿态观测下的鲁棒性。在PHOENIX14T和How2Sign上的实验表明，HybridSign在比较的基线中始终实现了最佳的质量-效率权衡。在How2Sign测试集上，在60帧评估协议下，它达到了BLEU-1/4分数30.12/6.48和DTW 3.89，同时将首帧时间减少到5.90秒，吞吐量提高到10.17 FPS。

英文摘要

Earlier Sign Language Production (SLP) models typically relied on autoregressive decoding, which naturally preserves temporal causality but suffers from error accumulation at inference time. More recent diffusion-based approaches improve generation quality through iterative denoising, yet their sequence-level refinement process introduces substantial latency. To address this trade-off, we propose HybridSign, a hybrid autoregressive-diffusion model for low-latency sign language production that combines causal frame generation with flow-based diffusion refinement. A Multi-Scale Pose Representation module captures fine-grained articulator features, while a Confidence-Aware Causal Attention mechanism leverages joint-level confidence scores to improve robustness under noisy 2D pose observations. Experiments on PHOENIX14T and How2Sign show that HybridSign consistently achieves the best quality--efficiency trade-off among the compared baselines. On the How2Sign test split, it reaches BLEU-1/4 scores of 30.12/6.48 and DTW of 3.89, while reducing time-to-first-frame to 5.90s and increasing throughput to 10.17 FPS under a 60-frame evaluation protocol.

URL PDF HTML ☆

赞 0 踩 0

2509.11323 2026-06-03 cs.CV cs.AI 版本更新

Motion Estimation for Multi-Object Tracking using KalmanNet with Semantic-Independent Encoding

基于语义无关编码的KalmanNet多目标跟踪运动估计

Jian Song, Wei Mei, Yunfeng Xu, Qiang Fu, Renke Kou, Lina Bu, Yucheng Long

AI总结提出语义无关KalmanNet（SIKNet），通过语义无关编码器（SIE）改进运动估计，在MOT中比传统卡尔曼滤波和学习辅助滤波器更鲁棒、更准确。

详情

DOI: 10.1016/j.inffus.2026.104513

AI中文摘要

运动估计是多目标跟踪（MOT）中的关键组成部分。它通过分析连续帧图像中物体位置的变化来预测物体的轨迹，减少跟踪失败和身份切换。基于线性恒速模型的卡尔曼滤波器（KF）是MOT中最常用的方法之一。然而，当KF参数不匹配且物体非平稳运动时，可能产生不理想的结果。在这项工作中，我们利用学习辅助滤波器来处理MOT的运动估计。具体地，我们提出了一种名为语义无关KalmanNet（SIKNet）的新方法，该方法通过两步使用语义无关编码器（SIE）对状态向量（输入特征）进行编码。首先，SIE使用核大小为1的一维卷积，该卷积沿不同状态向量中同语义元素维度进行卷积，以编码独立的语义信息。然后，它采用全连接层和非线性激活层来编码异语义元素之间的非线性和交叉依赖信息。为了独立评估MOT中运动估计模块的性能，我们从几个开源MOT数据集构建了一个大规模半模拟数据集。实验结果表明，所提出的SIKNet优于传统KF，并且比现有的学习辅助滤波器具有更好的鲁棒性和准确性。代码可在(https://github.com/SongJgit/filternet 和 https://github.com/SongJgit/TBDTracker)获取。

英文摘要

Motion estimation is a crucial component in multi-object tracking (MOT). It predicts the trajectory of objects by analyzing the changes in their positions in consecutive frames of images, reducing tracking failures and identity switches. The Kalman filter (KF) based on the linear constant-velocity model is one of the most commonly used methods in MOT. However, it may yield unsatisfactory results when KF's parameters are mismatched and objects move in non-stationary. In this work, we utilize the learning-aided filter to handle the motion estimation of MOT. In particular, we propose a novel method named Semantic-Independent KalmanNet (SIKNet), which encodes the state vector (the input feature) using a Semantic-Independent Encoder (SIE) by two steps. First, the SIE uses a 1D convolution with a kernel size of 1, which convolves along the dimension of homogeneous-semantic elements across different state vectors to encode independent semantic information. Then it employs a fully-connected layer and a nonlinear activation layer to encode nonlinear and cross-dependency information between heterogeneous-semantic elements. To independently evaluate the performance of the motion estimation module in MOT, we constructed a large-scale semi-simulated dataset from several open-source MOT datasets. Experimental results demonstrate that the proposed SIKNet outperforms the traditional KF and achieves superior robustness and accuracy than existing learning-aided filters. The code is available at (https://github.com/SongJgit/filternet and https://github.com/SongJgit/TBDTracker).

URL PDF HTML ☆

赞 0 踩 0

2509.03376 2026-06-03 cs.CV 版本更新

Transformer-Guided Content-Adaptive Graph Learning for Hyperspectral Unmixing

Transformer引导的内容自适应图学习用于高光谱解混

Hui Chen, Liangyu Liu, Xianchao Xiu, Wanquan Liu

发表机构 * School of Automation Engineering, Shanghai University of Electric Power（上海电力大学自动化工程学院）； School of Mechatronic Engineering and Automation, Shanghai University（上海大学机电工程与自动化学院）； School of Intelligent Systems Engineering, Sun Yat-sen University（中山大学智能系统工程学院）

AI总结提出T-CAGU框架，结合Transformer捕获全局依赖和内容自适应图神经网络增强局部关系，通过多阶传播动态学习图结构并引入图残差机制，实现高光谱图像的高效解混。

详情

AI中文摘要

高光谱解混（HU）旨在将遥感图像中的每个混合像素分解为一组端元及其对应的丰度。尽管深度学习在该领域取得了显著进展，但大多数方法无法同时表征全局依赖和局部一致性，难以保持长程交互和边界细节。本文提出了一种新颖的Transformer引导的内容自适应图解混框架（T-CAGU），通过采用Transformer捕获全局依赖并引入内容自适应图神经网络增强局部关系，克服了这些挑战。与以往工作不同，T-CAGU集成多个传播阶次以动态学习图结构，确保对噪声的鲁棒性。此外，T-CAGU利用图残差机制保留全局信息并稳定训练。实验结果表明其优于最先进的方法。我们的代码可在https://github.com/xianchaoxiu/T-CAGU获取。

英文摘要

Hyperspectral unmixing (HU) targets to decompose each mixed pixel in remote sensing images into a set of endmembers and their corresponding abundances. Despite significant progress in this field using deep learning, most methods fail to simultaneously characterize global dependencies and local consistency, making it difficult to preserve both long-range interactions and boundary details. This letter proposes a novel transformer-guided content-adaptive graph unmixing framework (T-CAGU), which overcomes these challenges by employing a transformer to capture global dependencies and introducing a content-adaptive graph neural network to enhance local relationships. Unlike previous work, T-CAGU integrates multiple propagation orders to dynamically learn the graph structure, ensuring robustness against noise. Furthermore, T-CAGU leverages a graph residual mechanism to preserve global information and stabilize training. Experimental results demonstrate its superiority over the state-of-the-art methods. Our code is available at https://github.com/xianchaoxiu/T-CAGU.

URL PDF HTML ☆

赞 0 踩 0

2508.15130 2026-06-03 cs.CV 版本更新

HiRQA: Hierarchical Ranking and Quality Alignment for Opinion-Unaware Image Quality Assessment

HiRQA: 面向无意见图像质量评估的层次化排序与质量对齐

Vaishnav Ramesh, Haining Wang, Md Jahidul Islam

AI总结提出HiRQA框架，通过层次化排序和对比学习实现自监督无参考图像质量评估，无需主观标签即可泛化到真实失真场景。

Comments Accepted for publication in Machine Vision and Applications

详情

AI中文摘要

尽管无参考图像质量评估（NR-IQA）取得了显著进展，但数据集偏差和对主观标签的依赖仍阻碍其泛化性能。我们提出HiRQA（层次化排序与质量对齐），一个自监督、无意见的框架，通过结合排序和对比学习提供层次化的质量感知嵌入。与依赖于推理时的原始参考或辅助模态的先前方法不同，HiRQA仅使用输入图像预测质量分数。我们引入了一种新颖的高阶排序损失，通过失真对之间的关系排序来监督质量预测，以及一个嵌入距离损失，强制特征距离与感知差异之间的一致性。由结构化文本提示引导的训练时对比对齐损失进一步增强了学习到的表示。仅在合成图像失真上训练的HiRQA能够泛化到真实退化，通过对各种未见失真（如镜头光晕、雾霾、运动模糊和低光条件）的全面评估得到了证明。为了实时部署，我们引入了HiRQA-S，一个轻量级变体，每张图像的推理时间仅为3.5毫秒。在合成和真实基准上的大量实验验证了HiRQA的竞争性能、强泛化能力和可扩展性。HiRQA模型和推理管道可在https://github.com/uf-robopi/HiRQA获取。

英文摘要

Despite significant progress in no-reference image quality assessment (NR-IQA), dataset biases and reliance on subjective labels continue to hinder their generalization performance. We propose HiRQA (Hierarchical Ranking and Quality Alignment), a self-supervised, opinion-unaware framework that offers a hierarchical, quality-aware embedding through a combination of ranking and contrastive learning. Unlike prior approaches that depend on pristine references or auxiliary modalities at inference time, HiRQA predicts quality scores using only the input image. We introduce a novel higher-order ranking loss that supervises quality predictions through relational ordering across distortion pairs, along with an embedding distance loss that enforces consistency between feature distances and perceptual differences. A training-time contrastive alignment loss, guided by structured textual prompts, further enhances the learned representation. Trained only on synthetic image distortions, HiRQA generalizes to authentic degradations, as demonstrated through comprehensive evaluations on various unseen distortions such as lens flare, haze, motion blur, and low-light conditions. For real-time deployment, we introduce HiRQA-S, a lightweight variant with an inference time of only 3.5 ms per image. Extensive experiments across synthetic and authentic benchmarks validate HiRQA's competitive performance, strong generalization ability, and scalability. The HiRQA model and inference pipeline are available at: https://github.com/uf-robopi/HiRQA.

URL PDF HTML ☆

赞 0 踩 0

2508.05852 2026-06-03 cs.CV 版本更新

Interpretable Modeling of Driver Attention Shifts with a Vision-Language Model

基于视觉-语言模型的驾驶员注意力转移可解释建模

Kaiser Hamid, Khandakar Ashrafi Akbar, Peihang Li, Nade Liang

发表机构 * Texas Tech University（德克萨斯理工大学）； Towson University（托森大学）

AI总结本研究通过少量人工监督微调视觉-语言模型，生成可解释的驾驶员注意力转移描述，以补充传统注视热图，提升人因分析、监控和态势感知支持。

详情

AI中文摘要

驾驶员注视通常被建模为空间热图，但热图本身难以解释，因为它们不说明正在监控哪个道路对象或区域，也不说明注意力转移为何重要。本研究探讨了最小的人工监督是否能够引导视觉-语言模型生成驾驶员注意力转移的可解释描述。利用Berkeley DeepDrive-Attention数据集中选定的高变化注视时刻，我们比较了零样本、单样本和LoRA微调VLM条件与人工精炼参考描述和专家评分。结果表明，使用80个专家精炼的注意力示例进行微调，相对于未引导的VLM输出，提高了ROUGE-L、METEOR、实体对齐F1和人类对齐分数。研究结果表明，基于语言的描述可以通过使驾驶员注意力更易于人因分析、驾驶员监控审查和态势感知支持来补充注视热图。

英文摘要

Driver gaze is commonly modeled as a spatial heatmap, but heatmaps alone are difficult for humans to interpret because they do not explain which road object or region is being monitored or why an attention shift may matter. This study examines whether minimal human-grounded supervision can steer a vision--language model toward interpretable descriptions of driver attention shifts. Using selected high-change gaze moments from the Berkeley DeepDrive-Attention dataset, we compare zero-shot, one-shot, and LoRA fine-tuned VLM conditions against human-refined reference descriptions and expert ratings. Results show that fine-tuning with 80 expert-refined attention examples improves ROUGE-L, METEOR, Entity Alignment F1, and Human Alignment Score relative to unsteered VLM outputs. The findings suggest that language-based descriptions can complement gaze heatmaps by making driver attention more accessible for human-factors analysis, driver-monitoring review, and situation-awareness support.

URL PDF HTML ☆

赞 0 踩 0

2507.19684 2026-06-03 cs.LG cs.AI cs.CL cs.CV 版本更新

CoMPAS3D: A Dataset and Benchmark for Interactive Motion

CoMPAS3D: 一个用于交互动作的数据集和基准

Bermet Burkanova, Yasaman Etesam, Payam Jome Yazdian, Trinity Evans, Chuxuan Zhang, Zoe Stanley, Paige Tuttösí, Angelica Lim

发表机构 * School of Computing Science Simon Fraser University（计算科学学院西蒙弗雷泽大学）

AI总结提出CoMPAS3D数据集和评估框架，通过动作可读性和熟练度适当性等客观指标，解决交互式动作生成中缺乏社交上下文评估的问题。

Comments https://rosielab.github.io/compas3d

详情

AI中文摘要

社交互动型人形机器人必须通过身体与人类互动，实时适应伙伴的动作、意图和能力。这需要模型不仅理解身体如何移动，还要理解在共享社交背景下动作的含义。然而，交互式动作生成的评估框架并未衡量生成的动作是否在共享动作词汇中可读，也不评估其是否适合伙伴的熟练水平。这一差距有两个原因：现有框架依赖运动学指标（如FID和节拍对齐），无法衡量上述特性；现有数据集缺乏动作标注和熟练度变化。萨尔萨舞作为评估领域很合适：即兴、双人、由动作词汇和评判标准（涵盖时机、音乐性、技巧、难度、配合和原创性）指导。我们提出CoMPAS3D，一个即兴双人萨尔萨舞的动作捕捉数据集，附带评估框架，涵盖运动学质量、两个客观指标（动作可读性和熟练度适当性）以及六个基于竞赛的主观维度。数据集包含18名舞者（涵盖初级、中级和高级水平）的3小时即兴表演，超过2800个专家标注片段，涵盖动作类型、错误和风格元素。我们定义了三个基准：动作分类（类似于转录）、熟练度估计（流利度评估）和跟随者生成（对话响应）。微调的视觉语言模型在应用于真实动作序列的客观指标上表现强劲。应用于Duolando和InterGen时，这些指标揭示了运动学指标遗漏的失败。人工评估确认了生成动作与真实动作之间的差距。CoMPAS3D、标注、基准代码和基线结果公开可用。

英文摘要

Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, and abilities. This requires models that understand not just how bodies move, but what movement means in a shared social context. Yet evaluation frameworks for interactive motion generation do not measure whether generated follower motion is legible within a shared movement vocabulary, nor whether it is appropriate to the partner's proficiency level. This gap has two causes: existing frameworks rely on kinematic metrics such as FID and beat alignment that cannot measure either property, and existing datasets lack the move annotations and proficiency variation needed. Salsa is well-suited as an evaluation domain: improvised, dyadic, and governed by a move vocabulary and judging criteria covering timing, musicality, technique, difficulty, partnering, and originality. We present CoMPAS3D, a motion capture dataset of improvised partner salsa paired with an evaluation framework covering kinematic quality, two objective metrics (move legibility and proficiency appropriateness), and six competition-based subjective dimensions. The dataset includes 3 hours of improvisation by 18 dancers spanning beginner, intermediate, and professional levels, with over 2,800 expert-annotated segments covering move types, errors, and stylistic elements. We define three benchmarks: move classification (analogous to transcription), proficiency estimation (fluency assessment), and follower generation (dialogue response). Fine-tuned vision-language models perform strongly on objective metrics applied to ground-truth motion sequences. Applied to Duolando and InterGen, the metrics reveal failures that kinematic metrics miss. Human evaluations confirm the gap between generated and ground-truth motion. CoMPAS3D, annotations, benchmark code, and baseline results are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2506.04367 2026-06-03 cs.CV 版本更新

Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

微调视频变换器用于词级孟加拉手语：分类任务的比较分析

Jubayer Ahmed Bhuiyan Shawon, Hasan Mahmud, Kamrul Hasan

发表机构 * Systems and Software Lab (SSL), Department of CSE, Islamic University of Technology (IUT)（计算机科学与软件系，伊斯兰科技大学（IUT）系统与软件实验室）

AI总结本研究通过微调VideoMAE、ViViT和TimeSformer三种视频变换器模型，在BdSLW60和BdSLW401数据集上实现了高精度孟加拉手语识别，其中VideoMAE在帧率校正后的BdSLW60上达到95.5%准确率。

Comments 16 pages, 8 figures, 6 tables

详情

DOI: 10.1371/journal.pone.0341909
Journal ref: PLOS ONE, Vol. 21, No. 5, e0341909, 2026

AI中文摘要

手语识别（SLR）涉及从图像或视频中自动识别和分类手势，将其转换为文本或语音，以改善听障社区的可访问性。在孟加拉国，孟加拉手语（BdSL）是许多听障人士的主要交流方式。本研究在BdSLW60（arXiv:2402.08635）上微调了最先进的视频变换器架构——VideoMAE、ViViT和TimeSformer，BdSLW60是一个包含60个频繁手势的小规模BdSL数据集。我们将视频标准化为30 FPS，得到9,307个用户试用片段。为了评估可扩展性和鲁棒性，模型还在BdSLW401（arXiv:2503.02360）上进行了微调，这是一个包含401个手势类别的大规模数据集。此外，我们还在公开数据集（包括LSA64和WLASL）上进行了基准测试。应用了随机裁剪、水平翻转和短边缩放等数据增强技术以提高模型鲁棒性。为了在模型选择期间确保跨折的平衡评估，我们在训练集上采用了10折分层交叉验证，同时使用来自未见用户U4和U8的留出测试数据进行了独立于手语者的评估。结果表明，视频变换器模型显著优于传统的机器学习和深度学习方法。性能受数据集大小、视频质量、帧分布、帧率和模型架构等因素影响。在这些模型中，VideoMAE变体（MCG-NJU/videomae-base-finetuned-kinetics）在帧率校正后的BdSLW60数据集上达到了95.5%的最高准确率，在BdSLW401的正面手势上达到了81.04%——展示了可扩展且准确的BdSL识别的强大潜力。

英文摘要

Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures -- VideoMAE, ViViT, and TimeSformer -- on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 -- demonstrating strong potential for scalable and accurate BdSL recognition.

URL PDF HTML ☆

赞 0 踩 0

2505.08886 2026-06-03 cs.CV cs.LG 版本更新

Optimizing Neuro-Fuzzy and Colonial Competition Algorithms for Skin Cancer Diagnosis in Dermatoscopic Images

优化神经模糊与殖民竞争算法用于皮肤镜图像中的皮肤癌诊断

Hamideh Khaleghpour, Brett McKinney

AI总结本研究融合图像处理、神经模糊和殖民竞争算法，在ISIC数据库的560张皮肤镜图像上实现94%准确率，旨在辅助临床早期黑色素瘤检测。

Comments 7 pages, 10 figures. Accepted at the 2nd Asia Pacific Computer Systems Conference (APCS 2024), March 15-17, 2024

详情

Journal ref: Proceedings of the 2024 7th International Conference on Information and Computer Technologies, pages 166-172, IEEE, March 2024

AI中文摘要

皮肤癌发病率的上升，加上公众意识有限和临床专业知识的不足，凸显了对先进诊断辅助工具的迫切需求。人工智能（AI）已成为该领域有前景的工具，特别是在区分恶性与良性皮肤病变方面。利用公开可用的皮肤病变数据集，研究人员一直在开发基于AI的诊断解决方案。然而，此类计算机系统在临床环境中的整合仍处于初期阶段。本研究旨在通过融合图像处理技术和机器学习算法（特别是神经模糊和殖民竞争方法）来弥合这一差距。应用于ISIC数据库中的皮肤镜图像，我们的方法在560张图像的数据集上达到了94%的显著准确率。这些结果强调了我们的方法在帮助临床医生早期检测黑色素瘤方面的潜力，从而为皮肤癌诊断做出重要贡献。

英文摘要

The rising incidence of skin cancer, coupled with limited public awareness and a shortfall in clinical expertise, underscores an urgent need for advanced diagnostic aids. Artificial Intelligence (AI) has emerged as a promising tool in this domain, particularly for distinguishing malignant from benign skin lesions. Leveraging publicly available datasets of skin lesions, researchers have been developing AI-based diagnostic solutions. However, the integration of such computer systems in clinical settings is still nascent. This study aims to bridge this gap by employing a fusion of image processing techniques and machine learning algorithms, specifically neuro-fuzzy and colonial competition approaches. Applied to dermoscopic images from the ISIC database, our method achieved a notable accuracy of 94% on a dataset of 560 images. These results underscore the potential of our approach in aiding clinicians in the early detection of melanoma, thereby contributing significantly to skin cancer diagnostics.

URL PDF HTML ☆

赞 0 踩 0

2406.18544 2026-06-03 cs.CV cs.GR 版本更新

GS-ROR$^2$: Bidirectional-guided 3DGS and SDF for Reflective Object Relighting and Reconstruction

GS-ROR$^2$: 双向引导的3DGS和SDF用于反射物体重光照与重建

Zuo-Liang Zhu, Beibei Wang, Jian Yang

发表机构 * VCIP, College of Computer Science, Nankai University（VCIP，计算机科学学院，南开大学）； School of Intelligence Science and Technology, Nanjing University（智能科学与技术学校，南京大学）

AI总结提出一种双向引导框架，通过SDF辅助的高斯溅射优化重光照模型，并利用GS引导的SDF增强实现高质量几何重建，解决反射物体重光照与重建中的几何约束和细节捕捉问题。

Comments Accepted by ACM TOG

详情

DOI: 10.1145/3759248

AI中文摘要

3D高斯溅射(3DGS)因其细致的表达能力和高效的渲染速度，在新视角合成方面展现出强大能力。然而，使用3DGS创建可重光照的3D资产并重建忠实几何仍然存在问题，特别是对于反射物体，其不连续表示给几何约束带来困难。体积符号距离场(SDF)方法提供了鲁棒的几何重建，但昂贵的射线步进阻碍了其实时应用并减慢了训练速度。此外，这些方法难以捕捉尖锐的几何细节。为此，我们提出以互补方式双向引导3DGS和SDF，包括SDF辅助的高斯溅射用于重光照模型的高效优化，以及GS引导的SDF增强用于高质量几何重建。SDF辅助高斯溅射的核心是混合高斯与SDF之间的深度和法线相互监督，避免了SDF昂贵的体积渲染。得益于这种相互监督，学习到的混合高斯以最小的时间成本得到良好约束。由于高斯以延迟着色模式渲染，alpha混合的高斯是平滑的，但单个高斯可能仍然是异常值，产生漂浮伪影。因此，我们引入SDF感知的剪枝策略，移除位于SDF定义表面远处的高斯异常值，避免漂浮问题。这样，我们的GS框架提供了合理的法线并实现了逼真的重光照，但来自深度的网格仍然存在问题。因此，我们设计了GS引导的SDF细化，利用来自高斯的混合法线微调SDF。通过这种增强，我们的方法可以以额外17%的训练时间为代价，为反射物体提供高质量的网格。

英文摘要

3D Gaussian Splatting (3DGS) has shown a powerful capability for novel view synthesis due to its detailed expressive ability and highly efficient rendering speed. Unfortunately, creating relightable 3D assets and reconstructing faithful geometry with 3DGS is still problematic, particularly for reflective objects, as its discontinuous representation raises difficulties in constraining geometries. Volumetric signed distance field (SDF) methods provide robust geometry reconstruction, while the expensive ray marching hinders its real-time application and slows the training. Besides, these methods struggle to capture sharp geometric details. To this end, we propose to guide 3DGS and SDF bidirectionally in a complementary manner, including an SDF-aided Gaussian splatting for efficient optimization of the relighting model and a GS-guided SDF enhancement for high-quality geometry reconstruction. At the core of our SDF-aided Gaussian splatting is the mutual supervision of the depth and normal between blended Gaussians and SDF, which avoids the expensive volume rendering of SDF. Thanks to this mutual supervision, the learned blended Gaussians are well-constrained with a minimal time cost. As the Gaussians are rendered in a deferred shading mode, the alpha-blended Gaussians are smooth, while individual Gaussians may still be outliers, yielding floater artifacts. Therefore, we introduce an SDF-aware pruning strategy to remove Gaussian outliers located distant from the surface defined by SDF, avoiding floater issue. This way, our GS framework provides reasonable normal and achieves realistic relighting, while the mesh from depth is still problematic. Therefore, we design a GS-guided SDF refinement, which utilizes the blended normal from Gaussians to finetune SDF. With this enhancement, our method can further provide high-quality meshes for reflective objects at the cost of 17% extra training time.

URL PDF HTML ☆

赞 0 踩 0

2412.01282 2026-06-03 cs.CV cs.AI 版本更新

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement

Align-KD：为移动视觉语言模型增强提取跨模态对齐知识

Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, China（通用人工智能国家重点实验室，智能科学与技术学院，北京大学，中国）； Huawei Noah’s Ark Lab, China（华为诺亚方舟实验室，中国）

AI总结提出Align-KD方法，通过蒸馏教师模型浅层跨模态对齐知识，指导1.7B学生模型学习视觉-文本匹配，在6个基准上平均提升2.0分。

Comments CVPR 2025 Paper

详情

AI中文摘要

视觉语言模型（VLM）为多模态任务带来了强大的理解和推理能力。同时，移动设备对强大人工智能的需求也日益增长，例如AI助手软件。一些工作试图将VLM迁移到边缘设备以扩展其应用范围。简化模型结构是一种常见方法，但随着模型缩小，性能与大小之间的权衡变得越来越困难。知识蒸馏（KD）可以帮助模型在不增加大小或数据量的情况下提升综合能力。然而，现有的大模型蒸馏技术大多只考虑单模态LLM的应用，或者仅使用教师为学生创建新的数据环境。这些方法都没有考虑VLM中最重要的跨模态对齐知识的蒸馏。我们提出了一种名为Align-KD的方法，引导学生模型学习发生在浅层的跨模态匹配。教师还帮助学生基于文本的关注点学习将视觉标记投影到文本嵌入空间。在Align-KD的指导下，1.7B的MobileVLM V2模型能够从7B教师模型中学习丰富的知识，且训练损失设计轻量，在两个训练子集上分别在6个基准上平均得分提升2.0。代码地址：https://github.com/fqhank/Align-KD。

英文摘要

Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average score improvement of 2.0 across 6 benchmarks under two training subsets respectively. Code is available at: https://github.com/fqhank/Align-KD.

URL PDF HTML ☆

赞 0 踩 0

2411.15851 2026-06-03 cs.CV 版本更新

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

ResCLIP: 用于无训练密集视觉-语言推理的残差注意力

Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan

发表机构 * University of Electronic Science and Technology of China（电子科学与技术大学）

AI总结提出残差交叉相关自注意力模块和语义反馈精炼模块，利用中间层交叉相关注意力重组空间信息，提升CLIP在密集预测任务中的性能。

详情

DOI: 10.1109/CVPR52734.2025.02789
Journal ref: Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 29968-29978

AI中文摘要

尽管像CLIP这样的视觉-语言模型在开放词汇任务中取得了显著成功，但其应用目前局限于图像级任务，在密集预测方面仍存在困难。最近的研究通常将这种密集预测的不足归因于最终块中的自注意力层，并通过将原始的查询-键注意力修改为自相关注意力（例如查询-查询和键-键注意力）取得了可观的成果。然而，这些方法忽略了捕捉丰富空间对应关系的交叉相关注意力（查询-键）特性。在本文中，我们揭示了CLIP非最终层中自注意力的交叉相关性也表现出定位特性。因此，我们提出了残差交叉相关自注意力（RCS）模块，该模块利用中间层的交叉相关自注意力来重塑最终块中的注意力。RCS模块有效重组了空间信息，释放了CLIP在密集视觉-语言推理中的定位潜力。此外，为了增强对相同类别区域的关注和局部一致性，我们提出了语义反馈精炼（SFR）模块，该模块利用语义分割图进一步调整注意力分数。通过整合这两种策略，我们的方法（称为ResCLIP）可以轻松作为即插即用模块集成到现有方法中，显著提升其在密集视觉-语言推理中的性能。在多个标准基准上的大量实验表明，我们的方法超越了最先进的无训练方法，验证了所提方法的有效性。代码可在 https://github.com/yvhangyang/ResCLIP 获取。

英文摘要

While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach. Code is available at https://github.com/yvhangyang/ResCLIP.

URL PDF HTML ☆

赞 0 踩 0

2407.18428 2026-06-03 cs.LG cs.AI cs.CV 版本更新

Weighted Risk Invariance: Domain Generalization under Invariant Feature Shift

加权风险不变性：不变特征偏移下的领域泛化

Gina Wong, Joshua Gleason, Rama Chellappa, Yoav Wald, Anqi Liu

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； University of Maryland, College Park（马里兰大学学院公园分校）； New York University（纽约大学）； Center for Data Science（数据科学中心）

AI总结针对不变协变量偏移下现有不变学习方法性能不佳的问题，提出加权风险不变性（WRI）框架，通过环境间损失的不变性并加权训练样本，在理论上保证学习到不变模型，并在实验中优于先前方法。

详情

Journal ref: TMLR 2024

AI中文摘要

学习预测在多个环境下不变的模型是一种有前景的分布外泛化方法。这类模型被训练来提取特征 $X_{ ext{inv}}$，其中给定提取特征的条件分布 $Y \mid X_{ ext{inv}}$ 在不同环境下不发生变化。不变模型还应能泛化到提取特征 $X_{ ext{inv}}$ 的边缘分布 $p(X_{ ext{inv}})$ 的偏移，这种偏移称为 $ extit{不变协变量偏移}$。然而，我们表明，现有学习不变模型的方法在不变协变量偏移下表现不佳，要么无法学习到不变模型——即使对于从简单且经过充分研究的线性-高斯模型生成的数据也是如此——要么有限样本性能较差。为了解决这些问题，我们提出 $ extit{加权风险不变性}$（WRI）。我们的框架基于对训练样本进行适当加权，强制要求损失在不同环境下保持不变。我们证明，在线性-高斯设置下，WRI 可证明地学习到不变模型，即丢弃虚假相关性。我们提出了一种实用算法，通过同时学习密度 $p(X_{ ext{inv}})$ 和模型参数来实现 WRI，并且实验表明，在不变协变量偏移下，WRI 优于先前的不变学习方法。

英文摘要

Learning models whose predictions are invariant under multiple environments is a promising approach for out-of-distribution generalization. Such models are trained to extract features $X_{\text{inv}}$ where the conditional distribution $Y \mid X_{\text{inv}}$ of the label given the extracted features does not change across environments. Invariant models are also supposed to generalize to shifts in the marginal distribution $p(X_{\text{inv}})$ of the extracted features $X_{\text{inv}}$, a type of shift we call an $\textit{invariant covariate shift}$. However, we show that proposed methods for learning invariant models underperform under invariant covariate shift, either failing to learn invariant models$\unicode{x2014}$even for data generated from simple and well-studied linear-Gaussian models$\unicode{x2014}$or having poor finite-sample performance. To alleviate these problems, we propose $\textit{weighted risk invariance}$ (WRI). Our framework is based on imposing invariance of the loss across environments subject to appropriate reweightings of the training examples. We show that WRI provably learns invariant models, i.e. discards spurious correlations, in linear-Gaussian settings. We propose a practical algorithm to implement WRI by learning the density $p(X_{\text{inv}})$ and the model parameters simultaneously, and we demonstrate empirically that WRI outperforms previous invariant learning methods under invariant covariate shift.

URL PDF HTML ☆

赞 0 踩 0

2407.05312 2026-06-03 cs.CV 版本更新

An Improved Method for Personalizing Diffusion Models

一种改进的扩散模型个性化方法

Yan Zeng, Masanori Suganuma, Takayuki Okatani

发表机构 * Graduate School of Information Sciences, Tohoku University（东北大学信息科学研究生院）； RIKEN Center for AIP（理化学研究所AIP研究中心）

AI总结提出一种在整合新信息时保留模型原有知识的扩散模型个性化方法，相比Dreambooth和文本反转训练时间更短且效果更优。

1007.3881 2026-06-03 cs.CV cs.NA math.NA 版本更新

Orthogonal multifilters image processing of astronomical images from scanned photographic plates

扫描照相底片天文图像的正交多滤波器处理

Vasil Kolev

AI总结本文提出基于Haar和Daubechies正交小波构造新的正交多滤波器，用于天文图像的多尺度分析，并应用于扫描照相底片的天文图像分解。

Comments 6 pages, The ACM proceedings of CompSysTech 2010

1105.1302 2026-06-03 q-bio.QM cs.CV cs.NA math.NA 版本更新

A Modified Cross Correlation Algorithm for Reference-free Image Alignment of Non-Circular Projections in Single-Particle Electron Microscopy

一种改进的互相关算法用于单颗粒电子显微镜中非圆形投影的无参考图像对齐

Wooram Park, Gregory S. Chirikjian

AI总结针对单颗粒电子显微镜中高度非球形结构的图像对齐问题，提出一种改进的互相关方法，通过粗对齐和基于统计噪声的搜索空间缩减，结合人工模糊图像和中间类平均分割，在低信噪比下优于经典互相关和最大似然方法。

Comments 29pages

详情

DOI: 10.1016/j.bpj.2010.12.1961

AI中文摘要

本文提出了一种改进的互相关方法，用于对齐单颗粒电子显微镜中高度非球形结构的同一类图像。在该新方法中，首先对投影图像进行粗对齐，然后使用互相关（CC）方法重新对齐所得图像。粗对齐通过匹配图像的质心和主轴实现。基于加性背景噪声的统计特性，可以量化粗对齐中的未对准分布。因此，互相关方法中重新对齐的搜索空间可以缩小以实现更好的对齐。为了克服互相关函数中虚假峰值相关的问题，我们在迭代互相关方法的早期阶段使用人工模糊图像，并从每次迭代步骤中分割中间类平均。这两种额外的操作与互相关方法中缩小的搜索空间相结合，对于低信噪比图像，比经典互相关和最大似然（ML）方法产生更好的对齐效果。

英文摘要

In this paper we propose a modified cross correlation method to align images from the same class in single-particle electron microscopy of highly non-spherical structures. In this new method, First we coarsely align projection images, and then re-align the resulting images using the cross correlation (CC) method. The coarse alignment is obtained by matching the centers of mass and the principal axes of the images. The distribution of misalignment in this coarse alignment can be quantified based on the statistical properties of the additive background noise. As a consequence, the search space for re-alignment in the cross correlation method can be reduced to achieve better alignment. In order to overcome problems associated with false peaks in the cross correlations function, we use artificially blurred images for the early stage of the iterative cross correlation method and segment the intermediate class average from every iteration step. These two additional manipulations combined with the reduced search space size in the cross correlation method yield better alignments for low signal-to-noise ratio images than both classical cross correlation and maximum likelihood(ML) methods.

URL PDF HTML ☆

赞 0 踩 0

0710.0736 2026-06-03 cs.CV cs.NA math.NA 版本更新

Colour image segmentation by the vector-valued Allen-Cahn phase-field model: a multigrid solution

基于向量值Allen-Cahn相场模型的彩色图像分割：多重网格解法

David A Kay, Alessandro Tomasi

AI总结提出结合向量值Allen-Cahn相场方程与初始数据拟合项的彩色图像分割PDE模型，并采用多重网格有限元方法实现高效鲁棒的分割。

Comments 17 pages, 9 figures

1003.2022 2026-06-03 cs.CV cs.CE cs.IT cs.NA math.IT math.NA 版本更新

Fast space-variant elliptical filtering using box splines

使用盒样条进行快速空间变椭圆滤波

Kunal Narayan Chaudhury, Arrate Munoz-Barrutia, Michael Unser

AI总结本文提出一种基于径向均匀盒样条的方法，通过预积分和局部有限差分实现每像素固定计算量的空间变高斯椭圆滤波，支持连续控制尺寸、伸长和方向。

Comments 12 figures; IEEE Transactions on Image Processing, vol. 19, 2010

详情

DOI: 10.1109/TIP.2010.2046953
Journal ref: IEEE Transactions on Image Processing, vol. 19(9), pp. 2290 - 2306, 2010

AI中文摘要

线性空间变（非卷积）滤波器的高效实现是图像处理中一个具有挑战性的计算问题。在本文中，我们证明可以使用每像素固定数量的计算来对图像进行具有变化大小、伸长和方向的高斯型椭圆窗口滤波。相关算法基于一族光滑紧支撑分段多项式——径向均匀盒样条，通过预积分和局部有限差分实现。径向均匀盒样条是通过重复卷积固定数量的盒分布构造的，这些盒分布经过适当缩放并以均匀方式径向分布。这些盒样条的吸引人特性包括其渐近行为、简单的协方差结构以及准可分离性。随着阶数的增加，它们收敛到高斯函数，并可通过控制组成盒分布的尺度来近似具有不同协方差的各向异性高斯函数。基于第二个特性，我们开发了一种连续控制这些高斯型函数大小、伸长和方向的技术。最后，利用准可分离结构以及盒分布的某种缩放性质，高效实现了相关的空间变椭圆滤波，该滤波每像素需要O(1)次计算，与滤波器的形状和大小无关。

英文摘要

The efficient realization of linear space-variant (non-convolution) filters is a challenging computational problem in image processing. In this paper, we demonstrate that it is possible to filter an image with a Gaussian-like elliptic window of varying size, elongation and orientation using a fixed number of computations per pixel. The associated algorithm, which is based on a family of smooth compactly supported piecewise polynomials, the radially-uniform box splines, is realized using pre-integration and local finite-differences. The radially-uniform box splines are constructed through the repeated convolution of a fixed number of box distributions, which have been suitably scaled and distributed radially in an uniform fashion. The attractive features of these box splines are their asymptotic behavior, their simple covariance structure, and their quasi-separability. They converge to Gaussians with the increase of their order, and are used to approximate anisotropic Gaussians of varying covariance simply by controlling the scales of the constituent box distributions. Based on the second feature, we develop a technique for continuously controlling the size, elongation and orientation of these Gaussian-like functions. Finally, the quasi-separable structure, along with a certain scaling property of box distributions, is used to efficiently realize the associated space-variant elliptical filtering, which requires O(1) computations per pixel irrespective of the shape and size of the filter.

URL PDF HTML ☆

赞 0 踩 0

1203.2995 2026-06-03 eess.SY cs.CV cs.SY 版本更新

Marginal multi-Bernoulli filters: RFS derivation of MHT, JIPDA and association-based MeMBer

边缘多伯努利滤波器：MHT、JIPDA和基于关联的MeMBer的RFS推导

Jason L. Williams

AI总结本文通过随机有限集推导全贝叶斯RFS滤波器，揭示数据关联隐式存在，并通过近似关联分布得到与JIPDA和MeMBer相关的两种算法，在复杂环境下提升性能。

Comments Journal version at http://ieeexplore.ieee.org/document/7272821. Matlab code of simple implementation included with ancillary files

1302.6105 2026-06-03 math.OC cs.CV cs.NA math.NA 版本更新

Image restoration using sparse approximations of spatially varying blur operators in the wavelet domain

利用小波域中空间变化模糊算子的稀疏近似进行图像恢复

Paul Escande, Pierre Weiss, Francois Malgouyres

AI总结针对空间变化模糊图像恢复问题，提出在小波域中用稀疏矩阵近似模糊算子，并从数学上证明其合理性，数值验证近似质量，且稀疏模式可预定义，适用于盲反卷积等任务。

Comments 6 pages

1210.6649 2026-06-03 astro-ph.IM cs.CV cs.NA math.NA 版本更新

Extended object reconstruction in adaptive-optics imaging: the multiresolution approach

自适应光学成像中的扩展目标重建：多分辨率方法

Roberto Baena Gallé, Jorge Núñez, Szymon Gladysz

AI总结提出使用小波和曲波等多分辨率变换重建自适应光学系统获取的扩展目标图像，通过静态PSF的多通道反卷积方法优于传统的盲/近视反卷积方法。

Comments In revision in Astronomy & Astrophysics. 19 pages, 13 figures

详情

DOI: 10.1051/0004-6361/201219489

AI中文摘要

我们提出将多分辨率变换（如小波变换（WT）和曲波变换（CT））应用于自适应光学（AO）系统获取的扩展目标图像重建。这种多通道方法通常利用概率工具来区分显著结构与噪声和重建残差。此外，我们旨在检验历史假设：使用静态PSF的图像重建算法不适用于AO成像。我们将哈勃太空望远镜（HST）拍摄的土星图像与帕洛马天文台5米海尔望远镜的AO PSF进行卷积，并添加散粒噪声和读出噪声。随后，我们对模糊和噪声数据应用不同方法以恢复原始目标。这些方法包括多帧盲反卷积（使用IDAC算法）、带正则化的近视反卷积（使用MISTRAL）以及基于小波或曲波的静态PSF反卷积（AWMLE和ACMLE算法）。我们使用均方误差（MSE）和结构相似性指数（SSIM）来比较结果。我们讨论了这两种指标的优缺点。我们发现，根据MSE和SSIM的测量，CT比WT产生更好的结果。使用静态PSF的多通道反卷积产生的结果通常优于近视/盲方法（对于我们测试的图像），这表明方法抑制噪声和跟踪底层迭代过程的能力与近视/盲方法更新PSF的能力同样关键。

英文摘要

We propose the application of multiresolution transforms, such as wavelets (WT) and curvelets (CT), to the reconstruction of images of extended objects that have been acquired with adaptive optics (AO) systems. Such multichannel approaches normally make use of probabilistic tools in order to distinguish significant structures from noise and reconstruction residuals. Furthermore, we aim to check the historical assumption that image-reconstruction algorithms using static PSFs are not suitable for AO imaging. We convolve an image of Saturn taken with the Hubble Space Telescope (HST) with AO PSFs from the 5-m Hale telescope at the Palomar Observatory and add both shot and readout noise. Subsequently, we apply different approaches to the blurred and noisy data in order to recover the original object. The approaches include multi-frame blind deconvolution (with the algorithm IDAC), myopic deconvolution with regularization (with MISTRAL) and wavelets- or curvelets-based static PSF deconvolution (AWMLE and ACMLE algorithms). We used the mean squared error (MSE) and the structural similarity index (SSIM) to compare the results. We discuss the strengths and weaknesses of the two metrics. We found that CT produces better results than WT, as measured in terms of MSE and SSIM. Multichannel deconvolution with a static PSF produces results which are generally better than the results obtained with the myopic/blind approaches (for the images we tested) thus showing that the ability of a method to suppress the noise and to track the underlying iterative process is just as critical as the capability of the myopic/blind approaches to update the PSF.

URL PDF HTML ☆

赞 0 踩 0

1210.3098 2026-06-03 math.NA cs.CV cs.IT cs.NA math.IT 版本更新

Near-optimal compressed sensing guarantees for total variation minimization

全变差最小化的近最优压缩感知保证

Deanna Needell, Rachel Ward

AI总结针对多维信号压缩感知重建问题，本文证明通过全变差最小化，从 O(sd*log(N^d)) 个线性测量中可重建信号，误差与梯度最佳 s 项近似成比例，并证明该保证在空间维度 d 上多项式因子内最优。

详情

DOI: 10.1109/TIP.2013.2264681

AI中文摘要

考虑压缩感知设置中从欠定测量集重建多维信号的问题。没有任何额外假设，该问题是不适定的。然而，对于自然图像或电影等信号，与测量一致的最小全变差估计通常能产生对潜在信号的良好近似，即使测量数量远小于环境维度。本文将二维图像的最新重建保证推广到任意维度 d>1 的信号和各向同性全变差问题。具体来说，我们证明多维信号 x 可以从 O(sd*log(N^d)) 个线性测量中通过全变差最小化重建，重建误差在其梯度最佳 s 项近似的因子内。我们提供的重建保证在空间维度 d 的多项式因子内必然是最优的。

英文摘要

Consider the problem of reconstructing a multidimensional signal from an underdetermined set of measurements, as in the setting of compressed sensing. Without any additional assumptions, this problem is ill-posed. However, for signals such as natural images or movies, the minimal total variation estimate consistent with the measurements often produces a good approximation to the underlying signal, even if the number of measurements is far smaller than the ambient dimensionality. This paper extends recent reconstruction guarantees for two-dimensional images to signals of arbitrary dimension d>1 and to isotropic total variation problems. To be precise, we show that a multidimensional signal x can be reconstructed from O(sd*log(N^d)) linear measurements using total variation minimization to within a factor of the best s-term approximation of its gradient. The reconstruction guarantees we provide are necessarily optimal up to polynomial factors in the spatial dimension d.

URL PDF HTML ☆

赞 0 踩 0

1110.3649 2026-06-03 math.NA cs.CV cs.GR cs.NA 版本更新

Algorithms to automatically quantify the geometric similarity of anatomical surfaces

自动量化解剖表面几何相似性的算法

D. Boyer, Y. Lipman, E. St. Clair, J. Puente, T. Funkhouser, B. Patel, J. Jernvall, I. Daubechies

AI总结提出利用局部结构和全局几何关系自动计算二维表面间距离与对应关系的多项式算法，无需人工标记，实现大规模数字化表面的高效比较。

Comments Changes with respect to v1, v2: an Erratum was added, correcting the references for one of the three datasets. Note that the datasets and code for this paper can be obtained from the Data Conservancy (see Download column on v1, v2)

详情

DOI: 10.1073/pnas.1112822108
Journal ref: PNAS 2011 108 (45) 18221-18226

AI中文摘要

我们描述了用于计算（嵌入三维空间的）二维表面对之间距离的新方法，这些方法利用局部结构以及结构间几何关系所包含的全局信息。我们提出了自动确定这些距离以及几何对应关系的算法。这一研究源于自然科学学生对理解统一生命多样性的形态连续性的追求。目前，科学家利用物理特征研究现存和灭绝动物之间的进化关系时，分析的是从精心定义的解剖对应点（地标）中提取的数据。识别和记录这些地标耗时且只能由训练有素的形态学家准确完成。这使得非形态学家无法进行这些研究，并导致表型组学在阐明进化模式方面落后于基因组学。与已提出的其他形态对应算法不同，我们的方法不需要用户预先标记任何特殊特征或地标。它也与计算几何中的其他开创性工作不同，因为我们的算法本质上是多项式的，因此更快，使得对大量数字化表面进行成对比较成为可能。我们使用代表灵长类和人类牙齿及不同骨骼的三个数据集展示了我们的方法，并表明它能产生高度准确的结果。

英文摘要

We describe new approaches for distances between pairs of 2-dimensional surfaces (embedded in 3-dimensional space) that use local structures and global information contained in inter-structure geometric relationships. We present algorithms to automatically determine these distances as well as geometric correspondences. This is motivated by the aspiration of students of natural science to understand the continuity of form that unites the diversity of life. At present, scientists using physical traits to study evolutionary relationships among living and extinct animals analyze data extracted from carefully defined anatomical correspondence points (landmarks). Identifying and recording these landmarks is time consuming and can be done accurately only by trained morphologists. This renders these studies inaccessible to non-morphologists, and causes phenomics to lag behind genomics in elucidating evolutionary patterns. Unlike other algorithms presented for morphological correspondences our approach does not require any preliminary marking of special features or landmarks by the user. It also differs from other seminal work in computational geometry in that our algorithms are polynomial in nature and thus faster, making pairwise comparisons feasible for significantly larger numbers of digitized surfaces. We illustrate our approach using three datasets representing teeth and different bones of primates and humans, and show that it leads to highly accurate results.

URL PDF HTML ☆

赞 0 踩 0

1203.2992 2026-06-03 eess.SY cs.CV cs.SY 版本更新

Hybrid Poisson and multi-Bernoulli filters

混合泊松和多伯努利滤波器

Jason L. Williams

AI总结提出一种结合概率假设密度和多目标多伯努利滤波器的混合方法，通过维持未检测目标的泊松分量和回收低存在概率的伯努利分量，实现快速航迹起始并减少伯努利分量数量。

Comments Submitted to 15th International Conference on Information Fusion (2012)

1202.6429 2026-06-03 cs.CV cs.IT cs.NA math.IT math.NA 版本更新

Stable image reconstruction using total variation minimization

利用全变差最小化的稳定图像重建

Deanna Needell, Rachel Ward

AI总结本文利用全变差最小化，从欠采样噪声测量中实现图像的高精度鲁棒重建，并给出了近最优保证。

Comments 25 pages

1109.0217 2026-06-03 math.NA cs.CV cs.NA 版本更新

Vessel Segmentation in Medical Imaging Using a Tight-Frame Based Algorithm

基于紧框架算法的医学图像血管分割

Xiaohao Cai, Raymond Chan, Serena Morigi, Fiorella Sgallari

AI总结提出一种基于紧框架的迭代算法，用于磁共振血管造影图像中管状结构（如血管）的自动分割，通过去噪、平滑和锐化边界区域，在少量迭代内收敛，并优于现有PDE和变分方法。

详情

AI中文摘要

紧框架作为正交小波的推广，已成功应用于图像处理中的多种问题，包括修复、脉冲噪声去除、超分辨率图像恢复等。分割是识别图像中物体轮廓的过程。目前存在多种基于变分方法和偏微分方程（PDE）建模的高效分割算法。本文提出应用紧框架方法自动识别磁共振血管造影（MRA）图像中的管状结构（如血管）。我们的方法迭代地细化一个包围血管可能边界或表面的区域。在每次迭代中，我们应用紧框架算法对可能边界进行去噪和平滑，并锐化该区域。我们证明了算法的收敛性。在真实2D/3D MRA图像上的数值实验表明，我们的方法非常高效，通常在几次迭代内收敛，并且由于能够提取图像中更多的管状目标和精细细节，优于现有的PDE和变分方法。

英文摘要

Tight-frame, a generalization of orthogonal wavelets, has been used successfully in various problems in image processing, including inpainting, impulse noise removal, super-resolution image restoration, etc. Segmentation is the process of identifying object outlines within images. There are quite a few efficient algorithms for segmentation that depend on the variational approach and the partial differential equation (PDE) modeling. In this paper, we propose to apply the tight-frame approach to automatically identify tube-like structures such as blood vessels in Magnetic Resonance Angiography (MRA) images. Our method iteratively refines a region that encloses the possible boundary or surface of the vessels. In each iteration, we apply the tight-frame algorithm to denoise and smooth the possible boundary and sharpen the region. We prove the convergence of our algorithm. Numerical experiments on real 2D/3D MRA images demonstrate that our method is very efficient with convergence usually within a few iterations, and it outperforms existing PDE and variational methods as it can extract more tubular objects and fine details in the images.

URL PDF HTML ☆

赞 0 踩 0

1101.4373 2026-06-03 stat.AP cs.CV cs.SY eess.SY math.OC stat.CO 版本更新

Statistical Multiresolution Dantzig Estimation in Imaging: Fundamental Concepts and Algorithmic Framework

成像中的统计多分辨率Dantzig估计：基本概念与算法框架

Klaus Frick, Philipp Marnitz, Axel Munk

AI总结本文针对“信号+噪声”模型中的函数估计问题，提出了一类统计多分辨率估计器，并开发了基于交替方向乘子法和Dykstra算法的计算框架，通过成像和信号检测示例展示了方法的有效性。

详情

DOI: 10.1214/12-EJS671
Journal ref: Electron. J. Stat. 6 (2012) 231-268

AI中文摘要

本文关注于“信号+噪声”模型中函数的全自动和局部自适应估计，其中回归函数可能进一步被线性算子（例如卷积）模糊。为此，我们引入了一类通用的统计多分辨率估计器，并开发了用于计算这些估计器的算法框架。这意味着估计器被定义为具有上确界型约束的凸优化问题的解。我们结合了交替方向乘子法和Dykstra算法来计算凸集交集上的正交投影，并证明了数值收敛性。通过成像和信号检测的各种示例，展示了所提出方法的能力。

英文摘要

In this paper we are concerned with fully automatic and locally adaptive estimation of functions in a "signal + noise"-model where the regression function may additionally be blurred by a linear operator, e.g. by a convolution. To this end, we introduce a general class of statistical multiresolution estimators and develop an algorithmic framework for computing those. By this we mean estimators that are defined as solutions of convex optimization problems with supremum-type constraints. We employ a combination of the alternating direction method of multipliers with Dykstra's algorithm for computing orthogonal projections onto intersections of convex sets and prove numerical convergence. The capability of the proposed method is illustrated by various examples from imaging and signal detection.

URL PDF HTML ☆

赞 0 踩 0

1112.3010 2026-06-03 cs.CV cs.NA math.NA 版本更新

A new variational principle for the Euclidean distance function: Linear approach to the non-linear eikonal problem

欧几里得距离函数的新变分原理：非线性程函问题的线性方法

Karthik S. Gurumoorthy, Anand Rangarajan

AI总结提出一种基于卷积的快速算法，通过求解线性微分方程并取负对数来近似计算欧几里得距离函数，利用快速傅里叶变换高效实现，避免了传统方法对非线性Hamilton-Jacobi方程的直接求解。

详情

AI中文摘要

我们提出了一种基于卷积的快速技术，用于在二维和三维网格位置上计算近似的有符号欧几里得距离函数 $S$。我们的方法不是求解非线性的静态Hamilton-Jacobi方程（$\\|\nabla S\\|=1$），而是首先求解线性微分方程中的标量场 $\phi$，然后通过取负对数推导出 $S$ 的解。换句话说，当 $S$ 和 $\phi$ 通过 $\phi = \exp\left(-\frac{S}{\tau}\right)$ 关联，且 $\phi$ 满足对应于变分问题极值的特定线性微分方程时，我们得到近似的欧几里得距离函数 $S = -\tau\log(\phi)$，该函数在 $\tau\rightarrow 0$ 的极限下收敛于真实解。这与快速行进法和快速扫描法等直接通过Godunov迎风离散格式求解Hamilton-Jacobi方程的技术形成鲜明对比。我们的线性公式导致近似欧几里得距离函数的闭式解可表示为离散卷积，因此可通过快速傅里叶变换（FFT）高效计算。我们的解还避免了对导数算子进行空间离散化的需要。当 $\tau\rightarrow 0$ 时，我们展示了结果收敛于真实解，并针对给定的 $\tau$ 值限定了误差。我们解的可微性允许我们通过一组卷积计算近似距离函数的一阶和二阶导数。为了确定距离函数的符号（定义为在封闭区域内为正，区域外为负），我们计算二维中的缠绕数和三维中的拓扑度，这些计算也可以通过快速卷积进行。我们通过一组实验结果证明了我们方法的有效性。

英文摘要

We present a fast convolution-based technique for computing an approximate, signed Euclidean distance function $S$ on a set of 2D and 3D grid locations. Instead of solving the non-linear, static Hamilton-Jacobi equation ($\|\nabla S\|=1$), our solution stems from first solving for a scalar field $ϕ$ in a linear differential equation and then deriving the solution for $S$ by taking the negative logarithm. In other words, when $S$ and $ϕ$ are related by $ϕ= \exp \left(-\frac{S}τ \right)$ and $ϕ$ satisfies a specific linear differential equation corresponding to the extremum of a variational problem, we obtain the approximate Euclidean distance function $S = -τ\log(ϕ)$ which converges to the true solution in the limit as $τ\rightarrow 0$. This is in sharp contrast to techniques like the fast marching and fast sweeping methods which directly solve the Hamilton-Jacobi equation by the Godunov upwind discretization scheme. Our linear formulation results in a closed-form solution to the approximate Euclidean distance function expressible as a discrete convolution, and hence efficiently computable using the fast Fourier transform (FFT). Our solution also circumvents the need for spatial discretization of the derivative operator. As $τ\rightarrow0$ we show the convergence of our results to the true solution and also bound the error for a given value of $τ$. The differentiability of our solution allows us to compute---using a set of convolutions---the first and second derivatives of the approximate distance function. In order to determine the sign of the distance function (defined to be positive inside a closed region and negative outside), we compute the winding number in 2D and the topological degree in 3D, whose computations can also be performed via fast convolutions. We demonstrate the efficacy of our method through a set of experimental results.

URL PDF HTML ☆

赞 0 踩 0

1212.3385 2026-06-03 math.NA cs.CV cs.NA 版本更新

Approximating rational Bezier curves by constrained Bezier curves of arbitrary degree

用任意次数的约束贝塞尔曲线逼近有理贝塞尔曲线

Mao Shi, Jiansong Deng

AI总结提出一种通过加权最小二乘法将有理贝塞尔曲线约束逼近为多项式贝塞尔曲线的方法，并分别研究了权重函数ρ(t)=ω(t)和ρ(t)=ω(t)^2的情况。

1304.1408 2026-06-03 math.OC cs.CV cs.NA math.NA 版本更新

Restoration of Images Corrupted by Impulse Noise and Mixed Gaussian Impulse Noise using Blind Inpainting

使用盲修复恢复被脉冲噪声和混合高斯脉冲噪声污染的图像

Ming Yan

AI总结提出基于盲修复和ℓ0最小化的两种方法，同时检测受损像素并恢复图像，实验表明性能优于其他方法，并提供了收敛性分析。

Comments 18 pages, 4 figures

详情

DOI: 10.1137/12087178X
Journal ref: SIAM J. Imaging Sci., 6(2013), 1227-1245

AI中文摘要

本文研究了被脉冲噪声和混合高斯脉冲噪声污染的观测图像的恢复问题。由于被脉冲噪声损坏的像素不包含真实图像的任何信息，如何正确找到这个集合是一个非常重要的问题。我们提出了两种基于盲修复和ℓ0最小化的方法，可以同时找到受损像素并恢复图像。通过迭代恢复图像和更新受损像素集合，这些方法在实验中表现出比其他方法更好的性能。此外，我们提供了这些方法的收敛性分析，这些算法将收敛到坐标极小点。另外，通过对算法进行一些修改，它们将收敛到局部极小点（或以概率1收敛）。

英文摘要

This article studies the problem of image restoration of observed images corrupted by impulse noise and mixed Gaussian impulse noise. Since the pixels damaged by impulse noise contain no information about the true image, how to find this set correctly is a very important problem. We propose two methods based on blind inpainting and $\ell_0$ minimization that can simultaneously find the damaged pixels and restore the image. By iteratively restoring the image and updating the set of damaged pixels, these methods have better performance than other methods, as shown in the experiments. In addition, we provide convergence analysis for these methods, these algorithms will converge to coordinatewise minimum points. In addition, they will converge to local minimum points (or with probability one) with some modifications in the algorithms.

URL PDF HTML ☆

赞 0 踩 0

1302.5554 2026-06-03 stat.AP cs.CV cs.NA math.NA physics.flu-dyn 版本更新

Self-similar prior and wavelet bases for hidden incompressible turbulent motion

用于隐藏不可压缩湍流运动的自相似先验和小波基

Patrick Héas, Frédéric Lavancier, Souleymane Kadri-Harouna

AI总结针对从图像序列估计湍流这一病态逆问题，提出基于散度自由各向同性分数布朗运动的自相似先验模型，并利用小波基实现有效求解。

Comments SIAM Journal on Imaging Sciences, 2014

详情

AI中文摘要

本文关注从图像序列观测估计湍流这一病态逆问题。从贝叶斯角度，选择散度自由各向同性分数布朗运动作为瞬时湍流速度场的先验模型。该自相似先验准确刻画了不可压缩各向同性湍流中速度场的二阶统计特性。然而，相关的最大后验估计涉及分数阶拉普拉斯算子，实际实现较为困难。为解决此问题，我们提出将散度自由分数布朗运动分解到精心选择的小波基上。作为第一种方案，我们设计小波作为白化滤波器，并证明这些滤波器是由Leray投影算子组成的分数阶拉普拉斯小波。作为第二种方案，我们使用散度自由小波基，该基隐式考虑了物理中的不可压缩约束。尽管后一种分解涉及相关小波系数，我们仍能在实践中处理这种依赖性。基于这两种小波分解，我们最终提供了有效且高效的算法来逼近最大后验估计。大量数值评估证明了所提出的小波基自相似先验的相关性。

英文摘要

This work is concerned with the ill-posed inverse problem of estimating turbulent flows from the observation of an image sequence. From a Bayesian perspective, a divergence-free isotropic fractional Brownian motion (fBm) is chosen as a prior model for instantaneous turbulent velocity fields. This self-similar prior characterizes accurately second-order statistics of velocity fields in incompressible isotropic turbulence. Nevertheless, the associated maximum a posteriori involves a fractional Laplacian operator which is delicate to implement in practice. To deal with this issue, we propose to decompose the divergent-free fBm on well-chosen wavelet bases. As a first alternative, we propose to design wavelets as whitening filters. We show that these filters are fractional Laplacian wavelets composed with the Leray projector. As a second alternative, we use a divergence-free wavelet basis, which takes implicitly into account the incompressibility constraint arising from physics. Although the latter decomposition involves correlated wavelet coefficients, we are able to handle this dependence in practice. Based on these two wavelet decompositions, we finally provide effective and efficient algorithms to approach the maximum a posteriori. An intensive numerical evaluation proves the relevance of the proposed wavelet-based self-similar priors.

URL PDF HTML ☆

赞 0 踩 0

1209.3318 2026-06-03 math.OC cs.CV cs.NA math.NA 版本更新

Hessian Schatten-Norm Regularization for Linear Inverse Problems

Hessian Schatten-范数正则化用于线性逆问题

Stamatios Lefkimmiatis, John Paul Ward, Michael Unser

AI总结提出一种基于Hessian矩阵Schatten范数的凸、非二次正则化函数族，用于解决线性逆成像问题，避免阶梯效应并适用于多种应用。

Comments 15 pages double-column format. This manuscript will appear in IEEE Transactions on Image Processing

详情

DOI: 10.1109/TIP.2013.2237919
Journal ref: IEEE Trans. Image Process. 22 (2013), no. 5, 1873--1888

AI中文摘要

我们引入了一类新的不变、凸且非二次的泛函，用于推导病态线性逆成像问题的正则化解。所提出的正则化项涉及图像每个像素处Hessian矩阵的Schatten范数。它们可以看作是流行的全变差（TV）半范数的二阶扩展，因为满足相同的不变性。同时，通过利用二阶导数，它们避免了基于TV的重建中常见的阶梯效应，并在广泛的应用中表现良好。为了解决相应的优化问题，我们提出了一种基于原始-对偶形式的算法。该算法的一个基本组成部分是将矩阵投影到任意半径的Schatten范数球上。基于我们提供的向量投影到ℓ_q范数球与矩阵投影到Schatten范数球之间的直接联系，可以高效地执行此操作。最后，我们通过几个逆成像问题的实验（包括真实和模拟数据）展示了所提出方法的有效性。

英文摘要

We introduce a novel family of invariant, convex, and non-quadratic functionals that we employ to derive regularized solutions of ill-posed linear inverse imaging problems. The proposed regularizers involve the Schatten norms of the Hessian matrix, computed at every pixel of the image. They can be viewed as second-order extensions of the popular total-variation (TV) semi-norm since they satisfy the same invariance properties. Meanwhile, by taking advantage of second-order derivatives, they avoid the staircase effect, a common artifact of TV-based reconstructions, and perform well for a wide range of applications. To solve the corresponding optimization problems, we propose an algorithm that is based on a primal-dual formulation. A fundamental ingredient of this algorithm is the projection of matrices onto Schatten norm balls of arbitrary radius. This operation is performed efficiently based on a direct link we provide between vector projections onto $\ell_q$ norm balls and matrix projections onto Schatten norm balls. Finally, we demonstrate the effectiveness of the proposed methods through experimental results on several inverse imaging problems with real and simulated data.

URL PDF HTML ☆

赞 0 踩 0

1208.4391 2026-06-03 cs.CV cs.SY eess.SY 版本更新

Shape Tracking With Occlusions via Coarse-To-Fine Region-Based Sobolev Descent

基于粗到细区域Sobolev下降的遮挡形状跟踪

Yanchao Yang, Ganesh Sundaramoorthi

AI总结提出一种在参数化区域黎曼流形上通过粗到细优化处理自遮挡和去遮挡的联合形状与外观跟踪方法，实现精确形状检测。

Comments Extension of ICCV paper, added coarse-to-fine optimization based on new Riemannian manifold of parameterized regions

详情

AI中文摘要

我们提出了一种方法，基于参数化区域的新型黎曼流形上的新建模和优化，跟踪视频中物体的精确形状。联合动态形状和外观模型，其中物体的模板被传播以匹配下一帧中的物体形状和辐射度，在复杂物体辐射度和杂乱背景的情况下优于使用全局图像统计的方法。在3D物体运动和视点变化的情况下，物体的自遮挡和去遮挡很突出，当前使用联合形状和外观模型的方法无法适应新的形状和外观信息，导致形状检测不准确。在这项工作中，我们在联合形状和外观跟踪框架中建模自遮挡和去遮挡。自遮挡和用于传播模板的扭曲是耦合的，因此提出了一个联合问题。我们推导了一个粗到细的优化方案，在物体跟踪中具有优势，该方案首先通过粗扰动扰动模板，然后过渡到更细尺度的扰动，无缝且自动地遍历所有尺度。该方案是在我们引入的新型无限维黎曼流形上的梯度下降。该流形由平面参数化区域组成，我们引入的度量是定义在区域上的无穷小向量场上的新型Sobolev型度量。该度量的性质是，梯度下降自动优先考虑粗尺度变形（当它们减少能量时），然后才转向更细尺度的变形。在展示遮挡/去遮挡、复杂辐射度和背景的视频上的实验表明，与最近使用联合形状/外观模型或使用全局统计的方法相比，遮挡/去遮挡建模导致更优越的形状精度。

英文摘要

We present a method to track the precise shape of an object in video based on new modeling and optimization on a new Riemannian manifold of parameterized regions. Joint dynamic shape and appearance models, in which a template of the object is propagated to match the object shape and radiance in the next frame, are advantageous over methods employing global image statistics in cases of complex object radiance and cluttered background. In cases of 3D object motion and viewpoint change, self-occlusions and dis-occlusions of the object are prominent, and current methods employing joint shape and appearance models are unable to adapt to new shape and appearance information, leading to inaccurate shape detection. In this work, we model self-occlusions and dis-occlusions in a joint shape and appearance tracking framework. Self-occlusions and the warp to propagate the template are coupled, thus a joint problem is formulated. We derive a coarse-to-fine optimization scheme, advantageous in object tracking, that initially perturbs the template by coarse perturbations before transitioning to finer-scale perturbations, traversing all scales, seamlessly and automatically. The scheme is a gradient descent on a novel infinite-dimensional Riemannian manifold that we introduce. The manifold consists of planar parameterized regions, and the metric that we introduce is a novel Sobolev-type metric defined on infinitesimal vector fields on regions. The metric has the property of resulting in a gradient descent that automatically favors coarse-scale deformations (when they reduce the energy) before moving to finer-scale deformations. Experiments on video exhibiting occlusion/dis-occlusion, complex radiance and background show that occlusion/dis-occlusion modeling leads to superior shape accuracy compared to recent methods employing joint shape/appearance models or employing global statistics.

URL PDF HTML ☆

赞 0 踩 0

1210.2380 2026-06-03 cs.CV cs.IT cs.NA math.IT math.NA 版本更新

Stable and robust sampling strategies for compressive imaging

压缩成像的稳定鲁棒采样策略

Felix Krahmer, Rachel Ward

AI总结针对傅里叶测量与Haar小波稀疏的压缩成像，提出基于局部相干性的变密度采样策略，证明近最优嵌入维度的限制等距性质，实现稳定鲁棒的重建。

Comments 17 pages, 4 figures

详情

AI中文摘要

在许多信号处理应用中，人们希望通过频域采样获取在变换域（如空间有限差分或小波）中稀疏的图像。对于此类应用，大量经验证据表明，通过集中于低频的变密度采样策略可以获得更优的图像重建。小波和傅里叶变换域并非不相干，因为低阶小波和低阶频率是相关的，因此压缩感知理论并不能直接推出采样策略和重建保证。本文转向一种更精细的相干性概念——所谓的局部相干性——分别测量每个感知向量与稀疏基的相关程度。对于傅里叶测量和Haar小波稀疏性，局部相干性可以被显式控制和界定，因此对于由从合适的逆平方幂律密度中采样的频率构成的矩阵，我们可以证明具有近最优嵌入维度的限制等距性质。因此，我们提供的变密度采样策略允许对稀疏缺陷稳定且对测量噪声鲁棒的图像重建。我们的结果涵盖了通过ℓ1最小化和全变差最小化的重建。本文开发的局部相干性框架在更一般的稀疏恢复问题中应具有独立意义，因为它表明，对于最优稀疏恢复结果，只要采样策略相应调整，只需感知基到稀疏基的有界平均相干性——而非有界最大相干性——就足够了。

英文摘要

In many signal processing applications, one wishes to acquire images that are sparse in transform domains such as spatial finite differences or wavelets using frequency domain samples. For such applications, overwhelming empirical evidence suggests that superior image reconstruction can be obtained through variable density sampling strategies that concentrate on lower frequencies. The wavelet and Fourier transform domains are not incoherent because low-order wavelets and low-order frequencies are correlated, so compressive sensing theory does not immediately imply sampling strategies and reconstruction guarantees. In this paper we turn to a more refined notion of coherence -- the so-called local coherence -- measuring for each sensing vector separately how correlated it is to the sparsity basis. For Fourier measurements and Haar wavelet sparsity, the local coherence can be controlled and bounded explicitly, so for matrices comprised of frequencies sampled from a suitable inverse square power-law density, we can prove the restricted isometry property with near-optimal embedding dimensions. Consequently, the variable-density sampling strategy we provide allows for image reconstructions that are stable to sparsity defects and robust to measurement noise. Our results cover both reconstruction by $\ell_1$-minimization and by total variation minimization. The local coherence framework developed in this paper should be of independent interest in sparse recovery problems more generally, as it implies that for optimal sparse recovery results, it suffices to have bounded \emph{average} coherence from sensing basis to sparsity basis -- as opposed to bounded maximal coherence -- as long as the sampling strategy is adapted accordingly.

URL PDF HTML ☆

赞 0 踩 0

1304.2367 2026-06-03 cs.CV cs.AI cs.SY eess.SY 版本更新

Utility-Based Control for Computer Vision

基于效用的计算机视觉控制

Tod S. Levitt, Thomas O. Binford, Gil J. Ettinger, Patrice Gelband

AI总结针对贝叶斯网络实现计算机视觉中的计算效率问题，提出通过最大化效用而非概率来控制视觉任务，以优化传感器信息收集和数据分析。

Comments Appears in Proceedings of the Fourth Conference on Uncertainty in Artificial Intelligence (UAI1988)

详情

AI中文摘要

在利用贝叶斯网络实现计算机视觉识别世界对象时，出现了几个关键问题。计算效率是驱动力。感知网络非常深，通常有十五层结构。图像很宽，例如，在512×512像素或更大的图像中，未指定数量的边缘可能出现在任何位置。为了提高效率，我们动态实例化观察到的对象的假设。网络不是固定的，而是在运行时逐步创建。世界对象假设的生成和识别模型的索引很重要，但本文不讨论[4,11]。这项工作旨在近期通过并行计算在雷达监视系统ADRIES[5,15]和工业零件识别系统SUCCESSOR[2]中实现。对于许多应用，视觉必须更快才能实用，因此有效控制机器视觉过程至关重要。感知操作可能扫描百万像素，并可能需要数分钟的计算时间。必须避免不必要的传感器动作和计算。并行计算在多个处理器能力级别上可用。用于高层视觉的并行分布式计算的潜力意味着分配非均匀计算。本文解决了基于贝叶斯概率模型的机器视觉系统中的任务控制问题。我们将控制与推理分离，以扩展先前的工作[3]，最大化效用而非概率。最大化效用允许采用感知策略，以有效收集传感器信息并分析传感器数据。本文展示了通过效用控制机器视觉以识别军事场景的结果。未来工作将将其扩展到SUCCESSOR的工业零件识别。

英文摘要

Several key issues arise in implementing computer vision recognition of world objects in terms of Bayesian networks. Computational efficiency is a driving force. Perceptual networks are very deep, typically fifteen levels of structure. Images are wide, e.g., an unspecified-number of edges may appear anywhere in an image 512 x 512 pixels or larger. For efficiency, we dynamically instantiate hypotheses of observed objects. The network is not fixed, but is created incrementally at runtime. Generation of hypotheses of world objects and indexing of models for recognition are important, but they are not considered here [4,11]. This work is aimed at near-term implementation with parallel computation in a radar surveillance system, ADRIES [5, 15], and a system for industrial part recognition, SUCCESSOR [2]. For many applications, vision must be faster to be practical and so efficiently controlling the machine vision process is critical. Perceptual operators may scan megapixels and may require minutes of computation time. It is necessary to avoid unnecessary sensor actions and computation. Parallel computation is available at several levels of processor capability. The potential for parallel, distributed computation for high-level vision means distributing non-homogeneous computations. This paper addresses the problem of task control in machine vision systems based on Bayesian probability models. We separate control and inference to extend the previous work [3] to maximize utility instead of probability. Maximizing utility allows adopting perceptual strategies for efficient information gathering with sensors and analysis of sensor data. Results of controlling machine vision via utility to recognize military situations are presented in this paper. Future work extends this to industrial part recognition for SUCCESSOR.

URL PDF HTML ☆

赞 0 踩 0

1112.3166 2026-06-03 cs.CV cs.NA math.NA 版本更新

Higher-Order Momentum Distributions and Locally Affine LDDMM Registration

高阶动量分布与局部仿射LDDMM配准

Stefan Sommer, Mads Nielsen, Sune Darkner, Xavier Pennec

AI总结本文在LDDMM框架中引入高阶动量分布，通过一阶动量实现局部仿射变换的紧凑表示，从而以极少数参数完成非刚性配准，并直接提供可解释的数学和建模信息。

详情

AI中文摘要

为了实现允许直观分析的稀疏参数化，我们旨在用包含可解释元素的基来表示变形，并希望使用具有描述能力的元素来紧凑地表示变形。为此，本文在LDDMM配准框架中引入了高阶动量分布。先前在LDDMM中使用的零阶动量仅描述局部位移，而本文提出的一阶动量表示一个基，允许局部描述仿射变换，进而紧凑地描述全局非刚性变形中的非平移运动。所得表示从数学和建模角度都包含直接可解释的信息。我们开发了具有高阶动量的配准框架的数学构造，展示了其对稀疏图像配准和变形描述的意义，并提供了参数化如何以极少数参数实现配准的示例。使用高阶动量的参数化的能力和可解释性导致了关节运动的自然建模，该方法有望用于量化阿尔茨海默病期间的心室扩张和进行性萎缩。

英文摘要

To achieve sparse parametrizations that allows intuitive analysis, we aim to represent deformation with a basis containing interpretable elements, and we wish to use elements that have the description capacity to represent the deformation compactly. To accomplish this, we introduce in this paper higher-order momentum distributions in the LDDMM registration framework. While the zeroth order moments previously used in LDDMM only describe local displacement, the first-order momenta that are proposed here represent a basis that allows local description of affine transformations and subsequent compact description of non-translational movement in a globally non-rigid deformation. The resulting representation contains directly interpretable information from both mathematical and modeling perspectives. We develop the mathematical construction of the registration framework with higher-order momenta, we show the implications for sparse image registration and deformation description, and we provide examples of how the parametrization enables registration with a very low number of parameters. The capacity and interpretability of the parametrization using higher-order momenta lead to natural modeling of articulated movement, and the method promises to be useful for quantifying ventricle expansion and progressing atrophy during Alzheimer's disease.

URL PDF HTML ☆

赞 0 踩 0

1211.1690 2026-06-03 cs.RO cs.CV cs.LG cs.SY eess.SY 版本更新

Learning Monocular Reactive UAV Control in Cluttered Natural Environments

学习在杂乱自然环境中进行单目反应式无人机控制

Stephane Ross, Narek Melik-Barkhudarov, Kumar Shaurya Shankar, Andreas Wendel, Debadeepta Dey, J. Andrew Bagnell, Martial Hebert

AI总结本文使用单目相机和模仿学习训练控制器，使小型四旋翼飞行器能在自然森林环境中以1.5m/s速度自主避障导航。

Comments 8 pages, 10 figures

详情

AI中文摘要

大型无人机的自主导航相对简单，因为可以使用昂贵的传感器和监控设备。相比之下，在杂乱环境中低空飞行的微型飞行器（MAV）的避障仍然是一项具有挑战性的任务。与大型飞行器不同，MAV只能携带非常轻的传感器，如摄像头，这使得通过障碍物的自主导航更具挑战性。本文描述了一个系统，该系统能够使小型四旋翼直升机在自然森林环境中低空自主导航。仅使用单个廉价摄像头感知环境，我们能够保持高达1.5m/s的恒定速度。通过少量人类飞行员演示，我们使用最新的模仿学习技术训练了一个控制器，该控制器通过调整MAV的航向来避免树木。我们在室内更受控的环境和室外真实自然森林环境中展示了系统的性能。

英文摘要

Autonomous navigation for large Unmanned Aerial Vehicles (UAVs) is fairly straight-forward, as expensive sensors and monitoring devices can be employed. In contrast, obstacle avoidance remains a challenging task for Micro Aerial Vehicles (MAVs) which operate at low altitude in cluttered environments. Unlike large vehicles, MAVs can only carry very light sensors, such as cameras, making autonomous navigation through obstacles much more challenging. In this paper, we describe a system that navigates a small quadrotor helicopter autonomously at low altitude through natural forest environments. Using only a single cheap camera to perceive the environment, we are able to maintain a constant velocity of up to 1.5m/s. Given a small set of human pilot demonstrations, we use recent state-of-the-art imitation learning techniques to train a controller that can avoid trees by adapting the MAVs heading. We demonstrate the performance of our system in a more controlled environment indoors, and in real natural forest environments outdoors.

URL PDF HTML ☆

赞 0 踩 0

1210.5034 2026-06-03 cs.LG cs.CV cs.NA math.NA 版本更新

Optimal Computational Trade-Off of Inexact Proximal Methods

非精确近端方法的最优计算权衡

Pierre Machart, Sandrine Anthoine, Luca Baldassarre

AI总结本文研究近端梯度方法在计算代价与收敛速度之间的权衡，提出了一种计算高效且易于实现的快速非精确近端梯度算法（SIP）。

详情

AI中文摘要

在本文中，我们研究了在使用近端梯度方法（机器学习中流行的优化工具）最小化复合泛函时，收敛速度与计算代价之间的权衡。我们考虑近端算子通过迭代过程计算的情况，该过程提供了精确近端算子的近似。在这种情况下，我们得到具有两个嵌套循环的算法。我们表明，在有限时间内达到所需精度的解时，最小化计算代价的策略是将内迭代次数设置为常数，这与收敛速度分析所指示的策略不同。在此过程中，我们还提出了一种称为SIP（快速非精确近端梯度算法）的新程序，该程序既计算高效又易于实现。我们的数值实验证实了理论发现，并表明SIP可以成为标准程序的非常有竞争力的替代方案。

英文摘要

In this paper, we investigate the trade-off between convergence rate and computational cost when minimizing a composite functional with proximal-gradient methods, which are popular optimisation tools in machine learning. We consider the case when the proximity operator is computed via an iterative procedure, which provides an approximation of the exact proximity operator. In that case, we obtain algorithms with two nested loops. We show that the strategy that minimizes the computational cost to reach a solution with a desired accuracy in finite time is to set the number of inner iterations to a constant, which differs from the strategy indicated by a convergence rate analysis. In the process, we also present a new procedure called SIP (that is Speedy Inexact Proximal-gradient algorithm) that is both computationally efficient and easy to implement. Our numerical experiments confirm the theoretical findings and suggest that SIP can be a very competitive alternative to the standard procedure.

URL PDF HTML ☆

赞 0 踩 0

1210.4081 2026-06-03 math.NA cs.CV cs.DS cs.LG cs.NA math.OC 版本更新

Getting Feasible Variable Estimates From Infeasible Ones: MRF Local Polytope Study

从不可行变量估计获得可行变量估计：MRF局部多面体研究

Bogdan Savchynskyy, Stefan Schmidt

AI总结针对具有可分离性的大规模优化问题，提出一种从对偶解构造近似可行原始解的方法，并应用于马尔可夫随机场推理问题的局部多面体松弛，证明其优于现有方法。

Comments 20 page, 4 figures

1207.3554 2026-06-03 cs.CV cs.NA math.NA stat.ME stat.ML 版本更新

Designing various component analysis at will

随意设计各种成分分析

Akisato Kimura, Masashi Sugiyama, Sakano Hitoshi, Hirokazu Kameoka

AI总结提出一种基于广义成对表达（GPE）的通用成分分析框架，涵盖标准方法、正则化、加权、聚类及半监督扩展，并给出利用模板组合设计新方法的简单策略。

Comments Accepted to IAPR International Conference on Pattern Recognition, submitted to IPSJ Transactions on Mathematical Modeling and its Applications (TOM). Just only one-page abstract for new due to novelty violation for journal submission. The details will be disclosed in late September

1210.0822 2026-06-03 math.NA cs.CV cs.NA 版本更新

Discrete geodesic calculus in the space of viscous fluidic objects

粘性流体对象空间中的离散测地线计算

Martin Rumpf, Benedikt Wirth

AI总结基于黎曼距离的局部近似，提出了一种时间离散的测地线计算方法，并应用于形状空间中的变形、外推和特征传递。

详情

AI中文摘要

基于流形上黎曼距离的局部近似（通过计算成本低的相异性度量），发展了一种时间离散的测地线计算，并探索了在形状空间中的应用。该相异性度量源自变形能量，其Hessian矩阵再现了底层的黎曼度量，并用于定义形状空间中离散路径的长度和能量。离散测地线定义为能量最小化路径，由此引出了离散对数映射、离散指数映射的变分定义以及时间离散的平行传输。这一新概念应用于形状空间，其中形状被视为由粘性材料构成的物理对象的边界轮廓。通过保持拓扑的形状变形、将局部形状变化作为路径生成器来表示形状空间中的路径、通过离散测地线流进行形状外推以及几何特征的传递，展示了该方法的灵活性和计算效率。

英文摘要

Based on a local approximation of the Riemannian distance on a manifold by a computationally cheap dissimilarity measure, a time discrete geodesic calculus is developed, and applications to shape space are explored. The dissimilarity measure is derived from a deformation energy whose Hessian reproduces the underlying Riemannian metric, and it is used to define length and energy of discrete paths in shape space. The notion of discrete geodesics defined as energy minimizing paths gives rise to a discrete logarithmic map, a variational definition of a discrete exponential map, and a time discrete parallel transport. This new concept is applied to a shape space in which shapes are considered as boundary contours of physical objects consisting of viscous material. The flexibility and computational efficiency of the approach is demonstrated for topology preserving shape morphing, the representation of paths in shape space via local shape variations as path generators, shape extrapolation via discrete geodesic flow, and the transfer of geometric features.

URL PDF HTML ☆

赞 0 踩 0

1209.5826 2026-06-03 math.NA cs.CV cs.NA 版本更新

Refinability of splines from lattice Voronoi cells

来自格点Voronoi细胞的样条的可细化性

Jorg Peters

AI总结本文提出简单准则，证明只有少数样条族（如箱样条和张量积样条）是可细化的，而六边形样条等不可细化样条在格点细化时近似误差可能增大。

1007.3753 2026-06-03 cs.CV cs.NA math.NA 版本更新

Fast L1-Minimization Algorithms For Robust Face Recognition

用于鲁棒人脸识别的快速L1最小化算法

Allen Y. Yang, Zihan Zhou, Arvind Ganesh, S. Shankar Sastry, Yi Ma

AI总结针对鲁棒人脸识别中的稀疏表示分类框架，提出基于增广拉格朗日方法的快速L1最小化解法，解决了传统算法在大规模应用中的可扩展性问题。

详情

AI中文摘要

L1最小化是指在欠定线性系统b=Ax中寻找最小L1范数解。根据压缩感知理论中的某些条件，最小L1范数解也是最稀疏的解。本文研究其算法的速度和可扩展性。特别地，我们关注鲁棒人脸识别中基于稀疏性的分类框架的数值实现，其中通过稀疏表示从可能被光照、面部伪装和姿态变化破坏的高维人脸图像中恢复人类身份。尽管底层数值问题是线性规划，但传统算法在大规模应用中可扩展性差。我们研究了一种基于经典凸优化框架——增广拉格朗日方法（ALM）的新解法。新的凸求解器为实时、时间关键的应用（如人脸识别）提供了可行的解决方案。我们进行了大量实验，验证并比较了ALM算法与几种流行的L1最小化解法（包括内点法、Homotopy、FISTA、SESOP-PCD、近似消息传递（AMP）和TFOCS）的性能。为便于同行评估，所有算法的代码均已公开。

英文摘要

L1-minimization refers to finding the minimum L1-norm solution to an underdetermined linear system b=Ax. Under certain conditions as described in compressive sensing theory, the minimum L1-norm solution is also the sparsest solution. In this paper, our study addresses the speed and scalability of its algorithms. In particular, we focus on the numerical implementation of a sparsity-based classification framework in robust face recognition, where sparse representation is sought to recover human identities from very high-dimensional facial images that may be corrupted by illumination, facial disguise, and pose variation. Although the underlying numerical problem is a linear program, traditional algorithms are known to suffer poor scalability for large-scale applications. We investigate a new solution based on a classical convex optimization framework, known as Augmented Lagrangian Methods (ALM). The new convex solvers provide a viable solution to real-world, time-critical applications such as face recognition. We conduct extensive experiments to validate and compare the performance of the ALM algorithms against several popular L1-minimization solvers, including interior-point method, Homotopy, FISTA, SESOP-PCD, approximate message passing (AMP) and TFOCS. To aid peer evaluation, the code for all the algorithms has been made publicly available.

URL PDF HTML ☆

赞 0 踩 0

1206.4676 2026-06-03 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

Clustering by Low-Rank Doubly Stochastic Matrix Decomposition

基于低秩双随机矩阵分解的聚类

Zhirong Yang, Erkki Oja

AI总结提出一种超越矩阵分解的低秩学习方法，通过两步二分随机游走逼近聚类分配概率，利用KL散度最小化实现判别模型的最大似然估计，并采用松弛的MM算法优化，显著提升大规模流形数据的聚类纯度。

Comments ICML2012

详情

AI中文摘要

在过去十年中，通过非负低秩近似进行聚类分析取得了显著进展。然而，该方向上的大多数近似方法仍局限于矩阵分解。我们提出了一种新的低秩学习方法以提高聚类性能，该方法超越了矩阵分解。该近似基于通过虚拟聚类节点的两步二分随机游走，其中近似仅由聚类分配概率构成。通过Kullback-Leibler散度测量的近似误差最小化等价于判别模型的最大似然估计，这为我们的方法提供了坚实的概率解释。优化通过一种松弛的Majorization-Minimization算法实现，该算法在寻找良好局部最小值方面具有优势。此外，我们指出带有Dirichlet先验的正则化算法仅作为初始化。实验结果表明，新方法在各种数据集上，特别是大规模流形数据上，具有强大的聚类纯度性能。

英文摘要

Clustering analysis by nonnegative low-rank approximations has achieved remarkable progress in the past decade. However, most approximation approaches in this direction are still restricted to matrix factorization. We propose a new low-rank learning method to improve the clustering performance, which is beyond matrix factorization. The approximation is based on a two-step bipartite random walk through virtual cluster nodes, where the approximation is formed by only cluster assigning probabilities. Minimizing the approximation error measured by Kullback-Leibler divergence is equivalent to maximizing the likelihood of a discriminative model, which endows our method with a solid probabilistic interpretation. The optimization is implemented by a relaxed Majorization-Minimization algorithm that is advantageous in finding good local minima. Furthermore, we point out that the regularized algorithm with Dirichlet prior only serves as initialization. Experimental results show that the new method has strong performance in clustering purity for various datasets, especially for large-scale manifold data.

URL PDF HTML ☆

赞 0 踩 0

1206.2061 2026-06-03 math.NA cs.CV cs.NA 版本更新

Comments on "On Approximating Euclidean Metrics by Weighted t-Cost Distances in Arbitrary Dimension"

关于“任意维度中通过加权t-代价距离逼近欧几里得度量”的评论

M. Emre Celebi, Hassan A. Kingravi, Fatih Celiker

AI总结本文评论了Mukherjee提出的加权t-代价距离逼近欧几里得范数的方法，指出其在ℝⁿ中的平均误差过于乐观，并提出了改进精度的归一化方案。

Comments 7 pages, 1 figure, 3 tables. arXiv admin note: substantial text overlap with arXiv:1008.4870

详情

DOI: 10.1016/j.patrec.2012.03.002
Journal ref: Pattern Recognition Letters 33 (2012) 1422--1425

AI中文摘要

Mukherjee（Pattern Recognition Letters, vol. 32, pp. 824-831, 2011）最近引入了一类称为加权t-代价距离的距离函数，它推广了m-邻域、八边形和t-代价距离。他证明了加权t-代价距离构成一个度量族，并推导了在$\mathbb{Z}^n$中欧几里得范数的近似。在本注释中，我们将此近似与先前提出的两种欧几里得范数近似进行比较，并证明Mukherjee给出的经验平均误差在$\mathbb{R}^n$中显著乐观。我们还提出了一种简单的归一化方案，该方案在平均相对误差和最大相对误差方面都显著提高了其近似的精度。

英文摘要

Mukherjee (Pattern Recognition Letters, vol. 32, pp. 824-831, 2011) recently introduced a class of distance functions called weighted t-cost distances that generalize m-neighbor, octagonal, and t-cost distances. He proved that weighted t-cost distances form a family of metrics and derived an approximation for the Euclidean norm in $\mathbb{Z}^n$. In this note we compare this approximation to two previously proposed Euclidean norm approximations and demonstrate that the empirical average errors given by Mukherjee are significantly optimistic in $\mathbb{R}^n$. We also propose a simple normalization scheme that improves the accuracy of his approximation substantially with respect to both average and maximum relative errors.

URL PDF HTML ☆

赞 0 踩 0

1008.5372 2026-06-03 math.OC cs.CV cs.IT cs.LG cs.NA math.IT math.NA stat.ME 版本更新

Penalty Decomposition Methods for $L0$-Norm Minimization

L0-范数最小化的罚分解方法

Zhaosong Lu, Yong Zhang

AI总结提出罚分解方法求解含L0-范数的优化问题，通过转化为秩最小化问题并利用向量化操作，在压缩感知等应用中优于现有方法。

Comments This paper has been withdrawn by the author because an updated version has been resubmitted

详情

AI中文摘要

本文考虑一般的l0-范数最小化问题，即目标函数或约束中出现l0-范数的问题。特别地，我们首先将l0-范数约束问题重新表述为等价的秩最小化问题，然后应用[33]中提出的罚分解（PD）方法求解后者。通过利用特殊结构，我们将该方法的所有矩阵运算转化为向量运算，得到仅涉及向量运算的PD方法。在适当的假设下，我们证明PD方法生成的序列的任何聚点满足一阶最优性条件，该条件通常比一个自然最优性条件更强。我们进一步扩展PD方法以求解目标函数中出现l0-范数的问题。最后，通过将PD方法应用于压缩感知、稀疏逻辑回归和稀疏逆协方差选择来测试其性能。计算结果表明，我们的方法在解质量和/或速度方面通常优于现有方法。

英文摘要

In this paper we consider general l0-norm minimization problems, that is, the problems with l0-norm appearing in either objective function or constraint. In particular, we first reformulate the l0-norm constrained problem as an equivalent rank minimization problem and then apply the penalty decomposition (PD) method proposed in [33] to solve the latter problem. By utilizing the special structures, we then transform all matrix operations of this method to vector operations and obtain a PD method that only involves vector operations. Under some suitable assumptions, we establish that any accumulation point of the sequence generated by the PD method satisfies a first-order optimality condition that is generally stronger than one natural optimality condition. We further extend the PD method to solve the problem with the l0-norm appearing in objective function. Finally, we test the performance of our PD methods by applying them to compressed sensing, sparse logistic regression and sparse inverse covariance selection. The computational results demonstrate that our methods generally outperform the existing methods in terms of solution quality and/or speed.

URL PDF HTML ☆

赞 0 踩 0

1108.5359 2026-06-03 math.NA cs.CV cs.NA 版本更新

Solving Principal Component Pursuit in Linear Time via $l_1$ Filtering

通过 $l_1$ 滤波在线性时间内求解主成分追踪

Risheng Liu, Zhouchen Lin, Siming Wei, Zhixun Su

AI总结提出一种名为 $l_1$ 滤波的算法，以 $O(r^2(m+n))$ 复杂度精确求解主成分追踪问题，实现线性时间内的核范数最小化，并具有高度可并行性。

详情

AI中文摘要

在过去的几十年中，从被破坏的观测数据中精确恢复内在数据结构（即鲁棒主成分分析，RPCA）引起了极大的兴趣，并在计算机视觉中找到了许多应用。最近，该问题被表述为从观测数据矩阵中恢复低秩分量和稀疏分量。已证明，在适当的条件下，该问题可以通过主成分追踪（PCP）精确求解，即最小化核范数和 $l_1$ 范数的组合。现有的求解 PCP 的方法大多需要对数据矩阵进行奇异值分解（SVD），导致计算复杂度高，从而阻碍了 RPCA 在超大规模计算机视觉问题中的应用。在本文中，我们提出了一种新颖的算法，称为 $l_1$ 滤波，用于以 $O(r^2(m+n))$ 的复杂度精确求解 PCP，其中 $m\times n$ 是数据矩阵的大小，$r$ 是要恢复矩阵的秩，假设远小于 $m$ 和 $n$。此外，$l_1$ 滤波是高度可并行的。它是第一个能够以线性时间（相对于数据大小）精确求解核范数最小化问题的算法。在合成数据和实际应用上的实验证明了 $l_1$ 滤波在速度上相对于最先进算法的巨大优势。

英文摘要

In the past decades, exactly recovering the intrinsic data structure from corrupted observations, which is known as robust principal component analysis (RPCA), has attracted tremendous interests and found many applications in computer vision. Recently, this problem has been formulated as recovering a low-rank component and a sparse component from the observed data matrix. It is proved that under some suitable conditions, this problem can be exactly solved by principal component pursuit (PCP), i.e., minimizing a combination of nuclear norm and $l_1$ norm. Most of the existing methods for solving PCP require singular value decompositions (SVD) of the data matrix, resulting in a high computational complexity, hence preventing the applications of RPCA to very large scale computer vision problems. In this paper, we propose a novel algorithm, called $l_1$ filtering, for \emph{exactly} solving PCP with an $O(r^2(m+n))$ complexity, where $m\times n$ is the size of data matrix and $r$ is the rank of the matrix to recover, which is supposed to be much smaller than $m$ and $n$. Moreover, $l_1$ filtering is \emph{highly parallelizable}. It is the first algorithm that can \emph{exactly} solve a nuclear norm minimization problem in \emph{linear time} (with respect to the data size). Experiments on both synthetic data and real applications testify to the great advantage of $l_1$ filtering in speed over state-of-the-art algorithms.

URL PDF HTML ☆

赞 0 踩 0

1202.5844 2026-06-03 math.NA cs.CV cs.NA 版本更新

Divide-and-Conquer Method for L1 Norm Matrix Factorization in the Presence of Outliers and Missing Data

存在异常值和缺失数据时L1范数矩阵分解的分治方法

Deyu Meng, Zongben Xu

AI总结针对L1范数矩阵分解问题，提出分治方法，将原问题分解为一系列最小子问题，每个子问题有闭式解，通过递归优化构建高效算法，复杂度与数据规模和维度近似线性，在计算时间和精度上优于现有方法。

Comments 19 pages, 2 figures, 2 tables

详情

AI中文摘要

低秩矩阵分解作为L1范数最小化问题，因其对异常值和缺失数据的内在鲁棒性而受到广泛关注。本文提出一种新方法，称为分治方法，用于解决该问题。主要思想是将原问题分解为一系列尽可能小的子问题，每个子问题仅涉及唯一的标量参数。每个子问题被证明是凸的且有闭式解。通过以解析方式递归优化这些小问题，可以自然地构建出完全避免耗时的数值优化作为内循环的高效算法来解决原问题。所提算法的计算复杂度在数据规模和维度上均近似线性，使其能够处理大规模L1范数矩阵分解问题。该算法在理论上也被证明是收敛的。基于一系列实验结果，我们证实了在L1矩阵分解计算中，我们的方法在计算时间和精度上始终优于当前最先进的方法，尤其是在人脸识别和运动恢复结构等大规模应用中。

英文摘要

The low-rank matrix factorization as a L1 norm minimization problem has recently attracted much attention due to its intrinsic robustness to the presence of outliers and missing data. In this paper, we propose a new method, called the divide-and-conquer method, for solving this problem. The main idea is to break the original problem into a series of smallest possible sub-problems, each involving only unique scalar parameter. Each of these subproblems is proved to be convex and has closed-form solution. By recursively optimizing these small problems in an analytical way, efficient algorithm, entirely avoiding the time-consuming numerical optimization as an inner loop, for solving the original problem can naturally be constructed. The computational complexity of the proposed algorithm is approximately linear in both data size and dimensionality, making it possible to handle large-scale L1 norm matrix factorization problems. The algorithm is also theoretically proved to be convergent. Based on a series of experiment results, it is substantiated that our method always achieves better results than the current state-of-the-art methods on $L1$ matrix factorization calculation in both computational time and accuracy, especially on large-scale applications such as face recognition and structure from motion.

URL PDF HTML ☆

赞 0 踩 0

1204.4476 2026-06-03 cs.CV cs.SY eess.SY 版本更新

Dynamic Template Tracking and Recognition

动态模板跟踪与识别

Rizwan Chaudhry, Gregory Hager, Rene Vidal

AI总结提出使用线性动态系统建模非刚性物体的外观/运动时间演化，作为动态模板进行跟踪，并实现同时跟踪与识别。

详情

AI中文摘要

本文解决局部外观和运动随时间变化的非刚性物体跟踪问题。这类物体包括动态纹理（如蒸汽、火、烟、水等）以及关节物体（如执行各种动作的人）。我们使用线性动态系统（LDS）对物体外观/运动的时间演化进行建模。从样本视频中学习此类模型，并将其作为动态模板用于跟踪新视频中的物体。我们将当前帧中动态非刚性物体的跟踪问题视为在给定当前图像特征和前帧状态最佳估计下，物体位置和动态系统潜在状态的最大后验估计。我们方法的优势在于，通过使用先前训练的纹理动力学模型，可以预先指定场景中要跟踪的纹理类型。我们的框架自然地将常见的跟踪方法（如SSD和基于核的跟踪）从静态模板推广到动态模板。我们在合成和真实动态纹理示例上测试算法，并表明我们基于简单动力学的跟踪器性能与最先进方法相当甚至更优。由于我们的方法具有通用性且适用于任何图像特征，我们还将其应用于人体动作跟踪问题，构建了特定动作的光流跟踪器，在跟踪执行特定动作的人时性能优于最先进方法。最后，由于我们的方法是生成式的，我们可以使用针对不同纹理或动作类别预先训练的跟踪器，同时跟踪和识别视频中的纹理或动作。

英文摘要

In this paper we address the problem of tracking non-rigid objects whose local appearance and motion changes as a function of time. This class of objects includes dynamic textures such as steam, fire, smoke, water, etc., as well as articulated objects such as humans performing various actions. We model the temporal evolution of the object's appearance/motion using a Linear Dynamical System (LDS). We learn such models from sample videos and use them as dynamic templates for tracking objects in novel videos. We pose the problem of tracking a dynamic non-rigid object in the current frame as a maximum a-posteriori estimate of the location of the object and the latent state of the dynamical system, given the current image features and the best estimate of the state in the previous frame. The advantage of our approach is that we can specify a-priori the type of texture to be tracked in the scene by using previously trained models for the dynamics of these textures. Our framework naturally generalizes common tracking methods such as SSD and kernel-based tracking from static templates to dynamic templates. We test our algorithm on synthetic as well as real examples of dynamic textures and show that our simple dynamics-based trackers perform at par if not better than the state-of-the-art. Since our approach is general and applicable to any image feature, we also apply it to the problem of human action tracking and build action-specific optical flow trackers that perform better than the state-of-the-art when tracking a human performing a particular action. Finally, since our approach is generative, we can use a-priori trained trackers for different texture or action classes to simultaneously track and recognize the texture or action in the video.

URL PDF HTML ☆

赞 0 踩 0

1203.2210 2026-06-03 cs.CV cs.NA math.NA 版本更新

Fixed-Rank Representation for Unsupervised Visual Learning

固定秩表示用于无监督视觉学习

Risheng Liu, Zhouchen Lin, Fernando De la Torre, Zhixun Su

AI总结本文提出固定秩表示（FRR）作为无监督视觉学习的统一框架，通过闭式解揭示多子空间结构，并引入稀疏正则化以增强鲁棒性，同时开发了快速数值求解器。

Comments accepted by CVPR 2012

详情

AI中文摘要

子空间聚类和特征提取是计算机视觉和模式识别中最常用的两种无监督学习技术。最先进的子空间聚类技术利用了稀疏性和秩最小化的最新进展。然而，现有技术计算成本高，并且在数据采样不足的情况下可能导致退化解，从而降低聚类性能。为了部分解决这些问题，并受现有矩阵分解工作的启发，本文提出固定秩表示（FRR）作为无监督视觉学习的统一框架。当数据无噪声时，FRR能够以闭式形式揭示多个子空间的结构。此外，我们证明在某些适当条件下，即使观测不足，FRR仍然能够揭示真实的子空间成员关系。为了实现对异常值和噪声的鲁棒性，我们在FRR框架中引入了稀疏正则化。除了子空间聚类，FRR还可用于无监督特征提取。作为一个非平凡的副产品，我们为FRR开发了一个快速数值求解器。在合成数据和实际应用上的实验结果验证了我们的理论分析，并展示了FRR在无监督视觉学习中的优势。

英文摘要

Subspace clustering and feature extraction are two of the most commonly used unsupervised learning techniques in computer vision and pattern recognition. State-of-the-art techniques for subspace clustering make use of recent advances in sparsity and rank minimization. However, existing techniques are computationally expensive and may result in degenerate solutions that degrade clustering performance in the case of insufficient data sampling. To partially solve these problems, and inspired by existing work on matrix factorization, this paper proposes fixed-rank representation (FRR) as a unified framework for unsupervised visual learning. FRR is able to reveal the structure of multiple subspaces in closed-form when the data is noiseless. Furthermore, we prove that under some suitable conditions, even with insufficient observations, FRR can still reveal the true subspace memberships. To achieve robustness to outliers and noise, a sparse regularizer is introduced into the FRR framework. Beyond subspace clustering, FRR can be used for unsupervised feature extraction. As a non-trivial byproduct, a fast numerical solver is developed for FRR. Experimental results on both synthetic data and real applications validate our theoretical analysis and demonstrate the benefits of FRR for unsupervised visual learning.

URL PDF HTML ☆

赞 0 踩 0

1202.5414 2026-06-03 math.AP cs.CV cs.NA math.NA math.RT 版本更新

Left-Invariant Diffusion on the Motion Group in terms of the Irreducible Representations of SO(3)

基于SO(3)不可约表示的运动群上的左不变扩散

Marco Reisert, Henrik Skibbe

AI总结利用SO(3)不可约表示将SE(3)上的左不变向量场表示为平移坐标的微分形式和旋转的代数形式，避免了对SO(3)或S2的显式离散化，并应用于扩散加权磁共振成像和目标检测。

1109.3827 2026-06-03 cs.IT cs.CV cs.SY eess.SY math.IT math.OC stat.ML 版本更新

Online Robust Subspace Tracking from Partial Information

基于部分信息的在线鲁棒子空间跟踪

Jun He, Laura Balzano, John C. S. Lui

AI总结提出GRASTA算法，利用鲁棒l1范数从高度不完整数据中在线跟踪子空间，应用于鲁棒矩阵补全和视频背景-前景实时分离，在基准视频上达到57帧/秒。

Comments 28 pages, 12 figures

0906.0434 2026-06-03 cs.CV cs.NA math.NA stat.ME 版本更新

Total Variation, Adaptive Total Variation and Nonconvex Smoothly Clipped Absolute Deviation Penalty for Denoising Blocky Images

全变分、自适应全变分和非凸平滑剪切绝对偏差惩罚用于块状图像去噪

Aditya Chopra, Heng Lian

AI总结针对全变分模型的偏差问题，提出一种受高维变量选择启发的非凸惩罚函数，通过MM算法高效求解，实验证明在块状图像去噪中性能优于传统方法。

1106.2124 2026-06-03 physics.med-ph cs.CV cs.NA math.NA stat.AP 版本更新

Omni-tomography/Multi-tomography -- Integrating Multiple Modalities for Simultaneous Imaging

全模态断层成像/多模态断层成像——整合多种模态实现同步成像

Ge Wang, Jie Zhang, Hao Gao, Victor Weir, Hengyong Yu, Wenxiang Cong, Xiaochen Xu, Haiou Shen, James Bennett, Yue Wang, Michael Vannier

AI总结本文提出全模态断层成像（omni-tomography）概念，通过整合CT、MRI、PET、SPECT、超声、光学等多种成像机制实现真正同步的局部重建，克服现有模态融合方法在配准误差和物理限制方面的固有局限。

Comments 43 pages, 15 figures, 99 references, provisional patent applications filed by Virginia Tech

详情

AI中文摘要

当前的断层成像系统需要重大改进，尤其是在研究多维、多尺度、多时间及多参数现象时。临床前和临床成像现在都依赖于体内断层成像，通常需要不同成像模态分别评估以定义形态细节、描绘疾病或干预引起的变化，并研究具有相互关联方面的生理功能。过去十年中，多模态图像融合出现了两种不同方法：事后图像配准以及PET-CT、PET-MRI及其他混合扫描仪上的联合采集。事后图像分析和双/三模态方法都存在固有局限性，这些局限性由配准误差和采集链中的物理约束决定。我们预见断层成像将超越当前的模态融合，走向大融合，即所有或许多成像模态的大规模融合，可称为全模态断层成像或多模态断层成像。与模态融合不同，这里提出的大融合旨在实现真正同步但通常局部的重建，涉及所有或许多相关成像机制，如CT、MRI、PET、SPECT、超声、光学以及可能更多。本文介绍了全模态断层成像的技术基础，并通过下一代扫描仪的顶层设计、代表性模态的内部断层重建以及全模态断层成像的预期应用进行了说明。

英文摘要

Current tomographic imaging systems need major improvements, especially when multi-dimensional, multi-scale, multi-temporal and multi-parametric phenomena are under investigation. Both preclinical and clinical imaging now depend on in vivo tomography, often requiring separate evaluations by different imaging modalities to define morphologic details, delineate interval changes due to disease or interventions, and study physiological functions that have interconnected aspects. Over the past decade, fusion of multimodality images has emerged with two different approaches: post-hoc image registration and combined acquisition on PET-CT, PET-MRI and other hybrid scanners. There are intrinsic limitations for both the post-hoc image analysis and dual/triple modality approaches defined by registration errors and physical constraints in the acquisition chain. We envision that tomography will evolve beyond current modality fusion and towards grand fusion, a large scale fusion of all or many imaging modalities, which may be referred to as omni-tomography or multi-tomography. Unlike modality fusion, grand fusion is here proposed for truly simultaneous but often localized reconstruction in terms of all or many relevant imaging mechanisms such as CT, MRI, PET, SPECT, US, optical, and possibly more. In this paper, the technical basis for omni-tomography is introduced and illustrated with a top-level design of a next generation scanner, interior tomographic reconstructions of representative modalities, and anticipated applications of omni-tomography.

URL PDF HTML ☆

赞 0 踩 0

1011.2292 2026-06-03 math.NA cs.CV cs.NA 版本更新

Image Segmentation with Multidimensional Refinement Indicators

基于多维细化指标的图像分割

Hend Ben Ameur, Guy Chavent, Francois Clément, Pierre Weis

AI总结提出将最优控制技术转用于图像分割，通过自适应参数化迭代构建最优参数表示，利用误差梯度驱动区域划分，实现稳健灵活的分割算法。

1102.0899 2026-06-03 cs.AI cs.CV cs.LG cs.NA math.NA math.PR 版本更新

Evidence Feed Forward Hidden Markov Model: A New Type of Hidden Markov Model

证据前馈隐马尔可夫模型：一种新型隐马尔可夫模型

Michael DelRose, Christian Wagner, Philip Frederick

AI总结针对隐马尔可夫模型无法建模观测间关联的问题，提出证据前馈隐马尔可夫模型，通过引入观测间概率链接提升分类性能，并在视觉动作和测量数据上验证其有效性。

Comments 19 pages, International Journal of Artificial Intelligence and Applications

详情

DOI: 10.5121/ijaia.2011.2101
Journal ref: International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 2, No. 1, Jan 2011

AI中文摘要

仅基于视觉动作预测他人意图的能力是人类和动物独有的技能。当前计算机算法的智能尚未达到这种复杂程度，但已有若干研究正朝此方向努力。由于可用的分类算法众多，难以确定哪种算法最适合特定情境。在视觉人类意图数据分类中，隐马尔可夫模型（HMM）及其变体是主要候选方法。HMM无法提供观测间链接的概率，这是该分类技术的一大缺陷。当人通过视觉识别他人的动作时，会监控观测中的模式。通过估计下一个观测，人们能够总结动作，从而相当准确地判断执行动作者的意图。这些视觉线索和链接对于创建基于视觉观测确定人类动作的智能算法至关重要。证据前馈隐马尔可夫模型是一种新开发的算法，它提供了观测间链接。本研究阐述了证据前馈HMM背后的理论，提供了其学习这些参数以优化观测似然性的数学证明（这对所有计算智能算法都至关重要），并给出了与标准HMM在视觉动作数据和测量数据分类中的比较示例，从而为证据前馈HMM在多种问题分类中的应用奠定了坚实基础。

英文摘要

The ability to predict the intentions of people based solely on their visual actions is a skill only performed by humans and animals. The intelligence of current computer algorithms has not reached this level of complexity, but there are several research efforts that are working towards it. With the number of classification algorithms available, it is hard to determine which algorithm works best for a particular situation. In classification of visual human intent data, Hidden Markov Models (HMM), and their variants, are leading candidates. The inability of HMMs to provide a probability in the observation to observation linkages is a big downfall in this classification technique. If a person is visually identifying an action of another person, they monitor patterns in the observations. By estimating the next observation, people have the ability to summarize the actions, and thus determine, with pretty good accuracy, the intention of the person performing the action. These visual cues and linkages are important in creating intelligent algorithms for determining human actions based on visual observations. The Evidence Feed Forward Hidden Markov Model is a newly developed algorithm which provides observation to observation linkages. The following research addresses the theory behind Evidence Feed Forward HMMs, provides mathematical proofs of their learning of these parameters to optimize the likelihood of observations with a Evidence Feed Forwards HMM, which is important in all computational intelligence algorithm, and gives comparative examples with standard HMMs in classification of both visual action data and measurement data; thus providing a strong base for Evidence Feed Forward HMMs in classification of many types of problems.

URL PDF HTML ☆

赞 0 踩 0

1011.0997 2026-06-03 math.NA cs.CV cs.NA math.FA stat.ML 版本更新

Performance Analysis of Spectral Clustering on Compressed, Incomplete and Inaccurate Measurements

压缩、不完整和不准确测量下的谱聚类性能分析

Blake Hunter, Thomas Strohmer

AI总结本文结合压缩感知和矩阵完成的距离保持测量与鲁棒谱聚类，分析了亲和矩阵微小误差对谱坐标和聚类能力的影响，并将双类谱聚类的扰动结果推广到多类聚类。

详情

AI中文摘要

谱聚类是提取数据集潜在全局结构最广泛使用的技术之一。压缩感知和矩阵完成已成为分别有效恢复稀疏和部分观测信号的主流方法。我们将压缩感知和矩阵完成的距离保持测量与鲁棒谱聚类的力量相结合。我们的分析提供了关于亲和矩阵中微小误差如何影响谱坐标和聚类能力的严格界限。这项工作将双类谱聚类的当前扰动结果推广到使用k个特征向量的多类聚类。我们彻底追踪了使用压缩感知和矩阵完成引起的小扰动如何影响亲和矩阵，进而影响谱坐标。这些多类聚类的扰动结果要求亲和矩阵的第k个和第(k+1)个特征值之间存在特征间隙，这在具有k个良好定义簇的数据中自然出现。我们的理论保证辅以数值结果以及图像数据的无监督组织和聚类的若干示例。

英文摘要

Spectral clustering is one of the most widely used techniques for extracting the underlying global structure of a data set. Compressed sensing and matrix completion have emerged as prevailing methods for efficiently recovering sparse and partially observed signals respectively. We combine the distance preserving measurements of compressed sensing and matrix completion with the power of robust spectral clustering. Our analysis provides rigorous bounds on how small errors in the affinity matrix can affect the spectral coordinates and clusterability. This work generalizes the current perturbation results of two-class spectral clustering to incorporate multi-class clustering with k eigenvectors. We thoroughly track how small perturbation from using compressed sensing and matrix completion affect the affinity matrix and in succession the spectral coordinates. These perturbation results for multi-class clustering require an eigengap between the kth and (k+1)th eigenvalues of the affinity matrix, which naturally occurs in data with k well-defined clusters. Our theoretical guarantees are complemented with numerical results along with a number of examples of the unsupervised organization and clustering of image data.

URL PDF HTML ☆

赞 0 踩 0

0912.4571 2026-06-03 math.OC cs.CV cs.NA math.NA 版本更新

Fast Alternating Linearization Methods for Minimizing the Sum of Two Convex Functions

最小化两个凸函数和的快速交替线性化方法

Donald Goldfarb, Shiqian Ma, Katya Scheinberg

AI总结提出基于交替方向增广拉格朗日方法的一阶交替线性化算法，用于最小化两个凸函数的和，基本方法需O(1/ε)次迭代达到ε-最优解，加速版本需O(1/√ε)次迭代，并给出数值结果。

1010.0301 2026-06-03 cs.CV cs.NA math.NA 版本更新

A Microwave Imaging and Enhancement Technique from Noisy Synthetic Data

一种基于含噪合成数据的微波成像与增强技术

Anjan Kumar Kundu, Bijoy Bandopadhyay, Sugata Sanyal

AI总结提出一种基于矩量法求解的逆迭代算法用于微波成像，通过约束优化确保收敛，并利用Levenberg-Marquardt方法处理病态性，最后对含噪合成数据重建的图像进行增强。

Comments 8 Pages, 10 Figures, International Symposium on Advanced Engineering and Applied Management-40th Anniversary in Higher Education-Image Processing-University Politegnica, Timisoara, 4-5 November, 2010, Hunedoara, ROMANIA

1009.0051 2026-06-03 math.NA cs.CV cs.NA 版本更新

Variational Iteration Method for Image Restoration

变分迭代法用于图像恢复

Keyvan Yahya, Jafar Biazar, Hossein Azari, Pouyan Rafiei Fard

AI总结本文首次应用变分迭代法求解Perona-Malik方程，通过误差分析获得近似解，并验证了该方法的有效性。

1008.4870 2026-06-03 math.NA cs.CV cs.NA 版本更新

On Euclidean Norm Approximations

关于欧几里得范数近似

M. Emre Celebi, Fatih Celiker, Hassan A. Kingravi

AI总结本文研究了欧几里得范数的多种近似方法，揭示了它们统一的数学形式，并纠正了Seol和Cheun方法中最大误差的乐观估计。

Comments 9 pages, 1 figure, Pattern Recognition

1006.5739 2026-06-03 math.NA cs.CV cs.NA 版本更新

Polyharmonic Daubechies type wavelets in Image Processing and Astronomy, II

图像处理与天文学中的多调和Daubechies型小波（II）

Ognyan Kounchev, Damyan Kalaglarsky, Milcho Tsvetkov

AI总结本文研究多调和细分小波（Daubechies型）在图像处理，特别是天文图像中的应用，结果表明其相对于某些标准多变量小波具有显著优势并展现出更好的压缩潜力。

Comments 9 pages

0909.1310 2026-06-03 math.NA cs.CV cs.NA 版本更新

Sparse image representation by discrete cosine/spline based dictionaries

基于离散余弦/样条字典的稀疏图像表示

James Bowley, Laura Rebollo-Neira

AI总结本文考虑由余弦和B样条函数生成的混合字典，通过正交匹配追踪等高非线性方法，证明所提字典的离散版本能显著提高图像表示的稀疏性。

0804.1046 2026-06-03 cs.CV cs.CG cs.GR cs.NA math.NA 版本更新

Discrete schemes for Gaussian curvature and their convergence

高斯曲率的离散格式及其收敛性

Zhiqiang Xu, Guoliang Xu

AI总结本文综述了高斯曲率的几种离散格式，提出了一种新的离散格式并证明了其在价数不小于5的正则顶点处的收敛性，同时通过反例表明价数为4时无法构造收敛的离散格式，最后比较了多种离散格式的渐近误差。