arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03940 2026-06-03 eess.IV cs.CV cs.LG cs.RO 版本更新

SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient Reconstruction

SEAOTTER: 基于传感器嵌入自编码器与一次性转码的高效重建

Dan Jacobellis, Neeraja J. Yadwadkar

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出SEAOTTER框架,结合传感器嵌入自编码器与可学习JPEG转码,在200:1压缩比下实现比AVIF快7倍编码、3.5倍解码,并提升ImageNet top-1准确率8%,同时保持JPEG兼容性。

详情
AI中文摘要

在机器人系统中,使用低成本、低功耗硬件可以轻松捕获高分辨率的大量视觉数据。然而,当通过JPEG/MPEG等传统编解码器传输时,有限的带宽和机载计算资源阻碍了充分利用。较新的编解码器(如AV1/AVIF)改善了率失真权衡,但需要更多资源进行编码,在没有定制ASIC的情况下不切实际。最近的非对称自编码器在极端功率和带宽约束下提供高质量,但增加了高昂的解码成本,并使用忽略围绕JPEG等标准建立的数十年基础设施的特有格式。为了解决这些限制,我们引入了一种基于传感器嵌入自编码器与一次性转码的高效重建(SEAOTTER)的云机器人压缩框架。由于传感器、云和消费阶段面临非常不同的功率和带宽预算,SEAOTTER结合了学习潜变量的紧凑性和标准JPEG文件的广泛可用性。由于朴素转码会降低性能,我们提出了一种可学习的JPEG颜色和量化变换,能够提高全局、密集和基于视觉语言感知的准确性。使用SEAOTTER,我们为预训练的冻结编码器训练通用和任务感知的转码流水线。在200:1的压缩比下,与AVIF相比,我们观察到编码速度提高7倍,解码速度提高3.5倍,ImageNet top-1准确率提高8%,同时保持与JPEG基础设施的兼容性。我们的代码可从此https URL获取。

英文摘要

In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at https://github.com/UT-SysML/seaotter .

2606.02906 2026-06-03 eess.IV cs.CV 版本更新

Depth from Dual Differential Defocus and Stereo Consensus

基于双差分散焦与立体一致性的深度估计

Junjie Luo, Wei Xu, Dylan Chu, Emma Alexander, Qi Guo

发表机构 * Purdue University(普渡大学) Northwestern University(西北大学)

AI总结 提出D^3S Consensus算法,融合散焦深度与立体视觉,在超出景深范围内实现高精度深度估计,通过物理独立线索的一致性选择可靠预测,以更小基线达到可比工作范围。

详情
AI中文摘要

我们提出了D^3S Consensus,一种基于物理的闭式算法,它统一了散焦深度(DfD)和立体视觉,在超出相机景深(DoF)的扩展工作范围内实现高精度深度估计。给定一对双散焦立体图像,该方法通过一种新颖的DfD理论——双差分散焦(D^3)和(S)立体耦合方式,估计一组超定深度。然后,通过在这些物理独立线索之间强制执行一致性,从该组中选择最可信的深度预测,以拒绝不可靠的估计。分析表明,在相同误差容限下,D^3S与先前的基于三角测量的深度估计系统相比,以10倍小的基线实现了可比的工作范围。这使得紧凑的无源双目测距仪具有比传统立体和DfD设计小得多的外形尺寸。我们展示了第一个D^3S原型,其基线仅为4毫米,EFL为12毫米。它通过单次采集生成高达900×1800像素的深度图,在0.3-1.64米范围内平均绝对误差为1厘米。这已经超过了某些具有更大外形尺寸的商用立体相机的报告精度。

英文摘要

We introduce D^3S Consensus, a physics-based, closed-form algorithm that unifies depth-from-defocus (DfD) and stereo to achieve highly accurate depth estimation throughout an extended working range beyond the depth-of-field (DoF) of cameras. Given a pair of dual-defocus stereo images, the method estimates an overdetermined set of depth using a novel DfD theory, Dual Differential Defocus (D^3), and (S)tereo in a coupled fashion. It then picks the most confident depth prediction from the set by enforcing consensus between these physically independent cues to reject unreliable estimates. Analysis shows that D^3S achieves a comparable working range under the same error tolerance with 10x smaller baseline than previous triangulation-based depth estimation systems. This enables compact passive binocular rangefinders with substantially smaller form factors than conventional stereo and DfD designs. We demonstrate the first D^3S prototype with only 4 mm baseline and 12 mm EFL. It generates up to 900 x 1800-pixel depth maps with 1-cm mean absolute error over 0.3-1.64 m from a snapshot acquisition. This has surpassed the reported accuracy of certain commercially available stereo cameras with much larger form factors.

2606.02642 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.MM cs.SD 版本更新

SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models

SVHalluc: 音频-视觉大语言模型中的语音-视觉幻觉基准测试

Chenshuang Zhang, Kyeong Seon Kim, Chengxin Liu, Tae-Hyun Oh

发表机构 * KAIST(韩国国立信息通信研究院)

AI总结 针对音频-视觉大语言模型中的语音-视觉幻觉问题,提出SVHalluc基准,从语义和时间两个维度评估模型将语音内容与视觉信号对齐的能力,发现现有模型存在跨模态理解局限。

Comments Accepted at CVPR 2026

详情
AI中文摘要

尽管音频-视觉大语言模型(LLMs)取得了成功,但它们可能产生看似合理但缺乏依据的输出,即幻觉。现有基准侧重于环境声音(例如狗叫)来指示事件发生。相比之下,人类语音承载着根本不同的、丰富的语义和时间结构,但当前模型能否准确地将语音内容与相应的视觉信号对齐仍未得到探索。在这项工作中,我们表明语音内容可以引发音频-视觉LLMs中的幻觉。为了系统研究这一点,我们引入了SVHalluc,这是第一个用于评估音频-视觉LLMs中语音-视觉幻觉的综合基准。我们的基准从两个关键且互补的方面诊断语音-视觉幻觉:语义和时间。实验结果表明,最先进的开源音频-视觉LLMs难以将语音内容与相应的视觉信号对齐,在多个任务上的准确率接近随机。相比之下,Gemini 2.5 Pro显著优于开源模型。我们的分析表明,它们的失败源于跨模态理解能力有限,尽管在单模态感知方面表现强劲。我们的工作揭示了当前音频-视觉LLMs的一个新的根本性局限,并强调了基于语音的视频理解的需求。项目页面:此https URL。

英文摘要

Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination. Existing benchmarks focus on environmental sounds (e.g., dog barking) to indicate event occurrence. In contrast, human speech carries fundamentally different, rich semantics and temporal structures, yet it remains unexplored whether current models can accurately align speech content with corresponding visual signals. In this work, we show that speech content can induce hallucinations in audio-visual LLMs. To systematically study this, we introduce SVHalluc, the first comprehensive benchmark for evaluating speech-vision hallucination in audio-visual LLMs. Our benchmark diagnoses speech-vision hallucinations from two critical and complementary aspects: semantic and temporal. Experimental results demonstrate that state-of-the-art open-source audio-visual LLMs struggle with aligning speech content with corresponding visual signals, with a near-random accuracy on multiple tasks. In contrast, Gemini 2.5 Pro significantly outperforms the open-source models. Our analysis suggests that their failures stem from limited ability in cross-modality understanding, despite strong performance in single-modality perception. Our work uncovers a new and fundamental limitation of current audio-visual LLMs and highlights the need for speech-grounded video comprehension. Project page: https://chenshuang-zhang.github.io/projects/svhalluc/.

2606.02639 2026-06-03 eess.IV cs.AI cs.CV 版本更新

Sparse-View Lung Nodule Volumetry from Digitally Reconstructed Radiographs via AReT: Anatomy-Regularized TensoRF

通过AReT:解剖正则化TensoRF从数字重建放射图像进行稀疏视图肺结节体积测量

Spoorthi M, Suja Palaniswamy

发表机构 * Amrita University(阿姆里塔大学)

AI总结 本文发现并解决了TensoRF在X射线衰减场中的默认密度偏移问题,提出解剖正则化张量辐射场框架AReT,仅用三个正交X射线投影即可实现肺结节的稳定体积重建,在LIDC-IDRI数据集上达到高精度。

详情
AI中文摘要

我们识别并解决了TensoRF应用于X射线衰减场时一个先前未报告的失败模式:默认密度偏移-10(最初为RGB场景重建引入)抑制了密度梯度,并阻止了稀疏视图医学重建,无论学习率或正则化策略如何。将密度偏移设置为零可恢复梯度流,并仅从三个正交X射线投影实现肺结节的稳定体积重建。在此基础上,我们提出AReT,一个解剖正则化的张量辐射场框架,用于使用LIDC-IDRI数据集(19名患者,放射科医生注释的结节)的冠状、矢状和轴向投影进行肺结节重建。与需要密集多视图采集的现有NeRF方法不同,AReT专为稀疏视图胸部成像设计,并整合了结合L1稀疏性和总变分平滑性的胸部解剖感知正则化。对11种重建策略的系统比较表明,解剖感知正则化始终优于生成先验引导的方法。与放射科医生共识分割相比,AReT在临床可操作的结节(>=10 mm,n=14)上实现了Pearson r=0.983(p<0.0001),中位绝对体积误差为11.4%,接近零的系统偏差为-77.3 mm^3,并且比球形体积近似提高了8.4倍。

英文摘要

We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift of -10, originally introduced for RGB scene reconstruction, suppresses density gradients and prevents sparse-view medical reconstruction regardless of learning rate or regularization strategy. Setting the density shift to zero restores gradient flow and enables stable volumetric reconstruction of pulmonary nodules from only three orthogonal X-ray projections. Building on this, we propose AReT, an anatomy-regularized tensorial radiance field framework for lung nodule reconstruction using coronal, sagittal, and axial projections from the LIDC-IDRI dataset (19 patients, radiologist-annotated nodules). Unlike existing NeRF approaches requiring dense multi-view acquisition, AReT is designed for sparse-view thoracic imaging and incorporates chest-anatomy-aware regularization combining L1 sparsity and total variation smoothness. A systematic comparison across 11 reconstruction strategies shows anatomy-aware regularization consistently outperforms generative-prior-guided approaches. Evaluated against radiologist consensus segmentations, AReT achieves Pearson r=0.983 (p<0.0001) for clinically actionable nodules >=10 mm (n=14), median absolute volumetric error of 11.4%, near-zero systematic bias of -77.3 mm^3, and 8.4x improvement over spherical volume approximation.

2606.02631 2026-06-03 eess.AS cs.AI cs.CV cs.LG cs.SD 版本更新

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

小波作为分词器:自然信号共享小波分词方案的初步结果

Shenghao Ding

发表机构 * Yet Another AI

AI总结 本文研究音频、图像和视频能否共享统一的小波分词方案,通过基于Haar DWT/IDWT的连续令牌模型,在多个数据集上验证了统一分词模式的可行性,并分析了潜在容量和元数据的影响。

Comments 12 pages, 3 figures

详情
AI中文摘要

本文研究音频、图像和视频是否可以共享一个共同的小波令牌模式,而不是依赖于各自模态特定的潜在网格。它介绍了一个初步的连续令牌模型,该模型围绕一级Haar DWT/IDWT前端、共享系数令牌布局、可选结构元数据、轻量级模态值适配器和共享的令牌级编码器-解码器主干构建。在Speech Commands、EuroSAT RGB和DAVIS 2017数据上,密集共享模型达到了39.92 dB音频、29.37 dB图像和23.93 dB视频的PSNR。在连续潜在标量预算下的匹配速率扫描表明,视觉增益不能仅由潜在容量解释,同时也表明加性元数据嵌入并非普遍改进来源。最后,固定速率能量选择提供了一个强大的非参数基线:在压缩保留比率下,energy_global相比均匀选择将音频的平均PSNR提高了16.73 dB,图像提高了16.90 dB,视频提高了15.86 dB。掩蔽稀疏训练在50%的密集令牌下达到了34.45 dB的视频PSNR。结果支持统一的 wavelet 令牌模式和稀疏令牌接口,但尚未建立通用的离散词汇表。

英文摘要

This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and DAVIS 2017 data, a dense shared model reaches 39.92 dB audio, 29.37 dB image, and 23.93 dB video PSNR. A matched-rate sweep under continuous latent scalar budgets indicates that the visual gains are not explained solely by latent capacity, while also showing that additive metadata embeddings are not a universal source of improvement. Finally, fixed-rate energy selection provides a strong non-parametric baseline: energy_global improves average PSNR over uniform selection by 16.73 dB for audio, 16.90 dB for images, and 15.86 dB for video under compressed keep ratios. Masked sparse training reaches 34.45 dB video PSNR with 50% of dense tokens. The results support a unified wavelet token schema and sparse token interface, while stopping short of establishing a universal discrete vocabulary.

2606.02937 2026-06-03 q-bio.NC cs.CV 版本更新

BEAST3D: Animal behavioral analysis and neural encoding from multi-view video via Gaussian splatting

BEAST3D: 通过高斯泼溅从多视角视频进行动物行为分析与神经编码

Yanchen Wang, Lenny Aharon, Wangshu Zhu, Kyle Daruwalla, Linghua Zhang, Jiaru Zou, Selmaan Chettih, Helen Hou, Liam Paninski, Matthew R Whiteway

发表机构 * Columbia University(哥伦比亚大学) Cold Spring Harbor(冷泉港) Stanford University(斯坦福大学)

AI总结 提出BEAST3D自监督预训练框架,利用未标注的多视角视频通过3D高斯泼溅重建和动物分割,学习3D视觉表征,有效应用于新视角合成、多视角姿态估计和神经编码。

详情
AI中文摘要

多视角视频记录越来越多地用于捕捉实验环境中动物的3D运动,但从这些记录中提取丰富的3D表示仍然具有挑战性。有监督的姿态估计需要大量手动标注,而在通用场景数据集上训练的通用3D重建模型无法适用于实验室实验的专业图像和稀疏视角设置。我们通过BEAST3D解决了这些限制,这是一个自监督预训练框架,从未标注的、已校准的多视角视频中学习3D视觉表示。BEAST3D使用视觉变换器预测3D高斯泼溅,通过可微渲染重建保留视角,同时将动物从背景中分割出来。BEAST3D通过直接以已知相机参数为条件,仅用四个视角即可重建3D结构——这与通用模型不同,后者必须从实验室环境中很少有的密集重叠视角估计相机几何。通过在四个物种上的全面评估,我们证明BEAST3D产生丰富的、视角不变的特征,这些特征有效地迁移到三个下游任务:新视角合成(验证了学习到的3D表示的质量)、多视角姿态估计(提供了行为分析中广泛使用的稀疏关键点轨迹)和神经编码(将3D行为特征与同时记录的神经活动相关联)。因此,BEAST3D建立了一个利用现代多视角实验室记录中3D结构的行为分析多功能框架。

英文摘要

Multi-view video recordings are increasingly used to capture the 3D movements of animals in experimental settings, yet extracting rich 3D representations from these recordings remains challenging. Supervised pose estimation requires extensive manual annotation, while general-purpose 3D reconstruction models trained on generic scene datasets fail on the specialized imagery and sparse-view setting of laboratory experiments. We address these limitations with BEAST3D, a self-supervised pretraining framework that learns 3D visual representations from unlabeled, calibrated multi-view video. BEAST3D uses a vision transformer to predict 3D Gaussian splats that reconstruct held-out views through differentiable rendering, while simultaneously segmenting the animal from the background. BEAST3D reconstructs 3D structure with as few as four views by conditioning directly on known camera parameters--unlike general-purpose models, which must estimate camera geometry from dense overlapping viewpoints that are seldom available in lab settings. Through comprehensive evaluation across four species, we demonstrate that BEAST3D produces rich, viewpoint-invariant features that transfer effectively to three downstream tasks: novel view synthesis, which validates the quality of the learned 3D representations; multi-view pose estimation, which provides the sparse keypoint trajectories widely used in behavioral analysis; and neural encoding, which relates 3D behavioral features to simultaneously recorded neural activity. BEAST3D thus establishes a versatile framework for behavioral analysis that leverages 3D structure in modern multi-view laboratory recordings.

2606.03994 2026-06-03 cs.CV cs.RO 版本更新

SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

SimuScene: 从单张图像重建仿真就绪的组合式3D场景

Inhee Lee, Sangwon Baik, Sungjoo Kim, Hyeonwoo Kim, Hyunsoo Cha, Hanbyul Joo

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出SimuScene,一种将物理仿真融入形状和布局估计的组合式3D重建流水线,通过物理引擎诊断重建错误并驱动修正,生成稳定且仿真就绪的场景。

Comments Project Page: https://snuvclab.github.io/SimuScene/

详情
AI中文摘要

从单张图像重建可交互、仿真就绪的3D场景是机器人操作的关键瓶颈。虽然最近的单图像提升器能恢复合理的每个物体形状,但组合它们会产生因物体相互穿透、悬浮或下沉而在物理仿真中崩溃的场景。现有的物理感知方法严格将其作为事后布局修正,而未解决底层几何误差。为此,我们引入SimuScene,一种将物理置于形状和布局估计循环中的组合式3D重建流水线。我们不仅将物理用于布局清理,还在生成过程中利用物理引擎作为诊断测量工具。通过在重力下对重建物体进行诊断性仿真,我们将穿透和支撑失败转化为定量修正信号,驱动重力轴拉伸和非模态形状重采样。这种物理信息反馈循环减轻了累积的重建误差,并产生稳定、仿真就绪的组合式3D场景。大量实验在物理稳定性和几何对齐基准上展示了最先进的性能。我们进一步通过在仿人控制和机器人臂操作任务中部署重建环境来突出SimuScene的实用性。

英文摘要

Reconstructing interactive, simulation-ready 3D scenes from a single image is a critical bottleneck for robotic manipulation. While recent single-image lifters recover plausible per-object shapes, composing them yields scenes that collapse under physical simulation due to interpenetrating, hovering, or sinking objects. Existing physics-aware methods address this strictly as a post-hoc layout correction, leaving the underlying geometric errors unresolved. To address this, we introduce SimuScene, a compositional 3D reconstruction pipeline that puts physics in the loop of shape and layout estimation. Rather than using physics merely for layout cleanup, we utilize the physics engine as a diagnostic measurement tool during the generative process itself. By diagnostically simulating reconstructed objects under gravity, we convert penetration and support failures into quantitative correction signals that drive gravity-axis stretching and amodal shape resampling. This physics-informed feedback loop mitigates accumulated reconstruction errors and produces a stable, simulation-ready compositional 3D scene. Extensive experiments demonstrate state-of-the-art performance on physical stability and geometric alignment benchmarks. We further highlight SimuScene's utility by deploying reconstructed environments in humanoid control and robot-arm manipulation tasks.

2606.03992 2026-06-03 cs.CV cs.RO 版本更新

Exploring Easy Boosts for Lidar Semantic Scene Completion

探索激光雷达语义场景补全的简易提升方法

Tetiana Martyniuk, Jonathan Seele, Alexandre Boulch, Gilles Puy, Renaud Marlet, Raoul de Charette

发表机构 * Inria, France(法国国家信息与自动化技术研究所) valeo.ai, France(valeo.ai公司) ETH Zurich, Switzerland(瑞士苏黎世联邦理工学院) LIGM, CNRS, Univ Gustave Eiffel, ENPC, IP Paris, France(法国高等科学研究院(CNRS))

AI总结 本文研究无需复杂架构重设计的“免费午餐”策略,通过为输入点云添加语义伪标签和可见性信息,显著提升激光雷达语义场景补全性能,使旧模型与最先进系统竞争甚至超越。

Comments Accepted to ICIP 2026

详情
AI中文摘要

本文研究了“免费午餐”策略,以提升激光雷达语义场景补全(SSC)的性能,而无需复杂的架构重新设计。我们首先证明,使用现成分割器为输入点云赋予语义伪标签可以显著提升现有架构的性能。通过将这些模型与 oracle 进行评估,我们确定高质量的语义先验是 mIoU 提升的主要驱动力。此外,我们为输入激光雷达扫描配备了可见性信息,以区分空区域和未知区域,这为测试的架构提供了次要的性能提升。使用这些简单的增强,我们观察到旧模型仍然可以与最先进的系统竞争,甚至超越它们。我们的代码可在 https://this https URL 获取。

英文摘要

This paper investigates "free lunch" strategies to boost the performance of lidar semantic scene completion (SSC) without requiring complex architectural redesigns. We first demonstrate that endowing input point clouds with semantic pseudo-labels from off-the-shelf segmentors significantly improves the performance of existing architectures. By evaluating these models against an oracle, we establish that high-quality semantic priors are a primary driver of mIoU gains. Furthermore, we equip the input lidar scan with visibility information that distinguishes between empty and unknown spaces, which provides a secondary performance boost across the tested architectures. Using these simple enhancements, we observe that older models remain competitive with state-of-the-art systems, and can even outperform them. Our code is available at https://github.com/astra-vision/SSC-Priors.

2606.03990 2026-06-03 cs.LG cs.CL cs.CV 版本更新

Neuron Populations Exhibit Divergent Selectivity with Scale

神经元群体随规模表现出分化的选择性

Amil Dravid, Yasaman Bahri, Alexei A. Efros, Yossi Gandelsman

发表机构 * UC Berkeley(加州大学伯克利分校) TTIC

AI总结 通过分析Rosetta神经元在不同规模模型中的分布与特性,发现其数量遵循次线性幂律增长,且选择性随规模增强,而非Rosetta神经元则保持低选择性,提出一个平衡特征效用与神经元容量的分析模型解释这一极化现象。

Comments Project page and code: https://avdravid.github.io/rosetta-neuron-scaling/

详情
AI中文摘要

我们研究神经网络中的神经元群体是否随规模可预测地演化,将缩放定律扩展到损失等宏观可观测指标之外。为探究此问题,我们研究了Rosetta神经元——一类先前被表征的、其激活模式在独立训练的模型中相似的神经元(Dravid et al., 2023)。在分别对高达30B参数的语言模型和高达5B参数的视觉模型的分析中,我们观察到Rosetta神经元群体遵循模型规模的次线性幂律,绝对数量增长但占总神经元数的比例缩小。我们进一步观察到神经元极化效应:Rosetta神经元随规模变得更具选择性且日益单语义化,与不断增长但仍保持低选择性的非Rosetta群体分离。一个平衡特征效用与有限神经元容量的分析模型解释了次线性幂律缩放和这种极化效应。最后,我们发现Rosetta神经元随规模变得更加领域专业化,并通过一个针对持续预训练的目标数据过滤案例研究展示了其选择性。我们的结果指向一个可解释的、共享的神经元层面结构的缩放定律,将模型大小与神经元通用性、选择性和专业化的系统性变化联系起来。

英文摘要

We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non-Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power-law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain-specialized with scale and illustrate their selectivity through a targeted data-filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron-level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.

2606.03989 2026-06-03 cs.CV 版本更新

PixVOD: Pixel-Distributed Direct Visual Odometry and Depth Estimation

PixVOD: 像素分布式直接视觉里程计与深度估计

Shinjeong Kim, Ignacio Alzugaray, Callum Rhodes, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Department of Computing, Imperial College London(帝国理工学院伦敦分校计算机系)

AI总结 提出一种基于高斯信念传播的像素级分布式视觉里程计与深度估计方法,通过关键帧锚定机制实现传感器上并行计算。

详情
AI中文摘要

由二维像素阵列组成的图像是计算机视觉算法的标准输入,然而许多底层计算可以分布在像素之间。传输原始、冗余且带有噪声的像素数据离开传感器仍然效率低下,这促使人们转向焦平面传感器处理器,其在每个像素内直接执行大部分计算。我们设想像素在本地合成更高级别的信号,减少下游负载,并为更高级别的视觉任务提供更丰富的输入。我们提出了一种完全可并行化的视觉里程计和深度估计形式,跨像素进行,其中传感器处理器通过高斯信念传播(GBP)交换信息,以达成关于相机运动的共识,并从逐像素光度观测和表面法线先验中推断深度。为了在优化过程中保持几何稳定性,我们引入了一种类似关键帧的锚定机制,该机制调节帧之间的有效基线,从而实现一致的运动和深度更新。我们的方法在真实数据集上进行了评估,证明了基于GBP的像素级分布式里程计和深度估计与传感器上关键帧锚定的可行性。项目页面:此 https URL

英文摘要

Images composed of 2D pixel arrays are the standard input to computer vision algorithms, yet many underlying computations can be distributed across pixels. Transmitting raw, redundant, and noisy pixel data off the sensor remains inefficient, motivating a shift toward focal-plane sensor-processors that perform a significant part of the computation directly within each pixel. We envision pixels synthesizing higher-level signals locally, reducing downstream load, and providing richer inputs for higher-level vision tasks. We propose a fully parallelizable form of visual odometry and depth estimation across pixels, where sensor-processors exchange information through Gaussian Belief Propagation (GBP) to achieve consensus about camera motion and infer depth from per-pixel photometric observations and a surface normal prior. To maintain geometric stability during optimization, we introduce a keyframe-like anchoring mechanism that regulates the effective baseline between frames, enabling consistent motion and depth updates. Our method is evaluated on realistic datasets, demonstrating the feasibility of GBP-based pixel-level distributed odometry and depth estimation with keyframe anchoring on-sensor. Project Page: https://www.shinjeongkim.com/pixvod/

2606.03986 2026-06-03 cs.CV 版本更新

NewtPhys: Do Foundation Models Understand Newtonian Physics?

NewtPhys: 基础模型理解牛顿物理学吗?

Sebastian Cavada, Soumava Paul, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette

发表机构 * Inria(法国国家信息与自动化研究所) Valeo.ai(Valeo人工智能公司) MBZUAI(马克斯·普朗克人工智能研究所)

AI总结 本文提出NewtPhys,一个基于真实场景多视图图像和物理模拟的4D物理标注数据集,用于系统评估基础模型在低层次牛顿物理推理中的能力,揭示了现有模型的局限性。

详情
AI中文摘要

先前的工作使用合成或半合成场景以及视觉问答任务评估基础模型中的物理推理。然而,这些基准强调高层次事件,缺乏评估真正低层次牛顿理解所需的视觉保真度。我们引入了NewtPhys,一个从真实场景的多视图图像构建的4D物理标注数据集,并带有基于物理的模拟。该数据集提供了跨时间步的密集、细粒度标注——包括3D力和覆盖物理、跟踪、语义和几何的逐像素非模态量——弥合了简单合成设置与真实视觉复杂性之间的差距。利用NewtPhys,我们系统评估了56个VLM,包括54个开放权重模型和2个闭源前沿模型,以及10个VFM,揭示了低层次物理推理中的局限性。除了基准测试外,我们的数据集还支持基于物理的视觉的未来研究和下一代物理感知评估的开发。代码和数据集可在该网址获取。

英文摘要

Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps -- including 3D forces and amodal per-pixel quantities covering physics, tracking, semantics and geometry -- bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 56 VLMs, including 54 open-weight models and 2 closed-source frontier models, and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations. Code and datasets are available at https://astra-vision.github.io/NewtPhys.

2606.03985 2026-06-03 cs.RO cs.AI cs.CV 版本更新

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Humanoid-GPT:扩展数据与结构以实现零样本运动跟踪

Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi

发表机构 * Tsinghua University(清华大学) Galbot Inc.(Galbot公司) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Shanghai Qi Zhi Institute(上海启智研究院)

AI总结 提出Humanoid-GPT,一种基于GPT风格的因果Transformer,在十亿级运动语料上预训练,实现全身控制,通过扩展数据和模型容量达到对未见运动和任务的零样本泛化。

Comments Accepted at CVPR 2026

详情
AI中文摘要

我们介绍了Humanoid-GPT,一种具有因果注意力的GPT风格Transformer,在十亿级运动语料上训练用于全身控制。与受限于稀缺数据和敏捷性-泛化权衡的先前浅层MLP跟踪器不同,Humanoid-GPT在一个包含所有主要动作捕捉数据集和大规模内部录制的20亿帧重定向语料上预训练。扩展数据和模型容量产生了一个单一的生成式Transformer,它能够跟踪高度动态的行为,同时实现对未见运动和控制任务的前所未有的零样本泛化。大量实验和扩展分析表明,我们的模型建立了新的性能前沿,展示了对未见任务的鲁棒零样本泛化,同时能够跟踪高度动态和复杂的运动。

英文摘要

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

2606.03976 2026-06-03 cs.CV cs.AI cs.LG q-bio.NC 版本更新

Formalizing the Binding Problem

形式化绑定问题

Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang, Ansh Soni, Konrad P. Kording

AI总结 本文用信息论方法形式化绑定问题,提出一种探测方法测量模型表示中的绑定信息,并在视觉Transformer上实验,证明绑定是强视觉识别和推理的关键要素。

Comments Accepted to ICML 2026

详情
AI中文摘要

世界表征,可以说,包含关于特征的信息(例如,某物是蓝色的,某物是圆形的),但也包含关于哪些特征属于同一对象的信息(例如,圆形是蓝色的),我们称之为绑定信息。任何具有理解包含多个对象场景能力的系统都必须解决绑定问题:它需要知道哪些特征属于一起。然而,尽管有研究表明视觉Transformer(ViT)知道哪些补丁属于一起,但目前尚不清楚当前的深度学习模型是否学会展示绑定信息,即针对特征的信息。我们可能认为绑定信息并不多,毕竟将特征错误归因于错误对象是基于ViT架构的常见失败,尤其是在对象共享特征的场景中。本文用信息论方法形式化绑定问题,并引入一种探测方法来测量模型表示中的绑定信息。我们在ViT上进行实验,测量来自架构不同组件(如图像摘要标记[CLS]或空间标记)的绑定信息。我们使用具有不同绑定挑战的数据集,例如特征共享、遮挡和自然特征,同时比较多个预训练ViT的性能。总体而言,我们的研究证明了绑定是强视觉识别和推理的关键要素。

英文摘要

Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

2606.03971 2026-06-03 cs.CV 版本更新

Video-Mirai: Autoregressive Video Diffusion Models Need Foresight

Video-Mirai: 自回归视频扩散模型需要远见

Yonghao Yu, Lang Huang, Runyi Li, Zerun Wang, Toshihiko Yamasaki

发表机构 * The University of Tokyo(东京大学) National Institute of Informatics(信息处理研究所) Peking University(北京大学)

AI总结 提出Video-Mirai训练方法,通过冻结的远见编码器从完整生成序列中提取未来信息并蒸馏到因果状态,在不改变推理过程的情况下弥合表示层面的规划差距,提升长视频生成的一致性。

详情
AI中文摘要

因果视频生成器必须从过去预测,但它们不必仅从过去学习。在流式自回归视频扩散中,每个发射的片段成为未来片段必须保留的承诺。然而,标准训练只要求每个因果状态解释当前。这造成了我们称之为表示层面的规划差距:适合当前片段的状态可能丢弃未来一致性所需的身份、布局和运动信息。我们引入Video-Mirai,一种仅训练的方法,在不改变因果推理的情况下弥合这一差距:生成器因果地展开,一个冻结的远见编码器非因果地读取完成的展开,一个轻量级预测器将得到的停止梯度目标蒸馏到因果状态。未来帧监督表示,从不监督生成器输入。在推理时,编码器和预测器被丢弃,原始架构、每步FLOPs和KV缓存行为保持不变。Video-Mirai在5秒VBench上将强因果强制基线从83.8提高到84.6(总分)。在超出训练范围的30秒展开中,主体一致性从84.9提高到88.5,背景一致性从90.2提高到91.9。消融实验确定未来条件目标是关键因素,探针实验显示未来帧从当前特征中更易解码。因果性应约束推理,而非表示监督。我们的研究强调视觉自回归模型需要远见。项目页面:此https URL。

英文摘要

Causal video generators must predict from the past, but they need not learn only from it. In streaming autoregressive video diffusion, each emitted segment becomes a commitment that future segments must preserve. Standard training, however, only asks each causal state to explain the present. This creates what we call a representation-level planning gap: states that fit the current segment may discard identity, layout, and motion information needed for a consistent future. We introduce Video-Mirai, a training-only method that closes this gap without changing causal inference: the generator rolls out causally, a frozen foresight encoder reads the completed rollout non-causally, and a lightweight predictor distills the resulting stopped-gradient targets into causal states. Future frames supervise representations, never generator inputs. At inference, the encoder and predictor are discarded, leaving the original architecture, per-step FLOPs, and KV-cache behavior unchanged. Video-Mirai improves a strong Causal-Forcing baseline on 5-second VBench from 83.8 to 84.6 in terms of Total Score. On 30-second rollouts beyond the training horizon, subject consistency improves from 84.9 to 88.5 and background consistency from 90.2 to 91.9. Ablations identify future-conditioned targets as the key ingredient, and probes show that future frames become more decodable from current features. Causality should constrain inference, not representation supervision. Our study highlights that visual autoregressive models need foresight. Project page: https://y0uroy.github.io/Video-Mirai.

2606.03954 2026-06-03 cs.CV cs.LG cs.RO 版本更新

VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

VLESA: 用于人类活动监测的视觉语言具身安全智能体

Hanjiang Hu, Yiyuan Pan, Jiaxing Li, Xusheng Luo, Alexander Robey, Na Li, Yebin Wang, Changliu Liu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Mitsubishi Electric Research Laboratories(三菱电机研究实验室) Harvard University(哈佛大学)

AI总结 提出VLESA框架,通过自我中心视频监测人类活动,利用GRPO训练的目标条件安全Q过滤器进行实时安全干预,在ASIMOV-2.0基准上实现更高干预精度。

Comments 18 pages, 5 tables, 5 figures

详情
AI中文摘要

随着AI系统越来越多地协助人类完成物理任务,确保安全变得至关重要——物理动作会带来即时且不可逆转的后果,而数字错误则不会。我们引入了视觉语言具身安全智能体(VLESA),这是一个从自我中心视频监测人类活动,并在预测到危险动作时触发实时安全干预的框架。VLESA处理意图依赖的安全问题,其中相同的动作可能根据上下文而安全或危险。我们引入了一个将自我中心帧与目标条件安全注释配对的数据集,使得能够通过GRPO训练一个目标条件安全Q过滤器,该过滤器在不重新训练的情况下根据推断的意图评估动作。在此基础上,提出了一个意图-动作预测智能体,用于从视频中联合推断目标并预测未来动作。在ASIMOV-2.0基准上,VLESA在精确的地面真值帧处实现了比基线更高的干预准确率,而通过目标条件约束解码,GRPO训练的Q过滤器将动作安全性提高了超过41个百分点。代码可在该网址获取。

英文摘要

As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at https://github.com/HanjiangHu/VLESA.

2606.03951 2026-06-03 cs.CV 版本更新

Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

Demo2Tutorial:从人类经验到多模态软件教程

Zechen Bai, Zhiheng Chen, Yiqi Lin, Kevin Qinghong Lin, Difei Gao, Xiangwu Guo, Xin Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室)

AI总结 提出Demo2Tutorial框架,通过屏幕录制和交互日志将人类经验解析为结构化多模态教程,用于人类学习和GUI智能体训练,实验证明其生成质量超越人工教程并提升任务效率。

Comments Accepted by CVPR 2026

详情
AI中文摘要

数字环境中的人类经验提供了大量未被充分探索的真实、未修剪的交互资源,其中包含丰富的程序性知识。我们提出了Demo2Tutorial,一个将屏幕录制和交互日志捕获的人类经验转化为结构化多模态软件教程的框架,用于同时教授人类和智能体。Demo2Tutorial首先通过专用记录器收集人类经验,然后使用多模态动作解析器解析原始经验,以重建感知、动作和意图。接着,步骤规划器将这些步骤抽象为表示目标和步骤的分层任务图。最后,教程合成器将解析后的经验转化为结构化的、可复用的图文指令。我们在一个基于官方软件文档的新基准上评估了教程生成质量。我们进一步证明,这种蒸馏表示有利于(i)人类学习,通过自动生成多模态教程,以及(ii)智能体学习,通过改进下游GUI智能体规划和泛化。实验表明,Demo2Tutorial生成的高质量教程超越了人工编写的教程,并显著优于基线方法,同时实现了更快的人类任务完成和更好的GUI智能体规划,证明从人类经验中蒸馏的结构化教程可以作为有效知识表示,促进人类学习和智能体能力。代码和数据将在https://this https URL提供。

英文摘要

Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable image-text instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and agent capabilities. Code and data will be available at https://github.com/showlab/Demo2Tutorial.

2606.03925 2026-06-03 cs.CV 版本更新

Adaptive Causal Alignment for High-Confidence Adversarial Training

自适应因果对齐用于高置信度对抗训练

Zhiming Luo, Kejia Zhang, Yingxin Lai, Junwei Wu, Juanjuan Weng, Shaozi Li

发表机构 * Department of Artificial Intelligence, Xiamen University(厦门大学人工智能学院) Department of Computer Science, Emory University(埃默里大学计算机科学系) College of Information Science and Technology, Jinan University(济南大学信息科学与技术学院)

AI总结 针对高置信度对抗训练中模型过度依赖非因果背景相关性的问题,提出HICAT框架,通过可学习背景偏差估计器与自适应去偏机制实现因果对齐,提升鲁棒泛化性能。

详情
AI中文摘要

逆对抗训练利用高置信度预测来稳定鲁棒学习,然而我们发现了一个关键悖论:高置信度往往源于对非因果背景相关性的过拟合,而非内在对象语义。我们的研究表明,视觉上下文作为双重信号,既可以是必要的支持先验,也可以是混杂的虚假相关。这一洞察使得现有的盲目抑制策略存在缺陷,因为它们不可避免地导致严重的特征损失。为解决此问题,我们提出高置信度因果对齐训练(HICAT),一个建立语义均衡的统一框架。HICAT遵循“测量-去偏-对齐”流程,集成了可学习背景偏差估计器(LBBE)以自适应诊断上下文效用。在该诊断指导下,自适应去偏机制执行精细的逻辑校正,并辅以几何基础的背景逻辑正交增强(FLOE)损失以强制执行特征解耦。在CIFAR-10、CIFAR-100和ImageNet-1K上的大量实验表明,HICAT在不同架构(CNN和ViT)上均持续优于匹配基线,同时显著缩小了鲁棒泛化差距。

英文摘要

Inverse adversarial training leverages high-confidence predictions to stabilize robust learning, yet we uncover a critical paradox: high confidence often stems from overfitting to non-causal background correlations rather than intrinsic object semantics. Our investigation reveals that visual context functions as a dual-natured signal, serving as either a necessary supportive prior or a spurious confounder. This insight renders existing blind suppression strategies flawed, as they inevitably lead to severe Feature Loss. To resolve this, we propose High-Confidence Causally Aligned Training (HICAT), a unified framework that establishes a Semantic Equilibrium. Operating on a ``Measure-Debias-Align'' pipeline, HICAT integrates a Learnable Background-Bias Estimator (LBBE) to adaptively diagnose context utility. Guided by this diagnosis, an Adaptive Debiasing mechanism performs surgical logit rectification, complemented by a geometrically grounded Foreground Logit Orthogonal Enhancement (FLOE) loss to enforce rigorous feature disentanglement. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that HICAT consistently improves over matched baselines across diverse architectures (CNNs and ViTs) while significantly reducing the robust generalization gap.

2606.03921 2026-06-03 cs.CV 版本更新

GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

GARDEN: 从RGB图像中重力对齐的解耦环境重建

Jiahao Sun, Dingkun Wei, Zehong Shen, Hongyu Zhou, Yujun Shen, Liang Li

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 提出GARDEN框架,利用重力先验将多视图RGB图像重建为具有显式刚体和解耦背景的结构化混合场景表示,支持直接物理模拟。

详情
AI中文摘要

将多视图RGB观测转换为可用于模拟的3D环境仍然具有挑战性,因为当前的重建流程会产生没有显式物理结构的整体场景表示。它们通常定义到任意全局旋转,并将刚性前景物体与背景几何纠缠在一起,这阻碍了稳定的物理交互。现有的解决方案通常通过用检索到的CAD资产替换重建的物体来恢复交互性,但这引入了缓慢的检索和替换阶段,并削弱了场景特定的几何保真度。我们提出GARDEN,一个仅使用RGB的框架,将重建重新表述为基于物理的场景分解,并输出结构化的混合场景表示。关键思想是使用重力作为通用物理先验:我们首先将重建对齐到统一的重力视角坐标系以解决规范模糊性,然后恢复具有准确6自由度放置的物体中心刚性网格,最后通过条件3D点分类从背景中移除重复的物体几何。得到的表示结合了显式刚体和解耦背景,能够在保持视觉真实感的同时实现直接物理模拟。在模拟和真实多视图场景上的实验表明,与基于检索的基线相比,GARDEN提高了物体放置可靠性、解耦质量和渲染模拟效率。

英文摘要

Converting multi-view RGB observations into simulation-ready 3D environments remains challenging because current reconstruction pipelines produce monolithic scene representations without explicit physical structure. They are typically defined up to an arbitrary global rotation and entangle rigid foreground objects with background geometry, which hinders stable physical interaction. Existing solutions often recover interactivity by replacing reconstructed objects with retrieved CAD assets, but this introduces a slow retrieval-and-replacement stage and weakens scene-specific geometric fidelity. We propose GARDEN, an RGB-only framework that reformulates reconstruction as physically-grounded scene factorization and outputs a structured hybrid scene representation. The key idea is to use gravity as a universal physical prior: we first align the reconstruction to a unified Gravity-View frame to resolve gauge ambiguity, then recover object-centric rigid meshes with accurate 6-DoF placement, and finally remove duplicate object geometry from the background through conditional 3D point classification. The resulting representation combines explicit rigid bodies with a decoupled background, enabling direct physics simulation while preserving visual realism. Experiments on both simulated and real multi-view scenes show that GARDEN improves object placement reliability, disentanglement quality, and rendering-simulation efficiency compared with retrieval-based baselines.

2606.03920 2026-06-03 cs.CV 版本更新

Benchmarking Visual State Tracking in Multimodal Video Understanding

多模态视频理解中的视觉状态追踪基准测试

Sihyun Yu, Nanye Ma, Pinzhi Huang, Hyunseok Lee, Shusheng Yang, June Suk Choi, Ellis Brown, Oscar Michel, Boyang Zheng, Jinwoo Shin, Saining Xie

发表机构 * New York University(纽约大学) KAIST(韩国科学技术院)

AI总结 提出VSTAT基准,通过需要连续感知和整合整个视频流的问题评估多模态大语言模型的视觉状态追踪能力,发现当前模型远低于人类表现,失败主要源于视觉感知而非文本推理。

Comments Website: https://vision-x-nyu.github.io/vstat-site/

详情
AI中文摘要

理解视频需要超越识别孤立时刻,因为人类会持续追踪实体、状态和事件。这种视觉状态追踪能力是视频理解的基础,但在当前多模态大语言模型(MLLMs)的评估中仍未得到充分探索。我们引入了视觉状态追踪基准(VSTAT),这是一个基于视频的基准,旨在诊断MLLMs的视觉状态追踪能力。VSTAT包含来自合成和真实世界视频的834个片段,配以1500个问题,这些问题无法从任何单帧或短片段中回答,需要持续感知和整合整个视频流中的事件。尽管在现有视频基准上表现强劲,我们发现最先进的MLLMs远低于人类水平,仅略高于基于答案先验的基线。为了分析这一差距,我们将MLLMs的思维轨迹与底层视频流进行比较,以理解MLLMs在VSTAT上失败的原因和时机。我们发现MLLMs在文本中正确推理和追踪,但在视觉上感知它们需要追踪的事件时失败。最后,我们的初步评估表明,最近的基于代理的方法,包括基于MLLM的视频代理和编码代理,并不能轻易解决这些失败,在VSTAT上仍然表现不佳。

英文摘要

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

2606.03915 2026-06-03 cs.CV 版本更新

PatchScene: Patch-based Voxel Diffusion for Large-Scale Scene Completion

PatchScene:基于体素块扩散的大规模场景补全

Qingdong Xu, Jiajun Zhu, Shilin Zhu, Xinjing He, Chao Lu, Huanran Wang, Jiyao Zhang

发表机构 * MEGVII Technology(MEGVII技术有限公司) Qianli Technology(千利技术) Peking University(北京大学) Northeastern University, China(中国东北大学) Northwest Polytechnical University, Xi’an(西北工业大学西安校区)

AI总结 提出PatchScene,一种基于体素块扩散的框架,通过局部3D区域细粒度生成、置信度引导的时空融合和环形流扩散策略,实现大规模LiDAR场景补全,在SemanticKITTI上达到最优性能并展现强泛化能力。

Comments 10 pages, 5 figures, 5 tables

详情
AI中文摘要

我们提出了PatchScene,一种新颖的基于扩散的大规模LiDAR场景补全框架。与依赖全局潜在表示或密集体素网格的现有方法不同,PatchScene采用基于体素块的扩散范式,在局部3D区域内显式生成细粒度几何结构。为了确保在空间和时间尺度上的连贯重建,我们引入了一种置信度引导的时空融合机制,在统一的生成过程中整合重叠块和相邻帧。此外,我们设计了一种环形流扩散策略,利用LiDAR扫描的径向密度模式,将近距离区域的高保真信息逐步传播到远距离区域,从而实现空间无界的场景补全。在SemanticKITTI基准上的大量实验表明,PatchScene在所有标准指标上均达到了最先进的性能,在几何精度和时间一致性上超越了先前的方法。值得注意的是,在20米LiDAR范围上训练的模型无需重新训练即可有效推广到50米场景,突显了其在真实世界自动驾驶应用中的强大可扩展性和泛化能力。

英文摘要

We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-the-art performance across all standard metrics, surpassing previous approaches in both geometric accuracy and temporal consistency. Remarkably, the model trained on 20 m LiDAR ranges generalizes effectively to 50 m scenes without retraining, highlighting its strong scalability and generalization capability for real-world autonomous driving applications.

2606.03911 2026-06-03 cs.CV 版本更新

Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

Bootstrap Your Generator: 基于流匹配的无配对视觉编辑

Yoad Tewel, Yuval Atzmon, Gal Chechik, Lior Wolf

发表机构 * Weizmann Institute of Science(魏茨曼科学研究院)

AI总结 提出Bootstrap Your Generator (ByG)框架,利用基础模型知识通过流匹配实现无配对训练的图像视频编辑,无需外部信号,在数据稀缺场景下达到最优性能。

Comments Accepted at ICML 2026. Project page is at https://research.nvidia.com/labs/par/byg/

详情
AI中文摘要

现代生成模型对视觉内容有深刻理解,但训练它们进行图像编辑通常需要大量配对示例数据集。这限制了可扩展性,尤其是对于视频编辑,收集配对数据成本过高。我们提出了Bootstrap Your Generator (ByG),一个用于流匹配编辑模型无配对训练的通用框架。它利用基础模型的知识,无需任何外部信号。我们的方法将从冻结模型中提取的指令遵循线索与用于结构保持的循环一致性相结合。为了使这可行,我们提出将来自干净预测的下游损失的梯度路由到噪声训练状态。我们在具有挑战性的数据稀缺图像和视频编辑场景中展示了最先进的结果。大量评估和用户研究表明,我们的方法有效泛化到未见过的领域,并优于在数百万样本上训练的监督基线。分析表明,我们的梯度路由弥合了训练-推理差距,从基础模型中提取语义线索提供了强大的训练信号,消除了对外部奖励模型的需求。

英文摘要

Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.

2606.03909 2026-06-03 cs.CV 版本更新

SparseStreet: Sparse Gaussian Splatting for Real-Time Street Scene Simulation

SparseStreet: 用于实时街景模拟的稀疏高斯泼溅

Qingpo Wuwu, Xiaobao Wei, Peng Chen, Nan Huang, Zhongyu Zhao, Hao Wang, Ming Lu, Ningning Ma, Shanghang Zhang

发表机构 * Peking University(北京大学) Chinese Academy of Sciences(中国科学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Autonomous Driving Development, NIO(蔚来自动驾驶开发)

AI总结 针对街景重建中高斯原语冗余问题,提出节点可学习剪枝与背景压缩框架,实现高达80%压缩比且质量损失极小。

详情
AI中文摘要

尽管3D高斯泼溅在街景重建中显示出有希望的结果,现有方法需要大量高斯原语来捕捉细节,导致存储成本过高和渲染速度缓慢。我们观察到动态对象(如车辆和行人)需要高保真表示以保持时间一致性,而静态背景区域通常包含大量冗余。受此启发,我们提出SparseStreet,一种专为街景设计的通用压缩框架。首先,我们引入基于节点的可学习剪枝策略,系统性地移除低贡献高斯原语,同时保留视觉关键区域。其次,在场景表示稳定后,我们应用背景压缩,进一步减少静态区域中的冗余。我们的方法有效保留了动态对象的几何和外观,同时显著减少了高斯原语的总数。在Waymo和nuScenes上的大量实验表明,SparseStreet实现了高达80%的压缩比,且质量退化极小,实现了资源高效、高保真的动态场景重建。项目网站:此 https URL。

英文摘要

While 3D Gaussian Splatting has shown promising results in street scene reconstruction, existing methods require massive numbers of Gaussian primitives to capture fine details, leading to prohibitive storage costs and slow rendering speeds. We observe that dynamic objects (e.g., vehicles and pedestrians) demand high-fidelity representations to maintain temporal consistency, while static background regions often contain substantial redundancy. Motivated by this, we propose SparseStreet, a general compression framework specifically designed for street scenes. First, we introduce a node-based learnable pruning strategy that systematically removes low-contributing Gaussian primitives while preserving visually critical regions. Second, after the scene representation stabilizes, we apply background compression, further reducing redundancy in static regions. Our method effectively preserves the geometry and appearance of dynamic objects while significantly reducing the total number of Gaussian primitives. Extensive experiments on the Waymo and nuScenes demonstrate that SparseStreet achieves up to 80% compression ratio with minimal quality degradation, enabling resource-efficient, high-fidelity dynamic scene reconstruction. Project website: https://sparsestreet.github.io/.

2606.03904 2026-06-03 cs.LG cs.CV 版本更新

MAdam: Metric-Aware Multi-Objective Adam

MAdam: 度量感知的多目标Adam

Fengbei Liu, Rachit Saluja, Sunwoo Kwak, Ruibo Wang, Ruining Deng, Heejong Kim, Johannes C. Paetzold, Mert R. Sabuncu

发表机构 * Cornell Tech(康奈尔科技) Weill Cornell Medicine(韦尔医学院) Delft University of Technology(代尔夫特理工大学)

AI总结 提出MAdam,通过偏好条件曲率预处理多目标优化中的协调方向,解决Adam与求解器之间的权重失配和几何失配问题,在多任务学习、帕累托前沿恢复等任务中一致提升性能。

详情
AI中文摘要

多目标优化是许多机器学习问题的基础,然而跨损失平衡、梯度平衡和基于帕累托的求解器家族几乎都将它们协调后的方向交给Adam处理。我们表明这种耦合在求解器的意图和优化器的执行之间引入了两个系统性差距。第一个是权重失配:Adam的二阶矩分母将时变偏好向量与梯度统计量纠缠在一起,将偏好边缘化为历史平均值,并将不同的帕累托权衡压缩为近乎均匀的混合。第二个是几何失配:Adam的自适应度量扭曲了多目标优化求解器假设的欧几里得几何,将对齐的目标转化为明显的冲突。为了共同解决这两个问题,我们引入了MAdam(度量感知的多目标Adam),这是一个即插即用的包装器,不改变求解器和优化器。MAdam通过标量化目标的偏好条件曲率对协调方向进行预处理;在此白化输入上,Adam的二阶矩退化为单位矩阵,因此实际更新由偏好条件度量主导。在多任务学习、帕累托前沿恢复、物理信息神经网络和医学成像中,MAdam在每个求解器家族上都一致优于Adam。

英文摘要

Multi-objective optimization (MOO) underlies many machine learning problems, yet MOO solvers across the loss-balancing, gradient-balancing, and Pareto-based families almost universally hand their reconciled directions to Adam~\cite{kingma2015adam}. We show this coupling introduces two systematic gaps between the solver's intent and the optimizer's execution. The first is a \emph{weighting mismatch}: Adam's second-moment denominator entangles the time-varying preference vector with gradient statistics, marginalizing the preference into a history average and collapsing distinct Pareto trade-offs toward a near-uniform mixture. The second is a \emph{geometric mismatch}: Adam's adaptive metric distorts the Euclidean geometry MOO solvers assume, turning aligned objectives into apparent conflicts. To resolve both jointly, we introduce \textbf{MAdam} (Metric-Aware Multi-Objective Adam), a drop-in wrapper that leaves both solver and optimizer unchanged. MAdam preconditions the reconciled direction by the preference-conditioned curvature of the scalarized objective; on this whitened input, Adam's second moment collapses to identity, so the realized update is governed by the preference-conditioned metric. Across multi-task learning, Pareto-front recovery, physics-informed neural networks, and medical imaging, MAdam consistently improves over Adam for every solver family.

2606.03903 2026-06-03 cs.CV 版本更新

An Attention-Based Denoising Model for Diffusion Weighted Imaging

一种基于注意力的扩散加权成像去噪模型

Prithviraj Verma, Pawan Kumar, Chandan Deshani, Prasun Chandra Tripathi

发表机构 * Institute of Infrastructure Technology Research and Management (IITRAM)(基础设施技术研究与管理研究所) University of Sheffield(谢菲尔德大学)

AI总结 提出一种结合Swin Transformer窗口注意力和多维门控精化的噪声感知注意力驱动去噪框架,用于解决DWI中信号依赖的Rician噪声问题,在1%至15%噪声水平下实现平均PSNR 33.69 dB和SSIM 0.8539。

详情
AI中文摘要

扩散加权成像(DWI)用于全身癌症筛查,但通常需要较长的采集时间。当扫描时间减少时,图像质量往往会下降,导致扫描中的噪声增加。DWI中的幅度重建引入了信号依赖的Rician噪声,这使得传统的基于卷积的方法去噪更具挑战性。为了解决这一限制,我们提出了一种噪声感知的注意力驱动去噪框架,该框架将分层Swin Transformer窗口注意力与基于transformer的多维门控精化相结合,用于DWI恢复。该模型结合了显式噪声水平调节和残差重建,以实现对广泛损坏水平下异方差噪声的自适应抑制。在损坏的DWI扫描上的实验评估显示了强大的恢复性能。我们的模型在1%至15%的噪声水平下实现了平均PSNR 33.69 dB和SSIM 0.8539,同时在严重噪声条件下保持稳定行为。这些结果表明,注意力引导的上下文建模与通道自适应精化相结合,为DWI去噪提供了稳健且可推广的解决方案。

英文摘要

Diffusion-weighted imaging (DWI) is used for whole-body cancer screening, but it typically requires a long acquisition time. When the scan time is reduced, the image quality often suffers, leading to increased noise in the scans. Magnitude reconstruction in DWI introduces signal-dependent Rician noise, which makes denoising more challenging for conventional convolution-based methods. To address this limitation, we propose a noise-aware attention-driven denoising framework that integrates hierarchical Swin Transformer window attention with transformer-based multi-dimensional gated refinement for DWI restoration. The model incorporates explicit noise-level conditioning and residual reconstruction to enable adaptive suppression of heteroscedastic noise across a wide range of corruption levels. Experimental evaluation on corrupted DWI scans demonstrates strong restoration performance. Our model achieves a mean PSNR of 33.69~dB and SSIM of 0.8539 across noise levels from 1\% to 15\%, while maintaining stable behavior under severe noise conditions. These results indicate that attention-guided contextual modeling combined with channel-adaptive refinement provides a robust and generalizable solution for DWI denoising.

2606.03893 2026-06-03 cs.CV 版本更新

Electromagnetic Navigation for Femoral Osteotomy Using High-Accuracy X-ray-to-CT Registration

基于高精度X光到CT配准的股骨截骨电磁导航

Roman Flepp, Arend Nieuwland, Bastian Sigrist, Philipp Fürnstahl, Lilian Calvet, Thomas Dreher

发表机构 * Department of Pediatric Orthopedics and Traumatology, University Children’s Hospital Zürich(苏黎世大学儿童医院小儿骨科与创伤外科部门) Research in Orthopedic Computer Science, University Hospital Balgrist, University of Zurich(骨科计算机科学研究所,巴尔格里斯大学医院,苏黎世大学) Department of Orthopedic Surgery, University Hospital Balgrist, University of Zurich(骨科外科部门,巴尔格里斯大学医院,苏黎世大学)

AI总结 提出一种基于电磁跟踪的股骨截骨导航系统,通过一次术中C臂标定和两幅X光图像配准实现实时无荧光导航,在合成股骨实验中总角度误差显著优于徒手操作,并与患者特异性器械精度等效。

Comments Will be published in the International Journal of Computer Assisted Radiology and Surgery

详情
AI中文摘要

矫正性股骨截骨术中准确执行术前计划仍具挑战。当前技术受限于精度不一、侵入性和辐射暴露,徒手方法和患者特异性器械(PSI)分别通常需要>30和>6次荧光透视图像。我们提出一种集成的、基于电磁跟踪(EMT)的股骨截骨导航系统,可最小化剥离和术中荧光透视。该系统将基于CT的术前计划与一次性术中C臂标定以及从初始化时获取的两幅X光图像进行精确的X光到CT配准相结合。这使得锯片和骨碎片相对于术前计划的实时、无荧光EMT导航成为可能,并兼容单平面和双平面截骨。在使用18个合成股骨的可行性研究中,EMT引导在总角度误差上显著优于徒手执行($(3.05 \pm 0.75)^\circ$ vs. $(6.32 \pm 2.36)^\circ$,$p=0.031$),假设两者具有相同的最小手术暴露。EMT引导试验均未超过>5°的临床阈值,而徒手6次试验中有4次异常值。该系统在总角度误差($p \le 0.02$)和总平移误差($p=0.048$)上与PSI达到统计等效($\pm 2^\circ$,$\pm 2, ext{mm}$),用户问卷评分无显著差异。通过仅使用两幅X光图像转移术前计划,同时匹配PSI精度且无需额外手术暴露,所提出的系统为后续尸体和临床验证提供了动力。

英文摘要

Accurate execution of preoperative plans in corrective femoral osteotomies remains challenging. Current techniques are limited by variable accuracy, invasiveness, and radiation exposure, with free-hand methods and patient-specific instrumentation (PSI) often requiring >30 and >6 fluoroscopic images, respectively. We present an integrated, electromagnetic tracking (EMT)-based navigation system for femoral osteotomies that minimizes dissection and intraoperative fluoroscopy. The system couples CT-based preoperative planning with one-time intraoperative C-arm calibration and accurate X-ray-to-CT registration from two fluoroscopic images acquired at initialization. This enables real-time, fluoroscopy-free EMT navigation of the saw blade and bone fragments relative to the preoperative plan, and is compatible with uniplanar and biplanar osteotomies. In a feasibility study using 18 synthetic femora, EMT guidance significantly outperformed free-hand execution in total angular error ($(3.05 \pm 0.75)^\circ$ vs.\ $(6.32 \pm 2.36)^\circ$, $p=0.031$), assuming the same minimal surgical exposure for both. No EMT-guided trials exceeded the >5° clinical threshold, whereas free-hand produced 4 outliers of 6 trials. The system achieved statistical equivalence ($\pm 2^\circ$, $\pm 2,\text{mm}$) to PSI for total angular ($p \le 0.02$) and total translational ($p=0.048$) errors, with no significant differences in user questionnaire scores. By transferring preoperative plans using only two fluoroscopic images while matching PSI accuracy without additional surgical exposure, the proposed system motivates subsequent cadaveric and clinical validation.

2606.03890 2026-06-03 cs.CV 版本更新

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

OVO-S-Bench:多模态大语言模型中流式空间智能的分层基准

Yifei Li, Pengyiang Liu, Yuhang Zang, Zhongyue Shi, Qi Fu, Hongye Hao, Jiwen Lu

发表机构 * Tsinghua University(清华大学) Shanghai AI Laboratory(上海人工智能实验室) Beihang University(北京航空航天大学)

AI总结 提出OVO-S-Bench,一个完全人工标注的流式空间智能基准,包含1680个问题,涵盖四个抽象层次,评估38个MLLM,发现Gemini-3.1-Pro落后人类专家27分,流式空间微调MLLM表现不如其骨干模型。

Comments 48 pages, 12 figures, 15 tables. Project page: https://internlm.github.io/OVO-S-Bench/

详情
AI中文摘要

机器人、增强现实和自动驾驶中的多模态智能体必须从连续的自我中心流中推理地点和布局,通常使用当前视野之外的证据。现有基准要么在完整视频上进行离线评估,要么针对事件而非空间结构。我们引入了OVO-S-Bench,一个完全人工标注的流式空间智能基准,包含来自348个源视频的1680个问题。标注涉及12名训练有素的标注员,每人还担任盲审交叉评审,耗时约804人小时的多轮质量保证。每个问题带有一个查询时间戳和一个证据区间,评估时模型仅看到查询之前的前缀。问题涵盖四个抽象层次:瞬时自我中心感知、时空上下文跟踪、空间模拟与推理、以及异中心映射。在38个专有和开源MLLM中,Gemini-3.1-Pro落后人类专家27分(59.2 vs. 86.6),异中心映射是主要瓶颈。值得注意的是,流式和空间微调的MLLM表现不如其骨干模型。我们进一步发现,当链式思维推理未基于流时,会放大空间错误。通过暴露这些局限性,OVO-S-Bench为下一代流式空间MLLM建立了一个高要求的测试平台。

英文摘要

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

2606.03888 2026-06-03 cs.CV cs.LG 版本更新

CoralBay: A Self-Supervised CT Foundation Model

CoralBay: 一种自监督CT基础模型

Ioannis Gatopoulos, Nicolas Känzig, Sebastian Otálora, Fei Tang

发表机构 * kaiko.ai(Kaiko AI)

AI总结 提出CoralBay框架,通过分层3D Swin骨干网络和自蒸馏学习多尺度特征,实现CT体积数据的自监督预训练,有效提升下游放射学任务性能。

详情
AI中文摘要

自监督学习已在2D自然图像上实现了大规模预训练,产生了跨任务有效迁移的通用视觉表示。然而,许多医学成像模态(如CT扫描)本质上是三维的,在结构和语义上与自然图像根本不同。体积模态捕捉空间连续性、器官解剖和基于强度的组织特性(如亨氏单位),这些无法通过2D预训练充分建模。为弥补这一差距,我们引入了CoralBay,一种自蒸馏框架,通过使用分层3D Swin骨干网络并将自蒸馏应用于拼接的多尺度特征,扩展了DINO,实现了数据高效的自监督学习,编码了全局语义和细粒度局部结构的丰富空间表示。因此,CoralBay有效迁移到广泛的下游放射学任务,在多样化的解剖目标上展现出强大且一致的性能。此外,我们通过引入一个公开、可复现的3D放射学排行榜,为开源\eva框架做出贡献,该排行榜统一了多个数据集,并建立了评估体积表示学习方法的标准化基准。

英文摘要

Self-supervised learning has enabled large-scale pre-training on 2D natural images, producing general-purpose visual representations that transfer effectively across tasks. However, many medical imaging modalities, such as CT scans, are inherently three-dimensional and differ fundamentally from natural images in both structure and semantics. Volumetric modalities capture spatial continuity, organ anatomy, and intensity-based tissue properties (e.g., Hounsfield Units), which are not adequately modeled by 2D pre-training. To bridge this gap, we introduce CoralBay, a self-distillation framework that extends DINO by using a hierarchical 3D Swin backbone and applying self-distillation to concatenated multi-scale features, enabling data-efficient self-supervised learning of rich spatial representations that encode both global semantics and fine-grained local structure. As a result, CoralBay transfers effectively to a wide range of downstream radiological tasks, demonstrating strong and consistent performance across diverse anatomical targets. In addition, we contribute to the open-source \eva framework by introducing a public, reproducible 3D radiology leaderboard that unifies multiple datasets and establishes a standardized benchmark for evaluating volumetric representation learning methods.

2606.03879 2026-06-03 cs.CV cs.AI 版本更新

Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

超越编码器累加:衡量多编码器视觉语言模型中编码器的作用

Wei Ding, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang

发表机构 * Tsinghua University(清华大学) Tencent(腾讯) University of Macau(澳门大学) University of Science and Technology Beijing(北京科技大学)

AI总结 通过重新训练所有31个非空子集,提出容量-必要性分解和预投影器秩分析,揭示多编码器视觉语言模型中编码器角色并非简单累加,并给出最优配对原则。

详情
AI中文摘要

随着基础模型向融合更多异构视觉流扩展,理解不同编码器在联合训练下的交互成为原则性设计的前提。然而,大型视觉语言模型目前缺乏相应的工具,且参数高效的编码器配置在训练前难以识别。为了重新审视联合训练下的编码器角色,我们在16基准的Cambrian-1套件上,在统一流程下重新训练并评估了五个常见视觉编码器的所有31个非空子集(总计约2万GPU小时),并报告了三个发现。首先,从头重新训练每个子集揭示了与在固定检查点上掩码编码器所得不同的编码器排名,包括哪个编码器整体排名第一。其次,我们将每个编码器的贡献分解为两个维度:容量(编码器自身达到的分数)和必要性(从完整池中移除时的下降)。这两个维度不可互换。配对两个最高容量的编码器是次优的,而将一个高容量锚点与一个自适应补充配对则匹配完整的五编码器模型。在此配对之外添加更多编码器仅带来边际收益。第三,在固定参数数量下,每个编码器的预投影器有效秩解释了残差分数变化。最强的配对结合了一个秩在联合训练中存活的锚点和一个秩在联合训练下扩展的补充,这表明更高秩、更少坍缩的投影器输入对应着编码器-投影器接口处更有利的优化机制。总之,容量-必要性分解和预投影器秩分析,连同通过重新训练进行的全面评估,揭示了多编码器视觉语言模型设计中的方法论差距,并提供了弥补这一差距的具体原语。

英文摘要

As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall. Second, we decompose each encoder's contribution into two axes, Capacity, the score an encoder reaches on its own, and Necessity, the drop when it is removed from the full pool. The two axes are not interchangeable. Pairing the two highest-Capacity encoders is suboptimal, while pairing a high-Capacity anchor with an adaptive complement matches the full five-encoder model. Adding further encoders beyond this pair yields only marginal gains. Third, at fixed parameter count, per-encoder pre-projector effective rank explains the residual score variation. The strongest pairs combine an anchor whose rank survives joint training with a complement whose rank expands under it, suggesting that higher-rank, less-collapsed projector inputs correspond to a more favorable optimization regime at the encoder-projector interface. Together, the Capacity-Necessity decomposition and the pre-projector rank analysis, along with comprehensive evaluation through retraining, expose a methodological gap in multi-encoder LVLM design, and offer concrete primitives for closing it.

2606.03877 2026-06-03 cs.CV 版本更新

MLP Splatting: Object-Centric Neural Fields

MLP Splatting: 以对象为中心的神经场

Shinjeong Kim, Yuzhou Cheng, Xin Kong, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Department of Computing, Imperial College London(帝国理工学院伦敦分校计算机系)

AI总结 提出MLP-Splatting方法,通过少量紧凑MLP原语实现场景分解和新视角合成,支持对象级编辑且内存和渲染效率优于现有方法。

详情
AI中文摘要

3D表示对于场景渲染、理解和交互至关重要。最近的方法,如3D高斯泼溅和神经辐射场,实现了令人印象深刻的光照真实感新视角合成,但缺乏将场景元素轻松分解为少数原语的能力,需要额外的分割或分组才能进行对象级操作。我们提出了MLP-Splatting,一种通过少量富有表现力的光场原语实现场景分解,同时提供光照真实感新视角合成的方法。MLP-Splatting将每个原语建模为一个独立的紧凑MLP,具有局部空间支持,预测辐射度和不透明度。与低级高斯原语或单个全局辐射场相比,我们的神经原语提供了更大的表达能力,同时保持空间局部性。通过高效的光线-原语交互稀疏体积合成进行渲染。我们的原语仅使用RGB监督进行训练,这产生了代表局部场景区域(通常对应于对象或对象部分)的原语,通过选择少量原语即可实现无需分割掩码的交互式对象级编辑。我们的方法辅以可选的语义特征蒸馏,支持开放词汇场景交互和开放集实例分割。与最先进的方法相比,我们在实验中表明,与语义3DGS方法相比,我们实现了显著更低的内存使用(1/15倍)和更快的渲染(3倍)。项目页面:此https URL

英文摘要

3D representations are fundamental to scene rendering, understanding, and interaction. Recent approaches, such as 3D Gaussian Splatting and Neural Radiance Fields, achieve impressive photorealistic novel-view synthesis, but lack the ability to easily decompose scene elements into a few primitives, requiring additional segmentation or grouping for object-level manipulation. We present MLP-Splatting, a method that enables scene decomposition via a few expressive light-field primitives while providing photorealistic novel-view synthesis. MLP-Splatting models each primitive as an independent compact MLP with localized spatial support that predicts radiance and opacity. In contrast to low-level Gaussian primitives or a single global radiance field, our neural primitives provide greater expressive capacity while remaining spatially localized. Rendering is performed through efficient sparse volumetric compositing over ray-primitive interactions. Our primitives are supervised using RGB supervision alone, which yields primitives that represent local scene regions often corresponding to objects or object parts, enabling interactive object-level editing without segmentation masks by selecting a handful of primitives. Our method, augmented with optional semantic feature distillation, enables open-vocabulary scene interaction and open-set instant segmentation. Compared to state-of-the-art methods, we achieve substantially lower memory usage (1/15$\times$) and faster rendering (3$\times$), as we show in our experiments compared to semantic 3DGS methods. Project Page: https://shinjeongkim.com/mlp-splatting

2606.03875 2026-06-03 cs.CV 版本更新

Seg2Track++: Probabilistic Track Validation and Data Association for Multi-Object Tracking and Segmentation

Seg2Track++: 用于多目标跟踪与分割的概率轨迹验证与数据关联

Diogo Mendonça, Tiago Barros, Cristiano Premebida, Urbano J. Nunes

发表机构 * University of Coimbra, Institute of Systems and Robotics, Department of Electrical and Computer Engineering(科英布拉大学,系统与机器人研究所,电气与计算机工程系)

AI总结 提出Seg2Track++框架,结合SAM2实例分割与概率轨迹验证,实现零样本多目标跟踪与分割,提升身份保持并抑制假阳性传播。

详情
AI中文摘要

自主系统需要鲁棒的多目标跟踪与分割(MOTS)以在动态环境中可靠运行,确保一致的目标身份和精确的掩码级描绘。SAM2等基础模型在分割方面表现出强大的零样本泛化能力,但其直接应用于MOTS受到不可靠的轨迹关联和假阳性传播的限制。本文介绍Seg2Track++,一个将实例分割与SAM2及新颖的轨迹管理模块相结合的框架,以执行具有增强时间一致性的零样本MOTS。轨迹通过掩码质心距离(MCD)和置信度感知成本调制(CCM)进行关联,而概率轨迹验证(PTV)采用伯努利滤波器验证轨迹存在并抑制鬼影轨迹。在KITTI MOTS上的实验结果表明,无需微调即可改善身份保持、减少假阳性传播并实现鲁棒的轨迹管理。

英文摘要

Autonomous systems require robust Multi-Object Tracking and Segmentation (MOTS) to operate reliably in dynamic environments, ensuring consistent object identities and precise mask-level delineation. Foundation models such as SAM2 have shown strong zero-shot generalization for segmentation, but their direct application to MOTS is limited by unreliable track association and false-positive propagation. This work introduces Seg2Track++, a framework that integrates instance segmentation with SAM2 and a novel track management module to perform zero-shot MOTS with enhanced temporal consistency. Tracks are associated using Mask Centroid Distance (MCD) and Confidence-Aware Cost Modulation (CCM), while Probabilistic Track Validation (PTV) employs a Bernoulli filter to validate track existence and suppress ghost tracks. Experimental results on KITTI MOTS demonstrate improved identity preservation, reduced false-positive propagation, and robust track management without fine-tuning.

2606.03874 2026-06-03 cs.CV cs.RO 版本更新

DyaPlex: Full-Duplex Speech-Motion Model for Dyadic Interaction

DyaPlex: 用于二元交互的全双工语音-运动模型

Koki Nagano, Hongyu Liu, Seonwook Park, Tianye Li, Amrita Mazumdar, Christian Jacobsen, Shengze Wang, Michael Stengel, Rajarshi Roy, Ka Chun Cheung, Simon See, Shalini De Mello

发表机构 * NVIDIA HKUST(香港科技大学)

AI总结 提出DyaPlex,一种流式全双工语音-运动模型,通过双塔Transformer架构和统一二元令牌交织机制,实现同步多模态交互,在单体和二元交互基准上达到最优性能。

Comments Project page: https://research.nvidia.com/labs/amri/projects/DyaPlex

详情
AI中文摘要

我们提出了DyaPlex,一种用于二元交互的流式全双工语音-运动模型。为了捕捉人类交流的连续性和互惠性,这种全双工能力使智能体能够以流式方式同时感知和生成语音及物理运动。其核心在于,我们的方法利用了基础全双工语音模型的强先验,并集成了新颖的运动通路,从而实现完全同步的多模态交互。具体来说,我们设计了一种双塔Transformer架构,在保持冻结基础语音模型的零样本对话推理能力的同时,构建了深度耦合的流式运动通路。通过引入统一的二元令牌交织机制,并借助时间对齐的语音-运动RoPE引导交叉注意力,我们的模型有效地将自回归运动与丰富的潜在语音特征对齐。在4000小时的Seamless Interaction数据集上训练,我们的模型有效捕捉了跨说话者依赖关系,并在单体和二元人类交互基准上建立了新的最优性能。

英文摘要

We present DyaPlex, a streaming, full-duplex speech-and-motion model designed for dyadic interaction. To capture the continuous and reciprocal nature of human communication, this full-duplex capability empowers the agent to simultaneously perceive and generate both speech and physical motion in a streaming fashion. At its core, our method leverages the strong priors of a foundational full-duplex speech model and integrates a novel motion pathway, thereby achieving fully synchronized multi-modal interaction. Specifically, we design a dual-tower Transformer architecture that preserves the zero-shot conversational reasoning of a frozen base speech model while constructing a deeply coupled, streaming motion pathway. By introducing a unified dyadic token interleaving mechanism and guiding cross-attention via a time-aligned speech-motion RoPE, our model effectively aligns autoregressive motions with rich latent speech features. Trained on the 4,000-hour Seamless Interaction dataset, our model effectively captures cross-speaker dependencies and establishes new state-of-the-art performance across both monadic and dyadic human interaction benchmarks.

2606.03871 2026-06-03 cs.CV cs.CL cs.LG 版本更新

Visual Instruction Tuning Aligns Modalities through Abstraction

视觉指令调优通过抽象对齐模态

Luis Palacios, Lorenzo Basile, Diego Doimo, Alberto Cazzaniga

发表机构 * Area Science Park, Trieste, Italy(特里埃斯特Area Science Park)

AI总结 通过探针分析和因果干预,发现视觉指令调优将视觉特征直接嵌入LLM的中间语义层,绕过早期单模态处理层,并通过扩展和强化现有抽象阶段对齐视觉与文本表示。

详情
AI中文摘要

视觉指令调优有效地使预训练的大语言模型(LLM)能够同时处理图像信息和文本。然而,视觉特征如何嵌入到LLM骨干网络的逐层抽象层次中仍不清楚。通过一系列不同的视觉-语言架构,我们表明指令调优主要充当桥梁,将视觉特征直接嵌入到LLM的中间语义层,绕过了用于单模态处理的早期层。通过探针分析和因果干预,我们表明这些中间层是视觉-语言处理的语义核心,并在广泛的 multimodal 基准测试中发挥关键作用。此外,通过比较语义等价的视觉和文本表示的几何结构,我们发现微调扩展并强化了现有的抽象阶段,使视觉特征与已有的文本特征对齐。最后,我们通过将微调限制在中间层来确认这种局部对齐的功能作用:该策略在视觉中心基准测试中保持了全微调的性能,同时减少了训练时间。我们的结果表明,多模态集成是一种局部现象,由LLM内部抽象引擎的重新利用驱动。

英文摘要

Visual instruction tuning effectively adapts a pre-trained Large Language Model (LLM) to process image information alongside text. Yet, it remains unclear how visual features are embedded into the layer-wise hierarchy of abstractions of the LLM backbone. Across a diverse set of vision-language architectures, we show that instruction tuning primarily serves as a bridge, embedding visual features directly into the intermediate semantic layers of the LLM, bypassing the early layers devoted to unimodal processing. With probing analyses and causal interventions, we show that these intermediate layers are the semantic core of vision-language processing and play a critical role in the performance on a broad set of multimodal benchmarks. In addition, by comparing the geometry of semantically equivalent visual and textual representations, we find that fine-tuning extends and strengthens the existing abstraction phase, aligning visual features with pre-existing textual ones. Finally, we confirm the functional role of this localized alignment by restricting fine-tuning to intermediate layers alone: this strategy preserves the performance of full fine-tuning on vision-centric benchmarks while reducing training time. Our results suggest that multimodal integration is a localized phenomenon driven by the repurposing of the internal abstraction engine of the LLM.

2606.03868 2026-06-03 cs.CV 版本更新

Unified Video-Action Joint Denoising for Dexterous Action and Data Generation

统一视频-动作联合去噪用于灵巧动作与数据生成

Dingrui Wang, YuAn Wang, Jinkun Liu, Yue Zhang, Mattia Piccinini, Yu Sun, Johannes Betz

发表机构 * Technical University of Munich(慕尼黑技术大学) ByteDance(字节跳动) Tsinghua University(清华大学)

AI总结 提出Donk模型,通过联合建模交互视频与手部轨迹的分布,实现灵巧手的动作生成与数据增强。

Comments 9 pages, 5 figures

详情
AI中文摘要

最近的世界动作模型通过将广泛的视觉动态先验与可执行的机器人动作对齐来利用视频基础模型。我们从分布的角度重新审视这种对齐。现有的公式通常将对齐的先验缩小为基于观测的未来动作策略分布。相比之下,我们通过在多条件机制下对交互视频和可执行手部轨迹的联合空间进行建模,保持更广泛的分布。我们提出了Donk,一个用于灵巧手的统一视频-动作去噪模型。通过语言、初始图像和初始手部状态,Donk采样未来视频和双手MANO轨迹作为动作策略。在没有图像条件的情况下,相同的去噪架构从文本条件分布中采样配对的视频-动作展开,将对齐的视频先验转化为数据引擎。在动作、视频和仅文本生成评估中,Donk在相同的统一训练方案下提高了灵巧轨迹的准确性,保持了强大的视频保真度,并产生了平滑的文本条件动作展开。

英文摘要

Recent world action models leverage video foundation models by aligning broad visual-dynamics priors with executable robot actions. We revisit this alignment from a distributional perspective. Existing formulations typically narrow the aligned prior into an observation-conditioned policy distribution over future actions. In contrast, we keep the distribution broader by modeling the joint space of interaction videos and executable hand trajectories under multiple conditioning regimes. We propose Donk, a unified video-action denoising model for dexterous hands. With language, an initial image, and the initial hand state, Donk samples future videos and bimanual MANO trajectories as an action policy. Without the image condition, the same denoising architecture samples paired video-action rollouts from a text-conditioned distribution, turning the aligned video prior into a data engine. Across action, video, and text-only generation evaluations, Donk improves dexterous trajectory accuracy, preserves strong video fidelity, and produces smooth text-conditioned action rollouts under the same unified training recipe.

2606.03837 2026-06-03 cs.CV 版本更新

Where Do We (Not) Need Temporal Context in Low-Resource Video Task Adaptation?

在低资源视频任务适应中,我们(不)需要时间上下文的哪些部分?

Luc P. J. Sträter, Hazel Doughty

发表机构 * Leiden University(莱顿大学)

AI总结 本文系统研究了视频理解中模型适应策略的时间上下文分配问题,通过评估不同设置下的参数高效微调和探测方法,揭示了时间上下文在骨干网络、PEFT和探测之间的最优分布。

详情
AI中文摘要

参数高效微调(PEFT)和探测使得仅使用少量可训练参数就能适应基础模型,这对于标注和计算成本高昂的视频理解具有吸引力。然而,视频PEFT主要集中于适应图像预训练模型,而标准PEFT方法也可应用于视频表示。这些设置很少被比较,并且都将时间推理限制在模型的单个组件中,从而留下了时间上下文应如何在骨干网络、PEFT和探测之间分布的问题。在这项工作中,我们提供了视频理解中模型适应策略的系统研究。我们在外观聚焦、运动聚焦和空间密集设置中评估了方法,特别关注数据有限且参数效率最有利的场景。我们的结果为跨设置的PEFT和探测提供了新的见解,并证明了时间上下文分配对于有效视频适应的重要性。

英文摘要

Parameter-efficient fine-tuning (PEFT) and probing enable adaptation of foundation models using only a small number of trainable parameters, making it attractive for video understanding where annotation and computation are expensive. However, video PEFT has focused on adapting image-pretrained models, while standard PEFT methods can also be applied to video representations. These settings are rarely compared and both confine temporal reasoning to a single component of the model, leaving open how temporal context should be distributed across backbone, PEFT and probe. In this work we provide a systematic study of model adaptation strategies for video understanding. We evaluate methods across appearance-focused, motion-focused and spatially dense settings, with a particular focus on scenarios with limited data where parameter-efficiency is most beneficial. Our results provide new insights into PEFT and probing across settings and demonstrate the importance of temporal context allocation for effective video adaptation

2606.03806 2026-06-03 cs.CV 版本更新

TeX-1500: A Paired Real-World LWIR Hyperspectral Dataset and Benchmark for Temperature-Emissivity-Texture Decomposition

TeX-1500:用于温度-发射率-纹理分解的配对真实世界长波红外高光谱数据集与基准

Cheng Dai, Jiale Lin, Hongyi Xu, Bingxuan Song, Ziyang Xie, Fanglin Bao

发表机构 * School of Science, Westlake University(西lake大学科学学院) School of Engineering, Westlake University(西lake大学工程学院)

AI总结 针对长波红外高光谱成像中温度-发射率-纹理分解缺乏配对监督数据的问题,构建了包含1522对真实场景的TeX-1500数据集,并提出波长感知基线模型TeX-UNet,实现了可量化的数据驱动热感知基准。

详情
AI中文摘要

温度-发射率-纹理(TeX)分解旨在从长波红外高光谱成像(LWIR HSI)中恢复物体热状态、材料光谱响应和可见光般的几何纹理。现有的TeX流程主要是场景特定的逆求解器,缺乏配对的LWIR HSI-TeX监督限制了基于学习的分解。为解决这一空白,我们引入了TeX-1500,一个大规模配对LWIR HSI-TeX数据集和基准,用于监督式HSI到TeX分解。TeX-1500包含来自DARPA隐形前照灯(DARPA IH)推扫式成像和我们FTIR采集的1,522个校准真实场景对,覆盖五个地点、四个季节、不同的采集时间、异构波长布局和两个传感器系列。每个样本存储一个校准的有效波段辐射立方体、校准的波长位置,以及通过一致的恢复和TeX构建协议构建的对齐温度、发射率和纹理监督。我们进一步提供了TeX-UNet,一个简单的波长感知基线,将校准的HSI波段和波长位置映射到TeX场。在保留的DARPA IH推扫场景和零样本/少样本迁移到FTIR场景上的实验表明,TeX-1500为数据驱动的以物理属性为中心的热感知提供了可用的配对监督和可测量的基准。

英文摘要

Temperature-emissivity-texture (TeX) decomposition seeks to recover object heat state, material spectral response, and visible-like geometric texture from long-wave infrared hyperspectral imaging (LWIR HSI). Existing TeX pipelines are mainly scene-specific inverse solvers, and the lack of paired LWIR HSI-TeX supervision has limited learning-based decomposition. To address this gap, we introduce TeX-1500, a large-scale paired LWIR HSI-TeX dataset and benchmark for supervised HSI-to-TeX decomposition. TeX-1500 contains 1,522 calibrated real-scene pairs from DARPA Invisible Headlights (DARPA IH) pushbroom imagery and our FTIR acquisitions, covering five locations, four seasons, diverse acquisition times, heterogeneous wavelength layouts, and two sensor families. Each sample stores a calibrated valid-band radiance cube, calibrated wavelength positions, and aligned temperature, emissivity, and texture supervision constructed through a consistent restoration and TeX-construction protocol. We further provide TeX-UNet, a simple wavelength-aware baseline that maps calibrated HSI bands and wavelength positions to TeX fields. Experiments on the held-out DARPA IH pushbroom scenes and zero-/few-shot transfer to FTIR scenes show that TeX-1500 provides usable paired supervision and a measurable benchmark for data-driven physical-property-centered thermal perception.

2606.03802 2026-06-03 cs.CV 版本更新

Template Collapse and Information-Theoretic Limits in Camera rPPG Pulse Morphology Restoration

模板坍塌与相机rPPG脉搏形态恢复中的信息论极限

Achraf Ben Ahmed

发表机构 * PlesmoSense SARL(PlesmoSense公司)

AI总结 本研究通过评估16种架构在153名受试者上的表现,引入跨受试者Pearson r来区分个体特异性恢复与模板坍塌,发现消费者摄像头无法编码个体动脉形态,且无架构能恢复个体特异性脉搏形态。

详情
AI中文摘要

目的:消费者面部相机远程光电容积描记法(rPPG)可实现被动心血管监测,但单周期波形形态(编码动脉硬化生物标志物)是否可从该测量中恢复尚未明确。方法:我们在三个数据集的153名受试者上评估了涵盖六个家族的16种架构,引入跨受试者Pearson r以区分个体特异性恢复与模板坍塌。结果:无架构恢复个体特异性形态(跨受试者r范围0.773--0.9999;真实上限0.601)。监督对比学习(SupCon)收敛至log N = 4.844,构成现有最强经验证据,表明测试的编码器家族无法从单周期rPPG中提取可判别形态结构。VAE解码器恢复了rPPG输入中缺失的群体级谐波内容(H2/H1:输出0.310 vs. 输入0.275),零样本泛化至UBFC(r = +0.708);方向性幻觉差距(p = 0.150)提示部分信号读取。当输入不携带可判别结构时,抗坍塌目标失效。意义:消费者摄像头无法编码个体动脉形态;跨受试者r是波形重建基准中必要的坍塌诊断指标。

英文摘要

Objective: Consumer face camera remote photoplethysmography (rPPG) enables passive cardiovascular monitoring, but whether single-cycle waveform morphology encoding arterial stiffness biomarkers is recoverable from this measurement has not been characterised. Methods: We evaluated 16 architectures spanning six families on 153 subjects across three datasets, introducing cross-subject Pearson r to distinguish subject-specific recovery from template collapse. Results: No architecture recovered subject-specific morphology (cross-subject r range 0.773--0.9999; ground-truth ceiling 0.601). Supervised Contrastive (SupCon) converged to log N = 4.844, constituting the strongest available empirical evidence that no discriminative morphological structure is extractable from single-cycle rPPG by the encoder families tested. The VAE decoder restores population-level harmonic content absent from the rPPG input (H2/H1: 0.310 output vs. 0.275 input), generalising zero-shot to UBFC (r = +0.708); a directional hallucination gap (p = 0.150) suggests partial signal reading. Anti-collapse objectives fail when input carries no discriminative structure. Significance: Consumer cameras cannot encode individual arterial morphology; cross-subject r is a necessary collapse diagnostic for waveform reconstruction benchmarks.

2606.03795 2026-06-03 cs.CV 版本更新

Beyond Compression: Quantifying Spectral Accessibility in Vision Representations

超越压缩:量化视觉表示中的频谱可访问性

Akayou A. Kitessa, Yijun Zhao

发表机构 * Fordham University(福特汉姆大学)

AI总结 通过残差频谱损失(RSL)测量线性可恢复的带限傅里叶能量,研究视觉语言模型中投影层对表示频谱结构的影响,发现CLIP和DINOv2中频谱可访问性随深度非单调变化,中间层峰值后下降,且CLIP的投影是频谱中性的,而DINOv2的[CLS]池化导致频谱结构损失。

详情
AI中文摘要

视觉语言模型通过学习的投影层将视觉特征映射到共享嵌入空间,但目前尚不清楚这些变换如何改变视觉信息的结构。本研究通过空间频率可访问性(以从模型表示中线性恢复带限傅里叶能量的能力衡量)来考察表示的变化。为隔离降维之外的影响,我们引入了残差频谱损失(RSL),该损失相对于维度匹配的随机投影基线评估变化。为减少优化带来的混杂效应,分析使用所有参数冻结的预训练模型。实验结果显示,在ImageNet和MS-COCO数据集上,CLIP和DINOv2中可访问性随频率一致变化。频谱可访问性随深度呈非单调轨迹,在中间层达到峰值,然后向输出表示下降。最终变换因架构而异:CLIP的学习投影是频谱中性的,变化可由压缩解释,而DINOv2的[CLS]池化导致整个频谱的结构性损失。这些发现表明中间层和池化机制是现代视觉编码器中频谱变换的主要驱动因素。

英文摘要

Vision-language models map visual features into a shared embedding space through learned projection layers, yet it remains unclear how these transformations alter the structure of visual information. This study examines changes in representation through spatial-frequency accessibility, measured by the linear recoverability of band-limited Fourier energy from model representations. To isolate effects beyond dimensionality reduction, we introduce Residual Spectral Loss (RSL), which evaluates changes relative to a dimension-matched random projection baseline. To reduce confounding effects from optimization, the analysis uses pretrained models with all parameters frozen. The experimental results show consistent frequency-dependent changes in accessibility across CLIP and DINOv2 on ImageNet and MS-COCO datasets. Spectral accessibility follows a non-monotonic trajectory across depth, peaking at intermediate layers before decreasing toward the output representation. The final transformation differs across architectures: CLIP's learned projection is spectrally neutral, with changes explained by compression, whereas DINOv2's [CLS] pooling induces a structured loss across the spectrum. These findings identify intermediate layers and pooling mechanisms as primary drivers of spectral transformation in modern vision encoders.

2606.03793 2026-06-03 cs.CL cs.CV 版本更新

Exploring Adversarial Robustness and Safety Alignment in Multilingual Multi-Modal Large Language Models

探索多语言多模态大语言模型的对抗鲁棒性与安全对齐

Hashmat Shadab Malik, Muzammal Naseer, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI, UAE(穆罕默德·本·扎耶德人工智能大学,阿联酋) Khalifa University, UAE(卡比拉大学,阿联酋) Australian National University, Australia(澳大利亚国立大学,澳大利亚)

AI总结 本研究通过梯度攻击和跨语言评估,发现多语言多模态大语言模型存在可迁移的对抗脆弱性,并揭示低资源语言因理解失败而呈现的虚假安全现象,提出深层训练整合才能实现真正的多语言安全对齐。

详情
AI中文摘要

多模态大语言模型将视觉感知整合到语言推理中,引入了一个连续的攻击面,容易受到对抗攻击。先前关于MLLM鲁棒性的工作主要关注以英语为中心的任务,多语言行为尚未被探索。我们通过对12种不同语言的对抗鲁棒性和多模态安全性进行系统研究来填补这一空白,评估通过指令调优获得多语言能力的开源MLLM。基于梯度的攻击揭示了一种可迁移的多语言脆弱性:在一种语言中优化的对抗图像会继续在其他语言中引发失败,表现出强大的跨语言可迁移性。多语言安全性进一步取决于模型检索或解释有害指令的有效性。当有害意图通过文本发出时,语言基础更强的语言更常引发允许滥用的响应,而较弱的语言产生较少的不安全输出。当嵌入图像作为排版内容时,英文脚本被可靠地识别和遵循,而非英文脚本很少被视觉编码器解析。因此,低资源语言可能看起来更安全,但这是理解和视觉基础失败的人为产物,而非真正的对齐,我们将这种现象称为“失败导致的安全”。相比之下,在整个训练阶段(而不仅仅在指令调优阶段)构建多语言能力的MLLM,如Qwen3-VL,表现出真正的跨语言安全性,跨语言保持主动拒绝,而不是掩盖理解失败。浅层多语言适应(例如在翻译的指令数据上进行微调)可能产生表面理解,在低资源语言中造成虚幻的安全感;跨训练阶段的更深层整合才能实现真正的多语言安全对齐。

英文摘要

Multimodal Large Language Models integrate visual perception into language reasoning, introducing a continuous attack surface susceptible to adversarial attacks. Prior work on MLLM robustness has focused largely on English-centric tasks, leaving multilingual behaviour unexplored. We address this gap through a systematic study of adversarial robustness and multimodal safety across 12 diverse languages, evaluating open-source MLLMs that acquire multilingual capability through instruction tuning. Gradient-based attacks reveal a transferable multilingual vulnerability: adversarial images optimized in one language continue to induce failure in others, demonstrating strong cross-lingual transferability. Multilingual safety further varies with how effectively a model retrieves or interprets harmful instructions. When harmful intent is issued through text, languages with stronger linguistic grounding more often elicit misuse-enabling responses, while weaker languages produce fewer unsafe outputs. When embedded in the image as typographic content, English scripts are reliably recognised and followed, whereas non-English scripts are rarely parsed by the vision encoder. Lower-resource languages may therefore appear safer, but this is an artefact of comprehension and visual-grounding failures rather than genuine alignment, a phenomenon we term safety-by-failure. In contrast, MLLMs that build multilingual capability throughout their training stages rather than only at instruction tuning, such as Qwen3-VL, exhibit genuine cross-lingual safety, maintaining active refusal across languages rather than masking comprehension failure. Shallow multilingual adaptation, such as fine-tuning on translated instruction data, may produce surface-level understanding that creates illusory safety in low-resource languages; deeper integration across training stages leads to genuine multilingual safety alignment.

2606.03792 2026-06-03 cs.CV cs.LG 版本更新

Training-Free Multi-Concept LoRA Composition with Prompt-Aware Weighting

免训练的多概念LoRA组合与提示感知加权

Georgios Tsoumplekas, Stella Bounareli, Vasileios Argyriou

发表机构 * Department of Networks and Digital Media, Kingston University London, UK(网络与数字媒体系,金史密斯大学伦敦分校)

AI总结 提出一种免训练的提示感知加权策略,通过优化组合多个LoRA模块的输出实现多概念定制,提升图像质量和概念保真度。

Comments Accepted at IEEE FG 2026

详情
AI中文摘要

低秩适应(LoRA)通过将预训练扩散模型适应到特定视觉概念和风格,成功实现了文本到图像生成中的个性化。然而,将此类模型扩展到多概念定制仍然具有挑战性。简单组合多个LoRA权重或其输出通常会导致概念间的干扰,从而降低视觉质量并减少对单个概念参考图像的保真度。本文提出了一种简单而有效的多概念定制方法,通过最优组合多个LoRA模块的输出。我们利用生成过程中每个概念的相对重要性(从其对应的提示标记推断),并引入了两种方法:W-Switch和W-Composite,它们采用提示感知的重要性加权策略,其中每个LoRA根据其触发词在目标提示中的语义影响进行加权。此外,我们通过提出一种新的基于图像的相似性评估框架来扩展现有的定量评估指标,该框架通过比较真实世界参考图像和从生成图像中自动分割的概念区域来评估图像保真度和身份保持。我们在ComposLoRA测试平台上评估了我们的方法,并在视觉质量、身份保持和组合性方面展示了相对于现有最先进方法的一致改进。定性评估,包括基于大语言模型(LLM)的评估和用户研究,进一步验证了所提出方法的有效性,并与新引入的基于图像的定量指标一致。我们的代码可在该https URL获取。

英文摘要

Low-Rank Adaptation (LoRA) successfully enables personalization in text-to-image generation by adapting pre-trained diffusion models to specific visual concepts and styles. However, extending such models to multi-concept customization remains challenging. Naively combining multiple LoRA weights or their outputs often leads to interference among concepts, resulting in degraded visual quality and reduced fidelity to the reference images of individual concepts. This paper proposes a simple yet effective approach for multi-concept customization by optimally combining the outputs of multiple LoRA modules. We leverage the relative importance of each concept during generation, as inferred from its corresponding prompt tokens and introduce two methods, W-Switch and W-Composite, that employ a prompt-aware importance weighting strategy in which each LoRA is weighted according to the semantic influence of its trigger words in the target prompt. In addition, we extend existing quantitative evaluation metrics by proposing a new image-based similarity evaluation framework that assesses image fidelity and identity preservation through comparisons between real-world reference images and automatically segmented concept regions from generated images. We evaluate our approach on the ComposLoRA testbed and demonstrate consistent improvements over existing state-of-the-art methods in terms of visual quality, identity preservation and compositionality. Qualitative evaluations, including a Large Language Model (LLM) based assessment and a user study, further validate the effectiveness of the proposed methods and align with the newly introduced quantitative image-based metrics. Our code is available at https://github.com/GeorgeTsoumplekas/Prompt-Aware-Multi-LoRA-Composition.

2606.03774 2026-06-03 cs.CV 版本更新

AmbientEye: A Dataset for Pupil Segmentation under Natural Ambient Infrared Illumination

AmbientEye: 自然环境红外光照下的瞳孔分割数据集

Mingyu Han, Hyunyoung Han, Nitheekulawatn Thommakoon, Gangtae Park, Jieun Han, Xucong Zhang, Ian Oakley

发表机构 * Electrical Engineering, Korea Advanced Institute of Science and Technology(电子工程系,韩国科学技术院) Intelligent Systems Department, Delft University of Technology(智能系统系,代尔夫特理工大学)

AI总结 本文提出AmbientEye数据集,探索在无主动红外光源、仅依靠环境阳光的户外场景中,利用被动红外相机实现可靠瞳孔检测,并评估现有算法性能下降。

Comments 12 pages, 7 figures

详情
AI中文摘要

眼动追踪对于智能眼镜至关重要,因为它能为环境智能应用提供用户注意力的洞察。然而,大多数现有的眼动追踪系统依赖主动红外(IR)照明,由于功耗问题,在全天户外使用中造成了实际障碍。本文研究了在无任何主动红外光源、仅依靠环境阳光作为唯一照明源的户外环境中,单独使用被动红外相机能否实现可靠的瞳孔检测。为支持这一研究,我们引入了AmbientEye,这是一个包含来自19个国家35名参与者的2,606,225张眼部图像的大规模数据集。该数据集在户外自然阳光下,使用两种离轴相机配置和两种太阳方向条件采集。我们通过SAM2自动分割提供高质量的瞳孔标注,随后由人工标注员进行细化。我们在数据集上评估了一种最先进的瞳孔分割算法,并将其性能与在受控红外照明下的现有数据集上的性能进行了比较。结果显示,瞳孔分割性能从受控红外数据集上的0.928大幅下降到AmbientEye上的0.767。这一性能差距凸显了环境光设置的挑战。这使得AmbientEye成为未探索且高度实用的眼动追踪场景的第一个基准。

英文摘要

Eye tracking is essential for smart glasses, as it provides insight into user attention for ambient intelligence applications. However, most existing eye-tracking systems rely on active infrared (IR) illumination, creating practical barriers to all-day outdoor use due to power consumption. In this paper, we investigate whether passive IR cameras alone, without any active IR light source, can enable reliable pupil detection in unconstrained outdoor environments, where ambient sunlight serves as the sole illumination source. To support this investigation, we introduce AmbientEye, a large-scale dataset of 2,606,225 eye images collected from 35 participants from 19 countries. It is captured outdoors under natural sunlight with two off-axis camera configurations and two sun-orientation conditions. We provide high-quality pupil annotation through SAM2 automatic segmentation, followed by refinement by human annotators. We benchmark a state-of-the-art pupil segmentation algorithm on our dataset and compare its performance with that on existing datasets under controlled IR illumination. Results reveal a substantial drop in pupil segmentation performance from 0.928 on controlled IR datasets to 0.767 on AmbientEye. This performance gap highlights the challenge of the ambient-light setting. This positions AmbientEye as a first benchmark for an unexplored and highly practical eye-tracking scenario.

2606.03748 2026-06-03 cs.CV cs.AI 版本更新

Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models

Ultralytics YOLO26: 统一的实时端到端视觉模型

Glenn Jocher, Jing Qiu, Mengyu Liu, Shuai Lyu, Fatih Cagatay Akyon, Muhammet Esat Kalfaoglu

发表机构 * Ultralytics

AI总结 本文提出YOLO26,通过双头设计、MuSGD优化器、渐进损失和STAL标签分配策略,实现无NMS的端到端实时检测,并在实例分割、姿态估计等任务上取得一致提升。

Comments 31 pages, 8 figures

详情
AI中文摘要

实时视觉需要准确、高效且易于在不同硬件上部署的模型。YOLO系列因此被广泛部署,但大多数YOLO检测器在推理时仍依赖非极大值抑制,由于分布聚焦损失而携带沉重的检测头,需要长时间的训练计划,并且可能使最小的物体没有正标签分配。我们提出Ultralytics YOLO26,一个统一的实时视觉模型系列,通过协调的架构和训练进展解决了这些限制。YOLO26采用双头设计实现原生无NMS的端到端推理,并完全移除DFL,产生具有无约束回归范围的更轻量头。其训练流程结合了MuSGD(一种从大语言模型训练改编的混合Muon-SGD优化器)、渐进损失(将监督转向推理时头)和STAL(一种保证小物体正覆盖的标签分配策略)。除了检测,YOLO26还为实例分割、姿态估计和旋转检测引入了特定任务的头和损失设计,在任务和尺度上产生一致的增益。该系列涵盖五个尺度(n/s/m/l/x),并在单一流程中支持检测、实例分割、姿态估计、分类和旋转检测,还有一个开放词汇扩展YOLOE-26,用于文本、视觉和提示无关的推理。在所有尺度上,YOLO26在COCO上以1.7-11.8 ms T4 TensorRT延迟实现40.9-57.5 mAP,在精度-延迟帕累托前沿上超越了先前的实时检测器,而YOLOE-26x在文本提示下于LVIS minival上达到40.6 AP。代码和模型可在https://this URL获取。

英文摘要

Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non-maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real-time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual-head design for native NMS-free end-to-end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon-SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference-time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task-specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open-vocabulary extension, YOLOE-26, for text-, visual-, and prompt-free inference. Across all scales, YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto front over prior real-time detectors, while YOLOE-26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at https://github.com/ultralytics/ultralytics.

2606.03715 2026-06-03 cs.CV 版本更新

Text-to-Image Models Need Less from Text Encoders Than You Think

文生图模型对文本编码器的依赖比你想象的要少

Nurit Spingarn, Noa Cohen, Tamar Rott Shaham, Tomer Michaeli

发表机构 * Technion – Israel Institute of Technology(技术学院 – 以色列理工学院) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 本文发现基于扩散Transformer的文生图模型主要依赖文本编码器提供的单词含义和词序信息,而非完整的上下文信息,并通过构建仅含位置标记词袋的嵌入验证了这一观点。

Comments Project webpage: https://nsping13.github.io/contextless-TTI/

详情
AI中文摘要

文生图模型依赖文本提示作为与人类意图交互的主要接口。提示由文本编码器编码为嵌入,以条件化图像生成过程。除了单个标记的含义外,文本嵌入还编码了整个提示中的上下文信息,如组合性和属性绑定。然而,图像模型是否实际利用了这些更丰富的信息仍未被充分探索。在此,我们探讨问题:文本表示的哪些方面对图像生成至关重要?我们表明,基于扩散Transformer的文生图模型通常仅依赖文本表示的两个相对简单的方面:(i)相邻标记合并为单词表示(对于跨多个标记的单词),以及(ii)词序,该词序由文本编码器的位置嵌入印刻。为了证明这一点,我们构建了一种新的文本嵌入,它仅编码单个单词的含义和顺序,但缺乏关于整个提示的任何上下文信息。我们发现,这种带位置标记的词袋表示足以成功引导图像生成,实现了与完整文本嵌入引导生成相当的视觉质量和文本保真度。这表明,与普遍看法相反,文生图模型通常不使用文本嵌入中除单词含义和词序之外的丰富信息。相反,复杂语言结构的解码由图像模型本身执行。项目网页:此 https URL

英文摘要

Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a text encoder into embeddings that condition the image generation process. Beyond individual token meanings, text embeddings encode contextual information across the full prompt, such as compositionality and attribute binding. However, whether image models actually exploit this richer information remains underexplored. Here, we address the question: Which aspects of text representation are essential for image generation? We show that text-to-image diffusion transformer-based models commonly rely only on two relatively straightforward aspects of text representations: (i) the merging of adjacent tokens into a word representation, for words spanning multiple tokens, and (ii) word order, which is imprinted by the positional embedding of the text-encoder. To show this, we construct a new text embedding that encodes only individual word meanings and order but lacks any contextual information about the full prompt. We find that this bag of position-tagged words representation is sufficient to successfully guide image generation, achieving visual quality and text fidelity that are on par with full text embedding-guided generation. This demonstrates that, contrary to common belief, text-to-image models often do not use the rich information encoded in the text embedding beyond individual word meanings and word order. Instead, the decoding of complex linguistic structures is performed by the image model itself. Project webpage: https://nsping13.github.io/contextless-TTI/

2606.03713 2026-06-03 cs.CV 版本更新

Investigating Adversarial Robustness of Multi-modal Large Language Models

探究多模态大语言模型的对抗鲁棒性

Hashmat Shadab Malik, Muzammal Naseer, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI, UAE(穆罕默德·本·扎耶德人工智能大学,阿联酋) Khalifa University, UAE(哈利法大学,阿联酋) Australian National University, Australia(澳大利亚国立大学,澳大利亚)

AI总结 通过系统研究多模态大语言模型的对抗鲁棒性,提出诊断性CLIP对齐协议预测鲁棒视觉编码器的迁移效果,并证明端到端多模态对抗训练能显著提升模型在强对抗攻击下的性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务上表现出色,但通过视觉编码器(如CLIP)引入视觉输入显著扩大了攻击面,使这些模型容易受到视觉对抗扰动的影响。先前的防御方法通常通过在对抗微调期间强制与CLIP原始嵌入空间严格对齐来保持与预训练MLLMs的兼容性;虽然实用,但这种约束从根本上限制了可实现的鲁棒性。我们对MLLMs的对抗鲁棒性进行了系统研究。我们首先引入了一个诊断性CLIP对齐协议,该协议在完整的MLLM训练之前预测哪些鲁棒视觉编码器能有效迁移到多模态设置中,揭示出大规模多模态对抗预训练(而非仅单模态规模)是强鲁棒性迁移的关键因素。通过端到端多模态训练将这些编码器集成到MLLMs中,与受约束的即插即用基线相比,在强对抗攻击下,字幕生成平均提升28个CIDEr点,VQA准确率提升11.7%。我们进一步表明,直接对标准非鲁棒MLLM应用对抗训练会降低干净和对抗性能,从而确立了鲁棒视觉表示作为严格先决条件,而从鲁棒骨干网络进行端到端对抗训练则额外带来1.9个CIDEr点和4.3% VQA准确率的提升。除了训练时防御外,轻量级的测试时视觉随机变换可作为非鲁棒MLLM的有效黑盒防御,将对抗性能从接近零提升到与鲁棒模型相当的水平。最后,我们展示了鲁棒模型在白盒视觉越狱攻击下显著减少了有毒生成。代码和预训练权重将公开发布。

英文摘要

Multi-modal Large Language Models (MLLMs) achieve strong performance on vision-language tasks, but incorporating visual inputs through a vision encoder (e.g., CLIP) substantially expands the attack surface, making these models vulnerable to visual adversarial perturbations. Prior defenses typically preserve compatibility with pretrained MLLMs by enforcing strict alignment to CLIP's original embedding space during adversarial fine-tuning; while practical, this constraint fundamentally limits achievable robustness. We present a systematic investigation of adversarial robustness in MLLMs. We first introduce a diagnostic CLIP-alignment protocol that predicts, prior to full MLLM training, which robust vision encoders will transfer effectively to the multimodal setting, revealing that large-scale multimodal adversarial pretraining, rather than unimodal scale alone, is the critical factor for strong robustness transfer. Integrating such encoders into MLLMs via end-to-end multimodal training yields average gains of 28 CIDEr points on captioning and 11.7% VQA accuracy under strong adversarial attacks compared to constrained plug-and-play baselines. We further show that adversarial training applied directly to a standard non-robust MLLM degrades both clean and adversarial performance, establishing robust visual representations as a strict prerequisite, while end-to-end adversarial training from a robust backbone delivers additional gains of 1.9 CIDEr points and 4.3% VQA accuracy. Beyond training-time defenses, lightweight test-time visual stochastic transformations serve as an effective black-box defense for non-robust MLLMs, elevating adversarial performance from near-zero to levels comparable with robust models. Finally, we show that our robust models substantially reduce toxic generation under white-box visual jailbreak attacks. Code and pretrained weights will be released publicly.

2606.03694 2026-06-03 cs.RO cs.CV cs.HC 版本更新

Face versus Body Tracking for Human-Robot Interaction: An Egocentric Dataset

面向人机交互的面部与身体跟踪:一个自我中心数据集

Jessica Wenninger, Gabriel Skantze

发表机构 * Furhat Robotics University of Naples Federico II(那不勒斯费德里科二世大学) Division of Speech, Music and Hearing, KTH Royal Institute of Technology(语音、音乐和听觉研究所,皇家理工学院)

AI总结 针对社交机器人自我中心视角下频繁身份切换问题,提出一个自定义标注的自我中心数据集,通过系统评估检测误差、对比面部与身体跟踪,并分析扩展空间记忆和外观重识别的影响,最终优化管道将身份切换减少49%。

Comments 8 pages, 5 figures, 3 tables. Accepted to the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情
AI中文摘要

为了实现有意义的人机交互(HRI),机器人必须通过持续跟踪用户来不断评估参与度。然而,最先进的计算机视觉模型主要针对监控或自动驾驶进行了优化。社交机器人面临独特的自我中心挑战,例如人类跳动、相互遮挡或离开画面。频繁的身份切换(IDSW)会导致机器人在对话中失去立足点。为了解决这个问题,我们引入了一个新颖的、自定义标注的自我中心数据集,通过Furhat机器人收集,以捕捉复杂的社会动态。我们进行了系统评估,将检测错误与跟踪逻辑分离,比较面部与身体跟踪,并评估扩展空间记忆和外观重识别(ReID)的影响。结果表明,增加空间记忆可以缓解长时间遮挡,但在复杂动态事件上失败。集成ReID解决了复杂的切换,但表现出相反的效果:它显著提高了身体跟踪的稳定性,但由于轮廓角度敏感性导致面部IDSW激增。最终,我们的优化管道将IDSW减少了49%,减轻了交互中断。由于标准基准缺乏密集的近距离遮挡,这项工作强调了原生捕捉社会动态对于真正验证HRI感知模型的迫切需求。

英文摘要

To enable meaningful human-robot interaction (HRI), a robot must continuously assess engagement by consistently tracking users over time. State-of-the-art computer vision models, however, are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans bouncing, obstructing each other, or leaving the frame. Frequent identity switches (IDSW) cause the robot to lose its footing mid-conversation. To address this, we introduce a novel, custom-annotated egocentric dataset collected via the Furhat robot to capture complex social dynamics. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended spatial memory and appearance re-identification (ReID). Results indicate that increasing spatial memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49\%, mitigating interaction breakdowns. Because standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.

2606.03693 2026-06-03 cs.CL cs.CV 版本更新

Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study

语言转换会破坏医学视觉语言模型吗?印度尼西亚放射学视觉问答案例研究

Pieter Christy Yan Yudhistira, Dzaki Rafif Malik, Novanto Yudistira

发表机构 * Intelligent System Laboratory, Faculty of Computer Science Brawijaya University(智能系统实验室,计算机科学学院布拉维亚大学)

AI总结 本研究通过构建印尼语放射学VQA数据集IndoRad-VQA,评估医学视觉语言模型在非英语临床语言下的鲁棒性,发现英语与印尼语设置间存在8-25%的性能差距,表明需要更包容的多语言评估。

Comments accepted to MMFM-BIOMED Workshop @ CVPR 2026

详情
AI中文摘要

医学视觉语言模型(VLM)通常在英语放射学视觉问答基准上进行评估,其在非英语临床语言下的鲁棒性很大程度上未被探索。我们引入了IndoRad-VQA,这是VQA-RAD的印尼语改编版,以评估当问题以印尼语提出时,医学VLM是否保留放射学推理能力。放射学问答对被翻译成印尼语,并通过基于自我评估的质量控制来保持临床意义、术语一致性和答案等价性。我们在英语和印尼语提示设置下评估了通用、东南亚多语言和医学专用VLM。除了准确性,我们量化了英语和印尼语输入之间的语言鲁棒性差距。我们还进行了错误分析,以识别问答的失败模式,例如是/否翻转、侧向性错误和输出语言不匹配。我们的发现表明,在英语医学VQA基准上的强性能并不一定转化为印尼语临床环境中的鲁棒行为。我们观察到英语和印尼语设置之间的性能差距为8%到25%,具体取决于评估指标。这些结果突显了对医学多模态基础模型进行更包容的多语言评估的必要性。数据集可在以下网址获取:此 https URL。

英文摘要

Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.

2606.03675 2026-06-03 cs.CV 版本更新

A Fast Methane Detection Pipeline on Board Satellites Based on Mag1c-SAS and LinkNet

基于Mag1c-SAS和LinkNet的星载甲烷快速检测流水线

Jonáš Herec, Vít Růžička, Rado Pitoňák, Jan Sedmidubsky

发表机构 * Zaitra s.r.o.(泽特拉公司) NASA JPL(美国国家航空航天局喷气推进实验室) Faculty of Informatics, Masaryk University(马萨里克大学信息学院)

AI总结 提出Mag1c-SAS算法加速甲烷检测,并结合轻量级LinkNet模型降噪,在星载硬件上实现高效、低功耗的甲烷泄漏检测。

Comments arXiv admin note: substantial text overlap with arXiv:2507.01472

详情
AI中文摘要

甲烷是一种强效温室气体,通过高光谱卫星图像早期检测泄漏有助于减缓气候变化。然而,许多现有高光谱任务仅捕获操作员手动瞄准的区域,从而遗漏潜在感兴趣事件。为了经济高效地克服下行链路速率慢的问题,星载检测是一种可行的解决方案。然而,传统的甲烷检测方法对于资源受限的星载硬件计算需求过高。本工作通过关注高效、低功耗算法来加速甲烷检测。具体而言,我们测试了先前未用于甲烷检测的快速目标检测ACE和CEM方法,并提出了Mag1c-SAS——当前最先进Mag1c算法的显著更快变体。为了探索其检测潜力,我们将它们与基于U-Net和LinkNet的机器学习模型集成。我们在STARCOP数据集和一个新的EMIT-MSeg数据集上评估我们的方法,该数据集我们与高质量注释策略一起引入并开源。所提出的Mag1c-SAS方法被证明非常有效,运行速度比原始Mag1c方法快约80倍,提供视觉上相似但噪声更大的结果。当额外与轻量级LinkNet方法配对时,它有效降低了噪声,在EMIT-MSeg上相比基线Mag1c方法AUPRC得分提高了超过30个百分点,在STARCOP上F1得分提高了约4个百分点。我们评估了两种新颖的波段选择策略,并通过硬件分析确认了系统的星载可行性,展示了边际功耗和高效的CPU/RAM利用率。我们以用户友好的轻量级PyPI库形式发布最终系统,网址为:this https URL,同时所有实验代码、模型和数据发布在:this https URL。

英文摘要

Methane is a potent greenhouse gas, and detecting leaks early via hyperspectral satellite imagery can help climate change mitigation efforts. Meanwhile, many existing hyperspectral missions only capture areas manually targeted by operators, thus missing potential events of interest. To overcome slow downlink rates cost-effectively, onboard detection is a viable solution. However, traditional methane detection methods are too computationally demanding for resource-limited onboard hardware. This work accelerates methane detection by focusing on efficient, low-power algorithms. In particular, we test fast target detection ACE and CEM methods that have not been previously used for methane detection and propose Mag1c-SAS -- a significantly faster variant of the current state-of-the-art Mag1c algorithm. To explore their detection potential, we integrate them with a machine learning model based on U-Net and LinkNet. We evaluate our methods on the STARCOP dataset and a novel EMIT-MSeg dataset, which we introduce and open-source alongside a high-quality annotation strategy. The proposed Mag1c-SAS approach proves highly effective by operating ~80x faster than the original Mag1c approach, providing a visually similar, but noisier result. When additionally paired with the lightweight LinkNet approach, it effectively reduces noise, achieving AUPRC score improvements of over 30 pp on EMIT-MSeg compared to the baseline Mag1c approach, and an F1 score on STARCOP ~4 pp higher. We evaluate two novel band selection strategies and confirm the system's onboard viability through hardware profiling, demonstrating marginal power consumption and efficient CPU/RAM utilization. We release the final system in a user-friendly and lightweight PyPI library at: https://pypi.org/project/onboard-methane-detection/, alongside all experimental code, models, and data at: https://github.com/zaitra/methane-filters-benchmark.

2606.03666 2026-06-03 cs.CV 版本更新

Beyond Single Solution: Multi-Hypothesis Collaborative Deep Unfolding Network for Image Compressive Sensing

超越单一解:用于图像压缩感知的多假设协作深度展开网络

Wenxue Cui, Hualin Li, Yuhang Qin, Yifu Xu, Xiaopeng Fan, Debin Zhao

发表机构 * Harbin Institute of Technology, Harbin, China(哈尔滨工业大学) Harbin Institute of Technology Suzhou Research Institute, Suzhou, China(哈尔滨工业大学苏州研究院)

AI总结 针对压缩感知问题的病态性,提出一种多假设协作深度展开网络(MHC-DUN),通过联合优化多个解空间,利用AlphaNet动态预测空间变步长进行梯度下降,并设计多假设协作近端映射模块,以提升重建质量。

Comments Accepted by CVPR 2026

详情
AI中文摘要

最近的深度展开网络(DUNs)通过将迭代优化与深度学习架构有效集成,推动了压缩感知(CS)的发展。然而,大多数CS方法主要将其推理限制在单一解空间,忽略了CS问题固有的病态性,该病态性本质上允许多个合理的候选假设。本文提出了一种新颖的多假设协作深度展开CS网络(MHC-DUN),该网络通过跨不同解空间联合优化,显式建模并利用多个假设。具体而言,遵循近端梯度下降算法,MHC-DUN在此多假设范式下联合执行梯度下降和近端映射。i) 对于梯度下降,引入精心设计的AlphaNet,动态预测所有假设的空间变步长,实现跨多个解的协作梯度更新。ii) 对于近端算子,设计了一个复杂的多假设协作近端映射模块,该模块利用假设内和假设间的相关性先验,联合优化多个解。为了实现端到端训练,设计了一种新颖的复合损失函数,该函数平衡测量保真度、假设多样性和重建精度,在保持重建保真度的同时鼓励探索互补解。实验结果表明,所提出的CS方法优于现有的CS网络。

英文摘要

Recent deep unfolding networks (DUNs) have advanced Compressive Sensing (CS) by effectively integrating iterative optimization with deep learning architectures. However, most CS approaches predominantly confine their inference to a single solution space, neglecting the inherent ill-posedness of CS problems that intrinsically permits multiple plausible candidate hypotheses. In this paper, a novel Multi-Hypothesis Collaborative Deep Unfolding CS Network (MHC-DUN) is proposed, which explicitly models and leverages multiple hypotheses by jointly optimizing across diverse solution spaces. Specifically, following the Proximal Gradient Descent algorithm, MHC-DUN jointly performs gradient descent and proximal mapping within this multi-hypothesis paradigm. i) For gradient descent, a well-designed AlphaNet is introduced to dynamically predict spatially varying step sizes for all hypotheses, enabling collaborative gradient updates across multiple solutions. ii) For proximal operator, a sophisticated multi-hypothesis collaborative proximal mapping module is designed, which leverages both intra-hypothesis and inter-hypothesis correlation priors to jointly refine multiple solutions. To enable end-to-end training, a novel composite loss function is designed, which balances measurement fidelity, hypothesis diversity, and reconstruction accuracy, encouraging exploration of complementary solutions while maintaining reconstruction fidelity. Experimental results reveal that the proposed CS method outperforms existing CS networks.

2606.03646 2026-06-03 cs.CV 版本更新

A Benchmark for Semi-supervised Multi-modal Crowd Counting

半监督多模态人群计数基准

Haoliang Meng, Xiaopeng Hong, Yabin Wang, Wangmeng Zuo

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Pengcheng Laboratory(鹏城实验室)

AI总结 本文构建了首个半监督多模态人群计数基准,通过制定标准化协议和评估多种基线方法,为该任务奠定基础。

详情
AI中文摘要

本文构建了首个半监督多模态人群计数基准。为了奠定这一未探索任务的基础,我们首先制定了半监督多模态设置和标准化协议,该协议规定了不同标记比例下的标记-未标记数据划分。接下来,为了建立可靠的参考点,我们精心定制了一系列具有代表性的基线方法,包括现有的全监督多模态方法和半监督单模态方法。然后,我们在提出的基准下仔细评估了它们的性能。代码和数据划分将在该 https URL 上发布。

英文摘要

This paper constructs the first benchmark on semi-supervised multi-modal crowd counting. To lay the foundation for this unexplored task, we first formulate the semi-supervised multi-modal setting and a standardized protocol that specifies the labeled-unlabeled data partition across different labeled ratios. Next, to establish solid reference points, we carefully tailor a diverse set of representative baselines, including existing fully supervised multi-modal methods and semi-supervised single-modal methods. Then, we carefully evaluate their performance under our proposed benchmark. Codes and the data partition will be released on https://github.com/HenryCilence/Semi-supervised-Multimodal-Crowd-Counting.

2606.03635 2026-06-03 cs.CV cs.AI 版本更新

VidMsg: A Benchmark for Implicit Message Inference in Short Videos

VidMsg:短视频中隐含信息推断的基准测试

Issar Tzachor, Michael Green, Rami Ben-Ari

发表机构 * OriginAI, Israel(OriginAI以色列)

AI总结 提出VidMsg基准,通过消息优先构建流程和双向检索任务,评估视频理解模型对短视频中隐含信息的推断能力。

Comments Project page: https://iyttor.github.io/VidMsg

详情
AI中文摘要

理解短视频不仅仅是识别可见物体和动作;视频制作者常常在片段中包含潜在的信息或目的。我们引入了VidMsg,一个用于评估互联网原生短视频中隐含信息理解的基准测试。VidMsg包含400个来自YouTube的片段,涵盖9个实际主题领域和52个细粒度目标信息,涉及职业与金融、教育、健康与福祉、文化、安全、可持续性和生活方式等领域。VidMsg通过消息优先流程构建:LLM首先将目标信息转化为间接搜索场景,用于检索候选片段。然后,人工标注者保留那些传达预期信息但不过于直白的片段。VidMsg主要设计用于双向消息-片段检索,适用于视频搜索和推荐等可扩展应用,系统必须捕捉全面的视频理解。除了检索,VidMsg还包括一个诊断性多项选择问答基准,模型需要从语义相关的选项中选出片段的预期信息。与当代视频语言和检索模型的实验表明,强模型在VidMsg上常常失败,因为该任务需要语用推理、上下文线索整合以及语义相近信息的区分。我们还引入了VidVec-Msg,一种改进消息导向检索的基线方法,同时为未来工作留下了足够的提升空间。

英文摘要

Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains 400 YouTube-derived clips across 9 practical topic areas and 52 fine-grained target messages, covering domains such as career and finance, education, health and well-being, culture, safety, sustainability, and lifestyle. VidMsg is constructed through a message-first pipeline: an LLM first translates target messages into indirect search scenarios, which are used to retrieve candidate clips. Human annotators then retain clips that convey the intended message without being overly explicit. VidMsg is designed primarily for bidirectional message-clip retrieval for scalable applications such as video search and recommendation, where systems must capture holistic video understanding. In addition to retrieval, VidMsg includes a diagnostic multiple-choice QA benchmark, where models select the intended message of a clip from semantically related alternatives. Experiments with contemporary video-language and retrieval models show that strong models often fail on VidMsg, because the task requires pragmatic inference, integration of contextual cues, and discrimination among semantically close messages. We also introduce VidVec-Msg, a baseline method that improves message-oriented retrieval while leaving substantial headroom for future work.

2606.03626 2026-06-03 cs.CV cs.AI cs.CY 版本更新

TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics

TurtleAI:海龟图形学中视觉编程的多模态模型基准测试

Chao Wen, Jacqueline Staub, Adish Singla

发表机构 * MPI-SWS(马克斯·普朗克研究所-斯图加特)

AI总结 提出TurtleAI基准,包含823个基于海龟图形学真实任务的视觉编程任务,评估20多个多模态模型发现成功率低于30%,并通过少量种子样本生成合成数据微调Qwen2-VL-72B提升约20%性能。

Comments ACL Findings 2026 paper

详情
AI中文摘要

视觉语言模型(VLM)已被探索用于视觉编程,即生成代码以解决视觉任务。然而,大多数先前工作侧重于提高生产力的视觉编程;目前尚不清楚当前VLM在教育导向的视觉编程上表现如何,以及哪些因素限制了它们的性能。为填补这一空白,我们引入了TurtleAI,这是一个包含823个任务的基准,这些任务基于海龟图形学领域的真实视觉编程任务精心策划。解决这些任务需要模型感知几何图案、推理空间关系,并合成能忠实再现几何图案的Python代码。我们评估了20多个VLM,包括GPT-5、GPT-4o和Qwen2-VL-72B,发现它们表现显著困难,大多数成功率低于30%。为解决这些限制,我们提出了一种仅需少量种子样本的数据生成技术。在生成的合成数据上微调Qwen2-VL-72B,在真实任务上取得了约20%的提升。我们的失败分析揭示,GPT-4o在空间推理和精确视觉复制方面存在困难,而微调主要改善了视觉推理与代码实现之间的对齐。

英文摘要

Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it remains unclear how well current VLMs perform on education-oriented visual programming and what factors limit their performance. To bridge this gap, we introduce TurtleAI, a benchmark containing 823 tasks curated based on real-world visual programming tasks in the Turtle Graphics domain. Solving these tasks requires models to perceive geometric patterns, reason about spatial relationships, and synthesize Python code that faithfully reproduces geometric patterns. We evaluate 20+ VLMs, including GPT-5, GPT-4o, and Qwen2-VL-72B, and find that they struggle significantly, with most achieving success rates below 30%. To address these limitations, we propose a data generation technique that requires only a small set of seed samples. Fine-tuning Qwen2-VL-72B on the resulting synthetic data yields an improvement of about 20% on real-world tasks. Our failure analysis reveals that GPT-4o struggles with spatial reasoning and precise visual replication, whereas fine-tuning primarily improves the alignment between visual reasoning and code implementation.

2606.03610 2026-06-03 cs.CV 版本更新

SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition

SkelHCC:一种基于双曲CLIP驱动的缓存自适应框架用于骨架基础的一次动作识别

Yanan Liu, Anqi Zhu, Jingmin Zhu, Jun Liu, Hossein Rahmani, Mohammed Bennamoun, Farid Boussaid, Dan Xu, Qiuhong Ke

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SkelHCC框架,利用双曲几何编码骨架层次结构,结合CLIP和免训练缓存实现一次动作识别,在三个数据集上达到最优。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于骨架的动作识别旨在从人体关节序列理解人类行为,在一次设置中尤其具有挑战性,因为每个新动作仅有一个标记样本。关键挑战是学习捕捉人体运动的层次和组合结构的表示,同时在极端数据稀缺下与高层动作语义有效对齐。现有方法主要基于欧几里得嵌入和低级运动线索,难以建模骨架数据的树状组织,限制了跨模态对齐和对未见动作类别的泛化。我们提出SkelHCC,一个统一的骨架双曲CLIP驱动的缓存自适应框架,用于一次骨架动作识别。SkelHCC引入显式层次双曲CLIP(EH-HCLIP)模块,将骨架序列和动作语言嵌入共享双曲空间。通过利用双曲几何的负曲率和指数体积增长,EH-HCLIP自然编码人体解剖学的关节-部位-身体层次,并产生结构一致的跨模态表示。为支持高效的一次自适应,SkelHCC进一步集成了一个无需训练的LLM引导的多粒度投票缓存(LMV-Cache),用于上下文感知推理。在NTU RGB+D 60、NTU RGB+D 120和PKU-MMD上的实验表明,SkelHCC持续优于最先进方法。

英文摘要

Skeleton-based action recognition aims to understand human behaviors from body joint sequences and is especially challenging in the one-shot setting, where only a single labeled exemplar is available for each novel action. A key challenge is learning representations that capture the hierarchical and compositional structure of human motion while aligning effectively with high-level action semantics under extreme data scarcity. Existing approaches, largely based on Euclidean embeddings and low-level motion cues, struggle to model the tree-like organization of skeleton data, limiting cross-modal alignment and generalization to unseen action categories. We propose SkelHCC, a unified skeleton hyperbolic CLIP-driven cache adaptation framework for one-shot skeleton-based action recognition. SkelHCC introduces an Explicitly Hierarchical Hyperbolic CLIP (EH-HCLIP) module that embeds skeleton sequences and action language into a shared hyperbolic space. By leveraging the negative curvature and exponential volume growth of hyperbolic geometry, EH-HCLIP naturally encodes the joint-part-body hierarchy of human anatomy and yields structurally consistent cross-modal representations. To support efficient one-shot adaptation, SkelHCC further integrates a training-free LLM-guided Multi-granularity Voting Cache (LMV-Cache) for context-aware inference. Experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD demonstrate that SkelHCC consistently outperforms state-of-the-art methods.

2606.03603 2026-06-03 cs.CV cs.CL 版本更新

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

世界模型遇见语言模型:论具体推理与抽象推理的互补性

Yucheng Zhou, Wei Tao, Yiwen Guo, Jianbing Shen

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文提出受控具体推理框架及PF-OPSD方法,通过结合世界模型的视觉模拟与多模态大语言模型的抽象推理,在空间前瞻和开放域物理预测任务上提升性能与鲁棒性。

详情
AI中文摘要

世界模型和多模态大语言模型(MLLMs)为从静态视觉观察预测未来结果提供了互补能力。世界模型可以生成可能未来的具体视觉推演,而MLLMs可以对问题、目标和规则进行抽象推理。然而,生成的推演是随机的,可能在视觉上合理但任务不正确,因此需要确定视觉模拟何时有用、推演是否可信以及它应如何影响最终答案。我们将此问题形式化为受控具体推理,其中模型学习在抽象推理之外调用、验证和整合视觉未来模拟。为了研究这一设置,我们构建了两个人工验证的基准:用于可控空间前瞻的VRQABench和用于开放域物理预测的OpenWorldQA,并提出了特权未来在策略自蒸馏(PF-OPSD)。在训练期间,PF-OPSD仅使用真实未来视频和答案作为教师侧特权上下文来评估在策略具体推理轨迹,而可部署的学生在测试时从未观察到真实未来。实验结果表明,PF-OPSD在VRQABench和OpenWorldQA上分别比基线高出10.6%和10.9%,同时增强了对噪声或冲突推演的鲁棒性。我们的代码和数据集可在以下网址获取:https://this https URL。

英文摘要

World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. However, generated rollouts are stochastic and may be visually plausible but task-incorrect, making it necessary to determine when visual simulation is useful, whether a rollout is credible, and how it should influence the final answer. We formulate this problem as controlled concrete reasoning, where a model learns to invoke, verify, and integrate visual future simulation alongside abstract reasoning. To study this setting, we construct two human-verified benchmarks, VRQABench for controllable spatial lookahead and OpenWorldQA for open-domain physical prediction, and propose Privileged-Future On-Policy Self-Distillation (PF-OPSD). During training, PF-OPSD uses ground-truth future videos and answers only as teacher-side privileged context to evaluate on-policy concrete-reasoning trajectories, while the deployable student never observes true futures at test time. Experimental results show that PF-OPSD outperforms baseline by 10.6% and 10.9% on VRQABench and OpenWorldQA, respectively, while increasing robustness to noisy or conflicting rollouts. Our code and dataset are available at https://github.com/yczhou001/PF-OPSD.

2606.03581 2026-06-03 cs.CV cs.RO 版本更新

UnsOcc: 3D Semantic Occupancy Prediction in Unstructured Scene via Rendering Fusion

UnsOcc:非结构化场景下基于渲染融合的3D语义占用预测

Ye Wu, Ruiqi Song, Baiyong Ding, Nanxin Zeng, Junjie Cheng, Yunfeng Ai

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Waytous Inc.(Waytous公司)

AI总结 提出UnsOcc多模态框架,通过渲染融合模块和基于高斯溅射的细节感知辅助监督,解决非结构化场景中跨模态融合困难与长尾分布问题,在露天矿和nuScenes数据集上超越现有方法。

Comments 8 pages

详情
AI中文摘要

非结构化场景给自动驾驶带来了独特挑战,因为不规则障碍物和稀疏的场景布局削弱了3D目标检测等传统感知方法的有效性。3D语义占用预测因其能够通过为3D空间中的单个体素分配语义标签来提供密集的空间表示而成为研究热点。然而,将3D语义占用预测直接应用于非结构化场景仍然具有挑战性,因为场景稀疏性阻碍了有效的跨模态融合,并且这些场景中更严重的长期尾部分布进一步降低了预测性能。为了验证我们方法的有效性,我们构建了一个从露天矿收集的非结构化场景专用数据集。在此基础上,我们提出了UnsOcc,一种多模态3D语义占用预测框架,提高了在非结构化环境中的鲁棒性。其核心是,我们引入了一个基于渲染的融合模块RenderFusion,通过双向渲染监督增强跨模态特征对齐。此外,我们提出了GSRefinement,一种基于高斯溅射的细节感知辅助监督方法,将稀疏的3D占用预测投影到密集的2D语义分割图中,从而实现对长尾类别的有效监督。在露天矿数据集和nuScenes数据集上的大量实验表明,我们的方法显著优于现有的最先进方法。

英文摘要

Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to provide dense spatial representations by assigning semantic labels to individual voxels in 3D space. However, directly applying 3D semantic occupancy prediction to unstructured scenes remains challenging because scene sparsity hinders effective cross-modal fusion and the more severe long-tail distribution in these scenarios further degrades prediction performance. To validate the effectiveness of our approach, we construct a dedicated dataset of unstructured scenes collected from open-pit mines. Based on this, we propose UnsOcc, a multi-modal 3D semantic occupancy prediction framework that improves robustness in unstructured environments. At its core, we introduce a rendering-based fusion module, RenderFusion, which enhances cross-modal feature alignment through bidirectional rendering supervision. Furthermore, we propose GSRefinement, a detail-aware auxiliary supervision method based on Gaussian Splatting that projects sparse 3D occupancy predictions into dense 2D semantic segmentation maps, enabling effective supervision for long-tail categories. Extensive experiments on both the open-pit mine dataset and the nuScenes dataset demonstrate that our method significantly outperforms existing state-of-the-art approaches.

2606.03578 2026-06-03 cs.CV 版本更新

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

在正确空间中扩散:潜在可扩散性的系统研究

Tianxiong Zhong, Xingye Tian, Xuebo Wang, Xin Tao, Pengfei Wan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文系统研究潜在扩散模型中潜在表示的可扩散性,提出速度不可约方差(VIV)作为生成质量的稳定预测指标。

详情
AI中文摘要

潜在扩散模型利用视觉分词器将图像压缩到潜在空间以实现高效生成建模。然而,分词器更好的重建质量并不一定转化为更好的生成质量,这表明潜在表示不仅应通过保真度评估,还应通过其可扩散性评估。最近的研究提出了多种对扩散友好的潜在空间的解释,包括语义可分离性、仿射等变性、分布均匀性、空间结构、谱平滑性和流形连续性。然而,这些性质通常在一组有限的分词器上验证,导致不清楚哪些因素最能预测下游生成质量,以及这些结论是否适用于其引入的特定设置之外。在这项工作中,我们通过训练大量具有不同正则化策略、架构和潜在配置的分词器,并使用多个下游扩散骨干网络对其进行评估,对潜在可扩散性进行了系统研究。我们的分析确定了几个与生成质量持续相关且在实验设置中表现出强泛化能力的潜在性质。除了现有指标,我们引入了速度不可约方差(VIV),这是一种由轨迹交叉引起的速度模糊性的度量。大量实验表明,VIV是生成质量最稳定的预测因子之一。

英文摘要

Latent diffusion models leverage visual tokenizers to compress images into latent spaces for efficient generative modeling. However, better reconstruction quality of a tokenizer does not necessarily translate into better generation quality, suggesting that latent representations should be evaluated not only by fidelity but also by their diffusability. Recent studies have proposed diverse explanations for diffusion-friendly latent spaces, including semantic separability, affine equivariance, distribution uniformity, spatial structure, spectral smoothness, and manifold continuity. Yet these properties are often validated on a limited set of tokenizers, leaving it unclear which factors are most predictive of downstream generation quality and whether such conclusions hold beyond the specific settings in which they are introduced. In this work, we conduct a systematic study of latent diffusability by training a large collection of tokenizers with diverse regularization strategies, architectures, and latent configurations, and evaluating them with multiple downstream diffusion backbones. Our analysis identifies several latent properties that consistently correlate with generation quality and exhibit strong generalization across experimental settings. Beyond existing metrics, we introduce Velocity Irreducible Variance (VIV), a measure of velocity ambiguity induced by trajectory crossings. Extensive experiments show that VIV is one of the most stable predictors of generation quality.

2606.03577 2026-06-03 cs.CV 版本更新

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

通过宽基线匹配激发多模态大语言模型中的复杂空间推理

Hao Zhong, Muzhi Zhu, Shenyan Zeng, Anzhou Li, Cong Chen, Hua Geng, Duochao Shi, Wentao Ye, Tao Lin, Hao Chen, Chunhua Shen

发表机构 * State Key Laboratory of CAD & CG, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室) Ant Group(蚂蚁集团) Westlake University(西湖大学)

AI总结 本文提出ReasonMatch-Bench基准和动态对应强化学习(DCRL)方法,以系统评估和提升多模态大语言模型在宽基线匹配任务中的空间推理能力。

Comments CVPR 2026. Project page: https://aim-uofa.github.io/reasonmatch/ Code: https://github.com/aim-uofa/ReasonMatch

详情
AI中文摘要

宽基线匹配(WBM)需要整合几何理解、视角变化、细粒度感知和遮挡推理,使其成为部署在物理环境中的多模态大语言模型(MLLMs)空间推理的一个具有挑战性的测试平台。然而,当前的MLLMs缺乏对这些能力的系统评估和训练框架。我们引入了ReasonMatch-Bench,这是一个根据视角位移和匹配粒度在室内、室外和以物体为中心的场景中分层的基准,并表明当前的MLLMs在细粒度宽基线对应上仍然存在困难:在一个困难的90样本子集上,人类标注者达到84.0 F1,而最佳现有基线达到37.2。为了弥补这一差距,我们构建了一个可扩展的数据生成管道,该管道从大规模视频-3D语料库(包括RGB-D视频和SfM重建)中自动提取宽基线视图对,产生多样且可验证的监督。我们进一步提出了动态对应强化学习(DCRL),它结合了图像级视角进展和点级对应课程,通过可验证的奖励改进WBM训练,无需显式的CoT监督。大量实验表明,DCRL显著提高了ReasonMatch-Bench的性能,并迁移到相关的空间基准,同时在几个基准上保持了通用视觉理解性能并取得了适度提升。

英文摘要

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.

2606.03569 2026-06-03 cs.CV cs.AI 版本更新

When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics

当注意力崩溃时:从结构到语义的阶段性视觉令牌剪枝

Jiahui Wang, Kai Zhang, Mai Han, Huanghe Zhang

发表机构 * Shandong University(山东大学) National University of Singapore (Suzhou) Research Institute(新加坡国立大学(苏州)研究院)

AI总结 针对视觉语言模型推理中视觉令牌剪枝因依赖单一注意力分数导致特征多样性下降的问题,提出两阶段剪枝框架STS,先通过排斥采样最大化结构多样性,再通过指令感知交叉注意力过滤语义无关令牌,从而提升保留令牌的结构多样性与细粒度任务对齐。

详情
AI中文摘要

视觉语言模型(VLMs)展现了卓越的能力,但在推理过程中承受着巨大的计算开销。虽然视觉令牌剪枝提供了一种有前景的解决方案,但现有方法主要依赖于初始注意力分数。这种单一度量范式存在一个关键缺陷:高注意力分数会固有地坍缩到语义相似区域,从而严重降低特征多样性并丢弃重要的上下文细节。为解决这一问题,我们引入了结构到语义(STS),一种新颖的两阶段视觉令牌剪枝框架,明确解耦了剪枝过程。第一阶段采用基于排斥的采样机制,以最大化空间和结构多样性。第二阶段利用指令感知的交叉注意力,精确过滤掉与提示无关的令牌。这种两阶段协同构成了STS的核心,首先确保几何覆盖,然后根据语义相关性细化保留的令牌。大量评估表明,STS减轻了由基于注意力的选择引起的冗余,提高了保留视觉令牌的结构多样性和细粒度任务对齐。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference. While visual token pruning offers a promising solution, existing methods predominantly rely on initial attention scores. This single-metric paradigm presents a critical flaw: high attention scores inherently collapse onto semantically similar regions, thereby severely reducing feature diversity and discarding vital contextual details. To address this, we introduce Structure-to-Semantics (STS), a novel two-stage visual token pruning framework that explicitly decouples the pruning process. The first stage employs a repulsion-based sampling mechanism to maximize spatial and structural diversity. The second stage leverages instruction-aware cross-attention to precisely filter out prompt-irrelevant tokens. This two-stage synergy constitutes the core of STS, first ensuring geometric coverage and then refining the retained tokens according to semantic relevance. Extensive evaluations demonstrate that STS mitigates the redundancy caused by attention-based selection, improving both structural diversity and fine-grained task alignment of the preserved visual tokens.

2606.03568 2026-06-03 cs.CV cs.AI cs.LG cs.RO 版本更新

Learned Non-Maximum Suppression for 3D Object Detection

用于3D目标检测的学习型非极大值抑制

Timo Osterburg, Stefan Schütte, Torsten Bertram

发表机构 * Institute of Control Theory and Systems Engineering, TU Dortmund University(控制理论与系统工程研究所,多特蒙德技术大学)

AI总结 提出两种基于学习的过滤模块(D2D-Rescore和GossipNet3D)替代启发式NMS,通过检测间关系提升3D检测性能,尤其改善小物体和稀有类别的检测精度。

Comments 6 pages, accepted at IEEE Intelligent Vehicles Symposium (IV) 2026

详情
AI中文摘要

后处理是基于激光雷达的3D目标检测中的关键阶段,必须过滤密集且重叠的提议以实现紧凑可靠的感知。本文引入了两个学习型过滤模块,通过利用检测之间的关系来替代启发式非极大值抑制(NMS)。D2D-Rescore采用基于Transformer的检测到检测(D2D)注意力,而GossipNet3D通过鸟瞰图中的局部消息传递将2D GossipNet概念适应到3D。一种与nuScenes评估协议对齐的度量感知匹配策略确保了训练和验证行为的一致性,从而提高了整体检测性能。与CircleNMS相比,两种方法都提高了平均精度(mAP)、nuScenes检测分数(NDS)和真阳性质量,特别是对于小物体和稀有类别,同时增加了最小的计算开销。这些结果表明,学习型的检测级过滤可以在不修改基础网络的情况下增强3D检测器的可靠性,为启发式抑制提供了一种原则性的替代方案。代码可在以下网址获取:https://this URL。

英文摘要

Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection-level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms .

2606.03566 2026-06-03 cs.CV cs.AI 版本更新

Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis

基于高效Transformer的局部块采样用于多发性硬化脉络丛分割

Po-Jui Lu, Alessandro Cagol, Mario Ocampo-Pineda, Federico Spagnolo, Marina Mastantuono, Andreea-Alexandra Aldea, Jannis Müller, Özgür Yaldizli, Matthias Weigel, Lester Melie-Garcia, Roberta Magliozzi, Maria Pia Sormani, Ludwig Kappos, Jens Kuhle, Cristina Granziera

AI总结 提出一种基于SwinUNETR和局部块采样的方法,实现多发性硬化侧脑室脉络丛的自动分割,在降低99%计算量的同时取得优于现有模型的Dice系数。

详情
AI中文摘要

背景:侧脑室脉络丛(LVCP)正逐渐被认为是与多发性硬化(MS)身体残疾和神经炎症相关的关键影像生物标志物。然而,LVCP的手动分割非常繁琐,限制了其在广泛临床试验和纵向评估中的应用。本研究旨在开发一种基于SwinUNETR的流程,利用靶向的脑室内和脑室周围小块采样,从独立和多模态MRI输入中自动分割MS中的LVCP。方法:我们回顾性评估了来自两个独立MS主导队列的三组数据的3T MRI扫描(数据集1:n=177;数据集2:n=177;扩展测试集:n=388)。我们的方法采用在32x32x32体素块上训练的SwinUNETR架构,并与3D UXNET模型进行基准比较。主要评估指标是Dice相似系数(DSC),辅以计算需求(GFLOPs)和95百分位豪斯多夫距离(HD95)。结果:在扩展测试集上,SwinUNETR模型在结合MPRAGE和FLAIR时获得了平均DSC为0.868(95% CI: 0.863-0.872),显著优于UXNET(DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001)。当仅限于独立FLAIR输入时,基于Transformer的方法保持了0.863的高DSC,而UXNET的空间定位显著恶化(HD95: 1.86 vs. 3.00 mm)。重要的是,所提出的框架将计算负载降低了99%(91.8 vs. 22,080 GFLOPs)。通过将局部块采样与SwinUNETR架构相结合,该方法为LVCP分割提供了一种准确、稳健且统计上优于当前领先模型的替代方案。其巨大的计算成本降低使其非常适合在临床和研究环境中广泛实施。

英文摘要

Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) related to physical disability and neuroinflammation. Yet, manual segmentation of the LVCP is highly tedious, restricting its use in broad clinical trials and longitudinal assessments. This research aims to develop a SwinUNETR-driven pipeline that leverages targeted intra- and peri-ventricular small patch sampling to automatically segment the LVCP in MS from both standalone and multi-modal MRI inputs. Methods: We retrospectively assessed 3T MRI scans across three sets of data stemming from two separate MS-dominant cohorts (Dataset 1: n=177; Dataset 2: n=177; expanded test set: n=388). Our method employed a SwinUNETR architecture trained on 32x32x32 voxel patches, benchmarking it against the 3D UXNET model. The primary metric for evaluation was the Dice Similarity Coefficient (DSC), supplemented by computational demand (GFLOPs) and the 95th percentile Hausdorff Distance (HD95). Results: On the extended test set, the SwinUNETR model secured a mean DSC of 0.868 (95% CI: 0.863-0.872) with MPRAGE and FLAIR combined, showing a statistically significant gain over UXNET (DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001). When restricted to standalone FLAIR inputs, the transformer-based approach sustained a high DSC of 0.863, while the spatial localization of UXNET worsened considerably (HD95: 1.86 vs. 3.00 mm). Importantly, the proposed framework lowered computational load by 99% (91.8 vs. 22,080 GFLOPs). By integrating localized patch sampling with a SwinUNETR architecture, this methodology offers an accurate, robust, and statistically superior alternative to current leading models for LVCP segmentation. Its vast reduction in computational cost makes it ideal for widespread implementation in clinical and research environments.

2606.03540 2026-06-03 cs.CV 版本更新

Attend to Anything: Foundation Model for Unified Human Attention Modeling

关注一切:统一人类注意力建模的基础模型

Wenzhuo Zhao, Ronghao Xian, Keren Fu, Qijun Zhao

AI总结 提出 Attend to Anything Model (AAM),一种多模态基础模型,通过层次化语言提示和双曲空间嵌入统一图像、视频和视听任务中的注意力建模,并在16个基准上平均提升6%,视频推理加速约4倍。

Comments Accepted to ICML 2026

详情
AI中文摘要

现有人类注意力(显著性)建模方法在模态、场景和任务公式上高度碎片化。因此,即使模型容量和数据规模增加,当前模型仍主要依赖于场景且针对特定任务,无法在实际应用中泛化。为解决这些根本限制,我们提出了关注一切模型(AAM),一种多模态基础模型,统一了各种图像、视频和视听任务及场景中的注意力建模。AAM将注意力重新表述为一种认知蕴含关系,按通用到特定的层次组织,通过双曲空间中的层次嵌入语言提示实现。此外,为统一静态图像和动态视频注意力,我们采用流体动力学视角,将视频帧注意力建模为由Fokker-Planck方程控制的扩散时间演化。在16个基准上的大量实验表明,AAM在各种场景下平均比最先进方法高出6%,同时视频推理速度提升约4倍。总体而言,这些结果表明AAM为未来注意力和显著性相关任务的研究提供了原则性基础。数据集和代码将在此https URL提供。

英文摘要

Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\% across various scenarios, while achieving approximately a 4$\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at https://github.com/wz-zhao/Attend-to-Anything.

2606.03539 2026-06-03 cs.CV 版本更新

Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding

零空间中知识保留的模型调优用于鲁棒的时空视频定位

Haoxuan Chen, Xianqin Liu, Jian-Fang Hu

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, China(中山大学计算机科学与工程学院) National Information Center of GACC (Guangdong), GuangZhou, China(广东省GACC国家信息中心) Guangdong Province Key Laboratory of Information Security Technology, China(广东省信息安全技术重点实验室) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, China(教育部机器智能与高级计算重点实验室)

AI总结 针对低质量视频导致预训练知识被破坏的问题,提出零空间调优(NST)框架,通过将可学习残差限制在冻结权重的零空间内来保留预训练知识,同时利用质量自适应单元和双空间重参数化合成残差,在混合质量基准上达到最优性能。

Comments Accepted by ICME 2026

详情
AI中文摘要

时空视频定位旨在基于文本查询定位目标管。尽管近期方法取得了显著成功,但它们主要关注高质量输入,忽略了现实场景中广泛存在的低质量视频。虽然像LoRA这样的调优方法可以适应降质输入,但它们不可避免地破坏了预训练知识。为解决这一问题,我们提出了零空间调优(NST)。该框架利用了将冻结权重的零空间内的向量添加到层输入不会影响输出的几何性质。利用这一点,NST将可学习残差注入输入特征,这些残差可以选择性地对预训练骨干网络不可见。具体地,NST结合了质量自适应单元和双空间重参数化来合成这些残差,通过将高质量输入的组件限制在零空间内,同时将低质量输入的恢复组件引导至非零空间。由于冻结权重消除了零空间组件,我们有效地纠正了降质输入,同时保留了高质量输入的预训练知识。大量实验表明,NST在我们的混合质量基准上优于最先进的方法。

英文摘要

Spatio-Temporal Video Grounding aims to localize object tubes based on textual queries. While recent methods have achieved remarkable success, they mainly focus on high-quality(HQ) inputs, neglecting the widespread presence of low-quality(LQ) videos in real-world scenarios. Although tuning methods like LoRA can adapt to degraded inputs, they inevitably disrupt pre-trained knowledge. To address this, we propose Null-Space Tuning (NST). This framework exploits the geometric property that adding vectors within the null-space of frozen weights to the layer input does not affect the output. Leveraging this, NST injects learnable residuals into input features that can be selectively invisible to the pre-trained backbone. Specifically, NST combines the Quality-Adaptive Unit and Dual-Space Reparameterization to synthesize these residuals by confining components for HQ inputs to the null-space, while directing restoration components for LQ inputs to the non-null space. As the frozen weights eliminate null-space components, we effectively rectify degraded inputs while preserving pre-trained knowledge for HQ inputs. Extensive experiments show that NST outperforms state-of-the-art methods on our Mixed-Quality benchmark.

2606.03509 2026-06-03 cs.CV 版本更新

EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

EvoMemNav: 用于零样本具身导航的高效自进化细粒度记忆

Zuhao Ge, Xiaosong Jia, Chao Wu, Yuchen Zhou, Zuxuan Wu, Yu-Gang Jiang

AI总结 提出EvoMemNav框架,通过构建视觉-语义记忆图并采用预算驱动的粗到细策略,结合反射驱动写回机制,实现零样本具身导航中高效、自进化的细粒度记忆,提升多实例区分和停止验证性能。

Comments Preprint

详情
AI中文摘要

构建记忆对于零样本具身导航中的长时程规划至关重要。以检测器为中心的场景图通常将观测压缩为稀疏节点,丢弃细粒度视觉证据并积累噪声,而基于3D重建的方法计算成本高昂。我们提出EvoMemNav,一种用于零样本具身导航的高效、自进化、细粒度记忆框架。EvoMemNav构建视觉-语义记忆图(VSMGraph),将原始视图作为一等记忆,并通过轻量级语义线索和拓扑关系将其组织成房间-视图-对象层次结构,保留用于消歧和停止验证的细粒度细节。为了扩展到不断增长的记忆,我们引入预算驱动的粗到细策略:粗阶段将搜索空间压缩到有希望的区域,细阶段仅调用VLM进行目标验证和决策。除了静态记忆,EvoMemNav在每个子任务后执行反射驱动的写回,更新附加到图上的先验知识,编码累积的环境知识以优化未来决策而无需重新训练。在GOAT-Bench和HM3D上,针对物体、文本描述和图像目标模态的实验显示,SR/SPL持续提升,具有更好的多实例区分能力、更少的过早停止和更强的零样本泛化能力。

英文摘要

Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation. EvoMemNav constructs a Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, preserving fine-grained details for disambiguation and Stop verification. To scale to growing memory, we introduce a budgeted coarse-to-fine policy: a coarse stage compresses the search space into promising regions, and a fine stage invokes a VLM only for targeted verification and decision. Beyond static memories, EvoMemNav performs reflection-driven write-back after each subtask, updating graph-attached priors that encode accumulated environmental knowledge to refine future decisions without retraining. Experiments on GOAT-Bench and HM3D across object, text-description, and image-goal modalities show consistent gains in SR/SPL, with better multi-instance disambiguation, fewer premature stops, and stronger zero-shot generalization.

2606.03508 2026-06-03 cs.CV 版本更新

Structure-Guided Mixed Masked Pretraining and Spatial Continuity Regularization for Printed Circuit Board Defect Detection

结构引导混合掩码预训练与空间连续性正则化用于印刷电路板缺陷检测

Peitong Wang, Nuo Wang, Enxin Qin, Chengjin Yu, Hanyu Xuan, Yuanting Yan

发表机构 * Ahu.edu.cn(安徽大学)

AI总结 提出两阶段PCB缺陷检测框架,通过结构引导混合掩码预训练学习PCB结构先验,并在微调阶段引入空间连续性正则化提升细长缺陷定位紧凑性,在DsPCBSD+数据集上达到85.5% mAP0.5。

Comments Preprint. 38 pages, 12 figures, 6 tables

详情
AI中文摘要

印刷电路板(PCB)缺陷检测是自动光学检测(AOI)的关键环节,但在实际应用中仍具挑战性,因为许多缺陷微小、低对比度且嵌入密集电路背景中。为解决这些问题,本文提出一种两阶段PCB缺陷检测框架,结合结构引导混合掩码预训练与空间连续性正则化。在预训练阶段,我们设计了一种稀疏卷积掩码预训练方案,利用无标签PCB图像,其中结构引导混合掩码用于构建信息丰富的掩码输入。稀疏卷积重建管道抑制掩码区域的无效响应,使检测器主干能够从可见导电模式推断缺失的PCB结构,从而学习PCB结构先验。在微调阶段,预训练主干被迁移到下游缺陷检测任务。针对该任务,在微调过程中引入空间连续性正则化项,该项约束分配给同一缺陷实例的分散正预测,并促进细长缺陷区域上更紧凑的定位。在DsPCBSD+数据集上的实验表明,所提方法达到85.5% mAP0.5和52.3% mAP0.5:0.95,优于多个强基线检测器。消融研究和定性结果进一步证实了所提框架在工业AOI场景中稳健PCB缺陷检测的有效性。

英文摘要

Printed circuit board (PCB) defect detection is an essential part of automated optical inspection (AOI); yet it remains challenging in practice because many defects are tiny, low-contrast, and embedded in dense circuit backgrounds. To address these issues, this paper presents a two-phase PCB defect detection framework that combines structure-guided mixed masked pretraining with spatial continuity regularization. In the pretraining stage, we design a sparse convolutional masked pretraining scheme to exploit unlabeled PCB images, where structure-guided mixed masking is used to construct informative masked inputs. The sparse convolutional reconstruction pipeline suppresses invalid responses from masked regions and enables the detector backbone to infer missing PCB structures from visible conductive patterns, thereby learning PCB structural priors. In the fine-tuning stage, the pretrained backbone is transferred to the downstream defect detection task. For the task, a spatial continuity regularization term is introduced during fine-tuning. This term constrains dispersed positive predictions assigned to the same defect instance and promotes more compact localization on elongated defect regions. Experiments on the DsPCBSD+ dataset show that the proposed method achieves 85.5% mAP0.5 and 52.3% mAP0.5:0.95, outperforming several strong baseline detectors. Ablation studies and qualitative results further confirm the effectiveness of the proposed framework for robust PCB defect detection in industrial AOI scenarios.

2606.03506 2026-06-03 cs.CV cs.GR 版本更新

AvatarMix: Identity-Preserving Cross-Avatar Composition for Outfit Personalization

AvatarMix: 保持身份特征的跨化身组合用于服装个性化

Zhaorong Wang, Yoshihiro Kanamori, Yuki Endo

发表机构 * University of Tsukuba(茨口大学)

AI总结 提出AvatarMix方法,通过直接组合两个高保真高斯化身实现服装迁移,并采用SeamFix和FullbodyFix两级细化策略解决接缝伪影和身体重塑后的外观保真问题。

Comments CVPR 2026 Findings. 16 pages, including supplementary material

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 425-435
AI中文摘要

现有的3D化身服装迁移方法面临不同挑战:将2D编辑提升到3D的方法通常会导致服装或身份质量下降,而分别建模身体和服装层的方法则容易出现交叉伪影。我们提出AvatarMix,一种组合范式,通过直接组合两个高保真高斯化身的头部和身体来绕过这些问题。虽然这种范式固有地保留了服装质量并避免了交叉,但在创建无缝连接和保持身体重塑后的外观保真度方面带来了挑战。为此,我们提出两级细化策略:SeamFix,一个局部扩散模块,用于细化头发和颈部以确保无伪影连接;以及一个可选的全身细化模块FullbodyFix,当重定向导致穿衣身体退化时恢复服装外观。两者都在已经3D一致的高斯化身渲染上操作,与2D到3D提升相比,这限制了多视图伪影。为了保留用户的身体身份,我们的基于网格的高斯表示能够适应鲁棒的网格重定向技术,精确地将穿衣身体重塑为用户体型,并鲁棒地处理多样化的身体形状。大量实验表明,我们的方法在服装保真度和身份保持方面达到了最先进的结果,为逼真的3D服装个性化提供了新视角。项目页面:此https URL

英文摘要

Existing 3D avatar outfit transfer methods face distinct challenges: approaches that lift 2D edits to 3D often suffer from outfit or identity quality degradation, while those that separately model body and clothing layers are prone to intersection artifacts. We introduce AvatarMix, a compositional paradigm that bypasses these issues by directly composing the head and body from two high-fidelity Gaussian avatars. While this paradigm inherently preserves outfit quality and avoids intersections, it introduces challenges in creating a seamless join and maintaining appearance fidelity after body reshaping. To this end, we propose a two-tier refinement strategy: SeamFix, a localized diffusion module that refines hair and neck to ensure an artifact-free join, and an optional full-body refinement, FullbodyFix, that restores garment appearance when retargeting degrades the clothed body. Both operate on renders from an already 3D-consistent Gaussian avatar, which limits multi-view artifacts compared to 2D-to-3D lifting. To preserve the user's body identity, our mesh-based Gaussian representation enables the adaptation of a robust mesh retargeting technique, precisely reshaping the clothed body to the user's physique and robustly handling diverse body shapes. Extensive experiments demonstrate that our method achieves state-of-the-art results in outfit fidelity and identity preservation, providing a new perspective for realistic 3D outfit personalization. Project page: https://larsph.github.io/avatarmix/

2606.03499 2026-06-03 cs.CV 版本更新

Characterizing Detectability in 3DGS Poisoning: A Stage-wise Benchmark

表征3DGS投毒中的可检测性:分阶段基准测试

Quoc-Anh Bui-Huynh, Thanh Duc Ngo, Xue Geng, Kaixin Xu, Wang Zhe, Xulei Yang, Ngai-Man Cheung

发表机构 * Temasek Laboratories, Singapore University of Technology and Design(新加坡科技与设计大学Temasek实验室) Vietnam National University, Ho Chi Minh City(越南国家大学胡志明市分校) University of Information Technology, VNU-HCM(越南国家大学胡志明市信息技术大学) Agency for Science, Technology, and Research (A*STAR)(科技研究局(A*STAR))

AI总结 针对3DGS易受多种投毒攻击的问题,提出分阶段基准Poison-3DGS,系统研究各阶段可检测性差异,发现不同攻击在不同阶段产生独特取证信号,后期阶段(如训练动态和高斯参数统计)提供早期不可观测的强线索。

详情
AI中文摘要

3D高斯泼溅(3DGS)已迅速成为实时新视角合成的主要表示方法,但近期研究表明它易受多种投毒攻击,包括虚幻物体注入、计算成本放大和事后模型水印。尽管威胁面不断扩大,现有研究主要关注攻击成功,而防御和检测仍探索不足。从检测角度看,3DGS重建流程的多阶段特性产生了异构的中间表示,这既是关键挑战也是机遇。检测投毒的取证信号本质上是阶段依赖的:在一个阶段引入的攻击可能仅在后续阶段产生信号。这促使我们采用超越单阶段评估的分阶段可检测性视角。我们引入Poison-3DGS,一个用于分阶段表征3DGS投毒检测的基准。它暴露了跨多种场景和攻击的阶段特定伪影,包括多视图图像、几何、训练动态和高斯参数。利用该基准,我们对流水线各阶段的可检测性进行了系统研究。分析揭示了若干见解。首先,可检测性在不同阶段间差异显著,且没有任何单一阶段在所有攻击类型中持续占优。其次,不同攻击表现出不同的阶段特定取证信号,因此检测有效性关键取决于信号在何处被观测到。第三,后期阶段的信号(如训练动态和高斯参数统计)提供了早期阶段不可观测的强线索。总体而言,我们的工作提供了一个原则性基准,并首次系统表征了3DGS中阶段依赖的可检测性,为未来研究鲁棒可靠的3DGS系统奠定了基础。

英文摘要

3D Gaussian Splatting (3DGS) has rapidly emerged as a leading representation for real-time novel view synthesis, but recent work shows it is vulnerable to diverse poisoning attacks, including illusory object injection, computation cost amplification, and post hoc model watermarking. Despite this expanding threat surface, existing studies focus mainly on attack success, while defense and detection remain underexplored. From a detection perspective, a key challenge and opportunity arise from the multi-stage nature of the 3DGS reconstruction pipeline, which produces heterogeneous intermediate representations. Forensic signals for detecting poisoning are inherently stage dependent: an attack introduced at one stage may produce signals that emerge only at later stages. This motivates a stage-wise view of detectability that goes beyond single-stage evaluation. We introduce Poison-3DGS, a benchmark for stage-wise characterization of poisoning detection in 3DGS. It exposes stage-specific artifacts, including multi-view images, geometry, training dynamics, and Gaussian parameters, across a diverse set of scenes and attacks. Using it, we conduct a systematic study of detectability across pipeline stages. Our analysis reveals several insights. First, detectability varies significantly across stages, and no single stage consistently dominates across attack types. Second, different attacks exhibit distinct stage-specific forensic signals, so detection effectiveness depends critically on where signals are observed. Third, later-stage signals such as training dynamics and Gaussian parameter statistics provide strong cues not observable at earlier stages. Overall, our work provides a principled benchmark and the first systematic characterization of stage-dependent detectability in 3DGS, offering a foundation for future research on robust and reliable 3DGS systems.

2606.03493 2026-06-03 cs.CV cs.LG 版本更新

Low-Frequency Shortcuts in Texture-Driven Visual Learning

纹理驱动视觉学习中的低频捷径

Utku Şirin, Cathy Hou, David Alvarez-Melis, Stratos Idreos

发表机构 * Harvard University(哈佛大学) Kempner Institute(凯姆纳研究所)

AI总结 本文分析了纹理驱动领域中神经网络依赖低频成分作为捷径的现象,提出通过裁剪低频成分来消除捷径,从而提升分布内准确率和鲁棒性。

详情
AI中文摘要

神经网络存在捷径学习问题,即学习到的特征在训练集上泛化良好,但在分布内(ID)或分布外(OOD)测试集上表现不佳。现有研究均基于少数几个标准基准,这些基准是形状驱动的。然而,许多应用领域是纹理驱动的。在这项工作中,我们针对纹理驱动领域进行了捷径学习分析,并将其与标准基准进行了比较。我们表明,纹理驱动领域存在低频捷径。它们主要基于少数具有偏斜频谱行为的低频成分(LFC)做出决策,尽管其分类信息存在于更高频率的细粒度细节中。从训练集和测试集中裁剪LFC可以消除捷径,并提供更平衡的频谱行为,将ID准确率提升高达8%。我们表明,低频捷径使模型极易受到OOD干扰的影响,导致与ID准确率相比下降高达70%。裁剪LFC显著提高了对低频干扰的鲁棒性,提升高达40%,并引入了对高频干扰的权衡;平衡的频谱行为提供了更好的泛化性能,而对高频特征的依赖增加则降低了泛化性能。OOD准确率取决于这两个因素之间的相互作用。

英文摘要

Neural networks suffer from shortcut learning, where learned features generalize well to the training set but not to in-distribution (ID) or out-of-distribution (OOD) test sets. Existing studies are all based on a few standard benchmarks, which are shape-driven. Numerous application domains, however, are texture-driven. In this work, we present shortcut learning analysis for texture-driven domains, and compare it with that of a standard benchmark. We show that texture-driven domains suffer from low-frequency shortcuts. They make the majority of their decisions based on a few low-frequency components (LFCs) with a skewed spectral behavior, despite that their classification information is in higher-frequency, fine-grained details. Pruning LFCs from training and test sets eliminates the shortcut and provides a more balanced spectral behavior, improving the ID accuracy by up to 8%. We show that low-frequency shortcuts make the models highly vulnerable to OOD corruptions, leading up to 70% accuracy drop compared to the ID accuracy. Pruning LFCs significantly improves robustness to low-frequency corruptions, by up to 40%, and introduces a trade-off for high-frequency corruptions; the balanced spectral behavior provides a better generalization performance, whereas the increased dependence on high-frequency features reduces it. OOD accuracy depends on the interaction between these two factors.

2606.03490 2026-06-03 cs.CV 版本更新

TrAction: Action Recognition with Sparse Trajectories

TrAction: 基于稀疏轨迹的动作识别

Jan F. Meier, Felix B. Mueller, Alexander Ecker, Timo Lüddecke

发表机构 * Institute of Computer Science and Campus Institute Data Science, University Göttingen(计算机科学研究所和校园数据科学学院,哥廷根大学) Max Planck Institute for Dynamics and Self-Organization(动态与自组织Max Planck研究所)

AI总结 提出使用稀疏点轨迹作为输入模态,结合掩码轨迹预训练的Transformer架构,在降低计算成本的同时实现高效动作识别,并证明轨迹特征与外观特征互补。

详情
AI中文摘要

现代动作识别模型运行在内存和计算密集的密集RGB视频体积上,并且经常利用外观和背景捷径,例如从物体或场景而不是特征运动来预测动作。我们研究了一种高效的替代输入模态,它通过构造在很大程度上避免了这种偏差:稀疏点轨迹。为此,我们开发了一个简单的Transformer架构用于基于2.5D轨迹的识别,并配合掩码轨迹预训练,我们证明这能显著提高下游动作识别准确率。尽管仅使用密集RGB输入的一小部分,我们的方法在Something-Something V2上达到45%的top-1准确率,在EPIC-Kitchens-100上达到54%,并在时间反转敏感性上超过了V-JEPA。更重要的是,我们发现轨迹特征与最先进的基于外观的特征互补。将我们的预训练模型与DINOv2和V-JEPA 2融合,在Something-Something V2上top-1准确率分别提高了8.7和1.6个百分点。代码:此 https URL

英文摘要

Modern action recognition models operate on memory- and compute-intensive dense RGB video volumes and frequently exploit appearance and background shortcuts, for example, predicting actions from objects or scenes instead of characteristic motion. We investigate an efficient alternative input modality that is largely free of such biases by construction: sparse point trajectories. To this end, we develop a simple transformer architecture for 2.5D trajectory-based recognition together with a masked-trajectory pretraining, which we show to substantially improve downstream action recognition accuracy. Despite using only a fraction of the dense RGB input, our method reaches 45% top-1 on Something-Something V2 and 54% on EPIC-Kitchens-100, and surpasses V-JEPA on time-reversal sensitivity. More importantly, we find trajectory features to be complementary to state-of-the-art appearance-based features. Fusing our pretrained model with DINOv2 and V-JEPA 2 improves top-1 accuracy on Something-Something V2 by 8.7 and 1.6 points, respectively. Code: https://github.com/ecker-lab/TrAction

2606.03479 2026-06-03 cs.CV cs.GR 版本更新

PersistGS: Differentiable Physics for Object Permanence in 4D Gaussian Splatting

PersistGS: 4D高斯溅射中物体持久性的可微物理

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出PersistGS方法,通过将可微刚体模拟与3D高斯溅射耦合,在物体被遮挡期间利用物理规律预测其SE(3)轨迹,从而恢复物体持久性,并引入质心轮廓损失降低轨迹误差。

Comments Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Workshop on Generative 3D Reconstruction

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 4687-4696
AI中文摘要

动态3D高斯溅射(3DGS)方法通过光度监督从同步多相机视频重建时变场景。当一个运动物体被所有训练相机完全遮挡时,光度监督消失:表示该物体的高斯体无法接收梯度信号而退化。现有处理神经重建中不完整观测的方法依赖于学习到的生成先验,这些先验优先考虑视觉合理性而非物理正确性。我们提出$ extbf{PersistGS}$,一种通过将可微刚体模拟与3D高斯溅射耦合来在遮挡期间恢复物体持久性的方法。我们的方法将场景分解为每个物体的高斯体和碰撞网格,通过可微模拟从观测到的遮挡前轨迹估计摩擦和速度,并利用得到的SE(3)轨迹在整个遮挡期间定位物体高斯体。由于预测轨迹满足刚体动力学的控制方程,它能够忠实捕捉接触事件(弹跳、基于摩擦的减速、方向变化),而运动学外推无法建模这些事件。我们引入质心轮廓损失,将位置梯度与外观噪声分离,使轨迹误差比光度监督降低40%。我们使用在训练中保留的相机进行评估,这些相机在遮挡期间观察物体。在合成场景上的实验表明,PersistGS在PSNR上比恒定速度外推高出2.46dB,并且与真实轨迹上限仅差0.19dB。

英文摘要

Dynamic 3D Gaussian Splatting (3DGS) methods reconstruct time-varying scenes from synchronized multi-camera video using photometric supervision. When a moving object becomes fully occluded from all training cameras, this supervision vanishes: the Gaussians representing it receive no gradient signal and degrade. Existing approaches to incomplete observations in neural reconstruction rely on learned generative priors that prioritize visual plausibility over physical correctness. We propose $\textbf{PersistGS}$, a method that restores object permanence during occlusion by coupling differentiable rigid body simulation with 3D Gaussian Splatting. Our approach decomposes the scene into per-object Gaussians and collision meshes, estimates friction and velocity from the observed pre-occlusion trajectory via differentiable simulation, and uses the resulting SE(3) trajectory to position object Gaussians throughout the occlusion period. Because the predicted trajectory satisfies the governing equations of rigid body dynamics, it faithfully captures contact events (bounces, friction-based deceleration, direction changes) that kinematic extrapolation cannot model. We introduce a centroid silhouette loss that isolates positional gradients from appearance noise, yielding 40% lower trajectory error than photometric supervision. We evaluate using cameras withheld from training that observe the object during its occlusion. Experiments on synthetic scenes show that PersistGS outperforms constant velocity extrapolation by +2.46dB PSNR and comes within 0.19dB of a ground-truth trajectory upper bound.

2606.03470 2026-06-03 cs.CV 版本更新

Mixed-Modality Dual Face-Hair Retrieval

混合模态双人脸-发型检索

Quoc-Anh Bui-Huynh, Mai-Tuyen Lam, Dai-Anh-Tuan Nguyen, Thanh Duc Ngo

发表机构 * Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学,胡志明市,越南) University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam(信息技术大学,VNU-HCM,胡志明市,越南)

AI总结 提出混合模态双参考检索任务DFHR,通过解耦身份与发型特征并融合多模态嵌入,实现跨模态的身份感知与属性可控检索。

详情
AI中文摘要

我们提出了双人脸-发型检索(DFHR),这是一种图像检索中新的混合模态双参考任务,其中查询由指定身份的人脸图像和以图像或文本形式表达的发型参考组成。与先前的检索设置不同,DFHR需要对来自异质模态的两个语义独立属性——身份和发型——进行跨组件推理。这种表述要求在统一的嵌入空间内实现局部特征解耦、跨模态语义对齐和混合模态组合。我们构建了DFHR-Bench,这是首个用于混合模态人脸-发型检索的基准,包含超过18万个标注三元组,涵盖双图像和图像-文本设置,通过多阶段标注协议构建,确保语义和身份完整性。我们进一步提出了MFHC(多模态人脸-发型组合器),一个统一的框架,通过令牌注入和多视角监督融合解耦的身份和发型嵌入。DFHR和DFHR-Bench共同为跨模态的身份感知、属性可控视觉检索建立了新的范式。

英文摘要

We introduce Dual Face-Hair Retrieval (DFHR), a new mixed-modality dual-reference task in image retrieval where a query consists of a face image specifying identity and a hairstyle reference expressed as either an image or text. Unlike prior retrieval settings, DFHR requires cross-component reasoning between two semantically independent attributes -- identity and hairstyle -- originating from heterogeneous modalities. This formulation demands localized feature disentanglement, cross-modal semantic alignment, and mixed-modality composition within a unified embedding space. We construct DFHR-Bench, the first benchmark for mixed-modality face-hair retrieval, comprising over 180K annotated triplets across dual-image and image-text settings, built via a multi-stage annotation protocol ensuring semantic and identity integrity. We further propose MFHC (Multimodal Face-Hair Combiner), a unified framework that fuses disentangled identity and hairstyle embeddings through token injection and multi-view supervision. DFHR and DFHR-Bench together establish a new paradigm for identity-aware, attribute-controllable visual retrieval across modalities.

2606.03460 2026-06-03 cs.CV 版本更新

From 3D Perception to Safety Reasoning: A Graph-Based Framework for Real-Time Underground Mine Monitoring

从3D感知到安全推理:基于图的实时地下矿井监控框架

Pasindu Ranasinghe, Simit Raval, Dibyayan Patra, Bikram Banerjee, Ismet Canbulat

AI总结 提出一个结合3D语义感知、不确定性异常检测、规则检查、设备端LLM推理和GraphRAG记忆分析的连续监控框架,通过场景图和时序图实现结构化安全推理,在115个危险场景中达到93%的覆盖率和92.7%的感知精度。

详情
AI中文摘要

地下煤矿开采要求人员和重型设备在共享、受限且照明不良的空间中作业,其中设备接近违规、结构不稳定和遮挡盲区等危险难以预测。传统监控系统(包括固定摄像头和基于规则的接近警报)可以检测预定义事件,但缺乏识别复杂或演变危险所需的3D场景理解和上下文记忆。本文提出一个连续监控框架,将彩色3D点云转换为结构化和可追溯的安全推理输出。该框架结合了3D语义感知、基于不确定性的异常检测、基于规则的危险检查、设备端LLM推理和基于GraphRAG的记忆分析,以识别即时危险并解释长期安全模式。场景图和时序图作为显式知识结构,连接推理阶段的感知输出。为克服标记地下数据的稀缺性,结合真实巷道扫描、受控物体放置和高保真长壁模拟生成多样化的危险场景,同时自监督预训练从有限标注中改进分割。感知模型在30 FPS下达到92.7%的准确率,内存使用低。在115个危险场景中,基于规则的检查覆盖率为57%,结合上下文LLM推理提高到76%,使用基于历史记录的记忆推理达到93%。定性结果表明,不确定性衍生的异常信号支持对超出预定义类别的分布外危险进行解释。总体而言,基于图的知识表示结合3D感知和分层安全推理,为地下矿井监控中的智能决策支持提供了实用基础。

英文摘要

Underground coal mining requires personnel and heavy equipment to operate within shared, confined, and poorly illuminated spaces where hazards such as equipment proximity violations, structural instabilities, and occluded blind spots are difficult to anticipate. Conventional monitoring systems, including fixed cameras and rule-based proximity alerts, can detect predefined events but lack the 3D scene understanding and contextual memory needed to identify complex or evolving hazards. This paper presents a continuous monitoring framework that converts colourised 3D point clouds into structured and traceable safety reasoning outputs. The framework combines 3D semantic perception, uncertainty-based anomaly detection, rule-based hazard checks, on-device LLM reasoning, and GraphRAG -based memory analysis to identify immediate hazards and interpret longer-term safety patterns. Scene and temporal graphs serve as the explicit knowledge structure, linking perception outputs across reasoning stages. To overcome the scarcity of labeled underground data, real roadway scans, controlled object placement, and high-fidelity longwall simulation were combined to generate diverse hazard scenarios, while self-supervised pretraining improved segmentation from limited annotations. The perception model achieved 92.7% accuracy at 30 FPS with low memory usage. Across 115 hazard scenarios, rule-based checks achieved 57% coverage, increasing to 76% with contextual LLM reasoning and 93% with memory-based reasoning using historical records. Qualitative results show uncertainty-derived anomaly signals support the interpretation of out-of-distribution hazards beyond predefined classes. Overall, graph-based knowledge representation combined with 3D perception and layered safety reasoning provides a practical foundation for intelligent decision support in underground mine monitoring.

2606.03444 2026-06-03 cs.CV cs.AI 版本更新

PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

PRISM: 通过自组织专家专业化协同视觉基础模型

Ying Tang, Dong Li, Youjia Zhang, Zikai Song, Junqing Yu, Wei Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PRISM框架,采用双流混合专家(MoE)架构,通过两阶段范式(先解构专家知识使其专业化,再动态重组为任务特定路径)解决视觉基础模型集成中的负迁移问题,在PASCAL-Context和NYUD-v2上达到新最优。

Comments Accepted to ICML 2026

详情
AI中文摘要

将多种视觉基础模型(VFM)的互补优势统一到单个高效模型中是非常理想的,但受到整体蒸馏中固有的负迁移的挑战。为了解决这些特征冲突,我们引入了 extbf{PRISM},一种新颖的双流混合专家(MoE)框架,通过模块化专业化协同VFM。我们提出了一个两阶段范式:(1)专业知识解构,其中教师条件路由器引导专家在不同的表示子空间中专业化以减轻干扰,然后(2)动态重组,其中路由器学习将这些专家组装成针对下游任务定制的计算路径。在PASCAL-Context和NYUD-v2上的实验表明, extbf{PRISM}建立了新的最先进水平,验证了稀疏、涌现的专业化是集成多样化视觉知识的可扩展方法。

英文摘要

Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these feature conflicts, we introduce \textbf{PRISM}, a novel dual-stream Mixture-of-Experts (MoE) framework that synergizes VFMs via modular specialization. We propose a two-stage paradigm: (1) expertise deconstruction, where a teacher-conditional router guides experts to specialize in distinct representational subspaces to mitigate interference, followed by (2) dynamic recomposition, where the router learns to assemble these experts into tailored computational pathways for downstream tasks. Experiments on PASCAL-Context and NYUD-v2 show that \textbf{PRISM} establishes a new state of the art, validating that sparse, emergent specialization is a scalable approach for integrating diverse visual knowledge.

2606.03420 2026-06-03 cs.CV 版本更新

PHAF-Personalized Hand Avatars in a Flash

PHAF-瞬间个性化手部化身

Meghana Shankar, Akanxit Upadhyay, Anmol Namdev, Green Rosh KS, Pawan Prasad BH

发表机构 * Samsung R&D Institue(三星研发机构)

AI总结 提出PHAF方法,从两张图像(手背和手掌)快速生成个性化逼真手部化身,通过语义引导网格对齐和密集纹理提取,结合视图修复网络,实现高质量多视角渲染,纹理生成速度比现有方法快30倍。

详情
AI中文摘要

我们提出PHAF-瞬间个性化手部化身,一种个性化的逼真手部化身,仅需两张图像(手背和手掌视图)即可提供高质量的多视角渲染。与基于优化的慢速技术不同,PHAF快速生成个性化纹理,适用于边缘设备上的实时部署。我们的方法结合语义引导的网格对齐和密集纹理提取,高效传递高频细节。基于视图的修复网络细化纹理,确保平滑连续的外观。PHAF可泛化到新视角,并利用参数化手部模型实现精确关节运动,与标准图形引擎兼容。实验表明,其在视觉保真度上与现有方法相当,同时将纹理生成时间大幅减少30倍,支持实用的AR/VR应用。

英文摘要

We present PHAF-Personalized Hand Avatars in a Flash, a personalized photo-realistic hand avatar which provides high quality multi-view renders from just two images (dorsal and palmar views).Unlike slow optimization-based techniques, PHAF generates fast personalized textures for real-time deployment on edge devices. Our approach combines semantic guided mesh alignment and densified texture extraction to transfer high-frequency details efficiently. A view-based inpainting network refines textures ensuring smooth, continuous appearance. PHAF generalizes to novel viewpoints and leverages a parametric hand model for accurate articulations, making it compatible with standard graphics engines. Experiments show it is comparable to existing methods in visual fidelity while drastically reducing texture generation time by 30 times, enabling practical AR/VR applications.

2606.03418 2026-06-03 cs.CV 版本更新

IDO: Incongruity-aware Distribution Optimization for Multimodal Fake News Detection

IDO: 面向多模态假新闻检测的不一致性感知分布优化

Hengyang Zhou, Rongman Hong, Yuxuan Zhou, Jing Wang, Zhaoyan Pan

AI总结 提出不一致性感知分布优化(IDO)方法,通过事实不一致性和模态不一致性建模,提升多模态假新闻检测性能。

Comments Accept by GlobalSouthML@ICML 2026

详情
AI中文摘要

多模态假新闻检测旨在识别新闻的真实性。现有的多模态假新闻检测方法主要关注跨模态一致性,但往往未能明确建模欺骗性多模态内容中存在的语义不一致性。然而,虚假信息通常包含与事实不符的语义信息。为了解决这些挑战,我们提出了不一致性感知分布优化(IDO),从事实不一致性和模态不一致性的角度提高假新闻检测的性能。对于事实不一致性,我们引入通道级重加权策略以获得语义判别性嵌入,并利用高斯分布建模由事实不一致性引起的不确定性相关性。对于模态不一致性,我们利用不一致性对比学习来学习跨模态语义信息。实验表明,IDO达到了最先进的性能。

英文摘要

Multimodal fake news detection aims to identify the authenticity of news. Existing multimodal fake news detection methods mainly focus on cross-modal consistency, but often fail to explicitly model the semantic incongruity that characterizes deceptive multimodal content. However, misinformation often contains semantic information incongruity with the facts. To address these challenges, we propose Incongruity-aware Distribution Optimization (IDO) to improve the performance of fake news detection from the perspectives of factual incongruity and modality incongruity. For factual incongruity, we introduce a channel-wise reweighting strategy to obtain semantically discriminative embeddings and utilize gaussian distribution to model the uncertain correlation caused by factual incongruity. For modality incongruity, we utilize incongruity contrastive learning to learn cross-modal semantic information. Experiments demonstrate that IDO achieves state-of-the-art performance.

2606.03417 2026-06-03 cs.CV 版本更新

A unified multi-task framework enables interpretable chest radiograph analysis

统一多任务框架实现可解释的胸部X光片分析

Lijian Xu, Ziyu Ni, Xinglong Liu, Xiaosong Wang, Hongsheng Li, Shaoting Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出IMT-CXR框架,通过统一Transformer架构模拟放射科医生诊断流程,实现疾病识别、属性表征和可追溯报告生成,在十个基准上表现优异,且临床评估中66%的AI报告达到或超越原始报告。

详情
AI中文摘要

虽然多模态深度学习推动了医学影像分析,但现有的黑箱系统可能局限于孤立任务,常常忽视临床诊断作为多任务过程对信任敏感的本质。我们提出IMT-CXR(可解释多任务Transformer用于胸部X光分析),该框架通过三个基于证据的阶段模拟放射科医生的诊断工作流:1)疾病识别;2)属性表征(如大小、位置、严重程度量化);3)具有可追溯决策路径的证据整合报告生成。该框架采用统一Transformer架构,通过医学领域指令调优优化,顺序执行四个临床任务:多标签疾病分类、病灶定位、解剖分割和放射学报告生成。实验验证表明,在直接推理和微调设置下,该框架在十个CXR基准上表现出竞争性性能。在一项对来自四个医疗中心的160份历史报告的盲评中,三位放射科医生认为66%的AI生成报告在诊断清晰度上达到或超越原始临床报告,凸显了该框架的转化潜力。通过建立从解剖发现到结论的可追溯诊断路径,这项工作弥合了AI技术指标与临床实用性之间的差距,推动了医学影像中可信赖AI系统的发展。

英文摘要

While multimodal deep learning has advanced medical imaging analysis, existing black-box systems \textcolor{black}{may remain confined to isolated tasks, often overlooking} the trust-sensitive nature of clinical diagnosis as a multi-task process. We propose IMT-CXR (Interpretable Multi-task Transformer for Chest X-ray Analysis), a framework that emulates radiologists' diagnostic workflow through three evidence-driven stages: 1) Disease recognition; 2) Attribute characterization (e.g., size, location, severity quantification); 3) Evidence-integrated report generation with traceable decision pathways. The framework employs a unified transformer architecture optimized via medical-domain instruction tuning, sequentially executing four clinical tasks: multi-label disease classification, lesion localization, anatomical segmentation, and radiology report generation. Experimental validation demonstrates competitive performance on ten CXR benchmarks under direct inference and fine-tuning settings. In a blinded evaluation of 160 historical reports from four medical centers, three radiologists rated 66\% of AI-generated reports as comparable to or surpassing original clinical reports in diagnostic clarity, highlighting the framework's translational potential. By establishing traceable diagnostic pathways from anatomical findings to conclusions, this work bridges the gap between AI technical metrics and clinical utility, advancing trustworthy AI systems in medical imaging.

2606.03410 2026-06-03 cs.CV 版本更新

Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams

Enginuity:工程图纸视觉语言理解的数据集与基准

Abhishek Kumar, Isha Motiyani, Tilak Kasturi, Ethan Seefried, Prahitha Movva, Tirthankar Ghosal

发表机构 * Predii Oak Ridge National Laboratory(橡树岭国家实验室) Independent Researcher(独立研究员)

AI总结 针对工程图纸领域缺乏公开基准的问题,提出首个开放数据集Enginuity,通过结构化零件表提取和自由形式视觉问答两项任务评估前沿VLM,揭示零件识别与描述保真度之间的系统性差距。

详情
AI中文摘要

工程图纸对视觉语言模型提出了独特的挑战:与自然图像或通用文档不同,它们通过密集的空间布局、领域特定符号以及视觉标注与结构化零件表之间的交叉引用来编码信息。尽管工程图纸在服务、维修和设计工作流中至关重要,但目前尚无公开基准来衡量该领域VLM的能力;现有数据集主要关注流程图、科学图表或商业文档。为填补这一空白,我们引入了Enginuity,这是首个用于评估复杂工程图纸上VLM的开放数据集和基准。我们在美国军用服务和维修手册语料库上定义了两项任务:结构化零件表提取(任务1)和自由形式视觉图问答(VQA)(任务2)用于基准测试。我们在零样本和思维链提示下评估了四种前沿VLM(GPT-5.2 Chat、Claude Opus 4.7、Gemma 4、Qwen3-VL-32B-Instruct)。在任务1上,模型达到了0.61-0.87的Recall@all,但Token F1pen仅为0.03-0.18,暴露了零件识别与描述保真度之间的系统性差距。任务2揭示了所有模型在事实推理上的一致差距。一项支持性分析表明,相对于语义相似性,token重叠指标将技术描述上的模型能力低估了2-6倍,这促使在领域特定评估中进行LLM作为评判者的校准。我们发布了数据集、注释、评估框架以及每个样本的模型输出,以支持对工程内容上VLM能力的可重复研究。

英文摘要

Engineering diagrams pose a distinct challenge for vision-language models: unlike natural images or general documents, they encode information through dense spatial layouts, domain-specific symbols, and cross-references between visual callouts and structured parts tables. Despite their centrality to service, repair, and design workflows, there is no public benchmark for measuring VLM capabilities in this domain; existing datasets primarily focus on flowcharts, scientific figures, or business documents. To address this gap, we introduce Enginuity, the first open dataset and benchmark for evaluating VLMs on complex engineering diagrams. We define two tasks over a corpus of U.S. military service and repair manuals: structured parts-table extraction (Task 1) and free-form visual diagram question answering (VQA)(Task 2) for benchmarking. We evaluate four frontier VLMs (GPT-5.2 Chat, Claude Opus 4.7, Gemma 4, Qwen3-VL-32B-Instruct) under zero-shot and chain-of-thought prompting. On Task 1, models reach Recall@all of 0.61-0.87 but Token F1pen of only 0.03-0.18, exposing a systematic gap between part identification and description fidelity. Task 2 reveals a consistent factual-reasoning gap across all models. A supporting analysis shows that token-overlap metrics under-report model capability on technical descriptions by 2-6x relative to semantic similarity, motivating LLM-as-judge calibration for domain-specific evaluation. We release the dataset, annotations, evaluation harness, and per-sample model outputs to support a reproducible study of VLM capability on engineering content.

2606.03406 2026-06-03 cs.CV 版本更新

SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

SAMatcher: 基于Segment Anything的共视性建模用于鲁棒特征匹配

Xu Pan, Qiyuan Ma, Mingyue Dong, He Chen, Wei Ji, Xianwei Zheng

AI总结 提出SAMatcher框架,通过共视性建模预测共视区域掩码和边界框作为结构先验,利用Segment Anything Model的对称交叉视图交互机制和统一监督方案,显著提升大视角和尺度变化下的特征匹配性能。

Comments 14 pages

详情
AI中文摘要

可靠的对应估计是图像处理中的一个基本问题,支撑着运动恢复结构、视觉定位和图像配准等应用。现有的基于学习的方法显著改进了局部特征表示,但大多数仍在像素或块级别操作,缺乏对跨视图共同可见区域的显式建模。我们提出了SAMatcher,一个通过共视性建模进行对应估计的特征匹配框架。SAMatcher不直接匹配局部特征,而是首先预测共视区域掩码和边界框作为对应估计的结构先验。基于Segment Anything Model (SAM),它引入了一种对称的交叉视图交互机制,实现了双向特征交换和跨视图语义对齐。我们进一步开发了一个统一的监督方案,通过掩码学习、边界框回归和掩码-边界框一致性约束联合优化掩码预测和边界框定位。在具有挑战性的基准上的大量实验表明,与现有的匹配流程相比,特别是在大视角和尺度变化下,性能有显著提升。我们的结果表明,最初为单目分割设计的基础模型可以通过显式的共视性建模有效地扩展到多视图对应推理,为图像匹配的结构化表示学习提供了新的视角。代码和项目页面:此https URL

英文摘要

Reliable correspondence estimation is a fundamental problem in image processing, underpinning applications such as Structure from Motion, visual localization, and image registration. Existing learning-based methods have significantly improved local feature representations, yet most still operate at the pixel or patch level and lack explicit modeling of regions that are jointly visible across views. We propose SAMatcher, a feature matching framework that formulates correspondence estimation through co-visibility modeling. Instead of directly matching local features, SAMatcher first predicts co-visible region masks and bounding boxes as structured priors for correspondence estimation. Built upon the Segment Anything Model (SAM), it introduces a symmetric cross-view interaction mechanism that enables bidirectional feature exchange and cross-view semantic alignment. We further develop a unified supervision scheme that jointly optimizes mask prediction and box localization through mask learning, box regression, and mask-box consistency constraints. Extensive experiments on challenging benchmarks demonstrate substantial improvements over existing matching pipelines, particularly under large viewpoint and scale variations. Our results show that foundation models originally designed for monocular segmentation can be effectively extended to multi-view correspondence reasoning through explicit co-visibility modeling, offering a new perspective on structured representation learning for image matching. Code and project page: https://xupan.top/Projects/samatcher

2606.03401 2026-06-03 cs.CV 版本更新

Towards Characterizing Scientific Image Utility and Upgradability

面向科学图像效用与可升级性的表征

WenZhe Li, Qihang Yan, Liang Chen, Junying Wang, Farong Wen, Yijin Guo, Chunyi Li, Zicheng Zhang, Guangtao Zhai

发表机构 * TongJi University(同济大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 针对AI生成内容对科学图像完整性的威胁,提出SIU²A框架,通过效用(错误检测与修正可行性)和可升级性(修正质量)两个维度评估科学图像,并构建基准数据集揭示当前多模态系统在科学错误评估与忠实修正方面的显著局限。

详情
AI中文摘要

科学图像在研究交流中作为关键证据,但其完整性面临来自AI生成内容的前所未有的威胁,这些内容引入了微妙但严重的错误。现有的评估范式被证明是不充分的:感知质量指标与科学有效性相关性差,而语言模型缺乏特定领域的验证能力。为了解决这一差距,我们提出了 extbf{科学图像效用与可升级性评估(SIU$^2$A)}框架,该框架引入了两个互补的科学图像评估维度。 extbf{效用}包括 extit{错误检测}(识别科学不准确性)和 extit{修正可行性}(评估错误是否可以被可靠修复)。 extbf{可升级性}衡量修正的质量。我们将科学图像损坏分为四种基本类型:细节失真、不完整性、虚假内容和实体混淆。基于这一分类,我们构建了SIU$^2$A-Benchmark,这是一个包含专家标注用于错误识别和修复的数据集。该框架实现了一个两阶段评估协议: extit{效用}阶段评估错误检测能力和修复指令生成,而 extit{可升级性}阶段评估修正是否在不损害现有准确信息的情况下忠实恢复科学有效性。实验表明,当前的多模态系统在科学错误评估和忠实修正方面表现出显著局限性,揭示了视觉感知与科学可用性之间的根本差距。

英文摘要

Scientific images function as critical evidence in research communication, yet their integrity faces unprecedented threats from AI-generated content that introduces subtle but consequential errors. Existing evaluation paradigms prove inadequate: perceptual quality metrics poorly correlate with scientific validity, while language models lack domain-specific verification capabilities. To address this gap, we propose the \textbf{S}cientific \textbf{I}mage \textbf{U}tility and \textbf{U}pgradability \textbf{A}ssessment (\textbf{SIU$^2$A}) framework, which introduces two complementary dimensions for scientific image evaluation. \textbf{Utility} encompasses \textit{error detection} (identifying scientific inaccuracies) and \textit{correction feasibility} (assessing whether errors can be reliably repaired). \textbf{Upgradability} measures the quality of correction. We categorize scientific image corruption into four fundamental types: Detail Distortion, Incompleteness, False Content, and Entity Confusion. Based on this taxonomy, we construct SIU$^2$A-Benchmark, a dataset with expert annotations for error identification and repair. The framework implements a two-stage evaluation protocol: the \textit{Utility} stage evaluates error detection capability and repair instruction generation, while the \textit{Upgradability} stage assesses whether corrections faithfully restore scientific validity without compromising existing accurate information. Experiments reveal that current multimodal systems exhibit significant limitations in both scientific error assessment and faithful correction, exposing a fundamental gap between visual perception and scientific usability.

2606.03348 2026-06-03 cs.CV cs.AI 版本更新

SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation

SynCred-Bench: 评估AI生成视觉虚假信息中的合成可信度

Junxiao Yang, Minghao Zhang, Xiaoce Wang, Haoran Liu, Shiyao Cui, Hongning Wang, Minlie Huang

发表机构 * The Conversational AI (CoAI) group, DCST, Tsinghua University(清华大学人工智能对话组,数据科学与技术研究院,清华大学)

AI总结 提出SynCred-Bench基准,包含600个AI生成的虚假信息图像,涵盖六种可信形式和七种传播风格,并引入FP450真实图像负集,评估显示现有系统在5%假阳性率下真阳性率极低,表明合成可信度是一个严重且未被充分探索的视觉虚假信息挑战。

详情
AI中文摘要

最近的生成模型能够生成带有逼真嵌入文本和布局的视觉制品,创造了一种新的虚假信息威胁:合成可信度。我们引入了SYNCRED-Bench,一个包含600个AI生成的虚假信息图像的基准,这些图像在六种可信形式类别和七种细粒度传播风格上平衡分布,同时还有FP450,一个用于测量假阳性的真实图像负集。广泛评估表明,现有系统仍然不可靠:在5%假阳性率约束下,15个多模态大语言模型仅达到10.5%的真阳性率,开源AIGC检测器达到不到5%,商业API达到57.6%。人类标注者也难以识别合成可信度,仅达到63%的真阳性率。这些发现将合成可信度确立为一个严重且未被充分探索的视觉虚假信息挑战,并提供了一个基准,用于开发超越表面可信度线索进行推理的检测器。

英文摘要

Recent generative models can now produce visual artifacts with realistic embedded text and layouts, creating a new misinformation threat: synthetic credibility. We introduce SYNCRED-Bench, a benchmark of 600 AI-generated misinformation images balanced across six credible-form categories and seven fine-grained circulation styles, together with FP450, a real-image negative set for measuring false positives. Extensive evaluation shows that existing systems remain unreliable: under a 5% false-positive-rate constraint, 15 MLLMs achieve only 10.5% true positive rate (TPR), open-source AIGC detectors achieve less than 5%, and commercial APIs reach 57.6%. Human annotators also struggled to identify synthetic credibility, reaching only 63% TPR. These findings establish synthetic credibility as a severe and underexplored visual misinformation challenge, and provide a benchmark for developing detectors that reason beyond superficial credibility cues.

2606.03345 2026-06-03 cs.CV cs.CL cs.CY 版本更新

Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

超越语义:从视觉语言数据建模事实与情感感知体验

Youssef Mohamed, Kenneth Ward Church, Mohamed Elhoseiny

发表机构 * KAUST(卡斯土尼亚-沙特大学)

AI总结 提出P-Topics建模问题,通过PercepT两阶段架构从图像和标题中无监督发现并映射事实与情感感知体验,在ArtELingo数据集上显著优于基线。

Comments 8 pages

详情
AI中文摘要

我们提出了P-Topics(感知主题)建模,这是一个理解图像如何被情感和文化感知的新问题。目标是(1)在图像和标题数据集中发现并建模不同的感知体验,每个体验由客观事实和主观情感两方面定义;(2)将图像关联到其相关的感知体验。我们引入了**PercepT**(**感知**主题**T**ransformer),一个两阶段架构来处理P-Topics建模。在形成阶段,percepT通过无监督训练目标发现作为视觉-文本聚类的*P-Topics*,并动态选择聚类数量以匹配数据集的感知丰富度。在映射阶段,它通过注意力池化学习*P-Topic映射函数*,将图像关联到各自的聚类。在ArtELingo上,PercepT的轮廓系数达到**0.97**,而最接近的基线为**0.37**,反映了更好的感知聚类。PercepT的AUC分数达到**0.94**,而基线为**0.77**,显示了更好的感知聚类映射。人工评估证实PercepT捕获了语义上有意义的感知体验,并显著优于现有方法。我们的实现将公开。

英文摘要

We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is defined by an objective factual and a subjective affective aspect, and (2) associate images to their relevant perception experiences. We introduce **PercepT** (**Percep**tion topic **T**ransformer), a two-stage architecture that tackles P-Topics modeling. In the formation stage, percepT discovers *P-Topics* as visual-textual clusters using an unsupervised training objective, and dynamically selects the number of clusters to match the perceptual richness of the dataset. In the mapping stage, it learns *P-Topic mapping functions* via attention pooling to associate images to their respective clusters. On ArtELingo, PercepT achieves a silhouette score of **0.97** compared to **0.37** from the closest baseline reflecting better perceptual clusters. PercepT also achieves an AUC score of **0.94** compared to **0.77** showing better mapping to perceptual clusters. Human evaluation confirms that PercepT captures semantically meaningful perception experiences and significantly outperforms existing methods. Our implementation will be made public.

2606.03341 2026-06-03 cs.CV 版本更新

Cross-Modality Feature Fusion Based on Structured State Space Duality for Multimodal Image Registration Network

基于结构化状态空间对偶性的跨模态特征融合多模态图像配准网络

Zhikang Li, Yan Wu, Xin Hu, Yi Dai, Ming Li

发表机构 * Remote Sensing Image Processing and Fusion Group, School of Electronic Engineering, Xidian University(遥感图像处理与融合组,电子工程学院,西安电子科技大学) National Key Laboratory of Radar Signal Processing, Xidian University(雷达信号处理国家级实验室,西安电子科技大学)

AI总结 提出RegNetMamba-2算法,利用结构化状态空间对偶性(SSD)在粗到细匹配过程中提取局部和全局结构特征,通过跨模态交互和多尺度融合模块实现多模态图像配准,在多个数据集上取得高效性能。

详情
AI中文摘要

在多模态图像配准中,主要挑战在于共享结构信息的提取。与Transformer相比,结构化状态空间对偶性(SSD)在训练和推理过程中能以更高效率提取更全面的全局结构特征。受这些优势启发,我们提出了一种新的多模态图像配准算法,命名为RegNetMamba-2。我们的算法将SSD融入粗到细的匹配过程中,以有效提取局部和全局结构特征。首先,在网络中应用SSD于三种不同尺度进行多模态特征提取。为了增强局部表示,我们通过SSD的特征缩放函数更加关注前景边缘和结构信息。其次,针对输入图像的共享特征提取和所有尺度的多模态特征融合,我们提出了基于SSD的跨模态特征融合模型,包括跨模态特征交互(CMI)模块和多尺度特征融合(MSF)模块。CMI模块通过交叉形式的SSD用于每个尺度的跨模态特征提取。MSF模块旨在采用渐进式向上融合方式在特征层面获取精细特征,包含所有尺度的多模态特征。遵循粗到细策略,收集来自CMI的1/8尺度特征和来自MSF的1/2尺度特征以计算匹配概率分数。然后我们通过像素级对应关系分别建立匹配过程。大量实验表明,与最先进的基于深度学习的算法相比,RegNetMamba-2在以下数据集上的多模态图像配准性能和效率均取得了良好效果:VIS-SAR(OSDataset)、VIS-IR(LGHD/RoadSence)和VIS-NIR(RGB-NIR sense)。

英文摘要

In multi-modal image registration, the primary challenge lies in shared structural information extraction. Compared to Transformers, Structured State Space Duality (SSD) offers greater global structural feature extraction with higher efficiency during training and inference. Inspired by these advantages, we propose a novel algorithm for multi-modal image registration, named RegNetMamba-2. Our algorithm incorporates SSD into coarse-to-fine matching process to extract local and global structural features effectively. Firstly, SSD is applied in three different scales for multi-modal feature extraction in our network. To strengthen local representation, we pay more attention on foreground edge and structural information by feature scaling function of SSD. Secondly, for shared feature extraction of input images and multi-modal feature fusion in all scales, we propose cross-modality feature fusion model based on SSD, consisting of Cross-Modality feature Interaction (CMI) module and Multi-Scale feature Fusion (MSF) module. CMI module is designed for cross-modality feature extraction of each scale by SSD in cross form. MSF module is designed to employ a progressive upward fusion in feature-level to obtain fine features, consisting of multi-modal features in all scales. Following coarse-to-fine, the features in 1/8 scale from CMI and 1/2 scale from MSF are collected to calculate matching probability scores. Then we respectively establish matching process by correspondences of pixel-wise. Extensive experiments demonstrate that comparing with state-of-the-art deep-learning based algorithms, RegNetMamba-2 has achieved good effects in both performance and efficiency for multi-modal image registration on the following datasets: VIS-SAR (OSDataset), VIS-IR (LGHD/RoadSence) and VIS-NIR (RGB-NIR sense).

2606.03338 2026-06-03 cs.LG cs.CV 版本更新

IdEst: Assessing Self-Supervised Learning Representations via Intrinsic Dimension

IdEst: 通过内在维度评估自监督学习表示

Julie Mordacq, Vicky Kalogeiton, Steve Oudot

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出IdEst方法,利用最小生成树维度估计器评估自监督学习表示的内在维度,发现其与下游线性探测性能强相关,并能高效选择超参数。

Comments ICML 2026

详情
AI中文摘要

自监督学习(SSL)已成为从无标签数据中学习有意义表示的有效范式。然而,评估这些表示的标准协议——线性探测——计算成本高、对超参数敏感,并且对表示空间的几何结构提供的洞察有限。在这项工作中,受神经网络泛化与内在维度(ID)之间联系的启发,我们提出了IdEst,一种通过最小生成树维度估计器($\mathrm{dim}_\mathrm{MST}$)估计SSL表示ID的方法。在多种数据集、架构和SSL预训练目标上,我们表明IdEst与下游线性探测性能强相关。此外,我们证明IdEst能够实现高效的超参数选择,与监督替代方案相比显著降低计算成本。我们的结果突出了内在维度作为评估SSL表示的原则性几何代理,补充了标准的监督探测协议。

英文摘要

Self-supervised learning (SSL) has emerged as a powerful paradigm for learning meaningful representations from unlabeled data. However, the standard protocol for evaluating these representations, linear probing, is computationally expensive, sensitive to hyperparameters, and provides limited insight into the geometric structure of the representation space. In this work, motivated by connections between neural network generalization and intrinsic dimension (ID) we propose IdEst, a method for estimating the ID of SSL representations via the Minimum Spanning Tree dimension estimator ($\mathrm{dim}_\mathrm{MST}$). Across diverse datasets, architectures, and SSL pretraining objectives, we show that IdEst strongly correlates with downstream linear probe performances. Furthermore, we demonstrate that IdEst enables efficient hyperparameter selection, significantly reducing the computational cost compared to supervised alternatives. Our results highlight intrinsic dimensionality as a principled geometric proxy for assessing SSL representations, complementing standard supervised probing protocols.

2606.03314 2026-06-03 cs.CV 版本更新

TASE: Truncation-Aware Semantic Embeddings for 3D Scene Understanding and Editing

TASE: 用于3D场景理解与编辑的截断感知语义嵌入

Tim-Felix Faasch, Jochen Kall, Lucas Nunes, Jens Behley, Cyrill Stachniss

发表机构 * Bosch Research(博世研究院) Rheinisch-Westfälische Technische Hochschule Aachen(亚琛工业大学) University of Bonn(波恩大学)

AI总结 提出TASE方法,通过将预训练的2D语义特征投影到截断感知嵌入空间,结合尺度和平移等变损失,实现可控的3D场景文本驱动编辑,在大几何修改任务上显著优于现有方法。

详情
AI中文摘要

高保真语义3D场景表示对于众多应用(包括机器人、自动驾驶和仿真)至关重要。除此之外,编辑此类表示的能力使开发人员能够更轻松地将这些应用适应特定的目标场景。当前方法对可控编辑的支持有限。我们引入TASE,一种将预训练的2D语义特征投影到截断感知嵌入空间以实现灵活3D场景编辑的方法。我们的方法显式优化了一个特征空间,在该空间中,逐步减少特征通道会产生越来越抽象的语义表示,而保留更多通道则保留细粒度细节。此外,我们使用尺度和平移等变损失来改进特征的多视图一致性。由此产生的截断感知嵌入空间支持对3D场景进行文本驱动的编辑,提供了对编辑与原始场景内容一致程度的显式控制,并允许比先前方法更实质性的修改。此外,我们提出了编辑扩散模型的微调阶段,以减轻几何变化引起的伪影。实验结果表明,在3D场景编辑中具有竞争力的性能,在涉及大几何修改的编辑上显著优于先前方法。

英文摘要

High-fidelity semantic 3D scene representations are crucial for numerous applications, including robotics, autonomous driving, and simulation. Beyond this, the ability to edit such representations enables developers to adapt these applications more easily to specific target scenarios. Current approaches provide limited support for controllable editing. We introduce TASE, a method that projects pretrained 2D semantic features into a truncation-aware embedding space to enable flexible 3D scene editing. Our method explicitly optimizes a feature space in which progressively reducing feature channels yields increasingly abstract semantic representations, while retaining more channels preserves fine-grained detail. Additionally, we improve multi-view consistency of the features using a scale- and translation-equivariance loss. The resulting truncation-aware embedding space enables text-driven edits to 3D scenes, providing explicit control over how strongly edits adhere to the original scene content and allowing more substantial modifications than prior methods. Moreover, we propose a finetuning stage for the editing diffusion model to mitigate artifacts caused by geometric changes. Experimental results demonstrate competitive performance in 3D scene editing, substantially outperforming prior methods on edits involving large geometric modifications.

2606.03301 2026-06-03 cs.CL cs.CV 版本更新

SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series

SagaQA:面向电视剧长篇叙事理解的多跳推理基准

Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

发表机构 * IRIT, University of Toulouse, France(法国图卢兹大学IRIT中心) Agency for Science, Technology and Research (A*STAR), Singapore(新加坡科技研究局) CNRS, IRIT, France(法国CNRS与IRIT)

AI总结 提出SagaQA基准,通过跨剧集的多跳推理任务评估模型对完整电视剧多模态叙事的高层次理解,并比较并行、顺序和混合三种规划策略的性能。

详情
AI中文摘要

我们介绍了SagaQA,一个用于对完整电视剧进行多跳推理的长视频基准。现有的视频推理基准通常强调对相邻帧或片段的局部理解。SagaQA通过要求对整个电视剧中扩展的多模态叙事进行高层次理解来弥补这一空白。SagaQA的一个显著特征是其推理步骤的粒度。我们的数据集需要长距离推理跳跃来连接完全不同的剧集之间的信息。这要求模型对整个事件和动作进行推理,需要在多模态层面上深入理解剧集的叙事和进展。受近期智能体方法进展的启发,我们进一步研究了不同的规划策略如何处理这种复杂推理。我们将这些方法分为三类——并行规划器、顺序规划器和混合规划器——并评估它们生成连贯且完整推理计划的能力。我们在SagaQA上的结果表明,混合规划器始终能产生更高质量的计划,并在电视剧复杂、高层次叙事理解方面表现出更强的能力。

英文摘要

We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives in entire TV shows. A distinguishing feature of SagaQA is the granularity of its reasoning steps. Our dataset necessitates long-range reasoning hops to connect information across completely different episodes. This requires models to reason over entire events and actions, demanding a deep understanding of the show's narration and progression at a multimodal level. Motivated by recent progress in agentic methods, we further study how different planning strategies handle such complex reasoning. We categorize these approaches into three classes-Parallel, Sequential, and Hybrid planners-and evaluate their ability to generate coherent and complete reasoning plans. Our results on SagaQA suggest that hybrid planners consistently produce higher-quality plans and exhibit stronger capabilities for complex, high-level narrative understanding in TV shows.

2606.03287 2026-06-03 cs.CV 版本更新

BA-T: An Iterative Transformer for Two-View Bundle Adjustment

BA-T: 一种用于双视图束调整的迭代Transformer

Ganlin Zhang, Weirong Chen, Daniel Cremers, Xi Wang

AI总结 受经典束调整启发,提出BA-T,一种通过迭代Transformer在隐式token空间中实现结构化更新的轻量级方法,用于改进双视图三维重建的精度和多视图一致性。

详情
AI中文摘要

前馈三维重建模型通过深度跨视图注意力在图像间交换信息取得了强性能。然而,这些方法通常依赖沉重的解码器堆栈,缺乏几何精化的结构化机制,导致多视图一致性差。我们通过借鉴经典束调整(BA)来解决这个问题,BA可被视为位姿与局部几何之间的迭代信息传播过程。受BA启发,我们提出BA-T,一种迭代Transformer,将BA风格的结构化更新作为可重复层在隐式token空间中实现。BA-T不依赖深度注意力堆栈,而是通过单个轻量层基于潜在残差精化预测。实验表明,BA-T在迭代中逐步提升位姿和重建精度,比传统解码器实现更强的跨视图一致性,在使用仅16%解码器参数的情况下匹配或超越更大的模型。BA-T为深度注意力提供了一种紧凑、高效且结构化的替代方案,在轻量架构内实现精确的三维重建。代码将在以下网址公开:https://this https URL。

英文摘要

Feed-forward models for 3D reconstruction have achieved strong performance using deep cross-view attention to exchange information across images. However, these approaches often depend on heavy decoder stacks and lack a structured mechanism for geometry refinement, resulting in poor multi-view consistency. We address this by drawing inspiration from classical bundle adjustment (BA), which can be viewed as an iterative information propagation process between poses and local geometry. Inspired by BA, we propose BA-T, an iterative Transformer that implements BA-style structured updates as a repeatable layer in implicit token space. Instead of relying on deep attention stacks, BA-T refines predictions based on latent residual by a single lightweight layer. Experiments demonstrate that BA-T progressively improves pose and reconstruction accuracy across iterations, achieves stronger cross-view consistency than conventional decoders, and matches or surpasses substantially larger models while using only 16% of their decoder parameters. BA-T provides a compact, efficient, and structural alternative to depth-heavy attention, enabling accurate 3D reconstruction within a lightweight architecture. The code will be made publicly at https://github.com/zhangganlin/BA-T.

2606.03273 2026-06-03 cs.CV cs.AI cs.CL 版本更新

VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

VistaHop: 视觉深度搜索的多跳视觉推理基准

Hang He, Chuhuai Yue, Chengqi Dong, Chengcheng Wan, Ting Su, Haiying Sun, Jiajun Chai, Xiaohan Wang, Guojun Yin

发表机构 * East China Normal University(东华大学) Meituan(美团) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出VistaHop基准,通过多跳问答任务评估多模态大推理模型在视觉深度搜索中的迭代图像检查、视觉锚点定位和跨证据链推理能力,实验表明现有模型表现有限。

详情
AI中文摘要

视觉深度搜索要求多模态大推理模型(MLRM)智能体通过反复检查图像区域、将中间推理锚定在视觉证据上,并跨长推理链连接细粒度线索来回答复杂的视觉查询。然而,现有基准主要关注单步视觉理解或静态图像问答,对迭代图像检查、视觉锚点定位和多跳证据整合的评估有限。在这项工作中,我们引入了VistaHop,一个用于评估视觉深度搜索中以视觉为中心的搜索和多跳视觉推理的基准。VistaHop包含300张高分辨率图像、25个视觉搜索场景和350个多跳QA任务,这些任务要求模型跟随从视觉锚点出发的证据链,或融合跨多个基于图像的推理路径的信息。我们进一步开发了VistaArena,一个统一的评估环境,支持带有文本搜索、图像搜索、图像裁剪和基于证据的答案验证的工具增强推理。在七个代表性MLRM上的实验表明,当前模型远未解决VistaHop:最佳模型SenseNova-MARS-32B仅达到24.31%的Pass@1。这些结果揭示了在视觉定位、证据重访、长链推理和多锚点信息融合方面的持续局限性,凸显了对更强基准和训练方法的需求,以推动视觉深度搜索的发展。

英文摘要

Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly focus on single-step visual understanding or static image-question answering, offering limited evaluation of iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. In this work, we introduce VistaHop, a benchmark for evaluating vision-centric search and multi-hop visual reasoning in Visual DeepSearch. VistaHop contains 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks that require models to follow evidence chains from visual anchors or fuse information across multiple image-grounded reasoning paths. We further develop VistaArena, a unified evaluation environment that supports tool-augmented reasoning with text search, image search, image cropping, and evidence-based answer validation. Experiments on seven representative MLRMs show that current models remain far from solving VistaHop: the best model, SenseNova-MARS-32B, achieves only 24.31% Pass@1. These results reveal persistent limitations in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion, highlighting the need for stronger benchmarks and training methods for Visual DeepSearch.

2606.03264 2026-06-03 cs.CV 版本更新

PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

PaddleOCR-VL-1.6:通过欠优化区域精炼和渐进式后训练扩展文档解析前沿

Zelun Zhang, Hongen Liu, Suyin Liang, Yubo Zhang, Yiqing Xiang, Jiaxuan Liu, Ting Sun, Manhui Lin, Yue Zhang, Changda Zhou, Tingquan Gao, Cheng Cui, Yi Liu, Dianhai Yu, Yanjun Ma

发表机构 * PaddlePaddle Team, Baidu Inc.(百度公司PaddlePaddle团队)

AI总结 提出PaddleOCR-VL-1.6,通过区域感知数据优化框架识别并增强前代模型的薄弱区域,结合渐进式后训练策略,在OmniDocBench v1.6上达到96.33%的新SOTA。

详情
AI中文摘要

我们介绍了PaddleOCR-VL-1.6,这是一个基于PaddleOCR-VL-1.5升级的紧凑型文档解析模型。尽管PaddleOCR-VL-1.5建立了强大的0.9B基线,但其剩余错误集中在欠优化区域,这些区域模型行为不稳定、数据覆盖稀疏或监督不可靠。PaddleOCR-VL-1.6没有不加区分地扩大训练语料,而是引入了一个区域感知数据优化框架,从先前模型中识别薄弱区域,对这些区域进行针对性增强,并提高监督信号的可靠性。它进一步采用基于精选数据选择和强化学习的渐进式后训练方案,通过分阶段优化将模型性能提升到更高水平。PaddleOCR-VL-1.6在OmniDocBench v1.6上达到了96.33%的新SOTA分数,展现出与顶级VLM的强劲竞争力,并为PaddleOCR-VL系列提供了实用的后训练方案。

英文摘要

We introduce PaddleOCR-VL-1.6, an upgraded compact document parsing model built upon PaddleOCR-VL-1.5. Although PaddleOCR-VL-1.5 establishes a strong 0.9B baseline, its remaining errors concentrate in under-optimized regions where model behavior is unstable, data coverage is sparse, or supervision is unreliable. Rather than expanding the training corpus indiscriminately, PaddleOCR-VL-1.6 introduces a region-aware data optimization framework that identifies weak regions from the previous model, applies targeted enhancement to these regions, and improves the reliability of supervision signals. It further adopts a progressive post-training recipe based on curated data selection and reinforcement learning, pushing model performance to a higher level through staged optimization. PaddleOCR-VL-1.6 achieves a new state-of-the-art score of 96.33% on OmniDocBench v1.6, demonstrates strong competitiveness against top-tier VLMs, and provides a practical post-training recipe for the PaddleOCR-VL series.

2606.03254 2026-06-03 cs.CV 版本更新

FreeStreamGS: Online Feed-forward 3D Gaussian Splatting from Unposed Streaming Inputs

FreeStreamGS: 来自无位姿流式输入的在线前馈3D高斯泼溅

Ruiyang Chen, Feiran Li, Chu Zhou, Zonglin Li, Zhanyu Ma, Heng Guo

AI总结 提出FreeStreamGS,一种在线前馈框架,通过解耦内参恢复头和动态点精炼偏移策略,实现从无位姿流式输入的高效高质量新视角合成。

详情
AI中文摘要

前馈3D高斯泼溅(3DGS)允许从离线录制的图像序列进行高效高保真的新视角合成(NVS)。然而,从流式和无位姿图像输入实现在线NVS仍然具有挑战性。尽管已经提出了用于流式深度和点云恢复的在线前馈几何估计方法,但由于严重的渲染伪影,它们无法适应NVS。这是因为NVS对高斯尺度和位姿-几何对齐要求更严格的多视图一致性;即使微小的偏差也会随时间累积并明显降低渲染质量。为此,我们提出了FreeStreamGS,一个鲁棒的在线前馈框架,用于高效高质量的NVS。我们引入了两个关键机制:解耦内参恢复头,消除累积的相机内参偏差并防止长期流式中的场景尺度抖动;以及动态点精炼偏移策略,放松刚性反投影以校正耦合的位姿-深度漂移。大量实验表明,尽管FreeStreamGS无法访问未来帧,但其渲染质量与最先进的离线前馈3DGS方法相当。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) allows efficient and high-fidelity novel view synthesis (NVS) from an offline recorded image sequence. However, achieving online NVS from streaming and unposed image inputs remains challenging. Although online feed-forward geometric estimation methods have been proposed for streaming depth and point cloud recovery, they cannot be adapted to NVS due to severe rendering artifacts. This is because NVS demands stricter multi-view consistency in Gaussian scales and pose-geometry alignment; even minor deviations would accumulate over time and visibly degrade rendering quality. To this end, we propose FreeStreamGS, a robust online feed-forward framework for efficient and high-quality NVS. We introduce two key mechanisms: a Decoupled Intrinsic Recovery Head that removes cumulative camera intrinsic bias and prevents scene scale jitter during long-term streaming, and a Dynamic Point Refinement Offset strategy that relaxes rigid unprojection to correct coupled pose-depth drift. Extensive experiments show that FreeStreamGS achieves rendering quality competitive with state-of-the-art offline feed-forward 3DGS methods, despite operating without access to future frames.

2606.03251 2026-06-03 cs.AI cs.CV cs.LG eess.IV stat.ML 版本更新

Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

现实世界数据集是否包含自然实验?基于因果特征选择的实证研究

Gautam Gare, John Galeotti, Michael Mozer, Deva Ramanan, Nan Rosemary Ke

AI总结 本文利用因果发现和特征选择检测现实世界数据集中的自然实验,并通过干预性处理提升模型性能。

详情
AI中文摘要

在自然界中,影响某些个体或群体但不影响其他个体或群体的事件构成隐式干预,被称为自然实验。例如,COVID-19大流行是冠状病毒对感染COVID的亚群的一次干预。我们问:现有的现实世界数据集中是否存在自然实验?如果存在,我们应该如何处理它们?为了检测数据中的自然实验,我们使用因果发现恢复潜在因果图,并基于因果链接进行特征选择。如果通过将数据视为干预性而非观测性来提升下游性能,我们认为这表明数据集包含自然实验。我们首先通过使用合成图模拟包含和不包含自然实验的数据集来验证这一假设。然后,我们在大量现实世界数据集上进行系统的实证评估。我们的结果表明,现实世界数据集确实包含自然实验,我们可以利用这些自然实验通过因果推断来提升模型性能。我们的工作代表了该领域的初步探索,在有限范围内进行了初步研究。

英文摘要

In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.

2606.03246 2026-06-03 cs.CV 版本更新

MariData: One-Step Unpaired Image Translation for Maritime Environments

MariData: 海洋环境下的单步非配对图像翻译

Santeri Henriksson, Mehdi Asadi, Amin Majd, Juha Kalliovaara

发表机构 * AIS Lab, Turku University of Applied Sciences(涡阳应用科学大学AIS实验室)

AI总结 针对海洋自主水面船舶训练数据稀缺问题,提出基于CycleGAN-turbo的单步非配对图像翻译框架,通过零卷积跳跃连接保留小目标细节,生成逼真的天气与光照条件合成数据。

详情
AI中文摘要

海洋自主水面船舶(MASS)鲁棒感知系统的发展受到多样化训练数据稀缺的严重制约,尤其是恶劣天气和低光照条件。由于在动态海洋环境中收集配对图像在物理上不可行,通过非配对图像到图像翻译生成合成数据提供了一种关键解决方案。然而,现有生成模型因潜在压缩瓶颈而无法保留小型导航目标的精细结构细节。在本文中,我们介绍了一个使用CycleGAN-turbo(一种单步非配对翻译架构)生成合成海洋数据的框架。通过引入零卷积跳跃连接以绕过变分自编码器(VAE)瓶颈,我们的方法在翻译过程中明确保留了小目标细节(例如远处的船只和海上标志)。我们收集了一个包含7000张海洋图像的数据集,用于训练和评估白天到雾天、白天到日落以及白天到夜晚的域翻译模型。定性评估和变强度推理研究表明,我们的方法有效地合成了逼真的大气条件,同时保持了场景的底层语义结构。白天到雾天和白天到日落模型表现出良好的结构保留,而白天到夜晚模型则突显了语义幻觉的挑战,例如由不平衡训练分布引起的人工海岸灯光生成。最终,这项工作建立了一个高效、结构感知的数据合成管道,直接解决了自主海洋导航中的数据稀缺瓶颈。

英文摘要

The development on robust perception systems for Maritime Autonomous Surface Ships (MASS) is heavily constrained by the scarcity of diverse training data, particularly for adverse weather and low-light conditions. Because collecting paired images in dynamic maritime environments is physically impossible, synthetic data generation via unpaired image-to-image translation offers a critical solution. However, existing generative models suffer from failing to preserve the fine structural details of small navigational objects due to latent compression bottlenecks. In this paper, we introduce a framework for generating synthetic maritime data using CycleGAN-turbo, a one-step unpaired translation architecture. By incorporating zero-convolution skip connections to bypass the Variational Autoencoder (VAE) bottleneck, our approach explicitly preserves small object details (e.g., distant vessels and sea marks) during translation. We compiled a dataset of 7,000 maritime images to train and evaluate models for Day-to-Foggy, Day-to-Sunset, and Day-to-Night domain translations. Qualitative evaluations and variable-strength inference studies demonstrate that our method effectively synthesizes realistic atmospheric conditions while maintaining the underlying semantic structure of the scene. The Day-to-Foggy and Day-to-Sunset models exhibit great structural retention, whereas the Day-to-Night model highlights the challenge of semantic hallucination, such as generating artificial coastal lights, induced by unbalanced training distributions. Ultimately, this work establishes an efficient, structure-aware data synthesis pipeline that directly addresses the data scarcity bottleneck in autonomous maritime navigation.

2606.03243 2026-06-03 cs.CV 版本更新

MemoGen: Can Past Experience Improve Future Text-to-Image Generation?

MemoGen:过去的经验能否改善未来的文本到图像生成?

Wenshuo Chen, Kuimou Yu, Bowen Tian, Jianfei Song, Shaofeng Liang, Haozhe Jia, Kan Cheng, Haosen Li, Kaishen Yuan, Lei Wang, Jiemin Wu, Songning Lai, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Scholarly, Guangzhou Ziyan Technology Co., Ltd.(学者,广州智源科技有限公司) LimX Dynamics Technology Co., Ltd.(LimX动力科技有限公司) Shandong University(山东大学) Data61/CSIRO Griffith University(Data61/CSIRO格里菲斯大学) Jiangsu Industrial Technology Research Institute (JITRI)(江苏工业技术研究院(JITRI))

AI总结 提出MemoGen框架,通过代理进化层和可重用经验记忆,在不更新生成器的情况下,利用过去经验改进文本到图像生成,在知识密集和推理基准上超越专有系统。

详情
AI中文摘要

现代文本到图像模型已实现强大的视觉合成,但在提示需要隐式视觉约束、关系推理或外部知识时仍不可靠。现有的检索增强和代理生成方法通过获取外部知识、参考或当前请求的优化提示来缓解此问题,但它们通常将每次生成视为孤立事件,并未系统性地保留过去的成功或失败以供将来使用。在这项工作中,我们探究文本到图像系统能否在不更新底层生成器的情况下,从自身的生成经验中持续改进。我们提出MemoGen,一种无需训练的框架,通过代理进化层增强现有图像生成器。对于每个任务,MemoGen显式推断视觉需求,必要时检索外部证据和参考,将其转化为可执行的生成约束,评估生成结果,并将任务理解、参考选择、视觉反馈、成功策略和失败教训存储为可重用的经验记忆。在进化轮次中,代理检索相关经验以改进类似的未来生成,选择性修复先前失败的案例同时保留成功的案例,从而实现在无需参数更新的情况下进行测试时自我进化。在知识密集和推理导向基准上的广泛实验证明了该范式的有效性:仅经过两轮进化,基于开源Qwen-Image骨干的MemoGen在WISE和Mind-Bench上超越了强大的专有系统,如Nano Banana Pro和GPT-Image-1,表明显式经验记忆可以作为可靠文本到图像生成的强大持续学习信号。

英文摘要

Modern text-to-image models have achieved strong visual synthesis, yet remain unreliable when prompts require implicit visual constraints, relational reasoning, or external knowledge. Existing retrieval-augmented and agentic generation methods mitigate this issue by acquiring external knowledge, references, or refined prompts for the current request, yet they typically treat each generation as an isolated episode and do not systematically preserve past successes or failures for future use. In this work, we ask whether a text-to-image system can continually improve from its own generation experience without updating the underlying generator. We propose MemoGen, a training-free framework that augments existing image generators with an agentic evolution layer. For each task, MemoGen explicitly infers visual requirements, retrieves external evidence and references when necessary, translates them into executable generation constraints, evaluates the generated result, and stores task understanding, reference choices, visual feedback, successful strategies, and failure lessons as reusable experience memory. Across evolution rounds, the agent retrieves relevant experience to improve similar future generations, selectively repairing previously failed cases while preserving successful ones, thereby enabling test-time self-evolution without parameter updates. Extensive experiments on knowledge-intensive and reasoning-oriented benchmarks demonstrate the effectiveness of this paradigm: after only two evolution rounds, MemoGen built upon the open-source Qwen-Image backbone surpasses strong proprietary systems such as Nano Banana Pro and GPT-Image-1 on WISE and Mind-Bench, showing that explicit experience memory can serve as a powerful continual learning signal for reliable text-to-image generation.

2606.03216 2026-06-03 cs.CV 版本更新

Follow-Your-Preference++: Rethinking Preference Alignment for Image Inpainting

Follow-Your-Preference++:重新思考图像修复中的偏好对齐

Junkun Yuan, Yutao Shen, Toru Aonishi, Hideki Nakayama, Yue Ma

发表机构 * Zhejiang University(浙江大学) The University of Tokyo(东京大学) Tsinghua University(清华大学)

AI总结 本文从基本原理出发,通过直接偏好优化框架和公开奖励模型构建偏好数据,系统研究了图像修复中的偏好对齐问题,发现奖励模型存在偏差但可通过集成缓解,并在标准指标、大视觉语言模型评估和人类评估上显著超越先前最先进模型。

Comments 23 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2509.23082

详情
AI中文摘要

我们研究图像修复中的偏好对齐。与其提出另一种方法,我们从头重新审视该问题并重新评估其核心挑战。我们采用广泛使用的直接偏好优化框架,并利用公开的奖励模型构建偏好训练数据。我们的实证研究涵盖九个奖励模型、两个基准以及两个在架构和生成机制上不同的基线修复模型。我们的主要发现是:(1) 大多数奖励模型为偏好数据构建提供了有效信号,尽管有些作为评估者不可靠。(2) 跨模型和基准,偏好数据在候选和样本缩放下表现出一致的趋势。(3) 奖励模型显示出明显的偏差——特别是在亮度、构图和配色方案方面——使其容易引发奖励黑客行为。(4) 简单的奖励模型集成减轻了此类偏差,并产生了稳健且可泛化的性能。(5) 偏好对齐可迁移到对象移除任务,其中目标从开放式创意生成转变为连贯的背景补全。(6) 进一步分析表明,校准的集成方法进一步减轻了黑客行为并提高了鲁棒性。在不修改模型架构或引入额外数据集的情况下,我们的模型在标准指标、大视觉语言模型评估和人类评估上显著优于先前最先进的模型。我们的代码可在以下网址获取:此 https URL。

英文摘要

We study preference alignment for image inpainting. Rather than proposing yet another method, we revisit the problem from first principles and reassess its core challenges. We adopt the widely used direct preference optimization framework and construct preference training data with publicly available reward models. Our empirical study spans nine reward models, two benchmarks, and two baseline inpainting models that differ in architecture and generative mechanism. Our main findings are: (1) Most reward models provide valid signals for preference data construction, although some are unreliable as evaluators. (2) Across models and benchmarks, preference data exhibits consistent trends under both candidate and sample scaling. (3) Reward models display pronounced biases--particularly in brightness, composition, and color scheme--that make them prone to inducing reward hacking. (4) A simple ensemble of reward models mitigates such biases and yields robust, generalizable performance. {\color{rebuttal_blue}(5) Preference alignment is transferable to the object removal task, where the goal shifts from open-ended creative generation to coherent background completion. (6) Further analysis reveals that a calibrated ensemble method further mitigates hacking and improves robustness.} Without modifying model architectures or introducing additional datasets, our models substantially outperform prior state-of-the-art models on standard metrics, large vision-language model evaluations, and human assessments. Our code is available at: https://github.com/shenytzzz/Follow-Your-Preference.

2606.03214 2026-06-03 cs.AI cs.CV cs.CY cs.LG 版本更新

Effect of Demographic Bias on Skin Lesion Classification

人口统计偏差对皮肤病变分类的影响

Ralf Raumanns, Gerard Schouten, Veronika Cheplygina, Josien P. W. Pluim

发表机构 * Fontys University of Applied Science, Venlo, The Netherlands(Fontys应用科学大学,荷兰Venlo) Fontys University of Applied Science, Eindhoven, The Netherlands(Fontys应用科学大学,荷兰Eindhoven) Eindhoven University of Technology, Eindhoven, The Netherlands(埃因霍温技术大学,荷兰Eindhoven) IT University of Copenhagen, Denmark(哥本哈根IT大学,丹麦)

AI总结 本研究使用基于ResNet的卷积模型评估皮肤病变分类性能,通过线性规划控制人口统计特征,研究患者性别和年龄偏差的影响,并比较三种学习策略,发现性别偏差主要源于数据不平衡,而年龄偏差始终偏向年轻群体。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) , 26 pages, 12 figures

详情
Journal ref
https://melba-journal.org/2026:011
AI中文摘要

在这项研究中,我们评估了使用基于ResNet的卷积模型进行皮肤病变分类的性能,重点关注训练数据中人口统计偏差的影响,特别是患者性别和年龄的变化。我们使用线性规划生成具有受控人口统计特征的数据集,从而系统性地研究偏差效应。评估了三种学习策略:单任务模型、强化多任务模型和对抗学习方案。我们的性别分析表明,性别特定的训练数据集优化了模型性能。值得注意的是,在训练数据中包含男性患者提高了男性亚组的性能,即使在女性占多数的情况下也是如此。强化学习和对抗学习方案缩小或消除了平衡和女性占多数数据集中的偏差差距。然而,这些策略在男性占多数的环境中效果较差,模型在男性上的表现仍然优于女性。在主要男性患者群体中,与基线模型相比,这两种学习方案显示出边际偏差减少。基于年龄的分析表明,三种模型方法的基线性能相当,性能随年龄类别下降。无论训练数据分布如何,年轻组始终达到最高性能。尽管平衡训练对最年轻年龄组产生最佳结果,但较老年组的性能下降。我们发现性别偏差主要源于数据不平衡,而年龄偏差无论分布如何始终偏向年轻群体。这些不同的机制需要有针对性的缓解策略。此外,在两个外部数据集上的跨数据集验证表明,域转移显著影响性能和人口统计偏差模式。

英文摘要

In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age. We use linear programming to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects. Three learning strategies are evaluated: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our sex-based analysis indicates that sex-specific training datasets optimise model performance. Notably, including male patients in the training data improved performance for the male subgroup, even in female-majority cases. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female-majority datasets. However, these strategies proved less effective in male-majority settings, where models continued to perform better for males than females. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations. Age-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories. Younger groups consistently achieve the highest performance, regardless of training data distribution. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories. We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution. These distinct mechanisms require targeted mitigation strategies. Additionally, cross-dataset validation on two external datasets revealed that domain shifts notably affect performance and patterns of demographic bias.

2606.03183 2026-06-03 cs.MM cs.CV cs.SD eess.AS 版本更新

Inference-Time Scaling for Joint Audio-Video Generation

联合音视频生成的推理时缩放

Jaemin Jung, Kyeongha Rho, Inkyu Shin, Joon Son Chung

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) Luma AI

AI总结 针对联合音视频生成中多目标优化的挑战,提出多验证器框架与自适应奖励加权算法,在无需额外训练的情况下显著提升语义对齐、感知质量和音视频同步。

Comments Accepted by Transactions on Machine Learning Research (TMLR). Project page: https://jung-jaemin.github.io/ITS-AVGen-Proj/

详情
AI中文摘要

联合音视频生成旨在合成与文本提示语义对齐且精确同步的逼真音视频对。现有联合音视频生成模型通常需要大量训练资源来提高保真度,而推理时缩放(ITS)最近在单模态领域成为一种有前景的无训练替代方案。然而,将ITS从单模态扩展到多模态领域并非易事,因为它需要平衡多个异构目标。在本文中,我们首次对联合音视频生成的ITS进行了全面研究。我们首先证明多验证器框架对于解决单目标指导的局限性(包括非对称性能权衡和验证器欺骗)至关重要。通过系统分析,我们随后确定了一个最优的多验证器组合,该组合在所有质量维度上产生均衡的改进。最后,为了有效聚合多样化的奖励信号,我们提出了自适应奖励加权(ARW),一种新颖的测试时优化算法。ARW将奖励聚合视为在线优化问题,利用可学习参数校准奖励方差,无需奖励分布的先验知识,从而确保鲁棒的多目标选择。在VGGSound和JavisBench-mini基准上的实验结果表明,我们的框架显著增强了生成输出的语义对齐、感知质量和音视频同步。合成样本和代码可在项目页面获取:this https URL。

英文摘要

Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: https://jung-jaemin.github.io/ITS-AVGen-Proj.

2606.03180 2026-06-03 cs.CV cs.CL cs.LG 版本更新

GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

GLINT:面向细粒度放射学表征的稀疏门控视觉-语言对齐

Jonggwon Park, Seongeun Lee, Junhyun Park, Hannah Yun, Hyunwoong Kim, Sohyun Jeong, Hyewon Kang, Byungmu Yoon, Kyoyun Choi

AI总结 针对放射学图像-报告全局对齐与局部病灶尺度不匹配的问题,提出GLINT框架,通过稀疏门控对齐和密集特征正则化实现零样本分类、定位和分割。

详情
AI中文摘要

放射学中的视觉-语言模型(VLM)通过利用临床工作流程中自然产生的图像-报告对,已成为一种可扩展的范式。然而,这种配对揭示了尺度上的不匹配:每个病灶仅占据图像的一小部分区域,但监督仅在全局图像-报告级别提供。这带来了一个核心挑战:先前的方法将权重密集地分布到所有补丁上,而不是集中在与给定查询相关的稀疏子集上。为了解决这个问题,我们提出了GLINT(门控语言-图像对齐)框架,该框架显式建模这种稀疏对应关系。在对齐方面,我们引入了稀疏门控对齐,这是一种新颖的架构,其中在单独的门控嵌入空间上的sigmoid门仅激活与每个文本查询相关的补丁,强制执行显式稀疏性。在表征方面,我们添加了密集特征正则化,将可训练编码器的中间特征锚定到冻结的自监督学习(SSL)教师模型上,从而保留门控所依赖的细粒度补丁特征。相同的方案适用于2D胸部X光片(CXR)和3D胸部计算机断层扫描(CT),分别基于DINOv3和V-JEPA 2.1构建。GLINT支持从自由文本查询进行零样本分类、定位和分割,据我们所知,这是首次在没有掩码监督的情况下在3D CT体积上展示零样本分割。值得注意的是,最显著的增益出现在零样本定位和分割上,这些任务需要稀疏的、特定于查询的定位,这与我们的设计意图一致。在下游评估中,GLINT在分类、报告生成和分割方面均优于SSL编码器和医学VLM。

英文摘要

Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder's intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.

2606.03168 2026-06-03 cs.CV 版本更新

JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation

JAVEDIT: 联合音频-视觉指令引导视频编辑与智能体数据策展

Yinan Chen, Chuming Lin, Zhennan Chen, Yuxiang Zeng, Junwei Zhu, Yali Bi, Xijie Huang, Chengming Xu, Donghao Luo, Zhucun Xue, Xiaobin Hu, Chengjie Wang, Yong Liu, Jiangning Zhang, Shuicheng Yan

发表机构 * Zhejiang University(浙江大学) Tencent Youtu Lab(腾讯优图实验室) Nanjing University(南京大学) University of Auckland(奥克兰大学) Fudan University(复旦大学) National University of Singapore(新加坡国立大学)

AI总结 针对联合音频-视觉编辑缺乏数据集和基准的问题,提出首个大规模高质量数据集JAVEdit-100k、基准JAVEditBench以及基线模型JAVEdit,在六项指标中五项超越所有基线。

Comments Equal contributions from first two authors. Project page: https://ryanchenyn.github.io/projects/JAVEdit Code: https://github.com/RyanChenYN/JAVEdit Dataset: https://huggingface.co/datasets/Coraxor/JAVEdit-100k

详情
AI中文摘要

虽然基于指令的视频编辑已取得显著进展,但联合音频-视觉编辑仍受限于缺乏专用数据集和基准。为填补这一空白,我们提出了JAVEdit-100k,这是首个为指令引导的联合音频-视觉编辑定制的大规模高质量数据集。该数据集专注于以人为中心的视频,包含约10万个编辑三元组,涵盖五个不同类别,包括主体编辑和语音编辑。该数据集通过四个精心设计的生成流程严格构建,并无缝配对智能体在环质量控制机制。此外,为解决该领域缺乏标准化评估的问题,我们引入了JAVEditBench,这是一个全面的基准,包含精选源视频和跨所有编辑类别的人类对齐指令。最后,我们提出了JAVEdit,一个用于指令引导的联合音频-视觉编辑的开创性基线模型。实验表明,\model\ 在六项评估指标中的五项上优于所有基线。

英文摘要

While instruction-based video editing has seen significant progress, joint audio-visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit-100k, the first large-scale, high-quality dataset tailored for instruction-guided joint audio-visual editing. Focusing on human-centric videos, JAVEdit-100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent-in-the-loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human-aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction-guided joint audio-visual editing. Experiments show that \model\ outperforms all baselines on five of six evaluation metrics.

2606.03160 2026-06-03 cs.CV 版本更新

SRENet: Spectral Re-Entry Network for Point Cloud Action Recognition

SRENet:用于点云动作识别的频谱重入网络

Qiuxia Wu, Jiarui Lan, Wenxiong Kang, Zhiyong Wang, Kun Hu

发表机构 * School of Software Engineering, South China University of Technology(南方科技大学软件工程学院) School of Automation Science and Engineering, South China University of Technology(南方科技大学自动化科学与工程学院) School of Computer Science, University of Sydney(悉尼大学计算机科学学院) School of Science, Edith Cowan University(爱丁堡牛津大学科学学院)

AI总结 提出SRENet,通过频谱分解与重入模块从频率角度学习全局上下文和细粒度时间动态,实现点云序列动作识别。

Comments 13 pages, 11 figures. Accepted by IEEE Transactions on Circuits and Systems for Video Technology

详情
AI中文摘要

从点云序列中识别人体动作对于自动驾驶和人机交互等3D感知驱动应用至关重要。然而,点云的不规则结构和时间不一致性给时空表示学习带来了独特挑战,特别是在捕捉全局运动上下文和细粒度时间动态方面。我们提出SRENet,一个频谱感知框架,旨在从频率角度显式学习动作识别的全局上下文和细粒度时间动态。SRENet引入频谱分解块(SDeBlock),沿时间和空间轴进行基于小波的分析,通过频率特定注意力将特征分解为低频和高频分量。为了恢复残差动态并重新对齐在语义融合过程中扭曲的时间频率结构,频谱重入块(SReBlock)执行二次时间分解。此外,设计了一种频谱感知学习策略,通过对比损失和课程调度增强两个频率子空间的可区分性,该调度逐渐将焦点从低频空间转移到高频空间,与从粗到细的运动模式一致。在MSR-Action3D、NTU-RGBD和NTU-RGBD120上的大量实验表明,SRENet实现了最先进的性能,验证了频率建模在基于点云的动作理解中的有效性。

英文摘要

Recognizing human actions from point cloud sequences is critical for 3D perception driven applications such as autonomous driving and human-computer interaction. However, the irregular structure and temporal inconsistency of point clouds pose unique challenges for spatio-temporal representation learning, especially in capturing both global motion context and fine-grained temporal dynamics. We propose SRENet, a spectral-aware framework designed to explicitly learn both global context and fine-grained temporal dynamics of motion from a frequency perspective for action recognition. SRENet introduces a Spectral Decomposition Block (SDeBlock) that performs wavelet-based analysis along temporal and spatial axes, disentangling features into low- and high-frequency components with frequency-specific attention. To recover residual dynamics and re-align temporal frequency structures distorted during semantic fusion, a Spectral Re-entry Block (SReBlock) performs secondary temporal decomposition. Furthermore, a spectral-aware learning strategy is devised to enhance discriminability in both frequency subspaces via contrastive loss and a curriculum schedule that gradually shifts focus from low- to high-frequency spaces in line with coarse to detailed motion patterns. Extensive experiments on MSR-Action3D, NTU-RGBD and NTU-RGBD120 demonstrate that SRENet achieves state-of-the-art performance, validating the effectiveness of frequency modeling in point cloud-based action understanding.

2606.03159 2026-06-03 cs.CV cs.AI cs.RO 版本更新

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

NVIDIA OmniDreams:用于闭环自动驾驶仿真的实时生成式世界模型

NVIDIA, :, Aarti Basant, Amlan Kar, Despoina Paschalidou, Fangyin Wei, Francesco Ferroni, Guillermo Garcia Cobo, Haithem Turki, Huan Ling, Jaewoo Seo, James Lucas, Jay Zhangjie Wu, Jialiang Wang, Jonathan Lorraine, Jun Gao, Kai He, Katarina Tothova, Kevin Xie, Michał Tyszkiewicz, Qi Wu, Riccardo de Lutio, Ruilong Li, Sanja Fidler, Seung Wook Kim, Tianchang Shen, Tianshi Cao, Tobias Pfaff, William Lew, Xindi Wu, Xuanchi Ren, Yifan Lu, Yuxuan Zhang, Zan Gojcic, Zian Wang

AI总结 提出OmniDreams,一个基于Cosmos扩散模型训练的基础生成式世界模型,通过自回归生成动作条件视频,实现闭环仿真中复杂长尾场景的实时合成,并验证其在策略模型训练中的有效性。

详情
AI中文摘要

随着自动驾驶能力的提升,在长尾场景中安全评估驾驶策略仍是一个关键瓶颈。在闭环仿真中,驾驶策略模型与环境主动交互,其动作动态更新模拟器状态并直接影响下一组生成的传感器观测。尽管近期基于重建的神经模拟器提供了逼真效果,但它们从根本上受限于初始捕获数据,难以泛化到高度动态或新颖场景。为克服这些限制,我们引入了OmniDreams,一个从Cosmos扩散模型进行中期和后训练的基础生成式世界模型,能够自回归地实时生成动作条件视频。通过利用Cosmos丰富的视觉先验以及在21k小时驾驶场景上的中期和后训练,OmniDreams合成了传统模拟器难以捕获的复杂未观测现象,例如极端天气和不可预测的动态智能体行为。关键在于,它自回归地根据过去帧、当前模拟器状态和即时驾驶动作来调节其逼真的传感器生成。在结合Alpamayo 1策略模型和AlpaSim编排器的闭环系统中部署时,OmniDreams充当一个高度响应、反应灵敏的环境,为训练和评估下一代自动驾驶策略提供了可扩展且全面的解决方案。我们还展示了初步结果,表明从OmniDreams后训练的世界-动作模型(WAM)在Physical AI自动驾驶NuRec数据集上取得了强劲性能,超越了基于VLA的Alpamayo 1.5研究策略模型,同时仅使用其1/5的总参数量。这些结果凸显了像OmniDreams这样的实时世界模型也有潜力作为策略架构的骨干网络。

英文摘要

As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and struggle to generalize to highly dynamic or novel scenes. To overcome these limitations, we introduce OmniDreams, a foundation generative world model mid- and post-trained from the Cosmos diffusion model to autoregressively generate action-conditioned videos in real time. By leveraging the rich visual priors of Cosmos and mid- and post-training on 21k hours of driving scenarios, OmniDreams synthesizes complex, unobserved phenomena that are hard for traditional simulators to capture, such as extreme weather and unpredictable dynamic agent behaviors. Crucially, it autoregressively conditions its photorealistic sensor generation on past frames, the current simulator state, and immediate driving actions. Deployed in a closed-loop system with the Alpamayo 1 policy model and AlpaSim orchestrator, OmniDreams acts as a highly responsive, reactive environment, providing a scalable and comprehensive solution for training and evaluating next-generation autonomous driving policies. We additionally show preliminary results indicating that a world-action model (WAM) post-trained from OmniDreams achieves strong performance on the Physical AI Autonomous Vehicles NuRec dataset, surpassing the VLA-based Alpamayo 1.5 research policy model while using only 1/5 the total parameters. These results highlight the potential for a real-time world model like OmniDreams to also serve as a backbone for policy architectures.

2606.03148 2026-06-03 cs.CV 版本更新

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

$A^2$: 较小的自监督ViT比更大的ViT定位更优

Sreehari Rammohan, Huy Ha, Carl Vondrick

发表机构 * Columbia University(哥伦比亚大学) Stanford University(斯坦福大学)

AI总结 针对视觉分类中前景定位与丰富表征的矛盾,提出$A^2$方法,通过解耦小模型定位与大模型嵌入,利用预训练特征实现无需额外训练的竞争性能。

详情
AI中文摘要

鲁棒的视觉分类通常依赖于定位图像中的主要前景对象,同时忽略上下文干扰。令人惊讶的是,我们发现较小的自监督ViT的注意力图比更大的ViT能更好地定位前景对象。然而,我们仍然需要大型ViT,因为它们从每个补丁中提取更丰富的表示。为了兼顾良好的定位和丰富的表示,我们提出了$A^2$,一种简单的方法,通过将看哪里(小注意力模型)与提取什么(大嵌入模型)解耦,利用这种逆缩放发现:我们围绕小模型的注意力峰值裁剪图像,并用大模型嵌入这些裁剪块。$A^2$完全使用预训练特征,不需要组标签,也不需要针对每个数据集进行注意力或骨干网络训练。在5个基准测试中,$A^2$与基于骨干匹配的损失级方法(如DFR)具有竞争力,并且在更强的分布偏移下优于端到端注意力训练。

英文摘要

Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foreground objects better than those of larger ViTs. However, we still need large ViTs, because they extract richer representations from each patch. To get the best of both worlds, good localization and rich representations, we propose $A^2$, a simple method that leverages this inverse scaling finding by decoupling where to look (a small attention model) from what to extract (a large embedding model): we crop around the attention peaks of a small model and embed the crops with a larger model. $A^2$ uses entirely pretrained features, requires no group labels, and does not require per-dataset attention or backbone training. Across 5 benchmarks, $A^2$ is competitive with backbone-matched loss-level methods like DFR, and outperforms end-to-end attention training under stronger distribution shifts.

2606.03142 2026-06-03 cs.CV 版本更新

Disentangling Visual and Factual Correctness in LVLMs' Visualization Literacy

解构LVLMs可视化素养中的视觉与事实正确性

Soohyun Lee, Jaeyoung Kim, Seokhyeon Park, Sihyeon Lee, Jiwon Song, Bohyoung Kim, Hyunjoo Song, Jinwook Seo

发表机构 * Seoul National University(首尔国立大学) MADI Co., Ltd.(MADI公司) Hankuk University of Foreign Studies(韩国民法大学) Soongsil University(顺天大学)

AI总结 提出框架分离视觉正确性与事实正确性,通过反事实测试和仲裁指标揭示LVLMs在可视化素养评估中依赖事实记忆而非视觉推理的问题。

Comments Under review at IEEE Transactions on Visualization and Computer Graphics (TVCG). 23 pages, 9 figures

详情
AI中文摘要

大型视觉语言模型(LVLMs)展现出强大的可视化解释能力,但尚不清楚其响应是否反映对视觉证据的真实推理,还是训练中习得的事实先验。当前评估混合了这两种来源,掩盖了正确视觉解释被记忆事实覆盖的情况。我们提出了一个将视觉正确性与事实正确性分离的框架,揭示了现有可视化素养评估的有效性局限。通过15个最先进LVLMs的三个实验:(1)多个模型在标准测试(VLAT)上达到人类水平,但这可能反映事实回忆而非视觉理解,而随机数据测试(reVLAT)在正确视觉解释被事实先验取代时低估了素养。(2)使用我们的反事实可视化素养评估测试(CVLAT)和能力归一化仲裁指标,我们根据视觉-事实依赖指数(VFRI)的符号对模型进行分类,揭示了以视觉为导向的多数和以事实知识为导向的少数,尽管几个接近零的情况需要谨慎。在相同反事实项目上的人类基线(N=30)证实,人们在冲突时绝大多数遵循图表,提供了人类参考点。(3)基于提示的干预可以改变优先级,但其有效性高度依赖模型且方向不对称,高图表阅读能力不能预测提示可控性。总体而言,高可视化准确性不足以证明忠实的视觉推理:可靠地集成到视觉分析中不仅需要评估可视化素养,还需要评估模型在视觉证据和事实先验分歧时如何仲裁。基准和代码:此 https URL

英文摘要

Large Vision-Language Models (LVLMs) show strong visualization interpretation, yet it is unclear whether their responses reflect genuine reasoning over visual evidence or factual priors learned during training. Current evaluations mix these two sources, obscuring when correct visual interpretation is overridden by memorized facts. We present a framework that isolates visual correctness from factual correctness, revealing validity limitations in existing visualization literacy assessments. Across three experiments with 15 state-of-the-art LVLMs: (1) several models reach human-level performance on standard tests (VLAT), but this may reflect factual recall rather than visual understanding, while randomized-data tests (reVLAT) underestimate literacy when correct visual interpretation is superseded by factual priors. (2) Using our Counterfactual Visualization Literacy Assessment Test (CVLAT) with capability-normalized arbitration metrics, we classify models by the sign of their visual-factual reliance index (VFRI), revealing a visualization-oriented majority and a factual knowledge-oriented minority, though several near-zero cases warrant caution. A human baseline (N=30) on the same counterfactual items confirms that people overwhelmingly follow the chart under conflict, providing a human reference point. (3) Prompt-based intervention can shift prioritization, but its effectiveness is highly model-dependent and direction-asymmetric, and high chart-reading capability does not predict prompt-controllability. Overall, high visualization accuracy is not sufficient evidence of faithful visual reasoning: reliable integration into visual analytics requires evaluating not only visualization literacy but also how models arbitrate between visual evidence and factual priors when the two diverge. Benchmark and code: https://github.com/JaeyoungKim-HCIL/CVLAT

2606.03120 2026-06-03 cs.CV 版本更新

KC-3DGS: Kurtosis-Constrained Gaussian Splatting for High-Fidelity View Synthesis

KC-3DGS: 基于峰度约束的高斯泼溅用于高保真视图合成

Vivekjyoti Banerjee, Abhay Yadav, Rama Chellappa, Aniket Roy

发表机构 * Johns Hopkins University(约翰霍普金斯大学) NEC Labs America(NEC美国实验室)

AI总结 提出KC-3DGS,通过在小波域添加多尺度对齐损失、峰度集中损失和跨频带协方差惩罚,增强3DGS的感知质量,尤其改善稀疏视图下的高频细节和结构伪影。

详情
AI中文摘要

3D高斯泼溅(3DGS)通过将场景表示为各向异性高斯集合,并通过可微分光栅化优化,实现了实时新视图合成。然而,标准像素空间损失(L1、SSIM)仅约束整体重建误差,允许优化在频率尺度上重新分配误差。这导致过度平滑和结构伪影,尤其在监督有限的稀疏视图设置中。我们提出KC-3DGS,通过基于自然图像统计的小波域监督来增强3DGS训练。我们的方法结合了三个组件:(1)多尺度小波系数对齐损失,显式惩罚缺失的高频细节;(2)有监督的峰度集中损失,鼓励渲染图像匹配真实图像的重尾频率统计;(3)跨频带协方差惩罚,促进频率专门化。我们提供理论分析,表明像素空间损失允许在小波重分布下的一族不可区分扰动,而我们的联合目标排除了退化解。在MipNeRF360、Tanks&Temples、MVImgNet、DeepBlending和WRIVA-ULTRRA上的实验表明,感知质量持续提升。在具有挑战性的WRIVA-ULTRRA室外数据集上,KC-3DGS在DreamSim上提高了9.48%,同时改善了PSNR、SSIM和LPIPS。在仅有12张训练图像的稀疏视图设置中,我们的方法在MipNeRF360上将PSNR提高了高达0.5 dB,同时保持了感知质量。该方法作为即插即用的正则化策略,可无缝集成到现有的3DGS流程中。

英文摘要

3D Gaussian Splatting (3DGS) enables real-time novel view synthesis by representing scenes as collections of anisotropic Gaussians optimized via differentiable rasterization. However, standard pixel-space losses (L1, SSIM) constrain only aggregate reconstruction error, permitting the optimization to redistribute error across frequency scales. This leads to oversmoothing and structural artifacts, particularly in sparse-view settings where supervision is limited. We propose KC-3DGS, which augments 3DGS training with wavelet-domain supervision based on natural image statistics. Our method combines three components: (1) a multi-scale wavelet coefficient alignment loss that explicitly penalizes missing high-frequency detail, (2) a supervised kurtosis concentration loss that encourages rendered images to match the heavy-tailed frequency statistics of ground-truth images, and (3) a cross-band covariance penalty that promotes frequency specialization. We provide theoretical analysis showing that pixel-space losses admit a family of indistinguishable perturbations under wavelet redistribution, and that our joint objective excludes degenerate solutions. Experiments across MipNeRF360, Tanks&Temples, MVImgNet, DeepBlending, and WRIVA-ULTRRA demonstrate consistent improvements in perceptual quality. On the challenging WRIVA-ULTRRA outdoor dataset, KC-3DGS achieves a 9.48% improvement in DreamSim while also improving PSNR, SSIM, and LPIPS. In sparse-view settings with only 12 training images, our method improves PSNR by up to 0.5 dB on MipNeRF360 while maintaining perceptual quality. The approach integrates seamlessly into existing 3DGS pipelines as a plug-and-play regularization strategy.

2606.03119 2026-06-03 cs.CV cs.AI cs.LG 版本更新

GuidedBridge: Training-freely Improving Bridge Models with Prior Guidance

GuidedBridge: 无需训练地利用先验引导改进桥接模型

Zehua Chen, Yucheng Yang, Binjie Yuan, Kaiwen Zheng, Jun S. Liu, Jun Zhu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出无需训练的先验引导方法(PG)和频率调制先验引导(FMPG),通过对比弱先验与已见先验增强桥接模型的先验利用,并设计级联框架CFG-FMPG用于图像修复,实验证明该方法能一致提升预训练桥接模型在多种图像翻译任务中的性能。

Comments ICML 2026

详情
AI中文摘要

引导方法,如无分类器引导(CFG)和自动引导(AG),推动了扩散模型中噪声到数据生成的发展。最近,桥接模型引入了一种数据到数据的生成过程,可以利用有指导性的干净先验。在这项工作中,受先前通过去噪结果质量差异作为引导的方法启发,我们提出了一种无需训练的桥接引导方法,称为先验引导(PG)。具体来说,我们引入一个弱先验,该先验在桥接预训练期间未见,阻碍先验利用从而降低去噪结果。然后,我们将其与已见先验对比,通过缩放因子突出并增强先验利用。此外,我们分析了桥接过程中先验利用的潜在机制,并设计了频率调制先验引导(FMPG),该引导将引导尺度调整到与桥接生成动力学一致的低频和高频带。为了解决图像修复中的先验利用问题,我们开发了一个级联框架CFG-FMPG,该框架首先通过CFG生成噪声隐藏表示,然后将其作为生成先验与FMPG一起利用,在不影响推理效率的情况下发挥它们的互补优势。实验表明,我们的PG方法在多种图像翻译任务中一致地改进了预训练桥接模型。

英文摘要

Guidance methods, such as classifier-free guidance (CFG) and auto-guidance (AG), have advanced noise-to-data generation in diffusion models. Recently, bridge models have introduced a data-to-data generative process that can exploit an instructive clean prior. In this work, inspired by previous methods creating quality difference between denoising results as guidance, we propose a training-free bridge guidance method, termed Prior Guidance (PG). Specifically, we introduce a weak prior, which is unseen during bridge pre-training, hindering prior exploitation and thereby degrading denoising result. Then, we contrast it with the seen prior to highlight and enhance prior exploitation via a scaling factor. Moreover, we analyze the underlying mechanism of prior exploitation in the bridge process and design frequency-modulated prior guidance (FMPG), which tailors the guidance scale to low- and high-frequency bands coherent with bridge generative dynamics. To address prior exploitation in image in-painting, we develop a cascaded framework, CFG-FMPG, which first generates a noisy hidden representation via CFG and then exploits it as a generative prior with FMPG, fulfilling their complementary strengths without compromising inference efficiency. Experiments demonstrate that our PG methods consistently improve pre-trained bridge models across diverse image translation tasks.

2606.03118 2026-06-03 cs.LG cs.CV q-bio.NC 版本更新

Learning to See via Epiretinal Implant Stimulation in silico with Model-Based Deep Reinforcement Learning

通过基于模型的深度强化学习在硅上学习经由视网膜上植入物刺激的视觉

Jacob Lavoie, Marwan Besrour, William Lemaire, Jean Rouat, Réjean Fontaine, Eric Plourde

发表机构 * Department of Electrical Engineering and Computer Engineering, Université de Sherbrooke(电气与计算机工程系, Sherbrooke 大学)

AI总结 本研究提出使用各向同性和各向异性形状,通过深度强化学习在虚拟患者的视网膜上渲染可理解的图像,以提高人工恢复视觉的清晰度。

Comments 18 pages, 6 figures. Published version: Biomed. Phys. Eng. Express 10, 025006 (2024)

详情
Journal ref
Biomed. Phys. Eng. Express 10 (2024) 025006
AI中文摘要

目标:年龄相关性黄斑变性和视网膜色素变性等疾病会导致感光层退化。恢复视力的一种方法是通过微电极阵列(如视网膜上植入物)电刺激存活的视网膜神经节细胞。已知视网膜上植入物会产生沿邻近视网膜神经节细胞轴突束延伸的可见各向异性形状。最近的研究表明,为了获得各向同性的像素状形状,可以通过失活电极或降低刺激电流水平来映射轴突束并避免刺激它们。避免轴突束刺激旨在去除类似笔触的形状,转而采用更简化的像素状形状集合。方法:在本研究中,我们提出使用各向同性和各向异性形状,在名为rlretina的强化学习环境中为虚拟患者的视网膜渲染可理解的图像。该环境将任务形式化为在基于笔触的渲染任务中使用笔触。主要结果:我们训练了一个深度强化学习智能体,它学会组合各向同性和各向异性形状以形成图像。我们研究了哪种基于误差或基于感知的指标适合奖励智能体。该智能体以基于模型的数据生成方式训练,使用经过心理物理学验证的轴突映射模型来渲染不同虚拟患者感知到的图像。我们表明,与不同虚拟患者中的朴素方法相比,该智能体可以生成更可理解的图像。意义:这项工作提供了一种解决视网膜上刺激的新方法,这是朝着使用各向异性光幻视改善人工恢复视力中视觉敏锐度的第一步。

英文摘要

Objective: Diseases such as age-related macular degeneration and retinitis pigmentosa cause the degradation of the photoreceptor layer. One approach to restore vision is to electrically stimulate the surviving retinal ganglion cells with a microelectrode array such as epiretinal implants. Epiretinal implants are known to generate visible anisotropic shapes elongated along the axon fascicles of neighboring retinal ganglion cells. Recent work has demonstrated that to obtain isotropic pixel-like shapes, it is possible to map axon fascicles and avoid stimulating them by inactivating electrodes or lowering stimulation current levels. Avoiding axon fascicle stimulation aims to remove brushstroke-like shapes in favor of a more reduced set of pixel-like shapes. Approach: In this study, we propose the use of isotropic and anisotropic shapes to render intelligible images on the retina of a virtual patient in a reinforcement learning environment named rlretina. The environment formalizes the task as using brushstrokes in a stroke-based rendering task. Main Results: We train a deep reinforcement learning agent that learns to assemble isotropic and anisotropic shapes to form an image. We investigate which error-based or perception-based metrics is adequate to reward the agent. The agent is trained in a model-based data generation fashion using the psychophysically validated axon map model to render images as perceived by different virtual patients. We show that the agent can generate more intelligible images compared to the naive method in different virtual patients. Significance: This work shares a new way to address epiretinal stimulation that constitutes a first step towards improving visual acuity in artificially-restored vision using anisotropic phosphenes.

2606.03114 2026-06-03 cs.CV 版本更新

FAF-CD: Frequency-Aware Fusion for Change Detection under Imperfect Multimodal Remote Sensing

FAF-CD: 面向不完美多模态遥感的频率感知融合变化检测

Yufan Wang, Sokratis Makrogiannis, Chandra Kambhamettu

发表机构 * University of South Florida(佛罗里达州立大学) Delaware State University(特拉华州立大学)

AI总结 提出频率感知混合框架FAF-CD,通过DINOv3预训练ConvNeXt编码器、VMamba解码器及修正感知三支融合模块(可变形空间对齐+傅里叶/哈尔小波比较+自适应门控),在不完美异质遥感(如EO-SAR)和二元光学变化检测中提升精度并降低计算成本。

Comments Code will be released at https://github.com/VimsLab/FAF-CD

详情
AI中文摘要

面向真实世界监测的遥感变化检测通常依赖于不完美的异质观测,其中事件前后图像可能异步、跨传感器,或受光照、季节和模态偏移影响。这一设置对EO-SAR灾害制图尤其具有挑战性,因为干扰变化可能类似于结构损伤。我们提出FAF-CD,一种频率感知混合框架,采用DINOv3预训练的ConvNeXt编码器和线性复杂度的基于VMamba的解码器。其修正感知三支融合模块将可变形空间对齐与傅里叶和哈尔小波比较相结合,使用自适应门控跨尺度聚合互补线索。在BRIGHT验证集上,匹配的异质EO-SAR适应在干净和扰动tc-mIoU/tc-mAP上优于NeXt2Former-CD。FAF-CD还泛化到二元光学变化检测,在LEVIR-CD上达到0.924 cF1,在WHU-CD上达到0.955 cF1,并在伪变化对齐压力测试下,在M-CD和NeXt2Former-CD中,在两个二元数据集上获得最佳平均扰动cIoU/cF1。相对于NeXt2Former-CD,它进一步降低了约24 GFLOPs的计算成本,同时保持或提高了精度。

英文摘要

Remote sensing change detection for real-world monitoring often relies on imperfect heterogeneous observations, where pre- and post-event images may be asynchronous, cross-sensor, or affected by illumination, seasonal, and modality shifts. This setting is especially challenging for EO-SAR disaster mapping, where nuisance variation can resemble structural damage. We propose FAF-CD, a frequency-aware hybrid framework with a DINOv3-pretrained ConvNeXt encoder and a linear-complexity VMamba-based decoder. Its rectification-aware tri-branch fusion module combines deformable spatial alignment with Fourier and Haar-wavelet comparisons, using adaptive gating to aggregate complementary cues across scales. On BRIGHT validation, a matched heterogeneous EO-SAR adaptation improves clean and perturbed tc-mIoU/tc-mAP over NeXt2Former-CD. FAF-CD also generalizes to binary optical CD, achieving 0.924 cF1 on LEVIR-CD and 0.955 cF1 on WHU-CD, and obtains the best average perturbed cIoU/cF1 on both binary datasets among M-CD and NeXt2Former-CD under pseudo-change-aligned stress tests. It further reduces cost by approximately 24 GFLOPs relative to NeXt2Former-CD while maintaining or improving accuracy.

2606.03111 2026-06-03 cs.CV 版本更新

Inverting the Generation Process of Denoising Diffusion Implicit Models: Empirical Evaluation and a Novel Method

反转去噪扩散隐式模型的生成过程:实证评估与新方法

Yan Zeng, Masanori Suganuma, Takayuki Okatani

发表机构 * Graduate School of Information Sciences, Tohoku University(东北大学信息科学研究生院) RIKEN Center for AIP(理化学研究所AIP中心)

AI总结 提出一种结合梯度下降和不动点方法的混合方法,用于从生成图像中恢复DDIM的初始噪声图,显著提高了预测精度和重建质量。

详情
AI中文摘要

本文研究了反转DDIM图像生成过程以从生成图像中恢复潜在变量(特别是初始噪声图)的问题。现有方法在此任务中常面临精度不足的挑战。我们提出了一种新颖的混合方法,该方法在第一步结合了通过梯度下降的直接反转,随后在后续步骤中采用不动点方法。在三个数据集上的实证评估表明,我们的方法显著提高了初始潜在变量的预测精度,同时实现了更优的重建准确性。此外,我们引入了一项新的评估指标,称为自插值测试,该测试评估从真实与预测潜在图之间的插值点生成的图像质量,从而提供对性能更深入的洞察。我们的结果表明,尽管现有方法在重建方面表现尚可,但它们始终无法准确预测初始潜在变量,导致在自插值测试中表现不佳。相比之下,我们的方法在所有指标上均优于其他方法,为扩散模型提供了宝贵的见解,并增强了其在图像生成和编辑中的应用。

英文摘要

This paper studies the problem of inverting the DDIM image generation process to recover latent variables, particularly the initial noise map, from a generated image. Existing methods often struggle with accuracy in this task. We propose a novel hybrid approach that combines direct inversion via gradient descent for the first step, followed by a fixed-point method for subsequent steps. Empirical evaluations across three datasets demonstrate that our method significantly improves the prediction of initial latent variables while achieving superior reconstruction accuracy. Additionally, we introduce a new evaluation, called the self-interpolation test, which assesses the quality of images generated from interpolated points between the true and predicted latent maps, offering deeper insights into performance. Our results reveal that while existing methods perform reasonably well in reconstruction, they consistently fail to accurately predict the initial latent variables, resulting in poor performance on the self-interpolation test. In contrast, our method outperforms all others across all metrics, providing valuable insights into diffusion models and enhancing their applications in image generation and editing.

2606.03084 2026-06-03 cs.CV 版本更新

Hierarchical Federated Learning with Dynamic Clustering and Adaptive Regularization for Robust Infrastructure Inspection

面向鲁棒基础设施检测的动态聚类与自适应正则化分层联邦学习

Yuhu Feng, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

发表机构 * Graduate School of Information Science and Technology, Hokkaido University(北海道大学信息科学技术研究生院) Faculty of Information Science and Technology, Hokkaido University(北海道大学信息科学技术学部)

AI总结 提出一种分层联邦学习框架,通过宏观动态梯度聚类和微观自适应正则化解决基础设施检测中数据异构问题,实现鲁棒且特化的诊断模型。

详情
AI中文摘要

由于严格的隐私和安全法规,数据驱动计算机视觉模型在结构健康监测(SHM)中的应用受到数据孤岛困境的严重制约。虽然联邦学习(FL)提供了一种保护隐私的协作替代方案,但其在全国性基础设施网络中的应用受到“双重异构性”挑战的严重阻碍:不同结构类型之间的宏观物理差异以及本地数据集内的微观统计不平衡。为了克服这一挑战,本文提出了一种新颖的分层联邦学习框架。该框架协调了一种协同的两层优化策略。在宏观层面,一种基于动态梯度的聚类机制根据客户的结构退化轨迹自动将分布式客户聚合成专门的专家组,无需先验地理元数据。同时,在微观层面,一种簇内动态区域自适应近端正则化(DRAPR)模块为每个客户端计算实时统计的非独立同分布强度分数。通过基于局部标签偏斜和梯度发散自适应调整近端惩罚,DRAPR有效校准局部更新,减轻客户端漂移,并防止少数损伤类别的灾难性遗忘。在大型真实世界结构检测数据集上的综合评估表明,宏观聚类与微观正则化的分层集成成功中和了双层异构性,为复杂基础设施检测生成了高度鲁棒且特化的诊断模型。

英文摘要

The deployment of data-driven computer vision models for structural health monitoring (SHM) is heavily constrained by the data silo dilemma due to stringent privacy and security regulations. While federated learning (FL) offers a privacy-preserving collaborative alternative, its application to nationwide infrastructure networks is severely hindered by the challenge of ``double heterogeneity'': macro-level physical divergence across disparate structural types and micro-level statistical imbalances within local datasets. To overcome this challenge, this paper proposes a novel hierarchical federated learning framework. The framework orchestrates a synergistic two-tier optimization strategy. At the macro-level, a dynamic gradient-based clustering mechanism autonomously aggregates distributed clients into specialized expert groups based on their structural degradation trajectories, circumventing the need for prior geographical metadata. Concurrently, at the micro-level, an intra-cluster Dynamic Region-Adaptive Proximal Regularization (DRAPR) module computes a real-time statistical Non-IID Intensity Score for each client. By adaptively modulating a proximal penalty based on local label skewness and gradient divergence, DRAPR effectively calibrates local updates, mitigates client drift, and prevents the catastrophic forgetting of minority damage classes. Comprehensive evaluations on a large-scale, real-world structural inspection dataset demonstrate that the hierarchical integration of macro-clustering and micro-regularization successfully neutralizes dual-level heterogeneity, yielding highly robust and specialized diagnostic models for complex infrastructure inspection.

2606.03075 2026-06-03 cs.CV 版本更新

TGV-KV: Text-Grounded KV Eviction for Vision-Language Models

TGV-KV:面向视觉语言模型的文本引导KV驱逐方法

Jizhihui Liu, Ruizi Han, Miao Zhang, Rui Shao, Xuebo Liu, Weili Guan, Yaowei Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对视觉语言模型中视觉信息冗余导致的KV缓存内存消耗问题,提出基于文本引导的KV驱逐方法TGV-KV,通过文本-视觉预算分配、文本加权排序和文本优先保留策略,在保持高精度的同时显著提升推理吞吐量。

Comments Accepted by ICML-2026

详情
AI中文摘要

视觉语言模型(VLM)继承了自回归生成范式,并缓存所有先前token的键和值(KV)以加速推理,导致内存消耗随上下文长度线性增长。由于视觉模态中存在大量冗余,这一问题在VLM中尤为突出。尽管KV缓存驱逐方法可以有效减少推理内存,但它们通常会导致VLM性能显著下降,因为大多数方法是为语言模型设计的,忽视了文本与视觉之间的固有差距。通过系统分析VLM中的模态差距,我们认为视觉信息的重要性应以文本引导为基础,并据此提出了一种面向VLM的文本引导KV驱逐方法(TGV-KV)。TGV-KV包含三个子模块:(1)文本-视觉预算分配(TVB)基于互信息交互为每层分配预算。(2)文本加权排序(TWR)评估文本的优先级,并根据加权文本-图像注意力对视觉重要性进行排序。(3)文本优先保留(TPR)策略有选择地保留文本KV以避免严重的信息损失。我们在五种不同规模和架构的模型上评估了TGV-KV,结果显示,在LLaVA-NeXT的VizWiz-VQA任务中,TGV-KV保留了全KV准确率的99.2%,并在极端保留预算5%下将端到端吞吐量提升了52.6%。代码可在该https URL获取。

英文摘要

Vision-Language Models (VLMs) inherit the auto-regressive generation paradigm and cache the keys and values (KV) of all previous tokens to accelerate inference, resulting in memory consumption that scales linearly with context length. This issue is particularly pronounced in VLMs due to substantial redundancy in the visual modality. Although KV cache eviction approaches can effectively reduce inference memory, they often incur significant performance degradation in VLMs, as most are designed for language models and overlook the inherent gap between text and vision. By systematically analyzing the modality gap in VLMs in this work, we argue that the importance of visual information should be grounded in textual guidance and accordingly propose a Text-Grounded KV Eviction method for VLMs (TGV-KV). TGV-KV comprises three submodules: (1) Text-Vision Budgeting (TVB) assigns budget to each layer based on the mutual information interaction. (2) Text-Weighted Ranking (TWR) assesses the priority of text and ranks vision importance based on weighted text-image attention. (3) Text-Prioritised Retention (TPR) policy strategically preserves text KV to avoid acute information loss. We evaluate TGV-KV across five models with different sizes and architectures, showing that TGV-KV preserves 99.2% full-KV accuracy on the VizWiz-VQA task with LLaVA-NeXT and boosts end-to-end throughput by 52.6% with an extreme retention budget of 5%. Code is available at https://github.com/Danielement321/TGV-KV.

2606.03069 2026-06-03 cs.CV cs.AI cs.LG 版本更新

ROBUST-WT: Robust Uncertainty-aware Segmentation Transform via Whitening and Training Enhancements

ROBUST-WT: 通过白化和训练增强的鲁棒不确定性感知分割变换

Aqsa Naseer, Maryam Bibi, Syeda Samiya Urooj, Muhammad Khurram Shahzad

发表机构 * SEECs, University of Engineering and Technology, Lahore, Pakistan(工程与技术大学,拉合尔,巴基斯坦)

AI总结 针对WT-PSE框架的四个局限性,提出域自适应增强、混合损失函数、课程式权重调度和消融控制标志四种改进,在眼底视盘分割中Dice达0.956。

Comments 8 pages, 6 figures; code available at https://github.com/213269/WT-PSE-code-main

详情
AI中文摘要

医学图像的广义分割可防止跨多个领域使用不同成像设备和临床协议时的性能下降。基于白化变换的概率形状正则化提取器(WT-PSE)发表于2024年IEEE Transactions on Medical Imaging,通过特征去相关和基于Wasserstein距离的知识蒸馏实现鲁棒的跨域分割。本研究系统性地检查了对WT-PSE学习框架的改进。识别出原始实现中的四个局限性:有限的训练增强无法模拟真实的扫描仪变化;依赖逐像素二元交叉熵损失对边缘噪声敏感;缺乏调度损失加权策略可能导致早期训练不稳定;以及缺乏用于受控科学比较的消融开关。为解决这些问题,我们提出四项增强:(1) 域自适应增强,包括随机擦除、伽马校正和椒盐噪声;(2) 混合BCE和Dice损失函数,用于在噪声条件下改进边缘感知分割;(3) 基于课程的Dice权重调度策略;(4) 命令行控制标志用于系统消融研究。在眼底视盘分割基准上的实验表明,改进后的流程在最终epoch的视盘Dice得分为0.956,ASD得分为13.31,优于基线epoch-5的Dice得分0.939。这些结果表明,在不修改底层WT-PSE架构的情况下,训练层面的改进可以提供一致的性能提升。

英文摘要

Generalized segmentation of medical images prevents performance degradation when different imaging devices and clinical protocols are used across multiple domains. The Whitening Transform-based Probabilistic Shape Regularization Extractor (WT-PSE), published in IEEE Transactions on Medical Imaging in 2024, addresses this challenge by employing feature decorrelation and Wasserstein distance-based knowledge distillation to achieve robust cross-domain segmentation. This study systematically examines improvements to the WT-PSE learning framework. Four limitations in the original implementation are identified: limited training augmentations that fail to simulate real scanner variations, reliance on per-pixel binary cross-entropy loss that is sensitive to edge noise, the absence of a scheduled loss weighting strategy that may destabilize early training, and the lack of ablation switches for controlled scientific comparison. To address these issues, we propose four enhancements: (1) domain-adaptive augmentation including random erasing, gamma correction, and salt-and-pepper noise; (2) a hybrid BCE and Dice loss function for improved edge-aware segmentation under noisy conditions; (3) a curriculum-based Dice weight scheduling strategy; and (4) command-line control flags for systematic ablation studies. Experiments on the fundus optic disc segmentation benchmark demonstrate that the improved pipeline achieves a final epoch optic-disc Dice score of 0.956 and an ASD score of 13.31, outperforming the baseline epoch-5 Dice score of 0.939. These results indicate that training-level improvements can provide consistent performance gains without modifying the underlying WT-PSE architecture.

2606.03050 2026-06-03 cs.CV 版本更新

FCUS-rPPG: A Fast-Converging Unsupervised Framework for Remote Photoplethysmography via Gradient Oscillation Suppression

FCUS-rPPG:一种通过梯度振荡抑制实现快速收敛的无监督远程光电容积描记框架

Jiajie Li, Yu Liu, Rencheng Song, Xun Chen, Juan Cheng

发表机构 * Department of Biomedical Engineering(生物医学工程系) Anhui Province Key Laboratory of Measuring Theory and Precision Instrument(安徽省测量理论与精密仪器重点实验室) Hefei University of Technology(合肥工业大学) Department of Electronic Engineering and Information Science(电子工程与信息科学系) University of Science and Technology of China(中国科学技术大学)

AI总结 提出FCUS-rPPG框架,通过光谱共享骨干网络和梯度、损失景观、特征表示层面的统一优化,实现单轮训练收敛并在跨数据集评估中达到最优性能。

详情
AI中文摘要

远程光电容积描记术(rPPG)利用消费级摄像头实现非接触式血容量脉搏(BVP)信号提取。现有的无监督rPPG方法无需真实生理标注即可学习BVP表示,但其优化常受噪声和不稳定梯度影响,导致收敛缓慢且跨域泛化能力有限。本文提出FCUS-rPPG,一种快速收敛且具有强泛化能力的无监督rPPG框架。受BVP表示同时具有多光谱共变和低维流形结构的观察启发,我们设计了光谱共享骨干网络,促进BVP特征解耦并提高优化效率。为了联合增强收敛稳定性和泛化性能,我们进一步开发了一个在梯度、损失景观和特征表示层面运作的统一优化框架。具体而言,后验证掩蔽机制根据BVP信号的弱幅度生理先验过滤误导性梯度;基于扰动的损失景观平滑策略将优化导向更可泛化的平坦最小值;噪声感知零空间正则化将特征更新约束在噪声子空间的正交补空间内,从而减轻噪声引起的表示漂移。在五个数据集上的大量实验表明,FCUS-rPPG仅需一个训练周期,而现有方法通常需要数十到数百个周期。值得注意的是,FCUS-rPPG在跨数据集评估中持续达到最先进(SOTA)性能。本研究为无监督rPPG的实际部署提供了高效且鲁棒的解决方案。源代码将在该URL公开。

英文摘要

Remote photoplethysmography (rPPG) enables non-contact extraction of blood volume pulse (BVP) signals using consumer-grade cameras. Recent unsupervised rPPG methods learn BVP representations without requiring ground-truth physiological annotations, yet their optimization is often hindered by noisy and unstable gradients, resulting in slow convergence and limited cross-domain generalization. In this paper, we propose FCUS-rPPG, a fast-converging unsupervised rPPG framework with strong generalization capability. Motivated by the observation that BVP representations exhibit both multi-spectral covariation and low-dimensional manifold structure, we design a spectrally shared backbone that facilitates BVP feature disentanglement while improving optimization efficiency. To jointly enhance convergence stability and generalization performance, we further develop a unified optimization framework operating at the gradient, loss-landscape, and feature-representation levels. Specifically, a post-verification masking mechanism filters out misleading gradients according to the weak-amplitude physiological prior of BVP signals; a perturbation-based loss landscape smoothing strategy steers optimization toward more generalizable flat minima; and a noise-aware null-space regularization constrains feature updates to the orthogonal complement of the noise subspace, thereby mitigating noise-induced representation drift. Extensive experiments on five datasets demonstrate that FCUS-rPPG requires only one training epoch, whereas existing methods typically require tens to hundreds of epochs. Notably, FCUS-rPPG consistently achieves state-of-the-art (SOTA) performance in cross-dataset evaluations. This study provides an efficient and robust solution to the real-world deployment of unsupervised rPPG. The source code will be publicly available at https://github.com/JiaJieLee/FCUS-rPPG.

2606.03005 2026-06-03 cs.CV cs.AI 版本更新

MUSE: A Unified Agentic Harness for MLLMs

MUSE: 多模态大语言模型的统一智能体框架

Jianglin Lu, Hailing Wang, Xu Ma, Qihua Dong, Mingyuan Zhang, Yizhou Wang, Yun Fu

发表机构 * Northeastern University(东北大学)

AI总结 提出MUSE框架,通过可组合模块(任务表示、视觉处理、感知工具、结构化解析、确定性验证和验证器引导修复)提升冻结多模态大语言模型性能,无需重新训练。

详情
AI中文摘要

尽管进展迅速,多模态大语言模型(MLLMs)在人类轻松解决的任务上仍然失败,例如从屏幕截图导航网格迷宫或选择正确的拼图块。我们不重新训练模型,而是提出一个补充性问题:仅通过改进执行脚手架,能从冻结的MLLM中引出多少能力?我们引入MUSE,一个多模态统一结构化执行框架,它用可组合的模块(任务表示、视觉处理、感知工具使用、结构化解析、确定性验证和验证器引导修复)包装任何现成的MLLM,无需任何模型重新训练。我们使用多个最先进的MLLM,在涵盖视觉空间规划、视觉感知、多模态推理和细粒度视觉辨别的多样化基准上评估MUSE。MUSE在所有设置中都比裸模型带来一致的提升,在困难实例上提升最大。进一步分析揭示,许多MLLM失败源于框架层面的缺陷而非根本的模型缺陷,并且可以通过验证器引导修复来解决,无需触及模型。这些发现突显了智能体多模态框架作为一个关键但尚未充分探索的设计维度,提供了超越以模型为中心的优化的正交改进途径。

英文摘要

Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs. MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.

2606.02996 2026-06-03 cs.RO cs.CV cs.HC 版本更新

MARIO: Motion-Augmented Real-Time Multi-Sensor Inertial Odometry

MARIO: 运动增强的实时多传感器惯性里程计

Yiquan Li, Taeyoung Yeon, Chenfeng Gao, Vasco Xu, Xuanyou Liu, Karan Ahuja

发表机构 * Northwestern University(西北大学) University of Chicago(芝加哥大学)

AI总结 提出MARIO框架,通过学习IMU推断的人体姿态先验约束运动动力学,并结合多传感器融合(磁力计、气压计、辅助IMU),在Nymeria数据集上将位置漂移降低36%-42%,实现无相机人体跟踪的准确鲁棒惯性里程计。

Comments CVPR 2026 Findings

详情
AI中文摘要

仅使用惯性测量单元(IMU)的惯性里程计(IO)为增强现实(AR)和可穿戴设备中的人体运动跟踪提供了轻量级解决方案。最近的基于学习的IO方法通过在大规模人体运动数据集上进行预训练,提高了惯性定位的泛化能力。然而,这些方法仍然容易受到漂移和噪声的影响,因为它们没有显式捕捉人体运动动力学,尤其是在日常活动数据集(如Nymeria)上。在这项工作中,我们提出通过学习的IMU推断姿态先验将惯性里程计建立在人体运动学基础上,该先验促进物理一致的运动约束。我们将此姿态先验集成到现有IO架构中,并在具有挑战性的Nymeria数据集上将位置漂移减少高达36%,该数据集比先前工作中使用的数据集大5倍。我们进一步通过传感器融合框架改进了长期性能,该框架整合了商用AR眼镜上已有的轻量级传感器的辅助信号,包括磁力计、气压计和辅助IMU。通过这种融合策略,位置漂移减少了高达42%,提高了在不同运动条件下的鲁棒性和泛化能力。总之,我们的结果通过将人体运动学与多模态传感统一起来,为惯性轻量级里程计引入了新范式,为准确鲁棒的无相机人体跟踪设立了新基准。我们的网站位于此https URL。

英文摘要

Inertial odometry (IO) using only Inertial Measurement Units (IMUs) provides a lightweight solution for human motion tracking in augmented reality (AR) and wearable devices. Recent learning-based IO methods have improved the generalizability of inertial localization through large-scale pretraining on human motion datasets. However, these approaches remain prone to drift and noise because they do not explicitly capture human motion dynamics, especially on daily activity datasets such as Nymeria. In this work, we propose to ground inertial odometry in human kinematics through a learned IMU-inferred pose prior, which promotes physically consistent motion constraints. We integrate this pose prior into existing IO architectures and reduce positional drift by up to 36% on the challenging Nymeria dataset, which is 5x larger than datasets used in prior work. We further improve long-term performance with a sensor-fusion framework that incorporates auxiliary signals from lightweight sensors already available on commercial AR glasses, including magnetometers, barometers, and secondary IMUs. With this fusion strategy, positional drift is reduced by up to 42%, improving robustness and generalization across diverse motion conditions. Together, our results introduce a new paradigm for inertial and lightweight odometry by unifying human motion kinematics with multimodal sensing, setting a new benchmark for accurate and robust camera-less human tracking. Our website is available at https://spice-lab.org/projects/MARIO/.

2606.02979 2026-06-03 cs.CV cs.AI cs.RO 版本更新

Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

面向紧凑型自动驾驶感知的平衡学习与多传感器融合

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Engineering, Toyohashi University of Technology(计算机科学与工程系,丰田寺大学) Department of Computer Science and Electronics, Gadjah Mada University(计算机科学与电子系,加查马达大学)

AI总结 提出一种紧凑的深度多任务学习模型,通过自适应损失加权和中间传感器融合技术,在单次前向传播中同时处理语义分割、深度估计、激光雷达分割和鸟瞰投影,实现高效自动驾驶感知。

Comments This work has been accepted for publication in IEEE Transactions on Intelligent Transportation Systems. https://ieeexplore.ieee.org/document/9712213

详情
AI中文摘要

我们提出了一种新颖的紧凑型深度多任务学习模型,能够在一次前向传播中处理多种自动驾驶感知任务。该模型同时执行多视角语义分割、深度估计、激光雷达分割和鸟瞰投影,无需其他模型支持。我们还提供了一种自适应损失加权算法,以解决因任务众多而出现的学习不平衡问题。通过数据预处理和中间传感器融合技术,该模型可以处理并组合来自RGB摄像头、动态视觉传感器(DVS)和安装在自车多个位置的激光雷达的多种输入模态。因此,可以更好地理解动态变化的环境。基于消融研究,使用我们提出的方法训练的模型变体取得了更好的性能。此外,还进行了比较研究,以阐明其与一些近期模型组合相比的性能和有效性。结果表明,即使参数少得多,我们的模型仍能保持更好的性能。因此,该模型可以更快地推理,并减少GPU内存使用。此外,结果在3个不同的CARLA仿真数据集和1个真实世界的nuScenes-lidarseg数据集上保持一致。为了支持未来的研究,我们在以下网址公开共享代码和其他文件:https://this URL。

英文摘要

We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact-perception.

2606.02962 2026-06-03 cs.CV cs.AI cs.HC eess.IV 版本更新

Hand Trajectory Fusion for Egocentric Natural Language Query Grounding

面向自我中心自然语言查询定位的手部轨迹融合

Enmin Zhong, Carlos R. del-Blanco, Fernando Jaureguizar, Narciso García

发表机构 * Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center , ETSI Telecomunicación, Universidad Politécnica de Madrid, Spain(图像处理小组(GTI)、信息处理与电信中心、电信工程学院、马德里理工大学、西班牙)

AI总结 针对自我中心视频中的自然语言查询定位任务,提出手部轨迹编码器与自适应门控交叉注意力融合方法,利用手部运动信息提升查询定位性能。

Comments Accepted for the poster session at the Egocentric Vision (EgoVis) Workshop in Conjunction with CVPR 2026

详情
AI中文摘要

自我中心自然语言查询(NLQ)定位要求模型在长第一人称视频中定位回答自由形式文本查询的时间区间。现有方法融合视频外观与查询,但忽略了手部运动,尽管大约41%的Ego4D NLQ查询是在手-物交互或其后立即发生的时刻回答的。我们提出了一种手部轨迹编码器,用于将手部骨骼序列转换为高语义的手部运动学特征,然后通过具有自适应门控的交叉注意力融合策略,将这些特征与预训练的视频-文本特征对齐并组合。在Ego4D NLQ v2验证集上,手-物交互查询(R1@IoU=0.3提升2.54)和数量/状态查询(R1@IoU=0.3提升4.32)的增益最为明显,表明手部轨迹提供了超越外观的定位线索。

英文摘要

Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.

2606.02956 2026-06-03 cs.CV cs.LG cs.RO 版本更新

The Road Ahead in Autonomous Driving: The KITScenes Multimodal Dataset

自动驾驶的未来之路:KITScenes多模态数据集

Richard Schwarzkopf, Fabian Immel, Alexander Blumberg, Jonas Merkert, Nils Rack, Kaiwen Wang, Fabian Konstantinidis, Julian Truetsch, Carlos Fernandez, Annika Bätz, Kevin Rösch, Marlon Steiner, Willi Poh, Yinzhe Shen, Royden Wagner, Felix Hauser, Dominik Strutz, Jaime Villa, Gleb Stepanov, Holger Caesar, Ömer Şahin Taş, Frank Bieder, Jan-Hendrik Pauls, Christoph Stiller

发表机构 * FZI Research Center for Information Technology(弗劳恩霍夫信息技术研究中心) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) University Charles III of Madrid(马德里第三大学) Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出KITScenes多模态数据集,通过高保真传感器和完整HD地图,解决现有数据集在传感器精度、地图完整性和地理多样性上的不足,并引入四个基准推动空间学习。

Comments 28 pages, 21 figures

详情
AI中文摘要

现有的自动驾驶数据集取得了重大进展,但在传感器保真度、地图完整性或地理多样性方面仍存在不足。我们提出了KITScenes多模态数据集,这是一个基于高保真传感器和地图构建的欧洲数据集。我们完全同步的传感器套件结合了高分辨率全局快门相机、超过400米的长距离激光雷达、4D成像雷达以及冗余的GNSS/INS定位。据我们所知,我们的HD地图是任何传感器数据集中最完整的,并通过开源软件上的自动驾驶试验进行了验证。首次在公共数据集中,所有与驾驶相关的交通元素(如交通灯)都以3D方式映射到重投影精确的水平,并具有完整的拓扑连接。我们的数据集记录在街道布局不规则且交通模式混合的城市中,通过拓宽可用的地理多样性来补充现有数据集。我们还引入了四个基准,每个基准都推动了具身AI的空间学习:在线HD地图构建、长距离深度估计、新颖视图合成和端到端驾驶。项目页面:此https URL

英文摘要

Existing autonomous driving datasets have enabled major progress, but fall short in sensor fidelity, map completeness, or geographic diversity. We present KITScenes Multimodal, a European dataset built around high-fidelity sensors and maps. Our fully synchronized sensor suite combines high-resolution global-shutter cameras, long-range lidar beyond 400m, 4D imaging radar, and redundant GNSS/INS localization. Our HD maps are, to our knowledge, the most complete of any sensor dataset, validated through autonomous driving trials on open-source software. For the first time in a public dataset, all driving-relevant traffic elements, such as traffic lights, are mapped in 3D to a reprojection-accurate level with full topological connectivity. Recorded in cities with irregular street layouts and mixed traffic modes, our dataset complements existing datasets by broadening the available geographic diversity. We also introduce four benchmarks, each advancing spatial learning for embodied AI: online HD map construction, long-range depth estimation, novel view synthesis, and end-to-end driving. Project page: https://kitscenes.com/

2606.02951 2026-06-03 cs.RO cs.AI cs.CL cs.CV cs.HC 版本更新

SCOPE: Real-Time Natural Language Camera Agent at the Edge

SCOPE:边缘实时自然语言相机代理

Nikolaj Hindsbo, Sina Ehsani, Pragyana Mishra

发表机构 * Armada AI

AI总结 提出SCOPE模块化代理,用于自然语言控制的PTZ相机,在边缘部署实现实时感知、规划与控制,并通过仿真和物理实验评估延迟、准确性和错误模式。

Comments 9 pages, 4 figures, 6 tables. Accepted at HRI '26 (21st ACM/IEEE International Conference on Human-Robot Interaction), Edinburgh, Scotland, March 16--19, 2026. Code: https://github.com/HindsboNikolaj/SCOPE

详情
Journal ref
Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction (HRI '26), ACM, 2026
AI中文摘要

在机器人领域部署语言驱动的代理需要能够反映现实任务需求的评估:自然语言指令与可重复的结果。此类代理必须将语言模型连接到可调用的感知和控制工具,并使用部署关键指标(包括延迟、准确性和错误模式)进行评估。我们提出了SCOPE(用于感知和评估的仿真与相机操作),这是一个模块化代理,用于自然语言、开放词汇的云台变焦(PTZ)相机控制和视觉场景理解,专门为边缘部署设计。SCOPE既可在基于Blender的仿真环境中运行,也可在物理PTZ相机上运行,所有感知、规划和控制均在部署现场使用边缘可访问的计算资源本地执行。我们发布了一个包含536个任务的基准测试,涵盖问答、单步和多步命令、计数、空间推理、描述以及光学字符识别,在基于Blender的仿真环境中提供逼真的PTZ控制功能。执行轨迹与LM作为评判器结合,以评估延迟、准确性和错误模式。我们评估了19种规划器-感知模型组合,将Qwen3小语言模型(SLM)与Moondream和Qwen视觉语言模型(VLM)配对。更强的SLM显著减少了幻觉并改善了工具路由,从而实现了更可靠的闭环行为。一旦使用了足够强大的SLM,感知就成为主要的性能瓶颈。在规划和感知方面,混合专家模型在延迟和内存占用与更小网络相当的情况下,始终匹配或超过密集替代方案。量化在精度损失最小的情况下提供了额外的效率提升,为实时、边缘可行的语言驱动PTZ控制确定了一个实用的、从仿真到现实验证的设计点。

英文摘要

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with reproducible outcomes. Such agents must connect language models to callable perception and control tools, and be assessed using deployment-critical metrics including latency, accuracy, and error modes. We present SCOPE (Simulation and Camera Operations for Perception and Evaluation), a modular agent for natural-language, open-vocabulary pan-tilt-zoom (PTZ) camera control and visual scene understanding, designed explicitly for edge deployment. SCOPE operates both in a Blender-based simulation environment and on a physical PTZ camera, executing all perception, planning, and control locally at the deployment site using edge-accessible compute. We release a 536-task benchmark spanning QA, single- and multi-step commands, counting, spatial reasoning, descriptions, and optical character recognition in a Blender-based simulation environment that exposes realistic PTZ control affordances. Execution traces are combined with an LM-as-Judge to evaluate latency, accuracy, and error modes. We evaluate 19 planner-perception model combinations pairing Qwen3 small language models (SLMs) with Moondream and Qwen vision-language models (VLMs). Stronger SLMs substantially reduce hallucinations and improve tool routing, leading to more reliable closed-loop behavior. Once a sufficiently capable SLM is used, perception becomes the dominant performance bottleneck. Mixture-of-Experts models on both the planning and perception side consistently match or exceed dense alternatives at latencies and memory footprints comparable to much smaller networks. Quantization provides additional efficiency gains with minimal accuracy degradation, identifying a practical, sim-to-real validated design point for real-time, edge-feasible language-driven PTZ control.

2606.02947 2026-06-03 cs.LG cs.CV 版本更新

BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks

BYORn:自举你的响应以防御大型视觉-语言模型的后门攻击

Ivan Sabolić, Marin Oršić, Josip Šarić, Sven Lončarić

发表机构 * University of Rijeka(里耶卡大学)

AI总结 提出BYORn框架,通过识别并替换语义不合理的后门目标响应,打破触发器与目标输出的关联,从而在保持干净任务性能的同时提升对后门攻击的鲁棒性。

Comments Accepted to ICML 2026

详情
AI中文摘要

监督微调是将自回归视觉-语言模型适应下游任务的主要方法。最近的研究表明,这种范式极易受到后门攻击,并且现有的防御在开放生成设置中无效。为此,我们提出了BYORn,一个鲁棒的后门防御微调框架,其动机是观察到,在给定相应图像-文本输入和预训练模型的情况下,被毒化的目标响应通常在语义上不合理。BYORn识别这种不对齐的响应,并动态地用模型生成的替代响应替换它们,从而打破触发器与目标输出之间的相关性。由此产生的目标梯度对应于干净数据分布上总体风险上界的经验估计的梯度。实验上,BYORn在保持干净任务性能的同时,持续提高了对后门攻击的鲁棒性,建立了泛化与攻击成功率之间的新权衡边界。最后,我们证明了BYORn对专门设计用于规避所提防御的自适应攻击仍然有效。

英文摘要

Supervised fine-tuning is the predominant approach for adapting autoregressive vision-language models to downstream tasks. Recent work has shown that this paradigm is highly vulnerable to backdoor attacks, and that existing defenses are ineffective in open-ended generation settings. In response, we propose BYORn, a backdoor-robust fine-tuning framework motivated by the observation that poisoned target responses are often semantically implausible given the corresponding image-text inputs and a pretrained model. BYORn identifies such misaligned responses and dynamically replaces them with alternative responses generated by the model, thereby breaking the correlation between triggers and target outputs. The resulting objective gradient corresponds to the gradient of the empirical estimate of the population risk upper bound over the clean data distribution. Empirically, BYORn consistently improves robustness to backdoor attacks while preserving clean-task performance, establishing a new trade-off frontier between generalization and attack success rate. Finally, we demonstrate that BYORn remains effective against adaptive attacks specifically designed to circumvent the proposed defense.

2606.02935 2026-06-03 cs.CV cs.CE 版本更新

CAD-to-CT Registration of Cylindrical Objects via Ellipse-Based Axis Estimation

基于椭圆轴估计的圆柱体CAD到CT配准

Aleksander Ogonowski, Mikołaj Mrozowski, Daniel Więcek, Arkadiusz Ćwiek, Konrad Klimaszewski, Rafał Możdżonek, Adam Padee, Lech Raczyński, Piotr Wasiuk, Wojciech Wiślicki, Michał Matusiak, Sławomir Wronka

发表机构 * Department of Complex Systems, National Centre for Nuclear Research(复杂系统系,国家核研究中心) ImagineRT sp. z o.o.(ImagineRT公司) National Centre for Nuclear Research(国家核研究中心)

AI总结 提出一种两阶段几何配准方法,通过检测CT切片中的椭圆截面估计旋转轴,再通过体素化CAD模型并最大化与CT扫描的体积重叠实现圆柱体(电离室)的精确配准,无需强度校准或特征匹配,倾斜和方向误差低于0.1°。

详情
AI中文摘要

CAD模型与CT扫描的精确配准对于在体积成像中建立真实几何基准至关重要。获取可靠的对象掩膜在机器学习环境中日益重要;随着最新架构能力增强,需要大规模数据集以充分利用其能力。当CT灰度值缺乏校准参考时,传统的基于强度的方法失效,而基于点的算法(如ICP、RANSAC)需要理想化CAD几何与噪声体积CT数据之间不可用的特征对应。我们提出了一种针对圆柱体(电离室)的两阶段几何配准方法,利用对象的独特几何特征。首先,通过检测CT切片中的椭圆截面、对边缘检测轮廓拟合椭圆,并在RANSAC异常值去除后对拟合椭圆中心进行PCA,来估计3D旋转轴。其次,将CAD模型体素化,沿检测轴定向,并通过平移调整最大化与CT扫描的体积重叠。该方法无需强度校准或特征匹配,即可实现倾斜和方向误差低于0.1°的鲁棒配准。配准后,对齐的CAD模型为机器学习目标定位和工业CT工作流中的自动分析等应用提供真实几何基准。

英文摘要

Accurate registration of CAD models to CT scans is essential for establishing ground truth geometry in volumetric imaging. Obtaining reliable object masks is of growing importance in machine learning settings; as recent architectures grow more capable, huge datasets are required to fully utilise their capabilities. Traditional intensity-based methods fail when CT grayscale values lack calibration references, while point-based algorithms (e.g., ICP, RANSAC) require feature correspondence unavailable between idealized CAD geometry and noisy volumetric CT data. We propose a two-stage geometric registration method for cylindrical objects (ionization chambers) that takes advantage of the distinctive geometric features of the objects. First, we estimate the 3D rotation axis by detecting elliptical cross-sections across CT slices, fitting ellipses to edge-detected contours, and performing PCA on the fitted ellipse centers after RANSAC outlier removal. Second, we voxelize the CAD model, orient it along the detected axis, and maximize volumetric overlap with the CT scan through translational adjustment. This approach achieves robust registration with tilt and orientation errors below $0.1^\circ$ without intensity calibration or feature matching. Once registered, the aligned CAD model provides ground truth geometry for applications including machine learning-based object localization and automated analysis in industrial CT workflows.

2606.02927 2026-06-03 cs.CV 版本更新

SaluNet: Enabling Total Plasticity in Normalization-Free Deep Networks

SaluNet: 在无归一化深度网络中实现完全可塑性

Mourad Zaied

发表机构 * Department of electrical engineering(电气工程系) National Engineering School of Gabes (ENIG)(盖斯国家工程学院) University of Gabes(盖斯大学)

AI总结 提出SALU激活函数替代归一化层,构建SaluNet网络,在无归一化条件下实现深度网络的稳定训练,并在多个数据集上取得优异性能。

Comments 34 pages

详情
AI中文摘要

归一化层如BatchNorm和LayerNorm长期以来被认为是深度网络稳定训练所必需的。本文证明它们可以被单一的可学习激活机制完全替代。我们发现标准归一化会引发可塑性抑制效应:当与归一化层配对时,可学习激活参数会迅速失去适应性。受此观察启发,我们引入SALU(饱和自适应线性单元),\[ \operatorname{SALU}(x;a,b) = \frac{a x}{\sqrt{1 + a b x^2}},\quad a>0,\; b>0 \] 一种有界的、可学习的激活函数,无需依赖批次统计或外部仿射参数即可提供内在的信号稳定。基于SALU,我们提出SaluNet,一种基于完全可塑性的范式:SALU替代归一化层,而SWALU和GALU替代标准激活函数。使用ResNet-18,SaluNet-C-18在CIFAR-10上达到97.35%,在CIFAR-100上达到83.25%,且无归一化;在批次大小为1时(归一化架构失败)仍保持93.44%和76.23%。对于Transformer,SaluNet-T在CIFAR-10上将LayerNorm-GELU从90.92%提升至91.01%,在CIFAR-100上从66.54%提升至68.10%。SaluNet-C-50在ImageNet-1K上达到78.67%的Top-1准确率(224×224),在288×288下为79.23%。这些结果表明归一化层抑制了完全可塑性——这是生物神经元固有的特性,使深度网络能够有效学习。

英文摘要

Normalization layers such as BatchNorm and LayerNorm have long been considered essential for stable training in deep networks. This work demonstrates that they can be fully replaced by a single learnable activation mechanism. We identify a plasticity suppression effect induced by standard normalization: learnable activation parameters rapidly lose adaptability when paired with normalization layers. Motivated by this observation, we introduce SALU (Saturated Adaptive Linear Unit), \[ \operatorname{SALU}(x;a,b) = \frac{a x}{\sqrt{1 + a b x^2}},\quad a>0,\; b>0 \] a bounded, learnable activation that provides intrinsic signal stabilization without relying on batch statistics or external affine parameters. Building on SALU, we propose SaluNet, a paradigm grounded in total plasticity: SALU replaces normalization layers, while SWALU and GALU replace standard activations. With ResNet-18, SaluNet-C-18 achieves 97.35\% on CIFAR-10 and 83.25\% on CIFAR-100 without normalization, maintaining 93.44\% and 76.23\% at batch size 1 where normalized architectures fail. For transformers, SaluNet-T improves over LayerNorm-GELU from 90.92\% to 91.01\% on CIFAR-10 and from 66.54\% to 68.10\% on CIFAR-100. SaluNet-C-50 reaches 78.67\% Top-1 on ImageNet-1K at $224\times224$, and $79.23\%$ at $288\times288$. These results suggest normalization layers suppress total plasticity, a property biological neurons inherently possess, enabling deep networks to learn effectively.

2606.02924 2026-06-03 cs.CV 版本更新

ATLAS: A Large-Scale Evaluation Benchmark for Adversarial LiDAR Perception

ATLAS:面向对抗性激光雷达感知的大规模评估基准

Mellon M. Zhang, Siddhant Panse, Zimo Fan, Akshal Dhal, Rishit Sarkar, Glen Chou

AI总结 针对黑盒传感器攻击下激光雷达感知模型的鲁棒性评估空白,提出首个大规模物理驱动基准ATLAS,通过点注入和点移除两种攻击模式,揭示模型性能与鲁棒性的非对称性,并溯源至标准数据增强方法。

Comments preprint

详情
AI中文摘要

自动驾驶感知通常在干净的基准数据上进行评估,然而实际部署需要对罕见、结构化且可能具有对抗性的传感器异常具有鲁棒性。这一差距对于激光雷达尤为关键,因为外部行为者可以在不访问模型的情况下物理操纵传感过程,引发黑盒感知故障。现有的激光雷达基准对此类故障模式几乎不提供可见性。先前的对抗性激光雷达研究主要集中于攻击硬件、几何和算法防御以及早期检测器,而现代感知系统的鲁棒性尚未被探索。为弥补这一评估空白,我们提出了ATLAS(对抗性时间激光雷达攻击套件),这是首个大规模、物理驱动的激光雷达感知模型评估基准,在黑盒传感器攻击下模拟两种主要攻击模式——点注入和点移除,覆盖真实驾驶序列。通过评估当前最先进的激光雷达感知模型的广泛截面,ATLAS揭示了一个令人惊讶的鲁棒性非对称性:在标准基准上表现更强的模型往往更能抵御移除攻击,但实际上比弱模型更容易受到注入攻击。我们将这一脆弱性追溯到标准对象数据库采样增强,揭示了当前训练实践如何引发与架构无关的鲁棒性故障,并研究了缓解两种攻击模式的初步方向。我们发布了ATLAS生成代码,以支持随着攻击能力演进而进行的可扩展、可重复的评估,帮助使黑盒传感器鲁棒性成为未来激光雷达感知发展中的明确考虑因素。

英文摘要

Autonomous driving perception is typically evaluated on clean benchmark data, yet real-world deployment requires robustness to rare, structured, and potentially adversarial sensor anomalies. This gap is especially critical for LiDAR, where external actors can physically manipulate the sensing process to induce black-box perception failures without accessing the model. Existing LiDAR benchmarks provide little visibility into this failure mode. Prior adversarial LiDAR studies have largely centered on attack hardware, geometric and algorithmic defenses, and early-generation detectors, leaving the robustness of modern perception systems unexplored. To address this evaluation gap, we introduce ATLAS (Adversarial Temporal LiDAR Attack Suite), the first large-scale, physically grounded evaluation benchmark for LiDAR perception models under black-box sensor attacks, simulating the two primary attack modes -- point injection and point removal -- across real driving sequences. Evaluating a broad cross-section of current state-of-the-art LiDAR perception models, ATLAS reveals a surprising robustness asymmetry: models with stronger performance on standard benchmarks tend to better withstand removal attacks, yet are actually more vulnerable to injection attacks than weaker models. We trace this vulnerability to standard object database sampling augmentations, revealing how current training practices can induce architecture-agnostic robustness failures, and study initial directions for mitigating both attack modes. We release the ATLAS generation code to support extensible, reproducible evaluations as attack capabilities evolve, helping make black-box sensor robustness an explicit consideration in future LiDAR perception development.

2606.02915 2026-06-03 cs.CV 版本更新

Any2Poster: Any-Source Poster Generation Across Modalities and Domains

Any2Poster: 跨模态和领域的任意源海报生成

Amogh Vinaykumar, Aiden Li, Suozhi Huang, Shilong Liu

发表机构 * Flower Mound High School(弗洛拉穆恩高中) University College London(伦敦大学学院) Princeton University(普林斯顿大学)

AI总结 提出Any2Poster Bench基准和Any2Poster Agent智能体,实现从多种输入模态和领域生成海报,并通过基于测验和视觉评估的方法验证信息保真度和视觉传达效果。

Comments Project Page: https://github.com/Any2Poster/Any2Poster

详情
AI中文摘要

视觉海报是传达密集信息的紧凑媒介,然而自动海报生成的进展难以衡量,因为现有评估通常局限于仅论文输入、狭窄领域或表面视觉相似性。我们引入了Any2Poster Bench,一个用于任意源海报生成的基准,它评估系统在八种输入模态(PDF、URL、PPTX、DOCX、Markdown、LaTeX、笔记本和视频)和五个内容领域上的表现。Any2Poster Bench将每个源与基于测验的逐字事实保留和解释性理解探测,以及基于VLM的视觉质量、布局、可读性、内容完整性和逻辑流程判断相结合,从而实现对信息保真度和视觉传达的可重复评估。为了实例化和验证这一基准,我们进一步提出了Any2Poster Agent,一个端到端的参考智能体,它解析异构源、组织显著内容、规划海报布局、渲染海报,并使用视觉反馈迭代优化。在Any2Poster Bench上,Any2Poster Agent在输入模态上平均准确率达到87.25%,在内容领域上达到87.28%。在PaperQuiz风格评估中(其中先前的论文到海报智能体可直接比较),Any2Poster Agent将总体准确率从PosterAgent-4o的51.06-51.33%提高到72.58%,并将密度增强分数从116-121提高到145.16。总之,Any2Poster Bench和Any2Poster Agent为研究多模态、通用领域的海报生成提供了可复用的评估资源和有竞争力的基线。

英文摘要

Visual posters are a compact medium for communicating dense information, yet progress on automatic poster generation remains difficult to measure because existing evaluations are often restricted to paper-only inputs, narrow domains, or surface-level visual similarity. We introduce Any2Poster Bench, a benchmark for any-source poster generation that evaluates systems across eight input modalities--PDFs, URLs, PPTX, DOCX, Markdown, LaTeX, notebooks, and videos--and five content domains. Any2Poster Bench pairs each source with quiz-based probes of verbatim factual retention and interpretive understanding, together with VLM-based judgments of visual quality, layout, readability, content completeness, and logical flow, enabling reproducible assessment of both information fidelity and visual communication. To instantiate and validate this benchmark, we further present Any2Poster Agent, an end-to-end reference agent that parses heterogeneous sources, organizes salient content, plans poster layouts, renders posters, and iteratively refines them using visual feedback. On Any2Poster Bench, Any2Poster Agent achieves 87.25% average accuracy across input modalities and 87.28% across content domains. On PaperQuiz-style evaluation, where prior paper-to-poster agents are directly comparable, Any2Poster Agent improves over PosterAgent-4o from 51.06-51.33% to 72.58% overall accuracy and from 116-121 to 145.16 in density-augmented score. Together, Any2Poster Bench and Any2Poster Agent provide a reusable evaluation resource and a competitive baseline for studying multimodal, domain-general poster generation.

2606.02831 2026-06-03 cs.CV 版本更新

Principled Reflection Separation via Nonlinear Superposition and Feature Interaction

基于非线性叠加与特征交互的原理性反射分离

Qiming Hu, Mingjia Li, Yuntong Li, Xiaojie Guo

AI总结 针对单图像反射分离中传输层与反射层非线性耦合问题,提出可学习非线性叠加模型和广义双流交互框架,实现更优的分解性能与泛化能力。

Comments 23 pages

详情
AI中文摘要

单图像反射分离从根本上受到复杂图像形成过程中传输层和反射层纠缠的挑战。现有方法大多依赖简化假设或独立建模,限制了其处理真实场景的能力。在这项工作中,我们从统一视角重新审视该问题,并指出现有方法的一个关键问题,即广泛采用的sRGB域线性合成模型无法捕捉真实图像信号处理流水线引入的非线性耦合。为解决此问题,我们引入了一个可学习的非线性叠加模型,该模型更真实地刻画层间相互作用并提高分解保真度。基于此公式,我们提出了一个广义双流交互框架,通过特征交换显式建模传输层和反射层之间的双向依赖关系。该框架统一了基于激活、门控和注意力的交互机制,并兼容CNN和Transformer骨干网络。在多种真实世界基准上的大量实验表明,所提方法实现了优越的性能和强泛化能力。更重要的是,我们的研究揭示反射分离并非撤销线性混合,而是学习非线性形成与交互,为原理性图像分解模型的设计提供了新见解。代码和模型已公开于该链接。

英文摘要

Single-image reflection separation is fundamentally challenged by the entanglement of transmission and reflection layers under complex image formation processes. Existing approaches largely rely on simplified assumptions or independent modeling, limiting their ability to handle real-world scenarios. In this work, we revisit the problem from a unified perspective and identify a key issue of existing approaches, i.e., the widely adopted linear composition model in the sRGB domain fails to capture the nonlinear coupling introduced by real-world image signal processing pipelines. To address this, we introduce a learnable nonlinear superposition model that more faithfully characterizes layer interactions and improves decomposition fidelity. Building upon this formulation, we propose a generalized dual-stream interactive framework that explicitly models bidirectional dependencies between transmission and reflection through feature exchange. This framework unifies activation-, gating-, and attention-based interaction mechanisms, and is compatible with both CNN and Transformer backbones. Extensive experiments on diverse real-world benchmarks demonstrate that the proposed approach achieves superior performance with strong generalization capability. More importantly, our study reveals that reflection separation is not about undoing a linear mixture, but about learning nonlinear formation and interaction}, offering new insights into the design of principled image decomposition models. Code and models are publicly available at https://mingcv.github.io/DIRS-Page.

2606.02809 2026-06-03 cs.CV 版本更新

Automated Report-Derived Oncology VQA Benchmark for Evaluating Vision-Language Models on 3D Medical Imaging

自动化报告驱动的肿瘤学VQA基准:用于评估3D医学影像上的视觉-语言模型

Bo Liu, Hanxue Gu, Xiangru Li, Zheren Zhu, Jacob Ellison, Kang Wang, Janine M. Lupo, Yang Yang, Hui Lin

发表机构 * UCSF–UC Berkeley Joint Graduate Program in Bioengineering(UCSF-伯克利生物工程联合研究生项目) Department of Radiology, UCSF(UCSF放射科) Department of Radiation Oncology, UCSF(UCSF放射肿瘤科)

AI总结 提出一个自动化管道,从私有放射学报告和3D肿瘤影像生成多选VQA数据集,构建无污染基准,评估六种视觉-语言模型,发现视觉依赖因数据集而异。

详情
AI中文摘要

评估医学影像上的视觉-语言模型(VLM)需要临床基础、可扩展且控制评估混淆的基准。现有的公共基准在规模上有限、需要手动标注,或可能泄露到VLM预训练语料中。我们提出一个自动化智能体驱动的管道,直接从配对的私有放射学报告和3D肿瘤影像生成多选VQA数据集,产生两种互补的问题类型:从临床医生定义的报告模式确定性导出的RADS风格问题,以及由LLM根据放射科医生发现生成并对照源报告验证的放射学报告衍生问题。应用于四个内部癌症队列,该管道产生一个实例污染控制的基准,无需每个问题的人工标注。对六个VLM的零样本评估显示没有主导模型,且所有单元均有显著提升空间。一项盲消融实验显示,视觉依赖高度特定于数据集:肝脏报告衍生问题确实需要图像,而肺CT基本上可以在没有图像的情况下解决——领先的闭源模型在盲测时在肺CT上的准确率超过其有视觉的准确率,这表明即使是私有临床数据也不能保证对视觉能力的污染控制读取。该管道作为开放智能体技能发布,用于内部重新部署。

英文摘要

Evaluating vision-language models (VLMs) on medical images requires benchmarks that are clinically grounded, scalable, and controlled for evaluation confounds. Existing public benchmarks are limited in scale, manually annotated, or potentially leaked into VLM pretraining corpora. We present an automated agent-driven pipeline that generates multiple-choice VQA datasets directly from paired private radiology reports and 3D oncology imaging, producing two complementary question types: RADS-style questions deterministically derived from clinician-defined reporting schemas, and radiology report-derived questions generated by an LLM from radiologist findings and verified against the source report. Applied to four in-house cancer cohorts, the pipeline yields an instance-contamination-controlled benchmark without per-question human annotation. Zero-shot evaluation of six VLMs reveals no dominant model and substantial headroom across all cells. A blind ablation reveals that visual reliance is highly dataset-specific: liver Report-derived questions genuinely require the image, while Lung CT is essentially solvable without it - the leading closed model exceeds its sighted accuracy on Lung CT when blinded - indicating that even private clinical data does not guarantee a contamination-controlled read of visual capability. The pipeline is released as an open agent skill for in-house redeployment.

2606.02789 2026-06-03 cs.CV 版本更新

Diagnosis of Human Object Interaction Detectors for Real World Educational Applications

面向真实世界教育应用的人-物交互检测器诊断

Divya Mereddy, Ashwin Tudur Sadashiva, Marcos Quinones-Grueiro, Gautam Biswas

AI总结 提出一种诊断驱动框架,结合三元组级HOI错误分类与错误因素归因分析,通过针对性改进将预训练CDN模型在CCATT数据集上的宏F1分数从48.6提升至90.2。

详情
AI中文摘要

人-物交互(HOI)识别对于在复杂教育环境中自动分析学生行为至关重要。尽管最先进的HOI检测器在基准数据集上表现良好,但在实际训练环境中部署时,由于领域特定物体、遮挡和复杂视觉条件,其性能往往会下降。本文针对真实世界的教育视频数据,引入了一种诊断驱动框架,该框架将三元组级HOI错误分类与错误因素归因分析相结合。我们在重症监护空运队(CCATT)混合现实医疗训练的背景下研究这一问题。基于对HOI失败模式及其原因的分析,我们开发了一种诊断信息驱动的改进策略,用于将预训练的HOI模型适应到目标领域。在CCATT数据集上的实验表明,通过由诊断出的错误因素引导的针对性改进,该方法将预训练CDN模型的宏F1分数从48.6提升至90.2。这些结果突显了详细诊断分析对于指导HOI模型在真实教育环境中进行针对性适应的价值。

英文摘要

Human-object interaction (HOI) recognition is critical for automatically analyzing student behavior in complex educational environments. Although state-of-the-art (SOTA) HOI detectors perform well on benchmark datasets, their performance often degrades when deployed in real-world training environments due to domain-specific objects, occlusions, and complex visual conditions. In this paper, we introduce a diagnosis-driven framework that integrates a triplet-level HOI error taxonomy with error-factor attribution analysis for real-world educational video data. We study this problem in the context of Critical Care Air Transport Team (CCATT) mixed-reality medical training. Based on an analysis of HOI failure modes and their causes, we develop a diagnosis-informed refinement strategy for adapting pretrained HOI models to the target domain. Experiments on the CCATT dataset show that this approach improves the macro-F1 score of a pretrained CDN model from 48.6 to 90.2 through targeted refinement guided by diagnosed error factors. These results highlight the value of detailed diagnostic analysis for informing targeted adaptation of HOI models in real-world educational environments.

2606.02774 2026-06-03 cs.CV 版本更新

GeoDrive-Bench: Benchmarking Region-Specific Multimodal Reasoning in Autonomous Driving

GeoDrive-Bench:自动驾驶中区域特定多模态推理的基准测试

Yingzi Ma, Chaowei Xiao, Ming Jiang

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出GeoDrive-Bench基准,通过5053个跨六国人工验证的多选题,评估视觉语言模型在感知、预测、规划和区域推理四个驾驶任务中基于区域特定交通规则的推理能力,并设计蒸馏算法注入区域知识以提升模型性能。

详情
AI中文摘要

用于自动驾驶的视觉语言模型(VLM)已展现出有前景的性能,但它们处理区域特定交通规则的能力仍未得到充分探索,这引发了对其在全球不同环境中部署的不确定性。因此,我们引入了GeoDrive-Bench,这是一个新颖的基准,能够系统性地研究VLM的地理文化驾驶推理。我们整理了5053个人工验证的多选题,涵盖六个国家,涉及多样的驾驶文化。具体而言,我们强调四个驾驶任务:感知、预测、规划和区域推理。每个问题要求模型从视觉证据和当地交通惯例中推断出正确的驾驶行为,而不给出明确的国家标签。除了评估,我们还设计了一种蒸馏算法,将区域特定的交通规则知识注入VLM的内部表示,使模型能够更好地将视觉场景理解与当地驾驶策略对齐。在九个最先进的VLM上的实验表明,每个任务在不同地理驾驶文化下存在显著的性能差异,而我们提出的基线模型在跨区域的地理文化推理上有所改进。这些结果表明,当前的VLM仍然缺乏鲁棒的区域感知驾驶智能,并突显了GeoDrive-Bench作为可部署自动驾驶基础模型的诊断和训练导向测试床的价值。

英文摘要

Vision-language models (VLMs) for autonomous driving have shown promising performance, but their ability to handle region-specific traffic rules remains underexplored, raising uncertainties about their deployment across diverse global settings. We therefore introduce GeoDrive-Bench, a novel benchmark that enables the systematic investigation of VLMs' geo-culturally grounded driving reasoning. We curated 5,053 human-validated multiple-choice QA pairs across six countries covering diverse driving cultures. Specifically, we emphasize four driving tasks: perception, prediction, planning, and region reasoning. Each question requires models to infer the correct driving behavior from visual evidence and local traffic conventions without explicit country labels. Beyond evaluation, we further design a distillation algorithm that injects region-specific traffic-rule knowledge into the internal representations of VLMs, enabling models to better align visual scene understanding with local driving policies. Experiments on nine state-of-the-art VLMs show substantial performance variations across geo-driving cultures for each task, while our proposed baseline models exhibit improved geo-cultural reasoning across regions. These results suggest that current VLMs still lack robust region-aware driving intelligence and highlight GeoDrive-Bench as a diagnostic and training-oriented testbed for deployable autonomous driving foundation models.

2606.02764 2026-06-03 cs.CV physics.comp-ph 版本更新

From Local Training to Large-Scale Mapping: A Comparative Assessment of Machine Learning and Deep Learning for Transferable Satellite-Derived Bathymetry

从局部训练到大规模制图:机器学习与深度学习在可迁移卫星测深中的比较评估

Hsiao-Jou Hsu, Joachim Moortgat

发表机构 * School of Earth Sciences, The Ohio State University(地球科学学院,俄亥俄州立大学)

AI总结 本研究评估了随机森林与四种CNN在0-20米深度范围内基于Sentinel-2影像的可迁移卫星测深性能,通过保持空间连续性的训练策略和引入平滑权重函数损失,实现了跨区域稳健的深度估计。

Comments 42 pages, 13 figures, 15 tables. Supplementary Information provided as ancillary file (anc/SI.pdf). Code and pretrained weights at https://github.com/buckai-observatory/DL_bathy

详情
Journal ref
Remote Sens. 18 (2026) 1768
AI中文摘要

多光谱影像的卫星测深(SDB)成本效益高,但在不同区域间的扩展性较差,尤其是在光学复杂的沿海环境中。我们利用Sentinel-2影像评估了机器学习与深度学习在0-20米深度范围内的可迁移SDB性能。在普拉塔斯岛和大堡礁选定区域训练了随机森林基线模型和四种CNN(ResNet-50、ResNet-101、EfficientNet-B4、ConvNeXt-Large),然后在空间独立的区域内和跨区域测试区域进行评估。训练过程中保持空间连续性(即保留连续的礁块而非随机斑块)是影响最大的设计选择;我们进一步引入了平滑权重函数(SWF)加权的RMSE损失,以强调近地表深度。采用这些选择后,区域内RMSE在0-20米范围内为1.15至1.92米,在深度≤3米时低至0.26米。随机森林在跨区域迁移下性能急剧下降(RMSE从1.53米升至2.99-3.78米),而深度模型保持更稳健(2.46-2.98米)。在公开的MagicBathyNet航空RGB基准(0-16米)上,所提出的网络达到了0.19-0.22米的RMSE,优于U-Net基线和一种任务特定的Transformer架构,且参数显著更少。我们进一步利用了多时相重复影像:在其上训练增加了多样性,并且在推理时对各次通过的中位数聚合预测减少了来自太阳角度、大气条件、水性质和潮汐变化的噪声。我们发布了优化的架构和预训练权重,以实现对新地点的可扩展迁移。

英文摘要

Satellite-derived bathymetry (SDB) from multispectral imagery is cost-effective but scales poorly across regions, especially in optically complex coastal environments. We evaluate machine learning and deep learning for transferable SDB over the 0-20 m depth range using Sentinel-2 imagery. A Random Forest baseline and four CNNs (ResNet-50, ResNet-101, EfficientNet-B4, ConvNeXt-Large) are trained on Pratas Island and selected Great Barrier Reef regions, then evaluated on spatially independent intra- and cross-regional test areas. Preserving spatial continuity during training, by keeping contiguous reef blocks rather than random patches, is the single most impactful design choice; we further introduce a Smooth Weight Function (SWF)-weighted RMSE loss that emphasizes near-surface depths. With these choices, intra-regional RMSE ranges from 1.15 to 1.92 m over 0-20 m and is as low as 0.26 m for depths <= 3 m. Random Forest degrades sharply under cross-regional transfer (RMSE 1.53 m -> 2.99-3.78 m), while the deep models stay more robust (2.46-2.98 m). On the public MagicBathyNet aerial-RGB benchmark (0-16 m) the proposed networks reach 0.19-0.22 m RMSE, outperforming a U-Net baseline and a task-specific transformer architecture with substantially fewer parameters. We further exploit multi-temporal repeat imagery: training on it broadens diversity, and median-aggregating predictions across passes at inference reduces noise from changing sun angles, atmospheric conditions, water properties, and tides. We release optimized architectures and pretrained weights to enable scalable transfer to new sites.

2606.02753 2026-06-03 cs.CV cs.AI 版本更新

MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

MetaWorld: 从单视角视频数据扩展多智能体视频世界模型

Teng Hu, Mingchun Lu, Yating Wang, Jiangning Zhang, Jinkun Hao, Ye Pan, Ran Yi, Lizhuang Ma, Dacheng Tao

发表机构 * Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学)

AI总结 提出MetaWorld框架,通过单目世界状态展开、主体感知世界生成器和世界状态对齐机制,从单视角视频构建多智能体视频世界模型,解决数据稀缺和世界状态对齐问题。

详情
AI中文摘要

视频世界模型是具身AI和元宇宙的基础生成技术,但现有方法固有限制于单智能体从单一视角观察。将这些模型扩展到多智能体设置引入了两个关键挑战:数据稀缺(协调的多视角记录对于通用开放域场景来说成本过高)和世界状态对齐(独立生成的视频流无法确保共享物理环境和事件在不同视角下一致演化)。为应对这些挑战,我们提出MetaWorld,一种新颖框架,可直接从单视角视频将多智能体视频世界模型扩展到开放域环境。首先,我们引入单目世界状态展开(MWSU),将单目视频显式分解为相机操作者的自我运动与可见主体的空间轨迹。这种相机-轨迹分解自然提取了共享3D空间内同步的多智能体运动数据,完全绕开了多相机设置的需求。其次,为精确视觉控制,我们开发了主体感知世界生成器,实现基于每个智能体身份图像的外观驱动模拟。最后,为确保两个视角基于相同的物理现实,我们提出世界状态对齐(WSA),一种在视频DiT的每个Transformer层插入的逐帧跨分支交叉注意力机制。通过联合同步去噪过程,WSA强制实现静态几何一致性和动态运动一致性,促使共享3D环境和物理事件在两个自我中心视角间保持良好对齐。大量实验表明,MetaWorld实现了优越的跨视角一致性和身份保真度,为多智能体视频世界建模建立了一个高度可扩展、物理驱动的范式。

英文摘要

Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-agent settings introduces two critical challenges: data scarcity (coordinated multi-view recordings are prohibitively expensive to collect for general open-domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi-agent video world models to open-domain environments directly from single-view videos. First, we introduce Monocular World-State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator's ego-motion and the visible subject's spatial trajectory. This camera-trajectory decomposition naturally extracts synchronized multi-agent motion data within a shared 3D space, completely bypassing the need for multi-camera setups. Second, for precise visual control, we develop the Subject-Aware World Generator to enable appearance-driven simulation conditioned on per-agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World-State Alignment, a per-frame inter-branch cross-attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well-aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross-view consistency and identity fidelity, establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.

2606.02747 2026-06-03 cs.CV cs.AI 版本更新

Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records

Plan2Map: 基于规划记录的文档驱动地理空间边界重建的多模态基准

Fabian Degen, Oishi Deb, Jindong Gu, Junchi Yu, Samuele Marro, Philip Torr, Jialin Yu

AI总结 提出Plan2Map基准和GeoPlanAgent系统,通过文档证据提取、定位、地图配准、边界分割等步骤,从英国规划记录中重建地理空间边界,显著优于直接VLM方法。

Comments Project page: https://odeb1.github.io/Plan2Map_Project_Page/. Fabian Degen and Oishi Deb Contributed Equally

详情
AI中文摘要

规划记录定义了地理区域上的限制,但其源文档通常仅提供间接的空间证据而非机器可读的边界。我们介绍了Plan2Map,一个包含208个案例的多模态基准,用于从英国规划记录中重建文档驱动的地理空间边界。仅给定源规划文档,系统必须从通知文本、时间表、地图图版、地图标签和边界注释中重建有效的地理空间边界;参考GeoJSON被保留用于评分。我们提出了GeoPlanAgent,一个文档驱动、地理空间工具在环的系统,将任务分解为证据提取、定位、地图配准、边界分割、投影和验证。在Plan2Map上,GeoPlanAgent实现了0.736的平均IoU和0.904的中位IoU,其中67.8%的预测IoU达到或超过0.8,显著优于直接VLM到GeoJSON的基线。诊断分析表明,直接VLM预测仍然不可靠,而剩余错误集中在定位和地图配准上,监督边界分割显著提高了像素级掩码质量。Plan2Map为从公共规划记录中进行多模态地理空间重建提供了一个具体的测试平台。项目页面:此https URL。

英文摘要

Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather than machine-readable boundaries. We introduce Plan2Map, a 208-case multimodal benchmark for document-grounded geospatial boundary reconstruction from UK planning records. Given only a source planning document, systems must reconstruct a valid geospatial boundary from notice text, schedules, map plates, map labels, and boundary annotations; the reference GeoJSON is held out for scoring. We propose GeoPlanAgent, a document-grounded, geospatial-tool-in-the-loop system that decomposes the task into evidence extraction, localisation, map registration, boundary segmentation, projection, and verification. On Plan2Map, GeoPlanAgent achieves 0.736 mean IoU and 0.904 median IoU, with 67.8\% of predictions at or above 0.8 IoU, substantially outperforming direct VLM-to-GeoJSON baselines. Diagnostic analysis shows that direct VLM prediction remains unreliable, while remaining errors are concentrated in localisation and map registration, and supervised boundary segmentation substantially improves pixel-level mask quality. Plan2Map provides a concrete testbed for multimodal geospatial reconstruction from public planning records. Project page: https://odeb1.github.io/Plan2Map_Project_Page/.

2606.02742 2026-06-03 cs.CV 版本更新

Consistent Yet Wrong: Evidence Insensitivity in Spatial Vision-Language Models

一致但错误:空间视觉-语言模型中的证据不敏感性

S Divakar Bhat, Toshihiko Yamasaki

发表机构 * The University of Tokyo, Japan(东京大学)

AI总结 通过引入ViewDiag多视图评估协议,发现现代视觉-语言模型在空间推理中表现出跨视角一致但错误的现象,表明其预测主要源于先验驱动而非证据敏感推理。

详情
AI中文摘要

空间推理是机器人、自主系统和具身AI的基础,然而现代视觉-语言模型(VLM)在度量距离查询上仍然不可靠。一个常见的假设是,跨视角的一致预测反映了几何基础。我们测试了这一假设,并发现了相反的情况:领先的VLM经常产生视角不变且一致的答案,即使这些答案是不正确的,这表明预测与视角特定的视觉证据之间的耦合较弱。我们引入了 extbf{ViewDiag},一个基于Hypersim、ScanNet和KITTI360构建的受控多视图评估协议,包含80个场景中的176个对象对轨迹,每个轨迹有2-10个视图。该协议从三个维度评估模型:度量准确性、分布集中度以及用于区分决策崩溃与表示崩溃的内部崩溃的潜在特征探针。在不同的模型中,我们观察到高预测稳定性与显著误差的一致模式,聚集在强一致性但低准确性的区域。 oindent 这些结果挑战了将跨视角一致性作为几何理解代理的常见做法。相反,我们表明稳定的预测可能反映了先验驱动的崩溃,而不是证据敏感的推理。ViewDiag提供了一个受控基准和诊断框架,用于评估超越准确性的空间VLM。代码和数据可在\href{this https URL}{此处}找到。

英文摘要

Spatial reasoning is fundamental to robotics, autonomy, and embodied AI, yet modern vision-language models (VLMs) remain unreliable on metric distance queries. A common assumption is that consistent predictions across viewpoints reflect geometric grounding. We test this assumption and find the opposite: leading VLMs often produce view-invariant and consistent answers even when those answers are incorrect, indicating weak coupling between predictions and viewpoint-specific visual evidence. We introduce \textbf{ViewDiag}, a controlled multi-view evaluation protocol built from Hypersim, ScanNet, and KITTI360, comprising 176 object-pair tracks across 80 scenes with 2--10 views per track. The protocol evaluates models along three axes: metric accuracy, distributional concentration, and a latent feature probe for internal collapse that distinguishes decision collapse from representation collapse. Across diverse models, we observe a consistent pattern of high prediction stability paired with substantial error, clustering in a regime characterized by strong consistency but low accuracy. \noindent These results challenge the common use of cross-view consistency as a proxy for geometric understanding. Instead, we show that stable predictions may reflect prior-driven collapse rather than evidence-sensitive reasoning. ViewDiag provides a controlled benchmark and diagnostic framework for evaluating spatial VLMs beyond accuracy alone. The code and data can be found \href{https://github.com/SDivakarBhat/Consistent_Yet_Wrong.git}{here}

2606.02724 2026-06-03 cs.CV cs.AI 版本更新

AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes

AVTrack: 以人为中心的复杂场景中的视听跟踪

Yaoting Wang, Yun Zhou, Zipei Zhang, Henghui Ding

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有视听跟踪数据集局限于简单场景的问题,提出AVTrack数据集,通过包含相机运动、视觉遮挡和位置变化等复杂动态条件,评估并提升鲁棒的人为中心视听场景理解。

Comments 19 pages, 10 figures, ICML 2026

详情
AI中文摘要

视听说话人跟踪旨在通过利用听觉和视觉线索来定位和跟踪活跃的说话人,实现细粒度、以人为中心的场景理解。这一能力对于智能视频编辑、监控和人机交互等实际应用至关重要。然而,现有数据集大多局限于具有粗略标注的简单或同质视听场景。这种过度简化的设置使评估偏向于静态视听共现,而非严格评估复杂动态场景中的鲁棒时空建模和跨模态推理。为了解决这些限制,我们引入了AVTrack,一个以人为中心的视听实例分割(AVIS)数据集,专为动态真实世界场景设计。AVTrack具有多样且具有挑战性的条件,包括相机运动、视觉遮挡和位置变化。在AVTrack上对代表性AVIS方法的评估揭示了显著的性能下降,使AVTrack成为复杂环境中鲁棒的以人为中心的视听场景理解的挑战性基准。我们进一步提供了一个简单而有效的基线,以促进未来的研究。项目网站:此https URL

英文摘要

Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio-visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: https://FudanCVL.github.io/AVTrack/

2606.02603 2026-06-03 cs.CV cs.LG 版本更新

COD10K-C: Benchmarking Robustness of Camouflaged Object Detection Under Natural Image Corruptions

COD10K-C:自然图像损坏下伪装目标检测的鲁棒性基准测试

Arafat Hossain Sayem

发表机构 * CSE, Bangladesh University of Engineering and Technology(孟加拉国工程与技术大学计算机科学与工程系)

AI总结 提出COD10K-C基准,包含8种损坏类型和5个严重级别,评估伪装目标检测模型在损坏图像上的性能,并引入轻量级模型RobustCODLite,通过损坏增强、频率先验分支和不确定性一致性损失,在损坏条件下保持较高Dice分数。

Comments 7 pages, 1 figure

详情
AI中文摘要

伪装目标检测已取得显著进步,但大多数标准基准仅评估模型在干净图像上的性能。这并不现实,因为真实相机经常捕捉到模糊、传感器噪声、天气效应和压缩伪影。我们提出了COD10K-C,一个基于COD10K的损坏鲁棒性基准。它包含8种损坏类型和5个严重级别,总共40种条件和81,040个评估对。我们评估了三种流行的伪装目标检测模型:SINet-v2、PFNet和ZoomNet,以及一个轻量级模型RobustCODLite。所有模型在损坏图像上均表现出明显的性能下降。运动模糊和高斯模糊导致最大的下降,其中SINet-v2在运动模糊下损失了18.5个Dice点。亮度和雾的影响较小。RobustCODLite使用了损坏增强、频率先验分支和不确定性一致性损失。它在损坏条件下保留了其干净Dice分数的92.3%,而SINet-v2为87.7%,ZoomNet为84.8%,PFNet为84.1%。在最严重的损坏情况下,RobustCODLite达到或超过了在干净数据上表现更好的模型。我们将发布COD10K-C的GitHub仓库,以支持未来在鲁棒伪装目标检测方面的研究。

英文摘要

Camouflaged object detection has improved substantially, but most standard benchmarks evaluate models only on clean images. This is not realistic because real cameras often capture blur, sensor noise, weather effects, and compression artifacts. We present COD10K-C, a corruption robustness benchmark based on COD10K. It includes 8 corruption types and 5 severity levels, giving 40 conditions and 81,040 evaluation pairs in total. We evaluate three popular camouflaged object detection models, SINet-v2, PFNet, and ZoomNet, as well as a lightweight model called RobustCODLite. All models show clear performance drops on corrupted images. Motion blur and Gaussian blur cause the largest drops, with SINet-v2 losing 18.5 Dice points under motion blur. Brightness and fog are less harmful. RobustCODLite uses corruption augmentation, a frequency-prior branch, and an uncertainty-consistency loss. It retains 92.3% of its clean Dice score under corruption, compared with 87.7% for SINet-v2, 84.8% for ZoomNet, and 84.1% for PFNet. On the hardest corruptions, RobustCODLite matches or outperforms models that perform better on clean data. We will release the COD10K-C GitHub repository to support future research in robust camouflaged object detection.

2606.02602 2026-06-03 cs.LG cs.CV 版本更新

Graph Mamba Survival Analysis Based on Topology-Aware ordering

基于拓扑感知排序的图Mamba生存分析

Yuanfang Chen, Peiqiang Yan, Yuntao Shou, Qian Zhao, Xiangyong Cao

发表机构 * School of Mathematics and Statistics(数学与统计学学院) West China Science and Technology Innovation Harbor(西部科学与技术创新港) School of Computer Science and Technology(计算机科学与技术学院)

AI总结 针对WSI生存分析中Mamba模型对输入顺序敏感及单向架构限制空间结构利用的问题,提出基于拓扑感知排序的图Mamba框架TopoMamSurv,通过TAO策略、双向Mamba模块和GCN集成实现高效长程依赖建模与双向空间上下文建模。

详情
AI中文摘要

在计算病理学中,全切片图像(WSI)生存分析对于患者预后评估至关重要,但面临多项技术挑战。尽管Transformer通过其自注意力机制捕获长程依赖,但其$O(N^2)$时间复杂度在大规模WSI图结构中造成严重计算瓶颈。Mamba模型以线性复杂度突破了Transformer的计算瓶颈。然而,由于Mamba对输入数据顺序的高度敏感性,图Mamba中传统的节点排序方法(如基于节点度或子图大小的方法)未能充分考虑图数据的拓扑连通性,从而限制了Mamba序列建模的性能。此外,其单向架构无法利用图像的双向空间结构。为解决这些挑战,本文提出一种基于拓扑感知排序的新型图Mamba生存分析框架(TopoMamSurv),以适应Mamba的序列敏感性。我们的可视化实验进一步证实,通过拓扑感知排序(TAO)策略提取的节点确实表现出更高的相似性。此外,我们设计了双向Mamba模块并集成图卷积网络(GCN),以实现图像的双向空间上下文建模,形成“局部聚合-全局捕获”的分层特征学习架构。该框架通过TAO、双向语义建模和分层特征融合的系统设计,有效调和了WSI分析中长程依赖建模、计算效率和空间结构利用之间的矛盾。该框架在五个TCGA数据集上验证了其全面的性能优势。

英文摘要

In computational pathology, Whole Slide Images (WSIs) survival analysis is crucial for patient prognosis assessment, but it faces multiple technical challenges. Although the Transformer captures long-range dependencies through its self-attention mechanism, its $O(N^2)$ time complexity causes a severe computational bottleneck in large-scale WSIs graph structures. The Mamba model breaks through the Transformer's computational bottleneck with linear complexity. But, owing to Mamba's high sensitivity to the order of input data, traditional node sorting methods in Graph Mamba, such as those based on node degree or subgraph size, fail to adequately account for the topological connectivity of graph data. This inadequacy consequently restricts the performance of Mamba's sequential modeling. Moreover, its unidirectional architecture cannot leverage the bidirectional spatial structure of images. To address these challenges, this paper proposes a novel Graph Mamba survival analysis framework based on topology-aware ordering (TopoMamSurv) to adapt to the sequential sensitivity of Mamba. Our visualization experiments further confirmed that the nodes extracted through the topology-aware ordering (TAO) strategy indeed exhibit higher similarity. Furthermore, we designed a bidirectional Mamba module and integrated a Graph Convolutional Network (GCN) to achieve bidirectional spatial context modeling of images, forming a hierarchical feature learning architecture for "local aggregation - global capture." This framework effectively reconciles the contradiction between long-range dependency modeling, computational efficiency, and spatial structure utilization in WSIs analysis through its systematic design of TAO, bidirectional semantic modeling, and hierarchical feature fusion. This framework has been validated for its comprehensive performance advantage on five TCGA datasets.

2606.02482 2026-06-03 cs.CV 版本更新

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

X-Stream: 探索多模态大语言模型作为多流理解的多路复用器

Peiwen Sun, Xudong Lu, Huadai Liu, Yang Bo, Dongming Wu, Huankang Guan, Minghong Cai, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Rui Liu, Xiangyu Yue

发表机构 * MMLab, Chinese University of Hong Kong(中大香港人工智能实验室) Huawei Inc.(华为公司)

AI总结 为解决多流视频理解评估缺失的问题,提出首个基准X-Stream,包含4220个QA对和932个视频,覆盖多窗口、多视角和多设备场景,并基于信号多路复用理论评估MLLM作为多路复用器的性能,发现现有模型在并发流上仅达约50%分数。

Comments Project Page: https://peiwensun2000.github.io/xstream/

详情
AI中文摘要

尽管视频流理解取得了显著进展,但实际应用(如体育直播、自动驾驶和多屏协作)本质上需要连续的多流交互。然而,现有基准局限于单流范式,在评估在线跨流推理方面存在关键空白。为填补这一空白,我们引入了X-Stream,这是首个专门用于多流流式理解的基准。X-Stream包含932个视频中精心整理的4220个QA对,评估了跨多窗口、多视角和多设备场景的11个子任务。关键的是,我们的数据集使用一种新颖的双重验证流水线构建,防止对单一流的过度依赖。此外,我们开创性地将多模态大语言模型(MLLM)概念化为朴素多路复用器,通过信号多路复用理论的视角系统评估其性能。我们广泛的在线推理实验揭示了一个严峻的现实:最先进的MLLM在并发流上表现困难,仅达到约50%的分数,且主动能力差。最终,X-Stream暴露了当前多路复用方案的权衡,为下一代多流智能体提供了实用的评估协议和经验指导。

英文摘要

While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning. To bridge this, we introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. Comprising 4,220 rigorously curated QA pairs across 932 videos, X-Stream evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios. Crucially, our dataset is constructed using a novel dual-verification pipeline that prevents over-reliance on a single stream. Furthermore, we pioneer the conceptualization of multi-modal large language models (MLLMs) as naive multiplexers, systematically evaluating their performance through the lens of Signal Multiplexing Theory. Our extensive online inference experiments reveal a stark reality: state-of-the-art MLLMs struggle significantly with concurrent streams, achieving only about 50% score and exhibiting poor proactive ability. Ultimately, X-Stream exposes the trade-off of current multiplexing schemes, providing both a practical evaluation protocol and empirical guidance for next-generation multi-stream agents.

2606.02090 2026-06-03 cs.CV 版本更新

FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation

FocusDiT: 扩散Transformer中的查询掩码用于细粒度图像生成

Xueji Fang, Liyuan Ma, Jianhao Zeng, Jinjin Cao, Mingyuan Zhou, Guo-Jun Qi

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学)

AI总结 提出FocusDiT方法,通过掩码关键查询令牌仅输入FFN层,增强细粒度视觉生成,实验验证其有效性。

详情
AI中文摘要

扩散Transformer(DiT)已被广泛应用于生成扩散领域,通过注意力和前馈(FFN)层推进查询令牌的去噪。FFN实际上充当解码视觉内容的键值词汇表,其中值嵌入视觉语义知识。我们提出,关注对应于更复杂细节的关键查询令牌,并鼓励模型改进这些令牌,对于细粒度视觉生成至关重要。为此,我们提出FocusDiT,它应用掩码方案来关注仅输入FFN的关键查询令牌。掩码查询可以从FFN词汇表中检索视觉令牌,并使用它们解码其视觉细节。大量的文本到图像实验验证了令牌掩码在增强生成性能方面的有效性。

英文摘要

Diffusion transformer (DiT) has been widely adopted in the generative diffusion field, advancing the denoising of query tokens through attention and Feed-Forward (\text{FFN}) layers. FFN actually acts as the key-value vocabulary for decoding visual contents where the value embeds the visual semantical knowledge. We present that focusing on critical query tokens corresponding to more complex details and encouraging the model to improve these tokens is essential for fine-grained visual generation. To this end, we propose FocusDiT, which applies a Masking scheme to focus on critical query tokens that are exclusively fed into FFN. The masked queries can retrieve visual tokens from the FFN vocabularies, and use them to decode their visual details. Extensive text-to-image experiments validate the effectiveness of token masking in enhancing generative performance.

2606.01962 2026-06-03 cs.CV 版本更新

Contrastive Augmented Transformer with Domain-specific Enhancement for Robust Multi-scenario Metal Surface Defect Detection

基于领域增强的对比增强Transformer用于鲁棒的多场景金属表面缺陷检测

Yiyao Liu, Wenxiao He, Liyuan Ren, Huan Wang

发表机构 * Glasgow College, University of Electronic Science and Technology of China(格拉斯哥学院,电子科学与技术大学)

AI总结 提出对比增强Transformer(CAT)框架,结合Swin Transformer骨干、特征金字塔网络、领域特定液滴增强算法和难负样本挖掘策略,解决金属表面缺陷检测中标注数据有限、多尺度缺陷识别难和跨场景泛化差的问题,在KolektorSDD2数据集上达到99.54%像素级AUROC。

详情
AI中文摘要

金属表面缺陷检测对于维持工业制造中的产品质量至关重要。然而,它面临着重大挑战,包括有限的标注数据、难以识别细微的多尺度缺陷以及跨不同场景的泛化能力差。为了解决这些问题,本文提出了一种新颖的对比增强Transformer(CAT)框架,用于鲁棒的缺陷检测。CAT采用分层Swin Transformer骨干,并重新设计了特征金字塔网络,以有效融合低级纹理与高级语义,从而实现对细微和多尺度缺陷模式的精确建模。为了增强在真实噪声条件下的鲁棒性,我们提出了一种领域特定的液滴增强算法。此外,我们将难负样本挖掘策略纳入对比损失中,以增强模型在模糊缺陷区域的判别能力。在KolektorSDD2数据集上的实验结果表明,CAT实现了99.54%的像素级AUROC,优于现有方法。此外,CAT在三个未见过的数据集(包括KSDD1、用于瓷砖缺陷的MTD和用于轨道表面缺陷的MSDD)上表现出优越的泛化能力和鲁棒性,展示了其在大规模工业部署中的潜力。

英文摘要

Metal surface defect detection is critical for maintaining product quality in industrial manufacturing. However, it faces significant challenges, including limited annotated data, difficulty in identifying subtle multi-scale defects, and poor generalization across diverse scenarios. To address these issues, this paper proposes a novel Contrastive Augmented Transformer (CAT) framework for robust defect detection. CAT employs a hierarchical Swin Transformer backbone and redesigns the feature pyramid network to effectively fuse low-level textures with high-level semantics, enabling precise modeling of subtle and multi-scale defect patterns. To enhance robustness under real-world noise conditions, we propose a domain-specific droplet augmentation algorithm. Furthermore, we incorporate a hard negative mining strategy into the contrastive loss to strengthen the model's discrimination ability in ambiguous defect regions. Experimental results on the KolektorSDD2 dataset demonstrate that CAT achieves a pixel-level AUROC of 99.54%, outperforming existing methods. In addition, CAT exhibits superior generalization and robustness on three unseen datasets, including KSDD1, MTD for tile defects, and MSDD for rail surface defects, demonstrating its potential for wide-scale industrial deployment.

2606.01624 2026-06-03 cs.CV cs.SE 版本更新

What to Test Next: Interpretable Coverage Gap Discovery in Driving VLMs

下一步测试什么:驾驶视觉语言模型中可解释的覆盖缺口发现

Abhishek Aich, Sparsh Garg, Vijay Kumar BG, Turgun Yusuf Kashgari, Manmohan Chandraker

AI总结 提出 SliceScorer 和 SliceNav 方法,通过结合暴露先验和邻居失败先验的确定性评分规则,在驾驶视觉语言模型中有效发现高风险覆盖缺口,并支持可解释和可审计的验证流程。

详情
AI中文摘要

驾驶视觉语言模型必须准确理解由操作设计域定义的各种条件下的场景,然而验证仍然稀疏:许多切片缺失,使得经验故障率不可靠。我们提出 SliceScorer,一种用于缺失切片推荐的确定性评分规则,它结合了 (i) 基于暴露的覆盖先验,优先考虑罕见、测试不足的区域,以及 (ii) 邻居失败先验,从类似测试条件传播风险。SliceScorer 刻意简单——可解释、可审计且保守——这些属性对于安全关键验证至关重要。为了在声明的 ODD 之外进行压力测试,我们将 SliceScorer 嵌入 SliceNav,一个由 LLM 编排的验证流程,其中模型解释开发者查询以选择相关操作(分诊、评分、获取、评估)和词汇扩展,组合验证工作流,同时保持所有评分确定性和可审计性。在三个驾驶 VLM(WiseAD、DriveMM、Cosmos-Reason2-2B)上的实验表明,SliceNav 比先前的切片发现方法更有效地发现高风险覆盖缺口,同时在条件空间中保持多样化的推荐。消融实验证实了两个评分组件的贡献,定性分析展示了从开发者查询到目标评估的端到端工作流。

英文摘要

Driving vision-language models (VLMs) must accurately understand scenes across diverse conditions defined by Operational Design Domains (ODDs), yet verification remains sparse: many slices are missing, making empirical failure rates unreliable. We propose SliceScorer, a deterministic scoring rule for missing-slice recommendation that combines (i) an exposure-based coverage prior to prioritize rare, under-tested regions, and (ii) a neighbor-failure prior that propagates risk from similar tested conditions. SliceScorer is deliberately simple - interpretable, auditable, and conservative - properties essential for safety-critical validation. For stress testing beyond the declared ODD, we embed SliceScorer within SliceNav, an LLM-orchestrated verification pipeline where the model interprets developer queries to select relevant operators (triage, scoring, acquisition, evaluation) and vocabulary extensions, composing verification workflows while keeping all scoring deterministic and auditable. Experiments on three driving VLMs (WiseAD, DriveMM, Cosmos-Reason2-2B) show that SliceNav surfaces high-risk coverage gaps more effectively than prior slice-discovery methods while maintaining diverse recommendations across the condition space. Ablations confirm both scoring components contribute, and qualitative analysis demonstrates end-to-end workflows from developer query to targeted evaluation.

2606.01348 2026-06-03 cs.CV 版本更新

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

ChartArena: 跨语言、场景和格式的图表解析基准测试

Shangpin Peng, Gengluo Li, Xingyu Wan, Chengquan Zhang, Hao Feng, Binghong Wu, Huawen Shen, Weinong Wang, Ziyi Cai, Zhuotao Tian, Han Hu, Can Ma, Yu Zhou

发表机构 * Large Language Model Department, Tencent(腾讯大语言模型部门) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) Shenzhen Loop Area Institute(深圳环湖区研究所) Nankai University(南开大学)

AI总结 提出ChartArena,一个覆盖8种图表族、3种视觉场景的双语基准,通过人机协作标注和格式无关评估协议,系统评估26个多模态大模型的图表解析能力,揭示前沿模型差距与挑战。

详情
AI中文摘要

图表是传达定量和关系信息的主要媒介,但系统地评估图表解析模型仍然困难。现有基准专注于狭窄的图表类型,而流程图和思维导图等图表结构在很大程度上未被涉及,同时模型输出格式不兼容,数据集也极少包含实践中遇到的打印或手绘图像。为解决这些问题,我们引入了ChartArena,一个全面的双语基准,涵盖8种图表族,包括数值图表和图表结构,每种图表在三种视觉场景中评估:数字渲染、打印照片和手绘照片。该数据集通过人机协作标注流程构建,并经过多阶段人工验证以确保标注可靠性。为实现公平的跨模型比较,我们进一步设计了一种格式无关的评估协议,将异构输出映射到两个规范语义空间:归一化三元组视图和有向图视图,并使用结构感知指标进行评分。通过对26个领先的多模态大语言模型的广泛评估,我们观察到三个一致的发现:(i) 前沿专有模型(如Gemini 3.1 Pro)总体领先,但最强的开源系统正在迅速缩小差距;(ii) 文档解析模型能较好地处理数值图表,但在图表结构上表现大幅落后;(iii) 专家图表解析器仍局限于狭窄的图表族。在所有模型中,雷达图和手绘场景尤其具有挑战性。这些发现表明,ChartArena揭示了清晰的能力差距,并为未来的进展提供了统一基础。ChartArena公开在 https://github.com/pspdada/ChartArena。

英文摘要

Charts are a primary medium for conveying quantitative and relational information, yet systematically evaluating chart parsing models remains difficult. Existing benchmarks focus on narrow chart types and leave diagrammatic structures such as flowcharts and mind maps largely unaddressed, while models produce outputs in incompatible formats, and datasets rarely include the printed or hand-drawn images encountered in practice. To address these issues, we introduce ChartArena, a comprehensive bilingual benchmark covering eight chart families spanning both numeric charts and diagrammatic structures, each evaluated across three visual scenarios: digital renderings, printed photos, and hand-drawn photos. The dataset is built via a human-agent collaborative annotation pipeline with multi-stage human verification to ensure annotation reliability. To enable fair cross-model comparison, we further design a format-agnostic evaluation protocol that maps heterogeneous outputs into two canonical semantic spaces, a normalized triple view and a directed graph view, and scores them with structure-aware metrics. Through extensive evaluation of 26 leading MLLMs, we observe three consistent findings: (i) frontier proprietary models such as Gemini 3.1 Pro lead overall, yet the strongest open-source systems are rapidly closing the gap; (ii) document parsing models handle numeric charts reasonably but fall sharply behind on diagrammatic structures; and (iii) expert chart parsers remain limited to narrow chart families. Across all models, radar charts and hand-drawn scenarios stay especially challenging. These findings show that ChartArena exposes clear capability gaps and provides a unified foundation for future progress. ChartArena is publicly available at https://github.com/pspdada/ChartArena.

2606.00489 2026-06-03 cs.CV 版本更新

3D Segment Anything Model with Visual Mamba for Diagnosing Placenta Accreta Spectrum

基于视觉Mamba的3D分割一切模型用于诊断胎盘植入谱

Yuliang Zhang, Fang He, Lulu Peng, Tianyu Yan, Pingping Zhang, Ting Song, Lili Du, Dunjin Chen

发表机构 * Department of Obstetrics and Gynecology, The Third Affiliated Hospital, Guangzhou Medical University(妇产科系,广州医科大学第三附属医院) Department of Obstetrics, Guangzhou Women and Children’s Medical Center, Guangzhou Medical University(妇产科,广州妇女儿童医疗中心,广州医科大学) Department of Radiology, The Third Affiliated Hospital, Guangzhou Medical University(放射科,广州医科大学第三附属医院) School of Future Technology, Dalian University of Technology(未来技术学院,大连理工大学)

AI总结 提出3DSAMba框架,结合3D SAM、适配器、多级聚合Mamba和融合状态空间模型,通过MRI图像分割病灶区域实现胎盘植入谱的自动诊断。

Comments Accepted by IEEE Transactions on Image Processing (TIP2026). More modifications may be performed

详情
AI中文摘要

胎盘植入谱(PAS)是一种罕见但高度危险的产科疾病。早期准确的PAS诊断对孕产妇健康至关重要。传统的PAS诊断依赖于经验丰富的医生分析剖宫产史和磁共振成像(MRI)数据。然而,地市级医院往往缺乏准确诊断PAS的专业知识和资源。为应对这些挑战,我们建立了首个基于MRI的PAS数据集,包含细粒度分割和分类标注。同时,通过从子宫MRI图像中分割病灶区域,可以显著增强PAS诊断。为了实现自动PAS诊断,我们提出了3DSAMba,一种新颖的特征学习框架,用于有效的病灶分割。具体来说,我们首先设计了3D分割一切模型(SAM),并通过高效的适配器机制将医学领域信息融入模型。此外,我们引入了多级聚合Mamba(MLAM)来聚合不同层次的特征图,以及融合状态空间模型(FSSM)来融合来自编码器和解码器的多尺度特征。最后,我们通过逐元素乘法将分割掩码应用于原始MRI图像,有效隔离病灶区域,以实现更准确的PAS诊断。大量实验验证了我们的框架显著提升了PAS诊断性能。为促进PAS诊断的进一步研究,我们在https://github.com/Drchip61/PASD上发布了数据集和源代码。

英文摘要

Placenta Accreta Spectrum (PAS) is a rare but highly dangerous obstetric disease. Early and accurate PAS diagnosis is critical for maternal health. Traditional PAS diagnosis relies on experienced doctors by analyzing the cesarean history and Magnetic Resonance Imaging (MRI) data. However, district-level hospitals often lack the expertise and resources for accurate PAS diagnosis. To address these challenges, we establish the first MRI-based PAS dataset, which includes both fine-grained segmentation and classification annotations. Meanwhile, diagnosing PAS can be significantly enhanced by segmenting lesion areas from MRI images of the uterus. To achieve automatic PAS diagnosis, we propose 3DSAMba, a novel feature learning framework for effective lesion segmentation. More specifically, we first design a 3D Segment Anything Model (SAM) and incorporate medical domain information into the model through an efficient adapter mechanism. In addition, we introduce a Multi-Level Aggregation Mamba (MLAM) to aggregate feature maps across different levels and a Fusion State Space Model (FSSM) to fuse multi-scale features from both the encoder and decoder. Finally, we apply segmentation masks to the original MRI images through element-wise multiplication, effectively isolating lesion areas for more accurate PAS diagnosis. Extensive experiments validate that our framework significantly improves the PAS diagnostic performance. To facilitate further research in PAS diagnosis, we have released the dataset and source code at https://github.com/Drchip61/PASD.

2606.00351 2026-06-03 cs.CV 版本更新

UniVerse: A Unified Modulation Framework for Segmentation-Free,Disentangled Multi-Concept Personalization

UniVerse:一种用于无分割、解耦多概念个性化的统一调制框架

Quynh Phung, Sandesh Ghimire, Minsi Hu, Chung-Chi Tsai, Jia-Bin Huang

发表机构 * University of Maryland, College Park(马里兰大学College Park分校) Qualcomm Technologies, Inc.(高通技术公司)

AI总结 提出UniVerse框架,通过扩散变换器中的统一调制实现无分割的多概念解耦与个性化,显著提升定位精度和视觉保真度。

Comments https://universe-personalization.github.io/

详情
AI中文摘要

个性化视觉理解已取得显著进展,但当输入图像包含多个对象时,现有方法难以定位和提取特定概念。许多先前方法严重依赖基于分割的监督或表现出较差的组合泛化能力,限制了它们准确解耦和操作单个概念的能力。在这项工作中,我们提出了UniVerse,一种用于扩散变换器中无分割、解耦多概念个性化的统一调制框架。我们的方法允许可组合和可分解的概念提取,无需显式分割掩码即可实现目标对象的细粒度定位和表示。UniVerse学习将复杂场景分解为特定于概念的表示,然后以统一的方式组合它们,从而在多样化的视觉上下文中实现鲁棒的个性化。通过在多个基准上的大量实验,我们证明UniVerse在定位精度和视觉保真度方面均显著优于最先进的基线。定性和定量结果表明,我们的方法可以在杂乱场景中精确提取目标概念,为更灵活、可解释和个性化的视觉生成与理解铺平道路。

英文摘要

Personalized visual understanding has advanced significantly, yet existing approaches struggle to localize and extract specific concepts when input images contain multiple objects. Many prior methods rely heavily on segmentation-based supervision or exhibit poor compositional generalization, limiting their ability to accurately disentangle and manipulate individual concepts. In this work, we propose UniVerse, a Unified Modulation Framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers. Our method allows for composable and decomposable concept extraction, enabling fine-grained localization and representation of target objects without explicit segmentation masks. UniVerse learns to decompose complex scenes into concept-specific representations and then compose them in a unified manner, enabling robust personalization across diverse visual contexts. Through extensive experiments on multiple benchmarks, we demonstrate that UniVerse significantly outperforms state-of-the-art baselines in both localization accuracy and visual fidelity. Qualitative and quantitative results show that our approach can precisely extract target concepts in cluttered scenes, paving the way for more flexible, interpretable, and personalized visual generation and understanding.

2606.00321 2026-06-03 cs.CV 版本更新

Training-Free Object-Agnostic Jam Detection in Fulfillment Centers

无训练、对象无关的配送中心堵塞检测

Ruiliang Liu, Tina Dongxu Li, Joshua Migdal, Fernando Ruch, Kenneth Meszaros, Moses Trevor Dardik

发表机构 * Amazon, USA(亚马逊公司)

AI总结 提出一种无需训练和标注数据的对象无关堵塞检测方法,通过监控参考点持续遮挡来识别堵塞事件,在1069个视频上达到100%精度和93.33% F1分数。

Comments 4 pages, 6 figures. Accepted at the 2026 IEEE International Conference on Automation Science and Engineering (CASE 2026) as a presentation-only paper

详情
AI中文摘要

在配送中心,各种物体从入库到出库连续移动,可能因传送带摩擦过大、方向错误或机械故障而堵塞。传统的堵塞检测方法依赖目标检测模型识别物体,然后使用跟踪算法(如IoU重叠和卡尔曼滤波)监控运动。这种流程需要数千个手动标注,耗时约两周,且仅限于已标注的物体类别。我们提出一种无需训练、对象无关的堵塞检测方法,消除了对标注数据的需求。我们的方法在没有物体时在监控区域内均匀采样参考点。当物体遮挡这些点时,我们检测到运动。当足够多的点被遮挡超过时间阈值时,我们将事件分类为堵塞。与传统的点跟踪(将遮挡视为失败情况)不同,我们的方法将遮挡重新用作检测信号,监控参考点是否持续被遮挡,而不是跟踪它们移动到哪里。我们在1069个视频上的实验评估表明,AllTracker实现了100.00%的精度和93.33%的F1分数,显著优于经典的稀疏跟踪方法,同时保持无需训练的部署。该方法具有三个关键优势:(1)无需训练数据或手动标注,(2)对象无关地泛化到任意物体类型,(3)显著减少开发时间。

英文摘要

In fulfillment centers, diverse objects move continuously from inbound to outbound operations and can become jammed due to excessive conveyor friction, incorrect orientation, or mechanical failures. Traditional jam detection approaches rely on object detection models to identify objects, followed by tracking algorithms (such as IoU overlap and Kalman filtering) to monitor motion over time. This pipeline requires thousands of manual annotations, consuming approximately two weeks of effort, and is limited to annotated object classes. We present a training-free, object-agnostic jam detection method that eliminates the need for labeled data. Our approach uniformly samples reference points within the monitoring region when no objects are present. As objects occlude these points, we detect motion. When a sufficient fraction remains occluded beyond a temporal threshold, we classify the event as a jam. Unlike conventional point tracking--which treats occlusion as a failure case--our approach repurposes occlusion as a detection signal, monitoring whether reference points remain persistently occluded rather than tracking where they move. Our experimental evaluation on 1,069 videos demonstrates that AllTracker achieves 100.00% precision and 93.33% F1 score, significantly outperforming classical sparse tracking methods while maintaining training-free deployment. This approach offers three key advantages: (1) no training data or manual annotations, (2) object-agnostic generalization to arbitrary object types, and (3) significantly reduced development time.

2606.00188 2026-06-03 cs.GR cs.CV cs.LG 版本更新

PaintBench: Deterministic Evaluation of Precise Visual Editing

PaintBench: 精确视觉编辑的确定性评估

Kai Xu, Ellis Brown, Shrikar Madhu, Rob Fergus, He He, Saining Xie

发表机构 * New York University(纽约大学)

AI总结 提出PaintBench基准,通过程序化生成20种基本视觉编辑操作,实现确定性像素级评估,发现当前模型性能低(最高mIoU 17.1%),并揭示任务分解和场景变化的影响。

Comments Project Page: https://paintbench.github.io/

详情
AI中文摘要

虽然当前的多模态模型在开放式视觉编辑方面表现熟练,但执行精确的单答案编辑仍然是一个重要障碍。为了探究这一挑战,我们引入了PaintBench,一个动态可扩展的基准测试,针对四个类别的20种基本精确视觉编辑操作:几何变换、结构操作、颜色变化和符号推理。具有可配置复杂性的程序化生成实现了有效无限、抗污染的评估套件,而确定性像素级评估消除了对易偏见的评判模型的依赖。在11个图像编辑模型中,我们发现整体性能较低,当前表现最佳的行业领先者仅得17.1%(mIoU)。任务分解揭示了特别具有挑战性的操作类型(几何变换、大多数结构操作、基于公式的颜色变化)和模型特定的专长。细粒度的基准诊断进一步显示了由对象数量、背景复杂性、配色方案和编辑区域大小等场景变化引起的性能下降。为了测试PaintBench分数对应用任务性能的泛化能力,我们创建了一个用于数据可视化编辑的程序化确定性评估(TinyGrafixBench),并发现其与PaintBench分数之间存在强线性相关性($R^2 = 0.91$, $p < 0.001$)。总之,PaintBench为衡量和推动精确多模态视觉编辑的进展提供了严格的基础。

英文摘要

While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores ($R^2 = 0.91$, $p < 0.001$). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.

2606.00096 2026-06-03 cs.CV cs.AI 版本更新

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

多样性优于频率:重新思考视觉思维链智能体中的工具使用

Dong-Hee Kim, Reuben Tan, Donghyun Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 本文研究视觉思维链智能体在复杂推理任务中的工具使用,发现工具使用崩溃现象,并提出熵正则化方法通过鼓励多样化探索提升推理性能。

Comments Presented in ICML 2026

详情
AI中文摘要

视觉智能体在视觉思维链中使用外部视觉工具来整合细粒度证据。虽然先前的工作主要研究这些工具在视觉搜索任务中的应用,但它们在更复杂的视觉推理中的作用仍未充分探索。在本文中,我们超越简单的视觉搜索任务,研究更具挑战性的任务,包括3D空间推理和医学视觉问答,其中智能体必须将工具获取的局部证据与全局上下文整合。我们识别出一种工具使用崩溃现象:模型逐渐停止使用工具,同时仍能获得更高的任务准确率。此外,我们观察到明显的不对称性:(i) 完全消除工具使用会降低性能,而(ii) 激励工具使用仅带来边际收益,尽管使用量大幅增加。我们发现,普通训练和工具使用鼓励都降低了展开多样性,这解释了为什么更高的工具使用不会带来更强的推理性能。受这些发现的启发,我们添加了一个熵正则化项来鼓励多样化的展开探索,尽管工具使用逐渐下降,但实现了最佳性能。总体而言,我们的发现表明了一种训练时工具作为支架的观点,其中对语言生成和视觉工具调用的更广泛探索改善了推理,尽管存在工具使用崩溃。项目页面:https://scaffolded-exploration.github.io

英文摘要

Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainly studied these tools in visual search tasks, their role in more complex visual reasoning remains underexplored. In this paper, we move beyond simple visual search tasks to investigate more challenging tasks, including 3D spatial reasoning and medical visual question answering, where agents must integrate tool-acquired local evidence with the global context. We identify a {tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration over language generation and visual tool invocation improves reasoning despite tool-use collapse. Project page: https://scaffolded-exploration.github.io

2605.30915 2026-06-03 cs.CV 版本更新

DiTTo: Scalable Order-aware All-in-One Image Restoration Agent

DiTTo: 可扩展的排序感知全能图像修复智能体

Seungho Choi, Jihyong Oh

发表机构 * CMLab, Chung-Ang University(Chung-Ang 大学 CMLab) David S. Hippocampus Department of Computer Science Cranberry-Lemon University(Cranberry-Lemon 大学 计算机科学系 Hippocampus 教授)

AI总结 提出DiTTo框架,通过模拟器高效构建最优修复轨迹数据集,并采用排序感知对齐实现修复专家的即插即用扩展,在多退化图像修复中达到最优性能。

Comments Please visit our project page at https://cmlab-korea.github.io/DiTTo/

详情
AI中文摘要

真实世界的图像很少只遭受单一退化,且退化去除的顺序显著影响最终修复质量,这推动了基于智能体的图像修复(IR),其中视觉语言模型调度一组预构建的修复专家。然而,现有的基于训练的智能体每张图像需要 $\mathcal{O}((N^{\mathbf{D}})^{2})$ 次修复专家调用来构建最优修复动作轨迹数据集(ORTD),其中 $N^{\mathbf{D}}$ 表示宇宙 $\mathbf{D}$ 中的退化类型数量,并且将智能体训练与固定的修复专家池耦合,阻止了在没有完全重新训练的情况下扩展到新引入的修复专家。为了克服这些效率和可扩展性瓶颈,我们提出了 extbf{DiTTo},一种新颖的排序感知图像修复智能体框架,由 DiTTo 模拟器和 DiTTo 智能体组成。DiTTo 模拟器结合了用于单步修复动作模拟的 $\cup$S-IR 和用于每个动作质量预测的 AiO-IQA,将 ORTD 构建减少到每张图像 $\mathcal{O}(N^{\mathbf{D}})$ 次模拟器调用;DiTTo 智能体通过在模拟器生成的 ORTD 上进行 SFT 训练,随后进行 extbf{排序感知修复对齐(ORA)},该对齐沿着独立轴对齐退化识别、修复动作排序和输出格式。这实现了 extbf{即插即用的可扩展扩展性}:添加一个新的修复专家只需要更新轻量级的 ORA 阶段。在最多包含五种并发退化的 MiO-100 评估集上,我们的 DiTTo 智能体在先前基于智能体的 IR 方法中实现了最先进的多退化修复质量。

英文摘要

Real-world images rarely suffer from a single degradation, and the order in which degradations are removed substantially affects the final restoration quality, motivating agent-based image restoration (IR), where a vision-language model schedules a pool of pre-built restoration-experts. However, existing training-based agents require $\mathcal{O}((N^{\mathbf{D}})^{2})$ restoration-expert calls per image to construct the Optimal Restoration-action Trajectory Dataset (ORTD), where $N^{\mathbf{D}}$ denotes the number of degradation types in the universe $\mathbf{D}$, and couple agent training to a fixed restoration-expert pool, preventing extension to newly introduced restoration-experts without full retraining. To overcome these efficiency and extensibility bottlenecks, we propose \textbf{DiTTo}, a novel order-aware image restoration agent framework consisting of the DiTTo Simulator and the DiTTo Agent. The DiTTo Simulator combines $\cup$S-IR for single-step restoration-action simulation and AiO-IQA for per-action quality prediction, reducing ORTD construction to $\mathcal{O}(N^{\mathbf{D}})$ simulator calls per image; the DiTTo Agent is trained by SFT on the simulator-generated ORTD, followed by \textbf{Order-aware Restoration Alignment (ORA)} that aligns degradation identification, restoration-action-ordering, and output format along independent axes. This enables \textbf{plug-and-play scalable extensibility}: adding a new restoration-expert requires updating only the lightweight ORA stage. On the MiO-100 evaluation set with up to five concurrent degradations, our DiTTo Agent achieves state-of-the-art multi-degradation restoration quality among previous agent-based IR methods.

2605.28119 2026-06-03 cs.CV 版本更新

ST-ColoNet: Spatio-Temporal Colon Segment Recognition via Hybrid Attention and Edge-Guided Feature Learning

ST-ColoNet: 通过混合注意力与边缘引导特征学习的时空结肠段识别

Crystal Cai, Ziyi Wang, Zhengjie Zhang, Jingsheng Gao, Dahong Qian, Suncheng Xiang

AI总结 提出ST-ColoNet框架,结合Colorlaus模块(度量学习优化边缘空间特征)和Full-Temp模块(三种自注意力模式近似全自注意力),在自建数据集上实现结肠段识别准确率81.0%、F1分数70.7%。

Comments Some experiments need to be updated

详情
AI中文摘要

结肠镜视频中的结肠段识别是许多下游任务的关键需求,但现有自动识别方法仅使用结肠镜图像,未充分利用时间信息,导致性能不佳。此外,相关的公开视频数据集稀缺。为解决此问题,我们整理并发布了一个专门用于结肠段识别任务的标注数据集。此外,我们提出了一种基于两阶段深度学习的框架——时空网络结肠段识别(ST-ColoNet),用于从结肠镜视频中识别结肠段,该框架包括Colorlaus模块(使用度量学习优化边缘介导的空间特征提取)和Full-Temp模块(结合三种自注意力模式,以更好地近似长结肠镜序列上的全自注意力并优化时间特征聚合)。通过大量消融实验,我们证明该框架能够在结肠段识别任务上达到最先进的性能,准确率为81.0%,F1分数为70.7%,相比现有最先进方法有巨大提升。

英文摘要

Colo-segment recognition in colonoscopy videos is a key requirement for many downstream tasks, but existing automatic recognition methods only use colonoscopy images without fully exploiting the use of temporal information, leading to poor performance. Additionally, relevant public video-based datasets are in scarcity. To tackle this problem, we curate and release a labeled dataset specifically for the task of colo-segment recognition. In addition, we propose a two-stage deep learning-based framework, Colo-Segment Recognition via SpatioTemporal Network (ST-ColoNet), for the task of colo-segment recognition from colonoscopy videos which includes the Colorlaus module that uses metric learning to optimize edge-mediated spatial feature extraction, as well as the Full-Temp module which combines three self-attention patterns to better approximate full self-attention on long colonoscopy sequences and optimize temporal feature aggregation. Through extensive ablation experiments, we show that our framework is capable of achieving state-of-the-art performance on the task of colo-segment recognition, achieving an accuracy of 81.0% and F1-score of 70.7%, which is a tremendous improvement over state-of-the-art methods.

2605.27454 2026-06-03 eess.IV cs.CV 版本更新

NL-MambaXCT: Self-Supervised Nested-Learning Mamba for Nomex Honeycomb X-ray CT Defect Classification

NL-MambaXCT:用于Nomex蜂窝X射线CT缺陷分类的自监督嵌套学习Mamba

Ghaleb Aldoboni, Lobna Nassar, Fakhri Karray, Reem Alshamsi

发表机构 * Aurak Academy of Arts and Sciences(阿劳克艺术与科学学院) Machine Intelligence Institute(人工智能研究所) University of Waterloo(滑铁卢大学)

AI总结 提出NL-MambaXCT框架,结合自监督掩码图像建模和嵌套学习,实现Nomex蜂窝XCT缺陷的高效分类,在测试集上达到96.91%准确率。

详情
AI中文摘要

X射线计算机断层扫描(XCT)广泛应用于航空航天制造中Nomex蜂窝结构的无损检测,但工业检测仍严重依赖人工解读和基于有限标注数据训练的监督模型。本文提出NL-MambaXCT,一个基于Mamba的框架,结合自监督掩码图像建模和嵌套学习(NL)公式,用于从生产XCT切片中进行自动化、标签高效的缺陷分类。骨干网络是一个四阶段2D编码器,早期阶段使用RegNet卷积块,深层阶段使用基于Mamba的序列混合与注意力。该网络在19,961张未标注的工业XCT切片上通过掩码图像建模进行预训练,并在按生产顺序划分的2,000张重新标注的Nomex XCT切片上进行微调。NL通过双时间尺度参数动态实现:选定投影保持慢速指数移动平均轨迹与快速权重并行,而深度动量优化器引入额外的慢速参数更新轨迹。在保留测试集上,MIM预训练的NL-MambaXCT模型达到96.91%的准确率和96.8%的宏F1分数,在准确率上比CNN、注意力和单时间尺度Mamba基线高出3.11-10.31个百分点。结果表明,将掩码自监督与NL风格的快/慢学习动态相结合,是Nomex蜂窝XCT检测中鲁棒缺陷分类的一种有前景的策略。

英文摘要

X-ray computed tomography (XCT) is widely used for non-destructive testing of Nomex honeycomb structures in aerospace manufacturing, but industrial inspection still relies heavily on manual interpretation and supervised models trained on limited labeled data. This work introduces NL-MambaXCT, a Mamba-based framework that combines self-supervised masked image modelling with a Nested Learning (NL) formulation for automated, label-efficient defect classification from production XCT slices. The backbone is a four-stage 2D encoder with RegNet convolutional blocks in the early stages and Mamba-based sequence mixing with attention in the deeper stages. It is pretrained by masked image modelling on 19,961 unlabeled industrial XCT slices and fine-tuned on 2,000 relabeled Nomex XCT slices split by production order. NL is instantiated through two-timescale parameter dynamics: selected projections maintain slow exponential-moving-average traces alongside fast weights, while a deep-momentum optimizer introduces an additional slow parameter-update trajectory. On the held-out test set, the MIM-pretrained NL-MambaXCT model achieves 96.91% accuracy and 96.8% macro F1, outperforming CNN, attention, and single-timescale Mamba baselines by 3.11--10.31 percentage points in accuracy. The results suggest that combining masked self-supervision with NL-style fast/ slow learning dynamics is a promising strategy for robust defect classification in Nomex honeycomb XCT inspection.

2605.26914 2026-06-03 cs.CV 版本更新

I2PRef: Image-Driven Point Completion with Iterative Refinement

I2PRef: 图像驱动的点云补全与迭代细化

Azhar Hussian, Marina Ritthaler, André Kaup, Vasileios Belagiannis

AI总结 提出一种以图像为主要几何来源的点云补全方法,通过图像到点(I2P)模块直接从单张RGB图像重建完整点云,并利用基于Transformer的点到点(P2P)细化模块迭代优化,在ShapeNet-ViPC上取得最先进性能,Chamfer距离相对提升12.3%。

Comments Accepted at European Signal Processing Conference (EUSIPCO 2026)

详情
AI中文摘要

我们提出了一种图像条件化的点云补全方法,将图像视为主要的几何来源而非次要的引导。为此,我们引入了一个图像到点(I2P)模块,该模块可以直接从单张RGB图像重建完整的点云,无需3D输入。此外,我们引入了一个基于Transformer的点到点(P2P)细化模块,该模块利用点令牌和图像特征之间的自注意力和交叉注意力,迭代地细化粗I2P输出。I2P模块使图像编码器能够学习丰富的几何表示,而P2P模块逐步恢复细粒度细节。与依赖辅助损失或融合模块的现有多模态方法不同,我们的显式I2P任务仅基于图像提供了强大的几何感知先验。在ShapeNet-ViPC上的大量实验表明,我们的方法取得了最先进的补全性能,Chamfer距离相对先前方法提升了12.3%。代码可在 https://github.com/AzharSindhi/I2PRef.git 获取。

英文摘要

We present an image-conditioned point cloud completion approach that treats images as the primary geometric source rather than a secondary guide. To this end, we introduce an Image-to-Point (I2P) module that can reconstruct complete point clouds directly from a single RGB image, with no need for 3D inputs. Additionally, we introduce a transformer-based Point-to-Point (P2P) refinement module that uses self- and cross-attention between point tokens and image features to iteratively refine the coarse I2P output. The I2P module enables the image encoder to learn rich geometric representations, while the P2P module progressively recovers fine-grained details. Unlike existing multimodal methods that rely on auxiliary losses or fusion modules, our explicit I2P task provides a strong, geometry-aware prior based on images alone. Extensive experiments on ShapeNet-ViPC demonstrate state-of-the-art completion performance with a 12.3% relative Chamfer Distance improvement over prior methods. Code is available at: https://github.com/AzharSindhi/I2PRef.git

2605.26774 2026-06-03 cs.CV 版本更新

Cesarean Scar Defect Segmentation in Transvaginal Ultrasound Images: a Dataset and Benchmark

经阴道超声图像中的剖宫产瘢痕缺损分割:数据集与基准

Yuan Tian, Yue Li, Wei Xia, Tianyu Xu, Jian Zhang, Liye Shi, Jing Liu, Yang Wang, Ming Liu, Qing Xu, Yixuan Zhang, Maggie M. He, Xiangjian He

发表机构 * Department of Obstetrics and Gynecology, International Peace Maternity and Child Health Hospital affiliated to Shanghai Jiao Tong University School of Medicine(妇产科部门,上海交通大学医学院国际和平妇产儿童医院) School of Computer Science, University of Nottingham Ningbo China(Nottingham Ningbo中国大学计算机学院) School of Computer Science, University of Nottingham(Nottingham大学计算机学院) Department of Computer Science and Engineering, University of California, San Diego(加州大学圣地亚哥分校计算机科学与工程系) Department of Ultrasound, International Peace Maternity and Child Health Hospital affiliated to Shanghai Jiao Tong University School of Medicine(超声科,上海交通大学医学院国际和平妇产儿童医院) School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University(上海交通大学电子信息与电气工程学院) Department of Cardiology, Gold Coast University Hospital(心内科,Gold Coast大学医院)

AI总结 针对经阴道超声图像中剖宫产瘢痕缺损(CSD)分割缺乏公开数据集的问题,构建了包含1111张图像和16个视频的CSD数据集,提供像素级标注,并建立了基准以推动医学图像分割算法和临床创新。

详情
AI中文摘要

剖宫产瘢痕缺损(CSD)是剖宫产后最常见的并发症之一。经阴道超声检查广泛用于CSD的初步筛查。准确确定CSD的轮廓和尺寸对治疗至关重要。然而,由于CSD尺寸小、形态不规则、图像质量欠佳以及资源有限环境中临床意识不足,超声医师常常忽略CSD。尽管人工智能在医学影像领域取得了进展,但目前尚无公开的经阴道超声CSD分割数据集。为填补这一空白,我们提出了一个全面的CSD数据集,包含1111张图像和16个视频,共501个阳性样本,带有确证的CSD和精确的像素级手动标注。标注遵循标准化临床指南,由经验丰富的超声医师和受过培训的博士生合作完成。这项工作为推进医学图像分割算法和促进临床创新提供了高质量的基准资源。最终,改善CSD诊断及后续治疗策略可提高育龄女性的生活质量,对医学研究和临床实践均具有重要价值。

英文摘要

Cesarean Scar Defect (CSD) is one of the most prevalent complications following cesarean delivery. Transvaginal ultrasonography is widely used for primary CSD screening. Accurate determination of CSD outline and dimensions is crucial for treatment. However, CSDs are frequently overlooked by sonographers due to small size and irregular morphology, suboptimal image quality, and limited clinical awareness in resource-constrained settings. Despite artificial intelligence advances in medical imaging, no public dataset exists for transvaginal ultrasound CSD segmentation. To address this gap, we present a comprehensive CSD dataset comprising 1,111 images and 16 videos, yielding 501 positive samples with confirmed CSD and precise pixel-level manual annotations. Annotations are performed following standardized clinical guidelines through collaboration between experienced sonographers and trained PhD students. This work provides high-quality benchmark resources for advancing medical image segmentation algorithms and promoting clinical innovation. Ultimately, improved CSD diagnosis and subsequent treatment strategies can enhance the quality of life in women of reproductive age, representing significant value for both medical research and clinical practice.

2605.26006 2026-06-03 cs.CV cs.GR cs.RO 版本更新

MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control

MIND: 多尺度意图扩散用于文本驱动的基于物理的人形控制

Bin Li, Ruichi Zhang, Han Liang, Jingyan Zhang, Juze Zhang, Xin Chen, Jingya Wang

发表机构 * ShanghaiTech University(上海科技大学) University of Pennsylvania(宾夕法尼亚大学) Bytedance Seed(字节跳动种子) Stanford University(斯坦福大学) InstAdapt

AI总结 提出MIND框架,通过多尺度意图扩散机制将文本命令与低级动作语义对齐,实现基于物理的人形机器人行为生成。

详情
AI中文摘要

使基于物理的人形机器人能够根据高级文本命令执行多样化的行为仍然是一个重大挑战。现有方法通常遵循两阶段范式(结合运动学动作生成与基于物理的跟踪)或端到端模仿学习范式(直接从文本生成动作)。然而,前者受限于运动学生成与基于物理跟踪之间的固有域偏移,而后者则难以弥合文本命令与低级动作之间的巨大模态差距,限制了有效的语义对齐。值得注意的是,人形状态编码了丰富的运动动态,与低级动作相比,这些动态在语义上与文本描述更对齐,因此成为推导行为意图的自然基础。基于这一见解,我们提出了MIND,一种新颖的端到端扩散框架,用于文本驱动的基于物理的人形控制,该框架利用行为意图作为文本命令与低级动作之间的语义桥梁。其核心是,MIND引入了多尺度意图扩散机制,其中整体意图预测器捕获全局行为动态以指导整体行为合成,而即时意图预测器在每一步扩散中提供逐步的细粒度信号以进行局部行为细化。这种分层意图公式化为人形控制施加了结构化的归纳偏置,改善了语义对齐和行为自然性。此外,MIND将人形状态编码到潜在空间中,以实现更有效的语义意图建模。大量实验表明,MIND优于现有方法,并能从文本命令中合成连贯、物理合理且语义对齐的人形行为。我们的代码将发布以促进未来研究。

英文摘要

Enabling physics-based humanoids to execute diverse behaviors from high-level textual commands remains a significant challenge. Existing methods typically follow either a two-stage paradigm that combines kinematic motion generation with physics-based tracking, or an end-to-end imitation-learning paradigm that directly generates actions from text. However, the former suffers from the inherent domain shift between kinematic generation and physics-based tracking, while the latter struggles with the substantial modality gap between textual commands and low-level actions, limiting effective semantic alignment. Notably, humanoid states encode rich motion dynamics that are more semantically aligned with textual descriptions than low-level actions, making them a natural basis for deriving behavioral intent. Building upon this insight, we propose MIND, a novel end-to-end diffusion framework for text-driven physics-based humanoid control that leverages behavioral intent as a semantic bridge between textual commands and low-level actions. At its core, MIND introduces a multi-scale intent diffusion mechanism, where a holistic intent predictor captures global behavioral dynamics to guide overall behavior synthesis, while an immediate intent predictor provides step-wise, fine-grained signals for local behavior refinement at each diffusion step. This hierarchical intent formulation imposes a structured inductive bias for humanoid control, improving semantic alignment and behavioral naturalness. Furthermore, MIND encodes humanoid states into a latent space to enable more effective semantic intent modeling. Extensive experiments demonstrate that MIND outperforms existing methods and synthesizes coherent, physically plausible, and semantically aligned humanoid behaviors from text commands. Project page: https://binlee26.github.io/MIND_page.

2605.29661 2026-06-03 cs.CV 版本更新

Geometry-Guided Modeling of Foundation Features Enables Generalizable Object Shape Deformation Learning

几何引导的基础特征建模实现可泛化的物体形状变形学习

Yiyao Ma, Kai Chen, Zhongxiang Zhou, Zhuheng Song, Dongsheng Xie, Zelong Tan, Rong Xiong, Qi Dou

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China(香港中文大学计算机科学与工程系) Zhejiang Innovation Center for Humanoid Robotics, Ningbo, China(浙江省人形机器人创新中心) State Key Laboratory of Industrial Control and Technology, Zhejiang University, Hangzhou, China(浙江省工业控制技术重点实验室)

AI总结 提出一种几何引导的特征建模机制和视图自适应特征聚合模块,通过变形类别级形状模板实现单目3D形状恢复,在形状变化和视角多样性上显著优于现有方法。

Comments 20 pages, 12 figures, accepted by ICML 2026

详情
AI中文摘要

单目3D形状恢复是几何理解的基础,但在任意视角和未见物体类别上实现鲁棒泛化仍然是一个重大挑战。本文提出一个可泛化的变形学习框架,通过显式变形类别级形状模板以匹配目标观测来重建3D物体。为了解决模板与目标之间的复杂形状变化,我们引入了几何引导的特征建模机制。该过程首先用模板拓扑丰富基础特征以生成几何感知表示,然后将其与目标观测显式关联以指导精确变形。此外,为了弥合固定模板与任意目标视图之间的差异,我们提出一个视图自适应特征聚合模块。该模块利用多视图模板特征及其对应的相机姿态来丰富规范模板表示,确保无论目标视角如何都能实现鲁棒的特征对齐。大量实验表明,我们的方法在处理大形状变化和多样化视角方面显著优于最先进的方法,展现出对新颖类别的强泛化能力,并有效支持下游真实世界的灵巧机器人操作任务。项目主页:https://GODeform.github.io/

英文摘要

Monocular 3D shape recovery is fundamental to geometric understanding, yet achieving robust generalization across arbitrary viewpoints and unseen object categories remains a significant challenge. In this paper, we present a generalizable deformation learning framework that reconstructs 3D objects by explicitly deforming a category-level shape template to match the target observation. To address complex shape variations between the template and the target, we introduce a geometry-guided feature modeling mechanism. This process first enriches foundation features with template topology to yield a geometry-aware representation, which is then explicitly correlated with the target observation to guide precise deformation. Furthermore, to bridge the disparity between the fixed template and arbitrary target views, we propose a view-adaptive feature aggregation module. This module leverages multi-view template features and their corresponding camera poses to enrich the canonical template representation, ensuring robust feature alignment regardless of the target's perspective. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in handling large shape variations and diverse viewpoints, exhibiting strong generalization to novel categories and effectively supporting downstream real-world dexterous robotic manipulation tasks. Project homepage: https://GODeform.github.io/

2603.18639 2026-06-03 cs.CV 版本更新

OrthoPhys: Physically Plausible Video Generation with Orthogonal-View Geometry Guidance

OrthoPhys:基于正交视角几何引导的物理合理视频生成

Cong Wang, Hanxin Zhu, Xiao Tang, Jiayi Luo, Xin Jin, Long Chen, Zhibo Chen

发表机构 * the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,中国科学院自动化研究所) the School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Zhongguancun Academy(中关村学院) School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学与技术学院) College of Automotive and Energy Engineering, Tongji University(同济大学汽车与能源工程学院) SKLCCSE, School of Computer Science and Engineering, Beihang University(SKLCCSE,北京航空航天大学计算机科学与工程学院)

AI总结 提出两阶段框架 OrthoPhys,通过正交视角几何引导生成物理一致的前景运动,再合成完整视频,显著提升物理真实感和时空一致性。

详情
AI中文摘要

近期视频生成的进展在视觉保真度上取得了显著提升,但确保物理一致的运动仍是一个基本挑战。直观上,这一限制可归因于现实世界中的物体运动在三维空间中展开,而视频观测仅提供此类动力学的部分、视角依赖的投影。为解决这些问题,我们提出 OrthoPhys,一个两阶段框架,利用正交视角几何引导来强制物理合理性。我们的第一阶段不直接生成非结构化的二维视频,而是生成前景动力学的同步四视角正交视频。通过在这些正交视角中引入几何增强的注意力机制,该阶段有效地强制了三维空间一致性,并隐式地将运动基于物理属性。在第二阶段,这些物理一致的正交前景作为刚性引导,合成最终的完整视频,无缝学习前景动力学与背景上下文之间的交互。为支持这种正交视角训练范式,我们构建了 PhysMV 数据集,包含 40K 个场景,每个场景由四个正交视角组成,总共 160K 个视频序列。大量实验表明,OrthoPhys 在物理真实感和时空一致性上显著优于现有视频生成方法。项目页面:https://anonymous.4open.science/w/Phys4D/。

英文摘要

Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose OrthoPhys, a two-stage framework that leverages orthogonal-view geometry guidance to enforce physical plausibility. Instead of directly generating unstructured 2D videos, our first stage generates synchronized, four-view orthogonal videos of the foreground dynamics. By incorporating a geometry-enhanced attention mechanism across these orthogonal views, this stage effectively enforces 3D spatial coherence and implicitly grounds the motion in physical attributes. In the second stage, these physically consistent orthogonal foregrounds serve as rigid guidance to synthesize the final complete video, seamlessly learning the interaction between foreground dynamics and the background context. To support this orthogonal-view training paradigm, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that OrthoPhys significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Project page: https://anonymous.4open.science/w/Phys4D/.

2605.03358 2026-06-03 cs.CV 版本更新

Tracing Like a Clinician: Anatomy-Guided Spatial Priors for Cephalometric Landmark Detection

像临床医生一样追踪:解剖引导的空间先验用于头影测量标志点检测

Sidhartha Mohapatra, Pallavi Mohanty

发表机构 * Founder & CTO, CephTrace(CephTrace创始人及CTO) Clinical Advisor, CephTrace(CephTrace临床顾问)

AI总结 提出一种五阶段解剖引导管道,生成置信度加权的空间先验来训练HRNet-W32,在1502张X光片上实现25个标志点平均径向误差1.04 mm,并通过消融实验和临床验证证明其有效性。

Comments v3: 21 pages, 15 tables, 12 figures + supplementary materials (8 tables, 3 figures). v4: quantified Grad-CAM analysis (Table 13), corrected clinical measurements (Table 6: bias, MAE, ICC; vertical kappa 1.00->0.78), reviewer wording fixes. Code and weights: https://github.com/sidwiz/cephtrace-research, https://huggingface.co/CephTrace/cephtrace-v4

详情
AI中文摘要

临床医生通过遵循结构化的解剖工作流程来追踪头影测量X光片——然而,先前没有系统明确地将此编码到计算中。我们提出了一个五阶段解剖引导管道,生成置信度加权的空间先验,用于塑造HRNet-W32的训练。该系统在来自7+个成像设备的1502张X光片上的25个标志点实现了1.04 mm的平均径向误差——通过显式解剖先验而非学习注意力,与HYATT-Net(在CEPHA29上1.05 mm)相当。三路消融实验隔离了机制:解剖先验保持1%的验证-测试差距,而去除先验则产生88%的差距(1.94 mm)——尽管验证收敛相同。训练×推理先验矩阵确认:(1)所有模型与推理无关,(2)仅28通道架构无益处,(3)随机先验部分且不稳定(1.72 mm),(4)只有解剖正确、图像特定的先验产生1.04 mm——作为训练时的正则化器。部署时无需生成先验。五折交叉验证(p=0.0015)、患者级置换检验(p<0.0001,n=151)、复现基线、Grad-CAM分析和临床验证(151名患者包括72例边界病例的100%骨骼分类,kappa=1.00)提供了汇聚证据。跨领域实验支持假设:先验有效性取决于标志点空间熵——在四个领域前瞻性确认。补充材料包含在内。

英文摘要

Clinicians trace cephalometric radiographs following a structured anatomical workflow, yet no prior system encodes this into computation. We present a five-phase anatomy-guided pipeline producing confidence-weighted spatial priors that shape HRNet-W32 training, achieving 1.04 mm mean radial error on 25 landmarks across 1,502 radiographs from 7+ imaging devices. A training x inference prior matrix isolates the mechanism: anatomical priors maintain a 1% validation-to-test gap versus 88% without priors (1.94 mm), despite identical validation convergence. The matrix establishes that all trained models are inference-independent, the expanded architecture alone provides no benefit, random priors yield partial but unstable improvement (1.72 mm), and only image-specific anatomically correct priors produce the 1.04 mm result -- functioning as a training-time regularizer requiring no automated prior generation at deployment. Five-fold cross-validation (p=0.0015), patient-level permutation testing (p<0.0001, n=151), quantified Grad-CAM analysis (88% vs. 74% in-zone activation, p<0.001), and clinical measurement validation (skeletal classification kappa=0.79-0.84, zero Class II<->III reversals, ICC>0.95) provide converging evidence. Cross-domain experiments on echocardiography, cervical spine, and hand radiography support the hypothesis that prior effectiveness scales with the spatial entropy of the landmark distribution.

2605.24253 2026-06-03 cs.CV cs.AI cs.IR 版本更新

CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval

CRISP -- 基于聚类的冗余减少实例采样用于病理病例表示与检索

Zahra Rahimi Afzal, Wataru Uegami, Saghir Alfasly, Wenchao Han, Saba Yasir, Judy C. Boughey, Matthew P. Goetz, Krishna R. Kalari, H. R. Tizhoosh

发表机构 * Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA(Kimia实验室,人工智能与信息学系,梅奥诊所,罗切斯特,明尼苏达州,美国) DICE Lab, Department of Electrical and Computer Engineering, University of Illinois Chicago, IL, USA(DICE实验室,电气与计算机工程系,伊利诺伊大学芝加哥分校,伊利诺伊州,美国) MD Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA(MD Kimia实验室,人工智能与信息学系,梅奥诊所,罗切斯特,明尼苏达州,美国) PhD Kimia Lab, Department of Artificial Intelligence & Informatics, Mayo Clinic, Rochester, MN, USA(PhD Kimia实验室,人工智能与信息学系,梅奥诊所,罗切斯特,明尼苏达州,美国) Division of Computational Pathology and Informatics, Mayo Clinic, Rochester, MN, USA(计算病理学与信息学部,梅奥诊所,罗切斯特,明尼苏达州,美国) Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, USA(实验室医学与病理学系,梅奥诊所,罗切斯特,明尼苏达州,美国) Department of Breast and Melanoma Surgical Oncology, Comprehensive Cancer Center, Mayo Clinic, Rochester, MN, USA(乳腺和黑色素瘤外科肿瘤学系,综合癌症中心,梅奥诊所,罗切斯特,明尼苏达州,美国) Department of Oncology, Comprehensive Cancer Center, Mayo Clinic, Rochester, MN, USA(肿瘤学系,综合癌症中心,梅奥诊所,罗切斯特,明尼苏达州,美国) PhD H.R. Tizhoosh

AI总结 提出CRISP无监督框架,通过聚类和冗余减少采样整合病例内多张全切片图像,构建紧凑代表性补丁集用于病例级检索,在乳腺癌数据集上匹配或超越现有标准。

详情
AI中文摘要

数字病理档案中每个病例通常包含多张全切片图像(WSI),这些图像捕获空间上不同的肿瘤区域并反映内在的形态异质性。然而,现有方法大多依赖单一病理学家选择的切片,从而丢弃了分布在其余WSI中的潜在信息性证据。迄今为止,尚无自主框架用于全面的多WSI病例处理。在此,我们提出一个用于病例级分析的无监督框架,该框架整合病例内所有可用切片的信息。所提方法不依赖单一指定切片,而是通过选择性提炼跨WSI的信息性补丁来构建病例级表示。我们引入基于聚类的冗余减少实例采样用于病理学(CRISP),这是一个两阶段框架,首先减少单个WSI内的冗余,随后应用基于聚类的采样为整个病例选择紧凑但具有代表性的补丁集。所得补丁集捕获病例级异质性,同时避免对千兆像素图像的穷举处理,并直接作为检索索引。使用两个梅奥诊所乳腺癌数据集进行诊断和治疗规划,我们证明CRISP在患者/病例搜索和检索中一致匹配或超越当前结合模型和病理学家切片选择的标准实践。通过自动化病例级处理并消除主观WSI选择,CRISP可能能够利用当前被忽视的分布在多个WSI中的临床相关信息。

英文摘要

Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumor regions and reflecting intrinsic morphological heterogeneity. However, most existing approaches rely on a single pathologist-selected slide, thereby discarding potentially informative evidence distributed across the remaining WSIs. To date, no autonomous framework has been proposed for comprehensive multi-WSI case processing. Here, we present an unsupervised framework for case-level analysis that integrates information from all available slides within a case. Rather than relying on a single designated slide, the proposed approach constructs case-level representations by selectively distilling informative patches across WSIs. We introduce Clustering-Based Redundancy-Reduced Instance Sampling for Pathology (CRISP), a two-stage framework that first reduces redundancy within individual WSIs and subsequently applies clustering-based sampling to select a compact yet representative set of patches for the entire case. The resulting patch set captures case-level heterogeneity while avoiding exhaustive processing of gigapixel images, and directly serves as a retrieval index. Using two Mayo Clinic breast cancer datasets for diagnosis and treatment planning, we demonstrate that CRISP consistently matches or surpasses the current standard practice of combined model and pathologist slide selection for patient/case search and retrieval. By automating case-level processing and eliminating subjective WSI selection, CRISP potentially enables the exploitation of clinically relevant information distributed across multiple WSIs that is currently overlooked.

2605.23995 2026-06-03 cs.CV cs.AI 版本更新

Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines

任务对齐的自监督学习在医学图像分析中的应用:系统综述与实践设计指南

Chathura Wimalasiri, Kishor Nandakishor, Marimuthu Palaniswami

发表机构 * Department of Electrical and Electronic Engineering, University of Melbourne(墨尔本大学电子与电气工程系)

AI总结 本文系统综述了医学图像中自监督学习(SSL)的四种范式(对比、非对比与预测、生成与重建、混合),分析了前置任务与下游任务的对齐对性能的影响,并提出了实践设计指南。

Comments This manuscript is 31 pages with 4 tables and 3 figures

详情
AI中文摘要

自监督学习(SSL)已成为通过从无标签数据中学习表示来解决医学影像中标注瓶颈的有前景范式。然而,其有效性在很大程度上取决于前置任务的设计及其与下游临床目标的对齐。我们对医学影像中的SSL进行了系统的、任务导向的综述,考察了不同前置任务公式如何影响分类、分割、检测等任务的性能。遵循PRISMA指南,我们分析了2017年至2025年间发表的75项研究,并将其组织为四种范式:对比学习、非对比与预测学习、生成与重建学习、以及混合学习。我们不是按架构对方法进行分类,而是将每种范式映射到其最佳支持的下游目标。我们的分析表明,不存在普遍最优的SSL策略;相反,性能由前置任务、成像模态和目标任务之间的对齐决定。对比方法学习全局判别特征,与分类任务对齐良好,但可能忽略细微的病理模式。生成和空间预测方法更好地保留局部解剖结构,使其更适合分割和其他密集预测任务,而混合方法提供了最平衡的性能。我们进一步表明,模态特定设计至关重要,并且SSL在低标签和少样本场景中提供最大益处。最后,我们将这些发现提炼为实践设计指南,并概述了开放挑战,包括病理感知前置任务设计、高维数据的资源高效训练以及标准化评估协议。这项工作为在医学影像中设计更有效且临床相关的SSL框架提供了实用指导。

英文摘要

Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning representations from unlabeled data. However, its effectiveness depends heavily on the design of the pretext task and its alignment with the downstream clinical-objectives. We present a systematic, task-oriented review of SSL in medical imaging, examining how different pretext-task formulations influence performance across classification, segmentation, detection, and other tasks. Following PRISMA guidelines, we analyze 75 studies published between 2017 and 2025 and organize them into four paradigms: contrastive, non-contrastive and predictive, generative and reconstruction-based, and hybrid learning. Rather than cataloguing methods by architecture, we map each paradigm to the downstream objectives it best supports. Our analysis shows there is no universally optimal SSL strategy; instead, performance is governed by the alignment between the pretext task, the imaging modality, and the target task. Contrastive methods learn global discriminative features and align well with classification, but may overlook subtle pathological patterns. Generative and spatial prediction-based approaches better preserve local anatomical structure, making them more suitable for segmentation and other dense prediction tasks, while hybrid methods offer the most balanced performance. We further show that modality-specific design is critical and that SSL provides its greatest benefit in low-label and few-shot regimes. Finally, we distill these findings into practical design guidelines and outline open challenges, including pathology-aware pretext task design, resource-efficient training for high-dimensional data, and standardized evaluation protocols. This work offers practical guidance for designing more effective and clinically relevant SSL frameworks in medical imaging.

2605.22018 2026-06-03 cs.CV cs.AI cs.RO 版本更新

FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

FRED:面向洪水道路环境的多模态自动驾驶数据集

Connor Malone, Sebastien Demmel, Sebastien Glaser

发表机构 * Queensland University of Technology(昆士兰理工大学) ARC Training Centre for Automated Vehicles in Rural and Remote Regions (AVR3)(农村和偏远地区自动化车辆培训中心(AVR3))

AI总结 提出首个针对道路水险场景的多模态自动驾驶数据集FRED,包含相机、LiDAR和IMU数据,并提供语义标签以支持水险检测方法训练与评估。

详情
AI中文摘要

洪水道路环境数据集(FRED)是,据我们所知,首个专门针对道路水险场景数据收集的多模态自动驾驶数据集。该数据集包含来自2.3 MP FLIR Blackfly USB3相机的图像、来自Ouster OS1-64 LiDAR的64线360度点云,以及由Geoflex RTK GNSS校正的iXblue ATLANS-C IMU数据,数据采集自五个不同地点,涵盖洪水期间和洪水之后。数据以两种格式发布:KITTI风格格式,便于与现有数据工具集成;以及RTMaps格式,用于直接回放车辆的数据捕获。我们提供语义标签,以支持用于水险检测的单传感器和传感器融合方法的训练与评估。提供位置和速度数据,以及干燥条件下捕获的数据,以支持可能包含地图的基于位置的检测方法开发,并评估其他任务,如定位和SLAM。

英文摘要

The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting the collection of data from scenarios involving water hazards on the road. The dataset contains images from a 2.3 MP FLIR Blackfly USB3 camera, 64-beam 360 degree point clouds from an Ouster OS1-64 LiDAR, and data from an iXblue ATLANS-C IMU corrected by a Geoflex RTK GNSS, from five separate locations captured both during and after flooding events. The data has been released in two formats: a KITTI-style format for easy integration with existing data tools, and the RTMaps format for direct replay of the vehicle's data capture. We provide semantic labels to enable the training and evaluation of both single-sensor and sensor-fusion methods for water hazard detection. Position and velocity, as well as data captured under dry conditions, are provided to enable the development of location-based detection methods that may incorporate maps, and to evaluate other tasks such as localisation and SLAM.

2601.00990 2026-06-03 eess.IV cs.CV 版本更新

Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review

不确定性校准的可解释人工智能用于胎儿超声平面分类:系统综述

Gustav Olaf Yunus Laitinen-Fredriksson Lundström-Imanov, Ozkan Gunalp

发表机构 * Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg(卢森堡大学生命科学与医学系,科学、技术与医学学院) Department of Biostatistics and Medical Informatics, Institute of Health Sciences, Ege University(伊兹密尔大学健康科学学院生物统计学与医学信息学系)

AI总结 通过系统综述78项研究,提出CALIB-XFUS框架,强调校准、解释忠实性和公平性,以满足监管要求。

Comments 12 pages, 5 figures, 1 table, 75 references; systematic review (PRISMA 2020); manuscript prepared for submission to The Lancet Digital Health (Reviews section)

详情
AI中文摘要

胎儿超声是产前护理的基石,准确识别一小组标准解剖平面支撑着生物测量、生长监测和结构异常检测。深度学习分类器现在在精心策划的基准上达到或超过专家准确性,但大多数仍然不透明且校准不良,使临床医生缺乏安全决策支持所需的校准置信度或忠实解释。我们按照PRISMA 2020系统综述了2015年1月1日至2026年4月30日期间发表的78项研究,这些研究将自动胎儿平面分类与可解释性或预测不确定性量化相结合。六个标准平面的合并平衡准确率为0.93(95% CI 0.91至0.95),但只有19项研究(24%)报告了校准,14项(18%)报告了选择性预测。我们提出了CALIB-XFUS,一个22项报告框架,将校准、解释忠实性和公平性操作化,用于受监管的胎儿超声人工智能。该框架涵盖六个领域:临床任务和使用指征;数据集来源和代表性;模型和训练流程;校准和选择性预测;解释忠实性和临床医生验证;以及上市后监测。我们认为,根据FDA良好机器学习实践原则和欧盟AI法案高风险义务,不确定性校准、忠实解释和公平审计的胎儿超声人工智能现在在技术上可行且在监管上被期望。

英文摘要

Fetal ultrasound is the cornerstone of antenatal care, and accurate recognition of a small set of standard anatomical planes underpins biometry, growth surveillance, and detection of structural anomalies. Deep learning classifiers now match or exceed expert accuracy on curated benchmarks, but most remain opaque and miscalibrated, leaving clinicians without the calibrated confidence or faithful explanations needed for safe decision support. We systematically reviewed 78 studies published between January 1, 2015 and April 30, 2026 that paired automated fetal plane classification with explainability or predictive uncertainty quantification, following PRISMA 2020. Pooled balanced accuracy across six standard planes was 0.93 (95% CI 0.91 to 0.95), but only 19 studies (24%) reported calibration and 14 (18%) reported selective prediction. We propose CALIB-XFUS, a 22-item reporting framework that operationalises calibration, explanation faithfulness, and fairness for regulated fetal ultrasound artificial intelligence. The framework spans six domains: clinical task and indication for use; dataset provenance and representativeness; model and training pipeline; calibration and selective prediction; explanation faithfulness and clinician validation; and post-market surveillance. We argue that uncertainty-calibrated, faithfully explained, and fairness-audited fetal ultrasound AI is now both technically feasible and regulatorily expected under the FDA Good Machine Learning Practice principles and the EU AI Act high-risk obligations.

2605.20731 2026-06-03 cs.CV cs.AI stat.AP 版本更新

TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design

TASTE:一个由设计师标注的AI生成图形设计多维偏好数据集

Haonan Zhu, Elad Hirsch, Alexandria Minetti, Allison Nulty, Purvanshi Mehta

发表机构 * Lica World(Lica世界) Contra.Work Inc.(Contra.Work公司)

AI总结 针对现有偏好数据集仅提供单一整体评价的不足,本文构建了TASTE多维偏好数据集,由两组专业设计师对四个文本到图像模型的输出按九项标准排序,并提出了无准则信号验证框架和偏好模型基准测试。

详情
AI中文摘要

文本到图像模型现在能够以生产规模生成图形设计,但其监督仍然主要来自照片风格的偏好数据集,每次比较只有一个整体判断。设计师沿着几个不同的轴(例如,排版、布局、色彩和谐)评估设计,而单个偏好标签会将这些轴合并。我们发布了\emph{TASTE} extit{(排版、美学、空间、色调等)},这是一个多维偏好数据集,其中两个不相交的五名专业设计师队列分别对来自四个当前文本到图像模型的输出按九项标准进行排序,并附带每张图像的幻觉标记。我们将该数据集与两个贡献配对。首先,一个基于Kendall的$τ$、多数投票概率和Condorcet循环的无准则信号验证框架,针对精确的iid均匀零假设;分析揭示了显著但中等程度的设计师一致性,每个TASTE标准都拒绝了随机评分者的零假设。其次,我们在TASTE上对偏好模型进行基准测试,发现现成的VLM评判器和专用的T2I评分器未能达到与设计师小组的多数一致,而直接在TASTE上训练的小型MLP头显著缩小了与单个评分者上限的差距,为未来基于TASTE训练的偏好模型设定了基线。

英文摘要

Text-to-image models now generate graphic design at production scale, yet their supervision still comes primarily from photo-style preference datasets with a single overall verdict per comparison. Designers evaluate designs along several distinct axes (e.g., typography, layout, color harmony) that a single preference label collapses. We release \emph{TASTE} \textit{(Typography, Aesthetics, Spatial, Tone, Etc.)}, a multi-dimensional preference dataset in which two disjoint cohorts of five professional designers each ranked outputs from four current text-to-image models across nine criteria along with per-image hallucination flags. We pair the dataset with two contributions. First, a criterion-agnostic signal-validation framework based on Kendall's $τ$, majority-vote probability, and Condorcet cycles against exact iid-uniform nulls; the analysis reveals significant but moderate designer agreement, with every TASTE criterion rejecting the random-rater null. Second, we benchmark preference models on TASTE and find that off-the-shelf VLM judges and dedicated T2I scorers fail to reach majority agreement with the designer panel, while a small MLP head trained directly on TASTE substantially narrows the gap to the single-rater ceiling, setting a baseline for future TASTE-trained preference models.

2605.20306 2026-06-03 cs.CV cs.LG 版本更新

WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

WildRoadBench: 面向视觉语言模型与自主智能体的野外航拍道路损伤定位基准

Bingnan Liu, Chenhang Cui, Rui Huang, Jiani Luo, Zhirong Shen, Tinghao Wang, Xiande Huang, Lingbei Meng, Fei Shen, An Zhang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) National University of Singapore(新加坡国立大学) De Artificial Intelligence Lab(德人工智能实验室) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) University of Science and Technology of China(中国科学技术大学)

AI总结 提出WildRoadBench基准,通过VLM直接定位和LLM驱动智能体自主研究两种协议,评估模型在航拍道路损伤定位上的性能,发现现有方法在野外场景下仍不可靠。

Comments Preprint. Under review. 4 figures, 6 tables

详情
AI中文摘要

我们介绍了WildRoadBench,一个野外航拍道路损伤定位基准,它在一个专业标注的无人机语料库上,将视觉语言模型的直接视觉定位与LLM驱动的智能体的自主研究与工程相结合。在两种协议下评估相同的图像集和相同的每类AP_50指标。VLM轨道衡量固定VLM是否能在统一的提示、解码和解析流程下,从一张图像和一个简短提示中定位特定领域的损伤。智能体轨道衡量一个自主智能体,在仅给定书面任务简介、少量探索切片和固定交互预算的情况下,能否搜索公共网络、调整预训练组件、编写训练和推理代码,并通过隐藏保留集上的标量反馈预言机提交预测。我们对广泛的闭源前沿模型和开源VLM以及几个前沿LLM驱动的智能体进行了基准测试。在野外环境中,两种途径都远未达到可靠性能:闭源前沿模型在VLM排行榜上领先,但仍留下超过一半的指标未达到;开源定位器远低于它们,且新一代或推理型变体并未持续改进定位;每个开源模型的小目标均崩溃;尽管智能体拥有更丰富的功能,但仍落后于最强的VLM,且有几个未能在预算内提交有效结果。我们在https://anonymous.4open.science/r/wildroadbench-0607发布代码和数据,以支持可重复的后续研究。

英文摘要

We introduce WildRoadBench, a wild aerial road-damage grounding benchmark that couples direct visual grounding by vision-language models with autonomous research-and-engineering by LLM-driven agents on a single professionally annotated UAV corpus. The same image set and the same per-class AP_50 metric are evaluated under two protocols. The VLM Track measures whether a fixed VLM can localise domain-specific damage from one image and one short prompt under a unified prompting, decoding and parsing pipeline. The Agent Track measures whether an autonomous agent, given only a written task brief, a small exploratory slice and a fixed interaction budget, can search the public web, adapt pretrained components, write training and inference code, and submit predictions through a scalar-feedback oracle on a hidden holdout. We benchmark a broad pool of closed-source frontier models and open-source VLMs together with several frontier LLM-driven agents. Both routes remain far from reliable performance in this wild setting: closed-source frontier models lead the VLM leaderboard but still leave more than half of the metric on the table; open-source grounders plateau well below them, and newer generations or reasoning-style variants do not consistently improve grounding; small targets collapse for every open-source model; agents lag the strongest VLM despite richer affordances, and several fail to land a valid submission within the budget. We release the code and data at https://anonymous.4open.science/r/wildroadbench-0607 to support reproducible follow-up research.

2605.20183 2026-06-03 cs.CV 版本更新

MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

MSAVBench:面向多镜头音频-视频生成的全面可靠评估

Yujie Wei, Yujin Han, Zhekai Chen, Yongming Li, Kaixun Jiang, Zhihang Liu, Quanhao Li, Zhiwu Qing, Xiang Wang, Zhen Xing, Ruihang Chu, Lingyi Hong, Yefei He, Junjie Zhou, Junqiu Yu, Yang Shi, Difan Zou, Kai Zhu, Shiwei Zhang, Yingya Zhang, Yu Liu, Xihui Liu, Hongming Shan

发表机构 * Fudan University(复旦大学) The University of Hong Kong(香港大学) Tongyi Lab, Alibaba Group(阿里集团通义实验室) Zhejiang University(浙江大学) Peking University(北京大学)

AI总结 提出首个多镜头音频-视频生成基准MSAVBench,通过自适应混合评估框架在四个维度上系统评估19个模型,发现当前系统在导演级控制和细粒度音视频同步上仍存在挑战。

详情
AI中文摘要

视频生成正从单镜头合成快速演变为复杂的多镜头音频-视频(MSAV)叙事以满足现实需求。然而,评估此类前沿模型仍是一个基本挑战。现有基准在范围和数据多样性上有限,并依赖僵化的评估流程,阻碍了对现代MSAV模型的系统可靠评估。为弥补这些差距,我们引入MSAVBench,这是首个针对多镜头音频-视频生成的综合基准和自适应混合评估框架。我们的基准涵盖四个关键维度:视频、音频、镜头和参考,覆盖多样化的任务设置、多达15个镜头的可变数量以及具有挑战性的非真实场景。我们的评估框架通过镜头分割的自适应自校正机制、主观指标的实例化评分规则以及复杂判断的基于工具的证据提取,提高了鲁棒性。此外,MSAVBench与人类判断高度一致,达到91.5%的斯皮尔曼等级相关系数。我们对19个最先进的闭源和开源模型的系统评估表明,当前系统在导演级控制和细粒度音视频同步上仍存在困难,而模块化或代理式生成管道为缩小开源与闭源模型之间的差距提供了一条有希望的路径。基准数据和评估代码已在https://github.com/ali-vilab/MSAVBench公开。

英文摘要

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. The benchmark data and evaluation code are publicly available at https://github.com/ali-vilab/MSAVBench.

2605.18740 2026-06-03 cs.CV cs.AI cs.CL cs.LG 版本更新

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Vision-OPD:通过在线策略自蒸馏学习多模态大语言模型的精细细节

Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, Yaojie Lu

发表机构 * Tsinghua University(清华大学)

AI总结 提出Vision-OPD框架,通过在线策略自蒸馏将模型自身的局部区域感知能力迁移到全局图像策略,提升多模态大语言模型对细粒度视觉理解的准确性。

Comments Project page: https://github.com/VisionOPD/Vision-OPD

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度视觉理解方面仍然存在困难,答案往往依赖于全图中微小但决定性的证据。我们观察到一种区域到全局的感知差距:当以证据为中心的裁剪图像为条件时,同一MLLM回答细粒度问题的准确率高于以对应全图为条件,这表明许多失败源于难以聚焦于相关证据,而非局部识别能力不足。受此观察启发,我们提出Vision-OPD(视觉在线策略蒸馏),一种区域到全局的自蒸馏框架,将模型自身特权的区域感知迁移到其全图策略。Vision-OPD从同一MLLM实例化两个条件策略:一个以裁剪图像为条件的教师和一个以全图为条件的学生。学生生成在线策略轨迹,Vision-OPD沿这些轨迹最小化教师和学生下一个词元分布之间的词元级差异。这使得模型能够内化视觉放大的好处,而无需外部教师模型、真实标签、奖励验证器或推理时工具使用。在多个细粒度视觉理解基准上的实验表明,Vision-OPD模型在性能上可与更大的开源、闭源以及“思考图像”智能体模型相媲美或更优。

英文摘要

Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but decisive evidence in the full image. We observe a regional-to-global perception gap: the same MLLM answers fine-grained questions more accurately when conditioned on evidence-centered crops than on the corresponding full images, suggesting that many failures stem from difficulty to focus on relevant evidence rather than insufficient local recognition ability. Motivated by this observation, we propose Vision-OPD (Vision On-Policy Distillation), a regional-to-global self-distillation framework that transfers the model's own privileged regional perception to its full-image policy. Vision-OPD instantiates two conditional policies from the same MLLM: a crop-conditioned teacher and a full-image-conditioned student. The student generates on-policy rollouts, and Vision-OPD minimizes token-level divergence between the teacher and student next-token distributions along these rollouts. This enables the model to internalize the benefit of visual zooming without external teacher models, ground-truth labels, reward verifiers, or inference-time tool use. Experiments on multiple fine-grained visual understanding benchmarks show that Vision-OPD models achieve competitive or superior performance against much larger open-source, closed-source, and "Thinking-with-Images" agentic models. The code is available at https://github.com/VisionOPD/Vision-OPD

2605.19320 2026-06-03 cs.CV cs.DB 版本更新

TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards

TextAlign: 基于层次化奖励的文本渲染偏好对齐

Mingxuan Cui, Jingpu Yang, Fengxian Ji, Qian Jiang, Zhecheng Shi, Jiaming Wang, Zirui Song, Fajri Koto, Xiuying Chen

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·穆萨大学人工智能学院) Chinese Academy of Sciences Institute of Automation(中国科学院自动化研究所) Northeastern University(东北大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出TextAlign框架,通过层次化视觉语言模型奖励将文本渲染错误分解为全局、单词和字形级别,并转化为标量偏好信号,利用GRPO或DPO进行后训练对齐,在不改变生成器架构下提升文本渲染准确性。

详情
AI中文摘要

忠实的文本渲染仍然是大型文本到图像生成模型的一个持续弱点,因为它需要语义指令遵循和细粒度的字形级结构。先前的方法通常通过特定于架构的模块或编码器修改来提高这种能力,这使跨基础模型的部署复杂化。我们将文本渲染作为后训练偏好对齐问题进行研究,并提出了TextAlign,一种非侵入式框架,保持生成器架构不变。关键组件是一个基于层次化视觉语言模型(VLM)的奖励,它将渲染错误分解为全局、单词和字形级别,然后将二元缺陷判断转换为标量偏好信号。得到的信号支持组相对策略优化(GRPO)和直接偏好优化(DPO)。在FLUX.1-dev和Z-Image-Turbo上的实验表明,基于OCR的文本准确性持续提升,且不降低一般生成质量。与强大的基础和文本渲染基线(包括SD3.5、Qwen-Image、AnyText和TextDiffuser)相比,这些结果表明奖励设计为改进文本渲染提供了一种可扩展的替代模型重新设计的方法。

英文摘要

Faithful text rendering remains a persistent weakness of large text-to-image generative models, as it requires both semantic instruction following and fine-grained glyph-level structure. Prior methods often improve this ability through architecture-specific modules or encoder modifications, which complicate deployment across foundation models. We study text rendering as a post-training preference-alignment problem and propose TextAlign, a non-invasive framework that keeps the generator architecture unchanged. The key component is a hierarchical vision-language model (VLM)-based reward that decomposes rendering errors into global, word, and glyph levels, then converts binary defect judgments into a scalar preference signal. The resulting signal supports both Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO). Experiments on FLUX.1-dev and Z-Image-Turbo show consistent gains in OCR-based text accuracy without degrading general generation quality. Compared with strong foundation and text-rendering baselines, including SD3.5, Qwen-Image, AnyText, and TextDiffuser, these results indicate that reward design offers a scalable alternative to model redesign for improving text rendering.

2605.18160 2026-06-03 cs.CV cs.AI 版本更新

Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models

Vision Inference Former:在多模态大语言模型中维持视觉一致性

Xinpeng Dong, Min Zhang, Kairong Han, Xu Tan, Fei Wu, Kun Kuang

发表机构 * Zhejiang University(浙江大学) East China Normal University(华东师范大学) Zhejiang University of Science and Technology(浙江理工大学)

AI总结 针对多模态大语言模型中视觉信息被弱化的问题,提出Vision Inference Former(VIF)轻量模块,在推理解码阶段持续注入视觉语义,提升生成内容与视觉的一致性。

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)取得了显著进展,主要归功于整合视觉和文本信息的有效范式。主流的基于连接器的范式将视觉特征投影到文本序列中,从而在生成式架构内实现统一的多模态对齐和推理。然而,我们的实验揭示了两个关键限制:(1)尽管视觉信息是MLLMs中的核心证据模态,但它被与文本标记同等对待,削弱了视觉模态的独特贡献;(2)随着生成长度的增加,特别是在有限的上下文窗口内,模型对视觉信息的依赖逐渐减弱,导致视觉-语言对齐恶化,生成内容与视觉语义之间的一致性降低。为了解决这些挑战,我们提出了Vision Inference Former(VIF),一种轻量级架构模块,它在纯视觉表示和模型输出空间之间建立直接桥梁。具体而言,VIF在推理过程的解码阶段持续注入视觉语义,确保模型在生成过程中牢固地基于视觉内容。我们在涵盖通用推理、OCR、表格理解、以视觉为中心的评估和幻觉的14个基准任务上进行了实验。实验结果表明,VIF在不同架构上持续提升模型性能,同时引入最小的额外开销。本工作的代码可在https://github.com/Dong-Xinpeng/VIF获取。

英文摘要

In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms for integrating visual and textual information. The dominant connector-based paradigm projects visual features into textual sequence, enabling unified multimodal alignment and reasoning within a generative architecture. However, our experiments reveal two key limitations: (1) Although visual information serves as the core evidential modality in MLLMs, it is treated on par with textual tokens, diminishing the unique contribution of the visual modality; (2) As generation length increases, particularly within a limited context window, the model's dependence on visual information progressively weakens, resulting in deteriorated vision-language alignment and reduced consistency between generated content and visual semantics. To address these challenges, we propose the Vision Inference Former (VIF), a lightweight architectural module that establishes a direct bridge between pure visual representations and the model's output space. Specifically, VIF continuously injects visual semantics throughout the decoding phase of the inference process, ensuring that the model remains firmly grounded in visual content during generation. We conduct experiments on 14 benchmark tasks covering general reasoning, OCR, table understanding, vision-centric evaluation, and hallucination. Experimental results show that VIF consistently improves model performance across diverse architectures while introducing minimal additional overhead. The code for this work is available at https://github.com/Dong-Xinpeng/VIF.

2605.16813 2026-06-03 cs.GR cs.CV 版本更新

QuadLink: Autoregressive Quad-Dominant Mesh Generation via Point-Relation Learning

QuadLink: 通过点关系学习的自回归四边形主导网格生成

Yiheng Zhang, Zhe Zhu, Tingrui Shen, Zhuojiang Cai, Tianxiao Li, Zixing Zhao, Qiujie Dong, Zhiyang Dou, Jiepeng Wang, Le Wan, Yuwang Wang, Wenping Wang, Yuan Liu, Cheng Lin

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tencent VISVISE(腾讯VISVISE) Peking University(北京大学) Technical University of Munich(慕尼黑技术大学) Tsinghua University(清华大学) The University of Hong Kong(香港大学) Massachusetts Institute of Technology(麻省理工学院) Texas A&M University(德克萨斯大学) Macau University of Science and Technology(澳门科技大学)

AI总结 提出QuadLink框架,通过将点云链接成结构化面片,以自回归方式生成各向异性的四边形主导网格,实现高几何保真度和拓扑质量。

详情
AI中文摘要

生成可用于生产的四边形主导网格是现代3D内容创作的基石。从点云生成各向异性的四边形主导网格具有挑战性,因为现有方法通常局限于生成纯三角形网格或具有各向同性密度的纯四边形网格。在本文中,我们提出QuadLink,一个由三个阶段组成的统一框架,通过将点链接成结构化面片来生成四边形主导网格。QuadLink将多边形网格生成公式化为混合质心条件顶点链接模型:它首先预测一组统一的锚点(顶点和面质心),然后学习将顶点与面质心关联的质心条件链接,最后通过鲁棒的几何验证策略引导的四边形优先策略组装多边形面。这种基于链接的公式能够高效生成具有连贯边流的稀疏各向异性四边形主导网格,同时支持混合多边形拓扑。为了构建该模型的训练数据,我们进一步引入三角到四边形算子,通过全局合并选择将艺术三角形网格转换为四边形主导训练数据。大量实验表明,QuadLink从点云生成可用于生产的四边形主导网格,与先前基线相比,实现了更高的几何保真度和拓扑质量。我们的方法原生支持混合多边形拓扑,无需架构更改即可推广到任意n边形网格。

英文摘要

The generation of production-ready quad-dominant meshes is a cornerstone of modern 3D content creation. Generating anisotropic quad-dominant meshes from point clouds is challenging, as existing methods are typically limited to producing either pure triangular meshes or pure quadrilateral meshes with isotropic densities. In this paper, we present QuadLink, a unified framework consisting of three stages for quad-dominant mesh generation by linking points into structured faces. QuadLink formulates polygonal mesh generation as a hybrid centroid-conditioned vertex linking model: it first predicts a unified set of anchors (vertices and face centroids), then learns centroid-conditioned links that associate vertices with face centroids, and finally assembles polygonal faces with a quad-first strategy guided by robust geometric verification strategies. This link-based formulation enables efficient generation of sparse and anisotropic quad-dominant meshes with coherent edge flow and meanwhile supporting hybrid polygonal topology. To construct training data for this model, we further introduce a Tri-to-Quad Operator that converts artistic triangle meshes into quad-dominant training data via global merge selection. Extensive experiments show that QuadLink produces production-ready quad-dominant meshes from point clouds and achieves improved geometric fidelity and topological quality compared to prior baselines. Our method natively supports hybrid polygonal topology, generalizing to arbitrary n-gon meshes without architectural changes.

2602.02994 2026-06-03 cs.CV 版本更新

Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

Video-OPD:通过在线策略蒸馏实现多模态大语言模型在时序视频定位中的高效后训练

Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, Jianzhong Ju, Zhenbo Luo, Jian Luan

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出Video-OPD框架,利用在线策略蒸馏和教师验证分歧聚焦课程,以高效后训练多模态大语言模型进行时序视频定位,克服稀疏奖励和高计算开销问题。

详情
AI中文摘要

强化学习因其在线策略优化而成为时序视频定位(TVG)后训练的一种有原则的范式,但现有的基于GRPO的方法仍然受到稀疏奖励信号和大量计算开销的根本限制。我们提出了Video-OPD,一个受近期在线策略蒸馏进展启发的TVG高效后训练框架。Video-OPD优化直接从当前策略采样的轨迹,从而保持训练和推理分布之间的一致性,同时前沿教师通过反向KL散度目标提供密集的令牌级监督。这种公式保留了缓解分布偏移至关重要的在线策略属性,同时将稀疏的回合级反馈转化为细粒度的逐步学习信号。基于Video-OPD,我们引入了教师验证分歧聚焦(TVDF),一种轻量级训练课程,迭代地优先考虑既教师可靠又对学生信息量最大的轨迹,从而提高训练效率。实验结果表明,Video-OPD在实现显著更快的收敛和更低计算成本的同时,始终优于GRPO,确立了在线策略蒸馏作为TVG传统强化学习的有效替代方案。

英文摘要

Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.

2605.13258 2026-06-03 cs.CV cs.AI 版本更新

X-Restormer++: 1st Place Solution for the UG2+ CVPR 2026 All-Weather Restoration Challenge

X-Restormer++:UG2+ CVPR 2026全天气恢复挑战赛第一名解决方案

Youwei Pan, Leilei Cao, Yingfang Zhu, Fengjie Zhu

发表机构 * TEX AI, Transsion Holdings(TEX AI,Transsion控股)

AI总结 提出基于X-Restormer的双阶段训练与双模型集成推理方法,结合梯度引导边缘感知损失,在全天气图像恢复挑战赛中取得第一名。

详情
AI中文摘要

在这项工作中,我们展示了在第八届UG2+挑战赛(CVPR 2026)赛道1:全天气条件下的图像恢复中的获胜解决方案。我们的方法基于X-Restormer基线,该基线通过其双注意力设计(多头深度卷积转置注意力和重叠交叉注意力)捕获通道级全局依赖和空间局部结构信息,并辅以Restormer-Plus的空间自适应输入缩放机制。我们采用两阶段训练策略与双模型集成推理。在第一阶段,模型B从零开始在从FoundIR训练集中随机采样的大规模多样化数据集(约4.84 TB中的800 GB)上进行训练,涵盖五种退化类型:模糊、雾霾、雨、雪以及复合条件(如雨和雾同时出现)。在第二阶段,模型A使用模型B的最终检查点作为预训练初始化,在WeatherStream数据集(雨和雪子集)上进行微调,从而以更小的数据集实现高效的域适应。为了更好地在训练过程中保留结构细节,我们提出了一种新颖的梯度引导边缘感知损失,该损失对真实图像应用Sobel算子以构建空间自适应权重图,为边缘和高频区域分配更高的监督。这与L1和多尺度SSIM损失一起纳入统一的训练目标中。在推理时,两个模型的预测通过加权平均融合:out = 0.4 × outA + 0.6 × outB,其中分配给模型B的更高权重反映了其从大规模预训练中获得的更强泛化能力。通过这些策略,我们提出的方法成功在挑战赛中排名第一。

英文摘要

In this work, we present our winning solution for the 8th UG2+ Challenge (CVPR 2026) Track 1: Image Restoration under All-weather Conditions. Our method is built upon the X-Restormer baseline, which captures both channel-wise global dependencies and spatially-local structural information through its dual-attention design (Multi-DConv Head Transposed Attention and Overlapping Cross-Attention), augmented with the spatially-adaptive input scaling mechanism from Restormer-Plus. We adopt a two-stage training strategy with dual-model ensemble inference. In the first stage, Model B is trained from scratch on a large-scale diverse dataset randomly sampled from the FoundIR training set (approximately 800 GB out of 4.84 TB), covering five degradation types: blur, haze, rain, snow, and composite conditions such as co-occurring rain and haze. In the second stage, Model A is fine-tuned on the WeatherStream dataset (rain and snow splits) using Model B's final checkpoint as pretrained initialization, enabling efficient domain adaptation with a substantially smaller dataset. To better preserve structural details during training, we propose a novel Gradient-Guided Edge-Aware (GGEA) Loss, which applies Sobel operators to the ground-truth image to construct a spatially adaptive weight map that assigns higher supervision to edge and high-frequency regions. This is incorporated alongside L1 and Multi-Scale SSIM losses in a unified training objective. At inference time, predictions from the two models are fused via a weighted average, out = 0.4 x outA + 0.6 x outB, where the higher weight assigned to Model B reflects its stronger generalization ability from large-scale pretraining. With these strategies, our proposed method successfully ranks 1st in the challenge.

2605.09233 2026-06-03 cs.CV cs.AI 版本更新

Towards Robust Sequential Decomposition for Complex Image Editing

面向复杂图像编辑的鲁棒顺序分解

Zilai Zeng, Mingdeng Cao, Zijie Li, Xiaochen Lian, Yichun Shi, Peihao Zhu, Chen Sun, Peng Wang

发表机构 * Brown University(布朗大学) ByteDance Seed(字节跳动种子) The University of Tokyo(东京大学)

AI总结 提出通过顺序分解将复杂编辑任务拆解为简单步骤,并利用合成数据训练模型,在统一上下文编辑框架下平衡分解优势与误差累积,实现鲁棒改进和从模拟到真实的泛化。

Comments CVPR 2026

详情
AI中文摘要

视觉生成模型的最新进展使得由人类指令引导的高保真图像编辑成为可能。然而,这些模型在处理涉及组合编辑操作或跨步骤依赖的复杂指令时常常遇到困难。这种困难源于两种典型范式的局限性:(1)单轮编辑,试图一次性应用所有指示的编辑,通常无法准确解析复杂指令并导致不期望的编辑;(2)顺序编辑可以将任务分解为更简单的步骤,但受到顺序执行引入的复合误差的影响,导致低保真结果。为了获得复杂图像编辑的鲁棒解决方案,我们在统一的上下文编辑框架下检查了不同范式的编辑行为,并研究了如何平衡顺序分解的优势与其误差累积的缺点。我们进一步开发了一个合成数据流水线,构建了不同指令复杂度的编辑任务,使我们能够整理一个具有高质量分解序列的大规模编辑数据集。通过在合成数据上进行微调,我们发现,通过适当设计的编辑范式,即使任务复杂度增加,顺序分解也能产生鲁棒的改进。此外,从合成任务中学到的分解技能可以通过与真实世界编辑数据共同训练迁移到真实图像,展示了模拟到真实泛化在更广泛领域中处理复杂图像编辑的前景。

英文摘要

Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models often struggle with complex instructions involving combinatorial editing operations or inter-step dependencies. This difficulty stems from the limitations of two canonical paradigms: (1) single-turn editing, which attempts to apply all instructed edits in one pass, often fails to parse the complex instruction accurately and causes undesired edits; and (2) sequential editing can decompose the task into simpler steps but suffers from compounding errors introduced by the sequential execution, leading to low-fidelity results. To derive a robust solution for complex image editing, we examine editing behaviors of different paradigms under a unified in-context editing framework, and study how the benefits of sequential decomposition can be balanced against its error-accumulation drawbacks. We further develop a synthetic data pipeline that constructs editing tasks of varying instruction complexity, allowing us to curate a large-scale editing dataset with high-quality decomposed sequences. By finetuning on synthetic data, we discovered that with properly designed editing paradigms, sequential decomposition yields robust improvements even as task complexity increases. Furthermore, the decomposition skills learned from synthetic tasks can transfer to real images by co-training with real-world editing data, demonstrating the promise of sim-to-real generalization for tackling complex image editing across broader domains.

2604.18572 2026-06-03 cs.CV cs.AI cs.LG 版本更新

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

回到柏拉图的洞穴:大规模检验跨模态表示收敛性

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros

发表机构 * UC Berkeley(伯克利大学) Technical University Munich, MCML(慕尼黑技术大学) University of Tübingen, Tübingen AI Center(图宾根大学) Toyota Technical Institute at Chicago(芝加哥丰田技术研究所)

AI总结 本文通过大规模数据集实验,质疑了柏拉图表示假说中跨模态表示收敛的证据,发现对齐度随数据规模增大而显著下降,且仅反映粗粒度语义重叠。

Comments Project page: http://akoepke.github.io/cave_umwelten/

详情
AI中文摘要

柏拉图表示假说认为,在不同模态(例如文本和图像)上训练的神经网络会趋向于对齐并最终收敛到相同的现实表示。如果该假说成立,将对模态选择是否重要产生重大影响。我们表明,该假说的实验证据是脆弱的,且关键依赖于评估方式。对齐度通过小数据集(约1000个样本)上的互最近邻测量,当数据集扩展到数百万样本时,对齐度显著下降。在文本-音频和文本-视频对齐中也观察到相同行为。模型表示之间剩余的对齐反映的是粗粒度语义重叠,而非一致的细粒度结构。此外,Huh等人的评估是在一对一图像-标题设置中进行的,这种约束在现实的多对多设置中失效,进一步降低了测量的对齐度。我们还发现,更强的语言模型与视觉对齐度增加的趋势似乎不适用于较新的模型。总体而言,我们的发现表明,当前跨模态表示收敛的证据比后续工作所认为的要弱得多。在不同模态上训练的模型可能学习到同样丰富的世界表示,但并非相同的表示。

英文摘要

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The same behavior is observed beyond text-image, for text-audio and text-video alignment. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces measured alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.

2604.15748 2026-06-03 cs.CV 版本更新

Concept-wise Attention for Fine-grained Concept Bottleneck Models

面向细粒度概念瓶颈模型的概念级注意力机制

Minghong Zhong, Guoshuai Zou, Kanghao Chen, Dexia Chen, Ruixuan Wang

AI总结 提出概念级注意力机制(CoAt-CBM),通过可学习概念视觉查询和概念对比优化,实现自适应细粒度图像-概念对齐,解决预训练偏差和概念互斥问题,显著提升性能。

Comments Withdrawn by authors for revision and improvement

详情
AI中文摘要

最近,通过利用大型预训练视觉-语言模型(如CLIP)学习的图像-文本对齐,概念瓶颈模型(CBM)取得了令人印象深刻的性能。然而,概念建模存在两个关键限制。现有方法常受预训练偏差影响,表现为粒度错位或依赖结构先验。此外,使用二元交叉熵(BCE)损失进行微调将每个概念独立处理,忽略了概念间的互斥性,导致对齐次优。为解决这些限制,我们提出了面向细粒度概念瓶颈模型的概念级注意力机制(CoAt-CBM),一种实现自适应细粒度图像-概念对齐和高可解释性的新颖框架。具体地,CoAt-CBM采用可学习的概念级视觉查询,自适应地获取细粒度的概念级视觉嵌入,然后用于生成概念得分向量。接着,一种新颖的概念对比优化指导模型处理概念得分的相对重要性,使概念预测忠实反映图像内容并改善对齐。大量实验表明,CoAt-CBM持续优于最先进方法。代码将在接收后公开。

英文摘要

Recently impressive performance has been achieved in Concept Bottleneck Models (CBM) by utilizing the image-text alignment learned by a large pre-trained vision-language model (i.e. CLIP). However, there exist two key limitations in concept modeling. Existing methods often suffer from pre-training biases, manifested as granularity misalignment or reliance on structural priors. Moreover, fine-tuning with Binary Cross-Entropy (BCE) loss treats each concept independently, which ignores mutual exclusivity among concepts, leading to suboptimal alignment. To address these limitations, we propose Concept-wise Attention for Fine-grained Concept Bottleneck Models (CoAt-CBM), a novel framework that achieves adaptive fine-grained image-concept alignment and high interpretability. Specifically, CoAt-CBM employs learnable concept-wise visual queries to adaptively obtain fine-grained concept-wise visual embeddings, which are then used to produce a concept score vector. Then, a novel concept contrastive optimization guides the model to handle the relative importance of the concept scores, enabling concept predictions to faithfully reflect the image content and improved alignment. Extensive experiments demonstrate that CoAt-CBM consistently outperforms state-of-the-art methods. The codes will be available upon acceptance.

2604.16808 2026-06-03 cs.CV 版本更新

BioLip: Language-Generalizable Lip-Sync Deepfake Detection via Biomechanical Constraint Violation Modeling

BioLip: 通过生物力学约束违反建模实现语言泛化的唇同步深度伪造检测

Hao Chen, Junnan Xu

发表机构 * Independent Researcher(独立研究者)

AI总结 针对现有检测方法在生成器或语言迁移下失效的问题,提出基于唇部运动生物力学约束的轻量级三分支网络,仅利用地标坐标检测唇同步伪造,在零样本设置下对未见生成器和多种语言表现鲁棒。

Comments 13 pages, 5 figures. Keywords: Deepfake detection, lip-sync forgery, biomechanical constraints, landmark kinematics, cross-lingual generalization, video forensics, privacy-preserving inference, compression robustness

详情
AI中文摘要

现有的唇同步深度伪造检测器依赖于像素伪影或视听对应关系,两者在生成器或语言迁移下均会失效,因为它们学习的特征与训练分布绑定。我们采用不同的方法。真实的唇部运动受到组织力学和神经肌肉带宽的约束;当前的生成器通常不施加这些约束,产生的轨迹在速度、加速度和加加速度上具有升高的方差,而真实语音不会表现出这些特征。我们利用这一信号,称为时间唇部抖动,通过从64个口周地标在短滑动窗口上计算运动学统计量,并将其输入一个轻量级三分支网络。该模型仅使用地标坐标:无像素、无音频、无声纹数据。我们仅在英语数据上训练,并在零样本设置下对五个未见生成器和七种语言进行测试。

英文摘要

Existing lip-sync deepfake detectors rely on pixel artifacts or audio-visual correspondence, and both fail under generator or language shift because the features they learn are tied to the training distribution. We take a different approach. Authentic lip motion is constrained by tissue mechanics and neuromuscular bandwidth; current generators typically do not impose these constraints, producing trajectories with elevated variance in velocity, acceleration, and jerk that real speech does not exhibit. We exploit this signal, which we term temporal lip jitter, by computing kinematic statistics from 64 perioral landmarks over short sliding windows and feeding them into a lightweight three-branch network. The model uses only landmark coordinates: no pixels, no audio, and no voiceprint data. We train only on English data and test in a zero-shot setting on five unseen generators and seven languages.

2604.07048 2026-06-03 cs.CV 版本更新

PRISM: Rethinking Atmospheric Scattering Reconstruction as a Unified Understanding and Restoration Model for Real-world Dehazing

PRISM: 重新思考大气散射重建作为真实世界去雾的统一理解与恢复模型

Chengyu Fang, Chunming He, Yuelin Zhang, Chubin Chen, Chenyang Zhu, Hongqiu Wang, Longxiang Tang, Xiu Li, Sina Farsiu

发表机构 * Tsinghua University(清华大学) Duke University(杜克大学) CUHK(香港中文大学) HKUST(香港理工大学) HKUST(GZ)(香港理工大学(广州))

AI总结 提出基于近端散射大气重建(PSAR)的物理结构化框架,结合在线非均匀雾合成和选择性自蒸馏适应(SSDA)方案,实现真实世界图像去雾的统一理解与恢复。

Comments 21 Pages, 8 Figures, 7 Tables

详情
AI中文摘要

真实世界图像去雾(RID)旨在去除真实场景中由雾引起的退化。由于非均匀雾分布、空间变化的颜色偏移以及配对真实雾-干净数据的稀缺,该任务仍然具有挑战性。在PRISM中,我们提出了近端散射大气重建(PSAR),这是一个物理结构化框架,在大气散射模型下联合重建清晰场景和散射变量,使恢复过程在复杂真实世界条件下更具可解释性。为了弥合合成到真实的差距,我们设计了一个在线非均匀雾合成流程和一个用于非配对真实世界场景的选择性自蒸馏适应(SSDA)方案,该方案使模型能够选择性地从高质量感知目标中学习,同时利用其内在的散射理解来审计残留雾并指导自我优化。在真实世界基准上的实验表明,PRISM在RID任务上取得了具有竞争力的性能。

英文摘要

Real-world image dehazing (RID) aims to remove haze-induced degradation from real scenes. This task remains challenging due to non-uniform haze distribution, spatially varying color shifts, and the scarcity of paired real hazy-clean data. In PRISM, we propose Proximal Scattering Atmosphere Reconstruction (PSAR), a physically structured framework that jointly reconstructs the clear scene and scattering variables under the atmospheric scattering model, making the restoration process more interpretable in complex real-world conditions. To bridge the synthetic-to-real gap, we design an online non-uniform haze synthesis pipeline and a Selective Self-Distillation Adaptation (SSDA) scheme for unpaired real-world scenarios, which enables the model to selectively learn from high-quality perceptual targets while leveraging its intrinsic scattering understanding to audit residual haze and guide self-refinement. Experiments on real-world benchmarks demonstrate that PRISM achieves competitive performance on RID tasks.

2604.05718 2026-06-03 cs.CV 版本更新

MPM: Mutual Pair Merging for Efficient Vision Transformers

MPM:用于高效视觉Transformer的互结对合并

Simon Ravé, Pejman Rasti, David Rousseau

发表机构 * LARIS University of Angers(安格尔大学LARIS实验室) UMR INRAe-IRHS Angers, France(法国安格尔INRAe-IRHS UMR)

AI总结 提出无训练、无参数的互结对合并(MPM)模块,通过余弦空间互近邻配对与平均,记录合并图用于解码器前基于收集的重建,在语义分割中实现端到端加速,且精度损失小。

Comments Accepted to CVPR 2026 (Findings)

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 2998-3008
AI中文摘要

减少序列长度是加速Transformer的常用方法,但先前的token缩减工作通常针对分类任务,报告的是代理指标而非端到端延迟。对于语义分割,token缩减进一步受到重建密集、像素对齐特征的需求限制,并且在现代加速器上,计算合并图的开销可能抵消预期收益。我们提出互结对合并(MPM),一种无需训练的token聚合模块,它在余弦空间中形成互最近邻对,对每对进行平均,并记录一个合并图,使得在解码器之前能够进行基于收集的重建,从而现有分割头可以保持不变。MPM不引入任何学习参数,也没有连续的压缩旋钮(无保留率或阈值)。速度-精度权衡由离散的插入调度设置。我们在NVIDIA H100 GPU(带和不带FlashAttention-2)和Raspberry Pi 5上,针对标准分割数据集基准测试了端到端延迟。在ADE20K上,MPM在Raspberry Pi 5上为ViT-Tiny减少了高达60%的每张图像延迟,在H100上使用FlashAttention-2时吞吐量提升高达20%,同时mIoU下降保持在3%以内。这些结果表明,当显式考虑开销时,简单、重建感知、无需训练的token合并可以转化为分割中实际的时钟时间增益。

英文摘要

Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is further constrained by the need to reconstruct dense, pixel-aligned features, and on modern accelerators the overhead of computing merge maps can erase expected gains. We propose Mutual Pair Merging (MPM), a training-free token aggregation module that forms mutual nearest-neighbor pairs in cosine space, averages each pair, and records a merge map enabling a gather-based reconstruction before the decoder so that existing segmentation heads can be used unchanged. MPM introduces no learned parameters and no continuous compression knob (no keep-rate or threshold). The speed-accuracy trade-off is set by a discrete insertion schedule. We benchmark end-to-end latency on an NVIDIA H100 GPU (with and without FlashAttention-2) and a Raspberry Pi 5 across standard segmentation datasets. On ADE20K, MPM reduces per-image latency by up to 60% for ViT-Tiny on Raspberry Pi 5, and increases throughput by up to 20% on H100 with FlashAttention-2 while keeping the mIoU drop below 3%. These results suggest that simple, reconstruction-aware, training-free token merging can translate into practical wall-clock gains for segmentation when overhead is explicitly accounted for.

2604.04439 2026-06-03 cs.LG cs.CV 版本更新

Estimating Central, Peripheral, and Temporal Visual Contributions to Human Decision Making in Atari Games

估计Atari游戏中中央、周边和时间视觉对人类决策的贡献

Henrik Krauss, Takehisa Yairi

发表机构 * Department of Advanced Interdisciplinary Studies, The University of Tokyo(东京大学先进跨学科研究系) Research Center for Advanced Science and Technology, The University of Tokyo(东京大学先进科学与技术研究中心)

AI总结 通过控制消融框架分析Atari游戏中的眼动数据,发现周边视觉信息对人类决策贡献最大,而注视信息和过去状态信息贡献较小。

详情
AI中文摘要

我们研究了不同视觉信息源在动态视觉环境中对人类决策的贡献。利用Atari-HEAD(一个带有同步眼动追踪的大规模Atari游戏数据集),我们引入了一个受控消融框架,作为逆向工程周边视觉信息、显式注视信息(以注视图形式)以及人类行为中过去状态信息贡献的手段。我们在六种设置下训练动作预测网络,这些设置选择性地包含或排除这些信息源。在20个游戏中,周边信息的贡献最为显著,移除后预测准确率的中位数下降范围为35.27-43.90%。注视信息导致的下降较小,为2.11-2.76%,而过去状态信息的下降范围较广,为1.52-15.51%,其中上限可能因减少了周边信息泄露而更具信息量。为了补充总体准确率,我们根据不同模型配置分配的真实动作概率对状态进行聚类。该分析识别出粗略的行为模式,包括焦点主导、周边主导以及更多情境决策情境。这些结果表明,Atari游戏中的人类决策强烈依赖于当前注视焦点之外的信息,而所提出的框架提供了一种从行为中估计此类信息源贡献的方法。

英文摘要

We study how different visual information sources contribute to human decision making in dynamic visual environments. Using Atari-HEAD, a large-scale Atari gameplay dataset with synchronized eye-tracking, we introduce a controlled ablation framework as a means to reverse-engineer the contribution of peripheral visual information, explicit gaze information in the form of gaze maps, and past-state information from human behavior. We train action-prediction networks under six settings that selectively include or exclude these information sources. Across 20 games, peripheral information shows by far the strongest contribution, with median prediction-accuracy drops in the range of 35.27-43.90% when removed. Gaze information yields smaller drops of 2.11-2.76%, while past-state information shows a broader range of 1.52-15.51%, with the upper end likely more informative due to reduced peripheral-information leakage. To complement aggregate accuracies, we cluster states by true-action probabilities assigned by the different model configurations. This analysis identifies coarse behavioral regimes, including focus-dominated, periphery-dominated, and more contextual decision situations. These results suggest that human decision making in Atari depends strongly on information beyond the current focus of gaze, while the proposed framework provides a way to estimate such information-source contributions from behavior.

2512.18954 2026-06-03 cs.CV 版本更新

VOIC: Visible-Occluded Integrated Guidance for 3D Semantic Scene Completion

VOIC:可见-遮挡联合引导的3D语义场景补全

Zaidao Han, Risa Higashita, Jiang Liu

发表机构 * Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology(可信自主系统研究院,南方科技大学) Department of Computer Science and Engineering, Southern University of Science and Technology(计算机科学与工程系,南方科技大学) School of Computer Science, University of Nottingham Ningbo China(宁波大学计算机学院) Department of Electronic and Information Engineering, Changchun University(电子与信息工程学院,长春大学)

AI总结 提出VOIC网络,通过解耦可见区域感知与遮挡区域推理,利用离线可见区域标签提取策略和双解码器框架,在SemanticKITTI和SSCBench-KITTI360上实现最先进的3D语义场景补全性能。

详情
AI中文摘要

基于相机的3D语义场景补全(SSC)是自动驾驶和机器人场景理解的关键任务。它旨在从单张图像推断完整的3D体素表示,包括语义和几何信息。现有方法通常关注端到端的2D到3D特征提升和体素补全。然而,它们常常忽视由单图像输入引起的高置信度可见区域感知与低置信度遮挡区域推理之间的干扰,这可能导致特征稀释和错误传播。为了解决这些挑战,我们引入了一种离线可见区域标签提取(VRLE)策略,该策略从密集的3D地面真值中显式分离并提取可见区域的体素级监督。该策略为两个互补的子任务(可见区域感知和遮挡区域推理)净化了监督空间。基于这一思想,我们提出了可见-遮挡交互补全网络(VOIC),一种新颖的双解码器框架,将SSC显式解耦为可见区域语义感知和遮挡区域场景补全。VOIC首先通过融合图像特征与深度导出的占据信息构建基础3D体素表示。可见解码器专注于生成高保真的几何和语义先验,而遮挡解码器则利用这些先验以及跨模态交互进行连贯的全局场景推理。在SemanticKITTI和SSCBench-KITTI360基准上的大量实验表明,VOIC在几何补全和语义分割精度上均优于现有的单目SSC方法,实现了最先进的性能。

英文摘要

Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.

2603.01576 2026-06-03 cs.CV 版本更新

Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications

Cryo-Bench:面向冰冻圈应用的基础模型基准测试

Saurabh Kaushik, Lalit Maurya, Beth Tellman, Valerio Marsocci

发表机构 * Center for Sustainability and the Global Environment (SAGE), University of Wisconsin–Madison(可持续性与全球环境中心(SAGE),威斯康星大学麦迪逊分校) Portsmouth AI and Data Science Centre (PAIDS), School of Computing, University of Portsmouth(波特茅斯人工智能与数据科学中心(PAIDS),计算学院,波特茅斯大学) ESA, ESRIN, φ \varphi -lab, Frascati(欧洲航天局(ESA),欧洲空间研究中心(ESRIN),φ实验室,弗拉斯卡蒂)

AI总结 提出Cryo-Bench基准,评估14个地理基础模型在冰冻圈关键组件(如冰川、冰湖、海冰等)上的性能,发现UNet在冻结编码器下平均mIoU最高(66.38),而全微调结合学习率调整可提升性能12.77%。

详情
AI中文摘要

地理基础模型(GFMs)已在涵盖多个领域的地球观测任务中得到评估,并展现出即使在标签稀疏的情况下也能生成可靠地图的强大潜力。然而,针对冰冻圈应用的GFMs基准测试仍然有限,主要原因是缺乏合适的评估数据集。为填补这一空白,我们引入了 extbf{Cryo-Bench},这是一个用于评估GFMs在关键冰冻圈组件上性能的基准。Cryo-Bench包括覆盖冰川、冰湖、海冰和崩解前沿,涉及多种传感器和广泛的地理区域。我们评估了14个GFMs以及UNet和ViT基线,以分析它们的优势、局限性和最佳使用策略。在冻结编码器的情况下,UNet在Cryo-Bench包含的五个评估数据集上取得了最高的平均mIoU extbf{66.38},其次是TerraMind的 extbf{64.02}。在少样本设置(10%输入数据)下,DOFA和TerraMind等GFMs优于UNet,分别达到 extbf{59.53}、 extbf{56.62}和 extbf{56.60}的mIoU分数,而U-Net为56.60。当完全微调GFMs时,我们观察到不同数据集和模型之间的性能不一致。然而,调整学习率并配合微调显著提升了GFM性能。例如,在两个代表性数据集(GLID和CaFFe)上的评估显示平均相对提升为 extbf{12.77\%}。尽管预训练数据中冰冻圈表示极少,GFMs仍展现出显著的领域适应能力,并在各项任务中产生有意义的结果。基于我们的发现,我们建议通过超参数优化进行编码器微调以获得最佳性能,而在用户需要快速结果且无需大量实验时使用冻结编码器。(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub})

英文摘要

Geo-Foundation Models (GFMs) have been evaluated across diverse Earth observation task including multiple domains and have demonstrated strong potential of producing reliable maps even with sparse labels. However, benchmarking GFMs for Cryosphere applications has remained limited, primarily due to the lack of suitable evaluation datasets. To address this gap, we introduce \textbf{Cryo-Bench}, a benchmark compiled to evaluate GFM performance across key Cryospheric components. Cryo-Bench includes debris-covered glaciers, glacial lakes, sea ice, and calving fronts, spanning multiple sensors and broad geographic regions. We evaluate 14 GFMs alongside UNet and ViT baselines to assess their advantages, limitations, and optimal usage strategies. With a frozen encoder, UNet achieves the highest average mIoU of \textbf{66.38}, followed by TerraMind at \textbf{64.02} across five evluation dataset included in Cryo-Bench. In the few-shot setting (10\% input data), GFMs such as DOFA and TerraMind outperform UNet, achieving mIoU scores of \textbf{59.53}, \textbf{56.62}, and \textbf{56.60}, respectively, comapred to U-Net's 56.60. When fully finetuning GFMs, we observe inconsistent performance across datasets and models. However, tuning learning rate along with finetuning substantially improves GFM performance. For example, evaluation on two representative datasets (GLID and CaFFe) shows an average relative improvement of \textbf{12.77\%}. Despite having minimal Cryosphere representation in their pretraining data, GFMs exhibit notable domain adaptation capabilities and produce meaningful results across tasks. Based on our findings, We recommend encoder fine-tuning with hyperparameter optimization optimization to achieve the best possible performance, while using frozen encoders when users need quick results without extensive experimentation.(\href{https://github.com/Sk-2103/Cryo-Bench}{GitHub}).

2511.19945 2026-06-03 cs.CV 版本更新

Low-Resolution Editing is All You Need for High-Resolution Editing

低分辨率编辑足以实现高分辨率编辑

Junsung Lee, Hyunsoo Lee, Yong Jae Lee, Bohyung Han

发表机构 * ECE & IPAI, Seoul National University(电子与信息物理学院及首尔国立大学IPAI) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出一种测试时优化框架,通过分块优化、细节迁移和同步策略,实现高分辨率图像编辑。

Comments CVPR 2026. Project website: https://hleephilip.github.io/ScaleEdit

详情
AI中文摘要

高分辨率内容创作正迅速成为视觉和图形社区的核心挑战。图像是视觉表达最基本的形式,符合用户意图的内容生成需要有效、可控的高分辨率图像编辑机制。然而,现有方法仍局限于低分辨率设置,通常仅支持最高1K分辨率。本文提出高分辨率图像编辑任务,并引入一种测试时优化框架来解决该问题。我们的方法对高分辨率源图像进行分块优化,随后采用细粒度细节迁移模块和一种新颖的同步策略来保持块间一致性。大量实验表明,我们的方法能够产生高质量编辑结果,促进高分辨率内容创作。

英文摘要

High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. Images serve as the most fundamental modality for visual expression, and content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating high-resolution content creation.

2603.26738 2026-06-03 cs.CV cs.AI cs.CL 版本更新

SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model

SleepVLM:基于视觉语言模型的可解释且规则驱动的睡眠分期

Guifeng Deng, Pan Wang, Mengfan Niu, Jiquan Wang, Shuying Rao, Junyi Xie, Xi'ang Chen, Sha Zhao, Gang Pan, Wanjun Guo, Tao Li, Haiteng Jiang

AI总结 提出SleepVLM,一种基于规则驱动的视觉语言模型,通过多通道PSG波形图像进行睡眠分期,并生成符合AASM评分标准的临床可读解释,在保持高准确率的同时提升可解释性。

Comments Under review

详情
AI中文摘要

尽管自动睡眠分期已达到专家级准确率,但其临床采用因缺乏可审计的推理而受阻。我们提出了SleepVLM,一种基于规则驱动的视觉语言模型(VLM),它通过多通道多导睡眠图(PSG)波形图像进行睡眠分期,并基于美国睡眠医学学会(AASM)评分标准生成临床可读的理由。利用波形感知预训练和规则驱动的监督微调,SleepVLM在保留测试集(MASS-SS1)上实现了0.767的Cohen's kappa,在外部队列(ZUAMHCS)上实现了0.743,达到了最先进的性能。两位经过训练的睡眠技术专家的独立评估进一步验证了模型的推理质量,在两个数据集上,事实准确性、证据全面性和逻辑连贯性的平均得分在3.75-3.96之间(满分5分)。通过将竞争性性能与透明、基于规则的解释相结合,SleepVLM可以提高临床工作流程中自动睡眠分期的可信度和可审计性。为了促进可解释睡眠医学的进一步研究,我们发布了MASS-EX,一个新颖的专家注释数据集。

英文摘要

While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We introduce SleepVLM, a rule-grounded vision-language model (VLM) that stages sleep from multi-channel polysomnography (PSG) waveform images and generates clinician-readable rationales based on American Academy of Sleep Medicine (AASM) scoring criteria. Utilizing waveform-perceptual pre-training and rule-grounded supervised fine-tuning, SleepVLM achieved Cohen's kappa of 0.767 on a held-out test set (MASS-SS1) and 0.743 on an external cohort (ZUAMHCS), matching state-of-the-art performance. Independent expert evaluation by two trained sleep technologists further validated the model's reasoning quality, with mean scores of 3.75-3.96 out of 5 across factual accuracy, evidence comprehensiveness, and logical coherence on both datasets. By coupling competitive performance with transparent, rule-based explanations, SleepVLM may improve the trustworthiness and auditability of automated sleep staging in clinical workflows. To facilitate further research in interpretable sleep medicine, we release MASS-EX, a novel expert-annotated dataset.

2603.27455 2026-06-03 cs.CV 版本更新

From None to All: Self-Supervised 3D Reconstruction via Novel View Synthesis

从无到有:通过新视角合成的自监督3D重建

Ranran Huang, Weixun Luo, Ye Mao, Krystian Mikolajczyk

AI总结 提出NAS3R框架,通过自监督学习从无标注图像中联合估计3D几何和相机参数,利用新视角合成进行训练,无需真实标注或预训练先验。

详情
AI中文摘要

本文提出NAS3R,一个自监督前馈框架,无需真实标注和预训练先验,联合学习显式3D几何和相机参数。训练时,NAS3R从无标定和无位姿的上下文视图重建3D高斯,并使用自预测的相机参数渲染目标视图,从而通过2D光度监督实现自监督训练。为确保稳定收敛,NAS3R在共享的Transformer骨干中集成重建和相机预测,并由掩码注意力调控,同时采用基于深度的高斯公式以促进良态优化。该框架与最先进的监督3D重建架构兼容,并可在可用时融入预训练先验或内参信息。大量实验表明,NAS3R优于其他自监督方法,为从无约束数据中进行3D重建建立了一个可扩展且几何感知的范式。代码和模型已在https://ranrhuang.github.io/nas3r/公开。

英文摘要

In this paper, we introduce NAS3R, a self-supervised feed-forward framework that jointly learns explicit 3D geometry and camera parameters with no ground-truth annotations and no pretrained priors. During training, NAS3R reconstructs 3D Gaussians from uncalibrated and unposed context views and renders target views using its self-predicted camera parameters, enabling self-supervised training from 2D photometric supervision. To ensure stable convergence, NAS3R integrates reconstruction and camera prediction within a shared transformer backbone regulated by masked attention, and adopts a depth-based Gaussian formulation that facilitates well-conditioned optimization. The framework is compatible with state-of-the-art supervised 3D reconstruction architectures and can incorporate pretrained priors or intrinsic information when available. Extensive experiments show that NAS3R achieves superior results to other self-supervised methods, establishing a scalable and geometry-aware paradigm for 3D reconstruction from unconstrained data. Code and models are publicly available at https://ranrhuang.github.io/nas3r/.

2603.00667 2026-06-03 cs.CV 版本更新

Act Like a Pathologist: Tissue-Aware Whole Slide Image Reasoning

像病理学家一样:组织感知的全切片图像推理

Wentao Huang, Weimin Lyu, Peiliang Lou, Qingqiao Hu, Xiaoling Hu, Shahira Abousamra, Wenchao Han, Ruifeng Guo, Jiawei Zhou, Chao Chen, Chen Wang

发表机构 * Stony Brook University(石英溪大学) Mayo Clinic(梅奥诊所) Harvard Medical School(哈佛医学院) Stanford University(斯坦福大学)

AI总结 提出一种问题引导、组织感知的粗到细检索框架HistoSelect,通过识别相关组织区域并选择最具信息量的补丁,在减少70%视觉标记的同时提升病理问答准确性。

Comments 14 pages, 8 figures. Accepted by CVPR'26

详情
AI中文摘要

近年来,计算病理学在领域特定图像编码器以及使用视觉-语言模型回答疾病自然语言问题的兴趣推动下迅速发展。然而,病理问答背后的核心问题仍未解决,因为一张千兆像素的切片包含的信息远多于给定问题所需。病理学家通过广泛扫描并根据临床问题选择性放大,自然地处理组织和形态复杂性。相比之下,当前模型依赖于均匀补丁采样或宽注意力图,常常平等关注不相关区域而忽略关键视觉证据。在这项工作中,我们试图使模型更接近人类实际检查切片的方式。我们提出了一个问题引导、组织感知、由粗到细的检索框架HistoSelect,它由两个关键组件组成:一个识别问题相关组织区域的组采样器,以及一个在这些区域内检索最具信息量补丁的补丁选择器。通过仅选择最具信息量的补丁,我们的方法显著提高了效率:平均减少70%的视觉标记使用,同时提高了三个病理QA任务的准确性。在356,000个问答对上评估,我们的方法优于现有方法,并产生基于可解释、与病理学家一致的区域的答案。我们的结果表明,将类人搜索和注意力模式引入WSI推理是构建实用且可靠的病理VLM的一个有前景的方向。代码可在https://github.com/winston52/HistoSelect获取。

英文摘要

Computational pathology has advanced rapidly in recent years, driven by domain-specific image encoders and growing interest in using vision-language models to answer natural-language questions about diseases. Yet, the core problem behind pathology question-answering remains unsolved, considering that a gigapixel slide contains far more information than necessary for a given question. Pathologists naturally navigate tissue and morphology complexity by scanning broadly, and zooming in selectively according to the clinical questions. Current models, in contrast, rely on uniform patch sampling or broad attention maps, often attending equally to irrelevant regions while overlooking key visual evidence. In this work, we try to bring models closer to how humans actually examine slides. We propose a question-guided, tissue-aware, and coarse-to-fine retrieval framework, HistoSelect, that consists of two key components: a group sampler that identifies question-relevant tissue regions, followed by a patch selector that retrieves the most informative patches within those regions. By selecting only the most informative patches, our method becomes significantly more efficient: reducing visual token usage by 70% on average, while improving accuracy across three pathology QA tasks. Evaluated on 356,000 question-answer pairs, our approach outperforms existing methods and produces answers grounded in interpretable, pathologist-consistent regions. Our results suggest that bringing human-like search and attention patterns into WSI reasoning is a promising direction for building practical and reliable pathology VLMs. Code is available at https://github.com/winston52/HistoSelect.

2603.18599 2026-06-03 cs.CV 版本更新

SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

SJD-PAC:通过主动草稿和自适应延续加速推测性雅可比解码

Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen

发表机构 * Peking University(北京大学) Huawei Technologies(华为技术)

AI总结 提出SJD-PAC框架,通过主动草稿策略和自适应延续机制提升推测性雅可比解码的接受率,实现无损加速文本到图像合成。

Comments CVPR 2026

详情
AI中文摘要

推测性雅可比解码(SJD)提供了一种无需草稿模型的方法来加速自回归文本到图像合成。然而,视觉生成的高熵特性导致复杂区域中草稿令牌接受率低,形成严重限制整体吞吐量的瓶颈。为了克服这一问题,我们引入了SJD-PAC,一个增强的SJD框架。首先,SJD-PAC采用主动草稿策略来提高这些具有挑战性的高熵区域的局部接受率。其次,我们引入了一种自适应延续机制,在初始拒绝后维持序列验证,无需完全重新采样。这些优化协同工作,显著增加了每步的平均接受长度,在严格保持目标分布的同时提升了推理速度。在标准文本到图像基准上的实验表明,SJD-PAC实现了$3.8 imes$的加速,且图像质量无损。代码可在https://github.com/KangJialiang/SJD-PAC获取。

英文摘要

Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a $3.8\times$ speedup with lossless image quality. Code is available at https://github.com/KangJialiang/SJD-PAC.

2602.07768 2026-06-03 cs.CV cs.AI cs.LG cs.MM 版本更新

PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification

PAND:面向提示的邻域蒸馏用于轻量级细粒度视觉分类

Qiuming Luo, Yuebing Li, Feng Li, Chang Kong

发表机构 * arXiv

AI总结 提出PAND框架,通过提示感知语义校准和邻域感知结构蒸馏,将大型视觉语言模型知识迁移至轻量网络,在细粒度分类任务上超越现有方法。

Comments Accepted by ICIP2026

详情
AI中文摘要

在细粒度视觉分类(FGVC)中,从大型视觉语言模型(VLM)中蒸馏知识到轻量级网络至关重要但具有挑战性,原因是依赖于固定提示和全局对齐。为解决此问题,我们提出PAND(提示感知邻域蒸馏),一个两阶段框架,将语义校准与结构迁移解耦。首先,我们引入提示感知语义校准以生成自适应语义锚点。其次,我们提出邻域感知结构蒸馏策略以约束学生的局部决策结构。PAND在四个FGVC基准上持续优于现有方法。值得注意的是,我们的ResNet-18学生在CUB-200上达到76.09%的准确率,超过强基线VL2Lite 3.4%。代码可在https://github.com/LLLVTA/PAND获取。

英文摘要

Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual Classification (FGVC), due to the reliance on fixed prompts and global alignment. To address this, we propose PAND (Prompt-Aware Neighborhood Distillation), a two-stage framework that decouples semantic calibration from structural transfer. First, we incorporate Prompt-Aware Semantic Calibration to generate adaptive semantic anchors. Second, we introduce a neighborhood-aware structural distillation strategy to constrain the student's local decision structure. PAND consistently outperforms state-of-the-art methods on four FGVC benchmarks. Notably, our ResNet-18 student achieves 76.09% accuracy on CUB-200, surpassing the strong baseline VL2Lite by 3.4%. Code is available at https://github.com/LLLVTA/PAND.

2512.10888 2026-06-03 cs.CV 版本更新

PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

PubTables-v2: 一个新的用于全页和多页表格提取的大规模数据集

Brandon Smock, Valerie Faucon-Morin, Max Sokolov, Libin Liang, Tayyibah Khanam, Amrit Ramesh, Maury Courtland

发表机构 * Kensho Technologies(Kensho技术公司)

AI总结 针对全页和多页表格提取任务缺乏标注数据的问题,本文创建了大规模数据集PubTables-v2,并评估了当前前沿模型与小模型在不同上下文级别任务上的性能差异。

Comments 28 pages, separated POTATR to its own paper, added frontier model results

详情
AI中文摘要

表格提取(TE)是文档理解中的一个关键挑战。传统方法先检测表格,然后识别其结构。最近,人们对开发直接在全页或文档上下文中提取表格的方法(如视觉语言模型(VLM))的兴趣激增。然而,缺乏标注数据使得进展难以展示。为了解决这个问题,我们创建了一个新的大规模数据集PubTables-v2。PubTables-v2统一了各种周围上下文级别的TE,并且值得注意的是,它是第一个用于多页TE的基准。我们的评估显示,虽然当前前沿模型在最复杂的任务(全文档多页TE)上显著优于小模型(+0.354 GriTS_Con),但在较窄的任务(裁剪表格提取)上,通过针对性训练,这种差距可以被缩小甚至逆转(-0.056 GriTS_Con)。数据可在 https://huggingface.co/datasets/kensho/PubTables-v2 获取。代码和模型将发布。

英文摘要

Table extraction (TE) is a key challenge in document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), to extract tables directly in their full page or document context. However, a lack of annotated data has made progress difficult to demonstrate. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 unifies TE across various levels of surrounding context and, notably, is the first benchmark for multi-page TE. Our evaluations reveal that while current frontier models strongly outperform ($+0.354\ \textrm{GriTS}_\textrm{Con}$) small models on the most complex task (full-document multi-page TE), this gap can be closed or even reversed ($-0.056\ \textrm{GriTS}_\textrm{Con}$) on narrower tasks (cropped table extraction) with targeted training. Data is available at https://huggingface.co/datasets/kensho/PubTables-v2. Code and models will be released.

2603.14377 2026-06-03 cs.CV 版本更新

LoCAtion: Long-time Collaborative Attention Framework for High Dynamic Range Video Reconstruction

LoCAtion: 用于高动态范围视频重建的长时间协同注意力框架

Qianyu Zhang, Bolun Zheng, Lingyu Zhu, Aiai Huang, Zongpeng Li, Shiqi Wang

发表机构 * School of Automation, Hangzhou Dianzi University(杭州电子科技大学自动化学院) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)

AI总结 提出LoCAtion框架,通过解耦对齐与融合、采用协同注意力机制和全局序列求解器,实现无需显式对齐的高动态范围视频重建,在视觉质量和时间稳定性上达到最优。

详情
AI中文摘要

主流的高动态范围(HDR)视频重建方法从根本上陷入了一种脆弱的对齐与融合范式。虽然显式空间对齐能在受控环境中成功恢复细节,但在无约束的动态场景中却成为严重瓶颈。通过强制对不可预测的运动和不同曝光进行刚性对齐,这些方法不可避免地会将配准误差转化为严重的鬼影伪影和时间闪烁。在本文中,我们重新思考了这一传统前提。认识到显式对齐本质上易受现实世界复杂性的影响,我们提出了LoCAtion,一种长时间协同注意力框架,将HDR视频生成从脆弱的空间扭曲任务重新构建为鲁棒的、无需对齐的协同特征路由问题。在这一新公式的指导下,我们的架构显式地解耦了高度纠缠的重建任务。我们不是努力刚性扭曲相邻帧,而是将场景锚定在一个连续的中等曝光骨干上,并利用协同注意力从未对齐的曝光中动态获取和注入可靠辐照度线索。此外,我们引入了一个学习的全局序列求解器。通过利用双向上下文和长时时间建模,它在整个序列中传播校正信号和结构特征,固有地强制执行全视频一致性并消除抖动。大量实验表明,LoCAtion在视觉质量和时间稳定性上达到了最先进水平,在准确性和计算效率之间提供了极具竞争力的平衡。

英文摘要

Prevailing High Dynamic Range (HDR) video reconstruction methods are fundamentally trapped in a fragile alignment-and-fusion paradigm. While explicit spatial alignment can successfully recover fine details in controlled environments, it becomes a severe bottleneck in unconstrained dynamic scenes. By forcing rigid alignment across unpredictable motions and varying exposures, these methods inevitably translate registration errors into severe ghosting artifacts and temporal flickering. In this paper, we rethink this conventional prerequisite. Recognizing that explicit alignment is inherently vulnerable to real-world complexities, we propose LoCAtion, a Long-time Collaborative Attention framework that reformulates HDR video generation from a fragile spatial warping task into a robust, alignment-free collaborative feature routing problem. Guided by this new formulation, our architecture explicitly decouples the highly entangled reconstruction task. Rather than struggling to rigidly warp neighboring frames, we anchor the scene on a continuous medium-exposure backbone and utilize collaborative attention to dynamically harvest and inject reliable irradiance cues from unaligned exposures. Furthermore, we introduce a learned global sequence solver. By leveraging bidirectional context and long-range temporal modeling, it propagates corrective signals and structural features across the entire sequence, inherently enforcing whole-video coherence and eliminating jitter. Extensive experiments demonstrate that LoCAtion achieves state-of-the-art visual quality and temporal stability, offering a highly competitive balance between accuracy and computational efficiency.

2603.07664 2026-06-03 cs.CV cs.AI cs.GR 版本更新

Ref-DGS: Reflective Dual Gaussian Splatting

Ref-DGS: 反射性双高斯泼溅

Ningjing Fan, Yiqun Wang, Dong-Ming Yan, Peter Wonka

发表机构 * Chongqing University(重庆大学) MAIS, Institute of Automation, Chinese Academy of Sciences and UCAS(自动化研究所,中国科学院,UCAS) King Abdullah University of Science and Technology (KAUST)(卡塔尔科学与技术大学)

AI总结 提出Ref-DGS框架,通过双高斯场景表示和物理感知的镜面自适应混合着色器,在高效光栅化管线中解耦表面重建与镜面反射,实现反射场景的SOTA新视图合成且训练速度远快于基于光线的方法。

Comments Project page: https://njfan.github.io/Ref-DGS/

详情
AI中文摘要

反射外观,尤其是强烈的近场镜面反射,对精确的表面重建和新视图合成构成了根本性挑战。现有的高斯泼溅方法要么无法建模近场镜面反射,要么依赖显式光线追踪而计算成本高昂。我们提出了 extbf{Ref-DGS},一个反射性双高斯泼溅框架,通过在高效光栅化管线中将表面重建与镜面反射解耦来解决这一权衡。Ref-DGS引入了一种双高斯场景表示,由几何高斯和互补的局部反射高斯组成,无需显式光线追踪即可捕捉近场镜面交互,并包含一个全局环境反射场用于建模远场镜面反射。为了预测镜面辐射,我们进一步提出了一种轻量级的、物理感知的镜面自适应混合着色器,融合全局和局部镜面特征。实验表明,Ref-DGS在反射场景上达到了最先进的性能,同时训练速度显著快于基于光线的高斯方法。

英文摘要

The reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surface reconstruction and novel view synthesis. Existing Gaussian splatting methods either fail to model near-field specular reflections or rely on explicit ray tracing at substantial computational cost. We present \textbf{Ref-DGS}, a reflective dual Gaussian splatting framework that addresses this trade-off by decoupling surface reconstruction from specular reflection within an efficient rasterization-based pipeline. Ref-DGS introduces a dual Gaussian scene representation consisting of geometry Gaussians and complementary local reflection Gaussians that capture near-field specular interactions without explicit ray tracing, along with a global environment reflection field for modeling far-field specular reflections. To predict specular radiance, we further propose a lightweight, physically-aware specular adaptive mixing shader that fuses global and local specular features. Experiments demonstrate that Ref-DGS achieves state-of-the-art performance on reflective scenes while training substantially faster than ray-based Gaussian methods.

2602.18690 2026-06-03 q-bio.NC cs.CV cs.LG 版本更新

Neural Fields as World Models

神经场作为世界模型

Joshua Nunley

发表机构 * Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington(信息学、计算与工程学院,印第安纳大学,布卢明顿) Cognitive Science Program, Indiana University, Bloomington(认知科学项目,印第安纳大学,布卢明顿)

AI总结 提出同构世界模型,利用运动门控神经场在空间图中进行物理预测,实现离线任务学习和身体相关表征。

Comments 6 pages, 6 figures. Annual Meeting of the Cognitive Science Society (CogSci 2026)

详情
AI中文摘要

人类可以在离线状态下预演可能的未来,例如在心理练习和可能的梦境中,这表明世界模型可能支持远离环境的学习。标准的机器学习世界模型将视觉输入压缩为潜在向量,丢弃了感觉皮层的空间结构特征。我们提出了同构世界模型:一种保持感觉拓扑结构的架构,使得物理预测成为几何传播而非抽象状态转换。我们通过运动门控神经场实现这一想法,其中活动通过局部侧向连接演化,运动命令乘性地调制特定通道。在三个实验中,相同的架构学习了无“瞬移”的弹道预测,通过将任务误差通过冻结的学习世界模型传播,改进了离线接球策略,并在没有身体标签的情况下发展出身体选择性的运动通道。这些结果提供了初步证据,表明物理预测、离线任务学习和身体相关表征共享一个共同的计算基础:空间地图内的动作条件预测。

英文摘要

Humans rehearse possible futures offline, as in mental practice and perhaps dreaming, suggesting that world models may support task learning away from the environment. Standard machine learning world models compress visual input into latent vectors, discarding the spatial structure that characterizes sensory cortex. We propose isomorphic world models: architectures that preserve sensory topology, so physics prediction becomes geometric propagation rather than abstract state transition. We implement this idea with motor-gated neural fields, where activity evolves through local lateral connectivity and motor commands multiplicatively modulate specific channels. Across three experiments, the same architecture learns ballistic prediction without ``teleporting,'' improves a catching policy offline by propagating task error through a frozen learned world model, and develops body-selective motor channels without body labels. These results provide preliminary evidence that physical prediction, offline task learning, and body-linked representation share a common computational substrate: action-conditional prediction within a spatial map.

2602.17063 2026-06-03 cs.LG cs.AI cs.CL cs.CV 版本更新

Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

符号锁定:随机初始化的权重符号持续存在并成为亚比特模型压缩的瓶颈

Akira Sakai, Yuma Ichikawa

发表机构 * Fujitsu Limited(富士通株式会社) Tokai University(静冈大学) Riken Center for AIP(理化学研究所AIP研究中心)

AI总结 研究亚比特模型压缩中符号位的瓶颈问题,通过符号锁定理论解释权重符号的随机性来源,并提出一种从头开始的低秩符号模板训练方法以突破该瓶颈。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

亚比特模型压缩的目标是将每个权重的存储降至1比特以下;当幅度被激进压缩时,符号位成为固定成本的瓶颈。在Transformer、CNN和MLP中,学习到的符号矩阵抵抗低秩近似,并且在频谱上与i.i.d. Rademacher基线无法区分。这种随机性导致了亚比特模型压缩的下界——1比特墙。尽管存在这种明显的随机性,大多数权重仍保留其初始化符号;翻转主要通过罕见的近零边界穿越发生,表明符号模式的随机性很大程度上继承自初始化。我们通过符号锁定理论形式化了这一行为,这是对SGD噪声下符号翻转的停时分析。在有界更新和零的小邻域内罕见重新进入的条件下,有效符号翻转的数量呈现几何尾部。基于这一机制,我们引入了一种从头开始的低秩符号模板训练方法,以防止这种1比特墙的出现。

英文摘要

Sub-bit model compression targets storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. This randomness gives rise to the lower bound of sub-bit model compression -- the one-bit wall. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood of zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a from-scratch low-rank sign-template training method that prevents the emergence of this one-bit wall.

2602.12221 2026-06-03 cs.CV 版本更新

Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching

两全其美:通过统一离散流匹配实现多模态推理与生成

Onkar Susladkar, Tushar Prakash, Gayatri Deshmukh, Kiet A. Nguyen, Jiaxun Zhang, Adheesh Juvekar, Tianshu Bao, Lin Chai, Sparsh Mittal, Inderjit S Dhillon, Ismini Lourentzou

AI总结 提出UniDFlow框架,通过任务特定低秩适配器解耦理解与生成,并利用基于参考的多模态偏好对齐优化忠实性与可控性,在多个基准上达到最先进性能。

详情
AI中文摘要

我们提出了UniDFlow,一个统一的多模态理解、生成和编辑的离散流匹配框架。它通过任务特定的低秩适配器解耦理解和生成,避免了目标干扰和表示纠缠,同时一种新颖的基于参考的多模态偏好对齐在相同条件下优化相对结果,提高了忠实性和可控性,无需大规模重新训练。UniDFlow在八个基准上达到了最先进的性能,并在包括修复、上下文图像生成、基于参考的编辑和组合生成等任务上展现出强大的零样本泛化能力,尽管没有进行明确的特定任务训练。

英文摘要

We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.

2602.11804 2026-06-03 cs.CV eess.IV 版本更新

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

基于深度感知融合与有限训练数据的高效分割一切

Yiming Zhou, Xuenjie Xie, Panfeng Li, Albrecht Kunz, Ahmad Osman, Xavier Maldague

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出一种轻量级RGB-D融合框架,通过单目深度先验增强EfficientViT-SAM,在仅使用11.2k训练样本(不到SA-1B的0.1%)的情况下,实现比EfficientViT-SAM更高的分割精度。

详情
Journal ref
ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1731-1735
AI中文摘要

分割一切模型(SAM)实现了令人印象深刻的通用分割性能,但需要大规模数据集(例如1100万张图像)且仅依赖RGB输入。最近的高效变体减少了计算量,但仍依赖于大规模训练。我们提出了一种轻量级RGB-D融合框架,用单目深度先验增强EfficientViT-SAM。深度图通过预训练的估计器生成,并通过专门的深度编码器与RGB特征进行中层融合。仅使用11.2k样本(不到SA-1B的0.1%)训练,我们的方法比EfficientViT-SAM取得了更高的准确率,表明深度线索为分割提供了强大的几何先验。

英文摘要

Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.

2602.09708 2026-06-03 cs.LG cs.AI cs.CV cs.NA math.NA 版本更新

Physics-informed diffusion models in spectral space

谱空间中的物理信息扩散模型

Davide Gallon, Philippe von Wurstemberger, Patrick Cheridito, Arnulf Jentzen

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 提出物理信息谱扩散(PISD)方法,结合生成式潜扩散模型与物理信息机器学习,在谱表示潜空间中对偏微分方程参数和解进行扩散建模,通过扩散后验采样施加物理约束和测量条件,在泊松、亥姆霍兹和不可压缩纳维-斯托克斯方程上展现出比现有扩散求解器更高的精度和计算效率。

Comments 18 pages, 10 figures

详情
AI中文摘要

我们提出物理信息谱扩散(PISD),一种将生成式潜扩散模型与物理信息机器学习相结合的方法,用于生成基于部分观测的偏微分方程(PDE)的解,特别包括正向和逆向PDE问题。我们在缩放谱表示的潜空间中通过扩散过程学习PDE参数和解的联合分布,其中高斯噪声对应于具有受控正则性的函数。与基于网格的扩散模型相比,这种谱公式能够实现显著的降维,并确保函数空间中的诱导过程保持在PDE算子定义良好的函数类内。基于扩散后验采样,我们在推理过程中施加物理信息约束和测量条件,在每个扩散步骤应用基于Adam的更新。我们在泊松、亥姆霍兹和不可压缩纳维-斯托克斯方程上评估了所提出的方法,与现有的基于扩散的PDE求解器(在稀疏观测下达到最先进水平)相比,展示了更高的精度和计算效率。代码可在 https://github.com/deeplearningmethods/PISD 获取。

英文摘要

We propose physics-informed spectral diffusion (PISD), a methodology that combines generative latent diffusion models with physics-informed machine learning to generate solutions of partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid-based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics-informed constraints and measurement conditions during inference, applying Adam-based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier-Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion-based PDE solvers, which are state of the art for sparse observations. Code is available at https://github.com/deeplearningmethods/PISD.

2601.22841 2026-06-03 cs.CV 版本更新

How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models

我们需要多少模型?遥感基础模型中的冗余与可瘦身性

Leonard Hackel, Tom Burgert, Begüm Demir

AI总结 通过后验瘦身(均匀减少编码器Transformer块宽度)评估8个遥感基础模型的表示冗余,发现遥感模型在激进宽度缩减下仍保持69%-109%相对精度,而自然图像预训练模型性能急剧下降,表明遥感模型存在冗余编码且可有效瘦身。

详情
AI中文摘要

遥感中的大规模基础模型(RS FMs)遵循计算机视觉(CV)中建立的范式开发,但将CV缩放定律迁移至RS的有效性尚未系统检验。我们假设RS FMs在比CV对应模型小得多的规模下进入过参数化区域,任务相关信息在模型维度间冗余编码。为验证这一假设,我们应用后验瘦身(即均匀减少预训练编码器Transformer块的宽度)作为衡量8个最先进RS FMs在分类、分割和变化检测任务中表示冗余的工具。在激进宽度缩减下,RS FMs在RS数据集上保持69%至109%的相对精度,而基于自然图像预训练的掩码自编码器(MAE)和DINOv2(记为CV MAE和CV DINOv2)在相同计算需求范围内,在匹配类别数的ImageNet子集上性能急剧下降。直接在相同RS数据集上评估的CV MAE缩小了差距但未消除,表明数据集特性和领域特定预训练共同导致了模型间的差异。特征相关性、解释方差和有效维度等机制分析表明,任务相关方差集中在少数主成分中,并在模型维度间冗余编码。我们进一步证明,对于对比目标,学习型可瘦身训练优于后验瘦身,而基于重建的目标无法从当前可瘦身训练协议中受益。我们的发现确立了后验瘦身作为资源受限RS应用的实际部署策略,以及作为RS FMs表示冗余的诊断工具。论文接收后,我们将发布所有代码。

英文摘要

Large-scale foundation models (FMs) in remote sensing (RS) (denoted as RS FMs) are developed following paradigms established in computer vision (CV), yet the validity of transferring CV scaling laws to RS has not been systematically examined. We hypothesize that RS FMs enter an overparameterized regime at substantially smaller scales than their CV counterparts, with task-relevant information encoded redundantly across model dimensions. To test this hypothesis, we apply post-hoc slimmability, uniform width reduction of pretrained encoder transformer blocks, as a tool to measure representational redundancy across eight state-of-the-art RS FMs on classification, segmentation, and change detection tasks. RS FMs retain 69% to 109% relative accuracy on RS datasets under aggressive width reduction, while masked autoencoder (MAE) and DINOv2 pretrained on natural images (denoted as CV MAE and CV DINOv2) degrade sharply on ImageNet subsets of matched class count over the same range of computational requirements. A CV MAE evaluated directly on the same RS datasets narrows but does not close the gap, indicating that both dataset characteristics and domain-specific pretraining contribute to the differences between the models. Mechanistic analyses such as feature correlation, explained variance, and effective dimensionality indicate that task-relevant variance concentrates in few principal components and is redundantly encoded across model dimensions. We further show that learned slimmable training improves over post-hoc slimmability for contrastive objectives, while reconstruction-based objectives do not benefit from current slimmable training protocols. Our findings establish post-hoc slimming as a practical deployment strategy for resource-constrained RS applications and as a diagnostic tool for representational redundancy in RS FMs. Upon acceptance, we will publish all code.

2601.22443 2026-06-03 cs.LG cs.CV stat.CO stat.ML 版本更新

Weak Diffusion Priors Can Still Achieve Strong Inverse-Problem Performance

弱扩散先验仍能实现强逆问题性能

Jing Jia, Wei Yuan, Sifan Liu, Liyue Shen, Guanyang Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 研究弱扩散先验在逆问题中的鲁棒性,通过贝叶斯一致性和局部相关性分析揭示其在信息丰富测量下仍有效的原因。

Comments 37 pages, ICML 2026 spotlight. Code: https://github.com/jjia131/weak-diffusion-priors-inverse-problem, Project Page: https://jjia131.github.io/weak-diffusion-priors-inverse-problem/

详情
AI中文摘要

在卧室图像上训练的扩散模型能否恢复人脸图像?扩散模型被广泛用作逆问题的先验,但标准方法通常假设一个高保真模型,该模型在与未知信号高度匹配的数据上训练。实践中,常常必须使用不匹配或低保真的扩散先验。令人惊讶的是,这些弱先验的表现往往几乎与全强度的域内基线相当。我们研究了逆求解器何时以及为何对弱扩散先验具有鲁棒性。通过大量实验,我们发现当测量信息高度丰富(例如,大量观测像素)时,弱先验能够成功,并识别了它们失败的场景。为了解释这一行为,我们将贝叶斯一致性理论与局部相关性分析相结合:理论给出了高维测量使后验集中于真实信号附近的条件,而相关性分析表明弱先验和更强的自然图像先验可以共享相似的局部空间结构。这些结果为何时可以可靠地使用弱扩散先验提供了原则性依据。代码可在 https://github.com/jjia131/weak-diffusion-priors-inverse-problem 获取。

英文摘要

Can a diffusion model trained on bedrooms recover human faces? Diffusion models are widely used as priors for inverse problems, but standard approaches usually assume a high-fidelity model trained on data that closely match the unknown signal. In practice, one often must use a mismatched or low-fidelity diffusion prior. Surprisingly, these weak priors often perform nearly as well as full-strength, in-domain baselines. We study when and why inverse solvers are robust to weak diffusion priors. Through extensive experiments, we find that weak priors succeed when measurements are highly informative (e.g., many observed pixels), and we identify regimes where they fail. To explain this behavior, we combine Bayesian-consistency theory with local-correlation analysis: the theory gives conditions under which high-dimensional measurements make the posterior concentrate near the true signal, while the correlation analysis shows that weak and stronger natural-image priors can share similar local spatial structure. These results provide a principled justification on when weak diffusion priors can be used reliably. Code is available at https://github.com/jjia131/weak-diffusion-priors-inverse-problem.

2510.22491 2026-06-03 cs.LG cs.CE cs.CV 版本更新

LAMP: Data-Efficient Linear Affine Weight-Space Models for Parameter-Controlled 3D Shape Generation and Extrapolation

LAMP: 数据高效的线性仿射权重空间模型用于参数控制的3D形状生成与外推

Ghadi Nehme, Yanxia Zhang, Dule Shu, Matt Klenk, Faez Ahmed

发表机构 * GitHub

AI总结 提出LAMP框架,通过过拟合共享初始化的符号距离函数解码器并对齐权重空间,以少量样本实现参数约束下的可控3D生成与外推,并引入线性失配安全度量确保可靠性。

详情
AI中文摘要

在显式参数约束下生成高保真3D几何体是工程设计的核心,但当前方法通常需要大型数据集,且无法在训练分布之外提供可靠控制。我们提出LAMP,一个数据高效的框架,用于可控和可解释的3D生成,该框架通过从共享初始化过拟合每个样本并对齐符号距离函数(SDF)解码器,然后在对齐的权重空间中通过求解参数约束的仿射混合问题来生成新设计。为了提高可靠性,我们提出一种线性失配安全度量,用于检测混合解码器何时离开有效的局部区域。我们在DrivAerNet++、BlendedNet以及额外的工业级车辆系列(包括跑车、SUV和敞篷车)上评估LAMP。LAMP能够以少至50个样本实现受控插值,在训练范围外安全外推高达100%,并在固定参数下进行性能引导优化,在外推、数据效率和参数保真度方面优于条件自编码器和深度网络插值(DNI)基线。我们的结果表明,LAMP推进了用于设计探索、数据集生成和性能驱动优化的可控、数据高效且安全的3D生成。

英文摘要

Generating high-fidelity 3D geometries under explicit parameter constraints is central to engineering design, yet current methods often require large datasets and fail to provide reliable control beyond the training distribution. We introduce LAMP, a data-efficient framework for controllable and interpretable 3D generation that aligns signed distance function (SDF) decoders by overfitting each exemplar from a shared initialization, then generates new designs by solving a parameter-constrained affine mixing problem in the aligned weight space. To improve reliability, we propose a linearity-mismatch safety metric that detects when mixed decoders leave the valid local regime. We evaluate LAMP on DrivAerNet++, BlendedNet, and additional industry-level vehicle families, including sports cars, SUVs, and convertibles. LAMP enables controlled interpolation with as few as 50 samples, safe extrapolation up to 100% beyond training ranges, and performance-guided optimization under fixed parameters, outperforming conditional autoencoder and Deep Network Interpolation (DNI) baselines in extrapolation, data efficiency, and parameter fidelity. Our results demonstrate that LAMP advances controllable, data-efficient, and safe 3D generation for design exploration, dataset generation, and performance-driven optimization.

2512.23234 2026-06-03 cs.CV cs.AI 版本更新

Edge-Aware and Content-Adaptive Infrared Gas Leak Detection for Industrial Safety Monitoring

边缘感知与内容自适应的工业安全监控红外气体泄漏检测

Dongsheng Li, Tianli Ma, Siling Wang, Beibei Duan, Song Gao

发表机构 * School of Mechatronic Engineering, Xi’an Technological University(机械电子工程学院,西安理工大学) School of Electronic Information Engineering, Xi’an Technological University(电子信息工程学院,西安理工大学) Shaanxi Shanhua Coal Chemical Co., Ltd.(陕西神华化工有限公司)

AI总结 针对红外气体羽流微弱、半透明且边界模糊的检测难题,提出一种边缘感知与内容自适应特征融合检测器(ECAF-Det),通过羽流导向的局部-全局特征增强、多尺度边缘感知模块和内容自适应稀疏路由路径聚合网络,在IIG和LangGas数据集上显著提升了检测精度。

详情
AI中文摘要

红外气体泄漏检测对于工业安全和环境监测至关重要,但由于气体羽流通常微弱、细小、半透明且边界模糊,自动检测仍然具有挑战性。本文提出了一种边缘感知与内容自适应特征融合检测器(ECAF-Det),用于杂乱热场景中的弱羽流检测。ECAF-Det集成了三个面向任务的设计:羽流导向的局部-全局特征增强块,用于保留精细边界线索并捕获长程上下文连续性;多尺度边缘感知模块,将方向梯度和相位一致性线索转化为分层边缘先验,用于边界敏感的羽流表示;以及内容自适应稀疏路由路径聚合网络,动态调节多尺度特征传播,以强调信息丰富的羽流特征并抑制冗余背景响应。在IIG数据集上的实验表明,ECAF-Det实现了29.8%的AP、84.3%的AP50和25.3%的小目标AP,分别比RT-DETR-R18基线提高了3.0、6.5和5.4个百分点,计算量为43.7 GFLOPs,参数量为14.9 M。在LangGas数据集上,ECAF-Det实现了36.3%的AP和68.5%的AP50,展示了其对不同红外气体羽流外观的泛化能力。主要的人工智能贡献在于边缘感知表示学习与内容自适应稀疏特征路由,用于弱红外羽流感知。所提出的检测器可作为工业气体泄漏监测中早期预警和远程巡检的视觉感知组件。

英文摘要

Infrared gas leak detection is important for industrial safety and environmental monitoring, but automatic detection remains challenging because gas plumes are often faint, small, semi-transparent, and weakly bounded. This paper proposes an Edge-Aware and Content-Adaptive Feature Fusion Detector (ECAF-Det) for weak-plume detection in cluttered thermal scenes. ECAF-Det integrates three task-oriented designs: a plume-oriented local-global feature enhancement block to preserve fine boundary cues and capture long-range contextual continuity; a multi-scale edge perception module that transforms directional gradient and phase-consistency cues into hierarchical edge priors for boundary-sensitive plume representation; and a content-adaptive sparse routing path aggregation network that dynamically regulates multi-scale feature propagation to emphasize informative plume features and suppress redundant background responses. Experiments on the IIG dataset show that ECAF-Det achieves 29.8% AP, 84.3% AP50, and 25.3% small-object AP, improving the RT-DETR-R18 baseline by 3.0, 6.5, and 5.4 percentage points, respectively, with 43.7 GFLOPs and 14.9 M parameters. On the LangGas dataset, ECAF-Det achieves 36.3% AP and 68.5% AP50, demonstrating its generalization to different infrared gas plume appearances. The main AI contribution is edge-aware representation learning with content-adaptive sparse feature routing for weak infrared plume perception. The proposed detector can serve as a visual perception component for early warning and remote inspection in industrial gas leak monitoring.

2512.22539 2026-06-03 cs.RO cs.CV 版本更新

VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

VLA-Arena:一个用于基准测试视觉-语言-动作模型的开源框架

Borong Zhang, Jiahao Li, Jiachen Shen, Yuhao Zhang, Yishuai Cai, Yuanpei Chen, Juntao Dai, Jiaming Ji, Yaodong Yang

AI总结 提出VLA-Arena基准,通过三正交轴(任务结构、语言命令、视觉观察)量化任务难度,系统评估视觉-语言-动作模型的能力边界与失败模式。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管视觉-语言-动作模型(VLA)正快速向通用机器人策略发展,但定量理解其局限和失败模式仍然困难。为此,我们引入了一个名为VLA-Arena的全面基准。我们提出了一种新颖的结构化任务设计框架,用于在三个正交轴上量化难度:(1)任务结构,(2)语言命令,以及(3)视觉观察。这使我们能够系统地设计具有细粒度难度级别的任务,从而精确测量模型能力边界。对于任务结构,VLA-Arena的170个任务被分为四个维度:安全性、干扰物、外推和长时域。每个任务设计有三个难度级别(L0-L2),仅在L0上进行微调以评估通用能力。正交于此,语言(W0-W4)和视觉(V0-V4)扰动可应用于任何任务,以实现鲁棒性的解耦分析。我们对最先进的VLA进行了广泛评估,揭示了几个关键局限性,包括强烈的记忆化倾向而非泛化、不对称鲁棒性、缺乏对安全约束的考虑,以及无法组合已学技能以完成长时域任务。为了促进针对这些挑战的研究并确保可重复性,我们提供了完整的VLA-Arena框架,包括从任务定义到自动评估的端到端工具链,以及用于微调的VLA-Arena-S/M/L数据集。我们的基准、数据、模型和排行榜可在https://vla-arena.github.io获取。

英文摘要

While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For Task Structure, VLA-Arena's 170 tasks are grouped into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is designed with three difficulty levels (L0-L2), with fine-tuning performed exclusively on L0 to assess general capability. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task to enable a decoupled analysis of robustness. Our extensive evaluation of state-of-the-art VLAs reveals several critical limitations, including a strong tendency toward memorization over generalization, asymmetric robustness, a lack of consideration for safety constraints, and an inability to compose learned skills for long-horizon tasks. To foster research addressing these challenges and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, data, models, and leaderboard are available at https://vla-arena.github.io.

2512.07394 2026-06-03 cs.CV 版本更新

Reconstructing Objects along Hand Interaction Timelines in Egocentric Video

在手交互时间线中重建第一人称视频中的物体

Zhifan Zhu, Siddhant Bansal, Shashank Tripathi, Dima Damen

发表机构 * University of Bristol, UK(英国布里斯托大学) Max Planck Institute for Intelligent Systems, Tübingen, Germany(德国图宾根马克斯·普朗克智能系统研究所)

AI总结 提出ROHIT任务,通过定义手交互时间线(HIT)并利用约束优化与传播(COP)框架,在无3D真值的情况下,从第一人称视频中重建刚性物体的姿态,显著提升重建精度。

Comments webpage: https://zhifanzhu.github.io/objects-along-hit

详情
AI中文摘要

我们引入了沿手交互时间线重建物体(ROHIT)的任务。首先从刚性物体的角度定义手交互时间线(HIT)。在HIT中,物体最初相对于场景静止,然后被手持并接触,其姿态发生变化。通常在使用过程中会有一个牢固的抓握,之后物体被释放,再次相对于场景静止。我们对HIT上的这些姿态约束进行建模,并提出沿HIT传播物体姿态,通过我们提出的约束优化与传播(COP)框架实现更优的重建。重要的是,我们关注稳定抓取的时间线——即手稳定地握住物体,在使用过程中保持恒定接触。这使得我们能够在没有3D真值的情况下,高效地标注、研究和评估视频中的物体重建。我们在两个第一人称数据集HOT3D和野外EPIC-Kitchens上评估了我们提出的任务ROHIT。在HOT3D中,我们整理了1.2K个稳定抓取片段。在EPIC-Kitchens中,我们标注了2.4K个稳定抓取片段,包括来自141个环境中日常交互视频的9个类别的390个物体实例。在没有3D真值的情况下,我们利用2D投影误差来评估重建。定量结果表明,COP通过约束姿态传播,将稳定抓取重建提高了6.2-11.3%,将HIT重建提高了高达24.5%。

英文摘要

We introduce the task of Reconstructing Objects along Hand Interaction Timelines (ROHIT). We first define the Hand Interaction Timeline (HIT) from a rigid object's perspective. In a HIT, an object is first static relative to the scene, then is held in hand following contact, where its pose changes. This is usually followed by a firm grip during use, before it is released to be static again w.r.t. to the scene. We model these pose constraints over the HIT, and propose to propagate the object's pose along the HIT enabling superior reconstruction using our proposed Constrained Optimisation and Propagation (COP) framework. Importantly, we focus on timelines with stable grasps - i.e. where the hand is stably holding an object, effectively maintaining constant contact during use. This allows us to efficiently annotate, study, and evaluate object reconstruction in videos without 3D ground truth. We evaluate our proposed task, ROHIT, over two egocentric datasets, HOT3D and in-the-wild EPIC-Kitchens. In HOT3D, we curate 1.2K clips of stable grasps. In EPIC-Kitchens, we annotate 2.4K clips of stable grasps including 390 object instances across 9 categories from videos of daily interactions in 141 environments. Without 3D ground truth, we utilise 2D projection error to assess the reconstruction. Quantitatively, COP improves stable grasp reconstruction by 6.2-11.3% and HIT reconstruction by up to 24.5% with constrained pose propagation.

2511.19995 2026-06-03 cs.CV 版本更新

CREward: A Type-Specific Creativity Reward Model

CREward:一种类型特定的创造力奖励模型

Jiyeon Han, Ali Mahdavi-Amiri, Hao Zhang, Haedong Jeong

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Sogang University(首尔大学)

AI总结 提出首个类型特定的创造力奖励模型CREward,通过几何、材质和纹理三个轴评估创造力,并应用于创造力评估、可解释创造力及创意样本获取。

Comments Accepted to CVPR 2026

详情
AI中文摘要

创造力是一种复杂现象。在表征和评估创造力时,将其视为单一的未分化量显得幼稚且不足。在这项工作中,我们学习了第一个类型特定的创造力奖励模型,称为CREward,它跨越三个创造力“轴”:几何、材质和纹理,使我们能够通过图像形成流程的视角来审视创造力。为了构建我们的奖励模型,我们首先进行人类基准评估,以捕捉人类对各种创意图像中每种类型的创造力感知。然后,我们分析人类判断与大型视觉语言模型(LVLMs)预测之间的相关性,确认LVLMs与人类感知高度一致。基于这一观察,我们收集LVLM生成的标签来训练我们的CREward模型,该模型适用于创意图像的评估和生成。我们探索了CREward的三个应用:创造力评估、可解释创造力以及创意样本获取,用于人类设计灵感和通过低秩适应引导创意生成。

英文摘要

Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this work, we learn the \emph{first type-specific creativity reward model}, coined CREward, which spans three creativity ``axes," geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.

2511.17126 2026-06-03 eess.IV cs.CV cs.LG physics.optics 版本更新

Towards Blind Lens Aberration Correction via Large LensLib Pre-training and Discrete Degradation Priors

面向盲镜头像差校正的大规模LensLib预训练与离散退化先验

Xiaolong Qian, Qi Jiang, Yao Gao, Lei Sun, Kailun Yang, Xian Wang, Zhonghua Yi, Wenyong Li, Ming-Hsuan Yang, Luc Van Gool, Kaiwei Wang

发表机构 * National Research Center for Optical Instrumentation, Zhejiang University(浙江省光学仪器研究中心,浙江大学) INSAIT, Sofia University "St. Kliment Ohridski"(INSAIT,索菲亚大学"圣克莱门特·欧弗里迪斯基") School of Artificial Intelligence and Robotics, Hunan University(人工智能与机器人学院,湖南大学) National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University(机器人视觉感知与控制技术国家工程研究中心,湖南大学)

AI总结 提出FoundCAC框架,通过构建大规模无偏镜头库AODLibpro和离散退化先验LPR,解决数据扩展与先验缺失问题,实现盲镜头像差校正的零样本泛化和高效少样本适应。

Comments Accepted to 2026 IEEE International Conference on Computational Photography (ICCP). The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/FoundCAC

详情
AI中文摘要

新兴的基于深度学习的镜头库预训练(LensLib-PT)流程通过训练通用神经网络,为盲镜头像差校正提供了新途径,展现出处理多种未知光学退化的强大能力。本文提出FoundCAC,一个通用的基础框架,解决了阻碍现有流程泛化的两个挑战:训练数据扩展的困难以及缺乏表征光学退化的先验指导。为提高数据可扩展性,我们扩展设计规范以增加退化多样性,并基于均匀采样策略构建了大规模无偏镜头库AODLibpro,该策略量化了空间变化模式和严重程度。在模型设计方面,为利用点扩散函数(PSF)作为指导同时保持盲范式,我们提出了一种多阶段向量量化表示学习方案。该范式专门设计用于构建潜在PSF表示(LPR),将复杂的连续PSF显式编码为离散退化先验,以规范高度病态的恢复过程。通过简单而有效的码本冻结策略,我们的框架利用离散先验提升全样本恢复性能,并实现对未见镜头的高效少样本适应。在合成LensLib和真实镜头的多种像差上的实验表明,我们的框架实现了最先进的零样本泛化,同时支持针对特定镜头的高效少样本适应。源代码和数据集将在https://github.com/zju-jiangqi/FoundCAC公开提供。

英文摘要

Emerging deep-learning-based lens library pre-training (LensLib-PT) pipeline offers a new avenue for blind lens aberration correction by training a universal neural network, demonstrating strong capability in handling diverse unknown optical degradations. This work proposes FoundCAC, a universal foundational framework that resolves two challenges hindering the generalization of existing pipelines: the difficulty of scaling training data and the absence of prior guidance characterizing optical degradation. To improve data scalability, we expand the design specifications to increase degradation diversity and construct AODLibpro, a large-scale, unbiased lens library based on a uniform sampling strategy that quantifies spatial-variation patterns and severity. In terms of model design, to leverage Point Spread Functions (PSFs) as guidance while maintaining the blind paradigm, we propose a multi-stage vector-quantized representation learning scheme. This paradigm is specifically designed to construct a Latent PSF Representation (LPR), explicitly encoding complex continuous PSFs into a discrete degradation prior to regularize the highly ill-posed restoration process. Through a simple yet effective codebook-freezing strategy, our framework leverages the discrete prior to elevate full-shot restoration performance and unlock highly efficient few-shot adaptation for unseen lenses. Experiments on diverse aberrations of synthetic LensLib and real-world lenses demonstrate that our framework achieves state-of-the-art zero-shot generalization while enabling highly efficient few-shot adaptation for specific lenses. The source code and datasets will be made publicly available at https://github.com/zju-jiangqi/FoundCAC.

2503.07265 2026-06-03 cs.CV cs.AI cs.CL 版本更新

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

WISE: 一种基于世界知识的文本到图像生成语义评估方法

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Fanqing Meng, Kunpeng Ning, Bin Zhu, Li Yuan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有文本到图像生成模型缺乏复杂语义理解和世界知识整合评估的问题,提出WISE基准,包含25个子领域的1000个精心设计的提示,并引入WiScore指标评估知识-图像对齐,实验表明当前模型在整合世界知识方面存在显著局限。

Comments Accepted to ICML 2026. We have also released an updated version of the benchmark, WISE_Verified. Please refer to https://github.com/PKU-YuanGroup/WISE for the latest version

详情
AI中文摘要

文本到图像(T2I)模型能够生成高质量的艺术创作和视觉内容。然而,现有研究和评估标准主要关注图像真实性和浅层的文本-图像对齐,缺乏对文本到图像生成中复杂语义理解和世界知识整合的全面评估。为解决这一挑战,我们提出了 extbf{WISE},这是首个专门用于 extbf{W}orld Knowledge- extbf{I}nformed extbf{S}emantic extbf{E}valuation(世界知识引导的语义评估)的基准。WISE超越了简单的词-像素映射,通过1000个精心设计的提示,涵盖文化常识、时空推理和自然科学等25个子领域,对模型进行挑战。为了克服传统CLIP指标的局限性,我们引入了 extbf{WiScore},一种用于评估知识-图像对齐的新型定量指标。通过对20个模型(10个专用T2I模型和10个统一多模态模型)在涵盖25个子领域的1000个结构化提示上进行全面测试,我们的发现揭示了它们在图像生成过程中有效整合和应用世界知识的能力存在显著局限,为下一代T2I模型增强知识整合与应用指明了关键路径。代码和数据可在\href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}获取。

英文摘要

Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \textbf{WISE}, the first benchmark specifically designed for \textbf{W}orld Knowledge-\textbf{I}nformed \textbf{S}emantic \textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce \textbf{WiScore}, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at \href{https://github.com/PKU-YuanGroup/WISE}{PKU-YuanGroup/WISE}.

2511.13020 2026-06-03 cs.CV cs.AI 版本更新

PHASE: Physiology-Aware Hyperspectral Reconstruction via Object-to-Human Domain Adaptation

PHASE: 通过对象到人体域适应的生理感知高光谱重建

Yufei Wen, Shuxing Zhong, Jingdan Kang, Yuting Zhang, Jintai Chen, Kaishun Wu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) South China University of Technology(华南理工大学)

AI总结 针对现有高光谱重建方法在生理成像中失效的问题,提出PHASE范式,通过生理通道重新解释和生理约束对齐,实现从对象到人体的域适应,仅需1.5%标注数据即可显著提升重建质量。

Comments To KDD26

详情
AI中文摘要

尽管高光谱成像提供了无与伦比的无创生理洞察,但其笨重的硬件、缓慢的采集速度和监管负担严重限制了其临床可用性。一种自然的替代方案是从无处不在的RGB或CASSI测量中重建高光谱信息。然而,现有的为以对象为中心的场景开发的范式依赖于基于反射率的特征对齐,假设光谱相似性保持语义一致性。这一假设在生理成像中不成立,因为视觉上相似的RGB响应可能源于不同且纠缠的生理状态。这种不匹配促使从反射率对齐转向基于共享光-物质相互作用原理的生理感知表示学习——这一转变引入了来自跨通道语义偏移(C1)和基于RGB采集的不可逆信息丢失(C2)的基本挑战。因此,我们设计了PHASE,一种生理感知的高光谱重建范式,通过生理通道重新解释解耦跨通道生理语义,并通过生理约束对齐将重建限制在生理上合理的解,从根本上重新定义了对象到人体的迁移。在两种源到目标迁移协议下,PHASE仅需1.5%的标注监督,在SSIM上一致优于最先进方法最多+2.20,在SAM上最多-3.06。

英文摘要

Although hyperspectral imaging offers unparalleled non-invasive physiological insight, its bulky hardware, slow acquisition, and regulatory burden severely limit its clinical availability. A natural workaround is to reconstruct hyperspectral information from ubiquitous RGB or CASSI measurements. However, existing paradigms, developed for object-centric scenes, rely on reflectance-based feature alignment, assuming that spectral similarity preserves semantic meaning. This assumption breaks down in physiological imaging, where visually similar RGB responses may arise from distinct and entangled physiological states. This mismatch motivates a shift from reflectance alignment to physiology-aware representation learning, grounded in shared light-matter interaction principles -- a shift that introduces fundamental challenges from cross-channel semantic shifts (C1) and irreversible information loss in RGB-based acquisition (C2). We therefore design PHASE, a physiology-aware hyperspectral reconstruction paradigm that fundamentally redefines object-to-human transfer by disentangling cross-channel physiological semantics via Physiological Channel Reinterpretation and restricting reconstruction to physiologically plausible solutions through Physiologically Constrained Alignment. Under two source-to-target transfer protocols, PHASE consistently outperforms state-of-the-art methods by up to +2.20 SSIM and -3.06 in SAM with merely 1.5% labeled supervision.

2511.10055 2026-06-03 cs.CV 版本更新

Physical Plausibility Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance

通过 HCM-GRPO 实现物理合理性推理:赋能紧凑模型以获得卓越性能

Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu

发表机构 * Tsinghua University(清华大学) Alibaba Health Information Technology Limited(阿里巴巴健康信息技术有限公司)

AI总结 针对多模态大语言模型在物理合理性推理中数据缺乏和推理能力弱的问题,提出包含大规模数据集和 HCM-GRPO 方法的完整解决方案,以紧凑模型超越大规模开源和闭源模型。

详情
AI中文摘要

近年来,图像生成的性能得到了显著提升。然而,图像筛选的研究很少,且由于缺乏数据以及多模态大语言模型(MLLMs)中物理合理性推理能力较弱,其性能并不令人满意。在这项工作中,我们提出了一个完整的解决方案,从数据和方法论两方面解决这些问题。在数据方面,我们收集了一个包含超过 128k 样本的综合图像筛选数据集,涉及约 640k 张图像。每个样本由一张原始图像和四张生成图像组成。该数据集从四个方面评估物理合理性推理能力:外观变形、物理阴影、放置布局和扩展合理性。关于数据标注,我们研究了多种方法,包括纯人工、全自动和答案驱动的标注,以最经济的方式获取高质量的思维链(CoT)数据。在方法论上,我们将一种硬案例挖掘(HCM)策略与动态比例准确率(DPA)奖励引入到组相对策略优化(GRPO)框架中,称为 HCM-GRPO。与原始 GRPO 相比,这种增强方法展示了更优越的物理合理性推理能力。我们的实验结果表明,即使是像 GPT5.2 和 Gemini3-Pro 这样的最先进的闭源 MLLMs,在物理合理性推理方面也表现出不令人满意的性能。相比之下,通过利用 HCM-GRPO,我们能够以更小的模型超越大规模开源和领先闭源模型的分数。

英文摘要

The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare, and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak physical plausibility reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, comprising about 640k images. Each sample consists of an original image and four generated images. The dataset evaluates the physical plausibility reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior physical plausibility reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT5.2 and Gemini3-Pro, exhibit unsatisfactory performance in physical plausibility reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

2511.02417 2026-06-03 cs.CV cs.RO 版本更新

CropCraft: A Procedural World Generator for Robotic Simulation of Agricultural Tasks

CropCraft:用于农业任务机器人仿真的程序化世界生成器

Riccardo Bertoglio, Cyrille Pierre, Johann Laconte, Roland Lenain

发表机构 * Institut National de la Recherche Agronomique(法国国家农业科研院)

AI总结 提出基于Blender和Python的开源程序化世界生成器CropCraft,通过YAML配置生成多样化农田场景,支持间作、葡萄园和杂草田,并生成带标注的3D仿真环境,用于农业机器人感知和导航算法开发。

详情
AI中文摘要

现代农业中 agroecological 实践的采用要求机器人系统能够在高度多样化和复杂的田间环境中运行。开发和评估此类系统严重依赖仿真,但生成代表 agroecological 多样性的逼真且可配置的3D环境仍然是一个主要挑战。本文提出了 CropCraft,一个基于 Blender 和 Python 构建的开源程序化世界生成器,旨在生成适用于农业机器人的3D仿真环境。CropCraft 通过简单的 YAML 配置文件生成作物田,支持多种场景,包括间作、葡萄园和杂草丛生的田地。该工具包含一个多生长阶段的3D植物模型库(作物、草和杂草),并使用随机放置算法真实地再现实际田地中观察到的空间变异性。生成的场景可直接导入 Gazebo 仿真器,并包含所有放置元素的地面真值标注,支持感知和导航算法的开发。为了展示 CropCraft 的实际用途,我们将其应用于使用深度学习的作物-杂草语义分割任务。生成了包含10,000张玉米田合成图像的数据集,这些图像具有不同的杂草密度、生长阶段和光照条件,并用于训练多个分割架构。仅使用合成数据训练的模型在真实田间图像上实现了约10%的平均交并比(mIoU)的 sim-to-real 差距,优于先前的先进合成生成方法。我们进一步表明,即使将少量真实图像与合成数据结合,也能提高跨领域的泛化能力,为农业感知任务中合成数据的有效使用提供了新见解。

英文摘要

The adoption of agroecological practices in modern agriculture requires robotic systems capable of operating in highly diverse and complex field environments. Developing and evaluating such systems relies heavily on simulation, yet generating realistic and configurable 3D environments representative of agroecological diversity remains a major challenge. This paper presents CropCraft, an open-source procedural world generator built on Blender and Python, designed to produce 3D simulation environments tailored to agricultural robotics. CropCraft generates crop fields from a simple YAML configuration file, supporting a wide range of scenarios including intercropping, vineyards, and weed-infested fields. The tool includes a library of 3D plant models (crops, grasses, and weeds) at multiple growth stages, and uses stochastic placement algorithms to realistically reproduce the spatial variability observed in real fields. Generated worlds are directly importable into the Gazebo simulator and include ground-truth annotations for all placed elements, supporting both perception and navigation algorithm development. To demonstrate the practical utility of CropCraft, we apply it to the task of crop-weed semantic segmentation using deep learning. A dataset of 10,000 synthetic images of maize fields with varying weed densities, growth stages, and lighting conditions was generated and used to train several segmentation architectures. Models trained exclusively on synthetic data achieve a sim-to-real gap of approximately 10% mean Intersection over Union (mIoU) on real field images, outperforming previous state-of-the-art synthetic generation approaches. We further show that combining even a few real images with synthetic data improves generalization across domains, providing new insights into the effective use of synthetic data for agricultural perception tasks.

2510.13565 2026-06-03 cs.CV 版本更新

XD-RCDepth: Lightweight Radar-Camera Depth Estimation with Explainability-Aligned and Distribution-Aware Distillation

XD-RCDepth: 轻量级雷达-相机深度估计,具有可解释性对齐和分布感知蒸馏

Huawei Sun, Zixu Wang, Xiangyuan Peng, Julius Ott, Georg Stettinger, Lorenzo Servadei, Robert Wille

发表机构 * Technical University of Munich(慕尼黑技术大学) Infineon Technologies AG(英飞凌科技)

AI总结 提出轻量级雷达-相机深度估计架构XD-RCDepth,通过可解释性对齐蒸馏和深度分布蒸馏减少参数29.7%并保持精度,在nuScenes和ZJU-4DRadarCam数据集上实现实时性能。

详情
AI中文摘要

深度估计仍然是自动驾驶的核心,雷达-相机融合通过提供互补的几何线索在恶劣条件下提供鲁棒性。在本文中,我们提出XD-RCDepth,一种轻量级架构,相对于最先进的轻量级基线减少了29.7%的参数,同时保持相当的精度。为了在压缩下保持性能并增强可解释性,我们引入了两种知识蒸馏策略:可解释性对齐蒸馏,将教师的显著性结构迁移给学生;以及深度分布蒸馏,将深度回归重新表述为离散箱上的软分类。这些组件共同将MAE相对于直接训练降低了7.97%,并在nuScenes和ZJU-4DRadarCam数据集上以实时效率提供了有竞争力的精度。代码:https://github.com/harborsarah/XD_RCDepth

英文摘要

Depth estimation remains central to autonomous driving, and radar-camera fusion offers robustness in adverse conditions by providing complementary geometric cues. In this paper, we present XD-RCDepth, a lightweight architecture that reduces the parameters by 29.7% relative to the state-of-the-art lightweight baseline while maintaining comparable accuracy. To preserve performance under compression and enhance interpretability, we introduce two knowledge-distillation strategies: an explainability-aligned distillation that transfers the teacher's saliency structure to the student, and a depth-distribution distillation that recasts depth regression as soft classification over discretized bins. Together, these components reduce the MAE compared with direct training with 7.97% and deliver competitive accuracy with real-time efficiency on nuScenes and ZJU-4DRadarCam datasets. Code: https://github.com/harborsarah/XD_RCDepth

2510.09845 2026-06-03 cs.LG cs.AI cs.CV 版本更新

Harnessing Self-Supervised Deep Learning and Geostationary Remote Sensing for Advancing Wildfire and Associated Air Quality Monitoring: Improved Smoke and Fire Front Masking using GOES and TEMPO Radiance Data

利用自监督深度学习和地球静止遥感推进野火及相关空气质量监测:使用GOES和TEMPO辐射数据改进烟雾和火锋掩膜

Nicholas LaHaye, Thilanka Munashinge, Hugo Lee, Xiaohua Pan, Gonzalo Gonzalez Abad, Hazem Mahmoud, Jennifer Wei

AI总结 本研究利用NASA TEMPO卫星任务的每小时数据和自监督深度学习,提出了一种创新系统,通过GOES-18和TEMPO数据有效区分烟雾与云层,实时绘制野火火锋和烟雾羽流,显著优于现有业务产品。

Comments https://2025.ieeeigarss.org/view_paper.php?PaperNum=6389&SessionID=1611

详情
AI中文摘要

这项工作展示了通过利用NASA的TEMPO卫星任务前所未有的每小时数据以及自监督深度学习的进展,改善美国西部野火和空气质量管理的可能性。我们展示了一种创新的自监督深度学习系统在绘制近实时每小时野火火锋和烟雾羽流扩散方面的有效性:成功使用GOES-18和TEMPO数据区分烟雾与云层,不同传感模态生成的烟雾和火掩膜之间具有强一致性,并且对于相同案例相比业务产品有显著改进。

英文摘要

This work demonstrates the possibilities for improving wildfire and air quality management in the western United States by leveraging the unprecedented hourly data from NASA's TEMPO satellite mission and advances in self-supervised deep learning. Here we demonstrate the efficacy of deep learning for mapping the near real-time hourly spread of wildfire fronts and smoke plumes using an innovative self-supervised deep learning-system: successfully distinguishing smoke plumes from clouds using GOES-18 and TEMPO data, strong agreement across the smoke and fire masks generated from different sensing modalities as well as significant improvement over operational products for the same cases.

2510.03316 2026-06-03 cs.CV cs.AI cs.LG 版本更新

The View From Space: Navigating Instrumentation Differences with EOFMs

从太空视角:利用EOFMs导航仪器差异

Ryan P. Demilt, Nicholas LaHaye, Karis Tenneson

发表机构 * Spatial Informatics Group(空间信息组)

AI总结 本研究通过分析地球观测基础模型(EOFMs)对传感器架构的敏感性,揭示了当前模型设计的缺陷,并为模型开发者、用户和遥感科学社区指明了前进方向。

详情
Journal ref
https://neurips.cc/virtual/2025/loc/san-diego/122891
AI中文摘要

地球观测基础模型(EOFMs)作为处理大量遥感及其他地球观测数据、并对许多关键地球监测任务产生影响的工具,其普及程度急剧上升。一个新兴趋势是利用预训练模型的输出作为“嵌入”,这些嵌入总结了高维数据,可用于通用任务,如相似性搜索和内容特定查询。然而,大多数EOFMs仅在单一模态数据上训练,然后通过匹配不同模态的波段进行应用或基准测试。现有工作尚不清楚多样化的传感器架构如何影响当前EOFMs套件的内部表示。我们在本工作中表明,EOFMs的表示空间对传感器架构高度敏感,理解这一差异为我们提供了关于当前EOFMs设计陷阱的关键视角,并指明了作为模型开发者、用户以及以稳健遥感科学为指导的社区应如何前进的方向。

英文摘要

Earth Observation Foundation Models (EOFMs) have exploded in prevalence as tools for processing the massive volumes of remotely sensed and other earth observation data, and for delivering impact on the many essential earth monitoring tasks. An emerging trend posits using the outputs of pre-trained models as 'embeddings' which summarize high dimensional data to be used for generic tasks such as similarity search and content-specific queries. However, most EOFM models are trained only on single modalities of data and then applied or benchmarked by matching bands across different modalities. It is not clear from existing work what impact diverse sensor architectures have on the internal representations of the present suite of EOFMs. We show in this work that the representation space of EOFMs is highly sensitive to sensor architecture and that understanding this difference gives a vital perspective on the pitfalls of current EOFM design and signals for how to move forward as model developers, users, and a community guided by robust remote-sensing science.

2509.25859 2026-06-03 cs.CV cs.SY eess.SY 版本更新

LiDAR Point Cloud Colourisation Using Multi-Camera Fusion and Low-Light Image Enhancement

使用多相机融合和低光图像增强的LiDAR点云着色

Pasindu Ranasinghe, Dibyayan Patra, Bikram Banerjee, Simit Raval

AI总结 提出一种硬件无关的方法,通过多相机融合和低光增强模块,实现机械LiDAR点云的360度着色,在低光照条件下仍能恢复场景细节。

详情
Journal ref
Sensors 25(21), 6582 (2025)
AI中文摘要

近年来,相机数据与LiDAR测量的融合已成为增强空间理解的一种强大方法。本研究引入了一种新颖的、与硬件无关的方法,该方法使用多个相机输入从机械LiDAR生成着色点云,提供完整的360度覆盖。主要创新在于其在低光照条件下的鲁棒性,这是通过在融合管道中集成低光图像增强模块实现的。系统需要初始校准以确定相机内参,然后自动计算LiDAR与相机之间的几何变换,无需专门的校准目标,简化了设置。数据处理框架使用颜色校正来确保融合前相机馈送的一致性。该算法使用Velodyne Puck Hi-Res LiDAR和四相机配置进行了测试。优化后的软件实现了实时性能,即使在极低照度下也能可靠着色,成功恢复了原本无法检测的场景细节。

英文摘要

In recent years, the fusion of camera data with LiDAR measurements has emerged as a powerful approach to enhance spatial understanding. This study introduces a novel, hardware-agnostic methodology that generates colourised point clouds from mechanical LiDAR using multiple camera inputs, providing complete 360-degree coverage. The primary innovation lies in its robustness under low-light conditions, achieved through the integration of a low-light image enhancement module within the fusion pipeline. The system requires initial calibration to determine intrinsic camera parameters, followed by automatic computation of the geometric transformation between the LiDAR and cameras, removing the need for specialised calibration targets and streamlining the setup. The data processing framework uses colour correction to ensure uniformity across camera feeds before fusion. The algorithm was tested using a Velodyne Puck Hi-Res LiDAR and a four-camera configuration. The optimised software achieved real-time performance and reliable colourisation even under very low illumination, successfully recovering scene details that would otherwise remain undetectable.

2505.17659 2026-06-03 cs.RO cs.CV 版本更新

Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling

Plan-R1:安全且可行的轨迹规划作为语言建模

Xiaolong Tang, Meina Kan, Shiguang Shan, Xilin Chen

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出Plan-R1两阶段轨迹规划框架,通过原则对齐与行为学习解耦,结合规则奖励和方差解耦GRPO,显著提升自动驾驶规划的安全性和可行性。

Comments Accepted by ICLR2026

详情
AI中文摘要

安全且可行的轨迹规划对于现实世界的自动驾驶系统至关重要。然而,现有的基于学习的规划器严重依赖专家演示,这不仅缺乏明确的安全意识,还可能继承次优人类驾驶数据中的不良行为(如超速)。受大型语言模型成功的启发,我们提出了Plan-R1,一种两阶段轨迹规划框架,将原则对齐与行为学习解耦。在第一阶段,通用轨迹预测器在专家数据上进行预训练,以捕获多样化的、类人的驾驶行为。在第二阶段,使用基于规则的奖励通过组相对策略优化(GRPO)对模型进行微调,明确地将自我规划与安全、舒适和交通规则遵守等原则对齐。这种两阶段范式保留了类人行为,同时增强了安全意识并丢弃了演示中的不良模式。此外,我们识别了直接应用GRPO到规划的一个关键限制:组级归一化消除了跨组的尺度差异,导致罕见、高方差的安全违规组与大量低方差的安全组具有相似的优势,从而抑制了对安全关键目标的优化。为解决此问题,我们提出了方差解耦GRPO(VD-GRPO),用中心化和固定缩放替代归一化以保留绝对奖励幅度,确保安全关键目标在整个训练过程中保持主导地位。在nuPlan基准上的实验表明,Plan-R1显著提高了规划的安全性和可行性,达到了最先进的性能,特别是在现实反应性设置中。我们的代码可在https://github.com/XiaolongTang23/Plan-R1获取。

英文摘要

Safe and feasible trajectory planning is critical for real-world autonomous driving systems. However, existing learning-based planners rely heavily on expert demonstrations, which not only lack explicit safety awareness but also risk inheriting undesirable behaviors such as speeding from suboptimal human driving data. Inspired by the success of large language models, we propose Plan-R1, a two-stage trajectory planning framework that decouples principle alignment from behavior learning. In the first stage, a general trajectory predictor is pre-trained on expert data to capture diverse, human-like driving behaviors. In the second stage, the model is fine-tuned with rule-based rewards using Group Relative Policy Optimization (GRPO), explicitly aligning ego planning with principles such as safety, comfort, and traffic rule compliance. This two-stage paradigm retains human-like behaviors while enhancing safety awareness and discarding undesirable patterns from demonstrations. Furthermore, we identify a key limitation of directly applying GRPO to planning: group-wise normalization erases cross-group scale differences, causing rare, high-variance safety-violation groups to have similar advantages as abundant low-variance safe groups, thereby suppressing optimization for safety-critical objectives. To address this, we propose Variance-Decoupled GRPO (VD-GRPO), which replaces normalization with centering and fixed scaling to preserve absolute reward magnitudes, ensuring that safety-critical objectives remain dominant throughout training. Experiments on the nuPlan benchmark demonstrate that Plan-R1 significantly improves planning safety and feasibility, achieving state-of-the-art performance, particularly in realistic reactive settings. Our code is available at https://github.com/XiaolongTang23/Plan-R1.

2507.09105 2026-06-03 cs.CV 版本更新

Hybrid Autoregressive-Diffusion Model for Real-Time Sign Language Production

混合自回归-扩散模型用于实时手语生成

Maoxiao Ye, Xinfeng Ye, Mano Manoharan

发表机构 * University of Auckland(奥克兰大学)

AI总结 提出HybridSign混合自回归-扩散模型,结合因果帧生成与流式扩散精炼,实现低延迟高质量手语生成,在PHOENIX14T和How2Sign上取得最佳质量-效率权衡。

Comments Accepted at ACL 2026

详情
AI中文摘要

早期的手语生成(SLP)模型通常依赖于自回归解码,这自然保持了时间因果性,但在推理时会出现错误累积。最近的基于扩散的方法通过迭代去噪提高了生成质量,但其序列级精炼过程引入了大量延迟。为了解决这一权衡问题,我们提出了HybridSign,一种用于低延迟手语生成的混合自回归-扩散模型,它结合了因果帧生成与流式扩散精炼。多尺度姿态表示模块捕获细粒度发音特征,而置信度感知因果注意力机制利用关节级置信度分数提高在噪声2D姿态观测下的鲁棒性。在PHOENIX14T和How2Sign上的实验表明,HybridSign在比较的基线中始终实现了最佳的质量-效率权衡。在How2Sign测试集上,在60帧评估协议下,它达到了BLEU-1/4分数30.12/6.48和DTW 3.89,同时将首帧时间减少到5.90秒,吞吐量提高到10.17 FPS。

英文摘要

Earlier Sign Language Production (SLP) models typically relied on autoregressive decoding, which naturally preserves temporal causality but suffers from error accumulation at inference time. More recent diffusion-based approaches improve generation quality through iterative denoising, yet their sequence-level refinement process introduces substantial latency. To address this trade-off, we propose HybridSign, a hybrid autoregressive-diffusion model for low-latency sign language production that combines causal frame generation with flow-based diffusion refinement. A Multi-Scale Pose Representation module captures fine-grained articulator features, while a Confidence-Aware Causal Attention mechanism leverages joint-level confidence scores to improve robustness under noisy 2D pose observations. Experiments on PHOENIX14T and How2Sign show that HybridSign consistently achieves the best quality--efficiency trade-off among the compared baselines. On the How2Sign test split, it reaches BLEU-1/4 scores of 30.12/6.48 and DTW of 3.89, while reducing time-to-first-frame to 5.90s and increasing throughput to 10.17 FPS under a 60-frame evaluation protocol.

2509.11323 2026-06-03 cs.CV cs.AI 版本更新

Motion Estimation for Multi-Object Tracking using KalmanNet with Semantic-Independent Encoding

基于语义无关编码的KalmanNet多目标跟踪运动估计

Jian Song, Wei Mei, Yunfeng Xu, Qiang Fu, Renke Kou, Lina Bu, Yucheng Long

AI总结 提出语义无关KalmanNet(SIKNet),通过语义无关编码器(SIE)改进运动估计,在MOT中比传统卡尔曼滤波和学习辅助滤波器更鲁棒、更准确。

详情
AI中文摘要

运动估计是多目标跟踪(MOT)中的关键组成部分。它通过分析连续帧图像中物体位置的变化来预测物体的轨迹,减少跟踪失败和身份切换。基于线性恒速模型的卡尔曼滤波器(KF)是MOT中最常用的方法之一。然而,当KF参数不匹配且物体非平稳运动时,可能产生不理想的结果。在这项工作中,我们利用学习辅助滤波器来处理MOT的运动估计。具体地,我们提出了一种名为语义无关KalmanNet(SIKNet)的新方法,该方法通过两步使用语义无关编码器(SIE)对状态向量(输入特征)进行编码。首先,SIE使用核大小为1的一维卷积,该卷积沿不同状态向量中同语义元素维度进行卷积,以编码独立的语义信息。然后,它采用全连接层和非线性激活层来编码异语义元素之间的非线性和交叉依赖信息。为了独立评估MOT中运动估计模块的性能,我们从几个开源MOT数据集构建了一个大规模半模拟数据集。实验结果表明,所提出的SIKNet优于传统KF,并且比现有的学习辅助滤波器具有更好的鲁棒性和准确性。代码可在(https://github.com/SongJgit/filternet 和 https://github.com/SongJgit/TBDTracker)获取。

英文摘要

Motion estimation is a crucial component in multi-object tracking (MOT). It predicts the trajectory of objects by analyzing the changes in their positions in consecutive frames of images, reducing tracking failures and identity switches. The Kalman filter (KF) based on the linear constant-velocity model is one of the most commonly used methods in MOT. However, it may yield unsatisfactory results when KF's parameters are mismatched and objects move in non-stationary. In this work, we utilize the learning-aided filter to handle the motion estimation of MOT. In particular, we propose a novel method named Semantic-Independent KalmanNet (SIKNet), which encodes the state vector (the input feature) using a Semantic-Independent Encoder (SIE) by two steps. First, the SIE uses a 1D convolution with a kernel size of 1, which convolves along the dimension of homogeneous-semantic elements across different state vectors to encode independent semantic information. Then it employs a fully-connected layer and a nonlinear activation layer to encode nonlinear and cross-dependency information between heterogeneous-semantic elements. To independently evaluate the performance of the motion estimation module in MOT, we constructed a large-scale semi-simulated dataset from several open-source MOT datasets. Experimental results demonstrate that the proposed SIKNet outperforms the traditional KF and achieves superior robustness and accuracy than existing learning-aided filters. The code is available at (https://github.com/SongJgit/filternet and https://github.com/SongJgit/TBDTracker).

2509.03376 2026-06-03 cs.CV 版本更新

Transformer-Guided Content-Adaptive Graph Learning for Hyperspectral Unmixing

Transformer引导的内容自适应图学习用于高光谱解混

Hui Chen, Liangyu Liu, Xianchao Xiu, Wanquan Liu

发表机构 * School of Automation Engineering, Shanghai University of Electric Power(上海电力大学自动化工程学院) School of Mechatronic Engineering and Automation, Shanghai University(上海大学机电工程与自动化学院) School of Intelligent Systems Engineering, Sun Yat-sen University(中山大学智能系统工程学院)

AI总结 提出T-CAGU框架,结合Transformer捕获全局依赖和内容自适应图神经网络增强局部关系,通过多阶传播动态学习图结构并引入图残差机制,实现高光谱图像的高效解混。

详情
AI中文摘要

高光谱解混(HU)旨在将遥感图像中的每个混合像素分解为一组端元及其对应的丰度。尽管深度学习在该领域取得了显著进展,但大多数方法无法同时表征全局依赖和局部一致性,难以保持长程交互和边界细节。本文提出了一种新颖的Transformer引导的内容自适应图解混框架(T-CAGU),通过采用Transformer捕获全局依赖并引入内容自适应图神经网络增强局部关系,克服了这些挑战。与以往工作不同,T-CAGU集成多个传播阶次以动态学习图结构,确保对噪声的鲁棒性。此外,T-CAGU利用图残差机制保留全局信息并稳定训练。实验结果表明其优于最先进的方法。我们的代码可在https://github.com/xianchaoxiu/T-CAGU获取。

英文摘要

Hyperspectral unmixing (HU) targets to decompose each mixed pixel in remote sensing images into a set of endmembers and their corresponding abundances. Despite significant progress in this field using deep learning, most methods fail to simultaneously characterize global dependencies and local consistency, making it difficult to preserve both long-range interactions and boundary details. This letter proposes a novel transformer-guided content-adaptive graph unmixing framework (T-CAGU), which overcomes these challenges by employing a transformer to capture global dependencies and introducing a content-adaptive graph neural network to enhance local relationships. Unlike previous work, T-CAGU integrates multiple propagation orders to dynamically learn the graph structure, ensuring robustness against noise. Furthermore, T-CAGU leverages a graph residual mechanism to preserve global information and stabilize training. Experimental results demonstrate its superiority over the state-of-the-art methods. Our code is available at https://github.com/xianchaoxiu/T-CAGU.

2508.15130 2026-06-03 cs.CV 版本更新

HiRQA: Hierarchical Ranking and Quality Alignment for Opinion-Unaware Image Quality Assessment

HiRQA: 面向无意见图像质量评估的层次化排序与质量对齐

Vaishnav Ramesh, Haining Wang, Md Jahidul Islam

AI总结 提出HiRQA框架,通过层次化排序和对比学习实现自监督无参考图像质量评估,无需主观标签即可泛化到真实失真场景。

Comments Accepted for publication in Machine Vision and Applications

详情
AI中文摘要

尽管无参考图像质量评估(NR-IQA)取得了显著进展,但数据集偏差和对主观标签的依赖仍阻碍其泛化性能。我们提出HiRQA(层次化排序与质量对齐),一个自监督、无意见的框架,通过结合排序和对比学习提供层次化的质量感知嵌入。与依赖于推理时的原始参考或辅助模态的先前方法不同,HiRQA仅使用输入图像预测质量分数。我们引入了一种新颖的高阶排序损失,通过失真对之间的关系排序来监督质量预测,以及一个嵌入距离损失,强制特征距离与感知差异之间的一致性。由结构化文本提示引导的训练时对比对齐损失进一步增强了学习到的表示。仅在合成图像失真上训练的HiRQA能够泛化到真实退化,通过对各种未见失真(如镜头光晕、雾霾、运动模糊和低光条件)的全面评估得到了证明。为了实时部署,我们引入了HiRQA-S,一个轻量级变体,每张图像的推理时间仅为3.5毫秒。在合成和真实基准上的大量实验验证了HiRQA的竞争性能、强泛化能力和可扩展性。HiRQA模型和推理管道可在https://github.com/uf-robopi/HiRQA获取。

英文摘要

Despite significant progress in no-reference image quality assessment (NR-IQA), dataset biases and reliance on subjective labels continue to hinder their generalization performance. We propose HiRQA (Hierarchical Ranking and Quality Alignment), a self-supervised, opinion-unaware framework that offers a hierarchical, quality-aware embedding through a combination of ranking and contrastive learning. Unlike prior approaches that depend on pristine references or auxiliary modalities at inference time, HiRQA predicts quality scores using only the input image. We introduce a novel higher-order ranking loss that supervises quality predictions through relational ordering across distortion pairs, along with an embedding distance loss that enforces consistency between feature distances and perceptual differences. A training-time contrastive alignment loss, guided by structured textual prompts, further enhances the learned representation. Trained only on synthetic image distortions, HiRQA generalizes to authentic degradations, as demonstrated through comprehensive evaluations on various unseen distortions such as lens flare, haze, motion blur, and low-light conditions. For real-time deployment, we introduce HiRQA-S, a lightweight variant with an inference time of only 3.5 ms per image. Extensive experiments across synthetic and authentic benchmarks validate HiRQA's competitive performance, strong generalization ability, and scalability. The HiRQA model and inference pipeline are available at: https://github.com/uf-robopi/HiRQA.

2508.05852 2026-06-03 cs.CV 版本更新

Interpretable Modeling of Driver Attention Shifts with a Vision-Language Model

基于视觉-语言模型的驾驶员注意力转移可解释建模

Kaiser Hamid, Khandakar Ashrafi Akbar, Peihang Li, Nade Liang

发表机构 * Texas Tech University(德克萨斯理工大学) Towson University(托森大学)

AI总结 本研究通过少量人工监督微调视觉-语言模型,生成可解释的驾驶员注意力转移描述,以补充传统注视热图,提升人因分析、监控和态势感知支持。

详情
AI中文摘要

驾驶员注视通常被建模为空间热图,但热图本身难以解释,因为它们不说明正在监控哪个道路对象或区域,也不说明注意力转移为何重要。本研究探讨了最小的人工监督是否能够引导视觉-语言模型生成驾驶员注意力转移的可解释描述。利用Berkeley DeepDrive-Attention数据集中选定的高变化注视时刻,我们比较了零样本、单样本和LoRA微调VLM条件与人工精炼参考描述和专家评分。结果表明,使用80个专家精炼的注意力示例进行微调,相对于未引导的VLM输出,提高了ROUGE-L、METEOR、实体对齐F1和人类对齐分数。研究结果表明,基于语言的描述可以通过使驾驶员注意力更易于人因分析、驾驶员监控审查和态势感知支持来补充注视热图。

英文摘要

Driver gaze is commonly modeled as a spatial heatmap, but heatmaps alone are difficult for humans to interpret because they do not explain which road object or region is being monitored or why an attention shift may matter. This study examines whether minimal human-grounded supervision can steer a vision--language model toward interpretable descriptions of driver attention shifts. Using selected high-change gaze moments from the Berkeley DeepDrive-Attention dataset, we compare zero-shot, one-shot, and LoRA fine-tuned VLM conditions against human-refined reference descriptions and expert ratings. Results show that fine-tuning with 80 expert-refined attention examples improves ROUGE-L, METEOR, Entity Alignment F1, and Human Alignment Score relative to unsteered VLM outputs. The findings suggest that language-based descriptions can complement gaze heatmaps by making driver attention more accessible for human-factors analysis, driver-monitoring review, and situation-awareness support.

2507.19684 2026-06-03 cs.LG cs.AI cs.CL cs.CV 版本更新

CoMPAS3D: A Dataset and Benchmark for Interactive Motion

CoMPAS3D: 一个用于交互动作的数据集和基准

Bermet Burkanova, Yasaman Etesam, Payam Jome Yazdian, Trinity Evans, Chuxuan Zhang, Zoe Stanley, Paige Tuttösí, Angelica Lim

发表机构 * School of Computing Science Simon Fraser University(计算科学学院西蒙弗雷泽大学)

AI总结 提出CoMPAS3D数据集和评估框架,通过动作可读性和熟练度适当性等客观指标,解决交互式动作生成中缺乏社交上下文评估的问题。

Comments https://rosielab.github.io/compas3d

详情
AI中文摘要

社交互动型人形机器人必须通过身体与人类互动,实时适应伙伴的动作、意图和能力。这需要模型不仅理解身体如何移动,还要理解在共享社交背景下动作的含义。然而,交互式动作生成的评估框架并未衡量生成的动作是否在共享动作词汇中可读,也不评估其是否适合伙伴的熟练水平。这一差距有两个原因:现有框架依赖运动学指标(如FID和节拍对齐),无法衡量上述特性;现有数据集缺乏动作标注和熟练度变化。萨尔萨舞作为评估领域很合适:即兴、双人、由动作词汇和评判标准(涵盖时机、音乐性、技巧、难度、配合和原创性)指导。我们提出CoMPAS3D,一个即兴双人萨尔萨舞的动作捕捉数据集,附带评估框架,涵盖运动学质量、两个客观指标(动作可读性和熟练度适当性)以及六个基于竞赛的主观维度。数据集包含18名舞者(涵盖初级、中级和高级水平)的3小时即兴表演,超过2800个专家标注片段,涵盖动作类型、错误和风格元素。我们定义了三个基准:动作分类(类似于转录)、熟练度估计(流利度评估)和跟随者生成(对话响应)。微调的视觉语言模型在应用于真实动作序列的客观指标上表现强劲。应用于Duolando和InterGen时,这些指标揭示了运动学指标遗漏的失败。人工评估确认了生成动作与真实动作之间的差距。CoMPAS3D、标注、基准代码和基线结果公开可用。

英文摘要

Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, and abilities. This requires models that understand not just how bodies move, but what movement means in a shared social context. Yet evaluation frameworks for interactive motion generation do not measure whether generated follower motion is legible within a shared movement vocabulary, nor whether it is appropriate to the partner's proficiency level. This gap has two causes: existing frameworks rely on kinematic metrics such as FID and beat alignment that cannot measure either property, and existing datasets lack the move annotations and proficiency variation needed. Salsa is well-suited as an evaluation domain: improvised, dyadic, and governed by a move vocabulary and judging criteria covering timing, musicality, technique, difficulty, partnering, and originality. We present CoMPAS3D, a motion capture dataset of improvised partner salsa paired with an evaluation framework covering kinematic quality, two objective metrics (move legibility and proficiency appropriateness), and six competition-based subjective dimensions. The dataset includes 3 hours of improvisation by 18 dancers spanning beginner, intermediate, and professional levels, with over 2,800 expert-annotated segments covering move types, errors, and stylistic elements. We define three benchmarks: move classification (analogous to transcription), proficiency estimation (fluency assessment), and follower generation (dialogue response). Fine-tuned vision-language models perform strongly on objective metrics applied to ground-truth motion sequences. Applied to Duolando and InterGen, the metrics reveal failures that kinematic metrics miss. Human evaluations confirm the gap between generated and ground-truth motion. CoMPAS3D, annotations, benchmark code, and baseline results are publicly available.

2506.04367 2026-06-03 cs.CV 版本更新

Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks

微调视频变换器用于词级孟加拉手语:分类任务的比较分析

Jubayer Ahmed Bhuiyan Shawon, Hasan Mahmud, Kamrul Hasan

发表机构 * Systems and Software Lab (SSL), Department of CSE, Islamic University of Technology (IUT)(计算机科学与软件系,伊斯兰科技大学(IUT)系统与软件实验室)

AI总结 本研究通过微调VideoMAE、ViViT和TimeSformer三种视频变换器模型,在BdSLW60和BdSLW401数据集上实现了高精度孟加拉手语识别,其中VideoMAE在帧率校正后的BdSLW60上达到95.5%准确率。

Comments 16 pages, 8 figures, 6 tables

详情
Journal ref
PLOS ONE, Vol. 21, No. 5, e0341909, 2026
AI中文摘要

手语识别(SLR)涉及从图像或视频中自动识别和分类手势,将其转换为文本或语音,以改善听障社区的可访问性。在孟加拉国,孟加拉手语(BdSL)是许多听障人士的主要交流方式。本研究在BdSLW60(arXiv:2402.08635)上微调了最先进的视频变换器架构——VideoMAE、ViViT和TimeSformer,BdSLW60是一个包含60个频繁手势的小规模BdSL数据集。我们将视频标准化为30 FPS,得到9,307个用户试用片段。为了评估可扩展性和鲁棒性,模型还在BdSLW401(arXiv:2503.02360)上进行了微调,这是一个包含401个手势类别的大规模数据集。此外,我们还在公开数据集(包括LSA64和WLASL)上进行了基准测试。应用了随机裁剪、水平翻转和短边缩放等数据增强技术以提高模型鲁棒性。为了在模型选择期间确保跨折的平衡评估,我们在训练集上采用了10折分层交叉验证,同时使用来自未见用户U4和U8的留出测试数据进行了独立于手语者的评估。结果表明,视频变换器模型显著优于传统的机器学习和深度学习方法。性能受数据集大小、视频质量、帧分布、帧率和模型架构等因素影响。在这些模型中,VideoMAE变体(MCG-NJU/videomae-base-finetuned-kinetics)在帧率校正后的BdSLW60数据集上达到了95.5%的最高准确率,在BdSLW401的正面手势上达到了81.04%——展示了可扩展且准确的BdSL识别的强大潜力。

英文摘要

Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures -- VideoMAE, ViViT, and TimeSformer -- on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 -- demonstrating strong potential for scalable and accurate BdSL recognition.

2505.08886 2026-06-03 cs.CV cs.LG 版本更新

Optimizing Neuro-Fuzzy and Colonial Competition Algorithms for Skin Cancer Diagnosis in Dermatoscopic Images

优化神经模糊与殖民竞争算法用于皮肤镜图像中的皮肤癌诊断

Hamideh Khaleghpour, Brett McKinney

AI总结 本研究融合图像处理、神经模糊和殖民竞争算法,在ISIC数据库的560张皮肤镜图像上实现94%准确率,旨在辅助临床早期黑色素瘤检测。

Comments 7 pages, 10 figures. Accepted at the 2nd Asia Pacific Computer Systems Conference (APCS 2024), March 15-17, 2024

详情
Journal ref
Proceedings of the 2024 7th International Conference on Information and Computer Technologies, pages 166-172, IEEE, March 2024
AI中文摘要

皮肤癌发病率的上升,加上公众意识有限和临床专业知识的不足,凸显了对先进诊断辅助工具的迫切需求。人工智能(AI)已成为该领域有前景的工具,特别是在区分恶性与良性皮肤病变方面。利用公开可用的皮肤病变数据集,研究人员一直在开发基于AI的诊断解决方案。然而,此类计算机系统在临床环境中的整合仍处于初期阶段。本研究旨在通过融合图像处理技术和机器学习算法(特别是神经模糊和殖民竞争方法)来弥合这一差距。应用于ISIC数据库中的皮肤镜图像,我们的方法在560张图像的数据集上达到了94%的显著准确率。这些结果强调了我们的方法在帮助临床医生早期检测黑色素瘤方面的潜力,从而为皮肤癌诊断做出重要贡献。

英文摘要

The rising incidence of skin cancer, coupled with limited public awareness and a shortfall in clinical expertise, underscores an urgent need for advanced diagnostic aids. Artificial Intelligence (AI) has emerged as a promising tool in this domain, particularly for distinguishing malignant from benign skin lesions. Leveraging publicly available datasets of skin lesions, researchers have been developing AI-based diagnostic solutions. However, the integration of such computer systems in clinical settings is still nascent. This study aims to bridge this gap by employing a fusion of image processing techniques and machine learning algorithms, specifically neuro-fuzzy and colonial competition approaches. Applied to dermoscopic images from the ISIC database, our method achieved a notable accuracy of 94% on a dataset of 560 images. These results underscore the potential of our approach in aiding clinicians in the early detection of melanoma, thereby contributing significantly to skin cancer diagnostics.

2406.18544 2026-06-03 cs.CV cs.GR 版本更新

GS-ROR$^2$: Bidirectional-guided 3DGS and SDF for Reflective Object Relighting and Reconstruction

GS-ROR$^2$: 双向引导的3DGS和SDF用于反射物体重光照与重建

Zuo-Liang Zhu, Beibei Wang, Jian Yang

发表机构 * VCIP, College of Computer Science, Nankai University(VCIP,计算机科学学院,南开大学) School of Intelligence Science and Technology, Nanjing University(智能科学与技术学校,南京大学)

AI总结 提出一种双向引导框架,通过SDF辅助的高斯溅射优化重光照模型,并利用GS引导的SDF增强实现高质量几何重建,解决反射物体重光照与重建中的几何约束和细节捕捉问题。

Comments Accepted by ACM TOG

详情
AI中文摘要

3D高斯溅射(3DGS)因其细致的表达能力和高效的渲染速度,在新视角合成方面展现出强大能力。然而,使用3DGS创建可重光照的3D资产并重建忠实几何仍然存在问题,特别是对于反射物体,其不连续表示给几何约束带来困难。体积符号距离场(SDF)方法提供了鲁棒的几何重建,但昂贵的射线步进阻碍了其实时应用并减慢了训练速度。此外,这些方法难以捕捉尖锐的几何细节。为此,我们提出以互补方式双向引导3DGS和SDF,包括SDF辅助的高斯溅射用于重光照模型的高效优化,以及GS引导的SDF增强用于高质量几何重建。SDF辅助高斯溅射的核心是混合高斯与SDF之间的深度和法线相互监督,避免了SDF昂贵的体积渲染。得益于这种相互监督,学习到的混合高斯以最小的时间成本得到良好约束。由于高斯以延迟着色模式渲染,alpha混合的高斯是平滑的,但单个高斯可能仍然是异常值,产生漂浮伪影。因此,我们引入SDF感知的剪枝策略,移除位于SDF定义表面远处的高斯异常值,避免漂浮问题。这样,我们的GS框架提供了合理的法线并实现了逼真的重光照,但来自深度的网格仍然存在问题。因此,我们设计了GS引导的SDF细化,利用来自高斯的混合法线微调SDF。通过这种增强,我们的方法可以以额外17%的训练时间为代价,为反射物体提供高质量的网格。

英文摘要

3D Gaussian Splatting (3DGS) has shown a powerful capability for novel view synthesis due to its detailed expressive ability and highly efficient rendering speed. Unfortunately, creating relightable 3D assets and reconstructing faithful geometry with 3DGS is still problematic, particularly for reflective objects, as its discontinuous representation raises difficulties in constraining geometries. Volumetric signed distance field (SDF) methods provide robust geometry reconstruction, while the expensive ray marching hinders its real-time application and slows the training. Besides, these methods struggle to capture sharp geometric details. To this end, we propose to guide 3DGS and SDF bidirectionally in a complementary manner, including an SDF-aided Gaussian splatting for efficient optimization of the relighting model and a GS-guided SDF enhancement for high-quality geometry reconstruction. At the core of our SDF-aided Gaussian splatting is the mutual supervision of the depth and normal between blended Gaussians and SDF, which avoids the expensive volume rendering of SDF. Thanks to this mutual supervision, the learned blended Gaussians are well-constrained with a minimal time cost. As the Gaussians are rendered in a deferred shading mode, the alpha-blended Gaussians are smooth, while individual Gaussians may still be outliers, yielding floater artifacts. Therefore, we introduce an SDF-aware pruning strategy to remove Gaussian outliers located distant from the surface defined by SDF, avoiding floater issue. This way, our GS framework provides reasonable normal and achieves realistic relighting, while the mesh from depth is still problematic. Therefore, we design a GS-guided SDF refinement, which utilizes the blended normal from Gaussians to finetune SDF. With this enhancement, our method can further provide high-quality meshes for reflective objects at the cost of 17% extra training time.

2412.01282 2026-06-03 cs.CV cs.AI 版本更新

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement

Align-KD:为移动视觉语言模型增强提取跨模态对齐知识

Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, China(通用人工智能国家重点实验室,智能科学与技术学院,北京大学,中国) Huawei Noah’s Ark Lab, China(华为诺亚方舟实验室,中国)

AI总结 提出Align-KD方法,通过蒸馏教师模型浅层跨模态对齐知识,指导1.7B学生模型学习视觉-文本匹配,在6个基准上平均提升2.0分。

Comments CVPR 2025 Paper

详情
AI中文摘要

视觉语言模型(VLM)为多模态任务带来了强大的理解和推理能力。同时,移动设备对强大人工智能的需求也日益增长,例如AI助手软件。一些工作试图将VLM迁移到边缘设备以扩展其应用范围。简化模型结构是一种常见方法,但随着模型缩小,性能与大小之间的权衡变得越来越困难。知识蒸馏(KD)可以帮助模型在不增加大小或数据量的情况下提升综合能力。然而,现有的大模型蒸馏技术大多只考虑单模态LLM的应用,或者仅使用教师为学生创建新的数据环境。这些方法都没有考虑VLM中最重要的跨模态对齐知识的蒸馏。我们提出了一种名为Align-KD的方法,引导学生模型学习发生在浅层的跨模态匹配。教师还帮助学生基于文本的关注点学习将视觉标记投影到文本嵌入空间。在Align-KD的指导下,1.7B的MobileVLM V2模型能够从7B教师模型中学习丰富的知识,且训练损失设计轻量,在两个训练子集上分别在6个基准上平均得分提升2.0。代码地址:https://github.com/fqhank/Align-KD。

英文摘要

Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for capable aritificial intelligence on mobile devices also arises, such as the AI assistant software. Some efforts try to migrate VLMs to edge devices to expand their application scope. Simplifying the model structure is a common method, but as the model shrinks, the trade-off between performance and size becomes more and more difficult. Knowledge distillation (KD) can help models improve comprehensive capabilities without increasing size or data volume. However, most of the existing large model distillation techniques only consider applications on single-modal LLMs, or only use teachers to create new data environments for students. None of these methods take into account the distillation of the most important cross-modal alignment knowledge in VLMs. We propose a method called Align-KD to guide the student model to learn the cross-modal matching that occurs at the shallow layer. The teacher also helps student learn the projection of vision token into text embedding space based on the focus of text. Under the guidance of Align-KD, the 1.7B MobileVLM V2 model can learn rich knowledge from the 7B teacher model with light design of training loss, and achieve an average score improvement of 2.0 across 6 benchmarks under two training subsets respectively. Code is available at: https://github.com/fqhank/Align-KD.

2411.15851 2026-06-03 cs.CV 版本更新

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

ResCLIP: 用于无训练密集视觉-语言推理的残差注意力

Yuhang Yang, Jinhong Deng, Wen Li, Lixin Duan

发表机构 * University of Electronic Science and Technology of China(电子科学与技术大学)

AI总结 提出残差交叉相关自注意力模块和语义反馈精炼模块,利用中间层交叉相关注意力重组空间信息,提升CLIP在密集预测任务中的性能。

详情
Journal ref
Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 29968-29978
AI中文摘要

尽管像CLIP这样的视觉-语言模型在开放词汇任务中取得了显著成功,但其应用目前局限于图像级任务,在密集预测方面仍存在困难。最近的研究通常将这种密集预测的不足归因于最终块中的自注意力层,并通过将原始的查询-键注意力修改为自相关注意力(例如查询-查询和键-键注意力)取得了可观的成果。然而,这些方法忽略了捕捉丰富空间对应关系的交叉相关注意力(查询-键)特性。在本文中,我们揭示了CLIP非最终层中自注意力的交叉相关性也表现出定位特性。因此,我们提出了残差交叉相关自注意力(RCS)模块,该模块利用中间层的交叉相关自注意力来重塑最终块中的注意力。RCS模块有效重组了空间信息,释放了CLIP在密集视觉-语言推理中的定位潜力。此外,为了增强对相同类别区域的关注和局部一致性,我们提出了语义反馈精炼(SFR)模块,该模块利用语义分割图进一步调整注意力分数。通过整合这两种策略,我们的方法(称为ResCLIP)可以轻松作为即插即用模块集成到现有方法中,显著提升其在密集视觉-语言推理中的性能。在多个标准基准上的大量实验表明,我们的方法超越了最先进的无训练方法,验证了所提方法的有效性。代码可在 https://github.com/yvhangyang/ResCLIP 获取。

英文摘要

While vision-language models like CLIP have shown remarkable success in open-vocabulary tasks, their application is currently confined to image-level tasks, and they still struggle with dense predictions. Recent works often attribute such deficiency in dense predictions to the self-attention layers in the final block, and have achieved commendable results by modifying the original query-key attention to self-correlation attention, (e.g., query-query and key-key attention). However, these methods overlook the cross-correlation attention (query-key) properties, which capture the rich spatial correspondence. In this paper, we reveal that the cross-correlation of the self-attention in CLIP's non-final layers also exhibits localization properties. Therefore, we propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference. Furthermore, to enhance the focus on regions of the same categories and local consistency, we propose the Semantic Feedback Refinement (SFR) module, which utilizes semantic segmentation maps to further adjust the attention scores. By integrating these two strategies, our method, termed ResCLIP, can be easily incorporated into existing approaches as a plug-and-play module, significantly boosting their performance in dense vision-language inference. Extensive experiments across multiple standard benchmarks demonstrate that our method surpasses state-of-the-art training-free methods, validating the effectiveness of the proposed approach. Code is available at https://github.com/yvhangyang/ResCLIP.

2407.18428 2026-06-03 cs.LG cs.AI cs.CV 版本更新

Weighted Risk Invariance: Domain Generalization under Invariant Feature Shift

加权风险不变性:不变特征偏移下的领域泛化

Gina Wong, Joshua Gleason, Rama Chellappa, Yoav Wald, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Maryland, College Park(马里兰大学学院公园分校) New York University(纽约大学) Center for Data Science(数据科学中心)

AI总结 针对不变协变量偏移下现有不变学习方法性能不佳的问题,提出加权风险不变性(WRI)框架,通过环境间损失的不变性并加权训练样本,在理论上保证学习到不变模型,并在实验中优于先前方法。

详情
Journal ref
TMLR 2024
AI中文摘要

学习预测在多个环境下不变的模型是一种有前景的分布外泛化方法。这类模型被训练来提取特征 $X_{ ext{inv}}$,其中给定提取特征的条件分布 $Y \mid X_{ ext{inv}}$ 在不同环境下不发生变化。不变模型还应能泛化到提取特征 $X_{ ext{inv}}$ 的边缘分布 $p(X_{ ext{inv}})$ 的偏移,这种偏移称为 $ extit{不变协变量偏移}$。然而,我们表明,现有学习不变模型的方法在不变协变量偏移下表现不佳,要么无法学习到不变模型——即使对于从简单且经过充分研究的线性-高斯模型生成的数据也是如此——要么有限样本性能较差。为了解决这些问题,我们提出 $ extit{加权风险不变性}$(WRI)。我们的框架基于对训练样本进行适当加权,强制要求损失在不同环境下保持不变。我们证明,在线性-高斯设置下,WRI 可证明地学习到不变模型,即丢弃虚假相关性。我们提出了一种实用算法,通过同时学习密度 $p(X_{ ext{inv}})$ 和模型参数来实现 WRI,并且实验表明,在不变协变量偏移下,WRI 优于先前的不变学习方法。

英文摘要

Learning models whose predictions are invariant under multiple environments is a promising approach for out-of-distribution generalization. Such models are trained to extract features $X_{\text{inv}}$ where the conditional distribution $Y \mid X_{\text{inv}}$ of the label given the extracted features does not change across environments. Invariant models are also supposed to generalize to shifts in the marginal distribution $p(X_{\text{inv}})$ of the extracted features $X_{\text{inv}}$, a type of shift we call an $\textit{invariant covariate shift}$. However, we show that proposed methods for learning invariant models underperform under invariant covariate shift, either failing to learn invariant models$\unicode{x2014}$even for data generated from simple and well-studied linear-Gaussian models$\unicode{x2014}$or having poor finite-sample performance. To alleviate these problems, we propose $\textit{weighted risk invariance}$ (WRI). Our framework is based on imposing invariance of the loss across environments subject to appropriate reweightings of the training examples. We show that WRI provably learns invariant models, i.e. discards spurious correlations, in linear-Gaussian settings. We propose a practical algorithm to implement WRI by learning the density $p(X_{\text{inv}})$ and the model parameters simultaneously, and we demonstrate empirically that WRI outperforms previous invariant learning methods under invariant covariate shift.

2407.05312 2026-06-03 cs.CV 版本更新

An Improved Method for Personalizing Diffusion Models

一种改进的扩散模型个性化方法

Yan Zeng, Masanori Suganuma, Takayuki Okatani

发表机构 * Graduate School of Information Sciences, Tohoku University(东北大学信息科学研究生院) RIKEN Center for AIP(理化学研究所AIP研究中心)

AI总结 提出一种在整合新信息时保留模型原有知识的扩散模型个性化方法,相比Dreambooth和文本反转训练时间更短且效果更优。

详情
AI中文摘要

扩散模型已经展示了令人印象深刻的图像生成能力。个性化方法,如文本反转和Dreambooth,通过使用特定图像增强模型的个性化。这些方法能够基于多样的文本上下文生成特定对象的图像。我们提出的方法旨在在整合新信息时保留模型的原有知识,从而在比Dreambooth和文本反转更少的训练时间内获得更优的结果。

英文摘要

Diffusion models have demonstrated impressive image generation capabilities. Personalized approaches, such as textual inversion and Dreambooth, enhance model individualization using specific images. These methods enable generating images of specific objects based on diverse textual contexts. Our proposed approach aims to retain the model's original knowledge during new information integration, resulting in superior outcomes while necessitating less training time compared to Dreambooth and textual inversion.

1007.3881 2026-06-03 cs.CV cs.NA math.NA 版本更新

Orthogonal multifilters image processing of astronomical images from scanned photographic plates

扫描照相底片天文图像的正交多滤波器处理

Vasil Kolev

AI总结 本文提出基于Haar和Daubechies正交小波构造新的正交多滤波器,用于天文图像的多尺度分析,并应用于扫描照相底片的天文图像分解。

Comments 6 pages, The ACM proceedings of CompSysTech 2010

详情
AI中文摘要

本文介绍了用于天文图像处理的正交多滤波器。我们基于Haar和Daubechies正交小波获得了新的正交多滤波器。最近,多小波作为一种更强大的多尺度分析工具被引入。它在多滤波器设计中增加了若干自由度,并使得同时具有多个有用属性成为可能,如对称性、正交性、短支撑和更高的消失矩。对带有天文图像的扫描照相底片进行了多滤波器分解。

英文摘要

In this paper orthogonal multifilters for astronomical image processing are presented. We obtained new orthogonal multifilters based on the orthogonal wavelet of Haar and Daubechies. Recently, multiwavelets have been introduced as a more powerful multiscale analysis tool. It adds several degrees of freedom in multifilter design and makes it possible to have several useful properties such as symmetry, orthogonality, short support, and a higher number of vanishing moments simultaneously. Multifilter decomposition of scanned photographic plates with astronomical images is made.

1105.1302 2026-06-03 q-bio.QM cs.CV cs.NA math.NA 版本更新

A Modified Cross Correlation Algorithm for Reference-free Image Alignment of Non-Circular Projections in Single-Particle Electron Microscopy

一种改进的互相关算法用于单颗粒电子显微镜中非圆形投影的无参考图像对齐

Wooram Park, Gregory S. Chirikjian

AI总结 针对单颗粒电子显微镜中高度非球形结构的图像对齐问题,提出一种改进的互相关方法,通过粗对齐和基于统计噪声的搜索空间缩减,结合人工模糊图像和中间类平均分割,在低信噪比下优于经典互相关和最大似然方法。

Comments 29pages

详情
AI中文摘要

本文提出了一种改进的互相关方法,用于对齐单颗粒电子显微镜中高度非球形结构的同一类图像。在该新方法中,首先对投影图像进行粗对齐,然后使用互相关(CC)方法重新对齐所得图像。粗对齐通过匹配图像的质心和主轴实现。基于加性背景噪声的统计特性,可以量化粗对齐中的未对准分布。因此,互相关方法中重新对齐的搜索空间可以缩小以实现更好的对齐。为了克服互相关函数中虚假峰值相关的问题,我们在迭代互相关方法的早期阶段使用人工模糊图像,并从每次迭代步骤中分割中间类平均。这两种额外的操作与互相关方法中缩小的搜索空间相结合,对于低信噪比图像,比经典互相关和最大似然(ML)方法产生更好的对齐效果。

英文摘要

In this paper we propose a modified cross correlation method to align images from the same class in single-particle electron microscopy of highly non-spherical structures. In this new method, First we coarsely align projection images, and then re-align the resulting images using the cross correlation (CC) method. The coarse alignment is obtained by matching the centers of mass and the principal axes of the images. The distribution of misalignment in this coarse alignment can be quantified based on the statistical properties of the additive background noise. As a consequence, the search space for re-alignment in the cross correlation method can be reduced to achieve better alignment. In order to overcome problems associated with false peaks in the cross correlations function, we use artificially blurred images for the early stage of the iterative cross correlation method and segment the intermediate class average from every iteration step. These two additional manipulations combined with the reduced search space size in the cross correlation method yield better alignments for low signal-to-noise ratio images than both classical cross correlation and maximum likelihood(ML) methods.

0710.0736 2026-06-03 cs.CV cs.NA math.NA 版本更新

Colour image segmentation by the vector-valued Allen-Cahn phase-field model: a multigrid solution

基于向量值Allen-Cahn相场模型的彩色图像分割:多重网格解法

David A Kay, Alessandro Tomasi

AI总结 提出结合向量值Allen-Cahn相场方程与初始数据拟合项的彩色图像分割PDE模型,并采用多重网格有限元方法实现高效鲁棒的分割。

Comments 17 pages, 9 figures

详情
Journal ref
IEEE Trans. Im. Proc. 18.10 (2009)
AI中文摘要

我们提出了一种用于彩色图像分割的PDE驱动模型数值解的新方法,并给出了结果的数值示例。该方法将向量值Allen-Cahn相场方程与初始数据拟合项相结合。已知该方法与Mumford-Shah问题以及Chan和Vese的水平集分割密切相关。我们的数值解使用有限元空间的多重网格分裂进行,从而为大型图像的分割产生了一种高效且鲁棒的方法。

英文摘要

We propose a new method for the numerical solution of a PDE-driven model for colour image segmentation and give numerical examples of the results. The method combines the vector-valued Allen-Cahn phase field equation with initial data fitting terms. This method is known to be closely related to the Mumford-Shah problem and the level set segmentation by Chan and Vese. Our numerical solution is performed using a multigrid splitting of a finite element space, thereby producing an efficient and robust method for the segmentation of large images.

1003.2022 2026-06-03 cs.CV cs.CE cs.IT cs.NA math.IT math.NA 版本更新

Fast space-variant elliptical filtering using box splines

使用盒样条进行快速空间变椭圆滤波

Kunal Narayan Chaudhury, Arrate Munoz-Barrutia, Michael Unser

AI总结 本文提出一种基于径向均匀盒样条的方法,通过预积分和局部有限差分实现每像素固定计算量的空间变高斯椭圆滤波,支持连续控制尺寸、伸长和方向。

Comments 12 figures; IEEE Transactions on Image Processing, vol. 19, 2010

详情
Journal ref
IEEE Transactions on Image Processing, vol. 19(9), pp. 2290 - 2306, 2010
AI中文摘要

线性空间变(非卷积)滤波器的高效实现是图像处理中一个具有挑战性的计算问题。在本文中,我们证明可以使用每像素固定数量的计算来对图像进行具有变化大小、伸长和方向的高斯型椭圆窗口滤波。相关算法基于一族光滑紧支撑分段多项式——径向均匀盒样条,通过预积分和局部有限差分实现。径向均匀盒样条是通过重复卷积固定数量的盒分布构造的,这些盒分布经过适当缩放并以均匀方式径向分布。这些盒样条的吸引人特性包括其渐近行为、简单的协方差结构以及准可分离性。随着阶数的增加,它们收敛到高斯函数,并可通过控制组成盒分布的尺度来近似具有不同协方差的各向异性高斯函数。基于第二个特性,我们开发了一种连续控制这些高斯型函数大小、伸长和方向的技术。最后,利用准可分离结构以及盒分布的某种缩放性质,高效实现了相关的空间变椭圆滤波,该滤波每像素需要O(1)次计算,与滤波器的形状和大小无关。

英文摘要

The efficient realization of linear space-variant (non-convolution) filters is a challenging computational problem in image processing. In this paper, we demonstrate that it is possible to filter an image with a Gaussian-like elliptic window of varying size, elongation and orientation using a fixed number of computations per pixel. The associated algorithm, which is based on a family of smooth compactly supported piecewise polynomials, the radially-uniform box splines, is realized using pre-integration and local finite-differences. The radially-uniform box splines are constructed through the repeated convolution of a fixed number of box distributions, which have been suitably scaled and distributed radially in an uniform fashion. The attractive features of these box splines are their asymptotic behavior, their simple covariance structure, and their quasi-separability. They converge to Gaussians with the increase of their order, and are used to approximate anisotropic Gaussians of varying covariance simply by controlling the scales of the constituent box distributions. Based on the second feature, we develop a technique for continuously controlling the size, elongation and orientation of these Gaussian-like functions. Finally, the quasi-separable structure, along with a certain scaling property of box distributions, is used to efficiently realize the associated space-variant elliptical filtering, which requires O(1) computations per pixel irrespective of the shape and size of the filter.

1203.2995 2026-06-03 eess.SY cs.CV cs.SY 版本更新

Marginal multi-Bernoulli filters: RFS derivation of MHT, JIPDA and association-based MeMBer

边缘多伯努利滤波器:MHT、JIPDA和基于关联的MeMBer的RFS推导

Jason L. Williams

AI总结 本文通过随机有限集推导全贝叶斯RFS滤波器,揭示数据关联隐式存在,并通过近似关联分布得到与JIPDA和MeMBer相关的两种算法,在复杂环境下提升性能。

Comments Journal version at http://ieeexplore.ieee.org/document/7272821. Matlab code of simple implementation included with ancillary files

详情
Journal ref
IEEE Transactions on Aerospace and Electronic Systems, vol 51, no 3, pp 1664-1687, July 2015
AI中文摘要

随机有限集(RFS)的最新发展产生了多种避免数据关联的跟踪方法。本文推导了全贝叶斯RFS滤波器的一种形式,并观察到数据关联隐式存在于类似于MHT的数据结构中。随后,通过近似关联分布得到算法。得到两种算法:一种与JIPDA几乎相同,另一种与MeMBer滤波器相关。两者均在具有挑战性的环境中提升了性能。

英文摘要

Recent developments in random finite sets (RFSs) have yielded a variety of tracking methods that avoid data association. This paper derives a form of the full Bayes RFS filter and observes that data association is implicitly present, in a data structure similar to MHT. Subsequently, algorithms are obtained by approximating the distribution of associations. Two algorithms result: one nearly identical to JIPDA, and another related to the MeMBer filter. Both improve performance in challenging environments.

1302.6105 2026-06-03 math.OC cs.CV cs.NA math.NA 版本更新

Image restoration using sparse approximations of spatially varying blur operators in the wavelet domain

利用小波域中空间变化模糊算子的稀疏近似进行图像恢复

Paul Escande, Pierre Weiss, Francois Malgouyres

AI总结 针对空间变化模糊图像恢复问题,提出在小波域中用稀疏矩阵近似模糊算子,并从数学上证明其合理性,数值验证近似质量,且稀疏模式可预定义,适用于盲反卷积等任务。

Comments 6 pages

详情
AI中文摘要

在摄影、卫星或显微成像中,恢复由空间变化模糊降质的图像是一个日益重要的问题。解决这一问题的主要困难之一在于模糊矩阵的巨大维度,这阻碍了使用朴素方法进行矩阵-向量乘法。在本文中,我们提出在小波域中用稀疏矩阵近似模糊算子。我们从数学角度证明了该方法的合理性,并数值研究了近似质量。最后,我们表明矩阵的稀疏模式可以预定义,这在盲反卷积等任务中至关重要。

英文摘要

Restoration of images degraded by spatially varying blurs is an issue of increasing importance in the context of photography, satellite or microscopy imaging. One of the main difficulty to solve this problem comes from the huge dimensions of the blur matrix. It prevents the use of naive approaches for performing matrix-vector multiplications. In this paper, we propose to approximate the blur operator by a matrix sparse in the wavelet domain. We justify this approach from a mathematical point of view and investigate the approximation quality numerically. We finish by showing that the sparsity pattern of the matrix can be pre-defined, which is central in tasks such as blind deconvolution.

1210.6649 2026-06-03 astro-ph.IM cs.CV cs.NA math.NA 版本更新

Extended object reconstruction in adaptive-optics imaging: the multiresolution approach

自适应光学成像中的扩展目标重建:多分辨率方法

Roberto Baena Gallé, Jorge Núñez, Szymon Gladysz

AI总结 提出使用小波和曲波等多分辨率变换重建自适应光学系统获取的扩展目标图像,通过静态PSF的多通道反卷积方法优于传统的盲/近视反卷积方法。

Comments In revision in Astronomy & Astrophysics. 19 pages, 13 figures

详情
AI中文摘要

我们提出将多分辨率变换(如小波变换(WT)和曲波变换(CT))应用于自适应光学(AO)系统获取的扩展目标图像重建。这种多通道方法通常利用概率工具来区分显著结构与噪声和重建残差。此外,我们旨在检验历史假设:使用静态PSF的图像重建算法不适用于AO成像。我们将哈勃太空望远镜(HST)拍摄的土星图像与帕洛马天文台5米海尔望远镜的AO PSF进行卷积,并添加散粒噪声和读出噪声。随后,我们对模糊和噪声数据应用不同方法以恢复原始目标。这些方法包括多帧盲反卷积(使用IDAC算法)、带正则化的近视反卷积(使用MISTRAL)以及基于小波或曲波的静态PSF反卷积(AWMLE和ACMLE算法)。我们使用均方误差(MSE)和结构相似性指数(SSIM)来比较结果。我们讨论了这两种指标的优缺点。我们发现,根据MSE和SSIM的测量,CT比WT产生更好的结果。使用静态PSF的多通道反卷积产生的结果通常优于近视/盲方法(对于我们测试的图像),这表明方法抑制噪声和跟踪底层迭代过程的能力与近视/盲方法更新PSF的能力同样关键。

英文摘要

We propose the application of multiresolution transforms, such as wavelets (WT) and curvelets (CT), to the reconstruction of images of extended objects that have been acquired with adaptive optics (AO) systems. Such multichannel approaches normally make use of probabilistic tools in order to distinguish significant structures from noise and reconstruction residuals. Furthermore, we aim to check the historical assumption that image-reconstruction algorithms using static PSFs are not suitable for AO imaging. We convolve an image of Saturn taken with the Hubble Space Telescope (HST) with AO PSFs from the 5-m Hale telescope at the Palomar Observatory and add both shot and readout noise. Subsequently, we apply different approaches to the blurred and noisy data in order to recover the original object. The approaches include multi-frame blind deconvolution (with the algorithm IDAC), myopic deconvolution with regularization (with MISTRAL) and wavelets- or curvelets-based static PSF deconvolution (AWMLE and ACMLE algorithms). We used the mean squared error (MSE) and the structural similarity index (SSIM) to compare the results. We discuss the strengths and weaknesses of the two metrics. We found that CT produces better results than WT, as measured in terms of MSE and SSIM. Multichannel deconvolution with a static PSF produces results which are generally better than the results obtained with the myopic/blind approaches (for the images we tested) thus showing that the ability of a method to suppress the noise and to track the underlying iterative process is just as critical as the capability of the myopic/blind approaches to update the PSF.

1210.3098 2026-06-03 math.NA cs.CV cs.IT cs.NA math.IT 版本更新

Near-optimal compressed sensing guarantees for total variation minimization

全变差最小化的近最优压缩感知保证

Deanna Needell, Rachel Ward

AI总结 针对多维信号压缩感知重建问题,本文证明通过全变差最小化,从 O(sd*log(N^d)) 个线性测量中可重建信号,误差与梯度最佳 s 项近似成比例,并证明该保证在空间维度 d 上多项式因子内最优。

详情
AI中文摘要

考虑压缩感知设置中从欠定测量集重建多维信号的问题。没有任何额外假设,该问题是不适定的。然而,对于自然图像或电影等信号,与测量一致的最小全变差估计通常能产生对潜在信号的良好近似,即使测量数量远小于环境维度。本文将二维图像的最新重建保证推广到任意维度 d>1 的信号和各向同性全变差问题。具体来说,我们证明多维信号 x 可以从 O(sd*log(N^d)) 个线性测量中通过全变差最小化重建,重建误差在其梯度最佳 s 项近似的因子内。我们提供的重建保证在空间维度 d 的多项式因子内必然是最优的。

英文摘要

Consider the problem of reconstructing a multidimensional signal from an underdetermined set of measurements, as in the setting of compressed sensing. Without any additional assumptions, this problem is ill-posed. However, for signals such as natural images or movies, the minimal total variation estimate consistent with the measurements often produces a good approximation to the underlying signal, even if the number of measurements is far smaller than the ambient dimensionality. This paper extends recent reconstruction guarantees for two-dimensional images to signals of arbitrary dimension d>1 and to isotropic total variation problems. To be precise, we show that a multidimensional signal x can be reconstructed from O(sd*log(N^d)) linear measurements using total variation minimization to within a factor of the best s-term approximation of its gradient. The reconstruction guarantees we provide are necessarily optimal up to polynomial factors in the spatial dimension d.

1110.3649 2026-06-03 math.NA cs.CV cs.GR cs.NA 版本更新

Algorithms to automatically quantify the geometric similarity of anatomical surfaces

自动量化解剖表面几何相似性的算法

D. Boyer, Y. Lipman, E. St. Clair, J. Puente, T. Funkhouser, B. Patel, J. Jernvall, I. Daubechies

AI总结 提出利用局部结构和全局几何关系自动计算二维表面间距离与对应关系的多项式算法,无需人工标记,实现大规模数字化表面的高效比较。

Comments Changes with respect to v1, v2: an Erratum was added, correcting the references for one of the three datasets. Note that the datasets and code for this paper can be obtained from the Data Conservancy (see Download column on v1, v2)

详情
Journal ref
PNAS 2011 108 (45) 18221-18226
AI中文摘要

我们描述了用于计算(嵌入三维空间的)二维表面对之间距离的新方法,这些方法利用局部结构以及结构间几何关系所包含的全局信息。我们提出了自动确定这些距离以及几何对应关系的算法。这一研究源于自然科学学生对理解统一生命多样性的形态连续性的追求。目前,科学家利用物理特征研究现存和灭绝动物之间的进化关系时,分析的是从精心定义的解剖对应点(地标)中提取的数据。识别和记录这些地标耗时且只能由训练有素的形态学家准确完成。这使得非形态学家无法进行这些研究,并导致表型组学在阐明进化模式方面落后于基因组学。与已提出的其他形态对应算法不同,我们的方法不需要用户预先标记任何特殊特征或地标。它也与计算几何中的其他开创性工作不同,因为我们的算法本质上是多项式的,因此更快,使得对大量数字化表面进行成对比较成为可能。我们使用代表灵长类和人类牙齿及不同骨骼的三个数据集展示了我们的方法,并表明它能产生高度准确的结果。

英文摘要

We describe new approaches for distances between pairs of 2-dimensional surfaces (embedded in 3-dimensional space) that use local structures and global information contained in inter-structure geometric relationships. We present algorithms to automatically determine these distances as well as geometric correspondences. This is motivated by the aspiration of students of natural science to understand the continuity of form that unites the diversity of life. At present, scientists using physical traits to study evolutionary relationships among living and extinct animals analyze data extracted from carefully defined anatomical correspondence points (landmarks). Identifying and recording these landmarks is time consuming and can be done accurately only by trained morphologists. This renders these studies inaccessible to non-morphologists, and causes phenomics to lag behind genomics in elucidating evolutionary patterns. Unlike other algorithms presented for morphological correspondences our approach does not require any preliminary marking of special features or landmarks by the user. It also differs from other seminal work in computational geometry in that our algorithms are polynomial in nature and thus faster, making pairwise comparisons feasible for significantly larger numbers of digitized surfaces. We illustrate our approach using three datasets representing teeth and different bones of primates and humans, and show that it leads to highly accurate results.

1203.2992 2026-06-03 eess.SY cs.CV cs.SY 版本更新

Hybrid Poisson and multi-Bernoulli filters

混合泊松和多伯努利滤波器

Jason L. Williams

AI总结 提出一种结合概率假设密度和多目标多伯努利滤波器的混合方法,通过维持未检测目标的泊松分量和回收低存在概率的伯努利分量,实现快速航迹起始并减少伯努利分量数量。

Comments Submitted to 15th International Conference on Information Fusion (2012)

详情
AI中文摘要

概率假设密度(PHD)和多目标多伯努利(MeMBer)滤波器是基于随机有限集(RFS)的两种主要算法。本文研究了一种结合这两种方法的方法。我们的工作受到一篇姊妹论文的启发,该论文证明了全贝叶斯RFS滤波器自然包含一个代表从未被检测到的目标的泊松分量,以及一个代表跟踪中目标的多伯努利分量的线性组合。这里我们展示了维持未检测目标的泊松分量所带来的好处(在航迹起始速度方面)。随后,我们提出了一种回收方法,将存在概率较低的伯努利分量投影到泊松分量上(而不是删除它们)。我们表明,这使我们能够使用更少的伯努利分量(即航迹)实现相似的跟踪性能。

英文摘要

The probability hypothesis density (PHD) and multi-target multi-Bernoulli (MeMBer) filters are two leading algorithms that have emerged from random finite sets (RFS). In this paper we study a method which combines these two approaches. Our work is motivated by a sister paper, which proves that the full Bayes RFS filter naturally incorporates a Poisson component representing targets that have never been detected, and a linear combination of multi-Bernoulli components representing targets under track. Here we demonstrate the benefit (in speed of track initiation) that maintenance of a Poisson component of undetected targets provides. Subsequently, we propose a method of recycling, which projects Bernoulli components with a low probability of existence onto the Poisson component (as opposed to deleting them). We show that this allows us to achieve similar tracking performance using a fraction of the number of Bernoulli components (i.e., tracks).

1202.6429 2026-06-03 cs.CV cs.IT cs.NA math.IT math.NA 版本更新

Stable image reconstruction using total variation minimization

利用全变差最小化的稳定图像重建

Deanna Needell, Rachel Ward

AI总结 本文利用全变差最小化,从欠采样噪声测量中实现图像的高精度鲁棒重建,并给出了近最优保证。

Comments 25 pages

详情
AI中文摘要

本文提出了利用全变差最小化从欠采样噪声测量中实现准确且鲁棒的图像恢复的近最优保证。特别地,我们证明从 O(slog(N)) 个非自适应线性测量中,图像可以重建到其梯度最佳 s 项近似的对数因子范围内,并且通过略微增加测量次数可以消除该因子。在此过程中,我们证明了对于位于适当不相干矩阵零空间中的函数,存在一个加强的 Sobolev 不等式。

英文摘要

This article presents near-optimal guarantees for accurate and robust image recovery from under-sampled noisy measurements using total variation minimization. In particular, we show that from O(slog(N)) nonadaptive linear measurements, an image can be reconstructed to within the best s-term approximation of its gradient up to a logarithmic factor, and this factor can be removed by taking slightly more measurements. Along the way, we prove a strengthened Sobolev inequality for functions lying in the null space of suitably incoherent matrices.

1109.0217 2026-06-03 math.NA cs.CV cs.NA 版本更新

Vessel Segmentation in Medical Imaging Using a Tight-Frame Based Algorithm

基于紧框架算法的医学图像血管分割

Xiaohao Cai, Raymond Chan, Serena Morigi, Fiorella Sgallari

AI总结 提出一种基于紧框架的迭代算法,用于磁共振血管造影图像中管状结构(如血管)的自动分割,通过去噪、平滑和锐化边界区域,在少量迭代内收敛,并优于现有PDE和变分方法。

详情
AI中文摘要

紧框架作为正交小波的推广,已成功应用于图像处理中的多种问题,包括修复、脉冲噪声去除、超分辨率图像恢复等。分割是识别图像中物体轮廓的过程。目前存在多种基于变分方法和偏微分方程(PDE)建模的高效分割算法。本文提出应用紧框架方法自动识别磁共振血管造影(MRA)图像中的管状结构(如血管)。我们的方法迭代地细化一个包围血管可能边界或表面的区域。在每次迭代中,我们应用紧框架算法对可能边界进行去噪和平滑,并锐化该区域。我们证明了算法的收敛性。在真实2D/3D MRA图像上的数值实验表明,我们的方法非常高效,通常在几次迭代内收敛,并且由于能够提取图像中更多的管状目标和精细细节,优于现有的PDE和变分方法。

英文摘要

Tight-frame, a generalization of orthogonal wavelets, has been used successfully in various problems in image processing, including inpainting, impulse noise removal, super-resolution image restoration, etc. Segmentation is the process of identifying object outlines within images. There are quite a few efficient algorithms for segmentation that depend on the variational approach and the partial differential equation (PDE) modeling. In this paper, we propose to apply the tight-frame approach to automatically identify tube-like structures such as blood vessels in Magnetic Resonance Angiography (MRA) images. Our method iteratively refines a region that encloses the possible boundary or surface of the vessels. In each iteration, we apply the tight-frame algorithm to denoise and smooth the possible boundary and sharpen the region. We prove the convergence of our algorithm. Numerical experiments on real 2D/3D MRA images demonstrate that our method is very efficient with convergence usually within a few iterations, and it outperforms existing PDE and variational methods as it can extract more tubular objects and fine details in the images.

1101.4373 2026-06-03 stat.AP cs.CV cs.SY eess.SY math.OC stat.CO 版本更新

Statistical Multiresolution Dantzig Estimation in Imaging: Fundamental Concepts and Algorithmic Framework

成像中的统计多分辨率Dantzig估计:基本概念与算法框架

Klaus Frick, Philipp Marnitz, Axel Munk

AI总结 本文针对“信号+噪声”模型中的函数估计问题,提出了一类统计多分辨率估计器,并开发了基于交替方向乘子法和Dykstra算法的计算框架,通过成像和信号检测示例展示了方法的有效性。

详情
Journal ref
Electron. J. Stat. 6 (2012) 231-268
AI中文摘要

本文关注于“信号+噪声”模型中函数的全自动和局部自适应估计,其中回归函数可能进一步被线性算子(例如卷积)模糊。为此,我们引入了一类通用的统计多分辨率估计器,并开发了用于计算这些估计器的算法框架。这意味着估计器被定义为具有上确界型约束的凸优化问题的解。我们结合了交替方向乘子法和Dykstra算法来计算凸集交集上的正交投影,并证明了数值收敛性。通过成像和信号检测的各种示例,展示了所提出方法的能力。

英文摘要

In this paper we are concerned with fully automatic and locally adaptive estimation of functions in a "signal + noise"-model where the regression function may additionally be blurred by a linear operator, e.g. by a convolution. To this end, we introduce a general class of statistical multiresolution estimators and develop an algorithmic framework for computing those. By this we mean estimators that are defined as solutions of convex optimization problems with supremum-type constraints. We employ a combination of the alternating direction method of multipliers with Dykstra's algorithm for computing orthogonal projections onto intersections of convex sets and prove numerical convergence. The capability of the proposed method is illustrated by various examples from imaging and signal detection.

1112.3010 2026-06-03 cs.CV cs.NA math.NA 版本更新

A new variational principle for the Euclidean distance function: Linear approach to the non-linear eikonal problem

欧几里得距离函数的新变分原理:非线性程函问题的线性方法

Karthik S. Gurumoorthy, Anand Rangarajan

AI总结 提出一种基于卷积的快速算法,通过求解线性微分方程并取负对数来近似计算欧几里得距离函数,利用快速傅里叶变换高效实现,避免了传统方法对非线性Hamilton-Jacobi方程的直接求解。

详情
AI中文摘要

我们提出了一种基于卷积的快速技术,用于在二维和三维网格位置上计算近似的有符号欧几里得距离函数 $S$。我们的方法不是求解非线性的静态Hamilton-Jacobi方程($\\|\nabla S\\|=1$),而是首先求解线性微分方程中的标量场 $\phi$,然后通过取负对数推导出 $S$ 的解。换句话说,当 $S$ 和 $\phi$ 通过 $\phi = \exp\left(-\frac{S}{\tau}\right)$ 关联,且 $\phi$ 满足对应于变分问题极值的特定线性微分方程时,我们得到近似的欧几里得距离函数 $S = -\tau\log(\phi)$,该函数在 $\tau\rightarrow 0$ 的极限下收敛于真实解。这与快速行进法和快速扫描法等直接通过Godunov迎风离散格式求解Hamilton-Jacobi方程的技术形成鲜明对比。我们的线性公式导致近似欧几里得距离函数的闭式解可表示为离散卷积,因此可通过快速傅里叶变换(FFT)高效计算。我们的解还避免了对导数算子进行空间离散化的需要。当 $\tau\rightarrow 0$ 时,我们展示了结果收敛于真实解,并针对给定的 $\tau$ 值限定了误差。我们解的可微性允许我们通过一组卷积计算近似距离函数的一阶和二阶导数。为了确定距离函数的符号(定义为在封闭区域内为正,区域外为负),我们计算二维中的缠绕数和三维中的拓扑度,这些计算也可以通过快速卷积进行。我们通过一组实验结果证明了我们方法的有效性。

英文摘要

We present a fast convolution-based technique for computing an approximate, signed Euclidean distance function $S$ on a set of 2D and 3D grid locations. Instead of solving the non-linear, static Hamilton-Jacobi equation ($\|\nabla S\|=1$), our solution stems from first solving for a scalar field $ϕ$ in a linear differential equation and then deriving the solution for $S$ by taking the negative logarithm. In other words, when $S$ and $ϕ$ are related by $ϕ= \exp \left(-\frac{S}τ \right)$ and $ϕ$ satisfies a specific linear differential equation corresponding to the extremum of a variational problem, we obtain the approximate Euclidean distance function $S = -τ\log(ϕ)$ which converges to the true solution in the limit as $τ\rightarrow 0$. This is in sharp contrast to techniques like the fast marching and fast sweeping methods which directly solve the Hamilton-Jacobi equation by the Godunov upwind discretization scheme. Our linear formulation results in a closed-form solution to the approximate Euclidean distance function expressible as a discrete convolution, and hence efficiently computable using the fast Fourier transform (FFT). Our solution also circumvents the need for spatial discretization of the derivative operator. As $τ\rightarrow0$ we show the convergence of our results to the true solution and also bound the error for a given value of $τ$. The differentiability of our solution allows us to compute---using a set of convolutions---the first and second derivatives of the approximate distance function. In order to determine the sign of the distance function (defined to be positive inside a closed region and negative outside), we compute the winding number in 2D and the topological degree in 3D, whose computations can also be performed via fast convolutions. We demonstrate the efficacy of our method through a set of experimental results.

1212.3385 2026-06-03 math.NA cs.CV cs.NA 版本更新

Approximating rational Bezier curves by constrained Bezier curves of arbitrary degree

用任意次数的约束贝塞尔曲线逼近有理贝塞尔曲线

Mao Shi, Jiansong Deng

AI总结 提出一种通过加权最小二乘法将有理贝塞尔曲线约束逼近为多项式贝塞尔曲线的方法,并分别研究了权重函数ρ(t)=ω(t)和ρ(t)=ω(t)^2的情况。

详情
AI中文摘要

本文提出了一种通过多项式贝塞尔曲线获得有理贝塞尔曲线的约束逼近的方法。该问题被重新表述为基于加权最小二乘法的两条多项式贝塞尔曲线之间的逼近问题,其中分别研究了权重函数ρ(t)=ω(t)和ρ(t)=ω(t)^2。通过一些例子测试了所提方法的有效性。

英文摘要

In this paper, we propose a method to obtain a constrained approximation of a rational Bézier curve by a polynomial Bézier curve. This problem is reformulated as an approximation problem between two polynomial Bézier curves based on weighted least-squares method, where weight functions $ρ(t)=ω(t)$ and $ρ(t)=ω(t)^{2}$ are studied respectively. The efficiency of the proposed method is tested using some examples.

1304.1408 2026-06-03 math.OC cs.CV cs.NA math.NA 版本更新

Restoration of Images Corrupted by Impulse Noise and Mixed Gaussian Impulse Noise using Blind Inpainting

使用盲修复恢复被脉冲噪声和混合高斯脉冲噪声污染的图像

Ming Yan

AI总结 提出基于盲修复和ℓ0最小化的两种方法,同时检测受损像素并恢复图像,实验表明性能优于其他方法,并提供了收敛性分析。

Comments 18 pages, 4 figures

详情
Journal ref
SIAM J. Imaging Sci., 6(2013), 1227-1245
AI中文摘要

本文研究了被脉冲噪声和混合高斯脉冲噪声污染的观测图像的恢复问题。由于被脉冲噪声损坏的像素不包含真实图像的任何信息,如何正确找到这个集合是一个非常重要的问题。我们提出了两种基于盲修复和ℓ0最小化的方法,可以同时找到受损像素并恢复图像。通过迭代恢复图像和更新受损像素集合,这些方法在实验中表现出比其他方法更好的性能。此外,我们提供了这些方法的收敛性分析,这些算法将收敛到坐标极小点。另外,通过对算法进行一些修改,它们将收敛到局部极小点(或以概率1收敛)。

英文摘要

This article studies the problem of image restoration of observed images corrupted by impulse noise and mixed Gaussian impulse noise. Since the pixels damaged by impulse noise contain no information about the true image, how to find this set correctly is a very important problem. We propose two methods based on blind inpainting and $\ell_0$ minimization that can simultaneously find the damaged pixels and restore the image. By iteratively restoring the image and updating the set of damaged pixels, these methods have better performance than other methods, as shown in the experiments. In addition, we provide convergence analysis for these methods, these algorithms will converge to coordinatewise minimum points. In addition, they will converge to local minimum points (or with probability one) with some modifications in the algorithms.

1302.5554 2026-06-03 stat.AP cs.CV cs.NA math.NA physics.flu-dyn 版本更新

Self-similar prior and wavelet bases for hidden incompressible turbulent motion

用于隐藏不可压缩湍流运动的自相似先验和小波基

Patrick Héas, Frédéric Lavancier, Souleymane Kadri-Harouna

AI总结 针对从图像序列估计湍流这一病态逆问题,提出基于散度自由各向同性分数布朗运动的自相似先验模型,并利用小波基实现有效求解。

Comments SIAM Journal on Imaging Sciences, 2014

详情
AI中文摘要

本文关注从图像序列观测估计湍流这一病态逆问题。从贝叶斯角度,选择散度自由各向同性分数布朗运动作为瞬时湍流速度场的先验模型。该自相似先验准确刻画了不可压缩各向同性湍流中速度场的二阶统计特性。然而,相关的最大后验估计涉及分数阶拉普拉斯算子,实际实现较为困难。为解决此问题,我们提出将散度自由分数布朗运动分解到精心选择的小波基上。作为第一种方案,我们设计小波作为白化滤波器,并证明这些滤波器是由Leray投影算子组成的分数阶拉普拉斯小波。作为第二种方案,我们使用散度自由小波基,该基隐式考虑了物理中的不可压缩约束。尽管后一种分解涉及相关小波系数,我们仍能在实践中处理这种依赖性。基于这两种小波分解,我们最终提供了有效且高效的算法来逼近最大后验估计。大量数值评估证明了所提出的小波基自相似先验的相关性。

英文摘要

This work is concerned with the ill-posed inverse problem of estimating turbulent flows from the observation of an image sequence. From a Bayesian perspective, a divergence-free isotropic fractional Brownian motion (fBm) is chosen as a prior model for instantaneous turbulent velocity fields. This self-similar prior characterizes accurately second-order statistics of velocity fields in incompressible isotropic turbulence. Nevertheless, the associated maximum a posteriori involves a fractional Laplacian operator which is delicate to implement in practice. To deal with this issue, we propose to decompose the divergent-free fBm on well-chosen wavelet bases. As a first alternative, we propose to design wavelets as whitening filters. We show that these filters are fractional Laplacian wavelets composed with the Leray projector. As a second alternative, we use a divergence-free wavelet basis, which takes implicitly into account the incompressibility constraint arising from physics. Although the latter decomposition involves correlated wavelet coefficients, we are able to handle this dependence in practice. Based on these two wavelet decompositions, we finally provide effective and efficient algorithms to approach the maximum a posteriori. An intensive numerical evaluation proves the relevance of the proposed wavelet-based self-similar priors.

1209.3318 2026-06-03 math.OC cs.CV cs.NA math.NA 版本更新

Hessian Schatten-Norm Regularization for Linear Inverse Problems

Hessian Schatten-范数正则化用于线性逆问题

Stamatios Lefkimmiatis, John Paul Ward, Michael Unser

AI总结 提出一种基于Hessian矩阵Schatten范数的凸、非二次正则化函数族,用于解决线性逆成像问题,避免阶梯效应并适用于多种应用。

Comments 15 pages double-column format. This manuscript will appear in IEEE Transactions on Image Processing

详情
Journal ref
IEEE Trans. Image Process. 22 (2013), no. 5, 1873--1888
AI中文摘要

我们引入了一类新的不变、凸且非二次的泛函,用于推导病态线性逆成像问题的正则化解。所提出的正则化项涉及图像每个像素处Hessian矩阵的Schatten范数。它们可以看作是流行的全变差(TV)半范数的二阶扩展,因为满足相同的不变性。同时,通过利用二阶导数,它们避免了基于TV的重建中常见的阶梯效应,并在广泛的应用中表现良好。为了解决相应的优化问题,我们提出了一种基于原始-对偶形式的算法。该算法的一个基本组成部分是将矩阵投影到任意半径的Schatten范数球上。基于我们提供的向量投影到ℓ_q范数球与矩阵投影到Schatten范数球之间的直接联系,可以高效地执行此操作。最后,我们通过几个逆成像问题的实验(包括真实和模拟数据)展示了所提出方法的有效性。

英文摘要

We introduce a novel family of invariant, convex, and non-quadratic functionals that we employ to derive regularized solutions of ill-posed linear inverse imaging problems. The proposed regularizers involve the Schatten norms of the Hessian matrix, computed at every pixel of the image. They can be viewed as second-order extensions of the popular total-variation (TV) semi-norm since they satisfy the same invariance properties. Meanwhile, by taking advantage of second-order derivatives, they avoid the staircase effect, a common artifact of TV-based reconstructions, and perform well for a wide range of applications. To solve the corresponding optimization problems, we propose an algorithm that is based on a primal-dual formulation. A fundamental ingredient of this algorithm is the projection of matrices onto Schatten norm balls of arbitrary radius. This operation is performed efficiently based on a direct link we provide between vector projections onto $\ell_q$ norm balls and matrix projections onto Schatten norm balls. Finally, we demonstrate the effectiveness of the proposed methods through experimental results on several inverse imaging problems with real and simulated data.

1208.4391 2026-06-03 cs.CV cs.SY eess.SY 版本更新

Shape Tracking With Occlusions via Coarse-To-Fine Region-Based Sobolev Descent

基于粗到细区域Sobolev下降的遮挡形状跟踪

Yanchao Yang, Ganesh Sundaramoorthi

AI总结 提出一种在参数化区域黎曼流形上通过粗到细优化处理自遮挡和去遮挡的联合形状与外观跟踪方法,实现精确形状检测。

Comments Extension of ICCV paper, added coarse-to-fine optimization based on new Riemannian manifold of parameterized regions

详情
AI中文摘要

我们提出了一种方法,基于参数化区域的新型黎曼流形上的新建模和优化,跟踪视频中物体的精确形状。联合动态形状和外观模型,其中物体的模板被传播以匹配下一帧中的物体形状和辐射度,在复杂物体辐射度和杂乱背景的情况下优于使用全局图像统计的方法。在3D物体运动和视点变化的情况下,物体的自遮挡和去遮挡很突出,当前使用联合形状和外观模型的方法无法适应新的形状和外观信息,导致形状检测不准确。在这项工作中,我们在联合形状和外观跟踪框架中建模自遮挡和去遮挡。自遮挡和用于传播模板的扭曲是耦合的,因此提出了一个联合问题。我们推导了一个粗到细的优化方案,在物体跟踪中具有优势,该方案首先通过粗扰动扰动模板,然后过渡到更细尺度的扰动,无缝且自动地遍历所有尺度。该方案是在我们引入的新型无限维黎曼流形上的梯度下降。该流形由平面参数化区域组成,我们引入的度量是定义在区域上的无穷小向量场上的新型Sobolev型度量。该度量的性质是,梯度下降自动优先考虑粗尺度变形(当它们减少能量时),然后才转向更细尺度的变形。在展示遮挡/去遮挡、复杂辐射度和背景的视频上的实验表明,与最近使用联合形状/外观模型或使用全局统计的方法相比,遮挡/去遮挡建模导致更优越的形状精度。

英文摘要

We present a method to track the precise shape of an object in video based on new modeling and optimization on a new Riemannian manifold of parameterized regions. Joint dynamic shape and appearance models, in which a template of the object is propagated to match the object shape and radiance in the next frame, are advantageous over methods employing global image statistics in cases of complex object radiance and cluttered background. In cases of 3D object motion and viewpoint change, self-occlusions and dis-occlusions of the object are prominent, and current methods employing joint shape and appearance models are unable to adapt to new shape and appearance information, leading to inaccurate shape detection. In this work, we model self-occlusions and dis-occlusions in a joint shape and appearance tracking framework. Self-occlusions and the warp to propagate the template are coupled, thus a joint problem is formulated. We derive a coarse-to-fine optimization scheme, advantageous in object tracking, that initially perturbs the template by coarse perturbations before transitioning to finer-scale perturbations, traversing all scales, seamlessly and automatically. The scheme is a gradient descent on a novel infinite-dimensional Riemannian manifold that we introduce. The manifold consists of planar parameterized regions, and the metric that we introduce is a novel Sobolev-type metric defined on infinitesimal vector fields on regions. The metric has the property of resulting in a gradient descent that automatically favors coarse-scale deformations (when they reduce the energy) before moving to finer-scale deformations. Experiments on video exhibiting occlusion/dis-occlusion, complex radiance and background show that occlusion/dis-occlusion modeling leads to superior shape accuracy compared to recent methods employing joint shape/appearance models or employing global statistics.

1210.2380 2026-06-03 cs.CV cs.IT cs.NA math.IT math.NA 版本更新

Stable and robust sampling strategies for compressive imaging

压缩成像的稳定鲁棒采样策略

Felix Krahmer, Rachel Ward

AI总结 针对傅里叶测量与Haar小波稀疏的压缩成像,提出基于局部相干性的变密度采样策略,证明近最优嵌入维度的限制等距性质,实现稳定鲁棒的重建。

Comments 17 pages, 4 figures

详情
AI中文摘要

在许多信号处理应用中,人们希望通过频域采样获取在变换域(如空间有限差分或小波)中稀疏的图像。对于此类应用,大量经验证据表明,通过集中于低频的变密度采样策略可以获得更优的图像重建。小波和傅里叶变换域并非不相干,因为低阶小波和低阶频率是相关的,因此压缩感知理论并不能直接推出采样策略和重建保证。本文转向一种更精细的相干性概念——所谓的局部相干性——分别测量每个感知向量与稀疏基的相关程度。对于傅里叶测量和Haar小波稀疏性,局部相干性可以被显式控制和界定,因此对于由从合适的逆平方幂律密度中采样的频率构成的矩阵,我们可以证明具有近最优嵌入维度的限制等距性质。因此,我们提供的变密度采样策略允许对稀疏缺陷稳定且对测量噪声鲁棒的图像重建。我们的结果涵盖了通过ℓ1最小化和全变差最小化的重建。本文开发的局部相干性框架在更一般的稀疏恢复问题中应具有独立意义,因为它表明,对于最优稀疏恢复结果,只要采样策略相应调整,只需感知基到稀疏基的有界平均相干性——而非有界最大相干性——就足够了。

英文摘要

In many signal processing applications, one wishes to acquire images that are sparse in transform domains such as spatial finite differences or wavelets using frequency domain samples. For such applications, overwhelming empirical evidence suggests that superior image reconstruction can be obtained through variable density sampling strategies that concentrate on lower frequencies. The wavelet and Fourier transform domains are not incoherent because low-order wavelets and low-order frequencies are correlated, so compressive sensing theory does not immediately imply sampling strategies and reconstruction guarantees. In this paper we turn to a more refined notion of coherence -- the so-called local coherence -- measuring for each sensing vector separately how correlated it is to the sparsity basis. For Fourier measurements and Haar wavelet sparsity, the local coherence can be controlled and bounded explicitly, so for matrices comprised of frequencies sampled from a suitable inverse square power-law density, we can prove the restricted isometry property with near-optimal embedding dimensions. Consequently, the variable-density sampling strategy we provide allows for image reconstructions that are stable to sparsity defects and robust to measurement noise. Our results cover both reconstruction by $\ell_1$-minimization and by total variation minimization. The local coherence framework developed in this paper should be of independent interest in sparse recovery problems more generally, as it implies that for optimal sparse recovery results, it suffices to have bounded \emph{average} coherence from sensing basis to sparsity basis -- as opposed to bounded maximal coherence -- as long as the sampling strategy is adapted accordingly.

1304.2367 2026-06-03 cs.CV cs.AI cs.SY eess.SY 版本更新

Utility-Based Control for Computer Vision

基于效用的计算机视觉控制

Tod S. Levitt, Thomas O. Binford, Gil J. Ettinger, Patrice Gelband

AI总结 针对贝叶斯网络实现计算机视觉中的计算效率问题,提出通过最大化效用而非概率来控制视觉任务,以优化传感器信息收集和数据分析。

Comments Appears in Proceedings of the Fourth Conference on Uncertainty in Artificial Intelligence (UAI1988)

详情
AI中文摘要

在利用贝叶斯网络实现计算机视觉识别世界对象时,出现了几个关键问题。计算效率是驱动力。感知网络非常深,通常有十五层结构。图像很宽,例如,在512×512像素或更大的图像中,未指定数量的边缘可能出现在任何位置。为了提高效率,我们动态实例化观察到的对象的假设。网络不是固定的,而是在运行时逐步创建。世界对象假设的生成和识别模型的索引很重要,但本文不讨论[4,11]。这项工作旨在近期通过并行计算在雷达监视系统ADRIES[5,15]和工业零件识别系统SUCCESSOR[2]中实现。对于许多应用,视觉必须更快才能实用,因此有效控制机器视觉过程至关重要。感知操作可能扫描百万像素,并可能需要数分钟的计算时间。必须避免不必要的传感器动作和计算。并行计算在多个处理器能力级别上可用。用于高层视觉的并行分布式计算的潜力意味着分配非均匀计算。本文解决了基于贝叶斯概率模型的机器视觉系统中的任务控制问题。我们将控制与推理分离,以扩展先前的工作[3],最大化效用而非概率。最大化效用允许采用感知策略,以有效收集传感器信息并分析传感器数据。本文展示了通过效用控制机器视觉以识别军事场景的结果。未来工作将将其扩展到SUCCESSOR的工业零件识别。

英文摘要

Several key issues arise in implementing computer vision recognition of world objects in terms of Bayesian networks. Computational efficiency is a driving force. Perceptual networks are very deep, typically fifteen levels of structure. Images are wide, e.g., an unspecified-number of edges may appear anywhere in an image 512 x 512 pixels or larger. For efficiency, we dynamically instantiate hypotheses of observed objects. The network is not fixed, but is created incrementally at runtime. Generation of hypotheses of world objects and indexing of models for recognition are important, but they are not considered here [4,11]. This work is aimed at near-term implementation with parallel computation in a radar surveillance system, ADRIES [5, 15], and a system for industrial part recognition, SUCCESSOR [2]. For many applications, vision must be faster to be practical and so efficiently controlling the machine vision process is critical. Perceptual operators may scan megapixels and may require minutes of computation time. It is necessary to avoid unnecessary sensor actions and computation. Parallel computation is available at several levels of processor capability. The potential for parallel, distributed computation for high-level vision means distributing non-homogeneous computations. This paper addresses the problem of task control in machine vision systems based on Bayesian probability models. We separate control and inference to extend the previous work [3] to maximize utility instead of probability. Maximizing utility allows adopting perceptual strategies for efficient information gathering with sensors and analysis of sensor data. Results of controlling machine vision via utility to recognize military situations are presented in this paper. Future work extends this to industrial part recognition for SUCCESSOR.

1112.3166 2026-06-03 cs.CV cs.NA math.NA 版本更新

Higher-Order Momentum Distributions and Locally Affine LDDMM Registration

高阶动量分布与局部仿射LDDMM配准

Stefan Sommer, Mads Nielsen, Sune Darkner, Xavier Pennec

AI总结 本文在LDDMM框架中引入高阶动量分布,通过一阶动量实现局部仿射变换的紧凑表示,从而以极少数参数完成非刚性配准,并直接提供可解释的数学和建模信息。

详情
AI中文摘要

为了实现允许直观分析的稀疏参数化,我们旨在用包含可解释元素的基来表示变形,并希望使用具有描述能力的元素来紧凑地表示变形。为此,本文在LDDMM配准框架中引入了高阶动量分布。先前在LDDMM中使用的零阶动量仅描述局部位移,而本文提出的一阶动量表示一个基,允许局部描述仿射变换,进而紧凑地描述全局非刚性变形中的非平移运动。所得表示从数学和建模角度都包含直接可解释的信息。我们开发了具有高阶动量的配准框架的数学构造,展示了其对稀疏图像配准和变形描述的意义,并提供了参数化如何以极少数参数实现配准的示例。使用高阶动量的参数化的能力和可解释性导致了关节运动的自然建模,该方法有望用于量化阿尔茨海默病期间的心室扩张和进行性萎缩。

英文摘要

To achieve sparse parametrizations that allows intuitive analysis, we aim to represent deformation with a basis containing interpretable elements, and we wish to use elements that have the description capacity to represent the deformation compactly. To accomplish this, we introduce in this paper higher-order momentum distributions in the LDDMM registration framework. While the zeroth order moments previously used in LDDMM only describe local displacement, the first-order momenta that are proposed here represent a basis that allows local description of affine transformations and subsequent compact description of non-translational movement in a globally non-rigid deformation. The resulting representation contains directly interpretable information from both mathematical and modeling perspectives. We develop the mathematical construction of the registration framework with higher-order momenta, we show the implications for sparse image registration and deformation description, and we provide examples of how the parametrization enables registration with a very low number of parameters. The capacity and interpretability of the parametrization using higher-order momenta lead to natural modeling of articulated movement, and the method promises to be useful for quantifying ventricle expansion and progressing atrophy during Alzheimer's disease.

1211.1690 2026-06-03 cs.RO cs.CV cs.LG cs.SY eess.SY 版本更新

Learning Monocular Reactive UAV Control in Cluttered Natural Environments

学习在杂乱自然环境中进行单目反应式无人机控制

Stephane Ross, Narek Melik-Barkhudarov, Kumar Shaurya Shankar, Andreas Wendel, Debadeepta Dey, J. Andrew Bagnell, Martial Hebert

AI总结 本文使用单目相机和模仿学习训练控制器,使小型四旋翼飞行器能在自然森林环境中以1.5m/s速度自主避障导航。

Comments 8 pages, 10 figures

详情
AI中文摘要

大型无人机的自主导航相对简单,因为可以使用昂贵的传感器和监控设备。相比之下,在杂乱环境中低空飞行的微型飞行器(MAV)的避障仍然是一项具有挑战性的任务。与大型飞行器不同,MAV只能携带非常轻的传感器,如摄像头,这使得通过障碍物的自主导航更具挑战性。本文描述了一个系统,该系统能够使小型四旋翼直升机在自然森林环境中低空自主导航。仅使用单个廉价摄像头感知环境,我们能够保持高达1.5m/s的恒定速度。通过少量人类飞行员演示,我们使用最新的模仿学习技术训练了一个控制器,该控制器通过调整MAV的航向来避免树木。我们在室内更受控的环境和室外真实自然森林环境中展示了系统的性能。

英文摘要

Autonomous navigation for large Unmanned Aerial Vehicles (UAVs) is fairly straight-forward, as expensive sensors and monitoring devices can be employed. In contrast, obstacle avoidance remains a challenging task for Micro Aerial Vehicles (MAVs) which operate at low altitude in cluttered environments. Unlike large vehicles, MAVs can only carry very light sensors, such as cameras, making autonomous navigation through obstacles much more challenging. In this paper, we describe a system that navigates a small quadrotor helicopter autonomously at low altitude through natural forest environments. Using only a single cheap camera to perceive the environment, we are able to maintain a constant velocity of up to 1.5m/s. Given a small set of human pilot demonstrations, we use recent state-of-the-art imitation learning techniques to train a controller that can avoid trees by adapting the MAVs heading. We demonstrate the performance of our system in a more controlled environment indoors, and in real natural forest environments outdoors.

1210.5034 2026-06-03 cs.LG cs.CV cs.NA math.NA 版本更新

Optimal Computational Trade-Off of Inexact Proximal Methods

非精确近端方法的最优计算权衡

Pierre Machart, Sandrine Anthoine, Luca Baldassarre

AI总结 本文研究近端梯度方法在计算代价与收敛速度之间的权衡,提出了一种计算高效且易于实现的快速非精确近端梯度算法(SIP)。

详情
AI中文摘要

在本文中,我们研究了在使用近端梯度方法(机器学习中流行的优化工具)最小化复合泛函时,收敛速度与计算代价之间的权衡。我们考虑近端算子通过迭代过程计算的情况,该过程提供了精确近端算子的近似。在这种情况下,我们得到具有两个嵌套循环的算法。我们表明,在有限时间内达到所需精度的解时,最小化计算代价的策略是将内迭代次数设置为常数,这与收敛速度分析所指示的策略不同。在此过程中,我们还提出了一种称为SIP(快速非精确近端梯度算法)的新程序,该程序既计算高效又易于实现。我们的数值实验证实了理论发现,并表明SIP可以成为标准程序的非常有竞争力的替代方案。

英文摘要

In this paper, we investigate the trade-off between convergence rate and computational cost when minimizing a composite functional with proximal-gradient methods, which are popular optimisation tools in machine learning. We consider the case when the proximity operator is computed via an iterative procedure, which provides an approximation of the exact proximity operator. In that case, we obtain algorithms with two nested loops. We show that the strategy that minimizes the computational cost to reach a solution with a desired accuracy in finite time is to set the number of inner iterations to a constant, which differs from the strategy indicated by a convergence rate analysis. In the process, we also present a new procedure called SIP (that is Speedy Inexact Proximal-gradient algorithm) that is both computationally efficient and easy to implement. Our numerical experiments confirm the theoretical findings and suggest that SIP can be a very competitive alternative to the standard procedure.

1210.4081 2026-06-03 math.NA cs.CV cs.DS cs.LG cs.NA math.OC 版本更新

Getting Feasible Variable Estimates From Infeasible Ones: MRF Local Polytope Study

从不可行变量估计获得可行变量估计:MRF局部多面体研究

Bogdan Savchynskyy, Stefan Schmidt

AI总结 针对具有可分离性的大规模优化问题,提出一种从对偶解构造近似可行原始解的方法,并应用于马尔可夫随机场推理问题的局部多面体松弛,证明其优于现有方法。

Comments 20 page, 4 figures

详情
AI中文摘要

本文提出了一种方法,用于从对偶解构造具有特定可分离性的大规模优化问题的近似可行原始解。虽然通常可以从对偶函数的(次)梯度产生不可行的原始估计,但将其投影到原始可行集往往并不容易,因为投影本身的复杂度与初始问题的复杂度相当。我们提出了一种替代的有效方法来获得可行性,并证明了其影响收敛到最优性的性质与欧几里得投影的性质相似。我们将我们的方法应用于马尔可夫随机场推理问题的局部多面体松弛,并证明了其优于现有方法。

英文摘要

This paper proposes a method for construction of approximate feasible primal solutions from dual ones for large-scale optimization problems possessing certain separability properties. Whereas infeasible primal estimates can typically be produced from (sub-)gradients of the dual function, it is often not easy to project them to the primal feasible set, since the projection itself has a complexity comparable to the complexity of the initial problem. We propose an alternative efficient method to obtain feasibility and show that its properties influencing the convergence to the optimum are similar to the properties of the Euclidean projection. We apply our method to the local polytope relaxation of inference problems for Markov Random Fields and demonstrate its superiority over existing methods.

1207.3554 2026-06-03 cs.CV cs.NA math.NA stat.ME stat.ML 版本更新

Designing various component analysis at will

随意设计各种成分分析

Akisato Kimura, Masashi Sugiyama, Sakano Hitoshi, Hirokazu Kameoka

AI总结 提出一种基于广义成对表达(GPE)的通用成分分析框架,涵盖标准方法、正则化、加权、聚类及半监督扩展,并给出利用模板组合设计新方法的简单策略。

Comments Accepted to IAPR International Conference on Pattern Recognition, submitted to IPSJ Transactions on Mathematical Modeling and its Applications (TOM). Just only one-page abstract for new due to novelty violation for journal submission. The details will be disclosed in late September

详情
AI中文摘要

本文提供了一个通用的成分分析(CA)方法框架,引入了一种新的散度矩阵和Gram矩阵表达式,称为广义成对表达(GPE)。该表达式非常紧凑但功能强大:该框架不仅包括(1)标准CA方法,还包括(2)几种正则化技术,(3)加权扩展,(4)一些聚类方法,以及(5)它们的半监督扩展。本文还提出了一种非常简单的方法,用于从所提出的框架中设计所需的CA方法:采用已知的GPE作为模板,并通过适当组合这些模板生成新方法。

英文摘要

This paper provides a generic framework of component analysis (CA) methods introducing a new expression for scatter matrices and Gram matrices, called Generalized Pairwise Expression (GPE). This expression is quite compact but highly powerful: The framework includes not only (1) the standard CA methods but also (2) several regularization techniques, (3) weighted extensions, (4) some clustering methods, and (5) their semi-supervised extensions. This paper also presents quite a simple methodology for designing a desired CA method from the proposed framework: Adopting the known GPEs as templates, and generating a new method by combining these templates appropriately.

1210.0822 2026-06-03 math.NA cs.CV cs.NA 版本更新

Discrete geodesic calculus in the space of viscous fluidic objects

粘性流体对象空间中的离散测地线计算

Martin Rumpf, Benedikt Wirth

AI总结 基于黎曼距离的局部近似,提出了一种时间离散的测地线计算方法,并应用于形状空间中的变形、外推和特征传递。

详情
AI中文摘要

基于流形上黎曼距离的局部近似(通过计算成本低的相异性度量),发展了一种时间离散的测地线计算,并探索了在形状空间中的应用。该相异性度量源自变形能量,其Hessian矩阵再现了底层的黎曼度量,并用于定义形状空间中离散路径的长度和能量。离散测地线定义为能量最小化路径,由此引出了离散对数映射、离散指数映射的变分定义以及时间离散的平行传输。这一新概念应用于形状空间,其中形状被视为由粘性材料构成的物理对象的边界轮廓。通过保持拓扑的形状变形、将局部形状变化作为路径生成器来表示形状空间中的路径、通过离散测地线流进行形状外推以及几何特征的传递,展示了该方法的灵活性和计算效率。

英文摘要

Based on a local approximation of the Riemannian distance on a manifold by a computationally cheap dissimilarity measure, a time discrete geodesic calculus is developed, and applications to shape space are explored. The dissimilarity measure is derived from a deformation energy whose Hessian reproduces the underlying Riemannian metric, and it is used to define length and energy of discrete paths in shape space. The notion of discrete geodesics defined as energy minimizing paths gives rise to a discrete logarithmic map, a variational definition of a discrete exponential map, and a time discrete parallel transport. This new concept is applied to a shape space in which shapes are considered as boundary contours of physical objects consisting of viscous material. The flexibility and computational efficiency of the approach is demonstrated for topology preserving shape morphing, the representation of paths in shape space via local shape variations as path generators, shape extrapolation via discrete geodesic flow, and the transfer of geometric features.

1209.5826 2026-06-03 math.NA cs.CV cs.NA 版本更新

Refinability of splines from lattice Voronoi cells

来自格点Voronoi细胞的样条的可细化性

Jorg Peters

AI总结 本文提出简单准则,证明只有少数样条族(如箱样条和张量积样条)是可细化的,而六边形样条等不可细化样条在格点细化时近似误差可能增大。

详情
AI中文摘要

样条可以通过对格点的Voronoi细胞的指示函数进行卷积来构造。本文提出了简单的准则,表明只有少数这样的样条族是可细化的:本质上就是众所周知的箱样条和张量积样条。许多不可细化的构造包括六边形样条及其在非笛卡尔格点上的推广。一个例子展示了不可细化样条在格点细化时如何表现出增加的近似误差。

英文摘要

Splines can be constructed by convolving the indicator function of the Voronoi cell of a lattice. This paper presents simple criteria that imply that only a small subset of such spline families can be refined: essentially the well-known box splines and tensor-product splines. Among the many non-refinable constructions are hex-splines and their generalization to non-Cartesian lattices. An example shows how non-refinable splines can exhibit increased approximation error upon refinement of the lattice.

1007.3753 2026-06-03 cs.CV cs.NA math.NA 版本更新

Fast L1-Minimization Algorithms For Robust Face Recognition

用于鲁棒人脸识别的快速L1最小化算法

Allen Y. Yang, Zihan Zhou, Arvind Ganesh, S. Shankar Sastry, Yi Ma

AI总结 针对鲁棒人脸识别中的稀疏表示分类框架,提出基于增广拉格朗日方法的快速L1最小化解法,解决了传统算法在大规模应用中的可扩展性问题。

详情
AI中文摘要

L1最小化是指在欠定线性系统b=Ax中寻找最小L1范数解。根据压缩感知理论中的某些条件,最小L1范数解也是最稀疏的解。本文研究其算法的速度和可扩展性。特别地,我们关注鲁棒人脸识别中基于稀疏性的分类框架的数值实现,其中通过稀疏表示从可能被光照、面部伪装和姿态变化破坏的高维人脸图像中恢复人类身份。尽管底层数值问题是线性规划,但传统算法在大规模应用中可扩展性差。我们研究了一种基于经典凸优化框架——增广拉格朗日方法(ALM)的新解法。新的凸求解器为实时、时间关键的应用(如人脸识别)提供了可行的解决方案。我们进行了大量实验,验证并比较了ALM算法与几种流行的L1最小化解法(包括内点法、Homotopy、FISTA、SESOP-PCD、近似消息传递(AMP)和TFOCS)的性能。为便于同行评估,所有算法的代码均已公开。

英文摘要

L1-minimization refers to finding the minimum L1-norm solution to an underdetermined linear system b=Ax. Under certain conditions as described in compressive sensing theory, the minimum L1-norm solution is also the sparsest solution. In this paper, our study addresses the speed and scalability of its algorithms. In particular, we focus on the numerical implementation of a sparsity-based classification framework in robust face recognition, where sparse representation is sought to recover human identities from very high-dimensional facial images that may be corrupted by illumination, facial disguise, and pose variation. Although the underlying numerical problem is a linear program, traditional algorithms are known to suffer poor scalability for large-scale applications. We investigate a new solution based on a classical convex optimization framework, known as Augmented Lagrangian Methods (ALM). The new convex solvers provide a viable solution to real-world, time-critical applications such as face recognition. We conduct extensive experiments to validate and compare the performance of the ALM algorithms against several popular L1-minimization solvers, including interior-point method, Homotopy, FISTA, SESOP-PCD, approximate message passing (AMP) and TFOCS. To aid peer evaluation, the code for all the algorithms has been made publicly available.

1206.4676 2026-06-03 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

Clustering by Low-Rank Doubly Stochastic Matrix Decomposition

基于低秩双随机矩阵分解的聚类

Zhirong Yang, Erkki Oja

AI总结 提出一种超越矩阵分解的低秩学习方法,通过两步二分随机游走逼近聚类分配概率,利用KL散度最小化实现判别模型的最大似然估计,并采用松弛的MM算法优化,显著提升大规模流形数据的聚类纯度。

Comments ICML2012

详情
AI中文摘要

在过去十年中,通过非负低秩近似进行聚类分析取得了显著进展。然而,该方向上的大多数近似方法仍局限于矩阵分解。我们提出了一种新的低秩学习方法以提高聚类性能,该方法超越了矩阵分解。该近似基于通过虚拟聚类节点的两步二分随机游走,其中近似仅由聚类分配概率构成。通过Kullback-Leibler散度测量的近似误差最小化等价于判别模型的最大似然估计,这为我们的方法提供了坚实的概率解释。优化通过一种松弛的Majorization-Minimization算法实现,该算法在寻找良好局部最小值方面具有优势。此外,我们指出带有Dirichlet先验的正则化算法仅作为初始化。实验结果表明,新方法在各种数据集上,特别是大规模流形数据上,具有强大的聚类纯度性能。

英文摘要

Clustering analysis by nonnegative low-rank approximations has achieved remarkable progress in the past decade. However, most approximation approaches in this direction are still restricted to matrix factorization. We propose a new low-rank learning method to improve the clustering performance, which is beyond matrix factorization. The approximation is based on a two-step bipartite random walk through virtual cluster nodes, where the approximation is formed by only cluster assigning probabilities. Minimizing the approximation error measured by Kullback-Leibler divergence is equivalent to maximizing the likelihood of a discriminative model, which endows our method with a solid probabilistic interpretation. The optimization is implemented by a relaxed Majorization-Minimization algorithm that is advantageous in finding good local minima. Furthermore, we point out that the regularized algorithm with Dirichlet prior only serves as initialization. Experimental results show that the new method has strong performance in clustering purity for various datasets, especially for large-scale manifold data.

1206.2061 2026-06-03 math.NA cs.CV cs.NA 版本更新

Comments on "On Approximating Euclidean Metrics by Weighted t-Cost Distances in Arbitrary Dimension"

关于“任意维度中通过加权t-代价距离逼近欧几里得度量”的评论

M. Emre Celebi, Hassan A. Kingravi, Fatih Celiker

AI总结 本文评论了Mukherjee提出的加权t-代价距离逼近欧几里得范数的方法,指出其在ℝⁿ中的平均误差过于乐观,并提出了改进精度的归一化方案。

Comments 7 pages, 1 figure, 3 tables. arXiv admin note: substantial text overlap with arXiv:1008.4870

详情
Journal ref
Pattern Recognition Letters 33 (2012) 1422--1425
AI中文摘要

Mukherjee(Pattern Recognition Letters, vol. 32, pp. 824-831, 2011)最近引入了一类称为加权t-代价距离的距离函数,它推广了m-邻域、八边形和t-代价距离。他证明了加权t-代价距离构成一个度量族,并推导了在$\mathbb{Z}^n$中欧几里得范数的近似。在本注释中,我们将此近似与先前提出的两种欧几里得范数近似进行比较,并证明Mukherjee给出的经验平均误差在$\mathbb{R}^n$中显著乐观。我们还提出了一种简单的归一化方案,该方案在平均相对误差和最大相对误差方面都显著提高了其近似的精度。

英文摘要

Mukherjee (Pattern Recognition Letters, vol. 32, pp. 824-831, 2011) recently introduced a class of distance functions called weighted t-cost distances that generalize m-neighbor, octagonal, and t-cost distances. He proved that weighted t-cost distances form a family of metrics and derived an approximation for the Euclidean norm in $\mathbb{Z}^n$. In this note we compare this approximation to two previously proposed Euclidean norm approximations and demonstrate that the empirical average errors given by Mukherjee are significantly optimistic in $\mathbb{R}^n$. We also propose a simple normalization scheme that improves the accuracy of his approximation substantially with respect to both average and maximum relative errors.

1008.5372 2026-06-03 math.OC cs.CV cs.IT cs.LG cs.NA math.IT math.NA stat.ME 版本更新

Penalty Decomposition Methods for $L0$-Norm Minimization

L0-范数最小化的罚分解方法

Zhaosong Lu, Yong Zhang

AI总结 提出罚分解方法求解含L0-范数的优化问题,通过转化为秩最小化问题并利用向量化操作,在压缩感知等应用中优于现有方法。

Comments This paper has been withdrawn by the author because an updated version has been resubmitted

详情
AI中文摘要

本文考虑一般的l0-范数最小化问题,即目标函数或约束中出现l0-范数的问题。特别地,我们首先将l0-范数约束问题重新表述为等价的秩最小化问题,然后应用[33]中提出的罚分解(PD)方法求解后者。通过利用特殊结构,我们将该方法的所有矩阵运算转化为向量运算,得到仅涉及向量运算的PD方法。在适当的假设下,我们证明PD方法生成的序列的任何聚点满足一阶最优性条件,该条件通常比一个自然最优性条件更强。我们进一步扩展PD方法以求解目标函数中出现l0-范数的问题。最后,通过将PD方法应用于压缩感知、稀疏逻辑回归和稀疏逆协方差选择来测试其性能。计算结果表明,我们的方法在解质量和/或速度方面通常优于现有方法。

英文摘要

In this paper we consider general l0-norm minimization problems, that is, the problems with l0-norm appearing in either objective function or constraint. In particular, we first reformulate the l0-norm constrained problem as an equivalent rank minimization problem and then apply the penalty decomposition (PD) method proposed in [33] to solve the latter problem. By utilizing the special structures, we then transform all matrix operations of this method to vector operations and obtain a PD method that only involves vector operations. Under some suitable assumptions, we establish that any accumulation point of the sequence generated by the PD method satisfies a first-order optimality condition that is generally stronger than one natural optimality condition. We further extend the PD method to solve the problem with the l0-norm appearing in objective function. Finally, we test the performance of our PD methods by applying them to compressed sensing, sparse logistic regression and sparse inverse covariance selection. The computational results demonstrate that our methods generally outperform the existing methods in terms of solution quality and/or speed.

1108.5359 2026-06-03 math.NA cs.CV cs.NA 版本更新

Solving Principal Component Pursuit in Linear Time via $l_1$ Filtering

通过 $l_1$ 滤波在线性时间内求解主成分追踪

Risheng Liu, Zhouchen Lin, Siming Wei, Zhixun Su

AI总结 提出一种名为 $l_1$ 滤波的算法,以 $O(r^2(m+n))$ 复杂度精确求解主成分追踪问题,实现线性时间内的核范数最小化,并具有高度可并行性。

详情
AI中文摘要

在过去的几十年中,从被破坏的观测数据中精确恢复内在数据结构(即鲁棒主成分分析,RPCA)引起了极大的兴趣,并在计算机视觉中找到了许多应用。最近,该问题被表述为从观测数据矩阵中恢复低秩分量和稀疏分量。已证明,在适当的条件下,该问题可以通过主成分追踪(PCP)精确求解,即最小化核范数和 $l_1$ 范数的组合。现有的求解 PCP 的方法大多需要对数据矩阵进行奇异值分解(SVD),导致计算复杂度高,从而阻碍了 RPCA 在超大规模计算机视觉问题中的应用。在本文中,我们提出了一种新颖的算法,称为 $l_1$ 滤波,用于以 $O(r^2(m+n))$ 的复杂度精确求解 PCP,其中 $m\times n$ 是数据矩阵的大小,$r$ 是要恢复矩阵的秩,假设远小于 $m$ 和 $n$。此外,$l_1$ 滤波是高度可并行的。它是第一个能够以线性时间(相对于数据大小)精确求解核范数最小化问题的算法。在合成数据和实际应用上的实验证明了 $l_1$ 滤波在速度上相对于最先进算法的巨大优势。

英文摘要

In the past decades, exactly recovering the intrinsic data structure from corrupted observations, which is known as robust principal component analysis (RPCA), has attracted tremendous interests and found many applications in computer vision. Recently, this problem has been formulated as recovering a low-rank component and a sparse component from the observed data matrix. It is proved that under some suitable conditions, this problem can be exactly solved by principal component pursuit (PCP), i.e., minimizing a combination of nuclear norm and $l_1$ norm. Most of the existing methods for solving PCP require singular value decompositions (SVD) of the data matrix, resulting in a high computational complexity, hence preventing the applications of RPCA to very large scale computer vision problems. In this paper, we propose a novel algorithm, called $l_1$ filtering, for \emph{exactly} solving PCP with an $O(r^2(m+n))$ complexity, where $m\times n$ is the size of data matrix and $r$ is the rank of the matrix to recover, which is supposed to be much smaller than $m$ and $n$. Moreover, $l_1$ filtering is \emph{highly parallelizable}. It is the first algorithm that can \emph{exactly} solve a nuclear norm minimization problem in \emph{linear time} (with respect to the data size). Experiments on both synthetic data and real applications testify to the great advantage of $l_1$ filtering in speed over state-of-the-art algorithms.

1202.5844 2026-06-03 math.NA cs.CV cs.NA 版本更新

Divide-and-Conquer Method for L1 Norm Matrix Factorization in the Presence of Outliers and Missing Data

存在异常值和缺失数据时L1范数矩阵分解的分治方法

Deyu Meng, Zongben Xu

AI总结 针对L1范数矩阵分解问题,提出分治方法,将原问题分解为一系列最小子问题,每个子问题有闭式解,通过递归优化构建高效算法,复杂度与数据规模和维度近似线性,在计算时间和精度上优于现有方法。

Comments 19 pages, 2 figures, 2 tables

详情
AI中文摘要

低秩矩阵分解作为L1范数最小化问题,因其对异常值和缺失数据的内在鲁棒性而受到广泛关注。本文提出一种新方法,称为分治方法,用于解决该问题。主要思想是将原问题分解为一系列尽可能小的子问题,每个子问题仅涉及唯一的标量参数。每个子问题被证明是凸的且有闭式解。通过以解析方式递归优化这些小问题,可以自然地构建出完全避免耗时的数值优化作为内循环的高效算法来解决原问题。所提算法的计算复杂度在数据规模和维度上均近似线性,使其能够处理大规模L1范数矩阵分解问题。该算法在理论上也被证明是收敛的。基于一系列实验结果,我们证实了在L1矩阵分解计算中,我们的方法在计算时间和精度上始终优于当前最先进的方法,尤其是在人脸识别和运动恢复结构等大规模应用中。

英文摘要

The low-rank matrix factorization as a L1 norm minimization problem has recently attracted much attention due to its intrinsic robustness to the presence of outliers and missing data. In this paper, we propose a new method, called the divide-and-conquer method, for solving this problem. The main idea is to break the original problem into a series of smallest possible sub-problems, each involving only unique scalar parameter. Each of these subproblems is proved to be convex and has closed-form solution. By recursively optimizing these small problems in an analytical way, efficient algorithm, entirely avoiding the time-consuming numerical optimization as an inner loop, for solving the original problem can naturally be constructed. The computational complexity of the proposed algorithm is approximately linear in both data size and dimensionality, making it possible to handle large-scale L1 norm matrix factorization problems. The algorithm is also theoretically proved to be convergent. Based on a series of experiment results, it is substantiated that our method always achieves better results than the current state-of-the-art methods on $L1$ matrix factorization calculation in both computational time and accuracy, especially on large-scale applications such as face recognition and structure from motion.

1204.4476 2026-06-03 cs.CV cs.SY eess.SY 版本更新

Dynamic Template Tracking and Recognition

动态模板跟踪与识别

Rizwan Chaudhry, Gregory Hager, Rene Vidal

AI总结 提出使用线性动态系统建模非刚性物体的外观/运动时间演化,作为动态模板进行跟踪,并实现同时跟踪与识别。

详情
AI中文摘要

本文解决局部外观和运动随时间变化的非刚性物体跟踪问题。这类物体包括动态纹理(如蒸汽、火、烟、水等)以及关节物体(如执行各种动作的人)。我们使用线性动态系统(LDS)对物体外观/运动的时间演化进行建模。从样本视频中学习此类模型,并将其作为动态模板用于跟踪新视频中的物体。我们将当前帧中动态非刚性物体的跟踪问题视为在给定当前图像特征和前帧状态最佳估计下,物体位置和动态系统潜在状态的最大后验估计。我们方法的优势在于,通过使用先前训练的纹理动力学模型,可以预先指定场景中要跟踪的纹理类型。我们的框架自然地将常见的跟踪方法(如SSD和基于核的跟踪)从静态模板推广到动态模板。我们在合成和真实动态纹理示例上测试算法,并表明我们基于简单动力学的跟踪器性能与最先进方法相当甚至更优。由于我们的方法具有通用性且适用于任何图像特征,我们还将其应用于人体动作跟踪问题,构建了特定动作的光流跟踪器,在跟踪执行特定动作的人时性能优于最先进方法。最后,由于我们的方法是生成式的,我们可以使用针对不同纹理或动作类别预先训练的跟踪器,同时跟踪和识别视频中的纹理或动作。

英文摘要

In this paper we address the problem of tracking non-rigid objects whose local appearance and motion changes as a function of time. This class of objects includes dynamic textures such as steam, fire, smoke, water, etc., as well as articulated objects such as humans performing various actions. We model the temporal evolution of the object's appearance/motion using a Linear Dynamical System (LDS). We learn such models from sample videos and use them as dynamic templates for tracking objects in novel videos. We pose the problem of tracking a dynamic non-rigid object in the current frame as a maximum a-posteriori estimate of the location of the object and the latent state of the dynamical system, given the current image features and the best estimate of the state in the previous frame. The advantage of our approach is that we can specify a-priori the type of texture to be tracked in the scene by using previously trained models for the dynamics of these textures. Our framework naturally generalizes common tracking methods such as SSD and kernel-based tracking from static templates to dynamic templates. We test our algorithm on synthetic as well as real examples of dynamic textures and show that our simple dynamics-based trackers perform at par if not better than the state-of-the-art. Since our approach is general and applicable to any image feature, we also apply it to the problem of human action tracking and build action-specific optical flow trackers that perform better than the state-of-the-art when tracking a human performing a particular action. Finally, since our approach is generative, we can use a-priori trained trackers for different texture or action classes to simultaneously track and recognize the texture or action in the video.

1203.2210 2026-06-03 cs.CV cs.NA math.NA 版本更新

Fixed-Rank Representation for Unsupervised Visual Learning

固定秩表示用于无监督视觉学习

Risheng Liu, Zhouchen Lin, Fernando De la Torre, Zhixun Su

AI总结 本文提出固定秩表示(FRR)作为无监督视觉学习的统一框架,通过闭式解揭示多子空间结构,并引入稀疏正则化以增强鲁棒性,同时开发了快速数值求解器。

Comments accepted by CVPR 2012

详情
AI中文摘要

子空间聚类和特征提取是计算机视觉和模式识别中最常用的两种无监督学习技术。最先进的子空间聚类技术利用了稀疏性和秩最小化的最新进展。然而,现有技术计算成本高,并且在数据采样不足的情况下可能导致退化解,从而降低聚类性能。为了部分解决这些问题,并受现有矩阵分解工作的启发,本文提出固定秩表示(FRR)作为无监督视觉学习的统一框架。当数据无噪声时,FRR能够以闭式形式揭示多个子空间的结构。此外,我们证明在某些适当条件下,即使观测不足,FRR仍然能够揭示真实的子空间成员关系。为了实现对异常值和噪声的鲁棒性,我们在FRR框架中引入了稀疏正则化。除了子空间聚类,FRR还可用于无监督特征提取。作为一个非平凡的副产品,我们为FRR开发了一个快速数值求解器。在合成数据和实际应用上的实验结果验证了我们的理论分析,并展示了FRR在无监督视觉学习中的优势。

英文摘要

Subspace clustering and feature extraction are two of the most commonly used unsupervised learning techniques in computer vision and pattern recognition. State-of-the-art techniques for subspace clustering make use of recent advances in sparsity and rank minimization. However, existing techniques are computationally expensive and may result in degenerate solutions that degrade clustering performance in the case of insufficient data sampling. To partially solve these problems, and inspired by existing work on matrix factorization, this paper proposes fixed-rank representation (FRR) as a unified framework for unsupervised visual learning. FRR is able to reveal the structure of multiple subspaces in closed-form when the data is noiseless. Furthermore, we prove that under some suitable conditions, even with insufficient observations, FRR can still reveal the true subspace memberships. To achieve robustness to outliers and noise, a sparse regularizer is introduced into the FRR framework. Beyond subspace clustering, FRR can be used for unsupervised feature extraction. As a non-trivial byproduct, a fast numerical solver is developed for FRR. Experimental results on both synthetic data and real applications validate our theoretical analysis and demonstrate the benefits of FRR for unsupervised visual learning.

1202.5414 2026-06-03 math.AP cs.CV cs.NA math.NA math.RT 版本更新

Left-Invariant Diffusion on the Motion Group in terms of the Irreducible Representations of SO(3)

基于SO(3)不可约表示的运动群上的左不变扩散

Marco Reisert, Henrik Skibbe

AI总结 利用SO(3)不可约表示将SE(3)上的左不变向量场表示为平移坐标的微分形式和旋转的代数形式,避免了对SO(3)或S2的显式离散化,并应用于扩散加权磁共振成像和目标检测。

详情
AI中文摘要

本文研究了基于SO(3)不可约表示的三维运动群SE(3)上的对流/扩散方程的公式化。因此,SE(3)上的左不变向量场被表示为线性算子,这些算子是平移坐标的微分形式和旋转的代数形式。在三维图像处理的背景下,该方法避免了对SO(3)或S2的显式离散化。这对于SO(3)尤其重要,因为直接离散化由于巨大的内存消耗而不可行。我们展示了该框架的两个应用:一个在扩散加权磁共振成像的背景下,另一个在目标检测的背景下。

英文摘要

In this work we study the formulation of convection/diffusion equations on the 3D motion group SE(3) in terms of the irreducible representations of SO(3). Therefore, the left-invariant vector-fields on SE(3) are expressed as linear operators, that are differential forms in the translation coordinate and algebraic in the rotation. In the context of 3D image processing this approach avoids the explicit discretization of SO(3) or $S_2$, respectively. This is particular important for SO(3), where a direct discretization is infeasible due to the enormous memory consumption. We show two applications of the framework: one in the context of diffusion-weighted magnetic resonance imaging and one in the context of object detection.

1109.3827 2026-06-03 cs.IT cs.CV cs.SY eess.SY math.IT math.OC stat.ML 版本更新

Online Robust Subspace Tracking from Partial Information

基于部分信息的在线鲁棒子空间跟踪

Jun He, Laura Balzano, John C. S. Lui

AI总结 提出GRASTA算法,利用鲁棒l1范数从高度不完整数据中在线跟踪子空间,应用于鲁棒矩阵补全和视频背景-前景实时分离,在基准视频上达到57帧/秒。

Comments 28 pages, 12 figures

详情
AI中文摘要

本文提出了GRASTA(Grassmannian鲁棒自适应子空间跟踪算法),一种高效且鲁棒的在线算法,用于从高度不完整的信息中跟踪子空间。该算法使用鲁棒的$l^1$-范数代价函数,以便在流数据向量被异常值污染时估计和跟踪非平稳子空间。我们将GRASTA应用于鲁棒矩阵补全以及视频中背景与前景的实时分离问题。在第二个应用中,我们展示了GRASTA以异常高的速度执行运动物体与背景的高质量分离:在一个流行的基准视频示例中,即使在个人笔记本电脑上运行MATLAB,GRASTA也能达到每秒57帧的速率。

英文摘要

This paper presents GRASTA (Grassmannian Robust Adaptive Subspace Tracking Algorithm), an efficient and robust online algorithm for tracking subspaces from highly incomplete information. The algorithm uses a robust $l^1$-norm cost function in order to estimate and track non-stationary subspaces when the streaming data vectors are corrupted with outliers. We apply GRASTA to the problems of robust matrix completion and real-time separation of background from foreground in video. In this second application, we show that GRASTA performs high-quality separation of moving objects from background at exceptional speeds: In one popular benchmark video example, GRASTA achieves a rate of 57 frames per second, even when run in MATLAB on a personal laptop.

0906.0434 2026-06-03 cs.CV cs.NA math.NA stat.ME 版本更新

Total Variation, Adaptive Total Variation and Nonconvex Smoothly Clipped Absolute Deviation Penalty for Denoising Blocky Images

全变分、自适应全变分和非凸平滑剪切绝对偏差惩罚用于块状图像去噪

Aditya Chopra, Heng Lian

AI总结 针对全变分模型的偏差问题,提出一种受高维变量选择启发的非凸惩罚函数,通过MM算法高效求解,实验证明在块状图像去噪中性能优于传统方法。

详情
AI中文摘要

基于全变分的图像去噪模型已被广泛推广和扩展,在不同场景下提升了性能。我们提出一种新的惩罚函数,其灵感来自高维变量选择统计文献的最新进展。利用特定实例化的MM算法,优化问题可以高效求解,且计算过程与空间自适应全变分模型类似。我们的两像素图像模型从理论上证明,新惩罚函数解决了全变分模型固有的偏差问题。通过多个实验展示了新惩罚的优越性能。我们的研究仅限于具有小全变分的“块状”图像。

英文摘要

The total variation-based image denoising model has been generalized and extended in numerous ways, improving its performance in different contexts. We propose a new penalty function motivated by the recent progress in the statistical literature on high-dimensional variable selection. Using a particular instantiation of the majorization-minimization algorithm, the optimization problem can be efficiently solved and the computational procedure realized is similar to the spatially adaptive total variation model. Our two-pixel image model shows theoretically that the new penalty function solves the bias problem inherent in the total variation model. The superior performance of the new penalty is demonstrated through several experiments. Our investigation is limited to "blocky" images which have small total variation.

1106.2124 2026-06-03 physics.med-ph cs.CV cs.NA math.NA stat.AP 版本更新

Omni-tomography/Multi-tomography -- Integrating Multiple Modalities for Simultaneous Imaging

全模态断层成像/多模态断层成像——整合多种模态实现同步成像

Ge Wang, Jie Zhang, Hao Gao, Victor Weir, Hengyong Yu, Wenxiang Cong, Xiaochen Xu, Haiou Shen, James Bennett, Yue Wang, Michael Vannier

AI总结 本文提出全模态断层成像(omni-tomography)概念,通过整合CT、MRI、PET、SPECT、超声、光学等多种成像机制实现真正同步的局部重建,克服现有模态融合方法在配准误差和物理限制方面的固有局限。

Comments 43 pages, 15 figures, 99 references, provisional patent applications filed by Virginia Tech

详情
AI中文摘要

当前的断层成像系统需要重大改进,尤其是在研究多维、多尺度、多时间及多参数现象时。临床前和临床成像现在都依赖于体内断层成像,通常需要不同成像模态分别评估以定义形态细节、描绘疾病或干预引起的变化,并研究具有相互关联方面的生理功能。过去十年中,多模态图像融合出现了两种不同方法:事后图像配准以及PET-CT、PET-MRI及其他混合扫描仪上的联合采集。事后图像分析和双/三模态方法都存在固有局限性,这些局限性由配准误差和采集链中的物理约束决定。我们预见断层成像将超越当前的模态融合,走向大融合,即所有或许多成像模态的大规模融合,可称为全模态断层成像或多模态断层成像。与模态融合不同,这里提出的大融合旨在实现真正同步但通常局部的重建,涉及所有或许多相关成像机制,如CT、MRI、PET、SPECT、超声、光学以及可能更多。本文介绍了全模态断层成像的技术基础,并通过下一代扫描仪的顶层设计、代表性模态的内部断层重建以及全模态断层成像的预期应用进行了说明。

英文摘要

Current tomographic imaging systems need major improvements, especially when multi-dimensional, multi-scale, multi-temporal and multi-parametric phenomena are under investigation. Both preclinical and clinical imaging now depend on in vivo tomography, often requiring separate evaluations by different imaging modalities to define morphologic details, delineate interval changes due to disease or interventions, and study physiological functions that have interconnected aspects. Over the past decade, fusion of multimodality images has emerged with two different approaches: post-hoc image registration and combined acquisition on PET-CT, PET-MRI and other hybrid scanners. There are intrinsic limitations for both the post-hoc image analysis and dual/triple modality approaches defined by registration errors and physical constraints in the acquisition chain. We envision that tomography will evolve beyond current modality fusion and towards grand fusion, a large scale fusion of all or many imaging modalities, which may be referred to as omni-tomography or multi-tomography. Unlike modality fusion, grand fusion is here proposed for truly simultaneous but often localized reconstruction in terms of all or many relevant imaging mechanisms such as CT, MRI, PET, SPECT, US, optical, and possibly more. In this paper, the technical basis for omni-tomography is introduced and illustrated with a top-level design of a next generation scanner, interior tomographic reconstructions of representative modalities, and anticipated applications of omni-tomography.

1011.2292 2026-06-03 math.NA cs.CV cs.NA 版本更新

Image Segmentation with Multidimensional Refinement Indicators

基于多维细化指标的图像分割

Hend Ben Ameur, Guy Chavent, Francois Clément, Pierre Weis

AI总结 提出将最优控制技术转用于图像分割,通过自适应参数化迭代构建最优参数表示,利用误差梯度驱动区域划分,实现稳健灵活的分割算法。

详情
Journal ref
N&deg; RR-7446 (2010)
AI中文摘要

我们将最优控制技术应用于图像分割问题。其思想是将图像分割视为一个参数估计问题。待估计的参数是图像像素的颜色。我们采用自适应参数化技术,该技术迭代地构建参数的最优表示,形成域的分割,从而对应于图像的分割。在迭代过程中,我们最小化误差函数,并且图像到区域的划分由该误差的梯度最优驱动。最终的分割算法继承了其最优控制起源的优良特性:可靠性、鲁棒性和灵活性。

英文摘要

We transpose an optimal control technique to the image segmentation problem. The idea is to consider image segmentation as a parameter estimation problem. The parameter to estimate is the color of the pixels of the image. We use the adaptive parameterization technique which builds iteratively an optimal representation of the parameter into uniform regions that form a partition of the domain, hence corresponding to a segmentation of the image. We minimize an error function during the iterations, and the partition of the image into regions is optimally driven by the gradient of this error. The resulting segmentation algorithm inherits desirable properties from its optimal control origin: soundness, robustness, and flexibility.

1102.0899 2026-06-03 cs.AI cs.CV cs.LG cs.NA math.NA math.PR 版本更新

Evidence Feed Forward Hidden Markov Model: A New Type of Hidden Markov Model

证据前馈隐马尔可夫模型:一种新型隐马尔可夫模型

Michael DelRose, Christian Wagner, Philip Frederick

AI总结 针对隐马尔可夫模型无法建模观测间关联的问题,提出证据前馈隐马尔可夫模型,通过引入观测间概率链接提升分类性能,并在视觉动作和测量数据上验证其有效性。

Comments 19 pages, International Journal of Artificial Intelligence and Applications

详情
Journal ref
International Journal of Artificial Intelligence and Applications (IJAIA), Vol. 2, No. 1, Jan 2011
AI中文摘要

仅基于视觉动作预测他人意图的能力是人类和动物独有的技能。当前计算机算法的智能尚未达到这种复杂程度,但已有若干研究正朝此方向努力。由于可用的分类算法众多,难以确定哪种算法最适合特定情境。在视觉人类意图数据分类中,隐马尔可夫模型(HMM)及其变体是主要候选方法。HMM无法提供观测间链接的概率,这是该分类技术的一大缺陷。当人通过视觉识别他人的动作时,会监控观测中的模式。通过估计下一个观测,人们能够总结动作,从而相当准确地判断执行动作者的意图。这些视觉线索和链接对于创建基于视觉观测确定人类动作的智能算法至关重要。证据前馈隐马尔可夫模型是一种新开发的算法,它提供了观测间链接。本研究阐述了证据前馈HMM背后的理论,提供了其学习这些参数以优化观测似然性的数学证明(这对所有计算智能算法都至关重要),并给出了与标准HMM在视觉动作数据和测量数据分类中的比较示例,从而为证据前馈HMM在多种问题分类中的应用奠定了坚实基础。

英文摘要

The ability to predict the intentions of people based solely on their visual actions is a skill only performed by humans and animals. The intelligence of current computer algorithms has not reached this level of complexity, but there are several research efforts that are working towards it. With the number of classification algorithms available, it is hard to determine which algorithm works best for a particular situation. In classification of visual human intent data, Hidden Markov Models (HMM), and their variants, are leading candidates. The inability of HMMs to provide a probability in the observation to observation linkages is a big downfall in this classification technique. If a person is visually identifying an action of another person, they monitor patterns in the observations. By estimating the next observation, people have the ability to summarize the actions, and thus determine, with pretty good accuracy, the intention of the person performing the action. These visual cues and linkages are important in creating intelligent algorithms for determining human actions based on visual observations. The Evidence Feed Forward Hidden Markov Model is a newly developed algorithm which provides observation to observation linkages. The following research addresses the theory behind Evidence Feed Forward HMMs, provides mathematical proofs of their learning of these parameters to optimize the likelihood of observations with a Evidence Feed Forwards HMM, which is important in all computational intelligence algorithm, and gives comparative examples with standard HMMs in classification of both visual action data and measurement data; thus providing a strong base for Evidence Feed Forward HMMs in classification of many types of problems.

1011.0997 2026-06-03 math.NA cs.CV cs.NA math.FA stat.ML 版本更新

Performance Analysis of Spectral Clustering on Compressed, Incomplete and Inaccurate Measurements

压缩、不完整和不准确测量下的谱聚类性能分析

Blake Hunter, Thomas Strohmer

AI总结 本文结合压缩感知和矩阵完成的距离保持测量与鲁棒谱聚类,分析了亲和矩阵微小误差对谱坐标和聚类能力的影响,并将双类谱聚类的扰动结果推广到多类聚类。

详情
AI中文摘要

谱聚类是提取数据集潜在全局结构最广泛使用的技术之一。压缩感知和矩阵完成已成为分别有效恢复稀疏和部分观测信号的主流方法。我们将压缩感知和矩阵完成的距离保持测量与鲁棒谱聚类的力量相结合。我们的分析提供了关于亲和矩阵中微小误差如何影响谱坐标和聚类能力的严格界限。这项工作将双类谱聚类的当前扰动结果推广到使用k个特征向量的多类聚类。我们彻底追踪了使用压缩感知和矩阵完成引起的小扰动如何影响亲和矩阵,进而影响谱坐标。这些多类聚类的扰动结果要求亲和矩阵的第k个和第(k+1)个特征值之间存在特征间隙,这在具有k个良好定义簇的数据中自然出现。我们的理论保证辅以数值结果以及图像数据的无监督组织和聚类的若干示例。

英文摘要

Spectral clustering is one of the most widely used techniques for extracting the underlying global structure of a data set. Compressed sensing and matrix completion have emerged as prevailing methods for efficiently recovering sparse and partially observed signals respectively. We combine the distance preserving measurements of compressed sensing and matrix completion with the power of robust spectral clustering. Our analysis provides rigorous bounds on how small errors in the affinity matrix can affect the spectral coordinates and clusterability. This work generalizes the current perturbation results of two-class spectral clustering to incorporate multi-class clustering with k eigenvectors. We thoroughly track how small perturbation from using compressed sensing and matrix completion affect the affinity matrix and in succession the spectral coordinates. These perturbation results for multi-class clustering require an eigengap between the kth and (k+1)th eigenvalues of the affinity matrix, which naturally occurs in data with k well-defined clusters. Our theoretical guarantees are complemented with numerical results along with a number of examples of the unsupervised organization and clustering of image data.

0912.4571 2026-06-03 math.OC cs.CV cs.NA math.NA 版本更新

Fast Alternating Linearization Methods for Minimizing the Sum of Two Convex Functions

最小化两个凸函数和的快速交替线性化方法

Donald Goldfarb, Shiqian Ma, Katya Scheinberg

AI总结 提出基于交替方向增广拉格朗日方法的一阶交替线性化算法,用于最小化两个凸函数的和,基本方法需O(1/ε)次迭代达到ε-最优解,加速版本需O(1/√ε)次迭代,并给出数值结果。

详情
AI中文摘要

本文提出基于交替方向增广拉格朗日方法的一阶交替线性化算法,用于最小化两个凸函数的和。我们的基本方法最多需要$O(1/ε)$次迭代即可获得$ε$-最优解,而加速(即快速)版本最多需要$O(1/\sqrtε)$次迭代,且每次迭代的计算量变化很小。对于这两种方法,我们提出了一种要求两个函数均具有Lipschitz连续梯度的光滑性的算法,以及一种仅要求其中一个函数具有该性质的算法。本文中的算法是Gauss-Seidel型方法,与Goldfarb和Ma在[21]中提出的Jacobi型方法形成对比。数值结果支持了我们的理论结论,并展示了算法的实际潜力。

英文摘要

We present in this paper first-order alternating linearization algorithms based on an alternating direction augmented Lagrangian approach for minimizing the sum of two convex functions. Our basic methods require at most $O(1/ε)$ iterations to obtain an $ε$-optimal solution, while our accelerated (i.e., fast) versions of them require at most $O(1/\sqrtε)$ iterations, with little change in the computational effort required at each iteration. For both types of methods, we present one algorithm that requires both functions to be smooth with Lipschitz continuous gradients and one algorithm that needs only one of the functions to be so. Algorithms in this paper are Gauss-Seidel type methods, in contrast to the ones proposed by Goldfarb and Ma in [21] where the algorithms are Jacobi type methods. Numerical results are reported to support our theoretical conclusions and demonstrate the practical potential of our algorithms.

1010.0301 2026-06-03 cs.CV cs.NA math.NA 版本更新

A Microwave Imaging and Enhancement Technique from Noisy Synthetic Data

一种基于含噪合成数据的微波成像与增强技术

Anjan Kumar Kundu, Bijoy Bandopadhyay, Sugata Sanyal

AI总结 提出一种基于矩量法求解的逆迭代算法用于微波成像,通过约束优化确保收敛,并利用Levenberg-Marquardt方法处理病态性,最后对含噪合成数据重建的图像进行增强。

Comments 8 Pages, 10 Figures, International Symposium on Advanced Engineering and Applied Management-40th Anniversary in Higher Education-Image Processing-University Politegnica, Timisoara, 4-5 November, 2010, Hunedoara, ROMANIA

详情
AI中文摘要

本文提出了一种基于矩量法求解的微波成像逆迭代算法。该迭代方案基于约束优化技术开发,并确保收敛。为克服逆问题犯罪,模型采用了不同的网格尺寸。接收器处的合成数据被不同百分比的噪声污染。问题的病态性通过Levenberg-Marquardt方法解决。该算法应用于合成数据,然后通过图像增强技术进一步改善重建图像。

英文摘要

An inverse iterative algorithm for microwave imaging based on moment method solution is presented here. The iterative scheme has been developed on constrained optimization technique and is certain to converge. Different mesh size for the model has been used here to overcome the Inverse Crime. The synthetic data at the receivers is contaminated with different percentage of noise. The ill-posedness of the problem is solved by Levenberg-Marquardt method. The algorithm is applied to synthetic data and the reconstructed image is then further enhanced through the Image enhancement technique

1009.0051 2026-06-03 math.NA cs.CV cs.NA 版本更新

Variational Iteration Method for Image Restoration

变分迭代法用于图像恢复

Keyvan Yahya, Jafar Biazar, Hossein Azari, Pouyan Rafiei Fard

AI总结 本文首次应用变分迭代法求解Perona-Malik方程,通过误差分析获得近似解,并验证了该方法的有效性。

详情
AI中文摘要

著名的Perona-Malik (P-M)方程最初用于图像恢复,已有多种数值方法求解。本文首次应用一种称为变分迭代法(VIM)的新数值方法求解该方程,并针对相关误差分析获得了P-M方程的对应近似解。通过实现我们的算法,我们得到了一些有效的结果,这些结果值得与其他方法给出的解一样被重视。

英文摘要

The famous Perona-Malik (P-M) equation which was at first introduced for image restoration has been solved via various numerical methods. In this paper we will solve it for the first time via applying a new numerical method called the Variational Iteration Method (VIM) and the correspondent approximated solutions will be obtained for the P-M equation with regards to relevant error analysis. Through implementation of our algorithm we will access some effective results which are deserved to be considered as worthy as the other solutions issued by the other methods.

1008.4870 2026-06-03 math.NA cs.CV cs.NA 版本更新

On Euclidean Norm Approximations

关于欧几里得范数近似

M. Emre Celebi, Fatih Celiker, Hassan A. Kingravi

AI总结 本文研究了欧几里得范数的多种近似方法,揭示了它们统一的数学形式,并纠正了Seol和Cheun方法中最大误差的乐观估计。

Comments 9 pages, 1 figure, Pattern Recognition

详情
AI中文摘要

欧几里得范数计算在科学和工程应用中频繁出现。文献中提出了几种具有不同复杂度和精度的该范数近似方法。早期方法基于最小化最大误差。最近,Seol和Cheun提出了一种基于最小化平均误差的近似方法。在本文中,我们首先详细考察这些近似,表明它们符合单一的数学公式,并比较它们的平均误差和最大误差。然后,我们证明Seol和Cheun给出的最大误差显著过于乐观。

英文摘要

Euclidean norm calculations arise frequently in scientific and engineering applications. Several approximations for this norm with differing complexity and accuracy have been proposed in the literature. Earlier approaches were based on minimizing the maximum error. Recently, Seol and Cheun proposed an approximation based on minimizing the average error. In this paper, we first examine these approximations in detail, show that they fit into a single mathematical formulation, and compare their average and maximum errors. We then show that the maximum errors given by Seol and Cheun are significantly optimistic.

1006.5739 2026-06-03 math.NA cs.CV cs.NA 版本更新

Polyharmonic Daubechies type wavelets in Image Processing and Astronomy, II

图像处理与天文学中的多调和Daubechies型小波(II)

Ognyan Kounchev, Damyan Kalaglarsky, Milcho Tsvetkov

AI总结 本文研究多调和细分小波(Daubechies型)在图像处理,特别是天文图像中的应用,结果表明其相对于某些标准多变量小波具有显著优势并展现出更好的压缩潜力。

Comments 9 pages

详情
AI中文摘要

我们考虑多调和细分小波(Daubechies型)在图像处理,特别是天文图像中的应用。结果显示,相对于某些标准多变量小波,该方法具有显著优势,并展现出更好的压缩潜力。

英文摘要

We consider the application of the polyharmonic subdivision wavelets (of Daubechies type) to Image Processing, in particular to Astronomical Images. The results show an essential advantage over some standard multivariate wavelets and a potential for better compression.

0909.1310 2026-06-03 math.NA cs.CV cs.NA 版本更新

Sparse image representation by discrete cosine/spline based dictionaries

基于离散余弦/样条字典的稀疏图像表示

James Bowley, Laura Rebollo-Neira

AI总结 本文考虑由余弦和B样条函数生成的混合字典,通过正交匹配追踪等高非线性方法,证明所提字典的离散版本能显著提高图像表示的稀疏性。

详情
AI中文摘要

考虑了由余弦和B样条函数生成的混合字典。结果表明,通过高度非线性的方法(如正交匹配追踪),所提字典的离散版本在图像表示的稀疏性上获得了显著提升。

英文摘要

Mixed dictionaries generated by cosine and B-spline functions are considered. It is shown that, by highly nonlinear approaches such as Orthogonal Matching Pursuit, the discrete version of the proposed dictionaries yields a significant gain in the sparsity of an image representation.

0804.1046 2026-06-03 cs.CV cs.CG cs.GR cs.NA math.NA 版本更新

Discrete schemes for Gaussian curvature and their convergence

高斯曲率的离散格式及其收敛性

Zhiqiang Xu, Guoliang Xu

AI总结 本文综述了高斯曲率的几种离散格式,提出了一种新的离散格式并证明了其在价数不小于5的正则顶点处的收敛性,同时通过反例表明价数为4时无法构造收敛的离散格式,最后比较了多种离散格式的渐近误差。

详情
AI中文摘要

本文综述了高斯曲率的几种离散格式。考虑了一种修正的高斯曲率离散格式的收敛性。此外,提出了一种新的高斯曲率离散格式。我们证明了新格式在价数不小于5的正则顶点处收敛。通过构造反例,我们还表明不可能构建一个在价数为4的正则顶点处收敛的高斯曲率离散格式。最后,比较了几种高斯曲率离散格式的渐近误差。

英文摘要

In this paper, several discrete schemes for Gaussian curvature are surveyed. The convergence property of a modified discrete scheme for the Gaussian curvature is considered. Furthermore, a new discrete scheme for Gaussian curvature is resented. We prove that the new scheme converges at the regular vertex with valence not less than 5. By constructing a counterexample, we also show that it is impossible for building a discrete scheme for Gaussian curvature which converges over the regular vertex with valence 4. Finally, asymptotic errors of several discrete scheme for Gaussian curvature are compared.