arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30352 2026-05-29 cs.CV 版本更新

GMOS: Grounding Moving Object Segmentation in 3D Space and Time

GMOS: 在3D空间和时间中定位运动物体分割

Junyu Xie, Tengda Han, Weidi Xie, Andrew Zisserman

发表机构 * Visual Geometry Group, Department of Engineering Science, University of Oxford, UK(牛津大学工程科学系视觉几何组) SAI, Shanghai Jiao Tong University, China(上海交通大学SAI)

AI总结 提出GMOS框架,直接在RGB视频上实现3D感知、时间细粒度的多运动物体分割,并构建GMOS-2K数据集和MOS-I评估协议,在多个基准上取得最先进结果。

Comments Project Page: https://www.robots.ox.ac.uk/vgg/research/gmos/

详情
AI中文摘要

运动物体分割(MOS)旨在发现、分割和跟踪独立于相机运动的物体。然而,当前的MOS方法存在两个基本限制:它们依赖于预计算的2D辅助模态(如光流或点轨迹),缺乏3D几何信息,并且将运动视为序列级属性,忽略了每个物体的瞬时运动状态。我们通过将MOS定位于3D空间和时间来解决这两个问题,并提出GMOS,这是一个直接对RGB视频进行操作的框架,可生成多个运动物体的3D感知、时间细粒度分割,同时还有一个前景-背景变体GMOS-S用于更快部署。为了支持这种模式下的训练和评估,我们整理了GMOS-2K数据集,包含来自五个已建立的视频物体分割(VOS)基准的2,210个真实世界视频,带有每个物体的时间运动注释,并正式定义了MOS-I(“I”表示瞬时),这是一个具有三个互补指标的时间细粒度评估协议。GMOS在MOS、MOS-I和无监督VOS基准上均取得了最先进的结果,同时运行速度显著快于先前的多物体MOS方法,并支持用于流式部署的在线推理。

英文摘要

Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two fundamental limitations: they rely on pre-computed 2D auxiliary modalities such as optical flow or point trajectories that lack 3D geometric information, and they treat motion as a sequence-level attribute, overlooking the instantaneous motion state of each object. We address both by grounding MOS in 3D space and time, and propose GMOS, a framework that operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, alongside a foreground--background variant GMOS-S for faster deployment. To support training and evaluation in this regime, we curate GMOS-2K, a dataset of 2,210 real-world videos with per-object temporal motion annotations drawn from five established Video Object Segmentation (VOS) benchmarks, and formalise MOS-I ("I" for instantaneous), a temporally fine-grained evaluation protocol with three complementary metrics. GMOS achieves state-of-the-art results across MOS, MOS-I, and Unsupervised VOS benchmarks, while running significantly faster than prior multi-object MOS methods and supporting online inference for streaming deployment.

2605.30351 2026-05-29 cs.CV cs.AI 版本更新

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

VideoMLA: 用于分钟级自回归视频扩散的低秩潜在KV缓存

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag

发表机构 * Virginia Tech(弗吉尼亚理工大学) fal Project(fal项目)

AI总结 本文提出VideoMLA,通过多头潜在注意力(MLA)将每头KV替换为共享低秩内容潜在和分离的3D-RoPE位置键,在视频扩散中减少92.7%的KV内存,并保持质量与吞吐量提升。

Comments Project Page: https://videomla.github.io/

详情
AI中文摘要

长序列因果视频扩散已收敛于固定大小的滑动窗口KV缓存,近期创新通过改变窗口内令牌或位置编码方式在此布局内进行改进。每头KV布局本身是流式内存和延迟的主要贡献者,但基本保持不变。本文首次研究多头潜在注意力(MLA)在视频扩散中的应用。VideoMLA将每头的键和值替换为共享的低秩内容潜在和共享的解耦3D-RoPE位置键,在每个缓存层将每令牌KV内存减少92.7%。我们进一步探究了为什么MLA在视频扩散中成功,尽管语言模型中常用于激励它的谱假设不成立:预训练视频注意力不是低秩的,99%能量的有效秩远高于任何实际潜在维度。VideoMLA在压缩比下保持质量,而直接谱近似会预测较大的重构误差。我们表明,MLA瓶颈而非预训练谱决定了有效秩:谱和随机初始化都从初始化开始占据几乎全部秩预算,训练在此预算内适应。在VBench上,VideoMLA匹配短视界流式视频扩散基线,在长视界中取得最佳总体分数,并在单个B200上将吞吐量提升1.23倍。

英文摘要

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.

2605.30349 2026-05-29 cs.CV 版本更新

AdaState: Self-Evolving Anchors for Streaming Video Generation

AdaState: 用于流式视频生成的自演化锚点

Yusuf Dalva, Pinar Yanardag

发表机构 * Virginia Tech(弗吉尼亚理工学院)

AI总结 针对自回归视频扩散模型中静态首帧锚点抑制视频动态的问题,提出自适应状态锚点,通过隐变量与内容联合去噪并随时间演化,显著提升运动丰富度和场景自然演进。

Comments Project page: https://adastate.github.io/

详情
AI中文摘要

自回归视频扩散模型通过顺序生成帧来产生流式视频,每个块的条件基于先前生成的内容。这些模型在结构上锚定于第一帧:其键值表示在注意力缓存中占据特权位置,并在整个生成过程中作为主要场景参考。作为缓存中最干净、无错误的位置,该锚点吸引了不成比例的注意力,抑制了视频动态,并将场景构图锁定在初始视角,即使场景自然演变也是如此。结果是一个时间上浅显的视频,其中运动、相机移动和场景进展被抑制,以利于静态一致性。为了解决这个问题,我们用自适应状态(一个隐藏的潜在变量)替换静态锚点,该状态在每个块中与内容一起被模型去噪,但从不渲染。模型不是参考冻结的第一帧,而是通过关注先前状态和当前内容,在每一步生成自己的场景锚点,产生一个随生成内容演变的参考。与编码绝对时间概念的标准视频生成不同,我们的公式将时间视为相对的:每个生成步骤看到相同的位置结构,无论生成进行到多远,并且状态转换在每个块中相同。这些特性共同在生成过程中引入了循环,其中去噪作为转换函数,KV缓存作为载体,无需外部模块。实验表明,自适应状态显著改善了视频动态,使生成视频中的运动更丰富,场景进展更自然。

英文摘要

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and locking scene composition to the initial viewpoint even as the scene naturally evolves. The result is a temporally shallow video in which motion, camera movement, and scene progression are dampened in favor of static consistency. To address this, we replace the static anchor with an adaptive state, a hidden latent that the model denoises alongside content at every chunk but never renders. Rather than referencing a frozen first frame, the model generates its own scene anchor at each step by attending to both the previous state and the current content, producing a reference that evolves with the generated content. Unlike standard video generation, which encodes an absolute notion of time, our formulation treats time as relative: every generation step sees the same positional structure regardless of how far generation has progressed, and the state transition is identical at every chunk. Together, these properties introduce a recurrence into the generation process, where denoising serves as the transition function, and the KV cache serves as the carrier, requiring no external module. Experiments demonstrate that the adaptive state substantially improves video dynamics, enabling richer motion and natural scene progression within generated videos.

2605.30347 2026-05-29 cs.CV cs.GR 版本更新

NeuROK: Generative 4D Neural Object Kinematics

NeuROK:生成式4D神经物体运动学

Chen Geng, Guangzhao He, Yue Gao, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu

发表机构 * Stanford University(斯坦福大学) University of Cambridge(剑桥大学) Cornell University(康奈尔大学)

AI总结 提出基于Transformer的编码器-解码器模型NeuROK,通过学习物体潜在运动学参数化,实现从静态3D物体生成逼真的4D动态变形,克服了传统方法对预定义物理模型和特定类别的依赖。

Comments CVPR 2026

详情
AI中文摘要

数据驱动的方法已经彻底改变了3D视觉,使Transformer能够有效地重建和生成静态3D物体。然而,生成模拟4D动态——即静态物体在各种物理条件下的逼真时间变形——仍然具有挑战性且通常是特设的,尽管它在构建全面的3D世界模型中很重要。大多数现有方法假设一个预定义的物理模型并使用系统辨识来估计参数,将这些方法限制在特定类别和小规模数据集上。我们提出,通过学习物体中心物理系统的数据驱动运动学状态参数化,可以克服这些限制。具体来说,我们学习了一个表示物体所有可能状态的潜在空间,以及一个将任何采样的潜在向量映射到物体合理变形形状的解码器。我们将这种参数化称为神经物体运动学(NeuROK),并在精心策划的大规模4D数据集上训练基于Transformer的编码器-解码器模型。这种公式化和学习到的模型显著简化了模拟动态的生成,因为我们只需要从经典物理中拉格朗日力学的角度考虑低维潜在空间内的动态。我们在各种动态物体类型上展示了这种神经模拟框架的有效性和通用性,显示出相对于先前工作的明显优势。项目页面:https://chen-geng.com/neurok

英文摘要

Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics -- realistic temporal deformations of static objects under various physical conditions -- remains challenging and often ad hoc, despite its importance in building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space representing all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameterization as Neural Object Kinematics (NeuROK), and learn a transformer-based encoder-decoder model on a curated large-scale 4D dataset. This formulation and the learned model significantly simplify the generation of simulative dynamics since we only need to consider the dynamics within a low-dimensional latent space from the Lagrangian mechanics' perspective in classical physics. We demonstrate the effectiveness and generality of this neural simulation framework across diverse dynamic object types, showing clear advantages over prior works. Project page: https://chen-geng.com/neurok

2605.30346 2026-05-29 cs.CV 版本更新

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

YoCausal: 视频生成距离世界模型还有多远?一个因果视角

You-Zhe Xie, Yu-Hsuan Li, Jie-Ying Lee, Kaipeng Zhang, Yu-Lun Liu, Zhixiang Wang

发表机构 * National Yang Ming Chiao Tung University Shanda AI Research Tokyo

AI总结 提出YoCausal基准,通过时间反转真实视频生成反事实样本,利用反向惊奇指数(RSI)和因果认知指数(CCI)评估视频扩散模型的因果理解能力,发现模型感知时间方向不等于理解因果关系,与人类水平存在显著差距。

Comments Project page: https://www.youzhexie.me/papers/YoCausal/index.html

详情
AI中文摘要

随着视频扩散模型(VDMs)向世界模型迈进,一个关键问题浮现:它们是否真正理解因果关系,还是仅仅过拟合了统计时间模式?现有基准大多依赖合成数据,由于模拟到现实的差距限制了真实世界的泛化能力。我们提出YoCausal,一个受认知科学中期望违背(VoE)范式启发的两级基准。通过零成本地将真实视频时间反转作为自然反事实样本,YoCausal建立了一个可任意扩展的评估协议。第一级引入反向惊奇指数(RSI),通过去噪损失量化时间箭头感知。第二级引入因果认知指数(CCI),利用VLM将数据集分层为因果和非因果子集,将真正的因果推理与时间偏差分离开来。对13个最先进VDMs的评估揭示,感知时间箭头并不等同于理解因果关系,并且与人类水平的因果认知相比仍存在显著差距。

英文摘要

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

2605.30342 2026-05-29 cs.CV cs.RO 版本更新

Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility Field

基于各向异性可见性场的不确定性驱动的3D高斯溅射主动建图

Shangjie Xue, Jesse Dill, Dhruv Ahuja, Frank Dellaert, Panagiotis Tsiotras, Danfei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出GAVIS框架,通过各向异性可见性场量化3DGS的不确定性,并基于最大信息增益实现主动建图,在精度和效率上显著优于现有方法。

Comments Accepted to CVPR 2026. Project page https://gatech-rl2.github.io/GAVIS/

详情
AI中文摘要

我们提出了高斯溅射各向异性可见性场(GAVIS),这是一个用于3DGS中不确定性量化和主动建图的新框架。我们的关键洞察是,训练视图中未见的区域会导致3DGS产生不可靠的预测。为了解决这个问题,我们引入了一种原则性且高效的方法来量化3DGS中的可见性场,定义为每个粒子相对于训练视图的各向异性可见性,并使用球谐函数表示。得到的可见性场被集成到基于贝叶斯网络的不确定性感知3DGS光栅化器中,实现了对合成视图的实时(200 FPS)不确定性量化。在此基础上,进一步在最大信息增益框架内执行主动建图。跨多种环境的广泛实验表明,GAVIS在精度和效率上始终且显著优于先前的方法。此外,除了独立使用外,我们的方法还可以事后应用于改进现有方法的性能。

英文摘要

We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network-based uncertainty-aware 3DGS rasterizer, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation. Extensive experiments across diverse environments demonstrate that GAVIS consistently and significantly outperforms prior approaches in both accuracy and efficiency. Moreover, beyond standalone use, our method can be applied post-hoc to improve the performance of existing approaches.

2605.30341 2026-05-29 cs.CV cs.AI 版本更新

GPIC: A Giant Permissive Image Corpus for Visual Generation

GPIC:用于视觉生成的大型许可图像数据集

Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli, Juan Carlos Niebles, Justin Johnson, Jiajun Wu, Li Fei-Fei

发表机构 * Stanford University(斯坦福大学) Radical Numerics University of Michigan(密歇根大学) Salesforce Research(Salesforce研究)

AI总结 提出GPIC,一个约28万亿像素的大型许可图像数据集,包含1亿训练样本,通过最先进的视觉语言模型标注,用于视觉生成建模研究。

Comments 25 pages; Dataset: https://huggingface.co/datasets/stanford-vision-lab/giant-permissive-image-corpus; Project website: https://gpic.stanford.edu

详情
AI中文摘要

研究视觉生成建模的可扩展方法需要大型、可访问且稳定的数据集。我们引入了GPIC,一个约28万亿像素的大型许可图像数据集。GPIC包含由最先进的视觉语言模型标注的多样化互联网图像,包括1亿训练样本、20万验证样本和100万测试样本。此外,所有GPIC图像均获得研究及商业用途的许可。GPIC经过安全过滤、去重,并集中托管在Hugging Face上。我们为GPIC上的生成建模提供了一个基准测试协议。最后,我们提供了GPIC上像素空间流匹配的参考基线。我们的数据集、基准和模型可在https://huggingface.co/datasets/stanford-vision-lab/gpic获取。评估工具包和代码可在https://gpic.stanford.edu获取。

英文摘要

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu

2605.30339 2026-05-29 cs.CV cs.MM cs.SD eess.AS 版本更新

Benchmarking Single-Factor Physical Video-to-Audio Generation

单因素物理视频到音频生成的基准测试

Tingle Li, Siddharth Gururani, Kevin J. Shih, Gantavya Bhatt, Sang-gil Lee, Zhifeng Kong, Arushi Goel, Gopala Anumanchipalli, Ming-Yu Liu

发表机构 * UC Berkeley(伯克利大学) NVIDIA(英伟达) University of Washington(华盛顿大学)

AI总结 提出FlatSounds基准,通过控制反事实对和单视频模式测试评估视频到音频模型的物理推理能力,发现模型依赖文本描述而非视觉流,且物理准确性与时序对齐存在权衡。

Comments CVPR 2026

详情
AI中文摘要

生成式视频到音频(V2A)模型能产生高度逼真的音轨,但尚不清楚它们是否捕捉了底层物理过程。现有评估强调感知真实性,忽视了在受控干预下的物理正确性。本文中,我们引入FlatSounds,一个通过以下方式审计V2A模型物理推理的基准:1)改变单个物理因素的受控反事实对,以及2)探测内部一致性和方向趋势的单视频模式测试。这些设置测试生成的音频是否正确反映特定的物理属性和时序。我们对最先进模型的评估揭示了一致的权衡:模型更依赖文本描述而非视觉流来推断物理和语义。描述通常提高物理和语义准确性,但矛盾地降低了时序对齐。我们的结果强调了需要超越音频质量,直接从像素学习物理过程。最后,我们发现我们的基于物理的指标与我们自己数据上的人类偏好测试强相关。项目网页:https://research.nvidia.com/labs/cosmos-lab/flatsounds/

英文摘要

Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data. Project webpage: https://research.nvidia.com/labs/cosmos-lab/flatsounds/

2605.30338 2026-05-29 cs.CV 版本更新

REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image

REST3D: 从单张图像重建物理稳定的3D场景

Xiaoxuan Ma, Jiashun Wang, Nicolas Ugrinovic, Yehonathan Litman, Kris Kitani

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出REST3D框架,通过物理场景理解与物理约束优化,从单张RGB图像重建物理稳定的3D场景,显著减少物理错误并提升仿真稳定性。

Comments Project page: https://shirleymaxx.github.io/REST3D/

详情
AI中文摘要

从单张RGB图像重建物理稳定的3D场景,能够将日常图像转化为可用于沉浸式交互和内容创作等应用的仿真就绪数字资产。然而,现有的单图像重建方法在捕捉场景的物理结构方面存在不足,因此常常产生几何上合理但物理上不一致的结果,包括物体漂浮和穿透,这导致物理仿真中的不稳定行为。基于图像条件的场景生成方法提高了物理合理性,但通常依赖强场景先验,产生合理但不准确的物体排列,无法匹配输入图像。我们提出REST3D,一种单图像重建框架,通过将物理场景理解与物理约束优化相结合,能够重建物理稳定的3D场景。我们首先引入一种智能物理场景理解技术,该技术从重力支撑角度构建场景树表示,捕捉物体物理状态和物体间关系,为重建提供结构先验。利用这一结构,我们使用图像到3D模型初始化场景,然后通过场景树引导的对齐和物理约束优化来解决物理违反问题,同时保持与输入图像的视觉一致性。实验表明,我们的方法在合成和真实世界数据集上显著减少了物理错误,提高了仿真稳定性,同时保持了良好的重建质量。我们进一步在基于VR的人机交互中展示了重建场景,显示了它们在沉浸式应用中的潜力。

英文摘要

Reconstructing physically stable 3D scenes from a single RGB image enables casual images to be converted into simulation-ready digital assets for applications such as immersive interaction and content creation. However, existing single-image reconstruction methods fall short in capturing the physical structure of a scene. As a result, they often produce geometrically plausible but physically inconsistent results, including object floating and penetration, which lead to unstable behavior in physics simulations. Image-conditioned scene generation methods improve physical plausibility but often rely on strong scene priors, yielding plausible yet inaccurate object arrangements that fail to match the input image. We propose REST3D, a single-image reconstruction framework that can reconstruct physically stable 3D scenes by integrating physical scene understanding with physics-constrained refinement. We first introduce an agentic physical scene understanding technique that constructs a scene-tree representation capturing object physical states and inter-object relationships from a gravity-support perspective, providing a structural prior for reconstruction. Leveraging this structure, we initialize the scene using image-to-3D models, followed by scene-tree-guided alignment and physics-constrained optimization to resolve physical violations while preserving visual consistency with the input image. Experiments show that our method significantly reduces physical errors and improves simulation stability on both synthetic and real-world datasets while maintaining strong reconstruction quality. We further demonstrate the reconstructed scenes in VR-based human-object interaction, showing their potential for immersive applications.

2605.30332 2026-05-29 cs.CV 版本更新

Colored Noise Diffusion Sampling

有色噪声扩散采样

Hadar Davidson, Noam Issachar, Sagie Benaim

发表机构 * The Hebrew University of Jerusalem(希伯来大学)

AI总结 针对扩散模型生成过程中均匀白噪声注入的低效问题,提出一种无训练的有色噪声采样方法(CNS),通过动态频率依赖的噪声调度利用谱偏置,显著提升图像生成质量。

详情
AI中文摘要

扩散模型实现了最先进的图像合成,其生成轨迹从根本上表现出谱偏置,早期解析低频全局结构,后期解析高频细节。传统的随机微分方程(SDE)求解器未能考虑这一动态,在整个过程中幼稚地注入均匀白噪声,并误用有限能量预算。在这项工作中,我们建立了一个数学框架,将SDE推理重新视为一种有针对性的、频率解耦的能量传递。利用这一框架,我们引入了有色噪声采样(CNS),一种新颖的、无需训练的随机求解器。CNS不注入均匀白噪声,而是利用动态的、依赖于时间步和频率的调度,更有效地将注入能量分配给结构未解决的频带。通过主动利用模型固有的谱偏置,CNS系统地将生成分布引导向真实数据流形。大量实验表明,作为严格的即插即用推理时采样器替代,CNS在多种架构(SiT、JiT、FLUX)上显著优于标准ODE和SDE基线。与ImageNet-256上的标准采样相比,CNS在无引导下实现了显著的FID降低,SiT-XL/2从8.26降至6.27,JiT-B/16从32.39降至26.69,JiT-H/16从11.88降至8.31,同时在无分类器引导下也获得一致的相对FID改进。项目页面:https://hadardavidson.github.io/CNS/。

英文摘要

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stochastic differential equation (SDE) solvers fail to account for this dynamic, naively injecting uniform white noise throughout the entire process and misusing the finite energy budget. In this work, we establish a mathematical framework that reconsiders SDE inference as a targeted, frequency-decoupled energy transfer. Leveraging this framework, we introduce Colored Noise Sampling (CNS), a novel, training-free stochastic solver. Rather than injecting uniform white noise, CNS utilizes a dynamic, timestep- and frequency-dependent schedule that more efficiently allocates injected energy toward structurally unresolved frequency bands. By actively exploiting the model's inherent spectral bias, CNS systematically steers the generated distribution toward the true data manifold. Extensive experiments demonstrate that CNS significantly outperforms standard ODE and SDE baselines as a strictly plug-and-play, inference-time sampler substitution across diverse architectures (SiT, JiT, FLUX). Compared to standard sampling on ImageNet-256, CNS achieves substantial unguided FID reductions, improving from 8.26 to 6.27 on SiT-XL/2, 32.39 to 26.69 on JiT-B/16, and 11.88 to 8.31 on JiT-H/16, while yielding consistent relative FID improvements with Classifier-Free Guidance. Project page is available at https://hadardavidson.github.io/CNS/.

2605.30328 2026-05-29 cs.CV 版本更新

Supercharging Thermal Gaussian Splatting with Depth Estimation

利用深度估计增强热高斯泼溅

Manoj Biswanath, Chenxin Cai, Hannah Schieber, Daniel Roth, Benjamin Busam

发表机构 * Technical University of Munich(技术大学慕尼黑) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Munich Institute of Robotics and Machine Intelligence (MIRMI)(慕尼黑机器人与人工智能研究所) Human-Centered Computing and Extended Reality Lab(以人为本计算与扩展现实实验室) TUM University Hospital(技术大学慕尼黑医院)

AI总结 提出一种仅使用热红外图像和深度估计的单模态方法TDg,通过热到深度高斯泼溅推导辐射场,在渲染质量和训练时间上优于多模态基线。

Comments 8 pages, 4 figures. Accepted and will be published in ISPRS proceedings (ISPRS Congress 2026)

详情
AI中文摘要

高效且鲁棒的3D场景表示在自动驾驶、机器人及相关领域至关重要。虽然RGB图像为3D重建提供了有价值的内容,但热成像或深度等其他模态可以提供环境的额外信息。最近,像3D高斯泼溅这样的新视角合成方法开始使用多模态来进一步提升性能。但融合或组合多模态数据可能使过程变慢,并带来额外挑战。因此,我们的项目旨在基于热红外域使用单模态,尽可能减少对可见光的依赖。这种单模态有望更快,因为它不依赖多模态数据。我们提出了一种方法,热到深度高斯泼溅(TDg),其架构仅使用热图像和深度估计来推导辐射场。我们的TDg方法在大多数情况下优于我们的测试数据集RGBT-Scenes和ThermalMix上的MSMG(多单模态高斯)基线。平均而言,TDg的渲染质量指标如学习感知图像块相似度(LPIPS)、结构相似性指数(SSIM)和峰值信噪比(PSNR)分别比基线MSMG值好1.12%、0.034%和0.01%。它还显著减少了训练时间,减少了12分47秒(提升55%)。总体而言,我们的方法成功推导了这些热辐射场,最终可以应用于多种场景,例如识别监控、搜索或救援行动中的热源,以及工业检查中温度广泛用于监测机器的情况。

英文摘要

Efficient and robust 3D scene representation is crucial in autonomous driving, robotics, and related fields. While RGB images provide valuable content for 3D reconstruction, other modalities like thermal or depth can enable additional information on the environment. Lately, novel view synthesis methods like 3D Gaussian Splatting have started using multiple modalities to further boost their performance. But fusing or combining multimodal data can make the process slower and can bring in additional challenges. Therefore, our project aims to use single modality based on thermal infrared domain, by removing the reliance on visible light as much as possible. This single modality can be expected to be faster as it does not rely on multimodal data. We propose a method, Thermal-to-Depth Gaussian Splatting (TDg), that uses only thermal images and depth estimation in its architecture to derive the radiance fields. Our TDg method outperforms the MSMG (Multiple Single-Modal Gaussians) baseline in most cases on our test datasets, RGBT-Scenes and ThermalMix. On average, the rendering quality metrics such as learned perceptual image patch similarity (LPIPS), structural similarity index measure (SSIM), and peak signal-to-noise ratio (PSNR) of TDg are 1.12%, 0.034%, and 0.01% better than the baseline MSMG values. It also reduces the training time significantly, by 12 mins 47 secs (55% improvement). Overall, our method is successful in deriving these thermal radiance fields, which can ultimately have several applications, such as identifying heat sources critical in surveillance, search or rescue operations, and industrial inspections where temperature is widely used to monitor machines.

2605.30325 2026-05-29 cs.CV 版本更新

Veda: Scalable Video Diffusion via Distilled Sparse Attention

Veda: 通过蒸馏稀疏注意力实现可扩展的视频扩散

Shihao Han, Hao Yang, Xinting Hu, Xiaofeng Mei, Yi Jiang, Xiaojuan Qi

发表机构 * ByteDance Inc.(字节跳动公司) The University of Hong Kong(香港大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出Veda蒸馏稀疏注意力框架,通过统计感知的tile评分和头感知tile选择,在保持生成质量的同时实现视频扩散模型的高效加速。

Comments Accepted to ICML 2026

详情
AI中文摘要

扩展扩散Transformer以生成高分辨率、长视频受限于自注意力的二次成本,现有的稀疏注意力方法在高稀疏度下性能下降。我们通过实验证明,生成质量并非由稀疏度本身决定,而是由稀疏掩模与全注意力的tile级几何对齐程度决定。基于这一洞察,我们提出Veda,一个蒸馏稀疏注意力框架,将tile选择形式化为从全注意力中显式重建的问题。Veda整合了统计感知的tile评分与头感知的tile选择,以减少估计误差和结构不匹配,从而实现高稀疏度。一个硬件高效的tile跳过内核将理论稀疏度转化为实际墙钟加速。在包括Waver和Wan2.1在内的大型视频扩散模型上的实验表明,Veda实现了显著的加速,且生成质量无明显下降。为了在Waver-T2V-12B上生成720P 10秒视频,Veda实现了5.1倍的端到端加速和10.5倍的自注意力加速,将注意力开销从92%降低到50%。值得注意的是,加速增益随序列长度增加而增加,表明Veda在跨模型的时空分辨率上具有良好的可扩展性。

英文摘要

Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality is determined not by the sparsity ratio itself, but by how well the sparse mask aligns with the tile-wise geometry of full attention. Based on this insight, we propose Veda, a distilled sparse attention framework that formulates tile selection as an explicit reconstruction problem from full attention. Veda integrates statistics-aware tile scoring with head-aware tiling to reduce estimation error and structural mismatch, enabling aggressive sparsity. A hardware-efficient tile-skipping kernel converts theoretical sparsity into practical wall-clock speedups. Experiments on large video diffusion models, including Waver and Wan2.1, demonstrate substantial acceleration with no noticeable degradation in generation quality. To generate 720P 10-second videos on Waver-T2V-12B, Veda achieves a 5.1$\times$ end-to-end speedup and a 10.5$\times$ self-attention speedup, reducing attention overhead from 92% to 50%. Notably, the gains increase with sequence length, indicating that Veda scales favorably with spatiotemporal resolution across models.

2605.30320 2026-05-29 cs.CV 版本更新

MonoPhysics: Estimating Geometry, Appearance, and Physical Parameters from Monocular Videos

MonoPhysics: 从单目视频估计几何、外观和物理参数

Daniel Rho, Jun Myeong Choi, Matthew Thornton, Biswadip Dey, Roni Sengupta

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Meta

AI总结 提出MonoPhysics框架,通过可微分MPM模拟和3D高斯泼溅,从单目视频联合优化可变形物体的几何、外观和物理参数,解决尺度模糊和几何不准确问题。

详情
AI中文摘要

现有的逆物理方法从多视角视频中恢复物理参数,其中跨视角的几何约束解决了尺度和3D结构问题。然而,在单目设置中,这种约束缺失,导致严重的尺度模糊、不准确的几何以及外观优化与物理模拟之间的弱耦合。我们提出MonoPhysics,一个用于可变形物体的单目逆物理估计框架,使用可微分MPM模拟和3D高斯泼溅,从单个相机视角联合优化几何、外观和物理参数。我们通过三个视觉-物理桥梁解决这些挑战:全局尺度对齐、物理感知的几何细化以及可微分位置图,这些共同使得仅从单目观测就能进行准确优化。我们在Vid2Sim和我们新的弹性和塑性物体数据集上评估,结果表明MonoPhysics在单目设置中优于现有基线,并且仅使用单个相机就能达到与多视角基线相当的性能。我们的项目页面可在https://daniel03c1.github.io/MonoPhysics/获取。

英文摘要

Existing inverse physics methods recover physical parameters from multi-view videos, where geometric constraints across views resolve scale and 3D structure. In monocular settings, however, such constraints are absent, leading to severe scale ambiguity, inaccurate geometry, and weak coupling between appearance optimization and physical simulation. We propose MonoPhysics, a framework for monocular inverse physics estimation of deformable objects using differentiable MPM simulation and 3D Gaussian Splatting, which jointly optimizes geometry, appearance, and physical parameters from a single camera view. We address these challenges through three visual-physical bridges: global scale alignment, physics-aware geometry refinement, and a differentiable position map, which together enable accurate optimization from monocular observations alone. We evaluate on Vid2Sim and our new dataset of elastic and plastic objects, showing that MonoPhysics outperforms existing baselines in monocular settings and achieves performance comparable to multi-view baselines using only a single camera. Our project page is available at https://daniel03c1.github.io/MonoPhysics/

2605.30318 2026-05-29 cs.GR cs.AI cs.CV 版本更新

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

快门之前:3D场景中美学的且可执行的人像摄影规划

Ruixiang Jiang, Chang Wen Chen

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出在3D场景中生成人像姿态、相机、照明和曝光方案的方法,通过构建摄影场景图实现美学引导的规划,生成视觉上引人注目且几何与光度可行的人像。

详情
AI中文摘要

人像摄影在很大程度上是在快门打开之前决定的:主体的姿态、相机配置和照明设备必须在周围的3D场景中协调。相比之下,大多数现有的计算方法侧重于2D图像空间中的后期制作,例如修饰、重新照明或编辑已经存在的图像;捕获前的摄影规划仍然很大程度上未被探索。我们引入了3D美学人像规划,即生成人体姿态、相机、照明和曝光计划的任务,这些计划在满足3D场景中的几何和光度可行性的同时,产生视觉上引人注目的人像。我们的方法构建了一个摄影场景图,该图表示场景可供性、主体-场景关系以及与人像相关的照明结构。基于这种表示,我们对先前的尝试和当前的取景器观察进行美学引导的比较规划。在多样化的室内和室外场景中的实验表明,我们的方法生成的人像比竞争基线更受人类评分者和MLLM评估者的青睐,同时保持高物理合理性。总之,我们的结果指明了从捕获后校正走向捕获前计算人像规划的道路。项目仓库:https://github.com/songrise/Before-the-Shutter

英文摘要

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: https://github.com/songrise/Before-the-Shutter

2605.30317 2026-05-29 cs.CV 版本更新

VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

VPG: 视觉前缀引导的自回归图像与视频生成

Xinyao Liao, Qiyuan He, Yicong Li, Jiayin Zhu, Xiaoye Qu, Wei Wei, Angela Yao

发表机构 * National University of Singapore(新加坡国立大学) Huazhong University of Science & Technology(华中科技大学)

AI总结 提出VPG,一种无需训练、推理时引导的方法,通过对比生成前缀与损坏前缀下的模型输出来改进自回归图像和视频生成的下一步预测,提升生成质量。

详情
AI中文摘要

自回归图像和视频生成器在训练时使用教师强制历史,但在推理时必须从自身生成的前缀中采样,因此容易受到曝光偏差和前缀漂移的影响。现有的补救方法要么修改训练,要么应用主要针对外部语义条件(如类别标签或文本提示)的采样时引导,而不是测试下一步预测是否为生成的前缀本身提供强大的后验支持。我们提出视觉前缀引导(VPG),一种用于自回归图像和视频生成的无需训练、推理时引导方法。VPG通过对比模型在生成前缀下的输出与在损坏前缀下的输出,然后将logits外推到加强生成前缀后验支持的候选者,从而改进下一步预测。在基于VAR的类别条件图像生成、基于Infinity的文本到图像生成以及基于InfinityStar的文本到视频生成中,VPG在不重新训练基础模型的情况下提高了生成质量,平均将VAR上的FID降低了0.36,并在图像和视频生成上均提升了基准性能。

英文摘要

Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either modify training or apply sampling-time guidance aimed primarily at external semantic conditions, such as class labels or text prompts, rather than testing whether a next-step prediction provides strong posterior support for the generated prefix itself. We propose Visual Prefix Guidance (VPG), a training-free inference-time guidance method for autoregressive image and video generation. VPG improves next-step prediction by contrasting the model's output under the generated prefix with its output under a corrupted prefix, then extrapolating logits toward candidates that strengthen the posterior support of the generated prefix. Across class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, VPG improves generation quality without retraining the base model, reducing FID on VAR by 0.36 on average and improving benchmark performance on both image and video generation.

2605.30311 2026-05-29 cs.CV cs.AI 版本更新

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Archon:面向整体数字人生成的统一多模态模型

Chong Bao, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava, Ziqian Bai, Feitong Tan, Guofeng Zhang, Zhaopeng Cui, Sean Fanello, Yinda Zhang

发表机构 * State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD与CG国家重点实验室) Google(谷歌) Google DeepMind(谷歌DeepMind)

AI总结 提出Archon,一个完全预训练的以人为中心的统一多模态模型,通过模态特定分词器、语义视频重参数化和“模态思维”策略,实现文本、音频、动作和视觉等七种模态的整体数字人生成。

Comments Accepted to CVPR 2026. Project Page: https://zju3dv.github.io/archon/

详情
AI中文摘要

数字人是沉浸式交互的基础,然而创建一个统一模型来处理包括文本、音频、动作和视觉内容在内的整体模态仍然是一个开放的挑战。在本文中,我们提出了Archon,一个完全预训练的、以人为中心的统一多模态模型,用于整体虚拟形象生成。Archon通过模态特定分词器统一了七种模态,并利用一个在同步模态和72个不同任务上预训练的原生自回归统一多模态模型来建模整体联合分布。为了解决高保真说话视频中的标记爆炸挑战,我们引入了一种内存高效的语义视频重参数化方法,在保持细粒度动态的同时实现了4倍的标记减少,并结合了一个语义驱动的视频扩散解码器。我们进一步提出了一种“模态思维”,它将模糊的跨模态任务分解为替代模态链中的逐步思维,逐步增强保真度和可控性。大量实验表明,Archon在各种数字人生成任务中实现了优越或可比的性能,验证了我们统一框架的有效性。项目页面:https://zju3dv.github.io/archon/。

英文摘要

Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. To address the token explosion challenge in high-fidelity talking videos, we introduce a memory-efficient semantic video reparameterization, achieving 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. We further propose a "Thinking in Modality" that decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Extensive experiments demonstrate that Archon achieves superior or comparable performance across diverse digital human generation tasks, validating the effectiveness of our unified framework. Project page: https://zju3dv.github.io/archon/.

2605.30310 2026-05-29 cs.CV cs.AI cs.GR 版本更新

City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

City-Mesh3R:面向仿真就绪的城市级多视图三维网格重建

Sayan Paul, Sourav Ghosh, Siddharth Katageri, Soumyadip Maity, Sanjana Sinha, Brojeshwar Bhowmick

发表机构 * Visual Computing & Embodied AI Lab, TCS Research(视觉计算与具身人工智能实验室,TCS研究)

AI总结 提出City-Mesh3R框架,通过分治策略从大规模无序图像集合端到端重建水密表面网格,解决城市尺度重建中几何不完整、表面不规则及计算复杂性问题。

Comments Accepted to the USM3D Workshop Proceedings at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 as an Oral Presentation. Project page: https://citymesh3r.github.io/

详情
AI中文摘要

从多视图图像进行城市级三维表面重建以支持下游三维仿真,由于城市场景的规模和复杂性,带来了极具挑战性的问题。现有的基于NeRF、高斯泼溅等方法的城市级三维重建技术,常因几何不完整/缺失以及不规则、噪声表面而无法恢复可用于仿真的三维网格。将现有小规模三维重建方法扩展到任意大规模城市场景因计算复杂而不可行。我们提出City-Mesh3R,一个可扩展的框架,直接从大规模无序图像集合重建水密表面网格。与近期使用全局稀疏SfM点云初始化后分布式稠密重建大规模场景的方法不同,我们的方法采用分治策略,遵循端到端的图像到网格三维重建流程。通过拓扑图像聚类、聚类独立稀疏SfM和地图合并重建稀疏城市地图,无需穷举图像特征匹配。然后对该地图进行空间划分,执行几何感知的相机选择,接着进行稠密表面重建,并使用曲率感知的自适应顶点密度重网格化进行表面细化。这些分区网格随后拼接成城市全局网格。所提出的端到端框架在城市级重建数据集上进行了评估。定性和定量结果表明,我们的方法能生成具有规则几何、捕捉精细表面细节的高保真水密三维网格,并因其分布式端到端处理而适用于任意大规模场景。

英文摘要

City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc. often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces. Scaling existing small-scale 3D reconstruction methods to arbitrarily large urban scenes is highly infeasible due to their computational complexity. We present City-Mesh3R, a scalable framework for reconstructing watertight surface meshes directly from large unordered image collections. Unlike recent methods which use global sparse SfM point-cloud initialization followed by a distributed 3D dense reconstruction of large-scale scenes, our method follows an end-to-end images-to-mesh 3D reconstruction approach using a divide-and-conquer strategy. The sparse city map is reconstructed via topological image clustering, cluster-wise independent sparse SfM and map merging, without need for exhaustive image feature matching. Then this map is partitioned spatially to perform geometry-aware camera selection, followed by dense surface reconstruction and surface refinement using curvature-aware adaptive vertex density remeshing. These partition meshes are then stitched together to produce the global mesh of the city. The proposed end-to-end framework is evaluated on city-scale reconstruction datasets. As demonstrated by our qualitative and quantitative results, our proposed method yields high-fidelity watertight 3D meshes with regular geometry, capturing fine surface details, and is suitable for scaling to arbitrarily large scenes owing to the end-to-end processing in a distributed setting.

2605.30307 2026-05-29 cs.CV 版本更新

Grounded 3D-Aware Spatial Vision-Language Modeling

基于三维感知的空间视觉语言建模

An-Chieh Cheng, Yang Fu, Yatai Ji, Ligeng Zhu, Guanqi Zhan, Zhuoyang Zhang, Zhaojing Yang, Song Han, Yao Lu, Pavlo Molchanov, Vidya Nariyambut Murali, Jan Kautz, Xiaolong Wang, Hongxu Yin, Sifei Liu

发表机构 * UCSD(加州大学圣迭戈分校) MIT(麻省理工学院) NVIDIA(英伟达)

AI总结 提出GR3D模型,通过显式2D定位、隐式2D定位和单目3D定位三种互补定位能力,在单一框架内实现空间链式推理,并在定位与非定位空间基准上取得一致提升。

Comments CVPR 2026 https://www.anjiecheng.me/gr3d

详情
AI中文摘要

我们提出了GR3D,一个空间视觉语言模型,在单一框架内配备了三种互补的定位能力——显式2D定位、隐式2D定位和单目3D定位。GR3D引入了一种隐式定位机制,在生成过程中识别实体提及,并将相应的区域标记插入文本流中,使模型在生成空间链式推理响应时能够即时引用视觉证据。同时,一种区域提示的单目3D定位设计从定位的区域查询中预测相机视图中的3D边界框,并由内在感知归一化和密集几何监督支持。这些定位能力共同使GR3D能够将复杂的空间理解问题分解为定位的2D感知,随后进行3D推理。GR3D在定位和非定位空间基准上均取得了一致的改进,证明了定位作为增强VLM空间理解的有效归纳偏差。这些定位能力共同增强了超越定位任务本身的通用空间理解。

英文摘要

We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference. GR3D achieves consistent improvements across grounded and non-grounded spatial benchmarks, demonstrating grounding as an effective inductive bias for strengthening spatial understanding in VLMs. These grounding capabilities collectively enhance general spatial understanding beyond the grounding task itself.

2605.30269 2026-05-29 cs.CV eess.IV 版本更新

Boosting Image Quality Assessment Performance: Unsupervised Score Fusion by Deep Maximum a Posteriori Estimation

提升图像质量评估性能:基于深度最大后验估计的无监督分数融合

Zhongling Wang, Raymond Zhou, Shahrukh Athar, Wenbo Yang, Zhou Wang

发表机构 * University of Waterloo, Canada(加拿大滑铁卢大学) McMaster University, Canada(加拿大麦马斯特大学)

AI总结 提出一种基于深度最大后验估计的无监督图像质量评估分数融合框架,通过细粒度不确定性估计提高融合预测的准确性并降低不确定性。

Comments 2024 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)

详情
AI中文摘要

在过去的几十年中,出现了许多图像质量评估(IQA)模型,旨在预测图像的感知质量。然而,单个模型往往偏向于某些类型的图像内容或失真,具体取决于设计原则和过程。一个直观的想法是通过将多个模型的分数融合成一个更强的模型,来利用每个IQA模型的优势并减轻其弱点。在此,我们首次尝试为这一想法寻求最优解,并提出一个基于深度最大后验(MAP)估计的无监督IQA分数融合通用框架。所提出的模型在分数级别进行细粒度不确定性估计,以提高准确性并降低融合预测中的不确定性。综合实验表明,所提出的模型优于单个IQA模型和其他融合方法。它还在融合过程中展现出拒绝“坏”模型的有趣能力。

英文摘要

Over the past decades, numerous Image Quality Assessment (IQA) models have emerged, aiming to predict the perceptual quality of images. However, individual models are often biased toward certain types of image content or distortions, depending on the design principle and process. An intuitive idea is to harness the strengths and mitigate the weaknesses of each IQA model, by fusing the scores of multiple models into a stronger one. Here we make one of the first attempts to seek an optimal solution for the idea and propose a general framework for unsupervised IQA score fusion using deep Maximum a Posteriori (MAP) estimation. The proposed model conducts fine-grained uncertainty estimation at the score level to increase the accuracy and reduce the uncertainty in fused predictions. Comprehensive experiments demonstrate the superiority of the proposed model over individual IQA models and other fusion methods. It also exhibits an interesting capability of rejecting ``bad" models in the fusion process.

2605.30268 2026-05-29 cs.CV cs.AI 版本更新

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

PhyGenHOI:物理感知的动态人-物交互4D生成

Omer Benishu, Gal Fiebelman, Sagie Benaim

发表机构 * Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 提出PhyGenHOI框架,结合运动扩散模型和物质点方法,通过窗口吸引损失、接触驱动重模拟和掩码视频SDS目标,生成物理一致且视觉逼真的4D人-物交互动态场景。

详情
AI中文摘要

我们解决了生成物理准确且视觉逼真的4D人-物交互(HOI)的任务。给定一个静态3D人体和以3D高斯泼溅(3DGS)表示的目标物体,我们的目标是合成动态场景,其中人体根据给定的输入文本主动与物体交互,例如拳击或踢腿。为此,我们引入了PhyGenHOI,一种新颖的框架,将生成式人体运动与显式物理物体模拟相结合。我们将人体建模为由运动扩散模型(MDM)驱动的语义智能体,将物体建模为通过物质点方法(MPM)模拟的物理智能体,并利用3D高斯作为统一的、可微分的表示。我们通过三种耦合机制监督它们的交互:(1)窗口吸引损失,时间上同步生成运动以拦截物体;(2)接触驱动重模拟步骤,在碰撞时触发物理一致动量传递;(3)掩码视频SDS目标,注入基于视频的先验以增强接触保真度。实验表明,PhyGenHOI在多种动作、人体和物体上生成物理一致的4D HOI,优于基线方法。项目页面和视频:https://omerbenishu.github.io/PhyGenHOI/

英文摘要

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity. Experiments show PhyGenHOI generates physically consistent 4D HOI across diverse actions, humans, and objects, outperforming baselines. Project page and videos: https://omerbenishu.github.io/PhyGenHOI/

2605.30265 2026-05-29 cs.CV cs.CL 版本更新

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

LoMo: 局部模态替换以实现更深的视觉-语言融合

Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对视觉-语言模型在模态替换时性能下降的“载体敏感性”问题,提出局部模态替换(LoMo)数据策展范式,通过将文本片段动态渲染为图像来训练跨模态表示不变性,显著提升多模态推理与融合效果。

详情
AI中文摘要

视觉-语言模型(VLM)在广泛的理解和推理任务中取得了显著进展,这得益于旨在多模态融合的大规模图像-文本训练。理想情况下,将文本问题替换为其渲染图像对应物应基本不影响模型性能。然而,在实践中,这种模态替换会导致性能急剧下降。我们将这种“载体敏感性”问题归因于当前训练语料中固有的偏差。在图像描述、VQA、OCR和网络来源的交错数据等流行数据集中,文本和图像通常被组织成不同且不对称的角色,文本作为语言查询,图像作为视觉参考。这种数据偏差导致VLM在不同模态的信息获取上表现出不同的偏好。因此,VLM无法对齐语义等价内容在文本和视觉载体上的表示,使得模型推理在模态替换下变得脆弱。为了解决这个问题,我们提出了局部模态替换(LoMo),一种轻量级、架构无关的数据策展范式,旨在为语义等价的文本和图像载体之间的跨模态表示不变性提供监督。LoMo通过将单模态提示重新表述为无缝交错的跨模态序列来实现这一点。它动态选择目标文本跨度并将其重新表述为渲染图像,从而在“文本、视觉、文本”载体上保持相同的语义。在13个不同的多模态基准上的大量实验表明,LoMo显著改善了整体多模态推理,并实现了更深的跨模态融合。具体来说,它在基础模型上带来了一致的提升,在LLaVA-OneVision-1.5-8B上比标准SFT提高了2.67个百分点,在Qwen3.5-9B上提高了2.82个百分点。

英文摘要

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.

2605.30263 2026-05-29 cs.CV 版本更新

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

minWM: 用于实时交互式视频世界模型的全栈开源框架

Min Zhao, Hongzhou Zhu, Bokai Yan, Zihan Zhou, Yimin Chen, Wenqiang Sun, Kaiwen Zheng, Guande He, Xiao Yang, Chongxuan Li, Fan Bao, Jun Zhu

发表机构 * ShengShu(盛书) THU(清华大学) RUC(中国人民大学) HKUST(香港科技大学) UT-Austin(德克萨斯大学奥斯汀分校)

AI总结 提出minWM全栈开源框架,通过因果强制/因果强制++流水线将双向视频扩散模型转化为可控制、低延迟的自回归世界模型,支持相机控制与多种骨干架构。

详情
AI中文摘要

最近的视频扩散基础模型在高品质视频生成方面取得了显著进展,但将其转化为实时交互式视频世界模型仍然具有挑战性。交互式世界模型需要可控、因果和低延迟的展开,这在实际中需要涵盖数据构建、可控微调、自回归训练、少步蒸馏和流式推理的完整流水线。在这项工作中,我们提出了minWM,一个用于构建实时交互式视频世界模型的全栈开源框架。minWM提供了一个端到端流水线,将现有的双向T2V/TI2V视频基础模型转化为相机可控的少步自回归世界模型。具体来说,minWM首先微调一个带有相机控制的双向视频扩散模型,然后应用因果强制/因果强制++流水线,包括AR扩散训练、因果ODE或因果一致性蒸馏以及非对称DMD,将其蒸馏为少步自回归生成器以实现低延迟展开。该框架是模块化和架构可扩展的:我们在代表性开源骨干上实例化它,包括Wan2.1-T2V-1.3B和HY1.5-TI2V-8B,覆盖了基于交叉注意力的条件注入和MMDiT风格架构。minWM还支持将现有的视频世界模型(如HY-WorldPlay)适应到新的数据分布、训练配方和延迟目标。除了发布可运行脚本、检查点、文档和推理代码外,我们还提供了关于相机轨迹质量、可控性训练步骤和最小批量大小要求的实际消融实验。我们希望minWM能够作为构建和适应实时交互式视频世界模型的可复现和可扩展的配方。

英文摘要

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [https://github.com/shengshu-ai/minWM](https://github.com/shengshu-ai/minWM)

2605.30260 2026-05-29 cs.CL cs.AI cs.CV cs.LG 版本更新

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

LoRA如何记忆?大语言模型微调的参数记忆定律

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang, Hui Xue, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出参数记忆定律,揭示LoRA在微调中参数与序列长度对损失降低的幂律关系,并基于此设计MemFT优化策略提升记忆保真度与效率。

Comments Ongoing work

详情
AI中文摘要

大型语言模型(LLM)必须持续学习和更新知识,以在动态的真实世界环境中保持有效。虽然低秩适应(LoRA)被广泛用于此类记忆更新,但现有研究主要依赖于定性的下游评估,使得精确参数记忆的定量容量限制和潜在动态在很大程度上未被探索。为了弥合这一差距,我们在潜在空间中使用LoRA作为受控记忆容量探针,以系统量化精确参数记忆。我们引入了参数记忆定律,这是一个将损失降低ΔL与有效参数和序列长度联系起来的稳健幂律。在令牌级别,细粒度分析揭示了确定性相变,表明在贪婪解码下,预测概率p > 0.5构成逐字回忆的充分条件。基于这些见解,我们引入了MemFT,一种阈值引导的优化策略,该策略动态地将训练预算重新分配给低于阈值的令牌。实证评估表明,MemFT可以提高记忆保真度和效率。代码将在https://github.com/zjunlp/ParametricMemoryLaw发布。

英文摘要

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at https://github.com/zjunlp/ParametricMemoryLaw.

2605.30257 2026-05-29 cs.CV 版本更新

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

Stable-Layers: 使用VLM评分强化学习微调图像层分解模型

Ciara Rowles, Reshinth Adithyan, Nikhil Pinnaparaju, Vikram Voleti, Mark Boss

发表机构 * Stability AI

AI总结 提出Stable-Layers框架,通过强化学习(Flow-GRPO)和视觉语言模型(VLM)评分,无需配对监督即可微调预训练层分解模型,解决评分信号方差不足问题,提升层分离质量和重建精度。

Comments 25 pages, 8 figures, 4 tables. Project page: https://stability-ai.github.io/stable-layers.github.io/

详情
AI中文摘要

我们提出了Stable-Layers,一个强化学习框架,通过仅使用视觉语言模型(VLM)的反馈来微调预训练的层分解模型,从而消除了对配对监督的需求。从Qwen-Image-Layered开始,我们应用带有LoRA适应的Flow-GRPO,对每张图像采样多个候选分解,用VLM进行评分,并根据组相对优势优化策略。关键挑战在于设计可靠的奖励信号:单独对样本评分的VLM倾向于将其判断压缩到一个狭窄的范围内,使得GRPO几乎没有组内方差可供学习。我们通过一个两阶段评估流水线解决了这个问题,该流水线将基于五个编辑中心标准的结构化逐样本评分与基于网格的校准步骤配对,在该步骤中VLM并排重新评分所有候选。与基础模型相比,Stable-Layers在Crello数据集上产生了具有更强层分离、更少空白或伪影层以及更低逐层重建误差的分解结果。

英文摘要

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.

2605.30256 2026-05-29 cs.CV cs.CL cs.HC 版本更新

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

VideoFDB: 评估对话代理中的全双工视觉-语音能力

Amrita Mazumdar, Seonwook Park, Rajarshi Roy, Nikhil Srihari, Shengze Wang, Yuhao Zhou, Julia Wang, Koki Nagano, Shalini De Mello

发表机构 * NVIDIA David AI

AI总结 提出首个全双工视听到视听(AV2AV)对话基准VideoFDB,通过237个真实视频片段、感知与生成行为分类以及基于评分规则的LM评判框架,系统评估代理在非语言对话动态中的表现,发现现有系统存在字幕崩溃和视觉流忽视等缺陷。

Comments Project page: https://research.nvidia.com/labs/amri/projects/video-fdb/

详情
AI中文摘要

自然的人类对话是全双工且视听融合的:人们同时说话和倾听,同时持续解读并产生非语言线索,如点头、微笑和手势。为了支持成功的人机交互,代理必须建模全双工视听对话;然而,现有的全双工基准仅评估语音。在这项工作中,我们提出了VideoFDB,这是首个评估全双工视听到视听(AV2AV)对话代理的基准。VideoFDB贡献了:(i) 237个来自真实世界视频通话的二元片段,涵盖11种非语言对话动态;(ii) 将感知行为与生成行为分离的分类法;(iii) 基于评分规则的LM评判评估框架,具有可解释的轴,用于评估关于非语言对话动态的对话质量。在开源和闭源的视觉-语音代理中,我们发现了系统性的失败模式:字幕崩溃和视觉流忽视,并且我们表明当前系统利用视觉进行显式视觉问答,但不用于自然对话中所需的流式联合视听基础。我们进一步评估了级联的语音到虚拟形象系统,发现其架构从根本上排除了全双工非语言线索的产生。作为全双工AV2AV交互的首个基准,VideoFDB为系统评估奠定了基础,我们希望这将加速下一代多模态对话代理的进步和发展。

英文摘要

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.

2605.30250 2026-05-29 cs.CV cs.GR 版本更新

Ambient-robust Inverse Rendering using Active RGB-NIR Imaging

使用主动RGB-NIR成像的环境鲁棒逆渲染

Hoon-Gyu Chung, Jinnyeong Kim, Hyunwoo Kang, Seung-Hwan Baek

发表机构 * POSTECH

AI总结 提出一种利用主动RGB-NIR成像的三阶段逆渲染方法,通过结合环境光照下的多视角RGB图像和主动NIR闪光图像,实现对外部光照变化鲁棒的几何与反射率重建。

Comments 11 pages

详情
AI中文摘要

逆渲染旨在从图像中重建物体的几何和反射率。尽管近期取得了进展,现有方法通常会产生不准确的重建,且对环境光照条件敏感。本文介绍了一种由主动RGB-NIR成像实现的环境鲁棒逆渲染方法。我们的关键洞察是利用近红外(NIR)闪光照明(对人眼不可见)来获得稳定的点光源阴影,该阴影在很大程度上不受环境光照影响。通过使用环境光照下的多视角RGB图像和主动NIR闪光照明获取的NIR图像,我们利用RGB和NIR图像的互补优势,通过三阶段逆渲染方法重建精确的几何和反射率。为了实现密集多视角采集,我们开发了一个主动成像系统,配备RGB-NIR相机和安装在移动底座上的NIR闪光灯。利用该系统,我们收集了首个在多种环境光照条件下捕获的多视角RGB-NIR逆渲染数据集。实验表明,我们的方法优于先前方法,在多种环境光照场景下实现了准确的几何和反射率估计。

英文摘要

Inverse rendering aims to reconstruct geometry and reflectance of objects from images. Despite recent progress, existing methods often produces inaccurate reconstructions that are sensitive to ambient illumination conditions. Here we introduce an ambient-robust inverse rendering method enabled by active RGB-NIR imaging. Our key insight is to leverage near-infrared (NIR) flash illumination-imperceptible to human observers-to obtain stable point-light shading that is largely invariant to ambient illumination. By using multi-view RGB images illuminated by ambient light and NIR images acquired with active NIR flash illumination, we reconstruct accurate geometry and reflectance by exploiting the complementary benefits of RGB and NIR images via a three-stage inverse rendering method. To enable dense multi-view acquisition, we develop an active imaging system equipped with a RGB-NIR camera and a NIR flash mounted on a mobile base. Using this system, we collect the first multi-view RGB-NIR inverse rendering dataset captured under multiple ambient illumination conditions. Experiments demonstrate that our method outperforms prior approaches, achieving accurate geometry and reflectance estimation across multiple ambient lighting scenarios.

2605.30244 2026-05-29 cs.CV cs.AI 版本更新

Reinforcement Learning with Robust Rubric Rewards

基于稳健评分规则的强化学习

Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu

发表机构 * Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 针对部分可验证的视觉-语言任务,提出RLR^3方法,通过双路径执行评分规则、最小暴露策略和层次聚合,实现从任务级到准则级验证的扩展,在15个基准上平均提升4.7分。

详情
AI中文摘要

虽然基于可验证奖励的强化学习(RLVR)对于确定性可检查的任务有效,但许多视觉-语言任务部分可验证,需要多准则监督(例如,感知细节、推理步骤和约束)。评分规则为此细粒度监督提供了自然接口,但其有效性取决于在线RL期间的执行准确性。我们提出基于稳健评分规则的强化学习($\text{RLR}^3$),将RLVR从任务级验证扩展到准则级验证。$\text{RLR}^3$通过两条执行路径路由实例特定的评分规则:LLM作为提取器与确定性验证器配对,或LLM作为裁判用于不可验证的准则。为确保忠实评分,$\text{RLR}^3$引入最小暴露策略,从提取器中屏蔽真实标签,从裁判中屏蔽图像。此外,$\text{RLR}^3$采用层次聚合,优先考虑基本准则而非附加准则,并缓解rollout组内的分数饱和。在Qwen3-VL-30B-A3B上跨15个基准评估,$\text{RLR}^3$始终优于RLVR,比基础模型提升4.7分,并超过官方instruct-to-thinking模型差距。受控审计证实,我们的确定性验证和最小暴露显著减少了可利用的假阳性。

英文摘要

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.

2605.30239 2026-05-29 cs.CV 版本更新

SAM3D-Phys: Towards Multi-Object Interactive Simulation in Real World

SAM3D-Phys:迈向真实世界中的多物体交互仿真

Xin Dong, Weijian Deng, Lihan Zhang, Tianru Dai, Wenfeng Deng, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Pengcheng Laboratory(鹏城实验室)

AI总结 提出SAM3D-Phys框架,结合场景重建与SAM3D生成式先验,从部分观测中恢复完整可仿真物体几何,并通过物理约束优化和掩码引导外观蒸馏实现场景一致性,支持多物体同时交互仿真。

Comments 23 pages, 11 figures

详情
AI中文摘要

这项工作解决了从重建的真实世界场景中恢复完整、可仿真的物体几何的问题,使得与场景中嵌入的物体进行基于物理的交互成为可能。虽然现代多视图重建方法可以产生视觉上准确的环境,但由于遮挡和有限的观测,物体往往不完整,因此不适合物理仿真。为了解决这一局限性,我们提出了SAM3D-Phys,一个将场景重建与SAM3D的生成式3D先验相结合以恢复可物理仿真的物体的框架。我们的方法首先从多视图图像重建场景,获得场景几何和物体的部分观测。然后,我们利用SAM3D从这些部分观测中推断出完整的物体几何。为了确保恢复的物体与重建场景保持一致,我们通过两种互补策略恢复场景一致的物体状态:一种物理约束的空间优化算法,迭代地将恢复的物体对齐到其原始位置;以及一种掩码引导的外观蒸馏模块,基于观测图像细化纹理保真度。通过恢复完整的物体几何并在场景中恢复其姿态和外观,SAM3D-Phys产生了适用于基于物理仿真的干净物体表示,使得在重建场景中能够对多个物体进行同时且物理一致的交互仿真。项目页面:https://chnxindong.github.io/sam3d-phys/

英文摘要

This work addresses the problem of recovering complete, simulatable object geometry from reconstructed real-world scenes, enabling physics-based interaction with objects embedded in the scene. While modern multi-view reconstruction methods can produce visually accurate environments, objects are often incomplete due to occlusions and limited observations, making them unsuitable for physics simulation. To address this limitation, we propose SAM3D-Phys, a framework that integrates scene reconstruction with generative 3D priors of SAM3D to recover physically simulatable objects. Our approach first reconstructs the scene from multi-view images to obtain scene geometry and partial observations of objects. We then leverage SAM3D to infer complete object geometry from these partial observations. To ensure that the recovered objects remain consistent with the reconstructed scene, we restore scene-consistent object states through two complementary strategies: a physics-constrained spatial optimization algorithm that iteratively aligns the recovered object to its original location, and a mask-guided appearance distillation module that refines texture fidelity based on the observed images. By recovering complete object geometry and restoring its pose and appearance within the scene, SAM3D-Phys produces clean object representations suitable for physics-based simulation, enabling simultaneous and physically consistent interactive simulation of multiple objects within a reconstructed scene. Project page: https://chnxindong.github.io/sam3d-phys/

2605.30235 2026-05-29 cs.CV 版本更新

BullingerDB: A Dataset for Handwritten Text Recognition and Writer Retrieval

BullingerDB:用于手写文本识别和作者检索的数据集

Marco Peer, Anna-Scius Bertrand, Patricia Scheurer, Andreas Fischer

发表机构 * AIBEX Group, University of Fribourg, Switzerland(AIBEX集团,弗里堡大学,瑞士) iCoSys Institute, University of Applied Sciences and Arts Western Switzerland(iCoSys研究所,西方瑞士应用科学与艺术大学) Department of Computational Linguistics, University of Zurich, Switzerland(计算语言学系,苏黎世大学,瑞士)

AI总结 提出一个基于Heinrich Bullinger书信的大规模历史文档数据集BullingerDB,用于手写文本识别和作者检索,并引入时间感知的nDCG指标评估检索性能。

Comments Accepted for presentation at ICDAR2026. Dataset available via zenodo

详情
AI中文摘要

我们提出了BullingerDB,这是一个基于Heinrich Bullinger(1504-1575)书信的大规模历史文档分析基准数据集。该语料库包含由796位作者在六十年间书写的20,898页和499,222行文本,具有风格变化、多语言内容(主要是拉丁语和早期新高地德语)以及作者身份和时间等元信息。我们在文本识别和作者检索上评估了BullingerDB。表现最佳的模型TrOCR实现了9.1%的字符错误率(CER)。对于作者检索,我们引入了一个时间感知的nDCG指标来评估时间感知检索。虽然可以实现时间连贯的检索,但mAP(78.3%)分数表明由于长期风格变化而存在挑战。通过BullingerDB,我们旨在为多语言历史文本识别和时间感知的作者分析建立一个新的基准。

英文摘要

We present BullingerDB, a large-scale benchmark dataset for historical document analysis based on the correspondence of Heinrich Bullinger (1504-1575). The corpus comprises 20,898 pages and 499,222 text lines written by 796 writers over six decades, featuring stylistic variation, multilingual content (mostly Latin and Early New High German) as well as meta-information such as writer identity and time. We evaluate BullingerDB on text recognition and writer retrieval. TrOCR, the best performing model, achieves a CER of 9.1%. For writer retrieval, we introduce a temporal nDCG metric to assess time-aware retrieval. While temporally coherent retrieval is achievable, mAP (78.3%) scores indicate challenges due to long-term stylistic variation. With BullingerDB, we aim to establish a new benchmark for multilingual historical text recognition and temporally-aware writer analysis.

2605.30231 2026-05-29 cs.CV cs.AI 版本更新

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

超越3D VQA:将3D空间先验注入视觉-语言模型以增强几何推理

Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao

发表机构 * FAIR at Meta(Meta的FAIR)

AI总结 提出GASP框架,通过将几何先验注入LLM的Transformer层,利用对比损失和深度一致性监督训练,显著提升VLM的3D空间推理能力,在多个基准上取得大幅提升。

Comments CVPR 2026. Project page: https://danielchyeh.github.io/GASP/

详情
AI中文摘要

视觉-语言模型(VLM)通常在鲁棒的3D空间推理方面存在困难。依赖于使用3D视觉问答(VQA)数据集进行微调的主流方法可能过度拟合数据集特定的偏差,而集成专门的3D视觉编码器往往不灵活且繁琐。在本文中,我们认为真正的空间理解应该源于学习基本的几何先验,而不仅仅是来自高级VQA监督。我们提出了GASP(几何感知空间先验),这是一个将这些先验直接注入LLM的Transformer层的框架。GASP采用一个小的对应头,作为跨所有层的深度监督信号,并使用一个双重目标进行训练,该目标利用大规模视频场景的真实几何:基于真实点对应的对比损失强制2D视图不变性,而深度一致性监督解决3D几何歧义。我们的分析首先提供了一个诊断,表明标准VLM的内部对应匹配精度非常低(通常低于5%)。然后我们证明,我们的训练显著改善了这种行为,将逐层峰值对应提升到70%以上,并保持超过85%的时间鲁棒性,而基线仍低于5%。这些内部改进转化为下游空间基准的显著提升,包括在All-Angles Bench上+18.2%,在VSI-Bench上+29.0%,所有这些都没有在任何3D VQA数据上进行训练。我们的发现表明,从基本几何先验中学习是实现具有更可靠3D空间推理的VLM的一条有前途且可推广的途径。

英文摘要

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

2605.30230 2026-05-29 cs.CV 版本更新

IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation

IP-Adapter 就够了:迈向免微调扩散模型的人脸说话视频生成

Hao Wu, Xiangyang Luo, Hao Wang, Jiawei Zhang, Yi Zhang, Jinwei Wang

发表机构 * Information Engineering University(信息工程大学) Huai’an University(淮安大学) Chongqing University of Post and Telecommunications(重庆邮电大学) Nankai University(南开大学)

AI总结 提出一种免微调范式,利用预训练的 Stable Diffusion 和 IP-Adapter,结合三个无参数组件解决身份漂移、同步误差和时间不稳定问题,在唇同步精度和视觉保真度上超越现有方法。

详情
AI中文摘要

随着扩散模型的快速发展,人脸说话视频生成取得了显著进展。然而,现有的基于扩散的方法仍然需要特定任务的微调和大规模音视频数据集,导致计算成本高昂,阻碍了扩散方法在学术界的可扩展性和可访问性。为了解决这个问题,我们提出了一种免微调范式,直接使用 Stable Diffusion 和 IP-Adapter 的预训练权重进行人脸说话视频生成。该骨干网络利用 IP-Adapter 的视觉嵌入能力,从预训练的 Stable Diffusion 中挖掘与嘴唇相关的语义。为了解决身份漂移、同步误差和时间不稳定的挑战,我们还设计了三个无训练参数组件:(1)结构器(Structurist),显式解耦并重新组合嘴唇和外观特征,以减轻身份漂移和外观失真;(2)结构控制器(Structure Controller),基于准单调运动趋势自适应细化嵌入,实现精确的唇同步;(3)噪声传感器(Noise Sensor),引入高斯先验来检测和抑制闪烁和抖动伪影,增强时间一致性。实验结果表明,我们的方法在唇同步精度(PCLD 至少提升 0.16)和视觉保真度(FID 至少提升 0.7)方面均优于现有最先进方法,建立了一种新颖的免微调扩散框架用于人脸说话视频生成。

英文摘要

With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder scalability and accessibility of diffusion-based approaches across the research community. To address this, we propose a finetuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameterfree components: (1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; (2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and (3) the Noise Sensor, which introduces Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.

2605.30211 2026-05-29 cs.CV 版本更新

Cycle Consistency in Video Object-Centric Learning

视频目标中心学习中的循环一致性

Rongzhen Zhao, Zhiyuan Li, Ruonan Wei, Juho Kannala, Joni Pajarinen

发表机构 * Department of Electrical Engineering and Automation, Aalto University, Espoo, Finland(艾洛大学电气工程与自动化系,芬兰 Espoo) School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China(华中科技大学人工智能与自动化学院,中国 Wuhan) Department of Computer Science, Aalto University, Espoo, Finland(艾洛大学计算机科学系,芬兰 Espoo) Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland(奥卢大学机器视觉与信号分析中心,芬兰 Oulu)

AI总结 针对视频目标中心学习中潜在槽空间难以直接应用循环一致性的问题,提出隐式循环一致性(ICC),将约束从槽空间转移到连续重建流形,避免特征坍塌并提升性能。

Comments 14 pages

详情
AI中文摘要

自监督视频目标中心学习(OCL)旨在发现不同目标并跨时间关联它们,而自监督多目标跟踪(MOT)则侧重于关联预定义的目标检测或分割。尽管循环一致性(CC)在MOT中已成熟应用,但它不能简单或显式地应用于OCL的潜在槽空间。与MOT中确定性和理想的目标表示不同,OCL槽由于非唯一的场景分解而固有地具有随机性和模糊性。在槽上强制执行显式循环一致性(ECC)会导致刚性均值寻求,这严重惩罚了模型探索替代但同样有效的分解,从而驱动特征坍塌。为解决这一困境,我们提出隐式循环一致性(ICC),它将循环一致性约束从限制性的槽空间转移到连续的重建流形,鼓励槽在集体解释视觉场景上达成软共识,而不是强制刚性点对点特征对齐。在复杂视频OCL基准上的大量实验表明,ICC避免了特征坍塌,并优于ECC基线。我们的源代码、模型检查点和训练日志可在 https://github.com/Genera1Z/ICC 获取。

英文摘要

Self-supervised video Object-Centric Learning (OCL) aims to discover distinct objects and associate them across time, whereas self-supervised Multi-Object Tracking (MOT) focuses on associating pre-defined object detections or segmentations. Although well-established in MOT, Cycle Consistency (CC) cannot naively or explicitly apply to the latent slot space of OCL. Unlike the deterministic and ideal object representations in MOT, OCL slots are inherently stochastic and ambiguous due to non-unique scene decompositions. Enforcing explicit cycle consistency (ECC) on slots imposes rigid mean seeking. This severely penalizes the model for exploring alternative but equally valid decompositions, thereby driving towards feature collapse. To resolve this dilemma, we propose \textit{Implicit Cycle Consistency (ICC)}, which shifts the cycle-consistency constraint from the restrictive slot space to the continuous reconstruction manifold, encouraging slots to reach a soft consensus on collectively interpreting the visual scene rather than forcing rigid point-to-point feature alignment. Extensive experiments on complex video OCL benchmarks demonstrate that ICC avoids feature collapse and outperforms ECC baselines. Our source code, model checkpoints and training logs are provided on https://github.com/Genera1Z/ICC.

2605.30174 2026-05-29 cs.CV 版本更新

LiveSVG: Zero-Shot SVG Animation via Video Generation

LiveSVG:通过视频生成的零样本SVG动画

Matan Levy, Ran Margolin, Bar Cavia, Dvir Samuel, Yael Pritch, Shmuel Peleg, Alex Rav Acha, Ariel Shamir, Dani Lischinski

发表机构 * Google(谷歌) The Hebrew University of Jerusalem(耶路撒冷希伯来大学) Bar-Ilan University(巴伊兰大学) Reichman University(雷赫曼大学)

AI总结 提出LiveSVG方法,利用视频扩散模型直接拟合目标视频实现零样本SVG动画,无需骨架或类别先验,通过双层运动表示和球体填充重着色策略解决复杂运动与颜色歧义问题。

Comments Project Page: https://levymsn.github.io/LiveSVG

详情
AI中文摘要

我们介绍了LiveSVG,一种利用视频扩散模型生成可缩放矢量图形(SVG)动画的零样本方法。当前的SVG动画方法在处理复杂运动时存在困难:基于LLM的代码合成难以表达精细的非刚性贝塞尔变形,而分数蒸馏采样(SDS)提供有噪声的梯度,并且通常需要类别特定的先验(如骨架)。相比之下,LiveSVG将矢量几何直接拟合到显式生成的目标视频上。给定输入SVG图像和运动提示,我们使用冻结的图像到视频模型生成可预览的目标视频,然后通过可微分渲染将原始SVG拟合到该视频。我们的拟合阶段无需骨架,利用双层运动表示:每个组的单应性矩阵用于粗略关节运动,每个路径的贝塞尔控制点偏移用于局部变形。为了解决逐像素拟合过程中颜色引起的对应歧义,我们引入了一种新颖的球体填充重着色策略。我们还提出了ChallengeSVG,一个包含复杂多对象场景的基准测试,揭示了先前工作的局限性。评估表明,LiveSVG在AniClipart和ChallengeSVG上均显著优于现有方法,确立了直接参考视频拟合作为实现提示对齐和完全可编辑矢量动画的实用、稳健途径。

英文摘要

We introduce LiveSVG, a zero-shot approach for generating Scalable Vector Graphics (SVG) animations using video diffusion models. Current SVG animation methods struggle with complex motions: LLM-based code synthesis fails to express fine, non-rigid Bézier deformations, while Score Distillation Sampling (SDS) provides noisy gradients and often requires category-specific priors like skeletons. In contrast, LiveSVG fits vector geometry directly to an explicitly generated target video. Given an input SVG image and a motion prompt, we generate a previewable target video using a frozen image-to-video model, then fit the original SVG to this video via differentiable rendering. Our fitting stage is skeleton-free, utilizing a dual-level motion representation that combines per-group homographies for coarse articulation with per-path Bézier control-point offsets for local deformations. To resolve color-induced correspondence ambiguities during pixel-wise fitting, we introduce a novel sphere-packing recolorization strategy. We also present ChallengeSVG, a benchmark of complex, multi-object scenes that exposes the limitations of prior work. Evaluations demonstrate that LiveSVG significantly outperforms existing methods on both AniClipart and ChallengeSVG, establishing direct reference-video fitting as a practical, robust route to prompt-aligned and fully editable vector animation.

2605.30170 2026-05-29 cs.MM cs.CV cs.LG 版本更新

Unveiling the Visual Counting Bottleneck in Vision-Language Models

揭示视觉语言模型中的视觉计数瓶颈

Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan

发表机构 * Department of Computer Science, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 通过分解视觉计数为三个认知阶段,发现视觉语言模型在符号映射阶段失败,提出断裂数量假说:模型学习到分离的模态特定统计流形,无法实现跨模态对齐。

Comments ICML 2026

详情
AI中文摘要

尽管大型视觉语言模型(VLM)在插值任务上表现出色,但在系统泛化方面,尤其是视觉计数任务中,会遭遇灾难性失败。本文通过将视觉计数分解为三个认知阶段:视觉个体化、数量感知和符号映射,来研究这一外推瓶颈。利用合成围棋棋盘和线性探针,我们证明视觉骨干网络在进入外推区域后仍能保持稳健、线性可分离的数量表示,排除了感知失败的可能性。此外,模型保留了潜在的数量感知能力,能够成功对无法枚举的数量进行比较推理。我们将崩溃定位在符号映射阶段,即模型无法将有效的视觉数量投影到符号标记上。我们的发现支持断裂数量假说:VLM未能获得通用数字空间,而是学习了不相交的、模态特定的统计流形,这阻止了对未见数量的跨模态对齐。在最新基础模型上的验证结果表明,弥合这一差距需要引入强制统一表示的归纳先验,因为仅靠数据扩展是不够的。

英文摘要

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, we demonstrate that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime, ruling out perceptual failure. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the symbolic mapping stage, where the model fails to project valid visual magnitudes onto symbolic tokens. Our findings support a frac tured magnitude hypothesis: VLMs fail to acquire a universal number space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen quantities. Validated on the state-of-the-art foundation model, our results suggest that bridging this gap requires inductive priors enforcing unified representations, as data scaling alone is insufficient.

2605.30168 2026-05-29 cs.CV 版本更新

OmniCD: A Foundational Framework for Remote Sensing Image Change Detection Guided by Multimodal Semantics

OmniCD:多模态语义引导的遥感图像变化检测基础框架

Chenhao Sun

发表机构 * Wuhan University(武汉大学)

AI总结 提出OmniCD框架,通过多模态语义引导(图像和文本提示)统一遥感变化检测任务,结合层次化场景检索和风格解耦机制,并构建大规模数据集RSITCD,在多个基准上取得最优性能。

详情
AI中文摘要

遥感中的变化检测(CD)对于城市监测和灾害评估等应用至关重要,但传统方法难以在不同场景下泛化。我们提出OmniCD,一个通过多模态语义引导统一并增强遥感CD的基础框架。OmniCD将图像和文本提示(如文本描述、语义地图和地理空间元数据)整合到统一架构中,支持从二元CD到零样本语义变化理解的任务。该框架集成了层次化场景检索模块和变化检测模块,并通过风格解耦机制增强跨域鲁棒性。我们进一步引入RSITCD,一个包含30万+标注图像-文本对的大规模多模态数据集。大量实验表明,OmniCD在多个基准上达到最先进性能,展现出强大的适应性,为遥感中的通用CD系统奠定了坚实基础。

英文摘要

Change detection (CD) in remote sensing is vital for applications such as urban monitoring and disaster assessment, yet traditional methods struggle with generalization across diverse scenarios. We present OmniCD, a foundational framework that unifies and enhances remote sensing CD through multimodal semantic guidance. OmniCD incorporates image and text prompts -- such as textual descriptions, semantic maps, and geospatial metadata -- into a unified architecture, supporting tasks from binary CD to zero-shot semantic change understanding. The framework integrates a hierarchical scene retrieval module and a change detection module, reinforced by a style disentanglement mechanism for improved cross-domain robustness. We further introduce RSITCD, a large-scale multimodal dataset with 300K+ annotated image-text pairs. Extensive experiments show that OmniCD achieves state-of-the-art performance across benchmarks, demonstrating strong adaptability and setting a solid foundation for general-purpose CD systems in remote sensing.

2605.30167 2026-05-29 stat.ML cs.CV cs.LG stat.AP 版本更新

Visual Spatial Learning: Single-Field Spatial Interpolation Using Convolutional Neural Networks

视觉空间学习:使用卷积神经网络的单场空间插值

Daniel Tinoco, Raquel Menezes, Carlos Baquero, Alexandra Silva

发表机构 * Centro de Matemática (CMAT), Universidade do Minho(数学中心(CMAT),明霍大学) DEI-FEUP & INESC TEC, Universidade do Porto(FEUP-DEI与INESC TEC,波尔图大学) Instituto Português do Mar e da Atmosfera, I. P. (IPMA, I. P.), Lisboa, Portugal(葡萄牙海洋与大气研究所(IPMA, I. P.),里斯本,葡萄牙) Centro de Ciências do Mar e do Ambiente (MARE), Évora, Portugal(海洋与环境科学中心(MARE),埃维拉,葡萄牙)

AI总结 提出基于卷积神经网络(CNN)的架构,直接从单次部分观测场学习空间插值,无需外部数据或先验场,作为克里金法的替代方案。

Comments 53 pages, 10 figures

详情
AI中文摘要

从稀疏观测中预测完整的空间相关场是空间统计和环境建模中的一个基本挑战。经典的插值方法如克里金法依赖于高斯过程假设和变异函数分析,这可能会限制其在非平稳环境中的有效性,并且需要大量的领域专业知识。在这项工作中,我们利用基于卷积神经网络(CNN)的架构进行空间插值,该架构在单个部分观测场上进行训练和应用,无需访问外部数据或先验场。模型直接在观测位置进行监督,并学习在用户定义的网格上预测未观测点的值。与克里金法不同,我们的方法不需要显式的协方差建模或变异函数估计,并且可以以数据驱动的方式灵活捕捉局部空间模式。这项工作展示了CNN在稀疏监督下进行单实例空间插值的潜力,为经典地统计方法提供了实用的替代方案,并将CNN的应用扩展到新的问题领域。

英文摘要

Predicting a complete spatially correlated field from sparse observations is a fundamental challenge in spatial statistics and environmental modelling. Classical interpolation methods such as Kriging rely on Gaussian process assumptions and variography, which can limit their effectiveness in non-stationary settings and require substantial domain expertise. In this work, we leverage an architecture based on convolutional neural networks (CNNs) for spatial interpolation that is trained and applied on a single partially observed field, without access to external data or prior fields. The model is supervised directly on the observed locations and learns to predict values at unobserved points on the user defined grid. Unlike Kriging, our method does not require explicit covariance modelling or variogram estimation, and it can flexibly capture local spatial patterns in a data-driven manner. This work demonstrates the potential of CNNs for single-instance spatial interpolation under sparse supervision, offering a practical alternative to classical geostatistical methods, and extending the use of CNNs to a new problem domain.

2605.30161 2026-05-29 cs.CV 版本更新

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

为什么远处看起来在上方:探究视觉-语言模型中的空间表征

Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su, Jonathan Tremblay, Chan Hee Song, Jaesik Park

发表机构 * Seoul National University(首尔国立大学) The Ohio State University(俄亥俄州立大学) NVIDIA(英伟达)

AI总结 通过最小对比对分析,发现视觉-语言模型存在垂直-距离纠缠(将图像垂直位置与距离混淆),这种透视偏差导致性能差距,并随数据规模扩大而加剧,而具有良好分离空间轴的模型更鲁棒。

详情
AI中文摘要

视觉-语言模型(VLM)在空间推理基准上取得了强劲性能,但仍不清楚这是否反映了结构化的3D理解,还是依赖于自然图像中的统计捷径。我们引入了一个表征级分析框架,构建最小对比对来测量VLM嵌入中空间轴的组织和分离程度。跨多个模型族的分析揭示了一致的垂直-距离纠缠:模型将图像垂直位置与距离混淆,反映了自然照片的透视偏差。这种偏差导致透视一致与反启发式示例之间存在显著的准确率差距,并且随着数据规模的扩大而加剧,即使整体基准准确率有所提高。我们进一步表明,具有相似基准分数的模型可能表现出不同的内部表征,并且这些差异可预测跨不同空间推理基准的准确率和鲁棒性。为了将这种偏差与评估集偏斜隔离,我们引入了SpatialTunnel,这是一个合成基准,通过去除自然图像中常见的相关性来暴露空间捷径偏差。实验证实,纠缠是模型固有的,并且具有良好分离空间轴的模型表现出更强的鲁棒性,这表明结构良好的空间表征可在不同基准上带来更可靠的空间推理。代码和基准可在项目页面获取:https://cheolhong0916.github.io/whyfarlooksup.github.io/。

英文摘要

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.

2605.30140 2026-05-29 cs.CV 版本更新

AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly Detection

AnomalyAgent: 用于零样本/少样本异常检测的无训练智能体模型

Yi Zhang, Jiawen Zhu, Lele Fu, Guansong Pang

发表机构 * Singapore Management University(新加坡国立管理学院) Sun Yat-sen University(中山大学)

AI总结 提出一种基于多模态大语言模型的无训练智能体框架AnomalyAgent,通过定制工具集和记忆模块实现零样本/少样本异常检测,在逻辑/上下文异常等复杂场景中优于现有方法。

详情
AI中文摘要

受益于视觉语言模型(如CLIP)的泛化能力,许多零样本/少样本异常检测方法已在各种数据集上取得了令人印象深刻的检测性能。然而,它们需要在大规模辅助数据集上进行大量训练以适应异常检测,并且其推理主要依赖于基于视觉-文本嵌入相似度的异常分数,缺乏检测需要深度上下文理解的复杂异常的推理能力。为了解决这一局限性,我们提出了 extbf{AnomalyAgent},一种新颖的无训练智能体框架,利用多模态大语言模型的先进推理和泛化能力进行异常检测。关键要素包括: extbf{1)}一个全面的以异常为中心的工具集,能够在零样本设置下实现自适应MLLM驱动的智能体异常推理; extbf{2)}一个定制的记忆模块,通过少样本上下文参考示例来支撑异常推理。我们将评估从广泛使用的基准测试中检测简单异常(例如,裂纹和凹痕等表面缺陷以及明显病变)扩展到更多样化的异常类型,例如物流和制造环境中的逻辑/上下文异常。大量实验结果表明,我们的AnomalyAgent与无训练的基于VLM的异常检测和通用智能体方法相比,实现了显著更好的性能,突显了其在零样本和少样本异常检测设置中的优越泛化能力。代码实现可在此地址找到。

英文摘要

Benefiting from generalizability of vision-language models (VLMs) such as CLIP, many zero-/few-shot anomaly detection (AD) approaches have achieved impressive detection performance across various datasets. Nevertheless, they require substantial training on large auxiliary datasets to adapt VLMs to anomaly detection, and their inference largely relies on visual-text embedding similarity-based anomaly scores, lacking reasoning abilities to detect complex anomalies that require in-depth contextual understanding. To address this limitation, we propose \textbf{AnomalyAgent}, a novel training-free, agentic framework that leverages the advanced reasoning and generalization capabilities of multimodal large language models (MLLMs) for anomaly detection. The key ingredients include \textbf{1)} a comprehensive anomaly-centric toolset that enables adaptive MLLM-driven, agentic anomaly reasoning in zero-shot settings, and \textbf{2)} a customized memory module that grounds anomaly reasoning with few-shot, in-context reference examples. We extend evaluation beyond the detection of simple anomalies (e.g., surface defects like cracks and dents and clear lesions) in widely used benchmarks to more diverse types of anomalies such as logical/contextual anomalies in logistics and manufacturing settings. Extensive experiment results demonstrate that our AnomalyAgent achieves substantially better performance compared to training-free VLM-based AD and generic agentic methods, highlighting its superior generalization capability in both zero-shot and few-shot anomaly detection settings. The code implementation can be find at this address.

2605.30131 2026-05-29 cs.CL cs.CV 版本更新

CCS: Clinical Consensus Selection for Radiology Report Generation

CCS:放射学报告生成的临床共识选择

Xi Zhang, Yingshu Li, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

发表机构 * School of Computing Science, University of Glasgow(格拉斯哥大学计算机科学学院) School of Electrical and Computer Engineering, University of Sydney(悉尼大学电气与计算机工程学院) Language Technology Lab, University of Cambridge(剑桥大学语言技术实验室)

AI总结 提出CCS框架,通过采样多个候选报告并选择临床共识最高的一个,以改进放射学报告生成在推理时的质量。

Comments 17 pages, 6 figures

详情
AI中文摘要

放射学报告生成(RRG)通常被表述为单路径生成任务,其中多模态大语言模型(MLLM)产生一个解码报告作为最终输出。虽然最近的进展主要通过扩展训练数据、模型容量和检索机制来推动,但在推理时提高报告质量仍未被充分探索。在这项工作中,我们观察到固定的放射学MLLM在其候选池中通常生成比默认解码选择的报告临床更强的报告,这表明推理时的决策仍然是一个被忽视的瓶颈。为了解决这个问题,我们提出了临床共识选择(CCS),一个解码器无关的推理时选择框架,它采样多个候选报告,并选择在展开池中具有最高临床共识的报告。CCS将基于文本的效用与由图像-报告训练的多模态嵌入器计算的放射学适应效用统一起来,该嵌入器测量超越表面文本相似性的候选一致性。在三个数据集和多个放射学MLLM上,CCS始终优于单路径解码和通用Best-of-N基线,特别是在临床指标上取得了明显提升。进一步分析表明,基于图像的效用形成了与文本共识不同的选择轴,并且在推理时改进RRG仍有很大的提升空间。

英文摘要

Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by scaling training data, model capacity, and retrieval mechanisms, improving report quality at inference time remains underexplored. In this work, we observe that fixed radiology MLLMs often generate clinically stronger reports elsewhere in their candidate pool than the one selected by default decoding, suggesting that inference-time decision making remains an overlooked bottleneck. To address this, we propose Clinical Consensus Selection (CCS), a decoder-agnostic inference-time selection framework that samples multiple candidate reports and selects the one with the highest clinical consensus across the rollout pool. CCS unifies text-based utilities with a radiology-adapted utility computed by an image--report-trained multimodal embedder, which measures candidate agreement beyond surface-level textual similarity. Across three datasets and multiple radiology MLLMs, CCS consistently improves inference-time performance over single-path decoding and generic Best-of-N baselines, with particularly clear gains on clinical metrics. Further analysis shows that image-grounded utility forms a selection axis distinct from textual consensus and that substantial headroom remains for improving RRG at inference time.

2605.30126 2026-05-29 cs.CV cs.AI cs.CL cs.LG 版本更新

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

PARCEL: 基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解

Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克研究所) Google(谷歌)

AI总结 提出PARCEL视觉分词架构,通过池锚定和条件弹性查询重采样解决视觉令牌压缩中的空间与查询表示冲突,在27个基准上提升性能-效率帕累托前沿。

Comments 33 pages, 4 figures

详情
AI中文摘要

大型视觉-语言模型(LVLMs)将视觉输入映射为密集的令牌序列,导致推理时的二次计算瓶颈。弹性视觉令牌压缩通过训练单一模型以在多个视觉令牌预算下运行来解决这一问题。然而,现有方法在激进压缩下表现不佳。空间压缩(如嵌套池化)表现为不完美的低通滤波器,并引起频谱混叠,掩盖了细粒度细节。查询压缩(如嵌套查询重采样)用非局部摘要替代显式的网格对齐令牌,显著降低了空间定位能力。为解决这一表示冲突,我们引入了PARCEL(基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解),一种视觉分词架构,动态分配特征提取的工作。PARCEL将空间池令牌建立为低频布局锚点,并通过池条件查询重采样使弹性查询令牌依赖于这些锚点。这鼓励查询令牌专注于互补的视觉特征,而非冗余的空间映射。在27个基准上的广泛评估表明,PARCEL改进了性能-效率帕累托前沿,在各种视觉令牌预算下持续优于现有的嵌套基线,同时保留了“一次训练,随处部署”的范式。

英文摘要

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

2605.30116 2026-05-29 cs.CV cs.LG 版本更新

SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation

SGMD: 得分梯度匹配蒸馏用于少步视频扩散蒸馏

Zhuguanyu Wu, Ruihao Gong, Yang Yong, Yushi Huang, Xiangyu Fan, Lei Yang, Dahua Lin, Xianglong Liu

发表机构 * Beihang University(北京理工大学) SenseTime Research(秒速科技研究院) Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 针对分布匹配蒸馏在少步视频扩散中训练昂贵且运动动态保守的问题,提出得分梯度匹配蒸馏(SGMD),通过直接优化假得分朝向教师并使用教师停止梯度Fisher作为稳定目标,实现约3倍训练加速并显著提升运动动态。

Comments ICML 2026

详情
AI中文摘要

分布匹配蒸馏(DMD)是加速少步视频扩散模型推理的常用范式。然而,DMD风格的视频蒸馏面临两个耦合挑战:假得分必须跟踪不断演化的生成器,当需要频繁更新时训练成本高昂,而反向KL风格匹配可能具有模式寻求性和保守性,难以保持强运动动态。为解决这些问题,我们提出 extbf{得分梯度匹配蒸馏(SGMD)}。SGMD采用假得分视角,直接优化假得分朝向教师,同时使用教师停止梯度Fisher作为稳定的分布匹配目标。我们提供了梯度分析,论证了在理想跟踪下该目标选择的合理性。在此基础上,SGMD引入一对双重势:负残差(NR)用于外环校正,残差收缩(RC)用于内环跟踪。实验上,与DMD2相比,SGMD实现了约$\sim 3 imes$的训练加速,并显著改善了4步蒸馏模型的运动动态,同时保持了时间一致性。一项人类研究证实,SGMD在运动质量和整体偏好上更受青睐,而视觉质量和文本对齐保持相当。代码可在https://github.com/ModelTC/LightX2V获取。

英文摘要

Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few-step video diffusion models. However, DMD-style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse-KL-style matching can be mode-seeking and conservative for preserving strong motion dynamics. To address these issues, we propose \textbf{Score Gradient Matching Distillation (SGMD)}. SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately $\sim 3\times$ training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at https://github.com/ModelTC/LightX2V.

2605.30115 2026-05-29 cs.CV 版本更新

Large Depth Completion Model from Sparse Observations

来自稀疏观测的大深度补全模型

Zhu Yu, Zhengyi Zhao, Runmin Zhang, Lingteng Qiu, Kejie Qiu, Yisheng He, Siyu Zhu, Zilong Dong, Si-Yuan Cao, Hui-Liang Shen

发表机构 * Zhejiang University(浙江大学) Tongyi Lab, Alibaba Group(阿里云实验室) Fudan University(复旦大学) Ningbo Innovation Center, Zhejiang University(宁波创新中心,浙江大学) NingboTech University(宁波科技学院) Jinhua Institute of Zhejiang University(金华大学浙大研究院)

AI总结 提出LDCM,利用单目基础模型和基于泊松的深度初始化策略,结合点图头回归3D坐标,实现稀疏观测下的度量准确深度补全。

Comments ICLR 2026. Project webpage: https://pkqbajng.github.io/ldcm/

详情
AI中文摘要

本文提出了大深度补全模型(LDCM),一个简单、有效且鲁棒的框架,用于稀疏观测下的单视图度量深度估计。在不依赖复杂架构设计的情况下,LDCM使用Transformer生成度量准确的密集深度图。它在多种数据集和稀疏观测下优于现有方法。我们从两个关键角度实现这一点:(1)利用现有的单目基础模型提高稀疏深度输入的质量,(2)重新制定训练目标以更好地捕捉几何结构和度量一致性。具体来说,首先引入基于泊松的深度初始化策略,从不同的稀疏观测生成均匀的粗密集深度图,为网络提供强大的结构先验。关于训练目标,我们用点图头替换传统的深度头,该点图头回归相机空间中的逐像素3D坐标,使模型能够直接学习底层3D场景结构,而不是执行逐像素深度图恢复。此外,这种设计消除了对相机内参的需求,使LDCM能够自然地产生度量尺度的3D点图。大量实验表明,LDCM在多个基准测试和不同稀疏度水平下,在深度补全和点图估计方面均持续优于最先进的方法,展示了其有效性和对未见数据分布的强泛化能力。

英文摘要

This work presents the Large Depth Completion Model (LDCM), a simple, effective, and robust framework for single-view metric depth estimation with sparse observations. Without relying on complex architectural designs, LDCM generates metric-accurate dense depth maps using a transformer. It outperforms existing approaches across diverse datasets and sparse observations. We achieve this from two key perspectives: (1) leveraging existing monocular foundation models to improve the quality of sparse depth inputs, and (2) reformulating training objectives to better capture geometric structure and metric consistency. Specifically, a Poisson-based depth initialization strategy is first introduced to generate a uniform coarse dense depth map from diverse sparse observations, providing a strong structural prior for the network. Regarding the training objective, we replace the conventional depth head with a point map head that regresses per-pixel 3D coordinates in camera space, enabling the model to directly learn the underlying 3D scene structure instead of performing pixel-wise depth map restoration. Moreover, this design eliminates the need for camera intrinsic parameters, allowing LDCM to naturally produce metric-scaled 3D point maps. Extensive experiments demonstrate that LDCM consistently outperforms state-of-the-art methods across multiple benchmarks and varying sparsity levels in both depth completion and point map estimation, showcasing its effectiveness and strong generalization to unseen data distributions.

2605.30111 2026-05-29 cs.CV cs.AI 版本更新

xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

xModel-KD:基于LiDAR的3D场景感知跨模态知识蒸馏

Thenukan Pathmanathan, Kanchan Keisham, Thangarajah Akilan

发表机构 * Dept. of Computer Science Lakehead University Thunder Bay, Canada School of Computer Science Engg. \& Info. Systems Vellore Institute of Technology Tamil Nadu, India Dept. of Software Engg. Lakehead University Thunder Bay, Canada

AI总结 提出跨模态知识蒸馏框架xModel-KD,通过对比学习对齐2D图像纹理与3D点云几何特征,在无额外标注下提升LiDAR点云分割性能。

Comments 3 figures, and 5 tables

详情
AI中文摘要

点云分割是3D场景理解中的基础任务。其进展受到密集3D标注高成本和高时间的限制,导致标注样本难以获取。除了标注稀缺,不同感知模态面临固有局限性。2D图像提供丰富的纹理和外观线索,但缺乏明确的深度和几何结构。相比之下,3D点云捕捉精确的空间几何,但稀疏且不含纹理信息。因此,依赖单一模态限制了所学表示的丰富性并削弱了泛化能力。尽管最近结合3D点云与2D图像的多模态方法在分类和检索等任务中表现出色,但它们通常依赖大规模标注数据集,且尚未充分用于数据高效的密集预测。为解决这些限制,我们提出一种新颖的跨模态知识蒸馏框架xModel-KD,用于3D点云分割。我们的方法通过跨模态对齐学习统一的逐点表示,利用2D纹理和3D几何的互补优势。具体而言,我们设计了一个跨模态融合编码器,通过对比目标训练,强制多视图下对应的2D和3D表示之间的特征一致性。通过将强大的预训练骨干与有针对性的融合策略相结合,所提框架有效地将图像的外观线索迁移到几何感知的点特征中。实验结果表明,跨模态融合在mIoU上比仅使用LiDAR的基线实现了2%的绝对提升,证明了利用互补多模态信息进行可扩展和标注高效的3D场景理解的优势。

英文摘要

Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required for dense 3D annotations, making labeled samples difficult to obtain. Beyond annotation scarcity, different sensing modalities face inherent limitations. 2D images provide rich texture and appearance cues, yet they lack explicit depth and geometric structure. In contrast, 3D point clouds capture accurate spatial geometry but are sparse and contain no texture information. As a result, relying on a single modality restricts the richness of learned representations and weakens generalization. Although recent multi-modal methods that combine 3D point clouds with 2D images have demonstrated strong performance in tasks such as classification and retrieval, they typically depend on large-scale labeled datasets and have not been fully exploited for data-efficient dense prediction. To address these limitations, we propose a novel cross-modal knowledge distillation framework, xModel-KD, for 3D point cloud segmentation. Our method exploits the complementary strengths of 2D texture and 3D geometry by learning unified per-point representations through cross-modal alignment. Specifically, we design a cross-modal fusion encoder trained with a contrastive objective that enforces feature consistency between corresponding 2D and 3D representations across multiple views. By integrating powerful pre-trained backbones with a targeted fusion strategy, the proposed framework effectively transfers appearance cues from images to geometry-aware point features. Experimental results show that cross-modal fusion achieves a 2% absolute improvement in mIoU over a LiDAR-only baseline, demonstrating the benefit of leveraging complementary multi-modal information for scalable and annotation-efficient 3D scene understanding.

2605.30099 2026-05-29 cs.CV 版本更新

Evaluation of Conversational Agents: Understanding Culture, Context and Environment in Emotion Detection

对话代理评估:理解情感检测中的文化、背景与环境

Martha Teiko Teye, Yaw Marfo Missah, Emmanuel Ahene, Twum Frimpong, Auxane Boch

发表机构 * Cluster of Excellence, University of Stuttgart(斯图加特大学卓越中心) Department of Computer Science, Kwame Nkrumah University of Science and Technology(库马西技术科学大学计算机科学系) Institute for Ethics in Artificial Intelligence, Technical University of Munich(慕尼黑技术大学人工智能伦理研究所)

AI总结 针对黑人非洲社会,提出结合语音和图像数据、使用3层CNN和AFME算法的情感预测模型,准确率85%-96%,并识别讽刺,提升对话AI情感识别系统的可信度。

Comments IEEE paper on arxiv

详情
Journal ref
IEEE Access 10 (2022) 24976-24984; Erratum: IEEE Access (2022) 35900-35900
AI中文摘要

现在,有价值决策和高度优先分析依赖于面部生物识别、社交媒体照片标记和人机交互等应用。然而,成功部署这些应用的能力取决于它们在考虑可能边缘情况下的测试用例效率。多年来,已经实施了大量通用解决方案来模仿人类情感,包括讽刺。然而,地理位置或文化差异等因素在其解决伦理问题和改进对话AI(人工智能)的相关性中尚未得到充分探索。在本文中,我们旨在解决在黑人非洲社会中对话AI使用的潜在挑战。我们开发了一个情感预测模型,准确率在85%到96%之间。我们的模型结合了语音和图像数据来检测七种基本情感,并特别关注识别讽刺。它使用了3层卷积神经网络,并结合了一种新的音频帧平均表情(AFME)算法,重点放在模型的预处理和后处理阶段。最后,我们的解决方案有助于维护对话AI中情感识别系统的可信度。

英文摘要

Valuable decisions and highly prioritized analysis now depend on applications such as facial biometrics, social media photo tagging, and human robots interactions. However, the ability to successfully deploy such applications is based on their efficiencies on tested use cases taking into consideration possible edge cases. Over the years, lots of generalized solutions have been implemented to mimic human emotions including sarcasm. However, factors such as geographical location or cultural difference have not been explored fully amidst its relevance in resolving ethical issues and improving conversational AI (Artificial Intelligence). In this paper, we seek to address the potential challenges in the usage of conversational AI within Black African society. We develop an emotion prediction model with accuracies ranging between 85% and 96%. Our model combines both speech and image data to detect the seven basic emotions with a focus on also identifying sarcasm. It uses 3-layers of the Convolutional Neural Network in addition to a new Audio-Frame Mean Expression (AFME) algorithm and focuses on model pre-processing and post-processing stages. In the end, our proposed solution contributes to maintaining the credibility of an emotion recognition system in conversational AIs.

2605.30093 2026-05-29 cs.CV 版本更新

Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

几何至关重要:用于学习语义对应的3D基础先验

Artur Jesslen, Olaf Dünkel, Adam Kortylewski

发表机构 * University of Freiburg(弗赖堡大学) Max Planck Institute for Informatics(马克斯·普朗克信息研究所) CISPA Helmholtz Center for Information Security(CISPA 河岸信息安全中心)

AI总结 提出一种3D感知的后训练框架,利用3D基础模型(SAM3D)估计物体几何和姿态,生成几何感知特征图,结合DINO和Stable Diffusion特征,通过测地距离过滤候选对应,训练轻量适配器改进语义对应。

Comments 9 pages (main paper), 21 pages (total), 4 figures

详情
AI中文摘要

来自自监督视觉模型和文本到图像扩散模型的基础特征已被证明对语义对应估计有效。然而,由于这些特征主要从2D图像目标学习,它们缺乏明确的3D意识,并且常常混淆对称物体侧面、重复部分以及在3D中不同的视觉相似结构。我们引入了一个3D感知的后训练框架,通过结合3D基础模型的先验,超越了现有的2D基础特征。给定一张图像,我们的方法使用SAM3D估计物体几何和姿态,并通过渲染-比较优化来细化姿态。随后,我们根据估计的物体姿态,将重建几何中的PartField描述符渲染到图像平面。由此产生的几何感知特征图补充了DINO和Stable Diffusion特征,而重建形状上的测地距离能够可靠地过滤候选对应。我们使用过滤后的匹配作为监督,在DINO和Stable Diffusion之上训练一个轻量适配器用于语义对应。与之前需要姿态标注并依赖粗略球形几何的后训练方法相比,我们的方法自动获得实例特定的3D结构,并用它来指导对应学习。实验表明,我们的方法改进了语义对应,同时减少了人工几何监督。代码和模型可在 https://github.com/GenIntel/3D-SC 获取。

英文摘要

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.

2605.30090 2026-05-29 cs.CL cs.CV 版本更新

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

DirectorBench: 通过个性化多智能体评估诊断长视频生成

Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

发表机构 * ByteDance Inc.(字节跳动公司) City University of Hong Kong(香港城市大学)

AI总结 提出DirectorBench,一种基于多智能体的诊断基准,通过80个结构化元数据、7个用户画像和40个检查点标准,在脚本、视觉、音频、跨模态和稳定性五个维度上评估长视频生成,并定位瓶颈和用户偏好依赖。

详情
AI中文摘要

长视频生成正从短的单场景合成快速转向分钟级、多镜头的创作,具有叙事结构、电影控制、音频和跨模态同步。然而,评估此类视频仍然具有挑战性,因为现有基准主要关注局部视觉质量、短期时间一致性或通用提示对齐,并且对工作流故障和用户依赖偏好的诊断有限。我们引入了DirectorBench,一个用于长视频生成的个性化多智能体诊断基准。DirectorBench根据80个结构化元数据、7个用户画像和40个检查点标准,在脚本、视觉、音频、跨模态和稳定性五个维度上评估生成的视频。DirectorBench不将质量简化为单一聚合分数,而是定位检查点级别的瓶颈并支持画像感知评估。我们评估了4个长视频生成工作流、6个基础LLM和7个用户画像。在不同工作流中,DirectorBench揭示了一个单元间瓶颈:过渡质量平均仅为0.256,最佳工作流达到0.356,而提示级别的用户需求满足度平均为0.71。我们进一步进行了14名标注者的人工评估,以验证DirectorBench与人类判断的一致性。结果表明,DirectorBench捕捉到了人类可感知的质量差异,并揭示了聚合评分所隐藏的工作流和画像依赖的故障模式。这些发现强调了长视频生成中诊断性和画像感知基准的重要性。

英文摘要

Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.

2605.30083 2026-05-29 cs.CV 版本更新

Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video Generation

未来强制:自回归视频生成中无需训练的未来感知KV缓存策略

Jiayi Luo, Qiyan Liu, Tengyang Wang, JunHao Liu, Jiayu Chen, Cong Wang, Hanxin Zhu, Chen Gao, Xiaobin Hu, Qingyun Sun, Zhibo Chen

发表机构 * BUAA(北京航空航天大学) ZGCA(中广核人工智能研究院) PKU(北京大学) CASIA(中国科学院自动化研究所) USTC(中国科学技术大学) NUS(新加坡国立大学)

AI总结 提出Future Forcing,一种无需训练的未来感知KV缓存策略,通过利用自回归视频模型中查询分布的平稳性来估计未来查询,从而改进长视频生成的一致性。

详情
AI中文摘要

自回归(AR)视频生成已成为长时域视频合成的一种有前景的范式,其中每一帧的生成基于先前生成的令牌。为了加速推理,使用KV缓存避免跨生成步骤的冗余重计算。然而,随着生成长度的增长,KV缓存会引入越来越多的内存和误差累积,限制了AR模型扩展到更长序列的可扩展性。现有的KV缓存压缩方法通过选择性地保留被认为重要的视频令牌来缓解这一问题。然而,大多数现有方法使用从当前或历史生成上下文中提取的短时域信号来评估令牌重要性,这使得这些方法容易忽略在早期步骤中看似不重要但后来对未来帧至关重要的令牌。在这项工作中,我们识别了训练好的AR视频模型的一个重要性质:尽管RoPE调制的查询在自回归步骤中演变,但底层的规范预RoPE查询分布在视频生成过程中保持显著稳定。这种近似平稳性意味着未来查询分布可以从历史统计中估计,从而无需额外训练即可实现原则性的未来感知缓存决策。基于这一洞察,我们提出了Future Forcing,一种用于AR视频生成的无需训练的未来感知KV缓存策略。具体来说,Future Forcing首先从历史统计中构建未来查询代理,然后根据该代理下的重要性对KV缓存令牌进行评分,最后在未来查询诱导的仿射子空间内合并冗余令牌对。大量实验表明,Future Forcing在有限的KV缓存下改善了长时域一致性,在VBench-Long上针对60秒生成,与现有的AR视频KV缓存策略相比,主体一致性提升了高达1.49。

英文摘要

Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to avoid redundant recomputation across generation steps. Nevertheless, its growth with generation length introduces increasing memory and error accumulation, limiting the scalability of AR models to even longer sequences. Existing KV cache compression methods mitigate this issue by selectively retaining only video tokens deemed important. However, most existing methods assess token importance using short-horizon signals derived from the current or historical generation context, making these methods prone to overlooking tokens that appear unimportant at early steps but later become critical for future frames. In this work, we identify an important property of trained AR video models: although RoPE-modulated queries evolve across autoregressive steps, the underlying canonical pre-RoPE query distribution remains remarkably stable throughout the video generation process. This approximate stationarity implies that future query distributions are estimable from historical statistics, enabling principled future-aware cache decisions without any additional training. Building on this insight, we propose Future Forcing, a training-free future-aware KV cache policy for AR video generation. Specifically, Future Forcing first constructs a future query proxy from historical statistics, then scores KV cache tokens by their importance under this proxy, and finally merges redundant token pairs within the affine subspace induced by the future query. Extensive experiments show that Future Forcing improves long-horizon consistency under limited KV caches, achieving up to 1.49 improvement in subject consistency on VBench-Long for 60s generation over existing AR video KV cache policies.

2605.30073 2026-05-29 cs.CV 版本更新

Native Audio-Visual Alignment for Generation

原生音视频对齐生成

Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Jingzhou He

发表机构 * ERNIE Team, Baidu Inc.(百度公司ERNIE团队)

AI总结 提出NAVA框架,通过原生音视频对齐和上下文条件联合去噪,实现高质量、同步且可控的音视频生成。

Comments Project page: https://ernie-research.github.io/NAVA/

详情
AI中文摘要

联合音视频生成旨在合成时间同步且语义一致的视觉-声学内容。然而,现有的开源方法主要依赖于带有后对齐的双塔设计或统一的三模态设计,将文本上下文、音频和视频混合在一个共享空间中。前者削弱了细粒度的音视频协同进化,而后者将语义条件与低级同步耦合。为了解决这些限制,我们提出了NAVA,一个用于联合音视频生成的原生音视频对齐框架。NAVA建立在上下文条件的原生音视频对齐之上:它首先在专用的交互空间中建立音视频对应关系,然后使用外部上下文来条件化联合去噪过程。具体地,NAVA通过Align-then-Fuse MMDiT架构实例化,该架构从模态感知的音视频对齐过渡到模态共享的联合去噪。此外,我们引入了上下文音色条件,将参考音色线索与相应的语音跨度关联,以实现可控的语音音色。在Verse-Bench和Seed-TTS上的实验以及用户研究表明,NAVA仅使用6.3B参数就实现了卓越的视频质量、精确的音视频同步、有竞争力的音频质量和更强的参考音色可控性。

英文摘要

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.

2605.30065 2026-05-29 cs.CV 版本更新

Boosting Zero-Shot 3D Style Transfer with 2D Pre-trained Priors

利用二维预训练先验提升零样本三维风格迁移

Xin Dong, Yunzhi Teng, Wenfeng Deng, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Pengcheng Laboratory(鹏城实验室)

AI总结 提出Data-Sufficient StyleGaussian模型,通过集成大规模2D图像数据集预训练的解码器,结合特征高斯溅射与延迟风格化,在数据稀缺条件下实现零样本3D风格迁移的高质量多视图一致渲染。

Comments Accepted by IEEE IVMSP2026

详情
AI中文摘要

在这项工作中,我们专注于零样本三维风格迁移,即给定任意风格图像,生成三维场景的多视图一致风格化视图。我们主要解决三维风格迁移中的数据稀缺问题,该问题源于每个模型仅在单个场景上训练,从而限制了可用内容图像的数量。这种稀缺性严重阻碍了风格化性能,因为模型优化依赖于足够数量的内容-风格图像对来提供监督信号。我们的核心思想是将在大规模二维图像数据集上预训练的解码器集成到三维风格迁移流程中,从而利用解码器从大量内容-风格图像对中学习到的先验知识。我们的方法结合了特征高斯溅射和延迟风格化,通过将视图相关操作统一为视图不变过程,在确保视图一致性的同时,利用数据充足的解码器网络实现高质量风格化。实验表明,我们的Data-Sufficient StyleGaussian(DS-StyleGaussian)模型在多个数据集上的视觉质量优于现有的零样本三维风格迁移方法。这项工作也表明,二维预训练可以作为三维任务的强增强手段,弥合二维与三维之间的数据差距。

英文摘要

In this work, we focus on zero-shot 3D style transfer that can generate multi-view consistent stylized views of the 3D scene given an arbitrary style image. We primarily tackle the issue of data scarcity in 3D style transfer, which arises when each model is trained on only a single scene, thereby limiting the number of available content images. This scarcity significantly hampers stylization performance, as model optimization relies on a sufficient number of content-style image pairs to provide supervisory signals. Our core idea is to integrate a decoder pre-trained on large-scale 2D image datasets into the 3D style transfer pipeline, thereby leveraging the prior knowledge encoded in the decoder from learning over numerous content-style image pairs. Our method combines feature Gaussian splatting and deferred stylization, enabling high-quality stylization with the data-sufficient decoder network while ensuring view consistency by unifying view-dependent operations into a view-invariant process. Experiments demonstrate that our Data-Sufficient StyleGaussian (DS-StyleGaussian) model outperforms existing zero-shot 3D style transfer methods in terms of visual quality across various datasets. This work also suggests that 2D pre-training can serve as a strong enhancement for 3D tasks, bridging the data gap between 2D and 3D.

2605.30062 2026-05-29 cs.CV 版本更新

FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection

FakeVLM-R1:通过思维链内化物理定律进行合成图像检测

Leqi Zhu, Junyan Ye, Kaiqing Lin, Zhiyuan Yan, Conghui He, Weijia Li

发表机构 * Shanghai AI Lab(上海人工智能实验室) Nanjing University(南京大学) Sun Yat-Sen University(中山大学) Shenzhen University(深圳大学) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 提出FakeVLM-R1框架,结合监督微调、组相对策略优化和批判性思维链机制,通过双向辩证推理和物理常识构建真实性反证,实现高精度、逻辑可解释的合成图像检测,解决现有方法的过度拒绝偏差。

详情
AI中文摘要

生成式人工智能技术的发展已将合成图像的视觉真实性提升至前所未有的水平。尽管当前基于大型多模态模型(LMM)的可解释检测方法取得了一定进展,但它们仍然依赖于从大量伪造数据中获得的模仿学习,因此缺乏真正的因果推理能力,容易产生解释性幻觉。为克服这一瓶颈,我们提出FakeVLM-R1,旨在赋予模型在执行合成检测任务时类似人类的批判性思维能力。该框架在监督微调(SFT)基础上,将组相对策略优化(GRPO)与批判性思维链(CoT)机制相结合。在推理阶段,模型执行“双向辩证推理”过程:在提出伪造假设的同时,必须同时调用物理常识构建真实性反证。此外,我们构建了包含高质量样本的FakeClue++数据集,该数据集广泛引入了基于真实图像物理定律的注释,为模型提供了统一的真实性锚点。实验证实,FakeVLM-R1在多个基准测试中达到了评估模型中的最优性能(SOTA)。它不仅实现了高精度、逻辑可解释的检测,还解决了现有方法对真实图像的过度拒绝偏差,展现出对扰动的泛化性和鲁棒性。

英文摘要

The development of generative artificial intelligence technologies has propelled the visual realism of synthetic images to an unprecedented level. Although current interpretable detection methods based on Large Multimodal Models (LMMs) have made certain progress, they still rely on imitation learning derived from massive volumes of forged data. Consequently, they lack genuine causal reasoning capabilities and are prone to explanatory hallucinations. To overcome this bottleneck, we propose FakeVLM-R1, aiming to endow the model with human-like critical thinking capabilities when performing synthetic detection tasks. Building upon Supervised Fine-Tuning (SFT), this framework integrates Group Relative Policy Optimization (GRPO) with a Critical Thinking Chain-of-Thought (CoT) mechanism. During the inference phase, the model executes a "bidirectional dialectical reasoning" process: while proposing a forgery hypothesis, it must simultaneously invoke physical commonsense to construct an authenticity counter-proof. Furthermore, we constructed the FakeClue++ dataset with high-quality samples, which extensively introduces annotations guided by the physical laws of authentic images, providing a unified authenticity anchor for the model. Experiments confirm that FakeVLM-R1 achieves SOTA performance the evaluated models across multiple benchmarks. It not only achieves high-precision, logically interpretable detection but also resolves the over-rejection bias of existing methods against real images, demonstrating generalization and robustness against perturbations.

2605.30045 2026-05-29 cs.CV 版本更新

GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-Preserver

GenEraser:通过平衡文本-掩码引导和解耦定位器-保持器实现可泛化的视频对象移除

Yuqing Chen, Lin Liu, Haisu Wu, Xiaopeng Zhang, Yaowei Wang, Yujiu Yang, Qi Tian

发表机构 * Tsinghua University(清华大学) Pengcheng National Laboratory(鹏城实验室) Huawei(华为) Southeast University(东南大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 提出GenEraser框架,通过多条件混合专家、可学习深度CFG融合机制和解耦专家架构,解决视频对象移除中目标与物理效应同时消除的泛化难题,在ROSE和VOR-Eval上分别提升2.16 dB和1.44 dB。

详情
AI中文摘要

视频对象移除在域外场景中常因复杂的时空歧义而难以同时消除目标对象及其关联的物理效应(如烟雾、反射、光线和涟漪)。现有方法主要依赖空间掩码,但往往无法捕捉弱相关效应,且显式文本引导的潜力尚未充分探索。此外,移除模型在高层语义泛化与精确像素级背景保持之间存在根本性的优化冲突。为解决这些挑战,我们提出GenEraser,一种用于泛化高保真视频对象与效应移除的新框架。首先,我们引入多条件混合专家(MC-MoE)配合二分文本引导,充分利用扩散变换器的多模态先验,显著增强复杂效应的识别。其次,开发可学习深度“CFG”融合机制(LD-CFG),以自适应平衡不同场景下掩码和文本条件的相对主导地位。最后,提出解耦专家架构,包含定位器和保持器,以缓解语义泛化与像素对齐之间的固有权衡。大量实验表明,我们的GenEraser超越了近期最先进方法,在ROSE基准和VOR-Eval上分别实现了显著的定量提升(2.16 dB和1.44 dB),同时在开放世界场景中保持了异常稳健的泛化能力。

英文摘要

Video object removal frequently struggles to simultaneously eliminate target objects and their associated physical effects (e.g., smoke, reflections, light, and ripples) in out-of-domain scenarios due to complex spatiotemporal ambiguities. While existing methods primarily rely on spatial masks, they often fail to capture weakly correlated effects, and the potential of explicit textual guidance remains underexplored. Furthermore, a fundamental optimization conflict exists in removal models between high-level semantic generalization and precise pixel-level background preservation. To address these challenges, we propose GenEraser, a novel framework for generalized and high-fidelity video object and effect removal. First, we introduce a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit the multimodal priors of Diffusion Transformers, significantly enhancing the identification of complex effects. Second, a Learnable Deep ``CFG'' Fusion mechanism (LD-CFG) is developed to adaptively balance the relative dominance of mask and textual conditions across diverse scenarios. Finally, we propose a Decoupled Expert Architecture, comprising a Locator and a Preserver, to mitigate the inherent trade-off between semantic generalization and pixel alignment. Extensive experiments demonstrate that our GenEraser surpasses recent state-of-the-art approaches, achieving significant quantitative improvements (e.g., $2.16$ dB and $1.44$ dB on the ROSE Benchmark and VOR-Eval, respectively) while maintaining exceptionally robust generalization in open-world scenarios. https://cyqii.github.io/GenEraser.github.io/

2605.30038 2026-05-29 cs.LG cs.AI cs.CV 版本更新

Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

对齐引导的分数匹配用于扩散模型中的文本到图像对齐

Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, South Korea(韩国高级人工智能研究生院)

AI总结 提出一种轻量级、无奖励的后训练方法,通过将对比对齐引导直接整合到扩散模型的分数匹配目标中,以解决文本-图像对齐中的过度惩罚和计数错误问题。

Comments ICML 2026, Project page: https://jaayeon.github.io/AGSM

详情
AI中文摘要

扩散模型生成高度逼真的图像,但通常难以实现精确的文本-图像对齐。虽然最近的后训练方法使用外部奖励或人类偏好信号改善对齐,但其性能严重依赖奖励质量,且不直接解决扩散过程中的对齐问题。最近的无奖励方法如SoftREPA表明,通过对比学习优化软文本令牌可以有效改善文本-图像表示对齐,优于标准参数高效微调基线。然而,对比公式可能过度惩罚负对,表现为典型的失败案例,如过度计数和重复。为解决此问题,我们提出一种轻量级、无奖励的后训练方法,通过将对比对齐引导直接整合到扩散模型的分数匹配目标中来细化软令牌。通过在分数级别分配对齐方向,我们的方法缓解了这些限制,并产生更连贯和语义忠实的生成。实验表明,我们的方法与SoftREPA相当,同时显著改善了其失败案例,在GenEval基准上计数准确性提高了超过35%。我们的方法可无缝应用于现有扩散骨干网络(SD1.5、SDXL和SD3),并与现有的基于RL的扩散后训练方法互补。项目页面:https://jaayeon.github.io/AGSM

英文摘要

Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM

2605.30027 2026-05-29 cs.CV cs.IR 版本更新

DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark

DocRetriever:面向多模态文档检索的即插即用框架与综合基准

Ruofan Hu, Menghui Zhu, Jieming Zhu, Bo Chen, Shengyang Xu, Minjie Hong, Xiaoda Yang, Sashuai Zhou, Li Tang, Tao Jin, Zhou Zhao

发表机构 * Zhejiang University(浙江大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出DocRetriever即插即用框架,通过布局感知的稀疏嵌入和推理增强的重排序器解决多模态文档检索中语义模糊和泛化瓶颈问题,并构建MultiDocR基准实现更严格评估。

Comments Accepted at KDD 2026 Research Track

详情
AI中文摘要

多模态文档包含表格、图形和布局等多样元素,可能使检索任务复杂化。当前方法通常将密集视觉嵌入模型与有监督重排序器相结合以实现高精度检索,但存在固有局限性。首先,密集嵌入的粗粒度特性往往模糊显式语义,无法利用结构显著信息。其次,有监督重排序模型面临泛化瓶颈,其性能严重依赖领域特定训练数据。此外,现有基准通常缺乏多样化的评估维度和全面的相关性标注,限制了可靠评估。为解决这些挑战,我们提出DocRetriever,一个即插即用框架。它通过布局感知的稀疏嵌入技术增强视觉检索,实现无需光学字符识别(OCR)开销的有效混合编码。我们还引入了一个可泛化的重排序器,利用推理增强的示范和优化采样来提高少样本场景下的准确性。最后,我们构建了一个新基准MultiDocR,以实现更严格的评估。在多个基准上的实验验证了DocRetriever相对于最先进方法的优越性。

英文摘要

Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information. Second, supervised reranking models suffer from generalization bottlenecks, as their performance heavily relies on domain-specific training data. Furthermore, existing benchmarks often lack diverse assessment dimensions and comprehensive relevance annotations, limiting reliable evaluation. To address these challenges, we propose DocRetriever, a plug-and-play framework. It enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.

2605.30011 2026-05-29 cs.CV cs.AI 版本更新

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

VisualThink-VLA:用于高效低延迟视觉-语言-动作策略的视觉中间推理

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang

发表机构 * Zhejiang University(浙江大学) Cornell University(康奈尔大学) National University of Singapore(新加坡国立大学) Xi'an University of Electronic Science and Technology(西安电子科技大学)

AI总结 提出VisualThink-VLA框架,通过视觉中间推理和选择性路由机制,在保持高精度的同时将推理延迟从数秒降至亚秒级。

详情
AI中文摘要

近期工作开始为视觉-语言-动作(VLA)策略配备显式的中间推理。然而,在具身控制中,文本思维链并不适用:无关或弱文本信息会干扰动作预测,而自回归文本解码为实时闭环执行增加了过多延迟。我们提出VISUALTHINK-VLA,一个用于准确、低延迟VLA策略的视觉中间推理框架。我们的引导哲学是通过有效的视觉思维来指导动作:VISUALTHINK-VLA通过一个紧凑的视觉证据接口引导动作预测,该接口在避免解码开销的同时保持空间精度。此外,为了进一步提升性能和效率,VISUALTHINK-VLA采用了一种定制的选择性路由机制来学习视觉证据令牌,从而实现低延迟推理同时保持高容量专用性。我们还引入了VisualEvidence-Kit,这是一个以VisualEvidence-Agent为核心的监督与审计资源,该智能体构建了754.7k条VLA指令的VisualEvidence-Set,用于路由监督和反事实忠实性测试。在多个基准测试和真实机器人评估中,VISUALTHINK-VLA在大多数基准测试上实现了最高成功率,同时将推理增强基线的多秒延迟降至亚秒级。例如,在BridgeData V2上,它将步骤延迟从ECoT的8.377秒降至0.367秒,实现了22.8倍的加速。

英文摘要

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

2605.30010 2026-05-29 cs.CV 版本更新

EarlyTom: Early Token Compression Completes Fast Video Understanding

EarlyTom: 早期令牌压缩实现快速视频理解

Hesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学) Alibaba Cloud Computing(阿里云计算)

AI总结 针对视频大语言模型中视觉编码阶段效率低下的问题,提出EarlyTom无训练令牌压缩框架,通过在视觉编码器内部进行早期压缩,显著降低首令牌延迟并提升吞吐量。

Comments Accepted by CVPR 2026. 16 pages, 8 figures, 8 tables. Project page: https://viridisgreen.github.io/EarlyTom

详情
AI中文摘要

视频大语言模型(Video-LLMs)在视频理解任务中展现了强大的能力。然而,处理大量视觉令牌带来的低效率仍然阻碍了它们的实际部署。尽管近期的方法在保持与全令牌基线相当准确性的同时实现了极低的令牌保留率,但大多数方法仅在预填充的后期阶段进行压缩,视觉编码器的效率未得到优化。在本文中,我们首先表明视觉编码对首令牌时间(TTFT)贡献很大。因此,与仅在视觉编码器之后压缩视觉令牌不同,在编码器内部进行压缩仍有很大的探索空间。基于这一见解,我们提出了EarlyTom,一种无训练的令牌压缩框架,在视觉编码器内部执行早期视觉令牌压缩,从而显著降低TTFT并提高吞吐量。此外,我们引入了一种解耦的空间令牌选择策略,提高了整体压缩效果。在单个NVIDIA A100 GPU上,对于LLaVA-OneVision-7B模型,EarlyTom将TTFT降低高达2.65倍,FLOPs降低高达61%,同时保持与全令牌基线相当的准确性。这些改进显著增强了Video-LLMs在实际生产场景中部署的实用性。

英文摘要

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

2605.29997 2026-05-29 cs.CV 版本更新

FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving Views

FRUC:来自未标定协作驾驶视图的前馈动态场景重建

Yihang Tao, Yu Guo, Zhengru Fang, Haonan An, Yuguang Fang

发表机构 * Hong Kong JC STEM Lab of Smart City City University of Hong Kong(香港JC STEM实验室,城市大学)

AI总结 提出FRUC框架,基于前馈3D高斯泼溅和视觉几何Transformer,从未标定的多车协作视图实现动态场景的一次性、免标定重建,通过自中心因果遮挡场和零初始化残差去噪实现非破坏性几何补充。

详情
AI中文摘要

我们提出了FRUC,一个用于从未标定协作驾驶视图进行动态场景重建的前馈3D高斯泼溅框架。现有的多智能体重建框架常常受到严格先决条件的阻碍,需要精确的空间标定和缓慢的逐场景优化。在本文中,我们通过将分布式多车辆网络概念化为一个时空非结构化的自中心多相机系统来重新思考这一任务,其核心挑战在于在不降低自中心准确观测到的可见几何的情况下,通过协作增强自中心遮挡几何,同时保持重建效率。为了实现高效重建,FRUC基于视觉几何Transformer骨干网络,支持从灵活数量的多车辆视图进行一次性、免标定推理。为了在未标定的跨智能体错位下实现非破坏性几何补充,FRUC首先引入了一个自中心因果遮挡场,通过建模智能体时空相关性,将遮挡演化显式推导为潜在先验。在这些遮挡先验的指导下,它进一步将跨智能体集成公式化为一个通过零初始化注入的确定性残差去噪过程,将具有挑战性的跨智能体融合转化为有界残差学习,以实现鲁棒的协作盲点补全。通过在真实世界V2XReal和UrbanIng-V2X数据集上的广泛评估,FRUC被证明是动态协作驾驶环境场景重建的新最先进方法,在渲染质量和效率上均显著优于现有方法。

英文摘要

We present FRUC, a feed-forward 3D Gaussian splatting framework for dynamic scene reconstruction from uncalibrated collaborative driving views. Existing multi-agent reconstruction frameworks are often hindered by rigid prerequisites, demanding precise spatial calibration and slow per-scene optimization. In this paper, we rethink this task by conceptualizing a distributed multi-vehicle network as a spatio-temporally unstructured ego-centric multi-camera system, where the core challenge lies in enhancing ego-centric occluded geometry through collaboration without degrading the ego's accurately observed visible geometry, while preserving reconstruction efficiency. For efficient reconstruction, FRUC is built upon a visual grounded geometric Transformer backbone to enable one-shot, calibration-free inference from a flexible number of multi-vehicle views. To achieve non-destructive geometric supplementation under uncalibrated cross-agent misalignment, FRUC first introduces an ego-centric causal occlusion field that explicitly derives occlusion evolution as latent priors by modeling agent-wise spatio-temporal correlations. Guided by these occlusion priors, it further formulates cross-agent integration as a deterministic residual denoising process via zero-initialized injection, turning challenging cross-agent fusion into bounded residual learning for robust collaborative blind-spot completion. Through extensive evaluations on the real-world V2XReal and UrbanIng-V2X datasets, FRUC is shown to be a new state-of-the-art for the scene reconstruction of dynamic collaborative driving environments, significantly outperforming existing methods in both rendering quality and efficiency.

2605.29983 2026-05-29 cs.LG cs.CV 版本更新

Improving Adversarial Robustness of Attribution via Implicit Regularization

通过隐式正则化提高归因的对抗鲁棒性

Amir Mehrpanah, Matteo Gamba, Hossein Azizpour

发表机构 * Department of Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden(瑞典皇家理工学院计算机科学系) Science for Life Laboratory, Stockholm, Sweden(瑞典斯德哥尔摩科学生命实验室) Department of Computer Science, Brown University, USA(美国布朗大学计算机科学系)

AI总结 本文发现标准随机梯度下降的学习动态可以隐式地提高归因的对抗鲁棒性,并证明在softmax归一化下注意力归因的鲁棒性提升受限,而基于核的注意力可恢复鲁棒性。

Comments 39 pages, 22 figures, to be published in International Conference on Machine Learning 2026

详情
AI中文摘要

归因的对抗鲁棒性是深度学习中可靠可解释性的基本要求,但现有方法通常依赖计算昂贵的显式正则化。在这项工作中,我们表明归因鲁棒性可以从标准随机梯度下降的学习动态中隐式产生。我们通过参数空间和输入空间曲率之间的联系从理论上论证了这种效应,并在各种架构、数据集和归因方法上进行了验证,计算开销可忽略不计。相反,我们证明由于固有的熵约束,这种鲁棒性提升通常不会转移到softmax归一化下的注意力归因,并通过实验验证了这一局限性。最后,我们表明用基于核的注意力替换softmax注意力可以恢复Transformer模型中的鲁棒性提升。我们的结果突出了学习动态作为鲁棒可解释性的一种原则性且实用的机制,并揭示了归一化下注意力归因的基本局限性。

英文摘要

The adversarial robustness of attributions is a fundamental requirement for reliable explainability in deep learning, yet existing approaches typically rely on computationally expensive explicit regularization. In this work, we show that attribution robustness can arise implicitly from the learning dynamics of standard stochastic gradient descent. We theoretically motivate this effect through connections between parameter-space and input-space curvature, and validate it across architectures, datasets, and attribution methods, with negligible computational overhead. In contrast, we prove that such robustness gains often does not transfer to attention-based attribution under softmax normalization, due to inherent entropy constraints, and we validate this limitation experimentally. Finally, we show that replacing softmax attention with kernel-based attention restores the robustness gains in transformer models. Our results highlight learning dynamics as a principled and practical mechanism for robust explainability, and reveal fundamental limitations of attention-based attribution under normalization.

2605.29980 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Genetically Aligned Patient Representations Improve Hematological Diagnosis

基因对齐的患者表示改善血液学诊断

Muhammed Furkan Dasdelen, Fatih Ozlugedik, Ilaria Looser, Rao Muhammad Umer, Christian Pohlkamp, Carsten Marr

发表机构 * Institute of AI for Health, Helmholtz Munich, Germany International School of Medicine, Istanbul Medipol University, T\"urkiye Munich Leukemia Laboratory, Germany Department of Medicine III, Ludwig-Maximilian-University Hospital, Germany Department of Physics, University of Munich, Germany Munich Center for Machine Learning (MCML), Germany DKTK, German Cancer Consortium, Germany

AI总结 提出一种两阶段框架,通过自监督视觉预训练和监督对比学习对齐白细胞图像与染色体畸变及体细胞突变,提升血液学诊断性能。

Comments Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026

详情
AI中文摘要

组织病理学编码器与转录组和基因组数据的多模态对齐已被证明能显著提高下游诊断任务的性能。血液学细胞学的独特之处在于,视觉单细胞评估通常与细胞遗传学和分子遗传学相结合用于血癌诊断。在本研究中,我们提出了一个框架,将单个白细胞图像与染色体畸变(核型)以及来自靶向基因面板的体细胞突变对齐。我们的训练策略采用两阶段方法:(i)在超过1500名患者的队列上,使用iBOT头进行自监督、仅视觉的Transformer聚合器预训练;(ii)通过急性髓系白血病患者的监督对比损失进行基因对齐。我们的基因对齐患者编码器改善了血液学诊断任务,优于切片级组织病理学基础模型。此外,该模型为疾病和遗传改变提供了即用型检索能力。将遗传数据纳入患者编码器提高了患者表示的质量,提供了一个与临床诊断工作流程对齐的框架,并为未来的多模态血液学特定AI铺平了道路。代码和模型权重可在https://github.com/marrlab/GenBloom获取。

英文摘要

Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single-cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two-stage approach: (i) self-supervised, vision-only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide-level histopathology foundation models. Additionally, the model provides off-the-shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology-specific AI. The code and model weights are available at https://github.com/marrlab/GenBloom.

2605.29954 2026-05-29 cs.CV 版本更新

SwInception -- Local Attention Meets Convolutions

SwInception -- 局部注意力与卷积的结合

David Hagerman, Roman Naeem, Jakob Lindqvist, Carl Lindström, Fredrik Kahl, Lennart Svensson

发表机构 * Chalmers University of technology(查尔姆斯理工大学) Zenseact(Zenseact公司)

AI总结 提出SwInception架构,通过在Swin Transformer的前馈层引入Inception块增强归纳偏置,并改进解码器以更少参数捕捉细节,在多个医学数据集上提升分割性能。

Comments International Conference on Pattern Recognition and Artificial Intelligence, 2024

详情
AI中文摘要

稀疏视觉变换器作为医学体积分割的高效编码器已广受欢迎,其中Swin成为突出选择。Swin使用局部注意力降低复杂度,在许多任务上表现优异,但仍倾向于在小数据集上过拟合。为缓解这一弱点,我们提出了一种新颖架构,通过在前馈层引入Inception块进一步增强Swin的归纳偏置。这些多分支卷积的引入使得在变换器块内能够更直接地对局部多尺度特征进行推理。我们还修改了解码器层,以使用更少的参数捕捉更精细的细节。通过大量实验,我们在十一个不同的医学数据集上展示了性能提升。我们特别展示了在医学分割十项全能(Medical Segmentation Decathlon)和颅穹窿外(Beyond the Cranial Vault)等基准挑战中,相较于先前最先进骨干网络的进步。通过证明Swin中现有的归纳偏置可以进一步改进,我们的工作为增强稀疏视觉变换器在医学和自然图像分割任务中的能力提供了一条有前景的途径。代码和预训练权重可在 https://github.com/Eiphodos/SwInception 获取。

英文摘要

Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance for many tasks but still tends to overfit on small datasets. To mitigate this weakness, we propose a novel architecture that further enhances Swin's inductive bias by introducing Inception blocks in the feed-forward layers. The introduction of these multi-branch convolutions enables more direct reasoning over local, multi-scale features within the transformer block. We have also modified the decoder layers in order to capture finer details using fewer parameters. We demonstrate a performance improvement on eleven different medical datasets through extensive experimentation. We specifically showcase advancements over the previous state-of-the-art backbones on benchmark challenges like the Medical Segmentation Decathlon and Beyond the Cranial Vault. By showing that the existing inductive bias in Swin can be further improved, our work presents a promising avenue for enhancing the capabilities of sparse vision transformers for both medical and natural image segmentation tasks. Code and pre-trained weights can be accessed at https://github.com/Eiphodos/SwInception.

2605.29953 2026-05-29 cs.CV 版本更新

Mesh-Aware Epipolar Matching for Multi-View Multi-Person 3D Pose Estimation in Basketball

网格感知的对极匹配用于篮球多视角多人3D姿态估计

Li Yin, Qin Haobin, Tomohiro Suzuki, Calvin Yeung, Mariko Isogawa, Keisuke Fujii

发表机构 * RIKEN Center for Advanced Intelligence Project(RIKEN先进情报项目中心)

AI总结 提出一种无训练框架MAEM,通过单目3D人体网格恢复模型和两阶段对极匹配策略,解决团队运动场景中多视角多人3D姿态估计的遮挡和外观相似问题。

详情
AI中文摘要

团队运动场景中的多视角多人3D姿态估计因球员遮挡、队服造成的外观相似性以及标注多视角数据的稀缺而仍然具有挑战性,这些因素限制了基于学习方法的有效性和泛化能力。相比之下,无训练方法的性能固有地受限于2D关键点检测的准确性和跨视角关联的鲁棒性。为应对这些挑战,我们提出了网格感知的对极匹配(MAEM),一种用于多视角多人3D姿态估计的无训练框架。我们的方法采用单目3D人体网格恢复模型作为前端,并基于恢复的网格输出引入了一种两阶段对极匹配策略。具体而言,所提出的框架结合了基于并查集的聚类与每关节三角测量,以实现鲁棒的跨视角关联和准确的3D姿态重建。在两个公开的多视角篮球数据集上的实验表明,MAEM持续优于现有的无训练关联基线,同时在室内和室外篮球场景中实现了有竞争力的仅RGB性能。MAEM在SportCenter EPFL上达到MPJPE/PA-MPJPE分数59.8/40.7毫米,在Human-M3 Basketball上达到74.0/51.8毫米,突显了密集网格几何在无需目标域训练或微调的情况下进行跨视角关联的有效性。

英文摘要

Multi-view multi-person 3D pose estimation in team sports scenarios remains challenging due to player occlusions, appearance similarity caused by team uniforms, and the scarcity of annotated multi-view data, all of which limit the effectiveness and generalization capability of learning-based methods. In contrast, the performance of training-free approaches is inherently constrained by the accuracy of 2D keypoint detection and the robustness of cross-view association. To address these challenges, we propose Mesh-Aware Epipolar Matching (MAEM), a training-free framework for multi-view multi-person 3D pose estimation. Our method employs a monocular 3D human mesh recovery model as the frontend and introduces a two-stage epipolar matching strategy based on the recovered mesh outputs. Specifically, the proposed framework combines disjoint-set-union-based clustering with per-joint triangulation to achieve robust cross-view association and accurate 3D pose reconstruction. Experiments on two public multi-view basketball datasets demonstrate that MAEM consistently outperforms existing training-free association baselines while achieving competitive RGB-only performance in both indoor and outdoor basketball scenarios. MAEM achieves MPJPE/PA-MPJPE scores of 59.8/40.7 mm on SportCenter EPFL and 74.0/51.8 mm on Human-M3 Basketball, highlighting the effectiveness of dense mesh geometry for cross-view association without requiring target-domain training or fine-tuning.

2605.29935 2026-05-29 cs.CV cs.AI 版本更新

CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving

CityGen: 结构引导的城市风格合成用于跨城市自动驾驶

Zezhong Qian, Zhao Yang, Lu Tan, Zhihao Yan, Weiyi Hong, Haizhuang Liu, Yawei Jueluo

发表机构 * Jiangsu Cytoderm Intelligent Technology Co., Ltd., China(江苏细胞膜智能科技有限公司,中国) Xi'an Jiaotong University, Xi'an, China(西安交通大学,中国) Tsinghua University, Beijing, China(清华大学,中国) University of Science and Technology of China, Hefei, China(中国科学技术大学,中国)

AI总结 提出CityGen,一种基于扩散模型的生成框架,通过高清地图条件和城市级视觉提示实现零标签城市适应,提升跨城市自动驾驶在感知、分割和规划任务上的鲁棒性。

详情
AI中文摘要

自动驾驶系统通常在有限的地理区域内进行训练和评估,这阻碍了它们在新城市部署时的可扩展性。然而,外观、道路拓扑和交通模式的显著域偏移常常导致跨城市部署时性能严重下降。现有的基于域适应、数据增强或合成数据生成的方法通常依赖于标注的目标数据、城市特定的标注或任务特定的设计,限制了它们在整体评估中的可扩展性和有效性。在本文中,我们引入了CityTransfer-Bench,一个地理上不重叠的基准,用于评估跨城市泛化在感知、分割和规划任务上的表现,并提出了CityGen,一个基于扩散的生成框架,通过城市级视觉提示引导的高清地图条件合成实现零标签城市适应。大量实验表明,CityGen在多个任务上持续提高了跨城市鲁棒性,为可泛化的自动驾驶建立了可扩展且标签高效的基石。

英文摘要

Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deployed in new cities. However, significant domain shifts in appearance, road topology, and traffic patterns often cause severe performance degradation under cross-city deployment. Existing approaches based on domain adaptation, data augmentation, or synthetic data generation typically rely on labeled target data, city-specific annotations, or task-specific designs, limiting their scalability and effectiveness for holistic evaluation. In this paper, we introduce CityTransfer-Bench, a geographically disjoint benchmark for evaluating cross-city generalization across perception, segmentation, and planning, and propose CityGen, a diffusion-based generative framework that performs zero-label city adaptation via HD-map-conditioned synthesis guided by city-level visual prompts. Extensive experiments demonstrate that CityGen consistently improves cross-city robustness across multiple tasks, establishing a scalable and label-efficient foundation for generalizable autonomous driving.

2605.29932 2026-05-29 cs.LG cs.CV 版本更新

Treatment-Conditioned Diffusion for Forecasting Neurodegenerative Disease Progression

治疗条件扩散用于预测神经退行性疾病进展

Danylo Boiko, Viktoriia Mishkurova

发表机构 * Innoloft Inc.(Innoloft公司) Bogomolets National Medical University(博戈莫列茨国家医学大学)

AI总结 提出一种治疗条件扩散框架,通过条件化生成过程于患者的筛查DaTscan图像和一年内左旋多巴等效日剂量,预测高保真未来脑状态,在临床保真度上显著优于基线。

Comments 9 pages, 5 figures, 1 table

详情
AI中文摘要

预测帕金森病等神经退行性疾病的进展对于有效的长期规划和个性化治疗干预至关重要。现有系统通常产生忽略纵向神经影像丰富结构的标量临床评分,而传统生成方法则遭受解剖细节丢失和细微进展模式模糊的问题。为此,我们引入了一种新颖的治疗条件扩散框架,通过将生成过程条件化于患者的筛查DaTscan图像和一年内左旋多巴等效日剂量,预测高保真的未来脑状态。该流程使用基于Transformer的编码器表示非线性、时间依赖的药理学动态,并通过一个关注生物关键区域的多权重感兴趣区域掩码优化生成。实验评估表明,我们的框架保持了清晰的解剖边界,并在临床保真度上显著优于基线,实现了MSE降低14.0%,MAE降低7.2%,SSIM提高4.9%。

英文摘要

Forecasting the progression of neurodegenerative diseases, such as Parkinson's disease, is essential for effective long-term planning and personalized therapeutic intervention. Existing systems typically produce scalar clinical scores that ignore the rich structure of longitudinal neuroimaging, while traditional generative approaches suffer from a loss of anatomical details and blurring subtle progression patterns. To address this, we introduce a novel treatment-conditioned diffusion framework that predicts high-fidelity future brain states by conditioning the generative process on patients' screening DaTscan images and levodopa equivalent daily dose over one year. The pipeline uses a Transformer-based encoder to represent non-linear, time-dependent pharmacological dynamics and optimizes generation through a multi-weight region-of-interest mask that focuses on biologically critical areas. Experimental evaluation shows that our framework maintains sharp anatomical boundaries and significantly improves clinical fidelity relative to the baseline, achieving 14.0% lower MSE, 7.2% lower MAE, and 4.9% higher SSIM.

2605.29911 2026-05-29 cs.LG cs.CV 版本更新

Reducing Experimental Testing in Space Propulsion Film Cooling Analyses by Pixelwise Generative Image Interpolation

通过逐像素生成图像插值减少空间推进薄膜冷却分析中的实验测试

Adam T. Müller, Philipp J. Teuffel, Konstantin Manassis, Nicolaj C. Stache

发表机构 * Heilbronn University of Applied Sciences(海德堡应用科学大学) Center for Machine Learning(机器学习中心) Max-Planck-Str. 39(马克斯-普朗克街39号) German Aerospace Center (DLR)(德国航空航天中心(DLR)) Institute of Space Propulsion(空间推进研究所)

AI总结 提出一种基于轻量级前馈神经网络和位置编码的机器学习方法,从稀疏实验测量中进行图像回归,以减少推进系统薄膜冷却研究中的物理测试需求。

Comments Presented at the 11th European Conference for Aeronautics and Aerospace Sciences (EUCASS), 2025, DOI: 10.13009/EUCASS2025-285

详情
AI中文摘要

我们提出了一种从稀疏实验测量中进行图像回归的机器学习方法。我们展示了该方法在推进系统开发中薄膜冷却研究中的应用,旨在减少对大量物理测试的需求。我们的方法采用带有位置编码的轻量级前馈神经网络,根据输入参数生成图像。在真实和合成数据上的验证表明,该方法在减少30%测量量的同时,实现了高图像相似度(RMSE < 8%,SSIM > 93%)。我们进一步提出了一种知识驱动的扩展,用于生成图像的局部适应性。该方法显著减少了所需测试次数,同时保持了高质量数据,从而能够高效优化冷却剂喷射器配置,其应用范围超越航空航天领域。

英文摘要

We propose a machine learning approach for image regression from sparse experimental measurements. We show the application of the proposed method on film cooling studies in propulsion system development, aiming to reduce the need for extensive physical testing. Our method employs a lightweight feed-forward neural network with positional encoding to generate images conditioned by input parameters. Validated on real and synthetic data, it achieves high image similarity (RMSE < 8 %, SSIM > 93 %) while maintaining accuracy with a 30 \% reduction of measurements. We further propose a knowledge-informed extension for local adaptability of the generated images. This approach significantly reduces required tests while preserving high-quality data, enabling efficient optimization of coolant injector configurations with applications beyond aerospace.

2605.29894 2026-05-29 cs.CV 版本更新

Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning

训练智能体而非专家:学习利用异构专家进行多轮视觉推理

Yaowu Fan, Tao Han, Dazhao Du, Andy J. Ma, Jia Wan

发表机构 * Sun Yat-sen University(中山大学) HKUST(香港科技大学) Harbin Institute of Technology(哈尔滨理工大学)

AI总结 提出VisHarness,一种可训练的视觉智能体,通过解耦高层感知推理与低层任务执行,学习利用异构视觉专家模型,以轻量训练实现多轮交互下的通用视觉任务求解。

详情
AI中文摘要

计算机视觉的最新进展产生了大量用于检测、分割、计数和其他视觉任务的强大专用模型。然而,这些模型通常针对孤立的任务形式进行优化,使得直接支持通用视觉智能变得困难,尤其是当任务需要复杂的语言理解和密集的小物体感知时。在本文中,我们提出了VisHarness,一种可训练的视觉智能体,它将高层感知、推理和决策与低层任务执行解耦。VisHarness不是训练模型来解决特定的视觉任务,而是学习利用一组精心设计的异构视觉专家。这种范式保留了智能体的通用智能,同时充分利用了专用视觉模型在具体视觉任务中的精度优势。仅通过轻量训练,VisHarness就能学习到可泛化的视觉专家利用策略,并通过与视觉专家模型的多轮交互,在各种复杂条件下解决常见的基础视觉任务。为了在实时环境中实现高效的在策略强化学习训练,我们引入了动态视觉记忆归档,这缓解了与视觉专家模型多轮交互导致的快速累积的视觉令牌开销。在涵盖推理分割、广义指代分割、密集小物体检测和指代计数的四个代表性基准上的实验表明,VisHarness显著优于现有的通用模型,并与任务专用模型相比取得了具有竞争力或更优的性能。

英文摘要

Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, making it difficult to directly support general-purpose visual intelligence, especially when a task requires complex language understanding and dense small-object perception. In this paper, we propose VisHarness, a trainable visual agent that decouples high-level perception, reasoning, and decision-making from low-level task execution. Instead of training a model to solve a specific visual task, VisHarness learns to harness a set of carefully designed heterogeneous visual experts. This paradigm preserves the general intelligence of the agent while fully leveraging the precision advantages of specialized visual models in concrete visual tasks. With only lightweight training, VisHarness learns a generalizable visual expert-harnessing policy and can solve common fundamental vision tasks under various complex conditions through multi-turn interactions with visual expert models. To enable efficient on-policy reinforcement learning training in a live environment, we introduce dynamic visual memory archiving, which mitigates the rapidly accumulating visual-token overhead caused by multi-turn interactions with visual expert models. Experiments on four representative benchmarks covering reasoning segmentation, generalized referring segmentation, dense small-object detection, and referring counting demonstrate that VisHarness substantially outperforms existing general-purpose models and achieves competitive or superior performance compared with task-specific models.

2605.29891 2026-05-29 cs.CV 版本更新

DVSM: Decoder-only View Synthesis Model Done Right

DVSM: 正确的仅解码器视图合成模型

Cheng Sun, Jaesung Choe, Min-Hung Chen, Ryo Hachiuma, Yu-Chiang Frank Wang

发表机构 * NVIDIA National Taiwan University(国立台湾大学)

AI总结 提出仅解码器架构DVSM,通过隐式KV-cache表示场景,在相同渲染复杂度下以更少参数超越编码器-解码器变体,并利用共享权重、基础模型先验和分阶段块大小优化效率与质量,在多个基准上实现新视点合成的最优结果。

Comments Code at https://github.com/NVLabs/dvsm

详情
AI中文摘要

近期的大型视图合成模型(LVSMs)倡导一种编码器-解码器架构,将重建和渲染分离到不同的网络中。我们重新审视了这种设计。通过控制实验,我们表明仅解码器架构(将场景隐式表示为KV-cache)在相同渲染复杂度下使用更少参数,性能优于编码器-解码器变体。进一步分析表明,在颜色输入重建网络和仅相机渲染网络之间共享权重,能更好地对齐同一视点下的特征,从而促进图像合成。基于这一发现,我们的模型DVSM进一步结合了基础模型先验和分阶段块大小调整,以改进效率与质量的权衡。我们的结果在多个基准上为新颖视图合成设立了新的最先进水平,在某些情况下,甚至在密集输入视图下优于每场景优化的3DGS。

英文摘要

Recent Large View Synthesis Models (LVSMs) advocate an encoder-decoder architecture that separates reconstruction and rendering into distinct networks. We re-examine this design. Through controlled experiments, we show that a decoder-only architecture, which represents scenes implicitly as a KV-cache, outperforms encoder-decoder variants while using fewer parameters at identical rendering complexity. Further analysis shows that sharing weights between the color-input reconstruction network and the camera-only rendering network better aligns their features at the same viewpoint, facilitating image synthesis. Building on this finding, our model, dubbed DVSM, further incorporates foundation model priors and stage-wise patch sizing for an improved efficiency-quality tradeoff. Our results establish a new state of the art for novel-view synthesis across multiple benchmarks, in some cases even outperforming per-scene-optimized 3DGS under dense input views.

2605.29881 2026-05-29 cs.CV cs.AI 版本更新

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

通过屏障调控自适应闭式引导缓解视觉语言模型中的幻觉

Soumyadeep Jana, Pulkit Mittal, Sanasam Ranbir Singh

发表机构 * Indian Institute of Technology Guwahati(印度理工学院果阿班加)

AI总结 提出BRACS框架,通过监测视觉注意力并仅在接地退化时进行闭式修正,无需训练即可有效减少LVLM中的物体幻觉。

详情
AI中文摘要

大型视觉语言模型(LVLMs)经常幻觉出输入图像中不存在的物体,这主要是因为随着解码进行,视觉接地减弱。现有的推理时缓解方法在生成过程中修改logits或隐藏状态,但它们存在三个关键限制:缺乏明确的接地目标,即使在模型已经良好接地时也进行干预,以及使用固定的修正强度,无法适应接地失败的严重程度。我们提出BRACS(屏障调控自适应闭式引导),一种无需训练的引导框架,通过屏障调控自适应闭式引导解决这些问题。BRACS监测模型自身的注意力以衡量视觉接地,并仅在接地恶化时对隐藏状态进行修正。修正更新以闭式解析计算,无需训练辅助网络或重新训练模型。在LLaVA-1.5-7B和Qwen-VL-Chat上的实验表明,BRACS在幻觉基准上持续优于先前方法,将CHAIR$_s$降低9.4个点,将POPE F1提高2.7个点,同时在四个通用多模态基准上匹配或提升性能。BRACS还保持高效,运行速度为贪心解码吞吐量的80%,平均速度比基线快1.3倍。

英文摘要

Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding weakens as decoding progresses. Existing inference-time mitigation methods modify logits or hidden states throughout generation, but they suffer from three key limitations: they lack an explicit grounding objective, intervene even when the model is already well-grounded, and use fixed correction strengths that do not adapt to the severity of grounding failure. We propose BRACS (Barrier-Regulated Adaptive Closed-form Steering), a training-free steering framework that addresses these issues through barrier-regulated adaptive closed-form steering. BRACS monitors the model's own attention to measure visual grounding and applies corrections to the hidden states only when grounding deteriorates. The corrective update is computed analytically in closed form, requiring no training of auxiliary networks or model retraining. Experiments on LLaVA-1.5-7B and Qwen-VL-Chat show that BRACS consistently outperforms prior methods on hallucination benchmarks, reducing CHAIR$_s$ by 9.4 points and improving POPE F1 by 2.7 points, while matching or improving performance on four general multimodal benchmarks. BRACS also remains efficient, operating at 80% of greedy decoding throughput and achieving 1.3 times higher speed on average than the baselines.

2605.29868 2026-05-29 cs.CR cs.CV cs.DC 版本更新

Ciphera: A Decentralised Biometric Identity Framework

Ciphera: 一种去中心化的生物特征身份框架

Ankit Kanaiyalal Prajapati, Shahzad Memon, Mohammed Mahir Rahman, Ameer Al-Nemrat

发表机构 * University of East London(东伦敦大学)

AI总结 提出Ciphera框架,结合隐私保护面部识别、多节点验证、IPFS凭证元数据存储和区块链锚定撤销,实现去中心化生物特征身份管理,并通过功能、性能、安全性和分布式一致性评估验证其可行性。

Comments Accepted at the CyberAI 2026 Conference, and to be indexed at IEEE-Scopus

详情
Journal ref
CyberAI 2026 (https://cyberai-conf.org/)
AI中文摘要

中心化的生物特征身份系统使用户面临单点故障、不透明的验证过程以及不可逆的生物特征泄露风险。去中心化标识符(DID)和可验证凭证(VC)提供了更强的隐私保障,但它们与生物特征认证和分布式验证的整合仍未被充分探索。本文提出了Ciphera,一个去中心化的生物特征身份框架,结合了隐私保护的面部识别、多节点验证、基于IPFS的凭证元数据存储和区块链锚定的撤销。在功能、性能、安全性和分布式一致性维度上评估,Ciphera实现了81%的功能成功率,具有稳定的注册和认证,但存在可测量的撤销传播延迟和偶尔的审计日志不一致。性能测试显示,在并发多节点条件下,p95验证延迟约为820毫秒,低于1秒。安全性分析确认了强大的机密性和完整性保证,但不完整的活体检测使其容易受到深度伪造和重放攻击。结果证明了去中心化生物特征身份的可行性,同时指出了生产级部署的关键工程挑战。

英文摘要

Centralised biometric identity systems expose users to single points of failure, opaque verification processes, and irreversible biometric compromise. Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) offer stronger privacy guarantees, yet their integration with biometric authentication and distributed verification remains insufficiently explored. This paper presents Ciphera, a decentralised biometric identity framework combining privacy-preserving facial recognition, multi-node verification, IPFS-based credential metadata storage, and blockchain-anchored revocation. Evaluated across functional, performance, security, and distributed consistency dimensions, Ciphera achieved an 81% functional success rate, with stable enrolment and authentication but measurable revocation propagation delays and occasional audit-log inconsistencies. Performance testing demonstrated sub-second p95 verification latency of approximately 820ms under concurrent multi-node conditions. Security analysis confirmed strong confidentiality and integrity guarantees, though incomplete liveness detection leaves susceptibility to deepfake and replay attacks. The results demonstrate the feasibility of decentralised biometric identity while identifying key engineering challenges for production-grade deployment.

2605.29858 2026-05-29 cs.CV 版本更新

Masked Diffusion Vision-Language Models for Temporal Action Localization

用于时序动作定位的掩码扩散视觉语言模型

Fengshun Wang, Zhengbo Zhang, Zhigang Tu

发表机构 * Wuhan University(武汉大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出掩码扩散视觉语言模型(MDVLM)用于时序动作定位,通过双向注意力迭代去噪联合优化语义和边界,并引入边界感知掩码和步级IoU奖励解决训练不匹配问题。

详情
AI中文摘要

时序动作定位(TAL)需要在未修剪视频中识别目标事件并精确定位其开始和结束时间。最近的视觉语言公式改进了语义推理并支持语言条件输出,但其自回归解码器仍然从左到右生成令牌,阻止了后续语义证据修正早期时间戳预测。我们将掩码扩散视觉语言模型(MDVLM)适配到TAL,使得语义令牌和边界令牌在具有双向注意力的迭代去噪过程中保持可编辑,从而允许时间边界和语义内容共同细化。然而,直接适配会产生两个TAL特定的不匹配:标准掩码扩散训练随机均匀地破坏所有位置,但时间令牌在有足够语义上下文时更可靠;令牌级交叉熵不反映时序IoU。为了解决这些不匹配,我们引入了一个计划训练目标,该目标使用边界感知掩码和步加权重构来排练时间令牌的后期恢复,同时引入步级IoU奖励,在去噪过程中提供重叠感知监督。标准序列级交叉熵项提供基础重构信号。在ActivityNet-RTL、ActivityNet-1.3和THUMOS-14上的实验表明,MDVLM-TAL在时序推理和边界定位方面均优于自回归视觉语言基线,在更严格的时序IoU标准下尤其显著。

英文摘要

Temporal action localization (TAL) requires recognizing the target event and localizing its start and end times precisely in untrimmed videos. Recent vision-language formulations improve semantic reasoning and support language-conditioned outputs, but their autoregressive decoders still generate tokens from left to right, preventing later semantic evidence from revising earlier timestamp predictions. We adapt masked diffusion vision-language models (MDVLMs) to TAL so that semantic tokens and boundary tokens remain editable throughout iterative denoising with bidirectional attention, allowing temporal boundaries and semantic content to be refined jointly. Direct adaptation, however, creates two TAL-specific mismatches: standard masked diffusion training corrupts all positions uniformly at random, but the time tokens are more reliable when enough semantic context is available; and token-level cross-entropy does not reflect temporal IoU. To address these mismatches, we introduce a Planned Training Objective that uses boundary-aware masking and step-weighted reconstruction to rehearse the late recovery of time tokens, together with a Step-Level IoU Reward that provides overlap-aware supervision during denoising. A standard sequence-level cross-entropy term provides the base reconstruction signal. Experiments on ActivityNet-RTL, ActivityNet-1.3, and THUMOS-14 show that MDVLM-TAL improves both temporal reasoning and boundary localization over autoregressive vision-language baselines, with especially strong gains under stricter temporal IoU criteria.

2605.29856 2026-05-29 cs.CV 版本更新

Building and Road Recognition in Dense Urban Informal Settlements: A Dataset and Benchmark

密集城市非正规住区中的建筑与道路识别:数据集与基准

Hongyu Long, Jiaxuan Liu, Rui Cao

发表机构 * Guangdong Provincial Project(广东省项目) Guangzhou-HKUST(GZ) Joint Funding Program(广州-香港科技大学联合基金计划) AI Research and Learning Base of Urban Culture Project(城市文化AI研究与学习基地项目)

AI总结 针对城市村等高密度非正规住区缺乏精细标注数据的问题,构建了首个高分辨率遥感数据集DenseUIS,并评估了现有深度学习模型,揭示了其在处理密集非正规形态上的局限性,为复杂高密度环境下的精细城市制图提供了基准。

Comments 5 pages, 4 figures;

详情
AI中文摘要

作为一种普遍存在的非正规住区形式,城中村对可持续城市发展和治理提出了重大挑战。精确绘制其基础设施至关重要,然而,现有的遥感数据集主要关注正规城市环境,缺乏针对城中村典型的高密度建筑模式和狭窄道路网络的精细标注数据。为填补这一空白,我们引入了 extit{DenseUIS}数据集,这是首个专门用于极度密集城市非正规住区中建筑和道路提取的高分辨率遥感数据集,覆盖了中国深圳和广州的126个城中村。此外,我们对该数据集上的最先进深度学习模型进行了全面评估。实验结果表明,现有方法在处理密集非正规住区的独特形态模式方面存在局限性,凸显了对专门方法的需求。因此, extit{DenseUIS}为推进复杂高密度非正规环境中的精细城市制图提供了一个稳健的基准。该数据集公开于https://github.com/rui-research/DenseUIS。

英文摘要

As a widespread form of informal settlements, urban villages present significant challenges for sustainable urban development and governance. Precise mapping of their infrastructure is essential, however, existing remote sensing datasets primarily focus on formal urban environments, lacking fine-grained annotated data for the high-density building patterns and narrow road networks typical of urban villages. To address this gap, we introduce the \textit{DenseUIS} dataset, the first high-resolution remote sensing dataset specifically designed for building and road extraction in extremely dense urban informal settlements, covering 126 urban villages across Shenzhen and Guangzhou in China. Furthermore, we conduct a comprehensive evaluation of state-of-the-art deep learning models on this dataset. Experimental results reveal the limitations of existing methods in handling the unique morphological patterns of dense informal settlements, underscoring the need for specialized approaches. \textit{DenseUIS} therefore provides a robust benchmark for advancing fine-grained urban mapping in complex and high-density informal environments. The dataset is publicly available at https://github.com/rui-research/DenseUIS.

2605.29827 2026-05-29 cs.CV 版本更新

Fairness Beyond Demographics: Optimizing Performance Across Appearance-Based Hidden Cohorts in Medical Imaging

超越人口统计学的公平性:在医学影像中优化基于外观的隐藏队列性能

Milad Masroor, Cuong Nguyen, Kevin Wells, Gustavo Carneiro

发表机构 * Centre for Vision, Speech and Signal Processing(视觉、语音和信号处理中心)

AI总结 提出无标签隐藏队列公平性(LHCF)训练范式,通过聚类图像为外观队列并优化其公平性,解决医学影像模型在人口统计学属性上的性能差异问题。

Comments Pre-review version submitted to MICCAI 2026. 10 pages, 5 figures

详情
AI中文摘要

医学图像分析模型可能在患者子组间表现出性能差异,威胁临床安全性和公平性。现有方法通常通过优化可见人口统计学属性(如性别或年龄)的准确性和公平性指标来解决这一问题,但这些属性是孤立考虑的。这种策略不仅忽略了可能更具信息量的潜在分层(这些分层可能揭示模型错误和不平等的更深层来源),而且当同时考虑多个人口统计学属性时,由于每个子组内训练数据的稀疏性,该方法无法扩展。我们通过引入无标签隐藏队列公平性(LHCF)训练范式来处理这些问题,该范式不是最大化可见人口统计学属性的公平性,而是优化从图像外观发现的潜在子群体的公平性。通过将图像聚类为K个基于外观的队列并对其应用公平性优化,LHCF揭示了模型错误的潜在来源,避免了多人口统计学属性的组合稀疏性,减少了单个和多个人口统计学属性上的差异。我们在我们提出的公平性基准HIDFairBench上证明,尽管从未使用人口统计学标签进行训练,LHCF在单个和多个人口统计学属性上提供了最先进的公平性结果。我们的结果将隐藏队列公平性定位为基于人口统计学的公平性优化的实用、可扩展且稳健的替代方案,用于可信的医学图像分析。

英文摘要

Medical image analysis models can exhibit performance disparities across patient subgroups, threatening clinical safety and fairness. Existing methods typically address this issue by optimizing accuracy and fairness metrics for visible demographic attributes (e.g., sex or age) considered in isolation. This strategy not only overlooks potentially more informative latent stratifications, which may reveal deeper sources of model error and inequity, but also fails to scale when multiple demographic attributes are considered simultaneously due to the resulting sparsity of training data within each subgroup. We deal with these issues by introducing the label-free hidden-cohort fairness (LHCF) training paradigm that instead of maximizing fairness over visible demographic attributes, it optimizes fairness across latent subpopulations discovered from image appearance. By clustering images into K appearance-based cohorts and applying fairness optimization over them, LHCF uncovers underlying sources of model error and avoids the combinatorial sparsity of multi-demographic attributes, reducing disparities across both single and multiple demographic attributes. We demonstrate on our proposed fairness benchmark, HIDFairBench, that LHCF provides state-of-the-art fairness results on single and multiple demographic attributes, despite never using demographic labels for training. Our results position hidden-cohort fairness as a practical, scalable, and robust alternative to demographic-based fairness optimization for trustworthy medical image analysis.

2605.29812 2026-05-29 cs.CV 版本更新

Not All Inputs Are Valid: Towards Open-Set Video Moment Retrieval Using Language

并非所有输入都有效:面向开放集视频时刻检索的语言方法

Xiang Fang, Wanlong Fang, Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Renfu Li, Zichuan Xu, Lixing Chen, Panpan Zheng, Yu Cheng

发表机构 * Huazhong University of Science and Technology(华中科技大学) Peking University(北京大学) Zhejiang Gongshang University(浙江工商大学) Dalian University of Technology(大连理工大学) Shanghai Jiao Tong University(上海交通大学) Xinjiang University(新疆大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 针对开放集场景下视频时刻检索任务中无关查询导致错误检索的问题,提出基于归一化流的开放集视频时刻检索模型OpenVMR,实现分布内查询的精确检索与分布外查询的拒绝。

Comments Published in ACM MM 2024

详情
AI中文摘要

视频时刻检索(VMR)旨在从未修剪的视频中检索与句子查询对应的特定时刻。尽管近期工作在该任务上取得了显著进展,但它们隐含地基于封闭集假设,即所有给定查询都与视频相关 ootnote{在本文中,我们将“视频相关查询”视为“分布内(ID)查询”,将“视频无关查询”视为“分布外(OOD)查询”。}。在开放集场景中,给定OOD查询时,它们仍会用于错误检索,这可能在高风险场景(例如犯罪活动检测)中导致不可挽回的损失。为此,我们创造性地探索了一种全新的VMR设置,称为开放集视频时刻检索(OS-VMR),其中我们不仅应基于ID查询检索精确时刻,还应拒绝OOD查询。在本文中,我们首次尝试迈向OS-VMR,并提出了一种新颖模型OpenVMR,该模型首先基于归一化流技术区分ID和OOD查询,然后基于ID查询进行时刻检索。具体而言,我们首先通过构建归一化流学习ID分布,并假设ID查询分布服从多元高斯分布。然后,我们引入不确定性分数来搜索ID-OOD分离边界。之后,通过拉近ID查询特征来细化ID-OOD边界。此外,分别设计了视频-查询匹配和帧-查询匹配用于粗粒度和细粒度的跨模态交互。最后,引入正-无标签学习模块用于时刻检索。在三个VMR数据集上的实验结果表明了我们的OpenVMR的有效性。

英文摘要

Video Moment Retrieval (VMR) targets to retrieve the specific moment corresponding to a sentence query from an untrimmed video. Although recent works have made remarkable progress in this task, they implicitly are rooted in the closed-set assumption that all the given queries as video-relevant\footnote{In this paper, we treat ``video-relevant query'' as ``in-distribution (ID) query'' and ``video-irrelevant query'' as ``out-of-distribution (OOD) query''.}. Given an OOD query in open-set scenarios, they still utilize it for wrong retrieval, which might lead to irrecoverable losses in high-risk scenarios, \textit{e.g.}, criminal activity detection. To this end, we creatively explore a brand-new VMR setting termed Open-Set Video Moment Retrieval (OS-VMR), where we should not only retrieve the precise moments based on ID query, but also reject OOD queries. In this paper, we make the first attempt to step toward OS-VMR and propose a novel model \textbf{OpenVMR}, which first distinguishes ID and OOD queries based on the normalizing flow technology, and then conducts moment retrieval based on ID queries. Specifically, we first learn the ID distribution by constructing a normalizing flow, and assume the ID query distribution obeys the multi-variate Gaussian distribution. Then, we introduce an uncertainty score to search the ID-OOD separating boundary. After that, we refine the ID-OOD boundary by pulling together ID query features. Besides, video-query matching and frame-query matching are designed for coarse-grained and fine-grained cross-modal interaction, respectively. Finally, a positive-unlabeled learning module is introduced for moment retrieval. Experimental results on three VMR datasets show the effectiveness of our OpenVMR.

2605.29809 2026-05-29 cs.CR cs.CV cs.GR cs.LG cs.MM 版本更新

Cert-LAS: Toward Certified Model Ownership Verification for Text-to-Image Diffusion Models via Layer-Adaptive Smoothing

Cert-LAS:通过层自适应平滑实现文本到图像扩散模型的认证模型所有权验证

Leyi Qi, Yiming Li, Siyuan Liang, Zhengzhong Tu, Dacheng Tao

发表机构 * Generative AI Lab, College of Computing Data Science, Nanyang Technological University, Singapore Department of Computer Science Engineering, Texas A\&M University, USA

AI总结 提出Cert-LAS方法,基于层自适应平滑和扩散分类器嵌入水印,通过假设检验验证模型所有权,并证明在恶意移除攻击下仍能可靠验证。

Comments This paper has been accepted to the International Conference on Machine Learning (ICML) 2026. 26 pages

详情
AI中文摘要

大规模文本到图像(T2I)扩散模型实现了前所未有的创意应用,但其未经授权的使用引发了严重的知识产权问题,使得模型所有权验证(MOV)日益关键。我们发现现有的基于后门的扩散水印方法通常(隐式地)假设一个“忠实”的验证过程,即验证者可以查询可疑模型并获得忠实的水印响应以完成MOV。然而,在实践中,攻击者可能有意或无意地破坏潜在的水印信号,显著降低验证可靠性。为解决此问题,我们提出Cert-LAS,首个基于层自适应平滑的T2I模型认证MOV方法。通常,Cert-LAS使用扩散分类器和LFS引导的层自适应噪声嵌入指定水印,并通过假设检验检查可疑模型是否表现出比无水印参考显著更强的水印响应来验证所有权。我们进一步证明,在特定条件下,即使存在恶意移除攻击,我们的Cert-LAS仍能实现可靠验证。大量实验验证了Cert-LAS的有效性及其对自适应攻击的抵抗力。我们的代码可在https://github.com/Leyi-Qi/Cert-LAS获取。

英文摘要

Large-scale text-to-image (T2I) diffusion models have enabled unprecedented creative applications, but their unauthorized use has raised serious intellectual property concerns, making model ownership verification (MOV) increasingly critical. We find that existing backdoor-based diffusion watermarking methods often (implicitly) assume a "faithful" verification process, namely, that the verifier can query a suspicious model and obtain the faithful watermark response to complete MOV. However, in practice, adversaries may intentionally or unintentionally damage potential watermark signals, significantly degrading verification reliability. To address this issue, we propose Cert-LAS, the first certified MOV method for T2I models based on layer-adaptive smoothing. In general, Cert-LAS embeds specified watermarks using diffusion classifiers and an LFS-guided layer-adaptive noise, and verifies ownership by examining whether the suspected model exhibits significantly stronger watermark responses compared to unwatermarked references through hypothesis testing. We further prove that, under certain conditions, our Cert-LAS can still achieve reliable verification even in the presence of malicious removal attacks. Extensive experiments validate the effectiveness of Cert-LAS and its resistance to adaptive attacks. Our code is available at https://github.com/Leyi-Qi/Cert-LAS.

2605.29801 2026-05-29 cs.AI cs.CL cs.CR cs.CV cs.LG 版本更新

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

AgentDoG 1.5:一种轻量级且可扩展的AI智能体安全与安保对齐框架

Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对开放世界智能体的新兴安全风险,提出一种轻量级可扩展的安全对齐框架,通过更新安全分类法、构建数据引擎并训练小模型(0.8B-8B参数),实现与闭源模型相当的性能,并降低部署开销两个数量级。

Comments 44 pages, 12 Figures, 9 Tables

详情
AI中文摘要

现代开放世界智能体(如OpenClaw)展现出强大的跨环境执行能力,但同时也引入了广泛的新安全风险源。同时,先进的前沿AI模型大幅降低了攻击门槛,使得当前的智能体对齐框架不足以应对实际部署。为了应对这些新兴威胁,我们提出了一种轻量级且可扩展的智能体安全对齐框架。具体而言,我们更新了智能体安全分类法,以涵盖来自Codex和OpenClaw执行场景的新兴风险。我们进一步构建了一个基于分类法指导的数据引擎,并采用影响函数净化,仅使用约1k样本训练轻量级AgentDoG 1.5变体(0.8B、2B、4B和8B参数),达到了与领先闭源模型(如GPT-5.4)相当的性能。基于AgentDoG 1.5,我们构建了一个高效的智能体安全SFT和RL训练环境,将Docker级环境的部署开销降低了两个数量级。最后,我们将AgentDoG 1.5部署为无需训练的在线护栏,用于实时安全审核。大量实验结果表明,AgentDoG 1.5在多样且复杂的交互式智能体场景中达到了最先进的性能。所有模型和数据集均已公开发布。

英文摘要

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

2605.29798 2026-05-29 cs.CV cond-mat.mtrl-sci eess.IV 版本更新

Low-Magnification SEM May Suffice: Interpretable Deep Learning for Multi-Scale Fracture-Cause Classification in Zirconia-Toughened Alumina

低倍率SEM可能足够:用于氧化锆增韧氧化铝多尺度断裂原因分类的可解释深度学习

Julian Schmid, Pawel Astankow, Tom Vater, Julius Beck, Robert Cichon, Danny Krautz

发表机构 * CeramTec GmbH(CeramTec公司) School of Life Sciences, University of Applied Sciences(应用科学与艺术北瑞士学院生命科学学院)

AI总结 提出一种可解释的视觉变换器工作流,利用低倍率SEM图像对氧化铝基复合材料植入物断裂原因进行自动分类,达到与高倍率相当的准确率。

详情
AI中文摘要

可靠识别氧化铝基复合材料髋关节和膝关节植入物的断裂起源对于质量保证和患者安全至关重要,然而当前的断口分析工作流程耗时、部分主观且依赖高倍率扫描电子显微镜(SEM)。我们提出了一种可解释的视觉变换器(ViT)工作流,用于对广泛用于全关节置换的氧化铝基复合材料(BIOLOX delta, CeramTec GmbH)的断裂原因进行自动分类。从五年的生产爆破和验证测试中整理了8,493张SEM图像(50倍至10,000倍)的数据集,并按照制造链定义的三个缺陷类别(生坯、硬加工和材料缺陷)进行标注。在严重的类别不平衡下,微调后的ViT在分层五折交叉验证中达到了0.907的准确率和0.888的宏F1分数,两阶段感知哈希/SSIM泄漏审计确认了样本重叠可忽略。值得注意的是,低倍率(50倍)下的性能与高倍率(1k-10k倍)相当,表明宏观特征——镜面几何和羽状纹线场——已经编码了足够的诊断信号。Grad-CAM归因一致地定位在经典的断口线索(镜面、羽状纹、孔隙、加工痕迹)上,与既定的断口分析标准一致。这些结果共同将可解释ViT定位为陶瓷植入物质量保证的补充工具,能够实现低倍率预筛选并减少对耗时的高倍率检查的依赖。

英文摘要

Reliable identification of fracture origins in alumina matrix composite hip and knee implants is critical for quality assurance and patient safety, yet current fractographic workflows are time-consuming, partly subjective, and reliant on high-magnification scanning electron microscopy (SEM). We present an interpretable vision-transformer (ViT) workflow for automated classification of fracture causes in an alumina matrix composite (BIOLOX delta, CeramTec GmbH) widely used in total joint replacements. A dataset of 8,493 SEM images (50x-10,000x) was curated from five years of in-production burst and proof tests and annotated into three defect categories defined along the manufacturing chain: green body, hard machining, and material defects. Under severe class imbalance, the fine-tuned ViT reached an accuracy of 0.907 and a macro-F1 of 0.888 in stratified five-fold cross-validation, with a two-stage perceptual-hash/SSIM leakage audit confirming negligible specimen overlap. Notably, performance at low magnification (50x) was comparable to that at high magnification (1k-10kx), indicating that macro-scale features - mirror geometry and hackle line fields - already encode sufficient diagnostic signal. Grad-CAM attributions consistently localised on canonical fractographic cues (mirrors, hackles, pores, machining marks), aligning with established fractographic criteria. Together, these results position interpretable ViTs as a complementary tool for ceramic-implant quality assurance, enabling low-magnification pre-screening and reducing reliance on time-intensive high-magnification inspection.

2605.29793 2026-05-29 cs.CV 版本更新

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

更少步骤,更优性能:基于语言的高效跨模态视频片段修剪用于视频时刻检索

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, Renfu Li

发表机构 * Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science of Technology(湖北大数据安全工程研究中心,网络安全学院,华中科技大学) Peking University(北京大学) Henan University(河南大学) Dalian University of Technology(大连理工大学) Sichuan University(四川大学) Shenzhen University(深圳大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出SpotVMR方法,通过可学习的片段搜索模型和低成本语义索引特征,高效修剪查询相关视频片段,作为即插即用模块提升现有VMR方法的效率与性能。

Comments Published in AAAI 2024

详情
AI中文摘要

给定一个未修剪的视频和一个句子查询,基于语言的视频时刻检索(VMR)旨在定位目标查询相关的时刻。由于未修剪的视频过长,几乎所有现有的VMR方法首先将每个未修剪的视频稀疏下采样为多个固定长度的视频片段,然后与查询特征和昂贵的片段特征进行多模态交互以进行推理,这对于跨越数小时的长真实世界视频是不可行的。由于视频被下采样为固定长度的片段,一些与查询相关的帧可能被过滤掉,这将模糊目标时刻的特定边界,将相邻的不相关帧作为新边界,容易导致跨模态错位,并引入边界偏差和推理偏差。为此,在本文中,我们提出了一种高效的方法SpotVMR,用于修剪与查询相关的片段。此外,我们提出的SpotVMR可以作为即插即用模块,在保持良好检索性能的同时提高最先进VMR方法的效率。特别地,我们首先设计了一个新颖的片段搜索模型,该模型学习根据语言查询识别有希望的视频区域进行搜索。然后,我们引入一组低成本的语义索引特征来捕获对象和交互的上下文,这些上下文提示在哪里搜索查询相关的时刻。此外,利用蒸馏损失来解决片段选择器和VMR模型端到端联合训练中出现的优化问题。在三个具有挑战性的数据集上的大量实验证明了其有效性。

英文摘要

Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. Extensive experiments on three challenging datasets demonstrate its effectiveness.

2605.29776 2026-05-29 cs.CV 版本更新

Improving CLIP Adaptation by Breaking Tail Alignment for Source-Free Cross-Domain Few-Shot Learning

通过打破尾部对齐改进CLIP适应:用于源无关跨域小样本学习

Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

发表机构 * School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China(华中科技大学计算机科学与技术学院) Institute of Artificial Intelligence, Huazhong University of Science and Technology, Wuhan, China(华中科技大学人工智能研究院)

AI总结 针对CLIP在跨域小样本学习中的性能下降问题,提出自适应尾头对齐策略(ATHA),通过有选择地削弱低相似度图像令牌的对齐来减少过拟合,在四个基准上取得最优结果。

Comments Accepted by ICML 2026

详情
AI中文摘要

视觉语言模型(如CLIP)展现出强大的零样本泛化能力,但在目标域训练数据稀缺的跨域场景(跨域小样本学习,CDFSL)中性能显著下降。本文聚焦于基于CLIP的CDFSL任务中的目标域小样本微调。现有的微调范式将所有图像块令牌与其对应的文本嵌入统一对齐。然而,我们发现一个反直觉的现象:主动将某些低相似度图像令牌(称为“尾部令牌”)推离其文本嵌入能持续提升目标域性能。我们深入探究这一现象并给出新的解释:在巨大的域偏移和稀缺的训练数据下,模型难以从视觉输入中提取语义信息;因此,常见的对齐信念仅对已包含足够语义信息的令牌有效;对于尾部令牌,强制对齐会导致对稀缺训练的过度过拟合,而打破对齐则更有用。受此启发,我们提出自适应尾头对齐(ATHA),一种新颖的CLIP微调策略,将传统的统一对齐范式转变为自适应对齐范式,同时包含对齐增强和削弱。在四个具有挑战性的CDFSL基准上的大量实验验证了我们的最先进性能。我们的代码可在 https://github.com/shuaiyi308/ATHA 获取。

英文摘要

Vision-Language Models (VLMs) such as CLIP demonstrate strong zero-shot generalization, but their performance significantly degrades in cross-domain scenarios with scarce target-domain training data (Cross-Domain Few-Shot Learning, CDFSL). In this paper, we focus on the target-domain few-shot finetuning in the CLIP-based CDFSL task. Prevailing finetuning paradigms uniformly align all image patch tokens with their corresponding textual embeddings. However, we find a counterintuitive phenomenon: actively pushing away certain low-similarity image tokens, termed "tail tokens", from their textual embeddings consistently improves target-domain performance. We delve into this phenomenon and provide a novel interpretation: under great domain shifts and scarce training data, the model can hardly extract semantic information from visual inputs; therefore, the common belief of alignment is valid only for tokens already containing sufficient semantic information; for tail tokens, forcing the alignment would lead to excessive overfitting to the scarce training, while breaking the alignment is more useful. Motivated by this, we propose Adaptive Tail-Head Alignment (ATHA), a novel fine-tuning strategy for CLIP that transforms the conventional uniform alignment paradigm to an adaptive alignment paradigm, with both alignment strengthening and weakening. Extensive experiments on four challenging CDFSL benchmarks validate our state-of-the-art performance. Our code is available at https://github.com/shuaiyi308/ATHA.

2605.29773 2026-05-29 cs.CV cs.AI cs.RO 版本更新

Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation

能量感知NECO:用于语义分割中单次逐像素分布外检测

Boyuan Zhang, Huanshan Huang, Yifei Cao

发表机构 * Ecole Polytechnique, Institut Polytechnique de Paris(巴黎理工学院高研院) CIAD, UTBM, Université Marie et Louis Pasteur(CIAD、UTBM、马吕斯·路易·巴斯蒂埃大学) U2IS, ENSTA, Institut Polytechnique de Paris(U2IS、ENSTA、巴黎理工学院)

AI总结 提出一种结合NECO几何比率和能量分数的混合方法,实现单次前向传播的逐像素分布外检测,在miniMUAD数据集上AUROC达0.8539,优于单独使用NECO或能量分数。

Comments 7 pages, 6 figures. Accepted at the ICRA 2026 Workshop on Long-term Deployments in the Wild (LoWi 2026)

详情
AI中文摘要

移动机器人的可靠语义分割需要准确的密集预测和分布偏移下的鲁棒不确定性估计。强不确定性基线如蒙特卡洛Dropout通常需要重复的随机前向传播,难以在边缘平台上部署。我们提出能量感知NECO,一种用于语义分割的单次逐像素分布外(OOD)检测器。该方法将从解码器特征计算的居中NECO风格几何比率与基于logit的能量分数相结合。两个分量均使用在纯分布内验证集上拟合的统计量进行标准化,并通过凸组合融合。我们在miniMUAD子集上使用真实像素级OOD标签评估该方法。所提出的混合分数达到0.8539的AUROC,优于仅NECO(0.8280)、仅能量(0.8171)和集成预测熵基线(0.8124)。额外的定性和操作点分析表明,混合检测器在保持单次设计效率优势的同时,提高了整体排名性能。代码可在https://github.com/boyuan-zhangx/Energy-Aware_NECO获取。

英文摘要

Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design. Code is available at https://github.com/boyuan-zhangx/Energy-Aware_NECO

2605.29762 2026-05-29 cs.CV 版本更新

GeoMag: Geometric-Aware Video Motion Magnification via State Space Model

GeoMag: 基于状态空间模型的几何感知视频运动放大

Kecheng Han, Yuchen Zhang, Bingqing Liu, Boqiang Guo, Wenbin Zheng, Shiyuan Pei

发表机构 * School of Software Engineering, Xi'an Jiaotong University(西安交通大学软件工程学院) Xi'an Jiaotong University(西安交通大学)

AI总结 提出GeoMag框架,利用状态空间模型实现全局一致的运动放大,并构建Geo-200K数据集提升训练多样性,在视觉保真度和计算效率上优于现有方法。

Comments ICME 2026 Spotlight

详情
AI中文摘要

视频运动放大(VMM)揭示了不可感知的动态,但在复杂几何变换下常常遭受结构不一致的问题。现有的基于学习的方法通常面临CNN的有限全局上下文与Transformer的高计算成本之间的权衡。此外,当前的训练协议主要由简单的线性运动主导,未能捕捉真实世界视频中遇到的几何和成像复杂性。为了解决这些问题,我们提出了GeoMag,一个基于状态空间模型的几何感知VMM框架,以实现具有线性复杂度的全局一致运动放大。我们进一步构建了Geo-200K,一个大规模合成数据集,引入了丰富的几何变换以及传感器真实的退化,提高了训练信号的多样性和真实性。在合成和真实世界基准上的大量实验表明,GeoMag在视觉保真度和计算效率上始终优于先前的方法,同时产生更少的伪影和更好的结构一致性。

英文摘要

Video Motion Magnification (VMM) reveals imperceptible dynamics but often suffers from structural inconsistencies under complex geometric transformations. Existing learning-based methods generally face a trade-off between the limited global context of CNNs and the high computational cost of Transformers. In addition, current training protocols, largely dominated by simple linear motion, fail to capture the geometric and imaging complexities encountered in real-world videos. To address these issues, we propose GeoMag, a geometric-aware VMM framework built upon State Space Models to achieve globally consistent motion amplification with linear complexity. We further construct Geo-200K, a large-scale synthetic dataset that introduces rich geometric transformations together with sensor-realistic degradations, improving the diversity and realism of training signals. Extensive experiments on synthetic and real-world benchmarks show that GeoMag consistently outperforms prior methods in visual fidelity and computational efficiency, while producing fewer artifacts and better structural consistency.

2605.29761 2026-05-29 cs.CV cs.CG 版本更新

S2MDF: A Plug-And-Play Layer for Intersection-Free Multi-Object Signed Distance Fields

S2MDF:用于无交叉多物体有符号距离场的即插即用层

Deniz Sayin Mercadier, Federico Stella, Aurel Bizeau, Nicolas Talabot, Pascal Fua

发表机构 * CVLab, Ecole Polytechnique Fédérale de Lausanne (EPFL)(计算机视觉实验室,瑞士联邦理工学院(EPFL))

AI总结 提出S2MDF模块,通过硬约束强制向量值有符号距离场避免物体间几何交叉,无需修改网络架构,在训练或后处理中均可使用,显著减少交叉至数值精度且保持重建质量。

详情
AI中文摘要

组合隐式表面表示将场景建模为物体集合,每个物体由有符号距离场(SDF)编码。该方法的一个基本限制是多个SDF可能产生相互穿透的几何形状,违反物理合理性。现有的缓解策略依赖于软惩罚项,这些项减少但不能消除交叉,并且需要仔细的损失加权。为了真正防止相互穿透,我们提出了对向量值SDF的硬约束,并引入了S2MDF,一个轻量级的即插即用模块,无需架构修改即可对任何物体组合SDF表示施加约束。它引入可忽略的计算开销,并与线性插值的标准网格化算法(如Marching Cubes)兼容。它可以在训练期间或作为后处理步骤应用。在多种最先进的组合方法上的实验表明,S2MDF将交叉减少到数值精度,同时保持重建质量,优于现有的缓解策略。

英文摘要

Compositional implicit surface representations model scenes as collections of objects, each encoded by a Signed Distance Field (SDF). A fundamental limitation of this approach is that multiple SDFs can produce geometries that interpenetrate, violating physical plausibility. Existing mitigation strategies rely on soft penalty terms that reduce but do not eliminate intersections, and require careful loss weighting. To truly prevent interpenetration, we propose a hard constraint on vector-valued SDFs and introduce S2MDF, a lightweight plug-and-play module that enforces the constraint on any object-compositional SDF representation without architectural modifications. It introduces negligible computational overhead and is compatible with linearly-interpolated standard meshing algorithms such as Marching Cubes. It can be applied during training or as a post-processing step. Experiments on multiple state-of-the-art compositional methods show that S2MDF reduces intersections to numerical precision while preserving reconstruction quality, outperforming existing mitigation strategies.

2605.29726 2026-05-29 cs.CV 版本更新

SLAD : Shared LoRA Adapters for Task Specific Distillation

SLAD:用于任务特定蒸馏的共享LoRA适配器

Reda Bensaid, Yassir Bendou, Vincent Gripon, François Leduc-Primeau

发表机构 * IMT Atlantique(IMT阿登蒂克) Polytechnique Montréal(蒙特利尔理工学院)

AI总结 提出SLAD方法,通过共享低秩适配器参数对齐教师和学生模型的特征表示,实现高效的知识蒸馏,在多个分类和分割数据集上达到最先进性能。

Comments CVPR Findings 2026

详情
AI中文摘要

在资源受限环境(如嵌入式系统)中,将缩小版基础模型适配到下游任务变得越来越流行。这最近激发了任务特定蒸馏的新场景,其中同一基础模型的较大和较小版本都适配到同一下游任务,目标是将知识从前者转移到后者。最近的工作展示了使用同一基础模型的较大版本协助较小版本适配的好处。通常,较大模型(教师)首先通过微调或线性探测进行适配,然后将其知识蒸馏到较小模型(学生)。虽然微调教师通常能提升其性能,但最近的工作表明,对教师进行探测能更好地向学生蒸馏知识。我们的发现表明,这主要是由于教师微调过程中教师和学生之间特征表示的对齐偏差。受现有保留先前学习知识的努力启发,我们首先提出利用低秩适配,从而带来更好的特征对齐,进而实现更好的知识转移。基于这一洞察,我们进一步通过联合训练期间两个编码器之间适配器的参数共享策略来增强特征对齐。我们提出的方法SLAD在教师和学生之间展现出更好的特征对齐,不仅提升了学生模型的性能,也提升了教师模型的性能,同时训练速度比微调快2倍。通过在多个分类和分割数据集上的大量实验,我们展示了该方法在准确性和迁移效率上的提升,在任务特定蒸馏框架中达到了最先进性能。

英文摘要

In the context of resource-constrained environments such as embedded systems, adapting reduced-size foundation models to downstream tasks has become increasingly popular. This has recently motivated the emerging setting of task-specific distillation, where a larger and a smaller version of the same foundation model are both adapted to the same downstream task, with the goal of transferring knowledge from the former to the latter. Recent work has demonstrated the benefits of using a larger version of the same foundation model to assist the adaptation of a smaller one. Typically, the larger model (teacher) is first adapted via fine-tuning or linear probing before its knowledge is distilled into the smaller model (student). While fine-tuning the teacher often increases its performance, recent work showed that probing it leads to better knowledge distillation to the student. Our findings show that this is mainly due to a mis-alignment in feature representation between the teacher and the student which occurs during the teacher's fine-tuning. Inspired by existing efforts to preserve previously learned knowledge, we first propose leveraging low-rank adaptation, resulting in better feature alignment and therefore better knowledge transfer. Drawing from this insight, we further enhance the feature alignment through a parameter-sharing strategy of the adapters between the two encoders during joint training. Our proposed method, SLAD, shows better feature alignment between the teacher and student, which results in increased performance for not only the student but also the teacher model, while being 2x faster to train than fine-tuning. Through extensive experiments on multiple classification and segmentation datasets, we demonstrate the improved accuracy and transfer efficiency of our method, achieving state-of-the-art performance in the task-specific distillation framework.

2605.29720 2026-05-29 cs.CV cs.LG 版本更新

Efficient, Validation-Free Intrinsic Quality Estimation for Large-Scale Face Recognition Datasets

面向大规模人脸识别数据集的高效、免验证的内在质量评估

Zhichao Chen, Yongle Zhao, Kaicheng Yang, Meng Yang, Yin Xie, Ziyong Feng

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China(中国科学技术大学网络科学与技术学院)

AI总结 提出一种无需训练的内在质量(IQ)指标,通过邻域一致性得分和全局表示子空间复杂度来估计人脸识别数据集生成高性能模型的潜力,实现快速数据集诊断与筛选。

Comments ICML 2026

详情
AI中文摘要

我们提出内在质量(IQ),一种无需验证的度量,旨在估计人脸识别(FR)数据集产生高性能模型的固有潜力,而无需进行全规模训练。IQ 包含两个组成部分:(i)邻域一致性得分,通过最近邻量化局部身份标签一致性;(ii)全局表示子空间复杂度(有效秩,ER),捕捉底层嵌入几何和数据集多样性。IQ 允许使用轻量级代理模型或数据子集进行快速评估,便于在资源密集型的全规模训练之前进行数据集诊断和筛选。我们描述了一个针对干净、噪声和混合质量 FR 数据集定制的实验协议,并概述了验证 IQ 对下游性能预测能力的评估方法。

英文摘要

We propose Intrinsic Quality (IQ), a validation-free metric designed to estimate the inherent potential of face recognition (FR) datasets to produce high-performance models without the need for full-scale training. IQ integrates two components: (i) a Neighbor-Consistency Score that quantifies local identity label agreement via nearest neighbors, and (ii) Global Representation Subspace Complexity (Effective Rank, ER), which captures the underlying embedding geometry and dataset diversity. IQ allows for rapid evaluation using lightweight proxy models or data subsets, facilitating dataset diagnosis and curation prior to resource-intensive full-scale training. We describe an experimental protocol tailored to clean, noisy, and mixed-quality FR datasets, and outline evaluation methodologies to validate IQ's predictive power for downstream performance.

2605.29703 2026-05-29 q-bio.NC cs.CV q-bio.TO 版本更新

Subcortical Shape Variations and Their Associations with Cognition Across the 8th Decade of Life. A Study in the Lothian Birth Cohort 1936

皮层下形状变化及其与第八个十年生命期认知的关联:洛锡安出生队列1936研究

Maria del C. Valdes-Hernandez, Wonjung Park, Joanna Moodie, Susana Muñoz Maniega, Janie Corley, Fraser N. Sneden, Mark E. Bastin, Joanna M. Wardlaw, Simon R. Cox, Jinah Park

发表机构 * Department of Neuroimaging Sciences(神经影像科学系) University of Edinburgh(爱丁堡大学) Computer Graphics and Visualization Laboratory(计算机图形与可视化实验室) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) Department of Psychology(心理学系) Edinburgh Futures Institute(爱丁堡未来研究所)

AI总结 利用洛锡安出生队列1936的纵向数据,通过ANCOVA和混合线性模型分析,研究第八个十年中皮层下结构的形状变化及其与认知老化的关联。

Comments 34 pages

详情
AI中文摘要

对正常个体脑形态变化的研究可能捕捉到与功能相关的脑老化方面,而这些方面不一定完全由总体积测量所指示。尽管皮层下脑结构在认知中起重要作用,但其形态轨迹与认知老化之间的关联尚未被记录。我们利用来自一项大型认知老化纵向研究——洛锡安出生队列1936——的神经影像、人口统计学和认知数据,探索社区居住个体在第八个十年生命期中皮层下脑结构的形状变化。我们使用ANCOVA和混合线性模型分析研究这些变化与认知老化的关联。皮层下形状变化是异质性的,在整个时期呈现不同的萎缩模式。海马体和腹侧DC经历了不同的形态变形(相对于其基线点),左右半球不同,而丘脑和苍白球形状则经历了更均匀的体积收缩,几乎在不同时间线上对称。一般认知的变化主要与时间点之间的向内和向外顶点位移相关。

英文摘要

The study of brain morphology changes in normal individuals may capture aspects of functionally-relevant brain aging not fully indicated by gross volumetry. Despite the important role of subcortical brain structures in cognition, the associations between their morphological trajectories and cognitive changes in aging have not been documented. We use neuroimaging, demographic, and cognitive data from a large longitudinal study of cognitive aging, the Lothian Birth Cohort 1936, to explore shape changes in subcortical brain structures of community-dwelling individuals across their 8th decade of life. We investigate the association of these changes with cognitive aging using ANCOVA and mixed linear model analyses. Subcortical shape changes were heterogeneous, with varied atrophy patterns across whole period. The hippocampus and the ventral DC experienced varied morphological deformations (from its baseline point) different in left and right hemispheres, while the thalami and globus pallidi shapes, for example, experienced a more uniform volume contraction, nearly symmetrical throughout different timelines. Changes in general cognition were mainly associated with inwards and outwards vertex displacements between the time-points.

2605.29691 2026-05-29 cs.CV 版本更新

Unsupervised Semantic Segmentation Facilitates Model Understanding

无监督语义分割促进模型理解

Xiaoyan Yu, Lisa Mais, Jannik Franzen, Peter Hirsch, Nick Lechtenbörger, Andreas Mardt, Dagmar Kainmüller

发表机构 * Max-Delbruck-Center(马克斯·德尔布鲁克中心) Helmholtz Imaging(海德堡成像) Humboldt-Universität zu Berlin(柏林洪堡大学) Charité Universitätsmedizin(夏里特大学医学院) University of Potsdam(波茨坦大学)

AI总结 提出基于无监督语义分割的可视化协议,直观揭示不同自监督视觉Transformer的注意力机制、位置偏差和缩放行为等模型特性。

详情
AI中文摘要

自监督学习(SSL)产生了多种视觉Transformer(ViT),其预训练表示支持广泛的下游任务。为了更好地理解这些模型,已有工作评估了自注意力的机制以及表示中捕获的信息类型,例如揭示了对比学习(CL)和掩码图像建模(MIM)训练模型之间的显著差异。然而,模型理解的这些进展尚未完全渗透到更广泛的社区,其中针对CL模型的见解有时被泛化到MIM模型。为了使模型理解对广大受众直接且直观,我们提出了一种简单且易于解释的可视化协议。我们的协议基于可视化无监督语义分割结果,但目标不是最大化分割性能。相反,它允许我们传达跨图像一致出现的模型行为。通过对不同层和表示上的多种SSL模型进行基准测试,我们获得了关于不同位置偏差和缩放行为的新见解,包括DINOv3-Large模型令牌中的强边界伪影。这些见解补充并有助于传达一系列先前发现。我们的协议进一步能够清晰地区分位置效应与密切相关但不同的局部性偏差,后者在文献中已被更广泛地研究。该协议在GitHub上公开,我们相信它将促进更广泛社区的进一步模型理解。

英文摘要

Self-supervised learning (SSL) has produced a diverse landscape of vision transformers (ViTs) whose pretrained representations support a wide range of downstream tasks. Towards a better understanding of these models, a body of work has assessed the mechanics of their self-attention as well as the types of information captured across their representations, revealing, for example, stark differences between models trained with contrastive learning (CL) and masked image modeling (MIM). However, these advances in model understanding have not yet fully permeated the broader community, where insights specific to CL models are sometimes generalized to MIM models. To make model understanding straightforward and intuitive for a broad audience, we propose a simple and easily interpretable visualization protocol. Our protocol is based on visualizing unsupervised semantic segmentation results, yet our goal is not to maximize segmentation performance. Instead, it allows us to convey model behaviors that consistently emerge across images. Benchmarking a diverse set of SSL models across layers and representations, we obtain novel insights into distinct positional biases and scaling behaviors, including strong boundary artifacts in DINOv3-Large model tokens. These insights complement and help communicate a range of previous findings. Our protocol further enables a clear visual distinction between positional effects and the closely related but distinct locality bias, which has been studied much more extensively in the literature. The protocol is publicly available on GitHub and we believe it will catalyze further model understanding for a broad community.

2605.29673 2026-05-29 cs.LG cs.CV 版本更新

A Geometric View of SRC: Learning Representations for Stable Residual Inference

SRC的几何视角:学习用于稳定残差推理的表示

Vangelis P. Oikonomou

AI总结 本文从几何角度分析稀疏表示分类(SRC)的残差排序稳定性,提出几何塑造目标以改善表示学习,并在多个数据集上验证了效果。

Comments 37 pages

详情
AI中文摘要

基于重构的推理通过比较类重构残差来分配类别;稀疏表示分类(SRC)是一个典型实例,其可靠性取决于学习表示的几何结构。我们采用严格的训练-推理分离:SRC仅作为固定的测试时规则使用,在训练过程中从不进行微分、展开或优化。在基于类条件张成子空间及其相关投影残差的张成子空间理想化中,我们通过残差间隔形式化残差排序稳定性,并刻画了可能在最坏方向破坏该间隔的几何障碍——张成子空间重叠、支配以及通过小主角产生的近重叠。这一张成子空间理论是首要的:它指定了理想化残差族何时良好分离,并为实际残差近似(如OMP)提供了条件性的求解器级解释,只要它们接近张成子空间级别的残差排序。在显式的覆盖和分离假设下,我们推导了(理想化)残差间隔的定量下界。在这些目标的指导下,我们提出了几何塑造目标,这些目标促进掩蔽的类内自表达性,抑制跨类重构路径和类间张成子空间对齐,并防止坍塌——而在训练过程中不调用SRC残差或预测。在图像(COIL-100)、文本(TREC)和EEG连接性上的实验,在相同的固定SRC/OMP推理下评估所有表示,并报告残差间隔和几何诊断;交叉熵仅作为相同评估协议下的参考几何包含在内。

英文摘要

Reconstruction-based inference assigns a class by comparing class-wise reconstruction residuals; Sparse Representation Classification (SRC) is a canonical instance whose reliability depends on the geometry of the learned representation. We adopt a strict training-inference separation: SRC is used only as a fixed test-time rule and is never differentiated, unrolled, or optimized during training. In a span-level idealization based on class-conditional spans and their associated projection residuals, we formalize residual-ordering stability through a residual margin and characterize geometric obstructions -- span overlap, dominance, and near-overlap via small principal angles -- that can collapse this margin in worst-case directions. This span-level theory is primary: it specifies when the idealized residual family is well-separated, and it provides a conditional solver-level interpretation for practical residual approximations (e.g., OMP) insofar as they remain close to the span-level residual ordering. Under explicit coverage and separation assumptions, we derive a quantitative lower bound on the (idealized) residual margin. Guided by these targets, we propose geometry-shaping objectives that promote masked within-class self-expressiveness, discourage cross-class reconstruction pathways and inter-class span alignment, and prevent collapse -- without invoking SRC residuals or predictions during training. Experiments on images (COIL-100), text (TREC), and EEG connectivity evaluate all representations under identical fixed SRC/OMP inference and report residual margins and geometric diagnostics; cross-entropy is included only as a reference geometry under the same evaluation protocol.

2605.29657 2026-05-29 cs.CV cs.AI 版本更新

OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

OccamToken: 无需训练且预算自适应的令牌剪枝实现高效VLM推理

Geng Li, Guohao Chen, Ting Chen, Shilin Shan, Kuangji Zuo, Bofan Lyu, Tuo An, Gen Li, Jianfei Yang

发表机构 * Nanyang Technological University (NTU)(南洋理工大学)

AI总结 提出OccamToken框架,通过寄存器锚定的相对证据测试替代绝对排名范式,实现无需训练、自适应预算的视觉令牌剪枝,在保持高精度的同时大幅压缩令牌数量。

Comments 26 pages,8 figures

详情
AI中文摘要

视觉语言模型(VLM)依赖长视觉令牌序列进行视觉理解,导致预填充阶段在计算和内存上开销巨大。现有大多数剪枝方法遵循绝对排名范式,为视觉令牌分配重要性分数并保留固定的Top-K子集。本文认为这种范式本质上是脆弱的:注意力汇聚点扭曲令牌重要性排名,而图像冗余和查询依赖的视觉证据使得固定令牌预算在不同输入间不可靠。我们提出OccamToken,一个无需训练的框架,用寄存器锚定的相对证据测试替代绝对令牌排名。OccamToken不询问哪些令牌全局重要,而是评估视觉令牌是否提供了超越寄存器基线的信息。我们的关键洞察是,寄存器令牌自然吸收低信息注意力模式,使其成为识别真正信息性视觉证据的稳定参考。基于这一原理,OccamToken通过从寄存器注意力中导出的动态阈值,执行图像自适应冗余剪枝和查询自适应相关性剪枝。在LLaVA-NeXT、LLaVA-v1.5和Qwen3-VL上,OccamToken一致地改善了准确率-效率权衡,无需额外训练。值得注意的是,在LLaVA-NeXT上,它将2880个视觉令牌减少到约40个,同时保留了超过93%的原始准确率,即使在极端的1.4%保留率下也能实现稳定的视觉令牌压缩。

英文摘要

Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.

2605.29647 2026-05-29 cs.CV 版本更新

MARTIAN: A Rendering Framework for Aerial Mars Imagery from HiRISE Orbital Data

MARTIAN:基于HiRISE轨道数据的火星空中影像渲染框架

Dario Pisanti, Georgios Georgakis

发表机构 * Space Robotics Research Group, SnT, University of Luxembourg(卢森堡大学空间机器人研究组) Jet Propulsion Laboratory, California Institute of Technology(加州理工学院喷气推进实验室)

AI总结 提出一个基于Blender的开源渲染框架MARTIAN,利用真实HiRISE轨道地图数据合成火星地形在不同光照和高度下的逼真空中视图,并生成精确姿态标注,以解决火星视觉导航训练数据稀缺问题。

详情
AI中文摘要

火星上的空中导航需要基于视觉的管道,这些管道必须对火星表面的多样光照条件和地形形态具有鲁棒性。训练和评估此类方法的一个关键瓶颈是缺乏大规模、带标注的空中数据集。我们提出了MARTIAN,一个基于Blender的开源渲染框架,它利用真实的HiRISE轨道地图产品,在可控光照条件和不同高度下合成火星地形的逼真空中视图。MARTIAN生成带有精确姿态标注的观测数据,直接解决了火星视觉导航训练数据稀缺的问题。该框架已通过其在基于地图的定位系统(用于Ingenuity和未来火星旋翼机)的并行工作中的部署得到验证,其中合成训练的深度图像匹配器已成功在真实火星图像上进行了评估。MARTIAN公开于:https://github.com/nasa-jpl/martian。

英文摘要

Aerial navigation on Mars requires vision-based pipelines that are robust to the diverse illumination conditions and terrain morphology of the Martian surface. A key bottleneck for training and evaluating such methods is the scarcity of large-scale, annotated aerial datasets. We present MARTIAN, an open-source Blender-based rendering framework that leverages real HiRISE orbital map products to synthesize realistic aerial views of the Martian terrain under controllable lighting conditions and at varying altitudes. MARTIAN generates observations with accurate pose annotations, directly addressing the scarcity of training data for vision-based navigation on Mars. The framework has been validated through its deployment in concurrent work on map-based localization systems for Ingenuity and future Mars rotorcraft, where synthetically trained deep image matchers were successfully evaluated on real Mars imagery. MARTIAN is publicly available at: https://github.com/nasa-jpl/martian.

2605.29643 2026-05-29 cs.CV cs.MA 版本更新

AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning

AgentCVR:通过脚本模拟强化学习的主动多智能体跨视频推理

Yilun Qiu, Jiahe Wang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Chun Yuan

发表机构 * Xiaohongshu Inc.(小红书公司) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生学院,清华大学)

AI总结 提出AgentCVR多智能体框架,将跨视频推理视为主动证据获取任务,通过主智能体协调视觉和音频智能体进行定向证据提取,并引入脚本模拟强化学习优化策略,在跨视频对齐和定位任务上超越单次基线,达到与闭源系统相当的性能。

详情
AI中文摘要

跨视频推理(CVR)已成为多模态智能的关键前沿,要求模型检索、对齐和聚合分布在多个视频中的证据。当前的多模态大语言模型(MLLMs)往往难以应对CVR,因为简单的单次策略将多个视频编码到共享压缩上下文中,可能掩盖罕见但关键的证据。在本文中,我们提出AgentCVR,一个多智能体框架,将CVR视为主动证据获取任务。AgentCVR使用主智能体迭代协调专门的视觉和音频智能体进行定向证据提取。为确保高效训练,我们引入脚本模拟强化学习,利用LLM生成的语义脚本和轻量级文本模拟器优化智能体策略,在在线探索期间避免昂贵的多模态推理。在综合CVR基准上的实验结果表明,AgentCVR优于单次基线,并在复杂跨视频对齐和定位任务上达到与最先进闭源系统相当的性能。为确保可复现性,我们的代码可在https://github.com/wang-jh24/AgentCVR获取。

英文摘要

Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single-pass strategies encode multiple videos into a shared compressed context, potentially obscuring rare but critical evidence. In this paper, we propose AgentCVR, a multi-agent framework that treats CVR as an active evidence-acquisition task. AgentCVR employs a Master Agent to iteratively coordinate specialized Visual and Audio Agents for targeted evidence extraction. To ensure efficient training, we introduce Script-Simulated RL, which optimizes the agent's policy with LLM-generated semantic scripts and a lightweight text-based simulator, bypassing costly multimodal inference during online exploration. Experimental results on a comprehensive CVR benchmark show that AgentCVR outperforms single-pass baselines and achieves comparable performance to state-of-the-art closed-source systems, particularly in complex cross-video alignment and localization. To ensure reproducibility, our code is available at https://github.com/wang-jh24/AgentCVR.

2605.29615 2026-05-29 cs.CV cs.CL 版本更新

DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?

DiffSpot:VLM能发现网页界面中的细微视觉差异吗?

Linhao Zhang, Aiwei Liu, Yuan Liu, Xiao Zhou

发表机构 * WeChat AI, Tencent Inc(腾讯公司)

AI总结 提出DiffSpot基准,通过CSS属性突变生成可控图像对,评估视觉语言模型在网页界面中检测细微视觉差异的能力,发现最佳模型仅识别40.7%的真实变化。

详情
AI中文摘要

视觉语言模型(VLM)在高层次图像-文本对齐方面取得了显著进展,但其感知细微视觉差异的能力仍然有限。我们在渲染的网页界面中研究这一问题,其中局部视觉变化既是对细粒度感知的诊断测试,也是GUI代理和设计工具的实际需求。我们引入了 extbf{DiffSpot},一个用于网页界面开放式找不同的代码驱动基准。DiffSpot通过突变自包含HTML中目标元素的单个CSS属性,重新渲染页面,并记录变化的属性、元素和突变幅度,从而构建受控图像对。一个接地门控仅保留渲染像素差异局限于目标元素的图像对。该基准包含4,400对图像,包括3,900对有差异对(平衡分布在13个CSS属性操作符和三个难度级别上)以及500对无差异对用于幻觉控制。对13个前沿VLM进行零样本评估,我们发现即使最佳模型也只能识别$40.7\%$的真实变化,所有模型在困难级别的召回率低于$23\%$。DiffSpot进一步表明,难度强烈依赖于属性:在CSS操作符中,像素幅度和CLIP距离都不能可靠预测召回率。

英文摘要

Vision-language models (VLMs) have made strong progress on high-level image-text alignment, yet their ability to perceive subtle visual differences remains limited. We study this problem in rendered web interfaces, where localized visual changes are both a diagnostic test of fine-grained perception and a practical requirement for GUI agents and design tools. We introduce \textbf{DiffSpot}, a code-driven benchmark for open-ended spot-the-difference on web interfaces. DiffSpot constructs controlled image pairs by mutating a single CSS property of a target element in self-contained HTML, re-rendering the page, and recording the changed property, element, and mutation magnitude. A grounding gate retains only pairs whose rendered pixel difference is confined to the target element. The benchmark contains 4{,}400 pairs, including 3{,}900 has-diff pairs balanced across 13 CSS-property operators and three difficulty tiers, plus 500 no-diff pairs for hallucination control. Evaluating 13 frontier VLMs zero-shot, we find that even the best model identifies only $40.7\%$ of true changes, with Hard-tier Recall below $23\%$ for every model. DiffSpot further shows that difficulty is strongly property-dependent: across CSS operators, neither pixel magnitude nor CLIP distance reliably predicts Recall.

2605.29610 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Learning Context-Conditioned Predicate Semantics via Prototype Feedback

通过原型反馈学习上下文条件谓词语义

NamGyu Jung, Chang Choi

发表机构 * Department of Computer Engineering, Gachon University, Seongnam, Republic of Korea(韩国成仁市加德满都大学计算机工程系)

AI总结 提出AlignG方法,利用原型反馈从图像关系候选中推断上下文条件谓词语义并调整关系表示,在VG-150和GQA-200上分别提升SGDet的F@100指标1.4和2.7。

Comments Accepted at ICML 2026. Code: https://github.com/Namgyu97/AlignG-SGG.pytorch

详情
AI中文摘要

在场景图生成中,一个核心挑战是建模多义谓词,其含义随上下文变化。先前的方法通过将谓词分解为多个静态原型或检索语义相似的示例来解决此问题。然而,这些策略保持谓词表示静态,无法重新组织语义以反映图像特定的证据,导致在模糊上下文中出现系统性混淆。我们提出AlignG,通过原型反馈学习上下文条件谓词语义。AlignG从每幅图像中的关系候选中推断上下文条件谓词语义,并将调整后的语义反馈回来以重新校准关系表示。学习目标将此适应锚定到全局语义中心,防止语义漂移,同时当场景提供一致的关系线索时仍允许选择性重组。在VG-150和GQA-200上的实验表明,在SGDet下,F@100指标分别提升了+1.4和+2.7,优于最先进的基线。我们进一步可视化每幅图像的原型相似性变化,并观察到一致的上下文相关重组,其中原型根据场景证据选择性地合并或分离谓词。代码可在https://github.com/Namgyu97/AlignG-SGG.pytorch获取。

英文摘要

In scene graph generation, a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches address this issue by decomposing predicates into multiple static prototypes or retrieving semantically similar exemplars. However, these strategies keep predicate representations static and cannot reorganize semantics to reflect image-specific evidence, leading to systematic confusions in ambiguous contexts. We propose AlignG, which learns context-conditioned predicate semantics via prototype feedback. AlignG infers context-conditioned predicate semantics from the relation candidates within each image and feeds the adapted semantics back to recalibrate relation representations. The learning objective anchors this adaptation to global semantic centers, preventing semantic drift while still allowing selective reorganization when the scene provides consistent relational cues. Experiments on VG-150 and GQA-200 show consistent improvements over state-of-the-art baselines, with F@100 improvements of +1.4 on VG-150 and +2.7 on GQA-200 under SGDet. We further visualize per-image prototype similarity shifts and observe coherent context-dependent reorganization where prototypes selectively merge or separate predicates according to scene evidence. The code is available at https://github.com/Namgyu97/AlignG-SGG.pytorch.

2605.29602 2026-05-29 cs.CV 版本更新

CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning

CogniVerse: 用认知反思与几何推理革新多模态检索增强生成

Xiang Fang, Wanlong Fang, Changshuo Wang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Nanyang Technological University, Singapore(新加坡南洋理工大学) University College London(伦敦大学学院)

AI总结 提出CogniVerse框架,通过认知反思模块、基于黎曼流形对齐的多模态检索模块和最优传输层次生成模块,解决多模态检索增强生成中的噪声检索、跨模态语义错位和生成不连贯问题。

Comments Accepted in CVPR 2026

详情
AI中文摘要

多模态检索增强生成(MMRAG)已成为一种强大的范式,通过整合外部视觉、文本和结构知识,增强多模态大语言模型在知识密集型问答中的能力。然而,现有的MMRAG框架存在关键限制,包括噪声和无关检索、跨模态语义错位、缺乏自适应推理以及局部和全局上下文生成不连贯。我们提出了 extbf{CogniVerse},一种新颖的MMRAG框架,通过受认知启发的数学严谨方法解决这些挑战。借鉴人类推理,CogniVerse集成了三个协同组件:(1)认知反思模块,动态评估检索必要性并过滤相关多模态内容,减少噪声和计算开销;(2)多模态检索模块,使用信息几何在黎曼流形中对齐嵌入,并通过谱图理论优化知识图谱,确保精确且连贯的检索;(3)层次生成模块,采用基于最优传输的损失来平衡词元级准确性和全局语义连贯性。大量实验表明,CogniVerse在准确性和连贯性上显著优于最先进系统,同时降低了检索延迟。

英文摘要

Multi-modal Retrieval-Augmented Generation (MMRAG) has emerged as a powerful paradigm for enhancing Multimodal Large Language Models in knowledge-intensive question answering by integrating external visual, textual, and structural knowledge. However, existing MMRAG frameworks suffer from critical limitations, including noisy and irrelevant retrieval, cross-modal semantic misalignment, lack of adaptive reasoning, and incoherent generation across local and global contexts. We introduce \textbf{CogniVerse}, a novel MMRAG framework that addresses these challenges through a cognitive-inspired, mathematically rigorous approach. Drawing from human-like reasoning, CogniVerse integrates three synergistic components: (1) a Cognitive Reflection Module that dynamically assesses retrieval necessity and filters relevant multi-modal content, reducing noise and computational overhead; (2) a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, ensuring precise and coherent retrieval; and (3) a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence. Extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency.

2605.29599 2026-05-29 cs.RO cs.CV 版本更新

How to Relieve Distribution Shifts in Semantic Segmentation for Off-Road Environments

如何缓解越野环境语义分割中的分布偏移

Ji-Hoon Hwang, Daeyoung Kim, Hyung-Suk Yoon, Dong-Wook Kim, Seung-Woo Seo

发表机构 * Department of Electrical and Communication Engineering, Seoul National University(电子与通信工程系,首尔国立大学)

AI总结 提出ST-Seg框架,通过风格扩展和纹理正则化缓解越野场景中源-目标域差异和传感器退化导致的分布偏移,提升语义分割鲁棒性。

Comments 8 pages, 6 figures. Accepted to IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 10, issue. 5, pp. 4500-4507, 2025
AI中文摘要

语义分割对于越野环境中的自主导航至关重要,能够精确分类周围环境以识别可通行区域。然而,越野条件固有的独特因素,如源-目标域差异和粗糙地形导致的传感器退化,可能引起分布偏移,使数据变化与训练条件不同。这常导致语义标签预测不准确,进而造成导航任务失败。为解决此问题,我们提出ST-Seg,一种通过风格扩展(SE)和纹理正则化(TR)扩展源分布的新框架。与先前在固定源分布内隐式应用泛化的方法不同,ST-Seg提供了一种直观的分布偏移处理方法。具体而言,SE通过生成多样化的逼真风格来拓宽域覆盖范围,增强源域有限的风格信息。TR通过深度纹理流形稳定受风格增强学习影响的局部纹理表示。在各种分布偏移的目标域上的实验证明了ST-Seg的有效性,相较于现有方法有显著改进。这些结果凸显了ST-Seg的鲁棒性,增强了越野导航中语义分割的实际应用性。

英文摘要

Semantic segmentation is crucial for autonomous navigation in off-road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off-road conditions, such as source-target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST-Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST-Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style-augmented learning through a deep texture manifold. Experiments across various distribution-shifted target domains demonstrate the effectiveness of ST-Seg, with substantial improvements over existing methods. These results highlight the robustness of ST-Seg, enhancing the real-world applicability of semantic segmentation for off-road navigation.

2605.29592 2026-05-29 cs.CV 版本更新

Non-Forgetting Knowledge Allocation with Bi-level Competition for Class-Incremental Learning

非遗忘知识分配与双层竞争用于类增量学习

Xiang Tan, Run He, Yawen Cui, Mengchen Zhao, Yan Wu, Tianyi Chen, Huiping Zhuang, Xiaonan Luo, Guanbin Li

发表机构 * South China University of Technology(华南理工大学) Hong Kong Polytechnic University(香港理工大学) Agency for Science, Technology and Research(科技研究局) Microsoft(微软) Guilin University of Electronic Technology(桂林电子科技大学) Sun Yat-sen University(中山大学)

AI总结 针对基于预训练模型的类增量学习中适配器知识分配不均和遗忘问题,提出非遗忘分配与双层竞争方法(NoFA-BC),通过递归最小二乘构建非遗忘分配器,并引入任务内赢家通吃和任务间最后淘汰机制优化适配器利用。

详情
AI中文摘要

基于预训练模型(PTM)的类增量学习(CIL)旨在顺序地将PTM适应到新类别而不遗忘旧知识。现有的基于适配器的方法主要通过不同的任务特定适配器训练模型,并在推理时为每个适配器呈现统一的知识分配。然而,这种分配机制忽略了任务差异的本质,导致适配器的利用次优。此外,在CIL约束下,分配器在任务演化时容易遗忘。为了解决这些问题,我们提出了一种具有双层竞争的非遗忘分配(NoFA-BC)。NoFA-BC通过将分配器训练转化为递归最小二乘问题来构建非遗忘分配器(NFA),并实现了与使用所有数据训练等效的分配器。基于NFA,提出了双层竞争(BLC),包括任务内级别的赢家通吃(WTA)机制和任务间级别的最后淘汰(LOF)消除,以提供更好的适配器知识分配。WTA提取任务内最显著的logit来表示适配器的贡献,LOF抑制不相关的适配器。通过BLC,每个适配器的参与比例可以根据每个输入进行调整。此外,还加入了稳定性增强(SE)过程,以进一步提高旧任务的性能。

英文摘要

Class-Incremental Learning (CIL) with pre-trained models (PTMs) aims to sequentially adapt PTMs to new categories without forgetting old knowledge. Built upon PTMs, existing adapter-based methods mainly train models via distinct task-specific adapters, and present a uniform knowledge allocation for each adapter during inference. However, this allocation mechanism ignores the nature of task discrepancy and leads to suboptimal utilization of adapters. Also, under CIL constraint, an allocator is prone to forgetting when tasks evolve. To address these issues, we propose a Non-Forgetting Allocation with Bi-Level Competition (NoFA-BC). NoFA-BC constructs a non-forgetting allocator (NFA) by transforming the allocator training into a recursive least-squares problem and achieves an allocator equivalent to that trained with all data. Based on the NFA, a Bi-Level Competition (BLC) including an intra-task level Winner-Takes-All (WTA) mechanism and inter-task Last-Ones-Fall (LOF) elimination is proposed to provide better allocation of adapter knowledge. WTA extracts the most significant logit within a task to represent the adapter's contribution and LOF suppresses the irrelevant adapters. With BLC, participation ratio of each adapter can be tailored for each input. Moreover, a Stability Enhancement (SE) process is incorporated to further improve the performance of old tasks.

2605.29583 2026-05-29 cs.CV 版本更新

BitC-3DGS: High-Capacity 3D Gaussian Splatting Watermarking via Bit Compression

BitC-3DGS: 基于位压缩的高容量3D高斯泼溅水印技术

Yuquan Bi, Baosheng Yu, Yingke Lei, Jianwei Yang, Hongsong Wang, Jie Gui, Yuan Yan Tang, James Tin-Yau Kwok

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Lee Kong Chian School of Medicine, Nanyang Technological University(南洋理工大学李科金医学院) College of Electronic Engineer, National University of Defense and Technology(国防科技大学电子工程学院) Institute of AI for Industries, Chinese Academy of Sciences(中国科学院人工智能产业研究所) School of Computer Science and Engineering, Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Southeast University(东南大学计算机科学与工程学院,教育部新一代人工智能技术及其交叉应用重点实验室) Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系) Faculty of Science and Technology, UOW College Hong Kong(UOW学院香港科技学院理学院) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(香港科学与技术大学计算机科学与工程学院)

AI总结 提出BitC-3DGS框架,通过位压缩令牌化、双分支架构和硬消息采样策略,突破CLIP文本编码器77位消息限制,实现高容量3DGS水印嵌入与恢复。

详情
AI中文摘要

高容量水印对于3D高斯泼溅(3DGS)资产嵌入丰富信息(例如所有权、来源和认证码)是必要的,从而在大规模3D资产管线中实现可靠的识别和完整性验证。现有的基于预训练文本编码器的位到令牌水印方法由于CLIP固定的77令牌上下文长度而仅限于77位消息,因为超出此限制的令牌不被学习的位置嵌入支持。为了解决这一限制,我们引入了BitC-3DGS,一种位压缩框架,每个令牌编码多个消息位。它采用位压缩令牌化方案,将同一块内的多个位编码为单个语义令牌。为了恢复压缩信息,它进一步引入了双分支架构用于联合块解压缩和位解码,以及硬消息采样策略以改善解码器训练期间的组合覆盖。在Blender和LLFF数据集上的大量实验证明了BitC-3DGS在高容量水印方面的有效性,实现了高消息恢复精度和渲染保真度。例如,它支持128位消息容量,恢复精度与最近最先进方法中64位消息相当。

英文摘要

High-capacity watermarking is necessary for 3D Gaussian Splatting (3DGS) assets to embed rich information (e.g., ownership, provenance, and authentication codes), enabling reliable identification and integrity verification in large-scale 3D asset pipelines. Existing bit-to-token watermarking methods based on a pre-trained text encoder are limited to 77-bit messages due to CLIP's fixed 77-token context length, as tokens beyond this limit are unsupported by learned positional embeddings. To address this limitation, we introduce BitC-3DGS, a bit-compression framework that encodes multiple message bits per token. It employs a bit-compressed tokenization scheme that encodes multiple bits within the same chunk into a single semantic token. To enable recovery of the compressed information, it further introduces a dual-branch architecture for joint chunk decompression and bit decoding, along with a hard-message sampling strategy to improve combinatorial coverage during decoder training. Extensive experiments on the Blender and LLFF datasets demonstrate the effectiveness of BitC-3DGS for high-capacity watermarking, achieving high message recovery accuracy and rendering fidelity. For example, it supports 128-bit message capacity with recovery accuracy comparable to that of 64-bit messages in recent state-of-the-art methods.

2605.27696 2026-05-29 cs.CV cs.LG 版本更新

Structure over Pixels: Learning Variable-Length Visual Programs

结构优于像素:学习可变长度视觉程序

Piotr Wyrwiński, Kacper Dobek, Krzysztof Krawiec

发表机构 * Institute of Computing Science(计算科学研究所) Poznan University of Technology(波兹南技术大学)

AI总结 提出STROP离散视觉分词器架构,通过基于DINOv3特征的局部率失真监督学习可变长度视觉程序,以结构表示替代像素重建。

详情
AI中文摘要

离散视觉分词器将图像转换为有序的代码序列,为场景的结构描述提供了自然表示。然而,现有的自适应分词器要么需要事后搜索,要么在预训练速率的离散集合中进行选择,而不是学习与模型和场景耦合的连续每图像序列长度,并且它们通常针对像素重建进行训练,强调纹理而非结构。我们提出STROP,一种离散视觉分词器架构,形成结构场景表示并同时学习图像的视觉程序应该有多长。使用由冻结的DINOv3特征的局部率失真探针监督的四阶段课程,STROP优化了一个专门的长度头,在单次前向传递中估计活动前缀长度。通过绕过像素级重建梯度,码本完全由高层潜在表示的质量塑造。程序长度随场景复杂性增长,组合结构的迹象出现在下游密集预测迁移和对学习代码词汇的直接检查中。

英文摘要

Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

2605.26064 2026-05-29 cs.CV cs.LG 版本更新

Paris 2.0: A Decentralized Diffusion Model for Video Generation

Paris 2.0: 一种去中心化的视频生成扩散模型

Ali Rouzbayani, Bidhan Roy, Marcos Villagra, Zhiying Jiang

AI总结 本文提出Paris 2.0,首个通过去中心化计算预训练的视频生成模型,基于Paris 1.0的扩散模型框架,在低分辨率文本到视频任务中相比集中式模型将FVD从561.04降至279.01,提升约2倍,并提高了CLIP文本-视频相似度和美学评分。

Comments 6 pages, 5 figures

详情
AI中文摘要

我们提出了Paris 2.0,这是首个通过去中心化计算预训练的视频生成模型。其训练方案建立在Paris 1.0(arXiv:2510.03434)的基础上,后者是首个开源权重的去中心化扩散模型(DDM),证明了图像生成可以在没有单一GPU集群的情况下进行训练。然而,时间上连贯的视频生成在去中心化训练下一直是一个未解决的问题,而Paris 2.0解决了这个问题。在低分辨率文本到视频训练中,与在相同数据上以匹配的总计算预算训练的集中式模型相比,Paris 2.0将Frechet视频距离(FVD)从561.04降至279.01,提升了约2.0倍,并提高了CLIP文本-视频相似度和美学评分。

英文摘要

We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it. In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score.

2605.25975 2026-05-29 cs.GR cs.CV 版本更新

F-RNG: Feed-Forward Relightable Neural Gaussians

F-RNG: 前馈可重光照神经高斯

Guangming Fu, Jiahui Fan, Jian Yang, Miloš Hašan, Beibei Wang

发表机构 * Nankai University(南开大学) Nanjing University(南京大学) NVIDIA

AI总结 提出前馈框架F-RNG,利用现有大型重建模型和内在分解模型先验,从稀疏视图直接生成可重光照的3D高斯资产,实现快速高质量重光照。

详情
AI中文摘要

从真实世界物体中捕捉可重光照的3D资产是一个广泛研究的问题。几种基于3D高斯溅射(3DGS)的逐场景优化方法支持重光照,但它们通常需要密集的输入视图,并且其过拟合特性使其难以跨场景泛化。与逐场景优化方法不同,泛化的前馈模型可以直接从稀疏输入视图重建高斯。然而,得到的资产具有烘焙好的光照,不能轻易用于重光照。在本文中,我们提出了F-RNG,一个前馈框架,直接从稀疏视图输入生成可重光照的3DGS资产。从头开始训练这样的模型可能需要大量的数据和计算资源,并且以可接受的成本以前馈方式生成可重光照资产尤其具有挑战性。我们在现有的大型重建模型(LRM)上开发F-RNG以提取可重光照表示,同时利用内在分解模型(IDM)的先验。具体来说,我们首先引入一种潜在插值的细粒度几何合成来增强LRM的几何表示。其次,我们提出了一种先验引导的可重光照外观蒸馏,通过结合IDM先验来提取可重光照神经表示。最后,一个通用的神经渲染器实现了灵活且高保真的重光照。F-RNG不需要重新训练或微调底层的LRM,因此可以自动受益于未来更好的LRM和IDM。仅使用可以用可负担的数据和计算资源训练的小型网络,F-RNG避免了在不同光照条件下大型模型的重复推理。与最先进的基于LRM的重光照方法相比,F-RNG实现了约25倍更快的重光照,以及更优的质量(约+2.0 dB)。

英文摘要

Capturing relightable 3D assets from real-world objects is a widely researched problem. Several per-scene optimization-based methods, based on 3D Gaussian splatting (3DGS), support relighting; however, they usually require dense input views, and their overfitting nature makes it difficult to generalize across scenes. Unlike per-scene optimization methods, generalized feed-forward models can directly reconstruct Gaussians from sparse input views. However, the resulting assets have baked-in illumination and cannot be easily used for relighting. In this paper, we present F-RNG, a feed-forward framework that directly generates relightable 3DGS assets from sparse-view inputs. Training such a model from scratch can require massive data and computing resources, and it is especially challenging to generate relightable assets in a feed-forward manner with acceptable cost. We develop F-RNG upon an existing large reconstruction model (LRM) to extract relightable representations, while also utilizing priors from an intrinsic decomposition model (IDM). Specifically, we first introduce a latent-interpolated fine-grained geometry synthesis to enhance the LRM's geometry representation. Second, we propose a prior-guided relightable appearance distillation to extract relightable neural representations by incorporating IDM priors. Finally, a universal neural renderer enables flexible and high-fidelity relighting. F-RNG requires neither re-training nor fine-tuning of the underlying LRMs, thus can automatically benefit from better LRMs and IDMs in the future. With only small networks that can be trained with affordable data and computational resources, F-RNG avoids the repetitive inference of large models under different light conditions. By comparison to the state-of-the-art LRM-based relighting method, F-RNG achieves ~25x faster relighting, as well as superior quality (~+2.0 dB).

2605.24934 2026-05-29 cs.RO cs.AI cs.CV cs.LG 版本更新

HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

HumanEgo:从几分钟的人类自我中心视频中零样本学习机器人

Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, Yiannis Aloimonos

发表机构 * University of Maryland(马里兰大学)

AI总结 提出HumanEgo框架,通过将人类演示提升为手-物体交互的实体级表示,并训练具有密集辅助目标的流匹配策略,实现从人类自我中心视频到机器人的零样本、无机器人数据、硬件无关的技能迁移。

Comments Project page: https://humanego-ai.github.io

详情
AI中文摘要

人类自我中心视频捕捉了丰富的操作演示,无需任何机器人硬件,但由于人类和机器人在视觉外观和运动学上的具身差距,将这些技能迁移到机器人仍然具有挑战性。我们提出了HumanEgo,一个通过将每个人类演示提升为手-物体交互的实体级表示,并训练具有密集辅助目标的流匹配策略来弥合具身差距的框架,该策略放大了每个轨迹的监督信号。HumanEgo无需机器人数据、硬件无关、数据高效且可零样本地从人类迁移到机器人。每个任务仅需30分钟的人类视频,HumanEgo在四个真实世界任务中实现了92.5%的平均成功率(仅15分钟即可达到75%),比匹配时间的机器人遥操作高出41%,并且能够稳健地零样本迁移到新的机器人、相机和环境。我们发布了HumanEgo作为一个易于使用的开源框架,用于直接从人类数据学习机器人策略:https://github.com/TX-Leo/HumanEgo

英文摘要

Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments. We release HumanEgo as an easy-to-use, open-source framework for learning robot policies directly from human data: https://github.com/TX-Leo/HumanEgo

2605.24460 2026-05-29 cs.CV cs.AI 版本更新

Coarse-to-Fine Domain Incremental Learning with Attentive Distillation for Mining Footprint Segmentation in Multispectral Imagery

面向多光谱影像采矿足迹分割的粗到细领域增量学习与注意力蒸馏

Alif Tri Handoyo, Vincent C. S. Lee, Rizka Widyarini Purwanto, Alex M. Lechner, Deanna Kemp, Muhamad Risqi U. Saputra

发表机构 * Monash University, Indonesia(印度尼西亚莫纳什大学) Monash University, Australia(澳大利亚莫纳什大学) Northeastern University, China(中国东北大学) The University of Queensland, Australia(澳大利亚昆士兰大学)

AI总结 提出MineC2FNet框架,利用粗标注数据通过教师-学生架构和注意力蒸馏增强细粒度采矿足迹分割,解决领域偏移问题。

Comments Accepted at the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026), AI and Social Good track

详情
AI中文摘要

利用遥感和深度学习自动绘制和分割全球采矿足迹对于监测采矿的社会环境风险和影响至关重要,但其进展受到细粒度标注数据稀缺的阻碍。尽管具有粗略边界的大规模数据集广泛可用,但由于显著的领域偏移,利用它们改进细粒度分割具有挑战性。为此,我们提出了MineC2FNet,一种粗到细的领域增量学习框架,利用丰富的粗数据增强细粒度采矿足迹分割。MineC2FNet采用教师-学生架构,在特征和预测层面进行注意力蒸馏,选择性地从粗领域迁移通用知识,同时利用有限的细粒度数据(细领域)实现边界细化。我们进一步引入了一个经过专家验证的数据集,包含219张图像,具有跨不同地理和商品类型的精确边界标注。与包括领域适应和领域增量学习方法在内的最先进方法进行的大量实验表明,MineC2FNet在有效处理领域偏移的同时实现了优越的性能。数据集和代码公开于https://github.com/risqiutama/MineC2FNet。

英文摘要

Automatically mapping and segmenting global mining footprints using remote sensing and deep learning is critical for monitoring the socio-environmental risks and impacts of mining, yet its progress is hindered by the scarcity of fine-grained annotated data. Although large-scale datasets with coarse boundaries are widely available, leveraging them to improve fine-grained segmentation is challenging due to significant domain shift. To address this, we propose MineC2FNet, a coarse-to-fine domain incremental learning framework that exploits abundant coarse data to enhance fine-grained mining footprint segmentation. MineC2FNet adopts a teacher-student architecture with attentive distillation at both the feature and prediction levels, selectively transferring generalized knowledge from the coarse domain while enabling boundary refinement using limited fine-grained data (fine domain). We further introduce an expertly validated dataset of 219 images with precise boundary annotations across diverse geographies and commodities. Extensive experiments against state-of-the-art approaches, including domain adaptation and domain incremental learning methods, demonstrate that MineC2FNet achieves superior performance while effectively handling domain shift. The dataset and code are publicly available at https://github.com/risqiutama/MineC2FNet.

2605.22255 2026-05-29 cs.CV cs.IR 版本更新

Direct content-based retrieval from music scores images

基于内容的乐谱图像直接检索

Noelia Luna-Barahona, Antonio Ríos-Vila, Félix Fuentes-Hurtado, David Rizo, Jorge Calvo-Zaragoza

发表机构 * Pattern Recognition and Artificial Intelligence Group, University of Alicante(阿利坎特大学模式识别与人工智能小组) Instituto Superior de Enseñanzas Artísticas de la Comunidad Valenciana(瓦伦西亚社区艺术教育研究所)

AI总结 研究乐谱图像内容检索方法,比较基于光学音乐识别的转录方法、无转录Transformer模型和文本提示大语言模型在不同数据集上的表现。

Comments 17 pages (14 pages + references), 3 figures (with subfigures)

详情
AI中文摘要

乐谱数字化对其保存和可访问性至关重要,但信息检索仍主要依赖于元数据搜索,如按标题或作曲家搜索。与文本文档相比,乐谱图像中的基于内容搜索仍未得到充分探索,尽管它对音乐家、音乐学家和教育工作者具有潜在价值。本文首先研究了乐谱中哪些特征与搜索最相关,并定义了一种从任何带注释语料库构建查询数据集的系统方法,从而为该领域做出贡献。我们还考虑了多种用于乐谱图像内容搜索的方法,从依赖光学音乐识别(OMR)的基于转录的方法,到训练用于直接从乐谱图像识别查询的无转录Transformer模型,以及文本提示的大语言模型。我们的实验在四个具有不同特征(数据集大小、图像质量和排版机制)的语料库上评估了这些模型。总体而言,每种方法在不同条件下表现出色:基于OMR的流水线在域内检索中表现更好,而无转录模型更有效地处理域变异性。

英文摘要

The digitization of musical scores plays a crucial role in their preservation and accessibility, yet information retrieval still depends mainly on metadata searches, such as by title or composer. Content based search in music score images remains underexplored compared to text documents, despite its potential value for musicians, musicologists, and educators. This work contributes to the field by first studying which characteristics of a score are most relevant for search and by defining a systematic method to build query datasets from any annotated corpus. We also consider diverse methods for content-based search on music score images, ranging from transcription-based approaches relying on Optical Music Recognition (OMR), to a transcription-free Transformer model trained to recognize queries directly from score images, and a text-prompted Large Language Model. Our experiments evaluate these models on four corpora exhibiting diverse characteristics in terms of dataset size, image quality, and typesetting mechanisms. Overall, each method excels under different conditions: OMR-based pipelines achieve higher in-domain retrieval, whereas transcription-free models handle domain variability more effectively.

2605.20460 2026-05-29 cs.GR cs.CV 版本更新

HyperBones: Realtime Bone-driven Neural Garment Simulation with Hypernetwork Conditioning

HyperBones: 基于超网络调节的实时骨骼驱动神经服装模拟

Astitva Srivastava, Hsiao-Yu Chen, Ryan Goldade, Philipp Herholz, Zhongshi Jiang, Gene Wei-Chin Lin, Lingchen Yang, Nikolaos Sarafianos, Tuur Stuyck, Doug Roble, Avinash Sharma, Egor Larionov

发表机构 * Meta Reality Labs(Meta现实实验室)

AI总结 提出一种结合虚拟骨骼驱动粗粒度模拟和卷积神经映射恢复细粒度褶皱的实时神经服装模拟方法,通过超网络调节实现高效物理监督,无需外部模拟器。

详情
AI中文摘要

服装模拟的最新进展使高质量结果更接近实时性能。基于物理的模拟器可以产生精确的运动,但对于交互式应用而言计算成本仍然过高。相比之下,线性混合蒙皮效率高,但无法捕捉宽松服装的复杂动态,常常导致不真实的运动和视觉伪影。神经方法提供了一种有前景的替代方案,但在严格的运行时约束下仍难以合理动画化宽松衣物。我们提出了一种快速且物理上合理的动态服装模拟方法。我们的方法训练了一个由独立的粗粒度和细粒度组件组成的降维神经动力学模拟器。在粗粒度层面,服装由一组与轻量级神经网络集成的虚拟骨骼驱动。然后使用训练好的卷积神经映射恢复细粒度的褶皱细节。通过将身份特定计算与实时神经集成解耦,我们的架构在支持多样化的体型和运动的同时保持了高性能。我们进一步引入了一种有效的物理监督方案,无需依赖外部模拟器即可获得准确结果。实验表明,我们的方法产生了物理上合理的服装动态,能够泛化到各种运动和体型,并支持固定服装集。我们的模拟器在商用GPU上以300+ FPS运行,使其适用于实时应用。

英文摘要

Recent advances in garment simulation have brought high-quality results closer to real-time performance. Physics-based simulators can produce accurate motion, but remain too computationally expensive for interactive applications. In contrast, linear blend skinning is efficient, but cannot capture the complex dynamics of loose-fitting garments, often leading to unrealistic motion and visual artifacts. Neural methods offer a promising alternative, yet they still struggle to animate loose clothing plausibly under strict runtime constraints. We present a fast and physically plausible approach for dynamic garment simulation. Our method trains a reduced-space neural dynamics simulator composed of independent coarse- and fine-level components. At the coarse level, the garment is driven by a set of virtual bones integrated with a lightweight neural network. Fine-scale wrinkle details are then recovered using a trained convolutional neural map. By decoupling identity-specific computation from real-time neural integration, our architecture maintains high performance while supporting diverse body shapes and motions. We further introduce an effective physics-supervision scheme that enables accurate results without relying on an external simulator. Experiments show that our method produces physically plausible garment dynamics, generalizes across a range of motions and body shapes, and supports a fixed set of garments. Our simulator runs at 300+ FPS on a commodity GPU, making it suitable for real-time applications.

2605.14113 2026-05-29 cs.CV cs.AI cs.LG cs.MA 版本更新

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

ProtoMedAgent: 通过隐私感知的智能体工作流实现多模态临床可解释性

Alvaro Lopez Pellicer, Plamen Angelov, Marwan Bukhari, Yi Li, Eduardo Soares, Jemma Kerns

发表机构 * School of Computing and Communications(计算与通信学校) Lancaster University(兰卡斯特大学) Lancaster Medical School(兰卡斯特医学院) PUC-Rio(里约热内卢联邦大学) Puc-Behring Institute for AI(人工智能皮克林研究所)

AI总结 提出ProtoMedAgent框架,通过神经符号瓶颈和反射性Scribe-Critic循环约束生成过程,解决原型网络在临床报告中的语义结构缺失和检索谄媚问题,并引入k-匿名和ℓ-多样性隐私门控。

Comments CVR 2026

详情
AI中文摘要

尽管可解释的原型网络为临床诊断提供了引人注目的基于案例的推理,但其原始连续输出缺乏医学文档所需的语义结构。通过标准检索增强生成(RAG)弥合这一差距通常会触发“检索谄媚”,即大语言模型(LLM)产生事后合理化幻觉以与视觉预测对齐。我们引入了ProtoMedAgent,一个将多模态临床报告形式化为在严格神经符号瓶颈上的迭代、零梯度测试时优化问题的框架。在冻结的原型骨干上运行,我们将潜在视觉和表格特征蒸馏为离散语义记忆。在线生成严格受限于精确的集合论差分和反射性Scribe-Critic循环,从数学上排除了无根据的叙述性声明。为了安全地限制数据泄露,我们引入了一个由k-匿名和ℓ-多样性控制的语义隐私门控。在4,160名患者临床队列上的评估显示,ProtoMedAgent达到了91.2%的比较集忠实度,从根本上优于标准RAG(46.2%)。ProtoMedAgent还利用一个绑定ℓ-多样性的相变,系统性地将工件级成员推理风险降低了绝对9.8%。

英文摘要

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%.

2605.11723 2026-05-29 cs.CV cs.AI 版本更新

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

CaC:通过分层时空聚焦推进视频奖励模型

Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin, Dewen Fan, Boheng Zhang, Haonan Fan, Fei Zuo, Jia Sun, Huaiqing Wang, Honglie Wang, Yiyang Fan, Zhenlong Yuan, Zijun Li, Yongrui Heng, Guosheng Lin, Fan Yang, Tingting Gao

发表机构 * BJTU(北京工业大学) NTU(国立台湾大学) BUPT(北京邮电大学) Kuaishou Technology(快手科技)

AI总结 提出基于视觉语言模型的粗到细异常奖励模型CaC,通过全局时间扫描、局部空间定位和结构化时空思维链推理,结合大规模生成视频异常数据集和三阶段渐进训练,显著提升细粒度异常检测精度并减少生成视频异常。

Comments 27 pages, 10 figures

详情
AI中文摘要

在本文中,我们提出了Concentrate and Concentrate (CaC),一种基于视觉语言模型的粗到细异常奖励模型。在推理过程中,它首先进行全局时间扫描以锚定异常时间窗口,然后在局部区间内进行细粒度空间定位,最后通过结构化的时空思维链推理得出稳健判断。为了使模型具备这些能力,我们构建了第一个大规模生成视频异常数据集,包含逐帧边界框注释、时间异常窗口和细粒度归因标签。基于该数据集,我们设计了三阶段渐进训练范式。模型首先通过单帧和多帧监督微调学习空间和时间锚定,然后通过基于两轮组相对策略优化(GRPO)的强化学习策略进行优化。除了传统的准确率奖励,我们引入了时间和空间IoU奖励来监督中间定位过程,有效引导模型进行更扎实和可解释的时空推理。大量实验表明,CaC能够稳定聚焦于细微异常,在细粒度异常基准上实现了25.7%的准确率提升,并且作为奖励信号时,CaC将生成视频异常减少了11.7%,同时提高了整体视频质量。

英文摘要

In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.

2605.05155 2026-05-29 cs.CV cs.AI 版本更新

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

Aes3D: 3D高斯泼溅中的美学评估

Chuanzhi Xu, Boyu Wei, Haoxian Zhou, Xuanhua Yin, Zihan Deng, Haodong Chen, Qiang Qu, Weidong Cai

发表机构 * The University of Sydney(悉尼大学) The University of Hong Kong(香港大学)

AI总结 针对3D高斯泼溅场景缺乏美学评估的问题,提出首个系统框架Aes3D,包含专用数据集Aesthetic3D和轻量级模型Aes3DGSNet,直接预测场景级美学分数,无需渲染多视图图像。

详情
AI中文摘要

随着3D高斯泼溅(3DGS)在沉浸式媒体和数字内容创作中受到关注,评估3D场景的美学对于帮助创作者构建更具视觉吸引力的3D内容变得重要。然而,现有的3D场景评估方法主要强调重建保真度和感知真实感,在很大程度上忽略了构图、和谐度和视觉吸引力等更高层次的美学属性。这一局限性源于两个关键挑战:(1)缺乏带有美学标注的通用3DGS数据集,以及(2)3DGS作为低级基元表示的内在性质,使其难以捕捉高级美学特征。为应对这些挑战,我们提出Aes3D,这是首个用于评估3D神经渲染场景美学的系统框架。Aes3D包含Aesthetic3D,这是首个专用于3D场景美学评估的数据集,基于我们提出的3D场景美学标注策略构建。此外,我们提出Aes3DGSNet,一个轻量级模型,可直接从3DGS表示预测场景级美学分数。值得注意的是,我们的模型仅基于3D高斯基元运行,无需渲染多视图图像,从而降低了计算成本和硬件要求。通过对多视图3DGS场景表示进行美学监督学习,Aes3DGSNet有效捕获高级美学线索并准确回归美学分数。实验结果表明,我们的方法在保持轻量级设计的同时实现了强劲性能,为3D场景美学评估建立了新基准。代码和数据集将在未来版本中提供。

英文摘要

As 3D Gaussian Splatting (3DGS) gains attention in immersive media and digital content creation, assessing the aesthetics of 3D scenes becomes important in helping creators build more visually compelling 3D content. However, existing evaluation methods for 3D scenes primarily emphasize reconstruction fidelity and perceptual realism, largely overlooking higher-level aesthetic attributes such as composition, harmony, and visual appeal. This limitation comes from two key challenges: (1) the absence of general 3DGS datasets with aesthetic annotations, and (2) the intrinsic nature of 3DGS as a low-level primitive representation, which makes it difficult to capture high-level aesthetic features. To address these challenges, we propose Aes3D, the first systematic framework for assessing the aesthetics of 3D neural rendering scenes. Aes3D includes Aesthetic3D, the first dataset dedicated to 3D scene aesthetic assessment, built on our proposed annotation strategy for 3D scene aesthetics. In addition, we present Aes3DGSNet, a lightweight model that directly predicts scene-level aesthetic scores from 3DGS representations. Notably, our model operates solely on 3D Gaussian primitives, eliminating the need for rendering multi-view images and thus reducing computational cost and hardware requirements. Through aesthetics-supervised learning on multi-view 3DGS scene representations, Aes3DGSNet effectively captures high-level aesthetic cues and accurately regresses aesthetic scores. Experimental results demonstrate that our approach achieves strong performance while maintaining a lightweight design, establishing a new benchmark for 3D scene aesthetic assessment. Code and datasets will be made available in a future version.

2605.04569 2026-05-29 cs.CV 版本更新

LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention

LIVEditor-14B:基于上下文稀疏注意力的闪电视频编辑

Shitong Shao, Zikai Zhou, Haopeng Li, Yingwei Song, Wenliang Zhong, Lichen Bai, Zeke Xie

发表机构 * Hong Kong University of Science(香港科技大学) University of Arizona, USA(美国亚利桑那大学)

AI总结 提出上下文稀疏注意力(ISA)框架,通过冗余上下文剪枝和动态查询分组实现近无损加速,构建LIVEditor-14B模型在多个基准上超越现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

视频编辑已向上下文学习(ICL)范式发展,但由此产生的二次注意力成本造成了关键的计算瓶颈。在这项工作中,我们提出了上下文稀疏注意力(ISA),这是首个专为ICL视频编辑设计的近无损经验稀疏框架。我们的设计基于两个关键见解:首先,上下文标记的显著性显著低于源标记;其次,我们从理论上证明并经验验证了查询锐度与近似误差相关。受这些发现启发,ISA实现了一种高效的预选择策略来剪枝冗余上下文,随后通过动态查询分组机制将高误差查询路由到全注意力,将低误差查询路由到计算高效的0阶泰勒稀疏注意力。此外,我们构建了 extbf{ exttt{LIVEditor-14B}},这是一种通过ISA和提出的视频编辑数据流水线(整理了170万高质量数据集)的新型闪电视频编辑模型。大量实验表明,LIVEditor-14B在注意力模块延迟上减少了约60%,同时在EditVerseBench、IVE-Bench和VIE-Bench上超越了最先进的方法,实现了近无损加速且不损害视觉保真度。

英文摘要

Video editing has evolved toward In-Context Learning (ICL) paradigms, yet the resulting quadratic attention costs create a critical computational bottleneck. In this work, we propose In-context Sparse Attention (ISA), the first near-lossless empirical sparse framework tailored for ICL video editing. Our design is grounded in two key insights: first, context tokens exhibit significantly lower saliency than source tokens; second, we theoretically prove and empirically validate that Query sharpness correlates with approximation error. Motivated by these findings, ISA implements an efficient pre-selection strategy to prune redundant context, followed by a dynamic query grouping mechanism that routes high-error queries to full attention and low-error ones to a computationally efficient 0-th order Taylor sparse attention. Furthermore, we build \textbf{\texttt{LIVEditor-14B}} , a novel lightning video editing model via ISA and a proposed video-editing data pipeline that curated a 1.7M high-quality dataset. Extensive experiments demonstrate that LIVEditor-14B achieves a $\sim$60% reduction in attention-module latency while surpassing state-of-the-art methods across EditVerseBench, IVE-Bench, and VIE-Bench, delivering near-lossless acceleration without compromising visual fidelity.

2605.02772 2026-05-29 cs.CV 版本更新

Linearizing Vision Transformer with Test-Time Training

通过测试时训练线性化视觉Transformer

Yining Li, Dongchen Han, Zeyu Liu, Hanyi Wang, Yulin Wang, Gao Huang

发表机构 * Tsinghua University(清华大学)

AI总结 提出利用测试时训练(TTT)架构与Softmax注意力的结构对齐性,结合键实例归一化和局部性增强模块,实现从预训练Transformer到线性注意力模型的有效权重迁移,在Stable Diffusion 3.5上仅需1小时微调即可达到相近的图像生成质量并加速推理。

Comments ICML 2026

详情
AI中文摘要

虽然线性复杂度注意力机制为克服二次瓶颈提供了Softmax注意力的有前途替代方案,但从头训练此类模型仍然成本高昂。继承预训练Transformer的权重提供了一种有吸引力的捷径,但Softmax与线性注意力之间的基本表征差距阻碍了有效的权重迁移。在这项工作中,我们从两个角度解决这一转换挑战:架构对齐和表征对齐。我们确定测试时训练(TTT)是一种线性复杂度架构,其两层动态公式在结构上与Softmax注意力对齐,从而能够直接继承预训练注意力权重。为了进一步对齐表征属性,包括键平移不变性和局部性,我们引入了键实例归一化和一个轻量级局部性增强模块。我们通过线性化Stable Diffusion 3.5验证了我们的方法,并推出了SD3.5-T$^5$(Transformer到测试时训练)。仅在4$ imes$H20 GPU上微调1小时,SD3.5-T$^5$在文本到图像质量上即可与微调后的Softmax模型相媲美,同时在1K和2K分辨率下分别加速推理1.32倍和1.47倍。代码可在https://github.com/LeapLabTHU/Transformer-to-TTT获取。

英文摘要

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T$^5$ (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4$\times$H20 GPUs, SD3.5-T$^5$ achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32$\times$ and 1.47$\times$ at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.

2605.02288 2026-05-29 cs.CV 版本更新

LabBuilder: Protocol-Grounded 3D Layout Generation for Interactable and Safe Laboratory

LabBuilder: 基于协议的可交互且安全的3D实验室布局生成

Jianbao Cao, Zhangrui Zhao, Bohan Feng, Zixuan Hu, Rui Li, Haiyuan Wan, Chenxi Li, Jingyuan Li, Wenzhe Cai, Lei Bai, Wanli Ouyang, Lingyu Duan, Di Huang, Minting Pan, Sha Zhang, Xinzhu Ma, Shixiang Tang, Dongzhan Zhou

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Wuhan University(武汉大学) Beihang University(北航) Peking University(北京大学) Tsinghua University(清华大学) Shanghai Jiaotong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出LabBuilder系统,通过协议引导和约束感知优化,从文本描述生成安全且可执行的3D实验室布局,显著优于现有方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

自动化实验室有望加速科学发现,但其部署受限于设计安全且可执行环境的难度。虽然基于模拟器的设计提供了可扩展性,但现有的3D场景生成方法主要针对家庭环境,优化视觉合理性而忽略了科学实验所需的协议基础和布局级安全约束。我们提出了LabBuilder,一个端到端系统,从简洁的文本规范生成并验证3D实验室布局。它通过三个紧密耦合的组件运行:LabForge首先整理一个包含注释资产和化学知识的元数据集,将自然语言规范转化为结构化协议;基于这些协议,LabGen通过迭代的、约束感知的优化策略合成实验室布局;最后,LabTouchstone评估生成的布局作为统一基准。大量实验表明,LabBuilder显著优于现有最先进方法,生成的实验室环境在建模的几何、化学安全和导航约束下既真实又有效。

英文摘要

Automated laboratories hold the promise of accelerating scientific discovery, yet their deployment is bottlenecked by the difficulty of designing safe and executable environments. While simulator-based design offers scalability, existing 3D scene generation methods are primarily tailored for household settings, optimizing for visual plausibility while neglecting the protocol grounding and layout-level safety constraints essential for scientific experimentation. We present LabBuilder, an end-to-end system that generates and verifies 3D laboratory layouts from concise textual specifications. It operates through three tightly coupled components: LabForge first curates a meta-dataset of annotated assets and chemical knowledge, translating natural language specifications into structured protocols; building on these protocols, LabGen synthesizes laboratory layouts via an iterative, constraint-aware optimization strategy; finally, LabTouchstone evaluates the resulting layouts as a unified benchmark. Extensive experiments demonstrate that LabBuilder significantly outperforms existing state-of-the-art methods, producing laboratory environments that are realistic and valid under modeled geometric, chemical-safety, and navigation constraints.

2604.22280 2026-05-29 cs.CV 版本更新

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

超越思维链:重写作为生成式多模态嵌入的通用接口

Peixi Wu, Ke Mei, Feipeng Ma, Bosong Chai, Zhibin Lan, Chenxi Zhao, Shannan Yan, Jie Chen, Zhangchi Hu, Yansong Peng, Bo Lin, Junjie Zhou, Dacheng Yin, Tianyi Wang, Fengyun Rao, Jing Lyu, Hebei Li, Xiaoyan Sun

发表机构 * WeChat Vision, Tencent Inc.(腾讯微信视觉部) Zhejiang University(浙江大学) Tsinghua University(清华大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院)

AI总结 针对思维链推理在检索中产生冗余和语义歧义的问题,提出重写驱动的多模态嵌入框架RIME,联合优化生成与嵌入,并通过跨模态对齐和精炼强化学习实现高效准确的检索。

详情
AI中文摘要

多模态大语言模型已成为通用多模态嵌入的有前景的基础。最近的研究表明,推理驱动的生成式多模态嵌入在多个嵌入任务上可以超越判别式嵌入。然而,思维链推理往往会产生冗余的思考步骤,并在更广泛的检索场景中引入总结答案的语义歧义。为了解决这一限制,我们提出了重写驱动的多模态嵌入(RIME),这是一个通过检索友好的重写联合优化生成和嵌入的统一框架。同时,我们提出了跨模态对齐(CMA)来桥接生成式和判别式嵌入空间,从而实现灵活的相互检索以权衡效率和准确性。在此基础上,我们还引入了精炼强化学习(Refine-RL),将判别式嵌入作为稳定的语义锚点来指导重写优化。在MMEB-V2、MRMR和UVRB上的大量实验表明,RIME显著优于先前的生成式嵌入模型,同时大幅减少了思考长度。

英文摘要

Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.

2603.29954 2026-05-29 cs.CV 版本更新

Detecting Unknown Objects via Energy-based Separation for Open World Object Detection

基于能量分离的未知物体检测用于开放世界目标检测

Jun-Woo Heo, Keonhee Park, Gyeong-Moon Park

发表机构 * Korea University, South Korea(韩国大学) Seoul National University, South Korea(首尔国立大学)

AI总结 提出DEUS框架,通过等角紧框架子空间未知分离和基于能量的已知区分损失,解决开放世界目标检测中未知物体检测和类别遗忘问题。

Comments 8 pages, Accepted at CVPR 2026

详情
AI中文摘要

在这项工作中,我们解决了开放世界目标检测(OWOD)问题。这一具有挑战性的场景要求检测器在不遗忘的情况下增量学习分类已知物体,同时在没有监督的情况下识别未知物体。先前的OWOD方法增强了未知发现过程,并采用记忆重放来缓解灾难性遗忘。然而,由于现有方法严重依赖检测器的已知类别预测来检测未知物体,它们难以有效学习和识别未知物体表示。此外,虽然记忆重放缓解了旧类别的遗忘,但往往牺牲了新学习类别的知识。为了解决这些限制,我们提出了DEUS(基于能量分离的未知检测),这是一个新颖的框架,应对开放世界目标检测的挑战。DEUS由等角紧框架(ETF)-子空间未知分离(EUS)和基于能量的已知区分(EKD)损失组成。EUS利用基于ETF的几何特性创建正交子空间,从而实现已知和未知物体表示的更干净分离。与仅考虑已知空间的先前基于能量的方法不同,EUS利用两个空间的能量来更好地捕捉未知物体的独特模式。此外,EKD损失强制先前和当前分类器之间的分离,从而在记忆重放期间最小化先前和新学习类别之间的知识干扰。我们在OWOD基准上彻底验证了DEUS,展示了在未知检测方面的显著性能改进,同时保持竞争力的已知类别性能。

英文摘要

In this work, we tackle the problem of Open World Object Detection (OWOD). This challenging scenario requires the detector to incrementally learn to classify known objects without forgetting while identifying unknown objects without supervision. Previous OWOD methods have enhanced the unknown discovery process and employed memory replay to mitigate catastrophic forgetting. However, since existing methods heavily rely on the detector's known class predictions for detecting unknown objects, they struggle to effectively learn and recognize unknown object representations. Moreover, while memory replay mitigates forgetting of old classes, it often sacrifices the knowledge of newly learned classes. To resolve these limitations, we propose DEUS (Detecting Unknowns via energy-based Separation), a novel framework that addresses the challenges of Open World Object Detection. DEUS consists of Equiangular Tight Frame (ETF)-Subspace Unknown Separation (EUS) and an Energy-based Known Distinction (EKD) loss. EUS leverages ETF-based geometric properties to create orthogonal subspaces, enabling cleaner separation between known and unknown object representations. Unlike prior energy-based approaches that consider only the known space, EUS utilizes energies from both spaces to better capture distinct patterns of unknown objects. Furthermore, EKD loss enforces the separation between previous and current classifiers, thus minimizing knowledge interference between previous and newly learned classes during memory replay. We thoroughly validate DEUS on OWOD benchmarks, demonstrating outstanding performance improvements in unknown detection while maintaining competitive known class performance.

2603.21746 2026-05-29 cs.CV 版本更新

Getting to the Point: Pointing Improves LVLMs at Counting

直击要点:指向提升LVLMs的计数能力

Simone Alghisi, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi

发表机构 * Signals and Interactive Systems Lab, University of Trento(信号与交互系统实验室,特伦托大学)

AI总结 提出Point-then-Count方法,通过生成目标物体坐标进行零样本计数,在多个LVLM上取得最高准确率,并揭示坐标编码的空间信息是性能提升的关键。

详情
AI中文摘要

基于指向的方法将复杂任务分解为顺序的定位和推理步骤。给定查询,模型首先生成相关对象的坐标进行定位,然后基于这些点预测答案。虽然这种方法已被证明能提高大型视觉语言模型(LVLM)的性能,但其为何以及如何改善模型的视觉推理仍不清楚。在这项工作中,我们评估了基于指向的方法在视觉场景零样本计数任务中的表现。我们在最先进的LVLM上实验了多种微调和免训练方法,并将其与Point-then-Count(PtC)进行比较,其中模型首先生成目标对象的点坐标,然后预测其数量。我们的结果表明,PtC在评估方法中达到了最高准确率,预测的点在超过94%的情况下正确位于图像中(基于F1分数)。机制分析表明,性能提升源于预测坐标中编码的空间信息。然而,定位性能在不同图像区域存在差异,揭示了空间偏差。最后,结果表明PtC在合成和真实数据上都改善了分布外泛化,表明坐标有潜力帮助LVLM提升计数技能。

英文摘要

Pointing-based methods decompose complex tasks as sequential grounding and reasoning steps. Given a query, the model first grounds the relevant objects by generating their coordinates, and then predicts an answer conditioned on these points. While this approach has been shown to increase the performance of Large Vision-Language Models (LVLMs), it remains unclear why and how it improves the models' visual reasoning. In this work, we evaluate pointing-based methods in the task of zero-shot counting in visual scenes. We experiment with multiple fine-tuning and training-free approaches on state-of-the-art LVLMs, and compare them with Point-then-Count (PtC), where models first generate point coordinates for the target objects and then predict their count. Our results show that PtC achieves the highest accuracy among the evaluated approaches, with predicted points correctly grounded in the image in more than 94% of cases (based on F1-score). Mechanistic analyses show that gains arise from spatial information encoded in the predicted coordinates. Nevertheless, grounding performance varies across image regions, revealing spatial biases. Finally, the results indicate that PtC improves out-of-distribution generalization on both synthetic and real data, suggesting the potential of coordinates to help LVLMs improve their counting skills.

2603.12588 2026-05-29 cs.CV 版本更新

SDF-Net: Structure-Aware Disentangled Feature Learning for Opticall-SAR Ship Re-identification

SDF-Net:面向光学-SAR船舶重识别的结构感知解耦特征学习

Furui Chen, Han Wang, Yuhan Sun, Jianing You, Yixuan Lv, Zhuang Zhou, Hong Tan, Shengyang Li

发表机构 * Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences(中国科学院空间利用技术与工程中心) Key Laboratory of Space Utilization, Chinese Academy of Sciences(中国科学院空间利用重点实验室) University of Chinese Academy of Sciences(中国科学院大学) School of Software, Beihang University(北航软件学院)

AI总结 针对光学与SAR图像间辐射差异导致的船舶重识别挑战,提出SDF-Net,通过结构一致性约束和解耦特征学习,实现模态不变的身份特征提取,在HOSS-ReID数据集上达到最优性能。

详情
AI中文摘要

光学与合成孔径雷达(SAR)图像之间的跨模态船舶重识别(ReID)面临根本性挑战,即被动光学成像与相干主动雷达传感之间的严重辐射差异。现有方法主要依赖统计分布对齐或语义匹配,但往往忽略了一个关键的物理先验:船舶是刚性物体,其几何结构在不同传感模态下保持稳定,而纹理外观则高度依赖模态。本文提出SDF-Net,一种结构感知解耦特征学习网络,系统地将几何一致性引入光学-SAR船舶重识别。基于ViT骨干网络,SDF-Net引入结构一致性约束,从中间层提取尺度不变的梯度能量统计量,以稳健地锚定表示对抗辐射变化。在终端阶段,SDF-Net将学习到的表示解耦为模态不变的身份特征和模态特定的特征。然后通过无参数的加性残差融合整合这些解耦线索,有效增强判别能力。在HOSS-ReID数据集上的大量实验表明,SDF-Net持续优于现有最先进方法。代码和训练模型已在https://github.com/cfrfree/SDF-Net公开。

英文摘要

Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery is fundamentally challenged by the severe radiometric discrepancy between passive optical imaging and coherent active radar sensing. While existing approaches primarily rely on statistical distribution alignment or semantic matching, they often overlook a critical physical prior: ships are rigid objects whose geometric structures remain stable across sensing modalities, whereas texture appearance is highly modality-dependent. In this work, we propose SDF-Net, a Structure-Aware Disentangled Feature Learning Network that systematically incorporates geometric consistency into optical--SAR ship ReID. Built upon a ViT backbone, SDF-Net introduces a structure consistency constraint that extracts scale-invariant gradient energy statistics from intermediate layers to robustly anchor representations against radiometric variations. At the terminal stage, SDF-Net disentangles the learned representations into modality-invariant identity features and modality-specific characteristics. These decoupled cues are then integrated through a parameter-free additive residual fusion, effectively enhancing discriminative power. Extensive experiments on the HOSS-ReID dataset demonstrate that SDF-Net consistently outperforms existing state-of-the-art methods. The code and trained models are publicly available at https://github.com/cfrfree/SDF-Net.

2603.04314 2026-05-29 cs.CV cs.AI 版本更新

MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification

MOO:用于牛个体重识别视角分析的多视角观测数据集

William Grolleau, Achraf Chaouch, Astrid Sabourin, Guillaume Lapouge, Catherine Achard

发表机构 * Universite Paris-Saclay, CEA, List(巴黎-萨克雷大学,CEA,List) Sorbonne University, CNRS, ISIR(索邦大学,CNRS,ISIR)

AI总结 提出大规模合成多视角观测数据集MOO,通过128个均匀采样视角的1000头牛图像,量化视角变化对重识别的影响,并验证合成几何先验在真实场景中的迁移性。

Comments 6 pages, 3 figures, accepted to the CVPR 2026 Workshop on Computer Vision for Animal Behavior Tracking and Modeling (CV4Animals)

详情
AI中文摘要

动物重识别(ReID)由于视角变化面临严峻挑战,特别是在航空-地面(AG-ReID)场景中,模型需要跨越剧烈的高度变化匹配个体。然而,现有数据集缺乏精确的角度标注来系统分析这些几何变化。为此,我们引入了多视角观测(MOO)数据集,这是一个大规模合成AG-ReID数据集,包含从128个均匀采样视角捕获的1000头牛个体(128,000张标注图像)。利用这个受控数据集,我们量化了高度的影响,并识别出一个关键高度阈值,超过该阈值模型对未见视角的泛化能力显著提升。最后,我们在零样本和监督设置下验证了向真实世界应用的迁移性,展示了在四个真实牛数据集上的性能提升,并确认合成几何先验有效弥合了领域差距。总之,该数据集和分析为跨视角动物ReID的未来模型开发奠定了基础。MOO公开于https://github.com/TurtleSmoke/MOO。

英文摘要

Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings where models must match individuals across drastic elevation changes. However, existing datasets lack the precise angular annotations required to systematically analyze these geometric variations. To address this, we introduce the Multi-view Oriented Observation (MOO) dataset, a large-scale synthetic AG-ReID dataset of $1,000$ cattle individuals captured from $128$ uniformly sampled viewpoints ($128,000$ annotated images). Using this controlled dataset, we quantify the influence of elevation and identify a critical elevation threshold, above which models generalize significantly better to unseen views. Finally, we validate the transferability to real-world applications in both zero-shot and supervised settings, demonstrating performance gains across four real-world cattle datasets and confirming that synthetic geometric priors effectively bridge the domain gap. Collectively, this dataset and analysis lay the foundation for future model development in cross-view animal ReID. MOO is publicly available at https://github.com/TurtleSmoke/MOO.

2603.03503 2026-05-29 cs.CV cs.LG 版本更新

Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer for 200m Resolution Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1, RCM, and AMSR2 Data

地理加权弱监督贝叶斯高分辨率Transformer:利用Sentinel-1、RCM和AMSR2数据实现200米分辨率泛北极海冰密集度制图与不确定性估计

Mabel Heffring, Lincoln Linlin Xu

发表机构 * Department of Geomatics Engineering, Schulich School of Engineering, University of Calgary(地质工程系,Schulich 工程学院,卡尔加里大学)

AI总结 提出一种贝叶斯高分辨率Transformer模型,结合地理加权弱监督损失函数和决策级数据融合,利用Sentinel-1、RCM和AMSR2数据实现200米分辨率泛北极海冰密集度制图与不确定性量化。

Comments 23 pages, 20 figures

详情
AI中文摘要

尽管具有可靠对应不确定性的泛北极海冰高分辨率制图对于业务化海冰密集度(SIC)制图至关重要,但由于冰特征信号的细微性、SIC标签的不精确性、模型不确定性和数据异质性等关键挑战,这是一项艰巨的任务。本研究提出了一种新颖的贝叶斯高分辨率Transformer方法,利用Sentinel-1、RADARSAT星座任务(RCM)和先进微波扫描辐射计2(AMSR2)数据,实现200米分辨率泛北极SIC制图和不确定性量化。首先,为了改进微小和细微海冰特征(例如裂缝/水道、融池和浮冰)的提取,我们设计了一种新颖的高分辨率Transformer模型,该模型具有全局和局部模块,能够更好地区分海冰模式的细微差异。其次,为了解决低分辨率和非精确SIC标签的问题,我们设计了一种地理加权弱监督损失函数,在区域级别而非像素级别监督模型,并优先考虑纯开阔水和冰盖特征,同时减轻边缘冰区(MIZ)中模糊性的影响。第三,为了改进不确定性量化,我们设计了所提Transformer模型的贝叶斯扩展,将其参数视为随机变量,以更有效地捕获不确定性。第四,为了解决数据异质性,我们在决策级融合三种不同类型的数据(Sentinel-1、RCM和AMSR2),以改进SIC制图和不确定性量化。所提方法在2021年和2025年泛北极最小范围条件下进行了评估。结果表明,所提模型在使用Sentinel-1数据时实现了0.70的总体特征检测精度,同时保留了泛北极SIC模式(相对于ARTIST海冰产品,Sentinel-1 R² = 0.90)。

英文摘要

Although high-resolution mapping of pan-Arctic sea ice with reliable corresponding uncertainty is essential for operational sea ice concentration (SIC) charting, it is a difficult task due to key challenges, such as the subtle nature of ice signature features, inexact SIC labels, model uncertainty, and data heterogeneity. This study presents a novel Bayesian High-Resolution Transformer approach for 200 meter resolution pan-Arctic SIC mapping and uncertainty quantification using Sentinel-1, RADARSAT Constellation Mission (RCM), and Advanced Microwave Scanning Radiometer 2 (AMSR2) data. First, to improve small and subtle sea ice feature (e.g., cracks/leads, ponds, and ice floes) extraction, we design a novel high-resolution Transformer model with both global and local modules that can better discern the subtle differences in sea ice patterns. Second, to address low-resolution and inexact SIC labels, we design a geographically-weighted weakly supervised loss function to supervise the model at region level instead of pixel level, and to prioritize pure open water and ice pack signatures while mitigating the impact of ambiguity in the marginal ice zone (MIZ). Third, to improve uncertainty quantification, we design a Bayesian extension of the proposed Transformer model, treating its parameters as random variables to more effectively capture uncertainties. Fourth, to address data heterogeneity, we fuse three different data types (Sentinel-1, RCM, and AMSR2) at decision-level to improve both SIC mapping and uncertainty quantification. The proposed approach is evaluated under pan-Arctic minimum-extent conditions in 2021 and 2025. Results demonstrate that the proposed model achieves 0.70 overall feature detection accuracy using Sentinel-1 data, while also preserving pan-Arctic SIC patterns (Sentinel-1 R\textsuperscript{2} = 0.90 relative to the ARTIST Sea Ice product).

2602.20316 2026-05-29 astro-ph.SR cs.CV 版本更新

Inspectorch: Efficient rare event exploration in solar observations

Inspectorch: 太阳观测中稀有事件的高效探索

C. J. Díaz Baso, I. J. Soler Poquet, C. Kuckein, M. van Noort, N. Poirier

发表机构 * Institute of Theoretical Astrophysics, University of Oslo, P.O. Box 1029 Blindern, N-0315 Oslo, Norway Rosseland Centre for Solar Physics, University of Oslo, P.O. Box 1029 Blindern, N-0315 Oslo, Norway Instituto de Astrof\'isica de Canarias, C/V\' a L\'actea s/n, E-38205 La Laguna, Tenerife, Spain Departamento de Astrof\'isica, Universidad de La Laguna, E-38206 La Laguna, Tenerife, Spain Max-Planck Institute for Solar System Research, Justus-von-Liebig-Weg 3, 37077 G\"ottingen, Germany LPC2E, OSUC, Univ Orl\'eans, CNRS, CNES, F-45071 Orl\'eans, France

AI总结 提出基于流的密度估计模型Inspectorch,用于从高维太阳观测数据中高效识别稀有事件,并聚焦计算资源于极端现象。

Comments Comments: 12+1 pages, 11+2 figures, submitted to A&A

详情
AI中文摘要

太阳正以前所未有的细节被观测,使得我们能够研究其非常小时空尺度上的活动。然而,望远镜收集的大量数据无法用传统方法完全分析。流行的机器学习方法从观测中识别一般趋势,但由于罕见事件发生频率低,往往忽略它们。我们研究无监督概率方法在多维太阳观测中高效识别罕见事件的适用性,并优化计算资源以研究这些极端现象。我们介绍了Inspectorch,一个开源框架,利用基于流的模型:灵活的概率密度估计器,能够学习太阳观测的多维分布。一旦优化,它为每个样本分配概率,使我们能够识别异常事件。我们通过将其应用于Hinode光谱偏振仪、界面区域成像光谱仪、瑞典1米太阳望远镜上的微透镜高光谱成像仪、太阳动力学观测站上的大气成像组件以及太阳轨道器上的极紫外成像仪的观测来应用该方法。我们发现该算法始终为表现出异常特征的光谱分配较低的概率。例如,它识别出具有非常强多普勒频移、不常见展宽以及与小型重联事件相关的时间动态的谱线等。因此,Inspectorch证明了使用基于流的模型进行密度估计为在大型太阳数据集中识别罕见事件提供了一种强大的方法。由此产生的概率异常分数允许将计算资源集中在最具信息量和物理相关的事件上。我们公开提供Python包,网址为https://github.com/cdiazbas/inspectorch。

英文摘要

The Sun is observed in unprecedented detail, enabling studies of its activity on very small spatiotemporal scales. However, the large volume of data collected by our telescopes cannot be fully analyzed with conventional methods. Popular machine learning methods identify general trends from observations, but tend to overlook unusual events due to their low frequency of occurrence. We study the applicability of unsupervised probabilistic methods to efficiently identify rare events in multidimensional solar observations and optimize our computational resources to the study of these extreme phenomena. We introduce Inspectorch, an open-source framework that utilizes flow-based models: flexible density estimators capable of learning the multidimensional distribution of solar observations. Once optimized, it assigns a probability to each sample, allowing us to identify unusual events. We apply this approach by applying it to observations from the Hinode Spectro-Polarimeter, the Interface Region Imaging Spectrograph, the Microlensed Hyperspectral Imager at Swedish 1-m Solar Telescope, the Atmospheric Imaging Assembly on board the Solar Dynamics Observatory and the Extreme Ultraviolet Imager on board Solar Orbiter. We find that the algorithm assigns consistently lower probabilities to spectra that exhibit unusual features. For example, it identifies profiles with very strong Doppler shifts, uncommon broadening, and temporal dynamics associated with small-scale reconnection events, among others. As a result, Inspectorch demonstrates that density estimation using flow-based models offers a powerful approach to identifying rare events in large solar datasets. The resulting probabilistic anomaly scores allow computational resources to be focused on the most informative and physically relevant events. We make our Python package publicly available at https://github.com/cdiazbas/inspectorch.

2602.18527 2026-05-29 cs.CV cs.AI cs.SD 版本更新

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

JAEGER:模拟物理环境中的联合3D音频-视觉定位与推理

Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang

发表机构 * Tsinghua University(清华大学) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Tencent AI Lab(腾讯AI实验室)

AI总结 提出JAEGER框架,通过集成RGB-D观测和多通道一阶环境声学,将音频-视觉大语言模型扩展到3D空间,实现联合空间定位与推理,并引入神经强度向量(Neural IV)提升声源方向估计的鲁棒性。

Comments Accepted to ICML 2026

详情
AI中文摘要

当前的音频-视觉大语言模型(AV-LLMs)主要局限于2D感知,依赖于RGB视频和单声道音频。这种设计选择引入了基本的维度不匹配,阻碍了在复杂3D环境中可靠的声源定位和空间推理。我们通过提出JAEGER框架来解决这一限制,该框架将AV-LLMs扩展到3D空间,通过集成RGB-D观测和多通道一阶环境声学实现联合空间定位与推理。我们工作的核心贡献是神经强度向量(Neural IV),一种学习的空间音频表示,它编码了鲁棒的方向线索,以增强到达方向估计,即使在具有重叠声源的不利声学场景中也是如此。为了促进大规模训练和系统评估,我们提出了SpatialSceneQA,一个包含从模拟物理环境中整理的6.1万个指令调优样本的基准。大量实验表明,我们的方法在各种空间感知和推理任务中始终优于以2D为中心的基线,强调了显式3D建模对于推进物理环境中AI的必要性。我们的源代码、预训练模型检查点和数据集可在https://github.com/liuzhan22/JAEGER获取。

英文摘要

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints, and datasets are available at https://github.com/liuzhan22/JAEGER.

2602.15382 2026-05-29 cs.CL cs.CV cs.LG 版本更新

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

视觉虫洞:异构多智能体系统中的潜在空间通信

Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao

发表机构 * Purdue University(普渡大学) Contextual AI(情境人工智能) Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Vision Wormhole框架,通过通用视觉编解码器将推理轨迹映射到共享连续空间,实现异构VLM间的潜在状态传输,无需配对翻译器,降低对齐复杂度并提升效率。

Comments Preprint. Work in progress

详情
AI中文摘要

由大型语言模型驱动的多智能体系统(MAS)实现了先进的协作推理,但仍受限于离散文本通信,这带来了运行时开销和信息量化损失。虽然潜在状态传输提供了一种替代方案,但现有方法要么假设同构的发送器-接收器架构,要么依赖于特定配对的学得翻译器,限制了跨具有不连续流形的不同模型族的可扩展性。我们将为自然图像训练的视觉-语言模型(VLM)的视觉界面重新概念化为异构智能体之间的连续通信通道,并将这一思想实例化为 extbf{视觉虫洞}:一种通用视觉编解码器,将推理轨迹映射到共享的连续参考空间,并将其注入接收器的视觉通路,实现无需配对翻译器的跨架构潜在状态传输。该框架采用中心辐射拓扑,将对齐复杂度从$O(N^2)$降低到$O(N)$,并通过无标签的教师-学生蒸馏针对文本通道进行训练,无需并行隐藏状态监督。在异构VLM族(Qwen-VL、Gemma、SmolVLM2、LFM2.5-VL)和九个推理基准上的大量实验表明,视觉虫洞在大多数评估设置中减少了端到端挂钟时间,并产生了正的平均宏$Δ$-准确率。

英文摘要

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain bottlenecked by discrete text communication, which imposes runtime overhead and information quantization loss. While latent state transfer offers an alternative, existing approaches either assume homogeneous sender--receiver architectures or rely on pair-specific learned translators, limiting scalability across diverse model families with disjoint manifolds. We reconceptualize the visual interface of Vision-Language Models (VLMs), trained for natural images, as a continuous communication channel between heterogeneous agents, and instantiate this idea as the \textbf{Vision Wormhole}: a Universal Visual Codec maps reasoning traces into a shared continuous reference space and injects them into the receiver's visual pathway, yielding cross-architecture latent state transfer without per-pair translators. The framework adopts a hub-and-spoke topology that reduces alignment complexity from $O(N^2)$ to $O(N)$, and is trained by label-free teacher--student distillation against the text channel, requiring no parallel hidden-state supervision. Extensive experiments across heterogeneous VLM families (Qwen-VL, Gemma, SmolVLM2, LFM2.5-VL) and nine reasoning benchmarks show that the Vision Wormhole reduces end-to-end wall-clock time across most evaluated settings and yields positive macro-average $Δ$-accuracy.

2602.01456 2026-05-29 cs.LG cs.CV 版本更新

Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations

Rectified LpJEPA:具有稀疏和最大熵表示的联合嵌入预测架构

Yilun Kuang, Yash Dagade, Tim G. J. Rudner, Randall Balestriero, Yann LeCun

发表机构 * New York University(纽约大学) Duke University(杜克大学) University of Toronto(多伦多大学) Brown University(布朗大学)

AI总结 提出Rectified Distribution Matching Regularization (RDMReg)损失,通过将表示对齐到Rectified Generalized Gaussian分布,实现稀疏且最大熵的表示,从而改进联合嵌入预测架构(JEPA)的性能。

Comments ICML 2026

详情
AI中文摘要

联合嵌入预测架构(JEPA)学习视角不变表示,并采用基于投影的分布匹配来防止崩溃。现有方法将表示正则化为各向同性高斯分布,但固有地偏向密集表示,未能捕捉高效表示中观察到的稀疏性关键特性。我们引入了Rectified Distribution Matching Regularization (RDMReg),这是一种切片双样本分布匹配损失,将表示对齐到Rectified Generalized Gaussian (RGG)分布。RGG通过整流显式控制期望的$\ell_0$范数,而其连续截断部分在期望$\ell_p$范数和支撑约束下具有最大熵特性。将RDMReg应用于JEPA得到Rectified LpJEPA,它严格推广了先前基于高斯的JEPA。实验表明,Rectified LpJEPA学习到稀疏、非负的表示,具有有利的稀疏性-性能权衡,并在图像分类基准上取得了有竞争力的下游性能,表明RDMReg可以在保留任务相关信息的同时强制执行稀疏性。

英文摘要

Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected $\ell_0$ norm through rectification, while its continuous truncated component admits a maximum-entropy characterization under expected $\ell_p$ norm and support constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity--performance trade-offs and competitive downstream performance on image classification benchmarks, showing that RDMReg can enforce sparsity while preserving task-relevant information.

2601.19947 2026-05-29 cs.LG cs.AI cs.CV 版本更新

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

NCSAM: 噪声补偿的锐度感知最小化用于噪声标签学习

Jiayu Xu, Junbiao Pang

发表机构 * Beijing University of Technology(北京理工大学)

AI总结 提出NCSAM方法,通过噪声补偿扰动修正噪声标签引起的优化偏差,缓解对噪声标签的记忆,在合成和真实噪声标签基准上优于SAM基线。

Comments 11 pages, 1 figure, 8 tables. Major revision of v1: revised PAC-Bayesian theoretical analysis, clarified the NCSAM formulation, added appendix derivations, reorganized experiments and ablations, updated related work, citations, writing, and author list

详情
AI中文摘要

从噪声标签学习(LNL)仍然是深度学习中的一个基本挑战,因为现实世界的数据集通常包含损坏的注释。大多数现有方法依赖于标签校正或样本选择机制。相比之下,我们从优化角度研究LNL,通过建立标签噪声与锐度感知最小化(SAM)的平坦性寻求行为之间的理论联系。基于此分析,我们提出了噪声补偿的锐度感知最小化(NCSAM),它使用噪声补偿扰动来抵消由噪声标签引起的优化偏差。通过纠正失真的SAM扰动,NCSAM在训练过程中减轻了对噪声标签的记忆,同时保持了基于优化的学习的简单性。在合成和真实噪声标签基准上的实验表明,NCSAM在基于SAM的优化基线上持续改进,并与代表性的噪声标签学习方法保持竞争力。

英文摘要

Learning from Noisy Labels (LNL) remains a fundamental challenge in deep learning because real-world datasets often contain corrupted annotations. Most existing methods rely on label correction or sample selection mechanisms. In contrast, we study LNL from an optimization perspective by establishing a theoretical connection between label noise and the flatness-seeking behavior of Sharpness-Aware Minimization (SAM). Based on this analysis, we propose Noise-Compensated Sharpness-Aware Minimization (NCSAM), which uses a noise-compensated perturbation to counteract the optimization bias induced by noisy labels. By correcting distorted SAM perturbations, NCSAM mitigates the memorization of noisy labels during training while preserving the simplicity of optimization-based learning. Experiments on synthetic and real-world noisy-label benchmarks show that NCSAM consistently improves over SAM-based optimization baselines and remains competitive with representative noisy-label learning methods.

2601.12500 2026-05-29 cs.CV 版本更新

Video Individual Counting and Tracking from Moving Drones: A Benchmark and Methods

来自移动无人机的视频个体计数与跟踪:基准与方法

Yaowu Fan, Jia Wan, Tao Han, Andy J. Ma, Wanli Ouyang, Antoni B. Chan

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) school of Computer Science and Engineering, Hong Kong University of Science and Technology(香港科学与技术大学计算机科学与工程学院) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) Chinese University of Hong Kong(香港中文大学)

AI总结 针对大规模密集人群场景,提出移动无人机视频数据集MovingDroneCrowd++,并设计基于最优传输和描述子投票的计数与跟踪方法GD3A和DVTrack,显著降低计数误差并提升跟踪精度。

详情
AI中文摘要

在大规模场景中计数和跟踪密集人群是一个高度实用但具有挑战性的问题。现有方法大多依赖于场景覆盖有限的固定摄像头数据集,使其不足以用于大规模场景的人群分析。为弥补这一差距,我们引入了MovingDroneCrowd++,这是最大的视频级数据集,专门用于快速移动无人机下的密集人群计数和跟踪,在多种飞行高度、相机角度和光照条件下采集。然而,现有方法在这些具有挑战性的空中条件下仍无法达到令人满意的视频个体计数或跟踪性能。为此,我们提出了GD3A(通过分组描述符关联的全局密度图分解),一种视频个体计数方法,该方法首先通过带有自适应垃圾桶分数的最优传输建立帧间行人描述符的像素级对应关系。然后,采用分组关联来指导将全局密度图分解为共享、流入和流出密度图。我们进一步引入了一种行人跟踪方法DVTrack(描述子投票跟踪),该方法通过描述子投票将描述符级匹配转换为实例级关联。我们的方法依赖于每个行人的分组多个描述符的关联结果,而不是单个向量。由于组内匹配错误不影响最终的计数和跟踪结果,我们的方法在密集人群和具有挑战性的空中条件下更加鲁棒。实验表明,我们的方法在密集人群和复杂运动的移动无人机视频上,在人群计数和跟踪方面均取得了显著提升,计数误差降低了47.4%,跟踪精度提高了64.6%。代码、数据集和预训练模型可在 https://github.com/fyw1999/MovingDroneCrowd 获取。

英文摘要

Counting and tracking dense crowds in large-scale scenes is a highly practical yet challenging problem. Existing methods mostly rely on fixed-camera datasets with limited scene coverage, making them inadequate for crowd analysis in large-scale scenes. To bridge this gap, we introduce MovingDroneCrowd++, the largest video-level dataset dedicated to dense crowd counting and tracking with fast-moving drones, captured under diverse flight altitudes, camera angles, and illumination conditions. Existing methods, however, still fail to achieve satisfactory video individual counting or tracking performance under these challenging aerial conditions. To this end, we propose GD3A (Global Density map Decomposition via group-wise Descriptor Association), a video individual counting method that first establishes pixel-level correspondences between pedestrian descriptors across frames via optimal transport with an adaptive dustbin score. Then, group-wise association is adopted to guide the decomposition of the global density map into shared, inflow, and outflow density maps. We further introduce a pedestrian tracking method, DVTrack (Descriptor Voting Track), which converts descriptor-level matching into instance-level association through descriptor voting. Our methods rely on the association results of group-wise multiple descriptors for each pedestrian rather than a single vector. Since intra-group matching errors do not affect the final counting and tracking results, our methods are more robust in dense crowds and challenging aerial conditions. Experiments show that our methods achieve substantial gains in both crowd counting and tracking on moving-drone videos with dense crowds and complex motions, reducing counting error by 47.4% and improving tracking accuracy by 64.6%. Code, dataset, and pretrained models are available at https://github.com/fyw1999/MovingDroneCrowd.

2601.05149 2026-05-29 cs.CV 版本更新

Multi-Scale Local Speculative Decoding for Image Generation

多尺度局部推测解码用于图像生成

Elia Peruzzo, Guillaume Sautière, Amirhossein Habibian

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 提出多尺度局部推测解码(MuLo-SD)框架,通过低分辨率草稿模型与高分辨率目标模型结合、局部拒绝与重采样机制,加速自回归图像生成,实现高达5倍加速并保持语义对齐和感知质量。

Comments Accepted at CVPR 2026

详情
AI中文摘要

自回归(AR)模型在图像合成中取得了显著成功,但其顺序性带来了严重的延迟限制。推测解码提供了一种有前景的加速途径,但现有方法受限于令牌级模糊性和缺乏空间感知。在这项工作中,我们引入了多尺度局部推测解码(MuLo-SD),一种新颖的框架,结合多分辨率草稿与空间感知验证来加速AR图像生成。我们的方法利用低分辨率草稿模型配合上采样步骤来提出候选图像令牌,然后由高分辨率目标模型并行验证。关键的是,我们引入了局部拒绝和重采样机制,通过关注空间邻域而非在第一次拒绝后进行光栅扫描重采样,从而高效纠正草稿错误。当与并行解码重采样集成时,MuLo-SD实现了显著的加速——高达$\mathbf{5 imes}$——在加速方面优于推测解码和并行解码基线,同时保持相当的语义对齐和感知质量。这些结果在MS-COCO 5k验证集上使用GenEval、DPG-Bench和FID/HPSv2进行了验证。广泛的消融实验突出了上采样设计、概率池化以及局部拒绝和重采样与邻域扩展的影响。我们的方法为图像合成中的推测解码设立了新的最先进水平,弥合了效率与保真度之间的差距。项目页面见https://qualcomm-ai-research.github.io/mulo-sd-webpage/。

英文摘要

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with an up-sampling step to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. When integrated with parallel decoding resampling, MuLo-SD achieves substantial speedups -- up to $\mathbf{5\times}$ -- outperforming both speculative decoding and parallel decoding baselines in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity. Project page is available at https://qualcomm-ai-research.github.io/mulo-sd-webpage/ .

2601.03729 2026-05-29 cs.CV 版本更新

MATANet: A Multi-context Attention and Taxonomy-Aware Network for Fine-Grained Underwater Recognition of Marine Species

MATANet:用于海洋物种细粒度识别的多上下文注意与分类感知网络

Donghwan Lee, Byeongjin Kim, Geunhee Kim, Hyukjin Kwon, Nahyeon Maeng, Wooju Kim

发表机构 * Department of Industrial Engineering, Yonsei University(延世大学工业工程系)

AI总结 提出MATANet框架,通过多上下文环境注意力模块和层级感知表示学习模块,结合生物外观、环境上下文和分类结构,实现海洋生物细粒度识别,在FathomNet2025和LifeCLEF2015-Fish上取得最优性能。

详情
AI中文摘要

海洋生物的细粒度识别对于生态研究、生物多样性监测、栖息地保护和基于证据的政策制定至关重要。然而,许多现有方法主要依赖于以物体或ROI为中心的表征。这些限制在具有挑战性的水下场景中会降低判别性能,因为视觉上相似的生物通常出现在不同的环境条件下。为了解决这些问题,我们提出了MATANet(多上下文注意与分类感知网络),一个用于海洋生物细粒度分类识别的框架。MATANet的动机来自专家分类识别实践,其中在识别过程中同时考虑生物体形态和上下文线索。该框架由两个主要组件组成。首先,多上下文环境注意力模块(MCEAM)对主要感兴趣区域(ROI)与多尺度周围环境区域之间的交叉注意力进行建模,从而将局部形态线索与栖息地级上下文信息相结合。其次,层级感知表示学习模块(HRLM)使用分类层次作为辅助监督来正则化表示学习,并鼓励跨分类级别的语义结构化嵌入。通过联合建模生物外观、环境上下文和分类结构,MATANet学习了用于细粒度分类识别的更具判别性的表示。在FathomNet2025和LifeCLEF2015-Fish上的实验表明,MATANet持续优于现有方法的识别性能。在FAIR1M上的额外实验进一步检验了所提框架在水下图像之外的适用性。值得注意的是,MATANet在CVPR 2025 FGVC12研讨会的FathomNet 2025挑战赛中获得了第一名。

英文摘要

Fine-grained recognition of marine organisms is important for ecological research, biodiversity monitoring, habitat conservation, and evidence-based policy-making. However, many existing approaches primarily rely on object- or ROI-centered representations. These limitations can reduce discriminative performance in challenging underwater scenes, where visually similar organisms often appear under diverse environmental conditions. To address these challenges, we propose MATANet (Multi-context Attention and Taxonomy-Aware Network), a framework for fine-grained taxonomic recognition of marine organisms. MATANet is motivated by expert taxonomic identification practices, in which both organism-level morphology and contextual cues are considered during recognition. The framework consists of two main components. First, the Multi-Context Environmental Attention Module (MCEAM) models cross-attention between the primary region of interest (ROI) and multi-scale surrounding environmental regions, thereby combining local morphological cues with habitat-level contextual information. Second, the Hierarchy-Aware Representation Learning Module (HRLM) uses taxonomic hierarchy as auxiliary supervision to regularize representation learning and encourage semantically structured embeddings across taxonomic levels. By jointly modeling organism appearance, environmental context, and taxonomic structure, MATANet learns more discriminative representations for fine-grained taxonomic recognition. Experiments on FathomNet2025 and LifeCLEF2015-Fish demonstrate that MATANet consistently improves recognition performance over existing methods. Additional experiments on FAIR1M further examine the applicability of the proposed framework beyond underwater imagery. Notably, MATANet ranked first in the FathomNet 2025 Challenge at the CVPR 2025 FGVC12 workshop.

2512.04733 2026-05-29 cs.CV cs.AI 版本更新

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

E3AD:面向以人为中心的端到端自动驾驶的情感感知视觉-语言-动作模型

Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu

发表机构 * McGill University(麦吉尔大学) University of Macau(澳门大学) The Hong Kong Polytechnic University(香港理工大学) Massachusetts Institute of Technology(麻省理工学院) University of Washington(华盛顿大学)

AI总结 提出E3AD框架,通过连续VAD情感模型和双路径空间推理模块,将情感理解融入视觉-语言-动作模型,实现开放域端到端自动驾驶中的情感感知轨迹规划,在真实数据集上达到SOTA性能。

详情
AI中文摘要

端到端自动驾驶系统越来越多地采用视觉-语言-动作模型,但它们通常忽略乘客的情绪状态,而情绪状态对舒适度和自动驾驶接受度至关重要。我们引入了开放域端到端自动驾驶,其中自动驾驶车辆必须解释自由形式的自然语言命令,推断情绪,并规划物理上可行的轨迹。我们提出了E3AD,一个情感感知的VLA框架,通过两个认知启发的组件增强语义理解:一个连续的Valence-Arousal-Dominance情感模型,从语言中捕捉语调和紧迫性;以及一个双路径空间推理模块,融合自我中心和异中心视角以实现类人空间认知。结合模态预训练和基于偏好的对齐的一致性导向训练方案,进一步强化了情感意图与驾驶行为之间的一致性。在真实世界数据集上,E3AD改进了视觉定位和路径点规划,并在情感估计方面达到了最先进的VAD相关性。这些评估结果表明,将情感注入VLA风格的驾驶能够产生更符合人类行为的定位、规划和反馈。

英文摘要

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These evaluation results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and feedback.

2511.19316 2026-05-29 cs.CV cs.AI 版本更新

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

评估数据集水印用于定制扩散模型微调可追溯性:一个综合基准与移除方法

Xincheng Wang, Hanchi Sun, Wenjun Sun, Kejun Xue, Wangqiu Zhou, Jianbo Zhang, Wei Sun, Dandan Zhu, Xiongkuo Min, Jun Jia, Zhijun Fang

发表机构 * Donghua University(东华大学) Shanghai Jiao Tong University(上海交通大学) Xidian University(西安电子科技大学) Hefei University of Technology(合肥工业大学) East China Normal University(华东师范大学)

AI总结 针对扩散模型微调中的版权与安全风险,本文建立统一威胁模型并提出包含普适性、可传递性和鲁棒性的评估框架,揭示现有数据集水印方法的脆弱性,并进一步提出一种实用的水印移除方法。

详情
AI中文摘要

最近扩散模型的微调技术使其能够再现特定图像集,例如特定人脸或艺术风格,但也引入了版权和安全风险。数据集水印已被提出,通过将不可察觉的水印嵌入训练图像来确保可追溯性,即使在微调后这些水印在输出中仍然可检测。然而,当前方法缺乏统一的评估框架。为解决这一问题,本文建立了一个通用威胁模型,并引入了一个包含普适性、可传递性和鲁棒性的综合评估框架。实验表明,现有方法在普适性和可传递性方面表现良好,并对常见图像处理操作具有一定的鲁棒性,但在真实威胁场景下仍然不足。为揭示这些脆弱性,本文进一步提出了一种实用的水印移除方法,该方法在不影响微调的情况下完全消除数据集水印,突出了未来研究的一个关键挑战。

英文摘要

Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.

2511.08423 2026-05-29 cs.CV 版本更新

OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

OmniAID: 解耦语义与伪影以实现通用AI生成图像野外检测

Yuncheng Guo, Junyan Ye, Chenjue Zhang, Hengrui Kang, Haohuan Fu, Conghui He, Weijia Li

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Sun Yat-Sen University(中山大学) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出OmniAID框架,通过解耦混合专家架构分离语义缺陷和通用伪影,结合两阶段训练策略和Mirage数据集,实现跨生成模型和语义内容的鲁棒AI生成图像检测。

Comments Accepted by ICML 2026

详情
AI中文摘要

一个真正通用的AI生成图像(AIGI)检测器必须同时泛化到多种生成模型和不同的语义内容。当前方法学习单一的、纠缠的伪造表示,混淆了内容相关的缺陷与内容无关的伪影,并进一步受到过时基准的限制。我们提出OmniAID,一种以解耦混合专家(MoE)架构为核心的新框架,该架构分离了:(1)通过可路由的专门语义专家在不同内容领域中的语义缺陷,以及(2)通过固定的通用伪影专家从内容相关缺陷中分离出内容无关的通用伪影。两阶段训练策略首先通过领域特定的困难采样独立专门化专家,然后训练一个轻量级门控网络以实现有效的输入路由。通过明确解耦“生成了什么”(内容特定缺陷)与“如何生成”(通用伪影),OmniAID实现了鲁棒的泛化。我们还引入了Mirage,一个大规模、当代的数据集,包含现代训练集和具有挑战性的测试集。大量实验表明,OmniAID超越了现有检测器,为针对现代野外威胁的AIGI检测建立了新标准。代码可在https://github.com/yunncheng/OmniAID获取。

英文摘要

A truly universal AI-Generated Image (AIGI) detector must simultaneously generalize across diverse generative models and varied semantic content. Current methods learn a single, entangled forgery representation, conflating content-dependent flaws with content-agnostic artifacts, and are further constrained by outdated benchmarks. We propose OmniAID, a novel framework centered on a decoupled Mixture-of-Experts (MoE) architecture that separates: (1) semantic flaws across distinct content domains via Routable Specialized Semantic Experts, and (2) content-agnostic universal artifacts from content-dependent flaws via a Fixed Universal Artifact Expert. A two-stage training strategy first specializes experts independently with domain-specific hard-sampling, then trains a lightweight gating network for effective input routing. By explicitly decoupling "what is generated" (content-specific flaws) from "how it is generated" (universal artifacts), OmniAID achieves robust generalization. We also introduce Mirage, a large-scale, contemporary dataset comprising a modern training set and a challenging test set. Extensive experiments demonstrate that OmniAID surpasses existing detectors, establishing a new standard for AIGI detection against modern, in-the-wild threats. Code is available at https://github.com/yunncheng/OmniAID.

2510.27391 2026-05-29 cs.CV cs.LG 版本更新

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

异质双曲流形上的树间模态对齐

Wei Wu, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi

发表机构 * Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology(北京智能信息科技重点实验室,计算机科学与技术学院,北京理工大学) Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University(广东机器感知与智能计算实验室,深圳MSU-BIT大学) Department of Electrical and Computer System Engineering, Monash University(电子与计算机系统工程系,墨尔本大学)

AI总结 提出一种在异质双曲流形上对齐图像和文本树状层次特征的方法,通过交叉注意力提取视觉层次特征、异质流形嵌入及KL距离度量学习中间流形,在开放集分类任务中优于基线。

Comments Published as a conference paper at ICLR 2026

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026), Rio de Janeiro, Brazil, 2026
AI中文摘要

模态对齐对于视觉-语言模型(VLM)有效整合跨模态信息至关重要。然而,现有方法在提取文本层次特征的同时,对每个图像仅用单一特征表示,导致不对称和次优的对齐。为解决此问题,我们提出树间对齐(Alignment across Trees)方法,该方法为图像和文本模态构建并对齐树状层次特征。具体而言,我们引入一个语义感知的视觉特征提取框架,该框架对来自中间Transformer层的视觉类别标记应用交叉注意力机制,由文本线索引导以提取具有从粗到细语义的视觉特征。然后,我们将两种模态的特征树嵌入到具有不同曲率的双曲流形中,以有效建模其层次结构。为了在不同曲率的异质双曲流形之间进行对齐,我们推导了异质流形上分布之间的KL距离度量,并通过最小化该距离学习一个用于流形对齐的中间流形。我们证明了最优中间流形的存在性和唯一性。在多个图像数据集上的分类学开放集分类任务实验表明,我们的方法在少样本和跨域设置下持续优于强基线。

英文摘要

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

2510.03550 2026-05-29 cs.CV 版本更新

Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!

流式拖拽导向的交互式视频操作:随时拖动任何物体!

Junbao Zhou, Yuan Zhou, Kesen Zhao, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang

发表机构 * Nanyang Technological University(南洋理工大学) University of Science and Technology of China(中国科学技术大学) Hefei University of Technology(合肥工业大学)

AI总结 提出REVEL任务和DragStream方法,通过自适应分布自校正和空间频率选择性优化,实现自回归视频扩散模型的流式拖拽交互操作。

详情
AI中文摘要

实现对自回归视频扩散模型输出的流式、细粒度控制仍然具有挑战性,难以确保其始终与用户期望一致。为弥补这一差距,我们提出 extbf{流式拖拽导向的交互式视频操作(REVEL)},这是一个新任务,允许用户通过细粒度的交互式拖拽 extit{随时}对 extit{任何物体}修改生成的视频。超越DragVideo和SG-I2V,REVEL将拖拽式视频操作统一为编辑和动画化视频帧,同时支持用户指定的平移、变形和旋转效果,使拖拽操作更加通用。在解决REVEL时,我们观察到: extit{i})拖拽引起的扰动在潜在空间中累积,导致严重的潜在分布漂移,从而中断拖拽过程; extit{ii})流式拖拽容易受到上下文帧的干扰,从而产生视觉上不自然的结果。因此,我们提出一种无需训练的方法 extbf{DragStream},包括: extit{i})自适应分布自校正策略,利用相邻帧的统计信息有效约束潜在嵌入的漂移; extit{ii})空间频率选择性优化机制,允许模型充分利用上下文信息,同时通过沿生成过程选择性传播视觉线索来减轻其干扰。我们的方法可以无缝集成到现有的自回归视频扩散模型中,大量实验有力地证明了DragStream的有效性。

英文摘要

Achieving streaming, fine-grained control over the outputs of autoregressive video diffusion models remains challenging, making it difficult to ensure that they consistently align with user expectations. To bridge this gap, we propose \textbf{stReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL)}, a new task that enables users to modify generated videos \emph{anytime} on \emph{anything} via fine-grained, interactive drag. Beyond DragVideo and SG-I2V, REVEL unifies drag-style video manipulation as editing and animating video frames with both supporting user-specified translation, deformation, and rotation effects, making drag operations versatile. In resolving REVEL, we observe: \emph{i}) drag-induced perturbations accumulate in latent space, causing severe latent distribution drift that halts the drag process; \emph{ii}) streaming drag is easily disturbed by context frames, thereby yielding visually unnatural outcomes. We thus propose a training-free approach, \textbf{DragStream}, comprising: \emph{i}) an adaptive distribution self-rectification strategy that leverages neighboring frames' statistics to effectively constrain the drift of latent embeddings; \emph{ii}) a spatial-frequency selective optimization mechanism, allowing the model to fully exploit contextual information while mitigating its interference via selectively propagating visual cues along generation. Our method can be seamlessly integrated into existing autoregressive video diffusion models, and extensive experiments firmly demonstrate the effectiveness of our DragStream.

2510.00936 2026-05-29 cs.CV 版本更新

Resolution as a Direction: Vector-Panning Feature Alignment for Cross-Resolution Re-Identification

分辨率作为方向:跨分辨率重识别的向量平移特征对齐

Zanwu Liu, Chao Yuan, Bo Li, Xiaowei Zhang, Guanglin Niu

发表机构 * School of Artificial Intelligence, Beihang University, Beijing, China(北京航空航天大学人工智能学院) School of Computer Science and Engineering, Beihang University, Beijing, China(北京航空航天大学计算机科学与工程学院) College of Computer Science and Technology, Qingdao University, Qingdao, China(青岛大学计算机科学与技术学院)

AI总结 提出向量平移特征对齐(VPFA)方法,通过将低分辨率特征沿学习到的分辨率方向平移得到伪高分辨率表示,实现轻量级且高效的跨分辨率行人重识别。

详情
AI中文摘要

跨分辨率行人重识别(CR-ReID)在实际监控中仍然具有挑战性,其中相机质量和拍摄距离导致低分辨率(LR)查询与高分辨率(HR)图库图像之间存在显著的分辨率差距。先前的方法通常依赖于超分辨率(SR)或分辨率不变表示学习,这往往增加系统复杂性,并且可能无法直接解决由分辨率退化引起的特征不匹配问题。在这项工作中,我们从一项专门分析中报告了一个新的经验发现,其中身份特定的变化被平均化:标准ReID主干产生的HR-LR特征差异在嵌入空间中表现出一致的、与分辨率相关的语义方向。我们进一步基于典型相关分析(CCA)和皮尔逊相关分析支持这一观察。受此发现启发,我们提出了向量平移特征对齐(VPFA),一个轻量级的后处理模块,学习将LR特征沿学习到的分辨率方向平移,以获得伪HR表示。VPFA在特征提取后运行,可以以可忽略的开销集成到现有的ReID系统中。在多个CR-ReID基准上的大量实验表明,VPFA实现了最先进的性能,同时与基于SR或联合训练的方法相比提高了效率。

英文摘要

Cross-resolution person re-identification (CR-ReID) remains challenging in practical surveillance, where camera quality and capture distance lead to substantial resolution gaps between low-resolution (LR) queries and high-resolution (HR) gallery images. Prior approaches commonly rely on super-resolution (SR) or resolution-invariant representation learning, which often increases system complexity and may not directly address the feature mismatch induced by resolution degradation. In this work, we report a new empirical finding from a dedicated analysis in which identity-specific variation is averaged out: the HR--LR feature discrepancy produced by standard ReID backbones exhibits a consistent, resolution-related semantic direction in the embedding space. We further support this observation with statistical analyses based on Canonical Correlation Analysis (CCA) and Pearson correlation analysis. Motivated by this finding, we propose Vector Panning Feature Alignment (VPFA), a lightweight post-hoc module that learns to pan LR features along the learned resolution direction to obtain pseudo-HR representations. VPFA operates after feature extraction and can be integrated into existing ReID systems with negligible overhead. Extensive experiments on multiple CR-ReID benchmarks show that VPFA achieves state-of-the-art performance while improving efficiency compared to SR-based or jointly trained alternatives.

2509.21979 2026-05-29 cs.CV cs.AI 版本更新

Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

医疗视觉语言模型中的谄媚行为基准测试与缓解

Juangui Xu, Zikun Guo, Jingwei Lv, Hongbin Lin, Shu Yang, Jun Wen, Di Wang, Lijie Hu

发表机构 * MBZUAI Saarland University(萨尔兰大学) HKUST(GZ)(香港科技大学(广州)) KAUST(卡塔尔大学)

AI总结 针对医疗视觉语言模型中的谄媚问题,提出分层医疗视觉问答基准和VIPER策略,通过过滤非证据社会线索减少谄媚,提升模型鲁棒性。

Comments 19figures, 61pages. The first two authors contributed equally

详情
AI中文摘要

视觉语言模型(VLM)有潜力改变医疗工作流程。然而,其部署受到谄媚行为的限制。尽管这对患者安全构成严重威胁,但系统性的基准测试仍然缺乏。本文通过引入一个医疗基准来填补这一空白,该基准在分层医疗视觉问答任务中对VLM应用多种模板。我们发现当前的VLM极易受到视觉线索的影响,失败率与模型大小或整体准确性相关。我们发现感知权威和用户模仿是强大的触发因素,表明存在独立于视觉数据的偏差机制。为了克服这一点,我们提出了一种基于证据的视觉信息净化响应(VIPER)策略,该策略主动过滤掉非基于证据的社会线索,从而强化基于证据的推理。VIPER在保持可解释性的同时减少了谄媚,并且始终优于基线方法,为VLM的稳健和安全集成奠定了必要的基础。

英文摘要

Visual language models (VLMs) have the potential to transform medical workflows. However, the deployment is limited by sycophancy. Despite this serious threat to patient safety, a systematic benchmark remains lacking. This paper addresses this gap by introducing a Medical benchmark that applies multiple templates to VLMs in a hierarchical medical visual question answering task. We find that current VLMs are highly susceptible to visual cues, with failure rates showing a correlation to model size or overall accuracy. we discover that perceived authority and user mimicry are powerful triggers, suggesting a bias mechanism independent of visual data. To overcome this, we propose a Visual Information Purification for Evidence based Responses (VIPER) strategy that proactively filters out non-evidence-based social cues, thereby reinforcing evidence based reasoning. VIPER reduces sycophancy while maintaining interpretability and consistently outperforms baseline methods, laying the necessary foundation for the robust and secure integration of VLMs.

2508.03221 2026-05-29 cs.CR cs.CV 版本更新

BadBlocks: Low-Cost and Stealthy Backdoor Attacks Tailored for Text-to-Image Diffusion Models

BadBlocks: 针对文本到图像扩散模型的低成本、隐蔽后门攻击

Jia Wu, Yu Pan, Junjun Yang, Yi Du

发表机构 * Shanghai Polytechnic University(上海理工大学) ShanghaiTech University(上海科技大学)

AI总结 提出BadBlocks攻击方法,通过仅污染UNet架构中的特定块,在保持其他组件不变的情况下,以30%的计算资源和20%的GPU时间实现高成功率且绕过注意力检测防御,揭示了不同神经层的脆弱性差异。

详情
AI中文摘要

尽管扩散模型在图像生成方面取得了显著进展,但最近的研究揭示了它们通过隐蔽的视觉或文本触发器易受后门攻击。虽然不断发展的防御机制可以通过视觉检查或特征分析检测大多数现有威胁,但我们引入了BadBlocks——一种新颖、轻量且高度隐蔽的攻击,挑战了这些防护措施。通过选择性地污染UNet架构中的特定块,同时保持其他组件不变,BadBlocks仅需传统攻击30%的计算资源和20%的GPU时间,有效地在消费级GPU上实现了后门注入的民主化。实证评估表明,BadBlocks实现了高攻击成功率,且感知质量损失可忽略不计,同时成功绕过了最先进的防御,特别是基于注意力的检测框架。层级别消融研究进一步证实,后门映射不需要全网络微调,揭示了不同神经层的脆弱性差异。总体而言,BadBlocks显著降低了执行后门攻击的门槛,构成了关键的安全风险。我们的代码可在 https://github.com/paoche11/BadBlocks 获取。

英文摘要

Despite the remarkable progress of diffusion models in image generation, recent studies reveal their vulnerability to backdoor attacks via covert visual or textual triggers. Although evolving defense mechanisms can detect most existing threats through visual inspection or feature analysis, we introduce BadBlocks-a novel, lightweight, and highly covert attack that challenges these safeguards. By selectively poisoning specific blocks within the UNet architecture while keeping other components intact, BadBlocks requires only 30% of the computational resources and 20% of the GPU time of conventional attacks, effectively democratizing backdoor injection on consumer-grade GPUs. Empirical evaluations demonstrate that BadBlocks achieves a high attack success rate with negligible perceptual quality loss, while successfully bypassing state-of-the-art defenses, particularly attention-based detection frameworks. Layer-level ablation studies further confirm that backdoor mapping does not require full-network fine-tuning, revealing the disparate vulnerability of different neural layers. Overall, BadBlocks significantly lowers the barrier for executing backdoor attacks, presenting a critical security risk. Our code is available at: https://github.com/paoche11/BadBlocks.

2507.16880 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Finding DoRI: Discovery of Retained Images in Diffusion Models

Finding DoRI: 扩散模型中保留图像的发现

Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch

发表机构 * CISPA Helmholtz Center for Information Security(CISPA信息安全研究中心) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) Technical University of Darmstadt(达姆施塔特技术大学) Hessian Center for AI (Hessian.AI)(黑森人工智能中心(Hessian.AI)) Centre for Cognitive Science, Technical University of Darmstadt(达姆施塔特技术大学认知科学中心)

AI总结 通过挑战记忆局部化假设,发现文本嵌入的小扰动可重新触发数据复制,并证明记忆本质上是非局部的,从而提出对抗微调实现更鲁棒的缓解方法。

Comments Published at ICML 2026

详情
AI中文摘要

文本到图像扩散模型(DMs)在图像生成方面取得了显著成功。然而,由于它们可能无意中记忆并复制训练数据,数据隐私和知识产权问题仍然存在。最近的缓解工作集中在识别和剪枝负责触发逐字训练数据复制的权重,基于记忆可以被局部化的假设。我们挑战这一假设,并证明即使经过这样的剪枝,对先前缓解的提示的文本嵌入进行微小扰动可以重新触发数据复制,揭示了此类方法的脆弱性。我们的进一步分析提供了多个迹象表明记忆确实本质上不是局部的:(1)记忆图像的复制触发因素分布在文本嵌入空间中;(2)产生相同复制图像的嵌入会产生不同的模型激活;(3)不同的剪枝方法对同一图像识别出不一致的记忆相关权重集。最后,我们表明绕过局部性假设可以通过对抗微调实现更鲁棒的缓解。这些发现为文本到图像DMs中记忆的基本性质提供了新见解,并为未来开发更可靠的对抗DM记忆的缓解方法提供了信息。

英文摘要

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such methods. Our further analysis then provides multiple indications that memorization is indeed \textit{not} inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the fundamental nature of memorization in text-to-image DMs and inform the future development of more reliable mitigation methods against DM memorization.

2507.09266 2026-05-29 cs.CV 版本更新

SAGE: Segment-Aware Gloss-Free Encoding for Token-Efficient Sign Language Translation

SAGE: 面向令牌高效手语翻译的分段感知无词汇编码

JianHe Low, Ozge Mercanoglu Sincan, Richard Bowden

发表机构 * CVSSP, University of Surrey, United Kingdom(CVSSP,萨里大学,英国)

AI总结 提出分段感知视觉标记化框架,通过手语分段将连续视频转换为离散视觉令牌,结合令牌级对比对齐和双层监督,在减少序列长度50%的同时,在PHOENIX14T基准上超越现有方法。

Comments Accepted in International Conference on Computer Vision (ICCV) Workshops. Code released at https://github.com/JianHe0628/SAGE

详情
AI中文摘要

无词汇手语翻译(SLT)发展迅速,无需词汇标注即可实现强性能。然而,这些进展往往伴随着模型复杂度和计算需求增加,引发了对可扩展性的担忧,尤其是在大规模手语数据集日益普及的背景下。我们提出了一种分段感知视觉标记化框架,利用手语分段将连续视频转换为离散的、基于手语信息的视觉令牌。与先前方法相比,这使输入序列长度减少多达50%,内存使用降低高达2.67倍,并在更大数据集上具有更好的可扩展性。为了桥接视觉和语言模态,我们引入了令牌到令牌的对比对齐目标,以及双层监督,同时对齐语言嵌入和中间隐藏状态。这在不依赖词汇级监督的情况下改善了细粒度跨模态对齐。我们的方法在PHOENIX14T基准上显著超越了现有技术的性能,同时大幅减少了序列长度。进一步实验还表明,在可比序列长度下,我们的性能优于先前工作,验证了我们的标记化和对齐策略的潜力。

英文摘要

Gloss-free Sign Language Translation (SLT) has advanced rapidly, achieving strong performances without relying on gloss annotations. However, these gains have often come with increased model complexity and high computational demands, raising concerns about scalability, especially as large-scale sign language datasets become more common. We propose a segment-aware visual tokenization framework that leverages sign segmentation to convert continuous video into discrete, sign-informed visual tokens. This reduces input sequence length by up to 50% compared to prior methods, resulting in up to 2.67x lower memory usage and better scalability on larger datasets. To bridge the visual and linguistic modalities, we introduce a token-to-token contrastive alignment objective, along with a dual-level supervision that aligns both language embeddings and intermediate hidden states. This improves fine-grained cross-modal alignment without relying on gloss-level supervision. Our approach notably exceeds the performance of state-of-the-art methods on the PHOENIX14T benchmark, while significantly reducing sequence length. Further experiments also demonstrate our improved performance over prior work under comparable sequence-lengths, validating the potential of our tokenization and alignment strategies.

2504.06022 2026-05-29 cs.CV 版本更新

CamC2V: Context-aware Controllable Video Generation

CamC2V: 上下文感知的可控视频生成

Luis Denninger, Sina Mokhtarzadeh Azar, Juergen Gall

发表机构 * University of Bonn(波恩大学) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔机器学习与人工智能研究所)

AI总结 提出CamC2V模型,通过集成多图像条件与3D约束及相机控制,实现上下文感知的连贯视频生成,在RealEstate10K数据集上FVD提升24.09%。

Comments Published at 3DV 2026

详情
AI中文摘要

近年来,图像到视频(I2V)扩散模型展示了令人印象深刻的场景理解和生成质量,通过引入图像条件来指导生成。然而,这些模型主要将静态图像动画化,而不扩展其提供的上下文。引入额外的约束,如相机轨迹,可以增强多样性,但往往会降低视觉质量,限制了它们在需要忠实场景表示的任务中的适用性。我们提出了CamC2V,一种上下文到视频(C2V)模型,它将多个图像条件作为上下文与3D约束以及相机控制集成在一起,以丰富全局语义和细粒度视觉细节。这使得视频生成更加连贯且上下文感知。此外,我们论证了有效上下文表示中时间感知的必要性。我们在RealEstate10K数据集上的全面研究表明,视觉质量和相机可控性提高了24.09%(FVD)。我们的代码公开在:https://github.com/LDenninger/CamC2V。

英文摘要

Recently, image-to-video (I2V) diffusion models have demonstrated impressive scene understanding and generative quality, incorporating image conditions to guide generation. However, these models primarily animate static images without extending beyond their provided context. Introducing additional constraints, such as camera trajectories, can enhance diversity but often degrade visual quality, limiting their applicability for tasks requiring faithful scene representation. We propose CamC2V, a context-to-video (C2V) model that integrates multiple image conditions as context with 3D constraints alongside camera control to enrich both global semantics and fine-grained visual details. This enables more coherent and context-aware video generation. Moreover, we motivate the necessity of temporal awareness for an effective context representation. Our comprehensive study on the RealEstate10K dataset demonstrates a $24.09\%$ (FVD) improvement in visual quality and camera controllability. Our code is publicly available at: https://github.com/LDenninger/CamC2V.

2502.21004 2026-05-29 cs.CV 版本更新

Soften the Mask: Adaptive Temporal Soft Mask for Efficient Dynamic Facial Expression Recognition

软化掩码:自适应时间软掩码用于高效动态面部表情识别

Meng-zhu Li, Quanxing Zha, Hongjun Wu

发表机构 * Beijing Union University(北京联合大学) Huaqiao University(华侨大学) Beijing University of Posts and Telecommunication(北京邮电大学)

AI总结 提出一种结合自监督重建与监督分类的AdaTosk网络,通过自适应时间软掩码(类不可知和类语义软掩码)增强关键表情时刻并减少语义冗余,在降低计算成本的同时保持竞争性能。

Comments 6 pages, 3 figures

详情
AI中文摘要

动态面部表情识别(DFER)通过非语言交流促进对心理意图的理解。现有方法难以管理无关信息(如背景噪声和冗余语义),影响效率和有效性。本文提出一种新颖的监督式时间软掩码自编码器网络用于DFER,即AdaTosk,它将并行监督分类分支与自监督重建分支相结合。自监督重建分支应用随机二元硬掩码生成多样化的训练样本,促进可见令牌中的有意义的特征表示。同时,分类分支采用自适应时间软掩码,根据时间重要性灵活地掩盖可见令牌。其两个关键组成部分,即类不可知软掩码和类语义软掩码,分别用于增强关键表情时刻并随时间减少语义冗余。在广泛使用的基准测试上进行的大量实验表明,与当前最先进方法相比,我们的AdaTosk显著降低了计算成本,同时仍保持竞争性能。

英文摘要

Dynamic Facial Expression Recognition (DFER) facilitates the understanding of psychological intentions through non-verbal communication. Existing methods struggle to manage irrelevant information, such as background noise and redundant semantics, which impacts both efficiency and effectiveness. In this work, we propose a novel supervised temporal soft masked autoencoder network for DFER, namely AdaTosk, which integrates a parallel supervised classification branch with the self-supervised reconstruction branch. The self-supervised reconstruction branch applies random binary hard mask to generate diverse training samples, encouraging meaningful feature representations in visible tokens. Meanwhile the classification branch employs an adaptive temporal soft mask to flexibly mask visible tokens based on their temporal significance. Its two key components, respectively of, class-agnostic and class-semantic soft masks, serve to enhance critical expression moments and reduce semantic redundancy over time. Extensive experiments conducted on widely-used benchmarks demonstrate that our AdaTosk remarkably reduces computational costs compared with current state-of-the-art methods while still maintaining competitive performance.

2412.00452 2026-05-29 cs.LG cs.CV 版本更新

Learning Locally, Revising Globally: Global Reviser for Federated Learning with Noisy Labels

局部学习,全局修正:面向含噪标签联邦学习的全局修正器

Yuxin Tian, Mouxing Yang, Yuhao Zhou, Jian Wang, Qing Ye, Tongliang Liu, Gang Niu, Jiancheng Lv

发表机构 * College of Computer Science, Sichuan University, Chengdu, China(四川大学计算机学院,中国成都) Engineering Research Center of Machine Learning(机器学习工程研究中心) University of Sydney, Sydney, Australia(悉尼大学,澳大利亚悉尼) Southeast University, Nanjing, China(东南大学,中国南京)

AI总结 针对联邦学习中标签噪声与数据异质性共存的问题,提出一种利用全局模型慢记忆特性的联邦全局修正器(FedGR),通过三个模块协同修正噪声标签并正则化局部训练,在三个基准上优于八种基线方法。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

传统的联邦学习(FL)严重依赖高质量标签,这在实际应用中往往不现实,导致联邦标签噪声(F-LN)问题。更糟糕的是,FL的异质性加剧了F-LN问题,因为客户端经历不同的标签噪声类型、比率和数据分布。在本研究中,我们首先观察到FL的全局模型表现出对噪声标签的缓慢记忆现象,这表明其在FL中能够维持可靠的预测和鲁棒的表示。受此启发,我们提出了一种名为联邦全局修正器(FedGR)的新方法,这是一种直接而有效的方法,包含三个模块,协同修正噪声标签并正则化局部训练。通过利用这一固有属性,FedGR以自包含的方式提高了FL对标签噪声的鲁棒性。在三个广泛使用的F-LN基准上的大量实验表明,即使在严重的标签噪声和数据异质性下,FedGR也表现出优越的性能,始终优于八个最先进的基线。代码:https://github.com/cs-yuxintian/FedGR-ICML26

英文摘要

Conventional federated learning (FL) heavily depends on high-quality labels, which are often impractical in the real world, leading to the federated label-noise (F-LN) problem. Worse still, the F-LN problem is exacerbated by the heterogeneity of FL, whereas clients experience different label-noise types, ratios, and data distribution. In this study, we first observe an intriguing phenomenon that the global model of FL exhibits a slow memorization of noisy labels, suggesting its ability to maintain reliable predictions and robust representations in FL. Motivated by this, we propose a novel method termed Federated Global Reviser (\method), a straightforward yet effective method comprising three modules that collaboratively rectify noisy labels and regularize local training. By exploiting this inherent property, \method\ improves the label-noise robustness of FL in a self-contained manner. Extensive experiments on three widely used F-LN benchmarks demonstrate the superior performance of FedGR, consistently outperforming eight state-of-the-art baselines even in severe label-noise and data heterogeneity. Code: https://github.com/cs-yuxintian/FedGR-ICML26

2605.29579 2026-05-29 cs.CV 版本更新

ReactBench: A Cause-Driven Benchmark for Multimodal Hallucination via Systematic Evaluation

ReactBench:通过系统评估的多模态幻觉因果驱动基准

Shizhe Zhou, Bohan Jia, Kai Wu, Yan Shen, Tongyun Li, Yuyang Wu, Shaohui Lin

发表机构 * East China Normal University(华东师范大学)

AI总结 提出ReactBench基准,通过对抗性图像和诱导幻觉的查询,系统评估多模态大模型在关系擦除、反事实属性、变化追踪和密集计数等任务中的因果幻觉。

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在视觉-语言理解方面取得了快速进展,但它们仍然容易产生多模态幻觉,即生成与视觉输入不一致的响应。现有基准主要侧重于检测幻觉结果,而非评估这些失败的潜在原因。此外,许多基准依赖于简单的场景和有限的评估格式,不再能挑战最先进的模型。为了解决这些局限性,我们引入了ReactBench,一个因果驱动的幻觉基准,具有多个任务和考试式评估格式。通过生成对抗性图像和诱导幻觉的查询,ReactBench引入了四个目标任务:关系擦除、反事实属性、变化追踪和密集计数。这些任务系统地暴露了共现偏差、语言先验、跨图像比较感知缺陷和细粒度感知瓶颈。除了基于标准准确率的评估外,我们利用思维链推理来识别每个任务中幻觉的细粒度子原因。大量评估表明,当前的MLLMs仍然容易受到特定因果幻觉触发因素的影响,这证明了ReactBench作为诊断和提高多模态模型鲁棒性的系统化和可解释测试平台的价值。项目页面见https://reactbench.github.io/。

英文摘要

While multimodal large language models (MLLMs) have achieved rapid progress in vision-language understanding, they remain prone to multimodal hallucinations, producing responses that are inconsistent with the visual input. Existing benchmarks predominantly focus on detecting hallucination outcomes rather than evaluating the underlying causes of these failures. Moreover, many benchmarks rely on simplistic scenarios and limited evaluation formats that no longer challenge state-of-the-art models. To address these limitations, we introduce ReactBench, a cause-driven hallucination benchmark featuring multiple tasks and an exam-style evaluation format. By generating adversarial images and hallucination-inducing queries, ReactBench introduces four targeted tasks: Relational Erasure, Counterfactual Attribute, Alteration Tracing, and Dense Counting. These tasks systematically expose co-occurrence bias, language priors, cross-image comparative perception deficiencies, and fine-grained perceptual bottlenecks. Beyond standard accuracy-based evaluation, we leverage Chain-of-Thought reasoning to identify fine-grained sub-causes of hallucination within each task. Extensive evaluations reveal that current MLLMs remain notably vulnerable to cause-specific hallucination triggers, demonstrating the value of ReactBench as a systematic and interpretable testbed for diagnosing and improving multimodal model robustness. The project page is available at https://reactbench.github.io/.

2605.29577 2026-05-29 cs.CV 版本更新

Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning

通过逆动力学学习缓解视觉-语言-动作模型中的状态混叠

Kyujin Lee, Injae Kim, Jihwan Park, Yejun Ju, Minseok Joo, Hyunwoo J. Kim

发表机构 * KAIST(韩国科学技术院) Korea University(韩国大学)

AI总结 提出将逆动力学学习作为辅助目标,直接监督VLA视觉编码器,通过预测当前与未来观测之间的动作来捕捉细粒度视觉差异,从而缓解状态混叠问题。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过将预训练的视觉-语言模型(VLM)适应于动作预测,成为统一机器人操作中感知、推理和控制的 promising 框架。然而,VLM 衍生的表示通常对低级控制所需的细微视觉差异不敏感,导致视觉相似但需要截然不同动作的状态之间出现状态混叠。先前的 VLA 研究通过生成视觉或推理输出(如未来帧、2D 接地点或轨迹、或中间空间推理步骤)来改善视觉理解,但这些目标通常仅通过端到端预测间接塑造视觉编码器,并未显式分析学习到的视觉特征空间中的状态混叠。为了缓解状态混叠,我们引入逆动力学学习作为辅助目标,直接监督 VLA 视觉编码器。通过预测当前与未来观测之间的动作,我们的目标鼓励编码器捕捉决定低级动作的细粒度视觉差异。我们进一步使用伪反向监督,使编码器暴露于更广泛的动作方向,并在有限的机器人演示下提高泛化能力。我们的方法适用于多种 VLA 基线,仅使用标准的观测-动作对,无需额外标注,并在测试时保留原始推理流程。在 CALVIN ABC-D 和 SimplerEnv 上的实验表明,在多种 VLA 基线上均获得一致的性能提升。冻结编码器探测和状态-特征对齐分析进一步表明,我们的方法学习了状态判别性的视觉表示,减少了状态混叠,并更好地与机器人状态变化对齐。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising framework that unifies perception, reasoning, and control for robot manipulation by adapting pretrained vision-language models (VLMs) to action prediction. However, VLM-derived representations are often insensitive to subtle visual distinctions required for low-level control, causing state aliasing between visually similar states that require substantially different actions. Prior VLA studies improve visual understanding by generating visual or reasoning outputs, such as future frames, 2D grounding points or traces, or intermediate spatial reasoning steps, but these objectives typically shape the vision encoder only indirectly through end-to-end prediction and do not explicitly analyze state aliasing in the learned visual feature space. To mitigate state aliasing, we introduce inverse dynamics learning as an auxiliary objective that directly supervises the VLA vision encoder. By predicting the action between current and future observations, our objective encourages the encoder to capture fine-grained visual distinctions that determine low-level actions. We further use pseudo-reversed supervision to expose the encoder to a broader range of action directions and improve generalization under limited robot demonstrations. Our method applies to diverse VLA baselines, uses only standard observation-action pairs without additional annotations, and preserves the original inference pipeline at test time. Experiments on CALVIN ABC-D and SimplerEnv show consistent gains across diverse VLA baselines. Frozen-encoder probing and state-feature alignment analyses further show that our method learns state-discriminative visual representations that reduce state aliasing and better align with robot state changes.

2605.29575 2026-05-29 cs.CV 版本更新

Optimizing Latent Representations for Robust Building Damage Assessment Onboard Earth Observation Satellites

优化潜在表示以实现地球观测卫星上稳健的建筑物损坏评估

Thomas Goudemant, Benjamin Francesconi

AI总结 提出一种基于AI的星上系统,通过编码预灾图像为紧凑潜在表示并与灾后图像在轨比较,实现建筑物损坏的定位与分类,减少下行数据量并提高响应速度。

Comments IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2026), Jun 2026, Denver, United States

详情
AI中文摘要

在自然灾害或战区后快速识别受损建筑物对于支持应急响应和优先干预至关重要。地球观测星座提供及时、大范围的覆盖,但可操作信息常因数据下行限制、地面处理及人工解读而延迟。减少这种延迟对于提高决策响应能力至关重要。本文提出一种原创的基于AI的系统,可直接在卫星上从灾前和灾后高分辨率光学图像中进行目标级建筑物损坏评估(定位和损坏分类)。可用的灾前图像在地面编码为紧凑潜在表示,传输至卫星,并与新获取的灾后观测在轨比较。利用AI解读能力和星上处理能力的提升,所提设计支持在数据源直接处理,减少需下行的信息量,同时保留任务相关内容并提高系统整体响应性。我们通过系统基准测试星上兼容变体,分析孪生处理、交叉注意力、潜在空间压缩和面向鲁棒性的数据增强的影响。在xBD数据集上的实验表明,在未对准情况下具有可靠且稳健的损坏评估,且在强压缩下性能退化最小。

英文摘要

Rapid identification of damaged buildings after natural disasters or on war areas is crucial to support emergency response and prioritize interventions. Earth Observation constellations provide timely, large-scale coverage, but actionable information is often delayed by data downlink constraints, on-ground processing, and human interpretation. Reducing this latency is essential to improve decision-making responsiveness. In this work, we propose an original AI-based system that enables object-level building damage assessment (localization and damage classification) directly onboard satellites from pre-disaster and post-disaster highresolution optical imagery. Available pre-disaster images are encoded on ground into compact latent representations, transmitted to the satellite, and compared on-board with newly acquired post-event observations. Leveraging AI interpretation capabilities and increasing processing capabilities on-board satellites, the proposed design enables processing directly at the data source, reducing the amount of information to be downlinked while preserving task-relevant content and improving overall system responsivity. We explore the design space through a systematic benchmark of onboard-compatible variants, analyzing the impact of siamese processing, cross-attention, latent-space compression, and robustness-oriented data augmentation. Experiments on xBD dataset demonstrate reliable and robust damage assessment under misalignment, with minimal performance degradation under strong compression.

2605.29570 2026-05-29 cs.CV 版本更新

DefSynUS: Real-time Patient-specific Intrahepatic Vessel Identification via Deformation-Aware CT-US Domain Adaptation

DefSynUS:通过形变感知CT-超声域自适应的实时患者特异性肝内血管识别

Karl-Philippe Beaudet, Yordanka Velikova, Sidaty El Hadramy, Nassir Navab, Philippe Cattin, Juan Verde, Stéphane Cotin

发表机构 * Inria(法国国家信息与自动化研究所) University of Strasbourg(斯特拉斯堡大学) Technical University of Munich(慕尼黑技术大学) University of Basel(巴塞尔大学) Institute of Image-Guided Surgery(图像引导手术研究所)

AI总结 提出一种基于物理渲染和形变感知数据增强的域自适应框架,无需术前超声即可实现术中实时、患者特异性的肝内血管分支识别。

详情
AI中文摘要

目的:腹腔镜超声通过实时可视化肝内血管增强肝脏手术的安全性。然而,由于探头限制、复杂的血管结构和组织形变,血管识别仍然困难。本研究旨在通过可变形超声增强,实现实时、患者特异性的血管识别,并在形变下保持鲁棒性。方法:利用术前CT血管标注,通过优化的基于物理的渲染生成合成超声数据,并结合域自适应到术中超声。渲染过程以端到端方式训练,用于血管识别和患者特异性,无需术前超声。形变感知增强在渲染流程中模拟真实的术中运动和软组织形变。结果:在腹部体模和有限临床可行性实验(单病例临床评估)中,该框架实现了实时肝内血管分支识别,并在新患者姿势下保持性能。结论:该框架无需术前超声即可实现实时血管识别,并支持技术可行性,但仍需多患者验证以评估泛化性和临床可行性。

英文摘要

Purpose: Laparoscopic ultrasound (LUS) enhances the safety of liver surgery by visualizing intrahepatic vessels in real-time. Still, vessel identification remains difficult due to probe constraints, complex vascular structure, and tissue deformation. This work aims to enable real-time, patient-specific vessel identification that remains robust under deformation through deformable ultrasound augmentation. Methods: Preoperative CT vessel annotations are used to generate synthetic ultrasound data via optimized physics-based rendering, coupled with domain adaptation to intraoperative ultrasound. The rendering is trained end-to-end for vessel identification and patient-specificity, eliminating the need for preoperative ultrasound. A deformation-aware augmentation simulates realistic intraoperative motion and tissue deformation within the rendering pipeline. Results: In abdominal phantom and limited clinical feasibility experiments (single-case clinical evaluation), the framework achieved real-time intrahepatic vessel-branch identification, maintaining performance under new patient poses. Conclusion: The framework enables real-time vessel identification without preoperative ultrasound and supports technical feasibility, but multi-patient validation is still needed for generalizability and clinical feasibility.

2605.29565 2026-05-29 cs.CV cs.RO 版本更新

From General Vision to Reliable Traversability Estimation: Adapting Vision Foundation Models for Unstructured Outdoor Environments

从通用视觉到可靠的可通行性估计:适应视觉基础模型用于非结构化户外环境

Ji-Hoon Hwang, Jisung Bae, Dong-Wook Kim, Yeonkyu Lee, Seung-Woo Seo

AI总结 提出ViTA框架,通过可学习提示、视角多样化训练和几何知识蒸馏,将视觉基础模型适应于非结构化户外环境的可靠可通行性估计,显著降低误报并提升跨域泛化。

Comments 8 pages, 5figures

详情
AI中文摘要

基于视觉的方法已成为非结构化户外环境中可通行性估计的主导范式,通常通过语义分割监督来适应视觉基础模型(VFM)。然而,该范式面临三个根本性挑战,削弱了其可靠性:VFM的任务无关设计、可通行性标注的模糊性以及语义标签与物理安全性之间的差异。我们提出了视觉到可通行性适应(ViTA)框架,该框架将VFM适应于可靠的可通行性估计,并在SAM2上实例化。ViTA通过可学习的可通行性提示注入任务特定知识,同时保留VFM的跨域泛化能力。为处理标注模糊性,我们引入了视角多样化训练,通过估计语义不确定性来抑制模糊边界处的自信预测。为弥合语义与可通行性之间的差异,我们在训练期间蒸馏几何知识,使得推理时仅从RGB图像即可进行坡度和高程推理。语义和几何输出融合为一个连续的可通行性分数,同时反映语义不确定性和几何风险。在包括具有挑战性的真实越野数据集在内的多个领域的评估表明,ViTA实现了最先进的IoU和精确度,同时大幅减少误报并具备强大的跨域泛化能力。

英文摘要

Vision-based approaches have become the dominant paradigm for traversability estimation in unstructured outdoor environments, typically adapting vision foundation models (VFMs) via semantic segmentation supervision. However, this paradigm faces three fundamental challenges that undermine its reliability: the task-agnostic design of VFMs, the ambiguity of traversability annotations, and the discrepancy between semantic labels and physical safety. We propose Vision-to-Traversability Adaptation (ViTA), a framework that adapts VFMs for reliable traversability estimation, instantiated on SAM2. ViTA injects task-specific knowledge through learnable traversability prompts while preserving the VFM's cross-domain generalization. To handle annotation ambiguity, we introduce Perspective-Diversified Training, which estimates semantic uncertainty to suppress confident predictions at ambiguous boundaries. To bridge the semantic-traversability discrepancy, we distill geometric knowledge during training, enabling slope and elevation reasoning from RGB images alone at inference. The semantic and geometric outputs are fused into a continuous traversability score that reflects both semantic uncertainty and geometric risk. Evaluations across diverse domains, including challenging real-world off-road datasets, demonstrate that ViTA achieves state-of-the-art IoU and Precision with substantial false-positive reduction and strong cross-domain generalization.

2605.29562 2026-05-29 cs.RO cs.AI cs.CV 版本更新

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

VLA-Pro:面向视觉-语言-动作模型的跨任务程序性记忆迁移

Shengyu Si, Yuanzhuo Lu, Ruimeng Yang, Ziyi Ye, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究院) Shanghai Key Laboratory of Multimodal Embodied AI(上海多模态具身人工智能重点实验室) Shanghai Xinzhi Embodied Intelligence Technology Co., Ltd.(上海新智具身智能技术有限公司)

AI总结 提出VLA-Pro框架,通过存储和检索任务相关的LoRA适配器作为程序性记忆,实现跨任务泛化,在仿真和真实任务中成功率显著提升。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在通用机器人操作中展现出强大潜力,但在泛化到需要跨物体、场景和动作模式迁移相关经验的新任务时仍面临挑战。本文提出VLA-Pro,一种即插即用框架,通过在训练时存储任务相关的程序性记忆并在推理时迁移这些记忆来增强跨任务泛化。具体而言,VLA-Pro在训练时将任务特定的LoRA适配器存储为参数化的程序性记忆。在推理时,VLA-Pro基于当前多模态上下文检索相关程序性记忆,并动态融合这些记忆以生成当前动作块。在RoboTwin、RLBench和真实世界操作任务上的实验表明,VLA-Pro在多个骨干网络上持续提升跨任务泛化能力,在仿真中实现高达207%的相对改进,并将真实世界成功率从5.8%提升至65.0%。这些结果表明,程序性记忆检索与自适应为将操作经验迁移到新任务提供了一种有效机制,同时保持了模块化和执行稳定性。

英文摘要

Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.

2605.29558 2026-05-29 cs.CV 版本更新

TAE: Target-aware enhancer for nighttime UAV tracking

TAE:面向夜间无人机跟踪的目标感知增强器

Yanyan Chen, Ruigang Fu, Yu Song, Ping Zhong

发表机构 * College of Electrical Science and Technology(电子科学与技术学院) National University of Defense Technology(国防科技大学)

AI总结 提出一种目标感知的低光增强框架TAE,利用跟踪框弱监督信号进行区域感知增强和自适应RGB多曲线融合,显著提升夜间无人机跟踪性能,并贡献了包含268个序列的DarkSOT基准。

Comments Accepted at ICIP 2026. Dataset is avaliable at: https://github.com/Fu0511/DarkSOT-Dataset

详情
AI中文摘要

夜间低光条件下的严重图像退化是基于无人机的单目标跟踪全天候应用的核心瓶颈。现有的图像增强方法通常难以区分目标和背景区域,容易放大背景噪声或损害目标特征。为克服这一限制,我们提出TAE,一种专为夜间目标跟踪设计的目标感知低光增强框架。在跟踪边界框的弱监督信号显式引导下,该框架进行区域感知增强,确保操作聚焦于目标区域。它进一步采用自适应RGB多曲线融合机制,实现不同区域的精细建模和自适应调整。为促进该领域研究,我们还贡献了DarkSOT,一个新的夜间无人机跟踪基准,包含9个目标类别的268个序列。在DarkSOT和UAVDark135上的实验结果表明,TAE显著提升了低光夜间场景下的跟踪性能,展现出强鲁棒性和泛化能力。DarkSOT数据集可在https://github.com/Fu0511/DarkSOT-Dataset获取。

英文摘要

Severe image degradation under low-light nighttime conditions constitutes a core bottleneck preventing all-day applications for UAV-based single object tracking. Existing image enhancement methods often struggle to distinguish between target and background regions, which can easily lead to amplified background noise or compromise target features. To overcome this limitation, we propose TAE, a target-aware low-light enhancement framework tailored for nighttime object tracking. Guided explicitly by weak supervisory signals from tracking bounding boxes, the framework performs region-aware enhancement to ensure operations focus on the target area. It further adopts an adaptive RGB multi-curve fusion mechanism to achieve refined modeling and adaptive adjustment across different regions. To facilitate research in this domain, we also contribute DarkSOT, a new benchmark for nighttime UAV tracking, comprising 268 sequences across 9 target categories. Experimental results on the DarkSOT and UAVDark135 demonstrate that TAE significantly improves tracking performance in low-light nighttime scenarios, exhibiting strong robustness and generalization. The DarkSOT dataset is available at https://github.com/Fu0511/DarkSOT-Dataset.

2605.29549 2026-05-29 cs.CV 版本更新

Learning Representations from 3D Gaussian Splats

从3D高斯溅射中学习表示

Julia Farganus, Krzysztof Żurawicki, Arkadiusz Gaweł, Weronika Jakubowska, Halina Kwaśnicka

发表机构 * Department of Artificial Intelligence, Wroc aw University of Science

AI总结 本研究通过比较多种几何深度学习架构,评估了基于3D高斯溅射的场景表示在分类任务中的有效性,揭示了不同架构和输入特征对表示质量的影响。

Comments 5 figures, 15 pages

详情
AI中文摘要

3D高斯溅射(3DGS)是一种用于场景渲染的最新方法。尽管其主要设计用于视图合成,但其在场景理解任务中的潜力尚未得到充分探索。在这项工作中,我们对使用高斯溅射表示的3D场景分类的各种几何深度学习架构进行了比较评估。我们在传统点云数据集和专用高斯溅射数据集上对基于点和基于图的模型进行了基准测试。场景被嵌入到潜在表示中,并通过端到端分类、线性探测和聚类分析进行评估。我们的研究为不同几何感知架构和输入特征配置在学习有效3D高斯溅射表示方面的适用性提供了见解。结果突出了架构家族之间的一致差异,并揭示了高斯特定属性对表示质量的影响。

英文摘要

3D Gaussian Splatting (3DGS) is a recent approach for scene rendering. Although primarily designed for view synthesis, its potential for scene understanding tasks remains underexplored. In this work, we conduct a comparative evaluation of various geometric deep learning architectures for the classification of 3D scenes represented using Gaussian Splatting. We benchmark point-based and graph-based models across both traditional point cloud datasets and dedicated Gaussian Splatting datasets. Scenes are embedded into latent representations, which are evaluated through end-to-end classification, linear probing, and clustering analysis. Our study provides insight into the suitability of different geometry-aware architectures and input feature configurations for learning effective 3D Gaussian Splat representations. The results highlight consistent differences between architectural families and reveal the impact of Gaussian-specific attributes on the quality of representation.

2605.29538 2026-05-29 cs.CV 版本更新

RadioFormer3D: Weakly Supervised 3D Radio Map Estimation in Low-Altitude Airspace via Generative Modeling

RadioFormer3D:通过生成式建模在低空空域中进行弱监督三维无线电地图估计

Zheng Fang, Junjie Liu, Kangjun Liu, Jianguo Zhang, Yaowei Wang, Ke Chen

发表机构 * Pengcheng Laboratory(鹏城实验室) Southern University of Science and Technology(南方科技大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 提出RadioFormer3D模型,采用傅里叶采样编码器、体素解码器和联合频谱完整性损失,在弱监督下实现三维空间稀疏测量的无线电地图估计,有效提升未标注高度层的重建质量。

详情
AI中文摘要

随着三维环境中无线应用(如低空空域和三维异构网络)的出现,无线电地图估计越来越需要表征信号在水平和垂直维度上的传播。然而,由于空间稀疏性增加和连续高度上的监督有限,将无线电地图估计从二维扩展到三维仍然具有挑战性。在本文中,我们提出了 extbf{ extit{RadioFormer3D}},一种专门用于弱监督下体素频谱重建的模型。基于 extit{RadioFormer}的双流多粒度融合架构, extit{RadioFormer3D}引入了基于傅里叶的采样编码器和体素解码器,以有效处理三维空间中的稀疏测量。为了缓解垂直监督的缺乏,我们提出了 extbf{ extit{联合频谱完整性损失}},它将体素级伪标签监督、地图级几何感知无线电渲染和像素级局部约束整合到一个统一的优化方案中。这种设计使模型能够在稀疏监督下更有效地捕捉复杂的垂直结构关系。在多个无线电地图数据集上的大量实验表明,与现有代表性方法相比, extit{RadioFormer3D}实现了优越的整体性能。特别是,它在保持精度和推理效率之间良好权衡的同时,在未标注高度层上展示了改进的重建质量,使其成为未来三维环境感知无线网络的一个非常有前景的解决方案。

英文摘要

With the emergence of wireless applications in three-dimensional environments, such as the low-altitude airspace and 3D heterogeneous networks, radio map estimation is increasingly required to characterize signal propagation across both horizontal and vertical dimensions. However, extending radio map estimation from 2D to 3D remains challenging due to increased spatial sparsity and limited supervision across continuous altitudes. In this paper, we propose \textbf{\textit{RadioFormer3D}}, a specialized model for volumetric spectrum reconstruction under weak supervision. Building on the dual-stream, multi-granularity fusion architecture of \textit{RadioFormer}, \textit{RadioFormer3D} introduces a Fourier-based sampling encoder and a volumetric decoder to efficiently process sparse measurements in 3D space. To alleviate the lack of vertical supervision, we propose the \textbf{\textit{Joint Spectrum Integrity Loss}}, which integrates volume-level pseudo-label supervision, map-level geometry-aware radio rendering, and pixel-level localized constraints within a unified optimization scheme. This design enables the model to capture complex vertical structural relationships more effectively under sparse supervision. Extensive experiments across several radio map datasets show that \textit{RadioFormer3D} achieves superior overall performance compared to representative existing methods. In particular, it demonstrates improved reconstruction quality at unlabeled altitudes while maintaining a favorable trade-off between accuracy and inference efficiency, positioning it as a highly promising solution for future 3D environment-aware wireless networks.

2605.29531 2026-05-29 cs.SD cs.CV cs.LG 版本更新

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

使用交叉注意力特征融合的半真音频深度伪造检测与定位

S. Sutharya, Remya K. Sasi

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出CAFNet模型,通过三元分类和边界回归联合检测部分伪造音频,在MLADDC数据集上达到92.71%准确率和0.075s定位误差。

Comments 13 pages, 5 figures, 11 tables

详情
AI中文摘要

音频深度伪造检测通常作为二分类问题研究,但部分篡改语音(其中一段短合成片段被拼接进真实语音)构成了更困难且更现实的威胁。检测此类半真音频不仅需要区分真实和完全伪造语音,还需要定位篡改发生的位置。我们提出了CAFNet,一个576k参数的架构,联合处理这两个任务:它在单次前向传播中执行三元分类(真实、完全伪造或半真)并回归合成区域的时间边界。CAFNet通过并行深度可分离卷积分支和交叉注意力融合梅尔频率倒谱系数(MFCC)、线性频率倒谱系数(LFCC)和色度短时傅里叶变换(Chroma-STFT)特征,随后使用双向长短期记忆(BiLSTM)回归头进行边界预测。在组合的多语言音频深度伪造检测语料库(MLADDC)T2+T3测试集上,CAFNet达到92.71%的准确率和0.9910的宏观曲线下面积(AUC),边界定位平均绝对误差(MAE)为0.075秒,中位误差为0.052秒。在二分类检测中,它达到96.76%的准确率和3.20%的等错误率(EER),以超过500倍的参数减少优于微调的XLS-R 300M(78.31%)和AST 87M(93.03%)。跨数据集研究进一步表明,即使在降低骨干学习率的情况下,标准微调也会破坏跨域表示。

英文摘要

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

2605.29505 2026-05-29 cs.CV 版本更新

ESAM++: Efficient Online 3D Perception on the Edge

ESAM++:边缘上的高效在线3D感知

Qin Liu, Lavisha Aggarwal, Saptarashmi Bandyopadhyay, Vikas Bahirwani, Marc Niethammer, Ehsan Adeli, Andrea Colaco

发表机构 * Stanford University(斯坦福大学) Google(谷歌) UC San Diego(圣地亚哥大学)

AI总结 提出ESAM++,一种轻量级可扩展的在线3D场景感知方法,通过3D稀疏特征金字塔网络(SFPN)在边缘设备上实现高效、准确的3D实例分割。

详情
AI中文摘要

实时在线3D场景感知对于机器人、AR/VR和自主系统至关重要,尤其是在计算资源有限且隐私至关重要的边缘计算场景中。最近的最先进方法如EmbodiedSAM(ESAM)通过利用Segment Anything Model(SAM)进行实时、细粒度且泛化的3D实例分割,展示了在线3D感知的前景。然而,ESAM仍然依赖计算昂贵的3D稀疏UNet进行点云特征提取,这占据了3D推理时间的大部分,阻碍了其在资源受限设备上的实用性。在本文中,我们提出ESAM++,一种轻量级且可扩展的在线3D场景感知替代方案,专为无GPU加速的边缘设备设计。我们的方法引入了3D稀疏特征金字塔网络(SFPN),该网络高效地从流式3D点云中捕获多尺度几何特征,同时显著降低计算开销和模型大小。我们在四个具有挑战性的分割基准(即ScanNet、ScanNet200、SceneNN和3RScan)上评估了我们的方法,结果表明,与ESAM相比,我们的模型在实现竞争性精度的同时,推理速度提升高达3倍,模型大小缩小2倍,从而能够在边缘设备上实际部署。

英文摘要

Online 3D scene perception in real time is essential for robotics, AR/VR, and autonomous systems, particularly in edge computing scenarios where computational resources are limited and privacy is crucial. Recent state-of-the-art methods like EmbodiedSAM (ESAM) demonstrate the promise of online 3D perception by leveraging the Segment Anything Model (SAM) for real-time, fine-grained, and generalized 3D instance segmentation. However, ESAM still relies on a computationally expensive 3D sparse UNet for point cloud feature extraction, which accounts for the majority of the 3D inference time, hindering its practicality on resource-constrained devices. In this paper, we propose ESAM++, a lightweight and scalable alternative for online 3D scene perception tailored to edge devices without GPU acceleration. Our method introduces a 3D Sparse Feature Pyramid Network (SFPN) that efficiently captures multi-scale geometric features from streaming 3D point clouds while significantly reducing computational overhead and model size. We evaluate our approach on four challenging segmentation benchmarks, namely ScanNet, ScanNet200, SceneNN, and 3RScan, demonstrating that our model achieves competitive accuracy with up to 3 times faster inference with a 2 times smaller model size compared to ESAM, enabling practical deployment on edge devices.

2605.29498 2026-05-29 cs.CL cs.CV 版本更新

Mask the Target: A Plug-and-Play Regularizer Against LoRA Forgetting

Mask the Target: 一种即插即用的正则化器,用于对抗LoRA遗忘

Runze Xu, Arpit Garg, Hemanth Saratchandran, Simon Lucey

发表机构 * Australian Institute for Machine Learning(澳大利亚机器学习研究所)

AI总结 针对LoRA微调中目标分布与原始训练分布差异大时导致的灾难性遗忘问题,提出一种无需重放数据的输出空间正则化方法,通过遮蔽目标token并仅对非目标词汇进行KL正则化,在不增加推理开销的前提下改善新学习与遗忘之间的平衡。

Comments In Submission

详情
AI中文摘要

低秩适应(LoRA)已成为将大型语言模型适应新领域、任务和用户的最广泛使用的微调机制之一。然而,仅凭适应性能可能掩盖一个重要失败模式:LoRA更新可能在提升目标分布性能的同时,削弱预训练和对齐阶段学习到的先前能力。我们表明,当适应分布与模型的原始训练或对齐分布存在显著差异时,这种遗忘变得尤为严重。在实际场景中,原始训练和对齐数据通常不可用,这加剧了挑战。受此约束,我们研究了基于LoRA的适应如何在无重放设置中平衡新学习与遗忘,并引入了一个简单的输出空间正则化器,可直接添加到现有训练流程中。我们的方法从基模型和适应模型分布中移除真实标记,重新归一化剩余概率,并仅对非目标词汇应用KL正则化。这保留了基模型在替代标记之间的相对偏好,同时不直接对抗适应所需的交叉熵信号。由于正则化器仅在损失层面起作用,它不需要重放数据、架构更改、适配器重新设计或推理时开销,并且可以直接应用于现有LoRA变体。在所有测试的LoRA变体和各种骨干网络上,当适应分布与基模型的原始训练或对齐分布存在显著差异时,我们的方法改善了新学习与遗忘之间的边界,表明这是一条通往更可靠LLM更新的广泛适用途径。

英文摘要

Low-Rank Adaptation (LoRA) has become one of the most widely used fine-tuning mechanisms for adapting large language models to new domains, tasks, and users. Yet adaptation performance alone can obscure an important failure mode: LoRA updates may improve performance on the target distribution while degrading prior capabilities learned during pretraining and alignment. We show that this forgetting becomes especially severe when the adaptation distribution differs substantially from the models original training or alignment distributions. The challenge is amplified in practical settings, where the original training and alignment data are typically unavailable. Motivated by this constraint, we study how LoRA based adaptation balances new learning against forgetting in a replay-free setting, and introduce a simple output space regularizer that can be added directly to existing training pipelines. Our method removes the ground-truth token from both the base and adapted model distributions, renormalizes the remaining probabilities, and applies KL regularization only over the non-target vocabulary. This preserves the base models relative preferences among alternative tokens without directly opposing the cross-entropy signal required for adaptation. As the regularizer acts only at the loss level, it requires no replay data, architectural changes, adapter redesign, or inference-time overhead, and can be applied directly to existing LoRA variants. Across all LoRA variants tested and across various backbones, our method improves the frontier between new learning and forgetting when the adaptation distribution differs substantially from the base models original training or alignment distributions, suggesting a broadly applicable route toward more reliable LLM updating.

2605.29496 2026-05-29 cs.CL cs.CV 版本更新

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

视觉语言模型后训练中推理与感知的非对称优化研究

Xueqing Wu, Yu-Chi Lin, Kai-Wei Chang, Nanyun Peng

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 通过合成任务诊断发现,后训练中推理提升显著优于感知,SFT源于感知token少导致训练信号弱,RL源于奖励耦合,提出动态重加权损失和感知奖励可缓解不平衡并提升端到端性能。

Comments Project: https://asymmetric-vlm-post-training.github.io/

详情
AI中文摘要

后训练极大地提升了前沿视觉语言模型中的推理能力,但其对感知的提升相对有限,这成为端到端视觉推理的瓶颈。为探究这一差距,我们引入了一个受控的诊断框架,包含两个将感知与推理分离的合成任务。我们的分析揭示了一致的感知-推理非对称性:后训练对推理的提升显著大于感知,尽管其内在机制因训练范式而异。对于监督微调(SFT),这种非对称性源于思维链监督中的token不平衡,其中感知占据较少token,因此接收到的训练信号较弱。动态重加权损失可缓解这种不平衡,并将端到端性能提升高达18.2。对于强化学习(RL),非对称性则源于奖励耦合:结果奖励与推理的相关性比与感知更强,从而削弱了感知学习的信号。添加感知感知奖励可缓解不平衡,并将端到端准确率提升高达6.0;即使没有真实感知奖励,可靠的替代奖励也能提供有用信号,带来3.2个百分点的提升。综合来看,我们的结果全面诊断了非对称优化,并提出了平衡感知与推理的具体干预措施。

英文摘要

Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.

2605.29471 2026-05-29 cs.CV 版本更新

V2XCrafter: Learning to Generate Driving Scene Across Agents

V2XCrafter:学习生成跨智能体的驾驶场景

Yihang Tao, Yu Guo, Senkang Hu, Yanan Ma, Zihan Fang, Sam Kwong, Yuguang Fang

发表机构 * Hong Kong JC STEM Lab of Smart City(香港JC智能城市STEM实验室) City University of Hong Kong(香港城市大学) Lingnan University(岭南大学)

AI总结 提出V2XCrafter框架,通过渐进式多智能体扩散模型和跨智能体注意力模块,生成跨智能体相机视角的一致可控协作驾驶场景,以增强数据并提升下游协作3D目标检测性能。

详情
AI中文摘要

协作驾驶系统利用车联网(V2X)通信进行多智能体协作感知,以提升驾驶安全性,但仍受限于标注的真实世界V2X驾驶数据集稀缺以及在多样化驾驶条件下的泛化能力有限。虽然图像生成技术为数据增强提供了可行的解决方案,但现有针对单车辆多视角场景的方法在多智能体驾驶设置中面临两个基本挑战:(1)学习目标的扩展降低了生成质量;(2)跨智能体的高度动态变化阻碍了对联合观测对象物理属性(如颜色、类别)一致性的建模。为弥补这一差距,我们提出V2XCrafter,这是首个用于跨智能体相机视角生成可控且逼真的协作驾驶场景的框架。为了实现有效学习,我们基于单智能体骨干网络开发了一种渐进式多智能体扩散模型,利用相邻智能体的潜在状态作为参考信号,逐步引导从单智能体到多智能体的扩散过程。为解决跨车辆不一致性问题,我们提出了一个跨智能体注意力模块,该模块利用协作视图图和可学习的联合观测对象表示来建模动态的跨智能体相机视角关系。实验表明,V2XCrafter能够生成高保真且可控的街道视图,并保持跨智能体的一致性,从而有效提升下游协作3D目标检测任务的效果。

英文摘要

Collaborative driving systems leverage vehicle-to-everything (V2X) communication for multi-agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real-world V2X driving datasets and limited generalization across diverse driving conditions. While image generation technology offers a feasible solution for data augmentation, existing methods tailored for single-vehicle multi-view scenarios face two fundamental challenges in multi-agent driving settings: (1) the expansion of the learning objective degrades generation quality, and (2) the highly dynamic variations across agents hinder the modeling of consistency for physical attributes (e.g., color, category) in jointly observed objects. To bridge this gap, we propose V2XCrafter, the first framework for generating controllable and realistic collaborative driving scene across agents' camera views. For effective learning, we develop a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, we propose a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships. Experiments have shown that V2XCrafter can generate high-fidelity and controllable street views with consistency across agents, thereby effectively enhancing the downstream collaborative 3D object detection tasks.

2605.29462 2026-05-29 cs.CV cs.AI 版本更新

Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

大型视觉语言模型在CFMME上的基准测试:一个全面的中文金融多模态评估数据集

Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo, Feng Chen, Chi Zhang

发表机构 * Qwen DianJin Team, Alibaba Cloud Computing(文言金团队,阿里云计算)

AI总结 提出CFMME,一个包含6052个实例的中文金融多模态评估基准,涵盖八种主要金融图像模态和四项核心多模态任务,用于评估LVLMs在金融业务全流程中的感知、理解、推理和认知能力。

详情
AI中文摘要

大型视觉语言模型(LVLMs)的出现显著扩展了模型的能力,超越了仅文本理解,实现了跨视觉和文本模态的统一推理,并支持更广泛的实际应用。为了全面评估LVLMs在中文环境下整个金融业务流程中的感知、理解、推理和认知能力,我们引入了CFMME,一个新颖的中文金融多模态评估基准。CFMME包含6052个实例,涵盖从基础学术知识到复杂实际应用,涉及八种主要金融图像模态和四项核心多模态任务。在CFMME上,我们对代表性LVLMs进行了全面评估。结果表明,最先进的模型在问答任务上达到了66.11%的总体准确率,在检测、识别和信息提取任务上平均得分为77.18,表明当前LVLMs仍有很大的改进空间。此外,我们对错误原因、跨模态能力和多方向设置进行了详细分析,为未来研究提供了有价值的见解。我们希望CFMME能推动LVLMs的进一步进展,特别是在金融领域多个多模态任务上的性能提升。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enabling unified inference across both visual and textual modalities and supporting a broader range of real-world applications. To comprehensively evaluate the perception, understanding, reasoning, and cognition capabilities of LVLMs throughout the entire financial business workflow in Chinese contexts, we introduce CFMME, a novel Chinese financial multimodal evaluation benchmark. CFMME comprises 6,052 instances spanning from fundamental academic knowledge to complex real-world applications, covering eight primary financial image modalities and four core multimodal tasks. On CFMME, we conduct a thorough evaluation of representative LVLMs. The results show that the state-of-the-art model attains an overall accuracy of 66.11\% on the question answering task and an average score of 77.18 on the detection, recognition, and information extraction tasks, indicating substantial room for improvement in current LVLMs. In addition, we conduct detailed analyses of error causes, cross-modal capabilities, and multi-orientation settings, yielding valuable insights for future research. We hope that CFMME will spur further progress in LVLMs, especially by improving their performance on multiple multimodal tasks in the financial domain.

2605.29461 2026-05-29 cs.CV 版本更新

FlowSeg: Dynamic Semantic Guidance for LLM-Conditioned Segmentation

FlowSeg: 面向大语言模型条件分割的动态语义引导

Zekang Zhang, Guangyu Gao, Youyun Tang, ChengJing Wu, Xiaochao Qu, Chi Harold Liu, Jianbo Jiao, Yunchao Wei, Luoqi Liu, Ting Liu

发表机构 * School of Computer Science, Beijing Institute of Technology(北京理工大学计算机科学学院) School of Computer Science, University of Birmingham(伯明翰大学计算机科学学院) WEI Lab, Institute of Information Science, Beijing Jiaotong University(北京交通大学信息科学学院WEI实验室) Beijing Key Laboratory of Advanced Information Science(北京高级信息科学重点实验室)

AI总结 针对大语言模型条件分割中语义错位问题,提出FlowSeg方法,通过双向语义流动态引导掩码生成,实现语义对齐并达到最优性能。

Comments 18 pages, accepted by ICML 2026

详情
AI中文摘要

大语言模型条件分割最近通过将大语言模型与迭代掩码生成框架相结合而迅速发展。然而,我们在当前的“提议-选择”流程中发现了一个持续的失败模式。尽管通常能生成高质量的掩码候选,但最终预测可能无法匹配给定的语言条件。这种失败源于语言语义通常被用作静态提示或事后匹配信号,而不是参与迭代掩码生成过程。通过系统分析,我们表明许多错误源于语义错位而非掩码质量差。为解决此问题,我们提出FlowSeg,它通过在整个生成过程中引入中间解码状态与大语言模型导出的条件嵌入之间的双向语义流,实现动态语义引导。语言条件在每个阶段主动引导掩码细化,而条件嵌入则通过出现的视觉证据逐步更新。这种设计产生了语义基础的掩码表示和视觉对齐的语言条件,从而实现更可靠的匹配。我们进一步引入轻量级边界感知细化,以选择性增强不确定区域而不扰动置信内部。在指代表达分割和推理分割任务上的大量实验表明,FlowSeg持续改善语言-掩码对齐,并达到最先进的性能。项目页面:https://zkzhang98.github.io/FlowSeg_page

英文摘要

LLM-conditioned segmentation has recently advanced rapidly by coupling large language models with iterative mask generation frameworks. However, we identify a persistent failure mode in current propose-then-select pipelines. Although high-quality mask candidates are often generated, the final prediction may fail to match the given linguistic condition. This failure arises because language semantics are typically used as static prompts or post-hoc matching signals, rather than participating in the iterative mask generation process. Through systematic analysis, we show that many errors stem from semantic misalignment rather than poor mask quality. To address this issue, we propose FlowSeg, which introduces dynamic semantic guidance via a bidirectional semantic flow between intermediate decoding states and LLM-derived condition embeddings throughout the generation process. Language conditions actively guide mask refinement at each stage, while condition embeddings are progressively updated by emerging visual evidence. This design yields semantically grounded mask representations and visually aligned language conditions, enabling more reliable matching. We further incorporate a lightweight boundary-aware refinement to selectively enhance uncertain regions without perturbing confident interiors. Extensive experiments on referring expression segmentation and reasoning segmentation tasks demonstrate that FlowSeg consistently improves language-mask alignment and achieves state-of-the-art performance. Project page: https://zkzhang98.github.io/FlowSeg_page

2605.29460 2026-05-29 cs.CV 版本更新

FedSmoothLoRA: Toward Smoother and Faster Convergence in Federated Low-Rank Adaptation

FedSmoothLoRA:面向更平滑和更快速收敛的联邦低秩适配

Zehao Wang, Guanglei Yang, Yihan Zeng, Hang Xu, Hongzhi Zhang, Wangmeng Zuo, Chun-Mei Feng

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Huawei Noah’s Ark Lab(华为诺亚实验室) University College Dublin(都柏林大学学院)

AI总结 针对联邦低秩适配中更新空间有限、轮间状态不匹配和客户端无关起始状态的问题,提出FedSmoothLoRA框架,通过轮匹配矩阵和梯度对齐矩阵实现更平滑和更快速的收敛。

Comments 26 pages, 4 figures

详情
AI中文摘要

使用低秩适配(LoRA)对基础模型进行联邦微调提供了一种高效的解决方案,可在降低通信和计算成本的同时保持数据本地性。然而,FedAvg与LoRA的直接组合存在三个关键问题:有限的更新空间限制了模型的有效学习能力;轮间状态不匹配破坏了跨轮局部优化的连续性;以及客户端无关的起始状态减慢了客户端上的局部收敛。尽管最近的方法通过跨通信轮将LoRA更新合并到主干中缓解了有限更新空间问题,但轮间状态不匹配和客户端无关的起始状态仍未得到充分解决。为了解决这些问题,我们提出了FedSmoothLoRA,一个联邦LoRA微调框架,它保留了扩大的更新空间,改善了跨轮局部优化的连续性,并为局部训练提供了客户端感知的起始状态。在每个通信轮,FedSmoothLoRA使用两个矩阵构建局部LoRA初始化:一个轮匹配矩阵,用于保持跨轮局部状态连续性;以及一个梯度对齐矩阵,用于从局部数据估计的梯度信号提供客户端特定的优化指导。这些设计共同实现了更平滑和更快速的收敛。在图像分类和自然语言生成任务上的大量实验表明,FedSmoothLoRA始终优于现有的联邦LoRA微调方法。代码:https://github.com/wangzehao0704/FedSmoothLoRA

英文摘要

Federated fine-tuning of foundation models with Low-Rank Adaptation (LoRA) provides an efficient solution for reducing communication and computation costs while preserving data locality. However, the direct combination of FedAvg and LoRA suffers from three key issues: limited update space, which restricts the model's effective learning capacity; inter-round state mismatch, which disrupts cross-round local optimization continuity; and a client-agnostic starting state, which slows local convergence on clients. Although recent methods mitigate the limited update space issue by merging LoRA updates into the backbone across communication rounds, inter-round state mismatch and the client-agnostic starting state remain insufficiently addressed. To address these issues, we propose FedSmoothLoRA, a federated LoRA tuning framework that preserves the enlarged update space, improves cross-round local optimization continuity, and provides a client-aware starting state for local training. At each communication round, FedSmoothLoRA constructs the local LoRA initialization using two matrices: a Round-Matching matrix that preserves cross-round local state continuity, and a Gradient-Aligned matrix that provides client-specific optimization guidance from gradient signals estimated on local data. Together, these designs enable smoother and faster convergence. Extensive experiments on image classification and natural language generation tasks demonstrate that FedSmoothLoRA consistently outperforms existing federated LoRA tuning methods. Code: https://github.com/wangzehao0704/FedSmoothLoRA

2605.29455 2026-05-29 cs.CV eess.SP 版本更新

Uni-RCM: Unified Reference-guided Cross-modal Mapping for Multi-Class Anomaly Detection

Uni-RCM:面向多类异常检测的统一参考引导跨模态映射

Yangchen Wu, Huiqiang Xie

发表机构 * School of Information Science and Technology, Jinan University(信息科学与技术学院,暨南大学)

AI总结 提出Uni-RCM框架,通过参考引导块和离线残差量化器,实现多类工业异常检测的统一建模,在MVTec-3D AD数据集上达到最优性能。

Comments This work has been submitted IEEE for potential publication

详情
AI中文摘要

多模态工业异常检测通常依赖于每个产品类别的单独模型,从根本上限制了实际可扩展性。当转向同时处理多种类别的统一范式时,由于类间干扰和特征流形混淆,检测精度往往会下降。为了克服这些挑战,我们提出了一个统一的参考引导跨模态映射框架,命名为Uni-RCM。其核心是,我们提出了一个参考引导块,通过引入可学习的参考特征来动态过滤特定类别的噪声,该参考特征捕捉了不同模态之间的共性。此外,我们提出了一个离线残差量化器,通过多个级联码本来表征正态分布。在MVTec-3D AD数据集上的大量评估表明,在具有挑战性的多类设置以及图像级检测和像素级定位方面,该方法达到了最先进的性能。

英文摘要

Multi-modal industrial anomaly detection typically relies on separate models for each product category, fundamentally limiting practical scalability. When shifting to a unified paradigm that handles diverse classes simultaneously, detection accuracy often degrades due to inter-class interference and feature manifold confusion. To overcome these challenges, we propose a Unified Reference guided Cross-modal Mapping framework, named Uni-RCM. At its core, we propose a reference guide block to dynamically filter out category-specific noise by introducing a learnable reference feature, which captures the commonalities across different modalities. Besides, an offline residual quantizer is proposed to characterize the normal distribution by multiple cascaded codebooks. Extensive evaluations on the MVTec-3D AD dataset demonstrate the state-of-the-art performance in the challenging multi-class setting and in terms of image-level detection and pixel-level localization.

2605.29452 2026-05-29 cs.CV 版本更新

Comparative evaluation of photogrammetric reconstruction methods and 3D Gaussian Splatting for road surface roughness analysis

摄影测量重建方法与3D高斯泼溅用于路面粗糙度分析的比较评估

Marouane Elmegdar, Teng Xiao

发表机构 * School of International Education, Hubei University of Technology(湖北工业大学国际教育学院) School of Computer Science, Hubei University of Technology(湖北工业大学计算机科学学院)

AI总结 本研究比较了COLMAP、Meshroom、Metashape和3D高斯泼溅四种重建方法,评估它们从智能手机图像估计路面粗糙度的能力,结果表明COLMAP对微纹理最敏感,而开源方法适用于低成本路面监测。

Comments accepted by RSMIP 2026

详情
AI中文摘要

基于图像的三维重建为传统的基于传感器的路面评估技术提供了一种低成本替代方案。本研究比较了四种重建流程——COLMAP、Meshroom、Metashape和3D高斯泼溅(3DGS),以评估它们从智能手机图像估计路面粗糙度的能力。所有点云均在CloudCompare中使用一致的工作流程进行处理,包括方向对齐、分割、法线估计以及在0.2、0.4和0.6模型单位的邻域半径下进行粗糙度计算。结果表明,COLMAP对微纹理的灵敏度最高,而Meshroom产生具有中等粗糙度变化的平衡重建。Metashape由于其内部滤波而生成最平滑的几何形状,3DGS捕捉到可见的不规则性但表现出更高的噪声和较低的密度。比较表明,开源管道可用于相对粗糙度评估,为低成本路面监测提供了一种实用方法。

英文摘要

Image-based 3D reconstruction offers a low-cost alternative to traditional sensor-based techniques for road surface assessment. This study compares four reconstruction pipelines--COLMAP, Meshroom, Metashape, and 3D Gaussian Splatting (3DGS)--to evaluate their ability to estimate road surface roughness from smartphone imagery. All point clouds were processed in CloudCompare using a consistent workflow involving orientation alignment, segmentation, normal estimation, and roughness computation at neighborhood radiuses of 0.2, 0.4, and 0.6 model units. The results show that COLMAP provides the highest sensitivity to micro-texture, while Meshroom yields balanced reconstructions with moderate roughness variation. Metashape produces the smoothest geometry due to its internal filtering, and 3DGS captures visible irregularities but exhibits higher noise and lower density. The comparison demonstrates that open-source pipelines are viable for relative roughness evaluation, offering a practical approach for low-cost pavement monitoring.

2605.29448 2026-05-29 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

数据集值多少钱?缩放定律、Vendi分数与矩阵谱函数

Jeff A. Bilmes, Gantavya Bhatt, Arnav M. Das

发表机构 * Department of Electrical & Computer Engineering(电气与计算机工程系) Paul G. Allen School of Computer Science & Engineering(保罗·G·艾伦计算机科学与工程学院) University of Washington(华盛顿大学)

AI总结 本文通过子模性理论统一了神经缩放定律与Vendi分数,提出矩阵谱函数作为广义数据评估框架,并开发了基于割线方程的快速优化算法,在ImageNet-1K规模上实现了约35,000倍加速,实验表明设施选址函数在预测子集价值方面表现最佳。

Comments 75 pages

详情
AI中文摘要

神经缩放定律通过数据集大小评估数据,而Vendi分数使用量子熵衡量数据集价值。我们证明常见的神经缩放定律目标和Vendi分数都是子模的。进一步,我们表明Vendi分数是一类更广泛的子模目标(称为矩阵谱函数)的特例,这还包括行列式点过程(DPP)目标以及许多其他目标。我们还引入了弱矩阵单调函数,并展示了它们如何导致弱子模矩阵谱函数,从而产生一系列实用的数据评估目标。我们开发了基于割线方程的更新方法,避免了贪心优化过程中的重复特征分解,将$m$维嵌入的边际增益评估相对于预言机查询减少了$O(m)$因子。这实现了平均约35,000倍的实证加速,使得在ImageNet-1K规模的数据集上直接优化Vendi分数成为可能。由此,我们比较了多个目标在固定大小、类别平衡和固定训练预算条件下预测训练子集对保留测试性能价值的能力,包括Vendi分数、DPP、设施选址以及三种新的矩阵谱变体。在多个数据集上,设施选址表现最佳。直接优化还揭示,虽然Vendi分数在中等分数范围内具有预测性,但将目标推向更高值可能使其成为下游性能的糟糕代理。我们还发现,均匀随机选择的固定大小子集(无论是否类别平衡)在评估分数和保留性能上都表现出显著的集中性。最后,我们表明大小、类别平衡和训练预算单独并不决定数据价值:即使控制这些因素,性能范围也从好到差平滑变化。

英文摘要

Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.

2605.29447 2026-05-29 cs.CV cs.CL 版本更新

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

恢复策略诱导错误:鲁棒GUI智能体的基准测试与轨迹合成

Tianpeng Bu, Xin Liu, Qihua Chen, Hao Jiang, Shurui Li, Hongtao Duan, Lu Jiang, Lulu Hu, Bin Yang, Minying Zhang

发表机构 * alibaba(阿里巴巴)

AI总结 提出GUI-RobustEval基准和鲁棒驱动轨迹合成框架RoTS,通过树状管道主动发现错误模式并合成恢复步骤,训练模型在GUI任务上取得最先进性能。

Comments ICML 2026 Spotlight. 36 pages, 19 figures, includes appendix

详情
AI中文摘要

尽管GUI智能体发展迅速,但它们通常缺乏从自身错误中恢复的鲁棒性,阻碍了实际部署。为了在评估和数据层面弥补这一差距,我们引入了GUI-RobustEval并提出了鲁棒驱动轨迹合成。GUI-RobustEval包含1,216个可执行测试用例,系统性地衡量在广泛且真实的错误模式下的错误恢复能力。在数据层面,RoTS是一个可扩展的合成框架,通过树状管道主动发现多样化的错误模式并合成相应的恢复步骤,创建了80万高质量数据。我们的两个模型RoTS-7B和RoTS-32B,在数据集上微调后,在GUI-RobustEval和传统GUI基准测试上均表现出显著提升。值得注意的是,RoTS-32B在OSWorld上达到了最先进性能,成功率为47.4%,All-Pass@4得分为33.8%,表明改进的长时域错误恢复能力有助于鲁棒性和整体性能。我们的代码可在https://github.com/AlibabaResearch/RoTS获取。

英文摘要

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.

2605.29429 2026-05-29 cs.CV 版本更新

One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation

每细胞类型一次点击足矣:无需训练的组交互用于细胞实例分割

Sanghyun Jo, Seo Jin Lee, Seohyung Hong, Yoorim Gang, Hyeongsub Kim, Hyungseok Seo, Kyungsu Kim

发表机构 * OGQ, Korea(韩国OGQ) Seoul National University, Korea(韩国首尔国立大学) LG CNS, Korea(韩国LG CNS)

AI总结 提出组提示范式,通过每细胞类型一次点击即可分割所有该类型实例,基于SAM冻结编码器的特征聚类性质,设计无需训练的Chain-of-Prompts框架递归扩展点击,在多个基准上保持高性能。

Comments Accepted to MICCAI 2026 (Early Accept)

详情
AI中文摘要

在特定细胞数据集上训练的细胞实例分割模型在分布外的细胞类型上性能严重下降,而交互式基础模型通过每个实例提示克服了这一点,但对于包含数百到数千个密集实例的组织病理学图像,其成本过高。我们引入了组提示,这是一种新范式,将交互式分割从每个实例 $O(N)$ 转变为每个类型 $O(T)$,其中每细胞类型一次点击即可分割该类型的所有实例。我们的关键观察是,Segment Anything Model (SAM) 的冻结图像编码器在给出任何提示之前,已经在其特征空间中对相同类型的细胞进行了聚类。利用这一特性,我们提出了Chain-of-Prompts (CoP),这是一个无需训练的框架,通过以下方式递归扩展单个用户点击:(1) 通过非参数门控多尺度编码器特征识别可靠的相同类型位置,以及 (2) 选择空间上最远的可靠点作为下一个提示以最大化覆盖范围。在三个细胞类型标注的基准上,每类型一次点击的CoP保留了超过90%的每个实例性能,并且无需任何额外训练就超越了全监督方法。在四个形态均匀的基准上,一次点击保留了超过99%。项目页面:https://shjo-april.github.io/Chain-of-Prompts/

英文摘要

Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive foundation models overcome this through per-instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances. We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per-instance $O(N)$ to per-type $O(T)$, where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same-type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain-of-Prompts (CoP), a training-free framework that recursively expands a single user click by (1) identifying reliable same-type locations through non-parametric gating of multi-scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage. On three cell-type-annotated benchmarks, CoP with one click per type retains over 90% of per-instance performance and surpasses fully-supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%. Project Page: https://shjo-april.github.io/Chain-of-Prompts/

2605.29416 2026-05-29 cs.RO cs.CV 版本更新

3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding

3DVLA:通过3D空间和实例理解增强视觉-语言-动作模型

Zhongyu Xia, Yousen Tang, Bingqing Wei, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所)

AI总结 提出3DVLA框架,通过多视角一致性3D特征编码、实例估计模块和掩码自监督3D编码,解决VLA模型缺乏3D场景理解的问题,在LIBERO-Plus和RoboTwin 2.0上显著提升操作性能。

详情
AI中文摘要

视觉-语言-动作模型在机器人操作中取得了显著进展,但存在一个关键限制:缺乏3D场景理解。这一缺陷表现为三个相互交织的挑战:在不强制执行多视角一致性的情况下弱提取3D空间位置、不足的3D实例理解以及遮挡下的脆弱推理。尽管存在成熟的3D感知方法,但由于架构不兼容以及对昂贵实例级标注的严重依赖,它们难以直接集成到VLA流程中。为解决上述挑战,我们提出3DVLA,一个即插即用框架,将稳健的3D推理注入预训练的VLA,无需额外人工标注或丢弃VLM先验。具体来说,3DVLA通过以下方式应对三个挑战:(1)在所有模态上具有显式多视角一致性约束的普遍3D特征编码和空间条件几何聚合方法,(2)具有高级实例令牌的实例估计模块以实现3D实例感知,以及(3)保留预测器用于视觉令牌完成的掩码自监督3D编码分支以处理遮挡。我们将3DVLA与多个VLA基线集成,并在LIBERO-Plus和RoboTwin 2.0上进行评估。结果显示操作性能持续且显著提升,验证了我们方法的有效性和即插即用兼容性。

英文摘要

Vision-Language-Action models have achieved remarkable progress in robotic manipulation, yet they suffer from a critical limitation: a lack of 3D scene understanding. This deficiency manifests as three intertwined challenges: weak extraction of 3D spatial positions without enforcing multi-view consistency, inadequate 3D instance understanding, and fragile reasoning under occlusion. Although mature 3D perception methods exist, their direct integration into VLA pipelines is hindered by architectural incompatibility and by heavy reliance on costly instance-level annotations. To address the above challenges, we propose 3DVLA, a plug-and-play framework that injects robust 3D reasoning into pretrained VLAs without requiring extra manual labels or discarding VLM priors. Specifically, 3DVLA tackles the three challenges through: (1) pervasive 3D feature encoding with explicit multi-view consistency constraints across all modalities and a Spatially-Conditioned Geometry Aggregation method, (2) an instance estimation module with high-level instance tokens for 3D instance awareness, and (3) a masked self-supervised 3D encoding branch that retains its predictor for visual token completion to handle occlusions. We integrate 3DVLA with multiple VLA baselines and evaluate on LIBERO-Plus and RoboTwin 2.0. Results show consistent and significant gains in manipulation performance, validating both the effectiveness and plug-and-play compatibility of our approach.

2605.29415 2026-05-29 eess.IV cs.CV cs.LG eess.SP stat.ML 版本更新

Constructing efficient channels for ideal observers using the conjugate gradient method

使用共轭梯度法构建理想观察者的高效通道

Weimin Zhou

发表机构 * University of Arizona, Wyant College of Optical Sciences(亚利桑那大学光学科学学院) University of Arizona, Department of Radiology & Imaging Sciences(亚利桑那大学放射科与成像科学系)

AI总结 针对医学成像系统图像质量的任务评估,提出基于共轭梯度(CG)的方法构建高效通道,以近似贝叶斯理想观察者(IO)和霍特林观察者(HO)的性能。

Comments Submitted to the Journal of Medical Imaging (JMI) Special Issue Honoring Dr. Harrison H. Barrett

详情
AI中文摘要

基于任务的图像质量(IQ)评估对于医学成像系统的设计和优化至关重要。理想观察者,包括贝叶斯理想观察者(IO)和理想线性观察者(即霍特林观察者(HO)),提供了客观的品质因数(FOM),用于量化系统在信号检测任务上的性能。然而,将理想观察者应用于高维图像数据通常在计算上难以处理。通道机制提供了一种有效的降维框架,可以促进理想观察者的计算。本文提出了一种基于共轭梯度(CG)的方法,用于构建近似IO和HO性能的高效通道。

英文摘要

Task-based assessment of image quality (IQ) is critically important for the design and optimization of medical imaging systems. Ideal observers, including the Bayesian Ideal Observer (IO) and the ideal linear observer, i.e., the Hotelling observer (HO), provide objective figures of merit (FOMs) that quantify system performance on signal detection tasks. However, the application of ideal observers to high-dimensional image data is often computationally intractable. Channel mechanisms provide an effective framework for dimensionality reduction that can facilitate the computation of ideal observers. This work presents a conjugate gradient (CG)-based method to construct efficient channels for approximating the IO and HO performance.

2605.29402 2026-05-29 cs.CV cs.AI 版本更新

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

面向高效长视频推理的语义与视觉证据:HD-EPIC VQA挑战赛的解决方案

Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv, Hui Li

发表机构 * Lenovo, China(联想(中国))

AI总结 提出一种统一框架,通过解耦长视频推理为语义证据(粗到细提取全局过程结构)和视觉证据(基于目标的细粒度定位),并采用查询条件证据检索与整合,在HD-EPIC VQA挑战赛中取得竞争性能。

详情
AI中文摘要

理解长格式自我中心视频对于多模态大语言模型(MLLMs)仍然具有挑战性,原因在于有限的上下文长度和对细粒度视觉细节的定位不足。最近提出的HD-EPIC基准突出了这些局限性:即使是强大的长上下文模型,在多样化的视频问答任务中也表现较低。在本文中,我们提出了一个统一框架,将长视频推理解耦为两种互补的证据形式:语义证据和视觉证据。语义证据通过粗到细的提取流程捕获全局过程结构,而基于目标的视觉证据通过边界框和视觉嵌入保留细粒度的定位。在推理过程中,我们将推理形式化为查询条件的证据检索和整合过程,动态地从两个来源选择相关信息。我们的方法在HD-EPIC-VQA挑战赛的多个任务类别中取得了竞争性能。更广泛地说,我们的结果表明,显式地结构化、检索和整合语义与视觉证据对于使用MLLMs进行有效的长视频理解至关重要。

英文摘要

Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benchmark highlights these limitations: even strong long-context models achieve relatively low performance across diverse video question answering tasks. In this paper, we propose a unified framework that decouples long-video reasoning into two complementary forms of evidence: semantic evidence and visual evidence. Semantic evidence captures global procedural structure through a coarse-to-fine extraction pipeline, while object-centric visual evidence preserves fine-grained grounding through bounding boxes and visual embeddings. During inference, we formulate reasoning as a query-conditioned evidence retrieval and integration process, dynamically selecting relevant information from both sources. Our approach achieves competitive performance in the HD-EPIC-VQA Challenge across multiple task categories. More broadly, our results demonstrate that explicitly structuring, retrieving, and integrating semantic and visual evidence is critical for effective long-video understanding with MLLMs.

2605.29390 2026-05-29 cs.CV 版本更新

Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation

注意力特征空间中的正交负引导用于文本到图像生成

Jungmin Ko, Jungwon Park, Jimyeong Kim, Changin Choi, Wonseok Lee, Wonjong Rhee

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University(人工智能交叉学科项目,首尔国立大学) Research Institute for Convergence Science, Seoul National University(融合科学研究所,首尔国立大学) Artificial Intelligence Institute, Seoul National University(人工智能研究所,首尔国立大学) Department of Intelligence and Information, Seoul National University(智能与信息系,首尔国立大学) Daegu Gyeongbuk Institute of Science and Technology(大邱庆北科学技术院) Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd(三星先进技术研究所,三星电子公司)

AI总结 提出一种基于注意力特征空间的正交负引导方法,通过正交化负提示注意力特征与正提示特征并仅减去正交分量,在无需训练的情况下有效抑制不需要的概念,同时保持图像质量和提示对齐。

Comments Preprint

详情
AI中文摘要

文本到图像(T2I)模型生成高质量图像的能力日益增强。然而,强制显式地避免指定对象或属性仍然是一个根本性的难题。现有方法,包括提示否定、事后编辑和负引导,对于显式概念抑制仍显不足,常常无法移除目标概念或降低整体图像质量。为此,我们提出了注意力特征空间中的正交负引导方法,这是一种无需训练的方法,在基于MM-DiT的T2I变换器的注意力输出空间中操作。我们的方法将负提示注意力特征相对于正提示特征进行正交化,并仅减去正交分量,从而在保留期望语义的同时抑制不需要的概念。在FLUX-dev和FLUX-schnell上的实验表明,我们的方法在概念抑制、提示对齐和图像质量之间取得了有利的权衡。在人工评估中,我们的方法比第二好的基线高出18.78%。我们进一步展示了该方法支持多概念抑制和可调概念抑制。

英文摘要

Text-to-image (T2I) models have become increasingly capable of generating high-quality images. Yet, enforcing the explicit absence of a specified object or attribute remains a fundamentally challenging problem. Existing approaches, including prompt negation, post-hoc editing, and negative guidance, remain insufficient for explicit concept suppression, often failing to remove the target concept or degrading overall image quality. To this end, we propose Orthogonal Negative Guidance in attention feature space, a training-free method that operates in the attention output space of MM-DiT-based T2I transformers. Our method orthogonalizes negative-prompt attention features with respect to positive-prompt features and subtracts only the orthogonal component, suppressing unwanted concepts while preserving desired semantics. Experiments on FLUX-dev and FLUX-schnell show that our method achieves favorable trade-offs between concept suppression, prompt alignment, and image quality. In human evaluation, our method outperforms the second-best baseline by 18.78%. We further show that our method supports multi-concept suppression and adjustable concept suppression.

2605.29380 2026-05-29 cs.LG cs.AI cs.CV 版本更新

TRACER: Persistent Regularization for Robust Multimodal Finetuning

TRACER: 用于鲁棒多模态微调的持久正则化

Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing and Information Systems (CIS), Faculty of Engineering and IT (FEIT), University of Melbourne, Australia(墨尔本大学计算机科学与信息系统学院(CIS)、工程与信息技术学院(FEIT))

AI总结 提出TRACER方法,通过加权移动平均教师实现持久正则化,解决多模态对比微调中的灾难性遗忘和EMA坍缩问题,提升分布外鲁棒性。

Comments ICML 2026

详情
AI中文摘要

微调预训练多模态模型的主流策略通常会降低分布外(OOD)鲁棒性,这种现象被称为灾难性遗忘。在本文中,我们为多模态对比微调开发了一个理论框架,为每种策略提供了闭式解和几何分解。该框架表明,自蒸馏在保留预训练模型知识方面比其他正则化方法更有效。我们的分析揭示了一个被广泛忽视的局限性:在鲁棒微调中广泛使用的标准指数移动平均(EMA)教师存在坍缩问题。为了解决这个问题,我们证明加权移动平均(WMA)教师在有限时间范围内保持持久的正则化力,并在任务子空间中实现无偏收敛,同时保留正交知识。这些见解促使了**TRACER**(**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization)的提出,它将对比学习与WMA引导的多视角蒸馏相结合。在CLIP微调上的大量实验表明,在三种骨干架构上,OOD准确率和校准性能持续提升,全面的消融实验证实TRACER既有理论依据,又对超参数选择具有鲁棒性。代码可在[https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER)获取。

英文摘要

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).

2605.29353 2026-05-29 cs.CR cs.CV 版本更新

DeepFake Forensics AI: A Multi-Modal Detection and Blockchain-Anchored Evidence Management Platform

DeepFake Forensics AI:多模态检测与区块链锚定证据管理平台

Naisha Minnah

发表机构 * Department of Computer Science(计算机科学系) Providence Women's College(普罗维登斯女子学院) University of Calicut(卡利卡特大学)

AI总结 提出一个统一平台,通过训练四种神经网络检测图像、视频和音频中的合成媒体,并利用以太坊区块链实现证据的不可篡改存储与管理。

Comments 5 pages, 5 figures, 3 tables

详情
AI中文摘要

AI生成的合成媒体的激增对法律和取证背景下数字证据的完整性构成了严重威胁。现有的深度伪造检测系统通常处理单一模态,并且没有提供防篡改证据保存的机制。我们提出了DeepFake Forensics AI,这是一个统一平台,能够检测图像、视频和音频模态中的合成媒体,识别生成架构指纹,并将取证证据不可变地锚定在以太坊区块链上。我们的系统从头训练了四个独立的神经网络:一个EfficientNet-B4图像检测器(AUC = 0.9868)、一个双向LSTM视频检测器(AUC = 0.9628)、一个ECAPA-TDNN音频检测器(EER = 18.63%),以及一个新颖的GAN指纹模块(准确率 = 99.88%),用于识别伪造图像背后的生成架构。证据文件通过SHA-256哈希,通过Pinata存储在IPFS上,并通过基于角色的访问控制的Solidity智能合约在链上注册。该平台提供了React前端和FastAPI后端,适用于取证和法律工作流程的部署。据我们所知,这是第一个将多模态深度伪造检测与基于区块链的链上保管管理相统一的系统。

英文摘要

The proliferation of AI-generated synthetic media poses a critical threat to the integrity of digital evidence in legal and forensic contexts. Existing deepfake detection systems typically address a single modality and provide no mechanism for tamper-proof evidence preservation. We present DeepFake Forensics AI, a unified platform that detects synthetic media across image, video, and audio modalities, identifies generative architecture fingerprints, and anchors forensic evidence immutably on the Ethereum blockchain. Our system trains four independent neural networks from scratch: an EfficientNet-B4 image detector (AUC = 0.9868), a Bidirectional LSTM video detector (AUC= 0.9628), an ECAPA-TDNN audio detector (EER = 18.63%), and a novel GAN fingerprinting module (accuracy = 99.88%) that identifies the generative architecture behind a fake image. Evidence files are hashed with SHA-256, stored on IPFS via Pinata, and registered on-chain via a Solidity smart contract with role-based access control. The platform provides a React frontend and FastAPI backend suitable for deployment in forensic and legal workflows. To our knowledge, this is the first system to unify multi-modal deepfake detection with blockchain-based chain-of custody management.

2605.29339 2026-05-29 cs.CV 版本更新

DMC-CF: Dynamic Multimodal CounterFactual QA benchmark for Causal Reasoning

DMC-CF: 用于因果推理的动态多模态反事实QA基准

Junzhe Zhang, Huixuan Zhang, Guirong Wang, Xingyao Zhang, Pei Liu, Lin Qu, Hu Wei, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所) Alibaba Group(阿里巴巴集团)

AI总结 针对现有因果推理数据集规模有限或基于非真实数据的问题,提出基于真实视频的大规模多模态因果反事实推理基准DMC-CF-Static,并利用动态图干预框架构建动态评估基准DMC-CF-Dynamic,实验表明当前多模态大模型在真实场景下的因果推理能力仍需大幅提升。

详情
AI中文摘要

随着多模态大语言模型(MLLMs)的快速发展,模型已展现出日益强大的多模态能力。然而,通过统计学习训练的MLLMs能否真正理解现实世界背后的因果关系仍是一个关键研究问题。近年来,众多多模态因果推理数据集被提出,但这些数据集要么规模有限,要么基于合成图像和视频、卡通内容或其他非真实多模态来源构建。为解决这些局限性,我们收集真实世界视频并构建了DMC-CF-Static,一个大规模多模态因果反事实推理基准。此外,为缓解传统静态评估中的数据污染等问题,我们使用因果图表示因果事件,并提出动态图干预(DGI)框架,从DMC-CF-Static构建动态评估基准DMC-CF-Dynamic。在包含静态和动态评估基准的整体DMC-CF上的实验结果表明,当前多模态大语言模型在真实场景下的多模态因果推理能力仍需大幅提升。

英文摘要

With the rapid advancement of multimodal large language models (MLLMs), models have demonstrated increasingly powerful multimodal capabilities. However, whether MLLMs trained through statistical learning can truly understand the causal relationships underlying the real world remains a key research question. In recent years, numerous multimodal causal reasoning datasets have been proposed. Nevertheless, these datasets are either limited in scale or constructed from synthetic images and videos, cartoon-based content, or other non-realistic multimodal sources. To address these limitations, we collect real-world videos and construct DMC-CF-Static, a large-scale benchmark for multimodal causal counterfactual reasoning. Furthermore, to mitigate issues such as data contamination in traditional static evaluation, we represent causal events using causal graphs and propose the Dynamic Graph Intervention (DGI) framework to build the dynamic evaluation benchmark DMC-CF-Dynamic from DMC-CF-Static. Experimental results on the overall DMC-CF, which includes both static and dynamic evaluation benchmarks, demonstrate that the multimodal causal reasoning capabilities of current multimodal large language models in real-world scenarios still require substantial improvement.

2605.29335 2026-05-29 cs.CV cs.AI 版本更新

Rethinking FID Through the Geometry of the Reference Dataset

通过参考数据集的几何结构重新思考FID

Yunghee Lee, Byeonghyun Pak

AI总结 本文通过分析参考数据集的几何特性(密度和有效秩)来解释Fréchet Inception Distance (FID) 与样本质量之间的不一致性,并提出应结合参考数据集几何结构来更可靠地评估生成模型。

Comments 9 pages, 2 figures. Accepted to ICML 2026 Workshop: Combining Theory and Benchmarks

详情
AI中文摘要

Fréchet Inception Distance (FID) 被广泛用于评估图像生成器,但较低的FID并不总是对应更好的样本质量。我们表明,这种不匹配部分取决于参考数据集的几何结构。在六个数据集的受控研究中,分布密度和有效秩显著解释了随着样本质量提高FID如何变化。集中数据集往往产生更有利的FID趋势,而更分散的数据集可能导致尽管样本更好但FID恶化。对精确率和召回率的归因以及使用替代特征空间和距离的消融实验支持了相同的结论。这些结果表明,分布度量应与参考数据集的几何结构一起解释,以实现更可靠的基准测试。

英文摘要

Fréchet Inception Distance (FID) is widely used to evaluate image generators, yet lower FID does not always correspond to better sample quality. We show that this mismatch depends in part on the geometry of the reference dataset. In a controlled study across six datasets, distributional density and effective rank significantly explain how FID changes as sample quality improves. Concentrated datasets tend to yield more favorable FID trends, whereas more dispersed datasets can make FID worsen despite better samples. Attribution to precision and recall and ablations with alternative feature spaces and distances support the same conclusion. These results suggest that distributional metrics should be interpreted together with the geometry of the reference dataset for more reliable benchmarking.

2605.29330 2026-05-29 cs.CV 版本更新

EarthShift: a benchmark for measuring robustness to real-world distribution shifts in Earth observation

EarthShift: 衡量地球观测中真实分布偏移鲁棒性的基准

Kelsey Doerksen, Hannah Kerner

发表机构 * School of Computing and Augmented Intelligence(计算与增强智能学院) Arizona State University(亚利桑那州立大学)

AI总结 提出EarthShift基准,通过多源配对数据集评估地理空间基础模型在时间、地理、尺度、传感器等真实分布偏移下的鲁棒性,发现模型性能平均下降15-20%。

详情
AI中文摘要

当前地球观测基准侧重于衡量多样任务和应用上的性能,通常衡量分布内泛化能力。但当模型部署时,它们必须泛化到无数分布外场景,例如新的时间段、地理区域、尺度和传感器。我们提出EarthShift:首个用于衡量遥感中多种真实分布偏移鲁棒性的公开测试平台。EarthShift通过使用来自不同来源、时间窗口、地理位置和传感器的配对数据集,比较分布内和分布外的性能,使用户能够衡量分布鲁棒性。我们在8个地理空间基础模型(GFMs)和覆盖5种偏移类型的11个任务上的实验表明,无论模型架构、大小、预训练或微调策略如何,GFMs在分布外的平均性能始终低15-20%。我们表明GFM的鲁棒性与通用视觉基础模型甚至全监督模型相似。这凸显了未来研究需要致力于提升分布鲁棒性,而不仅仅是性能,这可以通过EarthShift进行基准测试。我们发布代码和数据集,提供一个测试平台,以指导未来工作创建在真实应用中鲁棒且可靠的基础模型。EarthShift的代码和数据可在https://earthshift.github.io获取。

英文摘要

Current Earth observation benchmarks focus on measuring performance on diverse tasks and applications, typically measuring generalization in-distribution. But when models are deployed, they must generalize to myriad out-of-distribution scenarios, such as new time periods, geographies, scales, and sensors. We introduce EarthShift: the first public testbed for benchmarking robustness across multiple realistic distribution shifts encountered in remote sensing. EarthShift enables users to measure distributional robustness by comparing performance in- and out-of-distribution using datasets from paired datasets from different sources, temporal windows, geographic locations, and sensors. Our experiments on 8 geospatial foundation models (GFMs) and 11 tasks covering 5 shift types show that GFMs consistently perform 15-20% worse out-of-distribution on average regardless of model architecture, size, pre-training or fine-tuning strategy. We show that GFM robustness is similar to that of generic vision foundation models, and even fully-supervised models. This highlights a need for future research to strive for improvements in distributional robustness, not just performance, which can be benchmarked using EarthShift. We release our code and datasets to provide a testbed to guide future work to create foundation models that are robust and reliable in real-world applications. Code and data for EarthShift are available at: https://earthshift.github.io

2605.29325 2026-05-29 cs.CV 版本更新

Multi-Stage VLM Pipeline for Zero-Shot Traffic Accident Understanding

用于零样本交通事故理解的多阶段VLM流水线

Fumiya Tatematsu, Fumihiko Takahashi

发表机构 * GO Drive Inc(GO Drive公司)

AI总结 提出一个三阶段VLM流水线,在冻结的Qwen3-VL-32B-Instruct和235B MoE模型上实现零样本交通事故预测,通过9:1融合和车辆检测对齐赢得CVPR 2026 ACCIDENT挑战赛。

Comments Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 13. Code: https://github.com/fuumin621/cvpr2026-accident-1st-place-solution

详情
AI中文摘要

我们提出了CVPR 2026 AUTOPILOT Workshop中ACCIDENT挑战赛的第一名解决方案,该挑战要求从CCTV视频中零样本预测事故时间、撞击中心点和碰撞类型。在冻结的Qwen3-VL-32B-Instruct检查点上,我们构建了一个三阶段流水线(全视频联合预测、时间细化、单帧撞击中心点定位),在235B混合专家模型上再次运行相同的流水线,以9:1的比例融合两个输出,最后将每个预测点对齐到最近的车辆检测框。最终系统在Public LB上达到0.55469,在Private LB上达到0.57080,比最强的主办方基线(Molmo-7B,0.358)高出约0.21,并赢得了挑战赛。我们对每个组件进行了消融实验,报告了影响最终设计的负面结果,并在https://github.com/fuumin621/cvpr2026-accident-1st-place-solution 上发布了代码。

英文摘要

We present the 1st-place solution to the ACCIDENT challenge at the CVPR 2026 AUTOPILOT Workshop, which asks for zero-shot prediction of accident timing, impact centroid, and collision type from CCTV footage. On a frozen Qwen3-VL-32B-Instruct checkpoint we build a three-stage pipeline (full-video joint prediction, time refinement, and single-frame grounding of the impact centroid), run the same pipeline a second time on a 235B Mixture-of-Experts sibling, blend the two outputs 9:1, and finally snap each predicted point onto the nearest vehicle detection. The final system reaches Public LB 0.55469 / Private LB 0.57080, roughly +0.21 over the strongest host baseline (Molmo-7B, 0.358) and wins the challenge. We ablate each component, report the negative results that shaped the final design, and release the code at https://github.com/fuumin621/cvpr2026-accident-1st-place-solution.

2605.29324 2026-05-29 cs.CL cs.CV 版本更新

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

STAMP:在可控且可扩展的虚拟环境中训练移动GUI代理的显式记忆

Junyang Wang, Haiyang Xu, Xi Zhang, Zhaoqing Zhu, Ming Yan, Jieping Ye, Jitao Sang

发表机构 * Tongyi AI Lab, Alibaba Group(通义实验室,阿里巴巴集团) Beijing Jiaotong University(北京交通大学)

AI总结 提出STAMP框架,通过可控虚拟环境注入确定性记忆变量,生成可验证监督数据并支持在线强化学习,解决移动GUI代理在长时任务中因上下文窗口限制和缺乏显式记忆导致的失败问题。

Comments 24 pages, 4figures, 21 tables

详情
AI中文摘要

移动GUI代理在即时反应控制方面表现出色,但在需要记忆的现实长时任务中经常失败。这种失败源于有限的上下文窗口与令牌密集的屏幕截图之间的根本冲突。为了节省有限的上下文,代理必须逐步丢弃较旧的视觉历史,永久丢失关键的瞬时信息。此外,现有的以行动为中心的数据集无法教会代理记忆什么或何时显式记忆,并且增强静态真实世界数据成本高昂且缺乏交互验证。为了解决这个问题,我们提出了STAMP,一个通过可控虚拟环境训练移动代理显式记忆的框架,其中确定性记忆变量被程序化地注入到合成任务中,以控制必须记忆的内容、何时编码以及何时检索,从而大规模生成可验证的监督数据,并通过环境驱动的奖励反馈实现在线强化学习。在我们新引入的Memory-World基准测试上评估,得到的Stamp-GUI代理在GUI专用模型中达到了最先进的性能,并在我们的Memory-World基准测试上树立了新的高水位线,展示了卓越的记忆准确性和任务韧性,同时保持了强大的通用移动导航能力。

英文摘要

Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.

2605.29318 2026-05-29 cs.GR cs.CV 版本更新

FreeForm: Reduced-Order Deformable Simulation from Particle-Based Skinning Eigenmodes

FreeForm: 基于粒子蒙皮特征模态的降阶可变形仿真

Donglai Xiang, Vismay Modi, Rishit Dagli, Ty Trusty, Gilles Daviet, Anka He Chen, Nicholas Sharp, David I. W. Levin

发表机构 * NVIDIA University of Toronto(多伦多大学)

AI总结 提出一种基于再生核粒子法的无网格降阶超弹性物体仿真方法,通过求解弹性能量Hessian矩阵的广义特征系统构建降阶蒙皮权重,实现40倍训练加速并降低仿真误差。

Comments CVPR 2026, project website: https://research.nvidia.com/labs/sil/projects/freeform/

详情
AI中文摘要

我们提出了一种新的无网格、降阶可变形超弹性物体仿真方法。现有的降阶弹性动力学仿真工作要么通过网格表示输入几何体(由于扫描和三角化复杂形状的挑战,网格难以获得),要么通过需要逐形状优化的神经场表示。我们提出采用再生核粒子法(RKPM)表示,通过求解弹性能量Hessian矩阵上的广义特征系统,构建降阶蒙皮权重。我们证明,与神经场的逐形状优化相比,该公式不仅实现了40倍的训练加速,而且在与有限元方法的收敛结果进行评估时,实现了更低的仿真误差。我们在不同表示(包括网格和高斯溅射)的各种物体上展示了仿真结果,以及我们的方法在机器人仿真下游任务中的应用。

英文摘要

We present a novel formulation for mesh-free, reduced-order simulation of deformable hyperelastic objects. Existing work in reduced-order elastodynamic simulation represents the input geometry by either meshes, which can be difficult to obtain due to challenges in scanning and triangulating complex shapes, or by neural fields that require per-shape optimization. We propose to adopt a Reproducing Kernel Particle Method (RKPM) representation, which enables the construction of reduced-order skinning weights by solving a generalized eigensystem on the Hessian matrix of the elastic energy. We demonstrate that this formulation not only leads to a 40x training speedup compared with the per-shape optimization of neural fields, but also achieves lower simulation error when evaluated against the converged results of finite element method. We show our simulation results on a wide variety of objects in different representations including meshes and Gaussian splats, as well as the application of our method in the downstream task of robot simulation.

2605.29316 2026-05-29 cs.CV 版本更新

CapTalk: Text-Guided Stylization and Speech-Driven 3D Head Animation

CapTalk: 文本引导的风格化与语音驱动的3D头部动画

Xuangeng Chu, Yuan Gan, Ziteng Cui, Shuhong Liu, Jian Wang, Bing Zhou, Tatsuya Harada

发表机构 * The University of Tokyo(东京大学) Snap Research, Snap Inc. RIKEN AIP(理化学研究所AIP)

AI总结 提出CapTalk框架,通过文本描述控制说话风格和情感,结合语音驱动生成同步唇动和面部表情,支持动态情感变化。

详情
AI中文摘要

音频驱动的3D面部动画旨在从任意音频片段生成同步的唇部运动和生动的面部表情。现有方法虽能产生同步唇动,但通常依赖预定义的身份或风格潜在特征,限制了用户自由控制说话风格的能力。此外,将固定风格或身份应用于整个音频片段通常导致面部动画风格无法适应音频的情感内容。为解决这些挑战,我们重新审视风格与情感的纠缠,构建了一个包含风格和情感文本描述的大规模数据集,并提出了一种新颖的说话头生成框架,能够分别控制风格和情感。我们的模型以说话风格和角色情感的文本描述以及驱动音频流为输入,能够实时生成与描述高度同步的唇部运动和面部表情。此外,我们的模型在推理时支持动态情感控制,能够处理目标情感在语音过程中变化的情况。

英文摘要

Audio-driven 3D facial animation aims to generate synchronized lip movements and vivid facial expressions from arbitrary audio clips. While existing methods can produce synchronized lip motions, they often rely on predefined identity or style latent features, which limits users' ability to freely control speaking styles. Moreover, applying a fixed style or identity to an entire audio segment typically results in facial animation styles that do not adapt to the emotional content of the audio. To address these challenges, we revisit the entanglement between style and emotion, construct a large-scale dataset with textual descriptions of both style and emotion, and propose a novel talking head generation framework that enables separate control over style and emotion. Our model takes as input both textual descriptions of speaking style and character emotion, as well as the driving audio stream, enabling real-time generation of highly synchronized lip movements and facial expressions that match the provided descriptions. Furthermore, our model supports dynamic emotion control during inference, allowing it to handle scenarios where the target emotion changes throughout the speech.

2605.29302 2026-05-29 cs.CV 版本更新

ViASNet: A Video Ad Saliency Network for Predicting Dynamic Saliency and Viewer Engagement

ViASNet:用于预测动态显著性和观众参与度的视频广告显著性网络

Jianping Ye, Michel Wedel

发表机构 * Department of Mathematics, University of Maryland, College Park, MD 20742, USA(数学系,马里兰大学,学院公园,MD 20742, 美国) Robert H. Smith School of Business, University of Maryland, College Park, MD 20742, USA(罗伯特·H·史密斯商学院,马里兰大学,学院公园,MD 20742, 美国)

AI总结 提出基于3D U-Net架构的ViASNet模型,融合音频和场景语义,预测视频广告的动态显著性图,并通过熵分析诊断观众参与度。

详情
AI中文摘要

数字媒体领域已普遍转向电视、社交媒体和电子商务平台上的短视频广告。本研究聚焦于短视频广告的深度显著性预测。深度显著性模型已被用于生成人类眼动注视模式的预测,以增强用户与数字技术的交互并优化其设计。对于视频广告,动态显著性图捕捉观众观看的位置和时间,揭示视频广告为何有效以及如何优化其内容。我们开发并测试了一种新的深度动态显著性预测模型ViASNet(视频广告显著性网络),其架构基于3D U-Net,并考虑了音频和场景语义的影响。我们评估了该模型在151个视频广告上的性能,每个广告约有20名观众观看并记录其眼动,并通过消融实验探索影响模型性能的关键因素。我们逐帧计算预测显著性图的熵,作为诊断工具来识别未能吸引观众的广告和场景,并在15个未见广告的测试数据上展示了其应用。我们的研究表明,通过基于ViASNet等深度显著性模型的自动化系统,可以显著加快广告设计和测试的速度。

英文摘要

The digital media landscape has seen a pervasive shift toward short-form video advertising on TV, social media and e-commerce platforms. The present study focuses on deep saliency prediction for short-form video advertising. Deep saliency models have been used to generate predictions of human eye fixation patterns with the purpose of enhancing user interaction with digital technology and optimizing its design. For video ads, dynamic saliency maps capture where and when viewers are looking, revealing why video ads are effective, and how their content should be optimized. We develop and test a new deep dynamic saliency prediction model called ViASNet (Video Ad Saliency Network), which has an architecture founded on the 3D U-Net, and accommodates the influence of audio and the semantic meaning of scenes. We assess the model's performance on 151 video ads, each seen by about 20 viewers wile their eye movements were tracked, and explore the critical factors influencing model performance through ablation experiments. We calculate the entropy of the predicted saliency maps frame-by-frame as a diagnostic tool to identify ads and scenes that fail to engage viewers, and illustrate its use on test data of 15 unseen ads. Our study reveals that ad design and testing can be sped up considerably through automated systems built on deep saliency models such as ViASNet.

2605.29230 2026-05-29 cs.CV cs.AI 版本更新

Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data

面向道德的面部年龄估计:无需儿童数据训练的广义零样本基准

Caio Petrucci, Leo Sampaio Ferraz Ribeiro, Sandra Avila

发表机构 * New York University(纽约大学)

AI总结 提出一个广义零样本基准,训练时排除儿童数据,评估模型对未见年龄组的泛化能力,发现所有方法均存在严重性能下降和可见类偏见。

Comments 12 pages; 3 figures; 5 tables

详情
AI中文摘要

从面部图像进行年龄估计通常依赖于包含未成年人图像的训练数据,这种做法引发了严重的伦理、法律和隐私问题。在这项工作中,我们提出了一个用于面部年龄估计的广义零样本基准,该基准在训练时明确排除儿童数据,同时仍评估模型在年轻人群上的性能。我们重新审视了六个广泛使用的数据集,并引入了具有严格年龄组划分的标准化分割:18-59岁的样本用于训练、验证和测试;18岁以下的样本仅保留用于零样本评估;60岁以上的样本作为分布偏移下模型选择的未见验证集。对于具有身份注释的数据集,基于主体的分割防止了身份泄露,并更好地反映了实际部署条件。在此协议下评估九种最先进的年龄估计方法,结果表明所有评估方法均无法泛化到未见年龄组,性能相对于监督基线平均下降46.4%,最高达52.8%。此外,模型并非简单退化:它们系统性地将未见年龄的预测锚定到附近的可见类别,这是广义零样本学习中众所周知的可见类偏见的体现。通过将无儿童数据的年龄估计形式化为现有数据集上的广义零样本基准,这项工作突出了当前建模实践与现实伦理约束之间的关键差距。我们的基准为在受限数据制度下评估模型提供了原则性基础,并鼓励开发对分布偏移鲁棒且符合负责任数据使用的方法。

英文摘要

Age estimation from facial images typically relies on training data that includes images of minors, a practice that raises serious ethical, legal, and privacy concerns. In this work, we propose a generalized zero-shot benchmark for facial age estimation that explicitly excludes children's data during training while still assessing model performance on younger populations. We revisit six widely used datasets and introduce standardized splits with strict age-group separation: samples aged 18-59 for training, validation, and testing; samples under 18 reserved exclusively for zero-shot evaluation; and samples 60+ as an unseen validation set for model selection under distribution shift. For datasets with identity annotations, subject-exclusive splits prevent identity leakage and better reflect real-world deployment conditions. Evaluating nine state-of-the-art age estimation methods under this protocol reveals that all evaluated methods consistently fail to generalize to unseen age groups, suffering substantial performance degradation -- on average 46.4%, and up to 52.8% -- relative to the supervised baseline. Moreover, models do not simply degrade: they systematically anchor predictions for unseen ages to nearby seen classes, a manifestation of the well-known seen-class bias in generalized zero-shot learning. By formalizing age estimation without children's data as a generalized zero-shot benchmark on existing datasets, this work highlights a critical gap between current modeling practices and real-world ethical constraints. Our benchmark provides a principled basis for evaluating models under restricted data regimes and encourages the development of methods that are robust to distribution shift and aligned with responsible data use.

2605.29221 2026-05-29 cs.CV 版本更新

An Approach for Thyroid Nodule Analysis Using Thermographic Images

使用热成像图像进行甲状腺结节分析的方法

J. R. González, É. O. Rodrigues, C. P. Damião, C. A. P. Fontes, A. C. Silva, A. C. Paiva, H. Li, C. Du, A. Conci

发表机构 * Computer Science Department, Universidade Federal Fluminense(联邦弗里蒙特大学计算机科学系) Radiology Department, Hospital Universitário Antônio Pedro (HUAP)(安东尼奥佩德罗大学医院放射科) Applied Computation Group NCA-UFMA, Universidade Federal do Maranhão(马兰舍大学应用计算组NCA-UFMA)

AI总结 本文综述了热成像在甲状腺分析中的应用,提出图像采集协议和自主配准方法,并通过特征提取、图像处理和分类方法区分健康与患病患者。

详情
Journal ref
Application of Infrared to Biomedical Sciences 2017
AI中文摘要

据预测,到2030年,甲状腺癌将成为女性中第二常见的癌症类型,男性中第三常见。一般来说,早期检测癌症可提高个体生存机会。热成像是一种诊断工具,越来越多地用于检测癌症和异常,包括甲状腺异常。已有多种方法被提出用于分割和检测热成像图中的热区域,从而检测这些图像中存在的可疑组织。众所周知,医学诊断会产生大量信息。因此,医生必须在短时间内全面分析和评估这些信息,这在大多数情况下是不可行的。在这项工作中,我们对热成像进行了全面综述,重点关注甲状腺分析。我们提出了图像采集协议和甲状腺图像的自主配准方法。我们还对图像数据进行了分析,包括特征提取、图像处理以及一种可能的健康或非健康患者分类方法。总之,这项工作提出了在我们大学医院检测肿瘤的试点项目,这是支持我们内分泌科预防性医疗行动的一部分。经过一些未来调整后,该项目将提交给弗鲁米嫩塞联邦大学安东尼奥·佩德罗大学医院(HUAP-UFF)的伦理与研究委员会以及巴西卫生部伦理委员会审批,项目名称为:评估热成像在HUAP-UFF患者甲状腺结节诊断辅助中的重要性(葡萄牙语:Avaliação da importância da termografia no auxílio à investigação diagnóstica de nódulos tireoidianos em pacientes acompanhados no HUAP-UFF)。

英文摘要

Thyroid cancer is said to be the second most common type of cancer in female individuals and the third in males by 2030, according to projections. In general, detecting cancer in its early stages improves the chance of survival of the individual. Thermography is a diagnostic tool that has been increasingly used to detect cancer and abnormalities, including that of thyroid. Various methods to segment and detect hot regions in thermograms and, consequently, to detect suspicious tissues present in these images have been proposed. It is well known that medical diagnosis yields a great deal of information. Thus, physicians have to comprehensively analyse and evaluate this information in a short period of time, which is infeasible in most cases. In this work, we perform a general review of thermography , focusing on the thyroid analysis. We propose protocols for image acquisiton and an autonomous registration for thyroid images. We also perform analyses of the image data, which include feature extraction, image processing, and a possible approach for classification of healthy or unhealthy patients. In summary, this work presents a pilot project for detection of tumors in our university hospital, which is part of an effort to support preventive medical actions in our endocrinology department. Under some future adjustments, this project will be submitted for approval by the ethics and research committee of Hospital Universitário Antonio Pedro at Universidade Federal Fluminense (HUAP-UFF) and to the Brazilian Ministry of Health Ethical committee under the name: Evaluation of the importance of thermography to aid diagnosis of thyroid nodules of patients in HUAP-UFF (in Portuguese: Avaliação da importância da termografia no auxílio à investigação diagnóstica de nódulos tireoidianos em pacientes acompanhados no HUAP-UFF).

2605.29220 2026-05-29 cs.CV 版本更新

Motion-guided sparse correction enables expert-quality point tracking across diverse microscopy regimes

运动引导的稀疏校正实现跨不同显微镜体制的专家级点跟踪

Leonidas Zimianitis, Pasindu Thenahandi, Kai Buckhalter, Dineth Jayakody, Julian O. Kimura, Xinyue Liang, Karen Cunningham, Azeem Ahmad, Balpreet S. Ahluwalia, Sampath Jayarathna, Nikos Chrisochoides, Brandon Weissbourd, Dushan N. Wadduwage

发表机构 * Department of Computer Science, Old Dominion University(奥德赛大学计算机科学系) Department of Biology, Massachusetts Institute of Technology(麻省理工学院生物学系) The Picower Institute for Learning and Memory, Massachusetts Institute of Technology(麻省理工学院学习与记忆研究所) Department of Physics and Technology, UiT--The Arctic University of Norway(挪威北极大学物理与技术系) Department of Physics, University of Oslo(奥斯陆大学物理系) School of Data Science, Old Dominion University(奥德赛大学数据科学学院) Department of Physics, Old Dominion University(奥德赛大学物理系)

AI总结 提出RIPPLE方法,通过运动引导的稀疏校正,在多种显微镜视频中实现专家级点跟踪,将手动标注工作量减少3至25倍。

详情
AI中文摘要

在显微镜视频中跟踪非规范生物系统的动力学仍然是一个持续的挑战。经典和基于学习的跟踪器都需要专家审查的数据来进行评估和适应,然而详尽的手动标注很少能扩展到最需要这些工具的视频中。我们开发了RIPPLE(点位置估计的细化插值平台),它将标注重新定义为稀疏校正:用户点击一个起始点,RIPPLE提出完整的轨迹,用户仅在轨迹偏离时进行干预。我们在来自实验室的五个具有挑战性的显微镜数据集上测试了RIPPLE,其中四个来自透明水螅体Clytia hemisphaerica,一个跟踪快速移动精子的地标。在这些数据集中,RIPPLE匹配了详尽手动标注的质量,同时将数据集的手动点击次数减少了3至25倍。因此,RIPPLE填补了手动标注和全自动跟踪之间的缺失层,使得能够立即量化生物动力学、进行方法基准测试,并生成适应未来自动显微镜跟踪器所需的金标准数据。

英文摘要

Tracking the dynamics of non-canonical biological systems in microscopy videos remains a persistent challenge. Both classical and learning-based trackers depend on expert-reviewed data to be evaluated and adapted, yet exhaustive manual annotation rarely scales to the videos where these tools are needed most. We developed RIPPLE (Refinement Interpolation Platform for Point Location Estimation), which recasts annotation as sparse correction: a user clicks a starting point, RIPPLE proposes a full trajectory, and the user intervenes only where the trajectory drifts. We tested RIPPLE on five challenging microscopy datasets from our laboratories, four from the transparent jellyfish Clytia hemisphaerica and one tracking landmarks on rapidly moving sperm. Across these, RIPPLE matched the quality of exhaustive manual annotation while reducing manual clicks by 3 to 25 times across datasets. RIPPLE thereby fills a missing layer between manual annotation and fully automated tracking, enabling immediate quantification of biological dynamics, method benchmarking, and the production of the gold-standard data needed to adapt future automated microscopy trackers.

2605.29217 2026-05-29 cs.CV 版本更新

Towards the automated segmentation of epicardial and mediastinal fats: A multi-manufacturer approach using intersubject registration and random forest

朝向心外膜和纵隔脂肪的自动分割:一种使用跨受试者配准和随机森林的多厂商方法

É. O. Rodrigues, A. Conci, F. F. C. Morais, M. G. Pérez

发表机构 * Institute of Computing(计算学院) Institute of Medicine(医学学院) Fac. de Ing. en Sist. Electr. e Ind.(电子与工业工程系) Universidade Federal Fluminense(里约热内卢联邦大学) Universidade Federal do Rio de Janeiro(里约热内卢联邦大学) Universidad Técnica de Ambato(阿姆巴托技术大学)

AI总结 提出一种基于跨受试者配准和随机森林的全自动方法,用于分割CT图像中的心外膜和纵隔脂肪,平均准确率达98.4%,Dice相似指数为96.8%。

详情
Journal ref
2015 IEEE International Conference on Industrial Technology (ICIT)
AI中文摘要

心脏周围的脂肪量与多种健康风险因素相关,如颈动脉僵硬度、冠状动脉钙化、心房颤动、动脉粥样硬化、癌症发病率等。此外,心脏脂肪的变化与受试者的总体脂肪无关,因此加强了对这些脂肪组织进行定量分析的必要性。临床决策支持系统是能够评估信息并提供相应诊断或数据以补充物理学家分析的计算机程序。本工作的目的是提出一种方法,能够在通过用于冠状动脉钙化评分的标准采集协议获得的CT图像上,全自动分割两种由心包隔开的心脏脂肪组织。我们致力于减少用户干预并提高可重复性。本文提出的方法包括配准(将输入图像粗略调整到标准)、提取与像素及其周围区域相关的特征,以及基于数据挖掘分类算法的分割步骤,该算法判断输入像素是否属于某一类型。实验表明,心外膜和纵隔脂肪的平均准确率达到98.4%,平均真阳性率为96.2%。平均Dice相似指数为96.8%。

英文摘要

The amount of fat on the surroundings of the heart is correlated to several health risk factors such as carotid stiffness, coronary artery calcification, atrial fibrillation, atherosclerosis, cancer incidence and others. Furthermore, the cardiac fat varies unrelated to the overall fat of the subject, and, therefore, it reinforces the quantitative analysis of these adipose tissues as being essential. Clinical decision support systems are computer programs capable of evaluating information and providing a corresponding diagnosis or data to complement the physicists' analyses. The aim of this work is to propose a method capable of fully automatically segmenting two types of cardiac adipose tissues that stand apart from each other by the pericardium on CT images obtained by the standard acquisition protocol used for coronary calcium scoring. Much effort was devoted to promote minimal user intervention and ease of reproducibility. The methodology proposed in this work consists of a registration, which will roughly adjust input images to a standard, an extraction of features related to pixels and their surrounding area and a segmentation step based on data mining classification algorithms that define if an incoming pixel is of a certain type. Experimentations showed that the achieved mean accuracy for the epicardial and mediastinal fats was 98.4% with a mean true positive rate of 96.2%. In average, the Dice similarity index was equal to 96.8%.

2605.29212 2026-05-29 cs.CV cs.HC 版本更新

MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality

MetaRanker:用于超透镜图像质量的人机协同主动排序

Yujin Park, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University(翰阳大学) Hankuk University of Foreign Studies(韩国民法大学)

AI总结 提出MetaRanker框架,通过人机协同主动排序,以语义可解释性为指标评估超透镜图像质量,减少80%人工标注量,并实现与人类评估高度一致的排序。

Comments 12 pages, 6 figures

详情
AI中文摘要

现代成像系统中的图像质量源于传感器、光学元件和计算重建的耦合效应。超薄超透镜为实现光学模块的显著小型化提供了途径,但实际设计通常表现出明显的色差和视场相关像差,需要计算重建来补偿。在当前的超透镜流程中,重建模型通常使用基于失真的保真度目标(如PSNR)进行训练和选择,但这些代理指标与人类偏好和下游实用性的相关性较弱,反映了众所周知的感知-失真权衡。我们引入了MetaRanker,一种人机协同主动排序框架,以语义可解释性(定义为人类在存在光学伪影时可靠识别物体和结构的程度)来形式化超透镜图像质量。MetaRanker结合了概率偏好模型与不确定性感知的查询选择,并利用视觉-语言模型提供轻量级语义先验。重要的是,这些先验仅用于指导信息性比较的采样;人类判断始终是主要的监督信号。在具有不同退化特征的现实和合成超透镜数据集上,MetaRanker生成的排序与人类评估最为一致,同时相对于穷举成对评估,所需的成对标注数量减少了约80%。最后,我们表明标准图像质量评估指标在超透镜领域与人类可解释性的对齐有限,这使MetaRanker成为迈向基于感知的超透镜评估和协同设计的实际一步。

英文摘要

Image quality in modern imaging systems emerges from the coupled effects of the sensor, optics, and computational reconstruction. Ultra-thin metalenses offer a path toward substantial miniaturization of optical modules, but practical designs often exhibit pronounced chromatic and field-dependent aberrations that necessitate computational reconstruction. In current metalens pipelines, reconstruction models are commonly trained and selected using distortion-based fidelity objectives, such as PSNR, yet these proxies can be weakly correlated with human preference and downstream utility, reflecting the well-known perception--distortion trade-off. We introduce MetaRanker, a human-in-the-loop active ranking framework that formalizes metalens image quality in terms of semantic interpretability, defined as the degree to which humans can reliably recognize objects and structures in the presence of optical artifacts. MetaRanker combines a probabilistic preference model with uncertainty-aware query selection, and leverages vision--language models to provide lightweight semantic priors. Importantly, these priors are used only to guide the sampling of informative comparisons; human judgments remain the primary supervision signal throughout. Across real-world and synthetic metalens datasets with distinct degradation profiles, MetaRanker produces rankings that align most closely with human assessments, while reducing the number of pairwise annotations required by approximately 80% relative to exhaustive pairwise evaluation. Finally, we show that standard image quality assessment metrics exhibit limited alignment with human interpretability in the metalens domain, positioning MetaRanker as a practical step toward perceptually grounded metalens evaluation and co-design.

2605.29136 2026-05-29 cs.CV cs.LG 版本更新

Eulerian Gaussian Splatting using Hashed Probability Pyramids

使用哈希概率金字塔的欧拉高斯溅射

Mia Gaia Polansky, George Kopanas, Stephan Garbin, Todd Zickler, Dor Verbin

发表机构 * Harvard University(哈佛大学) Google DeepMind(谷歌DeepMind) Google(谷歌)

AI总结 提出一种基于概率溅射的辐射场框架,用梯度优化的体积概率密度替代启发式操作,通过多尺度哈希网格实现端到端优化,在mip-NeRF 360上达到SOTA重建质量并保持3DGS渲染速度。

Comments CVPR 2026. Project Page: https://euleriansplatting.github.io

详情
AI中文摘要

我们引入了一种基于概率溅射的辐射场框架,该框架保留了3D高斯溅射(3DGS)的快速光栅化和测试效率,同时用基于梯度优化的体积概率密度替代了启发式原始操作。我们不通过手动调整的密集化(例如ADC)来重新定位、分割或剔除高斯体,而是将原始位置视为从持久、可学习的密度中抽取的样本。我们使用一种新颖的、内存高效的多尺度层次网格来实例化该密度,从而实现端到端的梯度优化。为了稳定优化,我们推导了一个具有控制变量的无偏梯度估计器,显著降低了方差。通过允许概率质量流向损失要求的地方,我们的框架消除了脆弱的先验,并自然地探索体积,在mip-NeRF 360上实现了最先进的重建质量,同时保持了3DGS级别的渲染速度。

英文摘要

We introduce a probabilistic splat-based radiance field framework that retains the fast rasterization and test-time efficiency of 3D Gaussian Splatting (3DGS) while replacing heuristic primitive manipulation with gradient-based optimization of a volumetric probability density. Rather than relocating, splitting, or culling Gaussians via hand-tuned densification (e.g., ADC), we treat primitive locations as samples drawn from a persistent, learnable density. We instantiate this density using a novel, memory-efficient multi-scale hierarchical grid that enables end-to-end gradient-based optimization. To stabilize the optimization, we derive an unbiased gradient estimator with control variates that markedly reduces variance. By allowing probability mass to flow to where the loss demands, our framework eliminates brittle priors and naturally explores the volume, achieving state-of-the-art reconstruction quality on mip-NeRF 360 while preserving 3DGS-level rendering speed.

2605.29122 2026-05-29 cs.CV 版本更新

Robust Cross-Domain Generalization Using Unlabeled Target Data with Source-Domain Supervision

利用源域监督和无标签目标数据的鲁棒跨域泛化

Yuyue Zhou, Shrimanti Ghosh, Michael, Xie, Justin JY Kim, Jessica Knight, Steel McDonald, Vincent Man, Jacob L. Jaremko, Abhilash Hareendranathan

发表机构 * Department of Radiology and Diagnostic Imaging, University of Alberta(放射学与诊断影像学系,阿尔伯塔大学)

AI总结 针对医学影像AI模型跨设备泛化问题,提出结合目标域无监督预训练(掩码图像建模与对比学习)和源域监督训练的策略,在儿科腕部超声骨折检测中实现超过6%的Dice提升。

详情
AI中文摘要

通常,我们希望将使用密集标注训练的医学影像AI模型泛化到来自不同超声扫描仪或临床站点的数据;然而,使用新标注重新训练这些模型往往困难且成本高昂。我们在儿科腕部骨折评估中研究了这一挑战,使用床旁超声(POCUS),其中骨折常见且可通过超声有效分诊。AI在骨折检测中已展现出放射科医生级别的性能,通常借助高质量骨结构分割。然而,由于显著的域偏移,模型在其他中心或探头的数据上表现不佳,并且由于手动标注工作和数据隐私问题,跨设备获取分割标签不切实际。为了解决这个问题,我们提出了一种目标信息引导的自监督预训练和模型集成策略。具体来说,我们的方法结合了掩码图像建模(MIM)和对比学习,无需标签即可学习目标域结构表示,并引入了一个置信度感知融合头来自适应地集成预测。使用Philips Lumify探头收集的源数据集包含密集标签,而使用TeleMED便携式探头收集的目标数据集未标注。整个过程中数据集严格分离。我们的方法使用带标签的源数据进行监督训练,并利用目标域预训练来提高泛化能力。在来自62个儿科POCUS视频的318张图像上,该方法显著提高了跨设备性能,与基线相比,目标域的Dice提升了超过6%。这些结果展示了一种标签高效且保护隐私的跨设备鲁棒超声AI方法,提供了一个可扩展到多中心研究或联邦学习设置的框架。

英文摘要

It is often desirable to generalize medical imaging AI models trained with dense annotations to data acquired from different ultrasound scanners or clinical sites; however, retraining these models with new annotations is often difficult and costly. We examine this challenge in pediatric wrist fracture assessment using point-of-care ultrasound (POCUS), where fractures are common and can be effectively triaged via ultrasound. AI has shown radiologist-level performance for fracture detection, often aided by high-quality bony structure segmentation. However, due to significant domain shifts, models perform poorly on data from other centers or probes, and obtaining segmentation labels across devices is impractical due to manual annotation effort and data privacy concerns. To address this, we propose a target-informed self-supervised pretraining and model-ensemble strategy. Specifically, our approach combines masked image modeling (MIM) and contrastive learning to learn target-domain structural representations without labels, and introduces a confidence-aware infusion head to adaptively integrate predictions. The source dataset, collected with a Philips Lumify probe, contained dense labels, while the target dataset, acquired with a TeleMED portable probe, was unlabeled. The datasets were kept strictly separate throughout the entire process. Our method used labeled source data for supervised training and leveraged target-domain pretraining to improve generalization. On 318 images from 62 pediatric POCUS videos, this approach significantly improved cross-device performance, achieving over 6% Dice improvement on the target domain versus the baseline. These results demonstrate a label-efficient and privacy-preserving approach for cross-device-robust ultrasound AI, offering a framework that can be extended to multi-center studies or federated learning setups.

2605.29098 2026-05-29 cs.CV 版本更新

Seeing through boxes: Non-Line-of-Sight 3D Reconstruction from Radar Signals

透视箱子:基于雷达信号的非视距三维重建

Jiachen Lu, Hailan Shanbhag, Haitham Al Hassanieh

发表机构 * École Polytechnique Fédérale de Lausanne(联邦理工学院洛桑校区)

AI总结 提出统一视距与非视距神经几何重建框架GeRaF 2.0,利用外部视距几何约束引导射频信号传播,实现稳定训练和物理一致的重建,在射频几何重建中达到新最优。

Comments Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

从射频信号重建物体几何形状具有根本性挑战,因为射频传感的无透镜成像特性导致低空间分辨率和强噪声。与光信号不同,射频信号可以穿透遮挡物,从而捕获隐藏场景的信息。现有的非视距三维神经重建方法可以恢复封闭环境内的粗糙表面,但常常面临优化不稳定、表面几何噪声大和表面模糊等问题,无法从符号距离场生成精确的零水平集。这些局限性很大程度上源于忽略了封闭区域外视距几何的作用,而视距几何为建模信号传播提供了有价值的物理约束。本文提出统一视距与非视距神经几何重建框架GeRaF 2.0,利用外部视距几何来建模并引导射频信号从视距区域传播到非视距区域。通过将视觉视距先验融入神经场公式,GeRaF 2.0实现了可见和隐藏几何的稳定训练和物理一致重建,在基于射频的几何重建中达到了新的最优水平。

英文摘要

Reconstructing object geometry from radio frequency (RF) signals is fundamentally challenging due to the lensless imaging nature of RF sensing, which leads to low spatial resolution and high noise. Unlike light signals, RF signals can penetrate occlusions and thus capture information about hidden scenes. Existing Non-Line-of-Sight (NLoS) 3D neural reconstruction methods can recover coarse surfaces inside enclosed environments but often suffer from unstable optimization, noisy surface geometry, and surface ambiguity, failing to produce accurate zero-level sets from the signed distance field (SDF). These limitations largely stem from neglecting the role of Line-of-Sight (LoS) geometry outside the enclosed region, which provides valuable physical constraints for modeling signal propagation. In this paper, we introduce a Unified LoS and NLoS neural geometry reconstruction framework GeRaF 2.0 that leverages the outside LoS geometry to model and guide RF propagation from the LoS region into the NLoS region. By integrating visual LoS priors into the neural field formulation, GeRaF 2.0 achieves stable training and physically consistent reconstruction of both visible and hidden geometry, setting a new state-of-the-art in RF-based geometry reconstruction.

2605.29097 2026-05-29 cs.CV 版本更新

GeRaF: Neural Geometry Reconstruction from Radio Frequency Signals

GeRaF: 从射频信号进行神经几何重建

Jiachen Lu, Hailan Shanbhag, Haitham Al Hassanieh

发表机构 * École Polytechnique Fédérale de Lausanne(瑞士联邦理工学院)

AI总结 提出GeRaF方法,利用神经隐式学习从射频信号重建近距3D几何,通过滤波渲染、物理射频体渲染和无透镜采样策略解决低分辨率、噪声和镜面反射问题。

Comments Accepted at NeurIPS 2025 (Spotlight)

详情
Journal ref
Advances in Neural Information Processing Systems 38 (2026): 94200-94230
AI中文摘要

GeRaF是首个利用神经隐式学习从射频信号进行近距3D几何重建的方法。与基于RGB或LiDAR的方法不同,射频传感可以穿透遮挡,但由于其无透镜成像特性,存在分辨率低和噪声大的问题。虽然RGB成像中的透镜将采样限制在1D射线上,但射频信号在整个空间中传播,引入显著噪声并导致体渲染的立方复杂度。此外,射频信号通过镜面反射与表面相互作用,需要根本不同的建模。为解决这些挑战,GeRaF (1) 引入基于滤波的渲染以抑制无关信号,(2) 实现基于物理的射频体渲染管线,(3) 提出一种新颖的无透镜采样和无透镜alpha混合策略,使训练期间的全空间采样可行。通过MLP和可训练参数学习符号距离函数、反射率和信号功率,GeRaF迈出了从射频信号在真实环境中重建毫米级几何的第一步。

英文摘要

GeRaF is the first method to use neural implicit learning for near-range 3D geometry reconstruction from radio frequency (RF) signals. Unlike RGB or LiDAR-based methods, RF sensing can see through occlusion but suffers from low resolution and noise due to its lensless imaging nature. While lenses in RGB imaging constrain sampling to 1D rays, RF signals propagate through the entire space, introducing significant noise and leading to cubic complexity in volumetric rendering. Moreover, RF signals interact with surfaces via specular reflections, requiring fundamentally different modeling. To address these challenges, GeRaF (1) introduces filter-based rendering to suppress irrelevant signals, (2) implements a physics-based RF volumetric rendering pipeline, and (3) proposes a novel lensless sampling and lensless alpha blending strategy that makes full-space sampling feasible during training. By learning signed distance functions, reflectiveness, and signal power through MLPs and trainable parameters, GeRaF takes the first step towards reconstructing millimeter-level geometry from RF signals in real-world settings.

2605.29092 2026-05-29 cs.CV cs.LG cs.MM 版本更新

Lightweight Complementary-Cue Fusion for Robust Video Face Forgery Detection

轻量级互补线索融合用于鲁棒视频人脸伪造检测

Sunghwan Baek, Tariq Anwaar, Karanveer Singh, Rita Singh

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出轻量级融合模块,结合手工特征(小波去噪特征与相位谱或局部二值模式),在极小参数增加下显著提升视频人脸伪造检测的鲁棒性。

Comments 13 pages, 6 figures, 3 tables

详情
AI中文摘要

当前的人脸视频伪造检测器使用宽或双流骨干网络。我们证明,通过单个轻量级融合两个手工线索,可以在更小的模型下实现更高的准确率。基于Xception基线模型(2190万参数),我们构建了两个检测器:LFWS,它添加一个1x1卷积来结合低频小波去噪特征(WDF)和来自空间相位浅层学习(SPSL)的相位谱通道;以及LFWL,它以相同方式融合WDF和局部二值模式(LBP)。这个额外模块仅增加292个参数,使总参数保持在2190万,小于F3Net(2250万)且不到SRM(5530万)的一半。即使如此小的开销,融合模型在FaceForensics++上将平均曲线下面积(AUC)从74.8%提升至78.6%,在DFDC-Preview上从70.5%提升至74.9%,分别比Xception基线提高3.8%和4.4%。在八个公开基准上,它们也始终优于F3Net、SRM和SPSL,无需额外数据或测试时增强。这些结果表明,通过轻量级融合块精心配对的手工特征,可以以远低于可比频率检测器的成本提供有竞争力的鲁棒性。我们的发现提示需要重新评估人脸视频伪造检测中规模驱动的设计选择。

英文摘要

Current face video forgery detectors use wide or dual-stream backbones. We show that a single, lightweight fusion of two handcrafted cues can achieve higher accuracy with a much smaller model. Based on the Xception baseline model (21.9 million parameters), we build two detectors: LFWS, which adds a 1x1 convolution to combine a low-frequency Wavelet-Denoised Feature (WDF) with a phase-spectrum channel derived from Spatial-Phase Shallow Learning (SPSL), and LFWL, which merges WDF with Local Binary Patterns (LBP) in the same way. This extra module adds only 292 parameters, keeping the total at 21.9 million, smaller than F3Net (22.5 million) and less than half the size of SRM (55.3 million). Even with this minimal overhead, the fused models increase the average area under the curve (AUC) from 74.8% to 78.6% on FaceForensics++ and from 70.5% to 74.9% on DFDC-Preview, gains of 3.8% and 4.4% over the Xception baseline. They also consistently outperform F3Net, SRM, and SPSL in eight public benchmarks, without extra data or test-time augmentation. These results show that carefully paired, handcrafted features, combined through the lightweight fusion block, can provide competitive robustness at a significantly lower cost than comparable frequency-based detectors. Our findings suggest a need to reevaluate scale-driven design choices in face video forgery detection.

2605.29089 2026-05-29 cs.LG cs.AI cs.CV 版本更新

OISD: On-Policy Internal Self-Distillation of Language Models

OISD: 语言模型在策略内部自蒸馏

Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang, Pan He

发表机构 * Auburn University(阿肯色大学) William & Mary(威廉与玛丽学院)

AI总结 提出OISD框架,通过将最终层的预测信号蒸馏到中间层,结合logit对齐和注意力对齐,提升推理能力,在数学推理任务上显著优于基线。

Comments Under Review for Publication

详情
AI中文摘要

最近的强化学习后训练方法主要使用稀疏的结果级奖励来优化最终输出策略,而很大程度上忽略了中间表示中编码的预测信号。在本文中,我们引入了一种称为在策略内部自蒸馏的新范式,并提出了OISD框架,该框架通过将最终层的在策略预测信号转移到中间表示来改进推理。在展开和组相对策略优化(GRPO)优化过程中,最终层既充当策略,又充当所选中间层的分离内部教师,通过两种互补机制引导中间层与其对齐:logit对齐,传递高级推理行为(如何思考);注意力对齐,强制从最终层到所选中间层的一致注意力模式(看哪里),两者都不需要外部特权信息。我们的OISD与GRPO一起,采用带符号优势加权的Jensen-Shannon对齐来蒸馏信息丰富的中间表示,同时在统一行动策略下保持策略一致性。实验结果表明了OISD的有效性,在四个数学推理任务上,相对于强推理强化学习基线,取得了显著且一致的改进。代码将在https://github.com/THE-MALT-LAB/OISD发布。

英文摘要

Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage-weighted Jensen--Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at https://github.com/THE-MALT-LAB/OISD

2605.29088 2026-05-29 cs.CV 版本更新

A Deep Learning Iterative Framework for Sentinel-1 Stripmap Enhancement Based on Azimuth Doppler Decomposition

基于方位向多普勒分解的哨兵一号条带图增强深度学习迭代框架

Juan Francisco Amieva, Christian Ayala, Roberto Del Prete, Mikel Galar

发表机构 * Tracasa Instrumental S.L.(Tracasa仪器有限公司) European Space Agency(欧洲航天局) Public University of Navarre(纳瓦拉公共大学)

AI总结 提出一种基于方位子孔径分解的自监督增强框架,利用子孔径与全孔径图像之间的物理一致性生成训练数据,通过单/多帧学习和迭代推理逐步提升图像质量,在哨兵一号条带模式数据上优于MERLIN方法。

Comments Accepted at the AI4Space Workshop, CVPR 2026

详情
AI中文摘要

合成孔径雷达(SAR)图像能够实现全天候、昼夜地球观测;然而,由于散斑噪声和其他固有成像伪影,其仍难以解释。哨兵一号(S1)是最广泛使用的星载SAR任务之一,提供系统性的全球覆盖、高时间分辨率、双极化成像和免费数据获取。在S1模式中,条带图(SM)提供最高分辨率,但散斑噪声和空间约束常常阻碍需要更精细空间细节的应用。这激发了对有效图像增强策略的需求。在这项工作中,我们提出了一种基于方位子孔径分解的S1 SM图像自监督增强框架。该方法利用子孔径重建与对应全孔径图像之间的物理一致性,生成配对训练数据,无需外部传感器、模拟真值或多时相堆叠。所提框架集成了单帧和多帧学习,并融入迭代推理方案,逐步提升图像质量。在真实S1 SM数据上的实验表明,所提方法在PSNR和SSIM上持续优于广泛采用的自监督深度学习基线MERLIN,而MERLIN获得更高的ENL,凸显了结构保真度与散斑平滑之间的权衡。总体而言,结果表明基于子孔径的监督为使用S1数据的SAR图像增强提供了一种物理基础、可复现且操作可行的方法。值得注意的是,所提方法可扩展到其他SAR平台、极化和采集模式。

英文摘要

Synthetic Aperture Radar (SAR) imagery enables all-weather, day-and-night Earth observation; however, it remains difficult to interpret due to speckle noise and other intrinsic imaging artifacts. Sentinel-1 (S1) constitutes one of the most widely used spaceborne SAR missions, offering systematic global coverage, high temporal resolution, dual-polarization imaging, and free data availability. Among S1 modes, Stripmap (SM) provides the highest resolution, yet speckle noise and spatial constraints often hinder applications requiring finer spatial detail. This motivates the need for effective image enhancement strategies. In this work, we propose a self-supervised enhancement framework for S1 SM imagery based on azimuth subaperture decomposition. The method exploits the physical consistency between subaperture reconstructions and the corresponding full-aperture image to generate paired training data without external sensors, simulated ground truth, or multi-temporal stacks. The proposed framework integrates single- and multi-frame learning and incorporates an iterative inference scheme that progressively refines image quality. Experiments on real S1 SM data show that the proposed approach consistently outperforms the widely adopted self-supervised deep learning baseline MERLIN, in terms of PSNR and SSIM, while MERLIN attains higher ENL, highlighting a trade-off between structural fidelity and speckle smoothing. Overall, the results demonstrate that subaperture-based supervision provides a physically grounded, reproducible, and operationally viable approach for SAR image enhancement using S1 data. It is worth noting that the proposed approach can be extended to other SAR platforms, polarizations, and acquisition modes.

2605.29074 2026-05-29 cs.CV cs.RO 版本更新

Embodied3DBench: Benchmarking Low-Level Embodied Spatial Intelligence of Vision Language Models

Embodied3DBench: 视觉语言模型低级具身空间智能的基准测试

Jiyao Zhang, Mingxu Zhang, Yitong Peng, Haoxuan Liu, Chenshuo Wang, Yuxing Long, Haoyang Huang, Dongjiang Li, Nan Duan, Hui Shen, Hao Dong

发表机构 * CFCS, School of CS, PKU(计算机学院CFCS,北京大学) Jingdong Technology Information Technology Co., Ltd(京东科技信息技术有限公司)

AI总结 提出Embodied3DBench基准,通过6类任务(空间结构理解与交互导向感知)系统评估视觉语言模型在3D环境中的低级空间智能,并合成130万QA对训练数据以弥补能力差距。

详情
AI中文摘要

当前的视觉语言模型(VLM)是否准备好理解和推理3D环境中的复杂具身交互?我们引入了Embodied3DBench,一个以机器人为中心的基准,针对具身3D环境中的低级空间智能。为了系统评估这些基础感知能力,该基准包括6个任务类别,分为两个核心组:空间结构理解(定位、空间关系预测和多视图对应)和交互导向感知(可供性预测、抓取点预测和轨迹预测)。该基准涵盖12个子类别,包含超过21k个高质量问答对。我们评估了13个最先进的模型,结果显示,尽管当前模型在高级空间推理(如理解对象间位置关系)方面表现相对较强,但在交互导向感知方面仍然脆弱,突显了缺乏鲁棒的3D感知交互先验。为了积极弥合基准揭示的能力差距,我们进一步合成了一个包含130万问答对的大规模训练数据集。值得注意的是,在该数据集上微调显著提升了低级空间智能。最终,Embodied3DBench通过提供系统评估框架和可扩展的数据解决方案填补了关键空白,为交互感知多模态系统的发展设定了明确目标。

英文摘要

Are current Vision Language Models (VLMs) ready to comprehend and reason about complex embodied interactions in 3D environments? We introduce Embodied3DBench, a robot-centric benchmark targeting low-level spatial intelligence in embodied 3D environments. To systematically evaluate these foundational perceptual capabilities, the benchmark includes 6 task categories divided into two core groups: Spatial Structural Understanding (Grounding, Spatial Relation Prediction, and Multi-view Correspondence) and Interaction-Oriented Perception (Affordance Prediction, Grasp Point Prediction, and Trajectory Prediction). The benchmark spans 12 subcategories and contains over 21k high-quality question-answer pairs. We evaluate 13 state-of-the-art models, and the results show that while current models exhibit relatively strong high-level spatial reasoning, such as understanding object-to-object positional relations, they remain fragile in interaction-oriented perception, highlighting a significant lack of robust 3D-aware interaction priors. To actively bridge this capability gap revealed by our benchmark, we further synthesize a large-scale training dataset comprising 1.3M QA pairs. Notably, fine-tuning on this dataset yields significant improvements in low-level spatial intelligence. Ultimately, Embodied3DBench fills a critical gap by providing both a systematic evaluation framework and a scalable data solution, setting a clear target for the development of interaction-aware multimodal systems.

2605.29064 2026-05-29 cs.CL cs.CV cs.HC cs.MA 版本更新

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

分析多模态大语言模型代理在城市感知中生成解释的角色效应

Neemias da Silva, Myriam Delgado, Rodrigo Minetto, Daniel Silver, Thiago H Silva

发表机构 * Universidade Tecnologica Federal do Parana(巴西南里奥格兰德联邦技术大学) University of Toronto(多伦多大学)

AI总结 通过对比不同角色提示和无角色设置下多模态大语言模型生成的文本,发现标题描述趋同,但理由描述随社会经济和政治属性系统变化,感知标签无显著差异。

Comments 10 pages, 6 figures

详情
AI中文摘要

我们研究了角色提示如何塑造多模态大语言模型在城市感知环境中生成的语言。使用来自1,200个角色条件代理和两个无角色设置的59,808个注释,我们分析了不同角色下的标题、理由和感知标签。结果表明,不同角色的标题高度趋同,而理由描述显示出与社会经济和政治属性相关的系统变化,感知标签则没有统计上显著的角色相关差异,尽管观察到了效应趋势。主题分析进一步揭示,角色在解释相同场景时强调不同的评价主题。

英文摘要

We study how persona prompting shapes language generated by multimodal large language models in an urban perception setting. Using 59,808 annotations from 1,200 persona-conditioned agents and two no-persona settings, we analyze captions, justifications, and perception tags across personas. Results indicate strong convergence in captions for different personas, whereas justifications display systematic variation associated with socioeconomic and political attributes, while perception tags show no statistically significant persona-related differences, though effect trends are observed. Topic analysis further reveals that personas emphasize different evaluative themes when interpreting the same scenes.

2605.29063 2026-05-29 eess.IV cs.CV 版本更新

Accelerating HEVC Intra Partitioning via a CNN-Hierarchical Attention Transformer Hybrid

通过CNN-分层注意力Transformer混合加速HEVC帧内划分

Krishna Kumar Sharma, Somdyuti Paul

发表机构 * Department of Artificial Intelligence, Indian Institute of Technology Kharagpur(人工智能系,印度理工学院Kharagpur)

AI总结 提出HFViT混合架构,融合重参数化深度可分离卷积与分层注意力Transformer,以低复杂度实现高效全局信息传播,在HEVC帧内划分预测中降低VMAF BD-rate惩罚并保持低CPU延迟。

详情
AI中文摘要

高效视频编码(HEVC)中的递归四叉树划分带来了大量计算开销,其中针对CTU划分预测的穷举率失真优化消耗了编码时间的主要部分。尽管通过深度学习进行划分预测已成为一种可行的编码加速器,但架构上的二分法仍未得到充分解决:CNN计算效率高,但由于其局部有效感受野而空间短视,无法捕捉长程语义关系和重复纹理;相反,基于Transformer的架构更擅长捕捉全局上下文,但会带来过高的CPU延迟,这是阻碍其在主要CPU受限环境中部署的关键缺陷。本文介绍了混合快速视觉Transformer(HFViT),这是一种旨在加速HEVC帧内模式划分预测的混合架构。HFViT将重参数化的深度可分离卷积骨干与分层注意力Transformer(HAT)机制融合,利用载体令牌方案以次二次复杂度实现高效的全局信息传播。训练后的结构融合将批归一化折叠到前一层,以进一步减少延迟。全面评估揭示了HFViT在跨分辨率加速HEVC帧内编码方面的有效性。在标准JCT-VC测试序列上,与竞争的ETH-CNN基线相比,HFViT在A、B和E类上分别将平均VMAF BD-rate惩罚降低了2.4、2.6和7.9个百分点,同时将CPU推理延迟维持在CNN基线的8%以内,并在GPU上超越其40%,为实时编码器集成建立了实际可行性。

英文摘要

The recursive quad-tree partitioning in High Efficiency Video Coding (HEVC) incurs considerable computational overhead, with exhaustive rate-distortion optimization for CTU partition prediction consuming the dominant share of encoding time. Although partition prediction through deep learning has emerged as a viable encoding accelerator, an architectural dichotomy remains largely unaddressed: CNNs are computationally efficient but spatially myopic due to their localized effective receptive fields, failing to capture long range semantic relationships and repetitive textures; conversely, transformer based architectures are better at capturing global context but incur prohibitive CPU latency, a critical liability that impedes deployment which is predominantly CPU-bound. This paper introduces Hybrid Fast Vision Transformer (HFViT), a hybrid architecture designed to accelerate HEVC intra-mode partition prediction. HFViT fuses a reparameterized depthwise-separable convolutional backbone with a Hierarchical Attention Transformer (HAT) mechanism, leveraging a carrier token scheme to enable efficient global information propagation at sub-quadratic complexity. Post-training structural fusion collapses batch normalization into preceding layers to further reduce latency. Comprehensive evaluation reveals the efficacy of HFViT in accelerating HEVC intra-encoding across resolutions. On standard JCT-VC test sequences, HFViT reduces the average VMAF BD-rate penalty by 2.4, 2.6, and 7.9 percentage points on Classes A, B and E, respectively, as compared to the competing ETH-CNN baseline while maintaining CPU inference latency within 8% of the CNN baseline and surpassing it on GPU by 40%, establishing practical viability for real-time encoder integration.

2605.29012 2026-05-29 cs.CV 版本更新

Trajectory Constraints for Imaging Inverse Problems

成像逆问题的轨迹约束

Chaoyan Huang, Haijie Yuan, Saiprasad Ravishankar

发表机构 * Department of Computational Mathematics, Science, & Engineering, Michigan State University(密歇根州立大学计算数学、科学与工程系) Department of Electrical Engineering and Computer Science, University of Michigan(密歇根大学电气工程与计算机科学系) Department of Biomedical Engineering, Michigan State University(密歇根州立大学生物医学工程系)

AI总结 提出TRACE框架,通过相邻状态耦合约束重建轨迹,稳定扩散和迭代方法在成像逆问题中的重建过程,并提升重建质量。

Comments 20 pages, 10 figures

详情
AI中文摘要

基于扩散和迭代的方法已成为解决成像逆问题的有效工具。它们的重建过程自然形成一条由中间估计组成的轨迹。尽管这些中间估计定义了重建轨迹,但大多数方法并未显式正则化连续状态之间的转换。为了解决这一局限,我们引入了TRACE,一种无需训练的轨迹约束重建框架,通过沿轨迹耦合相邻状态来稳定重建路径。这产生了一个轨迹级模型,可解释为一系列近端更新。由于精确的近端更新通常是难解的,我们用一个神经映射来近似它。这产生了一个具有相邻状态间显式耦合的类扩散重建过程。我们提供了稳定性分析,表明时间耦合限制了轨迹变化,并且这种控制在未训练的网络更新下得以保持。在线性和非线性图像重建任务上的实验表明,TRACE提高了重建质量。轨迹级分析和消融实验证实,时间耦合直接影响重建路径上的状态转换。

英文摘要

Diffusion-based and iterative methods have become effective tools for solving imaging inverse problems. Their reconstruction process naturally forms a trajectory of intermediate estimates. Although these intermediate estimates define a reconstruction trajectory, most methods do not explicitly regularize the transitions between consecutive states. To address this limitation, we introduce TRACE, a training-free TRAjectory-Constrained rEconstruction framework that stabilizes the reconstruction path by coupling adjacent states along the trajectory. This gives a trajectory-level model that can be interpreted as a sequence of proximal updates. Since the exact proximal update is generally intractable, we approximate it with a neural mapping. This yields a diffusion-like reconstruction process with an explicit coupling between neighboring states. We provide a stability analysis showing that temporal coupling bounds trajectory variation and that this control is preserved under untrained network updates. Experiments on linear and nonlinear image reconstruction tasks show that TRACE improves reconstruction quality. Trajectory-level analyses and ablations confirm that temporal coupling directly affects state transitions along the reconstruction path.

2605.29004 2026-05-29 cs.CV cs.GR 版本更新

Auditing Training-Free 3D Shape Retrieval with Diffused Geodesic Moments

审计基于扩散测地矩的无训练三维形状检索

Zhicheng Du, Changyue Liu, Wenji Xi, Zhaotian Xie, Zhuo Deng, Ziheng Zhang, Yang Liu, Lan Ma

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Guangzhou International Economics College(广州国际经济学院) School of Electrical and Electronic Engineering, The University of Sheffield(谢菲尔德大学电子与电气工程学院)

AI总结 本文提出扩散测地矩(DGM)作为无训练形状描述符,通过协议审计方法隔离评估局部信号设计、归一化、聚合、码本拟合和度量选择等组件的影响,并在FAUST-Reg和TOSCA数据集上验证了协议主导性。

详情
AI中文摘要

无训练形状描述符的报告检索分数混淆了局部信号设计、归一化、聚合、码本拟合和度量选择,使得孤立组件评估困难。本文将描述符评估重新定义为协议审计。我们引入扩散测地矩(DGM),一种种子条件描述符,计算稀疏隐式热响应,将其转换为距离类场,并通过跨种子和尺度的低阶矩汇总每个顶点。DGM既作为实用的非谱基线,也作为隔离协议效应的工具。在注册的FAUST基准分割(FAUST-Reg)和TOSCA形状集合上,聚合匹配实验表明,基于热核签名特征构建的独立几何矩形状描述符基线(GMSD-HKS)在此实现中获得最高分数(平均精度(mAP)/top-1分别为0.621/0.820和0.865/0.963),波核签名(WKS)仍然是强经典信号,而DGM主要在稀疏求解、非谱部署或对称信息种子帧优先时有用。更广泛的发现是方法论的:输入场和聚合协议可以主导矩公式。本文贡献了可复现的协议级联分析、用于功能映射兼容性的跨形状对齐诊断,以及设计和报告无训练形状描述符的具体建议。

英文摘要

Reported retrieval scores for training-free shape descriptors conflate local signal design, normalization, aggregation, codebook fitting, and metric choices, making isolated component evaluation difficult. This paper reframes descriptor evaluation as a {\em protocol audit}. We introduce Diffused Geodesic Moments (DGM), a seed-conditioned descriptor that computes sparse implicit heat responses, converts them to distance-like fields, and summarizes each vertex by low-order moments across seeds and scales. DGM is used both as a practical non-spectral baseline and as an instrument for isolating protocol effects. On the registered FAUST benchmark split (FAUST-Reg) and the TOSCA shape collection, aggregation-matched experiments show that an independent Geometric Moment Shape Descriptor baseline built on Heat Kernel Signature features (GMSD-HKS) obtains the highest scores in this implementation ($0.621/0.820$ and $0.865/0.963$ mean average precision (mAP)/top-1), Wave Kernel Signature (WKS) remains a strong classical signal, and DGM is useful mainly when sparse solves, non-spectral deployment, or symmetry-informative seed frames are priorities. The broader finding is methodological: the input field and aggregation protocol can dominate the moment formula. The paper contributes a reproducible protocol-cascade analysis, a cross-shape alignment diagnostic for functional-map compatibility, and concrete recommendations for designing and reporting training-free shape descriptors.

2605.28962 2026-05-29 cs.CV 版本更新

Resolving Endpoint Underfitting in Diffusion Bridges via Noise Alignment

通过噪声对齐解决扩散桥中的端点欠拟合

Yurong Gao, Zicheng Zhang, Congying Han, Tiande Guo, Xinmin Qiu

发表机构 * University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对扩散桥模型在目标端点附近出现的欠拟合问题,提出噪声对齐扩散桥(NADB),通过均值网络和噪声对齐映射解决噪声不匹配,在图像恢复和翻译任务中验证有效性。

Comments Accepted by CVPR2026

详情
AI中文摘要

扩散桥模型为连接两个数据分布(如图像恢复和翻译)提供了强大框架。许多现有方法通过模仿标准扩散模型的分数匹配公式来学习这种桥接。在这项工作中,我们发现这种方式会导致在接近目标端点($t \to 0$)时出现异常的欠拟合现象。这种欠拟合以预测方差和方向的显著漂移为特征,是由网络输入与其回归目标之间的噪声水平差异过大引起的。为了解决这个问题,我们提出了噪声对齐扩散桥(NADB)。我们的方法通过首先使用均值网络提供更清晰的条件目标,然后引入一种新颖的噪声对齐映射关系来重新表述扩散桥。这种新表述解决了噪声不匹配问题,并纠正了目标端点附近的欠拟合。在多个图像恢复和图像翻译任务上的实验验证了我们的方法的有效性。代码可在 https://github.com/gyr02/NADB 获取。

英文摘要

Diffusion bridge models offer a powerful framework for connecting two data distributions, such as in image restoration and translation. Many existing methods learn this bridge by mimicking the score-matching formulation of standard diffusion models. In this work, we find that this way leads to an anomalous underfitting phenomenon near the target endpoint, as the process approaches the target distribution ($t \to 0$). This underfitting, characterized by significant drift in the predicted variance and direction, results from an excessively large discrepancy in noise levels between the network's input and its regression target.To resolve this issue, we propose the Noise-Aligned Diffusion Bridge (NADB).Our approach reformulates the diffusion bridge by first employing a mean network to provide a cleaner conditional target, and then introducing a novel, noise-aligned mapping relationship. This new formulation resolves the noise mismatch and corrects the underfitting near the target endpoint. Experimental validation across multiple image restoration and image translation tasks demonstrates the effectiveness of our approach. Code is available at https://github.com/gyr02/NADB.

2605.28551 2026-05-29 cs.CV cs.GR cs.LG 版本更新

Resolution-free neural surrogates for geometric parameterization and mapping with spatially varying fields

无分辨率依赖的几何参数化与映射神经替代模型:面向空间变化场

Yanwen Huang, Lok Ming Lui, Gary P. T. Choi

发表机构 * Department of Mathematics, The Chinese University of Hong Kong(香港中文大学数学系)

AI总结 提出一种无分辨率依赖的神经替代模型,通过多分辨率几何编码和几何感知约束(变分能量、扩散密度均衡、拟共形理论)无监督学习,直接从空间变化参数场预测映射位置,适用于任意结构化或非结构化点集。

详情
AI中文摘要

许多成像问题需要计算由空间变化的强度、特征或密度场引起的空间变换。典型例子包括畸变校正、可变形图像配准、基于图谱的分割以及变形驱动的图像分析。这些任务可以表述为几何映射问题,其中变换被约束以保持局部结构、控制边界行为或调节角度畸变。此类公式通常导致变分模型、扩散过程或椭圆偏微分方程。然而,当底层参数场在不同实例间变化时,重复求解高分辨率系统在计算上变得昂贵。在这项工作中,我们提出了一种无分辨率依赖的神经替代模型,用于几何参数化和映射问题。给定一个空间变化的参数场 $p:\Omega\to\mathbb{R}^m$ 和查询位置 $\{x_i\}_{i=1}^N\subset\Omega$,该模型预测任意结构化或非结构化点集上的映射位置 $\{u(x_i)\}_{i=1}^N$。为了避免对固定网格的依赖,我们采用了一种多分辨率几何编码策略,该策略将网络条件建立在参数场的坐标增强样本上。该模型通过强制执行源自变分能量、基于扩散的密度均衡和拟共形理论的几何感知约束进行训练,无需标记解数据。在拟共形映射和密度均衡映射问题上的实验结果展示了我们提出方法的有效性。

英文摘要

Many imaging problems require computing spatial transformations induced by spatially varying intensity, feature, or density fields. Canonical examples include distortion correction, deformable image registration, atlas-based segmentation, and deformation-driven image analysis. These tasks can be formulated as geometric mapping problems in which the transformation is constrained to preserve local structure, control boundary behavior, or regulate angular distortion. Such formulations typically lead to variational models, diffusion processes, or elliptic partial differential equations. However, repeatedly solving high-resolution systems becomes computationally expensive when the underlying parameter fields vary across instances. In this work, we propose a resolution-free neural surrogate for geometric parameterization and mapping problems. Given a spatially varying parameter field $p:Ω\to\mathbb{R}^m$ and query locations $\{x_i\}_{i=1}^N\subsetΩ$, the model predicts mapped locations $\{u(x_i)\}_{i=1}^N$ on arbitrary structured or unstructured point sets. To avoid dependence on a fixed grid, we use a multi-resolution geometric encoding strategy that conditions the network on coordinate-augmented samples of the parameter field. The model is trained without labeled solution data by enforcing geometry-aware constraints derived from variational energies, diffusion-based density equalization, and quasi-conformal theory. Experimental results on quasi-conformal mapping and density-equalizing mapping problems are presented to demonstrate the effectiveness of our proposed method.

2605.27959 2026-05-29 cs.CV cs.AI 版本更新

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

ROVER: 面向对象中心视觉证据的路由用于基于多图像推理

Guannan Lv, Ren Nie, Hongjian Dou, Tingting Gao

发表机构 * Kuaishou Technology(快手科技)

AI总结 提出ROVER,一种轻量级可学习插件,通过对象中心差分注意力聚合上下文、蒸馏图像内线索并路由历史感知证据,实现高效全局视觉证据路由,在多图像推理中提升答案和定位精度。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地定位和交错视觉证据以进行审慎推理。基于定位的方法通常通过将裁剪的图像块或感兴趣区域(RoI)特定特征注入推理上下文来关注RoI。然而,这种设计可能削弱整体场景理解和对象间关系,同时导致解码成本随RoI数量和大小增加而增加。或者,自适应视觉特征选择通常需要细粒度监督或复杂启发式方法。为解决这些限制,我们提出ROVER(面向对象中心视觉证据的路由用于基于多图像推理),一种轻量级、可学习的插件,用于高效的全局视觉证据路由。在每次对象定位预测时,ROVER注入一个步骤特定的令牌三元组,以协同地:(i) 聚合正在进行的推理上下文,(ii) 通过对象中心差分注意力将图像内线索蒸馏到视觉工作空间中,以及(iii) 在该空间内跨对象和图像路由并整合历史感知证据以供后续推理。我们将ROVER集成到Qwen2.5-VL-7B中,并开发了一个交错的SFT到GRPO训练流程。严格遵循原始数据集和评估协议,我们的方法在MM-GCoT(+4.8%答案准确率,+14.6%定位准确率)和VideoEspresso(+8.6%答案准确率)上取得了最佳性能。在VideoEspresso上训练的模型表现出强大的迁移能力,在多个基准测试上平均比基础模型高出+4.7%。

英文摘要

Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.

2605.26994 2026-05-29 cs.CV 版本更新

ChartAct: A Benchmark for Dynamic Chart Understanding

ChartAct: 动态图表理解基准

Muye Huang, Lin Wu, Lingling Zhang, Hang Yan, Zhiyuan Wang, Yumeng Fu, Zesheng Yang, Jun Liu

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) MOE KLNN Lab, Xi’an Jiaotong University(西安交通大学MOE KLNN实验室)

AI总结 提出ChartAct基准,通过收集673个动态图表和1440个问答样本,评估多模态模型在交互式图表理解中的能力,发现现有模型表现有限。

详情
AI中文摘要

图表广泛用于呈现复杂数据以支持分析和决策。现有的图表理解基准主要关注静态图表,但现实中的图表通常是动态且可交互的。关键信息可能仅在悬停、点击、缩放或拖拽等操作后出现。因此,动态图表理解要求模型识别可见内容、选择合适的交互方式,并在变化的图表状态中进行推理。为了评估这一能力,我们提出了ChartAct,一个用于动态图表理解的交互式基准。ChartAct从8个真实图表网站收集并筛选了673个动态图表,涵盖7种常见图表类型,并构建了1440个高质量问答样本。每个样本在两个环境(动态图表和仪表板图表)中实例化,以评估不同上下文下的动态图表理解能力。基于ChartAct,我们系统评估了11个先进的多模态模型和GUI智能体。实验结果表明,现有模型在动态图表理解方面仍存在明显局限。最强的模型Claude-Opus-4.7达到了84.5%的平均成功率,而大多数模型仍低于60%。我们还进行了详细的失败归因和案例分析。ChartAct为研究真实交互环境中的图表理解提供了新的基准。代码见https://github.com/wulin-wulin/OSWorld_Chart。

英文摘要

Charts are widely used to present complex data for analysis and decision making. Existing chart understanding benchmarks mainly focus on static charts, but real-world charts are often dynamic and interactive. Key information may only appear after actions such as hovering, clicking, zooming, or dragging. Dynamic chart understanding therefore requires models to identify visible content, choose proper interactions, and reason over changing chart states. To evaluate this ability, we propose ChartAct, an interactive benchmark for dynamic chart understanding. ChartAct collects and filters 673 dynamic charts from 8 real chart websites, covers 7 common chart types, and constructs 1,440 high-quality question-answer samples. Each sample is instantiated in two environments, Dynamic Chart and Dashboard Chart, to evaluate dynamic chart understanding under different contexts. Based on ChartAct, we systematically evaluate 11 advanced multimodal models and GUI agents. Experimental results show that existing models still have clear limitations in dynamic chart understanding. The strongest model, Claude-Opus-4.7, achieves an average success rate of 84.5\%, while most models remain below 60\%. We also conduct detailed failure attribution and case analysis. ChartAct provides a new benchmark for studying chart understanding in real interactive environments. Codes at https://github.com/wulin-wulin/OSWorld_Chart

2605.25299 2026-05-29 cs.CV cs.LG 版本更新

A Principled Self-Referenced Early Stopping Approach for Deep Image Prior

一种基于自引用的原则性早期停止方法用于深度图像先验

Chaoyan Huang, Cheng-Han Huang, Ismail R. Alkhouri, Rongrong Wang

发表机构 * Department of Computational Mathematics, Science, & Engineering, Michigan State University(密歇根州立大学计算数学、科学与工程系) Department of Electrical Engineering and Computer Science, University of Michigan(密歇根大学电气工程与计算机科学系) X Computational Physics Division, Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室计算物理部) Michigan Institute for Computational Discovery & Engineering, University of Michigan(密歇根大学计算发现与工程研究所) Mathematical Sciences, Michigan State University(密歇根州立大学数学科学系)

AI总结 针对深度图像先验(DIP)过拟合问题,提出一种基于构造伪自引用图像的过拟合检测框架,实现无需噪声水平估计的早期停止方法。

Comments 35 pages, 10 figures, 14 tables

详情
AI中文摘要

最近,深度图像先验(DIP)通过在无训练数据的情况下优化随机初始化的卷积神经网络,展示了解决逆成像问题(IIPs)的强大能力。然而,由于网络过参数化,DIP会过拟合噪声测量,使得早期停止(ES)至关重要。最成功的ES方法通过跟踪网络输出运行方差的波动来检测过拟合。然而,在许多应用中,这些波动可能过早出现,导致重建不稳定。本文首先证明,当退化图像的两个独立噪声副本可用时,可以实现近乎最优的DIP早期停止。受此观察启发,且由于获取两个完全独立的副本不可行,我们提出了一种基于构造伪自引用图像的过拟合检测框架,从而得到三种IIP特定算法。我们的方法还得到了关于单引用验证、伪验证估计以及共享噪声影响的理论结果的支持。在不同的IIP中,从自然图像恢复到医学图像重建,以及在不同噪声水平和噪声类型下,我们的方法始终优于现有的DIP早期停止方法,且无需准确估计噪声水平。

英文摘要

Recently, Deep Image Prior (DIP) has demonstrated strong capabilities for solving inverse imaging problems (IIPs) by optimizing a randomly initialized convolutional neural network in a training-data-free regime. However, DIP suffers from overfitting to noisy measurements due to network over-parameterization, making early stopping (ES) essential. The most successful ES method tracks fluctuations in the running variance of the network output to detect overfitting. However, in many applications, these fluctuations may appear prematurely, leading to unstable reconstructions. In this paper, we first show that nearly optimal DIP early stopping can be achieved when two independent noisy copies of the degraded image are available. Motivated by this observation, and since obtaining two fully independent copies is infeasible, we propose an overfitting detection framework based on constructing pseudo self-referenced images, resulting in three IIP-specific algorithms. Our approach is further supported by theoretical results on single-reference validation, pseudo-validation estimation, and the impact of shared noise. Across different IIPs, ranging from natural image restoration to medical image reconstruction, and under varying noise levels and noise types, our methods consistently outperform existing DIP early stopping approaches, all without requiring an accurate estimate of the noise level.

2605.25059 2026-05-29 cs.CV 版本更新

VEOcc: Voxel-Centric Online Semantic Occupancy Prediction For Embodied Scene Understanding

VEOcc:面向具身场景理解的体素中心在线语义占用预测

Ruoyu Wang, Yong Liu, Sheng Tao, Yuhang Lin, Yukai Ma

发表机构 * Institute of Cyber-Systems and Control(控制系统研究院)

AI总结 提出一种基于体素的递归感知-同化框架VEOcc,通过时空感知在线更新策略实现无需初始尺度估计的高效、鲁棒语义占用预测,在局部和具身场景中达到最先进性能。

详情
AI中文摘要

对于自主探索至关重要,在线3D占用预测和映射逐步构建密集的空间表示。然而,近期以高斯为中心的方法在结构边界保真度上存在困难,且严重依赖预定义的场景大小先验,从根本上限制了其操作效率。在这项工作中,我们提出了VEOcc,一个以体素为中心的框架,表述为递归感知-同化范式。通过消除初始尺度估计的需要,VEOcc实现了高度精简、开放的地图扩展。此外,为了在离散体素空间内鲁棒地聚合带噪声的时间观测,我们提出了一种时空感知在线更新策略。它集成了跨时间对数聚合(TLA)以保持时间一致性、可靠性感知置信度调制(RCM)以进行空间不确定性校准,以及置信度驱动的增量状态更新(CSU)以实现鲁棒的全局状态同化。在Occ-ScanNet和EmbodiedOcc-ScanNet上的大量实验表明,VEOcc在局部和具身设置中均建立了新的最先进性能,为真实世界探索提供了准确且高效的解决方案。值得注意的是,在自收集视频序列上的零样本评估进一步证实了其在完全未见过的真实世界环境中的鲁棒分布外泛化能力。最终,我们的框架为自主探索提供了准确且高效的解决方案。代码和补充可视化可在我们的项目页面获取:https://wryzju.github.io/VEOcc/。

英文摘要

Crucial for autonomous exploration, online 3D occupancy prediction and mapping incrementally constructs dense spatial representations on the fly. However, recent Gaussian-centric methods struggle with structural boundary fidelity and rely heavily on predefined scene-size priors, fundamentally limiting their operational efficiency. In this work, we present VEOcc, a voxel-centric framework formulated as a recursive perception-and-assimilation paradigm. By eliminating the need for initial scale estimation, VEOcc enables highly streamlined, open-ended map expansion. Furthermore, to robustly aggregate noisy temporal observations within the discrete voxel space, we propose a Spatio-Temporal-Aware Online Update Strategy. It integrates Cross-Temporal Logit Aggregation (TLA) for temporal consistency, Reliability-Aware Confidence Modulation (RCM) for spatial uncertainty calibration, and Confidence-Driven Incremental State Update (CSU) for robust global state assimilation. % Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings, providing an accurate and efficient solution for real-world exploration. Extensive experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate that VEOcc establishes new state-of-the-art performance in both local and embodied settings. Notably, zero-shot evaluations on self-collected video sequences further confirm its robust out-of-distribution generalization capability in completely unseen real-world environments. Ultimately, our framework provides an accurate and highly efficient solution for autonomous exploration. Code and supplementary visualizations are available on our project page: https://wryzju.github.io/VEOcc/.

2605.23993 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Nano World Models: A Minimalist Implementation of Future Video Prediction

纳米世界模型:未来视频预测的极简实现

Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Kaiwen Geng, Omar Chehab, Fernando Moreno-Pino, Max Simchowitz

发表机构 * DeepMind

AI总结 提出Nano World Models,一个基于扩散强迫的极简代码库,用于未来视频预测,支持可控研究世界模型的设计选择,并通过实验分析预测参数化、架构规模等因素对视频预测质量的影响。

Comments Project page: https://simchowitzlabpublic.github.io/nano-world-model/

详情
AI中文摘要

世界模型已成为学习预测模拟器的核心范式,支持生成、规划和决策。然而,尽管工业级交互式视频生成取得了快速进展,更广泛的研究社区仍然缺乏紧凑、可重复且易于扩展的实现来研究现代世界模型的设计选择。我们介绍了Nano World Models,一个围绕扩散强迫的极简代码库,用于未来视频预测。Nano World Models为生成目标、模型规模、动作条件机制、潜在观测空间、数据集、评估协议和长程展开程序提供了统一接口。这种设计使得通常在不同实现中纠缠的世界模型组件可以进行受控研究。通过在简单控制环境、游戏模拟和真实机器人数据上的实验,我们考察了预测参数化、架构规模、动作注入、采样预算和领域复杂性如何影响视频预测质量和自回归展开行为。通过发布代码、配置、评估脚本和预训练检查点,Nano World Models旨在为开放、可重复和科学的世界模型研究提供一个紧凑但可扩展的实验基础。

英文摘要

World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

2605.23531 2026-05-29 cs.CV 版本更新

PixIE: Prompted Pixel-Space Low-Light Image Enhancement

PixIE: 提示驱动的像素空间低光照图像增强

Ruirui Lin, Guoxi Huang, David Bull, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, University of Bristol, United Kingdom(布里斯托大学视觉信息实验室,英国)

AI总结 提出PixIE框架,利用视觉基础模型的语义提示,通过跨尺度去噪和DINO提示像素块进行像素空间低光照图像增强,在多个基准上提升PSNR和LPIPS。

详情
AI中文摘要

低光照图像遭受严重的噪声、对比度损失和语义模糊,使得增强成为去噪和细节恢复的联合问题。我们提出PixIE,一种由视觉基础模型语义提示的前馈像素空间LLIE框架。PixIE首先执行跨尺度去噪以抑制噪声并保持结构,然后使用DINO提示像素块(DPPBs)细化细节,通过补丁条件、空间连续的逐像素调制注入中间DINOv3特征。为了使像素空间注意力在跨尺度上高效,我们引入了空间通道压缩(SCC),它联合减少空间令牌网格和通道维度。我们进一步提出多感受野像素嵌入(MRPE),在语义提示之前提供邻域感知的像素表示,提高对信号依赖噪声的鲁棒性,超越逐点嵌入。在LLIE基准上的实验表明,与最近的最先进方法相比,PixIE将平均PSNR提高了1.9-15.0%,并将LPIPS降低了8.5-44.4%。定性比较进一步显示更清晰的细节和更稳定的纹理,提高了重建保真度和感知质量。

英文摘要

Low-light images suffer from severe noise, contrast loss, and semantic ambiguity, making enhancement a joint problem of denoising and detail recovery. We propose PixIE, a feed-forward pixel-space LLIE framework semantically prompted by a vision foundation model. PixIE first performs cross-scale denoising to suppress noise and preserve structure, then refines details using DINO-Prompted Pixel Blocks (DPPBs), which inject intermediate DINOv3 features through patch-conditioned, spatially continuous per-pixel modulation. To make pixel-space attention efficient across scales, we introduce Spatial-Channel Compaction (SCC), which jointly reduces the spatial token grid and channel dimension. We further propose Multi-Receptive-Field Pixel Embedding (MRPE) to provide neighborhood-aware pixel representations before semantic prompting, improving robustness to signal-dependent noise beyond point-wise embeddings. Experiments on LLIE benchmarks show that PixIE improves average PSNR by 1.9-15.0% over recent state-of-the-art methods and reduces LPIPS by 8.5-44.4%. Qualitative comparisons further show sharper details and more stable textures, improving both reconstruction fidelity and perceptual quality.

2605.23345 2026-05-29 cs.CV 版本更新

SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

SCOPE: 在可玩环境中模拟跨游戏操作以构建FPS世界模型

Zizhao Tong, Yeying Jin, Hongfeng Lai, Zeqing Wang, Zhaohu Xing, Kexu Cheng, Haoran Xu, Zhao Pu, Shangwen Zhu, Ruili Feng, Jian Zhao, Yan Zhang, Hao Tang, Ling Shao

发表机构 * UCAS-Terminus AI Lab, University of Chinese Academy of Sciences(中国科学院大学Terminus AI实验室) Tencent(腾讯) National University of Singapore(新加坡国立大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University of Waterloo(多伦多大学) Shanghai Jiaotong University(上海交通大学) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机科学学院多媒体信息处理国家重点实验室)

AI总结 提出SCOPE方法,通过在每个Transformer块中插入条件模块,将特征重塑为逐像素时间序列,以分离FPS游戏中局部作用域(scope)内的操作效果与全局生成,并引入跨游戏数据集CrossFPS,实现零样本迁移。

Comments Project page: https://z2tong.github.io/SCOPE/. Code is available at https://github.com/z2tong/SCOPE

详情
AI中文摘要

第一人称射击(FPS)游戏的交互式世界模型必须在每一帧解析高频重叠控制信号,同时不干扰未受影响的区域。现有方法全局注入动作并在单一游戏上训练,在密集FPS输入下失败。我们观察到FPS动作具有空间选择性:离散事件(如射击或换弹)仅影响武器周围的局部区域(scope),而连续的相机和移动信号控制稳定的环境。我们提出SCOPE,它在预训练视频扩散模型的每个Transformer块中插入一个条件模块。它将特征重塑为逐像素时间序列,使得每个位置根据局部视觉内容计算其动作响应。这无需分割标签即可将作用域内效果与作用域外生成分离。我们还引入了CrossFPS,这是第一个具有帧对齐动作遥测的多游戏FPS数据集。它包含来自7个游戏的69K个片段,具有10自由度控制器信号,并经过策划以消除游戏玩法偏差。该模型学习通用的视觉到动作映射,而非特定游戏模式,从而实现对未见场景的零样本迁移。实验证实了强动作响应性、精确的作用域分离以及有效的跨游戏泛化。

英文摘要

Interactive world models for first-person shooter (FPS) games must resolve high-frequency overlapping control signals at every frame without disrupting unaffected regions. Existing methods inject actions globally and train on single titles, failing under dense FPS inputs. We observe that FPS actions are spatially selective: discrete events such as firing or reloading affect only a localized region around the weapon (the scope), while continuous camera and movement signals govern stable surroundings. We propose SCOPE, which inserts a conditioning module into each transformer block of a pretrained video diffusion model. It reshapes features into per-pixel temporal sequences so that each position computes its action response from local visual content. This separates in-scope effects from out-of-scope generation without segmentation labels. We also introduce CrossFPS, the first multi-game FPS dataset with frame-aligned action telemetry. It comprises 69K clips from 7 titles with 10-DoF controller signals, curated to remove gameplay bias. The model learns general visual-to-action mappings rather than game-specific patterns, enabling zero-shot transfer to unseen scenes. Experiments confirm strong action responsiveness, precise scope separation, and effective cross-game generalization.

2605.22080 2026-05-29 cs.CV cs.AI 版本更新

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

JMed48k:用于视觉语言模型评估的多专业日本医疗执照基准

Yue Xun, Junyu Liu, Qian Niu, Xinyi Wang, Zheng Yuan, Zirui Li, Zequn Zhang, Bowen Zhao, Shujun Wang, Irene Li, Kan Hatakeyama-Sato, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Kyoto University(京都大学) The University of Tokyo(东京大学) Hohai University(淮海大学) University of Science and Technology of China(中国科学技术大学) University of Toronto(多伦多大学)

AI总结 本文提出JMed48k,一个包含48,862道试题和20,142张图像的多专业日本医疗执照基准,通过评估21个模型并引入配对图像移除审计,发现专有和开源模型显著受益于图像,而医学专用模型对视觉证据利用有限。

详情
AI中文摘要

我们引入了JMed48k,一个用于评估视觉语言模型的多专业日本医疗执照基准。该基准基于日本厚生劳动省发布的官方PDF材料构建,包含2005年至2025年间11个国家执照考试的48,862道试题和20,142张图像,视觉内容按8类分类法进行标注。从该语料库中,我们提取了JMed48k-Eval,一个近五年的评估子集,包含12,484道评分题,其中9,905道纯文本题和2,579道带图像题。我们评估了21个专有、开源和医学专用模型,分别报告纯文本和带图像的性能。由于这些子集包含不同的问题,我们进一步引入了一种配对图像移除审计,评估带图像的问题在移除视觉内容前后的表现,以探索四种答案转换状态。审计显示,专有和开源模型从图像中获益显著,而医学专用系统对视觉证据的利用有限,许多正确答案在图像移除后仍然存在。即使在专有模型中,净图像移除效应在不同专业间变化七倍,从医师问题的+5.7分到公共卫生护士问题的+39.8分。我们发布JMed48k以支持在医疗执照场景中对视觉语言模型进行可重复的、按专业分层的评估。

英文摘要

We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four answer-transition states. The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions. We release JMed48k to support reproducible, profession-stratified evaluation of vision-language models in medical licensing settings.

2605.22069 2026-05-29 cs.CV cs.LG 版本更新

TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting

TWINGS: 基于薄板样条翘曲对齐的稀疏视图高斯泼溅初始化

Hyeseong Kim, Geonhui Son, Deukhee Lee, Dosik Hwang

发表机构 * Yonsei University(延世大学) Korea Institute of Science and Technology(韩国科学技术院)

AI总结 提出TWINGS框架,利用薄板样条(TPS)对齐反投影点与三角化控制点,为3D高斯泼溅提供几何精确的初始化,从而在稀疏视图下提升场景重建的细节保留和颜色保真度。

Comments Accepted at CVPR 2026, Project page: https://sandokim.github.io/twings/

详情
AI中文摘要

从稀疏视图输入进行新视角合成是3D计算机视觉中的一个重大挑战,特别是在有限视角下实现高质量场景重建。我们引入了TWINGS,这是一个通过直接解决点稀疏性来增强3D高斯泼溅(3DGS)的框架。我们采用薄板样条(TPS),一种平滑的非刚性变形模型,通过最小化弯曲能量从控制点对应关系估计全局一致的翘曲,将估计深度反投影的点与三角化的3D控制点对齐,从而生成校准的反投影点。通过在这些控制点附近采样校准点,TWINGS为3DGS提供了快速且几何精确的初始化,最终改善了重建场景中结构细节的保留和颜色保真度。在DTU、LLFF和Mip-NeRF360上的大量实验表明,TWINGS在稀疏视图场景下始终优于现有方法,提供详细且准确的重建。

英文摘要

Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse-view scenarios.

2605.17286 2026-05-29 cs.CV 版本更新

HyperVision: A Channel-Adaptive Ground-Based Hyperspectral Vision Pre-trained Backbone

HyperVision: 一种通道自适应的地基高光谱视觉预训练骨干网络

Guanyiman Fu, Jingtao Li, Zihang Cheng, Zhuanfeng Li, Diqi Chen, Yan Xu, Xiangyu Liu, Fengchao Xiong, Jianfeng Lu, Chengrong Chen, Jun Zhou

发表机构 * Griffith University, Australia(格里菲斯大学,澳大利亚) Wuhan University, China(武汉大学,中国) Nanjing University of Science and Technology, China(南京理工大学,中国) Huaiyin Normal University, China(淮阴师范学院,中国) Massey University, New Zealand(马斯sey大学,新西兰)

AI总结 针对地基高光谱传感器配置差异、标签稀缺与不一致、数据集规模有限等问题,提出首个地基高光谱预训练骨干HyperVision,采用通道自适应动态嵌入、多源伪标签和跨模态知识蒸馏,在三个下游任务上取得最优性能。

详情
AI中文摘要

虽然高光谱成像通过数百个窄波长波段提供丰富的空间-光谱信息,用于精确的材料识别,但地基高光谱预训练骨干网络仍然缺失,受限于传感器间的光谱配置差异、标签的稀缺性和不一致性,以及现有数据集的规模有限和场景多样性不足。为了解决这些挑战并实现通用感知,我们提出了HyperVision,这是首个地基高光谱预训练骨干网络。首先,为了处理不同的光谱配置,HyperVision采用通道自适应动态嵌入机制,将异构输入映射到统一的标记空间。其次,我们开发了一个无监督表示学习框架。具体来说,为了解决标签稀缺和不一致问题,引入了一种多源伪标签方法,融合来自SAM2的空间结构和来自HyperFree的细粒度光谱材料信息。此外,为了丰富场景多样性并补偿有限的数据集规模,利用跨模态知识蒸馏机制,将预训练RGB视觉模型的丰富语义表示迁移到我们的骨干网络。HyperVision在来自26个不同地基数据集的15000张图像集合上进行预训练,展现出卓越的泛化能力。仅需高效的头适配而无需调整骨干参数,它在不同传感器配置下的三个下游任务中取得了比任务特定方法更优的性能,在高光谱语义分割中$\mathrm{Acc}_{\mathrm{M}}$相对提升高达16.3%,目标跟踪AUC相对提升2.1%,显著目标检测MAE降低35.5%。源代码和预训练模型将在https://github.com/lronkitty/HyperVision 公开。

英文摘要

While hyperspectral imaging provides rich spatial-spectral information across hundreds of narrow wavelength bands for precise material identification, ground-based hyperspectral pre-trained backbones remain absent, constrained by varying spectral configurations across sensors, the scarcity and inconsistency of labels, and the limited scale and scene diversity of existing datasets. To address these challenges and enable universal perception, we propose HyperVision, the first ground-based hyperspectral pre-trained backbone. First, to handle varying spectral configurations, HyperVision adopts a channel-adaptive dynamic embedding mechanism to map heterogeneous inputs into a unified token space. Second, we develop an unsupervised representation learning framework. Specifically, to address label scarcity and inconsistency, a multi-source pseudo-labeling method is introduced to fuse spatial structures from SAM2 and fine-grained spectral material information from HyperFree. Furthermore, to enrich scene diversity and compensate for limited dataset scale, a cross-modal knowledge distillation mechanism is utilized to transfer rich semantic representations from a pre-trained RGB vision model to our backbone. Pre-trained on a collection of 15k images from 26 diverse ground-based datasets, HyperVision demonstrates exceptional generalization. Requiring only efficient head-only adaptation without adjusting backbone parameters, it achieves state-of-the-art performance compared to task-specific methods across three downstream tasks under varying sensor configurations, yielding up to a 16.3% relative improvement in hyperspectral semantic segmentation $\mathrm{Acc}_{\mathrm{M}}$, a 2.1% relative gain in object tracking AUC, and a 35.5% reduction in salient object detection MAE. The source code and pre-trained model will be publicly available on https://github.com/lronkitty/HyperVision .

2605.15852 2026-05-29 cs.CV 版本更新

GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction

GHOST: 用于高效3D重建的几何层次化在线流式令牌驱逐

Leyang Chen, Junyi Wu, Zhiteng Li, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出GHOST框架,利用模型自身的3D几何输出在线驱逐冗余令牌,在保持重建质量的同时将KV缓存减半并实现1.75倍加速。

详情
AI中文摘要

从长单目视频序列进行流式3D重建需要维护一个随序列长度线性增长的键值(KV)缓存,造成严重的内存瓶颈。现有方法要么将缓存截断为固定的一组锚帧,导致重建质量下降,要么依赖于对3D场景结构无关的注意力分数启发式方法,未能保留几何上有价值的令牌。为解决这些问题,我们提出GHOST(几何层次化在线流式令牌驱逐),一种无需训练的KV缓存管理框架,利用模型自身的3D几何输出在线驱逐冗余令牌。GHOST引入了三项相互增强的创新:层次化双层重要性评分方案、保护特殊令牌不被驱逐的特权机制,以及余弦相似度引导的逐层预算分配。在各种基准上的实验表明,GHOST在保持出色重建质量的同时,将KV缓存削减近一半,并且与最先进方法相比实现了1.75倍的推理加速。我们的代码可在 https://github.com/lokiniuniu/GHOST 获取。

英文摘要

Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model's own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at https://github.com/lokiniuniu/GHOST.

2605.14270 2026-05-29 cs.CV 版本更新

Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

诊断和纠正多模态扩散Transformer中的概念遗漏

Kanghyun Baek, Jaihyun Lew, Chaehun Shin, Jungbeom Lee, Sungroh Yoon

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University, Seoul, South Korea Department of Electrical Computer Engineering, Seoul National University, Seoul, South Korea Department of Computer Science \& Engineering, Korea University, Seoul, South Korea ISRC, Seoul National University, Seoul, South Korea

AI总结 本文通过线性探测发现文本嵌入中存在表征目标概念缺失的“遗漏信号”,并提出遗漏信号干预(OSI)方法放大该信号以主动催化缺失概念的生成,在FLUX.1-Dev和SD3.5-Medium上显著缓解了概念遗漏问题。

Comments Accepted to ICML 2026

详情
AI中文摘要

多模态扩散Transformer(MM-DiTs)在文本到图像生成方面取得了显著进展,但它们经常遭受概念遗漏,即指定的对象或属性未能出现在生成的图像中。通过对文本标记进行线性探测,我们证明文本嵌入可以区分代表目标概念缺失的特征性“遗漏信号”。利用这一见解,我们提出了遗漏信号干预(OSI),该方法放大遗漏信号以主动催化缺失概念的生成。在FLUX.1-Dev和SD3.5-Medium上的全面实验表明,即使在极端场景下,OSI也能显著缓解概念遗漏。

英文摘要

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we demonstrate that text embeddings can distinguish a characteristic `omission signal' representing the absence of target concepts. Leveraging this insight, we propose Omission Signal Intervention (OSI), which amplifies the omission signal to actively catalyze the generation of missing concepts. Comprehensive experiments on FLUX.1-Dev and SD3.5-Medium demonstrate that OSI significantly alleviates concept omission even in extreme scenarios.

2604.21654 2026-05-29 cs.CV cs.AI 版本更新

Causal Disentanglement-Inspired Degradation Representation Learning for Full-Reference Image Quality Assessment

因果解耦启发的退化表示学习用于全参考图像质量评估

Zhen Zhang, Jielei Chu, Tian Zhang, Lin Ma, Fengmao Lv, Weide Liu, Tianrui Li, Yuming Fang

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(计算机与人工智能学院,西南交通大学) School of Transportation and Logistics, Southwest Jiaotong University(交通运输与物流学院,西南交通大学) School of Physics, Northeast Normal University(物理学院,东北师范大学) School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics(计算机与人工智能学院,江西财经大学) School of Information Management, Jiangxi University of Finance and Economics(信息管理学院,江西财经大学)

AI总结 提出基于因果推断和解耦表示学习的全参考图像质量评估新范式,通过干预潜在表示实现退化估计,在多种设置和跨域场景中表现优异。

详情
AI中文摘要

现有的基于深度网络的全参考图像质量评估(FR-IQA)模型通常通过对参考图像和失真图像的深度特征进行成对比较来工作。在本文中,我们从不同的角度处理这个问题,提出了一种基于因果推断和解耦表示学习的新型FR-IQA范式。与典型的基于特征比较的FR-IQA模型不同,我们的方法将退化估计表述为一个由对潜在表示进行干预引导的因果解耦过程。我们首先利用参考图像和失真图像之间的内容不变性来解耦退化表示和内容表示。其次,受人类视觉掩蔽效应的启发,我们设计了一个掩蔽模块来建模图像内容与退化特征之间的因果关系,从而从失真图像中提取受内容影响的退化特征。最后,通过监督回归或无标签降维从这些退化特征预测质量分数。大量实验表明,我们的方法在全监督、少标签和无标签设置的标准IQA基准上取得了极具竞争力的性能。此外,我们还在数据稀缺的多种非标准自然图像域(包括水下、放射线、医学、中子和屏幕内容图像)上评估了该方法。得益于其能够在没有标记IQA数据的情况下进行场景特定训练和预测的能力,我们的方法在跨域泛化方面优于现有的无训练FR-IQA模型。

英文摘要

Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.

2604.18518 2026-05-29 cs.CV cs.LG 版本更新

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

UDM-GRPO:面向均匀离散扩散模型的稳定高效组相对策略优化

Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang, Fan Zhang, Yonggang Qi, Xinlong Wang

发表机构 * Beijing University of Posts(北京邮电大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 针对均匀离散扩散模型(UDM)与强化学习(RL)集成时训练不稳定、性能提升有限的问题,提出UDM-GRPO框架,通过将最终干净样本作为动作、利用扩散前向过程重建轨迹以及引入简化步数和无CFG策略,显著提升文本到图像生成任务的性能。

Comments UDM-GRPO is accepted by ICML 2026 (Spotlight). Code is available at https://github.com/Yovecent/UDM-GRPO

详情
AI中文摘要

均匀离散扩散模型(UDM)最近成为离散生成建模的一种有前景的范式;然而,其与强化学习的集成仍然很大程度上未被探索。我们观察到,将GRPO直接应用于UDM会导致训练不稳定和边际性能提升。为了解决这个问题,我们提出了UDM-GRPO,这是第一个将UDM与RL集成的框架。我们的方法基于两个关键见解:(i)将最终干净样本作为动作提供更准确和稳定的优化信号;(ii)通过扩散前向过程重建轨迹更好地将概率路径与预训练分布对齐。此外,我们引入了两种策略,即简化步数(Reduced-Step)和无CFG(CFG-Free),以进一步提高训练效率。UDM-GRPO在多个T2I任务上显著提升了基础模型性能。值得注意的是,GenEval准确率从69%提高到96%,PickScore从20.46增加到23.81,在连续和离散设置中均达到了最先进的性能。在OCR基准测试中,准确率从8%提高到57%,进一步验证了我们方法的泛化能力。代码可在https://github.com/Yovecent/UDM-GRPO获取。

英文摘要

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO.

2604.13019 2026-05-29 cs.CV 版本更新

PrecisionCUA: Iterative Visual Refinement for Pixel-Precise Cursor Grounding in Code Editors

PrecisionCUA:代码编辑器中像素级光标定位的迭代视觉细化

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu

发表机构 * Microsoft(微软公司) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出PrecisionCUA方法,通过迭代视觉反馈细化机制实现代码编辑器中像素级光标定位,显著提升点击精度和任务成功率。

详情
AI中文摘要

计算机使用代理(CUA)从根本上依赖图形用户界面(GUI)定位,将语言指令转化为可执行的屏幕操作,但在密集编码界面(如VS Code和Cursor)中,需要亚像素精度才能与密集IDE元素交互的编辑级定位尚未得到充分探索。现有方法通常依赖单次坐标预测,缺乏纠错机制,在高密度界面中常常失败。在本技术报告中,我们对编码环境中的像素级光标定位进行了实证研究。我们的代理不是单步执行,而是参与迭代细化过程,利用先前尝试的视觉反馈来达到目标元素。这种闭环定位机制使代理能够自我纠正位移误差并适应动态UI变化。我们在Claude、Qwen和GPT上的一系列复杂编码基准上评估了我们的方法,结果表明多轮细化在点击精度和整体任务成功率上均显著优于最先进的单次模型。我们的结果表明,迭代视觉推理是下一代可靠软件工程代理的关键组成部分。代码:https://github.com/microsoft/precision-cua-bench/tree/main。

英文摘要

Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces (such as VS Code and Cursor), where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across Claude, Qwen, and GPT on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench/tree/main.

2604.12772 2026-05-29 cs.CV cs.MA 版本更新

A Multi-Agent Feedback System for Detecting and Describing News Events in Satellite Imagery

用于检测和描述卫星图像中新闻事件的多智能体反馈系统

Madeline Anderson, Mikhail Klassen, Ash Hoover, Kerri Cahoy

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Planet Labs

AI总结 提出一种迭代多智能体工作流SkyScraper,通过地理编码新闻文章并合成卫星图像序列描述,有效发现多时相事件并构建5000序列数据集。

详情
AI中文摘要

卫星图像的变化通常发生在多个时间步长上。尽管出现了双时相变化描述数据集,但遥感领域缺乏多时相事件描述数据集(每个序列至少两张图像)。这一差距的存在是因为(1)在卫星图像中搜索可见事件和(2)标记多时相序列需要大量的时间和人力。为了解决这些挑战,我们提出了SkyScraper,一种迭代的多智能体工作流,它对新闻文章进行地理编码,并为相应的卫星图像序列合成描述。我们的实验表明,SkyScraper成功找到的事件数量是传统地理编码方法的5倍,证明了智能体反馈是发现卫星图像中新的多时相事件的有效策略。我们将我们的框架应用于全球新闻文章的大型数据库,整理出一个包含5000个序列的新多时相描述数据集。通过自动识别与新闻事件相关的图像,我们的工作也支持新闻和报道工作。

英文摘要

Changes in satellite imagery often occur over multiple time steps. Despite the emergence of bi-temporal change captioning datasets, there is a lack of multi-temporal event captioning datasets (at least two images per sequence) in remote sensing. This gap exists because (1) searching for visible events in satellite imagery and (2) labeling multi-temporal sequences require significant time and labor. To address these challenges, we present SkyScraper, an iterative multi-agent workflow that geocodes news articles and synthesizes captions for corresponding satellite image sequences. Our experiments show that SkyScraper successfully finds 5x more events than traditional geocoding methods, demonstrating that agentic feedback is an effective strategy for surfacing new multi-temporal events in satellite imagery. We apply our framework to a large database of global news articles, curating a new multi-temporal captioning dataset with 5,000 sequences. By automatically identifying imagery related to news events, our work also supports journalism and reporting efforts.

2604.11080 2026-05-29 cs.CV cs.AI 版本更新

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

ReSpinQuant: 通过子空间残差旋转近似实现高效逐层大模型量化

Suyoung Kim, Sunghyun Wee, Hyeonjin Kim, Kyomin Hwang, Hyunho Lee, Nojun Kwak

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出ReSpinQuant框架,通过离线激活旋转融合和高效子空间残差旋转匹配基,解决逐层量化方法在线计算开销大的问题,在W4A4和W3A3量化上达到最优性能。

Comments ICML 2026

详情
AI中文摘要

基于旋转的后训练量化(PTQ)已成为缓解大型语言模型(LLMs)量化中激活值异常值的有前景的解决方案。全局旋转方法通过将激活旋转融合到注意力块和前馈网络块中实现推理效率,但由于受限于在所有层中使用单一可学习旋转矩阵,其表达能力有限。为了解决这一问题,出现了逐层变换方法,通过局部自适应实现了更高的精度。然而,逐层方法无法将激活旋转矩阵融合到权重中,需要在线计算并导致显著开销。在本文中,我们提出ReSpinQuant,一种量化框架,通过利用离线激活旋转融合和使用高效残差子空间旋转匹配基来解决此类开销。这种设计调和了逐层自适应的高表达性与仅可忽略的推理开销。在W4A4和W3A3量化上的大量实验表明,ReSpinQuant实现了最先进的性能,优于全局旋转方法,并以最小开销匹配计算昂贵的逐层方法的精度。

英文摘要

Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.

2603.27758 2026-05-29 cs.CV 版本更新

RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization

RHO: 基于OSM的鲁棒整体度量跨视角地理定位

Junwei Zheng, Ruize Dai, Ruiping Liu, Zichao Zeng, Yufan Chen, Fangjinhua Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Rainer Stiefelhagen

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工大学) Hunan University(湖南大学) ETH Zurich(苏黎世联邦理工学院) UCL(伦敦大学学院)

AI总结 提出RHO模型,利用全景图和OpenStreetMap进行度量跨视角地理定位,通过Split-Undistort-Merge模块处理全景畸变和Position-Orientation Fusion机制提升定位精度,在CV-RHO数据集上相比基线方法性能提升高达20%。

Comments Accepted by CVPR 2026. Project page: https://github.com/InSAI-Lab/RHO

详情
AI中文摘要

度量跨视角地理定位(MCVGL)旨在通过匹配地面和卫星图像来估计3自由度相机位姿(位置和朝向)。在这项工作中,我们研究使用整体全景图和OpenStreetMap(OSM)的鲁棒MCVGL,而非针孔和卫星图像。为此,我们建立了一个大规模MCVGL基准数据集CV-RHO,包含超过270万张图像,涵盖不同天气和光照条件以及传感器噪声。此外,我们提出了一种名为RHO的模型,采用双分支Pin-Pan架构进行精确视觉定位。引入了Split-Undistort-Merge(SUM)模块来解决全景畸变问题,并设计了Position-Orientation Fusion(POF)机制来增强定位精度。大量实验证明了我们CV-RHO数据集的价值和RHO模型的有效性,与最先进的基线方法相比,性能提升高达20%。项目页面:https://github.com/InSAI-Lab/RHO。

英文摘要

Metric Cross-View Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-branch Pin-Pan architecture for accurate visual localization. A Split-Undistort-Merge (SUM) module is introduced to address the panoramic distortion, and a Position-Orientation Fusion (POF) mechanism is designed to enhance the localization accuracy. Extensive experiments prove the value of our CV-RHO dataset and the effectiveness of the RHO model, with a significant performance gain up to 20% compared with the state-of-the-art baselines. Project page: https://github.com/InSAI-Lab/RHO.

2603.14644 2026-05-29 eess.IV cs.CV cs.DB cs.LG 版本更新

LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol

LUMINA:采用能量协调协议的多供应商乳腺X线摄影基准

Hongyi Pan, Gorkem Durak, Halil Ertugrul Aktas, Andrea M. Bejar, Baver Tutun, Emre Uysal, Ezgi Bulbul, Mehmet Fatih Dogan, Berrin Erok, Berna Akkus Yildirim, Sukru Mehmet Erturk, Ulas Bagci

发表机构 * Department of Radiology, Northwestern University(北western大学放射科) Department of Radiation Oncology, University of Health Sciences Prof. Dr. Cemil Tascioglu City Hospital(健康科学大学教授Dr. Cemil Tascioglu医院放射肿瘤科) Department of Radiology, Istanbul University(伊斯坦布尔大学放射科)

AI总结 为解决现有FFDM数据集规模小、标注少和供应商多样性不足的问题,提出LUMINA多供应商数据集及能量协调方法,通过前景像素对齐减少域偏移,在诊断、BI-RADS分类和密度估计任务上验证了模型性能提升。

Comments This paper was accepted to CVPR 2026

详情
AI中文摘要

公开可用的全视野数字乳腺X线摄影(FFDM)数据集在规模、临床标注和供应商多样性方面仍然有限,阻碍了稳健模型的发展。我们引入了LUMINA,一个经过整理的多供应商FFDM数据集,明确编码了采集能量和供应商元数据,以捕捉现有基准中常被忽略的临床相关外观变化。该数据集包含来自468名患者的1824张图像(960张良性,864张恶性),附有病理确认标签、BI-RADS评估和乳腺密度标注。LUMINA涵盖六个采集系统,包括高能和低能成像模式,能够系统分析供应商和能量引起的域偏移。为应对这些变化,我们提出了一种仅前景的像素空间对齐方法(“能量协调”),将图像映射到低能参考,同时保留病变形态。我们在三个临床相关任务上对CNN和Transformer模型进行了基准测试:诊断(良性 vs. 恶性)、BI-RADS分类和密度估计。双视图模型一致优于单视图模型。EfficientNet-B0在诊断任务上达到93.54%的AUC,而Swin-T在密度预测上达到最佳宏平均AUC 89.43%。协调方法提升了各架构的性能,并产生了更局部的Grad-CAM响应。总体而言,LUMINA提供了(1)一个供应商多样化的基准和(2)一个模型无关的协调框架,用于可靠且可部署的乳腺X线摄影AI。

英文摘要

Publicly available full-field digital mammography (FFDM) datasets remain limited in size, clinical annotations, and vendor diversity, hindering the development of robust models. We introduce LUMINA, a curated, multi-vendor FFDM dataset that explicitly encodes acquisition energy and vendor metadata to capture clinically relevant appearance variations often overlooked in existing benchmarks. This dataset contains 1824 images from 468 patients (960 benign, 864 malignant), with pathology-confirmed labels, BI-RADS assessments, and breast-density annotations. LUMINA spans six acquisition systems and includes both high- and low-energy imaging styles, enabling systematic analysis of vendor- and energy-induced domain shifts. To address these variations, we propose a foreground-only pixel-space alignment method (''energy harmonization'') that maps images to a low-energy reference while preserving lesion morphology. We benchmark CNN and transformer models on three clinically relevant tasks: diagnosis (benign vs. malignant), BI-RADS classification, and density estimation. Two-view models consistently outperform single-view models. EfficientNet-B0 achieves an AUC of 93.54% for diagnosis, while Swin-T achieves the best macro-AUC of 89.43% for density prediction. Harmonization improves performance across architectures and produces more localized Grad-CAM responses. Overall, LUMINA provides (1) a vendor-diverse benchmark and (2) a model-agnostic harmonization framework for reliable and deployable mammography AI.

2602.17200 2026-05-29 cs.CV 版本更新

GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation

GASS: 几何感知球面采样用于文本到图像生成中解耦多样性增强

Ye Zhu, Kaleb S. Newman, Johannes F. Lutzeyer, Adriana Romero-Soriano, Michal Drozdzal, Olga Russakovsky

发表机构 * Laboratoire d'Informatique (LIX), CNRS, École Polytechnique, IPP, France(信息实验室(LIX),法国国家科学研究中心,巴黎高等技术学院,IPP,法国) Department of Computer Science, Princeton University, USA(计算机科学系,普林斯顿大学,美国) FAIR at Meta - Montreal, Canada(Meta FAIR - 加拿大蒙特利尔) McGill University, Canada(麦吉尔大学,加拿大) Mila, Quebec AI Institute, Canada(魁北克人工智能研究所,加拿大) Canada CIFAR AI chair(加拿大CIFAR人工智能主席)

AI总结 提出几何感知球面采样(GASS),通过正交分解CLIP嵌入中的提示相关与无关变异方向,沿两轴扩展投影分布以引导采样,在保持图像保真度和语义对齐的同时增强生成多样性。

Comments ICML 2026 Camera-ready. Code available at https://github.com/L-YeZhu/GASS_T2I

详情
AI中文摘要

尽管具有较高的语义对齐性,现代文本到图像(T2I)生成模型仍难以从给定提示中合成多样化的图像。在这项工作中,我们通过几何视角增强T2I多样性。与大多数现有方法主要依赖基于熵的引导来增加样本差异性不同,我们引入了几何感知球面采样(GASS),通过显式控制提示相关和提示无关的变异来源来增强多样性。具体地,我们使用两个正交方向分解CLIP嵌入中的多样性度量:文本嵌入(捕获与提示相关的语义变异)和识别出的正交方向(捕获提示无关的变异,如背景)。基于此分解,GASS增加生成图像嵌入沿两个轴的几何投影分布,并通过沿生成轨迹的扩展预测引导T2I采样过程。我们在不同冻结T2I骨干网络(U-Net和DiT,扩散和流)及基准上的实验证明了解耦多样性增强的有效性,且对图像保真度和语义对齐影响极小。

英文摘要

Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.

2602.14399 2026-05-29 cs.CV 版本更新

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

多轮自适应提示攻击对大型视觉-语言模型

In Chong Choi, Jiacheng Zhang, Feng Liu, Yiliao Song

发表机构 * The University of Melbourne(墨尔本大学) The University of Adelaide(阿德莱德大学)

AI总结 提出多轮自适应提示攻击(MAPA),通过交替文本-视觉攻击动作和跨轮迭代调整攻击轨迹,显著提升对大型视觉-语言模型的多轮越狱攻击成功率。

详情
AI中文摘要

多轮越狱攻击已被证明对纯文本大型语言模型(LLMs)有效,其中恶意内容逐渐引入以绕过安全对齐。然而,将此类攻击有效扩展到大型视觉-语言模型(LVLMs)仍未被充分探索。在本文中,我们发现简单地将视觉输入纳入多轮越狱可能使其更容易防御;例如,过度恶意的视觉内容容易触发安全对齐的LVLMs中的防御机制,导致更保守的响应。基于这一发现,我们提出了多轮自适应提示攻击(MAPA),该攻击:1)在每一轮中,交替文本-视觉攻击动作以引发最恶意的响应;2)跨轮,通过迭代来回优化调整攻击轨迹,逐步放大响应的恶意程度。这种两级设计使MAPA能够持续优于最先进的方法,在最近的基准测试中,针对LLaVA-v1.6-Mistral-7B、Qwen2.5-VL-7B-Instruct、Llama-3.2-Vision-11B-Instruct和GPT-4o-mini,攻击成功率提高了15-30%。我们的代码可在https://github.com/thomaschoi143/MAPA获取。

英文摘要

Multi-turn jailbreak attacks have proven effective against text-only large language models (LLMs), where malicious content is gradually introduced to bypass safety alignment. However, effectively extending such attacks to large vision-language models (LVLMs) remains underexplored. In this paper, we find that naively incorporating visual inputs can make multi-turn jailbreaks easier to defend against; for example, overly malicious visual content will easily trigger the defense mechanism in safety-aligned LVLMs, resulting in more conservative responses. Based on this finding, we propose multi-turn adaptive prompting attack (MAPA) that 1) at each turn, alternates text-vision attack actions to elicit the most malicious response; and 2) across turns, adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify response maliciousness. This two-level design enables MAPA to consistently outperform state-of-the-art methods, improving attack success rates by 15-30% on recent benchmarks against LLaVA-v1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct and GPT-4o-mini. Our code is available at: https://github.com/thomaschoi143/MAPA.

2602.13600 2026-05-29 cs.CV 版本更新

SAVAA: Mitigating Hallucinations in LVLMs via Step-wise Adaptive Visual Attention Amplification

SAVAA: 通过逐步自适应视觉注意力放大减轻LVLMs中的幻觉

Jiacheng Zhang, Feng Liu, Chao Du, Tianyu Pang

发表机构 * Sea AI Lab(海思人工智能实验室) The University of Melbourne(墨尔本大学)

AI总结 提出SAVAA框架,通过视觉接地熵估计幻觉风险并自适应调整视觉注意力放大因子,在多个基准上显著减轻大型视觉语言模型的幻觉。

详情
AI中文摘要

最近一系列无需训练的减轻大型视觉语言模型(LVLMs)幻觉的方法,通过在单次前向传递的自回归生成过程中放大对视觉标记的注意力。我们将这种范式称为视觉注意力放大(VAA)。在本文中,我们识别出现有VAA方法的一个双重失败模式,原因是它们在生成步骤中使用固定的放大因子:在某些步骤可能太弱,无法解决幻觉,而在其他步骤太强,引入新的幻觉。受此发现启发,我们提出逐步自适应视觉注意力放大(SAVAA),一种新的VAA框架,它估计每个生成标记的幻觉风险,并使用估计的风险自适应地放大下一个生成步骤的视觉注意力。具体来说,我们引入视觉接地熵(VGE),一种轻量级的幻觉风险估计器,它用视觉接地增强预测熵,为那些不确定、在图像中接地较弱或两者兼有的标记分配更高的风险。在VGE的指导下,SAVAA使用估计的风险校准下一个生成步骤的VAA因子,对高风险步骤应用更强的放大,对低风险步骤应用更弱的放大。在LLaVA-NeXT-7B、Qwen3-VL-8B和InternVL3.5-8B上,SAVAA在生成幻觉基准(如CHAIR、SHR和AMBER)上显著优于基线方法。代码可在https://github.com/JiachengZ01/SAVVA获取。

英文摘要

A line of recent training-free methods for mitigating hallucinations in large vision-language models (LVLMs) operates by amplifying attention to visual tokens during autoregressive generation within a single forward pass. We refer to this paradigm as visual attention amplification (VAA). In this paper, we identify a dual failure pattern in existing VAA methods caused by their use of a fixed amplification factor across generation steps: it can be too weak at some steps, leaving hallucinations unresolved, while too strong at others, introducing new hallucinations. Motivated by this finding, we propose Step-wise Adaptive Visual Attention Amplification (SAVAA), a new VAA framework that estimates hallucination risk for each generated token and uses the estimated risk to adaptively amplify visual attention at the next generation step. Specifically, we introduce Visual Grounding Entropy (VGE), a lightweight hallucination-risk estimator that augments predictive entropy with visual grounding, assigning higher risk to tokens that are uncertain, weakly grounded in the image, or both. Guided by VGE, SAVAA uses the estimated risk to calibrate the VAA factor for the next generation step, applying stronger amplification to higher-risk steps and weaker amplification to lower-risk steps. Across LLaVA-NeXT-7B, Qwen3-VL-8B, and InternVL3.5-8B, SAVAA significantly outperforms baseline methods on generative hallucination benchmarks such as CHAIR, SHR and AMBER. Code is available at: https://github.com/JiachengZ01/SAVVA.

2602.07044 2026-05-29 cs.CV cs.AI 版本更新

PipeMFL-240K: A Large-scale Dataset and Benchmark for Object Detection in Pipeline Magnetic Flux Leakage Imaging

PipeMFL-240K:管道磁通量泄漏成像中目标检测的大规模数据集与基准

Tianyi Qu, Songxiao Yang, Haolin Wang, Huadong Song, Xiaoting Guo, Wenguang Hu, Guanlin Liu, Honghe Chen, Yafei Ou

发表机构 * SINOMACH Sensing Technology \ ., Ltd Shenyang Liaoning China Institute of Science Tokyo Tokyo Japan Hokkaido University Sapporo Hokkaido Japan SINOMACH Sensing Technology \ ., Ltd Institute of Science Tokyo Hokkaido University

AI总结 为解决管道磁通量泄漏检测中缺乏大规模公开数据集和基准的问题,构建了包含249,320张图像和200,020个边界框标注的PipeMFL-240K数据集,并评估了现有目标检测器,揭示了其在长尾分布、小目标和类内变异等挑战下的性能不足。

Comments Accepted by ACM KDD 2026 Datasets and Benchmarks Track

详情
AI中文摘要

管道完整性对工业安全和环境保护至关重要,磁通量泄漏(MFL)检测是一种主要的无损检测技术。尽管深度学习在自动化MFL解释方面具有前景,但由于缺乏大规模公开数据集和基准,可靠模型的进展受到限制,导致公平比较和可重复评估困难。我们引入了 extbf{PipeMFL-240K},这是一个大规模、精心标注的数据集和基准,用于管道MFL伪彩色图像中的复杂目标检测。PipeMFL-240K反映了真实检测的复杂性,并提出了几个独特挑战:(i) 覆盖 extbf{12}个类别的极端长尾分布,(ii) 大量仅包含少数像素的小目标,(iii) 显著的类内变异。该数据集包含 extbf{249,320}张图像和 extbf{200,020}个高质量边界框标注,采集自12条总长约 extbf{1,530}公里的管道。我们使用最先进的目标检测器进行了大量实验以建立基线。结果表明,现代检测器仍然难以应对MFL数据的固有特性,凸显了巨大的改进空间,而PipeMFL-240K为驱动未来研究提供了可靠且具有挑战性的试验平台。作为管道MFL检测领域首个如此规模和范围的数据集和基准,它为高效的管道诊断和维护规划提供了关键基础,并有望加速基于MFL的管道完整性评估中的算法创新和可重复研究。

英文摘要

Pipeline integrity is critical to industrial safety and environmental protection, with Magnetic Flux Leakage (MFL) detection being a primary non-destructive testing technology. Despite the promise of deep learning for automating MFL interpretation, progress toward reliable models has been constrained by the absence of a large-scale public dataset and benchmark, making fair comparison and reproducible evaluation difficult. We introduce \textbf{PipeMFL-240K}, a large-scale, meticulously annotated dataset and benchmark for complex object detection in pipeline MFL pseudo-color images. PipeMFL-240K reflects real-world inspection complexity and poses several unique challenges: (i) an extremely long-tailed distribution over \textbf{12} categories, (ii) a high prevalence of tiny objects that often comprise only a handful of pixels and (iii) substantial intra-class variability. The dataset contains \textbf{249,320} images and \textbf{200,020} high-quality bounding-box annotations, collected from 12 pipelines spanning approximately \textbf{1,530} km. Extensive experiments are conducted with state-of-the-art object detectors to establish baselines. Results show that modern detectors still struggle with the intrinsic properties of MFL data, highlighting considerable headroom for improvement, while PipeMFL-240K provides a reliable and challenging testbed to drive future research. As the first public dataset and the first benchmark of this scale and scope for pipeline MFL inspection, it provides a critical foundation for efficient pipeline diagnostics as well as maintenance planning and is expected to accelerate algorithmic innovation and reproducible research in MFL-based pipeline integrity assessment.

2602.06282 2026-05-29 cs.CV q-bio.QM 版本更新

An Interpretable Vision Transformer as a Fingerprint-Based Diagnostic Aid for Kabuki and Wiedemann-Steiner Syndromes

一种可解释的基于指纹的视觉Transformer辅助诊断Kabuki和Wiedemann-Steiner综合征

Marilyn Lionts, Arnhildur Tomasdottir, Viktor I. Agustsson, Yuankai Huo, Hans T. Bjornsson, Lotta M. Ellingsen

发表机构 * Dept. of Computer Science, Vanderbilt University(范德比尔特大学计算机科学系) Dept. of Genetics and Molecular Medicine, Landspitali University Hospital(陆斯帕蒂医院遗传学与分子医学系) Louma G. Laboratory of Epigenetics Research, Faculty of Medicine, University of Iceland(爱沙尼亚大学医学系表观遗传学研究实验室) McKusick-Nathans Dept. of Genetic Medicine, Johns Hopkins University School of Medicine(约翰霍普金斯大学医学院遗传医学部) Faculty of Electrical and Computer Engineering University of Iceland(爱沙尼亚大学电气与计算机工程系)

AI总结 本研究提出一种基于视觉Transformer的深度学习模型,利用指纹图像区分Kabuki综合征(KS)和Wiedemann-Steiner综合征(WSS)患者与健康对照,并通过注意力可视化增强可解释性,为罕见遗传病的非侵入性诊断提供新工具。

详情
AI中文摘要

Kabuki综合征(KS)和Wiedemann-Steiner综合征(WSS)是罕见但不同的发育障碍,具有重叠的临床特征,包括神经发育迟缓、生长受限和持续性胎儿指尖垫。尽管基因检测仍是诊断的金标准,但由于基因检测和专业知识获取的障碍,许多KS或WSS患者仍未得到诊断。皮纹异常虽然是几种遗传综合征的既定标志,但在分子检测时代仍是一种未被充分利用的诊断信号。本研究提出一种基于视觉Transformer的深度学习模型,利用指纹图像区分KS和WSS患者与未受影响的对照组以及彼此。我们在三个二分类任务中评估模型性能。在三个分类任务中,模型在对照组vs. KS、对照组vs. WSS和KS vs. WSS上分别达到了0.80、0.73和0.85的AUC分数,相应的F1分数分别为0.71、0.72和0.83。除了分类,我们应用基于注意力的可视化来识别对模型预测最显著的指纹区域,增强了可解释性。总之,这些发现表明存在综合征特异性的指纹特征,证明了基于指纹的人工智能(AI)工具作为一种非侵入性、可解释且可获取的未来诊断辅助手段,用于早期诊断未充分诊断的遗传综合征的可行性。

英文摘要

Kabuki syndrome (KS) and Wiedemann-Steiner syndrome (WSS) are rare but distinct developmental disorders that share overlapping clinical features, including neurodevelopmental delay, growth restriction, and persistent fetal fingertip pads. While genetic testing remains the diagnostic gold standard, many individuals with KS or WSS remain undiagnosed due to barriers in access to both genetic testing and expertise. Dermatoglyphic anomalies, despite being established hallmarks of several genetic syndromes, remain an underutilized diagnostic signal in the era of molecular testing. This study presents a vision transformer-based deep learning model that leverages fingerprint images to distinguish individuals with KS and WSS from unaffected controls and from one another. We evaluate model performance across three binary classification tasks. Across the three classification tasks, the model achieved AUC scores of 0.80 (control vs. KS), 0.73 (control vs. WSS), and 0.85 (KS vs. WSS), with corresponding F1 scores of 0.71, 0.72, and 0.83, respectively. Beyond classification, we apply attention-based visualizations to identify fingerprint regions most salient to model predictions, enhancing interpretability. Together, these findings suggest the presence of syndrome-specific fingerprint features, demonstrating the feasibility of a fingerprint-based artificial intelligence (AI) tool as a noninvasive, interpretable, and accessible future diagnostic aid for the early diagnosis of underdiagnosed genetic syndromes.

2602.00324 2026-05-29 math.OC cs.CV cs.RO eess.SP 版本更新

Dual Quaternion SE(3) Synchronization with Recovery Guarantees

对偶四元数 SE(3) 同步及其恢复保证

Jianing Zhao, Linglingzhi Zhu, Anthony Man-Cho So

发表机构 * Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, NT, Hong Kong(系统工程与工程管理系,香港中文大学(深圳)) H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, USA(H. Milton Stewart工业与系统工程学院,佐治亚理工学院)

AI总结 采用对偶四元数表示,通过谱初始化和对偶四元数广义幂法实现 SE(3) 同步,并给出误差界和线性收敛保证。

Comments ICML 2026

详情
AI中文摘要

特殊欧几里得群 SE(3) 上的同步旨在从含噪的成对相对变换中恢复绝对位姿,是机器人和 3D 视觉中的核心基本操作。标准方法通常需要多步启发式程序来恢复有效位姿,这些程序难以分析且通常缺乏理论保证。本文采用对偶四元数表示,并直接在对偶四元数单位上制定 SE(3) 同步。开发了一个两阶段算法:通过 Hermitian 对偶四元数测量矩阵上的幂法计算谱初始化,随后是对偶四元数广义幂法 (DQGPM),通过每次迭代投影来强制执行可行性。建立了谱估计器的估计误差界,并证明 DQGPM 具有有限迭代误差界,并实现线性误差收缩直至显式的噪声相关阈值。在合成基准和真实多扫描点集配准上的实验表明,所提出的流程在准确性和效率上均优于代表性的基于矩阵的方法。

英文摘要

Synchronization over the special Euclidean group SE(3) aims to recover absolute poses from noisy pairwise relative transformations and is a core primitive in robotics and 3D vision. Standard approaches often require multi-step heuristic procedures to recover valid poses, which are difficult to analyze and typically lack theoretical guarantees. This paper adopts a dual quaternion representation and formulates SE(3) synchronization directly over the unit dual quaternion. A two-stage algorithm is developed: A spectral initializer computed via the power method on a Hermitian dual quaternion measurement matrix, followed by a dual quaternion generalized power method (DQGPM) that enforces feasibility through per-iteration projection. The estimation error bounds are established for spectral estimators, and DQGPM is shown to admit a finite-iteration error bound and achieves linear error contraction up to an explicit noise-dependent threshold. Experiments on synthetic benchmarks and real-world multi-scan point-set registration demonstrate that the proposed pipeline improves both accuracy and efficiency over representative matrix-based methods.

2512.21032 2026-05-29 cs.CV 版本更新

Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model

基于潜在扩散模型的多属性引导热人脸图像翻译

Mingshu Cai, Osamu Yoshie, Yuya Ieiri

发表机构 * Graduate School of Information, Production and Systems, Waseda University(早稻田大学信息、生产与系统研究生院)

AI总结 提出一种基于潜在扩散模型的多属性引导方法,从热红外图像生成高质量可见光人脸图像,同时保留关键身份特征,解决异质人脸识别中的域偏移和特征丢失问题。

Comments Accepted by 2025 IEEE International Joint Conference on Biometrics (IJCB 2025)

详情
Journal ref
2025 IEEE International Joint Conference on Biometrics (IJCB), 2025
AI中文摘要

现代监控系统越来越依赖多波长传感器和深度神经网络来识别夜间拍摄的红外图像中的人脸。然而,大多数人脸识别模型是在可见光数据集上训练的,由于显著的域偏移,在红外输入上性能大幅下降。早期的基于特征的红外人脸识别方法被证明效果不佳,促使研究人员采用生成式方法将红外图像转换为可见光图像以提高识别性能。这种被称为异质人脸识别(HFR)的范式面临模型和模态差异等挑战,导致生成图像出现失真和特征丢失。为了解决这些限制,本文引入了一种新颖的基于潜在扩散的模型,旨在从热输入生成高质量的可见光人脸图像,同时保留关键身份特征。我们集成一个多属性分类器,从可见光图像中提取关键面部属性,减轻红外到可见光图像恢复过程中的特征丢失。此外,我们提出了Self-attn Mamba模块,该模块增强了跨模态特征的全局建模,并显著提高了推理速度。在两个基准数据集上的实验结果表明,我们的方法在图像质量和身份保持方面均达到了最先进的性能。

英文摘要

Modern surveillance systems increasingly rely on multi-wavelength sensors and deep neural networks to recognize faces in infrared images captured at night. However, most facial recognition models are trained on visible light datasets, leading to substantial performance degradation on infrared inputs due to significant domain shifts. Early feature-based methods for infrared face recognition proved ineffective, prompting researchers to adopt generative approaches that convert infrared images into visible light images for improved recognition. This paradigm, known as Heterogeneous Face Recognition (HFR), faces challenges such as model and modality discrepancies, leading to distortion and feature loss in generated images. To address these limitations, this paper introduces a novel latent diffusion-based model designed to generate high-quality visible face images from thermal inputs while preserving critical identity features. A multi-attribute classifier is incorporated to extract key facial attributes from visible images, mitigating feature loss during infrared-to-visible image restoration. Additionally, we propose the Self-attn Mamba module, which enhances global modeling of cross-modal features and significantly improves inference speed. Experimental results on two benchmark datasets demonstrate the superiority of our approach, achieving state-of-the-art performance in both image quality and identity preservation.

2512.03010 2026-05-29 cs.CV cs.GR cs.RO 版本更新

SurfFill: Completion of LiDAR Point Clouds via Gaussian Surfel Splatting

SurfFill: 通过高斯曲面元填充完成LiDAR点云

Svenja Strobel, Matthias Innmann, Bernhard Egger, Marc Stamminger, Linus Franke

发表机构 * NavVis GmbH(NavVis公司) Inria, Université Côte d'Azur(Inria与阿尔卑斯海岸大学)

AI总结 针对LiDAR点云缺失薄结构和边缘细节的问题,提出基于高斯曲面元(Gaussian surfel)的补全方案SurfFill,利用光束发散启发式识别模糊区域并优化曲面元重建以生长新点,在合成和真实场景中优于先前方法。

Comments Project page: https://lfranke.github.io/surffill

详情
AI中文摘要

LiDAR捕获的点云通常被视为主动3D重建的金标准。尽管其在平坦区域精度极高,但捕获容易遗漏小的几何结构,并可能在暗色、吸光材料上失败。或者,拍摄场景的多张照片并应用3D摄影测量可以推断这些细节,因为它们通常代表特征丰富的区域。然而,对于无特征区域,LiDAR的精度很少能达到。因此,我们建议通过引入SurfFill:一种基于高斯曲面元的LiDAR补全方案,结合LiDAR和基于相机的捕获的优势。我们分析LiDAR捕获,并将LiDAR光束发散归因于伪影的主要因素,主要表现为薄结构和边缘。我们利用这一见解,通过评估点云中密度的变化,引入一种用于完成扫描的模糊启发式方法。这使我们能够识别靠近缺失区域的点,然后我们可以使用这些点生长额外的点以完成扫描。对于这种点生长,我们约束高斯曲面元重建,将优化和密集化集中在这些模糊区域。最后,提取模糊区域重建的高斯基元并采样以获取点来完成点云。为了解决大规模重建的挑战,我们将此流程扩展为一种分治方案,用于建筑大小的点云补全。我们在合成和真实场景的LiDAR点云补全任务上评估,发现我们的方法优于先前的重建方法。

英文摘要

LiDAR-captured point clouds are often considered the gold standard in active 3D reconstruction. While their accuracy is exceptional in flat regions, the capturing is susceptible to miss small geometric structures and may fail with dark, absorbent materials. Alternatively, capturing multiple photos of the scene and applying 3D photogrammetry can infer these details as they often represent feature-rich regions. However, the accuracy of LiDAR for featureless regions is rarely reached. Therefore, we suggest combining the strengths of LiDAR and camera-based capture by introducing SurfFill: a Gaussian surfel-based LiDAR completion scheme. We analyze LiDAR capturings and attribute LiDAR beam divergence as a main factor for artifacts, manifesting mostly at thin structures and edges. We use this insight to introduce an ambiguity heuristic for completed scans by evaluating the change in density in the point cloud. This allows us to identify points close to missed areas, which we can then use to grow additional points from to complete the scan. For this point growing, we constrain Gaussian surfel reconstruction to focus optimization and densification on these ambiguous areas. Finally, Gaussian primitives of the reconstruction in ambiguous areas are extracted and sampled for points to complete the point cloud. To address the challenges of large-scale reconstruction, we extend this pipeline with a divide-and-conquer scheme for building-sized point cloud completion. We evaluate on the task of LiDAR point cloud completion of synthetic and real-world scenes and find that our method outperforms previous reconstruction methods.

2512.01334 2026-05-29 cs.CV 版本更新

AlignVid: Training-Free Attention Scaling for Semantic Fidelity in Text-Guided Image-to-Video Generation

AlignVid: 文本引导图像到视频生成中语义保真度的免训练注意力缩放

Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang, Jingjin Zhu, Manyuan Zhang, Ser-Nam Lim, Harry Yang

发表机构 * Hong Kong University of Science(香港科学大学) University of Central Florida, Orlando, FL, USA(中央佛罗里达大学) Beijing Academy of Artificial Intelligence, Beijing, China(北京人工智能研究院) The Chinese University of Hong Kong, Hong Kong SAR, China(香港中文大学)

AI总结 针对文本引导图像到视频生成中视觉主导导致语义编辑失败的问题,提出免训练注意力缩放调制(ASM)和引导调度(GS)方法,并构建OmitI2V基准,有效提升语义保真度且计算开销可忽略。

详情
AI中文摘要

文本引导的图像到视频生成取得了显著进展,但在执行需要对参考图像进行实质性更改(例如,添加、移除或修改对象)的文本指定编辑时仍存在困难。经验上,我们的分析表明,这源于 extbf{视觉主导},即参考图像导致严重的注意力分散,抑制了模型整合新语义信息的能力。为解决此问题,我们提出 extbf{AlignVid},一种免训练干预方法,重新校准模型内部的注意力分布。基于注意力的能量视角,AlignVid采用注意力缩放调制( extbf{ASM})以降低注意力熵并将焦点集中在语义标记上,同时结合引导调度( extbf{GS})以保持生成稳定性。为严格评估此能力,我们提出 extbf{OmitI2V},一个全面的基准,用于评估对象修改、添加和删除中的提示遵循度。大量实验表明,AlignVid有效增强了语义保真度,且计算开销可忽略。代码和OmitI2V基准可在https://github.com/LAW1223/AlignVid获取。

英文摘要

Text-guided image-to-video generation has made substantial progress, yet it still struggles to execute text-specified edits that require substantial changes to a reference image (\textit{e.g., object addition, removal, or modification}). Empirically, our analysis reveals that this stems from \textbf{visual dominance}, where the reference image causes severe attention dispersion, inhibiting the model's ability to incorporate new semantic information. To address this, we propose \textbf{AlignVid}, a training-free intervention that re-calibrates the model's internal attention distribution. Drawing on an energy-based perspective of attention, AlignVid employs Attention Scaling Modulation (\textbf{ASM}) to reduce attention entropy and concentrate focus on semantic tokens, alongside Guidance Scheduling (\textbf{GS}) to maintain generation stability. To rigorously assess this capability, we present \textbf{OmitI2V}, a comprehensive benchmark for evaluating prompt adherence across object modification, addition, and deletion. Extensive experiments demonstrate that AlignVid effectively enhances semantic fidelity with negligible computational overhead. Code and the OmitI2V benchmark are available at https://github.com/LAW1223/AlignVid.

2511.10861 2026-05-29 cs.CV cs.AI cs.LG 版本更新

An accuracy-aware extension to LRP-based pruning for CNNs to prevent cascading accuracy degradation in data-scarce transfer learning

一种面向CNN的基于LRP剪枝的精度感知扩展,以防止数据稀缺迁移学习中的级联精度下降

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

发表机构 * Mathematics and Computer Science National Defense Academy of Japan(日本防卫大学校数学与计算机科学系)

AI总结 针对数据稀缺迁移学习中预训练CNN剪枝导致的级联精度下降问题,提出一种精度感知的剪枝控制机制,通过动态调整剪枝率和顺序来抑制精度下降,提升模型压缩后的分类性能。

Comments Accepted to scientific reports. The title was revised during the peer review process

详情
AI中文摘要

在大规模数据集(如ImageNet)上预训练的卷积神经网络(CNN)被广泛用作特征提取器,从稀缺数据中构建特定任务的高精度分类模型。在此类场景中,由于数据稀缺,微调预训练CNN变得困难,因此必须使用固定权重。然而,当权重固定时,许多对目标任务无贡献的滤波器仍保留在模型中,导致不必要的冗余和效率降低。因此,需要有效的方法通过剪枝对推理不必要的滤波器来减小模型大小。为此,已有研究提出了利用逐层相关性传播(LRP)的方法。LRP量化每个滤波器对推理结果的贡献,从而可以剪枝低相关性的滤波器。然而,现有基于LRP的剪枝方法被观察到会导致级联精度下降。在本研究中,我们为现有基于LRP的滤波器剪枝方法引入了一种精度感知的剪枝控制机制,该机制通过使用类别精度的调和平均数动态调整剪枝率和剪枝顺序,抑制级联精度下降,并在小数据环境下压缩预训练模型的同时保持任务特定性能。我们证明,该控制机制有效缓解了级联精度下降,与现有基于LRP的剪枝方法相比,实现了更高的分类精度,将VGG16的精度-剪枝率曲线下的类别平均面积(AUC)比传统基于LRP的方法提高了约15%。

英文摘要

Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construct high-accuracy classification models from scarce data for specific tasks. In such scenarios, fine-tuning the pre-trained CNN is difficult due to data scarcity, necessitating the use of fixed weights. However, when the weights are kept fixed, many filters that do not contribute to the target task remain in the model, leading to unnecessary redundancy and reduced efficiency. Therefore, effective methods are needed to reduce model size by pruning filters that are unnecessary for inference. To address this, approaches utilizing Layer-wise Relevance Propagation (LRP) have been proposed. LRP quantifies the contribution of each filter to the inference result, enabling the pruning of filters with low relevance. However, existing LRP-based pruning methods have been observed to cause cascading accuracy degradation. In this study, we introduce an accuracy-aware pruning control mechanism for existing LRP-based filter pruning methods, which suppresses cascading accuracy degradation by dynamically adjusting the pruning rate and the pruning order using the harmonic mean of class accuracy, and compresses the pre-trained model while preserving task-specific performance in a small-data environment. We demonstrate that this control mechanism effectively mitigates cascading accuracy degradation and achieves higher classification accuracy compared to existing LRP-based pruning methods, improving the class-averaged area under the accuracy-pruning-rate curve (AUC) of VGG16 by approximately 15\% over conventional LRP-based approaches.

2510.27607 2026-05-29 cs.CV cs.RO 版本更新

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

双流扩散用于世界模型增强的视觉-语言-动作模型

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin

发表机构 * Kim Jaechul Graduate School of AI, Korea Advanced Institute of Technology, Seoul, Republic of Korea(金 Jaechul 人工智能研究生院,韩国科学技术院,首尔,大韩民国)

AI总结 提出DUST框架,通过双流扩散Transformer和异步采样方法,解决世界模型增强的视觉-语言-动作模型中的模态差距问题,在模拟和真实任务中取得显著性能提升。

Comments Accepted at ICML 2026. Project page at https://periphanes.github.io/dust (20 pages, 10 figures)

详情
AI中文摘要

用世界模型增强视觉-语言-动作模型(VLA)对于机器人策略学习很有前景,但由于模态差距,在联合预测状态和动作方面面临挑战。为了解决这个问题,我们提出了DUal-STream diffusion(DUST),一个世界模型增强的VLA框架,其特点是一个多模态扩散Transformer,在保持独立模态流的同时实现跨模态知识共享。此外,DUST利用独立的噪声扰动和解耦的流匹配损失来学习跨模态因果关系。我们进一步引入了一种用于动作和视觉令牌的异步采样方法,通过推理时缩放来增强性能。在RoboCasa和GR-1等模拟基准上的实验结果表明,DUST相对于最先进的VLA和世界建模基线实现了高达6%的性能提升,推理时缩放额外提供了2-5%的提升。在使用Franka Research 3的真实世界任务中,DUST的成功率比基线高出10%。最后,我们证明了DUST通过在无动作视频上的预训练以及与异构机器人和人类数据集的联合训练,实现了有效的迁移学习。

英文摘要

Augmenting vision-language-action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework featuring a multimodal diffusion transformer that maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, DUST utilizes independent noise perturbations and a decoupled flow matching loss to learn cross-modal causal relationships. We further introduce an asynchronous sampling method for action and vision tokens that enhances performance through inference-time scaling. Experimental results on simulated benchmarks like RoboCasa and GR-1 show that DUST achieves up to 6% gains over state-of-the-art VLA and world-modeling baselines, with inference-time scaling providing an additional 2-5% improvement. In real-world tasks using the Franka Research 3, DUST outperforms baselines by 10% in success rate. Finally, we demonstrate that DUST enables effective transfer learning through both pretraining on action-free videos and joint-training with heterogeneous robot and human datasets.

2510.26412 2026-05-29 cs.CV cs.AI 版本更新

LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation

LoCoT2V-Bench: 长文本与复杂文本到视频生成的基准测试

Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院) The University of Hong Kong(香港大学)

AI总结 针对长视频生成在复杂文本输入下的评估挑战,提出包含多场景提示与层次元数据的基准LoCoT2V-Bench,并设计多维度评估框架LoCoT2V-Eval,实验发现模型在细粒度文本-视频对齐和角色一致性方面存在显著不足。

Comments Accepted by ICML 2026 (Regular)

详情
AI中文摘要

近期文本到视频生成在短片段上取得了令人印象深刻的性能,但在复杂文本输入下评估长视频生成仍然是一个重大挑战。为应对这一挑战,我们提出了LoCoT2V-Bench,一个用于长视频生成(LVG)的基准,包含具有层次元数据(如角色设置和相机行为)的多场景提示,这些提示从收集的真实世界视频中构建。我们进一步提出了LoCoT2V-Eval,一个多维度评估框架,涵盖感知质量、文本-视频对齐、时间质量、动态质量和人类期望实现程度(HERD),重点关注细粒度文本-视频对齐和时间角色一致性等方面。在17个代表性LVG模型上的实验揭示了评估维度之间的显著能力差异,模型在感知质量和背景一致性方面表现强劲,但在细粒度文本-视频对齐和角色一致性方面明显较弱。这些发现表明,提高提示忠实度和身份保持仍是长视频生成的关键挑战。我们的代码和数据发布在https://github.com/XqZeppelinhead0702/LoCoT2V-Bench。

英文摘要

Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene prompts with hierarchical metadata (e.g., character settings and camera behaviors), constructed from collected real-world videos. We further propose LoCoT2V-Eval, a multi-dimensional framework covering perceptual quality, text-video alignment, temporal quality, dynamic quality, and Human Expectation Realization Degree (HERD), with an emphasis on aspects such as fine-grained text-video alignment and temporal character consistency. Experiments on 17 representative LVG models reveal pronounced capability disparities across evaluation dimensions, with strong perceptual quality and background consistency but markedly weaker fine-grained text-video alignment and character consistency. These findings suggest that improving prompt faithfulness and identity preservation remains a key challenge for long-form video generation. Our code and data are released at https://github.com/XqZeppelinhead0702/LoCoT2V-Bench

2508.16873 2026-05-29 cs.CV cs.SI 版本更新

Multimodal LLMs See Sentiment

多模态大语言模型感知情感

Neemias B. da Silva, John Harrison, Rodrigo Minetto, Myriam R. Delgado, Bogdan T. Nassu, Thiago H. Silva

发表机构 * Universidade Tecnológica Federal do Paraná(联邦技术大学帕拉纳州大学) University of Toronto(多伦多大学)

AI总结 本文通过系统评估研究,探讨多模态大语言模型在图像情感分析中的三种方法,发现基于MLLM描述的两阶段流水线在微调后性能显著优于传统基线。

Comments 24 pages, 7 figures

详情
AI中文摘要

理解视觉内容如何传达情感在以图像为主导的数字环境中日益重要。然而,情感感知依赖于复杂的场景级语义,这对计算模型而言是一项具有挑战性的任务。本文通过一项系统性的、以评估为导向的研究,从三个视角考察多模态大语言模型如何执行图像情感分析:(i) 使用MLLM直接从图像进行情感分类;(ii) 使用预训练LLM对MLLM生成的描述进行情感分析;(iii) 在情感标注的描述上微调这些LLM以评估性能和泛化能力。在最新基准上的实验表明,两阶段MLLM描述中介流水线在多种评估设置下能显著提高预测准确性,尤其是当LLM组件被微调时。在不同的一致性阈值和情感粒度下,该流水线的最强配置在基准测试中分别优于基于词典、CNN和Transformer的基线高达30.9%、64.8%和42.4%。在跨数据集评估中,所提出的流水线——无需在目标数据集上进行训练或微调——仍比最佳域内基线高出8%以上。总体而言,本研究提供了对MLLM描述中介情感分析的综合评估,阐明了其有效的条件、失败的场景以及与基于传统视觉方法的比较,同时为未来研究提供了可复现的基准资源。

英文摘要

Understanding how visual content conveys sentiment is increasingly important in a digital landscape dominated by imagery. However, sentiment perception depends on complex scene-level semantics, making this a challenging task for computational models. This paper examines how Multimodal Large Language Models (MLLMs) perform sentiment analysis in images through a systematic, evaluation-driven study encompassing three perspectives: (i) direct sentiment classification from images using MLLMs; (ii) sentiment analysis on MLLM-generated descriptions using pre-trained LLMs; and (iii) fine-tuning these LLMs on sentiment-labeled descriptions to assess performance and generalization. Experiments on a recent benchmark show that a two-stage MLLM description-mediated pipeline can substantially improve prediction accuracy under several evaluation settings, particularly when the LLM component is fine-tuned. Across different agreement thresholds and sentiment granularities, the strongest configurations of this pipeline outperform lexicon-, CNN-, and Transformer-based baselines in our benchmark by up to 30.9%, 64.8%, and 42.4%, respectively. In cross-dataset evaluation, the proposed pipeline - without training or fine-tuning on the target dataset - still surpasses the best in-domain baseline by over 8%. Overall, the study provides a comprehensive assessment of MLLM description-mediated sentiment analysis, clarifying the conditions under which it is effective, the scenarios in which it fails, and its comparison with traditional vision-based approaches, while also providing a reproducible benchmark resource for future research.

2508.15151 2026-05-29 eess.IV cs.CV 版本更新

Zero-shot CT Super-Resolution using Diffusion-based 2D Projection Priors and Signed 3D Gaussians

基于扩散的二维投影先验和有符号三维高斯的零样本CT超分辨率

Jeonghyun Noh, Hyun-Jic Oh, Won-Ki Jeong

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Korea University(韩国大学) Seoul, Korea(韩国首尔)

AI总结 提出一种零样本三维CT超分辨率框架,通过扩散模型上采样二维投影先验并结合有符号三维高斯溅射(NAB-GS)重建高分辨率CT体积,在公开数据集上实现4倍超分辨率的优越性能。

Comments MICCAI 2026 early accepted

详情
AI中文摘要

计算机断层扫描(CT)在临床诊断中至关重要,但获取高分辨率(HR)CT受到辐射暴露风险的限制。虽然基于深度学习的超分辨率(SR)方法在从低分辨率(LR)输入重建HR CT方面显示出前景,但监督方法需要通常不可用的配对数据集。零样本方法通过处理单个LR输入来解决这一限制;然而,由于单个体积内LR信息有限,它们常常无法恢复精细的结构细节。为克服这些限制,我们提出了一种新颖的零样本三维CT SR框架,将基于扩散的上采样二维投影先验集成到三维重建过程中。具体而言,我们的框架包含两个阶段:(1)LR CT投影SR,在丰富的X射线数据上训练扩散模型以对LR投影进行上采样,从而增强LR输入中固有的稀缺信息。(2)三维CT体积重建,使用我们新颖的负Alpha混合(NAB-GS)的三维高斯溅射,该技术建模正负高斯密度以学习扩散生成的HR投影与上采样的LR投影之间的有符号残差。我们的框架在两个公开数据集上展示了优越的定量和定性性能,专家评估表明了该框架在4倍超分辨率下的临床潜力。

英文摘要

Computed tomography (CT) is important in clinical diagnosis, but acquiring high-resolution (HR) CT is constrained by radiation exposure risks. While deep learning-based super-resolution (SR) methods have shown promise for reconstructing HR CT from low-resolution (LR) inputs, supervised approaches require paired datasets that are often unavailable. Zero-shot methods address this limitation by operating on single LR inputs; however, they frequently fail to recover fine structural details due to limited LR information within individual volumes. To overcome these limitations, we propose a novel zero-shot 3D CT SR framework that integrates diffusion-based upsampled 2D projection priors into the 3D reconstruction process. Specifically, our framework consists of two stages: (1) LR CT projection SR, training a diffusion model on abundant X-ray data to upsample LR projections, thereby enhancing the scarce information inherent in the LR inputs. (2) 3D CT volume reconstruction, using 3D Gaussian splatting with our novel Negative Alpha Blending (NAB-GS), which models positive and negative Gaussian densities to learn signed residuals between diffusion-generated HR and upsampled LR projections. Our framework demonstrates superior quantitative and qualitative performance on two public datasets, and expert evaluations present the framework's clinical potential at 4x.

2508.12176 2026-05-29 cs.CV cs.AI eess.SP 版本更新

Scalable RF Simulation in Generative 4D Worlds

生成式4D世界中的可扩展射频仿真

Zhiwei Zheng, Dongyin Hu, Mingmin Zhao

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出WaveVerse框架,通过语言引导的4D世界生成器和物理信号模拟器实现可扩展的射频信号仿真,在相位敏感基准上表现高保真度,并有效提升下游任务性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

射频(RF)感知已成为一种强大的、保护隐私的替代视觉方法,用于各种感知任务。然而,在动态和多样化的环境中构建高质量的RF数据集仍然是一个重大挑战。为了解决这一问题,我们引入了WaveVerse,一个基于提示的可扩展框架,该框架从生成的室内场景中模拟真实的RF信号,并包含由空间路径引导的人体运动,从而无需手动轨迹设计即可实现多样且可行的行为。WaveVerse具有语言引导的4D世界生成器和基于物理的信号模拟器,能够在多样化的环境中实现RF信号的逼真模拟。它采用了一个相位相干光线追踪器,保留了空间和时间上的相位一致性。模拟信号在相位敏感基准上显示出高保真度,并且与真实世界收集的测量数据以及来自专有电磁求解器的模拟结果高度一致。当用于数据增强时,WaveVerse在RF成像和人类活动识别等下游任务中持续提升性能,其增益随模拟数据量的增加而增长,并超越了现有方法。代码和附加材料可在网页上获取。

英文摘要

Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for various perception tasks. However, building high-quality RF datasets in dynamic and diverse environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions guided by spatial paths, enabling diverse and feasible behaviors without manual trajectory design. WaveVerse features a language-guided 4D world generator and a physics-based signal simulator that enables realistic simulation of RF signals in diverse environments. It employs a phase-coherent ray tracer that preserves both spatial and temporal phase consistency. The simulated signals show high fidelity on phase-sensitive benchmarks, and closely align with both real-world collected measurements and simulations from a proprietary electromagnetic solver. When used for data augmentation, WaveVerse consistently improves performance in downstream tasks like RF imaging and human activity recognition, with gains that grow with the amount of simulated data and surpass existing methods. Code and additional materials are available on the webpage.

2508.10566 2026-05-29 cs.CV 版本更新

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

HM-Talker:用于高保真说话头合成的混合运动建模

Shiyu Liu, Kui Jiang, Junjun Jiang, Xianming Liu, Xiaocheng Feng, Hongxun Yao, Qi Tian

发表机构 * Harbin Institute of Technology University(哈尔滨理工大学) Guangdong Laboratory of Artificial Intelligence and Digital Economy(广东省人工智能与数字经济实验室)

AI总结 提出HM-Talker框架,通过混合显式发音线索与隐式韵律特征,结合交叉模态映射和随机特征配对策略,解决说话头生成中个性化与泛化的权衡问题,在视觉真实感和唇同步精度上超越现有方法。

详情
AI中文摘要

音频驱动的说话头生成面临个性化与泛化之间的基本权衡,限制了其实际应用。隐式模型通常以结构不一致为代价实现泛化,导致不稳定的头部运动和不准确的唇同步。而显式方法引入了几何和解剖先验,如参数化面部几何的3D可变形模型(3DMM)或编码面部肌肉运动的动作单元(AU),但它们往往产生过度中性的表情或泛化能力有限。为解决这一矛盾,我们提出了HM-Talker,一个音频驱动的说话头框架,它协同整合显式发音线索与隐式韵律特征,以刻画身份特定动态,同时实现音频驱动的泛化。其显著特点可概括为:i) 跨模态映射模块(CMMM),从音频和视频中提取全面的运动线索词汇表;ii) 混合运动建模模块(HMMM),采用随机特征配对(SFP)策略,动态融合配对的隐式和显式特征以进行运动合成。该设计促进了下半部分面部运动的迭代优化,在身份特定目标与身份无关(仅音频)目标之间交替进行。大量实验表明,HM-Talker在多种设置下的视觉真实感和唇同步精度方面均优于最先进方法。

英文摘要

Audio-driven talking head generation faces a fundamental trade-off between personalization and generalization, limiting its practical application. Implicit models often achieve generalization at the cost of structural incoherence, resulting in unstable head motion and inaccurate lip synchronization. While explicit methods incorporate geometric and anatomical priors such as 3D Morphable Models (3DMMs), which parameterize facial geometry, or Action Units (AUs), which code facial muscle movements--they tend to produce overly neutral expressions or suffer from limited generalization. To resolve this conflict, we present HM-Talker, an audio-driven talking head framework that synergistically integrates explicit articulatory cues with implicit prosodic features to characterize identity-specific dynamics while enabling audio-driven generalization. Its distinctive features can be summarized as: i) the Cross-Modal Mapping Module (CMMM) that extracts a comprehensive vocabulary of motion cues from audio and video, and ii) the Hybrid Motion Modeling Module (HMMM) that employs a Stochastic Feature Pairing (SFP) strategy to dynamically merge paired implicit and explicit features for motion synthesis. This design facilitates an iterative optimization of the lower face motion, alternating between identity-specific and identity-agnostic (audio-only) objectives. Extensive experiments demonstrate that HM-Talker outperforms state-of-the-art methods in both visual realism and lip-sync accuracy across diverse settings.

2508.08677 2026-05-29 cs.LG cs.CV 版本更新

Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL

多级协作蒸馏遇见全局工作空间模型:面向OCIL的统一框架

Shibin Su, Guoqiang Liang, De Cheng, Shizhou Zhang, Lingyan Ran

发表机构 * School of Computer Science, Northwestern Polytechnical University(西北工业大学计算机学院) School of Telecommunications Engineering, Xidian University(西安电子科技大学电信工程学院)

AI总结 提出一种结合全局工作空间模型和多级协作蒸馏的统一框架,通过融合多学生模型参数形成共享隐式记忆并周期性广播,以及跨学生一致性和历史知识对齐机制,有效平衡在线类增量学习中的稳定性与可塑性。

Comments 15 pages, 8 figures

详情
AI中文摘要

在线类增量学习(OCIL)使模型能够从非独立同分布的数据流中持续学习。由于数据流中的样本只能被看到一次,因此与离线学习相比,它更适用于现实场景。然而,这一约束加剧了OCIL在维持稳定性与可塑性之间适当平衡的挑战。此外,在现实世界中更严格的内存缓冲区约束下,当前基于重放的方法效果较差。虽然集成方法提高了可塑性,但它们常常在稳定性上遇到困难。受全局工作空间理论(GWT)启发,我们提出了一种新颖方法,通过全局工作空间模型(GWM)——一种共享的隐式记忆,指导多个学生模型的学习——来增强集成学习。GWM通过在每个训练批次中融合所有学生的参数形成,捕获历史学习轨迹,并作为知识巩固的动态锚点。类似于GWT的广播机制,GWM定期重新分发给学生,稳定学习并促进跨任务一致性。此外,我们引入了一种多级协作蒸馏机制。它强制学生之间保持对等一致性,并通过将每个学生与GWM对齐来保留历史知识。因此,学生模型在保持先前所学知识的同时,仍能适应新任务,在稳定性与可塑性之间实现更好的平衡。在三个标准OCIL基准上的大量实验表明,我们的方法在各种内存预算下为多个OCIL模型带来了显著的性能提升。代码可在https://github.com/susususushi/GWM获取。

英文摘要

Online Class-Incremental Learning (OCIL) enables models to learn continuously from non-i.i.d. data streams. Since samples of the data streams can be seen only once, it is more suitable for real-world scenarios compared to offline learning. However, this constraint intensifies the challenge for OCIL in maintaining an appropriate balance between stability and plasticity. Moreover, under stricter memory buffer constraints in real world, current replay-based methods are less effective. While ensemble methods improve plasticity, they often struggle with stability. Inspired by the Global Workspace Theory (GWT), we propose a novel approach that enhances ensemble learning through a Global Workspace Model (GWM)-a shared, implicit memory that guides the learning of multiple student models. The GWM is formed by fusing the parameters of all students within each training batch, capturing the historical learning trajectory and serving as a dynamic anchor for knowledge consolidation. Like the broadcasting mechanism of GWT, the GWM is redistributed periodically to students, stabilizing learning and promoting cross-task consistency. In addition, we introduce a multi-level collaborative distillation mechanism. It enforces peer-to-peer consistency among students and preserves historical knowledge by aligning each student with the GWM. As a result, student models remain adaptable to new tasks while maintaining previously learned knowledge, striking a better balance between stability and plasticity. Extensive experiments on three standard OCIL benchmarks show that our method delivers significant performance improvement for several OCIL models across various memory budgets. The code is available at https://github.com/susususushi/GWM.

2507.21114 2026-05-29 cs.IR cs.AI cs.CV 版本更新

Page image classification for content-specific data processing

面向特定内容数据处理的页面图像分类

Kateryna Lutsai

AI总结 本研究针对人文学科数字化项目中历史文档页面图像内容多样、手动分类困难的问题,开发并评估了一种基于人工智能和机器学习的图像分类系统,通过按内容类别(如文本类型、图形元素、布局)自动分类页面,以支持定制化的下游分析流程。

Comments Dataset licensing issues occurred

详情
AI中文摘要

人文学科的数字化项目通常会产生大量历史文档的页面图像,这给手动分类和分析带来了巨大挑战。这些档案包含多样化的内容,包括各种文本类型(手写体、打字体、印刷体)、图形元素(图画、地图、照片)以及布局(纯文本、表格、表单)。高效处理这些异构数据需要基于页面内容进行自动分类的方法,从而能够启用定制化的下游分析流程。本项目通过开发并评估一种专门为历史文档页面设计的图像分类系统来满足这一需求,该系统利用了人工智能和机器学习的最新进展。所选的类别集旨在促进特定内容处理工作流程,将需要不同分析技术(例如,用于文本的OCR、用于图形的图像分析)的页面区分开来。

英文摘要

Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text types (handwritten, typed, printed), graphical elements (drawings, maps, photos), and layouts (plain text, tables, forms). Efficiently processing this heterogeneous data requires automated methods to categorize pages based on their content, enabling tailored downstream analysis pipelines. This project addresses this need by developing and evaluating an image classification system specifically designed for historical document pages, leveraging advancements in artificial intelligence and machine learning. The set of categories was chosen to facilitate content-specific processing workflows, separating pages requiring different analysis techniques (e.g., OCR for text, image analysis for graphics)

2507.09574 2026-05-29 cs.CV cs.AI cs.CL 版本更新

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

MENTOR: 面向自回归视觉生成模型的高效多模态条件微调

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Tsinghua University(清华大学) Peking University(北京大学) Microsoft(微软公司)

AI总结 提出MENTOR框架,通过两阶段训练范式实现自回归图像生成器与多模态输入的细粒度token级对齐,无需辅助适配器或交叉注意力模块,在DreamBench++上取得优异性能。

Comments Findings of ACL 2026

详情
AI中文摘要

最近的文本到图像模型能够生成高质量结果,但在精确视觉控制、平衡多模态输入以及需要大量训练以实现复杂多模态图像生成方面仍存在困难。为解决这些局限,我们提出MENTOR,一种新颖的自回归(AR)框架,用于高效的多模态条件微调以实现自回归多模态图像生成。MENTOR将AR图像生成器与两阶段训练范式相结合,无需依赖辅助适配器或交叉注意力模块,即可实现多模态输入与图像输出之间的细粒度、token级对齐。两阶段训练包括:(1)多模态对齐阶段,建立稳健的像素级和语义级对齐;随后是(2)多模态指令微调阶段,平衡多模态输入的整合并增强生成可控性。尽管模型规模适中、基础组件非最优且训练资源有限,MENTOR在DreamBench++基准测试上仍取得了强劲性能,在概念保持和提示遵循方面优于竞争基线。此外,与基于扩散的方法相比,我们的方法具有更优的图像重建保真度、广泛的任务适应性以及更高的训练效率。数据集、代码和模型可在 https://github.com/HaozheZhao/MENTOR 获取。

英文摘要

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

2505.21996 2026-05-29 cs.CV cs.AI 版本更新

VRAG: Learning World Models for Interactive Video Generation

VRAG:面向交互式视频生成的世界模型学习

Taiye Chen, Xun Hu, Zihan Ding, Chi Jin

发表机构 * Peking University(北京大学) University of Oxford(牛津大学) Princeton University(普林斯顿大学)

AI总结 针对自回归视频生成中累积误差和记忆机制不足的问题,提出视频检索增强生成(VRAG)方法,通过显式全局状态条件降低长期累积误差并提升时空一致性。

Comments Published at NeurIPS 2025. Project page: https://sites.google.com/view/vrag

详情
AI中文摘要

基础世界模型必须既具有交互性,又能保持时空连贯性,以便通过动作选择进行有效的未来规划。然而,当前的长时间视频生成模型由于两个主要挑战而具有有限的内在世界建模能力:累积误差和记忆机制不足。我们通过额外的动作条件和自回归框架增强了图像到视频模型的交互能力,并揭示了在自回归视频生成中累积误差本质上是不可约的,而记忆机制不足则导致世界模型的不连贯。我们提出了带有显式全局状态条件的视频检索增强生成(VRAG),它显著减少了长期累积误差并提高了世界模型的时空一致性。相比之下,具有扩展上下文窗口的朴素自回归生成和检索增强生成在视频生成中被证明效果较差,这主要是由于当前视频模型有限的上下文学习能力。我们的工作阐明了视频世界模型中的基本挑战,并为改进具有内在世界建模能力的视频生成模型建立了全面的基准。

英文摘要

Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.

2505.21876 2026-05-29 cs.CV cs.AI 版本更新

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

EPiC: 基于精确锚点视频引导的高效视频摄像机控制学习

Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal

AI总结 提出EPiC框架,通过基于首帧可见性掩码构建精确对齐的锚点视频,并引入轻量模块Anchor-ControlNet,以极低参数实现高效、精确的3D摄像机控制,在RealEstate10K和MiraData上达到最先进性能。

Comments Accepted to ICML 2026. Project website: https://zunwang1.github.io/Epic

详情
AI中文摘要

近期带摄像机控制的视频生成方法通常通过从估计的点云沿摄像机轨迹渲染,创建锚点视频(即近似所需摄像机运动的渲染视频),以作为结构化先验引导扩散模型。然而,点云和摄像机轨迹估计中的误差常导致不准确的锚点视频,并带来更高的训练成本和低效率,因为模型被迫补偿渲染错位。为解决这些局限,我们提出EPiC,一种高效且精确的摄像机控制学习框架,无需摄像机姿态或点云估计即可构建良好对齐的训练锚点视频。具体而言,我们通过基于首帧可见性掩码掩蔽源视频来创建高精度锚点视频,这确保了强对齐,消除了对摄像机/点云估计的需求,因此可轻松应用于任意野外视频。此外,我们引入Anchor-ControlNet,一种轻量模块,将可见区域中的锚点视频引导集成到预训练视频扩散模型中,仅增加不到1%的额外参数。EPiC以显著更少的参数、训练步骤和数据实现高效训练,并在测试时对使用点云制作的锚点视频具有鲁棒泛化能力,从而实现精确的3D感知摄像机控制。EPiC在RealEstate10K和MiraData上的I2V摄像机控制任务中达到最先进性能。值得注意的是,EPiC还展现出对视频到视频(V2V)场景的强零样本泛化能力。

英文摘要

Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camera motions) to guide diffusion models as a structured prior, by rendering from estimated point clouds following camera trajectories. However, errors in point cloud and camera trajectory estimation often lead to inaccurate anchor videos with higher training cost and low efficiency, as the model is forced to compensate for rendering misalignments. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that constructs well-aligned training anchor videos without the need for camera pose or point cloud estimation. Concretely, we create highly precise anchor videos by masking source videos based on first-frame visibility, which ensures strong alignment, eliminates the need for camera/point cloud estimation, and thus can be readily applied to any in-the-wild video. Furthermore, we introduce Anchor-ControlNet, a lightweight module that integrates anchor video guidance in visible regions to pretrained video diffusion models, with less than 1% of additional parameters. EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, and generalizes robustly to anchor videos made with point clouds at test time, enabling precise 3D-informed camera control. EPiC achieves SoTA performance on RealEstate10K and MiraData for I2V camera control task. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video (V2V) scenarios.

2504.12747 2026-05-29 cs.CV 版本更新

Privacy Protection Against Personalized Text-to-Image Synthesis via Cross-image Consistency Constraints

针对个性化文本到图像合成的跨图像一致性约束隐私保护

Guanyu Wang, Kailong Wang, Yihao Huang, Mingyi Zhou, Geguang Pu, Li Li

发表机构 * Beihang University(北京航空航天大学) Huazhong University of Science and Technology(华中科技大学) East China Normal University(东华大学)

AI总结 提出跨图像反个性化框架,通过强制扰动图像间的风格一致性并采用动态比率调整策略,增强对扩散模型个性化攻击的抵抗能力。

详情
AI中文摘要

扩散模型和个性化技术的快速发展使得仅凭少量公开图像就能重建个人肖像成为可能。虽然这种能力赋能了各种创意应用,但也带来了严重的隐私问题,因为攻击者可以利用它生成高度逼真的冒充图像。为应对这些威胁,反个性化方法被提出,通过向已发布图像添加对抗性扰动来破坏个性化模型的训练。然而,现有方法很大程度上忽视了个性化固有的多图像特性,而是采用一种朴素的独立应用扰动策略(如同在单图像设置中常见的那样)。这忽略了利用图像间关系实现更强隐私保护的机会。因此,我们倡导从群体层面看待针对个性化的隐私保护。具体而言,我们引入了跨图像反个性化(CAP),一种通过强制扰动图像间的风格一致性来增强对个性化抵抗能力的新型框架。此外,我们开发了一种动态比率调整策略,可在攻击迭代过程中自适应地平衡一致性损失的影响。在经典CelebHQ和VGGFace2基准上的大量实验表明,CAP显著改进了现有方法。

英文摘要

The rapid advancement of diffusion models and personalization techniques has made it possible to recreate individual portraits from just a few publicly available images. While such capabilities empower various creative applications, they also introduce serious privacy concerns, as adversaries can exploit them to generate highly realistic impersonations. To counter these threats, anti-personalization methods have been proposed, which add adversarial perturbations to published images to disrupt the training of personalization models. However, existing approaches largely overlook the intrinsic multi-image nature of personalization and instead adopt a naive strategy of applying perturbations independently, as commonly done in single-image settings. This neglects the opportunity to leverage inter-image relationships for stronger privacy protection. Therefore, we advocate for a group-level perspective on privacy protection against personalization. Specifically, we introduce Cross-image Anti-Personalization (CAP), a novel framework that enhances resistance to personalization by enforcing style consistency across perturbed images. Furthermore, we develop a dynamic ratio adjustment strategy that adaptively balances the impact of the consistency loss throughout the attack iterations. Extensive experiments on the classical CelebHQ and VGGFace2 benchmarks show that CAP substantially improves existing methods.

2503.20897 2026-05-29 cs.CV 版本更新

Domain-Agnostic Feature Modulation for Semi-Supervised Domain Generalization

面向半监督领域泛化的域无关特征调制

Venuri Amarasinghe, Kalinga Bandara, Isun Randila, Asini Jayakody, Chamuditha Jayanga Galappaththige, Ranga Rodrigo

发表机构 * University of Moratuwa(穆塔瓦大学) Queensland University of Technology(昆士兰理工大学)

AI总结 针对半监督领域泛化中无域标签的挑战,提出一种特征调制策略与损失缩放函数,通过增强类判别特征、抑制域特定信息并动态降低伪标签置信度阈值,显著提升模型在多个基准上的泛化性能。

Comments Accepted at CVPRW 2026

详情
AI中文摘要

半监督领域泛化(SSDG)利用少量标注数据与大量未标注数据来增强模型泛化能力。现有SSDG方法大多依赖伪标签(PL)处理未标注数据,且常假设可获取域标签——这一特权并非总是可用。然而,域偏移引入域噪声,导致不一致的伪标签,从而降低模型性能。源自FixMatch的方法尤其受限于较低的伪标签准确率,削弱了未标注数据的效用。为解决此问题,我们应对更具挑战性的域标签不可知SSDG场景,即在训练过程中未标注数据的域标签不可用。首先,我们提出一种特征调制策略,该策略在抑制域特定信息的同时增强类判别特征。此调制将特征推向“相似平均表示”(类原型的改进版本),该表示跨域鲁棒,促使分类器区分紧密相关的类别,并促使特征提取器形成紧密聚类、域不变的表征。其次,为缓解域噪声并提高伪标签准确率,我们引入一个损失缩放函数,该函数动态降低伪标签的固定置信度阈值,从而优化未标注数据的利用。凭借这些关键创新,我们的方法在四个主要领域泛化基准上取得了显著改进——即使在没有域标签的情况下。我们将公开代码。

英文摘要

Semi-supervised domain generalization (SSDG) leverages a small fraction of labeled data alongside unlabeled data to enhance model generalization. Most of the existing SSDG methods rely on pseudo-labeling (PL) for unlabeled data, often assuming access to domain labels-a privilege not always available. However, domain shifts introduce domain noise, leading to inconsistent PLs that degrade model performance. Methods derived from FixMatch suffer particularly from lower PL accuracy, reducing the effectiveness of unlabeled data. To address this, we tackle the more challenging domain-label agnostic SSDG, where domain labels for unlabeled data are not available during training. First, we propose a feature modulation strategy that enhances class-discriminative features while suppressing domain-specific information. This modulation shifts features toward Similar Average Representations-a modified version of class prototypes-that are robust across domains, encouraging the classifier to distinguish between closely related classes and feature extractor to form tightly clustered, domain-invariant representations. Second, to mitigate domain noise and improve pseudo-label accuracy, we introduce a loss-scaling function that dynamically lowers the fixed confidence threshold for pseudo-labels, optimizing the use of unlabeled data. With these key innovations, our approach achieves significant improvements on four major domain generalization benchmarks-even without domain labels. We will make the code available.

2502.16548 2026-05-29 cs.LG cs.AI cs.CV 版本更新

A Composable Multimodal Framework for cine CMR-Text-Driven Prediction of Heart Failure Outcomes

用于电影心脏磁共振-文本驱动的心力衰竭结局预测的可组合多模态框架

Jianzhou Chen, Jinyang Sun, Xiumei Wang, Xi Chen, Heyu Chu, Guo Song, Yuji Luo, Xingping Zhou, Rong Gu

发表机构 * Department of Cardiology, Nanjing Drum Tower Hospital, State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University(南京鼓楼医院心内科,南京大学国家药物生物技术重点实验室) School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University(上海交通大学电子信息与电气工程学院) College of Electronic and Optical Engineering, Nanjing University of Posts and Telecommunications(南京邮电大学电子与光学工程学院) College of Integrated Circuit Science and Engineering, Nanjing University of Posts and Telecommunications(南京邮电大学集成电路科学与工程学院) Department of Cardiology, Nanjing Drum Tower Hospital Clinical College of Nanjing Medical University(南京医科大学南京鼓楼医院临床学院心内科) Institute of Quantum Information and Technology, Nanjing University of Posts and Telecommunications(南京邮电大学量子信息与技术研究院)

AI总结 提出一种可组合多模态框架,通过整合cine CMR影像、结构化临床指标和非结构化文本记录,实现比单模态AI算法更准确的心力衰竭预后预测,并支持个性化治疗优化。

详情
AI中文摘要

目的。根据世界卫生组织(WHO)及其他公共卫生机构的数据,心力衰竭是全球主要死因之一,每年导致数百万人死亡。尽管心力衰竭领域已取得显著进展,生存率和射血分数有所改善,但由于其复杂性和多因素特征,仍存在大量未满足的需求。本研究旨在提出并评估一种用于心力衰竭评估和治疗优化的可组合策略框架,旨在提供更全面的患者评估和管理。方法。该框架利用多模态算法分析全面的患者数据,明确整合了电影心脏磁共振(cine CMR)序列、结构化临床指标(如实验室结果、人口统计学数据)和非结构化文本记录(如病史、处方)。通过整合这些多种数据源,我们的框架为患者提供了更全面的评估和优化的治疗方案。主要结果。与单模态AI算法相比,该多模态框架在心力衰竭预后预测方面展现出更高的准确性。此外,它还能详细评估各种病理指标对心力衰竭结局的影响。意义。通过系统性地整合异质性临床数据,该方法支持更全面的预后评估,并有助于为心力衰竭患者制定优化的个性化治疗计划。

英文摘要

Objective. Heart failure is one of the leading causes of death worldwide, with millions of deaths each year, according to data from the World Health Organization (WHO) and other public health agencies. While significant progress has been made in the field of heart failure, leading to improved survival rates and improvement of ejection fraction, there remains substantial unmet needs, due to the complexity and multifactorial characteristics. This study aims to propose and evaluate a composable strategy framework for assessment and treatment optimization in heart failure, designed to provide more holistic patient evaluation and management. Approach. The framework leverages multi-modal algorithms to analyze a comprehensive range of patient data, explicitly integrating cine cardiac magnetic resonance (cine CMR) sequences, structured clinical metrics (e.g., lab results, demographics), and unstructured textual records (e.g., medical history, prescriptions). By integrating these various data sources, our framework offers a more holistic evaluation and optimized treatment plan for patients. Main results. The multi-modal framework demonstrates superior accuracy in HF prognosis prediction compared to single-modal AI algorithms. Additionally, it enables a detailed evaluation of the impact of various pathological indicators on HF outcomes. Significance. By integrating heterogeneous clinical data in a systematic manner, this approach supports more comprehensive prognosis assessment and facilitates optimized, personalized treatment planning for heart failure patients.

2412.15632 2026-05-29 cs.CV 版本更新

A New Method to Capturing Compositional Knowledge in Linguistic Space

一种在语言空间中捕获组合知识的新方法

Jiahe Wan

发表机构 * School of Computer Science(计算机科学学院) South-Central Minzu University(西南民族大学)

AI总结 提出YUKINO方法,通过文本反转和“no”逻辑正则化,在无需硬负样本的情况下提升视觉语言模型的组合理解能力,在SugarCREPE基准上超越现有多模态SOTA模型8%以上。

详情
Journal ref
Neurocomputing 2026, 679, 133150
AI中文摘要

组合理解使视觉语言模型能够解释图像和文本中对象、属性和关系之间的复杂联系。然而,现有方法通常依赖硬负样本和微调,这可能会高估改进效果,且受限于获取硬负样本的难度。在这项工作中,我们引入了零样本组合理解(ZS-CU),这是一个无需硬负训练数据即可增强组合理解的新任务。我们提出了YUKINO(通过带有“NO”的文本反转产生的组合理解知识),该方法利用文本反转将未标记图像映射到预训练CLIP模型中的伪标记。我们提出引入“no”逻辑正则化来解决反转中标记交互的问题。此外,我们建议使用知识蒸馏来降低文本反转的时间复杂度。实验结果表明,YUKINO在SugarCREPE基准上比现有多模态SOTA模型高出8%以上,并且在图像检索任务中也取得了显著改进。

英文摘要

Compositional understanding allows visual language models to interpret complex relationships between objects, attributes, and relations in images and text. However, most existing methods often rely on hard negative examples and fine-tuning, which can overestimate improvements and are limited by the difficulty of obtaining hard negatives. In this work, we introduce Zero-Shot Compositional Understanding (ZS-CU), a novel task that enhances compositional understanding without requiring hard negative training data. We propose YUKINO (Yielded Compositional Understanding Knowledge via Textual Inversion with NO), which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model. We propose introducing "no" logical regularization to address the issue of token interaction in inversion. Additionally, we suggest using knowledge distillation to reduce the time complexity of textual inversion. Experimental results show that YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark, and also achieves significant improvements in image retrieval tasks.

2411.14279 2026-05-29 cs.CV cs.CL 版本更新

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

超越文本:通过多模态双注意力和软图像引导减少大型视觉语言模型中的语言偏差

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 针对大型视觉语言模型因语言偏差导致的幻觉问题,提出LACING框架,采用多模态双注意力机制和软图像引导策略,在不增加训练资源的情况下增强视觉理解并减少幻觉。

Comments EMNLP 2025

详情
AI中文摘要

大型视觉语言模型在各种视觉语言任务中取得了令人印象深刻的结果。然而,尽管表现出有前景的性能,大型视觉语言模型仍因语言偏差而产生幻觉,导致对图像的关注度降低和视觉理解效率低下。我们确定了这种偏差的两个主要原因:1. 大语言模型预训练阶段与多模态对齐阶段之间训练数据的规模差异。2. 文本数据短期依赖性导致的学习推理偏差。因此,我们提出了LACING,一个系统性框架,旨在通过多模态双注意力机制和软图像引导来解决大型视觉语言模型的语言偏差。具体来说,多模态双注意力机制引入了一种并行双注意力机制,增强了整个模型中视觉输入的整合。软图像引导在训练和推理过程中引入了一个可学习的软视觉提示,以替代视觉输入,旨在迫使大型视觉语言模型优先处理文本输入。然后,软图像引导进一步提出了一种使用软视觉提示的新解码策略,以减轻模型对相邻文本输入的过度依赖。综合实验表明,我们的方法有效地消除了大型视觉语言模型的语言偏差,增强了视觉理解并减少了幻觉,无需额外的训练资源或数据。代码和模型可在[lacing-lvlm.github.io](https://lacing-lvlm.github.io)获取。

英文摘要

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io).

2404.07977 2026-05-29 cs.CV 版本更新

Gaga: Group Any Gaussians via 3D-aware Memory Bank

Gaga: 通过3D感知记忆库分组任意高斯体

Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, Ming-Hsuan Yang

发表机构 * University of California, Merced(加州大学默塞德分校) NVIDIA Research(英伟达研究) Google DeepMind(谷歌深Mind) Atmanity Inc.(Atmanity公司)

AI总结 提出Gaga框架,利用零样本类别无关分割模型预测的不一致2D掩码,通过3D感知记忆库关联不同视角下的物体掩码,实现开放世界3D场景的重建与分割。

Comments TMLR Camera-Ready Version. Project Page: https://weijielyu.github.io/Gaga

详情
AI中文摘要

我们介绍了Gaga,一个通过利用零样本类别无关分割模型预测的不一致2D掩码来重建和分割开放世界3D场景的框架。与先前依赖视频对象跟踪或对比学习方法的3D场景分割方法不同,Gaga利用空间信息并通过新颖的3D感知记忆库有效关联不同相机姿态下的物体掩码。通过消除训练图像中连续视角变化的假设,Gaga展现出对相机姿态变化的鲁棒性,尤其有利于稀疏采样图像,确保精确的掩码标签一致性。此外,Gaga可兼容来自不同来源的2D分割掩码,并与不同的开放世界零样本类别无关分割模型展现出稳健性能,显著增强了其通用性。大量的定性和定量评估表明,Gaga的性能优于现有最先进方法,凸显了其在3D场景理解与操作等实际应用中的潜力。

英文摘要

We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot class-agnostic segmentation models. Contrasted to prior 3D scene segmentation approaches that rely on video object tracking or contrastive learning methods, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses through a novel 3D-aware memory bank. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot class-agnostic segmentation models, significantly enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as 3D scene understanding and manipulation.