arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22823 2026-05-22 cs.CV 版本更新

Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs

它往哪个方向移动?视频大语言模型中方向运动盲症的诊断与克服

Jongseo Lee, Hyuntak Lee, Sunghun Kim, Sooa Kim, Jihoon Chung, Jinwoo Choi

发表机构 * Kyung Hee University(庆熙大学) Princeton University(普林斯顿大学)

AI总结 本文研究了视频大语言模型在理解方向运动时的盲点,提出MoDirect数据集和DeltaDirect方法,通过改进模型对方向运动的感知能力,显著提升了模型在合成和真实场景中的方向识别性能。

Comments Preprint. 59 pages, including appendix. Code: https://github.com/KHU-VLL/DeltaDirect

详情
AI中文摘要

视频大语言模型(Video-LLMs)在时间视频理解方面取得了快速进展,但许多模型在基本感知原始上失败:带符号的图像平面运动方向。在简单的单一物体左右上下移动的视频中,大多数Video-LLMs表现接近随机,超随机的案例主要归因于预测偏差而非真正的方向理解。我们称之为方向运动盲症。我们通过追踪运动方向信息通过Video-LLM管道来定位失败。运动方向可以从视觉编码器、投影器和LLM隐藏状态线性地访问,但读取失败将此信号绑定到正确的言语答案选项,揭示了方向绑定缺口。尽管合成运动方向指令微调减少了源域的这一缺口,但运动方向概念向量分析显示,视觉复杂性削弱了信号幅度并限制了跨域泛化。我们引入MoDirect,一个用于运动方向指令微调和评估的的数据集家族,以及DeltaDirect,一个诊断驱动的投影层目标,通过相邻帧特征差预测归一化的2D运动向量。在MoDirect-SynBench上,使用DeltaDirect指令微调将运动方向准确性从25.9%提高到85.4%。在MoDirect-RealBench上,DeltaDirect在没有真实世界微调数据的情况下,将真实世界运动方向准确性提高了21.9个点,同时保持标准视频理解性能。代码:https://github.com/KHU-VLL/DeltaDirect

英文摘要

Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-chance cases largely attributable to prediction biases rather than genuine direction understanding. We call this failure directional motion blindness. We localize the failure by tracing motion direction information through the Video-LLM pipeline. Motion direction remains linearly accessible from the vision encoder, projector, and LLM hidden states, but the readout fails to bind this signal to the correct verbal answer option, revealing a direction binding gap. Although synthetic motion direction instruction tuning reduces this gap on the source domain, motion direction concept vector analysis shows that visual complexity weakens the signal magnitude and limits out-of-domain generalization. We introduce MoDirect, a dataset family for motion direction instruction tuning and evaluation, and DeltaDirect, a diagnosis-driven, projector-level objective that predicts normalized 2-D motion vectors from adjacent-frame feature deltas. On MoDirect-SynBench, instruction tuning with DeltaDirect improves motion direction accuracy from 25.9% to 85.4%. On MoDirect-RealBench, DeltaDirect improves real-world motion direction accuracy by 21.9 points over the vanilla baseline without real-world tuning data, while preserving standard video-understanding performance. Code: https://github.com/KHU-VLL/DeltaDirect

2605.22819 2026-05-22 cs.CV 版本更新

Cambrian-P: Pose-Grounded Video Understanding

Cambrian-P: 基于姿态的视频理解

Jihan Yang, Zifan Zhao, Xichen Pan, Shusheng Yang, Junyi Zhang, Bingyi Kang, Hu Xu, Saining Xie

发表机构 * New York University(纽约大学) UC Berkeley(加州大学伯克利分校) Meta FAIR

AI总结 该研究提出Cambrian-P,一种增强的视频多模态大语言模型,通过引入可学习的相机令牌和姿态回归头,利用姿态作为轻量级监督信号,显著提升了空间推理能力,并在多个视频问答基准上实现了SOTA表现。

Comments Project Page: https://cambrian-mllm.github.io/

详情
AI中文摘要

相机姿态至关重要。每个视角的位置和方向定义了一个共享的空间坐标框架,将视频帧之间的观察联系起来。然而,这种信号在多模态大语言模型(MLLMs)中大多缺失,因为这些模型将帧处理为孤立的2D快照,而非人类持续感知的场景。我们重新审视姿态作为轻量级监督信号,并引入Cambrian-P,一种增强的视频MLLM,其包含每帧可学习的相机令牌和姿态回归头。通过精心设计的采样方案,该模型在如VSI-Bench等空间推理基准上实现了4.5-6.5%的显著提升,跨八个额外的空间和通用视频问答基准泛化,且作为副产品,在ScanNet上实现了流式姿态估计的SOTA。令人惊讶的是,训练基于真实世界视频的伪标注姿态进一步提升了通用视频问答基准的表现,显示姿态对空间推理的帮助超出了空间推理本身。这些结果将相机姿态定位为视频模型在物理世界推理中的一项基本信号。

英文摘要

Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead of the persistent scene humans perceive. We revisit pose as a lightweight supervisory signal and introduce Cambrian-P, a video MLLM augmented with per-frame learnable camera tokens and a pose regression head. With a carefully designed sampling scheme, the model achieves substantial gains of 4.5-6.5% on spatial reasoning benchmarks such as VSI-Bench, generalizes across eight additional spatial and general video QA benchmarks, and, as a byproduct, achieves state of the art streaming pose estimation on ScanNet. Surprisingly, training on pseudo-annotated poses from in-the-wild video further improves general video QA benchmarks, showing pose helps beyond spatial reasoning. Together, these results position camera pose as a fundamental signal for video models that reason about the physical world.

2605.22818 2026-05-22 cs.CV 版本更新

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

MotiMotion: 基于视觉推理的运动控制视频生成

Lee Hsin-Ying, Hanwen Jiang, Yiqun Mei, Jing Shi, Ming-Hsuan Yang, Zhixin Shu

发表机构 * University of California, Merced(加州大学梅尔茨分校) Adobe Research(Adobe研究)

AI总结 本文提出MotiMotion框架,通过将运动控制转化为推理生成问题,改进视频生成中因果关系和常识一致性,引入免训练视觉语言推理器和置信度感知控制方案,通过MotiBench基准测试验证其生成视频的合理性与交互性。

Comments ICML 2026. Project page: https://motimotion.github.io/

详情
AI中文摘要

当前运动控制图像到视频生成模型严格遵循用户提供的轨迹,这些轨迹往往稀疏、不精确且因果不完整。这种依赖通常导致不自然或不合理的输出,尤其是由于忽略了次要因果后果。为了解决这个问题,我们引入了MotiMotion,一种新的框架,将运动控制重新表述为推理然后生成的问题。为了鼓励因果基础和常识一致的交互,我们利用免训练的视觉语言推理器来细化主要轨迹的图像空间坐标,并生成合理的次要运动。为进一步提高运动的自然性,我们提出了置信度感知的控制方案,该方案调节指导强度,使模型能够紧密跟随高置信度计划,同时在低置信度输入下利用其内部生成先验知识来纠正伪影。为了支持系统评估,我们精心策划了一个新的图像到视频基准,MotiBench,包含以交互为中心的场景,其中新事件由运动触发。通过基于VLM的评估和对MotiBench的人类研究证明,MotiMotion生成的视频具有更合理的物体行为和交互,并优于现有方法。

英文摘要

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

2605.22816 2026-05-22 cs.RO cs.CV 版本更新

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

AwareVLN: 基于自感知的视觉语言导航推理

Wenxuan Guo, Xiuwei Xu, Yichen Liu, Xiangyu Li, Hang Yin, Huangxing Chen, Wenzhao Zheng, Jianjiang Feng, Jie Zhou, Jiwen Lu

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出AwareVLN框架,通过自感知推理机制实现端到端的视觉语言导航,解决了传统方法在理解代理、指令和场景关系上的不足,并在多个数据集上实现了优于现有方法的性能。

Comments Accepted to CVPR 2026. Project page: https://gwxuan.github.io/AwareVLN/

详情
AI中文摘要

视觉语言导航(VLN)要求一个智能体将语言指令接地到其自身移动中。尽管最先进的方法利用视觉语言模型(VLMs)的推理能力进行端到端动作预测,但它们往往缺乏对代理、指令和场景之间关系的显式且可解释的理解。相反,显式构建场景图进行启发式规划直观但依赖额外的3D传感器,阻碍了大规模视觉语言预训练。为弥合这一差距,我们提出了AwareVLN,一种新的框架,使导航模型具备自感知推理机制,使其能够以完全端到端和数据驱动的方式理解代理的状态和任务进度。我们的方法有两个关键创新:(1)一个结构推理模块,促进空间和任务导向的自感知;(2)一个自动数据引擎,具有进度划分,用于有效的训练。在Habitat模拟器的各种数据集上的广泛实验表明,我们的AwareVLN显著优于先前的视觉语言导航方法。项目页面:https://gwxuan.github.io/AwareVLN/.

英文摘要

Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit and explainable understanding of the relationships between the agent, the instruction, and the scene. Conversely, explicitly building a scene map for heuristic planning is intuitively appealing but relies on additional 3D sensors and hinders large-scale vision-language pre-training. To bridge this gap, we propose AwareVLN, a novel framework that equips the navigation model with a self-aware reasoning mechanism, enabling it to understand the agent's state and task progress in a fully end-to-end and data-driven manner. Our approach features two key innovations: (1) a structural reasoning module that fosters spatial and task-oriented self-awareness, and (2) an automatic data engine with progress division for effective training. Extensive experiments on various datasets in Habitat simulator show our AwareVLN significantly outperforms previous state-of-the-art vision-language navigation methods. Project page: https://gwxuan.github.io/AwareVLN/.

2605.22812 2026-05-22 cs.RO cs.CV 版本更新

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

GesVLA: 一种具有手势感知能力的视觉-语言-动作模型嵌入表示

Wenxuan Guo, Ziyuan Li, Meng Zhang, Yichen Liu, Yimeng Dong, Chuxi Xu, Yunfei Wei, Ze Chen, Erjin Zhou, Jianjiang Feng

发表机构 * Tsinghua University(清华大学) Dexmal

AI总结 本文提出GesVLA模型,通过引入手势作为平行指令模态,解决现有VLA系统在复杂场景中空间模糊问题,采用双VLM架构实现手势表示与动作策略的紧密耦合,并通过手势数据生成管道和两阶段训练策略提升目标定位准确性和人机交互效率。

Comments Project page: https://gwxuan.github.io/GesVLA/

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过统一感知与动作,在通用机器人操作中展现出强大潜力。然而,现有VLA系统主要依赖文本指令,在包含多个相似物体的复杂场景中难以解决空间模糊问题。为解决这一限制,我们引入手势作为平行指令模态,提出一种具有手势感知能力的视觉-语言-动作模型(GesVLA)。我们的方法将手势特征直接编码到潜在空间中,使其能够参与高层推理和低层动作生成,并采用双VLM架构实现手势表示与动作策略的紧密耦合。在数据层面,我们通过将手模型渲染到现实世界场景图像上,构建了一个可扩展的手势数据生成管道。这在减少仿真到现实的视觉差距的同时,生成了具有多样化运动模式和相应指向注释的丰富数据。此外,我们采用两阶段训练策略,使模型具备手势感知和动作预测能力。我们在多个现实机器人任务中评估了我们的方法,包括受控块操作任务进行验证以及更实际的场景如产品和农产品选择。实验结果表明,结合手势能够一致地提高目标定位准确性和人机交互效率,特别是在复杂和拥挤的环境中。项目页面:https://gwxuan.github.io/GesVLA/.

英文摘要

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we introduce gesture as a parallel instruction modality and propose a Gesture-aware Vision-Language-Action model (GesVLA). Our approach encodes gesture features directly into the latent space, enabling them to participate in both high-level reasoning and low-level action generation, and adopts a dual-VLM architecture to achieve tight coupling between gesture representations and action policies. At the data level, we construct a scalable gesture data generation pipeline by rendering hand models onto real-world scene images. This reduces the sim-to-real visual gap while producing rich data with diverse motion patterns and corresponding pointing annotations. In addition, we employ a two-stage training strategy to equip the model with both gesture perception and action prediction capabilities. We evaluate our approach on multiple real-world robotic tasks, including a controlled block manipulation task for validation and more practical scenarios such as product and produce selection. Experimental results show that incorporating gesture consistently improves target grounding accuracy and human-robot interaction efficiency, especially in complex and cluttered environments. Project page: https://gwxuan.github.io/GesVLA/.

2605.22777 2026-05-22 cs.CV 版本更新

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

DecQ:用于增强表示自编码器中重建和生成的细节压缩查询

Tianhang Wang, Yitong Chen, Wei Song, Zuxuan Wu, Min Li, Jiaqi Wang

发表机构 * Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学) Westlake University(西湖大学)

AI总结 本文提出DecQ框架,通过引入轻量级细节压缩查询,有效缓解了表示自编码器中重建与生成之间的权衡问题,提升了重建质量和生成性能。

详情
AI中文摘要

表示自编码器(RAEs)利用冻结的视觉基础模型(VFMs)作为分词编码器,提供稳健的高层表示,从而在潜在扩散模型中实现快速收敛和高质量生成。然而,冻结VFM本质上限制了其空间重建能力,限制了细粒度生成和图像编辑;相反,通过微调引入重建导向信号会破坏预训练语义空间并降低生成保真度。为了解决这一权衡,我们提出了DecQ,一种简单而有效的RAEs框架。具体而言,DecQ引入了轻量级细节压缩查询,通过压缩模块从中间VFM特征中提取细粒度信息。这些查询被整合到解码器中以支持重建,并在生成建模过程中与补丁标记共同生成。通过聚合来自浅层和深层的信息,DecQ有效缓解了重建-生成权衡问题,提高了重建质量和生成性能。我们的实验表明:(1)仅使用8个额外查询和3.9%的额外计算,DecQ在冻结DINOv2基于的RAE上提高了重建质量,将PSNR从19.13 dB提高到22.76 dB;(2)在生成建模中,DecQ比RAE快3.3倍,达到无引导FID为1.41,有引导FID为1.05。

英文摘要

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3$\times$ faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.

2605.22767 2026-05-22 cs.CV 版本更新

Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition

合成数据足够吗?重新思考儿科罕见病识别中的数据稀缺性

Ganlin Feng, Yuxi Long, Erin Lou, Lianghong Chen, Zihao Jing, Pingzhao Hu, Wei Xu

发表机构 * Western University(西方大学) University of Toronto(多伦多大学)

AI总结 本研究探讨了在儿科罕见病识别中,仅使用合成数据是否足以克服数据稀缺问题,通过实验发现高保真合成数据能模拟临床有意义的分布,从而为遗传咨询提供隐私保护的视觉资源。

Comments CVPR 2026 CV4CHL workshop

详情
AI中文摘要

患有罕见遗传疾病的儿童往往表现出独特的面部表型,但开发用于早期诊断的计算机视觉系统仍极具挑战性,因为存在极端的数据稀缺性、隐私限制以及儿科环境中有限的数据共享。这些挑战不仅阻碍了自动化诊断,也限制了临床遗传咨询中的视觉资源可用性。尽管先前研究表明合成数据可以增强真实数据集并保持表型层面的语义,但尚不清楚在超低资源的儿科环境中,仅使用合成数据是否足以进行学习。在本工作中,我们研究了仅使用合成数据的儿科罕见病识别场景。在受控的实验设置中,模型仅在具有表型意识的合成面部图像上进行训练,随着数据规模的增加。我们发现,在足够规模下,仅使用合成数据的训练在多个架构上实现了与仅使用真实数据的基线相当的性能,这表明高保真合成数据能够近似临床有意义的分布。这些发现进一步使合成的儿科面部图像成为隐私保护的资源,用于遗传教育和咨询,支持临床医生培训和患者沟通。我们的结果强调了计算机视觉在提高数据效率和扩展儿童健康护理中可访问的视觉工具方面的潜力。

英文摘要

Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automated diagnosis but also restrict the availability of visual resources for clinical genetic counseling. While prior work has shown that synthetic data can augment real datasets and preserve phenotype-level semantics, it remains unclear whether synthetic data alone is sufficient for learning in ultra-low-resource pediatric settings. In this work, we study the synthetic-only regime for pediatric rare disease recognition. Under a controlled experimental setup, models are trained exclusively on phenotype-aware synthetic facial images at increasing scales. We find that synthetic-only training achieves performance comparable to real-data-only baselines at sufficient scale across multiple backbones, suggesting that high-fidelity synthetic data can approximate clinically meaningful distributions. These findings together further enable the use of synthetic pediatric facial images as privacy-preserving resources for genetic education and counseling, supporting clinician training and patient communication. Our results highlight the potential of computer vision to improve data efficiency and expand accessible visual tools in children's healthcare.

2605.22751 2026-05-22 cs.CV 版本更新

Spectral Tail Auxiliary Learning for AI-Generated Image Detection

用于AI生成图像检测的频谱尾辅助学习

Xingyi Li, Jiahui Zhang, Yiheng Li, Yun Cao, Wenhao Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) Institute of Automation, Chinese Academy of Sciences(自动化研究所,中国科学院) Vast Intelligence Lab(远见实验室)

AI总结 本文提出了一种基于频谱尾特征的辅助学习框架STAL,用于检测AI生成图像。通过分析真实和生成图像的径向对数功率谱,发现生成图像在超高频尾部表现出异常提升现象,即频谱尾部上升。STAL利用这一特征进行辅助监督学习,提升了模型在不同生成器、数据分布和现实场景中的泛化能力和稳定性。

详情
AI中文摘要

随着生成图像模型的快速发展,生成图像与真实图像的感知差距持续缩小,使AI生成图像检测变得愈发困难。许多现有方法利用频域线索进行检测,通常描述为频域伪影或高频差异。然而,具体的频谱规律仍不够理解和表征。本文系统分析了真实和生成图像的一维径向对数功率谱。发现生成图像并不一定在整个频谱或高频范围内具有更高的或更低的能量。相反,它们的频谱偏离幂律衰减,并在超高频尾部表现出异常上升。我们称这种现象为频谱尾部上升。进一步将这种现象归因于训练生成模型中的非线性谐波积累,表明它可以在生成架构中作为结构线索。基于这一观察,我们提出了Spectral Tail Auxiliary Learning (STAL),一种用于通用AI生成图像检测的频域辅助监督框架。STAL在训练时将频谱尾部线索从尾部意识的频率教师转移到空间检测器,而在推理时所有频域模块都被丢弃。因此,STAL不引入推理开销。在9个公开数据集上的大量实验表明,STAL在不同生成器、数据分布和现实场景中实现了强大的泛化能力和稳定性。

英文摘要

As generative image models evolve rapidly, the perceptual gap between generated and real images continues to narrow, making AI-generated image detection increasingly challenging. Many existing methods exploit frequency-domain cues for detection, typically described as frequency-domain artifacts or high-frequency discrepancies. However, the specific and recurring spectral regularities remain insufficiently understood and characterized. In this paper, we systematically analyze the one-dimensional radial log-power spectra of real and generated images. We find that generated images do not necessarily exhibit higher or lower energy across the entire spectrum or high-band range. Instead, their spectra deviate from the power-law decay and show an anomalous uplift in the ultra-high-frequency tail. We term this phenomenon spectral tail uplift. We further attribute this phenomenon to nonlinear harmonic accumulation in trained generative models, suggesting that it can serve as a structural cue across generative architectures. Based on this observation, we propose Spectral Tail Auxiliary Learning (STAL), a frequency-domain auxiliary supervision framework for generalizable AI-generated image detection. STAL transfers spectral-tail cues from a tail-aware frequency teacher to a spatial detector during training, while all frequency-domain modules are discarded at inference time. Consequently, STAL introduces no inference overhead. Extensive experiments on 9 public datasets show that STAL achieves strong generalization and stability across generators, data distributions, and real-world scenarios.

2605.22718 2026-05-22 cs.CV 版本更新

WorldKV: Efficient World Memory with World Retrieval and Compression

WorldKV: 通过世界检索和压缩实现高效的world内存

Jung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang, Sangdoo Yun, Seungryong Kim

发表机构 * KAIST AI(韩国科学技术院人工智能实验室) Naver AI Lab(Naver人工智能实验室)

AI总结 本文提出WorldKV,一种无需训练的框架,通过世界检索和压缩技术,在保持一致性的同时提高效率,实现在Matrix-Game-2.0和LingBot-World-Fast数据集上达到或超越全KV内存保真的性能。

Comments Project Page: https://cvlab-kaist.github.io/WorldKV/

详情
AI中文摘要

自回归视频扩散模型已使实时、动作条件化的world生成成为可能。然而,维持一个持久的world,其中重新访问先前看到的视角会得到一致的内容,仍然是一个开放问题。全KV缓存注意力保持这种一致性,但会破坏实时约束:内存足迹和注意力成本随着rollout长度线性增长。滑动窗口推断恢复了吞吐量,但丢弃了长期一致性。我们提出WorldKV,一种无需训练的框架,包含两个组件:World检索和World压缩。World检索将被驱逐的KV缓存片段存储在GPU/CPU内存中,并通过相机/动作对应关系选择性地检索场景相关的片段,将其插入回原生注意力窗口而不重新编码。World压缩通过键-键相似性修剪每个片段中的冗余token,将每个片段的存储减少一半,以在固定预算下容纳两倍的历史。在Matrix-Game-2.0和LingBot-World-Fast上,WorldKV在大约两倍的吞吐量下达到或超过全KV内存保真度,并且在无需微调的情况下与内存训练的基线竞争。项目页面:https://cvlab-kaist.github.io/WorldKV/

英文摘要

Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor frame, halving per-chunk storage to fit 2x more history under a fixed budget. On Matrix-Game-2.0 and LingBot- World-Fast, WorldKV matches or exceeds full-KV memory fidelity at roughly 2x the throughput, and is competitive with memory-trained baselines without any fine-tuning. Project Page: https://cvlab-kaist.github.io/WorldKV/

2605.22697 2026-05-22 cs.CV 版本更新

Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions

跨域人类动作识别:多视角运动与文本描述

Yannick Porto, Renato Martins, Thomas Chalumeau, Cedric Demonceaux

发表机构 * Université Bourgogne Europe, CNRS(布尔格ogne欧洲大学,法国国家科学研究中心) TEB Group, Prynel SAS(TEB集团,普里内尔公司)

AI总结 本文提出一种面向多视角运动和文本描述的跨域人类动作识别方法,通过结合多视角运动线索和文本描述,提升零样本动作识别模型在不同领域中的鲁棒性和泛化能力。

Comments Accepted to ICPR 2026. Code and trained models available at: https://icb-vision-ai.github.io/OrientationAware-HAR

详情
AI中文摘要

在真实世界场景中,人类动作识别系统对域变化的鲁棒性是一个关键能力,其中推理时的动作类别可能呈现重要的域偏移甚至训练中未见过的动作。在这一背景下,提高零样本动作识别模型(ZSAR)的识别能力,而无需强标注努力,仍然是一个核心挑战。大多数ZSAR方法假设动作是在与训练时相似的几何条件下观察到的。实际上,人体姿态变化和摄像机视角的变化会在ZSAR中引入显著的域差距,从而大大限制了对新动作-运动组合的泛化能力。在这一背景下,本文提出了一种新的面向姿态的行动识别方法,具有改进的跨域能力。我们的方法在训练阶段结合了多个摄像机视角的运动线索和人类动作的文本描述。我们提出了一种新的面向姿态的运动编码网络,以学习不同的运动特征,并在推理时适配特定的面向意识文本提示以匹配相应的特征。广泛的实验表明,所提出的方法在不同识别基准上一致提高了ZSAR性能,优于最近的最先进的零样本方法在NTU-RGB+D、BABEL、NW-UCLA以及两个监控数据集上。此外,学习到的表示表现出强大的迁移学习能力,在跨域和同域识别已见动作方面都表现出竞争力。代码和训练模型可在:https://icb-vision-ai.github.io/OrientationAware-HAR 获取。

英文摘要

Robustness to domain changes is a key capability for effective deployment of human action recognition systems in real-world scenarios, where action categories at inference can present important domain shifts or even unseen actions from training. In this context, improving the recognition capabilities of Zero-Shot Action Recognition models (ZSAR), without requiring strong annotation efforts, remains a central challenge. Most ZSAR approaches assume that actions are observed under geometric conditions similar to those seen during training. In practice, variations in human body orientation and camera viewpoint add a significant domain gap in ZSAR, substantially limiting generalization to novel action-motion combinations. In this context, this paper presents a novel orientation-aware action recognition approach with improved cross-domain capabilities. Our approach combines motion cues of multiple camera viewpoints and text descriptions of human actions in the training phase. We present a new orientation-aware motion encoding network to learn different motion features, and adapt a specific orientation-aware text prompt to match the corresponding features at inference. Extensive experiments demonstrate that the proposed method consistently improves ZSAR performance across different recognition benchmarks, outperforming recent state-of-the-art zero-shot approaches on NTU-RGB+D, BABEL, NW-UCLA, and on two surveillance datasets. In addition, the learned representations exhibit strong transfer learning capabilities, yielding competitive performance on both cross-domain and same-domain recognition of seen actions. Code and trained models are available at: https://icb-vision-ai.github.io/OrientationAware-HAR

2605.22695 2026-05-22 cs.CV 版本更新

Improving Viewpoint-Invariance and Temporal Consistency for Action Detection

提升视角不变性和时间一致性以进行动作检测

Yannick Porto, Renato Martins, Thomas Chalumeau, Cedric Demonceaux

发表机构 * Université Bourgogne Europe, CNRS, ICB(勃艮第欧洲大学、法国国家科学研究中心、ICB) TEB Group, Prynel SAS(TEB集团、普莱恩萨公司)

AI总结 本文提出了一种两阶段动作检测方法,通过增强视角不变性和全局时间一致性来改进动作检测性能,在PKU-MMD和BABEL基准测试中优于现有方法。

Comments Accepted at ICIP 2026. Code and trained models are available at: https://icb-vision-ai.github.io/HydraView-TAD

详情
AI中文摘要

视角变化不变性和动作时间一致性是无剪裁视频中有效部署人类动作检测的关键方面。现有的基于外观的视频检测方法在训练期间往往难以应对有限的视角多样性,而基于运动的检测方法则经常无法建模连续运动窗口之间的细粒度时间关系。本文介绍了一种新的两阶段动作检测方法,旨在同时提高视角不变性和全局时间一致性。在第一阶段,我们从增强的虚拟视角中提取运动特征,仅在训练过程中使用。然后,第二阶段引入了一种基于选择性状态空间序列建模的新的视角不变、多尺度时间编码器,以在不同视角和时间尺度上聚合信息。在PKU-MMD和BABEL基准测试中,实验表明该方法在所有考虑的分割中均显著优于现有最先进方法。代码和训练模型可在:https://icb-vision-ai.github.io/HydraView-TAD获取。

英文摘要

Viewpoint change invariance and action temporal consistency are critical aspects for the effective deployment of human action detection of untrimmed videos. Existing appearance-based video detection methods often struggle with limited viewpoint diversity during training, while motion-based detection approaches frequently fail to model fine-grained temporal relationships across consecutive motion windows. This paper introduces a novel two-stage action detection approach designed to improve both view-invariance and global temporal coherence properties. In the first stage, we extract motion features from augmented virtual viewpoints, solely used at training. Then, the second stage introduces a new view-invariant, multi-scale temporal encoder based on selective state-space sequence modelling to aggregate information across viewpoints and time scales. Experiments on PKU-MMD and BABEL benchmarks demonstrate that this approach significantly outperforms state-of-the-art methods in all considered splits. Code and trained models are available at: https://icb-vision-ai.github.io/HydraView-TAD

2605.22679 2026-05-22 cs.CV cs.LG 版本更新

Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

将嵌入概念化:面向视觉-语言模型的稀疏解缠

Piotr Kubaty, Patryk Marszałek, Łukasz Struski, Adam Wróbel, Jacek Tabor, Marek Śmieja

发表机构 * Faculty of Mathematics and Computer Science, Jagiellonian University(雅盖隆大学数学与计算机科学学院) Doctoral School of Exact and Natural Sciences, Jagiellonian University(雅盖隆大学精确与自然科学博士学校) Centre for Credible AI, Warsaw University of Technology(华沙技术大学可信人工智能中心)

AI总结 本文提出CEDAR方法,通过稀疏解缠技术在不增加维度的情况下揭示预训练嵌入的组成结构,从而提升视觉-语言模型的可解释性和与人类感知的一致性。

详情
AI中文摘要

视觉-语言模型学习了强大的多模态嵌入,但其内部语义仍然模糊。尽管稀疏自编码器(SAEs)可以提取可解释的特征,但它们依赖于扩展表示维度,这会破坏原始几何结构并引入冗余。我们引入CEDAR(通过自适应旋转进行概念嵌入解缠),一种事后方法,能够在不增加维度的情况下揭示预训练嵌入的组成结构。通过学习具有top-k稀疏瓶颈的可逆变换,CEDAR将语义信息集中到轴对齐的解缠坐标中。在CLIP-like架构中,单个坐标可以与文本概念进行解释,而对于生成模型如BLIP,它们可以解码为自然语言描述。实验表明,CEDAR在重建-稀疏性权衡方面具有竞争力,同时产生更可解释且更符合人类感知的解释。我们的结果表明,视觉-语言表示中的显性纠缠可以通过适当的基变换来解决,从而消除对过度扩展的需要。

英文摘要

Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We introduce CEDAR (Conceptual Embedding Disentanglement via Adaptive Rotation), a post-hoc method that reveals the compositional structure of pretrained embeddings without increasing dimensionality. By learning an invertible transformation with a top-$k$ sparsity bottleneck, CEDAR concentrates semantic information into axis-aligned disentangled coordinates. In CLIP-like architecture, individual coordinates can be interpreted with textual concepts, while for generative models such as BLIP, they can be decoded into natural language descriptions. Experiments demonstrate that CEDAR achieves a competitive reconstruction-sparsity trade-off while producing explanations that are more interpretable and better aligned with human perception. Our results suggest that the apparent entanglement in vision-language representations can be resolved through a suitable change of basis, eliminating the need for overcomplete expansions.

2605.22678 2026-05-22 cs.CV cs.AI 版本更新

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Swift Sampling: 通过泰勒级数选择时间惊喜

Dahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian, Deepti Ghadiyaram

发表机构 * Boston University(波士顿大学) Microsoft Research India(微软研究院印度)

AI总结 本研究提出了一种无需训练的帧选择算法Swift Sampling,通过在视觉潜在空间中建模视频为可微轨迹,并利用泰勒展开预测后续帧的路径,从而自动识别高信息量的时间惊喜帧,提升了长视频问答任务的性能。

详情
AI中文摘要

尽管长视频中的大多数帧都是冗余的,但关键信息存在于时间惊喜中:即实际视觉特征偏离其预测演变的时刻。受人脑预测编码的启发,我们引入了Swift Sampling,一种优雅且无需训练的帧选择算法,能够自动识别视频中的高信息量时刻。具体而言,我们将视频建模为视觉潜在空间中的可微轨迹,并计算其特征的速度和加速度。然后,我们应用泰勒展开来投影后续帧的预期路径。与预测路径显著偏离的帧被识别为时间惊喜帧并被选中采样。与依赖辅助网络或视频特定超参数调整的先前无训练方法不同,Swift Sampling 非常轻量,仅比基线增加 0.02x 的计算成本,使其比领先基线便宜 30 倍。在三个长视频问答基准和 10 个不同的下游任务上,Swift Sampling 超过了均匀采样和先前查询无关的基线。它在帧预算有限的长视频中表现尤为强大,准确率可提高高达 12.5 个百分点。

英文摘要

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

2605.22677 2026-05-22 cs.CV 版本更新

Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment

Slimmable ConvNeXt: 适用于高效多设备部署的宽度自适应推理

Janek Haberer, Jon Eike Wilhelm, Olaf Landsiedel

发表机构 * Kiel University(基尔大学) Hamburg University of Technology (TUHH)(汉堡工业大学) UNU-INWEH

AI总结 本文提出Slimmable ConvNeXt,通过训练包含多个嵌套子网络的共享权重集,实现宽度自适应推理,从而在不同资源约束的设备上高效部署模型。该方法利用ConvNeXt的现代设计,如LayerNorm和倒置瓶颈结构,实现了通道宽度压缩,减少了归一化开销,并提供了更简单的训练流程。

Comments Accepted at Mobile AI Workshop 2026 (CVPR'26 Workshop)

详情
AI中文摘要

在资源约束变化的设备上部署视觉模型,或在单个设备上由于电池状态、热 throttling 或延迟截止而变化的计算资源,通常需要训练和维护多个模型。宽度自适应推理通过训练一组共享权重,其中包含多个嵌套子网络,这些子网络具有递增的容量,从而解决这一问题。尽管之前的CNN方法需要可切换的批量归一化,而近期可扩展方法则集中在视觉Transformer上,本文提出了Slimmable ConvNeXt,证明了ConvNeXt的现代设计,特别是LayerNorm和倒置瓶颈结构,使其特别适合通道宽度压缩,消除了经典可压缩网络的归一化开销,并提供了比之前CNN和ViT方法更简单的训练流程。在ImageNet-1k上,Slimmable ConvNeXt-T在3个子网络的情况下,以4.5 GMACs达到80.8%的top-1准确率,以1.2 GMACs达到77.4%的准确率,训练了600个epoch。在同等计算量下,这超过了HydraViT的6头子网络(78.4%在4.6 GMACs)2.4个百分点,以及其3头配置(73.0%在1.3 GMACs)4.4个百分点,同时在相同GMACs下也超过了MatFormer-S(78.6%)和SortedNet-S(78.2%)。将规模扩展到Slimmable ConvNeXt-B进一步将最大准确率提高到15.35 GMACs时的82.8%。

英文摘要

Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this by training a single set of shared weights containing multiple nested subnetworks of increasing capacity, but prior CNN-based approaches required switchable batch normalization, while recent scalable methods have focused on Vision Transformers. We present Slimmable ConvNeXt, which shows that ConvNeXt's modern design, specifically LayerNorm and inverted bottlenecks, makes it particularly suited for channel-width slimming, eliminating the normalization overhead of classical slimmable networks and producing a simpler training pipeline than both prior CNN and ViT approaches. On ImageNet-1k, Slimmable ConvNeXt-T with 3 subnetworks achieves 80.8% top-1 accuracy at 4.5 GMACs and 77.4% at 1.2 GMACs, trained from scratch for 600 epochs. At comparable compute, this exceeds HydraViT's 6-head subnetwork (78.4% at 4.6 GMACs) by 2.4 percentage points and its 3-head configuration (73.0% at 1.3 GMACs) by 4.4 percentage points, while also outperforming MatFormer-S (78.6%) and SortedNet-S (78.2%) at the same GMACs. Scaling to Slimmable ConvNeXt-B further improves maximum accuracy to 82.8% at 15.35 GMACs.

2605.22668 2026-05-22 cs.CV 版本更新

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

SEGA:用于扩散变换器中分辨率外推的频谱-能量引导注意力

Javad Rajabi, Kimia Shaban, Koorosh Roohi, David B. Lindell, Babak Taati

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 SEGA通过动态调整注意力权重来提升扩散变换器在高分辨率生成中的表现,其核心方法是根据潜在空间的频谱结构调整RoPE组件的注意力缩放,从而在保持全局结构和恢复细节方面取得平衡。

Comments 27 pages, 14 figures. Project page: https://rajabi2001.github.io/sega/

详情
AI中文摘要

扩散变换器(DiTs)已成为文本到图像生成的主导架构,但其在生成超出训练范围的分辨率时性能下降。现有的无训练方法通过修改推理时的注意力行为来缓解这一问题,通常通过旋转位置嵌入(RoPE)外推结合注意力缩放。然而,这些策略在RoPE组件上采用统一且内容无关的缩放,具有不同的频率特性,导致在保持全局结构和恢复细节之间产生权衡。我们引入SEGA,一种无训练方法,根据每个去噪步骤中潜在空间的空间-频率结构动态调整注意力缩放。这种自适应缩放提高了结构一致性和细节保真度。实验表明,SEGA在多个目标分辨率上均能提升高分辨率合成性能,优于最先进的无训练基线。

英文摘要

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

2605.22658 2026-05-22 cs.CV cs.LG cs.MM eess.IV 版本更新

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

SegCompass: 探索通过稀疏自编码器实现可解释对齐以增强推理分割

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Haoqian Kang, Yan Feng, Ke Chen, Yaowei Wang

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Peng Cheng Laboratory(鹏城实验室) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Meituan, Beijing(美团,北京) University of Chinese Academy of Sciences(中国科学院大学) College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院)

AI总结 本文提出SegCompass,一种通过稀疏自编码器实现可解释对齐的端到端模型,以提升推理分割的性能和可解释性。

Comments Accepted by CVPR 2026. 15 pages, 9 figures, 6 tables

详情
AI中文摘要

尽管大语言模型提供了强大的组合推理能力,但现有推理分割流程未能清晰地将这种推理与视觉感知连接起来。当前方法,如潜在查询对齐,虽然端到端但却是不透明的“黑箱”。相反,文本定位读出仅可读但不真正可解释,通常作为无约束的后处理步骤。为弥合这一可解释性差距,我们提出了SegCompass,一种端到端模型,利用稀疏自编码器(SAE)建立一个显式、可解释且可微的对齐路径。给定一个图像-指令对,SegCompass首先生成一个思维链(CoT)轨迹。该方法的核心是一个将CoT和视觉标记映射到共享高维稀疏概念空间的SAE。一个查询代码本从该空间中选择显著概念,然后通过槽映射器在空间上定位到多槽热图,引导最终的掩码解码器。整个模型联合训练,将强化学习用于推理路径与标准分割监督相结合。这种由SAE驱动的接口提供了显著比潜在查询更可追溯的“白盒”连接,比文本读出更连贯。在五个具有挑战性的基准测试中,SegCompass匹配或超越了最先进的性能。关键的是,我们的视觉和定量分析显示,所学稀疏概念的质量与最终掩码准确性之间存在强相关性,证实了SegCompass通过其增强且可检查的对齐实现了优越的结果。代码可在https://github.com/ZhenyuLU-Heliodore/SegCompass获取。

英文摘要

While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization readout is merely readable, not truly interpretable, often functioning as an unconstrained post-hoc step. To bridge this interpretability gap, we propose SegCompass, an end-to-end model that leverages a Sparse Autoencoder (SAE) to forge an explicit, interpretable, and differentiable alignment pathway. Given an image-instruction pair, SegCompass first generates a chain-of-thought (CoT) trace. The core of our method is an SAE that maps both the CoT and visual tokens into a shared, high-dimensional sparse concept space. A query codebook selects salient concepts from this space, which are then spatially grounded by a slot mapper into a multi-slot heatmap that guides the final mask decoder. The entire model is trained jointly, unifying reinforcement learning for the reasoning path with standard segmentation supervision. This SAE-driven interface provides a "white-box" connection that is significantly more traceable than latent queries and more coherent than textual readouts. Extensive experiments on five challenging benchmarks demonstrate that SegCompass matches or surpasses state-of-the-art performance. Crucially, our visual and quantitative analyses show a strong correlation between the quality of the learned sparse concepts and final mask accuracy, confirming that SegCompass achieves superior results through its enhanced and inspectable alignment. Code is available at https://github.com/ZhenyuLU-Heliodore/SegCompass.

2605.22654 2026-05-22 cs.CL cs.CV 版本更新

Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs

看见诗歌:基于大语言模型的AI生成现代汉语诗歌的图像-语义检测

Shanshan Wang, Fengying Ye, Hanjia Lyu, Caiwen Gou, Junchao Wu, Jingming Yao, Chengzhong Xu, Jiebo Luo, Derek F. Wong

发表机构 * Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系) University of Rochester(罗切斯特大学) Sichuan University(四川大学) Department of Portuguese, Faculty of Arts and Humanities, University of Macau(澳门大学人文学院葡萄牙语系)

AI总结 本文提出了一种图像-语义引导的诗歌检测方法,通过整合图像内容与诗歌文本信息,提升大语言模型在检测现代汉语诗歌中的性能,实验结果表明该方法在多个数据集上均优于传统方法。

详情
AI中文摘要

先前的检测研究显示,LLMs无法有效用作检测器,但这些研究未涉及现代汉语诗歌。此外,没有相关研究探讨LLMs在检测现代汉语诗歌中的性能。本文评估并提升了LLMs作为现代汉语诗歌检测器的性能,并提出了一种图像-语义引导的诗歌检测方法。与传统检测方法相比,我们的方法创新性地整合了反映诗歌内容的图像。通过示例驱动的方法,我们的方法有效整合了图像中的意义、意象和情感信息,然后与诗歌文本形成互补判断。实验结果表明,基于我们方法的LLM检测器在多个数据集上均优于基于纯文本的基线检测器,甚至超越了表现最佳的传统检测器RoBERTa。使用我们方法的Gemini检测器在Macro-F1得分上达到85.65%,达到最先进的水平。不同LLM检测器在多个LLM生成数据上的性能提升证明了我们方法的有效性。

英文摘要

Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performance of LLMs as detectors for modern Chinese poetry, and proposes an image-semantic guided poetry detection method. Compared with traditional detection approaches, our method innovatively incorporates images that reflect the content of the poetry. Through example-driven approaches, our method effectively integrates information such as meaning, imagery, and feeling from the image, then forms a complementary judgment with the poem text. Experimental results demonstrate that the LLM detectors based on our method outperform baseline detectors based on plain text, and even surpass the best-performing traditional detector, RoBERTa. The Gemini detector using our method achieves a Macro-F1 score of 85.65%, reaching the state-of-the-art level. The performance improvements of different LLM detectors on multiple LLMs-generated data prove the effectiveness of our method.

2605.22651 2026-05-22 cs.CV 版本更新

What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

图中标签真的在说些什么?用于视觉语言预训练中组合数据选择的反事实短语干预

Hyejin Go, Semi Lee, Hyesong Choi

发表机构 * Soongsil University(顺斯尔大学)

AI总结 本文研究了在视觉语言预训练中如何通过反事实短语干预来改进组合数据选择,提出了CPI方法以解决现有方法中全局过滤信号失效的问题,从而提升模型在关系识别任务上的表现。

Comments 11 pages, 2 figures, 4 tables. Preprint

详情
AI中文摘要

CLIP风格的对比预训练通常通过样本级过滤信号来收集网络级图像-文本对,通常基于对级对齐。我们证明这种信号饱和:一旦粗略不匹配被移除,更严格的全局过滤不再跟踪由保留标签提供的组合监督。原因在于结构问题 - 全局评分混淆了对是否广泛合理与是否个别对象、属性和关系短语在标签中实质性支持图像-文本匹配。后者是组合泛化所需,但对级过滤器对此无能为力。我们通过反事实短语干预(CPI),一种短语级整理框架,将受控的非正式令牌替换转换为图像条件的短语敏感性评分。CPI仅使用全局对齐进行粗略不匹配移除,然后通过是否在受控替换下短语显著影响图像-文本评分来对幸存池进行排名。我们将CPI框架为一阶短语敏感性信号,而非接地或识别结果,并在CC3M规模上评估。按此信号排名产生一个50%的数据子集,在VL-CheckList-VG关系任务上比完整数据基线提高+1.91,在匹配预算下比仅对齐过滤提高+1.00,同时提高SugarCrepe整体表现并保持泛化转移。CPI是损失正交的:应用不变于NegCLIP,它进一步在VL-CheckList-VG关系任务上提高+3.84,并在主要文本中获得额外的CE-CLIP收益。

英文摘要

CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.

2605.22649 2026-05-22 cs.CV cs.LG 版本更新

From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder

从基线到随访:利用因果层次变分自编码器在UK Biobank中生成脊柱DXA图像

Yilin Zhang, Nicholas C. Harvey, Nicholas R. Fuggle, Rahman Attar

发表机构 * School of Electronics and Computer Science(电子与计算机科学学院) University of Southampton(萨塞克斯大学) MRC Lifecourse Epidemiology Centre(英国医学研究理事会生命周期流行病学中心) University of Southampton, Southampton General Hospital(萨塞克斯大学索马塞特医院) Computer Science University of Southampton(计算机科学萨塞克斯大学)

AI总结 本文提出了一种基于元数据的因果层次变分自编码器,用于在UK Biobank中生成一致的脊柱DXA图像,通过基线到随访的设置评估因果一致性,展示了年龄干预下关键椎体形态学变量的高一致性,支持了在解剖上合理的DXA图像合成。

Comments 7 pages, 4 figures, 3 tables. Accepted at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)

详情
AI中文摘要

双能X射线吸收法(DXA)广泛用于大规模骨骼评估,但学习可控且可解释的因子特异性解剖变异仍具挑战性。我们提出了一种基于元数据的因果层次变分自编码器(CHVAE),用于在UK Biobank(UKB)中因果一致地生成前后位(AP)脊柱DXA图像。模型在3,743个原始AP脊柱扫描(来自首次成像访问)上进行训练,并基于基本参与者属性和腰椎形态学进行条件化。因果一致性在基线到随访的设置中通过 abduction--action--prediction(AAP)进行评估:潜在变量从基线图像中抽象出来,年龄被干预到重复成像值,然后将产生的反事实随访形态学与观察到的重复成像测量进行比较。结果表明,在年龄干预下,关键椎体形态学变量的绝对一致性较高,支持了与干预对齐的、在解剖上合理的DXA图像合成。

英文摘要

Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consistent generation of anteroposterior (AP) spine DXA images from the UK Biobank (UKB). The model is trained on 3,743 raw AP spine scans from the first imaging visit and conditioned on basic participant attributes and lumbar morphometry. Causal consistency is evaluated in a baseline-to-follow-up setting using abduction--action--prediction (AAP): latent variables are abducted from baseline images, age is intervened to the repeat-imaging value, and the resulting counterfactual follow-up morphometry is compared with observed repeat-imaging measurements. Results show strong absolute-level agreement for key vertebral morphometry variables under age intervention, supporting intervention-aligned synthesis of anatomically plausible DXA images.

2605.22631 2026-05-22 cs.CV 版本更新

AtomicMotion: Learning Human Motion From Different Human Parts

AtomicMotion: 从不同人体部分学习人体动作

Runzhen Liu, Chuhua Xian, Fa-Ting Hong

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) South China University of Technology(华南理工大学) Department of Computer Science and Engineering(计算机科学与工程系) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 该研究提出AtomicMotion框架,通过解耦和重新整合身体动态,解决从稀疏头部和手部轨迹准确重建完整身体姿态的挑战,核心方法是逻辑身体分区、全身体预条件化策略和运动学注意力机制,实验表明其在AMASS数据集上显著优于现有基线。

详情
AI中文摘要

准确从稀疏头部和手部轨迹重建完整身体姿态是沉浸式AR/VR远程存在的基础挑战。当前方法常面临误差累积和不自然关节协调的问题,主要因为将人体视为单一实体,无法捕捉细微信号变化中的细粒度“原子意图”,并忽视了固有的结构拓扑。为弥合这一差距,我们提出了AtomicMotion,一个通过三个核心创新解耦和重新整合身体动态的框架。首先,我们引入一种逻辑身体分区方案,根据功能意图将骨架分解为五个不同的簇;这确保每个分区保留内部关节协同性,同时隔离局部运动原语。其次,为了稳健地将稀疏输入映射到高维姿态,我们在训练期间采用掩码全身体预条件化策略,迫使模型内化全局骨骼拓扑和潜在运动学约束。最后,针对常规空间注意力机制常忽略固定生理连接的局限性,我们提出了运动学注意力。通过将经典运动学树结构嵌入注意力机制中,我们确保合成动作具有生物合理性。在AMASS数据集上的广泛评估表明,AtomicMotion显著优于现有基线,实现了更高的重建保真度和更优越的生物力学真实性。

英文摘要

Accurately reconstructing full-body poses from sparse head and hand trajectories is a foundational challenge for immersive AR/VR telepresence. Current methods often struggle with error accumulation and unnatural joint coordination, primarily because they treat the human body as a monolithic entity, thereby failing to capture the fine-grained ``atomic intents'' embedded in subtle signal variations and overlooking the inherent structural topology. To bridge this gap, we present AtomicMotion, a framework designed to decouple and re-integrate body dynamics through three core innovations. First, we introduce a logical body partitioning scheme that decomposes the skeleton into five distinct clusters based on functional intent; this ensures that each partition preserves internal joint synergies while isolating local motion primitives. Second, to robustly map sparse inputs to high-dimensional poses, we employ a masked full-body pre-conditioning strategy during training, forcing the model to internalize global skeletal topology and latent kinematic constraints. Finally, addressing the limitations of vanilla spatial attention, which often ignores fixed physiological connectivity, we propose Kinematic Attention. By embedding the classical kinematic tree structure into the attention mechanism, we ensure biological plausibility in the synthesized motions. Extensive evaluations on the AMASS dataset demonstrate that AtomicMotion significantly outperforms existing baselines, yielding higher reconstruction fidelity and superior biomechanical realism.

2605.22629 2026-05-22 cs.CV 版本更新

H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning

H-Flow:通过物理启发的联合多模态学习实现自监督的人体场景流

Zhanbo Huang, Xiaoming Liu, Yu Kong

发表机构 * Michigan State University(密歇根州立大学) University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文提出H-Flow,一种能够同时捕捉骨骼运动学和表面变形的密集人体场景流方法,通过物理启发的联合多模态学习实现自监督,引入高保真合成基准DynAct4D,并在标准基准和零样本场景中优于现有方法。

Comments 19 pages, 7 figures, 4 tables

详情
AI中文摘要

参数化人体模型能够捕捉全局姿态,但无法表示衣物和软组织的非刚性表面动态。通用场景流估计密集运动,但在关节化身体上失效,且像素级监督难以获得。我们引入H-Flow,一种能够同时捕捉骨骼运动学和表面变形的密集人体场景流。统一的多头Transformer估计从单目视频中的流,同时预测姿态和深度作为互补输出。挑战在于缺乏监督。替代无法获得的标签,我们将网络锚定在人体运动的物理中,将几何、结构和生物力学先验编码为跨模态训练目标。我们进一步引入DynAct4D,一个高保真合成基准,提供跨多样体、服装和动作的密集流标注。在标准基准上,H-Flow优于场景流和参数化基线,并能泛化到野外视频。代码、模型和DynAct4D基准将在发表时发布。

英文摘要

Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense human scene flow that captures both skeletal kinematics and surface deformation. A unified multi-head transformer estimates flow from monocular video, jointly predicting pose and depth as companion outputs. The challenge lies in the lack of supervision. In place of unattainable labels, we anchor the network in the physics of human motion, encoding geometric, structural, and biomechanical priors as cross-modal training objectives. We further introduce DynAct4D, a high-fidelity synthetic benchmark providing dense flow annotations across diverse subjects, garments, and motions. On standard benchmarks, H-Flow outperforms scene-flow and parametric baselines, and generalizes zero-shot to in-the-wild video. Code, models, and the DynAct4D benchmark will be released upon publication

2605.22619 2026-05-22 cs.CV 版本更新

GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT

GLeVE: 在3D CT中基于图的病变接地与提案验证

Shuo Jiang, Yuhao Hong, Chunbo Jiang, Weihong Chen, Huangwei Chen, Shenghao Zhu, Beining Wu, Mingxuan Liu, Zhu Zhu, Feiwei Qin, Min Tan, Yifei Chen

发表机构 * Zhejiang Key Laboratory of Space Information Sensing and Transmission(浙江空间信息感知与传输重点实验室) Hangzhou Dianzi University(杭州电子科技大学) Zhejiang University(浙江大学) Tsinghua University(清华大学) Children's Hospital, Zhejiang University School of Medicine(浙江大学医学院附属儿童医院)

AI总结 本文提出GLeVE框架,通过图引导的病变接地和解剖学先验验证,解决3D CT中自由文本叙述与体积解剖之间的语义-空间差距问题,提升病变定位的准确性。

Comments 11 pages, 4 figures

详情
AI中文摘要

将放射科报告描述接地到3D CT体积对于可验证的临床解释至关重要,但受到自由文本叙述与体积解剖之间语义-空间差距的挑战。现有基于报告辅助和视觉-语言接地的方法通常依赖于短语级对齐或密集像素监督,导致病变层面的对应有限和定位准确性不足。我们提出GLeVE,一种带有解剖学先验验证和基于八叉树的自回归细化的图引导病变接地框架。GLeVE将每个病变描述视为一个原子语义单元,并通过关系感知图推理编码器官归属、属性和跨病变关系,以生成具有判别性的病变层面查询。具有区域级验证的解剖学感知提案生成强制一对一的文本-病变对齐,而分层八叉树细化逐步改进边界界定。在AbdomenAtlas 3.0上的实验表明,GLeVE在分割准确性和病变层面定位方面均优于经典多模态基础模型和报告监督基线。

英文摘要

Grounding radiology report descriptions to 3D CT volumes is essential for verifiable clinical interpretation, yet remains challenging due to the semantic-spatial gap between free-text narratives and volumetric anatomy. Existing report-assisted and vision-language grounding methods typically rely on phrase-level alignment or dense pixel supervision, resulting in limited lesion-wise correspondence and suboptimal localization accuracy. We propose GLeVE, a graph-guided lesion grounding framework with anatomical prior verification and octree-based autoregressive refinement. GLeVE treats each lesion description as an atomic semantic unit and encodes organ attribution, attributes, and inter-lesion relations through relation-aware graph reasoning to produce discriminative lesion-wise queries. Anatomy-aware proposal generation with region-level verification enforces one-to-one text-lesion alignment, while hierarchical octree refinement progressively improves boundary delineation. Experiments on AbdomenAtlas 3.0 demonstrate consistent gains over classical multimodal foundation models and report-supervised baselines in both segmentation accuracy and lesion-level localization.

2605.22607 2026-05-22 cs.CV 版本更新

Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

增强视觉基础模型中的眼动推理以实现眼动跟随

Shijing Wang, Yaping Huang, Chaoqun Cui, David Wong, Yihua Cheng, Alexandros Neophytou, Hyung Jin Chang

发表机构 * Beijing Jiaotong University(北京交通大学) University of Birmingham(英国伯明翰大学) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部) Microsoft, UK(微软公司(英国))

AI总结 本文提出了一种新的训练机制,通过局部LoRA和锥外惩罚来增强视觉基础模型中的眼动推理,以提升眼动跟随任务的性能,特别是在目标不显著时表现更优。

Comments 11 pages, 8 figures

详情
AI中文摘要

眼动跟随需要场景理解和眼动推理来定位场景中人的目光目标。最近,视觉基础模型(VFMs)在该任务上表现出色,使更简单的架构能够超越先前方法。然而,我们观察到基于VFM的方法存在关键限制:虽然VFM显著提高了场景理解,但对眼动推理贡献有限。因此,现有方法常依赖语义显著物体而非真实目光线索,导致目标不显著时性能下降。为了解决这一问题,我们提出了一种新的训练机制,通过局部LoRA和锥外惩罚来增强VFM中的眼动推理。实验表明,我们的方法在GazeFollow和VAT数据集上取得了最先进的性能,特别是在目标不显著时表现尤为突出。我们的发现为未来眼动跟随研究提供了有价值的见解。

英文摘要

Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a key limitation of VFM-based approaches: while VFMs substantially improve scene understanding, they contribute little to gaze reasoning. As a result, existing methods often rely on semantically salient objects rather than true gaze cues, leading to degraded performance when targets are not salient. To address this, we propose a novel training mechanism to enhance gaze reasoning in VFMs for gaze following. Our method includes: (1) a head-conditioned local LoRA, which enables localized adaptation to preserve scene token learning while improving head token learning for gaze reasoning; and (2) an out-of-cone penalty, which injects gaze cues into head tokens while aligning them with scene tokens. Experiments on the GazeFollow and VAT datasets demonstrate that our method achieves state-of-the-art performance, with particularly strong improvements when gaze targets are not semantically salient. Our findings offer valuable insights for advancing future gaze following research. We will release the code once the paper is accepted.

2605.22605 2026-05-22 cs.RO cs.CV 版本更新

Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection

通过双区间运动线索解耦自身运动与目标动态以实现无人机检测

Liuyang Wang, Feitian Zhang

发表机构 * Department of Robotics, School of Advanced Manufacturing and Robotics(机器人学院,先进制造与机器人学院) State Key Laboratory of Turbulence and Complex Systems(湍流与复杂系统国家重点实验室) Peking University(北京大学) Great Bay University(大湾大学)

AI总结 本文提出了一种基于视觉的运动引导检测框架,通过双区间运动提取策略和轻量级运动引导注意力模块,解耦目标运动与相机干扰,提升无人机检测在剧烈自身运动下的性能。

详情
AI中文摘要

无人机的物体检测面临严重的自身运动、相机抖动和大规模变化的挑战。尽管现代检测器在静态图像上表现良好,但直接应用于无人机视频时往往失效,尤其在动态场景中的小目标。现有基于运动的方法要么依赖计算昂贵的光流,要么使用单区间差分,易受抖动影响且难以捕捉多样的运动模式。本文提出了一种视觉-only的运动引导检测框架,通过双区间运动提取策略和轻量级运动引导注意力模块,解耦目标运动与相机干扰。首先基于同射影的全局运动补偿(GMC)对相邻帧进行对齐。然后引入双区间运动提取策略,捕捉短期和长期的运动线索。为了整合这些线索,轻量级运动引导注意力模块(MGA)在特征金字塔网络中增强特征表示。在VisDrone-VID数据集上的实验表明,在严重自身运动下,该方法在YOLOv8基线上有显著改进。消融研究进一步验证了双区间设计和所提运动引导注意力机制的有效性。

英文摘要

Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing motion-based methods either rely on computationally expensive optical flow or use single-interval differencing, which is sensitive to jitter and limited in capturing diverse motion patterns. We propose a vision-only motion-guided detection framework that decouples target motion from camera-induced disturbances. A homography-based Global Motion Compensation (GMC) first aligns adjacent frames. We then introduce a Dual-Interval Motion Extraction strategy that captures both short-term and long-term motion cues. To integrate these cues, a lightweight Motion-Guided Attention (MGA) module enhances feature representations within a Feature Pyramid Network. Experiments on the VisDrone-VID dataset demonstrate consistent improvements over a strong YOLOv8 baseline under severe ego-motion. Ablation studies further confirm the effectiveness of the dual-interval design and the proposed motion-guided attention mechanism.

2605.22591 2026-05-22 cs.CV 版本更新

Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure

重新思考冻结视觉基础模型的噪声鲁棒训练:一个跨数据集基准与小损失失败的案例研究

Zitong Li, Haoyu Wang

发表机构 * Department of Biostatistics and Health Informatics(生物统计学与健康信息学系)

AI总结 本文通过跨五个医学数据集、三种主干网络、两种噪声类型和五种噪声率的基准测试,重新评估了冻结特征域中噪声标签学习方法的性能,揭示了小损失假设在高风险场景下的局限性,并提出了基于特征空间的选择器以指导实际应用。

详情
AI中文摘要

冻结视觉基础模型(VFMs)配备轻量级分类头,因其高效且可重复部署而在医学影像中日益普及。然而,针对此冻结特征域的噪声标签学习方法仍缺乏深入理解,且大多数现有方法仍依赖于从端到端训练继承的小损失假设。本文提出了一个包含八个噪声标签方法、五个医学数据集、三种主干网络、两种噪声类型和五种噪声率(150种条件,6,000次训练运行)的受控基准测试,通过平衡准确率进行评估。基准测试表明,不存在普遍胜利者:Friedman排名在150种条件下得出χ²=333.2(p=4.77×10⁻⁶⁸),ELR在最多条件(49/150)中获胜,而CUFIT获得最佳平均排名(2.51)。方法选择的实际成本随着噪声严重程度急剧增加,从干净数据上的4.5pp增加到不对称40%噪声时的18.8pp。为了解释这些基准级别的模式,我们重新审视了小损失假设在代表性的高风险场景中的应用。在冻结DINOv2特征下,干净和噪声损失分布重叠达53-61%,匹配率的干净样本检测显示,在不对称噪声下,预测一致性比损失排名更加稳定(3pp vs. 13pp精度下降)。在ISIC2019数据集上,不对称40%噪声下,Co-Teaching达到68%的总体准确率,但在三个少数类上无召回时,其平衡准确率降至35.1%。这些结果将冻结VFMs的噪声标签学习重新定义为一种基于场景的方法选择问题,而非寻找单一主导算法。本文最后提供了基于证据的指导和一个低遗憾的特征空间选择器,以指导实际应用。

英文摘要

Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still rely on a small-loss assumption inherited from end-to-end training. We present a controlled benchmark of eight noisy-label methods across five medical datasets, three backbones, two noise types, and five noise rates (150 conditions, 6,000 training runs), evaluated with balanced accuracy. The benchmark shows that there is no universal winner: Friedman ranking over the 150 conditions yields $χ^2 = 333.2$ ($p = 4.77 \times 10^{-68}$), ELR wins the most conditions (49/150), while CUFIT attains the best mean rank (2.51). The practical cost of method choice grows sharply with noise severity, from 4.5pp on clean data to 18.8pp at asymmetric 40\% noise. To explain these benchmark-level patterns, we revisit the small-loss assumption in a representative high-risk regime. Under frozen DINOv2 features, clean and noisy loss distributions overlap by 53--61\%, and matched-rate clean-sample detection shows that prediction agreement is markedly more stable than loss ranking under asymmetric noise (3pp vs.\ 13pp precision drop). On ISIC2019 with asymmetric 40\% noise, Co-Teaching reaches 68\% overall accuracy while collapsing to 35.1\% balanced accuracy with zero recall on three minority classes. Together, these results recast noisy-label learning for frozen VFMs as a regime-aware method-selection problem rather than a search for a single dominant algorithm. We conclude with evidence-based guidance and a low-regret feature-space selector for practical recommendation.

2605.22581 2026-05-22 cs.CV cs.AI cs.LG 版本更新

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

SceneAligner: 在真实场景中实现基于3D的平面定位

Junhyeong Cho, Ruojin Cai, Hadar Averbuch-Elor

发表机构 * Cornell University(康奈尔大学) Kempner Institute, Harvard University(哈佛大学 Kempner 院)

AI总结 本文提出了一种在真实场景中实现基于3D重建的平面定位方法,通过将任务 grounding 在场景的重建3D表示中,解决了现有方法在大规模建筑和栅格化平面图中应用受限的问题。

Comments Project Page: https://Cornell-VAILab.github.io/SceneAligner

详情
AI中文摘要

许多公共建筑提供带有'你在这里'指示器的平面图,以帮助游客导航。平面定位旨在通过确定视觉观测是在平面图中的哪个位置来计算实现这一能力。然而,现有方法通常假设受控的小规模环境和精确的向量平面图,限制了它们在大规模建筑和栅格化平面图中的应用能力。在本文中,我们提出了一种在真实场景中实现平面定位的方法,通过将任务 grounding 在场景的重建3D表示中。给定一组无约束的图像集合,我们的方法重建一个重力对齐的3D场景,并将其投影到2D密度图中,作为平面图的代理。平面定位则被公式化为通过2D相似性变换将该代理与输入平面图对齐。为了弥合密度图与建筑平面图之间的外观差距,我们适配了一个2D基础模型来学习跨模态的对应关系,引入了一种细调方案,鼓励语义对齐的同时保持结构一致性。广泛的实验表明,与先前方法相比有显著的改进,包括在极稀疏设置中,甚至使用单张输入图像时。我们的代码和数据将公开提供。

英文摘要

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

2605.22578 2026-05-22 cs.CV 版本更新

Beyond Chamfer Distance: Granular Order-aware Evaluation Metric For Online Mapping

超越形变距离:面向在线制图的粒度化顺序感知评估度量

Chouaib Bencheikh Lehocine, Adam Lilja, Junsheng Fu, Lars Hammarstrand

发表机构 * Zenseact AB(Zenseact公司) Chalmers University of Technology(楚姆勒斯技术大学)

AI总结 本文提出一种粒度化顺序感知评估度量,用于评估在线制图方法,通过引入序列最优子模式分配(SOSPA)和多实例评估框架中的多段线定位与检测(PLD),改进了传统基于形变距离的评估方法,揭示了当前方法中检测能力是主要瓶颈。

详情
AI中文摘要

在线地图估计是自动驾驶系统的关键组成部分,能够减少对昂贵高精度地图的依赖。最先进的方法通常将地图元素预测为点的有序序列,形成多边形和多边形链。这些方法的评估主要依赖于基于阈值形变距离(CD)的平均平均精度(mAP)。该框架对点顺序缺乏敏感性,并且在评估几何质量时缺乏粒度,使得难以区分哪些方法真正优于其他方法。在本文中,我们从两个方面解决了这些限制。对于单实例相似性度量,我们引入了序列最优子模式分配(SOSPA),一种顺序感知度量,能够对单个几何体进行细粒度评估,同时满足所有度量公理。对于多实例评估框架,我们提出了多段线定位与检测(PLD),一种软度量,能够同时捕捉检测质量和几何准确性,用原理性的软分配替代mAP的硬阈值。通过在nuScenes上的评估,我们证明PLD能够有效排序最先进的在线制图方法(MapTRv2、StreamMapNet、MapTracker),并提供分解的误差分析。该分析揭示了当前方法中检测能力是主要瓶颈,揭示了一种mAP无法捕捉的性能趋势。使用我们度量的评估代码将被发布。

英文摘要

Online map estimation is a crucial component of autonomous driving systems that reduces the reliance on costly high-definition maps. State-of-the-art (SOTA) methods commonly predict map elements as ordered sequences of points that form polylines and polygons. The evaluation of these methods relies predominantly on mean average precision (mAP) based on thresholded Chamfer distance (CD). This framework lacks sensitivity to point ordering and provides limited granularity in assessing geometric quality, making it difficult to distinguish which methods truly excel over others. In this work, we address these limitations on two fronts. For the single-instance similarity measure, we introduce sequence optimal sub-pattern assignment (SOSPA), an order-aware metric that enables fine-grained evaluation of individual geometries while satisfying all metric axioms. For the multi-instance evaluation framework, we propose polyline localisation and detection (PLD), a soft metric that jointly captures detection quality and geometric accuracy, replacing the hard thresholding of mAP with a principled soft assignment. Through evaluations on nuScenes, we demonstrate that PLD effectively ranks SOTA online mapping methods (MapTRv2, StreamMapNet, MapTracker) while providing a decomposed error analysis. This analysis identifies detection capability as the dominant bottleneck in current methods, revealing a performance trend that mAP fails to capture. Code for evaluation using our metrics will be released.

2605.22572 2026-05-22 cs.CV 版本更新

SegGuidedNet: Sub-Region-Aware Attention Supervision for Interpretable Brain Tumor Segmentation

SegGuidedNet: 基于子区域的注意力监督用于可解释性脑肿瘤分割

Hasaan Maqsood, Saif Ur Rehman Khan, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim

发表机构 * German Research Center for Artificial Intelligence(德国人工智能研究中心) Intelligentx GmbH (intelligentx.com)(Intelligentx GmbH) National University of Sciences and Technology (NUST)(国家科学与技术大学) Department of Core Informatics, Graduate School of Informatics ,Osaka Metropolitan University(信息学研究生院核心信息学系,大阪 metropolitan 大学)

AI总结 本文提出SegGuidedNet,一种引入新颖SegAttentionGate模块的三维残差编码器-解码器网络,通过轻量级辅助损失显式监督解码器生成每个肿瘤子区域(坏死核心、周围水肿、增强肿瘤)的空间判别注意力图,从而在无需后处理解释方法的情况下提供免费的空间可解释性,并在BraTS2021和BraTS2023 GLI上实现了优异的分割性能。

详情
AI中文摘要

准确分割多参数MRI中脑肿瘤的子区域对于治疗计划至关重要,但因形态学变化、类别不平衡和不同成像序列中肿瘤区域的重叠外观而具有挑战性。我们提出了SegGuidedNet,一种引入新颖SegAttentionGate模块的三维残差编码器-解码器网络,该模块通过轻量级辅助损失显式监督解码器,为每个肿瘤子区域(坏死核心、周围水肿、增强肿瘤)生成空间判别性注意力图,参数开销低于0.2%。这种子区域监督在保持解码器在视觉模糊类别间的判别能力的同时,无需任何后处理解释方法即可在推理过程中提供免费的空间可解释性。在独立评估BraTS2021和BraTS2023 GLI的251个被排除受试者上,SegGuidedNet分别实现了平均Dice系数为0.905(ET=0.873,TC=0.906,WT=0.935)和0.897(ET=0.859,TC=0.902,WT=0.931),超越了基于集成的nnU-Net和HNF-Netv2作为单一模型,并接近Swin UNETR在2-4个Dice点内以少量推理成本实现。结果在两个基准版本中的一致性进一步验证了所提出方法的通用性,提供了一个轻量、临床实用的框架,在保证准确性的同时具备内置的可解释性。

英文摘要

Accurate segmentation of brain tumour sub-regions from multi-parametric MRI is critical for treatment planning yet remains challenging due to morphological variability, class imbalance, and overlapping appearances of tumour regions across imaging sequences. We propose SegGuidedNet, a three-dimensional residual encoder--decoder network introducing a novel SegAttentionGate module that explicitly supervises the decoder to produce spatially discriminative attention maps for each tumour sub-region necrotic core, peritumoral oedema, and enhancing tumour via a lightweight auxiliary loss, adding less than 0.2% parameter overhead. This sub-region supervision maintains decoder discriminability between visually ambiguous classes while providing free-of-cost spatial interpretability at inference without any post-hoc explanation method. Evaluated independently on BraTS2021 and BraTS2023 GLI across 251 held-out subjects each, SegGuidedNet achieves mean Dice of 0.905 (ET= 0.873, TC=0.906, WT=0.935) and 0.897 (ET=0.859, TC=0.902, WT=0.931) respectively, surpassing ensemble-based nnU-Net and HNF-Netv2 as a single model and approaching Swin UNETR a 10-model ensemble within 2--4 Dice points at a fraction of the inference cost. The consistency of results across two benchmark editions further confirms the generalisability of the proposed approach, offering competitive accuracy with built-in interpretability in a lightweight, clinically practical framework.

2605.22570 2026-05-22 cs.CV cs.AI 版本更新

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

VGenST-Bench: 一个通过主动视频合成进行时空推理的基准

Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University(全北大学人工智能系) Department of Artificial Intelligence, Yonsei University(延世大学人工智能系)

AI总结 本文提出VGenST-Bench,一个通过生成模型主动合成多样化评估场景的视频基准,旨在评估多模态大语言模型的时空推理能力,通过引入多代理流程和3x2x2视频分类体系,实现对细粒度时空理解的精准诊断。

Comments 82 pages, 91 figures (7 in main paper, 84 in appendix). Project page: https://zinosii.github.io/VGenST-Bench/

详情
AI中文摘要

时空推理是多模态大语言模型(MLLMs)在现实世界中的一项核心能力。因此,精确评估这一能力已成为一个关键挑战。然而,现有的时空推理基准数据集主要依赖静态图像集或被动整理的视频数据,这限制了对细粒度推理能力的评估。在本文中,我们介绍了VGenST-Bench,一个视频基准,该基准利用生成模型主动合成高度可控且多样化的评估场景。为了构建VGenST-Bench,我们提出一个包含人类质量控制阶段的多代理流程,确保所有生成的视频和问答对的质量。我们建立了一个全面的3x2x2视频分类体系,涵盖空间尺度、视角和场景动态,以涵盖多样化的场景。此外,我们设计了一个分层任务套件,将低层次的视觉感知与高层次的时空推理分离。通过从被动整理转向主动合成,VGenST-Bench能够对MLLMs的时空理解进行细粒度诊断。

英文摘要

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

2605.22563 2026-05-22 cs.CV 版本更新

Cell Phantom Video Generation in Elliptical Fourier Descriptor Domain

椭圆傅里叶描述符域中的细胞假体视频生成

Francesco Benedetto, Roberto Basla, Luca Magri, Giacomo Boracchi

发表机构 * Department of Electronics Information and Bioengineering(电子信息与生物工程系)

AI总结 本研究提出了一种在椭圆傅里叶描述符(EFD)域中生成细胞假体视频的新框架,通过将细胞假体演变表示为多变量时间序列的EFD系数,引入了强先验知识,从而高效生成在时间上一致的视频,验证了在EFD空间建模时间演变能够生成生物合理性的假体视频,为合成标注数据生成提供了方法,减少了标注努力。

Comments 6 pages, Accepted at the International Conference on Image Processing (ICIP) 2026

详情
AI中文摘要

训练用于生物视频中单个细胞跟踪的深度神经网络需要大量标注数据。对细胞跟踪视频进行标注非常耗时,通常需要领域专业知识;这解释了公共标注数据在解决重要医疗问题如组织修复或癌症治疗方面有限的可用性。生成合成视频及其地面真实标注是一个有前景的解决方案,其基础第一步是单个细胞标注(或假体)的合成。假体需要时间一致,因为它们必须复制特定细胞类型的生物过程。在本文中,我们提出了一种新的框架,用于在椭圆傅里叶描述符(EFDs)域中生成细胞假体视频,这是一种紧凑且几何上可解释的2D闭合轮廓表示。我们将细胞假体演变表示为EFD系数的多变量时间序列,引入了强先验知识用于细胞形态,从而高效生成在时间上一致演变的序列。我们的实验验证证明,建模EFD空间中的时间演变能够生成生物合理性的假体视频。我们的方法可用于生成合成标注数据的生成管道,从而强烈缓解创建新数据集的标注努力。我们的代码可在此处下载:https://github.com/FrancescoBenedetto99/efd-cell-video-gen。

英文摘要

Training Deep Neural Networks for tracking individual cells in biomedical videos requires a large amount of annotated data. The annotation of videos for cell tracking is very time consuming and often requires domain expertise; this explains the limited availability of public annotated data to address important medical problems like tissue repair or cancer treatment. Generating synthetic videos along with their Ground Truth annotations is a promising solution that relies, as a foundational first step, on the synthesis of single cell annotations (or phantoms). Phantoms need to be time consistent, as they have to replicate biological processes that are specific to the cell types. In this work, we propose a novel framework for generating videos of cell phantoms in the Elliptical Fourier Descriptors (EFDs) domain, a compact and geometrically interpretable representation for 2D closed contours. We represent the cell phantom evolution as a multivariate time series of EFD coefficients, introducing a strong prior for cell morphology and enabling the efficient generation of sequences that evolve coherently in time. Our experimental validation proves that modelling the temporal evolution in EFD space enables the generation of biologically plausible phantom videos. Our method can be used in generative pipelines for synthesizing annotated data for cell tracking, thus strongly mitigating the annotation effort for creating new datasets. Our code is available for download here: https://github.com/FrancescoBenedetto99/efd-cell-video-gen.

2605.22558 2026-05-22 cs.CV 版本更新

GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

GeoWeaver: 在场景推理前通过几何证据 grounding 视觉 token

Deshui Miao, Xingsen Huang, Yameng Gu, Xin Li, Haijun Zhang, Ming-Hsuan Yang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Pazhou Lab (Huangpu)(琶洲实验室(黄埔)) Hainan University(海南大学) University of California at Merced(加州大学默塞德分校)

AI总结 本文提出 GeoWeaver,一种在场景推理前通过几何证据对视觉 token 进行 grounding 的框架,以提升空间推理能力并保持多模态能力。

详情
AI中文摘要

视觉语言模型中的时空推理需要保持物理几何的视觉表示,而非仅仅语义外观。最近的多模态模型通过结构分支、3D感知监督、推理阶段融合或长视界记忆来整合几何信息。尽管这些方法展示了几何对空间智能的重要性,但它们通常将几何线索视为所有视觉 token 的共享信号。我们注意到,这忽略了更细致的挑战:不同的视觉 token 需要根据其空间角色不同的几何证据。为了解决这一限制,我们引入 GeoWeaver,一种预推理的几何 grounding 框架,将几何视为时空推理的表示前提。GeoWeaver 从冻结的几何编码器构建多层次的几何库,并执行 token 自适应的几何证据分配,使每个视觉 token 能够检索最相关的几何抽象。所选证据通过残差 grounding 操作整合到视觉 token 中,在语言建模之前,产生几何 grounding 的表示,以支持后续推理。在空间推理基准上的广泛评估表明,GeoWeaver 一致地增强了几何感知推理,同时保持了通用多模态能力。这表明几何信息带来的最大收益不是作为后期融合的辅助信号,而是作为塑造大型语言模型推理基础的必要前提。所有源代码和模型将在 https://github.com/yahooo-m/GeoWeaver 上发布。

英文摘要

Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at https://github.com/yahooo-m/GeoWeaver .

2605.22552 2026-05-22 cs.CV cs.MM 版本更新

FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning

FashionLens:通过任务自适应学习实现多功能时尚图像检索

Haokun Wen, Xuemeng Song, Xinghao Xie, Xiaolin Chen, Xiangyu Zhao, Weili Guan

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Institute of Data Science, National University of Singapore(新加坡国立大学数据科学研究所) School of Information Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)信息科学与技术学院) Shenzhen Loop Area Institute(深圳南山区研究所)

AI总结 本文提出FashionLens框架,通过任务自适应学习实现多功能时尚图像检索,解决现有方法无法处理多样检索需求的问题。

详情
AI中文摘要

时尚图像检索是现代电子商务系统的核心。在实践中,一个能够支持多种查询格式和搜索意图的统一框架备受青睐。然而,现有方法专注于狭窄的检索任务,无法充分捕捉这种多样性。因此,在本工作中,我们旨在开发一个能够处理多样现实时尚检索场景的统一框架,实现真正多功能的时尚图像检索。为了建立数据基础,我们首先引入U-FIRE,一个综合基准,将碎片化的时尚数据集整合到统一的集合中,并辅以两个人工整理的数据集进行测试通用性。在此基础上,我们提出了基于多模态大语言模型的FashionLens框架。为处理不同的匹配目标,我们设计了Proposal-Guided Spherical Query Calibrator,通过自适应球形线性插值动态将查询表示转移到任务对齐的度量空间中。此外,为缓解因任务复杂性和数据规模不同导致的优化不平衡问题,我们开发了Gradient-Guided Adaptive Sampling策略,根据实时学习难度和数据规模先验自动重新加权任务。在U-FIRE上的实验表明,FashionLens在多种检索场景中均取得最佳性能,并能稳健地推广到未见任务。数据和代码已公开发布在https://github.com/haokunwen/FashionLens。

英文摘要

Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at https://github.com/haokunwen/FashionLens.

2605.22550 2026-05-22 cs.CV 版本更新

MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding

MOTOR: 两轮车骑行行为理解的多模态数据集

Varun A. Paturkar, Shankar Gangisetty, C. V. Jawahar

发表机构 * CVIT, IIIT-Hyderabad(IIIT-海得拉巴学院计算机视觉研究所)

AI总结 本文提出MOTOR数据集,用于研究两轮车在密集无结构交通中的骑行行为,通过多视角、多模态数据融合,为自动驾驶辅助系统提供新的研究基础。

详情
AI中文摘要

两轮车在发展中国家道路上的致命事故比例显著偏高。然而,关于两轮车骑行行为的研究远远落后于四轮车,后者多模态数据集推动了高级驾驶辅助系统(ADAS)的重大进展。为填补这一空白,我们提出了MOTOR数据集,这是首个大规模、多视角、多模态资源,专门用于密集无结构交通中的两轮车。MOTOR包含1,629个序列(25多个小时的视频数据),由16名骑行者收集,整合了同步的前视、后视和头盔视频、可穿戴追踪器的骑行目视数据、道路音频和 telemetry(GPS、加速度计、陀螺仪)。丰富的注释捕捉交通情境、骑行状态、12种骑行动作(涵盖传统和非常规行为)以及合法性标签(合法、非法、未指定)。我们使用最先进的视频动作识别骨干网络(CNN和Transformer-based)进行骑行行为识别和动作合法性分类,并发现结合RGB、目视和telemetry数据能够获得最佳性能。MOTOR因此为两轮车驾驶的安全关键理解提供了独特基础。它为研究社区提供了一个基准,以开发和评估用于行为分析、合法性感知预测和智能交通系统模型。数据集和代码可在https://varuniiith.github.io/MOTOR-Dataset/获取。

英文摘要

Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on two-wheeler rider behavior, however, lags far behind four-wheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we present the MOtorized TwO-wheeler Rider (MOTOR) dataset, the first large-scale, multi-view, multimodal resource dedicated to two-wheelers in dense, unstructured traffic. MOTOR comprises 1,629 sequences (25+ hours of video data) collected from 16 riders and integrates synchronized front, rear, and helmet videos, rider eye-gaze from wearable trackers, on-road audio, and telemetry (GPS, accelerometer, gyroscope). Rich annotations capture traffic context, rider state, 12 riding maneuvers spanning conventional and unconventional behaviors, and legality labels (Legal, Illegal, Unspecified). We benchmark rider behavior recognition and maneuver legality classification using state-of-the-art video action recognition backbones (CNN and Transformer-based), extended with multimodal fusion, and find that combining RGB, gaze, and telemetry consistently yields the best performance. MOTOR thus provides a unique foundation for advancing safety-critical understanding of two-wheeler riding. It offers the research community a benchmark to develop and evaluate models for behavior analysis, legality-aware prediction, and intelligent transportation systems. Dataset and code is available at https: //varuniiith.github.io/MOTOR-Dataset/

2605.22538 2026-05-22 cs.CV 版本更新

Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking

基于运动、几何和语义适应的复杂非线性视觉目标跟踪

Deyi Zhu, Yuji Wang, Yong Liu, Yansong Tang, Bingyao Yu, Jiwen Lu, Jie Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 本文提出SAMOSA框架,通过显式利用运动、几何和语义线索,改进SAM 2在复杂非线性视觉目标跟踪中的表现,实现了更鲁棒和通用的跟踪方法。

详情
AI中文摘要

传统视觉目标跟踪(VOT)方法通常依赖于任务特定的监督训练,限制了其对未见对象和具有干扰、遮挡和非线性运动的挑战场景的泛化能力。最近的视觉基础模型,如SAM 2,通过大规模预训练学习强大的视频理解先验,并为构建更鲁棒和通用的跟踪器提供了有前景的基础。然而,直接将SAM 2应用于VOT仍然不够优化,因为它没有显式建模目标运动动态或在帧之间强制几何和语义一致性,这两者对于可靠的跟踪至关重要。为了解决这个问题,我们提出了SAMOSA,一个新的跟踪框架,通过显式利用运动、几何和语义线索,将SAM 2适应于复杂的VOT场景。具体来说,我们引入了一个轻量级的非线性运动预测器来建模目标动态并指导掩码选择以及内存过滤。我们进一步利用语义线索来检测目标位移并从跟踪失败中恢复,同时将几何线索作为结构约束以提高跟踪稳定性。通过这种方式,SAMOSA弥合了SAM 2隐含视频理解先验与显式跟踪导向建模之间的差距。广泛的实验表明,SAMOSA在通用基准上始终优于最先进的基于SAM 2的方法,展示了比监督VOT方法更强的泛化能力,并在反UAV数据集上实现了显著的提升,这些数据集典型地代表了复杂的非线性运动场景。我们的代码可在https://github.com/DurYi/SAMOSA上获得。

英文摘要

Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion predictor to model target dynamics and guide mask selection as well as memory filtering. We further exploit semantic cues to detect target shifts and recover from tracking failures, while geometric cues are incorporated as structural constraints to improve tracking stability. In this way, SAMOSA bridges the gap between the implicit video understanding prior of SAM 2 and explicit tracking-oriented modeling. Extensive experiments show that SAMOSA consistently outperforms state-of-the-art SAM 2--based approaches on general benchmarks, demonstrates stronger generalization than supervised VOT methods, and achieves substantial gains on anti-UAV datasets, which typify complex nonlinear motion scenarios. Our code is available at https://github.com/DurYi/SAMOSA.

2605.22536 2026-05-22 cs.CV cs.CL 版本更新

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

SpaceDG: 在视觉退化下评估空间智能的基准测试

Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Electronic Science and Technology of China(电子科技大学) Chongqing University(重庆大学) The University of Tokyo(东京大学) Beihang University(北航) Northwestern Polytechnical University(西北工业大学)

AI总结 本文提出SpaceDG,首个针对退化感知空间理解的大型数据集,通过物理基础的退化合成引擎生成9种退化类型,评估多模态大语言模型在视觉退化下的空间推理能力,并展示在退化条件下微调可提升模型鲁棒性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在空间智能方面取得了快速进展,但现有空间推理基准大多假设纯净的视觉输入,忽略了现实部署中常见的退化现象,如运动模糊、低光照、恶劣天气、镜头畸变和压缩伪影。这提出了一个根本性问题:当前MLLMs在视觉观察不完美时的空间智能鲁棒性如何?为回答这个问题,我们引入SpaceDG,首个大规模退化感知空间理解数据集。它通过物理基础的退化合成引擎将退化形成过程嵌入3D高斯点散布(3DGS)渲染,能够真实模拟九种退化类型。所生成的数据集包含约100万对QA问题,来自近1000个室内场景。我们进一步引入SpaceDG-Bench,一个经人类验证的基准,包含11种推理类别和9种视觉退化类型的1102个问题,产生超过10000个VQA实例。评估25个开源和闭源MLLMs发现,视觉退化一致且显著损害空间推理能力,暴露出关键的鲁棒性差距。最后,我们展示在SpaceDG上微调可显著提高退化鲁棒性,并且在退化条件下甚至可以超越人类性能,而不会在清晰图像上造成性能下降,突显了退化感知训练在鲁棒空间智能方面的潜力。

英文摘要

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

2605.22504 2026-05-22 cs.AI cs.CV 版本更新

LACO: Adaptive Latent Communication for Collaborative Driving

LACO:适应性潜在通信用于协同驾驶

Tianhao Chen, Yuheng Wu, Dongman Lee

发表机构 * Korea Advanced Institute of Science & Technology(韩国科学技术院)

AI总结 本文提出LACO,一种无需训练的潜在通信范式,通过迭代潜在推理、跨时间显著性归因和结构化语义知识蒸馏,解决协同驾驶中潜在通信的延迟和信息丢失问题,实验证明其在降低通信和推理延迟的同时保持了强大的协同驾驶性能。

详情
AI中文摘要

协同驾驶旨在通过使连接车辆在部分可观测性下协调以提高安全性和效率。最近的方法已从共享视觉特征进行感知发展到通过基础模型交换基于语言的推理以实现行为协调。尽管用语言交流提供直观的信息,但引入了两个挑战:由自回归解码引起的高延迟以及由于将丰富的内部表示压缩成离散标记而引起的信信息丢失。为了解决这些挑战,我们分析了协同驾驶中潜在通信在多智能体设置下的固有限制。我们的分析揭示了代理身份混淆,即直接融合潜在状态会将车辆间的决策表示纠缠。受此启发,我们提出了LACO,一种无需训练的潜在通信范式,能够无缝地将预训练驾驶模型适应到协同设置中。LACO引入了迭代潜在推理(ILD)用于潜在推理,跨时间显著性归因(CHSA)用于通信高效的信信息选择,以及结构化语义知识蒸馏(SSKD)以稳定以自我为中心的决策。在CARLA中的闭环实验表明,LACO显著降低了通信和推理延迟,同时保持了强大的协同驾驶性能。

英文摘要

Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language-based reasoning through foundation models for behavioral coordination. Though communicating in language provides intuitive information, it introduces two challenges: high latency caused by autoregressive decoding and information loss caused by compressing rich internal representations into discrete tokens. To address these challenges, we analyze latent communication in collaborative driving under inherent limitations of multi-agent settings. Our analysis reveals agent identity confusion, where direct fusion of latent states entangles decision representations across vehicles. Motivated by this, we propose LACO, a training-free \textbf{LA}tent \textbf{CO}mmunication paradigm that seamlessly adapts pretrained driving models to collaborative settings. LACO introduces Iterative Latent Deliberation (ILD) for latent reasoning, Cross-Horizon Saliency Attribution (CHSA) for communication-efficient information selection, and Structured Semantic Knowledge Distillation (SSKD) to stabilize ego-centric decision making. Closed-loop experiments in CARLA show that LACO notably reduces communication and inference latency while maintaining strong collaborative driving performance.

2605.22492 2026-05-22 cs.CV 版本更新

Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline

无需训练的细粒度语义分割在低数据环境下:一个FungiTastic基线

Sebastian Cavada, Francesco Pelosin, Lapo Faggi

发表机构 * Covision Lab(Covision实验室)

AI总结 本文提出了一种无需训练的两阶段框架,用于在低数据环境下实现细粒度语义分割,通过宏分类提示生成蘑菇掩码,并利用嵌入空间中的原型匹配进行细粒度标签分配,提高了可扩展性和分割成本。

Comments Accepted at the 13th Workshop on Fine-Grained Visual Categorization, CVPR 2026

详情
AI中文摘要

细粒度语义分割需要精确的定位和在视觉上相似的类别间的区分。在FungiTastic中,这个问题进一步复杂化了长尾分布和图像获取条件的强烈变化。我们提出了一种无需训练的两阶段框架,将分割与分类解耦。SAM3首先使用宏分类提示生成类别无关的蘑菇掩码,DINOv3随后通过嵌入空间中的原型匹配分配细粒度标签。为了改进这一阶段,我们应用了简单的DINOv3特征空间转换,以提高基于原型的分类效果。与类别特定提示相比,我们的方法更具可扩展性且保持分割成本较低。我们报告了一次-shot到几百-shot范围内的结果,提供了目前在低数据设置下细粒度语义分割的首个基线。

英文摘要

Fine-grained semantic segmentation requires both precise localization and discrimination between visually similar classes. In FungiTastic, this problem is further complicated by a long-tailed distribution and strong variation in image acquisition conditions. We propose a training-free two-stage framework that decouples segmentation from classification. SAM3 first produces class-agnostic mushroom masks using macro-taxonomic prompts, and DINOv3 then assigns fine-grained labels through prototype matching in the embedding space. To improve this stage, we apply a simple transformation of the DINOv3 feature space that improves prototype-based classification. Compared with class-specific prompting, our approach is more scalable and keeps the segmentation cost low. We report results from one-shot to few-hundred-shot regimes, providing, to the best of our knowledge, the first baseline for fine-grained semantic segmentation in low-data settings.

2605.22484 2026-05-22 cs.CV 版本更新

Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

监督分类头作为语义原型:通过权重重用解锁视觉-语言对齐

David Méndez, Roberto Confalonieri, Natalia Díaz Rodríguez

发表机构 * Department of Computer Science and Artificial Intelligence, DaSCI Institute, University of Granada, Granada, Spain(计算机科学与人工智能系,DaSCI研究所,格拉纳达大学,格拉纳达,西班牙) Department of Mathematics ``Tullio Levi-Civita'', University of Padova, Padova, Italy(托里利-西维塔数学系,帕多瓦大学,帕多瓦,意大利)

AI总结 本文提出利用预训练视觉模型的分类头作为语义原型,通过权重重用实现视觉-语言对齐,提升跨模态检索、零样本和少样本分类任务的性能。

详情
AI中文摘要

视觉-语言模型(VLMs)通过将图像和文本映射到共享空间,在零样本分类和跨模态检索等任务中表现出色,但需要昂贵的端到端训练和大量配对数据。当前的后处理对齐方法通过轻量级映射连接预训练编码器来降低计算成本,但仍需大量配对数据。在本文中,我们研究了重新利用预训练视觉模型的分类头作为语义原型的潜力。这些权重的重用,通常在预训练后被丢弃,解锁了两种不同的能力:它使零样本对齐成为可能,通过将权重用作语义锚点,并通过将这些原型与真实图像-文本对混合,成为一种稳健的数据增强策略。我们证明,将我们的方法与几种最先进的后处理对齐技术结合,能够一致地提高跨模态检索、零样本和少样本分类任务的准确性。

英文摘要

Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting pretrained encoders through lightweight mappings, yet still demand substantial paired data. In this work, we investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes. The recycling of these weights, typically discarded after pretraining, unlocks two distinct capabilities: it enables zero-shot alignment by using weights as semantic anchors, and serves as a robust data augmentation strategy by mixing these prototypes with real image-text pairs. We demonstrate that integrating our approach with several state-of-the-art post-hoc alignment techniques consistently boosts accuracy in cross-modal retrieval, zero- and few-shot classification tasks.

2605.22469 2026-05-22 cs.CV 版本更新

MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation

MaSC:一种用于评估概念驱动生成的遮蔽相似度度量

Patryk Bartkowiak, Lennart Petersen, Bartosz Kotrys, Dominik Michels, Soren Pirk, Wojtek Palubicki

发表机构 * Adam Mickiewicz University(亚当·密茨凯维奇大学) Kiel University(基尔大学) ArtCollect(艺术收藏) KAUST(卡塔尔科技大学)

AI总结 本文提出MaSC,一种基于遮蔽的相似度度量方法,用于评估文本到图像扩散模型中单概念个性化生成的保真度和提示遵循性,通过使用外部提供的前景概念遮罩将评估分解为针对主体的概念保真度和基于背景的提示遵循性。

Comments 20 pages, 2 figures, 7 tables

详情
AI中文摘要

评估文本到图像扩散模型中单概念个性化生成需要测量概念保真度(捕捉参考的识别保真度)和提示遵循性(捕捉生成场景是否匹配提示)。现有度量通常使用全局图像或文本-图像嵌入,如CLIP-I、DINO和CLIP-T。我们证明这些度量与人类感知相关性差,因为它们将图像视为整体而非将概念主体与背景分离。我们引入MaSC,一种遮蔽相似度度量,使用外部提供的前景概念遮罩将评估分解为主体特定的概念保真度和基于背景的提示遵循性。MaSC通过冻结的SigLIP2 SO400M-NaFlex特征计算两个分数:概念保真度通过前景参考块与生成图像块之间的遮蔽最大余弦匹配测量,提示遵循性通过比较仅背景的池化图像嵌入与无主体提示嵌入进行比较。在DreamBench++人类评分中,MaSC在概念保真度上达到Krippendorff alpha = 0.471,优于所有测试的非LLM基线和GPT-4V,并接近GPT-4o。在ORIDa,一个跨物理环境的真实照片身份保真度基准中,MaSC达到AUC = 0.992,几乎完美地区分相同主体与跨主体对。其提示遵循性分数也优于DreamBench++中自带的CLIP-T基线。这些结果表明,空间分解聚合是评估概念驱动生成的强大设计原则。

英文摘要

Evaluating single-concept personalization in text-to-image diffusion requires measuring both concept preservation, which captures identity fidelity to a reference, and prompt following, which captures whether the generated scene matches the prompt. Existing metrics commonly compute these signals using global image or text-image embeddings, such as CLIP-I, DINO, and CLIP-T. We show that such metrics correlate poorly with human perception because they attend to the image as a whole instead of separating the concept subject from the background. We introduce MaSC, a masked similarity metric that uses externally provided foreground concept masks to decompose evaluation into subject-specific concept preservation and background-based prompt following. MaSC computes both scores from frozen SigLIP2 SO400M-NaFlex features: concept preservation is measured by masked max-cosine matching between foreground reference patches and generated-image patches, while prompt following is measured by comparing a background-only pooled image embedding to a subject-stripped prompt embedding. On DreamBench++ human ratings, MaSC achieves Krippendorff alpha = 0.471 for concept preservation, outperforming all tested non-LLM baselines and GPT-4V, and approaching GPT-4o. On ORIDa, a real-photo identity-preservation benchmark across physical environments, MaSC achieves AUC = 0.992, nearly perfectly distinguishing same-subject from cross-subject pairs. Its prompt-following score also outperforms the CLIP-T baseline shipped with DreamBench++. These results show that spatially decomposed aggregation is a strong design principle for evaluating concept-driven generation.

2605.22467 2026-05-22 cs.CV 版本更新

SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data

SADGE:合成与真实数据的结构和外观领域差距估计

Patryk Bartkowiak, Bartosz Kotrys, Dominik Michels, Soren Pirk, Wojtek Palubicki

发表机构 * Adam Mickiewicz University(亚当·密茨凯维奇大学) ArtCollect(艺术收藏) KAUST(卡塔尔科技大学) Kiel University(基尔大学)

AI总结 本文提出SADGE,一种定量相似性度量指标,用于预测合成图像数据集在常见计算机视觉任务上的性能,而无需下游模型训练。研究发现,现有评估指标(如PSNR、FID、CLIP)主要衡量真实与合成图像之间的语义对齐(外观相似性分数),而结构相似性则用于评估领域差距(几何相似性分数)。本文通过多种合成数据集和下游任务证明,单一的外观或几何相似性无法可靠预测下游性能,而是它们的非线性交互决定了合成数据的效用。SADGE在五个公开的合成到真实基准家族和15个数据集变体(79k图像对)中,达到了线性和排名标准下最强的下游转移性能关联性。

详情
AI中文摘要

We propose SADGE, a quantitative similarity metric that predicts the performance of synthetic image datasets for common computer vision tasks without downstream model training. Estimating whether a synthetic dataset will lead to a model that performs well on real-world data remains a bottleneck in model development. Existing evaluation metrics (e.g., PSNR, FID, CLIP) primarily measure semantic alignment between real and synthetic images (Appearance Similarity Score). Less commonly, structural similarity between images is considered to assess the domain gap (Geometric Similarity Score). However, to the best of our knowledge there exists no studies that evaluate which similarity metric is the best downstream predictor for a given synthetic dataset. In this paper, we show over a wide variety of different synthetic datasets and downstream tasks that neither appearance nor geometry alone can reliably predict downstream performance; rather, it is their non-linear interplay that dictates synthetic data utility. Specifically, we measure how commonly used Appearance and Geometric Similarity metrics computed between synthetic and real images correlate with downstream performance in object detection, semantic segmentation, and pose estimation. Across five public synthetic-to-real benchmark families and 15 dataset-level variants (79k image pairs), SADGE achieves the strongest association with downstream transfer performance under both linear and rank-based criteria, reaching Pearson r=0.88 and Spearman rho=0.77. We compute for each combination of geometry-based methods and appearance-based approaches SADGE scores across all benchmark families. The best configuration is obtained by fusing DINOv3 appearance similarity with MASt3R geometric consistency through a constrained bilinear interaction, outperforming both the strongest geometry-only baseline and the strongest appearance-only baseline.

英文摘要

We propose SADGE, a quantitative similarity metric that predicts the performance of synthetic image datasets for common computer vision tasks without downstream model training. Estimating whether a synthetic dataset will lead to a model that performs well on real-world data remains a bottleneck in model development. Existing evaluation metrics (e.g., PSNR, FID, CLIP) primarily measure semantic alignment between real and synthetic images (Appearance Similarity Score). Less commonly, structural similarity between images is considered to assess the domain gap (Geometric Similarity Score). However, to the best of our knowledge there exists no studies that evaluate which similarity metric is the best downstream predictor for a given synthetic dataset. In this paper, we show over a wide variety of different synthetic datasets and downstream tasks that neither appearance nor geometry alone can reliably predict downstream performance; rather, it is their non-linear interplay that dictates synthetic data utility. Specifically, we measure how commonly used Appearance and Geometric Similarity metrics computed between synthetic and real images correlate with downstream performance in object detection, semantic segmentation, and pose estimation. Across five public synthetic-to-real benchmark families and 15 dataset-level variants (79k image pairs), SADGE achieves the strongest association with downstream transfer performance under both linear and rank-based criteria, reaching Pearson r=0.88 and Spearman rho=0.77. We compute for each combination of geometry-based methods and appearance-based approaches SADGE scores across all benchmark families. The best configuration is obtained by fusing DINOv3 appearance similarity with MASt3R geometric consistency through a constrained bilinear interaction, outperforming both the strongest geometry-only baseline and the strongest appearance-only baseline .

2605.22455 2026-05-22 cs.CV cs.AI cs.LG physics.optics 版本更新

Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light

使离散的成为连续的:合成RAW增强用于细粒度评估人检测性能在低光环境

Valeria Pais, Malena Mendilaharzu, Daniele Faccio, Luis Oala, Christoph Clausen, Bruno Sanguinetti

发表机构 * University of Glasgow(格拉斯哥大学) Dotphoton

AI总结 本文提出了一种合成RAW增强方法,用于在低光条件下更准确地评估人检测模型的性能,通过生成与相机传感器噪声模型匹配的低光样本,以改善基准测试的数据覆盖。

Comments Accepted non-archival paper at the CVPR 2026 AUTOPILOT Workshop (Autonomous Understanding Through Open-world Perception and Integrated Language Models for On-road Tasks)

详情
AI中文摘要

人工智能视觉模型的实际应用既受到可用训练和测试数据的推动,也受到其限制。真实数据集稀疏且不均匀:长尾或不平衡分布会阻碍泛化,而低密度区域中的样本数量少使得评估困难。合成数据可以填补这些空白,提供更连续地采样输入空间的方法,提高基准测试的数据覆盖。专注于自动驾驶安全关键场景中的夜间行人检测,我们展示如何利用合成低光样本更好地表征状态-of-the-art目标检测模型的性能,作为场景光照函数的函数。我们使用合成RAW图像增强技术生成低光样本,以匹配相机传感器的噪声模型。在真实和合成低光数据上的性能指标相似,表明AI模型难以区分它们。

英文摘要

Real-world deployment of AI vision models is both fueled and limited by the data available for training and testing. Real datasets are sparse and uneven: long-tailed or unbalanced distributions hinder generalization, and the low number of samples in low density regions makes it hard to run evaluations. Synthetic data can fill these gaps, providing us with a way to sample the input space more continuously and improve data coverage for benchmarks. Focusing on the autonomous driving safety-critical case of pedestrian detection in the dark, we show how synthetic low-light samples can be used to better characterize the performance of a state-of-the-art object detection model as a function of the scene illumination. We use a synthetic RAW image augmentation technique to generate low-light samples that match the noise model of the camera sensor. Performance metrics on real and synthetic low-light data are similar, indicating that the AI model finds it hard to distinguish between them.

2605.22446 2026-05-22 cs.CV cs.AI cs.RO 版本更新

Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

Pre-VLA: 预防性运行时验证用于可靠视觉-语言-动作和世界模型展开

Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, Zhijun Meng

发表机构 * Beihang University(北京航空航天大学) Tsinghua University(清华大学) Peking University(北京大学) JDT AI Infra Zhejiang University(浙江大学)

AI总结 本文提出Pre-VLA,一种统一的运行时验证架构,用于在物理执行或世界模型想象之前评估动作的有效性,以提高视觉-语言-动作和世界模型展开的可靠性。

详情
AI中文摘要

尽管大型视觉-语言-动作(VLA)模型和生成世界模型(WM)在长周期具身智能方面取得了进展,但其实际部署仍受到基于学习的动作生成不确定性的挑战。低质量的动作可能导致执行中的物理故障或导致冗余的渲染成本的误导性世界模型展开。为了解决这个问题,我们提出了Pre-VLA,一种统一的运行时验证架构,能够在物理执行或世界模型想象之前进行预防性动作有效性评估。Pre-VLA利用一个高效的多模态主干,具有模态感知的池化和轻量级双分支头,以预测候选动作片段的安全性信心和批评派生的优势分数。为处理严重的类别不平衡和不稳定边界决策,我们使用结合焦点分类、优势回归和软阈值校准的多任务目标来训练Pre-VLA。在部署期间,双模式预防性重采样调度器过滤低质量的动作,并在有限计算预算下触发自适应重采样。在LIBERO基准测试中,Pre-VLA将四个套件的平均闭环成功率从30.79%提高到37.62%,减少任务执行步骤,实现每个动作片段平均183.9毫秒的前向验证时间,并减轻世界模型展开中的误差累积。

英文摘要

While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79\% to 37.62\% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.

2605.22425 2026-05-22 eess.IV cs.CV 版本更新

Time-varying rPPG signal separation via block-sparse signal model

基于块稀疏信号模型的时变rPPG信号分离

Kosuke Kurihara, Yoshihiro Maeda, Daisuke Sugimura, Takayuki Hamamoto

发表机构 * Tokyo University of Science(东京科学大学) Shibaura Institute of Technology(Shibaura工学院) Tokyo Metropolitan University(东京 Metropolitan 大学)

AI总结 本文提出了一种利用rPPG信号近似周期特性进行信号提取的方法,通过构建时变信号分离框架,在光照变化下实现适应性信号分离,实验验证了方法的有效性。

Comments Accepted by IEEE International Conference on Image Processing (ICIP 2026)

详情
AI中文摘要

远程光脉冲波形图(rPPG)通过分析面部视频中细微的颜色变化来实现非接触式心脏脉搏信号测量。然而,由于rPPG信号极弱且易受光照噪声影响,提取rPPG信号仍然具有挑战性。本文提出了一种rPPG信号提取方法,利用rPPG信号的近似周期特性,将其近似周期性建模为时频域中的块稀疏结构。为了整合块稀疏模型并实现光照波动下的自适应信号分离,我们构建了时变信号分离框架。使用公共数据集的实验验证了该方法的有效性。

英文摘要

Remote photoplethysmography (rPPG) enables non-contact measurement of cardiac pulse signals by analyzing subtle color changes in facial videos. Nevertheless, extracting rPPG signals remains challenging because of their extremely weak signal strength and susceptibility to illumination noise. In this paper, we propose an rPPG signal extraction method that exploits the quasi-periodic characteristics of rPPG signals. Our approach models quasi-periodicity of the rPPG signal, which arises from the stable cardiac cycle, as a block-sparse structure in the time-frequency domain. To incorporate a block-sparse model and enable adaptive signal separation under illumination fluctuations, we construct a time-varying signal separation framework. Experiments using a public dataset demonstrate the effectiveness of our method.

2605.22422 2026-05-22 cs.CV cs.AI 版本更新

FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers

FastTab: 一种快速表格识别器,结合了微小递归模块和1D变换器

Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet

发表机构 * LITIS

AI总结 本文提出FastTab,一种基于网格的表格结构识别模型,通过轻量级的Tiny Recursive Module和轴向1D Transformer编码器,实现了高效的表格结构恢复,同时在多个基准测试中表现出低延迟和良好的鲁棒性。

详情
AI中文摘要

表格结构识别(TSR)需要在表级一致性(行/列数量、表头、跨单元格)和精确的分隔符定位之间取得平衡。我们介绍了FastTab,一种以网格为中心的TSR模型,通过结合(i)轻量级的Tiny Recursive Module(TRM)进行全局推理和(ii)轴向1D Transformer编码器,捕捉行和列上的长距离依赖关系,避免了自动回归的HTML解码。该模型预测行/列数量、表头行和分隔符以构建网格,然后利用ROI对齐的单元格特征推断行跨度/列跨度。在四个基准测试(PubTabNet、FinTabNet、PubTables-1M和SciTSR)中,FastTab在结构恢复性能方面表现优异,同时在低延迟推理中运行良好。我们进一步研究了在像素级匿名化下的鲁棒性,并展示了对相机捕获文档中弯曲分隔符的扩展。源代码将在https://github.com/hamdilaziz/FastTab上公开发布。

英文摘要

Table structure recognition (TSR) requires both table-level coherence (row/column counts, headers, spanning cells) and precise separator localization. We introduce FastTab, a grid-centric TSR model that avoids autoregressive HTML decoding by combining (i) a lightweight Tiny Recursive Module (TRM) for global reasoning and (ii) axial 1D Transformer encoders that capture long-range dependencies along rows and columns. The model predicts row/column counts, header rows, and separators to construct a grid, then infers rowspan/colspan using ROI-aligned cell features. Across four benchmarks (PubTabNet, FinTabNet, PubTables-1M, and SciTSR), FastTab achieves competitive structure recovery performance while operating at low-latency inference. We further study robustness under pixel-level anonymisation and show an extension to curved separators for camera-captured documents. The source code will be made publicly available at https://github.com/hamdilaziz/FastTab .

2605.22420 2026-05-22 cs.CV cs.AI cs.RO 版本更新

Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction

基于扩散的通用增强器用于城市场景重建

Henry Che, Jingkang Wang, Yun Chen, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun

发表机构 * Waabi University of Toronto(多伦多大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出GenRe,一种基于扩散的通用增强器,用于城市场景重建,通过学习不同场景中的生成先验,高效地生成稳健且高保真的表示,能够可靠地泛化到挑战性的未见过的视角,从而在自动驾驶中实现鲁棒和可扩展的传感器模拟。

Comments ICRA 2026. Project page: https://waabi.ai/genre

详情
AI中文摘要

从真实世界观测重建城市场景已成为自动驾驶开发和测试的强大工具。尽管当前的神经渲染方法在记录轨迹上实现了高质量的渲染,但其在大视角变化下质量显著下降,限制了闭环模拟的应用。最近的研究表明,使用扩散模型在这些具有挑战性的视角上增强质量并将其改进回3D表示具有前景。然而,它们通常需要昂贵的每场景优化,且提炼的表示仍然脆弱,无法超越有限的合成视角泛化。为了解决这些限制,我们提出了GenRe,一种新的基于扩散的通用增强器用于城市场景重建。GenRe输入任何预训练的3D高斯表示,并在几分钟内修复其中的缺陷。通过学习在多样化场景中提炼生成先验,GenRe高效地生成稳健且高质量的表示,能够可靠地泛化到具有挑战性的未见过的视角(例如,变道)。实验表明,GenRe在质量和效率上均优于现有方法,并且受益于各种下游任务,使自动驾驶中的传感器模拟更加稳健和可扩展。

英文摘要

Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.

2605.22414 2026-05-22 cs.CV cs.AI 版本更新

Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence

迈向具有空间定位病变证据的临床可解释性眼科VQA

Xingyue Wang, Bo Liu, Meng Wang, Zhixuan Zhang, Chengcheng Zhu, Huazhu Fu, Jiang Liu

发表机构 * Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系) The Hong Kong Polytechnic University(香港理工大学) National University of Singapore(新加坡国立大学) University of Washington(华盛顿大学) Institute of High Performance Computing, Agency for Science, Technology and Research(科技研究局高性能计算研究所)

AI总结 本文提出FundusGround基准,通过空间定位病变证据提升眼科VQA的临床可解释性,通过三阶段流程收集标注病变的视网膜影像,并评估多种视觉语言模型在答案准确性和病变层面推理上的表现。

详情
AI中文摘要

视觉问答(VQA)在临床支持中具有巨大潜力,特别是在眼科领域,视网膜彩色照相是诊断的关键。然而,眼科VQA基准主要强调答案准确性,忽视了临床可解释性所需的显式视觉证据。本文引入FundusGround,一个新的具有空间定位病变证据的临床可解释性眼科VQA基准。具体而言,我们提出一个三阶段流程,收集了10,719张带有15,595个图像级精细标注病变的视网膜影像。为确保解剖一致性和临床有效性,所有病变均通过早期治疗糖尿病视网膜病变研究(ETDRS)网格进行空间定位,从而标准化映射到九个具有临床意义的视网膜区域。基于此结构化的病变证据,生成了72,706个问题,涵盖四种格式:开放式、封闭式、单选和多选。我们进一步使用双指标(答案准确性和病变层面推理)评估多种通用和医学大型视觉语言模型。实验表明,整合病变层面的视觉证据能持续提高模型性能和透明度,突显了显式空间定位对于可靠和可解释性眼科VQA的必要性。

英文摘要

Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions are then generated spanning four formats: open-ended, closed-ended, single-choice, and multiple-choice. We further benchmark multiple general- and medical- large vision-language models using dual metrics for answer accuracy and lesion-level reasoning. The experiments demonstrate that incorporating lesion-level visual evidence consistently improves model performance and transparency, highlighting the necessity of explicit spatial grounding for reliable and explainable ophthalmic VQA.

2605.22413 2026-05-22 cs.CV 版本更新

From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding

从识别到推理:在现实世界收据文档理解上对齐和增强MLLMs

Yandi Wang, Libin Zhan, Ziwei Huang, Tiancheng Luo, Yuxuan Jiang, Wang Dong, Leilei Gan, Jun Chen

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出ReceiptBench基准,通过四个层次化子任务提升收据信息提取的结构一致性,并提出两阶段训练框架GRPO,通过强化学习信号提升模型性能,实验证明其在复杂推理任务上的优越性。

详情
AI中文摘要

从视觉文档中提取结构化信息(视觉信息提取,VIE)是业务自动化的核心。尽管最近的多模态大语言模型(MLLMs)展示了有前途的能力,但现有基准在规模和现实性方面存在关键限制,缺乏语义粒度,并未能覆盖多样化的文档类型。为弥合这一差距,我们引入ReceiptBench,一个大规模、人工标注的基准,包含10,000种多样化的收据,将信息提取组织成四个层次化子任务:(1)基础感知用于原始文本定位,(2)格式标准化用于严格遵循标准化指令,(3)语义推理用于从上下文中推断隐含属性,(4)结构解析用于处理嵌套的行项。此外,我们提出了一种两阶段训练框架,结合Metric-Aware Group Relative Policy Optimization(GRPO),将严格评估约束转化为强化学习信号以增强结构一致性。广泛的实验表明,我们的方法在复杂推理任务上实现了最先进的性能,超越了领先的专有模型。我们在此发布我们的数据集和代码:https://github.com/wwwT0ri/ReceiptBench。

英文摘要

Extracting structured information from visual documents (Visual Information Extraction, VIE) is a cornerstone of business automation. While recent Multimodal Large Language Models (MLLMs) have shown promising capabilities, existing benchmarks suffer from critical limitations in scale and realism, lack semantic granularity, and fail to cover diverse document types. To bridge this gap, we introduce ReceiptBench, a large-scale, human-annotated benchmark consisting of 10k diverse receipts, organizing information extraction into four hierarchical sub-tasks: (1) Basic Perception for raw text spotting, (2) Format Normalization for strictly following standardization instructions, (3) Semantic Reasoning for inferring implicit attributes from context, and (4) Structure Parsing for handling nested line items. Furthermore, we propose a two-stage training framework incorporating Metric-Aware Group Relative Policy Optimization (GRPO), which translates rigorous evaluation constraints into reinforcement learning signals to enhance structural consistency. Extensive experiments demonstrate that our method yields state-of-the-art performance, surpassing leading proprietary models on complex reasoning tasks. We release our datasets and code at https://github.com/wwwT0ri/ReceiptBench.

2605.22403 2026-05-22 cs.CV 版本更新

Translating Signals to Languages for sEMG-Based Activity Recognition

将信号转换为语言以实现基于sEMG的活动识别

Ming Wang, Haoxuan Qu, Qiuhong Ke, Wei Zhou, Hossein Rahmani, Jun Liu

发表机构 * Lancaster University(兰卡斯特大学) Monash University(墨尔本大学) Cardiff University(卡迪夫大学)

AI总结 本文提出了一种基于大语言模型的sEMG活动识别框架,通过将连续sEMG信号转换为语言形式,提升活动识别的准确性。

详情
AI中文摘要

基于sEMG信号的活动识别近年来受到了越来越多的研究关注。为了开发准确的sEMG信号活动识别器,已经提出了许多方法。一些研究专注于设计更大的、更具表达能力的模型架构以增强sEMG信号的表示能力,而另一些研究则通过大规模预训练来丰富模型先验知识,从而提高识别性能。最近,大语言模型(LLMs)在自然语言处理中展示了显著的泛化和推理能力,其隐含的知识,从大量的动作语言描述中学习而来,为解释sEMG信号和推断活动意图提供了新的可能性。受此启发,我们提出了LLM-sEMG,一种新的框架,利用LLMs作为sEMG活动识别器。在该框架中,我们设计了一种面向语言的映射机制,将连续的sEMG序列转换为sEMG语言,结合多种策略进一步促进信号到语言的映射过程。广泛的实验表明,所提出的框架能够利用大语言模型实现高精度的sEMG信号活动识别。

英文摘要

Surface electromyography (sEMG) signal-based activity recognition has attracted increasing research attention in recent years. To develop accurate sEMG signal-based activity recognizers, numerous approaches have been proposed. Some studies focus on designing larger and more expressive model architectures to enhance the representational capacity of sEMG signals, while others aim to enrich model priors through large-scale pretraining, thereby improving recognition performance. Recently, large language models (LLMs) have shown remarkable generalization and reasoning capabilities in natural language processing, whose implicit knowledge, learned from extensive linguistic descriptions of actions, opens new possibilities for interpreting sEMG signals and inferring activity intentions. Motivated by this, we propose LLM-sEMG, a novel framework that leverages LLMs as sEMG activity recognizers. Within this framework, we design a language-oriented mapping mechanism that converts continuous sEMG sequences into sEMG language, integrating several strategies to further facilitate the signal-to-language mapping process. Extensive experiments demonstrate that the proposed framework achieves highly accurate sEMG signal-based activity recognition using large language models.

2605.22366 2026-05-22 cs.CV 版本更新

AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture

AgroTools: 一个用于农业中增强工具的多模态代理基准

Zi Ye, Yibin Wen, Xiaoya Fan, Xinyu Zhang, Jing Wu, Kun Zeng, Zurong Mai, Jiarui Zhang, Bohan Shi, Juepeng Zheng, Jianxi Huang, Yutong Lu, Haohuan Fu

发表机构 * Sun Yat-Sen University(中山大学) Southwest University(西南大学) Northeastern University(东北大学) National Supercomputing Center in Shenzhen(深圳国家超算中心) Southwest Jiaotong University(西南交通大学) China Agricultural University(中国农业大学) Tsinghua University(清华大学)

AI总结 本文提出AgroTools基准,用于评估农业中增强工具的多模态代理,通过539个问题-答案实例和1097张异构农业图像,评估模型在工具使用中的执行质量和任务成功率。

详情
AI中文摘要

农业决策日益需要能够将视觉观察转化为可靠可执行动作的多模态系统。然而,现有农业多模态基准主要评估最终答案的正确性,并提供有限的支持来评估模型是否能使用外部工具完成高精度工作流。在本文中,我们介绍了AgroTools,一个用于评估农业中增强工具的多模态代理的基准。AgroTools包含539个问题-答案实例和1097张异构农业图像,涵盖五个任务家族和14种农业工具的可执行环境。每个查询都标注了结构化的工具使用轨迹,使能够从两个视角评估执行层面的质量和结果层面的任务成功率。我们对9个开源和4个闭源的多模态大语言模型在AgroTools上进行了基准测试。结果表明,当前模型在农业工具使用场景中仍远未可靠,存在工具规划、论点生成、执行恢复和最终答案综合等方面的明显瓶颈。我们希望AgroTools能支持未来在高精度农业应用中多模态代理的研究。该基准和评估可在https://huggingface.co/datasets/AgroTools/AgroTools上获取。

英文摘要

Agricultural decision-making increasingly requires multimodal systems that can transform visual observations into reliable, executable actions. However, existing agricultural multimodal benchmarks mainly evaluate final-answer correctness and provide limited support for assessing whether models can use external tools to complete precision-sensitive workflows. In this paper, we introduce AgroTools, a benchmark for evaluating tool-augmented multimodal agents in agriculture. AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query is annotated with structured tool-use traces, enabling a dual-view evaluation of both process-level execution quality and outcome-level task success. We benchmark 9 open-source and 4 closed-source multimodal large language models on AgroTools. Results show that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis. We hope AgroTools will support future research on multimodal agents for high-precision agricultural applications. The benchmark and evaluation are available at https://huggingface.co/datasets/AgroTools/AgroTools.

2605.22359 2026-05-22 cs.CV 版本更新

GazePrior: Zero-Shot AR/VR Eye Tracking via Learned 3D Gaze Reconstruction

GazePrior: 通过学习的3D注视重建实现零样本AR/VR注视跟踪

Corentin Dumery, David Colmenares, Alexander Fix, Pascal Fua, Ali Behrooz, Jogendra Kundu

发表机构 * Meta Reality Labs(Meta现实实验室) EPFL(苏黎世联邦理工学院)

AI总结 本文提出GazePrior,一种数据驱动的3D注视先验模型,用于在无需额外数据收集的情况下实现高质量的AR/VR注视跟踪,通过合成数据提高模型的准确性和鲁棒性。

Comments Project page: https://corentindumery.github.io/projects/gazeprior.html

详情
AI中文摘要

注视跟踪(ET)是高级AR/VR应用的基础技术。然而,为每个新的ET设备训练ET模型具有挑战性:真实数据收集成本高且耗时,而现有合成数据生成方法缺乏真实感。为了在不需额外数据收集的同时保持数据质量,我们引入了一种数据驱动的3D先验,该先验模型了人类眼睛在多样化身份、注视方向和光照设置下的分布。该模型,我们称之为GazePrior,能够对使用先前ET设备收集的注释数据进行稀疏输入的3D重建,这些数据可以进一步从任何目标ET设备的摄像头中渲染。我们的方法在不付出其压制性成本的情况下,合成数据具有真实数据收集的现实感、多样性和真实准确性。我们的实验表明,使用我们合成数据训练的ET模型优于先前的零样本方法,实现了更高的准确性和鲁棒性。

英文摘要

Eye tracking (ET) is a foundational technology for advanced AR/VR applications. However, training ET models for every new ET device is challenging: real data collection is costly and time-consuming, while existing synthetic data generation methods lack realism. To remove the need for additional data collection while maintaining data quality, we introduce a data-driven 3D prior that models the distribution of human eyes across diverse identities, gaze directions, and light settings. This model, which we coin GazePrior, then enables sparse-input 3D reconstruction of annotated data collected with previous ET devices, which can in turn be rendered from the cameras of any target ET device. Our approach synthesizes data with the realism, diversity and ground-truth accuracy of real data collection without its prohibitive costs. Our experiments demonstrate that ET models trained with our synthesized data outperform previous zero-shot methods, achieving higher accuracy and robustness.

2605.22357 2026-05-22 cs.CV cs.AI 版本更新

VEELA: A Clinically-Constrained Benchmark for Liver Vessel Segmentation in Computed Tomography Angiography

VEELA:一种受临床约束的肝血管分割基准数据集

Ziya Ata Yazıcı, N. Sinem Gezer, İlkay Öksüz, İlker Özgür Koska, Tuğçe Toprak, Pervin Bulucu, Ufuk Beşenk, A. Emre Kavur, Pierre-Henri Conze, Hazım Kemal Ekenel, Oğuz Dicle, Mustafa Ege Şeker, Mustafa Said Kartal, Ariorad Moniri, Orhan Özkan, Osman Faruk Bayram, Hakan Polat, Musa Balcı, Ece Tuğba Cebeci, Baran Cılga, Kardelen Peçenek, M. Alper Selver

发表机构 * Department of Radiology, Dokuz Eylul University(多尔朱·伊勒大学放射科) Department of Computer Engineering, Istanbul Technical University(伊斯坦布尔技术大学计算机工程系) Institute of Natural and Applied Sciences, Dokuz Eylul University(多尔朱·伊勒大学自然科学与应用科学学院) Department of Electrical and Electronics Engineering, Dokuz Eylul University(多尔朱·伊勒大学电气与电子工程系) Department of Radiology, University of Wisconsin-Madison(威斯康星大学麦迪逊分校放射科) School of Medicine, Sivas Cumhuriyet University(萨瓦斯·库尔德大学医学院) School of Medicine, Acibadem Mehmet Ali Aydinlar University(阿克塞姆·梅赫梅特·阿里·阿迪姆大学医学院) Department of Artificial Intelligence Engineering, Bahçeşehir University(巴切希尔大学人工智能工程系) Faculty of Pharmacy, Sivas Cumhuriyet University(萨瓦斯·库尔德大学药学院)

AI总结 本文提出VEELA数据集,用于在CT血管造影中实现肝门静脉分割,通过严格的人工标注和多专家共识,确保标注的临床现实性和准确性,并引入多种评估指标以评估血管分割的多视角性能。

Comments 27 pages, 25 figures, 5 tables

详情
AI中文摘要

在对比增强的计算机断层扫描血管造影(CTA)中,准确分割肝内和门静脉仍然具有挑战性,由于复杂的血管拓扑结构、边缘可见性限制以及成像引起的模糊性。尽管现有的公开数据集提供了有价值的基准,但很少包含临床现实的标注约束。我们引入VEELA(Vessel Extraction and Extrication for Liver Analysis),一个严格编纂的肝血管数据集,源自40个CTA扫描,继承自CHAOS大挑战队列。所有血管均在多专家共识下逐层手动勾勒,使用严格可见性驱动的标注策略,并避免解剖推断插值。这种设计明确捕捉了解剖变异性和成像相关不确定性。作为CHAOS挑战的延续,VEELA使可重复的跨基准评估成为可能,同时扩展到细粒度的肝内和门静脉分割。我们进一步建立了标准化的基准评估框架,并分析了互补的评估指标,包括拓扑感知(clDice)、重叠基于(IoU)、边界敏感(NSD)和几何感知(面积、长度)度量。我们的结果表明,不同的指标捕捉了血管完整性不同的方面,强调了多视角评估在临床有意义的血管分割中的必要性。VEELA已公开发布,以促进可重复的研究并支持稳健的血管分割方法的发展。研究人员可以访问评估指标、数据集和提交平台:https://www.synapse.org/Synapse:syn65471967。

英文摘要

Accurate segmentation of hepatic and portal vessels in contrast-enhanced computed tomography angiography (CTA) remains challenging due to complex vascular topology, peripheral visibility limitations, and acquisition-induced ambiguities. While existing public datasets offer valuable benchmarks, few include clinically realistic annotation constraints. We introduce VEELA (Vessel Extraction and Extrication for Liver Analysis), a rigorously curated liver vessel dataset derived from 40 CTA scans inherited from the CHAOS grand-challenge cohort. All vessels were manually delineated slice-by-slice under multi-expert consensus, using a strict visibility-driven annotation policy and avoiding anatomically inferred interpolation. This design explicitly captures anatomical variability and imaging-related uncertainty. As a continuation of the CHAOS challenge, VEELA enables reproducible cross-benchmark evaluation while extending the scope to fine-grained hepatic and portal vessel segmentation. We further establish a standardized benchmarking framework and analyze complementary evaluation metrics, including topology-aware (clDice), overlap-based (IoU), boundary-sensitive (NSD), and geometry-aware (area, length) measures. Our results demonstrate that different metrics capture distinct aspects of vascular integrity, underscoring the necessity of multi-perspective evaluation for clinically meaningful vessel segmentation. VEELA is publicly released to facilitate reproducible research and support the development of robust vascular segmentation methods. Researchers can access the evaluation metrics, dataset, and submission platform at https://www.synapse.org/Synapse:syn65471967.

2605.22351 2026-05-22 cs.CV 版本更新

QuantSR+: Pushing the Limit of Quantized Image Super-Resolution Networks

QuantSR+: 推动量化图像超分辨率网络的极限

Haotong Qin, Xudong Ma, Xianglong Liu, Jie Luo, Jinyang Guo, Michele Magno, Yulun Zhang

发表机构 * ETH Zurich(苏黎世联邦理工学院) Shanghai Jiao Tong University(上海交通大学) Beihang University(北京航空航天大学)

AI总结 本文提出QuantSR+框架,通过改进量化操作、网络设计和训练优化,实现了在精度和效率之间的更好平衡,针对超低精度下的性能下降问题,提出了三种关键技术贡献:重分布驱动位数确定、量化瘦身架构和瘦身引导的功能局部蒸馏。

详情
AI中文摘要

低比特量化广泛用于压缩超分辨率(SR)模型,以减少在资源受限设备上的存储和计算成本。然而,当SR模型被推向超低精度(2-4位)时,性能会因表示能力的降低和SR的细节敏感性而急剧下降。为了解决这些问题,我们提出QuantSR+,一个统一的框架,通过改进量化操作、网络设计和训练优化,实现了比先前低比特SR方法更好的精度和效率的权衡。QuantSR+主要依靠三个技术贡献:(1)重分布驱动位数确定(RBD),通过正向和反向传递中重塑量化分布,以保持表示保真度;(2)量化瘦身架构(QSA),从过参数化的模型开始,逐步剪枝不重要的块以满足效率预算,同时推动精度性能;(3)瘦身引导的功能局部蒸馏(SFD),通过直接损失和逐步的功能局部训练计划强制块感知的特征对齐,以更好地捕捉量化效果并加快收敛速度。广泛的实验表明,QuantSR+在专门的量化SR方法和通用量化方法上均实现了最先进的性能。对于SwinIR-S在Urban100(x4)上,它在2位SOTA基准上将PSNR提高了0.29 dB。同时,在2位下,它在操作数上减少了高达87.9%,存储上减少了89.4%。QuantSR+对卷积和基于Transformer的SR模型都有效,表明了广泛的应用性。

英文摘要

Low-bit quantization is widely used to compress super-resolution (SR) models and reduce storage and computation costs for deployment on resource-limited devices. However, when SR models are pushed to ultra-low precision (2-4 bits), performance can drop sharply due to diminished representational capacity and the detail-sensitive nature of SR. To address these issues, we propose QuantSR+, a unified framework that improves quantization operators, network design, and training optimization, achieving better trade-offs between accuracy and efficiency than prior low-bit SR methods. QuantSR+ mainly relies on three technical contributions: (1) Redistribution-driven Bit Determination (RBD), which reshapes quantization distributions in both forward and backward passes to preserve representation fidelity; (2) Quantized Slimmable Architecture (QSA), which begins with an over-parameterized model and progressively prunes less critical blocks to meet efficiency budgets while pushing the accuracy performance; and (3) Slimming-guided Function-localized Distillation (SFD), which enforces block-aware feature alignment via a direct loss and a progressive, function-local training schedule to capture quantization effects better and speed up convergence. Extensive experiments show that QuantSR+ achieves state-of-the-art performance against both specialized quantized SR methods and generic quantization approaches. For SwinIR-S on Urban100 (x4), it improves PSNR by 0.29 dB over the 2-bit SOTA baseline. Meanwhile, it delivers strong efficiency gains at 2-bit, reducing operations by up to 87.9% and storage by 89.4%. QuantSR+ is effective for both convolutional and transformer-based SR models, indicating broad applicability.

2605.22344 2026-05-22 cs.CV cs.AI cs.MM 版本更新

Bernini: Latent Semantic Planning for Video Diffusion

Bernini: 视频扩散中的潜在语义规划

Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan

发表机构 * Bernini Team(伯尼尼团队)

AI总结 本文提出Bernini框架,通过将大规模多模态语言模型用于语义规划,扩散模型用于像素生成,实现了视频生成与编辑的统一方法,提升了编辑任务的泛化能力。

Comments Project Page: https://bernini-ai.github.io/

详情
AI中文摘要

多模态大语言模型(MLLMs)和扩散模型各自已达到显著成熟度:MLLMs在处理异构多模态输入时具有强大的语义基础,而扩散模型则能以逼真度生成图像和视频。我们主张通过简单的分工统一这两类模型:MLLMs负责语义规划,扩散模型则根据高层语义指导和低层视觉特征生成像素。基于此思想,我们提出了Bernini,一个统一的视频生成与编辑框架。一个基于MLLM的规划器直接在ViT嵌入空间中预测目标语义表示,而基于DiT的渲染器则根据此计划生成像素,同时结合文本特征,并在编辑时引入源VAE特征以保留细节。因为语义作为接口,规划器和渲染器可以分别训练,并仅轻度联合训练,从而保留两者预训练的优势,同时保持训练效率。为更好地处理多种视觉输入,我们引入了Segment-Aware 3D Rotary Positional Embedding(SA-3D RoPE),并进一步在规划器中结合链式推理以更好地将理解转化为生成。Bernini在广泛的视频生成与编辑基准上均取得最先进的性能,MLLMs的预训练理解在挑战性的编辑任务上实现了强大的泛化能力。

英文摘要

Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.

2605.22342 2026-05-22 cs.CV cs.AI 版本更新

4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting

4D-GSW: 4D高斯点散布的运动感知空间-时间一致水印技术

Sifan Zhou, Hang Zhang, Yuhang Wang, Ming Li

发表机构 * Southeast University(东南大学) Guangdong Laboratory of Artificial Intelligence and Digital Economy(广东人工智能与数字经济实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出4D-GSW,一种运动感知的空间-时间一致水印技术,用于在4D高斯点散布中嵌入鲁棒的版权信息,同时保持高空间-时间一致性。

Comments 9 pages main paper, 7 figures, 18 pages in total

详情
AI中文摘要

尽管4D高斯点散布(4DGS)已革新了高保真的动态重建,但保护这些资产的知识产权仍是一个开放性挑战。传统隐写技术常常忽视底层的运动流形,导致非物理的伪影,如严重的时序闪烁和"FVD崩溃"。为了解决这个问题,我们提出了4D-GSW,一种运动感知的水印框架,旨在嵌入鲁棒的版权信息同时保持高空间-时间一致性。与以往的4D隐写技术不同,我们的方法明确处理运动轨迹的物理一致性。我们引入了空间-时间曲率(STC)度量来识别"动态瞬间",并自适应地门控水印梯度注入,以保护关键运动流形免受非物理扰动。为了确保复杂变形中的全局一致性,我们提出了联合HMM-MRF能量最小化模型,该模型同步水印相位在时间轨迹和空间邻域内。此外,一种各向异性梯度路由机制确保水印嵌入严格脱离光度重建保真度。大量实验表明,我们的方法在鲁棒隐藏水印的同时,能够抵抗各种攻击并保持高质量的渲染质量和空间-时间一致性。

英文摘要

While 4D Gaussian Splatting (4DGS) has revolutionized high-fidelity dynamic reconstruction, safeguarding the intellectual property of these assets remains an open challenge. Conventional steganographic techniques often neglect the underlying kinematic manifolds, triggering non-physical artifacts such as severe temporal flickering and "FVD collapse". To address this, we propose \textbf{4D-GSW}, a kinematic-aware watermarking framework designed to embed robust copyright information while preserving high spatio-temporal consistency. Unlike prior 4D steganography that primarily focuses on opacity-guided invisibility, our approach explicitly addresses the physical coherence of motion trajectories. We introduce a \textbf{Spatio-Temporal Curvature (STC)} metric to identify "Dynamic Instants," adaptively gating watermark gradient injection to shield critical motion manifolds from non-physical perturbations. To ensure global coherence across complex deformations, we formulate a joint \textbf{HMM-MRF energy minimization} model that synchronizes watermark phases within both temporal trajectories and spatial neighborhoods. Furthermore, an \textbf{anisotropic gradient routing} mechanism ensures that watermark embedding remains strictly decoupled from photometric reconstruction fidelity. Extensive experiments have demonstrated the superior performance of our method in robustly hiding watermarks while resisting various attacks and maintaining high rendering quality and spatiotemporal consistency.

2605.22328 2026-05-22 cs.CV 版本更新

3D LULC classification using multispectral LiDAR and deep learning: current and prospective schemes

基于多光谱LiDAR和深度学习的3D土地利用/覆盖分类:当前和未来方案

Narges Takhtkeshha, Aldino Rizaldy, Markus Hollaus, Juha Hyyppä, Fabio Remondino, Gottfried Mandlburger

发表机构 * D Optical Metrology (3DOM) Unit, Bruno Kessler Foundation (FBK)(3D光学计量(3DOM)单元,布鲁诺·凯塞尔基金会(FBK)) Department of Geodesy and Geoinformation, TU Wien(测绘与地理信息系,维也纳技术大学) Helmholtz-ZentrumDresden-Rossendorf (HZDR), Helmholtz Institute Freiberg for Resource Technology (HIF)(德累斯顿-罗斯托克赫尔姆霍尔茨研究中心(HZDR),资源技术赫尔姆霍尔茨研究所(HIF)) Freie Universität Berlin, Remote Sensing and Geoinformatics(柏林自由大学,遥感与地理信息学) Department of Remote Sensing and Photogrammetry, Finnish Geospatial Research Institute FGI, The National Land Survey of Finland(芬兰地理研究所(FGI),芬兰国家土地测绘局)

AI总结 本文提出了一种基于多光谱LiDAR和深度学习的3D土地利用/覆盖分类方法,介绍了NMCA对齐的L1和L2分类方案,并引入了一个新的多光谱LiDAR基准数据集,评估了七种最先进的深度学习模型,并展示了光谱信息对分类性能的提升。

详情
AI中文摘要

土地利用/覆盖(LULC)分类对于国家3D制图、地理空间分析和可持续规划至关重要。多光谱(MS)LiDAR提供同步的空谱信息,深度学习(DL)能够实现3D点云语义分割;然而,其应用受限于缺乏与国家制图和地籍机构(NMCAs)分类方案对齐的公开可用的城市和郊区MS LiDAR数据集。本文通过引入L1和L2 NMCA对齐的LULC分类方案和一个新的多光谱LiDAR数据集来填补这些空白。我们评估了七种最先进的深度学习模型,并在两个细节层次上进行了光谱消融研究。结果表明,Point Transformer V3在使用双波长LiDAR系统(532 nm和1064 nm)时,分别在L1(8类)和L2(20类)上实现了79.4%和58.9%的mIoU。消融结果表明,多光谱信息在几何信息基础上提升了性能,分别在L1和L2上提升了1.1个百分点和7.8个百分点。这些结果突显了LiDAR反射率在细粒度材料识别中的价值,并支持NMCA LULC方案向更高语义细节演进。Loosdorf-MSL数据集为一致的国家和国际LULC制图提供了新的基准。

英文摘要

Land Use Land Cover (LULC) classification is essential for national 3D mapping, geospatial analysis, and sustainable planning. Multispectral (MS) LiDAR provides synchronized spatial-spectral information, and deep learning (DL) enables 3D point cloud semantic segmentation; however, adoption is limited by the lack of publicly available urban and suburban MS LiDAR datasets aligned with National Mapping and Cadastral Agencies (NMCAs) classification schemes. This study addresses these gaps by introducing L1 and L2 NMCA-aligned LULC classification schemes and a new benchmark MS LiDAR dataset. We evaluate seven state-of-the-art DL models and perform spectral ablation studies at both levels of detail. Results show that Point Transformer V3 achieves the best performance, with mIoU of 79.4% (L1, 8 classes) and 58.9% (L2, 20 classes) using a dual-wavelength LiDAR system (532 nm and 1064 nm). Ablation results show that multispectral information improves performance over geometry-only inputs, with gains of 1.1 percentage points at L1 and 7.8 points at L2. These results highlight the value of LiDAR reflectance for fine-grained material discrimination and support the evolution of NMCA LULC schemes toward higher semantic detail. The Loosdorf-MSL dataset contributes a new benchmark for consistent national and international LULC mapping.

2605.22327 2026-05-22 cs.CV physics.med-ph 版本更新

Robustness of breast lesion segmentation under MRI undersampling improves with k-space-aware deep learning

在MRI欠采样下,基于k空间的深度学习改进了乳腺病变分割的鲁棒性

Lukas T. Rotkopf, Marco Schlimbach, Julius C. Holzschuh, Heinz-Peter Schlemmer, Jens Kleesiek, Moritz Rempe

发表机构 * Institute for AI in Medicine (IKIM), University Hospital Essen(人工智能医学研究所(IKIM),埃森大学医院) Department of Physics, Technical University Dortmund(物理系,多特蒙德技术大学) Division of Radiology, German Cancer Research Center (DKFZ)(放射学部,德国癌症研究中心(DKFZ)) Cancer Research Center Cologne Essen (CCCE), University Medicine Essen(科隆埃森癌症研究中心(CCCE),埃森大学医学中心) RACOON Study Group, Site Essen(RACOON研究组,埃森站点) German Cancer Consortium (DKTK), Partner Site Essen(德国癌症联合会(DKTK),埃森合作伙伴站点) Medical Faculty and Faculty of Computer Science, University of Duisburg-Essen(医学系和计算机科学系,杜伊斯堡-埃森大学)

AI总结 本文研究了直接从获得的MRI k空间学习乳腺病变分割是否能提高在加速或噪声下的鲁棒性,通过比较不同模型发现基于k空间的深度学习方法在欠采样和噪声下表现更优。

详情
AI中文摘要

目的:评估是否可以直接从获得的MRI k空间学习乳腺病变分割,并判断在数据加速或噪声情况下这种学习方式是否能提高鲁棒性。材料和方法:本回顾性研究使用了公开的乳腺动态对比增强MRI(DCE-MRI)数据集,包含获得的和合成的k空间,以及数据集内的合成对照。我们比较了四种3D U-Net变体:混合k空间到图像模型、原生k空间模型以及幅度和复数图像空间基线。模型在增加的欠采样和添加复数高斯k空间噪声下进行评估。主要结果是交叉验证下的患者级Dice相似性系数,其中混合模型被预设为主要比较对象,与幅度图像空间基线进行比较。结果:在完全采样下,混合模型和图像空间模型表现相似。随着加速增加,混合模型在欠采样水平中保持了显著的分割准确性,并在中等至高欠采样水平上显著优于幅度图像空间基线。当直接向k空间添加噪声时,相同模式被观察到:混合模型退化更慢,而图像空间基线在更重噪声下失败。这种优势在数据集内的合成对照中被重复验证。特征分析表明,k空间阶段和图像空间阶段发挥了互补作用,频率域过滤集中在图像域病变定位之前。结论:基于k空间的深度学习在MRI欠采样和k空间噪声下提高了乳腺病变分割的鲁棒性,同时在完全采样下与图像空间方法相当。

英文摘要

Purpose: To assess whether breast lesion segmentation can be learned directly from acquired MRI k-space, and whether doing so improves robustness when data are accelerated or noisy. Materials and Methods: This retrospective study used public breast dynamic contrast-enhanced MRI (DCE-MRI) datasets with acquired and synthetic k-space, together with a within-dataset synthetic control. We compared four 3D U-Net variants: a hybrid k-space-to-image model, a native k-space model, and magnitude and complex image-space baselines. Models were evaluated under increasing undersampling and added complex Gaussian k-space noise. The primary outcome was patient-level Dice similarity coefficient under cross-validation, with the hybrid model prespecified as the main comparison against the magnitude image-space baseline. Results: At full sampling, the hybrid and image-space models performed similarly. As acceleration increased, the hybrid model retained substantially more segmentation accuracy and significantly outperformed the magnitude image-space baseline across moderate to high undersampling levels. The same pattern was observed when noise was added directly to k-space: the hybrid model degraded more slowly, whereas the image-space baseline failed under heavier noise. This advantage was reproduced in the within-dataset synthetic control. Feature analysis suggested that the k-space stage and image-space stage played complementary roles, with frequency-domain filtering concentrated before image-domain lesion localization. Conclusion: K-space-aware deep learning improves the robustness of breast lesion segmentation under MRI undersampling and k-space noise, while matching image-space methods at full sampling.

2605.22311 2026-05-22 cs.CV 版本更新

PIU: Proximity-guided Identity Unlearning in ID-Conditioned Diffusion Models

PIU:基于接近性的身份去学习在ID条件化的扩散模型中

Jose Edgar Hernandez Cancino Estrada, Mauro Díaz Lupone, Žiga Emeršič, Vitomir Štruc, Peter Peer, Darian Tomašević

发表机构 * University of Ljubljana, Faculty of Computer and Information Science(卢布尔雅那大学计算机与信息科学系) University of Ljubljana, Faculty of Electrical Engineering(卢布尔雅那大学电子工程系)

AI总结 本文研究了在ID条件化的扩散模型中身份去学习的问题,提出了一种基于接近性的身份去学习框架PIU,通过在学习的身份空间中重新分配源身份到选定的锚身份来实现身份移除,并结合基于ArcFace表示几何的锚点选择策略,通过局部微调少量身份敏感的交叉注意力层实现有效的去学习。

详情
AI中文摘要

身份条件化的扩散模型能够生成高质量且身份一致的面部图像,但它们也引发了严重的隐私问题,因为模型可能在个人被遗忘后仍继续合成个体。尽管机器去学习已被广泛研究用于概念和数据删除,但身份去学习仍鲜有探索,特别是在直接基于身份嵌入而非文本提示的模型中。在本文中,我们研究了Arc2Face,一个最先进的身份条件化的潜在扩散模型用于面部生成,并引入了基于接近性的身份去学习(PIU),一种锚点引导的身份去学习框架。具体而言,我们将身份移除建模为身份替换目标,该目标将源身份重新分配到学习身份空间中选定的锚身份,并补充了受ArcFace表示几何启发的基于接近性的锚点选择策略。我们进一步表明,通过局部微调少量身份敏感的交叉注意力层可以实现有效的去学习。在许多目标身份上的实验表明,我们的框架能够有效抑制目标身份的生成,同时保持保留身份的真实性和身份一致性,这通过改进的去学习和图像质量指标以及定性评估得到验证。PIU框架的源代码可在https://github.com/edgarcancinoe/piu_unlearning 公开获取。

英文摘要

Identity-conditioned diffusion models enable high-quality and identity-consistent face generation, but they also raise severe privacy concerns, as models may continue to synthesize individuals despite their right to be forgotten. While machine unlearning has been extensively studied for concept and data removal, identity unlearning remains largely unexplored, particularly in models conditioned directly on identity embeddings rather than text prompts. In this work, we study identity unlearning in Arc2Face, a state-of-the-art identity-conditioned latent diffusion model for face generation, and introduce Proximity-guided Identity Unlearning (PIU), an anchor-guided framework for identity unlearning. Specifically, we formulate identity removal as an identity replacement objective that reassigns the source identity to a selected anchor identity in the learned identity space, and we complement it with a proximity-based anchor selection strategy motivated by the geometry of ArcFace representations. We further show that effective unlearning can be achieved through localized fine-tuning of a small subset of identity-sensitive cross-attention layers. Experiments across many target identities show that our framework effectively suppresses generation of the target identity while preserving realism and identity consistency for retained identities, as validated by improved performance on unlearning and image-quality metrics, together with qualitative evaluation. The source code for the PIU framework is publicly available at https://github.com/edgarcancinoe/piu_unlearning .

2605.22290 2026-05-22 cs.CV 版本更新

Detection of Virus and Small Cell Patches in Foci Images Using Switchable Convolution and Feature Pyramid Networks

利用可切换卷积和特征金字塔网络在焦点图像中检测病毒和小细胞斑块

Amrita Singh, Snehasis Mukherjee

AI总结 本文提出了一种改进的YOLOv2检测器,结合特征金字塔网络和可切换空洞卷积机制,以提高在生物医学焦点图像中检测病毒斑块和小细胞斑块的性能,实验结果显示在不同IoU阈值下的mAP值显著提升。

详情
AI中文摘要

准确检测和计数焦点形成单位(FFU)图像中的病毒斑块对于量化病毒感染和分析细胞结构至关重要。这项任务具有挑战性,因为生物医学目标在大小、密度、对比度和形状上往往差异显著。本文提出了一种增强的YOLOv2检测器,集成了特征金字塔网络(FPN)以提高多尺度特征表示。我们还引入了可切换空洞卷积机制,以适应密集显微图像中细粒度目标的接收域。所提出的方法在生物医学焦点图像数据集上进行评估,用于病毒斑块和小细胞斑块的检测。对于小细胞斑块检测,模型在25%的交并比(IoU)阈值下达到40.5%的平均精度均值(mAP)。对于FFU病毒斑块检测,模型达到68%的mAP。这些结果表明,结合FPN特征融合与可切换卷积能够提高YOLOv2在专门生物医学目标检测任务中的适用性。

英文摘要

Accurate detection and counting of virus patches in focus-forming unit (FFU) images, also known as foci images, are important for quantifying viral infection and analyzing cellular structures. This task is challenging because biomedical targets often vary substantially in size, density, contrast, and shape. In this paper, we propose an enhanced YOLOv2-based detector that integrates a Feature Pyramid Network (FPN) to improve multi-scale feature representation. We also incorporate a switchable atrous convolution mechanism to adapt the receptive field for fine-grained targets in dense microscopy images. The proposed method is evaluated on biomedical foci image datasets for virus patch and small cell patch detection. For small cell patch detection, the model achieves a mean average precision (mAP) of 40.5% at a 25% Intersection over Union (IoU) threshold. For FFU virus patch detection, the model achieves an mAP of 68%. These results indicate that combining FPN-based feature fusion with switchable convolution improves the suitability of YOLOv2 for specialized biomedical object detection tasks

2605.22273 2026-05-22 cs.CV 版本更新

Exposing Vulnerabilities in Visible-Infrared VLMs: A Unified Geometric Adversarial Framework with Cross-Task Transferability

揭示可见-红外VLMs中的漏洞:一种具有跨任务迁移性的统一几何对抗框架

Xiang Chen, Yuxian Dong, Chao Li, Chengyin Hu, Jiaju Han, Fengyu Zhang, Yiwei Wei, Jiahuan Long, Jiujiang Guo

AI总结 本文针对可见-红外视觉语言模型在多模态任务中的对抗鲁棒性不足问题,提出了一种基于分形几何的对抗框架CFGPatch,通过引入曲边分形元素和Fraser螺旋渲染机制,有效攻击VLMs并展示出跨任务迁移能力。

详情
AI中文摘要

视觉语言模型(VLMs)在多样化的多模态任务中实现了强大的性能,但其在可见-红外(VIS-IR)场景中的对抗鲁棒性仍处于探索阶段。为了解决这种跨模态威胁设置,我们提出了CFGPatch,一种基于三角分形几何的曲边分形对抗补丁框架,用于攻击VIS-IR VLMs。CFGPatch基于三角分形几何,用贝塞尔曲线元素替代刚性的直边元素,在保持多尺度分形自相似性的同时引入更平滑的轮廓、更丰富的方向变化和更灵活的形状变形。此外,我们设计了模态特定的Fraser螺旋渲染机制,以在可见和红外图像中注入细粒度纹理扭曲和误导性感知线索。通过将全局曲边分形几何与局部螺旋基外观干扰相结合,CFGPatch破坏了形状感知和纹理解释。我们进一步采用期望超越变换(EOT)以提高对常见图像级变换的鲁棒性。大量实验表明,CFGPatch能够有效欺骗VIS-IR VLMs,并在攻击效果和鲁棒性上均优于标准补丁基线。此外,针对零样本分类优化的对抗样本在图像描述和视觉问答任务中表现出良好的迁移能力,展示了在下游任务中的强大跨任务迁移性和泛化能力。

英文摘要

Vision-language models (VLMs) have achieved strong performance across diverse multimodal tasks, but their adversarial robustness in visible-infrared (VIS-IR) scenarios remains underexplored. This gap is critical because VIS-IR sensing is widely used in real-world perception systems to support reliable understanding under challenging imaging conditions. To address this cross-modal threat setting, we propose CFGPatch, a curved-edge fractal geometric adversarial patch framework for attacking VIS-IR VLMs. CFGPatch builds on triangular fractal geometry and replaces rigid straight-edged primitives with Bezier-curved elements, preserving multi-scale fractal self-similarity while introducing smoother contours, richer directional variation, and more flexible shape deformation. In addition, we design a modality-specific Fraser-spiral rendering mechanism to inject fine-grained texture distortions and misleading perceptual cues into visible and infrared images. By coupling global curved-fractal geometry with local spiral-based appearance interference, CFGPatch disrupts both shape perception and texture interpretation. We further adopt expectation over transformation (EOT) to improve robustness against common image-level transformations. Extensive experiments show that CFGPatch effectively fools VIS-IR VLMs and consistently outperforms standard patch baselines in attack effectiveness and robustness. Moreover, adversarial samples optimized for zero-shot classification transfer well to image captioning and visual question answering, demonstrating strong cross-task transferability and generalizability across downstream tasks.

2605.22269 2026-05-22 cs.CV cs.AI cs.MM 版本更新

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

MuKV:多粒度KV缓存压缩用于长流视频问答

Junbin Xiao, Jiajun Chen, Tianxiang Sun, Xun Yang, Angela Yao

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出MuKV,一种多粒度KV缓存压缩方法,通过半分层检索方法提升长流视频问答的效率和准确性,实验表明其在答案准确率、内存使用和在线问答效率方面均优于基线方法。

Comments To appear at CVPR'26. Code is available at https://github.com/IMBALDY/MuKV

详情
AI中文摘要

长流视频问答仍面临挑战,由于视觉token数量增加和大语言模型(LLM)推理长度有限。KV缓存通过LLM预填充存储历史token的Key-Value(KV),从而实现更高效的流式问答。然而,现有方法缓存每个或每两个帧,导致内存使用冗余并丢失帧内或跨帧的细粒度空间细节。本文提出MuKV,一种具有多粒度KV缓存压缩模块和半分层检索方法的方法,以提高长流视频问答的效率和准确性。对于离线KV缓存,MuKV在patch、frame和segment级别提取视觉表示。多个粒度层次保留了局部线索和全局时间上下文,同时通过自注意力和频率引导的双信号token压缩机制保持效率。对于在线问答,MuKV设计了一种半分层检索方法以检索相关KV缓存用于答案生成。在长流视频问答基准测试中,MuKV显著提高了答案准确率,而无需牺牲内存和在线问答效率。此外,我们的压缩机制本身在答案准确率、内存和问答效率方面均对基线方法带来了持续的改进,展示了高度有效的贡献。

英文摘要

Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.

2605.22268 2026-05-22 cs.NI cs.AI cs.CV 版本更新

Impact of Atmospheric Turbulence and Pointing Error on Earth Observation

大气湍流和指向误差对地球观测的影响

Celia Sánchez-de-Miguel, Antonio M. Mercado-Martínez, Beatriz Soret, Antonio Jurado-Navas, Miguel Castillo-Vázquez

发表机构 * TELMA, University of Malaga(TELMA,马拉加大学)

AI总结 本文研究了大气湍流和指向误差对地球观测图像的影响,提出了一种增强的图像模拟器来生成物理真实的失真图像,并通过案例研究评估了YOLOv8和RetinaNet在不同湍流和指向误差条件下的性能。

Comments Conference

详情
AI中文摘要

地球观测(EO)图像常常受到大气湍流和指向抖动的退化;然而,这些效应很少被考虑在用于训练基于AI的检测模型的数据集中。基于先前的工作,本文提出了一种增强的图像模拟器,能够将垂直路径的大气湍流和卫星指向抖动(源于平台和传感器振动)纳入其中,以生成物理上真实的失真图像。作为案例研究,使用YOLOv8和RetinaNet在由所提出模拟器生成的图像上评估船舶检测,结果表明,在理想条件下,YOLOv8的召回率从91%下降到弱湍流存在时的60%,在强湍流或抖动下低于40%。相比之下,RetinaNet表现出更大的鲁棒性,在退化条件下保持约75%的召回率。这些结果突显了在EO训练数据集中纳入真实物理退化的重要性,以确保AI模型在操作环境中的可靠性能,如在海上监控应用中所展示的那样。

英文摘要

Earth Observation (EO) imagery is often degraded by atmospheric turbulence and pointing jitter; yet, these effects are rarely considered in datasets used to train AI-based detection models. Based on prior work, this paper presents an enhanced image simulator that enables the incorporation of vertical-path atmospheric turbulence and satellite pointing jitter, arising from platform and sensor vibrations, to generate physically realistic distorted images. As a case study, vessel detection is evaluated using YOLOv8 and RetinaNet on images generated by the proposed simulator under different levels of turbulence and pointing errors. Results show that YOLOv8 recall decreases from 91% under ideal conditions to 60% in the presence of weak turbulence, and falls below 40% under strong turbulence or jitter. In contrast, RetinaNet demonstrates greater robustness, maintaining approximately 75% recall across degraded conditions. These results highlight the importance of incorporating realistic physical degradations into EO training datasets to ensure reliable performance of AI-based models in operational environments, as demonstrated in maritime surveillance applications.

2605.22259 2026-05-22 cs.LG cs.CV cs.RO 版本更新

An Evidence Hierarchy for Bayesian Object Classification via OSINT-Aided Heterogeneous Sensor Fusion

基于OSINT辅助异质传感器融合的贝叶斯目标分类证据层级

Jan Nausner, Michael Hubner

发表机构 * Center for Digital Safety & Security, Austrian Institute of Technology GmbH (AIT)(数字安全与安全研究所,奥地利技术研究院(AIT))

AI总结 本文提出了一种基于OSINT辅助的异质传感器融合方法,通过建立新的证据层级模型,结合上下文信息和领域知识,提升对CBRNE威胁的分类准确率,实验结果表明该方法在抗干扰和先验不匹配方面具有优势,分类准确率高达95%。

Comments 6 pages, 1 figure; \c{opyright} 2026 The Authors. Submitted to the 2026 IEEE International Conference on Multisensor Fusion and Integration (MFI 2026). Under review

详情
AI中文摘要

异质传感器融合对于检测、定位和分类CBRNE威胁至关重要。然而,单独的传感器通常只能检测相关威胁的子集,其可靠性各异,甚至只能提供间接威胁指示,使威胁分类变得困难。此外,传感器侧的高杂波率对融合系统提出了巨大挑战。此外,高质量数据集的有限供应阻碍了智能传感器中基于学习的检测和分类模型的发展。为缓解这些传感器相关缺点,提出了一种上下文感知和领域知识增强的融合过程。首先,建立了一个新的证据层级,能够建模直接、指示性和上下文信息。其次,通过收集、处理和利用OSINT输入,将环境上下文信息引入融合过程。第三,利用证据层级的所有级别,构建一个结合领域知识的贝叶斯威胁类型分类机制。所提出的方法在模拟场景中进行了评估,结果表明该融合方法在抗杂波和先验不匹配方面具有优势,总体分类准确率高达95%。

英文摘要

Heterogeneous sensor fusion is vital for detecting, localizing, and classifying CBRNE threats. However, individual sensors are often only capable of detecting a subset of relevant threats with varying reliability or can even provide only indirect threat indications, making threat classification challenging. Furthermore, high clutter rates on the sensor side present a great challenge for fusion systems. Additionally, the limited availability of high quality datasets hinders the advancement of learning-based detection and classification models in smart sensors. To mitigate these sensor related shortcomings, a context-aware and domain knowledge-enhanced fusion process is proposed. First, a novel evidence hierarchy is established that enables modeling of direct, indicative, and contextual information. Second, contextual information about the environment is introduced into the fusion process, by collecting, processing, and exploiting OSINT inputs. Third, all levels of the evidence hierarchy are used to craft a Bayesian threat type classification mechanism with domain knowledge-informed priors. The proposed methodology is evaluated in simulated scenarios, and the results demonstrate the benefit of the proposed fusion approach in terms of robustness to clutter and prior mismatch, with an overall classification accuracy of up to 95%.

2605.22249 2026-05-22 cs.CV 版本更新

D3Seg: Dependency-Aware Diffusion for Brain Tumor Segmentation with Missing Modalities

D3Seg: 依赖感知的扩散模型用于缺失模态的脑肿瘤分割

Danish Ali, Ajmal Mian, Naveed Akhtar, Ghulam Mubashar Hassan

发表机构 * The University of Western Australia(西澳大学) The University of Melbourne(墨尔本大学)

AI总结 本文提出D3Seg模型,通过多跳模态图融合、轻量扩散插补机制和概率空间决策细化,解决缺失MRI模态下的脑肿瘤分割问题,提升分割性能并保持计算效率。

详情
AI中文摘要

使用多参数MRI进行准确的脑肿瘤分割对于有效的治疗计划至关重要。然而,在临床环境中,完整获取所有MRI序列并不总是可能。某些MRI模态的缺失会导致现有分割方法性能显著下降,这些方法通常依赖于朴素的特征拼接或直接融合策略。为了解决这一限制,我们提出了一种新的分割模型D3Seg,其设计旨在在缺失模态设置下保持稳定的性能。D3Seg引入了多跳模态图融合(MMGF)来建模更高阶的跨模态依赖关系,一种轻量级的扩散基插补机制来补偿潜在空间中缺失的T1ce表示,并在概率空间中进行决策细化以缓解主导类的过度自信并改进低表示肿瘤亚区域的界定。在BraTS 2023数据集上的广泛评估表明,我们的D3Seg模型在缺失模态配置下 consistently 改善了分割性能。所提出的模型在多个缺失模态配置中相比当前最先进的模型,在增强肿瘤(ET)方面实现了约1.5-2.0%的Dice改进,在肿瘤核心(TC)方面实现了约1.0%的改进,同时保持了计算效率。

英文摘要

Accurate brain tumor segmentation using multiparametric MRI is critical for effective treatment planning. However, in clinical settings, complete acquisition of all MRI sequences is not always possible. The absence of certain MRI modalities results in substantial performance degradation in existing segmentation methods, which typically rely on naive feature concatenation or direct fusion strategies. To address this limitation, we propose a novel segmentation model D3Seg which is designed to maintain stable performance under missing-modality settings. D3Seg introduces Multi-hop Modality Graph Fusion (MMGF) to model higher order inter-modality dependencies, a lightweight diffusion-based imputation mechanism to compensate for missing T1ce representations in latent space, and probability-space decision refinement to mitigate dominant class overconfidence and improve delineation of underrepresented tumor subregions. Extensive evaluation on BraTS 2023 dataset demonstrates that our D3Seg model consistently improves segmentation performance under missing modality configurations. The proposed model achieves approximately 1.5-2.0% Dice improvement on enhancing tumor (ET) and around 1.0% on tumor core (TC) across multiple missing modality configurations compared to the current state-of-the-art model, while maintaining computational efficiency.

2605.22231 2026-05-22 cs.CV 版本更新

REACH: Hand Pose Estimation from Room Corners

REACH:从房间角落估计手部姿态

Shu Nakamura, Ryo Kawahara, Genki Kinoshita, Ryosuke Hirai, Yasutomo Kawanishi, Shohei Nobuhara, Ko Nishino

发表机构 * Graduate School of Informatics, Kyoto University(京都大学信息学研究科) RIKEN(理化学研究所) Kyoto Institute of Technology(京都工业大学)

AI总结 本文提出了一种新的3D手部姿态估计器,能够从远处(通常是从房间角落的固定摄像头)在极低分辨率且频繁遮挡的视图中准确恢复人的手部形状和姿态。核心方法是充分利用手部与身体的协调性、时间序列变化以及多视角观测,通过一种新的基于Transformer的模型实现,利用视图令牌之间的相关性建模手部和身体的配置,并以自回归方式利用时间协调性。同时引入了一个名为REACH的新型数据集,用于训练和测试方法。REACH是首个大规模的手部姿态数据集,记录了50名参与者在多种日常活动中的准确手部运动。通过大量实验,包括与现有方法的比较研究,证明了我们的模型REACH-Net在远距离3D手部姿态估计上取得了高度准确的结果。这些结果拓展了3D手部姿态估计的视野,尤其在“野外”连续人类行为分析方面。

详情
AI中文摘要

我们介绍了一种新颖的3D手部姿态估计器,能够从远处(通常是从房间角落的固定摄像头)在极低分辨率且频繁遮挡的视图中准确恢复人的手部形状和姿态。我们的核心思想是充分利用手部与身体的协调性、其时间序列变化以及多视角观测。我们通过一种新的基于Transformer的模型实现这一目标,其中手部和身体的配置通过其视觉特征之间的相关性建模为每视角令牌,其时间协调性则以自回归方式利用。我们引入了一个新的数据集,称为REACH,即带有胸部摄像头注释的房间环境数据集,用于训练和测试我们的方法。REACH是首个大规模的手部姿态数据集,记录了50名参与者在广泛日常活动中的准确手部运动。为了在标注手部准确形状和姿态时避免干扰自然运动,我们利用隐藏的胸部摄像头。通过广泛的实验,包括与现有方法的比较研究,我们证明了我们的模型REACH-Net在远距离3D手部姿态估计上取得了高度准确的结果。这些结果拓展了3D手部姿态估计的视野,尤其在“野外”连续人类行为分析方面。

英文摘要

We introduce a novel 3D hand pose estimator that can accurately recover the shape and pose of people's hands in a room from afar, typically from fixed cameras at room corners, in extremely low-resolution and frequently occluded views. Our key idea is to fully leverage hand-body coordination, its temporal progression, and multiview observations. We achieve this with a novel Transformer-based model, in which hand and body configurations are modeled through correlations between their visual features expressed as per-view tokens, and their temporal coordination is exploited in an autoregressive manner. We introduce a novel dataset, which we refer to as REACH, Room-Environment dataset Annotated with Chest cameras for Hand pose estimation, to train and test our method. REACH is a first-of-its-kind large-scale hand pose dataset that captures accurate hand movements of 50 participants across a wide variety of daily activities. In order to avoid interfering with natural movements while annotating the hands with accurate shape and pose, we leverage concealed chest cameras. Through extensive experiments, including comparative studies with existing methods, we show that our model, REACH-Net, achieves highly accurate 3D hand pose estimation from afar. These results broaden the horizon of 3D hand pose estimation, especially towards "in-the-wild" continuous human behavior analysis.

2605.22209 2026-05-22 cs.CV 版本更新

GALAR-TemporalNet v2: Anatomy-Guided Dual-Branch Temporal Classification with Bidirectional Mamba and Dual-Graph GCN for Video Capsule Endoscopy -- after competition results

GALAR-TemporalNet v2: 基于解剖引导的双分支时间分类方法,结合双向Mamba和双图GCN用于视频胶囊内镜

Jiye Won, Seangmin Lee, Soon Ki Jung

发表机构 * School of Computer Science and Engineering, Kyungpook National University(韩国庆北国立大学计算机科学与工程学院)

AI总结 该研究针对视频胶囊内镜中同时定位8个解剖区域和检测9种病理发现的多标签时间分类问题,提出GALAR-TemporalNet v2模型,通过结合窗口自注意力、双图GCN和双向Mamba解决类别不平衡、长程时间依赖和病理-解剖纠缠问题,最终在RARE-VISION测试集上取得更高的mAP指标。

Comments 7 pages, 2 figures. Post-competition preprint for the ICPR 2026 RARE-VISION Challenge

详情
AI中文摘要

视频胶囊内镜(VCE)提出了具有挑战性的多标签时间分类问题,要求在数万帧中同时定位8个解剖区域并检测9种病理发现。我们提出了GALAR-TemporalNet v2,一种分层时间模型,旨在解决三个核心挑战:极端类别不平衡、长程时间依赖性和病理-解剖纠缠。我们的架构结合了窗口自注意力进行局部建模,双图GCN用于全局帧关系,以及双向Mamba用于选择性边界上下文编码。新颖的解剖原型残差路径将病理偏差信号与正常器官外观分离,帧级GCN跳跃连接稳定了视觉上易混淆的稀有类别的训练。竞赛版本的GALAR-TemporalNet在RARE-VISION测试集上实现了整体mAP@0.5为0.2644和mAP@0.95为0.2353。在竞赛后,重新设计的GALAR-TemporalNet v2,结合了重构的病理分支、优化的损失函数和扩展的后处理,将这些结果提升到mAP@0.5为0.3409和mAP@0.95为0.3333。

英文摘要

Video Capsule Endoscopy (VCE) poses a challenging multi-label temporal classification problem, requiring simultaneous localization of 8 anatomical regions and detection of 9 pathological findings across tens of thousands of frames. We present GALAR-TemporalNet v2, a hierarchical temporal model that addresses three core challenges: extreme class imbalance, long-range temporal dependencies, and pathology--anatomy entanglement. Our architecture combines windowed self-attention for local modeling, a Dual-Graph GCN for global frame relationships, and Bidirectional Mamba for selective boundary context encoding. A novel anatomy prototype residual pathway decouples pathological deviation signals from normal organ appearance, and a frame-level GCN skip connection stabilizes training of visually confusable rare classes. The competition version, GALAR-TemporalNet, achieved an overall mAP@0.5 of 0.2644 and mAP@0.95 of 0.2353 on the RARE-VISION test set. Following the competition, the redesigned GALAR-TemporalNet v2 -- incorporating a restructured pathology branch, refined loss functions, and extended post-processing -- improved these results to mAP@0.5 of 0.3409 and mAP@0.95 of 0.3333.

2605.22201 2026-05-22 cs.CV 版本更新

Zero-Shot Temporal Action Localization Through Textual Guidance

通过文本指导实现零样本时间动作定位

Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Paolo Rota, Yiming Wang, Elisa Ricci

发表机构 * University of Trento(特伦托大学) Fondazione Bruno Kessler(布鲁诺·凯瑟勒基金会)

AI总结 本文提出TEGU方法,通过利用大规模语言模型和结构化文本提取的丰富文本信息,解决零样本时间动作定位中因缺乏训练监督导致的细粒度动作分类困难问题,实验表明该方法在THUMOS14和ActivityNet-v1.3数据集上优于现有方法。

Comments Accepted to FG 2026

详情
AI中文摘要

零样本时间动作定位(ZS-TAL)涉及在未修剪视频中对动作进行分类和定位,其中动作类别在训练时是未见过的。现有工作利用视觉语言模型(VLMs),借助其强大的零样本迁移能力。然而,这些模型在细粒度动作分类上面临明显挑战,难以直接用于区分动作存在与否。大多数当前ZS-TAL方法通过在大规模视频数据集上训练模型来解决这些问题,这需要标注数据且通常导致泛化性能有限。最近,不使用标注数据的方法出现了作为替代方案。沿着这一方向,我们提出了一种新的方法,即“视频中动作更精细定位的文本指导”(TEGU),通过利用大规模语言模型和从描述中提取的结构化文本所衍生的丰富文本信息,弥补训练数据缺乏监督的不足。这种额外的语境信息可以通过提供更丰富的视频内细粒度动作差异的线索,提高细粒度辨别能力。我们通过在THUMOS14和ActivityNet-v1.3数据集上进行实验验证所提出方法的有效性。我们的结果表明,通过利用丰富的文本信息来改进动作定位,TEGU在不涉及训练的最先进ZS-TAL方法上表现更优。

英文摘要

Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training

2605.22200 2026-05-22 cs.CV cs.AI cs.LG 版本更新

OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025

OSS: 2024-2025 开放缝合技能基于视觉的评估挑战

Hanna Hoffmann, Setareh Bady, Claas de Boer, Max Kirchner, Jan Egger, Rainer Röhrig, Frank Hölzle, Lennart Johannes Gruber, Kunpeng Xie, Marlon Neuhaus, Victor Alves, Guilherme Barbosa, Leonardo Barroso, João Carvalho, Hao Chen, Gabriella d'Albenzio, André Ferreira, Nuno Gomes, Yuichiro Hayashi, Kousuke Hirasawa, Rebecca Hisey, Seungjae Hong, Seoi Jeong, Tiago Jesus, Daehong Kang, Satoshi Kasai, Shunsuke Kikuchi, Takayuki Kitasaka, Satoshi Kondo, Hyoun-Joong Kong, Youngbin Kong, Atsushi Kouno, Shlomi Laufer, Kyu Eun Lee, Bining Long, Nooshin Maghsoodi, Hiroki Matsuzaki, Evangelos Mazomenos, Ori Meiraz, Kensaku Mori, Marina Music, Masahiro Oda, Roi Papo, Jieun Park, Rafael Piexoto, Saeid Rezaei, Mariana Ribeiro, Soyeon Shin, Yang Shu, Idan Smoller, Danail Stoyanov, Yihui Wang, Xinkai Zhao, Sebastian Bodenstedt, Isabel Funke, Stefanie Speidel, Behrus Hinrichs-Puladi

发表机构 * Department of Translational Surgical Oncology, National Center for Tumor Diseases (NCT/UCC) Dresden(转化外科肿瘤学部,肿瘤疾病国家中心(NCT/UCC)德累斯顿) The Centre for Tactile Internet with Human-in-the-Loop (CeTI), TUD Dresden University of Technology(具有人环路触觉互联网中心(CeTI),德累斯顿技术大学) Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen(口腔和颌面外科部,亚琛大学医院) Center for Tooth-, Mouth- and Jaw Medicine, University Göttingen(牙科、口科和颌科医学中心,哥廷根大学) Institute of Medical Informatics, University Hospital RWTH Aachen(医学信息学研究所,亚琛大学医院) Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology(医学系和卡尔·戈斯塔·卡鲁斯大学医院,德累斯顿技术大学) German Cancer Research Center (DKFZ)(德国癌症研究中心(DKFZ)) Muroran Institute of Technology(牟然技术学院) Niigata University of Health and Welfare(北九州市保健福利大学) Konica Minolta, Inc.(柯尼卡美能达公司) Jmees, Inc.(Jmees公司) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(计算机科学与工程部,香港科学与技术大学) Center Algoritmi/LASI, University of Minho(算法中心/ALASI,米尼奥大学) Life and Health Sciences Research Institute (ICVS), School of Medicine, University of Minho(生命与健康科学研究院(ICVS),医学院,米尼奥大学) ICVS/3B's - PT Government Associate Laboratory(ICVS/3B's - PT政府附属实验室) Institute for AI in Medicine (IKIM), University Medicine Essen(医学人工智能研究所(IKIM),埃森大学医学部) The Faculty of Data and Decisions Science, Technion - Israel Institute of Technology(数据与决策科学系,技术学院-以色列理工学院) UCL Hawkes Institute, University College London(UCL Hawkes研究所,伦敦大学学院) School of Computing, Queen's University(计算学院,皇后大学) Department of Transdisciplinary Medicine, Seoul National University Hospital(跨学科医学部,首尔国立大学医院) Interdisciplinary Program in Medical Informatics, Seoul National University(医学信息学跨学科项目,首尔国立大学) Department of Clinical Medical Sciences, Seoul National University(临床医学科学部,首尔国立大学) Institute of Convergence Medicine with Innovative Technology, Seoul National University Hospital(融合医学与创新技术研究所,首尔国立大学医院) Department of Surgery, Seoul National University College of Medicine and Seoul National University Hospital(外科部,首尔国立大学医学院和首尔国立大学医院)

AI总结 本文提出OSS挑战,旨在通过基于视觉的评估方法提升开放手术技能训练,通过挑战数据集和多任务评估,评估不同方法在开放手术技能评估中的表现,揭示视频评估的潜力与限制。

Comments Stefanie Speidel and Behrus Hinrichs-Puladi jointly supervised this work. Submitted to MEDIA

详情
AI中文摘要

通过有效的训练实现高水平的外科技能对于最佳的患者结果至关重要。自动化、数据驱动的技能评估有潜力改善外科训练。尽管基于机器学习的方法在微创手术技能评估中越来越受欢迎,但其在开放手术中的应用仍然有限。我们提出了一个专门的MICCAI挑战,旨在基准测试和推进开放手术中的基于视觉的技能评估。挑战数据集包含在干实验室环境中用静态GoPro相机记录的开放缝合训练任务视频,除了主要视频模态外,还包含仪器轨迹数据。OSS挑战连续两年举办,分别包含两个和三个独立任务:(1) 将技能水平分类为四个类别,(2) 预测涵盖八个类别的完整客观结构化评估技术技能分数,(3) 跟踪手部和手术工具。参与者提交了多种解决方案,包括基于深度学习的视频模型、跟踪驱动的方法和混合方法。通用的空间时间视频模型始终实现了最强的性能,尽管概念上多样的方法在执行良好的情况下也能达到竞争水平。预测细粒度的OSATS分数仍然具有挑战性,但受益于增加的训练数据。关键点跟踪由于频繁的遮挡和出帧实例而变得困难,限制了当前基于运动的技能分析的应用。这项工作评估了创新和多样的解决方案,突显了基于视频的评估在开放手术中的潜力和当前限制,并识别了推进自动化技能评估向临床影响发展的关键方向。

英文摘要

Achieving high levels of surgical skill through effective training is essential for optimal patient outcomes. Automated, data-driven skill assessment holds significant potential to improve surgical training. While machine learning-based methods are increasingly popular for assessing skills in minimally invasive surgery, their application to open surgery remains limited. We present the results of a dedicated MICCAI challenge designed to benchmark and advance vision-based skill assessment in open surgery. The challenge dataset comprises videos of an open suturing training task recorded with a static GoPro camera in a dry-lab setting, with instrument trajectories available in addition to the primary video modality. The OSS Challenge was hosted over two consecutive years, comprising two and three independent tasks, respectively: (1) classifying skill level into four classes, (2) predicting the full Objective Structured Assessment of Technical Skills across eight categories, and (3) tracking hands and surgical tools. Participants submitted diverse solutions including deep learning-based video models, tracking-driven methods, and hybrid approaches. General-purpose spatiotemporal video models consistently achieved the strongest performance, though conceptually diverse approaches reached competitive levels when well-executed. Predicting fine-grained OSATS scores remains challenging but benefits substantially from increased training data. Keypoint tracking proves difficult given frequent occlusions and out-of-frame instances, limiting current applicability for motion-based skill analysis. This work benchmarks innovative and diverse solutions for surgical skill assessment, highlighting both the promise and current limitations of video-based evaluation in open surgery and identifying critical directions for advancing automated skill assessment toward clinical impact.

2605.22192 2026-05-22 cs.CV 版本更新

Ultra-High-Definition Image Quality Assessment via Graph Representation Learning

通过图表示学习实现超高清图像质量评估

Shaode Yu, Enqi Chen, Ming Huang, Xuemin Ren, Songnan Zhao, Zhicheng Zhang, Qiurui Sun

发表机构 * 1 School of Information Communication Engineering, Communication University of China, Beijing 100024, China 2 College of Engineering, Northeastern University, Silicon Valley, San Jose, CA 95113, USA 3 JancsiLab, JancsiTech, Hongkong 999077, China 4 Center of Information \& Network Technology, Beijing Normal University, Beijing 100875, China

AI总结 本文提出了一种图表示学习框架UHD-GCN-BIQA,通过显式建模采样图像区域的结构依赖关系来改进超高清图像的盲质量评估,实现了高效的高质量图像质量预测。

详情
AI中文摘要

盲图像质量评估(BIQA)对于超高清(UHD)图像仍具挑战性,因为原分辨率推理计算成本高,而强制缩放或孤立裁剪可能抑制尺度敏感的失真并削弱局部瑕疵与全局场景上下文之间的关系。本文旨在通过显式建模采样图像区域之间的结构依赖关系来改进UHD-BIQA,而不是将它们视为独立视图。所提出的图表示学习框架UHD-GCN-BIQA从每个UHD图像中采样长宽比对齐的块,将它们编码为图节点,并利用空间接近性和特征相似性构建混合k-最近邻图。残差图卷积用于在区域间传播上下文信息,门控注意力池化将块级证据聚合为图像级质量预测。采用指数移动平均归一化的多目标损失函数以稳定回归、相关性和排序目标的联合优化。在UHD-IQA基准测试中,UHD-GCN-BIQA实现了PLCC=0.7784,SRCC=0.8019,RMSE=0.0519,取得了与比较方法相竞争的相关性性能和最低的RMSE。这些结果表明,基于图的区域关系建模对UHD图像质量评估是有效的,特别是在高分辨率视觉内容下提高绝对质量评分估计。

英文摘要

Blind image quality assessment (BIQA) for ultrahighdefinition (UHD) images remains challenging because native-resolution inference is computationally expensive, whereas aggressive resizing or isolated cropping may suppress scale-sensitive distortions and weaken the relationship between local artifacts and global scene context. This paper aims to improve UHD-BIQA by explicitly modeling the structural dependencies among sampled image regions rather than treating them as independent views, and a graph representation learning framework UHD-GCN-BIQA is proposed. The framework samples aspect-ratio-aligned patches from each UHD image, encodes them as graph nodes, and constructs a hybrid k-nearest-neighbor graph using spatial proximity and feature similarity. Residual graph convolution is used to propagate contextual information across regions, and gated attention pooling aggregates patchlevel evidence into an imagelevel quality prediction. An exponential moving average normalized multiobjective loss function is adopted to stabilize the joint optimization of regression, correlation, and ranking objectives. Experiments on the UHD-IQA benchmark show that UHD-GCN-BIQA achieves PLCC = 0.7784, SRCC = 0.8019, and RMSE = 0.0519, obtaining competitive correlation performance and the lowest RMSE among the compared methods. These results indicate that graph-based region relation modeling is effective for UHD image quality assessment, particularly for improving absolute quality score estimation under high-resolution visual content.

2605.22190 2026-05-22 cs.CV 版本更新

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

无需姿态,无问题:从未姿态多视角视频中馈送动态高斯

Matteo Balice, Yanik Kunzi, Chenyangguang Zhang, Matteo Matteucci, Marc Pollefeys, Sungwhan Hong

发表机构 * Politecnico di Milano(米兰理工大学) ETH Zürich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心)

AI总结 本文提出NoPo4D,一种首个无需姿态的馈送式系统,能够处理动态内容、多视角输入和未知相机姿态,通过速度分解和双向运动编码提升性能,优于现有方法。

Comments https://bralani.github.io/nopo4d_html/

详情
AI中文摘要

近期的馈送式3D高斯散射方法在3D场景重建的单个方面取得了显著进展,但现有方法无法在单次馈送过程中同时处理动态内容、多视角输入和未知相机姿态。处理动态的 方法要么需要准确的相机姿态,要么只能接受单目输入;无姿态多视角方法仅能处理静态场景;而每场景优化方法在填补这些差距时,每场景的成本为分钟到小时。我们引入NoPo4D,首个馈送式系统,通过预训练的几何骨干网络和最近的4D高斯框架,引入速度分解,将高斯运动分解为每个像素图像平面位移和深度变化,从而可以直接从伪地面真实光流获得2D组件的监督。这规避了可微渲染将先验姿态方法与姿态准确性耦合以及先验无姿态方法所需的3D运动地面真实。系统还通过双向运动编码实现跨视角和跨帧特征聚合,以及视图依赖的不透明度,以缓解跨视角和跨时间步的高斯错位。在四个多视角动态基准上,NoPo4D一致优于现有馈送式基线,并通过可选后优化阶段超越每场景优化方法,同时运行速度快十倍。

英文摘要

Recent feed-forward 3D gaussian splatting methods have made dramatic progress on individual aspects of 3D scene reconstruction, but no existing method jointly addresses dynamic content, multi-view input, and unknown camera poses in a single feed-forward pass. Methods that handle dynamics either require accurate camera poses or accept only monocular input; pose-free multi-view methods address only static scenes; and per-scene optimization methods bridge some of these gaps but at minutes-to-hours cost per scene. We introduce NoPo4D, the first feed-forward system that addresses this empty quadrant. Building on a pretrained geometry backbone and recent 4D Gaussian frameworks, NoPo4D introduces a velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts and depth changes, allowing direct supervision from pseudo ground-truth optical flow on the 2D component. This sidesteps both the differentiable rendering that couples prior posed methods to pose accuracy and the 3D motion ground truth that prior pose-free methods require. The system is rounded out by a bidirectional motion encoder for cross-view and cross-frame feature aggregation, and view-dependent opacity that mitigates cross-view and cross-timestep Gaussian misalignments. On four multi-view dynamic benchmarks, NoPo4D consistently outperforms prior feed-forward baselines, and with an optional post-optimization stage surpasses per-scene optimization methods, while running orders of magnitude faster.

2605.22186 2026-05-22 cs.CV 版本更新

Event-Illumination Collaborative Low-light Image Enhancement with a High-resolution Real-world Dataset

事件-照明协同低光照图像增强与高分辨率现实数据集

Senyan Xu, Zhijing Sun, Kean Liu, Xin Lu, Ruixuan Jiang, Mingyang Huang, Xueyang Fu, Zheng-Jun Zha

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出EIC-LIE框架,通过事件-照明协同模块和照明感知事件滤波器,解决低光照图像增强中HDR信息整合不足和现实噪声敏感问题,并构建首个高分辨率现实事件数据集,实验证明其在多个数据集上优于现有方法。

详情
AI中文摘要

事件基于低光照图像增强(LIE)方法主要关注整合高动态范围(HDR)信息,而忽视图像中的全局照明和现实场景中事件信号的固有噪声敏感性。为解决这些问题,我们提出EIC-LIE,一种事件-照明协同LIE框架。具体而言,我们首先设计了一个事件-照明协同交互(EICI)模块,包含两个关键过程:前向收集,用于在不同光照条件下收集HDR特征,以及后向注入,为照明和事件表示提供互补内容。接下来,我们引入了一个照明感知事件滤波器(IAEF),根据图像导出的亮度统计动态减少事件噪声。此外,我们构建了一个基于光束分割器的混合成像系统,以从动态场景中收集高质量的事件-图像对,实现时间同步,提供了首个高分辨率、现实的事件基LIE数据集。广泛的实验表明,我们的EIC-LIE在五个现实和合成数据集上优于现有方法,显著超越了以前的方法,在PSNR上提高了1.24dB,在SSIM上提高了0.069。代码和数据集已发布在https://github.com/QUEAHREN/EIC-LIE。

英文摘要

Event-based low-light image enhancement (LIE) methods mainly focus on incorporating high dynamic range (HDR) information from events while overlooking the essential global illumination in images and the inherent noise sensitivity of event signals in real-world scenarios. To address these issues, we propose EIC-LIE, an event-illumination collaborative LIE framework. Concretely, we first design an Event-Illumination Collaborative Interaction (EICI) module, which contains two key processes: forward gathering, which gathers HDR features across varying lighting conditions, and backward injection, which provides complementary content for illumination and event representations. Next, we introduce an Illumination-aware Event Filter (IAEF) that dynamically reduces event noise based on brightness statistics derived from images. Additionally, we build a beam-splitter-based hybrid imaging system to collect high-quality event-image pairs with temporal synchronization from dynamic scenes, providing the first high-resolution, real-world event-based LIE dataset. Extensive experiments show that our EIC-LIE outperforms state-of-the-art methods on five real-world and synthetic datasets, significantly surpassing previous methods with improvements of up to 1.24dB in PSNR and 0.069 in SSIM. The code and dataset are released at https://github.com/QUEAHREN/EIC-LIE.

2605.22185 2026-05-22 cs.CV cs.LG 版本更新

Enhancing Multimodal Large Language Models for Safety-Critical Driving Video Analysis

增强多模态大语言模型以用于安全关键驾驶视频分析

Tomaso Trinci, Henrique Piñeiro Monteagudo, Leonardo Taccari

发表机构 * Verizon Connect

AI总结 本研究通过融合降采样视频帧与同步高频 telemetry 数据及专用计算机视觉模型的语义信息,提升多模态大语言模型在安全关键驾驶场景中的感知与推理能力,从而更准确地识别和描述现实驾驶中的安全关键事件。

Comments Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026)

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在一般视觉理解方面展现了出色的性能。然而,其在安全关键驾驶场景中的应用受限于无法准确感知和推理罕见高风险动态事件(如碰撞或接近碰撞)的能力。为此,我们提出了一种增强MLLM感知能力的流程,通过融合降采样视频帧与同步高频telematics数据(IMU和GPS)以及专用计算机视觉模型的语义信息生成高质量的伪标签,包括描述性标题和问答对,专门用于训练MLLM识别和描述现实驾驶中的安全关键事件(SCEs)。我们通过微调开源QwenVL-2.5模型并使用DoRA适配器展示了该方法的有效性:实验表明在少于50M可训练参数和有限计算预算下,显著提高了识别和解释安全关键事件的能力。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding. However, their application to safety-critical driving scenarios remains limited by an inability to accurately perceive and reason about rare high-stakes dynamic events, such as collisions or near-collisions. To address this, we introduce a pipeline that enhances MLLM perception by fusing downsampled video frames with synchronized high-frequency telematics data (IMU and GPS) and semantic insights from specialized computer vision models. Our pipeline generates high-quality pseudo-labels, including descriptive captions and question-answer pairs, specifically designed to train MLLMs to identify and describe Safety-Critical Events (SCEs) in real-world driving footage. We show the effectiveness of our approach fine-tuning the open-source QwenVL-2.5 model via DoRA adapters: our experiments demonstrate significant improvements in identifying and explaining safety-critical events, with fewer than 50M trainable parameters and limited computational budget.

2605.22169 2026-05-22 cs.CV 版本更新

Balancing Uncertainty and Diversity of Samples: Leveraging Diversity of Least, High Confidence Samples for Effective Active Learning

平衡不确定性和样本多样性:利用低、高置信度样本的多样性进行有效的主动学习

Vipul Arya, S. H. Shabbeer Basha, Srikrishna U N, Sunainha Vijay, Snehasis Mukherjee

发表机构 * School of Computer Science and Engineering, RV University(计算机科学与工程学院,RV大学) School of Engineering & Technology, Vidyashilp University(工程与技术学院,Vidyashilp大学) Shiv Nadar Institution of Eminence(Shiv Nadar卓越研究院)

AI总结 本文提出了一种新的混合采样方法,通过同时选择容易和困难的样本,结合多样性,以提高主动学习的效果。实验表明,所提出的Least Confident and Diverse (LCD)方法在性能上优于现有方法,通过选择不确定且多样的实例,帮助模型学习更明显的特征。

详情
AI中文摘要

深度学习模型,包括卷积神经网络(CNNs)和视觉Transformer(ViTs),在各种计算机视觉任务如物体分类、检测、分割、生成等任务中取得了最先进的性能。然而,这些模型对数据需求很高,因为它们需要更多的训练数据来学习数百万或数十亿的参数。特别是对于监督学习任务,为模型训练收集大量标记样本是一个昂贵且耗时的任务。主动学习(AL)已被用于解决这个问题多年。现有的主动学习方法旨在从未标记样本池中选择用于注释的样本,这些样本要么是多样化的要么是不确定的。选择这样的样本可能会阻碍模型的性能,因为我们基于单一维度进行池化,即要么多样化要么不确定。在本文中,我们提出四种新颖的混合采样方法,用于同时池化容易和困难的样本,这些样本也是多样的。为了验证所提出方法的有效性,进行了大量的实验,分别使用高和低置信度样本。我们从实验中发现,所提出的混合采样方法,即Least Confident and Diverse(LCD),在性能上始终优于最先进的方法。观察到选择不确定且多样的实例有助于模型学习更明显的特征。与本研究相关的代码将在https://github.com/XXX/LCD上提供。

英文摘要

Deep learning models, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have achieved state-of-the-art performance on various computer vision tasks such as object classification, detection, segmentation, generation, and many more. However, these models are data-hungry as they require more training data to learn millions or billions of parameters. Especially for supervised learning tasks, curating a large number of labeled samples for model training is an expensive and time-consuming task. Active Learning (AL) has been used to address this problem for many years. Existing active learning methods aim at choosing the samples for annotation from a pool of unlabeled samples that are either diverse or uncertain. Choosing such samples may hinder the model's performance as we pool based on one dimension, i.e., either diverse or uncertain. In this paper, we propose four novel hybrid sampling methods for pooling both easy and hard samples, which are also diverse. To verify the efficacy of the proposed methods, extensive experiments are conducted using high and low-confidence samples separately. We observe from our experiments that the proposed hybrid sampling method, Least Confident and Diverse (LCD), consistently performs better compared to state-of-the-art methods. It is observed that selecting uncertain and diverse instances helps the model learn more distinct features. The codes related to this study will be available at https://github.com/XXX/LCD.

2605.22158 2026-05-22 cs.AI cs.CV 版本更新

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

ST-SimDiff:平衡时空相似性与差异以实现高效的视频理解与大语言模型

Bingjun Luo, Tony Wang, Chaoqi Chen, Xinpeng Ding

发表机构 * Tsinghua University(清华大学) Shenzhen University(深圳大学) Xidian University(西安电子科技大学)

AI总结 本文提出ST-SimDiff框架,通过平衡时空相似性与差异来提高视频理解效率,利用时空图和双选择策略减少计算成本并提升性能。

Comments Accepted by ICLR 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在处理长视频时面临显著的计算开销,因为需要处理大量的视觉标记。为了提高效率,现有方法主要通过修剪或合并标记来减少冗余,但这些方法忽略了视频内容的一个关键维度,即变化和转折点,并且缺乏对时空关系的协作模型。为此,我们提出了一种新的视角:相似性用于识别冗余,而差异用于捕捉关键事件。基于此,我们设计了一个无需训练的框架,名为ST-SimDiff。我们首先从视觉标记中构建时空图,以统一建模其复杂的关联。随后,我们采用并行双选择策略:1)基于相似性的选择使用社区检测保留代表性标记,压缩静态信息;2)基于时间差异的选择精确定位内容变化点,以保留捕捉关键动态变化的标记。这使它能够用最少的标记保留静态和动态内容。广泛实验表明,我们的方法在显著优于现有最先进方法的同时,大幅减少了计算成本。我们的代码可在https://github.com/bingjunluo/ST-SimDiff上获得。

英文摘要

Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in https://github.com/bingjunluo/ST-SimDiff.

2605.22147 2026-05-22 cs.CV 版本更新

Flow-based Gaussian Splatting for Continuous-Scale Remote Sensing Image Super-Resolution

基于流的高斯点散射用于连续尺度遥感图像超分辨率

Jiangwei Mo, Xi Lu, Hanlin Wu

发表机构 * School of Information Science and Technology, Beijing Foreign Studies University(信息科学与技术学院,北京外国语大学)

AI总结 本文提出FlowGS框架,通过流匹配和高斯点散射实现任意尺度的遥感图像超分辨率,提升生成效率和质量。

详情
AI中文摘要

高分辨率遥感图像(RSI)对于地球观测应用至关重要,但获取它们通常受到传感器限制和成本的限制。近年来,生成式超分辨率(SR)方法,特别是扩散模型,取得了显著进展。然而,它们通常需要缓慢的迭代推断,需要40-1000步,并且在连续尺度SR设置中表现出有限的灵活性。为了解决这些问题,我们提出FlowGS,一种用于任意尺度RSI超分辨率的生成性重建框架。FlowGS建模高分辨率和低分辨率图像之间的高频细节表示,并通过流匹配(FM)约束于快捷一致性,学习从噪声到细节先验的连续概率流,从而减少生成复杂性并提高推断效率。此外,我们采用2D高斯点散射来构建连续特征场,从而在任意查询位置上实现灵活的重建。实验结果表明,FlowGS在连续尺度和固定尺度SR设置中均能提供与现有方法相媲美的感知质量,同时具有显著提高的推断效率。

英文摘要

High-resolution remote sensing images (RSIs) are crucial for Earth observation applications, yet acquiring them is often limited by sensor constraints and costs. In recent years, generative super-resolution (SR) methods, particularly diffusion models, have made significant progress. However, they typically require slow iterative inference with 40--1000 steps and exhibit limited flexibility in continuous-scale SR settings. To address these issues, we propose FlowGS, a generative reconstruction framework for arbitrary-scale SR of RSIs. FlowGS models the high-frequency detail representations between high- and low-resolution images and learns a continuous probability flow from noise to detail priors via flow matching (FM) constrained by shortcut consistency, thereby reducing generative complexity and improving inference efficiency. Additionally, we employ 2D Gaussian splatting to construct a continuous feature field, thereby enabling flexible reconstruction at arbitrary query locations. Experimental results show that FlowGS delivers competitive perceptual quality compared with existing methods in both continuous-scale and fixed-scale SR settings, with substantially improved inference efficiency.

2605.22144 2026-05-22 cs.CV 版本更新

One Sentence, One Drama: Personalized Short-Form Drama Generation via Multi-Agent Systems

一句话,一出戏剧:通过多智能体系统实现个性化短剧生成

Yufei Shi, Weilong Yan, Naixuan Huang, Yucheng Chen, Chenyu Zhang, Tao He, Si Yong Yeo, Ming Li

发表机构 * MedVisAI Lab, Lee Kong Chian School of Medicine, Nanyang Technological University(MedVisAI实验室,李光前医学院,南洋理工大学) National University of Singapore(新加坡国立大学) Beijing Institute of Technology(北京理工大学) Tsinghua University(清华大学) University of Electronic Science and Technology of China(电子科技大学) Guangming Laboratory(光明实验室)

AI总结 本文提出了一种多智能体框架,通过结构化中间模块和迭代优化,将用户的单句想法转化为完整短剧,解决了短剧生成中的叙事节奏、空间一致性及生产质量控制问题。

详情
AI中文摘要

现有的数字短剧制作方法通常依赖单次生成的LLM脚本和松散耦合的流程,无法满足短剧生成的三个关键要求:(1) 叙事节奏,导致钩子弱、情节不足和不吸引人的结局;(2) 空间一致性,导致场景布局漂移和跨片段角色位置不一致;(3) 生产级质量控制,需要在脚本和视觉阶段进行大量手动审查和修正。我们提出了One Sentence, One Drama,一种分层多智能体框架,通过结构化中间模块和迭代优化,将用户的单句想法转化为完整短剧。我们的方法基于三个关键组件:(1) 基于多智能体辩论的故事生成模块,强制短剧节奏和叙事连贯性;(2) 3D基础的第一帧生成机制,建立共享的空间参考,确保跨片段的一致性角色定位和场景布局;(3) 多阶段评审循环,在脚本、视觉和视频生成阶段进行全面的错误检测和有针对性的修订。我们还引入了场景级BGM匹配和场景转换规划,以提高观众的沉浸体验。为了系统评估该任务,我们引入了Short-Drama-Bench,一个扩展标准视频质量指标的基准,包含短剧特定的评估标准。实验结果表明,我们的方法在叙事质量、跨片段一致性以及整体观看体验上显著优于现有流程。

英文摘要

Existing approaches for digital short-drama production typically rely on one-shot LLM generated scripts and loosely coupled pipelines, which fail to satisfy three key requirements of short-drama generation: (1) narrative pacing, resulting in weak hooks, insufficient escalation, and unattractive endings; (2) spatial consistency, leading to drifting scene layouts and inconsistent character positions across clips; and (3) production-level quality control, requiring extensive manual review and correction across script and visual stages. We present One Sentence, One Drama, a hierarchical multi-agent framework that transforms a user's single-sentence idea into a fully produced short drama through structured intermediate modules and iterative refinement. Our approach is built upon three key components: (1) a multi-agent debate-based story generation module that enforces short-drama pacing and narrative coherence; (2) a 3D-grounded first-frame generation mechanism that establishes a shared spatial reference for consistent character positioning and scene layout across clips; and (3) multi-stage reviewer loops that perform comprehensive error detection and targeted revision across script, visual, and video generation stages. We also introduce scene-level BGM matching and scene transition planning to improve the audience's immersive experience. To systematically evaluate this task, we introduce Short-Drama-Bench, a benchmark that extends standard video quality metrics with short-drama-specific criteria. Experimental results demonstrate that our method significantly outperforms existing pipelines in narrative quality, cross-clip consistency, and overall viewing experience.

2605.22139 2026-05-22 cs.CV 版本更新

EventGait: Towards Robust Gait Recognition with Event Streams

EventGait: 向事件流中实现鲁棒的步态识别

Senyan Xu, Shuai Chen, Chuanfu Shen, Kean Liu, Zhijing Sun, Chengzhi Cao, Xueyang Fu

发表机构 * University of Science and Technology of China(中国科学技术大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出EventGait,一种端到端的双流框架,通过事件相机捕捉动态和形状信息,提升在复杂光照和运动环境下的步态识别鲁棒性,并通过合成数据集和新基准测试验证了其有效性。

详情
AI中文摘要

步态识别能够实现非侵入性和隐私保护的识别,但在不受控环境中由于传统相机的光照和运动敏感性而面临挑战。本文探讨了使用事件相机进行步态识别,事件相机提供微秒级时间分辨率和高动态范围,自然捕捉鲁棒的动态特征并抑制静态噪声。现有基于事件的方法通常将事件流聚合为事件图像,从而丢弃了对步态识别至关重要的细粒度运动动态。因此,我们提出了EventGait,一种端到端的双流框架,分别建模运动和形状,同时保留事件的优势。我们的动态流利用混合脉冲专家(MoSE)和多样化的神经元常数,以在复杂的运动和光照场景中实现稳健的动态感知,而静态流通过跨模态结构对齐(CroSA)学习密集的形状表示,使用大规模视觉基础模型。为了解决大规模基于事件的步态数据集的缺乏,我们引入了合成管道并发布了两个新的基准:SUSTech1K-E和CCGR-Mini-E。广泛的实验表明,基于事件的步态识别不仅在正常条件下实现了与基于相机的步态识别相当的结果,而且在低光场景中显著优于前者。我们的方法在合成和真实世界基于事件的步态基准上均达到了新的状态,突显了事件驱动步态分析的鲁棒性和潜力。代码和数据集已发布在https://github.com/QUEAHREN/EventGait。

英文摘要

Gait recognition enables non-intrusive, privacy-preserving identification but suffers in uncontrolled environments due to illumination and motion sensitivity of conventional cameras. In this work, we explore gait recognition using event cameras, which offer microsecond temporal resolution and high dynamic range, naturally capturing robust dynamic cues and suppressing static noise. Existing event-based approaches typically aggregate event streams into event images over long time windows, thereby discarding fine-grained motion dynamics critical for gait recognition. Therefore, we propose \textbf{EventGait}, an end-to-end dual-stream framework that separately models motion and shape while preserving the advantages of events. Our dynamic stream leverages a Mixture of Spiking Experts (MoSE) with diverse neuron constants for robust dynamic perception across complex motion and illumination scenes, while the static stream learns dense shape representations via Cross-modal Structure Alignment (CroSA) with large vision foundation models. To address the absence of large-scale event-based gait datasets, we introduce a synthesis pipeline and release two new benchmarks: SUSTech1K-E and CCGR-Mini-E. Extensive experiments have shown that event-based gait recognition not only achieves results comparable to camera-based gait recognition under normal conditions but also significantly outperforms it in low-light scenarios. Our approach sets a new state of the art on both synthesized and real-world event-based gait benchmarks, highlighting the robustness and potential of event-driven gait analysis. The code and datasets are released at https://github.com/QUEAHREN/EventGait.

2605.22132 2026-05-22 cs.CV 版本更新

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

通过可插拔的深度卷积加速视觉基础模型

Carmelo Scribano, Mohammad Mahdi, Nedyalko Prisadnikov, Yuqian Fu, Giorgia Franchini, Danda Pani Paudel, Marko Bertogna, Luc Van Gool

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) INSAIT, Sofia University ``St. Kliment Ohridski'', Sofia, Bulgaria(INSAIT,索菲亚大学『圣克莱门特·欧赫里迪斯』,索菲亚,保加利亚)

AI总结 本文提出了一种通过可插拔深度卷积层替代部分注意力头来加速大规模预训练视觉Transformer,同时保持特征提取能力,在图像分类和分割任务中实现了17-20%的推理加速且性能损失极小。

Comments Accepted at ICPR 2026

详情
AI中文摘要

预训练的视觉基础模型在少量微调下即可在多种任务中取得优异性能。然而,其视觉Transformer(ViT)主干结构导致较高的推理开销,限制了在资源受限设备上的部署。在本文中,我们通过利用某些注意力头内在的卷积类行为,加速大规模预训练ViT的同时保持其特征提取能力。具体而言,我们引入了一个高效的基于深度卷积的层,作为这些头的可插拔替代方案。此外,我们提出了简单策略来识别可替换的头,并引入一种微调过程以恢复下游任务性能。在图像分类和分割任务中,我们的方法实现了17-20%的推理加速,且性能损失极小。我们通过详细的推导、广泛的实验和效率基准验证了该方法。参考实现已公开。

英文摘要

Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.

2605.22126 2026-05-22 cs.CV 版本更新

AesFormer: Transform Everyday Photos into Beautiful Memories

AesFormer: 将日常照片转化为美丽的记忆

Tianxiang Du, Hulingxiao He, Yuxin Peng

发表机构 * Wangxuan Institute of Computer Technology, Peking University, Beijing, China(王轩计算机技术研究所,北京大学,北京,中国)

AI总结 本文提出AesFormer框架,通过将审美规划与图像编辑解耦,改进照片的审美质量,同时保持主体身份和场景语义,构建了包含9071对严格对齐图像对的AesRecon基准数据集。

Comments Accepted by ICML 2026

详情
AI中文摘要

在日常摄影中,吸引人的时刻往往受到结构缺陷(如构图、相机视角或姿势)的影响,而现有的修图和人像增强方法无法修复这些缺陷。我们提出将审美照片重建(APR)定义为通过结构重建来提高照片的审美质量,同时保持主体身份和场景语义。尽管最近的图像编辑模型使APR成为可能,但它们通常缺乏审美理解,导致编辑结果在语义上合理但审美上薄弱。为此,我们提出了AesFormer,一个两阶段框架,将审美规划与图像编辑解耦。在第一阶段,一个审美动作模型(AesThinker)分析输入沿七个渐进的摄影维度,并输出可执行的编辑动作;我们进一步应用GRPO-A来鼓励在多样化的动作计划上进行广泛探索,超越SFT。在第二阶段,一个动作条件编辑器(AesEditor)在这些动作的指导下执行结构编辑。为了支持APR,我们构建了一个基于视频的语料挖掘管道(VCMP)并构建了AesRecon,一个包含9,071对严格对齐(差,好)图像对的基准。实验表明,AesFormer显著提高了APR性能,并与Nano Banana Pro具有竞争力。代码可在https://github.com/PKU-ICST-MIPL/AesFormer_ICML2026获取。

英文摘要

In everyday photography, aesthetically appealing moments are often captured with structural flaws (e.g., composition, camera viewpoint, or pose) that existing retouching and portrait enhancement methods cannot fix. We formulate Aesthetic Photo Reconstruction (APR) as improving a photo's aesthetic quality via structural reconstruction while preserving subject identity and scene semantics. Although recent advances in image editing models make APR feasible, they often lack aesthetic understanding, yielding edits that are semantically plausible yet aesthetically weak. To address this, we propose AesFormer, a two-stage framework that decouples aesthetic planning from image editing. In Stage 1, an aesthetic action model (AesThinker) analyzes the input along seven progressive photographic dimensions and outputs executable editing actions; we further apply GRPO-A to encourage broad exploration over diverse action plans beyond SFT. In Stage 2, an action-conditioned editor (AesEditor) performs structural edits guided by these actions. To support APR, we build a video-based corpus-mining pipeline (VCMP) and construct AesRecon, a benchmark of 9,071 strictly aligned (poor, good) image pairs. Experiments show that AesFormer substantially improves APR performance and is competitive with Nano Banana Pro. Code is available at https://github.com/PKU-ICST-MIPL/AesFormer_ICML2026.

2605.22121 2026-05-22 cs.CV 版本更新

MotionDPS: Motion-Compensated 3D Brain MRI Reconstruction

MotionDPS: 3D脑部MRI重建中的运动补偿

Antonio Ortiz-Gonzalez, Erich Kobler, Lukas Schletter, Alexander Effland

发表机构 * Life and Medical Sciences Institute, University of Bonn(波恩大学生命与医学科学研究所) Institute for Machine Learning, LIT AI Lab, Department of Virtual Morphology, Clinical Research Institute Medical AI, Johannes Kepler University Linz(林茨约翰尼斯·凯撒大学机器学习研究所、LIT AI实验室、虚拟形态部门、医学人工智能临床研究机构) German Center for Neurodegenerative Diseases (DZNE)(德国神经退行性疾病研究中心(DZNE)) Institute for Applied Mathematics, University of Bonn(波恩大学应用数学研究所)

AI总结 本文提出了一种统一的贝叶斯框架,用于运动补偿的3D MRI重建,通过直接从运动损坏的k空间数据中联合估计解剖图像、刚体运动参数和线圈灵敏度图,实现了无需配对无运动训练数据的完全无监督重建。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

磁共振成像(MRI)由于其相对较长的采集时间和k空间中按顺序采集数据的事实,对患者运动高度敏感。即使是很小的患者移动也会在测量之间引入相位不一致,导致严重的伪影,如模糊、鬼影和几何失真,这些伪影可能影响诊断质量。回顾性运动补偿仍然具有挑战性,尤其是在加速采集中,由于联合重建和运动估计问题的不恰当性。在本工作中,我们提出了一种统一的贝叶斯框架,用于运动补偿的3D MRI,该框架直接从运动损坏的k空间数据中联合估计解剖图像、刚体运动参数和线圈灵敏度图。我们的方法将预训练的3D复值分数扩散模型作为表达性解剖图像先验整合到基于物理的正向模型中。通过交替扩散后验图像更新和高效的近端优化步骤进行运动和线圈灵敏度估计,实现完全无监督的重建,无需配对无运动训练数据。在模拟和真实运动脑部MRI数据集上的实验表明,所提出的方法在图像质量和运动鲁棒性方面优于最先进的经典和学习运动校正技术,特别是在存在严重运动和高加速的情况下。

英文摘要

Magnetic resonance imaging (MRI) is highly susceptible to patient motion due to its relatively long acquisition times and the fact that data are acquired sequentially in k-space. Even small patient movements introduce phase inconsistencies across measurements, leading to severe artifacts such as blurring, ghosting, and geometric distortions that can compromise diagnostic quality. Retrospective motion compensation remains challenging, particularly in accelerated acquisitions, due to the ill-posed nature of the joint reconstruction and motion estimation problem. In this work, we propose a unified Bayesian framework for motion-compensated 3D MRI that jointly estimates the anatomical image, rigid-body motion parameters, and coil sensitivity maps directly from motion-corrupted k-space data. Our approach integrates pretrained 3D complex-valued score-based diffusion models as expressive anatomical image priors within a physics-based forward model. Inference is performed by alternating diffusion posterior image updates with efficient proximal optimization steps for motion and coil sensitivity estimation, enabling fully unsupervised reconstruction without the need for paired motion-free training data. Experiments on simulated and real-motion brain MRI datasets demonstrate that the proposed method achieves improved image quality and motion robustness compared to state-of-the-art classical and learning-based motion correction techniques, particularly in the presence of severe motion and high acceleration.

2605.22109 2026-05-22 cs.AI cs.CV cs.CY 版本更新

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

感知还是偏见:大语言模型能否超越个性的第一印象?

Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang, Ruicong Liu, Bo Zheng, Huchuan Lu, Kaipeng Zhang, Yoichi Sato, Yifei Huang

发表机构 * The University of Tokyo(东京大学) Shanda AI Research Tokyo(Shanda AI 研究所东京) Dalian University of Technology(大连理工大学)

AI总结 本文探讨了多模态大语言模型(MLLMs)在感知个性方面的能力,提出了一种新的任务Grounded Personality Reasoning(GPR),并构建了一个新的数据集MM-OCEAN,通过三重评估体系揭示了MLLMs在人格推理中的偏见问题。

详情
AI中文摘要

多模态大语言模型(MLLMs)正在越来越多地应用于需要感知个性的人类交互角色中,但现有的基准测试仅评估其对大五人格特质分数的预测能力,未能确定模型是通过行为理解真正感知个性,还是仅通过表面模式匹配进行偏见判断。我们通过三个贡献填补了这一空白:(i)一个新的任务:我们正式定义了Grounded Personality Reasoning(GPR),要求MLLMs通过一系列评分、推理和锚定过程,将每个大五评分与可观察的证据联系起来;(ii)一个新的数据集:我们发布了MM-OCEAN(1,104个视频,5,320个多项选择题),由多代理流程生成,包含时间戳行为观察、证据支持的特质分析以及七类线索锚定多项选择题;(iii)基准测试和分析:我们设计了一个三级评估体系(评分、推理、锚定)以及四个样本级失败模式指标:偏见率(PR)、编造率(CR)、整合失败率(IR)和整体锚定率(HR),并基准测试了27个MLLMs(13个封闭式,14个开放式)。分析揭示了一个显著的偏见差距:在所有正确评分中,51%的评分没有基于检索到的线索进行锚定,而整体锚定率仅在0-33.5%之间。这些发现揭示了获得正确分数与为正确原因推理之间的脱节,为MLLMs中的扎根社会认知绘制了路线图。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

2605.22104 2026-05-22 cs.CV 版本更新

OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization

OPERA: 一种用于图像修复的智能体,通过端到端联合规划-执行优化

Feng Zhu, Shuyang Xie, Yihan Zeng, Ming Liu, Wangmeng Zuo

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 该研究提出OPERA框架,通过端到端联合优化修复规划和工具执行,解决图像修复中复杂混合退化问题,优于现有方法和统一模型。

详情
AI中文摘要

现实中的图像修复因复杂的、相互作用的混合退化而具有挑战性。最近的基于智能体的方法通过组合多个任务特定的修复工具来解决这个问题。然而,实证分析表明,其性能根本上受到隐式约束的规划空间和独立预训练工具之间缺乏协调的限制。为了解决这些问题,我们提出了OPERA(优化规划-执行修复智能体),一种框架,通过端到端的方式联合优化修复规划和工具执行。在规划方面,OPERA使用强化学习直接优化工具组合在一个组合计划空间上,最终修复质量作为奖励。在执行方面,OPERA引入了智能体引导的修复工具协同训练,使它们能够在顺序组合下学习合作行为。在多退化基准和真实世界数据集上的大量实验表明,OPERA在多样且复杂的退化场景中始终优于所有-in-one修复模型和现有基于智能体的方法。

英文摘要

Real-world image restoration is challenging due to complex and interacting mixed degradations. Recent agent-based approaches address this problem by composing multiple task-specific restoration tools. However, empirical analysis reveals that their performance is fundamentally limited by implicitly constrained planning spaces and the lack of coordination among independently pretrained tools. To address these issues, we propose OPERA (Optimized Planning-Execution Restoration Agent), a framework that jointly optimizes restoration planning and tool execution in an end-to-end manner. On the planning side, OPERA uses reinforcement learning to directly optimize tool composition over a combinatorial plan space, with the final restoration quality as the reward. On the execution side, OPERA introduces agent-guided co-training of restoration tools, enabling them to learn cooperative behaviors under sequential composition. Extensive experiments on multi-degradation benchmarks and real-world datasets demonstrate that OPERA consistently outperforms both all-in-one restoration models and existing agent-based methods across diverse and complex degradation scenarios.

2605.22098 2026-05-22 cs.CV cs.AI cs.LG 版本更新

TextTeacher: What Can Language Teach About Images?

TextTeacher: 语言能教会我们关于图像什么?

Tobias Christian Nauen, Stanislav Frolov, Brian Bernhard Moser, Federico Raue, Ahmed Anwar, Andreas Dengel

发表机构 * RPTU University Kaiserslautern-Landau(赖兴海大学凯撒斯劳滕-兰道分校) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 该研究提出TextTeacher方法,通过将语言模型的语义知识注入到图像分类训练中,提升视觉模型的性能,同时保持推理时的模型简洁性。

Comments Published at TMLR

详情
Journal ref
Transactions on Machine Learning Research, ISSN 2835-8856, 2026
AI中文摘要

柏拉图表示假设认为,足够大的模型会收敛到共享的表示几何结构,即使跨模态。受此启发,我们提出问题:语言模型的语义知识能否有效提升视觉模型?为此,我们引入TextTeacher,一种简单的辅助目标,将文本嵌入作为额外信息注入图像分类训练。TextTeacher利用 readily available 的图像描述、预训练并冻结的文本编码器以及轻量级投影,生成语义锚点,高效引导训练期间的表示,同时保持推理时的模型不变。在ImageNet上使用标准ViT后端,TextTeacher将准确率提升高达+2.7个百分点(p.p.),并在相同配方和计算条件下产生一致的迁移增益(平均+1.0 p.p.)。它优于视觉知识蒸馏,在相同计算预算下更准确,或在相似准确率下更快。我们的分析表明,TextTeacher在训练初期塑造了更深的层,并通过补充互补的语义线索帮助泛化。TextTeacher增加的开销很小,不需要对目标模型进行昂贵的多模态训练,并保持纯视觉模型的简洁性和延迟。

英文摘要

The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: https://nauen-it.de/publications/text-teacher

2605.22096 2026-05-22 cs.CV 版本更新

VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

VISTA:基于解剖解码的时空基础模型验证引导集成用于罕见病理VCE事件检测——竞赛结果后

Bo-Cheng Qiu, Fang-Ying Lin, Ming-Han Sun, Yu-Fan Lin, Chia-Ming Lee, Chih-Chung Hsu

发表机构 * National Cheng Kung University, Taiwan(国立成功大学, 台湾) National Yang Ming Chiao Tung University, Taiwan(国立阳明交通大学, 台湾)

AI总结 本文提出VISTA框架,结合时空基础模型和解剖解码,通过验证引导的加权融合和时间事件解码,提升罕见病理VCE事件检测的性能,竞赛后进一步优化后取得第二名。

详情
AI中文摘要

胶囊内镜事件检测具有挑战性,因为临床相关发现稀少、视觉异质且以事件级别评估而非帧精度。我们提出VISTA,一个针对RAREVISION任务的度量对齐多主干框架。VISTA结合EndoFM-LV进行时间上下文分析和DINOv3 ViTL/16进行帧级视觉语义,随后通过Diverse Head Ensemble (DHE)、Validation-Guided Weighted Fusion (VGWF)和Anatomy-Aware Temporal Event Decoding (ATED)。原始官方提交在隐藏测试中达到mAP@0.5为0.3530和mAP@0.95为0.3235。竞赛后,通过局部阈值细化和全局粗略搜索的扩展,性能提升至mAP@0.5为0.3726和mAP@0.95为0.3431,排名Team ACVLab第二。

英文摘要

Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.

2605.22089 2026-05-22 cs.CV cs.AI 版本更新

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

LVDrive: 基于潜在视觉表征的视觉-语言-动作自动驾驶模型

Xiaodong Mei, Diankun Zhang, Hongwei Xie, Guang Chen, Hangjun Ye, Dan Xu

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Xiaomi EV(小米电动车)

AI总结 本文提出LVDrive,一种增强视觉-语言-动作能力的自动驾驶模型,通过引入未来场景预测任务,在高维潜在空间中学习语义丰富的场景表示,从而提升闭环驾驶性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型已逐渐成为端到端自动驾驶的有前途的框架。然而,现有VLA通常依赖于稀疏的动作监督,这未能充分利用其强大的场景理解和推理能力。最近尝试通过世界建模引入密集视觉监督时,往往过度强调像素级图像重建,忽略了语义丰富的场景表示学习。在本文中,我们提出LVDrive,一种基于潜在视觉表征的VLA框架,用于自动驾驶。LVDrive在VLA范式中引入了未来场景预测任务,其中未来表示在预训练视觉主干的辅助监督下完全在高维潜在空间中学习。脱离低效的自回归生成,我们在一个统一的嵌入空间中联合建模未来场景和运动预测,通过单次前向传递进行未来感知推理。我们进一步设计了一种两阶段轨迹解码策略,明确利用所学的潜在未来表示来细化轨迹生成。在具有挑战性的Bench2Drive基准测试中,大量实验表明,LVDrive在闭环驾驶性能上实现了显著提升,优于动作监督方法和基于图像重建的世界模型方法。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.

2605.21273 2026-05-22 cs.CV 版本更新

DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions

DriveMA: 重新思考驾驶VLAs中的语言接口以单步元动作

Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang zhao

AI总结 本文提出DriveMA,通过单步元动作替代传统的自然语言推理,解决了驾驶VLAs中语言接口的三个瓶颈问题,实现了更高效的端到端规划。

Comments We withdraw this submission because the current version contains a mismatch between the paper title, conceptual framing, and the intended contribution of the work. To avoid potential misunderstanding by readers, the authors have decided to withdraw this version and substantially revise the title, organization, and presentation before any future submission

详情
AI中文摘要

驾驶视觉-语言-动作模型(Driving VLAs)通常将自然语言推理作为端到端规划的中间接口,但以推理为中心的接口面临三个实际瓶颈:获得高质量的推理注释困难,生成和理解长推理链对紧凑模型具有挑战性,且推理延迟显著增加。本文重新思考了驾驶VLAs中的语言接口设计,表明简洁的单步元动作是替代冗长推理的有效替代方案。元动作提供语义决策基础,同时保持低熵,并能自动从专家轨迹推导出来,从而实现可扩展的监督和可靠的轨迹条件化。基于此接口,我们提出了DriveMA,结合以动作为中心的监督训练和基于转弯级别的信用分配强化学习框架,共同优化元动作的正确性、轨迹质量和轨迹-元动作一致性。实验表明,DriveMA在Waymo端到端驾驶挑战中已使用2B模型达到新的状态,Rater Feedback Score(RFS)为8.060,其4B版本进一步将状态提升至8.079;DriveMA在NAVSIM上也取得了具有竞争力的性能。消融实验显示,单步元动作在表达性、可预测性和推理效率之间提供了更好的实际权衡,优于自然语言推理或更细粒度的动作序列。代码、数据和模型将被发布以促进未来研究。

英文摘要

Driving Vision-Language-Action Models (Driving VLAs) commonly introduce natural-language reasoning as an intermediate interface for end-to-end planning, but reasoning-centric interfaces face three practical bottlenecks: obtaining high-quality reasoning annotations is difficult, generating and understanding long reasoning chains is challenging for compact models, and inference latency is substantially increased. In this paper, we rethink the design of language interfaces in Driving VLAs and show that concise one-step meta-actions are a simple yet effective alternative to verbose reasoning. Meta-actions provide semantic decision grounding while remaining low-entropy, and being automatically derivable from expert trajectories, enabling scalable supervision and reliable trajectory conditioning. Building on this interface, we propose DriveMA, which combines action-centric supervised training with a turn-level credit-assignment reinforcement learning framework that jointly optimizes meta-action correctness, trajectory quality, and trajectory--meta-action consistency. Experiments show that DriveMA already achieves a new state of the art on the Waymo End-to-End Driving Challenge with a 2B model, reaching a Rater Feedback Score (RFS) of 8.060, while its 4B version further improves the state of the art to 8.079; DriveMA also obtains competitive performance on NAVSIM. Ablations demonstrate that one-step meta-actions offer a better practical trade-off between expressiveness, predictability, and inference efficiency than natural-language reasoning or finer-grained action sequences. Code, data, and models will be released to facilitate future research.

2605.20578 2026-05-22 cs.SD cs.CV 版本更新

A strongly annotated passive acoustic dataset for tropical bird monitoring

一个强注解的被动声学数据集用于热带鸟类监测

Daniela Ruiz, Juan Sebastián Ulloa, Zhongqi Miao, Nicolás Betancourt, Maria Paula Toro-Gómez, Andrés Hernández, Bruno Demuro, Eliana Barona-Cortés, Angela Mendoza-Henao, Andrés Sierra-Ricaurte, Sebastián Pérez-Peña, Rahul Dodhia, Pablo Arbeláez, Juan M. Lavista Ferres

发表机构 * Microsoft AI for Good Research Lab(微软AI for Good研究实验室) Instituto de Investigación de Recursos Biológicos Alexander von Humboldt(亚历山大冯·洪堡生物资源研究所) Center for Research and Formation in Artificial Intelligence(人工智能研究与培养中心) Fundación Manacus(曼卡斯基金会) Louisiana State University(路易斯安那州立大学) Museum of Natural Sciences(自然博物馆)

AI总结 本文提出PteroSet数据集,用于热带鸟类监测,通过强注解的音频数据和COCO-inspired JSON格式,为机器学习提供基准,并展示了二元鸟类检测的深度学习基线。

详情
AI中文摘要

被动声学监测能够实现对多样化生态系统的连续、非侵入性生物多样性评估。这些数据集的规模推动了机器学习的应用,监督方法表现出强劲的性能。然而,监督方法需要时间分辨的注解数据集,这些数据仍然稀缺,尤其是在复杂的热带声音景观中。我们提出了PteroSet,这是一个经过精心编纂的数据集,包含在哥伦比亚Putumayo的Puerto Asis和Magdalena的Pivijay之间2023年至2025年录制的强注解新热带鸟类叫声数据集。该数据集包含563个录音(73.62小时)和15,372个时频注解,包括6,702个事件,这些事件被识别到物种水平,涵盖168个物种。我们以COCO启发的JSON模式发布注解,将音频文件、分类类别和机器学习工作流程的标签统一起来。除了提供注解数据外,PteroSet还充当一个现实的基准,突显了热带声音景观的关键特征,包括不同录制地点的声学共现和领域转移。我们提供了一个二元鸟类检测的深度学习基线,展示了PteroSet的可用性和其带来的挑战。

英文摘要

Passive acoustic monitoring enables continuous, non-invasive biodiversity assessment across diverse ecosystems. The scale of these datasets has driven the adoption of machine learning, with supervised approaches showing strong performance. However, supervised methods require time-resolved annotated datasets, which remain scarce, especially in complex tropical soundscapes. We present PteroSet, a curated dataset of strongly annotated Neotropical bird vocalizations recorded in Puerto Asis (Putumayo) and Pivijay (Magdalena), Colombia, between 2023 and 2025. The dataset comprises 563 recordings (73.62 h) and 15,372 time-frequency annotations, including 6,702 events identified to the species level across 168 species. We release the annotations in a COCO-inspired JSON schema that unifies audio files, taxonomic categories, and labels for machine learning workflows. Beyond providing annotated data, PteroSet serves as a realistic benchmark that highlights key characteristics of tropical soundscapes, including acoustic co-occurrence and domain shift across recording sites. We provide a deep learning baseline for binary bird detection, demonstrating PteroSet's usability and the challenges it presents.

2605.20342 2026-05-22 cs.CV 版本更新

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

ParaVT: 平衡工具先验悖论以实现代理视频强化学习中的并行工具使用

Zuhao Yang, Kaichen Zhang, Sudong Wang, Keming Wu, Zhongyu Yang, Bo Li, Xiaojuan Qi, Shijian Lu, Xingxuan Li, Lidong Bing

发表机构 * MiroMind NTU(国立台湾大学) HKU(香港大学) HKUST(GZ)(香港科技大学(广州)) THU(清华大学) LMMs-Lab(LMMs实验室)

AI总结 本文提出ParaVT,一种用于并行视频工具调用的端到端强化学习框架,通过引入PARA-GRPO机制解决工具先验悖论,提升了长视频理解的性能。

Comments Project Page: https://evolvinglmms-lab.github.io/ParaVT/

详情
AI中文摘要

通过强化学习(RL)训练大型多模态模型(LMMs)以原生调用视频处理工具(如裁剪)已成为实现长视频理解的有前景途径。然而,现有原生RL方法按顺序调度工具调用(即每回合一个):单个错误的裁剪会传播错误而无法得到同伴纠正,多回合工具调用会破坏上下文,且推理成本与回合数成线性关系。我们引入ParaVT,首个多智能体端到端RL训练框架用于并行视频工具调用,通过单个回合内调度多个时间窗口裁剪以获得更干净的上下文和更好的容错能力。然而,将标准RL应用于ParaVT揭示了一个我们称之为工具先验悖论的障碍:预训练的工具先验能够促进工具探索,但也破坏了冷启动的结构格式并暴露了在温度采样下的跳过工具奖励捷径。一个较弱先验LMM的跨模型对比支持这一观点:格式保持稳定但RL触发零工具调用,表明先验强度是格式崩溃和工具探索的共同驱动因素。我们提出PARA-GRPO(Parseability-Anchored和Ratio-gAted GRPO),它通过两种互补机制增强标准RL:(i)仅在最易崩溃的结构标记位置应用目标格式奖励;(ii)每提示帧预算随机化,创建训练提示,其中调用工具会提供可测量的奖励信号,而跳过工具则不会。在六个长视频理解基准测试中,ParaVT在平均上比Qwen3-VL基线提升了7.9%,而PARA-GRPO将训练时间格式合规性从0.13提升到0.64。随着工具能力在现代LMMs中日益内部化,RL必须与由此产生的先验合作,ParaVT提供了一种通用的代理RL配方。代码、数据和模型权重已公开可用。

英文摘要

Training large multimodal models (LMMs) via reinforcement learning (RL) to natively invoke video-processing tools (e.g., cropping) has become a promising route to long-video understanding. However, existing native-RL methods dispatch tool calls sequentially (i.e., one per turn): a single wrong crop propagates errors without peer correction, multi-turn tool calls corrupt context, and inference cost scales linearly with the number of turns. We introduce ParaVT, the first multi-agent end-to-end RL-trained framework for Parallel Video Tool calling, dispatching multiple time-window crops in a single turn for cleaner context and better fault tolerance. Yet applying standard RL to ParaVT reveals an obstacle we term the Tool Prior Paradox: the pretrained tool priors that enable tool exploration also destabilize cold-started structural format and expose the skip-tool reward shortcut under temperature sampling. A cross-model contrast on a weaker-prior LMM supports this claim: format stays stable but RL elicits zero tool calls, indicating that prior strength is the shared driver of both format collapse and tool exploration. We propose PARA-GRPO (Parseability-Anchored and Ratio-gAted GRPO), which augments standard RL with two complementary mechanisms: (i) a targeted format reward applied only at the structural-token positions most prone to collapse, and (ii) a per-prompt frame-budget randomization that creates training prompts where calling the tool yields a measurable reward signal over skipping it. Across six long-video understanding benchmarks, ParaVT improves over the Qwen3-VL baseline by +7.9% on average, with PARA-GRPO lifting training-time format compliance from 0.13 to 0.64. As tool capabilities become increasingly internalized in modern LMMs, RL must cooperate with the resulting priors, and ParaVT offers a general recipe for agentic RL. Code, data, and model weights are publicly available.

2605.19578 2026-05-22 cs.CV cs.AI 版本更新

Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition

Lens Privacy Sealing: 一种新的基准和方法用于物理隐私保护的动作识别

Mengyuan Liu, Ziyi Wang, Peiming Li, Junsong Yuan

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School(北京大学深圳研究生院通用人工智能国家重点实验室) Department of Computer Science and Engineering, State University of New York at Buffalo(纽约州立大学布法罗分校计算机科学与工程系)

AI总结 本文提出了一种名为Lens Privacy Sealing (LPS)的硬件解决方案,通过可调节的贴膜物理遮挡摄像头镜头,实现低成本的预传感器隐私保护,并引入P$^3$AR数据集用于隐私保护的动作识别,同时提出MSPNet框架以应对LPS带来的视频退化问题,实验表明MSPNet在动作识别准确率和隐私保护方面具有优势。

Comments Accepted by IEEE Transactions on Image Processing (TIP), 2026

详情
AI中文摘要

基于RGB摄像头的监控系统能够为公共安全和医疗保健提供人类动作识别,但引发了严重的隐私问题。现有方法依赖于事后捕获算法,这些算法在数据采集过程中无法保护隐私。我们提出Lens Privacy Sealing (LPS),一种简单的硬件解决方案,通过可调节的贴膜物理遮挡摄像头镜头,以最低的成本提供预传感器隐私保护。与软件方法或昂贵的工程光学不同,LPS通过随机多层散射实现强隐私保护,这种散射是物理不可逆的。我们引入了P$^3$AR数据集用于隐私保护的动作识别,该数据集包含大规模回放捕获(P$^3$AR-NTU,114K视频)和现实世界收集(P$^3$AR-PKU)的子集,并带有隐私属性注释。为处理LPS带来的视频退化,我们提出MSPNet,一种单阶段框架,结合了帧间噪声抑制器(IFNS)和跨帧语义聚合器(CFSA),并借助对比语言-图像预训练进行增强的语义提取。大量实验表明,与基线方法相比,MSPNet结合IFNS和CFSA几乎将动作识别准确率提高了一倍,同时抑制身份识别到低水平。全面验证显示,LPS在隐私-效用权衡方面优于现有最先进的硬件方法,能够抵御包括PSF反向计算和数据驱动恢复在内的重建攻击,并在不同光学配置和挑战性环境中具有良好的泛化能力。代码可在https://github.com/wangzy01/MSPNet上获得。

英文摘要

RGB camera-based surveillance systems enable human action recognition for public safety and healthcare, yet raise serious privacy concerns. Existing methods rely on post-capture algorithms, which fail to protect privacy during data acquisition. We propose Lens Privacy Sealing (LPS), a simple hardware solution that physically obscures camera lenses with adjustable laminating film, providing pre-sensor privacy protection at minimal cost. Unlike software methods or expensive engineered optics, LPS achieves strong privacy through stochastic multi-layer scattering that is physically irreversible. We introduce the P$^3$AR dataset for privacy-preserving action recognition, featuring both large-scale replay-captured (P$^3$AR-NTU, 114K videos) and real-world collected (P$^3$AR-PKU) subsets with privacy attribute annotations. To handle video degradation from LPS, we propose MSPNet, a single-stage framework incorporating Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA), enhanced by contrastive language-image pre-training for robust semantic extraction. Extensive experiments demonstrate that MSPNet with IFNS and CFSA nearly doubles action recognition accuracy compared to baseline methods while suppressing identity recognition to low levels. Comprehensive validation shows LPS achieves a superior privacy-utility trade-off compared to state-of-the-art hardware methods, resists reconstruction attacks including PSF inversion and data-driven recovery, and generalizes robustly across optical configurations and challenging environments. Code is available at https://github.com/wangzy01/MSPNet.

2605.19354 2026-05-22 eess.IV cs.CV 版本更新

Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction

用于自回归MRI重建的下一步加速尺度预测

Yilmaz Korkmaz, Vishal M. Patel

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出了一种基于离散多尺度潜在空间的自回归下一步加速尺度预测方法,通过引入特权信息蒸馏技术,提升了在极端欠采样下的MRI重建性能。

详情
AI中文摘要

MRI重建本质上是一个病态的逆问题,因为不完整的测量允许许多可能的解决方案。在高加速情况下,这种不确定性变得更加严重,像素域连续预测器倾向于在可行的重建之间平均并抑制高频解剖结构。我们通过将重建移动到离散多尺度潜在空间,并将其作为自回归下一步加速尺度预测来解决这一限制。利用在视觉自回归建模中证明有效的离散先验,我们的方法将解限制在紧凑的代码本令牌序列中,即使从极稀疏的测量中也能实现锐利的重建。这种离散自回归公式也自然与现代大型语言模型后训练技术对齐。基于这一观察,我们引入了视觉自回归建模中的在线策略特权信息蒸馏,其中教师仅在训练时使用不可用的特权上下文进行训练,在本案例中是完全采样获取,监督学生在自己的滚动生成中进行训练,从而实现一致的重建增益。通过在fastMRI基准上的广泛实验,我们展示了我们的方法在各种采样模式下在极端欠采样下提供了改进的重建性能。项目网站是https://yilmazkorkmaz1.github.io/discrete-mri-reconstruction-opd/。

英文摘要

MRI reconstruction is an inherently ill-posed inverse problem, since incomplete measurements admit many plausible solutions. This ambiguity becomes more severe under high acceleration, where pixel-domain continuous predictors tend to average over feasible reconstructions and suppress high-frequency anatomy. We address this limitation by moving reconstruction to discrete multi-scale latent space and posing it as autoregressive next-acceleration-scale prediction. Leveraging discrete priors proven effective in visual autoregressive modeling, our method restricts the solution to compact sequences of codebook tokens, enabling sharp reconstructions even from extremely sparse measurements. This discrete autoregressive formulation also aligns naturally with modern large language model post-training techniques. Building on this observation, we introduce on-policy privileged information distillation for visual autoregressive modeling, where a teacher is provided training only privileged context that is unavailable at inference, in our case fully sampled acquisitions, and supervises a student trained on its own rollouts, leading to consistent reconstruction gains. Through extensive experiments on the fastMRI benchmark, we show that our approach delivers improved reconstruction performance across diverse sampling patterns under extreme undersampling. Project website is \href{https://yilmazkorkmaz1.github.io/discrete-mri-reconstruction-opd/}{here}.

2605.19329 2026-05-22 cs.CV cs.AI 版本更新

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

RE-VLM:事件增强的视觉-语言模型用于场景理解

Hanqing Liu, Mingjie Liu, Luoping Cui, Endian Lin, Donghong Jiang, Chuang Zhu

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 本文提出RE-VLM,一种结合RGB图像和事件流的双流视觉-语言模型,旨在提升在正常和恶劣条件下对场景的理解能力。通过事件相机提供的高时间分辨率和宽动态范围的数据,RE-VLM在场景描述和视觉问答任务中优于现有模型。

Comments 10 pages, 6 figures, 6 tables

详情
AI中文摘要

传统视觉-语言模型(VLMs)在恶劣条件下(如低光、高动态范围或快速运动)捕获的场景解释能力不足,因为标准RGB图像在这些环境中质量下降。事件相机提供了一种互补的模态:它们异步记录每个像素的亮度变化,具有高时间分辨率和宽动态范围,在帧失效时保留运动线索。我们提出了RE-VLM,第一个双流视觉-语言模型,联合利用RGB图像和事件流,以在正常和挑战性条件下实现稳健的场景理解。RE-VLM采用并行的RGB和事件编码器,以及一种渐进训练策略,将异构视觉特征与语言对齐。为了解决RGB-Event-Text监督不足的问题,我们进一步提出了一种图驱动的流程,将同步的RGB-Event流转换为可验证的场景图,从中合成描述和问答对。为了开发和评估RE-VLM,我们构建了两个数据集:PEOD-Chat,针对光照挑战性场景,和RGBE-Chat,涵盖多样化的场景。在描述和VQA基准测试中,RE-VLM在与现有RGB-only和事件-only模型参数量相当的情况下,始终优于现有模型,特别是在挑战性条件下表现显著提升。这些结果证明了事件增强的VLMs在广泛现实环境中实现稳健视觉-语言理解的有效性。

英文摘要

Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments.

2605.18507 2026-05-22 cs.CV 版本更新

Weakly Supervised Cross-Modal Learning for 4D Radar Scene Flow Estimation

弱监督跨模态学习用于4D雷达场景流估计

Jingyun Fu, Zhiyu Xiang, Na Zhao

发表机构 * Zhejiang University, Hangzhou, Zhejiang, China(浙江大学) Zhejiang Provincial Key Laboratory of Multi-Modal Communication Networks and Intelligent Information Processing(浙江省多模态通信网络与智能信息处理重点实验室) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 本文提出了一种弱监督的雷达场景流学习框架,利用图像和里程计进行辅助监督,通过实例感知的自监督损失和静态区域的刚性损失,实现了更高效的场景流估计。

Comments Accepted by ICML2026

详情
AI中文摘要

由于获取4D雷达场景流的真实数据困难,先前的方法通常依赖于自监督损失或利用3D激光雷达数据、2D图像和里程计进行跨模态监督。然而,自监督方法由于雷达固有的低保真度测量往往导致次优结果,而现有跨模态监督方法引入复杂的多任务架构并需要昂贵的激光雷达传感器来从预训练的3D跟踪模型中生成伪雷达场景流标签。为克服这些限制,我们提出了一种任务特定的迭代框架,仅使用图像和里程计进行训练中的辅助监督。特别地,我们通过利用现成的2D跟踪和分割算法获得跟踪实例掩码,并将其投影到3D空间,以提供实例级别的语义指导;对于静态区域,我们整合车辆里程计与雷达的内在运动线索以构建刚性静态损失。在现实世界的View-of-Delft(VoD)数据集上的大量实验表明,我们的方法不仅超越了依赖密集LiDAR点云的最新跨模态监督方法,还优于现有的全监督场景流估计方法。代码已开源在https://github.com/FuJingyun/IterFlow。

英文摘要

Due to the difficulty of obtaining ground-truth data for 4D radar scene flow estimation, previous methods typically rely on either self-supervised losses or cross-modal supervision using 3D LiDAR data, 2D images, and odometry. However, self-supervised approaches often yield suboptimal results due to radar's inherently low-fidelity measurements, while existing cross-modal supervised methods introduce complex multi-task architecture and require costly LiDAR sensors to generate pseudo radar scene flow labels from pretrained 3D tracking models. To overcome these limitations, we propose a task-specific iterative framework for weakly supervised radar scene flow learning, using only images and odometry for auxiliary supervision during training. Specially, we establish two novel instance-aware self-supervised losses by exploiting off-the-shelf 2D tracking and segmentation algorithms to obtain tracked instance masks, which are back-projected into 3D space to provide instance-level semantic guidance; for static regions, we integrate vehicle odometry with radar's intrinsic motion cues to construct a rigid static loss. Extensive experiments on the real-world View-of-Delft (VoD) dataset demonstrate that our method not only surpasses state-of-the-art cross-modal supervised approaches that rely on 3D multi-object tracking on dense LiDAR point clouds but also outperforms existing fully supervised scene flow estimation methods. The code is open-sourced at \href{https://github.com/FuJingyun/IterFlow}{https://github.com/FuJingyun/IterFlow}.

2605.14926 2026-05-22 cs.CV 版本更新

SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation

SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation

Hanxu Zhang, Chen Jia, Hui Liu, Xu Cheng, Fan Shi, Shengyong Chen

AI总结 本文提出了一种高效的SCRWKV网络,通过新颖的结构场编码器和轻量级解码器,实现结构裂缝拓扑分割的高精度建模,其在多个复杂纹理和严重干扰的基准测试中表现出色,参数量仅为1.22M,达到了84.28%的F1分数和85.12%的mIoU。

Comments Accept by ICML2026

详情
AI中文摘要

实现跨多样场景的结构裂缝像素级准确分割仍然是一个严峻的挑战。现有方法在平衡裂缝拓扑建模与计算效率之间面临显著瓶颈,往往无法在高分割质量与低资源需求之间取得平衡。为了解决这些限制,我们提出了Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV),一种通过新颖的Structure-Field Encoder (SFE) backbone实现高精度建模的网络,同时保持线性复杂度。SFE集成了Adaptive Multi-scale Cascaded Modulator (AMCM)以增强纹理表示,并利用Structure-Calibrated Insight Unit (SCIU)作为其核心引擎。具体而言,SCIU采用Geometry-guided Bidirectional Structure Transformation (GBST)来捕捉拓扑相关性,并将Dynamic Self-Calibrating Decay (DSCD)整合到Dy-WKV中以抑制噪声传播。此外,我们引入了一种轻量级的Cross-Scale Harmonic Fusion (CSHF)解码器以实现精确的特征聚合。系统评估表明,在多个具有复杂纹理和严重干扰的基准测试中,仅拥有1.22M参数的SCRWKV显著优于现有最佳方法。在TUT数据集上,该模型达到了84.28%的F1分数和85.12%的mIoU,证明了其在高效现实部署中的鲁棒潜力。代码可在https://github.com/zhxhzy/SCRWKV上获取。

英文摘要

Achieving pixel-level accurate segmentation of structural cracks across diverse scenarios remains a formidable challenge. Existing methods face significant bottlenecks in balancing crack topology modeling with computational efficiency, often failing to reconcile high segmentation quality with low resource demands. To address these limitations, we propose the Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV), a network that achieves high-precision modeling via a novel Structure-Field Encoder (SFE) backbone while maintaining linear complexity. The SFE integrates the Adaptive Multi-scale Cascaded Modulator (AMCM) to enhance texture representation and utilizes the Structure-Calibrated Insight Unit (SCIU) as its core engine. Specifically, the SCIU employs the Geometry-guided Bidirectional Structure Transformation (GBST) to capture topological correlations and integrates the Dynamic Self-Calibrating Decay (DSCD) into Dy-WKV to suppress noise propagation. Furthermore, we introduce a lightweight Cross-Scale Harmonic Fusion (CSHF) decoder to achieve precise feature aggregation. Systematic evaluations on multiple benchmarks characterized by complex textures and severe interference demonstrate that SCRWKV, with only 1.22M parameters, significantly outperforms SOTA methods. Achieving an F1 score of 0.8428 and mIoU of 0.8512 on the TUT dataset, the model confirms its robust potential for efficient real-world deployment. The code is available at https://github.com/zhxhzy/SCRWKV.

2605.08389 2026-05-22 cs.CV cs.AI 版本更新

Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval

解耦端点与语义转换学习以实现零样本复合图像检索

Mingyu Liu, Sihan Huang, Yijia Fan, Yinlin Yan, Quan Zhang, Jian-Fang Hu, Jianhuang Lai

发表机构 * Sun Yat-sen University(中山大学) Guangdong Province Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Key Laboratory of Machine Intelligence and Advanced Computing(机器智能与先进计算重点实验室)

AI总结 本文提出了一种解耦端点与语义转换学习的方法DeCIR,用于零样本复合图像检索,通过构造配对的正向/反向编辑元组,训练独立的低秩文本适配器分支,并利用低秩方向合并(LRDM)将它们合并为一个可部署的适配器,从而提升了投影基于的零样本复合图像检索性能。

详情
AI中文摘要

零样本复合图像检索(ZS-CIR)在不依赖人工标注的CIR三元组的情况下,从参考图像和文本修改中检索目标图像。基于投影的ZS-CIR方法因其不依赖LLM并在推理时保持轻量而具有吸引力,但它们在复杂语义修改上往往表现不佳。这一差距反映了基于投影的ZS-CIR中的语义转换瓶颈:端点级匹配可以让编辑文本作为目标侧的属性线索,而不是作为源条件的语义转换。我们进一步表明,将语义转换监督添加到相同的文本适配器中会创建端点对齐与语义转换对齐之间的冲突。为了解决这一冲突,DeCIR解耦端点与转换学习。它从图像-标题对中构建配对的正向/反向编辑元组,训练独立的低秩文本适配器分支用于端点对齐和语义转换对齐,并将它们通过低秩方向合并(LRDM)合并为一个可部署的适配器。在CIRR、CIRCO、FashionIQ和GeneCIS上的大量实验表明,DeCIR在不增加推理复杂性的情况下,一致提升了基于投影的ZS-CIR性能。

英文摘要

Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint--transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge (LRDM) into one deployable adapter. Extensive experiments on CIRR, CIRCO, FashionIQ, and GeneCIS demonstrate that DeCIR consistently improves projection-based ZS-CIR without increasing inference complexity.

2605.05749 2026-05-22 cs.CV 版本更新

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

具有自适应更新的射线感知指针记忆用于流式3D重建

Feifei Li, Qi Song, Chi Zhang, Rui Huang

发表机构 * The Chinese University of HongKong, Shenzhen(香港中文大学(深圳)) Tsinghua University(清华大学)

AI总结 本文提出了一种射线感知指针记忆,用于流式3D重建,通过统一的记忆表示模型同时建模空间位置和视角方向,采用自适应指针更新策略以保留信息性指针并丢弃冗余指针,从而提高长期重建稳定性和相机姿态精度。

详情
AI中文摘要

从连续图像流中进行密集3D重建需要准确的几何聚合和稳定的长期内存管理。最近的前馈重建框架通过持久内存表示整合观测,但大多数方法在更新内存时主要依赖基于外观的相似性。这种基于外观的整合常常导致在视角变化时出现观测冗余和不稳定的几何结构。在本文中,我们提出了一种用于流式3D重建的射线感知指针记忆,该记忆在统一的记忆表示中显式建模空间位置和视角方向。每个内存指针存储其3D位置、关联的射线方向和特征嵌入,使系统能够联合推理几何接近性和视角一致性。基于此表示,我们引入了一种自适应指针更新策略,将传统的融合基记忆压缩替换为保留或替换机制。而不是平均附近的观测,系统会选择性地保留信息性指针并丢弃冗余的,从而在保持内存增长有限的同时保留独特的几何结构。此外,对空间距离和射线方向差异的联合推理使系统能够统一区分局部冗余、新观测和潜在的环路重访。当检测到环路候选时,会触发姿态细化以强制在重建中保持全局几何一致性。大量实验表明,所提出的射线感知记忆设计显著提高了长期重建的稳定性和相机姿态精度,同时保持了高效的流式推理。我们的方法提供了一个系统的方法框架,用于可扩展且抗漂移的在线3D重建。

英文摘要

Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.

2605.02784 2026-05-22 cs.CV 版本更新

HumanSplatHMR: Closing the Loop Between Human Mesh Recovery and Gaussian Splatting Avatar

HumanSplatHMR: 闭合人体网格恢复与高斯点绘肖像之间的循环

Yeheng Zong, Pou-Chun Kung, Yike Pan, Seth Isaacson, Yizhou Chen, Ram Vasudevan, Katherine A. Skinner

发表机构 * University of Michigan(密歇根大学)

AI总结 本文提出HumanSplatHMR方法,通过闭合几何姿态估计与可微渲染之间的循环,改进人体姿态恢复和高斯点绘肖像的生成,提升在新视角和新姿态下的渲染质量。

Comments Project page: https://scottyehengz.github.io/HumanSplat/

详情
AI中文摘要

从视频中准确恢复人体姿态和外观是场景重建的关键组成部分,应用于动作捕捉、动作预测、虚拟现实和数字孪生等领域。尽管对从视频中构建逼真人类肖像已有大量研究,本文证明现有方法无法准确恢复人类的3D几何结构。基于ViT的方法不一致可靠且可能过度拟合2D视角,而基于NeRF和高斯点绘的肖像将姿态和外观分开,限制了对新姿态的渲染泛化能力。为解决这些问题,本文提出HumanSplatHMR,一种联合优化框架,通过同时优化3D人体姿态并学习高保真的肖像,以实现新视角和新姿态的合成。我们的关键见解是闭合几何姿态估计与可微渲染之间的循环。不同于以往依赖运动捕捉系统或离线优化获得的准确人体姿态的人形肖像方法,在野外场景中不实用,我们的方法仅使用最先进的姿态估计器得到的人体网格估计,以更好地反映现实情况。因此,不同于将人体姿态仅作为变形先验使用,HumanSplatHMR通过可微渲染将光度、分割和深度损失反向传播到姿态参数和全局位置。这种耦合在时间上优化全局3D姿态,提高精度和对齐性,同时产生更高质量的新视角渲染。实验显示,与省略图像级优化的姿态恢复基线和将姿态估计与肖像重建解耦的肖像基线相比,有持续的改进。

英文摘要

Accurately recovering human pose and appearance from video is an essential component of scene reconstruction, with applications to motion capture, motion prediction, virtual reality, and digital twinning. Despite significant interest in building realistic human avatars from video, this paper demonstrates that existing methods do not accurately recover the 3D geometry of humans. ViT-based approaches are not consistently reliable and can overfit to 2D views, while NeRF- and Gaussian Splatting-based avatars treat pose and appearance separately, limiting rendering generalization to new poses. To resolve these shortcomings, this paper proposes HumanSplatHMR, a joint optimization framework that refines 3D human poses while simultaneously learning a high-fidelity avatar for novel-view and novel-pose synthesis. Our key insight is to close the loop between geometric pose estimation and differentiable rendering. Unlike prior human avatar methods that rely on accurate human pose obtained through motion capture systems or offline refinement, which are impractical in in-the-wild scenarios, our approach uses only human mesh estimates from a state-of-the-art human pose estimator to better reflect real-world conditions. Therefore, instead of using the human pose only as a deformation prior, HumanSplatHMR backpropagates photometric, segmentation, and depth losses through a differentiable renderer to the pose parameters and global position. This coupling refines the global 3D pose over time, improving accuracy and alignment while producing better renderings from novel views. Experiments show consistent improvements over pose recovery baselines that omit image-level refinement and avatar baselines that decouple pose estimation from avatar reconstruction.

2605.02098 2026-05-22 cs.CV 版本更新

From Spherical to Gaussian: A Comparative Analysis of Point Cloud Cropping Strategies in Large-Scale 3D Environments

从球形到高斯:在大规模3D环境中点云裁剪策略的比较分析

Maximilian Kellner, Dominik Merkle, Michael Brunklaus, Alexander Reiterer

发表机构 * Fraunhofer Institute for Physical Measurement Techniques IPM(弗劳恩霍夫物理测量技术研究所IPM) University of Freiburg, Department of Sustainable Systems Engineering INATECH(弗赖堡大学可持续系统工程系INATECH)

AI总结 本文比较了点云裁剪策略,提出了一种新的方法以提高大规模3D环境中的模型性能,特别是在户外场景中取得了新的最佳成果。

详情
AI中文摘要

大规模3D点云可能包含数以千万计的点。即使经过下采样,这些点云对于现代3D神经网络来说仍然太大。为了发展对场景的语义理解,点云被划分为更小的子云,以便处理。通常,这种划分是通过球形裁剪完成的,导致周围几何上下文的损失。为了解决这个问题,我们提出了替代方法,产生具有更大裁剪尺寸的子云,同时保持相似数量的点。具体来说,我们比较了指数、高斯和线性裁剪方法与球形方法。我们使用多个室内和户外环境数据集评估了三种3D深度学习模型架构。我们的结果表明,改变裁剪策略可以提高模型性能,特别是在大规模户外场景中,取得了新的最佳成果。代码可在https://github.com/mvg-inatech/point_cloud_cropping获取。

英文摘要

Large-scale 3D point clouds can consist of hundreds of millions of points. Even after downsampling, these point clouds are too large for modern 3D neural networks. In order to develop a semantic understanding of the scene, the point clouds are divided into smaller subclouds that can be processed. Typically, this division is done using spherical crops, resulting in a loss of surrounding geometric context. To address this issue, we propose alternative methods that produce subclouds with larger crop sizes while maintaining a similar number of points. Specifically, we compare exponential, Gaussian, and linear cropping methods with the spherical method. We evaluated three 3D deep learning model architectures using multiple indoor and outdoor environment datasets. Our results demonstrate that altering the cropping strategy can enhance model performance, especially for large-scale outdoor scenes, yielding new state-of-the-art results. Code is available at https://github.com/mvg-inatech/point_cloud_cropping

2605.00392 2026-05-22 cs.CV cs.LG 版本更新

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

RTPrune: 两次阅读启发的令牌修剪用于高效DeepSeek-OCR推理

Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng, Jia Wang, Tongxuan Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 本文提出RTPrune,一种针对DeepSeek-OCR的两次阶段令牌修剪方法,通过优先保留高范数视觉令牌并利用最优传输理论进行令牌配对和合并,从而在OCR任务中实现更高效的推理性能和更优的效率-精度权衡。

Comments 21 pages, accepted by ICML2026

详情
AI中文摘要

DeepSeek-OCR利用视觉-文本压缩来减少长文本处理成本并加速推理,但视觉令牌仍然容易出现冗余的文本和结构信息。此外,当前用于传统视觉-语言模型(VLMs)的令牌修剪方法由于不恰当的压缩机制而无法保持文本保真度。通过分析DeepSeek-OCR的解码过程,我们发现了一种独特的双阶段阅读轨迹:模型最初优先处理大多数高范数令牌,然后随后重新分配其注意力到剩余的令牌上。受此启发,我们提出RTPrune,一种专为DeepSeek-OCR设计的双阶段令牌修剪方法。在第一阶段,我们优先保留捕捉显著文本和结构信息的高范数视觉令牌。在第二阶段,剩余的令牌基于最优传输理论进行配对和合并,以实现高效的特征聚合。我们进一步引入了一个动态修剪比率,以适应令牌相似性和文本密度,从而在OCR任务中实现更优的效率-精度权衡。广泛的实验表明,RTPrune在OmniDocBench上实现了99.47%的准确率和1.23倍更快的prefill速度,当应用于DeepSeek-OCR-Large时,仅保留84.25%的令牌。

英文摘要

DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.

2604.28177 2026-05-22 cs.CV cs.CY 版本更新

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

AEGIS:一个评估人工智能生成学术图像取证分析的综合基准

Bo Zhang, Tzu-Yen Ma, Zichen Tang, Junpeng Ding, Zirui Wang, Yizhuo Zhao, Peilin Gao, Zijie Xi, Zixin Ding, Haiyang Sun, Haocheng Gao, Yuan Liu, Liangjia Wang, Yiling Huang, Yujie Wang, Yuyue Zhang, Ronghui Xi, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Haihong E

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 本文提出AEGIS基准,通过七个学术类别和39个细粒度子类型覆盖,揭示了人工智能生成学术图像取证分析的内在难度,同时评估了多种模型在检测、推理和定位方面的性能,揭示了不同模型家族的互补优势。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

我们介绍了AEGIS,一个用于评估人工智能生成学术图像取证分析的综合基准。与现有基准相比,AEGIS有三个关键改进:(1)领域特定复杂性:涵盖七个学术类别和39个细粒度子类型,暴露了内在的取证难度,其中即使GPT-5.1的整体性能也仅为48.80%,而专家模型只能达到有限的定位精度(IoU 30.09%);(2)多样化的伪造模拟:在25种生成模型中建模四种普遍的学术伪造策略,其中11种模型的平均取证准确率低于50%,表明取证技术落后于生成技术的发展;(3)多维取证评估:共同评估检测、推理和定位,揭示了不同模型家族之间的互补优势,其中多模态大语言模型(MLLMs)在文本伪影识别上的准确率高达84.74%,专家检测器在二元真实性检测上的最高准确率为79.54%。通过评估25种领先的MLLMs、九个专家模型和一个统一的多模态理解和生成模型,AEGIS成为了一个诊断测试平台,揭示了学术图像取证分析中的根本性限制。

英文摘要

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.

2604.20665 2026-05-22 cs.CV cs.AI 版本更新

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

视见之代价:在单体范式内实现可信的多模态推理

Karan Goyal

发表机构 * IIIT Delhi, India(德里印度理工学院)

AI总结 本文提出了一种新的多模态评估方法,通过信息论视角揭示了多模态推理中的视见代价问题,提出了三个新指标并提出了语义充分性准则,挑战了传统多模态评估方法。

Comments Addresses practical viability of Vlabel construction. Writing is grounded. Acknowledgement is duly added

详情
AI中文摘要

视觉语言模型(VLMs)的快速普及通常被视为促进统一多模态知识发现的手段,但其背后存在一个未经检验的假设:当前VLMs能够忠实合成多模态数据。我们认为它们往往不能,这种差距反映了主导的视觉编码器-投影器-语言模型范式中的可信问题。而非从视觉输入中提取基础知识,最先进的模型经常表现出功能失明,即利用强大的语言先验来绕过严重的视觉表示瓶颈。在本文中,我们挑战了传统多模态评估方法,该方法依赖于数据删减或新数据集创建,因此将数据集偏差与架构能力不足混淆了。我们提出了一种信息论的突破:模态翻译协议,旨在量化我们称之为视见代价的东西。通过翻译语义负载而不是删减它们,我们提出了三个新的指标——视见的 toll(ToS)、诅咒(CoS)和谬误(FoS)——最终得出语义充分性准则(SSC)。此外,我们假设多模态扩展的分歧定律:随着底层语言引擎扩展到前所未有的推理能力,视觉知识瓶颈的惩罚可能增加而不是减少。我们主张社区应超越“多模态增益”作为主要评估目标。通过将SSC从被动的诊断约束提升为主动的架构蓝图,我们为引导下一代人工智能系统走向真正的多模态推理提供了基础。

英文摘要

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.

2604.08295 2026-05-22 cs.AI cs.CV 版本更新

U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

U-CECE:一个通用的多分辨率框架用于概念反事实解释

Angeliki Dimitriou, Nikolaos Chaidos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou

发表机构 * Artificial Intelligence and Learning Systems (AILS) laboratory, National Technical University of Athens(人工智能与学习系统实验室,国家技术大学(雅典))

AI总结 本文提出U-CECE框架,旨在解决概念反事实方法在表达性和效率之间的权衡问题,通过多分辨率层次结构提供不同层次的解释能力,并在不同数据集上验证了其效率与表达性的平衡。

详情
AI中文摘要

随着AI模型日益复杂,可解释性对于建立信任至关重要,然而基于概念的反事实方法仍面临表达性与效率之间的权衡。将底层概念表示为原子集合虽然快速但忽略了关系上下文,而完整的图表示更加忠实但需要解决NP难的图编辑距离(GED)问题。我们提出了U-CECE,一个统一的、模型无关的多分辨率框架,用于概念反事实解释,能够适应数据环境和计算预算。U-CECE涵盖三个层次的表达性:原子概念用于广泛解释,关系集合-集合用于简单交互,以及结构图用于完整语义结构。在结构层,支持基于监督图神经网络(GNNs)的精度导向的归纳模式和基于无监督图自动编码器(GAEs)的可扩展归纳模式。在结构上,CUB和视觉基因组数据集的实验展示了不同层次的效率-表达性权衡,同时人类调查和LVLM基于评估表明,检索到的结构反事实与精确GED基于的地面真相解释在语义上等价,且常被优先选择。

英文摘要

As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.

2604.07180 2026-05-22 cs.CV cs.AI 版本更新

Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis

基于能量的组织流形用于纵向多参数MRI分析

Kartikay Tehlan, Lukas Förner, Sina Wendrich, Nico Schmutzenhofer, Michael Frühwald, Matthias Wagner, Nassir Navab, Thomas Wendler

发表机构 * Dept. of diagnostic and interventional Radiology and Neuroradiology, University Hospital Augsburg, Germany(诊断与介入放射科和神经放射科,奥格斯堡大学医院,德国) Digital Medicine, University Hospital Augsburg, Germany(数字医学,奥格斯堡大学医院,德国) Chair for Computer Aided Medical Procedures and Augmented Reality, Technical University of Munich, Germany(计算机辅助医疗程序与增强现实 chair,慕尼黑技术大学,德国) Bavarian Center for Cancer Research (BZKF) Augsburg, Germany(巴伐利亚癌症研究中心(BZKF)奥格斯堡,德国) Dept. of Pediatrics and Adolescent Medicine, University Hospital Augsburg, Germany(儿科和青少年医学科,奥格斯堡大学医院,德国) Center for Advanced Analytics and Predictive Sciences, University of Augsburg, Germany(高级分析与预测科学中心,奥格斯堡大学,德国)

AI总结 本文提出了一种基于患者特定能量建模的几何框架,用于纵向多参数MRI分析,通过训练紧凑的隐式神经表示来学习能量函数,为组织状态提供微分几何描述,无需分割标签,展示了患者特定能量流形在纵向mpMRI分析中的应用潜力。

Comments The code is available at https://github.com/tkartikay/EnFold-MRI

详情
AI中文摘要

我们提出了一种基于患者特定能量建模的几何框架,用于纵向多参数MRI分析。该框架基于序列空间中的患者特定能量建模,而不是在具有空间网络的图像上进行操作。每个体素由其多序列强度向量(T1,T1c,T2,FLAIR,ADC)表示,并通过去噪分数匹配训练紧凑的隐式神经表示,以从单次基线扫描学习一个能量函数E_θ(u) over R^d。学习的能量景观提供了没有分割标签的组织状态的微分几何描述。局部极小值定义了组织盆地,梯度大小反映了接近状态边界的可能性,拉普拉斯曲率表征了局部约束结构。重要的是,该基线能量流形被视为固定的几何参考:它编码了诊断时观察到的对比组合,并且在随访时不进行重新训练。因此,纵向评估被公式化为对后续扫描相对于此基线几何的评估。而不是比较解剖分割,我们分析MRI序列向量的分布如何在基线能量函数下演变。在一项儿童病例中,复发后随访扫描显示能量和方向位移在序列空间中逐渐偏离基线肿瘤相关状态,但在明显放射学再出现之前。在一项稳定疾病病例中,体素分布仍被限制在已建立的低能盆地内,没有系统性漂移。所展示的病例证明了患者特定能量流形可以作为纵向mpMRI分析的几何参考系统,而无需显式分割或监督分类,为进一步研究基于流形的肿瘤风险区域追踪提供了基础。

英文摘要

We propose a geometric framework for longitudinal multi-parametric MRI analysis based on patient-specific energy modelling in sequence space. Rather than operating on images with spatial networks, each voxel is represented by its multi-sequence intensity vector ($T1$, $T1c$, $T2$, FLAIR, ADC), and a compact implicit neural representation is trained via denoising score matching to learn an energy function $E_θ(\mathbf{u})$ over $\mathbb{R}^d$ from a single baseline scan. The learned energy landscape provides a differential-geometric description of tissue regimes without segmentation labels. Local minima define tissue basins, gradient magnitude reflects proximity to regime boundaries, and Laplacian curvature characterises local constraint structure. Importantly, this baseline energy manifold is treated as a fixed geometric reference: it encodes the set of contrast combinations observed at diagnosis and is not retrained at follow-up. Longitudinal assessment is therefore formulated as evaluation of subsequent scans relative to this baseline geometry. Rather than comparing anatomical segmentations, we analyse how the distribution of MRI sequence vectors evolves under the baseline energy function. In a paediatric case with later recurrence, follow-up scans show progressive deviation in energy and directional displacement in sequence space toward the baseline tumour-associated regime before clear radiological reappearance. In a case with stable disease, voxel distributions remain confined to established low-energy basins without systematic drift. The presented cases serve as proof-of-concept that patient-specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for further investigation of manifold-based tissue-at-risk tracking in neuro-oncology.

2603.18003 2026-05-22 cs.CV 版本更新

Universal Skeleton Understanding via Differentiable Rendering and MLLMs

通过可微渲染和大语言模型实现通用骨架理解

Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School, China(人工智能通用基础理论国家重点实验室,北京大学深圳研究生院,中国) Tencent(腾讯) Nanjing University of Aeronautics(南京航空航天大学)

AI总结 本文提出SkeletonLLM,通过可微渲染将任意骨架序列转换为大语言模型的视觉模态,实现通用骨架理解,同时引入协同训练策略提升推理能力,展示了在开放词汇动作识别中的强泛化能力,并扩展到异构骨架格式的运动描述和问答任务。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言推理方面表现出色,但无法处理结构化非视觉数据如人体骨架。现有方法要么将骨架动力学压缩成有损特征向量以进行文本对齐,要么将运动量化为离散标记,但这些方法在异构骨架格式上泛化能力较差。我们提出了SkeletonLLM,通过将任意骨架序列转换为MLLM的本机视觉模态实现通用骨架理解。其核心是DrAction,一种可微、格式无关的渲染器,将骨骼运动学转换为紧凑的图像序列。由于整个流程是端到端可微的,MLLM的梯度可以直接引导渲染以生成任务相关信息的视觉标记。为进一步增强推理能力,我们引入了协同训练策略:因果推理蒸馏将结构化的逐步推理从教师模型转移过来,而判别微调则增强可混淆动作之间的决策边界。SkeletonLLM在开放词汇动作识别中表现出强泛化能力,其学习的推理能力自然扩展到异构骨架格式的运动描述和问答任务——表明了将MLLM应用于非本机模态的可行路径。代码:https://github.com/wangzy01/SkeletonLLM。

英文摘要

Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet cannot process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM's native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization \revise{in open-vocabulary action recognition, while its learned reasoning capabilities naturally extend to motion captioning and question answering across heterogeneous skeleton formats} -- suggesting a viable path for applying MLLMs to non-native modalities. Code: https://github.com/wangzy01/SkeletonLLM.

2603.08403 2026-05-22 cs.CV 版本更新

SPIRAL: Self-Evolving Action-Conditioned Video Generation via Reflective Planning Agents

SPIRAL:通过反思规划代理实现自演化动作条件视频生成

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Liang Lv, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee

发表机构 * Zhejiang University(浙江大学) KnowledgeXLab at Shanghai AI Lab(上海人工智能实验室知识X实验室) National University of Singapore(新加坡国立大学) Chinese Academy of Sciences(中国科学院) Tencent Youtu Lab(腾讯优设实验室) Nanyang Technological University(南洋理工大学) Wuhan University(武汉大学)

AI总结 本文提出SPIRAL框架,通过反思规划代理实现长时域动作条件视频生成,解决传统方法在长时间视频生成中的不足,通过闭环设计和自演化机制提升视频生成的一致性和准确性。

Comments 42 Pages, 21 Figures, Project page at https://yuyang-cloud.github.io/spiral

详情
AI中文摘要

长时域动作条件视频生成旨在合成符合复杂动作指令的时序一致视频,要求过程有序、持续执行动作和场景一致,超越传统TI2V的短时精度。现有单次视频生成模型通常采用开环方式,导致动作执行不完整、幻觉运动和时间漂移。为解决此问题,我们提出SPIRAL,一种闭环框架,通过顺序规划和迭代反思进行动作条件长时域视频生成。具体而言,SPIRAL实现一个思考-行动-反思过程:PlanAgent将高层目标分解为子动作,这些动作条件VideoGenerator生成每个片段并伴随记忆上下文,同时CriticAgent评估中间视频片段以提供迭代优化的反馈。此闭环设计进一步通过利用PlanAgent提出的行为和CriticAgent得出的奖励进行GRPO基于的后训练,以增强视频生成器的长时域一致性。此外,我们引入ActVideoGen-Dataset用于任务特定训练,并建立ActVideoGen-Bench作为专用评估套件,用于衡量动作质量和时间一致性。在多个TI2V后端和自演化策略下的实验显示,在ActVideoGen-Bench和VBench上均取得一致提升,证明了SPIRAL的有效性。

英文摘要

Long-horizon action-conditioned video generation aims to synthesize temporally coherent videos that follow complex action instructions over extended horizons, requiring procedural ordering, persistent action execution, and scene consistency beyond conventional TI2V's short-term fidelity. Existing single-shot video generation models typically operate in an open-loop manner, leading to incomplete action execution, hallucinated motions, and temporal drift. To address this, we propose SPIRAL, a closed-loop framework that performs sequential planning and iterative reflection for action-conditioned long-horizon video generation. Specifically, SPIRAL instantiates a think-act-reflect process: a PlanAgent decomposes high-level goals into sub-actions, which condition a VideoGenerator to synthesize each segment alongside a memory context, while a CriticAgent evaluates intermediate video segments to provide corrective feedback for iterative refinement. This closed-loop design further supports self-evolution by utilizing PlanAgent-proposed actions and CriticAgent-derived rewards for GRPO-based post-training to enhance the video generator's long-horizon consistency. Moreover, we introduce ActVideoGen-Dataset for task-specific training, and establish ActVideoGen-Bench as a dedicated evaluation suite for measuring action quality and temporal coherence. Experiments across multiple TI2V backbones alongside the self-evolving strategy show consistent gains on ActVideoGen-Bench and VBench, demonstrating the effectiveness of SPIRAL.

2602.23833 2026-05-22 eess.IV cs.CV 版本更新

Revisiting Integration of Image and Metadata for DICOM Series Classification: Cross-Attention and Dictionary Learning

重新审视图像与元数据整合用于DICOM系列分类:交叉注意力与字典学习

Tuan Truong, Melanie Dohmen, Sara Lorio, Matthias Lenga

发表机构 * Bayer AG(勃林格殷曼公司)

AI总结 本文提出了一种端到端的多模态框架,用于DICOM系列分类,通过联合建模图像内容和获取元数据,显式考虑异质切片内容、可变系列长度和完全缺失、不完整或不一致的DICOM元数据等挑战。

Comments Early acceptance at MICCAI 2026

详情
AI中文摘要

自动化识别DICOM图像系列对于大规模医学图像分析、质量控制、协议标准化和可靠后续处理至关重要。然而,由于异质切片内容、可变系列长度和完全缺失、不完整或不一致的DICOM元数据,DICOM系列分类仍具挑战性。我们提出了一种端到端的多模态框架,用于DICOM系列分类,该框架联合建模图像内容和获取元数据,同时显式考虑这些挑战。(i)图像和元数据通过模态感知模块编码,并使用双向跨模态注意力机制融合。(ii)元数据通过基于可学习特征字典和值条件调制的稀疏、缺失感知编码器进行处理。通过设计,该方法不需要任何形式的填补。(iii)系列长度和图像数据维度的变化通过2.5D视觉编码器和在等距采样的切片上操作的注意力机制来处理。我们评估了所提出的方法在公开可用的Duke Liver MRI数据集和一个大型多机构内部队列上的表现,评估了域内性能和域外泛化能力。在所有评估设置中,所提出的方法一致优于相关的仅图像、仅元数据和多模态2D/3D基线。结果表明,显式建模元数据稀疏性和跨模态交互提高了DICOM系列分类的鲁棒性。

英文摘要

Automated identification of DICOM image series is essential for large-scale medical image analysis, quality control, protocol harmonization, and reliable downstream processing. However, DICOM series classification remains challenging due to heterogeneous slice content, variable series length, and entirely missing, incomplete or inconsistent DICOM metadata. We propose an end-to-end multimodal framework for DICOM series classification that jointly models image content and acquisition metadata while explicitly accounting for all these challenges. (i) Images and metadata are encoded with modality-aware modules and fused using a bi-directional cross-modal attention mechanism. (ii) Metadata is processed by a sparse, missingness-aware encoder based on learnable feature dictionaries and value-conditioned modulation. By design, the approach does not require any form of imputation. (iii) Variability in series length and image data dimensions is handled via a 2.5D visual encoder and attention operating on equidistantly sampled slices. We evaluate the proposed approach on the publicly available Duke Liver MRI dataset and a large multi-institutional in-house cohort, assessing both in-domain performance and out-of-domain generalization. Across all evaluation settings, the proposed method consistently outperforms relevant image only, metadata-only and multimodal 2D/3D baselines. The results demonstrate that explicitly modeling metadata sparsity and cross-modal interactions improves robustness for DICOM series classification.

2602.23231 2026-05-22 cs.CV 版本更新

Skarimva: Skeleton-based Action Recognition is a Multi-view Application

Skarimva:基于骨架的动作识别是一种多视图应用

Daniel Bermuth, Alexander Poeppel, Wolfgang Reif

发表机构 * Institute for Software and Systems Engineering, University of Augsburg(软件与系统工程研究所,奥格斯堡大学)

AI总结 本文研究了基于骨架的动作识别中多视图应用的重要性,指出通过多摄像头视图三角化获得更准确的3D骨架数据,可以显著提升现有动作识别模型的性能,表明输入数据质量是限制模型性能的关键因素,未来研究应将多视图应用作为标准设置。

详情
AI中文摘要

人类动作识别在开发人机智能交互中起着重要作用。尽管有很多研究致力于改进用于基于骨架的动作识别的机器学习算法,但对输入骨架数据本身质量的关注却很少。本文证明,通过利用多个摄像头视图来三角化更准确的3D骨架,可以显著提高现有动作识别模型的性能。这表明,输入数据的质量目前是这些模型性能的限制因素。基于这些结果,认为在大多数实际应用场景中,使用多个摄像头的成本效益比非常有利,因此未来基于骨架的动作识别研究应将多视图应用作为标准设置。

英文摘要

Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.

2602.20845 2026-05-22 cs.CV 版本更新

FLIM Networks with Bag of Feature Points

具有特征点袋的FLIM网络

João Deltregia Martinelli, Marcelo Luis Rodrigues Filho, Felipe Crispim da Rocha Salvagnini, Gilson Junior Soares, Jefersson A. dos Santos, Alexandre X. Falcão

发表机构 * Institute of Computing UNICAMP Campinas, Brazil School of Computer Science University of Sheffield Sheffield, United Kingdom(计算研究所(UNICAMP) 埃尔南迪斯,巴西 学校计算机科学 大学谢菲尔德,英国)

AI总结 本文提出FLIM-BoFP,一种更高效的滤波器估计方法,用于显微镜图像中的寄生虫检测,相较于FLIM-Cluster和其他先进基线,在效率、效果和泛化能力上均有优势。

Comments Accepted at the 28th Iberoamerican Congress on Pattern Recognition (CIARP 2025). To appear in Lecture Notes in Computer Science (LNCS), Springer

详情
AI中文摘要

卷积网络需要大量的图像标注,这可能成本高昂且耗时。通过从少量代表性图像上用户绘制的标记中估计编码器滤波器(即核权重),特征学习从图像标记(FLIM)解决了这一挑战,而无需传统优化。这种编码器与自适应解码器结合构成了一个完全训练而无需反向传播的FLIM网络。先前研究已证明其在显著物检测(SOD)中的有效性,比现有轻量模型显著更轻。本研究重新审视FLIM SOD,并引入FLIM-Bag of Feature Points(FLIM-BoFP),一种显著更快的滤波器估计方法。先前方法FLIM-Cluster通过每个编码器块的补丁聚类来推导滤波器,导致计算开销和对滤波器位置的控制减少。FLIM-BoFP通过在输入块进行一次聚类,创建特征点袋,并在所有块上直接从映射的特征点定义滤波器。论文评估了FLIM-BoFP与FLIM-Cluster和其他最先进的基线在寄生虫检测中的效率、效果和泛化能力的益处。

英文摘要

Convolutional networks require extensive image annotation, which can be costly and time-consuming. Feature Learning from Image Markers (FLIM) tackles this challenge by estimating encoder filters (i.e., kernel weights) from user-drawn markers on discriminative regions of a few representative images without traditional optimization. Such an encoder combined with an adaptive decoder comprises a FLIM network fully trained without backpropagation. Prior research has demonstrated their effectiveness in Salient Object Detection (SOD), being significantly lighter than existing lightweight models. This study revisits FLIM SOD and introduces FLIM-Bag of Feature Points (FLIM-BoFP), a considerably faster filter estimation method. The previous approach, FLIM-Cluster, derives filters through patch clustering at each encoder's block, leading to computational overhead and reduced control over filter locations. FLIM-BoFP streamlines this process by performing a single clustering at the input block, creating a bag of feature points, and defining filters directly from mapped feature points across all blocks. The paper evaluates the benefits in efficiency, effectiveness, and generalization of FLIM-BoFP compared to FLIM-Cluster and other state-of-the-art baselines for parasite detection in optical microscopy images.

2602.17517 2026-05-22 cs.CV 版本更新

Depth Augmented and FE Free 3D/2D Liver Registration for Laparoscopic Liver AR

深度增强和无有限元分析的3D/2D肝脏注册用于腹腔镜肝脏AR

Hanyuan Zhang, Lucas He, Runlong He, Weixi Yi, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangelos B. Mazomenos, Matthew J. Clarkson

发表机构 * UCL Hawkes Institute, University College London, London WC1E 6BT, UK(伦敦大学学院UCL哈维斯研究所) Division of Surgery and Interventional Science, University College London, London WC1E 6BT, UK(伦敦大学学院UCL外科与介入科学系) Unit for Lifelong Health and Ageing at UCL, University College London, London WC1E 7HB, UK(伦敦大学学院UCL终身健康与老龄化单位) Medtronic plc., London, UK(伦敦梅脱利克公司)

AI总结 本研究提出了一种深度增强且无需有限元分析的3D/2D肝脏注册方法,通过结合鲁棒的刚性初始化和患者特定的非刚性细化,以提高腹腔镜肝脏手术AR中的3D到2D注册精度。

详情
AI中文摘要

增强现实(AR)在腹腔镜肝脏手术中的引导需要准确地将术前3D模型与术中2D视频进行注册,但因部分可见性、镜面反射和组织变形而具有挑战性。现有方法通常依赖于基于轮廓的刚性初始化和有限元(FE)模型进行可变形注册,增加了建模和工程复杂性。我们提出了一种深度增强且无有限元分析的3D-2D注册流程,结合了鲁棒的刚性初始化和患者特定的非刚性细化。对于刚性对齐,我们通过使用多类轮廓图和单目深度来适应FoundationPose的RefineNet模块以适应腹腔镜肝脏场景,以实现相对姿态的细化。对于可变形对齐,我们从非刚性ICP(NICP)对应关系中构建患者特定的统计变形模型,并使用粗到细的L-BFGS-B策略优化姿态和形状参数。在公开的临床腹腔镜肝脏数据集上,所提出的方法在受控的手动轮廓设置下实现了平均目标注册误差(TRE)为14.73毫米。消融研究显示,单目深度在轮廓输入上提高了刚性初始化,而肿瘤映射分析表明良好的表面对齐并不一定转化为更低的目标定位误差。在没有地面真实数据的外部数据集上,该方法产生视觉上合理的叠加以进行定性评估。这些结果表明,深度增强的姿态细化和无有限元分析的统计变形建模为受控的3D-2D肝脏注册在手术AR中提供了一个有前景的替代方案。

英文摘要

Augmented reality (AR) guidance in laparoscopic liver surgery requires accurate registration of preoperative 3D models to intraoperative 2D video, but remains challenging due to partial visibility, specularities, and tissue deformation. Existing methods often rely on contour-based rigid initialization and finite-element (FE) models for deformable registration, increasing modeling and engineering complexity. We present a depth-augmented, FE-free 3D--2D registration pipeline that combines robust rigid initialization with patient-specific non-rigid refinement. For rigid alignment, we adapt the RefineNet module of FoundationPose to laparoscopic liver scenes by using multi-class contour maps and monocular depth for relative pose refinement. For deformable alignment, we construct a patient-specific statistical deformation model from non-rigid ICP (NICP) correspondences and optimize pose and shape parameters using a coarse-to-fine L-BFGS-B strategy. On a public clinical laparoscopic liver dataset, the proposed method achieves a mean target registration error (TRE) of 14.73\,mm under a controlled manual-contour setting designed to isolate registration performance. Ablation studies show that monocular depth improves rigid initialization over contour-only inputs, while tumor-mapping analysis indicates that good surface alignment does not necessarily translate into lower target localization error. On an external dataset without ground truth, the method produces visually plausible overlays for qualitative assessment. These results suggest that depth-augmented pose refinement and FE-free statistical deformation modeling provide a promising alternative to FE-based pipelines for controlled 3D--2D liver registration in surgical AR.

2602.12952 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Transporting Task Vectors across Different Architectures without Training

在不同架构间传输任务向量而无需训练

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, Simone Calderara

发表机构 * AImageLab, University of Modena and Reggio Emilia(AImageLab,Modena和雷吉奥艾米利亚大学)

AI总结 本文提出Theseus方法,通过功能匹配在不同宽度模型间传输任务更新,无需训练或反向传播,展示了在视觉和语言模型上的改进效果。

Comments Accepted at the International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

适应大型预训练模型以完成下游任务时,通常会产生针对特定任务的参数更新,这些更新对于每个模型变体重新学习都很昂贵。尽管最近的研究表明,这些更新可以在具有相同架构的模型之间转移,但跨不同宽度的模型转移仍鲜有探索。在本文中,我们引入Theseus,一种无需训练的方法,用于在异构宽度模型间传输任务更新。与其匹配参数,我们通过其在中间表示上诱导的功能效应来表征任务更新。我们正式将任务向量传输定义为在观察到的激活上进行的功能匹配问题,并显示在通过正交Procrustes分析对齐表示空间后,它允许一个稳定的闭式解,该解保留了更新的几何结构。我们在不同宽度的视觉和语言模型上评估Theseus,显示在不进行额外训练或反向传播的情况下,相对于基线有持续的改进。我们的结果表明,当任务身份通过功能而非参数定义时,任务更新可以有意义地在不同架构间转移。代码可在https://github.com/apanariello4/merge-and-rebase获取。

英文摘要

Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains unexplored. In this work, we introduce Theseus, a training-free method for transporting task updates across heterogeneous-width models. Rather than matching parameters, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically. Code is available at https://github.com/apanariello4/merge-and-rebase.

2602.06995 2026-05-22 cs.RO cs.CV cs.IT cs.MA math.IT 版本更新

When Simultaneous Localization and Mapping Meets Wireless Communications: A Survey

当同时定位与建图遇见无线通信:一篇综述

Konstantinos Gounis, Sotiris A. Tegos, Dimitrios Tyrovolas, Panagiotis D. Diamantoulakis, George K. Karagiannidis

发表机构 * Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki(阿尔蒂斯大学电气与计算机工程系)

AI总结 本文综述了SLAM与无线通信交汇领域的最新进展,重点探讨了视觉SLAM(V-SLAM)整合中的双向影响,总结了无线信号传播、几何信道建模、基于射频(RF)的定位与感知等关键概念,以及图像处理技术如何检测地标并预测无线信道的最优路径,同时分析了SLAM与无线通信交叉领域的技术、挑战和未来方向。

详情
AI中文摘要

本文综述了SLAM与无线通信交汇领域的最新进展, attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze estimation and control approaches such as Bayesian filters, feature-based pose estimation, perception-aware motion control, spatial methods for signal processing such as vector fields, and key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM appear to be in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

英文摘要

This paper surveys the state-of-the-art in the nexus of SLAM and Wireless Communications, attributing the bidirectional impact of each with a focus on visual SLAM (V-SLAM) integration. We provide an overview of key concepts related to wireless signal propagation, geometric channel modeling, and radio frequency (RF)-based localization and sensing. In addition to this, we show image processing techniques that can detect landmarks, proactively predicting optimal paths for wireless channels. Several dimensions are considered, including the prerequisites, techniques, background, and future directions and challenges of the intersection between SLAM and wireless communications. We analyze estimation and control approaches such as Bayesian filters, feature-based pose estimation, perception-aware motion control, spatial methods for signal processing such as vector fields, and key technological aspects. We expose techniques and items towards enabling a highly effective retrieval of the autonomous robot state. Among other interesting findings, we observe that monocular V-SLAM would benefit from RF relevant information, as the latter can serve as a proxy for the scale ambiguity resolution. Conversely, we find that wireless communications in the context of 5G and beyond can potentially benefit from visual odometry that is central in SLAM. Moreover, we examine other sources besides the camera for SLAM and describe the twofold relation with wireless communications. Finally, integrated solutions performing joint communications and SLAM appear to be in their infancy: theoretical and practical advancements are required to add higher-level localization and semantic perception capabilities to RF and multi-antenna technologies.

2602.06676 2026-05-22 cs.CV 版本更新

Can We Build a Monolithic Model for Fake Image Detection? SICA: Semantic-Induced Constrained Adaptation for Unified-Yet-Discriminative Artifact Feature Space Reconstruction

我们能否为伪造图像检测构建一个单一模型?SICA:语义诱导约束适应用于统一且具有判别性的伪影特征空间重建

Bo Du, Xiaochen Ma, Xuekang Zhu, Zhe Yang, Chaogun Niu, Chenfan Qu, Mingqi Fang, Zhenming Wang, Jingjing Liu, Jian Liu, Ji-Zhe Zhou

发表机构 * Sichuan University(四川大学) The Hong Kong University of Science and Technology(香港科学与技术大学) University of Science and Technology of China(中国科学技术大学) South China University of Technology(华南理工大学)

AI总结 本文提出了一种新的单体伪造图像检测模型SICA,通过语义诱导约束适应方法,解决伪影特征空间重建的统一与判别性矛盾,实验表明其优于15种现有方法。

详情
AI中文摘要

伪造图像检测(FID),旨在在四个图像鉴真子领域中实现统一检测,在现实鉴真场景中至关重要。与集成方法相比,单体FID模型在理论上更具前景,但至今在实践中始终表现不佳。在本文中,我们识别了伪影在子领域中的本质差异,这一关键障碍我们称之为“齐则现象”。受这一现象的驱动,我们首次诊断出这种表现不佳的根本原因:伪影特征空间的崩溃。因此,开发实用单体FID模型的核心挑战归结为“统一且具有判别性的”伪影特征空间重建。为了解决这个矛盾的挑战,我们假设高层语义可以作为重建的结构先验,并进一步提出语义诱导约束适应(SICA),这是首个单体FID范式。在我们开放的OpenMMSec数据集上进行了广泛的实验,结果表明SICA优于15种最先进的方法,并以近正交的方式重建了目标统一且具有判别性的伪影特征空间,从而牢固验证了我们的假设。代码和数据集可在:https://github.com/venus-guangjian/SICA_OpenMMSec获取。

英文摘要

Fake Image Detection (FID), aiming at unified detection across four image forensic subdomains, is critical in real-world forensic scenarios. Compared with ensemble approaches, monolithic FID models are theoretically more promising, but to date, consistently yield inferior performance in practice. In this work, we identify the intrinsic distinctness of artifacts across subdomains, a critical barrier we term the ``Ji-Zhe phenomenon". Driven by this phenomenon, we diagnose the cause of this underperformance for the first time: the collapse of the artifact feature space. The core challenge for developing a practical monolithic FID model thus boils down to the ``unified-yet-discriminative" reconstruction of the artifact feature space. To address this paradoxical challenge, we hypothesize that high-level semantics can serve as a structural prior for the reconstruction, and further propose Semantic-Induced Constrained Adaptation (SICA), the first monolithic FID paradigm. Extensive experiments on our OpenMMSec dataset demonstrate that SICA outperforms 15 state-of-the-art methods and reconstructs the target unified-yet-discriminative artifact feature space in a near-orthogonal manner, thus firmly validating our hypothesis. The code and dataset are available at: https://github.com/venus-guangjian/SICA_OpenMMSec.

2602.05536 2026-05-22 cs.LG cs.AI cs.CL cs.CV 版本更新

When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging

当共享知识有害:模型融合中的谱过积累

Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China.(新型软件技术国家重点实验室,南京大学,南京210023,中国。) Institute of Brain-Computer Interface, Nanjing University, Nanjing 210023, China.(脑机接口研究院,南京大学,南京210023,中国。)

AI总结 本文研究了模型融合中共享知识过积累的问题,提出SVC方法通过校准奇异值来恢复谱平衡,提升了模型融合和任务算术的性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

模型融合通过将多个微调模型的权重更新相加,提供了一种轻量级的替代方法,而非重新训练。现有方法主要针对解决任务更新之间的冲突,未处理共享知识过积累的失败模式。我们发现当任务共享对齐的谱方向(即重叠的奇异向量)时,简单的线性组合会反复积累这些方向,导致奇异值膨胀并使融合模型偏向共享子空间。为缓解此问题,我们提出Singular Value Calibration (SVC),一种无需训练和数据的后处理方法,量化子空间重叠并重新缩放膨胀的奇异值以恢复平衡的谱。在视觉和语言基准上,SVC一致改进了强大的融合基线并实现了最先进的性能。此外,仅通过修改奇异值,SVC将任务算术的性能提高了13.0%。代码可在https://github.com/lyymuwu/SVC获取。

英文摘要

Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at https://github.com/lyymuwu/SVC.

2602.01334 2026-05-22 cs.CV 版本更新

What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

视觉工具使用强化学习究竟在学习什么?解构工具诱导效应与内在效应以实现作物和缩放

Yan Ma, Weiyu Zhang, Tianle Li, Linge Du, Xuyang Shen, Pengfei Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) Peking University(北京大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文研究了视觉工具使用强化学习在作物和缩放任务中的学习机制,通过引入MED框架解耦内在能力变化与工具诱导效应,发现改进主要由内在学习驱动,而工具使用强化学习主要减少工具诱导的负面影响,而非掌握工具。

Comments ICML 2026 camera ready. Code: https://github.com/GAIR-NLP/Med

详情
AI中文摘要

视觉工具使用强化学习(RL)可以为视觉语言模型提供如作物和缩放等视觉操作,从而实现显著性能提升,但尚不清楚这些提升是源于工具使用能力的改进还是内在能力的演变。我们引入MED(测量-解释-诊断),一种由粗到细的框架,用于解耦内在能力变化与工具诱导效应,将工具诱导的性能差异分解为增益和损害项,并探测驱动其演变的机制。在作物和缩放设置中,对两个具有不同工具先验的VLMs和六个基准测试的检查点级分析显示,改进主要由内在学习驱动,而工具使用RL主要减少工具诱导的损害(例如更少的调用诱导错误和更弱的工具模式干扰),并在工具基于的内在失败修正方面取得有限进展。总体而言,在本文研究的作物和缩放设置中,当前的视觉工具使用RL学习的是安全地与工具共存,而非掌握工具。

英文摘要

Vision tool-use reinforcement learning (RL) can equip vision language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities. We introduce MED (Measure--Explain--Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses in the crop-and-zoom setting on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, in the crop-and-zoom setting studied here, current vision tool-use RL learns to coexist safely with tools rather than master them.

2601.23224 2026-05-22 cs.CV 版本更新

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Video-o3:长视频多跳推理的原生交错线索搜索

Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang, Changlian Ma, Qingyu Zhang, Zizheng Huang, Kun Ouyang, Tianxiang Jiang, Ziang Yan, Yi Wang, Hongjie Zhang, Yali Wang, Limin Wang

发表机构 * Nanjing University(南京大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) SIAT, Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 本研究提出Video-o3框架,通过迭代发现显著视觉线索、细粒度检查关键片段以及适应性终止,提升长视频多跳推理能力,实验表明其在MLVU和Video-Holmes上分别达到72.1%和46.5%的准确率。

Comments 27 pages, 15 figures, 15 tables

详情
AI中文摘要

现有用于长视频理解的多模态大语言模型主要依赖均匀采样和单轮推理,限制了其在大量冗余中识别稀疏但关键证据的能力。我们引入Video-o3,一种支持迭代发现显著视觉线索、细粒度检查关键片段以及在获得足够证据后适应性终止的新框架。技术上,我们解决了交错工具调用中的两个核心挑战。首先,为减轻由推理和工具调用异质性引起的注意力分散,我们提出任务解耦注意力掩码,该方法在保持共享全局上下文的同时,隔离每一步的专注。其次,为控制多轮交互中的上下文长度增长,我们引入可验证轨迹引导奖励,平衡探索覆盖与推理效率。为了支持大规模训练,我们进一步开发了数据合成管道,并构建了包含173,000个高质量工具交互轨迹的Seeker-173K数据集。大量实验表明,Video-o3显著优于现有方法,在MLVU上达到72.1%的准确率,在Video-Holmes上达到46.5%的准确率。这些结果展示了Video-o3在长视频场景中的强大多跳证据搜索和推理能力,并验证了原生工具调用的有效性。

英文摘要

Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3's strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.

2601.20107 2026-05-22 cs.CV cs.CL cs.IR 版本更新

Structural Anchor Pruning: Training-Free Multi-Vector Compression for Visual Document Retrieval

结构锚点剪枝:用于视觉文档检索的无训练多向量压缩

Zhuchenyang Liu, Ziyu Hu, Yao Zhang, Yu Xiao

发表机构 * Aalto University(阿alto大学)

AI总结 本文提出结构锚点剪枝(SAP),一种无需训练的多向量压缩方法,通过保留评分、指导窗口选择和视觉入度中心性评分三个组件,在不进行模型参数调整的情况下,实现了超过90%的视觉token剪枝同时保持NDCG@5超过90%的性能。

Comments methodology revision and new title

详情
AI中文摘要

最近的视觉-语言模型(例如ColPali)能够实现细粒度的视觉文档检索(VDR),但带来了可接受的多向量索引存储开销。现有的无训练剪枝方法要么依赖于启发式的层选择,要么在激进压缩下急剧退化,导致先前的工作认为有效的高压缩剪枝需要查询依赖的训练。我们通过结构锚点剪枝(SAP)挑战这一观点,这是一种自校准、无训练、且查询无关的索引时间剪枝框架,包含三个组件:(i)评分保留(SR),一种每层压缩诊断的白盒方法;(ii)SR引导的窗口选择,一种自动定位任何主干网络的结构剪枝区域的程序,无需每个模型的超参数;(iii)一个视觉入度中心性评分器,用于识别所选窗口内的锚点块。在ViDoRe v1/v2基准测试中,跨越三种架构(18、28和36层主干网络)的三个架构上,SAP在不进行任何模型参数调整的情况下,保留了超过90%的NDCG@5,同时剪枝了超过90%的视觉token。我们的分层解析SR分析揭示了对齐-聚合分歧:文档的视觉结构在主干网络中被保留为稳定的“结构高原”,但最终层将这种表示重塑为稀疏、查询对齐的形式,不再适合剪枝。这是SAP在最终层方法失败的地方的机械原因。

英文摘要

Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive multi-vector index storage overhead. Existing training-free pruning methods either rely on heuristic layer choices or degrade sharply under aggressive compression, leading prior work to argue that effective high-compression pruning requires query-dependent training. We challenge this view with Structural Anchor Pruning (SAP), a self-calibrating, training-free, and query-agnostic index-time pruning framework with three components: (i) Score Retention (SR), a white-box per-layer compression diagnostic; (ii) SR-guided window selection, a procedure that automatically locates the structural pruning region for any backbone with no per-model hyperparameters; and (iii) a visual in-degree centrality scorer that identifies anchor patches within the selected window. On the ViDoRe v1/v2 benchmarks across three architectures spanning 18, 28, and 36 backbone layers, SAP retains over 90\% of NDCG@5 while pruning more than 90\% of visual tokens, without any per-model parameter tuning. Our layer-resolved SR analysis reveals an Alignment-Aggregation Divergence: the document's visual structure is preserved as a stable ``Structural Plateau'' within the backbone, but the final layers reshape this representation into a sparse, query-aligned form that is no longer suitable for pruning. This is the mechanistic reason SAP succeeds where final-layer methods fail.

2601.07603 2026-05-22 cs.CV 版本更新

UIKA: Fast Universal Head Avatar from Pose-Free Images

UIKA:从无姿态图像快速生成通用头身模型

Zijian Wu, Boyao Zhou, Liangxiao Hu, Hongyu Liu, Yuan Sun, Xuan Wang, Xun Cao, Yujun Shen, Hao Zhu

发表机构 * Nanjing University(南京大学) Ant Group(蚂蚁集团) HKUST(香港科技大学) Xi’an Jiaotong University(西安交通大学)

AI总结 本文提出UIKA,一种从任意数量的无姿态输入(包括单张图像、多视角捕捉和手机拍摄视频)生成可动画的高斯头身模型。与传统头身模型不同,UIKA通过模型表示、网络设计和数据准备重新思考任务,引入了UV引导的头身建模策略,设计了可学习的UV标记,并通过聚合所有输入视角的UV信息解码为标准高斯属性。

Comments CVPR 2026 Highlight. Code: https://github.com/ant-research/UIKA

详情
AI中文摘要

我们提出UIKA,一种从任意数量的无姿态输入(包括单张图像、多视角捕捉和手机拍摄视频)生成可动画的高斯头身模型。与传统头身模型不同,UIKA通过模型表示、网络设计和数据准备重新思考任务。首先,我们引入了UV引导的头身建模策略,其中每个输入图像都与像素级的面部对应关系估计相关联。这种对应关系估计允许我们将每个有效像素的颜色从屏幕空间重新投影到UV空间,这与相机姿态和人物表情无关。此外,我们设计了可学习的UV标记,在屏幕和UV层面均可应用注意力机制。通过聚合所有输入视角的UV信息,这些学习到的UV标记可以解码为标准的高斯属性。为了训练我们的大型头身模型,我们还准备了一个大规模、身份丰富的合成训练数据集。我们的方法在单目和多视角设置中均显著优于现有方法。

英文摘要

We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of pose-free inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings.

2512.20538 2026-05-22 cs.CV 版本更新

AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment

AlignPose: 通过多视角特征-度量对齐实现通用的6D位姿估计

Anna Šárová Mikeštíková, Médéric Fourmy, Martin Cífka, Josef Sivic, Vladimir Petrik

发表机构 * Czech Institute of Informatics, Robotics and Cybernetics(捷克信息学、机器人学与控制学研究院) Czech Technical University in Prague(布拉格捷克技术大学)

AI总结 本文提出AlignPose,一种无需特定对象训练或对称标注的多视角6D位姿估计方法,通过多视角特征-度量细化优化单一一致的世界坐标系位姿,实验表明其在六个数据集上优于其他方法,尤其在工业数据集上表现突出。

Comments CVPR 2026

详情
AI中文摘要

单视角基于RGB的物体位姿估计方法虽然具有强大的泛化能力,但本质上受到深度模糊、杂乱和遮挡的限制。多视角位姿估计方法有潜力解决这些问题,但现有方法要么依赖于精确的单视角位姿估计,要么缺乏对未见过的对象的泛化能力。我们通过以下三个贡献来解决这些挑战:首先,我们引入了AlignPose,一种通过多个外校准的RGB视角聚合信息的6D物体位姿估计方法,无需任何对象特定的训练或对称标注。其次,该方法的关键组件是一个新的多视角特征-度量细化模块,专门设计用于物体位姿,通过同时最小化所有视角下即时渲染物体特征与观测图像特征之间的特征差异,优化单一一致的世界坐标系物体位姿。第三,我们在六个数据集(YCB-V,T-LESS,HouseCat6D,ITODD-MV,IPD,XYZ-IBD)上进行了广泛的实验,使用BOP基准评估,并证明AlignPose在挑战性的工业数据集上优于其他已发表的方法,其中多个视角在实践中易于获取。

英文摘要

Single-view RGB model-based object pose estimation methods achieve strong generalization but are fundamentally limited by depth ambiguity, clutter, and occlusions. Multi-view pose estimation methods have the potential to solve these issues, but existing works rely on precise single-view pose estimates or lack generalization to unseen objects. We address these challenges via the following three contributions. First, we introduce AlignPose, a 6D object pose estimation method that aggregates information from multiple extrinsically calibrated RGB views and does not require any object-specific training or symmetry annotation. Second, the key component of this approach is a new multi-view feature-metric refinement specifically designed for object pose. It optimizes a single, consistent world-frame object pose by minimizing the feature discrepancy between on-the-fly rendered object features and observed image features across all views simultaneously. Third, we report extensive experiments on six datasets (YCB-V, T-LESS, HouseCat6D, ITODD-MV, IPD, XYZ-IBD) using the BOP benchmark evaluation and show that AlignPose outperforms other published methods, especially on challenging industrial datasets where multiple views are readily available in practice.

2511.20785 2026-05-22 cs.CV 版本更新

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

LongVT: 通过原生工具调用激励'通过长视频思考'

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing

发表机构 * MiroMind NTU(国立台湾大学) HKUST(GZ)(香港科技大学(广州)) THU(清华大学) LMMs-Lab(LMMs实验室)

AI总结 本文提出LongVT,一种端到端的代理框架,通过交错的多模态工具链式思考实现'通过长视频思考',通过利用LMM的固有时间定位能力作为原生视频裁剪工具,以解决长视频推理中的幻觉问题,并通过VideoSIAH数据集提升训练和评估效果。

Comments CVPR 2026

详情
AI中文摘要

大型多模态模型(LMMs)在视频推理中展示出巨大的潜力,尤其是在文本链式思考(Chain-of-Thought)的应用中。然而,它们在处理长视频时仍然容易产生幻觉,尤其是当证据稀少且时间分布分散时。受人类理解长视频的方式启发——首先全局浏览,然后检查相关片段以获取细节——我们引入LongVT,一种端到端的代理框架,通过交错的多模态链式工具思考实现'通过长视频思考'。具体而言,我们利用LMM固有的时间定位能力作为原生视频裁剪工具,以聚焦特定视频片段并重新采样更细粒度的视频帧。这种从全局到局部的推理循环会持续进行,直到答案基于检索到的视觉证据得到支撑。鉴于长视频推理任务中细粒度问题-答案(QA)数据稀缺,我们整理并计划发布一个名为VideoSIAH的数据集,以促进训练和评估。具体而言,我们的训练数据集包含247.9万样本用于工具集成的冷启动监督微调,1.6千样本用于代理强化学习,以及15.4千样本用于代理强化学习微调。我们的评估基准包含1,280对精心挑选的QA对,通过半自动数据管道和人工在环验证进行筛选。通过精心设计的三阶段训练策略和广泛的实证验证,LongVT在四个具有挑战性的长视频理解和推理基准上均优于现有强大的基线。我们的代码、数据和模型检查点在https://github.com/EvolvingLMMs-Lab/LongVT上公开可用。

英文摘要

Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

2511.07820 2026-05-22 cs.RO cs.AI cs.CV cs.GR cs.SY eess.SY 版本更新

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

SONIC:为自然人形全身体控进行超大规模运动追踪

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Fernando Castañeda, Sirui Chen, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Jinhyung Park, David Sami, Zi Wang, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi "Jim" Fan, Yuke Zhu

发表机构 * NVIDIA

AI总结 本文提出了一种超大规模运动追踪方法,通过扩大模型容量、数据和计算资源,实现了一种能够产生自然且稳健全身体态的通用人形控制器,并展示了其在运动追踪任务中的可扩展性及在下游任务中的应用价值。

Comments Project page: https://nvlabs.github.io/SONIC/

详情
AI中文摘要

尽管大规模基础模型在数千块GPU上训练已取得显著进展,但类似规模提升在人形控制中尚未显现。当前的人形神经控制器规模较小,仅针对有限的行为集,并在少量GPU上训练。我们证明,扩大模型容量、数据和计算资源可以产生一个通用的人形控制器,能够实现自然且稳健的全身体态。我们将运动追踪定位为人形控制的可扩展任务,利用密集监督的多样化动作捕捉数据获取人类运动先验知识,而无需手动奖励工程。我们通过沿三个轴扩展构建了一个运动追踪的基础模型:网络大小(120万到4200万参数)、数据集规模(10亿+帧来自700小时的动作捕捉数据)以及计算资源(21000 GPU小时)。除了展示规模优势外,我们还通过:(1)实时运动规划器连接运动追踪到导航等任务,实现自然和交互式控制;(2)统一的token空间支持VR远程操作和视觉-语言-动作(VLA)模型,使用单一策略。通过这一接口,我们展示了需要协调手和脚放置的自主VLA驱动全身体控。扩大运动追踪表现出有利的特性:性能随计算和数据多样性稳步提升,学习的策略能泛化到未见的运动,使大规模运动追踪成为人形控制的实用基础。

英文摘要

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

2511.02014 2026-05-22 cs.CV 版本更新

Towards Selection of Large Multimodal Models as Engines for Burned-in Protected Health Information Detection in Medical Images

向大规模多模态模型选择作为医疗图像中已烧毁保护健康信息检测引擎的方向

Tuan Truong, Guillermo Jimenez Perez, Pedro Osorio, Matthias Lenga

发表机构 * QwenLM(通义实验室)

AI总结 本文研究了如何利用大规模多模态模型进行医疗图像中保护健康信息的检测,通过对比三种主流模型在不同流程配置下的表现,发现大规模多模态模型在OCR性能上优于传统方法,但整体检测准确性提升不显著,尤其在复杂印模模式测试中表现更优,并提出了针对特定操作约束的模型选择建议和部署策略。

Comments Accepted at EMBC 2026

详情
AI中文摘要

在医疗影像中检测保护健康信息(PHI)对于保障患者隐私和确保符合监管框架至关重要。传统检测方法主要利用光学字符识别(OCR)模型结合命名实体识别。然而,近年来大规模多模态模型(LMM)的进步为增强文本提取和语义分析提供了新机会。在本研究中,我们系统地评估了三种主要的闭源和开源LMM,即GPT-4o、Gemini 2.5 Flash和Qwen 2.5 7B,使用两种不同的流程配置:一种专注于文本分析,另一种整合OCR和语义分析。我们的结果显示,LMM在OCR性能(WER: 0.03-0.05,CER: 0.02-0.03)上优于传统模型如EasyOCR。然而,这种OCR性能的提升并不总是与整体PHI检测准确性提升相关联。在测试案例中具有复杂印模模式时,表现最强。在文本区域易于阅读且对比度足够的情况下,使用强LMM进行OCR后文本分析,不同流程配置的结果相似。此外,我们为特定操作约束提供了基于实证的LMM选择建议,并提出了一种利用可扩展和模块化基础设施的部署策略。

英文摘要

The detection of Protected Health Information (PHI) in medical imaging is critical for safeguarding patient privacy and ensuring compliance with regulatory frameworks. Traditional detection methodologies predominantly utilize Optical Character Recognition (OCR) models in conjunction with named entity recognition. However, recent advancements in Large Multimodal Model (LMM) present new opportunities for enhanced text extraction and semantic analysis. In this study, we systematically benchmark three prominent closed and open-sourced LMMs, namely GPT-4o, Gemini 2.5 Flash, and Qwen 2.5 7B, utilizing two distinct pipeline configurations: one dedicated to text analysis alone and another integrating both OCR and semantic analysis. Our results indicate that LMM exhibits superior OCR efficacy (WER: 0.03-0.05, CER: 0.02-0.03) compared to conventional models like EasyOCR. However, this improvement in OCR performance does not consistently correlate with enhanced overall PHI detection accuracy. The strongest performance gains are observed on test cases with complex imprint patterns. In scenarios where text regions are well readable with sufficient contrast, and strong LMMs are employed for text analysis after OCR, different pipeline configurations yield similar results. Furthermore, we provide empirically grounded recommendations for LMM selection tailored to specific operational constraints and propose a deployment strategy that leverages scalable and modular infrastructure.

2510.17991 2026-05-22 cs.LG cs.CV 版本更新

Demystifying Transition Matching: When and Why It Can Beat Flow Matching

解开转换匹配之谜:何时以及为何它能超越流匹配

Jaihoon Kim, Rajarshi Saha, Minhyuk Sung, Youngsuk Park

发表机构 * KAIST(韩国科学技术院) Amazon Web Services(亚马逊网络服务)

AI总结 本文研究了转换匹配(TM)在何时以及为何能超越流匹配(FM),通过证明在单峰高斯分布下TM具有更低的KL散度,并分析了在高斯混合分布中TM在局部单峰区域的优势,以及在目标方差非可忽略时TM的优越性。

Comments Code: https://github.com/amazon-science/TransitionFlowMatching (AISTATS 2026)

详情
AI中文摘要

流匹配(FM)是许多最先进的生成模型的基础,但最近的结果表明转换匹配(TM)可以以更少的采样步骤获得更高的质量。本文回答了TM何时以及为何能超越FM的问题。首先,当目标是一个单峰高斯分布时,我们证明在有限的步骤数下,TM的KL散度严格低于FM。改进源于TM中的随机差分潜在更新,这些更新保留了目标协方差,而确定性FM则低估了它。我们随后表征了收敛速率,显示在固定计算预算下,TM比FM收敛得更快,从而在单峰高斯情况下确立了其优势。其次,我们将分析扩展到高斯混合分布,并识别出局部单峰区域,在这些区域中,采样动态近似于单峰情况,TM可以超越FM。近似误差随着组件均值之间的最小距离增加而减少,突显了当模式良好分离时TM的优势。然而,当目标方差接近零时,每个TM更新收敛到FM更新,TM的性能优势减弱。总之,我们证明了当目标分布具有良好分离的模式和非可忽略的方差时,TM优于FM。我们通过受控实验在高斯分布上验证了我们的理论结果,并将比较扩展到现实世界中的图像和视频生成应用。

英文摘要

Flow Matching (FM) underpins many state-of-the-art generative models, yet recent results indicate that Transition Matching (TM) can achieve higher quality with fewer sampling steps. This work answers the question of when and why TM outperforms FM. First, when the target is a unimodal Gaussian distribution, we prove that TM attains strictly lower KL divergence than FM for finite number of steps. The improvement arises from stochastic difference latent updates in TM, which preserve target covariance that deterministic FM underestimates. We then characterize convergence rates, showing that TM achieves faster convergence than FM under a fixed compute budget, establishing its advantage in the unimodal Gaussian setting. Second, we extend the analysis to Gaussian mixtures and identify local-unimodality regimes in which the sampling dynamics approximate the unimodal case, where TM can outperform FM. The approximation error decreases as the minimal distance between component means increases, highlighting that TM is favored when the modes are well separated. However, when the target variance approaches zero, each TM update converges to the FM update, and the performance advantage of TM diminishes. In summary, we show that TM outperforms FM when the target distribution has well-separated modes and non-negligible variances. We validate our theoretical results with controlled experiments on Gaussian distributions, and extend the comparison to real-world applications in image and video generation.

2510.05094 2026-05-22 cs.CV 版本更新

VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

VChain:用于视频生成中推理的视觉思维链

Ziqi Huang, Ning Yu, Gordon Chen, Haonan Qiu, Paul Debevec, Ziwei Liu

发表机构 * eyeline-labs(Eyeline Labs)

AI总结 本文提出VChain,一种在视频生成中引入多模态模型视觉推理信号的新型推理时间视觉思维链框架,通过生成关键帧来指导预训练视频生成器的稀疏推理时间视觉状态适应,从而提升视频生成质量。

Comments ACL 2026 (Findings Paper), ICCV 2025 Workshop Outstanding Paper Award, Project page: https://eyeline-labs.github.io/VChain

详情
AI中文摘要

最近的视频生成模型可以生成流畅且视觉吸引人的片段,但它们经常难以合成具有连贯后果链的复杂动态。准确建模随时间推移的视觉结果和状态转换仍然是核心挑战。相比之下,大型语言和多模态模型(例如GPT-4o)表现出强大的视觉状态推理和未来预测能力。为了弥合这些优势,我们引入了VChain,一种新颖的推理时间视觉思维链框架,该框架将多模态模型的视觉推理信号注入到视频生成中。具体而言,VChain包含一个专用管道,利用大型多模态模型生成一组稀疏的关键帧作为快照,然后在这些关键时刻引导预训练视频生成器的稀疏推理时间视觉状态适应。我们的方法是调优高效的,引入了最小的开销,并避免了密集监督。在复杂的多步骤场景上进行的广泛实验表明,VChain显著提高了生成视频的质量。

英文摘要

Recent video generation models can produce smooth and visually appealing clips, but they often struggle to synthesize complex dynamics with a coherent chain of consequences. Accurately modeling visual outcomes and state transitions over time remains a core challenge. In contrast, large language and multimodal models (e.g., GPT-4o) exhibit strong visual state reasoning and future prediction capabilities. To bridge these strengths, we introduce VChain, a novel inference-time chain-of-visual-thought framework that injects visual reasoning signals from multimodal models into video generation. Specifically, VChain contains a dedicated pipeline that leverages large multimodal models to generate a sparse set of critical keyframes as snapshots, which are then used to guide the sparse inference-time visual-state adaptation of a pre-trained video generator only at these key moments. Our approach is tuning-efficient, introduces minimal overhead and avoids dense supervision. Extensive experiments on complex, multi-step scenarios show that VChain significantly enhances the quality of generated videos.

2509.23582 2026-05-22 cs.CV 版本更新

RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization

RobuQ: 通过鲁棒激活量化推动DiTs至W1.58A2

Kaicheng Yang, Xun Zhang, Haotong Qin, Yucheng Lin, Kaisen Yang, Xianglong Yan, Yulun Zhang

发表机构 * Shanghai Jiao Tong University, Shanghai, China(上海交通大学) Tsinghua University, Beijing, China(清华大学)

AI总结 本文提出RobuQ框架,通过鲁棒激活量化技术,解决了DiTs在极低比特下的部署问题,实现了在子4比特量化配置下的最佳性能,首次在大规模数据集上实现了稳定且具有竞争力的图像生成。

Comments Accepted by ICML2026

详情
AI中文摘要

扩散变换器(DiTs)最近已作为图像生成的强大骨干网络出现,展示了比U-Net架构更优越的可扩展性和性能。然而,其实际部署受到显著的计算和内存成本的阻碍。尽管量化感知训练(QAT)在U-Nets中显示出前景,但将其应用于DiTs面临独特的挑战,主要由于激活的敏感性和分布复杂性。在本文中,我们识别出激活量化是推动DiTs到极低比特设置的主要瓶颈。为此,我们提出了一种系统性的QAT框架,命名为RobuQ。我们首先建立了强大的三元权重(W1.58A4)DiT基准。在此基础上,我们提出RobustQuantizer以实现鲁棒的激活量化。我们的理论分析显示,Hadamard变换可以将未知的每token分布转换为每token正态分布,为该方法提供了坚实的基础。此外,我们提出AMPN,即首个仅激活混合精度网络流程,专为DiTs设计。该方法在整个网络中应用三元权重,同时为每一层分配不同的激活精度以消除信息瓶颈。通过在无条件和有条件图像生成中的广泛实验,我们的RobuQ框架在子4比特量化配置中实现了DiT量化最先进的性能。据我们所知,RobuQ是首个在大规模数据集如ImageNet-1K上实现稳定且具有竞争力的图像生成的,其激活量化平均为2比特。代码和模型将在https://github.com/racoonykc/RobuQ上提供。

英文摘要

Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at https://github.com/racoonykc/RobuQ .

2509.22769 2026-05-22 cs.CV 版本更新

PartCo: Part-Level Correspondence Priors Enhance Category Discovery

PartCo: 基于部分级对应先验的类别发现增强

Fernando Julio Cendra, Kai Han

发表机构 * Visual AI Lab, School of Computing and Data Science, The University of Hong Kong, Hong Kong(视觉人工智能实验室,计算与数据科学学院,香港大学,香港) Visual AI Lab, School of Computing(视觉人工智能实验室,计算学院) Data Science, The University of Hong Kong, Hong Kong(数据科学,香港大学,香港)

AI总结 PartCo通过引入部分级对应先验,提升了类别发现的性能,通过捕捉更细粒度的语义结构,改进了现有方法在区分密切相关类别方面的表现。

Comments ICML 2026, Project page: https://visual-ai.github.io/partco

详情
AI中文摘要

通用类别发现(GCD)旨在通过利用已知类别的标注示例,在未标记数据中识别已知和新类别。现有GCD方法主要依赖语义标签和全局图像表示,往往忽视了对区分密切相关类别至关重要的细节部分级线索。在本文中,我们引入了PartCo,即部分级对应先验,一种新的框架,通过整合部分级视觉特征对应关系来增强类别发现。通过利用部分级关系,PartCo捕捉到更细粒度的语义结构,从而更精确地理解类别关系。重要的是,PartCo能够无缝集成到现有GCD方法中,而无需进行显著修改。我们在多个基准数据集上的广泛实验表明,PartCo显著提高了当前GCD方法的性能,通过弥合语义标签与部分级视觉组成之间的差距,从而为GCD设定了新的基准。

英文摘要

Generalized Category Discovery (GCD) aims to identify both known and novel categories within unlabeled data by leveraging a set of labeled examples from known categories. Existing GCD methods primarily depend on semantic labels and global image representations, often overlooking the detailed part-level cues that are crucial for distinguishing closely related categories. In this paper, we introduce PartCo, short for Part-Level Correspondence Prior, a novel framework that enhances category discovery by incorporating part-level visual feature correspondences. By leveraging part-level relationships, PartCo captures finer-grained semantic structures, enabling a more nuanced understanding of category relationships. Importantly, PartCo seamlessly integrates with existing GCD methods without requiring significant modifications. Our extensive experiments on multiple benchmark datasets demonstrate that PartCo significantly improves the performance of current GCD approaches, outperforming most existing methods by bridging the gap between semantic labels and part-level visual compositions, thereby setting new benchmarks for GCD.

2508.02127 2026-05-22 cs.CV 版本更新

Enhancing Event-based Object Detection with Monocular Normal Maps

通过单目法线图增强基于事件的目标检测

Mingjie Liu, Hanqing Liu, Luoping Cui, Chuang Zhu

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(人工智能学院,北京邮电大学)

AI总结 本文提出NRE-Net框架,结合法线图的结构先验、RGB图像的外观上下文和事件的高频动态,通过自适应双流融合模块和事件模态感知融合模块提升自动驾驶中复杂光照下的目标检测性能。

详情
AI中文摘要

自动驾驶中的目标检测常受到复杂光照条件的干扰。虽然事件相机提供了一种稳健的解决方案,但它们容易受到突然的对比度变化(如反射)的影响,这通常会触发密集且误导性的事件信号。为了解决这个问题,我们利用RGB衍生的表面法线图作为显式的几何约束。关键在于,即使RGB退化,它们也保留了低频的结构先验,这有助于事件检测。因此,我们提出了NRE-Net,一个三模态框架,该框架整合了来自表面法线图的结构先验、来自RGB图像的外观上下文以及来自事件的高频动态。自适应双流融合模块(ADFM)首先对几何和外观线索进行对齐,随后是事件模态感知融合模块(EAFM),它选择性地整合事件动态。在DSEC-Det-sub和PKU-DAVIS-SOD上的大量评估表明,结合几何先验相比双模态基线在AP50上获得了额外的3.0%提升,而我们的方法在融合方法如SFNet(+2.7%)和SODFormer(+7.1%)上表现一致优于。

英文摘要

Object detection in autonomous driving is frequently compromised by complex illumination. While event cameras offer a robust solution, they are susceptible to sudden contrast changes such as reflections which often trigger dense, misleading event signals. To overcome this, we leverage RGB-derived surface normal maps as explicit geometric constraints. Crucially, even when RGB degrades, they preserve low-frequency structural priors that effectively assist in event-based detection. Consequently, we present NRE-Net, a trimodal framework that integrates structural priors from surface Normal maps, appearance context from RGB images, and high-frequency dynamics from Events. The Adaptive Dual-stream Fusion Module (ADFM) first aligns geometric and appearance cues, followed by the Event-modality Aware Fusion Module (EAFM) which selectively integrates event dynamics. Extensive evaluations on DSEC-Det-sub and PKU-DAVIS-SOD demonstrate that incorporating geometric priors yields an additional 3.0% AP50 gain over dual-modal baselines, while our approach consistently outperforms fusion methods such as SFNet (+2.7%) and SODFormer (+7.1%).

2505.16416 2026-05-22 cs.CV cs.AI 版本更新

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Circle-RoPE: 用于大视觉-语言模型的锥形解耦旋转位置嵌入

Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han

发表机构 * Huawei Noah's Ark Lab.(华为诺亚实验室) City University of Hong Kong.(香港城市大学) University of Sydney.(悉尼大学) State Key Lab of General AI, School of Intelligence Science and Technology, Peking University(北京大学人工智能国家重点实验室,智能科学与技术学院)

AI总结 本文提出Circle-RoPE,通过将图像标记坐标映射到与文本位置轴正交的圆环上,实现跨模态位置解耦,同时保留图像内部空间结构,并通过交替几何编码增强跨模态位置解耦和细粒度图像空间结构保留。

Comments Accepted at ICML 2026

详情
AI中文摘要

旋转位置嵌入(RoPE)在大型语言模型中被广泛采用,但应用于视觉-语言模型(VLMs)时会耦合文本和图像位置索引,并可能引入虚假的跨模态相对位置偏差。我们提出Per-Token Distance(PTD)来量化跨模态位置解耦,并证明PTD = 0是消除RoPE引起的几何注意力偏差的充分条件。基于此准则,我们引入Circle-RoPE,将2D图像标记坐标映射到与文本位置轴正交的圆环上,得到一种锥形几何结构,其中每个文本标记到所有图像标记等距,同时保留图像内部空间结构。我们进一步提出交替几何编码(AGE)以通过在层之间交替Circle-RoPE的解耦几何和标准RoPE的网格先验来结合互补的几何先验。这种设计在保持细粒度图像空间结构的同时实现了跨模态位置解耦。在多种VLM后端和多模态基准测试中的实验显示,在空间定位和视觉推理方面均取得了稳定的提升。代码可在https://github.com/lose4578/CircleRoPE上获得。

英文摘要

Rotary Position Embedding (RoPE) is widely adopted in large language models, but when applied to vision-language models (VLMs) it couples text and image position indices and can introduce spurious cross-modal relative-position bias. We propose Per-Token Distance (PTD) to quantify cross-modal positional disentanglement, and prove that PTD = 0 is a sufficient condition to eliminate the geometric attention bias induced by RoPE. Guided by this criterion, we introduce Circle-RoPE, which remaps 2D image-token coordinates onto an annulus orthogonal to the text position axis, yielding a cone-like geometry where each text token is equidistant to all image tokens while preserving intra-image spatial structure. We further propose Alternating Geometry Encoding (AGE) to combine complementary geometric priors by alternating the decoupled geometry of Circle-RoPE and the grid-based prior of standard RoPE across layers. This design enables cross-modal positional disentanglement while preserving fine-grained intra-image spatial structure. Experiments on diverse VLM backbones and multimodal benchmarks show consistent gains in spatial grounding and visual reasoning. The code is available at https://github.com/lose4578/CircleRoPE.

2410.19787 2026-05-22 cs.CV cs.LG 版本更新

Leveraging Multi-Temporal Sentinel 1 and 2 Satellite Data for Leaf Area Index Estimation With Deep Learning

利用多时相哨兵1和2卫星数据进行叶面积指数估计的深度学习方法

Clement Wang, Antoine Debouchage, Valentin Goldité, Aurélien Wery, Jules Salzinger

发表机构 * Austrian Institute of Technology - Vienna, Austria(奥地利技术研究所-维也纳,奥地利)

AI总结 本文提出了一种基于多时相哨兵1雷达数据和哨兵2多谱段数据的深度学习方法,用于像素级叶面积指数预测,通过多U-Net网络结构和共同潜在空间实现不同输入模态的互补信息融合,最终在公开数据上取得了0.06 RMSE和0.93 R2分数。

详情
Journal ref
Proc. 2023 Conference on Big Data from Space (BiDS'23), Publications Office of the European Union, Luxembourg, 2023
AI中文摘要

叶面积指数(LAI)是理解生态系统健康和植被动态的关键参数。在本文中,我们提出了一种新的像素级LAI预测方法,通过利用多时间戳的哨兵1雷达数据和哨兵2多谱段数据的互补信息。我们的方法基于多个针对此任务定制的多U-Net深度神经网络。为处理不同输入模态的复杂性,该方法由多个预先训练的模块组成,以在共同的潜在空间中表示所有输入数据。然后,我们通过一个共同的解码器进行端到端微调,该解码器还考虑了季节性因素,我们发现季节性在其中起重要作用。我们的方法在公开可用数据上实现了0.06 RMSE和0.93 R2分数。我们的贡献可在https://github.com/valentingol/LeafNothingBehind上获得,供未来工作进一步改进当前进展。

英文摘要

The Leaf Area Index (LAI) is a critical parameter to understand ecosystem health and vegetation dynamics. In this paper, we propose a novel method for pixel-wise LAI prediction by leveraging the complementary information from Sentinel 1 radar data and Sentinel 2 multi-spectral data at multiple timestamps. Our approach uses a deep neural network based on multiple U-nets tailored specifically to this task. To handle the complexity of the different input modalities, it is comprised of several modules that are pre-trained separately to represent all input data in a common latent space. Then, we fine-tune them end-to-end with a common decoder that also takes into account seasonality, which we find to play an important role. Our method achieved 0.06 RMSE and 0.93 R2 score on publicly available data. We make our contributions available at https://github.com/valentingol/LeafNothingBehind for future works to further improve on our current progress.

2404.05307 2026-05-22 cs.CV cs.RO 版本更新

4D Radar Semantic Segmentation of People in Field Conditions Using Temporal Multi-View Networks

利用时序多视角网络进行野外条件下4D雷达的人体语义分割

Mikael Skog, Oleksandr Kotlyar, Vladimír Kubelka, Martin Magnusson

发表机构 * Center for Advanced Autonomous Sensor Systems (AASS)(先进自主传感器系统中心)

AI总结 本文提出TMVA4D网络,利用4D雷达数据进行人体语义分割,通过多视角投影区分背景与人体,在低能见度条件下实现75.9%的Dice系数和61.2%的IoU指标。

详情
AI中文摘要

可靠的人员检测对于移动机器人和重型车辆在道路和工业环境(如采矿和建筑)中的安全自主至关重要。然而,常规传感器如摄像头或激光雷达在尘埃、雾或烟等恶劣条件下容易失效,限制了其在现实机器人系统中的应用。雷达在广泛的环境条件下提供稳健的测量。特别是现代高分辨率4D成像雷达提供跨距离、方位和仰角的4D点云,以及每个点的多普勒速度数据,非常适合机器人感知。我们提出TMVA4D,一种基于CNN和ConvLSTM编码器的神经网络架构家族,利用4D雷达模态进行语义分割。这些架构被训练以区分背景和人体类别,使用一系列2D投影的4D雷达数据,涵盖仰角、方位、距离和多普勒速度维度。在多个操作站点评估中,我们的模型在低能见度条件下实现了有希望的性能(Dice 75.9%,IoU 61.2% for class person)。数据和代码将在发表后公开发布。

英文摘要

Reliable people detection is crucial for the safe autonomy of mobile robots and heavy vehicles, both on roads and in industrial settings like mining and construction. However, common sensors like cameras or lidars are prone to failure in adverse conditions such as dust, fog, or smoke, which limits their use in real-world robotic systems. Radar, on the other hand, delivers robust measurements in a wide range of environmental conditions. In particular, modern high-resolution 4D imaging radars provide 4D point clouds across range, azimuth, and elevation, as well as per-point Doppler velocity data, well suited for robot perception. We propose TMVA4D, a family of artificial neural network architectures based on CNN and ConvLSTM encoders that leverage the 4D radar modality for semantic segmentation. The architectures are trained to distinguish between background and person classes using a series of 2D projections of the 4D radar data, encompassing elevation, azimuth, range, and Doppler velocity dimensions. Evaluated across several operational sites, our models achieve promising performance (Dice 75.9%, IoU 61.2% for class person) even in low-visibility conditions. The data and code will be made publicly available upon publication.

2209.03358 2026-05-22 cs.NE cs.AI cs.CR cs.CV cs.LG 版本更新

Attacking the Spike: On the Transferability and Security of Spiking Neural Networks to Adversarial Examples

攻击尖峰:关于脉冲神经网络对抗示例的转移性和安全性

Nuo Xu, Kaleel Mahmood, Haowen Fang, Ethan Rathbun, Caiwen Ding, Wujie Wen

发表机构 * Lehigh University(莱文大学) University of Minnesota Twin Cities(明尼苏达大学双城分校) North Carolina State University(北卡罗来纳州立大学) University of Rhode Island(罗德岛大学) Northeastern University(东北大学)

AI总结 本文研究了脉冲神经网络(SNN)在对抗示例中的鲁棒性,揭示了对抗攻击的转移性,并提出了混合动态脉冲估计(MDSE)攻击方法,以提高SNN和非SNN模型的对抗示例生成效果。

Comments Accepted manuscript. Published in *Neurocomputing*, Volume 656, 2025, Article 131506. Available online 12 September 2025. DOI: 10.1016/j.neucom.2025.131506

详情
Journal ref
Neurocomputing, Volume 656, 2025, 131506
AI中文摘要

脉冲神经网络(SNNs)因其高能效和最近在分类性能上的进展而受到广泛关注。然而,与传统深度学习方法不同,SNN对对抗示例的鲁棒性研究仍相对薄弱。在本文中,我们通过三个贡献推进了SNN的对抗攻击研究。首先,我们表明对SNN的成功白盒对抗攻击高度依赖于底层的替代梯度估计器,即使对于对抗训练的SNN也是如此。其次,使用最佳的单一替代梯度估计器,我们分析了对抗攻击在SNN、视觉Transformer(ViTs)和CNN之间的可转移性。我们的分析揭示了两个关键差距:现有的白盒攻击没有利用多个替代梯度估计器来攻击SNN,且没有单个模型攻击能够可靠地生成同时欺骗SNN和非SNN模型的对抗示例。作为我们的第三个贡献,我们开发了混合动态脉冲估计(MDSE)攻击来解决这些问题。MDSE使用动态梯度估计方案,充分利用多个替代梯度估计器函数,生成能够同时欺骗SNN和非SNN模型的对抗示例。MDSE在SNN/ViT模型集合上比传统白盒攻击如Auto-PGD有效多达91.4%,在对抗训练的SNN集合上提供了3倍的提升。实验覆盖了三个数据集(CIFAR-10、CIFAR-100、ImageNet)和十九个分类器模型(每个CIFAR数据集七个,ImageNet五个)。我们的MDSE实现和评估的模型在https://github.com/nuoxuxxx/attacking-the-spike-mdse上公开可用。

英文摘要

Spiking neural networks (SNNs) have attracted much attention for their high energy efficiency and recent advances in classification performance. However, unlike traditional deep learning approaches, the study of SNN robustness to adversarial examples remains relatively underdeveloped. In this work, we advance the adversarial attack side of SNNs through three contributions. First, we show that successful white-box adversarial attacks on SNNs are highly dependent on the underlying surrogate gradient estimator, even for adversarially trained SNNs. Second, using the best single surrogate gradient estimator, we analyze the transferability of adversarial attacks across SNNs, Vision Transformers (ViTs) and CNNs. Our analysis reveals two key gaps: no existing white-box attack exploits multiple surrogate gradient estimators for SNNs, and no single-model attack reliably generates adversarial examples that simultaneously fool both SNN and non-SNN models. For our third contribution, we develop the Mixed Dynamic Spiking Estimation (MDSE) attack to address these issues. MDSE uses a dynamic gradient estimation scheme to fully exploit multiple surrogate gradient estimator functions and generates adversarial examples capable of fooling SNN and non-SNN models simultaneously. MDSE is up to 91.4% more effective on SNN/ViT model ensembles and provides a 3x boost on adversarially trained SNN ensembles compared to conventional white-box attacks like Auto-PGD. Experiments cover three datasets (CIFAR-10, CIFAR-100, ImageNet) and nineteen classifier models (seven per CIFAR dataset, five for ImageNet). Our implementation of MDSE and the evaluated models is publicly available at https://github.com/nuoxuxxx/attacking-the-spike-mdse.

1709.03806 2026-05-22 cs.CV 版本更新

Do Vision Models Encode Object-Level Semantic Relatedness? A Cognitive Psychology-Inspired Benchmark

视觉模型是否编码物体层面的语义相关性?一种受认知心理学启发的基准

Hansang Lee, Haeil Lee, Junmo Kim

发表机构 * Department of Computer Science(计算机科学系) Seoul Women’s University(首尔女子大学) LG Energy Solution(LG能源解决方案) School of Electrical Engineering(电气工程学院) KAIST(韩国科学技术院)

AI总结 本文通过一种受认知心理学启发的基准,探讨了视觉模型是否能编码物体层面的语义相关性,研究了两种仅基于图像的测试集,并揭示了分类准确率之外的表征特性。

详情
AI中文摘要

现代视觉模型在物体识别任务上取得了显著的性能,但尚不清楚其表示是否编码物体层面的语义相关性,即支持人类视觉认知的对象概念之间的有意义联系。现有的基准主要针对类别预测或依赖图像-文本匹配,忽略了视觉表示本身的研究。受认知心理学启发,我们将语义相关性重新定义为三元组排序任务,并研究了两个仅基于图像的测试集:POPORO,一个已有的400个三元组心理刺激集,重新用于表示评估;以及PoporoIN,一个新构建并人工编写的1000个三元组ImageNet验证扩展集。每个三元组沿两个正交轴进行注释:一个相关目标轴区分类别相关性(CR,分类学)和上下文相关性(TR,主题性),一个干扰轴区分颜色匹配干扰项(CD)和形状匹配干扰项(SD)。二十种预训练模型,涵盖监督、自监督、视觉-语言和生成范式,在仅推理的协议下通过余弦相似度进行评估。基于变换器的表示在PoporoIN上比卷积表示高出高达18.30个百分点,且在可比的ImageNet准确率下,视觉-语言编码器在POPORO上比视觉-only编码器高出高达22.50个百分点。在所有范式中,模型在分类学目标上比主题性目标更可靠地识别,且更容易被形状匹配干扰项所误导,而不是颜色匹配干扰项。这些基准揭示了分类准确率之外的表征特性,连接了认知心理学和视觉表征评估。

英文摘要

Modern vision models have achieved strong object-recognition performance, yet it remains unclear whether their representations encode object-level semantic relatedness, the meaningful connection between object concepts that supports human visual cognition. Existing benchmarks predominantly target category prediction or rely on image--text matching, leaving the visual representation itself underexamined. Drawing on cognitive psychology, we recast semantic relatedness as a triplet-ranking task and study two image-only test beds: POPORO, an existing 400-triplet psychological stimulus set repurposed for representation evaluation, and PoporoIN, a newly constructed and manually curated 1,000-triplet ImageNet-validation extension. Each triplet is annotated along two orthogonal axes: a related-target axis distinguishing Categorical Relatedness (CR, taxonomic) from conTextual Relatedness (TR, thematic), and a distractor axis distinguishing Color-matched Distractors (CD) from Shape-matched Distractors (SD). Twenty pretrained models spanning supervised, self-supervised, vision--language, and generative paradigms were evaluated by cosine similarity in an inference-only protocol. Transformer-based representations exceeded convolutional counterparts by up to 18.30 percentage points on PoporoIN at comparable ImageNet accuracy, and vision--language encoders exceeded vision-only counterparts by up to 22.50 percentage points under matched ImageNet accuracy on POPORO. Across paradigms, models recognized taxonomic targets more reliably than thematic ones and were more easily misled by shape-matched than by color-matched distractors. The benchmarks expose representational properties that classification accuracy alone does not fully predict, bridging cognitive psychology and visual representation evaluation.

2605.22086 2026-05-22 cs.CV 版本更新

GenHAR: Generalizing Cross-domain Human Activity Recognition for Last-mile Delivery

GenHAR:面向最后一公里配送的跨领域人类活动识别通用化

Zhiqing Hong, Zelong Li, Xiubin Fan, Guang Yang, Baoshen Guo, Haotian Wang, Tian He, Desheng Zhang

发表机构 * Beijing Normal–Hong Kong Baptist University(北京师范大学-香港 Baptist大学) Rutgers University(罗格斯大学) SMART Center, MIT(MIT SMART中心)

AI总结 本文提出GenHAR框架,通过学习领域不变的传感器表示来解决跨领域人类活动识别中的分布偏移问题,提升了目标领域的泛化能力,并在实际部署中实现了高效率和高精度的实时活动检测。

详情
AI中文摘要

人类活动识别(HAR)在各种应用中表现出显著的有效性,如智能医疗和智能制造。然而,HAR面临的主要挑战是不同传感器数据域之间的分布偏移,这通常会导致在现实应用中性能下降。为了解决这个问题,本文引入了GenHAR,一种新的框架,旨在通过学习领域不变的传感器表示来缩小领域差距。GenHAR的目标是通过仅使用源域的数据来增强HAR在目标域上的泛化能力。GenHAR的关键创新体现在两个方面:首先,GenHAR对传感器数据进行分词,并学习频率传感器通道维度之间的相关性,以提高HAR模型的鲁棒性;其次,GenHAR通过选择性掩码和高效的注意力机制来提高效率。我们通过在现实世界的人类活动数据集上与最先进的HAR方法进行比较,系统分析了GenHAR。结果表明,GenHAR在准确性上比最先进的方法高出9.97%,并减少了6.4倍的浮点运算。此外,我们还在四个城市的一家领先物流公司部署了GenHAR,并检测到21.5亿次实时活动。我们发布了代码:https://github.com/Sensor-FoundationModel/GenHAR。

英文摘要

Human Activity Recognition (HAR) has shown remarkable effectiveness in various applications, such as smart healthcare and intelligent manufacturing. However, a major challenge faced by HAR is the distribution shift across different sensor data domains, which often leads to decreased performance when deployed for real-world applications. To address this issue, this paper introduces GenHAR, a novel framework designed to mitigate the domain gap by learning domain-invariant sensor representations. GenHAR aims to enhance the generalization capabilities of HAR on target domains purely with data from the source domain. The key novelty of GenHAR lies in two aspects. Firstly, GenHAR tokenizes sensor data and learns correlations among frequency sensor channel dimensions to improve the robustness of HAR models. Secondly, GenHAR improves the efficiency via selective masking and an efficient attention mechanism. We conduct a systematic analysis of GenHAR by comparing it with state-of-the-art HAR methods on real-world human activity datasets. Results show that GenHAR outperforms state-of-the-art methods by 9.97% in accuracy, and reduces Floating Point Operations by 6.4 times. Moreover, we deploy GenHAR at a leading logistics company in 4 cities, and have detected 2.15 billion real-time activities. We release our code at: https://github.com/Sensor-FoundationModel/GenHAR.

2605.22078 2026-05-22 cs.AI cs.CV 版本更新

Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

通过无训练空间-时间池化和栅格化增强视频大语言模型的视觉令牌表示

Bingjun Luo, Tony Wang, Hanqi Chen, Xinpeng Ding

发表机构 * Tsinghua University(清华大学) Shenzhen University(深圳大学) Xidian University(西安电子科技大学)

AI总结 本文提出了一种无需训练的空间-时间池化和栅格化方法ST-GridPool,用于提升视频大语言模型的视觉令牌表示,通过多级时空交互和基于规范的空间池化技术,在不需重新训练的情况下提高性能。

Comments Accepted by ICLR 2026

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在视频理解任务中取得了显著进展,但如何高效压缩视觉令牌同时保持时空交互仍面临挑战。现有方法如LLaVA家族使用简单的池化或插值技术,忽视了视觉令牌的复杂动态。为弥合这一差距,我们提出了ST-GridPool,一种专为视频LLM设计的新型无训练视觉令牌增强方法。我们的方法整合了金字塔时间栅格(PTG),通过层次化时间栅格捕捉多粒度时空交互,以及基于规范的空间池化(NSP),通过利用令牌规范与语义丰富度之间的相关性来保留高信息视觉区域。在各种基准测试中,ST-GridPool在不需成本高昂重新训练的情况下,一致提升了视频LLM的性能。我们的方法提供了一种高效且即插即用的解决方案来改进视觉令牌表示。我们的代码可在https://github.com/bingjunluo/ST-GridPool上获得。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST-GridPool, a novel training-free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi-grained spatiotemporal interactions through hierarchical temporal gridding, and Norm-based Spatial Pooling (NSP), which preserves high-information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST-GridPool consistently enhances performance of Video LLMs without requiring costly retraining. Our method offers an efficient and plug-and-play solution for improving visual token representations. Our code is available in https://github.com/bingjunluo/ST-GridPool.

2605.22072 2026-05-22 cs.CL cs.CV 版本更新

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Faithful-MR1: 通过锚定和强化视觉注意力实现忠实的多模态推理

Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye

发表机构 * AMAP, Alibaba Group(阿里云实验室,阿里巴巴集团) University of Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出Faithful-MR1框架,通过锚定和强化视觉注意力解决多模态推理中的忠实性问题,提升模型在多模态基准上的表现。

Comments 20 pages, 7 figures, 3 tables. Preprint

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为推动大语言模型复杂推理发展的有希望范式,最近的研究将其扩展到多模态大语言模型(MLLMs)。然而,这种转移带来了忠实性挑战:任务相关视觉证据的忠实感知以及在推理中忠实使用该证据,导致多模态基准上的不满意收益。具体而言,现有的感知监督通常基于文本描述而非原生的图像区域,且忠实使用被忽视,暴露出感知-推理断层,正确感知的证据在推理中被丢弃或矛盾。为弥合这些差距,我们提出Faithful-MR1,一个训练框架,通过锚定和强化视觉注意力来解决忠实多模态推理的两方面。锚定阶段将感知转化为一个显式的预推理子任务,监督专门的<Focus>标记的注意力直接针对图像区域,而不是通过文本描述。强化阶段通过反事实图像干预暴露忠实使用,奖励那些在视觉上因果重要的区域集中注意力的轨迹。广泛实验表明,Faithful-MR1在Qwen2.5-VL-Instruct 3B和7B架构上优于最近的多模态推理基线,同时使用大量训练数据。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.

2605.22068 2026-05-22 cs.CV 版本更新

COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

COCOTree: 一个用于开放树状视觉分解的数据集和基准

Junhyub Lee, Seunghun Chae, Hyosu Kim

发表机构 * Chung-Ang University(Chung-Ang大学)

AI总结 本文提出COCOTree数据集和基准,通过自动化生成管道和开放词汇空间,实现了对复杂物理组装的长尾分布的捕捉,并提出了Open Tree Quality (OTQ)评估指标。

详情
AI中文摘要

我们正式化并启用了开放树分解任务,该任务将图像分割为具有无约束粒度和灵活性的层次树状视觉组件。具体而言,我们为这一新范式提供了基础基准,有三个关键贡献:首先,通过开发一个完全自动化的生成管道,结合大视觉-语言模型的语义推理与SAM 3的精确几何定位,克服了手动标注的高认知和物理瓶颈;其次,利用该管道构建了COCOTree大规模基准,包含超过21,000张图像和180万个结构节点,通过超过3,500个唯一标签的开放词汇空间,成功捕捉了复杂物理组装的长尾分布;最后,我们通过提出Open Tree Quality (OTQ)指标建立了标准化评估协议,该指标联合评估掩码精度、标签准确性和结构一致性。我们已发布数据集和基准代码:https://github.com/melonkick3090/COCOTree.

英文摘要

We formalize and enable the task of open tree decomposition, which segments an image into hierarchical trees of visual components with unconstrained granularity and flexibility. Specifically, we provide the foundation benchmark for this new paradigm with the following three key contributions. First, we overcome the prohibitively high cognitive and physical bottlenecks of manual annotation by developing a fully automated generation pipeline that synergizes the semantic reasoning of Large Vision-Language Models (LVLMs) with the precise geometric grounding of SAM 3. Second, leveraging this pipeline, we construct COCOTree, a massive-scale benchmark featuring over 21K images and 1.8M structural nodes. By embracing an open-vocabulary space of over 3.5K unique labels, it successfully captures the long-tail distribution of complex physical assemblies. Notably, rigorous human evaluation confirms our generated annotations demonstrate strong alignment with human structural judgment. Third, we establish a standardized evaluation protocol by proposing the Open Tree Quality (OTQ) metric, which jointly assesses mask precision, label accuracy, and structural consistency. We release our dataset and benchmark code at https://github.com/melonkick3090/COCOTree.

2605.22066 2026-05-22 cs.CV cs.AI 版本更新

Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos

Echo4DIR: 从2D超声视频重建4D隐式心脏结构

Yanan Liu, Qinya Li, Hao Zhang, Kangjian He, Xuan Yang, Hao Li, Dan Xu, Lei Li

发表机构 * Department of Biomedical Engineering, National University of Singapore, Singapore(新加坡国立大学生物医学工程系) School of Information Science and Engineering, Yunnan University, Kunming, China(云南大学信息科学与工程学院)

AI总结 本文提出Echo4DIR框架,通过隐式重建方法从稀疏2D超声视频中重建4D心脏几何结构,解决了几何歧义和时间不连续性问题,实现了高精度的临床重叠度。

详情
AI中文摘要

从稀疏的2D超声图像中重建4D(3D+t)心脏几何结构具有高度的实用性,但受到几何歧义和时间不连续性的根本挑战。为了解决这些问题,我们提出了Echo4DIR,一种新颖的测试时4D隐式重建框架。具体来说,我们通过心脏条件SDF学习鲁棒的3D形状先验,构建了具有极线交叉注意力的Epipolar Mask Encoder模块,以有效融合多视角特征。为了弥合合成到现实的领域差距,我们引入了一种自监督的SDF定制可微渲染策略,利用未经校准的临床掩码进行患者特定的3D形状适应,而无需3D地面真实数据。关键的是,隐式表示的内在连续性克服了稀疏观测,使在任意分辨率下都能获得解剖学可靠的几何结构。此外,为了使我们的框架具备物理连续的4D扩展能力,我们引入了一种径向SDF对齐策略,严格锁定形状演变到预测的速度场,从根本上消除了网格漂移。在合成基准和真实临床数据集上的广泛实验表明,Echo4DIR实现了最先进的4D心脏网格重建,特别是在临床重叠度方面,达到了高达98.35%的Dice和96.75%的IoU。

英文摘要

Reconstructing 4D (3D+t) cardiac geometry from sparse 2D echocardiography is highly desirable yet fundamentally challenged by geometric ambiguity and temporal discontinuity. To tackle these issues, we propose Echo4DIR, a novel test-time 4D implicit reconstruction framework. Specifically, we learn robust 3D shape priors from statistical shape models (SSMs) via a cardiac conditional SDF, constructing an Epipolar Mask Encoder module with epipolar cross attention to effectively fuse multi-view features. To bridge the synthetic-to-real domain gap, we introduce a self-supervised SDF-tailored differentiable rendering strategy for patient-specific 3D shape adaptation using uncalibrated clinical masks without requiring 3D ground truth. Crucially, the inherent continuity of implicit representation overcomes sparse observations, enabling anatomically reliable geometry at arbitrary resolutions. Furthermore, to empower our framework with physically continuous 4D extension, we introduce a Radial SDF Alignment strategy that strictly locks shape evolution to the predicted velocity field, fundamentally eliminating mesh drift. Extensive experiments on synthetic benchmarks and real clinical datasets demonstrate that Echo4DIR achieves state-of-the-art 4D cardiac mesh reconstruction, notably yielding an impressive clinical overlap of up to 98.35% Dice and 96.75% IoU.

2605.22061 2026-05-22 cs.CV 版本更新

Distributed Image Compression with Multimodal Side Information at Extremely Low Bitrates

分布式图像压缩与多模态侧信息在极低比特率下的应用

Guojun Xu, Mingyang Zhang, Jianwen Xiang, Cheng Tan, Yanchao Yang, Junwei Zhou

发表机构 * School of Computer Science and Artificial Intelligence, Wuhan University of Technology(武汉理工大学计算机科学与人工智能学院)

AI总结 本文提出了一种多模态分布式图像压缩框架(MDIC),通过利用多模态侧信息在极低比特率下实现高质量图像重建,核心方法是引入文本到图像扩散解码器和特征掩码生成器,以提升全局感知质量和局部细节保留能力。

Comments Accepted by CVPR2026

详情
AI中文摘要

分布式图像压缩(DIC)对于多视图传输至关重要,尤其是在极低比特率(< 0.1 bpp)下。其核心挑战是有效利用侧信息以在严格比特率预算下实现高质量重建。然而,现有DIC方法难以利用全局上下文和对象级细节,导致局部模糊和细节丢失。为此,我们提出了一种多模态DIC框架(MDIC),首次将多模态侧信息引入DIC范式,有效保留细粒度局部细节并提升重建图像的全局感知质量。具体而言,我们引入基于文本到图像扩散的解码器,该解码器根据从相关图像中提取的文本侧信息进行条件化,以捕捉共享的全局语义。此外,我们设计了一个由多模态细粒度对齐任务监督的特征掩码生成器,以加强视觉侧信息的利用。生成的掩码具有两个作用:首先,它指导从无损传输的侧信息中提取细粒度细节,以保持重建细节的语义一致性;其次,它调节从量化VQ-VAE嵌入中提取的聚类特征表示,补偿主图像在极端压缩下的类别信息丢失。在广泛使用的KITTI立体和Cityscapes数据集上的大量实验表明,MDIC在极低比特率下实现了最先进的感知质量。

英文摘要

Distributed Image Compression (DIC) is crucial for multi-view transmission, especially when operating at extremely low bitrates (< 0.1 bpp). Its core challenge is effectively utilizing side information to achieve high-quality reconstruction under strict bitrate budgets. However, existing DIC approaches struggle to exploit global context and object-level details from side information, leading to local blurring and the loss of fine details in the reconstruction. To address these limitations, we propose a Multimodal DIC framework (MDIC), which, for the first time, leverages side information in a multimodal manner into the DIC paradigm, effectively preserving fine-grained local details and enhancing global perceptual quality in reconstructed images. Specifically, we introduce a text-to-image diffusion-based decoder conditioned on textual side information extracted from correlated images to capture shared global semantics. Moreover, we design a feature-mask generator, supervised by a multimodal fine-grained alignment task, to strengthen the exploitation of visual side information. The generated mask serves two purposes: first, it guides the extraction of fine-grained details from losslessly transmitted side information to preserve the semantic consistency of reconstructed details; second, it regulates the extraction of clustered feature representations from the quantized VQ-VAE embeddings, compensating for category information lost under the extreme compression of the primary image. Extensive experiments on the widely used KITTI Stereo and Cityscapes datasets demonstrate that MDIC achieves state-of-the-art perceptual quality at extremely low bitrates.

2605.22051 2026-05-22 cs.CV 版本更新

EasyVFX: Frequency-Driven Decoupling for Resource-Efficient VFX Generation

EasyVFX: 用于资源高效视觉效果生成的频率驱动解耦

Yue Ma, Xu Ye, Qinghe Wang, Yucheng Wang, Hongyu Liu, Yinhan Zhang, Xinyu Wang, Yuanpeng Che, Shanhui Mo, Paul Liang, Fangneng Zhan, Qifeng Chen

发表机构 * HKUST(香港科技大学) DUT(东华大学) THU(清华大学) MIT(麻省理工学院)

AI总结 本文提出EasyVFX框架,通过频率域分解解耦高频和低频成分,降低视觉效果生成的计算和数据依赖性,实现高效且高质量的视觉效果合成。

Comments Accepted by SIGGRAPH 2026. Project page: https://easy-vfx.github.io/

详情
AI中文摘要

生成高保真视觉效果(VFX)通常需要大量数据集和高昂的计算资源,因为空间纹理和时间动态之间存在复杂的耦合。本文介绍了EasyVFX,一个资源高效的框架,能够在严格约束下实现逼真的VFX合成。我们的核心理念在于频域分解:我们观察到通过解耦高频成分(代表复杂的空间外观)和低频成分(代表全局运动动态),可以显著降低VFX的复杂性。这种频域解耦将高维学习问题转化为可管理的子任务,从而降低优化障碍并减少数据依赖性。基于这一见解,我们提出了一种双阶段训练范式。首先,我们设计了一种频率感知的专家混合(Freq-MoE)架构。通过利用软路由机制,我们的模型将专门的专家分配到不同的频谱带,使它们能够培养稳健的先验知识用于外观和运动动态。这种专业化使模型能够以更少的GPU资源获取基础的VFX知识。其次,我们引入了一种由新型频率约束损失驱动的测试时训练策略。这使预训练模型能够通过局部优化快速适应特定的、未见过的效果,仅需在单个GPU上进行约100步的训练。实验结果表明,EasyVFX生成的结构一致且视觉震撼的效果,证明了频率感知学习是使专业级VFX民主化的重要催化剂。

英文摘要

Generating high-fidelity visual effects (VFX) typically demands massive datasets and prohibitive computational power due to the intricate coupling of spatial textures and temporal dynamics. In this paper, we introduce EasyVFX, a resource-efficient framework that achieves realistic VFX synthesis under stringent constraints. Our core philosophy lies in frequency-domain decomposition: we observe that the complexity of VFX can be significantly mitigated by decoupling high-frequency components, which represent intricate spatial appearances, from low-frequency components that encapsulate global motion dynamics. This spectral disentanglement transforms a high-dimensional learning problem into manageable sub-tasks, thereby lowering the optimization barrier and reducing data dependency. Building upon this insight, we propose a two-stage training paradigm. First, we design a Frequency-aware Mixture-of-Experts (Freq-MoE) architecture. By utilizing a soft routing mechanism, our model assigns specialized experts to distinct spectral bands, enabling them to cultivate robust priors for appearance and motion dynamics. This specialization allows the model to acquire foundational VFX knowledge with fewer GPU resources. Second, we introduce a Test-Time Training strategy powered by a novel Frequency-constraint Loss. This allows the pre-trained model to swiftly adapt to specific, unseen effects through localized optimizations, requiring only about 100 steps on a single GPU. Experimental results demonstrate that EasyVFX produces structurally consistent and visually stunning effects, proving that frequency-aware learning is a key catalyst for democratizing professional-grade VFX.

2605.22044 2026-05-22 cs.CV 版本更新

Physiology and Anatomy Aware Inverse Inference of Myocardial Infarction for Cardiac Digital Twin

心肌梗死逆推推理的生理与解剖意识:用于心脏数字双胞胎

Mengxiao Wang, Yilin Lyu, Julia Camps, Ching Hui Sia, Mark Yan-Yee Chan, Yanrui Jin, Shuzhi Sam Ge, Chengliang Liu, Lei Li

发表机构 * Shanghai Jiao Tong University, Shanghai, China(上海交通大学) National University of Singapore, Singapore(新加坡国立大学) Universitat Pompeu Fabra, Barcelona, Spain(庞培法华大学)

AI总结 本文提出了一种基于心脏数字双胞胎的非侵入性心肌梗死定位方法,通过整合运动成像和心电图,利用解剖和生理意识网络(PAA-Net)来更准确地推断心肌梗死区域的形态和位置,从而提高逆推推理的精度和可解释性。

Comments Early-accepted by MICCAI 2026. This version corresponds to the submitted version. The final version will be available on Springer Link

详情
AI中文摘要

准确定位心肌梗死对于风险分层至关重要。虽然LGE-MRI仍是金标准,但其资源消耗大。将运动MRI与ECG结合可以更详细地表示梗死特性。现有的逆推心肌梗死推断方法忽略了真实疤痕形态和心脏复极化,降低了对ECG细微变化的敏感性和对梗死引起电生理变化的可解释性。在本文中,我们提出了一种用于非侵入性心肌梗死定位的新框架。为了弥合仿真与现实之间的领域差距,我们引入了一种解剖意识的随机梗死合成策略,以合成真实、不规则的疤痕和边缘区,模拟缺血性横纹进展。我们然后构建了一个虚拟队列来模拟QRS-T波形,捕捉去极化和复极化动态。此外,我们设计了一种生理和解剖意识网络(PAA-Net),联合编码3D心肌几何和多导联ECG,以推断具有不同定位、大小、空间范围和横纹性的梗死区域。实验结果表明,我们的框架在逆推推断中显著优于现有方法,实现了疤痕和边缘区分割的Dice分数分别为0.7391和0.5503,同时进一步提高了ECG-梗死关系的可解释性。我们的代码将在接受后发布。

英文摘要

Accurate localization of myocardial infarction is essential for risk stratification. While LGE-MRI remains the gold standard, it is resource-intensive. Integrating cine MRI with ECG enables a more detailed representation of infarct properties. Existing inverse MI inference methods overlook realistic scar morphology and cardiac repolarization, reducing sensitivity to subtle ECG variations and interpretability of infarct-induced electrophysiological changes. In this paper, we propose a novel framework for noninvasive MI localization using cardiac digital twins. To bridge the domain gap between simulation and reality, we introduce an anatomy-aware stochastic infarct synthesis strategy to synthesize realistic, irregular scars with border zones, mimicking ischemic transmural progression. We then construct a virtual cohort to simulate QRS-T waveforms, capturing both depolarization and repolarization dynamics. Furthermore, we design a Physiology and Anatomy Aware Network (PAA-Net) that jointly encodes 3D myocardial geometry and multi-lead ECGs to infer infarct areas with varying localizations, sizes, spatial extents, and transmuralities. Experimental results demonstrate that our framework significantly outperforms existing methods in inverse inference, achieving Dice scores of 0.7391 and 0.5503 for scar and border zone segmentation, respectively, while further enhancing the interpretability of the ECG-infarct relationship. Our code will be released upon acceptance.

2605.22036 2026-05-22 cs.CV cs.AI 版本更新

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

GA-VLN: 用于高效视觉-语言导航的几何感知鸟瞰图表示

Jiahao Yang, Zihan Wang, Xiangyang Li, Xing Zhu, Yujun Shen, Yinghao Xu, Shuqiang Jiang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(人工智能安全国家重点实验室,计算技术研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Robbyant School of Computing, National University of Singapore(新加坡国立大学计算机学院) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出GA-VLN框架,通过引入几何感知的鸟瞰图表示(GA-BEV),整合显式和隐式几何信息,提升视觉-语言导航的效率和性能,实验表明其在仅使用导航数据的情况下取得了最先进的结果。

详情
AI中文摘要

尽管在视觉-语言导航(VLN)领域取得了显著进展,现有方法仍依赖密集的RGB视频,产生过多的片段标记且缺乏显式的空间结构,导致计算开销大且空间推理能力有限。为了解决这些问题,我们引入了几何感知的鸟瞰图(GA-BEV)-一种紧凑且3D基础的特征表示,将显式和隐式的几何线索整合到多模态大语言模型(MLLM)导航系统中。我们通过将视觉特征投影到3D空间并聚合为以代理为中心的布局来构建BEV空间地图,该布局在保持几何一致性的同时减少标记冗余。为了进一步丰富几何理解,我们将预训练的3D基础模型的特征融入BEV空间,注入从大规模3D重建任务中学习到的结构先验。这些互补的线索-基于深度的显式投影和隐式学习的先验-产生紧凑但空间表达能力强的表示,显著提高了导航效率和性能。实验表明,我们的方法仅使用导航数据即可取得最先进的结果,无需DaGger增强或混合VQA训练,证明了所提GA-VLN框架的鲁棒性和数据效率。

英文摘要

Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.

2605.22035 2026-05-22 cs.CV cs.CL 版本更新

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

HyLoVQA: 动态超网络生成低秩适应用于连续视觉问答

Yiran Wang, Chenyi Xiong, Ziyue Qin, Miao Zhang, Kui Xiao, Zhifei Li

发表机构 * School of Computer Science, Hubei University, Wuhan 430062, China(湖北大学计算机学院,武汉430062,中国) Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China(湖北省大数据智能分析与应用重点实验室(湖北大学),武汉430062,中国) Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China(智能感知系统与安全重点实验室(湖北大学),教育部,武汉430062,中国)

AI总结 HyLoVQA通过动态超网络生成低秩适应,解决连续视觉问答中任务干扰问题,提升模型对当前任务和对象的适应能力。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

连续视觉问答(VQA)需要在非稳态的视觉输入和问题流中学习,同时保持过去知识。大多数先前方法通过更新大量共享参数集来适应,这通常导致跨层任务干扰,阻碍对当前任务和对象的准确适应。为了解决这一限制,我们提出了HyLoVQA。它维护一个具有漂移鲁棒性的锚点记忆库。该库存储视觉对象的内容和文本任务的内容,并使用当前输入特征进行更新。基于检索到的锚点,超网络生成轻量级低秩适应(LoRA)适配器。这确保了参数效率,使模型能够动态适应每个任务和对象。此外,我们提出了一个对齐损失,将特征空间中的语义差异与参数空间中的功能变化对齐,从而约束LoRA适配器保持专注于当前任务和对象。在VQA v2和NExT-QA上广泛实验表明,HyLoVQA在标准和组合设置下优于先前最先进的方法。

英文摘要

Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hindering accurate adaptation to the current task and object. To address this limitation, we propose HyLoVQA. It maintains a drift-resilient memory bank of anchors. The bank stores the content of visual objects and textual tasks, and they are updated using current input features. Conditioned on retrieved anchors, a hypernetwork generates lightweight Low-Rank Adaptation (LoRA) adapters. This ensures parameter efficiency, allowing the model to adapt to each task and object dynamically. Additionally, we formulate an alignment loss that aligns semantic discrepancies in the feature space with functional changes in the parameter space, thereby constraining LoRA adapters to remain focused on the current task and object. Extensive experiments on VQA v2 and NExT-QA under both standard and compositional settings demonstrate the superiority of HyLoVQA over prior state-of-the-art methods.

2605.22034 2026-05-22 cs.CV cs.AI 版本更新

AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

AgroVG:一个大规模多源基准用于农业视觉 grounding

Haocheng Li, Juepeng Zheng, Zenghao Yang, Kaiqi Du, Guilong Xiao, Gengmeng Pu, Haohuan Fu, Jianxi Huang

发表机构 * China Agricultural University(中国农业大学) Sun Yat-sen University(中山大学) Tianjin University(天津大学) Tsinghua University(清华大学) Southwest Jiaotong University(西南交通大学) National Supercomputing Center in Shenzhen(深圳国家超算中心)

AI总结 本文提出AgroVG基准,用于评估农业视觉 grounding能力,通过多源数据集和任务特定协议,评估模型在多目标、多实例和无目标场景下的性能,揭示了现有模型在农业视觉 grounding任务中的不足。

Comments 45 pages,12 figures

详情
AI中文摘要

视觉 grounding,即根据自然语言描述定位物体的任务,是农业人工智能系统的基础能力,可应用于选择性除草、疾病监测和定向收获。农业视觉 grounding的可靠评估具有挑战性,因为农业目标往往小、重复、被遮挡或形状不规则,且指令可能指一个、多个或没有物体。因此,评估此能力需要联合测试定位精度、目标集完整性和存在感知的回避。为了解决这些挑战,我们引入了AgroVG,一个多源基准,将农业 grounding 视为广义集合预测:给定一张图像和一个指称表达,模型必须返回所有匹配的目标实例或在没有目标时回避。AgroVG包含来自十个数据集的10,071个注释-图像查询对,涵盖六个目标类别:作物/杂草、水果、小麦头、害虫、植物疾病和树冠。它支持所有六个类别上的边界框 grounding(T1)和具有可靠实例级像素注释的数据源上的实例掩码 grounding(T2),查询涵盖单目标、多目标和无目标场景。AgroVG进一步提供任务特定的协议用于框集匹配和查询级掩码覆盖。对26种模型配置的零样本评估揭示了持续的差距:最好的多目标Set-F1仅达到0.35,最好的正查询掩码成功率在IoU@0.75下仍低于0.17。数据和代码可在https://anonymous.4open.science/r/AgroVG-5172/上获得。

英文摘要

Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Evaluating this capability therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. To address these challenges, we introduce \textbf{AgroVG}, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. AgroVG contains 10{,}071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. It supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. AgroVG further provides task-specific protocols for box-set matching and query-level mask coverage. Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-$F_1$ reaches only 0.35, and the best positive-query mask success rate at IoU@0.75 remains below 0.17. Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ .

2605.22031 2026-05-22 cs.CV 版本更新

SO-Mamba: State-Ownership Mamba for Unrolled MRI Reconstruction

SO-Mamba:用于展开MRI重建的态所有权Mamba

Pengcheng Fang, Hongli Chen, Fangfang Tang, Feng Liu, Xiaohao Cai, Shanshan Shan

发表机构 * University of Southampton(南安普顿大学) University of Queensland(昆士兰大学) Soochow University(苏州大学)

AI总结 本文提出SO-Mamba,一种用于展开MRI重建的态所有权Mamba正则化器,通过分配每个Mamba阶段的重建证据到递归驻留、态接口访问和非态输出校正,以提升重建质量与效率。

详情
AI中文摘要

加速MRI重建需要在大空间区域内恢复缺失细节的同时保持解剖学一致的结构。状态空间模型如Mamba提供高效的长距离建模,使其成为展开重建中的有吸引力的学得正则化器。然而,在数据一致性耦合的展开求解器中,不同阶段操作于不同的重建迭代,其中驻留载体应在不同阶段保持一致的重建内容,而阶段依赖的非驻留证据则与当前更新相关。将这些角色统一处理会将持久驻留载体证据和更新依赖的非驻留证据置于相同的递归内容路由中。为此,我们提出了SO-Mamba,一种态所有权Mamba正则化器,该正则化器将每个Mamba阶段的重建证据分配到递归驻留、态接口访问和非态输出校正。SO-Mamba通过State-Ownership Router (SOR)实现这一所有权规则,构建递归内容的驻留载体,并将非驻留证据路由到B/C态接口的仿射调制和输出校正出口。驻留载体提供Mamba内容路由,而非驻留证据流调整态接口并通过输出出口贡献,而无需进入递归内容路由。我们进一步引入了两级外带泄漏诊断,通过测量选择性扫描状态轨迹中的外带能量和扫描后Mamba读取中的外带能量,将隐藏状态存储与读取表达分开。在五个公开的MRI重建基准上进行的实验表明,SO-Mamba在具有竞争性计算效率的CNN、Transformer和Mamba基线中表现一致提升。

英文摘要

Accelerated MRI reconstruction requires recovering missing details while preserving anatomically coherent structures across large spatial regions. State-space models such as Mamba provide efficient long-range modeling, making them attractive learned regularizers for unrolled reconstruction. However, in a data-consistency-coupled unrolled solver, different stages operate on different reconstruction iterates, where the resident carrier should preserve coherent reconstruction content across stages while stage-dependent non-resident evidence is tied to the current update. Treating these roles uniformly can place persistent resident-carrier evidence and update-dependent non-resident evidence into the same recurrent content route. We therefore propose SO-Mamba, a state-ownership Mamba regularizer that assigns reconstruction evidence within each Mamba stage to recurrent residency, state-interface access, and non-state output correction. SO-Mamba implements this ownership rule with a State-Ownership Router (SOR), which constructs a resident carrier for recurrent content and routes non-resident evidence to affine modulation of the B/C state interfaces and an output correction outlet. The resident carrier supplies the Mamba content route, while the non-resident evidence stream adapts the state interfaces and contributes through the output outlet without entering the recurrent content route. We further introduce a two-level outer-band leakage diagnostic that separates hidden-state storage from readout expression by measuring outer-band energy in the selective-scan state trajectory and the post-scan Mamba readout. Experiments on five public MRI reconstruction benchmarks spanning diverse anatomies, sampling patterns, and coil configurations show that SO-Mamba consistently improves over CNN-, Transformer-, and Mamba-based baselines with competitive computational efficiency.

2605.22017 2026-05-22 cs.CV 版本更新

Diverse Yet Consistent: Context-Guided Diffusion with Energy-Based Joint Refinement for Multi-Agent Motion Prediction

多样且一致:基于能量的联合细化的上下文引导扩散用于多智能体运动预测

Lei Chu, Yuhuan Zhao

发表机构 * University of Southern California(南加州大学)

AI总结 本文提出了一种基于扩散的框架,通过利用历史轨迹中的丰富上下文信息来改进多智能体运动预测,通过引导机制增强预测动作的多样性和表达性,并引入基于能量的公式来细化联合轨迹分布,同时保持个体轨迹的合理性,实验表明该方法在多个基准数据集上均优于现有方法。

Comments MEIS-- CVPR

详情
AI中文摘要

深度生成模型由于其能够捕捉多模态分布和表示多样化的人类行为的能力,已成为人类运动预测的有希望的方法。然而,生成在相互作用代理之间既多样又联合一致的预测仍然具有挑战性。此外,大多数现有方法主要使用单代理(边缘)度量进行评估,这无法充分反映多代理互动的联合动态。我们提出了一种基于扩散的框架,通过利用历史轨迹中的丰富上下文信息来改进多代理运动预测。这种信息通过引导机制进行整合,以增强预测动作的多样性和表达性。为了进一步强制交互一致性,我们引入了基于能量的公式,通过细化联合轨迹分布的同时保持个体轨迹的合理性。在四个基准数据集上的大量实验表明,我们的方法在多个指标上均优于现有方法。值得注意的是,我们的方法在ETH/UCY上显著提高了边缘(ADE/FDE)和联合(JADE/JFDE)度量,与先前的联合预测方法相比,它在保持竞争性联合性能的同时,显著提高了边缘度量。

英文摘要

Deepgenerative models havebecomeapromisingapproach for human motion prediction due to their ability to capture multimodal distributions and represent diverse human be haviors. However, generating predictions that are both di verse and jointly consistent among interacting agents re mains challenging. In addition, most existing approaches are primarily evaluated using single-agent (marginal) met rics, which fail to fully reflect the joint dynamics of multi agent interactions. We propose a diffusion-based frame work that improves multi-agent motion prediction by lever aging rich contextual information from historical trajecto ries. This information is incorporated through a guidance mechanism to enhance the diversity and expressiveness of predicted motions. To further enforce interaction consis tency, we introduce an energy-based formulation that re fines the joint trajectory distribution while preserving the plausibility of individual trajectories. Extensive experi ments on four benchmark datasets demonstrate that our approach consistently outperforms existing methods. No tably, our approach substantially improves both marginal (ADE/FDE) and joint (JADE/JFDE) metrics on ETH/UCY over strong marginal baselines. Compared with prior joint prediction methods, it delivers significant gains in marginal metrics while maintaining competitive joint performance.

2605.22015 2026-05-22 cs.CV cs.AR 版本更新

ORBIS: Output-Guided Token Reduction with Distribution-Aware Matching for Video Diffusion Acceleration

ORBIS: 通过分布感知匹配的输出引导标记减少以加速视频扩散

Hangyeol Lee, Joo-Young Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出ORBIS,一种针对视频扩散Transformer的SW-HW协同设计加速器,通过利用前一时间步的输出激活获得更准确的token相似性,从而提高匹配质量并实现更高的标记减少比例,同时引入分布感知标记匹配算法和专用硬件设计,实现比现有方法更高的标记减少率、更快的速度和更低的能耗。

详情
AI中文摘要

扩散Transformer(DiT)已发展为生成高质量图像和视频的强大模型架构。在视频DiT中,3D空间时间注意力使token长度与帧数成正比,显著增加计算成本。标记减少方法通过利用空间冗余来缓解这一成本,但现有方法依赖于不准确的相似性估计和轻量级匹配算法,导致匹配质量差且仅带来微小的加速效果。为克服这些限制,我们提出了ORBIS,一种为视频DiT设计的SW-HW协同加速器。ORBIS利用前一时间步的输出激活以获得更准确的token间相似性,显著提高匹配质量并实现更高的token减少比例。我们进一步引入了分布感知标记匹配(DATM)算法,该算法捕捉全局token分布并显式最小化token对损失以获得额外收益。为了完全隐藏DATM延迟,我们设计了专用、深度流水线化的硬件并通过量化来最小化其硬件成本,仅占用总面积的2.4%,且精度损失可忽略不计。大量实验表明,ORBIS的token减少比例比最先进的方法AsymRnR高约2倍,相比NVIDIA A100 GPU实现了高达4.5倍的速度提升和79.3%的能耗降低。

英文摘要

Diffusion Transformer (DiT) has emerged as a powerful model architecture for generating high-quality images and videos. In the case of video DiT, 3D Spatio-Temporal Attention increases token length in proportion to the number of frames, sharply increasing computational cost. Token reduction methods mitigate this cost by exploiting spatial redundancy, but existing approaches rely on inaccurate similarity estimates and lightweight matching algorithms, resulting in poor matching quality and only marginal acceleration. To overcome these limitations, we propose ORBIS, an SW-HW co-designed accelerator for video DiT. ORBIS leverages the output activation from the previous timestep to obtain more accurate inter-token similarity, substantially improving matching quality and enabling a higher token reduction ratio. We further introduce a Distribution-Aware Token Matching (DATM) algorithm that captures global token distribution and explicitly minimizes token-pair loss for additional gains. To fully hide DATM latency, we design specialized, deeply pipelined hardware and minimize its hardware cost through quantization, occupying only 2.4% of total area with negligible accuracy loss. Extensive experiments show that ORBIS achieves about 2x higher token reduction ratio than the state-of-the-art approach, AsymRnR, while delivering up to 4.5x speedup and 79.3% energy reduction compared to an NVIDIA A100 GPU.

2605.22013 2026-05-22 cs.CV cs.GR cs.LG 版本更新

PointLLM-R: Enhancing 3D Point Cloud Reasoning via Chain-of-Thought

PointLLM-R: 通过链式推理增强3D点云推理

Chaoqi Chen, Qile Xu, Wenjun Zhou, Hui Huang

发表机构 * Visual Computing Research Center (VCC), College of Computer Science(视觉计算研究中心(VCC),计算机科学学院) Software Engineering (CSSE) Shenzhen University China(软件工程(CSSE)深圳大学中国) VCC, CSSE Shenzhen University China(VCC,CSSE 深圳大学中国) Shenzhen University(深圳大学)

AI总结 本文提出了一种数据驱动的框架,用于构建大规模链式推理监督,以改进3D点云理解。通过两阶段流程优化点文本指令数据,并合成高质量推理路径,构建了包含55K样本的PoCoTI数据集,训练PointLLM-R实现3D多模态语言模型的推理能力,实验表明其在生成3D分类和描述任务中达到最先进的性能。

详情
AI中文摘要

通过语言理解3D点云仍然是计算机图形学和视觉计算中的基本挑战,由于点云数据的不规则结构和现有3D多模态模型中缺乏显式推理。尽管链式推理(CoT)在LLM和基于图像的MLLM中表现出强大的有效性,但其在3D理解中的扩展仍鲜有探索。本文提出了一种数据驱动的框架,用于构建大规模CoT监督,专门针对3D点云理解。我们的框架由一个两阶段流程组成,首先通过基于视觉语言模型的质量评估和参考引导细化点文本指令数据,然后通过人机协同提示优化(HiLPO)合成高质量的推理路径。使用这种方法,我们构建了PoCoTI,一个包含55K样本的CoT增强点文本指令遵循数据集。在PoCoTI上微调PointLLM,得到PointLLM-R,一个具备推理能力的3D多模态语言模型。在生成3D分类和描述任务上的大量实验表明,PointLLM-R在生成3D分类和描述任务中达到了最先进的性能,并且能够稳健地推广到现实世界扫描点云和多轮对话场景中。

英文摘要

Understanding 3D point clouds through language remains a fundamental challenge in computer graphics and visual computing, due to the irregular structure of point cloud data and the lack of explicit reasoning in existing 3D multimodal models. While Chain-of-Thought (CoT) reasoning has shown strong effectiveness in LLMs and image-based MLLMs, its extension to 3D understanding remains largely underexplored. In this paper, we propose a data-centric framework for constructing large-scale CoT supervision tailored to 3D point cloud understanding. Our framework consists of a two-stage pipeline that first refines point-text instruction data via vision-language-model-based quality evaluation and reference-guided refinement, and then synthesizes high-quality reasoning paths through Human-in-the-Loop Prompt Optimization (HiLPO). Using this approach, we build PoCoTI, a CoT-enhanced point-text instruction-following dataset containing 55K samples with explicit reasoning paths. Fine-tuning PointLLM on PoCoTI yields PointLLM-R, a reasoning-capable 3D multimodal language model. Extensive experiments on generative 3D classification and captioning demonstrate that PointLLM-R achieves state-of-the-art performance and generalizes robustly to real-world scanned point clouds and multi-turn dialogue scenarios.

2605.22012 2026-05-22 cs.CL cs.CV 版本更新

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

LatentOmni: 通过统一的音频-视觉潜在推理重新思考多模态理解

Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu, Bozhou Li, Yuran Wang, Chengzhuo Tong, Hao Liang, Xiaochen Ma, Junbo Niu, Tianyu Guo, Yang Shi, Yue Ding, Yiyan Ji, Bingyin Mei, Yushuo Guan, Yuanxing Zhang, Pengfei Wan, Fangcheng Fu, Wentao Zhang

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) Kling Team, Kuaishou Technology(快手科技 Kling 团队) Peking University(北京大学) HKUST(香港科技大学) CASIA(中国科学院自动化研究所) Nanjing University(南京大学) Renmin University of China(中国人民大学) Tsinghua University(清华大学)

AI总结 本文提出LatentOmni框架,通过统一的音频-视觉潜在空间进行多模态推理,利用特征级监督和Omni-Sync Position Embedding保持时间一致性,从而在多个音频-视觉推理基准测试中取得最佳性能。

Comments 21 pages, 15 figures

详情
AI中文摘要

联合音频-视觉推理对于多模态理解至关重要,但当前的多模态大语言模型(MLLMs)在需要从两种模态中提取细粒度证据进行推理时仍存在困难。一个核心限制是显式的基于文本的推理链(CoT)将连续的音频-视觉信号压缩成离散的标记,削弱了时间定位并使中间推理偏向语言先验。我们主张统一的潜在空间是此类推理更好的媒介,因为它保留了密集的感知信息,同时仍能与自回归生成兼容。基于这一见解,我们提出了LatentOmni,一个跨模态推理框架,将文本推理与音频-视觉潜在状态交织在一起。LatentOmni引入了特征级监督,以对齐潜在推理状态与任务相关的感知特征,并使用Omni-Sync Position Embedding(OSPE)来保持潜在音频和视觉状态之间的时间一致性。我们进一步构建了LatentOmni-Instruct-35K数据集,该数据集包含音频-视觉交织推理轨迹,用于监督潜在空间推理。在多个音频-视觉推理基准测试中的综合评估表明,LatentOmni在评估的开源模型中取得了最佳性能,并且在显式文本CoT基线中表现一致,支持潜在空间联合推理作为更强多模态理解的有前途的路径。

英文摘要

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.

2605.22011 2026-05-22 cs.CV 版本更新

Rethinking Token Reduction for Diffusion Models via Output-Similarity-Awareness

重新思考扩散模型的token减少:通过输出相似性意识

Hangyeol Lee, Hyojeong Lee, Joo-Young Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出DiTo,一种基于输出中心的token减少方法,通过利用相邻时间步的输出相似性来建立token对应关系,从而减少计算复杂度并提高生成质量。

详情
AI中文摘要

扩散变换器(DiTs)在图像生成质量上表现出色,但其计算复杂度与token数量呈二次关系。尽管已提出多种token减少(TR)方法以缓解这一成本,但它们忽略了生成模型的主要目标:最小化恢复误差,这需要反映输出token的相似性。它们仅依赖于输入token相似性,这是来自仅减少的ViT范式继承的,导致与该目标的根本不一致。为弥合这一差距,我们提出DiTo,一种新的TR范式,其重点转向以输出为中心的token减少。基于观察到输出token相似性在相邻时间步中保持一致,DiTo利用先前步骤的相似性作为有效代理,在匹配时间步中建立token对应关系,然后在多个后续减少时间步中重用。为了优化这种交错调度,我们提出Pair Match Ratio(PMR)引导的区间调度,以确定最佳匹配频率。此外,为了减轻由重复重用导致的局部近似误差和由此产生的阻塞伪影,我们提出频率感知的token匹配,通过引入选择频率惩罚。广泛的实验表明,DiTo在可比的加速下,比现有TR方法在PSNR上高出1.6-3.9 dB,实现了更优的帕累托前沿。

英文摘要

Diffusion Transformers (DiTs) achieve superior image generation quality but suffer from quadratic computational complexity relative to token count. While various token reduction (TR) methods have been proposed to mitigate this cost, they overlook the primary objective of generative models: minimizing recovery error, which requires reflecting output token similarity. They rely solely on input token similarity inherited from reduction-only ViT paradigms, leading to a fundamental misalignment with this objective. To bridge this gap, we propose DiTo, a novel TR paradigm that shifts the focus toward output-centric token reduction. Based on the observation that output token similarity is consistently preserved across adjacent timesteps, DiTo utilizes prior-step similarities as an effective proxy to establish token correspondences at a Matching timestep, which are then reused across multiple subsequent Reduction timesteps. To optimize this interleaved scheduling, we propose Pair Match Ratio (PMR)-guided Interval Scheduling to determine the optimal matching frequency. Furthermore, to mitigate localized approximation errors and resulting blocking artifacts caused by repeated reuse, we propose Frequency-aware Token Matching by incorporating a selection-frequency penalty. Extensive experiments demonstrate that DiTo consistently outperforms existing TR methods with 1.6-3.9 dB higher PSNR at comparable speedups, achieving a superior Pareto frontier.

2605.22002 2026-05-22 cs.CV 版本更新

ConvNeXt-FD: A Fractal-Based Deep Model for Robust Biomedical Image Segmentation

ConvNeXt-FD:一种基于分形的深度模型用于鲁棒的生物医学图像分割

Joao Batista Florindo, Amanda Pontes de Oliveira Ornelas

发表机构 * Institute of Mathematics, Statistics and Scientific Computing, Department of Applied Mathematics, University of Campinas(数学、统计与科学计算研究所,应用数学系,坎皮纳斯大学)

AI总结 本文提出了一种基于分形的深度学习模型ConvNeXt-FD,用于提高生物医学图像分割的鲁棒性,通过结合Dice系数和边界感知正则化项,提升模型对物体边界和形状保真的敏感性。

详情
AI中文摘要

生物医学图像分割是医疗诊断和治疗计划中的关键任务,能够精确勾勒解剖结构和病理区域。尽管取得了显著进展,但由于不同医学成像模态中固有的变异性、噪声和复杂的形态,仍存在挑战。本文介绍了一种新的深度学习架构ConvNeXt-FD,基于类似U-Net的编码器-解码器框架,利用强大的ConvNeXt主干网络。我们的方法结合了一种混合损失函数,该函数结合了Dice系数和受可微分分形维度公式启发的边界感知正则化项,旨在增强模型对物体边界和形状保真的敏感性。我们严格评估了ConvNeXt-FD在六个不同的生物医学数据集上的表现:BUSI(乳腺超声图像)、DDTI(甲状腺超声图像)、FluoCells(荧光细胞图像)、IDRiD(糖尿病视网膜病变图像用于视盘分割)、ISIC2018(皮肤病变图像)和MoNuSeg(核分割)。实验结果表明,ConvNeXt-FD,特别是在使用ImageNet预训练权重初始化时,在各种指标上(包括Dice、Jaccard、准确率、灵敏度、特异度和假阳性率)均表现出竞争性甚至更优的性能。ConvNeXt作为强大编码器的结合,与边界感知正则化相结合,证明了在挑战性的生物医学上下文中捕获高级语义特征和细粒度边界细节的有效性,从而实现更准确和可靠的分割。

英文摘要

Biomedical image segmentation is a critical task in medical diagnosis and treatment planning, enabling precise delineation of anatomical structures and pathological regions. Despite significant advancements, challenges persist due to the inherent variability, noise, and complex morphology present in diverse medical imaging modalities. This paper introduces ConvNeXt-FD, a novel deep learning architecture for robust biomedical image segmentation, built upon a U-Net-like encoder-decoder framework leveraging the powerful ConvNeXt backbone. Our approach integrates a hybrid loss function combining the Dice coefficient with a boundary-aware regularization term inspired by a differentiable formulation of Fractal Dimension, designed to enhance the model's sensitivity to object boundaries and shape fidelity. We rigorously evaluate ConvNeXt-FD across six distinct biomedical datasets: BUSI (Breast Ultrasound Images), DDTI (Thyroid Ultrasound Images), FluoCells (Fluorescent Cell Images), IDRiD (Diabetic Retinopathy Images for Optic Disc Segmentation), ISIC2018 (Skin Lesion Images), and MoNuSeg (Nuclei Segmentation). Experimental results demonstrate that ConvNeXt-FD, particularly when initialized with ImageNet pre-trained weights, achieves competitive and often superior performance compared to existing state-of-the-art methods across various metrics, including Dice, Jaccard, Accuracy, Sensitivity, Specificity, and False Positive Rate. The integration of ConvNeXt as a strong encoder, coupled with the boundary-aware regularization, proves effective in capturing both high-level semantic features and fine-grained boundary details, leading to more accurate and reliable segmentations in challenging biomedical contexts.

2605.22000 2026-05-22 cs.CV cs.AI 版本更新

Virtual 3D H&E Staining from Phase-contrast Back-illumination Interference Tomography

从相位对比背光干涉断层扫描生成虚拟3D的H&E染色

Anthony Song, Boyan Zhou, Mayank Golhar, Marisa Morakis, Alex Baras, Nicholas Durr

发表机构 * Department of Biomedical Engineering, Johns Hopkins University(约翰霍普金斯大学生物医学工程系) Department of Pathology, Johns Hopkins Hospital(约翰霍普金斯医院病理学系)

AI总结 本文提出HistoBIT3D,首个基于voxel的配对BIT和荧光标记核数据集,用于评估无监督虚拟染色在结构保持方面的定量效果。通过该数据集,作者提出一种新的虚拟染色框架,利用双向多尺度内容一致性与跨域风格复用,将具有移变对比度的BIT体积转化为逼真的H&E体积,从而提升3D核分割精度和边界保持性。

详情
AI中文摘要

三维(3D)未处理组织的病理学具有潜在的疾病管理变革能力,通过使组织微结构的体积分析和活体评估成为可能。背光干涉断层扫描(BIT)是一种新的相位显微镜技术,能够提供快速、非破坏性的未处理组织体积分像。然而,将BIT体积转化为临床可解释的H&E图像仍然具有挑战性,特别是由于移变对比和缺乏定量验证基准。我们引入HistoBIT3D,首个voxel-wise配对的BIT和荧光标记核数据集,使在无监督虚拟染色中结构保持的定量评估成为可能。利用该数据集,我们提出了一种新的虚拟染色框架,通过双向多尺度内容一致性和跨域风格复用来增强结构保真度和感知现实性,将具有移变对比度的BIT体积转化为逼真的H&E体积。我们的方法在现实感度量方面达到最先进的水平,同时显著提高了3D核分割精度和边界保持性,特别是在零shot Cellpose评估下。这些贡献共同建立了一个经过定量验证、结构忠实且可扩展的3D虚拟H&E染色流程,推动了无切片、体积分计算病理学的范式转变。我们的数据和代码可在:https://github.com/aasong113/HistoBIT3D_VirtualStaining。

英文摘要

Three-dimensional (3D) histopathology of unprocessed tissues has the potential to transform disease management by enabling volumetric characterization of tissue microarchitecture and in-vivo assessment. Back-illumination Interference Tomography (BIT) is a new phase microscopy technology that provides rapid, non-destructive volumetric imaging of unprocessed tissues. However, translating BIT volumes into clinically interpretable H&E images remains challenging, particularly due to shift-variant contrast and the absence of quantitative validation benchmarks. We introduce HistoBIT3D, the first voxel-wise paired BIT and fluorescence-labeled nuclei dataset, enabling quantitative evaluation of structural preservation in unsupervised virtual staining against ground-truth nuclear distributions. Using this dataset, we present a novel virtual staining framework that translates BIT volumes with shift-variant contrast into realistic H&E volumes by leveraging bidirectional multiscale content consistency and cross-domain style reuse to enhance structural fidelity and perceptual realism. Our method achieves state-of-the-art realism metrics while significantly improving 3D nuclei segmentation accuracy and boundary preservation under zero-shot Cellpose evaluation. Together, these contributions establish a quantitatively validated, structurally faithful, and scalable pipeline for 3D virtual H&E staining, advancing the paradigm of slide-free, volumetric computational histopathology. Our data and code are available at: https://github.com/aasong113/HistoBIT3D_VirtualStaining.

2605.21988 2026-05-22 cs.CV cs.AI 版本更新

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

通过反事实强化学习学习视频大语言模型中的时空敏感性

Dazhao Du, Jian Liu, Jialong Qin, Tao Han, Bohai Gu, Fangqi Zhu, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tencent(腾讯)

AI总结 本文提出CRPO方法,通过反事实强化学习提升视频大语言模型对时空动态的敏感性,通过构建反事实视频并引入反事实关系奖励,有效抑制了依赖静态线索的简略策略,从而在DyBench基准测试中提升了模型的时空敏感性。

Comments Project website: https://ddz16.github.io/crpo.github.io/

详情
AI中文摘要

视频大语言模型(Video LLMs)在基准测试中表现出色,但往往通过单帧线索和语言先验来回答视频问题,而不是通过跟踪时空动态。在训练后强化学习(RL)中,这种问题进一步加剧,因为仅正确性奖励会进一步强化那些不跟踪视频动态但能获得高奖励的简略策略。为此,我们提出一个受控的反事实问题:如果视觉世界发生变化而问题保持不变,答案应改变还是保持不变?基于这一观点,我们提出了反事实关系策略优化(CRPO),一种双分支强化学习框架,用于提升时空敏感性。CRPO通过水平翻转和时间反转构建反事实视频,在原始和反事实分支上进行训练,并引入反事实关系奖励(CRR)以鼓励答案在动态问题中改变而在静态问题中保持不变。这种跨分支约束使简略策略难以在两个分支中持续获得奖励。为了评估这一特性,我们引入了DyBench,一个配对反事实视频基准,包含3,014个视频,涵盖可逆动态、运动方向和事件序列,以及一个严格的配对准确度指标,防止固定答案简略策略夸大分数。实验表明,CRPO在时空敏感性评估中优于先前的RL方法,同时保持了竞争性的通用视频性能。在Qwen3-VL-8B上,CRPO在DyBench P-Acc上比基模型提高了+7.7,在TimeBlind I-Acc上提高了+8.2,表明改进了时空敏感性而非更强依赖静态简略策略。项目网站可在https://ddz16.github.io/crpo.github.io/上找到。

英文摘要

Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .

2605.21981 2026-05-22 cs.CV 版本更新

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

RiT: vanilla diffusion transformers suffice in representation space

Le Zhang, Ning Mang, Aishwarya Agrawal

发表机构 * Mila – Québec AI Institute, UdeM(魁北克AI研究院,麦吉尔大学) Utrecht University(乌得勒支大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 该研究探讨了在表示空间中使用vanilla diffusion transformers进行图像生成的有效性,发现通过预训练的表示空间能够更有效地进行流匹配学习,从而在ImageNet数据集上取得了优于DiT-DH-XL的性能。

详情
AI中文摘要

流匹配与x预测--回归干净的数据点而非环境速度--已被证明在像素空间中有效利用低维流形结构\cite{li2025back}。我们询问是否预训练的表示空间,尽管包含具有可比内在维度的低维数据流形,能提供更有利于流匹配学习的分布。通过比较像素、SD-VAE和DINOv2特征在四个几何轴上的表现,我们发现像素和DINOv2具有几乎相同的内在维度性(两者$\hat{d}\!\approx\!33$),但DINOv2表现出7.3倍更高的有效秩、35倍更好的协方差条件、11.5倍更低的超额峰度以及1.7倍更低的流形插值误差;SD-VAE潜在特征始终处于中间位置,表明优势源于表示学习目标而非单纯的压缩。这些统计特性使流匹配回归变得良好条件化,并消除了先前DINOv2扩散方法中专门预测头或Riemannian运输的需要。我们提出了表示图像变换器(RiT):一个通过冻结DINOv2特征进行x预测训练的vanilla Diffusion Transformer,仅通过维度感知的噪声调度和联合 exttt{[CLS]}-patch建模进行增强。在ImageNet $256{ imes}256$上,RiT在无指导时达到FID 1.45,在无分类器指导时达到1.14,优于参数更少19%的DiT$^ ext{DH}$-XL(676M vs.\ 839M)。所得到的ODE在粗略离散化下可以高效求解:在无分类器指导时,5步Heun步骤已达到FID 2.0,10步达到1.25,无需蒸馏或一致性训练。代码在https://github.com/lezhang7/RiT。

英文摘要

Flow matching with $x$-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space \cite{li2025back}. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both $\hat{d}\!\approx\!33$) yet DINOv2 exhibits $7.3\times$ higher effective rank, $35\times$ better covariance conditioning, $11.5\times$ lower excess kurtosis, and $1.7\times$ lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the \emph{Representation Image Transformer} (RiT): a vanilla Diffusion Transformer trained by $x$-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint \texttt{[CLS]}-patch modeling. On ImageNet $256{\times}256$, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT$^\text{DH}$-XL with $19\%$ fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, $5$ Heun steps already reach FID 2.0 and $10$ steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.

2605.21980 2026-05-22 cs.CV cs.AI 版本更新

Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow

通过跨模态信息流解读并增强大视觉-语言模型中的情感电路

Chengsheng Zhang, Chenghao Sun, Zhining Xie, Xinmei Tian

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception(脑启发智能感知与认知MOE实验室) Cognition, University of Science(认知,科学大学) AIPD, Tencent(AIPD,腾讯)

AI总结 本文提出了一种基于转向向量的因果归因框架,用于描述性情感推理,通过构建专用数据集揭示了三阶段'适应-聚合-执行'机制下的情感电路,发现视觉情感线索在中间层通过情感特定的注意力头进行聚合,随后在深层通过情感通用路径转换为叙述生成,并通过调控情感信息路由增强注意力流和语义激活,从而提升性能并缓解情感幻觉。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉-语言模型(LVLMs)代表了迈向共情代理的重要进展,展示了在情绪理解方面的显著能力。然而, governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remains largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

英文摘要

Large Vision-Language Models (LVLMs) represent a significant leap towards empathetic agents, demonstrating remarkable capabilities in emotion understanding. However, the internal mechanisms governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remain largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

2605.21977 2026-05-22 cs.CV cs.AI 版本更新

Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

视频作为自然增强:迈向统一的AI生成图像和视频检测

Zhengcen Li, Chenyang Jiang, Liangxu Su, Tong Shao, Shiyang Zhou, Ming Tao, Jingyong Su

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shenzhen Loop Area Institute(深圳南山区研究院)

AI总结 本研究针对AI生成内容检测中跨模态差距的问题,提出VINA框架,通过联合训练图像和视频数据,利用视频帧作为自然增强,并引入跨模态监督对比目标,实现统一的AI生成内容检测,提升鲁棒性和迁移性。

详情
AI中文摘要

AI生成内容(AIGC)正在迅速提升,催生了需要在数据源、部署管道和视觉模态间通用的检测器的紧迫需求。一个高度通用的检测器应在分布变化下保持稳健。然而,我们发现了一种一致的失败模式:最先进的AI生成图像检测器在应用于从视频中提取的帧时往往会崩溃。通过系统分析,我们发现这种跨模态差距源于交织的合成无关视频处理转换,包括颜色转换、编码压缩、缩放和模糊,以及由现代视频生成器引入的模型特定指纹。受这些发现的启发,我们提出了VINA(Video as Natural Augmentation),一个统一的AIGC检测框架,联合训练图像和视频数据。VINA利用视频帧作为物理上合理的自然增强,并进一步引入跨模态监督对比目标,以在共享的真/假决策边界下对齐图像和视频表示。在14个图像、视频和现实世界基准测试中,VINA展示了双向收益,提高了鲁棒性和迁移性,并在几乎所有评估设置中实现了最先进的性能,无需复杂的增强或数据集特定调整。

英文摘要

AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.

2605.21973 2026-05-22 cs.CV 版本更新

Foresee-to-Ground: From Predictive Temporal Perception to Evidence-Driven Reasoning for Video Temporal Grounding

Foresee-to-Ground: 从预测性时间感知到证据驱动推理的视频时间接地

Zelin Zheng, Xinyan Liu, Ruixin Li, Antoni B. Chan, Guorong Li, Qingming Huang, Laiyun Qing

发表机构 * Qwen3-VL-8B-Instruct

AI总结 本文提出了一种新的视频时间接地框架F2G,通过将时间接地问题重新表述为可验证的识别-测量问题,结合预测性时间感知和证据驱动推理,以提高时间接地的准确性和鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

当前视频大语言模型(Video-LLM)在视频时间接地(VTG)中的方法通常依赖于从无结构的视觉令牌流中直接生成时间戳,这通常导致脆弱的数值和不一致的边界。为了解决这个问题,我们提出了Foresee-to-Ground(F2G),一种将VTG重新表述为可验证的识别-测量问题的框架。F2G集成了预测性时间感知与证据驱动推理:它学习对边界敏感的时间表示,以构建一个覆盖整个视频的候选事件片段证据池,并将这些片段暴露给LLM作为可引用的证据单元,将边界预测与显式事件假设绑定。通过将事件识别与精确边界测量解耦,F2G稳定了接地并使预测可验证。广泛的实验表明,F2G在各种基准上都一致提高了接地准确性,能够在不同的Video-LLM后端之间稳健地转移,并保持了通用视频理解能力。

英文摘要

Current Video-LLM approaches for Video Temporal Grounding (VTG) typically rely on direct timestamp generation from an unstructured visual-token stream, often leading to brittle numerics and inconsistent boundaries. To address this, we propose Foresee-to-Ground (F2G), a framework that reformulates VTG as a verifiable Identify-then-Measure problem. F2G integrates Predictive Temporal Perception with Evidence-Driven Reasoning: it learns boundary-sensitive temporal representations to build a video-wide evidence pool of candidate event segments, and exposes these segments to the LLM as citable evidence units that bind boundary prediction to explicit event hypotheses. By decoupling event identification from precise boundary measurement, F2G stabilizes grounding and makes predictions verifiable. Extensive experiments demonstrate that F2G consistently improves grounding accuracy across diverse benchmarks, transfers robustly across different Video-LLM backbones, and preserves general video understanding capabilities.

2605.21970 2026-05-22 eess.IV cs.CV 版本更新

Entropy-Guided Self-Supervised Learning for Medical Image Classification

熵引导的自监督学习用于医学图像分类

Joao Florindo, Viviane Moura

发表机构 * Institute of Mathematics, Statistics and Scientific Computing(数学、统计与科学计算研究所) Department of Applied Mathematics, University of Campinas(应用数学系,坎皮纳斯大学)

AI总结 本文提出了一种结合自监督学习和迁移学习的深度学习框架,通过使用熵引导的掩码自动编码器和ImageNet预训练模型,提升医学图像分类的性能和鲁棒性。

详情
AI中文摘要

准确且鲁棒的医学图像分类对于早期疾病诊断和治疗计划至关重要。然而,有限的标注数据、高类内变异性以及细微的类间差异往往阻碍深度学习模型的性能。本文介绍了一种协同深度学习框架,利用自监督学习和迁移学习的优势来增强医学图像分类。我们的方法使用两个不同的ConvNeXt-Tiny模型:一个在大规模自然图像数据集(ImageNet)上预训练,另一个在目标医学数据集上使用熵引导的掩码自动编码器(MAE)预训练。然后,这两个模型在特定的医学图像分类任务上进行微调。最终采用基于平均预测概率的集成策略,结合这两个模型的互补见解。在四个多样化的医学成像数据集(乳腺超声图像(BUSI)、国际皮肤成像协作(ISIC)2018、Kvasir和COVID)上的严格实验验证显示,我们的集成方法在性能和鲁棒性方面均优于现有方法。MAE预训练显著提升了领域特定数据的特征学习,而ImageNet预训练提供了强大的可迁移特征。集成方法始终取得最先进的结果,优于单独模型和现有方法,突显了结合多样预训练策略在挑战性医学图像分析中的有效性。

英文摘要

Accurate and robust medical image classification is paramount for early disease diagnosis and treatment planning. However, challenges such as limited annotated data, high intra-class variability, and subtle inter-class differences often hinder the performance of deep learning models. This paper introduces a synergistic deep learning framework that leverages the strengths of self-supervised learning and transfer learning for enhanced medical image classification. Our approach employs two distinct ConvNeXt-Tiny models: one pre-trained on a large-scale natural image dataset (ImageNet) and another pre-trained using an entropy-guided Masked Autoencoder (MAE) on the target medical dataset. Both models are then fine-tuned on specific medical image classification tasks. A final ensemble strategy, based on averaging predicted probabilities, is utilized to combine the complementary insights from these two models. Rigorous experimental validation across four diverse medical imaging datasets (Breast Ultrasound Images (BUSI), International Skin Imaging Collaboration (ISIC) 2018, Kvasir, and COVID) demonstrates the superior performance and robustness of our ensemble approach. The MAE pre-training significantly improves feature learning on domain-specific data, while the ImageNet pre-training provides strong generalizable features. The ensemble consistently achieves state-of-the-art results, outperforming individual models and existing methods, highlighting the efficacy of combining diverse pre-training strategies for challenging medical image analysis.

2605.21957 2026-05-22 cs.CV 版本更新

Bounding-Box Trajectories Matter for Video Anomaly Detection

边界框轨迹对视频异常检测至关重要

Inpyo Song, Jangwon Lee

发表机构 * Sungkyunkwan University(成均馆大学)

AI总结 本文提出TrajVAD框架,通过建模多类边界框轨迹来学习正常运动模式,利用边界框轨迹作为主要异常线索,在ShanghaiTech数据集上取得优于现有姿态基方法的性能。

Comments 17 pages, 3 figures

详情
AI中文摘要

视频异常检测对于公共安全和安保至关重要,尽管已有大量研究,但仍极具挑战性,因为存在大量外观、视角和场景动态的变化。在现有方法中,基于人类姿态的方法已成为主要研究方向,由于许多公共数据集中的异常涉及人类,姿态表示对外观变化具有鲁棒性,同时提供紧凑的运动描述。然而,这些方法往往忽视了边界框轨迹,尽管这种信息在基于姿态的管道中本应是固有的。在本文中,我们明确利用这些轨迹作为主要异常线索。我们提出了TrajVAD框架,使用归一化流建模多类边界框轨迹以学习正常运动模式。其仅轨迹变体(TrajVAD-T)消除了姿态估计,并在ShanghaiTech上以87.7%的AP超越了所有比较的姿态基方法,同时在MSAD上取得最佳结果。扩展版本(TrajVAD-P)纳入了姿态信息,进一步将ShanghaiTech上的性能提升至88.6%的AUROC和90.9%的AP,突显了边界框轨迹作为视频异常检测中有效但尚未充分研究的模态。

英文摘要

Video anomaly detection is critical for public safety and security, yet remains highly challenging despite extensive research due to large variations in appearance, viewpoint, and scene dynamics. Among existing approaches, human pose-based methods have emerged as a major line of research, showing strong performance since many anomalies in public datasets involve humans and pose representations are robust to appearance changes while providing compact motion descriptions. However, these methods often overlook bounding-box trajectories, although such information is inherently available in pose-based pipelines. In this paper, we explicitly leverage these trajectories as a primary anomaly cue. We present TrajVAD, a framework that models multi-class bounding-box trajectories using normalizing flows to learn normal kinematic patterns. Its trajectory-only variant (TrajVAD-T) eliminates pose estimation and surpasses all compared pose-based methods on ShanghaiTech in AP (87.7%), while achieving the best results on MSAD. An extended version (TrajVAD-P) incorporates pose information and further improves performance to 88.6% AUROC and 90.9% AP on ShanghaiTech, highlighting bounding-box trajectories as an effective yet underexplored modality for video anomaly detection.

2605.21954 2026-05-22 cs.CV cs.AI 版本更新

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Dazhao Du, Liao Duan, Jian Liu, Tao Han, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Xi’an Jiaotong University(西安交通大学) Tencent(腾讯)

AI总结 本文研究了多模态大语言模型(MLLMs)在视频时间定位中的感知与生成之间的差距,提出了一种推理阶段的读取-再生成框架,通过利用注意力线索来提高时间定位的准确性,从而在三个视频时间定位基准上提升了MiMo-VL-7B、Qwen3-VL-8B和TimeLens-8B的性能。

Comments Project Website: https://ddz16.github.io/mllmsknowwhen.github.io/

详情
AI中文摘要

视频时间定位(VTG),即在未剪裁的视频中定位查询事件的起止时间,是检验多模态大语言模型(MLLMs)是否理解不仅发生了什么,而且何时发生的关键测试。尽管现代MLLMs能够流畅地描述视频内容,但它们的时间戳预测仍然不可靠,而现有的解决方案要么需要昂贵的后训练时间标注,要么依赖于粗略的训练无关启发式方法。在本文中,我们探测了MLLMs的跨模态注意力,并揭示了一个感知-生成的差距。我们的关键发现是,MLLMs在prefill阶段往往知道目标区间,但在生成最终答案时会丢失这个信号。在prefill阶段,一组稀疏的注意力头(我们称之为时间定位头(TG-Heads))会将查询到视频的注意力集中在真实区间上。然而,在自回归解码过程中,答案标记会将注意力从该区间转移到视觉显著但与查询无关的段落。这一观察促使我们提出了一种推理阶段的读取-再生成框架。我们首先将TG-Head prefill注意力转换为一个去偏的帧级相关性信号,并提取它突出的高注意力区间。然后,我们使用视频裁剪或注意力掩码来限制MLLM的视觉上下文,仅限于该区间,以抑制干扰项。在不进行参数更新和架构更改的情况下,我们的框架在三个VTG基准上一致地提高了MiMo-VL-7B、Qwen3-VL-8B和TimeLens-8B的性能,最大提升达到+3.5 mIoU。该项目网站可在https://ddz16.github.io/mllmsknowwhen.github.io/上找到。

英文摘要

Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emph{Temporal Grounding Heads} (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query-irrelevant segments. This observation motivates an inference-time read-then-regenerate framework. We first convert TG-Head prefill attention into a debiased frame-level relevance signal and extract the high-attention interval it highlights. We then re-invoke the MLLM with visual context restricted to this interval, using video cropping or attention masking to suppress distractors. Without parameter updates and architectural changes, our framework consistently improves MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B on three VTG benchmarks, with gains of up to +3.5 mIoU. The project website can be found at https://ddz16.github.io/mllmsknowwhen.github.io/.

2605.21931 2026-05-22 cs.CV 版本更新

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

EvoVid: 以时间为中心的自我进化用于视频大语言模型

Shiqi Huang, Ziyue Wang, Zhongrong Zuo, Han Qiu, Qi She, Bihan Wen

发表机构 * School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) ByteDance(字节跳动)

AI总结 本文提出EvoVid,一种以时间为中心的自我进化框架,使视频大语言模型能够直接从未经标注的视频中改进。通过引入两个互补的时间感知奖励,即时间感知的问题生成奖励和时间基础的求解奖励,EvoVid在四个基础模型和六个基准测试中实现了优于基线模型和现有自我进化基线的改进,展示了时间为中心的自我进化在视频理解和推理中的有效性。

Comments Project page: https://huangshiqi128.github.io/EvoVid.io/

详情
AI中文摘要

近期的视频大语言模型(Video-LLMs)通过强化学习(RL)展示了在视频推理中的强大能力。然而,现有的RL流程严重依赖于人工标注的任务和解决方案,使其扩展成本高且本质上受人类专业知识的限制。自我进化框架最近作为一种有前途的替代方案出现,通过自主的提问者-求解者自玩。不幸的是,这些方法主要针对静态模态,如文本和图像,从根本上无法捕捉视频推理中至关重要的时间动态。在本工作中,我们提出了EvoVid,一种以时间为中心的自我进化框架,使Video-LLMs能够直接从原始、未标注的视频中改进。具体来说,我们引入了两个互补的时间感知奖励:一个时间感知的问题生成奖励,通过时间扰动敏感性鼓励时间依赖性的问题生成;一个时间基础的求解奖励,通过固有的视频片段定位提供自动的时间监督。在四个基础模型和六个基准测试中的广泛实验显示,EvoVid在基线模型和现有自我进化基线上实现了持续的改进,取得了与监督方法相竞争的性能。这些结果突显了时间为中心的自我进化作为视频理解和推理的有效且可扩展的范式。

英文摘要

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose $\textbf{EvoVid}$, a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.

2605.21924 2026-05-22 cs.CV 版本更新

Visual-Advantage On-Policy Distillation for Vision-Language Models

基于视觉优势的在线策略蒸馏用于视觉-语言模型

Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, Shu Wu

发表机构 * Institute of Automation, CAS(中国科学院自动化研究所) School of Advanced Interdisciplinary Sciences, UCAS(中国科学院大学(UCAS)先进交叉学科学院) Hello Group Inc.(Hello集团有限公司) Sun Yat-sen University(中山大学)

AI总结 本文提出了一种基于视觉优势的在线策略蒸馏方法,用于提升视觉-语言模型对视觉输入的依赖性,通过引入视觉优势指标来区分关键视觉token与语言token,从而提高蒸馏效果。

详情
AI中文摘要

在线策略知识蒸馏在语言模型中已被证明有效,但其在视觉-语言模型(VLMs)中的应用仍显不足。我们发现标准在线策略蒸馏可以提高学生模型的输出质量,但未能增强其对视觉输入的依赖性:在视觉关键token上,学生模型的预测在是否具备细粒度视觉细节时基本保持不变,尽管教师模型的预测依赖于它。为了使这种差异变得明显,我们引入了视觉优势(VA),即当教师在评分学生生成的rollout时,有无细粒度视觉细节的token级对数概率差异。VA集中在少数token上,这些高VA token实际上承载了视觉监督信号。这促使我们提出了一种蒸馏目标,使它们与语言支架不同,以避免其被大量语言token稀释。我们提出了视觉优势在线策略蒸馏(VA-OPD),它在两个粒度上使用VA:通过轨迹平均VA进行rollout级重新加权,以及在高VA和低VA组内分别计算token级KL平均值。我们在这两个数学数据集(Geometry3K和ViRL39K)上进行训练,并在八个基准测试上进行评估,涵盖数学推理和视觉理解,跨三种教师大小(4B、8B和32B)在Qwen3-VL系列上。VA-OPD在每个基准测试上均优于标准在线策略蒸馏,增益随着教师大小和数据规模轴单调增长,表明这些因素一致地相互作用。

英文摘要

On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student's predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher's predictions depend heavily on it.To make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that treats them differently from language scaffolding, so their contribution is not diluted by the abundant surrounding language tokens.We propose Visual-Advantage On-Policy Distillation (VA-OPD), which uses VA at two granularities: rollout-level reweighting by trajectory-averaged VA, and token-level KL averaged within high-VA and low-VA groups separately. We train on two math datasets (Geometry3K and ViRL39K) and evaluate on eight benchmarks covering both mathematical reasoning and visual understanding, across three teacher sizes (4B, 8B, and 32B) on the Qwen3-VL family. VA-OPD improves over standard on-policy distillation on every benchmark, with the gain growing monotonically along both the teacher-size and data-scale axes, suggesting that these factors compound consistently.

2605.21919 2026-05-22 cs.CV cs.AI 版本更新

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

SDGBiasBench: 评估和减轻可持续发展目标中视觉-语言模型的偏见

Zihang Lin, Huaiyuan Qin, Muli Yang, Hongyuan Zhu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文提出SDGBiasBench,一个用于评估和减轻可持续发展目标中视觉-语言模型偏见的大型基准测试集,通过分析模型在决策和估计层面的偏见,提出CADE方法以减少偏见,提高模型的准确性和可靠性。

详情
AI中文摘要

评估可持续发展目标(SDGs)的进展需要对视觉线索、上下文知识和发展指标进行多步骤推理,其中不完整的证据使用和不完美的证据整合可能引入隐藏的预测偏见。现实中的SDG监测还涵盖定性判断和定量估计。然而,现有基准通常孤立地评估这些方面,掩盖了当模型用先验代替证据时系统性偏见。为解决这一差距,我们提出了SDGBiasBench,一个面向SDG的视觉-语言推理大型基准测试集。该基准涵盖50万专家参与的多项选择题和5万回归任务,能够全面评估视觉-语言模型(VLMs)在决策和估计层面的偏见。在SDGBiasBench上的评估揭示了当前VLMs中固有的SDG偏见,其中预测通常由SDG特定的先验驱动,而非可靠的多模态线索。为减轻这种偏见,我们提出CADE(对比自适应去偏集合),一种无需训练的即插即用方法,利用模态特定的答案先验。CADE在所提出的基准上取得显著成效,提高了多项选择的准确率高达25%,并减少了回归MAE高达12点,适用于多种VLMs。我们希望我们的工作能促进更公平和可靠的AI系统在可持续发展中的发展。

英文摘要

Assessing progress toward the Sustainable Development Goals (SDGs) requires multi-step reasoning over visual cues, contextual knowledge, and development indicators, where incomplete evidence use and imperfect evidence integration can introduce hidden prediction biases. Real-world SDG monitoring further spans both qualitative judgments and quantitative estimation. However, existing benchmarks typically evaluate these aspects in isolation, obscuring systematic biases that emerge when models substitute priors for evidence. To address this gap, we propose SDGBiasBench, a large-scale benchmark suite for SDG-oriented vision-language reasoning. Spanning 500k expert-involved multiple-choice questions and 50k regression tasks, the benchmark enables comprehensive assessment of both decision-level and estimation-level bias in Vision--Language Models (VLMs). Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi-modal cues. To mitigate such bias, we propose CADE (Contrastive Adaptive Debias Ensemble), a training-free, plug-and-play method that leverages modality-specific answer priors. CADE yields significant gains on the proposed benchmark, improving multiple-choice accuracy by up to 25% and reducing regression MAE by up to 12 points across multiple VLMs. We hope our work can foster the development of more fair and reliable AI systems for sustainable development.

2605.21917 2026-05-22 cs.CV cs.AI 版本更新

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

MAVEN:一种多阶段代理标注管道用于视频推理任务

Han Zhang, Wanting Jiang, Tomasz Kornuta, Tian Zheng, Vidya Murali

发表机构 * NVIDIA

AI总结 本文提出MAVEN,一种多阶段代理标注管道,通过链式推理轨迹生成多任务训练数据,用于视频事件推理任务,核心方法是多尺度时空事件描述,支持代理驱动的领域适应,通过分层细化循环改进数据质量,并在多个数据集上验证了其有效性。

Comments CVPR 2026 Workshop

详情
AI中文摘要

训练视频事件推理的视觉语言模型(VLMs)需要高质量的结构化标注,这些标注不仅要描述发生了什么,还要捕捉何时、何地、为何以及后果。我们提出了MAVEN(多阶段代理视频事件标注),一种多阶段代理管道,通过链式推理(CoT)轨迹将原始视频转换为多任务训练数据,围绕指定的事件焦点组织。在核心部分,MAVEN从三个互补的标题级别合成多尺度时空事件描述(MSTED),该显式中间体是下游问答生成的唯一输入,适用于多种任务格式。关键的是,MAVEN支持代理驱动的领域适应:给定新的视频数据集和目标问题示例,代理可以重新设计所有提示,而无需手动重新工程。分层细化循环进一步将注释错误分类到分类学中,追溯根本原因到起始管道阶段,并应用有针对性的编辑,重写提示或修改管道结构本身,迭代改进数据质量。我们应用MAVEN标注超过5,300个交通视频,并在生成的数据上微调Cosmos-Reason2-8B。在私人CCTV评估集上,微调优于Gemini 2.5 Pro和3.1 Flash,包括在零样本情况下MCQ准确率提高了38.8个百分点。在AccidentBench上,仅使用CCTV训练提升了Cosmos-Reason2的MCQ分数10.7分,并在没有dashcam视频的情况下与Gemini 2.5 Pro持平;添加代理适应的dashcam注释缩小了与Gemini 3.1 Flash的差距,RL后训练将总体性能推过了Gemini基线。对仓库监控和公共安全视频的定性结果进一步表明,代理工作流能够轻松适应新领域。

英文摘要

Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream Q&A generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering. A hierarchical refinement loop further classifies annotation errors against a taxonomy, traces root causes to the originating pipeline stage, and applies targeted edits that rewrite prompts or modify the pipeline structure itself, iteratively improving data quality. We apply MAVEN to label over 5,300 traffic videos and fine-tune Cosmos-Reason2-8B on the resulting data. On a private CCTV evaluation set, fine-tuning surpasses both Gemini 2.5 Pro and 3.1 Flash, including a $+38.8$-point gain in MCQ accuracy over zero-shot. On AccidentBench, CCTV-only training lifts Cosmos-Reason2 by $+10.7$ MCQ points and matches Gemini 2.5 Pro despite seeing no dashcam videos; adding agent-adapted dashcam annotations narrows the gap to Gemini 3.1 Flash, and RL post-training pushes overall performance past both Gemini baselines. Qualitative results on warehouse surveillance and public safety videos further show the agentic workflow readily adapts the pipeline to new domains.

2605.21913 2026-05-22 cs.CV 版本更新

Multi-scale interaction network for stereo image super-resolution

多尺度交互网络用于立体图像超分辨率

Liyi Xu, Lin Qi

发表机构 * Ocean University of China(中国海洋大学)

AI总结 本文提出了一种多尺度交互网络,用于立体图像超分辨率,通过改进视内特征提取和视间匹配精度,实现了更优的超分辨率效果。

详情
AI中文摘要

立体图像超分辨率旨在通过利用双目系统的互补信息生成高分辨率图像。尽管先前研究取得了显著成果,但视内和视间信息的潜力尚未被充分挖掘。为了解决这个问题,我们提出了一种新颖的多尺度交互网络用于立体图像超分辨率。具体来说,我们设计了一个多尺度空间-通道注意模块,利用多尺度大可分离核注意和简单的通道注意来改进视内特征提取。此外,我们提出了一个双视图极线注意模块,利用最优传输算法实现更精确的极线匹配。广泛的实验和消融研究显示,我们的方法实现了具有竞争力的结果,优于大多数最先进的方法。

英文摘要

Stereo image super-resolution aims to generate high-resolution images by leveraging complementary information from binocular systems. Although previous studies have achieved impressive results, the potential of intra-view and cross-view information has not been fully exploited. To address this issue, we propose a novel multi-scale interaction network for stereo image super-resolution. Specifically, we design a Multi-scale Spatial-Channel Attention Module that utilizes multi-scale large separable kernel attention and simple channel attention to improve intra-view feature extraction. Additionally, we propose a Dual-View Epipolar Attention Module, utilizing an optimal transport algorithm to achieve more accurate matching along the epipolar line. Extensive experimental and ablation studies show that our method achieves competitive results that outperform most SOTA methods.

2605.21907 2026-05-22 cs.CV 版本更新

Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

引导轨迹优化与稀疏缩放用于测试时间扩散

Gang Dai, Yining Huang, Yiming Xia, Guohao Chen, Shuaicheng Niu

发表机构 * Guangdong University of Technology(广东工业大学) South China University of Technology(华南理工大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出RTS方法,通过奖励引导的噪声优化策略和稀疏测试时间缩放框架,提升扩散模型的生成性能,实验表明在GenEval和ImageReward指标上均优于现有方法。

详情
AI中文摘要

高效的测试时间缩放(TTS)范式为提升扩散模型的生成性能提供了有前途的视角。然而,当前解决方案局限于静态、预定义的噪声池,并在去噪轨迹中的噪声探索上表现出灵活性不足。为弥合这一差距,我们提出了RTS,一种新颖的奖励引导轨迹缩放方法,以充分释放扩散模型的生成潜力。与现有方法不同,RTS通过两个核心创新实现了高质量图像的合成:1)奖励引导的噪声优化策略,主动将搜索方向引导至有前途的区域;2)结合PCA驱动的曲率分析方案的稀疏测试时间缩放框架,优先考虑去噪空间中的关键中间步骤,有效压缩搜索空间。实验表明,我们的方法在GenEval得分上比基线高出15.6%,在ImageReward得分上提升60.4%,设定了新的SOTA,并为扩散特定架构的更有效的测试时间缩放提供了实用指南。

英文摘要

The efficient Test-Time Scaling (TTS) paradigm offers a promising perspective for enhancing the generation performance of diffusion models. However, current solutions are limited to a static, pre-defined noise pool and suffer from inflexible noise exploration across the denoising trajectory. To bridge this gap, we propose RTS, a novel Reward-guided Trajectory Scaling method to fully unlock the generative potential of diffusion models. Unlike existing methods, RTS facilitates the synthesis of refined, high-fidelity images via two core innovations: 1) a reward-guided noise optimization strategy to actively direct the search towards promising regions; and 2) a sparse test-time scaling framework together with a PCA-driven curvature analysis scheme to prioritize key intermediate steps in the entire denoising space, effectively compressing the search space. Experiments show our approach outperforms baselines by 15.6% across GenEval Score, and a 60.4% enhancement in ImageReward score, setting a new SOTA while providing a practical guideline for more effective test-time scaling across diffusion-specific architectures.

2605.21882 2026-05-22 cs.CV 版本更新

Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception

Thermo-VL:扩展视觉语言模型以适应热红外感知

Rusiru Thushara, Yasiru Ranasinghe, Jay Paranjape, Vishal M. Patel

发表机构 * Department of Electrical & Computer Engineering, Johns Hopkins University(约翰霍普金斯大学电气与计算机工程系)

AI总结 本文提出Thermo-VL,一种基于热红外感知的视觉语言模型,通过引入可训练的热编码器和文本引导的双注意力融合模块,提升了低光照条件下的多光谱融合能力,并在热红外和RGB+热红外推理任务中取得显著成果。

Comments 18 pages, 11 figures

详情
AI中文摘要

视觉语言模型(VLMs)在低光照条件下往往表现不佳,因为它们的视觉基础主要学习自RGB图像,而热红外图像在可见线索退化时能保留互补的场景结构。我们提出了Thermo-VL,一种波长感知的VLM,它在冻结的Molmo-7B主干上添加了可训练的热编码器和文本引导的双注意力融合模块。给定对齐的RGB标记、热标记和提示嵌入,融合模块将热特征条件化为语言和RGB上下文,然后将门控残差注入冻结的RGB流中,使热证据能够被纳入而不破坏Molmo预训练的RGB-语言接口。我们使用标准的语言建模目标以及辅助对齐和正则化损失来训练模型,这些损失提高了跨模态基础并减少了对RGB的依赖。我们还引入了一个像素对齐的RGB-热指令微调数据集和Thermo-VL-Bench,一个手动筛选的RGB-热VQA基准,用于低光照和跨光谱推理。实验表明,在具有挑战性的热红外和RGB+热红外推理任务中取得了显著的提升,突显了基于提示的多光谱融合的价值。我们的数据集和代码可在:https://thusharakart.github.io/Thermo-VL 公开获取。

英文摘要

Vision-language models (VLMs) often fail under low illumination because their visual grounding is learned predominantly from RGB imagery, whereas thermal infrared preserves complementary scene structure when visible cues degrade. We present Thermo-VL, a wavelength-aware VLM that augments a frozen Molmo-7B backbone with a trainable thermal encoder and a text-guided dual-attention fusion module. Given aligned RGB tokens, thermal tokens, and prompt embeddings, the fusion module conditions thermal features on both language and RGB context, then injects a gated residual into the frozen RGB stream so thermal evidence can be incorporated without disrupting Molmo's pretrained RGB-language interface. We train the model with the standard language-modeling objective together with auxiliary alignment and regularization losses that improve cross-modal grounding and reduce over-reliance on RGB. We also introduce a pixel-aligned RGB-thermal instruction-tuning dataset and Thermo-VL-Bench, a manually screened RGB-thermal VQA benchmark for low-light and cross-spectrum reasoning. Experiments show strong gains on challenging thermal-only and RGB+thermal reasoning tasks, highlighting the value of prompt-conditioned multispectral fusion. Our dataset and code are publicly available at: https://thusharakart.github.io/Thermo-VL

2605.21869 2026-05-22 cs.CV cs.AI cs.HC 版本更新

Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction

双阶段多模态框架用于情感模仿强度预测

Dinithi Dissanayake, Shaveen Silva, Ovindu Atukorala, Prasanth Sasikumar, Suranga Nanayakkara

发表机构 * Augmented Human Lab, National University of Singapore, Singapore(新加坡国立大学增强人类实验室) University of Moratuwa, Sri Lanka(斯里兰卡穆拉图瓦大学)

AI总结 本文提出了一种双阶段多模态框架,用于从真实视频片段中预测六个连续情绪强度维度,通过结合文本、音频和视觉表示,并可选运动分支,提供了一个实用且可复现的基线。

Comments 10th Affective & Behavior Analysis in-the-wild, CVPR Workshop 2026

详情
AI中文摘要

我们提交了Hume-ABAW10情感模仿强度(EMI)挑战的参赛方案,旨在从真实多模态视频片段中预测六个连续情绪强度维度:钦佩、娱乐、决心、共情痛苦、兴奋和快乐。我们提出了一种分阶段的多模态框架,结合文本、音频和视觉表示,可选运动分支。我们的方法首先独立训练模态特定的编码器,然后通过轻量级回归器融合其学习的表示,通过模态丢弃和受控编码器适应。在我们提交的系统中,最佳验证性能由文本-音频-视觉-运动融合模型在扩展的4:1划分下获得,平均皮尔逊相关系数为0.4722。尽管运动分支仅带来极小的提升,但其行为值得研究。我们的团队在EMI挑战中获得第三名,测试集的平均皮尔逊相关系数为0.57。总体而言,我们提供了一个实用且可复现的EMI预测基线。

英文摘要

We present our submission to the Hume-ABAW10 Emotional Mimicry Intensity (EMI) Challenge, which aims to predict six continuous emotion intensity dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy, from in-the-wild multimodal video clips. We propose a staged multimodal framework that combines textual, acoustic, and visual representations, with an optional motion branch. Our approach first trains modality-specific encoders independently and then fuses their learned representations through a lightweight regressor with modality dropout and controlled encoder adaptation. Across our submitted systems, the best validation performance is obtained by the text--audio--vision--motion fusion model under the expanded 4:1 split, achieving an average Pearson correlation of 0.4722. Although the motion branch yields only very slight gains, its behavior can be interesting to study. Our team was placed third in the EMI challenge, achieving an average Pearson correlation of 0.57 for the test set. Overall, we provide a practical and reproducible baseline for EMI prediction.

2605.21861 2026-05-22 cs.CV cs.AI 版本更新

Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

在多模态医学视觉基础模型中学习涌现的模块化表示

Yuting He, Chenyu You, Shuo Li

发表机构 * Case Western Reserve University(凯斯西储大学) Stony Brook University(石溪大学)

AI总结 本文提出Director-Experts (DEX)框架,通过调控模块化动态,在多模态医学视觉基础模型中学习稳定的模块化表示,并在新的医学视觉基准数据集上验证了其在26个下游任务中的优越性。

Comments Accepted by KDD 2026

详情
AI中文摘要

多模态医学视觉(MV)基础模型(FM)在异质成像模态间面临显著的非独立同分布(Non-IID)特征统计挑战。对这类数据进行单一监督优化会引发冲突梯度,导致表示向模态主导的捷径坍缩。本文将这一失败重新解释为涌现模块化中专门化与协调之间的失衡,并提出Director-Experts(DEX)模块化网络,该网络在堆叠模块中显式调控这些动态。每个DEX模块包含一组专家,通过我们的图像级激活策略动态适应,自主专注于模态主导的统计特征,同时结合通过我们组指数移动平均更新的Director,将多专家知识蒸馏到共享空间,实现跨模态的语义整合,从而驱动模块化表示的涌现。我们构建了一个新的基准数据集Medical Vision Universe,包含超过400万张图像,覆盖10种模态,为DEX提供了最广泛的模态覆盖的FM级预训练。在26个下游任务上的广泛评估表明,DEX在优化行为和迁移性方面有所改进,表明DEX是通用多模态医学AI的有原则的一步。我们的代码和数据集将在https://github.com/YutingHe-list/DEX上公开。

英文摘要

Multi-modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non-IID feature statistics across heterogeneous imaging modalities. Monolithic self-supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality-dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM-level pre-training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general-purpose multi-modality medical AI. Our code and dataset will be opened at https://github.com/YutingHe-list/DEX.

2605.21852 2026-05-22 cs.CV 版本更新

Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding

癫痫半相学套件(S3):一个临床多模态数据集、基准和模型用于癫痫半相学理解

Lina Zhang, Tonmoy Monsoor, Peizheng Li, Jiarui Cui, Xinyi Peng, Chong Han, Prateik Sinha, Siyuan Dai, Jessica Nichole Pasqua, Colin M McCrimmon, Weiting Liu, Hailey Marie Miranda, Bing Hu, Xiangting Wu, Tengyou Xu, Chunhan Li, Jiaye Tian, Jiarui Tang, Detao Ma, Lingye Kong, Junnan Lyu, Jungang Li, Yan Zan, Junhua Huang, Rajarshi Mazumder, Vwani Roychowdhury

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) University of Pittsburgh(匹兹堡大学) Fudan University(复旦大学) University of California, Riverside(加州大学河滨分校) Hong Kong University of Science(香港科学大学) Maharishi International University(玛希拉国际大学)

AI总结 本文提出S3数据集和基准,用于细粒度、结构化的癫痫半相学理解,通过评估多模态大语言模型在低级视觉感知、时间序列处理、叙述报告生成和癫痫诊断中的能力,揭示了现有模型在左右脑推理、时间定位、症状序列和临床忠实报告方面的系统性弱点,并展示了针对癫痫的微调和双阶段神经符号框架在癫痫与非癫痫癫痫分类中的高F1分数。

Comments Accepted to ICML 2026 as a Spotlight presentation

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在一般视频理解方面表现出色,但其解释非自主、时空演变的病理运动行为如癫痫半相学的能力仍鲜有研究。为此,我们引入癫痫半相学套件(S3),一个临床导向的数据集和基准,用于细粒度、结构化的癫痫半相学理解。数据集包含438个癫痫视频,标注超过35,000个密集标签,涵盖20个ILAE定义的半相学特征。基于此数据集,我们提出了一个七任务分层基准,系统评估MLLMs从低级视觉感知到时间序列处理、叙述报告生成和癫痫诊断的能力。为进一步评估生成报告的临床意义,我们引入了癫痫半相学报告质量指数(Seizure-RQI)。在11个开放权重MLLMs上的广泛基线揭示了在左右脑推理、时间定位、症状序列和临床忠实报告方面的系统性弱点。我们展示,针对癫痫的微调显著提高了各任务的性能,而双阶段神经符号框架在癫痫与非癫痫癫痫分类中的F1分数达到0.96。S3数据集为评估多模态模型在安全关键医疗视频理解中的严谨基准,并指导开发临床可靠、领域适应的多模态智能。

英文摘要

While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in general video understanding, their capacity to interpret involuntary, and spatio-temporally evolving pathologic motor behaviors such as seizure semiology remains largely untested. To address this gap, we introduce Seizure-Semiology-Suite, a clinically grounded dataset and benchmark for fine-grained, structured seizure semiology understanding. The dataset includes 438 seizure videos annotated with over 35,000 dense labels covering 20 ILAE-defined semiological features. Building on this dataset, we propose a seven-task hierarchical benchmark that systematically evaluates MLLMs from low-level visual perception to temporal sequencing, narrative report generation, and seizure diagnosis. To enable clinically meaningful evaluation of generated reports, we further introduce the Report Quality Index for Seizure Semiology (Seizure-RQI). Extensive baselines across 11 open-weight MLLMs reveal systematic weaknesses in laterality reasoning, temporal localization, symptom sequencing, and clinically faithful reporting. We show that seizure-specific fine-tuning substantially improves performance across tasks, and that a two-stage neuro-symbolic framework achieves an F1 score of 0.96 on epileptic versus non-epileptic seizure classification. Seizure-Semiology-Suite establishes a rigorous benchmark for evaluating multimodal models in safety-critical medical video understanding and guides the development of clinically reliable, domain-adaptive multimodal intelligence.

2605.21835 2026-05-22 eess.IV cs.AI cs.CV physics.med-ph 版本更新

An Open Multi-Center Whole-Body FDG PET/CT Foundation Model for Tumor Segmentation

一种开放的多中心全身FDG PET/CT基础模型用于肿瘤分割

Xiaofeng Liu, Qianru Zhang, Thibault Marin, Menghua Xia, Chi Liu, Georges El Fakhri, Jinsong Ouyang

发表机构 * Department of Radiology and Biomedical Imaging, Yale Biomedical Imaging Institute, Yale University(放射学与生物医学成像系,耶鲁生物医学影像研究所,耶鲁大学)

AI总结 本文提出了一种开放的多中心全身FDG PET/CT基础模型,通过整合四个公开数据集中的4997份标准化扫描,利用层次UNet结构和早期通道拼接实现解剖和代谢特征的交互,提高了肿瘤分割的标签效率和跨模态表征学习能力。

Comments Code available at: https://github.com/liu-xiaofeng/Foundation-Model-for-PET-CT

详情
AI中文摘要

解剖信息来自计算机断层扫描(CT)和代谢信息来自正电子发射断层扫描(PET)的协同解释对于肿瘤成像至关重要。然而,现有的PET/CT深度学习方法大多任务特定,通常在单一中心队列上训练,或者采用双分支融合方案,这延迟了跨模态交互并低估了PET和CT之间早期空间对应关系。为了解决这些限制,我们提出了一种开源的、多中心的、全身FDG PET/CT基础模型,利用四个公开数据集中的4997份标准化扫描。我们的框架采用层次UNet形状的后端,并在早期通道拼接,使解剖和代谢特征从第一个嵌入层开始交互。我们进一步引入基于零均值填补的掩码自编码目标,结合加权全局重建损失。这种设计避免了由于可学习掩码标记产生的非物理强度不连续性。在下游AutoPET病变分割中,所提出的模型显示出强大的标签效率:仅使用10%的标记训练数据,即可达到在完整数据集上训练的模型的性能。在极端5-shot线性探测下,联合PET/CT预训练也比单独模态预训练取得了更高的Dice分数。这种多中心基础模型展示了PET/CT肿瘤分割的标签效率和跨模态表征学习能力。它为推进自动化肿瘤成像提供了稳健、开源的基础,显著减少了临床实践中大规模手动注释的需求。

英文摘要

The synergistic interpretation of anatomical information from computed tomography (CT) and metabolic information from positron emission tomography (PET) is important to oncologic imaging. However, existing deep learning methods for PET/CT remain largely task-specific, are often trained on single-center cohorts, or adopt dual-branch fusion schemes that delay cross-modal interaction and underutilize early spatial correspondence between PET and CT. To address these limitations, we present an open-source, multi-center, whole-body FDG PET/CT foundation model utilizing 4,997 harmonized scans from four public datasets. Our framework employs hierarchical UNet-shaped backbones with early channel-wise concatenation, enabling anatomical and metabolic features to interact from the first embedding layer onward. We further introduce a masked autoencoding objective based on zero-mean imputation, combined with a weighted global reconstruction loss. This design avoids non-physical intensity discontinuities at masked-region boundaries that arise from learnable mask tokens. On downstream AutoPET lesion segmentation, the proposed models demonstrate strong label efficiency: with only 10\% of the labeled training data, they achieve performance comparable to models trained from scratch on the full dataset. Under extreme 5-shot linear probing, joint PET/CT pretraining also achieves higher Dice scores than separated-modality pretraining. This multi-center foundation model demonstrates label efficiency and cross-modality representation learning for PET/CT tumor segmentation. It provides a robust, open-source basis for advancing automated oncologic imaging, significantly reducing the need for large-scale manual annotations in clinical practice.

2605.21804 2026-05-22 eess.IV cs.CV cs.LG 版本更新

Mapping Tomato Cropping Systems in California Using AlphaEarth Geospatial Embeddings and Deep Learning Analysis

使用AlphaEarth地理空间嵌入和深度学习分析映射加利福尼亚州番茄种植系统

Mohammadreza Narimani, Alireza Pourreza, Parastoo Farajpoor

发表机构 * Department of Biological and Agricultural Engineering, University of California, Davis(加州大学戴维斯分校生物与农业工程系)

AI总结 本研究评估了Google DeepMind的AlphaEarth地理空间嵌入是否可以作为替代方法,用于加利福尼亚州番茄种植系统的映射,通过使用LandIQ 2018的作物多边形构建平衡参考数据集,并利用U-Net分割模型和蒙特卡洛滴落技术实现高精度的番茄种植系统映射。

Comments 5 pages, 3 figures, 1 table. Preprint submitted to ASABE 2026 AIM

详情
AI中文摘要

田间尺度的作物地图支持供应链预测和政策制定,但州级作物识别仍常常依赖于回顾性调查或基于手工工程化光谱特征的遥感工作流程。这些流程可以准确,但需要重复预处理,并且在多年间往往失去鲁棒性。本研究评估了Google DeepMind的AlphaEarth地理空间嵌入是否可以作为映射加利福尼亚州番茄种植系统的替代分析方法。使用LandIQ 2018的作物多边形构建了包含4,742个番茄和4,742个非番茄地块的平衡参考数据集。对于每个多边形,提取了64波段的AlphaEarth嵌入芯片,并与二值掩码对齐,然后分为空间独立的训练集(n = 6,638)、验证集(n = 1,422)和测试集(n = 1,424)。在AWS SageMaker上使用复合掩码二进制交叉熵和软Dice损失训练了U-Net分割模型。为了补充硬预测,保留蒙特卡洛滴落并在每次芯片上重复100次以估计预测均值和方差。在独立的测试集上,模型实现了99.19%的像素准确率、98.69%的精确度、99.40%的召回率、99.04%的F1分数、98.11%的交并比和99.02%的芯片准确率。不确定性地图在田边区域始终最高,在田内区域较低。结果表明,AlphaEarth嵌入保留了与作物相关的空间和时间结构,并且可以支持无需手动特征工程的准确、田间尺度的番茄映射。

英文摘要

Field-scale crop maps support supply-chain forecasting and policy, yet statewide crop identification still often depends on retrospective surveys or remote-sensing workflows built around hand-engineered spectral features. Those pipelines can be accurate, but they require repeated preprocessing and often lose robustness across years. This study evaluated whether Google DeepMind's AlphaEarth geospatial embeddings can serve as an analysis-ready alternative for mapping processing tomato systems in California. LandIQ 2018 crop polygons were used to assemble a balanced reference dataset of 4,742 tomato and 4,742 non-tomato fields. For each polygon, 64-band AlphaEarth embedding chips were extracted and aligned with binary masks, then divided into spatially independent training (n = 6,638), validation (n = 1,422), and test (n = 1,424) sets. A U-Net segmentation model was trained on AWS SageMaker using a composite masked binary cross-entropy and soft Dice loss. To complement hard predictions, Monte Carlo dropout was retained at inference and repeated 100 times per chip to estimate predictive mean and variance. On the independent test set, the model achieved 99.19% pixel accuracy, 98.69% precision, 99.40% recall, 99.04% F1 score, 98.11% intersection over union, and 99.02% chip accuracy. Uncertainty maps were consistently highest near field edges and low within field interiors. The results show that AlphaEarth embeddings retain crop-relevant spatial and temporal structure and can support accurate, field-scale tomato mapping without manual feature engineering.

2605.21796 2026-05-22 cs.CV cs.CL 版本更新

MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue

MM-Conv: 一种多模态数据集和基准,用于上下文感知的3D对话中指代解析

Anna Deichler, Jim O'Regan, Fethiye Irmak Dogan, Lubos Marcinek, Anna Klezovich, Iolanda Leite, Jonas Beskow

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 本文提出了一种多模态数据集和基准,用于在动态3D环境中实现上下文感知的指代解析,通过引入包含6.7小时第一人称VR交互的同步语音、动作、注视和3D场景几何数据的基准,以及一个两阶段的指代解析流水线,改进了对话中的指代解析性能。

Comments Extended version of the paper published at LREC 2026 (Palma de Mallorca, Spain), with expanded VLM baselines and inter-annotator agreement analysis

详情
Journal ref
Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026), Palma de Mallorca, Spain
AI中文摘要

在物理世界中将语言进行定位需要AI系统解释在对话中动态出现的参考。尽管当前的视觉语言模型(VLMs)在静态图像任务上表现出色,但在自发的多轮对话中解决歧义表达方面存在困难。我们通过引入(1)一个用于动态3D环境中的指代交流的基准,该基准基于6.7小时的第一人称VR交互,同步语音、动作、注视和3D场景几何数据,以及(2)一个两阶段的定位流水线,该流水线在视觉定位之前显式解决对话中的歧义,来填补这一空白。该基准包含超过4,200个经过人工验证的指代表达,涵盖完整、部分和代词类型。我们的上下文重写方法在平均上将定位性能提高了11-22个百分点,纯检测器(GroundingDINO)在重写后在代词上达到了56.7%的准确率,几乎是最佳端到端基线的两倍。结果表明,将语言推理与视觉感知解耦比端到端方法在对话定位中更有效。

英文摘要

Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching 56.7% on pronominals after rewriting, nearly double the best end-to-end baseline. Results demonstrate that decoupling linguistic reasoning from visual perception is more effective than end-to-end approaches for conversational grounding.

2605.21788 2026-05-22 cs.CV cs.RO 版本更新

SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching

SceneGraphGrounder: 通过结构化场景图匹配实现零样本3D视觉定位

Xuefei Sun, Xujia Zhang, Brendan Crowe, Doncey Albin, Christoffer Heckman

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 本文提出SceneGraphGrounder框架,通过结构化场景图匹配将3D定位问题转化为结构化图匹配问题,利用视觉标记提示策略从2D视图推断物体间关系,并在3D场景图中建立持久编码,从而在ScanRefer基准测试中实现了零样本条件下与现有方法相当的性能,并在真实机器人部署中验证了其在长周期物理环境中的鲁棒空间推理能力。

详情
AI中文摘要

零样本3D视觉定位需要从非结构化环境中通过自由形式自然语言定位物体。最近的视觉-语言模型(VLM)方法取得了有希望的结果,但依赖于视点依赖的推理或隐式表示,限制了组合查询的空间一致性和可解释性。我们提出了SceneGraphGrounder,一个将3D定位重新表述为在重建的3D场景图上的结构化图匹配的框架。为了实现这种表述,我们引入了一种视觉标记提示策略,使VLM能够从2D视图推断物体-物体关系,这些关系随后被提升为持久的3D场景图编码,既包含空间关系又包含语义关系。给定一个查询,我们构建查询图并与场景图进行受限对齐,确保多视图一致性和可解释的推理。在ScanRefer基准测试中,我们的方法在零样本条件下实现了与现有方法相当的性能,仅使用RGB-D输入。我们进一步通过在移动机器人上的真实世界部署验证了我们的框架,展示了其在长周期物理环境中的鲁棒空间推理能力。我们将在接受后公开我们的代码。

英文摘要

Zero-shot 3D visual grounding requires localizing objects in unstructured environments from free-form natural language. Recent vision-language model (VLM) approaches achieve promising results but rely on view-dependent reasoning or implicit representations, limiting spatial consistency and interpretability for compositional queries. We propose SceneGraphGrounder, a framework that reformulates 3D grounding as structured graph matching over a reconstructed 3D scene graph. To enable this formulation, we introduce a visual marker prompting strategy that enables a VLM to infer object-object relationships from 2D views, which are subsequently lifted into a persistent 3D scene graph encoding both spatial and semantic relations. Given a query, we construct a query graph and perform constrained alignment with the scene graph, ensuring multi-view consistency and interpretable reasoning. Experiments on the ScanRefer benchmark demonstrate that our method achieves competitive performance among zero-shot approaches, using only RGB-D inputs. We further validate our framework through real-world deployment on a mobile robot, demonstrating robust spatial reasoning in long-horizon physical environments. We will make our code publicly available upon acceptance.

2605.21766 2026-05-22 cs.CV cs.GR 版本更新

BodyReLux: Temporally Consistent Full-Body Video Relighting

BodyReLux: 时序一致的全身人体视频重照明

Li Ma, Mingming He, Xueming Yu, David M. George, Ahmet Levent Taşel, Paul Debevec, Julien Philip

发表机构 * Eyeline Labs(Eyeline实验室)

AI总结 本文提出BodyReLux,一种基于视频扩散的框架,用于在时序一致的方式下重照明全身人体表演。该方法利用混合数据集训练,结合传统静态单光源捕捉和新型动态表演捕捉技术,通过引入新的光照条件表示方法和数据增强管道,实现了高质量、鲁棒且时序一致的视频重照明。

Comments Siggraph 2026 Journal Track. Project page: https://eyeline-labs.github.io/bodyrelux/

详情
AI中文摘要

能够重照明人体表演是后期制作和内容创作中的基本任务。我们提出了BodyReLux,一种针对特定主体的视频扩散框架,用于在时序一致的方式下重照明全身人体表演。我们的模型是在一个混合的像素对齐视频重照明数据集上训练的,涵盖了多样化的光照条件、表演和视角组合。为了获得这样的数据集,我们结合了传统的静态单光源捕捉(OLAT)和一种新的动态表演捕捉方法,在其中两个平滑变化的光照序列被快速交错。由于光照操作在人类闪烁融合阈值之上,交错不会显得闪烁。我们从预训练的文本到视频模型中训练视频重照明模型,以充分利用生成先验来产生高质量视频。为了实现精确的光照控制,我们引入了一种新的光照条件方法,将每个光源表示为一个标记。我们进一步使用掩码注意力对光照序列进行条件处理,以支持动态光照控制。结合精心设计的数据增强管道,我们实现了高质量、鲁棒且时序一致的特定主体人体表演视频重照明。

英文摘要

Being able to relight human performance is a fundamental task for post production and content creation. We present BodyReLux, a subject-specific video diffusion-based framework for relighting full-body human performances in a temporally consistent way. Our model is trained on a hybrid dataset of pixel-aligned video relighting pairs, covering a diverse combination of lighting conditions, performances and viewpoints. To acquire such dataset, we combine traditional static One-Light-at-a-Time (OLAT) capture and a novel dynamic performance capture in which two smoothly varying lighting sequences are rapidly interleaved. Because the lighting operates above the human flicker-fusion threshold, the interleaving does not appear to strobe. We train our video relighting model from a pretrained text-to-video model to fully leverage the generative priors for producing high quality videos. To achieve accurate lighting control, we introduce a new lighting conditioning method that represents each light source as a token. We further condition on sequences of lighting using masked attention to support dynamic lighting control. Together with a carefully designed data augmentation pipeline, we achieve photorealistic, robust, and temporally consistent video relighting of subject-specific human performances.

2605.21747 2026-05-22 cs.CV cs.RO 版本更新

Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models

通过利用视觉语言模型推断车辆信息以改进自动驾驶中的3D标注

Steven Chen, Shivesh Khaitan, Nemanja Djuric

发表机构 * Aurora Innovation, Inc.(Aurora创新公司)

AI总结 本文提出了一种利用视觉语言模型推断车辆信息以提高自动驾驶中3D车辆标注精度的方法,通过零样本推理车辆信息,结合车辆型号和型号识别方法,提升了标注效率和质量。

Comments To appear in Proceedings of the IEEE Intelligent Vehicles Symposium (IV), 2026. Accepted for oral presentation

详情
AI中文摘要

我们提出了一种通过零样本推理车辆信息来提高自动驾驶应用中3D车辆标注的方法,利用车辆制造商和型号识别(VMMR)方法。所提出的方法利用视觉语言模型(VLM)从图像片段中推断车辆的制造商、型号和代数,并输出准确的3D包围盒尺寸以引导手动标注。我们评估了迭代提示工程和不同VLMs选择对车辆包围盒推断和制造商/型号/代数识别的影响。与强大的基线相比,所提出的方法不仅在准确性上表现出色,而且在缓解特定失败模式方面也表现出色,例如在车辆显著遮挡的情况下,VLMs提供的尺寸比初始激光雷达辅助的人工标注标签更优。在公共和专有数据上的实验强烈表明,我们的结论可以推广到不同的标注者和数据集。结果表明,将VLMs整合到标注过程中可以减少手动标注时间,同时提高标注质量。

英文摘要

We present an approach to improve 3D vehicle labeling in self-driving applications through zero-shot inference of vehicle information, leveraging Vehicle Make and Model Recognition (VMMR) methods. The proposed approach utilizes a Vision Language Model (VLM) to both infer a vehicle's make, model, and generation from image crops, and output accurate 3D bounding box dimensions to seed manual labeling. We evaluate the impact of iterative prompt engineering and the choice of different VLMs on both vehicle bounding box inference and make/model/generation recognition. When compared to strong baselines, the proposed approach not only shows high accuracy, but also excels in mitigating specific failure modes where VLMs provide better dimensions than initial lidar-aided human annotated labels (e.g., in cases of significant vehicle occlusion). Experiments on both public and proprietary data strongly suggest that our conclusions are generalizable across different labelers and datasets. The results demonstrate that integrating VLMs into the labeling process can reduce manual labeling time while increasing label quality.

2605.21728 2026-05-22 cs.CV cs.CL cs.LG 版本更新

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

BEiTScore: 一种基于高效交叉编码器的无参考图像描述评估方法

Gonçalo Gomes, Bruno Martins, Chrysoula Zerva

发表机构 * Instituto Superior Técnico(里斯本大学理工学院) INESC-ID Instituto de Telecomunicações(电信机构)

AI总结 本文提出了一种无参考图像描述评估方法BEiTScore,通过高效的交叉编码器模型解决传统评估方法在计算成本和敏感性方面的不足,提出了一种新的评估指标,并在多种场景下验证了其优越的性能。

详情
AI中文摘要

图像描述评估仍是一个重大挑战,因为视觉-语言模型朝着生成长形式和上下文丰富的描述等更具挑战性的能力发展。最先进的评估度量标准涉及使用大型语言模型(LLMs)作为评判者的大量计算成本,或者受到标准CLIP基于编码器的限制,例如严格的令牌限制、缺乏细粒度敏感性或缺乏组合泛化能力,因为将描述视为“词袋”。我们提出了一种新的学习度量标准,以解决上述挑战,基于一个轻量级交叉编码器,其初始化来自视觉问答模型检查点,平衡了强大的权重初始化与计算效率。我们的训练方案使用精心编排的数据混合进行监督学习,特征是对抗性的LLM基于数据增强,以增强模型对细粒度视觉-语言错误的敏感性。我们还引入了一个新的基准,用于在多种场景中评估详细的描述评估。实验结果表明,所提出的度量标准在保持大规模基准测试、质量感知解码或奖励指导所需的效率的同时,实现了最先进的性能。

英文摘要

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

2605.21714 2026-05-22 cs.CV cs.RO 版本更新

AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking

AVI-HT:自适应视觉-IMU融合用于3D手部跟踪

Ziyi Kou, Ankit Kumar, Mia Huang, Taylor Niehues, Vatsal Mehta, Ergys Ristani, Li Guan

发表机构 * Meta Reality Labs(Meta现实实验室)

AI总结 本文提出AVI-HT,一种自适应视觉-IMU融合方法,通过联合建模第一人称视角图像与手套上的6自由度IMU信号,用于跟踪3D手部姿态。核心方法包括同步多模态训练数据配对和跨传感器深度注意力机制,主要贡献是提高了在手-物体交互场景中的准确性和可用性。

详情
AI中文摘要

我们提出了AVI-HT,一种用于通过联合建模第一人称视角图像与手套上的6自由度IMU信号来跟踪3D手部姿态的自适应视觉-IMU融合方法。AVI-HT在手-物体交互(HOI)场景中,特别是在重视觉遮挡情况下,实现了显著提高的准确性和可用性。其成功基于两个互补的成分:(1)同步多模态训练数据配对身体上的视觉-IMU传感器流与运动捕捉系统的地面真实3D手部姿态;(2)一种跨传感器深度注意力机制,能够自适应地调节对视觉和单个IMU传感器的信任度。为了在真实世界中评估AVI-HT,我们在包含100000+对视觉-IMU样本的DexGloveHOI数据集中进行了广泛的实验,这些样本具有同步的3D标注姿态,用户在日常任务中操作各种物体。我们比较了多种单模态和多模态跟踪方法,基于两种手部模型(UmeTrack、MANO)。结果表明,AVI-HT在基准上将平均关键点误差减少了16.1%,其腕对齐变体减少了24.2%。消融研究进一步揭示了IMU传感器在不同活动类型中的每指贡献,以及模型对IMU噪声和视觉-IMU融合中的时间偏移的敏感性。

英文摘要

We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during daily tasks. We compare against multiple single- and multi-modal tracking approaches under two hand models (UmeTrack, MANO). The results show that AVI-HT reduces mean keypoint error by 16.1% and its wrist-aligned variant by 24.2% over the baselines. Ablation studies further reveal the per-finger contribution of IMU sensors across activity types, and the model's sensitivity to IMU noise and temporal misalignment in vision-IMU fusion.

2605.21671 2026-05-22 eess.IV cs.CV 版本更新

HyperBench: Standardizing and Scaling Synthetic Evaluation for Hyperspectral Super-Resolution

HyperBench: 标准化和扩展超光谱超分辨率的合成评估

Ritik Shah, Marco F. Duarte

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文提出HyperBench框架,通过标准化和扩展合成实验来评估超光谱超分辨率方法,以解决现有评估方法中配置不一致、结果难以比较和复现的问题。

详情
AI中文摘要

超光谱超分辨率(HSR)通过融合低分辨率超光谱图像(LR-HSI)和高分辨率多光谱图像(HR-MSI)来重建高空间分辨率的超光谱图像。在缺乏真实世界配对数据的情况下,HSR方法几乎 exclusively 评估于通过Wald协议从超光谱数据集中衍生的合成实验中。尽管该协议被广泛采用,但其实际实施在不同研究工作中差异显著,通常依赖于单一(通常是高斯)或非常少的点扩散函数(PSFs),一个或两个光谱响应函数(SRFs),以及少量的空间下采样因子。因此,报告的性能指标在文献中难以比较,且往往难以复现;此外,它们可能无法在现实传感条件下推广。我们引入HyperBench,一个统一且可扩展的框架,用于标准化HSR的合成实验。HyperBench支持跨度十个PSFs、四个源自操作多光谱传感器的SRFs、可配置的空间下采样因子以及匹配的加性白高斯噪声;其目标是自动化大规模评估和结构化日志记录。通过将模型开发与实验设计解耦,该框架使可复现、公平的跨方法比较成为可能,且摩擦最小。我们使用HyperBench在四个广泛使用的超光谱场景上对六种最近提出的HSR方法进行了70种配置的评估,并观察到方法间PSNR的差异从最简单的PSF上的约5 dB扩大到最困难的PSF上的超过13 dB——这种脆弱性在现有的单配置评估协议中是结构上不可见的。HyperBench代码可在https://github.com/ritikgshah/HyperBench上获取。

英文摘要

Hyperspectral super-resolution (HSR) reconstructs a high-spatial-resolution hyperspectral image by fusing a low-resolution hyperspectral image (LR-HSI) with a high-resolution multispectral image (HR-MSI). In the absence of real-world paired data, HSR methods are evaluated almost exclusively on synthetic experiments derived from hyperspectral datasets through Wald's protocol. Despite the protocol's widespread adoption, its practical implementation varies markedly across research works, typically relying on a single (usually Gaussian) or very few point spread functions (PSFs), one or two spectral response functions (SRFs), and a couple of spatial downsampling factors. As a result, reported performance figures are difficult to compare across the literature, in addition to being often difficult to reproduce; furthermore, they may not generalize across realistic sensing conditions. We introduce HyperBench, a unified and extensible framework that standardizes synthetic experimentation for HSR. HyperBench supports diverse degradation configurations spanning ten PSFs, four SRFs derived from operational multispectral sensors, configurable spatial downsampling factors, and matched additive white Gaussian noise; its goal is to automate large-scale evaluation and structured logging. By decoupling model development from experimental design, the framework enables reproducible, apples-to-apples cross-method comparison with minimal friction. We use HyperBench to evaluate six recently proposed HSR methods across a 70-configuration sweep on four widely used hyperspectral scenes and observe that the inter-method PSNR spread widens from approximately 5 dB on the easiest PSF to over 13 dB on the hardest - a fragility that is structurally invisible to the prevailing single-configuration evaluation protocol. HyperBench code is available at https://github.com/ritikgshah/HyperBench .

2605.21669 2026-05-22 cs.CV cs.AI 版本更新

MRecover: A Conditional Generative Model for Recovering Motion-Corrupted MR images Using AI Generated Contrast

MRecover: 一种基于AI生成对比度的条件生成模型,用于通过AI生成对比度恢复运动模糊的MRI图像

Jinghang Li, Tales Santini, Courtney Clark, Bruno de Almeida, Cong Chu, Salem Alkhateeb, Andrea Sajewski, Jacob Berardinelli, Hecheng Jin, Tobias Campos, Jeremy J. Berardo, Joseph Mettenburg, Ariel Gildengers, Howard J. Aizenstein, Minjie Wu, Tamer S. Ibrahim

发表机构 * Department of Bioengineering, University of Pittsburgh(匹兹堡大学生物工程系) School of Medicine, University of Pittsburgh(匹兹堡大学医学院) Department of Radiology, University of Pittsburgh(匹兹堡大学放射科) Department of Psychiatry, University of Pittsburgh(匹兹堡大学精神病学系)

AI总结 该研究提出了一种条件生成模型MRecover,利用AI生成的对比度来恢复运动模糊的MRI图像,通过自回归切片条件化实现体积分 consistency,提高了 hippocampal 子区域分割的精度和泛化能力。

详情
AI中文摘要

海马亚区分割需要高分辨率的T2w turbo spin echo (TSE) MRI,但该序列易受运动伪影影响,导致数据丢失。我们开发了一种条件生成模型(MRecover),通过自回归切片条件化生成常规获取的T1w图像,生成TSE图像以实现体积分 consistency。在7T MRI数据(n=577)上训练,该模型在域内实现了高保真度(n=148,SSIM=0.84,FSIM=0.94),并能很好地推广到域外3T数据:合成和原生图像的亚区体积高度匹配(n=416,r=0.87-0.97),并在运动影响的ADNI3数据集中通过质量控制后,分析可及受试者数量增加了31.8%(593 vs 450)。合成图像还由于增加诊断组差异的样本量,产生了更大的效应量(整个海马体ε²=0.121-0.100 vs. 0.086-0.062,左右半球)。项目页面:https://jinghangli98.github.io/MRecover/

英文摘要

Hippocampal subfield segmentation requires high-resolution T2w turbo spin echo (TSE) MRI, yet this sequence is susceptible to motion artifacts, leading to substantial data loss. We developed a conditional generative model (MRecover) that synthesizes routinely acquired T1w images to create TSE images with autoregressive slice conditioning for volumetric consistency. Trained on 7T MRI data (n=577), the model achieved high in-domain fidelity (n=148, SSIM=0.84, FSIM=0.94) and generalized well to out-of-domain 3T data: subfield volumes from synthesized and the as-acquired images closely matched: (n=416, r=0.87-0.97) and yielded 31.8% more analyzable subjects in the motion-affected ADNI3 dataset after quality control (593 vs 450). The synthesized images also achieved larger effect sizes due to increasing the sample size for diagnostic group differences in hippocampal subfield atrophy (whole hippocampus $ε^2$= 0.121-0.100 vs. 0.086-0.062, left-right hemispheres). Project page: https://jinghangli98.github.io/MRecover/

2605.21661 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Hierarchical Variational Policies for Reward-Guided Diffusion

分层变分策略用于奖励引导的扩散

Kushagra Pandey, Farrin Marouf Sofian, Jan Niklas Groeneveld, Felix Draxler, Stephan Mandt

发表机构 * Department of Computer Science(计算机科学系) University of California Irvine(加州大学伊文斯顿分校)

AI总结 本文提出了一种分层变分模型框架,通过将控制信息压缩到轻量级且表达能力强的随机策略中,实现了在降低推理成本的同时生成高质量的奖励对齐样本,该方法在4倍超分辨率任务中实现了比现有最佳基线快5倍的推理速度并具有更好的感知质量。

详情
AI中文摘要

适应预训练扩散模型以解决下游目标如逆问题通常需要昂贵的测试时间引导或优化。我们提出了一种系统框架,能够在大幅降低推理成本的同时生成高质量的奖励对齐样本。我们的方法将测试时间适应建模为分层变分模型,其中控制被压缩到一个轻量级但表达能力强的随机策略中。这种建模自然支持少量步扩散采样:大步长使推理快速,而学习的策略通过提供结构化的每步控制保持样本质量。所得到的完全压缩采样器实现了强大的质量-速度权衡,匹配或超过最近的测试时间扩展基线,同时需要显著更少的计算资源。例如,在4倍超分辨率任务中,我们的方法在比最佳表现基线快5倍的情况下实现了更好的感知质量。我们进一步将该方法扩展到半压缩的 regime,结合廉价的压缩提案和有限的测试时间优化,在多个具有挑战性的逆问题中实现了最先进的感知质量。

英文摘要

Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples at substantially reduced inference cost. Our approach formulates test-time adaptation as a hierarchical variational model, where control is amortized into a lightweight yet expressive stochastic policy. This formulation naturally supports few-step diffusion sampling: large step sizes enable fast inference, while the learned policy maintains sample quality by providing structured per-step control. The resulting fully amortized sampler achieves a strong quality--speed tradeoff, matching or exceeding recent test-time scaling baselines while requiring significantly less compute. For example, on 4x super-resolution, our method achieves better perceptual quality with more than 5x faster inference compared to the best-performing baseline. We further extend our approach to a semi-amortized regime that combines cheap amortized proposals with limited test-time optimization, achieving state-of-the-art perceptual quality across several challenging inverse problems.

2605.21642 2026-05-22 cs.CV 版本更新

Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

Ablate-to-Validate: 视觉语言模型真的在使用连续思维令牌吗?

Tianyi Zhang, Mahtab Bigverdi, Ranjay Krishna

发表机构 * University of Washington(华盛顿大学)

AI总结 本文提出了一种诊断原则Ablate-to-Validate,通过Token Replacement Test(TRT)测试视觉语言模型是否真正利用了连续令牌内容,发现模型性能提升可能并非源于令牌内容,而是令牌存在本身。

详情
AI中文摘要

视觉语言模型(VLMs)越来越多地引入连续或潜在的非文本令牌以支持'视觉思维'。尽管任务准确性有所提高,但这并不能证明模型确实使用这些令牌进行推理——收益可能来自于诸如增加的上下文长度、特殊令牌锚定或训练时的正则化等混淆因素。我们正式提出了一种诊断原则,Ablate-to-Validate,用于测试潜在令牌内容是否被真正利用,并将其实例化为Token Replacement Test(TRT),一个标准化的内容替换消融套件。TRT固定提示、图像、令牌预算和解码,同时用零、随机、首次重复或Oracle替代中间令牌,以确定性能是否依赖于令牌内容或仅仅是令牌存在。作为受控测试平台,我们研究了LLaVA-13B和Qwen2.5-VL-3B在相对深度推理中的表现,训练模型在多个冻结编码器(SigLIP2,CLIP,DINOv2)和令牌预算下预测和消耗连续或离散深度跨度。此外,我们还将TRT应用于三个现成的视觉思维系统(Mirage,Mull-Tokens,CoVT)在BLINK,VSP和CV-Bench上。在所有设置中,准确性提升都是潜在令牌推理的误导性代理:VLMs在令牌内容被破坏或替换时仍能保持大部分改进,揭示了拥有潜在通道与将其用作信息瓶颈之间的持续差距。我们推荐TRT作为任何引入连续思维令牌的方法的标准诊断工具,与准确性并行使用。

英文摘要

Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning -- gains may arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized, and instantiate it as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence. As a controlled testbed, we study relative depth reasoning with LLaVA-13B and Qwen2.5-VL-3B, training models to predict and consume continuous or discrete depth spans across multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets. We additionally apply TRT to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench. Across all settings, accuracy gains are a misleading proxy for latent-token reasoning: VLMs retain most improvement even when token content is corrupted or replaced, revealing a persistent gap between having a latent channel and using it as an information bottleneck. We recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.

2605.21633 2026-05-22 eess.IV cs.CV 版本更新

VRXU-net: A Deep Learning Approach for Brain Ischemic Stroke Lesion Detection and Segmentation in T1W MRI

VRXU-net: 一种用于T1W MRI中脑缺血性中风病变检测和分割的深度学习方法

Sayed Amir Mousavi Mobarakeh

发表机构 * Sayed Amir Mousavi Mobarakeh

AI总结 该研究提出了一种基于视觉特征、残差连接和U型网络的VRU-Net架构,用于在3D磁共振成像扫描中检测和分割脑缺血性中风病变,通过改进的VGG模型和U型分割模型在不同切面中独立处理,并通过聚合结果提高分割精度和处理速度。

详情
AI中文摘要

当大脑供血被血栓阻断时,脑组织的氧气供应不足,导致细胞坏死。在医疗环境中,准确识别和勾勒缺血性病变边界对于治疗和手术计划至关重要。然而,缺血性中风病变在形状、大小和位置上差异很大,在灰度MRI模态如T1W中,它们可能与周围脑结构相似,这使得病变检测和分割对临床医生来说是一项挑战。本研究介绍了一种新的VRU-Net架构,该架构基于视觉特征、残差连接和U型网络,用于检测和分割3D磁共振成像扫描中的缺血性中风病变。所提出的方法首先使用修改后的VGG模型在单独的2D切片中识别缺血性中风。然后,一个带有残差块的U型分割模型对每个切片中的病变进行分割。此过程独立应用于轴向、矢状和冠状平面,并通过聚合三个分割结果生成最终输出。为了提高性能和处理速度,一种高性能分类器在顺序框架中应用于分割模型之前。这种策略减少了非病变切片的不必要的分割,并提高了整体准确性。此外,将3D图像分解为2D切片减少了模型复杂性,同时允许来自三个解剖平面的信息支持更准确的病变定位。所提出的方法在脑缺血后解剖追踪数据集上进行训练,并在准确率和Dice系数方面优于现有最先进模型。此外,分割输出提供的反馈有助于分类模型减少假阳性预测。

英文摘要

When the blood supply to the brain is obstructed by a clot, oxygen delivery to brain tissues becomes insufficient, leading to cellular necrosis. In healthcare settings, accurately identifying and delineating ischemic lesion boundaries is essential for treatment and surgical planning. However, ischemic stroke lesions vary widely in shape, size, and location, and in grayscale MRI modalities such as T1W they may resemble surrounding brain structures. This makes lesion detection and segmentation a challenging task for clinicians. This study introduces a novel VRU-Net architecture, derived from visual features, residual connections, and a U-shaped network, for detecting and segmenting ischemic stroke lesions in 3D magnetic resonance imaging scans. The proposed method first uses a modified VGG model to identify ischemic stroke in separate 2D slices. Then, a U-shaped segmentation model with residual blocks segments the lesion in each slice. This procedure is applied independently to the axial, sagittal, and coronal planes, and the final output is generated by aggregating the three segmentation results. To improve both performance and processing speed, a high-performance classifier is applied before the segmentation model in a sequential framework. This strategy reduces unnecessary segmentation of non-lesion slices and improves overall accuracy. In addition, decomposing 3D images into 2D slices reduces model complexity while allowing information from three anatomical planes to support more accurate lesion localization. The proposed model is trained on the Anatomical Tracings of Lesions After Stroke dataset and outperforms state-of-the-art models in terms of accuracy and Dice coefficient. Moreover, the segmentation output provides feedback that helps the classification model reduce false-positive predictions.

2605.21625 2026-05-22 cs.CV cs.AI cs.CL 版本更新

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Flat-Pack Bench: 通过家具组装评估大视觉-语言模型的时空理解

Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan

发表机构 * Cornell University(康奈尔大学) Cornell Tech(康奈尔科技) MBZUAI(麦吉尔-伯克利-浙江大学人工智能研究院) UC Berkeley(伯克利大学)

AI总结 本文提出Flat-Pack Bench基准,用于评估大视觉-语言模型在复杂视频场景中的时空理解能力,发现当前模型在细粒度时空推理上存在显著不足。

Comments CVPR 2026

详情
AI中文摘要

大视觉-语言模型(LVLMs)的出现显著提升了视频理解能力。然而,现有基准主要集中在粗粒度任务,如动作分割、分类、描述和检索,且这些基准通常依赖于易于口头识别的实体,如家庭物品、动物、人类主体等,限制了其在复杂真实视频场景中的适用性。但许多应用,如家具组装、烹饪等,需要对视频进行逐步细粒度的时空理解,而当前基准并未充分评估。为解决这一差距,我们引入了Flat-Pack Bench,一个专注于家具组装任务的新基准。我们的基准评估LVLMs在细微任务上的表现,包括组装动作的时间顺序、组装状态的时间定位、理解部件配合和追踪,使用多选问题配以视觉提示突出相关部分作为参考,以回答细粒度问题。我们的实验表明,最先进的LVLMs在细粒度时空推理上表现显著不足,凸显了其在有效利用视频时间信息、跟踪能力和理解空间交互(如物理接触)方面的局限性。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

2605.21611 2026-05-22 cs.CV cs.LG 版本更新

UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

UniVL:统一的视觉-语言嵌入用于空间接地的上下文图像生成

Jiayun Wang, Yu Wang, Weijie Gan, Zhenting Wang, Wei Wei

发表机构 * Center for Advanced AI(先进人工智能中心)

AI总结 本文提出了一种统一的视觉-语言嵌入方法,通过单一的视觉输入直接将语义绑定到空间位置,从而减少计算并提高图像生成质量。

详情
AI中文摘要

我们引入了空间接地的上下文图像生成任务,这是一种可控的图像生成任务,重新定义了条件生成范式。与通过两个独立编码器分别提供参考图像和全局文本提示不同,UniVL被训练以从单一统一的视觉输入中直接绑定语义到空间位置,其中文本指令被渲染到空间掩码上。这消除了推理过程中对独立文本编码器的需求。所得到的模型通过遵循用户指定的指令来支持上下文图像生成,即在指定位置生成什么内容,同时显著减少了计算量。为了解决这一任务,我们提出了一种框架,其中从光学字符识别预训练的backbone中适应的UniVL编码器读取统一的条件,并生成一个融合视觉和语义意图以及空间位置的UniVL嵌入fVIL。一个两阶段流程首先对齐UniVL与VAE嵌入空间,然后将预训练的扩散backbone完全基于UniVL嵌入进行条件生成,消除了如T5等独立文本编码器。尽管这种重新定义使用了刻意最小化的文本接口,但仍然取得了显著的实证收益。在UniVL-ImgGen上,一个包含477,000个掩码标注图像的基准数据集上,UniVL在文本提示基线之上提高了图像质量,将FID从14降低到11,并将PSNR从16提高到20。它还完全消除了文本编码器,将推理TFLOPs减少高达52%,将运行时间减少高达44%。此外的消融研究验证了所提出组件的贡献,为具有统一条件范式的高效、空间接地图像生成铺平了道路。

英文摘要

We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is trained to bind semantics to spatial locations directly from a single unified visual input, where the textual instruction is rendered onto the spatial mask. This removes the need for a standalone text encoder at inference time. The resulting model supports contextual image generation by following user-specified instructions about what should appear where, while substantially reducing computation. To address this task, we propose a framework in which the UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, reads the unified condition optically and produces a UniVL embedding, fVIL, that fuses visual and semantic intent with spatial locations in a single token sequence. A two-stage pipeline first aligns UniVL with the VAE embedding space and then conditions a pretrained diffusion backbone entirely on UniVL embeddings, eliminating the standalone text encoder, such as T5. Although this reframing uses a deliberately minimal text interface, it yields strong empirical gains. On UniVL-ImgGen, a benchmark of 477K mask-annotated images that we construct for training and evaluation, UniVL improves image quality over text-prompted baselines, reducing FID from 14 to 11 and increasing PSNR from 16 to 20. It also eliminates the text encoder entirely, reducing inference TFLOPs by up to 52% and runtime by up to 44%. Additional ablation studies validate the contributions of the proposed components, paving the way for efficient, spatially grounded image generation with a unified conditioning paradigm.

2605.21573 2026-05-22 cs.CV 版本更新

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Lens:重新思考基础文本到图像模型的训练效率

Dong Chen, Fangyun Wei, Ziyu Wan, Dongdong Chen, Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, Zhiyang Liang, Baining Guo, Chong Luo, Jianmin Bao, Ji Li, Lei Shi, Qinhong Yang, Xiuyu Wu, Xuelu Feng, Yan Lu, Yanchen Dong, Yitong Wang, Yunuo Chen

发表机构 * Microsoft Lens Team(微软Lens团队)

AI总结 本文提出Lens,一个具有38亿参数的文本到图像模型,在多种基准测试中表现与超过60亿参数的最新模型相当甚至更优,同时训练计算需求显著降低。通过最大化训练批次的数据信息密度和改进收敛速度的架构选择,实现了高效的训练和优化。

Comments Project Page: https://github.com/microsoft/Lens

详情
AI中文摘要

我们介绍了Lens,一个具有38亿参数的文本到图像(T2I)模型,其在多种基准测试中表现与超过60亿参数的最新模型相当甚至更优,同时训练计算需求显著降低。例如,Lens仅需约Z-Image的19.3%的训练计算。Lens的训练效率源于两个关键策略,除了其紧凑的模型大小外。首先,我们通过(i)在Lens-800M数据集上训练,该数据集包含8亿个密集标注的图像-文本对,其标注由GPT-4.1生成,平均每个标注约109个词,提供比传统短标注更丰富的语义监督,以及(ii)从具有多种分辨率和多样长宽比的图像中构建每个批次,从而扩大每个优化步骤的有效视觉覆盖范围。其次,我们通过精心的架构选择提高了收敛速度,包括采用提供更好潜在表示的语义变分自编码器(VAE)以及采用加速优化并实现从英语训练数据中多语言泛化的强语言编码器。预训练后,我们应用基于分类学驱动提示的强化学习(Lens-RL-8K)和结构化奖励标准来抑制伪影并提高视觉质量,一个具有训练免费系统提示搜索的推理模块以更好地对齐用户请求与模型,以及基于知识蒸馏的加速4步推理。通过高效的训练和系统的优化,Lens能够泛化到任意的长宽比从1:2到2:1以及分辨率高达1440^2,并支持几种常用语言的提示。得益于其紧凑的尺寸,Lens在单个NVIDIA H100 GPU上可以在3.15秒内生成1024^2的图像,而其蒸馏后的turbo版本可以在0.84秒内完成4步生成。

英文摘要

We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.

2605.21572 2026-05-22 cs.CV cs.RO 版本更新

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

PhysX-Omni: 为刚体、变形体和关节物体统一的模拟准备物理3D生成

Ziang Cao, Yinghao Liu, Haitian Li, Runmao Yao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University(南洋理工大学S实验室)

AI总结 本文提出PhysX-Omni,一种统一的模拟准备物理3D生成框架,通过开发针对视觉-语言模型的高效几何表示和首个通用模拟准备3D数据集PhysXVerse,以及评估生成和理解能力的PhysX-Bench,显著提升了生成和理解性能,推动下游应用如具身AI和物理模拟的发展。

Comments Project page: https://physx-omni.github.io/

详情
AI中文摘要

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

英文摘要

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

2605.21527 2026-05-22 eess.IV cs.CV cs.LG 版本更新

CryoNet: A Deep Learning Framework for Multi-Modal Debris-Covered Glacier Mapping. A Case Study of the Poiqu Basin, Central Himalaya

CryoNet:一种用于多模态冰川覆盖区制图的深度学习框架。帕iqu盆地,中央喜马拉雅地区案例研究

Farzaneh Barzegar, Tobias Bolch, Norbert Kuehtreiber, Silvia L. Ullo

发表机构 * University of Sannio(萨恩尼奥大学) Graz University of Technology(格拉茨技术大学)

AI总结 本研究提出CryoNet,一种利用多模态数据集的深度学习框架,用于区分干净冰川、覆盖冰川和冰湖,通过在喜马拉雅中央帕iqu盆地的案例研究展示了其在复杂高山环境中的有效性。

Comments 15 pages, 10 figures, 5 tables. Preprint submitted to IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS); currently under review

详情
AI中文摘要

冰川作为淡水储备和气候变化指标起着关键作用,但其自动制图,尤其是覆盖冰川,由于与周围地形的光谱相似性仍具挑战性。本研究引入了CryoNet,一种深度学习框架,利用丰富的多模态数据集,包括Sentinel-2光学影像、DEM导出的地形变量、光谱指数、主成分分析(PCA)、InSAR相干性和相位、点状特征和GLCM纹理,以区分干净冰川、覆盖冰川和冰湖。CryoNet是一种基于ResNet101编码器的编码器-解码器CNN,具有嵌套跳接连接和空间-通道Squeeze-and-Excitation(scSE)注意力机制。本研究在喜马拉雅中央帕iqu盆地进行,通过将训练模型应用于阿尔卑斯山脉的蒙布朗山群评估其可转移性。我们还分析了每层数据在提高冰川制图性能中的重要性。所提出的模型实现了总体IoU为90.52%,平均召回率为98.08%,平均精确率为92.26%。对于覆盖冰川,CryoNet实现了IoU为90.46%,召回率为95.79%,精确率为94.21%。在单类和总体指标上,CryoNet超越了DeepLabV3+、SegFormer和U-Net,作为最先进的(SOTA)参考,证明了其在复杂高山环境中的冰川制图有效性。

英文摘要

Glaciers play a critical role as freshwater reserves and indicators of climate change, yet their automatic delineation, especially for debris-covered glaciers, remains challenging due to spectral similarity with surrounding terrain. This study introduces CryoNet, a deep learning framework that leverages a rich multi-modal dataset combining Sentinel-2 optical imagery, DEM-derived topographic variables, spectral indices, Principal Component Analysis (PCA), InSAR coherence and phase, tasseled-cap features, and GLCM texture to discriminate clean-ice glaciers, debris-covered glaciers, and glacial lakes. CryoNet is an encoder-decoder CNN with nested skip connections and spatial-channel Squeeze-and-Excitation (scSE) attention, built upon a ResNet101 encoder to capture hierarchical contextual and spatial features. The study is conducted in the Poiqu Basin in the central Himalaya, and transferability is evaluated by applying the trained model to the Mont Blanc Massif in the Alps. We additionally analyse the importance of each data layer in improving glacier mapping performance. The proposed model achieves an overall IoU of 90.52%, mean Recall of 98.08%, and mean Precision of 92.26%. For debris-covered glaciers specifically, CryoNet obtains an IoU of 90.46%, a recall of 95.79%, and a precision of 94.21%. Across both per-class and overall metrics, CryoNet surpasses DeepLabV3+, SegFormer, and U-Net, taken as state-of-the-art (SOTA) references, demonstrating its effectiveness for robust glacier mapping in complex high-mountain environments.

2605.21523 2026-05-22 eess.IV cs.AI cs.CV cs.MM eess.SP 版本更新

Tackle CSM in JPEG Steganalysis with Data Adaptation

用数据适应法对抗JPEG隐写分析中的CSM

Rony Abecidan, Vincent Itier, Jérémie Boulanger, Patrick Bas, Tomáš Pevný

发表机构 * LABEL4.AI Univ. Lille(里尔大学) CNRS(国家科学研究中心) Centrale Lille(里尔中央理工大学) UMR 9189 CRIStAL(里尔大学UMR 9189 CRIStAL) IMT Nord Europe(北欧IMT) Centre for Digital System(数字系统中心) Department of Computers(计算机系)

AI总结 本文提出TADA框架,通过数据适应方法学习未知的处理流程,以提高在真实场景中对抗CSM问题的鲁棒性,并改进实际应用中的泛化能力。

Comments ACM Workshop on Information Hiding and Multimedia Security, (IH&MMSec '26), Jun 2026, Florence, Italy

详情
AI中文摘要

隐写分析模型在基准数据集上表现优异,但在实际应用中遇到由训练时未见过的处理流程生成的图像时会遇到困难。这种被称为覆盖源不匹配(CSM)的问题在现实场景中尤为棘手,因为实践者只能访问少量未标记的数据集,不确定这些图像所应用的处理技术,且缺乏关于该数据集中覆盖和隐写图像比例的信息。为解决这一挑战,我们引入了TADA(通过数据适应的目标对齐)框架,该框架学习从少量未标记的目标数据集中模拟未知的处理流程。该架构通过结合残差协方差对齐、残差分布匹配和一个ℓ²损失约束模拟器生成逼真图像。在玩具和实际目标上,TADA在对抗CSM的鲁棒性和实际应用泛化能力方面相比强大的整体和原子基线有显著提升。附加资源可在本链接中获得:https://github.com/RonyAbecidan/TADA

英文摘要

Steganalysis models excel on benchmark datasets but struggle in the wild when analyzed images are produced by a processing pipeline unseen during training. This problem known as Cover Source Mismatch (CSM) is particularly hard in realistic settings where practitioners (1) have access to only a small, unlabeled dataset, (2) are unsure of the processing techniques applied to these images, and (3) lack information on the proportion of covers and stegos in that set. To answer this challenge, we introduce TADA (Target Alignment through Data Adaptation), a framework learning to emulate the unknown processing pipeline from a small unlabeled target set. This architecture is trained with a loss combining residual covariance alignment, residual distribution matching, and a $\ell^2$ loss constraining the emulator to produce realistic images. Across toy and operational targets, TADA yields substantial gains in robustness to CSM and improves operational generalization compared to strong holistic and atomistic baselines. Additional resources are available at this link: https://github.com/RonyAbecidan/TADA

2605.21500 2026-05-22 eess.IV cs.CV 版本更新

A Task-Agnostic Algebraic Integrity Metric for Event-Camera Streams Toward SOTIF-Compliant Perception using Pearson Correlation Coefficient

一种任务无关的代数完整性度量用于事件相机流,以实现SOTIF兼容的感知使用皮尔逊相关系数

Arthur de Miranda Neto

发表机构 * Federal University of Lavras(拉巴斯联邦大学)

AI总结 本文提出了一种任务无关的代数完整性度量,通过将皮尔逊相关系数提升到三个标准事件表示中,以实现SOTIF兼容的感知。

Comments 12 pages, 6 figures, 3 tables, 14 equations. Theoretical framework paper with procedural-synthetic illustrations; empirical validation on real datasets reserved for follow-up. Code and demonstration video available

详情
AI中文摘要

事件相机已作为一种高带宽、低延迟的感知模态,用于自动化驾驶系统(ADS)中的安全关键感知,提供微秒时间分辨率、120-140 dB动态范围和固有的无运动模糊。然而,目前没有任务无关的质量度量可以直接操作异步事件流:最先进的代理需要下游任务(例如检测精度、跟踪误差)来评估流的完整性,这与ISO 21448(SOTIF)和ISO/PAS 8800:2024的认证要求不兼容。最近的BiasBench基准(CVPR 2025)明确指出了这一差距。本文提出了一种统一的代数框架,将皮尔逊相关系数(PCC)提升到三个标准事件表示:时间表面、事件帧和体素网格。该框架产生三个度量:(i)r-TS用于流完整性监控,以对抗自我运动预测的时间表面;(ii)r2-EF用于需要仅整数比较的自适应ROI选择;(iii)r-VG用于时间冗余门控。在事件相机的对比阈值机制(|Delta L| >= C)和基于PCC的变化标准之间建立了结构同构性,三个提升的度量被形式化,并且管道延迟和信息损失被对称分析,以与原始流相对比。每个度量的示例行为在由直接模拟发射模型生成的程序合成事件流中得到演示,而不是从任何真实或视频派生的数据集中获取,包括一个隧道下陷完整性异常场景,其中r_C从0.93(一致流动)降至低于0(警报)。一个显式的认知惯例([ESTABLISHED],[SOLID],[HYPOTH.],[OPEN])界定了每个贡献的状态。

英文摘要

Event cameras have emerged as a high-bandwidth, low-latency sensing modality for safety-critical perception in automated driving systems (ADS), offering microsecond temporal resolution, 120-140 dB dynamic range, and intrinsic absence of motion blur. However, no task-agnostic quality metric currently operates directly on the asynchronous event stream: state-of-the-art proxies require a downstream task (e.g., detection accuracy, tracking error) to assess stream integrity, which is incompatible with the certification requirements of ISO 21448 (SOTIF) and ISO/PAS 8800:2024. The recent BiasBench benchmark (CVPR 2025) explicitly identifies this gap. This work proposes a unified algebraic framework that lifts the Pearson Correlation Coefficient (PCC), historically used in two prior works for redundancy filtering and ROI selection on frame-based images, to the three standard event representations: Time Surface, Event Frame, and Voxel Grid. The framework yields three metrics: (i) r-TS for stream integrity monitoring against an ego-motion-predicted Time Surface, (ii) r2-EF for adaptive ROI selection requiring only integer comparisons, and (iii) r-VG for temporal redundancy gating. A structural isomorphism is established between the contrast-threshold mechanism of the event camera (|Delta L| >= C) and the PCC-based change criterion, the three lifted metrics are formalized, and pipeline latency and information loss are analyzed symmetrically against the raw stream. Illustrative behavior of each metric is demonstrated on a procedural-synthetic event stream, generated by direct simulation of the emission model rather than drawn from any real or video-derived dataset, including a tunnel-dip integrity-anomaly scenario in which r_C drops from 0.93 (coherent flow) to below 0 (alarm). An explicit epistemic convention ([ESTABLISHED], [SOLID], [HYPOTH.], [OPEN]) delineates the status of every contribution.

2605.21493 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

不要压缩你的特征:为什么CenterLoss伤害OOD检测和多尺度Mahalanobis获胜

Rahul D Ray

发表机构 * Department of Electronics and Electrical Engineering(电子与电气工程系)

AI总结 本文提出GOEN方法,通过多尺度特征、L2归一化、Mahalanobis距离和校准头来提升OOD检测性能,发现CenterLoss会降低OOD检测性能,而GOEN-NoCenterLoss在CIFAR-10基准上表现优于其他基线方法。

详情
AI中文摘要

检测分布外(OOD)输入的能力是安全部署机器学习系统的基础。然而,当前方法往往依赖于仅优化分类准确性的特征表示,忽略了epistemic不确定性的要求。我们引入GOEN(几何优化的epistemic网络),一种结合多尺度特征、L2归一化、Mahalanobis距离和使用真实硬OOD示例训练的校准头的简单流程。通过系统消融,我们发现一个反直觉的发现:CenterLoss,一种用于特征紧凑性的流行正则化器,显著降低了OOD检测性能,尽管提高了分类准确性。最佳变体GOEN-NoCenterLoss在CIFAR-10基准上实现了0.9483的平均OOD AUROC,超过了包括深度集成(0.8827)、KNN(0.8967)和ODIN(0.8870)在内的所有基线方法,同时保持了有竞争力的分布内准确性。我们的结果挑战了普遍认为更好的分类几何自动导致更好的epistemic不确定性假设。相反,我们展示了过于紧致的特征簇会压缩类间边缘并扭曲所需的有效OOD检测的协方差结构。GOEN是高效的,在单个GPU上训练不到20分钟,并提供了一种构建可靠识别自身局限的AI系统的实用蓝图。

英文摘要

The ability to detect out-of-distribution (OOD) inputs is fundamental to safe deployment of machine learning systems. Yet, current methods often rely on feature representations that are optimised solely for classification accuracy, neglecting the distinct requirements of epistemic uncertainty. We introduce GOEN (Geometry-Optimised Epistemic Network), a simple pipeline that combines multi-scale features, L2 normalisation, Mahalanobis distance, and a calibration head trained with real hard OOD examples. Through systematic ablation we uncover a counter-intuitive finding: CenterLoss, a popular regulariser for feature compactness, significantly degrades OOD detection performance, reducing average OOD AUROC from 0.9483 to 0.9366 despite improving classification accuracy. The best variant, GOEN-NoCenterLoss, achieves an average OOD AUROC of 0.9483, surpassing all baselines including deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870) on CIFAR-10 benchmarks, while maintaining competitive in-distribution accuracy. Our results challenge the prevailing assumption that better classification geometry automatically leads to better epistemic uncertainty. Instead, we show that overly tight feature clusters compress inter-class margins and distort the covariance structure needed for effective OOD detection. GOEN is efficient, training in under 20 minutes on a single GPU, and provides a practical blueprint for building AI systems that reliably recognise their own limitations.

2605.21079 2026-05-22 cs.CV 版本更新

VDFP: Video Deflickering with Flicker-banding Priors

VDFP:基于闪烁带先验的视频去闪烁

Zhiyi Zhou, Libo Zhu, Zihan Zhou, Yulun Zhang, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出VDFP,一种基于闪烁带先验的视频去闪烁框架,通过构建DeViD数据集和引入DFM和CPP模块,有效解决屏幕捕捉中的带状伪影问题,实验表明其在去闪烁效果和时空一致性方面优于现有方法。

Comments Our dataset and code will be released at https://github.com/ZhiyiZZhou/VDFP

详情
AI中文摘要

使用智能手机捕捉数字屏幕时,由于硬件同步不匹配,经常会产生严重的带状伪影。现有的视频修复方法难以处理这些结构化、周期性的亮度波动,通常导致残留伪影或过度平滑的纹理。我们首先构建了DeViD数据集,以应对可用数据集不足的问题。然后我们提出了VDFP(Video Deflickering with Flicker-banding Priors),一种新颖的感知引导生成框架。首先,我们引入了一种基于滚动快门机制的退化场建模(DFM),能够合成复杂的多带状场景。其次,我们提出了空间-时间连续先验感知(CPP)。不同于传统的二元分割,该模块通过闪烁感知的均方误差(FA-MSE)进行优化,以捕捉亮度过渡。通过零初始化增强的输入层,我们的模型保留了预训练的生成先验以及空间-时间先验感知。广泛的实验表明,VDFP在去闪烁效果和时空一致性方面显著优于其他方法,能够高效消除复杂的带状伪影并保留高保真的空间细节。我们的数据集和代码将在https://github.com/ZhiyiZZhou/VDFP上发布。

英文摘要

Capturing digital screens with smartphones frequently induces severe banding due to hardware synchronization mismatches. Existing video restoration methods struggle with these structured, periodic luminance fluctuations, often resulting in residual artifacts or over-smoothed textures. We firstly construct DeViD, a real-world dataset in various scenes to deal with the lack of available datasets. Then we propose VDFP (Video Deflickering with Flicker-banding Priors), a novel perception-guided generation framework. First, we introduce a Degradation Field Modeling Based on Rolling Shutter Mechanism (DFM) capable of synthesizing complex multi-banding scenarios. Second, we present a spatial-temporal continuous prior perception (CPP). Unlike traditional binary segmentation, this module is optimized via a Flicker-Aware Mean Squared Error (FA-MSE) to capture the luminance transitions. By zero-initializing an augmented input layer, our model preserves pre-trained generative priors as well as spatial-temporal prior perception. Extensive experiments demonstrate that VDFP significantly outperforms other methods, eliminating complex banding with high-fidelity spatial details and temporal consistency. Our dataset and code will be released at https://github.com/ZhiyiZZhou/VDFP.

2605.20302 2026-05-22 cs.LG cs.CV 版本更新

Neural Collapse by Design: Learning Class Prototypes on the Hypersphere

按设计实现神经崩溃:在超球面上学习类别原型

Panagiotis Koromilas, Theodoros Giannakopoulos, Mihalis A. Nicolaou, Yannis Panagakis

发表机构 * The Cyprus Institute(塞浦路斯研究所) University of Athens(雅典大学) Archimedes AI/Athena Research Center(阿基米德AI/阿泰纳研究中心) University of Cyprus(塞浦路斯大学)

AI总结 本文研究了监督分类的理论最优解神经崩溃(NC),指出交叉熵(CE)和监督对比学习(SCL)两种主流范式在实践中无法达到该最优解。作者提出通过在超球面上对比原型的方法,改进了CE和SCL,从而在多个基准测试中实现了更接近NC的性能。

Comments 43rd International Conference on Machine Learning (ICML 2026); Code: https://github.com/pakoromilas/nc_by_design

详情
AI中文摘要

监督分类有一个理论最优解,即神经崩溃(NC),然而其两种主导范式在实践中都无法达到这一最优。交叉熵(CE)保留了径向自由度,导致收敛到退化几何结构,而监督对比学习(SCL)在预训练阶段驱动特征向NC靠近,但在后续的线性探测阶段丢弃了这一结构。我们证明这两种范式实际上是同一种方法的不同表现,即在单位超球面上对比原型。缩小差距需要在各自失败点进行修正。从CE侧,我们提出NTCE和NONL两种归一化损失,将对比优化缺失的成分引入分类器学习:大有效负样本集和解耦的对齐和均匀性项。从SCL侧,我们证明SCL的目标在训练过程中已经优化了原理分类器,其权重是类别均值嵌入,使线性探测变得冗余且有害。实验表明,在四个基准测试(包括ImageNet-1K)中,NTCE和NONL在准确率上超过了CE,接近NC(≥95%),并在不到7.5%的迭代次数中在4/5个指标上匹配CE的收敛NC,而SCL在固定原型的情况下无需线性探测阶段即可达到。学习的几何结构在迁移学习中带来了+5.5%的平均相对改进,严重类别不平衡下可达+8.7%,并且在ImageNet-C上提高了对损坏的鲁棒性。本文将监督学习重新定义为在超球面上的原型学习,通过设计达到NC。

英文摘要

Supervised classification has a theoretical optimum, Neural Collapse (NC), yet neither of its two dominant paradigms reaches it in practice. Cross entropy (CE) leaves radial degrees of freedom unconstrained and converges to a degenerate geometry, while supervised contrastive learning (SCL) drives features toward NC during pretraining but discards this structure in a post hoc linear probing phase. We show that both paradigms are different appearances of the same method that contrasts prototypes on the unit hypersphere, and that closing the gap requires fixing each at its point of failure. From the CE side, we propose NTCE and NONL, two normalized losses that import contrastive optimization's missing ingredients into classifier learning: a large effective negative set and decoupled alignment and uniformity terms. From the SCL side, we prove that SCL's objective already optimizes throughout training for a principled classifier whose weights are the class mean embeddings, making linear probing both redundant and harmful. Empirically, on four benchmarks including ImageNet-1K, NTCE and NONL surpass CE accuracy, closely approximate NC ($\geq 95\%$), and match CE's converged NC on 4/5 metrics in under $7.5\%$ of its iterations, while SCL with fixed prototypes matches linear probing without the hours-long classifier training phase. The learned geometry yields $+5.5\%$ mean relative improvement in transfer learning, up to $+8.7\%$ under severe class imbalance, and improved robustness to corruptions on ImageNet-C. Our work recasts supervised learning as prototype learning on the hypersphere, with NC reached by design.

2605.17837 2026-05-22 cs.CV cs.AI 版本更新

Temporal Aware Pruning for Efficient Diffusion-based Video Generation

具有时间意识的剪枝用于高效扩散式视频生成

Sheng Li, Yang Sui, Junhao Ran, Bo Yuan, Yue Dai, Xulong Tang

发表机构 * University of Pittsburgh(匹兹堡大学) Illinois Institute of Technology(伊利诺伊理工学院) Rutgers University(罗格斯大学) Rice University(Rice大学)

AI总结 本文提出TAPE,一种无需训练的时间感知剪枝方法,用于高效扩散式视频生成,通过时间平滑、层内token重选和时间步预算调度,提升生成效率并保持高质量视觉效果。

详情
AI中文摘要

视频扩散模型最近通过基于ViT的架构实现了高质量视频生成,但生成过程由于需要在长时空序列上进行注意力计算而计算成本高。token剪枝已被证明在ViTs和VLMs中有效。然而,大多数先前的剪枝方法基于注意力,按帧操作,无法确保视频生成任务中帧间的重要时间一致性。在实践中,简单采用仅注意力的剪枝会导致明显退化,由于背景一致性变差、闪烁和图像质量下降。为此,我们提出TAPE,一种无需训练的时间感知剪枝方法,用于高效扩散式视频生成。TAPE(i)应用时间平滑以对齐相邻帧之间的token重要性并抑制选择抖动;(ii)在选定的层中进行token重选,以使token剪枝与层的多样化语义关注相一致,并避免特定区域的误差累积;它还(iii)采用时间步级预算调度,在早期噪声步骤中进行激进剪枝,并在保真度关键的细化阶段放松剪枝。实验结果表明,TAPE在保持高质量视觉保真度的同时提供了显著的加速,优于先前的token减少方法。

英文摘要

Video diffusion models have recently enabled high-quality video generation with ViT-based architectures, but remain computationally intensive because generation requires attention computation over long spatiotemporal sequences. Token pruning has proven effective for ViTs and VLMs. However, most prior pruning methods are attention-based and operate per frame, failing to ensure the vital temporal coherence across frames in video generation tasks. In practice, naively adopting attention-only pruning causes noticeable degradation due to worsened background consistency, flickering, and reduced image quality. To address this, we propose TAPE, a training-free Temporal Aware Pruning for Efficient diffusion-based video generation. TAPE (i) applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter; and (ii) performs token reselection in selected layers to align token pruning with layers' diverse semantic focus and avoid error accumulation in specific areas; it also (iii) adopt a timestep-level budget scheduling that prunes aggressively at early noisy steps and relaxes pruning during fidelity-critical refinement. The experimental results show that TAPE delivers significant speedups while preserving high visual fidelity, outperforming prior token reduction approaches.

2605.17602 2026-05-22 cs.AI cs.CV cs.LG 版本更新

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

AutoRubric-T2I: 一种用于文本到图像对齐的鲁棒基于规则的奖励模型

Kuei-Chun Kao, Daixuan Huo, Yuanhao Ban, Cho-Jui Hsieh

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出AutoRubric-T2I,一种首个用于文本到图像生成的规则学习框架,通过自动合成和选择显式规则来指导视觉语言模型(VLM)法官。该方法通过合成偏好对的推理轨迹生成候选规则,并利用VLM法官在每种规则下对配对图像进行评分,产生配对规则评分差异用于偏好学习。通过ℓ1正则化逻辑回归精简器去除噪声和冗余规则,从而在少量标注偏好数据下生成高质量、可解释的奖励信号,并在多个图像奖励基准测试中优于现有奖励模型基线。

Comments 27 pages

详情
AI中文摘要

将文本到图像(T2I)生成模型与人类偏好对齐越来越依赖于图像奖励模型,这些模型根据提示对齐和感知质量对生成图像进行评分或排序。现有的奖励模型通常在大规模人类偏好语料上训练为Bradley-Terry(BT)偏好模型,这使得训练成本高、适应困难且评估标准不透明。同时,视觉语言模型(VLM)法官可以通过文本评分规则提供更细致的评估,但其手动设计或启发式生成的评分规则可能无法可靠地反映人类偏好。在本文中,我们提出AutoRubric-T2I,这是首个用于T2I的规则学习框架,能够自动合成和选择显式规则以指导VLM法官。AutoRubric-T2I首先通过合成偏好对的推理轨迹生成候选规则,然后利用VLM法官在每种规则下对配对图像进行评分,产生配对规则评分差异用于偏好学习。为了去除噪声和冗余规则,我们进一步采用ℓ1正则化逻辑回归精简器,选择Top-N最判别性的规则。广泛评估表明,AutoRubric-T2I在使用不到0.01%的标注偏好数据的情况下,能够生成高质量、可解释的奖励信号,大幅减少了大规模奖励模型训练的需求。在图像奖励基准如MMRB2上,AutoRubric-T2I优于强奖励模型基线。我们进一步验证AutoRubric-T2I作为强化学习奖励在下游T2I任务中的效果,包括TIIF和UniGenBench++,其中它通过流-GRPO管道在扩散模型上提升了生成质量,优于标量奖励模型。

英文摘要

Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a $\ell_1$-Regularized Logistic Regression Refiner, which selects the Top-$N$ most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.

2605.16923 2026-05-22 cs.CV 版本更新

Neuroscience-inspired Staged Representation Learning with Disentangled Coarse- and Fine-Grained Semantics for EEG Visual Decoding

受神经科学启发的分阶段表征学习:解纠缠的粗粒度和细粒度语义用于EEG视觉解码

Xiang Gao, Hui Tian, Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew

发表机构 * School of Information and Communication Technology, Griffith University(信息与通信技术学院,格里菲斯大学)

AI总结 本文提出了一种受神经科学启发的分阶段表征学习框架,通过解纠缠的粗粒度和细粒度语义来改进EEG视觉解码,解决了现有方法在人类视觉处理分阶段和层次特性方面的不足。

Comments 17 pages, 5 figures

详情
AI中文摘要

从电生理图(EEG)信号解码视觉信息仍然是脑机接口和医疗康复中的基本挑战。现有的EEG视觉解码方法主要集中在学习一个单一的全局EEG嵌入以实现跨模态对齐,但它们大多忽略了人类视觉处理的分阶段和层次特性。为了解决这一限制,我们提出了一种受神经科学启发的分阶段表征学习框架,将EEG视觉解码重新表述为一个阶段特定的表征分解问题。所提出的框架将EEG表征学习分为三个互补的阶段:低级视觉表征学习、高级语义表征学习和整合信息融合。为了加强语义建模,我们进一步引入了一种多模态双级语义学习机制,将粗标签级别的语义与细图像级别的视觉-语义信息分开。此外,引入了语义潜在通道作为从观察到的视觉EEG信号生成的计算表征通道,扩展了通道级别的语义表征空间以实现结构化的语义抽象和跨模态对齐。在THINGS-EEG基准上的大量实验表明,所提出的方法在受试者依赖的零样本评估中表现优异,并在受试者独立的零样本评估中实现了改进的精确检索。此外,包括逐层检索、时间累积、扩展多图像检索和消融研究的额外分析进一步支持了分阶段分解和结构化语义建模的有效性。这些结果表明,显式建模分阶段的感知、语义和整合表征提供了一种有效的受神经科学启发的EEG视觉解码框架。

英文摘要

Decoding visual information from electroencephalography (EEG) signals remains a fundamental challenge in brain-computer interfaces and medical rehabilitation. Existing EEG visual decoding methods mainly focus on learning a single global EEG embedding for cross-modal alignment, but they largely overlook the staged and hierarchical characteristics of human visual processing. To address this limitation, we propose a neuroscience-inspired staged representation learning framework that reformulates EEG visual decoding as a stage-specific representation decomposition problem. The proposed framework organizes EEG representation learning into three complementary phases: low-level visual representation learning, high-level semantic representation learning, and integrative information fusion. To strengthen semantic modeling, we further introduce a multimodal dual-level semantic learning mechanism that separates coarse label-level semantics from fine image-level visual-semantic information. In addition, semantic latent channels are introduced as computational representation channels generated from observed visual EEG signals, expanding the channel-level semantic representation space for structured semantic abstraction and cross-modal alignment. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves superior performance under subject-dependent zero-shot evaluation and improved exact retrieval under subject-independent zero-shot evaluation. Additional analyses, including layer-wise retrieval, temporal accumulation, expanded multi-image retrieval, and ablation studies, further support the effectiveness of staged decomposition and structured semantic modeling. These results suggest that explicitly modeling staged perceptual, semantic, and integrative representations provides an effective neuroscience-inspired framework for EEG-based visual decoding.

2605.16579 2026-05-22 cs.CV cs.LG 版本更新

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

局部关注,线性记忆:线性注意力作为跨帧记忆用于自回归视频扩散

Kunyang Li, Mubarak Shah, Yuzhang Shang

发表机构 * Institute of Artificial Intelligence, University of Central Florida(中央佛罗里达大学人工智能研究所)

AI总结 本文提出了一种名为ARL2的混合注意力模块,通过将二次跨帧注意力替换为固定大小的递归状态,解决了自回归视频扩散模型在长视频生成中的可扩展性瓶颈问题,实现了线性时间复杂度和常数内存消耗,同时提升了时间一致性。

详情
AI中文摘要

自回归(AR)视频扩散是一种强大的视频生成范式,用于流式和交互式视频生成。然而,其依赖于softmax自注意力机制导致序列长度的二次计算复杂度和内存使用,由于键值缓存,限制了其扩展到长视频时间范围的能力。现有的解决方案(例如稀疏注意力和KV缓存压缩)降低了每步成本,但仍依赖于线性增长的缓存或不可逆地丢弃过去上下文,因此无法解决线性内存增长和流式上下文管理问题。为了解决这一可扩展性瓶颈,我们提出了ARL2(局部关注,线性记忆),一种混合注意力模块,通过将二次跨帧注意力替换为固定大小的递归状态。我们将自注意力分解为两个分支:一个用于空间细节和局部依赖的帧内softmax分支,以及一个用于维护固定大小状态以流式管理上下文的帧间门控线性分支。我们的关键见解是softmax注意力捕捉细粒度的局部交互,而递归状态提供可控的长程记忆。这种设计实现了线性时间复杂度和常数内存消耗,同时在全softmax模型上提高了时间一致性。为防止噪声中间状态破坏记忆,我们只在去噪步骤后更新递归状态。为了避免帧内信息不对称,所有token共享相同的预更新状态,而不是按顺序更新。据我们所知,这是首次将预训练的AR视频扩散模型转换为混合线性注意力架构的工作,通过一种高效的两阶段训练方案实现AR视频的训练。在75%的层被替换为混合线性注意力的情况下,模型实现了高达2.26倍的时钟时间加速和54%的内存减少,同时保持与改进的时间一致性相当的质量。

英文摘要

Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.

2605.16258 2026-05-22 cs.CV cs.AI cs.RO 版本更新

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT:隐式视觉几何变换器用于神经场景表示

Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou, Jiwen Lu

发表机构 * Intelligent Vision Group, Tsinghua University(清华大学智能视觉组)

AI总结 本文提出IVGT,一种隐式视觉几何变换器,通过无姿态多视角图像隐式建模连续且一致的几何结构,从而实现神经场景表示,支持在任意3D位置进行连续空间查询,以预测签名距离和颜色,并在多个任务中表现出色。

Comments Code: https://github.com/wzzheng/IVGT/

详情
AI中文摘要

从未经姿态的多视角图像中重建一致的3D几何和外观是计算机视觉中的基础但具有挑战性的问题。现有的视觉几何基础模型通常通过回归像素对齐的点图来预测显式几何,常常面临冗余和几何连续性有限的问题。我们提出了IVGT,一种隐式视觉几何变换器,能够从无姿态的多视角图像中隐式建模连续且一致的几何。这种形式在规范坐标系中学习了连续的神经场景表示,并支持在任意3D位置进行连续空间查询,通过轻量级解码器检索局部特征,以预测签名距离(SDF)值和颜色。它允许直接提取连续且一致的表面几何,从而能够从任意视角渲染RGB图像、深度图和表面法线图。我们通过多数据集联合优化进行训练,结合2D监督和3D几何正则化。IVGT在不同场景中表现出良好的泛化能力,并在多种任务中实现了优异的性能,包括网格和点云重建、新视角合成、深度和表面法线估计以及相机姿态估计。

英文摘要

Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.

2605.12623 2026-05-22 cs.CL cs.CV cs.LG 版本更新

DocAtlas: Multilingual Document Understanding Across 80+ Languages

DocAtlas: 跨80多种语言的多语言文档理解

Ahmed Heakl, Youssef Mohamed, Abdullah Sohail, Rania Elbadry, Ahmed Nassar, Peter W. J. Staar, Fahad Shahbaz Khan, Imran Razzak, Salman Khan

发表机构 * MBZUAI(穆罕默德·本·拉谢德人工智能研究所) IBM Research(IBM研究院)

AI总结 本文提出DocAtlas框架,通过构建高保真的OCR数据集和基准测试,覆盖82种语言和9个评估任务,利用双重管道生成精确的结构注解,展示了直接偏好优化在多语言适应中的有效性,提升了领域内和领域外的准确率。

Comments Under submission

详情
AI中文摘要

多语言文档理解在低资源语言中受限于稀缺的训练数据和基于模型的标注流程,这些流程会加剧现有偏见。我们引入DocAtlas,一个构建覆盖82种语言和9个评估任务的高保真OCR数据集和基准测试的框架。我们的双重管道,包括本地DOCX文档的差异渲染和针对从右到左脚本的合成LaTeX生成,生成统一的DocTag格式注解,编码布局、文本和组件类型,无需学习模型进行核心注解。评估16种最先进的模型揭示了低资源脚本中的持续差距。我们展示直接偏好优化(DPO)使用渲染派生的真实情况作为正信号,实现了稳定的多语言适应,提高了领域内(+1.9%)和领域外(+1.8%)的准确性,而监督微调会导致领域外性能下降高达21%。我们的最佳变体,DocAtlas-DeepSeek,在最强基线基础上提高了+1.7%。代码可在https://github.com/ahmedheakl/DocAtlas获取。

英文摘要

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline. Code is available at https://github.com/ahmedheakl/DocAtlas .

2605.07287 2026-05-22 cs.CV 版本更新

SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

SplatWeaver: 学习分配高斯原语以实现可泛化的新型视角合成

Yecong Wan, Fan Li, Mingwen Shao, Wangmeng Zuo

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) Zhengzhou Advanced Research Institute of Harbin Institute of Technology(哈尔滨工业大学郑州先进研究院) Huawei Noah’s Ark Lab(华为诺亚实验室) Artificial Intelligence Research Institute, Shenzhen University of Advanced Technology(深圳先进技术大学人工智能研究院)

AI总结 本文提出SplatWeaver框架,通过动态分配高斯原语实现可泛化的新型视角合成,解决传统方法中固定分配导致的资源浪费和表达不足问题。

Comments Project Page: https://yecongwan.github.io/SplatWeaver/

详情
AI中文摘要

可泛化的新型视角合成旨在从未经校准的输入图像中渲染未见过的视角,而无需每个场景的优化。最近基于3D高斯点划的前馈方法在效率和渲染质量上取得了显著进展。然而,大多数方法将固定数量的高斯分布分配给每个像素或体素,忽略了现实场景中空间变化的复杂性。这种均匀分配通常在平滑区域浪费高斯原语,而在细结构、复杂几何和高频细节方面提供不足的容量。这促使我们预测区域依赖的原语数量,而不是在所有地方施加固定原语预算,从而实现更具表达力的3D场景表示。因此,我们提出SplatWeaver,一个能够动态分配高斯原语的可泛化新型视角合成框架。具体而言,SplatWeaver引入了基数高斯专家和像素级路由方案,其中每个专家专门生成从0到M的特定数量的原语,路由方案协调这些专家以适应性地确定每个空间位置应分配多少高斯原语。此外,SplatWeaver结合了高频先验和相关的指导模块和路由正则化,以稳定专家选择并促进复杂度感知的分配。通过利用高频线索,路由过程被鼓励将更多的高斯原语分配给细结构和纹理区域,同时抑制平滑区域的冗余。在多样化的场景中进行的广泛实验表明,SplatWeaver在大多数情况下都优于最先进的方法,能够以更少的高斯原语生成更逼真的新型视角渲染。项目页面:https://yecongwan.github.io/SplatWeaver/

英文摘要

Generalizable novel view synthesis aims to render unseen views from uncalibrated input images without requiring per-scene optimization. Recent feed-forward approaches based on 3D Gaussian Splatting have achieved promising efficiency and rendering quality. However, most of them assign a fixed number of Gaussians to each pixel or voxel, ignoring the spatially varying complexity of real-world scenes. Such uniform allocation often wastes Gaussian primitives in smooth regions while providing insufficient capacity for fine structures, complex geometry, and high-frequency details. This motivates us to predict region-dependent primitive cardinalities rather than impose a fixed primitive budget everywhere, enabling a more expressive 3D scene representation. Therefore, we propose SplatWeaver, a generalizable novel view synthesis framework that is able to dynamically allocate Gaussian primitives over different regions in a feed-forward manner. Specifically, SplatWeaver introduces cardinality Gaussian experts and a pixel-level routing scheme, wherein each expert specializes in producing a specific number of primitives from 0 to M, and the routing scheme coordinates these experts to adaptively determine how many Gaussian primitives should be allocated to each spatial location. Moreover, SplatWeaver incorporates a high-frequency prior with attendant guidance module and routing regularization to stabilize expert selection and promote complexity-aware allocation. By leveraging high-frequency cues, the routing process is encouraged to assign more Gaussian primitives to fine structures and textured regions, while suppressing redundancy in smooth areas. Extensive experiments across diverse scenarios show that SplatWeaver consistently outperforms state-of-the-art methods, delivering more faithful novel-view renderings with fewer Gaussian primitives. Project Page: https://yecongwan.github.io/SplatWeaver/

2605.05765 2026-05-22 cs.CV 版本更新

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

X-OmniClaw 技术报告:一种统一的移动代理用于多模态理解和交互

Xiaoming Ren, Ru Zhen, Chao Li, Yang Song, Qiuxia Hou, Yanhao Zhang, Peng Liu, Qi Qi, Quanlong Zheng, Qi Wu, Zhenyi Liao, Binqiang Pan, Haobo Ji, Haonan Lu

发表机构 * OPPO AI Center(OPPO人工智能中心)

AI总结 本文提出X-OmniClaw,一种统一的移动代理,用于Android生态系统中的多模态理解和交互,通过统一的感知、记忆和行动架构,提升复杂移动任务的上下文感知能力,展示了其在多模态交互中的高效性和可靠性。

Comments 12 pages, 7 figures

详情
AI中文摘要

受OpenClaw发展启发,随着对能够处理复杂和直观交互的移动个人代理需求增加,本文介绍了X-OmniClaw,一种专为Android生态系统设计的统一移动代理,用于多模态理解和交互。该统一架构的感知、记忆和行动模块使代理能够通过高上下文感知处理复杂移动任务。具体而言,Omni Perception提供了一个统一的多模态输入管道,整合UI状态、现实世界视觉上下文和语音输入,利用时间对齐模块将原始数据分解为结构化的多模态意图表示。Omni Memory利用多模态记忆优化来增强个性化智能,通过整合运行时工作记忆与从本地数据中提取的长期个人记忆,实现高度上下文感知和个性化的交互。最后,Omni Action采用混合接地策略,结合结构性XML元数据与视觉感知以实现稳健的交互。通过行为克隆和轨迹回放,系统捕获用户导航作为可重用的技能,实现精确的直接访问执行。在多样化的场景中展示表明,X-OmniClaw有效提高了交互效率和任务可靠性,为下一代移动原生个人助手提供了实用的架构蓝图。

英文摘要

Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.

2605.01466 2026-05-22 cs.CV cs.LG 版本更新

SplAttN: Bridging 2D and 3D with Gaussian Soft Splatting and Attention for Point Cloud Completion

SplAttN: 通过高斯软溅射和注意力在2D和3D之间架桥以实现点云补全

Zhaoyang Li, Zhichao You, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence(计算与人工智能学院) Southwest Jiaotong University(西南交通大学) Chengdu, China(中国成都)

AI总结 本文提出SplAttN方法,通过高斯软溅射和注意力机制解决点云补全中2D和3D模态连接问题,改进了传统硬投影导致的跨模态熵塌陷问题,实现了更有效的跨模态连接学习。

Comments Accepted as a Spotlight paper at ICML 2026; camera-ready version

详情
AI中文摘要

尽管多模态学习在点云补全方面取得了进展,但理论机制仍不明确。最近的研究将成功归因于模态间的联系,但我们发现标准硬投影破坏了这种联系:将稀疏点云投影到图像平面会产生极稀疏的支持,阻碍视觉先验传播,这种失败模式我们称为跨模态熵塌陷。为解决这一实际限制,我们提出了SplAttN,用可微高斯溅射替代硬投影,生成密集的连续图像平面表示。通过将投影重新公式化为连续密度估计,SplAttN避免了塌陷的稀疏支持,促进了梯度流动,并提高了跨模态连接的学习能力。广泛的实验表明,SplAttN在PCN和ShapeNet-55/34上实现了最先进的性能。关键的是,我们利用现实世界的KITTI基准作为多模态依赖的应力测试。反事实评估显示,尽管基线退化为对视觉移除不敏感的单模态模板检索器,SplAttN仍能保持对视觉线索的稳健依赖,验证了我们的方法建立了有效的跨模态连接。代码可在https://github.com/zay002/SplAttN获取。

英文摘要

Although multi-modal learning has advanced point cloud completion, the theoretical mechanisms remain unclear. Recent works attribute success to the connection between modalities, yet we identify that standard hard projection severs this connection: projecting a sparse point cloud onto the image plane yields an extremely sparse support, which hinders visual prior propagation, a failure mode we term Cross-Modal Entropy Collapse. To address this practical limitation, we propose SplAttN, which replaces hard projection with Differentiable Gaussian Splatting to produce a dense, continuous image-plane representation. By reformulating projection as continuous density estimation, SplAttN avoids collapsed sparse support, facilitates gradient flow, and improves cross-modal connection learnability. Extensive experiments show that SplAttN achieves state-of-the-art performance on PCN and ShapeNet-55/34. Crucially, we utilize the real-world KITTI benchmark as a stress test for multi-modal reliance. Counter-factual evaluation reveals that while baselines degenerate into unimodal template retrievers insensitive to visual removal, SplAttN maintains a robust dependency on visual cues, validating that our method establishes an effective cross-modal connection. Code is available at https://github.com/zay002/SplAttN.

2604.24762 2026-05-22 cs.CV 版本更新

OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer

OmniShotCut: 以-shot查询Transformer实现整体关系性shot边界检测

Boyang Wang, Guangyi Xu, Jiahui Zhang, Zhipeng Tang, Zezhou Cheng

发表机构 * University of Virginia(弗吉尼亚大学) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文提出OmniShotCut,通过shot查询基于的密集视频Transformer,将shot边界检测建模为结构化关系预测,同时估计shot内关系和shot间关系,以解决现有方法在边界不可解释、错过细微有害断点以及依赖噪声低多样性标注和过时基准的问题。

详情
AI中文摘要

Shot Boundary Detection (SBD)旨在自动识别shot变化并将视频划分为连贯的shot。尽管SBD在文献中被广泛研究,现有方法往往在转换处产生不可解释的边界,错过细微但有害的断点,并依赖于噪声大、低多样性的标注和过时的基准。为缓解这些限制,我们提出OmniShotCut,将SBD建模为结构化关系预测,通过shot查询基于的密集视频Transformer,联合估计shot范围、shot内关系和shot间关系。为避免不精确的手动标注,我们采用完全合成的过渡合成管道,自动重现主要过渡家族并精确生成参数化变体。我们还引入OmniShotCutBench,一个现代宽领域基准,能够实现整体和诊断评估。在基准上的实验展示了我们方法的有效性和通用性。

英文摘要

Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation. Experiments on the benchmarks demonstrate the effectiveness and generality of our method.

2604.17623 2026-05-22 cs.CV cs.GR 版本更新

ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

ViPS: 为自动绑定网格的视频感知姿态空间

Honglin Chen, Karran Pandey, Rundi Wu, Matheus Gadelha, Yannick Hold-Geoffroy, Ayush Tewari, Niloy J. Mitra, Changxi Zheng, Paul Guerrero

发表机构 * Columbia University(哥伦比亚大学) University of Toronto(多伦多大学) Adobe Research(Adobe研究) University of Cambridge(剑桥大学) University College London(伦敦大学学院)

AI总结 本文提出ViPS,一种通过视频扩散模型提取运动先验来发现自动绑定网格有效姿态分布的前馈框架,实现了对多样形状变化、逆向运动学和动画的关键帧生成的支持。

Comments Project page: https://honglin-c.github.io/vips/

详情
AI中文摘要

运动绑定提供了一个结构化的接口来表达3D网格,但缺乏任何关联的姿态空间,即给定网格的可能关节配置的显式表示。没有这样的姿态空间,随机采样或手动操作原始绑定参数很容易导致语义和/或几何违规,例如解剖学超伸展和非物理自相交。我们提出了Video-informed Pose Spaces (ViPS),一种前馈框架,通过从预训练的视频扩散模型中提取运动先验,发现自动绑定网格有效姿态的潜在分布。与现有方法依赖稀缺的艺术家编写的4D数据集或专注于重建单个运动实例不同,ViPS将生成视频模型的先验转移到给定绑定参数化的通用分布中。应用于皮肤网格的可微几何验证器在不需手动调节器的情况下强制执行形状特定的完整性。我们的前馈模型揭示了平滑、紧凑且可控的姿态空间。这反过来支持了对多样形状变化的采样、逆向运动学的流形投影以及动画和关键帧的时序一致轨迹。此外,提取的3D姿态样本作为语义代理指导视频扩散,有效地闭合了生成2D先验和结构化3D运动控制之间的循环。我们的评估显示,仅使用视频先验训练的ViPS在合理性和多样性方面与基于合成艺术家创建的4D数据训练的最新模型表现相当。此外,作为通用模型,ViPS在分布外物种和未见骨骼拓扑上表现出鲁棒的零样本泛化能力。

英文摘要

Kinematic rigs provide a structured interface for articulating 3D meshes but lack any associated pose space, i.e., an explicit representation of the plausible manifold of joint configurations for a given mesh. Without such a pose space, stochastic sampling or manual manipulation of raw rig parameters easily results in semantic and/or geometric violations, such as anatomical hyperextension and non-physical self-intersections. We propose Video-informed Pose Spaces (ViPS), a feedforward framework that discovers the latent distribution of valid articulations for auto-rigged meshes by distilling motion priors from a pretrained video diffusion model. Unlike existing methods that rely on scarce, artist-authored 4D datasets, or focus on reconstructing instances of individual motions, ViPS transfers generative video model priors into a universal distribution over the given rig parameterization. Differentiable geometric validators applied to the skinned mesh enforce shape-specific integrity without requiring manual regularizers. Our feedforward model reveals a smooth, compact, and controllable pose space. This, in turn, supports sampling for diverse shape variations, manifold projection for inverse kinematics, and temporally coherent trajectories for animation and keyframing. Further, the distilled 3D pose samples serve as semantic proxies to guide video diffusion, effectively closing the loop between generative 2D priors and structured 3D kinematic control. Our evaluations show that ViPS, trained solely using video priors, matches the performance of state-of-the-art models trained on synthetic artist-created 4D data in both plausibility and diversity. Additionally, as a universal model, ViPS exhibits robust zero-shot generalization to out-of-distribution species and unseen skeletal topologies.

2604.15003 2026-05-22 cs.CV 版本更新

Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

流之真相:面向图像到视频生成的主动时间鉴伪

Yuzhuo Chen, Zehua Ma, Han Fang, Hengyi Wang, Guanjie Wang, Weiming Zhang

发表机构 * Anhui Province Key Laboratory of Digital Security (School of Cyber Science and Technology, University of Science and Technology of China)(安徽省数字安全重点实验室(网络安全学院,中国科学技术大学))

AI总结 本文提出了一种面向图像到视频生成的主动时间鉴伪方法,通过追踪像素在视频中的流动和变换,解决了传统空间鉴伪在时间维度上的不足。

详情
AI中文摘要

图像到视频(I2V)生成的迅速发展使单张图像可以生成逼真的视频,但也带来了新的鉴伪需求。与静态图像不同,I2V内容随时间演变,要求鉴伪方法超越二维像素级篡改定位,追踪像素在视频中的流动和变换。随着帧数增加,嵌入的痕迹会漂移和变形,使传统空间鉴伪失效。为应对这一未探索的维度,我们提出了**Flow of Truth**,首个专注于I2V生成中时间鉴伪的主动框架。关键挑战在于发现一个能够与生成过程一致演化的鉴伪特征,这本质上是一种创造性的转换而非确定性重建。尽管存在这种内在困难,我们创新性地将视频生成重新定义为*像素随时间的运动而非帧的合成*。基于这一观点,我们提出了一种可学习的鉴伪模板,追踪像素运动,并提出一个模板引导的流模块,将运动与图像内容解耦,实现稳健的时间追踪。实验表明,Flow of Truth在商业和开源I2V模型上均表现出色,显著提升了时间鉴伪性能。

英文摘要

The rapid rise of image-to-video (I2V) generation enables realistic videos to be created from a single image but also brings new forensic demands. Unlike static images, I2V content evolves over time, requiring forensics to move beyond 2D pixel-level tampering localization toward tracing how pixels flow and transform throughout the video. As frames progress, embedded traces drift and deform, making traditional spatial forensics ineffective. To address this unexplored dimension, we present **Flow of Truth**, the first proactive framework focusing on temporal forensics in I2V generation. A key challenge lies in discovering a forensic signature that can evolve consistently with the generation process, which is inherently a creative transformation rather than a deterministic reconstruction. Despite this intrinsic difficulty, we innovatively redefine video generation as *the motion of pixels through time rather than the synthesis of frames*. Building on this view, we propose a learnable forensic template that follows pixel motion and a template-guided flow module that decouples motion from image content, enabling robust temporal tracing. Experiments show that Flow of Truth generalizes across commercial and open-source I2V models, substantially improving temporal forensics performance.

2602.17186 2026-05-22 cs.CV 版本更新

Focusing Where Vision Matters: Selective Training for Large Vision Language Models via Visual Information Gain

聚焦视觉关键点:通过视觉信息增益进行大视觉语言模型的定向训练

Seulbi Lee, Sangheum Hwang

发表机构 * Department of Data Science, Seoul National University of Science and Technology(数据科学系,首尔科学技术大学)

AI总结 本文提出通过视觉信息增益(VIG)指标,对大视觉语言模型进行定向训练,以提升视觉基础性并减少语言偏见,通过优先选择高VIG样本和token来提高性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大视觉语言模型(LVLMs)已取得显著进展,但它们常常受到语言偏见的影响,产生答案时往往不依赖视觉证据。尽管先前工作试图通过解码策略、架构修改或精心挑选的指令数据来缓解这一问题,但它们通常缺乏对单个训练样本或token实际从图像中获益程度的定量衡量。在本工作中,我们引入了视觉信息增益(VIG),一种基于困惑度的度量指标,用于衡量视觉输入对预测不确定性的减少。VIG能够在样本和token层面进行细粒度分析,有效突出视觉基础元素,如颜色、空间关系和属性。借助这一指标,我们提出了一种VIG引导的定向训练方案,优先选择高VIG样本和token。这种方法提高了视觉基础性并减轻了语言偏见,通过专注于仅视觉信息丰富的样本和token,实现了显著减少监督下的优越性能。

英文摘要

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

2602.13294 2026-05-22 cs.CV cs.AI 版本更新

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

VisPhyWorld: 通过代码驱动的视频重建探测物理推理

Jiarong Liang, Max Ku, Ka-Hei Hui, Ping Nie, Wenhu Chen

发表机构 * University of Waterloo(滑铁卢大学) Autodesk AI Lab(Autodesk人工智能实验室) Independent Researcher(独立研究者)

AI总结 本文提出VisPhyWorld框架,通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理能力,引入VisPhyBench基准测试集,验证模型在重建外观和模拟物理运动方面的能力,发现最先进的MLLM在准确推断物理参数和模拟一致的物理动态方面存在困难。

详情
AI中文摘要

评估多模态大语言模型(MLLMs)是否真正理解物理动态仍然具有挑战性。现有的基准测试大多依赖于识别式协议,如视觉问答(VQA)和期望违反(VoE),这些协议通常可以在不承诺明确、可测试的物理假设的情况下回答。我们提出了VisPhyWorld,一个基于执行的框架,通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理能力。通过生成可运行的代码,推断的世界表示可以直接检查、编辑和验证。这将物理推理与渲染分开。基于此框架,我们引入了VisPhyBench,包含209个评估场景,这些场景源自108个物理模板和一个系统化的协议,用于评估模型在重建外观和模拟物理合理的运动方面的能力。我们的流水线在97.7%的基准运行中生成有效的重建视频之前会回退。实验表明,尽管最先进的MLLM在语义场景理解方面表现强劲,但在准确推断物理参数和模拟一致的物理动态方面存在困难。我们的代码可在https://github.com/TIGER-AI-Lab/VisPhyWorld上获得。

英文摘要

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% of benchmark runs before fallback. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics. Our code is available https://github.com/TIGER-AI-Lab/VisPhyWorld

2602.10062 2026-05-22 cs.LG cs.CV 版本更新

Vendi Novelty Scores for Out-of-Distribution Detection

Vendi Novelty Scores for Out-of-Distribution Detection

Amey P. Pasarkar, Adji Bousso Dieng

发表机构 * Lewis-Sigler Institute For Integrative Genomics, Princeton University(普林斯顿大学整合基因组学研究所) Department of Computer Science, Princeton University(普林斯顿大学计算机科学系)

AI总结 本文提出了一种基于Vendi Scores的Vendi Novelty Score(VNS)方法,从多样性角度解决分布外检测问题,该方法无需密度建模,具有线性时间复杂度和非参数特性,并在多个图像分类基准上实现了最先进的OOD检测性能。

详情
AI中文摘要

Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

英文摘要

Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

2602.01851 2026-05-22 cs.CV 版本更新

How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing

模型能多好地遵循视觉指令?VIBE:一个系统性的视觉指令驱动图像编辑基准

Huanyu Zhang, Xuehai Bai, Chengzu Li, Chen Liang, Haochen Tian, Haodong Li, Ruichuan An, Yifan Zhang, Anna Korhonen, Zhang Zhang, Liang Wang, Tieniu Tan

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Science(中国科学院大学人工智能学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Language Technology Lab, University of Cambridge(剑桥大学语言技术实验室) Peking University(北京大学) South China University of Technology(华南理工大学) Nanjing University(南京大学)

AI总结 本文提出VIBE基准,用于评估视觉指令驱动的图像编辑模型,通过三级交互层次评估指涉 grounding、形态操作和因果推理,并发现专有模型在早期阶段表现优异但随着任务难度增加性能下降。

Comments https://vibe-benchmark.github.io/

详情
AI中文摘要

最近的生成模型在图像编辑方面取得了显著进展。然而,现有系统和基准仍然主要是文本引导的。相比之下,人类交流本质上是多模态的,视觉指令如草图能高效传达空间和结构意图。为填补这一差距,我们引入VIBE,即图像编辑的视觉指令基准,其三级交互层次捕捉了指涉 grounding、形态操作和因果推理。在这些层次中,我们精心挑选了高质量且多样的测试用例,反映了视觉指令遵循的逐步增加的复杂性。我们进一步提出一个稳健的LMM-as-a-judge评估框架,配有任务特定的指标,以实现可扩展且细致的评估。通过全面评估17个代表性的开源和专有图像编辑模型,我们发现专有模型在早期阶段展现出视觉指令遵循能力,并且一贯优于开源模型。然而,随着任务难度的增加,即使是最强的系统性能也会显著下降,这揭示了未来研究的有希望方向。

英文摘要

Recent generative models have achieved remarkable progress in image editing. However, existing systems and benchmarks remain largely text-guided. In contrast, human communication is inherently multimodal, where visual instructions such as sketches efficiently convey spatial and structural intent. To address this gap, we introduce VIBE, the Visual Instruction Benchmark for Image Editing with a three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning. Across these levels, we curate high-quality and diverse test cases that reflect progressively increasing complexity in visual instruction following. We further propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment. Through a comprehensive evaluation of 17 representative open-source and proprietary image editing models, we find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models. However, performance degrades markedly with increasing task difficulty even for the strongest systems, highlighting promising directions for future research.

2602.01760 2026-05-22 cs.CV 版本更新

MagicFuse: Single Image Fusion for Visual and Semantic Reinforcement

MagicFuse: 单图像融合用于视觉与语义增强

Hao Zhang, Yanping Zha, Zizhuo Li, Meiqi Gong, Jiayi Ma

发表机构 * Electronic Information School, Wuhan University, China(武汉大学电子信息学院) Suzhou Institute of Wuhan University, China(武汉大学苏州研究院) School of Automation, Wuhan University, China(武汉大学自动化学院)

AI总结 本文提出MagicFuse单图像融合框架,通过扩散模型生成跨光谱场景表示,实现视觉与语义的双重约束,实验表明其性能优于多模态融合方法。

Comments Accepted by CVPR 2026

详情
AI中文摘要

本文聚焦于一个高度实用的场景:在仅使用可见成像传感器的情况下,如何继续利用多模态图像融合的优势。为此,我们提出了一种新的单图像融合概念,将其扩展到知识层面。具体而言,我们开发了MagicFuse,一种新的单图像融合框架,能够从单个低质量可见图像中推导出全面的跨光谱场景表示。MagicFuse首先引入了基于扩散模型的内在光谱知识增强分支和跨光谱知识生成分支。它们分别挖掘在可见光谱中被掩盖的场景信息,并学习转移到红外光谱的热辐射分布模式。在此基础上,我们设计了一个多领域知识融合分支,整合这两个分支的扩散流的概率噪声,从而通过连续采样获得跨光谱场景表示。然后,我们施加了视觉和语义约束,确保该场景表示能够满足人类观察同时支持下游语义决策。大量实验表明,尽管仅依赖单个退化的可见图像,我们的MagicFuse在视觉和语义表示性能上与或优于多模态输入的最先进融合方法。代码已公开在https://github.com/zhayanping/MagicFuse。

英文摘要

This paper focuses on a highly practical scenario: how to continue benefiting from the advantages of multi-modal image fusion under harsh conditions when only visible imaging sensors are available. To achieve this goal, we propose a novel concept of single-image fusion, which extends conventional data-level fusion to the knowledge level. Specifically, we develop MagicFuse, a novel single image fusion framework capable of deriving a comprehensive cross-spectral scene representation from a single low-quality visible image. MagicFuse first introduces an intra-spectral knowledge reinforcement branch and a cross-spectral knowledge generation branch based on the diffusion models. They mine scene information obscured in the visible spectrum and learn thermal radiation distribution patterns transferred to the infrared spectrum, respectively. Building on them, we design a multi-domain knowledge fusion branch that integrates the probabilistic noise from the diffusion streams of these two branches, from which a cross-spectral scene representation can be obtained through successive sampling. Then, we impose both visual and semantic constraints to ensure that this scene representation can satisfy human observation while supporting downstream semantic decision-making. Extensive experiments show that our MagicFuse achieves visual and semantic representation performance comparable to or even better than state-of-the-art fusion methods with multi-modal inputs, despite relying solely on a single degraded visible image. The code is publicly available at https://github.com/zhayanping/MagicFuse.

2512.10719 2026-05-22 cs.CV 版本更新

SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

SpaceDrive: 在基于视觉语言模型的自动驾驶中引入空间感知

Peizheng Li, Zhenghao Zhang, David Holtz, Hang Yu, Yutong Yang, Yuzhi Lai, Rui Song, Andreas Geiger, Andreas Zell

发表机构 * Mercedes-Benz AG(梅赛德斯-奔驰集团) University of Tübingen(图宾根大学) Tübingen AI Center(图宾根人工智能中心) TU Munich(慕尼黑工业大学) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) University of Stuttgart(斯图加特大学) UCLA(加州大学洛杉矶分校)

AI总结 本文提出SpaceDrive框架,通过将空间信息作为显式位置编码来增强基于VLM的自动驾驶系统对精细3D空间关系的理解,从而提升规划精度和开放环性能。

详情
AI中文摘要

基于视觉语言模型(VLM)的端到端自动驾驶方法因具备通用的视觉理解和强大的推理能力而迅速发展。然而,我们发现当前VLM在理解细粒度的3D空间关系方面存在困难,这在与物理世界交互的系统中是基本要求。为了解决这一问题,我们提出了SpaceDrive,一个基于空间感知的VLM自动驾驶框架,将空间信息作为显式位置编码(PEs)而非文本数字标记,从而实现语义和空间表示的联合推理。SpaceDrive采用通用的位置编码器处理从多视角深度估计、历史自我状态和文本提示中得到的所有3D坐标。这些3D PE首先叠加到相应的2D视觉标记上,同时作为任务无关的坐标表示,取代数字形式的数值标记作为VLM的输入和输出。这种机制使模型能够更好地在空间推理中索引特定的视觉语义,并直接回归轨迹坐标而非逐位生成,从而提升规划精度。广泛的实验验证了SpaceDrive在nuScenes数据集上实现了最先进的开放环性能,并在Bench2Drive闭环基准中取得了78.02的第二好Driving Score。代码可在:https://github.com/zhenghao2519/SpaceDrive获取。

英文摘要

End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods. Code is available at: https://github.com/zhenghao2519/SpaceDrive.

2510.20814 2026-05-22 cs.CV 版本更新

SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution

SpectraMorph: 结构化潜在学习用于自监督超光谱超分辨率

Ritik Shah, Marco F Duarte

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本研究提出SpectraMorph,一种基于物理指导的自监督融合框架,通过结构化潜在空间实现超光谱超分辨率,利用多光谱图像与超光谱图像的融合,产生可解释的中间结果,并在短时间内训练,即使在单波段多光谱图像下也保持鲁棒性。

详情
Journal ref
ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
AI中文摘要

超光谱传感器每像素捕获密集的光谱信息,但空间分辨率低,导致边界模糊和混合像素效应。共注册的互补传感器如多光谱、RGB或全色相机提供高空间分辨率细节,推动通过超光谱与多光谱图像融合实现超光谱超分辨率。现有的基于深度学习的方法虽然性能强大,但依赖于不透明的回归器,缺乏可解释性且在多光谱图像波段很少时往往失效。我们提出了SpectraMorph,一种具有结构化潜在空间的物理指导自监督融合框架。SpectraMorph不通过直接回归,而是强制一个解混瓶颈:从低分辨率超光谱图像中提取端成员签名,并通过紧凑的多层感知机从多光谱图像预测类似丰度的地图。通过线性混合重建光谱,训练通过多光谱传感器的光谱响应函数进行自监督方式。SpectraMorph产生可解释的中间结果,训练时间短于一分钟,并且即使在单波段(全色)多光谱图像下也保持鲁棒性。在合成和真实数据集上的实验表明,SpectraMorph在自监督和无监督基线中表现一致优于最先进方法,同时在监督基线中也保持非常具有竞争力。

英文摘要

Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor's spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.

2510.08759 2026-05-22 cs.CV cs.RO 版本更新

Dissecting Embodied Abilities in Multimodal Language Models through Skill-level Evaluation and Diagnosis

通过技能级评估与诊断解构多模态语言模型的具身能力

Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Yizhe Zhu, Shiji Xin, Yijian Huang, Boce Hu, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Haojie Huang, Lawson L. S. Wong

发表机构 * Northeastern University, Boston, MA, USA The Chinese University of Hong Kong, Hong Kong, China Peking University, Beijing, China Westlake University, Hangzhou, China Harvard University, Cambridge, MA, USA Purdue University, West Lafayette, IN, USA University of Oxford, Oxford, United Kingdom

AI总结 本文提出BEAR基准,通过分解具身任务为14个原子技能进行细粒度评估,发现感知能力是推理失败的主要瓶颈,并提出BEAR-Agent多模态对话代理,显著提升具身技能性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

理解具身多模态大语言模型(MLLMs)的能力瓶颈对于改进具身代理至关重要。然而,现有具身基准主要集中在任务级评估,未能提供模型失败的潜在原因的可操作见解。为解决这一限制,我们引入BEAR,一个将具身任务分解为14个原子技能以进行细粒度技能级评估的基准。BEAR包含4,469个交错的图像-视频-文本样本,涵盖6类中的14种技能,从低级感知到高级规划。我们评估了20个MLLMs在BEAR上的表现,采用分层技能级诊断框架,并揭示了两个关键发现:(1)感知能力是推理失败的主要瓶颈,(2)当前模型存在不稳定的时间空间建模问题,这在先前基准中未被充分暴露。受这些发现启发,我们进一步提出BEAR-Agent,一个多模态对话代理,通过添加视觉和空间推理工具来增强MLLMs。BEAR-Agent在具身技能上显著提升了性能,在BEAR上相对于GPT-5基模型实现了17.5%的相对提升,同时在仿真和现实世界机器人实验中也优于强基线模型。项目页面:https://bear-official66.github.io/

英文摘要

Understanding the capability bottlenecks of embodied multimodal large language models (MLLMs) is crucial for improving embodied agents. However, existing embodied benchmarks mainly focus on task-level evaluation and fail to provide actionable insights into the underlying causes of model failures. To address this limitation, we introduce BEAR, a benchmark that decomposes embodied tasks into 14 atomic skills for fine-grained skill-level evaluation. BEAR comprises 4,469 interleaved image-video-text samples spanning 14 skills across 6 categories, ranging from low-level perception to high-level planning. We evaluate 20 MLLMs on BEAR under a hierarchical skill-level diagnosis framework and uncover two key findings: (1) perceptual capabilities are major bottlenecks behind reasoning failures, and (2) current models suffer from unstable spatiotemporal modeling that remains largely unexposed in prior benchmarks. Motivated by these findings, we further propose BEAR-Agent, a multimodal conversational agent that augments MLLMs with visual and spatial reasoning tools. BEAR-Agent substantially improves performance across embodied skills, achieving a relative improvement of 17.5% on GPT-5 over the base model on BEAR, while also outperforming strong baselines in both simulation and real-world robotic experiments. Project page: https://bear-official66.github.io/

2509.17086 2026-05-22 cs.CV 版本更新

SFN-YOLO: Towards Free-Range Poultry Detection via Scale-aware Fusion Networks

SFN-YOLO:通过尺度感知融合网络实现自由放养禽类检测

Jie Chen, Yuhong Feng, Tao Dai, Hao Wang, Hongtao Chen, Zhaoxi He, Mingzhe Liu, Jiancong Bai

发表机构 * Shenzhen University(深圳大学) The Hong Kong University of Science(香港科学与技术大学)

AI总结 本文提出了一种名为SFN-YOLO的创新禽类检测方法,通过尺度感知融合技术提高复杂环境中的检测性能,并引入了专为自由放养条件设计的M-SCOPE数据集,实验表明该模型在仅7.2M参数的情况下达到了80.7%的mAP,比基准模型少35.1%的参数,同时保持了良好的泛化能力。

详情
AI中文摘要

检测和定位禽类对于推进智能禽类养殖至关重要。尽管检测导向方法已取得进展,但在自由放养环境中仍面临多尺度目标、遮挡和复杂或动态背景带来的挑战。为解决这些问题,我们引入了一种名为SFN-YOLO的创新禽类检测方法,该方法利用尺度感知融合技术,将详细的局部特征与更广泛的全局上下文相结合,以提高复杂环境中的检测性能。此外,我们还开发了一个新的扩展数据集(M-SCOPE),专门针对多样的自由放养条件。全面的实验表明,我们的模型在仅7.2M参数的情况下实现了80.7%的mAP,比基准模型少35.1%的参数,同时在不同领域中保持了强大的泛化能力。SFN-YOLO的高效和实时检测能力支持了自动化智能禽类养殖。

英文摘要

Detecting and localizing poultry is essential for advancing smart poultry farming. Despite the progress of detection-centric methods, challenges persist in free-range settings due to multiscale targets, obstructions, and complex or dynamic backgrounds. To tackle these challenges, we introduce an innovative poultry detection approach named SFN-YOLO that utilizes scale-aware fusion. This approach combines detailed local features with broader global context to improve detection in intricate environments. Furthermore, we have developed a new expansive dataset (M-SCOPE) tailored for varied free-range conditions. Comprehensive experiments demonstrate our model achieves an mAP of 80.7% with just 7.2M parameters, which is 35.1% fewer than the benchmark, while retaining strong generalization capability across different domains. The efficient and real-time detection capabilities of SFN-YOLO support automated smart poultry farming.

2507.17640 2026-05-22 cs.CV 版本更新

Not All Starting Points Are Equal: Pre-trained Priors and Their Outsized Impact on Person Identification

并非所有起始点都平等:预训练先验及其在人识别人脸识别中的巨大影响

Thomas M. Metz, Matthew Q. Hill, Alice J. O'Toole

发表机构 * School of Behavioral and Brain Sciences(行为与脑科学学院) The University of Texas at Dallas(德克萨斯大学达拉斯分校) Richardson, Texas, USA(德克萨斯州里德利尔)

AI总结 本文研究了预训练方法对人识别人脸识别任务的影响,发现预训练权重在域适应过程中扮演重要先验角色,并展示了使用大视觉基础模型进行简单域适应可获得SOTA结果。

详情
AI中文摘要

近年来,计算机视觉领域出现了大量多样化的通用预训练方法。然而,这些预训练方法对人识别人脸识别任务(re-id)的影响仍缺乏深入研究。我们发现,在等效域适应流程下,不同起始模型(架构和预训练权重)会产生显著不同的识别人脸识别结果。我们指出,对不同下游性能的直观解释是不足的,并提出预训练权重在域适应过程中学习的权重起着强先验作用。在此框架下,域适应解决方案可被视为Gibbs后验的最大概率点估计,其中预训练权重充当先验。在此框架下,我们展示了使用大预训练基础模型进行简单域适应可在多个re-id数据集(Market、PRCC、DeepChange、BTS)上获得SOTA结果,其参数空间与起始参数非常接近。此外,我们对这些解决方案进行了消融研究,发现它们可以使用小的迁移集和不同迁移数据集实现,但对优化器、权重衰减和损失函数的选择敏感。最终,我们提出直接使用大视觉基础模型(如CLIP、Dino、EVA、AIM等)进行微调的简单方法应作为未来re-id研究的重要基准。

英文摘要

Recent years have seen an explosion of diverse general purpose pre-training methodologies for computer vision. However, the impact that these pre-training methodologies have on person identification tasks (re-id) remains under-explored. We show that under equated domain adaptation pipelines, there is dramatic variance in person identification outcomes using different starting models (architectures and pre-trained weights). We show that a range of intuitive explanations for differing downstream performance on a range of re-id tests are insufficient and propose that pre-trained weights serve as a strong prior to the weights learned during domain adaptation. This framework allows for domain adapted solutions to be viewed as a maximum probability point estimate of the Gibbs posterior with the pre-trained weights acting as a prior. Under this framework, we show that large, pre-trained foundation models with simple domain adaptation achieve SOTA solutions on a range of re-id datasets (Market, PRCC, DeepChange, BTS) with solutions that are very close in the parameter space to the starting parameters. Moreover, we perform ablations on these solutions and show that they can be reached with small transfer sets and with varying transfer datasets but are sensitive to choice of optimizer, weight-decay, and loss function. Ultimately, we propose that the simple approach of direct fine-tuning using large vision foundation models (CLIP, Dino, EVA, AIM, etc.) needs to serve as an important baseline for future work in re-id.

2507.13339 2026-05-22 eess.IV cs.CV 版本更新

SpectraLift: Physics-Guided Spectral-Inversion Network for Self-Supervised Hyperspectral Image Super-Resolution

SpectraLift: 一种基于物理的频谱反演网络用于自监督超分辨率高光谱图像

Ritik Shah, Marco F. Duarte

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 该研究提出了一种自监督的频谱反演网络SpectraLift,利用多谱段图像的光谱响应函数实现高光谱图像与多谱段图像的融合,无需点扩散函数校准或高分辨率高光谱图像的地面真实数据,从而在PSNR、SAM、SSIM和RMSE等指标上优于现有方法。

详情
Journal ref
2025 15th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS)
AI中文摘要

高空间分辨率的高光谱图像(HSI)对于遥感和医学成像等应用至关重要,但HSI传感器本质上是牺牲空间细节来换取光谱丰富性。将高空间分辨率多谱段图像(HR-MSI)与低空间分辨率高光谱图像(LR-HSI)融合是恢复精细空间结构而不牺牲光谱保真度的有希望的途径。大多数最先进的HSI-MSI融合方法需要点扩散函数(PSF)校准或地面真实高分辨率HSI(HR-HSI),这两种在现实世界中都难以获得。我们提出SpectraLift,一种完全自监督的框架,利用仅MSI的光谱响应函数(SRF)融合LR-HSI和HR-MSI输入。SpectraLift通过(i)将SRF应用于LR-HSI得到的合成低空间分辨率多谱段图像(LR-MSI)作为输入,(ii)LR-HSI作为输出,以及(iii)估计与真实LR-HSI之间的ℓ₁光谱重建损失作为优化目标,训练一个轻量级的每像素多层感知机(MLP)网络。在推理时,SpectraLift使用训练好的网络将HR-MSI像素映射到HR-HSI估计。SpectraLift在几分钟内收敛,对空间模糊和分辨率不敏感,并在PSNR、SAM、SSIM和RMSE基准测试中优于现有方法。

英文摘要

High-spatial-resolution hyperspectral images (HSI) are essential for applications such as remote sensing and medical imaging, yet HSI sensors inherently trade spatial detail for spectral richness. Fusing high-spatial-resolution multispectral images (HR-MSI) with low-spatial-resolution hyperspectral images (LR-HSI) is a promising route to recover fine spatial structures without sacrificing spectral fidelity. Most state-of-the-art methods for HSI-MSI fusion demand point spread function (PSF) calibration or ground truth high resolution HSI (HR-HSI), both of which are impractical to obtain in real world settings. We present SpectraLift, a fully self-supervised framework that fuses LR-HSI and HR-MSI inputs using only the MSI's Spectral Response Function (SRF). SpectraLift trains a lightweight per-pixel multi-layer perceptron (MLP) network using ($i$)~a synthetic low-spatial-resolution multispectral image (LR-MSI) obtained by applying the SRF to the LR-HSI as input, ($ii$)~the LR-HSI as the output, and ($iii$)~an $\ell_1$ spectral reconstruction loss between the estimated and true LR-HSI as the optimization objective. At inference, SpectraLift uses the trained network to map the HR-MSI pixel-wise into a HR-HSI estimate. SpectraLift converges in minutes, is agnostic to spatial blur and resolution, and outperforms state-of-the-art methods on PSNR, SAM, SSIM, and RMSE benchmarks.

2506.23808 2026-05-22 cs.CV 版本更新

Towards Initialization-free Calibrated Bundle Adjustment

迈向无初始化的校准捆绑调整

Carl Olsson, Amanda Nilsson

发表机构 * Lund University(隆德大学)

AI总结 本文提出了一种利用已知相机校准的无初始化校准SfM方法,通过引入成对相对旋转估计来实现近等距重建,从而提高三维重建的准确性。

详情
AI中文摘要

近期一系列工作表明,可以通过伪对象空间误差(pOSE)作为替代目标函数来实现无初始化的捆绑调整(BA)。初始重建步骤优化一个所有项都是射影不变的目标函数,无法纳入相机校准的知识。因此,解法仅在射影变换下确定,该过程需要更多的数据才能成功重建。相反,我们提出了一种能够利用已知相机校准的方法,从而产生近等距解,即精确到相似变换的重建。为此,我们引入了携带相机校准信息的成对相对旋转估计。这些估计仅对相似变换不变,因此鼓励保留真实场景的度量特征的解。我们的方法可以看作是将旋转平均整合到pOSE框架中,朝着无初始化校准SfM迈进。我们的实验评估表明,我们能够可靠地优化我们的目标函数,从随机起始解中以高概率收敛到全局最小值,从而产生准确的近等距重建。

英文摘要

A recent series of works has shown that initialization-free BA can be achieved using pseudo Object Space Error (pOSE) as a surrogate objective. The initial reconstruction-step optimizes an objective where all terms are projectively invariant and it cannot incorporate knowledge of the camera calibration. As a result, the solution is only determined up to a projective transformation of the scene and the process requires more data for successful reconstruction. In contrast, we present a method that is able to use the known camera calibration thereby producing near metric solutions, that is, reconstructions that are accurate up to a similarity transformation. To achieve this we introduce pairwise relative rotation estimates that carry information about camera calibration. These are only invariant to similarity transformations, thus encouraging solutions that preserve metric features of the real scene. Our method can be seen as integrating rotation averaging into the pOSE framework striving towards initialization-free calibrated SfM. Our experimental evaluation shows that we are able to reliably optimize our objective, achieving convergence to the global minimum with high probability from random starting solutions, resulting in accurate near metric reconstructions.

2503.00747 2026-05-22 cs.CV cs.RO eess.IV 版本更新

LFX: Towards Unified Light Field Dense Semantic Segmentation and Salient Object Detection

LFX:迈向统一的光场密集语义分割和显著物体检测

Fei Teng, Lingxin Huang, Buyin Deng, Kai Luo, Boyuan Zheng, Zheng Fang, Hong Zheng, Kunyu Peng, Jiaming Zhang, Yaonan Wang, Kailun Yang

发表机构 * School of Artificial Intelligence and Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University, China(人工智能与机器人学院和机器人视觉感知与控制技术国家工程研究中心,湖南大学,中国) China Mobile Group Hunan Company Ltd., China(中国移动集团湖南有限公司,中国) Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Germany(人机学与机器人研究所,卡尔斯鲁厄理工学院,德国)

AI总结 本文提出LFX框架,通过统一的光场表示特征调制空间,实现了对多种光场表示和不同感知任务的适应,从而在三个光场基准测试中取得最先进的结果,显著优于特定表示方法。

Comments The source code will be made publicly available at https://github.com/FeiT-FeiTeng/LFX

详情
AI中文摘要

光场相机在单次曝光内捕获多视角观测。然而,现有研究通常针对特定的LF表示进行优化,导致该领域缺乏统一的学习框架。为弥合这一差距,我们提出了LFX,首个统一的光场感知框架。LFX建立了一个表示不变的特征调制空间,使其能够适应异构的LF表示和多样的感知任务。具体而言,我们提出了Field-of-Parallax Angular Subspace Modeling(FoP-ASM),为每个辅助视图分配独立的角标记,实现视图间的独立建模。同时,共享流形子空间约束和正则化损失强制在视图间保持全局一致的语义调制。在三个LF基准测试中的广泛评估表明,LFX在不同的LF表示上均取得最佳结果,比特定表示方法高出高达12%和20%,在显著物体检测中达到0.029/0.027的MAE,且在语义分割中达到84.37 mIoU。源代码将在https://github.com/FeiT-FeiTeng/LFX上公开。

英文摘要

Light field cameras capture multi-view observations within a single exposure. However, existing studies are typically tailored to specific LF representations, leaving the field without a unified learning framework. To bridge this gap, we present LFX, the first unified framework for LF perception. LFX establishes a representation-invariant feature modulation space, enabling it to adapt to heterogeneous LF representations and diverse perception tasks. Specifically, we propose Field-of-Parallax Angular Subspace Modeling (FoP-ASM), which assigns an independent angular marker to each auxiliary view, enabling view-wise independent modeling. Meanwhile, shared manifold subspace constraints and regularization losses enforce globally consistent semantic modulation across views. Extensive evaluations across three LF benchmarks show that LFX achieves state-of-the-art results across distinct LF representations, outperforming representation-specific methods by up to 12% and 20% with 0.029/0.027 MAE for salient object detection, and achieving 84.37 mIoU for semantic segmentation. The source code will be made publicly available at https://github.com/FeiT-FeiTeng/LFX.

2501.00677 2026-05-22 cs.LG cs.CV cs.IT cs.NA math.IT math.NA stat.ML 版本更新

Deeply Learned Robust Matrix Completion for Large-scale Low-rank Data Recovery

深度学习鲁棒矩阵补全用于大规模低秩数据恢复

HanQin Cai, Chandra Kundu, Jialin Liu, Wotao Yin

发表机构 * School of Data, Mathematical, and Statistical Sciences and the Department of Computer Science, University of Central Florida(数据、数学与统计科学学院和计算机科学系,中央佛罗里达大学) School of Data, Mathematical, and Statistical Sciences, University of Central Florida(数据、数学与统计科学学院,中央佛罗里达大学) Damo Academy, Alibaba US(阿里云美国研究院)

AI总结 本文提出了一种可扩展且可学习的非凸方法,即学得鲁棒矩阵补全(LRMC),用于大规模鲁棒矩阵补全问题,该方法具有低计算复杂度和线性收敛性,并通过深度展开有效学习自由参数以实现最优性能,同时在合成数据集和实际应用中验证了其优越的实验性能。

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, 48(6): 6541-6556, 2026
AI中文摘要

鲁棒矩阵补全(RMC)是一种广泛使用的机器学习工具,同时解决低秩数据分析中的两个关键问题:缺失数据条目和极端异常值。本文提出了一种新颖的可扩展且可学习的非凸方法,称为学得鲁棒矩阵补全(LRMC),用于大规模RMC问题。LRMC具有低计算复杂度和线性收敛性。受所提出定理的启发,LRMC的自由参数可通过深度展开有效学习以达到最佳性能。此外,本文提出了一种灵活的前馈-递归-混合神经网络框架,将深度展开从固定次数迭代扩展到无限次数迭代。通过在合成数据集和实际应用中的广泛实验,验证了LRMC的优越的实验性能,包括视频背景减除、超声成像、面部建模和卫星图像云去除。

英文摘要

Robust matrix completion (RMC) is a widely used machine learning tool that simultaneously tackles two critical issues in low-rank data analysis: missing data entries and extreme outliers. This paper proposes a novel scalable and learnable non-convex approach, coined Learned Robust Matrix Completion (LRMC), for large-scale RMC problems. LRMC enjoys low computational complexity with linear convergence. Motivated by the proposed theorem, the free parameters of LRMC can be effectively learned via deep unfolding to achieve optimum performance. Furthermore, this paper proposes a flexible feedforward-recurrent-mixed neural network framework that extends deep unfolding from fix-number iterations to infinite iterations. The superior empirical performance of LRMC is verified with extensive experiments against state-of-the-art on synthetic datasets and real applications, including video background subtraction, ultrasound imaging, face modeling, and cloud removal from satellite imagery.

2403.16552 2026-05-22 cs.NE cs.AI cs.CV 版本更新

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

QKFormer: 基于Q-K注意力的分层脉冲变换器

Chenlin Zhou, Han Zhang, Zhaokun Zhou, Liutao Yu, Liwei Huang, Xiaopeng Fan, Li Yuan, Zhengyu Ma, Huihui Zhou, Yonghong Tian

发表机构 * Pengcheng Laboratory(鹏城实验室) Harbin Institute of Technology(哈尔滨工业大学) Peking University(北京大学)

AI总结 本文提出QKFormer,一种基于Q-K注意力的分层脉冲变换器,通过引入新的脉冲形式Q-K注意力机制、分层结构和灵活的补丁嵌入模块,提升了脉冲神经网络在图像分类任务中的性能,实现了在ImageNet-1K数据集上85.65%的top-1准确率。

Comments Accepted by NeurIPS 2024 (Spotlight). Code and Model: https://github.com/zhouchenlin2096/QKFormer

详情
AI中文摘要

Spiking Transformers,将脉冲神经网络(SNNs)与变换器架构相结合,因其在能效和高性能方面的潜力而受到广泛关注。然而,现有模型在此领域仍存在性能不佳的问题。我们引入了几个创新来提高性能:i)我们提出了一种新的脉冲形式Q-K注意力机制,专为SNNs设计,通过二进制向量以线性复杂度高效建模token或通道维度的重要性。ii)我们将层次结构引入脉冲变换器,显著提升了生物和人工神经网络的性能,以获得多尺度脉冲表示。iii)我们设计了一个灵活且强大的补丁嵌入模块,具有特定于脉冲变换器的变形快捷方式。共同,我们开发了QKFormer,一种基于Q-K注意力的直接训练分层脉冲变换器。QKFormer在各种主流数据集上显著优于现有最先进SNN模型。值得注意的是,与Spikformer(66.34 M,74.81%)相比,QKFormer(64.96 M)在ImageNet-1k上实现了突破性的top-1准确率85.65%,大幅超越Spikformer 10.84%。据我们所知,这是首次直接训练SNNs在ImageNet-1K上超过85%的准确率。代码和模型可在https://github.com/zhouchenlin2096/QKFormer公开获取。

英文摘要

Spiking Transformers, which integrate Spiking Neural Networks (SNNs) with Transformer architectures, have attracted significant attention due to their potential for energy efficiency and high performance. However, existing models in this domain still suffer from suboptimal performance. We introduce several innovations to improve the performance: i) We propose a novel spike-form Q-K attention mechanism, tailored for SNNs, which efficiently models the importance of token or channel dimensions through binary vectors with linear complexity. ii) We incorporate the hierarchical structure, which significantly benefits the performance of both the brain and artificial neural networks, into spiking transformers to obtain multi-scale spiking representation. iii) We design a versatile and powerful patch embedding module with a deformed shortcut specifically for spiking transformers. Together, we develop QKFormer, a hierarchical spiking transformer based on Q-K attention with direct training. QKFormer shows significantly superior performance over existing state-of-the-art SNN models on various mainstream datasets. Notably, with comparable size to Spikformer (66.34 M, 74.81%), QKFormer (64.96 M) achieves a groundbreaking top-1 accuracy of 85.65% on ImageNet-1k, substantially outperforming Spikformer by 10.84%. To our best knowledge, this is the first time that directly training SNNs have exceeded 85% accuracy on ImageNet-1K. The code and models are publicly available at https://github.com/zhouchenlin2096/QKFormer

2311.04938 2026-05-22 cs.CV cs.AI cs.LG 版本更新

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

改进的DDIM采样与矩匹配高斯混合模型

Prasad Gabbur

发表机构 * Independent Researcher(独立研究者) Apple(苹果公司)

AI总结 本文提出在DDIM框架中使用高斯混合模型作为反向转换操作符,通过约束GMM参数匹配DDPM前向边缘的矩,从而在少量采样步骤下提升生成样本质量,实验表明GMM核在FID和IS指标上优于传统高斯核。

Comments 34 pages, 12 figures; Accepted to TMLR; Code open sourced

详情
Journal ref
Transactions on Machine Learning Research, 05/2026
AI中文摘要

我们提出在去噪扩散隐式模型(DDIM)框架中使用高斯混合模型(GMM)作为反向转换操作符(核),这是用于加速从预训练去噪扩散概率模型(DDPM)采样的最广泛使用的 approaches 之一。具体而言,我们通过约束GMM参数来匹配DDPM前向边缘的一阶和二阶中心矩。我们发现矩匹配足以获得与原始DDIM高斯核相等或更好的样本质量。我们分别在无条件模型(训练于CelebAHQ和FFHQ)、类条件模型(训练于ImageNet)以及使用Stable Diffusion v2.1在COYO700M数据集上进行文本到图像生成实验。我们的结果表明,当采样步骤数较小时,使用GMM核可显著提升生成样本的质量,如在ImageNet 256x256上,使用10个采样步骤时,GMM核的FID为6.94,IS为207.85,而高斯核分别为10.15和196.73。此外,我们还为修正流匹配模型推导了新的SDE采样器,并对所提出的方法进行了实验。我们发现使用1-修正流和2-修正流模型均有所改进。代码:https://github.com/pgabbur/ddim-gmm。

英文摘要

We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models. Code: https://github.com/pgabbur/ddim-gmm.