arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19534 2026-06-19 cs.CV cs.AI cs.CL 新提交

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM：基于多模态扩散语言模型的并行区域感知

Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

发表机构 * Peking University（北京大学）； MSALab ； ByteDance（字节跳动）

AI总结提出PerceptionDLM，利用扩散语言模型的并行解码特性，通过高效提示和结构化注意力掩码实现多区域并行感知，显著提升推理效率，并构建ParaDLC-Bench基准进行评估。

Comments Code available at https://github.com/MSALab-PKU/PerceptionDLM

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解任务中取得了显著进展。然而，现有大多数MLLMs依赖自回归生成，这限制了它们在需要描述多个区域的感知任务中的效率。在这项工作中，我们提出PerceptionDLM，一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base（一个在开源扩散MLLMs中达到最先进性能的强基础基线），我们的架构充分利用了DLMs的并行解码特性。具体来说，我们引入了高效提示和结构化注意力掩码，以实现对多个掩码区域的同步感知，使模型能够在序列和token级别并行生成区域描述。与现有顺序处理区域的方法相比，这种设计显著提高了推理效率。为了系统评估DLMs视觉感知能力的并行性，我们通过将DLC-Bench扩展为每张图像包含多个区域掩码，构建了一个新的并行详细局部描述基准（ParaDLC-Bench），从而能够联合评估描述质量和推理效率。实验表明，PerceptionDLM在区域描述中保持竞争性能，同时在多区域感知任务中实现了显著的加速。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知中的潜力。据我们所知，我们是首个利用扩散语言模型优势实现并行区域描述和感知的工作。代码、模型和数据集已发布。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

URL PDF HTML ☆

赞 0 踩 0

2606.19584 2026-06-19 cs.CV 新提交

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

语言引导的视觉嵌入用于可控且可泛化的感知

Chengzhi Mao, Xudong Lin, Wen-Sheng Chu

发表机构 * Google（谷歌）

AI总结提出语言引导视觉嵌入（LIVE）方法，利用语言动态引导视觉编码器生成任务中心嵌入，无需任务特定重训练，减少视觉幻觉并提升泛化能力。

Journal ref Published as a conference paper at ICLR 2026

2606.19828 2026-06-19 cs.CV 新提交

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey（萨里大学以人为本人工智能研究所）； Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey（萨里大学视觉、语音与信号处理中心）

AI总结研究视觉语言模型中视觉令牌如何通过不同集成架构（上下文注入与逐层注入）转化为有意义表示，揭示其内部演化过程及对性能的影响。

详情

AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型（LLM）。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示，还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成，目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式，在单图像、多图像和视频基准上进行公平比较。在此过程中，我们揭示了一个隐藏的演化：视觉令牌作为伪装的视觉上下文（缺乏语言结构的原始表示）进入LLM，但根据集成范式逐渐被重塑，每种范式捕捉视觉信号的不同频率特征。我们表明，LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐，以及最终每种范式在不同任务上的表现。我们进一步证明，仅关注注意力分配是不够的，性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

URL PDF HTML ☆

赞 0 踩 0

2606.20177 2026-06-19 cs.CV cs.AI 新提交

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

评估与增强遥感多模态大语言模型的否定理解能力

Haochen Han, Jue Wang, Alex Jinpeng Wang, Fangming Liu

发表机构 * Peng Cheng Laboratory（鹏城实验室）； Tsinghua University（清华大学）； Central South University（中南大学）

AI总结提出RS-Neg基准评估遥感MLLMs的否定理解，并设计NeFo方法通过测试时学习利用约5%未标注样本显著提升模型性能。

Comments ECCV 2026 Accepted

详情

AI中文摘要

多模态大语言模型（MLLMs）在各种遥感（RS）任务中取得了显著成功。然而，它们理解否定的能力仍未得到充分探索，限制了在现实应用中的部署，其中模型必须明确识别什么是错误的或不存在的，例如，应急响应人员需要定位非洪水路线进行疏散。为了全面研究这一局限性，我们引入了RS-Neg，这是第一个从区域级到场景级任务评估否定理解的基准。具体来说，我们为遥感图像设计了一个自动数据生成流程，使用LLMs合成多样化的否定查询，并引入了一个动态视觉焦点模块进行验证。我们的评估表明，先进的遥感MLLMs在否定理解上存在困难，表现出幻觉和显著的性能下降。为了弥补这一差距，我们提出了NeFo，一种新颖的测试时学习方法，将否定的逻辑角色明确纳入模型优化。值得注意的是，使用约5%的未标注测试样本，NeFo显著提升了模型的否定理解能力，并展现出对未见任务的强泛化能力。代码和数据将在接收后发布。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in various Remote Sensing (RS) tasks. However, their ability to comprehend negation remains underexplored, limiting deployment in real-world applications where models must explicitly identify what is false or absent, e.g., emergency responders need to locate non-flooded routes for evacuation. To comprehensively study this limitation, we introduce RS-Neg, the first benchmark to evaluate negation understanding across region-level to scene-level tasks. Specifically, we design an automated data generation pipeline for RS imagery, using LLMs to synthesize diverse negation queries, and introduce a dynamic visual focus module for verification. Our evaluation reveals that advanced RS MLLMs struggle with negation, exhibiting hallucinations and substantial performance degradation. To close this gap, we propose NeFo, a novel test-time learning method that explicitly incorporates the logical role of negation into the model optimization. Remarkably, using about 5\% unlabeled test samples, NeFo significantly improves the negation understanding of models and shows strong generalization to unseen tasks. Code and data will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.20244 2026-06-19 cs.CV cs.AI 新提交

EventVLA: 面向长程视觉-语言-动作策略的事件驱动视觉证据记忆

Ganlin Yang, Zhangzheng Tu, Yuqiang Yang, Sitong Mao, Junyi Dong, Tianxing Chen, Jiaqi Peng, Jing Xiong, Jiafei Cao, Jifeng Dai, Wengang Zhou, Yao Mu, Tai Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai AI Laboratory（上海人工智能实验室）； Shanghai Jiao Tong University（上海交通大学）； Dalian University of Technology（大连理工大学）； Huawei Technologies Co., Ltd.（华为技术有限公司）； The University of Hong Kong（香港大学）； Tsinghua University（清华大学）； Peking University（北京大学）

AI总结针对长程机器人操作中记忆瓶颈问题，提出EventVLA框架，通过动态关键帧证据记忆模块自主捕获任务关键视觉事件，在17个模拟和4个真实任务中平均成功率提升40%。

详情

AI中文摘要

记忆仍然是长程机器人操作的关键瓶颈，因为标准的视觉-语言-动作（VLA）策略在任务相关线索随时间变得遮挡或不可观测时常常失败。虽然现有的记忆增强方法利用历史上下文，但它们要么遭受严重的信息瓶颈，通过解耦的双系统引入高延迟，要么依赖积累大量视觉冗余的无选择性缓冲区。为了解决这些限制，我们引入了EventVLA，一个基于稀疏视觉证据记忆概念的端到端框架，包含两个核心组件：用于保留初始和短期上下文的基础视觉锚点，以及动态关键帧证据记忆（KEM）模块。具体来说，KEM直接从VLA的潜在嵌入中预测未来关键帧概率，以自主捕获和存储稀疏的、任务关键的视觉事件。这种前瞻驱动的机制使策略能够动态评估当前观测的未来因果效用，在瞬态视觉证据变得不可观测之前将其保留。此外，我们提出了RoboTwin-MeM，一个专门设计用于评估具有交互式视觉证据的非马尔可夫操作任务的诊断基准。大量评估表明，在17个需要记忆的模拟任务和4个真实世界双臂任务中，EventVLA相比最先进的记忆增强VLA实现了平均成功率提升+40%。

英文摘要

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

URL PDF HTML ☆

赞 0 踩 0

2606.20110 2026-06-19 cs.CV 新提交

FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model

FrozenDrive: 零样本文本引导驾驶场景生成与数据增强的无参数冻结扩散模型

Yuhwan Jeong, Hyeonseong Kim, Daehyun We, Seonkyu Song, Jinnyeong Yang, Hyun-Kurl Jang, Youngho Yoon, Kuk-Jin Yoon

发表机构 * KAIST, Visual Intelligence Lab（韩国科学技术院视觉智能实验室）

AI总结提出FrozenDrive框架，利用冻结的预训练扩散模型，通过知识保留的时空注意力实现多视图一致性和时间连贯性，无需微调即可生成恶劣天气下的驾驶场景，提升自动驾驶模型鲁棒性。

Comments Accepted to ECCV 2026

详情

AI中文摘要

自动驾驶的合成数据正在激增，这得益于扩散模型能够实现可扩展的场景生成。然而，关键障碍依然存在，因为强制执行多视图和时间一致性通常依赖于骨干网络微调或添加层，这会侵蚀预训练知识并削弱文本对齐。模型也保持接近训练分布，在恶劣天气和未见配置下表现不佳，并且保真度偏向频繁类别而非稀有类别。我们通过FrozenDrive解决这些差距，这是一个可控生成框架，在保持预训练扩散模型知识的同时实现强一致性。FrozenDrive以丰富的驾驶堆栈信号和文本提示为条件，并引入知识保留的时空注意力，在无参数的冻结扩散骨干中单次通过时施加跨视图对齐和时间连贯性。额外的对象聚焦约束提高了稀有类别的每个对象保真度。无需任何天气或场景特定的微调，我们的模型从文本合成全局连贯的多视图驾驶场景，特别是在恶劣和稀有条件下，并超越了先前的基线。在nuScenes上，FrozenDrive增强数据显著提升了AD模型的性能，尤其是在夜间和雨天，当使用我们的场景定向数据训练时，展示了更强的鲁棒性。

英文摘要

Synthetic data for autonomous driving is surging, powered by diffusion models that promise scalable scene generation. Yet key obstacles remain, as enforcing multi-view and temporal consistency often relies on backbone fine-tuning or added layers, which erodes pre-trained knowledge and weakens text alignment. Models also stay close to the training distribution, struggling under adverse weather and unseen configurations, and fidelity favors frequent over rare classes. We address these gaps with FrozenDrive, a controllable generative framework that preserves a pretrained diffusion models knowledge while achieving strong consistency. FrozenDrive conditions on rich driving-stack signals and text prompts, and introduces knowledge-preserving spatio-temporal attention to impose cross-view alignment and temporal coherence in a single pass within a parameter-free frozen diffusion backbone. An additional object-focused constraint improves per-object fidelity for rare categories. Without any weather- or scene-specific fine-tuning, our model synthesizes globally coherent multi-view driving scenes from text, particularly under adverse and rare conditions, and surpasses prior baselines. On nuScenes, FrozenDrive augmented data significantly improves AD models performance, especially at night and in rain, demonstrating stronger robustness when trained with our scenario-targeted data.

URL PDF HTML ☆

赞 0 踩 0

2606.20189 2026-06-19 cs.CV cs.AI cs.RO 新提交

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

HilDA：利用扩散的分层蒸馏推进自监督LiDAR预训练

Maciej Wozniak, Jesper Ericsson, Hariprasath Govindarajan, Truls Nyberg, Thomas Gustafsson, Patric Jensfelt, Olov Andersson

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）； Linköping University（林雪平大学）； TRATON AB（TRATON公司）； Qualcomm Auto Ltd Sweden Filial（高通汽车有限公司瑞典分公司）

AI总结提出HilDA框架，通过分层蒸馏（多层蒸馏和全局上下文蒸馏）结合时间占用扩散目标，自监督预训练LiDAR骨干网络，在3D检测、场景流和语义占用预测任务上达到最先进水平。

Comments Accepted to ECCV 2026. Maciej and Jesper contributed equally

详情

AI中文摘要

利用视觉基础模型（VFM）进行相机到LiDAR的知识蒸馏为解决真实世界自动驾驶中巨大的几何和运动多样性所需的标注数据稀缺问题提供了一种有前景的方案。然而，当前方法通常将VFM视为黑盒教师，仅依赖逐帧特征相似性。因此，它们未能充分利用教师的逐层语义结构和全局上下文，以及LiDAR序列中固有的丰富时空信息。我们提出HilDA，一个用于LiDAR骨干网络的自监督预训练框架，能更好地捕捉驾驶任务所需的语义“是什么”和几何“在哪里”。HilDA结合了分层蒸馏（包括用于渐进语义对齐的多层蒸馏和用于场景级语义的全局上下文蒸馏）与一个促进时空一致性的时间占用扩散目标。使用HilDA预训练的模型在跨模态蒸馏基准上取得了最先进的结果，并在3D目标检测、场景流和语义占用预测任务上优于通过先前蒸馏方法训练的模型。代码见：此 https URL。

英文摘要

Leveraging Vision Foundation Models (VFMs) for camera-to-LiDAR knowledge distillation offers a promising solution to the scarcity of annotated data needed to represent the immense geometric and kinematic diversity of real-world autonomous driving (AD). However, current approaches typically treat VFMs as black-box teachers, relying exclusively on frame-wise feature similarity. Consequently, they do not fully exploit the teacher's layer-wise semantic structure and global context, as well as the rich spatiotemporal information inherent in LiDAR sequences. We propose HilDA, a self-supervised pretraining framework for LiDAR backbones that better captures the semantic what and geometric where needed for driving tasks. HilDA combines hierarchical distillation comprising multi-layer distillation for progressive semantic alignment and global context distillation for scene-level semantics, with a temporal occupancy diffusion objective promoting spatiotemporal consistency. Models pre-trained with HilDA achieve state-of-the-art results on cross-modal distillation benchmarks and outperform models trained via prior distillation approaches on 3D object detection, scene flow, and semantic occupancy prediction. Code available at: https://maxiuw.github.io/hilda.

URL PDF HTML ☆

赞 0 踩 0

2606.20515 2026-06-19 cs.CV 新提交

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

S-Agent：空间工具使用激发空间智能推理

Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu

发表机构 * NTU（南洋理工大学）； THU（清华大学）； ByteDance（字节跳动）； NWPU（西北工业大学）

AI总结提出S-Agent空间工具使用智能体范式，通过时空证据积累和层次化工具集，将VLM作为语义规划器，实现连续多视图图像和视频的空间推理，在无训练下提升开源和闭源VLM性能，并基于S-300K轨迹微调得到紧凑空间智能体S-Agent-8B。

Comments Project Page : https://Ropedia.github.io/S-Agent

详情

AI中文摘要

现实世界的空间智能需要对连续且不断变化的三维世界进行推理，然而现有的VLM和工具增强智能体大多仍局限于从孤立的视觉观察中进行静态、无状态的推理。我们引入了\textbf{\textsc{S-Agent}}，一种用于理解和推理连续多视图图像和视频的空间工具使用智能体范式。通过将空间推理表述为时空证据积累而非孤立的帧级预测，\textsc{S-Agent}将空间感知重塑为以场景为中心的理解，超越以帧为中心的识别。具体而言，\textsc{S-Agent}将VLM作为语义规划器，决定需要哪些证据，而层次化的空间工具和专家将物体锚定在2D中，将其提升为3D几何证据，并将这些证据聚合为高级空间知识（例如，计数、测量、方向和相对位置）。此外，时间记忆机制，包括用于维护不断演变的场景状态的场景记忆和用于积累推理上下文的智能体记忆，实现了跨帧和推理步骤的证据整合。在多视图和视频空间推理基准上的全面实验表明，\textsc{S-Agent}以无需训练的方式持续提升开源和闭源VLM的性能。除了推理时增强，在\textsc{S-Agent}生成的空间轨迹\textsc{S-300K}上进行监督微调（SFT）得到了\textsc{S-Agent-8B}，一个紧凑的空间智能体，显著超越了类似规模的基线（例如，Qwen3-VL-8B），并与先进的闭源模型（例如，GPT-5.4和Gemini 3）性能相当。

英文摘要

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

URL PDF HTML ☆

赞 0 踩 0

2606.20521 2026-06-19 cs.CV 新提交

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

HumanScale: 以自我为中心的人类视频在具身预训练中可超越真实机器人数据

Juncheng Ma, Jianxin Bi, Yufan Deng, Xuanran Zhai, Kewei Zhang, Ye Huang, Bo Liang, Shukai Gong, Jiankai Tu, Xiaotian Tang, Jiaxin Li, Kaiqi Chen, Duomin Wang, Yuqi Wang, Bingyi Kang, Eric Huang, Zhiyang Dou, Zhen Dong, Enze Xie, Wojciech Matusik, Tat-Seng Chua, Daquan Zhou

发表机构 * PKU（北京大学）； NUS（新加坡国立大学）； MIT（麻省理工学院）； UCSB（加州大学圣塔芭芭拉分校）； NVIDIA（英伟达）

AI总结本文通过系统比较发现，经过精心设计的过滤和标注流程，以自我为中心的人类视频在具身基础模型预训练中不仅可行，而且性能优于遥操作真实机器人数据，验证了“预训练于人类视频+少量机器人数据适配”的可扩展范式。

Comments Github: https://github.com/DAGroup-PKU/HumanNet/

详情

AI中文摘要

ReA-OVCD：通过语义和空间精炼的可靠性感知开放词汇变化检测

Hongming Zhu, Huaji Chen, Bowen Du, Sicong Liu, Qin Liu

发表机构 * School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）； College of Surveying and Geo-Informatics, Tongji University（同济大学测绘与地理信息学院）

AI总结提出一种无需训练的可靠性感知开放词汇变化检测框架，通过语义变化推理和边界感知精炼策略，解决实例级比较忽略细粒度变化和像素级比较不可靠的问题，在多个数据集上F1提升2.13%-9.75%。

详情

AI中文摘要

与依赖预定义类别的传统遥感变化检测不同，开放词汇变化检测（OVCD）使用任意文本提示灵活识别土地覆盖变化。然而，现有方法在建模变化时存在固有折衷：实例级比较忽略了细粒度语义变化（例如部分建筑扩建），而直接像素比较不可靠，由于语义模糊和空间不一致导致不稳定响应和边界伪影。为此，我们提出一种高效的无训练可靠性感知开放词汇变化检测（ReA-OVCD）框架。它首先从像素级语义差异中推导候选变化区域，以确保灵活和详细的定位。为确保可靠性，随后引入协作精炼策略，从语义和空间角度显式建模变化有效性。具体而言，我们开发了语义变化推理（SCR）模块，通过联合分析分布差异和响应变化重新评估变化，从而抑制偶然不一致性同时保留可靠的语义转变。此外，设计了边界感知变化精炼（BCR）模块，通过验证候选区域是否得到可靠内部像素支持来减轻由边界错位和不确定性引起的伪影。在多个数据集（LEVIR-CD、WHU-CD、DSIFN和SECOND）上的大量实验表明，我们的方法持续优于现有技术，在更高计算效率下实现了2.13%至9.75%的F1提升。代码已公开于此 https URL。

英文摘要

Unlike traditional remote sensing change detection that relies on predefined categories, Open-Vocabulary Change Detection (OVCD) identifies land cover changes flexibly using arbitrary text prompts. However, existing methods suffer from an inherent trade-off when modeling changes: instance-level comparison overlooks fine-grained semantic variations (e.g., partial building extensions), while direct pixel comparison proves unreliable, yielding unstable responses and boundary artifacts due to semantic ambiguity and spatial inconsistency. To this end, we propose an efficient training-free Reliability-Aware Open-Vocabulary Change Detection (ReA-OVCD) framework. It first derives candidate change regions from pixel-wise semantic discrepancies to ensure flexible and detailed localization. To ensure reliability, it subsequently introduces a collaborative refinement strategy to explicitly model change validity from both semantic and spatial perspectives. Specifically, we develop a Semantic Change Reasoning (SCR) module that reassesses changes by jointly analyzing distributional divergence and response variation, enabling the suppression of incidental inconsistencies while preserving reliable semantic shifts. In addition, a Boundary-aware Change Refinement (BCR) module is designed to mitigate artifacts stemming from boundary misalignment and uncertainty through validating whether candidate regions are supported by reliable interior pixels. Extensive experiments across multiple datasets (LEVIR-CD, WHU-CD, DSIFN, and SECOND) demonstrate that our method consistently outperforms state-of-the-art approaches, achieving $\mathrm{F}_{1}^{C}$ improvements of 2.13\% to 9.75\% with higher computational efficiency. The code is publicly available at \https://github.com/Funny0101/ReA-OVCD

URL PDF HTML ☆

赞 0 踩 0

2606.20130 2026-06-19 cs.CV 新提交

SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

SAM3自蒸馏用于细粒度GOOSE 2D语义分割

Xuesong Wang

发表机构 * Wayne State University（韦恩州立大学）

AI总结提出基于SAM3图像编码器与轻量解码器的分割模型，通过自蒸馏、多尺度测试增强和光度畸变迁移，在GOOSE 2D挑战赛达69.73% mIoU。

Comments 4th place in ICRA 2026 GOOSE 2D Semantic Segmentation Challenge

详情

AI中文摘要

我们描述了在ICRA 2026 GOOSE 2D细粒度语义分割挑战赛中获得第四名的方案，该方案在官方1815张图像测试集上达到了69.73%的复合平均交并比（mIoU）。我们的模型适配了近期视觉基础模型Segment Anything Model 3（SAM3）的图像编码器，并搭配轻量级解码器。除此之外，我们贡献了两项技术和一项经验发现：（i）一种自蒸馏方案，该方案重新利用SAM3本身，以真实边界框作为提示，在SAM3性能优于我们自身模型的类别上充当教师；（ii）一种图像级多尺度测试时增强方案，通过重新缩放图像而非模型输入，为固定输入尺寸的模型恢复多尺度推理；（iii）一项发现：来自2025年GOOSE 2D获胜方案的一种激进光度畸变，移植到我们的流程中，是单一最大的改进来源。

英文摘要

We describe our 4th-place entry to the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which reached a composite mean Intersection-over-Union (mIoU) of 69.73% on the official 1,815-image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self-distillation scheme that re-uses SAM3 itself, prompted with ground-truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image-level multi-scale test-time augmentation scheme that restores multi-scale inference for a fixed-input-size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.20161 2026-06-19 cs.CV 新提交

NEST：面向长视频理解的时间叙事事件结构

Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas

发表机构 * Department of Computer Science, Virginia Tech（弗吉尼亚理工大学计算机科学系）

AI总结提出NEST数据集（1005部全长电影），通过多模态叙事事件标注和关系链接，评估模型在长视频中理解事件结构、时间顺序和长程依赖的能力，实验表明事件检测等任务极具挑战性。

详情

AI中文摘要

视觉-语言模型的最新进展使得处理越来越长的视频序列成为可能，但处理扩展令牌流的能力并不能转化为对长视频中叙事结构的理解。现有的长视频基准侧重于大海捞针式检索，而不是评估低级动作如何形成事件、事件如何跨时间交互以及叙事如何进展，例如，模型是否能够将早期的挫折（如失业）与后来的关系破裂联系起来，尽管存在长时间间隔、中间场景或重新诠释事件的闪回。我们引入了NEST（面向长视频理解的时间叙事事件结构），一个包含1005部全长电影（平均98分钟）的数据集，每部电影都标注了102个基于视觉内容、对话和音频的多模态叙事事件。NEST通过基于视觉内容、对话和音频的结构化标注捕捉多模态叙事事件，并通过反映叙事结构的关系（包括时间顺序、层次组合和长程依赖）将它们联系起来。我们引入了事件触发检测（ETD）、事件定位（EL）、事件论元抽取（EAE）和事件关系抽取（ERE）的基线。该基准对于基于事件发现极具挑战性，ETD低于8%，EL低于6%，EAE低于11%。相比之下，一旦事件给定，ERE更容易处理，零样本F1达到35.45%，微调后F1达到44.42%。

英文摘要

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.19849 2026-06-19 cs.CV 新提交

ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference

ViCoStream: 流式视频大模型通过阶段协调推理可运行超过100 FPS

Yang Tan, Junlong Tong, Linan Yue, Hao Wu, Pengfei Fang, Xiaoyu Shen

发表机构 * Southeast University（东南大学）； Eastern Institute of Technology, Ningbo（宁波东方理工大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出ViCoStream框架，通过阶段协调的流水线（分块执行、CUDA流重叠、视觉令牌控制、有界视觉注意力、查询端检索）实现流式视频大模型的高吞吐低延迟推理，在单A100上达到134 FPS视频吞吐和<50 ms首令牌延迟，精度接近全历史基线。

Comments 19 pages, 7 figures, 13 tables

详情

AI中文摘要

流式视频大模型必须持续处理传入的视频，同时保持低查询延迟，这使得视频摄入吞吐量和查询时间响应性对于实时部署至关重要。现有方法主要集中于加速单个模块，如视觉编码、令牌剪枝或KV缓存压缩，但对由此产生的系统能否维持实时流式性能提供的见解有限。我们将流式视频大模型推理形式化为一个协调的流水线，涵盖视觉预处理、视觉编码、令牌丢弃和LLM预填充/解码。基于这一形式化，我们提出了ViCoStream（视频协调流式处理），一个阶段协调的流式框架，结合了分块执行、CUDA流重叠、视觉令牌控制、有界视觉注意力和查询端检索，以限制每块的计算和内存成本。我们进一步对瓶颈迁移进行了系统研究，揭示了块大小、令牌保留、注意力局部性和检索范围如何影响吞吐量-准确率权衡。在多个流式基准测试上使用Qwen2.5-VL-3B/7B-Instruct进行的实验表明，ViCoStream在单块A100 GPU上实现了134 FPS的视频吞吐量和小于50 ms的首令牌延迟，同时保持接近全历史基线的准确率。

英文摘要

Streaming VideoLLMs must continuously process incoming video while maintaining low query latency, making both video-ingestion throughput and query-time responsiveness critical for real-time deployment. Existing methods largely focus on accelerating individual modules, such as visual encoding, token pruning, or KV-cache compression, but provide limited insight into whether the resulting system can sustain real-time streaming performance. We formulate streaming VideoLLM inference as a coordinated pipeline spanning visual preprocessing, visual encoding, token dropping, and LLM prefilling/decoding. Building on this formulation, we propose ViCoStream (Video Coordinated Streaming), a stage-wise coordinated streaming framework that combines chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval to bound per-chunk computation and memory costs. We further provide a systematic study of bottleneck migration, revealing how chunk size, token retention, attention locality, and retrieval scope shape the throughput-accuracy trade-off. Experiments with Qwen2.5-VL-3B/7B-Instruct across multiple streaming benchmarks show that ViCoStream achieves 134 FPS video throughput and less than 50 ms TTFT on a single A100 GPU while maintaining accuracy close to full-history baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19927 2026-06-19 cs.CV 新提交

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

CARE: 面向视频多模态大语言模型的自适应推理长度的能力感知奖励塑形

Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； School of Medical Technology, Beijing Institute of Technology（北京理工大学医学技术学院）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

AI总结提出CARE框架，通过能力感知奖励塑形自适应优化推理长度，利用指数移动平均估计能力并分阶段调整奖励偏好，结合批次归一化和后验放大器提升效率与准确性。

详情

AI中文摘要

在多模态视频推理中，基于强化学习的方法通常依赖简单且不灵活的推理长度控制策略，无法适应模型不断变化的能力。这种不匹配可能在早期阶段抑制必要的探索，而在模型变得更有能力后鼓励冗余推理和低效解码。本文提出CARE，一种用于多模态推理中自适应推理长度优化的能力感知奖励塑形框架。具体来说，CARE通过通过率的指数移动平均维护平滑的能力估计，并利用它将训练路由到渐进阶段，将奖励偏好从探索导向的长形式推理转向效率导向的简洁推理。为避免将冗长与内在任务复杂性混淆，CARE进一步使用批次级统计归一化推理努力，并引入后验放大器以增强对历史上困难样本上意外强性能的奖励信号。所提出的机制无缝集成到GRPO训练流程中，且不增加额外推理开销。在多个视频推理和通用视频理解基准上的大量实验表明，CARE持续提高推理准确性，稳定强化学习，并显著提升令牌效率。此外，CARE在训练过程中展现出推理长度的特征性倒U型轨迹，并在收敛时产生更短但信息更丰富的推理轨迹，表明推理预算的有效自适应分配。我们在以下网址提供CARE框架和实验的源代码：此https URL。

英文摘要

In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.

URL PDF HTML ☆

赞 0 踩 0

2606.20140 2026-06-19 cs.CV 新提交

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

SA-VIS: 用于训练视频实例分割的稀疏帧标注

Edoardo Mello Rella, Ajad Chhatkuli, Shipra Jain, Ender Konukoglu, Luc Van Gool

发表机构 * CVL, ETH Zurich（计算机视觉实验室，苏黎世联邦理工学院）； Align Technology ； VISICS, KU Leuven（VISICS，鲁汶大学）； INSAIT, Sofia（INSAIT，索非亚）

AI总结提出稀疏帧标注的SA-VIS方法，通过过去帧特征传播模块利用低维特征，在仅使用1/5标注帧时性能仅下降0.4%，显著降低标注成本。

详情

AI中文摘要

最近的在线视频实例分割（VIS）方法取得了令人印象深刻的结果，因此成为视频中实例分割的首选方法。尽管令人印象深刻的单图像模型（例如基于SAM的模型）重新兴起，但在线（或半在线）VIS方法通过在训练期间使用长序列的密集标注帧，优于单图像模型。然而，这种VIS的训练设置在计算和所需密集标注方面成本高昂。为了解决这些主要缺陷，我们认为实例及其在视频中的演变的有效建模并不需要密集标注的帧。为此，我们提出了一个简单有效的模块，称为过去帧特征传播（PFP），它聚合来自多个帧的图像编码器的低维特征。这个简单的低计算量模块为使用稀疏视频帧标签进行端到端训练提供了巨大的学习能力。结合轻量级的帧特定实例查询，我们的稀疏帧标注VIS（SA-VIS）显著提高了其基线的性能。最有趣的是，我们避免复杂性的简单设计有效地弥合了在稀疏和密集标注视频序列上训练之间的精度差距。这意味着当仅使用数据集中1/5图像的标注时，SA-VIS的性能仅下降0.4%。实验上，SA-VIS在YouTube-VIS 2019/2021/2022和Occluded VIS（OVIS）上显示出相对于基线的强劲改进，并且在有限标注场景下，AP比最先进方法提高了1%以上。

英文摘要

Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.

URL PDF HTML ☆

赞 0 踩 0

2606.20312 2026-06-19 cs.CV 新提交

Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection

面向冻结姿态流视频异常检测的可靠性感知原型校准

Ning Dong, Yingna Su, Xin Dong, Ziyun Jiao, Xinnian Guo, Zhuangzhuang Pan

AI总结提出一种后验评分校准方法RPC，通过标准化潜在空间中的最近原型偏差修正冻结姿态流检测器的排名，在8个骨干-数据集组合上平均提升AUROC 2.03个百分点。

Comments 15 pages, 5 figures, 7 tables. Code available at https://github.com/iNing10/RPC

详情

AI中文摘要

姿态流视频异常检测器因其能为跟踪的骨架窗口提供基于似然的排名，在一类监控中具有吸引力。然而，单个似然分数可能隐藏多模态正常行为，并对姿态观测噪声敏感。我们研究了一个冻结检测器设置，其中姿态流骨干网络、缓存的骨架轨迹和评估流程是固定的。可靠性感知原型校准（RPC）是针对该设置的一种后验评分校准方法。它在冻结潜在空间中添加标准化的最近原型偏差到标准化的流分数，并仅使用关键点置信度来门控这一新增的几何证据。因此，RPC在保留原始密度信号的同时，利用姿态可靠性下的经验正常模式结构修正排名。在两个冻结姿态流骨干网络和四个数据集上，RPC在所有八个骨干-数据集对中提升了帧级AUROC，增益范围为0.34到4.49个百分点，平均为2.03个百分点。消融和可靠性分析表明，原型偏差是主要的修正信号，而可靠性门控在姿态观测不可靠时最为有用。这些结果表明，当重新训练或复现完整姿态流程不可行时，轻量级后验校准可以增强缓存的姿态流系统。

英文摘要

Pose-flow video anomaly detectors are attractive for one-class surveillance because they provide likelihood-based rankings for tracked skeleton windows. However, a single likelihood score may hide multimodal normal behavior and be sensitive to pose-observation noise. We study a frozen-detector setting in which the pose-flow backbone, cached skeleton tracks, and evaluation pipeline are fixed. Reliability-Aware Prototype Calibration (RPC) is a post-hoc score calibration method for this setting. It adds a standardized nearest-prototype deviation in the frozen latent space to the standardized flow score, and uses keypoint confidence only to gate this added geometric evidence. Thus, RPC preserves the original density signal while correcting the ranking with empirical normal-mode structure under pose reliability. Across two frozen pose-flow backbones and four datasets, RPC improves frame-level AUROC in all eight backbone-dataset pairs, with gains ranging from 0.34 to 4.49 percentage points and averaging 2.03 points. Ablation and reliability analyses show that prototype deviation is the main corrective signal, while reliability gating is most useful when pose observations are less trustworthy. These results suggest that lightweight post-hoc calibration can strengthen cached pose-flow systems when retraining or reproducing the full pose pipeline is impractical.

URL PDF HTML ☆

赞 0 踩 0

2606.20559 2026-06-19 cs.CV cs.LG 新提交

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

UNIEGO：代理作为中介的统一自我中心视频表示学习

Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das

AI总结提出分层多教师蒸馏框架UNIEGO，通过代理模型将异构教师知识转化为同质自我中心空间，并采用选择性代理蒸馏自适应筛选可靠监督，在三个自我中心视频理解任务上达到最优。

详情

AI中文摘要

自我中心视频理解本质上受限于可穿戴摄像头的狭窄视角：单一视角、单一模态、单一模型无法捕捉人类动作的全部丰富性。我们认为，真正富有表现力的自我中心表示必须包含跨视角、跨模态和基础模型表示的互补知识，同时仍能仅从自我中心视频部署。为此，我们引入了一个分层多教师蒸馏框架，生成UNIEGO，一个统一的自我中心编码器，使用九个教师（涵盖自我-外部视角、RGB、深度和骨架模态）以及四个基础模型进行训练。我们的框架不是直接从异构教师中蒸馏（其不兼容的架构和特征几何会导致冲突梯度），而是在其中插入一层表示特定的代理模型，将多样的教师知识转化为同质的自我中心空间。第二阶段蒸馏，即选择性代理蒸馏（SPD），然后自适应地为每个训练样本选择既正确又自信的代理子集，仅从可靠监督中蒸馏并抑制错误信号。SPD进一步通过将UNIEGO初始化为代理参数的凸组合来稳定，在蒸馏开始前将统一模型置于损失景观的良好条件区域。UNIEGO在三个自我中心视频理解任务（动作识别、视频检索和动作分割）上，在三个具有挑战性的自我-外部基准测试中达到了最先进的性能，优于朴素的多教师蒸馏基线，并证明了结构化的、代理中介的知识转移能产生更丰富、更具判别性的自我中心表示。

英文摘要

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

URL PDF HTML ☆

赞 0 踩 0

2606.20561 2026-06-19 cs.CV 新提交

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

TimeProVe: 先提出后验证，实现日常活动中的高效长视频时间推理

Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan, Hieu Le, Srijan Das

AI总结提出TimeProVe框架，先通过轻量模块生成基于动作的候选假设，再调用昂贵VLM验证，在长视频问答中降低75%VLM调用和93%推理成本，性能提升7.3%。

详情

AI中文摘要

长视频问答（LVQA）需要在数小时未修剪的视频中识别稀疏的、与查询相关的证据。现有方法要么使用大型视觉语言模型（VLM）密集处理视频，导致计算成本过高，要么依赖稀疏的基于字幕的推理，这往往会遗漏时间局部化和以运动为中心的证据。我们提出TimeProVe，一种用于长视频中时间基础推理的高效混合框架。TimeProVe首先使用轻量模块生成基于动作的答案-证据假设，随后仅调用昂贵的VLM进行针对性验证。我们框架的核心在于基于动作的候选证据（ACE）模块，该模块通过轻量级LLM推理将时间局部化的动作转换为查询条件化的候选答案和支持证据窗口。我们进一步引入OpenTSUBench（OTB），一个开放基准测试，旨在评估真实世界日常活动（ADL）场景中的时间基础推理。实验表明，TimeProVe在OTB上比最强基线高出7.3%，同时减少了75%的VLM调用和93%的推理成本。此外，在没有显式时间基础训练的情况下，TimeProVe在Charades-STA上取得了竞争性性能，并在结合基础VLM增强时达到了最先进的结果。

英文摘要

Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19495 2026-06-19 cs.CV 新提交

LooseControlVideo: Directorial Video Control using Spatial Blocking

LooseControlVideo: 使用空间分块进行导演式视频控制

Shariq Farooq Bhat, Niloy J. Mitra, Kalyan Sunkavalli

发表机构 * Adobe Research（Adobe研究院）

AI总结提出LooseControlVideo框架，通过稀疏定向3D框作为“分块”代理，实现文本到视频生成中多对象场景的直观布局与轨迹控制，显著优于现有2D框和流方法。

Comments Project page at https://shariqfarooq123.github.io/LooseControlVideo/

详情

AI中文摘要

在文本到视频生成中，精确的3D空间编排仍然是一个重大挑战，特别是对于语义布局和时间动态经常纠缠的多对象场景。虽然现有的深度条件模型实现了良好的结构保真度，但它们需要密集的、帧精确的指导，这对于涉及可变形对象的动态事件来说，制作起来非常费力。我们提出了LooseControlVideo，一个通过使用稀疏的、定向的3D框作为“分块”代理来实现直观和表达性控制的框架。这允许用户创作高级布局和轨迹，同时利用视频生成模型生成逼真的遮挡、动态和交互。我们通过在带有DNOCS（一种用于3D大小、方向和深度排序遮挡的新型编码）注释的视频数据集上微调Wan 2.2骨干网络来实现这一点。此外，我们的方法允许局部细化，例如调整跳跃轨迹或添加交互，而对全局场景上下文的干扰最小。在nuScenes、HO-3D和BEHAVE基准上的广泛评估表明，LooseControlVideo显著优于现有的2D框和基于流的基线。我们的结果表明，与当前最先进的布局条件模型相比，轨迹误差提高了1.2倍到3倍；刚体运动一致性提高了2倍；遮挡精度提高了1.5倍到2倍，表明定向3D基元为复杂的多智能体视频创作提供了良好的几何先验。

英文摘要

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.

URL PDF HTML ☆

赞 0 踩 0

2606.19662 2026-06-19 cs.CV 新提交

Learning When to Denoise: Optimizing Asynchronous Schedules for Latent Diffusion

学习何时去噪：优化潜在扩散的异步调度

Bingshuo Qian, Xiang Cheng

AI总结提出学习异步调度策略，通过调度校正目标优化多表示扩散模型的去噪顺序，在ImageNet 256x256上以不到1%额外训练计算实现4倍加速，FID达1.02。

Comments 25 pages, 9 figures, 4 tables

详情

AI中文摘要

多表示扩散模型可以通过对图像的互补视图进行去噪来改善视觉合成，但其性能关键取决于决定每个表示何时去噪的异步调度。我们提出学习这种调度。我们的方法在多个表示空间上制定异步流匹配，并使用调度校正目标，该目标在调度变化时保持每个表示的局部噪声时间权重固定。我们用一个灵活的参数类实例化调度，该类通过构造是凸且单调的，并使用快速联合探针进行学习，额外训练计算少于1%。在ImageNet 256x256上，学习的调度在匹配的675M参数XL骨干下显著提高了收敛速度和最终质量。使用AutoGuidance，我们的200 epoch模型达到FID 1.05，与800 epoch的SFD-XL基线相当，训练量减少4倍。训练到600 epoch进一步改善到FID 1.02，优于1B参数的SFD-XXL结果（FID 1.04），同时使用更小的模型。在无引导设置中，我们的200 epoch模型达到FID 2.37，已经低于最佳800 epoch SFD-XL结果（2.54），训练量减少4倍，并在600 epoch时改善到FID 2.14。代码可在https://this URL获取。

英文摘要

Multi-representation diffusion models can improve visual synthesis by denoising complementary views of an image, but their performance depends critically on the asynchronous schedule that determines when each representation is denoised. We propose to learn this schedule. Our method formulates asynchronous flow matching over multiple representation spaces and uses a schedule-corrected objective that keeps each representation's local noising-time weights fixed as the schedule changes. We instantiate the schedule with a flexible parametric class that is convex and monotone by construction, and learn it using a fast joint probe with less than 1% additional training compute. On ImageNet 256x256, the learned schedule substantially improves both convergence speed and final quality under a matched 675M-parameter XL backbone. With AutoGuidance, our 200-epoch model reaches FID 1.05, matching the 800-epoch SFD-XL baseline with 4x less training. Training to 600 epochs further improves to FID 1.02, outperforming the 1B-parameter SFD-XXL result of FID 1.04 while using a smaller model. In the unguided setting, our 200-epoch model reaches FID 2.37, already below the best 800-epoch SFD-XL result (2.54) at 4x less training, and improves to FID 2.14 at 600 epochs. Code is available at https://github.com/bsq532087/LWD

URL PDF HTML ☆

赞 0 踩 0

2606.19676 2026-06-19 cs.CV cs.AI 新提交

TeleMorpher: Toward Robust Simultaneous Motion-Location Editing

TeleMorpher: 迈向鲁棒的同步运动-位置编辑

Haengbok Chung

AI总结提出TeleMorpher，一种基于扩散模型的一步式框架，通过运动先验、姿态扭曲和基线运动编辑器注入，实现视频中主角运动与位置的同步编辑，在定量和定性评估中表现优异。

详情

AI中文摘要

扩散模型在图像和视频生成与编辑中取得了显著成功。尽管最近的研究将工作扩展到运动编辑，但同步变换运动与位置——尽管具有实际重要性——仍基本未被探索。为了更好地理解鲁棒的运动-位置编辑，我们首先分析了降低其质量的根本因素。基于此分析，我们提出了TeleMorpher，据我们所知，这是首个用于同步运动-位置编辑的一步式框架之一。我们的方法利用运动先验（从现成模型生成的目标运动中心视频作为运动编辑指导）和真实运动，实现更可控和精确的运动-位置编辑。通过这种方式，我们的框架工作如下：(1) 首先通过预训练的分割和修复模型分离主角和背景。(2) 然后，我们引入一种无需训练的姿势扭曲，以运动先验为指导编辑主角的运动。(3) 扭曲运动视频的结果在推理时直接注入基线运动编辑器，减轻源运动与目标运动之间的差异，同时保留源视频的外观。(4) 为提高定量评估的可靠性，我们提出了两个新的基于LPIPS的指标，分别测量运动编辑前后背景一致性以及通过测量从源视频和目标视频中提取的主角骨架差异来评估运动编辑性能的保真度。在野外视频和TaiChi数据集上的实验表明，TeleMorpher在定量和定性测量（真实人类评估）中均取得了优越性能，凸显了其有效性。

英文摘要

Diffusion models have achieved remarkable success in image and video generation and editing. While recent studies have extended these efforts toward motion editing, simultaneously transforming both motion and location-despite its practical importance-remains largely unexplored. To better understand robust motion-location editing, we first analyze the fundamental factors that degrade its quality. Based on this analysis, we propose TeleMorpher, one of the first one-shot frameworks to the best of our knowledge, for simultaneous motion-location editing. Our approach leverages motion priors, a target motion-centric video generated from an off-the-shelf model as motion-editing guidance, and the ground truth motion to enable more controllable and precise motion-location editing. Via this, our framework works as follows: (1) we first disentangle the protagonist and the background via pre-trained segmentation and inpainting models. (2) Then, we introduce a training-free pose warping that edits the protagonist's motion with the motion prior as the guidance. (3) The result of warped motion video is directly injected into a baseline motion editor during inference, mitigating the difference between source and target motions while preserving the appearance of the source video. (4) To enhance the reliability of quantitative evaluations, we propose two new LPIPS-based metrics that measure the background consistency before and after the motion editing and the fidelity of motion editing performance via measuring the difference between the extracted protagonist's skeletons from source and target videos. Experiments with in-the-wild videos and the TaiChi dataset demonstrate that TeleMorpher achieves superior performance across both quantitative and qualitative measurements (real-human evaluation), underscoring its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.19718 2026-06-19 cs.CV 新提交

One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model

基于3D先验引导扩散模型的单样本新视角与姿态人体图像合成

Shenjian Gong, Kangkan Wang, Shanshan Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology（南京理工大学计算机科学与工程学院教育部高维信息智能感知与系统重点实验室、江苏省社会安全图像与视频理解重点实验室及PCA实验室）； Advanced Laser Technology Laboratory of Anhui Province, Electronic Engineering Institute, National University of Defense Technology, and Jianghuai Advance Technology Center（国防科技大学电子工程学院安徽省先进激光技术实验室及江淮前沿技术中心）

AI总结提出一种基于条件去噪扩散模型的方法，利用3D人体先验（法线图和颜色提示）作为几何和颜色条件，从单张参考图像合成任意姿态和视角的高质量人体图像，包括被遮挡部分。

Comments 30 pages, 10 figures

详情

DOI: 10.1016/j.patcog.2026.113644

AI中文摘要

本文解决了单样本新视角和姿态人体图像合成的挑战。现有方法通过一组2D姿态关键点将参考人体图像转移到目标姿态，或基于可泛化人体NeRF（使用人体模型先验提取逐点特征）合成人体图像。然而，基于姿态转移的方法无法处理使用模糊2D姿态作为条件的复杂人体姿态，而可泛化人体NeRF在缺乏可靠特征时可能无法准确恢复被遮挡/不可见的人体部分。为解决这些问题，我们提出了一种基于条件去噪扩散模型的新方法，用于从单张人体图像进行新视角和姿态合成。我们的扩散模型将新视角和姿态合成问题分解为一系列条件去噪步骤。具体而言，为了生成具有复杂和任意姿态的人体，我们将3D人体先验（即3D法线图和颜色提示）作为几何和颜色条件引入生成过程。通过一系列扩散步骤将参考人体转移到目标人体，我们的扩散模型能够实现高质量合成，包括被遮挡/不可见部分。此外，我们提出了一种基于自重建的自定义细化方法，以在测试新视角时增强细节。在多个公共数据集上的实验结果表明，我们的方法显著优于先前方法，并显示出更好的跨数据集泛化能力。代码将在https://this https URL上公开。

英文摘要

This paper addresses the challenge of one-shot novel view and pose human image synthesis. The existing methods transfer the reference human image to a target pose using a set of 2D pose keypoints or synthesize human images based on generalizable human NeRF which uses human model priors to extract point-wise features. However, pose transfer based methods can not handle complex human pose using ambiguous 2D pose as the condition, while generalizable human NeRFs may be inaccurate to recover occluded/invisiable human parts without extracted reliable features. To solve these problems, we propose a novel approach for novel view and pose synthesis from a singe human image via conditional denoising diffusion model. Our diffusion model divides the novel view and pose synthesis problem into a sequence of conditional denoising steps. Specifically, to generate humans with complex and arbitrary poses, we introduce 3D human priors, i.e., 3D normal map and color prompt, as geometry and color conditions into the generation process. By transferring the reference human into the target human with a series of diffusion steps, our diffusion model enables high-quality synthesis including the occluded/invisible parts. Further, we propose a self-reconstruction based customized refinement to enhance fine details when tested on novel persons.Experimental results on different public datasets demonstrate that our approach significantly outperforms previous methods and also shows better generalization ability across datasets. The code will be made publicly available at https://github.com/Yankeegsj/3DPGDM.

URL PDF HTML ☆

赞 0 踩 0

2606.19889 2026-06-19 cs.CV 新提交

SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics

SurgVista：具有合理器械-组织动力学的长程手术世界建模

Wentao Pan, Wuyang Li, Shengyuan Liu, Xinyu Liu, Hengyu Liu, Yixuan Yuan

发表机构 * The Chinese University of Hong Kong（香港中文大学）； EPFL（瑞士联邦理工学院洛桑）； Imperial College London（伦敦帝国学院）

AI总结提出SurgVista手术世界模型，通过变形一致性正则化和漂移适应训练，解决空间交互不连贯和时间保真度崩溃问题，在长程预测中显著优于现有方法。

详情

AI中文摘要

将机器人策略学习扩展到自主手术面临挑战，因为专家演示成本高昂且体内探索存在重大安全风险。手术世界模型通过从初始观测生成逼真的、动作条件下的未来帧来解决这一问题，但现有方法存在两种持续失效模式：空间交互不连贯，即可见器械接触未能引起空间一致的组织变形；以及时间保真度崩溃，即预测误差在自回归展开中累积并逐渐破坏视觉质量。我们提出SurgVista，一种通过两种训练策略缓解这两种失效的手术世界模型。变形一致性正则化从训练视频中提取场景点轨迹，并通过潜在对比学习强制跨帧一致性，增强物理一致的器械-组织动力学。漂移适应训练通过用在线预测残差和根据长程漂移统计校准的光度增强扰动条件帧，减轻长程漂移，在扩展展开中维持视觉保真度。为了进行严格评估，我们进一步引入SurgWorld-Bench，包含多样化的手术类型、长程展开以及用于器械运动精度和组织响应保真度的解耦指标。大量实验表明，SurgVista在视觉质量、时间一致性和交互保真度方面持续优于最先进方法，且随着预测视界增长优势扩大。

英文摘要

Scaling robot policy learning for autonomous surgery is challenging, as expert demonstrations are expensive and in vivo exploration poses substantial safety risks. Surgical world models address this by generating realistic, action-conditioned future frames from an initial observation, but existing methods exhibit two persistent failure modes: spatial interaction incoherence, where visible instrument contact fails to induce spatially consistent tissue deformation, and temporal fidelity collapse, where prediction errors compound across autoregressive rollouts and progressively corrupt visual quality. We present SurgVista, a surgical world model that mitigates both failures through two training recipes. Deformation Consistency Regularization extracts scene-point trajectories from training videos and enforces cross-frame coherence through latent contrastive learning, strengthening physically consistent instrument-tissue dynamics. Drift Adaptation Training mitigates long-horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long-horizon drift statistics, sustaining visual fidelity over extended rollouts. To enable rigorous evaluation, we further introduce SurgWorld-Bench, featuring diverse procedure types, long-range rollouts, and decoupled metrics for instrument-motion accuracy and tissue-response fidelity. Extensive experiments show that SurgVista consistently outperforms state-of-the-art methods across visual quality, temporal consistency, and interaction fidelity, with gains widening as the prediction horizon grows.

URL PDF HTML ☆

赞 0 踩 0

2606.19958 2026-06-19 cs.CV 新提交

SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis

SketchKeyAnime：基于参考锚点的稀疏关键草图动画合成

Meixi Li, Xianlin Zhang, Yue Zhang, Xueming Li

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出SketchKeyAnime视频扩散框架，通过双分支条件机制和可学习门控的草图交叉注意力，从单张参考RGB图像和稀疏关键草图生成结构可控、外观一致且时间连贯的动画，在Sakuga-42M数据集上显著优于基线方法。

详情

AI中文摘要

传统动画制作严重依赖手工绘制和迭代细化，特别是关键姿势设计、中间帧生成和角色着色。虽然现有的动画和视频生成方法取得了显著进展，但它们通常依赖于RGB边界帧、密集的帧级条件或完整的草图序列，限制了在低成本输入条件下的适用性。我们提出了SketchKeyAnime，一个视频扩散框架，用于从稀疏关键草图输入生成结构可控、外观一致且时间连贯的动画。给定单个参考RGB图像和几个按时间索引的关键草图，SketchKeyAnime引入了一种双分支条件机制，以编码局部几何约束以及语义-时间上下文。它利用草图交叉注意力，通过可学习门控融合参考图像和草图条件，并加入自适应加权损失以加强对关键草图帧和线条艺术区域的监督。在Sakuga-42M的Aesthetic子集上的实验结果表明，我们的方法始终优于代表性的动画插值和草图引导生成基线。与最佳基线相比，SketchKeyAnime将EDMD降低了31.9%，FVD降低了9.5%，展示了卓越的草图保真度和时间连贯性，同时在大多数定量指标上实现了最佳整体性能。这些结果验证了所提出的框架，并突显了其在低成本、高度可控动画创作中的潜力。

英文摘要

Traditional animation production relies heavily on manual drawing and iterative refinement, particularly for key-pose design, in-betweening, and character coloring. While existing animation and video generation methods have made notable progress, they typically depend on RGB boundary frames, dense frame-wise conditions, or complete sketch sequences, limiting their applicability under low-cost input conditions. We present SketchKeyAnime, a video diffusion framework for generating structurally controllable, appearance-consistent, and temporally coherent animations from sparse key-sketch inputs. Given a single reference RGB image and a few temporally indexed key sketches, SketchKeyAnime introduces a dual-branch conditioning mechanism to encode local geometric constraints alongside semantic-temporal context. It leverages Sketch Cross Attention to fuse reference image and sketch conditions with learnable gating, and incorporates an Adaptive Weighted Loss to strengthen supervision on key-sketch frames and line-art regions. Experimental results on the Aesthetic subset of Sakuga-42M show that our approach consistently outperforms representative animation interpolation and sketch-guided generation baselines. Compared to the best-performing baseline, SketchKeyAnime reduces EDMD by 31.9\% and FVD by 9.5\%, demonstrating superior sketch fidelity and temporal coherence, while achieving the best overall performance across most quantitative metrics. These results validate the proposed framework and highlight its potential for low-cost, highly controllable animation creation.

URL PDF HTML ☆

赞 0 踩 0

2606.19970 2026-06-19 cs.CV 新提交

CrossFlow: One-Step Generation Across Latent and Pixel Spaces

CrossFlow: 跨潜在空间与像素空间的单步生成

Xiyuan Wang, Xiao Zhang, Yang Li, Ruoxi Jiang, Zhao Zhong, Liefeng Bo, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； Tencent（腾讯）； Fudan University（复旦大学）

AI总结提出CrossFlow，一种跨空间流模型，将噪声潜在输入直接映射到像素图像，通过无速度单步目标实现潜在到像素的生成，并替代潜在扩散中的解码器，在ImageNet-1k上达到1.62 FID。

Comments Preprint, Under Review

详情

AI中文摘要

大多数扩散和流匹配生成器在相同的表示空间中定义先验、概率路径和预测目标。潜在扩散通过将该路径移动到自编码器潜在空间来提高效率，但最终样本仍由单独训练的解码器生成。这种分离造成了不匹配：生成器针对潜在空间预测进行优化，而最终质量取决于解码器如何处理可能与干净编码器输出不同的生成潜在变量。我们引入了CrossFlow，一种跨空间流公式，将噪声潜在输入直接映射到像素空间图像。关键技术步骤是一个无速度的单步目标：潜在轨迹定义了训练路径，但监督预测是图像而非潜在位移。这使得一个模型既可以作为单步潜在到像素生成器，也可以作为潜在扩散管道的解码器替代品。在类别条件ImageNet-1k $256\ imes256$上，CrossFlow-XL通过一次函数评估达到了1.62 FID。消融实验表明，潜在编码器以及像素空间感知和对抗损失对保真度很重要。这些结果表明，跨空间流目标可以结合潜在表示的效率与直接像素空间监督，而无需在推理时使用单独的解码器。

英文摘要

Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at $256\times256$, CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.

URL PDF HTML ☆

赞 0 踩 0

2606.20076 2026-06-19 cs.CV cs.AI 新提交

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

基于可学习全局合并的可变长度分词用于扩散变换器

Dong Hoon Lee, Seunghoon Hong

发表机构 * Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea（韩国科学技术院金载哲人工智能研究生院，大田，韩国）； School of Computing, KAIST, Daejeon, South Korea（韩国科学技术院计算学院，大田，韩国）

AI总结针对固定压缩比限制扩散模型质量-计算权衡的问题，提出基于可学习全局合并的可变长度分词器，通过合并令牌实现跨长度表示对齐，在ImageNet 256×256生成中实现更优的gFID-计算权衡。

详情

AI中文摘要

潜在扩散模型（LDM）在视觉合成中占据主导地位，但其质量-计算权衡很大程度上受限于分词器的固定压缩比。可变长度分词器（VLT）通过改变令牌数量实现自适应压缩，使扩散模型能够灵活平衡质量和计算。然而，传统的VLT通过截断有序令牌序列来调节长度，这使得令牌语义依赖于令牌位置，并破坏了跨长度的表示对齐。这导致潜在分布出现跨长度偏移，阻碍单个可变长度扩散模型有效运行。为了解决这个问题，我们提出了一种新颖的可变长度分词器，通过合并令牌来调节长度。我们表明，当扩散变换器根据合并模式运行时，鼓励相似令牌合并可以实现直接的跨长度表示对齐。由于传统的合并方法是数据依赖的，使得生成过程中无法访问合并模式，我们引入了可学习的全局合并，它是数据独立的，以确保与扩散变换器的兼容性。在ImageNet 256×256生成中，我们的基于合并的可变长度分词器与扩散变换器集成，相比之前的VLT方法实现了更优的gFID-计算权衡。代码可在[此https URL](此https URL)获取。

英文摘要

Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer's fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend on token position and breaks representational alignment across lengths. This leads to a cross-length shift in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256$\times$256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID-compute trade-off compared to prior VLT methods. Code is available at [this https URL](https://github.com/movinghoon/lgm)

URL PDF HTML ☆

赞 0 踩 0

2606.20083 2026-06-19 cs.CV 新提交

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Holo-World: 视频世界模型的统一相机、物体和天气控制

Xiangchen Yin, Wenzhang Sun, Jiahui Yuan, Zijie Liu, Yinda Chen, Wei Li, Dachun Kai, Chunfeng Wang, Xiaoyan Sun

AI总结提出Holo-World，一种从单张图像联合控制相机、物体运动和天气的统一视频世界模型，通过场景适配器和解耦CFG实现世界保持与天气迁移。

Comments Project Page: \url{https://xiangchenyin.github.io/Holo-World} Code: \url{https://github.com/XiangchenYin/Holo-World}

详情

AI中文摘要

视频世界模型正朝着在可控相机和物体运动下保持观察到的世界，同时允许其环境状态变化的方向发展。然而，这些控制仍然是孤立的，天气生成通常依赖于已经指定未来结构的源视频或重建场景。我们研究了一种基于第一帧锚定的源到状态设置，其中模型从单张图像开始，遵循明确的相机和物体控制以及可选的天气指令，然后生成一个视频，该视频要么保持源世界，要么将其转移到目标天气状态。为了解决这些挑战，我们首先构建了HoloStateData，一个状态视频数据集，将多样化的视频转换为用于相机、物体和天气监督的统一控制样本。其次，我们引入了Holo-World，一个统一的、可控制的视频世界模型，从单张图像联合控制场景。其统一场景适配器将世界保持和天气迁移分解为不同的参数子空间，使用渲染背景、几何缓冲区和物体控制来维持受控场景结构，同时建模依赖天气的外观和粒子效果。此外，场景-天气解耦CFG分别引导场景和天气残差，增强目标天气效果而不过度放大完整条件。定量和定性实验表明，Holo-World在保持精确的相机和物体控制以及一致场景结构的同时，将场景迁移到多样化的目标天气状态，在天气状态生成上优于视频到视频的天气编辑基线。我们的项目页面可在\url{this https URL}获取。

英文摘要

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{https://xiangchenyin.github.io/Holo-World/}.

URL PDF HTML ☆

赞 0 踩 0

2606.20094 2026-06-19 cs.CV cs.AI cs.GR cs.LG cs.MM 新提交

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror：在用于化妆迁移的扩散模型中改进面部属性保持

Nefeli Andreou, Angel Martínez-González, Sabine Sternig, Matthieu Guillaumin, Epameinondas Antonakos, Michael Opitz

发表机构 * Amazon（亚马逊）

AI总结提出MakeupMirror扩散模型，通过ControlNet几何条件、区域特定迁移控制、肤色调制和Langevin采样器，在保持面部特征和肤色的同时实现高质量化妆迁移，相比Stable-Makeup提升面部识别相似度60%、降低肤色差异50%。

详情

AI中文摘要

化妆迁移模型能够实现有趣的增强现实（AR）体验以及在线化妆购物的虚拟试妆（VTO）。尽管最近最先进的基于扩散的解决方案（如Stable-Makeup）显著提高了化妆迁移的准确性和逼真度，但在身份和肤色保持方面仍存在局限性，使得用于化妆购物的生产级VTO不切实际。在这项工作中，我们提出了MakeupMirror，一种基于扩散的化妆迁移方法，在保持面部特征和肤色方面取得了显著进展。我们在Stable-Makeup的基础上引入了多项技术创新：（1）将面部几何条件与ControlNets集成以保持面部保真度；（2）区域特定的化妆迁移控制，以便在面部区域（如皮肤、眼睛和嘴唇）实现精确的化妆应用；（3）基于肤色的化妆迁移调制，防止跨主体迁移场景中的肤色改变；（4）集成Levenberg-Marquardt Langevin采样器以加速推理同时保持生成质量。我们在CPM-Real、Makeup Wild以及（本文新收集的、更多样化的）MakeupSelfies数据集上的实验表明，与Stable-Makeup相比，MakeupMirror将相对面部识别相似度提高了+60%，将相对肤色差异降低了-50%，延迟为0.7秒，同时在核心面部身份保持标准上达到了94%的专家接受率。

英文摘要

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

URL PDF HTML ☆

赞 0 踩 0

2606.20233 2026-06-19 cs.CV 新提交

Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models

使用角色-环境协调视频生成模型的电影级合成

Tianyi Xiang, Mingming He, Li Ma, Jing Liao

发表机构 * City University of Hong Kong（香港城市大学）； Independent Researcher（独立研究员）

AI总结提出端到端视频扩散框架，通过三掩码引导和RGB-D联合去噪建模角色与环境的双向物理与光照交互，实现高质量动态视频合成。

详情

AI中文摘要

电影级合成旨在将绿幕角色融入新环境，同时保持物理和光度真实性。先前的方法通常未能捕捉角色与其周围环境之间的复杂双向交互，我们将其表征为角色到环境（C2E）的物理交互和环境到角色（E2C）的光照协调。为了解决这个问题，我们提出了一个端到端的视频扩散框架，联合建模C2E和E2C交互，特别处理交互道具的挑战。我们的方法引入了一种三掩码引导架构，结合RGB-D联合去噪，以确保角色、道具和环境之间的物理一致交互。我们进一步开发了一种高效的先验驱动数据整理流程，无需昂贵的渲染即可构建高质量的重光照对。最后，参考条件机制实现了可控的环境合成和精确的道具替换。大量实验表明，我们的框架在电影级动态视频合成方面显著优于现有方法。

英文摘要

Cinematic compositing aims to integrate green-screen characters into novel environments while maintaining physical and photometric realism. Previous methods often fail to capture the complex bidirectional interactions between characters and their surroundings, which we characterize as Character-to-Environment (C2E) physical interaction and Environment-to-Character (E2C) lighting harmonization. To address this, we propose an end-to-end video diffusion framework that jointly models C2E and E2C interactions, specifically handling the challenges of interactive props. Our approach introduces a tri-mask-guided architecture with RGB-D joint denoising to ensure physically consistent interactions among the character, props, and environment. We further develop an efficient prior-driven data curation pipeline to construct high-quality relighting pairs without expensive rendering. Finally, a reference-conditioned mechanism enables controllable environment synthesis and precise prop replacement. Extensive experiments demonstrate that our framework significantly outperforms existing methods in cinematic-quality dynamic video compositing.

URL PDF HTML ☆

赞 0 踩 0

2606.20310 2026-06-19 cs.CV 新提交

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

通过PRISM：视频扩散模型中间状态中的偏好表示

Haoxuan Wu, Lai Man Po, Mengyang Liu, Kun Li, Hongzheng Yang, Wei Liu

发表机构 * City University of Hong Kong（香港城市大学）； Video Rebirth ； The Chinese University of Hong Kong（香港中文大学）

AI总结提出PRISM方法，利用冻结的视频扩散骨干网络和轻量级查询聚合头从噪声潜变量中解码偏好信号，实现高精度偏好预测和噪声鲁棒性，支持早期最佳采样以降低计算成本并提升视频质量。

详情

AI中文摘要

使用干净的、基于像素的奖励模型评估视频生成，会使评估与噪声扩散过程脱节，并产生巨大的VAE解码成本。在本文中，我们通过提出一个基本问题来挑战这一范式：一个强大的视频生成器能否直接从噪声潜变量中内在地区分偏好？为了回答这个问题，我们引入了\textbf{PRISM}（\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels）。PRISM采用一个轻量级的基于查询的聚合头，配合冻结的视频扩散骨干网络，从噪声潜变量中解码偏好信号。令人惊讶的是，PRISM不仅达到了最先进的偏好准确率，还解锁了强大的噪声鲁棒性，从而实现了早期最佳-$N$采样。这使得在去噪的初始阶段就能过滤掉次优候选，大幅减少计算量并提升视频质量。我们还揭示了骨干网络的生成性能与其内在评估能力之间的强正相关性，从而实现了视频骨干网络的自我改进。

英文摘要

Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce \textbf{PRISM} (\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of-$N$ sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone's generative performance and its inherent evaluative power, enabling self-improving video backbones.

URL PDF HTML ☆

赞 0 踩 0

2606.20404 2026-06-19 cs.CV 新提交

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

FlowBender: 面向自校正条件流的反馈感知训练

Daniel Gilo, Sven Elflein, Ido Sobol, Or Litany

发表机构 * Technion（以色列理工学院）； NVIDIA（英伟达）； University of Toronto（多伦多大学）； Vector Institute（向量研究所）

AI总结针对条件扩散/流模型常违反任务约束的问题，提出FlowBender闭环框架，将对齐误差作为输入训练网络学习校正策略，在图像翻译、复原和3D纹理贴图中同时提升保真度与合理性。

Comments Project page: https://flow-bender.github.io/

详情

AI中文摘要

条件扩散和流模型通常无法满足定义其任务的约束条件。例如，深度条件模型经常产生重新提取的深度与输入不一致的图像，尽管定义约束的前向算子（深度预测器）在训练和推理期间都可用。现有方法通常分为两类：将条件信号视为静态线索并在推理时忽略对齐信息的监督模型，以及通过手动调整的线性更新咨询约束的基于引导的方法，通常以生成样本的合理性为代价来换取对条件的保真度。我们认为这两种范式的根本差距在于模型从未被训练利用自身的对齐误差。我们引入FlowBender，一个闭环框架，将此误差视为一等输入，训练网络学习基于推理时反馈的校正策略。在每一步，无引导的前瞻传递估计干净信号，通过前向算子计算特定任务的偏差，然后细化传递消耗此信号以产生校正速度。我们提出了FlowBender的几种变体，包括用于可微算子的基于梯度的公式和用于不可微设置（如JPEG压缩）的零阶变体。为了实现高效采样，我们引入了一个前一步捷径，使得以最小的额外计算成本实现闭环校正。在图像到图像翻译、复原和3D网格纹理贴图中，FlowBender始终优于标准监督基线、对齐损失增强训练和最先进的推理时引导，同时提高保真度和合理性，而不是在它们之间进行权衡。项目页面：此 https URL

英文摘要

Conditional diffusion and flow models routinely fail to satisfy the very constraints that define their task. For instance, a depth-conditioned model often produces images whose re-extracted depth disagrees with the input, even though the forward operator--the depth predictor defining the constraint--is available during both training and inference. Existing approaches generally fall into two categories: supervised models that treat the conditioning signal as a static cue and ignore alignment information at inference, and guidance-based methods that consult it through hand-tuned linear updates, typically trading fidelity to the condition against the plausibility of the generated sample. We argue that the fundamental gap in both paradigms is that the model is never trained to utilize its own alignment error. We introduce FlowBender, a closed-loop framework that treats this error as a first-class input, training the network to learn a correction policy conditioned on inference-time feedback. At each step, an unguided look-ahead pass estimates the clean signal, a task-specific deviation is computed via the forward operator, and a refinement pass consumes this signal to produce a corrected velocity. We propose several variants of FlowBender, including a gradient-based formulation for differentiable operators and a zero-order variant for non-differentiable settings such as JPEG compression. For efficient sampling, we introduce a prior-step shortcut that enables closed-loop correction at a minimal additional computational cost. Across image-to-image translation, restoration, and 3D mesh texturing, FlowBender consistently outperforms standard supervised baselines, alignment-loss-augmented training, and state-of-the-art inference-time guidance, improving fidelity and plausibility simultaneously rather than trading them against each other. Project page: https://flow-bender.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.20506 2026-06-19 cs.CV cs.AI 新提交

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

FreeStyle: 从社区LoRA挖掘中实现风格-内容双参考生成的自由控制

Jinghong Lan, Wei Cheng, Yunuo Chen, Ziqi Ye, Peng Xing, Yixiao Fang, Rui Wang, Yufeng Yang, Xuanyang Zhang, Xianfang Zeng, Difan Zou, Gang Yu, Chi Zhang

AI总结提出FreeStyle框架，利用社区LoRA作为锚点，通过两阶段课程学习（注意力级约束和频率感知RoPE调制）解决双参考生成中的内容泄露问题，并引入新基准和评估指标，实现风格对齐、内容保持与泄露抑制的平衡。

Comments 35 pages, 26figures. Project page: https://github.com/Blue2Giant/FreeStyle

详情

AI中文摘要

风格-内容双参考生成旨在合成一张图像，该图像保留内容参考的结构和语义，同时采用单独风格参考的风格。尽管近期有所进展，但这一设置仍然具有挑战性，因为模型必须平衡内容保真度、风格对齐和指令遵循，同时避免风格参考的语义泄露。一个关键瓶颈是缺乏大规模的三元组数据，这些数据具有清晰的内容-风格分离和广泛的长尾风格。在这项工作中，我们提出了FreeStyle，一个基于社区LoRA的可扩展双参考生成框架。我们将社区LoRA视为风格和内容的组合锚点，并设计了一个严格的生成和过滤流水线，以在多个基础模型上构建大规模的风格参考和内容参考三元组。为了解决内容泄露，我们采用了两阶段课程学习，并设计了特定阶段的解耦机制：在风格迁移阶段，采用注意力级增强约束来抑制风格参考泄露；在更困难的双参考阶段，采用频率感知的RoPE调制策略来针对基于位置对应的泄露。我们还引入了一个基准，涵盖风格参考和双参考生成，并在风格相似性、内容保持、美学质量、指令遵循和泄露拒绝方面进行评估。该基准包含一个风格不变的内容对齐分数（CAS），并引入了一个基于校准的VLM的拒绝分数，用于评估生成可靠性和泄露。大量实验表明，我们的模型在风格对齐、内容保持和泄露抑制之间实现了强平衡。

英文摘要

Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style reference.Despite recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style coverage.In this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA mining.We treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base models.To address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference stage.We also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage suppression.Extensive experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.

URL PDF HTML ☆

赞 0 踩 0

2606.20543 2026-06-19 cs.CV 新提交

Occ-VLM: 面向室内场景理解的占用接地视觉语言模型

Jianing Li, Zhou Fang, Yijiang Liu, Li Du

发表机构 * School of Electronic Science and Engineering, Nanjing University（南京大学电子科学与工程学院）

AI总结提出Occ-VLM，仅用姿态RGB图像和单一2D视觉编码器，通过重建3D占用作为几何先验，实现统一的3D场景理解，在占用预测、3D VQA和密集描述任务上达到领先水平。

详情

AI中文摘要

近期，视觉语言模型（VLM）在3D场景理解方面取得了显著进展，推动了具身智能和机器人视觉等应用的发展。然而，现有方法通常要么直接依赖显式的3D输入（如点云或RGB-D序列），要么引入额外的3D几何编码器从2D图像中推导出3D感知的视觉标记。这种设计在结构上将3D几何感知与通过视觉语言预训练学到的丰富2D语义解耦，阻碍了统一3D视觉语言表示的发展。在这项工作中，我们提出了Occ-VLM，一个仅基于姿态RGB图像并采用单一2D视觉编码器的3D场景理解新框架。具体而言，Occ-VLM重建3D场景占用作为辅助几何先验，用于将前景2D标记与3D空间进行空间关联。然后，这些标记由大型语言模型（LLM）解码，实现统一的场景理解。大量实验表明，Occ-VLM实现了准确的几何感知和稳健的视觉语言推理：在多视角占用预测上达到最先进性能，同时在3D视觉问答（VQA）和3D密集描述基准上与使用3D输入的VLM表现相当。

英文摘要

Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.19805 2026-06-19 cs.CV cs.AI 新提交

ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number

ParaScale: 通过规范不变视差数进行尺度校准的相机运动迁移

Zijie Meng

发表机构 * Peking University（北京大学）

AI总结提出ParaScale模块，通过规范不变的视差数Pi实现尺度忠实相机运动迁移，无需重新训练，在四个数量级尺度上降低视差一致性误差3倍以上。

Comments Accepted by SCA2026(poster)

详情

AI中文摘要

将参考视频的相机运动迁移到新生成的视频中，可以让创作者重复使用电影级运镜。然而，参考视频和目标视频往往处于不兼容的尺度——例如跨越银河系的扫视与桌面上的轻推——直接复用恢复的轨迹会导致运动要么不可察觉，要么剧烈夸张。我们将此归结为一个几何事实：平移引起的图像运动与||T||/Z成比例，因此单目轨迹仅在深度尺度规范下才有意义。我们将此提炼为视差数Pi = ||Delta T|| / Zbar，这是一个无量纲、规范不变的描述符，用于衡量相机运动的感知强度，并证明它是尺度忠实迁移必须保持的量，而非原始轨迹。ParaScale是一个即插即用模块，它从任何参考视频中读取Pi，并针对目标场景的深度逐帧重新实现它，保持旋转不变。它位于姿态提取和姿态注入之间，无需重新训练，可插入任何姿态条件生成器。我们进一步引入了视差一致性误差（PCE），这是一种尺度对称的度量，与相似性对齐的TransErr不同，它能暴露场景尺度不匹配。在跨越四个数量级的尺度范围和多个骨干网络上，ParaScale将实现的视差保持在恒等线上，并将PCE比未校准的迁移降低3倍以上，且不损失视觉保真度。

英文摘要

Transferring the camera motion of a reference video to a freshly generated one lets creators reuse cinematic moves. Yet reference and target often live at incompatible scales -- a sweep across a galaxy versus a nudge across a desk -- and naively reusing the recovered trajectory yields either imperceptible or violently exaggerated motion. We trace this to a geometric fact: translation-induced image motion scales as ||T||/Z, so a monocular trajectory is meaningful only up to a depth-scale gauge. We distill this into the Parallax Number Pi = ||Delta T|| / Zbar, a dimensionless, gauge-invariant descriptor of how strongly a camera move is felt, and prove that it -- not the raw trajectory -- is the quantity that scale-faithful transfer must preserve. ParaScale is a plug-and-play module that reads Pi off any reference video and re-realizes it against the target scene's own depth, per frame, leaving rotation untouched. Sitting between pose extraction and pose injection, it requires no retraining and drops into any pose-conditioned generator. We further introduce the Parallax Consistency Error (PCE), a scale-symmetric metric that -- unlike the similarity-aligned TransErr -- exposes scene-scale mismatch. Across scale regimes spanning four orders of magnitude and multiple backbones, ParaScale keeps the realized parallax on the identity line and cuts PCE by more than 3x over uncalibrated transfer with no loss of visual fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.20103 2026-06-19 cs.CV 新提交

Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

3D高斯溅射中保持几何结构的LiDAR-相机外参标定

Kyoleen Kwak, Daeho Kim, Jeong Woon Lee, Hyoseok Hwang

发表机构 * Kyung Hee University（庆熙大学）

AI总结针对LiDAR-相机标定中跨模态特征稀缺问题，提出通过多视图LiDAR深度监督和阻止光度梯度更新高斯空间参数来保持3DGS代理的度量几何，提升标定精度。

Comments Accepted to ECCV 2026. 15 pages (excluding references), 5 figures

详情

AI中文摘要

精确的LiDAR-相机标定对于鲁棒的多模态感知至关重要。无目标方法避免了手动设置，但仍受限于跨模态判别特征的稀缺性。最近的方法通过在可微模型中重建场景，通过密集光度监督实现外参优化。其中，3D高斯溅射（3DGS）被广泛用作几何代理，在单一可微框架内桥接LiDAR和相机。然而，由于3DGS最初是为新视图合成设计的，现有方法倾向于优先考虑渲染质量，导致代理几何偏离真实的LiDAR结构。我们提出了一种框架，通过聚合多视图LiDAR观测进行密集深度监督，并阻止光度梯度更新高斯空间参数，从而保持高斯代理的度量几何。我们在公开驾驶数据集上验证了该方法，在标定精度上持续优于现有无目标方法。

英文摘要

Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.20131 2026-06-19 cs.CV cs.GR 新提交

TriFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields

TriFlow: 通过最近顶点向量场生成类艺术家3D网格拓扑

Haoxuan Li, Ziya Erkoç, Daniele Sirigatti, Vladislav Rosov, Lei Li, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich（慕尼黑工业大学）； AUDI AG（奥迪股份公司）； University of Virginia（弗吉尼亚大学）

AI总结提出TriFlow，一种基于最近顶点向量场（NVF）的生成方法，通过流匹配模型合成NVF并引导拓扑感知的网格简化，直接从输入几何条件生成紧凑且具有类艺术家拓扑的3D网格。

详情

AI中文摘要

我们提出了TriFlow，一种新的生成方法，能够直接从输入几何条件（如符号距离场）生成具有类艺术家三角形拓扑的紧凑3D网格。我们的关键见解是将网格拓扑表示为在表面上定义的最近顶点向量场（NVF），其中每个点编码其在局部重心坐标系中与最近三角形顶点的关联。我们训练一个潜在流匹配模型来合成该场，从而实现基于输入几何条件的拓扑生成。为了提取连贯的网格，我们使用生成的NVF对表面区域进行聚类，并引导具有拓扑感知优化的约束二次误差度量（QEM）网格简化。这产生了与输入几何紧密匹配且具有结构化、类艺术家连接性的输出网格。实验表明，与最先进的基于学习方法相比，TriFlow实现了更强的泛化能力和显著提高的拓扑质量，同时Chamfer距离降低了90%，速度提升了8倍。

英文摘要

We present TriFlow, a new generative approach for producing compact 3D meshes with artist-like triangle topology directly from input geometry conditions such as signed distance fields. Our key insight is to represent mesh topology as a nearest-vertex vector field (NVF) defined over the surface, where each point encodes its association to the nearest triangle vertex in the local barycentric frame. We train a latent flow-matching model to synthesize this field, enabling topology generation conditioned on the input geometry. To extract a coherent mesh, we cluster surface regions using the generated NVF and guide a constrained quadric error metric (QEM) mesh simplification with topology-aware optimization. This yields output meshes that closely match the input geometry while exhibiting structured, artist-like connectivity. Experiments demonstrate that TriFlow achieves stronger generalization and significantly improved topology quality compared to state-of-the-art learning-based approaches, alongside 90% lower Chamfer Distance and an 8x speedup.

URL PDF HTML ☆

赞 0 踩 0

2606.20531 2026-06-19 cs.CV 新提交

HypOProto: 用于左心室充盈压分类的双曲序数原型

Victoria Wu, Nima Hashemi, Hooman Vaseli, Christina Luong, Purang Abolmaesumi, Teresa S. M. Tsang

发表机构 * The University of British Columbia（不列颠哥伦比亚大学）； Vancouver General Hospital（温哥华综合医院）

AI总结提出HypOProto框架，利用双曲空间中的序数原型对左心室充盈压进行分类，通过冻结的可解释基础模型实现高精度与临床可解释性。

详情

AI中文摘要

超声心动图（echo）是一种广泛用于评估心脏功能的成像模态，左心室充盈压（LVFP）是心力衰竭等疾病的关键生理标志物。将LVFP分为正常和升高类别的标准依赖于多普勒衍生的$E/e'$比值，该比值依赖于操作者，且在资源有限的环境中通常不可用，这促使了直接从B模式超声推断LVFP的方法。现有的深度学习方法实现了高性能，但大多是黑盒模型，限制了临床可解释性。我们提出了HypOProto，一个基于双曲序数原型的可解释LVFP分类框架，使用冻结的可解释基础模型骨干。HypOProto沿着生理$E/e'$尺度排列原型，将边界情况放置在双曲面根附近，其中小的角度差异区分相似情况，而正常和升高情况占据向外位置，反映诊断确定性的增加。这种双曲几何编码了临床上有意义的序数关系，并提高了可解释性。我们还引入了一种新的双曲原型角度分离（HyperPAS）损失，强制在双曲空间中实现类间原型分离。HypOProto在保持透明性的同时实现了最先进的性能，并在可视化中突出显示临床相关区域。这项工作代表了超声中LVFP分类的第一个基于原型的框架。我们的代码可在以下网址找到：此 https URL。

英文摘要

Echocardiography (echo) is a widely used imaging modality for assessing cardiac function, with Left Ventricular Filling Pressure (LVFP) serving as a critical physiological marker for conditions such as heart failure. Standard LVFP classification into normal \emph{vs} elevated categories relies on the Doppler-derived $E/e'$ ratio, which is operator-dependent and often unavailable in resource-limited settings, motivating methods that infer LVFP directly from B-mode echo. Existing deep learning approaches achieve high performance but remain largely black-box, limiting clinical interpretability. We propose HypOProto, a hyperbolic, ordinal prototype-based framework for interpretable LVFP classification using a frozen, explainable foundation model backbone. HypOProto arranges prototypes along the physiological $E/e'$ scale, placing borderline cases near the hyperboloid root where small angular differences separate similar cases, while normal and elevated cases occupy outward positions reflecting increasing diagnostic certainty. This hyperbolic geometry encodes clinically meaningful ordinal relationships and improves interpretability. We also introduce a novel Hyperbolic Prototype Angular Separation (HyperPAS) loss, enforcing inter-class prototype separation in hyperbolic space. HypOProto achieves SOTA performance while maintaining transparency, and highlights clinically relevant regions in visualizations. This work represents the first prototype-based framework for LVFP classification in echo. Our code can be found at https://github.com/DeepRCL/HypOProto.

URL PDF HTML ☆

赞 0 踩 0

2606.19824 2026-06-19 cs.CV cs.AI 新提交

CSWinUNETR: Segmentation of Thin Anatomical Structures in Medical Images

CSWinUNETR: 医学图像中薄解剖结构的分割

Junho Moon, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University（汉阳大学）； Hankuk University of Foreign Studies（韩国外国语大学）

AI总结提出CSWinUNETR通用骨干网络，通过交叉形条带自注意力、循环移位、细节增强多尺度自注意力和稀疏控制动态蛇形卷积，解决薄结构分割中的低对比度、断裂和类不平衡问题，在眼科、神经血管和皮肤科基准上超越现有方法。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

准确分割薄而曲折的解剖结构，如视网膜血管、脑血管和面部皱纹，由于低对比度、频繁断裂和严重的类别不平衡仍然具有挑战性。尽管最近的卷积和基于Transformer的模型提高了性能，但它们常常产生碎片化的预测，并且无法恢复细小的分支。我们提出了CSWinUNETR，一个用于2D和3D薄结构分割的通用骨干网络。它采用交叉形条带自注意力来建模长距离主轴上下文，并结合循环移位以增强条带间的信息交换。为了更好地保留细粒度细节，我们进一步引入了一个细节增强的多尺度自注意力模块，该模块从多分辨率表示中聚合上下文特征。此外，我们提出了稀疏控制动态蛇形卷积，它从稀疏预测的控制点重建可靠的密集曲线核，以更好地跟随曲折的几何形状。在眼科、神经血管成像和皮肤科的四个基准上的大量实验表明，CSWinUNETR在没有任务特定后处理或拓扑感知损失的情况下，始终优于最先进的方法。代码可在该网址获取。

英文摘要

Accurate segmentation of thin, tortuous anatomical structures, such as retinal vessels, cerebral vasculature, and facial wrinkles, remains challenging due to low contrast, frequent discontinuities, and severe class imbalance. Although recent convolutional and Transformer-based models have improved performance, they often yield fragmented predictions and fail to recover fine branches. We propose CSWinUNETR, a general-purpose backbone for 2D and 3D thin-structure segmentation. It employs cross-shaped stripe self-attention to model long-range principal-axis context and incorporates cyclic shifts to enhance information exchange across stripes. To better preserve fine-grained details, we further introduce a detail-enhanced multi-scale self-attention module that aggregates contextual features from multi-resolution representations. In addition, we propose sparse-control dynamic snake convolution, which reconstructs reliable dense curvilinear kernels from sparsely predicted control points to better follow tortuous geometry. Extensive experiments on four benchmarks across ophthalmology, neurovascular imaging, and dermatology demonstrate that CSWinUNETR consistently outperforms state-of-the-art methods without task-specific post-processing or topology-aware losses. The code is available at https://github.com/labhai/CSWinUNETR.

URL PDF HTML ☆

赞 0 踩 0

2606.19838 2026-06-19 cs.CV 新提交

OTCHA: Optimal Transport-driven Confidence-aware Latent Hub Alignment for Multi-View Medical Image Classification

OTCHA: 基于最优传输的置信度感知潜在中心对齐用于多视图医学图像分类

Jiwoong Yang, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University（汉阳大学）； Hankuk University of Foreign Studies（韩国外国语大学）

AI总结提出OTCHA模块，通过最优传输对齐多视图补丁令牌与共享潜在中心令牌，结合置信度门控和部分匹配，消除无关特征，提升多视图医学图像分类鲁棒性。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

多视图成像（如乳腺X线摄影和胸部X线摄影）是临床实践的标准组成部分。然而，医学图像通常未配准，且包含视图特定的伪影或无关背景线索，这些可能掩盖诊断相关发现。许多现有方法直接融合每个视图的表征，使得此类无关内容污染融合嵌入，并在不同视图配置下降低鲁棒性。我们提出OTCHA，一种基于最优传输（OT）的置信度感知潜在中心令牌对齐模块，在融合前细化补丁令牌以用于多视图分类。OTCHA引入一组跨视图共享的可学习潜在中心令牌。对于每个视图，我们计算补丁令牌与中心令牌之间的OT计划，该计划联合考虑特征相似性和几何结构，并通过令牌条件尘埃箱增强OT公式以实现部分匹配并丢弃无关令牌。所得传输计划提供令牌级匹配置信度，该置信度门控中心介导的消息传递，并加权一种新的基于最优传输的表征对齐损失以稳定细化。在三个多视图医学图像数据集上的实验表明，在不同解剖结构和视图配置下，相比竞争基线方法取得一致改进。我们的代码可在该https URL获取。

英文摘要

Multi-view imaging, such as mammography and chest radiography, is a standard component of clinical practice. However, medical images are often unregistered and contain view-specific artifacts or irrelevant background cues that can obscure diagnostically relevant findings. Many existing methods directly fuse per-view representations, allowing such irrelevant content to contaminate the fused embedding and reducing robustness under varying view configurations. We propose OTCHA, a confidence-aware latent hub token alignment module based on optimal transport (OT) that refines patch tokens before fusion for multi-view classification. OTCHA introduces a set of learnable latent hub tokens shared across views. For each view, we compute an OT plan between patch tokens and hub tokens that jointly considers feature similarity and geometry, and augment the OT formulation with token-conditional dustbins to enable partial matching and discard irrelevant tokens. The resulting transport plan provides token-wise matching confidence, which gates hub-mediated message passing and weights a novel optimal-transport-based representation alignment loss to stabilize refinement. Experiments on three multi-view medical image datasets demonstrate consistent improvements over competing baselines across diverse anatomies and view configurations. Our code is available at https://github.com/labhai/OTCHA.

URL PDF HTML ☆

赞 0 踩 0

2606.19867 2026-06-19 cs.CV cs.AI 新提交

PSCT-Net: Geometry-Aware Pediatric Skull CT Reconstruction via Differentiable Back-Projection and Attention-Guided Refinement

PSCT-Net: 通过可微反投影和注意力引导细化实现几何感知的儿科颅骨CT重建

Dong Yeong Kim, Jaewon Choi, Youmin Shin, Jungyu Lee, Myeongseop Kim, Jinwook Choi, Joo Whan Kim, Young-Gon Kim

发表机构 * Interdisciplinary Program in Bioengineering, Seoul National University（首尔大学生物工程跨学科项目）； Department of Transdisciplinary Medicine, Seoul National University Hospital（首尔大学医院跨学科医学系）； Department of Artificial Intelligence, Yonsei University（延世大学人工智能系）； Department of Medicine, Seoul National University College of Medicine（首尔大学医学院医学系）； Healthcare AI Research Institute, Seoul National University Hospital（首尔大学医院医疗人工智能研究所）

AI总结提出PSCT-Net，利用可微反投影建立空间先验，结合注意力引导投影和双向Mamba模块，从稀疏双平面X射线重建3D CT，缓解深度模糊并改善骨边界。

Comments 11pages, 5 figures

详情

AI中文摘要

计算机断层扫描（CT）对于诊断儿科颅面异常至关重要，但对发育中的解剖结构存在辐射风险。从稀疏双平面X射线重建3D CT提供了一种低剂量替代方案，但问题严重不适定。现有方法采用几何无关的特征提升，将2D特征天真地投影到3D中，缺乏显式空间建模，导致深度模糊和骨边界退化。我们提出PSCT-Net，一种具有可微反投影的几何感知框架。可微反投影建立了空间保真的体积先验，缓解了深度模糊。然后，注意力引导投影（AGP-3D）模块学习2D区域与3D位置之间的非线性体素级对应关系。双向Mamba（BiM-3D）模块以线性复杂度捕获长程体积依赖关系。我们进一步整理了一个私有的机构儿科颅骨CT数据集PedSkull-CT，包含正常和病理病例用于内部评估，弥补了以成人中心和躯干为主的数据集的空白。

英文摘要

Computed Tomography (CT) is essential for diagnosing pediatric craniofacial abnormalities, yet poses radiation risks to developing anatomies. Reconstructing 3D CT from sparse bi-planar X-rays offers a low-dose alternative but is severely ill-posed. Existing methods employ geometry-agnostic feature lifting, naively projecting 2D features into 3D without explicit spatial modeling, causing depth ambiguity and degraded osseous boundaries. We present PSCT-Net, a geometry-aware framework with differentiable back-projection. Differentiable back-projection establishes a spatially faithful volumetric prior, alleviating depth ambiguity. An Attention-Guided Projection (AGP-3D) module then learns non-linear voxel-wise correspondences between 2D regions and 3D locations. A Bidirectional Mamba (BiM-3D) module captures long-range volumetric dependencies with linear complexity. We further curate a private institutional pediatric skull CT cohort, PedSkull-CT, comprising normal and pathological cases for internal evaluation, addressing the gap in adult-centric, trunk-focused datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.19908 2026-06-19 cs.CV 新提交

Gaussian Process Prior Variational Autoencoder for Endoscopic Videos

用于内窥镜视频的高斯过程先验变分自编码器

Ivan De Boi, Xinxing Shi, Xiaoyu Jiang, Tim J. M. Jaspers, Francisco Caetano, Mauricio A. Alvarez, Fons van der Sommen, Sam Van der Jeught

发表机构 * Department of Electromechanics, InViLab, University of Antwerp（安特卫普大学机电工程系InViLab实验室）； Department of Computer Science, University of Manchester（曼彻斯特大学计算机科学系）； Department of Electrical Engineering, Eindhoven University of Technology（埃因霍温理工大学电气工程系）

AI总结提出高斯过程先验变分自编码器（GPVAE），通过时间高斯过程先验替代因子化先验，结合两种可扩展GP近似和镜面反射掩码，实现内窥镜视频缺失帧的插值与修复，在C3VDv2数据集上平均降低RMSE 21.9%。

详情

AI中文摘要

内窥镜视频分析对于胃肠道诊断和计算机辅助干预至关重要，但视频序列经常受到镜面反射、运动伪影和缺失帧的退化影响。这些瞬态损坏会分散临床医生的注意力，降低图像可解释性，并干扰下游任务（如3D重建和导航）。因此，有效的修复需要利用时间连续性而非孤立处理帧的方法。我们提出了一种用于内窥镜视频修复的高斯过程先验变分自编码器（GPVAE）框架，该框架用时间高斯过程先验替代标准因子化潜在先验，从而能够以不确定性感知的重建方式插值缺失帧。该框架结合了内窥镜专用编码器（包括卷积EndoVAE骨干网络和来自GastroNet-5M的预训练Vision Transformer编码器）以及两种可扩展GP近似：层次先验近似（HPA）和稀疏精度近似（SPA）。镜面反射通过基于DUCKNet的掩码流水线处理，该流水线从重建目标中排除损坏像素。在C3VDv2结肠镜数据集上，最佳GPVAE变体相对于匹配的VAE基线，图像重建RMSE平均降低21.9%，最高降低26.1%。下游轨迹RMSE在经典视觉里程计和预训练PoseNet上平均降低12.7%，而每epoch训练时间平均增加27.3%。最后，GP后验提供每帧不确定性估计，反映时间支持并为修复帧提供置信度信号。

PU-UNet：用于医学图像分割的稳定乘法交互

Ziyuan Li, Osamah Sufyan, Uwe Jaekel, Babette Dellen

发表机构 * Department of Mathematics, Informatics and Technology, University of Applied Sciences Koblenz（科布伦茨应用科学大学数学、信息学与技术系）； Technical University of Munich（慕尼黑工业大学）

AI总结提出PU-UNet，通过稳定乘积单元残差块在低分辨率阶段实现显式乘法特征交互，在三个医学图像分割数据集上提升Dice和IoU，降低假阳性率。

Comments Accepted to the ICANN 2026

详情

AI中文摘要

许多密集预测网络依赖于加性特征变换，并且仅隐式地建模高阶特征交互。乘积单元为乘法特征建模提供了显式机制，但其对数-指数公式可能导致数值不稳定性，这限制了它们在深度密集预测网络中的使用。在这项工作中，我们提出了乘积单元U-Net（PU-UNet），这是一种残差U-Net，它将稳定的乘积单元残差块集成到丰富的低分辨率阶段，用于医学图像分割。所提出的公式结合了平滑正性映射和对数域裁剪，实现了稳定的乘法特征学习，且计算开销可忽略不计。在ISIC 2018、Kvasir-SEG和BUSI上，PU-UNet分别达到了0.942、0.959和高达0.925的Dice分数。与匹配的残差U-Net基线相比，PU-UNet在保持参数、FLOPs和推理延迟几乎不变的情况下，持续提高了Dice和IoU，并将正常BUSI病例的图像级假阳性率从0.077降至零。消融研究表明，这些增益与乘积单元交互相关，在低分辨率放置下最强，并受益于所提出的稳定化设计。这些结果表明，稳定的乘积单元残差学习可以成为通过显式乘法交互增强U-Net风格分割网络的有效方式。

英文摘要

Many dense prediction networks rely on additive feature transformations and model higher-order feature interactions only implicitly. Product units provide an explicit mechanism for multiplicative feature modeling, but their logarithmic--exponential formulation can cause numerical instability, which has limited their use in deep dense prediction networks. In this work, we propose Product-Unit U-Net (PU-UNet), a residual U-Net that integrates stable product-unit residual blocks into rich low-resolution stages for medical image segmentation. The proposed formulation combines smooth positivity mapping with log-domain clipping, enabling stable multiplicative feature learning with negligible computational overhead. On ISIC 2018, Kvasir-SEG, and BUSI, PU-UNet achieves Dice scores of 0.942, 0.959, and up to 0.925, respectively. Compared with a matched Residual U-Net baseline, PU-UNet consistently improves Dice and IoU while keeping parameters, FLOPs, and inference latency nearly unchanged, and reduces the image-level false-positive rate on normal BUSI cases from 0.077 to zero. Ablation studies suggest that the gains are associated with product-unit interactions, are strongest under low-resolution placement, and benefit from the proposed stabilization design. These results suggest that stable product-unit residual learning can be an effective way to enhance U-Net-style segmentation networks with explicit multiplicative interactions.

URL PDF HTML ☆

赞 0 踩 0

2606.20108 2026-06-19 cs.CV cs.LG 新提交

EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors

EFIQA: 基于解剖先验的可解释眼底图像质量评估

Pengwei Wang, José Morano, Qian Wan, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria（维也纳医科大学医学数据科学中心人工智能研究所）； Christian Doppler Lab for Artificial Intelligence in Retina, Medical University of Vienna, Austria（维也纳医科大学视网膜人工智能克里斯蒂安·多普勒实验室）

AI总结提出无需质量标签的EFIQA框架，利用解剖先验通过掩膜解剖修复学习正常结构，生成空间质量图，在多个基准上超越监督方法，兼具可解释性。

Comments Accepted in MIDL 2026. Code: https://github.com/penway/EFIQA

Journal ref Proceedings of Machine Learning Research 315:2248-2264, 2026

详情

AI中文摘要

图像质量控制对于广泛的下游应用至关重要。基于深度学习的图像质量评估方法通常根据数据集特定的质量标签训练分类器，这继承了两种局限性：（1）泛化能力受限于训练集的标注标准；（2）这些方法无法提供质量下降的空间反馈，缺乏可解释性。在这项工作中，我们提出了EFIQA，一个无需质量相关监督的框架，并通过设计生成空间质量图。EFIQA不是从人工标注的标签中学习“什么是退化”，而是通过利用解剖先验来学习“应该有什么”。对于眼底摄影，我们将其实例化为两阶段方法：首先通过掩膜解剖修复训练无监督异常检测器，以识别缺失血管区域；然后将这一先验知识蒸馏到一个浅层适配器中，将冻结基础模型的特征映射到精确的质量图。外部数据集评估表明，这种无需标签且只需最小适配的方法，在不同质量标准的基准上，与监督方法相比，实现了更好的性能和可解释性，突显了其在现实应用中的潜力。

英文摘要

Image quality control is vital for a wide range of downstream applications. Deep learning-based image quality assessment methods typically train classifiers on dataset-specific quality labels, inheriting two limitations: (1) generalization is tied to the labeling criteria of the training set and (2) these methods cannot provide spatial feedback on where the quality is degraded, lacking explainability. In this work, we propose EFIQA, a framework that requires no quality-related supervision and produces spatial quality maps by design. Rather than learning ``what is degradation" from human-annotated labels, EFIQA learns ``what should be there" by leveraging anatomical priors. For fundus photography, we instantiate this as a two-stage approach, by first training an unsupervised anomaly detector via masked anatomical inpainting to identify regions of missing vasculature, and then distilling this prior knowledge into a shallow adapter mapping features of a frozen foundation model to precise quality maps. External-dataset evaluation demonstrates that this label-free approach with minimal adaptation achieves better performance and explainability compared with supervised methods across benchmarks with different quality criteria, highlighting its potential for real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.20112 2026-06-19 cs.CV eess.IV 新提交

Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation

像素级残差扩散Transformer：可扩展的3D CT体生成

Zhenkai Zhang, Markus Hiller, Krista A. Ehinger, Tom Drummond

发表机构 * School of Computing and Information Systems, The University of Melbourne（墨尔本大学计算与信息系统学院）

AI总结提出像素级残差扩散Transformer（PRDiT），通过两阶段训练（局部MLP盲估计器分离低频结构+全局残差扩散Transformer建模高频残差）实现高保真3D CT体生成，在LIDC-IDRI和RAD-ChestCT数据集上优于现有方法。

Comments Accepted at ICLR 2026. Code available at https://github.com/Fredy-Zhang/PRDiT

详情

AI中文摘要

由于现有生成模型固有的巨大计算需求和优化困难，生成具有精细细节的高分辨率3D CT体仍然具有挑战性。在本文中，我们提出了像素级残差扩散Transformer（PRDiT），这是一种可扩展的生成框架，可直接在体素级别合成高质量的3D医学体。PRDiT引入了一个两阶段训练架构，包括：1）一个局部去噪器，形式为基于MLP的盲估计器，作用于重叠的3D块，以有效分离低频结构；2）一个全局残差扩散Transformer，采用内存高效注意力来建模和细化整个体上的高频残差。这种从粗到细的建模策略简化了优化，增强了训练稳定性，并有效保留了细微结构，而无需自编码器瓶颈。在LIDC-IDRI和RAD-ChestCT数据集上进行的大量实验表明，PRDiT始终优于最先进的模型，如HA-GAN、3D LDM和WDM-3D，在3D FID、MMD和Wasserstein距离指标上显著降低。

英文摘要

Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.

URL PDF HTML ☆

赞 0 踩 0

2606.20143 2026-06-19 cs.CV 新提交

HEad and neCK TumOR (HECKTOR) 2025: Benchmark of Segmentation, Diagnosis, and Prognosis in Multimodal PET/CT

头颈肿瘤 (HECKTOR) 2025 挑战赛：多模态 PET/CT 中的分割、诊断与预后基准

Numan Saeed, Salma Hassan, Shahad Hardan, Lishan Cai, Xinglong Liang, Moona Mazher, Abdul Qayyum, Yansong Bu, Mengye Lyu, Yue Lin, Mingyuan Meng, Chuanyi Huang, Lisheng Wang, Dalal Chamseddine, Shamimeh Ahrari, Beining Wu, Yifei Chen, Fuyou Mao, Hao Zhang, Baixiang Zhao, Surajit Ray, Muzi Guo, Lei Xiang, Jakob Dexl, Michael Ingrisch, Adrien Depeursinge, Arman Rahmim, Mathieu Hatt, Vincent Andrearczyk, Mohammad Yaqub

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（穆罕默德·本·扎耶德人工智能大学）； Amsterdam UMC（阿姆斯特丹大学医学中心）； The Netherlands Cancer Institute（荷兰癌症研究所）； Radboud University Medical Centre（拉德堡德大学医学中心）； University College London（伦敦大学学院）； Imperial College London（帝国理工学院）； Shenzhen Technology University（深圳技术大学）； Shenzhen University（深圳大学）； Newland Digital Technology（新大陆数字技术）； The University of Sydney（悉尼大学）； Shanghai Jiao Tong University（上海交通大学）； University Hospital, Nantes（南特大学医院）； Nantes Université, Centrale Nantes, CNRS, LS2N（南特大学、南特中央理工学院、法国国家科学研究中心、LS2N实验室）； Hangzhou Dianzi University（杭州电子科技大学）； Tsinghua University（清华大学）； Central South University（中南大学）； University of Glasgow（格拉斯哥大学）； China Mobile System Integration Co., Ltd.（中移系统集成有限公司）； Subtle Medical Inc.（Subtle Medical公司）； University Hospital, LMU Munich（慕尼黑大学医院）； Munich Center for Machine Learning（慕尼黑机器学习中心）； BC Cancer Research Institute（不列颠哥伦比亚癌症研究所）； HES-SO Valais-Wallis University of Applied Sciences and Arts（HES-SO瓦莱州应用科学与艺术大学）； Lausanne University Hospital (CHUV)（洛桑大学医院）； LaTIM, INSERM, UMR 1101, Univ Brest（LaTIM实验室、法国国家健康与医学研究院、UMR 1101、布雷斯特大学）

AI总结 HECKTOR 2025 挑战赛利用多模态 PET/CT 和电子健康记录，建立了头颈癌自动分析的基准，涵盖肿瘤分割、复发预测和 HPV 分类三个任务，最佳算法分别达到 Dice 0.75、C-index 0.66 和平衡准确率 0.56。

Comments 17 pages, 4 figures, 4 tables. Overview paper for the HECKTOR 2025 challenge, held as a satellite event at MICCAI 2025. Challenge website: https://hecktor.grand-challenge.org/

详情

AI中文摘要

头颈癌 (HNC) 构成显著的全球健康负担，准确的肿瘤勾画对于有效的放疗计划至关重要。口咽部解剖结构的复杂性，加上肿瘤在影像上的异质性表现，使得手动分割耗时且存在观察者间差异。除分割外，从非侵入性影像预测长期临床结局（如无复发生存期 RFS）和确定人乳头瘤病毒 (HPV) 状态，仍然是具有挑战性但临床价值高的目标。HECKTOR 2025 挑战赛通过使用多模态 PET/CT 影像和电子健康记录，建立了一个用于自动 HNC 分析的全面基准。基于前几届（2020-2022），本次挑战赛采用了扩展的多机构数据集，包含来自全球 10 个中心的 1100 多名患者。参与者需完成三个互补目标：(1) 分割原发肿瘤体积 (GTVp) 和转移淋巴结 (GTVn)，(2) 预测无复发生存期，(3) 分类 HPV 状态。挑战赛吸引了 35 个注册团队，其中 15 个最终提交在保留测试集上进行了评估。表现最佳的算法在分割上达到平均 Dice 相似系数 0.75，在生存预测上达到一致性指数 0.66，在 HPV 分类上达到平衡准确率 0.56。本文对所提交的方法进行了全面分析，评估了它们在不同病变特征上的性能，并讨论了它们在自动化肿瘤学工作流程和决策支持系统中临床转化的意义。

英文摘要

Head and neck cancers (HNC) represent a significant global health burden, with accurate tumor delineation being essential for effective radiotherapy planning. The complexity of the oropharyngeal anatomy, combined with the heterogeneous appearance of tumors on imaging, makes manual segmentation time-intensive and subject to inter-observer variability. Beyond segmentation, predicting long-term clinical outcomes, such as recurrence-free survival (RFS), and determining human papillomavirus (HPV) status from noninvasive imaging, remain challenging yet clinically valuable goals. The HECKTOR 2025 challenge addresses these needs by establishing a comprehensive benchmark for automated HNC analysis using multimodal PET/CT imaging and electronic health records. Building on previous editions (2020-2022), this challenge features an expanded multi-institutional dataset comprising over 1,100 patients from 10 centers worldwide. Participants were tasked with three complementary objectives: (1) segmenting primary gross tumor volumes (GTVp) and metastatic lymph nodes (GTVn), (2) predicting recurrence-free survival, and (3) classifying HPV status. The challenge attracted 35 registered teams, with 15 final submissions evaluated on a held-out test set. Top-performing algorithms achieved a mean Dice similarity coefficient of 0.75 for segmentation, a concordance index of 0.66 for survival prediction, and a balanced accuracy of 0.56 for HPV classification. This paper presents a comprehensive analysis of the submitted methodologies, evaluates their performance across different lesion characteristics, and discusses their implications for clinical translation in automated oncology workflows and decision support systems.

URL PDF HTML ☆

赞 0 踩 0

2606.20223 2026-06-19 cs.CV q-bio.QM 新提交

DeepForestVisionV2: Ecology-Driven Taxonomy Expansion for Camera-Trap Monitoring in African Tropical Forests

DeepForestVisionV2：面向非洲热带森林相机监测的生态驱动分类扩展

Hugo Magaldi, Theau d'Audiffret, Etienne Francois Akomo-Okoue, Bala Amarasekaran, Naomi Anderson, Claire Auger, Noemie Cappelle, Daniel Cornelis, Raphael Cornette, Tobias Deschner, Gabriel Dubus, Davy Fonteyn, Rosa M. Garriga, Jennifer Hatlauf, Innocent Kasekendi, Raymond Katumba, Aram Kazandjian, Alfred Ngomanda, Stephan Ntie, Simone Pika, Xavier Rufray, Harold Rugonge, John Justice Tibesigwa, Peter van Lunteren, Hadrien Vanthomme, Joeri A. Zwerts, Sabrina Krief

发表机构 * UMR7206 Eco-Anthropologie, MNHN（UMR7206 生态人类学，法国国家自然历史博物馆）； One Forest Vision initiative（One Forest Vision 倡议）； Sebitoli Chimpanzee Project（塞比托利黑猩猩项目）； Centre National de la Recherche Scientifique et Technologique（国家科学技术研究中心）； Institut de Recherche en Ecologie Tropicale（热带生态研究所）； Tacugama Chimpanzee Sanctuary（塔库加马黑猩猩保护区）； Biotope（Biotope 公司）； CIRAD（法国农业发展国际合作研究中心）； Max Planck Institute for Evolutionary Anthropology（马克斯·普朗克进化人类学研究所）； BOKU University（维也纳自然资源与生命科学大学）； Agence Nationale des Parcs Nationaux du Gabon（加蓬国家公园管理局）； Uganda Wildlife Authority（乌干达野生动物管理局）； Addax Data Science（Addax 数据科学公司）； Utrecht University（乌得勒支大学）

AI总结针对非洲热带森林相机监测中生态梯度（垂直分层、场景开放度、人为界面）导致原35类分类过粗的问题，提出扩展至64类的DeepForestVisionV2，在保持离线工作流的同时提升野外实用性。

Comments Accepted at ICPR 2026 - Computer Vision for Biodiversity Monitoring and Conservation Workshop

详情

AI中文摘要

非洲热带森林中的相机监测正从封闭冠层内部扩展到河岸、空地和公园边缘。在现有的非洲森林相机分类开放工具中，DeepForestVision是唯一提供照片和视频匹配离线工作流的工具，先前研究表明其在可比基准上优于其他基线。然而，它专为封闭冠层、地面森林内部设计，使用35类预测空间，当部署遇到树栖灵长类、鸟类、半水生类群或家畜等人为混杂因素时，该空间变得过于粗糙。我们提出DeepForestVisionV2，这是一个从35类扩展到64类预测空间（61个动物类加上人类、车辆和空白）的生态驱动扩展，旨在解决三个反复出现的部署梯度：垂直分层、场景开放度和人为界面。DeepForestVisionV2保留相同的离线工作流，并在来自多国非洲热带森林项目的1,535,010张照片和243,354个视频上训练。评估结合了一个跨国家裁剪照片验证集（用于评估跨站点和相机设置的鲁棒性）和三个涵盖目标梯度的留出乌干达视频基准。在验证集上，DeepForestVisionV2达到0.86准确率、0.82宏F1和0.81平衡准确率。在部署基准上，尽管分类任务更困难，它仍保持或提高了基线准确率，同时将识别的类群数量从森林内部视频的22个增加到29个，河岸视频从4个增加到9个。在公园边缘用例中，它将准确率从0.62提高到0.86，并将误报从11次减少到0次。这些结果表明，DeepForestVisionV2在保持跨站点、栖息地和相机设置鲁棒性的同时，显著提高了野外实用性。

英文摘要

Camera-trap monitoring in African tropical forests increasingly extends beyond closed-canopy interiors to riverbanks, clearings, and park edges. Among available open tools for African forest camera-trap classification, DeepForestVision is the only one providing a matched offline workflow for both photographs and videos, and previous work showed that it outperformed other available baselines on a comparable benchmark. However, it was designed for closed-canopy, ground-level forest interiors and uses a 35-class prediction space that becomes too coarse when deployments encounter arboreal primates, birds, semi-aquatic taxa, or human-associated confounders such as livestock. We present DeepForestVisionV2, an ecology-driven expansion from 35 to 64 prediction classes (61 animal classes plus human, vehicle, and blank) designed to address three recurrent deployment gradients: vertical stratification, scene openness, and anthropogenic interfaces. DeepForestVisionV2 retains the same offline workflow and is trained on 1,535,010 photographs and 243,354 videos from multi-country African tropical-forest projects. Evaluation combines a cross-country cropped-photo validation set, used to assess robustness across sites and camera-trap settings, with three held-out Uganda video benchmarks spanning the targeted gradients. On the validation set, DeepForestVisionV2 reaches 0.86 accuracy, 0.82 macro-F1, and 0.81 balanced accuracy. On the deployment benchmarks, it preserves or improves baseline accuracy despite its harder classification task, while increasing the number of identified taxa from 22 to 29 in forest-interior videos and from 4 to 9 at riverbanks. In the park-edge use case, it raises accuracy from 0.62 to 0.86 and reduces false alarms from 11 to 0. These results show that DeepForestVisionV2 materially improves field utility while preserving robustness across sites, habitats, and camera-trap settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20250 2026-06-19 cs.CV 新提交

Single-Stage Hierarchical Rectification for Weakly Supervised Histopathology Segmentation

单阶段层次化校正用于弱监督组织病理学分割

Duc T. Nguyen, Hoang-Long Nguyen, Thanh-Ha DO, Huy-Hieu Pham

发表机构 * VinUni-Illinois Smart Health Center, VinUniversity, Hanoi, Vietnam（越南河内VinUniversity VinUni-Illinois智慧健康中心）； The Computer Vision and Medical AI Lab, VinUniversity, Hanoi, Vietnam（越南河内VinUniversity计算机视觉与医学人工智能实验室）； Posts and Telecommunications Institute of Technology, Hanoi, Vietnam（越南河内邮电技术学院）

AI总结提出单阶段层次化校正框架，通过层次化特征校正模块在单次训练中直接生成高保真激活图，解决多阶段弱监督分割中的误差传播和计算开销问题。

Comments Accepted to MICCAI 2026. This is the pre-review submitted version, not the camera-ready version. The final authenticated version will be available in the MICCAI 2026 proceedings

详情

AI中文摘要

现有的计算病理学中的弱监督语义分割方法依赖于多阶段范式：类激活图生成、离线伪掩码细化和全监督再训练。虽然这种解耦方法已被广泛采用，但它存在根本性缺陷。多阶段过程不仅导致高计算训练成本，还遭受误差传播：浅层CNN中的局部纹理偏差产生假阳性伪影，后续细化步骤往往无法纠正。为了通过简单而高效的方法解决这些持续存在的挑战，我们提出了单阶段层次化校正（SSHR）框架。我们的方法不是事后被动地细化CAM，而是在前向传播过程中主动净化中间特征表示。我们引入了一个层次化特征校正模块（HFRM），利用深层全局语义上下文过滤浅层中的局部异常。该机制在单个训练循环内直接生成高保真激活图。在LUAD-HistoSeg和BCSS数据集上的实验表明，SSHR优于最先进的多阶段方法。此外，SSHR将训练时间减少了2到5倍。这种效率降低了计算开销，并加速了大规模组织病理学工作流的临床转化。代码可在以下网址获取：this https URL

英文摘要

Existing weakly supervised semantic segmentation (WSSS) methods in computational pathology rely on a multi-stage paradigm: class activation map (CAM) generation, offline pseudo-mask refinement, and fully supervised retraining. While established, this decoupled approach presents fundamental limitations. The multi-stage process not only incurs high computational training costs but also suffers from error propagation: local texture biases in shallow CNN layers generate false-positive artifacts that subsequent refinement steps often fail to correct. To address these persistent challenges through a simple yet highly effective approach, we propose the Single-Stage Hierarchical Rectification (SSHR) framework. Rather than passively refining CAMs post-hoc, our method proactively purifies intermediate feature representations during the forward pass. We introduce a Hierarchical Feature Rectification Module (HFRM) that utilizes deep global semantic context to filter out local anomalies in shallow layers. This mechanism generates high-fidelity activation maps directly within a single training loop. Experiments on the LUAD-HistoSeg and BCSS datasets demonstrate that SSHR outperforms state-of-the-art multi-stage methods. Furthermore, SSHR reduces training duration by 2 to 5 times. This efficiency minimizes computational overhead and accelerates clinical translation for large-scale histopathology workflows. The code is available at: https://github.com/trongduc-nguyen/SSHR

URL PDF HTML ☆

赞 0 踩 0

2606.20390 2026-06-19 cs.CV 新提交

Geometry-Aware Superpixel Graph Transformer with Metadata for Skin Lesion Classification

几何感知超像素图变换器结合元数据用于皮肤病变分类

Muhammad Azeem, Tanveer Hussain, Amr Ahmed, Ardhendu Behera

发表机构 * Edge Hill University（埃奇希尔大学）

AI总结提出一种基于区域的图学习框架，将病变建模为超像素图，利用几何边属性和元数据上下文节点，通过边缘感知图变换器实现多模态融合，在四个公开数据集上取得优于现有方法的分类性能。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

由于病变结构异质性、类内变异大以及良恶性病例间细微视觉差异，从皮肤镜图像进行自动化皮肤癌分类仍然具有挑战性。现有的CNN/ViT流程通常依赖全局或补丁级特征，并常通过后期融合结合患者元数据，这限制了空间基础的多模态推理。我们提出一种新颖的基于区域的图学习框架，将病变显式建模为空间连贯的超像素区域图，这些区域表示为冻结的CNN特征。为了捕捉细粒度的病变排列，我们将区域间几何编码为边属性，并引入一个与所有区域相连的专用元数据上下文节点，从而在同一关系空间内结构化地整合人口统计学/临床变量。节点表示通过我们的边缘感知图变换器进行更新，随后进行注意力驱动的传播，最终生成用于良恶性分类的图级嵌入。在四个公开基准上的实验表明，显式的区域级关系建模和图原生多模态融合相较于现有技术取得了持续改进。因此，我们建立了一种新的以图为中心的视角，其中CNN特征被建模为关系节点，并通过上下文整合得到改进，从而产生更具表现力和鲁棒性的分类结果。

英文摘要

Automated skin cancer classification from dermoscopic images remains challenging due to heterogeneous lesion structure, strong intra-class variability, and subtle visual differences between benign and malignant cases. Existing CNN/ViT pipelines typically rely on global or patch-level features and often combine patient metadata via late fusion, which limits spatially grounded multimodal reasoning. We present a novel region-based graph learning framework that explicitly models lesions as graphs of spatially coherent superpixel regions represented as frozen CNN features. To capture fine-grained lesion arrangements, we encode inter-regional geometry as edge attributes and introduce a dedicated metadata context node connected to all regions, providing structured integration of demographic/clinical variables within the same relational space. Node representations are updated using our edge-aware graph transformer followed by attention-driven propagation, and a final graph-level embedding for benign-malignant classification. Experiments on four public benchmarks demonstrate that explicit region-level relational modeling and graph-native multimodal fusion yield consistent gains over the state-of-the-art. Consequently, we establish a new graph-centric perspective in which CNN features are modeled as relational nodes and improved through contextual integration, yielding more expressive and robust classifications.

URL PDF HTML ☆

赞 0 踩 0

2606.20449 2026-06-19 cs.CV 新提交

InfantFace: Detecting infant faces in neonatal clinical environments

InfantFace：新生儿临床环境中的婴儿面部检测

Abdullah Bin-Obaid, Maria M. Cobo, Rebeccah Slater, Lionel Tarassenko, Mauricio Villarroel

AI总结针对新生儿临床环境中的遮挡和光照问题，提出基于YOLOv11m的单阶段面部检测模型，在多个公开数据集预训练后，通过临床数据微调，AP50从0.87提升至0.96。

Comments 32 pages, 7 figures, 4 tables; supplementary information included

详情

AI中文摘要

新生儿面部的可靠定位是基于视频摄像头的非接触式评估的第一步，例如疼痛和痛苦相关的面部表情分析、疼痛评分、心肺信号提取和呼吸停止警报。然而，新生儿临床环境中仍存在重大挑战。杂乱的背景、光照变化和不良照明条件会降低面部检测模型的准确性。临床干预、监测设备以及在某些情况下的医疗设备可能会遮挡面部，使视觉评估变得困难。我们提出了一种基于YOLOv11m的单阶段模型，专门用于新生儿临床环境中的婴儿面部检测。我们结合了多个公开数据集（VGGFace2、CelebA、FDDB、WIDER FACE）来训练和评估我们提出的模型。然后，我们在一个新生儿研究数据集上对模型进行了微调，该数据集包含来自114个记录会话的228个视频，涉及113名独立婴儿。在微调之前，我们的模型达到了0.87的AP50，超过了三个最先进的通用面部检测器的性能。在临床领域适应后，性能进一步提高到0.96的AP50。由于缺乏公开的新生儿数据集，评估不同数据集上的面部检测性能仍然是一个挑战。优先创建此类数据集，同时在其创建和使用中维护适当的隐私保护措施和伦理标准，将极大地支持该领域的进一步进展。

英文摘要

Reliable localisation of the neonatal face is the first step for several video-camera based non-contact assessments such as pain and distress related facial expression analysis, pain scoring, cardiorespiratory signal extraction and cessation of breathing alerts. However, major challenges persist in neonatal clinical environments. Cluttered backgrounds, illumination changes and poor lighting conditions can reduce the accuracy of face detection models. Clinical interventions, monitoring equipment and, in some cases, medical devices can obstruct the face, making visual assessment difficult. We propose a one-stage YOLOv11m-based model tailored for face detection of infants in neonatal clinical environments. We combined multiple publicly available datasets (VGGFace2, CelebA, FDDB, WIDER FACE) to train and evaluate our proposed model. We then fine-tuned our model on a neonatal research dataset involving 228 videos from 114 recording sessions of 113 independent infants. Before fine-tuning, our model achieved an AP50 of 0.87, surpassing the performance of three state-of-the-art general face detectors. Performance improved further to an AP50 of 0.96 after clinical-domain adaptation. Evaluating face detection performance across different datasets remains a challenge due to the lack of publicly available neonatal datasets. Prioritising the creation of such datasets, while upholding appropriate privacy safeguards and ethical standards in their creation and use, would greatly support further progress in this field.

URL PDF HTML ☆

赞 0 踩 0

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 新提交

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany（德国弗莱堡大学计算机视觉组）； Department of Radiology, Medical Center -- University of Freiburg, Germany（德国弗莱堡大学医学中心放射科）； CRIION-AI Lab, Freiburg, Germany（德国弗莱堡CRIION-AI实验室）

AI总结提出RefRad2D大规模双语数据集，通过LLM和自动分割生成空间定位数据，训练RadGrounder模型联合完成报告生成、VQA和空间定位，在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情

AI中文摘要

我们研究了如何在没有手动空间标注的情况下，为放射学训练具有视觉定位能力的视觉-语言模型（VLM）。我们引入了RefRad2D，这是一个大规模的双语（德语/英语）数据集，包含来自临床实践的120万对CT和MR图像-文本对，并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准（Slake，VQA-RAD）上，RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集，相比于仅在下游数据集上微调，提高了开放式VQA的性能，显示了数据集的迁移性。关键在于，添加定位监督不会降低语言质量，从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19939 2026-06-19 cs.CV 新提交

DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation

DiffMath：面向手写数学表达式生成的符号与图感知潜在扩散Transformer

Wei Pan, Xuhan Zheng, Yilin Shi, Huiguo He, Hiuyi Cheng, Dezhi Peng, Minghui Liao, Lianwen Jin

发表机构 * South China University of Technology（华南理工大学）； Huawei Technologies Co., Ltd.（华为技术有限公司）

AI总结提出DiffMath框架，利用LaTeX层次结构作为先验，通过关系抽象语法树、结构保持潜在表示和条件去噪，无需位置监督即可生成结构一致的手写数学表达式。

详情

AI中文摘要

手写数学表达式生成（HMEG）由于数学表达式的复杂二维布局和长程结构依赖而具有挑战性。现有方法通常依赖显式空间监督，如符号级边界框，这导致高标注成本并限制可扩展性。在这项工作中，我们提出了DiffMath，一个符号与图感知的潜在扩散框架，利用LaTeX固有的层次结构作为结构先验，消除了位置监督的需求。首先，我们设计了关系抽象语法树（RelAST），一种面向生成的表示，将MathML树蒸馏为紧凑的三元组序列[S, R, D]，其中每个标记直接编码符号身份、空间关系或嵌套深度。其次，我们引入了MathVAE，通过符号感知和关系感知的感知正则化学习保持结构的潜在表示，确保潜在空间同时捕获字符语义和空间拓扑。第三，MathDiT在这个结构化潜在空间中进行条件去噪，并通过自适应层归一化（AdaLN）进一步由全局符号计数先验引导，以改善结构一致性。实验表明，DiffMath生成结构一致的手写表达式，在现有方法上实现了优越性能，并通过合成数据增强提高了下游OCR模型的准确性。

英文摘要

Handwritten Mathematical Expression Generation (HMEG) is challenging due to the complex two-dimensional layouts and long-range structural dependencies of mathematical expressions. Existing methods typically rely on explicit spatial supervision, such as symbol-level bounding boxes, which incurs high annotation costs and limits scalability. In this work, we propose DiffMath, a symbol- and graph-aware latent diffusion framework that leverages the hierarchical structure inherent in LaTeX as a structural prior, eliminating the need for positional supervision. First, we design a Relational Abstract Syntax Tree (RelAST), a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D], where each token directly encodes a symbol identity, spatial relation, or nesting depth. Second, we introduce MathVAE, which learns structure-preserving latent representations through symbol-aware and relation-aware perceptual regularization, ensuring that the latent space captures both character semantics and spatial topology. Third, MathDiT performs conditional denoising in this structured latent space, further guided by a global symbol-count prior via Adaptive Layer Normalization (AdaLN) to improve structural coherence. Experiments show that DiffMath produces structurally consistent handwritten expressions, achieves superior performance over existing methods, and improves the accuracy of downstream OCR models through synthetic data augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.19617 2026-06-19 cs.CV cs.GR cs.LG 新提交

GB-LSR: A Fast Local Spectral Image Representation with a Single Global Bandwidth for Continuous Reconstruction and Super-Resolution

GB-LSR：一种具有单一全局带宽的快速局部光谱图像表示，用于连续重建和超分辨率

Max Shad, Naeem Khoshnevis

发表机构 * Harvard University（哈佛大学）

AI总结提出GB-LSR，一种基于全局带宽的局部光谱表示，通过共享卷积编码器预测截断傅里叶基系数，实现连续图像重建，在Kodak等基准上PSNR提升2.8-3.6 dB，推理速度比最慢基线快约4倍。

详情

AI中文摘要

我们提出GB-LSR（全局带宽局部光谱表示），一种用于连续图像重建的固定网格局部光谱表示。图像域被划分为非重叠的方形块，每个块携带从共享卷积编码器特征预测的截断傅里叶基系数。一个可训练的标量带宽在所有块和图像中全局共享，在任何连续坐标处的重建是固定大小的基收缩，其成本与图像大小无关。我们研究了三种带宽处理变体：可训练的全局标量（主要）、固定的全局标量和逐块带宽场。在Kodak、Set14和Urban100上的标准化原生重建基准测试中，主要变体在匹配预算的LIIF/LTE/WIRE重实现上PSNR高出2.8-3.6 dB，LPIPS低0.11-0.15，同时推理成本约为最慢基线的四分之一。经验上，单个全局标量就足够了：逐块自适应带宽替代方案在闭式局部性诊断或端到端消融中均未带来改进。在独立的任意尺度超分辨率（ASR）扩展中，GB-LSR在标准SR协议下实现了具有竞争力的PSNR-Y，并在x4时比LIIF-RDN快1.44倍，比LTE-SwinIR快3.25倍；在同一扩展中，一个变体在训练和评估时不使用四角局部集成平均，速度提升1.77倍，峰值内存降低35%，PSNR变化可忽略，而将RDN编码器从64通道扩展到96通道时，PSNR略有提升，速度提升1.58倍，峰值内存降低31%。原生重建声明限定于匹配预算的摊销协议，ASR声明限定于独立的标准SR协议。

英文摘要

We present GB-LSR (Global-Bandwidth Local Spectral Representation), a fixed-grid local spectral representation for continuous image reconstruction. The image domain is partitioned into non-overlapping square patches, each carrying coefficients for a truncated Fourier basis predicted from shared convolutional-encoder features. A single trainable scalar bandwidth is shared globally across all patches and images, and reconstruction at any continuous coordinate is a fixed-size basis contraction whose cost is independent of image size. We study three bandwidth-handling variants: a trainable global scalar (main), a fixed global scalar, and a per-patch bandwidth field. On a standardized native-reconstruction benchmark across Kodak, Set14, and Urban100, the main variant outperforms matched-budget amortized LIIF / LTE / WIRE re-implementations by 2.8-3.6 dB PSNR and 0.11-0.15 LPIPS, while running at roughly one-quarter of the slowest baseline's inference cost. The single global scalar suffices empirically: per-patch adaptive-bandwidth alternatives do not improve over it on either a closed-form locality diagnostic or an end-to-end ablation. In a separate arbitrary-scale super-resolution (ASR) extension, GB-LSR achieves competitive PSNR-Y under a canonical-style SR protocol and runs 1.44x faster than LIIF-RDN and 3.25x faster than LTE-SwinIR at x4; within the same extension, a variant trained and evaluated without 4-corner local-ensemble averaging gives a 1.77x speedup with 35% lower peak memory and negligible PSNR change, while additionally widening the RDN encoder from 64 to 96 channels gives a small positive PSNR shift with a 1.58x speedup and 31% lower peak memory. Native-reconstruction claims are scoped to the matched-budget amortized protocol, and ASR claims are scoped to a separate canonical-style SR protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.19901 2026-06-19 cs.CV 新提交

Linear Recurrent Unit with Semantic Modulation for Image Super-Resolution

基于语义调制的线性递归单元用于图像超分辨率

Mingyu Choi, Woo Kyoung Han, Sunghoon Im, Kyong Hwan Jin

发表机构 * Korea University（高丽大学）； DGIST（大邱庆北科学技术院）

AI总结提出一种结合语义调制单元的线性递归网络，通过调制、空间分类和原型增强实现高效图像超分辨率，性能超越现有方法。

Comments Accepted to CVPR 2026 Findings

2606.19938 2026-06-19 cs.CV cs.AI 新提交

Triangular Consistency as a Universal Constraint for Learning Optical Flow

三角一致性作为光流学习的通用约束

Yi Xiao, Carlos Rodriguez Coronel, Jing Zhan, Haniyeh Ehsani Oskouie, Alex Wong, Dong Lao

发表机构 * Louisiana State University（路易斯安那州立大学）； University of California, Los Angeles（加州大学洛杉矶分校）； Yale University（耶鲁大学）

AI总结提出三角一致性约束，通过组合两个光流诱导第三个光流并强制三者一致，适用于不同网络架构、监督类型和数据集，在监督、无监督和迁移学习中均提升性能。

Comments Accepted by ECCV 2026

详情

AI中文摘要

我们提出三角一致性作为光流的第一性原理约束，该约束与网络架构、监督类型和数据集无关，适用于图像对和多帧设置。这个简单但强大的约束是通过组合两个光流来诱导第三个光流，并强制三者之间的一致性。组合的光流可能来自：(i) 图像对，产生循环一致性；(ii) 多个视频帧，通过时间链产生更长范围的运动；或 (iii) 图像对与受控合成变换相结合，这成为数据增强。这种三角一致性引入的计算开销可忽略不计，且不需要额外的标注。由于它直接源自光流的几何特性，不依赖于模型特定的假设，因此可作为光流训练的“通用”即插即用组件。实验表明，在监督、无监督和迁移学习设置中均有一致的改进。

英文摘要

We propose triangular consistency as a first-principled constraint for optical flow, which is agnostic to network architecture, supervision type, and dataset, and applies to both image-pair and multi-frame settings. This simple but powerful constraint is to compose two flows to induce a third flow and enforce consistency among the three. The composed flows may arise from (i) image pairs, yielding cycle consistency; (ii) multiple video frames, producing longer-range motion through temporal chaining; or (iii) image pairs combined with controlled synthetic transformations, which becomes data augmentation. This triangular consistency introduces negligible computational overhead and requires no additional annotations. Since it is derived directly from the geometry of optical flow, it does not rely on model-specific assumptions and serves as a ``universal'' plug-and-play component for optical flow training. Experiments show consistent improvement across supervised, unsupervised, and transfer learning settings.

URL PDF HTML ☆

赞 0 踩 0

2606.19961 2026-06-19 cs.CV 新提交

Addressing Detail Bottlenecks in Latent Diffusion for RGB-to-SWIR Image Translation

解决潜在扩散模型中RGB到SWIR图像翻译的细节瓶颈

Kaili Wang, Martin Dimitrievski, Jose Maria Salvador, Ben Stoffelen, David Van Hamme, Lore Goetschalckx

发表机构 * imec ； imec-IPI-Ghent University（imec-IPI-根特大学）； Yale University（耶鲁大学）

AI总结针对潜在扩散模型在RGB到SWIR图像翻译中丢失空间细节的问题，提出源条件自编码器和可学习引导编码器两种轻量级改进，在驾驶场景下将检测mAP提升至2倍，小目标提升3.4倍，并达到最优FID。

详情

AI中文摘要

潜在扩散模型（LDM）能够高效地进行图像到图像的翻译，但在压缩过程中丢弃了精细的空间细节，从而降低了下游感知任务的性能。我们识别出两个瓶颈：自编码器（丢失空间信息）和条件路径（通过朴素下采样进一步退化源信号）。我们提出了两种轻量级、与骨干网络无关的修复方法：源条件自编码器（SCAE），通过跳跃连接将高分辨率源特征注入解码器；以及可学习引导编码器（LGE），用学习到的条件信号替代朴素下采样。在驾驶场景的RGB到SWIR翻译任务上，使用两种去噪骨干网络（U-Net和DiT）进行评估，我们的方法在潜在扩散基线基础上将检测mAP提升了高达2倍，小目标（COCO-small，<32^2像素^2）上提升高达3.4倍，同时达到了最先进的FID。我们进一步表明FID与检测性能相关性较差，从而激励多轴评估。结果零样本泛化到公开的RASMD基准。我们将公开发布带有标注的测试数据、所有检查点和训练代码。

英文摘要

Latent diffusion models (LDMs) enable efficient image-to-image translation but discard fine spatial details during compression, degrading downstream perception tasks. We identify two bottlenecks: the autoencoder, which loses spatial information, and the conditioning pathway, which further degrades the source signal through naive downsampling. We propose two lightweight, backbone-agnostic fixes: a Source-Conditioned Autoencoder (SCAE) that injects high-resolution source features into the decoder via skip connections, and a Learnable Guidance Encoder (LGE) that replaces naive downsampling with a learned conditioning signal. Evaluated on RGB-to-SWIR translation for driving scenes with two denoiser backbones (U-Net and DiT), our approach improves detection mAP by up to 2x over the latent diffusion baseline, with up to 3.4x gains on small objects (COCO-small, <32^2 px^2), while achieving state-of-the-art FID. We further show that FID and detection performance are poorly correlated, motivating multi-axis evaluation. Results generalise zero-shot to the public RASMD benchmark. We will publicly release test data with annotations, all checkpoints, and training code.

URL PDF HTML ☆

赞 0 踩 0

2606.19985 2026-06-19 cs.CV 新提交

Vision-Reasoning-Guided Occlusion Removal from Light Fields

视觉推理引导的光场遮挡去除

Mohamed Youssef, Oliver Bimber

发表机构 * Johannes Kepler University（约翰·开普勒大学）

AI总结提出结合光场积分与视觉语言模型的框架，通过多视图融合和语义先验恢复被遮挡场景，在合成和真实数据上取得最优性能。

详情

AI中文摘要

遮挡鲁棒的场景恢复仍然是计算成像中的一个主要挑战，特别是在自然环境中，密集的前景植被严重限制了可见性。我们提出了一种视觉推理引导的光场遮挡去除框架，该框架结合了光场积分（LFI）的可见性恢复能力和视觉语言模型（VLM）的语义推理能力。首先通过LFI集成多视图观测以抑制前景遮挡，生成初始的可见性增强表示。然后，引入VLM作为条件语义先验，在观测测量的指导下恢复退化结构并恢复细节。为了提高恢复一致性并减少幻觉伪影，我们引入了一种多样本融合策略，将多个生成的假设聚合为统一的估计。在合成和真实世界数据集上的实验结果表明，该方法达到了最先进的性能，在四个合成光场基准场景（4-Syn）上取得了最高的平均SSIM，并在结构化和非结构化采集设置中表现出强大的泛化能力。这些结果凸显了将物理成像约束与视觉语言推理相结合在严重遮挡下实现鲁棒感知的有效性，可应用于搜索救援和探索性机器人导航。

英文摘要

Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.15648 2026-06-19 cs.CV 新提交

Fusing Transferred Priors and Physics-based Decomposition for Underwater Image Enhancement

融合迁移先验与物理分解的水下图像增强

Haochen Hu, Yanrui Bin, Zhengyan Zhang, Minchen Wei, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出一种无需配对标签的迁移学习方法，将水下图像增强分解为全局颜色校正、去雾和背景噪声抑制，利用跨域先验监督各步骤，实现物理一致的增强。

Journal ref Information Fusion (2026): 104557

详情

DOI: 10.1016/j.inffus.2026.104557

AI中文摘要

水下图像在不同水质条件下拍摄，导致复杂的退化，包括颜色偏差、低对比度和模糊效应。最近，基于学习的方法已显示出在水下图像增强（UIE）方面的潜力。然而，以往的大多数工作侧重于训练策略或网络设计，使增强结果与数据集中的标签良好对齐，忽略了标签是从先前UIE方法的增强结果中选取的，这些伪标签存在噪声。因此，它们的模型性能在一定程度上并不令人满意。然而，收集水下图像的真实标签具有挑战性。在这项工作中，我们提出了一种基于迁移学习的UIE方法，该方法不需要水下图像具有成对的噪声或真实标签来学习。相反，首先根据水下物理将UIE任务分解为全局颜色校正、去雾和背景噪声抑制。然后，利用来自其他视觉任务的多种先验作为每个步骤的跨域监督。通过这种方式，通过迁移学习实现了一种新颖的UIE，并且物理对齐的UIE分解提供了理论上的合理性。定性和定量实验表明，我们基于物理和先验融合的方法在UIE任务中达到了SOTA性能，并有效提升了下游视觉任务，显著优于基准方法。项目仓库：https://github.com/Haru2022/P2-UIE。

英文摘要

The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: https://github.com/Haru2022/P2-UIE.

URL PDF HTML ☆

赞 0 踩 0

2606.19565 2026-06-19 cs.CV 新提交

Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

Mix-QVLA：任务证据感知的视觉-语言-动作模型混合精度量化

Navin Ranjan, Andreas Savakis

发表机构 * Rochester Institute of Technology（罗彻斯特理工学院）

AI总结提出Mix-QVLA框架，通过任务证据感知的混合精度后训练量化，在保持任务性能的同时大幅降低VLA模型的内存和计算开销，在LIBERO上实现4.1GB内存和1.52倍加速。

详情

AI中文摘要

我们提出Mix-QVLA，一种针对VLA模型的任务证据感知混合精度PTQ框架。Mix-QVLA将每个量化变体锚定到全精度动作令牌参考决策，并评估量化是否在关键VLA功能边界上保留了任务相关证据。它从边界激活计算归一化的梯度加权任务证据图，并使用证据质量和归因分布失真比较全精度和量化图，捕捉决策支持证据的强度和分配变化。一个软瓶颈目标将边界级退化聚合为层敏感度分数。Mix-QVLA进一步在整个任务执行过程中建模敏感度，捕捉层重要性的阶段依赖变化，而不是假设固定的敏感度分布。由此产生的证据和时间感知分数指导在模型大小和BitOps预算下的混合精度位分配。在OpenVLA风格策略上的广泛评估表明，Mix-QVLA改善了低比特VLA部署的精度-效率权衡。在LIBERO上，Mix-QVLA将OpenVLA-OFT内存从15.4 GB减少到4.1 GB，保留了96.3的平均成功率（BF16模型为97.1），并实现了1.52倍的推理加速。

英文摘要

We propose Mix-QVLA, a task-evidence-aware mixed-precision PTQ framework for VLA models. Mix-QVLA anchors each quantized variant to the full-precision action-token reference decision and evaluates whether quantization preserves task-relevant evidence across key VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations and compares full-precision and quantized maps using evidence-mass and attribution-distribution distortion, capturing changes in both the strength and allocation of decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores. Mix-QVLA further models sensitivity throughout task execution, capturing phase-dependent shifts in layer importance rather than assuming a fixed sensitivity profile. The resulting evidence- and time-aware scores guide mixed-precision bit allocation under model-size and BitOps budgets. Extensive evaluations on OpenVLA-style policies show that Mix-QVLA improves the accuracy-efficiency trade-off of low-bit VLA deployment. On LIBERO, Mix-QVLA reduces OpenVLA-OFT memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared with 97.1 for the BF16 model, and achieves a 1.52x inference speedup.

URL PDF HTML ☆

赞 0 踩 0

2606.19736 2026-06-19 cs.CV 新提交

VFACamou: View-Fused Adversarial Camouflage for Environment-Adaptive Physical Evasion

VFACamou: 视图融合的对抗性伪装用于环境自适应物理规避

Shihui Yan, Hu Liu, Junyu Shi, Zihui Zhu, Ziqi Zhou, Yufei Song, Youming Geng, Minghui Li, Shengshan Hu

发表机构 * State Key Laboratory of Intelligent Vehicle Safety Technology（智能汽车安全技术国家重点实验室）； School of Cyber Science and Engineering, Huazhong University of Science and Technology（华中科技大学网络空间安全学院）； School of Computer Science and Technology, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； School of Software Engineering, Huazhong University of Science and Technology（华中科技大学软件学院）； Hebei Energy College of Vocation And Technology（河北能源职业技术学院）

AI总结提出一种端到端框架，结合UV体积渲染与扩散纹理生成器，并引入照明颜色一致性估计器和多尺度动态训练策略，生成可穿戴对抗图案，在无人机侦察等动态视角和光照变化下实现稳定物理攻击。

Comments Accepted by ICME 2026

详情

AI中文摘要

AI中文摘要

Mamba在建模长视觉序列方面表现出强大的效率。然而，当将token缩减应用于结构增强的Mamba变体时，这些模型会出现严重的性能崩溃。我们将这种退化归因于现有缩减方法在空间上的不可知性，这违反了选择性扫描机制所需的二维结构前提。在这项工作中，我们提出了STORM，一个空间感知的token缩减框架，旨在在压缩过程中保持结构完整性。STORM将缩减重新表述为对空间单元的结构化操作，强制局部约束以保持网格拓扑和邻域一致性。作为一个即插即用模块，STORM无需任何训练即可为现有缩减流程赋予明确的空间感知能力。实验结果表明，STORM在无训练设置下，在多种视觉Mamba骨干网络上实现了最先进的剪枝精度。值得注意的是，STORM在VMamba上实现了显著的精度恢复，在top-1准确率上比先前方法高出63.3%。同时，STORM在PlainMamba上仅造成1.0%的准确率下降，达到了与ViT相当的性能。

英文摘要

Mamba demonstrates strong efficiency in modeling long visual sequences. However, when token reduction is applied to structurally enhanced Mamba variants, these models exhibit a severe performance collapse. We attribute this degradation to the spatially agnostic nature of existing reduction methods, which violate the two-dimensional structural premise required by the selective scanning mechanism. In this work, we propose STORM, a spatial-aware token reduction framework designed to maintain structural integrity throughout the compression process. STORM reformulates reduction into a structured operation on spatial units, enforcing localized constraints to maintain both grid topology and neighborhood coherence. As a plug-and-play module, STORM equips existing reduction pipelines with explicit spatial awareness without any training. Empirical results demonstrate that STORM achieves state-of-the-art pruning accuracy across diverse vision Mamba backbones under training-free settings. Notably, STORM delivers a substantial accuracy recovery on VMamba, outperforming prior methods by up to 63.3\% in top-1 accuracy. Meanwhile, STORM incurs only a 1.0\% accuracy drop on PlainMamba, achieving performance comparable to ViT.

URL PDF HTML ☆

赞 0 踩 0

2606.19934 2026-06-19 cs.CV cs.AI 新提交

Speeding up the annotation process in semantic segmentation industrial applications

加速工业应用中的语义分割标注过程

Marta Fernandez-Moreno, Margarita Guerrero, Rosalia Rementeria, Pablo Mesejo, Raul Moreno

发表机构 * Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence, DaSCI, University of Granada（格拉纳达大学计算机科学与人工智能系，安达卢西亚数据科学与计算智能研究所，DaSCI）； Department of Computer Science and Automatic Control, National Distance Education University (UNED)（国立远程教育大学计算机科学与自动控制系）

AI总结本文利用无监督算法将材料科学中语义分割的标注时间从170小时降至37小时（减少78%），并发布了最大的公开钢微观结构分割数据集。

详情

AI中文摘要

当前的机器学习模型通常需要大量且标注良好的数据集。然而，标注过程常常成为瓶颈，随着复杂性的增加，人为错误的机会也更高。在此背景下，本文旨在利用无监督算法提高工业材料科学中复杂语义分割问题的数据标注效率。以往的研究量化了标注时间，并探索了无监督方法。但据我们所知，这是首次量化无监督算法加速标注过程程度的研究。我们旨在验证这一繁琐过程可以加速的程度，重点关注涉及高分辨率图像每个像素标注的语义分割任务，例如材料科学中的微观结构表征挑战。具体来说，我们证明通过使用无监督计算机视觉算法，标注过程所需的时间可以从170小时减少到37小时，实现了约78%的减少。我们处理的数据集包括尺寸为1280x959和960x703的大图像，这进一步增加了标注任务的复杂性。尽管存在这些挑战，我们创建并共享了迄今为止最大的公开钢微观结构分割数据集，在MIT许可下提供，并具有永久DOI，为该领域贡献了一个完全标注的高分辨率数据集。此外，这是首次将从头开始标注的时间（以往研究中的常见方法）与使用这些无监督算法作为预标注步骤时的标注时间进行比较。此外，我们提供了一个在此数据集上训练的深度学习模型，该模型经过领域专家验证，并部署在工业环境中，作为该公共数据集的初始基准。

英文摘要

Current machine learning models commonly require large and well-annotated datasets. However, the annotation process often becomes a bottleneck, with increased complexity leading to higher chances of human errors. Within this context, our goal in this paper is to leverage unsupervised algorithms to improve data annotation efficiency for complex semantic segmentation problems in industrial materials science. Previous research has quantified labeling time and others explored unsupervised methods. However, to the best of our knowledge, this is the first study to quantify how much unsupervised algorithms accelerate the labeling process. We aim to validate the extent to which this laborious process can be accelerated, focusing on semantic segmentation tasks that involve annotating each pixel of high-resolution images, such as the microstructure characterization challenge in materials science. Specifically, we demonstrate that by using unsupervised computer vision algorithms, the time required for the labeling process can be reduced from 170 hours to 37 hours, achieving an approximate reduction of 78\%. The dataset we work with includes large images of dimensions 1280x959 and 960x703, which further increases the complexity of the annotation task. Despite these challenges, we create and share the largest public steel microstructure segmentation dataset to date, available under MIT License with permanent DOI, contributing a fully annotated, high-resolution dataset to the field. Additionally, this is the first work to compare the labeling time from scratch (a common approach in previous studies) to the labeling time when using these unsupervised algorithms as a pre-annotation step. Furthermore, we provide a Deep Learning model trained on this dataset, validated by field experts, and deployed in an industrial setting, serving as an initial benchmark for this public dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.19965 2026-06-19 cs.CV cs.AI 新提交

ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

ROSE：多模态模型中感知到行动差距的基准测试

Yihao Wang, Zijian He, Jie Ren, Keze Wang

发表机构 * Sun Yat-sen University（中山大学）； Shaanxi Normal University（陕西师范大学）

AI总结提出ROSE基准，通过固定视觉场景并变化区域约束与符号输出，测试多模态大模型在不同上下文中将相同视觉证据转化为所需行动的能力，发现模型性能下降高达44.5个百分点，揭示感知到行动的瓶颈。

Comments 29 pages, 11 figures

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越被期望基于视觉信息采取行动，然而同一场景在不同任务上下文中可能需要不同的行动。模型能否可靠地将相同的视觉证据转化为当前上下文所需的行动？为了回答这个问题，我们引入了\textsc{ROSE}（\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution），一个受控基准，它在保持视觉场景固定的同时变化区域约束和所需的符号输出。通过耦合的计数和坐标行动任务，\textsc{ROSE}测试模型是否能够推断出隐含的多数参考，并在变化的上下文中基于由此产生的细粒度视觉证据采取行动。在九个最近的MLLMs中，从计数导向任务到区域条件行动的性能下降高达44.5个百分点，而人类表现达到98.8%。这种差距在成对的场景和区域中持续存在，即使同一模型在这些场景和区域上返回正确的计数，而全局点击和匹配的局部控制表明坐标定位仅解释了部分损失，揭示了在将共享视觉证据转化为上下文特定行动时存在一个独特的、模型相关的瓶颈。

英文摘要

Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textsc{ROSE} (\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textsc{ROSE} tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8\% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.

URL PDF HTML ☆

赞 0 踩 0

2606.20095 2026-06-19 cs.CV 新提交

Stitching and dimensionality effects on large artificially generated volume datasets

拼接和维度对大规模人工生成体数据集的影响

Lucas von Chamier, Jan Philipp Albrecht, Dagmar Kainmüller

发表机构 * GFZ Helmholtz-Zentrum für Geoforschung（亥姆霍兹地球科学中心）； Max Delbrück Center for Molecular Medicine in the Helmholtz Association（亥姆霍兹协会马克斯·德尔布吕克分子医学中心）； Helmholtz Imaging（亥姆霍兹成像）； Humboldt-Universität zu Berlin（柏林洪堡大学）； University of Potsdam（波茨坦大学）

AI总结研究深度学习生成大图像时的拼接伪影对风格迁移的影响，比较2D与3D模型，发现FID无法检测影响下游任务的细微伪影，3D模型略优但计算成本高。

详情

AI中文摘要

通过深度学习生成大图像需要对输入数据进行分块以适应硬件内存限制，然后组装输出块，这一过程在相邻块边界不对齐时可能引入拼接伪影。虽然已知这些伪影会影响分割任务，但它们对风格迁移生成模型的影响尚不清楚。我们使用在冷冻电镜数据集上训练的cycleGAN模型，研究了三种拼接方法和两种块维度（2D vs 3D）。我们评估了感知质量和下游线粒体分割的性能。主要发现如下：（1）FID分数无法检测到显著影响下游分割性能的细微拼接伪影；（2）具有无伪影拼接的3D模型在下游任务上略优于2D模型，尽管改进勉强证明计算成本合理；（3）2D模型由于更大的批量大小而训练更稳定。此外，我们证明从三个正交方向集成预测可以改善低质量体，但对高质量输出无益。这些结果表明，在大型科学数据集上最大化生成模型性能需要仔细考虑和减轻拼接伪影，并且仅凭感知指标不足以评估生物医学成像中的域适应质量。

英文摘要

Generating large images via deep learning requires patching input data to accommodate hardware memory limitations, then assembling output patches, a process that can introduce stitching artifacts when neighboring patches do not align at borders. While these artifacts are known to affect segmentation tasks, their impact on generative models for style-transfer remains poorly understood. We investigated three stitching approaches and two patch dimensionalities (2D vs 3D) using cycleGAN models trained on cryo-electron microscopy datasets. We evaluated both perceptual quality and performance on downstream mitochondria segmentation. Our key findings reveal that: (1) FID scores fail to detect subtle stitching artifacts that significantly impact downstream segmentation performance, (2) 3D models with artifact-free stitching marginally outperform 2D models on downstream tasks, though the improvement barely justifies the computational cost, and (3) 2D models train more stably due to larger batch sizes. Additionally, we demonstrate that ensembling predictions from three orthogonal directions can improve low-quality volumes but provides no benefit for high-quality outputs. These results demonstrate that maximizing generative model performance on large scientific datasets requires careful consideration and mitigation of stitching artifacts, and that perceptual metrics alone are insufficient for evaluating domain adaptation quality in biomedical imaging.

URL PDF HTML ☆

赞 0 踩 0

2606.20100 2026-06-19 cs.CV 新提交

GEN-Guard：纠正可部署联邦手术AI的泛化失败

Julia Alekseenko, Pietro Mascagni, AI4SafeChole Consortium, Nicolas Padoy

发表机构 * University of Strasbourg, CNRS, INSERM, ICube, UMR7357（斯特拉斯堡大学，法国国家科学研究中心，法国国家健康与医学研究院，ICube实验室，UMR7357）； Bioimage Analysis Center, Fondazione Policlinico Universitario Agostino Gemelli IRCCS（生物图像分析中心，阿戈斯蒂诺·杰梅利大学综合医院基金会IRCCS）； Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico di Milano, University of Milan（米兰IRCCS卡格兰达基金会马焦雷综合医院，米兰大学）； Monaldi Hospital, AORN dei Colli（莫纳尔迪医院，AORN dei Colli）

AI总结提出GEN-Guard框架，通过客户端阻塞评估检测性能泄漏，并利用分歧感知蒸馏进行特征级校正，提升联邦手术AI的跨机构泛化能力。

Journal ref Int J Comput Assist Radiol Surg. 2026 Jun 14

详情

DOI: 10.1007/s11548-026-03713-0

AI中文摘要

联邦学习（FL）在手术视频AI中实现了协作模型训练，无需共享敏感数据。然而，标准评估实践——仅基于参与医院的验证数据选择“最佳”全局模型——可能导致次优的部署选择。我们将这种关键失败模式识别为性能泄漏，即所选模型过拟合内部联邦数据，无法泛化到未见机构。我们提出GEN-Guard，一个实用的后处理框架，用于检测和纠正联邦手术AI中的泛化失败。它集成了通过客户端阻塞评估（CBE）进行泛化检测，该方法在隔离的客户端分布上验证性能以防止性能泄漏，以及通过分歧感知蒸馏（DAD）进行泛化纠正，该方法学习自适应的特征级校正以实现跨机构鲁棒性。两个组件在标准FL收敛后运行，同时为零样本适应未见环境提供鲁棒支持。我们首先量化了性能泄漏的严重性，观察到在标准评估下模型选择失败（MSF）超过80%。GEN-Guard在两个多中心临床挑战上进行了评估：腹腔镜胆囊切除术中的手术阶段识别和结肠镜中的息肉分割。在两个数据集上，GEN-Guard一致地纠正了这些失败，将联邦内F1分数提高了最多2个点，未见机构性能提高了最多3个点，最差情况机构性能提高了3-9个点。性能泄漏是联邦手术AI中一个系统性且以前未被充分认识的风险。GEN-Guard为检测和纠正此类失败提供了实用解决方案。通过提高跨机构鲁棒性和零样本泛化，它增强了FL在真实世界手术部署中的可靠性。

英文摘要

Federated Learning (FL) in surgical video AI enables collaborative model training without sharing sensitive data. However, standard evaluation practices - selecting the "best" global model based only on validation data from participating hospitals - can lead to suboptimal deployment choices. We identify this critical failure mode as performance leakage, where the selected model overfits internal federation data and fails to generalize to unseen institutions. We propose GEN-Guard, a practical post-hoc framework to detect and correct generalization failures in federated surgical AI. It integrates Generalization Detection via Client-Blocked Evaluation (CBE), which validates performance on isolated client distributions to prevent performance leakage, and Generalization Correction through Disagreement-Aware Distillation (DAD), which learns adaptive feature-level corrections for cross-institutional robustness. Both components operate after standard FL convergence while providing robust support for zero-shot adaptation to unseen environments. We first quantify the severity of performance leakage, observing Model Selection Failures (MSFs) exceeding 80% under standard evaluation. GEN-Guard is evaluated on two multi-center clinical challenges: surgical phase recognition in laparoscopic cholecystectomy and polyp segmentation in colonoscopy. Across both datasets, GEN-Guard consistently corrects these failures, improving in-federation F1 scores by up to 2 points, unseen-institution performance by up to 3 points, and worst-case institutional performance by 3-9 points. Performance leakage represents a systematic and previously under-recognized risk in federated surgical AI. GEN-Guard provides a practical solution for detecting and correcting such failures. By improving cross-institutional robustness and zero-shot generalization, it strengthens the reliability of FL for real-world surgical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.20455 2026-06-19 cs.CV 新提交

PCFootprint: A Large-Scale Dataset and Benchmark for Vectorized Building Footprint Extraction from Aerial LiDAR Point Clouds

PCFootprint：用于从航空LiDAR点云中提取矢量化建筑足迹的大规模数据集与基准

Haoyuan Shen, Kuihao Wang, Ruisheng Wang, Yujun Liu

发表机构 * School of Architecture and Urban Planning, Shenzhen University（深圳大学建筑与城市规划学院）

AI总结提出首个大规模航空激光扫描点云建筑足迹提取数据集PCFootprint，含33000个瓦片及跨域测试集，通过评估主流方法揭示复杂地理环境下的挑战。

Comments 14 pages, 9 figures

详情

AI中文摘要

建筑足迹提取是摄影测量、遥感和计算机视觉中的基本任务。近年来，基于图像的方法在高分辨率光学影像的矢量化足迹提取方面取得了显著进展。然而，光学影像本质上易受遮挡、透视畸变和残余地形位移的影响，导致足迹提取不完整或错位。此外，缺乏显式高程信息限制了其在细节层次建筑建模中的直接适用性。本文提出PCFootprint，这是首个用于从机载激光扫描点云中提取足迹的大规模公共数据集。PCFootprint包含来自爱沙尼亚土地和空间发展局的33000个瓦片，覆盖多样化的城市和乡村景观。每个瓦片大小为128×128米，并配有与点云对齐的系统性矢量化足迹。该数据集包括一个3000个瓦片的跨域测试集，用于评估跨地理区域的泛化能力。我们通过评估主流方法建立了全面的基准。实验结果表明，在复杂地理环境中存在高类内方差、数据不平衡和噪声等显著挑战。我们相信PCFootprint将推动建筑建模、城市场景理解和地理空间分析的未来研究。PCFootprint数据集公开于：https://this https URL。

英文摘要

Building footprint extraction is a fundamental task in photogrammetry, remote sensing, and computer vision. Recent image-based methods have achieved remarkable progress in extracting vectorized footprints from high-resolution optical imagery. However, optical imagery inherently susceptible to occlusions, perspective distortions, and residual relief displacement, yielding incomplete or misaligned footprint extraction. Furthermore, the lack of explicit elevation information limits its direct applicability to Level of Detail building modeling. In this paper, we present PCFootprint, the first large-scale public dataset for footprint extraction from airborne laser scanning point clouds. PCFootprint comprises \num{33000} tiles derived from the Estonian Land and Spatial Development Board, covering diverse urban and rural landscapes. Each tile spans \qtyproduct{128 x 128}{\m} with systematically aligned vectorized footprints aligned to point clouds. The dataset includes a \num{3000} tiles cross-domain test set for evaluating generalization across geographic regions. We establish comprehensive benchmarks by evaluating mainstream methods. Experimental results reveal significant challenges including high intra-class variance, data imbalance, and noise across complex geospatial environments. We believe PCFootprint will advance future research in building modeling, urban scene understanding, and geospatial analysis. The PCFootprint dataset is publicly available at \url{https://huggingface.co/datasets/Haoyuan-Shen/PCFootprint}.

URL PDF HTML ☆

赞 0 踩 0

2606.20523 2026-06-19 cs.CV cs.AI cs.DB 新提交

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

SARLO-80：全球斜距SAR语言光学数据集80cm

Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Elise Colin, Georgia Channing

发表机构 * DEMR-ONERA – The French Aerospace Lab, Université Paris-Saclay（法国航空航天实验室DEMR-ONERA，巴黎-萨克雷大学）； DTIS-ONERA – The French Aerospace Lab, Université Paris-Saclay（法国航空航天实验室DTIS-ONERA，巴黎-萨克雷大学）； Hugging Face

AI总结为解决高分辨率SAR与光学图像及文本对齐的数据稀缺问题，基于Umbra SLC数据构建了80cm斜距网格的SAR-光学-文本三元组数据集，支持跨模态检索与生成任务。

详情

AI中文摘要

多模态基础模型因大规模光学基准而快速发展，但合成孔径雷达（SAR）的类似资源仍然有限。现有的SAR-光学数据集主要依赖低分辨率、仅强度的地面距离检测（GRD）产品，未保留复值SAR测量或原生采集几何，限制了基于物理的多模态学习。特别是，结合甚高分辨率（VHR）SAR SLC、对齐光学图像和自然语言描述的大规模公开数据集仍然缺乏。我们提出了一个基于开源Umbra聚束模式采集的传感器独立复数据（SICD）构建的VHR SAR-光学-文本数据集。从约2500个全球场景（VV/HH，20cm–2m原生分辨率）出发，通过带限FFT重采样将所有SAR数据标准化到80cm斜距网格，并将图像分割为1024×1024的图块。对于每个SAR图块，我们检索高分辨率光学图块，并利用局部坐标对应关系将其扭曲到SAR网格以实现局部像素级对齐。我们进一步为每个样本生成三种描述变体（短/中/长），以支持视觉-语言训练和评估。我们的数据集包含119,566个三元组（复数和幅度斜距SAR图块、对齐光学图块、自然语言描述），覆盖72个国家的257个地点以及广泛的地物类型和基础设施。我们发布固定的训练/验证/测试划分以及完整的预处理和基线代码，以支持在原生SAR几何中进行跨模态检索和条件生成的多模态对齐的可重复基准测试。该数据集在Hugging Face Hub上公开可用，网址为https://this URL。

英文摘要

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR--optical--text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm--2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision--language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at https://huggingface.co/datasets/ONERA/SARLO-80.

URL PDF HTML ☆

赞 0 踩 0

2606.20536 2026-06-19 cs.CV 新提交

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

FID 彩票：量化生成模型评估中的隐藏随机性

Nicolas Dufour, Alexei A. Efros, Patrick Pérez

发表机构 * Kyutai ； UC Berkeley（加州大学伯克利分校）

AI总结研究FID作为随机变量在训练和生成种子上的方差，发现重训练比重采样导致更大FID波动，提出新评估协议：使用每类最优引导、报告多个训练种子的误差条。

Comments Website: https://kyutai.org/fid-lottery

详情

AI中文摘要

Frechet Inception Distance (FID) 是图像生成的事实标准仲裁者，但大多数论文仅报告来自单个训练模型使用单个采样种子的单一数值。如果我们重新训练模型，或仅重新从中采样，该数字的可重复性如何？在本文中，我们将 FID 视为训练和生成种子二维面板上的随机变量，并直接在数百个在类别条件 ImageNet 256x256 上训练的 SiT 网络上测量其方差。我们报告了令人惊讶的发现：(a) 使用相同配方但不同种子重新训练模型，在 Inception 特征空间中移动的 FID 比从固定网络重新绘制样本大 3.2 倍。(b) 这一差距由三个因素驱动：随机初始化、数据排序和流匹配损失的每步高斯噪声。(c) 增加计算量或模型大小几乎不会缩小分布范围，将 FID 变异系数 (CoV) 保持在 1-2% 的带内。(d) 每类无分类器引导调整使分布减半，但重新洗牌了哪些种子效果最好，幸运的训练种子达到相同 FID 所需的计算量比不幸的种子少 2 倍。基于这些发现，我们推荐一种新的 FID 评估协议：在每类最优引导下进行评估，将任何低于经验测量的约 1.3% CoV 的 FID 差距视为不确定，并报告多个训练种子的误差条，而不是单一的 FID 数值。

英文摘要

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.

URL PDF HTML ☆

赞 0 踩 0

2606.20542 2026-06-19 cs.CV 新提交

CalTennis: Large Multi-View Tennis Video Dataset and Benchmark of Monocular-to-3D Pose Estimation

CalTennis：大型多视角网球视频数据集及单目到3D姿态估计基准

Ilona Demler, Xinran Xie, Blake Werner, Anna Szczuka, Pietro Perona

发表机构 * California Institute of Technology（加州理工学院）

AI总结提出CalTennis大型多视角网球视频数据集（1100万帧，40名球员），用于评估野外单目到3D姿态估计，并发现现有模型在深度估计和足部接触方面存在不足。

详情

AI中文摘要

Caltech网球数据集（CalTennis）是一个大规模视频基准，用于评估野外单目到3D姿态估计。CalTennis包含超过1100万帧（51小时）来自40名球员的网球练习和比赛视频，由2-6台同步摄像机以60 Hz频率采集。它比现有的野外人体运动视频数据集大10倍，比现有的MOCAP真值数据集大3倍，并且是第一个提供专家运动同步多视角记录的大规模基准。多视角设置使得对单目到3D姿态估计算法进行廉价、无标签的评估成为可能。我们描述了一个简单、标准化的协议，无需专业设备或专业知识即可进行数据收集，并实现了全自动视频校准和同步。在CalTennis上对最先进的单目到3D姿态方法进行基准测试，我们发现，虽然3D关节角度恢复现在相当准确，但所有模型在一致地估计深度和足部接触方面仍然存在困难。我们进一步提出了两个新的性能指标——步法和稳定性，并定性研究了身体形状不一致性。这些指标揭示了以前未充分探索的失败模式，并为姿态估计和动作分析的改进提供了具体机会。

英文摘要

The Caltech Tennis Dataset (CalTennis) is a large-scale video benchmark for evaluating monocular-to-3D pose estimation in the wild. CalTennis comprises over 11 million frames (51 hours) of tennis practice and match play from 40 players, captured with 2-6 synchronized cameras at 60 Hz. It is 10 times larger than existing in-the-wild human motion video datasets and 3 times larger than existing MOCAP-ground-truthed datasets, and it is the first large-scale benchmark to provide synchronized multi-view recordings of expert athletic motion. The multi-view setup enables inexpensive, label-free evaluation of monocular-to-3D pose estimation algorithms. We describe a simple, standardized protocol that enables data collection without specialized equipment or expertise, along with fully automated video calibration and synchronization. Benchmarking state-of-the-art monocular-to-3D pose methods on CalTennis, we find that while 3D joint angle recovery is now quite accurate, all models struggle to estimate depth and foot contact consistently. We further propose two novel performance metrics, footwork and stability, as well as qualitatively study body shape inconsistency. These metrics expose previously underexplored failure modes and point to concrete opportunities for improvement in pose estimation and action analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.20545 2026-06-19 cs.CV 新提交

Current World Models Lack a Persistent State Core

当前世界模型缺乏持久状态核心

Jinpeng Lu, Dexu Zhu, Haoyuan Shi, Linghan Cai, Guo Tang, Yinda Chen, Jie Cao, Duyu Tang, Yi Zhang, Yong Dai, Xiaozhu Ju

发表机构 * University of Science and Technology of China（中国科学技术大学）； Beijing Innovation Center of Humanoid Robotics (X-Humanoid)（北京人形机器人创新中心）； NLPR, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所模式识别国家重点实验室）； Independent Researcher（独立研究者）； Dresden University of Technology（德累斯顿工业大学）； Peking University（北京大学）

AI总结提出WRBench基准测试，发现现有世界模型在观测中断时无法维持世界状态演化，强调物理状态核稳定性应成为世界模型设计首要目标。

Comments 39 pages, 16 figures

详情

AI中文摘要

世界模型日益被视为迈向通用人工智能的关键一步，然而对物理世界建模需要的不仅仅是按需生成令人信服的帧：它需要一个内部世界状态随时间持续演化，与观测解耦，使得物体持久存在、事件运行至结束，无论是否有相机在观察——就像月球在无人注视时仍保持轨道运行一样。这一要求是现有基准的盲点，它们奖励表面属性如保真度、运动和相机可控性，却从不询问生成的 world 在未被观测时是否持续演化。我们引入 \textbf{WRBench}，首个系统性的诊断基准，将相机运动视为对可观测性的干预，并将评估分解为一个人工校准的链条：询问相机是否执行了请求的交互，场景在视野内是否保持连续和可识别，以及返回的目标是否与已启动的事件保持一致。在来自 23 个模型（涵盖四种控制范式）的 9,600 个视频中，一个发现顽固地存在：当前系统将观测到的世界维持为跟踪镜头，返回的目标恢复为被遗弃时的状态，而非在未被观测时推进事件。由于这一失败在控制范式、模型家族和规模增量中重复出现，稳健的世界状态演化并非来自更清晰的图像、更严格的控制、更丰富的几何先验或单纯的参数数量。因此，我们主张物理状态核的稳定性和视角干预下世界线的一致性应成为世界模型设计的一级目标，使得世界模型捕捉世界将如何展开，而非下一帧如何呈现。

英文摘要

World models are increasingly regarded as a decisive step toward artificial general intelligence, yet modeling the physical world demands more than rendering convincing frames on demand: it requires an internal world state that keeps evolving over time, decoupled from observation, so that objects endure and events run to their conclusions whether or not a camera is watching, much as the moon holds to its orbit when no one is looking. This requirement is a blind spot of existing benchmarks, which reward surface properties such as fidelity, motion, and camera controllability while never asking whether a generated world keeps evolving once it is unobserved. We introduce \textbf{WRBench}, the first systematic diagnostic benchmark that treats camera motion as an intervention on observability and resolves evaluation into a human-calibrated chain that asks whether the camera executes the requested interaction, whether the scene stays continuous and identifiable while in view, and whether a returning target remains consistent with the event that was set in motion. Across 9{,}600 videos from 23 models spanning four control paradigms, one finding proves stubborn: current systems maintain the observed world as a tracking shot, resuming a returning target in the state at which it was abandoned rather than advancing the event while it went unseen. Because this failure recurs across control paradigms, model families, and increments of scale, robust world-state evolution does not follow from cleaner imagery, tighter control, richer geometric priors, or sheer parameter count We therefore argue that the stability of the physical state kernel and the consistency of worldlines under viewpoint intervention should become first-class objectives of world-model design, so that a world model captures how the world will unfold rather than how the next frame appears.

URL PDF HTML ☆

赞 0 踩 0

2606.19835 2026-06-19 cs.CV 新提交

Neural Events: Discrete Asynchronous Autoencoders for Event-Based Vision

神经事件：用于事件视觉的离散异步自编码器

Roberto Pellerito, Daniel Gehrig, Shintaro Shiba, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich（苏黎世大学机器人感知组）； University of Pennsylvania（宾夕法尼亚大学）； The University of Tokyo（东京大学）； Keio University（庆应义塾大学）

AI总结提出将事件流重新标记为少量高信息量的“神经事件”，每个事件代表一个局部时空上下文窗口的离散可学习编码，在物体检测和分类任务中达到或超越现有方法，同时将事件率降低2.0倍。

详情

AI中文摘要

事件相机通过将动态场景表示为微秒分辨率的连续事件流，以卓越的时间保真度捕捉动态场景。然而，每个单独的事件仅携带最小的语义价值，仅仅表示局部亮度变化。为了获得有意义的信号，下游算法需要快速整合来自潜在大量低信息事件流的线索。然而，当前的架构很容易被淹没，难以在捕捉细粒度时间动态和维持可管理的数据吞吐量之间取得平衡。本文提出一个框架，将事件流重新标记为少量高信息量的“神经事件”，每个事件代表一个局部时空上下文窗口，并带有离散可学习编码。每次该编码翻转时，触发一个神经事件，产生高度压缩的数据流。我们证明，在物体检测和分类任务中，基于神经事件训练的网络与最先进方法性能相当或更优，同时将事件率降低2.0倍。

英文摘要

Event cameras capture dynamic scenes with exceptional temporal fidelity by representing them as a continuous stream of microsecond resolution \textit{events}. Each individual event, however, only carries minimal semantic value, merely signaling a localized brightness change. To derive meaningful signals, downstream algorithms need to quickly integrate cues from a potentially massive torrent of low-information events. Current architectures, however, are easily overwhelmed, struggling to balance capturing fine-grained temporal dynamics and maintaining a manageable data throughput. This paper proposes a framework to re-tokenize event streams into a small set of highly informative \textit{neural events}, each representing a local spatio-temporal context window with a discrete learnable code. Every time this code flips, a neural event is triggered, yielding a highly compressed data stream. We demonstrate that, across object detection and classification, networks trained on neural events are on par or surpass the performance of state-of-the-art approaches while reducing the event rate by a factor of 2.0.

URL PDF HTML ☆

赞 0 踩 0

1. 多模态与视觉语言模型 10 篇

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models

Multimodal Concept Bottleneck Models

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

The Hidden Evolution of Disguised Visual Context inside the VLM

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

2. 具身智能、机器人与自动驾驶 7 篇

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

3. 图像识别、检索与分类 3 篇

Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

Evaluation of Image Matching for Art Skills Assessment

4. 目标检测、分割与定位 5 篇

ReA-OVCD: Reliability-Aware Open-Vocabulary Change Detection via Semantic and Spatial Refinement

SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation

U$^2$Mamba: A Two-level Nested U-structure Mamba for Salient Object Detection

CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

5. 视频理解与时序视觉 8 篇

Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval

NEST: Narrative Event Structures in Time for Long Video Understanding

ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

6. 生成式视觉与世界模型 16 篇

LooseControlVideo: Directorial Video Control using Spatial Blocking

Learning When to Denoise: Optimizing Asynchronous Schedules for Latent Diffusion

TeleMorpher: Toward Robust Simultaneous Motion-Location Editing

One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model

SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics

SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis

CrossFlow: One-Step Generation Across Latent and Pixel Spaces

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

Holo-World: Unified Camera, Object and Weather Control for Video World Model

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising

7. 3D视觉、点云与空间智能 7 篇

QueryGaussian: Scalable and Training-Free Open-Vocabulary 3D Instance Retrieval

Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding

ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number

Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

TriFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields

VisDom: Sparse Novel View Synthesis with Visible Domain Constraint

Thinking in Boxes: 3D Editing in Real Images Made Easy

8. 医学影像与生物视觉 18 篇

Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers

HypOProto: Hyperbolic Ordinal Prototypes for Left Ventricular Filling Pressure Classification

CSWinUNETR: Segmentation of Thin Anatomical Structures in Medical Images

OTCHA: Optimal Transport-driven Confidence-aware Latent Hub Alignment for Multi-View Medical Image Classification

PSCT-Net: Geometry-Aware Pediatric Skull CT Reconstruction via Differentiable Back-Projection and Attention-Guided Refinement

Gaussian Process Prior Variational Autoencoder for Endoscopic Videos

Confidence Calibration for Multimodal LLMs: An Empirical Study through Medical VQA

Semantic-Anchored Evidential Fusion for Domain-Robust Whole-Slide Survival Analysis

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

PU-UNet: Stable Multiplicative Interactions for Medical Image Segmentation

EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors

Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation

HEad and neCK TumOR (HECKTOR) 2025: Benchmark of Segmentation, Diagnosis, and Prognosis in Multimodal PET/CT

DeepForestVisionV2: Ecology-Driven Taxonomy Expansion for Camera-Trap Monitoring in African Tropical Forests

Single-Stage Hierarchical Rectification for Weakly Supervised Histopathology Segmentation

Geometry-Aware Superpixel Graph Transformer with Metadata for Skin Lesion Classification