arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.19534 2026-06-19 cs.CV cs.AI cs.CL 新提交

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

PerceptionDLM：基于多模态扩散语言模型的并行区域感知

Yueyi Sun, Yuhao Wang, Jason Li, Ye Tian, Tao Zhang, Jacky Mai, Yihan Wang, Haochen Wang, Jinbin Bai, Ling Yang, Yunhai Tong

发表机构 * Peking University（北京大学）； MSALab ； ByteDance（字节跳动）

AI总结提出PerceptionDLM，利用扩散语言模型的并行解码特性，通过高效提示和结构化注意力掩码实现多区域并行感知，显著提升推理效率，并构建ParaDLC-Bench基准进行评估。

Comments Code available at https://github.com/MSALab-PKU/PerceptionDLM

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解任务中取得了显著进展。然而，现有大多数MLLMs依赖自回归生成，这限制了它们在需要描述多个区域的感知任务中的效率。在这项工作中，我们提出PerceptionDLM，一种针对高效并行区域感知优化的多模态扩散语言模型。基于PerceptionDLM-Base（一个在开源扩散MLLMs中达到最先进性能的强基础基线），我们的架构充分利用了DLMs的并行解码特性。具体来说，我们引入了高效提示和结构化注意力掩码，以实现对多个掩码区域的同步感知，使模型能够在序列和token级别并行生成区域描述。与现有顺序处理区域的方法相比，这种设计显著提高了推理效率。为了系统评估DLMs视觉感知能力的并行性，我们通过将DLC-Bench扩展为每张图像包含多个区域掩码，构建了一个新的并行详细局部描述基准（ParaDLC-Bench），从而能够联合评估描述质量和推理效率。实验表明，PerceptionDLM在区域描述中保持竞争性能，同时在多区域感知任务中实现了显著的加速。我们的结果凸显了多模态扩散语言模型在高效并行视觉感知中的潜力。据我们所知，我们是首个利用扩散语言模型优势实现并行区域描述和感知的工作。代码、模型和数据集已发布。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.

URL PDF HTML ☆

赞 0 踩 0

2606.19584 2026-06-19 cs.CV 新提交

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

语言引导的视觉嵌入用于可控且可泛化的感知

Chengzhi Mao, Xudong Lin, Wen-Sheng Chu

发表机构 * Google（谷歌）

AI总结提出语言引导视觉嵌入（LIVE）方法，利用语言动态引导视觉编码器生成任务中心嵌入，无需任务特定重训练，减少视觉幻觉并提升泛化能力。

Journal ref Published as a conference paper at ICLR 2026

2606.19828 2026-06-19 cs.CV 新提交

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey（萨里大学以人为本人工智能研究所）； Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey（萨里大学视觉、语音与信号处理中心）

AI总结研究视觉语言模型中视觉令牌如何通过不同集成架构（上下文注入与逐层注入）转化为有意义表示，揭示其内部演化过程及对性能的影响。

详情

AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型（LLM）。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示，还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成，目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式，在单图像、多图像和视频基准上进行公平比较。在此过程中，我们揭示了一个隐藏的演化：视觉令牌作为伪装的视觉上下文（缺乏语言结构的原始表示）进入LLM，但根据集成范式逐渐被重塑，每种范式捕捉视觉信号的不同频率特征。我们表明，LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐，以及最终每种范式在不同任务上的表现。我们进一步证明，仅关注注意力分配是不够的，性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

URL PDF HTML ☆

赞 0 踩 0

2606.20177 2026-06-19 cs.CV cs.AI 新提交

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

评估与增强遥感多模态大语言模型的否定理解能力

Haochen Han, Jue Wang, Alex Jinpeng Wang, Fangming Liu

发表机构 * Peng Cheng Laboratory（鹏城实验室）； Tsinghua University（清华大学）； Central South University（中南大学）

AI总结提出RS-Neg基准评估遥感MLLMs的否定理解，并设计NeFo方法通过测试时学习利用约5%未标注样本显著提升模型性能。

Comments ECCV 2026 Accepted

详情

AI中文摘要

多模态大语言模型（MLLMs）在各种遥感（RS）任务中取得了显著成功。然而，它们理解否定的能力仍未得到充分探索，限制了在现实应用中的部署，其中模型必须明确识别什么是错误的或不存在的，例如，应急响应人员需要定位非洪水路线进行疏散。为了全面研究这一局限性，我们引入了RS-Neg，这是第一个从区域级到场景级任务评估否定理解的基准。具体来说，我们为遥感图像设计了一个自动数据生成流程，使用LLMs合成多样化的否定查询，并引入了一个动态视觉焦点模块进行验证。我们的评估表明，先进的遥感MLLMs在否定理解上存在困难，表现出幻觉和显著的性能下降。为了弥补这一差距，我们提出了NeFo，一种新颖的测试时学习方法，将否定的逻辑角色明确纳入模型优化。值得注意的是，使用约5%的未标注测试样本，NeFo显著提升了模型的否定理解能力，并展现出对未见任务的强泛化能力。代码和数据将在接收后发布。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in various Remote Sensing (RS) tasks. However, their ability to comprehend negation remains underexplored, limiting deployment in real-world applications where models must explicitly identify what is false or absent, e.g., emergency responders need to locate non-flooded routes for evacuation. To comprehensively study this limitation, we introduce RS-Neg, the first benchmark to evaluate negation understanding across region-level to scene-level tasks. Specifically, we design an automated data generation pipeline for RS imagery, using LLMs to synthesize diverse negation queries, and introduce a dynamic visual focus module for verification. Our evaluation reveals that advanced RS MLLMs struggle with negation, exhibiting hallucinations and substantial performance degradation. To close this gap, we propose NeFo, a novel test-time learning method that explicitly incorporates the logical role of negation into the model optimization. Remarkably, using about 5\% unlabeled test samples, NeFo significantly improves the negation understanding of models and shows strong generalization to unseen tasks. Code and data will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.20244 2026-06-19 cs.CV cs.AI 新提交

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

SPOT-E：基于视觉聚光灯的冻结VLM测试时熵整形

Bo Yin, Xiaobin Hu, Chengming Xu, Ruolin Shen, Mo Yang, Jiangning Zhang, Peng-Tao Jiang, Cheng Tan, Shuicheng YAN

发表机构 * National University of Singapore（新加坡国立大学）； Fudan University（复旦大学）； Technical University of Munich（慕尼黑工业大学）； Sagenic Tech ； Zhejiang University（浙江大学）； vivo ； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出SPOT-E方法，通过测试时熵整形和视觉聚光灯，解决VLM在证据密集型任务中因忽视局部关键证据而表现不佳的问题，无需重新训练即可提升定位与鲁棒性。

详情

AI中文摘要

视觉语言模型（VLM）在证据密集型任务中通常表现不佳，因为决定性视觉证据往往微小、局部且容易被忽略，导致即使高层推理完好，证据读取也会失败。先前的推理时视觉干预可以在不重新训练的情况下改善定位，但大多是开环的，缺乏验证高亮证据是否实际使用的机制。我们研究答案跨度预测熵作为模型内部反馈信号，并表明朴素熵最小化具有歧义性，因为低熵可能源于证据支持的置信度或捷径坍塌。为解决这一歧义，我们引入低熵锚点和熵整形目标，在减少答案不确定性的同时保留基线高置信度标记。我们将这一原理实例化为SPOT-E，一种即插即用的测试时方法，生成问题条件聚光灯，并通过基于组相对策略优化（GRPO）的轻量级调优对每个实例进行优化。在所有基准测试和不同VLM家族中，SPOT-E在视觉损坏下均取得一致增益和改进的鲁棒性。代码公开于：\url{this https URL}

英文摘要

Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}

URL PDF HTML ☆

赞 0 踩 0

2606.20419 2026-06-19 cs.CV 新提交

Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

谱查询-键乘积权重引导用于免训练VLM幻觉缓解

Karn Tiwari, Varnith Chordia, Prathosh A P

发表机构 * Indian Institute of Science, Bengaluru（印度科学理工学院，班加罗尔）； Snap Research（Snap 研究院）

AI总结提出QK乘积引导，一种无数据、免训练、零推理成本的权重编辑方法，通过抑制中间层主导奇异模式减少对象幻觉，在三个GQA基VLM上平均降低CHAIR$_s$ 4.0%。

Comments Under Review

详情

AI中文摘要

视觉语言模型（VLM）通常生成流畅但视觉上无依据的描述，尤其是提及图像中不存在的对象。我们提出QK乘积引导，一种无数据、免训练、零推理成本的权重编辑方法，用于减少对象幻觉。该方法通过抑制选定中间层中少量主导奇异模式，直接编辑每头的查询-键乘积（即产生softmax前注意力logits的算子）。然后，通过封闭形式的仅查询更新将编辑后的乘积映射回查询权重，同时保持共享的键权重固定，使编辑兼容分组查询注意力。我们进一步将QK乘积分解为对称和反对称分量，以区分相互内容相似性模式与方向性注意力模式。在三个基于GQA的VLM上，QK乘积引导实现了平均相对CHAIR$_s$降低4.0%，而匹配的随机模式控制显示可忽略的变化。可解释性消融表明，幻觉信号特定于主导QK模式，并主要定位于对称相互注意力通道。总体而言，QK乘积引导提供了一种解码时缓解的简单替代方案，无需额外数据、微调或推理时开销，同时基本保持多模态能力。

英文摘要

Vision-language models (VLMs) often generate fluent but visually unsupported descriptions, especially by mentioning objects absent from the image. We propose QK Product Steering, a data-free, training-free, and zero-inference-cost weight edit for reducing object hallucination. The method directly edits the per-head query-key product, the operator that produces pre-softmax attention logits, by suppressing a small number of dominant singular modes in selected middle layers. The edited product is then mapped back to the query weights through a closed-form query-only update while keeping shared key weights fixed, making the edit compatible with grouped-query attention. We further decompose the QK product into symmetric and antisymmetric components to distinguish mutual content-similarity patterns from directional attention patterns. Across three GQA-based VLMs, QK Product Steering achieves an average relative CHAIR$_s$ reduction of $4.0\%$, while matched random-mode controls show negligible change. Interpretability ablations show that the hallucination signal is specific to dominant QK modes and is primarily localized to the symmetric mutual-attention channel. Overall, QK Product Steering offers a simple alternative to decoding-time mitigation, requiring no additional data, fine-tuning, or inference-time overhead while largely preserving general multimodal capability.

URL PDF HTML ☆

赞 0 踩 0

2606.19646 2026-06-19 cs.IR cs.CV 交叉投稿

视觉-语言模型是理解3D场景还是仅仅 catalogue 物体？

Animesh Maheshwari, Divyansh Sahu, Nishit Verma

发表机构 * Deccan AI（德克南人工智能）

AI总结本文通过一个包含3034个样本的人工整理基准，探讨了视觉-语言模型对空间理解的深度有序遮挡、光学几何推断和体积重新安排规划能力，发现模型在重新安排可见布局时表现优异，但在遮挡和反射推断上表现较差。

详情

AI中文摘要

视觉-语言模型能够可靠地命名场景中的物体，但它们是否代表这些物体所处的3D布局？我们引入了一个包含3034个样本的人工整理基准，针对空间理解的三个组成部分：深度有序遮挡（通过三种独立的反事实操作化进行探测）、可见反射的光学几何推断，以及体积重新安排规划。六个前沿和开放权重的VLMs在18,204个响应上由训练注释者评分，没有使用LLM作为判断标准，揭示了明显的分离：在53-97%的准确率下，能够对可见布局进行重新安排的模型，在遮挡任务中表现不佳，仅在6-45%之间，而在反射任务中低于7%。一个具身推理模型重现了相同的模式。对Qwen3-VL-8B-Thinking的白盒分析显示，失败归因于视觉标记合并：在视觉编码器中可恢复的空间信息在标记压缩后变得不可用，只有在清洁的标记合并后激活被重新引入语言解码器后才恢复。

英文摘要

Vision-language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

URL PDF HTML ☆

赞 0 踩 0

2606.05833 2026-06-19 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结提出GeoVR框架，通过从2D视频序列中蒸馏3D几何知识（包括相机姿态、深度图、尺度因子和多尺度3D特征），重塑多模态大语言模型的内部表示以赋予其空间智能，在空间推理基准上达到最先进性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在2D语义理解方面表现出色，但缺乏内在的3D感知能力，导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性，我们提出了GeoVR，一种新颖的框架，仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间，以解锁空间智能。GeoVR并非采用浅层的特征混合，而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的，该策略由四个互补的几何目标驱动：（1）估计帧间相机姿态以嵌入变化的视角动态，（2）回归密集深度图以锚定物理距离，（3）预测度量尺度因子以进行真实世界校准，以及（4）蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下，模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明，GeoVR实现了最先进的性能，为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.16615 2026-06-19 cs.CV 版本更新

扩展端到端驾驶的自我对弈

Luke Rowe, Roger Girgis, Rodrigue de Schaetzen, Daphne Cornelisse, Alaap Grandhi, Felix Heide, Eugene Vinitsky, Christopher Pal, Liam Paull

发表机构 * Mila（米拉研究所）； Université de Montréal（蒙特利尔大学）； Polytechnique Montréal（蒙特利尔理工学院）； Torc Robotics ； NYU Tandon School of Engineering（纽约大学坦登工程学院）； McMaster University（麦克马斯特大学）； Princeton University（普林斯顿大学）

AI总结提出大规模自我对弈训练策略，通过高效模拟器Gigapixel实现像素级自我对弈，结合DAgger蒸馏和感知适应，提升端到端驾驶模型性能。

详情

AI中文摘要

端到端自动驾驶模型通常基于离线的人类演示数据集进行训练，这些数据集提供的状态覆盖有限，且通常没有闭环反馈，使得模型在闭环部署时容易出现复合误差，并对长尾智能体交互脆弱。为克服这些限制，我们提出了一种替代策略：直接在模拟中的像素上进行大规模自我对弈。虽然先前的自我对弈方法已显示出向真实世界驾驶的有前景的迁移，但它们通常假设向量化的鸟瞰图（BEV）观测，这与直接基于传感器观测的端到端策略不兼容。为此，我们引入了Gigapixel，一个具有透视渲染的高吞吐量批处理驾驶模拟器，实现了直接从像素观测的可扩展自我对弈。Gigapixel并非针对计算成本高的逼真传感器模拟，而是渲染一个简化的边界框世界，保留基本场景结构，同时实现每秒5万智能体步的吞吐量。由于直接像素空间的自我对弈强化学习在端到端模型规模下样本效率极低，我们提出了自我对弈DAgger训练：通过从特权RL教师进行在线策略蒸馏来训练基于像素的策略。为弥合模拟到现实的差距，我们随后通过轻量级感知适应将自我对弈训练的策略迁移到真实世界传感器数据。在Gigapixel中训练并适应真实世界传感器数据的策略在HUGSIM和NAVSIM-v2基准测试中取得了竞争性表现，无需人类轨迹监督。此外，扩展自我对弈训练带来策略性能的成比例提升，确立了自我对弈作为训练端到端模型的实用且可扩展的策略。

英文摘要

End-to-end autonomous driving models are typically trained on offline human-demonstration datasets that provide limited state coverage and often no closed-loop feedback, making them prone to compounding errors when deployed in closed-loop and brittle to long-tail agent interactions. To overcome these limitations, we propose an alternative strategy for training end-to-end driving models: large-scale self-play directly from pixels in simulation. While prior self-play approaches have shown promising transfer to real-world driving, they typically assume vectorized Bird's-Eye-View (BEV) observations that are incompatible with end-to-end policies operating directly on sensor observations. To this end, we introduce Gigapixel, a high-throughput batched driving simulator with perspective rendering, enabling scalable self-play directly from pixel observations. Rather than targeting compute-costly photorealistic sensor simulation, Gigapixel renders a simplified bounding-box world that preserves essential scene structure while achieving throughput at 50k agent steps per second. Since direct pixel-space self-play RL is prohibitively sample-inefficient at end-to-end model scale, we propose self-play DAgger training: we train pixel-based policies in self-play via on-policy distillation from a privileged RL teacher. To bridge the sim-to-real gap, we subsequently transfer the self-play trained policies to real-world sensor data through lightweight perception adaptation. Policies trained in Gigapixel and adapted to real-world sensor data achieve competitive performance on the HUGSIM and NAVSIM-v2 benchmarks without human trajectory supervision. Moreover, scaling self-play training yields proportional gains in policy performance, establishing self-play as a practical and scalable strategy for training end-to-end models.

URL PDF HTML ☆

赞 0 踩 0

2606.19836 2026-06-19 cs.RO cs.CV 交叉投稿

World Engine: Towards the Era of Post-Training for Autonomous Driving

World Engine：迈向自动驾驶后训练时代

Tianyu Li, Li Chen, Caojun Wang, Haochen Liu, Kashyap Chitta, Zhenjie Yang, Yuhang Lu, Naisheng Ye, Yihang Qiu, Yufei Wang, Luoxi Zou, Jiaxin Peng, Jin Pan, Zhaoyu Su, Andrei Bursuc, Shengbo Eben Li, Andreas Geiger, Peng Su, Hongyang Li

AI总结提出World Engine生成式框架，通过从真实日志重建高保真交互环境并外推安全关键变体，利用强化后训练对齐策略与安全约束，显著减少罕见安全关键场景故障，提升自动驾驶安全性。

Comments Technical Report. Project Page: https://opendrivelab.com/WorldEngine/

详情

AI中文摘要

自动驾驶车辆必须在现实世界中安全运行，而错误可能带来严重后果。尽管现代端到端驾驶策略在常规场景中表现出色，但其可靠性受限于真实驾驶数据集中安全关键的“长尾”事件的稀缺性。这些罕见交互定义了学习策略的实际安全边界，但在现实世界中难以大规模收集。我们展示了这一根本限制可以通过在合成的关键交互上对预训练驾驶模型进行后训练来解决。我们引入了World Engine，一个生成式框架，从真实日志中重建高保真交互环境，并系统性地将其外推为现实的安全关键变体。这一范式使得基于强化的后训练能够将策略与安全约束对齐，规避现实世界探索中固有的物理风险。在基于nuPlan构建的公开基准上，World Engine显著减少了罕见安全关键场景中的故障，并且相比仅扩展预训练数据带来了更大的增益。此外，当部署到生产级自动驾驶系统时，所得策略减少了模拟碰撞，并在道路测试中显示出可衡量的改进，表明在合成的安全关键交互上进行后训练为更安全的自动驾驶提供了一条可扩展且有效的途径。完整的代码库套件（包括训练）已向公众发布。

英文摘要

Autonomous vehicles must operate safely in the real world, where errors can have severe consequences. Although modern end-to-end driving policies excel in routine scenarios, their reliability is limited by the scarcity of safety-critical ``long-tail'' events in real driving datasets. These rare interactions define the practical safety boundary of the learned policy, yet they are difficult to collect at scale in the real world. Here we show that this fundamental limitation can be addressed by post-training pre-trained driving models on synthesized high-stakes interactions. We introduce World Engine, a generative framework that reconstructs high-fidelity interactive environments from real-world logs and systematically extrapolates them into realistic safety-critical variations. This paradigm enables reinforcement-based post-training to align policies with safety constraints, circumventing the physical risks inherent in real-world exploration. On a public benchmark built on nuPlan, World Engine substantially reduces failures in rare safety-critical scenarios and yields significantly larger gains than scaling pre-training data alone. Furthermore, when deployed on a production-scale autonomous driving system, the resulting policy reduces simulated collisions and demonstrates measurable improvements in on-road testing, showing that post-training on synthesized, safety-critical interactions offers a scalable and effective pathway to safer autonomous driving. The full codebase suite, including training, is released to the public.

URL PDF HTML ☆

赞 0 踩 0

2606.19998 2026-06-19 cs.RO cs.AI cs.CV cs.LG 交叉投稿

Mem-World：用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology（大连理工大学）； Samsung R&D Institute China-Beijing (SRCB)（三星中国北京研究院）

AI总结提出Mem-World，通过4D腕部视角曲面元索引内存W-VMem，解决操作中因遮挡和运动导致的场景遗忘问题，实现持久世界建模，提升策略评估与改进效果。

详情

AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式，通过生成动作一致的视频推演，为昂贵的真实世界实验提供了可扩展的替代方案。然而，在操作中持久世界建模仍然具有挑战性：频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图，导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制，我们提出了Mem-World，一种内存增强的多视图动作条件世界模型。其核心是W-VMem，一种4D腕部视图为中心的曲面元索引内存，将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置，W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中，通过基于曲面元的渲染和评分选择相关历史帧，为预测提供信息丰富且非冗余的上下文。大量实验表明，Mem-World在复杂操作场景中生成持久推演，比Ctrl-World实现更可靠的策略评估，将皮尔逊相关系数提高14.5%，并通过合成数据生成支持有效的策略改进，在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.18112 2026-06-19 cs.RO cs.CV 版本更新

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Qwen-RobotNav 技术报告：为智能体导航系统设计的可扩展导航模型

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei, Qihang Peng, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Zhibo Yang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Zhuoyuan Yu, Jingyang Fan, Zhixuan Liang, Pei Lin, Ye Wang, Anzhe Chen, Kun Yan, Xiao Xu, Jiahao Li, Lulu Hu, Minying Zhang, Shurui Li, Wenhu Xiao, Shuai Bai, Xuancheng Ren, Chenxu Lv, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team（通义实验室）

AI总结提出 Qwen-RobotNav 可扩展导航模型，通过参数化接口支持多种任务模式和可调观测参数，在15.6M样本上训练，联合视觉语言数据防止行为坍缩，在多个导航基准上取得新最优结果，并展示零样本泛化能力。

详情

AI中文摘要

组合对象检索：通过组合表达式进行对象级检索

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Jiangsu, China（新一代人工智能技术及跨学科应用国家重点实验室，东南大学，教育部，江苏，中国）； Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE（穆罕默德·本·扎耶德人工智能大学（MBZUAI），阿布扎赫德，阿联酋）

AI总结提出组合对象检索（COR）任务，通过组合参考对象、掩码和检索文本进行对象级检索，并构建COR125K基准和CORE模型，显著优于现有方法。

详情

AI中文摘要

基于用户意图检索细粒度视觉内容在多模态系统中仍然是一个挑战。尽管当前的组合图像检索（CIR）方法结合了参考图像和检索文本，但它们局限于图像级匹配，无法定位特定对象。为此，我们提出了组合对象检索（COR），一种新的对象级检索任务，从目标图像中的候选对象中检索目标对象，并用像素级掩码对检索结果进行定位。给定一个参考对象、其掩码、一个目标图像以及描述所需修改的检索文本，COR要求模型执行组合视觉-文本推理，而不是依赖显式的类别名称。这一设置带来了若干挑战，包括细粒度组合匹配、在视觉相似干扰物下的负对象过滤以及灵活的单对象或多对象检索。我们构建了COR125K，第一个大规模COR基准，包含408个类别的125,541个检索三元组，并划分基础/新类别以评估类别级泛化能力。我们还提出了CORE，一个统一的端到端模型，集成了参考区域编码、自适应视觉-文本交互和区域级对比学习，以将组合表示与目标对象对齐，同时抑制背景和干扰物。大量实验表明，CORE在基础和新类别上均显著优于现有的基于CIR的流程和强基线，为细粒度对象级多模态检索建立了一个简单而有效的基础。代码将在此https URL公开发布。

英文摘要

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

URL PDF HTML ☆

赞 0 踩 0

2512.03199 2026-06-19 cs.CV 版本更新

Does Head Pose Correction Improve Biometric Facial Recognition?

姿态校正是否能提升生物特征面部识别？

Justin Norman, Hany Farid

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究探讨了AI驱动的头部姿态校正与图像修复对面部识别准确率的影响，发现选择性应用CFR-GAN与CodeFormer可提升识别性能。

2604.19196 2026-06-19 cs.CV 版本更新

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

面向域泛化人脸反欺骗的视觉基础模型基准测试

Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki

发表机构 * Graduate School of Information Sciences, Tohoku University, Japan（东北大学信息科学研究生院，日本）

AI总结本文系统评估15种预训练视觉模型在人脸反欺骗域泛化中的表现，发现自监督ViT（尤其是DINOv2+Registers）结合数据增强和注意力损失在MICO协议上达到最优，且计算高效。

Comments 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

详情

AI中文摘要

人脸反欺骗（FAS）由于需要在未见过的环境中进行鲁棒的域泛化而仍然具有挑战性。尽管最近的趋势利用视觉-语言模型（VLM）进行语义监督，但这些多模态方法通常需要高昂的计算资源并表现出高推理延迟。此外，它们的有效性本质上受限于底层视觉特征的质量。本文重新审视仅视觉基础模型建立高效鲁棒FAS基线的潜力。我们在严苛的跨域场景下（包括MICO和有限源域（LSD）协议）对15个预训练模型进行了系统基准测试，例如有监督CNN、有监督ViT和自监督ViT。我们的全面分析表明，自监督视觉模型，特别是带有寄存器的DINOv2，显著抑制了注意力伪影并捕获了关键的细粒度欺骗线索。结合人脸反欺骗数据增强（FAS-Aug）、分块数据增强（PDA）和注意力加权分块损失（APL），我们提出的仅视觉基线在MICO协议上达到了最先进的性能。该基线在数据受限的LSD协议下优于现有方法，同时保持优越的计算效率。这项工作为FAS提供了一个确定的仅视觉基线，表明优化的自监督视觉变换器可以作为仅视觉和未来多模态FAS系统的骨干。项目页面见：此https URL。

英文摘要

Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .

URL PDF HTML ☆

赞 0 踩 0

2606.20032 2026-06-19 cs.CV 新提交

ReA-OVCD: Reliability-Aware Open-Vocabulary Change Detection via Semantic and Spatial Refinement

ReA-OVCD：通过语义和空间精炼的可靠性感知开放词汇变化检测

Hongming Zhu, Huaji Chen, Bowen Du, Sicong Liu, Qin Liu

发表机构 * School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）； College of Surveying and Geo-Informatics, Tongji University（同济大学测绘与地理信息学院）

AI总结提出一种无需训练的可靠性感知开放词汇变化检测框架，通过语义变化推理和边界感知精炼策略，解决实例级比较忽略细粒度变化和像素级比较不可靠的问题，在多个数据集上F1提升2.13%-9.75%。

详情

AI中文摘要

与依赖预定义类别的传统遥感变化检测不同，开放词汇变化检测（OVCD）使用任意文本提示灵活识别土地覆盖变化。然而，现有方法在建模变化时存在固有折衷：实例级比较忽略了细粒度语义变化（例如部分建筑扩建），而直接像素比较不可靠，由于语义模糊和空间不一致导致不稳定响应和边界伪影。为此，我们提出一种高效的无训练可靠性感知开放词汇变化检测（ReA-OVCD）框架。它首先从像素级语义差异中推导候选变化区域，以确保灵活和详细的定位。为确保可靠性，随后引入协作精炼策略，从语义和空间角度显式建模变化有效性。具体而言，我们开发了语义变化推理（SCR）模块，通过联合分析分布差异和响应变化重新评估变化，从而抑制偶然不一致性同时保留可靠的语义转变。此外，设计了边界感知变化精炼（BCR）模块，通过验证候选区域是否得到可靠内部像素支持来减轻由边界错位和不确定性引起的伪影。在多个数据集（LEVIR-CD、WHU-CD、DSIFN和SECOND）上的大量实验表明，我们的方法持续优于现有技术，在更高计算效率下实现了2.13%至9.75%的F1提升。代码已公开于此 https URL。

英文摘要

Unlike traditional remote sensing change detection that relies on predefined categories, Open-Vocabulary Change Detection (OVCD) identifies land cover changes flexibly using arbitrary text prompts. However, existing methods suffer from an inherent trade-off when modeling changes: instance-level comparison overlooks fine-grained semantic variations (e.g., partial building extensions), while direct pixel comparison proves unreliable, yielding unstable responses and boundary artifacts due to semantic ambiguity and spatial inconsistency. To this end, we propose an efficient training-free Reliability-Aware Open-Vocabulary Change Detection (ReA-OVCD) framework. It first derives candidate change regions from pixel-wise semantic discrepancies to ensure flexible and detailed localization. To ensure reliability, it subsequently introduces a collaborative refinement strategy to explicitly model change validity from both semantic and spatial perspectives. Specifically, we develop a Semantic Change Reasoning (SCR) module that reassesses changes by jointly analyzing distributional divergence and response variation, enabling the suppression of incidental inconsistencies while preserving reliable semantic shifts. In addition, a Boundary-aware Change Refinement (BCR) module is designed to mitigate artifacts stemming from boundary misalignment and uncertainty through validating whether candidate regions are supported by reliable interior pixels. Extensive experiments across multiple datasets (LEVIR-CD, WHU-CD, DSIFN, and SECOND) demonstrate that our method consistently outperforms state-of-the-art approaches, achieving $\mathrm{F}_{1}^{C}$ improvements of 2.13\% to 9.75\% with higher computational efficiency. The code is publicly available at \https://github.com/Funny0101/ReA-OVCD

URL PDF HTML ☆

赞 0 踩 0

2606.20130 2026-06-19 cs.CV 新提交

SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

SAM3自蒸馏用于细粒度GOOSE 2D语义分割

Xuesong Wang

发表机构 * Wayne State University（韦恩州立大学）

AI总结提出基于SAM3图像编码器与轻量解码器的分割模型，通过自蒸馏、多尺度测试增强和光度畸变迁移，在GOOSE 2D挑战赛达69.73% mIoU。

Comments 4th place in ICRA 2026 GOOSE 2D Semantic Segmentation Challenge

详情

AI中文摘要

我们描述了在ICRA 2026 GOOSE 2D细粒度语义分割挑战赛中获得第四名的方案，该方案在官方1815张图像测试集上达到了69.73%的复合平均交并比（mIoU）。我们的模型适配了近期视觉基础模型Segment Anything Model 3（SAM3）的图像编码器，并搭配轻量级解码器。除此之外，我们贡献了两项技术和一项经验发现：（i）一种自蒸馏方案，该方案重新利用SAM3本身，以真实边界框作为提示，在SAM3性能优于我们自身模型的类别上充当教师；（ii）一种图像级多尺度测试时增强方案，通过重新缩放图像而非模型输入，为固定输入尺寸的模型恢复多尺度推理；（iii）一项发现：来自2025年GOOSE 2D获胜方案的一种激进光度畸变，移植到我们的流程中，是单一最大的改进来源。

英文摘要

We describe our 4th-place entry to the ICRA 2026 GOOSE 2D Fine-Grained Semantic Segmentation Challenge, which reached a composite mean Intersection-over-Union (mIoU) of 69.73% on the official 1,815-image test set. Our model adapts the image encoder of a recent visual foundation model, Segment Anything Model 3 (SAM3), with a lightweight decoder. Beyond this, we contribute two techniques and one empirical finding: (i) a self-distillation scheme that re-uses SAM3 itself, prompted with ground-truth boxes, as a teacher on the classes where it outperforms our own model; (ii) an image-level multi-scale test-time augmentation scheme that restores multi-scale inference for a fixed-input-size model by rescaling the image rather than the model input; and (iii) the finding that an aggressive photometric distortion from a winning 2025 GOOSE 2D entry, transplanted onto our pipeline, is its single largest source of improvement.

URL PDF HTML ☆

赞 0 踩 0

2606.20161 2026-06-19 cs.CV 新提交

用于低光场景光场目标跟踪的角-时交互网络

Mianzhao Wang, Fan Shi, Xu Cheng, Feifei Zhang, Shengyong Chen

发表机构 * Engineering Research Center of Learning-Based Intelligent System (Ministry of Education)（教育部学习驱动智能系统工程研究中心）； key Laboratory of Computer Vision and System (Ministry of Education)（教育部计算机视觉与系统重点实验室）； School of Computer Science and Engineering, Tianjin University of Technology（天津工业大学计算机科学与工程学院）

AI总结提出一种光场极线平面结构图像表示和角-时交互网络，通过显式建模几何结构和自监督优化，在低光场景下实现高效目标跟踪，性能达到最优。

详情

AI中文摘要

高质量的四维光场表示结合高效的角特征建模对于场景感知至关重要，因为它可以提供判别性的空间-角度线索来识别移动目标。然而，近期的发展仍然难以在时间域中提供可靠的角建模，尤其是在复杂的低光场景中。在本文中，我们提出了一种新颖的光场极线平面结构图像（ESI）表示，该表示显式定义了光场内的几何结构。通过利用极线平面内光线角度的突变，这种表示可以增强低光场景中的视觉表达，并减少高维光场的冗余。我们进一步提出了一种用于光场目标跟踪的角-时交互网络（ATINet），该网络从光场的几何结构线索和角-时交互线索中学习角感知表示。此外，ATINet还可以通过自监督方式进行优化，以增强时间域上的几何特征交互。最后，我们引入了一个大规模的光场低光数据集用于目标跟踪。大量实验表明，ATINet在单目标跟踪中达到了最先进的性能。此外，我们将所提方法扩展到多目标跟踪，这也显示了高质量光场角-时建模的有效性。

英文摘要

High-quality 4D light field representation with efficient angular feature modeling is crucial for scene perception, as it can provide discriminative spatial-angular cues to identify moving targets. However, recent developments still struggle to deliver reliable angular modeling in the temporal domain, particularly in complex low-light scenes. In this paper, we propose a novel light field epipolar-plane structure image (ESI) representation that explicitly defines the geometric structure within the light field. By capitalizing on the abrupt changes in the angles of light rays within the epipolar plane, this representation can enhance visual expression in low-light scenes and reduce redundancy in high-dimensional light fields. We further propose an angular-temporal interaction network (ATINet) for light field object tracking that learns angular-aware representations from the geometric structural cues and angular-temporal interaction cues of light fields. Furthermore, ATINet can also be optimized in a self-supervised manner to enhance the geometric feature interaction across the temporal domain. Finally, we introduce a large-scale light field low-light dataset for object tracking. Extensive experimentation demonstrates that ATINet achieves state-of-the-art performance in single object tracking. Furthermore, we extend the proposed method to multiple object tracking, which also shows the effectiveness of high-quality light field angular-temporal modeling.

URL PDF HTML ☆

赞 0 踩 0

2510.24399 2026-06-19 cs.CV cs.RO 版本更新

GenTrack: A New Generation of Multi-Object Tracking

GenTrack：新一代多目标跟踪

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark（SDU机器人实验室，南丹麦大学）

AI总结提出GenTrack多目标跟踪方法，采用随机与确定性混合策略，结合粒子群优化与社会交互，在弱检测器、遮挡等场景下有效维持目标身份一致性并减少ID切换。

Comments This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

本文介绍了一种新颖的多目标跟踪（MOT）方法，称为GenTrack，其主要贡献包括：第一，一种混合跟踪方法，采用随机和确定性方式，以鲁棒地处理未知且时变的目标数量，特别是在维持目标身份（ID）一致性和管理非线性动态方面；第二，利用粒子群优化（PSO）和一些提出的适应度度量，引导随机粒子朝向其目标分布模式，从而即使在弱且噪声大的目标检测器下也能实现有效跟踪；第三，整合目标间的社会交互，以增强PSO引导的粒子，并改进强（匹配）和弱（未匹配）轨迹的连续更新，从而减少ID切换和轨迹丢失，尤其是在遮挡期间；第四，基于GenTrack重新定义的视觉MOT基线，结合了基于空间一致性、外观、检测置信度、轨迹惩罚和社会分数的综合状态与观测模型，以实现系统且高效的目标更新；第五，首个公开可用的最小依赖源代码参考实现，包含三种变体，包括GenTrack Simple、Strengthen和Super，便于灵活重新实现。实验结果表明，与最先进的跟踪器相比，GenTrack在标准基准和现实场景中提供了优越的性能，并集成了基线实现以进行公平比较。还讨论了未来工作的潜在方向。所提方法和比较跟踪器的源代码参考实现已在GitHub上提供：this https URL

英文摘要

This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: first-a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, second-leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, third-integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, fourth-a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and five-the first ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Simple, Strengthen, and Super, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack

URL PDF HTML ☆

赞 0 踩 0

2510.24410 2026-06-19 cs.CV cs.RO 版本更新

GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking

GenTrack2: 一种改进的多目标跟踪混合方法

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark（SDU机器人研究所，南丹麦大学）

AI总结提出结合随机粒子滤波与确定性关联的多目标跟踪方法，通过粒子群优化和新型代价矩阵解决非线性动态下的标识一致性问题，性能优于现有方法。

Comments The content of this paper was included in the full manuscript of GenTrack family which has been submitted to the journal for possible publication

详情

AI中文摘要

本文提出一种视觉多目标跟踪方法，联合使用随机和确定性机制，以确保在非线性动态下未知且时变目标数量的标识一致性。随机粒子滤波处理非线性动态和非高斯噪声，并借助粒子群优化（PSO）将粒子引导至状态分布模式，通过提出的适应度度量（包含运动一致性、外观相似性和与邻近目标的社交互动线索）减轻发散。确定性关联通过提出的代价矩阵进一步强制标识一致性，该矩阵包含粒子与当前检测之间的空间一致性、检测置信度和轨迹惩罚。随后，提出一种新颖方案，在保持目标身份的同时平滑更新目标状态，特别是对于与其他目标交互和长时间遮挡期间的弱轨迹。此外，对过去状态的速度回归提供趋势种子速度，增强粒子采样和状态更新。所提出的跟踪器设计灵活，适用于预录视频和相机直播流（未来帧不可用）。实验结果表明，与最先进的跟踪器相比，性能优越。所提出方法和对比跟踪器的源代码参考实现已在GitHub上提供：此 https URL

英文摘要

This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

URL PDF HTML ☆

赞 0 踩 0

2606.19682 2026-06-19 cs.CV 新提交

Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval

Vortex: 面向智能视频检索的多模态融合系统

Duc-Tho Nguyen, Hieu-Hoc Tran-Minh, Khanh-Hoa Lam, Hoang-Nhut Ly, Huu-Phuc Huynh, Thanh-Tien Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM（越南国立大学胡志明市理科大学）； Vietnam National University, Ho Chi Minh City（越南国立大学胡志明市）

AI总结提出Vortex系统，融合自适应关键帧提取、多模态元数据生成及混合检索策略（CLIP与SigLIP2的倒数秩融合），结合Rocchio反馈和多阶段时序搜索，在比赛中取得优异成绩。

Comments SOICT 2025

详情

AI中文摘要

本文介绍了Vortex，这是我们的团队FocusOnFun为胡志明市AI挑战赛2025开发的多模态视频检索系统，旨在推进智能多媒体搜索和时间推理。该系统集成了自适应关键帧提取、来自视觉语言和语音模型的多模态元数据生成，以及通过倒数秩融合融合CLIP和SigLIP2嵌入的混合检索策略，以平衡全局和细粒度语义。为了增强交互性，Vortex引入了基于Rocchio的相关性反馈和多阶段时序搜索机制，用于顺序事件对齐。该系统基于Milvus和Elasticsearch构建，支持可扩展的索引和高效检索。在官方比赛中，我们的FocusOnFun团队的系统在初赛中获得了79.6/88（90.5%）的分数，并在决赛中进一步评估，整体表现达到“优秀”，在问答（QA）任务中取得“杰出”成绩。这证明了CLIP和SigLIP2的互补优势，并确认了混合检索方法的有效性。该系统为未来在智能、上下文感知和交互式视频检索方面的研究奠定了坚实基础。

英文摘要

This paper presents Vortex, the multimodal video retrieval system developed by our team, FocusOnFun, for the Ho Chi Minh City AI Challenge 2025, designed to advance intelligent multimedia search and temporal reasoning. The system integrates adaptive keyframe extraction, multimodal metadata generation from vision-language and speech models, and a hybrid retrieval strategy that fuses CLIP and SigLIP2 embeddings through Reciprocal Rank Fusion to balance global and fine-grained semantics. To enhance interactivity, Vortex incorporates Rocchio-based relevance feedback and a multi-stage temporal search mechanism for sequential event alignment. Built on Milvus and Elasticsearch, the architecture enables scalable indexing and efficient retrieval. Evaluated in the official competition, our FocusOnFun team's system achieved a score of 79.6/88 (90.5\%) in the Preliminary Round and was further evaluated in the Final Round, achieving an `Excellent' overall performance with `Outstanding' results in the question-answering (QA) task. This demonstrating the complementary strengths of CLIP and SigLIP2 and confirming the effectiveness of the hybrid retrieval approach. The system establishes a robust foundation for future research in intelligent, context-aware, and interactive video retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.19706 2026-06-19 cs.CV cs.CL 新提交

NEST: Narrative Event Structures in Time for Long Video Understanding

NEST：面向长视频理解的时间叙事事件结构

Ali Asgarov, Kaushik Narasimhan, Najibul Haque Sarker, Hani Alomari, Chia-Wei Tang, Anushka Sivakumar, Zaber Ibn Abdul Hakim, Shaurya Mallampati, Chris Thomas

发表机构 * Department of Computer Science, Virginia Tech（弗吉尼亚理工大学计算机科学系）

AI总结提出NEST数据集（1005部全长电影），通过多模态叙事事件标注和关系链接，评估模型在长视频中理解事件结构、时间顺序和长程依赖的能力，实验表明事件检测等任务极具挑战性。

详情

AI中文摘要

视觉-语言模型的最新进展使得处理越来越长的视频序列成为可能，但处理扩展令牌流的能力并不能转化为对长视频中叙事结构的理解。现有的长视频基准侧重于大海捞针式检索，而不是评估低级动作如何形成事件、事件如何跨时间交互以及叙事如何进展，例如，模型是否能够将早期的挫折（如失业）与后来的关系破裂联系起来，尽管存在长时间间隔、中间场景或重新诠释事件的闪回。我们引入了NEST（面向长视频理解的时间叙事事件结构），一个包含1005部全长电影（平均98分钟）的数据集，每部电影都标注了102个基于视觉内容、对话和音频的多模态叙事事件。NEST通过基于视觉内容、对话和音频的结构化标注捕捉多模态叙事事件，并通过反映叙事结构的关系（包括时间顺序、层次组合和长程依赖）将它们联系起来。我们引入了事件触发检测（ETD）、事件定位（EL）、事件论元抽取（EAE）和事件关系抽取（ERE）的基线。该基准对于基于事件发现极具挑战性，ETD低于8%，EL低于6%，EAE低于11%。相比之下，一旦事件给定，ERE更容易处理，零样本F1达到35.45%，微调后F1达到44.42%。

英文摘要

Recent progress in vision-language models has enabled the processing of increasingly long video sequences, but the ability to handle extended token streams does not translate to understanding of narrative structure in long videos. Existing long video benchmarks focus on needle-in-a-haystack retrieval rather than evaluating how low-level actions form events, how events interact across time, and how narratives progress, for example, whether a model can connect an early setback, such as a job loss to a later relationship breakup, despite long gaps, intervening scenes, or flashbacks that reframe what occurred. We introduce NEST (Narrative Event Structures in Time for Long Video Understanding), a dataset of 1005 full-length movies (avg. 98 minutes), each annotated with 102 multimodal narrative events grounded in visual content, dialogue, and audio. NEST captures multimodal narrative events with structured annotations grounded in visual content, dialogue, and audio, and links them through relations that reflect narrative structure, including temporal ordering, hierarchical composition, and long-range dependencies. We introduce baselines for event trigger detection (ETD), event localization (EL), event argument extraction (EAE), and event relation extraction (ERE). The benchmark is highly challenging for grounded event discovery, with ETD below 8%, EL under 6%, and EAE below 11%. In contrast, ERE is more tractable once events are given, reaching 35.45% F1 zero-shot and 44.42% F1 after fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.19849 2026-06-19 cs.CV 新提交

ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference

ViCoStream: 流式视频大模型通过阶段协调推理可运行超过100 FPS

Yang Tan, Junlong Tong, Linan Yue, Hao Wu, Pengfei Fang, Xiaoyu Shen

发表机构 * Southeast University（东南大学）； Eastern Institute of Technology, Ningbo（宁波东方理工大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出ViCoStream框架，通过阶段协调的流水线（分块执行、CUDA流重叠、视觉令牌控制、有界视觉注意力、查询端检索）实现流式视频大模型的高吞吐低延迟推理，在单A100上达到134 FPS视频吞吐和<50 ms首令牌延迟，精度接近全历史基线。

Comments 19 pages, 7 figures, 13 tables

详情

AI中文摘要

流式视频大模型必须持续处理传入的视频，同时保持低查询延迟，这使得视频摄入吞吐量和查询时间响应性对于实时部署至关重要。现有方法主要集中于加速单个模块，如视觉编码、令牌剪枝或KV缓存压缩，但对由此产生的系统能否维持实时流式性能提供的见解有限。我们将流式视频大模型推理形式化为一个协调的流水线，涵盖视觉预处理、视觉编码、令牌丢弃和LLM预填充/解码。基于这一形式化，我们提出了ViCoStream（视频协调流式处理），一个阶段协调的流式框架，结合了分块执行、CUDA流重叠、视觉令牌控制、有界视觉注意力和查询端检索，以限制每块的计算和内存成本。我们进一步对瓶颈迁移进行了系统研究，揭示了块大小、令牌保留、注意力局部性和检索范围如何影响吞吐量-准确率权衡。在多个流式基准测试上使用Qwen2.5-VL-3B/7B-Instruct进行的实验表明，ViCoStream在单块A100 GPU上实现了134 FPS的视频吞吐量和小于50 ms的首令牌延迟，同时保持接近全历史基线的准确率。

英文摘要

Streaming VideoLLMs must continuously process incoming video while maintaining low query latency, making both video-ingestion throughput and query-time responsiveness critical for real-time deployment. Existing methods largely focus on accelerating individual modules, such as visual encoding, token pruning, or KV-cache compression, but provide limited insight into whether the resulting system can sustain real-time streaming performance. We formulate streaming VideoLLM inference as a coordinated pipeline spanning visual preprocessing, visual encoding, token dropping, and LLM prefilling/decoding. Building on this formulation, we propose ViCoStream (Video Coordinated Streaming), a stage-wise coordinated streaming framework that combines chunk-wise execution, CUDA-stream overlap, visual token control, bounded visual attention, and query-side retrieval to bound per-chunk computation and memory costs. We further provide a systematic study of bottleneck migration, revealing how chunk size, token retention, attention locality, and retrieval scope shape the throughput-accuracy trade-off. Experiments with Qwen2.5-VL-3B/7B-Instruct across multiple streaming benchmarks show that ViCoStream achieves 134 FPS video throughput and less than 50 ms TTFT on a single A100 GPU while maintaining accuracy close to full-history baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.19927 2026-06-19 cs.CV 新提交

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

CARE: 面向视频多模态大语言模型的自适应推理长度的能力感知奖励塑形

Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； School of Medical Technology, Beijing Institute of Technology（北京理工大学医学技术学院）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

AI总结提出CARE框架，通过能力感知奖励塑形自适应优化推理长度，利用指数移动平均估计能力并分阶段调整奖励偏好，结合批次归一化和后验放大器提升效率与准确性。

详情

AI中文摘要

在多模态视频推理中，基于强化学习的方法通常依赖简单且不灵活的推理长度控制策略，无法适应模型不断变化的能力。这种不匹配可能在早期阶段抑制必要的探索，而在模型变得更有能力后鼓励冗余推理和低效解码。本文提出CARE，一种用于多模态推理中自适应推理长度优化的能力感知奖励塑形框架。具体来说，CARE通过通过率的指数移动平均维护平滑的能力估计，并利用它将训练路由到渐进阶段，将奖励偏好从探索导向的长形式推理转向效率导向的简洁推理。为避免将冗长与内在任务复杂性混淆，CARE进一步使用批次级统计归一化推理努力，并引入后验放大器以增强对历史上困难样本上意外强性能的奖励信号。所提出的机制无缝集成到GRPO训练流程中，且不增加额外推理开销。在多个视频推理和通用视频理解基准上的大量实验表明，CARE持续提高推理准确性，稳定强化学习，并显著提升令牌效率。此外，CARE在训练过程中展现出推理长度的特征性倒U型轨迹，并在收敛时产生更短但信息更丰富的推理轨迹，表明推理预算的有效自适应分配。我们在以下网址提供CARE框架和实验的源代码：此https URL。

英文摘要

In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.

URL PDF HTML ☆

赞 0 踩 0

2606.20140 2026-06-19 cs.CV 新提交

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

SA-VIS: 用于训练视频实例分割的稀疏帧标注

Edoardo Mello Rella, Ajad Chhatkuli, Shipra Jain, Ender Konukoglu, Luc Van Gool

发表机构 * CVL, ETH Zurich（计算机视觉实验室，苏黎世联邦理工学院）； Align Technology ； VISICS, KU Leuven（VISICS，鲁汶大学）； INSAIT, Sofia（INSAIT，索非亚）

AI总结提出稀疏帧标注的SA-VIS方法，通过过去帧特征传播模块利用低维特征，在仅使用1/5标注帧时性能仅下降0.4%，显著降低标注成本。

详情

AI中文摘要

最近的在线视频实例分割（VIS）方法取得了令人印象深刻的结果，因此成为视频中实例分割的首选方法。尽管令人印象深刻的单图像模型（例如基于SAM的模型）重新兴起，但在线（或半在线）VIS方法通过在训练期间使用长序列的密集标注帧，优于单图像模型。然而，这种VIS的训练设置在计算和所需密集标注方面成本高昂。为了解决这些主要缺陷，我们认为实例及其在视频中的演变的有效建模并不需要密集标注的帧。为此，我们提出了一个简单有效的模块，称为过去帧特征传播（PFP），它聚合来自多个帧的图像编码器的低维特征。这个简单的低计算量模块为使用稀疏视频帧标签进行端到端训练提供了巨大的学习能力。结合轻量级的帧特定实例查询，我们的稀疏帧标注VIS（SA-VIS）显著提高了其基线的性能。最有趣的是，我们避免复杂性的简单设计有效地弥合了在稀疏和密集标注视频序列上训练之间的精度差距。这意味着当仅使用数据集中1/5图像的标注时，SA-VIS的性能仅下降0.4%。实验上，SA-VIS在YouTube-VIS 2019/2021/2022和Occluded VIS（OVIS）上显示出相对于基线的强劲改进，并且在有限标注场景下，AP比最先进方法提高了1%以上。

英文摘要

Recent online video instance segmentation (VIS) methods have achieved impressive results, thus becoming the preferred approach to segment instances in videos. Despite the resurgence of impressive single image models, the online (or semi-online) VIS approaches outperform single-image models (e.g., based on SAM) by using long sequences of densely annotated frames during training. However,such a training setup of VIS is expensive in the sense of compute as well as dense annotations required. In order to solve these major flaws, we argue that the effective modeling of the instances and their evolution in videos do not require densely annotated frames. To that end, we propose a simple and effective module, called Past-frames Feature Propagation (PFP) which aggregates low-dimensional features from the image encoder of multiple frames. This simple low-compute module provides tremendous learning capability in using sparse video frame labels for end-to-end training. Combined with a light-weight frame-specific Instance Queries, our Sparse frame Annotation VIS (SA-VIS) significantly improves performance over its baseline. Most interestingly, our simple design that avoids complexities effectively bridges the gap in accuracy between training on sparsely and densely annotated video sequences. This translates to a mere 0.4% drop in performance of SA-VIS when using annotations for only 1/5 of the images in the dataset. Empirically, SA-VIS shows strong improvements over the baseline on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS) and an over 1% improvement in AP on the state-of-the-art in a limited annotations scenario.

URL PDF HTML ☆

赞 0 踩 0

2606.20312 2026-06-19 cs.CV 新提交

Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection

面向冻结姿态流视频异常检测的可靠性感知原型校准

Ning Dong, Yingna Su, Xin Dong, Ziyun Jiao, Xinnian Guo, Zhuangzhuang Pan

AI总结提出一种后验评分校准方法RPC，通过标准化潜在空间中的最近原型偏差修正冻结姿态流检测器的排名，在8个骨干-数据集组合上平均提升AUROC 2.03个百分点。

Comments 15 pages, 5 figures, 7 tables. Code available at https://github.com/iNing10/RPC

详情

AI中文摘要

姿态流视频异常检测器因其能为跟踪的骨架窗口提供基于似然的排名，在一类监控中具有吸引力。然而，单个似然分数可能隐藏多模态正常行为，并对姿态观测噪声敏感。我们研究了一个冻结检测器设置，其中姿态流骨干网络、缓存的骨架轨迹和评估流程是固定的。可靠性感知原型校准（RPC）是针对该设置的一种后验评分校准方法。它在冻结潜在空间中添加标准化的最近原型偏差到标准化的流分数，并仅使用关键点置信度来门控这一新增的几何证据。因此，RPC在保留原始密度信号的同时，利用姿态可靠性下的经验正常模式结构修正排名。在两个冻结姿态流骨干网络和四个数据集上，RPC在所有八个骨干-数据集对中提升了帧级AUROC，增益范围为0.34到4.49个百分点，平均为2.03个百分点。消融和可靠性分析表明，原型偏差是主要的修正信号，而可靠性门控在姿态观测不可靠时最为有用。这些结果表明，当重新训练或复现完整姿态流程不可行时，轻量级后验校准可以增强缓存的姿态流系统。

英文摘要

Pose-flow video anomaly detectors are attractive for one-class surveillance because they provide likelihood-based rankings for tracked skeleton windows. However, a single likelihood score may hide multimodal normal behavior and be sensitive to pose-observation noise. We study a frozen-detector setting in which the pose-flow backbone, cached skeleton tracks, and evaluation pipeline are fixed. Reliability-Aware Prototype Calibration (RPC) is a post-hoc score calibration method for this setting. It adds a standardized nearest-prototype deviation in the frozen latent space to the standardized flow score, and uses keypoint confidence only to gate this added geometric evidence. Thus, RPC preserves the original density signal while correcting the ranking with empirical normal-mode structure under pose reliability. Across two frozen pose-flow backbones and four datasets, RPC improves frame-level AUROC in all eight backbone-dataset pairs, with gains ranging from 0.34 to 4.49 percentage points and averaging 2.03 points. Ablation and reliability analyses show that prototype deviation is the main corrective signal, while reliability gating is most useful when pose observations are less trustworthy. These results suggest that lightweight post-hoc calibration can strengthen cached pose-flow systems when retraining or reproducing the full pose pipeline is impractical.

URL PDF HTML ☆

赞 0 踩 0

2606.20559 2026-06-19 cs.CV cs.LG 新提交

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

UNIEGO：代理作为中介的统一自我中心视频表示学习

Wenhao Chi, Arkaprava Sinha, Dominick Reilly, Hieu Le, Srijan Das

AI总结提出分层多教师蒸馏框架UNIEGO，通过代理模型将异构教师知识转化为同质自我中心空间，并采用选择性代理蒸馏自适应筛选可靠监督，在三个自我中心视频理解任务上达到最优。

详情

AI中文摘要

自我中心视频理解本质上受限于可穿戴摄像头的狭窄视角：单一视角、单一模态、单一模型无法捕捉人类动作的全部丰富性。我们认为，真正富有表现力的自我中心表示必须包含跨视角、跨模态和基础模型表示的互补知识，同时仍能仅从自我中心视频部署。为此，我们引入了一个分层多教师蒸馏框架，生成UNIEGO，一个统一的自我中心编码器，使用九个教师（涵盖自我-外部视角、RGB、深度和骨架模态）以及四个基础模型进行训练。我们的框架不是直接从异构教师中蒸馏（其不兼容的架构和特征几何会导致冲突梯度），而是在其中插入一层表示特定的代理模型，将多样的教师知识转化为同质的自我中心空间。第二阶段蒸馏，即选择性代理蒸馏（SPD），然后自适应地为每个训练样本选择既正确又自信的代理子集，仅从可靠监督中蒸馏并抑制错误信号。SPD进一步通过将UNIEGO初始化为代理参数的凸组合来稳定，在蒸馏开始前将统一模型置于损失景观的良好条件区域。UNIEGO在三个自我中心视频理解任务（动作识别、视频检索和动作分割）上，在三个具有挑战性的自我-外部基准测试中达到了最先进的性能，优于朴素的多教师蒸馏基线，并证明了结构化的、代理中介的知识转移能产生更丰富、更具判别性的自我中心表示。

英文摘要

Egocentric video understanding is inherently limited by the narrow perspective of wearable cameras: a single viewpoint, a single modality, a single model cannot capture the full richness of human action. We argue that a truly expressive egocentric representation must subsume complementary knowledge across viewpoints, modalities, and foundation model representations, yet remain deployable from egocentric video alone. To this end, we introduce a hierarchical multi-teacher distillation framework that produces UNIEGO, a unified egocentric encoder trained with nine teachers spanning ego-exo viewpoints, RGB, depth, and skeleton modalities, and four foundation models. Rather than distilling directly from heterogeneous teachers whose incompatible architectures and feature geometries induce conflicting gradients, our framework interposes a layer of representation-specific Proxy models that translate diverse teacher knowledge into a homogeneous egocentric space. A second distillation stage, Selective Proxy Distillation (SPD), then adaptively selects, for each training sample, the subset of proxies that are both correct and confident, distilling exclusively from reliable supervision and suppressing erroneous signals. SPD is further stabilized by initializing UNIEGO as a learned convex combination of proxy parameters, placing the unified model in a well-conditioned region of the loss landscape before distillation begins. UNIEGO achieves state-of-the-art performance across three egocentric video understanding tasks - action recognition, video retrieval, and action segmentation on three challenging ego-exo benchmarks, outperforming naive multi-teacher distillation baselines and demonstrating that structured, proxy-mediated knowledge transfer yields richer and more discriminative egocentric representations.

URL PDF HTML ☆

赞 0 踩 0

学习何时去噪：优化潜在扩散的异步调度

Bingshuo Qian, Xiang Cheng

AI总结提出学习异步调度策略，通过调度校正目标优化多表示扩散模型的去噪顺序，在ImageNet 256x256上以不到1%额外训练计算实现4倍加速，FID达1.02。

Comments 25 pages, 9 figures, 4 tables

详情

AI中文摘要

多表示扩散模型可以通过对图像的互补视图进行去噪来改善视觉合成，但其性能关键取决于决定每个表示何时去噪的异步调度。我们提出学习这种调度。我们的方法在多个表示空间上制定异步流匹配，并使用调度校正目标，该目标在调度变化时保持每个表示的局部噪声时间权重固定。我们用一个灵活的参数类实例化调度，该类通过构造是凸且单调的，并使用快速联合探针进行学习，额外训练计算少于1%。在ImageNet 256x256上，学习的调度在匹配的675M参数XL骨干下显著提高了收敛速度和最终质量。使用AutoGuidance，我们的200 epoch模型达到FID 1.05，与800 epoch的SFD-XL基线相当，训练量减少4倍。训练到600 epoch进一步改善到FID 1.02，优于1B参数的SFD-XXL结果（FID 1.04），同时使用更小的模型。在无引导设置中，我们的200 epoch模型达到FID 2.37，已经低于最佳800 epoch SFD-XL结果（2.54），训练量减少4倍，并在600 epoch时改善到FID 2.14。代码可在https://this URL获取。

英文摘要

Multi-representation diffusion models can improve visual synthesis by denoising complementary views of an image, but their performance depends critically on the asynchronous schedule that determines when each representation is denoised. We propose to learn this schedule. Our method formulates asynchronous flow matching over multiple representation spaces and uses a schedule-corrected objective that keeps each representation's local noising-time weights fixed as the schedule changes. We instantiate the schedule with a flexible parametric class that is convex and monotone by construction, and learn it using a fast joint probe with less than 1% additional training compute. On ImageNet 256x256, the learned schedule substantially improves both convergence speed and final quality under a matched 675M-parameter XL backbone. With AutoGuidance, our 200-epoch model reaches FID 1.05, matching the 800-epoch SFD-XL baseline with 4x less training. Training to 600 epochs further improves to FID 1.02, outperforming the 1B-parameter SFD-XXL result of FID 1.04 while using a smaller model. In the unguided setting, our 200-epoch model reaches FID 2.37, already below the best 800-epoch SFD-XL result (2.54) at 4x less training, and improves to FID 2.14 at 600 epochs. Code is available at https://github.com/bsq532087/LWD

URL PDF HTML ☆

赞 0 踩 0

2606.19676 2026-06-19 cs.CV cs.AI 新提交

TeleMorpher: Toward Robust Simultaneous Motion-Location Editing

TeleMorpher: 迈向鲁棒的同步运动-位置编辑

Haengbok Chung

AI总结提出TeleMorpher，一种基于扩散模型的一步式框架，通过运动先验、姿态扭曲和基线运动编辑器注入，实现视频中主角运动与位置的同步编辑，在定量和定性评估中表现优异。

详情

AI中文摘要

扩散模型在图像和视频生成与编辑中取得了显著成功。尽管最近的研究将工作扩展到运动编辑，但同步变换运动与位置——尽管具有实际重要性——仍基本未被探索。为了更好地理解鲁棒的运动-位置编辑，我们首先分析了降低其质量的根本因素。基于此分析，我们提出了TeleMorpher，据我们所知，这是首个用于同步运动-位置编辑的一步式框架之一。我们的方法利用运动先验（从现成模型生成的目标运动中心视频作为运动编辑指导）和真实运动，实现更可控和精确的运动-位置编辑。通过这种方式，我们的框架工作如下：(1) 首先通过预训练的分割和修复模型分离主角和背景。(2) 然后，我们引入一种无需训练的姿势扭曲，以运动先验为指导编辑主角的运动。(3) 扭曲运动视频的结果在推理时直接注入基线运动编辑器，减轻源运动与目标运动之间的差异，同时保留源视频的外观。(4) 为提高定量评估的可靠性，我们提出了两个新的基于LPIPS的指标，分别测量运动编辑前后背景一致性以及通过测量从源视频和目标视频中提取的主角骨架差异来评估运动编辑性能的保真度。在野外视频和TaiChi数据集上的实验表明，TeleMorpher在定量和定性测量（真实人类评估）中均取得了优越性能，凸显了其有效性。

英文摘要

Diffusion models have achieved remarkable success in image and video generation and editing. While recent studies have extended these efforts toward motion editing, simultaneously transforming both motion and location-despite its practical importance-remains largely unexplored. To better understand robust motion-location editing, we first analyze the fundamental factors that degrade its quality. Based on this analysis, we propose TeleMorpher, one of the first one-shot frameworks to the best of our knowledge, for simultaneous motion-location editing. Our approach leverages motion priors, a target motion-centric video generated from an off-the-shelf model as motion-editing guidance, and the ground truth motion to enable more controllable and precise motion-location editing. Via this, our framework works as follows: (1) we first disentangle the protagonist and the background via pre-trained segmentation and inpainting models. (2) Then, we introduce a training-free pose warping that edits the protagonist's motion with the motion prior as the guidance. (3) The result of warped motion video is directly injected into a baseline motion editor during inference, mitigating the difference between source and target motions while preserving the appearance of the source video. (4) To enhance the reliability of quantitative evaluations, we propose two new LPIPS-based metrics that measure the background consistency before and after the motion editing and the fidelity of motion editing performance via measuring the difference between the extracted protagonist's skeletons from source and target videos. Experiments with in-the-wild videos and the TaiChi dataset demonstrate that TeleMorpher achieves superior performance across both quantitative and qualitative measurements (real-human evaluation), underscoring its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.19718 2026-06-19 cs.CV 新提交

One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model

基于3D先验引导扩散模型的单样本新视角与姿态人体图像合成

Shenjian Gong, Kangkan Wang, Shanshan Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology（南京理工大学计算机科学与工程学院教育部高维信息智能感知与系统重点实验室、江苏省社会安全图像与视频理解重点实验室及PCA实验室）； Advanced Laser Technology Laboratory of Anhui Province, Electronic Engineering Institute, National University of Defense Technology, and Jianghuai Advance Technology Center（国防科技大学电子工程学院安徽省先进激光技术实验室及江淮前沿技术中心）

AI总结提出一种基于条件去噪扩散模型的方法，利用3D人体先验（法线图和颜色提示）作为几何和颜色条件，从单张参考图像合成任意姿态和视角的高质量人体图像，包括被遮挡部分。

Comments 30 pages, 10 figures

详情

DOI: 10.1016/j.patcog.2026.113644

AI中文摘要

本文解决了单样本新视角和姿态人体图像合成的挑战。现有方法通过一组2D姿态关键点将参考人体图像转移到目标姿态，或基于可泛化人体NeRF（使用人体模型先验提取逐点特征）合成人体图像。然而，基于姿态转移的方法无法处理使用模糊2D姿态作为条件的复杂人体姿态，而可泛化人体NeRF在缺乏可靠特征时可能无法准确恢复被遮挡/不可见的人体部分。为解决这些问题，我们提出了一种基于条件去噪扩散模型的新方法，用于从单张人体图像进行新视角和姿态合成。我们的扩散模型将新视角和姿态合成问题分解为一系列条件去噪步骤。具体而言，为了生成具有复杂和任意姿态的人体，我们将3D人体先验（即3D法线图和颜色提示）作为几何和颜色条件引入生成过程。通过一系列扩散步骤将参考人体转移到目标人体，我们的扩散模型能够实现高质量合成，包括被遮挡/不可见部分。此外，我们提出了一种基于自重建的自定义细化方法，以在测试新视角时增强细节。在多个公共数据集上的实验结果表明，我们的方法显著优于先前方法，并显示出更好的跨数据集泛化能力。代码将在https://this https URL上公开。

英文摘要

This paper addresses the challenge of one-shot novel view and pose human image synthesis. The existing methods transfer the reference human image to a target pose using a set of 2D pose keypoints or synthesize human images based on generalizable human NeRF which uses human model priors to extract point-wise features. However, pose transfer based methods can not handle complex human pose using ambiguous 2D pose as the condition, while generalizable human NeRFs may be inaccurate to recover occluded/invisiable human parts without extracted reliable features. To solve these problems, we propose a novel approach for novel view and pose synthesis from a singe human image via conditional denoising diffusion model. Our diffusion model divides the novel view and pose synthesis problem into a sequence of conditional denoising steps. Specifically, to generate humans with complex and arbitrary poses, we introduce 3D human priors, i.e., 3D normal map and color prompt, as geometry and color conditions into the generation process. By transferring the reference human into the target human with a series of diffusion steps, our diffusion model enables high-quality synthesis including the occluded/invisible parts. Further, we propose a self-reconstruction based customized refinement to enhance fine details when tested on novel persons.Experimental results on different public datasets demonstrate that our approach significantly outperforms previous methods and also shows better generalization ability across datasets. The code will be made publicly available at https://github.com/Yankeegsj/3DPGDM.

URL PDF HTML ☆

赞 0 踩 0

2606.19889 2026-06-19 cs.CV 新提交

SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics

SurgVista：具有合理器械-组织动力学的长程手术世界建模

Wentao Pan, Wuyang Li, Shengyuan Liu, Xinyu Liu, Hengyu Liu, Yixuan Yuan

发表机构 * The Chinese University of Hong Kong（香港中文大学）； EPFL（瑞士联邦理工学院洛桑）； Imperial College London（伦敦帝国学院）

AI总结提出SurgVista手术世界模型，通过变形一致性正则化和漂移适应训练，解决空间交互不连贯和时间保真度崩溃问题，在长程预测中显著优于现有方法。

详情

AI中文摘要

将机器人策略学习扩展到自主手术面临挑战，因为专家演示成本高昂且体内探索存在重大安全风险。手术世界模型通过从初始观测生成逼真的、动作条件下的未来帧来解决这一问题，但现有方法存在两种持续失效模式：空间交互不连贯，即可见器械接触未能引起空间一致的组织变形；以及时间保真度崩溃，即预测误差在自回归展开中累积并逐渐破坏视觉质量。我们提出SurgVista，一种通过两种训练策略缓解这两种失效的手术世界模型。变形一致性正则化从训练视频中提取场景点轨迹，并通过潜在对比学习强制跨帧一致性，增强物理一致的器械-组织动力学。漂移适应训练通过用在线预测残差和根据长程漂移统计校准的光度增强扰动条件帧，减轻长程漂移，在扩展展开中维持视觉保真度。为了进行严格评估，我们进一步引入SurgWorld-Bench，包含多样化的手术类型、长程展开以及用于器械运动精度和组织响应保真度的解耦指标。大量实验表明，SurgVista在视觉质量、时间一致性和交互保真度方面持续优于最先进方法，且随着预测视界增长优势扩大。

英文摘要

Scaling robot policy learning for autonomous surgery is challenging, as expert demonstrations are expensive and in vivo exploration poses substantial safety risks. Surgical world models address this by generating realistic, action-conditioned future frames from an initial observation, but existing methods exhibit two persistent failure modes: spatial interaction incoherence, where visible instrument contact fails to induce spatially consistent tissue deformation, and temporal fidelity collapse, where prediction errors compound across autoregressive rollouts and progressively corrupt visual quality. We present SurgVista, a surgical world model that mitigates both failures through two training recipes. Deformation Consistency Regularization extracts scene-point trajectories from training videos and enforces cross-frame coherence through latent contrastive learning, strengthening physically consistent instrument-tissue dynamics. Drift Adaptation Training mitigates long-horizon drift by perturbing conditioning frames with online prediction residuals and photometric augmentations calibrated to long-horizon drift statistics, sustaining visual fidelity over extended rollouts. To enable rigorous evaluation, we further introduce SurgWorld-Bench, featuring diverse procedure types, long-range rollouts, and decoupled metrics for instrument-motion accuracy and tissue-response fidelity. Extensive experiments show that SurgVista consistently outperforms state-of-the-art methods across visual quality, temporal consistency, and interaction fidelity, with gains widening as the prediction horizon grows.

URL PDF HTML ☆

赞 0 踩 0

2606.19958 2026-06-19 cs.CV 新提交

SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis

SketchKeyAnime：基于参考锚点的稀疏关键草图动画合成

Meixi Li, Xianlin Zhang, Yue Zhang, Xueming Li

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出SketchKeyAnime视频扩散框架，通过双分支条件机制和可学习门控的草图交叉注意力，从单张参考RGB图像和稀疏关键草图生成结构可控、外观一致且时间连贯的动画，在Sakuga-42M数据集上显著优于基线方法。

详情

AI中文摘要

传统动画制作严重依赖手工绘制和迭代细化，特别是关键姿势设计、中间帧生成和角色着色。虽然现有的动画和视频生成方法取得了显著进展，但它们通常依赖于RGB边界帧、密集的帧级条件或完整的草图序列，限制了在低成本输入条件下的适用性。我们提出了SketchKeyAnime，一个视频扩散框架，用于从稀疏关键草图输入生成结构可控、外观一致且时间连贯的动画。给定单个参考RGB图像和几个按时间索引的关键草图，SketchKeyAnime引入了一种双分支条件机制，以编码局部几何约束以及语义-时间上下文。它利用草图交叉注意力，通过可学习门控融合参考图像和草图条件，并加入自适应加权损失以加强对关键草图帧和线条艺术区域的监督。在Sakuga-42M的Aesthetic子集上的实验结果表明，我们的方法始终优于代表性的动画插值和草图引导生成基线。与最佳基线相比，SketchKeyAnime将EDMD降低了31.9%，FVD降低了9.5%，展示了卓越的草图保真度和时间连贯性，同时在大多数定量指标上实现了最佳整体性能。这些结果验证了所提出的框架，并突显了其在低成本、高度可控动画创作中的潜力。

英文摘要

Traditional animation production relies heavily on manual drawing and iterative refinement, particularly for key-pose design, in-betweening, and character coloring. While existing animation and video generation methods have made notable progress, they typically depend on RGB boundary frames, dense frame-wise conditions, or complete sketch sequences, limiting their applicability under low-cost input conditions. We present SketchKeyAnime, a video diffusion framework for generating structurally controllable, appearance-consistent, and temporally coherent animations from sparse key-sketch inputs. Given a single reference RGB image and a few temporally indexed key sketches, SketchKeyAnime introduces a dual-branch conditioning mechanism to encode local geometric constraints alongside semantic-temporal context. It leverages Sketch Cross Attention to fuse reference image and sketch conditions with learnable gating, and incorporates an Adaptive Weighted Loss to strengthen supervision on key-sketch frames and line-art regions. Experimental results on the Aesthetic subset of Sakuga-42M show that our approach consistently outperforms representative animation interpolation and sketch-guided generation baselines. Compared to the best-performing baseline, SketchKeyAnime reduces EDMD by 31.9\% and FVD by 9.5\%, demonstrating superior sketch fidelity and temporal coherence, while achieving the best overall performance across most quantitative metrics. These results validate the proposed framework and highlight its potential for low-cost, highly controllable animation creation.

URL PDF HTML ☆

赞 0 踩 0

2606.19970 2026-06-19 cs.CV 新提交

CrossFlow: One-Step Generation Across Latent and Pixel Spaces

CrossFlow: 跨潜在空间与像素空间的单步生成

Xiyuan Wang, Xiao Zhang, Yang Li, Ruoxi Jiang, Zhao Zhong, Liefeng Bo, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； Tencent（腾讯）； Fudan University（复旦大学）

AI总结提出CrossFlow，一种跨空间流模型，将噪声潜在输入直接映射到像素图像，通过无速度单步目标实现潜在到像素的生成，并替代潜在扩散中的解码器，在ImageNet-1k上达到1.62 FID。

Comments Preprint, Under Review

详情

AI中文摘要

大多数扩散和流匹配生成器在相同的表示空间中定义先验、概率路径和预测目标。潜在扩散通过将该路径移动到自编码器潜在空间来提高效率，但最终样本仍由单独训练的解码器生成。这种分离造成了不匹配：生成器针对潜在空间预测进行优化，而最终质量取决于解码器如何处理可能与干净编码器输出不同的生成潜在变量。我们引入了CrossFlow，一种跨空间流公式，将噪声潜在输入直接映射到像素空间图像。关键技术步骤是一个无速度的单步目标：潜在轨迹定义了训练路径，但监督预测是图像而非潜在位移。这使得一个模型既可以作为单步潜在到像素生成器，也可以作为潜在扩散管道的解码器替代品。在类别条件ImageNet-1k $256\ imes256$上，CrossFlow-XL通过一次函数评估达到了1.62 FID。消融实验表明，潜在编码器以及像素空间感知和对抗损失对保真度很重要。这些结果表明，跨空间流目标可以结合潜在表示的效率与直接像素空间监督，而无需在推理时使用单独的解码器。

英文摘要

Most diffusion and flow-matching generators define the prior, probability path, and prediction target in the same representation space. Latent diffusion improves efficiency by moving this path into an autoencoder latent space, but the final sample is still produced by a separately trained decoder. This separation creates a mismatch: the generator is optimized for latent-space prediction, while final quality depends on how the decoder handles generated latents that may differ from clean encoder outputs. We introduce CrossFlow, a cross-space flow formulation that maps noisy latent inputs directly to pixel-space images. The key technical step is a velocity-free one-step objective: the latent trajectory defines the training path, but the supervised prediction is an image rather than a latent displacement. This lets one model act both as a one-step latent-to-pixel generator and as a decoder replacement for latent diffusion pipelines. On class-conditional ImageNet-1k at $256\times256$, CrossFlow-XL achieves 1.62 FID with one function evaluation. Ablations show that the latent encoder and pixel-space perceptual and adversarial losses are important for fidelity. These results indicate that cross-space flow objectives can combine the efficiency of latent representations with direct pixel-space supervision, without requiring a separate decoder at inference.

URL PDF HTML ☆

赞 0 踩 0

2606.20076 2026-06-19 cs.CV cs.AI 新提交

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

基于可学习全局合并的可变长度分词用于扩散变换器

Dong Hoon Lee, Seunghoon Hong

发表机构 * Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea（韩国科学技术院金载哲人工智能研究生院，大田，韩国）； School of Computing, KAIST, Daejeon, South Korea（韩国科学技术院计算学院，大田，韩国）

AI总结针对固定压缩比限制扩散模型质量-计算权衡的问题，提出基于可学习全局合并的可变长度分词器，通过合并令牌实现跨长度表示对齐，在ImageNet 256×256生成中实现更优的gFID-计算权衡。

详情

AI中文摘要

潜在扩散模型（LDM）在视觉合成中占据主导地位，但其质量-计算权衡很大程度上受限于分词器的固定压缩比。可变长度分词器（VLT）通过改变令牌数量实现自适应压缩，使扩散模型能够灵活平衡质量和计算。然而，传统的VLT通过截断有序令牌序列来调节长度，这使得令牌语义依赖于令牌位置，并破坏了跨长度的表示对齐。这导致潜在分布出现跨长度偏移，阻碍单个可变长度扩散模型有效运行。为了解决这个问题，我们提出了一种新颖的可变长度分词器，通过合并令牌来调节长度。我们表明，当扩散变换器根据合并模式运行时，鼓励相似令牌合并可以实现直接的跨长度表示对齐。由于传统的合并方法是数据依赖的，使得生成过程中无法访问合并模式，我们引入了可学习的全局合并，它是数据独立的，以确保与扩散变换器的兼容性。在ImageNet 256×256生成中，我们的基于合并的可变长度分词器与扩散变换器集成，相比之前的VLT方法实现了更优的gFID-计算权衡。代码可在[此https URL](此https URL)获取。

英文摘要

Latent Diffusion Models (LDMs) have become dominant in visual synthesis, but their quality-compute trade-off is largely constrained by the tokenizer's fixed compression ratio. Variable-length tokenizers (VLTs) promise adaptive compression by varying token counts, allowing diffusion models to flexibly balance quality and compute. However, conventional VLTs modulate length by truncating ordered token sequences, which makes token semantics depend on token position and breaks representational alignment across lengths. This leads to a cross-length shift in the latent distribution that hinders a single variable-length diffusion model from operating effectively. To address this, we propose a novel variable-length tokenizer that modulates length by merging tokens. We show that encouraging similar tokens to merge enables direct cross-length representation alignment when the diffusion transformer operates according to the merging pattern. Since conventional merging methods are data-dependent, making the merging pattern inaccessible during generation, we introduce learnable global merging, which is data-independent, to ensure compatibility with diffusion transformers. On ImageNet 256$\times$256 generation, our merging-based variable-length tokenizer integrated with a diffusion transformer achieves a superior gFID-compute trade-off compared to prior VLT methods. Code is available at [this https URL](https://github.com/movinghoon/lgm)

URL PDF HTML ☆

赞 0 踩 0

2606.20083 2026-06-19 cs.CV 新提交

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Holo-World: 视频世界模型的统一相机、物体和天气控制

Xiangchen Yin, Wenzhang Sun, Jiahui Yuan, Zijie Liu, Yinda Chen, Wei Li, Dachun Kai, Chunfeng Wang, Xiaoyan Sun

AI总结提出Holo-World，一种从单张图像联合控制相机、物体运动和天气的统一视频世界模型，通过场景适配器和解耦CFG实现世界保持与天气迁移。

Comments Project Page: \url{https://xiangchenyin.github.io/Holo-World} Code: \url{https://github.com/XiangchenYin/Holo-World}

详情

AI中文摘要

视频世界模型正朝着在可控相机和物体运动下保持观察到的世界，同时允许其环境状态变化的方向发展。然而，这些控制仍然是孤立的，天气生成通常依赖于已经指定未来结构的源视频或重建场景。我们研究了一种基于第一帧锚定的源到状态设置，其中模型从单张图像开始，遵循明确的相机和物体控制以及可选的天气指令，然后生成一个视频，该视频要么保持源世界，要么将其转移到目标天气状态。为了解决这些挑战，我们首先构建了HoloStateData，一个状态视频数据集，将多样化的视频转换为用于相机、物体和天气监督的统一控制样本。其次，我们引入了Holo-World，一个统一的、可控制的视频世界模型，从单张图像联合控制场景。其统一场景适配器将世界保持和天气迁移分解为不同的参数子空间，使用渲染背景、几何缓冲区和物体控制来维持受控场景结构，同时建模依赖天气的外观和粒子效果。此外，场景-天气解耦CFG分别引导场景和天气残差，增强目标天气效果而不过度放大完整条件。定量和定性实验表明，Holo-World在保持精确的相机和物体控制以及一致场景结构的同时，将场景迁移到多样化的目标天气状态，在天气状态生成上优于视频到视频的天气编辑基线。我们的项目页面可在\url{this https URL}获取。

英文摘要

Video world models are moving toward preserving an observed world under controllable camera and object motion while allowing its environmental state to change. Yet these controls remain isolated, and weather generation typically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchored source-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first build HoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduce Holo-World, a unified controllable video world model that jointly controls scene from a single image. Its Unified Scene Adapter factorizes world preservation and weather transfer into distinct parameter subspaces, using rendered background, geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally, Scene-Weather Decomposed CFG guides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate that Holo-World maintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperforming video-to-video weather editing baselines on weather-state generation. Our project page is available at \url{https://xiangchenyin.github.io/Holo-World/}.

URL PDF HTML ☆

赞 0 踩 0

2606.20094 2026-06-19 cs.CV cs.AI cs.GR cs.LG cs.MM 新提交

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

MakeupMirror：在用于化妆迁移的扩散模型中改进面部属性保持

Nefeli Andreou, Angel Martínez-González, Sabine Sternig, Matthieu Guillaumin, Epameinondas Antonakos, Michael Opitz

发表机构 * Amazon（亚马逊）

AI总结提出MakeupMirror扩散模型，通过ControlNet几何条件、区域特定迁移控制、肤色调制和Langevin采样器，在保持面部特征和肤色的同时实现高质量化妆迁移，相比Stable-Makeup提升面部识别相似度60%、降低肤色差异50%。

详情

AI中文摘要

化妆迁移模型能够实现有趣的增强现实（AR）体验以及在线化妆购物的虚拟试妆（VTO）。尽管最近最先进的基于扩散的解决方案（如Stable-Makeup）显著提高了化妆迁移的准确性和逼真度，但在身份和肤色保持方面仍存在局限性，使得用于化妆购物的生产级VTO不切实际。在这项工作中，我们提出了MakeupMirror，一种基于扩散的化妆迁移方法，在保持面部特征和肤色方面取得了显著进展。我们在Stable-Makeup的基础上引入了多项技术创新：（1）将面部几何条件与ControlNets集成以保持面部保真度；（2）区域特定的化妆迁移控制，以便在面部区域（如皮肤、眼睛和嘴唇）实现精确的化妆应用；（3）基于肤色的化妆迁移调制，防止跨主体迁移场景中的肤色改变；（4）集成Levenberg-Marquardt Langevin采样器以加速推理同时保持生成质量。我们在CPM-Real、Makeup Wild以及（本文新收集的、更多样化的）MakeupSelfies数据集上的实验表明，与Stable-Makeup相比，MakeupMirror将相对面部识别相似度提高了+60%，将相对肤色差异降低了-50%，延迟为0.7秒，同时在核心面部身份保持标准上达到了94%的专家接受率。

英文摘要

Makeup transfer models enable fun augmented reality (AR) experiences as well as virtual try-on (VTO) for online makeup shopping. While recent state-of-the-art diffusion based solutions such as Stable-Makeup dramatically improve the accuracy and realism of makeup transfer, they still face limitations in identity and skin color preservation, making production-level VTO for makeup shopping unrealistic. In this work, we propose MakeupMirror, a diffusion-based approach to makeup transfer that makes significant progress towards preserving facial features and skin tone. We introduce several technical innovations over Stable-Makeup: (1) integration of facial geometry conditioning with ControlNets to maintain facial fidelity; (2) region-specific makeup transfer control to enable precise makeup application across facial regions such as skin, eyes and lips; (3) skin tone-based makeup transfer modulation that prevent skin tone alteration in cross-subject transfer scenarios; and (4) integration of a Levenberg-Marquardt Langevin sampler to speed up inference while maintaining generation quality. Our experiments on CPM-Real, Makeup Wild, and (herein newly collected, more diverse) MakeupSelfies datasets show that MakeupMirror improves relative facial recognition similarity by +60%, reduces relative skin tone difference by -50% over Stable-Makeup, with a latency of 0.7s, while achieving expert acceptance rate of 94% across core facial identity preservation criteria.

URL PDF HTML ☆

赞 0 踩 0

2606.20233 2026-06-19 cs.CV 新提交

Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models

使用角色-环境协调视频生成模型的电影级合成

Tianyi Xiang, Mingming He, Li Ma, Jing Liao

发表机构 * City University of Hong Kong（香港城市大学）； Independent Researcher（独立研究员）

AI总结提出端到端视频扩散框架，通过三掩码引导和RGB-D联合去噪建模角色与环境的双向物理与光照交互，实现高质量动态视频合成。

详情

AI中文摘要

电影级合成旨在将绿幕角色融入新环境，同时保持物理和光度真实性。先前的方法通常未能捕捉角色与其周围环境之间的复杂双向交互，我们将其表征为角色到环境（C2E）的物理交互和环境到角色（E2C）的光照协调。为了解决这个问题，我们提出了一个端到端的视频扩散框架，联合建模C2E和E2C交互，特别处理交互道具的挑战。我们的方法引入了一种三掩码引导架构，结合RGB-D联合去噪，以确保角色、道具和环境之间的物理一致交互。我们进一步开发了一种高效的先验驱动数据整理流程，无需昂贵的渲染即可构建高质量的重光照对。最后，参考条件机制实现了可控的环境合成和精确的道具替换。大量实验表明，我们的框架在电影级动态视频合成方面显著优于现有方法。

英文摘要

Cinematic compositing aims to integrate green-screen characters into novel environments while maintaining physical and photometric realism. Previous methods often fail to capture the complex bidirectional interactions between characters and their surroundings, which we characterize as Character-to-Environment (C2E) physical interaction and Environment-to-Character (E2C) lighting harmonization. To address this, we propose an end-to-end video diffusion framework that jointly models C2E and E2C interactions, specifically handling the challenges of interactive props. Our approach introduces a tri-mask-guided architecture with RGB-D joint denoising to ensure physically consistent interactions among the character, props, and environment. We further develop an efficient prior-driven data curation pipeline to construct high-quality relighting pairs without expensive rendering. Finally, a reference-conditioned mechanism enables controllable environment synthesis and precise prop replacement. Extensive experiments demonstrate that our framework significantly outperforms existing methods in cinematic-quality dynamic video compositing.

URL PDF HTML ☆

赞 0 踩 0

2606.20310 2026-06-19 cs.CV 新提交

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

通过PRISM：视频扩散模型中间状态中的偏好表示

Haoxuan Wu, Lai Man Po, Mengyang Liu, Kun Li, Hongzheng Yang, Wei Liu

发表机构 * City University of Hong Kong（香港城市大学）； Video Rebirth ； The Chinese University of Hong Kong（香港中文大学）

AI总结提出PRISM方法，利用冻结的视频扩散骨干网络和轻量级查询聚合头从噪声潜变量中解码偏好信号，实现高精度偏好预测和噪声鲁棒性，支持早期最佳采样以降低计算成本并提升视频质量。

详情

AI中文摘要

使用干净的、基于像素的奖励模型评估视频生成，会使评估与噪声扩散过程脱节，并产生巨大的VAE解码成本。在本文中，我们通过提出一个基本问题来挑战这一范式：一个强大的视频生成器能否直接从噪声潜变量中内在地区分偏好？为了回答这个问题，我们引入了\textbf{PRISM}（\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels）。PRISM采用一个轻量级的基于查询的聚合头，配合冻结的视频扩散骨干网络，从噪声潜变量中解码偏好信号。令人惊讶的是，PRISM不仅达到了最先进的偏好准确率，还解锁了强大的噪声鲁棒性，从而实现了早期最佳-$N$采样。这使得在去噪的初始阶段就能过滤掉次优候选，大幅减少计算量并提升视频质量。我们还揭示了骨干网络的生成性能与其内在评估能力之间的强正相关性，从而实现了视频骨干网络的自我改进。

英文摘要

Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce \textbf{PRISM} (\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of-$N$ sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone's generative performance and its inherent evaluative power, enabling self-improving video backbones.

URL PDF HTML ☆

赞 0 踩 0

2606.20404 2026-06-19 cs.CV 新提交

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

FlowBender: 面向自校正条件流的反馈感知训练

Daniel Gilo, Sven Elflein, Ido Sobol, Or Litany

发表机构 * Technion（以色列理工学院）； NVIDIA（英伟达）； University of Toronto（多伦多大学）； Vector Institute（向量研究所）

AI总结针对条件扩散/流模型常违反任务约束的问题，提出FlowBender闭环框架，将对齐误差作为输入训练网络学习校正策略，在图像翻译、复原和3D纹理贴图中同时提升保真度与合理性。

Comments Project page: https://flow-bender.github.io/

详情

AI中文摘要

条件扩散和流模型通常无法满足定义其任务的约束条件。例如，深度条件模型经常产生重新提取的深度与输入不一致的图像，尽管定义约束的前向算子（深度预测器）在训练和推理期间都可用。现有方法通常分为两类：将条件信号视为静态线索并在推理时忽略对齐信息的监督模型，以及通过手动调整的线性更新咨询约束的基于引导的方法，通常以生成样本的合理性为代价来换取对条件的保真度。我们认为这两种范式的根本差距在于模型从未被训练利用自身的对齐误差。我们引入FlowBender，一个闭环框架，将此误差视为一等输入，训练网络学习基于推理时反馈的校正策略。在每一步，无引导的前瞻传递估计干净信号，通过前向算子计算特定任务的偏差，然后细化传递消耗此信号以产生校正速度。我们提出了FlowBender的几种变体，包括用于可微算子的基于梯度的公式和用于不可微设置（如JPEG压缩）的零阶变体。为了实现高效采样，我们引入了一个前一步捷径，使得以最小的额外计算成本实现闭环校正。在图像到图像翻译、复原和3D网格纹理贴图中，FlowBender始终优于标准监督基线、对齐损失增强训练和最先进的推理时引导，同时提高保真度和合理性，而不是在它们之间进行权衡。项目页面：此 https URL

英文摘要

Conditional diffusion and flow models routinely fail to satisfy the very constraints that define their task. For instance, a depth-conditioned model often produces images whose re-extracted depth disagrees with the input, even though the forward operator--the depth predictor defining the constraint--is available during both training and inference. Existing approaches generally fall into two categories: supervised models that treat the conditioning signal as a static cue and ignore alignment information at inference, and guidance-based methods that consult it through hand-tuned linear updates, typically trading fidelity to the condition against the plausibility of the generated sample. We argue that the fundamental gap in both paradigms is that the model is never trained to utilize its own alignment error. We introduce FlowBender, a closed-loop framework that treats this error as a first-class input, training the network to learn a correction policy conditioned on inference-time feedback. At each step, an unguided look-ahead pass estimates the clean signal, a task-specific deviation is computed via the forward operator, and a refinement pass consumes this signal to produce a corrected velocity. We propose several variants of FlowBender, including a gradient-based formulation for differentiable operators and a zero-order variant for non-differentiable settings such as JPEG compression. For efficient sampling, we introduce a prior-step shortcut that enables closed-loop correction at a minimal additional computational cost. Across image-to-image translation, restoration, and 3D mesh texturing, FlowBender consistently outperforms standard supervised baselines, alignment-loss-augmented training, and state-of-the-art inference-time guidance, improving fidelity and plausibility simultaneously rather than trading them against each other. Project page: https://flow-bender.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.20506 2026-06-19 cs.CV cs.AI 新提交

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

FreeStyle: 从社区LoRA挖掘中实现风格-内容双参考生成的自由控制

Jinghong Lan, Wei Cheng, Yunuo Chen, Ziqi Ye, Peng Xing, Yixiao Fang, Rui Wang, Yufeng Yang, Xuanyang Zhang, Xianfang Zeng, Difan Zou, Gang Yu, Chi Zhang

AI总结提出FreeStyle框架，利用社区LoRA作为锚点，通过两阶段课程学习（注意力级约束和频率感知RoPE调制）解决双参考生成中的内容泄露问题，并引入新基准和评估指标，实现风格对齐、内容保持与泄露抑制的平衡。

Comments 35 pages, 26figures. Project page: https://github.com/Blue2Giant/FreeStyle

详情

AI中文摘要

风格-内容双参考生成旨在合成一张图像，该图像保留内容参考的结构和语义，同时采用单独风格参考的风格。尽管近期有所进展，但这一设置仍然具有挑战性，因为模型必须平衡内容保真度、风格对齐和指令遵循，同时避免风格参考的语义泄露。一个关键瓶颈是缺乏大规模的三元组数据，这些数据具有清晰的内容-风格分离和广泛的长尾风格。在这项工作中，我们提出了FreeStyle，一个基于社区LoRA的可扩展双参考生成框架。我们将社区LoRA视为风格和内容的组合锚点，并设计了一个严格的生成和过滤流水线，以在多个基础模型上构建大规模的风格参考和内容参考三元组。为了解决内容泄露，我们采用了两阶段课程学习，并设计了特定阶段的解耦机制：在风格迁移阶段，采用注意力级增强约束来抑制风格参考泄露；在更困难的双参考阶段，采用频率感知的RoPE调制策略来针对基于位置对应的泄露。我们还引入了一个基准，涵盖风格参考和双参考生成，并在风格相似性、内容保持、美学质量、指令遵循和泄露拒绝方面进行评估。该基准包含一个风格不变的内容对齐分数（CAS），并引入了一个基于校准的VLM的拒绝分数，用于评估生成可靠性和泄露。大量实验表明，我们的模型在风格对齐、内容保持和泄露抑制之间实现了强平衡。

英文摘要

Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style of a separate style reference.Despite recent progress, this setting remains challenging because models must balance content fidelity, style alignment, and instruction following avoiding semantic leakage from the style reference.A key bottleneck is the lack of large-scale triplet data with clean content-style separation and broad long-tail style coverage.In this work, we propose FreeStyle, a scalable dual-reference generation framework based on community LoRA mining.We treat community LoRAs as compositional anchors for style and content, and design a rigorous generation and filtering pipeline to construct large-scale Style-Reference and Content-Reference triplets across multiple base models.To address content leakage, we adopt a two-stage curriculum with stage-specific disentanglement mechanisms: an attention-level enrichment constraint that suppresses style-reference leakage in the style-transfer stage, and a frequency-aware RoPE modulation strategy that targets positional-correspondence-based leakage in the harder dual-reference stage.We also introduce a benchmark covering both style-reference and dual-reference generation, with evaluations on style similarity, content preservation, aesthetics, instruction following, and leakage rejection. The benchmark incorporates a style-invariant Content Alignment Score (CAS) and introduces a calibrated VLM-based Rejection Score for evaluating generation reliability and leakage suppression.Extensive experiments show that our model achieves a strong balance among style alignment, content preservation, and leakage suppression.

URL PDF HTML ☆

赞 0 踩 0

2606.20543 2026-06-19 cs.CV 新提交

双锚点插值求解器加速生成建模

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

发表机构 * The Hong Kong University of Science（香港科学与技术大学）

AI总结提出BA-solver，通过轻量SideNet（1-2%主干大小）学习双向时间感知和双锚点速度积分，在不重新训练主干的情况下，以极低训练成本实现10步内达到100+步Euler求解器质量，支持即插即用。

详情

AI中文摘要

流匹配（FM）模型已成为高保真合成的前沿范式。然而，它们对迭代常微分方程（ODE）求解的依赖造成了显著的延迟瓶颈。现有解决方案面临两难：无训练求解器在低神经函数评估（NFE）下性能严重下降，而基于训练的一步或几步生成方法则面临高昂的训练成本且缺乏即插即用的通用性。为弥合这一差距，我们提出了双锚点插值求解器（BA-solver）。BA-solver保留了标准无训练求解器的通用性，同时通过引入轻量级SideNet（主干大小的1-2%）与冻结主干并行，实现了显著加速。具体而言，我们的方法基于两个协同组件：1）双向时间感知，其中SideNet学习近似未来和过去的速度，无需重新训练重型主干；2）双锚点速度积分，利用带有两个锚点速度的SideNet高效近似中间速度，用于批量高阶积分。通过利用主干建立高精度“锚点”并利用SideNet加密轨迹，BA-solver能够以最小误差实现大步长。在ImageNet-256^2上的实验结果表明，BA-solver仅需10次NFE即可达到与100+次NFE的Euler求解器相当的生成质量，并在仅5次NFE时保持高保真度，且训练成本可忽略不计。此外，BA-solver确保与现有生成流水线的无缝集成，便于图像编辑等下游任务。

英文摘要

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

URL PDF HTML ☆

赞 0 踩 0

2602.15819 2026-06-19 cs.CV 版本更新

Occ-VLM: 面向室内场景理解的占用接地视觉语言模型

Jianing Li, Zhou Fang, Yijiang Liu, Li Du

发表机构 * School of Electronic Science and Engineering, Nanjing University（南京大学电子科学与工程学院）

AI总结提出Occ-VLM，仅用姿态RGB图像和单一2D视觉编码器，通过重建3D占用作为几何先验，实现统一的3D场景理解，在占用预测、3D VQA和密集描述任务上达到领先水平。

详情

AI中文摘要

近期，视觉语言模型（VLM）在3D场景理解方面取得了显著进展，推动了具身智能和机器人视觉等应用的发展。然而，现有方法通常要么直接依赖显式的3D输入（如点云或RGB-D序列），要么引入额外的3D几何编码器从2D图像中推导出3D感知的视觉标记。这种设计在结构上将3D几何感知与通过视觉语言预训练学到的丰富2D语义解耦，阻碍了统一3D视觉语言表示的发展。在这项工作中，我们提出了Occ-VLM，一个仅基于姿态RGB图像并采用单一2D视觉编码器的3D场景理解新框架。具体而言，Occ-VLM重建3D场景占用作为辅助几何先验，用于将前景2D标记与3D空间进行空间关联。然后，这些标记由大型语言模型（LLM）解码，实现统一的场景理解。大量实验表明，Occ-VLM实现了准确的几何感知和稳健的视觉语言推理：在多视角占用预测上达到最先进性能，同时在3D视觉问答（VQA）和3D密集描述基准上与使用3D输入的VLM表现相当。

英文摘要

Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.19805 2026-06-19 cs.CV cs.AI 新提交

ParaScale: Scale-Calibrated Camera-Motion Transfer via a Gauge-Invariant Parallax Number

ParaScale: 通过规范不变视差数进行尺度校准的相机运动迁移

Zijie Meng

发表机构 * Peking University（北京大学）

AI总结提出ParaScale模块，通过规范不变的视差数Pi实现尺度忠实相机运动迁移，无需重新训练，在四个数量级尺度上降低视差一致性误差3倍以上。

Comments Accepted by SCA2026(poster)

详情

AI中文摘要

将参考视频的相机运动迁移到新生成的视频中，可以让创作者重复使用电影级运镜。然而，参考视频和目标视频往往处于不兼容的尺度——例如跨越银河系的扫视与桌面上的轻推——直接复用恢复的轨迹会导致运动要么不可察觉，要么剧烈夸张。我们将此归结为一个几何事实：平移引起的图像运动与||T||/Z成比例，因此单目轨迹仅在深度尺度规范下才有意义。我们将此提炼为视差数Pi = ||Delta T|| / Zbar，这是一个无量纲、规范不变的描述符，用于衡量相机运动的感知强度，并证明它是尺度忠实迁移必须保持的量，而非原始轨迹。ParaScale是一个即插即用模块，它从任何参考视频中读取Pi，并针对目标场景的深度逐帧重新实现它，保持旋转不变。它位于姿态提取和姿态注入之间，无需重新训练，可插入任何姿态条件生成器。我们进一步引入了视差一致性误差（PCE），这是一种尺度对称的度量，与相似性对齐的TransErr不同，它能暴露场景尺度不匹配。在跨越四个数量级的尺度范围和多个骨干网络上，ParaScale将实现的视差保持在恒等线上，并将PCE比未校准的迁移降低3倍以上，且不损失视觉保真度。

英文摘要

Transferring the camera motion of a reference video to a freshly generated one lets creators reuse cinematic moves. Yet reference and target often live at incompatible scales -- a sweep across a galaxy versus a nudge across a desk -- and naively reusing the recovered trajectory yields either imperceptible or violently exaggerated motion. We trace this to a geometric fact: translation-induced image motion scales as ||T||/Z, so a monocular trajectory is meaningful only up to a depth-scale gauge. We distill this into the Parallax Number Pi = ||Delta T|| / Zbar, a dimensionless, gauge-invariant descriptor of how strongly a camera move is felt, and prove that it -- not the raw trajectory -- is the quantity that scale-faithful transfer must preserve. ParaScale is a plug-and-play module that reads Pi off any reference video and re-realizes it against the target scene's own depth, per frame, leaving rotation untouched. Sitting between pose extraction and pose injection, it requires no retraining and drops into any pose-conditioned generator. We further introduce the Parallax Consistency Error (PCE), a scale-symmetric metric that -- unlike the similarity-aligned TransErr -- exposes scene-scale mismatch. Across scale regimes spanning four orders of magnitude and multiple backbones, ParaScale keeps the realized parallax on the identity line and cuts PCE by more than 3x over uncalibrated transfer with no loss of visual fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.20103 2026-06-19 cs.CV 新提交

Geometry-Preserving in 3D Gaussian Splatting for LiDAR-Camera Extrinsic Calibration

3D高斯溅射中保持几何结构的LiDAR-相机外参标定

Kyoleen Kwak, Daeho Kim, Jeong Woon Lee, Hyoseok Hwang

发表机构 * Kyung Hee University（庆熙大学）

AI总结针对LiDAR-相机标定中跨模态特征稀缺问题，提出通过多视图LiDAR深度监督和阻止光度梯度更新高斯空间参数来保持3DGS代理的度量几何，提升标定精度。

Comments Accepted to ECCV 2026. 15 pages (excluding references), 5 figures

详情

AI中文摘要

精确的LiDAR-相机标定对于鲁棒的多模态感知至关重要。无目标方法避免了手动设置，但仍受限于跨模态判别特征的稀缺性。最近的方法通过在可微模型中重建场景，通过密集光度监督实现外参优化。其中，3D高斯溅射（3DGS）被广泛用作几何代理，在单一可微框架内桥接LiDAR和相机。然而，由于3DGS最初是为新视图合成设计的，现有方法倾向于优先考虑渲染质量，导致代理几何偏离真实的LiDAR结构。我们提出了一种框架，通过聚合多视图LiDAR观测进行密集深度监督，并阻止光度梯度更新高斯空间参数，从而保持高斯代理的度量几何。我们在公开驾驶数据集上验证了该方法，在标定精度上持续优于现有无目标方法。

英文摘要

Accurate LiDAR-camera calibration is essential for robust multi-modal perception. Targetless approaches avoid manual setup but remain limited by the scarcity of discriminative cross-modal features. Recent methods address this by reconstructing the scene within a differentiable model, enabling extrinsic optimization through dense photometric supervision. Among these, 3D Gaussian Splatting (3DGS) has been widely adopted as a geometric proxy that bridges LiDAR and camera within a single differentiable framework. However, since 3DGS was originally designed for novel view synthesis, existing methods tend to prioritize rendering quality, causing the proxy geometry to drift from the true LiDAR structure. We propose a framework that preserves the metric geometry of the Gaussian proxy by aggregating multi-view LiDAR observations for dense depth supervision and blocking photometric gradients from updating the Gaussian spatial parameters. We validate our method on public driving datasets, where it consistently outperforms existing targetless methods in calibration accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.20131 2026-06-19 cs.CV cs.GR 新提交

TriFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields

TriFlow: 通过最近顶点向量场生成类艺术家3D网格拓扑

Haoxuan Li, Ziya Erkoç, Daniele Sirigatti, Vladislav Rosov, Lei Li, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich（慕尼黑工业大学）； AUDI AG（奥迪股份公司）； University of Virginia（弗吉尼亚大学）

AI总结提出TriFlow，一种基于最近顶点向量场（NVF）的生成方法，通过流匹配模型合成NVF并引导拓扑感知的网格简化，直接从输入几何条件生成紧凑且具有类艺术家拓扑的3D网格。

详情

AI中文摘要

我们提出了TriFlow，一种新的生成方法，能够直接从输入几何条件（如符号距离场）生成具有类艺术家三角形拓扑的紧凑3D网格。我们的关键见解是将网格拓扑表示为在表面上定义的最近顶点向量场（NVF），其中每个点编码其在局部重心坐标系中与最近三角形顶点的关联。我们训练一个潜在流匹配模型来合成该场，从而实现基于输入几何条件的拓扑生成。为了提取连贯的网格，我们使用生成的NVF对表面区域进行聚类，并引导具有拓扑感知优化的约束二次误差度量（QEM）网格简化。这产生了与输入几何紧密匹配且具有结构化、类艺术家连接性的输出网格。实验表明，与最先进的基于学习方法相比，TriFlow实现了更强的泛化能力和显著提高的拓扑质量，同时Chamfer距离降低了90%，速度提升了8倍。

英文摘要

We present TriFlow, a new generative approach for producing compact 3D meshes with artist-like triangle topology directly from input geometry conditions such as signed distance fields. Our key insight is to represent mesh topology as a nearest-vertex vector field (NVF) defined over the surface, where each point encodes its association to the nearest triangle vertex in the local barycentric frame. We train a latent flow-matching model to synthesize this field, enabling topology generation conditioned on the input geometry. To extract a coherent mesh, we cluster surface regions using the generated NVF and guide a constrained quadric error metric (QEM) mesh simplification with topology-aware optimization. This yields output meshes that closely match the input geometry while exhibiting structured, artist-like connectivity. Experiments demonstrate that TriFlow achieves stronger generalization and significantly improved topology quality compared to state-of-the-art learning-based approaches, alongside 90% lower Chamfer Distance and an 8x speedup.

URL PDF HTML ☆

赞 0 踩 0

2606.20531 2026-06-19 cs.CV 新提交

VisDom: Sparse Novel View Synthesis with Visible Domain Constraint

VisDom: 具有可见域约束的稀疏新视角合成

Mariia Gladkova*, Tarun Yenamandra*, Edmond Boyer, Robert Maier, Tony Tung, Daniel Cremers

发表机构 * TU Munich（慕尼黑工业大学）； MCML（慕尼黑机器学习中心）

AI总结提出VisDom，一种无学习的几何约束，通过最小多视角可见性要求增强视觉外壳重建，作为稀疏新视角合成中的空间先验，集成到NeRF和GS管线中，从四张输入图像实现高质量重建。

详情

AI中文摘要

稀疏新视角合成（NVS）由于从少量输入视角恢复3D几何的歧义性仍然具有挑战性。虽然基于NeRF和高斯泼溅（GS）的方法在密集监督下表现良好，但在稀疏设置中它们往往过拟合，产生漂浮伪影和不一致的几何。轮廓一致性通常用作正则化器，但还不够，因为轮廓一致区域可能超出真实物体几何。我们引入VisDom，一种无学习的几何约束，通过强制执行最小多视角可见性要求来增强经典的基于雕刻的视觉外壳重建。具体地，我们将可见域定义为至少被$K$个视角观察到的3D空间子集，并将其用作标准基于轮廓重建之上的额外过滤标准。这在稀疏视角设置中提供了更强的空间先验。我们通过限制体积采样和指导优化过程中的高斯放置，将VisDom集成到隐式（NeRF）和显式（GS）管线中。在三个具有挑战性的数据集上的实验表明，稀疏NVS的一致改进，使得从仅四张输入图像就能实现高质量以物体为中心的重建。我们的方法是领域无关的，仅需要轮廓，并且不引入学习参数，使其成为现有方法的简单补充。在GaussianObject之上应用VisDom进一步提高了在Omni3D和MipNeRF360上的性能，同时以22倍的训练成本匹配或超越它。

英文摘要

Sparse novel view synthesis (NVS) remains challenging due to the ambiguity of recovering 3D geometry from few input views. While NeRF- and Gaussian Splatting (GS)-based methods perform well with dense supervision, they often overfit in sparse settings, producing floating artifacts and inconsistent geometry. Silhouette consistency is commonly used as a regularizer, but it remains insufficient, as silhouette-consistent regions can extend beyond the true object geometry. We introduce VisDom, a learning-free geometric constraint that augments classical carving-based visual hull reconstruction by enforcing a minimum multi-view visibility requirement. Specifically, we define a visible domain as the subset of 3D space observed by at least $K$ views and use it as an additional filtering criterion on top of standard silhouette-based reconstruction. This provides a stronger spatial prior in sparse-view settings. We integrate VisDom into both implicit (NeRF) and explicit (GS) pipelines by restricting volumetric sampling and guiding Gaussian placement during optimization. Experiments on three challenging datasets show consistent improvements in sparse-view NVS, enabling high-quality object-centric reconstruction from as few as four input images. Our method is domain-agnostic, requires only silhouettes, and introduces no learned parameters, making it a simple complement to existing approaches. Applying VisDom on top of GaussianObject further improves performance on Omni3D and MipNeRF360, while matching or surpassing it at 22 $\times$ lower training cost.

URL PDF HTML ☆

赞 0 踩 0

2606.20556 2026-06-19 cs.CV 新提交

Thinking in Boxes: 3D Editing in Real Images Made Easy

Thinking in Boxes: 真实图像中的3D编辑变得简单

Pradhaan S Bhat, Naveen Chandra R, Rishubh Parihar, Vaibhav Vavilala, R. Venkatesh Babu, D. A. Forsyth, Anand Bhattad

发表机构 * Indian Institute of Science（印度科学研究所）； Apple（苹果公司）； UIUC（伊利诺伊大学厄巴纳-香槟分校）； Johns Hopkins University（约翰霍普金斯大学）

AI总结提出使用3D盒子作为结构化规范，通过用户提供输入和输出盒子来精确控制真实图像中的平移、旋转、缩放和视角变化，同时保持场景和物体身份，恢复未见的物体区域。

Comments Project Page: https://thinking-in-boxes.github.io/

详情

AI中文摘要

文本和2D条件接口在图像编辑中提供对空间变换的弱、模糊控制——特别是在大物体运动和相机变化下。先前的工作使用了如盒子这样的3D基元，但仅作为松散的调节信号指示近似物体位置，而非指定变换。我们则使用3D盒子作为结构化规范：用户提供编辑的输入和输出盒子，将编辑视为一个适定的几何问题。这种“在盒子中思考”的界面，其中每个盒子面都带有颜色编码以传达3D方向，提供了对真实图像中平移、旋转、缩放和视角变化的精确控制，同时保留场景和物体身份，并恢复之前未见的物体区域。为了将变换与场景外观联系起来，我们引入了一个深度对齐的平面地板作为全局参考框架，并用深度感知线索进行着色。基于这种结构，图像生成器在大变换下产生一致的结果。该系统在两个阶段训练——在合成多物体场景和来自Objectron的小型真实世界视频集上——能够泛化到复杂的、野外真实图像。我们的方法直接作用于真实照片，并在大型3D编辑上显著优于最近的最先进方法。

英文摘要

Text and 2D-conditioning interfaces provide weak, ambiguous control over spatial transformations in image editing -- particularly under large object motions and camera changes. Prior work has used 3D primitives such as boxes, but only as loose conditioning signals indicating approximate object location rather than specifying the transformation. We instead use 3D boxes as structured specifications: the user provides the input and output boxes of the edit, casting editing as a well-posed geometry problem. This ``thinking in boxes'' interface, where each box face is color-coded to convey 3D orientation, gives precise control over translation, rotation, scaling, and viewpoint changes in real images while preserving scene and object identity, and recovering previously unseen object regions. To ground transformations in scene appearance, we introduce a depth-aligned planar floor as a global reference frame, shaded with depth-aware cues. Conditioned on this structure, an image generator produces consistent results under large transformations. Trained in two stages -- on synthetic multi-object scenes and a small set of real-world videos from Objectron -- the system generalizes to complex, in-the-wild real images. Our method operates directly on real photographs and substantially outperforms recent state-of-the-art methods on large 3D edits.

URL PDF HTML ☆

赞 0 踩 0

2606.19451 2026-06-19 cs.LG cs.CV cs.RO 交叉投稿

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

3D-DLP：自监督3D物体中心场景表示学习

Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel

AI总结提出3D-DLP模型，通过自监督学习将场景级RGB-D或体素观测分解为3D潜在粒子，每个粒子编码解耦属性，实现可解释的逐粒子分割图，并支持场景操控和下游机器人操作。

Comments ICML 2026. Project webpage: https://eubooks3003.github.io/3d-dlp

详情

AI中文摘要

我们引入了3D-DLP，一种自监督的物体中心表示学习模型，它将场景级RGB-D或体素观测分解为一组3D潜在粒子。基于深度潜在粒子（DLP）框架，每个粒子编码解耦的属性，包括3D关键点位置、边界框尺寸和外观特征，并代表场景中的一个独特实体。该模型通过端到端的自监督重建目标学习可解释的逐粒子分割图。我们在模拟和真实数据集上证明，学习到的潜在空间是可解释和可控的：通过操纵粒子位置并解码，我们可以生成新颖的场景配置。此外，我们展示了将这些紧凑的3D潜在粒子用于下游机器人操作，相比缺乏显式3D信息或依赖无物体中心结构的密集3D输入的基线方法，性能有所提升。代码和视频可在以下网址获取：此 https URL。

英文摘要

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.

URL PDF HTML ☆

赞 0 踩 0

2606.19874 2026-06-19 cs.RO cs.CV 交叉投稿

MMD-SLAM: Structure-Enhanced Multi-Meta Gaussian Distribution-Guided Visual SLAM

MMD-SLAM：结构增强的多元高斯分布引导视觉SLAM

Fan Zhu, Ziyu Chen, Peichen Liu, Yifan Zhao, Zhisong Xu, Hui Zhu, Hongxing Zhou, Sixun Liu, Chunmao Jiang

发表机构 * HFIPS, Chinese Academy of Sciences（中国科学院合肥物质科学研究院）； University of Science and Technology of China（中国科学技术大学）； Aarhus University（奥胡斯大学）； University of Tokyo（东京大学）； Beijing University of Chemical Technology（北京化工大学）； North China Electric Power University（华北电力大学）

AI总结提出MMD-SLAM，利用亚特兰大世界假设引导多元高斯表示，通过点线融合、主导方向编码和高斯进化策略，提升视觉SLAM的跟踪精度与建图质量。

Comments ICRA 2026

详情

AI中文摘要

3D高斯泼溅（3DGS）显著提升了新视角合成和高保真场景重建，扩展了基于3DGS的视觉同步定位与建图（SLAM）方法的潜力。然而，大多数现有系统未能充分利用底层结构信息，这限制了渲染质量并常常导致地图不一致。为了解决这些限制，我们提出了MMD-SLAM，一个结构增强的视觉SLAM框架，利用亚特兰大世界（AW）假设来引导多元高斯表示以实现逼真的建图。首先，我们引入了一种点线融合策略用于位姿优化，其中3D线段被纳入以提高跟踪鲁棒性并为建图提供额外约束。其次，我们设计了一种具有主导方向的多元高斯表示，显式编码来自AW假设的结构先验。最后，我们提出了一种高斯进化策略，该策略适应场景几何并将结构线索融入全局优化。大量实验表明，这些创新使MMD-SLAM在跟踪精度和建图质量方面均达到了最先进的性能。例如，与MonoGS相比，我们的方法在ScanNet上实现了48.56%的ATE RMSE降低，在Replica上实现了5.71%的PSNR提升。

英文摘要

3D Gaussian Splatting (3DGS) has significantly boosted novel view synthesis and high-fidelity scene reconstruction, expanding the potential of 3DGS-based Visual Simultaneous Localization and Mapping (SLAM) methods. However, most existing systems fail to fully exploit the underlying structural information, which limits rendering quality and often leads to inconsistent maps. To address these limitations, we propose MMD-SLAM, a structure-enhanced Visual SLAM framework that leverages the Atlanta World (AW) assumption to guide a Multi-Meta Gaussian representation for photorealistic mapping. First, we introduce a point-line fusion strategy for pose optimization, where 3D line segments are incorporated to improve tracking robustness and provide additional constraints for mapping. Second, we design a Multi-Meta Gaussian representation with dominant directions, explicitly encoding structural priors from the AW hypothesis. Finally, we propose a Gaussian evolution strategy that adapts to scene geometry and incorporates structural cues into global optimization. Extensive experiments demonstrate that these innovations enable MMD-SLAM to achieve state-of-the-art performance in both tracking accuracy and mapping quality. e.g., our method achieves a 48.56% reduction in ATE RMSE on ScanNet and a 5.71% improvement in PSNR on Replica, compared with MonoGS.

URL PDF HTML ☆

赞 0 踩 0

2508.15228 2026-06-19 cs.CV 版本更新

Collaborative Multi-Modal Coding for High-Quality 3D Generation

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore（南洋理工大学S实验室）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出TriMM，首个前馈式3D原生生成模型，通过协作多模态编码融合RGB、RGBD和点云特征，结合辅助2D/3D监督和三平面潜在扩散模型，实现高质量3D资产生成。

详情

AI中文摘要

3D内容本质上具有多模态特性，可投影到不同模态（如RGB图像、RGBD和点云）。每种模态在3D资产建模中表现出独特优势：RGB图像包含生动的3D纹理，而点云定义精细的3D几何。然而，现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势，要么局限于3D结构，从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模，我们提出了TriMM，这是第一个从基本多模态（如RGB、RGBD和点云）学习的前馈式3D原生生成模型。具体来说，1) TriMM首先引入协作多模态编码，该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外，引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码，TriMM采用三平面潜在扩散模型生成更高质量的3D资产，增强了纹理和几何细节。在多个知名数据集上的大量实验表明，TriMM通过有效利用多模态，尽管使用少量训练数据，仍能达到与在大规模数据集上训练的模型相竞争的性能。此外，我们在最近的RGB-D数据集上进行了额外实验，验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

URL PDF HTML ☆

赞 0 踩 0

2512.00850 2026-06-19 cs.CV 版本更新

Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting

Smol-GS: 抽象3D高斯溅射的紧凑表示

Haishan Wang, Mohammad Hassan Vali, Arno Solin

发表机构 * ELLIS Institute Finland（芬兰ELLIS研究所）； Aalto University（阿alto大学）

AI总结提出Smol-GS方法，通过八叉树位置编码和熵压缩学习高效溅射特征，实现3D高斯溅射的紧凑表示，在保持渲染质量的同时大幅降低存储。

2602.23172 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg（弗赖堡大学）； Bosch Research（博世研究院）； University of Haifa（海法大学）

AI总结提出潜在高斯泼溅（LaGS）方法，通过特征高斯体作为动态关键点实现多视图特征聚合，用于4D全景占据跟踪，在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情

DOI: 10.1109/LRA.2026.3703990

AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而，现有方法通常只解决部分问题：它们要么通过边界框提供粗略的几何跟踪，要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中，我们提出了潜在高斯泼溅（LaGS）用于4D全景占据跟踪（4D-POT）。我们重新审视底层表示，将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点，在泼溅到体素网格进行解码之前，能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互，这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节，进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型：this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

URL PDF HTML ☆

赞 0 踩 0

2606.15908 2026-06-19 cs.CV 版本更新

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

高保真4D手-物体捕捉：基于多视角时空追踪和物理感知高斯模型

Bo Peng, Xu Chen, Yi Gu, Hidenobu Matsuki, Mingsong Dou, Jingjing Shen, Deying Kong, Juyong Zhang, Zhengyang Shen

发表机构 * Google XR（谷歌XR）； University of Science and Technology of China (USTC)（中国科学技术大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出无需模板和标记的多视角系统，通过跨视角几何与时间线索的Transformer初始化，结合物理感知高斯优化，实现鲁棒且无伪影的4D手-物体交互重建。

Comments Project page: https://hostpg.github.io/

详情

AI中文摘要

具身AI和空间计算中对高保真4D手-物体交互（HOI）数据的需求日益增长，但目前受限于对预扫描物体模板和物理标记的依赖。尽管近期方法在从视频重建4D手-物体交互方面取得了有希望的结果，但它们对手和物体姿态的初始估计高度敏感。然而，从图像中估计这些姿态具有挑战性，尤其是在手-物体交互场景中固有的严重遮挡下。我们提出了一种新颖系统，用于从同步且校准的多视角视频中鲁棒且精确地重建手和物体，无需任何模板或标记。我们的系统包含两个主要创新组件：（1）一个多视角前馈Transformer模型，聚合跨视角几何和时间线索，为姿态和密集物体几何提供可靠的、度量一致的初始化；（2）一个手-物体物理感知高斯优化框架，用于细化初始估计，集成四面体约束、碰撞细化和外观分解，以产生物理上合理且视觉上精确的重建。在公共基准和广泛内部数据集上的验证表明，我们的流程实现了高度鲁棒、无伪影的重建，为自动化4D资产生成提供了高效基础。我们的项目页面位于https://zyshen021.github.io/HOSTPG/。

英文摘要

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

URL PDF HTML ☆

赞 0 踩 0

2606.15966 2026-06-19 cs.CV cs.GR 版本更新

HypOProto: 用于左心室充盈压分类的双曲序数原型

Victoria Wu, Nima Hashemi, Hooman Vaseli, Christina Luong, Purang Abolmaesumi, Teresa S. M. Tsang

发表机构 * The University of British Columbia（不列颠哥伦比亚大学）； Vancouver General Hospital（温哥华综合医院）

AI总结提出HypOProto框架，利用双曲空间中的序数原型对左心室充盈压进行分类，通过冻结的可解释基础模型实现高精度与临床可解释性。

详情

AI中文摘要

超声心动图（echo）是一种广泛用于评估心脏功能的成像模态，左心室充盈压（LVFP）是心力衰竭等疾病的关键生理标志物。将LVFP分为正常和升高类别的标准依赖于多普勒衍生的$E/e'$比值，该比值依赖于操作者，且在资源有限的环境中通常不可用，这促使了直接从B模式超声推断LVFP的方法。现有的深度学习方法实现了高性能，但大多是黑盒模型，限制了临床可解释性。我们提出了HypOProto，一个基于双曲序数原型的可解释LVFP分类框架，使用冻结的可解释基础模型骨干。HypOProto沿着生理$E/e'$尺度排列原型，将边界情况放置在双曲面根附近，其中小的角度差异区分相似情况，而正常和升高情况占据向外位置，反映诊断确定性的增加。这种双曲几何编码了临床上有意义的序数关系，并提高了可解释性。我们还引入了一种新的双曲原型角度分离（HyperPAS）损失，强制在双曲空间中实现类间原型分离。HypOProto在保持透明性的同时实现了最先进的性能，并在可视化中突出显示临床相关区域。这项工作代表了超声中LVFP分类的第一个基于原型的框架。我们的代码可在以下网址找到：此 https URL。

英文摘要

Echocardiography (echo) is a widely used imaging modality for assessing cardiac function, with Left Ventricular Filling Pressure (LVFP) serving as a critical physiological marker for conditions such as heart failure. Standard LVFP classification into normal \emph{vs} elevated categories relies on the Doppler-derived $E/e'$ ratio, which is operator-dependent and often unavailable in resource-limited settings, motivating methods that infer LVFP directly from B-mode echo. Existing deep learning approaches achieve high performance but remain largely black-box, limiting clinical interpretability. We propose HypOProto, a hyperbolic, ordinal prototype-based framework for interpretable LVFP classification using a frozen, explainable foundation model backbone. HypOProto arranges prototypes along the physiological $E/e'$ scale, placing borderline cases near the hyperboloid root where small angular differences separate similar cases, while normal and elevated cases occupy outward positions reflecting increasing diagnostic certainty. This hyperbolic geometry encodes clinically meaningful ordinal relationships and improves interpretability. We also introduce a novel Hyperbolic Prototype Angular Separation (HyperPAS) loss, enforcing inter-class prototype separation in hyperbolic space. HypOProto achieves SOTA performance while maintaining transparency, and highlights clinically relevant regions in visualizations. This work represents the first prototype-based framework for LVFP classification in echo. Our code can be found at https://github.com/DeepRCL/HypOProto.

URL PDF HTML ☆

赞 0 踩 0

2606.19824 2026-06-19 cs.CV cs.AI 新提交

CSWinUNETR: Segmentation of Thin Anatomical Structures in Medical Images

CSWinUNETR: 医学图像中薄解剖结构的分割

Junho Moon, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University（汉阳大学）； Hankuk University of Foreign Studies（韩国外国语大学）

AI总结提出CSWinUNETR通用骨干网络，通过交叉形条带自注意力、循环移位、细节增强多尺度自注意力和稀疏控制动态蛇形卷积，解决薄结构分割中的低对比度、断裂和类不平衡问题，在眼科、神经血管和皮肤科基准上超越现有方法。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

准确分割薄而曲折的解剖结构，如视网膜血管、脑血管和面部皱纹，由于低对比度、频繁断裂和严重的类别不平衡仍然具有挑战性。尽管最近的卷积和基于Transformer的模型提高了性能，但它们常常产生碎片化的预测，并且无法恢复细小的分支。我们提出了CSWinUNETR，一个用于2D和3D薄结构分割的通用骨干网络。它采用交叉形条带自注意力来建模长距离主轴上下文，并结合循环移位以增强条带间的信息交换。为了更好地保留细粒度细节，我们进一步引入了一个细节增强的多尺度自注意力模块，该模块从多分辨率表示中聚合上下文特征。此外，我们提出了稀疏控制动态蛇形卷积，它从稀疏预测的控制点重建可靠的密集曲线核，以更好地跟随曲折的几何形状。在眼科、神经血管成像和皮肤科的四个基准上的大量实验表明，CSWinUNETR在没有任务特定后处理或拓扑感知损失的情况下，始终优于最先进的方法。代码可在该网址获取。

英文摘要

Accurate segmentation of thin, tortuous anatomical structures, such as retinal vessels, cerebral vasculature, and facial wrinkles, remains challenging due to low contrast, frequent discontinuities, and severe class imbalance. Although recent convolutional and Transformer-based models have improved performance, they often yield fragmented predictions and fail to recover fine branches. We propose CSWinUNETR, a general-purpose backbone for 2D and 3D thin-structure segmentation. It employs cross-shaped stripe self-attention to model long-range principal-axis context and incorporates cyclic shifts to enhance information exchange across stripes. To better preserve fine-grained details, we further introduce a detail-enhanced multi-scale self-attention module that aggregates contextual features from multi-resolution representations. In addition, we propose sparse-control dynamic snake convolution, which reconstructs reliable dense curvilinear kernels from sparsely predicted control points to better follow tortuous geometry. Extensive experiments on four benchmarks across ophthalmology, neurovascular imaging, and dermatology demonstrate that CSWinUNETR consistently outperforms state-of-the-art methods without task-specific post-processing or topology-aware losses. The code is available at https://github.com/labhai/CSWinUNETR.

URL PDF HTML ☆

赞 0 踩 0

2606.19838 2026-06-19 cs.CV 新提交

OTCHA: Optimal Transport-driven Confidence-aware Latent Hub Alignment for Multi-View Medical Image Classification

OTCHA: 基于最优传输的置信度感知潜在中心对齐用于多视图医学图像分类

Jiwoong Yang, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University（汉阳大学）； Hankuk University of Foreign Studies（韩国外国语大学）

AI总结提出OTCHA模块，通过最优传输对齐多视图补丁令牌与共享潜在中心令牌，结合置信度门控和部分匹配，消除无关特征，提升多视图医学图像分类鲁棒性。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

多视图成像（如乳腺X线摄影和胸部X线摄影）是临床实践的标准组成部分。然而，医学图像通常未配准，且包含视图特定的伪影或无关背景线索，这些可能掩盖诊断相关发现。许多现有方法直接融合每个视图的表征，使得此类无关内容污染融合嵌入，并在不同视图配置下降低鲁棒性。我们提出OTCHA，一种基于最优传输（OT）的置信度感知潜在中心令牌对齐模块，在融合前细化补丁令牌以用于多视图分类。OTCHA引入一组跨视图共享的可学习潜在中心令牌。对于每个视图，我们计算补丁令牌与中心令牌之间的OT计划，该计划联合考虑特征相似性和几何结构，并通过令牌条件尘埃箱增强OT公式以实现部分匹配并丢弃无关令牌。所得传输计划提供令牌级匹配置信度，该置信度门控中心介导的消息传递，并加权一种新的基于最优传输的表征对齐损失以稳定细化。在三个多视图医学图像数据集上的实验表明，在不同解剖结构和视图配置下，相比竞争基线方法取得一致改进。我们的代码可在该https URL获取。

英文摘要

Multi-view imaging, such as mammography and chest radiography, is a standard component of clinical practice. However, medical images are often unregistered and contain view-specific artifacts or irrelevant background cues that can obscure diagnostically relevant findings. Many existing methods directly fuse per-view representations, allowing such irrelevant content to contaminate the fused embedding and reducing robustness under varying view configurations. We propose OTCHA, a confidence-aware latent hub token alignment module based on optimal transport (OT) that refines patch tokens before fusion for multi-view classification. OTCHA introduces a set of learnable latent hub tokens shared across views. For each view, we compute an OT plan between patch tokens and hub tokens that jointly considers feature similarity and geometry, and augment the OT formulation with token-conditional dustbins to enable partial matching and discard irrelevant tokens. The resulting transport plan provides token-wise matching confidence, which gates hub-mediated message passing and weights a novel optimal-transport-based representation alignment loss to stabilize refinement. Experiments on three multi-view medical image datasets demonstrate consistent improvements over competing baselines across diverse anatomies and view configurations. Our code is available at https://github.com/labhai/OTCHA.

URL PDF HTML ☆

赞 0 踩 0

2606.19867 2026-06-19 cs.CV cs.AI 新提交

PSCT-Net: Geometry-Aware Pediatric Skull CT Reconstruction via Differentiable Back-Projection and Attention-Guided Refinement

PSCT-Net: 通过可微反投影和注意力引导细化实现几何感知的儿科颅骨CT重建

Dong Yeong Kim, Jaewon Choi, Youmin Shin, Jungyu Lee, Myeongseop Kim, Jinwook Choi, Joo Whan Kim, Young-Gon Kim

发表机构 * Interdisciplinary Program in Bioengineering, Seoul National University（首尔大学生物工程跨学科项目）； Department of Transdisciplinary Medicine, Seoul National University Hospital（首尔大学医院跨学科医学系）； Department of Artificial Intelligence, Yonsei University（延世大学人工智能系）； Department of Medicine, Seoul National University College of Medicine（首尔大学医学院医学系）； Healthcare AI Research Institute, Seoul National University Hospital（首尔大学医院医疗人工智能研究所）

AI总结提出PSCT-Net，利用可微反投影建立空间先验，结合注意力引导投影和双向Mamba模块，从稀疏双平面X射线重建3D CT，缓解深度模糊并改善骨边界。

Comments 11pages, 5 figures

详情

AI中文摘要

计算机断层扫描（CT）对于诊断儿科颅面异常至关重要，但对发育中的解剖结构存在辐射风险。从稀疏双平面X射线重建3D CT提供了一种低剂量替代方案，但问题严重不适定。现有方法采用几何无关的特征提升，将2D特征天真地投影到3D中，缺乏显式空间建模，导致深度模糊和骨边界退化。我们提出PSCT-Net，一种具有可微反投影的几何感知框架。可微反投影建立了空间保真的体积先验，缓解了深度模糊。然后，注意力引导投影（AGP-3D）模块学习2D区域与3D位置之间的非线性体素级对应关系。双向Mamba（BiM-3D）模块以线性复杂度捕获长程体积依赖关系。我们进一步整理了一个私有的机构儿科颅骨CT数据集PedSkull-CT，包含正常和病理病例用于内部评估，弥补了以成人中心和躯干为主的数据集的空白。

英文摘要

Computed Tomography (CT) is essential for diagnosing pediatric craniofacial abnormalities, yet poses radiation risks to developing anatomies. Reconstructing 3D CT from sparse bi-planar X-rays offers a low-dose alternative but is severely ill-posed. Existing methods employ geometry-agnostic feature lifting, naively projecting 2D features into 3D without explicit spatial modeling, causing depth ambiguity and degraded osseous boundaries. We present PSCT-Net, a geometry-aware framework with differentiable back-projection. Differentiable back-projection establishes a spatially faithful volumetric prior, alleviating depth ambiguity. An Attention-Guided Projection (AGP-3D) module then learns non-linear voxel-wise correspondences between 2D regions and 3D locations. A Bidirectional Mamba (BiM-3D) module captures long-range volumetric dependencies with linear complexity. We further curate a private institutional pediatric skull CT cohort, PedSkull-CT, comprising normal and pathological cases for internal evaluation, addressing the gap in adult-centric, trunk-focused datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.19908 2026-06-19 cs.CV 新提交

Gaussian Process Prior Variational Autoencoder for Endoscopic Videos

用于内窥镜视频的高斯过程先验变分自编码器

Ivan De Boi, Xinxing Shi, Xiaoyu Jiang, Tim J. M. Jaspers, Francisco Caetano, Mauricio A. Alvarez, Fons van der Sommen, Sam Van der Jeught

发表机构 * Department of Electromechanics, InViLab, University of Antwerp（安特卫普大学机电工程系InViLab实验室）； Department of Computer Science, University of Manchester（曼彻斯特大学计算机科学系）； Department of Electrical Engineering, Eindhoven University of Technology（埃因霍温理工大学电气工程系）

AI总结提出高斯过程先验变分自编码器（GPVAE），通过时间高斯过程先验替代因子化先验，结合两种可扩展GP近似和镜面反射掩码，实现内窥镜视频缺失帧的插值与修复，在C3VDv2数据集上平均降低RMSE 21.9%。

详情

AI中文摘要

内窥镜视频分析对于胃肠道诊断和计算机辅助干预至关重要，但视频序列经常受到镜面反射、运动伪影和缺失帧的退化影响。这些瞬态损坏会分散临床医生的注意力，降低图像可解释性，并干扰下游任务（如3D重建和导航）。因此，有效的修复需要利用时间连续性而非孤立处理帧的方法。我们提出了一种用于内窥镜视频修复的高斯过程先验变分自编码器（GPVAE）框架，该框架用时间高斯过程先验替代标准因子化潜在先验，从而能够以不确定性感知的重建方式插值缺失帧。该框架结合了内窥镜专用编码器（包括卷积EndoVAE骨干网络和来自GastroNet-5M的预训练Vision Transformer编码器）以及两种可扩展GP近似：层次先验近似（HPA）和稀疏精度近似（SPA）。镜面反射通过基于DUCKNet的掩码流水线处理，该流水线从重建目标中排除损坏像素。在C3VDv2结肠镜数据集上，最佳GPVAE变体相对于匹配的VAE基线，图像重建RMSE平均降低21.9%，最高降低26.1%。下游轨迹RMSE在经典视觉里程计和预训练PoseNet上平均降低12.7%，而每epoch训练时间平均增加27.3%。最后，GP后验提供每帧不确定性估计，反映时间支持并为修复帧提供置信度信号。

PU-UNet：用于医学图像分割的稳定乘法交互

Ziyuan Li, Osamah Sufyan, Uwe Jaekel, Babette Dellen

发表机构 * Department of Mathematics, Informatics and Technology, University of Applied Sciences Koblenz（科布伦茨应用科学大学数学、信息学与技术系）； Technical University of Munich（慕尼黑工业大学）

AI总结提出PU-UNet，通过稳定乘积单元残差块在低分辨率阶段实现显式乘法特征交互，在三个医学图像分割数据集上提升Dice和IoU，降低假阳性率。

Comments Accepted to the ICANN 2026

详情

AI中文摘要

许多密集预测网络依赖于加性特征变换，并且仅隐式地建模高阶特征交互。乘积单元为乘法特征建模提供了显式机制，但其对数-指数公式可能导致数值不稳定性，这限制了它们在深度密集预测网络中的使用。在这项工作中，我们提出了乘积单元U-Net（PU-UNet），这是一种残差U-Net，它将稳定的乘积单元残差块集成到丰富的低分辨率阶段，用于医学图像分割。所提出的公式结合了平滑正性映射和对数域裁剪，实现了稳定的乘法特征学习，且计算开销可忽略不计。在ISIC 2018、Kvasir-SEG和BUSI上，PU-UNet分别达到了0.942、0.959和高达0.925的Dice分数。与匹配的残差U-Net基线相比，PU-UNet在保持参数、FLOPs和推理延迟几乎不变的情况下，持续提高了Dice和IoU，并将正常BUSI病例的图像级假阳性率从0.077降至零。消融研究表明，这些增益与乘积单元交互相关，在低分辨率放置下最强，并受益于所提出的稳定化设计。这些结果表明，稳定的乘积单元残差学习可以成为通过显式乘法交互增强U-Net风格分割网络的有效方式。

英文摘要

Many dense prediction networks rely on additive feature transformations and model higher-order feature interactions only implicitly. Product units provide an explicit mechanism for multiplicative feature modeling, but their logarithmic--exponential formulation can cause numerical instability, which has limited their use in deep dense prediction networks. In this work, we propose Product-Unit U-Net (PU-UNet), a residual U-Net that integrates stable product-unit residual blocks into rich low-resolution stages for medical image segmentation. The proposed formulation combines smooth positivity mapping with log-domain clipping, enabling stable multiplicative feature learning with negligible computational overhead. On ISIC 2018, Kvasir-SEG, and BUSI, PU-UNet achieves Dice scores of 0.942, 0.959, and up to 0.925, respectively. Compared with a matched Residual U-Net baseline, PU-UNet consistently improves Dice and IoU while keeping parameters, FLOPs, and inference latency nearly unchanged, and reduces the image-level false-positive rate on normal BUSI cases from 0.077 to zero. Ablation studies suggest that the gains are associated with product-unit interactions, are strongest under low-resolution placement, and benefit from the proposed stabilization design. These results suggest that stable product-unit residual learning can be an effective way to enhance U-Net-style segmentation networks with explicit multiplicative interactions.

URL PDF HTML ☆

赞 0 踩 0

2606.20108 2026-06-19 cs.CV cs.LG 新提交

EFIQA: Explainable Fundus Image Quality Assessment via Anatomical Priors

EFIQA: 基于解剖先验的可解释眼底图像质量评估

Pengwei Wang, José Morano, Qian Wan, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria（维也纳医科大学医学数据科学中心人工智能研究所）； Christian Doppler Lab for Artificial Intelligence in Retina, Medical University of Vienna, Austria（维也纳医科大学视网膜人工智能克里斯蒂安·多普勒实验室）

AI总结提出无需质量标签的EFIQA框架，利用解剖先验通过掩膜解剖修复学习正常结构，生成空间质量图，在多个基准上超越监督方法，兼具可解释性。

Comments Accepted in MIDL 2026. Code: https://github.com/penway/EFIQA

Journal ref Proceedings of Machine Learning Research 315:2248-2264, 2026

详情

AI中文摘要

图像质量控制对于广泛的下游应用至关重要。基于深度学习的图像质量评估方法通常根据数据集特定的质量标签训练分类器，这继承了两种局限性：（1）泛化能力受限于训练集的标注标准；（2）这些方法无法提供质量下降的空间反馈，缺乏可解释性。在这项工作中，我们提出了EFIQA，一个无需质量相关监督的框架，并通过设计生成空间质量图。EFIQA不是从人工标注的标签中学习“什么是退化”，而是通过利用解剖先验来学习“应该有什么”。对于眼底摄影，我们将其实例化为两阶段方法：首先通过掩膜解剖修复训练无监督异常检测器，以识别缺失血管区域；然后将这一先验知识蒸馏到一个浅层适配器中，将冻结基础模型的特征映射到精确的质量图。外部数据集评估表明，这种无需标签且只需最小适配的方法，在不同质量标准的基准上，与监督方法相比，实现了更好的性能和可解释性，突显了其在现实应用中的潜力。

英文摘要

Image quality control is vital for a wide range of downstream applications. Deep learning-based image quality assessment methods typically train classifiers on dataset-specific quality labels, inheriting two limitations: (1) generalization is tied to the labeling criteria of the training set and (2) these methods cannot provide spatial feedback on where the quality is degraded, lacking explainability. In this work, we propose EFIQA, a framework that requires no quality-related supervision and produces spatial quality maps by design. Rather than learning ``what is degradation" from human-annotated labels, EFIQA learns ``what should be there" by leveraging anatomical priors. For fundus photography, we instantiate this as a two-stage approach, by first training an unsupervised anomaly detector via masked anatomical inpainting to identify regions of missing vasculature, and then distilling this prior knowledge into a shallow adapter mapping features of a frozen foundation model to precise quality maps. External-dataset evaluation demonstrates that this label-free approach with minimal adaptation achieves better performance and explainability compared with supervised methods across benchmarks with different quality criteria, highlighting its potential for real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.20112 2026-06-19 cs.CV eess.IV 新提交

Pixel-Level Residual Diffusion Transformer: Scalable 3D CT Volume Generation

像素级残差扩散Transformer：可扩展的3D CT体生成

Zhenkai Zhang, Markus Hiller, Krista A. Ehinger, Tom Drummond

发表机构 * School of Computing and Information Systems, The University of Melbourne（墨尔本大学计算与信息系统学院）

AI总结提出像素级残差扩散Transformer（PRDiT），通过两阶段训练（局部MLP盲估计器分离低频结构+全局残差扩散Transformer建模高频残差）实现高保真3D CT体生成，在LIDC-IDRI和RAD-ChestCT数据集上优于现有方法。

Comments Accepted at ICLR 2026. Code available at https://github.com/Fredy-Zhang/PRDiT

详情

AI中文摘要

由于现有生成模型固有的巨大计算需求和优化困难，生成具有精细细节的高分辨率3D CT体仍然具有挑战性。在本文中，我们提出了像素级残差扩散Transformer（PRDiT），这是一种可扩展的生成框架，可直接在体素级别合成高质量的3D医学体。PRDiT引入了一个两阶段训练架构，包括：1）一个局部去噪器，形式为基于MLP的盲估计器，作用于重叠的3D块，以有效分离低频结构；2）一个全局残差扩散Transformer，采用内存高效注意力来建模和细化整个体上的高频残差。这种从粗到细的建模策略简化了优化，增强了训练稳定性，并有效保留了细微结构，而无需自编码器瓶颈。在LIDC-IDRI和RAD-ChestCT数据集上进行的大量实验表明，PRDiT始终优于最先进的模型，如HA-GAN、3D LDM和WDM-3D，在3D FID、MMD和Wasserstein距离指标上显著降低。

英文摘要

Generating high-resolution 3D CT volumes with fine details remains challenging due to substantial computational demands and optimization difficulties inherent to existing generative models. In this paper, we propose the Pixel-Level Residual Diffusion Transformer (PRDiT), a scalable generative framework that synthesizes high-quality 3D medical volumes directly at voxel-level. PRDiT introduces a two-stage training architecture comprising 1) a local denoiser in the form of an MLP-based blind estimator operating on overlapping 3D patches to separate low-frequency structures efficiently, and 2) a global residual diffusion transformer employing memory-efficient attention to model and refine high-frequency residuals across entire volumes. This coarse-to-fine modeling strategy simplifies optimization, enhances training stability, and effectively preserves subtle structures without the limitations of an autoencoder bottleneck. Extensive experiments conducted on the LIDC-IDRI and RAD-ChestCT datasets demonstrate that PRDiT consistently outperforms state-of-the-art models, such as HA-GAN, 3D LDM and WDM-3D, achieving significantly lower 3D FID, MMD and Wasserstein distance scores.

URL PDF HTML ☆

赞 0 踩 0

2606.20143 2026-06-19 cs.CV 新提交

HEad and neCK TumOR (HECKTOR) 2025: Benchmark of Segmentation, Diagnosis, and Prognosis in Multimodal PET/CT

头颈肿瘤 (HECKTOR) 2025 挑战赛：多模态 PET/CT 中的分割、诊断与预后基准

Numan Saeed, Salma Hassan, Shahad Hardan, Lishan Cai, Xinglong Liang, Moona Mazher, Abdul Qayyum, Yansong Bu, Mengye Lyu, Yue Lin, Mingyuan Meng, Chuanyi Huang, Lisheng Wang, Dalal Chamseddine, Shamimeh Ahrari, Beining Wu, Yifei Chen, Fuyou Mao, Hao Zhang, Baixiang Zhao, Surajit Ray, Muzi Guo, Lei Xiang, Jakob Dexl, Michael Ingrisch, Adrien Depeursinge, Arman Rahmim, Mathieu Hatt, Vincent Andrearczyk, Mohammad Yaqub

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（穆罕默德·本·扎耶德人工智能大学）； Amsterdam UMC（阿姆斯特丹大学医学中心）； The Netherlands Cancer Institute（荷兰癌症研究所）； Radboud University Medical Centre（拉德堡德大学医学中心）； University College London（伦敦大学学院）； Imperial College London（帝国理工学院）； Shenzhen Technology University（深圳技术大学）； Shenzhen University（深圳大学）； Newland Digital Technology（新大陆数字技术）； The University of Sydney（悉尼大学）； Shanghai Jiao Tong University（上海交通大学）； University Hospital, Nantes（南特大学医院）； Nantes Université, Centrale Nantes, CNRS, LS2N（南特大学、南特中央理工学院、法国国家科学研究中心、LS2N实验室）； Hangzhou Dianzi University（杭州电子科技大学）； Tsinghua University（清华大学）； Central South University（中南大学）； University of Glasgow（格拉斯哥大学）； China Mobile System Integration Co., Ltd.（中移系统集成有限公司）； Subtle Medical Inc.（Subtle Medical公司）； University Hospital, LMU Munich（慕尼黑大学医院）； Munich Center for Machine Learning（慕尼黑机器学习中心）； BC Cancer Research Institute（不列颠哥伦比亚癌症研究所）； HES-SO Valais-Wallis University of Applied Sciences and Arts（HES-SO瓦莱州应用科学与艺术大学）； Lausanne University Hospital (CHUV)（洛桑大学医院）； LaTIM, INSERM, UMR 1101, Univ Brest（LaTIM实验室、法国国家健康与医学研究院、UMR 1101、布雷斯特大学）

AI总结 HECKTOR 2025 挑战赛利用多模态 PET/CT 和电子健康记录，建立了头颈癌自动分析的基准，涵盖肿瘤分割、复发预测和 HPV 分类三个任务，最佳算法分别达到 Dice 0.75、C-index 0.66 和平衡准确率 0.56。

Comments 17 pages, 4 figures, 4 tables. Overview paper for the HECKTOR 2025 challenge, held as a satellite event at MICCAI 2025. Challenge website: https://hecktor.grand-challenge.org/

详情

AI中文摘要

头颈癌 (HNC) 构成显著的全球健康负担，准确的肿瘤勾画对于有效的放疗计划至关重要。口咽部解剖结构的复杂性，加上肿瘤在影像上的异质性表现，使得手动分割耗时且存在观察者间差异。除分割外，从非侵入性影像预测长期临床结局（如无复发生存期 RFS）和确定人乳头瘤病毒 (HPV) 状态，仍然是具有挑战性但临床价值高的目标。HECKTOR 2025 挑战赛通过使用多模态 PET/CT 影像和电子健康记录，建立了一个用于自动 HNC 分析的全面基准。基于前几届（2020-2022），本次挑战赛采用了扩展的多机构数据集，包含来自全球 10 个中心的 1100 多名患者。参与者需完成三个互补目标：(1) 分割原发肿瘤体积 (GTVp) 和转移淋巴结 (GTVn)，(2) 预测无复发生存期，(3) 分类 HPV 状态。挑战赛吸引了 35 个注册团队，其中 15 个最终提交在保留测试集上进行了评估。表现最佳的算法在分割上达到平均 Dice 相似系数 0.75，在生存预测上达到一致性指数 0.66，在 HPV 分类上达到平衡准确率 0.56。本文对所提交的方法进行了全面分析，评估了它们在不同病变特征上的性能，并讨论了它们在自动化肿瘤学工作流程和决策支持系统中临床转化的意义。

英文摘要

Head and neck cancers (HNC) represent a significant global health burden, with accurate tumor delineation being essential for effective radiotherapy planning. The complexity of the oropharyngeal anatomy, combined with the heterogeneous appearance of tumors on imaging, makes manual segmentation time-intensive and subject to inter-observer variability. Beyond segmentation, predicting long-term clinical outcomes, such as recurrence-free survival (RFS), and determining human papillomavirus (HPV) status from noninvasive imaging, remain challenging yet clinically valuable goals. The HECKTOR 2025 challenge addresses these needs by establishing a comprehensive benchmark for automated HNC analysis using multimodal PET/CT imaging and electronic health records. Building on previous editions (2020-2022), this challenge features an expanded multi-institutional dataset comprising over 1,100 patients from 10 centers worldwide. Participants were tasked with three complementary objectives: (1) segmenting primary gross tumor volumes (GTVp) and metastatic lymph nodes (GTVn), (2) predicting recurrence-free survival, and (3) classifying HPV status. The challenge attracted 35 registered teams, with 15 final submissions evaluated on a held-out test set. Top-performing algorithms achieved a mean Dice similarity coefficient of 0.75 for segmentation, a concordance index of 0.66 for survival prediction, and a balanced accuracy of 0.56 for HPV classification. This paper presents a comprehensive analysis of the submitted methodologies, evaluates their performance across different lesion characteristics, and discusses their implications for clinical translation in automated oncology workflows and decision support systems.

URL PDF HTML ☆

赞 0 踩 0

2606.20223 2026-06-19 cs.CV q-bio.QM 新提交

DeepForestVisionV2: Ecology-Driven Taxonomy Expansion for Camera-Trap Monitoring in African Tropical Forests

DeepForestVisionV2：面向非洲热带森林相机监测的生态驱动分类扩展

Hugo Magaldi, Theau d'Audiffret, Etienne Francois Akomo-Okoue, Bala Amarasekaran, Naomi Anderson, Claire Auger, Noemie Cappelle, Daniel Cornelis, Raphael Cornette, Tobias Deschner, Gabriel Dubus, Davy Fonteyn, Rosa M. Garriga, Jennifer Hatlauf, Innocent Kasekendi, Raymond Katumba, Aram Kazandjian, Alfred Ngomanda, Stephan Ntie, Simone Pika, Xavier Rufray, Harold Rugonge, John Justice Tibesigwa, Peter van Lunteren, Hadrien Vanthomme, Joeri A. Zwerts, Sabrina Krief

发表机构 * UMR7206 Eco-Anthropologie, MNHN（UMR7206 生态人类学，法国国家自然历史博物馆）； One Forest Vision initiative（One Forest Vision 倡议）； Sebitoli Chimpanzee Project（塞比托利黑猩猩项目）； Centre National de la Recherche Scientifique et Technologique（国家科学技术研究中心）； Institut de Recherche en Ecologie Tropicale（热带生态研究所）； Tacugama Chimpanzee Sanctuary（塔库加马黑猩猩保护区）； Biotope（Biotope 公司）； CIRAD（法国农业发展国际合作研究中心）； Max Planck Institute for Evolutionary Anthropology（马克斯·普朗克进化人类学研究所）； BOKU University（维也纳自然资源与生命科学大学）； Agence Nationale des Parcs Nationaux du Gabon（加蓬国家公园管理局）； Uganda Wildlife Authority（乌干达野生动物管理局）； Addax Data Science（Addax 数据科学公司）； Utrecht University（乌得勒支大学）

AI总结针对非洲热带森林相机监测中生态梯度（垂直分层、场景开放度、人为界面）导致原35类分类过粗的问题，提出扩展至64类的DeepForestVisionV2，在保持离线工作流的同时提升野外实用性。

Comments Accepted at ICPR 2026 - Computer Vision for Biodiversity Monitoring and Conservation Workshop

详情

AI中文摘要

非洲热带森林中的相机监测正从封闭冠层内部扩展到河岸、空地和公园边缘。在现有的非洲森林相机分类开放工具中，DeepForestVision是唯一提供照片和视频匹配离线工作流的工具，先前研究表明其在可比基准上优于其他基线。然而，它专为封闭冠层、地面森林内部设计，使用35类预测空间，当部署遇到树栖灵长类、鸟类、半水生类群或家畜等人为混杂因素时，该空间变得过于粗糙。我们提出DeepForestVisionV2，这是一个从35类扩展到64类预测空间（61个动物类加上人类、车辆和空白）的生态驱动扩展，旨在解决三个反复出现的部署梯度：垂直分层、场景开放度和人为界面。DeepForestVisionV2保留相同的离线工作流，并在来自多国非洲热带森林项目的1,535,010张照片和243,354个视频上训练。评估结合了一个跨国家裁剪照片验证集（用于评估跨站点和相机设置的鲁棒性）和三个涵盖目标梯度的留出乌干达视频基准。在验证集上，DeepForestVisionV2达到0.86准确率、0.82宏F1和0.81平衡准确率。在部署基准上，尽管分类任务更困难，它仍保持或提高了基线准确率，同时将识别的类群数量从森林内部视频的22个增加到29个，河岸视频从4个增加到9个。在公园边缘用例中，它将准确率从0.62提高到0.86，并将误报从11次减少到0次。这些结果表明，DeepForestVisionV2在保持跨站点、栖息地和相机设置鲁棒性的同时，显著提高了野外实用性。

英文摘要

Camera-trap monitoring in African tropical forests increasingly extends beyond closed-canopy interiors to riverbanks, clearings, and park edges. Among available open tools for African forest camera-trap classification, DeepForestVision is the only one providing a matched offline workflow for both photographs and videos, and previous work showed that it outperformed other available baselines on a comparable benchmark. However, it was designed for closed-canopy, ground-level forest interiors and uses a 35-class prediction space that becomes too coarse when deployments encounter arboreal primates, birds, semi-aquatic taxa, or human-associated confounders such as livestock. We present DeepForestVisionV2, an ecology-driven expansion from 35 to 64 prediction classes (61 animal classes plus human, vehicle, and blank) designed to address three recurrent deployment gradients: vertical stratification, scene openness, and anthropogenic interfaces. DeepForestVisionV2 retains the same offline workflow and is trained on 1,535,010 photographs and 243,354 videos from multi-country African tropical-forest projects. Evaluation combines a cross-country cropped-photo validation set, used to assess robustness across sites and camera-trap settings, with three held-out Uganda video benchmarks spanning the targeted gradients. On the validation set, DeepForestVisionV2 reaches 0.86 accuracy, 0.82 macro-F1, and 0.81 balanced accuracy. On the deployment benchmarks, it preserves or improves baseline accuracy despite its harder classification task, while increasing the number of identified taxa from 22 to 29 in forest-interior videos and from 4 to 9 at riverbanks. In the park-edge use case, it raises accuracy from 0.62 to 0.86 and reduces false alarms from 11 to 0. These results show that DeepForestVisionV2 materially improves field utility while preserving robustness across sites, habitats, and camera-trap settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20250 2026-06-19 cs.CV 新提交

Single-Stage Hierarchical Rectification for Weakly Supervised Histopathology Segmentation

单阶段层次化校正用于弱监督组织病理学分割

Duc T. Nguyen, Hoang-Long Nguyen, Thanh-Ha DO, Huy-Hieu Pham

发表机构 * VinUni-Illinois Smart Health Center, VinUniversity, Hanoi, Vietnam（越南河内VinUniversity VinUni-Illinois智慧健康中心）； The Computer Vision and Medical AI Lab, VinUniversity, Hanoi, Vietnam（越南河内VinUniversity计算机视觉与医学人工智能实验室）； Posts and Telecommunications Institute of Technology, Hanoi, Vietnam（越南河内邮电技术学院）

AI总结提出单阶段层次化校正框架，通过层次化特征校正模块在单次训练中直接生成高保真激活图，解决多阶段弱监督分割中的误差传播和计算开销问题。

Comments Accepted to MICCAI 2026. This is the pre-review submitted version, not the camera-ready version. The final authenticated version will be available in the MICCAI 2026 proceedings

详情

AI中文摘要

现有的计算病理学中的弱监督语义分割方法依赖于多阶段范式：类激活图生成、离线伪掩码细化和全监督再训练。虽然这种解耦方法已被广泛采用，但它存在根本性缺陷。多阶段过程不仅导致高计算训练成本，还遭受误差传播：浅层CNN中的局部纹理偏差产生假阳性伪影，后续细化步骤往往无法纠正。为了通过简单而高效的方法解决这些持续存在的挑战，我们提出了单阶段层次化校正（SSHR）框架。我们的方法不是事后被动地细化CAM，而是在前向传播过程中主动净化中间特征表示。我们引入了一个层次化特征校正模块（HFRM），利用深层全局语义上下文过滤浅层中的局部异常。该机制在单个训练循环内直接生成高保真激活图。在LUAD-HistoSeg和BCSS数据集上的实验表明，SSHR优于最先进的多阶段方法。此外，SSHR将训练时间减少了2到5倍。这种效率降低了计算开销，并加速了大规模组织病理学工作流的临床转化。代码可在以下网址获取：this https URL

英文摘要

Existing weakly supervised semantic segmentation (WSSS) methods in computational pathology rely on a multi-stage paradigm: class activation map (CAM) generation, offline pseudo-mask refinement, and fully supervised retraining. While established, this decoupled approach presents fundamental limitations. The multi-stage process not only incurs high computational training costs but also suffers from error propagation: local texture biases in shallow CNN layers generate false-positive artifacts that subsequent refinement steps often fail to correct. To address these persistent challenges through a simple yet highly effective approach, we propose the Single-Stage Hierarchical Rectification (SSHR) framework. Rather than passively refining CAMs post-hoc, our method proactively purifies intermediate feature representations during the forward pass. We introduce a Hierarchical Feature Rectification Module (HFRM) that utilizes deep global semantic context to filter out local anomalies in shallow layers. This mechanism generates high-fidelity activation maps directly within a single training loop. Experiments on the LUAD-HistoSeg and BCSS datasets demonstrate that SSHR outperforms state-of-the-art multi-stage methods. Furthermore, SSHR reduces training duration by 2 to 5 times. This efficiency minimizes computational overhead and accelerates clinical translation for large-scale histopathology workflows. The code is available at: https://github.com/trongduc-nguyen/SSHR

URL PDF HTML ☆

赞 0 踩 0

2606.20390 2026-06-19 cs.CV 新提交

Geometry-Aware Superpixel Graph Transformer with Metadata for Skin Lesion Classification

几何感知超像素图变换器结合元数据用于皮肤病变分类

Muhammad Azeem, Tanveer Hussain, Amr Ahmed, Ardhendu Behera

发表机构 * Edge Hill University（埃奇希尔大学）

AI总结提出一种基于区域的图学习框架，将病变建模为超像素图，利用几何边属性和元数据上下文节点，通过边缘感知图变换器实现多模态融合，在四个公开数据集上取得优于现有方法的分类性能。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

由于病变结构异质性、类内变异大以及良恶性病例间细微视觉差异，从皮肤镜图像进行自动化皮肤癌分类仍然具有挑战性。现有的CNN/ViT流程通常依赖全局或补丁级特征，并常通过后期融合结合患者元数据，这限制了空间基础的多模态推理。我们提出一种新颖的基于区域的图学习框架，将病变显式建模为空间连贯的超像素区域图，这些区域表示为冻结的CNN特征。为了捕捉细粒度的病变排列，我们将区域间几何编码为边属性，并引入一个与所有区域相连的专用元数据上下文节点，从而在同一关系空间内结构化地整合人口统计学/临床变量。节点表示通过我们的边缘感知图变换器进行更新，随后进行注意力驱动的传播，最终生成用于良恶性分类的图级嵌入。在四个公开基准上的实验表明，显式的区域级关系建模和图原生多模态融合相较于现有技术取得了持续改进。因此，我们建立了一种新的以图为中心的视角，其中CNN特征被建模为关系节点，并通过上下文整合得到改进，从而产生更具表现力和鲁棒性的分类结果。

英文摘要

Automated skin cancer classification from dermoscopic images remains challenging due to heterogeneous lesion structure, strong intra-class variability, and subtle visual differences between benign and malignant cases. Existing CNN/ViT pipelines typically rely on global or patch-level features and often combine patient metadata via late fusion, which limits spatially grounded multimodal reasoning. We present a novel region-based graph learning framework that explicitly models lesions as graphs of spatially coherent superpixel regions represented as frozen CNN features. To capture fine-grained lesion arrangements, we encode inter-regional geometry as edge attributes and introduce a dedicated metadata context node connected to all regions, providing structured integration of demographic/clinical variables within the same relational space. Node representations are updated using our edge-aware graph transformer followed by attention-driven propagation, and a final graph-level embedding for benign-malignant classification. Experiments on four public benchmarks demonstrate that explicit region-level relational modeling and graph-native multimodal fusion yield consistent gains over the state-of-the-art. Consequently, we establish a new graph-centric perspective in which CNN features are modeled as relational nodes and improved through contextual integration, yielding more expressive and robust classifications.

URL PDF HTML ☆

赞 0 踩 0

2606.20449 2026-06-19 cs.CV 新提交

InfantFace: Detecting infant faces in neonatal clinical environments

InfantFace：新生儿临床环境中的婴儿面部检测

Abdullah Bin-Obaid, Maria M. Cobo, Rebeccah Slater, Lionel Tarassenko, Mauricio Villarroel

AI总结针对新生儿临床环境中的遮挡和光照问题，提出基于YOLOv11m的单阶段面部检测模型，在多个公开数据集预训练后，通过临床数据微调，AP50从0.87提升至0.96。

Comments 32 pages, 7 figures, 4 tables; supplementary information included

详情

AI中文摘要

新生儿面部的可靠定位是基于视频摄像头的非接触式评估的第一步，例如疼痛和痛苦相关的面部表情分析、疼痛评分、心肺信号提取和呼吸停止警报。然而，新生儿临床环境中仍存在重大挑战。杂乱的背景、光照变化和不良照明条件会降低面部检测模型的准确性。临床干预、监测设备以及在某些情况下的医疗设备可能会遮挡面部，使视觉评估变得困难。我们提出了一种基于YOLOv11m的单阶段模型，专门用于新生儿临床环境中的婴儿面部检测。我们结合了多个公开数据集（VGGFace2、CelebA、FDDB、WIDER FACE）来训练和评估我们提出的模型。然后，我们在一个新生儿研究数据集上对模型进行了微调，该数据集包含来自114个记录会话的228个视频，涉及113名独立婴儿。在微调之前，我们的模型达到了0.87的AP50，超过了三个最先进的通用面部检测器的性能。在临床领域适应后，性能进一步提高到0.96的AP50。由于缺乏公开的新生儿数据集，评估不同数据集上的面部检测性能仍然是一个挑战。优先创建此类数据集，同时在其创建和使用中维护适当的隐私保护措施和伦理标准，将极大地支持该领域的进一步进展。

英文摘要

Reliable localisation of the neonatal face is the first step for several video-camera based non-contact assessments such as pain and distress related facial expression analysis, pain scoring, cardiorespiratory signal extraction and cessation of breathing alerts. However, major challenges persist in neonatal clinical environments. Cluttered backgrounds, illumination changes and poor lighting conditions can reduce the accuracy of face detection models. Clinical interventions, monitoring equipment and, in some cases, medical devices can obstruct the face, making visual assessment difficult. We propose a one-stage YOLOv11m-based model tailored for face detection of infants in neonatal clinical environments. We combined multiple publicly available datasets (VGGFace2, CelebA, FDDB, WIDER FACE) to train and evaluate our proposed model. We then fine-tuned our model on a neonatal research dataset involving 228 videos from 114 recording sessions of 113 independent infants. Before fine-tuning, our model achieved an AP50 of 0.87, surpassing the performance of three state-of-the-art general face detectors. Performance improved further to an AP50 of 0.96 after clinical-domain adaptation. Evaluating face detection performance across different datasets remains a challenge due to the lack of publicly available neonatal datasets. Prioritising the creation of such datasets, while upholding appropriate privacy safeguards and ethical standards in their creation and use, would greatly support further progress in this field.

URL PDF HTML ☆

赞 0 踩 0

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 新提交

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany（德国弗莱堡大学计算机视觉组）； Department of Radiology, Medical Center -- University of Freiburg, Germany（德国弗莱堡大学医学中心放射科）； CRIION-AI Lab, Freiburg, Germany（德国弗莱堡CRIION-AI实验室）

AI总结提出RefRad2D大规模双语数据集，通过LLM和自动分割生成空间定位数据，训练RadGrounder模型联合完成报告生成、VQA和空间定位，在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情

AI中文摘要

我们研究了如何在没有手动空间标注的情况下，为放射学训练具有视觉定位能力的视觉-语言模型（VLM）。我们引入了RefRad2D，这是一个大规模的双语（德语/英语）数据集，包含来自临床实践的120万对CT和MR图像-文本对，并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准（Slake，VQA-RAD）上，RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集，相比于仅在下游数据集上微调，提高了开放式VQA的性能，显示了数据集的迁移性。关键在于，添加定位监督不会降低语言质量，从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19371 2026-06-19 cs.LG cs.AI cs.CV 交叉投稿

ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification

ProMUSE: 渐进式多模态不确定性引导的分阶段证据阿尔茨海默病分类

Long Doan, Branden Chen, Ethan Litton, Huan Huang, Jiajing Huang, Yixin Xie, Weihua Zhou, Nandakumar Narayanan, Chen Zhao

发表机构 * Kennesaw State University（肯尼索州立大学）； Michigan Technological University（密歇根理工大学）； University of Iowa（爱荷华大学）

AI总结提出ProMUSE，一种渐进式多模态不确定性引导的分阶段证据网络，通过自适应决定何时需要额外模态，在保持准确性的同时降低数据采集成本。

详情

AI中文摘要

阿尔茨海默病（AD）是一种致命性疾病，会破坏老年人的记忆和认知能力。大多数AD治疗在早期阶段有效，导致对早期AD诊断的需求日益增加。AD诊断越来越依赖多模态数据，如临床评估、结构磁共振成像（MRI）和正电子发射断层扫描（PET）成像。然而，MRI和PET采集仍然昂贵且不易普及，使得全模态推理在现实临床工作流程中不切实际。我们提出ProMUSE，一种渐进式多模态不确定性引导的分阶段证据网络，该网络自适应地确定何时需要额外模态，有助于在保持准确性的同时降低数据采集的总体成本。ProMUSE首先使用低成本临床数据进行证据分类，并通过基于Dirichlet的主观逻辑模型量化不确定性。当不确定性超过学习阈值时，ProMUSE逐步引入MRI或PET特征，通过Dempster-Shafer理论融合模态层面的信念和不确定性，获得校准的多模态预测。这种分阶段采集策略能够在最小化对昂贵成像依赖的同时实现准确诊断。在ADNI、AIBL和OASIS数据集上针对CN-AD、CN-MCI和MCI-AD任务的实验表明，ProMUSE在减少50-90%的MRI/PET使用量的同时，实现了与全模态基线相当或更优的准确性，从而大幅节省成本。这些结果突显了ProMUSE作为现实世界AD筛查中一种实用、不确定性感知且资源高效的解决方案。

英文摘要

Alzheimer's disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.

URL PDF HTML ☆

赞 0 踩 0

2606.19372 2026-06-19 eess.IV cs.CV cs.LG 交叉投稿

Full-Self Diagnostics (FSD): Physics-Grounded Visual Biomarker Inference from Smartphone Video via Inverse Problems and Operator Learning

全自诊断(FSD): 通过逆问题和算子学习从智能手机视频进行基于物理的可视生物标志物推断

Jonathan Thomas, Harsh Thaker

AI总结提出全自诊断(FSD)框架，结合物理前向模型、信息论可观测性、正则化逆问题、算子学习和随机变分推断，从9秒面部视频恢复生理状态，在59名受试者38812次扫描中验证，血糖MARD达29.86%。

Comments 38,812 paired scans, preliminary longitudinal validation of multichannel visual glucose inference (MARD 17 to 46 percent across cohorts); physics plus information theory plus operator learning framework

详情

AI中文摘要

我们提出全自诊断(FSD)，一个统一的数学框架，用于从消费级智能手机拍摄的无约束9秒面部视频中恢复潜在生理状态。该方法整合了五个相互增强的组件：(1)基于辐射传输方程和发色团吸收的物理前向模型，将相机观测映射到生物标志物浓度；(2)信息论可观测性理论，证明多通道视觉信号（光谱、脉搏、呼吸、微表情和眼动）与生理状态包含严格递增的互信息；(3)具有域均匀可辨识性保证的稳定Tikhonov正则化逆问题；(4)算子学习公式，实现跨设备、分辨率和人群的泛化；(5)可解释为随机变分推断的监督学习过程，从配对生物传感器真实值持续优化模型，性能随配对观测数量的平方根倒数比例提升。在59名受试者的38812次真实世界配对扫描上的实证验证展示了实际性能。第一作者自采数据（血糖范围35-550 mg/dL）的MARD为29.86%，97.57%的预测落在Clarke误差网格A+B区，仅0.27%在危险E区。一位管理良好的糖尿病参与者在较窄的70-180 mg/dL范围内达到MARD 17%。这些结果证实，消费级面部视频编码了足够的结构化信息，可在完全无约束条件下进行临床相关的非侵入性生物标志物推断，且性能随更多配对数据的可用性可预测地提升。

英文摘要

We present Full-Self Diagnostics (FSD), a unified mathematical framework for recovering latent physiological states from unconstrained 9-second facial videos captured by consumer smartphones. The approach integrates five mutually reinforcing components: (1) a physics-based forward model derived from the radiative transfer equation and chromophore absorption that maps camera observables to biomarker concentrations; (2) an information-theoretic observability theory proving that multi-channel visual signals (spectral, pulse, respiratory, micro-expression, and oculomotor) contain strictly increasing mutual information with physiological state; (3) a stable, Tikhonov-regularized inverse problem with domain-uniform identifiability guarantees; (4) an operator-learning formulation that enables generalization across devices, resolutions, and populations; and (5) a supervised learning procedure, interpretable as stochastic variational inference, that continuously refines the model from paired biosensor ground truth with performance improving proportionally to one over the square root of the number of paired observations. Empirical validation on 38812 real-world paired scans across 59 subjects demonstrates practical performance. Self-collected data from the lead author (glucose range 35-550 mg/dL) yields MARD of 29.86 percent with 97.57 percent of predictions in Clarke Error Grid Zones A+B and only 0.27 percent in the dangerous Zone E. A well-managed diabetic participant achieves MARD of 17 percent in the narrower 70-180 mg/dL band. These results confirm that consumer-grade facial video encodes sufficient structured information for clinically relevant, non-invasive biomarker inference under fully unconstrained conditions, with performance scaling predictably as more paired data becomes available.

URL PDF HTML ☆

赞 0 踩 0

2606.19651 2026-06-19 cs.AI cs.CV cs.LG 交叉投稿

BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation

BrainG3N：用于可控3D脑MRI生成的双用途分词器

Max Van Puyvelde, Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert

发表机构 * Department of Biomedical Data Science, Stanford University School of Medicine（斯坦福大学医学院生物医学数据科学系）； Department of Mathematical Modelling, Statistics & Bioinformatics, Ghent University（根特大学数学建模、统计与生物信息学系）； Department of Electrical Engineering, Stanford University（斯坦福大学电气工程系）

AI总结提出基于3D掩码自编码器的分词器，解耦编码器与解码器，在23项线性探测任务中21项超越SOTA，并支持条件生成和纵向预测。

详情

AI中文摘要

三维（3D）脑MRI是临床神经病学和神经肿瘤学的核心，生成模型可以增强代表性不足的队列、模拟疾病轨迹并支持隐私保护的数据共享。潜在扩散已成为建模成像数据的首选解决方案，但它对分词器提出了两个竞争性要求：编码器嵌入必须保留下游任务所需的临床信息，解码器必须重建解剖学上准确的体积。现有的重建驱动分词器以牺牲前者为代价实现了后者。为了解决这个问题，我们引入了一种基于全体积掩码自编码器（MAE）的分词器，用于3D脑MRI潜在扩散，解耦编码器和解码器：冻结的3D MAE编码器产生临床信息丰富的嵌入，而专用的CNN解码器从这些嵌入的线性投影重建体素。我们在来自18个公共队列的35,309个体积上预训练编码器，涵盖四种模态、十种疾病类别和200多个采集站点，并在两种设置中展示了其双重用途。首先，在23项线性探测基准测试中，编码器在21项任务上优于或匹配SOTA模型（即BrainIAC、BrainSegFounder和MedicalNet）。其次，在这些临床信息丰富的嵌入上训练的条件扩散变压器（DiT）支持跨六个变量的条件生成和患者特定的纵向预测。这些结果共同建立了一个单一的3D脑MRI嵌入空间，能够同时支持下游临床任务和可控生成。

英文摘要

Three-dimensional (3D) brain MRI is central to clinical neurology and neuro-oncology, where generative models could augment under-represented cohorts, simulate disease trajectories, and support privacy-preserving data sharing. Latent diffusion has been the go-to solution for modeling imaging data, but it places two competing demands on the tokenizer: encoder embeddings must retain the clinical information that downstream tasks act on, and the decoder must reconstruct anatomically faithful volumes. Existing reconstruction-driven tokenizers achieve the second at the expense of the first. To address this, we introduce a fully volumetric masked-autoencoder (MAE) based tokenizer for 3D brain MRI latent diffusion, decoupling encoder and decoder: a frozen 3D MAE encoder produces clinically informative embeddings, while a dedicated CNN decoder reconstructs voxels from a linear projection of those embeddings. We pretrain the encoder on 35,309 volumes from 18 public cohorts spanning four modalities, ten disease categories, and 200+ acquisition sites, and demonstrate its dual utility in two settings. First, on a 23-task linear-probing benchmark, the encoder outperforms or matches SOTA models (i.e., BrainIAC, BrainSegFounder, and MedicalNet) on 21 of 23 tasks. Second, a conditional diffusion transformer (DiT) trained on these clinically informative embeddings supports both conditional generation across six variables and patient-specific longitudinal forecasting. Together these results establish a single 3D brain-MRI embedding space capable of both downstream clinical tasks and controllable generation.

URL PDF HTML ☆

赞 0 踩 0

2606.19767 2026-06-19 eess.IV cs.CV physics.med-ph 交叉投稿

Contour-Constrained Deformable Registration with Parameter Characterization for Head and Neck Surgical Guidance

面向头颈外科引导的带参数表征的轮廓约束可变形配准

Qingyun Yang, Jon S. Heiselman, Ayberk Acar, Morgan J. Ringel, Michael I. Miga, Matthieu Chabanas, Michael C. Topf, Jie Ying Wu

AI总结提出一种基于正则化Kelvinlet基函数的可变形配准框架，通过表面点云、基准标记和轮廓约束校正术后组织变形，在9例头颈标本上将配准误差从刚性配准的11.11mm降至5.62mm，降幅达49.41%。

详情

AI中文摘要

全球每年新增89万例头颈部鳞状细胞癌，其复发率在实体恶性肿瘤中最高。尽管冰冻切片分析是术中切缘评估的标准方法，但由于切除标本与切除床之间的对准不精确，加上切除后黏膜组织收缩，准确地将检测到的阳性切缘重新定位到切除床上仍然具有挑战性。我们提出了一种生物力学驱动的可变形配准框架，用于校正术后组织变形以提供术中引导。该方法基于正则化Kelvinlet基函数的可变形配准方法，将3D标本网格配准到术中切除床点云。配准匹配表面点云、基准标记和边界轮廓约束，直接惩罚标本与切除床边界之间的垂直距离一致性。在来自皮肤、颊粘膜和舌部位的9个标本上，使用刚性配准的整体平均目标配准误差为$11.11 \pm 4.07$ mm，使用无轮廓约束的可变形配准则降至$8.20 \pm 2.68$ mm（降低26.19%）。所提出的轮廓约束可变形配准进一步将误差降至$5.62 \pm 2.28$ mm，相对于刚性配准降低了49.41%。我们在临床最具挑战性的舌标本中观察到最大降幅。我们还进行了系统的两阶段参数搜索，以表征表面配准、基准对应、轮廓约束和应变能正则化的相对重要性。该搜索表明，对于具有大侧向变形的组织类型，轮廓权重主导配准精度，而算法在广泛的参数组合范围内均可运行。

英文摘要

With 890,000 annual new cases globally, head and neck squamous cell carcinoma has one of the highest recurrence rates among solid malignancies. Although frozen section analysis is the standard of care for intraoperative margin assessment, accurately relocating detected positive margins on the resection bed remains challenging due to imprecise alignment between resected specimens and their resection bed, compounded by post-resection mucosal tissue shrinkage. We present a biomechanics-driven deformable registration framework that corrects post-resection tissue deformation to provide intraoperative guidance. Our approach registers 3D specimen meshes to intraoperative resection bed point clouds using a deformable registration approach based on regularized Kelvinlet basis functions. The registration matches surface point clouds, fiducial landmarks, and boundary contour constraints that directly penalize perpendicular distance-to-agreement between specimen and resection bed boundaries. Across nine specimens from skin, buccal mucosa, and tongue sites, the overall mean target registration error was $11.11 \pm 4.07$ mm using rigid registration, which decreased to $8.20 \pm 2.68$ mm (26.19\% reduction) using deformable registration without contour constraint. The proposed contour-constrained deformable registration further reduced the error to $5.62 \pm 2.28$ mm, a 49.41\% reduction relative to rigid registration. We observed the largest reduction in the most clinically challenging tongue specimens. We also performed a systematic two-stage parameter search to characterize the relative importance of surface alignment, fiducial correspondences, contour constraint, and strain energy regularization. This search revealed that contour weighting dominates registration accuracy for tissue types with large lateral deformation, while the algorithm operates over a broad range of parameter combinations.

URL PDF HTML ☆

赞 0 踩 0

2606.20115 2026-06-19 cs.LG cs.CV 交叉投稿

When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage

当校准失败于脆弱的医院：通过风险曲线收缩实现联邦共形风险控制

Nafis Fuad Shahid

AI总结针对联邦部署中标准共形风险控制（CRC）对个体机构覆盖不足的问题，提出基于风险曲线收缩的联邦CRC协议，在真实脑肿瘤数据上实现2.7/20的违规率且预测集仅扩大2.0倍。

Comments 9 pages, 3 figures, 2 tables. Submitted to the DeCaF Workshop at MICCAI 2026

详情

AI中文摘要

共形风险控制（CRC）通过在保留数据上校准预测集阈值，提供分割质量的无分布保证。在联邦部署中，标准方法将各站点的校准分数合并为一个阈值。我们在真实多机构脑肿瘤数据（FeTS-2022，1251名受试者，20个机构）上首次量化表明，这种朴素的合并CRC保护了平均医院，但违反了40%个体机构的覆盖，最差站点的假阴性率超出目标7.8个百分点。朴素的替代方案——每个站点本地CRC——基本恢复了覆盖，但将预测集扩大了83倍，使其在临床上无用。我们提出一种基于收缩的联邦CRC协议：每个站点仅将其经验风险曲线（G个标量）传输到服务器，服务器为每个站点计算收缩正则化阈值。单个超参数n0平滑地权衡最坏情况覆盖与预测集效率；留一站点敏感性分析确定n0=19，在2.0倍拉伸下实现2.7/20的违规。我们进一步表明，覆盖预算的直接拉格朗日优化失败，将风险集中在脆弱的医院，并且有限样本修正项是必不可少的：移除它会使违规增加三倍。在所述站点混合假设下，边际CRC保证通过构造得以保留；在三个种子下针对四个目标验证了每个站点的覆盖。没有患者级别的图像、掩膜或每体积分数离开任何站点。

英文摘要

Conformal risk control (CRC) provides distribution-free guarantees on segmentation quality by calibrating a prediction-set threshold on held-out data. In federated deployments, the standard approach pools calibration scores across sites into a single threshold. We provide the first quantification, on real multi-institutional brain tumor data (FeTS-2022, 1,251 subjects, 20 institutions), showing that this naive pooled CRC protects the average hospital but violates coverage at 40% of individual institutions, with the worst site exceeding the target false-negative rate by 7.8 percentage points. The naive alternative, per-site local CRC, largely restores coverage but inflates prediction sets by 83x, rendering them clinically useless. We propose a shrinkage-based federated CRC protocol: each site transmits only its empirical risk curve (G scalars) to a server, which computes a shrinkage-regularized threshold per site. A single hyperparameter n0 smoothly trades worst-case coverage for prediction-set efficiency; leave-one-site-out sensitivity analysis identifies n0=19, achieving 2.7/20 violations at 2.0x stretch. We further show that direct Lagrangian optimization of coverage budgets fails, concentrating risk on vulnerable hospitals, and that the finite-sample correction term is essential: removing it triples violations. The marginal CRC guarantee is preserved by construction under the stated site-mixture assumption; per-site coverage is validated across four targets with three seeds. No patient-level images, masks, or per-volume scores leave any site.

URL PDF HTML ☆

赞 0 踩 0

2602.22959 2026-06-19 cs.CV 版本更新

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

智能体能否在零样本设置中区分视觉上难以分离的疾病？一项初步研究

Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Sven Nebelung, Daniel Truhn

发表机构 * Department of Diagnostic and Interventional Radiology, University Hospital Aachen, 52074 Aachen, Germany（诊断与介入放射科，亚琛大学医院，德国亚琛，52074）

AI总结本研究探索多模态大语言模型智能体在零样本下区分视觉混淆疾病（如黑色素瘤与不典型痣、肺水肿与肺炎）的能力，提出基于对比裁决的多智能体框架，在皮肤镜数据上准确率提升11个百分点，但总体性能仍不足临床部署。

Comments Code available at https://github.com/TruhnLab/Contrastive-Agent-Reasoning. Accepted by MICCAI 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）的快速进展引发了对基于智能体系统的日益关注。尽管大多数医学影像先前工作集中于自动化常规临床工作流程，我们研究了一个未被充分探索但临床意义重大的场景：在零样本设置中区分视觉上难以分离的疾病。我们在两个仅基于影像的代理诊断任务上对代表性智能体进行基准测试：（1）黑色素瘤与不典型痣，以及（2）肺水肿与肺炎，尽管临床管理存在显著差异，但视觉特征高度混淆。我们引入了一种基于对比裁决的多智能体框架。实验结果显示诊断性能提升（在皮肤镜数据上准确率提高11个百分点），并在定性样本上减少了无根据的声明，尽管整体性能仍不足以用于临床部署。我们承认人类注释中固有的不确定性以及临床背景的缺失，这进一步限制了向真实世界场景的转化。在此受控设置中，这项初步研究为视觉混淆场景下的零样本智能体性能提供了初步见解。

英文摘要

The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

URL PDF HTML ☆

赞 0 踩 0

2603.01250 2026-06-19 cs.CV cs.AI 版本更新

基于血管概率引导衰减学习的稀疏视角动态DSA图像三维血管重建

Zhentao Liu, Huangxuan Zhao, Wenhui Qin, Zhenghong Zhou, Xinggang Wang, Wenping Wang, Xiaochun Lai, Chuansheng Zheng, Dinggang Shen, Zhiming Cui

发表机构 * School of Biomedical Engineering \& State Key Laboratory of Advanced Medical Materials ； Devices, ShanghaiTech University, Shanghai, China ； National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China ； School of Electronic Information ； Communications, Huazhong University of Science ； Department of Computer Science \& Engineering, Texas A\&M University, USA

AI总结提出血管概率引导衰减学习框架，通过静态与动态衰减场互补加权实现稀疏视角DSA重建，降低辐射剂量，并采用渐进训练和时间扰动损失提升质量。

Comments Accepted by Medical Image Analysis (MedIA), 2026

详情

DOI: 10.1016/j.media.2026.104088

AI中文摘要

数字减影血管造影（DSA）是血管疾病诊断的金标准之一。借助造影剂，时间分辨的二维DSA图像提供全面的血流信息，可用于重建三维血管结构以进行医学评估。当前的商用DSA系统通常需要数百个扫描视角进行重建，导致大量辐射暴露。在本研究中，我们提出了一种基于神经渲染的优化框架，专门用于高质量稀疏视角DSA重建，以减少辐射剂量。我们的方法称为血管概率引导衰减学习，将DSA成像表示为静态和动态衰减场的互补加权组合，权重来自时间无关的血管概率场。作为前景掩膜，血管概率为静态和动态场提供适应不同场景类型的适当梯度。该机制实现了静态背景与动态造影剂流的自监督分解，并显著提高了重建质量。我们的模型通过最小化合成投影与真实DSA图像之间的差异进行训练。我们进一步采用两种训练策略来提高重建质量：（1）由粗到细的渐进训练以改善几何结构，以及（2）时间扰动渲染损失以保持时间一致性。实验结果表明了高质量的三维血管重建和二维DSA图像合成。

英文摘要

Digital Subtraction Angiography (DSA) is one of the gold standards for vascular disease diagnosis. With the help of a contrast agent, time-resolved 2D DSA images deliver comprehensive blood flow information and can be utilized to reconstruct 3D vessel structures for medical assessment. Current commercial DSA systems typically require hundreds of scanning views to perform reconstruction, resulting in substantial radiation exposure. In this study, we propose a neural rendering-based optimization framework tailored for high-quality sparse-view DSA reconstruction to reduce radiation dosage. Our approach, termed vessel probability guided attenuation learning, represents DSA imaging as a complementary weighted combination of static and dynamic attenuation fields, with the weights derived from the time-independent vessel probability field. Functioning as a foreground mask, vessel probability provides proper gradients for both static and dynamic fields adaptive to different scene types. This mechanism enables self-supervised decomposition between static backgrounds and dynamic contrast agent flow, and significantly improves reconstruction quality. Our model is trained by minimizing the discrepancy between synthesized projections and real captured DSA images. We further employ two training strategies to improve reconstruction quality: (1) coarse-to-fine progressive training for better geometry and (2) temporal perturbed rendering loss for temporal consistency. Experimental results have demonstrated high-quality 3D vessel reconstruction and 2D DSA image synthesis.

URL PDF HTML ☆

赞 0 踩 0

2503.23179 2026-06-19 eess.IV cs.CV 版本更新

OncoReg: Medical Image Registration for Oncological Challenges

OncoReg：面向肿瘤学挑战的医学图像配准

Wiebke Heyer, Yannic Elser, Lennart Berkel, Xinrui Song, Xuanang Xu, Pingkun Yan, Xi Jia, Jinming Duan, Zi Li, Tony C. W. Mok, BoWen LI, Tim Hable, Christian Staackmann, Christoph Großbröhmer, Lasse Hansen, Alessa Hering, Malte M. Sieren, Mattias P. Heinrich

发表机构 * Institute of Medical Informatics, University of Lübeck（吕贝克大学医学信息学研究所）； Institute of Radiology and Nuclear Medicine, University Hospital Schleswig-Holstein（石勒斯维希-霍尔斯坦大学医院放射科和核医学研究所）； Department of Biomedical Engineering and Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute（伦塞拉塞尔理工学院生物医学工程系和生物技术与跨学科研究中心）； School of Computer Science, University of Birmingham（伯明翰大学计算机科学学院）； Division of Informatics, Imaging and Data Sciences, University of Manchester（曼彻斯特大学信息学、成像和数据科学系）； DAMO Academy, Alibaba Group（阿里集团DAMO学院）； Hangzhou Shengshi Technology Co., Ltd（杭州盛世科技有限公司）； Department of Radiation Oncology, University Hospital Schleswig-Holstein（石勒斯维希-霍尔斯坦大学医院放射肿瘤科）； EchoScout GmbH ； Radboud University Medical Center, Nijmegen（奈密根大学医学中心）； Institute of Interventional Radiology, University Hospital Schleswig-Holstein（石勒斯维希-霍尔斯坦大学医院介入放射科）

AI总结提出OncoReg挑战，通过两阶段框架在保护患者隐私的同时开发可泛化的图像配准方法，用于放射治疗中锥束CT与扇束CT的配准，发现特征提取是关键，深度学习和经典方法结合最有效。

Comments 21 pages, 13 figures

详情

AI中文摘要

在现代癌症研究中，由于患者隐私相关的挑战，产生的大量医学数据往往未被充分利用。OncoReg挑战通过一个两阶段框架解决了这一问题，该框架使研究人员能够在确保患者隐私的同时开发和验证图像配准方法，并促进更可泛化的AI模型的发展。第一阶段涉及使用公开可用的数据集，第二阶段则专注于在安全的医院网络内对私有数据集进行模型训练。OncoReg建立在Learn2Reg挑战的基础上，纳入了放射治疗中介入性锥束计算机断层扫描与标准计划扇束CT图像的配准。准确的图像配准在肿瘤学中至关重要，特别是在图像引导放射治疗的动态治疗调整中，需要精确对齐以最小化对健康组织的辐射暴露，同时有效靶向肿瘤。本文详细介绍了OncoReg挑战的方法和数据，并对竞赛参赛作品和结果进行了全面分析。研究发现，特征提取在此配准任务中起着关键作用。从该挑战中涌现的一种新方法展示了其多功能性，而现有方法的表现与新技术相当。深度学习和经典方法在图像配准中仍扮演重要角色，尤其是方法的组合，特别是在特征提取方面，被证明最为有效。

英文摘要

In modern cancer research, the vast volume of medical data generated is often underutilised due to challenges related to patient privacy. The OncoReg Challenge addresses this issue by enabling researchers to develop and validate image registration methods through a two-phase framework that ensures patient privacy while fostering the development of more generalisable AI models. Phase one involves working with a publicly available dataset, while phase two focuses on training models on a private dataset within secure hospital networks. OncoReg builds upon the foundation established by the Learn2Reg Challenge by incorporating the registration of interventional cone-beam computed tomography with standard planning fan-beam CT images in radiotherapy. Accurate image registration is crucial in oncology, particularly for dynamic treatment adjustments in image-guided radiotherapy, where precise alignment is necessary to minimise radiation exposure to healthy tissues while effectively targeting tumours. This work details the methodology and data behind the OncoReg Challenge and provides a comprehensive analysis of the competition entries and results. Findings reveal that feature extraction plays a pivotal role in this registration task. A new method emerging from this challenge demonstrated its versatility, while established approaches continue to perform comparably to newer techniques. Both deep learning and classical approaches still play significant roles in image registration, with the combination of methods, particularly in feature extraction, proving most effective.

URL PDF HTML ☆

赞 0 踩 0

2606.18970 2026-06-19 cs.LG cs.AI cs.CV 版本更新

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics（数学系）； Department of Political and Social Sciences（政治与社会科学系）

AI总结通过受控基准测试，比较量子与经典生成器在脑MRI数据增强中的性能，发现两者均未显著优于仅用真实数据训练，且量子生成器无额外优势。

详情

AI中文摘要

医学图像分类常受限于有限的标注数据，因此生成式增强被提出；最近，量子生成模型被用于此目的，并经常报告准确率提升。然而，这些声称通常基于单次训练运行，未匹配量子与经典生成器的参数预算，也未表征任何收益出现的数据范围。我们提出了一个受控基准测试，隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中，在该空间中，使用变分量子生成器或参数数量几乎相同的经典生成器（1648 vs. 1632）训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器，覆盖从5%到100%的标注数据比例，通过八个随机种子进行配对显著性检验（多重比较校正）以及集内多样性和潜在分布分析。在所有比例下，没有增强变体显著优于仅用真实数据训练，且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展：合成样本分布外移，并且在数据稀缺时严重模式崩溃，而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2606.19939 2026-06-19 cs.CV 新提交

DiffMath: Symbol- and Graph-Aware Latent Diffusion Transformer for Handwritten Mathematical Expression Generation

DiffMath：面向手写数学表达式生成的符号与图感知潜在扩散Transformer

Wei Pan, Xuhan Zheng, Yilin Shi, Huiguo He, Hiuyi Cheng, Dezhi Peng, Minghui Liao, Lianwen Jin

发表机构 * South China University of Technology（华南理工大学）； Huawei Technologies Co., Ltd.（华为技术有限公司）

AI总结提出DiffMath框架，利用LaTeX层次结构作为先验，通过关系抽象语法树、结构保持潜在表示和条件去噪，无需位置监督即可生成结构一致的手写数学表达式。

详情

AI中文摘要

手写数学表达式生成（HMEG）由于数学表达式的复杂二维布局和长程结构依赖而具有挑战性。现有方法通常依赖显式空间监督，如符号级边界框，这导致高标注成本并限制可扩展性。在这项工作中，我们提出了DiffMath，一个符号与图感知的潜在扩散框架，利用LaTeX固有的层次结构作为结构先验，消除了位置监督的需求。首先，我们设计了关系抽象语法树（RelAST），一种面向生成的表示，将MathML树蒸馏为紧凑的三元组序列[S, R, D]，其中每个标记直接编码符号身份、空间关系或嵌套深度。其次，我们引入了MathVAE，通过符号感知和关系感知的感知正则化学习保持结构的潜在表示，确保潜在空间同时捕获字符语义和空间拓扑。第三，MathDiT在这个结构化潜在空间中进行条件去噪，并通过自适应层归一化（AdaLN）进一步由全局符号计数先验引导，以改善结构一致性。实验表明，DiffMath生成结构一致的手写表达式，在现有方法上实现了优越性能，并通过合成数据增强提高了下游OCR模型的准确性。

英文摘要

Handwritten Mathematical Expression Generation (HMEG) is challenging due to the complex two-dimensional layouts and long-range structural dependencies of mathematical expressions. Existing methods typically rely on explicit spatial supervision, such as symbol-level bounding boxes, which incurs high annotation costs and limits scalability. In this work, we propose DiffMath, a symbol- and graph-aware latent diffusion framework that leverages the hierarchical structure inherent in LaTeX as a structural prior, eliminating the need for positional supervision. First, we design a Relational Abstract Syntax Tree (RelAST), a generation-oriented representation that distills MathML trees into compact triplet sequences [S, R, D], where each token directly encodes a symbol identity, spatial relation, or nesting depth. Second, we introduce MathVAE, which learns structure-preserving latent representations through symbol-aware and relation-aware perceptual regularization, ensuring that the latent space captures both character semantics and spatial topology. Third, MathDiT performs conditional denoising in this structured latent space, further guided by a global symbol-count prior via Adaptive Layer Normalization (AdaLN) to improve structural coherence. Experiments show that DiffMath produces structurally consistent handwritten expressions, achieves superior performance over existing methods, and improves the accuracy of downstream OCR models through synthetic data augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.19617 2026-06-19 cs.CV cs.GR cs.LG 新提交

GB-LSR: A Fast Local Spectral Image Representation with a Single Global Bandwidth for Continuous Reconstruction and Super-Resolution

GB-LSR：一种具有单一全局带宽的快速局部光谱图像表示，用于连续重建和超分辨率

Max Shad, Naeem Khoshnevis

发表机构 * Harvard University（哈佛大学）

AI总结提出GB-LSR，一种基于全局带宽的局部光谱表示，通过共享卷积编码器预测截断傅里叶基系数，实现连续图像重建，在Kodak等基准上PSNR提升2.8-3.6 dB，推理速度比最慢基线快约4倍。

详情

AI中文摘要

我们提出GB-LSR（全局带宽局部光谱表示），一种用于连续图像重建的固定网格局部光谱表示。图像域被划分为非重叠的方形块，每个块携带从共享卷积编码器特征预测的截断傅里叶基系数。一个可训练的标量带宽在所有块和图像中全局共享，在任何连续坐标处的重建是固定大小的基收缩，其成本与图像大小无关。我们研究了三种带宽处理变体：可训练的全局标量（主要）、固定的全局标量和逐块带宽场。在Kodak、Set14和Urban100上的标准化原生重建基准测试中，主要变体在匹配预算的LIIF/LTE/WIRE重实现上PSNR高出2.8-3.6 dB，LPIPS低0.11-0.15，同时推理成本约为最慢基线的四分之一。经验上，单个全局标量就足够了：逐块自适应带宽替代方案在闭式局部性诊断或端到端消融中均未带来改进。在独立的任意尺度超分辨率（ASR）扩展中，GB-LSR在标准SR协议下实现了具有竞争力的PSNR-Y，并在x4时比LIIF-RDN快1.44倍，比LTE-SwinIR快3.25倍；在同一扩展中，一个变体在训练和评估时不使用四角局部集成平均，速度提升1.77倍，峰值内存降低35%，PSNR变化可忽略，而将RDN编码器从64通道扩展到96通道时，PSNR略有提升，速度提升1.58倍，峰值内存降低31%。原生重建声明限定于匹配预算的摊销协议，ASR声明限定于独立的标准SR协议。

英文摘要

We present GB-LSR (Global-Bandwidth Local Spectral Representation), a fixed-grid local spectral representation for continuous image reconstruction. The image domain is partitioned into non-overlapping square patches, each carrying coefficients for a truncated Fourier basis predicted from shared convolutional-encoder features. A single trainable scalar bandwidth is shared globally across all patches and images, and reconstruction at any continuous coordinate is a fixed-size basis contraction whose cost is independent of image size. We study three bandwidth-handling variants: a trainable global scalar (main), a fixed global scalar, and a per-patch bandwidth field. On a standardized native-reconstruction benchmark across Kodak, Set14, and Urban100, the main variant outperforms matched-budget amortized LIIF / LTE / WIRE re-implementations by 2.8-3.6 dB PSNR and 0.11-0.15 LPIPS, while running at roughly one-quarter of the slowest baseline's inference cost. The single global scalar suffices empirically: per-patch adaptive-bandwidth alternatives do not improve over it on either a closed-form locality diagnostic or an end-to-end ablation. In a separate arbitrary-scale super-resolution (ASR) extension, GB-LSR achieves competitive PSNR-Y under a canonical-style SR protocol and runs 1.44x faster than LIIF-RDN and 3.25x faster than LTE-SwinIR at x4; within the same extension, a variant trained and evaluated without 4-corner local-ensemble averaging gives a 1.77x speedup with 35% lower peak memory and negligible PSNR change, while additionally widening the RDN encoder from 64 to 96 channels gives a small positive PSNR shift with a 1.58x speedup and 31% lower peak memory. Native-reconstruction claims are scoped to the matched-budget amortized protocol, and ASR claims are scoped to a separate canonical-style SR protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.19901 2026-06-19 cs.CV 新提交

Linear Recurrent Unit with Semantic Modulation for Image Super-Resolution

基于语义调制的线性递归单元用于图像超分辨率

Mingyu Choi, Woo Kyoung Han, Sunghoon Im, Kyong Hwan Jin

发表机构 * Korea University（高丽大学）； DGIST（大邱庆北科学技术院）

AI总结提出一种结合语义调制单元的线性递归网络，通过调制、空间分类和原型增强实现高效图像超分辨率，性能超越现有方法。

Comments Accepted to CVPR 2026 Findings

2606.19938 2026-06-19 cs.CV cs.AI 新提交

Triangular Consistency as a Universal Constraint for Learning Optical Flow

三角一致性作为光流学习的通用约束

Yi Xiao, Carlos Rodriguez Coronel, Jing Zhan, Haniyeh Ehsani Oskouie, Alex Wong, Dong Lao

发表机构 * Louisiana State University（路易斯安那州立大学）； University of California, Los Angeles（加州大学洛杉矶分校）； Yale University（耶鲁大学）

AI总结提出三角一致性约束，通过组合两个光流诱导第三个光流并强制三者一致，适用于不同网络架构、监督类型和数据集，在监督、无监督和迁移学习中均提升性能。

Comments Accepted by ECCV 2026

详情

AI中文摘要

我们提出三角一致性作为光流的第一性原理约束，该约束与网络架构、监督类型和数据集无关，适用于图像对和多帧设置。这个简单但强大的约束是通过组合两个光流来诱导第三个光流，并强制三者之间的一致性。组合的光流可能来自：(i) 图像对，产生循环一致性；(ii) 多个视频帧，通过时间链产生更长范围的运动；或 (iii) 图像对与受控合成变换相结合，这成为数据增强。这种三角一致性引入的计算开销可忽略不计，且不需要额外的标注。由于它直接源自光流的几何特性，不依赖于模型特定的假设，因此可作为光流训练的“通用”即插即用组件。实验表明，在监督、无监督和迁移学习设置中均有一致的改进。

英文摘要

We propose triangular consistency as a first-principled constraint for optical flow, which is agnostic to network architecture, supervision type, and dataset, and applies to both image-pair and multi-frame settings. This simple but powerful constraint is to compose two flows to induce a third flow and enforce consistency among the three. The composed flows may arise from (i) image pairs, yielding cycle consistency; (ii) multiple video frames, producing longer-range motion through temporal chaining; or (iii) image pairs combined with controlled synthetic transformations, which becomes data augmentation. This triangular consistency introduces negligible computational overhead and requires no additional annotations. Since it is derived directly from the geometry of optical flow, it does not rely on model-specific assumptions and serves as a ``universal'' plug-and-play component for optical flow training. Experiments show consistent improvement across supervised, unsupervised, and transfer learning settings.

URL PDF HTML ☆

赞 0 踩 0

2606.19961 2026-06-19 cs.CV 新提交

Addressing Detail Bottlenecks in Latent Diffusion for RGB-to-SWIR Image Translation

解决潜在扩散模型中RGB到SWIR图像翻译的细节瓶颈

Kaili Wang, Martin Dimitrievski, Jose Maria Salvador, Ben Stoffelen, David Van Hamme, Lore Goetschalckx

发表机构 * imec ； imec-IPI-Ghent University（imec-IPI-根特大学）； Yale University（耶鲁大学）

AI总结针对潜在扩散模型在RGB到SWIR图像翻译中丢失空间细节的问题，提出源条件自编码器和可学习引导编码器两种轻量级改进，在驾驶场景下将检测mAP提升至2倍，小目标提升3.4倍，并达到最优FID。

详情

AI中文摘要

潜在扩散模型（LDM）能够高效地进行图像到图像的翻译，但在压缩过程中丢弃了精细的空间细节，从而降低了下游感知任务的性能。我们识别出两个瓶颈：自编码器（丢失空间信息）和条件路径（通过朴素下采样进一步退化源信号）。我们提出了两种轻量级、与骨干网络无关的修复方法：源条件自编码器（SCAE），通过跳跃连接将高分辨率源特征注入解码器；以及可学习引导编码器（LGE），用学习到的条件信号替代朴素下采样。在驾驶场景的RGB到SWIR翻译任务上，使用两种去噪骨干网络（U-Net和DiT）进行评估，我们的方法在潜在扩散基线基础上将检测mAP提升了高达2倍，小目标（COCO-small，<32^2像素^2）上提升高达3.4倍，同时达到了最先进的FID。我们进一步表明FID与检测性能相关性较差，从而激励多轴评估。结果零样本泛化到公开的RASMD基准。我们将公开发布带有标注的测试数据、所有检查点和训练代码。

英文摘要

Latent diffusion models (LDMs) enable efficient image-to-image translation but discard fine spatial details during compression, degrading downstream perception tasks. We identify two bottlenecks: the autoencoder, which loses spatial information, and the conditioning pathway, which further degrades the source signal through naive downsampling. We propose two lightweight, backbone-agnostic fixes: a Source-Conditioned Autoencoder (SCAE) that injects high-resolution source features into the decoder via skip connections, and a Learnable Guidance Encoder (LGE) that replaces naive downsampling with a learned conditioning signal. Evaluated on RGB-to-SWIR translation for driving scenes with two denoiser backbones (U-Net and DiT), our approach improves detection mAP by up to 2x over the latent diffusion baseline, with up to 3.4x gains on small objects (COCO-small, <32^2 px^2), while achieving state-of-the-art FID. We further show that FID and detection performance are poorly correlated, motivating multi-axis evaluation. Results generalise zero-shot to the public RASMD benchmark. We will publicly release test data with annotations, all checkpoints, and training code.

URL PDF HTML ☆

赞 0 踩 0

2606.19985 2026-06-19 cs.CV 新提交

Vision-Reasoning-Guided Occlusion Removal from Light Fields

视觉推理引导的光场遮挡去除

Mohamed Youssef, Oliver Bimber

发表机构 * Johannes Kepler University（约翰·开普勒大学）

AI总结提出结合光场积分与视觉语言模型的框架，通过多视图融合和语义先验恢复被遮挡场景，在合成和真实数据上取得最优性能。

详情

AI中文摘要

遮挡鲁棒的场景恢复仍然是计算成像中的一个主要挑战，特别是在自然环境中，密集的前景植被严重限制了可见性。我们提出了一种视觉推理引导的光场遮挡去除框架，该框架结合了光场积分（LFI）的可见性恢复能力和视觉语言模型（VLM）的语义推理能力。首先通过LFI集成多视图观测以抑制前景遮挡，生成初始的可见性增强表示。然后，引入VLM作为条件语义先验，在观测测量的指导下恢复退化结构并恢复细节。为了提高恢复一致性并减少幻觉伪影，我们引入了一种多样本融合策略，将多个生成的假设聚合为统一的估计。在合成和真实世界数据集上的实验结果表明，该方法达到了最先进的性能，在四个合成光场基准场景（4-Syn）上取得了最高的平均SSIM，并在结构化和非结构化采集设置中表现出强大的泛化能力。这些结果凸显了将物理成像约束与视觉语言推理相结合在严重遮挡下实现鲁棒感知的有效性，可应用于搜索救援和探索性机器人导航。

英文摘要

Occlusion-robust scene recovery remains a major challenge in computational imaging, particularly in natural environments where dense foreground vegetation severely limits visibility. We propose a vision-reasoning-guided light field occlusion removal framework that combines the visibility recovery capability of light field integration (LFI) with the semantic reasoning capacity of vision-language models (VLMs). Multi-view observations are first integrated via LFI to suppress foreground occlusions and produce an initial visibility-enhanced representation. A VLM is then incorporated as a conditional semantic prior to restore degraded structures and recover fine details, guided by the observed measurements. To improve recovery consistency and reduce hallucination artifacts, we introduce a multi-sample fusion strategy that aggregates multiple generated hypotheses into a unified estimate. Experimental results on synthetic and real-world datasets demonstrate state-of-the-art performance, achieving the highest average SSIM across four synthetic light field benchmark scenes (4-Syn) and strong generalization across structured and unstructured acquisition settings. These results highlight the effectiveness of combining physical imaging constraints with vision-language reasoning for robust perception under severe occlusion, with applicability to search-and-rescue and exploratory robotic navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.15648 2026-06-19 cs.CV 新提交

Fusing Transferred Priors and Physics-based Decomposition for Underwater Image Enhancement

融合迁移先验与物理分解的水下图像增强

Haochen Hu, Yanrui Bin, Zhengyan Zhang, Minchen Wei, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出一种无需配对标签的迁移学习方法，将水下图像增强分解为全局颜色校正、去雾和背景噪声抑制，利用跨域先验监督各步骤，实现物理一致的增强。

Journal ref Information Fusion (2026): 104557

详情

DOI: 10.1016/j.inffus.2026.104557

AI中文摘要

水下图像在不同水质条件下拍摄，导致复杂的退化，包括颜色偏差、低对比度和模糊效应。最近，基于学习的方法已显示出在水下图像增强（UIE）方面的潜力。然而，以往的大多数工作侧重于训练策略或网络设计，使增强结果与数据集中的标签良好对齐，忽略了标签是从先前UIE方法的增强结果中选取的，这些伪标签存在噪声。因此，它们的模型性能在一定程度上并不令人满意。然而，收集水下图像的真实标签具有挑战性。在这项工作中，我们提出了一种基于迁移学习的UIE方法，该方法不需要水下图像具有成对的噪声或真实标签来学习。相反，首先根据水下物理将UIE任务分解为全局颜色校正、去雾和背景噪声抑制。然后，利用来自其他视觉任务的多种先验作为每个步骤的跨域监督。通过这种方式，通过迁移学习实现了一种新颖的UIE，并且物理对齐的UIE分解提供了理论上的合理性。定性和定量实验表明，我们基于物理和先验融合的方法在UIE任务中达到了SOTA性能，并有效提升了下游视觉任务，显著优于基准方法。项目仓库：https://github.com/Haru2022/P2-UIE。

英文摘要

The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: https://github.com/Haru2022/P2-UIE.

URL PDF HTML ☆

赞 0 踩 0

2606.19574 2026-06-19 eess.IV cs.CV 交叉投稿

FrequencyFormer: A Co-Designed Sensor-to-Processor Pipeline for Frequency-Domain Vision Transformer Inference

FrequencyFormer: 面向频域视觉Transformer推理的协同设计传感器到处理器流水线

Chengwei Zhou, Ovishake Sen, Xuming Chen, Rishith Paramasivam, Shaahin Angizi, Swarup Bhunia, Baibhab Chatterjee, Gourav Datta

AI总结提出FrequencyFormer，通过多尺度DCT标记化将图像压缩为频域令牌，结合近传感器LUT硬件和低功耗通信架构，实现高达128倍数据压缩和28.8 TOPS/W能效，兼容多种视觉任务。

详情

AI中文摘要

在传感器边缘系统上部署视觉Transformer（ViT）不仅受限于设备计算能力，还受限于从传感器到处理器传输高维图像数据所需的能量和带宽。虽然传感器内和近传感器计算通过早期特征提取降低了这一成本，但现有方法通常仅提供适度的压缩。我们观察到频域提供了视觉信息的自然紧凑表示，并且可以在传感器级别利用以减少传感器到处理器的数据移动。基于这一见解，我们提出了FrequencyFormer，一种用于高效ViT推理的协同设计传感器到处理器流水线。FrequencyFormer包括：（1）多尺度DCT标记化器，将224x224图像压缩为紧凑的频域令牌，实现高达128倍的片外数据量减少，且精度损失较小；（2）基于查找表（LUT）的近传感器硬件实现，利用固定DCT系数实现无乘法器、节能且面积高效的标记化；（3）改进的基于MIPI的低功耗通信架构，进一步降低传输能量。FrequencyFormer可作为标准ViT补丁嵌入的直接替代，并与分类、检测和分割任务的预训练骨干网络兼容。该流水线实现了28.8 TOPS/W的能效，将通信能量降低230倍，并将总传感器侧能量降低2.22倍，展示了频域标记化作为传感器内ViT部署的可扩展基础。

英文摘要

Deploying vision transformers (ViTs) on sensor-edge systems is limited not only by on-device compute, but also by the energy and bandwidth required to transmit high-dimensional image data from the sensor to the processor. While in-sensor and near-sensor computing reduce this cost through early feature extraction, existing methods often provide only modest compression. We observe that the frequency domain provides a naturally compact representation of visual information and can be exploited at the sensor level to reduce sensor-to-processor data movement. Building on this insight, we present FrequencyFormer, a co-designed sensor-to-processor pipeline for efficient ViT inference. FrequencyFormer includes: (1) a multi-scale DCT tokenizer that compresses a 224x224 image into compact frequency-domain tokens, achieving up to 128x reduction in off-chip data volume with modest accuracy loss; (2) a LUT-based near-sensor hardware implementation that leverages fixed DCT coefficients for multiplier-free, energy- and area-efficient tokenization; and (3) a modified MIPI-based low-power communication architecture that further reduces transfer energy. FrequencyFormer serves as a drop-in replacement for standard ViT patch embedding and remains compatible with pretrained backbones across classification, detection, and segmentation tasks. The pipeline achieves 28.8 TOPS/W, reduces communication energy by 230x, and lowers total sensor-side energy by 2.22x, demonstrating frequency-domain tokenization as a scalable foundation for in-sensor ViT deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.19802 2026-06-19 cs.LG cs.CV 交叉投稿

Flow Map Denoisers: Traversing the Distortion-Perception Plane for Inverse Problems

流映射去噪器：遍历逆问题的失真-感知平面

Nicolas Zilberstein, Morteza Mardani, Santiago Segarra

发表机构 * Rice University（莱斯大学）； NVIDIA Inc.（英伟达公司）

AI总结提出流映射模型，通过单一参数t在MMSE和感知质量间连续调节，实现逆问题的失真-感知权衡，无需额外监督或调参。

详情

AI中文摘要

图像复原面临一个基本权衡：最小化误差的方法产生模糊重建，而最大化感知质量的方法产生锐利但不够保真的图像。现有方法要么在失真-感知（DP）前沿上固定一个操作点，要么需要配对数据监督、辅助模型或对采样器进行超参数调优以访问不同点。我们证明，流映射模型——一种用于少步采样的流匹配的近期扩展，学习一个平均场——隐式定义了一个单参数去噪器族，连续跨越DP前沿。前瞻参数t充当MMSE和感知区域之间的控制旋钮。对于高斯目标，我们证明改变t精确恢复最优DP前沿；对于自然图像，我们在经验上观察到类似行为。在即插即用求解器中，相同机制扩展到一般逆问题，控制感知对齐与数据一致性之间的权衡。尽管在此设置中缺乏精确最优性保证，单个训练的流映射跨越DP权衡，在两端匹配或超越专门基线。在CelebA（128×128）和AFHQ（256×256）上的多个线性和非线性逆任务的广泛实验验证了我们的发现。

英文摘要

Image restoration faces a fundamental tradeoff: methods that minimize error produce blurry reconstructions, while those that maximize perceptual quality yield sharp but less faithful images. Existing approaches either commit to a single operating point on this distortion perception (DP) frontier or require paired-data supervision, auxiliary models, or hyperparameter tuning of the sampler to access different points. We show that flow map models, a recent extension of flow matching for few-step sampling that learns an average field, implicitly define a one-parameter family of denoisers that continuously spans the DP frontier. The lookahead parameter t acts as a control knob between the MMSE and perceptual regimes. For Gaussian targets, we prove that varying t exactly recovers the optimal DP frontier; for natural images, we observe similar behavior empirically. Within a Plug-and-Play solver, the same mechanism extends to general inverse problems, where it controls a tradeoff between perceptual alignment and data consistency. Despite the lack of exact optimality guarantees in this setting, a single trained flow map spans the DP tradeoff, matching or exceeding specialized baselines at both extremes. Extensive experiments on CelebA ($128\times 128$) and AFHQ ($256\times 256$) across several linear and nonlinear inverse tasks validate our findings.

URL PDF HTML ☆

赞 0 踩 0

2602.01391 2026-06-19 cs.CV 版本更新

Relighting as a Probe of Visual Priors via Augmented Latent Intrinsics

通过增强潜在本征属性将重光照作为视觉先验的探针

Xiaoyan Xing, Xiao Zhang, Sezer Karaoglu, Theo Gevers, Anand Bhattad

发表机构 * UvA-Bosch Delta Lab, University of Amsterdam, Amsterdam, Netherlands（乌得勒支大学阿姆斯特丹分校博世Delta实验室）； The University of Chicago, Chicago, USA（芝加哥大学）； Johns Hopkins University, Baltimore, USA（约翰霍普金斯大学）

AI总结提出增强潜在本征属性（ALI）方法，融合密集像素对齐视觉特征到潜在本征重光照模型，平衡语义与光度保真度，提升复杂材质重光照质量。

Comments Camera-ready version for ICML 2026. Project page: https://augmented-latent-intrinsics.github.io

详情

AI中文摘要

图像到图像的重光照需要能够将光照与场景属性分离，同时保留密集几何、材质和光度线索的表征。我们将此任务用作视觉先验的探针：与奖励不变性的识别任务不同，重光照测试视觉特征是否保留光传输所需的信息。通过一个受控的生成式重光照框架，我们发现强语义编码器会降低重光照质量，揭示了抽象与物理保真度之间的语义-光度权衡。我们引入了增强潜在本征属性（ALI），通过将密集的、像素对齐的视觉特征融合到潜在本征重光照模型中，并在未标注的真实图像对上通过自监督进行细化，来平衡这一权衡。ALI提高了重光照质量，尤其是在光泽、金属和透明材质上，并证明了生成式重光照是量化视觉编码器对物理世界编码内容的有效工具。

英文摘要

Image-to-image relighting requires representations that separate illumination from scene properties while preserving dense geometry, material, and photometric cues. We use this task as a probe of visual priors: unlike recognition tasks that reward invariance, relighting tests whether visual features retain the information needed for light transfer. Through a controlled generative relighting framework, we find that strong semantic encoders can degrade relighting quality, exposing a semantic--photometric trade-off between abstraction and physical fidelity. We introduce Augmented Latent Intrinsics (ALI), which balances this trade-off by fusing dense, pixel-aligned visual features into a latent-intrinsic relighting model and refining it with self-supervision on unlabeled real image pairs. ALI improves relighting quality, especially on glossy, metallic, and transparent materials, and demonstrates that generative relighting is an effective tool for quantifying what visual encoders encode about the physical world.

URL PDF HTML ☆

赞 0 踩 0

2606.19565 2026-06-19 cs.CV 新提交

Mix-QVLA: Task-Evidence-Aware Mixed-Precision Quantization of Vision-Language-Action Models

Mix-QVLA：任务证据感知的视觉-语言-动作模型混合精度量化

Navin Ranjan, Andreas Savakis

发表机构 * Rochester Institute of Technology（罗彻斯特理工学院）

AI总结提出Mix-QVLA框架，通过任务证据感知的混合精度后训练量化，在保持任务性能的同时大幅降低VLA模型的内存和计算开销，在LIBERO上实现4.1GB内存和1.52倍加速。

详情

AI中文摘要

我们提出Mix-QVLA，一种针对VLA模型的任务证据感知混合精度PTQ框架。Mix-QVLA将每个量化变体锚定到全精度动作令牌参考决策，并评估量化是否在关键VLA功能边界上保留了任务相关证据。它从边界激活计算归一化的梯度加权任务证据图，并使用证据质量和归因分布失真比较全精度和量化图，捕捉决策支持证据的强度和分配变化。一个软瓶颈目标将边界级退化聚合为层敏感度分数。Mix-QVLA进一步在整个任务执行过程中建模敏感度，捕捉层重要性的阶段依赖变化，而不是假设固定的敏感度分布。由此产生的证据和时间感知分数指导在模型大小和BitOps预算下的混合精度位分配。在OpenVLA风格策略上的广泛评估表明，Mix-QVLA改善了低比特VLA部署的精度-效率权衡。在LIBERO上，Mix-QVLA将OpenVLA-OFT内存从15.4 GB减少到4.1 GB，保留了96.3的平均成功率（BF16模型为97.1），并实现了1.52倍的推理加速。

英文摘要

We propose Mix-QVLA, a task-evidence-aware mixed-precision PTQ framework for VLA models. Mix-QVLA anchors each quantized variant to the full-precision action-token reference decision and evaluates whether quantization preserves task-relevant evidence across key VLA functional boundaries. It computes normalized gradient-weighted task-evidence maps from boundary activations and compares full-precision and quantized maps using evidence-mass and attribution-distribution distortion, capturing changes in both the strength and allocation of decision-supporting evidence. A soft-bottleneck objective aggregates boundary-level degradation into layer-wise sensitivity scores. Mix-QVLA further models sensitivity throughout task execution, capturing phase-dependent shifts in layer importance rather than assuming a fixed sensitivity profile. The resulting evidence- and time-aware scores guide mixed-precision bit allocation under model-size and BitOps budgets. Extensive evaluations on OpenVLA-style policies show that Mix-QVLA improves the accuracy-efficiency trade-off of low-bit VLA deployment. On LIBERO, Mix-QVLA reduces OpenVLA-OFT memory from 15.4 GB to 4.1 GB, retains 96.3 average success compared with 97.1 for the BF16 model, and achieves a 1.52x inference speedup.

URL PDF HTML ☆

赞 0 踩 0

2606.19736 2026-06-19 cs.CV 新提交

VFACamou: View-Fused Adversarial Camouflage for Environment-Adaptive Physical Evasion

VFACamou: 视图融合的对抗性伪装用于环境自适应物理规避

Shihui Yan, Hu Liu, Junyu Shi, Zihui Zhu, Ziqi Zhou, Yufei Song, Youming Geng, Minghui Li, Shengshan Hu

发表机构 * State Key Laboratory of Intelligent Vehicle Safety Technology（智能汽车安全技术国家重点实验室）； School of Cyber Science and Engineering, Huazhong University of Science and Technology（华中科技大学网络空间安全学院）； School of Computer Science and Technology, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； School of Software Engineering, Huazhong University of Science and Technology（华中科技大学软件学院）； Hebei Energy College of Vocation And Technology（河北能源职业技术学院）

AI总结提出一种端到端框架，结合UV体积渲染与扩散纹理生成器，并引入照明颜色一致性估计器和多尺度动态训练策略，生成可穿戴对抗图案，在无人机侦察等动态视角和光照变化下实现稳定物理攻击。

Comments Accepted by ICME 2026

详情

AI中文摘要

GLARE: 用于查询全局解释的自然语言接口

Bhavan Vasu, Rajesh Mangannavar

发表机构 * Oregon State University（俄勒冈州立大学）

AI总结提出基于LLM的交互接口GLARE，将自然语言问题转换为SQL查询以聚合局部解释数据，提升全局解释的可访问性和可用性。

Comments 16 pages, 2 figures

详情

AI中文摘要

虽然全局解释对于理解跨数据集、类别和决策上下文的视觉模型至关重要，但其复杂和单一的性质常常阻碍实际探索。由于用户通常寻求针对特定问题的目标答案，而不是静态产物，我们提出了一种基于LLM的交互接口，提供对黑盒图像分类器全局解释的自然语言访问。系统的核心LLM充当调解者，将自然语言问题转换为对局部解释数据的结构化SQL查询。这使得灵活聚合成为可能，而无需向用户暴露低级表示。对于每个查询，接口输出统计增强的自然语言响应，支持局部解释和意图对齐的可视化。我们在意图解释、查询映射准确性、对新查询和数据集的泛化能力以及对语言错误的鲁棒性方面评估了该系统。我们的结果表明，LLM中介的查询显著提高了以人为中心的XAI中全局解释的可访问性和可用性。

英文摘要

While global explanations are crucial for understanding vision models across datasets, classes, and decision contexts, their complex and monolithic nature often hinders practical exploration. Because users typically seek targeted answers to specific questions rather than static artifacts, we present an LLM-based interactive interface that provides natural language access to global explanations for black-box image classifiers. The system's core LLM acts as a mediator, translating natural language questions into structured SQL queries over local explanation data. This enables flexible aggregation without exposing users to low-level representations. For each query, the interface outputs statistics-augmented natural language responses, supporting local explanations, and intent-aligned visualizations. We evaluate the system on intent interpretation, query mapping accuracy, generalization to novel queries and datasets, and robustness to linguistic errors. Our results demonstrate that LLM-mediated querying substantially improves the accessibility and usability of global explanations for human-centered XAI.

URL PDF HTML ☆

赞 0 踩 0

2606.20527 2026-06-19 cs.CL cs.CV 交叉投稿

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

StylisticBias: 少数人类视觉线索驱动多模态大语言模型中的大部分社会偏见

Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner

发表机构 * Technical University of Munich（慕尼黑工业大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Princeton Center for Information and Technology Policy（普林斯顿信息与技术政策中心）

AI总结提出StylisticBias基准，通过控制单一视觉属性变化，发现年龄和体型主导身份层面偏见，而时尚风格等约15个属性解释近80%的偏见变化，偏见集中于少数视觉线索。

Comments Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在个人和社会影响重大的场景中，但影响这些模型判断人物的视觉线索仍知之甚少。先前的工作通常比较不同的（群体）个体，难以将外貌效应与身份差异分离。我们引入StylisticBias，一个用于评估MLLMs中属性级社会偏见的受控基准。我们生成500张逼真的基础人脸，每张脸创建约50个单一属性变体，产生约25K张图像。这种设计保持身份不变，每次改变一个视觉属性，使我们能够测量特定线索如何改变模型判断。我们在25个二元社会判断场景中评估了六个MLLMs。我们发现年龄和体型主导身份层面的效应，而时尚风格和其他视觉线索驱动最大的属性级变化。我们进一步发现，约15个属性解释了近80%的总变异，表明偏见集中在少数视觉线索上。在与外貌语义对齐的判断中，尤其是社会经济和风格相关判断，敏感性最强。我们发布StylisticBias作为多模态模型细粒度偏见评估的基准。代码和数据集：此https URL和此https URL。

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

URL PDF HTML ☆

赞 0 踩 0

2511.04260 2026-06-19 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet：面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science（数学与计算机科学系）； University of Catania（卡塔尼亚大学）

AI总结提出Proto-LeakNet，利用扩散模型中的信号泄漏痕迹，结合闭集分类与密度开集评估，实现可解释的生成器归因，在闭集上训练后对未见生成器也有效。

Comments 44 pages, 27 figures, 11 tables

详情

DOI: 10.1016/j.cviu.2026.104848

AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明，扩散管道会在其输出中无意中留下持久的统计痕迹，称为信号泄漏，特别是在潜在表示中。基于这一观察，我们提出了Proto-LeakNet，一个信号泄漏感知且可解释的归因框架，它将闭集分类与基于密度的开集评估相结合，对学习到的嵌入进行开集评估，从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域，重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征，而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC，Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒，超越了最先进的方法，并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取：this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

URL PDF HTML ☆

赞 0 踩 0

2510.27285 2026-06-19 cs.CV cs.CR 版本更新

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

重新思考扩散模型中的鲁棒对抗性概念擦除

Qinghong Yin, Yu Tian, Heming Yang, Xiang Chen, Xianlin Zhang, Yue Ming, Xueming Li, Yue Zhang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua University（计算机科学与技术系，人工智能研究院，清华大学）； University of Chinese Academy of Sciences（中国科学院大学）； Nanjing University of Aeronautics and Astronautics（南京航空航天大学）

AI总结针对扩散模型中概念擦除的对抗训练忽视概念语义导致拟合不足的问题，提出语义引导的鲁棒对抗概念擦除方法S-GRACE，显著提升擦除性能26%并减少90%训练时间。

详情

AI中文摘要

概念擦除旨在选择性地遗忘扩散模型（DMs）中的不良内容，以降低敏感内容生成的风险。作为概念擦除的一种新范式，现有方法大多采用对抗训练来识别和抑制目标概念，从而减少敏感输出的可能性。然而，这些方法常常忽视对抗训练在DMs中的特异性，导致仅能部分缓解。在这项工作中，我们从概念空间的角度调查并量化了这种特异性，即对抗样本能否真正拟合目标概念空间？我们观察到现有方法在生成对抗样本时忽视了概念语义的作用，导致对概念空间的拟合效果不佳。这种忽视导致了以下问题：1）当对抗样本较少时，它们无法全面覆盖目标概念；2）反之，它们会破坏其他目标概念空间。受这些发现分析的启发，我们引入了S-GRACE（语义引导的鲁棒对抗概念擦除），它优雅地利用概念空间内的语义引导来生成对抗样本并执行擦除训练。使用七种最先进方法和三种对抗提示生成策略在各种DM遗忘场景下进行的实验表明，S-GRACE显著提高了擦除性能26%，更好地保留了非目标概念，并将训练时间减少了90%。我们的代码可在此https URL获取。

英文摘要

Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

URL PDF HTML ☆

赞 0 踩 0

2605.07821 2026-06-19 cs.CV cs.AI 版本更新

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

通过对象共现分析缓解OOD检测中的简单性偏差

Boyang Dai, Chaoqi Chen, Yizhou Yu

发表机构 * The University of Hong Kong（香港大学）； Shenzhen University（深圳大学）； Shenzhen Loop Area Institute（深圳环城区域研究所）

AI总结提出基于对象共现的OOD检测框架，通过解耦表示和分治策略区分近OOD，缓解简单性偏差，在多种设置下取得竞争结果。

Comments This paper has been accepted by CVPR2026

详情

AI中文摘要

分布外（OOD）检测对于确保深度学习模型的可靠性至关重要。现有方法大多关注正则纠缠表示以区分分布内（ID）和OOD数据，忽略了图像中丰富的上下文信息。这一问题在检测近OOD时尤其具有挑战性，因为具有简单性偏差的模型难以在解耦表示中学习判别性特征。人类视觉系统可以利用自然环境中对象的共现来促进场景理解。受此启发，我们提出了一种以对象为中心的OOD检测框架，学习捕捉图像中的对象共现（OCO）模式。该方法引入了一种新的OOD检测范式，通过预测测试样本的解耦表示来理解图像中的对象共现，然后根据ID训练数据中观察到的对象共现模式自适应地将模式分为三种场景，最后以分治方式进行OOD检测。通过这种方式，OCO可以通过考虑图像中存在的语义上下文关系来区分近OOD，避免仅关注简单、易学习区域的倾向。我们通过在具有挑战性和全频谱OOD设置下的实验评估了OCO，展示了竞争性结果，并证实了其处理语义和协变量偏移的能力。代码发布在：https://this https URL。

英文摘要

Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.

URL PDF HTML ☆

赞 0 踩 0

2502.03227 2026-06-19 cs.LG cs.CV 版本更新

Adversarial Dependence Minimization

对抗性依赖最小化

Pierre-François De Plaen, Tinne Tuytelaars, Marc Proesmans, Luc Van Gool

发表机构 * CVL, ETH Zürich, Switzerland（CVL，苏黎世联邦理工学院，瑞士）； INSAIT, Sofia University, Bulgaria（INSAIT，索菲亚大学，保加利亚）

AI总结提出ADM算法，通过对抗博弈最小化特征维度间的统计依赖性，证明全局最优时达到相互独立，并应用于非线性去相关、图像分类泛化提升和自监督学习维度坍塌预防。

2606.19483 2026-06-19 cs.CV 新提交

LEAP: Layer-skipping Efficiency via Adaptive Progression for Vision Transformer Distillation

LEAP: 通过自适应进度实现视觉Transformer蒸馏的层跳过效率

Jiaqi Zhang, Ashton Lee, Anthony Wong, John Zou, Sami BuGhanem, Randall Balestriero

发表机构 * Brown University（布朗大学）； Rice University（莱斯大学）

AI总结提出LEAP训练课程，通过自适应选择教师中间特征图作为渐进式目标，加速学生ViT的知识蒸馏，在ImageNet-100上提升12.24%准确率，并节省25.1%训练FLOPs。

详情

AI中文摘要

基于视觉Transformer（ViT）骨干的视觉基础模型（VFMs），如DINOv2，已成为目标识别和语义分割等下游任务的关键。骨干网络的巨大计算需求通常需要将其蒸馏到更小的架构中以便在边缘部署。基于特征的知识蒸馏（KD）常受师生差距影响；学生由于容量有限难以模仿教师复杂的特征图。为缓解这一瓶颈，我们提出LEAP：通过自适应进度实现层跳过效率，一种用于ViT特征知识蒸馏的训练课程。通过利用教师的中间特征图作为一系列逐渐困难的渐进目标，我们的课程允许学生在处理更高层抽象之前构建基础表示。我们的结果表明，这种范式通过在不同学生模型大小和数据集规模上自适应选择难度，显著加速了收敛。采用我们的课程，LEAP蒸馏的ViT-S在ImageNet-100上达到90.1%的准确率，相比基线提升12.24%。在ImageNet-1K上，LEAP在Oxford和Paris数据集上的实例检索任务分别提升3.84%和7.75%。此外，该课程通过在训练初始阶段对教师推理实施早停，在ImageNet-100上节省了25.1%的训练FLOPs和21%的训练时间。代码可在以下网址获取：https://this URL

英文摘要

Vision Foundation Models (VFMs) with Vision Transformer (ViT) backbones, such as DINOv2, have become essential for downstream tasks like object recognition and semantic segmentation. The immense computational requirements of backbones often necessitate distillation into smaller architectures for edge deployment. Feature-based knowledge distillation (KD) often suffers from the teacher-student gap; the student struggles to imitate teacher's complex feature map due to its limited capacity. To mitigate this bottleneck, we propose LEAP: Layer-skipping Efficiency via Adaptive Progression, a training curriculum for ViT feature-based knowledge distillation. By utilizing the teacher's intermediate feature maps as a sequence of progressively more difficult targets, our curriculum allows the student to build a foundational representation before tackling higher-level abstractions. Our results demonstrate that this paradigm significantly accelerates convergence through adaptive difficulty selection across various student model sizes and dataset scales. With our curriculum, the LEAP-distilled ViT-S achieves 90.1% accuracy on ImageNet-100, a +12.24% improvement compared with baseline. On ImageNet-1K, LEAP achieves +3.84% and +7.75% improvement for the instance retrieval task on the Oxford and Paris datasets, respectively. Furthermore, the curriculum enables 25.1% savings in training FLOPs and 21% savings in training time on ImageNet-100 by implementing early-stopping for teacher inference during the initial stages of training. Code is available at https://github.com/KevinZ0217/LEAP

URL PDF HTML ☆

赞 0 踩 0

2606.19817 2026-06-19 cs.CV 新提交

Training-Free Metrics for Synthetic Object Detection Data: A Proxy for Detector Performance

无需训练的合成目标检测数据度量：检测器性能的代理指标

Myeongseok Nam, Donghoon Yeo, Seungwook Kim

发表机构 * GenGenAI

AI总结提出CCDM度量族，无需训练即可评估合成数据集对下游目标检测的效用，在VisDrone-DET上实现与YOLOv8性能的完全Spearman相关。

Comments 9 pages, 4 figures

2606.19932 2026-06-19 cs.CV cs.AI 新提交

Spatial-Aware Reduction Framework: Towards Efficient and Faithful Visual State Space Models

空间感知缩减框架：迈向高效且忠实的视觉状态空间模型

Jindi Lv, Aoyu Li, Yuhao Zhou, Zheng Zhu, Xiaofeng Wang, Qing Ye, Yueqi Duan, Wentao Feng, Jiancheng Lv

发表机构 * Sichuan University（四川大学）； Tsinghua University（清华大学）

AI总结提出STORM框架，通过保持空间结构完整性解决视觉Mamba模型在token缩减时的性能崩溃问题，无需训练即可实现高精度剪枝。

Comments Accepted by ICML 2026

详情

AI中文摘要

Mamba在建模长视觉序列方面表现出强大的效率。然而，当将token缩减应用于结构增强的Mamba变体时，这些模型会出现严重的性能崩溃。我们将这种退化归因于现有缩减方法在空间上的不可知性，这违反了选择性扫描机制所需的二维结构前提。在这项工作中，我们提出了STORM，一个空间感知的token缩减框架，旨在在压缩过程中保持结构完整性。STORM将缩减重新表述为对空间单元的结构化操作，强制局部约束以保持网格拓扑和邻域一致性。作为一个即插即用模块，STORM无需任何训练即可为现有缩减流程赋予明确的空间感知能力。实验结果表明，STORM在无训练设置下，在多种视觉Mamba骨干网络上实现了最先进的剪枝精度。值得注意的是，STORM在VMamba上实现了显著的精度恢复，在top-1准确率上比先前方法高出63.3%。同时，STORM在PlainMamba上仅造成1.0%的准确率下降，达到了与ViT相当的性能。

英文摘要

Mamba demonstrates strong efficiency in modeling long visual sequences. However, when token reduction is applied to structurally enhanced Mamba variants, these models exhibit a severe performance collapse. We attribute this degradation to the spatially agnostic nature of existing reduction methods, which violate the two-dimensional structural premise required by the selective scanning mechanism. In this work, we propose STORM, a spatial-aware token reduction framework designed to maintain structural integrity throughout the compression process. STORM reformulates reduction into a structured operation on spatial units, enforcing localized constraints to maintain both grid topology and neighborhood coherence. As a plug-and-play module, STORM equips existing reduction pipelines with explicit spatial awareness without any training. Empirical results demonstrate that STORM achieves state-of-the-art pruning accuracy across diverse vision Mamba backbones under training-free settings. Notably, STORM delivers a substantial accuracy recovery on VMamba, outperforming prior methods by up to 63.3\% in top-1 accuracy. Meanwhile, STORM incurs only a 1.0\% accuracy drop on PlainMamba, achieving performance comparable to ViT.

URL PDF HTML ☆

赞 0 踩 0

2606.19934 2026-06-19 cs.CV cs.AI 新提交

Speeding up the annotation process in semantic segmentation industrial applications

加速工业应用中的语义分割标注过程

Marta Fernandez-Moreno, Margarita Guerrero, Rosalia Rementeria, Pablo Mesejo, Raul Moreno

发表机构 * Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence, DaSCI, University of Granada（格拉纳达大学计算机科学与人工智能系，安达卢西亚数据科学与计算智能研究所，DaSCI）； Department of Computer Science and Automatic Control, National Distance Education University (UNED)（国立远程教育大学计算机科学与自动控制系）

AI总结本文利用无监督算法将材料科学中语义分割的标注时间从170小时降至37小时（减少78%），并发布了最大的公开钢微观结构分割数据集。

详情

AI中文摘要

当前的机器学习模型通常需要大量且标注良好的数据集。然而，标注过程常常成为瓶颈，随着复杂性的增加，人为错误的机会也更高。在此背景下，本文旨在利用无监督算法提高工业材料科学中复杂语义分割问题的数据标注效率。以往的研究量化了标注时间，并探索了无监督方法。但据我们所知，这是首次量化无监督算法加速标注过程程度的研究。我们旨在验证这一繁琐过程可以加速的程度，重点关注涉及高分辨率图像每个像素标注的语义分割任务，例如材料科学中的微观结构表征挑战。具体来说，我们证明通过使用无监督计算机视觉算法，标注过程所需的时间可以从170小时减少到37小时，实现了约78%的减少。我们处理的数据集包括尺寸为1280x959和960x703的大图像，这进一步增加了标注任务的复杂性。尽管存在这些挑战，我们创建并共享了迄今为止最大的公开钢微观结构分割数据集，在MIT许可下提供，并具有永久DOI，为该领域贡献了一个完全标注的高分辨率数据集。此外，这是首次将从头开始标注的时间（以往研究中的常见方法）与使用这些无监督算法作为预标注步骤时的标注时间进行比较。此外，我们提供了一个在此数据集上训练的深度学习模型，该模型经过领域专家验证，并部署在工业环境中，作为该公共数据集的初始基准。

英文摘要

Current machine learning models commonly require large and well-annotated datasets. However, the annotation process often becomes a bottleneck, with increased complexity leading to higher chances of human errors. Within this context, our goal in this paper is to leverage unsupervised algorithms to improve data annotation efficiency for complex semantic segmentation problems in industrial materials science. Previous research has quantified labeling time and others explored unsupervised methods. However, to the best of our knowledge, this is the first study to quantify how much unsupervised algorithms accelerate the labeling process. We aim to validate the extent to which this laborious process can be accelerated, focusing on semantic segmentation tasks that involve annotating each pixel of high-resolution images, such as the microstructure characterization challenge in materials science. Specifically, we demonstrate that by using unsupervised computer vision algorithms, the time required for the labeling process can be reduced from 170 hours to 37 hours, achieving an approximate reduction of 78\%. The dataset we work with includes large images of dimensions 1280x959 and 960x703, which further increases the complexity of the annotation task. Despite these challenges, we create and share the largest public steel microstructure segmentation dataset to date, available under MIT License with permanent DOI, contributing a fully annotated, high-resolution dataset to the field. Additionally, this is the first work to compare the labeling time from scratch (a common approach in previous studies) to the labeling time when using these unsupervised algorithms as a pre-annotation step. Furthermore, we provide a Deep Learning model trained on this dataset, validated by field experts, and deployed in an industrial setting, serving as an initial benchmark for this public dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.19965 2026-06-19 cs.CV cs.AI 新提交

ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

ROSE：多模态模型中感知到行动差距的基准测试

Yihao Wang, Zijian He, Jie Ren, Keze Wang

发表机构 * Sun Yat-sen University（中山大学）； Shaanxi Normal University（陕西师范大学）

AI总结提出ROSE基准，通过固定视觉场景并变化区域约束与符号输出，测试多模态大模型在不同上下文中将相同视觉证据转化为所需行动的能力，发现模型性能下降高达44.5个百分点，揭示感知到行动的瓶颈。

Comments 29 pages, 11 figures

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越被期望基于视觉信息采取行动，然而同一场景在不同任务上下文中可能需要不同的行动。模型能否可靠地将相同的视觉证据转化为当前上下文所需的行动？为了回答这个问题，我们引入了\textsc{ROSE}（\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution），一个受控基准，它在保持视觉场景固定的同时变化区域约束和所需的符号输出。通过耦合的计数和坐标行动任务，\textsc{ROSE}测试模型是否能够推断出隐含的多数参考，并在变化的上下文中基于由此产生的细粒度视觉证据采取行动。在九个最近的MLLMs中，从计数导向任务到区域条件行动的性能下降高达44.5个百分点，而人类表现达到98.8%。这种差距在成对的场景和区域中持续存在，即使同一模型在这些场景和区域上返回正确的计数，而全局点击和匹配的局部控制表明坐标定位仅解释了部分损失，揭示了在将共享视觉证据转化为上下文特定行动时存在一个独特的、模型相关的瓶颈。

英文摘要

Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textsc{ROSE} (\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textsc{ROSE} tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8\% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.

URL PDF HTML ☆

赞 0 踩 0

2606.20095 2026-06-19 cs.CV 新提交

Stitching and dimensionality effects on large artificially generated volume datasets

拼接和维度对大规模人工生成体数据集的影响

Lucas von Chamier, Jan Philipp Albrecht, Dagmar Kainmüller

发表机构 * GFZ Helmholtz-Zentrum für Geoforschung（亥姆霍兹地球科学中心）； Max Delbrück Center for Molecular Medicine in the Helmholtz Association（亥姆霍兹协会马克斯·德尔布吕克分子医学中心）； Helmholtz Imaging（亥姆霍兹成像）； Humboldt-Universität zu Berlin（柏林洪堡大学）； University of Potsdam（波茨坦大学）

AI总结研究深度学习生成大图像时的拼接伪影对风格迁移的影响，比较2D与3D模型，发现FID无法检测影响下游任务的细微伪影，3D模型略优但计算成本高。

详情

AI中文摘要

通过深度学习生成大图像需要对输入数据进行分块以适应硬件内存限制，然后组装输出块，这一过程在相邻块边界不对齐时可能引入拼接伪影。虽然已知这些伪影会影响分割任务，但它们对风格迁移生成模型的影响尚不清楚。我们使用在冷冻电镜数据集上训练的cycleGAN模型，研究了三种拼接方法和两种块维度（2D vs 3D）。我们评估了感知质量和下游线粒体分割的性能。主要发现如下：（1）FID分数无法检测到显著影响下游分割性能的细微拼接伪影；（2）具有无伪影拼接的3D模型在下游任务上略优于2D模型，尽管改进勉强证明计算成本合理；（3）2D模型由于更大的批量大小而训练更稳定。此外，我们证明从三个正交方向集成预测可以改善低质量体，但对高质量输出无益。这些结果表明，在大型科学数据集上最大化生成模型性能需要仔细考虑和减轻拼接伪影，并且仅凭感知指标不足以评估生物医学成像中的域适应质量。

英文摘要

Generating large images via deep learning requires patching input data to accommodate hardware memory limitations, then assembling output patches, a process that can introduce stitching artifacts when neighboring patches do not align at borders. While these artifacts are known to affect segmentation tasks, their impact on generative models for style-transfer remains poorly understood. We investigated three stitching approaches and two patch dimensionalities (2D vs 3D) using cycleGAN models trained on cryo-electron microscopy datasets. We evaluated both perceptual quality and performance on downstream mitochondria segmentation. Our key findings reveal that: (1) FID scores fail to detect subtle stitching artifacts that significantly impact downstream segmentation performance, (2) 3D models with artifact-free stitching marginally outperform 2D models on downstream tasks, though the improvement barely justifies the computational cost, and (3) 2D models train more stably due to larger batch sizes. Additionally, we demonstrate that ensembling predictions from three orthogonal directions can improve low-quality volumes but provides no benefit for high-quality outputs. These results demonstrate that maximizing generative model performance on large scientific datasets requires careful consideration and mitigation of stitching artifacts, and that perceptual metrics alone are insufficient for evaluating domain adaptation quality in biomedical imaging.

URL PDF HTML ☆

赞 0 踩 0

2606.20100 2026-06-19 cs.CV 新提交

GEN-Guard：纠正可部署联邦手术AI的泛化失败

Julia Alekseenko, Pietro Mascagni, AI4SafeChole Consortium, Nicolas Padoy

发表机构 * University of Strasbourg, CNRS, INSERM, ICube, UMR7357（斯特拉斯堡大学，法国国家科学研究中心，法国国家健康与医学研究院，ICube实验室，UMR7357）； Bioimage Analysis Center, Fondazione Policlinico Universitario Agostino Gemelli IRCCS（生物图像分析中心，阿戈斯蒂诺·杰梅利大学综合医院基金会IRCCS）； Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico di Milano, University of Milan（米兰IRCCS卡格兰达基金会马焦雷综合医院，米兰大学）； Monaldi Hospital, AORN dei Colli（莫纳尔迪医院，AORN dei Colli）

AI总结提出GEN-Guard框架，通过客户端阻塞评估检测性能泄漏，并利用分歧感知蒸馏进行特征级校正，提升联邦手术AI的跨机构泛化能力。

Journal ref Int J Comput Assist Radiol Surg. 2026 Jun 14

详情

DOI: 10.1007/s11548-026-03713-0

AI中文摘要

联邦学习（FL）在手术视频AI中实现了协作模型训练，无需共享敏感数据。然而，标准评估实践——仅基于参与医院的验证数据选择“最佳”全局模型——可能导致次优的部署选择。我们将这种关键失败模式识别为性能泄漏，即所选模型过拟合内部联邦数据，无法泛化到未见机构。我们提出GEN-Guard，一个实用的后处理框架，用于检测和纠正联邦手术AI中的泛化失败。它集成了通过客户端阻塞评估（CBE）进行泛化检测，该方法在隔离的客户端分布上验证性能以防止性能泄漏，以及通过分歧感知蒸馏（DAD）进行泛化纠正，该方法学习自适应的特征级校正以实现跨机构鲁棒性。两个组件在标准FL收敛后运行，同时为零样本适应未见环境提供鲁棒支持。我们首先量化了性能泄漏的严重性，观察到在标准评估下模型选择失败（MSF）超过80%。GEN-Guard在两个多中心临床挑战上进行了评估：腹腔镜胆囊切除术中的手术阶段识别和结肠镜中的息肉分割。在两个数据集上，GEN-Guard一致地纠正了这些失败，将联邦内F1分数提高了最多2个点，未见机构性能提高了最多3个点，最差情况机构性能提高了3-9个点。性能泄漏是联邦手术AI中一个系统性且以前未被充分认识的风险。GEN-Guard为检测和纠正此类失败提供了实用解决方案。通过提高跨机构鲁棒性和零样本泛化，它增强了FL在真实世界手术部署中的可靠性。

英文摘要

Federated Learning (FL) in surgical video AI enables collaborative model training without sharing sensitive data. However, standard evaluation practices - selecting the "best" global model based only on validation data from participating hospitals - can lead to suboptimal deployment choices. We identify this critical failure mode as performance leakage, where the selected model overfits internal federation data and fails to generalize to unseen institutions. We propose GEN-Guard, a practical post-hoc framework to detect and correct generalization failures in federated surgical AI. It integrates Generalization Detection via Client-Blocked Evaluation (CBE), which validates performance on isolated client distributions to prevent performance leakage, and Generalization Correction through Disagreement-Aware Distillation (DAD), which learns adaptive feature-level corrections for cross-institutional robustness. Both components operate after standard FL convergence while providing robust support for zero-shot adaptation to unseen environments. We first quantify the severity of performance leakage, observing Model Selection Failures (MSFs) exceeding 80% under standard evaluation. GEN-Guard is evaluated on two multi-center clinical challenges: surgical phase recognition in laparoscopic cholecystectomy and polyp segmentation in colonoscopy. Across both datasets, GEN-Guard consistently corrects these failures, improving in-federation F1 scores by up to 2 points, unseen-institution performance by up to 3 points, and worst-case institutional performance by 3-9 points. Performance leakage represents a systematic and previously under-recognized risk in federated surgical AI. GEN-Guard provides a practical solution for detecting and correcting such failures. By improving cross-institutional robustness and zero-shot generalization, it strengthens the reliability of FL for real-world surgical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.20455 2026-06-19 cs.CV 新提交

PCFootprint: A Large-Scale Dataset and Benchmark for Vectorized Building Footprint Extraction from Aerial LiDAR Point Clouds

PCFootprint：用于从航空LiDAR点云中提取矢量化建筑足迹的大规模数据集与基准

Haoyuan Shen, Kuihao Wang, Ruisheng Wang, Yujun Liu

发表机构 * School of Architecture and Urban Planning, Shenzhen University（深圳大学建筑与城市规划学院）

AI总结提出首个大规模航空激光扫描点云建筑足迹提取数据集PCFootprint，含33000个瓦片及跨域测试集，通过评估主流方法揭示复杂地理环境下的挑战。

Comments 14 pages, 9 figures

详情

AI中文摘要

建筑足迹提取是摄影测量、遥感和计算机视觉中的基本任务。近年来，基于图像的方法在高分辨率光学影像的矢量化足迹提取方面取得了显著进展。然而，光学影像本质上易受遮挡、透视畸变和残余地形位移的影响，导致足迹提取不完整或错位。此外，缺乏显式高程信息限制了其在细节层次建筑建模中的直接适用性。本文提出PCFootprint，这是首个用于从机载激光扫描点云中提取足迹的大规模公共数据集。PCFootprint包含来自爱沙尼亚土地和空间发展局的33000个瓦片，覆盖多样化的城市和乡村景观。每个瓦片大小为128×128米，并配有与点云对齐的系统性矢量化足迹。该数据集包括一个3000个瓦片的跨域测试集，用于评估跨地理区域的泛化能力。我们通过评估主流方法建立了全面的基准。实验结果表明，在复杂地理环境中存在高类内方差、数据不平衡和噪声等显著挑战。我们相信PCFootprint将推动建筑建模、城市场景理解和地理空间分析的未来研究。PCFootprint数据集公开于：https://this https URL。

英文摘要

Building footprint extraction is a fundamental task in photogrammetry, remote sensing, and computer vision. Recent image-based methods have achieved remarkable progress in extracting vectorized footprints from high-resolution optical imagery. However, optical imagery inherently susceptible to occlusions, perspective distortions, and residual relief displacement, yielding incomplete or misaligned footprint extraction. Furthermore, the lack of explicit elevation information limits its direct applicability to Level of Detail building modeling. In this paper, we present PCFootprint, the first large-scale public dataset for footprint extraction from airborne laser scanning point clouds. PCFootprint comprises \num{33000} tiles derived from the Estonian Land and Spatial Development Board, covering diverse urban and rural landscapes. Each tile spans \qtyproduct{128 x 128}{\m} with systematically aligned vectorized footprints aligned to point clouds. The dataset includes a \num{3000} tiles cross-domain test set for evaluating generalization across geographic regions. We establish comprehensive benchmarks by evaluating mainstream methods. Experimental results reveal significant challenges including high intra-class variance, data imbalance, and noise across complex geospatial environments. We believe PCFootprint will advance future research in building modeling, urban scene understanding, and geospatial analysis. The PCFootprint dataset is publicly available at \url{https://huggingface.co/datasets/Haoyuan-Shen/PCFootprint}.

URL PDF HTML ☆

赞 0 踩 0

2606.20523 2026-06-19 cs.CV cs.AI cs.DB 新提交

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

SARLO-80：全球斜距SAR语言光学数据集80cm

Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Elise Colin, Georgia Channing

发表机构 * DEMR-ONERA – The French Aerospace Lab, Université Paris-Saclay（法国航空航天实验室DEMR-ONERA，巴黎-萨克雷大学）； DTIS-ONERA – The French Aerospace Lab, Université Paris-Saclay（法国航空航天实验室DTIS-ONERA，巴黎-萨克雷大学）； Hugging Face

AI总结为解决高分辨率SAR与光学图像及文本对齐的数据稀缺问题，基于Umbra SLC数据构建了80cm斜距网格的SAR-光学-文本三元组数据集，支持跨模态检索与生成任务。

详情

AI中文摘要

多模态基础模型因大规模光学基准而快速发展，但合成孔径雷达（SAR）的类似资源仍然有限。现有的SAR-光学数据集主要依赖低分辨率、仅强度的地面距离检测（GRD）产品，未保留复值SAR测量或原生采集几何，限制了基于物理的多模态学习。特别是，结合甚高分辨率（VHR）SAR SLC、对齐光学图像和自然语言描述的大规模公开数据集仍然缺乏。我们提出了一个基于开源Umbra聚束模式采集的传感器独立复数据（SICD）构建的VHR SAR-光学-文本数据集。从约2500个全球场景（VV/HH，20cm–2m原生分辨率）出发，通过带限FFT重采样将所有SAR数据标准化到80cm斜距网格，并将图像分割为1024×1024的图块。对于每个SAR图块，我们检索高分辨率光学图块，并利用局部坐标对应关系将其扭曲到SAR网格以实现局部像素级对齐。我们进一步为每个样本生成三种描述变体（短/中/长），以支持视觉-语言训练和评估。我们的数据集包含119,566个三元组（复数和幅度斜距SAR图块、对齐光学图块、自然语言描述），覆盖72个国家的257个地点以及广泛的地物类型和基础设施。我们发布固定的训练/验证/测试划分以及完整的预处理和基线代码，以支持在原生SAR几何中进行跨模态检索和条件生成的多模态对齐的可重复基准测试。该数据集在Hugging Face Hub上公开可用，网址为https://this URL。

英文摘要

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR--optical--text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm--2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision--language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at https://huggingface.co/datasets/ONERA/SARLO-80.

URL PDF HTML ☆

赞 0 踩 0

2606.20536 2026-06-19 cs.CV 新提交

The FID Lottery: Quantifying Hidden Randomness in Generative-Model Evaluation

FID 彩票：量化生成模型评估中的隐藏随机性

Nicolas Dufour, Alexei A. Efros, Patrick Pérez

发表机构 * Kyutai ； UC Berkeley（加州大学伯克利分校）

AI总结研究FID作为随机变量在训练和生成种子上的方差，发现重训练比重采样导致更大FID波动，提出新评估协议：使用每类最优引导、报告多个训练种子的误差条。

Comments Website: https://kyutai.org/fid-lottery

详情

AI中文摘要

Frechet Inception Distance (FID) 是图像生成的事实标准仲裁者，但大多数论文仅报告来自单个训练模型使用单个采样种子的单一数值。如果我们重新训练模型，或仅重新从中采样，该数字的可重复性如何？在本文中，我们将 FID 视为训练和生成种子二维面板上的随机变量，并直接在数百个在类别条件 ImageNet 256x256 上训练的 SiT 网络上测量其方差。我们报告了令人惊讶的发现：(a) 使用相同配方但不同种子重新训练模型，在 Inception 特征空间中移动的 FID 比从固定网络重新绘制样本大 3.2 倍。(b) 这一差距由三个因素驱动：随机初始化、数据排序和流匹配损失的每步高斯噪声。(c) 增加计算量或模型大小几乎不会缩小分布范围，将 FID 变异系数 (CoV) 保持在 1-2% 的带内。(d) 每类无分类器引导调整使分布减半，但重新洗牌了哪些种子效果最好，幸运的训练种子达到相同 FID 所需的计算量比不幸的种子少 2 倍。基于这些发现，我们推荐一种新的 FID 评估协议：在每类最优引导下进行评估，将任何低于经验测量的约 1.3% CoV 的 FID 差距视为不确定，并报告多个训练种子的误差条，而不是单一的 FID 数值。

英文摘要

The Frechet Inception Distance (FID) is the de facto arbiter of image generation, yet most papers report just a single number from a single trained model using a single sampling seed. How reproducible is that number if we retrain the model, or merely resample from it? In this paper, we treat FID as a random variable on a two-axis panel of training and generation seeds, and measure its variance directly on several hundred SiT networks trained on class-conditional ImageNet 256x256. We report surprising findings: (a) Retraining the model using the same recipe with a different seed moves FID 3.2x more (in Inception feature space) than redrawing samples from a fixed network. (b) That gap is driven by three factors: random initialisation, data ordering, and the per-step Gaussian noise of the flow-matching loss. (c) Increasing compute or model size barely tightens the spread, holding the FID coefficient of variation (CoV) inside a 1-2% band. (d) Per-cell classifier-free-guidance tuning halves the spread but reshuffles which seeds work best, and a lucky training seed reaches the same FID with up to 2x less compute than an unlucky one. Based on these findings, we recommend a new FID evaluation protocol: evaluate under per-cell optimal guidance, treat any FID gap below the empirically measured ~1.3% CoV as inconclusive, and report an error bar over several training seeds rather than a single FID number.

URL PDF HTML ☆

赞 0 踩 0

2606.20542 2026-06-19 cs.CV 新提交

高效连接真实场景与合成数据生成以支持基于AI的认知机器人和计算机视觉应用

Paul Koch, Vivek Chavan, André Sers, Adem Karakurt, Paul Hofmann, Mohamad Zaher Ziadeh, Jörg Krüger

发表机构 * Fraunhofer IPK（弗劳恩霍夫生产设备和设计技术研究所）； TU Berlin（柏林工业大学）

AI总结本文讨论当前AI视觉模型在认知机器人应用中的局限，并提出通过连接仿真与真实世界训练数据生成来弥合领域差距的方法。

Comments Accepted and best paper award at MHI-Kolloquium 2024

2411.10077 2026-06-19 cs.CV 版本更新

Hierarchical mutual distillation for multi-view fusion: Learning from all possible view combinations

多视角融合的分层互蒸馏：从所有可能的视角组合中学习

Jiwoong Yang, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University（翰阳大学）； Hankuk University of Foreign Studies（韩国民法大学）

AI总结本文提出一种新颖的多视角不确定性加权互蒸馏方法，通过分层互蒸馏提升预测一致性，有效利用各视角信息并缓解不确定预测的影响。

Journal ref Pattern Recognition 178 (2026) 113432

详情

DOI: 10.1016/j.patcog.2026.113432

AI中文摘要

多视角学习常面临有效利用不同角度和位置拍摄图像的挑战，尤其是在处理视角间不一致性和不确定性时更为突出。本文提出了一种新颖的多视角不确定性加权互蒸馏（MV-UWMD）方法。我们的方法通过在所有可能的视角组合中进行分层互蒸馏来增强预测一致性，包括单视角、部分多视角和全多视角预测。这引入了一种基于不确定性的加权机制，通过互蒸馏有效利用每个视角的独特信息，同时减轻不确定预测的影响。我们扩展了CNN-Transformer混合架构以促进在多个视角组合中的稳健特征学习和整合。我们使用了一个大规模、非结构化的数据集进行广泛实验，该数据集来自多样且非固定视角的拍摄。结果表明，MV-UWMD相比现有多视角学习方法在预测准确性和一致性方面有所提升。

英文摘要

Multi-view learning often struggles to effectively leverage images captured from diverse angles and locations. Learning methods for unstructured multi-view images remain largely underexplored. We propose a novel Hierarchical Mutual Distillation for Multi-View Fusion (HMDMV) method, which can handle both structured and unstructured multi-view scenarios. It makes predictions utilizing all possible view combinations: single view, partial multi-view, and full multi-view. The method generates predictions for each view combination and then applies hierarchical mutual distillation to enhance inter-view consistency. An uncertainty-based weighting mechanism further refines the fusion process by adjusting the influence of each view combination according to its prediction confidence, reducing the impact of low-confidence views. Extensive experiments on large-scale structured and unstructured datasets demonstrate that HMDMV consistently achieves state-of-the-art classification accuracy. Another unique advantage of HMDMV is that it provides improved flexibility in inference, allowing for more or fewer view counts in inference than those used in training without additional processing. We also provide a light version with reduced training cost by designing an efficient strategy that randomly samples subsets of view combinations during each training iteration. These results highlight HMDMV's robustness in real-world settings where view availability is variable or incomplete. The code is available at https://github.com/labhai/HMDMV.

URL PDF HTML ☆

赞 0 踩 0

2512.24592 2026-06-19 cs.CV 版本更新

GH-ESD: Grounded Hypothesis-Driven Error Slice Discovery for Instance-Level Vision Tasks

GH-ESD：基于假设驱动的实例级视觉任务错误切片发现

Wei Zhang, Chaoqun Wang, Zixuan Guan, Sam Kao, Pengfei Zhao, Peng Wu, Sifeng He

发表机构 * Apple（苹果公司）

AI总结提出GH-ESD框架，通过LLM生成假设与视觉语言模型验证，在实例级任务中自动发现空间关系错误切片，并构建GESD基准，显著提升检测和分割任务的错误切片发现精度。

Comments Accepted by ECCV2026

详情

AI中文摘要

视觉模型在语义一致子集上的系统性失败（称为错误切片）揭示了鲁棒性和评估的局限性。现有的切片发现方法主要将切片建模为表示空间中的聚类或预定义属性的组合。虽然对图像级分类有效，但这种公式对于目标检测和分割等实例级任务不足，因为失败通常源于上下文关系性和空间定位的视觉模式。我们提出GH-ESD（基于假设驱动的实例级错误切片发现），一个生成与验证框架，将切片发现重新表述为基于假设的生成和统计验证。GH-ESD利用LLM先验和基于空间的视觉证据构建关系失败假设，通过视觉语言模型在实例级发现假设切片，并通过实例级错误的统计趋势分析进行验证。我们还引入了GESD（基于空间的错误切片数据集），一个用于实例级错误切片发现的新基准，提供由专家定义且基于空间的切片，这些切片源自检测和分割失败。大量实验表明，GH-ESD持续优于基线，在检测任务的GESD基准上Precision@10提高了0.10（0.73对比0.63），同时也支持分割场景。GH-ESD识别出可解释的切片，促进可操作的模型改进。GESD数据集将在接收后公开。

英文摘要

Systematic failures of vision models on semantically coherent subsets, known as error slices, reveal limitations in robustness and evaluation. Existing slice discovery approaches largely model slices as clusters in representation space or combinations of predefined attributes. While effective for image-level classification, such formulations are insufficient for instance-level tasks such as object detection and segmentation, where failures often arise from contextual relational and spatially grounded visual patterns. We propose GH-ESD (Grounded Hypothesis-Driven Error Slice Discovery), a generate and verify framework that reformulates slice discovery as grounded hypothesis generation and statistical verification. GH-ESD constructs relational failure hypotheses using LLM priors and grounded visual evidence, discovers hypothesis slices at the instance level via Vision Language Models, and verifies them through statistical trend analysis over instance-level errors. We also introduce GESD (Grounded Error Slice Dataset), a new benchmark for instance-level error slice discovery, providing expert-defined and spatially grounded slices derived from detection and segmentation failures. Extensive experiments demonstrate that GH-ESD consistently outperforms baselines, improving Precision@10 by 0.10 (0.73 vs. 0.63) on the GESD benchmark for detection tasks, while also supporting segmentation scenarios. GH-ESD identifies interpretable slices that facilitate actionable model improvements. The GESD dataset will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2604.13240 2026-06-19 cs.CV cs.LG 版本更新

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

基于概念的可解释AI的高分辨率景观数据集及其在物种分布模型中的应用

Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti

发表机构 * Université Rennes 2, CNRS, Nantes Université, Univ Brest, LETG, UMR 6554（里昂大学第二分校、法国国家科学研究中心、南特大学、布列塔尼大学、LETG、UMR 6554）； LTSER Zone Atelier Armorique（Armorique 领域实验室区）； University of Würzburg, Center for Artificial Intelligence and Data Science（乌尔姆大学、人工智能与数据科学中心）

AI总结提出首个基于概念的可解释AI方法用于物种分布模型，利用高分辨率多光谱和LiDAR无人机影像构建景观概念数据集，通过Robust TCAV量化景观概念对模型预测的影响，案例研究验证了方法的有效性。

详情

AI中文摘要

绘制物种空间分布对于保护政策和入侵物种管理至关重要。物种分布模型（SDMs）是完成此任务的主要工具，具有两个目的：实现稳健的预测性能，同时提供关于分布驱动因素的生态见解。然而，深度学习SDMs日益增长的复杂性使得提取这些见解更具挑战性。为了调和这些目标，我们提出了首个基于概念的可解释AI（XAI）在SDMs中的实现。我们利用Robust TCAV（测试与概念激活向量）方法量化景观概念对模型预测的影响。为此，我们提供了一个新的开放获取的景观概念数据集，该数据集源自高分辨率多光谱和LiDAR无人机影像。它包括跨越15个不同景观概念的653个斑块和1,450个随机参考斑块，旨在适用于广泛的物种。我们通过两个水生昆虫（襀翅目和毛翅目）的案例研究，使用两个卷积神经网络和一个视觉Transformer来展示这种方法。结果表明，基于概念的XAI有助于根据专家知识验证SDMs，同时发现产生新生态假说的新颖关联。Robust TCAV还提供了景观层面的信息，对政策制定和土地管理有用。代码和数据集公开可用。

英文摘要

Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2604.13416 2026-06-19 cs.CV cs.AI 版本更新

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K：用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney（悉尼科技大学）； University of Sydney（悉尼大学）； National Yang Ming Chiao Tung University（阳明交通大学）

AI总结为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白，构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集，并基于此基准测试了九种最新方法，识别出最鲁棒的方法和最具挑战的场景。

详情

AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中，已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而，对于无干扰辐射场，每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏，限制了发展。为填补这一空白，我们引入了DF3DV-1K，一个包含1048个场景的大规模真实世界数据集，每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像，模拟随意拍摄，涵盖128种干扰类型和161种场景主题，包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K，我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试，识别出最鲁棒的方法和最具挑战的场景。除了基准测试，我们还展示了DF3DV-1K的一个应用：微调基于扩散的2D增强器以改进辐射场方法，在保留集（例如DF3DV-41）和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展，并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取：此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

URL PDF HTML ☆

赞 0 踩 0

2605.10873 2026-06-19 cs.CV cs.AI 版本更新

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench：一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出CADBench，一个统一的多模态CAD程序生成基准，包含18000个样本和六类基准，评估11种视觉语言模型，揭示了CAD程序生成中的三种常见失败模式。

详情

AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心，但进展难以衡量，因为现有评估分散在数据集、模态和指标上。我们引入CADBench，一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本，涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族，五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染，以及六个指标，涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层，所有家族均进行多样性采样，以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统，生成超过140万个CAD程序。在理想输入下，专用的网格到CAD模型显著优于代码生成VLMs，后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式：几何复杂性增加时重建质量下降，CAD专用模型在模态转移下可能变得脆弱，且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

URL PDF HTML ☆

赞 0 踩 0

2606.10136 2026-06-19 cs.CV 版本更新

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

iSAGE: 一种通过稀疏点监督进行遥感语义分割的人机协同框架

Osmar Luiz Ferreira de Carvalho, Osmar Abilio de Carvalho Junior, Anesmar Olino de Albuquerque, Daniel Guerreiro e Silva

AI总结提出iSAGE框架，通过专家点击模型错误像素而非任意像素，无需辅助机制即可匹配密集监督，在BsB Aerial和ISPRS Vaihingen数据集上以极低标注率达到与密集监督相当的性能。

Comments 47 pages, 8 tables, 6 figures

详情

AI中文摘要

遥感中的语义分割需要昂贵的像素级标注，且由于模型很少能在传感器、平台或地理区域间迁移，几乎每个问题都需要新的数据集。现有的人机协同框架通过辅助机制（伪标签、传播、CRF、基础模型提示、辅助头）将稀疏点击扩展为密集监督，这些机制均基于模型的预测分布。在该分布中，一个自信的错误像素与一个自信的正确像素在结构上无法区分，因此任何读取该分布的规则都无法区分两者；区分信号位于模型外部。本文假设，专家针对模型错误（而非任意像素）的点击足以匹配密集监督，无需扩展机制。iSAGE（基于专家指导的迭代稀疏标注）在一个集成的开源平台上实现了这一假设，其中错误加权损失放大了每次点击的梯度，而标注记录本身即为数据集，可扩展、可纠正、可审计。实验采用最小努力策略：每帧每类最多一个标注像素。在BsB Aerial上，iSAGE恢复了密集监督的97.2%（在0.040%的像素上达到74.79% mIoU），并呈现出对比性的类别动态：无定形类别（渗透区域）从种子点开始饱和，而小类别（汽车）需要后期迭代的努力。在ISPRS Vaihingen（外部基准）上，iSAGE以0.011%的像素达到76.78% mIoU，匹配密集基线（76.65%）并超越所有已发表方法。在相同流程下，四种输出读取机制（预算1-100倍的oracle熵、阈值0.90-0.99的伪标签、基于CRF的传播、均匀随机）比iSAGE低7.4至14.5个百分点。在调查的31种方法中，iSAGE是唯一无需辅助机制即可运行的迭代式人机协同框架。

英文摘要

Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1--100x, pseudo-labels across thresholds 0.90--0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

URL PDF HTML ☆

赞 0 踩 0

2507.23534 2026-06-19 cs.LG cs.CV 版本更新

Continual Learning with Support Boundary Experience Blending

支持边界经验混合的持续学习

Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

发表机构 * National Taiwan University（国立台湾大学）

AI总结提出经验混合框架，通过差分隐私启发的噪声生成支持边界数据，联合训练样本和边界数据以正则化决策边界，在多个数据集上提升持续学习准确率。

详情

AI中文摘要

持续学习旨在减轻模型在顺序任务训练时的灾难性遗忘。常见方法经验回放存储过去的样本，但仅稀疏地近似数据分布，导致决策边界脆弱且过于简化。我们通过引入支持边界数据来解决这一限制，该数据通过差分隐私启发的噪声注入潜在特征，生成边界邻近表示，隐式正则化决策边界。基于此，我们提出经验混合框架，通过双模型聚合策略联合训练样本和支持边界数据。经验混合有两个组成部分：(1) 潜在空间噪声注入以生成支持边界数据，(2) 联合利用样本和支持边界数据的端到端训练。与标准经验回放不同，支持边界数据丰富了决策边界附近的特征空间，从而实现更稳定和鲁棒的持续学习。在CIFAR-10、CIFAR-100、Tiny ImageNet和ImageNet1K上的大量实验分别展示了10%、6%、13%和2%的持续准确率提升。

英文摘要

Continual learning (CL) seeks to mitigate catastrophic forgetting when models are trained with sequential tasks. A common approach, experience replay (ER), stores past exemplars but only sparsely approximates the data distribution, yielding fragile and oversimplified decision boundaries. We address this limitation by introducing Support Boundary Data (SBD), generated via differential-privacy-inspired noise into latent features to create boundary-adjacent representations that implicitly regularize decision boundaries. Building on this idea, we propose Experience Blending (EB), a framework that jointly trains on exemplars and SBD through a dual-model aggregation strategy. EB has two components: (1) latent-space noise injection to generate support boundary data, and (2) end-to-end training that jointly leverages exemplars and SBD. Unlike standard experience replay, SBD enriches the feature space near decision boundaries, leading to more stable and robust continual learning. Extensive experiments on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet1K demonstrate consistent accuracy improvements of 10%, 6%, 14%, 2%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.19835 2026-06-19 cs.CV 新提交

Neural Events: Discrete Asynchronous Autoencoders for Event-Based Vision

神经事件：用于事件视觉的离散异步自编码器

Roberto Pellerito, Daniel Gehrig, Shintaro Shiba, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich（苏黎世大学机器人感知组）； University of Pennsylvania（宾夕法尼亚大学）； The University of Tokyo（东京大学）； Keio University（庆应义塾大学）

AI总结提出将事件流重新标记为少量高信息量的“神经事件”，每个事件代表一个局部时空上下文窗口的离散可学习编码，在物体检测和分类任务中达到或超越现有方法，同时将事件率降低2.0倍。

详情

AI中文摘要

事件相机通过将动态场景表示为微秒分辨率的连续事件流，以卓越的时间保真度捕捉动态场景。然而，每个单独的事件仅携带最小的语义价值，仅仅表示局部亮度变化。为了获得有意义的信号，下游算法需要快速整合来自潜在大量低信息事件流的线索。然而，当前的架构很容易被淹没，难以在捕捉细粒度时间动态和维持可管理的数据吞吐量之间取得平衡。本文提出一个框架，将事件流重新标记为少量高信息量的“神经事件”，每个事件代表一个局部时空上下文窗口，并带有离散可学习编码。每次该编码翻转时，触发一个神经事件，产生高度压缩的数据流。我们证明，在物体检测和分类任务中，基于神经事件训练的网络与最先进方法性能相当或更优，同时将事件率降低2.0倍。

英文摘要

Event cameras capture dynamic scenes with exceptional temporal fidelity by representing them as a continuous stream of microsecond resolution \textit{events}. Each individual event, however, only carries minimal semantic value, merely signaling a localized brightness change. To derive meaningful signals, downstream algorithms need to quickly integrate cues from a potentially massive torrent of low-information events. Current architectures, however, are easily overwhelmed, struggling to balance capturing fine-grained temporal dynamics and maintaining a manageable data throughput. This paper proposes a framework to re-tokenize event streams into a small set of highly informative \textit{neural events}, each representing a local spatio-temporal context window with a discrete learnable code. Every time this code flips, a neural event is triggered, yielding a highly compressed data stream. We demonstrate that, across object detection and classification, networks trained on neural events are on par or surpass the performance of state-of-the-art approaches while reducing the event rate by a factor of 2.0.

URL PDF HTML ☆

赞 0 踩 0

2606.19383 2026-06-19 cs.RO cs.CV 交叉投稿

3D Scene Graphs: Open Challenges and Future Directions

3D场景图：开放挑战与未来方向

Dennis Rotondi, Francesco Argenziano, Sebastian Koch, Nathan Hughes, Martin Buechner, Johanna Wald, Lukas Rosenberger Schmid, Daniele Nardi, Abhinav Valada, Liam Paull, Federico Tombari, Luca Carlone, Kai O. Arras

AI总结本文统一综述3D场景图（3DSG）的构建、应用与评估，分析现有建模选择与开放挑战，旨在推动鲁棒部署。

Comments Invited article for the Annual Review of Control, Robotics, and Autonomous Systems Volume 10

详情

AI中文摘要

3D场景图（3DSG）通过将几何基础与环境的语义和关系抽象相结合，已成为空间AI的强大表示。其表现力使其与机器人和计算机视觉中的广泛问题相关，包括操作、导航、任务规划、场景理解等。然而，该领域仍然分散：不同的社区采用不同的公式、构建流程和评估协议，使得比较方法、识别共同假设以及评估鲁棒实际部署的剩余挑战变得困难。本综述提供了对3DSG的统一和批判性回顾，特别强调开放挑战和未来方向。我们首先在共同定义下形式化3DSG，并分析表征现有公式的主要建模选择，包括节点和边属性、层次结构、动态场景表示和可供性感知扩展。然后，我们回顾如何从原始感官观察构建3DSG，讨论最常见的术语、约定和技术。最后，我们检查下游应用和评估策略，从内在图质量到任务级性能。为支持社区，我们还提供了一个专用网站，组织和扩展所调查的内容，可访问此 https URL。

英文摘要

3D Scene Graphs (3DSGs) have emerged as a powerful representation for spatial AI by combining geometric grounding with semantic and relational abstractions of the environment. Their expressiveness has made them relevant to a broad range of problems in robotics and computer vision, including manipulation, navigation, task planning, scene understanding, and many others. However, the field remains fragmented: different communities adopt distinct formulations, construction pipelines, and evaluation protocols, making it difficult to compare methods, identify common assumptions, and assess remaining challenges for robust real-world deployment. This survey provides a unified and critical review of 3DSGs, with particular emphasis on open challenges and future directions. We first formalize 3DSGs under a common definition and analyze the principal modeling choices that characterize existing formulations, including node and edge attributes, hierarchical structure, dynamic scene representations, and affordance-aware extensions. We then review how 3DSGs are built from raw sensory observations, discussing the most common terminologies, conventions, and techniques. Finally, we examine downstream applications and evaluation strategies, from intrinsic graph quality to task-level performance. To support the community, we also provide a dedicated website that organizes and extends the surveyed content, accessible at https://3dscenegraphs.com/.

URL PDF HTML ☆

赞 0 踩 0

2606.20291 2026-06-19 cs.LG cs.CV 交叉投稿

Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision

整合国家森林清查、机载激光雷达和卫星影像，利用计算机视觉实现森林结构的全覆盖制图

Luke J. Zachmann, David D. Diaz, Vincent A. Landau, Chelsey Walden-Schreiner, Tony Chang, Nathan E. Rutenbeck, Katharyn A. Duffy, Kiarie Ndegwa, Andreas Gros, Scott Conway, Guy Bayes

发表机构 * Vibrant Planet Public Benefit Corporation（Vibrant Planet 公益公司）

AI总结提出VibrantForests框架，结合卫星影像、激光雷达样本和计算机视觉，以10米分辨率生成美国本土的冠层覆盖、高度、生物量等森林属性图，减少饱和与回归均值问题。

详情

AI中文摘要

遥感技术越来越被依赖，以提供可操作的科学研究，用于大型景观的森林和野火风险管理。全覆盖、每年更新的地图是有效森林管理的持续需求。许多规划系统和数据收集结合了不同目的、年份和预测质量的异质数据源，导致运营规划系统中的混淆行为。我们介绍了VibrantForests框架，该框架被开发并应用于绘制森林属性，为有效的森林和野火规划提供一致的基础。VibrantForests包括一个基于卫星的森林结构模型，该模型在激光雷达衍生的样本上训练，并应用于美国本土，以10米分辨率同时生成冠层覆盖度、冠层高度、地上活树生物量、胸高断面积和二次平均直径的估计。我们展示了跨越从稀疏冠层/低生物量到密集冠层/高生物量的全部森林条件的预测能力。结果表明，我们的模型扩展了在类似被动传感器模型中常见的饱和范围，并减少了回归均值行为，该行为通常在小/稀疏条件下高估森林属性，在大/密集条件下低估森林属性。VibrantForests框架通过以年度节奏和10米分辨率提供管理相关属性的一致全覆盖估计，解决了大面积森林和野火规划中的一个关键限制。

英文摘要

Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management. Many planning systems and data collections combine disparate data sources with different purposes, vintages, and prediction quality, which leads to confounding behavior in operational planning systems. We introduce the VibrantForests framework, developed and applied to map forest attributes and provide a coherent foundation for effective forest and wildfire planning. VibrantForests includes a satellite-based forest structure model trained on lidar-derived samples and applied across the contiguous United States to concurrently generate estimates of canopy cover, canopy height, aboveground live tree biomass, basal area, and quadratic mean diameter at 10-meter resolution. We demonstrate predictive capability spanning the full spectrum of forest conditions ranging from sparse-canopy/low-biomass to dense-canopy/high-biomass. Results show that our model extends the range at which saturation is commonly encountered in comparable passive-sensor models, and reduces regression-to-mean behavior that commonly produces overestimation of forest attributes in small/sparse conditions and underestimation in large/dense conditions. The VibrantForests framework addresses a key limitation in large-area forest and wildfire planning by delivering coherent wall-to-wall estimates of management-relevant attributes at annual cadence and 10m resolution.

URL PDF HTML ☆

赞 0 踩 0

2606.20547 2026-06-19 cs.LG cs.CV cs.GR cs.RO math.DG 交叉投稿

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

Token 是群元素：关于矩阵李群上的李代数注意力

Przemyslaw Musialski

发表机构 * New Jersey Institute of Technology（新泽西理工学院）

AI总结提出李代数注意力机制，将token定义为矩阵李群元素，利用相对位姿的李代数范数作为注意力分数，无需学习核函数或表示论工具，适用于仿射全帧群等非紧致非阿贝尔群。

Comments preprint, 19 pages, 3 figures

详情

AI中文摘要

我们将注意力token置于群上：一个token是矩阵李群$G$的一个元素$g_i$——一个纯粹的变换，没有特征负载，也没有外部作用$\rho(g)$承载它。据我们所知，这是第一个token为裸矩阵李群元素的注意力构造：它们的分数是相对位姿的闭式代数范数，而非学习核，并且它达到了每个基于不可约表示或满射指数的方法必须排除的仿射全帧群。我们称之为李代数注意力。一旦token是群元素，其余部分无需通常的表示论机制。一对的相对几何是规范的，即$g_i^{-1} g_j$，因此成对不变量$w_{ij} = \log(g_i^{-1} g_j)$是内在的而非设计的；在$G$对角作用下的等变性是重言式的，且余循环条件自动成立。注意力分数是负平方代数范数$s_{ij} = -\|\log(g_i^{-1} g_j)\|_\lambda^2/\tau$：在块加权Frobenius内积下的规范邻近核，无需不可约表示、球谐函数、Clebsch-Gordan积或学习核。该构造适用于任何矩阵李群，在包含相对位姿的选定对数图上，包括具有尺度和剪切的非紧致非阿贝尔仿射群，这些是向量token注意力方法无法达到的：既不是不可约表示传统，也不是满射指数方法。在SE(2)、SO(3)和Aff(2)上的三个序列补全实验证实了这一点：闭式分数匹配了相同不变量上的学习MLP核，并在SE(2)上优于它，使用的分数参数少50到80倍，而向量token基线破坏了不变量，误差达五到十二个数量级。

英文摘要

We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $ρ(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_λ^2/τ$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.

URL PDF HTML ☆

赞 0 踩 0

2603.07236 2026-06-19 cs.CV 版本更新

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

HY-WU (第一部分): 一种可扩展的功能性神经记忆框架及其在文本引导图像编辑中的应用

Mengxuan Wu, Xuanlei Zhao, Ziqiao Wang, Ruicheng Feng, Zhangyang Wang, Kai Wang

发表机构 * Tencent HY Team（腾讯 HY 团队）

AI总结提出HY-WU框架，通过功能性神经记忆模块即时生成实例特定权重更新，避免共享权重覆盖导致的干扰，解决持续学习与个性化中的灾难性遗忘问题。

详情

AI中文摘要

基础模型正从离线预测器过渡到期望长时间运行的部署系统。在实际部署中，目标并非固定：领域漂移、用户偏好演变，以及模型发布后出现新任务。这将持续学习和即时个性化从可选功能提升为核心架构要求。然而，大多数适应流程仍遵循静态权重范式：训练后（或任何适应步骤后），推理执行单一参数向量，而不考虑用户意图、领域或实例特定约束。这将训练或适应后的模型视为参数空间中的单个点。在异构且持续演变的机制中，不同目标可能在参数上诱导分离的可行区域，迫使任何单一共享更新陷入妥协、干扰或过度专业化。结果，持续学习和个性化通常实现为对共享权重的重复覆盖，冒着先前学习行为退化的风险。我们提出HY-WU（权重释放），一种记忆优先的适应框架，将适应压力从覆盖单一共享参数点转移。HY-WU将功能性（算子级）记忆实现为神经模块：一个根据实例条件即时合成权重更新的生成器，产生实例特定算子而无需测试时优化。

英文摘要

Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.

URL PDF HTML ☆

赞 0 踩 0

2507.05169 2026-06-19 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model

世界模型批判：一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结本文从心理学“假设性思维”出发，提出世界模型的核心目标是模拟真实世界的所有可行动可能性，并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测（GLP）架构。

详情

AI中文摘要

世界模型，即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器，近年来因开发具有人工（通用）智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估，已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发，并借鉴心理学文献中“假设性思维”的概念，论证世界模型的主要目标是模拟真实世界中所有可行动的可能性，以进行有目的的推理和行动。我们审视了世界建模的关键设计维度：数据、表示、架构、学习目标和使用，调查了现有方法并分析了它们的权衡。在此基础上，我们提出了一种新的通用世界模型生成式潜在预测（GLP）架构，基于有状态的、分层的、多层次的、混合连续/离散表示，以及生成式和自监督学习框架，并展望了由这种模型支持的物理、智能体和嵌套（PAN）AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

URL PDF HTML ☆

赞 0 踩 0

2605.15231 2026-06-19 cs.LG cs.CV 版本更新

Mask-Morph Graph U-Net: A Generalisable Mesh-Based Surrogate for Crashworthiness Field Prediction under Large Geometric Variation

Mask-Morph Graph U-Net：一种通用的基于网格的替代模型，用于在大几何变化下预测碰撞worthiness领域

Haoran Li, Tobias Lehrer, Yingxue Zhao, Haosu Zhou, Philipp Stocker, Tobias Pfaff, Marcus Wagner, Nan Li

发表机构 * Dyson School of Design Engineering, Imperial College London（帝国理工学院伦敦设计工程学院）； TUM School of Engineering and Design, Technical University of Munich（慕尼黑技术大学工程与设计学院）； Faculty of Mechanical Engineering, OTH Regensburg（雷根斯堡机械工程学院）； NVIDIA（NVIDIA公司）

AI总结本文提出Mask-Morph Graph U-Net，通过特征对齐的重心参数化和节点掩码预训练，提升网格模拟的通用性和数据效率，适用于碰撞worthiness设计探索。

Comments 48 pages, 15 figures, jounral paper under review

详情

AI中文摘要

非线性有限元碰撞模拟准确但计算成本高，限制了其在迭代设计优化中的应用。基于图神经网络（GNN）的机器学习替代模型提供了更快的替代方案。消息传递GNN广泛用于网格模拟，其共享节点和边更新函数在不同图结构中相对通用。相比之下，非共享边特定聚合层能更准确地捕捉非线性关系，但通常需要固定图连接性，限制了通用性。本文提出Mask-Morph Graph U-Net（MMGUNet），一种解决分层图U-Net架构限制的方法，该架构使用边特定下采样和上采样层。固定粗图连接性是边特定层所必需的。为了在保留此连接性的同时提高空间对应性，所提出的方法通过特征对齐的重心参数化将粗化图层次变形到每个输入网格，然后构建跨图边。它进一步在监督预训练中应用节点掩码，随后进行参数高效的微调，其中高参数边特定层被冻结。所提出的方法在分布内、分布外和跨组件迁移设置中使用均欧距离和最大入侵百分比误差进行评估。结果表明，粗图变形相对于固定粗图基线提高了测试准确性，而掩码监督预训练减少了训练-测试差异并提高了迁移期间的数据效率。所提出的模型还比外部基线取得了更低的预测误差。这些结果展示了通往可重用、数据高效网格替代模型的实用路径，用于碰撞worthiness设计探索。

英文摘要

Nonlinear finite element crash simulations are accurate but computationally expensive, limiting their use in iterative design optimisation. Machine-learning surrogate models based on graph neural networks (GNNs) offer a faster alternative. Message-passing GNNs are widely used for mesh simulation, and their shared node and edge update functions are relatively generalisable across varying graph structures. By contrast, non-shareable edge-specific aggregation layers can capture nonlinear relationships more accurately but usually require fixed graph connectivity, which limits generalisability. This paper presents Mask-Morph Graph U-Net (MMGUNet), a practical approach to addressing the limitation of hierarchical Graph U-Net architectures that use edge-specific downsampling and upsampling layers. Fixed coarse graph connectivity is required for edge-specific layers. To retain this while improving spatial correspondence, the proposed method morphs the coarsened graph hierarchy to each input mesh using feature-aligned barycentric parameterisation before constructing cross-graph edges. It further applies node masking during supervised pretraining, followed by parameter-efficient fine-tuning in which high-parameter edge-specific layers are frozen. The proposed approach is evaluated in in-distribution, out-of-distribution, and cross-component transfer settings using mean Euclidean distance and maximum intrusion percentage error. Results show that coarse-graph morphing improves test accuracy relative to a fixed-coarse-graph baseline, while masked supervised pretraining reduces the train-test discrepancy and improves data efficiency during transfer. The proposed model also achieves lower prediction error compared with external baselines. These results demonstrate a practical route toward reusable, data-efficient mesh-based surrogate modelling for crashworthiness design exploration.

URL PDF HTML ☆

赞 0 踩 0

2605.00569 2026-06-19 cs.CV cs.GR 版本更新

2D-SuGaR: Surface-Aware Gaussian Splatting for Geometrically Accurate Mesh Reconstruction

Prajwal Gupta C. R., Divyam Sheth, Jinjoo Ha, Mirela Ostrek, Justus Thies

发表机构 * TU Darmstadt（图宾根大学）； ELIZA（ELIZA实验室）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）

Journal ref Eurographics 2026 Short Papers, The Eurographics Association, 2026

2511.23071 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur（印度理工学院朱道尔）

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

2603.27698 2026-06-19 cs.CV cs.DL 版本更新

Ink Detection from Surface Topography of the Herculaneum Papyri

Giorgio Angelotti, Federica Nicolardi, Paul Henderson, W. Brent Seales

发表机构 * Vesuvius Challenge, USA（维苏威挑战赛，美国）； Università degli Studi di Napoli Federico II, Italy（那不勒斯费德里科二世大学，意大利）； University of Glasgow, Scotland, UK（格拉斯哥大学，苏格兰，英国）； EduceLab, University of Kentucky, USA（EduceLab，肯塔基大学，美国）

Comments 9 pages, 3 figures, 2 tables. Currently under review

Journal ref Scientific Reports (2026)

2601.15119 2026-06-19 eess.IV cs.CV 版本更新

Vision Models for Medical Imaging: A Hybrid Approach for PCOS Detection from Ultrasound Scans

Md Mahmudul Hoque, Md Mehedi Hassain, Muntakimur Rahaman, Md. Towhidul Islam, Shaista Rani, Md Sharif Mollah

发表机构 * Department of CSE, CCN University of Science & Technology（计算机科学与工程系，CCN科学与技术大学）； Department of EEE,International Islamic University Chittagong（电子工程系，国际伊斯兰大学恰tagong分校）； Faculty of Engineering, Multimedia University（工程学院，多媒体大学）； Department of CSE, Stamford University of Bangladesh（计算机科学与工程系，斯塔福德大学孟加拉国分校）； Department of Biology, Lucknow University（生物学系，拉胡尔大学）； Department of CSE, Bangladesh Army International University of Science & Technology（计算机科学与工程系，孟加拉国军队国际科学与技术大学）

2508.21190 2026-06-19 cs.CV 版本更新

Radially Distorted Homographies, Revisited

Mårten Wadenbäck, Marcus Valtonen Örnhag, Johan Edstedt

发表机构 * Linköping University（林雪平大学）； Ericsson Research（爱立信研究）

Journal ref 2026, Proceedings of the International Conference on 3D Vision (3DV). Vancouver, BC, Canada: IEEE, pp. 52-62

2507.23027 2026-06-19 cs.CV cs.AI 版本更新

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

发表机构 * Indian Institute of Technology Madras（印度理工学院马德拉斯分校）； All India Institute of Medical Sciences（全印度医学科学研究所）； Indian Institute of Technology Hyderabad（印度理工学院海得拉巴分校）

Comments Accepted at the MICCAI Workshop on "Medical Image Computing in Resource Constrained Settings & Knowledge Interchange (MIRASOL)" 2025

1902.06202 2026-06-19 cs.CV cs.CG 版本更新

Using Persistent Homology to Quantify a Diurnal Cycle in Hurricane Felix

Sarah Tymochko, Elizabeth Munch, Jason Dunion, Kristen Corbosiero, Ryan Torn

发表机构 * Michigan State University, Dept. of Computational Mathematics, Science and Engineering（密歇根州立大学，计算数学、科学与工程系）； Michigan State University, Dept. of Mathematics（密歇根州立大学，数学系）； Cooperative Institute for Marine and Atmospheric Studies, University of Miami（马里安诺大气研究合作机构，迈阿密大学）； Hurricane Research Division, NOAA/Atlantic Oceanographic and Meteorological Laboratory（飓风研究部，国家海洋和大气管理局/大西洋海洋学和气象实验室）； University at Albany - SUNY Albany, Dept. of Atmospheric and Environmental Sciences（阿尔巴尼大学 - 纽约州立大学阿尔巴尼分校，大气与环境科学系）

1. 多模态与视觉语言模型 18 篇

PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models

Multimodal Concept Bottleneck Models

SpatialSV: Internalizing Interpretable 3D Spatial Awareness in MLLMs via Task-Oriented Visual Supervision

Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

The Hidden Evolution of Disguised Visual Context inside the VLM

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

TerraMind: Large-Scale Generative Multimodality for Earth Observation

Vero: An Open RL Recipe for General Visual Reasoning

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

2. 具身智能、机器人与自动驾驶 17 篇

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

FrozenDrive: Zero-Shot Text-Guided Driving Scene Generation and Data Augmentation with Parameter-Free Frozen Diffusion Model

HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Human Universal Grasping

Scaling Self-Play for End-to-End Driving

World Engine: Towards the Era of Post-Training for Autonomous Driving

Tri-Info: Generalizable, Interpretable Failure Prediction for VLA Models via Information Theory

Fast Human Attention Prediction for Fixation-guided Active Perception in Autonomous Navigation

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

Class-Incremental Motion Forecasting

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

3. 图像识别、检索与分类 6 篇

Exploring Multi-Modal Large Language Models and Two-Stage Fine-Tuning for Fashion Image Retrieval

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

Evaluation of Image Matching for Art Skills Assessment

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

Does Head Pose Correction Improve Biometric Facial Recognition?

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

4. 目标检测、分割与定位 8 篇

ReA-OVCD: Reliability-Aware Open-Vocabulary Change Detection via Semantic and Spatial Refinement

SAM3 Self-Distillation for Fine-Grained GOOSE 2D Semantic Segmentation

ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation

U$^2$Mamba: A Two-level Nested U-structure Mamba for Salient Object Detection

CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

An Angular-Temporal Interaction Network for Light Field Object Tracking in Low-Light Scenes

GenTrack: A New Generation of Multi-Object Tracking

GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking

5. 视频理解与时序视觉 9 篇

Vortex: Multi-Modal Fusion System for Intelligent Video Retrieval

NEST: Narrative Event Structures in Time for Long Video Understanding

ViCoStream: Streaming VideoLLMs Can Run Beyond 100 FPS with Stage-Wise Coordinated Inference

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

SA-VIS: Sparse frame Annotations for training Video Instance Segmentation

Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection

UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

6. 生成式视觉与世界模型 26 篇

LooseControlVideo: Directorial Video Control using Spatial Blocking

Learning When to Denoise: Optimizing Asynchronous Schedules for Latent Diffusion

TeleMorpher: Toward Robust Simultaneous Motion-Location Editing

One-Shot Novel View and Pose Human Image Synthesis via 3D Prior Guided Diffusion Model

SurgVista: Long-Horizon Surgical World Modeling with Plausible Instrument-Tissue Dynamics

SketchKeyAnime: Reference-anchored Sparse Key-Sketch Animation Synthesis

CrossFlow: One-Step Generation Across Latent and Pixel Spaces

Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers

Holo-World: Unified Camera, Object and Weather Control for Video World Model

MakeupMirror: Improving Facial Attribute Preservation in Diffusion Models for Makeup Transfer

Cinematic Compositing Using Character-Environment-Harmonized Video Generation Models

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

JanusMesh: Fast and Zero-Shot 3D Visual Illusion Generation via Cross-Space Denoising