2606.05833 2026-06-19 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结提出GeoVR框架，通过从2D视频序列中蒸馏3D几何知识（包括相机姿态、深度图、尺度因子和多尺度3D特征），重塑多模态大语言模型的内部表示以赋予其空间智能，在空间推理基准上达到最先进性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在2D语义理解方面表现出色，但缺乏内在的3D感知能力，导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性，我们提出了GeoVR，一种新颖的框架，仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间，以解锁空间智能。GeoVR并非采用浅层的特征混合，而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的，该策略由四个互补的几何目标驱动：（1）估计帧间相机姿态以嵌入变化的视角动态，（2）回归密集深度图以锚定物理距离，（3）预测度量尺度因子以进行真实世界校准，以及（4）蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下，模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明，GeoVR实现了最先进的性能，为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.16615 2026-06-19 cs.CV 版本更新

Mem-World：用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology（大连理工大学）； Samsung R&D Institute China-Beijing (SRCB)（三星中国北京研究院）

AI总结提出Mem-World，通过4D腕部视角曲面元索引内存W-VMem，解决操作中因遮挡和运动导致的场景遗忘问题，实现持久世界建模，提升策略评估与改进效果。

详情

AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式，通过生成动作一致的视频推演，为昂贵的真实世界实验提供了可扩展的替代方案。然而，在操作中持久世界建模仍然具有挑战性：频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图，导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制，我们提出了Mem-World，一种内存增强的多视图动作条件世界模型。其核心是W-VMem，一种4D腕部视图为中心的曲面元索引内存，将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置，W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中，通过基于曲面元的渲染和评分选择相关历史帧，为预测提供信息丰富且非冗余的上下文。大量实验表明，Mem-World在复杂操作场景中生成持久推演，比Ctrl-World实现更可靠的策略评估，将皮尔逊相关系数提高14.5%，并通过合成数据生成支持有效的策略改进，在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.18112 2026-06-19 cs.RO cs.CV 版本更新

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Qwen-RobotNav 技术报告：为智能体导航系统设计的可扩展导航模型

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei, Qihang Peng, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Zhibo Yang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Zhuoyuan Yu, Jingyang Fan, Zhixuan Liang, Pei Lin, Ye Wang, Anzhe Chen, Kun Yan, Xiao Xu, Jiahao Li, Lulu Hu, Minying Zhang, Shurui Li, Wenhu Xiao, Shuai Bai, Xuancheng Ren, Chenxu Lv, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team（通义实验室）

AI总结提出 Qwen-RobotNav 可扩展导航模型，通过参数化接口支持多种任务模式和可调观测参数，在15.6M样本上训练，联合视觉语言数据防止行为坍缩，在多个导航基准上取得新最优结果，并展示零样本泛化能力。

详情

AI中文摘要

智能体导航系统需要一个基础导航模型，其观测策略可以在推理时从外部重新配置，因为指令跟随、目标搜索、目标跟踪和自动驾驶共享相同的感知规划主干，但对视觉流的消费方式有根本不同的要求。我们提出 Qwen-RobotNav，一个建立在 Qwen-RobotNav 上的可扩展导航模型，通过一个具有两个互补维度的参数化接口来解决这个问题：多个任务模式选择导航行为，以及可控的观测参数（例如，token 预算、每个摄像头的权重）控制视觉历史的编码方式。通过训练时对所有参数进行随机化，Qwen-RobotNav 对任何推理时配置都具有鲁棒性，无需对 Qwen-RobotNav 主干进行任何架构修改。我们在15.6M样本上训练 Qwen-RobotNav；与视觉语言数据联合训练防止了在仅轨迹训练中观察到的反应性动作序列映射器的坍缩。参数化接口也使 Qwen-RobotNav 成为智能体系统的自然构建块：对于长时域场景，上层规划器将目标分解为子任务，并在情节中动态切换 Qwen-RobotNav 的任务模式和上下文策略，通过重复调用同一模型组合出复杂行为。大量实验表明，Qwen-RobotNav 在主要导航基准上取得了新的最优结果。该模型从2B到8B参数展现出良好的扩展性，联合多任务训练发展出一个跨任务族迁移的共享空间规划基板，并在多样环境中对真实世界机器人展现出强大的零样本泛化能力。

英文摘要

Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.

URL PDF HTML ☆

赞 0 踩 0

2508.04424 2026-06-19 cs.CV 版本更新

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

组合对象检索：通过组合表达式进行对象级检索

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Jiangsu, China（新一代人工智能技术及跨学科应用国家重点实验室，东南大学，教育部，江苏，中国）； Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE（穆罕默德·本·扎耶德人工智能大学（MBZUAI），阿布扎赫德，阿联酋）

AI总结提出组合对象检索（COR）任务，通过组合参考对象、掩码和检索文本进行对象级检索，并构建COR125K基准和CORE模型，显著优于现有方法。

详情

AI中文摘要

基于用户意图检索细粒度视觉内容在多模态系统中仍然是一个挑战。尽管当前的组合图像检索（CIR）方法结合了参考图像和检索文本，但它们局限于图像级匹配，无法定位特定对象。为此，我们提出了组合对象检索（COR），一种新的对象级检索任务，从目标图像中的候选对象中检索目标对象，并用像素级掩码对检索结果进行定位。给定一个参考对象、其掩码、一个目标图像以及描述所需修改的检索文本，COR要求模型执行组合视觉-文本推理，而不是依赖显式的类别名称。这一设置带来了若干挑战，包括细粒度组合匹配、在视觉相似干扰物下的负对象过滤以及灵活的单对象或多对象检索。我们构建了COR125K，第一个大规模COR基准，包含408个类别的125,541个检索三元组，并划分基础/新类别以评估类别级泛化能力。我们还提出了CORE，一个统一的端到端模型，集成了参考区域编码、自适应视觉-文本交互和区域级对比学习，以将组合表示与目标对象对齐，同时抑制背景和干扰物。大量实验表明，CORE在基础和新类别上均显著优于现有的基于CIR的流程和强基线，为细粒度对象级多模态检索建立了一个简单而有效的基础。代码将在此https URL公开发布。

英文摘要

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

URL PDF HTML ☆

赞 0 踩 0

2512.03199 2026-06-19 cs.CV 版本更新

Does Head Pose Correction Improve Biometric Facial Recognition?

姿态校正是否能提升生物特征面部识别？

Justin Norman, Hany Farid

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究探讨了AI驱动的头部姿态校正与图像修复对面部识别准确率的影响，发现选择性应用CFR-GAN与CodeFormer可提升识别性能。

2604.19196 2026-06-19 cs.CV 版本更新

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

面向域泛化人脸反欺骗的视觉基础模型基准测试

Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki

发表机构 * Graduate School of Information Sciences, Tohoku University, Japan（东北大学信息科学研究生院，日本）

AI总结本文系统评估15种预训练视觉模型在人脸反欺骗域泛化中的表现，发现自监督ViT（尤其是DINOv2+Registers）结合数据增强和注意力损失在MICO协议上达到最优，且计算高效。

Comments 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

详情

AI中文摘要

人脸反欺骗（FAS）由于需要在未见过的环境中进行鲁棒的域泛化而仍然具有挑战性。尽管最近的趋势利用视觉-语言模型（VLM）进行语义监督，但这些多模态方法通常需要高昂的计算资源并表现出高推理延迟。此外，它们的有效性本质上受限于底层视觉特征的质量。本文重新审视仅视觉基础模型建立高效鲁棒FAS基线的潜力。我们在严苛的跨域场景下（包括MICO和有限源域（LSD）协议）对15个预训练模型进行了系统基准测试，例如有监督CNN、有监督ViT和自监督ViT。我们的全面分析表明，自监督视觉模型，特别是带有寄存器的DINOv2，显著抑制了注意力伪影并捕获了关键的细粒度欺骗线索。结合人脸反欺骗数据增强（FAS-Aug）、分块数据增强（PDA）和注意力加权分块损失（APL），我们提出的仅视觉基线在MICO协议上达到了最先进的性能。该基线在数据受限的LSD协议下优于现有方法，同时保持优越的计算效率。这项工作为FAS提供了一个确定的仅视觉基线，表明优化的自监督视觉变换器可以作为仅视觉和未来多模态FAS系统的骨干。项目页面见：此https URL。

英文摘要

Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .

URL PDF HTML ☆

赞 0 踩 0

2507.21460 2026-06-19 cs.CV 版本更新

双锚点插值求解器加速生成建模

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

发表机构 * The Hong Kong University of Science（香港科学与技术大学）

AI总结提出BA-solver，通过轻量SideNet（1-2%主干大小）学习双向时间感知和双锚点速度积分，在不重新训练主干的情况下，以极低训练成本实现10步内达到100+步Euler求解器质量，支持即插即用。

详情

AI中文摘要

流匹配（FM）模型已成为高保真合成的前沿范式。然而，它们对迭代常微分方程（ODE）求解的依赖造成了显著的延迟瓶颈。现有解决方案面临两难：无训练求解器在低神经函数评估（NFE）下性能严重下降，而基于训练的一步或几步生成方法则面临高昂的训练成本且缺乏即插即用的通用性。为弥合这一差距，我们提出了双锚点插值求解器（BA-solver）。BA-solver保留了标准无训练求解器的通用性，同时通过引入轻量级SideNet（主干大小的1-2%）与冻结主干并行，实现了显著加速。具体而言，我们的方法基于两个协同组件：1）双向时间感知，其中SideNet学习近似未来和过去的速度，无需重新训练重型主干；2）双锚点速度积分，利用带有两个锚点速度的SideNet高效近似中间速度，用于批量高阶积分。通过利用主干建立高精度“锚点”并利用SideNet加密轨迹，BA-solver能够以最小误差实现大步长。在ImageNet-256^2上的实验结果表明，BA-solver仅需10次NFE即可达到与100+次NFE的Euler求解器相当的生成质量，并在仅5次NFE时保持高保真度，且训练成本可忽略不计。此外，BA-solver确保与现有生成流水线的无缝集成，便于图像编辑等下游任务。

英文摘要

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

URL PDF HTML ☆

赞 0 踩 0

2602.15819 2026-06-19 cs.CV 版本更新

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore（南洋理工大学S实验室）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出TriMM，首个前馈式3D原生生成模型，通过协作多模态编码融合RGB、RGBD和点云特征，结合辅助2D/3D监督和三平面潜在扩散模型，实现高质量3D资产生成。

详情

AI中文摘要

3D内容本质上具有多模态特性，可投影到不同模态（如RGB图像、RGBD和点云）。每种模态在3D资产建模中表现出独特优势：RGB图像包含生动的3D纹理，而点云定义精细的3D几何。然而，现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势，要么局限于3D结构，从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模，我们提出了TriMM，这是第一个从基本多模态（如RGB、RGBD和点云）学习的前馈式3D原生生成模型。具体来说，1) TriMM首先引入协作多模态编码，该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外，引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码，TriMM采用三平面潜在扩散模型生成更高质量的3D资产，增强了纹理和几何细节。在多个知名数据集上的大量实验表明，TriMM通过有效利用多模态，尽管使用少量训练数据，仍能达到与在大规模数据集上训练的模型相竞争的性能。此外，我们在最近的RGB-D数据集上进行了额外实验，验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

URL PDF HTML ☆

赞 0 踩 0

2512.00850 2026-06-19 cs.CV 版本更新

Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting

Smol-GS: 抽象3D高斯溅射的紧凑表示

Haishan Wang, Mohammad Hassan Vali, Arno Solin

发表机构 * ELLIS Institute Finland（芬兰ELLIS研究所）； Aalto University（阿alto大学）

AI总结提出Smol-GS方法，通过八叉树位置编码和熵压缩学习高效溅射特征，实现3D高斯溅射的紧凑表示，在保持渲染质量的同时大幅降低存储。

2602.23172 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg（弗赖堡大学）； Bosch Research（博世研究院）； University of Haifa（海法大学）

AI总结提出潜在高斯泼溅（LaGS）方法，通过特征高斯体作为动态关键点实现多视图特征聚合，用于4D全景占据跟踪，在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情

DOI: 10.1109/LRA.2026.3703990

AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而，现有方法通常只解决部分问题：它们要么通过边界框提供粗略的几何跟踪，要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中，我们提出了潜在高斯泼溅（LaGS）用于4D全景占据跟踪（4D-POT）。我们重新审视底层表示，将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点，在泼溅到体素网格进行解码之前，能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互，这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节，进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型：this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

URL PDF HTML ☆

赞 0 踩 0

2606.15908 2026-06-19 cs.CV 版本更新

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

高保真4D手-物体捕捉：基于多视角时空追踪和物理感知高斯模型

Bo Peng, Xu Chen, Yi Gu, Hidenobu Matsuki, Mingsong Dou, Jingjing Shen, Deying Kong, Juyong Zhang, Zhengyang Shen

发表机构 * Google XR（谷歌XR）； University of Science and Technology of China (USTC)（中国科学技术大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出无需模板和标记的多视角系统，通过跨视角几何与时间线索的Transformer初始化，结合物理感知高斯优化，实现鲁棒且无伪影的4D手-物体交互重建。

Comments Project page: https://hostpg.github.io/

详情

AI中文摘要

具身AI和空间计算中对高保真4D手-物体交互（HOI）数据的需求日益增长，但目前受限于对预扫描物体模板和物理标记的依赖。尽管近期方法在从视频重建4D手-物体交互方面取得了有希望的结果，但它们对手和物体姿态的初始估计高度敏感。然而，从图像中估计这些姿态具有挑战性，尤其是在手-物体交互场景中固有的严重遮挡下。我们提出了一种新颖系统，用于从同步且校准的多视角视频中鲁棒且精确地重建手和物体，无需任何模板或标记。我们的系统包含两个主要创新组件：（1）一个多视角前馈Transformer模型，聚合跨视角几何和时间线索，为姿态和密集物体几何提供可靠的、度量一致的初始化；（2）一个手-物体物理感知高斯优化框架，用于细化初始估计，集成四面体约束、碰撞细化和外观分解，以产生物理上合理且视觉上精确的重建。在公共基准和广泛内部数据集上的验证表明，我们的流程实现了高度鲁棒、无伪影的重建，为自动化4D资产生成提供了高效基础。我们的项目页面位于https://zyshen021.github.io/HOSTPG/。

英文摘要

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

URL PDF HTML ☆

赞 0 踩 0

2606.15966 2026-06-19 cs.CV cs.GR 版本更新

MAMA-MIA挑战：推进乳腺MRI肿瘤分割与治疗反应预测的泛化性和公平性

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir

发表机构 * Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona（巴塞罗那人工智能在医学实验室（BCN-AIM），巴塞罗那大学数学与计算机学院）

AI总结提出MAMA-MIA挑战，通过标准化基准评估乳腺MRI肿瘤分割和病理完全缓解预测，在跨洲多中心数据上分析模型泛化性与公平性，发现性能与亚组公平性之间存在权衡。

详情

AI中文摘要

乳腺癌是全球女性中最常诊断的恶性肿瘤，也是癌症相关死亡的主要原因之一。动态对比增强磁共振成像在肿瘤表征和治疗监测中发挥核心作用，尤其是接受新辅助化疗的患者。然而，现有的乳腺磁共振成像人工智能模型通常使用异质性数据集、研究人群和评估协议进行开发和评估，使得直接比较困难，并限制了跨机构和临床相关患者亚组的模型鲁棒性理解。MAMA-MIA挑战旨在通过提供标准化基准来解决这些问题，该基准用于联合评估原发性肿瘤分割和仅使用治疗前磁共振成像预测病理完全缓解。训练队列包括来自美国多家机构的1506名患者，而评估则在来自三个独立欧洲中心的574名患者的外部测试集上进行，以评估跨大陆和跨机构的泛化性。统一的评分框架结合了预测性能与年龄、绝经状态和乳腺密度方面的亚组一致性。26个国际团队参加了最终评估阶段。结果表明，在共同的外部评估框架下，性能存在显著差异，并揭示了整体准确性与亚组公平性之间的权衡。该挑战提供了标准化数据集、评估协议和公共资源，以促进开发稳健且公平的乳腺癌影像人工智能系统。

英文摘要

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are typically developed and evaluated using heterogeneous datasets, study populations, and assessment protocols, making direct comparison difficult and limiting understanding of model robustness across institutions and clinically relevant patient subgroups. The MAMA-MIA Challenge was designed to address these challenges by providing a standardized benchmark for the joint evaluation of primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under a common external evaluation framework and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

URL PDF HTML ☆

赞 0 踩 0

2605.00665 2026-06-19 cs.CV 版本更新

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

基于深度学习的视网膜图像预测阿尔茨海默病风险因素：英国生物银行中生物学相关形态学关联的开发和验证

Seowung Leem, Yunchao Yang, Adam J. Woods, Ruogu Fang

发表机构 * J. Crayton Pruitt Family Dept. of Biomedical Engineering, University of Florida（朱·克雷顿·普瑞特生物医学工程系，佛罗里达大学）； University of Florida Research Computing（佛罗里达大学研究计算中心）； Meta AI (FAIR)（Meta AI（FAIR））； School of Behavioral and Brain Sciences, University of Texas at Dallas（德克萨斯大学达拉斯分校行为与脑科学学院）； Dept. of Electrical and Computer Engineering, University of Florida（佛罗里达大学电气与计算机工程系）； Dept. of Computer and Information Science and Engineering, University of Florida（佛罗里达大学计算机与信息科学与工程系）； Center for Cognitive Aging and Memory, University of Florida（佛罗里达大学认知衰老与记忆中心）

AI总结利用深度学习从视网膜彩色眼底照片预测12个阿尔茨海默病相关风险因素，并揭示其背后的视网膜结构特征，发现视神经头和视网膜血管等区域与风险因素及阿尔茨海默病前期变化相关。

Comments Accepted to the "Journal of Alzheimer's Disease" for publication

详情

AI中文摘要

系统性的、代谢性的、生活方式的因素已通过流行病学和AD特异性生物标志物研究与阿尔茨海默病（AD）建立关联。彩色眼底摄影（CFP）是否包含与这些AD相关风险域相对应的视网膜结构特征仍不清楚。为了确定深度学习（DL）模型能否从CFP预测12个AD相关风险因素，并表征这些预测背后的视网膜结构，从而评估CFP是否反映AD易感性的通路。使用来自英国生物银行的44,501名独特参与者的62,876张CFP，训练DL模型预测与AD发病率相关的12个因素：6个分类变量（性别、吸烟、失眠、经济状况、饮酒、抑郁）和6个连续变量（年龄、受教育完成年龄、BMI、收缩压、舒张压、HbA1c）。评估模型性能、模型显著性和显著性衍生得分（CAM-Score），并与视网膜形态测量进行比较。还将得分在AD发病病例（平均发病前8.55年）与匹配对照之间进行比较。DL的性能范围为分类变量的AUROC=0.5654-0.9480，连续变量的R2=-0.0291-0.7620，优于大多数形态测量-机器学习模型。基于显著性的得分一致地突出了生物学上有意义的区域，特别是视神经头和视网膜血管。它也与现有的形态测量变异一致。多个基于显著性的得分在AD发病病例与匹配对照之间存在显著差异，表明风险因素的视网膜相关性与临床前AD相关变化之间存在潜在重叠。CFP编码了与AD风险因素相关的视网膜特征。尽管不具有诊断性，但DL衍生的视网膜表征可能揭示反映潜在AD易感性的生物学上有意义的风险相关结构变化。

英文摘要

The systemic, metabolic, lifestyle factors have established associations with Alzheimer's Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability. Using 62,876 CFPs from 44,501 unique participants from the UK Biobank, DL models were trained to predict 12 factors linked to AD incidence: 6 categorical (sex, smoking, sleeplessness, economic status, alcohol use, depression) and 6 continuous (age, age at completing education, BMI, systolic, diastolic blood pressure, HbA1c). Model performance, model saliency, and saliency-derived scores (CAM-Score) were evaluated and compared to retinal morphometry. The scores were also compared between incident-AD cases (average 8.55 years before onset) and matched controls. Performance of DL ranged from AUROC= 0.5654-0.9480 for categorical and R2=-0.0291-0.7620 for continuous factors, outperforming most of the morphometry-machine learning models. Saliency-based score consistently highlighted biologically meaningful regions, particularly the optic nerve head and retinal vasculature. It also aligned with present morphometric variations. Several saliency-based scores differed significantly between incident AD and matched controls, suggesting potential overlap between retinal correlates of risk factors and preclinical AD-associated changes. CFP encodes retinal signatures linked to AD risk factors. Although not diagnostic, DL-derived retinal representations may uncover biologically meaningful risk-related structural changes mirroring the potential AD vulnerability.

URL PDF HTML ☆

赞 0 踩 0

2606.14957 2026-06-19 cs.CV 版本更新

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

学习用于多模态神经影像的稀疏潜在预测基础模型

Haoxu Huang, Long Chen, Jingyun Chen, Jinu Hyun, James Ryan Loftus, Kara Melmed, Daniel Orringer, Jennifer Frontera, Seena Dehkharghani, Arjun Masurkar, Narges Razavian

发表机构 * New York University, Center for Data Science（纽约大学数据科学中心）； NYU Grossman School of Medicine, Department of Radiology（纽约大学格罗斯曼医学院放射学系）； State University of New York at Binghamton, School of Computing（纽约州立大学宾汉姆顿分校计算机学院）； NYU Grossman School of Medicine, Department of Neurology（纽约大学格罗斯曼医学院神经病学系）； NYU Grossman School of Medicine, Department of Neurosurgery（纽约大学格罗斯曼医学院神经外科学系）； NYU Grossman School of Medicine, Department of Pathology（纽约大学格罗斯曼医学院病理学系）； School of Medicine, Department of Radiology, Stanford（斯坦福大学医学院放射学系）； NYU Grossman School of Medicine, Department of Neuroscience（纽约大学格罗斯曼医学院神经科学系）； NYU Grossman School of Medicine, Neuroscience Institute（纽约大学格罗斯曼医学院神经科学研究所）

AI总结提出Neuro-JEPA模型，结合潜在预测目标和专家混合架构，学习T1w、T2w和FLAIR三种MRI序列的统一表示，在25项临床任务和22项公开数据集任务上优于现有基础模型和CNN基线。

Comments Under Review Preprint

详情

AI中文摘要

脑部MRI通常作为多个互补序列采集，具有独特的对比度加权，包括T1加权成像（T1w）解剖对比和液体敏感T2加权（T2w）对比。然而，在健康系统规模上，跨多种MRI对比机制学习统一表示的方法尚缺乏。在本研究中，我们引入了Neuro-JEPA，一种稀疏多模态神经影像基础模型，它结合了潜在预测目标和专家混合架构，以编码跨核心T1w、T2w和液体抑制FLAIR成像（FLAIR）的脑部MRI。我们进一步对架构、掩码、目标和稀疏性设计选择进行了系统的方法论研究，这些选择有利于稳健的神经影像多模态表示学习。Neuro-JEPA在428,647项研究的1,551,862次扫描上进行了预训练，这些扫描经过了模态特定的预处理和跨三种核心结构脑部MRI序列的数据整理。我们在临床和研究环境中评估了学习到的表示，包括来自三个健康系统（NYU Langone、NYU Long Island和Massachusetts General Hospital）的25项任务，以及来自12个公开数据集的22项任务，涵盖了单模态、多模态和跨域评估配置。在这些基准测试中，现有的神经影像基础模型相对于简单的卷积神经网络（CNN）基线显示出不一致的提升，而Neuro-JEPA在所有评估设置中实现了更强且更一致的性能。这些结果建立了一个可扩展的多模态神经影像表示学习方法论框架，并强调了基础模型评估协议需要包括简单基线、临床异质性队列和受控的多模态比较。

英文摘要

Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

URL PDF HTML ☆

赞 0 踩 0

2405.10705 2026-06-19 eess.IV cs.CV 版本更新

3D Vessel Reconstruction from Sparse-View Dynamic DSA Images via Vessel Probability Guided Attenuation Learning

基于血管概率引导衰减学习的稀疏视角动态DSA图像三维血管重建

Zhentao Liu, Huangxuan Zhao, Wenhui Qin, Zhenghong Zhou, Xinggang Wang, Wenping Wang, Xiaochun Lai, Chuansheng Zheng, Dinggang Shen, Zhiming Cui

发表机构 * School of Biomedical Engineering \& State Key Laboratory of Advanced Medical Materials ； Devices, ShanghaiTech University, Shanghai, China ； National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China ； School of Electronic Information ； Communications, Huazhong University of Science ； Department of Computer Science \& Engineering, Texas A\&M University, USA

AI总结提出血管概率引导衰减学习框架，通过静态与动态衰减场互补加权实现稀疏视角DSA重建，降低辐射剂量，并采用渐进训练和时间扰动损失提升质量。

Comments Accepted by Medical Image Analysis (MedIA), 2026

详情

DOI: 10.1016/j.media.2026.104088

AI中文摘要

数字减影血管造影（DSA）是血管疾病诊断的金标准之一。借助造影剂，时间分辨的二维DSA图像提供全面的血流信息，可用于重建三维血管结构以进行医学评估。当前的商用DSA系统通常需要数百个扫描视角进行重建，导致大量辐射暴露。在本研究中，我们提出了一种基于神经渲染的优化框架，专门用于高质量稀疏视角DSA重建，以减少辐射剂量。我们的方法称为血管概率引导衰减学习，将DSA成像表示为静态和动态衰减场的互补加权组合，权重来自时间无关的血管概率场。作为前景掩膜，血管概率为静态和动态场提供适应不同场景类型的适当梯度。该机制实现了静态背景与动态造影剂流的自监督分解，并显著提高了重建质量。我们的模型通过最小化合成投影与真实DSA图像之间的差异进行训练。我们进一步采用两种训练策略来提高重建质量：（1）由粗到细的渐进训练以改善几何结构，以及（2）时间扰动渲染损失以保持时间一致性。实验结果表明了高质量的三维血管重建和二维DSA图像合成。

英文摘要

Digital Subtraction Angiography (DSA) is one of the gold standards for vascular disease diagnosis. With the help of a contrast agent, time-resolved 2D DSA images deliver comprehensive blood flow information and can be utilized to reconstruct 3D vessel structures for medical assessment. Current commercial DSA systems typically require hundreds of scanning views to perform reconstruction, resulting in substantial radiation exposure. In this study, we propose a neural rendering-based optimization framework tailored for high-quality sparse-view DSA reconstruction to reduce radiation dosage. Our approach, termed vessel probability guided attenuation learning, represents DSA imaging as a complementary weighted combination of static and dynamic attenuation fields, with the weights derived from the time-independent vessel probability field. Functioning as a foreground mask, vessel probability provides proper gradients for both static and dynamic fields adaptive to different scene types. This mechanism enables self-supervised decomposition between static backgrounds and dynamic contrast agent flow, and significantly improves reconstruction quality. Our model is trained by minimizing the discrepancy between synthesized projections and real captured DSA images. We further employ two training strategies to improve reconstruction quality: (1) coarse-to-fine progressive training for better geometry and (2) temporal perturbed rendering loss for temporal consistency. Experimental results have demonstrated high-quality 3D vessel reconstruction and 2D DSA image synthesis.

URL PDF HTML ☆

赞 0 踩 0

2503.23179 2026-06-19 eess.IV cs.CV 版本更新

OncoReg: Medical Image Registration for Oncological Challenges

OncoReg：面向肿瘤学挑战的医学图像配准

Wiebke Heyer, Yannic Elser, Lennart Berkel, Xinrui Song, Xuanang Xu, Pingkun Yan, Xi Jia, Jinming Duan, Zi Li, Tony C. W. Mok, BoWen LI, Tim Hable, Christian Staackmann, Christoph Großbröhmer, Lasse Hansen, Alessa Hering, Malte M. Sieren, Mattias P. Heinrich

发表机构 * Institute of Medical Informatics, University of Lübeck（吕贝克大学医学信息学研究所）； Institute of Radiology and Nuclear Medicine, University Hospital Schleswig-Holstein（石勒斯维希-霍尔斯坦大学医院放射科和核医学研究所）； Department of Biomedical Engineering and Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute（伦塞拉塞尔理工学院生物医学工程系和生物技术与跨学科研究中心）； School of Computer Science, University of Birmingham（伯明翰大学计算机科学学院）； Division of Informatics, Imaging and Data Sciences, University of Manchester（曼彻斯特大学信息学、成像和数据科学系）； DAMO Academy, Alibaba Group（阿里集团DAMO学院）； Hangzhou Shengshi Technology Co., Ltd（杭州盛世科技有限公司）； Department of Radiation Oncology, University Hospital Schleswig-Holstein（石勒斯维希-霍尔斯坦大学医院放射肿瘤科）； EchoScout GmbH ； Radboud University Medical Center, Nijmegen（奈密根大学医学中心）； Institute of Interventional Radiology, University Hospital Schleswig-Holstein（石勒斯维希-霍尔斯坦大学医院介入放射科）

AI总结提出OncoReg挑战，通过两阶段框架在保护患者隐私的同时开发可泛化的图像配准方法，用于放射治疗中锥束CT与扇束CT的配准，发现特征提取是关键，深度学习和经典方法结合最有效。

Comments 21 pages, 13 figures

详情

AI中文摘要

在现代癌症研究中，由于患者隐私相关的挑战，产生的大量医学数据往往未被充分利用。OncoReg挑战通过一个两阶段框架解决了这一问题，该框架使研究人员能够在确保患者隐私的同时开发和验证图像配准方法，并促进更可泛化的AI模型的发展。第一阶段涉及使用公开可用的数据集，第二阶段则专注于在安全的医院网络内对私有数据集进行模型训练。OncoReg建立在Learn2Reg挑战的基础上，纳入了放射治疗中介入性锥束计算机断层扫描与标准计划扇束CT图像的配准。准确的图像配准在肿瘤学中至关重要，特别是在图像引导放射治疗的动态治疗调整中，需要精确对齐以最小化对健康组织的辐射暴露，同时有效靶向肿瘤。本文详细介绍了OncoReg挑战的方法和数据，并对竞赛参赛作品和结果进行了全面分析。研究发现，特征提取在此配准任务中起着关键作用。从该挑战中涌现的一种新方法展示了其多功能性，而现有方法的表现与新技术相当。深度学习和经典方法在图像配准中仍扮演重要角色，尤其是方法的组合，特别是在特征提取方面，被证明最为有效。

英文摘要

In modern cancer research, the vast volume of medical data generated is often underutilised due to challenges related to patient privacy. The OncoReg Challenge addresses this issue by enabling researchers to develop and validate image registration methods through a two-phase framework that ensures patient privacy while fostering the development of more generalisable AI models. Phase one involves working with a publicly available dataset, while phase two focuses on training models on a private dataset within secure hospital networks. OncoReg builds upon the foundation established by the Learn2Reg Challenge by incorporating the registration of interventional cone-beam computed tomography with standard planning fan-beam CT images in radiotherapy. Accurate image registration is crucial in oncology, particularly for dynamic treatment adjustments in image-guided radiotherapy, where precise alignment is necessary to minimise radiation exposure to healthy tissues while effectively targeting tumours. This work details the methodology and data behind the OncoReg Challenge and provides a comprehensive analysis of the competition entries and results. Findings reveal that feature extraction plays a pivotal role in this registration task. A new method emerging from this challenge demonstrated its versatility, while established approaches continue to perform comparably to newer techniques. Both deep learning and classical approaches still play significant roles in image registration, with the combination of methods, particularly in feature extraction, proving most effective.

URL PDF HTML ☆

赞 0 踩 0

2606.18970 2026-06-19 cs.LG cs.AI cs.CV 版本更新

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics（数学系）； Department of Political and Social Sciences（政治与社会科学系）

AI总结通过受控基准测试，比较量子与经典生成器在脑MRI数据增强中的性能，发现两者均未显著优于仅用真实数据训练，且量子生成器无额外优势。

详情

AI中文摘要

医学图像分类常受限于有限的标注数据，因此生成式增强被提出；最近，量子生成模型被用于此目的，并经常报告准确率提升。然而，这些声称通常基于单次训练运行，未匹配量子与经典生成器的参数预算，也未表征任何收益出现的数据范围。我们提出了一个受控基准测试，隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中，在该空间中，使用变分量子生成器或参数数量几乎相同的经典生成器（1648 vs. 1632）训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器，覆盖从5%到100%的标注数据比例，通过八个随机种子进行配对显著性检验（多重比较校正）以及集内多样性和潜在分布分析。在所有比例下，没有增强变体显著优于仅用真实数据训练，且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展：合成样本分布外移，并且在数据稀缺时严重模式崩溃，而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2602.01391 2026-06-19 cs.CV 版本更新

Relighting as a Probe of Visual Priors via Augmented Latent Intrinsics

通过增强潜在本征属性将重光照作为视觉先验的探针

Xiaoyan Xing, Xiao Zhang, Sezer Karaoglu, Theo Gevers, Anand Bhattad

发表机构 * UvA-Bosch Delta Lab, University of Amsterdam, Amsterdam, Netherlands（乌得勒支大学阿姆斯特丹分校博世Delta实验室）； The University of Chicago, Chicago, USA（芝加哥大学）； Johns Hopkins University, Baltimore, USA（约翰霍普金斯大学）

AI总结提出增强潜在本征属性（ALI）方法，融合密集像素对齐视觉特征到潜在本征重光照模型，平衡语义与光度保真度，提升复杂材质重光照质量。

Comments Camera-ready version for ICML 2026. Project page: https://augmented-latent-intrinsics.github.io

详情

AI中文摘要

图像到图像的重光照需要能够将光照与场景属性分离，同时保留密集几何、材质和光度线索的表征。我们将此任务用作视觉先验的探针：与奖励不变性的识别任务不同，重光照测试视觉特征是否保留光传输所需的信息。通过一个受控的生成式重光照框架，我们发现强语义编码器会降低重光照质量，揭示了抽象与物理保真度之间的语义-光度权衡。我们引入了增强潜在本征属性（ALI），通过将密集的、像素对齐的视觉特征融合到潜在本征重光照模型中，并在未标注的真实图像对上通过自监督进行细化，来平衡这一权衡。ALI提高了重光照质量，尤其是在光泽、金属和透明材质上，并证明了生成式重光照是量化视觉编码器对物理世界编码内容的有效工具。

英文摘要

Image-to-image relighting requires representations that separate illumination from scene properties while preserving dense geometry, material, and photometric cues. We use this task as a probe of visual priors: unlike recognition tasks that reward invariance, relighting tests whether visual features retain the information needed for light transfer. Through a controlled generative relighting framework, we find that strong semantic encoders can degrade relighting quality, exposing a semantic--photometric trade-off between abstraction and physical fidelity. We introduce Augmented Latent Intrinsics (ALI), which balances this trade-off by fusing dense, pixel-aligned visual features into a latent-intrinsic relighting model and refining it with self-supervision on unlabeled real image pairs. ALI improves relighting quality, especially on glossy, metallic, and transparent materials, and demonstrates that generative relighting is an effective tool for quantifying what visual encoders encode about the physical world.

URL PDF HTML ☆

赞 0 踩 0

2511.04260 2026-06-19 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet：面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science（数学与计算机科学系）； University of Catania（卡塔尼亚大学）

AI总结提出Proto-LeakNet，利用扩散模型中的信号泄漏痕迹，结合闭集分类与密度开集评估，实现可解释的生成器归因，在闭集上训练后对未见生成器也有效。

Comments 44 pages, 27 figures, 11 tables

详情

DOI: 10.1016/j.cviu.2026.104848

AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明，扩散管道会在其输出中无意中留下持久的统计痕迹，称为信号泄漏，特别是在潜在表示中。基于这一观察，我们提出了Proto-LeakNet，一个信号泄漏感知且可解释的归因框架，它将闭集分类与基于密度的开集评估相结合，对学习到的嵌入进行开集评估，从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域，重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征，而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC，Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒，超越了最先进的方法，并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取：this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

URL PDF HTML ☆

赞 0 踩 0

2510.27285 2026-06-19 cs.CV cs.CR 版本更新

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

重新思考扩散模型中的鲁棒对抗性概念擦除

Qinghong Yin, Yu Tian, Heming Yang, Xiang Chen, Xianlin Zhang, Yue Ming, Xueming Li, Yue Zhang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua University（计算机科学与技术系，人工智能研究院，清华大学）； University of Chinese Academy of Sciences（中国科学院大学）； Nanjing University of Aeronautics and Astronautics（南京航空航天大学）

AI总结针对扩散模型中概念擦除的对抗训练忽视概念语义导致拟合不足的问题，提出语义引导的鲁棒对抗概念擦除方法S-GRACE，显著提升擦除性能26%并减少90%训练时间。

详情

AI中文摘要

概念擦除旨在选择性地遗忘扩散模型（DMs）中的不良内容，以降低敏感内容生成的风险。作为概念擦除的一种新范式，现有方法大多采用对抗训练来识别和抑制目标概念，从而减少敏感输出的可能性。然而，这些方法常常忽视对抗训练在DMs中的特异性，导致仅能部分缓解。在这项工作中，我们从概念空间的角度调查并量化了这种特异性，即对抗样本能否真正拟合目标概念空间？我们观察到现有方法在生成对抗样本时忽视了概念语义的作用，导致对概念空间的拟合效果不佳。这种忽视导致了以下问题：1）当对抗样本较少时，它们无法全面覆盖目标概念；2）反之，它们会破坏其他目标概念空间。受这些发现分析的启发，我们引入了S-GRACE（语义引导的鲁棒对抗概念擦除），它优雅地利用概念空间内的语义引导来生成对抗样本并执行擦除训练。使用七种最先进方法和三种对抗提示生成策略在各种DM遗忘场景下进行的实验表明，S-GRACE显著提高了擦除性能26%，更好地保留了非目标概念，并将训练时间减少了90%。我们的代码可在此https URL获取。

英文摘要

Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

URL PDF HTML ☆

赞 0 踩 0

2605.07821 2026-06-19 cs.CV cs.AI 版本更新

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

通过对象共现分析缓解OOD检测中的简单性偏差

Boyang Dai, Chaoqi Chen, Yizhou Yu

发表机构 * The University of Hong Kong（香港大学）； Shenzhen University（深圳大学）； Shenzhen Loop Area Institute（深圳环城区域研究所）

AI总结提出基于对象共现的OOD检测框架，通过解耦表示和分治策略区分近OOD，缓解简单性偏差，在多种设置下取得竞争结果。

Comments This paper has been accepted by CVPR2026

详情

AI中文摘要

分布外（OOD）检测对于确保深度学习模型的可靠性至关重要。现有方法大多关注正则纠缠表示以区分分布内（ID）和OOD数据，忽略了图像中丰富的上下文信息。这一问题在检测近OOD时尤其具有挑战性，因为具有简单性偏差的模型难以在解耦表示中学习判别性特征。人类视觉系统可以利用自然环境中对象的共现来促进场景理解。受此启发，我们提出了一种以对象为中心的OOD检测框架，学习捕捉图像中的对象共现（OCO）模式。该方法引入了一种新的OOD检测范式，通过预测测试样本的解耦表示来理解图像中的对象共现，然后根据ID训练数据中观察到的对象共现模式自适应地将模式分为三种场景，最后以分治方式进行OOD检测。通过这种方式，OCO可以通过考虑图像中存在的语义上下文关系来区分近OOD，避免仅关注简单、易学习区域的倾向。我们通过在具有挑战性和全频谱OOD设置下的实验评估了OCO，展示了竞争性结果，并证实了其处理语义和协变量偏移的能力。代码发布在：https://this https URL。

英文摘要

Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.

URL PDF HTML ☆

赞 0 踩 0

2502.03227 2026-06-19 cs.LG cs.CV 版本更新

Adversarial Dependence Minimization

对抗性依赖最小化

Pierre-François De Plaen, Tinne Tuytelaars, Marc Proesmans, Luc Van Gool

发表机构 * CVL, ETH Zürich, Switzerland（CVL，苏黎世联邦理工学院，瑞士）； INSAIT, Sofia University, Bulgaria（INSAIT，索菲亚大学，保加利亚）

AI总结提出ADM算法，通过对抗博弈最小化特征维度间的统计依赖性，证明全局最优时达到相互独立，并应用于非线性去相关、图像分类泛化提升和自监督学习维度坍塌预防。

2411.10077 2026-06-19 cs.CV 版本更新

Hierarchical mutual distillation for multi-view fusion: Learning from all possible view combinations

多视角融合的分层互蒸馏：从所有可能的视角组合中学习

Jiwoong Yang, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University（翰阳大学）； Hankuk University of Foreign Studies（韩国民法大学）

AI总结本文提出一种新颖的多视角不确定性加权互蒸馏方法，通过分层互蒸馏提升预测一致性，有效利用各视角信息并缓解不确定预测的影响。

Journal ref Pattern Recognition 178 (2026) 113432

详情

DOI: 10.1016/j.patcog.2026.113432

AI中文摘要

多视角学习常面临有效利用不同角度和位置拍摄图像的挑战，尤其是在处理视角间不一致性和不确定性时更为突出。本文提出了一种新颖的多视角不确定性加权互蒸馏（MV-UWMD）方法。我们的方法通过在所有可能的视角组合中进行分层互蒸馏来增强预测一致性，包括单视角、部分多视角和全多视角预测。这引入了一种基于不确定性的加权机制，通过互蒸馏有效利用每个视角的独特信息，同时减轻不确定预测的影响。我们扩展了CNN-Transformer混合架构以促进在多个视角组合中的稳健特征学习和整合。我们使用了一个大规模、非结构化的数据集进行广泛实验，该数据集来自多样且非固定视角的拍摄。结果表明，MV-UWMD相比现有多视角学习方法在预测准确性和一致性方面有所提升。

英文摘要

Multi-view learning often struggles to effectively leverage images captured from diverse angles and locations. Learning methods for unstructured multi-view images remain largely underexplored. We propose a novel Hierarchical Mutual Distillation for Multi-View Fusion (HMDMV) method, which can handle both structured and unstructured multi-view scenarios. It makes predictions utilizing all possible view combinations: single view, partial multi-view, and full multi-view. The method generates predictions for each view combination and then applies hierarchical mutual distillation to enhance inter-view consistency. An uncertainty-based weighting mechanism further refines the fusion process by adjusting the influence of each view combination according to its prediction confidence, reducing the impact of low-confidence views. Extensive experiments on large-scale structured and unstructured datasets demonstrate that HMDMV consistently achieves state-of-the-art classification accuracy. Another unique advantage of HMDMV is that it provides improved flexibility in inference, allowing for more or fewer view counts in inference than those used in training without additional processing. We also provide a light version with reduced training cost by designing an efficient strategy that randomly samples subsets of view combinations during each training iteration. These results highlight HMDMV's robustness in real-world settings where view availability is variable or incomplete. The code is available at https://github.com/labhai/HMDMV.

URL PDF HTML ☆

赞 0 踩 0

2512.24592 2026-06-19 cs.CV 版本更新

GH-ESD: Grounded Hypothesis-Driven Error Slice Discovery for Instance-Level Vision Tasks

GH-ESD：基于假设驱动的实例级视觉任务错误切片发现

Wei Zhang, Chaoqun Wang, Zixuan Guan, Sam Kao, Pengfei Zhao, Peng Wu, Sifeng He

发表机构 * Apple（苹果公司）

AI总结提出GH-ESD框架，通过LLM生成假设与视觉语言模型验证，在实例级任务中自动发现空间关系错误切片，并构建GESD基准，显著提升检测和分割任务的错误切片发现精度。

Comments Accepted by ECCV2026

详情

AI中文摘要

视觉模型在语义一致子集上的系统性失败（称为错误切片）揭示了鲁棒性和评估的局限性。现有的切片发现方法主要将切片建模为表示空间中的聚类或预定义属性的组合。虽然对图像级分类有效，但这种公式对于目标检测和分割等实例级任务不足，因为失败通常源于上下文关系性和空间定位的视觉模式。我们提出GH-ESD（基于假设驱动的实例级错误切片发现），一个生成与验证框架，将切片发现重新表述为基于假设的生成和统计验证。GH-ESD利用LLM先验和基于空间的视觉证据构建关系失败假设，通过视觉语言模型在实例级发现假设切片，并通过实例级错误的统计趋势分析进行验证。我们还引入了GESD（基于空间的错误切片数据集），一个用于实例级错误切片发现的新基准，提供由专家定义且基于空间的切片，这些切片源自检测和分割失败。大量实验表明，GH-ESD持续优于基线，在检测任务的GESD基准上Precision@10提高了0.10（0.73对比0.63），同时也支持分割场景。GH-ESD识别出可解释的切片，促进可操作的模型改进。GESD数据集将在接收后公开。

英文摘要

Systematic failures of vision models on semantically coherent subsets, known as error slices, reveal limitations in robustness and evaluation. Existing slice discovery approaches largely model slices as clusters in representation space or combinations of predefined attributes. While effective for image-level classification, such formulations are insufficient for instance-level tasks such as object detection and segmentation, where failures often arise from contextual relational and spatially grounded visual patterns. We propose GH-ESD (Grounded Hypothesis-Driven Error Slice Discovery), a generate and verify framework that reformulates slice discovery as grounded hypothesis generation and statistical verification. GH-ESD constructs relational failure hypotheses using LLM priors and grounded visual evidence, discovers hypothesis slices at the instance level via Vision Language Models, and verifies them through statistical trend analysis over instance-level errors. We also introduce GESD (Grounded Error Slice Dataset), a new benchmark for instance-level error slice discovery, providing expert-defined and spatially grounded slices derived from detection and segmentation failures. Extensive experiments demonstrate that GH-ESD consistently outperforms baselines, improving Precision@10 by 0.10 (0.73 vs. 0.63) on the GESD benchmark for detection tasks, while also supporting segmentation scenarios. GH-ESD identifies interpretable slices that facilitate actionable model improvements. The GESD dataset will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2604.13240 2026-06-19 cs.CV cs.LG 版本更新

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

基于概念的可解释AI的高分辨率景观数据集及其在物种分布模型中的应用

Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti

发表机构 * Université Rennes 2, CNRS, Nantes Université, Univ Brest, LETG, UMR 6554（里昂大学第二分校、法国国家科学研究中心、南特大学、布列塔尼大学、LETG、UMR 6554）； LTSER Zone Atelier Armorique（Armorique 领域实验室区）； University of Würzburg, Center for Artificial Intelligence and Data Science（乌尔姆大学、人工智能与数据科学中心）

AI总结提出首个基于概念的可解释AI方法用于物种分布模型，利用高分辨率多光谱和LiDAR无人机影像构建景观概念数据集，通过Robust TCAV量化景观概念对模型预测的影响，案例研究验证了方法的有效性。

详情

AI中文摘要

绘制物种空间分布对于保护政策和入侵物种管理至关重要。物种分布模型（SDMs）是完成此任务的主要工具，具有两个目的：实现稳健的预测性能，同时提供关于分布驱动因素的生态见解。然而，深度学习SDMs日益增长的复杂性使得提取这些见解更具挑战性。为了调和这些目标，我们提出了首个基于概念的可解释AI（XAI）在SDMs中的实现。我们利用Robust TCAV（测试与概念激活向量）方法量化景观概念对模型预测的影响。为此，我们提供了一个新的开放获取的景观概念数据集，该数据集源自高分辨率多光谱和LiDAR无人机影像。它包括跨越15个不同景观概念的653个斑块和1,450个随机参考斑块，旨在适用于广泛的物种。我们通过两个水生昆虫（襀翅目和毛翅目）的案例研究，使用两个卷积神经网络和一个视觉Transformer来展示这种方法。结果表明，基于概念的XAI有助于根据专家知识验证SDMs，同时发现产生新生态假说的新颖关联。Robust TCAV还提供了景观层面的信息，对政策制定和土地管理有用。代码和数据集公开可用。

英文摘要

Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2604.13416 2026-06-19 cs.CV cs.AI 版本更新

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K：用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney（悉尼科技大学）； University of Sydney（悉尼大学）； National Yang Ming Chiao Tung University（阳明交通大学）

AI总结为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白，构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集，并基于此基准测试了九种最新方法，识别出最鲁棒的方法和最具挑战的场景。

详情

AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中，已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而，对于无干扰辐射场，每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏，限制了发展。为填补这一空白，我们引入了DF3DV-1K，一个包含1048个场景的大规模真实世界数据集，每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像，模拟随意拍摄，涵盖128种干扰类型和161种场景主题，包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K，我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试，识别出最鲁棒的方法和最具挑战的场景。除了基准测试，我们还展示了DF3DV-1K的一个应用：微调基于扩散的2D增强器以改进辐射场方法，在保留集（例如DF3DV-41）和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展，并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取：此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

URL PDF HTML ☆

赞 0 踩 0

2605.10873 2026-06-19 cs.CV cs.AI 版本更新

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench：一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出CADBench，一个统一的多模态CAD程序生成基准，包含18000个样本和六类基准，评估11种视觉语言模型，揭示了CAD程序生成中的三种常见失败模式。

详情

AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心，但进展难以衡量，因为现有评估分散在数据集、模态和指标上。我们引入CADBench，一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本，涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族，五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染，以及六个指标，涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层，所有家族均进行多样性采样，以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统，生成超过140万个CAD程序。在理想输入下，专用的网格到CAD模型显著优于代码生成VLMs，后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式：几何复杂性增加时重建质量下降，CAD专用模型在模态转移下可能变得脆弱，且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

URL PDF HTML ☆

赞 0 踩 0

2606.10136 2026-06-19 cs.CV 版本更新

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

iSAGE: 一种通过稀疏点监督进行遥感语义分割的人机协同框架

Osmar Luiz Ferreira de Carvalho, Osmar Abilio de Carvalho Junior, Anesmar Olino de Albuquerque, Daniel Guerreiro e Silva

AI总结提出iSAGE框架，通过专家点击模型错误像素而非任意像素，无需辅助机制即可匹配密集监督，在BsB Aerial和ISPRS Vaihingen数据集上以极低标注率达到与密集监督相当的性能。

Comments 47 pages, 8 tables, 6 figures

详情

AI中文摘要

遥感中的语义分割需要昂贵的像素级标注，且由于模型很少能在传感器、平台或地理区域间迁移，几乎每个问题都需要新的数据集。现有的人机协同框架通过辅助机制（伪标签、传播、CRF、基础模型提示、辅助头）将稀疏点击扩展为密集监督，这些机制均基于模型的预测分布。在该分布中，一个自信的错误像素与一个自信的正确像素在结构上无法区分，因此任何读取该分布的规则都无法区分两者；区分信号位于模型外部。本文假设，专家针对模型错误（而非任意像素）的点击足以匹配密集监督，无需扩展机制。iSAGE（基于专家指导的迭代稀疏标注）在一个集成的开源平台上实现了这一假设，其中错误加权损失放大了每次点击的梯度，而标注记录本身即为数据集，可扩展、可纠正、可审计。实验采用最小努力策略：每帧每类最多一个标注像素。在BsB Aerial上，iSAGE恢复了密集监督的97.2%（在0.040%的像素上达到74.79% mIoU），并呈现出对比性的类别动态：无定形类别（渗透区域）从种子点开始饱和，而小类别（汽车）需要后期迭代的努力。在ISPRS Vaihingen（外部基准）上，iSAGE以0.011%的像素达到76.78% mIoU，匹配密集基线（76.65%）并超越所有已发表方法。在相同流程下，四种输出读取机制（预算1-100倍的oracle熵、阈值0.90-0.99的伪标签、基于CRF的传播、均匀随机）比iSAGE低7.4至14.5个百分点。在调查的31种方法中，iSAGE是唯一无需辅助机制即可运行的迭代式人机协同框架。

英文摘要

Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1--100x, pseudo-labels across thresholds 0.90--0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

URL PDF HTML ☆

赞 0 踩 0

2507.23534 2026-06-19 cs.LG cs.CV 版本更新

Continual Learning with Support Boundary Experience Blending

支持边界经验混合的持续学习

Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

发表机构 * National Taiwan University（国立台湾大学）

AI总结提出经验混合框架，通过差分隐私启发的噪声生成支持边界数据，联合训练样本和边界数据以正则化决策边界，在多个数据集上提升持续学习准确率。

详情

AI中文摘要

持续学习旨在减轻模型在顺序任务训练时的灾难性遗忘。常见方法经验回放存储过去的样本，但仅稀疏地近似数据分布，导致决策边界脆弱且过于简化。我们通过引入支持边界数据来解决这一限制，该数据通过差分隐私启发的噪声注入潜在特征，生成边界邻近表示，隐式正则化决策边界。基于此，我们提出经验混合框架，通过双模型聚合策略联合训练样本和支持边界数据。经验混合有两个组成部分：(1) 潜在空间噪声注入以生成支持边界数据，(2) 联合利用样本和支持边界数据的端到端训练。与标准经验回放不同，支持边界数据丰富了决策边界附近的特征空间，从而实现更稳定和鲁棒的持续学习。在CIFAR-10、CIFAR-100、Tiny ImageNet和ImageNet1K上的大量实验分别展示了10%、6%、13%和2%的持续准确率提升。

英文摘要

Continual learning (CL) seeks to mitigate catastrophic forgetting when models are trained with sequential tasks. A common approach, experience replay (ER), stores past exemplars but only sparsely approximates the data distribution, yielding fragile and oversimplified decision boundaries. We address this limitation by introducing Support Boundary Data (SBD), generated via differential-privacy-inspired noise into latent features to create boundary-adjacent representations that implicitly regularize decision boundaries. Building on this idea, we propose Experience Blending (EB), a framework that jointly trains on exemplars and SBD through a dual-model aggregation strategy. EB has two components: (1) latent-space noise injection to generate support boundary data, and (2) end-to-end training that jointly leverages exemplars and SBD. Unlike standard experience replay, SBD enriches the feature space near decision boundaries, leading to more stable and robust continual learning. Extensive experiments on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet1K demonstrate consistent accuracy improvements of 10%, 6%, 14%, 2%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2603.07236 2026-06-19 cs.CV 版本更新

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

HY-WU (第一部分): 一种可扩展的功能性神经记忆框架及其在文本引导图像编辑中的应用

Mengxuan Wu, Xuanlei Zhao, Ziqiao Wang, Ruicheng Feng, Zhangyang Wang, Kai Wang

发表机构 * Tencent HY Team（腾讯 HY 团队）

AI总结提出HY-WU框架，通过功能性神经记忆模块即时生成实例特定权重更新，避免共享权重覆盖导致的干扰，解决持续学习与个性化中的灾难性遗忘问题。

详情

AI中文摘要

基础模型正从离线预测器过渡到期望长时间运行的部署系统。在实际部署中，目标并非固定：领域漂移、用户偏好演变，以及模型发布后出现新任务。这将持续学习和即时个性化从可选功能提升为核心架构要求。然而，大多数适应流程仍遵循静态权重范式：训练后（或任何适应步骤后），推理执行单一参数向量，而不考虑用户意图、领域或实例特定约束。这将训练或适应后的模型视为参数空间中的单个点。在异构且持续演变的机制中，不同目标可能在参数上诱导分离的可行区域，迫使任何单一共享更新陷入妥协、干扰或过度专业化。结果，持续学习和个性化通常实现为对共享权重的重复覆盖，冒着先前学习行为退化的风险。我们提出HY-WU（权重释放），一种记忆优先的适应框架，将适应压力从覆盖单一共享参数点转移。HY-WU将功能性（算子级）记忆实现为神经模块：一个根据实例条件即时合成权重更新的生成器，产生实例特定算子而无需测试时优化。

英文摘要

Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.

URL PDF HTML ☆

赞 0 踩 0

2507.05169 2026-06-19 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model

世界模型批判：一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结本文从心理学“假设性思维”出发，提出世界模型的核心目标是模拟真实世界的所有可行动可能性，并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测（GLP）架构。

详情

AI中文摘要

世界模型，即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器，近年来因开发具有人工（通用）智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估，已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发，并借鉴心理学文献中“假设性思维”的概念，论证世界模型的主要目标是模拟真实世界中所有可行动的可能性，以进行有目的的推理和行动。我们审视了世界建模的关键设计维度：数据、表示、架构、学习目标和使用，调查了现有方法并分析了它们的权衡。在此基础上，我们提出了一种新的通用世界模型生成式潜在预测（GLP）架构，基于有状态的、分层的、多层次的、混合连续/离散表示，以及生成式和自监督学习框架，并展望了由这种模型支持的物理、智能体和嵌套（PAN）AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

URL PDF HTML ☆

赞 0 踩 0

2605.15231 2026-06-19 cs.LG cs.CV 版本更新

Mask-Morph Graph U-Net: A Generalisable Mesh-Based Surrogate for Crashworthiness Field Prediction under Large Geometric Variation

Mask-Morph Graph U-Net：一种通用的基于网格的替代模型，用于在大几何变化下预测碰撞worthiness领域

Haoran Li, Tobias Lehrer, Yingxue Zhao, Haosu Zhou, Philipp Stocker, Tobias Pfaff, Marcus Wagner, Nan Li

发表机构 * Dyson School of Design Engineering, Imperial College London（帝国理工学院伦敦设计工程学院）； TUM School of Engineering and Design, Technical University of Munich（慕尼黑技术大学工程与设计学院）； Faculty of Mechanical Engineering, OTH Regensburg（雷根斯堡机械工程学院）； NVIDIA（NVIDIA公司）

AI总结本文提出Mask-Morph Graph U-Net，通过特征对齐的重心参数化和节点掩码预训练，提升网格模拟的通用性和数据效率，适用于碰撞worthiness设计探索。

Comments 48 pages, 15 figures, jounral paper under review

详情

AI中文摘要

非线性有限元碰撞模拟准确但计算成本高，限制了其在迭代设计优化中的应用。基于图神经网络（GNN）的机器学习替代模型提供了更快的替代方案。消息传递GNN广泛用于网格模拟，其共享节点和边更新函数在不同图结构中相对通用。相比之下，非共享边特定聚合层能更准确地捕捉非线性关系，但通常需要固定图连接性，限制了通用性。本文提出Mask-Morph Graph U-Net（MMGUNet），一种解决分层图U-Net架构限制的方法，该架构使用边特定下采样和上采样层。固定粗图连接性是边特定层所必需的。为了在保留此连接性的同时提高空间对应性，所提出的方法通过特征对齐的重心参数化将粗化图层次变形到每个输入网格，然后构建跨图边。它进一步在监督预训练中应用节点掩码，随后进行参数高效的微调，其中高参数边特定层被冻结。所提出的方法在分布内、分布外和跨组件迁移设置中使用均欧距离和最大入侵百分比误差进行评估。结果表明，粗图变形相对于固定粗图基线提高了测试准确性，而掩码监督预训练减少了训练-测试差异并提高了迁移期间的数据效率。所提出的模型还比外部基线取得了更低的预测误差。这些结果展示了通往可重用、数据高效网格替代模型的实用路径，用于碰撞worthiness设计探索。

英文摘要

Nonlinear finite element crash simulations are accurate but computationally expensive, limiting their use in iterative design optimisation. Machine-learning surrogate models based on graph neural networks (GNNs) offer a faster alternative. Message-passing GNNs are widely used for mesh simulation, and their shared node and edge update functions are relatively generalisable across varying graph structures. By contrast, non-shareable edge-specific aggregation layers can capture nonlinear relationships more accurately but usually require fixed graph connectivity, which limits generalisability. This paper presents Mask-Morph Graph U-Net (MMGUNet), a practical approach to addressing the limitation of hierarchical Graph U-Net architectures that use edge-specific downsampling and upsampling layers. Fixed coarse graph connectivity is required for edge-specific layers. To retain this while improving spatial correspondence, the proposed method morphs the coarsened graph hierarchy to each input mesh using feature-aligned barycentric parameterisation before constructing cross-graph edges. It further applies node masking during supervised pretraining, followed by parameter-efficient fine-tuning in which high-parameter edge-specific layers are frozen. The proposed approach is evaluated in in-distribution, out-of-distribution, and cross-component transfer settings using mean Euclidean distance and maximum intrusion percentage error. Results show that coarse-graph morphing improves test accuracy relative to a fixed-coarse-graph baseline, while masked supervised pretraining reduces the train-test discrepancy and improves data efficiency during transfer. The proposed model also achieves lower prediction error compared with external baselines. These results demonstrate a practical route toward reusable, data-efficient mesh-based surrogate modelling for crashworthiness design exploration.

URL PDF HTML ☆

赞 0 踩 0

2605.00569 2026-06-19 cs.CV cs.GR 版本更新

2D-SuGaR: Surface-Aware Gaussian Splatting for Geometrically Accurate Mesh Reconstruction

Prajwal Gupta C. R., Divyam Sheth, Jinjoo Ha, Mirela Ostrek, Justus Thies

发表机构 * TU Darmstadt（图宾根大学）； ELIZA（ELIZA实验室）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）

Journal ref Eurographics 2026 Short Papers, The Eurographics Association, 2026

2511.23071 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur（印度理工学院朱道尔）

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

2603.27698 2026-06-19 cs.CV cs.DL 版本更新

Ink Detection from Surface Topography of the Herculaneum Papyri

Giorgio Angelotti, Federica Nicolardi, Paul Henderson, W. Brent Seales

发表机构 * Vesuvius Challenge, USA（维苏威挑战赛，美国）； Università degli Studi di Napoli Federico II, Italy（那不勒斯费德里科二世大学，意大利）； University of Glasgow, Scotland, UK（格拉斯哥大学，苏格兰，英国）； EduceLab, University of Kentucky, USA（EduceLab，肯塔基大学，美国）

Comments 9 pages, 3 figures, 2 tables. Currently under review

Journal ref Scientific Reports (2026)

2601.15119 2026-06-19 eess.IV cs.CV 版本更新

Vision Models for Medical Imaging: A Hybrid Approach for PCOS Detection from Ultrasound Scans

Md Mahmudul Hoque, Md Mehedi Hassain, Muntakimur Rahaman, Md. Towhidul Islam, Shaista Rani, Md Sharif Mollah

发表机构 * Department of CSE, CCN University of Science & Technology（计算机科学与工程系，CCN科学与技术大学）； Department of EEE,International Islamic University Chittagong（电子工程系，国际伊斯兰大学恰tagong分校）； Faculty of Engineering, Multimedia University（工程学院，多媒体大学）； Department of CSE, Stamford University of Bangladesh（计算机科学与工程系，斯塔福德大学孟加拉国分校）； Department of Biology, Lucknow University（生物学系，拉胡尔大学）； Department of CSE, Bangladesh Army International University of Science & Technology（计算机科学与工程系，孟加拉国军队国际科学与技术大学）

2508.21190 2026-06-19 cs.CV 版本更新

Radially Distorted Homographies, Revisited

Mårten Wadenbäck, Marcus Valtonen Örnhag, Johan Edstedt

发表机构 * Linköping University（林雪平大学）； Ericsson Research（爱立信研究）

Journal ref 2026, Proceedings of the International Conference on 3D Vision (3DV). Vancouver, BC, Canada: IEEE, pp. 52-62

2507.23027 2026-06-19 cs.CV cs.AI 版本更新

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

发表机构 * Indian Institute of Technology Madras（印度理工学院马德拉斯分校）； All India Institute of Medical Sciences（全印度医学科学研究所）； Indian Institute of Technology Hyderabad（印度理工学院海得拉巴分校）

Comments Accepted at the MICCAI Workshop on "Medical Image Computing in Resource Constrained Settings & Knowledge Interchange (MIRASOL)" 2025

1902.06202 2026-06-19 cs.CV cs.CG 版本更新

Using Persistent Homology to Quantify a Diurnal Cycle in Hurricane Felix

Sarah Tymochko, Elizabeth Munch, Jason Dunion, Kristen Corbosiero, Ryan Torn

发表机构 * Michigan State University, Dept. of Computational Mathematics, Science and Engineering（密歇根州立大学，计算数学、科学与工程系）； Michigan State University, Dept. of Mathematics（密歇根州立大学，数学系）； Cooperative Institute for Marine and Atmospheric Studies, University of Miami（马里安诺大气研究合作机构，迈阿密大学）； Hurricane Research Division, NOAA/Atlantic Oceanographic and Meteorological Laboratory（飓风研究部，国家海洋和大气管理局/大西洋海洋学和气象实验室）； University at Albany - SUNY Albany, Dept. of Atmospheric and Environmental Sciences（阿尔巴尼大学 - 纽约州立大学阿尔巴尼分校，大气与环境科学系）

1. 多模态与视觉语言模型 7 篇

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

TerraMind: Large-Scale Generative Multimodality for Earth Observation

Vero: An Open RL Recipe for General Visual Reasoning

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

2. 具身智能、机器人与自动驾驶 5 篇

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

Class-Incremental Motion Forecasting

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

3. 图像识别、检索与分类 3 篇

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

Does Head Pose Correction Improve Biometric Facial Recognition?

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

4. 目标检测、分割与定位 3 篇

An Angular-Temporal Interaction Network for Light Field Object Tracking in Low-Light Scenes

GenTrack: A New Generation of Multi-Object Tracking

GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking

5. 视频理解与时序视觉 1 篇

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

6. 生成式视觉与世界模型 9 篇

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

VideoSketcher: Sequential Sketch Generation Using Video Model Priors

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

Abstraction in Style: Beyond Texture and Color

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

DiT-JSCC: Rethinking Deep JSCC with Diffusion Transformers and Semantic Representations

7. 3D视觉、点云与空间智能 6 篇

Collaborative Multi-Modal Coding for High-Quality 3D Generation

Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

VEPHand: View-Efficient Photometric Hand Performance Capture at Scale

MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing

8. 医学影像与生物视觉 7 篇

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

3D Vessel Reconstruction from Sparse-View Dynamic DSA Images via Vessel Probability Guided Attenuation Learning

OncoReg: Medical Image Registration for Oncological Challenges

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

9. 低层视觉、计算成像与图像增强 1 篇

Relighting as a Probe of Visual Priors via Augmented Latent Intrinsics

10. 鲁棒性、安全、隐私与可信视觉 4 篇

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

Adversarial Dependence Minimization

11. 数据集、基准、评测与训练方法 7 篇

Hierarchical mutual distillation for multi-view fusion: Learning from all possible view combinations

GH-ESD: Grounded Hypothesis-Driven Error Slice Discovery for Instance-Level Vision Tasks

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

Continual Learning with Support Boundary Experience Blending

12. 其他/综合视觉 10 篇

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

Critique of World Model

Mask-Morph Graph U-Net: A Generalisable Mesh-Based Surrogate for Crashworthiness Field Prediction under Large Geometric Variation

2D-SuGaR: Surface-Aware Gaussian Splatting for Geometrically Accurate Mesh Reconstruction

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Ink Detection from Surface Topography of the Herculaneum Papyri

Vision Models for Medical Imaging: A Hybrid Approach for PCOS Detection from Ultrasound Scans

Radially Distorted Homographies, Revisited

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Using Persistent Homology to Quantify a Diurnal Cycle in Hurricane Felix