arXivDaily arXiv每日学术速递 周一至周五更新

1. 多模态与视觉语言模型 7 篇

2305.14985 2026-06-19 cs.CV cs.CL 版本更新

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

IdealGPT: 通过大型语言模型迭代分解视觉与语言推理

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University(哥伦比亚大学) HKUST(香港科技大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出IdealGPT框架,利用大型语言模型迭代分解视觉语言推理任务,通过子问题生成、子答案获取和最终答案推理的循环过程,在零样本设置下显著提升多步推理性能。

Comments 13 pages, 5 figures

详情
AI中文摘要

视觉与语言(VL)理解领域通过端到端的大型预训练VL模型(VLM)取得了前所未有的进展。然而,它们在需要多步推理的零样本推理任务中仍存在不足。为了实现这一目标,先前的工作采用了分而治之的流程。本文认为,先前的工作存在几个固有的缺点:1)它们依赖于特定领域的子问题分解模型。2)即使子问题或子答案提供的信息不足,它们也强制模型预测最终答案。我们通过IdealGPT框架解决了这些局限性,该框架利用大型语言模型(LLM)迭代分解VL推理。具体来说,IdealGPT使用一个LLM生成子问题,一个VLM提供相应的子答案,另一个LLM进行推理以得出最终答案。这三个模块迭代地执行分而治之的过程,直到模型对主问题的最终答案有信心。我们在零样本设置下对多个具有挑战性的VL推理任务评估了IdealGPT。特别是,我们的IdealGPT在VCR上比现有最好的GPT-4类模型绝对提高了10%,在SNLI-VE上提高了15%。代码可在以下网址获取:此 https URL

英文摘要

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

2504.11171 2026-06-19 cs.CV cs.AI 版本更新

TerraMind: Large-Scale Generative Multimodality for Earth Observation

TerraMind:面向地球观测的大规模生成式多模态模型

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

发表机构 * IBM Research – Europe(IBM欧洲研究院) ETH Zurich(苏黎世联邦理工学院) Forschungszentrum Jülich(尤利希研究中心) European Space Agency(欧洲航天局) Φ \Phi -Lab(Φ实验室) NASA IMPACT University of Iceland(爱沙尼亚大学)

AI总结 提出首个任意到任意生成式多模态基础模型TerraMind,通过双尺度表示(token级和像素级)预训练,实现零样本/少样本应用,并引入“模态思考”能力,在PANGAEA等基准上达到领先性能。

Comments Accepted at ICCV'25

详情
AI中文摘要

我们提出了TerraMind,这是首个面向地球观测(EO)的任意到任意生成式多模态基础模型。与其他多模态模型不同,TerraMind在跨模态的双尺度表示(结合token级和像素级数据)上进行预训练。在token级别,TerraMind编码高层上下文信息以学习跨模态关系;在像素级别,TerraMind利用细粒度表示捕捉关键空间细节。我们在一个全球大规模数据集的九种地理空间模态上预训练了TerraMind。在本文中,我们证明:(i)TerraMind的双尺度早期融合方法为地球观测解锁了一系列零样本和少样本应用;(ii)TerraMind引入了“模态思考”(TiM)——在微调和推理过程中生成额外人工数据以改善模型输出的能力;(iii)TerraMind在PANGAEA等社区标准的地球观测基准上达到了超越现有最优的性能。预训练数据集、模型权重和我们的代码均在宽松许可下开源。

英文摘要

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

2604.04917 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Vero: An Open RL Recipe for General Visual Reasoning

Vero: 通用视觉推理的开放RL配方

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出Vero系列开放视觉语言模型,通过构建600K样本数据集Vero-600K和任务路由奖励,在30个基准测试中平均提升2.9-5.4点,Vero-Qwen3I-8B超越Qwen3-VL-8B-Thinking 3.8点。

Comments Project page: https://vero-reasoning.github.io/

详情
AI中文摘要

构建一个能在图表、科学、空间理解和开放式任务中工作的视觉推理器需要什么?最强的视觉语言模型(VLM)表明广泛的视觉推理是可以实现的,但其封闭的数据和强化学习(RL)流程使得其成果难以研究、复现或扩展。我们引入了Vero,一个完全开放的VLM系列,在各种视觉推理任务中匹配或超越现有的开放权重模型。我们跨六个广泛的任务类别扩展RL数据和奖励,构建了Vero-600K,一个来自59个数据集的600K样本数据集,并设计了处理异构答案的任务路由奖励。在我们的30个基准测试套件VeroEval中,Vero-600K在受控比较下优于现有的RL数据集。应用于五个起始模型,Vero变体在其初始模型上平均获得2.9-5.4分的提升。值得注意的是,基于Instruct模型训练的Vero-Qwen3I-8B,在没有额外蒸馏的情况下,平均超过Qwen3-VL-8B-Thinking 3.8分。系统的消融实验揭示,不同的任务类别引发不同的推理模式,而广泛的收益依赖于联合学习它们,而非孤立学习。所有数据、代码和模型均已公开。

英文摘要

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, yet their closed data and reinforcement learning (RL) pipelines make their gains difficult to study, reproduce, or extend. We introduce Vero, a family of fully open VLMs that match or exceed existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answers. Across VeroEval, our 30-benchmark suite, Vero-600K outperforms existing RL datasets under controlled comparisons. Applied to five starting models, Vero variants gain 2.9-5.4 points on average over their initial models. Notably, Vero-Qwen3I-8B, trained on the Instruct model, surpasses Qwen3-VL-8B-Thinking by 3.8 points on average without additional distillation. Systematic ablations reveal that different task categories elicit distinct reasoning patterns and that broad gains depend on learning them jointly rather than in isolation. All data, code, and models are publicly available.

2605.20448 2026-06-19 cs.CV cs.LG 版本更新

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

视觉-语言模型是理解3D场景还是仅仅 catalogue 物体?

Animesh Maheshwari, Divyansh Sahu, Nishit Verma

发表机构 * Deccan AI(德克南人工智能)

AI总结 本文通过一个包含3034个样本的人工整理基准,探讨了视觉-语言模型对空间理解的深度有序遮挡、光学几何推断和体积重新安排规划能力,发现模型在重新安排可见布局时表现优异,但在遮挡和反射推断上表现较差。

详情
AI中文摘要

视觉-语言模型能够可靠地命名场景中的物体,但它们是否代表这些物体所处的3D布局?我们引入了一个包含3034个样本的人工整理基准,针对空间理解的三个组成部分:深度有序遮挡(通过三种独立的反事实操作化进行探测)、可见反射的光学几何推断,以及体积重新安排规划。六个前沿和开放权重的VLMs在18,204个响应上由训练注释者评分,没有使用LLM作为判断标准,揭示了明显的分离:在53-97%的准确率下,能够对可见布局进行重新安排的模型,在遮挡任务中表现不佳,仅在6-45%之间,而在反射任务中低于7%。一个具身推理模型重现了相同的模式。对Qwen3-VL-8B-Thinking的白盒分析显示,失败归因于视觉标记合并:在视觉编码器中可恢复的空间信息在标记压缩后变得不可用,只有在清洁的标记合并后激活被重新引入语言解码器后才恢复。

英文摘要

Vision-language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

2606.05833 2026-06-19 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出GeoVR框架,通过从2D视频序列中蒸馏3D几何知识(包括相机姿态、深度图、尺度因子和多尺度3D特征),重塑多模态大语言模型的内部表示以赋予其空间智能,在空间推理基准上达到最先进性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在2D语义理解方面表现出色,但缺乏内在的3D感知能力,导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性,我们提出了GeoVR,一种新颖的框架,仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间,以解锁空间智能。GeoVR并非采用浅层的特征混合,而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的,该策略由四个互补的几何目标驱动:(1)估计帧间相机姿态以嵌入变化的视角动态,(2)回归密集深度图以锚定物理距离,(3)预测度量尺度因子以进行真实世界校准,以及(4)蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下,模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明,GeoVR实现了最先进的性能,为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

2606.16615 2026-06-19 cs.CV 版本更新

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

SUP-MCRL:面向EEG视觉解码的感知主体统一伪特征编码多模态对比表示学习

Shengyu Gong, Weiming Zeng, Yueyang Li, Zijian Kang, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University(上海海事大学数字图像与智能计算实验室) Department of Language Science and Technology, The Hong Kong Polytechnic University(香港理工大学语言科学与技术系) Affiliated Lianyungang Hospital of Xuzhou Medical University(徐州医科大学附属连云港医院)

AI总结 提出SUP-MCRL框架,通过语义感知视觉编码器、统一EEG增强器和原型渐进增强器,解决多模态对比学习中语义一致性和主体选择性问题,在THINGS-EEG零样本任务上达到66.0%/91.9%的Top-1/Top-5准确率。

详情
AI中文摘要

非侵入式脑机接口在泛化到自然视觉体验时,神经视觉解码面临严重的保真度退化。传统的多模态对比表示学习仅优化几何距离对齐,忽略了语义一致性和主体选择性,导致虚假的零样本对齐。我们提出SUP-MCRL,一个统一框架,集成了三种协作机制:(1) 语义实体感知视觉编码器(SAVE),学习空间注意力以提取语义内容,无需预训练的显著性模型;(2) 统一EEG增强器(UEE),采用多尺度空洞卷积和频带间注意力实现自适应跨主体鲁棒性;(3) 基于原型的渐进增强器(PPA),维护一个EMA更新的伪特征池以防止表示崩溃。在THINGS-EEG上的零样本实验实现了66.0%/91.9%(Top-1/Top-5)的个体内准确率和24.0%/52.9%的LOSO准确率,超越了现有最先进方法。代码可在https://github.com/NZWANG/SUP-MCRL获取。

英文摘要

Non-invasive brain-computer interfaces exhibit significant performance degradation when moving from controlled laboratory stimuli to real-world natural images. This degradation occurs because conventional multimodal contrastive representation learning models focus exclusively on optimizing geometric distance alignment, thereby failing to account for semantic consistency and inter-subject variability in neural representation and selective attention. As a result, these models are prone to producing spurious zero-shot matches. To address these limitations, we propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) a Semantic-entity Aware Visual Encoder (SAVE) that learns spatial attention to extract semantic content without relying on pre-trained saliency models; (2) a Unified EEG Enhancer (UEE) that employs multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) a Prototype-based Progressive Augmenter (PPA) that maintains an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on the THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, significantly surpassing state-of-the-art methods and demonstrating that structured alignment supervision is key to overcoming the limitations of cross-modal decoding. Code is available at https://github.com/NZWANG/SUP-MCRL.

2606.18249 2026-06-19 cs.CV 版本更新

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

统一多模态自回归建模:共享上下文-视觉分词器是实现统一的关键

Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(可信具身AI研究院,复旦大学) Shanghai Innovation Institute(上海创新研究院) Qwen Team, Alibaba Inc.(通义实验室,阿里公司)

AI总结 提出UniAR框架,通过单一离散视觉分词器桥接视觉理解与生成,采用并行位预测和扩散解码,在图像生成和编辑上达到最优,同时保持多模态理解竞争力。

Comments ICML2026. Project page https://sharelab-sii.github.io/uniar-web

详情
AI中文摘要

统一多模态建模旨在将视觉理解和生成集成到单个系统中。然而,现有方法通常依赖两个不同的视觉分词器,这分割了表示空间并阻碍了真正的统一建模。我们提出UniAR,一个统一的自回归框架,其中单个离散视觉分词器作为理解和生成之间的关键桥梁,使得模型能够直接解释其自身生成的视觉标记而无需额外的重新编码,从而实现共享上下文。UniAR采用预训练的视觉编码器,结合多级特征融合和无查找的逐位量化方案,在保留高层语义和低层细节的同时,以最小代价扩展有效视觉词汇。在此基础上,统一自回归模型采用并行逐位预测来联合预测空间分组的多级视觉编码,大幅减少视觉序列长度并加速生成。最后,基于扩散的视觉解码器对离散视觉标记进行操作,以解码高保真图像。通过大规模预训练,随后进行监督微调和强化学习,UniAR在图像生成和图像编辑上达到了最先进的性能,同时在多模态理解基准上保持竞争力。项目页面可在此URL获取。

英文摘要

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

2. 具身智能、机器人与自动驾驶 5 篇

2505.17006 2026-06-19 cs.CV cs.RO 版本更新

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

CoMo: 从互联网视频中学习连续潜在运动以实现可扩展的机器人学习

Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang

发表机构 * Nanjing University(南京大学) Shanghai AI Lab(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) Fudan University(复旦大学) Tongji University(同济大学)

AI总结 提出CoMo方法,通过早期时间差分和时序对比学习从互联网视频中学习连续潜在运动,避免离散化信息损失,实现零样本泛化生成伪动作标签,联合训练策略在仿真和真实实验中表现优异。

Comments CVPR 2026

详情
AI中文摘要

从互联网视频中无监督学习潜在运动对于机器人学习至关重要。现有的离散方法通常通过小码本大小的向量量化来减轻提取过多静态背景导致的捷径学习,但它们存在信息损失,难以捕捉更复杂和细粒度的动态。此外,离散潜在运动与连续机器人动作之间存在固有分布差距,阻碍了统一策略的联合学习。我们提出CoMo,旨在从互联网规模视频中学习更精确的连续潜在运动。CoMo采用早期时间差分(Td)机制来增加捷径学习难度并显式增强运动线索。此外,为确保潜在运动更好地捕捉有意义的背景,我们进一步提出时序对比学习(Tcl)方案。具体地,正样本对通过小的未来帧时间偏移构建,而负样本对则通过直接反转时间方向形成。所提出的Td和Tcl协同工作,有效确保潜在运动更好地关注前景并增强运动线索。关键的是,CoMo表现出强大的零样本泛化能力,使其能够为未见过的视频生成有效的伪动作标签。大量的仿真和真实实验表明,使用CoMo伪动作标签联合训练的策略在扩散和自回归架构下均实现了优越性能。

英文摘要

Unsupervised learning of latent motion from Internet videos is crucial for robot learning. Existing discrete methods generally mitigate the shortcut learning caused by extracting excessive static backgrounds through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and fine-grained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the shortcut learning difficulty and explicitly enhance motion cues. Additionally, to ensure latent motion better captures meaningful foregrounds, we further propose a temporal contrastive learning (Tcl) scheme. Specifically, positive pairs are constructed with a small future frame temporal offset, while negative pairs are formed by directly reversing the temporal direction. The proposed Td and Tcl work synergistically and effectively ensure that the latent motion focuses better on the foreground and reinforces motion cues. Critically, CoMo exhibits strong zeroshot generalization, enabling it to generate effective pseudo action labels for unseen videos. Extensive simulated and real-world experiments show that policies co-trained with CoMo pseudo action labels achieve superior performance with both diffusion and auto-regressive architectures.

2603.00654 2026-06-19 cs.CV 版本更新

RC-GeoCP: Geometric Consensus for Radar-Camera Collaborative Perception

RC-GeoCP:雷达-相机协同感知的几何一致性

Xiaokai Bai, Lianqing Zheng, Runwei Guan, Siyuan Cao, Songkai Wang, Huiliang Shen

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院) School of Automotive Studies, Tongji University(同济大学汽车学院) Thrust of Artificial Intelligence, Hong Kong University of Science and Technology(香港科技大学人工智能研究所)

AI总结 提出首个4D雷达与相机协同感知框架RC-GeoCP,通过雷达锚定几何一致性解决深度模糊和空间分散导致的错位,实现高效通信与全局一致表示。

Comments 11 pages, 6 figures, 9 tables

详情
AI中文摘要

协同感知(CP)通过多智能体信息共享增强场景理解。尽管以LiDAR为中心的系统提供精确几何,但高成本和恶劣天气下的性能下降需要多模态替代方案。尽管具有密集的视觉语义和鲁棒的空间测量,相机与4D雷达之间的协同在协作环境中仍未得到充分探索。本文介绍RC-GeoCP,这是首个探索CP中4D雷达与图像融合的框架。为解决由深度模糊和跨智能体空间分散引起的错位,RC-GeoCP建立了雷达锚定的几何一致性。具体而言,几何结构修正(GSR)将视觉语义与雷达导出的几何对齐,以生成空间有根基的、几何一致的表示。不确定性感知通信(UAC)将选择性传输表述为条件熵减少过程,基于智能体间分歧优先处理信息特征。最后,共识驱动聚合器(CDA)通过共享几何锚聚合多智能体信息,形成全局一致的表示。我们在V2X-Radar和V2X-R上建立了首个统一的雷达-相机CP基准,展示了最先进的性能,同时显著降低了通信开销。代码即将发布。

英文摘要

Collaborative perception (CP) enhances scene understanding through multi-agent information sharing. While LiDAR-centric systems offer precise geometry, high costs and performance degradation in adverse weather necessitate multi-modal alternatives. Despite dense visual semantics and robust spatial measurements, the synergy between cameras and 4D radar remains underexplored in collaborative settings. This work introduces RC-GeoCP, the first framework to explore the fusion of 4D radar and images in CP. To resolve misalignment caused by depth ambiguity and spatial dispersion across agents, RC-GeoCP establishes a radar-anchored geometric consensus. Specifically, Geometric Structure Rectification (GSR) aligns visual semantics with geometry derived from radar to generate spatially grounded, geometry-consistent representations. Uncertainty-Aware Communication (UAC) formulates selective transmission as a conditional entropy reduction process to prioritize informative features based on inter-agent disagreement. Finally, the Consensus-Driven Assembler (CDA) aggregates multi-agent information via shared geometric anchors to form a globally coherent representation. We establish the first unified radar-camera CP benchmark on V2X-Radar and V2X-R, demonstrating state-of-the-art performance with significantly reduced communication overhead. Code will be released soon.

2603.09420 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Class-Incremental Motion Forecasting

类别增量运动预测

Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系) Qualcomm SARL France(法国.qualcomm SARL) Automated Driving, Qualcomm Technologies, Inc.(qualcomm Technologies, Inc. 自动驾驶部门)

AI总结 提出类别增量运动预测新任务,通过端到端框架结合伪标签与开放词汇分割,利用3D-2D投票机制和查询特征方差重放策略,缓解灾难性遗忘并适应新类别。

Comments V3: Change title. Add further experiments

详情
AI中文摘要

运动预测使自动驾驶车辆能够通过预测动态智能体的未来轨迹来预判场景演化。然而,现有方法通常假设一个封闭世界设定,具有固定的对象分类法并依赖高质量感知,限制了其在现实世界中的应用,因为现实世界中感知不完美,且新对象类别可能随时间出现。在这项工作中,我们引入了类别增量运动预测,这是一个新颖的设定,其中新对象类别随时间顺序引入,并且直接从相机图像预测未来对象轨迹。我们提出了首个针对该设定的端到端框架,该框架适应新引入的类别,同时减轻对先前学习类别的灾难性遗忘。我们的方法为已知类别生成运动预测伪标签,并将其与开放词汇分割模型的2D实例掩码进行匹配。这种3D到2D关键点投票机制过滤不一致和过度自信的预测,而基于查询特征方差的重放策略采样信息丰富的过去序列以保留先验知识。在nuScenes和Argoverse 2上的广泛评估表明,我们的方法成功地在已知类别上保持性能,同时有效适应新类别。我们进一步展示了向真实世界驾驶的零样本迁移,并表明该框架自然地扩展到nuScenes和NeuroNCAP上的开环和闭环端到端类别增量规划。代码和模型将在该https URL上公开。

英文摘要

Motion forecasting enables autonomous vehicles to anticipate scene evolution by predicting the future trajectories of dynamic agents. However, existing approaches typically assume a closed-world setting with a fixed object taxonomy and access to high-quality perception, limiting their applicability in the real world where perception is imperfect, and new object classes may emerge over time. In this work, we introduce class-incremental motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are predicted directly from camera images. We propose the first end-to-end framework for this setting, which adapts to newly introduced classes while mitigating catastrophic forgetting of previously learned ones. Our method generates motion forecasting pseudo-labels for known classes and matches them with 2D instance masks from an open-vocabulary segmentation model. This 3D-to-2D keypoint voting mechanism filters inconsistent and overconfident predictions, while a query feature variance-based replay strategy samples informative past sequences to preserve prior knowledge. Extensive evaluations on nuScenes and Argoverse 2 show that our approach successfully preserves performance on known classes while effectively adapting to novel ones. We further demonstrate zero-shot transfer to real-world driving and show that the framework extends naturally to open- and closed-loop end-to-end class-incremental planning on nuScenes and NeuroNCAP. Code and models will be made publicly available at https://omen.cs.uni-freiburg.de.

2606.18960 2026-06-19 cs.CV cs.RO 版本更新

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Mem-World:用于持久机器人操作的内存增强动作条件世界模型

Zirui Zheng, Jiaqian Yu, Xiongfeng Peng, jun shi, Mingyi Li, Chao Zhang, Weiming Li, Dong Wang, Huchuan Lu, Xu Jia

发表机构 * Dalian University of Technology(大连理工大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院)

AI总结 提出Mem-World,通过4D腕部视角曲面元索引内存W-VMem,解决操作中因遮挡和运动导致的场景遗忘问题,实现持久世界建模,提升策略评估与改进效果。

详情
AI中文摘要

动作条件世界模型已成为机器人学习的一种有前景的范式,通过生成动作一致的视频推演,为昂贵的真实世界实验提供了可扩展的替代方案。然而,在操作中持久世界建模仍然具有挑战性:频繁的末端执行器遮挡和快速的腕部相机运动使得当前观测不足以预测未来视图,导致模型遗忘或幻觉先前帧中看到的场景细节。现有的内存检索策略在动态操作场景中往往无法识别信息丰富的历史。为解决这一限制,我们提出了Mem-World,一种内存增强的多视图动作条件世界模型。其核心是W-VMem,一种4D腕部视图为中心的曲面元索引内存,将历史观测锚定到随时间演变的表面元素上。通过显式建模场景元素被观测的时间和位置,W-VMem能够根据未来动作实现几何感知的相关历史帧检索。在生成过程中,通过基于曲面元的渲染和评分选择相关历史帧,为预测提供信息丰富且非冗余的上下文。大量实验表明,Mem-World在复杂操作场景中生成持久推演,比Ctrl-World实现更可靠的策略评估,将皮尔逊相关系数提高14.5%,并通过合成数据生成支持有效的策略改进,在长时域任务中将成功率从58%提升到72%。

英文摘要

Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

2606.18112 2026-06-19 cs.RO cs.CV 版本更新

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Qwen-RobotNav 技术报告:为智能体导航系统设计的可扩展导航模型

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei, Qihang Peng, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Zhibo Yang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Zhuoyuan Yu, Jingyang Fan, Zhixuan Liang, Pei Lin, Ye Wang, Anzhe Chen, Kun Yan, Xiao Xu, Jiahao Li, Lulu Hu, Minying Zhang, Shurui Li, Wenhu Xiao, Shuai Bai, Xuancheng Ren, Chenxu Lv, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(通义实验室)

AI总结 提出 Qwen-RobotNav 可扩展导航模型,通过参数化接口支持多种任务模式和可调观测参数,在15.6M样本上训练,联合视觉语言数据防止行为坍缩,在多个导航基准上取得新最优结果,并展示零样本泛化能力。

详情
AI中文摘要

智能体导航系统需要一个基础导航模型,其观测策略可以在推理时从外部重新配置,因为指令跟随、目标搜索、目标跟踪和自动驾驶共享相同的感知规划主干,但对视觉流的消费方式有根本不同的要求。我们提出 Qwen-RobotNav,一个建立在 Qwen-RobotNav 上的可扩展导航模型,通过一个具有两个互补维度的参数化接口来解决这个问题:多个任务模式选择导航行为,以及可控的观测参数(例如,token 预算、每个摄像头的权重)控制视觉历史的编码方式。通过训练时对所有参数进行随机化,Qwen-RobotNav 对任何推理时配置都具有鲁棒性,无需对 Qwen-RobotNav 主干进行任何架构修改。我们在15.6M样本上训练 Qwen-RobotNav;与视觉语言数据联合训练防止了在仅轨迹训练中观察到的反应性动作序列映射器的坍缩。参数化接口也使 Qwen-RobotNav 成为智能体系统的自然构建块:对于长时域场景,上层规划器将目标分解为子任务,并在情节中动态切换 Qwen-RobotNav 的任务模式和上下文策略,通过重复调用同一模型组合出复杂行为。大量实验表明,Qwen-RobotNav 在主要导航基准上取得了新的最优结果。该模型从2B到8B参数展现出良好的扩展性,联合多任务训练发展出一个跨任务族迁移的共享空间规划基板,并在多样环境中对真实世界机器人展现出强大的零样本泛化能力。

英文摘要

Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.

3. 图像识别、检索与分类 3 篇

2508.04424 2026-06-19 cs.CV 版本更新

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

组合对象检索:通过组合表达式进行对象级检索

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Jiangsu, China(新一代人工智能技术及跨学科应用国家重点实验室,东南大学,教育部,江苏,中国) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE(穆罕默德·本·扎耶德人工智能大学(MBZUAI),阿布扎赫德,阿联酋)

AI总结 提出组合对象检索(COR)任务,通过组合参考对象、掩码和检索文本进行对象级检索,并构建COR125K基准和CORE模型,显著优于现有方法。

详情
AI中文摘要

基于用户意图检索细粒度视觉内容在多模态系统中仍然是一个挑战。尽管当前的组合图像检索(CIR)方法结合了参考图像和检索文本,但它们局限于图像级匹配,无法定位特定对象。为此,我们提出了组合对象检索(COR),一种新的对象级检索任务,从目标图像中的候选对象中检索目标对象,并用像素级掩码对检索结果进行定位。给定一个参考对象、其掩码、一个目标图像以及描述所需修改的检索文本,COR要求模型执行组合视觉-文本推理,而不是依赖显式的类别名称。这一设置带来了若干挑战,包括细粒度组合匹配、在视觉相似干扰物下的负对象过滤以及灵活的单对象或多对象检索。我们构建了COR125K,第一个大规模COR基准,包含408个类别的125,541个检索三元组,并划分基础/新类别以评估类别级泛化能力。我们还提出了CORE,一个统一的端到端模型,集成了参考区域编码、自适应视觉-文本交互和区域级对比学习,以将组合表示与目标对象对齐,同时抑制背景和干扰物。大量实验表明,CORE在基础和新类别上均显著优于现有的基于CIR的流程和强基线,为细粒度对象级多模态检索建立了一个简单而有效的基础。代码将在此https URL公开发布。

英文摘要

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

2512.03199 2026-06-19 cs.CV 版本更新

Does Head Pose Correction Improve Biometric Facial Recognition?

姿态校正是否能提升生物特征面部识别?

Justin Norman, Hany Farid

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究探讨了AI驱动的头部姿态校正与图像修复对面部识别准确率的影响,发现选择性应用CFR-GAN与CodeFormer可提升识别性能。

详情
AI中文摘要

生物特征面部识别模型在处理现实世界图像时常表现出显著的准确性下降,通常表现为图像质量差、非正面姿态和主体遮挡。我们调查了针对这些挑战的AI驱动头部姿态校正和图像修复是否能提高识别准确率。使用模型无关的大规模法医评估流程,我们评估了三种修复方法:3D重建(NextFace)、2D正面化(CFR-GAN)和特征增强(CodeFormer)。我们发现这些技术的简单应用会显著降低面部识别准确率。然而,我们还发现选择性应用CFR-GAN结合CodeFormer可以带来有意义的提升。

英文摘要

Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.

2604.19196 2026-06-19 cs.CV 版本更新

Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

面向域泛化人脸反欺骗的视觉基础模型基准测试

Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki

发表机构 * Graduate School of Information Sciences, Tohoku University, Japan(东北大学信息科学研究生院,日本)

AI总结 本文系统评估15种预训练视觉模型在人脸反欺骗域泛化中的表现,发现自监督ViT(尤其是DINOv2+Registers)结合数据增强和注意力损失在MICO协议上达到最优,且计算高效。

Comments 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

详情
AI中文摘要

人脸反欺骗(FAS)由于需要在未见过的环境中进行鲁棒的域泛化而仍然具有挑战性。尽管最近的趋势利用视觉-语言模型(VLM)进行语义监督,但这些多模态方法通常需要高昂的计算资源并表现出高推理延迟。此外,它们的有效性本质上受限于底层视觉特征的质量。本文重新审视仅视觉基础模型建立高效鲁棒FAS基线的潜力。我们在严苛的跨域场景下(包括MICO和有限源域(LSD)协议)对15个预训练模型进行了系统基准测试,例如有监督CNN、有监督ViT和自监督ViT。我们的全面分析表明,自监督视觉模型,特别是带有寄存器的DINOv2,显著抑制了注意力伪影并捕获了关键的细粒度欺骗线索。结合人脸反欺骗数据增强(FAS-Aug)、分块数据增强(PDA)和注意力加权分块损失(APL),我们提出的仅视觉基线在MICO协议上达到了最先进的性能。该基线在数据受限的LSD协议下优于现有方法,同时保持优越的计算效率。这项工作为FAS提供了一个确定的仅视觉基线,表明优化的自监督视觉变换器可以作为仅视觉和未来多模态FAS系统的骨干。项目页面见:此https URL。

英文摘要

Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .

4. 目标检测、分割与定位 3 篇

2507.21460 2026-06-19 cs.CV 版本更新

An Angular-Temporal Interaction Network for Light Field Object Tracking in Low-Light Scenes

用于低光场景光场目标跟踪的角-时交互网络

Mianzhao Wang, Fan Shi, Xu Cheng, Feifei Zhang, Shengyong Chen

发表机构 * Engineering Research Center of Learning-Based Intelligent System (Ministry of Education)(教育部学习驱动智能系统工程研究中心) key Laboratory of Computer Vision and System (Ministry of Education)(教育部计算机视觉与系统重点实验室) School of Computer Science and Engineering, Tianjin University of Technology(天津工业大学计算机科学与工程学院)

AI总结 提出一种光场极线平面结构图像表示和角-时交互网络,通过显式建模几何结构和自监督优化,在低光场景下实现高效目标跟踪,性能达到最优。

详情
AI中文摘要

高质量的四维光场表示结合高效的角特征建模对于场景感知至关重要,因为它可以提供判别性的空间-角度线索来识别移动目标。然而,近期的发展仍然难以在时间域中提供可靠的角建模,尤其是在复杂的低光场景中。在本文中,我们提出了一种新颖的光场极线平面结构图像(ESI)表示,该表示显式定义了光场内的几何结构。通过利用极线平面内光线角度的突变,这种表示可以增强低光场景中的视觉表达,并减少高维光场的冗余。我们进一步提出了一种用于光场目标跟踪的角-时交互网络(ATINet),该网络从光场的几何结构线索和角-时交互线索中学习角感知表示。此外,ATINet还可以通过自监督方式进行优化,以增强时间域上的几何特征交互。最后,我们引入了一个大规模的光场低光数据集用于目标跟踪。大量实验表明,ATINet在单目标跟踪中达到了最先进的性能。此外,我们将所提方法扩展到多目标跟踪,这也显示了高质量光场角-时建模的有效性。

英文摘要

High-quality 4D light field representation with efficient angular feature modeling is crucial for scene perception, as it can provide discriminative spatial-angular cues to identify moving targets. However, recent developments still struggle to deliver reliable angular modeling in the temporal domain, particularly in complex low-light scenes. In this paper, we propose a novel light field epipolar-plane structure image (ESI) representation that explicitly defines the geometric structure within the light field. By capitalizing on the abrupt changes in the angles of light rays within the epipolar plane, this representation can enhance visual expression in low-light scenes and reduce redundancy in high-dimensional light fields. We further propose an angular-temporal interaction network (ATINet) for light field object tracking that learns angular-aware representations from the geometric structural cues and angular-temporal interaction cues of light fields. Furthermore, ATINet can also be optimized in a self-supervised manner to enhance the geometric feature interaction across the temporal domain. Finally, we introduce a large-scale light field low-light dataset for object tracking. Extensive experimentation demonstrates that ATINet achieves state-of-the-art performance in single object tracking. Furthermore, we extend the proposed method to multiple object tracking, which also shows the effectiveness of high-quality light field angular-temporal modeling.

2510.24399 2026-06-19 cs.CV cs.RO 版本更新

GenTrack: A New Generation of Multi-Object Tracking

GenTrack:新一代多目标跟踪

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark(SDU机器人实验室,南丹麦大学)

AI总结 提出GenTrack多目标跟踪方法,采用随机与确定性混合策略,结合粒子群优化与社会交互,在弱检测器、遮挡等场景下有效维持目标身份一致性并减少ID切换。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

本文介绍了一种新颖的多目标跟踪(MOT)方法,称为GenTrack,其主要贡献包括:第一,一种混合跟踪方法,采用随机和确定性方式,以鲁棒地处理未知且时变的目标数量,特别是在维持目标身份(ID)一致性和管理非线性动态方面;第二,利用粒子群优化(PSO)和一些提出的适应度度量,引导随机粒子朝向其目标分布模式,从而即使在弱且噪声大的目标检测器下也能实现有效跟踪;第三,整合目标间的社会交互,以增强PSO引导的粒子,并改进强(匹配)和弱(未匹配)轨迹的连续更新,从而减少ID切换和轨迹丢失,尤其是在遮挡期间;第四,基于GenTrack重新定义的视觉MOT基线,结合了基于空间一致性、外观、检测置信度、轨迹惩罚和社会分数的综合状态与观测模型,以实现系统且高效的目标更新;第五,首个公开可用的最小依赖源代码参考实现,包含三种变体,包括GenTrack Simple、Strengthen和Super,便于灵活重新实现。实验结果表明,与最先进的跟踪器相比,GenTrack在标准基准和现实场景中提供了优越的性能,并集成了基线实现以进行公平比较。还讨论了未来工作的潜在方向。所提方法和比较跟踪器的源代码参考实现已在GitHub上提供:this https URL

英文摘要

This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: first-a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, second-leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, third-integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, fourth-a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and five-the first ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Simple, Strengthen, and Super, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack

2510.24410 2026-06-19 cs.CV cs.RO 版本更新

GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking

GenTrack2: 一种改进的多目标跟踪混合方法

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

发表机构 * SDU Robotics, University of Southern Denmark(SDU机器人研究所,南丹麦大学)

AI总结 提出结合随机粒子滤波与确定性关联的多目标跟踪方法,通过粒子群优化和新型代价矩阵解决非线性动态下的标识一致性问题,性能优于现有方法。

Comments The content of this paper was included in the full manuscript of GenTrack family which has been submitted to the journal for possible publication

详情
AI中文摘要

本文提出一种视觉多目标跟踪方法,联合使用随机和确定性机制,以确保在非线性动态下未知且时变目标数量的标识一致性。随机粒子滤波处理非线性动态和非高斯噪声,并借助粒子群优化(PSO)将粒子引导至状态分布模式,通过提出的适应度度量(包含运动一致性、外观相似性和与邻近目标的社交互动线索)减轻发散。确定性关联通过提出的代价矩阵进一步强制标识一致性,该矩阵包含粒子与当前检测之间的空间一致性、检测置信度和轨迹惩罚。随后,提出一种新颖方案,在保持目标身份的同时平滑更新目标状态,特别是对于与其他目标交互和长时间遮挡期间的弱轨迹。此外,对过去状态的速度回归提供趋势种子速度,增强粒子采样和状态更新。所提出的跟踪器设计灵活,适用于预录视频和相机直播流(未来帧不可用)。实验结果表明,与最先进的跟踪器相比,性能优越。所提出方法和对比跟踪器的源代码参考实现已在GitHub上提供:此 https URL

英文摘要

This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

5. 视频理解与时序视觉 1 篇

2606.09547 2026-06-19 cs.CV cs.LG 版本更新

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

流式干预:视频大语言模型能否在错误发生时即时纠正?

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

发表机构 * Qualcomm AI Research(高通人工智能研究院) York University(约克大学) Vector Institute for AI(向量人工智能研究所)

AI总结 提出Ego-MC-Bench基准评估视频LLM在烹饪场景中的实时干预能力,并构建Ego-CoMist反事实合成数据集提升小模型性能。

Comments The project page is available at https://apratimbh.github.io/livecookv2/

详情
AI中文摘要

学习日常技能(如烹饪一道菜)越来越依赖于教学媒体,例如在线视频。这为使用视频(和多模态)大语言模型(LLMs)作为任务指导助手打开了大门。一个潜在的任务指导助手在现实世界中成功的关键能力是,它能够在错误一出现时就主动干预以引导用户。为了评估这一关键能力,我们引入了Ego-MC-Bench(错误纠正),这是一个用于评估在现实烹饪场景中反应性、逐步任务指导的基准。大量实验表明,Ego-MC-Bench对于最先进的视频LLMs具有高度挑战性。我们认为一个关键原因是用于在此任务上微调模型的训练数据有限。尽管存在广泛的烹饪视频数据集,但现有数据集缺乏错误示例以及适当时间的干预。为了帮助解决这一数据限制,我们还引入了Ego-CoMist,这是一个反事实合成数据集,通过将非交互式烹饪视频转换为显示主动干预的监督训练示例而创建。我们表明,在Ego-CoMist上进行微调可以带来性能提升,特别是对于更适合在边缘设备上提供帮助的更小、更高效的视频LLMs。

英文摘要

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

6. 生成式视觉与世界模型 9 篇

2506.06952 2026-06-19 cs.CV 版本更新

LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

LaTtE-Flow: 基于层间时间步专家流的Transformer

Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Maryland(马里兰大学) Nvidia(英伟达) Salesforce AI Research(Salesforce AI研究) Intuit AI Research(Intuit AI研究)

AI总结 提出LaTtE-Flow,一种基于预训练视觉语言模型的高效统一架构,通过层间时间步专家流和条件残差注意力机制,实现图像理解与生成,生成速度提升约6倍。

Comments Unified multimodal model, Flow-matching

详情
AI中文摘要

多模态基础模型在统一图像理解与生成方面取得了最新进展,为在单一框架内处理广泛的视觉-语言任务开辟了令人兴奋的途径。尽管取得了进展,现有的统一模型通常需要大量的预训练,并且与专门针对每项任务的模型相比,难以达到相同的性能水平。此外,许多这些模型存在图像生成速度慢的问题,限制了它们在实时或资源受限环境中的实际部署。在这项工作中,我们提出了基于层间时间步专家流的Transformer(LaTtE-Flow),一种新颖且高效的架构,可在单个多模态模型中统一图像理解与生成。LaTtE-Flow建立在强大的预训练视觉语言模型(VLM)之上,以继承强大的多模态理解能力,并通过新颖的层间时间步专家流架构扩展它们,以实现高效的图像生成。LaTtE-Flow将流匹配过程分布到专门的Transformer层组中,每组负责不同的时间步子集。这种设计通过在每个采样时间步仅激活一小部分层,显著提高了采样效率。为了进一步提升性能,我们提出了一种时间步条件残差注意力机制,用于跨层高效的信息重用。实验表明,LaTtE-Flow在多模态理解任务上取得了强劲的性能,同时与最近的统一多模态模型相比,实现了具有竞争力的图像生成质量,推理速度提高了约6倍。

英文摘要

Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

2601.21081 2026-06-19 cs.CV 版本更新

Shape of Thought: Progressive Object Assembly via Visual Chain-of-Thought

思维形状:通过视觉思维链进行渐进式物体组装

Yu Huo, Siyu Zhang, Kun Zeng, Haoyue Liu, Owen Lee, Junlin Chen, Yuquan Lu, Yifu Guo, Yaodong Liang, Xiaoying Tang

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)科学与工程学院) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Sun Yat-sen University(中山大学) The Hong Kong University of Science and Technology, Guangzhou(香港科学与技术大学(广州)) Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen)(深圳未来网络智能研究所(FNii-Shenzhen)) Guangdong Provincial Key Laboratory of Future Networks of Intelligence, CUHK(SZ)(广东省未来网络智能重点实验室,CUHK(SZ))

AI总结 提出Shape-of-Thought (SoT)框架,通过视觉思维链在渲染2D域中逐步组装形状,解决文本到图像生成中的组合结构约束问题,在组件计数和结构拓扑上显著优于直接生成。

Comments ICML2026

详情
AI中文摘要

用于文本到图像生成的多模态模型已实现强视觉保真度,但在组合结构约束(特别是生成计数、属性绑定和部分级关系)下仍然脆弱。为解决这些挑战,我们提出了Shape-of-Thought (SoT),一种视觉思维链框架,用于在渲染2D域中进行过程监督的渐进式形状组装,推理时无需外部引擎。SoT训练一个统一的多模态自回归模型,生成交错文本计划和渲染中间状态,帮助模型在不产生显式几何表示的情况下捕捉形状组装逻辑。与纯文本思维链不同,每个决策都基于渲染状态,使得计数、连接、拓扑和中间部件添加错误在整个轨迹中可检查。为支持这一范式,我们引入了SoT-26K,一个基于部件CAD层次结构的大规模接地组装轨迹数据集,以及T2S-CompBench,一个用于评估结构完整性和轨迹忠实度的基准。在SoT-26K上微调在组件计数上达到88.4%,在结构拓扑上达到84.8%,在组件计数上比直接生成高出24.2个百分点,在结构拓扑上高出19.3个百分点。SoT为渲染域结构感知生成建立了一个透明测试平台。代码见此https URL。

英文摘要

Multimodal models for text-to-image generation have achieved strong visual fidelity, yet they remain brittle under compositional structural constraints, notably generative numeracy, attribute binding, and part-level relations. To address these challenges, we propose Shape-of-Thought (SoT), a visual CoT framework for process-supervised progressive shape assembly in the rendered 2D domain, without external engines at inference time. SoT trains a unified multimodal autoregressive model to generate interleaved textual plans and rendered intermediate states, helping the model capture shape-assembly logic without producing explicit geometric representations. Unlike text-only CoT, each decision is grounded in a rendered state, making counts, attachments, topology, and intermediate part-addition errors inspectable across the trajectory. To support this paradigm, we introduce SoT-26K, a large-scale dataset of grounded assembly traces derived from part-based CAD hierarchies, and T2S-CompBench, a benchmark for evaluating structural integrity and trace faithfulness. Fine-tuning on SoT-26K achieves 88.4% on component numeracy and 84.8% on structural topology, outperforming direct generation by +24.2 points on component numeracy and +19.3 points on structural topology. SoT establishes a transparent testbed for rendered-domain structure-aware generation. The code is available at https://github.com/yuhuo03/Shape-of-Thought.

2601.21542 2026-06-19 cs.CV cs.AI 版本更新

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

双锚点插值求解器加速生成建模

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

发表机构 * The Hong Kong University of Science(香港科学与技术大学)

AI总结 提出BA-solver,通过轻量SideNet(1-2%主干大小)学习双向时间感知和双锚点速度积分,在不重新训练主干的情况下,以极低训练成本实现10步内达到100+步Euler求解器质量,支持即插即用。

详情
AI中文摘要

流匹配(FM)模型已成为高保真合成的前沿范式。然而,它们对迭代常微分方程(ODE)求解的依赖造成了显著的延迟瓶颈。现有解决方案面临两难:无训练求解器在低神经函数评估(NFE)下性能严重下降,而基于训练的一步或几步生成方法则面临高昂的训练成本且缺乏即插即用的通用性。为弥合这一差距,我们提出了双锚点插值求解器(BA-solver)。BA-solver保留了标准无训练求解器的通用性,同时通过引入轻量级SideNet(主干大小的1-2%)与冻结主干并行,实现了显著加速。具体而言,我们的方法基于两个协同组件:1)双向时间感知,其中SideNet学习近似未来和过去的速度,无需重新训练重型主干;2)双锚点速度积分,利用带有两个锚点速度的SideNet高效近似中间速度,用于批量高阶积分。通过利用主干建立高精度“锚点”并利用SideNet加密轨迹,BA-solver能够以最小误差实现大步长。在ImageNet-256^2上的实验结果表明,BA-solver仅需10次NFE即可达到与100+次NFE的Euler求解器相当的生成质量,并在仅5次NFE时保持高保真度,且训练成本可忽略不计。此外,BA-solver确保与现有生成流水线的无缝集成,便于图像编辑等下游任务。

英文摘要

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

2602.15819 2026-06-19 cs.CV 版本更新

VideoSketcher: Sequential Sketch Generation Using Video Model Priors

VideoSketcher:利用视频模型先验的序列草图生成

Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba, Yael Vinker

发表机构 * MIT(麻省理工学院)

AI总结 提出VideoSketcher方法,结合LLM的语义规划与视频扩散模型的时序渲染,通过两阶段微调从少量样本学习笔画顺序与风格,生成高质量序列草图。

详情
AI中文摘要

素描本质上是序列化的:笔画逐步绘制以探索和完善想法。然而,大多数生成方法将草图视为静态图像,忽略了创造性探索背后的时间过程。建模这种序列结构仍然具有挑战性:先前的方法要么依赖大规模但多样性有限的人类绘制数据集,要么使用大型语言模型(LLM)生成绘制指令,但往往以视觉保真度为代价。我们提出VideoSketcher,一种通过将预训练的文本到视频扩散模型适应于草图形成的稀疏连续性质来生成高质量绘制过程的方法。我们的关键洞察是LLM和视频扩散模型提供互补优势:LLM作为语义规划器,将概念分解为逐步指令,而视频扩散模型作为强大的“渲染器”,将它们转化为时间连贯的草图序列。我们引入一种两阶段微调策略,将时间结构与视觉外观解耦:笔画顺序从合成形状组合中学习,而风格则从少至七幅手绘示例中提炼。尽管监督极少,我们的方法能够生成多样、高质量的序列草图,并忠实遵循指定的绘制顺序。我们的框架自然扩展到笔刷风格控制和自回归生成,支持艺术应用。

英文摘要

Sketching is inherently sequential: strokes are drawn progressively to explore and refine ideas. Yet most generative approaches treat sketches as static images, ignoring the temporal process underlying creative exploration. Modeling this sequential structure remains challenging: prior methods either rely on large-scale human-drawn datasets with limited diversity, or use large language models (LLMs) to produce drawing instructions, often at the cost of visual fidelity. We present VideoSketcher, a method for generating high-quality sketching processes by adapting pretrained text-to-video diffusion models to the sparse, continuous nature of sketch formation. Our key insight is that LLMs and video diffusion models offer complementary strengths: LLMs act as semantic planners that decompose concepts into step-by-step instructions, while video diffusion models serve as powerful "renderers" that translate them into temporally coherent sketch sequences. We introduce a two-stage fine-tuning strategy that decouples temporal structure from visual appearance: stroke ordering is learned from synthetic shape compositions, while style is distilled from as few as seven hand-drawn examples. Despite minimal supervision, our method can generate diverse, high-quality sequential sketches that faithfully follow specified drawing orders. Our framework naturally extends to brush style control and autoregressive generation, supporting artistic applications.

2603.12252 2026-06-19 cs.CV cs.CL 版本更新

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

EndoCoT:扩散模型中的内生思维链推理扩展

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Xi’an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学) Shanghai Jiaotong University(上海交通大学) Fudan University(复旦大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出EndoCoT框架,通过迭代思维引导模块激活MLLM的推理潜力,并利用终端思维接地模块确保推理轨迹与文本监督对齐,使DiT逐步执行复杂任务,在多个基准上平均准确率达92.1%。

Comments 23 pages, 18 figures, The code and dataset are publicly available at https://internlm.github.io/EndoCoT/

详情
AI中文摘要

最近,多模态大语言模型(MLLMs)被广泛集成到扩散框架中,主要作为文本编码器来处理空间推理等复杂任务。然而,这种范式存在两个关键限制:(i)MLLM文本编码器表现出不足的推理深度。单步编码无法激活思维链过程,而这对MLLM为复杂任务提供准确指导至关重要。(ii)在解码过程中,指导保持不变。即使有正确的MLLM编码,解码过程中的不变指导也阻止了DiT逐步将复杂指令分解为可执行的去噪步骤。为此,我们提出了内生思维链(EndoCoT),一种新颖的框架,首先通过迭代思维引导模块迭代细化潜在思维状态来激活MLLM的推理潜力,然后将这些状态桥接到DiT的去噪过程。其次,应用终端思维接地模块,通过将最终状态与真实答案对齐,确保推理轨迹保持与文本监督的接地。通过这两个组件,MLLM文本编码器提供精心推理的指导,使DiT能够逐步执行并最终以逐步方式解决复杂任务。在多个基准(如Maze、TSP、VSP和Sudoku)上的广泛评估实现了平均准确率92.1%,比最强基线高出8.3个百分点。代码和数据集在此https URL公开。

英文摘要

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. The code and dataset are publicly available at https://internlm.github.io/EndoCoT/.

2603.29924 2026-06-19 cs.CV 版本更新

Abstraction in Style: Beyond Texture and Color

风格中的抽象:超越纹理与色彩

Min Lu, Yuanfeng He, Anthony Chen, Jianhuang He, Pu Wang, Daniel Cohen-Or, Hui Huang

发表机构 * Shenzhen University(深圳大学) Visual Computing Research Center (VCC), College of Computer Science and Software Engineering (CSSE)(视觉计算研究中心(VCC),计算机科学与软件工程学院) Peking University(北京大学)

AI总结 提出Abstraction in Style (AiS)框架,将结构抽象与视觉风格分离,通过中间抽象代理实现几何保真度放松,从而支持更广泛的非真实感风格迁移。

Comments SIGGRAPH 2026

详情
AI中文摘要

艺术风格通常嵌入超越表面外观的抽象,涉及对结构的有意重新诠释,而不仅仅是纹理或色彩的变化。传统的风格迁移方法通常保留输入几何结构,因此难以捕捉这种更深层次的抽象行为,尤其是对于插画和非真实感风格。在这项工作中,我们引入了Abstraction in Style (AiS),一个将结构抽象与视觉风格化分离的生成框架。给定目标图像和少量风格样本,AiS首先推导出一个中间抽象代理,该代理根据风格所展现的抽象逻辑重新诠释目标的结构。代理捕捉语义结构,同时放松几何保真度,使得后续的风格化能够在抽象表示而非原始图像上进行操作。在第二阶段,渲染抽象代理以产生最终风格化输出,保持与参考风格的视觉一致性。两个阶段都使用共享的图像空间类比实现,使得变换可以从视觉样本中学习,无需显式的几何监督。通过将抽象与外观解耦,并将抽象视为显式、可迁移的过程,AiS支持更广泛的风格变换,提高了可控性,并实现了更具表现力的风格化。

英文摘要

Artistic styles often embed abstraction beyond surface appearance, involving deliberate reinterpretation of structure rather than mere changes in texture or color. Conventional style transfer methods typically preserve the input geometry and therefore struggle to capture this deeper abstraction behavior, especially for illustrative and nonphotorealistic styles. In this work, we introduce Abstraction in Style (AiS), a generative framework that separates structural abstraction from visual stylization. Given a target image and a small set of style exemplars, AiS first derives an intermediate abstraction proxy that reinterprets the target's structure in accordance with the abstraction logic exhibited by the style. The proxy captures semantic structure while relaxing geometric fidelity, enabling subsequent stylization to operate on an abstracted representation rather than the original image. In a second stage, the abstraction proxy is rendered to produce the final stylized output, preserving visual coherence with the reference style. Both stages are implemented using a shared image space analogy, enabling transformations to be learned from visual exemplars without explicit geometric supervision. By decoupling abstraction from appearance and treating abstraction as an explicit, transferable process, AiS supports a wider range of stylistic transformations, improves controllability, and enables more expressive stylization.

2605.31158 2026-06-19 cs.CV cs.LG 版本更新

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

光交互:交互式视频世界模型的免训练推理加速

Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University(浙江大学) NVIDIA

AI总结 针对交互式视频世界模型推理成本高的问题,提出免训练加速框架Light Interaction,通过自适应上下文管理、去噪缓存加速和3D块稀疏注意力实现最高2.59倍加速。

Comments 13 pages, 6 figures, 3 tables. Project page: https://2843721358l-del.github.io/Light-Interaction-Project/

详情
AI中文摘要

交互式视频世界模型根据用户控制的相机运动逐块生成视频,支持实时游戏模拟、虚拟场景导航和具身AI训练等应用。然而,由于上下文记忆增长、二次注意力复杂度和重复去噪步骤,扩展到长交互轨迹的成本过高。我们提出Light Interaction,一种用于交互式视频世界模型的免训练推理加速框架。我们的关键洞察是,交互自然支持轨迹依赖的自适应计算:在探索新区域时可丢弃检索到的空间记忆,根据局部潜在动态调整时间上下文,当相机重新访问熟悉区域时可重用早期步骤的模型输出。基于此洞察,Light Interaction结合了自适应上下文管理、去噪缓存加速以及硬件-软件协同设计的3D块稀疏注意力(融合Triton内核)。在HY-WorldPlay和Matrix-Game-3.0上的评估表明,Light Interaction在无需模型重训练的情况下实现了最高2.59倍加速,同时保持有竞争力的视觉质量。

英文摘要

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

2606.15015 2026-06-19 cs.CV cs.AI 版本更新

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Brian Sheil, Yixiong Jing

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学)

AI总结 提出神经能量场框架NEXUS,通过标量能量和耗散项建模保守与非保守动力学,提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情
AI中文摘要

基于物理的视频生成需要可控的3D物体动力学,这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应,难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS,一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图,并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发,NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应(包括重力和弹性变形)被组合为加性能量项,而非保守效应(如阻尼和冲击引起的能量损失)则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到,并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中,NEXUS在不同力学属性和物理效应组合下,相较于代表性的学习和物理结构化动力学基线,提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导,在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

2601.03112 2026-06-19 eess.IV cs.CV 版本更新

DiT-JSCC: Rethinking Deep JSCC with Diffusion Transformers and Semantic Representations

DiT-JSCC:基于扩散变换器与语义表示的深度JSCC再思考

Kailin Tan, Jincheng Dai, Sixian Wang, Guo Lu, Shuo Shao, Kai Niu, Wenjun Zhang, Ping Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Jiao Tong University(上海交通大学) University of Shanghai for Science and Technology(上海科技大学)

AI总结 提出DiT-JSCC框架,联合学习语义优先表示编码器和扩散变换器生成解码器,通过粗细粒度条件解码和基于Kolmogorov复杂度的自适应带宽分配,在极端信道条件下提升语义一致性与传输效率。

Comments 14pages, 14figures, 2tables

详情
AI中文摘要

生成式联合源信道编码(GJSCC)已成为一种新的深度JSCC范式,用于在极端无线信道条件(如超低带宽和低信噪比)下实现高保真和鲁棒的图像传输。近期研究通常采用扩散模型作为生成解码器,但经常产生视觉上逼真但语义一致性有限的结果。这种局限性源于面向重建的JSCC编码器与生成解码器之间的根本性不匹配,因为前者缺乏显式的语义判别能力,无法提供可靠的条件线索。在本文中,我们提出DiT-JSCC,一种新颖的GJSCC骨干网络,能够联合学习语义优先的表示编码器和基于扩散变换器(DiT)的生成解码器,我们的开源项目旨在促进GJSCC的未来研究。具体来说,我们设计了一个语义-细节双分支编码器,与从粗到细的条件DiT解码器自然对齐,在极端信道条件下优先考虑语义一致性。此外,受Kolmogorov复杂度启发,引入了一种无需训练的自适应带宽分配策略,以进一步提高传输效率,从而真正重新定义生成解码时代的信息价值概念。大量实验表明,DiT-JSCC在语义一致性和视觉质量上始终优于现有JSCC方法,尤其是在极端条件下。

英文摘要

Generative joint source-channel coding (GJSCC) has emerged as a new Deep JSCC paradigm for achieving high-fidelity and robust image transmission under extreme wireless channel conditions, such as ultra-low bandwidth and low signal-to-noise ratio. Recent studies commonly adopt diffusion models as generative decoders, but they frequently produce visually realistic results with limited semantic consistency. This limitation stems from a fundamental mismatch between reconstruction-oriented JSCC encoders and generative decoders, as the former lack explicit semantic discriminability and fail to provide reliable conditional cues. In this paper, we propose DiT-JSCC, a novel GJSCC backbone that can jointly learn a semantics-prioritized representation encoder and a diffusion transformer (DiT) based generative decoder, our open-source project aims to promote the future research in GJSCC. Specifically, we design a semantics-detail dual-branch encoder that aligns naturally with a coarse-to-fine conditional DiT decoder, prioritizing semantic consistency under extreme channel conditions. Moreover, a training-free adaptive bandwidth allocation strategy inspired by Kolmogorov complexity is introduced to further improve the transmission efficiency, thereby indeed redefining the notion of information value in the era of generative decoding. Extensive experiments demonstrate that DiT-JSCC consistently outperforms existing JSCC methods in both semantic consistency and visual quality, particularly in extreme regimes.

7. 3D视觉、点云与空间智能 6 篇

2508.15228 2026-06-19 cs.CV 版本更新

Collaborative Multi-Modal Coding for High-Quality 3D Generation

协作多模态编码用于高质量3D生成

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University, Singapore(南洋理工大学S实验室) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出TriMM,首个前馈式3D原生生成模型,通过协作多模态编码融合RGB、RGBD和点云特征,结合辅助2D/3D监督和三平面潜在扩散模型,实现高质量3D资产生成。

详情
AI中文摘要

3D内容本质上具有多模态特性,可投影到不同模态(如RGB图像、RGBD和点云)。每种模态在3D资产建模中表现出独特优势:RGB图像包含生动的3D纹理,而点云定义精细的3D几何。然而,现有大多数3D原生生成架构要么主要在单模态范式下运行——从而忽略了多模态数据的互补优势,要么局限于3D结构,从而限制了可用训练数据集的范围。为了全面利用多模态进行3D建模,我们提出了TriMM,这是第一个从基本多模态(如RGB、RGBD和点云)学习的前馈式3D原生生成模型。具体来说,1) TriMM首先引入协作多模态编码,该编码在保留各模态独特表示优势的同时整合模态特定特征。2) 此外,引入辅助2D和3D监督以提高多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码,TriMM采用三平面潜在扩散模型生成更高质量的3D资产,增强了纹理和几何细节。在多个知名数据集上的大量实验表明,TriMM通过有效利用多模态,尽管使用少量训练数据,仍能达到与在大规模数据集上训练的模型相竞争的性能。此外,我们在最近的RGB-D数据集上进行了额外实验,验证了将其他多模态数据集纳入3D生成的可行性。

英文摘要

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

2512.00850 2026-06-19 cs.CV 版本更新

Smol-GS: Compact Representations for Abstract 3D Gaussian Splatting

Smol-GS: 抽象3D高斯溅射的紧凑表示

Haishan Wang, Mohammad Hassan Vali, Arno Solin

发表机构 * ELLIS Institute Finland(芬兰ELLIS研究所) Aalto University(阿alto大学)

AI总结 提出Smol-GS方法,通过八叉树位置编码和熵压缩学习高效溅射特征,实现3D高斯溅射的紧凑表示,在保持渲染质量的同时大幅降低存储。

详情
AI中文摘要

我们提出Smol-GS,一种学习3D高斯溅射(3DGS)紧凑表示的新方法。我们的方法学习高效的逐溅射特征来建模3D空间,这些特征捕获抽象线索,包括颜色、不透明度、变换和材质属性。我们提出八叉树导出的位置编码,显式建模空间局部性并增强表示效率。我们进一步应用基于熵的压缩来利用特征冗余,并使用递归体素层次压缩溅射坐标。这种设计在保持表示灵活性的同时,实现了数量级的存储减少。Smol-GS在标准基准测试上以高渲染质量实现了最先进的压缩性能。

英文摘要

We present Smol-GS, a novel method for learning compact representations for 3D Gaussian Splatting (3DGS). Our approach learns highly efficient splat-wise features to model 3D space, which capture abstracted cues, including color, opacity, transformation, and material properties. We propose octree-derived positional encoding, which explicitly models spatial locality and enhances representation efficiency. We further apply entropy-based compression to exploit feature redundancy and compress splat coordinates using a recursive voxel hierarchy. This design enables orders-of-magnitude reduction in storage while preserving representation flexibility. Smol-GS achieves state-of-the-art compression performance on standard benchmarks with high-level rendering quality.

2602.23172 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg(弗赖堡大学) Bosch Research(博世研究院) University of Haifa(海法大学)

AI总结 提出潜在高斯泼溅(LaGS)方法,通过特征高斯体作为动态关键点实现多视图特征聚合,用于4D全景占据跟踪,在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情
AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而,现有方法通常只解决部分问题:它们要么通过边界框提供粗略的几何跟踪,要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中,我们提出了潜在高斯泼溅(LaGS)用于4D全景占据跟踪(4D-POT)。我们重新审视底层表示,将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点,在泼溅到体素网格进行解码之前,能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互,这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节,进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型:this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

2606.15908 2026-06-19 cs.CV 版本更新

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

高保真4D手-物体捕捉:基于多视角时空追踪和物理感知高斯模型

Bo Peng, Xu Chen, Yi Gu, Hidenobu Matsuki, Mingsong Dou, Jingjing Shen, Deying Kong, Juyong Zhang, Zhengyang Shen

发表机构 * Google XR(谷歌XR) University of Science and Technology of China (USTC)(中国科学技术大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出无需模板和标记的多视角系统,通过跨视角几何与时间线索的Transformer初始化,结合物理感知高斯优化,实现鲁棒且无伪影的4D手-物体交互重建。

Comments Project page: https://hostpg.github.io/

详情
AI中文摘要

具身AI和空间计算中对高保真4D手-物体交互(HOI)数据的需求日益增长,但目前受限于对预扫描物体模板和物理标记的依赖。尽管近期方法在从视频重建4D手-物体交互方面取得了有希望的结果,但它们对手和物体姿态的初始估计高度敏感。然而,从图像中估计这些姿态具有挑战性,尤其是在手-物体交互场景中固有的严重遮挡下。我们提出了一种新颖系统,用于从同步且校准的多视角视频中鲁棒且精确地重建手和物体,无需任何模板或标记。我们的系统包含两个主要创新组件:(1)一个多视角前馈Transformer模型,聚合跨视角几何和时间线索,为姿态和密集物体几何提供可靠的、度量一致的初始化;(2)一个手-物体物理感知高斯优化框架,用于细化初始估计,集成四面体约束、碰撞细化和外观分解,以产生物理上合理且视觉上精确的重建。在公共基准和广泛内部数据集上的验证表明,我们的流程实现了高度鲁棒、无伪影的重建,为自动化4D资产生成提供了高效基础。我们的项目页面位于https://zyshen021.github.io/HOSTPG/。

英文摘要

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

2606.15966 2026-06-19 cs.CV cs.GR 版本更新

VEPHand: View-Efficient Photometric Hand Performance Capture at Scale

VEPHand: 大规模视图高效光度手部性能捕捉

Zhengyang Shen, Kai-Hung Chang, Erroll Wood, Deying Kong, Bo Peng, Timo Bolkart, Jinlong Yang, Bowen Zhao, Danhang Tang, Sasa Petrovic, Emre Aksan, Jérémy Riviere, Vassilis Choutas, Delio Vicini, Jay Busch, Shichen Liu, Zhe Cao, Hugh Liu, JingJing Shen, Jonathan Taylor, Mingsong Dou

发表机构 * Google XR

AI总结 提出面向有限视角(约20个)的端到端手部动态捕捉与配准管线,通过无掩膜神经方法和物理启发框架解决几何歧义与自接触变形难题,在12000+序列上验证了高保真重建与配准。

详情
AI中文摘要

鲁棒、高保真的3D手部捕捉是数字人创建的基础,但在实际多视角系统中仍具挑战性,这些系统需要在丰富光度信息与有限视角密度导致的重建几何歧义之间取得平衡。本文提出一种端到端的动态手部性能捕捉与配准管线,专为视图高效设置(约20个视角)设计。我们通过两项主要创新应对关键挑战。首先,为克服重建困难(如视角重叠有限和背景杂乱),我们的无掩膜神经方法通过场景参数化和场景特定密度正则化,从无掩膜图像中鲁棒地提取精细的手部几何和外观。其次,针对配准挑战(如准确捕捉非线性皮肤变形和确保严重自接触时的合理结果),我们提出一个物理启发框架。它通过优化个性化手部模型规范四面体网格内的固有体积偏移以及姿态参数,将重建与个性化手部模型对齐。该方法在鲁棒损失和优化支持下,捕捉精细表面变形,确保在严重关节运动和自接触下的合理结果,并对输入噪声表现出强容忍性。我们在超过12000个序列的大规模数据集上展示了自动化管线的可扩展性和鲁棒性,并从中导出一个大规模、高质量合成2D/3D手部数据集用于训练下游任务。这展示了该方法在单手、复杂双手交互和自然手物操作中的有效性。我们的方法在视图高效、无掩膜场景下实现了最先进的重建保真度和高精度配准。项目页面:https://zyshen021.github.io/VEPHand/。

英文摘要

Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ($\sim$20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at https://vephand.github.io/.

2503.01425 2026-06-19 cs.GR cs.CV 版本更新

MeshPad: Interactive Sketch-Conditioned Artist-Reminiscent Mesh Generation and Editing

MeshPad: 交互式草图条件艺术家风格网格生成与编辑

Haoxuan Li, Ziya Erkoc, Lei Li, Daniele Sirigatti, Vladislav Rosov, Angela Dai, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑技术大学) AUDI AG(奥迪股份公司)

AI总结 提出MeshPad,一种基于草图输入的交互式3D网格生成与编辑方法,通过分解为网格区域的删除和添加操作,结合Transformer和顶点对齐推测策略,实现快速迭代编辑,在Chamfer距离上提升22%以上质量,并获90%用户偏好。

Comments Project page: https://derkleineli.github.io/meshpad/ Video: https://www.youtube.com/watch?v=_T6UTGTMZ1E

详情
AI中文摘要

我们介绍了MeshPad,一种从草图输入生成3D网格的生成方法。基于最近在艺术家风格三角形网格生成方面的进展,我们的方法解决了交互式网格创建的需求。为此,我们专注于通过将编辑分解为网格区域的“删除”和随后新网格几何的“添加”来实现一致编辑。这两个操作都由用户对草图图像的简单编辑触发,促进了迭代内容创建过程,并能够构建复杂的3D网格。我们的方法基于三角形序列网格表示,利用大型Transformer模型进行网格三角形的添加和删除。为了交互式地执行编辑,我们在加法网格生成器之上引入了一种顶点对齐的推测预测策略。该推测器预测对应于一个顶点的多个输出标记,从而显著降低推理的计算成本并加速编辑过程,使得每个编辑步骤只需几秒钟即可完成。综合实验表明,MeshPad优于最先进的草图条件网格生成方法,在Chamfer距离上实现了超过22%的网格质量改进,并且在感知评估中被90%的参与者所偏好。

英文摘要

We introduce MeshPad, a generative approach that creates 3D meshes from sketch inputs. Building on recent advances in artist-reminiscent triangle mesh generation, our approach addresses the need for interactive mesh creation. To this end, we focus on enabling consistent edits by decomposing editing into 'deletion' of regions of a mesh, followed by 'addition' of new mesh geometry. Both operations are invoked by simple user edits of a sketch image, facilitating an iterative content creation process and enabling the construction of complex 3D meshes. Our approach is based on a triangle sequence-based mesh representation, exploiting a large Transformer model for mesh triangle addition and deletion. In order to perform edits interactively, we introduce a vertex-aligned speculative prediction strategy on top of our additive mesh generator. This speculator predicts multiple output tokens corresponding to a vertex, thus significantly reducing the computational cost of inference and accelerating the editing process, making it possible to execute each editing step in only a few seconds. Comprehensive experiments demonstrate that MeshPad outperforms state-of-the-art sketch-conditioned mesh generation methods, achieving more than 22% mesh quality improvement in Chamfer distance, and being preferred by 90% of participants in perceptual evaluations.

8. 医学影像与生物视觉 7 篇

2602.22959 2026-06-19 cs.CV 版本更新

Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study

智能体能否在零样本设置中区分视觉上难以分离的疾病?一项初步研究

Zihao Zhao, Frederik Hauke, Juliana De Castilhos, Sven Nebelung, Daniel Truhn

发表机构 * Department of Diagnostic and Interventional Radiology, University Hospital Aachen, 52074 Aachen, Germany(诊断与介入放射科,亚琛大学医院,德国亚琛,52074)

AI总结 本研究探索多模态大语言模型智能体在零样本下区分视觉混淆疾病(如黑色素瘤与不典型痣、肺水肿与肺炎)的能力,提出基于对比裁决的多智能体框架,在皮肤镜数据上准确率提升11个百分点,但总体性能仍不足临床部署。

Comments Code available at https://github.com/TruhnLab/Contrastive-Agent-Reasoning. Accepted by MICCAI 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)的快速进展引发了对基于智能体系统的日益关注。尽管大多数医学影像先前工作集中于自动化常规临床工作流程,我们研究了一个未被充分探索但临床意义重大的场景:在零样本设置中区分视觉上难以分离的疾病。我们在两个仅基于影像的代理诊断任务上对代表性智能体进行基准测试:(1)黑色素瘤与不典型痣,以及(2)肺水肿与肺炎,尽管临床管理存在显著差异,但视觉特征高度混淆。我们引入了一种基于对比裁决的多智能体框架。实验结果显示诊断性能提升(在皮肤镜数据上准确率提高11个百分点),并在定性样本上减少了无根据的声明,尽管整体性能仍不足以用于临床部署。我们承认人类注释中固有的不确定性以及临床背景的缺失,这进一步限制了向真实世界场景的转化。在此受控设置中,这项初步研究为视觉混淆场景下的零样本智能体性能提供了初步见解。

英文摘要

The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.

2603.01250 2026-06-19 cs.CV cs.AI 版本更新

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

MAMA-MIA挑战:推进乳腺MRI肿瘤分割与治疗反应预测的泛化性和公平性

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir

发表机构 * Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona(巴塞罗那人工智能在医学实验室(BCN-AIM),巴塞罗那大学数学与计算机学院)

AI总结 提出MAMA-MIA挑战,通过标准化基准评估乳腺MRI肿瘤分割和病理完全缓解预测,在跨洲多中心数据上分析模型泛化性与公平性,发现性能与亚组公平性之间存在权衡。

详情
AI中文摘要

乳腺癌是全球女性中最常诊断的恶性肿瘤,也是癌症相关死亡的主要原因之一。动态对比增强磁共振成像在肿瘤表征和治疗监测中发挥核心作用,尤其是接受新辅助化疗的患者。然而,现有的乳腺磁共振成像人工智能模型通常使用异质性数据集、研究人群和评估协议进行开发和评估,使得直接比较困难,并限制了跨机构和临床相关患者亚组的模型鲁棒性理解。MAMA-MIA挑战旨在通过提供标准化基准来解决这些问题,该基准用于联合评估原发性肿瘤分割和仅使用治疗前磁共振成像预测病理完全缓解。训练队列包括来自美国多家机构的1506名患者,而评估则在来自三个独立欧洲中心的574名患者的外部测试集上进行,以评估跨大陆和跨机构的泛化性。统一的评分框架结合了预测性能与年龄、绝经状态和乳腺密度方面的亚组一致性。26个国际团队参加了最终评估阶段。结果表明,在共同的外部评估框架下,性能存在显著差异,并揭示了整体准确性与亚组公平性之间的权衡。该挑战提供了标准化数据集、评估协议和公共资源,以促进开发稳健且公平的乳腺癌影像人工智能系统。

英文摘要

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are typically developed and evaluated using heterogeneous datasets, study populations, and assessment protocols, making direct comparison difficult and limiting understanding of model robustness across institutions and clinically relevant patient subgroups. The MAMA-MIA Challenge was designed to address these challenges by providing a standardized benchmark for the joint evaluation of primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under a common external evaluation framework and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

2605.00665 2026-06-19 cs.CV 版本更新

Prediction of Alzheimer's Disease Risk Factors from Retinal Images via Deep Learning: Development and Validation of Biologically Relevant Morphological Associations in the UK Biobank

基于深度学习的视网膜图像预测阿尔茨海默病风险因素:英国生物银行中生物学相关形态学关联的开发和验证

Seowung Leem, Yunchao Yang, Adam J. Woods, Ruogu Fang

发表机构 * J. Crayton Pruitt Family Dept. of Biomedical Engineering, University of Florida(朱·克雷顿·普瑞特生物医学工程系,佛罗里达大学) University of Florida Research Computing(佛罗里达大学研究计算中心) Meta AI (FAIR)(Meta AI(FAIR)) School of Behavioral and Brain Sciences, University of Texas at Dallas(德克萨斯大学达拉斯分校行为与脑科学学院) Dept. of Electrical and Computer Engineering, University of Florida(佛罗里达大学电气与计算机工程系) Dept. of Computer and Information Science and Engineering, University of Florida(佛罗里达大学计算机与信息科学与工程系) Center for Cognitive Aging and Memory, University of Florida(佛罗里达大学认知衰老与记忆中心)

AI总结 利用深度学习从视网膜彩色眼底照片预测12个阿尔茨海默病相关风险因素,并揭示其背后的视网膜结构特征,发现视神经头和视网膜血管等区域与风险因素及阿尔茨海默病前期变化相关。

Comments Accepted to the "Journal of Alzheimer's Disease" for publication

详情
AI中文摘要

系统性的、代谢性的、生活方式的因素已通过流行病学和AD特异性生物标志物研究与阿尔茨海默病(AD)建立关联。彩色眼底摄影(CFP)是否包含与这些AD相关风险域相对应的视网膜结构特征仍不清楚。为了确定深度学习(DL)模型能否从CFP预测12个AD相关风险因素,并表征这些预测背后的视网膜结构,从而评估CFP是否反映AD易感性的通路。使用来自英国生物银行的44,501名独特参与者的62,876张CFP,训练DL模型预测与AD发病率相关的12个因素:6个分类变量(性别、吸烟、失眠、经济状况、饮酒、抑郁)和6个连续变量(年龄、受教育完成年龄、BMI、收缩压、舒张压、HbA1c)。评估模型性能、模型显著性和显著性衍生得分(CAM-Score),并与视网膜形态测量进行比较。还将得分在AD发病病例(平均发病前8.55年)与匹配对照之间进行比较。DL的性能范围为分类变量的AUROC=0.5654-0.9480,连续变量的R2=-0.0291-0.7620,优于大多数形态测量-机器学习模型。基于显著性的得分一致地突出了生物学上有意义的区域,特别是视神经头和视网膜血管。它也与现有的形态测量变异一致。多个基于显著性的得分在AD发病病例与匹配对照之间存在显著差异,表明风险因素的视网膜相关性与临床前AD相关变化之间存在潜在重叠。CFP编码了与AD风险因素相关的视网膜特征。尽管不具有诊断性,但DL衍生的视网膜表征可能揭示反映潜在AD易感性的生物学上有意义的风险相关结构变化。

英文摘要

The systemic, metabolic, lifestyle factors have established associations with Alzheimer's Disease (AD) through epidemiologic and AD-specific biomarker studies. Whether colored fundus photography (CFP) contains retinal structural signatures corresponding to these AD-related risk domains remains unclear. To determine whether deep learning (DL) models can predict 12 AD-related risk factors from CFP and to characterize the retinal structures underlying these predictions, thereby assessing whether CFP reflects pathways to AD vulnerability. Using 62,876 CFPs from 44,501 unique participants from the UK Biobank, DL models were trained to predict 12 factors linked to AD incidence: 6 categorical (sex, smoking, sleeplessness, economic status, alcohol use, depression) and 6 continuous (age, age at completing education, BMI, systolic, diastolic blood pressure, HbA1c). Model performance, model saliency, and saliency-derived scores (CAM-Score) were evaluated and compared to retinal morphometry. The scores were also compared between incident-AD cases (average 8.55 years before onset) and matched controls. Performance of DL ranged from AUROC= 0.5654-0.9480 for categorical and R2=-0.0291-0.7620 for continuous factors, outperforming most of the morphometry-machine learning models. Saliency-based score consistently highlighted biologically meaningful regions, particularly the optic nerve head and retinal vasculature. It also aligned with present morphometric variations. Several saliency-based scores differed significantly between incident AD and matched controls, suggesting potential overlap between retinal correlates of risk factors and preclinical AD-associated changes. CFP encodes retinal signatures linked to AD risk factors. Although not diagnostic, DL-derived retinal representations may uncover biologically meaningful risk-related structural changes mirroring the potential AD vulnerability.

2606.14957 2026-06-19 cs.CV 版本更新

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

学习用于多模态神经影像的稀疏潜在预测基础模型

Haoxu Huang, Long Chen, Jingyun Chen, Jinu Hyun, James Ryan Loftus, Kara Melmed, Daniel Orringer, Jennifer Frontera, Seena Dehkharghani, Arjun Masurkar, Narges Razavian

发表机构 * New York University, Center for Data Science(纽约大学数据科学中心) NYU Grossman School of Medicine, Department of Radiology(纽约大学格罗斯曼医学院放射学系) State University of New York at Binghamton, School of Computing(纽约州立大学宾汉姆顿分校计算机学院) NYU Grossman School of Medicine, Department of Neurology(纽约大学格罗斯曼医学院神经病学系) NYU Grossman School of Medicine, Department of Neurosurgery(纽约大学格罗斯曼医学院神经外科学系) NYU Grossman School of Medicine, Department of Pathology(纽约大学格罗斯曼医学院病理学系) School of Medicine, Department of Radiology, Stanford(斯坦福大学医学院放射学系) NYU Grossman School of Medicine, Department of Neuroscience(纽约大学格罗斯曼医学院神经科学系) NYU Grossman School of Medicine, Neuroscience Institute(纽约大学格罗斯曼医学院神经科学研究所)

AI总结 提出Neuro-JEPA模型,结合潜在预测目标和专家混合架构,学习T1w、T2w和FLAIR三种MRI序列的统一表示,在25项临床任务和22项公开数据集任务上优于现有基础模型和CNN基线。

Comments Under Review Preprint

详情
AI中文摘要

脑部MRI通常作为多个互补序列采集,具有独特的对比度加权,包括T1加权成像(T1w)解剖对比和液体敏感T2加权(T2w)对比。然而,在健康系统规模上,跨多种MRI对比机制学习统一表示的方法尚缺乏。在本研究中,我们引入了Neuro-JEPA,一种稀疏多模态神经影像基础模型,它结合了潜在预测目标和专家混合架构,以编码跨核心T1w、T2w和液体抑制FLAIR成像(FLAIR)的脑部MRI。我们进一步对架构、掩码、目标和稀疏性设计选择进行了系统的方法论研究,这些选择有利于稳健的神经影像多模态表示学习。Neuro-JEPA在428,647项研究的1,551,862次扫描上进行了预训练,这些扫描经过了模态特定的预处理和跨三种核心结构脑部MRI序列的数据整理。我们在临床和研究环境中评估了学习到的表示,包括来自三个健康系统(NYU Langone、NYU Long Island和Massachusetts General Hospital)的25项任务,以及来自12个公开数据集的22项任务,涵盖了单模态、多模态和跨域评估配置。在这些基准测试中,现有的神经影像基础模型相对于简单的卷积神经网络(CNN)基线显示出不一致的提升,而Neuro-JEPA在所有评估设置中实现了更强且更一致的性能。这些结果建立了一个可扩展的多模态神经影像表示学习方法论框架,并强调了基础模型评估协议需要包括简单基线、临床异质性队列和受控的多模态比较。

英文摘要

Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

2405.10705 2026-06-19 eess.IV cs.CV 版本更新

3D Vessel Reconstruction from Sparse-View Dynamic DSA Images via Vessel Probability Guided Attenuation Learning

基于血管概率引导衰减学习的稀疏视角动态DSA图像三维血管重建

Zhentao Liu, Huangxuan Zhao, Wenhui Qin, Zhenghong Zhou, Xinggang Wang, Wenping Wang, Xiaochun Lai, Chuansheng Zheng, Dinggang Shen, Zhiming Cui

发表机构 * School of Biomedical Engineering \& State Key Laboratory of Advanced Medical Materials Devices, ShanghaiTech University, Shanghai, China National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China School of Electronic Information Communications, Huazhong University of Science Department of Computer Science \& Engineering, Texas A\&M University, USA

AI总结 提出血管概率引导衰减学习框架,通过静态与动态衰减场互补加权实现稀疏视角DSA重建,降低辐射剂量,并采用渐进训练和时间扰动损失提升质量。

Comments Accepted by Medical Image Analysis (MedIA), 2026

详情
AI中文摘要

数字减影血管造影(DSA)是血管疾病诊断的金标准之一。借助造影剂,时间分辨的二维DSA图像提供全面的血流信息,可用于重建三维血管结构以进行医学评估。当前的商用DSA系统通常需要数百个扫描视角进行重建,导致大量辐射暴露。在本研究中,我们提出了一种基于神经渲染的优化框架,专门用于高质量稀疏视角DSA重建,以减少辐射剂量。我们的方法称为血管概率引导衰减学习,将DSA成像表示为静态和动态衰减场的互补加权组合,权重来自时间无关的血管概率场。作为前景掩膜,血管概率为静态和动态场提供适应不同场景类型的适当梯度。该机制实现了静态背景与动态造影剂流的自监督分解,并显著提高了重建质量。我们的模型通过最小化合成投影与真实DSA图像之间的差异进行训练。我们进一步采用两种训练策略来提高重建质量:(1)由粗到细的渐进训练以改善几何结构,以及(2)时间扰动渲染损失以保持时间一致性。实验结果表明了高质量的三维血管重建和二维DSA图像合成。

英文摘要

Digital Subtraction Angiography (DSA) is one of the gold standards for vascular disease diagnosis. With the help of a contrast agent, time-resolved 2D DSA images deliver comprehensive blood flow information and can be utilized to reconstruct 3D vessel structures for medical assessment. Current commercial DSA systems typically require hundreds of scanning views to perform reconstruction, resulting in substantial radiation exposure. In this study, we propose a neural rendering-based optimization framework tailored for high-quality sparse-view DSA reconstruction to reduce radiation dosage. Our approach, termed vessel probability guided attenuation learning, represents DSA imaging as a complementary weighted combination of static and dynamic attenuation fields, with the weights derived from the time-independent vessel probability field. Functioning as a foreground mask, vessel probability provides proper gradients for both static and dynamic fields adaptive to different scene types. This mechanism enables self-supervised decomposition between static backgrounds and dynamic contrast agent flow, and significantly improves reconstruction quality. Our model is trained by minimizing the discrepancy between synthesized projections and real captured DSA images. We further employ two training strategies to improve reconstruction quality: (1) coarse-to-fine progressive training for better geometry and (2) temporal perturbed rendering loss for temporal consistency. Experimental results have demonstrated high-quality 3D vessel reconstruction and 2D DSA image synthesis.

2503.23179 2026-06-19 eess.IV cs.CV 版本更新

OncoReg: Medical Image Registration for Oncological Challenges

OncoReg:面向肿瘤学挑战的医学图像配准

Wiebke Heyer, Yannic Elser, Lennart Berkel, Xinrui Song, Xuanang Xu, Pingkun Yan, Xi Jia, Jinming Duan, Zi Li, Tony C. W. Mok, BoWen LI, Tim Hable, Christian Staackmann, Christoph Großbröhmer, Lasse Hansen, Alessa Hering, Malte M. Sieren, Mattias P. Heinrich

发表机构 * Institute of Medical Informatics, University of Lübeck(吕贝克大学医学信息学研究所) Institute of Radiology and Nuclear Medicine, University Hospital Schleswig-Holstein(石勒斯维希-霍尔斯坦大学医院放射科和核医学研究所) Department of Biomedical Engineering and Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute(伦塞拉塞尔理工学院生物医学工程系和生物技术与跨学科研究中心) School of Computer Science, University of Birmingham(伯明翰大学计算机科学学院) Division of Informatics, Imaging and Data Sciences, University of Manchester(曼彻斯特大学信息学、成像和数据科学系) DAMO Academy, Alibaba Group(阿里集团DAMO学院) Hangzhou Shengshi Technology Co., Ltd(杭州盛世科技有限公司) Department of Radiation Oncology, University Hospital Schleswig-Holstein(石勒斯维希-霍尔斯坦大学医院放射肿瘤科) EchoScout GmbH Radboud University Medical Center, Nijmegen(奈密根大学医学中心) Institute of Interventional Radiology, University Hospital Schleswig-Holstein(石勒斯维希-霍尔斯坦大学医院介入放射科)

AI总结 提出OncoReg挑战,通过两阶段框架在保护患者隐私的同时开发可泛化的图像配准方法,用于放射治疗中锥束CT与扇束CT的配准,发现特征提取是关键,深度学习和经典方法结合最有效。

Comments 21 pages, 13 figures

详情
AI中文摘要

在现代癌症研究中,由于患者隐私相关的挑战,产生的大量医学数据往往未被充分利用。OncoReg挑战通过一个两阶段框架解决了这一问题,该框架使研究人员能够在确保患者隐私的同时开发和验证图像配准方法,并促进更可泛化的AI模型的发展。第一阶段涉及使用公开可用的数据集,第二阶段则专注于在安全的医院网络内对私有数据集进行模型训练。OncoReg建立在Learn2Reg挑战的基础上,纳入了放射治疗中介入性锥束计算机断层扫描与标准计划扇束CT图像的配准。准确的图像配准在肿瘤学中至关重要,特别是在图像引导放射治疗的动态治疗调整中,需要精确对齐以最小化对健康组织的辐射暴露,同时有效靶向肿瘤。本文详细介绍了OncoReg挑战的方法和数据,并对竞赛参赛作品和结果进行了全面分析。研究发现,特征提取在此配准任务中起着关键作用。从该挑战中涌现的一种新方法展示了其多功能性,而现有方法的表现与新技术相当。深度学习和经典方法在图像配准中仍扮演重要角色,尤其是方法的组合,特别是在特征提取方面,被证明最为有效。

英文摘要

In modern cancer research, the vast volume of medical data generated is often underutilised due to challenges related to patient privacy. The OncoReg Challenge addresses this issue by enabling researchers to develop and validate image registration methods through a two-phase framework that ensures patient privacy while fostering the development of more generalisable AI models. Phase one involves working with a publicly available dataset, while phase two focuses on training models on a private dataset within secure hospital networks. OncoReg builds upon the foundation established by the Learn2Reg Challenge by incorporating the registration of interventional cone-beam computed tomography with standard planning fan-beam CT images in radiotherapy. Accurate image registration is crucial in oncology, particularly for dynamic treatment adjustments in image-guided radiotherapy, where precise alignment is necessary to minimise radiation exposure to healthy tissues while effectively targeting tumours. This work details the methodology and data behind the OncoReg Challenge and provides a comprehensive analysis of the competition entries and results. Findings reveal that feature extraction plays a pivotal role in this registration task. A new method emerging from this challenge demonstrated its versatility, while established approaches continue to perform comparably to newer techniques. Both deep learning and classical approaches still play significant roles in image registration, with the combination of methods, particularly in feature extraction, proving most effective.

2606.18970 2026-06-19 cs.LG cs.AI cs.CV 版本更新

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics(数学系) Department of Political and Social Sciences(政治与社会科学系)

AI总结 通过受控基准测试,比较量子与经典生成器在脑MRI数据增强中的性能,发现两者均未显著优于仅用真实数据训练,且量子生成器无额外优势。

详情
AI中文摘要

医学图像分类常受限于有限的标注数据,因此生成式增强被提出;最近,量子生成模型被用于此目的,并经常报告准确率提升。然而,这些声称通常基于单次训练运行,未匹配量子与经典生成器的参数预算,也未表征任何收益出现的数据范围。我们提出了一个受控基准测试,隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中,在该空间中,使用变分量子生成器或参数数量几乎相同的经典生成器(1648 vs. 1632)训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器,覆盖从5%到100%的标注数据比例,通过八个随机种子进行配对显著性检验(多重比较校正)以及集内多样性和潜在分布分析。在所有比例下,没有增强变体显著优于仅用真实数据训练,且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展:合成样本分布外移,并且在数据稀缺时严重模式崩溃,而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

9. 低层视觉、计算成像与图像增强 1 篇

2602.01391 2026-06-19 cs.CV 版本更新

Relighting as a Probe of Visual Priors via Augmented Latent Intrinsics

通过增强潜在本征属性将重光照作为视觉先验的探针

Xiaoyan Xing, Xiao Zhang, Sezer Karaoglu, Theo Gevers, Anand Bhattad

发表机构 * UvA-Bosch Delta Lab, University of Amsterdam, Amsterdam, Netherlands(乌得勒支大学阿姆斯特丹分校博世Delta实验室) The University of Chicago, Chicago, USA(芝加哥大学) Johns Hopkins University, Baltimore, USA(约翰霍普金斯大学)

AI总结 提出增强潜在本征属性(ALI)方法,融合密集像素对齐视觉特征到潜在本征重光照模型,平衡语义与光度保真度,提升复杂材质重光照质量。

Comments Camera-ready version for ICML 2026. Project page: https://augmented-latent-intrinsics.github.io

详情
AI中文摘要

图像到图像的重光照需要能够将光照与场景属性分离,同时保留密集几何、材质和光度线索的表征。我们将此任务用作视觉先验的探针:与奖励不变性的识别任务不同,重光照测试视觉特征是否保留光传输所需的信息。通过一个受控的生成式重光照框架,我们发现强语义编码器会降低重光照质量,揭示了抽象与物理保真度之间的语义-光度权衡。我们引入了增强潜在本征属性(ALI),通过将密集的、像素对齐的视觉特征融合到潜在本征重光照模型中,并在未标注的真实图像对上通过自监督进行细化,来平衡这一权衡。ALI提高了重光照质量,尤其是在光泽、金属和透明材质上,并证明了生成式重光照是量化视觉编码器对物理世界编码内容的有效工具。

英文摘要

Image-to-image relighting requires representations that separate illumination from scene properties while preserving dense geometry, material, and photometric cues. We use this task as a probe of visual priors: unlike recognition tasks that reward invariance, relighting tests whether visual features retain the information needed for light transfer. Through a controlled generative relighting framework, we find that strong semantic encoders can degrade relighting quality, exposing a semantic--photometric trade-off between abstraction and physical fidelity. We introduce Augmented Latent Intrinsics (ALI), which balances this trade-off by fusing dense, pixel-aligned visual features into a latent-intrinsic relighting model and refining it with self-supervision on unlabeled real image pairs. ALI improves relighting quality, especially on glossy, metallic, and transparent materials, and demonstrates that generative relighting is an effective tool for quantifying what visual encoders encode about the physical world.

10. 鲁棒性、安全、隐私与可信视觉 4 篇

2511.04260 2026-06-19 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet:面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学)

AI总结 提出Proto-LeakNet,利用扩散模型中的信号泄漏痕迹,结合闭集分类与密度开集评估,实现可解释的生成器归因,在闭集上训练后对未见生成器也有效。

Comments 44 pages, 27 figures, 11 tables

详情
AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明,扩散管道会在其输出中无意中留下持久的统计痕迹,称为信号泄漏,特别是在潜在表示中。基于这一观察,我们提出了Proto-LeakNet,一个信号泄漏感知且可解释的归因框架,它将闭集分类与基于密度的开集评估相结合,对学习到的嵌入进行开集评估,从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域,重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征,而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC,Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒,超越了最先进的方法,并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取:this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

2510.27285 2026-06-19 cs.CV cs.CR 版本更新

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

重新思考扩散模型中的鲁棒对抗性概念擦除

Qinghong Yin, Yu Tian, Heming Yang, Xiang Chen, Xianlin Zhang, Yue Ming, Xueming Li, Yue Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua University(计算机科学与技术系,人工智能研究院,清华大学) University of Chinese Academy of Sciences(中国科学院大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学)

AI总结 针对扩散模型中概念擦除的对抗训练忽视概念语义导致拟合不足的问题,提出语义引导的鲁棒对抗概念擦除方法S-GRACE,显著提升擦除性能26%并减少90%训练时间。

详情
AI中文摘要

概念擦除旨在选择性地遗忘扩散模型(DMs)中的不良内容,以降低敏感内容生成的风险。作为概念擦除的一种新范式,现有方法大多采用对抗训练来识别和抑制目标概念,从而减少敏感输出的可能性。然而,这些方法常常忽视对抗训练在DMs中的特异性,导致仅能部分缓解。在这项工作中,我们从概念空间的角度调查并量化了这种特异性,即对抗样本能否真正拟合目标概念空间?我们观察到现有方法在生成对抗样本时忽视了概念语义的作用,导致对概念空间的拟合效果不佳。这种忽视导致了以下问题:1)当对抗样本较少时,它们无法全面覆盖目标概念;2)反之,它们会破坏其他目标概念空间。受这些发现分析的启发,我们引入了S-GRACE(语义引导的鲁棒对抗概念擦除),它优雅地利用概念空间内的语义引导来生成对抗样本并执行擦除训练。使用七种最先进方法和三种对抗提示生成策略在各种DM遗忘场景下进行的实验表明,S-GRACE显著提高了擦除性能26%,更好地保留了非目标概念,并将训练时间减少了90%。我们的代码可在此https URL获取。

英文摘要

Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.

2605.07821 2026-06-19 cs.CV cs.AI 版本更新

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

通过对象共现分析缓解OOD检测中的简单性偏差

Boyang Dai, Chaoqi Chen, Yizhou Yu

发表机构 * The University of Hong Kong(香港大学) Shenzhen University(深圳大学) Shenzhen Loop Area Institute(深圳环城区域研究所)

AI总结 提出基于对象共现的OOD检测框架,通过解耦表示和分治策略区分近OOD,缓解简单性偏差,在多种设置下取得竞争结果。

Comments This paper has been accepted by CVPR2026

详情
AI中文摘要

分布外(OOD)检测对于确保深度学习模型的可靠性至关重要。现有方法大多关注正则纠缠表示以区分分布内(ID)和OOD数据,忽略了图像中丰富的上下文信息。这一问题在检测近OOD时尤其具有挑战性,因为具有简单性偏差的模型难以在解耦表示中学习判别性特征。人类视觉系统可以利用自然环境中对象的共现来促进场景理解。受此启发,我们提出了一种以对象为中心的OOD检测框架,学习捕捉图像中的对象共现(OCO)模式。该方法引入了一种新的OOD检测范式,通过预测测试样本的解耦表示来理解图像中的对象共现,然后根据ID训练数据中观察到的对象共现模式自适应地将模式分为三种场景,最后以分治方式进行OOD检测。通过这种方式,OCO可以通过考虑图像中存在的语义上下文关系来区分近OOD,避免仅关注简单、易学习区域的倾向。我们通过在具有挑战性和全频谱OOD设置下的实验评估了OCO,展示了竞争性结果,并证实了其处理语义和协变量偏移的能力。代码发布在:https://this https URL。

英文摘要

Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.

2502.03227 2026-06-19 cs.LG cs.CV 版本更新

Adversarial Dependence Minimization

对抗性依赖最小化

Pierre-François De Plaen, Tinne Tuytelaars, Marc Proesmans, Luc Van Gool

发表机构 * CVL, ETH Zürich, Switzerland(CVL,苏黎世联邦理工学院,瑞士) INSAIT, Sofia University, Bulgaria(INSAIT,索菲亚大学,保加利亚)

AI总结 提出ADM算法,通过对抗博弈最小化特征维度间的统计依赖性,证明全局最优时达到相互独立,并应用于非线性去相关、图像分类泛化提升和自监督学习维度坍塌预防。

详情
AI中文摘要

最小冗余表示通常通过最小化特征协方差来学习。然而,基于协方差的方法无法消除所有依赖/冗余,因为线性不相关的变量仍可能表现出非线性关系。为了解决这个问题,我们引入了ADM,一种可微分的算法,通过对抗博弈最小化特征维度之间的统计依赖性:辅助网络识别依赖关系,而编码器去除它们。我们证明了在全局最优时实现了相互独立,经验验证了收敛性,并研究了三个潜在应用:将PCA扩展到非线性去相关、提高图像分类的泛化能力以及防止自监督学习中的维度坍塌。通过促进统计独立的表示,ADM为在多种应用中学习更鲁棒、更压缩和更泛化的表示铺平了道路。

英文摘要

Minimally redundant representations are typically learned by minimizing feature covariance. However, covariance-based methods fail to eliminate all dependencies/redundancies, as linearly uncorrelated variables can still exhibit nonlinear relationships. To address this, we introduce ADM, a differentiable algorithm that minimizes statistical dependence between feature dimensions through an adversarial game: auxiliary networks identify dependencies, while the encoder removes them. We prove that mutual independence is achieved at the global optimum, empirically verify convergence, and study three potential applications: extending PCA to nonlinear decorrelation, improving generalization in image classification, and preventing dimensional collapse in self-supervised learning. By promoting statistically independent representations, ADM paves the way for learning more robust, compressed, and generalizable representations across diverse applications.

11. 数据集、基准、评测与训练方法 7 篇

2411.10077 2026-06-19 cs.CV 版本更新

Hierarchical mutual distillation for multi-view fusion: Learning from all possible view combinations

多视角融合的分层互蒸馏:从所有可能的视角组合中学习

Jiwoong Yang, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University(翰阳大学) Hankuk University of Foreign Studies(韩国民法大学)

AI总结 本文提出一种新颖的多视角不确定性加权互蒸馏方法,通过分层互蒸馏提升预测一致性,有效利用各视角信息并缓解不确定预测的影响。

Journal ref Pattern Recognition 178 (2026) 113432

详情
AI中文摘要

多视角学习常面临有效利用不同角度和位置拍摄图像的挑战,尤其是在处理视角间不一致性和不确定性时更为突出。本文提出了一种新颖的多视角不确定性加权互蒸馏(MV-UWMD)方法。我们的方法通过在所有可能的视角组合中进行分层互蒸馏来增强预测一致性,包括单视角、部分多视角和全多视角预测。这引入了一种基于不确定性的加权机制,通过互蒸馏有效利用每个视角的独特信息,同时减轻不确定预测的影响。我们扩展了CNN-Transformer混合架构以促进在多个视角组合中的稳健特征学习和整合。我们使用了一个大规模、非结构化的数据集进行广泛实验,该数据集来自多样且非固定视角的拍摄。结果表明,MV-UWMD相比现有多视角学习方法在预测准确性和一致性方面有所提升。

英文摘要

Multi-view learning often struggles to effectively leverage images captured from diverse angles and locations. Learning methods for unstructured multi-view images remain largely underexplored. We propose a novel Hierarchical Mutual Distillation for Multi-View Fusion (HMDMV) method, which can handle both structured and unstructured multi-view scenarios. It makes predictions utilizing all possible view combinations: single view, partial multi-view, and full multi-view. The method generates predictions for each view combination and then applies hierarchical mutual distillation to enhance inter-view consistency. An uncertainty-based weighting mechanism further refines the fusion process by adjusting the influence of each view combination according to its prediction confidence, reducing the impact of low-confidence views. Extensive experiments on large-scale structured and unstructured datasets demonstrate that HMDMV consistently achieves state-of-the-art classification accuracy. Another unique advantage of HMDMV is that it provides improved flexibility in inference, allowing for more or fewer view counts in inference than those used in training without additional processing. We also provide a light version with reduced training cost by designing an efficient strategy that randomly samples subsets of view combinations during each training iteration. These results highlight HMDMV's robustness in real-world settings where view availability is variable or incomplete. The code is available at https://github.com/labhai/HMDMV.

2512.24592 2026-06-19 cs.CV 版本更新

GH-ESD: Grounded Hypothesis-Driven Error Slice Discovery for Instance-Level Vision Tasks

GH-ESD:基于假设驱动的实例级视觉任务错误切片发现

Wei Zhang, Chaoqun Wang, Zixuan Guan, Sam Kao, Pengfei Zhao, Peng Wu, Sifeng He

发表机构 * Apple(苹果公司)

AI总结 提出GH-ESD框架,通过LLM生成假设与视觉语言模型验证,在实例级任务中自动发现空间关系错误切片,并构建GESD基准,显著提升检测和分割任务的错误切片发现精度。

Comments Accepted by ECCV2026

详情
AI中文摘要

视觉模型在语义一致子集上的系统性失败(称为错误切片)揭示了鲁棒性和评估的局限性。现有的切片发现方法主要将切片建模为表示空间中的聚类或预定义属性的组合。虽然对图像级分类有效,但这种公式对于目标检测和分割等实例级任务不足,因为失败通常源于上下文关系性和空间定位的视觉模式。我们提出GH-ESD(基于假设驱动的实例级错误切片发现),一个生成与验证框架,将切片发现重新表述为基于假设的生成和统计验证。GH-ESD利用LLM先验和基于空间的视觉证据构建关系失败假设,通过视觉语言模型在实例级发现假设切片,并通过实例级错误的统计趋势分析进行验证。我们还引入了GESD(基于空间的错误切片数据集),一个用于实例级错误切片发现的新基准,提供由专家定义且基于空间的切片,这些切片源自检测和分割失败。大量实验表明,GH-ESD持续优于基线,在检测任务的GESD基准上Precision@10提高了0.10(0.73对比0.63),同时也支持分割场景。GH-ESD识别出可解释的切片,促进可操作的模型改进。GESD数据集将在接收后公开。

英文摘要

Systematic failures of vision models on semantically coherent subsets, known as error slices, reveal limitations in robustness and evaluation. Existing slice discovery approaches largely model slices as clusters in representation space or combinations of predefined attributes. While effective for image-level classification, such formulations are insufficient for instance-level tasks such as object detection and segmentation, where failures often arise from contextual relational and spatially grounded visual patterns. We propose GH-ESD (Grounded Hypothesis-Driven Error Slice Discovery), a generate and verify framework that reformulates slice discovery as grounded hypothesis generation and statistical verification. GH-ESD constructs relational failure hypotheses using LLM priors and grounded visual evidence, discovers hypothesis slices at the instance level via Vision Language Models, and verifies them through statistical trend analysis over instance-level errors. We also introduce GESD (Grounded Error Slice Dataset), a new benchmark for instance-level error slice discovery, providing expert-defined and spatially grounded slices derived from detection and segmentation failures. Extensive experiments demonstrate that GH-ESD consistently outperforms baselines, improving Precision@10 by 0.10 (0.73 vs. 0.63) on the GESD benchmark for detection tasks, while also supporting segmentation scenarios. GH-ESD identifies interpretable slices that facilitate actionable model improvements. The GESD dataset will be made publicly available upon acceptance.

2604.13240 2026-06-19 cs.CV cs.LG 版本更新

A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

基于概念的可解释AI的高分辨率景观数据集及其在物种分布模型中的应用

Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti

发表机构 * Université Rennes 2, CNRS, Nantes Université, Univ Brest, LETG, UMR 6554(里昂大学第二分校、法国国家科学研究中心、南特大学、布列塔尼大学、LETG、UMR 6554) LTSER Zone Atelier Armorique(Armorique 领域实验室区) University of Würzburg, Center for Artificial Intelligence and Data Science(乌尔姆大学、人工智能与数据科学中心)

AI总结 提出首个基于概念的可解释AI方法用于物种分布模型,利用高分辨率多光谱和LiDAR无人机影像构建景观概念数据集,通过Robust TCAV量化景观概念对模型预测的影响,案例研究验证了方法的有效性。

详情
AI中文摘要

绘制物种空间分布对于保护政策和入侵物种管理至关重要。物种分布模型(SDMs)是完成此任务的主要工具,具有两个目的:实现稳健的预测性能,同时提供关于分布驱动因素的生态见解。然而,深度学习SDMs日益增长的复杂性使得提取这些见解更具挑战性。为了调和这些目标,我们提出了首个基于概念的可解释AI(XAI)在SDMs中的实现。我们利用Robust TCAV(测试与概念激活向量)方法量化景观概念对模型预测的影响。为此,我们提供了一个新的开放获取的景观概念数据集,该数据集源自高分辨率多光谱和LiDAR无人机影像。它包括跨越15个不同景观概念的653个斑块和1,450个随机参考斑块,旨在适用于广泛的物种。我们通过两个水生昆虫(襀翅目和毛翅目)的案例研究,使用两个卷积神经网络和一个视觉Transformer来展示这种方法。结果表明,基于概念的XAI有助于根据专家知识验证SDMs,同时发现产生新生态假说的新颖关联。Robust TCAV还提供了景观层面的信息,对政策制定和土地管理有用。代码和数据集公开可用。

英文摘要

Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

2604.13416 2026-06-19 cs.CV cs.AI 版本更新

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K:用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney(悉尼科技大学) University of Sydney(悉尼大学) National Yang Ming Chiao Tung University(阳明交通大学)

AI总结 为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白,构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集,并基于此基准测试了九种最新方法,识别出最鲁棒的方法和最具挑战的场景。

详情
AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中,已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而,对于无干扰辐射场,每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏,限制了发展。为填补这一空白,我们引入了DF3DV-1K,一个包含1048个场景的大规模真实世界数据集,每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像,模拟随意拍摄,涵盖128种干扰类型和161种场景主题,包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K,我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试,识别出最鲁棒的方法和最具挑战的场景。除了基准测试,我们还展示了DF3DV-1K的一个应用:微调基于扩散的2D增强器以改进辐射场方法,在保留集(例如DF3DV-41)和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展,并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取:此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

2605.10873 2026-06-19 cs.CV cs.AI 版本更新

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench:一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出CADBench,一个统一的多模态CAD程序生成基准,包含18000个样本和六类基准,评估11种视觉语言模型,揭示了CAD程序生成中的三种常见失败模式。

详情
AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心,但进展难以衡量,因为现有评估分散在数据集、模态和指标上。我们引入CADBench,一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本,涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族,五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染,以及六个指标,涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层,所有家族均进行多样性采样,以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统,生成超过140万个CAD程序。在理想输入下,专用的网格到CAD模型显著优于代码生成VLMs,后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式:几何复杂性增加时重建质量下降,CAD专用模型在模态转移下可能变得脆弱,且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

2606.10136 2026-06-19 cs.CV 版本更新

iSAGE: A Human-in-the-Loop Framework for Remote Sensing Semantic Segmentation via Sparse Point Supervision

iSAGE: 一种通过稀疏点监督进行遥感语义分割的人机协同框架

Osmar Luiz Ferreira de Carvalho, Osmar Abilio de Carvalho Junior, Anesmar Olino de Albuquerque, Daniel Guerreiro e Silva

AI总结 提出iSAGE框架,通过专家点击模型错误像素而非任意像素,无需辅助机制即可匹配密集监督,在BsB Aerial和ISPRS Vaihingen数据集上以极低标注率达到与密集监督相当的性能。

Comments 47 pages, 8 tables, 6 figures

详情
AI中文摘要

遥感中的语义分割需要昂贵的像素级标注,且由于模型很少能在传感器、平台或地理区域间迁移,几乎每个问题都需要新的数据集。现有的人机协同框架通过辅助机制(伪标签、传播、CRF、基础模型提示、辅助头)将稀疏点击扩展为密集监督,这些机制均基于模型的预测分布。在该分布中,一个自信的错误像素与一个自信的正确像素在结构上无法区分,因此任何读取该分布的规则都无法区分两者;区分信号位于模型外部。本文假设,专家针对模型错误(而非任意像素)的点击足以匹配密集监督,无需扩展机制。iSAGE(基于专家指导的迭代稀疏标注)在一个集成的开源平台上实现了这一假设,其中错误加权损失放大了每次点击的梯度,而标注记录本身即为数据集,可扩展、可纠正、可审计。实验采用最小努力策略:每帧每类最多一个标注像素。在BsB Aerial上,iSAGE恢复了密集监督的97.2%(在0.040%的像素上达到74.79% mIoU),并呈现出对比性的类别动态:无定形类别(渗透区域)从种子点开始饱和,而小类别(汽车)需要后期迭代的努力。在ISPRS Vaihingen(外部基准)上,iSAGE以0.011%的像素达到76.78% mIoU,匹配密集基线(76.65%)并超越所有已发表方法。在相同流程下,四种输出读取机制(预算1-100倍的oracle熵、阈值0.90-0.99的伪标签、基于CRF的传播、均匀随机)比iSAGE低7.4至14.5个百分点。在调查的31种方法中,iSAGE是唯一无需辅助机制即可运行的迭代式人机协同框架。

英文摘要

Semantic segmentation in remote sensing requires costly pixel-level annotations, and nearly every problem demands a new dataset since models rarely transfer across sensors, platforms, or geographies. Existing human-in-the-loop frameworks expand sparse clicks into dense supervision via auxiliary machinery (pseudo-labels, propagation, CRFs, foundation-model prompts, auxiliary heads), all operating on the model's predictive distribution. A confidently wrong pixel is indistinguishable from a confidently correct one in that distribution by construction, so no rule reading it can separate the two; the distinguishing signal is external to the model. This paper hypothesizes that expert clicks targeting confident model errors, not arbitrary pixels, suffice to match dense supervision, with no expansion machinery. iSAGE (Iterative Sparse Annotation Guided by Expert) realizes this hypothesis on an integrated open-source platform, where an error-weighted loss amplifies the gradient at each click and the annotation record itself is the dataset, extensible, correctable, and auditable. Experiments use a minimum-effort regime: at most one labeled pixel per class per frame. On BsB Aerial, iSAGE recovers 97.2% of dense supervision (74.79% mIoU on 0.040% of pixels) with contrasting class dynamics: amorphous classes (permeable areas) saturate from the seed, while small classes (cars) require late-iteration effort. On ISPRS Vaihingen (external benchmark), iSAGE reaches 76.78% mIoU with 0.011% of pixels, matching the dense baseline (76.65%) and exceeding all published methods. Under the same pipeline, four output-reading mechanisms (oracle entropy across budgets 1--100x, pseudo-labels across thresholds 0.90--0.99, CRF-based propagation, uniform random) plateau 7.4 to 14.5 pp below iSAGE. Across 31 surveyed methods, iSAGE is the only iterative human-in-the-loop framework operating without auxiliary machinery.

2507.23534 2026-06-19 cs.LG cs.CV 版本更新

Continual Learning with Support Boundary Experience Blending

支持边界经验混合的持续学习

Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

发表机构 * National Taiwan University(国立台湾大学)

AI总结 提出经验混合框架,通过差分隐私启发的噪声生成支持边界数据,联合训练样本和边界数据以正则化决策边界,在多个数据集上提升持续学习准确率。

详情
AI中文摘要

持续学习旨在减轻模型在顺序任务训练时的灾难性遗忘。常见方法经验回放存储过去的样本,但仅稀疏地近似数据分布,导致决策边界脆弱且过于简化。我们通过引入支持边界数据来解决这一限制,该数据通过差分隐私启发的噪声注入潜在特征,生成边界邻近表示,隐式正则化决策边界。基于此,我们提出经验混合框架,通过双模型聚合策略联合训练样本和支持边界数据。经验混合有两个组成部分:(1) 潜在空间噪声注入以生成支持边界数据,(2) 联合利用样本和支持边界数据的端到端训练。与标准经验回放不同,支持边界数据丰富了决策边界附近的特征空间,从而实现更稳定和鲁棒的持续学习。在CIFAR-10、CIFAR-100、Tiny ImageNet和ImageNet1K上的大量实验分别展示了10%、6%、13%和2%的持续准确率提升。

英文摘要

Continual learning (CL) seeks to mitigate catastrophic forgetting when models are trained with sequential tasks. A common approach, experience replay (ER), stores past exemplars but only sparsely approximates the data distribution, yielding fragile and oversimplified decision boundaries. We address this limitation by introducing Support Boundary Data (SBD), generated via differential-privacy-inspired noise into latent features to create boundary-adjacent representations that implicitly regularize decision boundaries. Building on this idea, we propose Experience Blending (EB), a framework that jointly trains on exemplars and SBD through a dual-model aggregation strategy. EB has two components: (1) latent-space noise injection to generate support boundary data, and (2) end-to-end training that jointly leverages exemplars and SBD. Unlike standard experience replay, SBD enriches the feature space near decision boundaries, leading to more stable and robust continual learning. Extensive experiments on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet1K demonstrate consistent accuracy improvements of 10%, 6%, 14%, 2%, respectively.

12. 其他/综合视觉 10 篇

2603.07236 2026-06-19 cs.CV 版本更新

HY-WU (Part I): An Extensible Functional Neural Memory Framework and An Instantiation in Text-Guided Image Editing

HY-WU (第一部分): 一种可扩展的功能性神经记忆框架及其在文本引导图像编辑中的应用

Mengxuan Wu, Xuanlei Zhao, Ziqiao Wang, Ruicheng Feng, Zhangyang Wang, Kai Wang

发表机构 * Tencent HY Team(腾讯 HY 团队)

AI总结 提出HY-WU框架,通过功能性神经记忆模块即时生成实例特定权重更新,避免共享权重覆盖导致的干扰,解决持续学习与个性化中的灾难性遗忘问题。

详情
AI中文摘要

基础模型正从离线预测器过渡到期望长时间运行的部署系统。在实际部署中,目标并非固定:领域漂移、用户偏好演变,以及模型发布后出现新任务。这将持续学习和即时个性化从可选功能提升为核心架构要求。然而,大多数适应流程仍遵循静态权重范式:训练后(或任何适应步骤后),推理执行单一参数向量,而不考虑用户意图、领域或实例特定约束。这将训练或适应后的模型视为参数空间中的单个点。在异构且持续演变的机制中,不同目标可能在参数上诱导分离的可行区域,迫使任何单一共享更新陷入妥协、干扰或过度专业化。结果,持续学习和个性化通常实现为对共享权重的重复覆盖,冒着先前学习行为退化的风险。我们提出HY-WU(权重释放),一种记忆优先的适应框架,将适应压力从覆盖单一共享参数点转移。HY-WU将功能性(算子级)记忆实现为神经模块:一个根据实例条件即时合成权重更新的生成器,产生实例特定算子而无需测试时优化。

英文摘要

Foundation models are transitioning from offline predictors to deployed systems expected to operate over long time horizons. In real deployments, objectives are not fixed: domains drift, user preferences evolve, and new tasks appear after the model has shipped. This elevates continual learning and instant personalization from optional features to core architectural requirements. Yet most adaptation pipelines still follow a static weight paradigm: after training (or after any adaptation step), inference executes a single parameter vector regardless of user intent, domain, or instance-specific constraints. This treats the trained or adapted model as a single point in parameter space. In heterogeneous and continually evolving regimes, distinct objectives can induce separated feasible regions over parameters, forcing any single shared update into compromise, interference, or overspecialization. As a result, continual learning and personalization are often implemented as repeated overwriting of shared weights, risking degradation of previously learned behaviors. We propose HY-WU (Weight Unleashing), a memory-first adaptation framework that shifts adaptation pressure away from overwriting a single shared parameter point. HY-WU implements functional (operator-level) memory as a neural module: a generator that synthesizes weight updates on-the-fly from the instance condition, yielding instance-specific operators without test-time optimization.

2507.05169 2026-06-19 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model

世界模型批判:一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结 本文从心理学“假设性思维”出发,提出世界模型的核心目标是模拟真实世界的所有可行动可能性,并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测(GLP)架构。

详情
AI中文摘要

世界模型,即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器,近年来因开发具有人工(通用)智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估,已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发,并借鉴心理学文献中“假设性思维”的概念,论证世界模型的主要目标是模拟真实世界中所有可行动的可能性,以进行有目的的推理和行动。我们审视了世界建模的关键设计维度:数据、表示、架构、学习目标和使用,调查了现有方法并分析了它们的权衡。在此基础上,我们提出了一种新的通用世界模型生成式潜在预测(GLP)架构,基于有状态的、分层的、多层次的、混合连续/离散表示,以及生成式和自监督学习框架,并展望了由这种模型支持的物理、智能体和嵌套(PAN)AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

2605.15231 2026-06-19 cs.LG cs.CV 版本更新

Mask-Morph Graph U-Net: A Generalisable Mesh-Based Surrogate for Crashworthiness Field Prediction under Large Geometric Variation

Mask-Morph Graph U-Net:一种通用的基于网格的替代模型,用于在大几何变化下预测碰撞worthiness领域

Haoran Li, Tobias Lehrer, Yingxue Zhao, Haosu Zhou, Philipp Stocker, Tobias Pfaff, Marcus Wagner, Nan Li

发表机构 * Dyson School of Design Engineering, Imperial College London(帝国理工学院伦敦设计工程学院) TUM School of Engineering and Design, Technical University of Munich(慕尼黑技术大学工程与设计学院) Faculty of Mechanical Engineering, OTH Regensburg(雷根斯堡机械工程学院) NVIDIA(NVIDIA公司)

AI总结 本文提出Mask-Morph Graph U-Net,通过特征对齐的重心参数化和节点掩码预训练,提升网格模拟的通用性和数据效率,适用于碰撞worthiness设计探索。

Comments 48 pages, 15 figures, jounral paper under review

详情
AI中文摘要

非线性有限元碰撞模拟准确但计算成本高,限制了其在迭代设计优化中的应用。基于图神经网络(GNN)的机器学习替代模型提供了更快的替代方案。消息传递GNN广泛用于网格模拟,其共享节点和边更新函数在不同图结构中相对通用。相比之下,非共享边特定聚合层能更准确地捕捉非线性关系,但通常需要固定图连接性,限制了通用性。本文提出Mask-Morph Graph U-Net(MMGUNet),一种解决分层图U-Net架构限制的方法,该架构使用边特定下采样和上采样层。固定粗图连接性是边特定层所必需的。为了在保留此连接性的同时提高空间对应性,所提出的方法通过特征对齐的重心参数化将粗化图层次变形到每个输入网格,然后构建跨图边。它进一步在监督预训练中应用节点掩码,随后进行参数高效的微调,其中高参数边特定层被冻结。所提出的方法在分布内、分布外和跨组件迁移设置中使用均欧距离和最大入侵百分比误差进行评估。结果表明,粗图变形相对于固定粗图基线提高了测试准确性,而掩码监督预训练减少了训练-测试差异并提高了迁移期间的数据效率。所提出的模型还比外部基线取得了更低的预测误差。这些结果展示了通往可重用、数据高效网格替代模型的实用路径,用于碰撞worthiness设计探索。

英文摘要

Nonlinear finite element crash simulations are accurate but computationally expensive, limiting their use in iterative design optimisation. Machine-learning surrogate models based on graph neural networks (GNNs) offer a faster alternative. Message-passing GNNs are widely used for mesh simulation, and their shared node and edge update functions are relatively generalisable across varying graph structures. By contrast, non-shareable edge-specific aggregation layers can capture nonlinear relationships more accurately but usually require fixed graph connectivity, which limits generalisability. This paper presents Mask-Morph Graph U-Net (MMGUNet), a practical approach to addressing the limitation of hierarchical Graph U-Net architectures that use edge-specific downsampling and upsampling layers. Fixed coarse graph connectivity is required for edge-specific layers. To retain this while improving spatial correspondence, the proposed method morphs the coarsened graph hierarchy to each input mesh using feature-aligned barycentric parameterisation before constructing cross-graph edges. It further applies node masking during supervised pretraining, followed by parameter-efficient fine-tuning in which high-parameter edge-specific layers are frozen. The proposed approach is evaluated in in-distribution, out-of-distribution, and cross-component transfer settings using mean Euclidean distance and maximum intrusion percentage error. Results show that coarse-graph morphing improves test accuracy relative to a fixed-coarse-graph baseline, while masked supervised pretraining reduces the train-test discrepancy and improves data efficiency during transfer. The proposed model also achieves lower prediction error compared with external baselines. These results demonstrate a practical route toward reusable, data-efficient mesh-based surrogate modelling for crashworthiness design exploration.

2605.00569 2026-06-19 cs.CV cs.GR 版本更新

2D-SuGaR: Surface-Aware Gaussian Splatting for Geometrically Accurate Mesh Reconstruction

Prajwal Gupta C. R., Divyam Sheth, Jinjoo Ha, Mirela Ostrek, Justus Thies

发表机构 * TU Darmstadt(图宾根大学) ELIZA(ELIZA实验室) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所)

Journal ref Eurographics 2026 Short Papers, The Eurographics Association, 2026

详情
英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for generating photorealistic renderings of a scene in real-time. However, the volumetric nature of 3DGS limits its ability to accurately capture surface geometry. To address this, 2D Gaussian Splatting (2DGS) was proposed to enable view-consistent and geometrically accurate surface reconstruction from multi-view images. However, 2DGS can be sensitive to the initialization of the Gaussian primitives. Reliance on Structure-from-Motion (SfM) initializations, which can produce poor estimates on challenging image sets, may lead to subpar results. In this work, we enhance 2DGS by incorporating monocular depth and normal priors to improve both geometric accuracy and robustness. We propose a depth-guided initialization strategy for Gaussians and introduce a clustering-based technique for pruning degenerate Gaussians. We evaluate our method on the DTU dataset, where it achieves state-of-the-art results in mesh reconstruction while preserving high-quality novel view synthesis.

2511.23071 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur(印度理工学院朱道尔)

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

详情
英文摘要

Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.

2603.27698 2026-06-19 cs.CV cs.DL 版本更新

Ink Detection from Surface Topography of the Herculaneum Papyri

Giorgio Angelotti, Federica Nicolardi, Paul Henderson, W. Brent Seales

发表机构 * Vesuvius Challenge, USA(维苏威挑战赛,美国) Università degli Studi di Napoli Federico II, Italy(那不勒斯费德里科二世大学,意大利) University of Glasgow, Scotland, UK(格拉斯哥大学,苏格兰,英国) EduceLab, University of Kentucky, USA(EduceLab,肯塔基大学,美国)

Comments 9 pages, 3 figures, 2 tables. Currently under review

Journal ref Scientific Reports (2026)

详情
英文摘要

Reading the Herculaneum papyri is challenging because both the scrolls and the ink, which is carbon-based, are carbonized. In X-ray radiography and tomography, ink detection typically relies on density- or composition-driven contrast, but carbon ink on carbonized papyrus provides little attenuation contrast. Building on the morphological hypothesis, we show that the surface morphology of written regions contains enough signal to distinguish ink from papyrus. To this end, we train machine learning models on three-dimensional optical profilometry from mechanically opened Herculaneum papyri to separate inked and uninked areas. We further quantify how lateral sampling governs learnability and how a native-resolution model behaves on coarsened inputs. We show that high-resolution topography alone contains a usable signal for ink detection. Diminishing segmentation performance with decreasing lateral resolution provides insight into the characteristic spatial scales that must be resolved on our dataset to exploit the morphological signal. These findings inform spatial resolution targets for morphology-based reading of closed scrolls through X-ray tomography.

2601.15119 2026-06-19 eess.IV cs.CV 版本更新

Vision Models for Medical Imaging: A Hybrid Approach for PCOS Detection from Ultrasound Scans

Md Mahmudul Hoque, Md Mehedi Hassain, Muntakimur Rahaman, Md. Towhidul Islam, Shaista Rani, Md Sharif Mollah

发表机构 * Department of CSE, CCN University of Science & Technology(计算机科学与工程系,CCN科学与技术大学) Department of EEE,International Islamic University Chittagong(电子工程系,国际伊斯兰大学恰tagong分校) Faculty of Engineering, Multimedia University(工程学院,多媒体大学) Department of CSE, Stamford University of Bangladesh(计算机科学与工程系,斯塔福德大学孟加拉国分校) Department of Biology, Lucknow University(生物学系,拉胡尔大学) Department of CSE, Bangladesh Army International University of Science & Technology(计算机科学与工程系,孟加拉国军队国际科学与技术大学)

详情
英文摘要

Polycystic Ovary Syndrome (PCOS) is the most familiar endocrine illness in women of reproductive age. Many Bangladeshi women suffer from PCOS disease in their older age. The aim of our research is to identify effective vision-based medical image analysis techniques and evaluate hybrid models for the accurate detection of PCOS. We introduced two novel hybrid models combining convolutional and transformer-based approaches. The training and testing data were organized into two categories: "infected" (PCOS-positive) and "noninfected" (healthy ovaries). In the initial stage, our first hybrid model, 'DenConST' (integrating DenseNet121, Swin Transformer, and ConvNeXt), achieved 85.69% accuracy. The final optimized model, 'DenConREST' (incorporating Swin Transformer, ConvNeXt, DenseNet121, ResNet18, and EfficientNetV2), demonstrated superior performance with 98.23% accuracy. Among all evaluated models, DenConREST showed the best performance. This research highlights an efficient solution for PCOS detection from ultrasound images, significantly improving diagnostic accuracy while reducing detection errors.

2508.21190 2026-06-19 cs.CV 版本更新

Radially Distorted Homographies, Revisited

Mårten Wadenbäck, Marcus Valtonen Örnhag, Johan Edstedt

发表机构 * Linköping University(林雪平大学) Ericsson Research(爱立信研究)

Journal ref 2026, Proceedings of the International Conference on 3D Vision (3DV). Vancouver, BC, Canada: IEEE, pp. 52-62

详情
英文摘要

Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. A reference implementation of the proposed solvers is made available as part of HomLib (https://github.com/marcusvaltonen/HomLib).

2507.23027 2026-06-19 cs.CV cs.AI 版本更新

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯分校) All India Institute of Medical Sciences(全印度医学科学研究所) Indian Institute of Technology Hyderabad(印度理工学院海得拉巴分校)

Comments Accepted at the MICCAI Workshop on "Medical Image Computing in Resource Constrained Settings & Knowledge Interchange (MIRASOL)" 2025

详情
英文摘要

Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.

1902.06202 2026-06-19 cs.CV cs.CG 版本更新

Using Persistent Homology to Quantify a Diurnal Cycle in Hurricane Felix

Sarah Tymochko, Elizabeth Munch, Jason Dunion, Kristen Corbosiero, Ryan Torn

发表机构 * Michigan State University, Dept. of Computational Mathematics, Science and Engineering(密歇根州立大学,计算数学、科学与工程系) Michigan State University, Dept. of Mathematics(密歇根州立大学,数学系) Cooperative Institute for Marine and Atmospheric Studies, University of Miami(马里安诺大气研究合作机构,迈阿密大学) Hurricane Research Division, NOAA/Atlantic Oceanographic and Meteorological Laboratory(飓风研究部,国家海洋和大气管理局/大西洋海洋学和气象实验室) University at Albany - SUNY Albany, Dept. of Atmospheric and Environmental Sciences(阿尔巴尼大学 - 纽约州立大学阿尔巴尼分校,大气与环境科学系)

详情
英文摘要

The diurnal cycle of tropical cyclones (TCs) is a daily cycle in clouds that appears in satellite images and may have implications for TC structure and intensity. The diurnal pattern can be seen in infrared (IR) satellite imagery as cyclical pulses in the cloud field that propagate radially outward from the center of nearly all Atlantic-basin TCs. These diurnal pulses, a distinguishing characteristic of the TC diurnal cycle, begin forming in the storm's inner core near sunset each day and appear as a region of cooling cloud-top temperatures. The area of cooling takes on a ring-like appearance as cloud-top warming occurs on its inside edge and the cooling moves away from the storm overnight, reaching several hundred kilometers from the circulation center by the following afternoon. The state-of-the-art TC diurnal cycle measurement has a limited ability to analyze the behavior beyond qualitative observations. We present a method for quantifying the TC diurnal cycle using one-dimensional persistent homology, a tool from Topological Data Analysis, by tracking maximum persistence and quantifying the cycle using the discrete Fourier transform. Using Geostationary Operational Environmental Satellite IR imagery data from Hurricane Felix (2007), our method is able to detect an approximate daily cycle.