arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.02580 2026-06-02 cs.CV 版本更新

从零到英雄：世界模型中的免训练自定义概念生成

Kiymet Akdemir, Pinar Yanardag

发表机构 * Virginia Tech（弗吉尼亚理工学院）

AI总结提出SPAWN方法，利用图像到视频骨干网络的结构特性，通过交换参考帧锚点与外部概念潜变量，实现无需训练即可在世界模型中生成用户指定的视觉概念。

详情

AI中文摘要

自回归世界模型已成为交互式视频生成的一种强大范式，允许用户通过动作在动态生成的环境中进行导航。这些模型通常以文本提示和/或单个参考帧为条件，从中生成整个世界。然而，一旦用户导航到该帧可见区域之外，未见区域将由基础模型的先验填充，用户无法指定应该出现什么以及出现在哪里。对于游戏、交互式故事讲述和模拟等应用来说，这是一个根本性的限制，因为在这些应用中，可控的场景组成至关重要。我们将这种缺失的能力称为概念生成；将用户指定的视觉概念引入世界模型，类似于游戏引擎中的生成。我们提出了SPAWN（Swapping Pinned Anchor with Windowed iNjection），一种免训练的概念生成方法。SPAWN利用了图像到视频骨干网络的结构特性：上下文记忆的第一个槽位被固定到参考帧，并作为每个生成块的基石锚点。通过在短注入窗口内将该锚点与外部概念潜变量交换，并让原始锚点返回，概念通过模型自身的记忆在滚动过程中自然传播。SPAWN支持从角色和道具等细粒度实体到建筑物和地标等大规模元素的概念，并接受概念图像或文本描述作为输入。实验表明，SPAWN在保持身份和时间一致性的同时，以一致的光照、尺度和视角整合概念，证明了在现有自回归世界模型中无需训练即可实现可控的概念生成。

英文摘要

Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or a single reference frame, from which the entire world is generated. Yet the moment the user navigates beyond what is visible in that frame, the unseen regions are populated by the base model's priors, with no mechanism for the user to specify what should appear and where. This is a fundamental limitation for applications such as gaming, interactive storytelling, and simulation, where controllable scene composition is essential. We refer to this missing capability as concept spawning; introducing a user-specified visual concept into a world model, analogous to spawning in a game engine. We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. SPAWN exploits a structural property of image-to-video backbones: the first slot of the context memory is pinned to the reference frame and acts as a foundational anchor for every generated chunk. By swapping this anchor with an external concept latent over a short injection window and letting the original anchor return, we cause the concept to propagate naturally through the rollout via the model's own memory. SPAWN supports concepts from fine-grained entities such as characters and props to large-scale elements such as buildings and landmarks, and accepts either a concept image or a text description as input. Experiments show that SPAWN integrates concepts with consistent lighting, scale, and perspective while preserving identity and temporal coherence, demonstrating that controllable concept spawning is achievable in existing autoregressive world models without any training.

URL PDF HTML ☆

赞 1 踩 0

2606.02573 2026-06-02 cs.CV 版本更新

HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

HumanNOVA: 从单张图像实现逼真、通用且快速的3D人体化身建模

Hezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang, Jonathan C. Liu, Zhiwen Fan, Kai Wang, Zhangyang Wang, Georgios Pavlakos

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）； National University of Singapore（新加坡国立大学）； Texas A&M University（德克萨斯农工大学）

AI总结提出HumanNOVA模型，通过可扩展数据生成流水线和前馈令牌条件化架构，从单张RGB图像快速生成逼真3D人体化身，无需测试时优化。

Comments CVPR 2026 Highlight

详情

AI中文摘要

在本文中，我们提出HumanNOVA，一种从单张RGB图像生成3D人体化身的逼真、通用且快速的模型。由于缺乏多样化、高质量的3D人体数据，实现逼真度和泛化性具有挑战性。为此，我们构建了一个可扩展的数据生成流水线，遵循两种策略。第一种是利用现有绑定资产，并通过日常生活中的大量姿态进行动画化。第二种策略是利用现有的多摄像头人体捕捉，并采用拟合方法生成更多样化的视角用于训练。这两种策略使我们能够扩展到10万个资产，显著增强了数据的数量和多样性，以支持稳健的模型训练。在架构方面，HumanNOVA采用前馈、令牌条件化的化身建模框架，可在不到一秒内实现快速推理，且无需测试时优化。给定输入图像和估计的简化人体网格（SMPL），无需详细几何或外观，模型首先将两者编码为紧凑的令牌表示。这些令牌随后作为条件信号，通过交叉注意力融合，构建基于三平面的3D化身表示。在多个基准上的大量实验表明，我们的方法在定量和定性上均具有优越性，并且在多样输入图像条件下具有鲁棒性。项目页面：https://HumanNOVA.github.io。

英文摘要

In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions. Project page at https://HumanNOVA.github.io .

URL PDF HTML ☆

赞 0 踩 1

2606.02572 2026-06-02 cs.CV 版本更新

建模深度歧义：一种用于无飞点深度估计的混合密度表示

Siyuan Bian, Congrong Xu, Jun Gao

发表机构 * University of Michigan（密歇根大学）； NVIDIA（英伟达）

AI总结提出混合密度表示MDA，通过预测每个像素的多个深度假设及其概率，解决深度估计中边界处的飞点伪影问题，显著改善边界重建并消除飞点。

详情

AI中文摘要

尽管深度估计取得了进展，飞点仍然是一个持续存在的失败模式：在物体边界附近，深度估计器经常在前景和背景表面之间的空白空间中预测虚假的3D点。我们将这种伪影追溯到一种标准建模选择：为每个像素分配单个深度假设。在边界处，一个像素可能跨越前景和背景表面，因此其真实深度在两者之间是模糊的。预测单个深度的模型无法同时保留两种可能性，因此训练反而将预测拉向一个位于两个表面之间的中间深度。我们通过MDA解决了这个问题，这是一种混合密度表示，让模型为每个像素预测多个深度假设及其相关概率。在边界附近，不同的假设可以与不同的表面对齐，解码后的深度从这些假设之一中选择，而不是放置在它们之间的空白空间中。在不同的骨干网络上，MDA显著改善了边界重建，并在很大程度上消除了飞点伪影，即使在严重的输入模糊下也是如此，同时增加了可忽略的运行时开销。相同的混合密度框架自然地扩展到透明物体，其中它预测透明像素处的多个深度层，以及天空区域，其中专用组件将无界天空与有限深度区域分开，产生无飞点的天际线。项目页面：https://biansy000.github.io/mda-site/。

英文摘要

Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points in the empty space between foreground and background surfaces. We trace this artifact to a standard modeling choice: assigning each pixel a single depth hypothesis. At boundaries, a pixel can straddle a foreground and a background surface, so its true depth is ambiguous between the two. A model that predicts a single depth cannot keep both possibilities, so training instead pulls the prediction toward an intermediate depth that lies on neither surface. We address this with MDA, a mixture-density representation that lets the model predict multiple depth hypotheses and their associated probabilities for each pixel. Near boundaries, different hypotheses can align with different surfaces, and the decoded depth is selected from one of these hypotheses rather than placed in the empty space between them. Across different backbones, MDA substantially improves boundary reconstruction and largely removes flying-point artifacts even under severe input blur, while adding negligible runtime overhead. The same mixture-density framework naturally extends to transparent objects, where it predicts multiple depth layers at transparent pixels, and to sky regions, where a dedicated component separates the unbounded sky from finite-depth regions, producing flying-point-free skylines. Project Page: https://biansy000.github.io/mda-site/.

URL PDF HTML ☆

赞 0 踩 0

2606.02551 2026-06-02 cs.RO cs.CV 版本更新

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

AFUN：迈向用于功能理解的可供性基础模型

Zhaoning Wang, Yi Zhong, Jiawei Fu, Henrik I. Christensen, Jun Gao

发表机构 * University of Michigan（密歇根大学）； University of California, San Diego（加州大学圣地亚哥分校）； NVIDIA（英伟达）

AI总结提出AFUN模型，从单张RGB-D图像和语言任务描述中预测任务条件功能掩码和3D接触后运动曲线，通过大规模标准化数据流水线实现开放世界泛化，在多项基准测试中显著优于现有方法。

详情

AI中文摘要

可供性理解连接视觉感知和物理动作，作为开放非结构化真实环境中机器人操作的可解释接口。然而，构建一个不仅理解交互发生的位置和方式，还能跨不同环境、物体和任务泛化的可供性基础模型，仍然是一个长期的研究挑战。现有方法通常只解决部分挑战，要么定位任务相关区域而不指定可执行运动，要么预测运动但可扩展性有限。在本文中，我们提出了我们的模型，朝着用于功能理解的可供性基础模型迈出了一步。从单个RGB-D观测和语言任务描述中，我们的模型预测任务条件功能掩码（在哪里交互）和3D接触后运动曲线（如何交互）。为了支持开放世界泛化，我们构建了一个大规模标准化数据流水线，将异构的机器人、人类、仿真和真实世界扫描数据转换为共享的可供性模式，包含语言、掩码和以物体为中心的3D运动标签。我们从三个方面评估我们的模型：对于可供性分割，我们的模型在来自4个基准的8个测试集上以较大优势优于所有基线，平均gIoU/cIoU提高+23.9/+26.3；对于接触点预测，它预测出更精确的点，命中率比最佳基线提高12.7-61.3%；对于3D运动，它在所有三个测试集上均达到最佳性能。我们的模型可以部署于真实世界机器人操作，无需针对机器人本体进行微调或使用任务特定启发式方法，展示了适应开放世界可供性任务的能力。项目页面：https://www.zhaoningwang.com/AFUN

英文摘要

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN

URL PDF HTML ☆

赞 1 踩 0

2606.02535 2026-06-02 cs.CV 版本更新

LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

LL-Bench: 在大规模生成模型时代重新思考低级视觉评估

Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu, Haoyun Jiang, Liu Yang, Qiang Hu, Guangtao Zhai, Xiaoyun Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出LL-Bench基准，包含大量真实退化图像和人工偏好标注，系统评估大规模生成模型在低级视觉任务中的性能，并引入LL-Score评估器以更好对齐人类偏好。

详情

AI中文摘要

大规模生成模型在图像生成和编辑任务中展现了卓越的能力。然而，它们在需要像素级控制的低级视觉任务中的表现仍未得到充分研究。为填补这一空白，我们引入了 extbf{LL-Bench}，一个用于评估大规模生成模型在 extbf{低级视觉}任务上能力的全面 extbf{基准}。该基准包含覆盖16种低级退化任务的2,469张真实退化图像，以及由10个最先进的大规模生成模型和21个传统恢复模型生成的28,919张恢复图像，这些图像附有152,020个专家级成对人类偏好和28,334个质量评分。基于LL-Bench，我们进行了系统诊断，揭示了大规模生成模型在不同低级视觉任务中的性能边界和独特失败模式，并与传统代表性恢复方法进行了比较。此外，我们研究了当前质量评估指标在LL-Bench上的有效性，发现它们与人类评分存在显著差异。为了更好地使恢复图像质量评估与人类偏好对齐，我们进一步提出了 extbf{LL-Score}，一个基于MLLM的评估器，能够同时捕捉恢复质量和幻觉存在。大量实验表明，LL-Score不仅优于现有的图像质量评估指标，而且可以作为有前景的奖励模型，用于训练低级视觉任务的生成模型。

英文摘要

Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To address this gap, we introduce \textbf{LL-Bench}, a comprehensive \textbf{Benchmark} for evaluating the capabilities of large-scale generative models on \textbf{L}ow-\textbf{L}evel vision tasks. The benchmark comprises 2,469 real-world degraded images covering 16 low-level degradation tasks, and 28,919 restored images produced by 10 state-of-the-art large-scale generative models and 21 conventional restoration models, which are annotated with 152,020 expert-level pairwise human preferences and 28,334 quality scores. Built upon LL-Bench, we present a systematic diagnosis that reveals the performance boundaries and unique failure modes of large-scale generative models across diverse low-level vision tasks, compared with conventional representative restoration approaches. Moreover, we investigate the effectiveness of current quality evaluation metrics on LL-Bench, which exhibit significant discrepancy with human ratings. To better align restored-image quality assessment with human preferences, we further propose \textbf{LL-Score}, an MLLM-based evaluator that captures both restoration quality and hallucination existence. Extensive experiments demonstrate that LL-score not only outperforms existing image quality assessment metrics, but also serves as a promising reward model for training generative models on low-level vision tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.02532 2026-06-02 cs.CV 版本更新

Improving Combined Detection and Classification of TEM Defects via Mask-Conditioned Latent Diffusion Augmentation

通过掩码条件潜在扩散增强改善TEM缺陷的联合检测与分类

Ni Li, Nuohao Liu, Ryan Jacobs, Ajay Annamareddy, Maciej P. Polak, Kevin Field, Izabela Szlufarska, Dane Morgan

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； University of Michigan-Ann Arbor（密歇根大学安娜堡分校）

AI总结提出一种基于掩码条件潜在扩散模型（LDM）的生成式数据增强方法，用于合成可控、自动标注的多类缺陷掩码的TEM图像，以提升小样本下Mask R-CNN模型的缺陷检测与分类性能。

详情

AI中文摘要

分析透射电子显微镜（TEM）图像中的微观结构缺陷，特别是在辐照金属合金中，通常受到高质量标注数据可用性的限制。为了解决这个问题，我们引入了一种生成式数据增强方法，使用掩码条件潜在扩散模型（LDM）合成具有可控、自动标注的多类缺陷掩码的逼真TEM图像。我们的方法无需生成过程中的人工标注，通过从实验掩码学习到的分布中采样，能够创建合成图像-掩码对。这些生成的数据用于增强不同规模（10、50和100张标注实验图像）的小型实验数据集，以训练Mask区域卷积神经网络（R-CNN）模型进行缺陷检测和分类。我们的结果表明，生成式增强带来了整体模型性能的小幅提升，检测和分类F1分数的调和平均值最高提升0.02。然而，我们也发现检测和分类改进的相对贡献取决于特定的训练/测试数据划分。这些发现凸显了针对性生成模型在数据稀缺的基于显微镜的图像量化任务中提升深度学习性能的潜力。

英文摘要

Analyzing microstructural defects in transmission electron microscopy (TEM) images, particularly in irradiated metal alloys, is often limited by the availability of high-quality, labeled data. To address this, we introduce a generative data augmentation approach using a mask-conditioned latent diffusion model (LDM) for synthesizing realistic TEM images with controllable, automatically labeled multi-class defect masks. Without requiring manual annotations for generation, our method enables the creation of synthetic image-mask pairs by sampling distributions learned from experimental masks. These generated data were used to augment small experimental datasets of varying sizes (10, 50, and 100 labeled experimental images) to train a Mask Regional Convolutional Neural Network (R-CNN) model for defect detection and classification. Our results show that generative augmentation yields small overall model performance improvements, with up to a 0.02 gain in the harmonic mean of detection and classification F1 scores. However, we also find that the relative contributions to detection and classification improvement depend on the specific train/test data split. These findings highlight the potential of targeted generative models to enhance deep learning performance in data-scarce microscopy-based image quantification tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.02526 2026-06-02 cs.CV cs.AI 版本更新

ToolFG：面向良好基础的细粒度图像分类

Yu Xue, Haoxuan Qu, Zhuoling Li, Yihang Lou, Yan Bai, Hossein Rahmani, Jun Liu

发表机构 * Lancaster University（兰卡斯特大学）； Peking University（北京大学）

AI总结提出ToolFG框架，通过MCTS引导的工具使用知识蒸馏和模型-工具协同进化机制，使MLLM自主调用外部工具获取可靠视觉线索，实现细粒度图像分类。

详情

AI中文摘要

细粒度图像分类（FGIC）具有广泛的应用并吸引了大量研究关注。本文通过提出 extbf{ToolFG}探索了一种解决FGIC的新范式，这是首个针对FGIC定制的集成工具的多模态大语言模型（MLLM）框架。ToolFG使MLLM能够在推理过程中自主灵活地使用外部工具，主动与图像交互，并以更 extit{可靠}和 extit{良好基础}的方式收集可验证的视觉线索，以区分高度相似的类别。为了赋予模型这种工具使用能力，我们设计了一种新颖的 extbf{MCTS引导的工具使用知识蒸馏机制}，该机制有效地从高级专有MLLM中挖掘与工具使用和FGIC相关的知识用于模型训练。此外，我们提出了一种 extbf{模型-工具协同进化机制}，该机制共同优化工具集和模型的工具使用策略，推动它们朝向相互适应且专门针对FGIC的状态发展。大量实验证明了我们框架的有效性。

英文摘要

Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbf{ToolFG}, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textit{reliable} and \textit{well-grounded} manner. To equip the model with such tool-use ability, we design a novel \textbf{MCTS-guided tool-use knowledge distillation mechanism}, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbf{model-tool co-evolution mechanism} that jointly refines the toolset and the model's tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.

URL PDF HTML ☆

赞 0 踩 0

2606.02510 2026-06-02 cs.CV cs.RO 版本更新

Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis

并非所有点都同等重要：不确定性感知的4D LiDAR场景合成

Xiang Xu, Alan Liang, Youquan Liu, Xian Sun, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu

发表机构 * NUAA（南京航空航天大学）； NUS（新加坡国立大学）； FDU（福建工程学院）； Duke（杜克大学）； NTU（国立新加坡大学）； NJUPT（南京理工大学泰州学院）； SKL-TI（特种信息处理实验室）

AI总结提出U4D框架，利用空间不确定性引导LiDAR场景生成，通过熵图识别高不确定性区域并优先合成，再补全其余区域，实现高保真4D场景。

Comments CVPR 2026 E2E3D Workshop; GitHub at https://github.com/worldbench/U4D

详情

AI中文摘要

从LiDAR获取的序列构建忠实的4D世界对于具身AI至关重要，但当前的生成框架对所有空间区域采用统一的建模能力。这忽略了单个扫描中感知难度的巨大差异：远距离表面、遮挡边界和小尺度物体比良好观测的结构具有更高的不确定性。我们提出了U4D，一种新的框架，明确利用空间不确定性以“从难到易”的顺序引导LiDAR场景生成。U4D通过预训练分割器的香农熵推导逐点不确定性图，然后应用无条件扩散阶段合成具有精确几何的高熵区域，接着是条件补全阶段，利用这些结构作为先验填充剩余区域。MoST（时空混合）块通过动态平衡空间细节和时间连续性进一步维护跨帧一致性。在nuScenes和SemanticKITTI上的大量实验证明了最先进的场景保真度、时间一致性和下游性能。

英文摘要

Constructing faithful 4D worlds from LiDAR-acquired sequences is crucial for embodied AI, yet current generative frameworks apply uniform modeling capacity across all spatial regions. This ignores that perceptual difficulty varies dramatically within a single scan: distant surfaces, occluded boundaries, and small-scale objects carry far higher uncertainty than well-observed structures. We present U4D, a new framework that explicitly leverages spatial uncertainty to guide LiDAR scene generation in a "hard-to-easy" schedule. U4D derives per-point uncertainty maps via Shannon Entropy from a pretrained segmentor, then applies an unconditional diffusion stage to synthesize high-entropy areas with precise geometry, followed by a conditional completion stage that fills in the remaining regions using these structures as priors. A MoST (Mixture of Spatio-Temporal) block further maintains cross-frame coherence by dynamically balancing spatial detail and temporal continuity. Extensive experiments on nuScenes and SemanticKITTI demonstrate state-of-the-art scene fidelity, temporal consistency, and downstream performance.

URL PDF HTML ☆

赞 0 踩 0

2606.02498 2026-06-02 cs.CV 版本更新

GloResNet: A lightweight 3D CNN with global topological features for preterm brain injury prediction

GloResNet：一种用于早产儿脑损伤预测的轻量级3D CNN与全局拓扑特征

Boyu Yuan, Jiamiao Lu, Weichuan Zhang, Benqing Wu, Tuo Wang, Changshan Wang, Changming Sun, Liang Guo

发表机构 * Image Computing Laboratory, Shaanxi University of Science and Technology（陕西科技大学图像计算实验室）； Department of Neonatology, Shenzhen University of Advanced Technology General Hospital（深圳先进技术医院新生儿科）； Department of Neurosurgery, The First Affiliated Hospital of Xi’an Jiaotong University（西安交通大学第一附属医院神经外科）； CSIRO Technology（澳大利亚CSIRO技术）

AI总结提出基于ResNet-10的轻量级3D CNN GloResNet，结合全局流形映射和预处理策略，在dHCP数据集上实现早产儿脑损伤预测，平均准确率75.18%。

详情

AI中文摘要

本研究引入了一个自动化深度学习框架，用于从T2加权MRI（dHCP数据集）预测早产儿脑损伤（BI）。我们提出了GloResNet，一种基于ResNet-10的轻量级3D CNN，并在MedicalNet上预训练以应对数据稀缺。一种全局流形映射策略首先将每个3D体积重采样为128x128x128，然后应用逐样本z分数强度归一化，从而在标准化外观的同时保留全局拓扑。训练集成了mixup、类别加权和测试时增强以提高鲁棒性。在5折交叉验证中，GloResNet达到了75.18%的平均准确率（峰值81.82%），特异性0.81，敏感性0.76。结果表明，拓扑感知的轻量级CNN能够有效预测新生儿脑损伤，提供了一种非侵入性筛查工具。本文源代码可从GitHub仓库获取：https://github.com/ICL-SUST/GloResNet-Preterm-Brain

英文摘要

This study introduces an automated deep learning framework for predicting brain injury (BI) in preterm infants from T2-weighted MRI (dHCP dataset). We propose GloResNet, a lightweight 3D CNN based on ResNet-10, pretrained on MedicalNet to address data scarcity. A global manifold mapping strategy first resamples each 3D volume to 128x128x128 and then applies subject-wise z-score intensity normalization, thereby preserving global topology while standardizing appearance. Training integrates mixup, class weighting, and test-time augmentation for robustness. In 5-fold cross-validation, GloResNet achieved 75.18% average accuracy (peak 81.82%), with specificity 0.81 and sensitivity 0.76. Results demonstrate that a topology-aware lightweight CNN has the capability to effectively predict neonatal BI, offering a non-invasive screening tool. The source code of this paper can be obtained from the GitHub repository: https://github.com/ICL-SUST/GloResNet-Preterm-Brain

URL PDF HTML ☆

赞 0 踩 0

2606.02491 2026-06-02 cs.CV 版本更新

MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

MORPHOS: 基于时间结构化潜变量的自回归4D生成

Minkyung Kwon, Jinhyeok Choi, Youngjin Shin, Jaeyeong Kim, JongMin Lee, Seungryong Kim

发表机构 * KAIST AI（韩国国立科学技术院人工智能实验室）

AI总结提出MORPHOS框架，利用时间结构化潜变量（T-SLAT）统一表示4D动态资产，通过自回归因果注意力生成，解决多表示兼容、拓扑变化和长时间一致性问题。

Comments Project page: https://cvlab-kaist.github.io/MORPHOS/

详情

AI中文摘要

我们提出MORPHOS，一种新颖的自回归框架，能够从视频生成动态3D资产，支持多种表示，包括网格、3D高斯和辐射场。现有方法通常局限于单一表示，难以建模拓扑变化，或在长视频中无法保持时间一致性。为解决这些限制，我们引入时间结构化潜变量（T-SLAT），一种统一的4D表示，沿时间维度联合编码几何和外观。利用T-SLAT，MORPHOS通过因果注意力自回归生成动态3D资产，将每一帧条件于其先前历史，以确保时间一致性并处理演化的拓扑。我们还提出一种时间结构增强，以减轻自回归生成中的误差累积。MORPHOS在多个基准测试中实现了外观方面的最先进性能和几何方面的竞争性结果，展示了跨多种表示的卓越泛化能力和长时程生成的鲁棒性。

英文摘要

We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.

URL PDF HTML ☆

赞 0 踩 0

2606.02481 2026-06-02 cs.CV 版本更新

PaSBench-Video: 面向主动安全预警的流式视频基准

Yusong Zhao, Yuejin Xie, Youliang Yuan, Junjie Hu, Jitian Guo, Yujiu Yang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Tsinghua University（清华大学）

AI总结提出PaSBench-Video基准，包含740个视频，评估多模态大模型在危险发生前及时发出预警的能力，发现现有模型在时序精度和低误报率上表现不佳。

详情

AI中文摘要

从危险的第一个可见迹象到事故发生之间，通常存在一个仍可干预的时间窗口。具备视频能力的多模态大语言模型（MLLM）可以作为始终在线的安全监控器，在此窗口内发出警告。然而，当前的基准测试并未检验这一能力：它们依赖静态输入，忽略时间精度，并且省略了对安全场景的误报测量。我们提出了PaSBench-Video，一个包含740个视频的基准测试，涵盖驾驶、医疗、日常生活和工业生产四个领域，其中包含481个风险视频和259个无风险视频。风险视频标注了帧级别的风险起始点和事故边界。模型必须以因果方式观察视频，并发出在时间上校准且内容正确的警告。测试了13个MLLM后，我们发现没有模型在我们的最严格指标上超过20.0%，并且召回率与误报率紧密相关，皮尔逊相关系数为0.64：更高的检测率只能以在大多数安全片段上触发警告为代价。性能按领域显著分化：在日常生活领域，模型在低误报率下实现了中等召回率，因为该领域的风险本质上是异常的；而在驾驶领域，模型不加区分地触发警告，因为常规场景和危险场景看起来相似。这些结果表明，当前模型依赖于场景级别的活动线索，而不是推理正在出现的危害。

英文摘要

Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.

URL PDF HTML ☆

赞 0 踩 0

2606.02441 2026-06-02 cs.CV 版本更新

Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation

空间-时间解耦参考条件用于身份保持的文本到视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Lizhuang Ma, Jiangning Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Electronic Science and Technology of China（电子科技大学）； Zhejiang University（浙江大学）

AI总结提出ST-DRC框架，通过空间-时间解耦参考条件、TASS-RoPE机制和身份目标，实现高保真身份保持视频生成。

详情

AI中文摘要

身份保持视频生成（IPVG）旨在合成高保真视频，遵循文本提示同时忠实保持参考身份。尽管最近取得进展，现有IPVG方法仍难以平衡高级语义控制和低级身份保真度。为弥合这一差距，我们提出ST-DRC，一种有效的空间-时间解耦参考条件框架，用于身份保持的文本到视频生成。在框架层面，ST-DRC通过使用视频VAE编码参考图像并将其与噪声视频潜在变量拼接，执行潜在上下文特征注入，无需额外适配器即可访问丰富的低级身份细节。为将身份感知参考检索与外观复制分离，我们引入TASS-RoPE，一种时间相邻-空间偏移的RoPE方案，将参考令牌在时间上靠近视频序列但在空间上偏移，允许参考信息通过时空注意力流动，同时抑制像素级复制粘贴捷径。为进一步防止捷径学习并增强扩散目标中被稀释的身份监督，我们结合外观不变参考增强与面部引导身份目标，鼓励模型在颜色、姿态和布局变化下保持身份。在推理时，我们引入三流参考无分类器引导策略，独立控制文本遵循度和参考保真度。实验表明，ST-DRC在基于LTX-2.3的轻量级设计下，实现了强身份保持、提示对齐、时间一致性和视频质量。我们的方法在面部身份保持视频生成赛道中排名靠前，验证了空间-时间解耦参考条件的有效性。

英文摘要

Identity-preserving video generation (IPVG) aims to synthesize high-fidelity videos that follow text prompts while faithfully preserving a reference identity. Despite recent progress, existing IPVG methods still struggle to balance high-level semantic control and low-level identity fidelity. To bridge this gap, we propose ST-DRC, an effective Spatial-Temporal Decoupled Reference Conditioning framework for identity-preserving text-to-video generation. At the framework level, ST-DRC performs latent in-context feature injection by encoding the reference image with the video VAE and concatenating it with noisy video latents, enabling rich low-level identity details to be accessed without additional adapters. To separate identity-aware reference retrieval from appearance copying, we introduce TASS-RoPE, a Temporal-Adjacent Spatial-Shifted RoPE scheme that places reference tokens near the video sequence in time but shifts them in space, allowing reference information to flow through spatio-temporal attention while suppressing pixel-level copy-paste shortcuts. To further prevent shortcut learning and strengthen the otherwise diluted identity supervision in the diffusion objective, we combine appearance-invariant reference augmentation with face-guided identity objectives, encouraging the model to preserve identity under variations in color, pose, and layout. At inference time, we introduce a three-stream reference classifier-free guidance strategy that independently controls text adherence and reference fidelity. Experiments demonstrate that ST-DRC achieves strong identity preservation, prompt alignment, temporal consistency, and video quality with a lightweight design built on LTX-2.3. Our method ranks among the top submissions in the facial identity-preserving video generation track, validating the effectiveness of spatial-temporal decoupled reference conditioning.

URL PDF HTML ☆

赞 0 踩 0

2606.02436 2026-06-02 cs.CV 版本更新

Geometry-Aware Implicit Memory for Video World Models

几何感知隐式记忆用于视频世界模型

Zhengxuan Wei, Xu Guo, Xinghui Li, Xunzhi Xiang, Min Wei, Yiran Zhu, Qiulin Wang, Xintao Wang, Pengfei Wan, Xiangwang Hou, Qi Fan

发表机构 * School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院）； Kling Team, Kuaishou Technology（快手技术 Kling 团队）； Tsinghua University（清华大学）

AI总结提出GIM-World框架，通过轻量级Transformer编码器将可变长度历史压缩为固定大小的记忆令牌，并利用相机可查询的几何头在训练期间从冻结的基础模型中蒸馏3D场景结构，从而在长时程视频生成中保持几何和视觉一致性。

Comments Project page: https://gim-world.github.io/

详情

AI中文摘要

视频世界模型旨在模拟可控的视觉环境，但长时程展开取决于模型在观察离开其原生上下文窗口后记住的内容。显式记忆保留帧或在线3D重建，可能会遭受启发式检索错误、冗余外观存储或重建伪影。隐式记忆将历史压缩为紧凑状态，但现有设计没有明确约束以编码跨视图场景几何。我们提出GIM-World，一种用于视频世界模型的几何感知隐式记忆框架。轻量级Transformer编码器将可变长度历史压缩为固定大小的记忆令牌，相机可查询的几何头在训练期间从冻结的基础模型中将3D场景结构蒸馏到记忆中，信息引导的剪枝规则在历史增长时保持编码成本有界。在推理时丢弃几何教师，留下轻量级记忆模块。在MIND上的实验表明，GIM-World在保持长时程几何和视觉一致性方面优于显式和隐式记忆基线。

英文摘要

Video world models aim to simulate controllable visual environments, but long-horizon rollouts depend on what the model remembers after observations leave its native context window. Explicit memories retain frames or online 3D reconstructions, which can suffer from heuristic retrieval errors, redundant appearance storage, or reconstruction artifacts. Implicit memories compress history into a compact state, but existing designs are not explicitly constrained to encode cross-view scene geometry. We propose GIM-World, a geometry-aware implicit memory framework for video world models. A lightweight transformer encoder compresses variable-length history into fixed-size memory tokens, a camera-queryable geometry head distills 3D scene structure from a frozen foundation model into the memory during training, and an information-guided pruning rule keeps encoding cost bounded as history grows. The geometry teacher is discarded at inference, leaving a lightweight memory module. Experiments on MIND show that GIM-World better preserves long-horizon geometric and visual consistency than both explicit- and implicit-memory baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.02424 2026-06-02 cs.CV cs.AI cs.LG 版本更新

PRIMA: 利用生物先验和测试时自适应提升动物网格恢复

Xiaohang Yu, Ti Wang, Mackenzie Weygandt Mathis

发表机构 * École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）

AI总结提出PRIMA框架，通过生物先验（BioCLIP嵌入）和测试时自适应策略，解决严重物种和姿态不平衡下的3D四足动物网格恢复问题，实现高泛化性能并构建大规模伪3D数据集Quadruped3D。

详情

AI中文摘要

我们提出PRIMA（*PRI*ors for *M*esh *A*daptation），一个在严重物种和姿态不平衡下进行鲁棒3D四足动物网格恢复的框架。现有的动物重建方法由于有限的3D监督和长尾物种分布，往往回归到平均形状和姿态，导致对欠代表性动物和罕见关节的泛化能力差。PRIMA通过三个关键贡献解决了这一挑战。首先，我们将BioCLIP嵌入作为生物先验，将语义和形态学知识注入重建过程，从而在多样化的四足动物中实现更准确和可泛化的形状预测。其次，我们引入了一种测试时自适应（TTA）策略，该策略利用2D重投影约束和辅助关键点指导来优化SMAL预测，改进了姿态和形状估计，同时能够从现有2D数据集中生成高质量的伪3D标注。第三，利用这个TTA框架，我们构建了Quadruped3D，一个大规模伪3D数据集，涵盖多样化的物种和姿态变化，以系统性地提升模型性能。在Animal3D、CtrlAni3D、Quadruped2D和Animal Kingdom上的大量实验表明，PRIMA达到了最先进的结果，在欠代表性物种和挑战性姿态上尤其有显著改进。我们的结果强调了生物先验和自适应驱动的数据扩展对于可扩展和可泛化的动物网格恢复的重要性。代码可在https://github.com/AdaptiveMotorControlLab/PRIMA获取。

英文摘要

We present PRIMA (*PRI*ors for *M*esh *A*daptation), a framework for robust 3D quadruped mesh recovery under severe species and pose imbalance. Existing animal reconstruction methods often regress toward mean shapes and poses due to limited 3D supervision and long-tailed species distributions, resulting in poor generalization to underrepresented animals and rare articulations. PRIMA addresses this challenge through three key contributions. First, we incorporate BioCLIP embeddings as biological priors to inject semantic and morphological knowledge into the reconstruction process, enabling more accurate and generalizable shape prediction across diverse quadrupeds. Second, we introduce a test-time adaptation (TTA) strategy that refines SMAL predictions using 2D reprojection constraints together with auxiliary keypoint guidance, improving pose and shape estimation while enabling the generation of high-quality pseudo-3D annotations from existing 2D datasets. Third, leveraging this TTA framework, we construct Quadruped3D, a large-scale pseudo-3D dataset that covers diverse species and pose variations to systematically improve model performance. Extensive experiments on Animal3D, CtrlAni3D, Quadruped2D, and Animal Kingdom demonstrate that PRIMA achieves state-of-the-art results, with particularly strong improvements on underrepresented species and challenging poses. Our results highlight the importance of biological priors and adaptation-driven data expansion for scalable and generalizable animal mesh recovery. Code is available at https://github.com/AdaptiveMotorControlLab/PRIMA.

URL PDF HTML ☆

赞 0 踩 0

2606.02357 2026-06-02 cs.CV cs.AI 版本更新

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

多模态智能体真的从工具使用中受益吗？能力增益的系统性研究

Garvin Guo, Donglei Yu, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Qinghao Wang, Minpeng Liao

AI总结通过对比工具增强与无工具的多模态智能体在多项任务上的表现，发现工具使用并未带来一致的性能提升，智能体更多是学会了工具调用模式而非真正利用工具扩展能力。

详情

AI中文摘要

工具增强的多模态智能体在基准测试中表现出显著提升，这常被视为智能体已学会使用工具的证据。我们认为这种解读可能为时过早：仅凭工具调用轨迹并不能证明工具提供了答案关键信息。我们研究了两种代表性的“用图像思考”智能体，Thyme 和 DeepEyesV2，在真实世界理解、OCR、图表理解和数学推理任务上的表现。每个智能体与其无工具版本以及从同一源池训练但不含工具调用轨迹的纯文本推理器进行比较。工具访问并未带来一致的总体改进，未能可靠地降低生成令牌成本，并且仅留下一个很小的仅工具解决集：DeepEyesV2 的 93% 工具解决问题和 Thyme 的 96% 也被至少一种无工具设置解决。机制消融进一步表明，完整的工具使用循环并不始终优于单独的工具调用格式或返回的执行结果。在我们研究的设置中，所分析的智能体似乎更可靠地学习了工具调用模式而非工具贡献的能力，这表明评估应区分工具的可用性与工具是否真正扩展了智能体可解决的问题。

英文摘要

Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images'' agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2's tool-solved problems and 96% of Thyme's are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.

URL PDF HTML ☆

赞 0 踩 0

2606.02352 2026-06-02 cs.CV 版本更新

Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection

多模态视频表示对齐用于鲁棒的自监督驾驶员分心检测

David J. Lerch, Livien Majer, Zeyun Zhong, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen

发表机构 * Fraunhofer IOSB（弗劳恩霍夫智能系统研究所）； Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院）

AI总结提出一种多模态全局对齐框架，通过软目标和加权机制处理错误负样本和不可靠正样本，在Drive&Act数据集上优于现有方法，实现鲁棒的驾驶员分心检测。

Comments Accepted at the IEEE ITSC 2026

详情

AI中文摘要

鲁棒的自监督多模态视频表示学习对于现实应用（如驾驶员分心检测）至关重要，其中多个传感器提供互补但嘈杂的信号。传统的对比目标（如InfoNCE）假设所有负样本信息量相等且所有正样本可靠。然而，由于视角变化、遮挡或模态间的语义重叠，这一假设在多模态数据中经常被违反。在这项工作中，我们提出了一种新颖的多模态全局对齐框架，通过联合建模错误负样本和不可靠或错误正样本来解决这些挑战。我们引入基于循环一致性分数的软目标来放松硬负样本假设，并基于相似性分布的加权机制来减轻噪声或错误正样本的影响。我们的方法将传统的成对对齐扩展到原则性的全局多模态设置，聚合所有模态对的对齐信息。我们在Drive&Act数据集上评估了我们的方法，结果表明它在RGB、IR、深度和骨架模态上始终优于成对和现有的全局对齐基线。跨视角消融研究进一步显示了对未见相机视角的强泛化能力，突出了我们表示的鲁棒性。总体而言，我们的框架为自监督全局多模态表示学习提供了一种可扩展且有效的解决方案，实现了可靠的驾驶员分心检测，并在现实世界的多模态视频理解中具有开创性。我们的代码将在GitHub上发布。

英文摘要

Robust self-supervised learning of multi-modal video representations is critical for real-world applications such as driver distraction detection, where multiple sensors provide complementary but noisy signals. Conventional contrastive objectives, such as InfoNCE, assume all negatives are equally informative and all positives are reliable. However, this assumption is frequently violated in multi-modal data due to viewpoint changes, occlusions, or semantic overlap across modalities. In this work, we propose a novel framework for multi-modal global alignment that addresses these challenges by jointly modeling faulty negatives and unreliable or faulty positives. We introduce soft targets derived from cycle-consistency scores to relax the hard-negative assumption, and a weighting mechanism based on similarity distributions to mitigate the impact of noisy or faulty positives. Our approach extends traditional pairwise alignment to a principled global multi-modal setting, aggregating alignment information across all modality pairs. We evaluate our method on the Drive&Act dataset, demonstrating that it consistently outperforms both pairwise and existing global alignment baselines across RGB, IR, Depth, and Skeleton modalities. Cross-view ablation studies further show strong generalization to unseen camera perspectives, highlighting the robustness of our representations. Overall, our framework provides a scalable and effective solution for self-supervised global multi-modal representation learning, enabling reliable driver distraction detection and pioneering in real-world multi-modal video understanding. Our code will be published on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2606.02350 2026-06-02 cs.CV 版本更新

无模型坍塌的熵最小化：减轻医学影像中的预测偏差

Tim Nielen, Sameer Ambekar, Johannes Kiechle, Daniel M. Lang, Julia A. Schnabel

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany（慕尼黑技术大学计算、信息与技术学院）； Institute of Machine Learning in Biomedical Imaging, Helmholtz Munich, Germany（生物医学成像中的机器学习研究所，海德堡慕尼黑德国）； School of Biomedical Engineering and Imaging Sciences, King’s College London, UK（伦敦国王学院生物医学工程与成像科学学院）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心（MCML））； relAI – Konrad Zuse School of Excellence in Reliable AI（relAI——Konrad Zuse可靠性人工智能卓越学院）； TUM University Hospital Rechts der Isar（慕尼黑技术大学医院Rechts der Isar）

AI总结针对测试时适应中熵最小化导致的模型坍塌问题，提出分布偏移偏差减少（DSBR）方法，通过均衡各预测类对无监督熵最小化损失的贡献来纠正预测偏差，在四个医学影像数据集和ImageNet-C上验证了其稳定性和有效性。

详情

AI中文摘要

熵最小化（EM）是测试时适应的主导目标，但其失败模式——模型坍塌——仍然知之甚少。在这项工作中，我们表明分布偏移会导致模型表示空间中对应不同类别的特征簇合并，而决策边界保持不变。这导致预测类别分布出现系统性偏差，称为预测偏差。预测偏差是指预测类别分布的偏移，其中一些类别被过度代表，而其他类别被抑制。我们表明熵最小化通过收紧现有簇来放大这种预测偏差，强化错误的分类，直到所有预测坍缩为平凡解。接下来，为了证明预测偏差的重要性并减轻它，我们进一步提出了分布偏移偏差减少（DSBR），这是一种偏差纠正目标，通过均衡每个预测类别对无监督熵最小化损失的贡献来专门针对这种失败模式。为了研究这种失败模式，我们使用四个医学影像数据集设计了合适的适应设置，并在ImageNet-C上进行了额外评估。我们发现DSBR一致地稳定了测试时适应，防止了模型坍塌，并且匹配或超越了最先进的方法。此外，DSBR仅在测试时运行。

英文摘要

Entropy minimization (EM) is the dominant objective for test-time adaptation, yet its failure mode, model collapse, remains poorly understood. In this work, we show that distribution shifts can cause feature clusters corresponding to distinct classes in the model's representation space to merge, while the decision boundary remains fixed. This induces a systematic skew in the predicted class distribution, referred to as prediction bias. Prediction bias refers to a shift in the predicted class distribution, with some classes overrepresented and others suppressed. We show that entropy minimization amplifies this prediction bias by tightening the existing clusters, reinforcing the incorrect groupings until all predictions collapse to a trivial solution. Next, to demonstrate the significance of prediction bias and mitigate it, we further propose Distribution Shift Bias Reduction (DSBR), a bias-correcting objective that specifically targets this failure mode by equalizing the contribution of each predicted class to the unsupervised entropy minimization loss. To study this failure mode, we design suitable adaptation settings using four medical-imaging datasets and additionally evaluate on ImageNet-C. We find that DSBR consistently stabilizes test-time adaptation, prevents model collapse, and matches or outperforms state-of-the-art methods. Moreover, DSBR operates solely at test-time.

URL PDF HTML ☆

赞 0 踩 0

2606.02331 2026-06-02 cs.CV cs.LG 版本更新

可信生成式逆问题的测量几何与设计

Pengfei Jin, Na Li, Quanzheng Li

发表机构 * Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School（先进医学计算与分析中心，麻省总医院和哈佛医学院）； School of Engineering and Applied Sciences, Harvard University（工程与应用科学学院，哈佛大学）

AI总结提出局部测量-流形兼容性度量，证明其控制重建误差的稳定部分，并基于体积保持设计固定和自适应测量策略，在多个成像任务中预测失败模式、减少幻觉并指导采样。

详情

AI中文摘要

生成模型越来越多地被用作逆问题的先验，但它们生成逼真图像的能力带来了一个基本的信任问题：一个看似合理的重建可能由测量支持，也可能由先验沿未观测方向填充。这一区别在医学成像中尤为重要，因为采集操作是在扫描时间、剂量和校准约束下设计的。我们从测量几何的角度研究生成式逆问题。核心问题是：固定的测量算子能否区分在生成先验下看似合理的邻近图像，以及这种关系能否指导更好的测量。我们引入了一个局部测量-流形兼容性度量，用于量化算子观测先验相关切线方向的程度。在局部正则性假设下，我们证明该量控制重建误差的稳定部分，而生成先验控制流形外漂移。这一最坏方向证书基于整体局部体积保持，提出了实用的固定和顺序采集规则，包括一种后验云设计，该设计在测试时自适应调整测量，无需训练采样策略。在行采样、断层扫描和MR采集设置中，所提出的分数预测失败模式，解释测量引起的幻觉，并指导更好的采样。在fastMRI笛卡尔采样中，后验云测量设计优于强大的非学习ACS保留基线，包括可变密度和泊松类掩模。

英文摘要

Generative models are increasingly used as priors for inverse problems, but their ability to produce realistic images creates a basic trust problem: a plausible reconstruction may be supported by the measurements, or it may be filled in by the prior along unobserved directions. This distinction is especially important in medical imaging, where acquisition operators are designed under scan-time, dose, and calibration constraints. We study generative inverse problems from a measurement-geometry perspective. The central question is whether a fixed measurement operator can distinguish nearby images that are plausible under the generative prior, and whether this relationship can guide better measurements. We introduce a local measurement-manifold compatibility measure that quantifies how well the operator observes prior-relevant tangent directions. Under local regularity assumptions, we prove that this quantity controls the stable part of the reconstruction error, while the generative prior controls off-manifold drift. This worst-direction certificate motivates practical fixed and sequential acquisition rules based on overall local volume preservation, including a posterior-cloud design that adapts measurements at test time without training a sampling policy. Across row-sampling, tomographic, and MR acquisition settings, the proposed scores predict failure modes, explain measurement-induced hallucinations, and guide better sampling. In fastMRI Cartesian sampling, posterior-cloud measurement design improves over strong non-learned ACS-preserving baselines, including variable-density and Poisson-like masks.

URL PDF HTML ☆

赞 0 踩 0

2606.02303 2026-06-02 cs.CV 版本更新

Cross-Domain Dead Tree Detection via Knowledge Distillation in Aerial Imagery

跨域航拍图像死树检测：基于知识蒸馏的方法

Anis Ur Rahman, Mete Ahishali, Einari Heinaro, Samuli Junttila

发表机构 * CSC – IT Center for Science Ltd.（CSC信息科技研究中心有限公司）； Department of Forest Sciences, University of Helsinki（赫尔辛基大学森林科学系）； KOKO Forest Ltd.（KOKO森林有限公司）； School of Forest Sciences, University of Eastern Finland（东芬兰大学森林科学学院）

AI总结针对航拍图像中死树检测的域差异和标注数据稀缺问题，提出基于知识蒸馏的TreeMort-1T-UNet模型，通过特征级蒸馏在多个目标域上实现鲁棒性能，并验证其在低数据场景下的优越性。

Comments 14 pages, 6 figures, journal

详情

AI中文摘要

航拍图像中的死树检测对于评估森林健康至关重要，尤其是随着气候变化导致全球树木死亡率上升，但域变异性和稀缺的标注数据常常限制模型的泛化能力。本研究改进了最初在芬兰航拍图像（源域）上训练的TreeMort-1T-UNet（树木死亡率单任务U-Net）模型，通过应用知识蒸馏（KD）使其适应各种目标域，包括代表不同森林类型的波兰、德国和爱沙尼亚数据集。我们评估了四种KD变体：基础、自蒸馏、特征级和集成，与微调基线进行比较，使用平均树木IoU、实例F1分数、实例精度和平均质心误差作为关键指标，并结合表征分析（如余弦相似度、CKA、SSIM、t-SNE和线性探针）评估域不变性。特征级KD优于其他方法，在波兰数据集上实现了平均树木IoU为0.106、实例F1分数为0.63、实例精度为0.55、平均质心误差为3.039，并在其他目标域上保持稳健精度（例如，芬兰为0.15，波兰为0.67，德国为0.60，爱沙尼亚为0.59）。它在低数据场景下表现优异，假阳性更少，并展现出优越的表征不变性（例如，更高深层CKA/SSIM、t-SNE中更好的域混合、线性探针AUC为0.95），使其成为精度关键的林业应用的理想选择。额外的消融研究证实，特征对齐等关键组件增强了其跨指标的平衡性能。我们的发现证明了KD在遥感中增强迁移学习的潜力，为生态监测和可持续森林管理提供了可扩展、域鲁棒的工具。

英文摘要

Detecting dead trees in aerial imagery is vital for assessing forest health, especially as tree mortality increases globally due to climate change, but domain variability and scarce labeled data often limit model generalization. This study advances the TreeMort-1T-UNet (Tree Mortality 1-Task U-Net) model, initially trained on Finnish aerial imagery (source domain), by applying knowledge distillation (KD) to adapt it to various target domains, including Polish, German, and Estonian datasets representing diverse forest types. We assess four KD variants: Basic, Self, Feature-level, and Ensemble, against a fine-tuning baseline, using Mean Tree IoU, Instance F1-score, Instance Precision, and Mean Centroid Error as key metrics, alongside representational analyses (e.g., cosine similarity, CKA, SSIM, t-SNE, and linear probing) for domain invariance. Feature-level KD outperforms others, yielding a Mean Tree IoU of 0.106, Instance F1-score of 0.63, Instance Precision of 0.55, and Mean Centroid Error of 3.039 on the Polish dataset, with robust precision across other target domains (e.g., 0.15 on Finnish, 0.67 on Polish, 0.60 on German, 0.59 on Estonian). It excels in low-data scenarios with fewer false positives and shows superior representational invariance (e.g., higher deep-layer CKA/SSIM, better domain mixing in t-SNE, and linear probing AUC of 0.95), making it ideal for precision-critical forestry applications. Additional ablation studies confirm that key components like feature alignment enhance its performance balance across metrics. Our findings demonstrate KD's potential to enhance transfer learning in remote sensing, offering a scalable, domain-robust tool for ecological monitoring and sustainable forest management.

URL PDF HTML ☆

赞 0 踩 0

2606.02301 2026-06-02 cs.HC cs.AI cs.CV 版本更新

Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video

定量运动测试：从单部智能手机视频测量患者运动

Pranav Mahajan, Amanda Wall, Eleonora Maria Camerone, Julie Stebbins, Eoin Kelleher, Shuangyi Tong, Annina Schmid, Katja Wiech, Anushka Irani, Ben Seymour

发表机构 * Nuffield Department of Clinical Neurosciences, University of Oxford（临床神经科学系,Nuffield大学,牛津大学）； Max Planck Institute of Biological Cybernetics（生物信息学研究所）； Oxford Gait Laboratory, University of Oxford（牛津大学步态实验室）； Harvard Medical School（哈佛医学院）； Massachusetts General Hospital（麻省总医院）； Institute of Biomedical Engineering, University of Oxford（生物医学工程研究所,牛津大学）； Mayo Clinic（梅奥诊所）

AI总结提出基于计算机视觉的定量运动测试（QMT）方法，利用深度学习3D姿态估计从单目智能手机视频提取运动生物标志物，在实验室验证中与光学运动捕捉高度一致（r>0.85），并在纤维肌痛和慢性坐骨神经痛患者中展示了可靠性和纵向监测能力。

详情

AI中文摘要

慢性疼痛通过降低功能能力而损害生活质量，但在现实环境中客观测量这种功能影响仍然具有挑战性。虽然光学运动捕捉为评估运动质量改变提供了高精度，但成本高昂且局限于实验室环境。我们旨在开发并验证定量运动测试（QMT），这是一个从标准单目智能手机视频中提取3D运动生物标志物的计算机视觉流程，平衡临床可及性与生物力学精度。我们利用基于深度学习的3D姿态估计，在健康对照组（N=13）中针对金标准光学运动捕捉验证了QMT流程。经过留一法受试者校准以纠正系统偏差后，我们在两个前瞻性临床队列中部署QMT以评估现实世界效用：一项纤维肌痛患者的干预前后试验，以及一项慢性坐骨神经痛患者和健康对照的30天纵向家庭监测研究。在实验室验证中，QMT提取的临床运动指标与光学运动捕捉高度一致，显示出强相关性（r>0.85）和低平均绝对误差。QMT在纤维肌痛患者中显示出高重测信度（r>0.86），并成功追踪了慢性坐骨神经痛患者的日常运动波动。虽然现实家庭环境引入了比实验室环境更高的测量方差，但QMT完全基于远程记录发现了健康对照组和坐骨神经痛患者之间的组级差异。单目3D姿态估计为传统评估提供了一种可扩展的替代方案。QMT为临床试验中跟踪疾病进展和治疗反应提供了客观、可及的生物标志物，但需要进一步研究以优化家庭环境中的可靠性。

英文摘要

Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challenging in real-world settings. While optical motion capture provides high precision for assessing altered movement quality, it is costly and restricted to laboratory environments. We aimed to develop and validate Quantitative Movement Testing (QMT), a computer vision pipeline extracting 3D kinematic biomarkers from standard monocular smartphone video, balancing clinical accessibility with biomechanical accuracy. We validated the QMT pipeline, utilising deep learning-based 3D pose-estimation, against gold-standard optical motion capture in healthy controls (N=13). Following leave-one-subject-out calibration to correct systematic bias, we deployed QMT in two prospective clinical cohorts to assess real-world utility: a pre- and post-intervention trial for fibromyalgia patients, and a 30-day longitudinal at-home monitoring study of chronic sciatica patients and healthy controls. In laboratory validation, QMT extracted clinical kinematic metrics with high agreement to optical motion capture, yielding strong correlations (r > 0.85) and low mean absolute errors. QMT demonstrated high test-retest reliability (r > 0.86) in fibromyalgia patients and successfully tracked day-to-day movement fluctuations in chronic sciatica. While real-world home settings introduced higher measurement variance than lab settings, QMT found group-level differences between healthy controls and sciatica patients based entirely on remote recordings. Monocular 3D pose estimation offers a scalable alternative to traditional assessments. QMT provides an objective, accessible biomarker for tracking disease progression and treatment response in clinical trials, though further research is needed to optimise reliability in home environments.

URL PDF HTML ☆

赞 0 踩 0

2606.02292 2026-06-02 cs.CV 版本更新

Neural Acquisition & Representation of Subsurface Scattering

次表面散射的神经获取与表示

Arjun Majumdar, Raphael Braun, Hendrik Lensch

发表机构 * University of Tübingen（图宾根大学）

AI总结提出一种通过U-Net CNN学习物体表面每个点的像素足迹响应来获取和估计高细节层次次表面散射特性的方法，实现任意高分辨率投影图案的重光照。

Comments 8 pages

详情

DOI: 10.2312/vmv.20251228

AI中文摘要

我们提出了一种方法，通过学习物体表面每个点的像素足迹响应，以高度细节化的水平获取和估计光传输的次表面散射特性。重建利用3D扫描技术作为U-Net CNN的输入。使用相移轮廓测量（PSP）图案的立体投影仪-相机设置高效捕获各种散射物体的数据。重建密集像素足迹允许使用任意高分辨率投影图案进行重光照。最终输出是重光照后的彩色图像。与真实世界捕获图像的定性和定量比较表明，预测的足迹与实际响应几乎相同。同一模型针对多个物体的多个视图进行训练，使得学习到的表示也能泛化到未见过的次表面散射材料。

英文摘要

We present a method to acquire and estimate the sub-surface scattering properties of light transport at a highly detailed level by learning the pixel footprint response at each point on the object surface. The reconstruction leverages 3D scanning techniques as input to a U-Net CNN. A stereo projector-camera setup using phase-shifted profilometry (PSP) patterns efficiently captures the data for a variety of scattering objects. Reconstructing dense pixel footprints allows for relighting with arbitrary high-resolution projector patterns. The final output is a relit color image. Qualitative and quantitative comparison against illuminated real-world captured images demonstrate that the predicted footprints are almost identical to the actual responses. The same model is trained for multiple views across multiple objects such that the learned representations can be used to generalize to unseen sub-surface scattering materials as well.

URL PDF HTML ☆

赞 0 踩 0

2606.02276 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Cross-modal linkage risk in clinical vision-language models

临床视觉-语言模型中的跨模态链接风险

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

发表机构 * Lab for AI in Medicine（医学人工智能实验室）； RWTH Aachen University（亚琛工业大学）； Department of Diagnostic and Interventional Radiology（诊断与介入放射学部门）

AI总结研究临床视觉-语言模型（VLM）在图像与报告分离场景下通过余弦相似度实现跨模态重链接的风险，并采用仅对投影头进行差分隐私微调的方法在保持图像效用同时显著降低重链接率。

详情

AI中文摘要

在配对胸部X光片和放射学报告上训练的视觉-语言模型（VLM）学习了一个共享嵌入空间，该空间可以保留实例级别的图像-报告对应关系。这在故意将X光片和报告在获取后分开的场景中（例如仅图像数据共享或受控访问的报告）构成了隐私风险，因为一个去标识的图像可能仅通过余弦相似度就重新链接到其原始叙述性报告。我们将此形式化为图像到报告的检索，并使用公共配对队列（其中真实配对是已知的）作为基准来审计风险，而不是作为隐私场景。在来自MIMIC-CXR（43,793个保留对）和外部CheXpert Plus（29,296个对）的126,804名患者的406,241个配对示例上评估了临床专业化程度递增的VLM，我们发现重链接率随专业化程度系统性地上升：最强的VLM在候选池N=100时以15倍随机概率检索到正确报告，在N=10,000时以50倍随机概率，在全数据库规模下仍远高于随机概率。该信号在去除疾病标签捷径的病理匹配困难负样本下仍然存在，表明对应关系超出了广泛的诊断类别。为了在不重新训练的情况下减少这种风险，我们冻结了两个编码器，仅对定义对齐层的投影头应用差分隐私优化（epsilon=0.34，delta=6x10^-6）。这使得MIMIC-CXR上N=10,000时的Recall@1降低了61.8%，并无需重新训练即可迁移到CheXpert Plus，同时图像侧效用基本保持：线性探针分类在14个标签上的宏AUROC仅从79.63%变为79.43%。对共享对齐层的定向DP微调可以大幅减少跨模态重链接，而不会实质性降低使这些模型在临床上有用的图像表示。

英文摘要

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.

URL PDF HTML ☆

赞 0 踩 0

2606.02273 2026-06-02 cs.CV 版本更新

Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset

用于驾驶员监控系统的视觉语言模型：一个驾驶员活动描述数据集

David J. Lerch, Sarath Mulugurthi, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen

发表机构 * Fraunhofer IOSB（弗劳恩霍夫智能系统研究所）； Technische Hochschule Ingolstadt（图林根工业大学）； Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院）

AI总结本文通过创建Drive&Act数据集的详细自然语言版本，评估并微调视觉语言模型，以提升对驾驶员细微动作的识别能力，微调后的模型在跨数据集评估中表现更优。

Comments Accepted at IEEE ITSC 2026

详情

AI中文摘要

理解细微的驾驶员动作对于构建可靠的驾驶员监控系统至关重要。现有的视觉语言模型（VLM）在通用数据集上训练，难以识别驾驶员行为的细微差别。本文通过创建Drive&Act数据集的详细自然语言版本来解决这一限制。我们使用基于LLM的评分方法在新的基准上评估了三个VLM。它们在新基准上的表现表明，它们无法可靠地生成准确的细粒度驾驶员活动描述。基于标注的Drive&Act数据集，我们创建了一个新的Drive&Act描述数据集，其中包含细粒度描述，用于训练VLM理解驾驶员活动。在驾驶员监控数据集（DMD）上的跨数据集评估表明，在我们的新Drive&Act描述数据集上微调的VLM能够很好地泛化到DMD数据集中的动作。在我们的Drive&Act描述数据集上微调的VLM取得了76的ACCR分数，优于零样本VLM基线的66 ACCR分数。这些发现表明，用丰富描述的驾驶员动作来适应VLM可以显著提高其解释驾驶员行为的能力，同时也突显了需要更多样化的数据集以支持未来应用中更广泛泛化的需求。我们的Drive&Act描述数据集和代码将在GitHub上公开。

英文摘要

Understanding subtle driver actions is essential for building reliable driver monitoring systems. Existing visionlanguage models (VLMs) are trained on general datasets and struggle to recognize fine distinctions in driver behaviors. This paper addresses this limitation by creating a detailed natural language version of the Drive&Act dataset. We evaluate three VLMs on our new benchmark using LLM-based scoring methods. Their performance on the new benchmark shows that they cannot reliably generate accurate fine-grained driver activity descriptions. Based on the labeled Drive&Act dataset we create a new Drive&Act description dataset containing finegrained descriptions to train VLMs on driver activity understanding. Cross dataset evaluation on the Driver Monitoring Dataset (DMD) shows that the VLM fine-tuned on our new Drive&Act description dataset generalizes well to actions in the DMD dataset. The VLM fine-tuned on our Drive&Act description dataset achieves an ACCR score of 76 outperforming the zero-shot VLM baseline with an ACCR score of 66. These findings demonstrate that adapting VLMs with richly described driver actions can significantly improve their ability to interpret driver behavior while also highlighting the need for more diverse datasets to support broader generalization in future applications. Our Drive&Act description dataset and code will be publicly available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2606.02268 2026-06-02 cs.CV 版本更新

解决基于图像和基于文本的行人重识别之间的优化冲突

Karina Kvanchiani, Timur Mamedov

发表机构 * Tevian, Russia（俄罗斯Tevian）； Lomonosov Moscow State University, Russia（俄罗斯罗蒙诺索夫莫斯科国立大学）

AI总结针对图像与文本行人重识别任务因模态差异和目标冲突导致共享表示次优的问题，提出解耦两阶段训练流程，使用单一视觉编码器避免跨任务干扰，实验表明图像预训练和文本监督能提升双任务性能。

详情

AI中文摘要

基于图像（I2I）和基于文本（T2I）的行人重识别（ReID）的联合优化受到模态差异和冲突训练目标的阻碍，导致共享表示次优。虽然I2I ReID关注同一人图像间的身份级不变性，但T2I ReID由与独特视觉特征相关的实例特定文本描述驱动。本文探讨了两个ReID任务及其优化过程之间的根本差异，以实现有效训练。由于I2I和T2I ReID通常分开研究，为一种检索设置优化的损失函数可能对另一种所需的表示质量产生负面影响。基于这些发现，我们提出了一种解耦的两阶段训练流程，用于学习跨图像和文本模态的共享表示。该流程基于单个视觉编码器，支持I2I和T2I检索，同时避免训练期间的跨任务干扰。我们在多种配置下进行了大量实验，改变了域混合程序、学习策略和任务目标。我们观察到I2I ReID预训练对T2I数据的泛化能力有积极影响。此外，我们发现视觉编码器训练阶段引入文本监督能提升I2I和T2I性能。我们相信，我们的见解为统一的ReID系统和跨模态检索整体迈出了有意义的一步。

英文摘要

The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.

URL PDF HTML ☆

赞 0 踩 0

2606.02228 2026-06-02 stat.ML cs.CV cs.LG 版本更新

缩小联邦原型学习中的对齐-成熟度差距

Mario Casado-Diez, Alejandro Dopico-Castro, Verónica Bolón-Canedo, Bertha Guijarro-Berdiñas

发表机构 * CITIC, Universidade da Coruña（CITIC，科鲁纳大学）

AI总结针对联邦学习中原型对齐压力抑制局部判别结构的问题，提出FedSAP框架，通过确定性对齐课程和几何驱动代理分离损失稳定表征学习，在多种异质性条件下提升分类性能。

详情

AI中文摘要

从分布式异质数据中学习判别性视觉表示是联邦学习（FL）中的一个基本挑战。基于原型的方法通过跨客户端共享类级表示来解决统计异质性，但在早期训练轮次中会产生距离依赖的梯度压力，这种压力尤其严重：对从噪声局部表示聚合而来的不成熟全局原型施加的对齐压力会产生大梯度，从而抑制局部判别结构的出现。结果导致嵌入空间组织不良，识别性能下降，尤其是在严重的非独立同分布（non-IID）条件下。我们提出FedSAP，一个通过两种互补机制稳定联邦表示学习的框架：一个确定性对齐课程，将全局对齐延迟到局部表示变得稳定；以及一个几何驱动的代理分离损失，利用现有原型库在单位超球面上强制执行类间结构，而不引入额外参数或通信开销。这些机制共同产生紧凑、分离良好的类簇，而不改变联邦参与者之间的底层通信协议。在三个基准测试和不同程度的异质性下的实验表明，与评估的原型基线相比，性能提升高达4个百分点，在高异质性下改进最为显著。我们框架的表示性质还使其能够直接扩展到半监督设置，其中未标记数据只需最小修改即可纳入，突显了调度对齐作为设计原则的通用性。

英文摘要

Learning discriminative visual representations from distributed, heterogeneous data is a fundamental challenge in Federated Learning (FL). Prototype-based methods address statistical heterogeneity by sharing class-level representations across clients but create a distance-dependent gradient pressure that is particularly severe during early training rounds: alignment pressure applied to immature global prototypes, aggregated from noisy local representations, generates large gradients that suppress the emergence of local discriminative structure. The result is a poorly organized embedding space and degraded recognition performance, particularly under severe non-IID conditions. We propose FedSAP, a framework that stabilises federated representation learning through two complementary mechanisms: a deterministic alignment curriculum that delays global alignment until local representations become stable and a geometry-driven proxy separation loss that enforces inter-class structure on the unit hypersphere using the existing prototype bank without introducing additional parameters or communication overhead. Together, these mechanisms produce compact, well-separated class clusters without altering the underlying communication protocol between federation's participants. Experiments across three benchmarks and varying degrees of heterogeneity show gains of up to 4 percentage points over the prototype-based baselines evaluated, with improvements most pronounced under high heterogeneity. The representational nature of our framework further enables a straightforward extension to semi-supervised settings, where unlabelled data is incorporated with minimal modification, underscoring the generality of scheduled alignment as a design principle.

URL PDF HTML ☆

赞 0 踩 0

2606.02171 2026-06-02 cs.CV 版本更新

InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark

InsightVQA: 高维情感认知视觉问答基准

Shiyu Wang, Ziyu Liu, Chaoyi Yu, Yujie Yin, Zhongqian Mao, Jing Chen, Jiaqi Song, Yunshi Lan, Yan Wang

发表机构 * East China Normal University（东华师范大学）

AI总结为解决现有基准仅关注情感识别而缺乏深层认知推理的问题，提出大规模层次化视觉问答数据集InsightVQA，包含725K问答对，并构建评估基准InsightVQA-Bench和基线模型InsightNet。

Comments 16 pages, 22 figures

详情

AI中文摘要

视觉情感理解要求模型不仅识别情感状态，还要理解其产生原因并进行更高层次的认知推理。然而，现有基准主要关注情感识别，对基于依据的理解和面向响应的分析支持有限。为弥补这一差距，我们引入了 extbf{InsightVQA}，一个用于情感理解和认知推理的层次化视觉问答大规模数据集。我们从六个公开来源收集的351K图像出发，应用严格的多阶段过滤流程，筛选出138K高置信度图像。每张图像在三个层次上进行标注：用于情感和效价识别的感知QA、通过约束引导生成从视觉触发提取构建的基于依据的理解QA，以及以响应意图预测和序列洞察推理为中心的认知QA。总计，InsightVQA包含725K个问答对。我们还提出了 extbf{InsightVQA-Bench}，一个包含30K样本的高质量评估基准，用于细粒度评估。为支持评估，我们引入了 extbf{InsightNet}，一个针对多模态大语言模型的情感调优基线。结果表明，InsightVQA对基于依据的情感理解和推理提出了重大挑战。

英文摘要

Visual emotion understanding requires models not only to recognize emotional states, but also to why they arise and perform higher-level cognitive reasoning. However, existing benchmarks mainly focus on emotion recognition, offering limited support for grounded understanding and response-oriented analysis. To address this gap, we introduce \textbf{InsightVQA}, a large-scale dataset for hierarchical visual question answering on emotion understanding and cognitive reasoning. Building from 351K images collected from six public sources, we apply a rigorous multi-stage filtering pipeline to curate 138K high-confidence images. Each image is annotated at three hierarchical levels: perception QA for emotion and valence recognition, grounded understanding QA constructed from visual trigger extraction through constraint-guided generation, and cognition QA centered on response intent prediction and sequential insight reasoning. In total, InsightVQA contains 725K QA pairs. We further present \textbf{InsightVQA-Bench}, a high-quality evaluation benchmark comprising 30K samples for fine-grained evaluation. To support evaluation, we introduce \textbf{InsightNet}, an emotion-tuned baseline for MLLMs. Results demonstrate that InsightVQA poses significant challenges for grounded emotion understanding and reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.02168 2026-06-02 cs.CV cs.LG 版本更新

Disentanglement-Based Equivariant Learning for Compositional VQA

基于解耦的等变学习用于组合式VQA

Zhou Du, Zhaoquan Yuan, Xiao Wu, Changsheng Xu

发表机构 * IEEE Publication Technology Group（IEEE出版技术组）； School of Computing and Artificial Intelligence, Southwest Jiaotong University（计算机与人工智能学院，西南交通大学）； Engineering Research Center of Sustainable Urban Intelligence Transportation, Ministry of Education, China（可持续智慧城市交通工程研究中心，中华人民共和国教育部）； State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences（多模态人工智能系统（MAIS）国家重点实验室，自动化研究所，中国科学院）； School of Artificial Intelligence, University of Chinese Academy of Sciences（人工智能学院，中国科学院大学）

AI总结提出DEAL框架，通过因果干预解耦视觉和文本概念，并利用等变约束增强组合推理能力，在CLEVR-CoGenT和GQA-SGL上超越现有方法。

Comments Accepted by IEEE Transactions on Multimedia

详情

DOI: 10.1109/TMM.2025.3604897
Journal ref: IEEE Trans. Multimedia, vol. 27, pp. 8160-8173, 2025

AI中文摘要

组合式视觉问答（VQA）是一项具有挑战性但基础的任务，要求模型理解先前学习概念的新组合。当前方法往往忽视潜在概念的解耦，并且在有效捕捉组合变化机制方面受到限制。此外，最先进的技术依赖于额外的线索进行训练，这在现实世界的VQA场景中不可行。为了解决这些问题，本文提出了一种新颖的基于解耦的等变学习（DEAL）框架用于组合式VQA，该框架仅由真实答案指导。在DEAL中，我们采用因果启发的干预措施，在重新编码框架内解耦来自视觉和文本输入的概念。基于等变性原理，我们随后对推理输入进行组合变换，并对输出施加等变约束，以增强模型的组合推理能力。在基准数据集CLEVR-CoGenT和GQA-SGL上进行的全面实验验证了我们提出的DEAL方法在视觉和语言泛化设置下均优于现有的最先进方法。

英文摘要

Compositional visual question answering (VQA) represents a challenging yet fundamental task that requires models to comprehend novel combinations of previously learned concepts. The current methods often overlook the disentanglement of underlying concepts and are restricted in terms of their ability to effectively capture the compositional variation mechanism. Moreover, the state-of-the-art techniques depend on additional clues for training, which is not feasible in real-world VQA scenarios. To address these issues, in this paper, we introduce a novel Disentanglement-based EquivAriant Learning (DEAL) framework for compositional VQA, which is guided exclusively by ground-truth answers. In DEAL, we employ causality-inspired interventions to disentangle concepts derived from visual and textual inputs within a re-encoding framework. Based on the principle of equivariance, we subsequently perform a compositional transformation on the inference input and impose the equivariant constraint on the output to augment the compositional reasoning capacity of the model. Comprehensive experiments conducted on the benchmark CLEVR-CoGenT and GQA-SGL datasets validate the superiority of our proposed DEAL approach over the existing state-of-the-art methods for compositional VQA tasks in both visual and linguistic generalization settings.

URL PDF HTML ☆

赞 0 踩 0

2606.02162 2026-06-02 cs.CV cs.AI cs.CL cs.IR 版本更新

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

视觉丰富文档类型分类的多模态方法：一项比较分析

Catyana Heyne, Jürgen Frikel, Filippo Riccio

AI总结针对视觉丰富文档类型分类中多模态建模策略难以系统比较的问题，本文在统一实验框架下对基于Transformer和LLM的四种代表性模型进行受控对比，发现专用多模态Transformer优于LLM方法，且图像信息贡献最大。

详情

AI中文摘要

视觉丰富文档中的文档类型分类仍然具有挑战性，因为相关信息分布在文本、视觉和布局模态中。为了捕捉这种复杂性，当前方法依赖于多样化的多模态建模策略，导致异构架构使得系统比较复杂化。这种变异性也反映在现有的比较研究中，这些研究通常依赖于异构评估设置，进一步复杂化了系统比较，并使得评估进展变得困难。为了解决这些局限性，本文提供了跨基于Transformer和基于LLM架构的多模态设计策略的结构化分析，并结合统一实验框架内的受控实证比较。具体来说，在RVL-CDIP基准上评估了四种代表性模型（LayoutLMv3、Donut、Qwen3-VL-32B-Instruct和Qwen3-32B），以系统分析文本、图像和布局信息对文档类型分类的贡献，特别关注对比OCR依赖和OCR无关的方法。结果表明，专用多模态Transformer在视觉丰富和布局密集型文档上优于基于LLM的方法。图像信息对可靠分类贡献最大，而OCR派生的文本提供有用但次要的支持。这些发现强调，对于具有显著布局结构的文档，多模态处理仍然是必不可少的。总体而言，该研究为比较多模态架构提供了系统基础，并为选择有效的特征组合和模型设计以进行文档类型分类提供了实用指导。

英文摘要

Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.

URL PDF HTML ☆

赞 0 踩 0

2606.02161 2026-06-02 cs.CV cs.CL 版本更新

InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models

InfoMerge: 信息感知的令牌压缩用于高效视频大语言模型

Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin, Lei Xie, Sanglu Lu

发表机构 * State Key Laboratory of Novel Software Technology（新型软件技术国家重点实验室）

AI总结提出InfoMerge，一种无需训练的视觉令牌压缩方法，通过鲁棒冗余估计和内容感知预算分配，在减少85%视觉令牌的同时保持98.8%性能，实现4.24倍预填充加速。

Comments 15 pages, 8 figures

详情

AI中文摘要

视频大语言模型在视频理解中表现出色，但过多的视觉令牌带来了巨大的计算开销。现有的免训练压缩方法通过减少视觉令牌来提高推理效率，但它们通常依赖局部相邻帧相似性进行时间冗余估计，或主要根据片段长度分配令牌预算。这种设计对帧级噪声敏感，且无法捕捉真实视频的非均匀信息分布。为解决这些挑战，我们提出InfoMerge，一种无需训练的视觉令牌压缩方法，通过鲁棒冗余估计和内容感知预算分配来提高令牌利用率。具体来说，我们提出时间指纹差异：一种片段级二阶时间冗余估计策略，用于建模每个片段内相同空间位置令牌的时间相似性结构。我们进一步引入内容感知预算分配（CABA），根据片段独特性和基于谱熵的表征丰富性动态分配片段级令牌预算。通过减少对冗余静态区域的重复保留，并将更多令牌分配给信息丰富的片段，InfoMerge在保持强大性能的同时更好地利用了有限的令牌预算。大量实验表明，InfoMerge在多个基准和骨干网络上实现了强效的精度-效率权衡，在激进压缩下优势更为明显。在LLaVA-OneVision-7B上，InfoMerge保留了原始平均性能的98.8%，同时减少了85%的视觉令牌，并在预填充阶段实现了4.24倍的加速。

英文摘要

Video Large Language Models (Video-LLMs) achieve strong performance in video understanding, but their excessive visual tokens bring substantial computational overhead. Existing training-free compression methods improve inference efficiency by reducing visual tokens, yet they often rely on local adjacent-frame similarity for temporal redundancy estimation or allocate token budgets mainly according to segment length. Such designs are sensitive to frame-level noise and fail to capture the non-uniform information distribution of real-world videos. To address these challenges, we propose InfoMerge, a training-free visual token compression method that improves token utilization through robust redundancy estimation and content-aware budget allocation. Specifically, we propose the Temporal Fingerprint Difference: a segment-level second-order temporal redundancy estimation strategy, which models the temporal similarity structure of tokens at the same spatial positions within each segment. We further introduce Content-Aware Budget Allocation (CABA), which dynamically allocates segment-level token budgets based on segment uniqueness and spectral-entropy-based representational richness. By reducing repeated preservation of redundant static regions and allocating more tokens to informative segments, InfoMerge makes better use of the limited token budget while maintaining strong performance. Extensive experiments show that InfoMerge achieves strong efficiency--accuracy trade-offs across multiple benchmarks and backbones, with more pronounced advantages under aggressive compression. On LLaVA-OneVision-7B, InfoMerge retains 98.8\% of the original average performance while reducing 85\% of visual tokens and achieving a 4.24-fold speedup in the prefill stage.

URL PDF HTML ☆

赞 0 踩 0

2606.02156 2026-06-02 eess.IV cs.AI cs.CV cs.IR cs.LG 版本更新

Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel

基于术前肠道血供映射预测结直肠吻合口漏风险

Zahra Tabatabaei, Jon Sporring, Mark Bremholm Ellebæk, Alaa El-Hussuna

发表机构 * Computer Science Department, Københavns Universitet (KU)（哥本哈根大学计算机科学系）； University of Southern Denmark（南部丹麦大学）； Odense University Hospital（奥登塞大学医院）； OpenSourceResearch Collaboration（开源研究协作）

AI总结提出一种基于术前CT影像的AI驱动系统，通过分析血管和组织特征量化吻合口漏风险，并结合内容检索支持临床决策。

详情

AI中文摘要

吻合口漏仍然是结直肠癌手术后最严重的并发症之一，显著影响患者预后、康复轨迹和医疗成本。尽管影像技术有所进步，目前的术前评估仍依赖临床评估，这一过程主观、易出错且高度依赖个人经验。迄今为止，尚无经过验证的基于CT的方法能够在术前预测吻合口漏风险。本方案论文概述了一个全面的框架，用于开发和验证一个AI驱动的系统，该系统利用对比增强前后的CT影像进行术前风险评估。研究描述了数据收集、伦理处理、符合GDPR的患者数据预处理、图像预处理以及旨在生成临床可解释输出的深度学习架构探索等阶段。该工作流程的两个主要成果是：1) 风险评估模块，通过分析CT扫描中的血管和组织特征量化漏液可能性；2) 基于内容的医学图像检索（CBMIR）模块，识别并显示相似历史病例以支持循证手术决策。该方案论文需要医院和大学之间的密切合作；本方案表明，此类系统在现有医疗基础设施内技术上可行且临床可实施。通过遵循所提出的方法论阶段和监管原则，其他机构可以复制此工作流程以开发类似的决策支持工具。最终，这一跨学科框架旨在加强手术规划、减少漏液发生率，并推动向可解释、数据驱动的精准手术的更广泛范式转变。

英文摘要

Anastomotic leak remains one of the most serious complications following colorectal cancer surgery, substantially affecting patient outcomes, recovery trajectories, and healthcare costs. Despite advances in imaging technology, current preoperative assessment relies only on clinical assessment, a process that is subjective, error-prone, and highly dependent on individual expertise. To date, no validated CT-based method exists to predict anastomotic leak risk prior to surgery. This protocol paper outlines a comprehensive framework for developing and validating an AI-driven system for preoperative risk assessment using pre- and post-contrast CT imaging. The study describes the stages of data collection, ethical handling, and preprocessing of patient data in accordance with GDPR, image preprocessing, and the exploration of deep learning architectures designed to generate clinically interpretable outputs. Two integrated tools constitute the main deliverables of this workflow: 1) a risk assessment module, which quantifies the likelihood of leakage by analyzing vascular and tissue features in CT scans, and 2) a Content-Based Medical Image Retrieval (CBMIR) module, which identifies and displays similar historical cases to support evidence-based surgical decision making. The protocol paper requires close collaboration between hospitals and universities; this protocol demonstrates that such a system is technically feasible and clinically implementable within existing healthcare infrastructures. By following the proposed methodological stages and regulatory principles, other institutions can reproduce this workflow to develop analogous decision-support tools. Ultimately, this interdisciplinary framework aims to enhance surgical planning, reduce leak incidence, and contribute to a broader paradigm shift toward explainable, data-driven precision surgery.

URL PDF HTML ☆

赞 0 踩 0

2606.02153 2026-06-02 cs.CV cs.GR 版本更新

Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances

超扩散姿态估计器：基于扩散的从稀疏惯性传感器和测距传感器间距离的人体运动跟踪

Dominik Hollidt, Tommaso Bendinelli, Christian Holz

发表机构 * Department of Computer Science, ETH Zurich（苏黎世联邦理工学院计算机科学系）

AI总结提出Ultra Diffusion Poser扩散模型，通过显式建模UWB测距的几何约束（空间布局模块解析重建传感器位置）和引入UWB扩散引导，在扩散采样中强制预测姿态与实测距离对齐，将关节位置误差降低22%。

Comments CVPR 2026 - Computer Vision and Pattern Recognition

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, pp. 7036-7046

AI中文摘要

使用惯性测量单元（IMU）的方法提供了一种可穿戴的替代基于摄像头的运动捕捉方案。为了减轻惯性信号的漂移，最近的稀疏惯性姿态估计器集成了由超宽带（UWB）测距测量的传感器间距离。到目前为止，UWB距离仅被用作额外的输入特征，忽略了它们对传感器位置施加的物理约束。然而，这些距离也可以用于重建底层3D传感器布局，从而为姿态重建提供更具信息性的输入。我们提出了Ultra Diffusion Poser，一种显式建模这些几何约束的扩散模型。它包括一个空间布局模块，该模块从UWB测量中解析地重建3D传感器位置。这些传感器位置与IMU信号和UWB距离一起作为扩散过程中的条件信号。尽管如此，网络预测可能违反传感器间距离测量。为了解决这个问题，我们引入了UWB扩散引导，它在扩散采样过程中鼓励预测姿态与测量距离之间的对齐。这些贡献共同使我们的模型达到了最先进的性能，将关节位置误差相比先前工作降低了高达22%。

英文摘要

Methods using inertial measurement units (IMUs) provide a wearable alternative to camera-based motion capture. To mitigate drift from inertial signals, recent sparse inertial pose estimators integrate inter-sensor distances measured by ultra-wideband (UWB) ranging. So far, UWB distances have only been used as an additional input feature, ignoring the physical constraints they impose on sensor positions. However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction. We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints. It includes a Spatial Layout Module that analytically reconstructs the 3D sensor positions from UWB measurements. These sensor positions are used alongside IMU signals and UWB distances as a conditioning signal during diffusion. Still, network predictions can violate inter-sensor distance measurements. To address this, we introduce UWB-Diffusion Guidance, which encourages alignment between predicted poses and measured distances during diffusion sampling. Together, these contributions enable our model to achieve state-of-the-art performance, reducing joint position error by up to 22% over prior work.

URL PDF HTML ☆

赞 0 踩 0

2606.02134 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Rethinking Evaluation Paradigms in IBP-based Certified Training

重新思考基于IBP的认证训练中的评估范式

Konstantin Kaulen, Hadar Shavit, Holger H. Hoos

发表机构 * University of Freiburg（弗赖堡大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结针对认证训练中自然精度与认证精度的权衡问题，提出基于Pareto前沿的多目标超参数优化方法，实现公平的方法间比较，并发现先前配置的欠调优现象，建立新的最优性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

深度神经网络在许多监督学习任务上取得了强大性能，但仍易受对抗性扰动的影响。神经网络验证提供了数学上严格的鲁棒性保证，但计算成本高昂。为缓解这一问题，认证训练技术在训练过程中优化可验证的鲁棒性，通常通过方法特定的超参数控制自然精度与认证精度之间的权衡。由于这些指标本质上是冲突的，报告单一配置的常见做法存在问题：它可能误导关于整体性能的结论，并妨碍对最新技术的无偏评估。我们通过基于自然-认证精度权衡的Pareto前沿比较来评估认证训练方法。为了实现公平、方法无关的比较，我们执行高效的自动化多目标超参数优化，为每种方法识别一组Pareto最优配置。这种方法常常揭示先前报告配置中的显著欠调优，从而获得更优性能并建立新的最优水平。利用这些前沿，我们首次对认证训练方法进行了全面的多目标比较，表明先前的进展并不像假设的那样显著，并揭示了先前未报告的性能互补性。

英文摘要

Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neural network verification provides mathematically rigorous robustness guarantees, yet at substantial computational cost. To mitigate this, certified training techniques optimise for verifiable robustness during training, typically inducing a trade-off between natural and certified accuracy controlled by method-specific hyperparameters. Because these metrics are inherently conflicting, the common practice of reporting a single configuration is problematic: it can mislead conclusions about overall performance and prevents unbiased assessments of the state of the art. We address this by evaluating certified training methods via Pareto front comparisons over the natural--certified accuracy trade-off. To enable fair, method-agnostic comparisons, we perform efficient automated multi-objective hyperparameter optimisation to identify a set of Pareto-optimal configurations for each method. This approach often uncovers substantial undertuning in previously reported configurations, yielding superior performance and establishing a new state of the art. Leveraging these fronts, we present the first comprehensive multi-objective comparison of certified training approaches, showing that prior advancements are less pronounced than assumed and revealing previously unreported performance complementarities.

URL PDF HTML ☆

赞 0 踩 0

2606.02129 2026-06-02 cs.CV 版本更新

Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization

均衡扩散：面向均衡图像定制的频率感知文本嵌入

Liyuan Ma, Xueji Fang, Guo-Jun Qi

发表机构 * Westlake University（西湖大学）； Zhejiang University（浙江大学）

AI总结提出均衡扩散方法，通过频率空间分解概念特征并独立优化嵌入，实现风格与主体解耦，提升定制图像的保真度和文本对齐。

详情

AI中文摘要

图像定制从参考概念图像中学习目标主体，并根据文本提示生成条件图像，主要修改风格或背景。主流方法采用微调将多样化的概念属性打包到统一的潜在嵌入中，但纠缠的属性阻碍了从风格和背景中消除无关干扰。为解决此问题，我们提出均衡扩散，一种频率驱动的方法，解缠纠缠的概念特征，实现均衡定制和一致的文本-视觉匹配。与使用共享嵌入和统一调优学习完整概念的传统方法不同，我们的工作利用图像频率分量与语义之间的内在联系：低频表示主体内容，高频对应风格。我们在频率空间中分解概念并独立优化每个嵌入。这种分离优化使去噪器能够捕获与主体身份分离的风格，并更好地泛化到未见过的风格提示。合并多频率嵌入保留了模型原有的空间定制能力。我们进一步部署掩码引导扩散以限制无关背景变化并增强文本对齐。将残差参考注意力（RRA）插入空间注意力中以保持主体结构和身份一致性。实验证明，均衡扩散在主体保真度和文本遵循方面超过主流基线，验证了我们方法的优越性。

英文摘要

Image customization learns target subjects from reference concept images and generates conditioned images per text prompts, mainly modifying styles or backgrounds. Prevailing methods adopt fine-tuning to pack diverse concept attributes into a unified latent embedding, yet entangled attributes hinder elimination of irrelevant disturbances from style and background. To address this issue, we propose Equilibrated Diffusion, a frequency-driven approach that disentangles tangled concept features for balanced customization and consistent text-visual matching. Unlike conventional methods learning full concepts with shared embeddings and unified tuning, our work utilizes the inherent link between image frequency components and semantics: low frequencies represent subject content and high frequencies correspond to styles. We decompose concepts in frequency space and optimize each embedding independently. This separate optimization enables the denoiser to capture style detached from subject identity and generalize better to unseen stylistic prompts. Merging multi-frequency embeddings preserves the model's original spatial customization ability. We further deploy mask-guided diffusion to restrict irrelevant background changes and boost text alignment. Residual Reference Attention (RRA) is inserted into spatial attention to retain subject structure and identity consistency. Experiments prove Equilibrated Diffusion exceeds mainstream baselines on subject fidelity and text adherence, verifying our method's superiority.

URL PDF HTML ☆

赞 0 踩 0

2606.02120 2026-06-02 cs.CV cs.AI cs.LG 版本更新

LALE：用于土地覆盖估计的轻量级Transformer架构

Ümit Mert Çağlar, Alptekin Temizel

发表机构 * Middle East Technical University（中亚技术大学）

AI总结提出LALE架构，通过分辨率分支编码器（轻量级ConvMixer处理高分辨率局部特征，Transformer处理低分辨率全局上下文）和全MLP多尺度解码器，在遥感图像分割中实现高效性能与计算成本的平衡。

详情

AI中文摘要

遥感图像的语义分割需要模型在严格的计算预算下同时捕捉全局上下文和局部细节。先前的工作通常针对这些轴之一进行优化：注意力用于全局上下文，卷积用于局部细节，或紧凑性用于效率。虽然混合方法旨在同时捕捉两者，但它们需要架构更改和带有计算开销的编码器骨干，限制了效率和性能。我们提出了LALE（用于土地覆盖估计的轻量级Transformer架构），一种端到端的遥感图像分割架构，它通过分辨率分支编码器：轻量级ConvMixer阶段处理高分辨率局部特征，而Transformer阶段处理低分辨率全局上下文，将自注意力的二次成本限制在深层、下采样的特征图上。全MLP多尺度解码器，以及贯穿始终的RMSNorm和StarReLU，进一步减少了计算量和参数数量。在大型ARAS400k遥感分割基准上，LALE相对于CNN、Transformer和混合基线建立了强大的效率-性能权衡。我们最小的变体（仅1.6M参数）在F1分数上达到最佳基线（UPerNet）的2.6分以内，同时使用4.5倍更少的参数、7倍更少的存储、17倍更少的GMACs，并提供1.8倍更高的吞吐量。

英文摘要

Semantic segmentation of remote sensing imagery requires models that capture both global context and local detail under tight computational budgets. Prior work typically optimizes for one of these axes: attention for global context, convolution for local detail, or compactness for efficiency. While hybrid approaches aim to capture both, they require architectural changes and encoder backbones with computational overhead, limiting efficiency and performance. We present LALE (Lightweight-transformer Architecture for Land-cover Estimation), an end-to-end remote sensing image segmentation architecture, that bifurcates its encoder by resolution: lightweight ConvMixer stages handle high-resolution local features, while transformer stages handle low-resolution global context, confining the quadratic cost of self-attention to deep, downsampled feature maps. An all-MLP multi-scale decoder, together with RMSNorm and StarReLU throughout, further reduces compute and parameter count. On the large-scale ARAS400k remote-sensing segmentation benchmark, LALE establishes a strong efficiency-performance trade-off against CNN, transformer, and hybrid baselines. Our smallest variant, (just 1.6M parameters), reaches within 2.6 F1 points of the best baseline (UPerNet) while using 4.5x fewer parameters, 7x less storage, 17x fewer GMACs, and delivering 1.8x higher throughput.

URL PDF HTML ☆

赞 0 踩 0

2606.02080 2026-06-02 cs.MA cs.AI cs.CV 版本更新

Agentic-J: An AI Agent for Biological Microscopy Image Analysis

Agentic-J：用于生物显微镜图像分析的AI智能体

Lukas Johanns, Marilin Moor, Davide Panzeri, Yu Zhou, Xinyi Chen, Nora F. K. Pauly, Zixuan Pan, Matthias Gunzer, Andreas Müller, Yiyu Shi, Hedi Peterson, Jianxu Chen

AI总结提出基于容器的多智能体AI助手Agentic-J，通过自然语言接口集成ImageJ/Fiji工具，实现从细胞分割到多条件量化的可追溯、可复现生物图像分析工作流。

Comments Presented at Cell Biology at Scale 2026 (Poster). The Agentic-J project is available at https://mmv-lab.github.io/Agentic-J/

2606.02079 2026-06-02 cs.CV 版本更新

FACT: A Simple and Efficient Framework for Active Finetuning

FACT：一种简单高效的主动微调框架

Wenshuai Xu, You Song, Yuzhuo Cui, Minjie Ren, Qingjie Liu, Zhenghui Hu

发表机构 * Zhejiang (No. 2024C01020)（浙江（No. 2024C01020））； National Natural Science Foundation of China (No. 62302031)（中国国家自然科学基金委员会（No. 62302031））； Zhejiang Provincial Natural Science Foundation of China (Nos. LQ23F020024 and LZJMZ24D050009)（中国浙江省自然科学基金委员会（Nos. LQ23F020024 and LZJMZ24D050009））

AI总结针对主动微调中全量微调导致预训练特征失真和过拟合的问题，提出FACT三层分层微调框架，通过冻结特征增强和参数高效微调，在多种数据集和架构上显著提升性能，尤其在低采样率下实现超过20%的增益。

Comments ACCEPTED for publication as a REGULAR paper in the IEEE Transactions on Image Processing (T-IP)

详情

AI中文摘要

主动微调的主要目标是通过使用精心挑选的信息性或挑战性数据对预训练模型进行微调，以提高其在特定任务或领域上的性能。先前的研究主要关注主动方面（即数据选择），同时统一采用全量微调进行模型适应，这不可避免地因分布偏移而扭曲预训练特征。当模型大小相对于微调数据量较大时，这个问题变得尤为突出，导致过拟合风险增加。为了解决这一关键差距，我们正式概述了FiAF任务，该任务强调在主动学习中系统探索微调方法。我们提出了FACT，一个三阶段分层微调框架，兼具高效性和简洁性，专门为主动微调场景设计。我们的综合实验涵盖：（1）三大数据集类别，包括经典（CIFAR10、CIFAR100、ImageNet-1k）、不平衡（CIFAR10-LT、CIFAR100-LT）和细粒度（StanfordCars、FGVCAircraft）图像分类数据集，每个在3-5种不同采样率下评估；（2）多样化的预训练架构，包括卷积神经网络（ConvNeXt）、视觉变换器（ViT）和视觉LSTM（ViL）网络；（3）对冻结特征增强（FroFA）策略的系统研究；（4）对效率和泛化性的全面严格分析。结果表明，我们的框架具有显著改进，并具备强大的泛化性和鲁棒性。值得注意的是，在低采样率下，我们的框架在CIFAR10、CIFAR100和ImageNet-1k基准测试中，ViT模型实现了超过20%的显著性能提升。这种系统性的方法在保持参数效率的同时建立了新的最先进性能，在标记数据稀缺时尤其有效。

英文摘要

The main goal of active finetuning is to improve a pretrained model's performance on a specific task or domain by finetuning it with carefully selected informative or challenging data. Previous research has predominantly focused on the active aspect (i.e., data selection) while uniformly employing full finetuning for model adaptation, which inevitably distorts pretrained features due to distribution shift. This issue becomes particularly pronounced when the model size is large relative to the finetuning data quantity, leading to heightened overfitting risks. To address this critical gap, we formally outline the FiAF task that emphasizes systematic exploration of finetuning methodologies in active learning. We propose FACT, a three-phase hierarchical finetuning framework featuring both efficiency and simplicity, specifically designed for active finetuning scenarios. Our comprehensive experiments span: (1) Three major dataset categories encompassing classic (CIFAR10, CIFAR100, ImageNet-1k), imbalanced (CIFAR10-LT, CIFAR100-LT), and fine-grained (StanfordCars, FGVCAircraft) image classification datasets, each evaluated under 3-5 distinct sampling ratios; (2) Diverse pretrained architectures including Convolutional Neural Network (ConvNeXt), Vision Transformer (ViT), and Vision LSTM (ViL) networks; (3) A systematic investigation of frozen feature augmentation (FroFA) strategies. (4) A comprehensive and rigorous analysis of efficiency and generalizability. The results demonstrate significant improvements with strong generalization and robustness. Notably, under low sampling ratios, our framework achieves remarkable performance gains of over 20% on the ViT model for CIFAR10, CIFAR100, and ImageNet-1k benchmarks. This systematic approach establishes new state-of-the-art performance while maintaining parameter efficiency, proving particularly effective when labeled data is scarce.

URL PDF HTML ☆

赞 0 踩 0

2606.02068 2026-06-02 cs.CV cs.AI 版本更新

Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image

基于可微多平面图像的快速轻量级新视角合成

Kaidi Zhang, Guanxu Zhu

发表机构 * Universiti Malaya（马来大学）； Wuhan University（武汉大学）

AI总结针对现有方法在速度、模型大小和稀疏视角下的不足，提出基于可微多平面图像（MPI）的快速轻量级新视角合成方法，利用点图进行几何初始化并引入一步扩散处理空洞和伪影。

详情

AI中文摘要

近年来，新视角合成取得了显著进展，主流方法如神经辐射场（NeRF）和3D高斯泼溅（3DGS）产生了令人印象深刻的结果。然而，这些方法往往难以平衡渲染速度和模型大小，且其基于优化的训练可能非常耗时。此外，它们通常依赖于密集观测，在稀疏视角条件下往往无法产生令人满意的结果。尽管前馈重建显著减少了3DGS的优化时间，但其像素对齐公式从单张图像生成数百万个高斯，严重限制了其在移动设备上的实际部署。为了解决这些限制，我们重新审视了多平面图像（MPI）表示，该表示使用一组紧凑的平面层来表示场景，以实现高效的新视角合成。利用视觉基础模型的最新进展，我们使用预测的点图进行可靠的几何初始化，然后进行可微优化。为了解决稀疏初始化MPI中的空洞和伪影问题，我们引入了一步扩散，该扩散既参与MPI的可微优化，也参与渲染结果的后处理。与代表性的基于GS的方法相比，我们的方法速度快30.7%，模型大小仅为其14.8%，同时在前景场景中实现了具有竞争力的合成质量。

英文摘要

Recently, novel view synthesis has witnessed remarkable progress, with mainstream methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) delivering impressive results. However, these approaches often struggle to balance rendering speed and model size, and their optimization-based training can be highly time-consuming. Furthermore, they typically rely on dense observations, often failing to produce satisfactory results under sparse-view conditions. Although feed-forward reconstruction significantly reduces the optimization time of 3DGS, its pixel-aligned formulation generates millions of Gaussians from a single image, severely limiting its practical deployment on mobile devices. To address these limitations, we revisit the Multiplane Image(MPI) representation, which represents scenes using a compact set of planar layers for efficient novel view synthesis. Leveraging recent advances in visual foundation models, we utilize predicted point maps for reliable geometric initialization, followed by differentiable optimization. To address the issues of holes and artifacts in sparsely initialized MPI, we introduce one-step diffusion, which participates in both the differentiable optimization of MPI and the postprocessing of rendering results. Compared with a representative GS-based method, our approach is 30.7% faster and uses only 14.8% of its model size, while achieving competitive synthesis quality on front-view scenarios

URL PDF HTML ☆

赞 0 踩 0

2606.02058 2026-06-02 cs.CV cs.RO 版本更新

TIDES: Time-Derivative Event Simulation via Deformable Reconstruction

TIDES：基于可变形重建的时间导数事件模拟

Christopher Thirgood, Dipon Kumar Ghosh, Simon Hadfield

发表机构 * University of Surrey（萨里大学）

AI总结提出TIDES，一种基于动态高斯泼溅的连续时间事件模拟器，通过显式3D场景表示推导逐像素强度动态，实现精确的阈值交叉预测，并利用遮挡引导自适应时间步长，达到最先进的事件流保真度。

详情

AI中文摘要

事件相机响应环境外观变化而发出异步事件。真实世界事件数据集的稀缺使得模拟至关重要。然而，大多数模拟器从帧序列推断事件时间戳，迫使许多阈值交叉共享一小组离散时间；我们将这种失效模式称为时间戳批处理，它在快速运动和遮挡下会恶化。我们提出TIDES，一种基于动态高斯泼溅的连续时间事件模拟器。由于TIDES在具有学习几何和运动的显式3D场景表示上运行，它可以直接从场景推导每像素强度动态，而不是通过渲染帧的差分。这使得能够精确预测阈值交叉，包括每个渲染步骤的多次交叉，而无需时间上采样或帧插值。相同的3D场景模型揭示了物体之间部分遮挡的位置；TIDES利用这一点来指导自适应时间步长，仅将计算集中在遮挡动力学使简单亮度变化模型不可靠的区域。最后，我们使用瓦片级仲裁器对有限传感器带宽进行建模，其吞吐量、抖动和事件丢失再现了真实的传感器伪影。在配对的RGB-事件基准测试中，TIDES达到了最先进的事件流保真度。我们还表明，TIDES模拟的事件比竞争对手更有效地转移到真实下游任务。

英文摘要

Event cameras emit asynchronous events in response to environmental appearance changes. The scarcity of real-world event datasets makes simulation essential. However, most simulators infer event timestamps from frame sequences, forcing many threshold crossings to share a small set of discrete times; a failure mode we term timestamp batching that worsens under fast motion and occlusion. We present TIDES, a continuous-time event simulator built on dynamic Gaussian splatting. Because TIDES operates on an explicit 3D scene representation with learnt geometry and motion, it can derive per-pixel intensity dynamics directly from the scene, rather than by differencing rendered frames. This enables accurate threshold-crossing prediction, including multiple crossings per rendering step, without temporal upsampling or frame interpolation. The same 3D scene model reveals where objects partially occlude one another; TIDES uses this to guide adaptive time stepping, concentrating computation only in regions where occlusion dynamics make simple models of brightness change unreliable. Finally, we model finite sensor bandwidth using a tile-level arbiter whose throughput, jitter, and event drops reproduce realistic sensor artifacts. Across paired RGB-event benchmarks, TIDES attains state-of-the-art event-stream fidelity. We also show that events simulated by TIDES transfer more effectively to real downstream tasks than competitors'.

URL PDF HTML ☆

赞 0 踩 0

2606.02048 2026-06-02 cs.AI cs.CV physics.bio-ph 版本更新

Topological texture analysis of microscopy images of dynamic casein gelation and its relation to rheological properties

动态酪蛋白凝胶化显微图像拓扑纹理分析及其与流变学性质的关系

Zahra Tabatabaei, Diana Soto Aguilar, Jose C. Bonilla, Mathias P. Clausen, Jon Sporring

发表机构 * Department of Computer Science, University of Copenhagen, Denmark（哥本哈根大学计算机科学系）； Department of Green Technology, University of Southern Denmark, Denmark（南丹麦大学绿色技术系）； Department of Food Science, University of Copenhagen, Denmark（哥本哈根大学食品科学系）

AI总结提出结合拓扑数据分析、差分盒计数、多重分形分割和局部二值模式的工具箱，分析STED显微图像中酪蛋白凝胶化的拓扑与纹理特征，揭示与流变学性质相关的微观结构转变。

详情

AI中文摘要

我们提出了一种新颖的计算工具箱，集成了拓扑数据分析（TDA）、差分盒计数（DBC）、多重分形分割（MFP）和局部二值模式（LBP），应用于由葡萄糖酸-δ-内酯（GDL）在30°C和40°C以及两种GDL浓度（1.8%和3.5% w/v）下诱导的酪蛋白酸钠凝胶化的时间序列超分辨率STED显微图像。TDA通过最大Betti-1曲线追踪拓扑环，即反映蛋白质网络互连性的封闭环状结构，揭示了分散聚集体的滞后阶段、与网络渗透和流变学观察到的溶胶-凝胶转变相一致的急剧衰减，以及对应于网络重排的凝胶后增加。这些拓扑转变通过DBC和MFP得到证实，因为这些方法能够解析结构复杂性和空间异质性的变化。该工具箱在实验应用前在模拟分形图像上进行了验证。总之，这些描述符对体相流变学作为平均体相力学响应捕获的细微微观结构转变具有敏感性。这种集成方法为表征食品和材料科学中具有演化微观结构动力学的复杂微观结构提供了稳健的定量工具。代码可在https://github.com/Zahratabatabaei/Delifood_CV_paper.git获取。

英文摘要

We propose a novel computational toolbox that integrates Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Partition (MFP), and Local Binary Patterns (LBP), applied to time-lapse super-resolution STED microscopy images of sodium caseinate gelation induced by glucono-delta-lactone (GDL) at 30 °C and 40 °C and two GDL concentrations (1.8% and 3.5% w/v). TDA tracked topological loops, closed ring-like structures reflecting protein network interconnectivity, via max-Betti-1 curves, which revealed a lag phase of dispersed aggregates, a sharp decay coinciding with network percolation and the rheologically observed sol-gel transition, and a post-gelation increase corresponding to network rearrangements. These topological transitions were corroborated by DBC and MFP as these methods were able to resolve changes in structural complexity and spatial heterogeneity. The toolbox was validated on simulated fractal images prior to experimental application. Together, these descriptors provided sensitivity to subtle microstructural transitions that bulk rheology captured as averaged bulk mechanical responses. This integrated approach provides a robust quantitative tool for characterizing complex microstructure in food and material science with evolving microstructural dynamics. Code is available at https://github.com/Zahratabatabaei/Delifood_CV_paper.git

URL PDF HTML ☆

赞 0 踩 0

2606.02042 2026-06-02 cs.CV 版本更新

Normality-Preserving Continual Industrial Anomaly Detection via Orthogonal LoRA Banks

通过正交LoRA库保持正态性的持续工业异常检测

Weibai Fang, Haijun Che, Feiyang Ren, Qiancheng Lao

发表机构 * Yisu University（Yorkshire University）

AI总结提出基于历史冻结正交LoRA库和分层新颖性自适应库增长模块的框架，解决扩散模型在持续工业异常检测中的历史正态先验漂移和灾难性遗忘问题。

Comments 33 pages,6 figures,Submitted to Advanced Engineering Informatics

详情

AI中文摘要

基于扩散模型的持续工业异常检测面临历史正态先验漂移和灾难性遗忘问题。现有的持续扩散方法通过回放或约束优化保留先前知识，但缺乏在顺序适应过程中隔离和保护类别特定正态先验的显式机制。尽管低秩适应提供了模块化残差更新，但标准LoRA既未冻结历史正态子空间，也未阻止新适配器干扰先前适配器。为解决此问题，我们提出基于两个模块的正态保持持续异常检测框架：历史冻结正交LoRA库（HF-OLB）和分层新颖性自适应库增长模块（HNABG）。HF-OLB冻结预训练的U-Net主干和已学习的LoRA库，并将新任务特定的正态残差约束到历史LoRA子空间的正交补空间中。HNABG进一步分配层依赖的残差容量，并仅在残差正态新颖性超过现有库的表达容量时扩展库。在MVTec和VisA上的大量实验证明了所提方法的有效性。在具有挑战性的VisA 2x6设置下，我们的方法实现了83.6/91.8的图像和像素级A-AUROC，以及3.8/3.9的FM，将像素级A-AUROC提升了3.2个百分点，同时将像素级FM降低了1.3。这些结果表明，我们的方法在长时间跨度的持续类别序列中有效保留了历史正态先验。

英文摘要

Continual industrial anomaly detection with diffusion models suffers from historical normality prior drift and catastrophic forgetting. Existing continual diffusion methods preserve previous knowledge through replay or constrained optimization, but they lack an explicit mechanism for isolating and protecting category-specific normality priors during sequential adaptation. Although low-rank adaptation provides modular residual updates, standard LoRA neither freezes historical normality subspaces nor prevents new adapters from interfering with previous ones. To address this issue, we propose a normality-preserving continual anomaly detection framework based on two modules: History Frozen Orthogonal LoRA Bank (HF-OLB) and Hierarchical Novelty Adaptive Bank Growth module (HNABG). HF-OLB freezes both the pre-trained U-Net backbone and the learned LoRA banks, and constrains new task-specific normality residuals to the orthogonal complement of historical LoRA subspaces. HNABG further allocates layer-dependent residual capacity and expands the bank only when the residual normality novelty exceeds the expressive capacity of existing banks. Extensive experiments on MVTec and VisA demonstrate the effectiveness of the proposed method. On the challenging VisA 2x6 setting, our method achieves 83.6/91.8 image and pixel level A-AUROC with 3.8/3.9 FM, improving pixel level A-AUROC over the state of the art by 3.2 points while reducing pixel level FM by 1.3. These results show that our method effectively preserves historical normality priors in long horizon continual category sequences.

URL PDF HTML ☆

赞 0 踩 0

2606.02022 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

排名 vs. 分配：多视角目标关联中的度量不匹配

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

发表机构 * Tevian Moscow（莫斯科Tevian）； Lomonosov Moscow State University（莫斯科国立罗蒙诺索夫大学）

AI总结本文揭示了多视角目标关联中常用的排名度量（如AP、FPR-95）与分配目标之间的根本性不匹配，并提出了基于Sinkhorn归一化的后处理方法以缓解该问题。

详情

AI中文摘要

多视角目标关联是一个重要的计算机视觉问题，是许多多相机感知任务的基础。虽然该任务自然被表述为受约束的一对一匹配问题，但最近的工作严重依赖成对排名度量（如AP和FPR-95）进行模型评估。我们强调了这些度量与实际分配目标之间的根本性不匹配。理论上，我们表明即使分配已经正确，AP和FPR-95也可能不完美，而基于Sinkhorn的归一化可以使它们完美。相反，最优的成对排名仍然可能导致错误的分配。我们通过使用基于Sinkhorn的归一化作为受控的后处理压力测试，在实践中验证了这种不匹配。我们表明，仅优化几个后处理参数就能显著提升AP和FPR-95，而分配级别的度量（如ACC和IPAA）却没有相应改进。

英文摘要

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.

URL PDF HTML ☆

赞 0 踩 0

2606.02021 2026-06-02 cs.CV 版本更新

PerBite: A Curated Diagnostic Workflow for Bite-Aware Food Volume Estimation

PerBite: 一种用于咬合感知食物体积估计的精选诊断工作流

Ahmad AlMughrabi, Farid Al-Areqi, David Fernández Gómez, Umair Haroon, Marc Bolaños, Ricardo Marques, Petia Radeva

发表机构 * University of Barcelona（巴塞罗那大学）； LogMeal ； Universitat Pompeu Fabra（庞培法布拉大学）

AI总结提出PerBite工作流，通过分割、三维重建、尺度校准和网格后处理等步骤，从餐前餐后状态估计食物体积，在MetaFood挑战中排名第一。

详情

AI中文摘要

一个视觉上合理的食物网格能否被信任来估计消耗食物的体积？\method 使用来自MetaFood CVPR 2026连续三维重建与进食挑战的选定配对餐前和餐后状态来研究这个问题。提交的工作流遵循一个精选的重建协议：SAM~3分割食物和盘子区域；Hunyuan3D/SAM~3D生成无量纲食物网格；盘子直径提供度量尺度；在Blender中移除盘子几何形状；剩余的网格进行孔洞填充、水密化并积分以估计体积。MoGe-2仅作为辅助线索用于初始菜肴直径估计，当直接盘子测量不确定时；它不是报告挑战结果的主要尺度来源。\method 排名第一，在34个网格上使用刚性ICP（无尺度校正）的平均Chamfer距离为8.31。在17个餐前餐后对上，它实现了33.87%的状态级体积MAPE和零单调性违规，而消耗体积MAPE为53.74%。结果表明，表面重建、度量尺度、受控网格清理、水密体积积分和物理消耗一致性应分别评估以用于饮食评估。源代码和评估脚本将在\href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}提供。

英文摘要

Can a visually plausible food mesh be trusted to estimate the volume of consumed food? \method investigates this question using selected paired before- and after-consumption states from the MetaFood CVPR 2026 Continuous 3D Reconstruction While Eating Challenge. The submitted workflow follows a curated reconstruction protocol: SAM~3 segments the food and plate regions; Hunyuan3D/SAM~3D generates a dimensionless food mesh; the plate diameter provides the metric scale; the plate geometry is removed in Blender; and the remaining mesh is hole-filled, made watertight, and integrated to estimate volume. MoGe-2 is used only as an auxiliary cue for initial dish-diameter estimation when direct plate measurement is uncertain; it is not the primary scale source for the reported challenge result. \method ranks first, with an average Chamfer distance of 8.31 across 34 meshes using rigid ICP without scale correction. On 17 before- and after-pairs, it achieves 33.87\% state-level volume MAPE and zero monotonicity violations, while consumed-volume MAPE remains 53.74\%. The results show that surface reconstruction, metric scale, controlled mesh cleanup, watertight volume integration, and physical depletion consistency should be evaluated separately for dietary assessment. Source code and evaluation scripts will be available at \href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}.

URL PDF HTML ☆

赞 0 踩 0

2606.02002 2026-06-02 cs.CV 版本更新

Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment

面向盲图像质量评估的统计与视觉-语言特征的失真感知融合

Bishr Omer Abdelrahman Adam, Xu Li

发表机构 * Northwestern Polytechnical University（西北工业大学）

AI总结提出一种失真感知融合框架，通过乘法门控机制动态加权NSS统计特征与VLM嵌入，在三个基准上取得最优或竞争性能，并揭示NSS对不同失真的贡献差异。

详情

AI中文摘要

盲图像质量评估（BIQA）旨在无参考图像的情况下预测感知图像质量。经典的自然场景统计（NSS）描述符和现代视觉语言模型（VLM）嵌入从根本不同的角度解决这一问题，但两者结合是否能产生互补优势以及如何根据输入图像加权其贡献尚待探索。我们提出一种失真感知融合框架，通过乘法门控机制将138维NSS描述符与两种互补的VLM嵌入（SigLIP和CLIP-H）集成，该门控机制学习基于图像内容的每输入流权重。与静态拼接融合不同，所提出的门控网络根据输入抑制或放大每个流的贡献，产生的权重与在KADID-10k上通过独立消融测量的每失真NSS贡献呈正相关（Spearman秩相关系数ρ=0.33）。该框架无需对VLM骨干网络进行端到端微调，并使用结合均方误差、Pearson线性相关和成对排序目标的混合损失进行训练。我们在三个标准基准上评估：KonIQ-10k（SROCC=0.9142，PLCC=0.9279）、KADID-10k（SROCC=0.9715，PLCC=0.9733，超越近期最先进方法）和LIVE Challenge in-the-Wild（通过跨数据集预训练和微调，SROCC=0.8527，PLCC=0.8802）。在KADID-10k上的每失真分析表明，NSS特征对噪声和色彩偏移失真（像素统计直接影响）贡献最大，对感知失真（如色彩饱和度变化）贡献最小。学习到的门控值验证了这些发现，确认模型自主发现了与手动每失真研究一致的失真-流亲和模式。

英文摘要

Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream's contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.33) with the per-distortion NSS contribution measured by independent ablation on KADID-10k. The framework requires no end-to-end fine-tuning of the VLM backbones and is trained with a hybrid loss combining mean squared error, Pearson linear correlation, and pairwise ranking objectives. We evaluate on three standard benchmarks: KonIQ-10k (SROCC=0.9142, PLCC=0.9279), KADID-10k (SROCC=0.9715, PLCC=0.9733, surpassing recent state-of-the-art methods), and LIVE Challenge in-the-Wild (SROCC=0.8527, PLCC=0.8802 with cross-dataset pretraining and fine-tuning). A per-distortion analysis on KADID-10k reveals that NSS features contribute most on noise and color-shift distortions where pixel statistics are directly affected, and least on perceptual distortions such as color saturation changes. The learned gate values validate these findings, confirming that the model autonomously discovers distortion-stream affinity patterns consistent with the manual per-distortion study.

URL PDF HTML ☆

赞 0 踩 0

2606.02000 2026-06-02 cs.CV cs.AI eess.IV 版本更新

Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization

迈向3D感知视频扩散模型：基于网格标记化的无渲染人体运动控制

Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan, Lei Sun, Weihua Chen, Fan Wang

发表机构 * DAMO Academy, Alibaba Group（阿里巴巴集团大模型实验室）； Hupan Lab（虎盘实验室）； Zhejiang University（浙江大学）； INSAIT

AI总结提出一种无渲染框架，通过压缩的3D人体网格标记直接条件化视频生成，实现精确的人体运动控制，减少2D引导伪影并提升3D结构建模能力。

Comments Project page: https://jingyunliang.github.io/MeshToken/

详情

AI中文摘要

扩散模型在视频生成方面取得了显著成功。然而，这类模型是否真正感知视觉观察背后的3D结构，而不仅仅是生成合理的2D投影，仍是一个开放问题。本文通过人体运动控制这一任务来探究该问题，该任务需要对人体3D几何、运动、相机视角和场景上下文进行精确建模。与依赖渲染的2D运动引导视频的先前方法不同，我们提出了一种无渲染框架，直接基于压缩的3D人体网格标记条件化视频生成。该表示保留了完整的3D几何信息，同时实现了统一的基于标记的生成流程，在DiT架构中联合处理视频标记和运动标记。这种设计要求模型在视频生成过程中联合推理外观、3D结构和相机视角。实验结果表明，该方法在人体运动控制基准上表现强劲，同时减少了由视角依赖的2D引导和编辑过程中轨迹-姿态不匹配引起的伪影。这些发现表明，配备网格标记化的视频扩散模型能够更好地捕捉复杂的3D人体结构及其与周围环境的交互。

英文摘要

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.

URL PDF HTML ☆

赞 0 踩 0

2606.01992 2026-06-02 cs.CV cs.AI cs.LG 版本更新

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

文本引导异常检测的结构化基准：当语言停止条件化决策时

Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro, Matteo Matteucci

发表机构 * Politecnico di Milano, AIRLab（米兰理工学院，AIRLab）； S&H – Software & Hardware（S&H – 软件与硬件）

AI总结提出结构化基准TGAD，通过三个场景逐步增加语言功能角色，评估多模态异常检测系统的文本引导能力，发现当前系统仅表面受语言条件化，标准基准高估了其能力。

详情

AI中文摘要

工业异常检测历来是单模态任务。最近的多模态视觉-语言模型产生了接受文本输入和图像的系统，并被呈现为支持文本引导的零样本和少样本检测。然而，这些方法使用继承自单模态基准的协议进行评估，这些协议保持文本条件不变，因此无法衡量语言是否条件化决策；报告的性能提升是否反映文本引导或强大的预训练视觉特征仍是开放问题。我们引入文本引导异常检测（TGAD），这是一个结构化基准，通过三个场景逐步增加语言的功能角色：MVTec AD上的受控提示敏感性设置；MVTec AD的组件标记扩展，要求模型将其评估限制在指定部件；以及新的组装面板数据集（APD），这是一个需要缺陷类型和组件位置知识的现实工业场景。我们评估每个范式的代表性模型：生成式大视觉-语言、无训练判别式和嵌入自适应判别式。在所有三个模型中，文本接口仅表面条件化决策：除非移除对象名词，否则提示内容被吸收（生成模型的I-AUROC从97.4降至82.6）；一旦指令部件外的缺陷被视为正常，组件级指令不约束决策（从90.3降至66.3）；当两者在APD上结合时，图像级判别崩溃至MVTec水平以下，一种情况低于随机水平（71.2、50.5、31.5）。这些结果表明，标准基准夸大了当前多模态异常检测系统的文本引导能力，并且此类协议是能够通过语言可靠控制以用于工业部署的模型的先决条件。

英文摘要

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model's I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.01985 2026-06-02 cs.CV 版本更新

MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching

MT-EditFlow：基于流匹配的多轮图像编辑强化学习

Jiahui Huang, Yasi Zhang, Tianyu Chen, Shu Wang, Jianwen Xie, Oscar Leong, Mingyuan Zhou, Nanzhu Wang, Ying Nian Wu

发表机构 * Apple（苹果公司）； University of California, Los Angeles（加州大学洛杉矶分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； Lambda, Inc（Lambda公司）

AI总结提出MT-EditFlow框架，通过流匹配强化学习优化多轮图像编辑的奖励信号，解决单轮编辑模型在多轮交互中的失败和误差传播问题，显著提升多轮编辑性能。

详情

AI中文摘要

近年来，基于指令的图像编辑取得了重大突破，模型现在能够处理现实世界中的编辑需求，满足日常用户的实用性要求。然而，主要为单轮编辑训练的编辑模型在多轮编辑中常常失败——在这种自然的交互设置中，用户基于模型自身之前的输出迭代地细化图像。这种失败源于“全有或全无”的要求，即单次失败会破坏整个序列，以及误差传播，即暴露偏差导致编辑误差累积。为了解决这些挑战，我们引入了MT-EditFlow，一个流匹配强化学习框架，旨在优化序列图像编辑的奖励信号。MT-EditFlow整合了多轮视角和多奖励公式，为基于GRPO和NFT的强化学习方法提供了统一的结构。我们通过研究有效的轮次级聚合评分策略、权衡奖励偏差与方差的VLM推理模式以及防止奖励破解的优势融合级别，系统地分析和优化了奖励信号。我们的发现表明，将聚合优势广播到整个编辑轨迹中，有效地弥合了局部规划与全局多轮任务成功之间的差距。大量实验表明，MT-EditFlow在多种基础模型上显著提升了性能。值得注意的是，它在FLUX.1-Kontext-dev上将第3轮整体性能提升了6.85分，超越了Qwen-Image-Edit等最先进的开源模型。通过保持高边际成功率和减少暴露偏差，MT-EditFlow为视觉内容创作中更可靠、更自然的人机协作奠定了基础。

英文摘要

Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing--the natural interactive setting where a user iteratively refines an image based on the model's own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.

URL PDF HTML ☆

赞 0 踩 0

2606.01981 2026-06-02 cs.CV 版本更新

Generalization Limits in Vehicle Re-Identification

车辆再识别中的泛化极限

Anis Yassine Ben Mabrouk, Antoine Tadros, Rafael Grompone von Gioi, Gabriele Facciolo, Axel Davy, Rodrigo Verschae

AI总结针对车辆再识别任务中模型对未见车辆类型泛化能力差的问题，提出了一种新的评估方法，并通过视角分割分析揭示了现有方法在视角鲁棒性和细节关注上的局限性。

详情

AI中文摘要

车辆再识别关注于根据查询图像从图库中检索同一车辆的图像。通过仔细检查常用数据集，我们观察到视觉差异很小的车辆——例如相同的品牌、型号和颜色——同时出现在训练集和测试集中。因此，有效记忆训练数据的方法在这些测试集上表现良好，但难以泛化到其他数据集。在本文中，我们通过提出一种新的评估方法来解决这个问题，该方法能更有效地衡量对未见车辆类型的泛化能力。为了进一步研究泛化性能，我们还提出基于视角进行分割评估，从而区分视角鲁棒性与同视角再识别的影响。我们的发现表明，大多数最先进的方法在处理未见车辆类型时存在困难，并且它们对视角变化的鲁棒性和对细节的关注仅限于训练中见过的车辆类型。

英文摘要

Vehicle re-identification focuses on retrieving images of the same vehicle from a gallery given a query image. Upon closer inspection of commonly used datasets, we observe that vehicles with few visual differences-e.g., the same make, model, and color-appear in both the training and test sets. As a result, methods that effectively memorize the training data tend to perform well on these test sets but struggle to generalize to other datasets. In this paper, we address this issue by proposing a novel evaluation approach that more effectively measures generalization capability to unseen vehicle types. To further study generalization performance, we also propose splitting the evaluation based on view, allowing us to differentiate the effect of viewpoint robustness from that of same-view re-identification. Our findings reveal that most state-of-the-art methods struggle with unseen vehicle types, and that their robustness to viewpoint changes and attention to detail are limited to vehicle types seen during training.

URL PDF HTML ☆

赞 0 踩 0

2606.01973 2026-06-02 cs.LG cs.CV 版本更新

A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

开放集测试时自适应中分布内与分布外准确率的深入分析

Zefeng Li, Evan Shelhamer

发表机构 * University of British Columbia and Vector Institute（不列颠哥伦比亚大学和向量研究所）

AI总结本文通过基准测试和提出新基线，揭示了当前开放集测试时自适应方法在平衡分布内准确率和分布外检测能力上的不足。

Comments TMLR 2026

详情

AI中文摘要

开放集测试时自适应（TTA）在存在输入偏移和未知输出类别的情况下更新模型。尽管近期方法在提高已知类别的分布内（InD）准确率方面取得了进展，但它们准确检测分布外（OOD）未知类别的能力仍未得到充分探索。我们在小规模CIFAR-10-C和大规模ImageNet-C的标准损坏基准上，对鲁棒和开放集TTA方法（SAR、OSTTA、UniEnt和SoTTA）进行了基准测试。对于CIFAR-10-C，我们使用来自SVHN和CIFAR-100的OOD数据，分别对应其损坏形式SVHN-C和CIFAR-100-C。对于ImageNet-C，我们使用来自ImageNet-O和Textures的OOD数据，分别对应其损坏形式ImageNet-O-C和Textures-C。ImageNet-O更接近ImageNet，包含未知但相关的物体类别（如食物类的“蒜香面包”与“热狗”，基础设施类的“高速公路”与“水坝”），而Textures则远离ImageNet，包含非物体图案（如“裂纹”泥土、“多孔”海绵、“纹理”树叶）。我们评估了TTA方法在CIFAR-10-C和ImageNet-C上对InD与OOD识别的准确率和置信度。我们在CIFAR-10-C上验证了每种方法自身OOD检测技术的准确率。我们还在ImageNet-C上进行了评估，并报告了准确率和标准OOD检测指标。我们进一步考察了更现实的设置，其中OOD数据的比例和速率可以变化。为了探索InD识别与OOD拒绝之间的权衡，我们提出了一种新的基线，将softmax/多类输出替换为sigmoid/多标签输出。我们的分析首次表明，当前的开放集TTA方法难以平衡InD和OOD准确率，并且它们仅能不完全地过滤OOD数据以进行自身的自适应更新。

英文摘要

Open-set test-time adaptation (TTA) updates models on new data in the presence of input shifts and unknown output classes. While recent methods have made progress on improving in-distribution (InD) accuracy for known classes, their ability to accurately detect out-of-distribution (OOD) unknown classes remains underexplored. We benchmark robust and open-set TTA methods (SAR, OSTTA, UniEnt, and SoTTA) on the standard corruption benchmarks of CIFAR-10-C at the small scale and ImageNet-C at the large scale. For CIFAR-10-C, we use OOD data from SVHN and CIFAR-100 in their respective corrupted forms of SVHN-C and CIFAR-100-C. For ImageNet-C, we use OOD data from ImageNet-O and Textures in their respective corrupted forms of ImageNet-O-C and Textures-C. ImageNet-O is nearer to ImageNet, as unknown but related object classes (like ''garlic bread'' vs. ''hot dog'' for food, or ''highway'' vs. ''dam'' for infrastructure), while Textures is farther from ImageNet, as non-object patterns (like ''cracked'' mud, ''porous'' sponge, ''veined'' leaves). We evaluate the accuracy and confidence of TTA methods for InD vs. OOD recognition on CIFAR-10-C and ImageNet-C. We verify the accuracy of each method's own OOD detection technique on CIFAR-10-C. We also evaluate on ImageNet-C and report both accuracy and standard OOD detection metrics. We further examine more realistic settings, in which the proportions and rates of OOD data can vary. To explore the trade-off between InD recognition and OOD rejection, we propose a new baseline that replaces softmax/multi-class output with sigmoid/multi-label output. Our analysis shows for the first time that current open-set TTA methods struggle to balance InD and OOD accuracy and that they only imperfectly filter OOD data for their own adaptation updates.

URL PDF HTML ☆

赞 0 踩 0

2606.01955 2026-06-02 cs.RO cs.CV 版本更新

WALL-WM: Carving World Action Modeling at the Event Joints

WALL-WM：在事件关节处雕刻世界动作建模

Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang

发表机构 * X Square Robot Team（X Square机器人团队）

AI总结提出WALL-WM世界动作模型，通过事件级视觉-语言-动作预训练解决固定长度动作块与语言、视觉、动作之间的粒度不匹配问题，实现跨语言、场景和任务的泛化，在大规模真实世界评估中达到最先进性能。

详情

AI中文摘要

WALL-WM是一种世界动作模型，它将视频-动作学习从以块为中心的优化转变为以事件为基础的视觉-语言-动作预训练，使用语义连贯的动作事件作为学习的基本单元。现有的WAM通常从多模态或视频基础模型初始化，然后直接基于当前观测和指令优化固定长度的动作块。尽管方便，但这种以块为中心的公式造成了基本的粒度不匹配。语言描述语义目标和事件，视觉通过连续场景动态演变，动作在控制级时间尺度上运行；将三者强制纳入相同的固定长度预测窗口，使得VLA训练变成短视的相关性拟合。WALL-WM通过围绕语义事件组织监督和数据来解决这种不匹配。具体来说，它将基于事件的VLA预训练与由事件级标题和聚类平衡采样构建的数据生态系统配对，从而实现对多样化行为、场景和任务结构的可扩展学习。从相同的事件预训练骨干出发，WALL-WM支持两种互补的推理模式。事件模式消耗下一事件描述并实现可变长度的执行块，而统一模式使用带有阶梯式解码的VLM来调节传统的固定长度块推理，同时保留梯度连续的VLA路径。结合基于Muon优化器的大规模预训练基础设施，WALL-WM为通用WAM提供了实用的规模化方案。实验表明，WALL-WM在语言、场景和任务上广泛泛化，在大规模真实世界泛化评估中达到了最先进的性能。

英文摘要

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.01950 2026-06-02 cs.RO cs.CV cs.LG 版本更新

Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects

面向刚性物体的学习动作条件与对象中心高斯溅射世界模型

Jens U. Kreber, Lukas Mack, Joerg Stueckler

发表机构 * Intelligent Perception in Technical Systems Group（技术系统智能感知组）

AI总结提出MRO-GWM模型，通过对象中心高斯表示和时空变换器架构，学习刚性物体在3D中的动作条件动力学，支持多物体场景和部分观测下的未来运动预测。

详情

AI中文摘要

世界模型使智能体能够预测其动作对环境的影响。在本文中，我们提出了多刚性物体高斯世界模型（MRO-GWM），一种学习刚性物体在3D中动作条件动力学的新模型。通过用对象中心高斯表示场景，我们可以表示任意物体形状和多物体场景。我们开发了一种新颖的时空变换器架构，该架构根据物体高斯的历史和未来动作预测未来的刚体运动。物体通过其在规范坐标系中的高斯表示，从而可以将物体运动描述为刚体变换。我们的模型在多视角重建上进行训练，这要求模型处理因遮挡导致的物体部分观测。我们分析了该方法在由典型家庭物体组成的合成数据集上的预测性能，这些数据集包含多物体动力学和机器人末端执行器的交互。我们还在模拟中评估了模型在非抓取操作中的模型预测控制性能。

英文摘要

World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.01947 2026-06-02 cs.CV cs.AI 版本更新

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

大型预训练模型在实例分割任务中的参数高效微调

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

发表机构 * University of Freiburg（弗赖堡大学）

AI总结本研究针对实例分割任务，探索了适配器和低秩适应（LoRA）两种参数高效微调方法，在仅微调约1-6%参数的情况下取得竞争性能，并发现每个Transformer块使用2-3个适配器可达到性能与效率的最佳平衡。

Comments Published by the Machine Learning and Knowledge Extraction Journal

详情

DOI: 10.3390/make6040133
Journal ref: Abou Baker N, Rohrschneider D, Handmann U. Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks. Machine Learning and Knowledge Extraction. 2024; 6(4):2783-2807

AI中文摘要

近年来，随着大型预训练模型的兴起，人工智能的研究和应用发生了转变，这些模型在众多任务中取得了最先进的结果。然而，参数的大量增加引入了对参数高效训练策略的需求。尽管取得了显著进展，但针对基于Transformer的模型在实例分割任务中的参数高效微调（PEFT）方法的研究仍然有限。为填补这一空白，本研究调查了PEFT方法的有效性，特别是适配器和低秩适应（LoRA），并将其应用于两个模型和四个基准数据集。通过集成顺序排列的适配器模块并将LoRA应用于可变形注意力（本文首次探索），在仅微调约1-6%模型参数的情况下取得了竞争性能，相比传统微调所需的40-55%有显著改进。关键发现表明，每个Transformer块使用2-3个适配器可实现性能与效率的最佳平衡。此外，LoRA在应用于可变形注意力时表现出强大的参数效率，并在某些情况下超越了适配器配置。这些结果表明，PEFT技术的影响因数据集复杂性和模型架构而异，强调了上下文特定调优的重要性。总体而言，这项工作展示了PEFT在实例分割任务中实现可扩展、可定制且计算高效的迁移学习的潜力。

英文摘要

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored parameter-efficient fine-tuning (PEFT) methods in the context of transformer-based models for instance segmentation. Addressing this gap, this study investigates the effectiveness of PEFT methods, specifically adapters and Low-Rank Adaptation (LoRA), applied to two models across four benchmark datasets. Integrating sequentially arranged adapter modules and applying LoRA to deformable attention--explored here for the first time--achieves competitive performance while fine-tuning only about 1-6% of model parameters, a marked improvement over the 40-55% required in traditional fine-tuning. Key findings indicate that using 2-3 adapters per transformer block offers an optimal balance of performance and efficiency. Furthermore, LoRA, exhibits strong parameter efficiency when applied to deformable attention, and in certain cases surpasses adapter configurations. These results show that the impact of PEFT techniques varies based on dataset complexity and model architecture, underscoring the importance of context-specific tuning. Overall, this work demonstrates the potential of PEFT to enable scalable, customizable, and computationally efficient transfer learning for instance segmentation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.01945 2026-06-02 cs.CV 版本更新

Beyond Low-Rank: Low-Rank Sparse Prompting via Spiking Neural Network and Prompt Factorization

超越低秩：通过脉冲神经网络和提示分解实现低秩稀疏提示

Yumiao Zhao, Bo Jiang, Beibei Wang, Xixi Wan, Xiao Wang, Jin Tang

发表机构 * Information Materials and Intelligent Sensing Laboratory of Anhui Province（安徽省信息材料与智能感知实验室）； Anhui Provincial Key Laboratory of Multimodal Cognitive Computation（安徽省多模态认知计算重点实验室）； School of Computer Science and Technology, Anhui University（安徽大学计算机科学与技术学院）

AI总结提出LoRSP框架，利用脉冲神经元的稀疏发放机制和低秩分解，生成实例特定的稀疏视觉提示，实现高效且鲁棒的视觉提示学习。

详情

AI中文摘要

视觉提示（VP）已成为一种高效范式，通过在输入层引入可学习提示来适应大规模预训练视觉模型到下游任务。然而，现有的VP方法通常采用密集的像素级提示，往往存在冗余扰动、泛化能力有限和能效低的问题。为克服这些限制，我们提出将脑启发脉冲学习融入视觉提示学习任务。我们知道，脉冲神经元可以通过将输入数据转换为离散脉冲序列并返回稀疏输出来进行低成本信息处理。受此启发，我们提出低秩视觉脉冲提示（LoRSP），一种新颖框架，通过脉冲神经元学习机制自然地学习动态低秩稀疏视觉提示。LoRSP的核心思想是利用脉冲神经元的脑启发稀疏发放机制为每个实例生成像素级稀疏提示。具体而言，我们首先通过低秩分解构建一系列提示因子以捕获不同的提示子空间。然后将这些提示因子输入SNN架构，执行整合-发放过程以发射脉冲。因此，我们的LoRSP在保持低秩约束的同时生成稀疏视觉提示。这种设计实现了实例特定的选择性提示，从而在多样化的下游任务中实现更紧凑和鲁棒的适应。在五个异构视觉骨干网络和多个基准上的大量实验表明，与现有VP方法相比，LoRSP在需要更少可调参数的情况下实现了具有竞争力的性能。

英文摘要

Visual Prompting (VP) has emerged as an efficient paradigm for adapting large-scale pre-trained vision models to downstream tasks by incorporating learnable prompts at the input level. However, existing VP methods typically employ dense pixel-level prompts, which often suffer from redundant perturbations, limited generalization and energy inefficiency. To overcome these limitations, we propose to integrate brain-inspired spiking learning into visual prompt learning tasks. As we know that spiking neuron can perform inexpensive information processing by transmitting the input data into discrete spike trains and return sparse outputs. Inspired by this, we propose \textbf{Lo}w-\textbf{R}ank visual \textbf{S}pike \textbf{P}rompting (LoRSP), a novel framework that learns dynamic low-rank sparse visual prompts naturally via a Spiking neuron learning mechanism. The core idea of LoRSP is to exploit the brain-inspired sparse firing mechanism of spiking neurons to generate pixel-level sparse prompt for each instance. To be specific, we first construct a series of prompt factors via low-rank factorization to capture distinct prompt subspaces. These prompt factors are then fed into an SNN architecture, which performs the integrate-and-fire process to emit spikes. As a result, our LoRSP generates a \emph{sparse} visual prompt while maintaining the low-rank constraint. This design enables instance-specific selective prompting, leading to more compact and robust adaptation across diverse downstream tasks. Extensive experiments on five heterogeneous vision backbones and multiple benchmarks demonstrate that LoRSP achieves competitive performance while requiring fewer tunable parameters compared to existing VP methods.

URL PDF HTML ☆

赞 0 踩 0

2606.01940 2026-06-02 cs.CV 版本更新

SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation

SCAPO: 从单次3D观测中自监督学习类别级关节物体姿态估计

Can Zhang, Gim Hee Lee

发表机构 * Department of Computer Science, National University of Singapore（新加坡国立大学计算机科学系）

AI总结提出SCAPO框架，通过自监督方式从单张RGB-D图像中估计关节物体的规范几何、刚性部件分割和关节参数，无需真实标签或类别特定模型。

详情

AI中文摘要

现有的从单次3D观测中估计类别级物体关节的方法通常依赖密集监督、多帧输入或CAD模板，并且仍然难以从关节中解耦几何或恢复显式关节参数。我们提出SCAPO，一个自监督框架，从单张RGB-D观测中估计规范几何、刚性部件分割以及关节枢轴、轴和关节状态，无需真实标签或类别特定模型。我们的SCAPO首先使用SE(3)-等变向量神经元自编码器来分解全局姿态并将不同实例对齐到共享规范空间。在此对齐形状上，设计了一个关节感知的混合蒙皮模块来建模部件运动。我们通过观测形状和规范形状之间的循环重建以及可学习规范模板的跨空间对齐来学习这种表示，该模板将共享类别几何与实例特定残差形状解耦。在合成和真实关节物体数据集上的实验表明，我们的SCAPO恢复了一致的部件结构和准确的关节参数，并优于所有自监督基线。

英文摘要

Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO, a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01939 2026-06-02 cs.CV 版本更新

SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video

SAVMap: 基于结构辅助的全景视频大规模2.5D曼哈顿线框视觉映射

Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng

发表机构 * Nokia Bell Labs（诺基亚贝尔实验室）； NYU（纽约大学）

AI总结提出SAVMap方法，利用全景视频和语义分割网络，结合曼哈顿网格几何约束，从仓库场景生成语义线框地图，实现高精度大规模3D重建。

Comments IEEE ICRA 2026

详情

AI中文摘要

工业环境的精确3D表示能够支持机器人定位和数字孪生生成等任务。我们提出SAVMap，一种仅使用全景视频相机作为传感器输入，生成仓库货架和灯光结构语义线框地图的方法。从沿仓库通道拍摄的全景视频中提取一系列带有货架和天花板视角的校正图像。通过语义分割网络前端，从每张图像中提取一组稀疏的语义结构特征点（例如货架结构的角点、灯光的中心），并在序列中跟踪这些点。通过考虑点之间的真实世界几何关系（如曼哈顿网格），一种受约束的运动恢复结构算法生成构成线框地图的3D点。我们在一个拥有46排货架的仓库中展示了我们方法的可扩展性和准确性，每排货架的面尺寸为55米×7米。从一小时的视频内容中，我们为超过5000个货架元素创建了线框地图，与真实值相比，总体平均绝对误差为4.8厘米。

英文摘要

Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55\,m by 7\,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8\,cm with respect to ground-truth.

URL PDF HTML ☆

赞 0 踩 0

2606.01933 2026-06-02 cs.CV 版本更新

3rd Place at CVPR 2026 CASTLE Challenge: Agentic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval

CVPR 2026 CASTLE挑战赛第三名：基于层次化知识图谱检索的智能多视角长视频理解

Raghad Albusayes, Munirah Alyahya

发表机构 * TAHAKOM（塔哈科姆）

AI总结提出一种免训练的智能框架，通过视频知识图谱和层次化检索索引，解决大规模多视角视频中的复杂时空推理问题，在CASTLE挑战赛中获得第三名。

详情

AI中文摘要

本文介绍了我们在CVPR 2026 EgoVis研讨会举办的CASTLE 2026挑战赛中的获胜方法，我们的团队在全球获得了第三名。该挑战要求参与者在海量多模态视频流中回答高度复杂的视觉、时空和语言问题，包括视觉计数、动作定位、多视角跟踪和说话者时间推理。底层数据集包含由15个自我和外部摄像头源捕获的超过600小时的同步视频。为了应对这种极端规模和长上下文的需求，我们引入了一种无需训练的智能框架，专门针对长视频理解进行了优化。我们的框架引入了两个核心架构组件：i) 视频知识图谱，映射静态和动态实体、它们的时间关系以及交叉事件，以实现多跳关系推理；ii) 自适应智能工作流，通过层次化检索和索引解决复杂查询。实验结果表明，我们的框架在长上下文多视角流上实现了高零样本推理精度。我们的代码将在https://github.com/RaghadKhaled/CASTLE-Challenge-Framework发布。

英文摘要

This paper presents our winning methodology for the CASTLE 2026 Challenge at the CVPR 2026 EgoVis Workshop, where our team secured third place globally. The challenge tasks participants with answering highly complex visual, spatiotemporal, and verbal questions, including visual counting, action localization, multi-view tracking and speaker temporal reasoning, within massive, multimodal video streams. The underlying dataset consists of over 600 hours synchronized footage captured by 15 ego and exo camera sources. To tackle the extreme scale and long-context demands of this environment, we introduce a training-free agentic framework optimized for long-form video understanding. Our framework introduces two core architectural components: i) a Video Knowledge Graph that maps static and dynamic entities, their temporal relationships, and intersecting events to enable multi-hop relational reasoning, and ii) an adaptive agentic workflow that resolves complex queries through a hierarchical retrieval and indexing. Empirical results demonstrate that our framework achieves high zero-shot reasoning accuracy on long-context multi-view streams. Our code will be released at https://github.com/RaghadKhaled/CASTLE-Challenge-Framework.

URL PDF HTML ☆

赞 0 踩 0

2606.01920 2026-06-02 cs.CV 版本更新

Pool-Select-Refine: Allocation-Aware Generative Dataset Distillation with Soft-Label-Guided Latent Refinement

Pool-Select-Refine: 基于软标签引导潜在精化的分配感知生成式数据集蒸馏

Wenmin Li, Shunsuke Sakai, Zhongkai Zhao, Tatsuhito Hasegawa

发表机构 * Graduate School of Engineering, University of Fukui（福井大学工学研究科）； College of Computer Science and Artificial Intelligence, Southwest Minzu University（西南民族大学计算机科学与人工智能学院）

AI总结提出Pool-Select-Refine两阶段框架，通过过完备候选池选择与软标签引导潜在精化解耦生成、选择和精化，提升扩散模型数据集蒸馏的预算利用效率。

详情

AI中文摘要

基于扩散的数据集蒸馏最近作为一种有前景的范式出现，用于将大规模数据集压缩为紧凑的合成集。通过利用预训练的生成先验，这些方法可以比传统的基于匹配的方法更高效地生成逼真的类别条件样本。然而，大多数现有的基于扩散的方法仍然采用僵化的“生成即用”策略，其中生成的样本在固定的每类图像预算下直接被视为最终的蒸馏集。这种设计将候选生成与最终预算分配紧密耦合，可能导致有限预算的冗余浪费或信息不足的样本。在本文中，我们提出“Pool-Select-Refine”，一个用于分配感知生成式数据集蒸馏的两阶段框架。首先，我们不直接使用固定数量的生成样本，而是构建一个过完备的候选池，并在目标预算下选择一个紧凑的子集。其次，我们使用从教师模型导出的软标签监督在潜在空间中精化所选样本，提高语义对齐同时保留生成先验。这种设计明确地将生成、选择和精化解耦，从而更有效地利用蒸馏预算。在大规模和细粒度图像分类基准上的实验表明，所提出的框架在基于扩散的基线上取得了一致的改进。结果表明，在精化之前引入一个筛选阶段是改进基于扩散的数据集蒸馏的一种简单而有效的方法。

英文摘要

Diffusion-based dataset distillation has recently emerged as a promising paradigm for condensing large-scale datasets into compact synthetic sets. By leveraging pretrained generative priors, these methods can produce realistic class-conditional samples more efficiently than traditional matching-based approaches. However, most existing diffusion-based methods still adopt a rigid ``Generate-and-Use'' strategy, where the generated samples are directly treated as the final distilled set under a fixed images-per-class budget. Such a design tightly couples candidate generation with final budget allocation, which may result in redundant waste of the limited budget or insufficiently informative samples. In this paper, we propose ``Pool-Select-Refine'', a two-stage framework for allocation-aware generative dataset distillation. First, instead of directly using a fixed number of generated samples, we construct an over-complete candidate pool and select a compact subset under the target budget. Second, we refine the selected samples in latent space using soft-label supervision derived from the teacher model, improving semantic alignment while preserving the generative prior. This design explicitly decouples generation, selection, and refinement, enabling more effective use of the distillation budget. Experiments on large-scale and fine-grained image classification benchmarks show that the proposed framework delivers consistent gains over diffusion-based baselines. The results suggest that introducing a curation stage before refinement is a simple yet effective way to improve diffusion-based dataset distillation.

URL PDF HTML ☆

赞 0 踩 0

2606.01914 2026-06-02 cs.CL cs.CV 版本更新

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

多模态大语言模型空间推理中空间词汇偏差的机制诊断

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng, Wang Yang, Sudong Cai, Shuyuan Zheng, Akiko Aizawa, Sadao Kurohashi

发表机构 * Kyoto University（京都大学）； NII LLMC（日本国立信息与通信技术研究所语言模型中心）； RIKEN AIP（日本理化学研究所先进理工研究所）； Case Western Reserve University（凯斯西储大学）； The Hong Kong Polytechnic University（香港理工大学）； The University of Osaka（大阪大学）； University of Tokyo（东京大学）

AI总结本文发现多模态大语言模型存在空间词汇偏差，即添加空间关系词会吸引模型选择该选项，并通过机制可解释性工具揭示偏差主要源于语言侧而非视觉侧，最后提出轻量级LLM-only DPO更新可有效缓解偏差。

详情

AI中文摘要

多模态大语言模型（MLLMs）在空间多项选择题上仍不可靠，其失败常归因于视觉信息关注不足。本文识别了一种互补的失败模式——空间词汇偏差：向答案选项添加空间关系词会吸引模型决策，使新添加的选项更可能被选中。使用九个开放权重的MLLMs，我们证明该现象广泛存在。特别地，模型能正确回答二元空间问题，但一旦向答案集添加第三个空间选项，模型便持续选择错误的第三选项。我们将这种二元稳定但三元脆弱的案例隔离为诊断示例，并利用机制可解释性工具，揭示失败的主要原因来自语言侧而非视觉侧：视觉注意力分析和残差流探针表明，在这些失败中，正确的空间关系在内部仍然可用，而不相关选项控制、激活修补和稀疏组件干预将偏差追溯到特定的LLM侧通道和神经元。基于此发现，我们证明在微小的单对象对合成数据上进行轻量级仅LLM的DPO更新可缓解偏差，在合成数据上将四路鲁棒准确率提升高达100个百分点，在更广泛的评估数据集WhatsUp、SpatialMQA-Direct和VSR上分别提升68.0、32.6和20.1个百分点。

英文摘要

Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.

URL PDF HTML ☆

赞 0 踩 0

2606.01911 2026-06-02 cs.CV 版本更新

Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

残差解码器适配器：用于自回归文本渲染的身份保持分词器适配

Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li, Jingru Tan

发表机构 * Central South University（中南大学）； University of Oxford（牛津大学）； Microsoft Research（微软研究院）

AI总结提出残差解码器适配器（RDA），通过引入配对码本和平行分支学习像素空间残差，在不重新训练分词器和自回归模型的情况下显著提升文本渲染性能。

Comments CVPR 2026 poster

详情

AI中文摘要

视觉自回归（AR）模型通过预测由视觉分词器解码的离散标记来生成图像。尽管展示了强大的整体图像生成能力，但在文本渲染方面仍表现不佳，出现模糊笔画和破坏字母形状。在这项工作中，我们将这一限制追溯到视觉分词器，它难以重建细粒度细节。改进分词器直接但昂贵，因为它需要重新训练分词器和AR模型。我们能否在不重新训练现有分词器和AR模型的情况下提高AR模型的文本渲染性能？为实现这一目标，我们提出了残差解码器适配器（RDA），它在不改变标记空间的情况下事后升级现有分词器。具体来说，它通过引入两个新颖组件来细化视觉分词器的解码器输出：（i）一个与原始标记分布共享的配对码本；（ii）一个并行分支，用于学习像素空间中重建图像与真实图像之间的微小差异（残差）。这种残差设计使我们能够非侵入性地增强分词器，同时保持与先前AR模型的兼容性。RDA大幅提升了文本渲染性能。例如，在具有竞争力的TextAtlas基准测试上，我们使微调后的Janus-Pro OCR准确率从24.52%提高到58.26%（TextVisionBlend），从12.75%提高到36.81%（StyledTextSynth）。代码可在https://github.com/CSU-JPG/RDA获取。

英文摘要

Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA

URL PDF HTML ☆

赞 0 踩 0

2606.01910 2026-06-02 cs.GR cs.CV 版本更新

Single-Line Drawing Generation via Semantics-Driven Optimization

基于语义驱动的单线图生成

Tanguy Magne, Alexandre Binninger, Ruben Wiersma, Olga Sorkine-Hornung

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结提出一种基于语义驱动的方法，通过文本提示或输入图像自动生成矢量格式的单线图，利用分数蒸馏采样优化均匀有理B样条曲线参数，并引入额外损失项控制艺术风格，生成结果优于现有方法且支持下游制造。

Comments 18 pages, published in Computer Graphics Forum 2026

详情

DOI: 10.1111/cgf.70502

AI中文摘要

线条画是一种高度表现力的艺术形式，要求艺术家抽象和提炼其主题的本质。我们提出了第一种语义驱动的方法，用于自动生成矢量格式的单线图，该方法可由描述概念的文本提示或描绘概念的输入图像引导。我们的方法利用分数蒸馏采样来优化均匀有理B样条（URBS）曲线的参数，确保绘图由单一连续笔画组成。这种表示提供了对细节水平的精细控制，而额外的损失项使我们能够引导最终的艺术风格。我们证明，我们的方法在此任务上优于最先进的文本到图像模型和优化流程，产生的结果在美学上更令人愉悦，并且更忠实于连续线条画艺术家的风格。此外，由于我们的方法生成矢量化的曲线，它直接支持下游制造过程，如刺绣、激光雕刻和弯线。我们的代码和结果可在 https://github.com/tanguymagne/SLDgen 获取。

英文摘要

Line drawings are a highly expressive art form that requires the artist to abstract and distill the essence of their subject. We present the first semantics-driven method for automatically generating single-line drawings in vector format, guided either by a text prompt describing the concept or an input image depicting it. Our approach leverages score distillation sampling to optimize the parameters of a uniform rational B-spline (URBS) curve, ensuring that the drawing consists of a single continuous stroke by design. This representation provides fine-grained control over the level of detail, while additional loss terms allow us to steer the final artistic style. We demonstrate that our method outperforms state-of-the-art text-to-image models and optimization pipelines for this task, producing results that are both more aesthetically pleasing and more faithful to the style of continuous line drawing artists. Furthermore, because our method generates a vectorized curve, it directly supports downstream fabrication processes such as embroidery, laser engraving and wire bending. Our code and results are available at https://github.com/tanguymagne/SLDgen.

URL PDF HTML ☆

赞 0 踩 0

2606.01908 2026-06-02 cs.LG cs.CV 版本更新

Private and Stable Test-Time Adaptation with Differential Privacy

具有差分隐私的私有且稳定的测试时自适应

Zefeng Li, Qiaoyue Tang, Mathias Lecuyer, Evan Shelhamer

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出将多种测试时自适应方法转化为差分隐私形式，通过逐样本梯度裁剪和高斯噪声保护测试数据隐私，在ImageNet-C上实现隐私与精度的平衡，并发现裁剪机制能提升连续自适应的准确性和稳定性。

Comments ICML 2026

详情

AI中文摘要

测试时自适应（TTA）可以通过在推理过程中更新模型来减少在新数据上的误差。然而，这些更新引发了关于测试数据隐私的问题，因为模型参数现在依赖于所有过去的输入。为了控制这种隐私风险，我们将多种流行的TTA方法（Tent、EATA、SAR、DeYO和COME）转化为差分隐私（DP）形式，对所有更新应用逐样本梯度裁剪和高斯噪声。在ImageNet-C上，我们的DP-TTA方法在精度损失较小的情况下提供了足够的隐私，并且在低隐私机制下，DP的裁剪机制甚至可以改善连续设置中自适应的准确性和稳定性。这些对隐私和精度的改进仅带来适度的计算开销。这些关于私有TTA的初步结果提高了对该问题的认识，为开发更私密的测试时更新提供了信息，并确定了逐样本裁剪作为提高自适应准确性和稳定性的有效技术。

英文摘要

Test-time adaptation (TTA) can reduce error on new and different data by updating the model on these inputs during inference. However, these updates raise the issue of privacy w.r.t. the testing data, because the model parameters now depend on all past inputs. To control this privacy risk, we cast multiple popular TTA methods (Tent, EATA, SAR, DeYO, and COME) into differential privacy (DP) forms that apply per-sample gradient clipping and Gaussian noise for all updates. On ImageNet-C, our DP-TTA methods provide adequate privacy at small cost to accuracy, and in the low-privacy regime the clipping mechanism of DP can even improve the accuracy and stability of adaptation in the continual setting. These improvements to privacy and accuracy come at only modest computational overhead. These first results on private TTA raise awareness of the issue, inform the development of more private test-time updates, and identify per-sample clipping as an effective technique for improving the accuracy and stability of adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.01901 2026-06-02 cs.CV cs.AI cs.CL 版本更新

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

图像重建游戏：通过迭代多模态对话建立共同基础

Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam（波恩大学语言学系计算语言学部）； German Research Center for Artificial Intelligence (DFKI), Berlin（德国人工智能研究中心（DFKI）柏林）

AI总结提出图像重建游戏基准，通过多轮迭代中视觉语言模型向图像生成器发出纠正指令，使累积的共同基础直接可视化为重建图像，发现描述器是重建质量的主导因素，而生成器决定迭代改进的效果。

详情

AI中文摘要

我们引入了图像重建游戏，这是一个全自动基准测试，其中视觉语言模型在多轮迭代中向图像生成器发出纠正指令，使得累积的共同基础直接可视化为渲染图像。通过对七个图像类别中的两个描述器模型与两个生成器模型进行交叉基准测试，我们发现描述器是重建质量的主导因素，而生成器决定迭代改进是否有益。数学和几何图像构成了最大的挑战。描述器的令牌预算强烈影响收敛性：较短的预算产生更稀疏的初始渲染，有更多可见改进的空间，而较长的预算提高了绝对质量，但留下的修复空间较少。更强的描述器使用更丰富的纠正词汇，涵盖空间、数值和结构类别，而较弱的描述器则集中于表面属性，并且往往在几轮后停止。人工验证表明，最佳自动评判器与人类偏好之间仅达到轻微到中等的一致性，并且自动评分需要人工重新校准才能可靠使用。

英文摘要

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer's token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.

URL PDF HTML ☆

赞 0 踩 0

2606.01896 2026-06-02 cs.CV cs.AI 版本更新

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

训练、测试、重新评估：用于手部检测的生成数据的调度敏感评估

Atmika Bhardwaj, Silvia Vock, Nico Steckhan

发表机构 * Federal Institute for Occupational Safety and Health（联邦职业安全与卫生研究所）

AI总结本研究通过多阶段训练调度实验，评估生成性图像修补数据对安全关键场景下手部检测性能的影响，发现适当的训练流程能显著提升真实部署效果。

Comments 16 pages, 4 figures

详情

AI中文摘要

生成（或合成）图像数据越来越多地被用于增强或替代真实训练数据集，当目标图像稀缺、昂贵或存在偏差时。在手部检测中，特别是在职业安全设置中，公共数据集大多包含裸手。这低估了手套、纹身、珠宝和其他个人防护装备引入的手部外观变化，造成了安全关键应用在部署时遇到的分布偏移。我们测试生成性修补，即仅编辑真实照片的手部区域以引入配饰，是否能缩小这种偏移差距。在一个由真实图像及其合成对应物组成的配对数据集上，我们在六种训练和调度方案（实验A-F，每种三个随机种子）下训练YOLOv8n手部检测器，在真实测试集和仅真实手套测试子集上评估每个检测器，报告两个重叠阈值（mAP@0.5和mAP@0.5:0.95）下的平均精度（mAP），并进行配对统计检验。一个两阶段实验：在真实+合成数据上训练，然后在较低学习率下仅用真实数据微调得到的权重，与标准真实测试集上的仅真实基线模型相比，提高了mAP@0.5，并改善了真实手套的分布外差距。另一个三阶段实验最好地保持了框的紧密度，达到了研究中任何其他实验的最高mAP@0.5:0.95。合成数据对安全关键手部检测的效用由训练过程决定，简单的多阶段实验从修补的配饰数据中提取了实质性的真实部署收益。

英文摘要

Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation in hand appearance introduced by gloves, tattoos, jewelry, and other personal protective equipment, creating a distribution shift that safety-critical applications encounter at deployment. We test whether generative inpainting, editing only the hand region of a real photograph to introduce accessories, can close this shift gap. On a paired dataset of real images and their synthetic counterparts, we train YOLOv8n hand detectors under six training-and-scheduling regimes (Experiments A-F, three random seeds each), evaluate every detector on a real test set and on a real-gloves-only test split, and report the mean average precision (mAP) at two overlap thresholds (mAP@0.5 and mAP@0.5:0.95) along with paired statistical tests. A two-stage experiment: train on real U synthetic data, then fine-tune the resulting weights on real-only at a lower learning rate, increases mAP@0.5 compared to the real-only baseline model on the standard real test set, and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest mAP@0.5:0.95 of any other experiment in the study. The synthetic-data utility for safety-critical hand detection is determined by the training procedure, and simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.

URL PDF HTML ☆

赞 0 踩 0

2606.01895 2026-06-02 cs.CV cs.AI 版本更新

深度学习从前列腺活检H&E图像生成计算PIN-4免疫组织化学染色

Vietbao Tran, Pratik Shah

发表机构 * Biomedical Engineering University of California, Irvine Irvine, CA, USA（生物医学工程卡罗来纳大学伊文城分校伊文城，加州，美国）； Laboratory Medicine Biomedical Engineering Electrical Engineering（实验室医学生物医学工程电气工程）； Computer Science University of California, Irvine Irvine, CA, USA（计算机科学卡罗来纳大学伊文城分校伊文城，加州，美国）

AI总结本研究使用条件生成对抗网络(cGAN)从H&E图像合成PIN-4 IHC染色，实现了直接的空间对应，并在病理评估中取得了良好效果。

详情

AI中文摘要

免疫组织化学(IHC)常用于解决苏木精和伊红(H&E)染色组织上诊断不明确的前列腺癌活检结果。然而，PIN-4 IHC染色通常在相邻组织切片上进行，限制了H&E形态与相应免疫表型信号之间的直接空间比较。从常规临床前列腺活检全切片图像(WSI)构建了一个配对、配准的H&E/PIN-4数据集，并训练了一个条件生成对抗网络(cGAN)直接从原始H&E图像块合成PIN-4染色模式。最终数据集包含来自93名患者的172对配准WSI和27,298对配准的1024x1024图像块，涵盖腺癌阳性和良性病例，并代表了不同年龄、种族和民族群体。模型在来自17张WSI的1,814对图像块的保留测试集上进行了评估，平均峰值信噪比(PSNR)为21.88 dB，结构相似性指数(SSIM)为0.667，皮尔逊相关系数(PCC)为0.684，学习感知图像块相似度(LPIPS)为0.417。由委员会认证的病理学家进行的定性审查显示，生成的图像捕获了诊断相关的PIN-4染色模式，包括AMACR/消旋酶表达和基底细胞相关染色，同时保持了与源H&E形态的空间对应。在形态复杂的区域（包括高级别癌和导管内癌）中，合成的准确性有所变化。这些结果支持从常规采集的明场H&E前列腺活检图像进行监督式PIN-4合成的可行性。该方法能够在源前列腺H&E结构的背景下直接解释预测的PIN-4标记模式，解决了传统相邻切片IHC当前的空间局限性。

英文摘要

Immunohistochemistry (IHC)is frequently used to resolve diagnostically ambiguous prostate cancer biopsy findings on hematoxylin and eosin (H&E)-stained tissue. However, PIN-4 IHC staining is typically performed on adjacent tissue sections, limiting direct spatial comparison between the H&E morphology and the corresponding immunophenotypic signal. A paired, registered H&E/PIN-4 dataset was constructed from routine clinical prostate biopsy whole-slide images (WSIs), and a conditional generative adversarial network (cGAN) was trained to synthesize PIN-4 staining patterns directly from native H&E image patches. The final dataset comprised 172 paired WSIs from 93 patients and 27,298 registered 1024x1024 patch pairs, spanning adenocarcinoma-positive and benign cases with representation across age, race, and ethnicity groups. The model was evaluated on a held-out test set of 1,814 patch pairs from 17 WSIs, achieving a mean peak signal-to-noise ratio (PSNR) of 21.88 dB, structural similarity index measure (SSIM) of 0.667, Pearson correlation coefficient (PCC) of 0.684, and learned perceptual image patch similarity (LPIPS) of 0.417. Qualitative review by a board-certified pathologist showed that generated images captured diagnostically relevant PIN-4 staining patterns, including AMACR/racemase expression and basal-cell-associated staining, while preserving spatial correspondence with the source H&E morphology. Accuracy of synthesis varied across morphologically complex regions, including high-grade carcinoma and intraductal carcinoma. These results support the feasibility of supervised PIN-4 synthesis from routinely acquired brightfield H&E prostate biopsy images. The approach enables direct interpretation of predicted PIN-4 marker patterns in the context of the source prostate H&E architecture, addressing a current spatial limitation of conventional adjacent-section IHC.

URL PDF HTML ☆

赞 0 踩 0

2606.01858 2026-06-02 cs.CV 版本更新

Polaris: Scaling Up Instruction-Guided Image Generation Towards Millions of Personalized Style Needs

Polaris: 将指令引导的图像生成扩展到数百万个性化风格需求

Zhi-Kai Chen, Jun-Peng Jiang, Jun-Jie Tao, De-Chuan Zhan, Han-Jia Ye

发表机构 * Tsinghua University（清华大学）

AI总结提出Polaris智能检索框架，通过索引和检索超过6500个检查点和75000个适配器，自动选择和集成最相关的模型组件，实现无需额外训练的可扩展、可控且对齐的指令驱动图像生成。

详情

AI中文摘要

用户越来越期望图像生成模型能够快速适应高度多样化和个性化的需求，例如生成具有独特风格或特征的图像。传统方法依赖于微调，成本高昂且难以扩展。为了应对这些限制，社区积累了一个不断增长的微调模块和适配器库，其中每个组件针对特定的生成需求，并共同作为处理新需求的基础。这自然引出一个问题：与其重复训练新模型，我们能否系统地利用这个不断扩展的生态系统来更好地满足用户指令？为此，我们提出了Polaris，一个智能检索框架，根据用户的指令自动从模型库中选择和集成合适的模型。关键见解是，利用如此庞大和异构的库不仅需要在数千个候选中找到最相关的模块，还需要将它们有效地对齐以进行指令驱动的生成和编辑。Polaris通过索引超过6500个检查点和75000个适配器，并根据用户的输入和指令检索最相关的组件来解决这一挑战。通过这种方式，它提供了可扩展、可控且良好对齐的生成——无需任何额外训练。

英文摘要

Users increasingly expect image generation models to quickly adapt to highly diverse and personalized requirements, such as producing images with distinctive styles or characteristics. Traditional approaches rely on fine-tuning, which is costly and difficult to scale. To cope with these limitations, the community has accumulated a growing library of fine-tuned modules and adapters, where each component targets specific generation needs and collectively serves as a foundation for handling new demands. This naturally raises a question: instead of repeatedly training new models, can we systematically exploit this expanding ecosystem to better fulfill user instructions? To this end, we present Polaris, an intelligent retrieval framework that automatically selects and integrates suitable models from the model library based on a user's instructions. The key insight is that harnessing such a massive and heterogeneous pool requires not only finding the most relevant modules among thousands of candidates, but also aligning them effectively for instruction-driven generation and editing. Polaris addresses this challenge by indexing over 6,500 checkpoints and 75,000 adapters, and retrieving the most relevant components given a user's input and instruction. In doing so, it delivers scalable, controllable, and well-aligned generation -- without any additional training.

URL PDF HTML ☆

赞 0 踩 0

2606.01848 2026-06-02 cs.CV 版本更新

RescueBench: Can Embodied Agents Save Lives in the Wild ?

RescueBench: 具身智能体能否在野外拯救生命？

Kui Wu, Beiyu Guo, Hao Chen, ShuHang Xu, Yuling Li, Yongdan Zeng, Zhoujun Li, Yizhou Wang, Fangwei Zhong

发表机构 * Beihang University（北京航空航天大学）； Beijing Normal University（北京师范大学）； Peking University（北京大学）； City University of Macau（澳门城市大学）； ATEC2025 Challenge Committee（ATEC2025挑战委员会）

AI总结本文提出 RescueBench，一个四阶段流水线的逼真诊断基准，用于评估具身智能体在搜索与救援任务中的探索、记忆和交互能力，并揭示探索和记忆失败如何传播。

详情

AI中文摘要

搜索与救援（SAR）要求具身智能体在多模态不确定性下探索陌生环境，执行多阶段交互，并在长时域内检索空间记忆。现有基准通常孤立评估这些能力，当它们必须在现实工作流中组合时，失败如何叠加尚不清楚。我们提出 RescueBench，一个逼真的诊断基准，将 SAR 实例化为四阶段流水线：多模态探索、目标救援、记忆引导返回和最终交接。通过将顺序任务组合与阶段级评估相结合，RescueBench 能够分析探索和记忆失败如何在具身救援工作流中传播。它包含五个渐进难度级别，在环境复杂性、线索模糊性和空间层次上有所不同，并配有自动化的情节生成和标注流水线，用于可扩展的评估和训练。我们评估了七个基线、一个 oracle 参考和人类玩家，结果显示没有基线能在最大难度下完成全部任务。阶段级诊断将自主探索识别为主要失败模式，空间记忆为第二个独立瓶颈，表明当前拓扑视觉语言导航或基于地图的方法无法解决这些局限。代码见 https://github.com/wukui-muc/RescueBench。

英文摘要

Search-and-rescue (SAR) requires embodied agents to explore unfamiliar environments under multimodal uncertainty, perform multi-stage interactions, and retrieve spatial memory over long horizons. Existing benchmarks typically evaluate these capabilities in isolation, leaving unclear how failures compound when they must be composed in realistic workflows. We introduce RescueBench, a photo-realistic diagnostic benchmark that instantiates SAR as a four-stage pipeline: multimodal exploration, target rescue, memory-guided return, and final handoff. By combining sequential task composition with stage-level evaluation, RescueBench enables analysis of how exploration and memory failures propagate through embodied rescue workflows. It contains five progressive difficulty levels that vary in environmental complexity, clue ambiguity, and spatial hierarchy, along with an automatic episode generation and annotation pipeline for scalable evaluation and training. We evaluate seven baselines, an oracle reference, and human players, showing that no baselines complete the full task at the greatest difficulty. Stage-level diagnosis identifies autonomous exploration as the dominant failure mode and spatial memory as a second, independent bottleneck, suggesting that these limitations are not resolved by current topological visual-language navigation or map-based methods. Code is available in https://github.com/wukui-muc/RescueBench

URL PDF HTML ☆

赞 0 踩 0

2606.01843 2026-06-02 cs.CV cs.AI 版本更新

Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection

抑制伪造特定捷径以实现可泛化的深度伪造检测

Yihui Wang, Yonghui Yang, Jilong Liu, Fengbin Zhu, Le Wu, Tat-Seng Chua

发表机构 * Hefei University of Technology（合肥工业大学）； National University of Singapore（国立新加坡大学）

AI总结提出Shortcut Subspace Suppression (S^3)框架，通过子空间建模显式表征并抑制方法特定捷径，以提升深度伪造检测的跨方法泛化能力。

详情

AI中文摘要

深度伪造检测在跨伪造方法泛化方面表现不佳，因为现有模型倾向于依赖虚假的方法特定捷径，这些捷径无法迁移到未见过的篡改操作。尽管近期方法试图改进泛化性，但它们缺乏明确的机制来识别和抑制学习表示中的此类捷径。在这项工作中，我们提出了捷径子空间抑制（S^3）框架，通过子空间建模显式表征并抑制方法特定捷径。我们的关键洞察是，区分不同伪造方法的变体捕获了方法特定的伪影，因此可作为方法特定捷径的有效代理。为此，我们训练一个轻量级线性探针进行伪造方法分类，并执行奇异值分解（SVD）以提取主导的捷径子空间。基于此公式，我们开发了两种互补策略来减少对捷径的依赖。在训练期间，我们软性抑制特征表示中的捷径子空间，鼓励模型依赖更可泛化的线索进行真/假判别。在推理时，我们引入一个无需训练的对应方法，衰减与识别出的捷径方向对齐的神经元，从而实现即插即用的泛化增强，并提高可解释性。在多个基准上的大量实验表明，我们的方法显著改善了跨方法泛化，同时保持了强大的域内性能。代码将在论文被接收后发布。

英文摘要

Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific shortcuts that fail to transfer to unseen manipulations. While recent approaches attempt to improve generalization, they lack an explicit mechanism to identify and suppress such shortcuts in learned representations. In this work, we propose Shortcut Subspace Suppression (S^3) framework that explicitly characterizes and suppresses method-specific shortcuts via subspace modeling. Our key insight is that variations distinguishing different forgery methods capture method-specific artifacts and thus serve as an effective proxy for method-specific shortcuts. To this end, we train a lightweight linear probe for forgery method classification and perform Singular Value Decomposition (SVD) to extract the dominant shortcut subspace. Building on this formulation, we develop two complementary strategies to reduce shortcut reliance. During training, we softly suppress the shortcut subspace in feature representations, encouraging the model to rely on more generalizable cues for real/fake discrimination. At inference time, we introduce a training-free counterpart that attenuates neurons aligned with the identified shortcut directions, enabling plug-and-play generalization enhancement with improved interpretability. Extensive experiments on multiple benchmarks demonstrate that our method significantly improves cross-method generalization while maintaining strong in-domain performance. The code will be released upon acceptance of the submission.

URL PDF HTML ☆

赞 0 踩 0

2606.01834 2026-06-02 cs.CV cs.AI 版本更新

无监督协作域自适应用于驾驶场景解析

Jiahe Fan, Shaolong Shu, Mingjian Sun, Tiehua Zhang, Bohong Xiao, Hanli Wang, Rui Fan

发表机构 * College of Electronic and Information Engineering, Tongji University（同济大学电子与信息学院）； Department of Control Science and Engineering, Harbin Institute of Technology（控制科学与工程系，哈尔滨工业大学）； Department of Vehicle Control System and Software Development, NIO（车辆控制系统与软件开发部，蔚来汽车）； School of Computer Science and Technology, Tongji University（计算机科学与技术学院，同济大学）； Key Laboratory of Embedded System and Service Computing (Ministry of Education), Tongji University（嵌入式系统与服务计算重点实验室（教育部），同济大学）

AI总结提出无监督协作域自适应框架UCDA，通过多源模型协作优化和知识蒸馏，在无源数据条件下提升目标域驾驶场景解析的鲁棒性和泛化能力。

详情

AI中文摘要

可靠的驾驶场景解析是自动驾驶车辆在开放动态环境中运行的基本能力。然而，将感知模型适应新的部署域仍然具有挑战性，因为像素级标注成本高昂，且由于隐私、安全或所有权限制，源域数据通常无法访问。现有的无源无监督域自适应方法通常依赖于单个预训练源模型，这使得自适应后的感知系统容易受到源特定偏差的影响，并在不同的道路布局、光照条件、天气模式和交通状况下限制其鲁棒性。本文提出了一种无监督协作域自适应（UCDA）框架，用于无源设置下的驾驶场景解析，该框架将多个预训练源模型的互补知识迁移到统一的目标模型，而无需访问任何原始源样本。为了比较独立训练模型的预测，UCDA构建了一个类级原型记忆库，并通过原型相似性估计跨模型预测可靠性，从而减少源模型间不一致置信度尺度的影响。基于由此产生的互补监督，UCDA采用两阶段迁移策略：首先通过正负一致性约束的协作优化，在无标签的目标域驾驶数据上精炼多个源模型，然后将它们经过验证的专业知识蒸馏到单个可部署的目标模型中。在公开驾驶场景数据集和从自动驾驶车辆平台收集的真实世界数据上的全面评估表明，UCDA有效地整合了互补的多源知识，提高了目标域场景解析的可靠性和在不同驾驶环境中的泛化能力。

英文摘要

Reliable driving scene parsing is a fundamental capability for autonomous vehicles operating in open and dynamic driving environments. However, adapting perception models to new deployment domains remains challenging because pixel-level annotations are expensive to obtain, while source-domain data are often inaccessible due to privacy, security, or ownership constraints. Existing source-free unsupervised domain adaptation methods typically rely on a single pre-trained source model, which makes the adapted perception system vulnerable to source-specific biases and limits its robustness under diverse road layouts, illumination conditions, weather patterns, and traffic conditions. This article presents an unsupervised collaborative domain adaptation (UCDA) framework for driving scene parsing in a source-free setting, which transfers complementary knowledge from multiple pre-trained source models to a unified target model without accessing any original source samples. To compare predictions from independently trained models, UCDA constructs a class-level prototype memory bank and estimates cross-model prediction reliability through prototype similarity, reducing the effect of inconsistent confidence scales across source models. Based on the resulting complementary supervision, UCDA adopts a two-stage transfer strategy: multiple source models are first refined on unlabeled target-domain driving data through collaborative optimization with positive and negative consistency constraints, and their validated expertise is then distilled into a single deployable target model. Comprehensive evaluations on public driving-scene datasets and real-world data collected from an autonomous vehicle platform demonstrate that UCDA effectively consolidates complementary multi-source knowledge, improving target-domain scene parsing reliability and generalization across diverse driving environments.

URL PDF HTML ☆

赞 0 踩 0

2606.01808 2026-06-02 cs.CV 版本更新

Personalized 3D Myocardial Infarct Geometry Reconstruction from Cine MRI for Cardiac Digital Twins

基于电影MRI的个性化三维心肌梗死几何重建用于心脏数字孪生

Yilin Lyu, Mark YY Chan, Ching-Hui Sia, Lei Li

发表机构 * Department of Biomedical Engineering, National University of Singapore（新加坡国立大学生物医学工程系）； Department of Medicine, National University of Singapore（新加坡国立大学医学系）； Department of Cardiology, National University Heart Centre Singapore（新加坡国立心脏中心心内科部）

AI总结提出一种显式几何-运动嵌入模型，从多视角电影MRI中全自动重建个性化、可仿真的三维心肌梗死几何结构，采用双分支自适应融合和AHA-17引导的多尺度监督，实现无对比剂梗死表征。

Comments 14 pages

详情

AI中文摘要

准确的三维心肌梗死（MI）几何表征对于构建心脏数字孪生（CDT）以精确模拟梗死相关电生理至关重要。晚期钆增强磁共振成像（LGE MRI）是定位MI的临床参考，但其对造影剂的依赖限制了在肾功能受损患者中的使用，并限制了纵向随访。作为替代，无对比剂电影MRI可可视化异常心室壁运动，这高度指示梗死区域。在本研究中，我们提出了一种新颖的显式几何-运动嵌入模型，直接从多视角电影MRI中全自动重建个性化、可仿真的三维MI几何结构。具体地，我们构建了一个4D（3D+t）双心室网格，以显式提取和解耦几何感知和运动感知特征。我们进一步设计了一个双分支模块用于自适应几何-运动融合，以捕获时空依赖性来映射梗死区域。此外，我们引入了一种利用AHA-17节段引导的交叉注意力机制的多尺度监督来指导预测，确保生物物理一致的重建。在225例电影MRI上的实验结果表明，所提出的三维MI重建实现了高性能，平均Dice得分为0.678±0.011。在下游的计算机电生理模拟评估中，结果与LGE衍生的真实情况高度一致，突显了所提出模型在无对比剂瘢痕表征和无缝集成到CDT建模中的巨大潜力。代码将在稿件被接受发表后公开。

英文摘要

Accurate 3D geometric characterization of myocardial infarction (MI) is essential for building cardiac digital twins (CDTs) to precisely simulate infarct-related electrophysiology. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is the clinical reference for locating MI, yet its reliance on contrast agents restricts use in renally impaired patients and limits longitudinal follow-ups. As an alternative, contrast-free cine MRI visualizes abnormal ventricular wall motion, which is highly indicative of the infarcted area. In this study, we propose a novel explicit geometry-motion embedded model to fully automatically reconstruct personalized, simulation-ready 3D MI geometries directly from multi-view cine MRIs. Specifically, we construct a 4D (3D + t) biventricular mesh to explicitly extract and decouple geometry-aware and motion-aware features. We further design a dual-branch module for adaptive geometry-motion fusion to capture spatiotemporal dependencies for mapping infarcted region. Furthermore, we introduce multi-scale supervision utilizing an AHA-17 segment-guided cross-attention mechanism to steer the prediction, ensuring biophysically consistent reconstruction. Experimental results on 225 cine MRIs demonstrated that the proposed 3D MI reconstruction achieved high performance with an average Dice score of 0.678 $\pm$ 0.011. In the downstream in-silico electrophysiological simulation evaluations, the results were highly consistent with the LGE-derived ground truth, highlighting the great potential of the proposed model for contrast-free scar characterization and seamless integration into CDT modeling. The code will be released publicly upon acceptance of the manuscript for publication.

URL PDF HTML ☆

赞 0 踩 0

2606.01790 2026-06-02 cs.CV cs.AI 版本更新

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

STaR-KV: 面向GUI视觉语言模型的时空自适应KV缓存压缩重加权方法

Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang, Siteng Huang, Linfeng Zhang

发表机构 * EPIC Lab, SJTU（上海交通大学EPIC实验室）； HKUST (GZ)（香港科技大学（广州））； The University of Sydney（悉尼大学）； UESTC（电子科技大学）； ZJU（浙江大学）

AI总结提出STaR-KV，一种无需训练的KV缓存压缩框架，通过子空间感知评分、时间稳定性折扣和熵驱动温度三个维度自适应校准令牌重要性，在GUI任务中实现高精度和近40%的峰值GPU内存节省。

详情

AI中文摘要

基于视觉语言模型的图形用户界面（GUI）代理展现出广泛的自动化能力，但其部署受限于随交互步骤线性增长的键值（KV）缓存。例如，UI-TARS-1.5-7B在仅五个屏幕截图上消耗76 GB的GPU内存，接近主流80 GB加速器的容量。现有的KV压缩方法共享两个结构假设：将视觉令牌重要性聚合为单个共享显著性图，并对融合的分数分布应用固定的top-B截断。初步测量反驳了这两点：空间专门化存在于注意力子空间层面并在层间迁移，而分数分布沿轨迹漂移。我们提出STaR-KV（时空自适应重加权），一种无需训练的KV缓存压缩框架，沿三个维度校准令牌重要性：（i）由在线空间互信息驱动的子空间感知评分；（ii）时间稳定性折扣，抑制来自持续关注子空间的冗余缓存条目；（iii）熵导出的温度，自适应重塑分数分布。在四个GUI基准测试中，STaR-KV在匹配预算下实现了最先进的KV压缩方法（如GUIKV、SnapKV）中最强的平均准确率，无压缩阶段FLOPs开销（-0.07%），并在20% KV缓存预算下削减近40%的峰值GPU内存。代码可在https://github.com/kawhiiiileo/STaR-KV获取。

英文摘要

Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS-1.5-7B consumes 76 GB of GPU memory on merely five screenshots, approaching the capacity of mainstream 80 GB accelerators. Existing KV compression methods share two structural assumptions: aggregating visual-token importance into a single shared saliency map, and applying a fixed top-B cutoff to the fused score distribution. Pilot measurements refute both: spatial specialization lives at the attention-subspace level and migrates across layers, while the score distribution drifts in shape along a trajectory. We propose STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free KV cache compression framework that calibrates token importance along three axes: (i) subspace-aware scoring driven by online spatial mutual information; (ii) a temporal stability discount that suppresses redundant cache entries from persistently attended subspaces; and (iii) an entropy-derived temperature that adaptively reshapes the score distribution. Across four GUI benchmarks, STaR-KV achieves the strongest average accuracy among state-of-the-art KV compression methods (e.g., GUIKV, SnapKV) at matched budgets, with no compression-stage FLOPs overhead (-0.07%) and cutting peak GPU memory by nearly 40% at a 20% KV-cache budget. Code is available at https://github.com/kawhiiiileo/STaR-KV.

URL PDF HTML ☆

赞 0 踩 0

2606.01788 2026-06-02 cs.CV 版本更新

JenBridge: 跨场景转换的自适应长视频配乐

Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang

发表机构 * Jen Music AI

AI总结提出JenBridge框架，通过基于Transformer的生成模型、双文本-视觉条件对齐和LLM代理驱动的自适应过渡机制，实现长视频配乐的高保真生成与场景转换自然连贯。

详情

AI中文摘要

我们解决了在场景转换中生成高保真、长格式配乐并保持连贯性的挑战。现有的AI音乐系统主要针对短片段设计，缺乏确保叙事连续性的机制。我们提出了JenBridge，一个模块化且可解释的自适应长视频配乐框架，确保高保真音频生成和转换自然性。核心架构是一个基于Transformer的生成模型，采用流匹配目标训练，遵循两阶段范式：在大规模文本-音频语料库上进行预训练以建立稳健的音乐先验，然后通过双文本-视觉条件适应视频领域以实现精确的跨模态对齐。关键的是，为了实现跨不同场景变化的长格式连贯性，JenBridge引入了一种新颖的自适应过渡机制。该系统具有一个多功能的过渡风格工具包，包括一种生成式过渡方法，并独特地采用了一个大型语言模型（LLM）代理，作为导演智能地为每个叙事转变选择最合适的过渡。为了严格评估这一任务，我们提出了LVS基准，这是一个新基准，包含一个精选数据集和新的评估指标，侧重于整体和过渡感知评估。在提出的基准上进行的大量实验表明，JenBridge在客观和主观指标上均显著优于现有方法，特别是在转换自然性和整体叙事连贯性方面。JenBridge代表了向全自动、专业质量的视频配乐迈出的重要一步。

英文摘要

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

URL PDF HTML ☆

赞 0 踩 0

2606.01701 2026-06-02 cs.CV 版本更新

Spatio-Temporal Correlation Guided Geometric Partitioning for Versatile Video Coding

时空相关性引导的几何划分用于多功能视频编码

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma

发表机构 * Institute of Digital Media, Department of Electronics Engineering and Computer Science, Peking University（数字媒体研究所，电子工程与计算机科学系，北京大学）； Information Technology R&D Innovation Center of Peking University（北京大学信息科技研发创新中心）； Peng Cheng Laboratory（鹏城实验室）； School of Computer Science and Technology, University of Chinese Academy of Sciences（中国科学院大学计算机科学与技术学院）

AI总结针对VVC中几何划分开销大的问题，提出时空相关性引导的几何划分（STGEO）方案，通过模式预测和运动候选选择减少边信息比特，提升编码效率。

详情

DOI: 10.1109/TIP.2021.3126420
Journal ref: IEEE Transactions on Image Processing, vol. 31, pp. 30-42, 2022

AI中文摘要

几何划分因其在混合视频编码框架中卓越的运动场描述能力而受到越来越多的关注。然而，多功能视频编码（VVC）中现有的几何划分（GEO）方案给边信息的信令带来了不可忽视的负担，从而限制了编码效率。鉴于此，我们提出了一种时空相关性引导的几何划分（STGEO）方案，以有效描述视频编码运动场中的物体信息。所提方法可以节省用于边信息信令的比特，包括划分模式和运动信息。我们首先以统计合理的方式分析了划分模式决策和运动矢量选择的特性。基于观察到的时空相关性，我们设计了一种模式预测和编码方法，以减少表示上述边信息的开销。主要思想是预测具有较高选择可能性的STGEO模式和运动候选，这可以指导熵编码，即用更少的比特表示预测的高概率模式和运动候选。特别地，高概率STGEO模式基于边缘信息和相邻STGEO编码块的历史模式进行预测。相应的运动信息由合并候选列表中的索引表示，该索引基于离线训练的合并候选选择概率自适应地推断。仿真结果表明，与未使用GEO的VTM-8.0相比，所提方法在随机接入和低延迟B配置下平均分别节省了0.95%和1.98%的比特率。

英文摘要

Geometric partitioning has attracted increasing attention by its remarkable motion field description capability in the hybrid video coding framework. However, the existing geometric partitioning (GEO) scheme in Versatile Video Coding (VVC) causes a non-negligible burden for signaling the side information. Consequently, the coding efficiency is limited. In view of this, we propose a spatio-temporal correlation guided geometric partitioning (STGEO) scheme to efficiently describe the object information in the motion field of video coding. The proposed method can economize the bits consumed for side information signaling, including the partitioning mode and motion information. We firstly analyze the characteristics of partitioning mode decision and motion vector selection in a statistically-sound way. Based on the observed spatio-temporal correlation, we design a mode prediction and coding method to reduce the overhead for representing the above mentioned side information. The main idea is to predict the STGEO modes and motion candidates that have higher selection possibilities, which can guide the entropy coding, i.e., representing the predicted high-probability modes and motion candidates with fewer bits. In particular, the high-probability STGEO modes are predicted based on the edge information and history modes of adjacent STGEO-coded blocks. The corresponding motion information is represented by the index in a merge candidate list, which is adaptively inferred based on the off-line trained merge candidate selection probability. Simulation results show that the proposed approach achieves 0.95% and 1.98% bit-rate savings on average compared to VTM-8.0 without GEO for Random Access and Low-Delay B configurations, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.01700 2026-06-02 cs.CV 版本更新

MixerSENet: A Lightweight Framework for Efficient Hyperspectral Image Classification

MixerSENet: 一种用于高效高光谱图像分类的轻量级框架

Mohammed Q. Alkhatib, Swalpa Kumar Roy, Ali Jamali

发表机构 * College of Engineering and IT, University of Dubai（迪拜大学工程与信息技术学院）； Department of Computer Science and Engineering, Alipurduar Government Engineering and Management College（阿利普杜尔政府工程与管理学院计算机科学与工程系）； Department of Geography, Simon Fraser University（西蒙·弗雷泽大学地理系）

AI总结提出轻量级框架MixerSENet，通过解耦空间与通道维度混合并引入挤压激励模块，在保持低参数量的同时实现高光谱图像分类的高精度与高效率。

Comments Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)

详情

DOI: 10.1109/LGRS.2025.3616338

AI中文摘要

本文提出了一种新颖的框架MixerSENet，用于高光谱图像（HSI）分类，旨在解决计算效率和有限标注数据带来的挑战。所提出的模型处理高光谱图像块，同时在整个网络中保持一致的尺寸和分辨率，有效解耦了空间和通道维度的混合。值得注意的是，MixerSENet轻量且计算高效，与传统模型相比所需参数更少，适用于资源受限环境。模型中嵌入了挤压激励块以细化特征提取，增强网络捕获更多信息特征的能力。在两个基准数据集上的实验结果表明，MixerSENet实现了优越的性能，在Houston13数据集上达到82.47%的总体精度（OA），在Qingyun数据集上达到96.70%，优于包括3D-CNN、HybridKAN、HSIFormer、SimPoolFormer和MorphMamba在内的最先进方法。此外，对计算效率的详细分析表明，MixerSENet在准确性和效率之间实现了良好的平衡，仅需53,146个参数和较低的推理时间，证实了其在实际应用中的实用性。发布时，源代码将在https://github.com/mqalkhatib/MixerSENet公开。

英文摘要

In this paper, a novel framework, MixerSENet, is introduced for hyperspectral image (HSI) classification, designed to address the challenges of computational efficiency and limited labeled data. The proposed model processes hyperspectral image patches while maintaining consistent size and resolution throughout the network, effectively decoupling the mixing of spatial and channel dimensions. Notably, MixerSENet is lightweight and computationally efficient, requiring fewer parameters compared to traditional models, making it suitable for resource-constrained environments. A squeeze and excitation block is incorporated into the model to refine feature extraction, enhancing the network's ability to capture more informative features. Experimental results on two benchmark datasets demonstrate that MixerSENet achieves superior performance, reaching an overall accuracy (OA) of 82.47% on Houston13 dataset and 96.70% on the Qingyun dataset, outperforming state-of-the-art methods including 3D-CNN, HybridKAN, HSIFormer, SimPoolFormer, and MorphMamba. Furthermore, a detailed analysis of computational efficiency shows that MixerSENet achieves a favorable balance between accuracy and efficiency, with only 53,146 parameters and an low inference time, confirming its practicality for real-world applications. At publication, source code will be publicly available at https://github.com/mqalkhatib/MixerSENet.

URL PDF HTML ☆

赞 0 踩 0

2606.01698 2026-06-02 cs.CV 版本更新

Learning Label-Efficient Interpretable Medical Image Diagnosis via Semi-supervised Hypergraph Concept Bottleneck Model

通过半监督超图概念瓶颈模型实现标签高效的医学图像可解释诊断

Yijun Yang, Ruiqiang Xiao, Lijie Hu, Angelica I Aviles-Rivero, Yunzhu Wu, Jing Qin, Lei Zhu

发表机构 * HKUST(GZ)（香港科技大学（广州））； Joy Future Academy（未来正义学院）； MBZUAI（穆罕默德·本·拉希德智能研究院）； Tsinghua University（清华大学）； Sichuan University（四川大学）； PolyU

AI总结提出一种半监督超图概念瓶颈模型，利用双层超图学习建模高阶概念依赖并生成领域自适应伪标签，在胎盘植入谱系等医学图像诊断中实现高可解释性和性能。

详情

AI中文摘要

深度学习在医学图像分析中取得了革命性进展，在多种应用中提供了卓越的诊断准确性。然而，其决策缺乏可解释性阻碍了临床采纳，特别是在高风险医疗场景中，透明度对可信度至关重要。例如，在胎盘植入谱系（PAS）中，超声图像中的细微线索挑战了可靠诊断，使得黑盒模型难以获得准确的评分信任。为了解决这一问题，概念瓶颈模型（CBM）通过将临床上有意义的中间概念嵌入诊断流程，提供了一种有前景的途径，使临床医生能够审查和优化模型输出。然而，传统的CBM在捕捉复杂的概念间依赖关系方面表现不佳，并且需要昂贵、专家驱动的概念注释，限制了其可扩展性。本研究引入了一种新颖的半监督CBM框架，专为医学成像设计，利用双层超图学习来建模高阶概念依赖并生成领域自适应伪标签。我们的方法通过集成概念级超图以增强推理和图像级超图以生成鲁棒的伪标签，实现了卓越的可解释性和性能。在新标注的PAS超声数据集和乳腺超声公共数据集上的实验证明了所提出的概念标签高效可解释框架的有效性。其通用性在皮肤镜图像数据集SkinCon上得到了进一步验证。代码可在https://github.com/scott-yjyang/HyperCBM获取。

英文摘要

Deep learning has revolutionized medical image analysis, delivering exceptional diagnostic accuracy across diverse applications. Yet, the lack of interpretability in its decision-making hinders clinical adoption, particularly in high-stakes medical contexts where transparency is paramount for trustworthiness. For example, in Placenta Accreta Spectrum (PAS), subtle cues in ultrasound imaging challenge reliable diagnosis, rendering black-box models untrustworthy for accurate scoring. To address this, Concept Bottleneck Models (CBMs) offer a promising avenue by embedding clinically meaningful intermediate concepts into the diagnosis pipeline, enabling clinicians to scrutinize and refine model outputs. However, conventional CBMs falter in capturing complex inter-concept dependencies and demand costly, expert-driven concept annotations, limiting their scalability. This study introduces a novel semi-supervised CBM framework designed for medical imaging, which leverages dual-level hypergraph learning to model high-order concept dependencies and generate domain-adaptive pseudo-labels. Our approach achieves superior interpretability and performance by integrating a concept-level hypergraph for enhanced reasoning and an image-level hypergraph for robust pseudo-label generation. Experiments on a newly annotated PAS ultrasound dataset and a breast ultrasound public dataset demonstrate the effectiveness of the proposed concept label-efficient interpretable framework. Its universality is further validated on the dermoscopic image dataset SkinCon. The code is available at https://github.com/scott-yjyang/HyperCBM.

URL PDF HTML ☆

赞 0 踩 0

2606.01694 2026-06-02 cs.CV cs.AI cs.LG cs.MM 版本更新

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

通过场景级一致性理解热视频中的身份连续性

Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang, Jenq-Neng Hwang

发表机构 * Department of Electrical and Computer Engineering, Information Processing Lab, University of Washington, USA（电气与计算机工程系，信息处理实验室，华盛顿大学，美国）

AI总结针对热行人多目标跟踪中身份碎片化问题，提出轻量级后处理方法，通过在线短间隙重映射和离线轨迹重链接恢复身份连续性，在PBVS热行人MOT基准上提升IDF1。

Comments Accepted to CVPR 2026 Workshop on SVC. Published in CVPR Workshops proceedings

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 1411-1419

AI中文摘要

热行人多目标跟踪仍然具有挑战性，因为弱外观线索和频繁的检测中断导致严重的轨迹碎片化。我们研究轻量级后处理是否可以在不依赖重型重识别模型或复杂在线关联的情况下恢复身份连续性。从YOLOv8和SORT基线开始，我们添加了一个模块化的身份修复后端，包括基于时间、空间、运动和边界线索的在线短间隙重映射和离线轨迹重链接。在固定验证集上的受控消融实验和在官方PBVS热行人MOT基准上的评估表明，主要身份增益来自保守的重链接，将IDF1从82.25提升到84.93，同时保持MOTA，而许多启发式阈值在广泛的操作范围内保持稳定。这些结果表明，在低信息热图像中，通过高精度轨迹重链接比增加跟踪器复杂性更能有效地实现鲁棒的身份恢复。这些结果提供了对热视频中身份恢复的受控分析，表明与局部帧到帧关联相比，场景级时空一致性在身份连续性中起主导作用。

英文摘要

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.

URL PDF HTML ☆

赞 0 踩 0

2606.01689 2026-06-02 cs.CV cs.AI 版本更新

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection

RPCASSM: 基于鲁棒主成分分析的状态空间模型用于红外小目标检测

Pingping Liu, Aohua Li, Yubing Lu, Jin Kuang, Tongshun Zhang, Qiuzhan Zhou

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University（教育部符号计算与知识工程重点实验室）； College of Software, Jilin University（吉林大学软件学院）； School of Geosciences, Yangtze University（长江大学地球科学学院）； College of Communication Engineering, Jilin University（吉林大学通信工程学院）

AI总结针对红外小目标检测中主流状态空间模型难以准确建模目标边缘的问题，提出基于鲁棒主成分分析（RPCA）的RPCASSM网络，通过设计背景状态空间模块（BSSM）和目标状态空间模块（TSSM）分别利用空间异质信号显著性和目标稀疏局部高亮特性进行状态空间建模，有效解决了边缘建模难题。

Comments 12 pages, 8 figures, under review

详情

AI中文摘要

红外小目标的检测与分割在监控安防、海上救援等领域具有重要的应用意义。由于这些目标在远距离成像中占据像素少，主流的视觉状态空间模型效率低下且难以准确建模目标边缘。现有的红外状态空间模型并未从红外小目标的结构特性出发偏离主流视觉状态空间结构框架。为了解决这一问题，本文基于鲁棒主成分分析（RPCA）的模型范式提出了RPCASSM网络，旨在通过红外小目标在空间域的性质设计背景状态空间模块（BSSM）和目标状态空间模块（TSSM）。BSSM旨在利用空间异质信号的显著性设计空间探测扫描机制（SPCM）来建模背景信息。TSSM利用目标的稀疏性和局部高亮特性设计可变形提示扫描机制（DPCM），聚焦于目标的可变形空间进行状态空间建模。通过上述设计，我们有效解决了现有主流视觉状态空间模型难以准确建模红外小目标边缘结构的问题。在现有基准数据集上的实验结果证明了RPCASSM设计的有效性。我们的代码将在\href{https://github.com/PepperCS/RPCASSM}{RPCASSM}公开。

英文摘要

The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security, maritime rescue and so on. Due to the low occupancy of these targets in long-distance imaging, the mainstream visual state space model is inefficient and difficult to accurately model the target edge. The existing infrared state space models do not deviate from the mainstream visual state space structure framework from the structural properties of infrared small targets. In order to solve this problem, this paper proposes the RPCASSM network based on the model paradigm of robust principal component analysis(RPCA), which aims to design the background state space module(BSSM) and the target state space module(TSSM) by the nature of the infrared small target in the spatial domain. The BSSM aims to use the saliency of spatial heterogeneous signals to design a spatial probe scanning mechanism(SPCM) to model background information. The TSSM designs a deformable prompt scanning mechanism(DPCM) by using the sparsity and local highlight of the target to focus on the deformable space of the target for state space modeling. According to the above design, we effectively solve the problem that the existing mainstream vision state space model is difficult to accurately model the edge structure of infrared small target. Experimental results on the existing benchmark data sets prove the effectiveness of the RPCASSM design. Our code will be made public at \href{https://github.com/PepperCS/RPCASSM}{RPCASSM}.

URL PDF HTML ☆

赞 0 踩 0

2606.01652 2026-06-02 eess.SP cs.CV 版本更新

Physics-Aware Linearized ADMM and Its Unrolling

物理感知线性化ADMM及其展开

Satoshi Takabe, Shunta Arai, Tadashi Wadayama

发表机构 * Japan Society for the Promotion of Science (JST), CRONOS（日本学术振兴会（JST）、CRONOS）

AI总结针对基于PDE测量过程的逆问题，提出物理感知线性化ADMM算法，通过子问题线性化实现高效更新，并利用深度展开训练内部参数，在光纤通信压缩感知和噪声各向异性扩散图像恢复中验证有效性。

Comments 5 pages, 3 figures

2606.01651 2026-06-02 cs.CV 版本更新

Restoring Initial Noise Sensitivity in Text-to-Image Distillation via Geometric Alignment

通过几何对齐恢复文本到图像蒸馏中的初始噪声敏感性

Huayang Huang, Ruoyu Wang, Jinhui Zhao, Wei Deng, Daiguo Zhou, Jian Luan, Yu Wu, Ye Zhu

发表机构 * Huazhong University of Science and Technology（华中科技大学）

AI总结提出几何感知蒸馏（GAD）框架，通过匹配雅可比-向量积来对齐教师和学生模型的局部功能行为，从而恢复文本到图像蒸馏中丢失的初始噪声敏感性，提升下游噪声驱动控制任务的性能。

Comments ICML 2026

详情

AI中文摘要

生成式蒸馏通过将多步轨迹压缩为少步学生模型，在保持感知质量的同时显著加速文本到图像（T2I）生成。然而，现有方法主要优化效率和输出保真度，往往忽略了原始轨迹的关键属性。在这项工作中，我们识别出一个缺失的关键属性：对初始噪声的敏感性，其退化会损害依赖噪声优化和操作的下游控制方法。我们将此问题追溯到标准的蒸馏目标，这些目标强制逐点输出对齐，无意中压平了输入-输出景观并抑制了教师的局部几何结构。为了解决这个问题，我们提出了几何感知蒸馏（GAD），一种保持敏感性的框架，用于对齐教师和学生模型的局部功能行为。具体而言，GAD匹配关于输入噪声的雅可比-向量积，使学生能够再现教师对扰动的微分响应。在多个T2I范式和噪声驱动控制任务上的大量实验表明，GAD显著恢复了敏感性并提高了多样性，同时保持了高视觉保真度。代码可在 https://github.com/Hannah1102/GAD 获取。

英文摘要

Generative distillation significantly accelerates text-to-image (T2I) generation by compressing multi-step trajectories into few-step student models while preserving perceptual quality. However, existing methods primarily optimize efficiency and output fidelity, often neglecting critical properties of the original trajectory. In this work, we identify a key missing property: sensitivity to initial noise, whose degradation impairs downstream control methods relying on noise-based optimization and manipulation. We trace this issue to standard distillation objectives that enforce pointwise output alignment, inadvertently flattening the input-output landscape and suppressing the teacher's local geometric structure. To address this, we propose Geometry-Aware Distillation (GAD), a sensitivity-preserving framework that aligns the local functional behavior of teacher and student models. Specifically, GAD matches Jacobian-vector products with respect to input noise, enabling the student to reproduce the teacher's differential response to perturbations. Extensive experiments across multiple T2I paradigms and noise-driven control tasks demonstrate that GAD significantly restores sensitivity and improves diversity while maintaining high visual fidelity. Code is available at https://github.com/Hannah1102/GAD.

URL PDF HTML ☆

赞 0 踩 0

2606.01643 2026-06-02 cs.CV 版本更新

Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument

手语生成中的条件坍塌：诊断与缩放论证

Rui Hong, Jana Košecká

发表机构 * George Mason University（乔治·马歇尔大学）

AI总结本文通过提出三个独立评估层级（初始姿态条件、输出多样性、目标忠实度）并利用冻结运动自编码器的潜在表示计算成对距离比，诊断手语生成模型中的条件坍塌问题，并论证句子级配对数据集规模是瓶颈。

详情

AI中文摘要

手语生成（SLP）是从自然语言文本生成虚拟人物手语动作的任务。生成动作的质量通常通过运动空间弗雷歇距离（FID）和反向翻译（BT）BLEU分数在How2Sign等基准上进行评估。这两个指标可能大幅提升，而底层生成器未能忠实表示手语手势。在这项工作中，我们提出在三个独立层级上评估生成的动作：（τ1）初始姿态条件，（τ2）输出多样性，以及（τ3）目标忠实度。我们使用冻结运动自编码器（MoAE）的潜在表示计算这些成对距离比。我们在How2Sign数据集上评估了14个SLP模型检查点，包括重新实现的Neural Sign Actors（NSA），并表明τ3忠实度从未达到，而FID变化近两个数量级且与忠实度不相关。我们表明，在孤立词汇数据集ASL3DWord上可以达到有利的τ3，因此将句子级配对数据集的大小确定为瓶颈。

英文摘要

Sign Language Production (SLP) is the task of generating avatar sign language motion from natural language text. The quality of the generated motion is typically evaluated by a motion-space Fréchet distance (FID) and back-translation (BT) BLEU score on benchmarks such as How2Sign. Both metrics can improve substantially while the underlying generator fails to faithfully represent the sign language gestures. In this work we propose to evaluate the generated motion at three independent levels: ($\tau1$) initial-pose conditioning, ($\tau2$) output diversity, and ($\tau3$) target faithfulness. We compute these as pairwise-distance ratios using latent representations of a frozen motion autoencoder (MoAE). We evaluate 14 SLP model checkpoints on the How2Sign dataset, including a re-implemented Neural Sign Actors (NSA), and show that $\tau3$ faithfulness is never attained, while FID varies by nearly two orders of magnitude and is uncorrelated with faithfulness. We show that on the isolated gloss dataset ASL3DWord favorable $\tau3$ can be attained, hence isolating the size of the sentence-level paired-dataset as the bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2606.01641 2026-06-02 cs.CV 版本更新

Edge-directed geometric partitioning for versatile video coding

面向多功能视频编码的边缘导向几何划分

Xuewei Meng, Xinfeng Zhang, Chuanmin Jia, Xia Li, Shanshe Wang, Siwei Ma

AI总结针对VVC标准，提出基于时空边缘信息构建最可能模式列表的几何划分模式预测策略，以降低索引开销并提升编码效率，平均BD-rate增益0.58%-1.00%。

Comments This paper has been published in IEEE ICME

详情

DOI: 10.1109/ICME46284.2020.9102781
Journal ref: IEEE International Conference on Multimedia and Expo (ICME), 2020, pp. 1-6

AI中文摘要

为了提升编码性能，针对即将到来的VVC标准提出了几何划分（GEO）。GEO提供140个划分候选。最优GEO模式的索引需要显式地信令。考虑到不同CU的结构特性以及空间相邻块与时序同位块之间的相关性，我们提出了一种GEO模式预测策略，通过构建最可能模式（MPM）列表来减少GEO索引的开销并提高编码效率。基于划分模式与物体边界高度相关的观察，提出了一种边缘导向的几何划分方案，根据时空边缘信息构建MPM列表。与VTM-6.0相比，所提方法在RA和LDB配置下平均提供了0.58%和1.00%的客观BD-rate增益。此外，它还提升了物体边界的视觉质量。

英文摘要

To improve the coding performance, geometric partition (GEO) was proposed for the upcoming VVC standard. GEO provides 140 partition candidates. The index of optimal GEO mode needs to be signaled explicitly. Considering different structural characteristics of different CUs and the correlation between spatial adjacent blocks and temporal collocated blocks, we propose a GEO mode prediction strategy by constructing a Most Probable Mode (MPM) list to reduce the overhead of GEO index and improve coding efficiency. Based on the observation of the high correlation between the partition mode and object boundaries, an edge-directed geometric partition scheme is proposed to construct the MPM list according to spatio-temporal edge information. The proposed method provides an objective BD-rate gain of 0.58% and 1.00% on average for RA and LDB configurations compared to VTM-6.0. Besides, it also promotes the visual quality of object boundaries.

URL PDF HTML ☆

赞 0 踩 0

2606.01638 2026-06-02 cs.CV 版本更新

CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation

CanonCGT：基于参考的颜色分级通过规范枢轴表示

Jinwon Ko, Keunsoo Ko, Chang-Su Kim

发表机构 * Korea University（韩国大学）； The Catholic University of Korea（韩国天主教大学）

AI总结提出一种基于规范枢轴的两阶段框架CanonCGT，通过去除内在色调偏差并匹配参考风格，实现稳定、真实的颜色分级。

Comments CVPR 2026 accepted

详情

AI中文摘要

基于参考的颜色分级旨在再现参考图像的色调和光照，同时保持色彩和谐与场景结构。现有的逼真和基于滤镜的方法通常产生不稳定的色调映射——过度偏移或不一致地保留颜色——导致不自然的结果。我们提出CanonCGT，一个基于规范枢轴的两阶段框架——一种风格中立的中间表示，用于稳定的颜色映射。第一阶段通过去除内在色调偏差来规范化输入，第二阶段对其进行颜色分级以匹配参考风格。一种双阶段训练方案DP-CGT结合了监督预设学习和非配对照片上的自监督细化。CanonCGT在多种数据集上产生逼真且色调一致的结果，在稳定性和视觉保真度上超越了最先进的方法。我们的代码可在\href{https://github.com/Jinwon-Ko/CanonCGT}{https://github.com/Jinwon-Ko/CanonCGT}获取。

英文摘要

Reference-based color grading aims to reproduce the tonal mood and lighting of a reference while preserving color harmony and scene structure. Existing photorealistic and filter-based methods often produce unstable tone mappings -- over-shifting or inconsistently retaining colors -- leading to unnatural results. We propose CanonCGT, a two-stage framework built on a canonical pivot -- a style-neutral intermediate representation for stable color mapping. The first stage canonicalizes the input by removing intrinsic tonal bias, and the second color-grades it to match the reference style. A dual-phase training scheme, DP-CGT, combines supervised preset learning with self-supervised refinement on unpaired photographs. CanonCGT delivers photorealistic and tonally consistent results across diverse datasets, surpassing state-of-the-art methods in stability and visual fidelity. Our codes are available at \href{https://github.com/Jinwon-Ko/CanonCGT}{https://github.com/Jinwon-Ko/CanonCGT}

URL PDF HTML ☆

赞 0 踩 0

2606.01636 2026-06-02 cs.CV 版本更新

Pave-GRPO: Beyond Instantaneous Guidance through Principled Average Velocity Decomposition

Pave-GRPO：通过原则性平均速度分解超越瞬时引导

Pengyang Ling, Jiazi Bu, Yujie Zhou, Yibin Wang, Zhenyu Hu, Zihan Zhang, Yi Jin, Huaian Chen, Yuhang Zang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai Jiao Tong University（上海交通大学）； Fudan University（复旦大学）； Harbin Institute of Technology（哈尔滨工业大学）； Beihang University（北京航空航天大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出Pave-GRPO方法，通过原则性平均速度分解将粗粒度过渡分解为细粒度子轨迹，在不增加生成成本的情况下将奖励反馈传播到更多中间步骤，实现更全面的偏好对齐。

Comments 8 pages,5 figures

详情

AI中文摘要

通过群体相对策略优化（GRPO）的后训练已成为将基于流的生成模型与人类偏好对齐的强大范式。然而，流模型的迭代去噪性质在生成用于策略梯度更新的群体展开时会产生巨大成本，迫使现有方法使用极少的去噪步骤进行训练。这种时间稀疏性严重限制了偏好优化：奖励反馈只能到达每个轨迹的少数阶段，使得绝大多数中间去噪步骤缺乏直接监督，从而损害了对齐的粒度。为了解决这个问题，我们提出了Pave-GRPO，它通过原则性平均速度分解重新表述了GRPO目标。我们不生成昂贵的高步数展开，而是保持高效的少步数群体采样，但将每个粗粒度转换分解为跨越多个中间时间步的等效细粒度子轨迹集合。这将奖励反馈传播到更密集的时间阶段集，从而实现更全面的偏好对齐，而无需额外的生成成本。这种设计有两个好处：（i）零成本视野扩展：通过直接重用分段群体样本及其相关奖励，Pave-GRPO在固定采样预算下显著拓宽了有效优化范围；（ii）全面的时间监督：通过将瞬时速度目标等效分解为多时间步集合，它将奖励信号分布到去噪过程的更多中间阶段，从而实现更细粒度、更彻底的偏好优化。大量实验验证了Pave-GRPO在不同奖励设置下有效推进了偏好对齐，提供了全面的性能提升。

英文摘要

Post-training via Group Relative Policy Optimization (GRPO) has emerged as a powerful paradigm for aligning flow-based generative models with human preferences. However, the iterative denoising nature of flow models incurs substantial costs when generating group rollouts for policy-gradient updates, compelling existing methods to train with extremely few denoising steps. This temporal sparsity severely restricts preference optimization: reward feedback can only reach a handful of stages per trajectory, leaving the vast majority of intermediate denoising steps without direct supervision and thus compromising alignment granularity. To address this, we propose Pave-GRPO, which reformulates the GRPO objective through Principled average velocity decomposition. Rather than generating expensive high-step rollouts, we maintain efficient few-step group sampling but decompose each coarse transition into an equivalent ensemble of finer sub-trajectories spanning multiple intermediate timesteps. This propagates reward feedback to a denser set of temporal stages for more comprehensive preference alignment without additional generation cost. This design offers two benefits: (i) zero-cost horizon expansion: through the direct reuse of piece-wise group samples and their associated rewards, Pave-GRPO significantly broadens the effective optimization scope under fixed sampling budgets; and (ii) comprehensive temporal supervision: by equivalently decomposing an instantaneous velocity target into a multi-timestep ensemble, it distributes reward signals across more intermediate stages of the denoising process, enabling finer-grained and more thorough preference optimization. Extensive experiments validate that Pave-GRPO effectively advances preference alignment across different reward settings, offering comprehensive performance enhancement.

URL PDF HTML ☆

赞 0 踩 0

2606.01620 2026-06-02 cs.CV 版本更新

Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs

基于参考引导深度压缩VAE的流式说话人肖像视频实时生成

Sicheng Xu, Yu Deng, Shoukang Hu, Yichuan Wang, Yizhong Zhang, Zhan Chen, Jiaolong Yang, Baining Guo

发表机构 * Microsoft Research（微软研究院）； Microsoft AI（微软人工智能）

AI总结提出一种结合因果视频VAE和自回归潜在去噪模型的流式说话人肖像视频生成框架，通过参考图像引导实现实时高质量生成。

Comments CVPR 2026 (Highlight) Camera ready

详情

AI中文摘要

视频扩散模型显著推动了肖像视频生成的发展，但其高计算需求限制了在交互式应用中的使用。本文提出一个框架，用于生成以语音音频和参考图像为条件的可流式说话人肖像视频。该框架专为流式场景精心设计，包含一个用于深度潜在压缩的因果视频VAE和一个自回归潜在去噪模型。我们的因果VAE集成了可变数量的参考图像作为引导，使网络能够专注于动态信息而非静态外观，从而提升压缩效率和重建质量。此外，我们扩展了残差自编码范式，以改善VAE中的时空因果处理。生成器基于Rectified Flow Transformer架构，并以块状自回归方式生成视频潜在表示。我们的方法能够实时生成高质量的说话人肖像视频，速度显著快于基线模型。此外，综合实验表明，在逼真度、生动性和视频质量方面，该方法与这些大型模型相当甚至更优。

英文摘要

Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.

URL PDF HTML ☆

赞 0 踩 0

2606.01615 2026-06-02 cs.CV cs.MM 版本更新

Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval

图灵模式用于多媒体：反应-扩散多模态融合用于语言引导的视频时刻检索

Xiang Fang, Wanlong Fang, Wei Ji, Tat-Seng Chua

发表机构 * Nanyang Technological University（南洋理工大学）； National University of Singapore（新加坡国立大学）

AI总结提出基于反应-扩散过程的多模态融合框架RDMF，通过模拟生物模式形成机制实现视频与文本的动态对齐，用于视频时刻检索与高亮检测。

Comments Published in ACM MM 2025. Address some typos

详情

AI中文摘要

视频-语言模型对于时刻检索和高亮检测等任务至关重要，但它们通常难以捕捉时间视频序列与文本语义之间的动态、非线性交互。现有方法依赖静态交叉注意力或提示调优机制，无法自适应地建模模态间的演化关系，导致对齐次优和泛化受限。受系统生物学启发，我们提出 extbf{反应-扩散多模态融合（RDMF）}，这是一个新颖的框架，将视频-语言对齐重新构想为反应-扩散（RD）过程，借鉴了Alan Turing引入的模式形成原理。在RDMF中，视频特征随时间扩散以捕捉时间上下文，而文本-视频交互被建模为非线性反应，放大相关特征并抑制噪声，形成类似于生物系统的涌现模式。利用Gray-Scott RD模型，我们设计了一个计算高效的融合模块，集成视频和文本表示，并通过图灵不稳定性准则对稳定性和收敛性进行严格的数学分析。我们的框架具有理论依据，采用先进的数学工具确保稳定的模式形成，并且实际可行，集成了标准组件如预训练编码器和DETR风格的头用于时刻检索和显著性预测。RDMF代表了一种开创性的跨学科方法，桥接了系统生物学和多媒体研究，以解决传统多模态融合的局限性。初步实验表明，它在识别显著视频时刻方面具有超越现有方法的潜力，为视频-语言任务提供了新的范式。

英文摘要

Video-language models are pivotal for tasks such as moment retrieval and highlight detection, yet they often struggle to capture the dynamic, non-linear interactions between temporal video sequences and textual semantics. Existing approaches, relying on static cross-attention or prompt-tuning mechanisms, fail to adaptively model the evolving relationships between modalities, leading to suboptimal alignment and limited generalization. Inspired by systems biology, we propose \textbf{Reaction-Diffusion Multimodal Fusion (RDMF)}, a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. In RDMF, video features diffuse across time to capture temporal context, while text-video interactions are modeled as non-linear reactions that amplify relevant features and suppress noise, forming emergent patterns akin to biological systems. Leveraging the Gray-Scott RD model, we design a computationally efficient fusion module that integrates video and text representations, supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria. Our framework is theoretically grounded, employing advanced mathematical tools to ensure stable pattern formation, and is practically viable, incorporating standard components like pretrained encoders and DETR-style heads for moment retrieval and saliency prediction. RDMF represents a pioneering interdisciplinary approach, bridging systems biology and multimedia research to address the limitations of conventional multimodal fusion. Preliminary experiments demonstrate its potential to outperform existing methods in identifying salient video moments, offering a new paradigm for video-language tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.01612 2026-06-02 cs.CV cs.LG 版本更新

Self-Improving Small Object Grounding in LVLMs

LVLMs中的自改进小目标定位

Tianze Yang, Yucheng Shi, Ruitong Sun, Ninghao Liu, Jin Sun

发表机构 * University of Georgia（佐治亚大学）

AI总结利用LVLMs内部注意力模式，通过轻量级IoU回归器或无需训练的注意力熵选择器，从多个候选框中选出最佳框，实现小目标定位的自改进。

Comments 29 Pages, 15 Figures

详情

AI中文摘要

大型视觉语言模型（LVLMs）中的内部注意力模式能否在无需微调的情况下识别可靠的小目标框？在这项工作中，我们给出了肯定的答案。LVLMs中的注意力结构编码了定位质量——一个仅基于注意力图训练的轻量级IoU回归器实现了强IoU预测（Pearson r > 0.67）。该回归器驱动了我们基于注意力的候选选择（ACS）框架的回归器变体，称为ACS-Learned，它从多个采样候选中选择最佳框以改进目标定位。通过分析回归器学习的内容，我们揭示了哪些Transformer层和头最为关键，并推导出ACS-Free：一个无需训练的选择器，它根据这些判别性头上的注意力熵对候选进行排序，推理时无需任何学习组件。在COCO和Objects365上的实验表明，小目标定位的自改进高达19%，其中ACS-Free在所有无需训练的方法中排名最佳，表明有用的注意力结构提高了LVLMs中定位的可靠性和可解释性。

英文摘要

Can internal attention patterns in Large Vision Language Models (LVLMs) identify reliable small-object boxes without fine-tuning? In this work, we provide an affirmative answer. Attention structure in LVLMs encodes grounding quality-a lightweight IoU regressor trained solely on attention maps achieves strong IoU prediction (Pearson r > 0.67). This regressor powers the regressor-based variant of our Attention-based Candidate Selection (ACS) framework, called ACS-Learned, which selects the best box from multiple sampled candidates to improve object grounding. By analyzing what the regressor learns, we reveal which transformer layers and heads are most critical and derive ACS-Free: a training-free selector that ranks candidates by attention entropy on these discriminative heads, with no learned component at inference. Experiments on COCO and Objects365 demonstrate up to 19% self-improvement on small object localization, with ACS-Free ranking best among all training-free methods, demonstrating that useful attention structure improves both localization reliability and interpretability in LVLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.01608 2026-06-02 cs.CV 版本更新

RoboTrustBench：机器人操作视频世界模型的可信度基准测试

Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen, Bin Zhu

发表机构 * Singapore Management University（新加坡国立管理学院）； Fudan University（复旦大学）； Princeton University（普林斯顿大学）

AI总结针对视频世界模型在机器人操作中的可信度问题，提出RoboTrustBench基准，包含正常、约束敏感、反事实和对抗四种场景，通过专家验证的指令-图像对和六维评估协议，发现当前模型在约束推理、反事实基础、物理交互和不安全指令抑制方面存在不足。

Comments Project: https://huiqiongli.github.io/RoboTrustBench/

2606.01591 2026-06-02 cs.CV cs.LG 版本更新

TLG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning

TLG: 通过源标注重建和类别目标推理实现视频问答的时间逻辑基础

Ali Alavi

发表机构 * The Ohio State University（俄亥俄州立大学）

AI总结提出TLG三阶段系统，通过重建动作时间线、解析问题为时间逻辑程序并确定性执行，结合强视觉语言模型和前沿推理模型，将视频问答准确率从46.9%提升至71.37%。

详情

AI中文摘要

TimeLogic挑战评估对视频的形式时间逻辑推理——包括16个算子（之前、之后、直到、自从、总是、共现、排序等），采用布尔和四选一形式。端到端视频语言模型在此任务上接近随机水平，因为它们将视频视为帧的集合，无法定位动作发生的时间。我们提出TLG（时间逻辑基础），一个三阶段系统：（i）从生成基准测试的公共源数据集标注中重建每个视频的动作时间线，将每个问题解析为时间逻辑程序，并确定性执行；（ii）在没有标注的情况下回退到强大的开放视觉语言模型；（iii）仅将视觉语言模型经验上最弱的问题类别路由到前沿推理模型。TLG将测试准确率从46.9%的视觉语言模型基线提升到71.37%，绝对增益+24.5，达到排行榜前三名3分以内。我们报告了广泛的消融实验，包括三种基于模型的时间线重建变体，它们都低于整体视觉语言模型，将时间基础隔离为不可约的瓶颈，并表明真正的标注——而非更大的模型——驱动准确率。

英文摘要

The TimeLogic Challenge evaluates formal temporal-logic reasoning over video - 16 operators (before, after, until, since, always, co-occur, ordering, ...) in boolean and 4-way multiple-choice form. End-to-end video-language models (VLMs) hover near chance on this task because they treat video as a bag of frames and cannot localize when actions occur. We present TLG (Temporal-Logic Grounding), a three-tier system that (i) reconstructs each video's action timeline from the public source-dataset annotations the benchmark was generated from, parses every question into a temporal-logic program, and executes it deterministically; (ii) falls back to a strong open VLM where no annotation exists; and (iii) routes only the question categories where the VLM is empirically weakest to a frontier reasoning model. TLG raises test accuracy from a 46.9% VLM baseline to 71.37%, a +24.5 absolute gain, reaching within 3 points of the leaderboard top. We report extensive ablations, including three model-based timeline-reconstruction variants that all underperform a holistic VLM, isolating temporal grounding as the irreducible bottleneck and showing that real annotations - not larger models - drive accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.01590 2026-06-02 cs.CV cs.GR 版本更新

Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis

面向街景新视角合成的有效多传感器条件控制

Zhengfei Kuang, Adam Sun, Liyuan Zhu, Tong Wu, Shengqu Cai, Jonathan Tremblay, Iro Armeni, Ehsan Adeli, Lior Yariv, Gordon Wetzstein

发表机构 * Stanford Univerity（斯坦福大学）； NVIDIA

AI总结提出StreetNVS视频扩散框架，通过参考增强相机注意力模块和相对射线级位置编码联合利用LiDAR、环视图像和相机位姿，实现稀疏LiDAR条件下的高质量街景新视角合成。

详情

AI中文摘要

现代车辆平台配备了丰富的传感器套件，包括LiDAR、标定多相机系统和精确的自车运动，这原则上为从新视角重新渲染驾驶场景提供了强信号。最近一系列工作利用视频扩散模型完成此任务，通过其生成先验从稀疏车辆观测中合成合理的新视角。然而在实践中，现有方法仅利用了该信号的一部分，且其质量往往随着目标轨迹偏离记录驾驶路径而下降。我们认为这本质上是一个多传感器融合问题：稀疏LiDAR重投影提供准确但不完整的度量几何，环视参考图像提供密集外观但不提供度量深度，而相机位姿将两者跨视图连接起来。我们引入StreetNVS，一种视频扩散框架，通过基于相对射线级位置编码的参考增强相机注意力模块，联合对所有三种信号进行条件控制。我们开发了一种两阶段课程训练策略，逐步使模型适应越来越稀疏的LiDAR。在Waymo Open数据集上，StreetNVS在稀疏LiDAR条件下显著优于最先进的基线，与依赖密集10-100倍点云的方法性能相当。我们进一步展示了沿极端轨迹外路径（如高程、车道偏移、拉回和旋转）合成连贯视频的能力。我们的网站：https://streetnvs.github.io

英文摘要

Modern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10-100 times denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation. Our website: https://streetnvs.github.io

URL PDF HTML ☆

赞 0 踩 0

2606.01577 2026-06-02 cs.CV 版本更新

FLAME: Physics-Guided Neural Operators for Onboard Satellite Methane Detection in Hyperspectral Imagery

FLAME：物理引导的神经算子用于高光谱图像中星载甲烷检测

Junhyuk Heo, Junhwan Park, Sancheol Sim, Beomkyu Choi, Woojin Cho

发表机构 * KAIST（韩国科学技术院）

AI总结提出FLAME，一种将甲烷吸收物理直接嵌入架构的物理引导神经算子，在星载甲烷检测中实现最高精度，像素级假阳性率降低近3倍，参数最少且满足星载硬件延迟预算。

2606.01576 2026-06-02 cs.CV 版本更新

Deformable Wiener Filter for Future Video Coding

可变形维纳滤波器用于未来视频编码

Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma

发表机构 * National Engineering Research Center of Visual Technology, School of Computer Science, Peking University（视觉技术国家工程研究中心，北京大学计算机科学学院）； Core Media Technology, Disney Streaming（核心媒体技术，迪士尼流媒体）； Wangxuan Institute of Computer Technology, Peking University（王萱计算机技术研究所，北京大学）； Information Technology R&D Innovation Center of Peking University（北京大学信息技术研发创新中心）； Peng Cheng Laboratory, Shenzhen（鹏城实验室，深圳）

AI总结提出一种结合局部与非局部特征的可变形维纳滤波器（DWF），通过监督训练和自适应融合实现高效环路滤波，在VVC标准上平均节省1.16%~2.67%的码率。

Comments This paper has been published in IEEE Transactions on Image Processing

详情

DOI: 10.1109/TIP.2022.3221278
Journal ref: IEEE Transactions on Image Processing, vol. 31, pp. 7222-7236, 2022

AI中文摘要

环路滤波器由于在混合视频编码框架中显著的降噪能力而受到越来越多的关注。然而，现有通用视频编码（VVC）中的环路滤波器主要利用图像局部相似性。尽管一些基于非局部的环路滤波器可以弥补这一不足，但非局部滤波器广泛使用的无监督参数估计方法限制了性能。鉴于此，我们提出了一种可变形维纳滤波器（DWF）。它结合了局部和非局部特性，并基于维纳滤波器理论监督地训练滤波器系数。在滤波过程中，首先为每个感兴趣样本导出局部相邻样本和非局部相似样本。然后，基于块级噪声和样本级特征将待滤波样本分类到特定组中。每组样本共享相同的滤波器系数。之后，根据分类结果自适应融合局部和非局部参考样本。最后，对每个待滤波样本进行带有异常值数据约束的滤波操作。此外，详细分析了所提出的DWF在不同参考样本导出方案下的性能。仿真结果表明，与VTM-11.0相比，所提方法在全内、随机访问和低延迟B配置下平均分别节省1.16%、1.92%和2.67%的码率。

英文摘要

In-loop filters have attracted increasing attention due to the remarkable noise-reduction capability in the hybrid video coding framework. However, the existing in-loop filters in Versatile Video Coding (VVC) mainly take advantage of the image local similarity. Although some non-local based in-loop filters can make up for this shortcoming, the widely-used unsupervised parameter estimation method by non-local filters limits the performance. In view of this, we propose a deformable Wiener Filter (DWF). It combines the local and non-local characteristics and supervisedly trains the filter coefficients based on the Wiener Filter theory. In the filtering process, local adjacent samples and non-local similar samples are first derived for each sample of interest. Then the to-be-filtered samples are classified into specific groups based on the patch level noise and sample-level characteristics. Samples in each group share the same filter coefficients. After that, the local and non-local reference samples are adaptively fused based on the classification results. Finally, the filtering operation with outlier data constraints is conducted for each to-be-filtered sample. Moreover, the performance of the proposed DWF is analyzed with different reference sample derivation schemes in detail. Simulation results show that the proposed approach achieves 1.16%, 1.92%, and 2.67% bit-rate savings on average compared to the VTM-11.0 for All Intra, Random Access, and Low-Delay B configurations, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.01572 2026-06-02 eess.IV cs.CV 版本更新

PINNOCHIO: Physics-Informed Neural Network for Coupled Hyperelastic Interface-Volume Simulation in Orthognathic Surgery

PINNOCHIO: 用于正颌手术中耦合超弹性界面-体积模拟的物理信息神经网络

Jungwook Lee, Daeseung Kim, Kevin Gu, Zhangfeng Hu, Tianshu Kuang, Finn Hopeman, Michael A. K. Liebschner, Jaime Gateno, Pingkun Yan

发表机构 * Department of Biomedical Engineering and Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute（生物医学工程系和生物技术与跨学科研究中心，伦塞拉尔理工学院）； Department of Oral and Maxillofacial Surgery, Houston Methodist Research Institute（口腔颌面外科系，休斯顿方法主义研究学院）； Department of Neurosurgery, Baylor College of Medicine（神经外科系，贝勒医学院）

AI总结提出PINNOCHIO框架，通过混合顺序分解解耦不连续骨-软组织界面运动与连续体积超弹性变形，实现稳定训练和物理启发的模拟到真实适应策略，在40名患者队列中优于现有基线，解决了精度-效率权衡问题。

Comments This work has been submitted to MICCAI 2026

详情

AI中文摘要

预测患者特定面部软组织变形对于迭代正颌手术规划至关重要。然而，当前计算方法面临严格的精度-效率权衡：高保真有限元方法计算成本过高，而纯深度学习模型往往产生生物力学不一致的结果。尽管物理信息神经网络提供了一条有前景的途径，但在仅有部分临床监督（即外表面）下学习骨-软组织相互作用的复杂异质力学仍然高度不稳定。为克服这些挑战，我们提出了PINNOCHIO，一种用于面部软组织模拟的新型物理信息框架。PINNOCHIO引入了一种混合顺序分解，明确地将不连续的骨-软组织界面运动与连续的体积超弹性变形解耦。这种结构分离实现了稳定训练，并促进了物理启发的模拟到真实适应策略，确保内部生物力学一致性而无需体积真实数据。在40名患者临床队列上的评估表明，PINNOCHIO在表面精度和物理有效性方面均优于现有基线。此外，它实现了比有限元方法显著的加速，成功解决了精度-效率权衡，为交互式手术规划提供了高度可靠和实用的工具。

英文摘要

Predicting patient-specific facial soft-tissue deformation is critical for iterative orthognathic surgery planning. However, current computational methods face a strict accuracy-efficiency trade-off: high-fidelity Finite Element Methods (FEM) are computationally prohibitive, whereas pure deep learning models often produce biomechanically inconsistent results. While Physics-Informed Neural Networks (PINNs) offer a promising avenue, learning the complex heterogeneous mechanics of bone--soft-tissue interactions with only partial clinical supervision (i.e., outer facial surfaces) remains highly unstable. To overcome these challenges, we present PINNOCHIO, a novel physics-informed framework for facial soft-tissue simulation. PINNOCHIO introduces a hybrid sequential decomposition that explicitly decouples discontinuous bone--soft-tissue interface movements from continuous volumetric hyperelastic deformation. This structural separation enables stable training and facilitates a physics-enabled sim-to-real adaptation strategy, ensuring internal biomechanical consistency without requiring volumetric ground truth. Evaluated on a 40-patient clinical cohort, PINNOCHIO outperforms existing baselines in both surface accuracy and physical validity. Furthermore, it achieves a substantial speedup over FEM, successfully resolving the accuracy-efficiency trade-off to provide a highly reliable and practical tool for interactive surgical planning.

URL PDF HTML ☆

赞 0 踩 0

2606.01565 2026-06-02 cs.RO cs.CV 版本更新

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

层级语义增强导航：面向视觉语言导航的最优传输与图驱动推理

Xiang Fang, Wanlong Fang, Changshuo Wang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology（华中科技大学软件学院）； Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore（新加坡南洋理工大学交叉学科研究生项目）； University College London（伦敦大学学院）

AI总结提出层级语义增强导航框架，通过动态层级语义场景图、基于最优传输的拓扑规划器与图感知强化学习策略，解决连续环境中的视觉语言导航难题，实现最优性能。

Comments Published in NeurIPS 2025, address some typos

详情

AI中文摘要

连续环境中的视觉语言导航（VLN-CE）对自主智能体构成严峻挑战，要求无缝整合自然语言指令与视觉观察以在复杂3D室内空间导航。现有方法在长程任务中常因场景理解有限、规划效率低下及缺乏稳健决策框架而表现不佳。我们引入层级语义增强导航（HSAN）框架，这是一种开创性方法，通过三项协同创新重新定义VLN-CE。首先，HSAN构建动态层级语义场景图，利用视觉语言模型捕捉从物体到区域到区域的多级环境表示，实现细粒度空间推理。其次，它采用基于最优传输的拓扑规划器，以Kantorovich对偶为基础，通过平衡语义相关性与空间可达性来选择长期目标，并具有理论最优性保证。第三，图感知强化学习策略确保精确的低层控制，在稳健避障的同时导航子目标。通过整合谱图理论、最优传输和先进的多模态学习，HSAN解决了先前工作中静态地图和启发式规划器的缺陷。在多个具有挑战性的VLN-CE数据集上的大量实验表明，HSAN实现了最先进的性能，在导航成功率和泛化到未见环境方面均有显著提升。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbf{Hierarchical Semantic-Augmented Navigation (HSAN)} framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich's duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.

URL PDF HTML ☆

赞 0 踩 0

2606.01558 2026-06-02 cs.CV 版本更新

Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning

注意力引导的多模态大语言模型微调提升思维链推理能力

Sanchit Sinha, Guangzhi Xiong, Bohan Liu, Zhenghao He, Aidong Zhang

发表机构 * University of Virginia（弗吉尼亚大学）

AI总结针对多模态大语言模型中思维链推理效果不佳的问题，提出注意力引导的微调目标Attentive-CoT，通过延迟答案承诺和维持视觉令牌访问来提升推理性能。

详情

AI中文摘要

思维链提示在多模态大语言模型中的有效性仍不确定：在多个视觉推理基准上，与直接提示相比，思维链提示常常降低性能。在本文中，我们对三个现代多模态大语言模型系列在不同模型规模下，针对需要逐步视觉证据的数据集进行了思维链行为的系统分析。我们的分析识别出两种反复出现的失败模式：过早的答案承诺和推理生成过程中有限的直接视觉令牌访问。我们进一步发现，标准的思维链式监督微调只能部分缓解这些问题，同时往往增加对文本先验的依赖并减少反事实视觉依赖。受这些发现的启发，我们提出了Attentive-CoT，一种注意力引导的微调目标，它鼓励思维链轨迹延迟答案承诺，同时维持持续的视觉令牌访问。Attentive-CoT可以插入任何思维链式监督微调训练中，无需架构更改。在六个多模态大语言模型上的三个视觉推理基准实验表明，Attentive-CoT相比标准微调提升了思维链性能。

英文摘要

The effectiveness of Chain-of-Thought (CoT) prompting in Multimodal Large Language Models (MLLMs) remains uncertain: across several visual reasoning benchmarks, CoT prompting often degrades performance compared to direct prompting. In this paper, we provide a systematic analysis of CoT behavior in three modern MLLM families across model scales on datasets requiring step-wise visual evidence. Our analysis identifies two recurring failure modes: premature answer commitment and limited direct visual-token access during rationale generation. We further find that standard CoT-style Supervised Fine-Tuning (CoT-SFT) can mitigate these issues only partially, while often increasing reliance on textual priors and reducing counterfactual visual dependence. Motivated by these findings, we propose Attentive-CoT (Att-CoT), an attention-guided fine-tuning objective that encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access. Att-CoT can be plugged into any CoT-SFT training run without architectural changes. Experiments on three visual reasoning benchmarks across six MLLMs show that Att-CoT enhances CoT performance over standard fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.01549 2026-06-02 cs.CV 版本更新

论高效统一视觉语言训练中令牌缩减的极限

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

发表机构 * University of Michigan（密歇根大学）； Sony AI（索尼人工智能）

AI总结本文通过分析层注意力分配，发现视觉理解与视觉生成在令牌冗余上存在不对称性，设计任务特定加速器，但统一训练中任务特定令牌丢弃导致协同损失，表明高效统一建模需保留共享跨任务结构。

详情

AI中文摘要

统一视觉语言模型（VLM）在单个自回归骨干中集成了视觉理解和视觉生成，但其联合训练计算成本高昂且从效率角度常被忽视。在这项工作中，我们研究了基于令牌缩减的加速在统一VLM训练中的可行性和极限。通过对逐层注意力分配的系统分析，我们揭示了一个基本的不对称性：视觉理解在后期层表现出显著的视觉冗余，而视觉生成在深度上对图像令牌保持持续依赖。受此观察启发，我们设计了任务特定的加速器，针对每个目标选择性地减少图像令牌计算。虽然这些方法在孤立设置中实现了显著的效率提升，但我们在统一训练下观察到一致的协同损失——任务特定的令牌丢弃需要不同的参数路径，并消除了联合优化中通常观察到的相互性能增益。我们的发现表明，高效统一建模需要保留共享的跨任务结构，强调了需要协同感知的加速策略。项目页面：https://chicychen.github.io/TokenReductionUnifiedVLM/。

英文摘要

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training -- task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.

URL PDF HTML ☆

赞 0 踩 0

2606.01493 2026-06-02 cs.CV 版本更新

Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo

Splatshot: 从单张非约束照片生成3D人脸头像

Hao Liang, Zhixuan Ge, Soumendu Majee, Joanna Li, Ashok Veeraraghavan, Guha Balakrishnan

发表机构 * Rice University（里士大学）； Samsung Research America（三星美国研究院）

AI总结提出SplatShot，一种无需训练的方法，通过将3D高斯泼溅与扩散模型去噪过程耦合，从单张照片生成多视图一致的逼真3D人脸头像。

Comments 28 pages, 15 figures

详情

AI中文摘要

从单张非约束照片重建逼真的3D人脸头像具有挑战性：前馈3D高斯泼溅（3DGS）模型在分布外输入上性能下降，而预训练扩散模型生成高保真图像但缺乏多视图一致性。我们观察到这些范式本质上是互补的：显式3D表示保证几何一致性，而2D扩散先验确保逼真度。基于此，我们提出SplatShot，一种无需训练的框架，直接在去噪过程中耦合这些表示。给定一个基础3DGS人脸模型和一张参考图像，我们使用每步3D反馈循环联合去噪所有目标视图。在每个时间步，我们从噪声潜变量预测干净图像，将3DGS重新拟合到这些多视图预测，并将3DGS重新渲染与2D预测之间的光度差异反向传播到噪声估计中。这将采样轨迹引导向严格3D一致、身份保真的输出。在各种野外图像上的实验表明，SplatShot生成的3D头像具有优越的身份保持、逼真度和多视图一致性。

英文摘要

Reconstructing a photorealistic 3D face avatar from a single unconstrained photograph is challenging: feed-forward 3D Gaussian Splatting (3DGS) models degrade on out-of-distribution inputs, while pretrained diffusion models produce high-fidelity images but lack multi-view consistency. We observe that these paradigms are fundamentally complementary: explicit 3D representations guarantee geometric consistency, whereas 2D diffusion priors ensure photorealism. Building on this, we propose SplatShot, a training-free framework that couples these representations directly within the denoising process. Given a base 3DGS face model and a single reference image, we jointly denoise all target views using a per-step 3D feedback loop. At each timestep, we predict clean images from the noisy latents, refit the 3DGS to these multi-view predictions, and back-propagate the photometric discrepancy between the 3DGS re-renderings and 2D predictions into the noise estimate. This steers the sampling trajectory toward strictly 3D-coherent, identity-faithful outputs. Experiments on diverse in-the-wild images demonstrate that SplatShot produces 3D avatars with superior identity preservation, photorealism, and multi-view consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.01485 2026-06-02 cs.CV cs.LG 版本更新

Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering

感知优先：具有自一致性的前沿原生视频模型用于隐式视频问答

Ali Alavi

发表机构 * The Ohio State University（俄亥俄州立大学）

AI总结本文通过系统实验发现隐式视频问答基准是感知受限而非推理受限，并指出提升基础模型感知能力和轻量级测试时去噪是唯一可靠手段。

详情

AI中文摘要

我们描述了提交至CVPR 2026 VRR挑战赛的方案，该方案基于ImplicitQA / VRR-QA基准：一种多项选择视频问答任务，其中答案有意地不在任何单帧中可观察，必须从创意视频的不连续帧中的空间布局、运动、深度、视角、因果关系和社会背景推断。我们对开源视频大语言模型（Qwen2.5-VL、Qwen3-VL、InternVL3、Gemma-3以及经过强化学习训练的视频推理器Video-R1和VideoChat-R1.5）和一系列推理时策略（思维链、问题分解、描述-推理级联、音频转录、空间状态提示、自一致性、多模型集成和类别路由）进行了系统的、无需训练的研究。我们的核心发现是，该基准是感知受限而非推理受限：推理侧的增强是中性的甚至有害的，而基础模型的感知能力和轻量级测试时去噪是唯一可靠的杠杆。按类别的错误分析将困难定位到低级感知——相对深度、视角和计数是最困难的类别，而因果和社会推理几乎已解决——一个明确注入单目深度线索以攻击最弱类别的提示将测试准确率降低了5.8个百分点，证实了模型需要更好的感知，而非更好的过程。

英文摘要

We describe our submission to the VRR Challenge @ CVPR 2026, built on the \emph{ImplicitQA} / \emph{VRR-QA} benchmark~\cite{implicitqa}: multiple-choice video question answering in which answers are deliberately \emph{not} observable in any single frame and must be inferred from spatial layout, motion, depth, viewpoint, causality, and social context across discontinuous frames of creative video. We conduct a systematic, training-free study spanning open-source Video-LMMs (Qwen2.5-VL~\cite{qwen25vl}, Qwen3-VL~\cite{qwen3vl}, InternVL3, Gemma-3, and the RL-tuned video reasoners Video-R1~\cite{videor1} and VideoChat-R1.5~\cite{videochatr15}) and a battery of inference-time strategies (chain-of-thought, question decomposition, describe-then-reason cascades, audio transcripts, spatial state prompting, self-consistency~\cite{selfconsistency}, multi-model ensembling, and category routing). Our central finding is that this benchmark is \emph{perception-bound rather than reasoning-bound}: reasoning-side augmentations are neutral-to-harmful, whereas base-model perceptual capability and lightweight test-time denoising are the only reliable levers. A per-category error analysis localizes the difficulty to low-level perception -- relative depth, viewpoint, and counting are the hardest categories, while causal and social reasoning are nearly solved -- and a prompt that explicitly injects monocular depth cues to attack the weakest category \emph{lowers} test accuracy by $5.8$ points, confirming that the model needs a better \emph{percept}, not a better \emph{procedure}.

URL PDF HTML ☆

赞 0 踩 0

2606.01481 2026-06-02 cs.CV 版本更新

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

SafeGen-Bench: 图像条件文本到视频生成中的安全性基准测试

Yingzi Ma, Xiaogeng Liu, Yawen Zheng, Chaowei Xiao

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Tsinghua University（清华大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结针对图像条件文本到视频生成中安全文本和图像组合仍可能产生有害内容的问题，提出SafeGen-Bench基准，定义10个恶意类别并评估现有模型，发现当前模型难以避免生成恶意内容，且单模态护栏防御不足。

Comments 8 pages, 7 figures, 2 tables

详情

AI中文摘要

随着文本到图像扩散模型的快速发展，像Sora这样的生成式视频模型（T2V模型）现在可以从文本提示或初始图像生成短视频。然而，合成视频生成——尤其是在初始图像引导下——常常带来风险，包括可能创建非法、政治敏感或不道德的内容。现有基准已开始考虑生成视频的安全性，但它们主要关注用恶意文本提示测试模型，忽略了文本提示和图像组合仍可能导致有害视频内容的场景。在实践中，这是一个常见且具有挑战性的问题：从安全文本和图像输入生成的视频仍可能传达有害信息。为弥补这一差距，我们引入了SafeGen-Bench，一个专门设计用于评估条件T2V模型安全性的基准。我们的基准定义了10个恶意类别，重点关注与时间序列和描绘行为相关的风险。SafeGen-Bench包含从多样图像和视频源中精心选择的起始帧，并配以相应的文本提示以模拟真实输入。我们在SafeGen-Bench上评估了多种条件T2V模型，结果表明当前模型难以持续避免生成恶意内容，不安全分数高达44.5，尤其是在需要高质量的条件下。此外，我们评估了基于文本和基于图像的护栏在我们的基准上的有效性，发现单模态护栏单独不足以提供稳健防御，在七个恶意类别中失败率达80%。我们希望SafeGen-Bench能促进更安全、更可控的条件T2V模型的开发。

英文摘要

With the rapid advancements in text-to-image diffusion models, generative video models (T2V models) like Sora can now produce short synthetic videos from a text prompt or an initial image. However, synthetic video generation -- especially when guided by an initial image -- often poses risks, including the potential creation of illegal, politically sensitive, or unethical content. Existing benchmarks have started to consider the safety of generated videos, but they primarily focus on testing models with malicious text prompts, ignoring the scenario where text prompt and image combination may still lead to harmful video content. In practice, this is a common and challenging issue: videos generated from safe text and image inputs can nonetheless convey harmful information. To bridge this gap, we introduce SafeGen-Bench, a benchmark specifically designed to evaluate the safety of conditional T2V models. Our benchmark defines 10 malicious categories, concentrating on risks related to both temporal sequences and depicted behaviors. SafeGen-Bench consists of carefully selected start frames from diverse image and video sources, paired with corresponding text prompts to simulate realistic inputs. We evaluate a variety of conditional T2V models on SafeGen-Bench, and the results indicate that current models struggle to consistently avoid generating malicious content with unsafety scores reaching up to 44.5, especially under conditions requiring high quality. Furthermore, we assess the effectiveness of both text-based and image-based guardrails on our benchmark, finding that unimodal guardrails alone were insufficient to provide a robust defense, with an 80\% failure rate across seven malicious categories. We hope that SafeGen-Bench will foster the development of safer and more controllable conditional T2V models.

URL PDF HTML ☆

赞 0 踩 0

2606.01443 2026-06-02 cs.LG cs.AI cs.CV 版本更新

UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures

UR-JEPA：均匀可整流性作为联合嵌入预测架构的正则化器

Triet M. Le

发表机构 * Spatiolyx LLC（Spatiolyx公司）

AI总结提出UR-JEPA，通过高斯核平滑的Carleson型平方函数实现均匀n-可整流测度正则化，防止表示坍塌，在多个数据集上达到与LeJEPA相当的峰值精度但具有更低的种子方差。

详情

AI中文摘要

训练联合嵌入预测架构（JEPA）的一个核心困难是防止表示坍塌。LeJEPA通过素描各向同性高斯正则化（SIGReg）对嵌入施加各向同性高斯目标来解决这一问题。该目标与流形假设相矛盾，流形假设期望嵌入集中在环境空间的低维子集上。我们提出\emph{UR-JEPA}，其目标是在小尺度上具有局部切向维度$n$的均匀$n$-可整流测度，通过高斯核平滑的Carleson型平方函数$\mathcal{L}^{ ext{CGLT}}$实现，并辅以Jones $β$数公式。在Inet10上，UR-JEPA($\mathcal{L}^{ ext{CGLT}}$)达到$0.9141 \pm 0.0014$，相比LeJEPA($\mathcal{L}^{ ext{SIGReg}}$)提高了$+0.83$个百分点，种子标准差降低约$30\%$；在匹配配方的Galaxy10~SDSS、单种子ImageNet-$100$运行和3种子EuroSAT遥感运行中，两种方法在收敛时处于相同的峰值精度区间，UR-JEPA保持其较低的种子方差特征。在EuroSAT上，域内对在$96.0$到$96.1\%$之间具有竞争力，且使用大型遥感基础模型迁移时骨干网络缩小$25$倍。区别在于几何结构：对投影仪输出分布的直接可视化显示，在所有四个数据集上，UR-JEPA($\mathcal{L}^{ ext{CGLT}}$)产生的全局PCA谱在索引$\sim 20$到$25$（共$D=32$）处出现$4$到$5$个数量级的下降，而LeJEPA的谱接近平坦（顶部到底部比率最多为$3.6$）。两种方法的每维度边缘分布同时接近高斯分布（平均Shapiro-Wilk $W \in [0.992, 0.996]$），这是Diaconis-Freedman结果的一个推论。因此，在匹配精度下，两种正则化器产生结构上不同的投影表示。

Dr. DocBench：专家级与困难文档解析的综合基准

Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo, Zhenting Qi, Konwoo Kim, Longtian Ye, Xiaolong Luo, Jinhe Bi, Henry Zhang, Haris Riaz, Xuan Zhang, Yunze Xiao, Bangya Liu, Tom Tang, Yunfei Zhao, Qunshu Lin, Zihan Wang, Minghao Liu, Michael Lingzhi Li, Yilun Du, Jesse Thomason, Rogerio Feris, Alex Pentland, Zexue He

发表机构 * Stanford University（斯坦福大学）； MIT（麻省理工学院）； Carnegie Mellon University（卡内基梅隆大学）； University of Southern California（南加州大学）； Harvard University（哈佛大学）； IBM Research（IBM研究院）； University of Arizona（亚利桑那大学）； Duke University（杜克大学）； UC Berkeley（加州大学伯克利分校）； LMU Munich（慕尼黑路德维希-马克西米利安大学）

AI总结提出Dr. DocBench基准，通过基于解析器失败的采样从多语言书籍语料库中选取挑战性文档，包含52个BISAC主题领域和65k高质量标注，用于评估专家级文档解析能力。

Comments 27 pages, 13 figures, 14 tables

详情

AI中文摘要

文档解析和识别是视觉语言模型（VLM）和文档处理系统的基本能力。然而，现有的光学字符识别（OCR）和文档解析基准在覆盖范围和难度上日益受限：许多基准专注于常见文档类型或均匀采样的页面，现代解析器在这些页面上已表现良好，而对专家领域结构（如化学公式、乐谱、复杂表格和跨页布局）的标注有限。我们引入了Dr. DocBench，一个面向专家级文档解析的难度感知基准。Dr. DocBench基于大规模多语言书籍语料库构建，涵盖52个BISAC主题领域，并通过基于解析器失败的采样选择挑战性文档，针对多个最先进系统难以处理的案例。它包含来自平均约100页的长文档的4,514个标注页面，具有65k高质量的页面级和块级标注，涵盖布局、阅读顺序、层次关系和特定领域的视觉内容。对基于流水线的解析器和通用VLM的评估表明，在现有基准上的强性能并不能迁移到我们的专家级文档解析中。我们的分析揭示了跨主题、内容类型和结构属性的重大失败，突显了Dr. DocBench作为诊断和推进文档智能的综合测试平台的作用。

英文摘要

Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.01380 2026-06-02 cs.CV 版本更新

Training-free image inversion for one-step diffusion models

无需训练的一步扩散模型图像反演

Tao Wu, Senmao Li, Yaxing Wang, Shiqi Yang, Kai Wang, Joost van de Weijer

发表机构 * CVC, University of Alabama in Birmingham（CVC，阿拉巴马大学伯明翰分校）； Machine Intelligence Institute, Masdar Institute of Science and Technology（机器智能研究所，马斯达尔科技 institute）； Jilin University（吉林大学）； City University of Hong Kong, Department of Geography（香港城市大学地理系）

AI总结提出一种无需训练的反演框架TFinv，通过迭代噪声对齐和后缀学习解决一步扩散模型中真实图像反演与编辑的关键挑战，实现高效编辑。

Comments Accepted to Pattern Recognition

详情

DOI: 10.1016/j.patcog.2026.114063

AI中文摘要

在这项工作中，我们为一步扩散模型引入了一种新颖的无需训练的反演（TFinv）框架，解决了真实图像反演和编辑中的关键挑战。我们首先确定了阻碍真实图像反演和编辑的两个关键因素：（1）初始潜在可编辑性，与初始噪声与理想高斯分布之间的距离有关；（2）描述差距，即文本描述与图像表示之间的对齐。这两个因素都影响一步扩散模型的反演效率和可编辑性。然后，我们提出了两种新颖的技术：迭代噪声对齐（iterNA），它最小化分布差距以与正态高斯分布对齐；以及后缀学习（suffL），它通过引入学习到的后缀提示令牌来增强文本到图像的描述对齐。这些技术能够将输入图像精确反演为其初始噪声表示，并促进图像编辑。此外，我们提出了一种基于掩码的编辑技术，用于局部编辑同时保持背景完整性。在PIE-Bench数据集上的全面实验验证了我们的方法TFinv不仅在一阶扩散编辑中实现了最先进的性能，而且在效率上显著优于现有的多步方法。代码可在https://github.com/tttao-uwu/TFinv.git获取。

英文摘要

In this work, we introduce a novel training-free inversion (TFinv) framework for one-step diffusion models,addressing key challenges in real image inversion and editing. We first identify two critical factors hamperingreal-image inversion and editing: (1) Initial Latent Editability, which is related to the distance between theinitial noise and the ideal Gaussian distribution, and (2) Caption Gap, which means the alignment betweentext captions and image representations. Both factors influence inversion efficiency and the editability ofone-step diffusion models. Then, we propose two novel techniques: iterative noise alignment (iterNA), whichminimizes the distribution gap to align with the normal Gaussian distribution, and suffix learning (suffL),which enhances text-to-image caption alignment by introducing learned suffix prompt tokens. These techniquesenable precise inversion of input images into their initial noise representations and facilitate image editing.Furthermore, we propose a mask-based editing technique for localized edits while preserving backgroundintegrity. Comprehensive experiments on the PIE-Bench dataset validate that our method TFinv not onlyachieves state-of-the-art performance in one-step diffusion editing, but also significantly outperforms existingmultistep approaches in efficiency. The code is available at https://github.com/tttao-uwu/TFinv.git.

URL PDF HTML ☆

赞 0 踩 0

2606.01372 2026-06-02 cs.LG cs.AI cs.CV 版本更新

BRo-JEPA: Learning Modular Arithmetic in Latent Space

BRo-JEPA：在潜空间中学习模算术

Divyansh Jha, Yuanfang Xie, Varan Mehra, Brennen Yu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； NYU Langone Health（纽约大学Langone医疗中心）

AI总结本文提出BRo-JEPA模型，通过在潜空间中施加模10算术的循环结构，实现零样本泛化，解决了标准模型无法外推未见操作的问题。

Comments 10 pages, 14 figures

2606.01367 2026-06-02 cs.RO cs.CV 版本更新

ActMVS: Active Scene Reconstruction with Monocular Multi-View Stereo

ActMVS：基于单目多视图立体的主动场景重建

Guo Pu, Yixuan Han, Zhouhui Lian

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王轩计算机技术研究所）

AI总结提出ActMVS框架，通过视图因子图构建和全局深度优化，实现单目相机在线生成高质量、全局一致的密集深度图，支持机器人/UAV的主动场景重建与安全轨迹规划。

Comments ICRA 2026

详情

AI中文摘要

主动场景重建使机器人/UAV能够自主规划轨迹并重建环境，无需昂贵的手动数据采集。与被动方法不同，主动重建需要实时构建高置信度占据地图以实现无碰撞导航。现有方法依赖深度传感器更新占据地图，增加了平台成本和重量。为推进空间智能，我们旨在实现纯视觉单目解决方案。然而，当前单目场景重建方法离线运行，无法在机器人/UAV导航所需的帧率下提供全局一致的密集深度。为弥补这一差距，我们引入ActMVS，这是首个单目主动重建框架。我们的框架集成了用于信息多视图立体深度预测的视图因子图构建，以及全局深度优化，从而实现在线生成高质量、全局一致的密集深度图。这使得单目机器人/UAV能够在重建过程中维护可靠的占据地图，以实现安全的轨迹规划。在Replica数据集上的实验表明，其性能与RGB-D方法相当。我们的代码和数据可在https://github.com/TrickyGo/ActMVS获取。

英文摘要

Active scene reconstruction enables robots/UAVs to autonomously plan trajectories and reconstruct environments without costly manual data acquisition. Unlike passive methods, active reconstruction requires real-time construction of high-confidence occupancy maps for collision-free navigation. Existing approaches rely on depth sensors for occupancy map updates, increasing platform cost and weight. To advance spatial intelligence, we aim for a vision-only monocular solution. However, current monocular scene reconstruction methods operate offline and fail to deliver globally consistent dense depth at the frame rates required for robots/UAVs navigation. To bridge this gap, we introduce ActMVS, the first framework for monocular active reconstruction. Our framework integrates a view factor graph construction for informed Multi-View Stereo depth prediction, along with a global depth optimization, to enable the online generation of high-quality, globally consistent dense depth maps. This enables monocular robots/UAVs to maintain reliable occupancy maps for safe trajectory planning during reconstruction. Experiments on Replica datasets demonstrate performance competitive with RGB-D methods. Our code and data are available at https://github.com/TrickyGo/ActMVS.

URL PDF HTML ☆

赞 0 踩 0

2606.01362 2026-06-02 cs.GR cs.CV 版本更新

AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance

AlbedoEdit: 基于反照率引导的统一实例级视频编辑

Xilong Zhou, Bao-Huy Nguyen, Zheng Zeng, Jacob Munkberg, Jon Hasselgren, Thomas Leimkühler, Nima Kalantari, Miloš Hašan, Christian Theobalt

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克信息研究所）； University of California Santa Barbara（加州大学圣巴巴拉分校）； NVIDIA Research（NVIDIA研究）； Texas A & M University（德克萨斯A&M大学）

AI总结提出 AlbedoEdit，一个统一框架，利用反照率图实现对象插入、移除和纹理编辑，通过微调视频基础模型，在合成数据集上训练，实现编辑内容的和谐融合与复杂视觉效果模拟。

详情

AI中文摘要

视频生成模型在合成逼真视频序列方面取得了显著进展。然而，实现更广泛和更具创造性的下游应用需要细粒度的实例级视频编辑，包括对象插入、对象移除和纹理编辑，这已成为一个突出但具有挑战性的问题。现有方法要么提出仅具有粗略语义控制的统一生成框架，要么为单个编辑任务设计特定任务框架，限制了它们在多样化真实场景中的灵活性和适用性。为解决这些限制，我们提出了 AlbedoEdit，一个统一的生成式视频编辑框架，同时支持对象插入、对象移除和纹理编辑。我们的关键洞察是，内在反照率图（对光照不变且不包含镜面反射、阴影和相互反射效应）为指定细粒度外观编辑提供了一种有效且用户友好的机制。基于视频基础模型，AlbedoEdit 被微调以将源 RGB 视频转换为编辑后的 RGB 视频，条件为用户编辑的第一帧反照率。在覆盖所有三种编辑任务的新配对合成数据集上训练后，AlbedoEdit 隐式学习协调编辑内容并模拟由编辑操作触发的复杂真实世界视觉效果，包括镜面高光、软阴影和镜面反射。AlbedoEdit 在定性和定量上均优于最先进的视频编辑方法。项目网页为 https://vcai.mpi-inf.mpg.de/projects/AlbedoEdit/。

英文摘要

Video generative models have achieved remarkable progress in synthesizing photorealistic video sequences. However, enabling broader and more creative downstream applications requires fine-grained instance-level video editing, including object insertion, object removal, and texture editing, which has emerged as a prominent yet challenging problem. Existing approaches either propose unified generative frameworks with only coarse semantic control, or design task-specific frameworks for individual editing tasks, limiting their flexibility and applicability across diverse real-world scenarios. To address these limitations, we propose AlbedoEdit, a unified generative video editing framework that jointly supports object insertion, object removal, and texture editing. Our key insight is that the intrinsic albedo map, which is invariant to lighting and contains no specularity, shadowing and inter-reflection effects, provides an effective and user-friendly mechanism for specifying fine-grained appearance edits. Built upon video foundation models, AlbedoEdit is fine-tuned to translate source RGB videos into edited RGB videos, conditioned on a user-edited first-frame albedo. Trained on a new paired synthetic dataset covering all three editing tasks, AlbedoEdit implicitly learns to harmonize edited contents and simulate complex real-world visual effects triggered by editing operations, including specular highlights, soft shadows, and mirror reflections. AlbedoEdit demonstrates superior performance over state-of-the-art video editing approaches, both qualitatively and quantitatively. Project webpage is https://vcai.mpi-inf.mpg.de/projects/AlbedoEdit/.

URL PDF HTML ☆

赞 0 踩 0

2606.01361 2026-06-02 cs.CV 版本更新

Diamonds in the Sky: Pareidolic Animals in Clouds

天空中的钻石：云中的空想性动物

Miriam Horovicz, Yacov Hel-Or, Yael Moses

发表机构 * Reichman University, Israel（里奇曼大学，以色列）

AI总结提出基于扩散模型的方法，预测人们可能在云中感知到的空想性动物，并通过生成相似形状的动物图像和变形视频辅助识别。

详情

AI中文摘要

人们常在云中看到动物形状，这种现象被称为空想性错视。我们提出一种基于AI的方法，旨在预测人们可能在云中感知到哪些动物，尽管最先进的识别方法通常无法检测到此类动物。此外，我们引入一种方法帮助个体感知特定的空想性动物，即使他们最初未能识别。我们的方法使用扩散模型将云片段转换为视觉上类似于原始云的动物形状。这种扩散技术的灵感来源于观察：扩散过程仅在目标动物与云形状相似时成功，且微妙的视觉线索通常足以帮助个体识别特定的空想性动物。从扩散模型成功生成的图像随后用于预测空想性动物。此外，使用从生成图像过渡回原始云片段的短变形视频进一步增强人类对空想性动物的感知。

英文摘要

People often see animal shapes in clouds, a phenomenon known as pareidolia. We propose an AI-based method that aims to predict which animals people are likely to perceive in clouds, even though state-of-the-art recognition methods typically fail to detect such animals. Additionally, we introduce a method to assist individuals in perceiving specific pareidolic animals, even if they did not recognize them initially. Our approach uses a diffusion model to transform cloud segments into an animal shape that visually resemble the original cloud. This diffusion technique is inspired by the observation that the diffusion process succeeds only when the target animal resembles the shape of the cloud, and that subtle visual hints often suffice to help individuals recognize specific pareidolic animals. A generated image, successfully derived from the diffusion model, is then used to predict the pareidolic animal. Additionally, a short morphing video transitioning from the generated image back to the original cloud segment is employed to further enhance the human's perception of the pareidolic animals.

URL PDF HTML ☆

赞 0 踩 0

2606.01339 2026-06-02 cs.LG cs.AI cs.CL cs.CV cs.ET 版本更新

FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting

FreqLite：一种轻量级频率分解线性模型，具有自适应可逆归一化，用于稳健的长期时间序列预测

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani

发表机构 * Hamdard University（哈姆达德大学）

AI总结提出FreqLite，一种超轻量级、通道独立的频率分解线性预测器，通过可学习的无损谱滤波器进行频带分解和线性预测，并引入自适应可逆实例归一化（A-RevIN）处理非平稳性，在长期预测基准上以更少参数和计算资源超越PatchTST等模型。

Comments 26 pages, 5 figures

详情

AI中文摘要

长期时间序列预测需要既准确又能在商用硬件上高效运行的模型。轻量级线性预测器在此领域表现出色，但仍存在两个问题：可逆实例归一化（RevIN）使用单一回溯统计量对整个预测区间进行去归一化，在非平稳性下不准确；时域趋势/季节分解依赖于固定的非自适应滤波器。我们提出FreqLite，一种超轻量级、通道独立的频率分解线性预测器：一个可学习的、无损的单位划分谱滤波器将输入分割成多个频带，由每个频带的线性头进行预测，与低通截断方法不同，高频带被保留并建模。FreqLite在标准长期预测基准上是最佳的轻量级模型，在长回溯（L=336）时，其平均误差低于PatchTST Transformer（0.3244 vs 0.3587 MSE），同时参数减少4倍，内存减少2.2倍，在单块4 GB笔记本GPU上每轮时间减少2.2倍；尽管幅度不大，但在所有匹配单元上的配对Wilcoxon检验中，其改进具有统计显著性（p < 1e-5）。我们进一步引入自适应可逆实例归一化（A-RevIN），一种自适应可逆归一化，严格推广了RevIN（在其门关闭时完全恢复），在非平稳性下起作用，并在平稳数据上无害地退化为RevIN。我们在一个真实的强非平稳数据集（ILI，MSE降低约5%）和一个受控合成漂移扫描中验证了这一点，其中A-RevIN的收益及其学习门都随注入的非平稳性单调增加。每个组件均可独立消融（Linear和RLinear是FreqLite的特例），所有结果均可在商用硬件上复现。

英文摘要

Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecasters are remarkably strong in this regime, yet they leave two openings: reversible instance normalization (RevIN) de-normalizes the entire horizon with a single lookback statistic, which is inaccurate under non-stationarity, and time-domain trend/seasonal decomposition relies on a fixed, non-adaptive filter. We present FreqLite, an ultra-lightweight, channel-independent frequency-decomposed linear forecaster: a learnable, lossless, partition-of-unity spectral filter splits the input into bands that are forecast by per-band linear heads and, unlike low-pass-truncation approaches, the high-frequency band is retained and modeled. FreqLite is the best lightweight model on the standard long-term forecasting benchmarks and, at long lookback (L=336), attains a lower average error than a PatchTST Transformer (0.3244 vs. 0.3587 MSE) while using 4x fewer parameters, 2.2x less memory, and 2.2x less time per epoch on a single 4 GB laptop GPU; although modest in magnitude, its improvements are statistically significant under paired Wilcoxon tests across all matched cells (p < 1e-5). We further introduce Adaptive Reversible Instance Normalization (A-RevIN), a regime-adaptive reversible normalization that strictly generalizes RevIN (recovered exactly when its gate is closed), engages under non-stationarity, and reduces to RevIN without harm on stationary data. We validate this on both a real strongly non-stationary dataset (ILI, up to ~5% MSE reduction) and a controlled synthetic drift sweep in which A-RevIN's benefit and its learned gate both rise monotonically with injected non-stationarity. Every component is independently ablatable (Linear and RLinear are special cases of FreqLite), and all results are reproducible on commodity hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.01334 2026-06-02 cs.CV 版本更新

HOLA: Holistic Multi-Modal Alignment for Open-Set 3D Recognition

HOLA: 面向开放集3D识别的全息多模态对齐

Koby Aharonov, Oren Shrout, Ayellet Tal

发表机构 * Technion – Israel Institute of Technology（技术ion-以色列理工学院）

AI总结提出HOLA方法，通过解耦多正例对比损失和对齐点云与多视图图像及文本描述，实现开放集3D识别中的全息多模态对齐，在长尾基准上取得最先进零样本性能。

详情

AI中文摘要

开放集3D识别需要模型能够泛化到罕见或未见类别。最近的方法通过将语言-视觉知识蒸馏到3D编码器来解决这一问题，通常依赖重型2D ViT，并将每个点云与单张图像或标题对齐，从而将表示锚定到局部视图。我们提出将每个点云与多张图像和文本描述对齐，以捕获对3D对象的更全面理解。为实现这一想法，必须设计一个损失函数，能够联合对齐一个3D实例与多个匹配信号（多视图图像和多个文本），同时将正例聚合与负例竞争分离。我们引入了这样的函数，称为解耦多正例对比损失。我们的公式增强了损失对困难负例的难度感知关注，避免了当多个正例与所有负例共享同一个softmax时出现的“聚光灯拥挤”现象。作为补充，我们提出了一个轻量级文本适配器，仅应用于网络标题，减少了与精心标注之间的领域差距，并能够有效利用大规模无监督文本。我们的模型在长尾基准上展示了最先进的开放词汇性能，在保持高帧率的同时实现了显著的零样本改进。

英文摘要

Open-set 3D recognition requires models that generalize to rare or unseen categories. Recent approaches address this by distilling language-vision knowledge into 3D encoders, typically relying on heavy 2D ViTs and aligning each point cloud with a single image or caption, thus anchoring representations to partial views. We propose aligning each point cloud with multiple images and textual descriptions to capture a more holistic understanding of 3D objects. To realize this idea, it is essential to design a loss function capable of jointly aligning a 3D instance with multiple matched signals, multi-view images and multiple texts, while separating positive aggregation from negative competition. We introduce such a function, termed the decoupled multi-positive contrastive loss. Our formulation enhances the loss's hardness-aware focus on challenging negatives, avoiding the "spotlight crowding" that occurs when many positives share the same softmax with all the negatives. Complementing this, we present a lightweight text adapter applied only to web captions, reducing the domain gap to curated annotations and enabling effective use of large-scale unsupervised text. Our model demonstrates state-of-the-art open-vocabulary performance on long-tail benchmarks, yielding substantial zero-shot improvements while sustaining high frame rates.

URL PDF HTML ☆

赞 0 踩 0

2606.01315 2026-06-02 cs.CV 版本更新

DeblurNVS: Geometric Latent Diffusion for Novel View Synthesis from Sparse Motion-Blurred Images

DeblurNVS：基于几何潜在扩散的稀疏运动模糊图像新视角合成

Changyue Shi, Wangbo Yu, Chaoran Feng, Li Yuan

发表机构 * School of AI for Science, Peking University Shenzhen Graduate School（人工智能科学学院，北京大学深圳研究生院）； School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School（电子与计算机工程学院，北京大学深圳研究生院）

AI总结提出DeblurNVS框架，利用几何潜在扩散从稀疏运动模糊图像中直接合成高保真新视角，无需逐场景优化。

详情

AI中文摘要

新视角合成（NVS）是计算机视觉和图形学中的一个基本问题。神经辐射场（NeRF）、3D高斯泼溅（3DGS）和生成式视角合成的最新进展显著提高了其质量。然而，大多数方法仍然依赖于清晰观测，其中图像结构和跨视角几何线索得以良好保留。运动模糊通过破坏局部细节和削弱多视角对应关系打破了这一假设。这种模糊通常由实际拍摄中的相机抖动、场景运动或有限曝光引起。模糊感知的NVS方法通过建模图像形成来解决这一退化问题，但它们依赖于昂贵的逐场景优化，限制了高效且可泛化的稀疏视角合成。为了解决这个问题，我们提出了DeblurNVS，一种新颖的框架，可以直接从稀疏运动模糊图像中合成高保真新视角，无需逐场景优化。DeblurNVS恢复了多视角推理所需的中间几何表示，使模糊输入能够恢复可靠的结构和对应线索。然后，将恢复的表示与目标相机信息结合，合成目标视角表示并重建清晰的RGB新视角。为了实现大规模训练，我们使用基于插值的有限曝光模糊合成方法，从DL3DV-10K构建了一个运动模糊NVS数据集。大量实验表明，DeblurNVS在合成运动模糊基准上优于现有基线，并能泛化到真实运动模糊场景，生成感知上更清晰、结构上更稳定的新视角，同时避免了昂贵的逐场景优化。项目页面：https://github.com/PKU-YuanGroup/DeblurNVS。

英文摘要

Novel view synthesis (NVS) is a fundamental problem in computer vision and graphics. Recent advances in neural radiance fields (NeRF), 3D Gaussian Splatting (3DGS), and generative view synthesis have substantially improved its quality. Yet most methods still rely on clean observations, where image structures and cross-view geometric cues are well preserved. Motion blur breaks this assumption by corrupting local details and weakening multi-view correspondences. Such blur commonly arises from camera shake, scene motion, or finite exposure in practical capture. Blur-aware NVS methods address this degradation by modeling image formation, but their reliance on costly per-scene optimization limits efficient and generalizable sparse-view synthesis. To address this, we propose DeblurNVS, a novel framework for synthesizing high-fidelity novel views directly from sparse motion-blurred images, without requiring per-scene optimization. DeblurNVS restores the intermediate geometric representations needed for multi-view reasoning, enabling blurred inputs to recover reliable structure and correspondence cues. The restored representations are then combined with target camera information to synthesize the target-view representation and reconstruct a sharp RGB novel view. To enable the large-scale training, we construct a motion-blurred NVS dataset from DL3DV-10K using interpolation-based finite-exposure blur synthesis. Extensive experiments demonstrate that DeblurNVS outperforms existing baselines on synthetic motion-blur benchmarks and generalizes to real motion-blurred scenes, producing perceptually sharper and structurally more stable novel views while avoiding costly per-scene optimization. Project page: https://github.com/PKU-YuanGroup/DeblurNVS.

URL PDF HTML ☆

赞 0 踩 0

2606.01293 2026-06-02 eess.IV cs.AI cs.CV 版本更新

ResNet-34 with Lightweight Decoder for Accurate and Efficient Segmentation of Fetal Brain MRI

ResNet-34与轻量级解码器用于胎儿脑部MRI的准确高效分割

Ashiqur Rahman, Muhammad E. H. Chowdhury, Md. Abu Sayed, Md. Sharjis Ibne Wadud, Abu Naser Md. Arafat, Mehedi Hasan Prince

发表机构 * Department of Biomedical Physics and Technology, University of Dhaka（达卡大学生物医学物理与技术系）； Department of Electrical Engineering, College of Engineering, Qatar University（卡塔尔大学工程学院电气工程系）； Department of Biomedical Engineering, Jashore University of Science and Technology（贾沙尔大学科学与技术学院生物医学工程系）

AI总结提出一种结合ResNet-34编码器和基于MLP的轻量级解码器的深度学习模型，以解决胎儿脑MRI分割中的运动伪影和强度不均匀问题，在FeTA 2021数据集上达到97.37%准确率和90.33%平均DSC。

详情

AI中文摘要

在磁共振成像（MRI）中准确分割胎儿脑组织对于先天性异常的早期诊断和改善产前护理至关重要。然而，由于胎儿运动、组织对比度低以及整个孕龄期解剖结构变异大，特别是分割白质、灰质、侧脑室、深部灰质、脑外脑脊液、小脑和脑干等复杂结构时，该任务仍然困难。针对这些难题，本研究引入了一种新颖的深度学习模型，该模型将ResNet-34编码器与利用多层感知器（MLP）模块进行自适应特征细化的轻量级解码器相结合。这种设计特别增强了模型保留解剖边界并减轻由运动伪影和强度不均匀引起的分割误差的能力。通过减少参数数量、采用双线性上采样代替转置卷积以及优化解码器以提高速度而不牺牲精度，实现了计算效率。在FeTA 2021数据集上使用5折交叉验证进行训练和验证，所提出的模型优于UNet、UNet++、DeepLabV3和DeepLabV3+等基线架构，平均准确率达到97.37%，平均Dice相似系数（DSC）为90.33%，平均交并比（IoU）为86.93%，精确率为90.83%。此外，其快速的推理时间和减少的计算负载使其非常适合集成到实时临床工作流程中。

英文摘要

Accurate segmentation of fetal brain tissues in Magnetic Resonance Imaging (MRI) is critical for early diagnosis of congenital abnormalities and improving prenatal care. However, the task remains difficult because of fetal motion, low tissue contrast, and major anatomical variability throughout gestational ages, particularly in segmenting complex structures such as white matter, gray matter, lateral ventricles, deep gray matter, extra-cerebrospinal fluid, cerebellum, and brainstem. As a solution to these difficulties, this research introduces a novel deep learning model that combines a ResNet-34 encoder with a lightweight decoder leveraging multi-layer perceptron (MLP) modules for adaptive feature refinement. This design specifically enhances the model's ability to preserve anatomical boundaries and mitigate segmentation errors caused by motion artifacts and intensity inhomogeneities. Computational efficiency is achieved by reducing parameter count, employing bilinear upsampling instead of transposed convolutions, and optimizing the decoder for speed without sacrificing accuracy. Trained and validated on the FeTA 2021 dataset using 5-fold cross-validation, the proposed model outperforms baseline architectures such as UNet, UNet++, DeepLabV3, and DeepLabV3+, achieving an average Accuracy of 97.37% with a mean Dice Similarity Coefficient (DSC) of 90.33%, mean Intersection over Union (IoU) of 86.93%, and Precision of 90.83%. Additionally, its fast inference time and reduced computational load make it well-suited for integration into real-time clinical workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.01287 2026-06-02 cs.CV cs.AI 版本更新

Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

超越视觉记忆：潜在视觉推理的机制诊断

Garvin Guo, Yu Chen, Xiang Wang, Shuai Li, Xinpei Zhao, Huaxing Liu, Shuai Dong

发表机构 * Amap, Alibaba Group（阿里集团亚马通）； Shanghai Innovation Institute（上海创新研究院）

AI总结通过分解潜在令牌为三个可测试组件，发现边界标记和格式而非潜在槽贡献了主要性能提升，揭示了潜在视觉推理的真正机制。

详情

AI中文摘要

最近的潜在视觉推理方法通过在多模态语言模型中插入连续潜在令牌取得了显著提升。这些提升通常归因于令牌编码了视觉证据；然而，最近的分析揭示了一个悖论：令牌与图像关联松散，对答案贡献甚微。关键的是，这些分析将潜在令牌视为一个整体，掩盖了提升的真正来源。因此，我们将潜在令牌分解为三个可测试组件：潜在槽、边界标记和格式，并在有利条件下开发了一种最先进的方法作为探针。在六个方法-阶段设置和四个感知密集型基准测试中，潜在槽未能通过视觉记忆解释的所有预测。引人注目的是，在几种设置中，仅保留边界标记即可保留78%至100%的提升，而模型在潜在位置比在答案位置更窄地关注图像。因此，提升来自边界标记、格式以及这种注意力模式，而非潜在槽。每种方法如何利用这一机制取决于其训练监督：在匹配的准确率下，机制仍可能显著不同。因此，潜在视觉推理不仅需要根据准确率评估，还需要根据模型实际依赖的内容进行评估。

英文摘要

Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. These gains are commonly attributed to the tokens encoding visual evidence; recent analyses, however, reveal a paradox: the tokens are loosely tied to the image and contribute little to the answer. Critically, these analyses treat latent tokens as a single unit, obscuring the true source of the gains. We therefore decompose latent tokens into three testable components: latent slots, boundary markers, and format, and develop a state-of-the-art method as a probe under favorable conditions. Across six method-stage settings and four perception-heavy benchmarks, latent slots fail every prediction of the visual-memory account. Strikingly, retaining only the boundary markers preserves 78 to 100% of the gain in several settings, while the model attends to the image more narrowly at latent positions than at answer positions. The gain therefore comes from boundary markers, format, and this attention pattern, not from latent slots. How each method engages this mechanism depends on its training supervision: at matched accuracy, mechanisms can still differ markedly. Latent visual reasoning thus needs evaluation not only by accuracy but by what the model actually relies on.

URL PDF HTML ☆

赞 0 踩 0

2606.01285 2026-06-02 cs.CV cs.AI 版本更新

Knowledge-Intensive Video Generation

知识密集型视频生成

Chenxu Wang, Mingda Chen

发表机构 * Fudan University（复旦大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结针对文本到视频生成在事实性和实用性方面的不足，提出知识密集型视频生成（KIVI）任务，构建KIVI-Bench基准和自动评估指标，实验表明现有模型在视觉属性、操作过程和信息呈现上落后于人类。

2606.01282 2026-06-02 cs.CV cs.CY cs.LG 版本更新

KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation

KG-FairDiff: 知识图谱引导的提示词精炼用于人口统计公平的文本到图像生成

Farbod Davoodi, Seyed Reza Tavakoli Shiyadeh, Pooria Safaei, Sana Harighi, Parsa Gholami, Amirali Amini, Kimia Vanaei, Emad Firoozi, Parham Abed Azad, Babak Khalaj, Siavash Ahmadi, Amir Hossein Payberah, Mohammad Hossein Rohban, Soheil Kolouri, Ali Diba

发表机构 * University of Science and Technology of China（中国科学技术大学）； Sharif University of Technology（谢赫·伊斯兰大学）； Iran University of Science and Technology（伊朗科学技术大学）

AI总结提出KG-FairDiff框架，通过知识图谱引导的提示词精炼，在推理时优化公平性损失，减少文本到图像生成中的性别、种族、年龄等人口统计偏差，同时保持语义保真度。

详情

AI中文摘要

文本到图像（TTI）系统现已成为新闻、教育、广告和公共传播的日常基础设施，它们从训练数据中继承的人口统计和文化刻板印象（将女性、有色人种、老年人和非西方文化描绘为代表性不足或漫画化）在部署规模上成为人口层面的危害。现有的缓解措施要么需要昂贵的重新训练，这对于主导消费产品的闭源骨干网络不可行，要么依赖于忽略文化背景的固定人口统计模板。我们提出了KG-FairDiff，一个模型无关、推理时框架，将公平感知的提示词精炼形式化为一个约束优化问题，并将其实现为一个闭环流水线：一个包含约1200个文化和偏见相关三元组的知识图谱检索结构化上下文，一个LLM改写器提出精炼，一个验证器仅接受那些减少基于散度的公平性损失同时保持用户原始意图语义保真度的提示词。我们证明了精炼循环的有限终止界限，贡献了一个数学上一致的评估套件，将Bias-P/Bias-W与目标分布的散度以及ENS与KL散度联系起来，并审计了八个广泛部署的骨干生成器。KG-FairDiff显著减少了性别、种族、年龄和交叉差异，同时保持了提示词语义，为更公平的生成式AI提供了一条实用、可部署的路径。

英文摘要

Text-to-Image (TTI) systems are now everyday infrastructure for journalism, education, advertising, and public communication, and the demographic and cultural stereotypes they inherit from training data (rendering women, people of colour, older adults, and non-Western cultures as under-represented or caricatured) become a population-level harm at deployment scale. Existing mitigations either require costly retraining, infeasible for the closed-source backbones that dominate consumer products, or rely on fixed demographic templates that ignore cultural context. We present KG-FairDiff, a model-agnostic, inference-time framework that formalises fairness-aware prompt refinement as a constrained optimisation problem and operationalises it as a closed-loop pipeline: a knowledge graph of ~1,200 culture- and bias-related triples retrieves structured context, an LLM rewriter proposes refinements, and a validator accepts only prompts that reduce a divergence-based fairness loss while preserving semantic fidelity to the user's original intent. We prove a finite-termination bound for the refinement loop, contribute a mathematically consistent evaluation suite linking Bias-P/Bias-W to divergence from target distributions and ENS to KL divergence, and audit eight widely-deployed backbone generators. KG-FairDiff substantially reduces gender, race, age, and intersectional disparities while preserving prompt semantics, offering a practical, deployment-ready route to more equitable generative AI.

URL PDF HTML ☆

赞 0 踩 0

2606.01280 2026-06-02 cs.CV 版本更新

Event-Based Vision in Space: Applications, Trends, and Future Directions

太空中的事件视觉：应用、趋势与未来方向

Luigi Capogrosso, Pietro Bonazzi, Michele Magno

发表机构 * Interdisciplinary Transformation University of Austria（交叉学科转型奥地利大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结本文综述了事件视觉传感器在太空领域的应用，通过分类四个主要领域（大气与高速观测、环境监测与变化检测、操作支持与星上处理、地理空间建模与预测分析），指出神经形态工程是解决现代遥感与可持续太空探索关键瓶颈的范式转变。

Comments Accepted at the XXIV Annual Conference on Sensors and Microsystems (AISEM) 2026

详情

AI中文摘要

地球观测（EO）正经历由新型传感技术部署驱动的重大变革。传统的基于帧的光学传感器在具有挑战性的轨道环境中常受运动模糊、高功耗和极端数据冗余的困扰。相比之下，事件传感器（也称为神经形态相机）提供了一种仿生异步方法。通过仅捕获局部光照变化，它们提供微秒级时间分辨率、极高动态范围和卓越能效。尽管这些传感器的使用正从地面系统迅速扩展到轨道平台，但围绕其太空应用的科学文献仍然高度分散。为弥合这一差距，本文对太空领域事件视觉的最新技术进行了全面综述。基于检索到的文献，我们引入了一个围绕四个主要领域构建的分类体系：1）大气与高速观测；2）环境监测与变化检测；3）操作支持与星上处理；4）地理空间建模与预测分析。因此，本综述强调，神经形态工程远不止是一种补充成像技术；它是一种范式转变，可直接用于解决现代遥感和可持续太空探索中的关键瓶颈。

英文摘要

Earth Observation (EO) is undergoing a significant transformation driven by the deployment of novel sensing technologies. Traditional frame-based optical sensors often struggle with motion blur, high power consumption, and extreme data redundancy in challenging orbital environments. In contrast, event-based sensors, also known as neuromorphic cameras, offer a bio-inspired asynchronous approach. By capturing only local illumination changes, they provide microsecond temporal resolution, an extremely high dynamic range, and exceptional energy efficiency. Although the use of these sensors is rapidly expanding from terrestrial systems to orbital platforms, the scientific literature surrounding their space-based applications remains heavily fragmented. To bridge this gap, this article presents a comprehensive review of the state-of-the-art in event-based vision in the space domain. Based on the retrieved literature, we introduce a taxonomy structured around four primary domains: 1) atmospheric and high-speed observation; 2) environmental monitoring and change detection; 3) operational support and onboard processing; and 4) geospatial modeling and predictive analysis. As a result, this survey highlights that neuromorphic engineering is far more than a supplementary imaging technique; it is a paradigm shift that can be used to directly address critical bottlenecks in modern remote sensing and sustainable space exploration.

URL PDF HTML ☆

赞 0 踩 0

2606.01277 2026-06-02 cs.RO cs.AI cs.CV cs.SY eess.IV eess.SY 版本更新

GRPO-TTA：基于GRPO驱动的强化学习进行视觉语言模型的测试时视觉调优

Yujun Li, Hongyuan Zhang, Yuan Yuan

发表机构 * School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University（人工智能、光学与电子学院（iOPEN），西北工业大学）

AI总结提出GRPO-TTA方法，将GRPO应用于测试时适应，通过将类特定提示预测重构为组策略优化问题，并设计对齐奖励和分散奖励，在多种基准上优于现有方法。

详情

AI中文摘要

组相对策略优化（GRPO）最近在大型语言模型和视觉语言模型的后训练中展现出强大性能。这引发了一个问题：GRPO是否也能显著促进视觉语言模型的测试时适应（TTA）。在本文中，我们提出了用于测试时适应的组相对策略优化（GRPO-TTA），通过将类特定提示预测重构为组策略优化问题，将GRPO适应到TTA设置。具体来说，我们通过从CLIP相似度分布中采样top-K类候选来构建输出组，从而在无需真实标签的情况下实现概率驱动的优化。此外，我们设计了针对测试时适应的奖励函数，包括对齐奖励和分散奖励，以指导有效的视觉编码器调优。在多种基准上的大量实验表明，GRPO-TTA一致优于现有的测试时适应方法，在自然分布偏移下性能提升尤为显著。

英文摘要

Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In this paper, we propose Group Relative Policy Optimization for Test-Time Adaptation (GRPO-TTA), which adapts GRPO to the TTA setting by reformulating class-specific prompt prediction as a group-wise policy optimization problem. Specifically, we construct output groups by sampling top-K class candidates from CLIP similarity distributions, enabling probability-driven optimization without access to ground-truth labels. Moreover, we design reward functions tailored to test-time adaptation, including alignment rewards and dispersion rewards, to guide effective visual encoder tuning. Extensive experiments across diverse benchmarks demonstrate that GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.

URL PDF HTML ☆

赞 0 踩 0

2606.01234 2026-06-02 econ.GN cs.CE cs.CV cs.GT cs.LG physics.soc-ph q-fin.EC 版本更新

Differing Roles of Leisure and Productivity in GDP - A Machine Learning based comparative analysis of Germany and USA

休闲与生产力在GDP中的不同作用——基于机器学习的德国与美国比较分析

Achintya Ranjan, Uma Ranjan

发表机构 * Achintya Ranjan（阿金蒂亚·兰詹）； Uma Ranjan（乌玛·兰詹）

AI总结本研究通过随机森林模型分析工作时间和全要素生产率对GDP的影响，并利用Gini重要性、SHAP图和部分依赖图揭示德国与美国社会结构差异在GDP贡献中的体现。

Comments International Conference on Emerging Techniques in Computational Intelligence 2025

2606.01217 2026-06-02 cs.CV cs.LG stat.AP 版本更新

Analysis of Ethnic Disparities in Autism Spectrum Disorder among Toddlers

幼儿自闭症谱系障碍中的种族差异分析

Aadithya Prabha Ramaharsha, Deevna Reddy, Uma Ranjan

发表机构 * Sri Ramachandra Institute of Higher Education and Research（Sri Rajachandra高等教育部与研究机构）

AI总结通过逻辑回归分析，研究种族、行为评分、性别和新生儿黄疸对幼儿自闭症谱系障碍（ASD）的影响，发现白种人ASD风险比亚洲人高81%，中东人低79%，并确认新生儿黄疸和男性为显著风险因素。

Comments Third International Conference Biomedical Engineering Science and technology

2606.01215 2026-06-02 cs.CV cs.AI cs.CL cs.MM 版本更新

Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs

将神经符号程序蒸馏到3D多模态大语言模型中

Wentao Mo, Yang Liu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出APEIRIA，通过三阶段课程学习将符号推理模式蒸馏到3D多模态大语言模型中，实现透明推理与开放词汇空间推理的统一。

Comments To appear in ICML 2026

详情

AI中文摘要

当前的3D空间推理方法面临根本性权衡：神经符号3D（NS3D）概念学习器通过组合程序实现可解释推理，但受限于封闭集概念词汇和简单程序；端到端3D多模态大语言模型（3D MLLMs）能处理复杂自然语言和开放词汇概念，但缺乏显式空间验证的黑箱推理。我们提出APEIRIA，一种神经符号3D MLLM，通过将符号推理模式以自然语言思维链形式蒸馏到MLLMs中，桥接两种范式。我们的三阶段课程逐步构建推理能力：a) 3D感知对齐将物体视觉-几何特征接地到LLM，b) CoT-SFT从符号程序轨迹中教授查询分解和逐步验证，c) CoT-RL将推理模式扩展到开放集概念和深度嵌套指令。通过迁移推理模式而非概念特定知识，APEIRIA保留了NS3D的关键优点：透明推理以及规划和感知组件的模块化可互换性。在接地、问答和描述任务上的评估表明，APEIRIA超越了先前的NS3D方法，并在3D空间推理数据集上匹配最先进的3D MLLMs，统一了符号方法的系统推理与MLLMs的灵活性。代码见https://github.com/oceanflowlab/APEIRIA。

英文摘要

Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods' systematic reasoning with MLLMs' flexibility. Code is available at https://github.com/oceanflowlab/APEIRIA.

URL PDF HTML ☆

赞 0 踩 0

2606.01213 2026-06-02 cs.CV cs.AI cs.CL 版本更新

TECCI: Tricky Edits of Collected and Curated Images

TECCI：收集与策划图像的棘手编辑

Aishwarya Agrawal, Roy Hirsch, Yasumasa Onoe, Sherry Ben, Jason Baldridge

发表机构 * Google Research（谷歌研究）； Google DeepMind（谷歌深Mind）

AI总结提出TECCI基准，包含7550对图像与编辑指令，通过人工与自动评估揭示现有图像编辑模型在指令遵循、最小编辑和视觉质量方面的不足。

详情

AI中文摘要

尽管近期取得了巨大进展，但当前的文本引导图像编辑方法在涉及指令遵循、最小化编辑源图像以及确保高视觉质量等多个方面仍面临困难。当请求的编辑具有挑战性时，例如涉及位置、运动、视角、比例和创意编辑，这些问题尤为明显。为了系统性地测试生成式图像编辑器，我们提出了一个新的图像编辑基准——TECCI：收集与策划图像的棘手编辑。TECCI包含我们发布的全新图像集。TECCI中的图像涵盖7个图像类别。这些图像和类别经过有意策划，以针对现有方法的弱点。TECCI中的编辑指令由Gemini自动生成，每个源图像覆盖5种编辑类型。我们还策划了一组530张图像，为其创建了具有挑战性的人工编写编辑指令。总体而言，TECCI包含7550对图像和编辑指令。我们对TECCI上的五个领先图像编辑模型进行了人工评估。人类从三个维度判断输出：1）指令遵循，2）编辑的最小性，以及3）视觉质量。为了扩大评估规模，我们还使用Gemini构建了一个自动评分器，在匹配人类评估方面达到了74.7%的准确率。我们的评估揭示：1）没有一个模型的总体成功率超过22%，这显示了TECCI的挑战性；2）Nano Banana Pro是整体表现最好的模型；3）模型在指令遵循方面表现显著优于最小编辑和视觉质量；4）模型在编辑建筑和自然图像方面存在困难，这些需要较强的空间布局和复杂视觉细节理解能力；5）推理和创意编辑是最困难的，而颜色和外观编辑是最容易的。

英文摘要

Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark -- TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.

URL PDF HTML ☆

赞 0 踩 0

2606.01207 2026-06-02 cs.CV cs.LG 版本更新

Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning

特征对齐决定融合策略：多模态学习中交叉注意力与拼接的比较研究

Zhiqiang Zhou, Xuezhen Xie

发表机构 * Hunan Chemical Industry Vocational and Technical College（湖南化学工业职业技术学院）

AI总结通过实验和理论分析，证明特征对齐质量而非数据规模是决定多模态融合策略优劣的关键因素，当特征预对齐时拼接优于交叉注意力。

Comments 8 pages,6 figures,4 tables

详情

AI中文摘要

在多模态融合中，交叉注意力与拼接的选择仍由实践者直觉而非原理性理解主导。本文通过使用两个特征提取骨干（ResNet18和CLIP ViT-B/32）在Flickr8k上的控制实验，证明特征对齐质量（而非仅数据规模）是决定哪种融合策略更优的主要因素。当特征通过视觉语言预训练目标预对齐时，在所有测试规模（2048-16384样本）下，拼接比交叉注意力高出4.1-5.1个百分点。我们提供了基于样本复杂度分析的理论解释：拼接需要O(d_v + d_t)个样本来学习其融合投影，而交叉注意力需要O(d_v * d_t)个样本来学习双线性注意力权重，对于512维CLIP特征，后者是前者的256倍以上。当特征已经对齐时，两种方法的近似误差差距消失，拼接的样本效率在所有实际数据集规模上占优。对齐退化研究证实了单调趋势：随着特征对齐退化，拼接的优势从1.3%增长到2.8%。这些发现为多模态系统中的融合方法选择提供了原理性决策框架，对多模态大语言模型的设计具有直接影响。

英文摘要

The choice between cross-attention and concatenation for multimodal fusion remains governed by practitioner intuition rather than principled understanding. In this paper, we demonstrate that feature alignment quality, not data scale alone, is the primary determinant of which fusion strategy excels. Through controlled experiments on Flickr8k using two feature extraction backbones (ResNet18 and CLIP ViT-B/32), we show that concatenation outperforms cross-attention by 4.1-5.1 percentage points across all tested scales (2048-16384 samples) when features are pre-aligned by a vision-language pretraining objective. We provide a theoretical explanation grounded in sample complexity analysis: concatenation requires O(d_v + d_t) samples to learn its fusion projection, while cross-attention requires O(d_v * d_t) samples to learn bilinear attention weights, over 256 times as many for 512-dimensional CLIP features. When features are already aligned, the approximation error gap between the two methods vanishes, and concatenation's sample efficiency dominates at all practical dataset sizes. An alignment degradation study confirms a monotonic trend: as feature alignment degrades, concatenation's advantage grows from 1.3% to 2.8%. These findings provide a principled decision framework for fusion method selection in multimodal systems, with direct implications for the design of Multimodal Large Language Models.

URL PDF HTML ☆

赞 0 踩 0

2606.01192 2026-06-02 cs.CV 版本更新

PairedGTA: Generating Driving Datasets for Controlled Photometric Shift Analysis

PairedGTA：用于受控光度偏移分析的驾驶数据集生成

Andrea Chianese, Giulio Rossolini, Alessandro Biondi, Marco Cococcioni, Giorgio Buttazzo

发表机构 * Scuola Superiore Sant’Anna（圣安娜高等学院）； Department of Excellence in Robotics & AI（机器人与人工智能卓越部门）； University of Pisa（比萨大学）

AI总结提出基于高保真游戏引擎的PairedGTA框架，通过生成完美配对的图像，实现独立于几何和语义变化的光度偏移分析，并用于评估语义分割模型在恶劣条件下的性能退化。

Comments Under review

详情

AI中文摘要

评估自动驾驶视觉感知系统的性能对于确保在不同环境场景下的可靠运行至关重要。理想情况下，要在不同恶劣条件下进行平衡和公平的分析，需要同一场景在不同天气或光照变化下的完美配对图像。这将允许独立于几何和语义变化来评估光度偏移的影响。不幸的是，真实世界数据集很少提供同一场景在不同环境条件下的图像，因为通常相机姿态、交通和动态物体（车辆、行人等）的位置随时间变化，因此只能提供粗略配对的数据。为了解决这一挑战，本工作引入了一种基于高保真游戏引擎的数据生成框架，用于提取完美配对的图像。通过利用与GTA游戏引擎通信的软件API，该框架在保持场景几何、相机姿态以及动态物体的身份和位置的同时，修改光照和天气条件。对于每个采样位置，它程序化地实例化动态实体，并在各种恶劣条件下渲染像素对齐的图像。通过在语义分割模型上的系统分析，展示了所提出的生成框架在驾驶场景中的优势，其输出退化可以更直接地归因于光度偏移，而不是不受控制的语义或几何因素。

英文摘要

Evaluating the performance of visual perception systems for autonomous driving is essential to ensure reliable operation across diverse environmental scenarios. Ideally, a balanced and fair analysis across different adverse conditions would require perfectly paired images of the same scene under different weather or illumination changes. This would allow evaluating the effect of photometric shifts independently of geometry and semantic changes. Unfortunately, real-world datasets rarely provide images of the same scene under different environmental conditions, because, normally, camera pose, traffic, and locations of dynamic objects (vehicles, pedestrians, etc.) vary over time, thus yielding only coarsely paired data. To address this challenge, this work introduces a data generation framework based on a high-fidelity game engine for extracting perfectly paired images. By leveraging software APIs that communicate with the GTA game engine, the framework modifies illumination and weather conditions while preserving scene geometry, camera pose, and the identity and placement of dynamic objects. For each sampled location, it procedurally instantiates dynamic entities and renders pixel-aligned images under diverse adverse conditions. The benefit of the proposed generation framework in driving scenarios is demonstrated through a systematic analysis of semantic segmentation models, whose output degradation can be attributed more directly to photometric shifts rather than to uncontrolled semantic or geometric factors.

URL PDF HTML ☆

赞 0 踩 0

2606.01173 2026-06-02 cs.CV 版本更新

Reusing Fusion-Time Spectral Reliability for Adaptive Fusion and Expert Routing in RGB-Infrared Object Detection

复用融合时频谱可靠性用于RGB-红外目标检测的自适应融合与专家路由

Yefeng Wu

发表机构 * Tsinghua University（清华大学）

AI总结提出一种无参数的7维频谱可靠性描述符，通过频谱可靠性融合和可靠性条件专家路由，提升RGB-红外目标检测在退化条件下的性能。

详情

AI中文摘要

RGB-红外检测器通常会丢弃跨模态融合过程中产生的统计信息，使得下游模块无法知晓当前交互是否可靠。我们提出提取一个无参数的7维频谱可靠性描述符——汇总频带能量、幅度比、相位一致性和跨模态相关性——并在融合阶段之外复用该描述符。该描述符驱动频谱可靠性融合（SRF），它将频谱残差与保守的空间基进行门控，以及可靠性条件专家路由（RCER），它将描述符与池化内容结合以引导稀疏的后融合专家。在匹配消融实验下，描述符感知门控相比仅内容自适应门控提高了mAP50；一个2×2因子分析进一步表明，在参数数量几乎相等的情况下，描述符条件路由相比仅专家架构提供了更大的边际增益。在DroneVehicle上的六种合成退化条件下，平均保留率提升至95.0%，而仅内容MoE为92.0%，拼接为87.9%，在模态缺失下增益最大；同一模型在自然白天/黑夜分割上也分别提高了+5.2/+5.3的mAP50。这些结果表明，将融合时可靠性作为显式信号保留有利于自适应融合和融合后条件计算。

英文摘要

RGB-infrared detectors typically discard the statistics generated during cross-modal fusion, leaving downstream modules unaware of whether the current interaction is reliable. We propose to extract a parameter-free, 7-dimensional spectral reliability descriptor -- summarizing band energy, amplitude ratio, phase consistency, and cross-modal correlation -- and to reuse it beyond the fusion stage. The descriptor drives both Spectral Reliability Fusion (SRF), which gates a spectral residual against a conservative spatial base, and Reliability-Conditioned Expert Routing (RCER), which combines the descriptor with pooled content to steer sparse post-fusion experts. Under matched ablations, descriptor-aware gating improves mAP50 over content-only adaptive gating; a $2{\times}2$ factorial analysis further shows that descriptor-conditioned routing provides the larger marginal gain over expert architecture alone at near-equal parameter count. Under six synthetic degradations on DroneVehicle, average retention rises to 95.0%, versus 92.0% for content-only MoE and 87.9% for concatenation, with the largest gain under modality drop; the same model also improves mAP50 by +5.2/+5.3 on the natural day/night split. These results suggest that preserving fusion-time reliability as an explicit signal benefits both adaptive fusion and post-fusion conditional computation.

URL PDF HTML ☆

赞 0 踩 0

2606.01164 2026-06-02 cs.CV 版本更新

Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends

迈向交互式视频世界建模：前沿、挑战、基准与未来趋势

Jiuming Liu, Chaojun Ni, Mengmeng Liu, Chensheng Peng, Fangjinhua Wang, Sitian Shen, Marc Pollefeys, Masayoshi Tomizuka, Ayush Tewari, Per Ola Kristensson

发表机构 * Department of Engineering, University of Cambridge, U.K.（剑桥大学工程系）； Peking University（北京大学）； University of Twente（埃因霍温理工大学）； Mechanical Systems Control Laboratory, University of California, Berkeley, USA（加州大学伯克利分校机械系统控制实验室）； ETH Zurich（苏黎世联邦理工学院）； Microsoft（微软公司）； University of Oxford（牛津大学）

AI总结本文系统综述了交互式世界建模的研究趋势、技术挑战、评估基准，并提出了未来方向，重点在于动作条件可控性、长程交互与记忆以及实时响应性。

Comments Under review. The GitHub repository is publicly available at: https://github.com/liujiuming123/Awesome-Interactive-World-Model

详情

AI中文摘要

随着大语言模型和基于扩散的内容生成的快速发展，世界建模引起了越来越多的研究关注，惠及游戏引擎、具身人工智能、自动驾驶等多个下游领域。通过将用户动作明确纳入世界状态转换，最近的文献在动作条件视频或3D生成范式中赋予了世界建模交互性，进一步增强了世界演化的可控性，并促进用户自由遍历、操纵、导航和个性化状态演化。本文旨在系统回顾交互式世界建模的最新研究趋势、技术发展、评估基准，并提出未来潜在方向。具体而言，我们首先总结了在应用场景、世界状态演化和场景模态方面的近期工作和趋势。随后，我们深入探讨三个关键的技术挑战，包括动作条件可控性、长程交互与记忆，以及实时交互的动作跟随响应性。此外，我们还全面比较了四个特定应用领域（开放世界探索、游戏引擎、自动驾驶和机器人）中的现有基准和指标。最后，我们讨论了实现下一代交互式世界建模的几个有前景的未来方向。相应的代码库已公开在：https://github.com/liujiuming123/Awesome-Interactive-World-Model。

英文摘要

With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous driving, etc. Through explicitly incorporating user actions into world state transition, recent literature empowers world modeling with interactivity in an action-conditioned video or 3D generation paradigm, further enhancing controllability over world evolutions and facilitating users to freely traverse, manipulate, navigate, and personalize the state evolution. In this paper, we aim to systematically review recent research trends, technical developments, evaluation benchmarks, and also propose future potential directions in interactive world modeling. Specifically, we first summarize recent efforts and trends in terms of application scenarios, world state evolution, and scene modality. Afterwards, we delve into three crucial technical challenges, including action-conditioned controllability, long-horizon interactions and memory, and action-following responsiveness for real-time interactivity. Furthermore, we also thoroughly compare existing benchmarks and metrics in four specific application fields: open-world exploration, game engine, autonomous driving, and robotics. Finally, we discuss several promising future directions in achieving next-generation interactive world modeling. The corresponding repository is publicly available at: https://github.com/liujiuming123/Awesome-Interactive-World-Model.

URL PDF HTML ☆

赞 0 踩 0

STARFISH: 从内部状态修复中实现剪枝网络的快速精度恢复

Shir Maon, Odelia Melamed, Adi Shamir

发表机构 * Weizmann Institute of Science（魏茨曼科学研究所）

AI总结提出STARFISH方法，通过少量无标签校准集优化剪枝网络与原始网络内部状态对齐，高效恢复精度，在ViT网络上优于现有方法。

详情

AI中文摘要

剪枝是一种旨在减少大型神经网络中权重数量的过程。这可以显著加快推理速度，但可能导致模型精度大幅下降，因此通常随后会进行修复过程以恢复部分丢失的精度。在本文中，我们提出了一种新的修复方法STARFISH，它可以高效地恢复任何剪枝网络的（大部分）精度。STARFISH的主要思想是使用少量无标签示例的校准集，优化剪枝网络以与原始网络的内部状态表示对齐。对于去除50%权重的常见情况，在基于ViT的网络中，STARFISH修复相比最先进方法将恢复精度提高了高达22%。在激进剪枝下其优势更为显著。例如，在ImageNet的DeiT-B网络中去除75%权重后，STARFISH仅使用训练图像数量的0.4%作为校准集，恢复了原始稠密模型精度的82%，而竞争恢复技术仅达到稠密模型精度的40%。

英文摘要

Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but might cause a considerable reduction in the model's accuracy, and thus it is usually followed by a healing process that regains some of the lost accuracy. In this paper, we propose a new healing method, STARFISH, that can recover (most of) the accuracy of any pruned network efficiently. The main idea of STARFISH is to optimize the pruned network to align with the original network's internal state representations using a tiny calibration set of unlabeled examples. For the common case of removing 50% of the weights, STARFISH healing improves the recovered accuracy by up to 22% over the state-of-the-art methods on ViT-based networks. Its advantage is even more pronounced under aggressive pruning. For example, after eliminating 75% of the weights in a DeiT-B network for ImageNet, STARFISH uses only 0.4% of the number of training images as a calibration set and recovers 82% of the original dense accuracy, whereas competing recovery techniques reach only 40% of the dense model accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.01118 2026-06-02 cs.CV 版本更新

Rank-Aware Quantile Activation for Motion-Robust Crop Segmentation in UAV Imagery

面向无人机影像中运动鲁棒作物分割的秩感知分位数激活

Abinav Kiran, Sravan Danda, Aditya Challa, Sougata Sen, Daya Sagar B S

发表机构 * Senior Member, IEEE（IEEE高级会员）

AI总结针对高速无人机影像中的运动模糊导致语义分割退化的问题，提出秩感知的双分位数激活（QAct）模块，通过实例级秩归一化替代幅度门控，在零样本和模糊监督两种设置下均显著提升mIoU，尤其在稀有纹理依赖类上表现突出，且与模糊域训练互补。

详情

AI中文摘要

高速无人机采集的运动模糊会降低对具有高农业价值的稀有纹理依赖类别的语义分割性能。标准CNN依赖于高频幅度特征，而模糊会破坏这些特征，导致少数信号被统计性擦除。我们提出双分位数激活（QAct），一种秩感知模块，用实例级秩归一化替代幅度门控。在Agriculture-Vision 2021数据集上，在零样本和模糊监督两种设置下、多种严重程度上进行评估，QAct是主导架构因素：它在两种设置和所有严重程度上都比ReLU带来一致的mIoU提升，在稀有结构和纹理依赖类别上增益最强。一些主导类别（水、播种机跳过）在蒸馏下表现出混合的每类性能。在中等模糊下，零样本QAct优于蒸馏训练的ReLU；在所有严重程度上，Distill-QAct达到最佳性能，证实了秩感知激活和模糊域训练是互补的鲁棒性来源。

英文摘要

Motion blur from high-speed UAV acquisition de-grades semantic segmentation on rare texture-dependent classes with high agronomic value. Standard CNNs rely on high-frequency magnitude features that blur destroys, causing statistical erasure of minority signals. We propose Dual Quantile Activation (QAct), a rank-aware block replacing magnitude gating with instance-level rank normalization. Evaluated onAgriculture-Vision 2021 across zero-shot and blur-supervised regimes at multiple severities, QAct is the dominant architectural factor: it delivers consistent mIoU gains over ReLU across both regimes and all severities, with strongest gains on rare structural and texture-dependent classes. Some dominant classes (water,planter skip) show mixed per-class performance under distillation. At moderate blur, zero-shot QAct outperforms distillation-trained ReLU; across all severities, Distill-QAct achieves best performance, confirming rank aware activation and blur-domain training are complementary robustness sources.

URL PDF HTML ☆

赞 0 踩 0

2606.01106 2026-06-02 cs.CV 版本更新

Temporal Evidence Routing with Structured Visual Evidence for TimeLogicQA

基于结构化视觉证据的时间证据路由用于TimeLogicQA

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University（东南大学）； National University of Singapore（新加坡国立大学）； Independent Researcher（独立研究员）； Opus AI Research（Opus AI研究院）； University of Science and Technology of China（中国科学技术大学）

AI总结提出视觉证据路由流水线，分离感知与符号时间推理，通过结构化视觉证据和确定性时间规则在TimeLogicQA上达到81.8 AvgAcc。

详情

AI中文摘要

TimeLogicQA评估视频问答系统是否能推理事件存在、顺序、持续性、边界条件和重叠等时间关系。我们通过一个视觉证据路由流水线来处理此任务，该流水线将感知与符号时间推理分离。系统首先将每个问题解析为事件目标、答案模式、候选选项和时间算子。然后，根据持续时间和算子难度对视频进行路由，对短片段使用有序的全帧证据，对长视频使用以事件为中心的候选窗口。多模态大语言模型为相关事件生成结构化视觉证据，而程序化验证器恢复密集的动作区间，确定性归约器应用算子特定的时间规则产生最终答案。保守融合仅在视觉证据、时间程序和置信度检查一致时接受答案，减少噪声答案翻转。在官方测试评估中，我们的最终系统实现了81.8的平均准确率。

英文摘要

TimeLogicQA evaluates whether video question answering systems can reason over temporal relations such as event existence, ordering, persistence, boundary conditions, and overlap. We address this task with a visual evidence routing pipeline that separates perception from symbolic temporal reasoning. The system first parses each question into event targets, answer mode, candidate options, and temporal operators. It then routes videos according to duration and operator difficulty, using ordered full-frame evidence for short clips and event-focused candidate windows for long videos. A multimodal large language model produces structured visual evidence for the relevant events, while programmatic verifiers recover dense action intervals and a deterministic reducer applies operator-specific temporal rules to produce the final answer. Conservative fusion accepts an answer only when the visual evidence, temporal program, and confidence checks agree, reducing noisy answer flips. On the official test evaluation, our final system achieves an AvgAcc of 81.8.

URL PDF HTML ☆

赞 0 踩 0

2606.01104 2026-06-02 cs.CV 版本更新

Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge

自适应密集证据精炼用于视频关系推理：VRR-QA挑战

Yuyang Sun, Yongliang Wu, Xingyu Zhu, Yuxia Chen, Zhenxiang Jiang, Yangguang Ji, Wenbo Zhu, Yanxi Shi, Jay Wu, Shuo Wang, Xu Yang

发表机构 * Southeast University（东南大学）； National University of Singapore（国立新加坡大学）； Independent Researcher（独立研究员）； Opus AI Research（Opus AI研究）； University of Science and Technology of China（中国科学技术大学）

AI总结提出一种自适应测试时计算系统，通过轻量视图识别不稳定问题并路由到高预算密集证据模块，在VRR-QA测试集上达到90.07%平均准确率。

详情

AI中文摘要

VRR-QA评估视频语言系统能否推断空间、时间、视角、深度和可见性关系，这些关系通常无法通过单帧解决。我们提出一个仅推理的系统，基于自适应测试时计算。系统首先通过直接视频语言模型传递回答每个问题，然后使用多个轻量视图发现不稳定问题。只有这些困难问题被路由到高预算密集证据模块，该模块构建带时间戳的帧观察、关系特定探针、候选验证和保守的时间聚合。这种设计分离了视频问答中常混淆的两个问题：寻找合理的替代答案以及决定何时应更改当前答案。在测试集上，最终系统获得90.07平均准确率和87.81宏平均准确率。报告重点介绍最终测试系统和复现自适应密集验证器所需的实现设置。

英文摘要

VRR-QA evaluates whether video-language systems can infer spatial, temporal, viewpoint, depth, and visibility relations that are not always resolved by a single frame. We present an inference-only system built around adaptive test-time computation. The system first answers each question with a direct video-language model pass, then uses multiple lightweight views to find unstable questions. Only these difficult questions are routed to a high-budget dense evidence module that constructs timestamped frame observations, relation-specific probes, candidate verification, and conservative temporal aggregation. This design separates two problems that are often confused in video question answering: finding plausible alternative answers and deciding when a current answer should actually be changed. On the test split, the final system obtains 90.07 average accuracy and 87.81 macro average accuracy. The report focuses on the final test system and the implementation settings required to reproduce the adaptive dense verifier.

URL PDF HTML ☆

赞 0 踩 0

2606.01097 2026-06-02 cs.CV 版本更新

TextFake: 对富含文本图像中AI生成图像检测的基准测试

Yuning Zhang, Changtao Miao, Mingyu Liao, Tingyu Liu, Xinghao Wang, Tao Gong, Qi Chu, Nenghai Yu

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China（中国科学技术大学网络科学与技术学院）； Anhui Province Key Laboratory of Digital Security（安徽省数字安全重点实验室）； Individual Researcher（独立研究者）

AI总结针对AI生成图像检测在富含文本图像上的空白，构建包含28种语言、2万图像的TextFake基准，评估14种检测器和3种VLM API，发现系统性能差距并诊断三种失败模式。

详情

AI中文摘要

最近的AI生成图像（AIGI）检测器在自然图像基准上表现良好，但它们在富含文本的伪造图像（如虚假截图、文档和新闻页面）上的行为尚未得到测试，这些伪造图像在虚假信息中普遍存在。我们引入了TextFake，一个包含20,000张图像的富含文本AIGI检测基准，涵盖28种语言、4个主题类别和2种场景模态。伪造图像通过一个四阶段流水线合成，该流水线沿三个受控维度注释真实图像，并通过分布对齐的结构化提示生成对应图像，排除了协变量捷径。对14个专用检测器和3个前沿VLM API的零样本评估揭示了巨大的系统性差距：没有方法超过80%的准确率，有些方法相比自然图像基准下降了60%以上。诊断评估识别出三种失败模式：文本密度诅咒，即密集字形压倒低级检测器；通过渲染保真度进行伪装，即更强的文本渲染抑制生成伪影；以及阈值崩溃，即常规扰动将检测器推向随机水平。

英文摘要

Recent AI-generated image (AIGI) detectors perform well on natural-image benchmarks, but their behavior on text-rich forgeries, such as fabricated screenshots, documents, and news pages prevalent in misinformation, remains untested. We introduce TextFake, a 20,000-image benchmark for text-rich AIGI detection spanning 28 languages, 4 topic categories, and 2 scene modalities. Fake images are synthesized via a four-stage pipeline that annotates real images along three controlled dimensions and generates counterparts through distribution-aligned structured prompting, ruling out covariate shortcuts. Zero-shot evaluation of 14 specialized detectors and 3 frontier VLM APIs reveals a large systematic gap: no method exceeds 80% accuracy, with some dropping over 60% from natural-image benchmarks. Diagnostic evaluations identify three failure modes: the Text Density Curse, where dense glyphs overwhelm low-level detectors; Cloaking via Rendering Fidelity, where stronger text rendering suppresses enerative artifacts; and Threshold Collapse, where routine perturbations drive detectors toward chance-level performance.

URL PDF HTML ☆

赞 0 踩 0

2606.01048 2026-06-02 cs.CV 版本更新

Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation

解耦残差去噪扩散模型用于统一且数据高效的图像到图像翻译

Ziyue Lin, Jiahe Hou, Hongyu Xia, Xinrui Xie, Feifei Wang, Yuyin Zhou, Wei Wang, Jiawei Liu, Liangqiong Qu

发表机构 * The University of Hong Kong（香港大学）； Shenyang Institute of Automation, Chinese Academy of Sciences（中国科学院沈阳自动化研究所）； The Chinese University of Hong Kong（香港中文大学）； University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结提出解耦残差去噪扩散模型（DRDD），通过将扩散过程解耦为随机噪声扩散和确定性残差扩散两个独立阶段，实现统一且数据高效的图像到图像翻译。

Comments CVPR 2026

详情

AI中文摘要

我们提出解耦残差去噪扩散模型（DRDD），用于统一且数据高效的图像到图像（I2I）翻译。尽管扩散模型在质量和多样性方面推动了I2I翻译的发展，但我们揭示了扩散模型中一个先前未被充分探索的性质。关键在于，除了其传统的流形提升作用（即将数据移出低维流形），注入高斯噪声通过隐式对齐跨域的特征分布促进了域协调，这一性质对于统一的I2I翻译尤其有利。然而，现有的扩散模型过早地削弱了这种协调效果，因为噪声和残差在单个耦合的扩散过程中被同时移除。为解决这一问题，DRDD将扩散过程解耦为两个顺序且独立的扩散阶段：（1）用于域协调和流形提升的随机噪声扩散，以及（2）完全在固定噪声域内学习核心语义映射的确定性残差扩散。这种解耦在整个变换过程中保留了协调和流形提升效果，极大地简化了跨不同任务和域的统一映射学习。值得注意的是，噪声扩散阶段仅在丰富的、无配对的目标域图像上训练，大大提高了数据效率。全面的理论和实证分析表明，DRDD与主流扩散模型广泛兼容，即使在有限配对数据下也能持续提供稳健、统一的I2I翻译。我们的代码可在 https://github.com/HKU-HealthAI/DRDD 获取。

英文摘要

We propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and data-efficient image-to-image (I2I) translation. While diffusion models have advanced I2I translation in terms of quality and diversity, we uncover a previously under-explored property in diffusion models. Crucially, beyond its conventional role of manifold lifting (i.e., moving data off low-dimensional manifolds), injecting Gaussian noise facilitates domain harmonization by implicitly aligning feature distributions across domains, a property particularly advantageous for unified I2I translation. However, existing diffusion models prematurely erode this harmonization effect, as noise and residuals are simultaneously removed in a single coupled diffusion process. To address this, DRDD decouples the diffusion process into two sequential and independent diffusion stages: (1) a stochastic noise diffusion for domain harmonization and manifold lifting, and (2) a deterministic residual diffusion that learns the core semantic mapping entirely within the fixed-noise domain. This decoupling preserves harmonization and manifold lifting effects throughout the transformation, substantially simplifying the learning of unified mappings across diverse tasks and domains. Notably, the noise diffusion stage is trained exclusively on abundant, unpaired target-domain images, greatly improving data efficiency. Comprehensive theoretical and empirical analysis demonstrates that DRDD is broadly compatible with mainstream diffusion models and consistently delivers robust, unified I2I translation, even under limited paired data. Our code is available at https://github.com/HKU-HealthAI/DRDD.

URL PDF HTML ☆

赞 0 踩 0

2606.01044 2026-06-02 cs.CV 版本更新

Ask4VG: Risk-Aware Question Selection for Reducing Prior-Driven Answers in Medical VQA

Ask4VG: 用于减少医学VQA中先验驱动答案的风险感知问题选择

Xiaorong Zhu, Qiang Li, Zibo Xu, Weijie Wang, Weizhi Nie

发表机构 * School of Microelectronics, Tianjin University, Tianjin 300072, China（天津大学电子工程学院，天津 300072，中国）； DISI, University of Trento, Trento, Italy（特伦托大学DISI研究所，意大利特伦托）； School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China（天津大学电气与信息工程学院，天津 300072，中国）

AI总结提出Ask4VG框架，通过反事实视觉探测估计问题引发的幻觉风险，并重排问题改写以选择更依赖图像证据的问题，从而减少医学VQA中的先验驱动答案。

详情

AI中文摘要

医学视觉问答要求模型将回答建立在图像证据上，因为缺乏视觉支持的答案可能误导下游解读。然而，许多医学VQA问题是通用的、模板化的或形式高度相似，这可能鼓励模型学习问答捷径而非依赖图像的推理，从而增加幻觉回答的风险。我们提出Ask4VG，一个无标签的试点框架，用于风险感知的问题选择。Ask4VG通过反事实视觉探测估计问题引发的幻觉风险：在原始图像、扰动图像、空白图像和错配图像下提出相同问题，并将得到的答案关系转换为反事实风险估计器的弱监督信号。然后，学习到的估计器对候选问题改写进行重排，以优先选择那些对缺失或错配视觉证据更不具不变性的、保留意图的问题，再进行最终答案生成。在VQA-RAD上使用Qwen2-VL-2B-Instruct，仅提示改写增加了反事实风险，而基于预测风险的重排将留出风险从0.658降至0.623，并将精确准确率从0.337提升至0.356。一个300样本的PMC-VQA外部检查显示了相同的风险降低方向，并伴有小幅准确率提升。这些结果表明，问题选择是响应级幻觉缓解的一个有前景的补充，有助于实现可靠的医学VQA。

英文摘要

Medical visual question answering requires models to ground their responses in image evidence, because visually unsupported answers can mislead downstream interpretation. However, many medical VQA questions are generic, template-like, or highly similar in form, which can encourage models to learn question-answer shortcuts instead of image-dependent reasoning and thereby increase the risk of hallucinated responses. We propose Ask4VG, a label-free pilot framework for risk-aware question selection. Ask4VG estimates question-induced hallucination risk through counterfactual visual probing: the same question is asked under the original image, a perturbed image, a blank image, and a mismatched image, and the resulting answer relations are converted into weak supervision for a counterfactual risk estimator. The learned estimator then reranks candidate question rewrites to favor intent-preserving questions that are less invariant to missing or mismatched visual evidence before final answer generation. On VQA-RAD with Qwen2-VL-2B-Instruct, prompt-only rewriting increases counterfactual risk, whereas predicted-risk reranking reduces held-out risk from 0.658 to 0.623 and improves exact accuracy from 0.337 to 0.356. A 300-sample PMC-VQA external check shows the same direction of risk reduction with a small accuracy gain. These results suggest that question selection is a promising complement to response-level hallucination mitigation for reliable medical VQA.

URL PDF HTML ☆

赞 0 踩 0

2606.01031 2026-06-02 cs.GR cs.AI cs.CV cs.LG cs.MM 版本更新

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

音频驱动说话头生成的时序对齐评估

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

发表机构 * School of Business, University of New South Wales (UNSW)（新南威尔士大学商学院）； School of Engineering and Built Environment, Griffith University（格里菲斯大学工程与环境学院）； Data61/CSIRO（Data61/澳大利亚国家科学委员会）

AI总结针对现有帧级评估指标对时序偏差敏感的问题，提出基于软动态时间规整的序列级对齐评估框架，提升评估鲁棒性并揭示不同建模范式间的系统权衡。

Comments Research report

详情

AI中文摘要

音频驱动的说话头生成技术发展迅速，但现有评估协议主要依赖帧级指标，假设生成视频与参考视频之间存在严格的时间对应关系。这一假设与语音驱动的面部运动不符，后者自然包含轻微的时间偏移、不同的说话速度和风格变化。因此，传统指标可能将无害的时间差异视为质量错误，使得公平比较方法并理解其权衡变得更加困难。在这项工作中，我们认为动态生成模型的评估应被表述为序列对齐问题，而非独立的帧比较。我们引入了一种统一的序列级重新表述，将软动态时间规整集成到已有的评估流程中。通过在对齐特征轨迹的同时保持时间顺序，所提出的框架对有限的时间错位具有鲁棒性，且不改变底层的感知、身份或同步编码器。我们表明，在刚性对齐下，帧级评估可被视为一个特例，而序列级对齐提供了更好的稳定性、对时间差异的更低敏感性以及建模范式之间更清晰的区分。基于这一原则性表述，我们在标准化协议下，对涵盖规范、野外和风格多样场景的七个数据集上的20种方法进行了大规模基准测试。大量实验表明，时序对齐的指标对时间差异更鲁棒，跨数据集提供更一致的结果，并能更好地揭示建模范式之间的系统权衡，例如同步性与真实性、表现力与稳定性之间的权衡。

英文摘要

Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.

URL PDF HTML ☆

赞 0 踩 0

2606.01022 2026-06-02 cs.CV cs.AI 版本更新

ProductWebGen: Benchmarking Multimodal Product Webpage Generation

ProductWebGen: 多模态产品网页生成基准测试

Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen, Peng Jiang, Kai Yu, Zhijie Deng

发表机构 * School of Computer Science & Zhiyuan College（计算机科学学院及智远学院）； Shanghai Jiao Tong University（上海交通大学）； Kuaishou Technology（快手科技）

AI总结提出ProductWebGen基准，用于评估多模态生成模型从产品图像和指令生成一致产品展示网页的能力，并比较了基于编辑和基于统一模型两种工作流。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770855.3817507

AI中文摘要

从源产品图像以及布局和视觉内容指令中制作产品展示网页，对于营销、广告和电子商务等领域具有重要的实用价值。直观上，该任务要求产品展示之间严格的视觉一致性以及高保真度的指令遵循，以联合生成可渲染的HTML代码。这些对可控性和指令遵循的要求与先进多模态生成模型（如图像编辑模型和统一模型）的核心特征紧密一致。为此，本文引入ProductWebGen来系统性地基准测试这些模型的产品网页生成能力。我们组织了包含500个测试样本的ProductWebGen，涵盖13个产品类别；每个样本由源图像、视觉内容指令和网页指令组成。任务是根据源图像和指令生成包含多个一致图像的产品展示网页。鉴于任务的混合模态输入输出性质，我们设计并系统比较了两种评估工作流——一种使用大语言模型和图像编辑模型分别生成HTML代码和图像（基于编辑），另一种依赖单个统一模型生成两者，其中图像生成依赖于先前的多模态上下文（基于统一模型）。实验结果表明，基于编辑的方法在网页指令遵循和内容吸引力方面取得领先结果，而基于统一模型的方法在满足视觉内容指令方面可能展现出更多优势。我们还构建了一个监督微调数据集ProductWebGen-1k，包含1000组真实产品图像和LLM生成的HTML代码。我们在开源统一模型BAGEL上验证了其有效性。数据和代码可在https://github.com/SJTU-DENG-Lab/ProductWebGen获取。

英文摘要

Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation -- one uses large language models and image editing models to separately generate HTML code and images (editing-based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The data and code are available at https://github.com/SJTU-DENG-Lab/ProductWebGen.

URL PDF HTML ☆

赞 0 踩 0

2606.01021 2026-06-02 cs.CV 版本更新

Learning Neural Deformation Representation for 4D Dynamic Shape Generation

学习神经变形表示用于4D动态形状生成

Gyojin Han, Jiwan Hur, Jaehyun Choi, Junmo Kim

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）

AI总结提出一种新的神经变形表示，结合条件神经符号距离场，设计解耦运动与形状潜在空间的4D表示架构，通过扩散模型生成高质量、高时间一致性的4D动态形状。

Comments ECCV 2024

详情

AI中文摘要

近期3D形状表示的发展为生成精细3D形状开辟了新可能性。尽管取得了这些进展，但关于生成随时间变形的3D对象形式的4D动态形状的研究仍然很少。为弥补这一差距，本文聚焦于生成4D动态形状，同时强调生成质量和效率。先前关于4D生成的工作HyperDiffusion提出了一种直接生成4D占用场权重参数的方法，但由于运动表示未与4D占用场的形状表示分离，导致时间一致性差且渲染速度慢。因此，我们提出一种新的神经变形表示，并将其与条件神经符号距离场结合，设计了一种4D表示架构，其中运动潜在空间与形状潜在空间解耦。所提出的变形表示通过预测多个部分的蒙皮权重和刚体变换来工作，在理解形状结构方面也优于现有4D表示的变形模块。此外，我们设计了一种扩散模型的训练过程，利用由我们的4D表示提取的形状和运动特征作为数据点。无条件生成、条件生成和运动重定向实验结果表明，我们的方法不仅在4D动态形状生成方面表现出优于先前工作的性能，而且具有多种潜在应用。

英文摘要

Recent developments in 3D shape representation opened new possibilities for generating detailed 3D shapes. Despite these advances, there are few studies dealing with the generation of 4D dynamic shapes that have the form of 3D objects deforming over time. To bridge this gap, we focus on generating 4D dynamic shapes with an emphasis on both generation quality and efficiency in this paper. HyperDiffusion, a previous work on 4D generation, proposed a method of directly generating the weight parameters of 4D occupancy fields but suffered from low temporal consistency and slow rendering speed due to motion representation that is not separated from the shape representation of 4D occupancy fields. Therefore, we propose a new neural deformation representation and combine it with conditional neural signed distance fields to design a 4D representation architecture in which the motion latent space is disentangled from the shape latent space. The proposed deformation representation, which works by predicting skinning weights and rigid transformations for multiple parts, also has advantages over the deformation modules of existing 4D representations in understanding the structure of shapes. In addition, we design a training process of a diffusion model that utilizes the shape and motion features that are extracted by our 4D representation as data points. The results of unconditional generation, conditional generation, and motion retargeting experiments demonstrate that our method not only shows better performance than previous works in 4D dynamic shape generation but also has various potential applications.

URL PDF HTML ☆

赞 0 踩 0

2606.01014 2026-06-02 cs.CV cs.AI 版本更新

Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing

基于文本的三维人体运动编辑中的跨轴特征融合与关节运动差异预测

Gyojin Han, Junmo Kim

发表机构 * School of Electrical Engineering, KAIST（韩国科学技术院电子工程学院）

AI总结提出一种跨轴特征融合架构和辅助任务，通过联合锚定变换器预测关节运动差异，实现文本驱动的三维人体运动编辑，在MotionFix数据集上达到最优性能。

Comments CVPR 2026

详情

AI中文摘要

我们研究基于文本的三维人体运动编辑，目标是保留源运动的风格和结构，同时应用自然语言描述的编辑。MotionFix数据集的发布推动了基于训练扩散模型的直接生成编辑运动的研究，这些模型从源运动和文本指令生成编辑运动。虽然先前的工作主要关注学习编辑在时间上何时发生，但我们的目标是创建一个不仅理解时间方面，还理解哪些特定关节负责变化的模型。为此，我们提出了一种新颖的架构和一个互补的辅助任务来辅助其训练。我们的架构由两个轴锚定变换器组成，分别沿关节和时间维度提取不同特征，以及一个跨轴融合块来整合这些表示。我们进一步引入一个辅助任务，训练关节锚定变换器回归源和目标关节旋转之间的Soft-DTW距离。该目标教会模块理解哪些关节需要修改，哪些需要保留。通过在MotionFix数据集上的全面实验，我们证明我们的方法显著提高了与文本指令和源运动的语义对齐，以及生成运动的整体保真度，达到了最先进的结果。

英文摘要

We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edits described in natural language. The release of the MotionFix dataset has spurred active research into training-based diffusion models that directly generate an edited motion from a source motion and a text instruction. While previous works have focused primarily on learning when an edit should occur temporally, our goal is to create a model that understands not only this temporal aspect but also which specific joints are responsible for the change. Targeting this, we propose a novel architecture and a complementary auxiliary task to aid its training. Our architecture consists of two axis-anchored transformers, which extract distinct features along the joint and time dimensions respectively, and a cross-axis fusion block that integrates these representations. We further introduce an auxiliary task that trains the joint-anchored transformer to regress the Soft-DTW distance between source and target joint rotations. This objective teaches the module to understand which joints to modify and which to preserve. Through comprehensive experiments on the MotionFix dataset, we demonstrate that our method significantly improves semantic alignment with both the text instruction and the source motion, as well as the overall fidelity of the generated motion, achieving state-of-the-art results.

URL PDF HTML ☆

赞 0 踩 0

2606.01006 2026-06-02 cs.CV 版本更新

Automated Erythrocyte Detection and Tracking for Retinal Blood Flow Quantification in Erythrocyte-Mediated Angiography

自动红细胞检测与追踪用于红细胞介导血管造影中的视网膜血流定量

Chiao-Yi Wang, Havish S Gadde, Yi-Ting Shen, Saige M. Oechsli, Osamah Saeedi, Yang Tao

发表机构 * Department of Bioengineering, University of Maryland, College Park, MD 20742, USA（生物工程系，马里兰大学，学院公园，MD 20742，美国）； Department of Ophthalmology and Visual Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA（眼科学与视觉科学系，马里兰大学医学院，巴尔的摩，MD 21201，美国）； Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA（电气与计算机工程系，马里兰大学，学院公园，MD 20742，美国）

AI总结提出EMTrack框架，通过流上下文模块和拓扑感知追踪策略实现红细胞自动检测与追踪，用于视网膜血流定量，并在新数据集RBF-EMA上优于基线方法。

详情

AI中文摘要

毛细血管水平的视网膜血流（RBF）作为多种眼病的生物标志物具有巨大潜力。然而，测量毛细血管水平RBF的方法仍然有限。红细胞介导血管造影（EMA）是一种新兴成像技术，通过可视化单个红细胞实现毛细血管水平RBF测量，但自动红细胞检测与追踪（量化血流所必需）仍鲜有探索。为填补这一空白，我们提出EMTrack，一种新颖框架，包含用于区分运动与静止细胞的红细胞检测流上下文模块，以及能够在帧间大位移和显著运动变化下进行追踪的拓扑感知追踪策略。此外，我们建立了RBF-EMA，一个包含全面红细胞检测与追踪标注的新EMA数据集。实验结果表明，我们的方法在RBF-EMA数据集上的检测与追踪任务中，在定量和定性上均优于基线方法。此外，RBF量化结果凸显了我们的框架在自动化视网膜血流测量中的巨大潜力。

英文摘要

Capillary-level retinal blood flow (RBF) has strong potential as a biomarker for various ocular diseases. However, modalities for measuring capillary-level RBF remain limited. Erythrocyte-mediated angiography (EMA), an emerging imaging technique, enables capillary-level RBF measurement by visualizing individual erythrocytes, yet automated erythrocyte detection and tracking, which are essential for quantifying blood flow, remain largely unexplored. To address this gap, we propose EMTrack, a novel framework featuring a flow-context module for erythrocyte detection that distinguishes moving from paused cells and a topology-aware tracking strategy that enables tracking under large inter-frame displacements and substantial motion variations. In addition, we establish RBF-EMA, a new EMA dataset with comprehensive erythrocyte detection and tracking annotations. Experimental results demonstrate that our method outperforms baseline methods both quantitatively and qualitatively on detection and tracking tasks in the RBF-EMA dataset. Moreover, RBF quantification results highlight the strong potential of our framework for automated retinal blood flow measurement.

URL PDF HTML ☆

赞 0 踩 0

2606.00999 2026-06-02 cs.CV 版本更新

SWARD: Stochastic Window-Attention-Based Relational Distillation for Cross-Architectural Semantic Segmentation

SWARD：基于随机窗口注意力的关系蒸馏用于跨架构语义分割

Aditya Makineni, Qing Tian

发表机构 * Department of Computer Science University of Alabama at Birmingham（计算机科学系阿拉巴马大学伯明翰分校）

AI总结提出SWARD框架，通过多尺度窗口注意力蒸馏和原型判别正则化，弥合Transformer教师与CNN学生之间的表征差距，实现跨架构语义分割的知识蒸馏。

详情

AI中文摘要

大规模视觉基础模型在语义分割等密集预测任务上取得了显著进展，但其规模使得在资源受限环境中部署不切实际，因此知识蒸馏成为将其能力迁移至轻量级学生网络的一种手段。然而，现代基础教师模型主要基于Transformer，编码全局上下文，而高效学生模型通常是具有局部偏置感受野的卷积网络。现有蒸馏方法大多假设架构同质性，并依赖直接特征模仿，这未能弥合这种表征差距，且忽略了准确语义分割所需的结构化空间依赖和判别性组织。在本文中，我们提出SWARD，一种通过两种互补机制解决这一差距的知识蒸馏框架。首先，我们引入多尺度窗口注意力蒸馏（MWAD）模块，该模块在随机移位窗口分区中对齐师生基于注意力的关系，窗口偏移在每次训练迭代中随机重新采样。这消除了窗口边界偏差，并结合多尺度设计，捕获了短程和长程空间依赖。其次，我们引入原型判别正则化（PDR），一种通过强制类间分离和类内紧凑性来塑造学生特征分布的损失，进一步锐化判别结构，超越仅靠特征模仿在学生容量减少下所能产生的效果。在不同视觉应用（即城市场景解析和医学图像分割）上的实验表明，SWARD达到了最先进的性能。

英文摘要

Large-scale vision foundation models have driven substantial gains on dense prediction tasks such as semantic segmentation, but their size makes deployment impractical in resource-constrained settings, motivating knowledge distillation as a means of transferring their capabilities to lightweight student networks. However, modern foundation teachers are predominantly transformer-based that encode global context, whereas efficient students are typically convolutional networks with locally biased receptive fields. Existing distillation methods largely assume architectural homogeneity and rely on direct feature mimicry, which fails to bridge this representational gap and neglects the structured spatial dependencies and discriminative organization required for accurate semantic segmentation. In this paper, we propose SWARD, a knowledge distillation framework that addresses this gap through two complementary mechanisms. First, we introduce a Multi-Scale Windowed Attention Distillation (MWAD) module that aligns teacher-student attention-based relations within stochastically shifted window partitions whose offsets are randomly resampled at every training iteration. This removes window boundary bias, and, combined with the multi-scale design, captures both short- and long-range spatial dependencies. Second, we introduce Prototype Discriminative Regularization (PDR), a loss that helps shape the student's feature distribution by enforcing inter-class separation and intra-class compactness, further sharpening the discriminative structure beyond what feature mimicry alone can produce under the student's reduced capacity. Experiments across different vision applications (i.e., urban scene parsing and medical image segmentation) show that SWARD achieves state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.00987 2026-06-02 cs.CV cs.AI 版本更新

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

多时相指代分割的开源基准与基线

Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao, Junyu Gao, Xuelong Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Institute of Artificial Intelligence (TeleAI)（人工智能研究所）； China Telecom（中国电信）； School of Artificial Intelligence, Optics and Electronics (iOPEN)（人工智能、光学与电子学院）； Northwestern Polytechnical University（西北工业大学）

AI总结提出多时相指代分割任务，通过自动化数据构建管道CRAFT-Agent生成首个基准MTRefSeg-21K，并设计两阶段训练的变化感知LVLM框架MTRefSeg-R1，实现优于现有基线的性能。

详情

AI中文摘要

大型视觉语言模型（LVLMs）展现了强大的视觉理解和语言引导定位能力，但其多时相视觉推理能力仍未充分探索。为填补这一空白，我们引入了 extbf{多时相指代分割（MTRS）}，这是一个新任务，旨在从多时相图像中分割语言描述的时间变化。MTRS通过联合要求时相对应推理、语言定位和像素级掩码预测，扩展了传统的指代分割和变化检测。我们提出了 extbf{CRAFT-Agent}，一个带有人工审核的自动化数据构建管道，并构建了 extbf{MTRefSeg-21K}，这是第一个MTRS基准，包含21K个高质量的多时相图像-文本-掩码三元组，覆盖多样化的场景、视角和领域。对一系列基于VLM和LVLM的模型进行基准测试表明，直接推理表现较差，而任务特定的微调仍然有限。为解决这一问题，我们提出了 extbf{MTRefSeg-R1}，一个采用两阶段策略训练的变化感知LVLM框架。它首先从20K个仅视觉的双时相样本中学习通用时间变化感知，然后在MTRefSeg-21K上进行微调，以实现细粒度的语言引导时间定位。MTRefSeg-R1显式建模跨时相视觉差异，将语言指令与时间变化对齐，并预测所指变化掩码。大量实验表明，与现有的LVLM基线相比，MTRefSeg-R1实现了强大且通常更优的性能，展示了MTRS的挑战和潜力。

英文摘要

Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbf{Multi-temporal Referring Segmentation (MTRS)}, a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbf{CRAFT-Agent}, an automated data construction pipeline with human auditing, and build \textbf{MTRefSeg-21K}, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \textbf{MTRefSeg-R1}, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.

URL PDF HTML ☆

赞 0 踩 0

2606.00963 2026-06-02 cs.CV cs.CL 版本更新

Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Reasmory: 3D重建作为VLMs空间推理的显式记忆

Jixuan He, Xueting Li, Chieh Hubert Lin, Ming-Hsuan Yang

发表机构 * Cornell Tech, Cornell University（康奈尔科技学院、康奈尔大学）； NVIDIA（英伟达）； illoca AI（illoca人工智能）； The University of California, Merced（加州大学梅尔塞德斯分校）

AI总结提出Reasmory框架，通过结构化程序执行重建的3D显式记忆，并引入轻量级领域特定语言约束VLM查询和操作，在空间推理任务上提升6-18%。

详情

AI中文摘要

视觉语言模型（VLM）展现出新兴的空间推理能力，但在需要精确空间理解的任务（如视角推理、方向比较和距离估计）上仍不可靠。在多视图图像和单目视频中，相关空间线索通常稀疏且分布在冗余观测中，难以组织和利用。基于重建的视觉基础模型（VFM）提供了一种自然的方式将这些观测聚合为显式空间记忆，例如点云。然而，简单地将重建模型作为自由形式工具使用是脆弱的，VLM可能错误调用工具、跳过所需的空间变换或误用中间结果。我们提出 extbf{Reasmory}，一个将空间推理形式化为对重建空间记忆的结构化程序执行的框架。Reasmory构建显式3D记忆，用语义锚定的3D对象实例增强它，并引入轻量级领域特定语言（DSL），约束VLM在推理过程中如何查询对象和相机、变换视角以及渲染观测。生成的程序在执行前被解析和验证，从而比无约束的工具使用更可靠地与空间记忆交互。在多视图图像和视频空间推理基准上的实验表明，与强基线（包括GPT-5-mini和Gemini-3-flash）相比，一致提升6-18%，表明显式3D记忆在通过约束、验证的操作而非自由形式的工具调用访问时最为有用。

英文摘要

Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbf{Reasmory}, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory. Reasmory constructs explicit 3D memory, augments it with semantically grounded 3D object instances, and introduces a lightweight Domain-Specific Language (DSL) that constrains how VLMs query objects and cameras, transform viewpoints, and render observations during reasoning. Generated programs are parsed and validated before execution, enabling more reliable interaction with spatial memory than unconstrained tool use. Experiments on multi-view image and video spatial reasoning benchmarks show consistent gains of 6--18\% over strong baselines, including GPT-5-mini and Gemini-3-flash, indicating that explicit 3D memory is most useful when accessed through constrained, validated operations rather than free-form tool calls.

URL PDF HTML ☆

赞 0 踩 0

2606.00957 2026-06-02 cs.CV 版本更新

Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers

面向大规模文生视频扩散Transformer的边界保护W8A8 HiFloat8量化

Yiming Zhao

发表机构 * Yiming Zhao（赵毅铭）

AI总结针对Wan2.1-T2V-14B模型，提出一种边界保护策略的W8A8 HiF8后训练量化方法，通过保留首尾边界块为BF16而量化中间块，在VBench五个维度上匹配或略优于BF16基线。

Comments 6 pages, 5 figures. Accepted to ICME 2026 Grand Challenge

详情

AI中文摘要

我们提出了一种针对Wan2.1-T2V-14B（一个140亿参数文生视频扩散Transformer）的后训练量化方法，目标是在Ascend 910B NPU上实现W8A8 HiFloat8（HiF8）格式。量化视频DiT模型的一个核心挑战是跨Transformer块的异构激活分布：边界块（前几个和后几个块）表现出与中间块根本不同的统计特性，使得均匀量化无效。我们对所有40个WanAttentionBlock进行了系统的逐块激活分析，并利用这些发现提出了一种边界保护策略，该策略保留前两个和后三个块为BF16，同时用W8A8 HiF8量化剩余的35个块。所提出的PTQ方法在评估的所有五个VBench维度上匹配或略优于BF16基线，表明在5提示评估集内没有可测量的精度损失。对四种保护配置的消融研究证实，完全边界保护产生最高的平均VBench分数，验证了数据驱动的块选择。我们还研究了量化感知训练作为补充微调阶段，并分析了在单卡硬件上它无法优于普通PTQ的条件。

英文摘要

We present a post-training quantization (PTQ) approach for Wan2.1-T2V-14B, a 14-billion-parameter text-to-video diffusion transformer, targeting the W8A8 HiFloat8 (HiF8) format on Ascend 910B NPUs. A central challenge in quantizing video DiT models is the heterogeneous activation distribution across transformer blocks: boundary blocks (the first and last few blocks) exhibit fundamentally different statistical properties from middle blocks, making uniform quantization ineffective. We conduct a systematic per-block activation analysis across all 40 WanAttentionBlocks and use the findings to motivate a boundary-protection strategy that retains the first two and last three blocks in BF16 while quantizing the remaining 35 blocks with W8A8 HiF8. The proposed PTQ method matches or marginally exceeds the BF16 baseline on all five VBench dimensions evaluated, indicating no measurable accuracy loss within the 5-prompt evaluation set. An ablation study over four protection configurations confirms that full boundary protection yields the highest average VBench score, validating the data-driven block selection. We additionally investigate quantization-aware training (QAT) as a complementary fine-tuning stage and analyze the conditions under which it fails to outperform plain PTQ on single-card hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.00954 2026-06-02 cs.CV 版本更新

推理、检索、重排序：一种用于组合视频检索的零样本推理感知框架

Ali Alavi

发表机构 * The Ohio State University（俄亥俄州立大学）

AI总结提出R3-CoVR零样本管道，通过多模态大模型推理编辑后状态、对比编码检索和约束感知重排序，在CVPR 2026 VidLLMs挑战赛上达到91.9% R@1和98.2% R@10。

详情

AI中文摘要

组合视频检索（CoVR）旨在通过对参考视频应用自由形式的文本修改来寻找目标视频。我们应对CVPR 2026 VidLLMs研讨会上的推理感知CoVR（CoVR-R）挑战，其中检索严格为零样本。我们提出R3-CoVR（推理、检索、重排序），一个完全由冻结基础模型构建的无训练管道。多模态大语言模型（Qwen3-VL-8B）推理编辑所隐含的“后效”——状态转换、动作阶段、场景、镜头和节奏——并生成简洁的编辑后描述；对比视频-文本编码器（SigLIP-2）对该描述和图库进行嵌入以进行第一阶段检索；最后，一个约束感知重排序阶段使用相同的多模态模型作为评判者，对每个候选视频针对预期的编辑结果进行评分。在挑战测试集上，R3-CoVR达到了91.9%的R@1和98.2%的R@10。两个发现推动了这些结果：（i）将描述长度匹配到对比编码器的文本窗口使R@1从67.5提升到72.7；（ii）仅对候选列表进行重排序的约束感知重排序器将R@1从72.7提升到91.9——这是最大的单一增益。我们分析了重排序器的行为、检索/重排序混合以及候选列表深度，并发布了一个干净的三层实现。

英文摘要

Composed Video Retrieval (CoVR) seeks the target video that results from applying a free-form textual modification to a reference video. We address the \emph{Reason-Aware} CoVR (CoVR-R) challenge at the CVPR~2026 VidLLMs workshop, where retrieval is strictly zero-shot. We present \textbf{R3-CoVR} (\emph{Reason, Retrieve, Re-rank}), a training-free pipeline built entirely from frozen foundation models. A multimodal large language model (Qwen3-VL-8B) reasons about the \emph{after-effects} an edit implies -- state transitions, action phases, scene, camera and tempo -- and verbalises a concise post-edit description; a contrastive video--text encoder (SigLIP-2) embeds this description and the gallery for first-stage retrieval; finally a constraint-aware re-ranking stage uses the same multimodal model as a judge that scores each shortlisted candidate against the intended edited result. On the challenge test set, R3-CoVR attains \textbf{91.9\% R@1} and \textbf{98.2\% R@10}. Two findings drive these results: (i)~matching the description length to the contrastive encoder's text window lifts \Rk{1} from $67.5$ to $72.7$; and (ii)~the constraint-aware re-ranker, which reorders only the shortlist, lifts \Rk{1} from $72.7$ to $91.9$ -- the single largest gain. We analyse the re-ranker's behaviour, the retrieve/re-rank blend, and the shortlist depth, and we release a clean three-layer implementation.

URL PDF HTML ☆

赞 0 踩 0

2606.00906 2026-06-02 cs.CV 版本更新

hZACH-ViT: Curved Latent Geometry for Compact Vision Transformers in Low-Data Medical Imaging

hZACH-ViT：用于低数据医学成像中紧凑视觉Transformer的曲率潜在几何

Athanasios Angelakis

发表机构 * BioML Lab, Research Institute CODE, UniBw, Munich, Germany（BioML实验室，CODE研究机构，UniBw，慕尼黑，德国）； Department of Epidemiology and Data Science, Amsterdam UMC, Amsterdam, Netherlands（流行病学与数据科学系，阿姆斯特丹大学医学中心，阿姆斯特丹，荷兰）

AI总结提出hZACH-ViT，通过扩展ZACH-ViT的潜在空间为双曲或球形几何，在低数据医学成像中提升紧凑视觉Transformer的性能，并在MedMNIST数据集上平均提升+0.021。

Comments 17 pages, 2 figures, 4 tables. Code, execution notebooks, and aggregated result summaries will be released at https://github.com/Bluesman79/hZACH-ViT upon publication

详情

AI中文摘要

紧凑视觉Transformer在低数据和资源受限的医学成像场景中具有吸引力，但大多数现有变体假设欧几里得潜在几何足以组织图像表示。我们引入了hZACH-ViT，这是ZACH-ViT的曲率几何扩展家族，ZACH-ViT是一种紧凑的零令牌视觉Transformer，它去除了位置嵌入和类别令牌，并依赖于补丁表示的全局平均池化。为了隔离几何的作用，我们保留了经过验证的ZACH-ViT骨干网络，仅修改了最终表示空间和基于原型的分类器头部，从而实现了欧几里得、双曲和球形潜在几何之间的受控比较。我们在七个MedMNIST数据集上评估了庞加莱、克莱因和球形hZACH-ViT头部，采用相同的少样本协议，每个类别50个样本和五个随机种子。完整的基准测试包含770次训练运行，涵盖七个数据集、三种非欧几里得几何、七个曲率幅度以及一个欧几里得基线。在所有七个数据集中，最佳非欧几里得hZACH-ViT配置优于欧几里得ZACH-ViT，在数据集特定的主要指标上平均提升+0.021，在OCTMNIST上提升最大（+0.055 MacroF1）。固定的低曲率配置在大多数数据集上保持正向增益，低曲率值（c = 0.1或0.2）占据了七个数据集级别优胜者中的六个。我们的结果并未确定一个普遍最优的流形，而是将几何和曲率确立为数据集依赖的模型选择变量，固定的低曲率分析证实了增益在详尽的逐数据集调优之外仍然存在。

英文摘要

Compact Vision Transformers are attractive for medical imaging in low-data and resource-constrained settings, but most existing variants assume that Euclidean latent geometry is sufficient for organizing image representations. We introduce hZACH-ViT, a family of curved-geometry extensions of ZACH-ViT, a compact zero-token Vision Transformer that removes positional embeddings and the class token and relies on global average pooling over patch representations. To isolate the role of geometry, we preserve the verified ZACH-ViT backbone and modify only the final representation space and prototype-based classifier head, enabling a controlled comparison between Euclidean, hyperbolic, and spherical latent geometries. We evaluate Poincaré, Klein, and spherical hZACH-ViT heads on seven MedMNIST datasets under an identical few-shot protocol with 50 samples per class and five random seeds. The completed benchmark contains 770 training runs spanning seven datasets, three non-Euclidean geometries, seven curvature magnitudes, and a Euclidean baseline. Across all seven datasets, the best non-Euclidean hZACH-ViT configuration improves over Euclidean ZACH-ViT, with an average gain of +0.021 in the dataset-specific primary metric and the largest improvement on OCTMNIST (+0.055 MacroF1). Fixed low-curvature configurations retain positive gains on the majority of datasets, and low curvature values (c = 0.1 or 0.2) account for six of the seven dataset-level winners. Rather than identifying a universally optimal manifold, our results establish geometry and curvature as dataset-dependent model-selection variables, with fixed low-curvature analyses confirming that gains persist beyond exhaustive per-dataset tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.00891 2026-06-02 cs.CV 版本更新

MMDG-Bench: A Benchmark for Multimodal Domain Generalization

MMDG-Bench：多模态领域泛化基准

Qianshan Zhan, Qian Wang, Da Li, Xiao-Jun Zeng, Xiatian Zhu

发表机构 * University of Manchester（曼彻斯特大学）； Jiyue AI（极越AI）； Samsung AI Centre Cambridge（三星AI中心剑桥）； University of Surrey（萨里大学）

AI总结提出MMDG-Bench基准，通过D2M和M2D两种框架统一多模态学习与领域泛化，在动作识别和活体检测等任务上验证了结构化组合优于现有方法，并给出关键设计指南。

详情

AI中文摘要

多模态领域泛化（MMDG）旨在利用互补模态增强模型在未见领域上的鲁棒性。尽管多模态学习（MML）和领域泛化（DG）作为独立领域取得了广泛进展，但它们的系统集成仍未被充分探索。当前的MMDG研究主要局限于动作识别，且缺乏标准化的评估协议。为此，我们引入了MMDG-Bench，一个全面的基准，包含两个基础框架：先DG后MML（D2M）和先MML后DG（M2D）。我们在多种任务上提供了统一的实验协议，包括视频-音频-光流动作识别和RGB-深度-红外人脸活体检测。通过将统一的MML配置与五种DG技术配对，在D2M和M2D两种顺序下实例化十个MMDG基线，我们证明这些结构化组合通常优于现有最先进方法，强调了统一基准工作的必要性。我们的分析得出三个关键见解：（1）集成DG技术在各种骨干网络上提供一致的泛化增益，而非DG方法对骨干网络变化高度敏感；（2）最优框架选择取决于模态间稳定性：当模态关系在领域间稳定时D2M表现更好，而M2D对跨领域关系变化更鲁棒；（3）更强的骨干网络在集成到我们的结构化框架中时会产生放大的性能收益。MMDG-Bench为未来多模态鲁棒性研究提供了原则性基础和可操作的设计指南。代码已发布在 https://github.com/qszhan/MMDG-Bench。

英文摘要

Multi-modal Domain Generalization (MMDG) seeks to leverage complementary modalities to enhance model robustness on unseen domains. Despite extensive progress in Multi-modal Learning (MML) and Domain Generalization (DG) as individual fields, their systematic integration remains under-explored. Current MMDG research is largely confined to action recognition and lacks standardized evaluation protocols. To address this, we introduce MMDG-Bench, a comprehensive benchmark featuring two foundational frameworks: DG then MML (D2M) and MML then DG (M2D). We provide unified experimental protocols across diverse tasks, including video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing. By instantiating ten MMDG baselines through pairing a unified MML configuration with five DG techniques under both D2M and M2D orderings, we demonstrate that these structured combinations frequently outperform existing state-of-the-art methods, underscoring the necessity of a unified benchmarking effort. Our analysis yields three key insights: (1) Integrating DG techniques provides consistent generalization gains across various backbones, whereas non-DG methods are highly sensitive to backbone shifts; (2) The optimal framework choice depends on inter-modal stability: D2M excels when modal relations are stable across domains, while M2D is more robust to cross-domain relational variance; (3) Stronger backbones yield amplified performance dividends when integrated into our structured frameworks. MMDG-Bench provides a principled foundation and actionable design guidelines for future research in multi-modal robustness. Code is released at https://github.com/qszhan/MMDG-Bench.

URL PDF HTML ☆

赞 0 踩 0

2606.00890 2026-06-02 cs.CV 版本更新

RefDiffNet: 在检测前学习暴露细微PCB缺陷

Vinay Edula, Nilesh Badwe, Priyanka Bagade

发表机构 * Department of Computer Science and Engineering Indian Institute of Technology Kanpur（计算机科学与工程系印度理工学院坎浦尔）； Department of Materials Science and Engineering Indian Institute of Technology Kanpur（材料科学与工程系印度理工学院坎浦尔）

AI总结提出RefDiffNet，一种轻量级即插即用的输入增强模块，通过引入无缺陷参考图像来突出缺陷区域，从而提升下游检测器在PCB缺陷检测中的性能。

详情

AI中文摘要

印刷电路板（PCB）缺陷检测具有挑战性，因为许多缺陷很小且难以与复杂的背景图案区分。大多数基于深度学习的PCB检测方法仅依赖被检测的PCB图像进行缺陷检测，忽略了编码走线、焊盘和其他PCB结构预期布局的无缺陷参考图像。在这项工作中，我们提出了RefDiffNet，一种轻量级即插即用的输入增强模块，放置在检测器主干之前，用于在缺陷检测前增强图像。RefDiffNet将经典检测中的一个成熟思想带入深度学习时代，利用无缺陷参考图像来揭示缺陷。RefDiffNet比较缺陷图像与对齐的参考图像，捕获相对于参考图像的结构变化，并使用轻量级编码器输出缺陷区域被突出的原始图像，从而简化下游检测器的任务。在HRIPCB和DeepPCB上的结果表明，RefDiffNet在各类检测器上一致地提升了性能，包括从YOLOv8到YOLOv26的单阶段检测器、基于Transformer的RT-DETR以及两阶段Faster R-CNN。它实现了高达18%的相对mAP50:95增益，且开销可忽略，仅引入0.004-0.005M额外参数和0.7-0.8 GFLOPs，最多占任何评估检测器参数量的0.25%。结果确立了RefDiffNet作为一种轻量级、即插即用、检测器无关的输入增强模块，以最小的计算成本显著提升PCB缺陷检测性能。

英文摘要

Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex background patterns. Most deep learning-based PCB inspection methods rely only on the inspected PCB image for defect detection, ignoring the defect-free reference image that encodes the expected layout of traces, pads, and other PCB structures. In this work, we propose RefDiffNet, a lightweight plug-and-play input enhancement block placed before the detector backbone to enhance the image before defect detection. RefDiffNet brings one proven idea from classical inspection into the deep learning era, using a defect-free reference image to reveal defects. RefDiffNet compares the defective image with the aligned reference, captures structural changes relative to the reference, and uses a lightweight encoder to output the original image with defective regions highlighted, thereby making the downstream detector's task easier. Results on HRIPCB and DeepPCB show that RefDiffNet consistently improves performance across detector families, including one-stage detectors from YOLOv8 to YOLOv26, the transformer-based RT-DETR, and the two-stage Faster R-CNN. It achieves up to 18% relative mAP50:95 gain with negligible overhead, introducing only 0.004 - 0.005M additional parameters and 0.7 - 0.8 GFLOPs, amounting to at most 0.25% of the parameter count of any evaluated detector. Results establish RefDiffNet as a lightweight, plug-and-play, detector-agnostic input enhancement module that substantially improves PCB defect detection with minimal computational cost.

URL PDF HTML ☆

赞 0 踩 0

2606.00844 2026-06-02 cs.CV cs.AI cs.LG 版本更新

MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts

MoEIoU：将边界框回归重新思考为混合专家模型

Vinay Edula, Priyanka Bagade

发表机构 * Indian Institute of Technology Kanpur（印度理工学院坎普尔分校）

AI总结提出MoEIoU损失函数，通过混合专家模型联合优化重叠、中心对齐和长宽比，并采用课程学习权重调度，在多个数据集和YOLO架构上超越现有IoU损失。

详情

AI中文摘要

边界框回归是目标检测的基本组成部分，在精确目标定位中起着关键作用。现有的基于交并比（IoU）的损失函数通过引入几何惩罚项（如中心距离和长宽比不匹配）来扩展IoU目标，以改进边界框回归。然而，这些惩罚项通常在训练过程中保持不变，没有考虑优化动态：预测框在初始阶段表现出较大的中心距离和形状误差，而后期阶段则侧重于提高与真实框的重叠。为了解决这一局限性，我们引入了MoEIoU，一种基于混合专家的回归损失，它联合建模了重叠、中心对齐和长宽比不匹配。MoEIoU使用log-sum-exp函数聚合这些组件，该函数强调主要的定位误差，同时保持其他项的平滑贡献。此外，采用基于课程的权重调度，在早期训练阶段优先纠正框的位置和形状，在后期阶段提高重叠。我们在PASCAL VOC、HRIPCB和MS COCO上使用多种YOLO架构以及大规模模拟实验评估了所提出的MoEIoU。它始终优于标准和最新的最先进损失，表现出更快的收敛速度和更高的定位精度。我们进一步表明，这种自适应聚合改进了现有的基于IoU的损失，带来了一致的增益，并为目标检测框架中的边界框回归提供了更有效的优化指导。

英文摘要

Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing Intersection-over-Union (IoU)-based loss functions extend the IoU objective by incorporating geometric penalties, such as center-distance and aspect-ratio mismatch, to improve bounding-box regression. However, these penalties typically remain fixed throughout training and do not account for the optimization dynamics in which predicted boxes initially exhibit large center-distance and shape errors, with later stages focusing on improving overlap with the ground truth. To address this limitation, we introduce MoEIoU, a mixture-of-experts based regression loss that jointly models overlap, center alignment, and aspect-ratio mismatch. MoEIoU aggregates these components using a log-sum-exp function, which emphasizes the dominant localization error while maintaining smooth contributions from other terms. Additionally, a curriculum-based weighting schedule is employed to prioritize correcting box position and shape in early training stages and improving overlap in later stages. We evaluated proposed MoEIoU on PASCAL VOC, HRIPCB, and MS COCO using multiple YOLO architectures, along with large-scale simulation experiments. It consistently outperforms standard and recent state-of-the-art losses, demonstrating faster convergence and improved localization accuracy. We further show that this adaptive aggregation improves existing IoU-based losses, yielding consistent gains and providing more effective optimization guidance for bounding-box regression in object detection frameworks.

URL PDF HTML ☆

赞 0 踩 0

2606.00829 2026-06-02 cs.CV 版本更新

The Right Inference Strategy Is All You Need: Nearly Training-Free Domain-Wise Inference for EgoCross Challenge

正确的推理策略即是一切：面向EgoCross挑战的近乎无需训练的领域感知推理

Leyi Wu, Yifan Zhao, Jinjie Zhang, Yinchuan Li, Ying-Cong Chen

发表机构 * HKUST(GZ)（香港科技大学（广州））； HKUST（香港科技大学）； Knowin

AI总结针对EgoCross挑战中源受限场景下多模态大模型在领域偏移严重的自我中心视频问答任务上表现不佳的问题，提出一种领域感知推理策略，通过为四个目标领域分别设计不同的输入、提示和答案映射流程，在不进行额外训练的情况下显著提升基线模型性能。

详情

AI中文摘要

EgoCross评估多模态大语言模型在显著领域偏移下的自我中心视频问答，其中测试视频来自手术、工业装配、极限运动和动物佩戴相机，而非日常场景。在源受限赛道中，基础模型固定为Qwen3-VL-4B，而官方任务特定支持集仅包含20个训练样本。这一设置使得挑战更侧重于向受限模型暴露正确的视觉、时序和答案选择线索，而非模型规模。我们的关键观察是，冻结的基线模型并非完全无法处理这些罕见场景；相反，它往往缺乏合适的接口来将其现有的视觉-语言知识迁移到新任务格式。因此，我们采用领域感知推理策略，将四个目标领域分开处理，并根据每个领域的任务特点设计不同的输入、提示和答案映射流程。这些策略通过强调每个领域重要的线索，使罕见自我中心场景对VLM更具可解释性。最终系统几乎无需训练：手术和动物问题使用基础Qwen3-VL-4B模型回答，而极限运动和工业问题仅使用在提供的20个训练样本上训练两个epoch的官方SFT检查点。在最终评估中，这一简单策略达到了66.98%的整体准确率，表明精心设计的领域感知推理可以弥补基础模型能力的不足，并恢复基线模型中已存在的大部分能力。

英文摘要

EgoCross evaluates multimodal large language models on egocentric video question answering under substantial domain shift, where test videos come from surgery, industrial assembly, extreme sports, and animal-mounted cameras rather than ordinary daily-life scenes. In the source-limited track, the base model is fixed to Qwen3-VL-4B, while the official task-specific support set contains only 20 training samples. This setting makes the challenge less about model scaling and more about exposing the right visual, temporal, and answer-selection cues to a constrained model. Our key observation is that the frozen baseline model is not simply incapable of these rare scenarios; rather, it often fails to transfer its existing visual-language knowledge to the new task format without an appropriate interface. We therefore use a domain-wise inference strategy that treats the four target domains separately and designs different input, prompting, and answer-mapping procedures according to each domain's task characteristics. These strategies make the rare egocentric scenes more interpretable to the VLM by emphasizing the cues that matter for each domain. The resulting system is nearly training-free: surgery, and animal questions are answered with the base Qwen3-VL-4B model, while XSports and industry use only the official SFT checkpoint trained for two epochs on the provided 20 training samples. On the final evaluation, this simple strategy reaches 66.98\% overall accuracy, suggesting that careful domain-aware inference can compensate for limited base-model strength and recover much of the ability already present in the baseline model.

URL PDF HTML ☆

赞 0 踩 0

2606.00828 2026-06-02 cs.CV 版本更新

RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes

RoboStressBench: 在具身场景中基准测试VLM对物理视觉压力的鲁棒性

Leyi Wu, Yifan Zhao, Jinjie Zhang, Suzeyu Chen, Wosong Chen, Zhifei Chen, Tianshuo Xu, Qingchun He, Hongxin Hu, Haojian Huang, Yangkai Wei, Wenqian Li, Yinchuan Li, Ying-Cong Chen

发表机构 * HKUST(GZ)（香港科技大学(广州)）

AI总结本文提出RoboStressBench，从逆图形学角度将视觉压力分解为材质、视角、光照和几何四个物理维度，系统评估VLM在真实物理压力下的鲁棒性，并引入压力感知求解器提升高压力场景下的性能。

详情

AI中文摘要

视觉语言模型（VLM）展现出强大的视觉理解能力，并越来越多地部署在具身AI系统中，这些系统需要在真实条件下进行可靠的感知。然而，现有的基准测试使用干净图像或孤立扰动来评估VLM，而非由物理场景形成引起的压力。这种设计有两个局限性：它仅覆盖了日常视觉压力的一小部分子集，并且某些扰动在现实具身场景中很少出现。这一差距引发了一个基本问题：我们如何以一种原则性的方式定义视觉压力，以捕捉物理环境中遇到的各种因素？为了解决这个问题，我们从逆图形学角度构建视觉感知，并引入RoboStressBench，这是一个用于评估VLM在具身场景中对物理视觉压力鲁棒性的基准测试。受物理渲染方程的启发，RoboStressBench将视觉压力分解为四个物理基础维度：材质（M）、视角（V）、光照（L）和几何（G）。这种设计使RoboStressBench能够覆盖现实世界环境中的广泛视觉压力，同时允许对其在VLM能力（如视觉识别、推理和规划）上的影响进行受控分析。通过对最先进的VLM进行全面评估，我们识别出特定于压力的失败模式，并揭示了不同的物理因素会降低不同的具身能力，而这些往往被总体准确率所掩盖。我们进一步引入了一种压力感知的智能求解器，它在推理前检测视觉压力源并调用视觉编辑技能，从而提高了高压力场景下的鲁棒性。总体而言，RoboStressBench提供了一个原则性的评估框架，用于诊断和改进VLM在真实物理压力下的感知能力，支持开发更可靠的具身AI系统。

英文摘要

Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00825 2026-06-02 cs.CV cs.ET cs.HC cs.MA 版本更新

SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory

SuperMemory-VQA：面向长期记忆的自我中心视觉问答基准

Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, James Fort, Richard Newcombe, Hyo Jin Kim, Mi Zhang

发表机构 * The Ohio State University（俄亥俄州立大学）； Meta Project（Meta项目）

AI总结提出SuperMemory-VQA数据集，包含52.9小时AI眼镜录制的日常活动及4853个多选问答对，用于评估AI助手在长期记忆任务上的表现，发现现有系统可靠性不足。

Comments 34 pages, 21 figures, 5 tables

详情

AI中文摘要

AI眼镜为AI代理作为个性化记忆助手提供了有吸引力的平台。要真正有用，此类系统必须超越短期视频理解，解决人类在纵向自我中心视频流中因实际、个人或社交目的而经历的记忆缺口。然而，现有的自我中心数据集主要关注动作识别或来自短片的通用问答，衡量的是感知能力而非现实的人类记忆需求。我们引入了SuperMemory-VQA，一个用于评估AI助手在实际长期记忆任务上的自我中心视觉问答（VQA）数据集。它包含52.9小时用AI眼镜记录的日常活动，包括同步的RGB视频、音频转录、眼动追踪、IMU和SLAM轨迹。通过人工验证的标注流程，我们构建了4,853个有依据的问答对，涵盖物体和位置记忆、意图回忆、视觉场景回忆、时间线重建、对话记忆和上下文检索。每个问题以多项选择形式提出，并包含明确的“不可回答”选项以测试幻觉鲁棒性。对领先的代理框架和LLM骨干的基准测试表明，现有系统在现实世界记忆任务上仍远不可靠，凸显了需要新的架构来实现有依据的AI记忆，使其仅在证据充分时才能回答。参与者调查进一步支持我们的问题具有现实性、实用性，并与日常记忆需求一致。

英文摘要

AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.

URL PDF HTML ☆

赞 0 踩 0

2606.00817 2026-06-02 cs.GR cs.CV 版本更新

DINO-GFSA：基于语义门控融合和Mamba序列聚合的地理定位

Beier Hu, Yuanshen Guo, Jialu Cai, Chengwei Li, Yong Wang, Shunan Wu, Zhigang Wu

发表机构 * School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen, China（中山大学航空航天学院，深圳，中国）

AI总结提出DINO-GFSA框架，通过LoRA适配的DINOv3骨干网络、语义门控残差融合模块和Mamba序列聚合头，在无人机跨视角地理定位中实现最先进性能。

详情

AI中文摘要

跨视角地理定位（CVGL）对于无人机在无GNSS环境下的自定位和目标定位至关重要。然而，在保留细粒度空间细节的同时获取鲁棒语义仍然具有挑战性。为此，我们提出DINO-GFSA框架，利用LoRA（低秩适配）适配的DINOv3（ViTL）骨干网络实现参数高效、高容量的表示。关键地，我们引入了语义门控残差融合模块，利用高层语义选择性校准和整合低层空间线索，有效弥合语义鸿沟。此外，设计了基于Mamba的序列聚合头，以线性复杂度捕获长距离空间依赖。实验表明，在University-1652和DenseUAV基准上取得了最先进性能，特别是在DenseUAV上Recall@1比之前最佳方法高出3.48%。这些结果验证了DINO-GFSA作为无人机CVGL通用鲁棒解决方案的有效性。

英文摘要

Cross-view geo-localization (CVGL) is critical for Unmanned Aerial Vehicle (UAV) self-positioning and target localization in GNSS-denied environments. However, acquiring robust semantics while preserving finegrained spatial details remains challenging. To address this, we propose DINO-GFSA, a framework leveraging a LoRA (Low-Rank Adaptation) adapted DINOv3 (ViTL) backbone for parameter-efficient, high-capacity representation. Crucially, we introduce a Semantic Gated Residual Fusion module, which utilizes high-level semantics to selectively calibrate and integrate low-level spatial cues, effectively bridging the semantic gap. Furthermore, a Mamba-based Sequential Aggregation Head is designed to capture long-range spatial dependencies with linear complexity. Experiments demonstrate state-of-the-art performance on University-1652 and DenseUAV benchmarks, notably surpassing the previous best on DenseUAV by 3.48% on Recall@1. These results validate DINO-GFSA as a generalized, robust solution for UAV CVGL.

URL PDF HTML ☆

赞 0 踩 0

2606.00782 2026-06-02 cs.CV 版本更新

将并行序列模型扩展到基础规模的视觉编码器

Yitong Jiang, Hongjun Wang, Collin McCarthy, Hanrong Ye, David Wehr, Xinhao Li, Qi Dou, Tianfan Xue, Ka Chun Cheung, Simon See, Wonmin Byeon, Ke Chen, Kai Han, Jinwei Gu, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu

发表机构 * NVIDIA ； The Chinese University of Hong Kong（香港中文大学）； The University of Hong Kong（香港大学）； University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出C-GSPN，一种基于2D空间传播的基础规模视觉编码器，通过快速CUDA内核、压缩潜在空间传播块和两阶段交叉算子蒸馏，在减少参数的同时提升性能并实现高效推理。

详情

AI中文摘要

视觉基础模型受限于自注意力的二次成本，这限制了可用分辨率并增加了大规模预训练的成本。次二次替代方案如线性注意力和状态空间模型降低了这一成本，但通常将图像序列化为1D令牌流，削弱了对视觉重要的2D空间结构。广义空间传播网络（GSPN）通过线扫描递归直接在2D网格上传播上下文，实现了接近线性的复杂度且无需位置嵌入，但很少用作基础规模的编码器。我们提出C-GSPN，一种基于2D空间传播的基础规模视觉编码器。C-GSPN通过三项改进使该算子实用化：（1）一个快速的GSPN CUDA内核，将每步启动融合为单个warp专用实现，采用共享内存分块、合并访问和紧凑的多通道传播，达到峰值内存带宽的90%以上，运行速度比原始GSPN实现快40-52倍；（2）一个带有融合归一化的压缩潜在空间传播块，将内核级速度转化为块级和模型级效率；（3）一个两阶段交叉算子蒸馏方案，从注意力教师训练新架构，无需从头开始进行基础规模训练的成本。使用6亿图像-文本对进行蒸馏，C-GSPN以少15%的参数匹配同构ViT基线，在ADE20K分割上提升+2.1%，以极少的数据迁移到高分辨率，并在2K分辨率下通过单次无分块推理实现4倍的端到端块加速。

英文摘要

Vision foundation models are bottlenecked by the quadratic cost of self-attention, which limits usable resolution and increases the cost of large-scale pretraining. Subquadratic alternatives such as linear attention and state-space models reduce this cost, but often serialize images into 1D token streams and weaken the 2D spatial structure important for vision. Generalized Spatial Propagation Networks (GSPN) instead propagate context directly on the 2D grid through line-scan recurrences, achieving near-linear complexity without positional embeddings, but have seen little use as foundation-scale encoders. We present C-GSPN, a foundation-scale vision encoder based on 2D spatial propagation. C-GSPN makes the operator practical through three improvements: (1) a fast GSPN CUDA kernel that fuses per-step launches into a single warp-specialized implementation with shared-memory tiling, coalesced access, and a compact multi-channel propagation, reaching over 90% of peak memory bandwidth and running up to 40--52x faster than the original GSPN implementation; (2) a compressed latent-space propagation block with fused normalization, which turns kernel-level speed into block- and model-level efficiency; and (3) a two-stage cross-operator distillation recipe that trains the new architecture from an attention teacher without the cost of from-scratch foundation-scale training. Distilled with 600M image-text pairs, C-GSPN matches an isomorphic ViT baseline with 15% fewer parameters, improves ADE20K segmentation by +2.1%, transfers to high resolution with a fraction of the data needed from scratch, and delivers a 4x end-to-end block speedup at 2K with single-pass, tiling-free inference.

URL PDF HTML ☆

赞 0 踩 0

2606.00738 2026-06-02 cs.LG cs.AI cs.CV 版本更新

SORA: Free Second-Order Attacks in Fast Adversarial Training

SORA：快速对抗训练中的自由二阶攻击

Mazdak Teymourian, Ramtin Moslemi, Farzan Rahmani, Mohammad Hossein Rohban

发表机构 * Department of Computer Engineering, Sharif University of Technology, Tehran, Iran（谢赫大学计算机工程系）

AI总结针对快速对抗训练中的灾难性过拟合问题，提出通过扰动变异性和梯度对齐指标PertAlign来预测并防止过拟合，并设计自适应步长方法SORA，实现最优鲁棒性和干净准确率。

Comments Accepted at ICML 2026

详情

AI中文摘要

对抗训练是对抗性样本的主要防御手段，但在高效的单步变体中常常遭受灾难性过拟合，即尽管单步性能很高，但对多步攻击的鲁棒性却崩溃。我们通过两个贡献来解决这种失效模式。首先，我们形式化了epsilon过拟合（EO），这是一种固定扰动幅度和方向加剧CO的视角，并表明引入扰动变异性可以显著提高不同架构和数据集上的鲁棒泛化能力。其次，我们提出了PertAlign（扰动对齐），这是一种理论上合理、计算开销可忽略的指标，通过测量攻击阶段的梯度对齐来预测CO的发生。利用这些见解，我们引入了SORA，一种自适应步长的AT方法，它根据损失曲面几何动态调整扰动。SORA始终能防止CO，实现最先进的鲁棒性和干净准确率，并使用一组固定的超参数在数据集和架构上泛化，这对于快速AT的适用性至关重要。在不同数据集和架构上的大量实验表明，SORA在提供更高干净准确率和卓越效率的同时，匹配或超越了先前方法的鲁棒性。代码可在https://github.com/SecondOrderAT/SORA获取。

英文摘要

Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high single-step performance. We address this failure mode with two contributions. First, we formalize Epsilon Overfitting (EO), a perspective in which fixed perturbation magnitudes and directions exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we propose PertAlign (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduce SORA, an adaptive step-size AT method that dynamically adjusts perturbations based on loss surface geometry. SORA consistently prevents CO, achieves state-of-the-art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters, which is essential for applicability in fast AT. Extensive experiments on diverse datasets and architectures show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency. Code is available at https://github.com/SecondOrderAT/SORA.

URL PDF HTML ☆

赞 0 踩 0

2606.00712 2026-06-02 cs.CV 版本更新

CASTLE2026 Team WDL Technical Report

CASTLE2026 团队 WDL 技术报告

Zhengyang Li, Zhenglin Du, Yi Wen, Fang Liu, Shuo Li, Xu Liu

发表机构 * Key Laboratory of Intelligent Perception and Image Understanding（智能感知与图像理解重点实验室）

AI总结提出基于 Qwen 的证据感知多模态推理流程，通过提示路由和置信度加权投票解决长视频问答，在 CASTLE 挑战赛中排名第一。

Comments 4 pages

2606.00704 2026-06-02 cs.CV 版本更新

VICR: Visual In-Context Restoration for Real-World Image Super-Resolution

VICR: 面向真实图像超分辨率的视觉上下文恢复

Qichang Zhang, Hailong Wang, Baiang Li, Linhao Wang, Rong Fu, Erkang Cheng, Simon James Fong

发表机构 * Faculty of Science and Technology, University of Macau（澳门大学科技学院）； Nullmax ； Hefei University of Technology（合肥工业大学）； Shandong Normal University（山东师范大学）

AI总结提出基于扩散变换器的视觉上下文恢复框架，通过解耦的视觉先验注入机制将真实图像超分辨率建模为图像补全，实现结构保真与细节合成的平衡。

Comments 28 pages, 11 figures, 9 tables

详情

AI中文摘要

真实世界图像超分辨率（Real-ISR）需要在结构保真度（对退化观测）与逼真细节合成之间取得平衡。然而，现有的生成式Real-ISR方法通常依赖于纠缠的条件机制，导致结构漂移或语义不一致的细节。为了解决这个问题，我们提出了视觉上下文恢复（VICR），一种基于扩散变换器（DiT）的框架，将Real-ISR表述为图像补全。具体来说，我们引入了一种解耦的视觉先验注入机制，从低质量（LQ）图像中提取局部和全局线索：局部线索有助于恢复图像结构并支持高频细节合成，而全局线索指导整体生成并促进语义一致性。对于严重退化下的模糊区域，VICR采用推理时代理，利用LQ输入的视觉证据优化语义提示，同时保持模型参数固定。实验表明，VICR仅用127M可训练参数就在多个Real-ISR基准上实现了最先进的性能。

英文摘要

Real-world image super-resolution (Real-ISR) requires balancing structural fidelity to degraded observations with realistic detail synthesis. However, existing generative Real-ISR methods often rely on entangled conditioning mechanisms, leading to structural drift or semantically inconsistent details. To address this issue, we propose Visual In-Context Restoration (VICR), a Diffusion Transformer (DiT)-based framework that formulates Real-ISR as image completion. Specifically, we introduce a decoupled visual prior injection mechanism that derives local and global cues from the low-quality (LQ) image: local cues help recover image structures and support high-frequency detail synthesis, while global cues guide overall generation and promote semantic consistency. For ambiguous regions under severe degradation, VICR employs an inference-time agent to refine semantic prompts using visual evidence from the LQ input while keeping model parameters fixed. Experiments show that VICR achieves state-of-the-art performance across multiple Real-ISR benchmarks with only 127M trainable parameters.

URL PDF HTML ☆

赞 0 踩 0

2606.00694 2026-06-02 cs.CV 版本更新

FROST-STA: Frozen Dense Features for the Ego4D Short-Term Object Interaction Anticipation

FROST-STA: 用于Ego4D短期物体交互预测的冻结密集特征

Chaoyang Wang, Lexuan Xu

发表机构 * Beihang University（北航大学）

AI总结提出FROST-STA模型，利用冻结的密集图像-视频特征和对象中心解码，在Ego4D短期物体交互预测挑战中取得第二名。

详情

AI中文摘要

第一人称视频中的短期预测需要超越对当前场景的识别：系统必须推断摄像头佩戴者将接触哪个物体、将执行什么动作以及接触将在多久后发生。本报告描述了FROST-STA，我们提交至EgoVis 2026 Ego4D短期物体交互预测（STA）挑战的方案。对于每个查询时间，模型输出一组排序的结构化假设，包含主动物体框、名词标签、动词标签、接触时间（TTC）和置信度。FROST-STA基于V-JEPA 2.1 STA评估协议，但通过使用对象中心解码、多头预测以及面向提交的训练和集成方案，使其适应挑战。我们固定V-JEPA 2.1 ViT-G骨干网络，提取两个密集token流：来自查询前缩放至384像素的短视频片段的视频token，以及来自最后观察到的最高分辨率帧的图像token。一个紧凑的对齐模块，由注意力探针和帧引导的时间池化组成，将片段表示映射到最后一帧的空间参考上，然后与图像特征融合。融合后的特征图由Faster R-CNN风格的STA头解码，估计框偏移、名词、动词、TTC值和交互质量。对于最终排行榜提交，我们使用官方训练集加上额外允许的验证标注训练25个epoch，并组合来自8个头和epoch 15-25的检查点的预测。FROST-STA在官方测试服务器上获得5.13总体Top-5 mAP，在挑战中排名第二，表明冻结的密集图像-视频特征可以作为物体级交互预测的坚实基础。

英文摘要

Short-term anticipation in egocentric video requires more than recognizing the current scene: a system must infer which object the camera wearer will contact, which action will follow, and how soon the contact will happen. This report describes FROST-STA, our submission to the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. For each query time, the model produces a ranked set of structured hypotheses containing an active-object box, noun label, verb label, time-to-contact (TTC), and confidence. FROST-STA builds on the V-JEPA 2.1 STA evaluation protocol, but adapts it to the challenge by using object-centric decoding, multi-head prediction, and a submission-oriented training and ensembling recipe. We keep the V-JEPA 2.1 ViT-G backbone fixed and extract two dense token streams: video tokens from a short clip resized to 384 pixels before the query, and image tokens from the last observed high-resolution frame. A compact alignment module, consisting of an attentive probe and frame-guided temporal pooling, maps the clip representation onto the spatial reference of the final frame before fusing it with image features. The fused maps are decoded by Faster R-CNN-style STA heads that estimate box offsets, nouns, verbs, TTC values, and interaction quality. For the final leaderboard entry, we train for 25 epochs with the official training split plus additional permitted validation annotations, and combine predictions across eight heads and checkpoints from epochs 15-25. FROST-STA obtains 5.13 Overall Top-5 mAP on the official test server, ranking second in the challenge and showing that frozen dense image-video features can serve as a strong basis for object-level interaction forecasting.

URL PDF HTML ☆

赞 0 踩 0

2606.00689 2026-06-02 cs.CV 版本更新

Wavelet-Fusion Diffusion Model for Multimodal Brain MRI Synthesis with Modality and Metadata Conditioning

小波融合扩散模型用于多模态脑MRI合成，具有模态和元数据条件

Muhammad Nabi Yasinzai, Remika Mito, Mangor Pedersen

发表机构 * Department of Psychology & Neuroscience, Auckland University of Technology（心理学与神经科学系，奥克兰技术大学）； Department of Psychiatry, University of Melbourne（精神病学系，墨尔本大学）

AI总结提出一种小波融合扩散模型（WFDM），结合小波融合变分自编码器（WF-VAE）和条件3D U-Net扩散模型，通过显式模态和元数据条件实现多模态脑MRI合成，解决了数据集模态覆盖不均和异质性问题，在分布对齐上优于现有方法。

Comments 51 pages, 7 figures, including supplementary material. Submitted to Imaging Neuroscience

详情

AI中文摘要

多模态MRI为神经影像分析提供互补信息，不同成像模态捕获独特的解剖、组织和病理特征，支持下游AI应用的开发和评估。尽管大规模结构MRI资源日益可用，但公共和汇集神经影像数据集的模态覆盖往往不均匀。这种不均匀的模态覆盖因站点、扫描仪和采集协议之间的异质性，以及跨研究通常稀疏、不一致记录或不可用的人口统计学和临床变量而进一步复杂化。合成MRI生成可以通过合成目标模态体积用于数据集增强和受控合成队列创建，帮助解决这种不平衡。然而，许多现有的MRI合成方法在狭窄的模态集或相对同质的队列上训练，限制了它们对大型汇集神经影像资源的适用性，其中模态可用性、采集协议和元数据覆盖在不同数据集之间差异很大。扩散模型因其强大的样本保真度和多样性而成为MRI合成的一种有吸引力的方法，但直接在3D体素空间采样在推理时计算昂贵且缓慢。潜在扩散通过在学习的3D潜在空间中合成MRI提高了实用性，尽管生成质量取决于自编码器的重建保真度和由此产生的潜在分布。我们的方法将小波融合变分自编码器（WF-VAE）潜在压缩器与在学习的潜在空间中训练的、使用显式模态和元数据条件的条件3D U-Net扩散模型相结合。我们提出的Wavelet-Fusion Diffusion Model (WFDM) 在评估的合成MRI生成器中实现了最强的分布对齐。

英文摘要

Multimodal MRI provides complementary information for neuroimaging analysis, where different imaging modalities capture distinct anatomical, tissue, and pathological features that support the development and evaluation of downstream AI applications. Although large-scale structural MRI resources are increasingly available, their modality coverage is often uneven across public and pooled neuroimaging datasets. This uneven modality coverage is further complicated by heterogeneity across sites, scanners, and acquisition protocols, as well as demographic and clinical variables that are often sparse, inconsistently recorded, or unavailable across studies. Synthetic MRI generation can help address this imbalance by synthesizing target-modality volumes for dataset augmentation and controlled synthetic cohort creation. However, many existing MRI synthesis approaches are trained on narrow modality sets or relatively homogeneous cohorts, limiting their applicability to large pooled neuroimaging resources where modality availability, acquisition protocols, and metadata coverage vary substantially across datasets. Diffusion models have become an attractive approach for MRI synthesis because of their strong sample fidelity and diversity, but sampling directly in 3D voxel space is computationally expensive and slow at inference. Latent diffusion improves practicality by synthesizing MRI in a learned, 3D latent space, although generation quality depends on the autoencoder's reconstruction fidelity and the resulting latent distribution. Our approach combines a Wavelet-Fusion variational autoencoder (WF-VAE) latent compressor with a conditional 3D U-Net diffusion model trained in the learned latent space using explicit modality and metadata conditioning. Our proposed Wavelet-Fusion Diffusion Model (WFDM) achieved the strongest distributional alignment among the evaluated synthetic MRI generators.

URL PDF HTML ☆

赞 0 踩 0

2606.00688 2026-06-02 cs.CV 版本更新

Shape-Prior-Based Point Cloud Completion for Single-Stage Fully Sparse 3D Object Detection

基于形状先验的点云补全用于单阶段全稀疏3D目标检测

Kaizheng Wang, Mingqian Ji, Jian Yang, Shanshan Zhang

发表机构 * School of Computer Science and Engineering, Nanjing University of Science and Technology（南京理工大学计算机科学与工程学院）

AI总结针对单阶段全稀疏3D检测器中点云稀疏和不完整的问题，提出一种基于形状先验的点云补全方法，通过实例选择和对齐补全模块显著提升检测性能。

详情

AI中文摘要

单阶段全稀疏3D目标检测器依赖点云数据在自动驾驶场景中检测目标。然而，点云的稀疏性和不完整性严重限制了3D目标检测的性能。为解决此问题，本文提出一种专门针对单阶段全稀疏检测器的点云补全方法。整个基于形状先验的补全过程由两个连续步骤组成。第一步，我们设计了一个新颖的实例选择模块，即使在基线模型未生成提议的情况下，也能识别对应前景目标的点云，同时有效忽略背景区域的点云。第二步，我们引入了一种新颖的基于对齐的点补全模块，该模块将前景目标的点云在中心和朝向上与原型对齐。随后，从原型中选择点来填充前景目标的缺失部分。我们在KITTI数据集上使用两个单阶段全稀疏检测器评估了我们的方法。实验结果表明，所提方法显著提升了检测性能，证实了其有效性和泛化能力。

英文摘要

Single-stage fully sparse 3D object detectors rely on point clouds data to detect objects in autonomous driving scenarios. However, the sparsity and incompleteness of point clouds significantly limit the performance of 3D object detection. To address this issue, this paper proposes a point clouds completion method specifically designed for single-stage fully sparse detectors. The entire shape-prior-based completion process consists of two consecutive steps. In the first step, we design a novel Instance Selection module, which is capable of identifying point clouds corresponding to foreground objects even when the baseline model does not generate proposals, while effectively ignoring the point clouds of background regions. In the second step, we introduce a novel Alignment-Based Point Completion module, which aligns the point clouds of foreground objects with prototypes in terms of both their centers and orientations. Subsequently, points are selected from the prototype to fill in the missing parts of the foreground object. We evaluated our method on two single-stage fully sparse detectors using the KITTI dataset. The experimental results demonstrate that the proposed method significantly improves the detection performance, confirming its effectiveness and generalizability.

URL PDF HTML ☆

赞 0 踩 0

2606.00676 2026-06-02 cs.CV 版本更新

Wan2.2双专家视频扩散模型的协同少步蒸馏与低位量化

Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong, Shiqiao Gu, Yang Yong, Jinyang Guo, Xianglong Liu

发表机构 * IEEE ICME 2026 ； GCC Low-Bit-width Large Model Quantization Challenge（GCC 低精度大模型量化挑战）

AI总结针对Wan2.2-T2V-A14B视频扩散模型，提出结合少步分布匹配蒸馏与低位量化的部署压缩流程，通过双专家去噪分支校准、敏感层保护及HiF4低位表示，在保持质量的同时降低计算开销。

详情

AI中文摘要

大型视频扩散模型实现了强大的视觉质量，但由于每个样本需要大量去噪步骤和较大的驻留参数足迹，部署成本仍然很高。本文研究了一种面向部署的压缩流程，针对Wan2.2-T2V-A14B模型，结合少步分布匹配蒸馏与低位量化。该流程遵循模型的双专家去噪路线，分别校准高噪声和低噪声分支，保护敏感入口层，并使用HiF4风格的低位表示以改善动态范围覆盖。量化是在蒸馏后的少步学生模型上校准，而非原始的长步轨迹上，从而减少推理过程中的激活分布不匹配。所提出的协同设计使量化模型保持接近同步全精度模型，并在平均8步和20步时超越原始全精度基线。在测试配置中，20步设置提供了最佳的质量-效率权衡。

英文摘要

Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps and a large resident parameter footprint. This paper studies a deployment-oriented compression pipeline for Wan2.2-T2V-A14B by combining few-step distribution-matching distillation with low-bit quantization. The pipeline follows the model's dual-expert denoising route, calibrates the high-noise and low-noise branches separately, protects sensitive entrance layers, and uses HiF4-style low-bit representation to improve dynamic-range coverage. Quantization is calibrated on the distilled few-step student rather than on the original long-step trajectory, reducing activation-distribution mismatch during inference. The proposed co-design keeps the quantized model close to the same-step full-precision model and surpasses the original full-precision baseline at 8 and 20 steps on average. The 20-step setting gives the best quality-efficiency trade-off in the tested configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.00640 2026-06-02 cs.CV 版本更新

An Attribute-Based Measure of Video Complexity

基于属性的视频复杂度度量

Aditya Sarkar, Yi Li, Zihao Wang, Jiacheng Cheng, Sai Vidyaranya Nuthalapati, Aashu Singh, Shlok Kumar Mishra, David Jacobs, Nuno Vasconcelos

发表机构 * UMIACS-University of Maryland College Park（马里兰大学College Park分校UMIACS）； University of California San Diego（加州大学圣地亚哥分校）； Yale University（耶鲁大学）； Meta AI

AI总结提出VideoABC框架，通过属性空间量化估计视频-问题对在视频大语言模型上的失败概率，实现非参数复杂度度量。

详情

AI中文摘要

提出了一种新的框架，用于估计视频-问题对给视频大语言模型带来的复杂度，即基于属性的视频复杂度（VideoABC）。视频复杂度定义为视频大语言模型在给定视频-问题对上的失败概率。VideoABC是一种非参数复杂度度量，使用参考视频数据集和预定义的视频属性词汇表（这些属性对复杂度有信息量，例如场景复杂度或与问题相关的视频事件速度）。在训练阶段，参考视频被投影到这些属性空间中，然后进行量化。计算每个量化单元的期望ABC。给定一个新视频及其在属性空间中的投影，通过关联量化单元的期望ABC来估计复杂度。为了能够使用小规模参考视频数据集，结合了两种量化器：k-means量化器（能对参考数据集分布内的样本进行准确复杂度估计）和通用格点量化器（保证对分布外样本的泛化）。受心理物理学研究中目标-干扰物操纵的启发，提出了一种合成视频生成程序，用于在训练期间填充格点量化器的单元，从而计算其期望ABC。实验结果表明，即使使用非常低维的属性表示，VideoABC也有效，其性能大大优于“视频大语言模型作为评判者”等方法，且复杂度更低。最后，VideoABC分数在定义良好的属性方面的可解释性，揭示了基准测试的属性组成如何影响其复杂度。

英文摘要

A new framework for the estimation of the complexity posed by video-question pairs to video-LLMs, Video Attribute-Based Complexity (VideoABC), is proposed. Video complexity is defined as the probability of failure of a video-LLM for a given video-question pair. VideoABC is a non-parametric complexity measure, using a reference video dataset and a pre-defined vocabulary of video attributes informative of complexity, \eg the scene complexity or the speed of the video event informative of the question. In a training phase, reference videos are projected into the space of these attributes, which is then quantized. The expected ABC of each quantization cell is then computed. Given a new video and its projection into the attribute space, complexity is estimated by the expected ABC of the associated quantization cell. To enable the use of VideoABC with small reference video datasets, two quantizers are combined: a k-means quantizer that enables accurate complexity estimates for samples in the distribution of the reference dataset and a universal lattice quantizer that guarantees generalization to out-of-distribution samples. A synthetic video generation procedure, inspired by target-distractor manipulations of psychophysics studies, is proposed to populate the cells of the lattice quantizer during training, enabling the computation of their expected ABCs. Experimental results show that VideoABCis effective even with very low-dimensional attribute representations, substantially outperforming approaches like `video-LLM as judge' with much less complexity. Finally, the explainable nature of the VideoABC score, in terms of well-defined attributes, is shown to provide insights on how the attribute composition of benchmarks affects their complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.00630 2026-06-02 cs.CV stat.ML 版本更新

FiSeR：用于跨域AI图像检测的细粒度源表示

Shan Zhang, Yongxin He, Mingming Zhang, Huiwen Tian, Lei Ma

发表机构 * Shan Zhang, Yongxin He, Mingming Zhang, Huiwen Tian, Lei Ma（作者团队）

AI总结针对合成图像检测器在域迁移下泛化能力差的问题，提出层次对比学习框架FiSeR，通过粗粒度和细粒度对比目标联合优化，在跨域评估中平均AUROC提升+10.22。

详情

AI中文摘要

现实世界的合成图像检测器在域内表现强劲，但在域迁移下通常泛化能力差。通过无监督UMAP投影，我们发现自然和合成特征在未见数据集上仍部分可分，但性能仍然下降，表明分类头过度拟合训练域伪影。因此，关键在于学习更具迁移性的表示，使决策标准对域迁移更稳定和鲁棒。基于合成图像由多种生成器生成的结构事实，我们提出一个层次对比学习框架，在保留生成器身份信息的同时提高自然和合成图像之间的可分离性。它联合优化（i）自然和合成图像之间的粗粒度对比目标和（ii）使用生成器身份的合成图像之间的细粒度对比目标。在WildFake上训练，我们的方法在跨域评估中，在与强基线DIRE相同的设置下，在Chameleon、AIGIBench、Community Forensics和GenImage上平均AUROC提升+10.22。对于少样本适应，我们冻结骨干网络，并在每类10个标记样本上拟合SVM头，在12个广泛使用的检测器上平均，AIGIBench的AUROC提升+10.64，Chameleon提升+17.41。我们的代码公开在：https://github.com/heyongxin233/FiSeR。

英文摘要

Real-world synthetic image detectors often generalize poorly under domain shift despite strong in-domain performance. Using unsupervised UMAP projections, we find that natural and synthetic features remain partially separable on unseen datasets, yet performance still drops, suggesting that the classification head overfits to training-domain artifacts. Therefore, the key is to learn more transferable representations so that the decision criterion is more stable and robust to domain shifts. Based on the structural fact that synthetic images are produced by diverse generators, we propose a hierarchical contrastive learning framework that improves the separability between natural and synthetic images while preserving generator identity information. It jointly optimizes (i) a coarse contrastive objective between natural and synthetic images and (ii) a fine contrastive objective among synthetic images using generator identities. Trained on WildFake, our method achieves an average AUROC gain of +10.22 on cross-domain evaluation over Chameleon, AIGIBench, Community Forensics, and GenImage under the same settings as the strong baseline DIRE. For few-shot adaptation, we freeze the backbone and fit an SVM head on 10 labeled samples per class, improving AUROC by +10.64 on AIGIBench and +17.41 on Chameleon, averaged over 12 widely used detectors. Our code is publicly available at: https://github.com/heyongxin233/FiSeR.

URL PDF HTML ☆

赞 0 踩 0

2606.00602 2026-06-02 cs.CV 版本更新

ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training

ASAP: 基于解剖感知语义自适应预训练的医学体素表示学习

Rongsheng Wang, Fenghe Tang, Zihang Jiang, Yingtai Li, Xu Zhang, Haoran Lai, Wenxin Ma, Wei Wei, Zhiyang He, Xiaodong Tao, Rui Yan, Qingsong Yao, Shaohua Kevin Zhou

发表机构 * School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China（生物医学工程学院，生命科学与医学系，中国科学技术大学）； Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE) Lab, YRD-RIGHT, USTC Suzhou Institute for Advanced Research（医学影像、机器人、分析计算与学习（MIRACLE）实验室，YRD-RIGHT，中国科学技术大学苏州研究院）； Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology（江苏省多模态数字孪生技术重点实验室）； Biomedical Basic Research Center (BBRC) of Jiangsu Province（江苏省生物医学基础研究中心）； Department of Radiology, The First Affiliated Hospital of USTC, Division of Life Sciences and Medicine, USTC（放射科，中国科学技术大学第一附属医院，生命科学与医学系，中国科学技术大学）； Anhui IFLYTEK CO., Ltd（安徽科大讯飞股份有限公司）； School of Medicine, Stanford University（医学院，斯坦福大学）； State Key Laboratory of Precision and Intelligent Chemistry, Hefei, Anhui, China（安徽省精密与智能化学重点实验室，合肥，安徽，中国）

AI总结提出ASAP框架，通过解剖感知知识注入、语义自适应对齐与融合，从胸部CT扫描和放射学报告中学习可迁移且可解释的体素表示，在15个数据集和22个下游任务上取得最先进性能。

Comments MICCAI2025 extention

详情

AI中文摘要

从医学体素扫描中学习可迁移和可解释的表示仍然具有挑战性，因为存在复杂的解剖结构和放射学报告提供的弱、异质监督。在本文中，我们提出了解剖感知语义自适应预训练（ASAP），一个用于从大规模胸部CT扫描及其对应放射学报告中进行细粒度医学体素表示学习的原理性视觉-语言预训练框架。ASAP集成了三个关键组件：（1）解剖感知知识注入模块，通过现成的分割工具融入器官级结构先验，以促进解剖上一致的表示；（2）语义自适应选择性对齐机制，动态地将句子级别的发现与局部体素区域关联；（3）语义自适应融合模块，在双模态掩码建模范式下，实现解剖信息视觉特征与基于文本线索之间的有效交互。除了方法论贡献外，我们还为胸部CT上的医学体素视觉-语言预训练建立了一个全面的基准，涵盖15个数据集和22个下游任务，包括异常分类、分割、疾病预后预测、报告生成、词汇分类、跨模态检索和视觉问答。该基准提供了标准化的评估协议，以系统评估在不同临床设置和数据制度下的表示质量。大量实验表明，ASAP在跨任务和数据集上一致地实现了最先进的性能，在有限监督和分布偏移下尤其显著，验证了其在学习可迁移和临床有意义的体素表示方面的有效性。

英文摘要

Learning transferable and interpretable representations from medical volumetric scans remains challenging due to complex anatomical structures and weak, heterogeneous supervision provided by radiology reports. In this paper, we propose Anatomy-aware Semantically-Adaptive Pre-training (ASAP), a principled vision-language pre-training framework for fine-grained medical volumetric representation learning from large-scale chest CT scans and their corresponding radiology reports. ASAP integrates three key components: (1) an anatomy-aware knowledge injection module that incorporates organ-level structural priors via off-the-shelf segmentation tool to encourage anatomically coherent representations; (2) a semantically-adaptive selective alignment mechanism that dynamically associates sentence-level findings with localized volumetric regions; and (3) a semantically-adaptive fusion module for effective interaction between anatomically informed visual features and grounded textual cues under dual-modal masked modeling paradigm. Beyond methodological contributions, we establish a comprehensive benchmark for medical volumetric vision-language pre-training on chest CT, covering 15 datasets and 22 downstream tasks spanning abnormality classification, segmentation, disease prognosis prediction, report generation, vocabulary classification, cross-modal retrieval and visual question answering. This benchmark provides standardized evaluation protocols to systematically assess representation quality under diverse clinical settings and data regimes. Extensive experiments demonstrate that ASAP consistently achieves state-of-the-art performance across tasks and datasets, with particularly pronounced gains under limited supervision and distribution shift, validating its effectiveness in learning transferable and clinically meaningful volumetric representations.

URL PDF HTML ☆

赞 0 踩 0

2606.00592 2026-06-02 cs.CV 版本更新

Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs

通过PRISM：原则感知、可解释和多尺度的视觉设计评估

Mona Gandhi, KJ Joseph, Srinivasan Parthasarathy, Sayan Nag

发表机构 * Ohio State University（俄亥俄州立大学）； Adobe Research（Adobe研究院）

AI总结提出PRISM基准和一种多尺度评估框架，通过原则扰动和分层分析实现可解释的设计质量评估。

详情

Journal ref: CVPR 2026 Findings

AI中文摘要

有效的视觉传达源于多个设计原则的和谐，如可读性、对比度、对齐、重叠和连贯性，这些原则共同支配着传达者的清晰度和意图。虽然人类设计师会整体性地考虑这些原则，但机器智能体通常将它们压缩成一个单一的启发式分数，提供有限的可解释性和诊断精度。为了解决这一差距，我们引入了PRISM（原则感知、可解释和结构引导的设计修改），这是一个基准，它沿着可测量的设计原则系统地扰动Crello数据集中的专业布局。该基准包含10万个扰动训练样本和1万个扰动验证设计，每个样本隔离特定的原则违规，以进行关于设计质量的多模态推理的受控分析。我们表明，像Qwen-2.5-VL和GPT-4o-mini这样的模型对有针对性的原则退化在很大程度上不敏感，而GPT-4o表现出全局意识但缺乏细粒度的解耦。基于这些见解，我们提出了一个多尺度评估框架，该框架集成了用于定量评估的轻量级评分器、用于局部反馈的指令调优视觉语言模型以及用于全局推理的基于提示的方法。我们的框架提供了设计失败的可解释解释。利用这些局部见解，我们展示了改善布局质量的有针对性的改进。PRISM和我们的框架共同为可解释的、具有设计素养的多模态推理系统奠定了基础。

英文摘要

Effective visual communication stems from the harmony of multiple design principles, such as readability, contrast, alignment, overlap, and coherence, which collectively govern clarity and intent of the communicator. While human designers reason holistically over these principles, machine agents typically condense them into a single heuristic score, offering limited interpretability and diagnostic precision. To address this gap, we introduce PRISM (PRinciple-aware, Interpretable, and Structure-guided Design Modifications), a benchmark that systematically perturbs professional layouts from the Crello dataset along measurable design principles. The benchmark comprises 100K perturbed training samples and 10K perturbed validation designs, each isolating a specific principle violation for controlled analysis of multimodal reasoning about design quality. We show that models like Qwen-2.5-VL and GPT-4o-mini are largely insensitive to targeted principle degradations, whereas GPT-4o exhibits global awareness without fine-grained disentanglement. Building on these insights, we propose a multi-scale evaluation framework that integrates lightweight scorers for quantitative assessment, instruction-tuned vision-language models for localised feedback, and prompt-based methods for global reasoning. Our framework provides interpretable explanations of design failures. Using these localised insights, we show targeted refinements that improve layout quality. Together, PRISM and our framework lay the foundation for interpretable design-literate multimodal reasoning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.00583 2026-06-02 cs.CV cs.AI cs.LG cs.MM 版本更新

Improving Visual Representation Alignment Generation with GRPO

利用GRPO改进视觉表示对齐生成

Shentong Mo, Sukmin Yun

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Hanyang University（翰阳大学）

AI总结提出VRPO方法，通过强化学习将静态对齐损失替换为生成式表示策略优化目标，动态平衡表示一致性与生成质量，在扩散Transformer中实现更快的收敛和更高的图像保真度。

详情

AI中文摘要

最近的扩散Transformer展示了强大的图像合成能力，但由于生成表示与判别表示之间的弱对齐，训练效率仍然较低。虽然表示对齐框架（如REPA）通过将噪声去噪特征与预训练视觉编码器对齐来改善收敛，但其外部监督的对齐损失是静态的，在训练和推理过程中缺乏自适应性。现有方法依赖于固定的余弦对齐或对比目标，无法动态平衡表示一致性和生成质量，导致判别收益有限，且无法以任务自适应方式优化对齐。为了解决这个问题，我们提出了VRPO，一种基于强化学习的优化策略，用生成式表示策略优化目标取代REPA的静态对齐损失。VRPO不强制执行固定的相似性约束，而是将表示对齐视为一个奖励引导的过程：模型根据生成保真度、感知质量以及扩散特征与预训练视觉嵌入之间的语义一致性获得自适应奖励。这种公式使生成器能够不断优化其内部表示，朝向有语义意义的方向，同时提高图像质量。我们的VRPO驱动训练无缝集成到扩散Transformer中，引入可忽略的计算成本，并保持与SiT和DiT架构的完全兼容性。在ImageNet-256x256上的大量实验表明，我们的VRPO-Alignment显著提高了收敛速度和保真度，在相同计算预算下，与REPA相比，FID提升高达1.8，训练速度加快2.3倍。

英文摘要

Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA's static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.00579 2026-06-02 cs.CL cs.CV 版本更新

Sandboxed Coding Agents are Competitive Omni-modal Task Solvers

沙盒化编码智能体是竞争性的全模态任务求解器

Dongping Chen, Xuanao Huang, Zhihan Hu, Qingyuan Shi, Dianqi Li, Tianyi Zhou

发表机构 * University of Maryland（马里兰大学）； MBZUAI

AI总结本文提出沙盒化编码智能体，仅通过文本+图像访问和工具使用，即可在全模态任务中匹配甚至超越原生全模态模型，并通过技能注入和训练配方Code-X进一步提升性能。

Comments Paper under review

详情

AI中文摘要

随着多模态大语言模型越来越多地针对视频和音频，人们通常认为这类任务需要原生全模态模型。我们表明情况并非总是如此：仅具有文本+图像访问权限和沙盒化工具使用接口的编码智能体，可以在多个音频-视频基准测试中匹配，并在某些设置中超越最先进的原生全模态模型和预定义的多模态智能体框架。我们的轨迹分析表明，它们的优势来自于编写代码和编排工具，以从转录、帧和其他模态信号中提取相关证据，从而将全模态任务转化为检索和信息处理问题，而不是摄取整个媒体流。我们进一步通过失败分类和过程级轨迹分析来刻画它们的局限性，并表明简单的技能注入（包括人工编写和自蒸馏的技能）能显著提高性能。为了探索开源激发，我们引入了Code-X，一种包含OmniCoding轨迹数据集和可验证奖励的训练方案，并在Qwen-3.5-9B和Qwen-3.6-27B上提供了基线。最后，我们认为下一个前沿是多模态处理，并引入了TerminalBench-O，一个用于现实世界全模态处理任务的过程级基准。代码将在https://github.com/Dongping-Chen/OmniCoding提供。

英文摘要

As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information-processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process-level trace analysis, and show that simple skill injection, including human-written and self-distilled skills, substantially improves performance. To explore open-source elicitation, we introduce Code-X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen-3.5-9B and Qwen-3.6-27B. Finally, we argue that the next frontier is many-modality processing, and introduce TerminalBench-O, a process-level benchmark for real-world omnimodal processing tasks. Code will be available at https://github.com/Dongping-Chen/OmniCoding.

URL PDF HTML ☆

赞 0 踩 0

2606.00571 2026-06-02 cs.LG cs.AI cs.CV 版本更新

On the Difficulty of Learning a Meta-network for Training Data Selection

学习用于训练数据选择的元网络的困难性

Zilin Du, Junqi Zhao, Boyang Albert Li

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结针对元学习训练数据选择（MTS）在实践中表现不佳的问题，本文通过数学分析揭示了梯度信噪比低和缺乏信息特征两大障碍，并提出增大批大小和利用信息特征作为解决方案。

详情

AI中文摘要

合成数据越来越多地被用于训练神经网络，但若不加区分地使用，其与真实数据的分布不匹配会限制其有效性。一种常见策略是通过双层优化学习数据权重，我们称之为元学习训练数据选择（MTS）。有趣的是，在实践中，MTS 往往低于预期。我们识别了正确训练 MTS 的两个障碍：梯度信噪比（GSNR）低导致优化困难，以及缺乏与数据质量相关的信息特征。我们对 MTS 进行了数学分析，揭示了归一化数据权重的动态以及不同数据质量与低 GSNR 之间的关系。分析表明，一个简单而有效的解决方案是增大批大小。此外，我们提出了一组信息特征，用于捕捉训练数据在其分布中的位置和训练动态。在四个基准上的实验显示了一致的改进，与无选择的训练相比平均提升 5.49%，与最强基线相比平均提升 2.89%。

英文摘要

Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when used indiscriminately. A common strategy is to learn data weights via bi-level optimization, which we refer to as Meta-learning for Training-data Selection (MTS). Interestingly, in practice, MTS often performs below expectation. We identify two obstacles in properly training MTS: a poor gradient signal-to-noise ratio (GSNR), which causes optimization difficulties, and lack of informative features that correlates with data quality. We present a mathematical analysis of MTS, which reveals the dynamics of normalized data weights and the relation between disparate data quality and poor GSNR. The analysis suggests a a simple yet effective solution: increasing the batch size. Further, we propose a set of informative features that capture the positions of training data in their distributions and training dynamics. Experiments across four benchmarks show consistent improvements, achieving average gains of 5.49% over training without selection and 2.89% over the strongest baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.00564 2026-06-02 cs.CV cs.CL 版本更新

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

面向视觉-语言推理的分解式在策略蒸馏：引导梯度实现视觉定位

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom, Ji Woo Hong, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结通过将视觉-语言模型蒸馏损失分解为语言先验和视觉定位两个正交分量，提出视觉梯度引导（VGS）方法动态调整更新方向以优先优化视觉子空间，从而提升小模型在复杂多模态任务中的定位能力。

Comments ICML 2026 Spotlight

详情

AI中文摘要

虽然在策略蒸馏为训练小型推理模型提供了密集监督，但其在多模态领域的优化动态仍未得到充分探索。在这项工作中，我们通过数学上将损失分解为两个不同的组成部分：语言先验和视觉定位，挑战了视觉-语言模型（VLM）蒸馏的标准整体观点。我们的分析揭示，这些分量的梯度向量几乎正交，表明与教师语言分布对齐的目标在几何上独立于匹配其视觉感知的目标。因此，标准优化被动地遵循一条次优的折衷轨迹，隐式地平衡这两个目标。假设视觉定位是视觉-语言推理的主要瓶颈，我们引入了视觉梯度引导（VGS），一种动态重新定向更新向量以优先考虑视觉子空间的方法。在多个蒸馏设置和复杂多模态基准上的实验结果表明，VGS显著优于标准的在策略蒸馏整体公式，以最小的训练开销实现了卓越的定位能力。

英文摘要

While on-policy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision-Language Model (VLM) distillation by mathematically decomposing the loss into two distinct components: the language prior and visual grounding. Our analysis uncovers that gradient vectors for these components are nearly orthogonal, indicating that the objective of aligning with the teacher's language distribution is geometrically independent from the objective of matching its visual perception. Consequently, standard optimization passively follows a suboptimal compromise trajectory that implicitly balances the two objectives. Hypothesizing that visual grounding constitutes the primary bottleneck for vision-language reasoning, we introduce Visual Gradient Steering (VGS), a method that dynamically reorients the update vector to prioritize the visual subspace. Experimental results on multiple distillation settings and complex multimodal benchmarks demonstrate that VGS significantly outperforms the standard monolithic formulation of on-policy distillation, achieving superior grounding with minimal training overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.00562 2026-06-02 cs.CV cs.LG 版本更新

ETC: 通过任务感知的视觉信息蒸馏实现视觉语言模型中的极端令牌压缩

Yiling Gao, Hongchen Wei, Zhenzhong Chen

发表机构 * School of Remote Sensing and Information Engineering, Wuhan University（武汉大学遥感与信息工程学院）

AI总结提出ETC框架，基于变分信息蒸馏原理，在减少输入令牌数量时最小化任务损失，通过文本-图像交叉注意力加权视觉特征并引入变分信息蒸馏，实现单令牌压缩下仍保持强任务性能。

详情

AI中文摘要

在视觉语言模型（VLM）中，高分辨率图像会产生大量视觉令牌，导致推理时的高计算成本和KV缓存开销。为解决此问题，我们提出极端令牌压缩（ETC）框架，基于变分信息蒸馏原理，在减少输入令牌数量时最小化任务损失。具体而言，从信息论角度，我们表明最小化任务损失需要紧凑表示保留用于预测的指令感知充分统计量。在实践中，ETC利用文本-图像交叉注意力加权原始视觉特征以近似潜在的指令感知预测统计量。此外，ETC引入变分信息蒸馏，使紧凑表示保留必要信息以恢复该预测统计量。在LLaVA-1.5-7B和Qwen3-VL-2B上的实验表明，即使在单令牌压缩下，ETC仍保持有效性，大幅减少KV缓存开销同时保留强任务性能。

英文摘要

In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Compression (ETC) framework that minimizes task loss when reducing the number of input tokens based on the principle of variational information distillation. Specifically, from an information-theoretic perspective, we show that minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. In practice, ETC leverages text-to-image cross-attention to weight the original visual features to approximate the latent instruction-aware predictive statistic. Moreover, ETC introduces a variational information distillation, enabling the compact representation to preserve the essential information to recover this predictive statistic. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B show that ETC remains effective even under single-token compression, substantially reducing KV-cache overhead while retaining strong task performance.

URL PDF HTML ☆

赞 0 踩 0

2606.00514 2026-06-02 cs.LG cs.CV 版本更新

Generate in Reconstruction Space, Match in Semantic Space: Transport Geometry for One-Step Generation

在重建空间中生成，在语义空间中匹配：一步生成的传输几何

Hugues Van Assel, Edward De Brouwer, Saeed Saremi, Gabriele Scalia, Aviv Regev

发表机构 * Genentech（基因泰克）

AI总结本文研究自监督表示学习（SSL）特征在一步生成模型中的作用，提出在语义特征空间中使用Sinkhorn散度进行分布匹配，显著降低ImageNet FID，并揭示了评估指标与训练特征之间的潜在冲突。

Comments 26 pages, 4 figures

详情

AI中文摘要

生成建模和自监督表示学习（SSL）优化结构不同的目标：生成训练奖励分布保真度，而SSL奖励语义一致性。然而，最近的研究反复发现SSL特征改善了生成训练，尽管这种协同作用的机制仍不清楚。在这里，我们在一步生成的框架下研究SSL在生成建模中的优势，其中表示的作用是明确的：冻结的SSL特征用于将生成的样本与真实数据匹配。我们在该特征空间中使用Sinkhorn散度，为Wasserstein距离提供了一个可处理的代理，这是由Fréchet风格评估指标（如FID）近似的总体差异。我们发现，当在语义结构化的SSL特征空间中计算时，这个目标变得非常有效（ImageNet FID降低39倍）。我们将这种行为主要归因于匹配估计：抑制无关重建细节的语义SSL特征诱导出更紧凑的几何结构，使分布匹配更易处理。因此，最佳的训练SSL特征不一定与评估指标使用的特征匹配。特别是，我们表明使用Inception作为特征提取器可以改善FID，同时降低匹配稳定性和样本质量，揭示了一种形式的指标黑客攻击。通过在ImageNet上的大量实验，我们确定了哪些SSL特征族能带来最佳的生成性能，并表明匹配稳定性是选择它们的定量标准。代码可在https://github.com/Genentech/semantic-transport-generation获取。

英文摘要

Generative modeling and self-supervised representation learning (SSL) optimize structurally different objectives: generative training rewards distributional fidelity, while SSL rewards semantic coherence. Yet recent work repeatedly finds that SSL features improve generative training, though the mechanism of this synergy remains unclear. Here, we study the benefits of SSL in generative modeling in the framework of one-step generation where the role of representation is explicit: frozen SSL features are used to match generated samples to real data. We use the Sinkhorn divergence in that feature space, providing a tractable surrogate for the Wasserstein distance, the population-level discrepancy approximated by Fréchet-style evaluation metrics (such as FID). We find that this objective becomes highly effective when computed in a semantically structured SSL feature space (a 39$\times$ reduction in ImageNet FID). We trace this behavior primarily to matching estimation: semantic SSL features that suppress nuisance reconstruction details induce a more compact geometry, making distribution matching more tractable. As a consequence, the best training SSL features need not match the features used by the evaluation metric. In particular, we show that using Inception as the feature extractor can improve FID while degrading matching stability and sample quality, revealing a form of metric hacking. Using extensive experiments on ImageNet, we identify which SSL feature families lead to best generation performance and show that matching stability is a quantitative criterion for selecting them. Code is available at https://github.com/Genentech/semantic-transport-generation.

URL PDF HTML ☆

赞 0 踩 0

2606.00511 2026-06-02 cs.LG cs.CV 版本更新

Saliency-Aware Model Merging

显著性感知模型合并

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * Yonsei University, Seoul, South Korea（首尔大学）； Ewha Womans University, Seoul, South Korea（成均馆女子大学）

AI总结提出SA-Merging方法，利用结构剪枝中的连通性显著性（如SynFlow）进行数据无关模型合并，通过任务向量显著性评分和合并感知调制减少任务干扰，并在视觉和语言任务上验证有效性。

Comments ICML 2026 Camera-ready

详情

AI中文摘要

模型合并旨在将多个在不同数据集上微调的任务特定模型整合到一个统一架构中，以实现跨领域能力。当前的数据无关模型合并方法通常难以扩展，因为它们依赖于忽略层间依赖性和非均匀专业知识分布的简单参数级启发式方法。本文提出SA-Merging，它基于结构剪枝（如SynFlow）中的连通性显著性公式，并将其扩展到数据无关模型合并设置。我们相对于共享基础模型定义任务向量上的显著性分数，并进一步引入合并感知调制，该调制结合专家间的一致性以减轻任务干扰。基于此公式，迭代的显著性感知合并过程逐步移除非信息性更新，同时保留端到端连通性。此外，我们将SA-Merging扩展到为LoRA引入秩级显著性分解，而不损害其结构完整性。在视觉和语言任务上的大量实验证明了我们基于显著性方法的有效性，进一步缩小了数据无关方法和测试时自适应方法之间的差距。

英文摘要

Model merging aims to consolidate multiple task-specific models fine-tuned on different datasets into a unified architecture that performs cross-domain proficiency. Current data-free model merging methods often struggle to scale as they rely on simple parameter-level heuristics that ignore inter-layer dependencies and non-uniform distribution of expertise. This work proposes SA-Merging, which is built upon connectivity-based saliency formulations from structural pruning (e.g., SynFlow) and extends them to the data-free model merging setting. We define a saliency score over task vectors relative to a shared base model, and further introduce merge-aware modulation that incorporates agreement across experts to mitigate task interference. Based on this formulation, an iterative saliency-aware merging procedure progressively removes non-informative updates while preserving end-to-end connectivity. Furthermore, we extend SA-Merging to introduce rank-wise saliency decomposition for LoRAs without compromising their structural integrity. Extensive experiments on vision and language tasks demonstrate the effectiveness of our saliency-based approach, further reducing the gap between data-free and test-time adaptation methods.

URL PDF HTML ☆

赞 0 踩 0

2606.00509 2026-06-02 cs.CV 版本更新

Structure-Aware Consistency Priors for Shape from Polarization in Complex Media

复杂介质中偏振形状恢复的结构感知一致性先验

Kaimin Yu, Puyun Wang, Huayang He, Xianyu Wu

发表机构 * The School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou, China（福州大学机械工程与自动化学院）； Research Institute of Highway, Ministry of Transport, Beijing, China（交通部公路科学研究院）

AI总结针对复杂介质（以冰为例）中偏振观测与表面法线间的非线性映射问题，提出基于自相关函数的结构感知偏振先验，并设计双分支网络IceSfP通过跨模态注意力和多尺度特征融合实现精确法线估计，在首个真实冰SfP数据集上达到16.01°的平均角度误差。

详情

Journal ref: 2026ICML

AI中文摘要

在复杂介质中从单视角偏振图像恢复表面法线仍然具有挑战性。本文以冰作为代表性复杂介质，其中复杂的光与物质相互作用导致偏振观测与表面法线之间存在非线性映射。为了解决这一问题，提出了一种基于自相关函数的结构感知偏振先验，以捕获AoLP的局部空间一致性。在此基础上，设计了一个双分支网络（IceSfP），通过跨模态注意力和多尺度特征融合将原始偏振特征与先验集成，从而在复杂介质条件下实现准确的表面法线估计。为了评估该方法，构建了首个真实世界的冰SfP数据集。实验结果表明，该方法在所有指标上均优于现有方法，平均绝对误差（MAE）为16.01°，比第二好的方法低2.74°。该框架为复杂介质中的高精度几何感知提供了一种可推广的解决方案。

英文摘要

Recovering surface normals from single view polarization images in complex media remains challenging. This paper focuses on ice as a representative complex medium, where intricate light matter interactions lead to a nonlinear mapping between polarization observations and surface normals. To address this, a structure-aware polarization prior based on autocorrelation functions is proposed to capture the local spatial consistency of AoLP. Building on this, a dual-branch network (IceSfP) is designed to integrate raw polarization features with priors via cross modal attention and multi-scale feature fusion, enabling accurate surface normal estimation under complex media conditions. To evaluate the method, the first real-world ice SfP dataset is constructed. Experimental results show that the method outperforms existing approaches across all metrics, achieving a MAE of 16.01 deg, which is 2.74 deg lower than the second-best method. The framework provides a generalizable solution for high-precision geometric perception in complex media.

URL PDF HTML ☆

赞 0 踩 0

2606.00508 2026-06-02 cs.CV cs.AI 版本更新

V-LynX: Token Interface Alignment for Video+X LLMs

V-LynX: 视频+X 大语言模型的令牌接口对齐

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * Yonsei University, Seoul, South Korea（延世大学，首尔，韩国）； Ewha Womans University, Seoul, South Korea（成均馆大学，首尔，韩国）

AI总结本文发现视频大语言模型中存在令牌接口连续流形，并提出V-LynX框架，通过轻量辅助路径对齐注意力响应和统计分布，无需配对监督即可集成新模态，在音视频问答、3D推理等任务上达到最优效率。

Comments ICML 2026 Camera-ready

详情

AI中文摘要

本研究揭示了视频大语言模型中的一个有趣现象：视频大语言模型不仅仅是简单地将帧转换为文本嵌入，而是建立了一个连续流形——令牌接口，使得视觉令牌能够在架构内作为独立实体运行。利用这一发现，我们提出了V-LynX，这是一个可扩展的框架，通过重新利用内部化接口，将新模态集成到视频大语言模型中。与需要大量模态特定编码器或配对监督的传统范式不同，V-LynX采用轻量辅助路径与冻结的视觉编码器并行运行。我们的方法通过使用非配对单模态数据集对齐注意力响应和统计分布，将新的感官输入与内在视频先验相结合。这确保了流形兼容性，同时保持了视频大语言模型的完整性。大量基准测试表明，V-LynX在音视频问答、3D推理、高帧率和多视角视频理解方面达到了最先进水平和高效性。代码可在https://github.com/park-jungin/lynx获取。

英文摘要

This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs establish a continuous manifold, token interface, allowing visual tokens to operate as standalone entities within the architecture. Exploiting this discovery, we propose V-LynX, a scalable framework that integrates novel modalities into Video LLMs by repurposing the internalized interface. Departing from conventional paradigms that necessitate heavy modality-specific encoders or paired supervision, V-LynX employs a lightweight auxiliary pathway in parallel with the frozen vision encoder. Our method integrates new sensory inputs with intrinsic video priors by aligning both attention responses and statistical distributions using unpaired unimodal data sets. This ensures manifold compatibility while preserving the integrity of the Video LLMs. Extensive benchmarks demonstrate that V-LynX achieves SOTA and efficiency across audio-visual QA, 3D reasoning, high-frame-rate, and multi-view video understanding. The code is available at https://github.com/park-jungin/lynx.

URL PDF HTML ☆

赞 0 踩 0

2606.00499 2026-06-02 cs.CV 版本更新

OptiWorld: Optimal Control for Video World Generation under Physical Constraints

OptiWorld: 物理约束下的视频世界生成最优控制

Yu Yuan, Jianhao Yuan, Xijun Wang, Daiqing Li, Liu He, Lu Ling, Stanley H. Chan

发表机构 * Purdue University（普渡大学）； University of Oxford（牛津大学）； SixteenMiles Labs（SixteenMiles 实验室）

AI总结提出OptiWorld框架，在推理时结合经典最优控制与视频生成，通过提取紧凑世界状态、规划最优轨迹并生成条件视频，实现符合物理约束的动态优化。

Comments Porject Page: https://yuyuanspace.com/OptiWorld/

详情

AI中文摘要

视频生成模型正成为一种可扩展的世界模型形式，但它们主要生成合理的运动，而非主动控制或优化底层动态。因此，生成视频中的物体可能遵循不安全、不光滑、低效或物理不一致的轨迹。在这项工作中，我们提出了 extbf{OptiWorld}，一个在推理时将经典最优控制引入视频生成的框架。OptiWorld首先提取紧凑的、与任务相关的世界状态，然后在物理约束下规划最优轨迹，最后基于该轨迹渲染视频。我们将规划表述为连续流形上的几何问题，将3D几何和任务相关的物理约束转化为统一的规划几何。通过添加这一最优控制层，OptiWorld生成具有更优动态的视频，在多个任务中展现出强大潜力，包括目标条件的图像到视频生成、视频动态编辑和反事实生成。

英文摘要

Video generation models are becoming a scalable form of world models, but they mainly generate plausible motion rather than proactively control or optimize the underlying dynamics. As a result, an object in the generated video may follow trajectories that are unsafe, not smooth, inefficient, or physically inconsistent. In this work, we propose \textbf{OptiWorld}, a framework that brings classical optimal control into video generation at inference time. OptiWorld first extracts a compact, task-relevant world state, then plans an optimal trajectory under physical constraints, and finally renders the video conditioned on this trajectory. We formulate planning as a geometric problem on a continuous manifold, which converts 3D geometry and task-dependent physical constraints into a unified planning geometry. By adding this optimal-control layer, OptiWorld generates videos with preferable dynamics, demonstrating strong potential in multiple tasks including goal-conditioned image-to-video generation, video dynamics editing, and counterfactual generation.

URL PDF HTML ☆

赞 0 踩 0

2606.00477 2026-06-02 cs.CL cs.CV 版本更新

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

文本编辑能否泛化到视觉生成？统一多模态模型中的跨模态知识编辑基准

Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Toronto（多伦多大学）； University of Washington（华盛顿大学）

AI总结提出跨模态知识编辑基准UniKE，发现文本编辑在图像生成中效果显著下降（VQA准确率仅18.5%），并提出推理增强参数编辑方法提升跨模态迁移效果。

Comments Published at ICML 2026; Code and data available at https://github.com/gxx27/UniKE

详情

AI中文摘要

统一多模态模型（UMMs）已成为通用多模态智能的有前途的范式。随着它们在现实世界应用中的部署，有效更新内部知识变得至关重要。虽然知识编辑在纯文本模型中已经成熟，但成功修改文本输出的编辑是否也能迁移到UMMs中的图像生成仍不清楚。为了研究这个问题，我们引入了UniKE，这是第一个用于UMMs中跨模态知识编辑的基准，包含2,971个编辑主题，涵盖属性和关系编辑。使用基于VQA的视觉验证，我们揭示了一个显著的模态差距：文本侧的有效性可以达到约92%，而直接图像生成下的最佳整体VQA准确率仅为18.5%。我们进一步提出了推理增强参数编辑，它在生成前显式激活编辑后的知识，并提高了所有评估模型-编辑器对的整体VQA准确率，提升高达18.6个百分点。机制分析表明，这种差距与编辑后的文本表示与视觉生成的条件路径之间的部分对齐有关，其中足以用于文本输出的编辑可能仍然太弱或未对齐，无法引导图像合成。这些发现表明，文本知识编辑不能保证可靠的跨模态迁移，并激励了模态感知的编辑方法。我们的代码和数据可在https://github.com/gxx27/UniKE获取。

英文摘要

Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model-editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross-modality transfer and motivate modality-aware editing methods. Our code and data are available at https://github.com/gxx27/UniKE.

URL PDF HTML ☆

赞 0 踩 0

2606.00472 2026-06-02 cs.CV cs.AI cs.HC cs.LG 版本更新

超越静态高斯：动态3D场景重建架构范式的实证研究

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo（滑铁卢大学）

AI总结本文通过实证比较结构引导与高斯中心两种动态3D高斯溅射范式，揭示重建质量/紧凑性与渲染速度之间的根本权衡。

Comments Accepted in Journal of Computational Vision and Imaging Systems (JCVIS)

详情

DOI: 10.15353/jcvis.v11i1.10019
Journal ref: Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, 2025, p. 99

AI中文摘要

通过3D高斯溅射（3DGS）进行动态场景重建已成为表示演化环境的一种引人注目的方法，但理解不同方法之间的权衡仍然至关重要。本文对动态3DGS方法进行了全面分析，将其分为两种范式：结构引导方法，利用辅助表示（变形场、规范空间、网格）来建模时间变化；以及高斯中心方法，通过连续函数或4D表示将动态直接编码到基元中。我们在D-NeRF基准上评估了两种范式的代表性方法。我们的发现表明，结构引导方法实现了优越的重建保真度和紧凑的模型大小，而高斯中心方法则表现出显著更高的渲染速度，能够实现实时性能，但质量变异性更大且可能产生大量存储开销。该分析突出了重建质量/紧凑性与渲染速度之间的根本权衡，为动态场景重建的未来研究和应用开发提供了见解。

英文摘要

Dynamic scene reconstruction via 3D Gaussian Splatting (3DGS) has emerged as a compelling approach for representing evolving environments, yet understanding trade-offs between methodologies remains crucial. This paper presents a comprehensive analysis of dynamic 3DGS methods, categorizing them into two paradigms: structure-guided methods employing auxiliary representations (deformation fields, canonical spaces, grids) to model temporal changes, and gaussian-centric methods encoding dynamics directly into primitives via continuous functions or 4D representations. We evaluate representative methods from both paradigms on the D-NeRF benchmark. Our findings reveal that structure-guided methods achieve superior reconstruction fidelity and compact model sizes, while gaussian-centric approaches demonstrate significantly higher rendering speeds enabling real-time performance, though with greater quality variability and potentially substantial storage overhead. This analysis highlights a fundamental trade-off between reconstruction quality/compactness versus rendering speed, providing insights to guide future research and application development in dynamic scene reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.00450 2026-06-02 cs.CV cs.GR 版本更新

Optimizing 3D Gaussian Splatting via Point Cloud Upsampling

通过点云上采样优化3D高斯泼溅

Adrian Ramlal, Yan Song Hu, John S. Zelek

发表机构 * Vision and Image Processing Group, Systems Design Engineering, University of Waterloo（滑铁卢大学视觉与图像处理组，系统设计工程）

AI总结提出多种点云上采样方法及深度引导点提升技术，改善3D高斯泼溅的初始化质量，实验表明不同场景适用不同策略。

Comments Accepted in Journal of Computational Vision and Imaging Systems (JCVIS)

详情

DOI: 10.15353/jcvis.v10i1.10008
Journal ref: Journal of Computational Vision and Imaging Systems, Vol. 10, No. 1, p. 47, 2024

AI中文摘要

3D高斯泼溅（3DGS）是一种用于创建和渲染3D场景的技术，但其性能严重依赖于初始种子点的质量。为了改进3DGS初始化，本研究提出并评估了几种点云上采样方法：线性插值、三角插值、基于样条的曲面重建、移动最小二乘曲面拟合和基于Voronoi的点生成。此外，本研究引入了一种深度引导的点提升方法，利用深度图保持与运动恢复结构（SfM）重建的几何一致性。通过在Mip-NeRF360和Replica数据集上的大量实验，所提出的方法在多种场景类型中展示了重建质量的提升。结果表明，不同的上采样策略在不同场景中表现优异：曲面重建方法在处理有机、细节丰富的场景时表现更好，而简单的插值方法更适合以分段平滑几何为主的场景。相比之下，深度引导方法在添加整个场景中的几何感知点方面显示出潜力，尤其是在纹理缺失区域。这些发现为根据场景特征和计算约束选择合适的上采样方法提供了初步实用指南，增进了对点云初始化如何影响3DGS质量的理解。

英文摘要

3D Gaussian Splatting (3DGS) is a technique for creating and rendering 3D scenes, however its performance depends heavily on the quality of initial seed points. To improve 3DGS initialization, this study presents and evaluates several point cloud upsampling approaches: linear interpolation, triangular interpolation, spline-based surface reconstruction, moving least squares surface fitting, and Voronoi-based point generation. Additionally, this research introduces a depth-guided point lifting method that leverages depth maps to maintain geometric consistency with Structure-from-Motion (SfM) reconstructions. Through extensive experiments on the Mip-NeRF360 and Replica datasets, the proposed methods demonstrate improvements in reconstruction quality across diverse scene types. Results indicate that different upsampling strategies excel in different scenarios: surface reconstruction methods perform better with organic, detailed scenes, while simpler interpolation approaches are more suited for scenes dominated by piecewise-smooth geometries. In comparison, the depth-guided approach shows promise for adding geometry-aware points across the entire scene, importantly in texture-less regions. These findings, which provide preliminary practical guidelines for selecting appropriate upsampling methods based on scene characteristics and computational constraints, advances the understanding of how point cloud initialization affects 3DGS quality.

URL PDF HTML ☆

赞 0 踩 0

2606.00447 2026-06-02 cs.CV cs.AI 版本更新

GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

GeoSAM-3D: 用于从单目视频进行开放词汇3D场景分割的测地线提示传播

Arun Sharma

发表机构 * University of Minnesota, Twin Cities（明尼苏达大学，双城分校）

AI总结提出GeoSAM-3D方法，利用冻结的视觉基础模型和单目3D高斯泼溅重建，通过可微分的图-测地线传播核在场景图上传播用户提示，实现从单目视频的开放词汇3D场景分割。

详情

AI中文摘要

开放词汇的3D场景分割通常假设有RGB-D视频、校准的多视角图像或重建的网格。GeoSAM-3D研究了一种更轻的设置：用户上传一段短的单目视频，在一帧中点击或命名一个物体，并在高斯场景上接收传播的3D掩码。该实现结合了冻结的图像和视频基础模型、单目3D高斯泼溅重建以及在高斯质心上可微分的图-测地线传播核。核心设计选择是通过重建场景图上的热核距离传播提示，而不是通过3D中的欧几里得最近邻。这保持了曲面周围的连续性，并减少了附近但不相连物体之间的泄漏。本文描述了仓库状态、在geosam3d.propagate中实现的数学核、从Segment Anything掩码训练的特征头以及代码库中已有的验证。评估协议将实现验证、图传播质量、泄漏控制和交互延迟分开。

英文摘要

Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studies a lighter setting: a user uploads a short monocular video, clicks or names an object in one frame, and receives a propagated 3D mask over a Gaussian scene. The implementation combines frozen image and video foundation models with a monocular 3D Gaussian Splatting reconstruction and a differentiable graph-geodesic propagation kernel over Gaussian centroids. The central design choice is to propagate prompts by heat-kernel distance on the reconstructed scene graph, rather than by Euclidean nearest neighbors in 3D. This preserves continuity around curved surfaces and reduces leakage across nearby but disconnected objects. This paper describes the repository state, the mathematical kernel implemented in geosam3d.propagate, the feature head trained from Segment Anything masks, and the validation already present in the codebase. The evaluation protocol separates implementation validation, graph propagation quality, leakage control, and interactive latency.

URL PDF HTML ☆

赞 0 踩 0

2606.00445 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection

DarkVesselNet: 用于暗船检测的多模态遥感和轨迹推理

Arun Sharma

发表机构 * University of Minnesota, Twin Cities（明尼苏达大学，双城分校）

AI总结提出DarkVesselNet，融合Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型、AIS轨迹推理、TGARD间隙检测和Pi-DPM异常头，实现多模态遥感暗船检测。

2606.00444 2026-06-02 cs.CV cs.GR 版本更新

Real-Time Physics Simulation with Dynamic Mesh-Gaussian Reconstructions

基于动态网格-高斯重建的实时物理仿真

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo（滑铁卢大学）

AI总结针对动态重建与物理仿真拓扑不兼容的问题，提出固定拓扑网格与高斯泼溅的双表示框架，实现实时物理仿真，并揭示高质量重建与物理兼容拓扑存在本质冲突。

详情

Journal ref: Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, 2025

AI中文摘要

将动态3D重建集成到物理仿真中需要固定的网格拓扑以实现高效的碰撞检测，但像DG-Mesh这样的先进方法会生成针对几何质量优化的可变拓扑。我们研究了拓扑转换是否能在保持重建保真度的同时实现物理集成。我们提出了一种双表示框架，将用于物理的固定拓扑网格与用于渲染的高斯泼溅相结合，通过运行时顶点缓冲区更新实现了比可变拓扑基线快4.65倍的加速。我们在DG-Mesh数据集上评估了两种转换策略（时间对应跟踪和基于模板的投影）与原生固定拓扑方法（MaGS）的性能。我们的评估表明，两种转换方法都会导致65-80%的几何退化，尽管DG-Mesh具有优越的初始质量，但产生的结果不如MaGS。这表明高质量重建和物理兼容拓扑代表了根本不同的目标，无法通过后处理来调和。我们的发现为未来物理感知重建方法的发展提供了信息，并且我们的框架能够与任何固定拓扑方法实现实时仿真。

英文摘要

Integrating dynamic 3D reconstructions into physics simulation requires fixed mesh topology for efficient collision detection, but state-of-the-art methods like DG-Mesh produce varying topology optimized for geometric quality. We investigate whether topology conversion can enable physics integration while preserving reconstruction fidelity. We propose a dual-representation framework combining fixed-topology meshes for physics with Gaussian splatting for rendering, achieving 4.65$\times$ speedup over varying-topology baselines through runtime vertex buffer updates. We evaluate two conversion strategies, temporal correspondence tracking and template-based projection, against native fixed-topology methods (MaGS) on the DG-Mesh dataset. Our evaluation reveals that both conversion approaches incur 65-80% geometric degradation, producing results inferior to MaGS despite DG-Mesh's superior initial quality. This demonstrates that high-quality reconstruction and physics-compatible topology represent fundamentally distinct objectives that cannot be reconciled through post-processing. Our findings inform future development of physics-aware reconstruction methods and our framework enables real-time simulation with any fixed-topology approach.

URL PDF HTML ☆

赞 0 踩 0

2606.00439 2026-06-02 cs.CV 版本更新

Physical Object Understanding with a Physically Controllable World Model

基于物理可控世界模型的物理对象理解

Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Wanhee Lee, Gia Ancone, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel LK Yamins

发表机构 * Stanford University（斯坦福大学）； OpenAI（开放人工智能公司）； Noetik Inc.（Noetik公司）； Google（谷歌）

AI总结提出一类概率世界模型，通过自回归序列建模高效训练，从视频中推断对象及其物理交互，实现对象发现、3D操控和物理关系计算。

Comments CVPR 2026 Highlight. Project page at: https://neuroailab.github.io/psi-website/blog.html

详情

AI中文摘要

视觉智能的一个核心挑战是从原始视频中学习场景的物理结构：区域如何形成对象以及支配它们交互的规律。解决这些任务需要能够从部分观测中推断世界分布状态的世界模型——当前架构无法提供这种能力。我们引入了一类新的概率世界模型，支持估计任何视觉变量（如外观和动态）在给定其他变量条件下的概率。在这里，我们发现这些模型可以通过自回归序列建模高效训练，从而产生能够涌现丰富对象理解的世界模型。首先，我们展示了我们的模型通过顺序推理生成多个合理的未来世界状态，捕捉了支配对象如何运动的物理规律。然后，通过分析这些未来状态中的运动相关性，我们提取出对象及其关节子部分。在发现这些对象后，我们展示了我们的世界模型可以在3D中操控它们。最后，我们演示了如何从世界模型计算对象之间的物理关系，从而实现了诸如视觉叠叠乐等应用。

英文摘要

A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations - capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract objects and articulated object subparts. Having discovered these objects, we show that our world model can manipulate them in 3D. Finally, we demonstrate how physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.

URL PDF HTML ☆

赞 0 踩 0

2606.00416 2026-06-02 cs.CV 版本更新

4D Radar Meets LiDAR and Camera: Cooperative Perception under Adverse Weather

4D雷达与激光雷达和相机的结合：恶劣天气下的协同感知

Melih Yazgan, Iramm Hamdard, Qiyuan Wu, J. Marius Zoellner

发表机构 * FZI Research Center for Information Technology（FZI信息技术研究所以）； Karlsruhe Institute of Technology（卡尔斯鲁厄大学）

AI总结针对恶劣天气下相机和激光雷达性能下降的问题，提出集成4D成像雷达作为鲁棒模态，并引入多普勒引导的空间注意力机制进行多智能体融合，显著提升雾雨环境下的协同感知鲁棒性。

Comments Accepted by CVPR - DriveX Workshop

详情

AI中文摘要

协同感知对于自动驾驶至关重要，但在恶劣天气下，当相机和激光雷达性能下降时，其可靠性会受到影响。我们通过将4D成像雷达作为一种对天气鲁棒的模态集成到协同感知中，并引入多普勒引导的空间注意力机制用于多智能体融合，来解决这一挑战。我们的方法扩展了两种代表性骨干网络：一种是雷达-相机流水线，其中雷达替代激光雷达；另一种是激光雷达-雷达流水线，其中雷达补充激光雷达。为了支持评估，我们发布了雷达增强的基准数据集OPV2V-R和Adver-City-R，并加入了基于物理的激光雷达退化模拟。实验表明，在雾和雨条件下，该方法获得了显著的鲁棒性提升，特别是在雷达替代退化激光雷达时改进明显。在MAN TruckScenes上的额外验证证明了该方法在仿真之外的迁移能力。总体而言，我们的结果突出了4D成像雷达作为一种适用于全天候协同感知的鲁棒模态。数据集和代码可在以下网址获取：https://url.fzi.de/SlimComm。

英文摘要

Cooperative perception is important for autonomous driving but remains fragile when cameras and LiDAR degrade in adverse weather. We address this challenge by integrating 4D imaging radar as a weather-robust modality into collaborative perception and introducing a Doppler-guided spatial attention mechanism for multi-agent fusion. Our approach extends two representative backbones: a radar-camera pipeline where radar substitutes LiDAR, and a LiDAR-radar pipeline where radar complements LiDAR. To support evaluation, we release radar-augmented benchmarks, OPV2V-R and Adver-City-R, with physics-based LiDAR degradation. Experiments show strong robustness gains in fog and rain, including substantial improvements when radar replaces degraded LiDAR. Additional validation on MAN TruckScenes demonstrates transfer beyond simulation. Overall, our results highlight 4D imaging radar as a robust modality for all-weather collaborative perception. Dataset and code are available at: https://url.fzi.de/SlimComm.

URL PDF HTML ☆

赞 0 踩 0

2606.00404 2026-06-02 cs.CV cs.LG 版本更新

Rethinking Amortized Neural Representations for High-Resolution Terrain Elevation Data

重新思考高分辨率地形高程数据的摊销神经表示

Haoan Feng, Xin Xu, Leila De Floriani

发表机构 * University of Maryland, College Park（马里兰大学学院公园分校）

AI总结针对地形高程数据，提出HUVR+SIREN超网络方法，通过替换坐标解码器为平滑可微版本，在统一基准上实现最佳高度和导数保真度，且支持后训练量化压缩。

Comments 12 pages, 7 figures, 10 tables

详情

AI中文摘要

隐式神经表示（INR）将信号建模为连续的坐标到值函数。对于地形高程数据，这支持解析导数、任意分辨率解码以及底层高度场的平滑表面模型。然而，为每个瓦片拟合和存储单独的INR无法扩展到大型地形数据集。摊销神经表示通过共享网络降低了这一成本：新瓦片被映射到紧凑的每瓦片载荷，共享解码器从中重建高度场。大多数此类方法是超网络，通过单次前向传递预测载荷，而其他方法则通过短时的每瓦片优化恢复载荷。这些方法主要针对自然图像开发，其在地形高度场上的适用性尚不清楚。我们在1米/像素的地形数据集上引入了受控基准，并在统一协议下评估了三种代表性方法。观察到明显的跨领域差距后，我们提出了HUVR+SIREN，这是一种超网络，它通过将坐标解码器替换为平滑、解析可微的解码器来适应最强的基准方法（HUVR）。它在基准上实现了最佳的高度和导数保真度，无需额外的每瓦片存储且解码成本更低，并且能够容忍激进的后训练量化而质量损失可忽略，从而形成了紧凑的地形神经格式。消融和诊断进一步确定了哪些设计选择可迁移到地形，并表明每瓦片瓶颈已接近其有用极限，剩下的差距在于共享超网络的架构设计。

英文摘要

Implicit neural representations (INRs) model a signal as a continuous coordinate-to-value function. For terrain elevation data, this supports analytic derivatives, arbitrary-resolution decoding, and a smooth surface model of the underlying heightfield. However, fitting and storing a separate INR for every tile does not scale to large terrain datasets. Amortized neural representations reduce this cost with a shared network: a new tile is mapped to a compact per-tile payload, and a shared decoder reconstructs the heightfield from it. Most such methods are hypernetworks that predict the payload in a single forward pass, while others recover it through a short per-tile optimization. These methods were developed primarily for natural images, and their suitability for terrain heightfields remains unclear. We introduce a controlled benchmark on a 1 m/pixel terrain dataset and evaluate three representative methods under a unified protocol. Observing a clear cross-domain gap, we propose HUVR+SIREN, a hypernetwork that adapts the strongest benchmarked method (HUVR) by replacing its coordinate decoder with a smooth, analytically differentiable one. It attains the best height and derivative fidelity on the benchmark with no additional per-tile storage and lower decode cost, and tolerates aggressive post-training quantization with negligible quality loss, giving a compact terrain neural format. Ablations and diagnostics further identify which design choices transfer to terrain and show that the per-tile bottleneck is already near its useful limit, leaving the remaining gap in the shared hypernetwork's architectural design.

URL PDF HTML ☆

赞 0 踩 0

2606.00393 2026-06-02 eess.IV cs.CV 版本更新

AutoIQ: An Ensemble Framework for Automatic Assessment of Geometric Distortion in Prostate Diffusion-Weighted Imaging

AutoIQ：前列腺扩散加权成像中几何畸变自动评估的集成框架

Haoran Sun, Lixia Wang, Yin-Chen Hsu, Hsu-Lei Lee, Chang Gao, Fei Han, Robert Grimm, Vibhas Deshpande, Ziyang Long, Hsin-Jung Yang, Rola Saouaf, Alessandro D'Agnolo, Timothy Daskivich, Hyung Kim, Debiao Li, Yibin Xie

发表机构 * Biomedical Imaging Research Institute, Cedars-Sinai Medical Center（生物医学成像研究 institute， Cedars-Sinai 医疗中心）； Department of Bioengineering, University of California（生物工程系，加州大学）； Siemens Medical Solutions USA Inc.（西门子医疗解决方案美国公司）； Siemens Healthineers AG（西门子健康影像股份有限公司）； Department of Imaging, Cedars-Sinai Medical Center（成像部，Cedars-Sinai 医疗中心）； Department of Nuclear Medicine, Cedars-Sinai Medical Center（核医学部，Cedars-Sinai 医疗中心）； Department of Urology, Cedars-Sinai Medical Center（泌尿科，Cedars-Sinai 医疗中心）

AI总结提出AutoIQ集成机器学习框架，结合分割和配准方法量化DWI几何畸变，用于自动分类畸变严重程度，在独立测试集上达到0.95准确率。

Comments Original research; 11 pages, 7 figures, 1 table

详情

AI中文摘要

前列腺扩散加权成像（DWI）中的几何畸变会损害病灶定位并降低基于MRI的临床评估的可靠性。我们提出了AutoIQ，一个用于自动量化和分类DWI几何畸变严重程度的集成机器学习框架。共分析了140例回顾性前列腺双参数MRI检查，包括33次严重畸变需要重复采集的扫描和107次基于放射科专家评估可接受的畸变扫描。AutoIQ结合了两种互补的畸变量化策略：一种基于分割的方法，测量T2加权成像（T2WI）和DWI之间的前列腺边界不匹配；另一种基于配准的方法，估计DWI到T2WI对齐后的变形幅度。由此产生的畸变分数用于训练单个分类器和逻辑回归集成模型。两种计算方法均显著区分了严重和可接受的畸变病例（p < 0.001）。在独立测试集上，集成模型达到了0.95的准确率、0.93的F1分数和0.98的AUC，优于单个模型。这些结果表明，AutoIQ可以为前列腺DWI提供自动化的定量质量评估，并可能有助于识别需要重复采集的扫描。

英文摘要

Geometric distortion in prostate diffusion-weighted imaging (DWI) can impair lesion localization and reduce the reliability of MRI-based clinical assessment. We propose AutoIQ, an ensemble machine learning framework for automatic quantification and classification of DWI geometric distortion severity. A total of 140 retrospective prostate biparametric MRI examinations were analyzed, including 33 scans with severe distortion requiring repeat acquisition and 107 scans with acceptable distortion based on expert radiologist assessment. AutoIQ combines two complementary distortion quantification strategies: a segmentation-based method measuring prostate boundary mismatch between T2-weighted imaging (T2WI) and DWI, and a registration-based method estimating deformation magnitude after DWI-to-T2WI alignment. The resulting distortion scores were used to train individual classifiers and a logistic-regression ensemble model. Both computational methods significantly differentiated severe from acceptable distortion cases (p < 0.001). On an independent test set, the ensemble model achieved an accuracy of 0.95, F1-score of 0.93, and AUC of 0.98, outperforming individual models. These results suggest that AutoIQ can provide automated, quantitative quality assessment for prostate DWI and may help identify scans that require repeat acquisition.

URL PDF HTML ☆

赞 0 踩 0

2606.00390 2026-06-02 cs.CV cs.AI 版本更新

Zamba2-VL Technical Report

Zamba2-VL 技术报告

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）

AI总结提出基于混合架构Zamba2的视觉语言模型Zamba2-VL，在图像理解等基准上媲美Transformer模型，且首次令牌延迟降低约一个数量级。

Comments 16 pages, 2 figures

详情

AI中文摘要

我们提出Zamba2-VL，这是一套基于Zamba2构建的视觉语言模型，Zamba2是一种混合语言模型架构，结合了Mamba2状态空间层和少量共享的Transformer块。在广泛的图像理解、推理、OCR、定位和计数基准测试中，Zamba2-VL与同等规模的主流基于Transformer的开源VLM（包括Molmo2、Qwen3-VL和InternVL3.5系列）具有竞争力，并且显著优于之前的基于SSM和混合的VLM，如VL-Mamba、Cobra和mmMamba。继承了其Zamba2骨干网络的近线性预填充计算和小的、近乎恒定的循环状态，Zamba2-VL在匹配参数规模下，首次令牌延迟（TTFT）比这些Transformer基线低大约一个数量级，在最适合设备和边缘部署的较小1.2B和2.7B规模上效率差距最为明显。我们发布了三个模型——1.2B、2.7B和7B——以及推理代码，网址为https://huggingface.co/collections/Zyphra/zamba2-vl。

英文摘要

We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models -- 1.2B, 2.7B, and 7B -- together with inference code at https://huggingface.co/collections/Zyphra/zamba2-vl.

URL PDF HTML ☆

赞 0 踩 0

2606.00386 2026-06-02 cs.CV 版本更新

αDepth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion

αDepth: 学习用于立体转换的单次软边界分解

Xiang Zhang, Yang Zhang, Lukas Mehl, Karlis Martins Briedis, Markus Gross, Christopher Schroers

发表机构 * ETH Zürich（苏黎世联邦理工学院）； DisneyResearch|Studios（迪士尼研究|工作室）

AI总结提出αDepth表示，通过圆形Alpha表示（CAR）将软边界分解为局部层次，实现高保真立体转换，无需用户干预。

详情

AI中文摘要

精确建模软边界（例如头发和散焦模糊）是立体转换中的一个基本挑战，因为前景和背景的模糊混合。现有的深度模型主要预测单层深度，导致软边界处的深度对应关系模糊。虽然抠图技术可以捕获用于分层建模的不透明度，但它们在具有多个目标的复杂场景中通常表现不佳，并且通常需要用户干预。本文介绍了αDepth，一种分层表示，用于分解软边界以实现高保真立体转换。具体来说，我们首先通过估计软边界处的分层颜色和深度值来解决混合颜色和深度模糊问题。考虑到复杂的多目标场景，我们设计了一种圆形Alpha表示（CAR），将范式从全局目标提取转变为局部边界分解。与先前仅限于单个前景/背景的抠图方法不同，CAR无需手动指导即可实现高效的场景级推理。大量评估表明，αDepth在立体转换中实现了最先进的性能，消除了软边界处的背景渗色和结构失真。

英文摘要

Accurately modeling soft boundaries, e.g., hair and defocus blur, is a fundamental challenge in stereo conversion due to the ambiguous blending of foreground and background. Existing depth models primarily predict single-layer depth, leading to ambiguity in depth correspondence at soft boundaries. While matting techniques can capture opacity for layered modeling, they often struggle in complex scenes with multiple targets and usually require user intervention. This paper introduces αDepth, a layered representation that decomposes soft boundaries for high-fidelity stereo conversion. Specifically, we first resolve mixed color and depth ambiguity by estimating layered color and depth values at soft boundaries. Considering complex multi-target scenes, we design a Circular Alpha Representation (CAR) that shifts the paradigm from global target extraction to local boundary decomposition. Unlike prior matting methods restricted to a single foreground/background, CAR enables efficient scene-level inference without manual guidance. Extensive evaluations demonstrate that αDepth achieves state-of-the-art performance in stereo conversion, eliminating background bleeding and structural distortions at soft boundaries.

URL PDF HTML ☆

赞 0 踩 0

2606.00380 2026-06-02 cs.CV cs.AI 版本更新

SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation

SUPREME: 一个用于可复现图像遗忘方法评估的多GPU框架

Petros Andreou, Jamie Lanyon, Axel Finke, Georgina Cosma

发表机构 * Department of Computer Science, School of Science, Loughborough University（计算机科学系，科学学院，洛斯伯勒大学）； School of Mathematics, Statistics and Physics, Newcastle University（数学、统计与物理学院，新卡克大学）

AI总结提出SUPREME框架，通过多GPU分布式架构加速图像分类遗忘方法的评估，支持新方法注册和多精度模式。

Comments 17 pages. Code available at https://github.com/pedroandreou/supreme-unlearning

2606.00379 2026-06-02 cs.CV 版本更新

Non-Learning Low-Light Stereo Vision

非学习低光立体视觉

Jason Wang, Lucas Nguyen, Hyunseung Eom, Wei Xu, Qi Guo

发表机构 * Department of Computer Sciences, Purdue University（普渡大学计算机科学系）； Elmore Family School of Electrical and Computer Engineering, Purdue University（普渡大学埃尔莫夫家庭电气与计算机工程学院）

AI总结提出一种非学习立体框架，利用Field of Junctions (FoJ)提取粗视觉特征，结合边界感知半全局匹配(SGM)从严重噪声图像中估计视差，在基准数据集上获得比近期立体算法更准确的稀疏视差图。

Comments Accepted to ICIP 2026. Code and data available at https://github.com/guo-research-group/nonlearning-lowlight-stereo

2606.00377 2026-06-02 cs.CV 版本更新

Score-Control for Hallucination Reduction in Diffusion Models

扩散模型中减少幻觉的分数控制

Mahesh Bhosale, Naresh Kumar Devulapally, Abdul Wasi, Chau Pham, Vishnu Suresh Lokhande, David Doermann

发表机构 * University at Buffalo（布法罗大学）

AI总结针对扩散模型中的幻觉问题，提出基于方差引导的分数调制策略，通过控制分数雅可比矩阵减少幻觉，在保持高保真度和多样性的同时将幻觉降低约25%。

详情

AI中文摘要

扩散模型已成为现代生成式AI的支柱，推动了视觉、语言、音频及其他模态的进步。尽管取得了成功，但它们存在幻觉问题，即生成真实数据分布支撑集之外的不可信样本，这降低了可靠性和信任度。在这项工作中，我们首先通过实验证实了先前提出的假设，即分数平滑性导致图像生成扩散模型中的幻觉，并提供了基于密度的视角。我们进一步通过将幻觉概率质量与学习到的分数函数的利普希茨常数联系起来，形式化了这一概念。受此启发，我们引入了一种方差引导的分数调制（VSM）策略，该策略控制分数雅可比矩阵，从而降低分数平滑性并更好地逼近真实分数，进而减少幻觉。在合成和真实世界数据集上的实验结果表明，我们的方法在保持高保真度和多样性的同时，将幻觉降低了约25%，为更可靠的基于扩散的图像生成提供了原则性步骤。我们还提出了两个具有极端语义变化的基准数据集，用于系统性幻觉评估。代码和数据集公开于https://github.com/bhosalems/VSM。

英文摘要

Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language, audio and other modalities. Despite their success, they suffer from hallucinations, implausible samples that lie outside the support of true data distribution, which degrade reliability and trust. In this work, we first empirically confirm previously proposed hypothesis that score smoothness causes hallucinations in Image Generation diffusion models and provide a density-based perspective. We further formalize this notion by linking the hallucinations probability mass to lipschitz constant of the learned score function. Motivated by this, we introduce a Variance-Guided Score Modulation (VSM) strategy that controls the score Jacobian, in turn reducing score smoothness and better approximating the ground truth score that decreases hallucinations. Empirical results on synthetic and real-world datasets demonstrate that our approach reduces hallucinations (up to ~25%) while maintaining high fidelity and diversity, providing a principled step toward more reliable diffusion-based image generation. We also propose two benchmark datasets with extreme semantic variation for systematic hallucination evaluation. Code and Datasets are publicly available at https://github.com/bhosalems/VSM.

URL PDF HTML ☆

赞 0 踩 0

2606.00372 2026-06-02 cs.CV 版本更新

LFA: Layer Feature Attention for Run-Time Introspection of 2D Object Detectors in Automated Driving

LFA：用于自动驾驶中2D目标检测器运行时自省的分层特征注意力

Mert Keser, Alois Knoll

发表机构 * Automated Driving Report GitHub Issue（自动驾驶报告GitHub问题）

AI总结提出LFA方法，通过注意力机制聚合骨干网络多层特征，以提升自动驾驶中2D目标检测器的错误预测性能和可解释性。

详情

AI中文摘要

可靠的目标检测对于自动驾驶至关重要，然而即使是最先进的检测器也不可避免地会犯错误，从而危及安全。预测检测器失败的自省方法通过触发后备机制或提醒人类操作员，能够实现更安全的部署。然而，现有方法仅依赖最后一层特征或手工设计的统计量，丢弃了来自早期层的宝贵信息，这些信息捕捉了不同层次的视觉抽象。我们提出了分层特征注意力（LFA），一种轻量级的自省方法，通过注意力机制学习从多个骨干层聚合特征。我们的关键洞察是，检测错误在特征层次上表现不同：低层捕捉对检测小目标或被遮挡目标至关重要的细粒度细节，而高层编码用于场景理解的语义信息。LFA端到端地学习层重要性权重，从而既改进了错误预测，又实现了对哪些特征级别最能指示检测器失败的可解释分析。在KITTI和BDD100K上的大量实验表明，LFA实现了最先进的自省性能，在多种检测器架构上优于单层基线方法。

英文摘要

Reliable object detection is critical for automated driving, yet even state-of-the-art detectors inevitably make errors that can compromise safety. Introspection methods that predict detector failures enable safer deployment by triggering fallback mechanisms or alerting human operators. However, existing approaches rely solely on last-layer features or hand-crafted statistics, discarding valuable information from earlier layers that capture different levels of visual abstraction. We propose Layer Feature Attention (LFA), a lightweight introspection method that learns to aggregate features from multiple backbone layers through an attention mechanism. Our key insight is that detection errors manifest differently across feature hierarchies: low-level layers capture fine-grained details essential for detecting small or occluded objects, while high-level layers encode semantic information for scene understanding. LFA learns layer importance weights end-to-end, enabling both improved error prediction and interpretable analysis of which feature levels are most indicative of detector failures. Extensive experiments on KITTI and BDD100K demonstrate that LFA achieves state-of-the-art introspection performance, outperforming single-layer baselines across multiple detector architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.00352 2026-06-02 cs.CV cs.GR 版本更新

HiGS: A Hierarchical Rendering Architecture for Real-Time 3D Gaussian Splatting

HiGS：一种用于实时三维高斯泼溅的分层渲染架构

Dawid Pająk, Martin Bisson, Rodolfo Lima

发表机构 * NVIDIA

AI总结针对3D高斯泼溅中空间分区与光栅化对瓦片尺寸需求矛盾的问题，提出分层瓦片高斯泼溅（HiGS），通过粗粒度宏瓦片分区和细粒度渲染瓦片光栅化实现加速，在保持精确alpha合成的同时实现最高15.8倍加速。

Comments Project Page: https://research.nvidia.com/labs/sil/projects/higs/

详情

AI中文摘要

3D高斯泼溅（3DGS）已成为在商用GPU上实现实时新视角合成的标准。其流程将空间分区和光栅化绑定到同一瓦片尺寸，但两者需求相反：分区（对高斯进行分箱和深度排序）随瓦片增大而成本降低，而光栅化随瓦片减小而成本降低。先前的加速工作降低了单个阶段的成本，但将两者锁定在单一尺度上，其中少数密集瓦片主导帧时间。我们提出分层瓦片高斯泼溅（HiGS），为每个阶段赋予独立尺度：分区在粗粒度宏瓦片上运行，而光栅化在宏瓦片内的细粒度渲染瓦片上运行。光栅化工作根据每个宏瓦片中的高斯数量分配，而非按瓦片分配，因此密集区域分布在多个并行单元上，而非串行通过一个单元。在测试场景中，HiGS比原始3DGS渲染速度快15.8倍，并且优于我们评估的所有其他光栅化器，同时保持精确的前后alpha合成。

英文摘要

3D Gaussian Splatting (3DGS) has become the standard for real-time novel view synthesis on commodity GPUs. Its pipeline ties spatial partitioning and rasterization to one tile size, yet the two pull in opposite directions: partitioning, which bins and depth-sorts gaussians, grows cheaper with larger tiles, while rasterization gets cheaper with smaller ones. Prior acceleration work reduces the cost of individual stages but keeps both locked to that single scale, where a few dense tiles dominate frame time. We present Hierarchically Tiled Gaussian Splatting (HiGS), which gives each its own scale: partitioning runs over coarse macro-tiles, while rasterization runs over the fine render tiles within them. Rasterization work is then issued in proportion to the gaussians in each macro-tile rather than per tile, so dense regions spread across many parallel units instead of serializing through one. Across tested scenes, HiGS renders up to 15.8x faster than the original 3DGS and outperforms every other rasterizer we evaluate, while preserving exact front-to-back alpha compositing.

URL PDF HTML ☆

赞 0 踩 0

2606.00318 2026-06-02 cs.RO cs.CV 版本更新

Belief Consistency Between Foundation-Model Evidence and Geometric Perception in Persistent Robotic Maps

持久机器人地图中基础模型证据与几何感知之间的信念一致性

Christoffer Heckman, Harel Biggie, Brendan Crowe, Nicholas Roy

发表机构 * Department of Computer Science, University of Colorado, Boulder（科罗拉多大学博尔德分校计算机科学系）； Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology（麻省理工学院计算机科学与人工智能实验室）

AI总结提出一种更新算子，通过每类校准提交门和每事件冲突丢弃窗口，解决基础模型语义通道与几何感知通道在持久地图中的矛盾，显著提升地图精度。

详情

AI中文摘要

自主机器人使用的持久地图越来越多地将一个断言特征良好的几何感知栈与一个产生语义声明但未校准可靠性的基础模型通道融合到同一场景中。当代建图系统通过将基础模型通道视为每个元素后验的额外投票者来集成这两个通道，但未针对其自身的每类可靠性进行校准，也没有机制在给定时刻标记两个通道相互矛盾的情况。我们提出了一种具有两个协作机制的更新算子：一个每类校准的提交门，以及一个每事件冲突丢弃窗口，该窗口拒绝提交在声明时刻与几何通道矛盾的基础模型声明。我们在KITTI-360和ScanNet上进行了评估，使用oracle几何通道（全景真值）和现成的在线语义分割器（Mask2Former）来展示真实世界性能。该算子生成的提交地图精度显著更高（KITTI中汽车提交精度99.7%对比仅校准算子的43.9%；平均每类IoU 0.522对比0.180），并且在更高精度下保留了比整体式组合VLM提示更多的组合真阳性。该框架在oracle和现成分割器几何通道上均达到部署质量，并且对基础模型替换具有不变性。

英文摘要

Persistent maps used by autonomous robots increasingly fuse a geometric perception stack whose assertions are well-characterized with a foundation-model channel that produces semantic claims without calibrated reliability about the same scene. Contemporary mapping systems integrate the two channels by treating the foundation-model channel as an additional voter into a per-element posterior, uncalibrated for its own per-class reliability and without machinery to flag when the two channels contradict each other at a given moment. We propose an update operator with two cooperating mechanisms: a per-class calibrated commit gate, and a per-event conflict-drop window that refuses to commit foundation-model claims contradicted by the geometric channel at the moment of the claim. We evaluate on KITTI-360 and ScanNet, with an oracle geometric channel (panoptic ground truth) and an off-the-shelf online semantic segmenter (Mask2Former) to demonstrate real-world performance. The operator produces substantially more accurate committed maps (KITTI is car commit precision 99.7% vs. 43.9% for the calibration-only operator; mean per-class IoU 0.522 vs. 0.180), retains more compositional true positives at higher precision than a monolithic compositional VLM prompt. The framework operates at deployment quality across both oracle and off-the-shelf-segmenter geometric channels, and is invariant under foundation-model substitution.

URL PDF HTML ☆

赞 0 踩 0

2606.00310 2026-06-02 cs.CV 版本更新

Where to Refine, When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation

何处精炼，何时停止：通过潜在差异重新思考高效视觉自回归生成中的冗余

Changwang Mei, Peisong Wang, Zekun Li, Changsheng Li, Shuang Qiu, Qinghao Hu, Gang Li, Yifan Zhang, Zhihui Wei, Jian Cheng

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出基于潜在差异（Latent Discrepancy）的无训练剪枝框架LD-Pruning，通过解码无关区域选择和自适应无条件分支跳过，在视觉自回归模型中实现高达2.35倍加速并保持生成质量。

详情

AI中文摘要

视觉自回归（VAR）模型能够生成高质量图像，但在高分辨率下存在显著的推理延迟。最近的加速方法大多依赖基于层特征的启发式度量来剪枝令牌。这些启发式方法对复杂上下文语义敏感，导致冗余计算识别不准确且跨提示的适应性差。我们从冗余对像素空间生成影响的角度重新思考VAR中的冗余，并引入潜在差异（Latent Discrepancy）。该统一度量通过测量生成过程中模型状态的变化来量化令牌的贡献。我们的分析表明，当以图像潜在或像素空间信号为指导时，冗余识别更准确。我们进一步观察到，在无分类器引导（CFG）中，条件分支与无条件分支之间差异的收敛趋势随不同提示呈现高度动态性。基于这些发现，我们提出LD-Pruning（潜在差异剪枝），一种无训练框架，通过集成解码无关区域选择和自适应无条件分支跳过，利用潜在差异消除冗余。大量实验表明，LD-Pruning在保持高生成质量的同时显著降低推理延迟，在Infinity-8B上实现高达2.35倍加速。

英文摘要

Visual Autoregressive (VAR) models deliver high-quality image generation but suffer from significant inference latency at high resolutions. Recent acceleration approaches most rely on heuristic measures with layer features to prune tokens. Such heuristics are sensitive to complex contextual semantics, leading to inaccurate identification of redundant computation and poor adaptability across prompts. We rethink redundancy in VAR from the perspective of its impact on pixel-space generation and introduce Latent Discrepancy. This unified metric quantifies a token's contribution by measuring the change in model states during generation. Our analysis shows that redundancy is more accurately identified when guided by image latent or pixel-space signals. We further observed that in classifier-free guidance (CFG), the convergence trend of the discrepancy between conditional and unconditional branches exhibits high dynamics with different prompts. Based on these findings, we propose LD-Pruning (Latent Discrepancy Pruning), a training-free framework that removes redundancy via latent discrepancy by integrating decoding-free region selection and adaptive unconditional-branch skipping. Extensive experiments show that LD-Pruning substantially reduces inference latency while maintaining high generation quality, achieving up to 2.35x speedup on Infinity-8B.

URL PDF HTML ☆

赞 0 踩 0

2606.00299 2026-06-02 cs.CV cs.AI 版本更新

Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion

Real2SAM2Real: 生成式3D缓存作为视频扩散的互补上下文

Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler, Yiannis Aloimonos

发表机构 * University of Maryland（马里兰大学）

AI总结提出Real2SAM2Real框架，通过3D提升模型提取可编辑的3D缓存作为几何支架，结合软空间对齐注入和微调策略，实现视频扩散模型对相机轨迹和多实体运动的精确解耦控制。

详情

AI中文摘要

虽然视频扩散模型（VDM）在合成高保真视频方面表现出色，但实现精确的相机和场景控制仍然具有挑战性。现有方法主要依赖隐式扩散先验来生成未观察区域，在高动态运动或复杂遮挡期间不可避免地导致结构崩溃。为了解决这一挑战，我们提出了Real2SAM2Real框架，该框架利用3D提升模型（例如SAM3D）提取显式可编辑的3D缓存，作为VDM的稳健几何支架。通过捕获前景实体的整个3D体积而不仅仅是其可见外壳，该缓存将整体空间先验注入VDM，为复杂场景动态提供可靠的3D感知指导。为了有效利用这种3D指导同时保留预训练先验，我们设计了一种软空间对齐注入机制以及一种针对VDM量身定制的微创微调策略。此外，我们采用掩码法线图作为跨模态桥梁，构建了无3D数据的数据整理和扰动流程。大量实验表明，Real2SAM2Real能够对相机轨迹和多实体运动实现精确、解耦的控制。通过利用生成式3D缓存的互补上下文，我们的框架克服了因过度依赖扩散先验而导致的典型崩溃，在大的相机位移和严重遮挡下保持了卓越的时空一致性。关键的是，通过将几何与外观解耦，我们为VDM定制的3D缓存消除了由结构空洞和错误立面引起的视角歧义，以及反射和折射引起的误导性线索。项目网站见https://jiayi-wu-leo.github.io/real2sam2real。

英文摘要

While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challenging. Existing methods predominantly rely on implicit diffusion priors to generate unobserved regions, inevitably leading to structural collapse during high-dynamic movements or complex occlusions. To address this challenge, we propose Real2SAM2Real, a framework that leverages 3D lifting models (e.g., SAM3D) to extract an explicitly editable 3D cache, serving as a robust geometric scaffold for the VDM. By capturing the entire 3D volume of foreground entities rather than just their visible shells, this cache injects holistic spatial priors into the VDM, providing dependable 3D-aware guidance for complex scene dynamics. To effectively leverage this 3D guidance while preserving pre-trained priors, we design a Soft Spatial-Aligned Injection mechanism alongside a minimally invasive fine-tuning strategy tailored for VDMs. Furthermore, we employ masked normal maps as a cross-modal bridge to construct a 3D-free data curation and perturbation pipeline. Extensive experiments demonstrate that Real2SAM2Real enables precise, decoupled control over both camera trajectories and multi-entity motions. By utilizing the complementary context from generative 3D caches, our framework overcomes typical breakdowns caused by over-reliance on diffusion priors, maintaining exceptional spatiotemporal consistency under large camera shifts and severe occlusions. Crucially, by decoupling geometry from appearance, our VDM-tailored 3D cache eradicates perspective ambiguities caused by structural holes and erroneous facades, as well as misleading cues from reflections and refractions. Project website is available at https://jiayi-wu-leo.github.io/real2sam2real

URL PDF HTML ☆

赞 0 踩 0

2606.00275 2026-06-02 cs.CV cs.AI 版本更新

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

超几何与证据优先专家用于大型视觉-语言模型

Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao, Yi Zhao

发表机构 * China University of Petroleum (Beijing)（中国石油大学（北京））； Hainan Institute of China University of Petroleum (Beijing)（中国石油大学（北京）海南学院）； South China Normal University（华南师范大学）

AI总结针对大型视觉-语言模型中视觉与语言模态的不对称性，提出AsyMoE架构，通过超几何跨模态专家和证据优先语言专家分别建模层级关系与保持上下文基础，在减少参数的同时提升性能。

详情

AI中文摘要

大型视觉-语言模型（LVLMs）通过扩展架构和大量训练在多模态任务上展现了令人印象深刻的性能。近期研究将混合专家（MoE）引入LVLMs以提高计算效率。然而，现有的MoE方法以对称架构处理视觉和语言模态，忽视了这两种模态处理中的固有不平衡性。这种不平衡性导致两个关键问题。首先，文本和视觉形成层级而非并行关系，因为文本查询通常描述完整视觉场景的部分方面。欧几里得专家空间难以编码这种包含结构。其次，深层语言专家逐渐从基于证据的处理转向参数记忆依赖，失去对提供的视觉和语言信息的立足点。为解决这些问题，我们提出AsyMoE，一种通过三个专门专家组显式建模这种不平衡性的新型架构。模态内专家处理模态特定处理。超几何跨模态专家通过负曲率几何捕获层级跨模态关系。证据优先语言专家抑制参数记忆激活并在整个网络深度中保持上下文基础。大量实验表明，AsyMoE相比基线方法取得一致改进，平均比MoE变体提升1.5%，在幻觉敏感任务上提升高达3.8%。与密集模型相比，AsyMoE激活参数减少25.45%。

英文摘要

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5\% over MoE variants and up to 3.8\% on hallucination-sensitive tasks. AsyMoE activates 25.45\% fewer parameters compared to dense models.

URL PDF HTML ☆

赞 0 踩 0

2606.00267 2026-06-02 cs.CV cs.AI cs.LG cs.RO 版本更新

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

StressDream: 引导视频世界模型实现鲁棒的策略评估与改进

Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； NVIDIA Research（NVIDIA研究）； University of Washington（华盛顿大学）； Stanford University（斯坦福大学）

AI总结提出StressDream方法，通过优化扩散视频世界模型的初始噪声，在推理时引导生成高影响且合理的未来场景，以支持鲁棒的策略评估与改进。

Comments Project page: https://junwon.me/StressDream/

详情

AI中文摘要

视频世界模型通过想象以自我机器人动作为条件的真实未来观察，在策略评估与改进方面展现出潜力。虽然世界模型可以对未来的分布进行建模，但策略评估与改进通常依赖于名义上的想象，这可能会遗漏机器人动作的高影响结果，除非抽取大量样本。为了实现对世界模型想象的鲁棒策略评估与改进，我们提出StressDream，该方法通过在推理时优化扩散世界模型的初始噪声，将想象引导至高影响且合理的结果。然而，优化高维噪声具有挑战性：优化必须推理生成视频中细微的、场景相关的目标事件，同时避免产生不合理想象的分布外噪声。我们通过两个互补目标来解决这一问题：一个语义目标，利用视觉语言模型通过推理生成视频提供信息丰富的梯度；一个合理性目标，防止优化后的噪声漂移到分布外。利用用于自动驾驶和机器人操作的最先进的视频世界模型，我们展示了StressDream能够有效地将想象引导至推理时由文本指定的高影响且合理的结果，例如任务失败，从而通过识别那些合理未来包含不良结果的动作，实现鲁棒的策略评估与改进。视频结果见https://junwon.me/StressDream/。

英文摘要

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.

URL PDF HTML ☆

赞 0 踩 0

2606.00261 2026-06-02 cs.CV physics.soc-ph 版本更新

The Harsh Truth: Segment-Level Analysis of Harsh Driving Events in Milan Using Large-Scale Telematics, Street Networks, and Google Street View

残酷真相：基于大规模远程信息处理、街道网络和谷歌街景的米兰激烈驾驶事件路段级分析

Andrea La Grotteria, Paolo Santi, Titus Venverloo, Umberto Fugiglando, Carlo Ratti

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本研究结合大规模远程信息处理、交通指标、街道网络属性和谷歌街景视觉特征，通过非参数检验和机器学习回归分析米兰城市道路网络中激烈驾驶事件的路段级特征，发现更宽的车道、交叉口和公交站以及更开阔的视野与更高激烈事件强度相关，而密集建筑正面与较低强度相关，并针对自行车基础设施案例揭示了不同设施类型间的强度梯度。

详情

AI中文摘要

警方报告的碰撞统计数据仍然是城市道路安全评估的标准输入，但其不完整性和报告滞后限制了其在及时、细粒度干预设计中的实用性。激烈加速和制动事件被广泛用作替代安全指标，但迄今为止仅在相对较小的城市样本中进行了研究。本研究分析了米兰城市道路网络中的激烈事件，结合了来自超过420万辆配备车载单元的车辆的高分辨率远程信息处理数据、TomTom的路段级交通指标、OpenStreetMap的街道网络和基础设施属性，以及通过使用OneFormer模型进行语义分割从谷歌街景中提取的视觉街景特征。我们采用了一个分析框架，结合了高、低激烈组之间路段特征分布的非参数Mann-Whitney U检验和监督机器学习回归器。我们发现，在控制暴露量后，更宽的车道、交叉口和公交站以及更开阔的视野（更高的天空和道路像素比例）与更高的激烈事件强度相关，而更密集的建筑正面与较低的强度相关。最后，自行车基础设施案例研究确定了不同设施类型之间激烈事件强度的梯度：相对于物理隔离的自行车道，仅标线的自行车道与19.5%更高的激烈评分相关，混合交通配置与11.5%更高的评分相关，条件取决于包含的控制变量。这些结果支持针对具体情境而非统一的城市安全干预措施，并说明了大规模远程信息处理结合开放地理空间和视觉数据如何为大都市尺度的零死亡愿景决策提供信息。

英文摘要

Police-reported crash statistics remain the standard input for urban road-safety assessment, but their incompleteness and reporting lag limit their usefulness for timely, fine-grained intervention design. Harsh acceleration and braking events are widely used as surrogate safety indicators, but have so far been studied only in comparatively small urban samples. This study analyses harsh events across the urban road network of Milan, combining high-resolution telematics from more than 4.2 million vehicles equipped with On-Board Units, segment-level traffic metrics from TomTom, street-network and infrastructure attributes from OpenStreetMap, and visual streetscape features extracted from Google Street View via semantic segmentation using a OneFormer model. We employ an analytical framework combining non-parametric Mann--Whitney U tests of segment-feature distributions between high- and low-harshness groups with supervised machine-learning regressors. We find that, once exposure is controlled for, wider carriageways, crossings and transit stops, and more open visual fields (higher sky- and road-pixel proportions) are associated with higher harsh-event intensity, while denser built frontage is associated with lower intensity. Finally, the cycling-infrastructure case study identifies a gradient in harsh-event intensity across facility types: markings-only cycle lanes are associated with a 19.5% higher harshness score, and mixed-traffic configurations with an 11.5% higher score, relative to physically separated cycle paths, conditional on the included controls. These results support context-specific rather than uniform urban-safety interventions and illustrate how large-scale telematics combined with open geospatial and visual data can inform Vision Zero decision-making at the metropolitan scale.

URL PDF HTML ☆

赞 0 踩 0

2606.00204 2026-06-02 cs.CV 版本更新

APE: Agentic Prompt Enhancer for Image Generation and Editing

APE: 用于图像生成与编辑的智能提示增强器

Zijian Huang, Jay Zhangjie Wu, Zian Wang, Tianshi Cao, Jiasi Chen, Sanja Fidler, Huan Ling, Xuanchi Ren

发表机构 * NVIDIA ； University of Michigan（密歇根大学）

AI总结提出APE框架，通过后训练小型语言模型作为提示增强代理，以单代理或多代理方式改进文本到图像生成与编辑中的提示质量，无需修改下游视觉模型。

Comments Project Page: https://research.nvidia.com/labs/sil/projects/ape/

详情

AI中文摘要

自然语言已成为图像生成和编辑的强大接口，但文本引导的视觉系统对提示表述高度敏感。语义相似的请求可能因措辞、具体性以及视觉约束的明确程度而产生不同输出，这促使将提示增强作为可训练组件而非外围用户选择。现有的强增强器通常依赖大型专有LLM（如ChatGPT或Gemini），增加了视觉生成流水线的成本、延迟和部署依赖性。我们提出智能提示增强器（APE），一种轻量级框架，将小型语言模型（SLM）后训练为提示增强代理。APE支持单代理重写和角色专用多代理增强。其单代理实例SAPE一次性重写提示，而多代理实例MAPE将增强分解为路由器-重写器-组合器过程，以处理对象、属性、空间关系和编辑的组合约束。通过任务感知奖励和后训练协议，APE在不修改下游视觉模型的情况下改善了视觉对齐和提示遵循。在具有挑战性的图像生成和编辑基准上的实验表明，后训练的小型提示增强器可靠地优于其基础对应物，缩小了与闭源提示增强器的差距；此外，MAPE在这些基准中的复杂组合任务上表现尤为强劲。

英文摘要

Natural language has become a powerful interface for image generation and editing, yet text-guided visual systems remain highly sensitive to prompt formulation. Semantically similar requests can produce different outputs depending on wording, specificity, and how explicitly visual constraints are stated, motivating prompt enhancement as a trainable component rather than a peripheral user choice. Existing strong enhancers often rely on large, proprietary LLMs such as ChatGPT or Gemini, adding cost, latency, and deployment dependence to the visual generation pipeline. We propose Agentic Prompt Enhancer (APE), a lightweight framework that post-trains small language models (SLMs) as prompt-enhancement agents. APE supports both single-agent rewriting and role-specialized multi-agent enhancement. Its single-agent instantiation, SAPE, rewrites the prompt in one pass, while its multi-agent instantiation, MAPE, decomposes enhancement into a router--rewriter--composer process for handling compositional constraints over objects, attributes, spatial relations, and edits. With task-aware rewards and post-training protocols, APE improves visual alignment and prompt following without modifying the downstream visual model. Experiments on challenging image generation and editing benchmarks demonstrate that post-trained small prompt enhancers reliably outperform their base counterparts, narrowing the gap to closed-source prompt enhancers; in addition, MAPE proves particularly strong on complex compositional tasks within these benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.00191 2026-06-02 cs.RO cs.CV 版本更新

Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models

Safe2Drive: 评估端到端自动驾驶模型的安全驾驶行为

Nishad Sahu, Kalpana Panda, Congyuan Yu, Changzhong Qian, Shounak Sural, Ragunathan Rajkumar

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Birla Institute of Technology and Science Pilani（比拉理工学院和科学帕利尼）

AI总结针对端到端自动驾驶模型在常见安全关键场景中表现脆弱的问题，提出Safe2Drive测试集和安全驾驶评分（SDS），评估发现领先模型在安全场景中驾驶得分大幅下降且SDS较低。

详情

Journal ref: CVPR Workshops 2026

AI中文摘要

最近的端到端（E2E）自动驾驶策略在闭环模拟中取得了高驾驶得分。然而，这些策略是否能够处理常见的安全关键场景仍不清楚。我们提出了Safe2Drive（S2D），一组与Bench2Drive对齐的场景扩展，重点关注三类常见的道路危险：施工区、行人乱穿马路和被遮挡的弱势道路使用者（VRU）。Safe2Drive增加了100个常见但具有挑战性的场景，并引入了安全驾驶评分（SDS），这是一种以安全为中心的度量，在先前评估器的基础上增加了碰撞前制动、施工区物体接触、车道居中和平滑性检查。在S2D上评估两种最先进的策略（LEAD和SimLingo），我们发现它们的驾驶得分相对于报告的Bench2Drive基线急剧下降（LEAD：从Bench2Drive上的94.70 DS下降到S2D上的39.95 DS；SimLingo：从Bench2Drive上的85.07 DS下降到S2D上的41.00 DS），并且S2D上的SDS较低（LEAD为11.85，SimLingo为15.27）。这些结果与脆弱的安全驾驶行为一致，例如对施工区理解差、闯红灯以及行人制动延迟或缺失。这项研究突显了E2E模型即使在训练集包含的CARLA城镇上进行测试时也缺乏安全行为推理。我们计划发布所有100个S2D场景的代码和视频。

英文摘要

Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.00174 2026-06-02 cs.CV cs.AI 版本更新

MyoSem: Aligning Electromyography to Natural-Language Action Semantics for Hand Action Understanding

MyoSem: 将肌电图与自然语言动作语义对齐以实现手部动作理解

Chiyue Wang, Dong She, Yang Gao, Zhanpeng Jin

发表机构 * South China University of Technology（华南理工大学）

AI总结提出MyoSem框架，通过多视角动作语义构建、激活感知EMG编码和语义查询对齐，实现EMG信号与文本描述的双向检索，在多个数据集上优于基线方法并展现良好泛化性。

Comments 16 pages, 9 figures. Preprint

详情

AI中文摘要

肌电图（EMG）直接反映肌肉激活，是手势识别、假肢控制和可穿戴交互的关键传感模态。然而，现有的EMG方法通常将手部动作理解视为固定标签的分类问题，难以支持基于动作描述的查询、检索和泛化。我们提出MyoSem，一个EMG-动作语义对齐框架，将低层EMG信号映射到由多视角动作描述构建的共享语义空间。MyoSem结合多视角动作语义构建、激活感知EMG编码和语义查询对齐，实现了EMG信号与文本描述之间的双向检索。我们在EMG2Pose和NinaPro系列数据集上系统评估了MyoSem。结果表明，MyoSem在EMG-文本双向检索上表现良好，普遍优于大多数基线，并在未见用户、保留动作类别和截肢用户迁移场景中展现出良好的泛化性。消融实验和可视化进一步验证了每个模块的有效性。总体而言，MyoSem将基于EMG的手部动作理解从固定标签识别推进到可查询的双向语义检索，为语言介导的EMG动作理解提供了新的建模范式。

英文摘要

Electromyography (EMG) directly reflects muscle activation and is a key sensing modality for gesture recognition, prosthetic control, and wearable interaction. Existing EMG methods, however, commonly formulate hand action understanding as classification over fixed labels, making it difficult to support querying, retrieval, and generalization based on action descriptions. We present MyoSem, an EMG--action semantic alignment framework that maps low-level EMG signals into a shared semantic space constructed from multi-view action descriptions. MyoSem combines multi-view action-semantic construction, activation-aware EMG encoding, and semantic query alignment, enabling bidirectional retrieval between EMG signals and text descriptions. We systematically evaluate MyoSem on EMG2Pose and NinaPro-series datasets. Results show that MyoSem performs well on EMG--text bidirectional retrieval, generally outperforms most baselines, and shows favorable generalization to unseen users, held-out action classes, and amputee-user transfer scenarios. Ablations and visualizations further validate the effectiveness of each module. Overall, MyoSem advances EMG-based hand action understanding from fixed-label recognition toward queryable bidirectional semantic retrieval, providing a new modeling paradigm for language-mediated EMG action understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.00170 2026-06-02 cs.HC cs.AI cs.CV 版本更新

面向人类与机器的可扩展图像编码的无训练连续码率控制

Yui Tatsumi, Hiroshi Watanabe

发表机构 * University of Tokyo（东京大学）

AI总结提出一种无训练的变码率可扩展图像编码框架，通过基于预测尺度值调整量化步长实现连续码率控制，同时保留机器层和增强层的高尺度信息。

2606.00153 2026-06-02 cs.CV cs.AI 版本更新

DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion

DiffCrossGait：基于潜在扩散的2D-3D跨模态步态识别轨迹级对齐

Zhiyang Lu, Ming Cheng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对2D-3D跨模态步态识别中的域差异问题，提出DiffCrossGait，通过潜在扩散空间中的轨迹级对齐实现连续模态对齐，并引入三阶段对齐策略确保身份锚定、动态一致性和跨模态结构可恢复性，在SUSTech1K和FreeGait基准上达到最优性能。

Comments Accepted by ICML2026

详情

AI中文摘要

跨模态2D-3D步态识别受到2D轮廓和3D LiDAR距离视图表示之间固有域差异的阻碍。虽然先前的方法仅对齐最终嵌入，我们提出DiffCrossGait，将跨模态匹配重新表述为身份相关潜在扩散空间中的轨迹级对齐，而不是假设2D和3D观测完全等价。通过在潜在空间中使用共享高斯噪声驱动两种模态，我们实现了生成演化过程中的连续对齐。我们引入了一种三阶段对齐策略，利用不同的噪声强度来强制身份锚定、动态一致性和跨模态结构可恢复性，从而约束两种模态共享去噪动态和瓶颈结构，促进模态不变的步态特征。关键的是，我们的框架将生成对齐与判别骨干解耦；扩散机制仅作为训练目标，通过消除迭代去噪的计算开销确保高推理效率。在SUSTech1K和FreeGait基准上的大量实验表明，DiffCrossGait达到了最先进的性能。

英文摘要

Cross-modal 2D-3D gait recognition is impeded by inherent domain discrepancies between 2D silhouette and 3D LiDAR range-view representations. While prior methods align only final embeddings, we propose DiffCrossGait, which reformulates cross-modal matching as trajectory-level alignment in an identity-relevant latent diffusion space, rather than assuming full equivalence between 2D and 3D observations. By driving both modalities with shared Gaussian noise within a latent space, we enable continuous alignment throughout the generative evolution. We introduce a Tri-Phase Alignment Strategy that exploits varying noise intensities to enforce identity anchoring, dynamics consistency, and cross-modal structural recoverability, thereby constraining both modalities to share denoising dynamics and bottleneck structure, which promotes modality-invariant gait features. Crucially, our framework decouples generative alignment from the discriminative backbone; the diffusion mechanism serves exclusively as a training objective, ensuring high inference efficiency by eliminating the computational overhead of iterative denoising. Extensive experiments on the SUSTech1K and FreeGait benchmarks demonstrate that DiffCrossGait achieves state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.00148 2026-06-02 cs.CV cs.AI 版本更新

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

StemBind: 当多模态大语言模型在抽象视觉推理中迷失于规则与实例之间

Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng, Qiyao Sun, Xuanyu Ji, Qingyong Hu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出 StemBind 诊断基准，通过共享主干的三对齐问题（感知、规则、完整）定位 MLLM 在抽象视觉推理中的失败环节，发现规则到实例的绑定是主要瓶颈。

Comments Project page: https://hexixiang.github.io/StemBind

详情

AI中文摘要

多模态大语言模型（MLLM）常常知道规则但选错答案：在抽象视觉推理（AVR）任务中，模型可以描述所见内容并命名底层模式，但仍然无法选择匹配的候选。现有的 AVR 基准无法检测到这一点，因为它们将感知、规则归纳和答案选择合并为一个单一的对错信号。我们引入了 StemBind，一个共享主干的诊断基准，它用三个对齐的问题探测同一视觉主干：感知（图像中有什么）、规则（支配它的模式是什么）和完整（哪个选项完成它），因此最终答案的错误可以归因于同一证据上的特定子步骤。StemBind 包含 2,298 个经过精心策划的知识精简主干，涵盖九种可审计的视觉操作，总计 19,533 个 P/R/F 任务，每个完整项目都通过 Sternberg 的四个推理阶段（S1 编码、S2 推断、S3 映射、S4 应用）进行标注。评估 24 个前沿 MLLM 配置得出四个发现。（i）R-F 鸿沟：在 24 个模型中的 22 个上，规则准确率超过完整项目准确率，因此大多数失败发生在规则被识别之后。（ii）持续的绑定差距：即使在同一主干上 P 和 R 都正确，模型仍有 51.2% 的时间错误回答 F。（iii）瓶颈是 S3：过程诊断和阶段式刺激增强将主要失败定位到规则到实例的映射。（iv）扩展和思考无济于事：更大的模型和显式思考模式都无法可靠地缩小差距，思考甚至降低了规则和完整项目的准确率。StemBind 将 AVR 评估从最终答案排名重新定义为定位抽象视觉推理失败的位置，将规则到实例的绑定确定为视觉基础推理的具体下一个目标。

英文摘要

Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right-or-wrong signal. We introduce StemBind, a shared-stem diagnostic benchmark that probes the same visual stem with three aligned questions: Perception (what is in the image), Rule (what pattern governs it), and Full (which option completes it), so a final-answer error can be attributed to a specific sub-step on the same evidence. StemBind contains 2,298 curated knowledge-light stems across nine auditable visual operations, totaling 19,533 P/R/F tasks, with each full item annotated by Sternberg's four reasoning stages (S1 Encode, S2 Infer, S3 Map, S4 Apply). Evaluating 24 frontier MLLM configurations yields four findings. (i) The R-F chasm: rule accuracy exceeds full-item accuracy on 22 of 24 models, so most failures happen after the rule is identified. (ii) A persistent binding gap: even when P and R are both correct on the same stem, models still answer F incorrectly 51.2% of the time. (iii) The bottleneck is S3: process diagnostics and Stage-wise Stimulus Augmentation localize the dominant failure to rule-to-instance mapping. (iv) Scaling and thinking do not help: neither larger models nor explicit thinking mode reliably closes the gap, and thinking even lowers rule and full-item accuracy. StemBind reframes AVR evaluation from final-answer ranking to locating where abstract visual reasoning breaks down, identifying rule-to-instance binding as a concrete next target for vision-grounded reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.00146 2026-06-02 eess.IV cs.AI cs.CV 版本更新

Multi-Contrast MRI Motion Correction via Parameter-Informed Disentanglement and Adaptive Experts

多对比度MRI运动校正：基于参数信息解缠与自适应专家网络

Honglin Xiong, Yuxian Tang, Feng Li, Yulin Wang, Lei Xiang, Dinggang Shen, Qian Wang

发表机构 * ShanghaiTech University（上海科技大学）

AI总结提出一种结合参数信息对比度解缠与严重度感知自适应校正的统一框架，通过ScanCLIP提取对比度嵌入以分离解剖内容，利用视觉Transformer估计运动严重度并路由至专家混合网络，实现跨对比度与严重度的运动伪影校正，在IXI和HCP基准上优于现有方法。

详情

AI中文摘要

磁共振成像中的运动伪影降低了诊断可靠性。现有的深度学习方法通常针对特定对比度，无法泛化到不同模态和伪影严重度。我们提出一个统一框架，结合参数信息对比度解缠与严重度感知自适应校正。ScanCLIP在超过30,000个MRI文本-图像对上预训练，从采集参数中导出对比度嵌入，将对比度风格与解剖内容分离，得到无对比度特征。然后，视觉Transformer估计运动严重度，并通过专家混合网络路由特征，实现针对性伪影校正。双路径解码器重建干净图像和残差伪影图，强制执行图像空间一致性。在IXI和HCP基准上，我们的方法在PSNR上提升0.75 dB，SSIM最高提升0.0279，优于现有方法，且在更高伪影严重度下增益更大。该方法在真实临床数据上展现出鲁棒的零样本泛化能力，这些数据使用未见过的扫描参数采集，而现有方法要么无法去除伪影，要么引入额外失真。

英文摘要

Motion artifacts in magnetic resonance imaging (MRI) degrade diagnostic reliability. Existing deep learning methods are typically contrast-specific and fail to generalize across diverse modalities and artifact severities. We propose a unified framework combining parameter-informed contrast disentanglement with severity-aware adaptive correction. ScanCLIP, pretrained on over 30,000 MRI text-image pairs, derives contrast embeddings from acquisition parameters to disentangle contrast style from anatomical content, yielding contrast-free features. A Vision Transformer then estimates motion severity and routes features through a Mixture-of-Experts network, enabling targeted artifact correction. A dual-pathway decoder reconstructs both the clean image and residual artifact map, enforcing image-space consistency. On IXI and HCP benchmarks, our method improves PSNR by 0.75 dB and SSIM by up to 0.0279 over state-of-the-art approaches, with larger gains at higher artifact severities. It further demonstrates robust zero-shot generalization on real-world clinical data acquired with unseen scanning parameters, where existing methods either fail to remove artifacts or introduce additional distortions.

URL PDF HTML ☆

赞 0 踩 0

2606.00139 2026-06-02 cs.CV cs.AI 版本更新

Geodesics with Unified Tangent-constrained Priors and Curvature Regularization

具有统一切线约束先验和曲率正则化的测地线

Chong Di, Li Liu, Jinglin Zhang, Zhenjiang Li, Da Chen, Laurent D. Cohen

发表机构 * Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences)（山东省人工智能研究院，齐鲁工业大学（山东省科学院））； Yuanshen Rehabilitation Institute, Shanghai Jiao Tong University School of Medicine（元身康复研究院，上海交通大学医学院）； School of Control Science and Engineering, Shandong University（控制科学与工程学院，山东大学）； Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University, Shandong Academy of Medical Sciences（放疗科，山东省肿瘤医院及研究院，山东第一医科大学，山东省医学科学院）； CEREMADE, Université Paris Dauphine, Université-PSL, CNRS, UMR 7534（CEREMADE，巴黎大学Dauphine，Université-PSL，CNRS，UMR 7534）

AI总结提出一种在方向提升空间中融合切线约束先验与曲率惩罚的测地线框架，通过快速行进法高效求解HJB PDE，增强复杂形状图像分割的鲁棒性。

详情

AI中文摘要

曲率惩罚的测地线模型通过计算全局最优曲线在图像分割中证明了其有效性。不幸的是，当描绘具有复杂形状和图像强度分布的对象时，这些模型仍然容易受到捷径的影响，因为它们缺乏强制执行形状感知切线约束的机制。为了解决这一局限性，我们提出了一种统一的测地线框架，该框架将切线约束先验与曲率惩罚相结合。关键思想是直接在方向提升空间内制定切线可接受性，其中路径切线被限制在由内在形状代表（ISR）（如骨架或内部地标）导出的空间变化角度扇区内。这一公式产生了一系列切线约束的芬斯勒度量，扩展了经典的曲率惩罚测地线模型，同时强制执行强制切线约束。由此产生的Hamilton-Jacobi-Bellman（HJB）偏微分方程（PDE）可以通过快速行进法的变体进行高效数值求解，保持了单次通过的计算复杂度。在合成、自然和医学图像上的实验表明，所提出的测地线框架确实提高了对弱边界和拓扑捷径的鲁棒性，与现有测地线模型相比，产生了具有增强形状保真度的分割结果。

英文摘要

Curvature-penalized geodesic models have proven their effectiveness in image segmentation by computing globally optimal curves. Unfortunately, these models remain susceptible to shortcuts when delineating objects with complex shapes and image intensity distributions, as they lack mechanisms to enforce shape-aware tangent constraints. To address this limitation, we propose a unified geodesic framework that integrates tangent-constrained priors with curvature penalization. The key idea is to formulate tangent admissibility directly within the orientation-lifted space, where path tangents are restricted to spatially varying angular sectors derived from intrinsic shape representatives (ISR) such as skeletons or interior landmarks. This formulation gives rise to a family of tangent-constrained Finslerian metrics, extending the classical curvature-penalized geodesic models while enforcing mandatory tangent constraints. The resulting Hamilton-Jacobi-Bellman (HJB) partial differential equations (PDEs) admit efficient numerical solutions via variants of the fast marching method, preserving the single-pass computational complexity. Experiments on synthetic, natural, and medical images demonstrate that the proposed geodesic framework indeed improves robustness against weak boundaries and topological shortcuts, yielding segmentation results with enhanced shape fidelity compared to existing geodesic models.

URL PDF HTML ☆

赞 0 踩 0

2606.00137 2026-06-02 cs.CV cs.GR 版本更新

Advances in Neural 3D Mesh Texturing: A Survey

神经3D网格纹理化进展：综述

Sai Raj Kishore Perla, Hao Zhang, Ali Mahdavi-Amiri

发表机构 * Simon Fraser University（西蒙弗雷泽大学）

AI总结本文综述了神经3D网格纹理化的最新进展，涵盖纹理合成、迁移和补全方法，并提出了统一的分类体系。

Comments Eurographics STAR (Computer Graphics Forum), 2026. Project Page: https://sairajk.github.io/neural-mesh-texturing/

详情

DOI: 10.1111/cgf.70392
Journal ref: Eurographics STAR (State of The Art Report), Computer Graphics Forum, Volume 45, Number 2, 2026

AI中文摘要

3D网格纹理化在决定数字对象和场景的视觉真实感中起着至关重要的作用。尽管最近基于神经辐射场和高斯泼溅的生成式3D方法可以直接生成带纹理的资产，但多边形网格仍然是建模、动画、视觉效果和游戏管线中的核心表示。因此，神经3D网格纹理化仍然是一个重要且活跃的研究领域。在本综述中，我们对神经3D网格纹理化的最新进展进行了全面回顾，涵盖了纹理合成、迁移和补全的方法。我们首先总结了网格几何、纹理映射、可微渲染和神经生成模型的关键基础，然后将文献组织成一个统一的分类体系，涵盖从早期基于GAN的方法到现代基于扩散的管线。我们进一步分析了常见的架构和监督策略，回顾了数据集和评估协议，并讨论了新兴应用、实际/商业系统以及开放挑战。这些见解共同为当前格局提供了结构化的视角，并有助于指导基于学习的3D网格纹理化的未来发展。

英文摘要

Texturing 3D meshes plays a vital role in determining the visual realism of digital objects and scenes. Although recent generative 3D approaches based on Neural Radiance Fields and Gaussian Splatting can produce textured assets directly, polygonal meshes remain the core representation across modeling, animation, visual effects, and gaming pipelines. Neural 3D mesh texturing therefore continues to be an essential and active area of research. In this survey, we present a comprehensive review of recent advances in neural 3D mesh texturing, covering methods for texture synthesis, transfer, and completion. We first summarize key foundations in mesh geometry, texture mapping, differentiable rendering, and neural generative models, and then organize the literature into a unified taxonomy spanning early GAN-based methods to modern diffusion-based pipelines. We further analyze common architectures and supervision strategies, review datasets and evaluation protocols, and discuss emerging applications, practical/commercial systems, and open challenges. Together, these insights provide a structured perspective on the current landscape and help guide future developments in learning-based 3D mesh texturing.

URL PDF HTML ☆

赞 0 踩 0

2606.00124 2026-06-02 cs.CV cs.LG 版本更新

Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness

位置编码锚定视觉Transformer中的空间结构：基于几何视角的鲁棒性研究

Mahmoud Mannes

发表机构 * ESSTHS

AI总结本文通过引入空间相似性距离相关性（SSDC）度量，研究不同位置编码对视觉Transformer内部空间表示几何结构的影响，发现位置编码通过建立索引锚定的空间组织来提升模型在内容破坏性分布偏移下的鲁棒性。

Comments 16 pages (9 main text, 7 appendix). 5 figures (3 main text, 2 appendix) with 8 graphics total. 5 tables (1 main text, 4 appendix). Submitted to NeurIPS 2026 main conference and the ICML 2026 mechanistic interpretability workshop

详情

AI中文摘要

视觉Transformer中的位置嵌入（PEs）已知会影响性能和鲁棒性，但它们在塑造内部空间表示中的作用尚不明确。本文研究了不同形式的PEs如何影响ViT的表示几何结构，以及这些变化如何与内容破坏性分布偏移下的鲁棒性相关。我们引入了一个度量——空间相似性距离相关性（SSDC），用于量化token表示中的空间结构。利用该度量，我们发现未使用PEs训练的ViT仍会发展出非平凡的空间结构，但这种结构由视觉内容驱动，并在token置换下崩溃。相反，所有考虑的PEs（可学习绝对位置编码、正弦位置编码和旋转位置编码）都与向索引锚定空间组织的一致转变相关。这些模型中的表示在破坏内容的扰动下保持稳定，并对这类分布偏移表现出显著增强的鲁棒性。我们进一步表明，尽管不同的PEs产生不同的空间结构深度轨迹，但其鲁棒性属性大致相似（编码方案间存在次要差异），这表明鲁棒性似乎更依赖于稳定的位置参考框架的存在，而非特定的编码机制。这些结果为位置编码如何塑造内部表示提供了几何解释，并对未来编码方案的原则性设计具有启示意义。

英文摘要

Positional embeddings (PEs) in Vision Transformers (ViTs) are known to impact performance and robustness, but their role in shaping internal spatial representations is not well understood. In this work, we study how different forms of PEs influence the representational geometry of ViTs and how these changes relate to robustness under content-disrupting distribution shifts. We introduce a metric, the Spatial Similarity Distance Correlation (SSDC), to quantify spatial structure in token representations. Using this metric, we show that ViTs trained without PEs still develop non-trivial spatial structure, but this structure is driven by visual content and collapses under token permutation. In contrast, we find that all PEs considered (learned absolute, sinusoidal, and rotary) are associated with a consistent shift toward an index-anchored spatial organization. Representations in these models remain stable under perturbations that disrupt content, and exhibit substantially improved robustness to such distributional shifts. We further show that while different PEs produce distinct depth-wise trajectories of spatial structure, their robustness properties are largely similar (with secondary variation across encoding schemes), suggesting that robustness appears to depend on the presence of a stable positional reference frame more than it depends on the specific encoding mechanism. These results offer a geometric account of how positional encodings shape internal representations, with implications for the principled design of future encoding schemes.

URL PDF HTML ☆

赞 0 踩 0

2606.00123 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

CardioLens: 通过多序列心脏MRI评估揭示MLLMs的临床现实差距

Zixian Su, Hongkai Zhang, Fan Gao, Encheng Su, Taiping Qu, Jingwei Guo, Nan Zhang, Hui Wang, Zhen Zhou, Kairui Bo, Yan Chen, Yue Ren, Shuai Li, Lei Xu, Henggui Zhang

发表机构 * Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Beijing Anzhen Hospital（北京安贞医院）； Beihang University（北航）； King Abdullah University of Science and Technology（国王 Abdullah 科学与技术大学）

AI总结提出CardioLens测试平台，通过多序列心脏磁共振成像评估24个多模态大语言模型，发现其在临床工作流中表现不佳，存在类别崩溃失败模式，且输入选择和推理提示改进效果有限。

详情

AI中文摘要

多模态大语言模型在公共医学基准上表现出色，但现有评估通常依赖于孤立输入和简化识别任务，难以作为临床使用的有效代理。我们提出了CardioLens，一个针对多序列心血管磁共振的无泄漏评估测试平台，通过严格的报告到QA构建和验证流程，从私有医院档案中构建。CardioLens包含473,896张切片和13,494个经过验证的QA对，涵盖4D Cine、LGE、灌注和T2加权成像，并评估CMR解读的三个阶段：图像理解、报告生成和疾病诊断。在24个最先进的MLLM上，CardioLens揭示了显著的临床现实差距：模型整体表现不佳，性能沿真实CMR工作流下降。混淆分析进一步显示一种类别崩溃失败模式，模型倾向于默认频繁出现的异常类别，而不是区分临床不同的发现。为了排除MLLM兼容输入构造是主要原因，我们在不同切片预算下比较了随机、临床动机和数据驱动的切片选择协议；性能变化很小，通常约为1%。显式推理提示也无法挽救性能，往往使模型更加保守，而不是改善视觉证据的使用。这些结果表明，当前MLLM远未达到可靠的CMR解读，临床决策需要跨序列、视图和时间相位整合分布式证据。CardioLens为开发面向真实临床部署的下一代MLLM提供了一个临床基础的测试平台。

英文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain weak proxies for clinical use, relying on isolated inputs and simplified recognition-style tasks. We introduce CardioLens, a leakage-resistant evaluation testbed for multi-sequence Cardiovascular Magnetic Resonance (CMR), constructed from private hospital archives through a rigorous report-to-QA construction and verification pipeline. CardioLens contains 473,896 slices and 13,494 verified QA pairs across 4D Cine, LGE, perfusion, and T2-weighted imaging, and evaluates three stages of CMR interpretation: image understanding, report generation, and disease diagnosis. Across 24 state-of-the-art MLLMs, CardioLens reveals a substantial clinical reality gap: models perform poorly overall, with performance degrading along the real CMR workflow. Confusion analysis further shows a category-collapse failure mode, where models default to frequent abnormal categories rather than distinguishing clinically distinct findings. To rule out MLLM-compatible input construction as the primary cause, we compare random, clinically motivated, and data-driven slice selection protocols under different slice budgets; performance changes only marginally, typically by about 1%. Explicit reasoning prompts also fail to rescue performance, often making models more conservative rather than improving visual evidence use. These results show that current MLLMs remain far from reliable CMR interpretation, where clinical decisions require integrating distributed evidence across sequences, views, and temporal phases. CardioLens provides a clinically grounded testbed for developing next-generation MLLMs toward real-world clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.00121 2026-06-02 cs.CV cs.AI 版本更新

Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity

基于语义和结构引导的大脑活动图像重建通用框架

Yizhuo Lu, Changde Du, Qiongyi Zhou, Liuyun Jiang, Huiguang He

发表机构 * State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology（脑认知与脑启发智能技术国家重点实验室）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Future Technology, University of Chinese Academy of Sciences（中国科学院大学未来技术学院）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结提出MindDiffuser两阶段框架，结合CLIP文本嵌入和视觉特征，通过Stable Diffusion生成语义图像并迭代优化结构信息，在fMRI、EEG、MEG三种模态上显著提升图像重建性能。

详情

AI中文摘要

从大脑记录中重建视觉刺激一直是脑解码中一项有意义且具有挑战性的任务。特别是，实现精确且可控的图像重建对于推动脑机接口的进步和应用具有重要意义。最近的方法利用文本到图像生成模型的能力，在语义（如概念和对象）方面重建了接近复杂自然刺激的图像。然而，它们在保持与原始刺激在细粒度结构信息（如位置、方向和大小）上的一致性方面存在困难，这削弱了模型的可控性和可解释性。为了解决上述问题，我们提出了一个两阶段图像重建框架，称为MindDiffuser。在第一阶段，从大脑反应解码的对比语言-图像预训练（CLIP）文本嵌入被输入到Stable Diffusion中，生成包含语义信息的初步图像。在第二阶段，我们使用解码的浅层CLIP视觉特征作为监督信号，通过反向传播迭代优化来自第一阶段的特征向量，以对齐结构信息。我们在由视觉刺激引发的三种模态（fMRI、EEG、MEG）的大脑反应数据集上进行了大量实验，结果表明我们的框架显著提升了先前最先进模型的性能，凸显了我们方法的有效性和通用性。空间和时间可视化结果进一步支持了我们框架的神经生物学合理性，为未来跨不同大脑信号模态的神经解码工作提供了指导。

英文摘要

Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Recent methods, leveraging advances in the power of text-to-image generation models, have reconstructed images that closely approximate complex natural stimuli in terms of semantics (e.g., concepts and objects). However, they struggle to maintain consistency with the original stimuli in fine-grained structural information (e.g., position, orientation and size), which undermines both the controllability and interpretability of the models. To address the aforementioned issues, we propose a two-stage image reconstruction framework, termed MindDiffuser. In Stage 1, Contrastive Language-Image Pretraining (CLIP) text embeddings decoded from brain responses are input into Stable Diffusion, generating a preliminary image containing semantic information. In Stage 2, we use decoded shallow CLIP visual features as supervisory signals, iteratively refining the feature vectors from Stage 1 via backpropagation to align structural information. We conducted extensive experiments on brain response datasets across three modalities (fMRI, EEG, MEG) elicited by visual stimuli, demonstrating that our framework significantly enhances the performance of previous state-of-the-art models, highlighting the effectiveness and versatility of our approach. Spatial and temporal visualization results further support the neurobiological plausibility of our framework, providing guidance for future neural decoding efforts across different brain signal modalities.

URL PDF HTML ☆

赞 0 踩 0

2606.00115 2026-06-02 cs.CV cs.LG stat.ML 版本更新

Physics from Video: Identifiability of Time-Invariant Second-Order ODEs under Minimal Trajectory Conditions

来自视频的物理：最小轨迹条件下时不变二阶ODE的可辨识性

Yuanyuan Wang, Wenjie Wang, Kun Zhang, Mingming Gong

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结研究从原始像素中辨识连续时间物理定律的结构可辨识性，证明在最小轨迹条件下，编码器-仅管道可唯一恢复二阶线性ODE参数，并引入方差底正则化器稳定无解码器目标。

Comments Accepted at ICML 2026

详情

AI中文摘要

弥合视觉真实感与物理理解之间的差距是基于视频的世界模型的核心挑战。我们研究从原始像素中辨识连续时间物理定律的结构可辨识性，重点关注编码器-仅管道能否唯一恢复二阶线性ODE的参数。我们证明，一个水平集斜率覆盖条件确保学习到的潜在空间与真实物理状态局部仿射，从而实现精确的参数恢复。我们的理论首次给出了不同阻尼机制下最小数据需求的刻画，建立了欠阻尼系统可从单个视频片段辨识，而其他机制需要三个不同轨迹。我们进一步引入方差底正则化器以稳定无解码器目标并防止潜在坍缩。在合成和真实数据上验证，我们的方法表明，无需计算密集的像素重建，即可从视频中可靠估计可解释的物理常数，确保物理正确性和透明性。代码可在 https://github.com/wenjiewang3/PhysicsFromVideo 获取。

英文摘要

Bridging the gap between visual realism and physical understanding is a core challenge for video-based world models. We study the structural identifiability of continuous-time physical laws from raw pixels, focusing on whether an encoder-only pipeline can uniquely recover the parameters of second-order linear ODEs. We prove that a level-set slope-coverage condition ensures the learned latent space is locally affine to the true physical state, enabling exact parameter recovery. Our theory provides the first characterization of minimal data requirements across damping regimes, establishing that underdamped systems are identifiable from a single video clip, whereas other regimes require three diverse trajectories. We further introduce a variance-floor regularizer to stabilize the decoder-free objective and prevent latent collapse. Validated on synthetic and real-world data, our approach demonstrates that interpretable physical constants can be reliably estimated from video without the need for compute-intensive pixel reconstruction, ensuring both physical correctness and transparency. Code is available at https://github.com/wenjiewang3/PhysicsFromVideo.

URL PDF HTML ☆

赞 0 踩 0

2606.00114 2026-06-02 cs.CV cs.IT math.IT 版本更新

广义协变动作建模：通过时空解耦构建广义流形

Huaihai Lyu, Chaofan Chen, Mingyu Cao, Yuheng Ji, Changsheng Xu

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出广义动作流形框架，通过时间不变性和几何不变性解耦实现广义协变，提升从稀疏演示中泛化的鲁棒性。

详情

AI中文摘要

从有限数据中实现鲁棒泛化是具身智能的核心挑战。现有方法通过回归绝对坐标失败，这违反了广义协变原理。根本上，这混淆了内在任务几何与刚性执行模式，将策略绑定到特定运动风格和固定速度。为解决此问题，我们提出广义动作流形（GAM）框架，通过结构解耦强制执行广义协变。具体地，GAM通过强制两个正交维度的不变性来实现流形：（1）时间不变性，利用弧长参数化将空间路径几何与时间动力学正交化，确保对速度变化的鲁棒性；（2）几何不变性，其中模式-仿射-分解机制将轨迹映射到姿态归一化坐标框架中的规范“世界线”。这区分了不变几何模式与仿射调制，确保空间泛化性。通过将GAM集成到结构化视觉-语言-动作（VLA）架构中，我们使稀疏演示能够密集填充连续有效的动作流形。实验结果表明，GAM实现了优越的迁移和鲁棒性，优于几何无关基线。

英文摘要

Achieving robust generalization from limited data is a central challenge in embodied intelligence. Prevailing methods fail by regressing absolute coordinates, which violates the principle of general covariance. Fundamentally, this conflates the intrinsic task geometry with rigid execution patterns, binding policies to specific motion styles and fixed speeds. To resolve this, we propose the Generalized Action Manifold (GAM) framework that enforces general covariance through structural disentanglement. Specifically, GAM realizes the manifold by enforcing invariance across two orthogonal dimensions: (1) Temporal Invariance, utilizing an Arc-Length Parameterizer to orthogonalize the spatial path geometry from temporal dynamics, ensuring robustness to velocity variations; (2) Geometric Invariance, where a Schema-Affine-Factorization mechanism maps trajectories to canonical ``world lines'' in a pose-normalized coordinate frame. This distinguishes invariant geometric schemas from affine modulations, ensuring spatial generalizability. By integrating GAM within a structured Vision-Language-Action (VLA) architecture, we enable sparse demonstrations to densely populate a continuous, valid action manifold. Empirical results demonstrate that GAM enables superior transfer and robustness capabilities, outperforming geometry-agnostic baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00109 2026-06-02 cs.CV cs.AI cs.LG 版本更新

VDSB-GWSyn: Diffusion Schrödinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography

VDSB-GWSyn: 用于冠状动脉造影中可控且解剖学可行的导丝合成的扩散薛定谔桥

Haoyuan Tang, Zhuo Zhang, Jialin Li, Shuai Xiao, Jiachen Yang

发表机构 * Tianjin University（天津大学）

AI总结提出基于扩散薛定谔桥的VDSB-GWSyn框架，通过形状先验和血管分割约束生成可控、高保真导丝样本，显著提升下游导丝端点定位精度。

Comments Early accept to MICCAI 2026

详情

AI中文摘要

冠状动脉导丝端点定位是计算机辅助PCI的基本能力，随着机器人辅助PCI逐渐普及以减少操作者辐射暴露，其重要性日益增加。然而，带有导丝的标注CAG图像稀缺以及现有导丝合成模型的适应性有限，仍是导丝端点定位的关键瓶颈。为解决此问题，我们提出VDSB-GWSyn，一个基于扩散薛定谔桥（DSB）模型的框架，能够在复杂解剖背景下合成可控、高保真的导丝样本。VDSB-GWSyn首先使用我们的形状先验算法学习基本导丝几何形状，然后在血管分割掩码的约束下生成导丝掩码并输出对应的端点坐标，最后通过SPADE条件化的DSB在真实CAG图像上合成逼真的导丝样本。实验结果表明，VDSB-GWSyn合成的导丝样本取得了良好的ROI-FID和ROI-KID，以及高IPR分数。此外，将我们的合成数据用于合成预训练后接真实微调，显著改进了下游导丝端点定位，将MPE从16.01像素降低到7.71像素，PCK@3像素从52.63%提高到86.27%，从而实现了更临床可靠的机器人辅助导丝输送系统部署。此外，具有严格背景保留和解剖可行性约束的可控设备合成的核心设计理念，有可能迁移到其他标注数据稀缺的介入设备感知任务中。

英文摘要

Coronary guidewire endpoint localization is a fundamental capability for computer-assisted PCI, and its importance increases as robot-assisted PCI is progressively adopted to reduce operator radiation exposure. However, the scarcity of annotated CAG images with guidewires and the limited adaptability of existing guidewire synthesis models remain key bottlenecks for guidewire endpoint localization. To address this issue, we propose VDSB-GWSyn, a Diffusion Schrödinger Bridge (DSB) model-based framework, enabling synthesis of controllable, high-fidelity guidewire samples under complex anatomical backgrounds. VDSB-GWSyn first uses our shape prior algorithm to learn the basic guidewire geometry. It then generates guidewire masks under constraints imposed by the vessel segmentation masks and outputs the corresponding endpoint coordinates. Finally, it synthesizes realistic guidewire samples on real CAG images using DSB conditioned with SPADE. Experimental results show that the guidewire samples synthesized by VDSB-GWSyn achieve favorable ROI-FID and ROI-KID, as well as high IPR scores. In addition, incorporating our synthesized data for synthetic pre-training followed by real fine-tuning substantially improves downstream guidewire endpoint localization, reducing MPE from 16.01~px to 7.71~px and increasing PCK at 3~px from 52.63\% to 86.27\%, leading to more clinically reliable deployment of robot-assisted guidewire delivery systems. Moreover, the core design philosophy of controllable device synthesis with strict background preservation and anatomical feasibility constraints has the potential to transfer to other interventional device perception tasks where annotated data are scarce.

URL PDF HTML ☆

赞 0 踩 0

2606.00105 2026-06-02 cs.CV cs.AI 版本更新

Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning

视觉噪声引导的上下文蒸馏用于多模态大语言模型遗忘

Junkai Chen, Yuhao He, Junxiang You, Ruiqi Liu, Chenyu Wang, Shu Wu

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Advanced Interdisciplinary Sciences, UCAS（北京大学交叉学科研究院）

AI总结提出视觉噪声引导的上下文蒸馏（VGID）框架，通过双模态干预构建教师分布进行蒸馏，实现多模态大语言模型参数级遗忘，平衡遗忘效果与模型效用。

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉-语言任务上取得了显著进展，但它们也可能记忆和暴露敏感或受限知识，引发隐私和更广泛的安全风险。机器遗忘（MU）提供了一种有前景的方法，可以从训练好的模型中移除目标不良知识，而无需从头重新训练，同时保持通用模型效用。然而，在MLLMs中实现有效遗忘仍然特别具有挑战性。现有的基于训练的方法通常难以平衡遗忘效果和模型效用。相比之下，无训练方法如上下文遗忘通过避免参数更新来保持模型效用，但它们不会在参数级别移除记忆的知识，可能仍然容易受到逆向工程攻击。更重要的是，上下文遗忘在多模态设置中不足，其中视觉输入可以提供强条件信号并诱导不良输出。为了解决这些挑战，我们提出了视觉噪声引导的上下文蒸馏（VGID），一种基于蒸馏的MLLM遗忘框架。VGID通过结合视觉扰动与文本上下文遗忘的双模态干预，从冻结的基础模型动态构建面向遗忘的教师分布。由此产生的干预诱导分布作为蒸馏的教师信号，引导学生模型实现参数级遗忘，而无需外部教师模型或显式的不良响应注释。实验结果表明，VGID在保持竞争性模型效用的同时实现了强遗忘效果，在代表性设置中，遗忘集ROUGE-L降低了0.371，而保留集ROUGE-L仅下降0.055。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable progress on vision-language tasks, but they may also memorize and expose sensitive or restricted knowledge, raising concerns about privacy and broader safety risks. Machine Unlearning (MU) provides a promising way to remove targeted undesirable knowledge from trained models without retraining from scratch while preserving general model utility. Nevertheless, effective unlearning in MLLMs remains particularly challenging. Existing training-based methods often struggle to balance unlearning effectiveness and model utility. In contrast, training-free methods such as in-context unlearning preserve model utility by avoiding parameter updates, but they do not remove memorized knowledge at the parameter level and may remain vulnerable to reverse-engineering attacks. More importantly, in-context unlearning is insufficient in multimodal settings, where visual inputs can provide strong conditioning signals and induce undesirable outputs. To address these challenges, we propose Visual-Noise Guided In-Context Distillation (VGID), a distillation-based framework for MLLM unlearning. VGID dynamically constructs an unlearning-oriented teacher distribution from the frozen base model through dual-modal intervention that combines visual perturbation with textual in-context unlearning. The resulting intervention-induced distribution serves as a teacher signal for distillation, guiding the student model toward parameter-level unlearning without requiring external teacher models or explicit undesirable response annotations. Experimental results show that VGID achieves strong unlearning effectiveness while preserving competitive model utility, reducing forget set ROUGE-L by 0.371 with only a 0.055 drop in retain set ROUGE-L in a representative setting.

URL PDF HTML ☆

赞 0 踩 0

2606.00101 2026-06-02 cs.CV cs.AI 版本更新

CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection

CoCoVideo: 基于商业模型的高质量对比基准用于AI生成视频检测

Huidong Feng, Wentao Chen, Jie Chen, Xinqi Cai, Ruolong Ma, Yinglin Zheng, Yuxin Lin, Ming Zeng

发表机构 * School of Informatics, Xiamen University（厦门大学信息学院）； China Academy of Information and Communications Technology（中国信息通信技术研究院）； AI Transcend Pte. Ltd.（AI Transcend有限公司）

AI总结针对现有数据集依赖低质量开源模型且商业样本带水印的问题，提出包含13个商业生成器的CoCoVideo-26K对比数据集，并设计结合对比学习与置信门控多模态大语言模型的CoCoDetect检测框架，实现高保真AI生成视频的鲁棒检测。

Comments Accepected by CVPR 2026

详情

AI中文摘要

随着人工智能生成内容（AIGC）技术的快速发展，视频伪造日益普遍，给公共讨论和社会安全带来新挑战。尽管现有深度伪造检测方法取得了显著进展，但AIGC伪造检测仍然具有挑战性，因为现有数据集主要依赖开源视频生成模型，其质量远低于商业AIGC系统。即使包含少量商业样本的数据集也常常保留可见水印，损害真实性并阻碍模型泛化到高保真AIGC视频。为解决这些问题，我们引入了CoCoVideo-26K，一个基于对比学习的商业模型AIGC视频数据集，涵盖13个主流商业生成器，并提供语义对齐的真实-伪造视频对。该数据集能够深入探索真实视频与高质量合成视频之间的差异，并为高逼真视频伪造检测建立新基准。基于该数据集，我们提出了CoCoDetect，一个集成对比学习与置信门控多模态大语言模型（MLLM）推理的检测框架。R3D-18骨干网络提取时空表示，而置信门将不确定案例路由到MLLM进行物理合理性和场景一致性的推理。在CoCoVideo-26K和公共基准上的大量实验证明了最先进的性能，验证了该框架的鲁棒性和泛化能力。我们的代码和数据集可在https://github.com/DonoToT/CoCoVideo获取。

英文摘要

With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below that of commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real-fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos and establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework's robustness and generalizability. Our code and dataset are available at https://github.com/DonoToT/CoCoVideo.

URL PDF HTML ☆

赞 0 踩 0

2606.00100 2026-06-02 cs.CV cs.AI 版本更新

CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout

CoilDrop-MRI：基于线圈丢弃的自监督物理引导MRI重建

Tongxi Song, Ziyu Li, Zihan Li, Wen Zhong, Congyu Liao, Yang Yang, Hua Guo, Wenchuan Wu, Qiyuan Tian

发表机构 * School of Biomedical Engineering, Tsinghua Medicine, Tsinghua University（清华大学生物医学工程系）； Oxford Centre for Integrative Neuroimaging, FMRIB, Nuffield Department of Clinical Neurosciences, University of Oxford（牛津大学整合神经影像中心）； Department of Radiology & Biomedical Imaging, University of California San Francisco（加州大学旧金山分校放射科与生物医学成像系）

AI总结提出CoilDrop-MRI方法，通过在线圈维度进行丢弃并作为自监督训练目标，结合图像域和k空间域展开架构，实现无需全采样数据的并行MRI重建，在多站点、多场强、多模态数据集上性能优于现有自监督方法。

详情

AI中文摘要

基于自监督深度学习的方法在加速磁共振成像（MRI）重建中展现出巨大潜力，无需全采样数据即可实现高图像质量。这些方法通常将采集的数据划分为两个不相交的子集，构建输入-目标对以优化重建网络。然而，现有方法仅在空间频率（k空间）域进行划分，未探索线圈维度。为充分利用接收线圈间的信号相关性，我们提出CoilDrop-MRI，该方法对输入应用线圈级丢弃，并将丢弃的数据作为自监督框架中的训练目标。该方法被集成到图像域（SENSE）和k空间（SPIRiT）公式的展开架构中。我们进一步将CoilDrop-MRI扩展到多激发、相位校正的扩散MRI（dMRI）重建，展示了其多功能性。CoilDrop-MRI在多站点、多场强（0.3T、0.55T和3T）和多模态（T1加权、T2加权、T2-FLAIR和dMRI）数据集上进行了广泛验证，始终优于最先进的自监督方法，达到了与监督重建方法相当的质量，且无需全采样参考训练数据。此外，CoilDrop-MRI表现出强大的数据效率和跨成像条件的鲁棒泛化能力，使其成为自监督并行MRI重建的实用且通用的框架。

英文摘要

Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achieving high image quality without requiring fully sampled data for training. These methods typically partition the acquired data into two disjoint subsets to construct input-target pairs for optimizing the reconstruction network. However, existing approaches perform this partition exclusively within the spatial frequency (k-space) domain, leaving the coil dimension unexplored. To enforce full exploitation of signal correlation across receiver coils, we propose CoilDrop-MRI, which applies coil-wise dropout to the input and uses the dropped data as training targets in a self-supervised framework. This method is integrated into unrolled architectures in both image-domain (SENSE) and k-space (SPIRiT) formulations. We further demonstrate its versatility by extending CoilDrop-MRI to multi-shot, phase-corrected diffusion MRI (dMRI) reconstruction. CoilDrop-MRI is extensively validated on multi-site, multi-field-strength (0.3T, 0.55T, and 3T), and multi-modality (T1-weighted, T2-weighted, T2-FLAIR, and dMRI) datasets and consistently outperforms state-of-the-art self-supervised methods, achieving quality comparable to supervised reconstruction methods without requiring fully sampled reference training data. Moreover, CoilDrop-MRI exhibits strong data efficiency and robust generalization across imaging conditions, establishing it as a practical and versatile framework for self-supervised parallel MRI reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.00098 2026-06-02 cs.CV eess.IV 版本更新

Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection

分割引导的空间索引用于可泛化和可解释的深度伪造检测

Izaldein Al-Zyoud, Abdulmotaleb El Saddik

发表机构 * University of Central Florida（佛罗里达大学）

AI总结提出分割引导的空间索引方法，通过冻结的FaRL解析器为DINOv3 ViT-L/16的patch token分配语义标签，仅选择语义相关的区域进行分类，实现可泛化且可解释的深度伪造检测。

详情

AI中文摘要

我们引入了分割引导的空间索引，用于可泛化和可解释的深度伪造检测。关键思想颠倒了标准设计顺序：不是先汇集所有人脸token再分类，而是先选择语义上有意义的patch token，然后仅汇集这些token。一个冻结的FaRL解析器为每个DINOv3 ViT-L/16 patch token分配一个语义标签；丢弃非目标token；一个线性探针对保留的区域进行分类。这种空间索引利用了DINOv3的patch级空间一致性（即产生涌现分割的相同属性），向探针呈现一个更纯净的区域子空间，其中与操作相关的证据较少被全脸线索稀释。区域归因是结构性的：当嘴部模型预测为假时，决策仅使用了嘴部token，而不是叠加的显著性图。在Celeb-DF v2上，嘴部索引探针的AUC达到0.905，优于LipForensics（+8.1个百分点）和Xception（+16.9个百分点），且无需对DINOv3或FaRL进行微调，也无需目标域数据。消融实验隔离了机制：用DINOv3的CLS token替换区域选择，Celeb-DF v2 AUC下降26.4个百分点；用FaRL特征替换DINOv3，AUC下降20.9个百分点。DINOv3表示和空间索引都是独立必要的；单独任何一个都无法达到完整系统的性能。

英文摘要

We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3's patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3's CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.

URL PDF HTML ☆

赞 0 踩 0

2606.00095 2026-06-02 cs.CV cs.AI cs.CL cs.RO 版本更新

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

弥合2D-3D鸿沟：面向视觉语言导航的分层语义几何地图

Kailing Li, Tianwen Qian, Lijin Yang, Yuqian Fu, Jingyu Gong, Xiaoling Wang, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University（东华大学计算机科学与技术学院）； Bosch Corporate Research（博世企业研究）； King Abdullah University of Science and Technology（卡布斯大学）

AI总结提出分层语义几何地图（HSGM），将3D几何信息转化为VLM可理解的结构化表示，结合VLM高层语义规划与经典路径规划，实现零样本视觉语言导航，在R2R-CE和RxR-CE基准上达到最先进性能。

详情

AI中文摘要

视觉语言导航（VLN）使具身智能体能够通过遵循语言指令在未知环境中到达目标位置。尽管近期视觉语言模型（VLM）取得了进展，但仍存在关键的语义-几何鸿沟：VLM擅长语言和2D视觉理解，但在3D空间推理方面表现不佳，且无法捕捉动作与空间转换之间的因果动态，导致导航不可靠，尤其在零样本设置中。为弥合这一鸿沟，我们提出分层语义几何地图（HSGM），将3D几何信息转化为与VLM兼容的结构化表示，有效将其与物理世界连接。具体而言，HSGM表示为多通道俯视图，组织为三个层次：（1）几何层，记录可导航区域和障碍物；（2）语义层，表示物体及其关系；（3）决策层，支持高层任务推理和目标选择。导航过程中，VLM作为高层语义规划器，解释HSGM编码的空间布局以选择几何有效航点，而航点间的低层无碰撞运动由经典路径规划算法执行，从而将语义推理与动作执行完全解耦。此外，复杂指令被分解为子任务，以缓解长程导航中的进度遗忘或幻觉问题。在R2R-CE和RxR-CE基准上的大量实验表明，我们的零样本框架达到了最先进性能，甚至优于若干监督方法。代码见 https://github.com/Teacher-Tom/HSGM_public。

英文摘要

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at https://github.com/Teacher-Tom/HSGM_public.

URL PDF HTML ☆

赞 0 踩 0

2606.00092 2026-06-02 cs.CV cs.AI 版本更新

Aligning Cellular Sheaves with Classifier Attention for Interpretable Weakly-Supervised Pathology Localization

对齐细胞层与分类器注意力以实现可解释的弱监督病理定位

Devansh Lalwani, Swapnil Bhat, Maulik Shah

发表机构 * Turocrates AI Private Limited（Turocrates AI私有有限公司）

AI总结针对弱监督全切片图像分类中注意力图定位不准确的问题，提出结合细胞层与注意力机制的一致性训练方法，在Camelyon16上实现补丁级AUC 0.940，并提升注意力AUC从0.717至0.953。

详情

AI中文摘要

基于基础特征的注意力多实例学习（ABMIL）在Camelyon16切片级别性能上接近饱和，但相应的注意力图作为定位信号并不完美：在临床解释中，一个正确分类但未激活实际病灶的模型难以被信任。我们通过细胞层（cellular sheaves）来解决这一差距，细胞层为图的每个顶点和边赋予有限维向量空间及它们之间一致的线性映射，提供了一种在图结构数据上检测局部不一致性的原则性方法。我们将细胞层应用于全切片图像的弱监督肿瘤定位，结合了细胞层不一致场与ABMIL。自然的训练目标——鼓励相似特征之间的一致性——产生的不一致场追踪的是组织级纹理而非诊断内容。我们提出注意力条件一致性，利用分类器的注意力来定义哪些相邻补丁应该一致。在此目标下联合训练分类器和细胞层，在Camelyon16上产生的不一致场达到补丁级AUC 0.940，并将注意力头从单独ABMIL的0.717提升至0.953。两阶段消融实验（分类器冻结在ABMIL值）仅在不一致场上达到0.727，注意力保持0.717，证实增益来自投影器在两个目标下的共同适应，而非单独的损失变化。训练后的模型无需重新训练即可迁移至Camelyon17的标注切片，保持Delta AUC 0.932 +/- 0.083和注意力AUC 0.955 +/- 0.099。结果是注意力图和细胞层不一致图同时激活相同的诊断区域，为每个切片级预测提供两种互补的解释。

英文摘要

Weakly-supervised classification of whole-slide images with attention-based multiple instance learning (ABMIL) on top of foundation features now reaches near-saturation on Camelyon16 slide-level performance, but the corresponding attention maps are an imperfect localization signal: in clinical interpretation, a model that classifies correctly without firing on the actual lesion is hard to trust. We address this gap with cellular sheaves, which equip each vertex and edge of a graph with a finite-dimensional vector space and consistent linear maps between them, providing a principled way to detect local disagreement on graph-structured data. We apply cellular sheaves to weakly-supervised tumour localization on whole-slide images, combining a sheaf disagreement field with ABMIL. The natural training objective, encouraging consistency between similar features, produces a disagreement field that tracks tissue-level texture rather than diagnostic content. We propose attention-conditional consistency, which uses the classifier's attention to define which neighbouring patches should agree. Joint training of the classifier and the sheaf under this objective produces a disagreement field with patch-level AUC 0.940 on Camelyon16 and raises the attention head from its ABMIL-alone level of 0.717 to 0.953. Two-stage ablation with the classifier frozen at its ABMIL values reaches only 0.727 on the disagreement field and leaves attention at 0.717, confirming that the gain comes from the projector co-adapting under both objectives, not from the loss change in isolation. The trained model transfers without retraining to annotated slides from Camelyon17, maintaining Delta AUC 0.932 +/- 0.083 and attention AUC 0.955 +/- 0.099. The result is an attention map and a sheaf-disagreement map that fire on the same diagnostic regions, giving clinicians two complementary explanations for each slide-level prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.00087 2026-06-02 cs.CV cs.AI 版本更新

Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome

结构化视觉证据分解用于阻塞性睡眠呼吸暂停低通气综合征的证据驱动多模态筛查

Chen Zhan, Yingchen Wei, Xiaoyu Tan, Jingjing Huang, Xihe Qiu

发表机构 * School of Electronic and Electrical Engineering, Shanghai University of Engineering Science（上海工程技术大学电子与电气工程学院）； Tencent Youtu Lab（腾讯云视频实验室）； ENT Institute and Department of Otorhinolaryngology, Eye & ENT Hospital of Fudan University（复旦大学耳鼻喉科医院耳鼻喉科研究所）； National University of Singapore（新加坡国立大学）

AI总结提出EviOSAHS框架，通过将面部图像分解为七个解剖查询并生成结构化证据卡，结合临床信息进行高灵敏度OSAHS筛查。

详情

AI中文摘要

有效的阻塞性睡眠呼吸暂停低通气综合征（OSAHS）多导睡眠图前筛查需要结合临床风险因素与可见的颅面和颈部线索。直接提示通用多模态基础模型进行医学是/否决策可能产生不稳定、校准不良的输出。我们提出EviOSAHS，一个证据驱动的多模态推理框架，将仅基于图像的解剖证据获取与最终临床判定分离。每张正面面部图像被分解为七个固定的解剖查询，涵盖颈部、下巴、嘴巴、面/颈脂肪、下颌、中面部和鼻子。视觉响应被转换为结构化证据卡，记录目标解剖结构、可见性、风险方向、证据强度、置信度和简洁摘要。这些卡片仅在最后阶段与清理后的临床档案结合，由大型语言模型进行平衡的二元筛查判定。我们在642名受试者队列上评估了EviOSAHS，将正常受试者映射为筛查阴性，轻度、中度或重度OSAHS受试者映射为筛查阳性。EviOSAHS实现了88.47%的准确率、94.86%的灵敏度、93.74%的F1分数和5.14%的假阴性率，在统一协议下优于仅临床提示、直接多模态提示和朴素两阶段流水线。消融实验表明，七问题视觉分解和平衡最终判定对高灵敏度工作点至关重要。对4,494个视觉输出的问题级审计显示100%的结构化解析率和93.88%的高可见率。EviOSAHS为二元多导睡眠图前OSAHS筛查提供了一个可审计、高灵敏度的工作流程，但应被视为分诊助手而非诊断系统。在临床部署前需要进行前瞻性验证、外部测试和校准的工作点控制。

英文摘要

Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with visible craniofacial and neck cues. Directly prompting general-purpose multimodal foundation models for medical yes/no decisions can yield unstable, poorly calibrated outputs. We propose EviOSAHS, an evidence-grounded multimodal reasoning framework that separates image-only anatomical evidence acquisition from final clinical adjudication. Each frontal facial image is decomposed into seven fixed anatomical queries covering the neck, chin, mouth, face/neck fat, lower jaw, midface, and nose. Visual responses are converted into structured evidence cards recording target anatomy, visibility, risk direction, evidence strength, confidence, and a concise summary. These cards are combined with a cleaned clinical profile only in the final stage, where a large language model performs balanced binary screening adjudication. We evaluated EviOSAHS on a 642-subject cohort, mapping normal subjects to screening-negative and mild, moderate, or severe OSAHS subjects to screening-positive. EviOSAHS achieved 88.47% accuracy, 94.86% sensitivity, 93.74% F1-score, and a 5.14% false-negative rate, outperforming clinical-only prompting, direct multimodal prompting, and naive two-stage pipelines under a unified protocol. Ablations showed that seven-question visual decomposition and balanced final adjudication were critical to the high-sensitivity operating point. A question-level audit of 4,494 visual outputs showed a 100% structured parse rate and 93.88% high-visibility rate. EviOSAHS provides an auditable, high-sensitivity workflow for binary pre-polysomnography OSAHS screening, but should be viewed as a triage assistant rather than a diagnostic system. Prospective validation, external testing, and calibrated operating-point control are needed before clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.00080 2026-06-02 cs.CV cs.AI cs.LG cs.NE 版本更新

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Planktonzilla: 用于理解浮游生态系统的多模态数据集与模型

Alan Gerson Contreras Montanares, Luis Valenzuela, Luis Martí, Nayat Sanchez-Pi

发表机构 * Inria Chile Research Center（Inria智利研究中心）

AI总结为解决浮游生物分类模型泛化性差的问题，提出统一数据集Planktonzilla-17M（含1740万张图像，涵盖602个分类类群），并对比监督学习与CLIP风格训练，发现基于分类谱系的监督学习优于CLIP，且现有生物基础模型在海洋成像领域表现不佳。

详情

AI中文摘要

海洋浮游生物支撑着水生食物网，并在全球二氧化碳封存中发挥关键作用，因此可靠的物种识别对于理解海洋健康和气候反馈至关重要。现有的分类模型在单个数据集上表现良好，但由于训练数据集孤立且标签不一致，无法跨仪器和环境泛化。为解决这一问题，我们引入了Planktonzilla-17M，这是一个统一的数据集，整合了来自13个成像系统的公开浮游生物图像集合。它包含1740万张图像，具有标准化的分类学和地理环境元数据，其中包括374万张浮游生物图像，涵盖602个分类类群，其中201个在物种级别被识别，使其成为迄今为止最大、最全面的浮游生物图像数据集。利用这一大规模数据集，我们在共享ViT骨干网络上进行了监督学习与CLIP风格图像-文本训练的对比实验。我们发现，当使用分类谱系作为文本时，监督分类器的表现与CLIP风格训练相当或更优。我们进一步观察到，BioCLIP和BioCLIP2在零样本和少样本设置下对浮游生物表现不佳。利用Planktonzilla-17M提高了浮游生物分类性能，凸显了当前生物基础模型在海洋成像领域的局限性。

英文摘要

Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical for understanding ocean health and climate feedbacks. Existing classification models perform well on individual collections but fail to generalize across instruments and environments due to isolated training datasets and inconsistent labels. To address this, we introduce Planktonzilla-17M, a unified dataset consolidating publicly available plankton image collections spanning thirteen imaging systems. It comprises 17.4 million images with standardized taxonomy and geo-environmental metadata, including 3.74 million plankton images spanning over 602 taxonomic classes, of which 201 are identified at the species level, making it the largest and most comprehensive plankton image dataset to date. Using this large-scale dataset, we perform a controlled comparison between supervised and CLIP-style image--text training on a shared ViT backbone. We find that a supervised classifier matches or exceeds CLIP-style training when trained using taxonomic lineage as text. We further observe that BioCLIP and BioCLIP2 perform poorly on plankton in zero-shot and few-shot settings. Leveraging Planktonzilla-17M improves plankton classification performance, highlighting the limitations of current biological foundation models in marine imaging domains.

URL PDF HTML ☆

赞 0 踩 0

2606.00078 2026-06-02 cs.CV cs.AI 版本更新

Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications

基于流的生成建模优化压缩感知应用中的采样策略

Roman Pavelkin, Luis A. Zavala-Mondragon, Christiaan G. A. Viviers, Fons van der Sommen

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）

AI总结提出一种任务感知的基于流的生成框架，通过训练流模型优化压缩感知中的子采样掩码，显著提升图像分类、重建和MRI加速的性能。

详情

AI中文摘要

信号处理和医学成像中的许多现代应用需要在严格的资源约束下获取高维信号。传统采样理论表明，准确重建信号所需的测量次数与信号的维数成正比，这一要求往往过于昂贵或不切实际。压缩感知通过证明稀疏信号可以在较少的测量下恢复（前提是测量算子满足某些条件）挑战了这一观念。这项概念验证研究提出了一种任务感知的基于流的生成框架——对传统流匹配训练范式的重新表述，其中流模型被训练用于优化压缩感知应用中的子采样。我们建立了所提出的学习子采样掩码框架的基本可行性，该框架显著提升了压缩感知在图像分类、图像重建和MRI加速中的性能。在图像重建任务中，我们的方法展示了最先进的性能，在CelebA数据集上以5%的子采样率实现了25.17 dB的峰值信噪比，在重建8倍加速的MRI测量（fastMRI数据集）时以最小的计算开销达到了29.24 dB。这些结果突显了生成流模型中任务条件化的有效性，并揭示了表示学习策略的一个有前景的方向。总体而言，所提出的框架提供了一种统一、灵活的方法来设计数据和任务驱动的感知方案，有望适用于广泛的逆问题。

英文摘要

Numerous modern applications in signal processing and medical imaging necessitate acquiring high-dimensional signals under tight resource constraints. Traditional sampling theory suggests that accurate signal reconstruction requires a number of measurements proportional to the signal's ambient dimension, a requirement often too expensive or impractical. Compressed sensing challenges this notion by demonstrating that sparse signals can be recovered with fewer measurements, provided the measurement operator meets certain conditions. This proof-of-concept study presents a task-aware flow-based generative framework -- a reformulation of the conventional Flow Matching training paradigm with a flow model trained to optimize subsampling in compressed sensing applications. We establish the fundamental feasibility of the proposed framework of learning subsampling masks that substantially enhance the performance of compressed sensing for image classification, image reconstruction, and MRI acceleration. For the image reconstruction task, our method demonstrated state-of-the-art performance, achieving Peak Signal-to-Noise Ratio of 25.17 dB at the subsampling rate of 5\% on the CelebA dataset and 29.24 dB when reconstructing $8\times$ accelerated MRI measurements (fastMRI dataset) with the minimal computational overhead. These results highlight the effectiveness of task-conditioning within generative flow models and reveal a promising direction for representation learning strategies. Overall, the proposed framework offers a unified, flexible approach to designing data- and task-driven sensing schemes that can be potentially adapted to a broad range of inverse problems.

URL PDF HTML ☆

赞 0 踩 0

2606.00077 2026-06-02 cs.CV cs.AI 版本更新

当玩笑越界：分析YouTube Shorts中的常规幽默与黑色幽默

Sydney Johns, Sanjeev Parthasarathy, Shantnu Bhalla, Vaibhav Garg

发表机构 * Virginia Polytechnic Institute and State University（弗吉尼亚理工大学）

AI总结通过构建TwistedHumor数据集（1211个YouTube Shorts及33041条评论的手工标注），结合多视角分析（LLooM概念归纳、评论情感分析、大模型评估），揭示了短格式视频中常规幽默与黑色幽默在主题、观众反应和模型检测上的差异，强调了上下文感知审核的必要性。

详情

AI中文摘要

YouTube等视频平台重塑了用户参与娱乐和信息的方式，强调简短、高参与度的内容，如Shorts。在这个生态系统中，某些内容处于灰色地带：虽然允许存在，但仍可能对部分观众产生意想不到的负面影响。为了研究这一问题，我们引入了TwistedHumor数据集，包含1,211个YouTube Shorts及其配对的33,041条相关评论，并手工标注了幽默存在性、幽默类型、伤害性、主题、修辞手法和单口喜剧背景。除了数据集构建，我们还提出了对短格式社交媒体中幽默与伤害表现的多视角分析。通过使用基于LLooM的概念归纳对视频描述进行分析，我们发现黑色幽默经常围绕批评、应对、尴尬和身份表达等主题聚集，而不是作为一个单一的类别出现。我们进一步通过关联评论分析观众反应，表明常规幽默与更积极的情感相关，而黑色幽默则收到更多混合、中性甚至有时更有毒的反馈。最后，我们评估了大语言模型与人类标注的一致性，发现它们在单口喜剧上的表现优于短笑话。综合来看，这些结果将TwistedHumor不仅定位为一个新的基准，而且是对短格式视频中幽默与伤害灰色地带的实证研究，强调了需要上下文感知的审核和更稳健的多模态评估。

英文摘要

Video platforms such as YouTube have reshaped how users engage with entertainment and information, emphasizing brief, highly engaging content such as Shorts. Within this ecosystem, certain content occupies a gray area where it remains allowed but may still have unintended negative effects on some audiences. To study this problem, we introduce TwistedHumor, a dataset of 1,211 YouTube Shorts paired with 33,041 related comments, with hand annotations for humor presence, humor type, harm, topic, rhetorical devices, and stand up context. Beyond dataset creation, we present a multi view analysis of how humor and harm appear in short form social media. Using LLooM based concept induction over video descriptions, we find that dark humor frequently clusters around themes of critique, coping, awkwardness, and identity expression rather than appearing as a single uniform category. We further analyze audience response through linked comments and show that regular humor is associated with more positive sentiment, while dark humor receives more mixed, neutral, and sometimes more toxic reactions. Finally, we evaluate large language models against human annotations and find that they perform better on stand up comedy compared to shorter jokes. Together, these results position TwistedHumor not only as a new benchmark, but as an empirical study of the gray area between humor and harm in short form video, highlighting the need for context aware moderation and more robust multimodal evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.00001 2026-06-02 cs.HC cs.CV cs.MM 版本更新

Shu Dao: A Calligraphy Score Framework Linking Calligraphy, Music, and Performance

书道：连接书法、音乐与表演的评分框架

Lican Huang

发表机构 * Hangzhou Domain Zones Technology Co., Ltd.（杭州域区技术有限公司）

AI总结提出CWSR表示法和书道框架，将东亚书法建模为类似乐谱的结构化表演，支持人机共创。

Comments 47 pages

详情

Journal ref: Journal of Advances in Information Science and Technology, 2026 4(2), 1-47. https://yvsou.com/journal/index.php/jaist/article/view/43

AI中文摘要

本文介绍了书法书写评分表示法（CWSR），并提出了书道框架，将东亚书法解读为一种表演艺术而非静态视觉产物。受日本书道和茶道等体现文化实践的启发，该框架将书法建模为类似于音乐符号的结构化表演。该方法不将字符表示为固定图像，而是将每个笔画编码为有序且可执行的动作，形成书法评分。字符在结构化空间网格中组织，笔画标注有类型、执行顺序、空间坐标、轨迹、构图角色以及动态属性（如笔压和节奏）。这种表示捕捉了书法书写中通常图像表示所缺失的时间和表达方面。本文做出三项主要贡献：首先，引入CWSR作为结构化符号系统，在笔画、字符结构和构图组织（如布局和章法）等多个层面表示书法，及其节奏和表演动态；其次，将书道概念化为基于评分的框架，将书法建模为结构化表演；第三，为基于AI的书法智能体分析、可视化和可执行生成书法作品建立计算基础。这些贡献共同连接了书法、音乐符号和表演文化实践，支持计算书法和数字人文研究中的人机共创。

英文摘要

This paper introduces Calligraphy Writing Score Representation (CWSR) and proposes Shu Dao as a framework that interprets East Asian calligraphy as a performative art rather than a static visual artifact. Inspired by traditions such as Japanese Shodō and embodied cultural practices such as Chadao , the framework models calligraphy as a structured performance analogous to musical notation. Instead of representing characters as fixed images, the proposed approach encodes each brush stroke as an ordered and executable action, forming a calligraphy score. Characters are organized within a structured spatial grid, and strokes are annotated with attributes including stroke type, execution order, spatial coordinates, trajectory, compositional role, and dynamic properties such as brush pressure and pacing. This representation captures temporal and expressive aspects of calligraphic writing that are typically absent from image-based representations. The paper makes three main contributions. First, it introduces CWSR as a structured notation system for representing calligraphy across multiple levels, including strokes, character structures, and compositional organization (e.g., layout and zhangfa), together with their rhythmic and performative dynamics. Second, it conceptualizes Shu Dao as a score-mediated framework that models calligraphy as structured performance. Third, it establishes a computational foundation for the analysis, visualization, and executable generation of calligraphic works by AI-based calligraphic agents. Together, these contributions bridge calligraphy, musical notation, and performative cultural practices, supporting human--AI co-creation in computational calligraphy and digital humanities research.

URL PDF HTML ☆

赞 0 踩 0

2605.31597 2026-06-02 cs.CV 版本更新

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

SOCO: 视觉基础模型中语义对象对应关系的基准测试

Olaf Dünkel, Basavaraj Sunagad, Haoran Wang, David T. Hoffmann, Christian Theobalt, Adam Kortylewski

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克研究所信息学研究所）； Saarland Informatics Campus（萨尔州信息学校园）； CISPA Helmholtz Center for Information Security（信息安全霍夫曼中心）； University of Freiburg（弗赖堡大学）

AI总结提出SOCO基准，通过引入对应类型分类法和100个类别上超过100万对功能上有意义的关键点注释，系统评估视觉基础模型中的语义对应能力，并揭示模型在跨类别迁移、语言引导定位与视觉对应之间的差距。

Comments Project page: https://genintel.github.io/SOCO/

详情

AI中文摘要

由于评估协议不一致和部分级监督有限，测量视觉基础模型中的结构化对象理解仍然具有挑战性。语义对应（SC）通过测试对象部分是否能在外观、视角和几何形状的巨大变化下跨实例和类别匹配来评估这种能力。为了实现系统的SC评估，我们引入了SOCO，一个新的语义对象对应基准，它引入了对应类型的分类法，并在100个类别和超过100万对应对上提供了一致、功能上有意义的关键点注释。此外，SOCO包括关键点语言描述，使得能够评估大型视觉语言模型（LVLMs）及其细粒度部分级理解。综合实验揭示：(i) 视觉基础骨干编码了强大的语义结构，但在相关类别之间转移对应关系较差，且仅部分捕捉对象部分位置；(ii) LVLMs在文本提示的部分定位方面比视觉参考的跨图像匹配更强，暴露了语言引导定位与细粒度视觉对应之间的差距；(iii) 对应性能比ImageNet分类更能预测密集下游任务（包括分割、跟踪、3D姿态估计和3D检测）的性能。总之，这些发现将SOCO定位为视觉和多模态基础模型中结构化、部分级表示质量的基准。

英文摘要

Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.31557 2026-06-02 cs.CV 版本更新

EGOSTREAM: A Diagnostic Benchmark for Streaming Episodic Memory in Egocentric Vision

EGOSTREAM: 面向第一人称视角的流式情景记忆诊断基准

Rosario Forte, Giuseppe Lando, Antonino Furnari

发表机构 * Department of Mathematics and Computer Science（数学与计算机科学系）； University of Catania（卡塔尼亚大学）

AI总结提出EGOSTREAM基准，通过七种认知维度和答案有效期窗口，诊断流式视频中模型的情景记忆能力，并评估多种记忆管理机制。

详情

AI中文摘要

连续情景记忆是自主代理在动态真实环境中运行的核心能力，然而当前的流式视频基准为诊断模型记住什么以及记忆多久提供的工具有限。我们引入Egostream，一个面向第一人称视角的流式情景记忆评估诊断基准。Egostream沿七个认知维度组织了2250个精心设计的问题：细节、空间、时间、事件、社会、因果和前瞻记忆。我们引入了答案有效期窗口（AVW），它指定了随着观察场景演变答案保持有效的时间跨度。这使得我们将问题扩展为8528个回忆条件评估，从而能够控制从即时到超长期的回忆测试，同时将模型真正的遗忘与自然世界状态变化区分开来。我们通过一个统一的流式多模态大语言模型框架严格建立了基线性能，该框架比较了几种最先进的记忆管理机制，包括滑动窗口、注意力汇聚、KV缓存剪枝、合并和卸载。在统一的Qwen3-VL骨干网络上的实验表明，可比较的总体准确率掩盖了截然不同的记忆特征。例如，token剪枝在保留细粒度细节和时间结构方面显著优于token合并，而量化卸载则挽救了超长期回忆。最终，所有机制都远低于实时运行（>1秒每帧），且表现最好的方法准确率上限约为45%，揭示了当前架构中的关键差距。Egostream提供了弥合这些差距所需的诊断测试平台。项目网站、新闻和更新请访问：https://saroo25.github.io/Egostream/

英文摘要

Continuous episodic memory is a core capability for autonomous agents operating in dynamic, real-world environments, yet current streaming video benchmarks provide limited tools for diagnosing what models remember and for how long. We introduce Egostream, a diagnostic benchmark for streaming episodic memory evaluation in egocentric vision. \egostream organizes 2,250 curated questions along seven cognitive dimensions: detail, spatial, temporal, event, social, causal, and prospective memory. We introduce the Answer Validity Window (AVW), which specifies the temporal span an answer remains valid as the observed scene evolves. This allows us to expand the questions into 8,528 recall-conditioned evaluations, enabling controlled testing from instant to ultra-long-term recall while separating genuine model forgetting from natural world-state changes. We rigorously establish baseline performance through a unified streaming MLLM framework that compares several state-of-the-art memory-management mechanisms, covering sliding windows, attention sinks, KV-cache pruning, merging, and offloading. Experiments within a unified Qwen3-VL backbone reveal that comparable aggregate accuracies mask starkly different memory profiles. For instance, token pruning preserves fine-grained details and temporal structure significantly better than token merging, while quantized offloading rescues ultra-long-term recall. Ultimately, all mechanisms operate well below real-time (>1s per frame), and top performing methods ceil at about 45% accuracy, exposing critical gaps in current architectures. Egostream provides the diagnostic testbed needed to close these gaps. Project website, news and updates at: https://saroo25.github.io/Egostream/

URL PDF HTML ☆

赞 0 踩 0

2605.31487 2026-06-02 cs.CV 版本更新

Enhancing Computer Vision Model Generalization in Warehouse Facilities: A Case Study on Anomaly Detection in Vertical Material Handling Systems

提升仓库设施中计算机视觉模型泛化能力：垂直物料搬运系统异常检测案例研究

Ruiliang Liu, Tina Dongxu Li, Joshua Migdal, Ken Meszaros, Trevor Dardik

发表机构 * Amazon, USA（亚马逊公司）

AI总结本研究通过实验室环境下的最优相机布置、图像触发策略、模型选择与集成，实现了垂直物料搬运系统异常检测模型从实验室到多种仓库环境的有效泛化，简化了部署流程并节省了标注和重训练资源。

Comments 6 pages, 10 figures. Accepted at IEEE International Conference on Mechatronics and Automation (ICMA) 2026

详情

AI中文摘要

在仓库设施中部署计算机视觉模型传统上需要大量资源用于相机安装、图像采集、标注、训练和部署——由于相机安装限制和环境变化，这一过程通常需要在每个新环境中重复。本文探索了一种创新方法，通过仅在实验室环境中执行标准流程来简化这一过程，重点关注垂直物料搬运系统及其叉的异常检测。通过大量实验，我们发现结合最优相机布置、策略性图像触发、谨慎的模型选择和模型集成，能够实现从实验室条件到多种仓库设施环境的有效泛化，可能通过将仓库设施部署简化为仅需相机安装、图像采集和模型部署，从而节省通常用于图像标注和模型重训练的大量资源和时间，改变仓库自动化实施方式。这是一项实验研究，并非生产部署。

英文摘要

Deploying computer vision models in Warehouse Facilities traditionally requires extensive resources for camera mounting, image collection, annotation, training, and deployment - a process often needing repetition in each new environment due to camera mounting constraints and environmental variability. This paper explores an innovative approach to streamline this process by conducting the standard procedure solely in a laboratory setting, focusing on vertical material handling systems and anomaly detection in forks of the systems. Through extensive experimentation, we have found that combining optimal camera placement, strategic image triggering, careful model selection and model ensemble enables effective generalization from laboratory conditions to diverse warehouse facilities environments, potentially transforming warehouse automation implementation by simplifying warehouse facilities deployment to just camera mounting, image collection, and model deployment, thereby saving significant resources and time typically spent on image annotation and model retraining. This is an experimental research study and not a production deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.31437 2026-06-02 cs.CV 版本更新

Astra: a generalizable report generation foundation model for 3D computed tomography

Astra：一种用于三维计算机断层扫描的通用报告生成基础模型

Zhuhao Wang, Fang Chen, Chaohui Yu, Zihan Li, Yuchao Zheng, Jing Wang, Xuan Yang, Jia Guo, Zhenlu Yang, Xingju Zheng, Yihua Sun, Haojie Han, Xiaoxiao Qin, Zhan Feng, Wenbo Xiao, Chao Zhu, Yuehua Li, Shipeng Zhang, Hao Luo, Yunsong Peng, Fan Wang, Hongen Liao

发表机构 * School of Biomedical Engineering, Tsinghua University（清华大学生物医学工程学院）； School of Biomedical Engineering, Shanghai Jiao Tong University（上海交通大学生物医学工程学院）； DAMO Academy, Alibaba Group（阿里云达摩院）； Hupan Laboratory（壶辰实验室）； Department of Biomedical Engineering, National University of Singapore（新加坡国立大学生物医学工程系）； Department of Radiology, Guizhou Provincial People’s Hospital（贵州省级人民医院放射科）； Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine（浙江大学医学院附属第一医院放射科）； Department of Radiology, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine（上海交通大学医学院附属第六人民医院放射科）； College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）

AI总结提出Astra模型，通过风格统一和强化学习，在8个器官系统的CT报告生成中实现高精度，平均细粒度诊断指标提升44.1%，并加速临床工作流。

详情

AI中文摘要

CT解读需要放射科医生每次检查审查数百个容积切片，使得报告耗时且高度依赖专业知识。自动CT报告生成为提高临床效率提供了一条有前景的途径，但该领域仍缺乏一个支持多区域报告并在外部真实世界队列中保持鲁棒性的通用CT报告生成基础模型。不同队列间报告风格和诊断术语的内在不一致性使得朴素联合训练容易受到噪声文本监督的影响，从而限制了模型的泛化能力。本文提出Astra，一个通用的CT报告生成基础模型，在包含90,678个胸腹部CT-报告对（CTRgDB）的数据集上训练，涵盖8个器官系统的353,671个异常。通过统一报告风格并进一步通过强化学习细化诊断一致性，Astra实现了跨不同解剖区域和机构的风格一致且诊断准确的报告生成。在CTRgDB和六个外部队列上的评估显示，Astra在细粒度诊断指标上平均提升44.1%（P<0.001），达到最先进性能。在实际临床工作流中，Astra辅助将胸部报告起草速度提高29.6%，并将腹部报告完整性提高11.3%（P<0.001）。此外，Astra作为CT AI开发的基础也展现出广泛实用性，通过高质量报告合成改善下游诊断性能并扩展视觉-语言预训练。总体而言，Astra作为一个广泛可用的临床助手和下一代AI医疗的关键基础设施。

英文摘要

CT interpretation requires radiologists to review hundreds of volumetric slices per examination, making reporting time-consuming and highly expertise-dependent. Automated CT report generation offers a promising route to improving clinical efficiency, yet the field still lacks a generalizable CT report generation foundation model that supports multi-region reporting and remains robust across external real-world cohorts. Intrinsic inconsistencies in reporting style and diagnostic terminology across cohorts make naive joint training prone to noisy textual supervision, thereby limiting model generalizability. Here we present Astra, a generalizable CT report generation foundation model trained on 90,678 thoracoabdominal CT-report pairs (CTRgDB) with 353,671 abnormalities spanning eight organ systems. By harmonizing report style and further refining diagnostic consistency via reinforcement learning, Astra achieves style-consistent and diagnostically accurate report generation across diverse anatomical regions and institutions. Evaluating on CTRgDB and six external cohorts, Astra achieves state-of-the-art performance with a 44.1% average improvement in fine-grained diagnostic metrics (P<0.001). In real-world clinical workflows, Astra assistance accelerates chest report drafting by 29.6% and improves abdominal report completeness by 11.3% (P<0.001). Furthermore, Astra also demonstrates broad utility as a foundation for CT AI development, improving downstream diagnostic performance and scaling vision-language pretrain through high-quality report synthesis. Overall, Astra serves as a broadly accessible clinical assistant and a pivotal infrastructure for the next generation of AI-powered healthcare.

URL PDF HTML ☆

赞 0 踩 0

2605.31162 2026-06-02 cs.CV cs.LG 版本更新

Guidance for Low-Level Perceptual Editing in Unconditional Diffusion Models

无条件扩散模型中低级感知编辑的引导

Shreyansh Modi, Akshat Tomar, Aarush Aggarwal

发表机构 * Indian Institute of Technology Roorkee（印度理工学院罗尔基）

AI总结针对无条件扩散模型在美学和感知增强中难以进行全局低级变换的问题，提出一种无需训练的推理时机制，通过提取退化概念向量并结合瓶颈修补与无分类器引导，实现图像编辑与质量提升。

Comments 11 pages, 12 figures, Generative Models for Computer Vision Workshop CVPR 2026

2605.30855 2026-06-02 cs.CV 版本更新

ForestHG-Trace: 大规模森林场景下的可追踪长程生态推理

Zihang Cheng, Duanchu Wang, Cheng Li, Jing Huang, Huanzhao Fu, Di Wang

AI总结提出ForestHG-Trace框架，通过生态超图表示和LLM引导的确定性工具链，实现森林场景中可追踪的多步生态推理，并构建ForestTraceQA基准，显著提升长程生态问答的准确性和执行忠实度。

Comments It has theoretical flaws and experimental errors

详情

AI中文摘要

遥感问答（RS-QA）通常需要超越直接语义预测的能力，尤其是在大规模森林场景中，生态分析涉及多步过滤、数值聚合、邻域推理和可验证证据。我们提出ForestHG-Trace，一个用于森林环境中可追踪长程生态推理的框架。它将多模态NEON森林场景表示为生态超图，其中树木实例、空间单元、语义组和邻域关系支持超越成对场景图的高阶推理。然后，一个LLM引导的智能体调用确定性工具进行读取、过滤、扩展、聚合、比较和审计，生成可重放的执行轨迹和紧凑的证据记录，而不仅仅是自由形式的答案。我们进一步构建了ForestTraceQA，一个可执行的基准，用于评估跨不同任务类型和推理深度的生态问答。实验表明，ForestHG-Trace在答案准确性和执行忠实度上显著优于单步基线和场景图智能体，同时指出执行深度是长程生态问答的主要瓶颈。

英文摘要

Remote sensing question answering (RS-QA) often requires more than direct semantic prediction, especially in large-scale forest scenes where ecological analysis involves multi-step filtering, numerical aggregation, neighborhood reasoning, and verifiable evidence. We introduce ForestHG-Trace, a framework for traceable long-horizon ecological reasoning over forest environments. It represents multimodal NEON forest scenes as ecological hypergraphs, where tree instances, spatial units, semantic groups, and neighborhood relations support higher-order reasoning beyond pairwise scene graphs. An LLM-guided agent then invokes deterministic tools for reading, filtering, expansion, aggregation, comparison, and auditing, producing replayable execution traces and compact evidence records rather than only free-form answers. We further construct ForestTraceQA, an executable benchmark for evaluating ecological QA across diverse task types and reasoning depths. Experiments show that ForestHG-Trace substantially improves answer accuracy and execution faithfulness over single-step baselines and scene-graph agents, while highlighting execution depth as the main bottleneck for long-horizon ecological QA.

URL PDF HTML ☆

赞 0 踩 0

2605.27458 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

融合异质注意力结构的Transformer模型通用解释方法

Yongjin Cui, Xiaohui Fan, Huajun Chen

发表机构 * Zhejiang University（浙江大学）

AI总结针对Transformer中异质注意力结构（如共注意力）带来的多源信息融合挑战，提出一种通用解释方法，并通过实验分析范式对代表性模型进行语义和逻辑解释。

详情

AI中文摘要

Transformer极大地推动了人工智能的发展，也推动了智能体（agent）的发展。我们将Transformer的注意力结构根据输入信息的来源分为两类：同质注意力结构和异质注意力结构。异质注意力结构以共注意力（co-attention）为典型例子，处理来自不同来源的信息。异质注意力结构是Transformer模型实现更复杂功能、融合更多模态信息的基础。无论是出于研究目的还是政策要求，对具有异质注意力结构的Transformer模型进行解释都是一项重要任务。来自不同来源的信息融合带来了新的挑战。我们的工作主要包括方法和实验两部分。在方法方面，我们提出了一种针对具有异质注意力结构的Transformer模型的解释方法。在实验方面，基于我们的实验分析范式，我们解释代表性模型的操作机制，进行语义解释和逻辑解释。

英文摘要

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

URL PDF HTML ☆

赞 0 踩 0

2605.26292 2026-06-02 cs.CV cs.CL 版本更新

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Evi-Steer：通过高效且可泛化的证据调优学习引导生物医学视觉-语言模型

Taha Koleilat, Hassan Rivaz, Yiming Xiao

发表机构 * Concordia University（康科迪亚大学）

AI总结提出Evi-Steer框架，通过证据跨模态低维引导实现BiomedCLIP的不确定性感知参数高效微调，仅更新0.11%参数，在15个生物医学数据集上少样本学习和域泛化设置中优于现有方法。

Comments MICCAI 2026 Early Accept; Project Page: https://tahakoleilat.github.io/Evi-Steer. This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published as part of the MICCAI 2026 proceedings in October

详情

AI中文摘要

视觉-语言基础模型的参数高效适配对于生物医学图像的精确多模态理解至关重要，但现有方法仍具有确定性，且在域偏移或模糊的图像-文本对齐下常常表现不佳。这一限制在临床中尤为关键，因为模型应在低数据 regime 和域偏移下保持鲁棒性。我们提出了Evi-Steer，一个用于BiomedCLIP的证据跨模态低维引导框架，能够在仅更新总模型参数0.11%的情况下实现不确定性感知的参数高效微调。我们的方法在视觉和文本编码器中执行轻量级低维令牌更新，同时估计认知不确定性。这些不确定性估计更新门控残差，使模型在证据较弱时能够保守地适应。此外，我们引入了基于Dempster-Shafer理论的跨模态置信度融合，使视觉适应能够以文本置信度为条件，并抑制冲突或不确定的跨模态更新。我们在涵盖8个器官和8种成像模态的15个生物医学成像数据集上，在少样本学习和域泛化设置下进行了全面评估。Evi-Steer在少样本学习和域偏移设置下始终优于最先进的方法，展示了在真实临床环境中部署视觉-语言模型的实用且鲁棒的途径。代码可在https://github.com/HealthX-Lab/Evi-Steer获取。

英文摘要

Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware parameter-efficient fine-tuning while updating only 0.11% of total model parameters. Our approach performs lightweight low-dimensional token updates in both vision and text encoders while simultaneously estimating epistemic uncertainty. These uncertainty estimates update gate residuals, allowing the model to adapt conservatively when evidence is weak. Furthermore, we introduce cross-modal confidence fusion based on Dempster-Shafer theory, enabling visual adaptation to be conditioned on textual confidence and suppressing conflicting or uncertain cross-modal updates. We conduct a comprehensive evaluation on 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities under few-shot learning and domain generalization settings. Evi-Steer consistently outperforms state-of-the-art methods under few-shot learning and domain shift settings, demonstrating a practical and robust pathway for deploying vision-language models in real-world clinical settings. Code is available at https://github.com/HealthX-Lab/Evi-Steer.

URL PDF HTML ☆

赞 0 踩 0

2605.24634 2026-06-02 cs.CV 版本更新

Resolving Ambiguity in Composed Image Retrieval via Calibrated Interaction

通过校准交互解决组合图像检索中的歧义

Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Amsisan Tran ； Baogh Le ； Tuan Kiet Pham ； Sui Yang Guang

AI总结本文提出将组合图像检索重新定义为不确定性下的校准意图解析，通过共形预测层提供覆盖保证的候选集，并利用期望信息增益策略提出最有效的澄清问题，从而解决查询歧义和假阴性问题。

详情

AI中文摘要

组合图像检索（CIR）使用参考图像和描述如何修改它的文本搜索语料库。尽管从三元组训练的合成器到零样本和生成方法取得了快速进展，但所有系统本质上都共享一个假设：查询映射到单个目标，通过Recall@K针对一个标注进行评分。我们认为这与任务根本不一致。诸如“使其更正式”之类的查询并不命名一个图像，而是命名语料库的一个区域，用户意图中的哪个成员真正是不确定的。这种欠指定是众所周知的假阴性问题的根源，并使得当前模型无法区分精确查询和模糊查询。我们将CIR重新定义为不确定性下的校准意图解析：检索器被包裹在一个共形预测层中，该层返回一个具有覆盖保证的候选集，其大小是歧义的原则性度量；当集合很大时，期望信息增益策略从可解释的歧义轴中提出一个最有用的澄清问题，然后集合收缩。我们引入了AmbiCIR，一个基准和经过人工验证的用户模拟器，它复活了CIRR中休眠的辅助和对话标注，并扩展了CIRCO的多正例设置。在开放域和时尚基准上，我们的方法匹配了单轮最先进水平，确认了校准解析在精确查询上是无成本的，同时以朴素对话基线所需交互预算的一小部分达到预期目标，并且它是第一个为任务报告有效覆盖和校准的方法。

英文摘要

Composed image retrieval (CIR) searches a corpus with a reference image and a text describing how to modify it. Despite rapid progress from triplet-trained compositors to zero-shot and generative methods, essentially all systems share one assumption: that a query maps to a single target, scored by Recall@K against one annotation. We argue this is fundamentally at odds with the task. A query such as make it more formal does not name an image but a region of the corpus, and which member the user intends is genuinely underdetermined. This underspecification is the root of the well-known false-negative problem and leaves current models unable to tell a precise query from an ambiguous one. We reframe CIR as calibrated intent resolution under uncertainty: a retriever is wrapped in a conformal prediction layer that returns a candidate set with a coverage guarantee and whose size is a principled measure of ambiguity; when the set is large, an expected-information-gain policy asks the single most useful clarifying question, drawn from interpretable ambiguity axes, and the set contracts. We introduce AmbiCIR, a benchmark and human-validated user simulator that revive the dormant auxiliary and dialogue annotations of CIRR and extend the multiple-positive setting of CIRCO. Across open-domain and fashion benchmarks our method matches single-turn state of the art, confirming calibrated resolution is cost-free on precise queries, while reaching the intended target in a fraction of the interaction budget required by naive conversational baselines, and it is the first to report valid coverage and calibration for the task.

URL PDF HTML ☆

赞 0 踩 0

2605.26102 2026-06-02 cs.CV 版本更新

InstructSAM: Segment Any Instance with Any Instructions

InstructSAM: 根据任意指令分割任意实例

Yuqian Yuan, Wentong Li, Zhaocheng Li, Yutong Lin, Juncheng Li, Siliang Tang, Jun Xiao, Yueting Zhuang, Wenqiao Zhang

发表机构 * Zhejiang University（浙江大学）； Nanjing University of Aeronautics and Astronautics（南京航空航天大学）

AI总结提出InstructSAM框架，通过将指令驱动实例分割建模为集合结构查询预测问题，并设计显式推理到实例查询接口，结合视觉语言模型和SAM3实现单次前向传播中的多实例分割。

Comments 19 pages, 8 figures, code: https://github.com/DCDmllm/InstructSAM

详情

AI中文摘要

在本文中，我们介绍了InstructSAM，一个统一且精简的框架，旨在任意指令下进行多实例分割。我们将指令驱动的实例分割形式化为一个集合结构查询预测问题，并提出了一个显式的推理到实例查询接口，优雅地桥接了视觉语言模型（VLM）和SAM3。具体来说，一组可学习的实例查询被注入到VLM中，并与指令和视觉信息进行上下文关联，使每个查询成为一个实例感知槽。混合注意力机制进一步促进了这些查询、视觉令牌和指令令牌之间的交互，改进了实例枚举并减少了重复预测。得到的LLM条件查询被投影到SAM3的检测器查询空间中，以在单次前向传播中驱动准确的多实例分割。这种设计赋予了SAM3高级指令理解、组合推理和实例级集合预测的能力，而无需修改其核心架构。为了支持训练和评估，我们进一步构建了Inst2Seg，一个高质量、大规模的基于指令的实例分割数据集和基准，将自由形式的指令与实例级掩码配对。大量实验表明，仅2B规模的InstructSAM在复杂的指令驱动和短语级指代分割基准上取得了强劲的结果，超越了之前的端到端方法和SAM3的代理流水线，同时实现了高效的单次多实例预测。

英文摘要

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.26089 2026-06-02 cs.CV cs.AI 版本更新

Channel-wise Vector Quantization

通道级向量量化

Wei Song, Tianhang Wang, Yitong Chen, Tong Zhang, Zuxuan Wu, Min Li, Jiaqi Wang, Kaicheng Yu

发表机构 * Shanghai Innovation Institute（上海创新研究院）； Westlake University（西湖大学）； Zhejiang University（浙江大学）； Fudan University（复旦大学）； JD.COM（京东公司）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出通道级向量量化（CVQ）代替补丁级量化，并基于此设计通道级自回归（CAR）模型，通过逐通道预测实现渐进式细节生成，在图像重建和文本到图像生成中取得优异性能。

详情

AI中文摘要

我们提出了通道级向量量化（CVQ），一种新颖的图像标记化范式，用通道级标记取代补丁级标记。与传统的向量量化（为每个补丁特征向量分配一个离散标记）不同，CVQ 对特征图的每个通道进行量化。这种表示将图像表示为视觉细节的离散层级，而不是空间补丁的网格。基于 CVQ，我们引入了一种新的视觉自回归框架，采用“下一通道预测”。我们的通道级自回归（CAR）模型不是按光栅顺序逐补丁渲染图像，而是顺序预测图像通道，逐步生成更丰富的视觉细节。具体来说，它首先勾勒全局结构，然后细化细粒度属性，类似于人类艺术家的创作流程。实验表明：（1）CVQ 在 16K+ 的码本大小下实现了 100% 的码本利用率，无需任何额外技巧，并且显著提高了传统 VQ 的重建质量；（2）CAR 在 DPG 评分中达到 86.7，在 GenEval 评分中达到 0.79，展示了其在文本到图像生成中的强大有效性。

英文摘要

We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This formulation represents an image as discrete levels of visual details, rather than as a grid of spatial patches. Based on CVQ, we introduce a new visual autoregressive framework with "next-channel prediction". Instead of rendering images patch by patch in raster order, our Channel-wise Autoregressive (CAR) model predicts image channels sequentially, producing progressively enriched visual details. Specifically, it first sketches global structure and then refines fine-grained attributes, akin to a human artist's workflow. Empirically, we show that: (1) CVQ achieves 100% codebook utilization with a 16K+ codebook size without any bells and whistles, and substantially improves reconstruction quality over conventional VQ; and (2) CAR attains a DPG score of 86.7 and a GenEval score of 0.79, demonstrating strong effectiveness for text-to-image generation.

URL PDF HTML ☆

赞 0 踩 0

2605.29977 2026-06-02 cs.CV cs.LG 版本更新

EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation

EVL-ECG：面向多视角异构知识蒸馏的高效心电图解读

Dang Nguyen Hong, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

发表机构 * University of Notre Dame（诺丁汉大学）

AI总结提出EVL-ECG框架，通过多头交叉注意力对齐、最优传输视觉特征匹配和几何结构关系匹配三种创新方法，实现跨架构知识蒸馏，在资源受限环境下高效解读心电图。

Comments 7Accepted at the SD4H Workshop at ICML 2026. 7 pages, 3 figures

详情

AI中文摘要

高保真心电图解读越来越依赖于大规模基础模型，但其在临床边缘护理中的部署仍受到极端计算需求的阻碍。虽然知识蒸馏（KD）是一种有前景的解决方案，但传统方法在跨异构架构传递知识时，无法捕捉心电图信号的复杂时空依赖关系。本文提出EVL-ECG，一个专门用于心脏诊断逻辑跨架构蒸馏的框架。EVL-ECG引入了三种心电图感知创新：（1）多头交叉注意力对齐，协调架构差异以保留细粒度形态特征；（2）基于最优传输的视觉特征匹配，利用最优传输在标记表示不匹配的情况下保持跨心电图导联的全局结构关系；（3）几何结构内关系匹配，蒸馏教师模型的潜在诊断推理。在心电图基准测试上的评估表明，EVL-ECG相比现有基线，AUC提升高达2.4%，临床准确率提升1.1%。值得注意的是，EVL-ECG建立了一个高效的20亿参数心电图基础模型，适用于资源受限的临床环境。

英文摘要

High-fidelity ECG interpretation is increasingly reliant on massive foundation models, yet their deployment in clinical edge-care remains hindered by extreme computational demands. While knowledge distillation (KD) is a promising solution, traditional methods fail to capture the complex spatio-temporal dependencies of ECG signals when transferring knowledge across heterogeneous architectures. In this paper, we propose EVL-ECG, a framework specifically designed for cross-architecture distillation of cardiac diagnostic logic. EVL-ECG introduces three ECG-aware innovations: (1) Multi-Head Cross-Attention Alignment, which harmonizes architectural discrepancies to preserve fine-grained morphological features; (2) Optimal Transport-based Visual Feature Matching, utilizing optimal transport to maintain global structural relationships across ECG leads despite mismatched token representations; and (3) Geometric Intra-Architecture Relation Matching, which distills the latent diagnostic reasoning of the teacher model. Evaluations across ECG benchmarks demonstrate that EVL-ECG yields improvements of up to 2.4% AUC and 1.1% clinical accuracy over existing baselines. Notably, EVL-ECG establishes an efficient 2B-parameter ECG foundation model, suitable for resource-constrained clinical environments.

URL PDF HTML ☆

赞 0 踩 0

2605.16415 2026-06-02 cs.CV cs.LG 版本更新

Diffusion Models, Denoiser Architecture and Creativity

扩散模型、去噪器架构与创造力

Itamar Levine, Yair Weiss

发表机构 * The Hebrew University of Jerusalem（海法大学）

AI总结本文通过理论和实验表明，扩散模型的创造力源于去噪器架构与目标分布之间的相互作用，并指出去噪器架构的归纳偏差必须与真实目标分布高度一致才能成功。

详情

AI中文摘要

扩散模型的创造力是指它们生成与训练数据不同但高度逼真图像的能力。创造力有些令人惊讶，因为已知如果扩散模型中使用的去噪器是给定训练集的贝叶斯最优去噪器，那么模型将简单地复制训练样本。在本文中，我们提出经验和理论结果，表明扩散模型的创造力源于去噪器架构与目标分布之间的相互作用。理论上，我们针对三种不同的去噪器架构（线性、多项式、瓶颈）给出了生成样本分布作为目标分布和去噪器函数的显式形式。经验上，我们表明流行的UNET去噪器架构的微小变化会导致非常不同的创造力形式，并且这些微小变化通常会产生高度不真实的样本。综合来看，我们的结果表明，只有当去噪器架构的归纳偏差与真实目标分布高度一致时，扩散模型才能成功。

英文摘要

The creativity of diffusion models refers to their ability to generate highly realistic images that are different from their training data. Creativity is somewhat surprising since it is known that if the denoiser used in the diffusion model is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples. In this paper we present empirical and theoretical results that suggest that creativity in diffusion models is due to an interaction between the denoiser architecture and the target distribution. Theoretically, we give explicit forms for the distribution of generated samples as a function of the target distribution and the denoiser architecture for three different denoiser architectures (linear, polynomial, bottleneck). Empirically, we show that small changes in the popular UNET denoiser architecture leads to very different forms of creativity, and these small changes often yield samples that are highly nonrealistic. Taken together, our results show that diffusion models will only be successful if the inductive bias of the denoiser architecture is in strong alignment with the true target distribution.

URL PDF HTML ☆

赞 0 踩 0

2605.29539 2026-06-02 cs.CV cs.AI 版本更新

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

GiPL: 用于跨域小样本目标检测的生成增强迭代伪标签方法

Jiacong Liu, Shu Luo, Yikai Qin, Yaze Zhao, Yongwei Jiang, Yixiong Zou

发表机构 * Huazhong University of Science and Technology（华中科技大学）

AI总结提出GiPL双分支训练框架，通过迭代伪标签自训练和生成数据增强，解决跨域小样本目标检测中支持集利用不足和过拟合问题。

Comments CVPR 2026 Workshop

详情

AI中文摘要

视觉语言基础模型在跨域小样本目标检测（CD-FSOD）中展现出有前景的零样本泛化能力。然而，它们在微调过程中面临两个关键挑战：由于稀疏的单实例标注导致支持集利用不足，以及在极有限的域目标样本下严重过拟合。为解决这些问题，本文提出GiPL，一个高效的双分支训练框架。在第一个分支中，我们设计了一种迭代伪标签自训练范式，该范式对支持集进行零样本推理以生成可靠的伪标注，将其与真实标签融合，并迭代优化模型以充分利用支持集数据。在第二个分支中，我们引入了使用大型视觉语言模型的生成数据增强流程，该流程合成域对齐、多目标标注的图像以丰富训练样本并抑制过拟合。在三个具有挑战性的CD-FSOD数据集（RUOD、CARPK、CarDD）上，在1/5/10样本设置下的大量实验表明，GiPL始终以显著的性能提升优于最先进的方法。代码可在\href{https://github.com/z-yaz/CDiscover}{CDiscover}获取。

英文摘要

Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). However, they face two critical challenges in fine-tuning: insufficient support set utilization due to sparse single-instance annotations, and severe overfitting under extremely limited target-domain samples. To address these issues, this paper proposes GiPL, an efficient two-branch training framework. In the first branch, we design an iterative pseudo-label self-training paradigm, which performs zero-shot inference on the support set to generate reliable pseudo-annotations, fuses them with ground-truth labels, and iteratively optimizes the model to fully exploit support set data. In the second branch, we introduce generative data augmentation pipeline using large vision-language models, which synthesizes domain-aligned, multi-object annotated images to enrich training samples and suppress overfitting. Extensive experiments on three challenging CD-FSOD datasets (RUOD, CARPK, CarDD) under 1/5/10-shot settings demonstrate that GiPL consistently outperforms state-of-the-art methods with significant performance gains. Code is available at \href{https://github.com/z-yaz/CDiscover}{CDiscover}.

URL PDF HTML ☆

赞 0 踩 0

2605.29488 2026-06-02 cs.CV cs.AI 版本更新

深度心理视觉图像表示

Wendi Ma, Aryaman Sharma, Wei Dai, Shekhar S. Chandra

发表机构 * School of EECS The University of Queensland（电子工程与计算机科学学院昆士兰大学）

AI总结受心理视觉模型启发，提出深度视觉编码方法，利用频域表示和复值图像表示实现心理视觉风格的抽象，构建首个基于心理视觉的深度学习框架，通过数据驱动频谱滤波器学习任务相关语义结构，实验表明该模型提取可解释性强的物体部分，且对深度依赖较小。

详情

AI中文摘要

心理视觉模型表明，人类视觉通过首先形成中间抽象来将低级特征提取与高级认知解耦。相比之下，基于深度学习的视觉模型通常使用同质空间层堆叠来提取和聚合特征，导致其决策过程不透明。在本文中，我们提出了深度视觉编码，这是一种受20世纪90年代图像编码启发的学习频域表示，该编码量化了感知显著的频率，与复值图像表示一起产生心理视觉风格的抽象。该方法实现了首个基于心理视觉的深度学习框架，利用数据驱动的频谱滤波器学习在不同频率子带内编码任务相关的语义结构。显著性分析表明，与常规卷积神经网络产生的无定形区域相比，我们的心理视觉模型提取了高度可解释的物体部分。此外，我们发现对于模型缩放，我们的模型对深度的依赖小于CNN，因为我们的复值表示和学习抽象取代了深层空间层的作用。这些发现共同表明，心理视觉编码为更高效和透明的视觉模型提供了一条有前景的路径。

英文摘要

Psychovisual models suggest human vision decouples low-level feature extraction from higher cognition by first forming intermediate abstractions. In contrast, deep learning-based vision models routinely extract and aggregate features using homogeneous stacks of spatial layers, rendering their decision-making processes opaque. In this paper, we propose Deep Visual Coding, a learned frequency-domain representation inspired by 1990s image codes that quantised perceptually salient frequencies, which together with complex-valued image representations produces psychovisual-style abstractions. This approach enables the first psychovisual-based deep learning framework, utilizing data-driven spectral filters that learn to encode task-relevant semantic structures within distinct frequency sub-bands. Salience analyses reveal that our psychovisual models extract highly interpretable object parts compared to the amorphous regions produced by regular Convolutional Neural Networks (CNNs). Furthermore, we find that our models are less depth dependent than CNNs for model scaling, since our complex-valued representations and learned abstractions subsume the role of the deep spatial layers. Together, these findings demonstrate that psychovisual coding provides a promising path toward more efficient and transparent vision models.

URL PDF HTML ☆

赞 0 踩 0

2605.28995 2026-06-02 cs.CV 版本更新

GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation

GAP3D: 将VLM潜在表示与补丁级嵌入进行生成式对齐以实现3D生成

Polytimi Anna Gkotsi, Andrii Zadaianchuk, Mohammad Mahdi Derakhshani

发表机构 * Polytimi Anna Gkotsi ； Andrii Zadaianchuk ； Mohammad Mahdi Derakhshani

AI总结提出GAP3D，一种基于扩散的模块化方法，将VLM生成的潜在表示直接对齐到预训练图像编码器的完整补丁级特征空间，使冻结的下游生成模型能够利用VLM作为提示编码器，同时保持空间结构化的条件信号，在3D资产生成中无需大规模3D数据训练，并展现出多模态提示的零样本能力。

详情

AI中文摘要

最近将视觉语言模型（VLM）作为生成模型条件提示编码器的方法通常依赖于昂贵的端到端训练或将特征映射到压缩表示，丢弃了像3D资产生成这类几何感知任务所需的密集空间结构。为了解决这个问题，我们提出了GAP3D，一种基于扩散的模块化方法，它将VLM生成的潜在表示直接对齐到预训练图像编码器的完整补丁级特征空间，使得冻结的下游生成模型能够利用VLM作为提示编码器，同时保持空间结构化的条件信号。在3D资产生成上的评估表明，我们的方法主要通过训练通用领域的图像-文本对来绕过对大规模3D数据的需求。尽管仅使用文本输入进行训练，但它还展现出对多模态提示的涌现零样本能力。最后，虽然目前优先考虑高级语义而非细粒度细节，但GAP3D表明，通过基于扩散的对齐，VLM和图像编码器特征空间之间的表示差距可以部分弥合，这为通过生成式对齐到密集嵌入空间实现基础模型的模块化集成迈出了第一步。

英文摘要

Recent approaches integrating vision-language models (VLMs) as prompt encoders for generative model conditioning typically rely on expensive end-to-end training or map features to compressed representations, discarding the dense spatial structure required for geometry-aware tasks like 3D asset generation. To address this, we propose GAP3D, a modular, diffusion-based approach that aligns VLM-generated latents directly to the complete, patch-level feature space of a pre-trained image encoder, enabling a frozen downstream generative model to utilize a VLM as prompt encoder while maintaining a spatially structured conditioning signal. Evaluated on 3D asset generation, our method bypasses the need for large-scale 3D data by training mainly on general-domain image-text pairs. It also exhibits emergent zero-shot capabilities for multimodal prompts, despite being trained exclusively on text input. Finally, while currently prioritizing high-level semantics over fine-grained detail, GAP3D demonstrates that the representation gap between VLM and image-encoder feature spaces can be partially bridged through diffusion-based alignment, taking the first steps towards a modular integration of foundation models through generative alignment to dense embedding spaces.

URL PDF HTML ☆

赞 0 踩 0

2505.11158 2026-06-02 eess.IV cs.CV 版本更新

Diffusion Models for Hyperspectral Image Analysis: A Comprehensive Review

扩散模型在高光谱图像分析中的应用：综述

Xing Hu, Xiangcheng Liu, Qianqian Duan, Lian Zhang, Huiliang Shang, Linhua Jiang, Haima Yang, Dawei Zhang

发表机构 * School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology（上海理工大学光学电子与计算机工程学院）； School of Electronics and Electrical Engineering, Shanghai University of Engineering Science（上海工程技术大学电子与电气工程学院）； Medical Artificial Intelligence Lab, The First Hospital of Hebei Medical University, Hebei Medical University（河北医科大学第一医院医学人工智能实验室）； Hangzhou Institute of Technology, xidian University（杭州职业技术学院）

AI总结本文系统综述了扩散模型（包括去噪扩散概率模型和基于随机微分方程的生成框架）在高光谱图像处理中的最新进展，分类现有方法，强调其处理高维数据的优势，并与传统方法比较性能，特别关注变化检测和灾后异常识别等关键应用，同时讨论计算成本和训练稳定性等局限，并展望未来研究方向。

Comments Published in Neural Networks

详情

DOI: 10.1016/j.neunet.2026.109109
Journal ref: Neural Networks (2026) 109109

AI中文摘要

高光谱图像（HSI）分析在遥感、农业和环境监测中起着关键作用。然而，传统方法通常难以处理HSI数据中固有的高维度、光谱冗余和噪声，限制了其准确性和可扩展性。最近，扩散模型（包括去噪扩散概率模型和其他基于随机微分方程的生成框架）在捕捉复杂光谱空间结构和生成高保真HSI数据方面显示出强大潜力。这些模型为噪声抑制、数据增强、分类和异常检测等任务提供了有效解决方案。本文系统总结了扩散模型在HSI处理中的最新进展。我们对现有方法进行分类，强调其处理高维数据的优势，并与传统方法进行性能比较。特别关注变化检测和灾后异常识别等关键应用。本文还讨论了当前局限性，如计算成本和训练稳定性，并概述了潜在的研究方向。我们的主要贡献可总结如下：提供了基于扩散的HSI方法的系统分类，考察了它们在主要遥感任务中的应用，并提供了对未来研究潜在方向的见解。通过这些努力，本综述旨在支持社区利用深度学习模型实现更有效和高效的高光谱图像分析。

英文摘要

Hyperspectral image (HSI) analysis plays a critical role in remote sensing, agriculture, and environmental monitoring. However, traditional methods often struggle to handle the high dimensionality, spectral redundancy, and noise inherent in HSI data, limiting their accuracy and scalability. Recently, diffusion models including denoising diffusion probabilistic models and other generative frameworks based on stochastic differential equations have shown strong potential in capturing complex spectral spatial structures and generating high fidelity HSI data. These models offer effective solutions for tasks such as noise supression, data augmentation, classification, and anomaly detection. This review presents a systematic summary of recent advances in diffusion models for HSI processing. We categorize existing methods, highlight their strengths in handling high dimensional data, and compare their performance with conventional approaches. Special attention is given to critical applications such as change detection and post disaster anomaly identification. The review also discusses current limitations, such as computational cost and training stability, and outlines potential research directions. Our main contributions can be summarized as follows: we provide a systematic taxonomy of diffusion based HSI methods, examine their applications across major remote sensing tasks, and offer perspectives on potential directions for future research. With these efforts, this review seeks to support the community in harnessing deep learning models to achieve more effective and efficient hyperspectral image analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.25195 2026-06-02 cs.CV 版本更新

Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation

Baton: 用于联合视频-音频生成的显式语义蓝图

Shuyuan Tu, Qi Tian, Zihan Yang, Yue Wu, Xintong Han, Weijie Kong, Jiangfeng Xiong, Jian-Wei Zhang, Zhao Zhong, Liefeng Bo, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Tencent Hunyuan（腾讯幻元）

AI总结提出Baton框架，通过VA-Planner生成语义对齐的模态感知规划令牌作为蓝图，注入扩散骨干以协调视频和音频去噪，解决现有方法因缺乏共享长期规划导致的跨模态对齐脆弱问题。

详情

AI中文摘要

当前的开源扩散模型难以生成稳定且同步的视听内容，尤其是在需要复杂语义推理的场景中。根本原因在于现有方法依赖现成编码器生成的粗糙文本嵌入来引导音频-视频去噪，这丢弃了细粒度语义，并且关键的是缺乏共享的长期规划，导致去噪轨迹不协调和跨模态对齐脆弱。我们提出Baton，这是第一个将显式语义规划引入联合视频-音频生成的框架。我们的关键洞察是，用语义丰富、模态感知的规划令牌（在去噪前经过联合推理和相互对齐）补充粗糙文本引导，可以同时恢复细粒度语义细节并建立协调音频和视频去噪轨迹的共享蓝图。具体来说，Baton首先引入VA-Planner，这是一个配备双语义对齐塔的多模态语言模型，其中可学习查询与视频和音频特征进行交叉注意力，生成一对语义对齐的视频和音频规划令牌作为关键帧级别的蓝图。这些规划令牌通过交叉注意力层注入扩散骨干，提供与粗糙文本嵌入互补的时域引导。由于规划令牌与扩散潜变量不具有一一对应的时空对应关系，我们进一步提出相对语义RoPE，一种相对位置编码，将规划令牌和潜变量映射到共享的时空坐标框架中，使每个潜变量能够准确关注其位置对应的语义线索。基准实验在定性和定量上均证明了Baton的有效性。

英文摘要

Current open-source diffusion models struggle to generate stable and synchronized audio-visual content, particularly in scenarios demanding complex semantic reasoning. The root cause is that existing methods rely on coarse text embeddings from off-the-shelf encoders to guide audio-video denoising, which discards fine-grained semantics and, critically, lacks a shared long-horizon plan, leading to uncoordinated denoising trajectories and fragile cross-modal alignment. We propose Baton, the first framework that introduces explicit semantic planning into joint video-audio generation. Our key insight is that complementing coarse text guidance with semantically rich, modality-aware planned tokens, jointly reasoned and mutually aligned before denoising, can simultaneously restore fine-grained semantic detail and establish a shared blueprint that coordinates both audio and video denoising trajectories. Concretely, Baton first introduces the VA-Planner, a multimodal language model equipped with dual semantic alignment towers, where learnable queries cross-attend to both video and audio features to produce a pair of semantically aligned video and audio planned tokens as keyframe-level blueprints. These planned tokens are injected into the diffusion backbone via cross-attention layers, providing temporally grounded guidance complementary to coarse text embeddings. Since planned tokens do not share one-to-one spatial-temporal correspondence with diffusion latents, we further propose Relative Semantic RoPE, a relative positional encoding that maps planned tokens and latents into a shared spatial-temporal coordinate frame, enabling each latent to accurately attend to its positionally corresponding semantic cues. Experiments on benchmarks show the effectiveness of Baton both qualitatively and quantitatively.

URL PDF HTML ☆

赞 0 踩 0

2605.25144 2026-06-02 cs.CV 版本更新

SpikeReg: Energy-Efficient 3D Deformable Medical Image Registration with Spiking Neural Networks

SpikeReg: 基于脉冲神经网络的高能效3D可变形医学图像配准

Ali Mikaeili Barzili, Behzad Moshiri, Hamid Azadegan, Mohammad-Reza A. Dehaqani

发表机构 * School of Electrical and Computer Engineering, College of Engineering, University of Tehran（德黑兰大学电气与计算机工程学院）； Max Planck Institute for Brain Research（马克斯·普朗克脑科学研究所）； School of Computer Engineering, Iran University of Science and Technology (IUST)（伊朗科学技术大学计算机工程学院）； Department of Electrical and Computer Engineering, University of Waterloo（滑铁卢大学电气与计算机工程系）

AI总结提出SpikeReg，一种脉冲U-Net，通过层间权重迁移和激活百分位阈值校准从模拟ANN教师初始化，结合局部互相关、扩散正则化和脉冲率稀疏性的代理梯度微调，在OASIS Learn2Reg验证集上达到Dice 0.7474，与ANN教师无显著差异，同时实现12.8%平均脉冲率和55.5倍算术能量降低。

详情

AI中文摘要

可变形医学图像配准对齐图像中的解剖结构，但在3D分辨率下计算密集。脉冲神经网络（SNN）提供稀疏事件驱动计算，但尚未系统研究用于可变形医学图像配准。我们提出SpikeReg，一种用于3D脑MRI配准的脉冲U-Net。SpikeReg从模拟ANN配准教师初始化，通过层间权重迁移和激活百分位阈值校准进行转换，并使用结合局部互相关、扩散正则化和脉冲率稀疏性的代理梯度目标进行微调。在OASIS Learn2Reg验证集（19对图像）上，SpikeReg达到Dice 0.7474 ± 0.032，与ANN教师（0.7480 ± 0.037，p = 0.67）无显著配对Dice差异，平均脉冲率为12.8%，相对于密集ANN基线，在事件稀疏SynOps/MAC代理下投影算术能量降低55.5倍。我们还报告了两个负面发现：来自ANN教师的位移蒸馏损害性能，以及使用标签Dice损失训练的ANN教师无法通过速率编码转换。这些结果共同表明，密集几何预测可以在稀疏事件驱动计算下进行，为神经形态医学图像配准开辟了道路。

英文摘要

Deformable medical image registration aligns anatomical structures across images but remains computationally dense at 3D resolution. Spiking neural networks (SNNs) offer sparse event-driven computation, yet have not been systematically studied for deformable medical image registration. We introduce SpikeReg, a spiking U-Net for 3D brain MRI registration. SpikeReg is initialized from an analog ANN registration teacher, converted by layer-wise weight transfer and activation-percentile threshold calibration, and fine-tuned with a surrogate-gradient objective combining local cross-correlation, diffusion regularization, and spike-rate sparsity. On the OASIS Learn2Reg validation split ($19$ image pairs), SpikeReg reaches Dice $0.7474 \pm 0.032$, with no significant paired Dice difference from the ANN teacher ($0.7480 \pm 0.037$, $p = 0.67$), at a $12.8\%$ mean spike rate and a $55.5\times$ projected arithmetic-energy reduction under an event-sparse SynOps/MAC proxy relative to the dense-ANN baseline. We additionally report two negative findings: displacement distillation from the ANN teacher hurts performance, and ANN teachers trained with a label-Dice loss fail to transfer through rate-code conversion. Together these results show that dense geometric prediction can be performed under sparse event-driven computation, opening a path toward neuromorphic medical image registration.

URL PDF HTML ☆

赞 0 踩 0

2605.24716 2026-06-02 cs.CV eess.SP 版本更新

Physics-Guided Self-Supervised Statistical Residual Learning for Sonar Despeckling with Improved Generalization

物理引导的自监督统计残差学习用于声纳图像去斑及泛化改进

Swapna Pillai, Siddharth Singh Savner, Sujit Kumar Sahoo

发表机构 * School of Electrical Sciences, Indian Institute of Technology Goa（印度理工学院Goa电子科学学院）； Inria, Sophia Antipolis, France（法国Sophia Antipolis Inria）

AI总结提出一种物理引导的自监督框架，通过同态对数域残差一致性约束，结合方差统计损失、边缘感知正则化和中值引导课程学习，实现无需干净监督的声纳图像去斑，并在多个真实数据集上达到最优性能且具有跨数据集鲁棒性。

详情

DOI: 10.1109/LSP.2026.3697693
Journal ref: IEEE Signal Processing Letters, Early Access, pp. 1-5, 2026

AI中文摘要

本文介绍了一种物理引导的自监督框架用于声纳图像去斑，该框架将去斑重新表述为同态对数域中的残差一致性。通过约束对数比残差服从乘性散斑统计，所提方法无需干净监督即可防止恒等解退化。结合方差目标统计损失、边缘感知结构正则化以及中值引导的课程学习，该方法在保持结构保真度的同时实现了有效的散斑抑制。该公式与轻量级神经网络相结合，在多个真实声纳数据集上实现了最先进的性能，并展现出优异的跨数据集鲁棒性，同时适用于实时部署。

英文摘要

This letter introduces a physics-informed self-supervised framework for sonar image despeckling that reformulates despeckling as residual consistency in the homomorphic log domain. By constraining the log-ratio residual to obey multiplicative speckle statistics, the proposed method eliminates the need for clean supervision while preventing degenerate identity solutions. A variance-targeted statistical loss combined with edge-aware structural regularization and median-guided curriculum stabilization enables effective speckle suppression with preserved structural fidelity. This formulation along with a lightweight neural network achieves state-of-the-art performance across multiple real sonar datasets and demonstrates excellent cross-dataset robustness, while remaining suitable for real-time deployment.

URL PDF HTML ☆

赞 0 踩 0

2603.09095 2026-06-02 cs.CL cs.CV 版本更新

Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

阅读，而非思考：理解并弥合多模态大语言模型中文本变为像素时的模态差距

Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, Fan Bai

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Amazon（亚马逊）； New York University（纽约大学）； Texas A&M University（德克萨斯大学）

AI总结本文系统诊断多模态大语言模型在处理图像文本时的模态差距，发现其源于模型推理意愿不足而非感知失败，并提出一种轻量级自蒸馏方法有效弥合该差距。

详情

AI中文摘要

多模态大语言模型（MLLMs）能够处理以图像形式呈现的文本，但它们的表现往往不如相同内容以文本令牌形式提供时。我们通过在五种输入模式下跨七个基准评估七个MLLM，系统性地诊断了这种“模态差距”，涵盖了从合成渲染文本到来自arXiv PDF和Wikipedia页面的真实文档图像。我们发现，该差距对字体和分辨率等渲染选择高度敏感，并且自然文档图像通常表现出更小的差距，这表明性能差异部分反映了评估伪影而非根本性限制。通过对超过4000个示例进行基于扎根理论的错误分析，我们确定了主要原因：仅图像输入抑制了推理努力，模型产生的输出短5-19倍，跳过了逐步计算或推理。不愿推理，而非感知或知识检索失败，驱动了性能差距，尤其是在需要多步推理的任务上。我们展示了一种简单的、轻量级的在线自蒸馏方法，通过让模型在其自身的文本模式推理轨迹与图像输入配对上进行微调，弥合了这一差距，将图像模式准确率提升至匹配或超过文本模式性能，提升超过50%，并且增益可迁移到未见过的基准而不会灾难性遗忘。总体而言，我们的结果和分析提供了对模态差距的系统理解，并指出了在多模态语言模型中改进视觉文本理解的实际路径。

英文摘要

Multimodal large language models (MLLMs) can process text presented as images, yet they often perform worse than when the same content is provided as textual tokens. We systematically diagnose this "modality gap" by evaluating seven MLLMs across seven benchmarks in five input modes, spanning both synthetically rendered text and realistic document images from arXiv PDFs to Wikipedia pages. We find that the gap is highly sensitive to rendering choices such as font and resolution, and that natural document images often exhibit much smaller gaps, suggesting the performance difference partly reflects evaluation artifacts rather than fundamental limitations. Through a grounded-theory error analysis of over 4,000 examples, we identify the primary cause: image input alone suppresses reasoning effort, with models producing 5--19x shorter outputs that skip step-by-step computation or reasoning. The reluctance to reason, not a failure of perception or knowledge retrieval, drives the performance gap, particularly on tasks requiring multi-step reasoning. We show that a simple, lightweight on-policy self-distillation method by fine-tuning models on their own text-mode reasoning traces paired with image inputs closes this gap, raising image-mode accuracy to match or exceed text-mode performance with over 50\% improvement, and the gains transfer to unseen benchmarks without catastrophic forgetting. Overall, our results and analyses provide a systematic understanding of the modality gap and suggest a practical path toward improving visual text understanding in multimodal language models.

URL PDF HTML ☆

赞 0 踩 0

2605.23500 2026-06-02 cs.CV cs.LG 版本更新

双锚定：解决视觉语言导航中的状态漂移问题

Kangyi Wu, Pengna Li, Kailin Lyu, Xi Lin, Lin Zhao, Qingrong He, Jinjun Wang, Jianyi Liu

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence（人机混合增强智能国家重点实验室）； National Engineering Research Center for Visual Information and Applications（视觉信息与应用国家工程研究中心）； Institute of Artificial Intelligence and Robotics（人工智能与机器人研究院）； Xi’an Jiaotong University（西安交通大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Johns Hopkins University（约翰霍普金斯大学）； Joy Future Academy, JD（京东未来学院）

AI总结提出双锚定框架，通过指令进度锚定和记忆地标锚定分别解决进度漂移和记忆漂移，显著提升长场景导航成功率。

详情

AI中文摘要

视觉语言导航（VLN）要求智能体通过遵循自然语言指令在3D环境中导航。尽管最近的视频大语言模型（Video-LLMs）极大地推进了VLN，但在长场景中它们仍然非常容易受到状态漂移的影响。在这些情况下，智能体的内部状态偏离真实的任务执行状态，导致无目的漫游和无法执行指令中的关键操作。我们将这种失败归因于两种不同的认知缺陷：进度漂移，即智能体无法区分已完成的子目标和剩余的子目标；以及记忆漂移，即智能体的历史表示退化，使其无法跟踪已访问的地标。在本文中，我们提出了一个双锚定框架，明确锚定指令进度和历史表示。首先，为了解决进度漂移，我们引入了指令进度锚定，监督智能体生成结构化的文本标记，以描述已完成与剩余的子目标。其次，为了缓解记忆漂移，我们提出了记忆地标锚定，利用以地标为中心的世界模型回顾性地预测由Segment Anything模型提取的以对象为中心的嵌入，迫使智能体显式验证过去的观察并保留已访问地标的独特表示。为促进该框架，我们整理了两个大规模数据集：360万个带有显式进度描述的样本，以及93.7万个用于回顾性验证的接地地标数据。在模拟和真实环境中的大量实验证明了我们方法的优越性，在成功率上提高了15.2%，在长时程轨迹上获得了24.7%的显著提升。为促进进一步研究，我们将发布我们的代码、数据生成流程以及收集的数据集。

英文摘要

Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While recent Video Large Language Models(Video-LLMs) have largely advanced VLN, they remain highly susceptible to State Drift in long scenarios. In these cases, the agent's internal state drifts away from the true task execution state, leading to aimless wandering and failure to execute essential maneuvers in the instruction. We attribute this failure to two distinct cognitive deficits: Progress Drift, where the agent fails to distinguish completed sub-goals from remaining ones, and Memory Drift, where the agent's history representations degrade, making it lose track of visited landmarks. In this paper, we propose a Dual-Anchoring Framework that explicitly anchors the instruction progress and history representations. First, to address progress drift, we introduce Instruction Progress Anchoring, which supervises the agent to generate structured text tokens that delineate completed versus remaining sub-goals. Second, to mitigate memory drift, we propose Memory Landmark Anchoring, which utilizes a Landmark-Centric World Model to retrospectively predict object-centric embeddings extracted by the Segment Anything Model, compelling the agent to explicitly verify past observations and preserve distinct representations of visited landmarks. Facilitating this framework, we curate two extensive datasets: 3.6 million samples with explicit progress descriptions, and 937k grounded landmark data for retrospective verification. Extensive experiments in both simulation and real-world environments demonstrate the superiority of our method, achieving a 15.2% improvement in Success Rate and a remarkable 24.7% gain on long-horizon trajectories. To facilitate further research, we will release our code, data generation pipelines, and the collected datasets.

URL PDF HTML ☆

赞 0 踩 0

2602.02214 2026-06-02 cs.CV 版本更新

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

因果强迫：自回归扩散蒸馏的正确方法，用于高质量实时交互式视频生成

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu

发表机构 * Hongzhou Zhu（朱洪洲）； Min Zhao（赵敏）； Guande He（何冠德）； Hang Su（苏hang）； Chongxuan Li（李崇轩）； Jun Zhu（朱军）

AI总结针对双向扩散模型蒸馏为自回归模型时的架构差距问题，提出因果强迫方法，通过自回归教师进行ODE初始化并应用DMD过程，显著提升视频生成质量。

Comments Project page and the code: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; https://github.com/thu-ml/Causal-Forcing. ICML 2026

详情

AI中文摘要

为了实现实时交互式视频生成，当前方法将预训练的双向视频扩散模型蒸馏为少步自回归（AR）模型，当全注意力被因果注意力替代时面临架构差距。然而，现有方法并未从理论上弥合这一差距。它们通过ODE蒸馏初始化AR学生模型，这需要帧级单射性，即在AR教师的PF-ODE下，每个噪声帧必须映射到唯一的干净帧。从双向教师蒸馏AR学生违反了这一条件，阻止了教师流映射的恢复，反而诱导出条件期望解，导致性能下降。为解决此问题，我们提出因果强迫（Causal Forcing），它使用自回归教师进行ODE初始化以弥合架构差距，然后应用与Self Forcing相同的DMD过程。实验结果表明，我们的方法在所有指标上优于所有基线，在动态程度、VisionReward和指令跟随上分别超过SOTA Self Forcing 19.3%、8.7%和16.7%。项目页面：https://thu-ml.github.io/CausalForcing.github.io/；代码：https://github.com/thu-ml/Causal-Forcing。

英文摘要

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing, which uses an autoregressive teacher for ODE initialization to bridge the architectural gap, and then applies the same DMD procedure as in Self Forcing. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; the code: \href{https://github.com/thu-ml/Causal-Forcing}{https://github.com/thu-ml/Causal-Forcing}.

URL PDF HTML ☆

赞 0 踩 0

2605.21964 2026-06-02 cs.CV physics.optics 版本更新

Dual-Integrated Low-Latency Single-Lens Infrared Computational Imaging for Object Detection

用于目标检测的双集成低延迟单透镜红外计算成像

Xuquan Wang, Guishuo Yang, Dapeng Yan, Yujie Xing, Xuanyu Qian, Kai Zhang, Xiong Dun, Jiande Sun

发表机构 * MOE Key Laboratory of Advanced Micro-Structured Materials（教育部先进微结构材料重点实验室）； Institute of Precision Optical Engineering（精密光学工程研究院）； School of Physics Science and Engineering（物理科学与工程学院）； Shanghai Frontiers Science Center of Digital Optics（上海前沿科学中心数字光学中心）； School of Computer Science and Artificial Intelligence（计算机科学与人工智能学院）； Shandong Normal University（山东师范大学）； Shandong Engineering Research Center for Multimodal Computing and Intelligent Decision Making（山东省多模态计算与智能决策中心）

AI总结提出物理感知双集成网络（PDI-Net），通过嵌入光学先验并共享编码器特征，在单透镜红外相机上实现低延迟高精度目标检测。

Comments 15 pages, 11 figures; supplementary material: 3 pages, 2 figures

详情

AI中文摘要

计算成像能够实现紧凑的红外系统，但结合图像重建和目标检测的深度学习流程通常会引入显著的推理延迟。大多数现有的加速策略压缩重建网络，而忽略了来自光路的物理先验，从而在准确性和速度之间留下权衡。我们提出了物理感知双集成网络（PDI-Net），这是一个低延迟框架，它将红外重建与目标检测集成在一起，并进一步将光学先验嵌入到学习过程中。PDI-Net在训练期间使用监督U-Net，而在推理期间，半U-Net编码器直接与基于YOLO的检测器共享特征，避免了完整的图像重建。为了弥合面向保真度的重建特征与面向检测的语义之间的差距，我们引入了物理感知大小桥接（PALS-Bridge），它使用与视场相关的点扩散函数先验自适应地调制多尺度卷积分支。还开发了物理信息的光学退化模拟流程用于训练和验证。该方法部署在单透镜红外相机上，与传统多透镜设计相比，系统重量减轻约50%。在低信噪比条件下的M3FD基准上，与采用剪枝策略的Rec+Det相比，PDI-Net将推理时间减少了84.06%，同时将mAP@0.5:0.95提高了5.07%。这些结果展示了在资源受限平台上用于实时目标检测的紧凑、低延迟计算红外成像。

英文摘要

Computational imaging enables compact infrared systems, but deep-learning pipelines that combine image reconstruction and object detection often introduce substantial inference latency. Most existing acceleration strategies compress the reconstruction network while overlooking physical priors from the optical path, leaving a trade-off between accuracy and speed. We present Physics-aware Dual-Integrated Network (PDI-Net), a low-latency framework that integrates infrared reconstruction with object detection and further embeds optical priors into the learning process. PDI-Net uses a supervised U-Net during training, while a semi-U-Net encoder shares features directly with a YOLO-based detector during inference, avoiding full image reconstruction. To bridge the gap between fidelity-oriented reconstruction features and detection-oriented semantics, we introduce a physics-aware large-small bridge (PALS-Bridge), which uses field-dependent point spread function priors to adaptively modulate multiscale convolutional branches. A physics-informed optical degradation simulation pipeline is also developed for training and validation. The method is deployed on a single-lens infrared camera, reducing system weight by about 50% compared with traditional multi-lens designs. On the M3FD benchmark under low-SNR conditions, PDI-Net reduces inference time by 84.06% compared with the Rec+Det with pruning strategy while improving mAP@0.5:0.95 by 5.07%. These results demonstrate compact, low-latency computational infrared imaging for real-time object detection on resource-constrained platforms.

URL PDF HTML ☆

赞 0 踩 0

2605.20823 2026-06-02 cs.CV 版本更新

RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses

RelWitness: 基于视觉-几何关系见证者的开放词汇3D场景图生成

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Phenikaa University（费恩基亚大学）

AI总结提出RelWitness框架，通过视觉-几何关系见证者从不完整关系监督中生成开放词汇3D场景图，解决关系标注稀疏和词汇扩展问题。

详情

AI中文摘要

开放词汇3D场景图生成旨在用灵活的自然语言谓词描述对象实例及其关系。核心难点不仅在于词汇扩展，还在于监督可靠性：3D场景图数据集中的关系标注具有选择性，许多有效的对象对关系未被标注。我们提出RelWitness，一个从带有位姿的RGB-D序列中生成开放词汇3D场景图的框架，可在不完整关系监督下工作。关键概念是关系见证者：一种具体的视觉-几何线索，使关系在捕获场景中可观察。支持关系需要接触和垂直排序；包含关系需要包围；邻近关系需要度量接近；朝向关系需要面对方向；稳定关系应在两个对象可见的视角间持续存在。RelWitness从RGB视图、深度图、重建的3D几何、角色敏感文本、对象先验空视图和多视角一致性构建关系见证记录。视觉-几何见证验证器将未标注的关系候选分配给验证的缺失正例、可靠负例或不确定未标注案例。然后，见证引导的正-无标记目标从不完整标注中学习，而不将每个缺失标签视为负例。我们进一步引入见证一致解码和RGB-D缺失关系审计协议。在3DSSG/3RScan和ScanNet派生的开放词汇分割上的模拟手稿规划实验显示了预期行为：改进的未见关系识别、更高的见证精度、更低的幻觉和减少的关系短语冗余。所有数值结果均为规划值，在提交前必须替换为复现的测量值。

英文摘要

Open-vocabulary 3D scene graph generation seeks to describe object instances and their relations with flexible natural-language predicates. The central difficulty is not only vocabulary expansion, but supervision reliability: relation annotations in 3D scene graph datasets are selective, and many valid object-pair relations are unannotated. We propose RelWitness, a framework for open-vocabulary 3D scene graph generation from posed RGB-D sequences under incomplete relation supervision. The key concept is a relation witness: a concrete visual-geometric cue that makes a relation observable in the captured scene. Support relations require contact and vertical ordering; containment requires enclosure; proximity requires metric closeness; orientation requires facing direction; and stable relations should persist across views where both objects are visible. RelWitness constructs relation witness records from RGB views, depth maps, reconstructed 3D geometry, role-sensitive text, object-prior null views, and multi-view consistency. A visual-geometric witness verifier assigns unannotated relation candidates to verified missing positives, reliable negatives, or uncertain unlabeled cases. A witness-guided positive-unlabeled objective then learns from incomplete annotations without turning every missing label into a negative. We further introduce witness-consistent decoding and an RGB-D missing-relation audit protocol. Simulated manuscript-planning experiments on 3DSSG/3RScan and ScanNet-derived open-vocabulary splits show the intended behavior: improved unseen-relation recognition, higher witness precision, lower hallucination, and reduced redundant relation phrases. All numerical results are planning values and must be replaced by reproduced measurements before submission

URL PDF HTML ☆

赞 0 踩 0

2605.21421 2026-06-02 cs.CV 版本更新

AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing

AIGaitor: 面向所有人的隐私保护与无云端运动分析——基于边缘计算

Lauhitya Reddy, Trisha M. Kesar, Hyeokhyen Kwon

发表机构 * Department of Biomedical Informatics, Emory University（埃默里大学生物医学信息学系）； Department of Rehabilitation Medicine, Emory University（埃默里大学康复医学系）； The Wallace H. Coulter Department of Biomedical Engineering, Emory University and Georgia Institute of Technology（埃默里大学和佐治亚理工学院的Wallace H. Coulter生物医学工程系）

AI总结提出AIGaitor系统，在智能手机上利用边缘计算实现无标记单目运动捕捉与深度学习分析，解决成本、隐私和易用性问题。

Comments 18 pages 3 figures, 2 tables

详情

AI中文摘要

运动捕捉是测量人体运动的金标准，但临床使用仍受成本、技术复杂性和隐私问题限制。AIGaitor是一个隐私保护、无云端的运动分析系统，完全在消费级智能手机上使用设备上的神经加速器运行无标记单目运动捕捉流程和下游深度学习分析。为激励其设计，我们调查了74位康复临床医生：92%表示会采用准确、经济、易用的AI步态分析工具，而79.7%认为运营成本、68.9%认为培训不足、64.9%认为隐私问题是主要障碍。然后，我们优化并基准测试了当前单目流程组件的移动iOS实现，包括2D和3D姿态估计、姿态优化、基于骨架的深度学习和视觉语言模型。一个时间优先的端到端设备上流程在iPhone 14上处理10秒4K 60fps视频片段耗时77秒，与高端NVIDIA H200云服务器（含网络传输）相比，在全局移动平均上行链路下为94秒，在发达地区Wi-Fi下为66秒，匹配或优于后者。轻量级模型如ViTPose-s实现实时关键点提取，基于骨架的动作识别模型在同一片段上提供亚毫秒级步态分类。据我们所知，AIGaitor是首个展示端到端设备上运动捕捉和下游深度学习分析的单目系统，支持低成本、私密且对智能手机用户可及的临床适用运动分析。

英文摘要

Motion capture is the gold standard for measuring human movement, but clinical use remains limited by cost, technical complexity, and privacy concerns. AIGaitor is a privacy-preserving, cloud-free motion analysis system that runs markerless monocular motion-capture pipelines and downstream deep-learning analysis entirely on a consumer smartphone using on-device neural accelerators. To motivate its design, we surveyed 74 rehabilitation clinicians: 92 percent said they would adopt an accurate, cost-effective, easy-to-use AI gait analysis tool, while 79.7 percent cited operating cost, 68.9 percent insufficient training, and 64.9 percent privacy concerns as leading barriers. We then optimized and benchmarked mobile iOS implementations of current monocular pipeline components, including 2D and 3D pose estimation, pose optimization, skeleton-based deep-learning analysis, and a vision-language model. A Time-Priority end-to-end on-device pipeline processes a 10 s 4K 60 fps video clip in 77 s on an iPhone 14, matching or beating the same pipeline on a high-end NVIDIA H200 cloud server when network transfer is included: 94 s at global mobile-average uplink and 66 s at developed-world Wi-Fi. Lightweight models such as ViTPose-s achieve real-time keypoint extraction, and skeleton-based action-recognition models provide sub-millisecond gait classification on the same clip. To our knowledge, AIGaitor is the first monocular system to demonstrate end-to-end on-device motion capture and downstream deep-learning analysis, supporting clinically applicable movement analysis that is low-cost, private, and accessible to smartphone users.

URL PDF HTML ☆

赞 0 踩 0

2605.20301 2026-06-02 cs.CV cs.AI 版本更新

Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

Co-Fusion4D：面向鲁棒3D目标检测的时空协同融合

Wenxuan Li, Qin Zou, Shoubing Chen, Chi Chen, Yingyi Yang, Qingxiang Meng

发表机构 * Tsinghua University（清华大学）

AI总结提出Co-Fusion4D框架，通过当前帧主导-历史帧互补机制和双注意力融合模块，解决BEV检测器中跨帧时空不一致问题，在nuScenes上达到74.9% mAP和75.6% NDS。

详情

AI中文摘要

在自动驾驶中，3D目标检测对于准确感知和可靠决策至关重要。然而，目标运动和自车运动常常在基于BEV的检测器中引起跨帧时空不一致，导致时序BEV特征错位和时空一致性退化。为了解决这些挑战，我们提出了Co-Fusion4D，一个统一框架，显式地保持跨帧时空一致性并抑制时序特征漂移。Co-Fusion4D采用当前帧中心策略，将当前帧作为主要信息源，同时在时空滤波和对齐后选择性地融入历史帧。这种主从互补机制有效减轻了累积对齐误差，抑制了噪声特征传播，并利用可靠的时序线索获得更一致的BEV表示。此外，Co-Fusion4D集成了双注意力融合（DAF）模块，以进一步增强时空特征交互。DAF联合利用帧内空间注意力和帧间时序注意力，自适应地对齐和融合多帧特征，强调运动一致区域同时抑制虚假相关性。通过偏离传统的均匀融合范式，该设计显著提高了BEV表示的时序稳定性和判别能力。在nuScenes基准上的大量实验表明，Co-Fusion4D实现了最先进的性能，mAP为74.9%，NDS为75.6%，且不依赖测试时增强或外部数据。

英文摘要

In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and ego-motion often induce cross-frame spatiotemporal inconsistencies in BEV-based detectors, leading to temporal BEV feature misalignment and degraded spatiotemporal consistency. To address these challenges, we propose Co-Fusion4D, a unified framework that explicitly preserves cross-frame spatiotemporal consistency and suppresses temporal feature drift. Co-Fusion4D adopts a current-frame-centric strategy, treating the current frame as the primary source of information while selectively incorporating historical frames after spatiotemporal filtering and alignment. This dominant-complementary mechanism effectively mitigates cumulative alignment errors, suppresses noisy feature propagation, and exploits reliable temporal cues for a more consistent BEV representation. In addition, Co-Fusion4D integrates a Dual Attention Fusion (DAF) module to further enhance spatiotemporal feature interaction. DAF jointly leverages intra-frame spatial attention and inter-frame temporal attention to adaptively align and fuse multi-frame features, emphasizing motion-consistent regions while suppressing spurious correlations. By departing from conventional uniform fusion paradigms, this design substantially improves the temporal stability and discriminative capability of BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that Co-Fusion4D achieves state-of-the-art performance, with 74.9% mAP and 75.6% NDS, without relying on test-time augmentation or external data.

URL PDF HTML ☆

赞 0 踩 0

2605.20282 2026-06-02 cs.CV cs.AI 版本更新

Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning

视觉模型真的能遗忘吗？Mirage：表示层面的视觉遗忘认证

Zhenyu Yu, Yangchen Zeng, Chunlei Meng, Guangzhen Yao, Shuigeng Zhou

发表机构 * Fudan University（复旦大学）； Southeast University（东南大学）； Northeast Normal University（东北师范大学）

AI总结提出Mirage框架，通过表示层面诊断揭示现有垂直联邦学习遗忘方法在输出层面通过认证后仍保留类别结构信息，并发现遗忘三元组困境和类别-样本不对称性。

详情

AI中文摘要

垂直联邦学习中的机器遗忘引起了越来越多的关注，但现有方法仅使用输出层面指标来认证遗忘。我们通过引入Mirage（一个表示层面审计框架，包含四种互补诊断方法：线性探针恢复、中心核对齐、特征可分性评分和逐层恢复分析）来挑战这些说法。通过在七个数据集和七种基线方法上遵循最近的VFL遗忘协议进行实验，Mirage揭示了三个关键发现：（i）遗忘差距：通过输出层面认证的方法在其表示中仍然保留了大量的类别结构，线性探针恢复比重新训练的基线高出最多15.4个百分点；中心核对齐显示这些模型在结构上更接近原始模型而非重新训练的参考模型，而可分性评分表明存在持续的几何区分。（ii）遗忘三元组困境：没有现有方法能同时实现高效用、输出层面遗忘和表示层面遗忘。（iii）类别-样本不对称性：类别级遗忘留下强烈的表示痕迹（线性探针恢复高达97%），而样本级遗忘与随机无异（线性探针恢复约50%）；逐层分析进一步表明残差类别信息在网络深度中持续存在。这些发现呼吁在联邦遗忘研究中采用表示层面感知的评估标准。

英文摘要

Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely using output-level metrics. We challenge these claims by introducing Mirage, a representation-level auditing framework comprising four complementary diagnostics: Linear Probe Recovery (LPR), Centered Kernel Alignment (CKA), Feature Separability Scoring, and Layer-Wise Recovery Analysis. Through experiments across seven datasets and seven baseline methods following recent VFL unlearning protocols, Mirage reveals three key findings: (i) Forgetting gap: methods that pass output-level certification still retain substantial class structure in their representations, with LPR exceeding the retrained baseline by up to 15.4 points; CKA shows these models remain structurally closer to the original than to the retrained reference, while separability scores indicate persistent geometric discrimination. (ii) Unlearning trilemma: no existing method simultaneously achieves high utility, output-level forgetting, and representation-level forgetting. (iii) Class-sample asymmetry: class-level forgetting leaves strong representational traces (LPR up to 97%), whereas sample-level forgetting is indistinguishable from chance (LPR approx. 50%); layer-wise analysis further shows residual class information persists across network depths. These findings call for representation-aware evaluation standards in federated unlearning research.

URL PDF HTML ☆

赞 0 踩 0

2605.05945 2026-06-02 cs.CV cs.CL 版本更新

MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware

MobileEgo Anywhere：基于商用硬件的长时域自我中心数据开放基础设施

Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathore, Pratyush Patnaik, Shubhanshu Khatana, Ekaksh Janweja

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Washington（华盛顿大学）； University of California, Los Angeles（加州大学洛杉矶分校）； University of California, Santa Barbara（加州大学圣巴巴拉分校）

AI总结提出MobileEgo Anywhere框架，利用智能手机传感器实现超过一小时的自我中心轨迹采集，并发布开源处理流水线STERA、移动应用及200小时数据集，验证其在视觉-语言-动作模型训练中的有效性。

详情

AI中文摘要

视觉-语言-动作（VLA）模型推动了对大规模自我中心数据集的需求，但用于收集长时域数据的硬件和基础设施仍然难以获取。当前数据集通常只有几分钟长的片段，无法捕捉复杂机器人任务执行所需的长时域时间依赖。我们提出MobileEgo Anywhere，一个在商用移动硬件上收集超过一小时自我中心轨迹的框架，利用现代智能手机传感器进行长期姿态跟踪，避免了传统机器人数据收集的硬件障碍。我们发布三个组件：（1）STERA，一个开源视频处理流水线，将原始移动捕获转换为标准化、训练就绪的格式，用于VLA和基础模型研究；（2）一个免费的移动应用，让任何用户记录自我中心活动；（3）一个200小时的数据集，包含多样化的长格式自我中心数据，跨584个会话具有持久状态跟踪。我们进一步展示该数据是可用的训练信号：在其上对VLA进行中期训练可降低保留动作预测误差。

英文摘要

Vision-language-action (VLA) models have driven demand for large-scale egocentric datasets, yet the hardware and infrastructure to collect long-horizon data remain inaccessible. Datasets today typically have episodes only a few minutes long, which fails to capture the long-horizon temporal dependencies that complex robotic task execution requires. We present MobileEgo Anywhere, a framework for collecting hour-plus egocentric trajectories on commodity mobile hardware that uses modern smartphone sensors for long-term pose tracking without the hardware barriers of traditional robotics data collection. We release three components: (1) STERA, an open-source video-processing pipeline that converts raw mobile captures into standardized, training-ready formats for VLA and foundation-model research; (2) a free mobile app that lets any user record egocentric activity; and (3) a 200-hour dataset of diverse, long-form egocentric data with persistent state tracking across 584 sessions. We further show this data is a usable training signal:mid-training a VLA on it lowers held-out action-prediction error.

URL PDF HTML ☆

赞 0 踩 0

2411.19093 2026-06-02 cs.CV cs.CY cs.LG 版本更新

Seeing SDG 6 from space: local-scale monitoring of piped water and sewage system access across Africa using satellite imagery and self-supervised learning

从太空看SDG 6：利用卫星图像和自监督学习对非洲管道水和污水系统接入进行局部尺度监测

Othmane Echchabi, Aya Lahlou, Nizar Talty, Josh Malcolm Manto, Tongshu Zheng, Ka Leung Lam

发表机构 * Mila – Quebec AI Institute（魁北克人工智能研究所）； School of Computer Science, McGill University（麦吉尔大学计算机科学学院）； Department of Earth and Environmental Engineering, Columbia University（哥伦比亚大学地球与环境工程系）； Center for Learning the Earth with Artificial Intelligence and Physics (LEAP)（人工智能与物理学习地球中心（LEAP））； Division of Natural and Applied Sciences, Duke Kunshan University（杜克-昆山大学自然科学与应用科学系）

AI总结本研究利用Sentinel-2图像、Afrobarometer调查数据、30米人口数据和DINO自监督视觉Transformer特征，开发了一个可扩展的遥感框架，以约2.56公里分辨率估计管道水和污水系统接入情况，最佳模型AUROC分别达到91.54%和93.24%，与WHO/UNICEF JMP统计数据高度一致，并在尼日利亚案例中揭示了细粒度环境不平等。

Comments Under Review

详情

AI中文摘要

获得饮用水和卫生设施对健康和福祉至关重要，但主要差距仍然存在，尤其是在非洲等数据稀缺地区。SDG 6旨在实现普遍接入，但目前的监测依赖于成本高昂、频率低且空间不均匀的调查和普查，且报告延迟较长。本研究开发了一个可扩展的遥感框架，利用Sentinel-2图像、Afrobarometer调查响应、30米人口数据和DINO自监督视觉Transformer特征，以约2.56公里分辨率估计管道水和污水系统接入情况。最佳模型在管道水和污水接入方面分别达到91.54%和93.24%的AUROC值。在50个非洲国家中，人口加权估计与WHO/UNICEF JMP统计数据在管道水方面高度一致（$R^2 = 0.92$），在污水接入方面也有显著一致性（$R^2 = 0.72$）。在无Afrobarometer覆盖的国家，平均绝对误差分别为9.5%和10.7%，估计值分别与1.214亿和1.597亿人口的JMP值相差在15%以内。一项覆盖尼日利亚767个地方政府区域的案例研究表明，该框架揭示了细尺度的环境不平等。管道水和污水无接入的最大负担分别达到115.5万和145.2万人，是地方政府区域中位数负担的7.9倍和8.3倍，而最高十分位无接入阈值分别为0.805和0.952，表明匮乏普遍存在。这些发现表明，基于DINO的卫星模型可以以低成本、空间详细的方式补充家庭调查，为SDG 6监测、基础设施定位和环境公平评估提供证据。

英文摘要

Access to drinking water and sanitation is essential for health and well-being, yet major disparities remain, especially in data-scarce regions such as Africa. SDG 6 aims for universal access, but current monitoring relies on costly, infrequent, and spatially uneven surveys and censuses with long reporting delays. This study develops a scalable remote-sensing framework to estimate piped water and sewage system access at approximately 2.56 km resolution using Sentinel-2 imagery, Afrobarometer survey responses, 30 m population data, and DINO self-supervised Vision Transformer features. The best model achieves AUROC values of 91.54% for piped water and 93.24% for sewage access. Across 50 African countries, population-weighted estimates strongly align with WHO/UNICEF JMP statistics for piped water ($R^2 = 0.92$) and show meaningful agreement for sewage access ($R^2 = 0.72$). In countries without Afrobarometer coverage, MAEs are 9.5% and 10.7%, with estimates within 15% of JMP values for 121.4 million and 159.7 million people, respectively. A Nigeria case study across 767 Local Government Areas (LGAs) shows that the framework reveals fine-scale environmental inequality. The largest no-access burdens reach 1.155 million people for piped water and 1.452 million for sewage, 7.9 and 8.3 times the median LGA burden, while top-decile no-access thresholds of 0.805 and 0.952 indicate that deprivation is widespread. These findings show that DINO-based satellite models can complement household surveys with low-cost, spatially detailed evidence for SDG 6 monitoring, infrastructure targeting, and environmental equity assessment.

URL PDF HTML ☆

赞 0 踩 0

2605.17921 2026-06-02 cs.CV 版本更新

Causal Forcing++：用于实时交互式视频生成的可扩展少步自回归扩散蒸馏

Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, Jun Zhu

发表机构 * Tsinghua University（清华大学）； ShengShu（盛数）； Renmin University of China（中国人民大学）

AI总结提出Causal Forcing++框架，通过因果一致性蒸馏（causal CD）实现帧级1-2步自回归扩散蒸馏，在降低延迟和训练成本的同时提升视频生成质量。

详情

AI中文摘要

实时交互式视频生成需要低延迟、流式处理和可控展开。现有的自回归（AR）扩散蒸馏方法通过将双向基础模型蒸馏为少步AR学生模型，在分块4步机制中取得了强劲结果，但仍受限于粗粒度响应和不可忽略的采样延迟。本文研究了一种更激进的设置：仅用1-2采样步的帧级自回归。在此机制下，我们识别出少步AR学生模型的初始化是关键瓶颈：现有策略要么目标不对齐，要么无法进行少步生成，要么成本过高难以扩展。我们提出 extbf{Causal Forcing++}，一个原则性且可扩展的流水线，使用\emph{因果一致性蒸馏}（causal CD）进行少步AR初始化。核心思想是：因果CD学习与因果ODE蒸馏相同的AR条件流映射，但通过相邻时间步之间的单个在线教师ODE步获得监督，避免了预计算和存储完整PF-ODE轨迹的需要。这使得初始化既更高效又更易优化。由此产生的流水线\ours在 extit{ extbf{帧级2步设置}}下，VBench总分、VBench质量和VisionReward分别超过SOTA 4步分块Causal Forcing 0.1、0.3和0.335，同时首帧延迟降低50%，阶段2训练成本降低约$4 imes$。我们进一步将流水线扩展到以动作条件的世界模型生成，秉承Genie3的精神。项目页面：https://github.com/thu-ml/Causal-Forcing 和 https://github.com/shengshu-ai/minWM 。

英文摘要

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

URL PDF HTML ☆

赞 0 踩 0

2605.14709 2026-06-02 cs.CV 版本更新

Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reasoners

打破双重瓶颈：将统一多模态模型演化为自适应交错视觉推理器

Qingyang Liu, Bingjie Gao, Canmiao Fu, Zhipeng Huang, Chen Li, Feng Wang, Shuochen Chang, Shaobo Wang, Yali Wang, Keming Ye, Jiangtong Li, Li Niu

发表机构 * Tsinghua University（清华大学）

AI总结针对统一多模态模型在理解与生成之间的鸿沟导致的注意力纠缠和视觉细化瓶颈，提出一种自适应切换生成策略的框架，通过分层数据流水线和两阶段训练（SFT+RL）提升X2I任务性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

最近的统一模型在单一框架内集成了多模态理解和生成。然而，“理解-生成鸿沟”仍然存在，模型能够捕捉用户意图，但往往难以将这种语义知识转化为精确的像素级操作。这种鸿沟在任意到图像任务（X2I）中导致了两个瓶颈：注意力纠缠瓶颈，即盲目规划难以处理复杂提示；以及视觉细化瓶颈，即非结构化反馈无法有效纠正缺陷。在本文中，我们提出了一种新颖的框架，使统一模型能够根据指令复杂性和模型能力自主切换生成策略。为此，我们构建了一个分层数据流水线，在三种自适应模式中构建执行路径：简单情况的直接生成、质量细化的自我反思以及分解复杂场景的多步规划。基于该流水线，我们贡献了一个包含超过50,000个样本的高质量数据集，并实施了一个包含SFT和RL的两阶段训练策略。具体地，我们设计了逐步推理奖励以确保逻辑一致性，以及组内复杂度惩罚以防止冗余计算开销。大量实验表明，我们的方法在X2I上优于现有基线，在简单到复杂指令中实现了优越的生成保真度。代码已发布在 https://github.com/WeChatCV/Interleaved_Visual_Reasoner。

英文摘要

Recent unified models integrate multimodal understanding and generation within a single framework. However, an "understanding-generation gap" persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Specifically, we design step-wise reasoning rewards to ensure logical consistency and intra-group complexity penalty to prevent redundant computational overhead. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions. The code is released at https://github.com/WeChatCV/Interleaved_Visual_Reasoner.

URL PDF HTML ☆

赞 0 踩 0

2605.08193 2026-06-02 cs.CV cs.AI 版本更新

ScriptHOI：学习脚本化状态转换用于开放词汇人-物交互检测

Minh Anh Nguyen, Quang Huy Tran, Bao Ngoc Le, SuiYang Guang, Tuan Kiet Pham, Linh Chi Vo

发表机构 * Phenikaa University（费因克斯大学）

AI总结提出ScriptHOI框架，将交互短语分解为软脚本化状态转换，通过视觉状态分词器和槽位匹配器校准HOI逻辑，并引入区间部分标签学习和反事实脚本对比损失，提升开放词汇HOI检测中稀有和未见交互的识别，减少功能冲突误报。

详情

AI中文摘要

开放词汇人-物交互（HOI）检测需要识别在训练期间可能未作为注释类别出现的交互短语。最近的视觉-语言HOI检测器通过将人-物特征与文本嵌入匹配来改进语义迁移，但其预测通常受物体功能性和短语级共现主导。因此，模型可能仅凭刀和蛋糕的存在就预测“切蛋糕”，而未验证手、工具、目标、接触模式和物体状态是否共同支持该动作。我们提出 extbf{ScriptHOI}，一个结构化框架，将每个交互短语表示为软脚本化状态转换。ScriptHOI不将短语视为单个类别标记，而是将其分解为身体角色、接触、几何、功能性、运动和物体状态槽位。视觉状态分词器将每个检测到的人-物对解析为相应的状态标记，槽位匹配器估计脚本覆盖率和脚本冲突。这两个量校准HOI逻辑值，暴露缺失的视觉证据，并为不完整注释提供训练约束。为避免抑制有效但未注释的交互，我们进一步引入区间部分标签学习，该学习使用脚本导出的下界和上界概率约束未注释的候选，而不是分配封闭世界的负例。反事实脚本对比损失交换单个脚本槽位以阻止仅物体捷径。在HICO-DET、V-COCO和开放词汇HOI分割上的实验表明，ScriptHOI改善了稀有和未见交互的识别，同时大幅减少了功能冲突假阳性。

英文摘要

Open-vocabulary human-object interaction (HOI) detection requires recognizing interaction phrases that may not appear as annotated categories during training. Recent vision-language HOI detectors improve semantic transfer by matching human-object features with text embeddings, but their predictions are often dominated by object affordance and phrase-level co-occurrence. As a result, a model may predict \textit{cut cake} from the presence of a knife and a cake without verifying whether the hand, tool, target, contact pattern, and object state jointly support the action. We propose \textbf{ScriptHOI}, a structured framework that represents each interaction phrase as a soft scripted state transition. Rather than treating a phrase as a single class token, ScriptHOI decomposes it into body-role, contact, geometry, affordance, motion, and object-state slots. A visual state tokenizer parses each detected human-object pair into corresponding state tokens, and a slot-wise matcher estimates both script coverage and script conflict. These two quantities calibrate HOI logits, expose missing visual evidence, and provide training constraints for incomplete annotations. To avoid suppressing valid but unannotated interactions, we further introduce interval partial-label learning, which constrains unannotated candidates with script-derived lower and upper probability bounds instead of assigning closed-world negatives. A counterfactual script contrast loss swaps individual script slots to discourage object-only shortcuts. Experiments on HICO-DET, V-COCO, and open-vocabulary HOI splits show that ScriptHOI improves rare and unseen interaction recognition while substantially reducing affordance-conflict false positives.

URL PDF HTML ☆

赞 0 踩 0

2602.08058 2026-06-02 cs.CV cs.AI cs.RO cs.SY eess.SY 版本更新

Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

Picasso: 基于物理约束采样的整体场景重建

Xihang Yu, Rajat Talak, Lorenzo Shaikewitz, Luca Carlone

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； National University of Singapore（新加坡国立大学）

AI总结提出Picasso，一种通过快速拒绝采样推理多物体交互并考虑几何、非穿透和物理约束的整体场景重建方法，在物理合理性和重建精度上显著优于现有技术。

Comments 15 pages, accepted to Robotics: Science and Systems (RSS) 2026

详情

AI中文摘要

在存在遮挡和测量噪声的情况下，几何精确的场景重建（即拟合传感器数据）仍然可能在物理上不正确。例如，当估计场景中物体的姿态和形状并将结果导入模拟器时，微小误差可能导致不合理的配置，包括物体相互穿透或不稳定平衡。这使得使用数字孪生预测场景的动态行为变得困难，而这是基于模拟的接触丰富行为规划和控制的重要步骤。在本文中，我们认为物体姿态和形状估计需要对场景进行整体推理（而不是孤立地推理每个物体），考虑物体交互和物理合理性。为此，我们的第一个贡献是Picasso，一个受物理约束的重建流水线，通过考虑几何、非穿透和物理来构建多物体场景重建。Picasso依赖于一种快速拒绝采样方法，该方法推理多物体交互，利用推断的物体接触图来指导采样。其次，我们提出了Picasso数据集，这是一个包含10个接触丰富真实场景的集合，带有真实标注，以及一个量化物理合理性的指标，我们将其作为基准测试的一部分开源。最后，我们在新引入的数据集和YCB-V数据集上对Picasso进行了广泛评估，结果表明它在提供物理合理且更符合人类直觉的重建的同时，大幅优于现有技术。

英文摘要

In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.

URL PDF HTML ☆

赞 0 踩 0

2605.09883 2026-06-02 cs.CV cs.AI 版本更新

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

笛卡尔捷径：在极坐标空间中重新评估视觉推理

Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas, Chun-Ta Lu, Zhicheng Wang

发表机构 * Stanford University（斯坦福大学）； Google Research（谷歌研究院）

AI总结针对多模态大语言模型在视觉推理中利用笛卡尔坐标捷径的问题，提出Polaris-Bench基准，将任务转换至极坐标空间，揭示模型缺乏拓扑不变性视觉推理。

详情

AI中文摘要

随着当前多模态大语言模型迅速饱和标准视觉推理基准，一个关键问题浮现：这些高分是否真正反映了鲁棒的视觉理解？我们发现了一个普遍存在的漏洞，即笛卡尔捷径：视觉推理基准普遍基于正交网格布局，这些布局可以轻易地离散化为显式的文本坐标。模型系统地利用这一特性，大量依赖基于文本的演绎推理来辅助视觉问题解决。为了系统地消除这一捷径，我们引入了Polaris-Bench，该基准将53个视觉推理任务重新表述在极坐标空间中，并配有对应的笛卡尔坐标作为参考，同时保持一致的逻辑约束和任务语义——从而从根本上打破了模型所利用的正交先验。对14个最先进MLLM的全面评估显示，在笛卡尔布局上达到70%-83%的前沿模型在极坐标等价布局上骤降至31%-39%，即使在完全逻辑等价的情况下，性能下降依然持续。此外，在笛卡尔布局上观察到的推理增益在极坐标等价布局上严重减弱。这些发现揭示了当前MLLM的一个关键缺陷：缺乏拓扑不变的视觉推理。

英文摘要

As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the Cartesian Shortcut: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce Polaris-Bench, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.09503 2026-06-02 cs.CV 版本更新

retinalysis-vascx: 一个用于提取视网膜血管生物标志物的可解释软件工具箱

Jose D. Vargas Quiros, Michael J. Beyeler, Sofia Ortin Vela, EyeNED Reading Center, Sven Bergmann, Caroline C. W. Klave, Bart Liefers, VascX Research Consortium

发表机构 * Department of Ophthalmology, Erasmus University Medical Center（埃因霍温大学医学中心眼科系）； Department of Epidemiology, Erasmus University Medical Center（埃因霍温大学医学中心流行病学系）； Department of Ophthalmology, Radboud University Medical Center（拉德堡德大学医学中心眼科系）； Institute of Molecular and Clinical Ophthalmology, University of Basel（巴塞尔大学分子与临床眼科研究所）； Dept. of Computational Biology, University of Lausanne（洛桑大学计算生物学系）； Swiss Institute of Bioinformatics, Lausanne, Switzerland（瑞士生物信息学研究所，洛桑，瑞士）； Dept. of Integrative Biomedical Sciences, University of Cape Town（开普敦大学整合生物医学科学系）

AI总结提出开源Python工具箱VascX，从彩色眼底图像中提取视网膜血管生物标志物，包括血管密度、中央视网膜等效值和迂曲度等，并通过可重复性分析和敏感性分析验证其稳健性。

详情

AI中文摘要

从彩色眼底图像（CFI）中自动提取视网膜血管生物标志物对于大规模视网膜血管研究至关重要。我们提出VascX，一个开源的Python工具箱，可从CFI动静脉分割中提取生物标志物。VascX从血管分割掩膜开始，提取其骨架，构建无向和有向血管图，并将血管段解析为更长的血管。导出一组全面的生物标志物，包括血管密度、中央视网膜等效值（CRE）和迂曲度。空间局部化的生物标志物可相对于中央凹和视盘放置的网格进行计算。VascX通过GitHub和PyPI发布，附有全面的文档和示例。我们对同一眼睛在不同设备上重复成像的测试-重测再现性分析表明，大多数VascX生物标志物具有中等至良好的一致性（ICC > 0.5），不同生物标志物的稳健性水平存在重要差异。我们对生物标志物对图像扰动和启发式参数值的敏感性分析支持这些差异，并进一步表征了VascX生物标志物。最终，VascX提供了一个可解释且易于修改的特征提取工具箱，补充了分割以产生可靠的视网膜血管生物标志物。我们基于图的生物标志物计算阶段支持可重复、区域感知的测量，适用于大规模临床和流行病学研究。通过支持轻松提取现有生物标志物和快速实验新生物标志物，VascX支持眼组学研究。其稳健性和计算效率便于在大型数据库中可扩展部署，而开源分发降低了眼科研究人员和临床医生的采用门槛。

英文摘要

Automatic extraction of retinal vascular biomarkers from color fundus images (CFI) is crucial for large-scale studies of the retinal vasculature. We present VascX, an open-source Python toolbox that extracts biomarkers from CFI artery-vein segmentations. VascX starts from vessel segmentation masks, extracts their skeletons, builds undirected and directed vessel graphs, and resolves vessel segments into longer vessels. A comprehensive set of biomarkers is derived, including vascular density, central retinal equivalents (CREs), and tortuosity. Spatially localized biomarkers may be calculated over grids placed relative to the fovea and optic disc. VascX is released via GitHub and PyPI with comprehensive documentation and examples. Our test-retest reproducibility analysis on repeat imaging of the same eye by different devices shows that most VascX biomarkers have moderate to excellent agreement (ICC > 0.5), with important differences in the level of robustness of different biomarkers. Our analyses of biomarker sensitivity to image perturbations and heuristic parameter values support these differences and further characterize VascX biomarkers. Ultimately, VascX provides an explainable and easily modifiable feature-extraction toolbox that complements segmentation to produce reliable retinal vascular biomarkers. Our graph-based biomarker computation stages support reproducible, region-aware measurements suited for large-scale clinical and epidemiological research. By enabling easy extraction of existing biomarkers and rapid experimentation with new ones, VascX supports oculomics research. Its robustness and computational efficiency facilitate scalable deployment in large databases, while open-source distribution lowers barriers to adoption for ophthalmic researchers and clinicians.

URL PDF HTML ☆

赞 0 踩 0

2604.18326 2026-06-02 cs.CV 版本更新

OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation

OmniHuman：面向以人为中心的视频生成的大规模数据集与基准

Lei Zhu, Xing Cai, Yingjie Chen, Yiheng Li, Binxin Yang, Hao Liu, Jie Chen, Chen Li, Jing LYu

发表机构 * Peking University（北京大学）； WeChat Lab（微信实验室）； Chinese Academy of Sciences（中国科学院）

AI总结为解决现有数据集在场景多样性、交互建模和属性对齐方面的结构性缺陷，提出OmniHuman大规模多场景数据集及全自动标注流程，并建立OHBench三级评估体系，实现与人类感知高度一致的诊断。

Comments 19 pages, 6 figures

详情

AI中文摘要

近期音频-视频联合生成模型在内容创作方面展现出令人印象深刻的能力。然而，在复杂的真实世界物理场景中生成高保真以人为中心的视频仍然是一个重大挑战。我们指出根本原因在于现有数据集在三个维度上的结构性缺陷：有限的全局场景和相机多样性、稀疏的交互建模（包括人与人以及人与物体），以及不足的个体属性对齐。为弥补这些差距，我们提出了OmniHuman，一个大规模、多场景数据集，专为细粒度人体建模而设计。OmniHuman提供了层次化标注，涵盖视频级场景、帧级交互和个体级属性。为此，我们开发了一个全自动流水线，用于高质量数据收集和多模态标注。作为数据集的补充，我们建立了OmniHuman基准（OHBench），一个三级评估系统，为以人为中心的音频-视频合成提供科学诊断。关键的是，OHBench引入了与人类感知高度一致的指标，通过提供跨全局场景、关系交互和个体属性的全面诊断，填补了现有基准的空白。

英文摘要

Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.

URL PDF HTML ☆

赞 0 踩 0

2604.17625 2026-06-02 cs.CV 版本更新

FlowC2S: Flowing from Current to Succeeding Frames for Fast and Memory-Efficient Video Continuation

FlowC2S：从当前帧流向后续帧以实现快速且内存高效的视频延续

Hovhannes Margaryan, Quentin Bammey, Christian Sandor

发表机构 * Team ARAI, Université Paris-Saclay, CNRS, LISN, France（ARAI团队，巴黎萨克雷大学，法国国家科学研究中心，LISN，法国）； LTCI, Télécom Paris, Institut Polytechnique de Paris, France（LTCI，巴黎电信学院，巴黎理工学院，法国）

AI总结提出FlowC2S方法，通过微调预训练文本到视频流模型学习当前与后续视频块之间的向量场，利用固有最优耦合和目标反转实现快速、内存高效的视频延续。

详情

AI中文摘要

本文介绍了一种生成快速且内存高效的视频延续的新方法。我们的方法名为FlowC2S，它微调预训练的文本到视频流模型，以学习当前视频块与后续视频块之间的向量场。两个设计选择是关键。首先，我们引入固有最优耦合，在训练期间利用时间上相邻的视频块作为真实最优耦合的实用代理，从而产生更直的流。其次，我们纳入目标反转，将目标块的倒置潜在变量注入输入表示中，以加强对应关系并提高视觉保真度。通过直接从当前帧流向后续帧，而不是常见的将当前帧与噪声组合以生成视频延续的方式，我们将模型输入的维度减少了一半。所提出的方法从LTXV和Wan微调而来，在FID和FVD的定量评估中超越了最先进的分数，且仅需五次神经函数评估。

英文摘要

This paper introduces a novel methodology for generating fast and memory-efficient video continuations. Our method, dubbed FlowC2S, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the current and succeeding video chunks. Two design choices are key. First, we introduce inherent optimal couplings, utilizing temporally adjacent video chunks during training as a practical proxy for true optimal couplings, resulting in straighter flows. Second, we incorporate target inversion, injecting the inverted latent of the target chunk into the input representation to strengthen correspondences and improve visual fidelity. By flowing directly from current to succeeding frames, instead of the common combination of current frames with noise to generate a video continuation, we reduce the dimensionality of the model input by a factor of two. The proposed method, fine-tuned from LTXV and Wan, surpasses the state-of-the-art scores across quantitative evaluations with FID and FVD, with as few as five neural function evaluations.

URL PDF HTML ☆

赞 0 踩 0

2601.14750 2026-06-02 cs.CL cs.CV 版本更新

Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning

Render-of-Thought: 将文本思维链渲染为图像以进行视觉潜在推理

Yifan Wang, Shiyu Li, Peiming Li, Xiaochen Yang, Yang Tang, Zheng Wei

发表机构 * Tencent BAC（腾讯BAC）； Shenzhen International Graduate School, Tsinghua University（深圳国际研究生院，清华大学）； School of Electronic and Computer Engineering, Peking University（北京大学电子与计算机工程学院）； School of Mathematics and Statistics, University of Glasgow（格拉斯哥大学数学与统计学学院）

AI总结提出Render-of-Thought框架，通过将思维链的文本步骤渲染为图像，利用视觉语言模型的视觉编码器进行语义对齐，实现3-4倍令牌压缩和推理加速，同时保持竞争性能。

Comments Accepted by ACL 2026 Main Conference

详情

AI中文摘要

思维链提示在解锁大型语言模型的推理能力方面取得了显著成功。尽管思维链提示增强了推理能力，但其冗长性带来了巨大的计算开销。最近的工作通常只关注结果对齐，缺乏对中间推理过程的监督。这些缺陷掩盖了潜在推理链的可分析性。为了解决这些挑战，我们引入了Render-of-Thought，这是第一个通过将文本步骤渲染为图像来具体化推理链的框架，使潜在推理过程显式且可追溯。具体来说，我们利用现有视觉语言模型的视觉编码器作为语义锚点，将视觉嵌入与文本空间对齐。这种设计确保了即插即用的实现，而无需额外的预训练开销。在数学和逻辑推理基准上的大量实验表明，与显式思维链相比，我们的方法实现了3-4倍的令牌压缩和显著的推理加速。此外，它与其他方法相比保持了竞争性能，验证了这种范式的可行性。我们的代码可在https://github.com/TencentBAC/RoT获取。

英文摘要

Chain-of-Thought (CoT) prompting has achieved remarkable success in unlocking the reasoning capabilities of Large Language Models (LLMs). Although CoT prompting enhances reasoning, its verbosity imposes substantial computational overhead. Recent works often focus exclusively on outcome alignment and lack supervision on the intermediate reasoning process. These deficiencies obscure the analyzability of the latent reasoning chain. To address these challenges, we introduce Render-of-Thought (RoT), the first framework to reify the reasoning chain by rendering textual steps into images, making the latent rationale explicit and traceable. Specifically, we leverage the vision encoders of existing Vision Language Models (VLMs) as semantic anchors to align the vision embeddings with the textual space. This design ensures plug-and-play implementation without incurring additional pre-training overhead. Extensive experiments on mathematical and logical reasoning benchmarks demonstrate that our method achieves 3-4x token compression and substantial inference acceleration compared to explicit CoT. Furthermore, it maintains competitive performance against other methods, validating the feasibility of this paradigm. Our code is available at https://github.com/TencentBAC/RoT

URL PDF HTML ☆

赞 0 踩 0

2604.17007 2026-06-02 cs.CV cs.AI 版本更新

MobileAgeNet: Lightweight Facial Age Estimation for Mobile Deployment

MobileAgeNet：面向移动部署的轻量级面部年龄估计

Arun Kumar, Aswathy Baiju, Radu Timofte, Dmitry Ignatov

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany（计算机视觉实验室、CAIDAS与IFI、乌尔姆大学、德国）

AI总结提出基于MobileNetV3-Large的轻量级年龄回归框架MobileAgeNet，通过两阶段微调和边界回归策略，在UTKFace测试集上达到4.65年MAE，移动端延迟14.4ms，参数量3.23M。

Comments 9 Pages including references, 3 figures

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3810-3818, 2026

AI中文摘要

面部年龄估计的移动部署需要模型在预测准确性、低延迟和小尺寸之间取得平衡。在这项工作中，我们提出了MobileAgeNet，一个轻量级年龄回归框架，在UTKFace保留测试集上实现了4.65年的MAE，同时使用AI Benchmark应用程序测量，平均延迟为14.4毫秒，保持了高效的设备端推理。该模型基于预训练的MobileNetV3-Large骨干网络，结合紧凑的回归头，支持移动设备上的实时预测。训练和评估流程集成到NN LEMUR数据集框架中，支持可重复实验、结构化超参数优化和一致评估。我们采用边界年龄回归以及两阶段微调策略，以提高训练稳定性和泛化能力。实验结果表明，MobileAgeNet以3.23M参数实现了具有竞争力的准确性，并且从PyTorch训练通过ONNX导出到TensorFlow Lite转换的部署流程，在实际设备条件下保持了预测行为，没有可测量的退化。总体而言，这项工作为面向移动的面部年龄估计提供了一个实用、可部署的基线。

英文摘要

Mobile deployment of facial age estimation requires models that balance predictive accuracy with low latency and compact size. In this work, we present MobileAgeNet, a lightweight age-regression framework that achieves an MAE of 4.65 years on the UTKFace held-out test set while maintaining efficient on-device inference with an average latency of 14.4 ms measured using the AI Benchmark application. The model is built on a pretrained MobileNetV3-Large backbone combined with a compact regression head, enabling real-time prediction on mobile devices. The training and evaluation pipeline is integrated into the NN LEMUR Dataset framework, supporting reproducible experimentation, structured hyperparameter optimization, and consistent evaluation. We employ bounded age regression together with a two-stage fine-tuning strategy to improve training stability and generalization. Experimental results show that MobileAgeNet achieves competitive accuracy with 3.23M parameters, and that the deployment pipeline from PyTorch training through ONNX export to TensorFlow Lite conversion - preserves predictive behavior without measurable degradation under practical on-device conditions. Overall, this work provides a practical, deployment-ready baseline for mobile-oriented facial age estimation.

URL PDF HTML ☆

赞 0 踩 0

2601.02997 2026-06-02 cs.LG cs.CV 版本更新

From Memorization to Creativity: LLM as a Designer of Novel Neural Architectures

从记忆到创造：LLM作为新型神经架构的设计者

Waleed Khalid, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany（计算机视觉实验室，CAIDAS与IFI，乌尔姆大学，德国）

AI总结本文提出NNGPT框架，通过闭环架构合成流水线，利用代码型LLM的监督微调循环，结合MinHash-Jaccard新颖性过滤和低保真性能信号，迭代提升生成架构的有效性、性能和多样性，实现从记忆到创造的转变。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3252-3261, 2026

AI中文摘要

大型语言模型（LLM）在程序合成方面表现出色，但其在神经架构设计中的能力——平衡语法可靠性、性能和结构新颖性——仍未得到充分探索。我们提出了NNGPT框架内的闭环架构合成流水线，其中代码型LLM经过22次监督微调循环的演化。在每个循环中，LLM合成PyTorch卷积网络，通过低保真性能信号验证，并通过MinHash-Jaccard标准过滤以防止结构冗余，然后纳入LEMUR数据集。具有新颖架构的高性能候选被转换为提示-代码对，用于参数高效的LoRA微调。这种反馈循环驱动了可测量的分布偏移，逐步内化经验架构先验，使得有效且高性能的输出从稀缺变为主导。在CIFAR-10上，有效生成率稳定在50.6%（峰值74.5%），平均第一轮准确率从28.1%上升到51.0%，超过40%准确率的候选从2.0%增长到96.8%。跨数据集迁移到CIFAR-100和SVHN证实了改进的有效性、偏移的准确率分布和持续的新颖性在不同难度和视觉领域的基准测试中泛化。在22个循环中，有455个原始语料库中不存在的新颖架构被新颖性过滤器接受。通过将合成基于执行反馈和新颖性过滤，我们证明了迭代自监督微调将LLM重塑为任务特化的架构先验——提高了生成可靠性、代理性能和结构多样性——为手工设计的搜索空间提供了一种可复现、无需标注的替代方案。

英文摘要

Large language models (LLMs) excel in program synthesis, yet their capacity for neural architecture design -- balancing syntactic reliability, performance, and structural novelty -- remains underexplored. We present a closed-loop architecture synthesis pipeline within the NNGPT framework, in which a code-oriented LLM evolves over 22 supervised fine-tuning cycles. At each cycle, the LLM synthesizes PyTorch convolutional networks, validated via low-fidelity performance signals and filtered via a MinHash--Jaccard criterion to prevent structural redundancy before being incorporated into the LEMUR dataset. High-performing candidates with novel architectures are converted into prompt--code pairs for parameter-efficient LoRA fine-tuning. This feedback loop drives a measurable distributional shift, progressively internalizing empirical architectural priors such that valid and high-performing outputs evolve from scarce to dominant across cycles. On CIFAR-10, the valid generation rate stabilizes at 50.6% (peaking at 74.5%), mean first-epoch accuracy rises from 28.1% to 51.0%, and candidates exceeding 40% accuracy grow from 2.0% to 96.8%. Cross-dataset transfer to CIFAR-100 and SVHN confirms that improved validity, shifted accuracy distributions, and sustained novelty generalize across benchmarks of varying difficulty and visual domain. Across 22 cycles, 455 unique architectures absent from the original corpus are admitted under the novelty filter. By grounding synthesis in execution feedback and novelty filtering, we demonstrate that iterative self-supervised fine-tuning reshapes an LLM into a task-specialized architectural prior -- improving generation reliability, proxy performance, and structural diversity -- offering a reproducible, annotation-free alternative to hand-crafted search spaces.

URL PDF HTML ☆

赞 0 踩 0

2512.24120 2026-06-02 cs.CV cs.AI 版本更新

Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

增强基于LLM的神经网络生成：面向自动化架构设计的少样本提示与高效验证

Raghuvir Duvvuri, Chandini Vysyaraju, Avi Goyal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany（计算机视觉实验室，CAIDAS与IFI，乌尔姆大学，德国）

AI总结本文提出少样本架构提示（FSAP）和空白归一化哈希验证方法，以提升基于LLM的计算机视觉架构自动生成效率，并通过大规模实验验证其有效性。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3242-3251, 2026

AI中文摘要

自动化神经网络架构设计仍然是计算机视觉中的一个重大挑战。任务多样性和计算约束要求既有效又高效的架构与搜索方法。大型语言模型（LLMs）为计算密集型的神经架构搜索（NAS）提供了一种有前景的替代方案，但它们在计算机视觉架构生成中的应用尚未被系统研究，特别是在提示工程和验证策略方面。基于任务无关的NNGPT/LEMUR框架，本文引入并验证了两项针对计算机视觉的关键贡献。首先，我们提出了少样本架构提示（FSAP），这是首个针对基于LLM的架构生成中支持示例数量（n = 1, 2, 3, 4, 5, 6）的系统研究。我们发现使用n = 3个示例能在视觉任务的架构多样性和上下文聚焦之间取得最佳平衡。其次，我们引入了空白归一化哈希验证，一种轻量级去重方法（耗时小于1毫秒），相比AST解析实现了100倍加速，并防止了重复计算机视觉架构的冗余训练。在七个计算机视觉基准（MNIST、CIFAR-10、CIFAR-100、CelebA、ImageNette、SVHN、Places365）的大规模实验中，我们生成了1,900个独特架构。我们还引入了一种数据集平衡的评估方法，以应对跨异构视觉任务比较架构的挑战。这些贡献为计算机视觉中基于LLM的架构搜索提供了可操作的指导，并建立了严格的评估实践，使计算资源有限的研究人员也能更便捷地进行自动化设计。

英文摘要

Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

URL PDF HTML ☆

赞 0 踩 0

2603.18373 2026-06-02 cs.CV cs.AI 版本更新

To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

看见还是取悦：揭示视觉语言模型中的视觉谄媚与分裂信念

Rui Hong, Shuxue Quan

发表机构 * George Mason University（乔治·玛斯纳大学）； Independent Researcher（独立研究者）

AI总结提出三层诊断框架，通过反事实干预实验发现视觉语言模型中普遍存在视觉谄媚（内部证据保留但输出幻觉答案）现象，并证明扩展模型规模无法解决该问题。

Comments 14 pages, 1 figures

详情

AI中文摘要

当视觉语言模型正确回答时，它们是否真正依赖视觉信息？我们引入了一个三层诊断框架，包含三个每样本指标：潜在异常检测、视觉必要性分数和竞争分数，用于解耦感知、依赖和对齐失败。在9个视觉语言模型和9000个模型-样本对中，通过反事实盲、噪声和冲突干预，72.9%的样本表现出视觉谄媚，这是一种分裂信念模式，即内部证据被保留但解码出幻觉答案，而零样本表现出稳健拒绝，表明当前的对齐训练已消除拒绝作为解码结果。在Qwen-VL系列中，无论是代内还是代间扩展，都单调减少了语言捷径，但加剧了视觉谄媚，表明仅靠规模和更新的后训练无法解决接地问题。诊断分数进一步实现了一种无需训练的择性预测策略，在50%覆盖率下准确率提升高达9.5个百分点。

英文摘要

When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-sample metrics: Latent Anomaly Detection, Visual Necessity Score, and Competition Score, which disentangle perception, dependency, and alignment failures. Across 9 VLMs and 9,000 model-sample pairs under counterfactual blind, noise, and conflict interventions, 72.9% of samples exhibit Visual Sycophancy, a Split Beliefs pattern in which internal evidence is preserved yet a hallucinated answer is decoded, while zero samples show Robust Refusal, indicating that current alignment training has eliminated refusal as a decoding outcome. Scaling within the Qwen-VL family, both within- and across-generation, monotonically reduces Language Shortcuts but amplifies Visual Sycophancy, showing that scale and newer post-training alone cannot resolve the grounding problem. Diagnostic scores further enable a training-free selective-prediction strategy yielding up to +9.5 percentage points accuracy at 50% coverage.

URL PDF HTML ☆

赞 0 踩 0

2604.11283 2026-06-02 cs.CV 版本更新

Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey

多模态大语言模型驱动的视频翻译：面向角色的综述

Bingzheng Qu, Kehai Chen, Xuefeng Bai, Min Zhang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术学院）

AI总结本文通过面向角色的分类法，系统综述了多模态大语言模型在视频翻译中的应用，将其分为语义推理器、表达执行器和视觉合成器三个功能角色，并总结了数据集、基准和评估指标，指出了端到端视频翻译的挑战与未来方向。

详情

AI中文摘要

多模态大语言模型（MLLMs）的最新进展正在将视频翻译从自动语音识别、机器翻译、文本到语音和唇形同步的级联管道重塑为统一的多模态推理和生成问题。高质量的视频翻译不仅需要语义保真度，还需要跨视觉、听觉和语言流的时间对齐、说话者一致性和情感表现力。本综述通过面向角色的分类法，对MLLM驱动的视频翻译进行了重点回顾。我们将MLLM驱动和MLLM相关的研究组织为三个功能角色：语义推理器，将翻译基于视频理解、时间推理和多模态融合；表达执行器，支持可控和上下文感知的语音生成；视觉合成器，实现唇形同步和视觉连贯的说话者渲染。我们进一步总结了每个角色的代表性数据集、基准和指标，并讨论了当前评估协议如何未能满足端到端视频翻译的要求。最后，我们指出了长视频理解、时间建模、多模态对齐、多语言鲁棒性和负责任部署方面的开放挑战，为自然和可信的跨语言视频通信勾勒了未来方向。

英文摘要

Recent progress in multimodal large language models (MLLMs) is reshaping video translation from a cascaded pipeline of automatic speech recognition, machine translation, text-to-speech, and lip synchronization into a unified multimodal reasoning and generation problem. High-quality video translation requires not only semantic fidelity, but also temporal alignment, speaker consistency, and emotional expressiveness across visual, acoustic, and linguistic streams. This survey provides a focused review of MLLM-enabled video translation through a role-oriented taxonomy. We organize MLLM-enabled and MLLM-relevant studies into three functional roles: Semantic Reasoner, which grounds translation in video understanding, temporal reasoning, and multimodal fusion; Expressive Performer, which supports controllable and context-aware speech generation; and Visual Synthesizer, which enables lip synchronization and visually coherent speaker rendering. We further summarize representative datasets, benchmarks, and metrics for each role, and discuss how current evaluation protocols fall short of end-to-end video translation requirements. Finally, we identify open challenges in long-form video understanding, temporal modeling, multimodal alignment, multilingual robustness, and responsible deployment, outlining future directions for natural and trustworthy cross-lingual video communication.

URL PDF HTML ☆

赞 0 踩 0

2604.09877 2026-06-02 cs.CV cs.AI cs.RO 版本更新

POVQA: 基于偏好的视频问答与数据效率的推理

Ashim Dahal, Ankit Ghimire, Saydul Akbar Murad, Nick Rahimi

发表机构 * University of Southern Mississippi（密西根州立大学）

AI总结提出POVQA方法，通过时间池化压缩视频帧、监督微调加偏好优化，在长视频问答中实现数据高效推理。

Comments Accepted in MAR at CVPR Workshop (Proceedings Track)

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 11533-11542

AI中文摘要

长视频多模态问答需要对视觉证据和对话进行结构化推理，但大型视觉语言模型（LVLMs）受限于上下文窗口和计算限制。我们提出POVQA，将每秒压缩为时间池化图像（1 fps池化图像），以在固定token预算下保持密集的时间覆盖。然后，我们在推理+答案目标上对Qwen2.5-VL-7B进行监督微调（SFT），并可选地应用直接偏好优化（DPO）进行偏好对齐。我们引入ReasonVQA作为初步诊断数据集，包含12部电影和239个人工标注的QA+推理三元组，用于在压缩下对长上下文多模态推理进行受控分析。在ReasonVQA上，SFT将最佳纯池化基线从0.212 F1提升至0.550 F1，表明池化证据加推理监督在此设置中提供了主要性能提升。在零样本迁移中，POVQA在SFT+DPO后在TVQA上也达到64.7%。这些结果是初步的：ReasonVQA规模小，池化可能丢失细粒度时间顺序，且DPO效果在不同设置中并非一致正面。代码、数据集和额外定性评估见\href{https://povqa.github.io}{https://povqa.github.io}。

英文摘要

Long-video multimodal question answering requires structured reasoning over visual evidence and dialogue, but Large Vision-Language Models (LVLMs) are constrained by context-window and compute limits. We propose POVQA, which compresses each second into a temporally pooled image (1 fps pooled images) to maintain dense temporal coverage under a fixed token budget. We then train Qwen2.5-VL-7B with supervised fine-tuning (SFT) on rationale+answer targets, and optionally apply Direct Preference Optimization (DPO) for preference alignment. We introduce ReasonVQA as a pilot diagnostic dataset with 12 movies and 239 human-annotated QA+rationale triplets for controlled analysis of long-context multimodal reasoning under compression. On ReasonVQA, SFT improves the best pooled-only baseline from 0.212 to 0.550 F1, showing that pooled evidence plus rationale supervision provides the main performance gains in this setting. In zero-shot transfer, POVQA also reaches 64.7\% on TVQA after SFT+DPO. These results are preliminary: ReasonVQA is small, pooling can lose fine-grained temporal order, and DPO effects are not uniformly positive across settings. Code, dataset, and additional qualitative evaluations are available at \href{https://povqa.github.io}{https://povqa.github.io}.

URL PDF HTML ☆

赞 0 踩 0

2603.28759 2026-06-02 cs.CV 版本更新

FlowIt: Global Matching via Hierarchical Transformers and Optimal Transport for Optical Flow

FlowIt: 通过分层Transformer和最优传输实现全局匹配的光流估计

Sadra Safadoust, Fabio Tosi, Matteo Poggi, Fatma Güney

发表机构 * Department of Computer Engineering and KUIS AI Center, Koç University, Istanbul, Turkey（计算机工程系和KUIS人工智能中心，科克大学，伊斯坦布尔，土耳其）； Department of Computer Science and Engineering (DISI), University of Bologna, Italy（计算机科学与工程系（DISI），博洛尼亚大学，意大利）

AI总结提出FlowIt架构，结合分层Transformer和最优传输进行全局匹配，并通过置信度与遮挡引导的细化步骤，在多个基准上达到最先进性能。

Comments Project Page: https://kuis-ai.github.io/FlowIt/

详情

AI中文摘要

我们提出FlowIt，一种新颖的光流估计架构，结合了全局匹配与置信度和遮挡引导的细化。其核心是利用分层Transformer架构捕获广泛的全局上下文，使模型能够有效建模长距离对应关系。为了克服局部匹配的局限性，我们将流初始化表述为一个最优传输问题。这种表述产生了一个高度鲁棒的初始流场，以及显式推导的遮挡和置信度图。然后，这些线索无缝集成到引导细化阶段，网络将可靠的运动估计从高置信度区域主动传播到模糊的低置信度区域。在Sintel、KITTI、Spring和LayeredFlow数据集上的大量实验验证了我们方法的有效性。FlowIt在具有挑战性的Sintel基准上取得了最先进的结果，并在Sintel、Spring和LayeredFlow上建立了新的跨数据集零样本泛化性能的最先进水平，同时在KITTI基准和KITTI零样本泛化设置上也提供了有竞争力的性能。

英文摘要

We present FlowIt, a novel architecture for optical flow estimation that combines global matching with confidence and occlusion-guided refinement. At its core, FlowIt leverages a hierarchical transformer architecture that captures extensive global context, enabling the model to effectively model long-range correspondences. To overcome the limitations of localized matching, we formulate the flow initialization as an optimal transport problem. This formulation yields a highly robust initial flow field, alongside explicitly derived occlusion and confidence maps. These cues are then seamlessly integrated into a guided refinement stage, where the network actively propagates reliable motion estimates from high-confidence regions into ambiguous, low-confidence areas. Extensive experiments across the Sintel, KITTI, Spring, and LayeredFlow datasets validate the effectiveness of our approach. FlowIt achieves state-of-the-art results on the competitive Sintel benchmark and establishes new state-of-the-art cross-dataset zero-shot generalization performance on Sintel, Spring, and LayeredFlow, while also delivering competitive performance on both the KITTI benchmark and KITTI zero-shot generalization settings.

URL PDF HTML ☆

赞 0 踩 0

2603.27645 2026-06-02 cs.CV 版本更新

OpenDPR: Open-Vocabulary Change Detection via Vision-Centric Diffusion-Guided Prototype Retrieval for Remote Sensing Imagery

OpenDPR：面向遥感影像的基于视觉中心扩散引导原型检索的开放词汇变化检测

Qi Guo, Jue Wang, Yinhe Liu, Yanfei Zhong

发表机构 * Wuhan University（武汉大学）； Beijing Institute of Technology（北京理工大学）

AI总结提出OpenDPR框架，通过扩散模型构建原型并检索视觉相似性，解决开放词汇变化检测中类别识别瓶颈，并设计S2C模块增强变化定位能力。

Comments Accepted by CVPR 2026

详情

AI中文摘要

开放词汇变化检测（OVCD）旨在通过泛化到预定义类别集合之外，识别任意感兴趣的变化。我们将OVCD重新表述为两阶段流程：首先使用视觉基础模型（如SAM和DINOv2）生成类别无关的变化提议，然后使用视觉语言模型（如CLIP）进行类别识别。我们发现类别识别错误是OVCD的主要瓶颈，这主要是由于基于图像-文本匹配的VLM在表示细粒度土地覆盖类别方面的能力有限。为了解决这个问题，我们提出了OpenDPR，一个无需训练的、以视觉为中心的扩散引导原型检索框架。OpenDPR利用扩散模型离线为目标类别构建多样化的原型，并在推理时与视觉空间中的变化提议进行相似性检索。次要瓶颈在于变化定位，这是由于VFM固有缺乏变化先验。为弥补这一差距，我们设计了一个名为S2C的空间到变化弱监督变化检测模块，以适应其强大的空间建模能力进行变化定位。将预训练的S2C集成到OpenDPR中，得到一个可选的弱监督变体OpenDPR-W，它通过最小监督进一步改进了OVCD。在四个基准数据集上的实验结果表明，所提出的方法在两种监督模式下均达到了最先进的性能。代码可在https://github.com/guoqi2002/OpenDPR获取。

英文摘要

Open-vocabulary change detection (OVCD) seeks to recognize arbitrary changes of interest by enabling generalization beyond a fixed set of predefined classes. We reformulate OVCD as a two-stage pipeline: first generate class-agnostic change proposals using visual foundation models (VFMs) such as SAM and DINOv2, and then perform category identification with vision-language models (VLMs) such as CLIP. We reveal that category identification errors are the primary bottleneck of OVCD, mainly due to the limited ability of VLMs based on image-text matching to represent fine-grained land-cover categories. To address this, we propose OpenDPR, a training-free vision-centric diffusion-guided prototype retrieval framework. OpenDPR leverages diffusion models to construct diverse prototypes for target categories offline, and to perform similarity retrieval with change proposals in the visual space during inference. The secondary bottleneck lies in change localization, due to the inherent lack of change priors in VFMs. To bridge this gap, we design a spatial-to-change weakly supervised change detection module named S2C to adapt their strong spatial modeling capabilities for change localization. Integrating the pretrained S2C into OpenDPR leads to an optional weakly supervised variant named OpenDPR-W, which further improves OVCD with minimal supervision. Experimental results on four benchmark datasets demonstrate that the proposed methods achieve state-of-the-art performance under both supervision modes. Code is available at https://github.com/guoqi2002/OpenDPR.

URL PDF HTML ☆

赞 0 踩 0

2603.27223 2026-06-02 cs.CV cs.AI 版本更新

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

EuraGovExam：来自现实世界公务员考试的多语言多模态基准

Jaeseong Kim, Chaehwan Lim, Sang Hyun Gil, Suan Lee

发表机构 * School of Computer Science / Data Intelligence Lab（计算机科学学院/数据智能实验室）

AI总结提出一个包含8000多道真实公务员考试题目的多语言多模态基准EuraGovExam，要求模型直接从图像中进行布局感知的跨语言推理，当前最先进的视觉语言模型准确率仅达86%。

详情

DOI: 10.1145/3770855.3817532

AI中文摘要

我们提出了EuraGovExam，一个多语言和多模态基准，来源于五个代表性欧亚地区（韩国、日本、台湾、印度和欧盟）的现实世界公务员考试。该数据集旨在反映公共部门评估的真实复杂性，包含超过8000道高分辨率扫描选择题，涵盖17个不同的学术和行政领域。与现有基准不同，EuraGovExam将所有题目内容（包括问题陈述、答案选项和视觉元素）嵌入到单个图像中，仅提供最小化的标准答案格式指令。这种设计要求模型直接从视觉输入进行布局感知的跨语言推理。所有题目均来自真实考试文档，保留了丰富的视觉结构，如表格、多语言排版和类似表单的布局。评估结果显示，即使是最先进的视觉语言模型（VLM）也仅达到86%的准确率，突显了该基准的难度及其诊断当前模型局限性的能力。通过强调文化真实性、视觉复杂性和语言多样性，EuraGovExam为在高风险、多语言、图像基础环境中评估VLM建立了新标准。它还支持电子政务、公共部门文档分析和公平考试准备等实际应用。

英文摘要

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark's difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.

URL PDF HTML ☆

赞 0 踩 0

2603.26779 2026-06-02 cs.CV cs.AI 版本更新

Limits of Spatial Imagery Reasoning in Frontier LLM Models

前沿大语言模型在空间意象推理中的局限性

Sergio Y. Hayashi, Nina S. T. Hirata

发表机构 * Institute of Mathematics and Statistics – University of São Paulo（数学统计研究所 – 圣保罗大学）

AI总结本研究通过引入外部“意象模块”辅助3D模型旋转任务，发现即使外包整体3D状态维护，前沿模型仍缺乏基础视觉空间原语，导致准确率最高仅62.5%。

Comments 25 pages. v2: Title updated; added a section on object/spatial imagery and propositional reasoning; added new experimental results for the single-object rotation probe

详情

AI中文摘要

大型语言模型（LLMs）展示了令人印象深刻的推理能力，但在需要心理模拟的空间任务（如心理旋转）中表现不佳。本文研究是否通过为LLM配备一个外部“意象模块”——一种能够渲染和旋转3D模型的工具——可以弥合这一差距，充当“认知假体”。我们使用双模块架构进行了实验，其中推理模块（MLLM）与意象模块在3D模型旋转任务上进行交互。性能低于预期，准确率最高达到62.5%。进一步研究表明，即使将维护和操作整体3D状态的负担外包，系统仍然失败。这揭示了当前前沿模型缺乏与意象交互所需的基础视觉空间原语。具体来说，它们缺乏：（1）提取空间信号的低级敏感性，例如（a）深度，（b）运动，以及（c）短视距动态预测；以及（2）对图像进行沉思性推理的能力，动态转移视觉焦点，并平衡意象与符号和关联信息。

英文摘要

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental simulation, such as mental rotation. This paper investigates whether equipping an LLM with an external ``Imagery Module'' -- a tool capable of rendering and rotating 3D models -- can bridge this gap, functioning as a ``cognitive prosthetic.'' We conducted experiments using a dual-module architecture in which a reasoning module (an MLLM) interacts with an imagery module on 3D model rotation tasks. Performance was lower than expected, with accuracy reaching at most 62.5%. Further investigation suggests that even when the burden of maintaining and manipulating a holistic 3D state is outsourced, the system still fails. This reveals that current frontier models lack the foundational visual-spatial primitives required to interface with imagery. Specifically, they lack: (1) the low-level sensitivity to extract spatial signals such as (a) depth, (b) motion, and (c) short-horizon dynamic prediction; and (2) the capacity to reason contemplatively over images, dynamically shifting visual focus and balancing imagery with symbolic and associative information.

URL PDF HTML ☆

赞 0 踩 0

2603.26028 2026-06-02 cs.CV 版本更新

Learning to Trim: End-to-End Causal Graph Pruning with Dynamic Anatomical Feature Banks for Medical VQA

学习修剪：基于动态解剖特征库的端到端因果图剪枝用于医学视觉问答

Zibo Xu, Qiang Li, Weizhi Nie, Yuting Su

发表机构 * School of Microelectronics, Tianjin University（天津大学微电子学院）； School of Electrical and Information Engineering, Tianjin University（天津大学电气与信息工程学院）

AI总结提出可学习因果修剪（LCT）框架，通过动态解剖特征库（DAFB）和可微修剪模块，在端到端优化中抑制虚假相关，增强因果信号，提升医学VQA的鲁棒性和泛化性。

详情

AI中文摘要

IAG: 基于输入感知的后门攻击针对VLM视觉定位

Junxian Li, Beining Xu, Simin Chen, Jiatong Li, Jingdi Lei, Haodong Zhao, Di Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Fudan University（复旦大学）； Columbia University（哥伦比亚大学）； Hong Kong Polytechnic University（香港理工大学）； Nanyang Technological University（南洋理工大学）

AI总结提出IAG方法，通过文本条件UNet动态生成输入感知的触发器，实现首个多目标后门攻击VLM视觉定位，在多个模型和基准上达到最佳攻击成功率且不影响正常性能。

Comments Accepted by CVPR 2026; Code is at https://github.com/lijunxian111/IAG

详情

Journal ref: https://openaccess.thecvf.com/content/CVPR2026/papers/Li_IAG_Input-aware_Backdoor_Attack_on_VLM-based_Visual_Grounding_CVPR_2026_paper.pdf

AI中文摘要

近期视觉语言模型（VLM）的进展显著提升了视觉定位任务，该任务涉及根据自然语言查询在图像中定位对象。尽管取得了这些进展，基于VLM的定位系统的安全性尚未得到彻底研究。本文揭示了一个新颖且现实的安全漏洞：首个针对VLM视觉定位的多目标后门攻击。与依赖静态触发器或固定目标的先前攻击不同，我们提出了IAG，一种动态生成输入感知、文本引导触发器的方法，这些触发器以任意指定目标对象描述为条件来执行攻击。这是通过一个文本条件的UNet实现的，该网络将难以察觉的目标语义线索嵌入视觉输入，同时保持对良性样本的正常定位性能。我们进一步开发了一个联合训练目标，平衡语言能力与感知重建，以确保隐蔽性、有效性和隐秘性。在多个VLM（如LLaVA、InternVL、Ferret）和基准（RefCOCO、RefCOCO+、RefCOCOg、Flickr30k Entities和ShowUI）上的大量实验表明，IAG在几乎所有设置下都实现了比其他基线最佳的攻击成功率，同时不损害干净准确率，保持对现有防御的鲁棒性，并展现出跨数据集和模型的迁移性。这些发现强调了具有定位能力的VLM中的关键安全风险，并突出了对可信多模态理解的进一步研究的必要性。

英文摘要

Recent advances in vision-language models (VLMs) have significantly enhanced the visual grounding task, which involves locating objects in an image based on natural language queries. Despite these advancements, the security of VLM-based grounding systems has not been thoroughly investigated. This paper reveals a novel and realistic vulnerability: the first multi-target backdoor attack on VLM-based visual grounding. Unlike prior attacks that rely on static triggers or fixed targets, we propose IAG, a method that dynamically generates input-aware, text-guided triggers conditioned on any specified target object description to execute the attack. This is achieved through a text-conditioned UNet that embeds imperceptible target semantic cues into visual inputs while preserving normal grounding performance on benign samples. We further develop a joint training objective that balances language capability with perceptual reconstruction to ensure imperceptibility, effectiveness, and stealth. Extensive experiments on multiple VLMs (e.g., LLaVA, InternVL, Ferret) and benchmarks (RefCOCO, RefCOCO+, RefCOCOg, Flickr30k Entities, and ShowUI) demonstrate that IAG achieves the best ASRs compared with other baselines on almost all settings without compromising clean accuracy, maintaining robustness against existing defenses, and exhibiting transferability across datasets and models. These findings underscore critical security risks in grounding-capable VLMs and highlight the need for further research on trustworthy multimodal understanding.

URL PDF HTML ☆

赞 0 踩 0

2510.19496 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CARES: Context-Aware Resolution Selector for VLMs

CARES: 面向视觉语言模型的上下文感知分辨率选择器

Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz

发表机构 * Technion（技术ion大学）； IBM Research（IBM研究院）； Tel-Aviv University（特拉维夫大学）； Ben-Gurion University（本· Gurion大学）

AI总结提出CARES轻量级预处理模块，通过紧凑型VLM预测图像-查询对的最小足够分辨率，在保持任务性能的同时最多减少80%计算量。

Comments Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Accepted to ACL 2026 (Oral presentation). Code available at https://github.com/mkimhi/CARES

详情

AI中文摘要

大型视觉语言模型通常以原始或高分辨率处理图像以保持跨任务有效性。这导致视觉令牌通常占总令牌的97-99%，即使低分辨率图像就足够时，也会产生高计算量和延迟。我们引入了CARES——一种上下文感知分辨率选择器，这是一个轻量级预处理模块，给定图像-查询对，预测最小的足够输入分辨率。CARES使用紧凑型VLM（350M）提取特征，并预测目标预训练VLM的响应何时收敛到其正确回答的峰值能力。尽管作为一组可选分辨率上的离散分类器进行训练，但CARES在推理时插值连续分辨率以实现细粒度控制。在涵盖文档和自然图像以及多样化目标VLM的五个多模态基准测试中，CARES在保持任务性能的同时最多减少80%的计算量。

英文摘要

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

URL PDF HTML ☆

赞 0 踩 0

2603.18652 2026-06-02 cs.CV cs.AI cs.IR 版本更新

Beyond String Matching: Semantic Evaluation of PDF Table Extraction

超越字符串匹配：PDF表格提取的语义评估

Pius Horn, Janis Keuper

发表机构 * Institute for Machine Learning and Analytics (IMLA)（机器学习与分析研究所）； Offenburg University（奥芬堡大学）； University of Mannheim（曼海姆大学）

AI总结提出基于LLM-as-a-judge的语义评估框架，通过合成PDF和人工验证，显著优于现有规则指标（TEDS、GriTS），并评估了21种PDF解析器。

Comments Submitted to BMVC 2026

详情

AI中文摘要

从PDF中可靠地提取表格对于大规模科学数据挖掘和知识库构建至关重要，然而现有的评估方法依赖于基于规则的指标，无法捕捉表格内容的语义等价性。我们提出了一个基于合成PDF的基准测试框架，这些PDF具有精确的LaTeX真实标注，并使用来自arXiv的表格以确保现实的复杂性和多样性。作为我们的核心方法论贡献，我们将LLM-as-a-judge应用于语义表格评估，并将其集成到一个能够适应解析器输出不一致性的匹配流水线中。通过一项包含超过1500个提取表格对的人工验证研究，我们表明基于LLM的评估与人类判断的相关性（Pearson r=0.93）显著高于当前使用的基于树编辑距离的相似度（TEDS, r=0.68）和网格表格相似度（GriTS, r=0.70）。对21个当代PDF解析器在包含451个表格的100个合成文档上的评估揭示了显著的性能差异。我们的结果为选择用于表格数据提取的解析器提供了实用指导，并为这一关键任务建立了一种可重复、可扩展的评估方法。代码和数据：https://github.com/phorn1/pdf-parse-bench 指标研究和人工评估：https://github.com/phorn1/table-metric-study

英文摘要

Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to currently used Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

URL PDF HTML ☆

赞 0 踩 0

2504.05033 2026-06-02 cs.RO cs.CV 版本更新

CloSE: A Geometric Shape-Agnostic Cloth State Representation

CloSE: 一种几何形状无关的布料状态表示

Jay Kamat, Júlia Borràs, Carme Torras

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC（西班牙工业机器人与信息技术研究所，CSIC-UPC）

AI总结提出一种基于拓扑索引的dGLI圆盘表示，并从中抽象出紧凑、连续的CloSE表示，用于预测布料折叠位置并支持语义标注与规划。

Comments Accepted at ICRA 2026 (8 pages, 11 figures, 1 table). Project page: https://close-representation.github.io/

详情

AI中文摘要

布料操作是一个难题，主要是因为布料的非刚性特性，这使得对变形的良好表示至关重要。我们提出了一种新的布料变形状态表示。首先，我们提出了基于拓扑索引的dGLI圆盘表示，这些索引是针对排列在圆形网格上的布料边界边缘段计算的。dGLI圆盘的热力图揭示了与布料状态特征相对应的模式，这些模式对于不同形状、尺寸或方向的布料是一致的。然后，我们将这些重要特征从dGLI圆盘中抽象成一个圆，称为布料状态表示（CloSE）。这种表示紧凑、连续，且适用于不同形状。我们表明，这种表示能够准确预测多个仿真布料数据集中的折叠位置。最后，我们还展示了这种表示在两个相关应用中的优势：语义标注以及高层和低层规划。代码和数据集可从以下网址获取：https://close-representation.github.io/

英文摘要

Cloth manipulation is a difficult problem mainly because of the non-rigid nature of cloth, which makes a good representation of deformation essential. We present a new representation for the deformation-state of clothes. First, we propose the dGLI disk representation based on topological indices computed for edge segments of the cloth border that are arranged on a circular grid. The heat-map of the dGLI disk uncovers patterns that correspond to features of the cloth state that are consistent for different shapes, sizes or orientation of the cloth. We then abstract these important features from the dGLI disk into a circle, calling it the Cloth StatE representation (CloSE). This representation is compact, continuous, and general for different shapes. We show that this representation is able to accurately predict the fold locations for several simulation clothing datasets. Finally, we also show the strengths of this representation in two relevant applications: semantic labeling and high- and low-level planning. The code and the dataset can be accessed from: https://close-representation.github.io/

URL PDF HTML ☆

赞 0 踩 0

2603.04256 2026-06-02 cs.CV 版本更新

A Hypertoroidal Covering for Perfect Color Equivariance

完美颜色等变的超环面覆盖

Yulong Yang, Zhikun Xu, Yaojun Li, Christine Allen-Blanchette

发表机构 * GitHub

AI总结提出一种通过将区间值提升到圆上的双覆盖来构建真正等变的颜色等变架构，解决了先前方法中近似饱和度和亮度为1D平移带来的伪影问题，在细粒度分类和医学成像等任务上提升了性能。

Comments Accept to the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

当输入图像的颜色分布在推理时发生变化，传统神经网络架构的性能会显著下降。一些研究者已开始将颜色几何的先验知识融入神经网络设计。这些颜色等变架构将色调变化建模为2D旋转，饱和度和亮度变换建模为1D平移。虽然这种方法在多种情况下提高了神经网络对颜色变化的鲁棒性，但我们发现将饱和度和亮度（区间值量）近似为1D平移会引入明显的伪影。本文提出了一种真正等变的颜色等变架构。我们不再用实直线近似区间，而是将区间上的值提升到圆上的值（双覆盖），并在其上构建等变表示。我们的方法解决了先前方法的近似伪影问题，提高了可解释性和泛化能力，并在细粒度分类和医学成像等任务上取得了优于传统和等变基线的预测性能。超越颜色范畴，我们提出的提升方法还可以扩展到尺度等几何变换。

英文摘要

When the color distribution of input images changes at inference, the performance of conventional neural network architectures drops considerably. A few researchers have begun to incorporate prior knowledge of color geometry in neural network design. These color equivariant architectures have modeled hue variation with 2D rotations, and saturation and luminance transformations as 1D translations. While this approach improves neural network robustness to color variations in a number of contexts, we find that approximating saturation and luminance (interval valued quantities) as 1D translations introduces appreciable artifacts. In this paper, we introduce a color equivariant architecture that is truly equivariant. Instead of approximating the interval with the real line, we lift values on the interval to values on the circle (a double-cover) and build equivariant representations there. Our approach resolves the approximation artifacts of previous methods, improves interpretability and generalizability, and achieves better predictive performance than conventional and equivariant baselines on tasks such as fine-grained classification and medical imaging tasks. Going beyond the context of color, we show that our proposed lifting can also extend to geometric transformations such as scale.

URL PDF HTML ☆

赞 0 踩 0

2603.00171 2026-06-02 cs.CV cs.AI 版本更新

LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models

LookWise: 知道何时何地关注多模态大语言模型中的细粒度视觉推理

Yuxiang Shen, Hailong Huang, Zhenkun Gao, Xueheng Li, Man Zhou, Chengjun Xie, Haoxuan Che, Xuanhua He, Jie Zhang

发表机构 * Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences（智能机器研究所，合肥物理科学研究院，中国科学院）； University of Science and Technology of China（中国科学技术大学）； Zhejiang University（浙江大学）； East China Normal University（华东师范大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出LookWise框架，通过置信度模块和语义引导定位模块实现自适应视觉推理，无需额外训练即可提升细粒度推理精度并加速推理。

详情

AI中文摘要

多模态大语言模型正转向通过主动探索图像细节进行“图像思考”。虽然有效，但大规模训练计算成本高昂，这激发了对轻量级、无需训练解决方案的兴趣。然而，现有无需训练方法存在两个缺陷：无差别裁剪导致的感知冗余，增加了计算成本并引入噪声；以及语义意图与空间注意力之间的漂移，阻碍了用户关注区域的准确定位。为应对这些挑战，我们提出LookWise，一个自适应视觉推理框架。LookWise遵循两阶段流程：基于置信度的模块决定何时更仔细地观察，语义引导的定位模块确定观察位置。该设计使MLLM能够自适应获取细粒度视觉证据而无需额外训练。在细粒度和高分辨率视觉推理基准上的实验表明，LookWise在强基线上持续提升准确率，同时相较于基于搜索的方法ZoomEye实现约$4.0 imes$的推理加速，展现出稳健的跨模型泛化能力。

英文摘要

Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which increases computational cost and introduces noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose LookWise, a framework for adaptive visual reasoning. LookWise follows a two-stage pipeline: a confidence-based module decides when to look more carefully, and a semantic-guided localization module determines where to look. This design enables MLLMs to adaptively acquire fine-grained visual evidence without additional training. Experiments on fine-grained and high-resolution visual reasoning benchmarks show that LookWise consistently improves accuracy over strong baselines while achieving an approximately $4.0\times$ inference speedup over the search-based method ZoomEye, demonstrating robust cross-model generalization.

URL PDF HTML ☆

赞 0 踩 0

2603.09529 2026-06-02 cs.CV 版本更新

面向事件相机的运动感知事件抑制

Roberto Pellerito, Nico Messikommer, Giovanni Cioffi, Marco Cannici, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich, Switzerland（苏黎世大学机器人与感知组，瑞士）

AI总结提出首个运动感知事件抑制框架，通过联合分割当前事件流中的独立运动物体并预测其未来运动，实现动态事件的预期抑制，在EVIMO基准上分割精度提升67%，推理速度提高53%。

Comments Robotics: Science and Systems (RSS) 2026

2602.01577 2026-06-02 eess.SP cs.CV 版本更新

Visible Light Positioning With Lamé Curve LEDs: A Generic Approach for Camera Pose Estimation

基于拉梅曲线LED的可见光定位：一种通用的相机姿态估计方法

Wenxuan Pan, Yang Yang, Dong Wei, Zhiyu Zhu, Jintao Wang, Huan Wu, Yao Nie

发表机构 * Beijing Key Laboratory of Network System Architecture and Convergence, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications（北京网络系统架构与融合重点实验室，信息与通信工程学院，北京邮电大学）； Institute of Information Engineering, Chinese Academy of Sciences（信息工程研究所，中国科学院）； College of Physics and Electronic Engineering, Shanxi University（物理与电子工程学院，山西大学）； School of Electronic Information and Artificial Intelligence, West Anhui University（电子信息与人工智能学院，皖西学院）

AI总结本文提出一种基于拉梅曲线LED的通用可见光定位算法LC-VLP，通过统一表示常见LED形状并利用曲线参数进行非线性最小二乘优化，实现高精度相机姿态估计。

Comments Submitted to an IEEE journal for possible publication

详情

AI中文摘要

基于相机的可见光定位（VLP）是一种有前景的技术，可实现精确且低成本的室内相机姿态估计（CPE）。为减少所需发光二极管（LED）的数量，先进方法通常利用LED形状特征进行定位。尽管有趣，但这些方法通常局限于单一LED几何形状，导致在异构LED形状场景中失效。为应对这一挑战，本文研究拉梅曲线作为常见LED形状的统一表示，并提出一种使用拉梅曲线形状LED的通用VLP算法，称为LC-VLP。在所考虑的系统中，多个天花板安装的拉梅曲线形状LED通过可见光通信定期广播其曲线参数，这些参数由配备相机的接收器捕获。基于接收到的LED图像和曲线参数，接收器可使用LC-VLP估计相机姿态。具体而言，离线构建LED数据库以存储曲线参数，而在线定位则被表述为非线性最小二乘问题并迭代求解。为提供可靠的初始化，进一步开发了一种无需对应点的透视n点（FreePnP）算法，无需任何预校准参考点即可实现近似CPE。通过仿真和实验验证了LC-VLP的性能。仿真表明，在圆形和矩形LED场景中，LC-VLP均优于最先进的方法。与透视弧算法相比，LC-VLP可实现平均位置和旋转误差均降低30%以上。实验进一步表明，LC-VLP可实现小于4厘米的平均位置精度。

英文摘要

Camera-based visible light positioning (VLP) is a promising technique for accurate and low-cost indoor camera pose estimation (CPE). To reduce the number of required light-emitting diodes (LEDs), advanced methods commonly exploit LED shape features for positioning. Although interesting, they are typically restricted to a single LED geometry, leading to failure in heterogeneous LED-shape scenarios. To address this challenge, this paper investigates Lamé curves as a unified representation of common LED shapes and proposes a generic VLP algorithm using Lamé curve-shaped LEDs, termed LC-VLP. In the considered system, multiple ceiling-mounted Lamé curve-shaped LEDs periodically broadcast their curve parameters via visible light communication, which are captured by a camera-equipped receiver. Based on the received LED images and curve parameters, the receiver can estimate the camera pose using LC-VLP. Specifically, an LED database is constructed offline to store the curve parameters, while online positioning is formulated as a nonlinear least-squares problem and solved iteratively. To provide a reliable initialization, a correspondence-free perspective-n-points (FreePnP) algorithm is further developed, enabling approximate CPE without any pre-calibrated reference points. The performance of LC-VLP is verified by both simulations and experiments. Simulations show that LC-VLP outperforms state-of-the-art methods in both circular- and rectangular-LED scenarios. Compared to a perspective arcs algorithm, LC-VLP can achieve reductions of both over 30% in average position and rotation errors. Experiments further show that LC-VLP can achieve an average position accuracy of less than 4 cm.

URL PDF HTML ☆

赞 0 踩 0

2602.20807 2026-06-02 cs.CV cs.RO 版本更新

RU4D-SLAM: Reweighting Uncertainty in Gaussian Splatting SLAM for 4D Scene Reconstruction

RU4D-SLAM：面向4D场景重建的高斯溅射SLAM不确定性重加权

Yangfan Zhao, Hanwei Zhang, Ke Huang, Qiufeng Wang, Zhenzhou Shao, Dengyu Wu

发表机构 * Capital Normal University（首都师范大学）； Saarland University（萨尔兰大学）； Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； King’s College London（伦敦国王学院）

AI总结提出RU4D-SLAM框架，通过引入时间因子、不确定性感知和语义引导重加权机制，解决动态环境中3D高斯溅射SLAM的跟踪与4D场景重建问题。

详情

AI中文摘要

将3D高斯溅射与同时定位与地图构建（SLAM）相结合的方法因其能够在运动过程中实现连续3D环境重建而受到广泛关注。然而，现有方法在动态环境中表现不佳，尤其是移动物体使3D重建复杂化，进而阻碍了可靠的跟踪。4D重建的出现，特别是4D高斯溅射，为解决这些挑战提供了有前景的方向，但其在4D感知SLAM中的潜力尚未得到充分探索。沿着这一方向，我们提出了一种鲁棒且高效的框架，即面向4D场景重建的高斯溅射SLAM不确定性重加权（RU4D-SLAM），该框架将时间因子引入空间3D表示，同时结合了场景变化的不确定性感知、模糊图像合成和动态场景重建。我们通过集成运动模糊渲染增强了动态场景表示，并通过扩展原本为静态场景设计的逐像素不确定性建模来处理模糊图像，从而改进了不确定性感知跟踪。此外，我们提出了一种用于动态场景中逐像素不确定性估计的语义引导重加权机制，并引入可学习的不透明度权重以支持自适应4D映射。在标准基准上的大量实验表明，我们的方法在轨迹精度和4D场景重建方面显著优于最先进的方法，尤其是在存在移动物体和低质量输入的动态环境中。代码地址：https://ru4d-slam.github.io

英文摘要

Combining 3D Gaussian splatting with Simultaneous Localization and Mapping (SLAM) has gained popularity as it enables continuous 3D environment reconstruction during motion. However, existing methods struggle in dynamic environments, particularly moving objects complicate 3D reconstruction and, in turn, hinder reliable tracking. The emergence of 4D reconstruction, especially 4D Gaussian splatting, offers a promising direction for addressing these challenges, yet its potential for 4D-aware SLAM remains largely underexplored. Along this direction, we propose a robust and efficient framework, namely Reweighting Uncertainty in Gaussian Splatting SLAM (RU4D-SLAM) for 4D scene reconstruction, that introduces temporal factors into spatial 3D representation while incorporating uncertainty-aware perception of scene changes, blurred image synthesis, and dynamic scene reconstruction. We enhance dynamic scene representation by integrating motion blur rendering, and improve uncertainty-aware tracking by extending per-pixel uncertainty modeling, which is originally designed for static scenarios, to handle blurred images. Furthermore, we propose a semantic-guided reweighting mechanism for per-pixel uncertainty estimation in dynamic scenes, and introduce a learnable opacity weight to support adaptive 4D mapping. Extensive experiments on standard benchmarks demonstrate that our method substantially outperforms state-of-the-art approaches in both trajectory accuracy and 4D scene reconstruction, particularly in dynamic environments with moving objects and low-quality inputs. Code available: https://ru4d-slam.github.io

URL PDF HTML ☆

赞 0 踩 0

2602.19857 2026-06-02 cs.CV 版本更新

Contrastive meta-domain adaptation for robust skin lesion classification across clinical and acquisition conditions

对比元域适应用于跨临床和采集条件的鲁棒皮肤病变分类

Rodrigo Mota, Kelvin Cunha, Emanoel dos Santos, Fábio Papais, Francisco Filho, Thales Bezerra, Erico Medeiros, Paulo Borba, Tsang Ing Ren

发表机构 * University of São Paulo（圣保罗大学）

AI总结提出基于视觉元域概念的适应策略，通过将大规模皮肤镜数据集的视觉表示迁移到临床图像域，提高皮肤病变分类的泛化鲁棒性。

Comments 4 pages, 5 figures, 1 table, Published in: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

2602.19848 2026-06-02 cs.CV 版本更新

DerMAE: Improving skin lesion classification through conditioned latent diffusion and MAE distillation

DerMAE: 通过条件潜在扩散和MAE蒸馏改进皮肤病变分类

Francisco Filho, Kelvin Cunha, Fábio Papais, Emanoel dos Santos, Rodrigo Mota, Thales Bezerra, Erico Medeiros, Paulo Borba, Tsang Ing Ren

发表机构 * Universidade Federal do Pernambuco（佛罗里达州帕尔马大学）

AI总结针对皮肤病变分类中恶性样本不足导致的类别不平衡问题，提出使用类别条件扩散模型生成合成图像，结合自监督MAE预训练学习鲁棒特征，并通过知识蒸馏将大模型知识迁移至轻量级ViT学生模型，在提升分类性能的同时实现高效设备端推理。

Comments 4 pages, 2 figures, 1 table, Published in: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

2602.13430 2026-06-02 cs.CV 版本更新

Handling Supervision Scarcity in Chest X-ray Classification: Long-Tailed and Zero-Shot Learning

处理胸部X光分类中的监督稀缺性：长尾与零样本学习

Ha-Hieu Pham, Hai-Dang Nguyen, Thanh-Huy Nguyen, Min Xu, Ulas Bagci, Trung-Nghia Le, Huy-Hieu Pham

发表机构 * University of Technology, Vietnam（越南技术大学）； National University of Singapore（新加坡国立大学）； University of California, San Diego（加州大学圣地亚哥分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结针对胸部X光分类中极端长尾多标签分布和罕见/未见发现缺失标注的问题，提出不平衡感知多标签学习（任务1）和无需监督标签的零样本预测方法（任务2），在CXR-LT 2026挑战赛中取得领先性能。

详情

DOI: 10.1109/ISBI61048.2026.11515586
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

AI中文摘要

临床实践中的胸部X光（CXR）分类常受限于不完美的监督，这源于（i）极端长尾多标签疾病分布和（ii）罕见或先前未见发现的缺失标注。CXR-LT 2026挑战赛在基于PadChest的基准上解决这些问题，其标签空间包含36个类别，分为30个训练集内分布类别和6个用于零样本评估的集外分布（OOD）类别。我们提出了针对不同监督机制的任务特定解决方案。对于任务1（长尾多标签分类），我们采用不平衡感知的多标签学习策略，以提高尾类别的识别能力，同时保持对常见发现的稳定性能。对于任务2（零样本OOD识别），我们提出了一种预测方法，在训练期间不使用任何来自OOD类别的监督标签或示例的情况下，为未见疾病类别生成分数。通过宏平均平均精度（mAP）评估，我们的方法在两个任务上均取得了强劲性能，在开发阶段的公开排行榜上排名第一。代码和预训练模型可在https://github.com/hieuphamha19/CXR_LT获取。

英文摘要

Chest X-Ray (CXR) classification in clinical practice is often limited by imperfect supervision, arising from (i) extreme long-tailed multi-label disease distributions and (ii) missing annotations for rare or previously unseen findings. The CXR-LT 2026 challenge addresses these issues on a PadChest-based benchmark with a 36-class label space split into 30 in-distribution classes for training and 6 out-of-distribution (OOD) classes for zero-shot evaluation. We present task-specific solutions tailored to the distinct supervision regimes. For Task 1 (long-tailed multi-label classification), we adopt an imbalance-aware multi-label learning strategy to improve recognition of tail classes while maintaining stable performance on frequent findings. For Task 2 (zero-shot OOD recognition), we propose a prediction approach that produces scores for unseen disease categories without using any supervised labels or examples from the OOD classes during training. Evaluated with macro-averaged mean Average Precision (mAP), our method achieves strong performance on both tasks, ranking first on the public leaderboard of the development phase. Code and pre-trained models are available at https://github.com/hieuphamha19/CXR_LT.

URL PDF HTML ☆

赞 0 踩 0

2602.15278 2026-06-02 cs.CV cs.AI 版本更新

Visual Persuasion: What Influences Decisions of Vision-Language Models?

视觉说服：什么影响了视觉语言模型的决策？

Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； MIT Media Lab（MIT媒体实验室）

AI总结提出一个框架，通过控制图像选择任务并系统性地扰动输入，利用视觉提示优化方法推断视觉语言模型的潜在视觉效用，揭示影响模型决策的视觉偏好。

Comments Accepted to ICML 2026

详情

AI中文摘要

网络上充斥着图像，这些图像最初是为人类消费而创建的，现在越来越多地被使用视觉语言模型（VLM）的智能体解释。这些智能体大规模地做出视觉决策，决定点击、推荐或购买什么。然而，我们对它们视觉偏好的结构知之甚少。我们引入了一个框架来研究这一点，通过将VLM置于受控的基于图像的选择任务中，并系统地扰动它们的输入。我们的关键思想是将智能体的决策函数视为一种潜在的视觉效用，可以通过揭示偏好来推断：在系统编辑的图像之间进行选择。从常见图像（如产品照片）开始，我们提出了视觉提示优化的方法，将文本优化方法适应为使用图像生成模型（例如在构图、光照或背景方面）迭代地提出并应用视觉上合理的修改。然后，我们评估哪些编辑增加了选择概率。通过对前沿VLM的大规模实验，我们证明了优化后的编辑在直接比较中显著改变了选择概率。我们开发了一个自动可解释性管道来解释这些偏好，识别出驱动选择的一致视觉主题。我们认为，这种方法提供了一种实用且高效的方式来揭示视觉漏洞和安全问题，否则这些问题可能会在现实世界中隐含地发现，从而支持对基于图像的AI智能体进行更主动的审计和治理。

英文摘要

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

URL PDF HTML ☆

赞 0 踩 0

2602.14134 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DenseMLLM: Standard Multimodal LLMs for Dense Prediction

DenseMLLM：用于密集预测的标准多模态大语言模型

Yi Li, Hongze Shen, Lexiang Tang, Xin Li, Xinpeng Ding, Yinsong Liu, Deqiang Jiang, Xing Sun, Xiaomeng Li

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China（香港科技大学电子与计算机工程系）； Tencent, Youtu-Lab, China（腾讯优图实验室）

AI总结提出DenseMLLM，通过标准多模态大语言模型架构和视觉令牌监督策略，无需任务特定解码器即可实现语义分割、深度估计等密集预测任务，在多个基准上取得竞争性能。

Comments ICML 2026

详情

AI中文摘要

多模态大语言模型在高层次视觉理解方面展现出卓越能力。然而，将这些模型扩展到细粒度的密集预测任务（如语义分割和深度估计）通常需要引入复杂的任务特定解码器和其他定制化组件。这种架构碎片化增加了模型复杂度，偏离了多模态大语言模型的通用设计，最终限制了其实用性。在这项工作中，我们挑战了这一范式，通过调整标准多模态大语言模型来执行密集预测，无需额外的任务特定解码器。所提出的模型称为DenseMLLM，基于标准架构，并采用一种新颖的视觉令牌监督策略来处理多个标签和任务。尽管设计极简，我们的模型在广泛的密集预测和视觉语言基准测试中取得了极具竞争力的性能，表明标准的通用多模态大语言模型可以在没有架构专门化的情况下有效支持密集感知。该项目可在github.com/Eli-YiLi/DenseMLLM获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization. This project is available at github.com/Eli-YiLi/DenseMLLM.

URL PDF HTML ☆

赞 0 踩 0

2602.13602 2026-06-02 cs.CV cs.LG 版本更新

Towards Sparse Video Understanding and Reasoning

迈向稀疏视频理解与推理

Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu

发表机构 * Northwestern University（西北大学）； Johns Hopkins University（约翰霍普金斯大学）； Dolby Laboratories（杜比实验室）

AI总结提出一种多轮视频问答代理，通过稀疏帧选择、状态摘要和早期停止机制，在减少帧数和令牌数的同时提升准确率。

Comments Accepted to CVPR 2026. Project page: https://sparsevideounderstanding.github.io

详情

AI中文摘要

我们提出 \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity)，一种用于视频问答 (VQA) 的多轮代理。与均匀采样帧不同，\revise 选择一小部分信息丰富的帧，跨轮维护摘要作为状态，并在置信时提前停止。它支持专有视觉语言模型 (VLM) 的“即插即用”设置，并允许对开源模型进行强化微调。对于微调，我们引入 EAGER (Evidence-Adjusted Gain for Efficient Reasoning)，一种无注释奖励，包含三项：(1) 置信增益：添加新帧后，奖励正确选项与最强替代选项之间对数几率差距的增加；(2) 摘要充分性：在回答时仅使用最后提交的摘要重新提问，并奖励成功；(3) 正确且早期停止：在较小的轮次预算内正确回答即获得奖励。在多个 VQA 基准上，\revise 在减少帧数、轮数和提示令牌数的同时提高了准确率，展示了实用的稀疏视频推理。

英文摘要

We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

URL PDF HTML ☆

赞 0 踩 0

2602.11554 2026-06-02 cs.RO cs.CV cs.LG 版本更新

HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds

HyperDet: 基于超4D雷达点云的3D目标检测

Yichun Xiao, Runwei Guan, Jin Jin, Fangqiang Ding

发表机构 * University of Edinburgh（爱丁堡大学）； HKUST (GZ)（香港科技大学（广州））； University of Oxford（牛津大学）； MIT（麻省理工学院）

AI总结提出一种与检测器无关的框架HyperDet，通过构建任务感知的超4D雷达点云，利用时空累积、跨传感器验证和多普勒引导的运动补偿以及前景生成增强，显著提升仅用雷达的3D目标检测性能。

Comments 11 pages, 3 figures, 3 tables

详情

AI中文摘要

仅使用4D雷达进行3D目标检测能达到什么程度？尽管现代4D雷达为自主感知提供了鲁棒天气和速度感知能力，但其点云仍然稀疏、嘈杂且不稳定，限制了仅用雷达的3D检测。我们提出HyperDet，一种与检测器无关的框架，在检测前构建任务感知的超4D雷达点云。HyperDet首先通过时空累积、跨传感器验证和多普勒引导的运动补偿来细化短窗口环视雷达观测，提高返回可靠性和时间一致性。然后，它利用仅在训练时可用的激光雷达引导的伪雷达监督进行前景生成增强，在保留测量雷达背景和雷达原生属性的同时丰富目标几何。在检测器训练期间，雷达感知的目标级增强进一步在几何重定位下保持多普勒一致性。在推理时，HyperDet仅需雷达输入，可直接与标准3D检测器配合使用。在两个公开的环视4D雷达数据集上的实验表明，与原始雷达输入相比，在标准3D检测器上均取得一致改进，验证了输入级雷达增强作为仅用雷达3D检测的有效方法。

英文摘要

How far can 3D object detection go using 4D radar alone? Despite offering weather-robust and velocity-aware sensing for autonomous perception, modern 4D radar still yields sparse, noisy, and unstable point clouds, limiting radar-only 3D detection. We present HyperDet, a detector-agnostic framework that constructs task-aware hyper 4D radar point clouds before detection. HyperDet first refines short-window surround-view radar observations through spatio-temporal accumulation, cross-sensor validation, and Doppler-guided motion compensation, improving return reliability and temporal coherence. It then performs foreground generative enhancement using LiDAR-guided pseudo-radar supervision available only during training, enriching object geometry while preserving measured radar background and radar-native attributes. During detector training, radar-aware object-level augmentation further preserves Doppler consistency under geometric relocation. At inference time, HyperDet requires radar input alone and can be directly paired with standard 3D detectors. Experiments on two public surround-view 4D radar datasets demonstrate consistent improvements over raw radar inputs across standard 3D detectors, validating input-level radar enhancement as an effective approach to radar-only 3D detection.

URL PDF HTML ☆

赞 0 踩 0

2602.12819 2026-06-02 cs.IR cs.CV 版本更新

WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

WISE：一种用于视觉场景、音频、物体、人脸、语音和元数据的多模态搜索引擎

Prasanna Sridhar, Horace Lee, David M. S. Pinto, Andrew Zisserman, Abhishek Dutta

发表机构 * Engineering Science University of Oxford（工程科学大学牛津）

AI总结提出WISE开源多模态搜索引擎，整合场景级和物体级的自然语言与反向图像查询、人脸搜索、音频事件检索、语音转录搜索及元数据过滤，支持跨模态组合查询，采用向量搜索实现高效扩展，可本地部署。

Comments Software: https://www.robots.ox.ac.uk/~vgg/software/wise/ , Online demos: https://www.robots.ox.ac.uk/~vgg/software/wise/demo/ , Example Queries: https://www.robots.ox.ac.uk/~vgg/software/wise/examples/

详情

DOI: 10.1145/3805712.3808375
Journal ref: International ACM SIGIR Conference on Research and Development in Information Retrieval (2026)

AI中文摘要

在本文中，我们提出WISE，一个开源视听搜索引擎，它将多种多模态检索能力集成到一个单一、实用的工具中，无需机器学习专业知识即可使用。WISE支持图像和视频的场景级（例如空街道）和物体级（例如马）的自然语言和反向图像查询；基于人脸的特定个体搜索；使用文本（例如木头吱吱声）或音频文件的声学事件音频检索；自动转录语音的搜索；以及按用户提供的元数据进行过滤。通过跨模态组合查询可以获得丰富的洞察——例如，通过应用物体查询“火车”和元数据查询“德国”从历史档案中检索德国火车，或在一个地方搜索人脸。通过采用向量搜索技术，WISE可以扩展到支持对数百万张图像或数千小时视频的高效检索。其模块化架构便于集成新模型。WISE可以本地部署用于私有或敏感集合，并已应用于各种实际用例。我们的代码是开源的，可在https://gitlab.com/vgg/wise/wise获取。

英文摘要

In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. WISE supports natural-language and reverse-image queries at both the scene level (e.g. empty street) and object level (e.g. horse) across images and videos; face-based search for specific individuals; audio retrieval of acoustic events using text (e.g. wood creak) or an audio file; search over automatically transcribed speech; and filtering by user-provided metadata. Rich insights can be obtained by combining queries across modalities -- for example, retrieving German trains from a historical archive by applying the object query "train" and the metadata query "Germany", or searching for a face in a place. By employing vector search techniques, WISE can scale to support efficient retrieval over millions of images or thousands of hours of video. Its modular architecture facilitates the integration of new models. WISE can be deployed locally for private or sensitive collections, and has been applied to various real-world use cases. Our code is open-source and available at https://gitlab.com/vgg/wise/wise.

URL PDF HTML ☆

赞 0 踩 0

2602.09153 2026-06-02 cs.RO cs.AI cs.CV cs.GR 版本更新

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

SceneSmith: 面向仿真就绪室内场景的智能体生成

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Harvard University（哈佛大学）

AI总结提出层次化智能体框架SceneSmith，通过VLM智能体协作从自然语言生成仿真就绪的室内场景，相比先前方法生成3-6倍物体且碰撞率低于2%。

Comments ICML 2026 Spotlight; Project page: https://scenesmith.github.io/

详情

AI中文摘要

仿真已成为大规模训练和评估家用机器人的关键工具，但现有环境未能捕捉真实室内空间的多样性和物理复杂性。当前的场景合成方法生成的房间稀疏布置，缺乏机器人操作所必需的密集杂乱、铰接式家具和物理属性。我们提出了SceneSmith，一个层次化智能体框架，能够从自然语言提示生成仿真就绪的室内环境。SceneSmith通过连续阶段构建场景——从建筑布局到家具放置再到小物体填充——每个阶段都实现为VLM智能体（设计师、评论家和编排者）之间的交互。该框架通过文本到3D合成生成静态物体、数据集检索获取铰接式物体以及物理属性估计，紧密集成了资产生成。SceneSmith生成的物体数量是先前方法的3-6倍，物体间碰撞率低于2%，且96%的物体在物理仿真下保持稳定。在205名参与者参与的用户研究中，与基线相比，平均真实感胜率达到92%，平均提示忠实度胜率达到91%。我们进一步证明了这些环境可用于端到端的自动机器人策略评估流程。

英文摘要

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

URL PDF HTML ☆

赞 0 踩 0

2602.08236 2026-06-02 cs.CV cs.AI cs.CL 版本更新

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

何时想象以及想象多少：基于世界模型的自适应测试时缩放用于视觉空间推理

Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal

发表机构 * University of North Carolina, Chapel Hill（北卡罗来纳大学教堂山分校）； Nanyang Technological University（南洋理工大学）

AI总结本文提出自适应测试时框架AVIC/AVIC-R，通过世界模型选择性调用和缩放视觉想象，在空间推理中平衡准确性与效率，超越GPT-4o等基线。

Comments the first two authors are equally contributed. Project page: https://adaptive-visual-tts.github.io/

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）取得了快速进展，但当正确答案取决于场景在未见或替代视角下的外观时，视觉空间推理仍然不可靠。最近的工作通过使用世界模型进行视觉想象来增强推理，但诸如想象何时真正必要、多少想象有益、以及何时想象有害等问题仍知之甚少。在实践中，无差别的想象可能会增加计算量，甚至通过引入误导性证据而降低性能。在这项工作中，我们深入分析了作为空间推理可控资源的测试时视觉想象。我们首先研究静态视觉证据何时足够，想象何时改进推理，以及过度或不必要的想象如何影响准确性和效率。为了支持这一分析，我们随后引入了AVIC，一个基于世界模型的自适应测试时框架，该框架在选择性调用和缩放视觉想象之前，明确推理当前视觉证据的充分性。最后，为了进一步学习这种门控和规划行为，而无需任何关于何时想象以及想象多少的标注，我们引入了AVIC-R，它通过来自QA正确性奖励和想象成本惩罚的GRPO来训练策略。在空间推理基准（SAT, MMSI）和具身导航基准（R2R）上，我们的结果揭示了想象至关重要、边际或有害的明确场景，并表明选择性控制可以匹配或超越固定想象策略，同时大幅减少世界模型调用和语言标记。我们的AVIC-R超越了包括GPT-4o和GPT-4.1在内的强大专有基线，同时调用世界模型的频率更低。总体而言，我们的发现强调了分析和控制测试时想象对于高效可靠的空间推理的重要性。

英文摘要

Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We first study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we then introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Finally, to further learn this gating and planning behavior without any annotation of when and how much to imagine, we introduce AVIC-R, which trains the policy via GRPO from QA-correctness rewards and penalties by imagination cost. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Our AVIC-R surpasses strong proprietary baselines including GPT-4o and GPT-4.1 while invoking the world model less often. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

URL PDF HTML ☆

赞 0 踩 0

2602.07955 2026-06-02 cs.CV 版本更新

寻找NeMO：面向少样本感知的模板视图几何感知表示

Sebastian Jung, Leonard Klüpfel, Rudolph Triebel, Maximilian Durner

发表机构 * German Aerospace Center (DLR)（德国航空航天中心（DLR））

AI总结提出NeMO（神经记忆对象）表示，通过少量RGB模板视图编码生成稀疏点云，实现未见对象的检测、分割和6DoF姿态估计，无需重训练。

Comments 17 pages including supplement, published in 3DV 2026, Project website: https://sebastian-jung.github.io/nemo/

详情

DOI: 10.1109/3DV69130.2026.00039
Journal ref: Proceedings of the International Conference on 3D Vision (3DV), 2026

AI中文摘要

我们提出了神经记忆对象（NeMO），一种新颖的以对象为中心的表示，可用于使用RGB图像检测、分割和估计训练中未见对象的6DoF姿态。我们的方法包括一个编码器，该编码器仅需少量描绘对象的RGB模板视图，利用包含语义和几何信息的学到的UDF生成稀疏的对象状点云。接下来，解码器将对象编码与查询图像一起使用，生成各种密集预测。通过大量实验，我们展示了我们的方法可用于少样本对象感知，无需任何相机特定参数或对目标数据的重训练。我们提出的将对象信息外包到NeMO中并使用单个网络执行多个感知任务的概念，增强了对新对象的交互，通过启用快速对象接入而无需重训练或大量预处理，提高了可扩展性和效率。我们在BOP基准测试的各种数据集和感知任务上报告了竞争性和最先进的结果，展示了我们方法的多功能性。https://github.com/DLR-RM/nemo

英文摘要

We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-of-the-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach. https://github.com/DLR-RM/nemo

URL PDF HTML ☆

赞 0 踩 0

2602.04094 2026-06-02 cs.CV 版本更新

SurrogateSHAP：文本到图像（T2I）模型的无训练贡献者归因

Mingyu Lu, Soham Gadgil, Chris Lin, Chanwoo Kim, Su-In Lee

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对文本到图像扩散模型中数据贡献者公平估值的高计算成本问题，提出基于预训练模型推理的无重训练框架SurrogateSHAP，利用梯度提升树近似效用函数并解析计算Shapley值，在多个任务上以更低开销超越现有方法。

详情

AI中文摘要

随着文本到图像（T2I）扩散模型在现实创意工作流中的广泛应用，一个用于评估提供数据集合的贡献者的原则性框架对于公平补偿和可持续数据市场至关重要。虽然Shapley值提供了理论上有依据的归因方法，但它面临双重计算瓶颈：（i）对每个采样的玩家（即数据贡献者）子集进行穷举模型重训练的高昂成本，以及（ii）由于贡献者交互，估计边际贡献所需的子集组合数量巨大。为此，我们提出了SurrogateSHAP，一个无需重训练的框架，通过从预训练模型进行推理来近似昂贵的重训练博弈。为了进一步提高效率，我们采用梯度提升树来近似效用函数，并基于树模型解析地推导Shapley值。我们在三个不同的归因任务上评估了SurrogateSHAP：（i）CIFAR-20上DDPM-CFG的图像质量，（ii）后印象派艺术品上Stable Diffusion的美学质量，以及（iii）时尚产品数据上FLUX.1的产品多样性。在各种设置下，SurrogateSHAP在显著降低计算开销的同时优于先前方法，一致地在多个效用指标上识别出有影响力的贡献者。最后，我们证明了SurrogateSHAP能够有效定位导致临床图像中虚假相关的数据源，为审计安全关键型生成模型提供了一条可扩展的路径。

英文摘要

As Text-to-Image (T2I) diffusion models are increasingly used in real-world creative workflows, a principled framework for valuing contributors who provide a collection of data is essential for fair compensation and sustainable data marketplaces. While the Shapley value offers a theoretically grounded approach to attribution, it faces a dual computational bottleneck: (i) the prohibitive cost of exhaustive model retraining for each sampled subset of players (i.e., data contributors) and (ii) the combinatorial number of subsets needed to estimate marginal contributions due to contributor interactions. To this end, we propose SurrogateSHAP, a retraining-free framework that approximates the expensive retraining game through inference from a pretrained model. To further improve efficiency, we employ a gradient-boosted tree to approximate the utility function and derive Shapley values analytically from the tree-based model. We evaluate SurrogateSHAP across three diverse attribution tasks: (i) image quality for DDPM-CFG on CIFAR-20, (ii) aesthetics for Stable Diffusion on Post-Impressionist artworks, and (iii) product diversity for FLUX.1 on Fashion-Product data. Across settings, SurrogateSHAP outperforms prior methods while substantially reducing computational overhead, consistently identifying influential contributors across multiple utility metrics. Finally, we demonstrate that SurrogateSHAP effectively localizes data sources responsible for spurious correlations in clinical images, providing a scalable path toward auditing safety-critical generative models.

URL PDF HTML ☆

赞 0 踩 0

2601.21444 2026-06-02 cs.CV cs.AI cs.CL 版本更新

APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention

APB-V: 通过序列并行感知的近似注意力加速长视频理解

Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Ao Sun, Ziqi Yuan, Hao Zhou, Fandong Meng, Zhiyuan Liu

发表机构 * NLP Group, DCST, IAI, BNRIST, Tsinghua University, Beijing, China（清华大学北京校区自然语言处理组、国防科技大学、人工智能研究院、北京理工大学、清华大学）； Department of CS&T, Central South University, Changsha, China（中南大学计算机与技术系，长沙，中国）； BUPT, Beijing, China（北京邮电大学，北京，中国）； Pattern Recognition Center, WeChat AI, Tencent Inc.（腾讯公司微信人工智能研究院）

AI总结提出APB-V，一种序列并行框架，通过分布式近似注意力在多GPU上加速长视频推理，显著提升速度且不损失性能。

Comments ACL 2026 main

详情

AI中文摘要

长视频推理的效率仍然是一个关键瓶颈，主要由于大型多模态模型（LMMs）预填充阶段的密集计算。现有方法要么压缩视觉嵌入，要么在单个GPU上应用稀疏注意力，导致加速有限或性能下降，并限制了LMMs处理更长、更复杂视频的能力。为了克服这些问题，我们提出了APB-V，一种具有优化注意力的序列并行框架，可在多个GPU上加速长视频推理。通过分布近似注意力，APB-V减少了计算量并增加了并行性，使得无需压缩即可高效处理更多视觉嵌入，从而提升任务性能。系统级优化，如负载均衡和融合前向传递，进一步释放了APB-V的潜力，相较于FlashAttn、ZigZagRing和APB，分别实现了12.72倍、1.70倍和1.18倍的加速，且没有明显的性能损失。代码可在https://github.com/thunlp/APB获取。

英文摘要

The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Multimodal Models (LMMs). Existing methods either compress visual embeddings or apply sparse attention on a single GPU, yielding limited acceleration or degraded performance and restricting LMMs from handling longer, more complex videos. To overcome these issues, we propose APB-V, a sequence-parallel framework with optimized attention that accelerates long-video inference across multiple GPUs. By distributing approximate attention, APB-V reduces computation and increases parallelism, enabling efficient processing of more visual embeddings without compression and thereby improving task performance. System-level optimizations, such as load balancing and fused forward passes, further unleash the potential of APB-V, delivering speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss. Code available at https://github.com/thunlp/APB

URL PDF HTML ☆

赞 0 踩 0

2601.18340 2026-06-02 cs.CV 版本更新

Beyond Rigid: Benchmarking Non-Rigid Video Editing

超越刚性：非刚性视频编辑基准测试

Bingzheng Qu, Xuefeng Bai, Kehai Chen, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学（深圳））

AI总结提出NRVBench诊断基准，通过物理感知评估框架揭示传统指标在非刚性视频编辑中的不足，并引入VM-Edit基线分析稳定性-可塑性权衡。

详情

AI中文摘要

随着视频生成模型越来越需要处理物理动态，评估必须超越外观保真度和语义对齐。非刚性视频编辑提供了一个独特的揭示性测试平台，其中不同材料施加不同的物理约束。在本文中，我们引入了NRVBench，一个用于非刚性视频编辑的诊断基准，其任务是修改可变形运动，同时保留无关区域并保持材料特定的合理性。NRVBench包含180个精心策划的视频，涵盖六个基于物理的类别，2,340条细粒度编辑指令，360个多项选择题和像素精确的掩码。我们进一步提出了NRVE-Acc，一种基于VLM的结构化协议，将编辑成功分解为指令遵循、材料感知变形合理性和带有运动线索的时间一致性。对代表性推理时视频编辑方法的实验揭示了传统指标与物理感知感知编辑成功之间的明显不匹配：在非刚性动态下，保留外观或实现强全局对齐的方法可能仍然失败。我们还引入了VM-Edit，一个简单的区域条件编辑基线，它释放前景同时锁定背景，暴露了稳定性-可塑性权衡。

英文摘要

As video generation models are increasingly expected to manipulate physical dynamics, there is a growing need to move evaluation beyond appearance fidelity and semantic alignment. Non-rigid video editing offers a uniquely revealing testbed, where distinct materials impose distinct physical constraints. In this paper, we introduce NRVBench, a diagnostic benchmark for non-rigid video editing, where the task is to modify deformable motion while preserving irrelevant regions and maintaining material-specific plausibility. NRVBench contains 180 curated videos across six physics-grounded categories, 2,340 fine-grained editing instructions, 360 multiple-choice questions, and pixel-accurate masks. We further propose NRVE-Acc, a structured VLM-based protocol that decomposes editing success into instruction following, material-aware deformation plausibility, and temporal coherence with motion cues. Experiments on representative inference-time video editing methods reveal a clear mismatch between conventional metrics and physics-aware perceptual editing success: methods that preserve appearance or achieve strong global alignment may still fail under non-rigid dynamics. We additionally introduce VM-Edit, a simple region-conditioned editing baseline that frees the foreground while locking the background, exposing the stability--plasticity trade-off.

URL PDF HTML ☆

赞 0 踩 0

2508.06407 2026-06-02 cs.CV cs.AI eess.IV 版本更新

A Classification-Aware Super-Resolution Framework for Ship Targets in SAR Imagery

SAR图像中舰船目标的分类感知超分辨率框架

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Oktay Karakus

发表机构 * University of Malaya（马来亚大学）

AI总结提出一种将分类目标融入超分辨率过程的算法，通过优化兼顾图像质量和分类性能的损失函数，提升SAR图像分辨率并改善分类精度。

详情

DOI: 10.1109/JSTARS.2026.3655550
Journal ref: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 19, pp. 6614-6622, 2026

AI中文摘要

高分辨率图像在提升分类、检测和分割等视觉识别任务性能中起着关键作用。在包括遥感和监视在内的许多领域，低分辨率图像可能限制自动分析的准确性。为此，超分辨率（SR）技术被广泛采用，试图从低分辨率输入重建高分辨率图像。相关的传统方法仅基于像素级指标专注于提升图像质量，而超分辨率图像保真度与下游分类性能之间的关系在很大程度上未被探索。这引发了一个关键问题：将分类目标直接集成到超分辨率过程中是否能进一步提高分类精度？在本文中，我们通过部署一种专门的算法策略来研究超分辨率与分类之间的关系，试图回答这一问题。我们提出了一种新颖的方法，通过优化同时考虑图像质量和分类性能的损失函数，提高合成孔径雷达图像的分辨率。我们的方法在提升图像质量（通过科学验证的图像质量指标衡量）的同时，也提高了分类精度。

英文摘要

High-resolution imagery plays a critical role in improving the performance of visual recognition tasks such as classification, detection, and segmentation. In many domains, including remote sensing and surveillance, low-resolution images can limit the accuracy of automated analysis. To address this, super-resolution (SR) techniques have been widely adopted to attempt to reconstruct high-resolution images from low-resolution inputs. Related traditional approaches focus solely on enhancing image quality based on pixel-level metrics, leaving the relationship between super-resolved image fidelity and downstream classification performance largely underexplored. This raises a key question: can integrating classification objectives directly into the super-resolution process further improve classification accuracy? In this paper, we try to respond to this question by investigating the relationship between super-resolution and classification through the deployment of a specialised algorithmic strategy. We propose a novel methodology that increases the resolution of synthetic aperture radar imagery by optimising loss functions that account for both image quality and classification performance. Our approach improves image quality, as measured by scientifically ascertained image quality indicators, while also enhancing classification accuracy.

URL PDF HTML ☆

赞 0 踩 0

2511.06163 2026-06-02 eess.IV cs.CV cs.LG physics.med-ph 版本更新

Cross-Modal Fine-Tuning of 3D Convolutional Foundation Models for ADHD Classification with Low-Rank Adaptation

基于低秩适应的3D卷积基础模型跨模态微调用于ADHD分类

Jyun-Ping Kao, Shinyeong Rho, Shahar Lazarev, Hyun-Hae Cho, Fangxu Xing, Taehoon Shin, C. -C. Jay Kuo, Jonghye Woo

发表机构 * National Institute of Mental Health, National Institutes of Health（国家精神卫生研究所，国立卫生研究院）

AI总结提出一种参数高效的迁移学习方法，通过3D低秩适应（LoRA）将预训练于CT图像的3D卷积基础模型微调至MRI的ADHD分类任务，在公开扩散MRI数据集上达到71.9%准确率和0.716 AUC，仅需164万可训练参数。

Comments Accepted for presentation at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026

详情

DOI: 10.1109/ISBI61048.2026.11515951
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), pp. 1-4

AI中文摘要

儿童注意缺陷/多动障碍（ADHD）的早期诊断在改善教育和心理健康结果中起着关键作用。然而，由于异质性表现和与其他疾病的重叠症状，使用神经影像数据诊断ADHD仍然具有挑战性。为了解决这一问题，我们提出了一种新颖的参数高效迁移学习方法，将预训练于CT图像的大规模3D卷积基础模型适应于基于MRI的ADHD分类任务。我们的方法通过将3D卷积核分解为2D低秩更新，在3D中引入低秩适应（LoRA），大幅减少可训练参数，同时实现优越性能。在公开扩散MRI数据库上的五折交叉验证评估中，我们的3D LoRA微调策略取得了最先进的结果，一个模型变体达到71.9%的准确率，另一个达到0.716的AUC。两个变体仅使用164万可训练参数（比完全微调的基础模型少113倍以上）。我们的结果代表了神经影像中基础模型首次成功的跨模态（CT到MRI）适应之一，为ADHD分类建立了新的基准，同时大幅提高了效率。

英文摘要

Early diagnosis of attention-deficit/hyperactivity disorder (ADHD) in children plays a crucial role in improving outcomes in education and mental health. Diagnosing ADHD using neuroimaging data, however, remains challenging due to heterogeneous presentations and overlapping symptoms with other conditions. To address this, we propose a novel parameter-efficient transfer learning approach that adapts a large-scale 3D convolutional foundation model, pre-trained on CT images, to an MRI-based ADHD classification task. Our method introduces Low-Rank Adaptation (LoRA) in 3D by factorizing 3D convolutional kernels into 2D low-rank updates, dramatically reducing trainable parameters while achieving superior performance. In a five-fold cross-validated evaluation on a public diffusion MRI database, our 3D LoRA fine-tuning strategy achieved state-of-the-art results, with one model variant reaching 71.9% accuracy and another attaining an AUC of 0.716. Both variants use only 1.64 million trainable parameters (over 113x fewer than a fully fine-tuned foundation model). Our results represent one of the first successful cross-modal (CT-to-MRI) adaptations of a foundation model in neuroimaging, establishing a new benchmark for ADHD classification while greatly improving efficiency.

URL PDF HTML ☆

赞 0 踩 0

2601.04946 2026-06-02 cs.CV cs.AI 版本更新

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

原型性偏差揭示多模态评估指标中的盲点

Subhadeep Roy, Gagan Bhatia, Steffen Eger

发表机构 * University of Technology Nuremberg（图恩大学）

AI总结本文通过构建受控诊断基准PROTOBIAS，发现并验证了多模态评估指标中存在原型性偏差，即倾向于选择视觉或社会原型性高但语义错误的图像，并提出了轻量级对比训练评估器PROTOSCORE作为缓解基线。

详情

AI中文摘要

自动指标广泛用于评估文生图模型，常常在基准测试、模型选择和大规模数据过滤中取代人类判断。然而，它们可能奖励看起来合理或原型性的图像，而非忠实满足提示的图像。我们识别出原型性偏差是多模态评估中的一个系统性盲点：指标可能偏好语义不正确但在视觉或社会层面具有原型性的图像，而非正确但原型性较弱的图像。我们引入PROTOBIAS，一个跨动物、物体和人口统计的受控诊断基准，其中语义正确的图像与包含单个受控语义违反的合理原型性对抗样本进行对比。基于原型理论和社会类别原型性，PROTOBIAS通过多个提示生成器、图像生成器和独立的VLM过滤器构建，并通过提示质量、人工标注和图像质量控制进行验证。使用PROTOBIAS，我们展示了广泛使用的嵌入、奖励、基于VQA和VLM作为评判的指标经常在这些对比中失败，而人类判断仍然更忠实于语义正确性。我们进一步引入PROTOSCORE，一个轻量级对比训练评估器，作为初始缓解基线。PROTOBIAS为测量原型性驱动的指标失败和开发更语义忠实的T2I评估器提供了一个聚焦基准。

英文摘要

Automatic metrics are widely used to evaluate text-to-image models, often replacing human judgment in benchmarking, model selection, and large-scale data filtering. Yet they may reward images that look plausible or prototypical rather than images that faithfully satisfy the prompt. We identify prototypicality bias as a systematic blindspot in multimodal evaluation: metrics can prefer a semantically incorrect but visually or socially prototypical image over a correct but less prototypical one. We introduce PROTOBIAS, a controlled diagnostic benchmark across Animals, Objects, and Demography, where semantically correct images are contrasted with plausible prototypical adversaries containing a single controlled semantic violation. Grounded in prototype theory and social-category prototypicality, PROTOBIAS is constructed with multiple prompt generators, image generators, and independent VLM filters, and validated through prompt-quality, human-annotation, and image-quality controls. Using PROTOBIAS, we show that widely used embedding, reward, VQA-based, and VLM-as-judge metrics frequently fail these contrasts, while human judgments remain more faithful to semantic correctness. We further introduce PROTOSCORE, a lightweight contrastively trained evaluator, as an initial mitigation baseline. PROTOBIAS provides a focused benchmark for measuring prototypicality-driven metric failures and developing more semantically faithful T2I evaluators.

URL PDF HTML ☆

赞 0 踩 0

2601.03309 2026-06-02 cs.CV cs.AI 版本更新

IMA++: ISIC档案多标注者皮肤镜皮损分割数据集

Kumar Abhishek, Jeremy Kawahara, Ghassan Hamarneh

发表机构 * Medical Image Analysis Lab, School of Computing Science, Simon Fraser University（医学影像分析实验室，计算科学学院，西蒙弗雷泽大学）； AIP Labs（AIP实验室）

AI总结提出ISIC MultiAnnot++数据集，包含14,967张皮肤镜图像和17,684个分割掩码，其中2,394张图像有2-5个标注，并附带标注者技能水平和工具元数据，支持多标注者医学图像分割研究。

Comments Published in IEEE Data Descriptions, 12 pages, 7 figures

详情

DOI: 10.1109/IEEEDATA.2026.3689801
Journal ref: IEEE Data Descr. 3 (2026) 367-378

AI中文摘要

多标注者医学图像分割是一个重要的研究问题，但需要昂贵收集的标注数据集。皮肤镜皮损成像允许人类专家和AI系统观察在常规临床照片中无法辨别的形态结构。然而，目前没有大规模公开可用的、带有标注者标签的多标注者皮损分割（SLS）数据集用于皮肤镜皮损成像。我们引入了ISIC MultiAnnot++，一个大型公开的多标注者皮损分割数据集，图像来自ISIC档案。最终数据集包含14,967张皮肤镜图像的17,684个分割掩码，其中2,394张皮肤镜图像每张有2-5个分割，使其成为最大的公开SLS数据集。此外，还包括关于分割的元数据，包括标注者的技能水平和分割工具，支持诸如分割的标注者特定偏好建模和标注者元数据分析等研究主题。我们对该数据集的特征、策划的数据分区和共识分割掩码进行了分析。

英文摘要

Multi-annotator medical image segmentation is an important research problem, but requires annotated datasets that are expensive to collect. Dermoscopic skin lesion imaging allows human experts and AI systems to observe morphological structures otherwise not discernable from regular clinical photographs. However, currently there are no large-scale publicly available multi-annotator skin lesion segmentation (SLS) datasets with annotator-labels for dermoscopic skin lesion imaging. We introduce ISIC MultiAnnot++, a large public multi-annotator skin lesion segmentation dataset for images from the ISIC Archive. The final dataset contains 17,684 segmentation masks spanning 14,967 dermoscopic images, where 2,394 dermoscopic images have 2-5 segmentations per image, making it the largest publicly available SLS dataset. Further, metadata about the segmentation, including the annotators' skill level and segmentation tool, is included, enabling research on topics such as annotator-specific preference modeling for segmentation and annotator metadata analysis. We provide an analysis on the characteristics of this dataset, curated data partitions, and consensus segmentation masks.

URL PDF HTML ☆

赞 0 踩 0

2512.20251 2026-06-02 cs.CV eess.IV 版本更新

Degradation-Aware Metric Prompting for Hyperspectral Image Restoration

退化感知度量提示用于高光谱图像恢复

Binfeng Wang, Di Wang, Haonan Guo, Ying Fu, Jing Zhang

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China（北京理工大学计算机科学与技术学院）； School of Computer Science, Wuhan University, Wuhan, Hubei, China（武汉大学计算机学院）； Zhongguancun Academy, Beijing, China（中关村学院）

AI总结提出退化感知度量提示（DAMP）框架，通过可解释的空间-光谱度量作为退化提示，结合退化自适应混合专家（DAMoE）模块，实现多维度退化统一恢复，在自然和遥感高光谱数据集上达到最先进性能并展现零样本泛化能力。

Comments Accepted by ICML 2026

详情

AI中文摘要

统一高光谱图像（HSI）恢复旨在单个模型中恢复多种退化。然而，当前方法通常依赖于不切实际的显式先验或过拟合训练分布的不透明黑盒表示，阻碍了对未见场景的泛化。为弥补这一差距，我们提出退化感知度量提示（DAMP），一种新颖框架，通过可解释的空间-光谱度量表征多维退化。这些度量作为退化提示（DP），使模型能够捕捉任务间的共享特征并适应未知损坏。我们框架的核心是退化自适应混合专家（DAMoE），其中空间-光谱自适应模块（SSAM）作为专家，利用可学习的融合系数专门处理不同的退化程度。通过使用DP作为门控路由器，DAMoE动态激活针对特定退化特征定制的专家。在自然和遥感HSI数据集上的大量实验表明，DAMP实现了最先进的性能，并在未见恢复任务上展现出卓越的零样本泛化能力。代码公开于 \href{DAMP}{https://github.com/MiliLab/DAMP}。

英文摘要

Unified hyperspectral image (HSI) restoration aims to recover diverse degradations within a single model. However, current methods often rely on impractical explicit priors or opaque black-box representations that overfit to training distributions, hampering generalization to unseen scenarios. To bridge this gap, we propose Degradation-Aware Metric Prompting (DAMP), a novel framework that characterizes multi-dimensional degradations through interpretable spatial-spectral metrics. These metrics serve as Degradation Prompts (DP), enabling the model to capture shared characteristics across tasks and adapt to unknown corruptions. Central to our framework is the Degradation-Adaptive Mixture-of-Experts (DAMoE), where Spatial-Spectral Adaptive Modules (SSAMs) serve as experts that utilize learnable fusion coefficients to specialize in distinct degradation degrees. By using DP as a gating router, DAMoE dynamically activates specialized experts tailored to the specific degradation profile. Extensive experiments on natural and remote sensing HSI datasets demonstrate that DAMP achieves state-of-the-art performance and exhibits exceptional zero-shot generalization on unseen restoration tasks. Code is publicly available at \href{DAMP}{https://github.com/MiliLab/DAMP}.

URL PDF HTML ☆

赞 0 踩 0

2508.20072 2026-06-02 cs.CV cs.LG cs.RO 版本更新

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

离散扩散VLA：将离散扩散引入视觉-语言-动作策略中的动作解码

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出离散扩散VLA，通过将动作块离散化并在统一Transformer骨干内使用离散扩散模式进行渐进细化，实现自适应解码顺序和错误纠正，在多个基准上取得高性能并保留预训练的视觉-语言先验。

Comments Accepted by ICML 2026. 17 pages

详情

AI中文摘要

视觉-语言-动作（VLA）模型将大型视觉-语言骨干网络适配为将图像和指令映射为机器人动作。然而，当前的VLA要么以固定的从左到右顺序自回归生成动作，性能较差；要么在骨干网络外附加独立的扩散头，这会割裂信息通路并阻碍统一、可扩展的架构。相反，我们提出了离散扩散VLA，它将动作块离散化，并使用离散扩散模式在统一的Transformer骨干内保留渐进细化。我们的方法实现了自适应解码顺序，在解决较难的动作元素之前先解决高置信度的动作元素，并采用二次重掩码来重新审视不确定的预测，从而实现鲁棒的纠错。这种设计保留了预训练的视觉-语言先验，支持并行解码，并提高了效率。离散扩散VLA在LIBERO上达到96.4%的平均成功率，在SimplerEnv-Fractal上达到71.2%的视觉匹配，在SimplerEnv-Bridge上达到54.2%的整体性能。在LIBERO-Goal的分布外测试中，我们的方法仅表现出0.8%的语言退化（相比之下并行解码为8.0%），以及20.4%的视觉退化（相比之下连续扩散为29.0%），表明其很好地保留了预训练的视觉-语言能力。我们还在AgileX Cobot Magic平台上进行了两次真实机器人评估，以展示该方法的有效性。

英文摘要

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2512.15647 2026-06-02 cs.CV 版本更新

Hard Labels In! Rethinking the Role of Hard Labels in Mitigating Local Semantic Drift

硬标签登场！重新思考硬标签在缓解局部语义漂移中的作用

Jiacheng Cui, Bingkui Tong, Xinyue Bi, Xiaohan Zhao, Jiacheng Liu, Zhiqiang Shen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对软标签在稀疏监督下导致的局部语义漂移问题，提出混合硬标签与软标签的HALD训练范式，在数据集蒸馏和大规模分类任务中提升泛化性能。

Comments ICML 2026. Code at: https://github.com/Jiacheng8/HALD

详情

AI中文摘要

来自教师模型的软标签是知识迁移和大规模数据集蒸馏（例如SRe2L、LPLD）的实际做法。然而，当我们限制每张图像的裁剪数量以减少存储预计算软标签的巨大成本时，这些方法会严重遭受局部语义漂移：视觉上模糊的裁剪可能导致软监督偏离图像级别的真实语义，导致持续错误和训练-测试分布不匹配。我们重新审视了硬标签被忽视的作用，并表明当适当整合时，它们可以作为内容不变的语义锚点来校准这种漂移。我们从理论上分析了稀疏软标签监督下漂移的出现，并证明混合硬标签和软标签可以恢复视觉内容与语义监督之间的对齐。基于这一见解，我们提出了一种新的训练范式——用于缓解局部语义漂移的硬标签（HALD），它使用硬标签作为中间校正信号，同时保留软标签的细粒度优势。在数据集蒸馏和大规模分类基准上的大量实验显示了一致的泛化改进。在ImageNet-1K上，我们的方法仅使用285M软标签存储（减少100倍）就达到了42.7%的准确率，优于先前最先进的LPLD 9.0%。

英文摘要

Soft labels from teacher models are a de facto practice for knowledge transfer and large-scale dataset distillation (e.g., SRe2L, LPLD). However, when we limit the number of crops per image to reduce the substantial cost of storing precomputed soft labels, these methods suffer severely from local semantic drift: visually ambiguous crops can cause soft supervision to deviate from the image-level ground-truth semantics, leading to persistent errors and a train-test distribution mismatch. We revisit the overlooked role of hard labels and show that, when properly integrated, they can act as a content-invariant semantic anchor that calibrates such drift. We theoretically analyze the emergence of drift under sparse soft-label supervision and demonstrate that hybridizing hard and soft labels restores alignment between visual content and semantic supervision. Building on this insight, we propose a new training paradigm, Hard Label for Alleviating Local Semantic Drift (HALD), which uses hard labels as intermediate corrective signals while preserving the fine-grained benefits of soft labels. Extensive experiments on dataset distillation and large-scale classification benchmarks show consistent generalization improvements. On ImageNet-1K, our method achieves 42.7% accuracy with only 285M soft-label storage (reduces by 100X), outperforming prior state-of-the-art LPLD 9.0%.

URL PDF HTML ☆

赞 0 踩 0

2505.08438 2026-06-02 cs.CV cs.AI 版本更新

A Survey of 3D Reconstruction with Event Cameras

事件相机三维重建综述

Chuanzhi Xu, Haoxian Zhou, Langyi Chen, Haodong Chen, Zeke Zexi Hu, Zhicheng Lu, Ying Zhou, Vera Chung, Qiang Qu, Weidong Cai

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文首次全面综述了基于事件相机的三维重建方法，按输入模态（立体、单目、多模态）和重建技术（几何、深度学习、神经渲染如NeRF和3DGS）分类，并讨论了数据集、评估、表示和动态场景重建等挑战。

Comments This survey has been accepted for publication in the Computational Visual Media Journal

详情

AI中文摘要

事件相机正迅速成为用于三维重建的强大视觉传感器，能够异步捕捉每个像素的亮度变化。与传统基于帧的相机相比，事件相机产生稀疏但时间密集的数据流，即使在高速运动、低光照和极端动态范围等挑战性条件下，也能实现鲁棒且准确的三维重建。这些能力为自动驾驶、机器人、空中导航和沉浸式虚拟现实等各个领域的变革性应用提供了巨大前景。在本文中，我们首次专门针对基于事件的三维重建进行了全面综述。现有方法根据输入模态系统地分为立体、单目和多模态系统，并根据重建方法进一步分类，包括基于几何的技术、深度学习方法以及神经渲染技术，如神经辐射场（NeRF）和3D高斯泼溅（3DGS）。在每个类别中，方法按时间顺序组织，以突出关键概念和进展的演变。此外，我们详细总结了专门适用于基于事件重建任务的公开数据集。最后，我们讨论了数据集可用性、标准化评估、有效表示和动态场景重建方面的重大开放挑战，并概述了未来研究的有见地的方向。本综述旨在作为重要参考，并为推进事件驱动三维重建的最新技术提供清晰且激励人心的路线图。

英文摘要

Event cameras are rapidly emerging as powerful vision sensors for 3D reconstruction, uniquely capable of asynchronously capturing per-pixel brightness changes. Compared to traditional frame-based cameras, event cameras produce sparse yet temporally dense data streams, enabling robust and accurate 3D reconstruction even under challenging conditions such as high-speed motion, low illumination, and extreme dynamic range scenarios. These capabilities offer substantial promise for transformative applications across various fields, including autonomous driving, robotics, aerial navigation, and immersive virtual reality. In this survey, we present the first comprehensive review exclusively dedicated to event-based 3D reconstruction. Existing approaches are systematically categorised based on input modality into stereo, monocular, and multimodal systems, and further classified according to reconstruction methodologies, including geometry-based techniques, deep learning approaches, and neural rendering techniques such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Within each category, methods are chronologically organised to highlight the evolution of key concepts and advancements. Furthermore, we provide a detailed summary of publicly available datasets specifically suited to event-based reconstruction tasks. Finally, we discuss significant open challenges in dataset availability, standardised evaluation, effective representation, and dynamic scene reconstruction, outlining insightful directions for future research. This survey aims to serve as an essential reference and provides a clear and motivating roadmap toward advancing the state of the art in event-driven 3D reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2512.17605 2026-06-02 cs.CV cs.AI 版本更新

MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration

MGRegBench：一个带有解剖标志的乳腺X线图像配准新型基准数据集

Svetlana Krasnova, Emiliya Starikova, Ilia Naletov, Andrey Krylov, Dmitry Sorokin

发表机构 * MSU（莫斯科国立大学）

AI总结为解决乳腺X线图像配准中缺乏公开数据集和标准化基准的问题，提出了MGRegBench，包含5000多对图像和100对带手动标注解剖标志的数据集，并评估了多种配准方法。

详情

AI中文摘要

稳健的乳腺X线图像配准对于临床相关应用（如追踪乳腺组织疾病进展）至关重要。然而，由于缺乏透明的公共数据集和可重复的标准化基准，进展受到限制。现有研究通常使用私有数据和不一致的评估框架，因此难以直接比较。为解决这一问题，我们提出了MGRegBench，一个患者独立、无泄漏控制的乳腺X线图像配准评估协议，包含超过5000对图像，每对图像带有乳腺分割掩膜，以及100对带有手动标注解剖标志的图像，此外还有标准化的训练/评估分割和即用基线。利用这一资源，我们对多种配准方法进行了基准测试——包括经典方法（ANTs）、基于学习的方法（VoxelMorph, TransMorph）、隐式神经表示（IDIR）、一种乳腺X线专用方法，以及最近的深度学习方法MammoRegNet，并针对该模态调整了实现，同时在独立数据集SDM-MCs上验证了泛化能力。我们的贡献包括：（1）首个此规模且带有手动标注标志和掩膜的乳腺X线图像配准公共数据集；（2）一个透明、无泄漏控制的基准，首次实现了多种经典和基于机器学习的方法的同类比较；（3）在SDM-MCs上的外部验证，以测试主要趋势是否超越MGRegBench；（4）对基于深度学习的配准进行了广泛分析。我们公开发布代码和数据，为公平、可重复且临床相关的比较建立基础资源，并推动AI驱动医学影像的未来研究。

英文摘要

Robust mammography registration is essential for clinically relevant applications like tracking disease progression in breast tissue. However, progress has been limited by the absence of transparent public datasets and reproducible standardized benchmarks. Existing studies are often not directly comparable, as they use private data and inconsistent evaluation frameworks. To address this, we present MGRegBench, a patient-disjoint, leakage-controlled evaluation protocol for mammography registration, comprising over 5,000 image pairs, each with a breast segmentation mask, and 100 pairs with manually annotated anatomical landmarks, plus standardized train/evaluation splits and ready-to-run baselines. Using this resource, we benchmark diverse registration methods -- including classical (ANTs), learning-based (VoxelMorph, TransMorph), implicit neural representation (IDIR), a mammography-specific approach, and a recent deep learning method MammoRegNet, with implementations adapted to this modality, and validate generalization on the independent SDM-MCs dataset. Our contributions are: (1) the first public dataset of this scale with manual landmarks and masks for mammography registration; (2) a transparent, leakage-controlled benchmark enabling the first like-for-like comparison of diverse classical and machine learning-based methods; (3) external validation on SDM-MCs to test whether the main trend transfers beyond MGRegBench; and (4) an extensive analysis of deep learning-based registration. We publicly release our code and data to establish a foundational resource for fair, reproducible, and clinically relevant comparisons and catalyze future research in AI-driven medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2512.14364 2026-06-02 cs.CV 版本更新

Unified Semantic Transformer for 3D Scene Understanding

统一语义Transformer用于3D场景理解

Sebastian Koch, Johanna Wald, Hidenobu Matsuki, Pedro Hermosilla, Timo Ropinski, Federico Tombari

发表机构 * Ulm University（乌尔姆大学）； Google（谷歌）； TU Vienna（维也纳技术大学）； TU Munich（慕尼黑技术大学）

AI总结提出UNITE，一个统一的语义Transformer，通过端到端训练从RGB图像直接预测多种密集语义属性，实现3D场景理解，并在多个任务上达到最先进性能。

Comments Accepted at TMLR. Project page: https://unite-page.github.io/

详情

AI中文摘要

整体3D场景理解涉及捕捉和解析非结构化3D环境。由于现实世界的固有复杂性，现有模型主要被开发并局限于特定任务。我们引入UNITE，一个用于3D场景理解的统一语义Transformer，这是一种新颖的前馈神经网络，将多种3D密集语义室内任务统一在单个模型中。我们的模型以完全端到端的方式在未见过的场景上训练，仅需几秒钟即可推断完整的3D语义几何。我们的方法能够直接从RGB图像预测多个密集语义属性，包括3D场景分割、实例嵌入、开放词汇特征和关节。该方法使用2D蒸馏和自监督相结合的训练方式，并利用新颖的多视图损失确保3D视图一致性。我们证明UNITE在多个不同的密集室内语义任务上达到了最先进的性能，甚至在许多情况下超越了任务特定模型，超过了使用真实3D几何的方法。参见项目网站 unite-page.github.io。

英文摘要

Holistic 3D scene understanding involves capturing and parsing unstructured 3D environments. Due to the inherent complexity of the real world, existing models have predominantly been developed and limited to be task-specific. We introduce UNITE, a Unified Semantic Transformer for 3D scene understanding, a novel feed-forward neural network that unifies a diverse set of 3D dense semantic indoor tasks within a single model. Our model operates on unseen scenes trained in a fully end-to-end manner and only takes a couple seconds to infer the full 3D semantic geometry. Our approach is capable of directly predicting multiple dense semantic attributes, including 3D scene segmentation, instance embeddings, open-vocabulary features, and articulations, solely from RGB images. The method is trained using a combination of 2D distillation, heavily relying on self-supervision and leverages novel multi-view losses designed to ensure 3D view consistency. We demonstrate that UNITE achieves state-of-the-art performance on several different dense indoor semantic tasks and even outperforms task-specific models, in many cases, surpassing methods that operate on ground truth 3D geometry. See the project website at unite-page.github.io

URL PDF HTML ☆

赞 0 踩 0

2511.07438 2026-06-02 cs.CV cs.NA math.NA stat.ME 版本更新

Semimage: 基于HSV的语义图像编码用于解缠文本表示

Mohammad Zare

发表机构 * AI Lab at Department of Computer Engineering（计算机工程系人工智能实验室）； AriooBarzan Engineering Team and Information Technology（AriooBarzan工程团队和信息技术）； Shiraz University of Technology（谢兹大学技术学院）

AI总结提出SemImage方法，将文本表示为二维语义图像，利用HSV颜色空间解缠主题、情感和强度特征，通过多任务学习实现，并在文档分类中取得竞争性性能。

详情

Journal ref: 2026 12th International Conference on Web Research (ICWR), 253-259

AI中文摘要

我们提出SemImage，一种将文本文档表示为二维语义图像以由卷积神经网络（CNN）处理的新方法。在SemImage中，每个单词表示为二维图像中的一个像素：行对应句子，并在句子之间插入额外的边界行以标记语义转换。每个像素不是典型的RGB值，而是解缠HSV颜色空间中的向量，编码不同的语言特征：色调（具有两个分量H_cos和H_sin以考虑循环性）编码主题，饱和度编码情感，明度编码强度或确定性。我们通过多任务学习框架强制这种解缠：ColorMapper网络将每个词嵌入映射到HSV空间，并对色调和饱和度通道应用辅助监督以预测主题和情感标签，同时执行主要任务目标。在句子之间插入动态计算的边界行，当连续句子在语义上不相似时，会在图像中产生清晰的视觉边界，有效地使段落边界突出。我们将SemImage与标准2D CNN（例如ResNet）集成用于文档分类。在多标签数据集（同时具有主题和情感标注）和单标签基准上的实验表明，SemImage能够达到与强文本分类基线（包括BERT和层次注意力网络）相当或更好的准确性，同时提供增强的可解释性。消融研究证实了多通道HSV表示和动态边界行的重要性。最后，我们展示了SemImage的可视化，定性地揭示了生成图像中与主题转换和情感变化相对应的清晰模式，表明我们的表示使这些语言特征对人类和机器都可见。

英文摘要

We propose SemImage, a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs). In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions. Each pixel is not a typical RGB value but a vector in a disentangled HSV color space, encoding different linguistic features: the Hue with two components H_cos and H_sin to account for circularity encodes the topic, Saturation encodes the sentiment, and Value encodes intensity or certainty. We enforce this disentanglement via a multi-task learning framework: a ColorMapper network maps each word embedding to the HSV space, and auxiliary supervision is applied to the Hue and Saturation channels to predict topic and sentiment labels, alongside the main task objective. The insertion of dynamically computed boundary rows between sentences yields sharp visual boundaries in the image when consecutive sentences are semantically dissimilar, effectively making paragraph breaks salient. We integrate SemImage with standard 2D CNNs (e.g., ResNet) for document classification. Experiments on multi-label datasets (with both topic and sentiment annotations) and single-label benchmarks demonstrate that SemImage can achieve competitive or better accuracy than strong text classification baselines (including BERT and hierarchical attention networks) while offering enhanced interpretability. An ablation study confirms the importance of the multi-channel HSV representation and the dynamic boundary rows. Finally, we present visualizations of SemImage that qualitatively reveal clear patterns corresponding to topic shifts and sentiment changes in the generated image, suggesting that our representation makes these linguistic features visible to both humans and machines.

URL PDF HTML ☆

赞 0 踩 0

2506.22881 2026-06-02 cs.CV 版本更新

CLIP-like Model as a Foundational Density Ratio Estimator

CLIP-like模型作为基础密度比估计器

Fumiya Uchiyama, Rintaro Yanagi, Shohei Taniguchi, Shota Takashiro, Masahiro Suzuki, Hirokatsu Kataoka, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo（东京大学）； AIST（日本产业技术综合研究所）

AI总结本文重新解释CLIP类模型为预训练的通用密度比估计器，提出重要性权重学习和KL散度估计两种应用，通过单一提示提升F1分数达7点，并利用KL散度实现数据筛选。

Comments Accepted to CVPR 2026. Code: https://github.com/fumiyauchiyama/CLIP_Density_Ratio

详情

AI中文摘要

密度比估计是统计机器学习中的核心概念，因为它为重要性加权、散度估计和无似然推断等任务提供了统一机制，但其在视觉和语言模型中的潜力尚未被充分探索。现代视觉-语言编码器（如CLIP和SigLIP）通过对比目标进行训练，隐式优化联合图像-文本分布与边缘分布之间的对数密度比，从而学习与对数密度比成比例的相似度分数。然而，先前的工作主要关注其嵌入效用，而对比学习诱导的密度比结构在多模态应用中尚未被系统性地检验或利用。为填补这一空白，我们重新解释CLIP类模型为预训练的通用密度比估计器，并表明这一视角能够实现新的算法能力。我们统一解释了对比目标如何估计密度比，并提出了两种实际应用：重要性权重学习和KL散度估计。我们的重要性权重学习方法仅需单个额外提示，即可将F1分数提升最多7点。我们进一步证明，基于CLIP的密度比支持估计KL散度，该散度量化了以图像或文本为条件如何改变另一模态的分布。通过定性示例和标题的N-gram分析，我们发现这些散度捕捉了多模态数据中的语义多样性和模式结构。利用这一特性，我们引入了一种简单的KL引导数据筛选方法，其性能可与LAION2B筛选相媲美。

英文摘要

Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image-text distributions, which implicitly learn similarity scores proportional to log density ratios. However, prior work has largely focused on their embedding utility, and the density-ratio structure induced by contrastive learning has not been systematically examined or exploited in multimodal applications. To address this gap, we reinterpret CLIP-style models as pretrained and general-purpose density ratio estimators and show that this perspective enables new algorithmic capabilities. We present a unified explanation of how contrastive objectives estimate density ratios and propose two practical applications: Importance Weight Learning and KL divergence estimation. Our Importance Weight Learning method requires only a single additional prompt and improves F1 scores by up to 7 points. We further show that CLIP-based density ratios support estimation of KL divergences that quantify how conditioning on an image or text alters the distribution of the other modality. Through qualitative examples and an N-gram analysis of captions, we find that these divergences capture semantic diversity and mode structure in multimodal data. Leveraging this property, we introduce a simple KL-guided data curation method that achieves performance competitive with LAION2B filtering.

URL PDF HTML ☆

赞 0 踩 0

2511.21397 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Understanding the Effects of Distractors on Reasoning Vision-Language Models

理解干扰项对推理视觉语言模型的影响

Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)（坡山科学技术大学（POSTECH））

AI总结本文通过构建包含语义和数值维度干扰项的视觉问答数据集Idis，研究视觉干扰项如何影响视觉语言模型的测试时缩放行为，发现视觉干扰项以与文本干扰项根本不同的方式降低准确率而不增加推理长度，并提出简单提示策略缓解干扰项驱动的预测。

Comments preprint

详情

AI中文摘要

无关信息（即干扰项）如何影响视觉语言模型（VLM）的测试时缩放？先前关于纯文本语言模型的研究表明，文本干扰项可以加剧逆缩放，导致模型推理更长但推理轨迹效率更低。在这项工作中，我们研究了类似现象是否在多模态设置中出现。我们引入了Idis（带干扰项的图像），这是一个视觉问答数据集，系统性地沿着语义和数值维度变化干扰项。我们的分析揭示，视觉干扰项以与文本干扰项根本不同的方式影响推理VLM：尽管逆缩放仍然出现，但视觉干扰项降低了准确率而不增加推理长度。我们进一步展示了从推理轨迹中提取的属性计数为干扰项如何与推理长度和准确率交互提供了关键见解。作为合理性检查，我们提出了一种简单的提示策略，以减轻推理视觉语言模型中干扰项驱动的预测。

英文摘要

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only language models has shown that textual distractors can intensify inverse scaling, causing models to reason longer but less effective reasoning traces. In this work, we investigate whether similar phenomena arise in multimodal settings. We introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic and numerical dimensions. Our analyses reveal that visual distractors affect reasoning VLMs in a fundamentally different way from textual distractors: although inverse scaling still emerges, visual distractors reduce accuracy without increasing reasoning length. We further show that attribute counts extracted from reasoning traces provide key insights into how distractors interact with reasoning length and accuracy. As a sanity check, we propose a simple prompting strategy that mitigates distractor-driven predictions in reasoning vision-language models.

URL PDF HTML ☆

赞 0 踩 0

2511.20615 2026-06-02 cs.CV cs.AI 版本更新

Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities

评估深度学习模型在负重活动期间全身动态3D姿态预测中的性能

Seyede Niloofar Hosseini, Ali Mojibi, Mahdi Mohseni, Navid Arjmand, Alireza Taheri

发表机构 * Department of Mechanical Engineering, Sharif University of Technology（谢赫·巴赫什大学机械工程系）

AI总结本研究利用双向长短期记忆和Transformer架构的时间序列模型，通过优化身体段长度约束的代价函数，实现了对动态负重活动中全身3D姿态的高精度预测。

Comments 11 pages, 6 figures, 7 tables, This work has been submitted to the IEEE for possible publication

详情

AI中文摘要

本研究旨在探索深度神经网络在动态负重活动中全身人体姿态预测的应用。使用双向长短期记忆（BLSTM）和Transformer架构训练了两个时间序列模型。数据集包含20名正常体重健康男性个体的3D全身插件步态动态坐标，每人从不同负载位置执行204次负重任务，并采用不同的举升和处理技术。模型输入包括手-负载位置的3D坐标、举升（弯腰、全蹲和半蹲）和处理（单手和双手）技术、体重和身高，以及任务前25%时间的身体姿态3D坐标数据。模型利用这些输入预测任务剩余75%时间内的身体坐标。此外，提出了一种新方法，通过优化新的代价函数强制身体段长度恒定，以提高先前和当前姿态预测网络的准确性。结果表明，新代价函数使手臂和腿部模型的预测误差分别降低了约8%和21%。我们发现，使用Transformer架构（均方根误差为41.4 mm）的长期性能比基于BLSTM的模型准确约58%。本研究证明了利用捕捉时间序列依赖性的神经网络在3D运动帧中的价值，为理解和预测人工物料搬运活动中的运动动力学提供了独特方法。

英文摘要

This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 41.4 mm, exhibited approximately 58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.

URL PDF HTML ☆

赞 0 踩 0

2511.20295 2026-06-02 cs.CV 版本更新

Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

回归特征：用视频反事实解释来解释视频分类器

Chao Wang, Chengan Che, Xinyue Chen, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

发表机构 * Visual Understanding Research Group, Department of Informatics, King’s College London, UK（信息学院视觉理解研究组，伦敦国王学院，英国）； Department of Informatics, King’s College London, UK（信息学院，伦敦国王学院，英国）

AI总结提出BTTF优化框架，通过两阶段优化和渐进式去噪策略生成物理合理、时间连贯的视频反事实解释，揭示视频分类器的决策依据。

Comments Accepted at CVPR2026 main conference

详情

AI中文摘要

反事实解释（CFEs）是对模型输入的最小且语义上有意义的修改，能够改变模型预测。它们突出了模型依赖的决定性特征，为分类器提供对比性解释。最先进的视觉反事实解释方法主要集中于解释图像分类器，而视频模型领域相对未被充分探索。为了使视频CFEs有用，它们必须物理上合理、时间上连贯，并表现出平滑的运动轨迹。现有的基于图像的CFE方法旨在解释图像分类器，缺乏生成时间连贯、平滑且物理合理的视频CFEs的能力。为了解决这个问题，我们提出了回归特征（BTTF），一个生成视频CFEs的优化框架。我们的方法引入了两个新颖的特性：1）一个优化方案，用于检索由输入视频第一帧条件化的初始潜在噪声；2）一个两阶段优化策略，使得能够在输入视频附近搜索反事实视频。两个优化过程仅由目标分类器指导，确保解释的忠实性。为了加速收敛，我们还引入了一种渐进式优化策略，逐步增加去噪步骤的数量。在Shape-Moving（运动分类）、MEAD（情感分类）和NTU RGB+D（动作分类）等视频数据集上的大量实验表明，我们的BTTF有效地生成了有效、视觉相似且逼真的反事实视频，为分类器的决策机制提供了具体见解。

英文摘要

Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods have primarily focused on interpreting image classifiers, leaving the domain of video models relatively underexplored. For the video CFEs to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier's decision-making mechanism.

URL PDF HTML ☆

赞 0 踩 0

2507.02792 2026-06-02 cs.CV 版本更新

RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

RichControl: 面向文本到图像生成的、结构和外观丰富的免训练空间控制

Lexi Pang, Liheng Zhang, Hang Ye, Xiaoxuan Ma, Yizhou Wang

发表机构 * Peking University（北京大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出一种免训练框架，通过解耦条件特征的采样调度与去噪过程，并引入重启细化调度和外观丰富提示策略，在复杂条件下实现结构和外观平衡的受控生成。

详情

AI中文摘要

文本到图像（T2I）扩散模型在从文本提示生成高质量图像方面取得了显著成功。最近的研究尝试扩展这些模型以融入条件图像（例如，Canny边缘）实现细粒度空间控制。其中，特征注入方法作为传统基于微调方法的免训练替代方案出现。然而，它们常常遭受结构错位、条件泄露和视觉伪影，特别是当条件图像与自然RGB分布显著偏离时。通过对现有方法的分析，我们识别出一个关键限制：条件特征的采样调度（此前未被探索）未能考虑扩散步骤中结构保持与域对齐之间不断变化的相互作用。受此观察启发，我们提出一个灵活的免训练框架，将条件特征的采样调度与去噪过程解耦，并系统性地研究特征注入调度的谱系，以实现结构对齐与外观质量之间的更好平衡。我们进一步通过引入重启细化调度来增强采样过程，并通过外观丰富的提示策略改善视觉质量。这些设计共同实现了既结构丰富又外观丰富的免训练可控生成。大量实验表明，我们的方法在复杂多样的条件下达到了最先进的性能。由于其通用性，我们的框架自然支持组合条件生成，并以即插即用的方式跨架构泛化，从基于UNet的扩散模型到现代DiT骨干网络（如FLUX）。

英文摘要

Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., canny edge) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning-based approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. Through an analysis of existing methods, we identify a key limitation: the sampling schedule of condition features, previously unexplored, fails to account for the evolving interplay between structure preservation and domain alignment throughout diffusion steps. Inspired by this observation, we propose a flexible training-free framework that decouples the sampling schedule of condition features from the denoising process, and systematically investigate the spectrum of feature injection schedules to achieve a better balance between structural alignment and appearance quality. We further enhance the sampling process by introducing a restart refinement schedule, and improve the visual quality with an appearance-rich prompting strategy. Together, these designs enable training-free controllable generation that is both structure-rich and appearance-rich. Extensive experiments demonstrate that our method achieves state-of-the-art performance under complex and diverse conditions. Owing to its generality, our framework naturally supports compositional conditional generation and generalizes across architectures in a plug-and-play manner, from UNet-based diffusion models to modern DiT backbones such as FLUX.

URL PDF HTML ☆

赞 0 踩 0

2511.10367 2026-06-02 cs.CV cs.AI 版本更新

DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile

DermAI：通过质量驱动的图像采集实现移动端AI分类的临床皮肤病学

Thales Bezerra, Emanoel Thyago, Kelvin Cunha, Rodrigo Abreu, Fábio Papais, Francisco Mauro, Natália Lopes, Érico Medeiros, Jéssica Guido, Shirley Cruz, Paulo Borba, Tsang Ing Ren

发表机构 * Centro de Informática, Universidade Federal de Pernambuco, Brazil（巴西佩纳布卢克联邦大学计算机中心）； Hospital das Clínicas, Universidade Federal de Pernambuco, Brazil（巴西佩纳布卢克联邦大学临床医院）

AI总结提出DermAI智能手机应用，通过实时质量检查、本地模型适应和多样化数据集收集，解决AI皮肤病学中数据集偏差、图像质量差异和验证不足的问题。

Comments 4 pages, 2 figures, 1 table, submitted on ISBI

2511.10806 2026-06-02 eess.IV cs.CV 版本更新

From Attention to Frequency: Integration of Vision Transformer and FFT-ReLU for Enhanced Image Deblurring

从注意力到频率：融合Vision Transformer与FFT-ReLU的图像去模糊增强方法

Syed Mumtahin Mahmud, Mahdi Mohd Hossain Noki, Prothito Shovon Majumder, Abdul Mohaimen Al Radi, Md. Haider Ali, Md. Mosaddek Khan

发表机构 * Department of Computer Science and Engineering, University of Dhaka（达卡大学计算机科学与工程系）

AI总结提出一种双域架构，将Vision Transformer与频域FFT-ReLU模块结合，通过空间注意力建模和频率稀疏性抑制模糊伪影并保留细节，在基准数据集上取得优于现有方法的PSNR、SSIM和感知质量。

详情

DOI: 10.5220/0014441500004052
Journal ref: Proceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Volume 2, Marbella, Spain, March 5-7, 2026, pp. 1810-1820. SCITEPRESS

AI中文摘要

Seq-DeepIPC：足式机器人导航中用于端到端控制的顺序感知

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada（计算机科学与电子系，加查马达大学）； Department of Computer Science and Engineering, Toyohashi University of Technology（计算机科学与工程系，东福冈技术大学）

AI总结提出Seq-DeepIPC模型，通过融合多模态感知（RGB-D+GNSS）与时间序列，实现足式机器人在真实环境中的端到端导航控制，并在机器人狗上验证了其有效性。

Comments This work has been accepted for publication in the IEEE Sensors Journal. https://ieeexplore.ieee.org/document/11373257/

详情

DOI: 10.1109/JSEN.2026.3656442

AI中文摘要

我们提出了Seq-DeepIPC，一种用于足式机器人在真实环境中导航的顺序端到端感知到控制模型。Seq-DeepIPC通过将多模态感知（RGB-D+GNSS）与时间融合和控制紧密结合，推进了自主足式导航的智能感知。该模型联合预测语义分割和深度估计，为规划和控制提供更丰富的空间特征。为了在边缘设备上高效部署，我们使用轻量级模型作为编码器，在保持精度的同时减少计算量。通过移除噪声较大的IMU，转而通过顺序GNSS坐标的差分分析推导全局航向，简化了航向估计。我们收集了一个更大且更多样化的数据集，包括道路和草地地形，并在机器人狗上验证了Seq-DeepIPC。对比和消融研究表明，顺序输入改善了我们的模型中的感知和控制，而其他基线则没有受益。Seq-DeepIPC以合理的模型大小取得了具有竞争力或更好的结果；尽管仅使用GNSS的航向在高大建筑物附近可靠性较低，但在开阔区域是鲁棒的。总体而言，Seq-DeepIPC将端到端导航从轮式机器人扩展到更通用和具有时间感知能力的系统。为了支持未来的研究，我们将在GitHub仓库https://github.com/oskarnatan/Seq-DeepIPC发布代码。

英文摘要

We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in real-world environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use a lightweight model as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead deriving global heading via differential analysis of sequential GNSS coordinates. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repo at https://github.com/oskarnatan/Seq-DeepIPC.

URL PDF HTML ☆

赞 0 踩 0

2510.17045 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Video Reasoning without Training

无需训练的视频推理

Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague

发表机构 * Qualcomm AI Research（高通AI研究）； University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出V-Reason方法，利用输出分布熵作为信号，通过轻量级控制器在推理时自适应调整值缓存，无需强化学习或微调即可提升视频推理性能。

Comments CVPR Findings 2026. Project Page https://deepaksridhar.github.io/vreason.github.io/

详情

AI中文摘要

使用大型多模态模型（LMM）进行视频推理依赖于昂贵的强化学习（RL）和冗长的思维链，导致训练和推理过程中产生大量计算开销。此外，这些推理模型中控制思维过程的机制非常有限。在本文中，我们利用模型输出分布的熵作为信号来研究和指导推理行为。我们发现高质量模型表现出微探索和微利用循环的特征模式，随后出现后期熵峰值（即更长的思考）和较低的最终熵，表明更谨慎的探索和自信的收敛（即当模型探索或思考答案时避免过度随机性）。然后，我们利用这些新颖的、有理论基础的见解，引入了V-Reason（Video-Reason），一种推理时优化方法，通过轻量级、可训练的控制器自适应调整LMM的值缓存。我们提出的控制器由基于熵的目标引导，直接在推理时调整模型行为，无需使用任何RL或监督微调。我们的实验表明，V-Reason在许多视频推理数据集上显著优于基础指令调优模型，将与RL模型的差距平均缩小到0.6%的准确率以内。我们在无需任何训练的情况下实现了这一点，同时提供了效率优势：V-Reason使用的token比RL模型少58.6%。项目页面：https://deepaksridhar.github.io/vreason.github.io/

英文摘要

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, we use the entropy of the model's output distribution as a signal to study and guide reasoning behavior. We discover that high-quality models exhibit a characteristic pattern of micro-exploration and micro-exploitation cycles, followed by a later entropy peak (i.e., longer thinking) and a lower final entropy, indicating more deliberate exploration and confident convergence (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We then use these novel, theoretically-grounded insights to introduce V-Reason (Video-Reason), an inference-time optimization method that adapts the value cache of the LMM through a lightweight, trainable controller. Our proposed controller is guided by an entropy-based objective, to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Our experiments show that V-Reason significantly outperforms the base instruction-tuned models on many video reasoning datasets, narrowing the gap with RL models to within 0.6% accuracy on average. We achieve this without any training, while offering efficiency benefits: V-Reason uses 58.6% fewer tokens than the RL model. Project Page https://deepaksridhar.github.io/vreason.github.io/

URL PDF HTML ☆

赞 0 踩 0

2510.16660 2026-06-02 cs.CV cs.LG physics.med-ph 版本更新

Universal and Transferable Attacks on Pathology Foundation Models

病理基础模型的通用与可迁移攻击

Yuntian Wang, Xilin Yang, Che-Yung Shen, Nir Pillar, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校电子与计算机工程系）； Bioengineering Department, University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校生物工程系）； California NanoSystems Institute (CNSI), University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校加州纳米系统研究所）； Department of Pathology, Hadassah Hebrew University Medical Center, Jerusalem, 91120, Israel（海法希伯来大学医疗中心病理学系）； Department of Surgery, University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校外科系）

AI总结提出通用可迁移对抗扰动（UTAP），通过固定弱噪声模式破坏多个病理基础模型的特征表示能力，导致下游任务性能下降，并展示其跨数据集通用性和跨模型可迁移性。

Comments 38 Pages, 8 Figures

详情

DOI: 10.1038/s41377-026-02347-w
Journal ref: Light: Science & Applications (2026)

AI中文摘要

我们为病理基础模型引入了通用可迁移对抗扰动（UTAP），揭示了其能力中的关键脆弱性。UTAP 使用深度学习优化，由一个固定的弱噪声模式组成，当添加到病理图像时，会系统地破坏多个病理基础模型的特征表示能力。因此，UTAP 会导致利用基础模型的下游任务性能下降，包括在广泛的未见数据分布上的错误分类。除了损害模型性能，我们展示了 UTAP 的两个关键特征：（1）通用性：其扰动可应用于不同的视野，与开发 UTAP 的数据集无关；（2）可迁移性：其扰动能成功降低各种外部、黑盒病理基础模型（从未见过）的性能。这两个特征表明 UTAP 不是针对特定基础模型或图像数据集的专用攻击，而是对多种新兴病理基础模型及其应用构成广泛威胁。我们在多个数据集上系统评估了 UTAP 对各种最先进病理基础模型的影响，通过使用固定噪声模式对输入图像进行视觉上不可察觉的修改，导致其性能显著下降。这些强大攻击的开发为模型鲁棒性评估建立了一个关键的高标准基准，凸显了推进防御机制的需求，并可能为对抗训练提供必要资产，以确保 AI 在病理学中的安全可靠部署。

英文摘要

We introduce Universal and Transferable Adversarial Perturbations (UTAP) for pathology foundation models that reveal critical vulnerabilities in their capabilities. Optimized using deep learning, UTAP comprises a fixed and weak noise pattern that, when added to a pathology image, systematically disrupts the feature representation capabilities of multiple pathology foundation models. Therefore, UTAP induces performance drops in downstream tasks that utilize foundation models, including misclassification across a wide range of unseen data distributions. In addition to compromising the model performance, we demonstrate two key features of UTAP: (1) universality: its perturbation can be applied across diverse field-of-views independent of the dataset that UTAP was developed on, and (2) transferability: its perturbation can successfully degrade the performance of various external, black-box pathology foundation models - never seen before. These two features indicate that UTAP is not a dedicated attack associated with a specific foundation model or image dataset, but rather constitutes a broad threat to various emerging pathology foundation models and their applications. We systematically evaluated UTAP across various state-of-the-art pathology foundation models on multiple datasets, causing a significant drop in their performance with visually imperceptible modifications to the input images using a fixed noise pattern. The development of these potent attacks establishes a critical, high-standard benchmark for model robustness evaluation, highlighting a need for advancing defense mechanisms and potentially providing the necessary assets for adversarial training to ensure the safe and reliable deployment of AI in pathology.

URL PDF HTML ☆

赞 0 踩 0

2507.23277 2026-06-02 cs.CV 版本更新

iLRM: An Iterative Large 3D Reconstruction Model

iLRM：一种迭代式大型3D重建模型

Gyeongjin Kang, Seungtae Nam, Seungkwon Yang, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, Eunbyung Park

发表机构 * Sungkyunkwan University（首尔大学）； Yonsei University（延世大学）； Rembrand ； Meta

AI总结提出一种迭代式大型3D重建模型iLRM，通过解耦场景表示、分解多视图交互和注入高分辨率信息，实现高效、可扩展的前馈3D重建，在RE10K和DL3DV数据集上优于现有方法。

Comments Project page: https://gynjn.github.io/iLRM/

详情

AI中文摘要

前馈3D建模已成为快速高质量3D重建的一种有前景的方法。特别是直接生成显式3D表示（如3D高斯泼溅）因其快速高质量的渲染而备受关注。然而，许多基于Transformer架构的最先进方法存在严重的可扩展性问题，因为它们依赖于跨多个输入视图的图像令牌的全注意力，导致随着视图数量或图像分辨率的增加，计算成本变得难以承受。为了实现可扩展且高效的前馈3D重建，我们引入了一种迭代式大型3D重建模型（iLRM），该模型通过迭代细化机制生成3D高斯表示，并遵循三个核心原则：（1）将场景表示与输入图像解耦，以实现紧凑的3D表示；（2）将全局多视图交互分解为两阶段注意力方案，以降低计算成本；（3）在每一层注入高分辨率信息，以实现高保真重建。在广泛使用的数据集（如RE10K和DL3DV）上的实验结果表明，iLRM在重建质量和速度上均优于现有方法。

英文摘要

Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input images to enable compact 3D representations; (2) decomposing global multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed.

URL PDF HTML ☆

赞 0 踩 0

2510.14025 2026-06-02 cs.CV 版本更新

NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

NAPPure: 针对非加性扰动的鲁棒图像分类的对抗净化

Junjie Nan, Jianing Li, Wei Chen, Mingkun Zhang, Xueqi Cheng

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences（人工智能安全国家重点实验室，计算技术研究所，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出NAPPure框架，通过似然最大化解耦干净图像与扰动参数，有效提升图像分类模型对非加性扰动（如模糊、遮挡、失真）的鲁棒性。

2510.13774 2026-06-02 cs.LG cs.CV 版本更新

UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

UrbanFusion: 用于鲁棒空间表示对比学习的随机多模态融合

Dominik J. Mühlematter, Lin Che, Ye Hong, Martin Raubal, Nina Wiedemann

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结提出UrbanFusion模型，通过随机多模态融合（SMF）和Transformer模块整合街景、遥感、地图和POI数据，在56个城市41项任务中优于现有GeoAI模型。

详情

Journal ref: International Conference on Machine Learning (ICML), 2026

AI中文摘要

预测房价和公共卫生指标等城市现象需要有效整合各种地理空间数据。当前方法主要使用特定任务模型，而近期用于空间表示的通用模型通常仅支持有限模态且缺乏多模态融合能力。为克服这些挑战，我们提出UrbanFusion，一种具有随机多模态融合（SMF）的空间表示模型。该框架采用模态特定编码器处理不同类型输入，包括街景图像、遥感数据、制图地图和兴趣点（POI）数据。这些多模态输入通过基于Transformer的融合模块进行集成，学习统一表示。在全世界56个城市的41项任务上的广泛评估表明，与最先进的GeoAI模型相比，UrbanFusion具有强大的泛化和预测性能。具体而言，它1）在位置编码上优于先前模型，2）允许推理时多模态输入，3）能很好地泛化到训练中未见过的区域。UrbanFusion在预训练和推理过程中均可灵活利用给定位置的任何可用模态子集，从而在多样化的数据可用性场景中实现广泛适用性。

英文摘要

Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent generic models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a spatial representation model that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios.

URL PDF HTML ☆

赞 0 踩 0

2505.16915 2026-06-02 cs.CV cs.AI 版本更新

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

DetailMaster：你的文本到图像模型能处理长提示吗？

Qirui Jiao, Daoyuan Chen, Yilun Huang, Xika Lin, Ying Shen, Yaliang Li

发表机构 * Sun Yat-Sen University（中山大学）； Alibaba Group（阿里巴巴集团）； Worcester Polytechnic Institute（沃斯特理工学院）； Guangdong Provincial Key Laboratory of Fire Science and Intelligent Emergency Technology（广东省火灾科学与智能应急技术重点实验室）

AI总结提出DetailMaster基准，通过自动数据构建和评估流程，系统评估文本到图像模型在长提示下的性能，发现编码器和扩散模型在细节密集条件下的局限性，并证明高保真生成需要扩展提示限制与长提示训练的协同组合。

Comments 36 pages, 10 figures, 21 tables, accepted by ICML2026

详情

AI中文摘要

尽管最近的文本到图像（T2I）模型在从简短描述合成图像方面表现出令人印象深刻的能力，但它们在专业应用所需的冗长、详细提示上存在困难。我们提出了DetailMaster，一个全面的基准，用于评估T2I模型在具有复杂组合要求的长提示上的能力，并附有自动数据构建流程和评估工作流。我们的基准包含专家验证的提示，平均长度为284.89个标记，引入了四个关键评估维度：角色属性、结构化角色位置、多维场景属性以及空间/交互关系。对各种通用和长提示优化模型的评估揭示了关键的性能限制，表明弱编码器难以保留提示中的句法依赖关系，并且扩散模型在细节密集条件下遭受属性泄漏。通过在不同约束下的受控消融研究，我们进一步表明高保真生成需要扩展提示限制和长提示训练的协同组合。我们开源了数据集和代码，以促进长提示驱动的T2I生成的发展。

英文摘要

While recent Text-to-Image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, they struggle with the long, detailed prompts required for professional applications. We present DetailMaster, a comprehensive benchmark for evaluating T2I capabilities on long prompts with complex compositional requirements, accompanied by an automated data construction pipeline and an evaluation workflow. Comprising expert-validated prompts averaging 284.89 tokens, our benchmark introduces four critical evaluation dimensions: Character Attributes, Structured Character Locations, Multi-Dimensional Scene Attributes, and Spatial/Interactive Relationships. Evaluations on various general-purpose and long-prompt-optimized models reveal critical performance limitations, showing that weak encoders struggle to preserve syntactic dependencies within prompts and diffusion models suffer from attribute leakage under detail-intensive conditions. Through a controlled ablation study under varying constraints, we further show that high-fidelity generation requires a synergistic combination of expanded prompt limits and long-prompt training. We open-source our dataset and code to foster progress in long-prompt-driven T2I generation.

URL PDF HTML ☆

赞 0 踩 0

2510.09608 2026-06-02 cs.CV cs.AI cs.CL 版本更新

StreamingVLM: Real-Time Understanding for Infinite Video Streams

StreamingVLM：无限视频流的实时理解

Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Yao Lu, Song Han

发表机构 * MIT（麻省理工学院）； NVIDIA（英伟达）

AI总结提出StreamingVLM，通过统一训练与流推理的框架，利用注意力汇点状态复用和滑动窗口机制实现无限视频流的实时稳定理解，在Inf-Streams-Eval基准上以8 FPS速度达到66.18%胜率，并提升通用VQA能力。

Comments Published as a conference paper at ICLR 2026. The first two authors contributed equally to this work

详情

AI中文摘要

视觉语言模型（VLM）可以为实时助手和自主代理提供动力，但它们面临一个关键挑战：理解近乎无限的视频流而不增加延迟和内存使用。对整个视频进行全注意力处理会导致二次计算成本和在长视频上性能不佳。同时，简单的滑动窗口方法也存在缺陷，它们要么破坏连贯性，要么由于冗余重计算而遭受高延迟。在本文中，我们介绍了StreamingVLM，一种专为实时、稳定理解无限视觉输入而设计的模型。我们的方法是一个统一框架，将训练与流推理对齐。在推理过程中，我们通过重用注意力汇点状态、最近视觉令牌的短窗口和最近文本令牌的长窗口来维护一个紧凑的KV缓存。这种流式能力通过一个简单的监督微调（SFT）策略灌输，该策略在短的重叠视频块上应用全注意力，有效地模拟了推理时的注意力模式，而无需在过长的上下文中进行训练。为了评估，我们构建了Inf-Streams-Eval，一个新的基准，包含平均超过两小时的视频，需要帧与文本之间的密集、每秒对齐。在Inf-Streams-Eval上，StreamingVLM对GPT-4O mini实现了66.18%的胜率，并在单个NVIDIA H100上以高达8 FPS的速度保持稳定、实时的性能。值得注意的是，我们的SFT策略还增强了通用的VQA能力，无需任何VQA特定的微调，在LongVideoBench上提高了+4.30，在OVOBench Realtime上提高了+5.96。代码可在https://github.com/mit-han-lab/streaming-vlm获取。

英文摘要

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.

URL PDF HTML ☆

赞 0 踩 0

2510.03938 2026-06-02 physics.optics cs.CV cs.NE physics.app-ph 版本更新

Super-resolution image projection over an extended depth of field using a diffractive decoder

使用衍射解码器实现扩展景深上的超分辨率图像投影

Hanlong Chen, Cagatay Isil, Tianyi Gan, Mona Jarrahi, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, California 90095, USA（加州大学洛杉矶分校电子与计算机工程系）； Bioengineering Department, University of California, Los Angeles, California 90095, USA（加州大学洛杉矶分校生物工程系）； California NanoSystems Institute (CNSI), University of California, Los Angeles, California 90095, USA（加州大学洛杉矶分校加州纳米系统研究所）

AI总结提出一种混合图像投影系统，结合CNN编码器和全光学衍射解码器，实现扩展景深和像素超分辨率，提升空间带宽积。

Comments 18 Pages, 6 Figures

详情

DOI: 10.1038/s41377-026-02320-7
Journal ref: Light: Science & Applications (2026)

AI中文摘要

图像投影系统必须在数据存储、计算和传输方面高效，同时保持输出的大空间带宽积（SBP）。本文介绍了一种混合图像投影系统，该系统结合了基于卷积神经网络（CNN）的数字编码器和全光学衍射解码器，实现了具有改进分辨率的扩展景深（DOF）。基于CNN的编码器将输入图像压缩为紧凑的相位表示，随后由低分辨率（LR）投影仪显示，并由模拟衍射解码器进行全光学图像重建。该光学解码器完全被动，设计用于合成像素超分辨图像投影，具有扩展景深，同时无需额外功耗即可实现超分辨图像重建。我们的像素超分辨率（PSR）图像投影系统在约267倍波长（W）的扩展景深内展示了高保真图像合成，同时在每个横向平面上提供高达约16倍的SBP改进。通过太赫兹波段的实验验证了该概念，并且该系统可扩展到电磁波谱的不同部分。这种图像投影架构可以减少显示系统的数据存储和传输需求，而不会对光学解码器施加额外的功率限制。除了扩展景深PSR图像投影外，该方法的基本原理还可扩展到各种应用，包括光学计量和显微镜。

英文摘要

Image projection systems must be efficient in data storage, computation and transmission while maintaining a large space-bandwidth-product (SBP) at their output. Here, we introduce a hybrid image projection system that achieves extended depth-of-field (DOF) with improved resolution, combining a convolutional neural network (CNN)-based digital encoder with an all-optical diffractive decoder. A CNN-based encoder compresses input images into compact phase representations, which are subsequently displayed by a low-resolution (LR) projector and processed by an analog diffractive decoder for all-optical image reconstruction. This optical decoder is completely passive, designed to synthesize pixel super-resolved image projections that feature an extended DOF while eliminating the need for additional power consumption for super-resolved image reconstruction. Our pixel super-resolution (PSR) image projection system demonstrates high-fidelity image synthesis over an extended DOF of ~267xW, where W is the illumination wavelength, concurrently offering up to ~16-fold SBP improvement at each lateral plane. The proof of concept of this approach is validated through an experiment conducted in the THz spectrum, and the system is scalable across different parts of the electromagnetic spectrum. This image projection architecture can reduce data storage and transmission requirements for display systems without imposing additional power constraints on the optical decoder. Beyond extended DOF PSR image projection, the underlying principles of this approach can be extended to various applications, including optical metrology and microscopy.

URL PDF HTML ☆

赞 0 踩 0

2510.00053 2026-06-02 eess.IV cs.CV cs.LG 版本更新

DPsurv: Dual-Prototype Evidential Fusion for Uncertainty-Aware and Interpretable Whole-Slide Image Survival Prediction

DPsurv: 双原型证据融合用于不确定性感知和可解释的全切片图像生存预测

Yucheng Xing, Ling Huang, Jingying Ma, Ruping Hong, Jiangdong Qiu, Pei Liu, Kai He, Huazhu Fu, Mengling Feng

发表机构 * National University of Singapore ； National University of Singapore Guangzhou Research Translation ； Innovation Institute ； Imperial College London ； Peking Union Medical College Hospital, Chinese Academy of Medical Sciences \& Peking Union Medical College ； Hunan University ； Institute of High Performance Computing, Agency for Science, Technology ； Research (A STAR)

AI总结提出DPsurv双原型证据融合网络，通过不确定性感知的生存区间预测和基于补丁原型分配图、组件原型及组件级相对风险聚合的可解释性，在五个公开数据集上取得最佳一致性指数和积分Brier分数。

详情

AI中文摘要

病理全切片图像（WSIs）因其在细胞和组织水平上全面的组织病理学信息而被广泛用于癌症生存分析，能够进行定量、大规模且预后丰富的肿瘤特征分析。然而，现有大多数WSI生存分析方法可解释性有限，且常常忽略异质性切片图像中的预测不确定性。本文提出DPsurv，一种双原型全切片图像证据融合网络，输出不确定性感知的生存区间，同时通过补丁原型分配图、组件原型和组件级相对风险聚合实现预测的解释。在五个公开数据集上的实验取得了最高的平均一致性指数和最低的平均积分Brier分数，验证了DPsurv的有效性和可靠性。预测结果的解释在特征、推理和决策层面提供了透明度，从而增强了DPsurv的可信度和可解释性。

英文摘要

Pathology whole-slide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing methods in WSI survival analysis struggle with limited interpretability and often overlook predictive uncertainty in heterogeneous slide images. In this paper, we propose DPsurv, a dual-prototype whole-slide image evidential fusion network that outputs uncertainty-aware survival intervals, while enabling interpretation of predictions through patch prototype assignment maps, component prototypes, and component-wise relative risk aggregation. Experiments on five publicly available datasets achieve the highest mean concordance index and the lowest mean integrated Brier score, validating the effectiveness and reliability of DPsurv. The interpretation of prediction results provides transparency at the feature, reasoning, and decision levels, thereby enhancing the trustworthiness and interpretability of DPsurv.

URL PDF HTML ☆

赞 0 踩 0

2408.01653 2026-06-02 cs.CV 版本更新

MCPDepth: Omnidirectional Depth Estimation via Stereo Matching from Multi-Cylindrical Panoramas

MCPDepth：基于多圆柱全景图的立体匹配全方位深度估计

Feng Qiao, Zhexiao Xiong, Xinge Zhu, Yuexin Ma, Qiumeng He, Nathan Jacobs

发表机构 * Washington University in St. Louis（华盛顿大学圣路易斯分校）； The Chinese University of Hong Kong（香港中文大学）； ShanghaiTech University（上海科技大学）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出MCPDepth两阶段框架，通过圆柱全景图的立体匹配和融合，利用循环注意力模块处理垂直畸变，在标准网络组件上实现高效的全方位深度估计，在Deep360和3D60数据集上MAE分别降低18.8%和19.9%。

Comments Accepted at the OmniCV Workshop, CVPR 2026

详情

AI中文摘要

全方位深度估计由于全景图像固有的畸变而面临重大挑战。尽管取得了显著进展，但投影方法的影响仍未得到充分探索。我们引入了多圆柱全景深度估计（MCPDepth），这是一种新颖的两阶段框架，旨在通过多个圆柱全景图之间的立体匹配来增强全方位深度估计。MCPDepth首先使用圆柱全景图进行立体匹配，然后对不同视图得到的深度图进行鲁棒融合。与现有方法依赖定制内核来处理畸变不同，MCPDepth利用标准网络组件，便于在嵌入式设备上无缝部署，同时提供卓越的性能。为了有效处理圆柱全景图中的垂直畸变，MCPDepth结合了循环注意力模块，显著扩展了传统卷积的感受野。我们对常见的全景投影——球面、圆柱和立方体——进行了全面的理论和实验分析，证明了圆柱投影的优越性。我们的方法在室外数据集Deep360上将平均绝对误差（MAE）降低了18.8%，在真实数据集3D60上降低了19.9%。这项工作为其他任务和实际应用提供了实用见解，建立了全方位深度估计的新范式。代码可在https://github.com/Qjizhi/MCPDepth获取。

英文摘要

Omnidirectional depth estimation presents a significant challenge due to the inherent distortions in panoramic images. Despite notable advancements, the impact of projection methods remains underexplored. We introduce Multi-Cylindrical Panoramic Depth Estimation (MCPDepth), a novel two-stage framework designed to enhance omnidirectional depth estimation through stereo matching across multiple cylindrical panoramas. MCPDepth initially performs stereo matching using cylindrical panoramas, followed by a robust fusion of the resulting depth maps from different views. Unlike existing methods that rely on customized kernels to address distortions, MCPDepth utilizes standard network components, facilitating seamless deployment on embedded devices while delivering exceptional performance. To effectively address vertical distortions in cylindrical panoramas, MCPDepth incorporates a circular attention module, significantly expanding the receptive field beyond traditional convolutions. We provide a comprehensive theoretical and experimental analysis of common panoramic projections-spherical, cylindrical, and cubic-demonstrating the superior efficacy of cylindrical projection. Our method improves the mean absolute error (MAE) by 18.8% on the outdoor dataset Deep360 and by 19.9% on the real dataset 3D60. This work offers practical insights for other tasks and real-world applications, establishing a new paradigm in omnidirectional depth estimation. The code is available at https://github.com/Qjizhi/MCPDepth.

URL PDF HTML ☆

赞 0 踩 0

2504.10552 2026-06-02 cs.LG cs.AI cs.CV cs.DL 版本更新

LEMUR Neural Network Dataset: Towards Seamless AutoML

LEMUR 神经网络数据集：迈向无缝 AutoML

Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Hojjat Torabi Goudarzi, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg（计算机视觉实验室，CAIDAS，乌尔姆大学）

AI总结提出 LEMUR 开源数据集与框架，通过统一模板、结构化存储和自动化超参数优化，标准化神经网络实现与评估，以加速 AutoML 研究并促进公平基准测试。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3291-3300, 2026

AI中文摘要

神经网络是现代人工智能的支柱，但设计、评估和比较它们仍然劳动密集。尽管存在许多用于训练的数据集，但模型本身的标准化集合很少。我们介绍 LEMUR，一个开源数据集和框架，它提供了大量基于 PyTorch 的神经网络集合，涵盖分类、分割、检测和自然语言处理等任务。每个模型遵循统一模板，配置和结果存储在结构化数据库中，以确保一致性和可重复性。LEMUR 通过 Optuna 集成自动超参数优化，包括统计分析和可视化工具，并提供 API 以无缝访问性能数据。该框架是可扩展的，允许研究人员添加新模型、数据集或指标而不破坏兼容性。通过标准化实现和统一评估，LEMUR 旨在加速 AutoML 研究，实现公平基准测试，并降低大规模神经网络实验的障碍。为支持采用和协作，LEMUR 及其插件在 MIT 许可下发布，网址为：https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr

英文摘要

Neural networks are the backbone of modern artificial intelligence, but designing, evaluating, and comparing them remains labor-intensive. While numerous datasets exist for training, there are few standardized collections of the models themselves. We introduce LEMUR, an open-source dataset and framework that provides a large collection of PyTorch-based neural networks across tasks such as classification, segmentation, detection, and natural language processing. Each model follows a unified template, with configurations and results stored in a structured database to ensure consistency and reproducibility. LEMUR integrates automated hyperparameter optimization via Optuna, includes statistical analysis and visualization tools, and offers an API for seamless access to performance data. The framework is extensible, allowing researchers to add new models, datasets, or metrics without breaking compatibility. By standardizing implementations and unifying evaluation, LEMUR aims to accelerate AutoML research, enable fair benchmarking, and reduce barriers to large-scale neural network experimentation. To support adoption and collaboration, LEMUR and its plugins are released under the MIT license at: https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr

URL PDF HTML ☆

赞 0 踩 0

2509.16635 2026-06-02 cs.CV 版本更新

Towards Anytime Retrieval: A Benchmark for Anytime Person Re-Identification

面向任意时间检索：任意时间行人重识别基准

Xulin Li, Yan Lu, Bin Liu, Jiaze Li, Qinhong Yang, Tao Gong, Qi Chu, Mang Ye, Nenghai Yu

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China（中国科学技术大学信息科学与技术学院）； Anhui Province Key Laboratory of Digital Security（安徽省数字安全重点实验室）； The Chinese University of Hong Kong（香港中文大学）； School of Computer Science, Wuhan University, China（武汉大学计算机科学学院）

AI总结提出任意时间行人重识别（AT-ReID）任务，构建大规模多场景数据集AT-USTC，并设计统一模型Uni-AT实现全天候多场景有效检索。

Comments Accepted by IJCAI 2025 (oral)

详情

AI中文摘要

在实际应用中，行人重识别（ReID）需要能够在任何时间（包括白天和夜晚，从短期到长期）检索目标行人。然而，现有的ReID任务和数据集无法满足这一需求，因为它们受限于可用时间，仅提供特定场景的训练和评估。因此，我们研究了一项名为任意时间行人重识别（AT-ReID）的新任务，旨在基于时间变化在多个场景中实现有效检索。为了解决AT-ReID问题，我们收集了首个大规模数据集AT-USTC，其中包含由RGB和IR相机拍摄的403k张穿着多件衣服的个体图像。我们的数据收集跨越21个月，270名志愿者在不同日期或场景下平均被拍摄29.1次，比现有数据集多4-15倍，为AT-ReID的后续研究提供了条件。此外，为了应对多场景检索的新挑战，我们提出了一个统一模型Uni-AT，该模型包括一个用于场景特定特征学习的多场景ReID（MS-ReID）框架、一个减轻场景间干扰的属性专家混合（MoAE）模块，以及一个确保所有场景平衡训练的分层动态加权（HDW）策略。大量实验表明，我们的模型取得了令人满意的结果，并在所有场景中表现出优异的泛化能力。

英文摘要

In real applications, person re-identification (ReID) is expected to retrieve the target person at any time, including both daytime and nighttime, ranging from short-term to long-term. However, existing ReID tasks and datasets can not meet this requirement, as they are constrained by available time and only provide training and evaluation for specific scenarios. Therefore, we investigate a new task called Anytime Person Re-identification (AT-ReID), which aims to achieve effective retrieval in multiple scenarios based on variations in time. To address the AT-ReID problem, we collect the first large-scale dataset, AT-USTC, which contains 403k images of individuals wearing multiple clothes captured by RGB and IR cameras. Our data collection spans 21 months, and 270 volunteers were photographed on average 29.1 times across different dates or scenes, 4-15 times more than current datasets, providing conditions for follow-up investigations in AT-ReID. Further, to tackle the new challenge of multi-scenario retrieval, we propose a unified model named Uni-AT, which comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW) strategy to ensure balanced training across all scenarios. Extensive experiments show that our model leads to satisfactory results and exhibits excellent generalization to all scenarios.

URL PDF HTML ☆

赞 0 踩 0

2509.15234 2026-06-02 cs.CV 版本更新

Exploring the Capabilities of Large Language Model Encoders for Image-Text Retrieval in Chest X-rays

探索大语言模型编码器在胸部X光片图像-文本检索中的能力

Hanbin Ko, Gihun Cho, Inhyeok Baek, Donguk Kim, Joonbeom Koo, Changi Kim, Dongheon Lee, Chang Min Park

发表机构 * Interdisciplinary Program in Bioengineering, Seoul National University Graduate School（生物工程跨学科项目，首尔国立大学研究生院）； Integrated Major in Innovative Medical Science, Seoul National University Graduate School（创新医学科学整合专业，首尔国立大学研究生院）； Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine（浙江大学医学院第一附属医院放射科）； Seoul National University College of Medicine（首尔国立大学医学院）； Department of Radiology, Seoul National University College of Medicine, Seoul National University Hospital（首尔国立大学医学院放射科，首尔国立大学医院）； Institute of Medical and Biological Engineering, Seoul National University Medical Research Center（医学与生物工程研究所，首尔国立大学医学研究所以及）； Institute of Radiation Medicine, Seoul National University Medical Research Center（放射医学研究所，首尔国立大学医学研究所以及）

AI总结提出一种领域自适应的双向大语言模型文本编码器，通过掩码标记预测和监督对比学习训练，结合参数高效的双塔对比视觉语言框架，提升胸部X光片图像与文本的对齐和检索性能。

Comments 12 pages, 2 figures, under review

详情

AI中文摘要

从配对的医学图像和临床文本中进行多模态学习是医学数据驱动信息学中的核心挑战，其中有效的跨模态对齐对于可扩展的分析和检索至关重要。在胸部放射学中，视觉语言预训练受到异质性放射学报告的制约，这些报告包含缩写、仅印象笔记和机构特定的写作风格。与通用领域不同，当报告风格差异显著时，简单聚合大量噪声报告可能会使多模态学习停滞甚至退化。我们提出了一种针对胸部放射学报告的领域自适应双向大语言模型文本编码器，通过在风格多样但临床等效的报告变体上进行掩码标记预测和监督对比学习训练，以生成鲁棒、可泛化的文本嵌入。然后，我们使用参数高效适配将该编码器集成到双塔对比视觉语言框架中，以改善图像-文本对齐。在来自公共数据集和去标识化医院队列的160万对配对研究上，所提出的模型提高了双向检索准确性和外部泛化能力，在MIMIC-CXR上达到0.308的GREEN分数，在Open-I上达到0.618，同时减少了在训练中添加富含缩写、仅印象的医院报告时观察到的退化。

英文摘要

Multimodal learning from paired medical images and clinical text is a central challenge in medical data-driven informatics, where effective cross-modal alignment is critical for scalable analysis and retrieval. In chest radiography, vision-language pretraining is constrained by heterogeneous radiology reports that contain abbreviations, impression-only notes, and institution-specific writing styles. Unlike general-domain settings, naively aggregating large collections of noisy reports can plateau or even degrade multimodal learning when reporting styles differ substantially. We propose a domain-adapted bidirectional large language model text encoder for chest radiograph reports, trained with masked token prediction and supervised contrastive learning on stylistically diverse but clinically equivalent report variants to produce robust, generalizable text embeddings. We then integrate this encoder into a dual-tower contrastive vision-language framework using parameter-efficient adaptation to improve image-text alignment. Across 1.6 million paired studies from public datasets and a de-identified hospital cohort, the proposed models improve bidirectional retrieval accuracy and external generalization, achieving GREEN scores of 0.308 on MIMIC-CXR and 0.618 on Open-I, while reducing the degradation observed when abbreviation-rich, impression-only hospital reports are added to training.

URL PDF HTML ☆

赞 0 踩 0

2205.02071 2026-06-02 cs.CV 版本更新

Representation-Centric Survey of Supervised Skeletal Action Recognition and the New Benchmark

以表示为中心的监督式骨骼动作识别综述与新基准

Yang Liu, Jiyao Yang, Madhawa Perera, Pan Ji, Dongwoo Kim, Min Xu, Tianyang Wang, Saeed Anwar, Tom Gedeon, Lei Wang, Zhenyue Qin

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算学院）； University of Alabama at Birmingham（阿拉巴马大学伯明翰分校）； OPPO US Research Center（OPPO美国研究中心）； Carnegie Mellon University（卡内基梅隆大学）； University of Western Australia（西澳大利亚大学）； Curtin University（Curtin大学）； School of Engineering and Built Environment, Griffith University（格里菲斯大学工程与环境学院）； School of Medicine, Yale University（耶鲁大学医学院）

AI总结本文以输入表示类型（关节坐标、骨骼向量、运动流及扩展表示）为中心，系统综述了监督式3D骨骼动作识别方法，并提出了包含多视角、复杂多人交互等挑战的大规模数据集ANUBIS，通过实验揭示了动作-特征依赖关系及多表示融合的局限性。

Comments Accepted for publication in Pattern Recognition

详情

AI中文摘要

3D骨骼动作识别已成为传统RGB和基于深度的方法的有力替代方案，具有对环境变化的鲁棒性、计算效率和增强的隐私性。尽管取得了显著进展，当前研究仍因输入表示多样而碎片化，且缺乏反映现实挑战场景的评估。本文以表示为中心，对监督式骨骼动作识别进行了综述，根据输入特征类型（关节坐标、骨骼向量、运动流和扩展表示）系统地对最先进方法进行分类，并分析这些选择如何影响时空建模策略。基于综述的见解，我们提出了ANUBIS，这是一个大规模、具有挑战性的数据集，旨在填补现有基准的关键空白。ANUBIS包含多视角记录（包括背面视角）、复杂的多人交互、细粒度和暴力动作以及当代社会行为。我们在ANUBIS上对多种最先进模型进行了基准测试，并深入分析了不同特征类型如何影响102个动作类别的识别性能。我们的结果显示了强烈的动作-特征依赖性，突出了朴素多表示融合的局限性，并指出了对任务感知、语义对齐的集成策略的需求。这项工作既提供了全面的基础，也提供了实用的基准资源，旨在指导下一代针对复杂现实场景的鲁棒、可泛化的基于骨骼的动作识别系统。数据集、基准框架和代码可在 https://yliu1082.github.io/ANUBIS/ 获取。

英文摘要

3D skeletal action recognition has emerged as a powerful alternative to traditional RGB and depth-based approaches, offering robustness to environmental variations, computational efficiency, and enhanced privacy. Despite remarkable progress, current research remains fragmented across diverse input representations and lacks evaluation under scenarios that reflect real-world challenges. This paper presents a representation-centric review of supervised skeletal action recognition, systematically categorizing state-of-the-art methods by their input feature types: joint coordinates, bone vectors, motion flows, and extended representations, and analyzing how these choices influence spatiotemporal modeling strategies. Building on the insights from this review, we introduce ANUBIS, a large-scale, challenging dataset designed to address critical gaps in existing benchmarks. ANUBIS incorporates multi-view recordings with back-view perspectives, complex multi-person interactions, fine-grained and violent actions, and contemporary social behaviors. We benchmark a diverse set of state-of-the-art models on ANUBIS and conduct an in-depth analysis of how different feature types affect recognition performance across 102 action categories. Our results show strong action-feature dependencies, highlight the limitations of naive multi-representational fusion, and point toward the need for task-aware, semantically aligned integration strategies. This work offers both a comprehensive foundation and a practical benchmarking resource, aiming to guide the next generation of robust, generalizable skeleton-based action recognition systems for complex real-world scenarios. The dataset, benchmarking framework, and code are available at https://yliu1082.github.io/ANUBIS/.

URL PDF HTML ☆

赞 0 踩 0

2507.19881 2026-06-02 cs.CV cs.AI 版本更新

FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving

FedS2R: 面向自动驾驶中合成到真实语义分割的一次性联邦域泛化

Tao Lian, Jose L. Gómez, Antonio M. López

发表机构 * Computer Vision Center (CVC) Univ. Autònoma de Barcelona (UAB) Barcelona, Spain（计算机视觉中心（CVC）巴塞罗那自治大学（UAB）巴塞罗那，西班牙）

AI总结提出FedS2R框架，通过不一致性驱动的数据增强和多客户端知识蒸馏，实现自动驾驶中合成到真实语义分割的一次性联邦域泛化，在五个真实数据集上性能接近集中式训练。

Comments Accepted by IEEE Intelligent Vehicles Symposium (IV) 2026

详情

AI中文摘要

联邦域泛化在图像分类中通过多客户端协作训练而不共享原始数据已显示出有希望的进展。然而，其在自动驾驶语义分割中的潜力尚未被充分探索。本文提出FedS2R，这是第一个用于自动驾驶中合成到真实语义分割的一次性联邦域泛化框架。FedS2R包含两个组件：一种不一致性驱动的数据增强策略，用于生成不稳定类别的图像；以及一种具有特征融合的多客户端知识蒸馏方案，从多个客户端模型中蒸馏出全局模型。在五个真实数据集Cityscapes、BDD100K、Mapillary、IDD和ACDC上的实验表明，全局模型显著优于单个客户端模型，并且仅比同时访问所有客户端数据训练的模型落后2个mIoU点。这些结果证明了FedS2R在联邦学习下自动驾驶合成到真实语义分割中的有效性。

英文摘要

Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple clients without sharing raw data. However, its potential in the semantic segmentation of autonomous driving remains underexplored. In this paper, we propose FedS2R, the first one-shot federated domain generalization framework for synthetic-to-real semantic segmentation in autonomous driving. FedS2R comprises two components: an inconsistency-driven data augmentation strategy that generates images for unstable classes, and a multi-client knowledge distillation scheme with feature fusion that distills a global model from multiple client models. Experiments on five real-world datasets, Cityscapes, BDD100K, Mapillary, IDD, and ACDC, show that the global model significantly outperforms individual client models and is only 2 mIoU points behind the model trained with simultaneous access to all client data. These results demonstrate the effectiveness of FedS2R in synthetic-to-real semantic segmentation for autonomous driving under federated learning

URL PDF HTML ☆

赞 0 踩 0

2507.18863 2026-06-02 cs.CV cs.CL 版本更新

Phoneme-Level Visual Speech Recognition via Point-Visual Fusion and Language Model Reconstruction

基于点视觉融合与语言模型重建的音素级视觉语音识别

Matthew Kit Khinn Teng, Haibo Zhang, Takeshi Saitoh

发表机构 * Kyushu Institute of Technology（九州工业大学）

AI总结提出一种两阶段音素级视觉语音识别框架，通过融合视觉和面部地标运动特征，并利用LLM模型重建单词，在LRS2和LRS3数据集上分别实现17.4%和21.0%的词错误率。

Comments Accepted at ICASSP 2026. This version corresponds to the camera-ready manuscript

详情

AI中文摘要

视觉自动语音识别（V-ASR）是一项具有挑战性的任务，涉及仅从视觉信息（如唇部运动和面部表情）解释口语。由于缺乏听觉线索以及音素（表现出相似视位——在唇部运动中看起来相同的不同声音）的视觉模糊性，该任务尤为困难。现有方法通常旨在直接从视觉线索预测单词或字符，但由于视位模糊性，它们通常遭受高错误率，并且需要大量预训练数据。我们提出了一种新颖的基于音素的两阶段框架，融合视觉和地标运动特征，随后使用LLM模型进行单词重建以应对这些挑战。第一阶段包括V-ASR，输出预测的音素，从而降低训练复杂度。同时，面部地标特征处理说话者特定的面部特征。第二阶段包括一个编码器-解码器LLM模型NLLB，将输出的音素重建回单词。除了使用大型视觉数据集进行深度学习微调外，我们的PV-ASR方法在LRS2数据集上实现了17.4%的词错误率，在LRS3数据集上实现了21.0%的词错误率，展现出优越性能。

英文摘要

Visual Automatic Speech Recognition (V-ASR) is a challenging task that involves interpreting spoken language solely from visual information, such as lip movements and facial expressions. This task is notably challenging due to the absence of auditory cues and the visual ambiguity of phonemes that exhibit similar visemes-distinct sounds that appear identical in lip motions. Existing methods often aim to predict words or characters directly from visual cues, but they commonly suffer from high error rates due to viseme ambiguity and require large amounts of pre-training data. We propose a novel phoneme-based two-stage framework that fuses visual and landmark motion features, followed by an LLM model for word reconstruction to address these challenges. Stage 1 consists of V-ASR, which outputs the predicted phonemes, thereby reducing training complexity. Meanwhile, the facial landmark features address speaker-specific facial characteristics. Stage 2 comprises an encoder-decoder LLM model, NLLB, that reconstructs the output phonemes back to words. Besides using a large visual dataset for deep learning fine-tuning, our PV-ASR method demonstrates superior performance by achieving 17.4% WER on the LRS2 and 21.0% WER on the LRS3 dataset.

URL PDF HTML ☆

赞 0 踩 0

2503.06520 2026-06-02 cs.CV cs.MM 版本更新

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Seg-Zero: 通过认知强化学习的推理链引导分割

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia

发表机构 * The Chinese University of Hong Kong（香港中文大学）； The Hong Kong University of Science and Technology（香港科学与技术大学）； Renmin University of China（中国人民大学）

AI总结提出Seg-Zero框架，通过解耦的推理模型和分割模型，结合GRPO强化学习与格式-精度奖励机制，实现零样本推理分割，在ReasonSeg基准上超越LISA-7B 18%。

详情

AI中文摘要

传统的推理分割方法依赖于使用类别标签和简单描述进行监督微调，限制了其域外泛化能力且缺乏显式推理过程。为解决这些限制，我们提出了Seg-Zero，一种新颖的框架，通过认知强化学习展现出显著的泛化能力并推导出显式的思维链推理。Seg-Zero引入了一个解耦架构，包含一个推理模型和一个分割模型。推理模型解释用户意图，生成显式推理链，并产生位置提示，随后分割模型利用这些提示生成精确的像素级掩码。我们设计了一个复杂的奖励机制，整合了格式奖励和精度奖励，以有效指导优化方向。仅通过GRPO的强化学习训练，无需显式推理数据，Seg-Zero实现了鲁棒的零样本泛化，并展现出涌现的测试时推理能力。实验表明，Seg-Zero-7B在ReasonSeg基准上达到了57.5的零样本性能，超越了之前的LISA-7B 18%。这一显著提升突显了Seg-Zero跨域泛化的能力，同时呈现了显式的推理过程。

英文摘要

Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process.

URL PDF HTML ☆

赞 0 踩 0

2502.08884 2026-06-02 cs.CV cs.AI cs.GR 版本更新

ShapeLib: Designing a library of programmatic 3D shape abstractions with Large Language Models

ShapeLib: 利用大型语言模型设计程序化3D形状抽象库

R. Kenny Jones, Paul Guerrero, Niloy J. Mitra, Daniel Ritchie

发表机构 * Stanford University（斯坦福大学）； Adobe Research（Adobe研究）； University College London（伦敦大学学院）； Brown University（布朗大学）

AI总结提出ShapeLib方法，利用大型语言模型的先验知识，通过引导式工作流自动设计可泛化的程序化3D形状抽象库，并支持下游形状编辑与生成。

详情

AI中文摘要

我们提出ShapeLib，这是第一个利用大型语言模型（LLM）的先验知识来设计程序化3D形状抽象库的方法。我们的系统接受两种形式的用户提供的设计意图：输出库中应包含的功能的高级文本描述，以及一小部分示例形状的种子集。我们通过引导式LLM工作流发现与设计意图匹配的抽象库，该工作流首先提出应用和实现功能的不同方式，然后验证这些功能有助于表示种子集形状。为了扩展到种子集之外，我们开发了特定于库的识别网络，将形状（表示为基元、体素或点云）映射到使用这些新发现的抽象的程序。跨多个建模领域（按形状类别划分），我们发现，当LLM与几何推理深思熟虑地结合时，可以引导它们编写出能跨形状分布泛化的抽象函数库。我们的框架朝着实现长期以来的形状分析愿望迈出了一步，即发现可重用的、程序化的形状抽象，同时暴露可解释的、语义对齐的接口。我们的广泛评估表明，ShapeLib在泛化性、可用性和在操作下保持合理性方面，优于先前的替代抽象发现方法。最后，我们展示了ShapeLib的抽象函数解锁了多个下游应用，将LLM对形状程序的推理与几何处理工具相结合，以支持形状编辑和生成工作流。

英文摘要

We present ShapeLib, the first method that uses the priors of Large Language Models (LLMs) to design libraries of programmatic 3D shape abstractions. Our system accepts two forms of user-provided design intent: high-level text descriptions of functions to include in the output library and a small seed set of exemplar shapes. We discover a library of abstractions that matches this design intent with a guided LLM workflow that first proposes different ways of applying and implementing functions, and then validates these functions are helpful in representing seed set shapes. To extend beyond the seed set, we develop library-specific recognition networks that map shapes (represented as primitives, voxels, or point clouds) to programs that use these newly discovered abstractions. Across multiple modeling domains (split by shape category), we find that LLMs, when thoughtfully combined with geometric reasoning, can be guided to author libraries of abstraction functions that generalize across shape distributions. Our framework takes a step towards realizing the long-standing shape analysis aspiration of discovering reusable, programmatic shape abstractions while exposing interpretable, semantically aligned interfaces. Our extensive evaluation demonstrates that ShapeLib provides distinct advantages over prior alternative abstraction discovery works in terms of generalization, usability, and maintaining plausibility under manipulation. Finally, we demonstrate that ShapeLib's abstraction functions unlock a number of downstream applications, combining LLM reasoning over shape programs with geometry processing tools to support shape editing and generation workflows.

URL PDF HTML ☆

赞 0 踩 0

2506.10858 2026-06-02 eess.IV cs.CV 版本更新

Med-URWKV†: Toward Enhanced Pretrained Pure VRWKV Models for Medical Image Segmentation

Med-URWKV†：面向医学图像分割的增强型预训练纯VRWKV模型

Zhenhuan Zhou, Yining Li, Yanlin Wu, Haohan Zou, Yan Wang, Tao Li

发表机构 * College of Computer Science, Nankai University（南开大学计算机科学学院）； Key Laboratory of Data and Intelligent System Security, Ministry of Education（教育部数据与智能系统安全重点实验室）； School of Medicine, Nankai University（南开大学医学院）； Nankai University Eye institute, Nankai University（南开大学眼科研究院）； Tianjn Eye Hospital（天津眼科医院）； Haihe Lab of ITAI（海河ITAI实验室）

AI总结本文提出Med-URWKV模型，通过重用预训练VRWKV编码器并设计FAWA和MSCF模块，在五个数据集上达到SOTA性能，其中Med-URWKV†以半参数实现最高平均Dice 88.00%。

详情

AI中文摘要

医学图像分割是计算机辅助诊断和治疗中的基本任务。基于CNN、ViT、Mamba和混合模型的现有方法仍存在感受野受限、计算成本高或精度不足等问题。最近，视觉感受野加权键值（VRWKV）模型作为一种有前景的替代方案出现，为视觉任务提供了强大的长距离依赖建模能力。然而，当前基于VRWKV的医学图像分割研究主要集中于从头训练的混合架构，而大规模预训练纯VRWKV模型的潜力尚未被探索。在这项工作中，我们系统研究了纯VRWKV架构在医学图像分割中的有效性。通过在不同尺度上重用预训练VRWKV编码器并搭配纯VRWKV解码器，我们构建了Med-URWKV-T和Med-URWKV-S，从而对该领域中的预训练纯VRWKV模型进行全面评估。为进一步提升性能，我们提出了两个VRWKV兼容模块：频率感知小波注意力（FAWA）模块，利用小波变换捕捉边缘细节和结构特征；以及多尺度通道融合（MSCF）模块，整合多尺度特征以增强信息性通道表示。通过将它们集成到Med-URWKV-T中，我们得到了增强模型Med-URWKV†。在五个医学图像分割数据集上的大量实验表明，Med-URWKV取得了与最先进方法及精心设计的混合VRWKV架构相当或更优的性能。此外，Med-URWKV†进一步提升了分割精度，在仅使用一半参数量的情况下超越了Med-URWKV-S，并达到了最高的平均Dice相似系数88.00%。代码将公开发布。

英文摘要

Medical image segmentation is a fundamental task in computer-aided diagnosis and treatment. Existing approaches based on CNNs, ViTs, Mamba, and hybrid models still suffer from limitations such as restricted receptive fields, high computational cost, or insufficient accuracy. Recently, Vision Receptive-field Weighted Key-Value (VRWKV) models have emerged as a promising alternative,delivering strong long-range dependency modeling for visual tasks. However, current studies on VRWKV-based medical image segmentation mainly focus on hybrid architectures trained from scratch, while the potential of large-scale pretrained pure VRWKV models remains unexplored. In this work, we systematically investigate the effectiveness of pure VRWKV architectures for medical image segmentation. We construct Med-URWKV-T and Med-URWKV-S by reusing pretrained VRWKV encoders at different scales and pairing them with pure VRWKV decoders, enabling a comprehensive evaluation of pretrained pure VRWKV models in this domain. To further enhance performance, we propose two VRWKV-compatible modules: a Frequency-Aware Wavelet Attention (FAWA) module, which exploits wavelet transforms to capture edge details and structural characteristics, and a Multi-Scale Channel Fusion (MSCF) module, which integrates multi-scale features to strengthen informative channel representations. By incorporating them into Med-URWKV-T, we obtain the enhanced model Med-URWKV†. Extensive experiments on five medical image segmentation datasets demonstrate that Med-URWKV achieves performance comparable to or superior to state-of-the-art methods and carefully designed hybrid VRWKV architectures. Moreover, Med-URWKV† further improves segmentation accuracy, surpassing Med-URWKV-S while using only half of its parameter count, and achieves the highest average Dice similarity coefficient of 88.00%. The codes will be released.

URL PDF HTML ☆

赞 0 踩 0

2506.08137 2026-06-02 cs.CV cs.AI 版本更新

IGraSS: Learning to Identify Infrastructure Networks from Satellite Imagery by Iterative Graph-constrained Semantic Segmentation

IGraSS: 通过迭代图约束语义分割从卫星图像中识别基础设施网络

Oishee Bintey Hoque, Abhijin Adiga, Aniruddha Adiga, Siddharth Chaudhary, Madhav V. Marathe, S. S. Ravi, Kirti Rajagopalan, Amanda Wilson, Samarth Swarup

发表机构 * Biocomplexity Institute, University of Virginia（弗吉尼亚大学生物复杂性研究所）； Department of Computer Science, University of Virginia（弗吉尼亚大学计算机科学系）； Department Biomedical Systems Engineering, Washington State University（华盛顿州立大学生物医学系统工程系）； Earth System Science Center, University of Alabama in Huntsville（阿拉巴马大学亨茨维尔分校地球系统科学中心）

AI总结提出IGraSS迭代框架，结合语义分割与图约束优化，将不可达运河段从18%降至3%，并提升道路网络完整性。

详情

DOI: 10.24963/ijcai.2025/1076

AI中文摘要

精确的运河网络制图对于水资源管理（包括灌溉规划和基础设施维护）至关重要。最先进的基础设施制图语义分割模型（如道路）依赖于大规模、良好标注的遥感数据集。然而，不完整或不充分的真实标注会阻碍这些学习方法。许多基础设施网络具有图级属性，如可达性（运河）或连通性（道路），可用于改进现有真实标注。本文开发了一种新颖的迭代框架IGraSS，将结合RGB和额外模态（NDWI、DEM）的语义分割模块与基于图的真实标注精化模块相结合。分割模块处理卫星图像块，而精化模块将基础设施网络视为图，在整个数据上运行。实验表明，IGraSS将不可达运河段从约18%降至3%，并且使用精化后的真实标注进行训练显著改善了运河识别。IGraSS是一个鲁棒的框架，既可用于精化噪声真实标注，也可用于从遥感影像中绘制运河网络。我们还以道路网络为例，应用不同的图论约束来完善道路网络，证明了IGraSS的有效性和泛化能力。

英文摘要

Accurate canal network mapping is essential for water management, including irrigation planning and infrastructure maintenance. State-of-the-art semantic segmentation models for infrastructure mapping, such as roads, rely on large, well-annotated remote sensing datasets. However, incomplete or inadequate ground truth can hinder these learning approaches. Many infrastructure networks have graph-level properties such as reachability to a source (like canals) or connectivity (roads) that can be leveraged to improve these existing ground truth. This paper develops a novel iterative framework IGraSS, combining a semantic segmentation module-incorporating RGB and additional modalities (NDWI, DEM)-with a graph-based ground-truth refinement module. The segmentation module processes satellite imagery patches, while the refinement module operates on the entire data viewing the infrastructure network as a graph. Experiments show that IGraSS reduces unreachable canal segments from around 18% to 3%, and training with refined ground truth significantly improves canal identification. IGraSS serves as a robust framework for both refining noisy ground truth and mapping canal networks from remote sensing imagery. We also demonstrate the effectiveness and generalizability of IGraSS using road networks as an example, applying a different graph-theoretic constraint to complete road networks.

URL PDF HTML ☆

赞 0 踩 0

2506.09035 2026-06-02 cs.CV 版本更新

Princeton365: A Diverse Dataset with Accurate Camera Pose

Princeton365: 一个具有精确相机位姿的多样化数据集

Karhan Kayan, Stamatis Alexandropoulos, Rishabh Jain, Yiming Zuo, Erich Liang, Jia Deng

发表机构 * Princeton University（普林斯顿大学）

AI总结提出Princeton365数据集，包含365个视频和精确相机位姿，通过校准板和360度相机的新颖真值采集框架弥合精度与多样性差距，并引入基于光流的尺度感知评估指标及新颖视图合成基准。

Comments Update v2: Match the ICCV 2025 camera-ready version. Fix typos

详情

AI中文摘要

我们介绍了Princeton365，一个包含365个视频的大规模多样化数据集，具有精确的相机位姿。我们的数据集通过引入一种新颖的真值采集框架，利用校准板和360度相机，弥合了当前SLAM基准中精度与数据多样性之间的差距。我们收集了室内、室外和物体扫描视频，并同步输出单目和立体RGB视频以及IMU数据。我们进一步提出了一种基于相机位姿估计误差引起的光流的新场景尺度感知SLAM评估指标。与当前指标相比，我们的新指标允许跨场景比较SLAM方法的性能，而现有指标如平均轨迹误差（ATE）则不能，从而使研究人员能够分析其方法的失败模式。我们还提出了一个具有挑战性的新颖视图合成基准，涵盖了当前NVS基准未覆盖的情况，例如具有360度相机轨迹的完全非朗伯场景。请访问 https://princeton365.cs.princeton.edu 获取数据集、代码、视频和提交信息。

英文摘要

We introduce Princeton365, a large-scale diverse dataset of 365 videos with accurate camera pose. Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a 360-camera. We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU. We further propose a new scene scale-aware evaluation metric for SLAM based on the optical flow induced by the camera pose estimation error. In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods. We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with 360-degree camera trajectories. Please visit https://princeton365.cs.princeton.edu for the dataset, code, videos, and submission.

URL PDF HTML ☆

赞 0 踩 0

2503.06473 2026-06-02 cs.CV cs.AI 版本更新

Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

通过剪枝冗余检索增强层注意力效率

Hanze Li, Yaosong Du, Zhibo Yao, Mengyao Zeng, Xiuqi Ge, Xiande Huang

发表机构 * De Artificial Intelligence Lab（德人工智能实验室）

AI总结针对层注意力机制中相邻层权重冗余导致特征重复和训练效率低的问题，提出基于KL散度量化冗余并利用增强Beta分位数映射（EBQM）跳过冗余层的高效层注意力（ELA）架构，在图像分类和目标检测任务中训练时间减少30%且性能提升。

Comments 5 pages

详情

AI中文摘要

越来越多的证据表明，层注意力机制增强了深度神经网络中层间的交互，显著推进了网络架构的发展。然而，现有的层注意力方法存在冗余问题，因为相邻层学习的注意力权重往往变得高度相似。这种冗余导致多个层提取几乎相同的特征，降低了模型的表示能力并增加了训练时间。为了解决这个问题，我们提出了一种新颖的方法，利用相邻层之间的Kullback-Leibler（KL）散度来量化冗余。此外，我们引入了一种增强Beta分位数映射（EBQM）方法，能够准确识别并跳过冗余层，从而保持模型稳定性。我们提出的高效层注意力（ELA）架构提高了训练效率和整体性能，在图像分类和目标检测等任务中实现了30%的训练时间减少，同时提升了性能。

英文摘要

Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30% reduction in training time while enhancing performance in tasks such as image classification and object detection.

URL PDF HTML ☆

赞 0 踩 0

2411.15076 2026-06-02 eess.IV cs.CV q-bio.QM 版本更新

RankByGene: Gene-Guided Histopathology Representation Learning Through Cross-Modal Ranking Consistency

RankByGene: 通过跨模态排序一致性实现基因引导的组织病理学表示学习

Wentao Huang, Meilong Xu, Xiaoling Hu, Shahira Abousamra, Aniruddha Ganguly, Saarthak Kapse, Alisa Yurovsky, Prateek Prasanna, Tahsin Kurc, Joel Saltz, Michael L. Miller, Chao Chen

发表机构 * Stony Brook University（石英溪大学）； Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School（阿提诺拉A.马丁努斯生物医学影像中心，麻省总医院和哈佛医学院）； Department of Biomedical Data Science, Stanford University（生物医学数据科学系，斯坦福大学）； Department of Pathology and Cell Biology, Columbia University（病理学与细胞生物学系，哥伦比亚大学）

AI总结提出基于排序对齐损失的框架，利用教师-学生网络自监督知识蒸馏，解决空间转录组学与组织学图像的对齐问题，在基因表达预测、切片分类和生存分析任务中表现优异。

Comments 18 pages, 9 figures

详情

AI中文摘要

空间转录组学通过映射组织内的基因表达提供必要的空间背景，从而能够详细研究细胞异质性和组织组织。然而，由于固有的空间扭曲和模态特异性变化，将ST数据与组织学图像对齐面临挑战。现有方法主要依赖直接对齐，通常无法捕捉复杂的跨模态关系。为解决这些限制，我们提出一种新颖框架，使用基于排序的对齐损失来对齐基因和图像特征，保留跨模态的相对相似性，并实现稳健的多尺度对齐。为进一步增强对齐的稳定性，我们采用教师-学生网络架构的自监督知识蒸馏，有效减轻基因表达数据中高维性、稀疏性和噪声带来的干扰。在涵盖基因表达预测、切片级分类和生存分析的七个公共数据集上的大量实验证明了我们方法的有效性，显示出比现有方法更好的对齐和预测性能。

英文摘要

Spatial transcriptomics (ST) provides essential spatial context by mapping gene expression within tissue, enabling detailed study of cellular heterogeneity and tissue organization. However, aligning ST data with histology images poses challenges due to inherent spatial distortions and modality-specific variations. Existing methods largely rely on direct alignment, which often fails to capture complex cross-modal relationships. To address these limitations, we propose a novel framework that aligns gene and image features using a ranking-based alignment loss, preserving relative similarity across modalities and enabling robust multi-scale alignment. To further enhance the alignment's stability, we employ self-supervised knowledge distillation with a teacher-student network architecture, effectively mitigating disruptions from high dimensionality, sparsity, and noise in gene expression data. Extensive experiments on seven public datasets that encompass gene expression prediction, slide-level classification, and survival analysis demonstrate the efficacy of our method, showing improved alignment and predictive performance over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2503.15639 2026-06-02 cs.CV cs.AI 版本更新

A Lightweight Context-Driven Training-Free Network for Scene Text Segmentation and Recognition

一种轻量级上下文驱动的免训练网络用于场景文本分割与识别

Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal, Cheng-Lin Liu

发表机构 * CVPR Unit, Indian Statistical Institute, Kolkata, India（印度统计研究所柯西拉分校CVPR单位）； Manipal University Jaipur, India（印度贾浦尔曼普尔大学）； University of Salford, UK（英国萨尔福德大学）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结提出一种基于上下文理解、无需训练的即插即用框架，通过注意力分割和语义评估实现高效场景文本识别，性能与SOTA相当且资源消耗更低。

Comments Accepted at ICDAR 2025 (ORAL) 21 pages, 8 figures, 7 tables

详情

AI中文摘要

现代场景文本识别系统通常依赖于大型端到端架构，这些架构需要大量训练，并且对于实时场景来说成本过高。在这种情况下，由于内存、计算资源和延迟的限制，部署重型模型变得不切实际。为了应对这些挑战，我们提出了一种新颖的、无需训练的即插即用框架，该框架利用预训练文本识别器的优势，同时最小化冗余计算。我们的方法使用基于上下文的理解，并引入了一个基于注意力的分割阶段，该阶段在像素级别细化候选文本区域，从而改进下游识别。我们不执行传统的文本检测（即特征图与源图像之间的块级比较），而是利用预训练的标题生成器来利用上下文信息，使框架能够直接从场景上下文生成单词预测。候选文本经过语义和词汇评估以获得最终分数。达到或超过预定义置信度阈值的预测绕过更重的端到端文本STR（场景文本识别）流程，确保更快的推理并减少不必要的计算。在公共基准上的实验表明，我们的范式实现了与最先进系统相当的性能，但所需资源大大减少。我们的代码可在此处找到：https://ritabrata04.github.io/Context-driven-STR/。

英文摘要

Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. In such cases, the deployment of heavy models becomes impractical due to constraints on memory, computational resources, and latency. To address these challenges, we propose a novel, training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level, improving downstream recognition. Instead of performing traditional text detection that follows a block-level comparison between feature map and source image and harnesses contextual information using pretrained captioners, allowing the framework to generate word predictions directly from scene context.Candidate texts are semantically and lexically evaluated to get a final score. Predictions that meet or exceed a pre-defined confidence threshold bypass the heavier process of end-to-end text STR profiling, ensuring faster inference and cutting down on unnecessary computations. Experiments on public benchmarks demonstrate that our paradigm achieves performance on par with state-of-the-art systems, yet requires substantially fewer resources.Our code can be found here: https://ritabrata04.github.io/Context-driven-STR/.

URL PDF HTML ☆

赞 0 踩 0

2503.06136 2026-06-02 cs.CV cs.AI 版本更新

GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

GSV3D: 基于高斯溅射的几何蒸馏与稳定视频扩散用于单图像3D物体生成

Ye Tao, Jiawei Zhang, Yahao Shi, Dongqing Zou, Bin Zhou

发表机构 * State Key Laboratory of Virtual Reality Technology and Systems, Beihang University（虚拟现实技术与系统国家重点实验室，北京航空航天大学）； SenseTime Research（商汤研究）； PBVR

AI总结提出一种结合2D扩散模型隐式3D推理能力与高斯溅射几何蒸馏的方法，通过高斯溅射解码器将SV3D潜变量输出转换为显式3D表示，实现多视图一致性和高质量3D生成。

详情

DOI: 10.1109/iccv51701.2025.00727

AI中文摘要

基于图像的3D生成在机器人和游戏领域有广泛应用，其中高质量、多样化的输出和一致的3D表示至关重要。然而，现有方法存在局限性：3D扩散模型受限于数据集稀缺和缺乏强大的预训练先验，而基于2D扩散的方法则难以保证几何一致性。我们提出了一种方法，利用2D扩散模型的隐式3D推理能力，同时通过基于高斯溅射的几何蒸馏确保3D一致性。具体来说，所提出的高斯溅射解码器通过将SV3D潜变量输出转换为显式3D表示来强制3D一致性。与仅依赖隐式2D表示进行视频生成的SV3D不同，高斯溅射显式编码空间和外观属性，通过几何约束实现多视图一致性。这些约束纠正了视图不一致性，确保了稳健的几何一致性。因此，我们的方法同时生成高质量、多视图一致的图像和精确的3D模型，为基于单图像的3D生成提供了可扩展的解决方案，并弥合了2D扩散多样性与3D结构一致性之间的差距。实验结果表明，该方法在多个数据集上实现了最先进的多视图一致性和强泛化能力。代码将在接收后公开。

英文摘要

Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2502.07617 2026-06-02 cs.CV 版本更新

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

将视觉语言模型的预训练扩展到一千亿数据

Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, Xiaohua Zhai

发表机构 * Google DeepMind（谷歌DeepMind）

AI总结本文通过实验探究将视觉语言模型预训练数据扩展到一千亿规模的效果，发现传统基准性能饱和，但文化多样性任务和低资源语言受益显著，并指出质量过滤可能减少文化多样性。

Comments v2: CVPR Findings'26

详情

AI中文摘要

我们提供了一个关于将视觉语言模型预训练扩展到前所未有规模——一千亿样本——潜力的实证研究。我们发现，在许多常见的西方中心分类和检索基准（如COCO Captions）上，模型性能在此规模下趋于饱和。然而，文化多样性任务从一千亿规模的网络数据中获得了更实质性的提升，这得益于其对长尾概念的覆盖。此外，我们分析了模型的多语言能力，并展示了在低资源语言上的提升。另外，我们观察到，通过使用如CLIP等质量过滤器减少预训练数据集的大小（通常用于提升性能）可能会无意中减少大规模数据集中所代表的文化多样性。我们的结果强调，虽然传统基准可能不会从将噪声原始网络数据扩展到一千亿样本中显著受益，但这一数据规模对于构建真正包容的多模态系统至关重要。

英文摘要

We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric classification and retrieval benchmarks, such as COCO Captions. Nevertheless, tasks of cultural diversity achieve more substantial gains from the 100-billion scale web data, thanks to its coverage of long-tail concepts. Furthermore, we analyze the model's multilinguality and show gains in low-resource languages as well. In addition, we observe that reducing the size of the pretraining dataset via quality filters like using CLIP, typically used to enhance performance, may inadvertently reduce the cultural diversity represented in large-scale datasets. Our results highlight that while traditional benchmarks may not benefit significantly from scaling noisy, raw web data to 100 billion examples, this data scale is vital for building truly inclusive multimodal systems.

URL PDF HTML ☆

赞 0 踩 0

2111.03861 2026-06-02 cs.CV cs.AI cs.LG 版本更新

What augmentations are sensitive to hyper-parameters and why?

哪些数据增强对超参数敏感以及为什么？

Ch Muhammad Awais, Imad Eddine Ibrahim Bekkouch

发表机构 * Knowledge Representation Lab Innopolis University（知识表示实验室印尼奥利普斯大学）； Sorbonne Center for Artificial Intelligence - SCAI Sorbonne University（索邦人工智能中心 - SCAI 索邦大学）

AI总结本研究通过局部代理（LIME）解释和线性回归系数评估不同数据增强对模型超参数的敏感性、一致性和影响，发现某些增强对超参数高度敏感，而另一些则更稳健可靠。

Comments 10 pages, 17 figures

详情

DOI: 10.1007/978-3-031-10461-9_31
Journal ref: Intelligent Computing: Proceedings of the 2022 Computing Conference

AI中文摘要

我们对数据集应用增强以提高预测质量，并使最终模型对噪声数据和领域漂移更具鲁棒性。然而，问题仍然存在：这些增强在不同的超参数下表现如何？在本研究中，我们通过执行局部代理（LIME）解释来评估增强对模型超参数的敏感性、一致性和影响，当不同增强应用于机器学习模型时，解释超参数的影响。我们利用线性回归系数来加权每个增强。我们的研究证明，有些增强对超参数高度敏感，而其他增强则更具鲁棒性和可靠性。

英文摘要

We apply augmentations to our dataset to enhance the quality of our predictions and make our final models more resilient to noisy data and domain drifts. Yet the question remains, how are these augmentations going to perform with different hyper-parameters? In this study we evaluate the sensitivity of augmentations with regards to the model's hyper parameters along with their consistency and influence by performing a Local Surrogate (LIME) interpretation on the impact of hyper-parameters when different augmentations are applied to a machine learning model. We have utilized Linear regression coefficients for weighing each augmentation. Our research has proved that there are some augmentations which are highly sensitive to hyper-parameters and others which are more resilient and reliable.

URL PDF HTML ☆

赞 0 踩 0

2501.12178 2026-06-02 cs.CV 版本更新

Visualizing definitional divergence in high-dimensional data by manifold alignment: Application to 3D right ventricular strain computations

通过流形对齐可视化高维数据中的定义差异：应用于3D右心室应变计算

Maxime Di Folco, Gabriel Bernardino, Patrick Clarysse, Nicolas Duchateau

发表机构 * Univ Lyon, Université Claude Bernard Lyon 1, INSA-Lyon,CNRS, Inserm, CREATIS UMR 5220, U1294（里昂大学，克劳德·贝尔纳里 Lyon 1 大学，INSA-里昂，CNRS，Inserm，CREATIS UMR 5220，U1294）； Institute of Machine Learning in Biomedical Imaging, Helmholtz Center Munich, Germany（生物医学成像机器学习研究所，海德堡中心慕尼黑，德国）； LTCI, Telecom Paris, Institut Polytechnique de Paris（LTCI，电信巴黎，巴黎理工学院）； DTIC, Universitat Pompeu Fabra, Barcelona, Spain（DTIC，庞培法布拉大学，巴塞罗那，西班牙）； Institut Universitaire de France (IUF)（法国大学研究所（IUF））

AI总结提出一种基于表示学习的策略，通过流形对齐匹配不同定义的高维数据，并重建参数图以可视化定义差异，应用于右心室应变分析。

Comments Accepted for publication in IEEE Transactions on Medical Imaging, DOI: 10.1109/TMI.2026.3698240 \c{opyright} 2026 IEEE. Personal use is permitted. For all other uses, permission must be obtained from IEEE

详情

DOI: 10.1109/TMI.2026.3698240

AI中文摘要

医学影像研究通常依赖于每个受试者的单个样本，假设其能代表生理特征。然而，输入描述符定义或计算方式的变化（例如由于科学领域缺乏共识）可能对分析产生关键影响，但在实践中很少被考虑。本文提出一种基于表示学习的原创策略，用于估计反映这种定义差异对先前从医学图像中提取的特定生理描述符影响的参数图。我们将这些生理描述符的不同定义或计算视为不同的高维数据，可能具有异构类型。我们特别关注心肌变形（应变），其定义尚未达成共识。我们首先使用流形对齐来匹配与该描述符不同定义相关的潜在表示。然后，我们在潜在空间中制定合理的分布来表示描述符之间的定义差异，并从中重建高维参数图以可视化这种定义差异。由于缺乏针对该特定临床应用的适当真实数据，我们首先在玩具实验上演示该方法，然后扩展到从3D超声心动图图像序列获得的受试者右心室应变数据的评估，其中右心室内膜表面网格的每个点都有不同类型的应变可用。除了这一说明性应用外，我们的方法具有推广到其他考虑异构高维描述符的人群分析的潜力。

英文摘要

Medical imaging studies often rely on a single sample per subject, assuming it is representative of their physiological traits. However, variations in how input descriptors are defined or computed (e.g. due to a lack of consensus in the scientific field) may have a crucial impact on the analysis, and are hardly considered in practice. In this paper, we propose an original strategy based on representation learning to estimate a parametric map reflecting the impact of such definitional differences on a given physiological descriptor, previously extracted from medical images. We consider the different definitions or computations of such physiological descriptors as different high-dimensional data, potentially of heterogeneous types. We specifically focus on myocardial deformation (strain), for which there is limited agreement on its definition. We first use manifold alignment to match the latent representations associated with the different definitions of this descriptor. Then, we formulate plausible distributions in the latent space to represent definitional divergence across descriptors, from which we reconstruct a high-dimensional parametric map to visualize such definitional divergence. Due to the lack of proper ground truth for this specific clinical application, we first demonstrate this methodology on toy experiments and then expand the evaluation on right ventricular strain data from subjects obtained from 3D echocardiographic image sequences, for which different types of strain are available at each point of the right ventricle endocardial surface mesh. Beyond this illustrative application, our methodology has the potential to be generalised to many other population analyses considering heterogeneous high-dimensional descriptors.

URL PDF HTML ☆

赞 0 踩 0

2412.10362 2026-06-02 cs.LG cs.CV 版本更新

OP-LoRA: The Blessing of Dimensionality

OP-LoRA：维度的祝福

Piotr Teterwak, Kate Saenko, Bryan A. Plummer, Ser-Nam Lim

发表机构 * Boston University（波士顿大学）； University of Central Florida（中央佛罗里达大学）

AI总结提出OP-LoRA方法，通过额外MLP预测LoRA适配器权重以改善优化，训练后丢弃MLP，在零额外推理成本下提升性能并降低对学习率的敏感性。

详情

AI中文摘要

低秩适配器（LoRA）使得仅用少量参数即可微调大模型。然而，它们常常面临病态的损失景观，导致优化困难。先前的工作通过自定义优化器将适配器更新与全微调梯度对齐来解决这些挑战，但这些方法缺乏适应新适配器架构的灵活性，且计算成本高。我们引入了OP-LoRA，一种新颖的方法，它用额外的MLP预测的权重替换每个LoRA适配器，该MLP在训练后被丢弃。这允许在训练期间临时增加额外参数以改善优化，但比自定义优化器需要更少的墙钟时间，并且在推理时零额外成本，因为MLP被丢弃。关键的是，将OP-LoRA扩展到其他适配器只需修改每个新适配器类型的预测头大小。我们表明，OP-LoRA允许优化自适应地增加或减少步长，从而提高性能并降低对学习率的敏感性。在小型和大型LoRA微调任务中，我们观察到OP-LoRA相对于LoRA及其变体的一致性能提升。我们在图像生成中取得了特别显著的改进，OP-LoRA的CMMD分数相对于LoRA提高了多达15分。这使得OP-LoRA能够在推理参数减半的情况下达到LoRA的性能。

英文摘要

Low-rank adapters (LoRA) enable finetuning of large models with only a small number of parameters. However, they often suffer from an ill-conditioned loss landscape, leading to difficult optimization. Prior work addresses these challenges by aligning adapter updates with full finetuning gradients via custom optimizers, but these methods lack the flexibility to accommodate new adapter architectures and are computationally expensive. We instead introduce OP-LoRA, a novel method which replaces each LoRA adapter with weights predicted by an extra MLP, which is discarded after training. This temporarily allows additional parameters during training to improve optimization, yet requires less wall time than custom optimizers and zero extra cost at inference time because the MLP is discarded. Crucially, extending OP-LoRA to other adapters is as simple as modifying the size of the prediction head for each new adapter type. We show that OP-LoRA allows the optimization to adaptively increase or decrease step size, improving performance and decreasing sensitivity to learning rate. On both small and large-scale LoRA tuning tasks, we observe consistent performance gains of OP-LoRA relative to LoRA and its variants. We achieve especially notable improvements in image generation, with OP-LoRA CMMD scores improving by up to 15 points relative to LoRA. This allows OP-LoRA to achieve the performance of LoRA with half of the inference parameters.

URL PDF HTML ☆

赞 0 踩 0

2411.17790 2026-06-02 cs.CV cs.AI 版本更新

Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors

基于潜在先验的自监督单目内窥镜深度与姿态估计

Ziang Xu, Bin Li, Yang Hu, Chenyu Zhang, James East, Sharib Ali, Jens Rittscher

发表机构 * University of Oxford（牛津大学）； University of Leeds（利兹大学）

AI总结提出一种结合生成潜在库和变分自编码器的自监督框架，通过自然图像深度先验和姿态潜在变量正则化，实现内窥镜复杂场景下的高精度深度与姿态估计。

详情

DOI: 10.1109/TMI.2026.3671423

AI中文摘要

内窥镜中的精确3D映射能够实现胃肠道（GI）内定量、整体的病变表征，这需要可靠的深度和姿态估计。然而，内窥镜系统是单目的，现有依赖合成数据集或复杂模型的方法在具有挑战性的内窥镜条件下往往缺乏泛化能力。我们提出了一种鲁棒的自监督单目深度和姿态估计框架，该框架结合了生成潜在库（Generative Latent Bank）和变分自编码器（VAE）。生成潜在库利用自然图像中的广泛深度场景来调节深度网络，通过潜在特征先验增强深度预测的真实感和鲁棒性。对于姿态估计，我们将其重新构建在VAE框架内，将姿态转换视为潜在变量以正则化尺度、稳定z轴突出性并提高x-y灵敏度。这种双重精炼流程能够实现精确的深度和姿态预测，有效应对胃肠道复杂的纹理和光照。在SimCol和EndoSLAM数据集上的广泛评估证实，我们的框架在内窥镜深度和姿态估计方面优于已发表的自监督方法。

英文摘要

Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.

URL PDF HTML ☆

赞 0 踩 0

2411.12321 2026-06-02 cs.CV 版本更新

Enhancing Blind Source Separation with Dissociative Principal Component Analysis

增强盲源分离的解离主成分分析

Muhammad Usman Khalid

发表机构 * College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University（伊斯兰国际大学计算机与信息科学学院）

AI总结提出解离主成分分析（DPCA），通过联合估计主成分和载荷向量并显式建模其相互依赖关系，克服传统稀疏PCA在源重叠时性能下降的问题，在模拟fMRI源恢复、前景背景分离等任务中优于经典sPCA。

Comments 13 pages with 6 figures, this work has not been published before

详情

AI中文摘要

主成分分析（PCA）及其稀疏变体（sPCA）被广泛用作独立成分分析（ICA）的前置步骤，用于盲源分离（BSS）。然而，sPCA通常依赖于一种逐次提取成分并在它们之间施加正交性的缩减策略。当底层源重叠时，这会丢弃ICA所依赖的跨成分结构，从而降低分离效果。本文提出解离PCA（DPCA），它联合估计成分而非通过缩减。DPCA在基于SVD的分解中引入左、右解离矩阵，以显式建模主成分（PC）和载荷向量（LV）之间的相互依赖关系，同时通过稀疏约束保持可解释性。我们开发了三种算法，称为DPCA1a、DPCA1b和DPCA2，采用自适应软阈值与梯度下降和坐标下降相结合，并辅以二次硬阈值步骤，以保持稀疏性并抑制恢复的载荷向量中的背景噪声。该方法在四个设置上进行了评估，即模拟fMRI源恢复、前景与背景分离、图像重建和图像修复，在这些设置中，它比基于经典sPCA的流程更可靠地恢复源结构，在显著空间重叠下增益最大。当稀疏参数为零时，DPCA退化为普通PCA。所提出算法的MATLAB实现可在https://github.com/usmankhalid06/DPCA公开获取。

英文摘要

Principal component analysis (PCA) and its sparse variants (sPCA) are widely used as a precursor to independent component analysis (ICA) for blind source separation (BSS). However, sPCA typically relies on a deflation strategy that extracts components sequentially and imposes orthogonality between them. When the underlying sources overlap, this discards the cross component structure that ICA depends on, degrading separation. This paper proposes dissociative PCA (DPCA), which estimates components jointly rather than by deflation. DPCA introduces left and right dissociation matrices into the SVD based decomposition to explicitly model the interdependencies among principal components (PCs) and loading vectors (LVs), while sparsity constraints maintain interpretability. We develop three algorithms called DPCA1a, DPCA1b, and DPCA2, using adaptive soft thresholding with gradient and coordinate descent, together with a secondary firm thresholding step that preserves sparsity and suppresses background noise in the recovered loading vectors. The method is evaluated on four settings, namely simulated fMRI source retrieval, foreground and background separation, image reconstruction, and image inpainting, where it recovers source structure more reliably than classical sPCA based pipelines, with the largest gains under significant spatial overlap. DPCA reduces to ordinary PCA when the sparsity parameter is zero. A MATLAB implementation of the proposed algorithms is publicly available at https://github.com/usmankhalid06/DPCA.

URL PDF HTML ☆

赞 0 踩 0

2411.05359 2026-06-02 cs.CV cs.AI cs.CY 版本更新

Agricultural Landscape Understanding At Country-Scale

国家级农业景观理解

Radhika Dua, Aditi Agarwal, Aishwarya Jayagopal, Depanshu Sani, Alex Wilson, Hoang Tran, Ishan Deshpande, Bogdan Floristean, Neelabh Goyal, Ramya Cheruvu, Vishal Batchu, Yan Mayster, Gaurav Aggarwal, Alok Talekar, Vaibhav Rajan

发表机构 * Google DeepMind（谷歌深Mind）； Google（谷歌）

AI总结提出首个国家级农业制图系统，通过新颖的后处理启发式方法实现田地、树木和水体的实例分割，并在全国范围内部署验证。

Comments 32 pages, 11 tables, 22 figs

详情

AI中文摘要

全面的农业景观理解对于应对粮食安全、气候变化和资源管理等全球挑战至关重要。这不仅需要绘制农田地图，还需要绘制树木和水体等重要特征，这些特征在主导全球南方的复杂 extit{小农户}系统中形成了错综复杂的镶嵌结构。以往开发此类土地利用地图的努力受到限制，仅专注于田地划界的方法，并且没有开发出实际部署所必需的稳健后处理步骤。此外，据我们所知，之前没有针对小农户农场的系统在国家范围内进行部署和评估。本文通过提出首个国家级农业制图系统来解决这些局限性，该系统超越了简单的田地划界，能够对田地、树木和水体等农业实例进行分割。我们的系统通过新颖的后处理启发式方法进行了优化，以确保地图的一致性和准确性，并通过严格、多方面的评估过程进行了验证。我们系统生成的精细土地利用地图可通过API在 extit{\href{http://agri.withgoogle.com}{http://agri.withgoogle.com}}公开访问，支持从精准农业和政策制定到推进全球可持续发展目标的各种应用。

英文摘要

Comprehensive agricultural landscape understanding is critical for addressing global challenges in food security, climate change, and resource management. This requires mapping not just crop fields, but also vital features like trees and water bodies which form an intricate mosaic in complex \textit{smallholder} systems dominating the Global South. Previous efforts to develop such land use maps have been limited by a narrow focus on methods for field delineation only, and also do not develop robust post-processing steps essential for real-world deployment. Further, to our knowledge, no prior system for smallholder farms has been deployed and evaluated at a national scale. This work addresses these limitations by presenting the first national-scale agricultural mapping system that moves beyond simple field delineation to enable segmentation of agricultural instances like fields, trees and water bodies. Our system is refined for real-world application using novel post-processing heuristics to ensure map consistency and accuracy, and is validated through a rigorous, multi-faceted evaluation process. Fine-grained land use maps generated by our system are publicly accessible via an API at \textit{\href{http://agri.withgoogle.com}{http://agri.withgoogle.com}}, enabling a wide range of applications from precision agriculture and policy-making to advancing global sustainability development goals.

URL PDF HTML ☆

赞 0 踩 0

2410.21361 2026-06-02 cs.CV cs.LG 版本更新

Domain Adaptation with a Single Vision-Language Embedding

基于单一视觉-语言嵌入的域适应

Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette

发表机构 * Inria（法国国家信息与自动化研究所）； Kyutai（Kyutai公司）

AI总结提出一种利用单一视觉-语言（VL）嵌入进行域适应的框架，通过提示/照片驱动的实例归一化（PIN）挖掘多种视觉风格，实现零样本和单样本无监督域适应，在语义分割任务上优于基线方法。

Comments International Journal of Computer Vision (IJCV 2026)

详情

AI中文摘要

域适应在计算机视觉中已被广泛研究，但仍需要在训练时访问目标数据，这在现实世界的自动驾驶场景中可能难以获得，尤其是在罕见或恶劣条件下。本文提出了一种新的域适应框架，该框架依赖于单一的视觉-语言（VL）潜在嵌入，而不是完整的目标数据。首先，利用对比语言-图像预训练模型（CLIP），我们提出了提示/照片驱动的实例归一化（PIN）。PIN是一种特征增强方法，通过优化低级源特征的仿射变换，使用单一的目标VL潜在嵌入挖掘多种视觉风格。VL嵌入可以来自描述目标域的语言提示、部分优化的语言提示或单一未标记的目标图像。其次，我们表明这些挖掘的风格（即增强）可用于零样本（即无目标）和单样本无监督域适应。在真实世界驾驶数据集（包括Cityscapes和ACDC（恶劣条件））上的语义分割实验证明了所提出方法的有效性，在实用的零样本和单样本设置中优于相关基线。

英文摘要

Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in real-world autonomous driving scenarios, especially under rare or adverse conditions. In this paper, we present a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data. First, leveraging a contrastive language-image pre-training model (CLIP), we propose prompt/photo-driven instance normalization (PIN). PIN is a feature augmentation method that mines multiple visual styles using a single target VL latent embedding, by optimizing affine transformations of low-level source features. The VL embedding can come from a language prompt describing the target domain, a partially optimized language prompt, or a single unlabeled target image. Second, we show that these mined styles (i.e., augmentations) can be used for zero-shot (i.e., target-free) and one-shot unsupervised domain adaptation. Experiments on semantic segmentation in real-world driving datasets, including Cityscapes and ACDC (adverse conditions), demonstrate the effectiveness of the proposed method, which outperforms relevant baselines in the practical zero-shot and one-shot settings.

URL PDF HTML ☆

赞 0 踩 0

2404.13621 2026-06-02 cs.CV cs.LG cs.MM 版本更新

Attack on Scene Flow using Point Clouds

使用点云对场景流进行攻击

Haniyeh Ehsani Oskouie, Mohammad-Shahram Moin, Shohreh Kasaei

发表机构 * Sharif University of Technology（谢里弗大学）； ICT Research Institute（信息与通信技术研究所）

AI总结针对场景流网络提出白盒对抗攻击方法，在KITTI和FlyingThings3D数据集上实现平均端点误差相对下降33.7%，并揭示单维度或单颜色通道攻击的影响。

详情

DOI: 10.13140/RG.2.2.29455.19362

AI中文摘要

深度神经网络在使用点云准确估计场景流方面取得了显著进展，这对于视频分析、动作识别和导航等许多应用至关重要。然而，这些技术的鲁棒性仍然令人担忧，特别是在面对已被证明能在许多领域欺骗最先进深度神经网络的对抗攻击时。令人惊讶的是，场景流网络对此类攻击的鲁棒性尚未得到彻底研究。为解决这一问题，本文提出了一种专门针对场景流网络的白盒对抗攻击方法。实验结果表明，生成的对抗样本在KITTI和FlyingThings3D数据集上使平均端点误差相对下降高达33.7%。研究还揭示了仅针对点云的一个维度或颜色通道的攻击对平均端点误差的显著影响。通过分析这些攻击在场景流网络及其2D光流网络变体上的成功与失败，发现光流网络具有更高的脆弱性。代码可在https://github.com/aheldis/Attack-on-Scene-Flow-using-Point-Clouds.git获取。

英文摘要

Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. The robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many domains. Surprisingly, the robustness of scene flow networks against such attacks has not been thoroughly investigated. To address this problem, the proposed approach aims to bridge this gap by introducing adversarial white-box attacks specifically tailored for scene flow networks. Experimental results show that the generated adversarial examples obtain up to 33.7 relative degradation in average end-point error on the KITTI and FlyingThings3D datasets. The study also reveals the significant impact that attacks targeting point clouds in only one dimension or color channel have on average end-point error. Analyzing the success and failure of these attacks on the scene flow networks and their 2D optical flow network variants shows a higher vulnerability for the optical flow networks. Code is available at https://github.com/aheldis/Attack-on-Scene-Flow-using-Point-Clouds.git.

URL PDF HTML ☆

赞 0 踩 0

2012.01494 2026-06-02 cs.CV 版本更新

Braille to Text Translation for Bengali Language: A Geometric Approach

孟加拉语盲文到文本翻译：一种几何方法

Minhas Kamal, Amin Ahsan Ali, Muhammad Asif Hossain Khan, Mohammad Shoyaib

发表机构 * Institute of Information Technology（信息科技研究所）； University of Dhaka（达卡大学）

AI总结针对孟加拉语缺乏盲文翻译工具的问题，提出一种基于图像处理和几何结构分析的盲文到文本翻译方法，识别准确率达97.25%。

Comments GitHub Repo.: https://github.com/MinhasKamal/BrailleToTextTranslator

2404.11326 2026-06-02 cs.CV 版本更新

SCL: Towards Domain Generalization via Single-Temporal Multimodal Contrastive Learning for Remote Sensing Change Detection

SCL：面向遥感变化检测的单时相多模态对比学习域泛化方法

Qiangang Du, Jinlong Peng, Xu Chen, Qingdong He, Liren He, Qiang Nie, Mingmin Chi

发表机构 * Fudan University（复旦大学）； Tencent YouTu Lab（腾讯YouTu实验室）

AI总结提出基于视觉-语言预训练模型的单时相多模态对比学习（SCL）基础模型，结合动态文本-视觉上下文优化（DTCO）和可控生成与单时相训练策略（SAIN），无需目标数据集训练即可实现遥感变化检测的跨数据集泛化。

Comments CVPRW 2026

详情

AI中文摘要

近年来，基于CNN和Transformer的变化检测与异常检测模型在基于配对数据的多个数据集上取得了显著成功。然而，由于领域特定的设计，大多数此类方法表现出有限的跨数据集泛化能力，并且通常依赖于大量配对的标注数据。本文基于视觉-语言预训练模型，引入了一种单时相多模态对比学习（SCL）基础模型，用于变化检测，无需在目标数据集上进行训练。为了进一步提高模型学习文本和视觉信息上下文的能力，我们提出了一种动态文本-视觉上下文优化（DTCO）模块用于提示学习。同时，为了解决现有方法的数据依赖性问题，我们引入了一种可控生成和单时相训练策略（SAIN）。这使得我们能够利用大量现有的单时相图像训练模型，而无需配对标签。在各种真实世界变化检测数据集上的大量实验表明，SCL具有优越的性能和泛化能力，在评估设置下优于最先进的方法。代码可在https://github.com/Kane-Du/scl-cd.git获取。

英文摘要

In recent years, change detection and anomaly detection models based on CNN and transformer have achieved remarkable success across various datasets based on paired data. However, most such methods exhibit limited crossdataset generalization due to domain-specific designs and typically rely on large amounts of paired labeled data. In this paper, based on visual-language pre-training model, we introduce a Single-temporal multimodal Contrastive Learning (SCL) foundation models for change detection without training on the target dataset. To further improve the model's ability to learn context of textual and visual information, we propose a Dynamic Text-vision Context Optimization (DTCO) module for prompt learning. Meanwhile, to address the data dependency issue of existing methods, we introduce a controllable generation and Single-temporal trAINing strategy (SAIN). This allows us to train the model using a large number of existing single-temporal images without the need for paired label. Extensive experiments on various realworld change detection datasets demonstrate the superior performance and generalization of SCL, outperforming state-of-the-art methods under the evaluated settings. Code is available at https://github.com/Kane-Du/scl-cd.git.

URL PDF HTML ☆

赞 0 踩 0

2307.06647 2026-06-02 cs.RO cs.AI cs.CV 版本更新

DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle

DeepIPCv2: 基于LiDAR的鲁棒环境感知与自动驾驶导航控制

Oskar Natan, Jun Miura

发表机构 * Department of Computer Science and Electronics, Universitas Gadjah Mada（计算机科学与电子系，加查马达大学）； Department of Computer Science and Engineering, Toyohashi University of Technology（计算机科学与工程系，toyohashi技术大学）

AI总结提出DeepIPCv2端到端自动驾驶框架，通过融合LiDAR点云分割与多视图投影构建鲁棒场景表示，结合门控循环单元、命令特定多层感知器和PID控制器实现路径点与导航控制命令的联合估计，在光照变化下取得最低总指标误差和最少驾驶干预。

Comments This work has been accepted for publication in IEEE Access. https://ieeexplore.ieee.org/document/11313052

详情

DOI: 10.1109/ACCESS.2025.3647530

AI中文摘要

我们提出DeepIPCv2，一个端到端的自动驾驶框架，它集成了基于LiDAR的环境感知与命令特定的控制学习。与先前依赖摄像头的模型不同，DeepIPCv2采用点云分割和多视图投影来构建鲁棒的场景表示。这些特征通过门控循环单元、命令特定的多层感知器和PID控制器的组合进行融合和解码，以估计路径点和导航控制命令。这种设计增强了机动性并解决了驾驶数据集中的动作不平衡问题。为了验证模型，我们构建了一个覆盖不同光照条件的数据集，并进行了消融研究和与包括TransFuser在内的最新方法的对比测试。结果表明，DeepIPCv2实现了最低的总指标误差和最少的驾驶干预，突显了其对光照变化的鲁棒性和改进的控制精度。通过稍后在https://github.com/oskarnatan/DeepIPCv2发布代码，我们旨在支持端到端自动驾驶研究的可重复性和未来进展。

英文摘要

We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific control learning. Unlike prior camera-reliant models, DeepIPCv2 employs point cloud segmentation and multi-view projection to construct robust scene representations. These features are fused and decoded through a combination of gated recurrent units, command-specific multi-layer perceptrons, and PID controllers to estimate both waypoints and navigational control commands. This design enhances maneuverability and addresses action imbalance in driving datasets. To validate the model, we constructed a dataset covering diverse illumination conditions and conducted ablation studies and comparative tests against recent methods, including TransFuser. Results demonstrate that DeepIPCv2 achieves the lowest total metric error and the fewest driving interventions, highlighting both its robustness to illumination changes and its improved control accuracy. By releasing the codes at https://github.com/oskarnatan/DeepIPCv2 later, we aim to support reproducibility and future advancements in end-to-end autonomous driving research.

URL PDF HTML ☆

赞 0 踩 0

2310.15676 2026-06-02 cs.CV cs.AI 版本更新

Recent Advances in Multi-modal 3D Intelligence: A Comprehensive Survey and Evaluation

多模态3D智能的最新进展：综合调查与评估

Yinjie Lei, Zixuan Wang, Feng Chen, Guoqing Wang, Peng Wang, Yang Yang

发表机构 * College of Electronics and Information Engineering, Sichuan University（四川大学电子信息工程学院）； School of Computer Science, University of Adelaide（阿德莱德大学计算机科学学院）； School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）

AI总结本文系统综述了多模态3D智能方法，提出基于模态和任务的新分类法，并比较了基准数据集上的结果，最后讨论了未来研究方向。

详情

AI中文摘要

多模态3D智能因其在自动驾驶和世界模拟等领域的广泛应用而受到广泛关注。与传统的单模态3D理解相比，引入额外模态不仅提升了场景解释的丰富性和精确性，还为更高层次的物理世界交互奠定了基础。在仅依赖3D数据可能不足的多样化和挑战性环境中，这一点变得尤为关键。尽管过去六年中多模态3D方法的发展激增，特别是那些整合多相机图像（3D+2D）和文本描述（3D+语言）的方法，但缺乏全面深入的综述。在本文中，我们通过系统调查最新进展来弥补这一空白。我们首先简要总结了各种3D多模态任务中的独特挑战。之后，我们提出了一种新的分类法，根据模态和任务对现有方法进行彻底分类，探讨它们各自的优势和局限性。此外，我们提供了近期方法在几个基准数据集上的比较结果及深入分析。最后，我们讨论了未解决的问题，并提出了未来研究的几个潜在方向。

英文摘要

Multi-modal 3D Intelligence has gained considerable attention due to its wide applications in autonomous driving and world simulation, etc. Compared to conventional single-modal 3D understanding, introducing an additional modality not only elevates the richness and precision of scene interpretation but also provides a foundation for higher-level physical world interaction. This becomes especially crucial in varied and challenging environments where solely relying on 3D data might be inadequate. While there has been a surge in the development of multi-modal 3D methods over the past six years, especially those integrating multi-camera images (3D+2D) and textual descriptions (3D+language), a comprehensive and in-depth review is notably absent. In this paper, we present a systematic survey of recent progress to bridge this gap. We begin by briefly summarizing the unique challenges among various 3D multi-modal tasks. After that, we present a novel taxonomy that delivers a thorough categorization of existing methods according to modalities and tasks, exploring their respective strengths and limitations. Furthermore, comparative results of recent approaches on several benchmark datasets, together with insightful analysis, are offered. Finally, we discuss the unresolved issues and provide several potential avenues for future research.

URL PDF HTML ☆

赞 0 踩 0

2208.00967 2026-06-02 cs.CV 版本更新

Counterfactual Intervention Feature Transfer for Visible-Infrared Person Re-identification

反事实干预特征迁移用于可见光-红外行人重识别

Xulin Li, Yan Lu, Bin Liu, Yating Liu, Guojun Yin, Qi Chu, Jinyang Huang, Feng Zhu, Rui Zhao, Nenghai Yu

发表机构 * School of Information Science and Technology, University of Science and Technology of China（信息科学与技术学院，中国科学技术大学）； Key Laboratory of Electromagnetic Space Information, Chinese Academy of Science（电磁空间信息重点实验室，中国科学院）； School of Data Science, University of Science and Technology of China（数据科学学院，中国科学技术大学）； SenseTime Research（商汤研究院）； Qing Yuan Research Institute, Shanghai Jiao Tong University（青元研究院，上海交通大学）

AI总结针对可见光-红外行人重识别中图模型泛化性差的问题，提出反事实干预特征迁移方法，通过同质与异质特征迁移减少模态不平衡，并利用反事实关系干预增强图拓扑结构的可靠性。

Comments Accepted by ECCV 2022

详情

AI中文摘要

基于图模型的方法最近在行人重识别任务中取得了巨大成功，该方法首先计算不同行人之间的图拓扑结构（亲和度），然后跨行人传递信息以获得更强的特征。但我们发现，现有的基于图模型的方法在可见光-红外行人重识别任务（VI-ReID）中存在泛化性差的问题，原因有二：1）训练-测试模态平衡差距，这是VI-ReID任务的一个特性。训练阶段两种模态的数据量是平衡的，但在推理时极度不平衡，导致基于图的VI-ReID方法泛化性低。2）图模块的端到端学习方式导致次优的拓扑结构。我们分析认为，训练良好的输入特征削弱了图拓扑的学习，使其在推理过程中不够泛化。在本文中，我们提出了一种反事实干预特征迁移（CIFT）方法来解决这些问题。具体而言，设计了同质与异质特征迁移（H2FT），通过两种独立设计的图模块和不平衡场景模拟来减少训练-测试模态平衡差距。此外，提出了反事实关系干预（CRI），利用反事实干预和因果效应工具来突出拓扑结构在整个训练过程中的作用，使图拓扑结构更加可靠。在标准VI-ReID基准上的大量实验表明，CIFT在各种设置下均优于最先进的方法。

英文摘要

Graph-based models have achieved great success in person re-identification tasks recently, which compute the graph topology structure (affinities) among different people first and then pass the information across them to achieve stronger features. But we find existing graph-based methods in the visible-infrared person re-identification task (VI-ReID) suffer from bad generalization because of two issues: 1) train-test modality balance gap, which is a property of VI-ReID task. The number of two modalities data are balanced in the training stage, but extremely unbalanced in inference, causing the low generalization of graph-based VI-ReID methods. 2) sub-optimal topology structure caused by the end-to-end learning manner to the graph module. We analyze that the well-trained input features weaken the learning of graph topology, making it not generalized enough during the inference process. In this paper, we propose a Counterfactual Intervention Feature Transfer (CIFT) method to tackle these problems. Specifically, a Homogeneous and Heterogeneous Feature Transfer (H2FT) is designed to reduce the train-test modality balance gap by two independent types of well-designed graph modules and an unbalanced scenario simulation. Besides, a Counterfactual Relation Intervention (CRI) is proposed to utilize the counterfactual intervention and causal effect tools to highlight the role of topology structure in the whole training process, which makes the graph topology structure more reliable. Extensive experiments on standard VI-ReID benchmarks demonstrate that CIFT outperforms the state-of-the-art methods under various settings.

URL PDF HTML ☆

赞 0 踩 0

2203.03768 2026-06-02 cs.CV 版本更新

CrowdFormer: Weakly-supervised Crowd counting with Improved Generalizability

CrowdFormer: 改进泛化性的弱监督人群计数

Siddharth Singh Savner, Vivek Kanhangad

发表机构 * Department of Electrical Engineering, Indian Institute of Technology Indore, India（印度理工学院印度尔分校电子工程系）

AI总结提出基于金字塔视觉变换器的弱监督人群计数方法，通过全局上下文建模实现与现有方法相当的性能并展现显著泛化性。

详情

DOI: 10.1016/j.jvcir.2023.103853
Journal ref: Journal of Visual Communication and Image Representation, vol. 94, article 103853, 2023

AI中文摘要

卷积神经网络（CNN）由于其强大的局部特征学习能力，在计算机视觉领域主导了近十年。然而，由于感受野有限，CNN无法建模全局上下文。另一方面，基于注意力的变换器可以轻松建模全局上下文。尽管如此，目前关于变换器在人群计数中有效性的研究仍然有限。此外，现有的大多数人群计数方法基于密度图回归，这需要对场景中每个人进行点级标注。这种标注任务既费力又容易出错。这导致了对仅需要计数级标注的弱监督人群计数方法的关注增加。在本文中，我们提出了一种使用金字塔视觉变换器的弱监督人群计数方法。我们进行了广泛评估以验证所提出方法的有效性。我们的方法在基准人群数据集上与最先进方法相当。更重要的是，它表现出显著的泛化性。

英文摘要

Convolutional neural networks (CNNs) have dominated the field of computer vision for nearly a decade due to their strong ability to learn local features. However, due to their limited receptive field, CNNs fail to model the global context. On the other hand, transformer, an attention-based architecture can model the global context easily. Despite this, there are limited studies that investigate the effectiveness of transformers in crowd counting. In addition, the majority of the existing crowd counting methods are based on the regression of density maps which requires point-level annotation of each person present in the scene. This annotation task is laborious and also error-prone. This has led to increased focus on weakly-supervised crowd counting methods which require only the count-level annotations. In this paper, we propose a weakly-supervised method for crowd counting using a pyramid vision transformer. We have conducted extensive evaluations to validate the effectiveness of the proposed method. Our method is comparable to the state-of-the-art on the benchmark crowd datasets. More importantly, it shows remarkable generalizability.

URL PDF HTML ☆

赞 0 踩 0