arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.26884 2026-05-27 cs.CV 版本更新

Small Object Detection in Industrial Recycling: A New Dataset and YOLO Performance Evaluation

工业回收中的小目标检测:新数据集与YOLO性能评估

Oussama Messai, Abbass Zein-Eddine, Abdelouahid Bentamou, Mickael Picq, Nicolas Duquesne, Stéphane Puydarrieux, Yann Gavet

发表机构 * Mines Saint-Etienne, CNRS, UMR 5307 LGF(圣艾蒂安 Mines、法国国家科学研究中心、UMR 5307 LGF)

AI总结 针对工业回收中小、密集、重叠目标的检测难题,本文提出新数据集并对比基于深度学习的监督方法,评估YOLO等系统的性能、精度与计算效率,同时探索数据增强与合成图像的优势。

详情
Journal ref
Journal of Electronic Imaging 2026
AI中文摘要

本文解决了检测小、密集和重叠目标的问题,这是计算机视觉中的一个主要挑战。我们重点回顾了基于深度学习监督方法提出的系统,并在一个包含超过1万张图像和12万个实例的新数据集上对这些系统进行了详细比较,突出了它们在工业回收流程用例中的性能、准确性和计算效率。通过这种比较分析,我们确定了当前最可靠的系统及其设计要解决的具体挑战。此外,我们探讨了数据增强和合成图像的好处。基于我们的分析,我们还提出了潜在的未来方向和创新解决方案,这些方案可以增强小、密集和重叠目标检测系统的有效性。我们的研究范围涵盖回收流程中的目标检测、长度测量和异常检测。异常检测策略对图像分辨率和缩放级别的变化具有鲁棒性,确保在工业应用中的可靠性能。所提出的数据集、方法和评估代码的仓库可在以下网址找到:https://github.com/o-messai/SDOOD

英文摘要

In this paper, we address the problem of detecting small, dense, and overlapping objects, a major challenge in computer vision. Our focus is on reviewing proposed methods based on deep learning supervised approaches. We provide a detailed comparison of these systems on a new dataset of more than 10k images and 120k instances, highlighting their performance, accuracy, and computational efficiency in the industrial recycling process use case. Through this comparative analysis, we identify the most reliable systems currently available and the specific challenges they are designed to tackle. Furthermore, we explore the benefits of data augmentation and synthetic images. Based on our analysis, we also propose potential future directions and innovative solutions that could enhance the effectiveness of small, dense and overlapped object detection systems. The scope of our investigations encompasses object detection, length measurement, and anomaly detection within the context of the recycling process. The anomaly detection strategy is robust against variations in image resolution and zoom levels, ensuring reliable performance in industrial applications. The repository of the proposed dataset, methods and evaluation codes can be found at: https://github.com/o-messai/SDOOD

2605.26601 2026-05-27 cs.CV 版本更新

FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

FTibSuite:面向藏语视觉语言建模的综合资源套件

Guixian Xu, Yide Liang, Zeli Su, Xuexian Song, Ziyin Zhang, Yushuang Dong, Ting Zhang, Xu Han

发表机构 * Hainan International College, Minzu University of China(民族大学海南国际学院) School of Information Engineering, Minzu University of China(民族大学信息工程学院) Shanghai Jiao Tong University(上海交通大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 针对藏语视觉语言建模缺乏可复现训练和评估基础设施的问题,提出FTibSuite资源套件,包含数据集FTibData、基准FTibBench和基线模型FTibVLM,在多项任务上取得显著性能提升。

详情
AI中文摘要

视觉语言模型取得了快速进展,但藏语由于缺乏可复现的训练和评估基础设施,仍然是一种严重服务不足的低资源语言。为填补这一空白,我们引入了FTibSuite,一个面向藏语视觉语言研究的综合资源套件,包括FTibData(人工验证的多模态训练语料库,涵盖持续预训练、图像-文本对齐和指令调优数据)、FTibBench(五个主流多模态基准的藏语改编版本,采用分层质量控制流程以减少翻译噪声)以及FTibVLM(基于Qwen3-VL-8B-Instruct通过三阶段适应流程构建的可复现基线)。在FTibBench上的实验表明,FTibVLM在所有任务上均取得一致的性能提升,例如将MMBench准确率从42.97提高到67.78,POPE-random准确率从47.53提高到80.56,同时保持了骨干模型原有的中文能力且退化最小,为藏语多模态研究提供了首个标准化基础。

英文摘要

Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.

2605.25046 2026-05-27 cs.CV cs.AI 版本更新

TinyFormer: Preserving Tiny Objects in YOLO-DETR Hybrid Real-time Detectors

TinyFormer: 在YOLO-DETR混合实时检测器中保留小目标

Jun-Wei Hsieh, Meng-Yu Kao, Ghufron Wahyu Kurniawan, Kuan-Chuan Peng

发表机构 * College of Artificial Intelligence, National Yang Ming Chiao Tung University(国立阳明交通大学人工智能学院) Mitsubishi Electric Research Laboratories(三菱电机研究实验室)

AI总结 提出TinyFormer混合检测器,通过并行双融合模块(PBM)保留浅层高分辨率特征,并设计空间语义适配器(SSA)补偿粗粒度标记化导致的空间损失,在MS COCO上实现小目标检测精度提升。

详情
AI中文摘要

YOLO系列和基于DETR的检测器在小目标检测方面存在困难。YOLO风格的模型受益于高效的密集预测,但其大步长骨干网络可能会抑制深层特征图中的小目标实例,并使网格分配变得模糊。基于DETR的模型通过集合预测去除了手工设计的后处理,但它们在粗粒度标记网格上进行推理,其中小目标仅占据少数弱标记,在匹配过程中容易被忽略。为了解决这些局限性,我们提出了TinyFormer,一种统一的YOLO-DETR混合实时检测器,它结合了ViT表示、无NMS的集合预测和YOLO风格的金字塔颈部,以实现准确的小目标检测。TinyFormer引入了并行双融合模块(PBM),该模块从浅层阶段构建高分辨率捷径到特征金字塔,在多尺度融合过程中保留精细的空间细节。我们进一步设计了空间语义适配器(SSA)来补偿粗粒度标记化导致的空间损失。SSA从早期阶段提取高分辨率线索并将其注入Transformer标记嵌入中,从而在不牺牲DETR全局建模能力的情况下改进小目标定位。在MS COCO上的实验表明,TinyFormer持续优于最近的YOLO系列检测器和强大的DEIMv2基线。即使没有PBM,TinyFormer-X也达到了58.4%的AP,而添加PBM将整体AP提高到58.5%,并在小目标上带来了1.6%的AP增益。使用Objects365预训练,TinyFormer-X-PBM达到了60.2%的AP,以更少的参数和更低的计算量超越了RF-DETR和其他Objects365预训练的检测器。这些结果表明,TinyFormer弥合了密集的YOLO风格特征融合和DETR风格集合预测之间的差距,为实时小目标检测提供了强大的精度-效率权衡。代码可在https://github.com/mmpmmpmmpjosh/TinyFormer获取。

英文摘要

YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but their large-stride backbones may suppress tiny instances in deep feature maps and make grid assignment ambiguous. DETR-based models remove hand-crafted post-processing through set prediction, yet they reason over coarse token grids, where tiny objects occupy only a few weak tokens and are easily overlooked during matching. To address these limitations, we propose TinyFormer, a unified YOLO--DETR hybrid real-time detector that combines ViT representations, NMS-free set prediction, and a YOLO-style pyramid neck for accurate small-object detection. TinyFormer introduces a Parallel Bi-fusion Module (PBM), which builds high-resolution shortcuts from shallow stages to the feature pyramid, preserving fine spatial details during multi-scale fusion. We further design a Spatial Semantic Adapter (SSA) to compensate for the spatial loss caused by coarse tokenization. SSA extracts high-resolution cues from early stages and injects them into transformer token embeddings, improving tiny-object localization without sacrificing the global modeling ability of DETR. Experiments on MS COCO show that TinyFormer consistently outperforms recent YOLO-series detectors and the strong DEIMv2 baseline. TinyFormer-X achieves 58.4% AP even without PBM, while adding PBM improves the overall AP to 58.5% and brings a 1.6% AP gain on small objects. With Objects365 pre-training, TinyFormer-X-PBM reaches 60.2% AP, surpassing RF-DETR and other Objects365-pretrained detectors with fewer parameters and lower computation. These results demonstrate that TinyFormer bridges dense YOLO-style feature fusion and DETR-style set prediction, providing a strong accuracy-efficiency trade-off for real-time tiny-object detection. Code is available at https://github.com/mmpmmpmmpjosh/TinyFormer.

2605.27372 2026-05-27 cs.CV 版本更新

G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing

G3T 崛起!重力对齐的坐标框架简化点图处理

Bharath Raj Nagoor Kani, Noah Snavely

发表机构 * Cornell University(康奈尔大学)

AI总结 提出G3T模型,通过预测重力对齐的点图而非相机中心点图,利用场景结构先验减少旋转自由度,提升3D重建精度。

Comments Project Page: https://g3t-paper.github.io/

详情
AI中文摘要

现代前馈3D重建方法(如VGGT)在相机中心坐标框架中预测像素对齐的点图。然而,这种坐标框架的选择并非总是最优。我们提出改为在直立、重力对齐的框架中预测点图,该框架利用许多真实场景中存在的强结构线索。与相机中心框架不同,重力对齐框架在视点之间共享共同的垂直轴,减少了关联点图所需的旋转自由度。为此,我们引入了重力接地几何变换器(G3T),该模型从现有模型在重力对齐的3D数据上进行微调。G3T生成高度准确的重力感知预测,包括直立点图和相机到重力姿态。我们进一步介绍了G3T-Long,一种基于子图的增量式3D重建流程,该流程利用直立框架提供的减少的旋转自由度,实现了显著提高的重建精度。

英文摘要

Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal. We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another. To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.

2605.27343 2026-05-27 cs.CV cs.LG 版本更新

Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

通过表示条件扩散模型实现可控图像生成

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

AI总结 本文提出利用预训练自监督模型的表示作为条件,通过扩散模型实现无需大量标注的可控图像生成,并探索了表示空间中的平滑和分离特性。

详情
AI中文摘要

扩散模型已成为高质量图像生成和编辑的强大工具,但引导这些模型产生特定输出仍然是一个挑战。传统方法依赖于条件机制,如文本提示或语义图,这些需要大量标注的数据集。在这项初步工作中,我们探索了以预训练自监督模型的表示为条件的扩散模型。自条件机制不仅提高了无条件图像生成的质量,还提供了一个可用于控制生成的表示空间。我们通过识别变化方向来探索这个条件空间,并展示了在平滑性和分离性方面的有前景的特性。

英文摘要

Diffusion models have emerged as powerful tools for high-quality image generation and editing, but guiding these models to produce specific outputs remains a challenge. Conventional approaches rely on conditioning mechanisms, such as text prompts or semantic maps, which require extensively annotated datasets. In this preliminary work, we explore diffusion models conditioned on representations from a pre-trained self-supervised model. The self-conditioning mechanism not only improves the quality of unconditional image generation, but also provides a representation space that can be used to control the generation. We explore this conditioning space by identifying directions of variations, and demonstrate promising properties in terms of smoothness and disentanglement.

2605.27336 2026-05-27 cs.CV 版本更新

PARE: Pruning and Adaptive Routing for Efficient Video Generation

PARE:面向高效视频生成的剪枝与自适应路由

Yutong Wang, Yunke Wang, Tianfan Xue, Yu Qiao, Yaohui Wang, Xinyuan Chen, Chang Xu

发表机构 * The University of Sydney(悉尼大学) Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出PARE方法,通过结构感知剪枝压缩宽度和输入自适应路由压缩深度,联合减少视频扩散Transformer的计算量,在Wan2.1-14B上实现每步计算大幅降低且质量保持。

详情
AI中文摘要

视频扩散Transformer(DiTs)能生成高质量视频,但由于宽块、深架构和迭代采样,需要大量计算。近期方法通过压缩宽度、深度或采样步数来降低成本,但通常采用固定架构,无法适应单个输入或去噪阶段。我们提出PARE(面向高效视频生成的剪枝与自适应路由),通过结构感知剪枝和输入自适应路由联合压缩宽度和深度。对于宽度,我们观察到注意力头分化为空间和时间角色,并设计考虑这种区分的重分评分,以防止运动关键的时间头被过早剪枝。对于深度,我们训练一个轻量级路由器,以去噪时间步和视觉内容为条件,动态选择每个步骤执行哪些块,实现每个输入的计算自适应,而非静态移除块。一个渐进式流程首先通过蒸馏恢复宽度剪枝的质量,然后联合优化学生和路由器以解耦两个学习目标。在Wan2.1-14B上的图像到视频和文本到视频生成实验表明,PARE在VBench各维度上显著减少每步计算同时保持质量,并与步蒸馏结合实现进一步加速。

英文摘要

Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but typically commit to a fixed architecture that cannot adapt to individual inputs or denoising stages. We propose PARE (Pruning and Adaptive Routing for Efficient video generation), which jointly compresses width and depth with structure-aware pruning and input-adaptive routing. For width, we observe that attention heads specialize into spatial and temporal roles, and design importance scoring that accounts for this distinction to prevent motion-critical temporal heads from being pruned prematurely. For depth, we train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute at each step, enabling per-input compute adaptation rather than static block removal. A progressive pipeline first recovers width-pruned quality via distillation, then jointly optimizes the student and router to decouple the two learning objectives. Experiments on Wan2.1-14B for both image-to-video and text-to-video generation show that PARE substantially reduces per-step computation while preserving quality across VBench dimensions, and composes with step distillation for further acceleration.

2605.27332 2026-05-27 cs.SE cs.AI cs.CV 版本更新

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

EdgeFlow: 基于边缘图增强的VLM流程图处理用于工业需求工程

Zhifei Dou, Shabnam Hassani, Ou Wei

发表机构 * Huawei Research Canada(华为加拿大研究)

AI总结 提出EdgeFlow方法,通过向视觉语言模型(VLM)输入添加Canny边缘图作为结构先验,无需训练数据或微调即可提升流程图到Mermaid代码的转换精度,在工业数据集上节点F1提升17.39%,边F1提升16.94%。

Comments 10 pages

详情
AI中文摘要

流程图广泛应用于工业需求中,但通常以静态图像形式嵌入。视觉语言模型(VLM)在将这些流程图转换为机器可读模型以支持需求工程活动方面显示出潜力,然而,当直接应用于流程图转换时,它们常常在拓扑关键视觉细节上失败。为了解决这个问题,我们提出了EdgeFlow,它通过向VLM的原始输入添加确定性提取的Canny边缘图——作为结构先验——来改进流程图到Mermaid的转换,无需标注训练数据或领域特定的模型微调。我们在IndusReqFlow(一个来自真实世界需求的数据集)上评估了EdgeFlow。与现成的VLM相比,EdgeFlow将节点级F1提高了17.39个百分点,边级F1提高了16.94个百分点。在路径级别,EdgeFlow将路径F1提高了11.06个百分点,从而更好地支持基于模型的测试。这些结果表明,EdgeFlow提供了一种实用的、无需训练的方法,用于改进工业需求工程中保持拓扑结构的流程图到Mermaid转换。在公共合成基准上的跨数据集评估结果显示没有显著改进;这凸显了需要包含工业数据的多样化基准,以全面评估未来基于VLM的需求工程工具。

英文摘要

Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversion of these flowcharts into machine-readable models for RE activities, yet, when directly applied to flowchart conversion, they often fail on topology-critical visual details. To address this, we propose EdgeFlow that augments a VLM's original input with a deterministically extracted Canny edge map-acting as a structural prior-to improve flowchart-to-Mermaid conversion, without requiring annotated training data or domain-specific model fine-tuning. We evaluate EdgeFlow on IndusReqFlow, a dataset sourced from real-world requirements. Compared with off-the-shelf VLMs, EdgeFlow improves node-level F1 by 17.39 percentage points and edge-level F1 by 16.94 percentage points. At the path level, EdgeFlow improves path F1 by 11.06 percentage points, enabling better support for model-based testing. These results demonstrate that EdgeFlow provides a practical, training-free means to improve topology-preserving flowchart-to-Mermaid conversion for industrial RE. Cross-dataset evaluation results on a public synthetic benchmark show no significant improvement; this highlights the need for diverse benchmarks incorporating industrial data for the comprehensive evaluation of future VLM-based RE tools.

2605.27318 2026-05-27 cs.CV 版本更新

Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning

Q-GeoMem:面向视频空间推理的问题引导几何记忆

Xianqiang Gao, Qizhi Chen, Delin Qu, Haoming Song, Zhigang Wang, Bin Zhao, Dong Wang, Xuelong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai AI Lab(上海人工智能实验室) Northwestern Polytechnical University(西北工业大学) TeleAI

AI总结 提出Q-GeoMem框架,通过问题引导的几何记忆机制,结合细粒度上下文库和语义几何证据库,在视频空间推理任务中实现最先进性能。

详情
AI中文摘要

视频空间推理需要在时间上累积依赖于视角的证据,同时保留对回答问题有用的信息。现有的空间视频语言模型改进了几何感知和长程上下文建模,但通常将记忆视为通用时间缓存,这可能引入冗余或无关的几何信息,削弱长程推理能力。我们提出 extbf{\ours},一种用于视频空间推理的问题引导几何记忆框架。\ours将相机条件几何注入视觉标记,并维护两种互补记忆:用于近期密集特征和相机状态的细粒度上下文库,以及用于紧凑长程证据的语义几何证据库。每个候选帧通过Q-Former基于的问题相关性与相对于已保留库的新颖性的乘积进行评分;该分数在读取时存储并重用,同时基于容量的替换规则保持库紧凑。在推理过程中,两种记忆在更新前被读取,并与当前帧表示自适应融合。在VSI-Bench和VSTI-Bench上的实验表明,\ours在评估的空间推理模型中达到了最先进的性能,验证了问题引导几何记忆的有效性。消融实验进一步验证了所提出的证据评分机制的贡献。

英文摘要

Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf{\ours}, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.

2605.27311 2026-05-27 cs.CL cs.CV 版本更新

Chartographer: Counterfactual Chart Generation for Evaluating Vision-Language Models

Chartographer: 用于评估视觉语言模型的反事实图表生成

Yifan Jiang, Dae Yon Hwang, Jesse C. Cresswell, Freda Shi

发表机构 * University of Waterloo(滑铁卢大学) Vector Institute(向量研究所) Layer 6 AI

AI总结 提出 Chartographer 框架,通过将图表逆向工程为可执行代码并生成反事实变体,揭示视觉语言模型在图表问答中的视觉推理缺陷。

详情
AI中文摘要

图表问答(QA)基准旨在提出需要视觉推理才能正确回答的问题,但模型通常可以通过捷径或基于自身背景知识对图表的先验熟悉度来达到解决方案。为了严格评估视觉推理,我们提出了反事实图表,其中图表问答任务保持不变,但底层图表和相应答案发生变化。我们引入了 Chartographer,一个将图表逆向工程为可执行代码、验证重构保真度、生成种子控制的反事实变体以及从可执行问答逻辑中推导新答案的框架。我们将该框架应用于现有的图表 QA 数据集,并评估了专有和开源视觉语言模型(VLM),测量了变化敏感性和泛化能力。反事实图表揭示了单一图表性能所隐藏的失败:VLM 在正确回答原始图表后通常无法泛化。我们发现,当更新后的图表需要新的视觉推理路径时,失败最为普遍。

英文摘要

Chart question-answering (QA) benchmarks aim to pose questions that require visual reasoning to correctly answer, but models can often reach solutions through shortcuts or prior familiarity with a chart based on their own background knowledge. To strictly evaluate visual reasoning, we propose counterfactual charts where the chart-question task remains fixed, but underlying chart and the corresponding answer are varied. We introduce Chartographer, a framework to reverse engineer charts into executable code, validate reconstruction fidelity, generate seed-controlled counterfactual variants, and derive new answers from executable QA logic. We apply this framework to existing chart QA datasets and evaluate proprietary and open-source vision-language models (VLMs), measuring variation sensitivity and generalizability. Counterfactual charts reveal failures hidden by single-chart performance: VLMs often fail to generalize after answering the original chart correctly. We find failures are most prevalent when updated charts require novel visual reasoning pathways.

2605.27310 2026-05-27 cs.CV 版本更新

How and What to Imagine? Visual Thinking in Unified Multimodal Models for Cross-View Spatial Reasoning

如何以及想象什么?统一多模态模型中的视觉思维用于跨视角空间推理

Qian Yang, Ankur Sikarwar, Huy Le, Le Zhang, Zhuan Shi, Perouz Taslakian, Aishwarya Agrawal

发表机构 * Mila - Québec AI Institute(蒙特利尔AI研究所) Université de Montréal(蒙特利尔大学) McGill University(麦吉尔大学) ServiceNow AI Research(ServiceNow人工智能研究) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出View Dropout训练策略使模型利用中间思维图像进行跨视角空间推理,并发现全景视觉思维在信息性和可学习性上最优。

Comments Preprint

详情
AI中文摘要

跨视角空间推理仍然是视觉语言模型(VLM)的薄弱环节:它们通常用语言推理,丢失了任务所需的细粒度几何信息。用图像思考旨在通过生成中间思维图像来解决这一问题,但近期工作表明模型常常忽略这些轨迹中的视觉证据。因此,我们提出如何让视觉思维起作用,以及哪种视觉思维效果最好。我们在统一多模态模型(UMMs)中研究这些问题,这类模型原生支持交错的图像-文本生成。对于第一个问题,我们提出视图丢弃(VDrop),一种训练时干预方法,将输入视图的部分内容从答案跨度中隐藏,同时使其对思维图像令牌可见。这鼓励模型在回答时使用思维图像,而不是仅依赖输入视图。一旦思维图像用于答案预测,我们研究哪种类型的视觉思维最有效。我们将其表述为可学习性-信息性权衡,并比较三种思维图像变体:俯视图、全景图和点匹配渲染图。在合成场景上训练,并在五个真实世界域外基准上评估,采用VDrop的全景视觉思维是唯一既信息丰富又可学习的配置,并实现了最佳的域外泛化。

英文摘要

Cross-view spatial reasoning remains a weak spot for vision-language models (VLMs): they often reason in language and lose the fine-grained geometry needed for the task. Thinking with images aims to address this by generating an intermediate thinking image, but recent work shows that models often ignore the visual evidence in these traces. We therefore ask how to make visual thinking matter, and what kind of visual thinking works best. We study these questions in unified multimodal models (UMMs), which natively support interleaved image-text generation. For the first question, we propose View Dropout (VDrop), a training-time intervention that hides parts of one input view from the answer span while keeping them visible to the thinking-image tokens. This encourages the model to use the thinking image when answering, instead of relying only on the input views. Once the thinking image is used for answer prediction, we study which type of visual thinking is most effective. We frame this as a learnability-informativeness tradeoff and compare three thinking-image variants: top-down, panoramic, and point-matching renderings. Trained on synthetic scenes and evaluated on five real-world out-of-domain benchmarks, panoramic visual thinking with VDrop is the only configuration that is both informative and learnable, and it achieves the best out-of-domain generalization.

2605.27304 2026-05-27 cs.CV 版本更新

PlayClass: Automated Play Behaviour Classification in Poultry

PlayClass: 家禽自动玩耍行为分类

Prince Ravi Leow, Neil Scheidwasser, Rebecca Oscarsson, Per Jensen, Samir Bhatt, David Alejandro Duchêne

发表机构 * Section for Health Data Science & AI, University of Copenhagen(哥本哈根大学健康数据科学与人工智能部门) AVIAN Behaviour Genomics and Physiology Group, Linköping University(林雪平大学鸟类行为基因组学与生理学组) Department of Infectious Disease Epidemiology, Imperial College London(伦敦帝国学院传染病流行病学系)

AI总结 提出PlayClass流水线,利用SAM 3长时跟踪和V-JEPA 2.1基础模型,从俯拍视频中自动分类家禽玩耍行为,达到77.0宏平均F1。

Comments Accepted at CV4Animals Workshop @ CVPR 2026

详情
AI中文摘要

自动监测动物福利主要关注负面指标,而玩耍等积极福利行为尚未充分探索。为解决这一问题,我们提出了PlayClass,一个从俯拍围栏视频中对家禽玩耍行为进行分类的流水线。该流水线利用SAM 3通过YOLO引导的片段边界进行长时跟踪,以最小化点提示中的身份错误,并使用图像和视频基础模型的冻结嵌入进行玩耍动作分类。尽管仅从跟踪掩模中手工设计的运动特征达到了有竞争力的准确率,但V-JEPA 2.1在所有模型规模上始终优于其他骨干网络,当与手工特征结合时达到77.0宏平均F1。尽管如此,由于玩耍子类型与非玩耍行为具有相似的运动特征以及鸟间遮挡,数据集仍然具有挑战性。总体而言,我们的工作为家禽玩耍行为自动分类框架提供了令人鼓舞的证据。

英文摘要

Automated monitoring of animal welfare has largely targeted negative indicators, leaving positive welfare behaviours such as play underexplored. To address this gap, we present PlayClass, a pipeline for play-behaviour classification in poultry from top-down pen video. The pipeline leverages long-duration tracking with SAM 3 via YOLO-guided chunk boundaries to minimise identity errors in point-based prompting, and frozen embeddings from image and video foundation models for play action classification. Although handcrafted motion features from tracked masks alone achieved competitive accuracy, V-JEPA 2.1 consistently outperformed all other backbones across model scales, reaching 77.0 macro-averaged F$_1$ when combined with handcrafted features. Despite this result, the dataset remains challenging due to play sub-types sharing similar kinematic profiles with non-play and inter-bird occlusion. Overall, our work provides encouraging evidence towards automated frameworks for play behaviour classification in poultry.

2605.27295 2026-05-27 cs.CV 版本更新

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Gemini Embedding 2:来自Gemini的原生多模态嵌入模型

Madhuri Shanbhogue, Zhe Li, Shanfeng Zhang, Gustavo Hernández Ábrego, Shih-Cheng Huang, Aashi Jain, Daniel Salz, Sonam Goenka, Chaitra Hegde, Ji Ma, Feiyang Chen, Jiaxing Wu, Tanmaya Dabral, Babak Samari, Kevin Poulet, Daniel Cer, Kaifeng Chen, Paul Suganathan, Hui Hui, Jovan Andonov, Philippe Schlattner, Jay Han, Iftekhar Naim, Wing Lowe, Vladimir Pchelin, Albert Yang, Yi-Ting Chen, Zhongli Ding, Grace Zhang, Georg Heigold, Yichang Chen, Antoine Reveillon, Brendan Mccloskey, Wenlei Zhou, Dahun Kim, Rui Meng, Emma Wang, Jack Zheng, Halley Fede, Zhen Yang, Keegan Mosley, Brian Potetz, Sahil Dua, Henrique Schechter Vera, Shen Gao, Hesen Zhang, Andreas Hess, Hengxuan Ying, Alberto Montes, Karan Gill, Min Choi, Sebastian Russo, Anja Hauth, Jinhyuk Lee, Michael Boratko, Megan Barnes, Vikram Rao, Claudiu Musat, Cyril Allauzen, Ehsan Variani, Shankar Kumar, Tom Bagby, Junyi Jiao, Yang Gu, Tengxin Li, Ayush Agrawal, Roberto Santana, Dev Nath, Stephen Karukas, Shuoxuan Han, Lucia Loher, Alice Twu, Nidhi Vyas, Siddharth Bhai, Frank Palma Gomez, Wangyuan Zhang, Chaoren Liu, Jizheng Yang, Steve Qiu, Shijie Zhang, Sujay Kulkarni, Sascha Rothe, Sean Nakamoto, Raphael Hoffmann, Zach Gleicher, Yunhsuan Sung, Qin Yin, Tom Duerig, Mojtaba Seyedhosseini

发表机构 * Gemini Report(Gemini 报告)

AI总结 提出原生多模态嵌入模型Gemini Embedding 2,通过多任务多阶段对比学习统一视频、音频、图像和文本的表示空间,在单模态、跨模态和多模态检索任务上达到最先进性能。

详情
AI中文摘要

我们介绍了Gemini Embedding 2,一种原生多模态嵌入模型,允许在统一表示空间中对视频、音频、图像和文本模态进行嵌入。我们利用Gemini的多模态能力,为所有这些模态的交错输入任意组合生成嵌入,这些嵌入在广泛的任务中具有良好的泛化能力。在多任务多阶段训练设置中应用大规模对比学习,我们在关键嵌入基准测试中取得了最先进的性能,包括涵盖多种任务的单模态、跨模态和多模态检索。我们展示了我们的嵌入模型在多种任务上表现出强大的性能(在MSCOCO上得分为62.9 R@1,在Vatex上为68.8 NDCG@10,在MTEB多语言上为69.9,在MTEB代码上为84.0),超越了专门模型的性能。这些统一的能力使Gemini Embedding 2成为下游用例(如RAG、推荐和搜索)的有前途的候选者。此外,它在不同领域(从天文学和生物科学到美术和烹饪艺术)的强大零样本性能,使其成为即使对于专业领域也非常可靠的即用型表示。

英文摘要

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.

2605.27287 2026-05-27 cs.CV 版本更新

A Dynamic Programming Framework for Discovering Count and Values of Multilevel Image Thresholding

一种用于发现多级图像阈值计数和值的动态规划框架

Eslam Hegazy, Mohamed Gabr

发表机构 * German University in Cairo(埃及盖尔马大学)

AI总结 提出一种基于动态规划和改进最小误差阈值准则的自动多级阈值方法,能自动确定阈值数量,在速度上优于传统动态规划方法,但SSIM和PSNR略低。

详情
AI中文摘要

多级图像阈值化是当今计算机视觉应用中重要的预处理算法。由于大多数常见的阈值化方法将期望的阈值数量作为用户输入,因此能够从输入图像本身自动确定合适阈值数量的阈值化方法具有优势。本文详细介绍了一种基于动态规划算法和改进的最小误差阈值(MET)准则的新型阈值化方法。通过实证统计研究,指出了该方法为何更优。此外,在自然、卫星和医学测试图像的综合集合上,将该方法与其它最先进方法进行了扩展比较。数值结果表明,当阈值数量较高时,所提出的MET-DP方法比传统的动态规划阈值化方法耗时少得多。该方法能够为大多数不同类型的测试图像检测出合适的阈值数量。然而,以阈值数量作为输入的传统方法产生的阈值化图像在结构相似性指数(SSIM)和峰值信噪比(PSNR)值上高于MET-DP。源代码可在https://w3id.org/met-dp/article1-code找到。

英文摘要

Multilevel Image thresholding is an important preprocessing algorithm in computer vision applications nowadays. Since most common thresholding methods take the desired count of thresholds as input by the user, thresholding methods that automatically determines a suitable count of thresholds from the input image itself are advantageous. In this article, a novel thresholding method based on a dynamic programming algorithm and a modification of Minimum Error Thresholding (MET) criterion is thoroughly presented. An empirical statistical study is performed to pinpoint why this proposed method is superior. Moreover, an extended comparison between this proposed method and other state-of-the-art methods is performed on a comprehensive set of natural, satellite and medical test images. The numerical results show that the proposed MET-DP method takes much less time than traditional dynamic programming thresholding methods when the number of thresholds is high. The proposed method can detect a suitable count of thresholds for most of tested images of different types. However, traditional methods that take the count of thresholds as input produce thresholded images of higher structural similarity index measure (SSIM) and peak signal-to-noise ratio (PSNR) values than MET-DP. Source code can be found on https://w3id.org/met-dp/article1-code

2605.27243 2026-05-27 cs.CV 版本更新

Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

检索头能看见图像吗?长上下文视觉语言模型中的多模态检索头

Aaron Branson Cigres Li, Zhaowei Wang, Yu Zhao, Yiming Du, Haobo Li, Xiyu Ren, Ginny Wong, Simon See, Lishu Luo, Haodong Duan, Pasquale Minervini, Yangqiu Song

发表机构 * HKUST(香港科技大学) University of Edinburgh(爱丁堡大学) CUHK(香港中文大学) NVAITC, NVIDIA, Santa Clara, USA(NVIDIA Santa Clara 分公司) Tsinghua University(清华大学)

AI总结 本文提出一种多模态检索头检测方法,发现视觉语言模型中仅有4.4-10.2%的注意力头贡献了50%的正检索分数,这些头对长上下文推理至关重要,且可直接用于文档检索提升性能。

Comments Work in Progress

详情
AI中文摘要

大型视觉语言模型越来越依赖长上下文建模来推理文档、小时级视频和长周期智能体轨迹,要求它们能在交错的文本和图像中定位相关证据。先前的工作使用大语言模型中的检索头研究了这种行为,但其基于复制的标准在证据出现在图像中时并不直接适用。我们引入了一种多模态检索头检测方法,对从问题标记到文本或视觉证据的注意力进行评分。通过这种方法,我们表明多模态检索头是稀疏的、内在的且因果重要的:仅4.4-10.2%的注意力头贡献了50%的正检索分数,而屏蔽前5%选定的头会使MMLongBench-Doc从48.2%降至5.7%,SlideVQA从71.2%降至8.9%,而随机头屏蔽的破坏性要小得多。进一步分析表明,这些头在模态间部分共享,但在每个模态内保持动态,随着上下文长度和“草堆”模态的变化,图像检索头比文本检索头变化更大。无需进一步训练,我们发现这些头也可直接用于对视觉丰富文档进行排序:在MMDocIR上,Qwen3-VL-8B选定的头评分在页面检索上比最强基线提高了7.7/7.4宏/微平均Recall@1,在布局检索上提高了6.3/6.8点。

英文摘要

Large vision-language models increasingly rely on long-context modeling to reason over documents, hour-level videos, and long-horizon agent trajectories, requiring them to locate relevant evidence across interleaved text and images. Prior work has studied this behavior using retrieval heads in large language models, but its copy-based criterion does not directly apply when evidence appears in images. We introduce a multimodal retrieval head detection method that scores attention from question tokens to textual or visual evidence. With this method, we show that multimodal retrieval heads are sparse, intrinsic, and causally important: only 4.4-10.2% of attention heads account for 50% of the positive retrieval-score mass, and masking the top-5% selected heads drops MMLongBench-Doc from 48.2% to 5.7% and SlideVQA from 71.2% to 8.9%, while random-head masking is far less damaging. Further analysis shows that these heads are partly shared across modalities yet remain dynamic within each modality, with image retrieval heads changing more than text retrieval heads as context length and haystack modality change. Without further training, we find that these heads can also be used directly to rank visually rich documents: on MMDocIR, Qwen3-VL-8B selected-head scoring improves Recall@1 by 7.7/7.4 macro/micro points for page retrieval and 6.3/6.8 points for layout retrieval over the strongest reported baseline.

2605.27235 2026-05-27 cs.CV 版本更新

MRT: Masked Region Transformer for Layered Image Generation and Editing at Scale

MRT:用于大规模分层图像生成与编辑的掩码区域变换器

Zhicong Tang, Zhao Zhang, Jingye Chen, Mohan Zhou, Yifan Pu, Yuchi Liu, Yalong Bai, Ethan Smith, Yuhui Yuan

发表机构 * Canva Research(Canva研究院)

AI总结 提出MRT,一个200亿参数的掩码区域扩散模型,通过统一文本到层、图像到层和层到层任务,并引入溢出感知画布层,实现高效的多层透明图像生成与编辑。

Comments CVPR 2026

详情
AI中文摘要

分层图像生成与编辑是一项基础能力,能够实现生成视觉内容的逐层重用、编辑和组合,类似于自然语言中的词级编辑。尽管其重要性,但在大规模场景下仍是一个未充分探索的领域。为解决这一问题,我们提出了MRT,一个200亿参数的掩码区域扩散模型,专为多层透明图像生成与编辑设计,并在超过1000万个涵盖多种宽高比和文本提示的多语言设计样本上训练。为充分利用这一规模,我们做出了两项关键技术贡献。首先,我们在共享的掩码区域扩散框架内统一了三个互补任务,包括文本到层、图像到层和层到层,其中选择性标记掩码实现了灵活的逐层生成与编辑。其次,为实现溢出层生成,我们引入了一个溢出感知画布层,用于处理边界不一致性并支持半透明背景合成,从而生成超出可见画布边界的完整可编辑层。此外,我们应用扩散蒸馏实现了8步实时多层生成,且质量下降极小。大量实验表明,我们的框架在所有三个任务上显著优于先前的最先进方法(包括各种商业系统),为多层透明图像生成建立了新基准。值得注意的是,根据用户研究结果,我们的模型在图像到层质量上显著优于同期Qwen-Image-Layered模型,同时在图像到层推理中实现了10-100倍的推理速度提升,并将激活GPU内存消耗降低50-90%。

英文摘要

Layered image generation and editing is a fundamental capability that enables layer-wise reuse, editing, and composition of generated visual content, analogous to word-level editing in natural language. Despite its importance, this remains an underexplored area at scale. To address this gap, we present MRT, a 20B-parameter masked region diffusion model tailored for multi-layer transparent image generation and editing, trained on over 10M multilingual design samples spanning diverse aspect ratios and textual prompts. To fully leverage this scale, we make two key technical contributions. First, we unify three complementary tasks including text-to-layers, image-to-layers, and layers-to-layers within a shared masked region diffusion framework, where selective token masking enables flexible layer-wise generation and editing. Second, to enable overflow layer generation, we introduce an overflow-aware canvas layer that handles boundary inconsistencies and supports semi-transparent background synthesis, enabling complete editable layers extending beyond visible canvas boundaries. Additionally, we apply diffusion distillation to achieve 8-step, real-time multi-layer generation with minimal quality degradation. Extensive experiments demonstrate that our framework substantially outperforms prior state-of-the-art approaches, including various commercial systems, across all three tasks, establishing a new benchmark for multi-layer transparent image generation. Notably, our model significantly outperforms the concurrent Qwen-Image-Layered model in image-to-layers quality according to user-study results, while achieving 10-100\times faster inference and reducing activation GPU memory consumption by 50-90\% during image-to-layer inference.

2605.27203 2026-05-27 cs.CV cs.AI 版本更新

Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis

生成式动画:面向提示驱动运动合成的多模型流水线

Mannat Khurana, Sanyam Jain, Rishav Agarwal

发表机构 * Canva Adobe

AI总结 提出一种结合大语言模型和分割模型的流水线,将自然语言提示自动转换为符合场景几何、深度遮挡和3D透视变换的动画运动路径。

Comments 5 pages, 6 figures

详情
AI中文摘要

动画将数字文档提升为沉浸式体验,然而创建自定义运动路径仍然繁琐,需要设计师手动选择预设、绘制贝塞尔点并配置时间属性。我们引入了生成式动画,这是一个将自然语言提示转换为生产就绪动画的系统。通过将用于语义解析的大语言模型(LLMs)与用于视觉基础的Segment Anything Model(SAM)串联,我们的流水线自动生成尊重场景几何、处理基于深度的遮挡并考虑3D透视变换的运动路径。我们通过三个用例演示该系统:轮廓跟随轨迹、具有z轴顺序意识的轨道动画以及变换对象上的透视对齐运动。

英文摘要

Animation elevates digital documents into immersive experiences, yet creating custom motion paths remains cumbersome, requiring designers to manually select presets, plot Bézier points, and configure timing properties. We introduce Generative Animations, a system that transforms natural language prompts into production-ready animations. By chaining Large Language Models (LLMs) for semantic parsing with the Segment Anything Model (SAM) for visual grounding, our pipeline automatically generates motion paths that respect scene geometry, handle depth-based occlusions, and honor 3D perspective transforms. We demonstrate the system through three use cases: contour-following trajectories, orbital animations with z-order awareness, and perspective-aligned motion on transformed objects.

2605.27194 2026-05-27 cs.CL cs.CV cs.LG 版本更新

Not All Tokens Matter Equally: Dynamic In-context Vector Distillation with Decisive-Token Supervision for Long-form Medical Report Generation

并非所有标记都同等重要:基于关键标记监督的动态上下文向量蒸馏用于长医学报告生成

Ning Wu, Rui Liu, Xinkun Lin, Weixing Chen, Jinxi Xiang, Tao Wei, Lina Yao, Mingjie Li

发表机构 * UNSW Sydney(新南威尔士大学悉尼分校) University of Technology Sydney(技术大学悉尼分校) School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Stanford University(斯坦福大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出DIVE框架,通过关键标记监督和状态条件动态引导,解决长文本生成中标记级蒸馏忽略关键标记的问题,在医学报告生成任务上取得最佳性能。

Comments Preprint. 20 pages, 6 figures

详情
AI中文摘要

将示范效果蒸馏到隐藏空间干预中提供了一种轻量级的替代全微调的方法。然而,现有的多模态变体主要是在短文本任务上评估的,其中输出在几个标记后结束。将这些方法扩展到长文本生成暴露了一个基本但未充分研究的局限性:标记级蒸馏隐式地将所有输出标记视为同等信息量,但长文本输出由高频模板和语法标记主导,而实际决定输出质量的标记稀疏分布。在医学报告生成(MRG)中,有两种这样的关键标记突出:决定诊断内容的病理相关标记和决定终止的序列结束(EOS)事件。两者在均匀交叉熵下都受到不足的监督,自回归解码通过偏离教师强制轨迹进一步加剧了问题。我们提出DIVE,一个冻结骨干的蒸馏框架,通过两种与这些失败相匹配的互补机制来解决长文本报告生成。关键标记监督通过提高病理相关标记和EOS事件的交叉熵贡献来恢复监督平衡,确保内容保真度和终止在训练期间学习,而不是在解码时施加。状态条件动态引导用隐藏状态相关的适配器替换固定的开环残差,允许注入信号随着解码漂移而适应。在MIMIC-CXR和CheXpert Plus上使用两个医学VLM骨干的实验表明,DIVE在词汇和临床代理指标中始终位列最强方法之一。我们的方法在所有数据集-骨干设置中实现了最佳的BLEU-4、ROUGE-L和RadGraph F1,同时在粗粒度标签级CheXbert F1上保持竞争力。

英文摘要

Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medical report generation (MRG), two such decisive tokens stand out: pathology-related tokens that determine diagnostic content, and the end-of-sequence (EOS) event that determines termination. Both receive insufficient supervision under uniform cross-entropy, and autoregressive decoding further compounds the problem by drifting away from teacher-forced trajectories. We propose DIVE, a frozen-backbone distillation framework that addresses long-form report generation through two complementary mechanisms matched to these failures. Decisive-token supervision restores supervision balance by upweighting the cross-entropy contribution of pathology-related tokens and the EOS event, ensuring that content fidelity and termination are learned during training rather than imposed at decoding time. State-conditioned dynamic steering replaces fixed open-loop residuals with hidden-state-dependent adapters, allowing the injected signal to adapt as decoding drifts. Experiments on MIMIC-CXR and CheXpert Plus with two medical VLM backbones show that DIVE consistently ranks among the strongest methods across lexical and clinical-proxy metrics. Our method achieves the best BLEU-4, ROUGE-L, and RadGraph F1 in all dataset--backbone settings, while remaining competitive on coarse label-level CheXbert F1.

2605.27178 2026-05-27 cs.CV cs.AI cs.LG cs.RO 版本更新

FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

FoundObj: 自监督基础模型作为无标签3D物体分割的奖励

Zihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li, Jiahao Chen, Bo Yang

发表机构 * Shenzhen Research Institute, The Hong Kong Polytechnic University(深圳研究院,香港理工大学) vLAR Group, The Hong Kong Polytechnic University(vLAR小组,香港理工大学)

AI总结 提出FoundObj框架,利用自监督2D/3D基础模型的语义和几何先验作为奖励,通过强化学习引导超点合并,实现无标注复杂场景3D物体分割。

Comments ICML 2026. Zihui and Zhixuan are co-first authors. Code and data are available at: https://github.com/vLAR-group/FoundObj

详情
AI中文摘要

我们解决了在训练过程中不依赖任何场景级人类标注的复杂场景点云中3D物体分割的挑战性任务。现有方法通常局限于识别简单物体,这主要是由于学习过程中物体先验不足。在本文中,我们提出了FoundObj,一个新颖的框架,其特点是基于超点的物体发现代理,该代理在我们的创新语义和几何奖励模块的指导下逐步合并合适的相邻超点。这些模块协同利用自监督2D/3D基础模型中的语义和几何先验,为物体发现代理提供互补反馈,并通过强化学习实现对多类物体的鲁棒识别。在多个基准上的大量实验表明,我们的方法始终优于现有基线。值得注意的是,我们的方法在零样本和长尾场景中表现出强大的泛化能力,突显了其在可扩展、无标签3D物体分割方面的潜力。

英文摘要

We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.

2605.27158 2026-05-27 cs.CV 版本更新

Model discovery for dynamical systems with complex-valued product units

具有复数值乘积单元的动力学系统模型发现

Martin Brückmann, Babette Dellen, Uwe Jaekel

发表机构 * Department of Mathematics, Informatics, and Technology, RheinAhrCampus(数学、信息学与技术系,莱茵阿尔校区) University of Applied Sciences Koblenz(科布伦茨应用科学大学)

AI总结 提出基于复数值乘积单元网络的数据驱动方法,直接从观测轨迹学习包含分数或负指数单项式的稀疏线性组合,从而发现动力学系统的控制方程。

Comments 16 pages, 8 figures

详情
AI中文摘要

从观测轨迹中发现动力学系统的控制方程比单纯预测未来状态能更深入地理解其结构。我们提出一种基于复数值乘积单元网络的数据驱动模型发现方法,其中每个单元表示一个复数值单项式,网络输出是这些单项式的稀疏线性组合。与SINDy等基于库的方法不同,我们的方法不需要预定义候选函数集:相关的单项式(包括分数或负指数)直接从数据中学习。在四个混沌基准系统(Lorenz63、Lorenz84、四翼吸引子和Lorenz63的分数阶变体)上,使用至少3000个训练点,我们对前三个系统在90%的试验中恢复了精确的控制方程,对分数阶情况在70-90%的试验中恢复。应用于真实世界的人体步态加速度计信号,模型产生具有有界预测误差的稳定轨迹,在比训练间隔长三倍的测试时间范围内,RMSE约为信号幅度范围的12-14%,展示了其在高维系统(其中解析方程不可用)中的潜力。

英文摘要

Discovering the governing equations of a dynamical system from observed trajectories provides deeper insight into its structure than mere prediction of future states. We present a data-driven approach to model discovery based on complex-valued product-unit networks, in which each unit represents a complex monomial and the network output is a sparse linear combination of such monomials. In contrast to established library-based methods such as SINDy, our approach does not require a predefined set of candidate functions: the relevant monomials, including those with fractional or negative exponents, are learned directly from data. Across four chaotic benchmark systems (Lorenz63, Lorenz84, the Four-Wing attractor, and a fractional variant of Lorenz63), we recover the exact governing equations in 90% of trials for the first three systems, and in 70-90% of trials for the fractional case, using at least 3000 training points. Applied to real-world human-gait accelerometer signals, the model produced stable trajectories with bounded prediction errors, corresponding to an RMSE of approximately 12-14% of the signal amplitude range over a test horizon three times longer than the training interval, demonstrating its potential for high-dimensional systems in which analytic equations are unavailable.

2605.27154 2026-05-27 cs.CV 版本更新

Touch-R1: Reinforcing Touch Reasoning in MLLMs

Touch-R1:在多模态大语言模型中强化触觉推理

Yingxin Lai, Yafei Zhou, Fucai Zhu, Siyu Zhu, Weihao Yuan

发表机构 * Xiamen University(厦门大学) Great Bay University(大湾大学) Fudan University(复旦大学) Nanjing University(南京大学) Daimon Robotics(达摩机器人)

AI总结 针对触觉推理中物理属性序数性和跨传感器分布偏移的挑战,提出基于触觉接地GRPO目标训练的Touch-R1模型,在TouchReason-Bench上平均性能超过Octopi-13B和GPT-4o。

Comments Our code and data will be made public on the https://laiyingxin2.github.io/Projects

详情
AI中文摘要

虽然基于规则的强化学习最近在多模态模型中催化了显式推理,但触觉推理仍然很大程度上未被探索。现有的触觉语言模型主要依赖于监督或对比目标,这限制了它们将预测基于物理证据或纠正误导性视觉先验的能力。触觉推理引入了两个模态特定的挑战:物理属性(如硬度、粗糙度)的序数性质,以及光学触觉硬件固有的跨传感器分布偏移。在这项工作中,我们引入了TouchReason-1M,一个大规模多模态数据集,包含来自四个不同传感器的超过100万同步触觉对,以及TouchReason-Bench,一个用于评估触觉感知和视觉-触觉冲突解决的严格框架。在此基础上,我们提出了Touch-R1,一个基于Qwen2.5-VL-7B的触觉推理多模态大语言模型。Touch-R1通过一个触觉接地的GRPO目标进行训练,该目标结合了序数感知准确性、跨传感器物理一致性、结构化格式控制以及输入侧触觉接地目标。具体来说,触觉使用奖励仅在真实触觉输入相对于去除、打乱或噪声掩蔽触觉流的反事实控制产生更优正确性时赋予信用。在TouchReason-Bench上,Touch-R1-7B平均优于Octopi-13B 18.4%和GPT-4o 24.7%。其结构化推理轨迹揭示了探测、比较和修正的涌现行为,表明R1风格的推理可以有效地基于物理接触。

英文摘要

While rule-based reinforcement learning has recently catalyzed explicit reasoning in multimodal models, tactile reasoning remains largely underexplored. Existing tactile-language models primarily rely on supervised or contrastive objectives, which limits their capacity to ground predictions in physical evidence or rectify misleading visual priors. Tactile reasoning introduces two modality-specific challenges: the ordinal nature of physical attributes (e.g., hardness, roughness) and the cross-sensor distribution shifts inherent in optical tactile hardware. In this work, we introduce TouchReason-1M, a large-scale multimodal dataset comprising over 1M synchronized tactile pairs across four distinct sensors, and TouchReason-Bench, a rigorous framework for evaluating tactile perception and visual-tactile conflict resolution. Building upon these, we propose Touch-R1, a tactile reasoning MLLM based on Qwen2.5-VL-7B. Touch-R1 is trained via a tactile-grounded GRPO objective that combines ordinal-aware accuracy, cross-sensor physical consistency, structured-format control, and an input-side tactile grounding objective. Specifically, the tactile-use reward assigns credit only when authentic tactile inputs yield superior correctness relative to counterfactual controls where the tactile stream is removed, shuffled, or noise-masked. On TouchReason-Bench, Touch-R1-7B outperforms Octopi-13B by 18.4\% and GPT-4o by 24.7\% on average. Its structured reasoning traces reveal emergent behaviors of probing, comparison, and revision, demonstrating that R1-style reasoning can be effectively grounded in physical contact.

2605.27146 2026-05-27 cs.CV 版本更新

Chaos-SSL: An Attention-Based Self-Supervised Learning Framework with Chaotic Transformation for Medical Image Classification

Chaos-SSL:基于混沌变换的注意力自监督学习框架用于医学图像分类

Joao Batista Florindo

发表机构 * Institute of Mathematics, Statistics and Scientific Computing, University of Campinas(数学、统计与科学计算研究所,坎皮纳斯大学)

AI总结 提出Chaos-SSL框架,利用一维混沌映射作为非线性数据增强进行自监督预训练,并结合注意力融合模型,在皮肤病变和糖尿病视网膜病变分类上达到与最先进方法竞争的性能。

详情
Journal ref
In Proceedings of VISAPP 2026 - Volume 1, pages 574-581
AI中文摘要

自监督学习(SSL)已成为缓解对大规模标注数据集依赖的强大范式,这是医学图像分析中的常见瓶颈。然而,依赖简单几何和颜色增强的标准SSL方法可能无法捕捉到分类细微病理所需的细粒度、复杂纹理细节。本文介绍了Chaos-SSL,一种新颖的两阶段医学图像分类框架。在第一阶段,我们提出了一种新的自监督预训练策略,利用一维混沌映射(Logistic、Tent和Sine)作为对比学习的复杂非线性增强。我们假设这些混沌变换创建了“更难”且语义更丰富的视图,迫使网络学习细粒度医学纹理的鲁棒表示。在第二阶段,我们引入了一种基于注意力的融合模型,该模型动态地将来自Chaos-SSL模型的专门特征与来自更大的ImageNet预训练模型的通用特征相结合。我们在两个公共数据集上验证了我们的方法:ISIC 2018(皮肤病变)和APTOS 2019(糖尿病视网膜病变)。我们的结果表明,使用Tent映射预训练30个epoch的Chaos-SSL模型,随后进行注意力融合,其性能与最先进方法完全竞争,在ISIC 2018上达到0.9261的准确率,在APTOS 2019上达到0.8726的准确率。这显著优于现有的SSL方法,包括几种最新方法。

英文摘要

Self-Supervised Learning (SSL) has emerged as a powerful paradigm to mitigate the reliance on large, annotated datasets, a common bottleneck in medical image analysis. However, standard SSL methods, which rely on simple geometric and color augmentations, may fail to capture the fine-grained, complex textural details necessary for classifying subtle pathologies. This paper introduces Chaos-SSL, a novel two-stage framework for medical image classification. In the first stage, we propose a new self-supervised pre-training strategy that leverages 1D chaotic maps (Logistic, Tent, and Sine) as a complex, non-linear augmentation for contrastive learning. We hypothesize that these chaotic transformations create ``harder'' and more semantically-rich views, forcing a network to learn robust representations of fine-grained medical textures. In the second stage, we introduce an attention-based fusion model that dynamically combines the specialized features from our Chaos-SSL model with the general-purpose features of a larger, ImageNet-pre-trained model. We validate our method on two public datasets: ISIC 2018 (skin lesions) and APTOS 2019 (diabetic retinopathy). Our results demonstrate that the Chaos-SSL model pre-trained with a Tent map for 30 epochs, followed by attention fusion, achieves performance fully competitive with the state-of-the-art, yielding an accuracy of 0.9261 on ISIC 2018 and 0.8726 on APTOS 2019. This significantly outperforms existing SSL methods, including several recent approaches.

2605.27144 2026-05-27 cs.CV cs.LG 版本更新

Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification

图像是否也值得16x16=256个超像素?一个用于注意力图像分类的框架

Pedro Henrique da Costa Avelar, Anderson R. Tavares, Luís C. Lamb

发表机构 * UFRGS(联邦大学里约格兰德杜斯鲁斯) Institute of Informatics(信息学院) Federal University of Rio Grande do Sul(里约格兰德杜斯鲁斯联邦大学) Division of Informatics(信息系) School of Health Sciences(健康科学学院) Imaging and Data Science(成像与数据科学) Faculty of Biology, Medicine and Health(生物医学与健康学院) University of Manchester(曼彻斯特大学) Vaughan House, Portsmouth St(波特兰街瓦尔赫恩大楼)

AI总结 提出超像素变换器(SPT)框架,统一超像素图像分类与视觉变换器,通过多维正弦余弦位置编码和增强的补丁数据结构,在多个数据集上优于超像素图神经网络方法,与视觉变换器竞争。

详情
AI中文摘要

基于超像素的图像分类传统上利用图神经网络(GNN)处理不规则图像表示。计算机视觉的最新进展,由视觉变换器(ViT)驱动,引入了自注意力模型的新范式,在各种任务中超越了卷积神经网络(CNN)。然而,GNN、超像素和变换器之间的协同联系仍未探索。在这项工作中,我们提出了超像素变换器(SPT),这是一个统一超像素图像分类和ViT的新框架。SPT将超像素图像分类与图注意力网络(SICGAT)模型和ViT泛化,以支持任意超像素分块策略、连接图和位置编码。我们引入了改进,包括多维正弦余弦位置编码和完全包含超像素形状和颜色信息的增强补丁数据结构。通过在CIFAR10、FashionMNIST和Imagenette等数据集上测试SPT,采用各种超像素生成和图连接策略,我们证明SPT相比以前的超像素GNN方法实现了优越的性能,并与ViT保持竞争力。值得注意的是,我们的方法解决了SICGAT的局限性,例如像素聚合过程中的信息丢失,并展示了受限图连接如何增强ViT性能。SPT弥合了基于超像素和变换器模型之间的差距,为跨领域泛化和混合注意力框架的未来创新开辟了道路,并表明图像也值得$16\times16$个超像素。

英文摘要

Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduced new paradigms in self-attentional models, surpassing convolutional neural networks (CNNs) in various tasks. However, a synergistic connection between GNNs, superpixels, and transformers remains unexplored. In this work, we propose Superpixel Transformers (SPT), a novel framework that unifies superpixel-based image classification and ViTs. SPT generalizes the Superpixel Image Classification with Graph Attention Networks (SICGAT) model and ViT to support arbitrary superpixel-based chunking strategies, connectivity graphs, and positional encodings. We introduce refinements including a multidimensional sine-cosine positional encoding and an enriched patch data structure that fully incorporates superpixel shape and color information. By testing SPT across datasets such as CIFAR10, FashionMNIST, and Imagenette, with various superpixel generation and graph connectivity strategies, we demonstrate that SPT achieves superior performance compared to previous superpixel-based GNN methods and remains competitive with ViTs. Notably, our approach addresses the limitations of SICGAT, such as information loss during pixel aggregation, and shows how constrained graph connectivity can enhance ViT performance. SPT bridges the gap between superpixel-based and transformer models, opening avenues for cross-domain generalization and future innovations in hybrid attentional frameworks, and showing that an image can also be worth $16\times16$ superpixels.

2605.27139 2026-05-27 eess.IV cs.CV physics.ins-det 版本更新

Unsupervised Deep Image Prior for Sparse-View and Limited-Angle Electron Tomography

无监督深度图像先验用于稀疏视角和有限角度电子断层扫描

Serge Brosset, Daniel del Pozo Bueno, Thomas David, Laure Guetaz, Philippe Ciuciu, Zineb Saghi

发表机构 * Univ. Grenoble Alpes, CEA, Leti(格勒诺布尔阿尔卑斯大学,CEA,LETI) Univ. Grenoble Alpes, CEA, Liten(格勒诺布尔阿尔卑斯大学,CEA,Liten) CEA, Joliot, NeuroSpin(CEA,Joliot,NeuroSpin) Inria, MIND, Université Paris-Saclay(Inria,MIND,巴黎-萨克雷大学)

AI总结 提出无监督深度图像先验方法,在稀疏视角和有限角度条件下实现与监督方法相当的电子断层重建性能,并应用于实验数据验证其可靠性。

Comments 22 pages, 12 figures

详情
AI中文摘要

电子断层扫描(ET)在纳米材料的三维(3D)表征中发挥着重要作用。然而,在有限角度和稀疏视角条件下,传统算法会产生退化的重建结果,影响所得3D数据的质量和可解释性。本文提出深度图像先验(DIP),一种无监督的深度学习(DL)方法,用于高度退化的断层扫描采集,并通过模拟数据证明,即使在倾斜范围仅为60°、倾斜步长为10°的情况下,其性能也与需要训练数据集的监督方法相当。然后,我们将其应用于实验数据,并表明它在稀疏视角和有限角度条件下都能实现可靠的3D量化,突显了其在广泛材料和采集模式中的潜力。

英文摘要

Electron tomography (ET) plays an important role in the three-dimensional (3D) characterization of nanomaterials. However, under limited-angle and sparse-view conditions, conventional algorithms produce degraded reconstructions, which compromise the quality and interpretability of resulting 3D data. In this paper, we present deep image prior (DIP), an unsupervised deep learning (DL) approach, for highly degraded tomography acquisitions and demonstrate, using simulated data, that its performance is comparable to that of supervised approaches requiring training datasets, even for tilt ranges as limited as 60° and tilt increments of 10°. We then apply it to experimental data and show that it enables reliable 3D quantification under both sparse-view and limited-angle conditions, highlighting its potential for a wide range of materials and acquisition modalities.

2605.27136 2026-05-27 cs.CV 版本更新

Leveraging Visual Signals for Robust Token-Level Uncertainty in Vision-Language Generation

利用视觉信号实现视觉-语言生成中鲁棒的词元级不确定性

Joseph Hoche, David Brellmann, Gianni Franchi

发表机构 * AMIAD, Pôle Recherche, Palaiseau(AMIAD研究部,帕莱索)

AI总结 针对大型视觉语言模型不确定性量化中视觉信息利用不足的问题,提出基于视觉锚定的词元级不确定性量化框架VIG-TUQ,通过加权语言不确定性与视觉锚定分数,无需训练即可提升不确定性估计性能。

详情
AI中文摘要

不确定性量化(UQ)对于大型视觉语言模型(LVLMs)的可靠预测和实际部署仍然是一个关键挑战。然而,现有方法大多源自LLM文献,主要关注语言模态,而视觉信息对LVLM不确定性的贡献在很大程度上未被探索。在本文中,我们研究了LVLMs如何处理视觉信息,以及这一过程是否可用于改进不确定性估计。通过分析生成过程中视觉特征整合后的隐藏表示,我们观察到高置信度预测比不确定预测更依赖于视觉内容。基于这一发现,我们提出了视觉锚定语元级UQ(VIG-TUQ),这是一个无需训练的框架,通过用视觉锚定分数加权词元级语言不确定性,将视觉锚定显式纳入不确定性估计。我们在多个数据集和不同的LVLM架构(包括早期融合、晚期融合和原生融合模型)上评估了VIG-TUQ。结果表明,我们的方法通常优于现有的词元级不确定性方法。代码和数据将在接收后公开。

英文摘要

Uncertainty quantification (UQ) remains a critical challenge in Large Vision Language Models (LVLMs) for reliable predictions and real-world deployment. However, most existing methods are adapted from the LLM literature and primarily focus on the language modality, leaving the contribution of visual information to LVLM uncertainty largely underexplored. In this paper, we investigate how LVLMs process visual information and whether this process can be used to improve uncertainty estimation. By analyzing hidden representations after the integration of visual features during the generation process, we observe that high-confidence predictions rely more heavily on visual content than uncertain ones. Building on this insight, we propose Visual-Grounded Token UQ (VIG-TUQ), a training-free framework that explicitly incorporates visual grounding into uncertainty estimation by weighting token-level language uncertainty with visual grounding scores. We evaluate VIG-TUQ on multiple datasets and across diverse LVLM architectures, including early-fusion, late-fusion, and native-fusion models. Results indicate that our method often improves upon existing token-level uncertainty approaches. Code and data will be made available upon acceptance.

2605.27135 2026-05-27 cs.CR cs.CV 版本更新

Do Modern Post-Hoc Watermarking Methods Beat Broken-Arrows?

现代事后水印方法能否击败断箭?

Enoal Gesny, Eva Giboulot

发表机构 * Inria(法国里昂研究所)

AI总结 本文通过公平比较现代与经典事后水印方法在多种攻击下的鲁棒性和安全性,发现经典方法在现实场景中更优。

详情
AI中文摘要

随着扩散模型等生成模型的快速普及,数字水印已成为识别AI生成图像的关键解决方案。现代事后水印方案利用神经网络实现极低的误报率,同时对常见图像变换保持鲁棒性。然而,这些现代方法与经典方法之间缺乏比较,特别是在鲁棒性和安全性优先于极低误报概率的现实场景中。本文提出了现代与经典事后水印在多种经典增强和近期复杂攻击下的鲁棒性和安全性的公平比较。实验表明,在现实场景中,经典水印在保持鲁棒性的同时,在安全性方面优于现代技术。

英文摘要

With the rapid proliferation of generative models, such as diffusion models, digital watermarking has emerged as a crucial solution for identifying AI-generated images. Modern post-hoc watermarking schemes use neural networks to achieve an extremely low false-alarm rate while remaining robust to common image transformations. However, there is a lack of comparison between these modern methods and classic ones, particularly in real-world scenarios where robustness and security take precedence over achieving an extremely low false-alarm probability. In this paper, we propose a fair comparison of robustness and security between modern and classic post-hoc watermarking across various types of classic augmentations and recent sophisticated attacks. Our experiments show that, in a realistic scenario, classic watermarking outperforms modern techniques in terms of security while maintaining robustness.

2605.27132 2026-05-27 cs.CV 版本更新

Image Thresholding: Understanding Bias of Evaluation Metrics towards Specific Evaluation Functions

图像阈值化:理解评估指标对特定评估函数的偏差

Eslam Hegazy, Mohamed Gabr

发表机构 * German University in Cairo(埃及开罗德国大学)

AI总结 本文通过分析BSDS500数据集上所有可能阈值的阈值化目标函数与质量指标的相关性,揭示了Otsu准则与SSIM和PSNR的高相关性,以及Kapur熵的弱相关性,表明存在固有的指标-目标函数偏差。

Comments Submitted to ICPR 2026 (https://icpr2026.org)

详情
AI中文摘要

多级图像阈值化广泛应用于从医学成像到遥感的分割任务中。经典的目标函数,如Otsu的类间方差和Kapur的熵,通常通过元启发式算法进行优化,并使用结构相似性指数(SSIM)和峰值信噪比(PSNR)等指标评估性能。这些评估隐含地假设SSIM和PSNR提供了分割质量的无偏度量。在本研究中,我们通过分析BSDS500数据集中所有可能阈值下阈值化目标函数与质量指标之间的相关性来检验这一假设。结果表明,Otsu准则始终与SSIM和PSNR表现出高相关性,而Kapur熵的相关性较弱且变化较大。Otsu在所有图像上与PSNR的相关性优于Kapur,在超过91%的图像上与SSIM的相关性也优于Kapur。我们的发现揭示了一种固有的指标-目标函数偏差。这项工作强调了需要更中立的评估框架,并激励将分析扩展到其他阈值化准则和领域。本文的源代码可在https://w3id.org/met-dp/icpr26-95找到。

英文摘要

Multilevel image thresholding is widely used for segmentation in applications ranging from medical imaging to remote sensing. Classical objective functions, such as Otsu's between-class variance and Kapur's entropy, are often optimized using metaheuristic algorithms, with performance evaluated via metrics like Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR). These evaluations implicitly assume that SSIM and PSNR provide unbiased measures of segmentation quality. In this study, we examine this assumption by analyzing the correlation between thresholding objective functions and quality metrics across all possible thresholds for images in the BSDS500 dataset. Results show that Otsu's criterion consistently exhibits high correlation with both SSIM and PSNR, while Kapur's entropy demonstrates weaker and more variable correlation. Otsu outperforms Kapur in correlation with PSNR for all images and with SSIM for over 91%. Our findings reveal an inherent metric-objective-function bias. This work highlights the need for more neutral evaluation frameworks and motivates extending the analysis to additional thresholding criteria and domains. Source code of this paper can be found at https://w3id.org/met-dp/icpr26-95

2605.27129 2026-05-27 cs.CV cs.RO 版本更新

YOLO26-RipeLoc Lite: A lightweight architecture for tomato ripeness detection and picking point localization in greenhouse robotic harvesting

YOLO26-RipeLoc Lite:用于温室机器人采摘中番茄成熟度检测与采摘点定位的轻量级架构

Rajmeet Singh, Manveen Kaur, Shahpour Alirezaee, Irfan Hussain

发表机构 * Department of Mechanical Engineering(机械工程系) Khalifa University(卡利法大学) University of Windsor(温莎大学)

AI总结 提出基于YOLO26的轻量级架构YOLO26-RipeLoc Lite,通过轻量特征金字塔网络、成熟度感知注意力模块和紧凑检测头,实现温室番茄的成熟度分类与中心点定位,在仅2.38M参数下达到92.9% mAP@0.5。

详情
AI中文摘要

在温室番茄生产中,自动化收获需要准确检测成熟番茄、进行成熟度分类,并为机器人末端执行器精确定位采摘点。本文提出YOLO26-RipeLoc Lite,一种基于YOLO26的轻量级深度学习架构,用于同时检测、成熟度分类和温室番茄的中心点定位。该模型引入了三项改进:(1) 轻量特征金字塔网络(LFPN),采用深度可分离卷积实现高效多尺度融合;(2) 成熟度感知注意力模块(RAAM),具有双池化和可学习的成熟度偏置向量,增强颜色纹理区分能力;(3) 紧凑检测头(CDH),采用共享卷积和集成的中心点回归分支,用于直接抓取规划。该模型在来自阿联酋阿布扎比SILAL温室的自定义数据集(1500张图像,6227个实例,其中3566个成熟,2661个未成熟)上进行评估。YOLO26-RipeLoc Lite在仅使用2.38M参数的情况下,实现了92.9%的mAP@0.5(成熟95.2%,未成熟90.6%),在所有评估架构中精度最高(95.2%)。训练后批量归一化剪枝30%可将参数减少至约1.8M,且精度损失可忽略。消融研究证实,温室感知的HSV增强提供了最大的改进(+2.02个百分点 mAP@50),骨干网络冻结达到了峰值精度(93.8%),而三阶段渐进解冻获得了最佳的定位质量(mAP@50:95为64.6%)。与YOLOv8n/s、YOLO11n/s、YOLO12n/s和YOLO26s的比较证实了其优越的精度-效率:比YOLO12n精度高2.9个百分点,参数少7.0%,并集成了用于机器人末端执行器引导的中心点定位。

英文摘要

In greenhouse tomato production, automated harvesting requires accurate detection of ripe tomatoes, ripeness classification, and precise picking-point localization for robotic end-effectors. This paper proposes YOLO26-RipeLoc Lite, a lightweight deep learning architecture based on YOLO26 for simultaneous detection, ripeness classification, and center-point localization of greenhouse tomatoes. The model introduces three modifications: (1) a Lightweight Feature Pyramid Network (LFPN) with depthwise separable convolutions for efficient multi-scale fusion, (2) a Ripeness-Aware Attention Module (RAAM) with dual pooling and a learnable ripeness bias vector for enhanced color-texture discrimination, and (3) a Compact Detection Head (CDH) with shared convolutions and an integrated center-point regression branch for direct grasp planning. The model is evaluated on a custom dataset of 1,500 images with 6,227 instances (3,566 ripe, 2,661 unripe) from the SILAL greenhouse, Abu Dhabi, UAE. YOLO26-RipeLoc Lite achieves mAP@0.5 of 92.9% (95.2% ripe, 90.6% unripe) with the highest precision (95.2%) among all evaluated architectures using only 2.38M parameters. Post-training BatchNorm pruning at 30% reduces parameters to ~1.8M with negligible accuracy loss. Ablation studies confirm that greenhouse-aware HSV augmentation provides the largest improvement (+2.02 pp mAP@50), backbone freezing achieves peak precision (93.8%), and 3-phase progressive unfreezing yields the best localization quality (mAP@50:95 of 64.6%). Comparisons with YOLOv8n/s, YOLO11n/s, YOLO12n/s, and YOLO26s confirm superior accuracy-efficiency: 2.9 pp higher precision than YOLO12n with 7.0% fewer parameters and integrated center-point localization for robotic end-effector guidance.

2605.27128 2026-05-27 cs.CV cs.LG 版本更新

PILOT: A Data-Free Continual Learning Approach for Real-Time Semantic Segmentation via Boundary Guidance

PILOT: 一种基于边界引导的无数据持续学习方法用于实时语义分割

Yujing Zhou, Prashant Shekhar, Thomas Yang, Yongxin Liu

发表机构 * Department of Mathematics, College of Arts and Sciences, Embry-Riddle Aeronautical University(数学系,文理学院,埃姆布里-里德航空大学) Department of Electrical Engineering and Computer Science, College of Engineering, Embry-Riddle Aeronautical University(电气工程与计算机科学系,工程学院,埃姆布里-里德航空大学)

AI总结 提出PILOT框架,通过冻结原网络参数并引入并行导数分支捕获新类边界信息,实现实时语义分割模型在无需旧数据情况下的增量学习,有效缓解灾难性遗忘。

详情
AI中文摘要

实时语义分割模型在准确性和推理速度之间取得了极好的平衡。然而,将这些模型部署在动态的真实世界环境中,通常需要能够在不重新训练整个数据集的情况下增量地学习新类别。这种能力被称为持续学习。在这方面,深度学习中的标准微调方法常常因灾难性遗忘而失败,即模型学习新信息但忘记了先前训练和学习的类别。针对这一关键领域,本文提出了一种针对PIDNet的新型持续学习框架,PIDNet是一种被广泛引用的最先进的实时语义分割模型。我们的方法PILOT(并行增量学习随时间)通过实现一个并行导数分支(D-branch)引入了一种实时且轻量级的策略,该分支旨在捕获新类别的高频边界信息,同时冻结原始分割网络的训练参数。这种新颖的设置允许模型适应新的语义类别,同时保留先前学习类别的知识。通过仅使用与新类别相关的数据,我们的模型显著减少了训练开销。实验结果表明,我们的方法成功分割了新类别,同时在原始基类上保持了较高的平均交并比(mIoU),从而在该领域轻松超越了所有主要的持续学习方法。总体而言,PILOT被证明能有效缓解灾难性遗忘,同时对推理延迟影响最小,从而保持实时性能。

英文摘要

Real-time semantic segmentation models offer an excellent balance between accuracy and inference speed. However, deploying these models in dynamic real world environments often requires the ability to learn novel classes incrementally without retraining on the entire dataset. This capability is known as continual learning. In this regard, the standard fine-tuning methods in deep learning often fail due to catastrophic forgetting, where the model learns new information but forgets previously trained and learned classes. Contributing to this crucial domain, the current paper proposes a novel continual learning framework tailored for PIDNet, which is a widely cited state-of-the-art real-time semantic segmentation model. Our method, PILOT(Parallel Incremental Learning Over Time), introduces a real-time and lightweight strategy by implementing a parallel Derivative-branch (D-branch) designed to capture the high frequency boundary information of novel classes while freezing the trained parameters of the original segmentation network. This novel setup allows the model to adapt to new semantic categories while preserving the knowledge of previously learned classes. By using only data associated with the new class, our model significantly reduces training overhead. Experimental results demonstrate that our approach successfully segments new classes while maintaining high mean Intersection over Union (mIoU) on the original base classes, thereby comfortably outperforming all major continual learning approaches in this domain. Overall, PILOT is shown to effectively mitigate catastrophic forgetting with minimal impact on inference latency, thus maintaining real-time performance.

2605.27116 2026-05-27 cs.CV 版本更新

COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection

COVD: 通过新概念注入的持续开放词汇目标检测

Yupeng Zhang, Ruize Han, Yuzhong Feng, Zixin Ren, Yuntong Tian, Liang Wan

发表机构 * Tianjin University(天津大学) Shenzhen University of Advanced Technology(深圳大学)

AI总结 提出持续开放词汇目标检测新任务COVD,通过冻结视觉编码器并仅更新文本分支参数注入新概念,实现无需额外参数的高效持续学习。

详情
AI中文摘要

开放词汇目标检测(OVD)取得了显著进展,使检测器能够从已见类别泛化到未见类别。然而,现实世界的类别空间不断演变,现有的OVD模型仍然难以处理新出现的概念,而重复的完全重新训练成本过高。为此,我们引入了一个新的任务设置,称为持续开放词汇目标检测与新概念注入(COVD),其中模型顺序学习传入的新概念组,同时保留先前的概念和原始的开放词汇知识,并附带一个新的基准Novel-114。我们的关键观察是,预训练的视觉编码器通常已经感知并表示了众多新概念,主要瓶颈在于视觉表示与文本概念之间缺乏稳定的语义对齐。基于此,我们提出了NoIn-Det,一个无需额外参数的高效持续注入框架。NoIn-Det冻结视觉编码器,仅使用常见概念和先前注入概念的文本来保留文本表示空间,并通过仅更新有利于新概念学习的少量文本分支参数来注入新概念。大量实验表明,NoIn-Det在不引入额外参数的情况下,有效学习了新概念,保留了旧知识,并持续优于现有的VLM持续学习方法。Novel-114和代码将发布。

英文摘要

Open-vocabulary object detection (OVD) has made significant progress, enabling detectors to generalize from seen to unseen categories. However, real-world category spaces continually evolve, and existing OVD models still struggle with newly emerging concepts, while repeated full retraining is prohibitively expensive. To this end, we introduce a new task setting, termed Continual OVD with Novel Concept Injection (COVD), where models sequentially learn incoming novel concept groups while preserving prior concepts and original open-vocabulary knowledge, along with a new benchmark, Novel-114. Our key observation is that pretrained visual encoders often already perceive and represent many novel concepts, and the main bottleneck lies in the lack of stable semantic alignment between visual representations and textual concepts. Based on this, we propose NoIn-Det, an efficient continual injection framework without additional parameters. NoIn-Det freezes the visual encoder, preserves the text representation space using only texts of common concepts and previously injected concepts, and injects novel concepts by updating only a small subset of text-branch parameters beneficial to novel concept learning. Extensive experiments show that NoIn-Det effectively learns novel concepts, preserves old knowledge, and consistently outperforms existing continual learning methods for VLMs without introducing additional parameters.Novel-114 and the code will be released.

2605.27101 2026-05-27 cs.CV cs.CL 版本更新

Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models

弹出式干扰揭示视频大语言模型中的事件袋行为

Oscar Chew, Serhii Honcharenko, Qian-Hui Chen, Patricia Lu, Dishant Zaveri, Khoa D. Doan, Kuan-Hao Huang

发表机构 * Texas A&M University(德克萨斯A&M大学) National Taiwan University(台湾国立大学) Stanford University(斯坦福大学) VinUniversity(文大学)

AI总结 通过插入无关广告片段,发现视频大语言模型常将不同片段的事件错误关联,表现出将视频视为事件集合而非时间序列的“事件袋”行为。

详情
AI中文摘要

视频理解的一个关键能力是跨时间可靠地将主体与事件联系起来,然而视频大语言模型(VideoLLMs)是否真正实现了这一点仍不清楚。在这项工作中,我们引入了DistractionBench来评估VideoLLMs在存在无关视频片段的情况下是否能稳健地关联主体和事件。通过受控干预,例如在较长视频中插入短广告片段,我们表明VideoLLMs经常幻觉出不同片段中实体之间的交互,错误地将注入广告中的动作归因于主视频中的主体。我们将这种系统性幻觉表征为事件袋(BoE)行为,其中模型将视频视为事件的集合而非时间结构化的序列。评估11个流行的VideoLLMs,我们发现所有模型都表现出显著的BoE行为。我们的发现表明VideoLLMs缺乏可靠的时间接地机制,并激励开发具有更稳健主体-事件关联的模型。

英文摘要

A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.

2605.27080 2026-05-27 cs.CV 版本更新

Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning

基于解耦子空间对比学习的半监督视线估计

Qida Tan, Hongyu Yang, Wenchao Du

发表机构 * National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China(合成视觉基础科学国家重点实验室,四川大学,成都,中国) College of Computer Science, Sichuan University, Chengdu, China(计算机学院,四川大学,成都,中国)

AI总结 提出一种半监督学习框架DSCL,通过雅可比正则化解耦特征为俯仰角和偏航角子空间,并利用子空间内序数对比学习,仅用5%-20%标注数据即可达到竞争性能。

Comments ICML2026

详情
AI中文摘要

基于外观的视线估计由于标注样本有限和数据集多样性不足,常面临泛化能力差的问题。主流方法采用弱监督学习从无约束真实场景生成大规模伪标签数据,以缓解域偏移。本文设计了一种简单而有效的半监督学习架构,利用未标注数据增强域泛化,从而减少对劳动密集型人工标注的依赖。我们的关键洞察是施加雅可比正则化,将特征表示解耦为专门针对特定视线组件(如俯仰角和偏航角)的判别性子空间。我们进一步利用每个子空间内的内在序数排序进行对比学习,使模型能够从少量标注样本和大量未标注样本中学习鲁棒的视线表示。最终形成了我们的解耦子空间对比学习(DSCL)框架。在多个基准上的大量实验表明,所提出的DSCL是即插即用的,在域内和跨域评估设置下,仅使用20%、10%甚至5%的标注数据即可达到竞争性能。公开代码见https://github.com/da60266/DSCL。

英文摘要

Appearance-based gaze estimation always suffers from poor generalization due to limited annotated samples and insufficient dataset diversity. Leading approaches adopt weakly supervised learning to generate large-scale pseudo-labeled data from unconstrained real-world scenarios, aiming to mitigate the domain shifts. In this work, we devise a simple yet effective semi-supervised learning architecture that leverages unlabeled data to enhance domain generalization, thereby reducing reliance on labor-intensive manual annotations. Our key insight is to impose Jacobian regularization to disentangle feature representations into discriminative subspaces dedicated to specific gaze components, such as pitch and yaw angles. We further exploit the intrinsic ordinal ranking within each subspace for contrastive learning, enabling the model to learn robust gaze representations from a small set of labeled samples and an abundance of unlabeled ones. This ultimately yields our Disentangled Subspace Contrastive Learning (DSCL) framework. Extensive experiments on multiple benchmarks verify that the proposed DSCL is plug-and-play, achieving competitive performance using only 20\%, 10\%, and even 5\% of the annotated data under both in-domain and cross-domain evaluation settings. The public code is available at \href{https://github.com/da60266/DSCL}{https://github.com/da60266/DSCL}.

2605.27075 2026-05-27 cs.CV 版本更新

SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration

SoftCap: 扩散Transformer加速的软预算控制

Yuhang Zhang, Junxiang Qiu, Huixia Ben, Zhenhua Tang, Shuo Wang, Yanbin Hao

发表机构 * Hefei University of Technology(合肥工业大学) University of Science and Technology of China(中国科学技术大学) Anhui University of Science and Technology(安徽理工大学) University of Macau(澳门大学)

AI总结 提出一种无需训练的软预算控制层SoftCap,通过轨迹漂移观测器和软预算PI控制器动态调整全步触发阈值,在保持计算预算软上限的同时提升图像质量。

详情
AI中文摘要

扩散Transformer(DiTs)实现了强大的视觉质量,但其迭代去噪过程需要大量昂贵的Transformer评估。无训练加速方法通过缓存、预测或验证中间特征来降低这一成本,然而何时执行全步的运行时决策通常由固定调度或手动调整的阈值驱动。我们提出 extbf{SoftCap},一种用于基于缓存的DiT推理的无训练控制层。SoftCap将轨迹漂移观测器(通过轻量级隐藏状态统计估计局部缓存风险)与软预算PI控制器(根据相对于固定参考配置的实际计算调整全步触发阈值)相结合。预算是软上限:它塑造阈值,但不要求运行消耗预定数量的全步评估。在FLUX.1-dev上,在可比的中等计算操作点下,SoftCap优于SpeCa,在几乎相同的FLOPs下将ImageReward从0.967提升至0.981,并将LPIPS-Full从0.518降至0.498,而目标扫描诊断显示随着预算放宽,预期的软上限行为得以实现。

英文摘要

Diffusion Transformers (DiTs) achieve strong visual quality, but their iterative denoising process requires many costly Transformer evaluations. Training-free acceleration methods reduce this cost by caching, forecasting, or verifying intermediate features, yet the runtime decision of when to execute a Full step is often driven by fixed schedules or hand-tuned thresholds. We propose \textbf{SoftCap}, a training-free control layer for cache-based DiT inference. SoftCap couples a Trajectory Drift Observer, which estimates local cache risk from lightweight hidden-state statistics, with a Soft-Budget PI Controller, which adjusts the Full-triggering threshold from realized compute relative to a fixed reference profile. The budget is a soft ceiling: it shapes the threshold but does not require a run to spend a prescribed number of Full evaluations. On FLUX.1-dev, SoftCap improves over SpeCa at a comparable middle-compute operating point, raising ImageReward from 0.967 to 0.981 and reducing LPIPS-Full from 0.518 to 0.498 at nearly identical FLOPs, while target-sweep diagnostics show the intended soft-ceiling behavior as the budget is relaxed.

2605.27074 2026-05-27 cs.CV 版本更新

IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams

IPIBench: 在连续流下评估多模态大模型的交互式主动智能

Jinzhao Li, Yinuo Chen, Wenxuan Song, Yijia Lei, Yichi Zhang, Honglei Yan, Panwang Pan, Miao Liu

发表机构 * College of AI, Tsinghua University(清华大学人工智能学院) ByteDance(字节跳动)

AI总结 提出IPIBench基准,用于评估多模态大模型在流式视频场景中的交互式主动智能,并设计IPI-Agent框架以改善主动触发和交互协调。

详情
AI中文摘要

最近的多模态大模型在反应式问答上表现强劲,但现实世界的流式助手需要对连续视觉输入进行主动推理。现有基准主要研究孤立的单轮设置中的反应式或主动式交互,忽视了用户可能在交错反应式查询中添加、修改或取消主动请求的动态多轮场景。为填补这一空白,我们引入IPIBench,这是首个在流式视频设置下评估多模态大模型交互式主动智能的基准。IPIBench涵盖主动监控、主动任务管理以及交错的反应式-主动式请求。对代表性多模态大模型的评估揭示了两个主要限制:不稳定的主动触发以及反应式和主动行为之间的弱协调。我们进一步提出IPI-Agent,一个无训练的智能体框架,包含交互控制策略和时间门控机制,用于稳定主动触发和协调多轮交互。实验表明,IPI-Agent在所有基准设置上持续改进现有多模态大模型。

英文摘要

Recent multimodal large language models (MLLMs) achieve strong performance on reactive question answering, but real-world streaming assistants require proactive reasoning over continuous visual inputs. Existing benchmarks mainly study reactive or proactive interactions in isolated single-turn settings, overlooking dynamic multi-turn scenarios where users may add, modify, or cancel proactive requests alongside interleaved reactive queries. To address this gap, we introduce IPIBench, the first benchmark for evaluating Interactive Proactive Intelligence of MLLMs under streaming video settings. IPIBench covers proactive monitoring, proactive task management, and interleaved reactive-proactive requests. Evaluations on representative MLLMs reveal two major limitations: unstable proactive triggering and weak coordination between reactive and proactive behaviors. We further propose IPI-Agent, a training-free agentic framework with an interaction-control policy and a temporal-gating mechanism for stabilizing proactive triggering and coordinating multi-turn interactions. Experiments show that IPI-Agent consistently improves existing MLLMs across all benchmark settings.

2605.27067 2026-05-27 cs.CV 版本更新

BEAT: Rhythm-Elastic Alignment for Agentic Music-guided Movie Trailer Generation

BEAT: 节奏弹性对齐用于智能音乐引导的电影预告片生成

Yutong Wang, Yunke Wang, Xinyuan Chen, Chang Xu

发表机构 * The University of Sydney(悉尼大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出BEAT框架,通过音乐-视觉对齐编码器MuVA和能量自适应动态规划算法Bar-DP,实现弹性多对一节奏对齐,用于端到端电影预告片生成。

详情
AI中文摘要

自动电影预告片生成必须从整部电影中选择镜头并与背景音乐同步。现有方法要么将音乐对齐归为后处理,要么强制执行刚性的——对应镜头-音乐映射,忽略了专业剪辑节奏的弹性:快速剪辑伴随高能量段落,而持续镜头跨越较安静的小节。我们提出BEAT,一个解决这一差距的框架,包含两个核心组件:MuVA,一个紧凑的音乐-视觉对齐编码器,通过Sinkhorn正则化的两阶段学习训练;以及Bar-DP,一种能量自适应动态规划算法,根据音乐动态产生弹性的多对一对齐。这些组件被集成到一个五阶段智能管道中,该管道将核心对齐建立在学习的跨模态特征上,同时通过结构化文本信号协调更高层次的创意决策。为了支持全面评估,我们还引入了TrailerArena,一个包含四个互补维度20多个指标的基准。在TrailerArena上,BEAT在镜头选择、排序和感知质量方面实现了最先进的性能,同时端到端地生成完整制作的预告片。

英文摘要

Automatic movie trailer generation must select shots from a full-length film and synchronize them with background music. Existing methods either relegate music alignment to post-processing or enforce rigid one-to-one shot-music mappings, overlooking that professional editing rhythm is elastic: rapid cuts accompany high-energy passages while sustained shots span quieter bars. We introduce BEAT, a framework that addresses this gap with two core components: MuVA, a compact music-visual alignment encoder trained with Sinkhorn-regularized two-stage learning, and Bar-DP, an energy-adaptive dynamic programming algorithm that produces elastic many-to-one alignments following musical dynamics. These components are integrated into a five-phase agentic pipeline that grounds the core alignment in learned cross-modal features while coordinating higher-level creative decisions through structured text signals. To support comprehensive evaluation, we also introduce TrailerArena, a benchmark with 20+ metrics across four complementary dimensions. On TrailerArena, BEAT achieves state-of-the-art performance across shot selection, ordering, and perceptual quality, while producing fully composed trailers end-to-end.

2605.27032 2026-05-27 cs.CV 版本更新

SCKAN: Structural Consensus-based KAN Prototype Learning for Semi-Supervised Pancreas Segmentation

SCKAN: 基于结构一致性的KAN原型学习用于半监督胰腺分割

Yuqi Liu, Yufei Chen, Wei Fu, Xiaodong Yue, Shuo Li

发表机构 * School of Computer Science and Technology, Tongji University, Shanghai, China(同济大学计算机科学与技术学院) Artificial Intelligence Institute, Shanghai University, Shanghai, China(上海大学人工智能研究院) Department of Computer and Data Science and Department of Biomedical Engineering, Case Western Reserve University, Cleveland, USA(凯斯西储大学计算机与数据科学系及生物医学工程系)

AI总结 针对半监督胰腺分割中稀疏监督导致的监督偏差问题,提出基于结构一致性的KAN原型学习方法(SCKAN),通过跨样本结构一致性学习和KAN自适应融合实现更泛化且准确的分割。

Comments 10.5 pages, 5 figures, Medical Image Computing and Computer Assisted Intervention 2026

详情
AI中文摘要

精确的胰腺分割对于早期癌症诊断至关重要,而标注稀缺使得半监督学习(SSL)成为必要。然而,由于样本间显著的形态变异性,现有SSL方法在稀疏监督下存在严重的泛化限制,导致监督偏差问题。为解决这一问题,我们提出了基于结构一致性的KAN原型学习(SCKAN),该方法首次利用Kolmogorov-Arnold网络(KANs)构建跨样本结构一致性学习,以实现更泛化和准确的分割。具体而言,SCKAN包含两个关键设计:结构约束的原型一致性学习(SPCL),通过原型级对比优化强制跨样本一致性,促进无偏结构表示;以及基于一致性的Kolmogorov-Arnold融合(CKaF),通过KAN的自适应B样条非线性聚合稳定一致性并过滤样本特定噪声,减少形态特异性偏差。在两个公开胰腺数据集上的大量实验证明了SCKAN的有效性。代码位于https://github.com/rhodaliu17/SCKAN。

英文摘要

Accurate pancreas segmentation is critical for early cancer diagnosis, where annotation scarcity necessitates Semi-Supervised Learning (SSL). However, due to significant inter-sample morphological variability, existing SSL methods face severe generalizability limitations under sparse supervision, leading to the Supervision Bias problem. To address this, we propose Structural Consensus-based KAN Prototype Learning (SCKAN), which constructs the first cross-sample structural consensus learning with Kolmogorov-Arnold Networks (KANs), to achieve more generalizable and accurate segmentation. Specifically, SCKAN contains two key designs: Structure-constrained Prototype Consistency Learning (SPCL), which prompts unbiased structural representation by enforcing cross-sample consistency via prototype-level contrastive optimization, and Consensus-based Kolmogorov-Arnold Fusion (CKaF), which reduces morphology-specific bias by aggregating stable consensus and filtering sample-wise noise via KAN's adaptive B-spline nonlinearity. Extensive experiments on two public pancreas datasets demonstrate the effectiveness of SCKAN. Code is at https://github.com/rhodaliu17/SCKAN.

2605.27024 2026-05-27 cs.CV cs.MM 版本更新

NeR-SC: Adapting Neural Video Representation to Screen Content

NeR-SC:适应屏幕内容的神经视频表示

Ruohan Shi, Jiaoyan Zhao, Haogang Feng

发表机构 * Management school(管理学院) The University of Sheffield(谢菲尔德大学) Undergraduate School of Artificial Intelligence(人工智能本科学院) Shenzhen Polytechnic University(深圳职业技术大学) School of Artificial Intelligence(人工智能学院) Shenzhen University of Information Technology(深圳信息大学)

AI总结 提出NeR-SC框架,通过可学习调色板、多门密集融合和嵌入级帧跳过策略,针对屏幕内容视频的离散颜色、强时间冗余等特性进行优化,在低码率下超越H.264/H.265。

Comments Submitted to PRMVAI 2026

详情
AI中文摘要

隐式神经表示已成为视频压缩的一种有前景的范式,最近的方法在自然视频上取得了有竞争力的性能。然而,屏幕内容视频——常见于远程桌面、在线教育和云游戏——表现出独特的统计特性:锐利边缘、有限调色板和强时间冗余。现有的为自然场景设计的神经表示方法缺乏利用这些特性的机制,留下了很大的改进空间。在本文中,我们提出了NeR-SC,一个为屏幕内容视频量身定制的神经表示框架。基于SNeRV骨干网络,NeR-SC引入了三个屏幕内容特定模块:(i) 可学习调色板,通过将低频子带限制到学习到的颜色集来建模屏幕内容的离散颜色结构;(ii) 多门密集融合模块,用密集的、注意力门控的跨阶段交互替代顺序特征融合;(iii) 嵌入级帧跳过策略,绕过静态帧的冗余解码器调用,且零训练开销。在DSCVC和VCD上的实验表明,NeR-SC实现了40.32 dB和41.73 dB的平均PSNR,优于代表性的神经视频表示方法,并且在低码率下超越了H.264和H.265。帧跳过策略实现了实时解码且质量无损失。

英文摘要

Implicit neural representations have emerged as a promising paradigm for video compression, with recent methods achieving competitive performance on natural video. However, screen content video -- common in remote desktop, online education, and cloud gaming -- exhibits distinct statistics: sharp edges, limited color palettes, and strong temporal redundancy. Existing neural representation methods, designed for natural scenes, lack mechanisms to exploit these properties, leaving substantial room for improvement. In this paper, we propose NeR-SC, a neural representation framework tailored for screen content video. Building on the SNeRV backbone, NeR-SC introduces three screen-content-specific modules: (i) a learnable color palette that models the discrete color structure of screen content by restricting the low-frequency sub-band to a learned color set; (ii) a multi-gate dense fusion module that replaces sequential feature fusion with dense, attention-gated cross-stage interaction; and (iii) an embedding-level frame skip strategy that bypasses redundant decoder invocations for static frames, with zero training overhead. Experiments on DSCVC and VCD show that NeR-SC achieves 40.32~dB and 41.73~dB average PSNR, outperforming representative neural video representation methods and, at low bitrates, surpassing H.264 and H.265. The skip strategy enables real-time decoding with no loss in quality.

2605.27020 2026-05-27 cs.CV cs.AI 版本更新

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

黑盒成员推断攻击:针对图像生成模型的预训练数据

Tao Qi, Huili Wang, Yuanhong Huang, Wendan Wang, Lianchao Zhao, Jinrui Wang, Zichen Qin, Shangguang Wang, Yongfeng Huang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学)

AI总结 提出一种基于跨模态数据扰动的黑盒成员推断攻击框架SD-MIA,通过分析扩散模型对目标图像和扰动文本指令的去噪过程,有效检测预训练数据中的成员关系。

Comments 13 pages, 9 figures; CVPR 2026 camera-ready

详情
AI中文摘要

基于扩散的图像生成模型的快速发展引发了对涉及人类创建数据的潜在版权和隐私侵犯的严重担忧。成员推断攻击(MIA)已成为识别模型训练期间未经授权数据使用的有前景工具。现有方法通常评估模型对扰动嫌疑图像的去噪能力作为成员状态的指标。然而,此类特征的判别能力高度依赖于模型记忆程度,并且在应用于曝光较少的数据(例如预训练数据)时显著下降。尽管有几种方法尝试通过利用模型内部特征来增强检测,但这些特征在主流闭源图像生成平台中通常不可访问,限制了其实用性。在本文中,我们证明分析黑盒扩散模型如何对目标图像和相应的扰动文本指令进行去噪可以揭示更具区分性的成员线索。基于这一见解,我们提出了一种名为SD-MIA的黑盒成员推断攻击框架,该框架利用跨模态数据扰动机制来检测扩散模型中的预训练数据。我们在一个公共基准数据集和一个新构建的数据集上进行了广泛实验,每个数据集包含具有相同分布的预训练成员和非成员样本。实验结果表明,SD-MIA相比现有基线(包括那些具有不公平访问模型内部特征优势的基线)实现了更优的性能。

英文摘要

The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data. Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training. Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status. However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data). Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality. In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models. We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.

2605.27003 2026-05-27 cs.CV cs.AI 版本更新

Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V

时间步感知的 SVDQuant-GPTQ 用于 Wan2.2-I2V 的 W4A4 量化

Junhao Wu, Dezhong Yao, Hai Jin

发表机构 * National Engineering Research Center for Big Data Technology and System(大数据技术与系统国家工程研究中心) Services Computing Technology and System Lab(服务计算技术与系统实验室) Cluster and Grid Computing Lab(集群与网格计算实验室) School of Computer Science and Technology(计算机科学与技术学院) Huazhong University of Science and Technology(华中科技大学)

AI总结 针对 Wan2.2-I2V 视频扩散 Transformer 的 W4A4 量化,提出结合 SVDQuant 低秩异常补偿、GPTQ 重建感知残差权重量化和时间步分箱逐层激活裁剪比搜索的后训练量化框架,在 OpenS2V-Eval 上降低 59.3% 峰值显存且仅损失 0.9% VBench 平均分。

详情
AI中文摘要

大型视频扩散 Transformer 的 W4A4 量化提供了显著的内存节省,但面临两个主要挑战:稀疏的大幅度激活异常值,以及跨多步去噪轨迹的强时间步依赖的激活分布。这些困难因 Wan2.2-I2V 的双专家混合专家 DiT 设计而加剧,其高噪声和低噪声专家表现出不同的量化敏感性,单一全局校准策略无法捕捉。我们提出了一种后训练量化框架,结合基于 SVDQuant 的低秩异常补偿、基于 GPTQ 的重建感知残差权重量化,以及针对每个专家独立进行的时间步分箱逐层激活裁剪比搜索。在 OpenS2V-Eval 基准上,我们的方法相对于 BF16 基线将峰值 GPU 内存降低了 59.3%,同时仅导致 VBench 平均分数下降 0.9%,成像质量下降 2.3%,表明专家和时间步感知的校准对于 MoE 视频 DiT 的高保真 W4A4 推理至关重要。

英文摘要

W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse large-magnitude activation outliers, and strongly timestep-dependent activation distributions across the multi-step denoising trajectory. These difficulties are compounded by Wan2.2-I2V's two-expert Mixture-of-Experts DiT design, whose high-noise and low-noise experts exhibit distinct quantization sensitivities that a single global calibration policy cannot capture. We propose a post-training quantization framework combining SVDQuant-based low-rank outlier compensation, GPTQ-based reconstruction-aware residual weight quantization, and timestep-bin-wise per-layer activation clipping-ratio search conducted independently for each expert. On the OpenS2V-Eval benchmark, our method reduces peak GPU memory by 59.3\% relative to the BF16 baseline while incurring only a 0.9\% drop in VBench average score and a 2.3\% drop in Imaging Quality, demonstrating that expert- and timestep-aware calibration is essential for high-fidelity W4A4 inference on MoE video DiTs.

2605.26992 2026-05-27 cs.CV 版本更新

On the Robustness of Machine Unlearning for Vision-Language Models

机器遗忘在视觉-语言模型中的鲁棒性研究

Yujie Lin, Kaidi Jia, Jiayao Ma, Chengyi Yang, Jinsong Su

发表机构 * Xiamen University(厦门大学)

AI总结 本文首次系统调查了视觉-语言模型机器遗忘的鲁棒性,通过提出三种攻击范式揭示现有方法往往隐藏而非彻底移除目标知识。

详情
AI中文摘要

视觉-语言模型(VLM)可能会记忆训练数据中的不良信息,这激发了人们对机器遗忘的兴趣。在这项工作中,我们首次对VLM遗忘进行了系统调查和鲁棒性分析。我们提供了现有VLM遗忘方法的全面分类和回顾,以及在多种提示设置下的统一评估。然后,我们提出了三种攻击范式,以检验被遗忘的多模态知识是否可以通过上下文提示或下游微调重新激活。大量实验表明,许多现有方法在这些攻击下仍然脆弱,这表明当前方法往往隐藏而非完全移除目标知识。我们的研究为当前VLM遗忘方法的鲁棒性和局限性提供了新见解,并强调了需要更可靠的多模态遗忘策略。代码可在https://github.com/XMUDeepLIT/VLM-UnL-Attack获取。

英文摘要

Vision-language models (VLMs) may memorize undesirable information from training data, motivating growing interest in machine unlearning. In this work, we present the first systematic survey and robustness analysis of VLM unlearning. We provide a comprehensive taxonomy and review of existing VLM unlearning methods, together with unified evaluations under multiple prompt settings. We then propose three attack paradigms to examine whether forgotten multimodal knowledge can be reactivated through contextual prompting or downstream retraining. Extensive experiments show that many existing methods remain vulnerable under these attacks, indicating that current approaches often hide rather than fully remove target knowledge. Our study provides new insights into the robustness and limitations of current VLM unlearning methods and highlights the need for more reliable multimodal unlearning strategies. Code is available at https://github.com/XMUDeepLIT/VLM-UnL-Attack.

2605.26967 2026-05-27 cs.CV 版本更新

CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning

CodecCap: 高保真度编解码器启发的残差建模用于密集视频字幕生成

Zihan Lin, Songhe Deng, Shuwei He, Danxiang Zhu, Dan Zhang, Yishu Lei, Xianlong Luo, Shikun Feng, Rui Liu

发表机构 * ERNIE Team, Baidu(百度ERNIE团队) College of Artificial Intelligence, Inner Mongolia University(内蒙古大学人工智能学院)

AI总结 提出CodecCap框架,通过关键帧和残差字幕模拟视频编解码器,在保持细粒度视觉证据的同时减少冗余,并引入VidCapQA基准验证其高保真度。

Comments 11 pages, 4 figures

详情
AI中文摘要

现有的视频字幕方法难以平衡视觉保真度和冗余:整体字幕紧凑但丢失细粒度证据,而分段字幕改善覆盖但引入大量冗余。我们提出CodecCap,一种受编解码器启发的高保真度密集视频字幕框架。类似于视频编解码器,CodecCap使用关键帧和残差字幕表示视频。关键帧字幕详尽编码稳定的视觉上下文,而残差字幕仅捕获时间上局部的动作、运动和变化。这有效保留了细粒度视觉证据,同时减少冗余描述。为了量化字幕的保真度,我们引入VidCapQA,一个包含14个能力维度1000个问题的字幕-问答基准。VidCapQA上的结果表明,强VLM直接生成的字幕仍然遗漏许多视觉细节,突显字幕表示是关键瓶颈。实验表明,CodecCap显著超越使用相同底层VLM的直接字幕生成,表明关键帧-残差字幕是一种高保真度视频-语言监督的方式。我们进一步使用CodecCap构建CodecVDC-100K,一个包含锚点、残差、场景级和视频级监督的大规模密集字幕数据集。

英文摘要

Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.

2605.26949 2026-05-27 cs.CV cs.GR 版本更新

DinoComplete: 3D Shape Completion with Distilled Semantic Priors and State Space Models

DinoComplete: 利用蒸馏语义先验和状态空间模型进行3D形状补全

Furkan Mert Algan, Eckehard Steinbach

发表机构 * Chair of Media Technology(媒体技术教授职位) Munich Institute of Robotics and Machine Intelligence(慕尼黑机器人与机器智能研究所) School of Computation Information and Technology, Technical University of Munich(计算信息科学学院,慕尼黑技术大学)

AI总结 提出DinoComplete框架,通过从DINO特征中蒸馏语义先验并结合多尺度体素Mamba模块,实现高效、鲁棒的3D形状补全,在未见类别和真实噪声扫描上优于现有方法。

详情
AI中文摘要

从部分扫描进行3D形状补全对于未见类别和嘈杂的真实世界观测仍然具有挑战性,因为仅凭几何信息往往不足以推断缺失结构。我们提出了DinoComplete,一个确定且高效的形状补全框架,通过从DINO特征中蒸馏的体素对齐语义先验来增强几何重建。首先,我们构建与ShapeNet数据对齐的多视图DINO特征体积,并训练一个学生网络直接从不完整形状预测密集语义特征。这些预测特征捕获全局结构和部分感知的语义上下文,同时与底层几何保持对齐。然后,我们将这些蒸馏特征集成到一个补全网络中,其中几何和语义体素表示通过体素状态空间建模进行融合。为了在不牺牲分辨率的情况下实现高效的长距离推理,我们引入了一个多尺度体素Mamba模块,通过结合全网格和分块序列建模来细化融合特征。在未见过的ShapeNet类别和ScanNet物体上的实验表明,DinoComplete在使用更少参数、更低内存和更快推理速度的同时,实现了比先前确定性和基于生成的方法更强的补全质量。我们的结果表明,从视觉基础模型中蒸馏语义先验提高了3D形状补全的泛化能力和鲁棒性。

英文摘要

3D shape completion from partial scans remains challenging for unseen categories and noisy real-world observations, where geometry alone is often insufficient for inferring missing structure. We present DinoComplete, a deterministic and efficient shape completion framework that augments geometric reconstruction with voxel-aligned semantic priors distilled from DINO features. First, we construct multi-view DINO feature volumes aligned with ShapeNet data and train a student network to predict dense semantic features directly from incomplete shapes. These predicted features capture global structure and part-aware semantic context while remaining aligned with the underlying geometry. We then integrate these distilled features into a completion network, where geometric and semantic voxel representations are fused through voxel state-space modeling. To enable efficient long-range reasoning without sacrificing resolution, we introduce a multi-scale voxel Mamba module that refines the fused features by combining full-grid and chunk-wise sequence modeling. Experiments on unseen ShapeNet categories and ScanNet objects show that DinoComplete achieves stronger completion quality than prior deterministic and generative based completion methods while using fewer parameters, requiring lower memory, and achieving faster inference. Our results demonstrate that distilling semantic priors from visual foundation models improves generalization and robustness in 3D shape completion.

2605.26944 2026-05-27 cs.RO cs.CV 版本更新

Object Pose and Shape Estimation for Grasping: Does it Work?

用于抓取的目标姿态与形状估计:有效吗?

Pavan Karke, Kushal Shah, Gaurav Singh, Md Faizal Karim, K Madhava Krishna, Rajat Talak

发表机构 * Robotics Research Center, IIIT Hyderabad(IIIT海得拉巴机器人研究中心) National University of Singapore(新加坡国立大学)

AI总结 本文通过对比端到端抓取合成方法与模块化方法(先估计目标姿态和形状再采样抓取),评估现有姿态和形状估计方法在抓取任务中的有效性。

Comments 9 pages, 8 figures

详情
AI中文摘要

目标姿态和形状估计问题近年来取得了关键进展。编码器-解码器(如SAM3D、LRM、CRISP)和基于扩散的模型(如InstantMesh、Zero123、SceneComplete)展示了类别无关的形状编码能力和开放集泛化性。在这项工作中,我们提出一个问题:当与对极抓取采样结合使用时,目标姿态和形状估计方法是否足够成熟,以至于能够超越端到端抓取合成方法?我们通过将研究范围限定在平行颚夹爪、7自由度抓取和单视图RGB(-D)图像输入,详细探讨了这个问题。我们实现并比较了一种最先进的端到端抓取合成方法和三种模块化方法,这些方法首先估计场景中所有目标的姿态和形状,然后使用对极采样生成抓取。我们观察到,在所有实验中,模块化方法均优于端到端方法。模块化方法能够合成大量抓取,即使是对于端到端方法失败的小目标也是如此。模块化方法的有效性取决于姿态和形状估计的准确性,并且在杂乱场景中会部分退化——这是现有姿态和形状估计方法的局限性。我们还分析了三种模块化方法的失败模式和运行时间,这些方法使用了两种不同的目标姿态和形状估计方式:一种基于编码器-解码器模型,另一种基于扩散模型。最后,我们证明单视图目标姿态和形状估计方法可以与视觉语言模型结合,仅从单视图RGB-D图像输入即可产生语言条件抓取。我们注意到其性能与最先进的LERF-TOGO基线相当。

英文摘要

The problem of object pose and shape estimation has seen key advancements lately. Encoder-decoder (e.g., SAM3D, LRM, CRISP) and diffusion-based models (e.g., InstantMesh, Zero123, SceneComplete) have shown category-agnostic shape encoding capacity and open-set generalizability. In this work, we ask the question: Are the object pose and shape estimation methods mature enough, such that when used with antipodal grasp sampling, can outperform the end-to-end grasp synthesis methods? We explore this question in detail by scoping our study to parallel jaw grippers, 7-DoF grasps, and single-view RGB(-D) image as input. We implement and compare a state-of-the-art, end-to-end grasp synthesis method and three modular methods, which first estimate the object pose and shape for all objects in the scene, and generate grasps using antipodal sampling. We observe that the modular methods outperform the end-to-end method in all our experiments. The modular methods are able to synthesize plenty of grasps, even for small objects, where the end-to-end methods fail. The effectiveness of the modular methods is contingent on the accuracy of the pose and shape estimation, and suffers partial degradation in cluttered scenes - a limitation of the existing pose and shape estimation methods. We also analyze the failure modes and run-times for the three modular methods, which use two different ways of object pose and shape estimation: one based on an encoder-decoder model, while another a diffusion model. Finally, we demonstrate that the single-view object pose and shape estimation methods can be augmented with vision-language models to yield language-conditioned grasps from just single-view RGB-D image as input. We notice comparable performance to the state-of-the-art LERF-TOGO baseline.

2605.26933 2026-05-27 cs.CV 版本更新

Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking

利用文本到图像扩散模型进行无监督视觉目标跟踪

Zhengbo Zhang, Zhigang Tu, Junsong Yuan, De Wen Soh, Bo Du

发表机构 * Information Systems Technology and Design Pillar, Singapore University of Technology and Design(新加坡科技设计大学信息系统技术与设计学院) State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University(武汉大学测绘遥感信息工程国家重点实验室) Department of Computer Science and Engineering, University at Buffalo, State University of New York(纽约州立大学布法罗分校计算机科学与工程系) School of Computer Science, Wuhan University(武汉大学计算机学院)

AI总结 提出Diff-Tracking方法,利用预训练文本到图像扩散模型的跨注意力机制,通过初始提示学习器和在线提示更新器实现无监督目标跟踪。

Comments Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2026

详情
AI中文摘要

无监督视觉目标跟踪是一项具有挑战性的任务,需要在没有真实标注训练的情况下跟踪视频中的任意目标。尽管取得了显著进展,现有的最先进无监督跟踪器在处理需要细粒度理解视频帧内语义和视觉结构信息的场景时仍常遇到困难。文本到图像扩散模型以其生成准确反映输入提示中描述的语义和结构的图像的能力而闻名,展现出对视觉语义和结构的强大把握。基于这一能力,我们从新的角度处理无监督跟踪,利用预训练文本到图像扩散模型中编码的丰富语义知识。为了将原本用于图像生成的扩散模型适应到跟踪任务,我们将其重新解释为文本和图像模态之间的桥梁。这种连接通过跨注意力机制实现:当文本和图像同时输入模型时,模型会突出显示与文本语义对齐的图像区域(在跨注意力图中)。因此,我们学习一个表示跟踪目标的提示,并在每一帧中激活其在跨注意力图中的对应区域,从而利用扩散模型实现目标跟踪。具体来说,我们的方法Diff-Tracking由两个主要部分组成:初始提示学习器和在线提示更新器。初始提示学习器生成一个捕获第一帧中目标对象的提示,使扩散模型能够识别目标。在线提示更新器基于运动信息优化提示,实现跨视频帧的一致跟踪。我们在六个具有挑战性的跟踪数据集上评估了我们的方法,证明了其有效性。

英文摘要

Unsupervised visual object tracking is a challenging task that requires following arbitrary targets in videos without training on ground-truth annotations. Despite considerable progress, existing state-of-the-art unsupervised trackers often struggle in scenarios that demand fine-grained understanding of semantic and visual structural information within video frames. Text-to-image diffusion models are well known for their ability to generate images that accurately reflect the semantics and structures described in the input prompt, demonstrating a strong grasp of visual semantics and structures. Building on this capability, we approach the unsupervised tracking from a new perspective by exploiting the rich semantic knowledge encoded in pretrained text-to-image diffusion models. To adapt the diffusion models, which are originally developed for image generation, to the tracking task, we reinterpret the models as a bridge between text and image modalities. This connection is realized through the cross-attention mechanism: when both text and an image are input into the models, they highlight the regions of the image that are semantically aligned with the text in the cross-attention maps. We therefore learn a prompt that represents the tracking target and activates its corresponding region in the cross-attention map for each frame, which enables object tracking with the diffusion model. Specifically, our method Diff-Tracking is composed of two main components: an initial prompt learner and an online prompt updater. The initial prompt learner generates a prompt that captures the target object in the first frame, allowing the diffusion model to identify the target. The online prompt updater refines the prompt based on motion information, enabling consistent tracking across video frames. We evaluate our approach on six challenging tracking datasets demonstrate the effectiveness of our approach.

2605.26894 2026-05-27 cs.CV 版本更新

SIMPC: Learning Self-Induced Mirror-Point Consistency for Unsupervised Point Cloud Denoising

SIMPC: 学习自诱导镜像点一致性用于无监督点云去噪

Chengwei Zhang, Xueyi Zhang, Tao Jiang, Xinhao Xu, Wenjie Li, Fubo Zhang, Longyong Chen

发表机构 * National Key Laboratory of Microwave Imaging, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, China(微波成像国家重点实验室,航天信息研究所,中国科学院,北京,中国) School of Computing, National University of Singapore, Singapore(计算学院,新加坡国立大学,新加坡)

AI总结 提出自诱导镜像点一致性(SIMPC)方法,通过几何先验生成镜像点并约束去噪目标一致性,实现无监督点云去噪,在合成和真实数据集上超越现有无监督及部分有监督方法。

Comments Accepted by ICML 2026. 17 pages, 8 figures, 8 tables

详情
AI中文摘要

在点云中,噪声直接扰动编码空间位置和几何形状的点坐标,使得构建一一对应关系比图像更具挑战性。现有方法通过噪声或最优传输在噪声变体之间施加统计映射,但存在对应歧义。本文提出自诱导镜像点一致性(SIMPC),以无监督方式学习点与潜在表面之间的确定性对应关系。对于每个噪声点,SIMPC在去噪过程中根据几何先验在潜在表面的另一侧生成一个镜像点。通过鼓励原始点与其镜像点的去噪目标之间的一致性,SIMPC有效定位潜在表面的位置。在合成和真实数据集上的大量实验表明,SIMPC显著优于最先进的无监督方法,并超越了几种强监督方法。

英文摘要

In point clouds, noise directly perturbs point coordinates that encode both spatial location and geometry, making one-to-one correspondence construction more challenging than in images. Existing methods impose statistical mappings across noisy variants via noise or optimal transport, but suffer from correspondence ambiguity. In this work, we propose Self-Induced Mirror-Point Consistency (SIMPC) to learn deterministic correspondences between points and the underlying surface in an unsupervised manner. For each noisy point, SIMPC generates a mirror-point on the opposite side of the underlying surface, guided by geometric priors during the denoising process. By encouraging consistency between the denoising targets of the original point and its mirror counterpart, SIMPC effectively localizes the position of underlying surface. Extensive experiments on synthetic and real-world datasets demonstrate that SIMPC significantly outperforms state-of-the-art unsupervised methods and surpasses several strong supervised counterparts.

2605.26879 2026-05-27 cs.CV 版本更新

Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos

通过对齐单目视频中的高阶时间动态恢复自然人体运动

Dingkun Wei, Zehong Shen, Yan Xia, Georgios Pavlakos, Yujun Shen, Xiaowei Zhou

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出HTD-Refine框架,利用PVA-Net估计的高阶时间动态(速度和加速度)优化全局轨迹,恢复自然人体运动。

Comments 13 pages, 6 figures. Accepted as an Oral presentation and Best Paper Candidate at CVPR 2026. Project page: https://zju3dv.github.io/htd-refine/

详情
AI中文摘要

从单目视频中恢复的人体运动通常显得过于平滑或动态不一致,即使关节位置在数值上是准确的。我们观察到,这种局限性源于缺乏可靠的高阶时间线索——速度和加速度——这些对于重建具有真实动量、时序和高频细节的运动至关重要。我们引入了HTD-Refine,一个后处理框架,通过显式估计的高阶时间动态来增强现有的人体运动恢复(HMR)流程。我们系统的核心是PVA-Net,一个时间变换器,它直接从单目视频推断每个关节的2D位置、3D速度和3D加速度。这些预测的动态作为全局优化过程中的软约束,优化世界空间轨迹,显著减少抖动、抑制过度平滑,并恢复物理上合理的运动。在具有挑战性的野外基准上的大量实验表明,HTD-Refine持续改进了最先进的HMR方法,产生了更准确的全局轨迹和更自然的运动动态。我们的结果强调了高阶时间建模在推进单目人体运动恢复中的关键作用。

英文摘要

Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues -- velocity and acceleration -- which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail. We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, 3D velocities, and 3D accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines world-space trajectories, significantly reducing jitter, suppressing over-smoothing, and restoring physically plausible motion. Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.

2605.26862 2026-05-27 cs.CV 版本更新

RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction

RoadGIE:面向通用交互式道路提取的全球尺度航拍基准

Chenxu Peng, Chenxu Wang, Yimian Dai, Yongxiang Liu, Ming-Ming Cheng, Xiang Li

发表机构 * NKIARI, Shenzhen Futian(深圳福田NKIARI) VCIP, CS, Nankai University(南开大学VCIP研究所) AAIS, Nankai University(南开大学AAIS) College of Electronic Engineering, National University of Defense Technology, Changsha, China(国防科技大学电子工程学院,长沙,中国)

AI总结 提出最大、最多样的道路分割数据集WorldRoadSeg-360K,并设计支持连通性感知提示的交互式方法RoadGIE,在分割精度和拓扑一致性上达到最优。

详情
AI中文摘要

从航拍图像中准确分割道路是许多地理空间应用的基础。然而,现有数据集通常面临场景多样性有限、语义粒度低和结构连续性差的问题,限制了它们在不同环境中的泛化能力。为了解决这些挑战,我们引入了WorldRoadSeg-360K,这是迄今为止最大、最多样的道路分割数据集,包含从38个国家223个城市收集的366,947张高分辨率图像,覆盖不同地形和大陆。WorldRoadSeg-360K作为一个全面的基准,揭示了处理多样化和结构复杂场景的关键挑战。自动化方法通常难以保持道路连通性,而当前的交互式方法缺乏高效、拓扑敏感的工具用于实际道路编辑。为此,我们提出了RoadGIE,建立了一种新的遥感道路提取交互范式。与先前的点或框提示策略不同,RoadGIE支持连通性感知提示,包括点击和涂鸦,这些提示与道路网络的拓扑结构天然对齐。为了提高结构一致性并减轻迭代交互中的性能下降,RoadGIE集成了专家引导的提示策略,并针对交互场景调整了基于骨架的召回损失。RoadGIE在WorldRoadSeg-360K和其他基准上,在分割精度和拓扑一致性方面均达到了最先进的性能,同时仅需3.7M参数即可高效运行。代码公开于:https://github.com/chaineypung/RoadGIE

英文摘要

Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce WorldRoadSeg-360K, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across various terrains and continents. WorldRoadSeg-360K serves as a comprehensive benchmark and reveals key challenges in handling diverse and structurally complex scenes. Automated approaches often struggle to preserve road connectivity, while current interactive methods lack efficient, topology-sensitive tools for real-world road editing. To this end, we present RoadGIE, establishing a novel interactive paradigm for road extraction in remote sensing. Unlike prior point- or box-based prompting strategies, RoadGIE supports connectivity-aware prompts, including clicks and scribbles, which inherently align with the topology of road networks. To improve structural consistency and mitigate performance degradation during iterative interactions, RoadGIE integrates an expert-guided prompting strategy and adapts the skeleton-based recall loss for interactive scenarios. RoadGIE achieves state-of-the-art performance in both segmentation accuracy and topological consistency on WorldRoadSeg-360K and other benchmarks, while maintaining efficient operation with only 3.7M parameters. The code are publicly available at: https://github.com/chaineypung/RoadGIE

2605.26861 2026-05-27 cs.CV 版本更新

REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization

REVERSE: 强化证据验证与搜索的智能体图像地理定位

Yong Li, Furong Jia, Dacheng Yin, Kang Rong, Fengyun Rao, Jing Lyu, Fan Zhang

发表机构 * Peking University(北京大学) The Hong Kong University of Science and Technology(香港科技大学) WeChat Vision, Tencent Inc(腾讯公司)

AI总结 提出REVERSE框架,通过多轮智能体推理强化证据搜索与验证的交互,在图像地理定位任务中优于强检索增强基线,以4B模型媲美更大模型。

详情
AI中文摘要

图像地理定位旨在确定照片的拍摄地点,该任务通常需要识别可见地标之外的信息。人类专家通常通过迭代工作流程解决:检查信息区域,形成位置假设,寻求外部证据,并根据新线索修正判断。现有方法仅部分捕捉这一过程:直接预测方法完全绕过证据获取,而检索增强方法引入外部证据但通常对中间决策(搜索位置、查询方式、过滤噪声结果)提供有限监督。我们提出REVERSE,一个强化证据搜索与验证交互的框架,实现多轮智能体推理。REVERSE教授三个中间决策:看哪里、查什么、信任什么证据。为此,我们构建了带注释区域选择、搜索观察和地理信息证据标签的工具化轨迹,并引入视觉定位、查询效用和证据辨别的过程奖励。离线搜索缓存使检索观察在强化学习过程中稳定且可重用,实现对噪声搜索结果的密集监督。使用4B模型,REVERSE在Im2GPS3k和YFCC4k上优于强检索增强基线,并媲美显著更大的模型。代码见https://github.com/yonglleee/REVERSE。

英文摘要

Image geo-localization aims to determine where a photograph was taken, a task that often requires more than recognizing visible landmarks. Human experts typically solve it through an iterative workflow: they inspect informative regions, form location hypotheses, seek external evidence, and revise their judgments as new clues appear. Existing methods only partially capture this process: direct prediction methods bypass evidence acquisition altogether, while retrieval-augmented methods introduce external evidence but usually provide limited supervision on the intermediate decisions of where to search, how to query, and how to filter noisy results. We present REVERSE, a framework that reinforces the interplay between evidence search and verification to enable multi-turn agentic reasoning. REVERSE teaches three intermediate decisions: where to look, what to query, and what evidence to trust. To support this, we construct tool-grounded trajectories with annotated region selections, search observations, and geo-informative evidence labels, and introduce process rewards for visual grounding, query utility, and evidence discrimination. An offline search cache makes retrieval observations stable and reusable during reinforcement learning, enabling dense supervision over noisy search results. With a 4B model, REVERSE outperforms strong retrieval-augmented baselines and rivals substantially larger models on Im2GPS3k and YFCC4k. Code is available at https://github.com/yonglleee/REVERSE.

2605.26855 2026-05-27 cs.CV 版本更新

Receipt Replay OOD: A Small Benchmark for Screen Replay Detection Under Domain Shift

Receipt Replay OOD: 一个用于域偏移下屏幕重放检测的小型基准

Alexander Vinogradov

发表机构 * IU International University of Applied Science(国际应用科学大学)

AI总结 针对屏幕重放攻击检测中的域偏移问题,提出基于收据的小型OOD基准,评估跨域泛化性能。

详情
AI中文摘要

公共数据集如 DLC-2021、SynID 和 KID34K 对身份文档的呈现攻击检测(包括屏幕重放攻击)研究做出了重要贡献。然而,对域外(OOD)鲁棒性的评估仍不充分,尤其是在现实域偏移下。在这项工作中,我们引入了 Receipt Replay OOD,一个用于屏幕重放检测的小型域外基准。收据与身份文档共享多个特征,包括平面几何、圆角、磨损伪影以及文本或标志图案,同时避免了身份文档常见的个人身份信息约束。我们在跨域条件下评估文档重放检测模型,并展示了域偏移对泛化性能的影响。该数据集已公开。

英文摘要

Public datasets such as DLC-2021, SynID, and KID34K have significantly contributed to research on presentation attack detection for identity documents, including screen replay attacks. However, evaluation of out-of-domain (OOD) robustness remains insufficiently explored, especially under realistic domain shifts. In this work, we introduce Receipt Replay OOD, a small out-of-domain benchmark for screen replay detection. Receipts share several characteristics with identity documents, including planar geometry, curved corners, wear-and-tear artifacts, and text or logo patterns, while avoiding personally identifiable information constraints commonly associated with identity documents. We evaluate document replay detection models under cross-domain conditions and demonstrate the impact of domain shift on generalization performance. The dataset is publicly available.

2605.26831 2026-05-27 cs.CV cs.RO 版本更新

OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes

OSMa-Bench++:面向操作任务的语义映射开放基准测试,使用提示生成的合成场景

Regina Kurkova, Maxim Popov, Sergey Kolyubin

发表机构 * Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University(生物机电学与节能机器人实验室,ITMO大学)

AI总结 本文扩展OSMa-Bench,通过提示生成合成室内场景实现可控基准测试,并提出一种基于提示的VQA类别,用于语义映射方法在杂乱、小物体、部分遮挡和光照变化等条件下的压力测试。

Comments Code: https://github.com/be2rlab/OSMa-Bench-v2

详情
AI中文摘要

语义映射方法越来越多地被用作下游机器人推理和操作的中间场景表示,但它们的评估仍然很大程度上依赖于固定的基准数据集,这些数据集对操作相关边缘情况的覆盖有限。在这项工作中,我们将OSMa-Bench扩展到使用提示生成的合成室内场景进行可控基准测试。我们的流程自动生成场景描述,使用SceneSmith合成相应环境,并将生成的资产适配为OSMa-Bench兼容的仿真格式。这种适配需要一个非平凡的中层,包括语义归一化、材质和纹理修复、着色器回退策略、地面处理、导航设置和受控光照配置。所提出设置的一个关键优势是原始场景生成提示是预先已知的,因此可以作为预期场景的辅助语义规范。我们利用这一特性,将OSMa-Bench的VQA组件扩展了一个基于提示的问题类别。由此产生的框架支持在杂乱、小物体、部分遮挡和光照变化等条件下对语义场景表示进行有针对性的压力测试,并使基准测试更具可扩展性,更好地与下游操作需求对齐。我们的代码可在https://github.com/be2rlab/OSMa-Bench-v2获取。

英文摘要

Semantic mapping methods are increasingly used as intermediate scene representations for downstream robotic reasoning and manipulation, yet their evaluation is still largely tied to fixed benchmark datasets with limited coverage of manipulation-relevant corner cases. In this work, we extend OSMa-Bench toward controllable benchmarking with prompt-generated synthetic indoor scenes. Our pipeline automatically generates scene descriptions, synthesizes corresponding environments with SceneSmith, and adapts the resulting assets into an OSMa-Bench-compatible simulation format. This adaptation requires a nontrivial intermediate layer, including semantic normalization, material and texture repair, shader fallback policies, floor handling, navigation setup, and controlled lighting configuration. A key advantage of the proposed setup is that the original scene-generation prompt is known in advance and can therefore serve as an auxiliary semantic specification of the intended scene. We use this property to extend the VQA component of OSMa-Bench with a prompt-grounded question category. The resulting framework supports targeted stress-testing of semantic scene representations under conditions such as clutter, small objects, partial occlusions, and lighting variation, and makes benchmarking more extensible and better aligned with downstream manipulation requirements. Our code is available at https://github.com/be2rlab/OSMa-Bench-v2.

2605.26830 2026-05-27 cs.LG cs.AI cs.CV 版本更新

The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery

卡尔曼演化:通过可解释算法发现缩小卡尔曼滤波的差距

Vasileios Saketos, Ming Xiao

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 针对非线性传感场景下卡尔曼滤波性能下降的问题,提出Kalman Evolve框架,联合优化噪声参数与更新结构,利用大语言模型生成可解释的非仿射修改,在多个基准上实现高达12%的RMSE降低。

详情
AI中文摘要

状态估计是控制和信号处理中的一个基本问题,卡尔曼滤波器在线性动力学、高斯噪声和已知噪声协方差下提供最优解。然而,这些假设在多普勒雷达和LiDAR等实际传感场景中常常不成立。在这些情况下,最优估计器本质上是非线性的,导致系统性能下降。这产生了一个仅通过调整噪声协方差参数(即卡尔曼滤波器中的过程噪声和测量噪声)无法消除的性能差距。为了解决这一限制,我们提出了Kalman Evolve,一个通过联合优化噪声参数和更新结构来发现改进滤波算法的框架。我们的方法利用大语言模型作为程序空间上的结构化先验,能够生成对经典卡尔曼滤波器的可解释、非仿射修改,同时保留其递归形式。我们提供了分析结果,证明了在常见非线性传感模型下仿射估计器的次优性,从而激发了结构感知更新的必要性。在一系列合成和真实跟踪基准测试中,包括多普勒雷达、基于LiDAR的定位和行人跟踪,所发现的算法始终优于强基线(如优化卡尔曼滤波器),实现了高达12%的RMSE降低。这些结果表明,优化卡尔曼滤波器的结构而不仅仅是其参数,提供了一种实用且可解释的方式来改进状态估计。

英文摘要

State estimation is a fundamental problem in control and signal processing, for which the Kalman Filter provides an optimal solution under linear dynamics, Gaussian noise, and known noise covariances. However, these assumptions often fail in realistic sensing settings such as Doppler radar and LiDAR. In these cases, the optimal estimator is inherently nonlinear, which leads to systematic performance degradation. This creates a performance gap that cannot be eliminated by tuning the noise covariance parameters (i.e., the process and measurement noise in the Kalman Filter) alone. To address this limitation, we propose Kalman Evolve, a framework for discovering improved filtering algorithms by jointly optimizing both noise parameters and the update structure. Our approach leverages large language models (LLMs) as a structured prior over program space, enabling the generation of interpretable, non-affine modifications to the classical Kalman filter while preserving its recursive form. We provide analytical results establishing the suboptimality of affine estimators under common nonlinear sensing models, motivating the need for structure-aware updates. Across a range of synthetic and real-world tracking benchmarks, including Doppler radar, LiDAR-based localization, and pedestrian tracking, the discovered algorithms consistently improve over strong baselines such as the Optimized Kalman Filter, achieving up to 12\% reduction in RMSE. These results suggest that optimizing the structure of the Kalman filter, rather than only its parameters, provides a practical and interpretable way to improve state estimation.

2605.26744 2026-05-27 cs.CV 版本更新

Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy

基于高效人体球代理的自交感知3D人体运动生成

Pascal Herrmann, Maarten Bieshaar, Dennis Mack, Robert Herzog, Juergen Gall

发表机构 * Bosch Research(博世研究院) University of Bonn(波恩大学) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔人工智能与机器学习研究所)

AI总结 提出一种基于人体球代理的自交损失函数,用于训练人体运动生成模型,可减少高达49%的自交现象并改善评估指标。

Comments Accepted to BMVC 2025

详情
AI中文摘要

近年来,人体运动生成取得了巨大进展,最先进的方法在领先的评估基准上超越了真实数据。然而,对生成运动的视觉检查揭示了不同情况:即使是最先进的方法也经常生成包含自交(即身体部位相互穿透)的运动,这些强烈的伪影严重限制了感知到的运动质量。我们引入了一种新的损失函数,明确惩罚自交,用于人体运动生成方法的训练。我们的损失基于人体几何的球代理,与基于三角网格的类似方法相比,计算自交损失的速度快98%,内存使用减少83%。该损失与具体方法无关,我们将其添加到最近的人体运动生成方法(人体运动扩散模型MDM和MoMask)的训练中。大量实验表明,生成运动中的自交减少了高达49%,同时改善了其他评估指标。代码可在https://github.com/boschresearch/humansphereproxy获取。

英文摘要

Human motion generation has made tremendous progress in recent years, with state-of-the-art approaches surpassing ground truth data in leading evaluation benchmarks. However, visual inspection of the generated motions paints a different picture. Even state-of-the-art approaches generate motions frequently containing self-intersections, i.e., body parts interpenetrating, which are strong artifacts, severely limiting the perceived motion quality. We introduce a novel loss, which explicitly penalizes self-intersections, to the training of human motion generation methods. We base our loss on a sphere proxy of human geometry, which allows us to calculate a self-intersection loss 98% faster and uses 83% less memory than comparable methods based on triangular meshes. The loss is agnostic to the specific approach, and we add it to the training of the recent human motion generation methods human motion diffusion model (MDM) and MoMask. Our extensive experiments show a reduction of self-intersections in generated motions of up to 49% while improving other evaluation metrics. The code is available at https://github.com/boschresearch/humansphereproxy .

2605.26734 2026-05-27 cs.CV 版本更新

CIRCLED: A Multi-turn CIR Dataset with Consistent Dialogues across Domains

CIRCLED:跨领域一致对话的多轮CIR数据集

Tomohisa Takeda, Yu-Chieh Lin, Yuji Nozawa, Youyang Ng, Osamu Torii, Yusuke Matsui

发表机构 * Graduate School of Information Science and Technology, The University of Tokyo(信息科学与技术研究生院,东京大学) Kioxia Corporation(铠侠公司)

AI总结 为解决现有MTCIR数据集缺乏对话历史一致性和领域局限的问题,构建了CIRCLED数据集,通过扩展FashionIQ、CIRR和CIRCO,利用CIReVL检索流水线生成多轮会话,并经过多重过滤确保质量,最终提供22,608个多轮会话,涵盖九个子集,规模与通用性显著提升。

详情
AI中文摘要

现有的多轮组合图像检索(MTCIR)数据集缺乏对话历史一致性,且仅限于时尚领域。为解决这些限制,我们通过扩展FashionIQ、CIRR和CIRCO构建了CIRCLED。在CIRCLED中,每一轮的查询逐步逼近目标图像。数据通过基于CIReVL的检索流水线生成,并经过检索成功、轮次长度、一致性和信息冗余等多重过滤以确保质量。我们总共收集了涵盖九个子集的22,608个多轮会话,在规模和通用性上显著超过Multi-turn FashionIQ(11,505个会话)。我们进一步应用了多种基线方法,并在CIRCLED上定量评估了检索准确性。我们的工作提供了一个实用、高质量的基准,以促进未来多轮CIR的研究。数据集和代码公开于https://huggingface.co/datasets/tk1441/CIRCLED和https://github.com/mti-lab/circled。

英文摘要

Existing Multi-Turn Composed Image Retrieval (MTCIR) datasets lack dialogue-history consistency and are restricted to the fashion domain. To address these limitations, we construct CIRCLED by extending FashionIQ, CIRR, and CIRCO. In CIRCLED, the query at each turn progressively approaches the target image. Data are generated via a CIReVL-based retrieval pipeline and curated with multiple filters on retrieval success, turn length, consistency, and information redundancy to ensure quality. In total, we collect 22,608 multi-turn sessions across nine subsets, substantially exceeding Multi-turn FashionIQ (11,505 sessions) in both scale and generality. We further apply multiple baseline methods and quantitatively assess retrieval accuracy on CIRCLED. Our work provides a practical, high-quality benchmark to facilitate future research on multi-turn CIR. The dataset and code are publicly available at https://huggingface.co/datasets/tk1441/CIRCLED and https://github.com/mti-lab/circled.

2605.26729 2026-05-27 cs.CV 版本更新

Learning Reference-Guided Exposure Correction with Hybrid Illumination Characteristics

基于混合光照特性的参考引导曝光校正

Hao Ren, Zetong Bi, Zhaoliang Wan, Hui Cheng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China(计算机科学与工程学院,中山大学,广州,中国)

AI总结 提出HICNet,一种参考引导的曝光校正框架,通过轻量编码器提取光照嵌入,结合FiLM全局调整和光度通道重平衡实现精细曝光匹配,无需真值或内在分解即可在基准测试上取得更优精度并泛化到未见场景。

Comments ICASSP2026

详情
AI中文摘要

我们提出了HICNet,一个参考引导的曝光校正框架。一个轻量级、内容无关的编码器将每张图像蒸馏成一个紧凑的光照嵌入,捕获区域亮度、边缘对比度和高阶亮度矩。源图像与其参考图像之间的嵌入差异驱动一个多尺度调制网络,该网络结合基于FiLM的全局调整和光度通道重平衡,实现细粒度的、光照感知的光谱门控,产生曝光匹配的输出,同时忠实保留场景细节。跨批次对比损失对光照流形进行排序,增强了对不同光照条件的鲁棒性。在没有真值或内在分解的情况下训练,HICNet在公共基准测试上达到了更好的精度,并且能够很好地泛化到完全未见过的场景。

英文摘要

We present HICNet, a reference-guided exposure correction framework. A lightweight, content-agnostic encoder distills each image into a compact illumination embedding capturing regional brightness, edge contrast, and higher-order luminance moments. The embedding difference between a source and its reference drives a multi-scale modulation network that combines FiLM-based global adjustment with Photometric Channel Rebalancing for fine-grained, illumination-aware spectral gating, producing exposure-matched outputs while faithfully preserving scene details. A cross-batch contrastive loss orders the illumination manifold, bolstering robustness to diverse lighting conditions. Trained without ground truth or intrinsic decomposition, HICNet attains better accuracy on public benchmarks and generalizes well to entirely unseen scenes.

2605.26726 2026-05-27 eess.IV cs.AI cs.CV 版本更新

Measuring Prediction Uncertainty in Neural Cellular Automata

神经细胞自动机中的预测不确定性测量

Ario Sadafi, Michael Deutges, Nassir Navab, Carsten Marr

发表机构 * Computational Health Center, Helmholtz Munich, Neuherberg, Germany(赫尔姆霍茨慕尼黑计算健康中心) Helmholtz AI, Helmholtz Munich, Neuherberg, Germany(赫尔姆霍茨慕尼黑人工智能研究所) Computer Aided Medical Procedures, Technical University of Munich, Munich, Germany(慕尼黑技术大学计算机辅助医疗程序研究所) Munich Center for Machine Learning, Munich, Germany(慕尼黑机器学习中心) Department of Medicine III, Ludwig-Maximilian-University Hospital, Munich, Germany(慕尼黑路德维希-马克西米利安大学医院第三医学部) Department of Physics, University of Munich, Munich, Germany(慕尼黑大学物理系) German Cancer Consortium (DKTK), partner site Munich, Germany(德国癌症研究中心(DKTK)慕尼黑分部)

AI总结 提出一种基于动态系统收敛性的不确定性度量方法,通过扰动自动机状态并观察预测稳定性来评估神经细胞自动机在医学图像分割中的可信度。

Comments Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026

详情
AI中文摘要

神经细胞自动机(NCA)为编码器-解码器分割网络提供了一种轻量级替代方案。然而,决定何时应信任预测可能很困难。在这里,我们研究基于NCA的医学图像分割的不确定性估计,无需修改底层架构或重新训练模型。我们的方法通过将NCA视为一个动态系统来激发,其中收敛吸引子对应于可信预测。具体地,我们提出了弹性(resilience),这是一种简单的度量,通过探测在自动机状态微小扰动下最终预测的稳定性来利用NCA固有的迭代结构。返回相同解的预测被认为是可信的,而显著变化的预测被标记为不确定。我们使用选择性预测指标($\Delta$Dice@90和AURC)和排序指标(AUROC和AUPRC)通过其预测分割质量的能力来评估不确定性。在多个医学分割基准测试中,弹性比基线更可靠地识别失败案例,提高了基于NCA模型的信任度和安全性。

英文摘要

Neural cellular automata (NCA) provide a lightweight alternative to encoder-decoder segmentation networks. However, it can be difficult to decide when a prediction should be trusted. Here, we study uncertainty estimation for NCA-based medical image segmentation without modifying the underlying architecture or retraining the model. Our approach is motivated by viewing the NCA as a dynamical system where convergent attractors correspond to confident predictions. Concretely, we propose resilience, a simple measure that leverages the intrinsic iterative structure of NCAs by probing the stability of the final prediction under small perturbations of the automaton state. Predictions that return to the same solution are deemed confident, while those that change substantially are flagged as uncertain. We evaluate uncertainty by its ability to predict segmentation quality using selective prediction metrics ($Δ$Dice@90 and AURC) and ranking metrics (AUROC and AUPRC). Across multiple medical segmentation benchmarks, resilience identifies failure cases more reliably than baselines, improving trust and safety in NCA-based models.

2605.26725 2026-05-27 cs.CV 版本更新

Joint 2D-3D Segmentation and Association in Street-level Imaging

街景成像中的联合2D-3D分割与关联

Amir Melnikov, Masayuki Tanaka, Yusuke Monno, Masatoshi Okutomi

发表机构 * Institute of Science Tokyo(东京科学研究所)

AI总结 提出一个统一框架,结合零样本检测分割与运动恢复结构,通过3D驱动的几何一致性机制替代传统2D多目标跟踪,实现街景图像中跨视角的稳定分割与身份关联,在挑战性城市场景中性能提升22%。

Comments 15 pages, 6 image figures, 1 in-body table, 1 in-body algorithm, 2 indexes with tables

详情
AI中文摘要

准确解读街景图像对于大规模城市地图绘制和创建空间数字孪生环境至关重要。本文提出了一个用于联合2D-3D分割与关联的统一框架,该框架将视觉语义与多视图几何推理相结合。与依赖时序帧进行跟踪的传统方法不同,我们的方法利用零样本检测和分割,结合运动恢复结构重建,建立稳定的跨视图对应关系。3D驱动的关联机制取代了传统的2D多目标跟踪,利用几何一致性指导宽基线视角和不同成像条件下的身份保持。通过结合2D纹理线索和全局3D上下文,所提出的管道非常适合可扩展的街景处理,并可适用于多种对象类型。实验表明,与最先进的纯2D跟踪方法相比,我们的方法显著提高了对真实序列的覆盖率和更鲁棒的身份保持,在挑战性城市场景中实现了22%的性能提升。

英文摘要

Accurate interpretation of street-level imagery is essential for large-scale urban mapping and the creation of Spatial Digital Twin (SDT) environments. This work presents a unified framework for joint 2D-3D segmentation and association that integrates visual semantics with multi-view geometric reasoning. Unlike conventional approaches that rely heavily on sequential frames for temporal tracking, our method leverages zero-shot detection and segmentation together with structure-from-motion reconstruction to establish stable cross-view correspondences. A 3D-driven association mechanism replaces traditional 2D multi-object tracking, using geometric consistency to guide identity preservation across wide-baseline viewpoints and varying imaging conditions. By combining 2D texture cues with global 3D context, the proposed pipeline is well-suited for scalable street-level processing and can be used for a variety of object types. Experiments demonstrate substantially improved coverage of ground-truth sequences and more robust identity retention compared to state-of-the-art 2D-only tracking methods, achieving a 22% performance gain in challenging urban scenarios.

2605.26712 2026-05-27 cs.CV 版本更新

METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition

METATR:一个多语言、不断演进的自动文本识别基准

Mélodie Boillet, Solène Tarride, Christopher Kermorvant

发表机构 * TEKLIA

AI总结 提出METATR基准,通过多样化多语言文档、标准化评估框架和动态更新机制,全面评估自动文本识别系统(尤其是视觉大语言模型)的性能,支持模型比较与选择。

详情
AI中文摘要

反映真实文档多样性和复杂性的基准对于准确评估自动文本识别(ATR)系统,特别是视觉大语言模型(vLLMs)至关重要。尽管最近的模型表现出令人印象深刻的性能,但它们通常在包含现代印刷文本(主要是英语)的数据集上进行评估,这限制了它们与许多实际应用的相关性。因此,为特定用例选择模型需要在与目标文档匹配的数据上进行评估。这突显了代表性基准对于实际应用的重要性。在本文中,我们介绍了METATR(v1.0),一个多语言、不断演进的基准,旨在评估ATR模型在广泛文档上的性能,促进有意义的模型比较和选择。该基准通过包含来自各种公共收藏的文档来最大化多样性。这些文档涵盖29种语言,并包含多种文字和布局的文本。除了数据集本身,METATR还定义了标准化的提示和归一化方法,并建立了一个动态评估框架。这种方法旨在产生可重复的结果,同时随着时间的推移保持可扩展性。我们评估了广泛的最先进系统,包括开源模型和闭源模型。结果从多个维度报告,包括数据集和语言级别的性能、对手写文档的鲁棒性以及计算效率。我们的发现表明,尽管专有模型实现了最一致的性能,但在不同文字和布局之间仍然存在显著差异。总体而言,METATR提供了一个多维度的、面向从业者的框架,用于在真实条件下评估多语言ATR,并随着领域的发展跟踪进展。

英文摘要

Benchmarks that reflect the diversity and complexity of real-world documents are essential for accurately evaluating Automatic Text Recognition (ATR) systems, especially Vision-Large Language Models (vLLMs). Although recent models demonstrate impressive performance, they are often evaluated on datasets containing modern, printed texts mostly written in English, which limits their relevance to many practical applications. Therefore, selecting a model for a specific use case requires evaluating it on data that matches the target documents. This highlights the importance of representative benchmarks for real-world applications. In this paper, we introduce METATR (v1.0), a multilingual, evolving benchmark designed to evaluate ATR models across a wide range of documents, facilitating meaningful model comparison and selection. The benchmark was designed to maximize diversity by including documents from various public collections. These documents cover 29 languages and include texts with multiple scripts and layouts. Beyond the dataset itself, METATR defines a standardized prompting and normalization methodology and establishes a dynamic evaluation framework. This approach is intended to produce reproducible results while remaining extensible over time. We evaluated a wide range of state-of-the-art systems, including open-source models and closed-source models. Results are reported across various dimensions, including performance at the dataset and language levels, robustness to handwritten documents, and computational efficiency. Our findings show that, although proprietary models achieve the most consistent performance, substantial variability persists across scripts and layouts. Overall, METATR provides a multidimensional, practitioner-oriented framework for assessing multilingual ATR in real-world conditions and tracking progress as the field evolves.

2605.24456 2026-05-27 cs.CV 版本更新

EgoProx: Evaluating MLLMs on Egocentric 3D Proximity Reasoning Across a Cognitive Hierarchy

EgoProx: 在认知层级上评估多模态大语言模型的自我中心3D邻近推理能力

Jinzhao Li, Yinuo Chen, Dongxu Piao, Panwang Pan, Yifan Yu, Dong Wang, Honglei Yan, Liang Yue, Shaofei Wang, Yixin Chen, Siyuan Huang, Miao Liu

发表机构 * College of AI, Tsinghua University(清华大学人工智能学院) ByteDance(字节跳动) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 提出EgoProx基准,通过认知链任务和基于智能体的数据引擎,评估多模态大语言模型在自我中心3D邻近推理中的表现,发现模型虽具备空间知识但难以有效利用。

Comments Accepted to CVPR 2026

详情
AI中文摘要

人类不断推理3D邻近性,即身体与周围物体之间的关系,以指导日常生活中的感知和行动。多模态大语言模型(MLLMs)能否进行这种具身3D推理尚不清楚。为此,我们引入了EgoProx,一个用于自我中心3D邻近推理的基准。我们沿着认知链组织任务,涵盖意图、探索、利用和行动链推理。我们还设计了一个基于智能体的数据引擎,能够大规模生成多样且一致的问答对。我们在EgoProx上对主流MLLMs进行了基准测试,并通过数据集特定和任务特定的指令微调进行了额外分析。我们观察到较大的跨领域增益,表明当前的MLLMs包含一些空间知识;然而,它们仍然难以有效利用这些知识进行空间推理VQA。

英文摘要

Humans constantly reason about 3D proximity, the relations between their body and surrounding objects, to guide perception and action in daily life. Whether multimodal large language models (MLLMs) can perform such embodied 3D reasoning remains unclear. To this end, we introduce EgoProx, a benchmark for egocentric 3D proximity reasoning. We organize our tasks along a cognitive chain, covering intention, exploration, exploitation, and chain-of-actions reasoning. We also design an agent based data engine that produces diverse and consistent QA pairs at scale. We benchmark prevailing MLLMs on EgoProx and conduct additional analyses with dataset specific and task specific instruction tuning. We observe large cross-domain gains, indicating that current MLLMs contain some spatial knowledge; however, they still struggle to effectively leverage it for spatial reasoning VQA.

2605.22417 2026-05-27 cs.CV cs.SE 版本更新

The Neglected Baseline in Model Interpretation

模型解释中被忽视的基线

Yongjin Cui, Xiaohui Fan

发表机构 * Zhejiang University(浙江大学)

AI总结 针对现有模型解释方法普遍忽略基线导致不精确的问题,本文重新定义解释任务和原则,统一梯度法、积分梯度法和泰勒展开,分析相关方法缺陷,并基于清晰合理的基线改进积分梯度法,实现基于任意层特征的解释。

详情
AI中文摘要

我们观察到现有的模型解释方法普遍忽略了基线,这种忽视常常导致不精确甚至错误的解释。本文重新阐述了模型解释的任务和解释结果的原则,以证明基线的重要性。我们进一步统一了基于梯度的方法、积分梯度(IG)方法和泰勒展开,阐明了它们之间的联系,并明确识别了每种方法的基线。在此基础上,我们分析了相关模型解释方法(IG、LayerCAM、ODAM、Difference Map)中的缺陷和错误。我们主张通过归因结果与归因目标之间的归因误差来精确评估模型解释结果的质量,而不是采用有缺陷的评估方法,例如基于边际效应或假设模型性能完美的方法。我们改进了IG,并开发了一种具有清晰合理基线的模型解释方法,取得了更好的结果。我们的方法支持基于任意层特征进行模型解释。基于不同层特征的解释都是合理的,这些结果之间的差异反映了不同特征提取阶段特征提取的不同程度。

英文摘要

We observe that existing model interpretation methods generally ignore the baseline, and such neglect often results in imprecise or even incorrect interpretation. In this paper, we reformulate the task of model interpretation and the interpretation principles for model interpretation results to demonstrate the importance of the baseline. We further unify gradient-based methods, Integrated Gradients (IG) methods, and Taylor expansion, clarifying the connections among them and explicitly identifying the baseline for each method. On this basis, we analyze the flaws and errors in related model interpretation methods (IG, LayerCAM, ODAM, Difference Map). We advocate evaluating the quality of model interpretation results precisely through the attribution error between the attribution result and the attribution target, rather than adopting flawed evaluation methods, such as those based on marginal-effect or the assumption of perfect model performance. We revise IG and develope a model interpretation method with a clear and reasonable baseline, achieving better results. Our method supports model interpretation based on features from any layer. Interpretation based on features from different layers are all reasonable, and the differences among these results reflect varying degrees of feature extraction at different feature extraction stages.

2605.26689 2026-05-27 cs.CV cs.CL 版本更新

PinPoint: Prompting with Informative Interior Points

PinPoint: 通过信息性内部点进行提示

Pouya Sadeghi, Shawn He, Pedro Pablo Guerrero Vela, C. Thomas, Alex Wong, Sirisha Rambhatla

发表机构 * University of Waterloo(滑铁卢大学) Critical ML Apple(苹果公司)

AI总结 针对指代图像分割中VLM与SAM结合时因提示模糊导致的性能差距,提出无需训练的确定性点选择器PinPoint,通过融合视觉线索选择稳定、信息丰富的内部点,在无训练下达到监督和强化学习方法的性能。

详情
AI中文摘要

现代指代图像分割流程将用于定位的视觉语言模型(VLM)与用于掩码生成的可提示分割器(如Segment Anything Model,SAM)相结合。先前该方案的无训练实例始终落后于微调和强化学习(RL)调优的专家,且不清楚差距来自VLM的定位、SAM的能力还是提示。我们表明差距主要由提示模糊性主导:VLM提出的边界框(bbox)让SAM猜测框内哪些像素属于表达式所指的对象。内部点是自然的消歧器,但它们的落点很重要;先前的工作依赖于朴素采样的点,这些点落在边界、干扰物和背景杂波上,甚至可能比单独使用bbox更差。有监督和RL调优的方法通过训练VLM预测更好的点来缩小这一差距;我们表明这种训练是不必要的。在五个内部点的匹配预算下,用稳定、信息丰富的点选择替换朴素采样,在RefCOCO/+/g上累积交并比(cIoU)提高了12-18个点,且每个模型固定。我们将这一观察转化为PinPoint,一个确定性的、无需训练的点选择器,它融合四个视觉线索为共识图,选择紧凑、空间多样且远离边界的点,并使用冻结的VLM标记每个点。无需任何任务特定训练,PinPoint在相同堆栈上匹配了有监督和RL调优的专家,同时每次查询仅调用两次VLM。

英文摘要

Modern referring image segmentation pipelines couple a vision-language model (VLM) for grounding with a promptable segmenter such as the Segment Anything Model (SAM) for mask generation. Prior training-free instances of this recipe consistently trail fine-tuned and reinforcement-learning (RL)-tuned specialists, and it has been unclear whether the gap comes from the VLM's grounding, SAM's capacity, or the prompt. We show that the gap is dominated by prompt ambiguity: a VLM-proposed bounding box (bbox) leaves SAM to guess which pixels inside the bbox belong to the object the expression denotes. Interior points are the natural disambiguator, but where they fall matters; prior work relies on naively sampled points that land on boundaries, distractors, and background clutter, and can even hurt performance compared to the bbox alone. Supervised and RL-tuned methods close this gap by training a VLM to predict better points; we show that this training is unnecessary. At a matched budget of five interior points, replacing naive sampling with stable, informative point selection improves cumulative Intersection-over-Union (cIoU) by 12-18 points across RefCOCO/+/g, with every model fixed. We turn this observation into PinPoint, a deterministic, training-free point selector that fuses four visual cues into a consensus map, selects compact, spatially diverse points away from boundaries, and uses the frozen VLM to label each point. Without any task-specific training, PinPoint matches supervised and RL-tuned specialists on the same stack while issuing only two VLM calls per query.

2605.26682 2026-05-27 cs.RO cs.CV 版本更新

SteelDS: A High-Resolution Video Dataset of E40 Steel Scrap for Object Detection and Instance Segmentation

SteelDS: 用于目标检测和实例分割的E40钢废料高分辨率视频数据集

Melanie Neubauer, Christian Rauch, Gerald Koinig, Alexia Tischberger-Aldrian, Roland Pomberger, Elmar Rueckert

发表机构 * Chair of Cyber-Physical-Systems(系统工程系) Technical University of Leoben(莱比锡技术大学) Chair of Waste Processing Technology and Waste Management(废物处理技术与废物管理系)

AI总结 该数据集提供了E40级钢和铜废料在传送带上的高分辨率标注视频序列,用于支持材料分类、目标检测和实例分割的机器学习模型开发。

详情
AI中文摘要

该数据集提供了粉碎的E40级钢和铜废料在传送带上的高分辨率、标注视频序列。在受控实验室环境中捕获,数据反映了工业磁选后阶段,通常需要人工干预去除铜污染物。数据集包含五个子集的24,297个标注帧,包含396个钢和101个铜物体,按大小分类。它支持材料分类、目标检测和实例分割的机器学习模型开发。包含物体间距和密度的变化,以模拟真实的工业分拣条件。地面真值标注包括像素级分割掩码和材料类别。该数据集作为评估自动化分拣算法的基准,旨在识别复杂、异质钢废料流中的铜杂质。

英文摘要

This dataset provides high-resolution, annotated video sequences of shredded E40-grade steel and copper scrap on a conveyor belt. Captured in a controlled laboratory environment, the data reflects the industrial post-magnetic sorting stage, where manual intervention is typically required to remove copper contaminants. The dataset comprises 24,297 labeled frames across five subsets, featuring 396 steel and 101 copper objects categorized by size. It supports the development of machine learning models for material classification, object detection, and instance segmentation. Variations in object spacing and density are included to simulate realistic industrial sorting conditions. Ground truth annotations include pixel-wise segmentation masks and material classes. This dataset serves as a benchmark for evaluating automated sorting algorithms aiming to identify copper impurities within complex, heterogeneous steel scrap streams.

2605.26680 2026-05-27 cs.CV cs.AI 版本更新

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

DynFrame: 自适应推理驱动的多模态框架与动态帧增强用于复杂视频理解

Peng Zhang, Guanghao Zhang, Wanggui He, Longxiang Zhang, Mushui Liu, Yan Xia, Zhenhao Peng, Weilong Dai, Jinlong Liu, Haobing Tang, Le Zhang, Hao Jiang, Pipei Huang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 提出DynFrame框架,通过将时间窗口和采样密度作为原生token进行单步检索,并引入分段解耦GRPO优化,解决了视频多模态大模型中采样密度不可学习及检索与回答优化耦合的问题。

详情
AI中文摘要

最近视频多模态大语言模型(MLLMs)越来越多地将逐步推理与按需视觉证据检索相结合,允许模型在推理过程中重新访问相关视频片段。然而,现有的思考与视频系统仍存在两个结构性缺陷。(i)采样密度不是一个可学习的决策:现有方法可能让模型决定看哪里,但每个窗口的帧率基本固定。因此,细粒度证据通常通过重复的检索调用来恢复,这增加了推理上下文长度和训练难度。(ii)检索和答案生成通常使用单个轨迹级优势进行优化,因此“看哪里”的token和“如何回答”的token获得相同的信用,即使一个正确而另一个不正确。为了解决这些缺陷,我们提出了DynFrame,一个在单次自回归过程中将时间窗口和采样密度作为原生token发出的框架。这种可学习的跨度-密度检索使得单步检索即可获取多粒度证据。基于上述token化检索接口,我们进一步引入了分段解耦GRPO(SD-GRPO),它在检索边界分割每次展开,并分配角色特定的token级优势,分别对采样决策和答案进行信用分配。在精心策划的DM-CoT-74k和DM-RL-45k上训练后,DynFrame-4B在六个基准测试(NExT-GQA、Charades-STA、ActivityNet-MR、Video-MME、MLVU、LVBench)上与强大的7B-8B基线竞争,而DynFrame-8B在大多数指标上创造了新的最先进水平。代码可在https://github.com/zhangguanghao523/DynFrame获取。

英文摘要

Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking-with-video systems. (i) Sampling density is not a learnable decision: existing methods may let the model decide where to look, but the per-window frame rate is largely fixed. As a result, fine-grained evidence is often recovered through repeated retrieval calls, which increases inference context length and training difficulty. (ii) Retrieval and answer generation are usually optimized with a single trajectory-level advantage, so the "where to look" tokens and the "how to answer" tokens receive the same credit even when one is correct and the other is not. To address these gaps, we present DynFrame, a framework that emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Based on the above tokenized retrieval interface, we further introduce Segment-Decoupled GRPO (SD-GRPO), which splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer. Trained on the curated DM-CoT-74k and DM-RL-45k, DynFrame-4B is competitive with strong 7B-8B baselines across six benchmarks (NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, LVBench), and DynFrame-8B sets new state-of-the-art on most metrics. Code is available at https://github.com/zhangguanghao523/DynFrame.

2605.26676 2026-05-27 cs.CV 版本更新

Memory-Distilled Selection for Noise-Robust Anomaly Detection

记忆蒸馏选择用于噪声鲁棒异常检测

Sirojbek Safarov, Jaewoo Park, Yoon Gyo Jung, Kuan-Chuan Peng, Wonchul Kim, Seongdeok Bang, Octavia Camps

发表机构 * AIVEX Inc. Northeastern University Mitsubishi Electric Research Laboratories (MERL)

AI总结 提出基于数据选择的训练算法MeDS,通过随机子采样构建部分记忆集成,利用稀疏性作为低通滤波器捕获名义模式,再蒸馏为重建分数网络,实现噪声鲁棒的异常检测。

Comments Accepted by ICML2026. The code is available at https://github.com/SirojbekSafarov/MeDS

详情
AI中文摘要

数据污染下的异常检测对于在工业环境中部署无监督缺陷检测至关重要,因为整理完全干净的训练集是不切实际的。然而,现有方法对污染敏感,随着噪声比例增加,性能显著下降。在本文中,我们提出记忆蒸馏选择(MeDS),一种基于数据选择的训练算法。MeDS通过随机子采样构建部分记忆集成,其中产生的稀疏性作为低通滤波器,在广泛的噪声比例下捕获名义模式,从而实现对污染样本的粗粒度识别。然后,将到自举记忆的聚合距离蒸馏到重建分数网络中,随后在通过蒸馏模型过滤的干净数据上进行微调,实现异常的精确定位。MeDS在广泛的噪声比例下具有鲁棒性,无需针对特定噪声比例的超参数调整,在MVTecAD上以40%噪声比例达到99.16%的图像级AUROC,并在噪声设置下在VisA和Real-IAD上取得最先进性能。我们在噪声数据场景下的工业AD基准上彻底验证了MeDS的有效性,并进行了深入的经验分析。

英文摘要

Anomaly detection (AD) under data contamination is critical for deploying unsupervised defect detection in industrial environments, where curating perfectly clean training sets is impractical. However, existing methods are sensitive to contamination, suffering significant performance degradation as the noise ratio increases. In this paper, we propose Memory-Distilled Selection (MeDS), a training algorithm based on data selection. MeDS constructs an ensemble of partial memories via random subsampling, where the resulting sparsity acts as a low-pass filter that captures nominal patterns across a wide range of noise ratios, enabling coarse-level identification of contaminated samples. The aggregated distances to the bootstrapped memories are then distilled into a reconstruction score network, which is subsequently fine-tuned on clean data filtered using scores from the distilled model, enabling fine-grained localization of anomalies. MeDS is robust across a wide range of noise ratios without requiring noise-ratio-specific hyperparameter tuning, achieving 99.16\% image-level AUROC on MVTecAD at a 40\% noise ratio, and attaining state-of-the-art performance on both VisA and Real-IAD under noisy settings. We thoroughly verify the efficacy of MeDS on industrial AD benchmarks under noisy data scenarios, accompanied by in-depth empirical analyses.

2605.26661 2026-05-27 cs.CV cs.AI 版本更新

Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models

在预训练视觉语言模型的后验分布外检测中尊重模态差距

Yuanwei Hu, Bo Peng, Yadan Luo, Zhen Fang, Ling Chen, Jie Lu

发表机构 * The University of Queensland(昆士兰大学) University of Technology Sydney(悉尼科技大学)

AI总结 针对预训练视觉语言模型在后验分布外检测中文本原型与视觉原型存在模态差距的问题,提出在线伪监督框架直接在视觉特征空间学习类原型,实现新最优性能。

详情
AI中文摘要

分布外(OOD)检测已成为一种流行的技术,通过识别来自未知类别的意外输入来增强机器学习模型的可靠性。预训练视觉语言模型(VLM)的最新进展使得无需访问分布内(ID)训练数据即可进行零样本OOD检测;在这种设置下,现有方法通常将类名的文本嵌入视为类原型。在本文中,我们通过理论证明现成的文本原型通常与最优视觉原型不对齐,从而产生无法通过提示工程单独消除的内在模态差距,来挑战广泛采用的文本即原型范式。为了在后验约束下缓解这一差距,本文提出了一种在线伪监督框架,该框架使用未标记的测试时数据流和预训练VLM的软预测,直接在视觉特征空间中学习类原型。我们为在线优化过程的收敛性提供了理论保证。大量实验经验证明,我们的方法在各种OOD检测设置中达到了新的最优水平。

英文摘要

Out-of-distribution (OOD) detection has emerged as a popular technique to enhance the reliability of machine learning models by identifying unexpected inputs from unknown classes. Recent progress in pre-trained vision-language models (VLMs) has enabled zero-shot OOD detection without access to in-distribution (ID) training data; in this setting, existing methods commonly treat text embeddings of class names as class prototypes. In this paper, we challenge the widely adopted text-as-prototype paradigm by theoretically showing that off-the-shelf textual prototypes are generally misaligned with the optimal visual prototypes, yielding an intrinsic modality gap that cannot be eliminated by prompt engineering alone. To mitigate this gap under the post-hoc constraint, this paper presents an online pseudo-supervised framework that directly learns class prototypes in the visual feature space using unlabeled test-time data streams and soft predictions from the pre-trained VLMs. We provide theoretical guarantees for the convergence of the online optimization procedure. Extensive experiments empirically demonstrate that our method achieves a new state of the art across a variety of OOD detection setups.

2605.26656 2026-05-27 cs.CV 版本更新

DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding

DV-SFT: 直接视觉监督用于细粒度视觉理解

Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Bing Wang, Zhixing Tan

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院) Zhongguancun Academy(中关村学院) Beihang University(北航) Zhongguancun Laboratory(中关村实验室) Southeast Academy of Information Technology, Beijing Institute of Technology(北京理工大学信息科学技术东南学院)

AI总结 提出DV-SFT方法,通过为视觉令牌构建显式令牌级监督信号,利用OCR场景中的直接视觉-文本对应关系,在不修改模型架构或增加前向传播的情况下,显著提升多模态大语言模型的细粒度视觉理解能力。

Comments Under Review

详情
AI中文摘要

多模态大语言模型通常以端到端方式训练以预测真实答案,但监督信号仅应用于文本令牌。视觉令牌作为视觉信息的核心载体,仅作为上下文的一部分被隐式优化,导致粗粒度的视觉理解。先前的工作尝试监督视觉输入,但不可避免地依赖辅助组件(如额外的解码器或前向传播),因为视觉令牌缺乏可直接解释的标签。这限制了它们的实际应用。在这项工作中,我们提出了直接视觉监督微调(DV-SFT),该方法为视觉令牌构建显式的令牌级监督,并通过与文本相同的下一个令牌预测目标来训练它们。具体来说,我们利用OCR相关场景中的直接视觉-文本对应关系,自动为每个视觉令牌标注其对应图像块中的单词。DV-SFT将MLLM视为黑盒,无需修改架构或额外的前向传播。大量实验证明了直接视觉监督的优越性。DV-SFT在三个域内和四个域外基准测试中始终优于标准SFT。进一步分析表明,视觉监督有效增强了细粒度视觉理解,并实现了更高的多模态对齐效率。

英文摘要

Multimodal large language models are typically trained end-to-end to predict ground-truth answers, yet supervision signals are applied exclusively to text tokens. Visual tokens, the core carriers of visual information, are optimized only implicitly as part of the context, leading to coarse-grained visual understanding. Prior works attempt to supervise visual inputs but inevitably rely on auxiliary components such as additional decoders or forward passes, because visual tokens lack readily interpretable labels. This limits their practical applicability. In this work, we propose \textbf{D}irect \textbf{V}ision \textbf{S}upervised \textbf{F}ine-\textbf{T}uning (DV-SFT), which constructs explicit, token-level supervision for visual tokens and trains them through the same next-token prediction objective used for text. Specifically, we exploit the direct vision--text correspondence in OCR-related scenarios and automatically label each visual token with the word in its corresponding image patch. DV-SFT treats the MLLM as a black box, requiring no architectural modifications or additional forward passes. Extensive experiments demonstrate the superiority of direct vision supervision. DV-SFT consistently outperforms standard SFT across three in-domain and four out-of-domain benchmarks. Further analyses show that vision supervision effectively enhances fine-grained visual understanding and achieves higher multimodal alignment efficiency.

2605.26642 2026-05-27 cs.CV 版本更新

Adaptation-Free Heterogeneous Collaborative Perception with Unseen Agent Configurations

无适应异构协同感知:应对未见过的智能体配置

Hyunchul Bae, Heejin Ahn

发表机构 * School of Electrical Engineering(电气工程学院) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 提出ALF框架,通过将轻量级框级消息提升为自车兼容的辅助特征,实现与未见配置智能体的零适应协同感知,在V2X-Real上零样本评估中相对mAP@0.7提升35.91%,带宽仅需约9.6 Kbps。

Comments 9 pages main paper, 23 pages including references and appendix, 7 figures

详情
AI中文摘要

协同感知通过使智能体共享互补观测来改进3D目标检测,但大多数现有方法假设固定或已知的编码器配置,限制了实际部署。在这项工作中,我们考虑一个开放世界场景,其中具有未见配置的辅助智能体可能在部署后出现,例如不同的LiDAR线束数量或编码器架构。为应对这一挑战,我们提出ALF,一种协同感知框架,通过将轻量级框级消息提升为自车兼容的辅助特征,实现与未见配置智能体的零适应协作。ALF将辅助框级消息转换为伪BEV地图,并通过将目标中心线索与来自自车特征的场景上下文相结合,合成自车兼容的潜在特征。在V2X-Real上,跨越64个案例研究的零样本评估中,ALF在相对mAP@0.7上比最强先前基线高出35.91%,同时每个智能体每帧仅需120字节(在10 Hz下约9.6 Kbps带宽)。

英文摘要

Collaborative perception improves 3D object detection by enabling agents to share complementary observations, but most existing methods assume fixed or known collaborator encoder configurations, limiting deployment in practice. In this work, we consider an open-world setting in which auxiliary agents with unseen configurations may appear after deployment, such as different LiDAR beam counts or encoder architectures. To address this challenge, we propose ALF, a collaborative perception framework that enables zero-adaptation collaboration with unseen agent configurations by lifting lightweight box-level messages into ego-compatible auxiliary features. ALF converts auxiliary box-level messages into pseudo-BEV maps and synthesizes ego-compatible latent features by combining object-centric cues with scene context from the ego feature. On V2X-Real, under a zero-shot evaluation across 64 case studies, ALF outperforms the strongest prior baseline by 35.91% in relative mAP@0.7 while requiring only 120 bytes per agent per frame (approximately 9.6 Kbps bandwidth at 10 Hz).

2605.26641 2026-05-27 cs.CV 版本更新

OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation

OmniRetriever: 通过融合作为教师蒸馏实现任意到任意的音频-视频-文本检索

Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen

发表机构 * Memories.ai Research(Memories.ai研究院)

AI总结 提出融合作为教师蒸馏方法,利用三元组嵌入的融合信号训练单模态嵌入,并构建OmniRetriever-7B模型,在零样本检索基准上超越现有方法,同时发布OmniRetriever-Bench基准。

Comments https://yunzeliu.github.io/OmniRetriever/

详情
AI中文摘要

统一的多模态嵌入空间已成为跨模态检索和多模态RAG的标准接口,最近的音频-视频-文本(AVT)编码器将这一设置扩展到三种模态。当所有三种模态都可用时,此类编码器可以生成联合的(T,V,A)嵌入,但标准的成对InfoNCE目标在训练过程中未使用这一信号。我们通过融合作为教师蒸馏来弥补这一差距,将融合嵌入的停止梯度副本视为单模态嵌入的教师信号,并配以Tuple-InfoNCE项直接监督融合嵌入。我们将这一目标实例化为OmniRetriever-7B。在六个零样本检索基准上,OmniRetriever-7B在Clotho和SoundDescs上以R@1超过闭源的Gemini Embedding 2达13.3-18.0,并在MSR-VTT和MSVD上达到当代零样本专家级开放视频-文本编码器的水平。为了压力测试联合表示,我们进一步发布了OmniRetriever-Bench,这是一个包含12个方向的AVT检索基准,总计3782个三元组;在此基准上,OmniRetriever-7B达到AVG-all 34.84,比Gemini Embedding 2提高1.72,比之前最好的开源AVT方法提高8.03。

英文摘要

Unified multimodal embedding spaces have become the standard interface for cross-modal retrieval and multimodal RAG, and recent audio-video-text (AVT) encoders extend this setting to three modalities. Such encoders can produce a joint (T,V,A) embedding whenever all three modalities are available, but standard pairwise InfoNCE objectives leave this signal unused during training. We close this gap with fusion-as-teacher distillation, which treats a stop-gradient copy of the fused embedding as a teacher signal for the single-modal embeddings, paired with a Tuple-InfoNCE term that supervises the fused embedding directly. We instantiate this objective as OmniRetriever-7B. Across six zero-shot retrieval benchmarks, OmniRetriever-7B surpasses the closed-source Gemini Embedding 2 by 13.3-18.0 R@1 on Clotho and SoundDescs, and reaches the contemporary zero-shot specialist band of open video-text encoders on MSR-VTT and MSVD. To stress-test joint representations, we further release OmniRetriever-Bench, a 12-direction AVT retrieval benchmark totaling 3782 triples; on it OmniRetriever-7B attains AVG-all 34.84, improving over Gemini Embedding 2 by 1.72 and over the best prior open-source AVT method by 8.03.

2605.26636 2026-05-27 cs.CV cs.AI 版本更新

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

JetViT: 高效高分辨率视觉Transformer与训练后注意力搜索

Dongyun Zou, Zhuoyang Zhang, Junyu Chen, Wenkun He, Qinhe Peng, Hanrong Ye, Yao Lu, Hongxu Yin, Yu Wang, Song Han, Han Cai

发表机构 * MIT(麻省理工学院) University of Pennsylvania(宾夕法尼亚大学) NVIDIA(NVIDIA公司) Physical Intelligence(物理智能)

AI总结 提出JetViT混合架构视觉Transformer,通过训练后注意力搜索将预训练全注意力ViT转换为高效混合注意力变体,在高分辨率图像上实现更高推理效率且不损失精度。

Comments Accepted to CVPR 2026 Findings

详情
AI中文摘要

我们介绍了JetViT,一种新颖的混合架构视觉Transformer(ViT)模型系列,它在匹配最先进的全注意力视觉基础模型精度的同时,在高分辨率图像上实现了显著更高的推理效率。我们方法的核心是训练后注意力搜索,这是一种训练后加速框架,通过识别并将冗余的全注意力块替换为线性注意力或窗口注意力块,将预训练的全注意力ViT转换为高效的混合注意力变体。通过继承基础模型的MLP和注意力权重,训练后注意力搜索通过三个关键步骤高效探索架构设计空间:(1)优化线性注意力块设计;(2)找到线性注意力块和窗口注意力块的最佳组合;(3)识别并保留关键的全注意力块。我们在两个代表性的高分辨率视觉基础模型DINOv3和DepthAnythingV2上评估了JetViT。在NVIDIA H100 GPU上,JetViT在不牺牲精度的情况下实现了高达1.79倍的吞吐量提升和高达44.81%的延迟降低。我们将很快发布我们的代码和加速后的ViT模型。

英文摘要

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.

2605.26630 2026-05-27 cs.CV 版本更新

Attenuation-Resilient Alternating Optimization for Laparoscopic Liver Landmark Detection

衰减鲁棒的交替优化用于腹腔镜肝脏地标检测

Lanqing Liu, Ruize Cui, Jialun Pei, Diandian Guo, Tiffany Y. So, Pheng-Ann Heng, Jing Qin

发表机构 * The Hong Kong Polytechnic University, Hong Kong, China(香港理工大学) The Chinese University of Hong Kong, Hong Kong, China(香港中文大学)

AI总结 提出A2ONet,通过照明场补偿、频率方向选择性滤波和交替分割-曲线优化解码器,解决腹腔镜肝脏地标检测中的光照衰减和结构不匹配问题。

Comments This paper has been accepted by MICCAI 2026

详情
AI中文摘要

肝脏表面地标检测是腹腔镜肝脏手术中解剖引导的基本前提。然而,由于两个普遍存在的挑战,它在实践中仍然不可靠:欠曝光区域的照明衰减和像素级定位与连续曲线几何之间的结构不匹配。为了解决这些限制,我们提出了A2ONet,一种衰减鲁棒的交替优化网络,用于稳健的肝脏地标检测。为了减轻照明衰减,A2ONet包含一个照明场补偿(IFC)块,该块自适应增强暗区域同时保持结构一致性。同时,我们引入了一个轻量级的频率方向选择性滤波器(FOSF),以抑制重复纹理干扰并保留显著的曲线线索。基于这些鲁棒的表示,我们设计了一个交替分割-曲线优化(ASCO)解码器,该解码器迭代地将密集分割与显式曲线建模耦合,实现相互指导以优化结构连续性和端点定位。在L3D-2K、L3D和P2ILF上的广泛评估表明,与竞争方法相比,该方法具有一致的改进,为术中解剖引导建立了更可靠的基础。我们的代码将在https://github.com/hyperiondk115/A2ONet上提供。

英文摘要

Liver surface landmark detection is a fundamental prerequisite for anatomical guidance in laparoscopic liver surgery. However, it remains unreliable in practice due to two pervasive challenges: illumination attenuation in underexposed regions and the structural mismatch between pixel-wise localization and continuous curvilinear geometry. To address these limitations, we propose A2ONet, an attenuation-resilient alternating optimization network for robust liver landmark detection. To mitigate illumination attenuation, A2ONet embraces an illumination field compensation (IFC) block that adaptively enhances dark regions while preserving structural consistency. Meanwhile, we introduce a lightweight frequency-orientation selective filter (FOSF) to suppress repetitive texture interference and preserve salient curvilinear cues. Building upon these resilient representations, we design an alternating seg-curve optimization (ASCO) decoder that iteratively couples dense segmentation with explicit curve modeling, enabling mutual guidance to optimize both structural continuity and endpoint localization. Extensive evaluations on L3D-2K, L3D, and P2ILF demonstrate consistent improvements over competitive methods, establishing a more reliable foundation for intraoperative anatomy guidance. Our code will be available at https://github.com/hyperiondk115/A2ONet.

2605.26629 2026-05-27 cs.CV 版本更新

DelowlightSplat: Feed-Forward Gaussian Splatting for Lowlight 3D Scene Reconstruction

DelowlightSplat: 面向低光照3D场景重建的前馈高斯泼溅

Fuzhen Jiang, Zengtian Xie, Zhuoran Li

发表机构 * Hangzhou Dianzi University(杭州电子科技大学) Zhuhai College of Science(珠海科技学院)

AI总结 提出DelowlightSplat,一种低光照感知的前馈高斯泼溅框架,通过轻量级低光照适配器和成本体积多视图推理,从稀疏有噪声图像中直接预测干净3D高斯,实现高质量新视角合成。

详情
AI中文摘要

从稀疏有姿态图像进行新视角合成和3D重建是机器人和AR/VR的核心。然而,前馈3D高斯重建在低光照下因噪声、颜色偏移和不可靠对应而失败。我们提出DelowlightSplat,一种低光照感知的前馈高斯泼溅框架,用于干净的新视角渲染。我们通过仅退化上下文视图同时保持目标视图干净,构建了一个可控的多视图低光照基准。我们引入轻量级低光照适配器进行残差增强以提高可匹配性,并将其与基于成本体积的多视图推理相结合,直接预测干净的3D高斯。实验表明,DelowlightSplat在低光照条件下显著优于先前的前馈方法和两阶段流水线。

英文摘要

Novel-view synthesis and 3D reconstruction from sparse posed images are central to robotics and AR/VR. Yet, feed-forward 3D Gaussian reconstruction fails under lowlight due to noise, color shifts, and unreliable correspondence. We propose DelowlightSplat, a lowlight-aware feed-forward Gaussian splatting framework for clean novel-view rendering. We build a controllable multi-view lowlight benchmark by degrading only context views while keeping target views clean. We introduce a lightweight Lowlight Adapter for residual enhancement to improve matchability, and couple it with cost-volume-based multi-view inference to directly predict clean 3D Gaussians. Experiments show that DelowlightSplat significantly outperforms previous feed-forward method and two-stage pipeline under lowlight conditions.

2605.26621 2026-05-27 cs.CV cs.AI 版本更新

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

MedVol-R1:基于奖励驱动的证据基础用于体积推理分割

Zichun Wang, Hairong Shi, Bingzheng Wei, Yan Xu, Zihua Wang

发表机构 * School of Biological Science and Medical Engineering, Beihang University, Beijing, China(生物科学与医学工程学院,北京航空航天大学) Center for Information and Computer Science, School of Science for Open and Environmental Systems, Graduate School of Science and Technology, Keio University, Kanagawa, Japan(信息与计算机科学中心,开放与环境系统科学学院,科技研究生学校,东京大学,神奈川,日本) Bytedance Inc., China(字节跳动公司,中国) Tsinghua University, Beijing, China(清华大学,北京,中国)

AI总结 提出MedVol-R1框架,通过强化学习将临床推理解耦为可验证的2D证据锚点,再传播为3D掩膜,实现体积推理分割,在多个基准上达到最优性能。

详情
AI中文摘要

体积推理分割(VRS)旨在根据自由形式的临床查询在3D医学扫描中分割目标区域,其中所指对象通常是隐含的,需要医学知识和体积基础推理。现有方法通常依赖专门的分割标记将语言与掩膜解码连接起来,但这种耦合将决策过程压缩为不透明的潜在表示,限制了可解释性和对多样化叙述表达的泛化能力。在本文中,我们提出MedVol-R1,一种基于强化学习的VRS框架,明确地将证据基础与体积描绘解耦:LVLM将临床推理定位到可验证的2D证据锚点(关键轴向切片和2D边界框),然后由冻结的MedSAM2模块将其传播为连贯的3D掩膜。我们使用冷启动监督微调后接GRPO来训练MedVol-R1,并由多组件奖励引导,该奖励鼓励信息性证据选择、准确的2D空间定位和跨切片体积连贯性,无需昂贵的思维链注释。在M3D-Seg基准的CT-ORG、AbdomenCT-1K和KiTS23上的实验表明,MedVol-R1一致优于强基线并达到最先进性能,强化学习相比纯监督微调提供了明显增益。

英文摘要

Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the referent is often implicit and requires both medical knowledge and volume-grounded reasoning. Existing methods typically rely on specialized segmentation tokens to connect language with mask decoding, but this coupling collapses the decision process into opaque latent representations, limiting interpretability and generalization to diverse narrative expressions. In this paper, we present MedVol-R1, a reinforcement learning-based framework for VRS that explicitly decouples evidence grounding from volumetric delineation: the LVLM grounds clinical reasoning to a verifiable 2D evidence anchor (key axial slice and 2D bounding boxes), which is then propagated into a coherent 3D mask by a frozen MedSAM2 module. We train MedVol-R1 with cold-start supervised fine-tuning followed by GRPO, guided by a multi-component reward that encourages informative evidence selection, accurate 2D spatial grounding, and cross-slice volumetric coherence, without requiring costly chain-of-thought annotations. Experiments on CT-ORG, AbdomenCT-1K, and KiTS23 from the M3D-Seg benchmark demonstrate that MedVol-R1 consistently outperforms strong baselines and achieves state-of-the-art performance, with reinforcement learning providing clear gains over pure supervised fine-tuning.

2605.26616 2026-05-27 cs.CV 版本更新

Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction

高斯-体素二重奏:用于快速准确单目表面重建的双支架混合表示

Zhenhua Du, Zhen Tan, Haoyu Zhang, Dewen Hu, Shuaifeng Zhi, Peidong Liu

发表机构 * Zhejiang University(浙江大学) Westlake University(西湖大学) National University of Defense Technology(国防科技大学)

AI总结 提出一种混合高斯-体素表示,通过将锚定高斯约束在体素化SDF定义的表面窄带内,并引入隐式表面约束损失,在保持快速训练和实时渲染的同时,实现了高质量表面重建和新视图合成。

Comments 27 pages, 14 figures

详情
AI中文摘要

尽管3D高斯泼溅在逼真新视图合成方面取得了显著成功,但其追求快速高保真3D重建一直受限于几何精度与优化效率之间的权衡。专攻图像渲染的方法收敛快,但代价是由于多余基元过拟合训练视图导致几何不完美;而集成神经有符号距离场(SDF)以改善几何的方法则带来了高昂的训练成本。在本文中,我们尝试通过将支架锚定高斯与联合优化的稀疏体素支架绑定来达成更好的权衡。这种混合高斯-体素表示明确地将锚定高斯限制在体素化SDF定义的表面周围的窄带内,有效提高了表示效率并凝聚了浮动高斯,同时不牺牲几何质量。隐式表面约束损失进一步以相互正则化的方式将单个高斯基元拉近至SDF诱导的表面,从而提高重建精度。在来自ScanNet++、ScanNetv2和DeepBlending数据集的各种真实室内场景上的大量实验表明,我们的方法在保持快速训练收敛和实时渲染的同时,实现了最先进的表面重建质量以及优于领先基线的新视图合成。代码将在https://github.com/duzh11/VoxelGS提供。

英文摘要

While 3D Gaussian Splatting has achieved remarkable success in photorealistic novel view synthesis, its pursuit of fast and high-fidelity 3D reconstruction has long been constrained by a trade-off between geometric accuracy and optimization efficiency. Methods specialized in image rendering converge quickly at the cost of imperfect geometry caused by superfluous primitives overfitting training views, while methods integrating neural signed-distance field (SDF) for better geometry incur prohibitive training costs. In this paper, we attempt to strike a better trade-off by tethering scaffold-anchored Gaussians to a jointly optimized sparse voxel scaffold. This hybrid Gaussian-Voxel representation explicitly confines anchored Gaussians to a narrow band around surfaces defined by voxelized SDFs, which effectively improves representation efficiency and condenses floating Gaussians without sacrificing geometry quality. An implicit surface tethering loss further pulls individual Gaussian primitives closer to SDF-induced surfaces in a mutually regularized manner for improved reconstruction accuracy. Extensive experiments on diverse real-world indoor scenes from ScanNet++, ScanNetv2, and DeepBlending datasets demonstrate that our method achieves state-of-the-art surface reconstruction quality as well as superior novel view synthesis against leading baselines, while maintaining fast training convergence and real-time rendering. Code will be available at https://github.com/duzh11/VoxelGS.

2605.26584 2026-05-27 cs.CV 版本更新

O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding

O-MARC: 全记忆增强压缩蒸馏用于高效视频理解

Peiran Wu, Yunze Liu, Chi-Hao Wu, Chen Chen, Junxiao Shen

发表机构 * University of Bristol(布里斯托大学) Memories.ai Research(Memories.ai研究院) University of Central Florida(佛罗里达中央大学)

AI总结 提出O-MARC框架,通过无训练压缩方法OMAC保留视觉记忆和音频锚点,并利用压缩蒸馏使紧凑模型鲁棒,在多个基准上提升性能并降低推理成本。

详情
AI中文摘要

全模态大语言模型实现了统一的音频视频理解,但长联合令牌序列导致推理成本高昂,且现有基准未能完全隔离噪声用户生成视频中的音视频关联。我们引入了UGC-AVQA,一个公开的UGC基准,包含1000个视频和4816个问答对,其中音频移除测试确保基准问题需要声学和视觉证据。为了降低推理成本,我们提出了OMAC,一种无需训练的即插即用压缩方法,保留显著的视觉记忆和时域锚定的音频锚点。为了进一步使紧凑模型对压缩输入鲁棒,我们引入了O-MARC,一种用于学习记忆压缩多模态上下文的压缩蒸馏框架。在Qwen2.5-Omni-3B上,O-MARC在四个基准上的平均得分提升至45.8,优于全令牌推理的44.1和OmniZip的41.0。与全令牌推理相比,OMAC还保持了推理效率,延迟降低34.6%(1.53倍加速),内存降低34.7%。

英文摘要

Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We introduce UGC-AVQA, a public UGC benchmark with 1,000 videos and 4,816 QA pairs, where an audio removal test ensures that benchmark questions require both acoustic and visual evidence. To reduce inference cost, we propose OMAC, a training free plug in compression method that preserves salient visual memory and temporally grounded audio anchors. To further make compact models robust to compressed inputs, we introduce O-MARC, a compression distillation framework for learning with memory compressed multimodal contexts. On Qwen2.5-Omni-3B, O-MARC improves the average score across four benchmarks to 45.8, outperforming full token inference at 44.1 and OmniZip at 41.0. OMAC also keeps inference efficient, reducing latency by 34.6\% (1.53$\times$ speedup) and memory by 34.7\% compared with full token inference.

2605.26576 2026-05-27 cs.CV cs.LG 版本更新

TrackRef3D: Multi-View Consistent Track-then-Label for Open-World Referring Segmentation in 3D Gaussian Splatting

TrackRef3D: 面向开放世界3D高斯泼溅分割的多视角一致跟踪-标注方法

Yuyang Tan, Renhe Zhang, Hang Zhang, Ao Li, Xin Tan

发表机构 * East China Normal University, Shanghai, China(华东师范大学,上海,中国) Shanghai AI Laboratory(上海人工智能实验室) University of Electronic Science and Technology of China, Chengdu, China(电子科技大学,成都,中国)

AI总结 提出TrackRef3D全自动流水线,通过多视角一致跟踪-标注范式解耦目标发现与语义定位,无需人工标注实现开放世界3D高斯泼溅分割。

详情
AI中文摘要

引用3D高斯泼溅(R3DGS)利用自然语言进行3D目标分割,已成为具身AI的关键能力。然而,现有方法通常依赖昂贵的每场景人工标注和每视图伪掩码生成,存在多视角不一致以及对不同查询特异性的泛化能力差的问题。为此,我们提出TrackRef3D,一种全自动流水线,通过引入多视角一致的跟踪-标注范式,从根本上将目标发现与语义定位解耦,无需人工标注即可实现3D高斯泼溅(3DGS)中的开放世界引用分割。具体而言,我们提出轨迹感知语义共识模块(TSCM),通过同义词聚类和轨迹感知投票聚合跨视图预测,建立规范语义身份,从而确保多视角一致性。此外,我们采用可见性感知描述生成策略以缓解歧义,并提出混合训练策略(HTS),利用多正例对比目标联合优化粗粒度类别语义和细粒度引用线索,确保在不同查询特异性下的鲁棒性。在基准上的大量实验表明,TrackRef3D达到了最先进的性能。

英文摘要

Referring 3D Gaussian Splatting (R3DGS), which utilizes natural language for 3D object segmentation, has emerged as a crucial capability for embodied AI. However, existing methods typically rely on expensive per-scene manual annotation and per-view pseudo mask generation, which suffer from multi-view inconsistency and poor generalization to varying query specificities. To address this, we present TrackRef3D, a fully automatic pipeline that achieves open-world referring segmentation in 3D Gaussian Splatting (3DGS) without manual annotation by introducing a multi-view consistent track-then-label paradigm that fundamentally decouples object discovery from semantic grounding. Specifically, we propose a Trajectory-Aware Semantic Consensus Module (TSCM) which aggregates cross-view predictions via synonymous clustering and trajectory-aware voting to establish a canonical semantic identity, thereby ensuring multi-view consistency. Furthermore, we employ a visibility-aware description generation strategy to mitigate ambiguity and propose a Hybrid Training Strategy (HTS) that jointly optimizes coarse category semantics and fine-grained referential cues to ensure robustness under varying query specificities using a multi-positive contrastive objective. Extensive experiments on benchmarks demonstrate that TrackRef3D achieves state-of-the-art performance.

2605.26538 2026-05-27 cs.CV 版本更新

Scheduled Style Injection: Expanding the Style-Content Pareto Frontier in Training-Free Diffusion-based Style Transfer

调度式风格注入:在免训练扩散风格迁移中扩展风格-内容帕累托前沿

Amey Sunil Kulkarni

发表机构 * Independent Researcher(独立研究者)

AI总结 通过系统探索层、时间步和ControlNet几何条件四个维度的调度,发现递减调度(浅层和早期时间步强结构注入)优于递增调度,且余弦和平方根时间步调度优于线性,结合近乎独立的gamma调度与ControlNet条件可扩展帕累托前沿,在ArtFID上相对提升6.1%。

Comments Accepted to CVPR NTIRE 2026

详情
AI中文摘要

基于预训练扩散模型的风格迁移已取得快速进展,但一个核心问题仍未充分探索:模型中风格注入应在何处最强?领先的免训练方法StyleID在所有层和时间步上统一使用单个全局参数(gamma),这强制了风格质量与内容保留之间的固定权衡。我们证明这种权衡是不必要的刚性。我们系统地探索了四个控制维度:跨解码器层改变风格注入强度、跨去噪时间步改变强度,以及沿两个轴调度ControlNet几何条件。模式在所有地方一致:递减调度(在较浅层和较早时间步注入更强的结构信号)可靠地优于反向调度。除方向外,调度形状也很重要:余弦和平方根时间步调度优于线性。最重要的是,我们发现gamma调度和ControlNet条件几乎独立。由此产生的组合配置扩展了帕累托前沿,与任何单一基线设置相比,提供了风格保真度和内容保留之间的优越权衡。我们最佳的平衡配置实现了ArtFID 27.036,而StyleID为28.801——相对改进6.1%,在整个风格-内容权衡前沿上具有一致的增益。结果在35种配置(总计超过28,000张风格化图像)上使用四种互补指标进行了验证。这些发现在SD骨干网络上具有相同的排名顺序。所有修改都是免训练、无参数的,仅需几行调度代码;代码可在https://github.com/ameyskulkarni/scheduled_style_injection获取。

英文摘要

Style transfer with pre-trained diffusion models has advanced rapidly, but a core question remains underexplored: where in the model should style injection be strongest? StyleID, the leading training-free method, uses a single global parameter (gamma) uniformly across all layers and timesteps, which forces a fixed tradeoff between style quality and content preservation. We show this tradeoff is unnecessarily rigid. We systematically explore four dimensions of control: varying style injection strength across decoder layers, across denoising timesteps, and scheduling ControlNet geometric conditioning along both axes. The pattern is consistent everywhere: decreasing schedules, with stronger structural signal injection in shallower layers and earlier timesteps, reliably outperform the reverse. Beyond direction, schedule shape matters: cosine and square-root timestep schedules outperform linear. Most importantly, we find that gamma scheduling and ControlNet conditioning are nearly independent. The resulting combined configurations expand the Pareto frontier, offering superior tradeoffs between style fidelity and content preservation compared to any single baseline setting. Our best balanced configuration achieves ArtFID of 27.036 versus StyleID's 28.801 - a 6.1% relative improvement, with consistent gains across the full style-content tradeoff frontier. Results are validated across 35 configurations totaling over 28,000 stylized images using four complementary metrics. These findings generalize across SD backbones with identical rank ordering. All modifications are training-free, parameter-free, and require only a few lines of scheduling code; code is available at https://github.com/ameyskulkarni/scheduled_style_injection.

2605.26535 2026-05-27 cs.LG cs.AI cs.CV cs.NA math.NA 版本更新

Recursive Flow Matching

递归流匹配

Jiahe Huang, Sihan Xu, Sharvaree Vadgama, Rose Yu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) University of Michigan(密歇根大学)

AI总结 提出递归流匹配(RecFM)框架,通过自一致性约束对齐不同离散化尺度的轨迹,实现高保真单步或少步(2-4步)动态生成,在科学基准上相比领先扩散模拟器加速20倍且提升预测精度。

Comments Project page: https://jhhuangchloe.github.io/RecFM/

详情
AI中文摘要

生成模型已成为解决物理系统和建模复杂时空动态的强大范式。然而,在不产生高计算成本的情况下实现高物理精度仍然是一个基本挑战,因为现有方法面临关键的速度-保真度权衡。在这项工作中,我们引入了递归流匹配(RecFM),一个用于预测复杂时空动态的生成框架。RecFM强制执行自一致性以对齐跨离散化尺度的轨迹,减少离散化误差并改善基于物理任务的各种指标。据我们所知,这是第一种在科学系统中实现高保真单步和少步(2-4步)动态生成的方法,其性能可与最先进的多步求解器相媲美。在具有挑战性的科学基准测试中,RecFM相比领先的扩散模拟器实现了高达20倍的加速,同时提高了预测精度。此外,与普通流匹配相比,RecFM将均方误差降低了超过15%,为实时科学模拟提供了一种可扩展且高效的解决方案。

英文摘要

Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high physical accuracy without incurring high computational cost remains a fundamental challenge, as existing approaches face a critical speed-fidelity trade-off. In this work, we introduce Recursive Flow Matching (RecFM), a generative framework for forecasting complex spatiotemporal dynamics. RecFM enforces self-consistency to align trajectories across discretization scales, reducing discretization errors and improving performance across metrics for physics-based tasks. To our knowledge, this is the first method to achieve high-fidelity one- and few-step (2-4 step) dynamic generation for scientific systems with performance comparable to state-of-the-art multi-step solvers. Across challenging scientific benchmarks, RecFM achieves up to a 20$\times$ speedup over leading diffusion-based emulators while improving predictive accuracy. Furthermore, RecFM reduces mean squared error by over 15% compared to vanilla flow matching, offering a scalable and efficient solution for real-time scientific emulation.

2605.26533 2026-05-27 cs.CV cs.AI cs.CL cs.LG 版本更新

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

一种用于工业检测中自动缺陷推理与报告生成的混合视觉-语言架构

Malikussaid, Imad Gohar

发表机构 * School of Computing, Telkom University(Telkom大学计算机学院) Faculty of Engineering and Technology, School of Computing and Artificial Intelligence(工程与技术学院,计算与人工智能学院)

AI总结 本文提出一种解耦的边缘可部署管道,结合YOLO26-x-obb检测器、确定性编码模块和QLoRA微调的Qwen-2.5-1.5B模型,实现风电叶片缺陷定位与结构化报告生成,在BLEU-4、幻觉率和专家评分上显著优于零样本VLM基线。

Comments 23 pages, 6 figures, 9 equations, and 6 tables

详情
AI中文摘要

自动化工业检测需要精确的缺陷定位和结构化的维护报告生成;在当前的实践中,这些任务被分开处理,语言解释留给人类专家。本文描述了一种解耦的、边缘可部署的风电叶片检测管道,由三个组件组成,每个组件处理一个不同的子任务。“眼睛”是一个YOLO26-x-obb定向边界框检测器,在数据集原生分辨率下定位缺陷。“桥梁”是一个确定性的、无参数的编码模块,将每个检测到的边界框映射到嵌入结构化提示中的网格参考空间令牌。“大脑”是一个4比特量化的Qwen-2.5-1.5B模型,通过量化低秩适应(QLoRA)在947个合成生成的维护报告上进行适配,从该提示生成结构化的JSON报告。检索增强微调(RAFT)进一步将每个建议基于索引的维护程序。五项消融实验,通过BLEU-4、ROUGE-L、幻觉率(HR)和LLM-as-a-Judge评分标准,将该管道与单一视觉-语言模型(VLM)基线以及移除一个组件的部分配置进行比较。完整系统实现了BLEU-4 0.41、HR=4%和专家评分8.6/10,而零样本VLM基线分别为0.07、65%和3.3/10。在相同的检测证据下,QLoRA适配的1.5B模型在单个T4级GPU上以每秒47个令牌的速度生成比671B参数通用API模型更高质量的报告。结果表明,具有小型领域特定训练语料库的专用解耦架构在此结构化生成任务上优于通用端到端模型。

英文摘要

Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.

2605.26525 2026-05-27 cs.CV cs.AI 版本更新

ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation

ReCA: 通过递归上下文分配实现多镜头长视频外推

Akide Liu, Jinbo Xing, Chaojie Mao, Ye Li, Zeyu Zhang, Yefei He, Weijie Wang, Zihan Wang, Yu Liu, Gholamreza Haffari, Bohan Zhuang

发表机构 * Monash University(墨尔本大学) Tongyi Lab, Alibaba Group(通义实验室,阿里集团) Zhejiang University(浙江大学) University of Queensland(昆士兰大学)

AI总结 针对多镜头视频外推任务中上下文分配瓶颈,提出递归上下文分配框架,通过层次化分解和结构化状态传播提升长视频生成的一致性和质量。

Comments Project Page: https://reca.vmv.re , Code: https://github.com/ali-vilab/ReCA

详情
AI中文摘要

分钟级电影式视频生成是生成式视频模型的核心挑战。现有范式仅解决该挑战的片段:单镜头外推保留锚点但缺乏电影结构,而多镜头叙事施加结构却可自由创造视觉状态而非延续观察到的状态。我们定义多镜头视频外推(MSVE)任务,该任务将观察到的帧或片段扩展为一系列具有电影结构的镜头,同时保留锚点状态并推进叙事意图。该设置受限于短视频模型的每次调用生成预算。我们识别出三个耦合瓶颈:(1)全局规划器从完整剧本中过度指定不支持的细节;(2)镜头级提示在携带完整故事时稀释任务相关状态;(3)时间链将生成帧转变为有损记忆,其中身份、场景、对象和动作状态衰减。MSVE揭示长视频失败不仅是上下文长度的限制,更是上下文分配失败。我们提出递归上下文分配(ReCA),一种推理时框架,在规划和生成之间分层分配上下文。ReCA递归地将MSVE分解为上下文有界子问题,在叶节点调用冻结生成器,并跨时间传播结构化状态更新。为评估该设置,我们进一步提出MSVE-Bench和NB-Q,一种源接地协议,带有专为3至5分钟长视频生成设计的提示,该场景未被现有短视频基准覆盖。与先前方法相比,ReCA在最强竞争控制器上将平均归一化分数提高8%至16%,并将多镜头一致性指标提高28%至43%。查看项目页面:https://reca.vmv.re。

英文摘要

Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of this challenge: single-shot extrapolation preserves an anchor but lacks cinematic structure, while multi-shot storytelling imposes structure yet remains free to invent its visual states rather than continue an observed one. We define Multi-Shot Video Extrapolation (MSVE), a task that extends an observed frame or clip into a sequence of cinematically structured shots while preserving anchor state and advancing narrative intent. This setting operates under the finite per-call generation budget of short-video models. We identify three coupled bottlenecks: (1) global planners over-specify unsupported details from full screenplays; (2) shot-level prompts dilute task-relevant state when carrying the complete story; and (3) temporal chaining turns generated frames into a lossy memory in which identity, scene, object, and action state decay. MSVE reveals that long-video failure is not merely a limitation of context length, but a failure of context allocation. We propose Recursive Context Allocation (ReCA), an inference-time framework that allocates context hierarchically across planning and generation. ReCA recursively decomposes MSVE into context-bounded subproblems, invokes frozen generators at leaf nodes, and propagates structured state updates across time. To evaluate this setting, we further propose MSVE-Bench and NB-Q, a source-grounded protocol with prompts purpose-built for 3 to 5 minute long-video generation, a regime not addressed by existing short-clip benchmarks. Compared to previous methods, ReCA improves average normalized score by 8 to 16 percent over the strongest competing controller and improves multi-shot consistency metrics by 28 to 43 percent. View the project page at https://reca.vmv.re.

2605.26524 2026-05-27 cs.CV cs.AI 版本更新

CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence

CmIVTP:面向海事智能的基于跨模态交互的船舶轨迹预测

Yuxu Lu, Dong Yang, Xiaoyu Li, Mengwei Bao, Congcong Zhao

发表机构 * Department of Logistics and Maritime Studies, the Hong Kong Polytechnic University(物流及海运研究系,香港理工大学) Research Centre for ESG Advancement (RCESGA), the Hong Kong Polytechnic University(ESG进步研究中心(RCESGA),香港理工大学) School of Navigation, Wuhan University of Technology(航海学院,武汉理工大学)

AI总结 针对单一数据源局限导致船舶轨迹预测不准的问题,提出跨模态交互框架CmIVTP,融合AIS和CCTV数据,利用目标感知场景编码器和跨模态交互Transformer实现高精度预测。

详情
AI中文摘要

海事智能交通系统(MITS)对于确保繁忙水域的航行安全和效率至关重要。然而,由于单源数据的局限性,准确的船舶轨迹预测仍然具有挑战性。自动识别系统(AIS)数据对于小型船舶通常稀疏或不可用,而仅靠闭路电视(CCTV)数据无法完全捕捉动态船舶行为。为缓解这些挑战,我们提出了一种基于跨模态交互的船舶轨迹预测(称为CmIVTP)框架,以建模船舶动力学与环境约束之间的复杂交互。具体地,我们引入了一个目标感知场景编码器来提取场景语义特征,有效捕捉船舶-环境交互并提高轨迹预测精度。此外,我们提出了一个跨模态交互变换器,它集成了AIS衍生的运动特征、基于CCTV的环境特征和场景表示。它利用跨模态注意力机制同时捕捉模态内语义和模态间交互,确保动态一致且环境可行的预测。此外,我们通过将历史AIS轨迹聚类为代表性运动模式构建了船舶群体轨迹库,为候选轨迹生成提供了一种高效且可扩展的方法。另外,我们引入了海事多模态数据集增强版(名为Maritime-MmD$^+$),这是一个同步AIS数据和CCTV视频数据的大规模数据集,为多模态轨迹预测研究提供了有力支持。大量实验表明,CmIVTP在多模态驱动的船舶轨迹预测基准上取得了更好的性能。本工作的代码资源可在https://github.com/LouisYxLu/CmIVTP获取。

英文摘要

Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However, accurate vessel trajectory prediction remains challenging due to the limitations of single-source data. Automatic identification system (AIS) data is often sparse or unavailable for small vessels, while closed-circuit television (CCTV) data alone cannot fully capture dynamic vessel behavior. To mitigate these challenges, we propose a cross-modal interaction-based vessel trajectory prediction (named CmIVTP) framework to model the intricate interactions between vessel dynamics and environmental constraints. Specifically, we introduce a target-aware scene encoder to extract scene semantic features, effectively capturing vessel-environment interactions and enhancing trajectory prediction accuracy. In addition, we propose a cross-modal interaction transformer, which integrates AIS-derived motion features, CCTV-based environmental features, and scene representations. It leverages cross-modal attention mechanisms to simultaneously capture intra-modal semantics and inter-modal interactions, ensuring dynamically consistent and environmentally feasible predictions. Furthermore, we construct a vessel group trajectory bank by clustering historical AIS trajectories into representative motion patterns, providing an efficient and scalable approach for candidate trajectory generation. Additionally, we introduce the maritime multimodal dataset plus (named Maritime-MmD$^+$), a large-scale dataset that synchronizes AIS data and CCTV video data, providing robust support for multimodal trajectory prediction research. Extensive experiments demonstrate that CmIVTP achieves better performance on multimodal-driven vessel trajectory prediction benchmarks. The code resources for this work can be available at https://github.com/LouisYxLu/CmIVTP.

2605.26520 2026-05-27 cs.CV cs.AI 版本更新

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

InterSketch: 一种具有自校正视觉草图和逐步奖励的交错推理模型

Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu, Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang, Yuanjie Zheng, Wei Liu, Lewei Lu

发表机构 * Shanghai Jiao Tong University(上海交通大学) SenseTime Research(商汤研究院) Shandong Normal University(山东师范大学)

AI总结 针对视觉-语言模型在长程视觉推理中文本中心范式局限性的问题,提出InterSketch模型,通过自校正和逐步奖励机制增强交错视觉-文本思维链能力,在视觉推理基准上超越Gemini-3-Pro等专有模型。

详情
AI中文摘要

尽管视觉-语言模型(VLM)已展现出多轮视觉推理能力,但其推理轨迹仍相对浅层且以文本为中心,限制了其在复杂视觉挑战中的适用性。相比之下,人类思维通常涉及长程推理,并伴有交错的视觉-文本思维链(VT-CoT)。为弥合这一差距,我们引入InterSketch,一种交错推理模型,通过自校正和逐步奖励机制增强VT-CoT能力。InterSketch使用外部工具动态生成中间视觉草图,并将其与文本推理交错进行,从而在长程视觉理解任务中实现有效的感知和逻辑推理。具体而言,在第一个冷启动阶段,我们提出了一个合成的高质量交错VT-CoT数据集,并引入反思机制,使模型具备多轮交错推理和自校正能力。在后续的强化学习(RL)阶段,我们设计了一种逐步奖励机制,以缓解长程推理中仅端到端监督固有的奖励信号稀疏性问题。在视觉推理基准上的大量实验证明了InterSketch的有效性,其性能甚至超越了Gemini-3-Pro等专有模型。

英文摘要

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

2605.26514 2026-05-27 cs.CV cs.AI cs.LG 版本更新

CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies

CSV-ViT: 一种使用可变大小皮层超顶点的视觉Transformer用于阿尔茨海默病病理检测

Geonwoo Baek, Ikbeom Jang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Hankuk University of Foreign Studies(韩国家 foreign 学院)

AI总结 提出一种保留感兴趣区域的、基于顶点的可变大小皮层表面分块方法(皮层超顶点),并设计可变大小补丁兼容的视觉Transformer(CSV-ViT),在阿尔茨海默病诊断、淀粉样蛋白阳性和tau蛋白阳性三分类任务中优于现有表面模型。

详情
AI中文摘要

确认阿尔茨海默病(AD)通常依赖于正电子发射断层扫描(PET),该方法仍然昂贵且有创,这促使了基于结构MRI的预筛查的使用。在非欧几里得流形,特别是大脑皮层表面上的深度学习,由于数据的球形拓扑结构面临重大挑战。最近的表面模型已经能够从皮层表面数据中学习;然而,施加基于面的均匀补丁通常会导致补丁边界处的重复顶点。一般来说,许多基于表面的模型对感兴趣区域(ROI)的感知有限,这可能导致非皮层区域(如内侧壁)被包含在内。我们提出了一种皮层表面分块方法,该方法执行保留ROI的、基于顶点的、可变大小的补丁划分。我们将这些皮层表面补丁称为皮层超顶点(CSV)。基于这种表示,我们设计了CSV视觉Transformer(CSV-ViT),这是一种可变大小补丁容忍的视觉Transformer,使用填充和掩码感知的补丁嵌入。我们使用T1加权MRI,并通过将AD相关状态分类为三个类别来评估我们的框架:AD诊断、淀粉样蛋白阳性和tau蛋白阳性。在实验中,CSV-ViT取得了比最近基于表面的模型更高的分类性能。结果表明,所提出的CSV-ViT可能支持在PET或脑脊液确认之前基于MRI的AD相关状态预测。

英文摘要

Confirming Alzheimer's disease (AD) typically relies on positron emission tomography (PET), which remains costly and invasive, motivating the use of structural MRI-based prescreening. Deep learning on non-Euclidean manifolds, particularly brain cortical surfaces, faces significant challenges due to the data's spherical topology. Recent surface models have enabled learning from cortical surface data; however, imposing face-based uniform patches often causes duplicate vertices at patch boundaries. In general, many surface-based models are limited in their awareness of the region of interest (ROI), which can result in non-cortical regions, such as the medial wall, being included. We propose a cortical surface tokenization that performs ROI-preserving, vertex-based, variable-sized patch partitioning. We refer to these cortical surface patches as cortical supervertices (CSVs). Building on this representation, we design the CSV Vision Transformer (CSV-ViT), a variable-size patch-tolerant Vision Transformer that uses padding and a mask-aware patch embedding. We used T1-weighted MRI and evaluated our framework by classifying AD-related status into three categories: AD diagnosis, amyloid positivity, and tau positivity. Across the experiments, CSV-ViT achieved higher classification performance than recent surface-based models. The results suggest that the proposed CSV-ViT may support MRI-based prediction of AD-related status prior to PET or CSF confirmation.

2605.26513 2026-05-27 cs.CV 版本更新

Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression

Re-M3Dr:重新平衡的多模态均值偏差回归

Haojie Yin, Chengcheng Feng, Tianyi Liu, Tianqi Zhang, Kaizhu Huang

发表机构 * Duke Kunshan University, China(杜克大学昆山学院) Xi'an Jiaotong-Liverpool University, China(西安交通大学利物浦大学) Soochow University(苏州大学)

AI总结 针对多模态医学图像融合性能反不如单模态的问题,提出Re-M3Dr框架,通过自适应边界的监督对比学习和锐度感知梯度调制,实现多模态均值偏差回归,在临床数据集上均方误差降低29%。

详情
AI中文摘要

均值偏差(MD)是评估眼科视野损失的关键指标。虽然以往的工作仅关注从光学相干断层扫描(OCT)预测MD,但直观上假设将OCT与另一种眼底摄影(FP)成像结合可以提高性能,因为两种眼科医学成像提供了互补信息。当应用复杂的多目标优化时,这一点尤其值得期待,正如常见的多模态分类中所记载的那样。令人惊讶的是,我们的研究表明,在这种医学成像场景中,多模态融合的性能不如单模态模型。通过详细分析,我们确定根本原因是数据分布和模态学习冲突之间的耦合不平衡。这种不平衡扭曲了优化景观,导致训练不稳定。为了解决这一挑战,我们提出了重新平衡的多模态均值偏差回归(Re-M3Dr)方法,这是一种新颖的多模态回归框架。我们通过自适应边界的监督对比学习增强单模态表示。然后,我们的框架通过锐度感知梯度调制稳定联合优化。在公共和私人临床数据集上的实验结果表明,与最先进的多模态学习方法相比,均方误差平均降低29%,证明了Re-M3Dr的优越性。代码可在补充材料中获得。

英文摘要

Mean Deviation (MD) is a critical metric for assessing visual field loss in ophthalmology. While previous work has focused solely on predicting MD from Optical Coherence Tomography (OCT), it is intuitive to assume that combining OCT with another imaging of fundus photography (FP) could improve performance, as two ophthalmic medical imaging provide complementary information. This is particularly expected when sophisticated multi-objective optimization is applied, as documented in common multimodal classification. Surprisingly, our investigations reveal that multimodal fusion in this medical imaging scenario performs worse than unimodal model. Through detailed analysis, we identify the root cause as a coupled imbalance between data distribution and modality learning conflict. This imbalance distorts the optimization landscape, leading to unstable training. To address this challenge, we propose the method of Rebalanced MultiModal Mean Deviation Regression (Re-M3Dr), a novel multimodal regression framework. We enhance unimodal representation through adaptive margin based supervised contrastive learning. Then, our framework stabilizes the joint optimization with the sharpness-aware gradient modulation. Experimental results on both public and private clinical datasets show average 29\% reduction in MSE compared to SOTA multimodal learning methods, demonstrating the superiority of Re-M3Dr. The code is available in the supplementary materials.

2605.26503 2026-05-27 cs.CV 版本更新

Uncertainty-Aware Gaussian Map for Vision-Language Navigation

面向视觉-语言导航的不确定性感知高斯地图

Jianzhe Gao, Rui Liu, Yuxuan Xu, Tongtong Cao, Yingxue Zhang, Zhanguang Zhang, Sida Peng, Yi Yang, Wenguan Wang

发表机构 * The State Key Lab of Brain-Machine Intelligence(脑机智能国家重点实验室) Department of Foundation model, 2012 Labs, Huawei(基础模型部门,2012实验室,华为) Noah’s Ark Lab, 2012 Labs, Huawei(诺亚方舟实验室,2012实验室,华为) School of Software Technology, Zhejiang University(浙江大学软件学院)

AI总结 提出不确定性感知高斯地图,通过显式建模几何、语义和外观三种感知不确定性并融入观测空间,提升视觉-语言导航中智能体的决策可靠性。

详情
AI中文摘要

视觉-语言导航(VLN)要求智能体按照自然语言指令在3D环境中导航。在导航过程中,现有智能体通常遇到感知不确定性,例如缺乏可靠定位的证据或空间线索解释的模糊性,但在预测动作时通常忽略此类信息。在这项工作中,我们显式建模三种形式的感知不确定性(即几何、语义和外观不确定性),并将其整合到智能体的观测空间中,以实现知情决策。具体来说,我们的智能体首先构建一个语义高斯地图(SGM),由从全景观测初始化的可微3D高斯原语组成,编码环境的几何结构和语义内容。在SGM之上,通过高斯位置和尺度的变分扰动估计几何不确定性,以评估结构可靠性;通过扰动高斯语义属性捕获语义不确定性,以揭示模糊解释;通过Fisher信息刻画外观不确定性,该信息衡量渲染观测对高斯级变化的敏感性。这些不确定性被纳入SGM,将其扩展为统一的3D价值地图,将其作为支持可靠导航的可供性和约束。在多个VLN基准上的综合评估显示了我们的智能体的有效性。

英文摘要

Vision-Language Navigation (VLN) requires an agent to navigate 3D environments following natural language instructions. During navigation, existing agents commonly encounter perceptual uncertainty, such as insufficient evidence for reliable grounding or ambiguity in interpreting spatial cues, yet they typically ignore such information when predicting actions. In this work, we explicitly model three forms of perceptual uncertainty (i.e., geometric, semantic, and appearance uncertainty) and integrate them into the agent's observation space to enable informed decision-making. Concretely, our agent first constructs a Semantic Gaussian Map (SGM), composed of differentiable 3D Gaussian primitives initialized from panoramic observations, that encodes both the geometric structure and semantic content of the environment. On top of SGM, geometric uncertainty is estimated through variational perturbations of Gaussian position and scale to assess structural reliability; semantic uncertainty is captured by perturbing Gaussian semantic attributes to reveal ambiguous interpretations; and appearance uncertainty is characterized by Fisher Information, which measures the sensitivity of rendered observations to Gaussian-level variations. These uncertainties are incorporated into SGM, extending it into a unified 3D Value Map, which grounds them as affordances and constraints that support reliable navigation. Comprehensive evaluations across multiple VLN benchmarks show the effectiveness of our agent.

2605.26501 2026-05-27 cs.CV cs.AI 版本更新

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

揭示视觉-语言模型的脆弱性:通过纹理约束扰动和跨模态优化的多模态对抗协同

Xiang Fang, Wanlong Fang, Changshuo Wang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Nanyang Technological University, Singapore(新加坡南洋理工大学) University College London(伦敦大学学院)

AI总结 提出多模态对抗协同框架,通过纹理约束的通用对抗扰动和可学习的文本提示扰动,在黑盒设置下联合优化,揭示视觉-语言模型在多模态攻击下的脆弱性。

Comments Publish in AAAI 2026

详情
AI中文摘要

大型视觉-语言模型(LVLMs)通过整合视觉和文本输入,在图像描述和视觉问答等任务中表现出色,改变了多模态理解。然而,它们对抗攻击的鲁棒性,特别是利用两种模态的攻击,仍未被充分探索,这给自动驾驶和内容审核等关键应用带来了风险。现有攻击集中于单一模态或需要不切实际的白盒访问,限制了其现实相关性。在本文中,我们引入了多模态对抗协同(MMAS),这是一个开创性的框架,用于针对LVLMs构建通用的黑盒多模态攻击。MMAS同时生成纹理尺度约束的通用对抗扰动用于图像,以及可学习的提示扰动用于文本,仅通过模型查询进行联合优化。图像扰动利用基于小波的纹理约束确保在各种视觉输入中的不可感知性和鲁棒性。文本扰动在嵌入空间中受L范数约束,在保持语义连贯性的同时将输出导向目标。一种新颖的跨模态正则化项对齐扰动的梯度方向,增强了它们在任务和模型间的协同影响和可迁移性。大量实验表明,我们提出的攻击在主流LVLMs上具有强大的通用对抗能力。

英文摘要

Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against adversarial attacks, particularly those exploiting both modalities, remains underexplored, posing risks to critical applications like autonomous driving and content moderation. Existing attacks focus on single modalities or require impractical white-box access, limiting their real-world relevance. In this paper, we introduce Multi-Modal Adversarial Synergy, a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. MMAS simultaneously generates a texture scale-constrained universal adversarial perturbation for images and a learnable prompt perturbation for text, optimized jointly using only model queries. The image perturbation leverages wavelet-based texture constraints to ensure imperceptibility and robustness across diverse visual inputs. The text perturbation, constrained by an L-norm in the embedding space, maintains semantic coherence while steering outputs toward a target. A novel cross-modal regularization term aligns the perturbations' gradient directions, enhancing their synergistic impact and transferability across tasks and models. Extensive experiments show the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs.

2605.26500 2026-05-27 cs.CV 版本更新

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

面向视觉-语言导航的开放集语义分组3D高斯地图

Jianzhe Gao, Rui Liu, Wenguan Wang

发表机构 * The State Key Lab of Brain-Machine Intelligence(脑机智能国家重点实验室)

AI总结 提出一种3D高斯地图表示环境,通过在线构建自中心场景地图和开放集语义分组操作增强几何与语义信息,并设计多层级动作预测策略,在三个公开基准上验证了有效性。

详情
AI中文摘要

视觉-语言导航(VLN)要求智能体基于自然语言指令遍历复杂的3D环境,这需要对场景有透彻的理解。现有工作为智能体配备了各种场景表示以增强空间感知,但往往忽略了VLN场景中复杂的3D几何和丰富的语义,限制了在多样化和未见环境中的泛化能力。为应对这些挑战,本文提出一种3D高斯地图,将环境表示为一组可微分的3D高斯,并据此开发了用于VLN的导航策略。具体地,通过从稀疏伪激光雷达点云初始化3D高斯来在线构建自中心场景地图,为场景理解提供信息丰富的几何先验。每个高斯基元进一步通过开放集语义分组操作得到增强,该操作基于3D高斯在开放世界中属于对象实例或材质类别的成员关系对其进行分组,形成统一的3D高斯地图。基于该地图,设计了多层级动作预测策略,结合多粒度的空间-语义线索,辅助智能体进行决策。在三个公开基准(即R2R、R4R和REVERIE)上进行的大量实验验证了我们方法的有效性。

英文摘要

Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Scene Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors for scene understanding. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world, resulting in a unified 3D Gaussian Map. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist agents in decision-making. Extensive experiments conducted on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our method.

2605.26491 2026-05-27 cs.LG cs.CV 版本更新

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

超越成对偏好:扩散模型的列表级奖励感知对齐

Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue

发表机构 * Caltech(加州理工学院) Stanford University(斯坦福大学)

AI总结 提出Diffusion LAIR方法,通过列表级奖励感知优化,利用连续奖励分数和所有候选图像同时优化扩散模型,在文本到图像生成等任务上超越成对偏好基线。

详情
AI中文摘要

偏好优化已成为从人类反馈中进行在线强化学习(RLHF)的一种高效替代方案,用于对齐文本到图像扩散模型。然而,现有方法大多将监督简化为二元成对比较。当训练数据自然包含同一提示的多个候选图像,并且连续奖励分数能提供比单一赢家-输家标签更丰富的信息时,这种成对简化具有局限性。为解决这些局限性,我们提出了Diffusion LAIR,一种用于扩散模型的奖励感知列表级偏好优化方法。对于每个提示,LAIR将一组候选图像的奖励分数转换为居中优势权重,然后在隐式奖励上优化优势加权回归目标,隐式奖励定义为当前模型相对于固定参考模型的去噪损失改进,并带有二次惩罚以正则化隐式奖励的幅度。所得目标同时使用所有候选图像而非选择成对,并通过显式控制隐式奖励的幅度保持保守性。LAIR目标在隐式奖励空间中具有有界闭式最优解,阐明了正则化强度如何控制偏好更新的幅度。实验表明,Diffusion LAIR在SD1.5和SDXL上,在文本到图像生成、组合生成和图像编辑基准测试中均优于强偏好优化基线。

英文摘要

Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.

2605.26486 2026-05-27 cs.CV 版本更新

LongCat-Video-Avatar 1.5 Technical Report

LongCat-Video-Avatar 1.5 技术报告

Meituan LongCat Team, Xunliang Cai, Meng Cheng, Feng Gao, Zhe Kong, Jiamu Li, Le Li, Weiheng Li, Hongyu Liu, Shuai Tan, Xiaoming Wei, Tianyu Yang, Yong Zhang

发表机构 * Meituan LongCat Team(美团LongCat团队)

AI总结 本文提出 LongCat-Video-Avatar 1.5,一个通过升级音频编码器、优化训练策略、数据筛选和RLHF训练实现高精度唇同步、全身时间稳定性和长视频生成的开放框架,在多个基准测试中达到或超越商业系统性能。

Comments Homepage: https://meigen-ai.github.io/LongCat-Video-Avatar-1.5-Page/ Github: https://github.com/meituan-longcat/LongCat-Video

详情
AI中文摘要

尽管音频驱动视频生成取得了进展,但实现商业级稳定性仍具挑战。我们提出 LongCat-Video-Avatar 1.5,一个升级的开源框架,优先考虑系统工程和生产就绪性而非架构新颖性。通过将音频编码器升级为 Whisper Large 并精心扩展训练配方,v1.5 实现了精确的唇同步、全身时间稳定性和严格身份一致性的鲁棒长视频生成。通过严格的数据筛选和 RLHF 训练,该模型能轻松泛化到风格化领域(如动漫和动物),并原生处理复杂现实条件(如多人交互和物体操作)。此外,针对工业部署的实际需求,我们采用高级步进蒸馏将推理加速至最优的 8 NFE,在服务效率与视觉保真度之间实现了良好权衡。通过在超过 500 个多样化测试案例的综合基准上进行的广泛定量指标和严格人工评估,验证了我们方法的优越性。结果表明,v1.5 在人类相似度评分和专家级质量评估中,与领先的闭源系统(如 HeyGen、OmniHuman 1.5、Kling Avatar 2.0)相比,达到了具有竞争力或更优的性能。通过开源发布,LongCat-Video-Avatar 1.5 缩小了学术研究原型与商业级部署之间的差距。

英文摘要

Despite advances in audio-driven video generation, achieving commercial-grade stability remains challenging. We present LongCat-Video-Avatar 1.5, an upgraded open-source framework prioritizing systematic engineering and production-readiness over architectural novelty. By upgrading the audio encoder to Whisper Large and meticulously scaling our training recipes, v1.5 achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency. Through rigorous data curation and RLHF Training, the model readily generalizes to stylized domains such as anime and animals, and natively handles complex real-world conditions, such as multi-person interactions and object handling. Furthermore, addressing the practical demands of industrial deployment, we employ advanced step distillation to accelerate inference to an optimal 8 NFE, achieving a favorable trade-off between serving efficiency and visual fidelity. The superiority of our approach is validated through extensive quantitative metrics and a rigorous human evaluation conducted on a comprehensive benchmark of over 500 diverse test cases. Results show that v1.5 achieves competitive or superior performance compared to leading closed-source systems (e.g., HeyGen, OmniHuman 1.5, Kling Avatar 2.0) across human-likeness ratings and expert-level quality assessments on our benchmark. With its open-source release, LongCat-Video-Avatar 1.5 narrows the gap between academic research prototypes and commercial-grade deployment.

2605.26485 2026-05-27 cs.CV cs.CL 版本更新

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

OmniInteract:面向实时全模态助手的流式交互基准测试

Xudong Lu, Xueying Li, Annan Wang, Yang Bo, Jinpeng Chen, Zengliang Li, Nianzu Yang, Rui Liu, Xue Yang, Jingwen Hou, Hongsheng Li

发表机构 * CUHK MMLab(香港中文大学多模态实验室) SJTU(上海交通大学) NTU(国立新加坡大学) McMaster(麦马斯特大学) CityUHK(香港城市大学) JUFE(吉林大学)

AI总结 提出OmniInteract基准,通过在线推理音视频流评估全模态大模型的实时交互能力,发现现有模型在流式交互中表现薄弱。

详情
AI中文摘要

我们引入了OmniInteract,一个用于实时全模态大语言模型的流式基准测试,通过音视频流上的原生在线推理进行评估。与离线视频理解或文本提示的流式问答不同,OmniInteract保留了原始音视频流,并要求模型在线处理,无法访问未来内容。用户查询和环境声音嵌入在音频轨道中,要求模型检测多模态触发信号,决定何时响应,并在流展开时回答问题。OmniInteract包含250个视频,具有1430个时间锚定的响应槽:其中1062个1Q1A槽涵盖实时、主动和嵌套场景,368个1QnA槽用于连续任务监控和步骤指导。每个槽包括触发信号、响应窗口和目标答案。我们使用交互感知质量-时效性F1、中断诊断套件和嵌套链完成分数来评估响应正确性、时序、无效输出、中断处理和上下文连续性。实验表明,当前模型在流式交互中仍然薄弱,最佳整体IA-QTF1仅为0.368,最佳1QnA IA-QTF1仅为0.052。在全双工设置下的数学推理进一步研究表明,离线能力不一定能迁移到在线交互中。代码和数据集将在https://github.com/Lucky-Lance/OmniInteract公开。

英文摘要

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

2605.26483 2026-05-27 cs.CV 版本更新

Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis

基于临床基础的反事实推理用于医学视频诊断

Jianzhe Gao, Churan Wang, Weiyi Zhang, Jianghua Li, Li-An Li, Wenguan Wang, Yixin Zhu, Yizhou Wang

发表机构 * Center for Data Science in Clinical Medicine(临床医学数据科学中心) The State Key Lab of Brain-Machine Intelligence(脑机智能国家重点实验室) Department of Gynecology and Obstetrics, 7th Medical Center of Chinese PLA General Hospital(中国人民解放军第七医学中心妇产科部) School of Computer Science, Peking University(北京大学计算机学院) School of Psychological and Cognitive Sciences, Peking University(北京大学心理学与认知科学学院) State Key Lab of General AI, Peking University(通用人工智能国家重点实验室) Nat’l Eng. Research Center of Visual Technology(视觉技术国家工程研究中心) Beijing Key Laboratory of Behavior and Mental Health(北京行为与心理健康重点实验室) Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence(具身智能实验室,北京大学-武汉人工智能研究院)

AI总结 提出MedVCR反事实推理框架,通过扩散生成器合成病理组织演变、临床规则编码诊断知识及双重诊断预测策略,在医学视频诊断任务上提升2.6%-10.2%性能。

详情
AI中文摘要

医学视频诊断涉及从整个检查过程中的动态组织反应推断临床决策。现有方法依赖于端到端学习范式,该范式i)关注外观而非病理,ii)缺乏临床先验知识,iii)仅基于观察进行推理而无反事实比较。本文引入MedVCR,一个模仿临床诊断思维的反事实推理框架。MedVCR包含三个组件:一个反事实生成器,通过扩散方式合成指定病理状态下的组织演变;一个反事实表示学习模块,通过临床规则(即时间一致性、病理可分离性和反事实对齐)编码诊断知识;以及一个双重诊断预测策略,将视频级评估与帧级反事实分析相结合。MedVCR在完全监督(如阴道镜检查)和弱监督(如结肠镜检查)视频诊断设置下进行评估,与领先基线相比取得了2.6%-10.2%的性能提升。全面的消融研究进一步验证了每个组件的有效性。代码将发布。

英文摘要

Medical video diagnosis involves inferring clinical decisions from dynamic tissue responses throughout examination processes. Existing methods rely on an end-to-end learning paradigm that i) focuses on appearance rather than pathology, ii) lacks clinical priors, and iii) reasons solely from observations without counterfactual comparison. This work introduces MedVCR, a counterfactual reasoning framework that mimics clinical diagnostic thinking. MedVCR comprises three components: a Counterfactual Generator that synthesizes tissue evolution under specified pathological states via a diffusion-based manner; a Counterfactual Representation Learning module that encodes diagnostic knowledge through clinical rules (i.e., temporal consistency, pathological separability, and counterfactual alignment); and a Dual Diagnostic Prediction strategy that integrates video-level assessment with frame-level counterfactual analysis. MedVCR is evaluated under both fully supervised (e.g., colposcopy) and weakly supervised (e.g., colonoscopy) video diagnosis settings, yielding 2.6%-10.2% performance gains compared with leading baselines. Comprehensive ablation studies further validate the effectiveness of each component. The code will be released.

2605.26478 2026-05-27 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

基于随机解耦策略梯度的高效在策略视觉强化学习

Haoxiang You, Yilang Liu, Davis Zong, Qian Wang, Teeratham Vitchutripop, Qi Wang, Daniel Rakita, Ian Abraham

发表机构 * Yale University(耶鲁大学) Shanghai Jiao Tong University(上海交通大学) University of Sydney(悉尼大学)

AI总结 提出随机解耦策略梯度(SDPG)方法,通过轨迹滚动的随机扰动估计策略梯度,在单GPU上数小时内端到端训练多样化的视觉运动控制策略,显著降低计算和内存开销,并在视觉MuJoCo基准测试中优于基线方法。

详情
AI中文摘要

我们提出了随机解耦策略梯度(SDPG),一种轻量级的视觉强化学习方法,能够在单个NVIDIA RTX 4080 GPU上在数小时内端到端训练多样化的视觉运动控制策略。SDPG通过轨迹滚动的随机扰动估计策略梯度,所需批量渲染环境数量减少几个数量级,并显著降低计算和内存开销。在视觉MuJoCo基准测试中,SDPG在训练时间、内存使用和奖励方面始终优于基线方法。最后,为支持未来研究,我们引入了一套涵盖灵巧操作、具有挑战性的运动控制的逼真视觉机器人基准测试,并在物理硬件上展示了有效的仿真到现实迁移。

英文摘要

We present the stochastic decoupled policy gradient (SDPG), a lightweight visual reinforcement learning (RL) method that trains diverse visuomotor control policies end-to-end within a few hours on a single NVIDIA RTX 4080 GPU. SDPG estimates policy gradients via random perturbations of trajectory rollouts, requiring orders of magnitude fewer batch-rendered environments and substantially reducing compute and memory overhead. On visual MuJoCo benchmarks, SDPG consistently outperforms baseline methods in training time, memory usage, and rewards. Finally, to support future research, we introduce a suite of realistic visual robotics benchmarks spanning dexterous manipulation, challenging locomotion, and demonstrate effective sim-to-real transfer on physical hardware.

2605.26475 2026-05-27 cs.CV cs.AI 版本更新

Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes

大规模平面场景的视觉度量测量比较研究

ZhiXin Sun

发表机构 * PowerChina Zhongnan Engineering Corporation Limited(中国电力工程顾问集团有限公司)

AI总结 本文针对大规模室外场景,使用PTZ相机比较了三种基于视觉的平面度量方法(单目测距、图像拼接和立体测距),分析了它们的精度和适用性。

详情
AI中文摘要

基于视觉的度量距离和面积测量在大规模室外环境中仍然具有挑战性,原因包括远距离感知、相机变焦和不稳定的成像条件。本文研究了在实际水库监测场景中使用PTZ相机的平面度量测量,并比较了三种代表性方法:基于几何的单目测距、带有鸟瞰变换的图像拼接以及使用两个联合校准的单目相机的立体测距。对于单目测距,从相机几何推导出平面定位模型,并分析了相机俯仰角的影响。研究了用于大面积映射的图像拼接,同时开发了一种无需专用立体硬件的立体方案用于远距离测量。实验显示了明确的权衡:单目测距在足够大的俯仰角下达到米级精度,立体测距达到分米级精度且对俯仰变化敏感性较低,图像拼接在小规模场景中有效,但随着场景增大稳定性和可扩展性下降。

英文摘要

Vision-based metric distance and area measurement remains challenging in large-scale outdoor environments due to long-range sensing, camera zoom, and unstable imaging conditions. This work studies planar metric measurement in a real-world reservoir monitoring scenario using PTZ cameras and compares three representative approaches: geometry-based monocular ranging, image stitching with birds-eye-view transformation, and stereo-based ranging using two jointly calibrated monocular cameras. For monocular ranging, planar localization models are derived from camera geometry and the effect of camera pitch angle is analyzed. Image stitching is investigated for large-area mapping, while a stereo-based scheme is developed for long-range measurement without dedicated stereo hardware. Experiments show clear trade-offs: monocular ranging achieves meter-level accuracy under sufficiently large pitch angles, stereo-based ranging achieves decimeter-level accuracy with reduced sensitivity to pitch variations, and image stitching is effective for small-scale scenes but degrades in stability and scalability as scene size increases.

2605.26470 2026-05-27 cs.CV 版本更新

Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules

面向逆问题的三元动力学感知扩散后验采样:优化引导与随机性调度

Junseo Bang, Dong Ju Mun, Hoigi Seo, Seongmin Hong, Se Young Chun

发表机构 * IPAI \& AIIS, Seoul National University, Republic of Korea

AI总结 提出TriPS方法,将后验采样建模为时变控制问题,通过优化数据一致性引导、无分类器引导和随机性的调度策略,显著提升成像逆问题的求解性能。

Comments ICML 2026

详情
AI中文摘要

使用扩散模型的生成后验采样已成为解决成像逆问题的主流范式,通常包含三个主要组件:数据一致性(DC)引导、无分类器引导(CFG)和随机性。虽然先前的工作专注于如何开发每个或所有组件,但很少关注如何调度它们,导致启发式固定或部分调整的次优调度。在这项工作中,我们认为所有三个组件在调度方面的相互作用对于显著提高成像逆问题的求解性能至关重要。我们的分析表明,在采样早期激进的CFG与DC引导冲突,而随机性将轨迹带回高概率区域。基于这些发现,我们提出了三元动力学感知后验采样(TriPS),它将后验采样重新表述为一个时变控制问题,并按照DC和随机性尺度递减、CFG尺度递增的三元趋势优化调度。TriPS通过两种策略实现:基于模板的函数先验搜索以获得可靠的基线调度,以及基于组相对策略优化(GRPO)的强化学习以获得更灵活的时间曲线。实验表明,TriPS在数据保真度和感知真实感方面优于最先进的基线方法。

英文摘要

Generative posterior sampling using diffusion models has emerged as a dominant paradigm for solving inverse problems in imaging, which usually consists of three main components: data consistency (DC) guidance, classifier-free guidance (CFG) and stochasticity. While prior arts have focused on how to develop each or all components, less attention has given to how to schedule them, leading to heuristically fixed or partially adjusted suboptimal schedules. In this work, we argue that the interactions among all three components in terms of scheduling are crucial for significantly improved performance in solving inverse problems in imaging. Our analysis shows that aggressive CFG early in sampling conflict with DC guidance, while stochasticity brings the trajectory back to higher-probability regions. Based on these findings, we propose Triadic Dynamics Aware Posterior Sampling (TriPS), which reformulates posterior sampling as a time-varying control problem and optimizes schedules following a triadic trend of decreasing DC and stochasticity scales alongside increasing CFG scale. TriPS achieves this through two strategies: template-based search over functional priors for reliable baseline schedules, and Group Relative Policy Optimization (GRPO)-based reinforcement learning for more flexible temporal curves. Experiments demonstrate TriPS outperforms state-of-the-art baselines in data fidelity and perceptual realism.

2605.26460 2026-05-27 cs.CV cs.AI 版本更新

AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

AnchorDiff: 基于锚点图传播的无训练概念定位用于多模态扩散Transformer

Jian Zhang, Zhijun Zhang

发表机构 * School of Automation Science and Engineering(自动化科学与工程学院)

AI总结 提出AnchorDiff方法,通过锚点选择和混合图传播解耦语义定位与结构细化,解决多模态扩散Transformer中视觉混淆概念间的概念泄漏问题。

详情
AI中文摘要

多模态扩散Transformer(MM-DiTs)为无训练概念定位编码了丰富的表示,但现有的基于注意力的方法通常在视觉上易混淆的概念上产生重叠激活,这种失败模式我们称为概念泄漏,即目标响应溢出到非目标对象。为了解决这个问题,我们提出了AnchorDiff,一种无训练的定位方法,将语义定位与结构细化解耦。AnchorDiff从概念到图像的注意力图中选择一个高置信度锚点,并将其作为独热种子在从图像到图像自注意力导出的混合图上传播。该图利用输出空间相似性进行密集的物体内传播,并通过逐行注意力门抑制跨物体连接。此外,我们引入了多概念混淆数据集,其中包含具有多个视觉相似概念和独立掩码的图像,从而能够显式评估概念泄漏。实验表明,AnchorDiff在ImageNet-Segmentation和PascalVOC上实现了强大的定位性能,同时在我们的多概念混淆数据集上显著减少了概念泄漏。

英文摘要

Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.

2605.26456 2026-05-27 cs.CV 版本更新

Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth

稀疏激光雷达提示的单目几何基础:面向长距离驾驶深度的实证研究

Kai Zheng, Qiang Feng, Xingjian Liu, Wenquan Tan, Yuan Li

发表机构 * Benewake (Beijing) Co., Ltd.(北京 Benewake 公司)

AI总结 本文提出SLIM,首次将MoGe-2适配为接受真正稀疏激光雷达输入,通过部分卷积稀疏编码器和多尺度融合网络,在长距离(100-150米)将绝对相对误差降低39-51%。

Comments 6 pages, 3 figures, 2 tables

详情
AI中文摘要

稀疏激光雷达提示的深度基础模型(PromptDA, Prior Depth Anything, DMD3C)在室内场景或KITTI标准80米评估范围内表现出色。然而,存在两个局限性:(i)在长距离驾驶场景(50-150米)中缺乏系统性的距离分层评估;(ii)基于视差基础模型的先前方法依赖于预插值的密集先验,而真正稀疏激光雷达注入到点图基础模型(例如MoGe-2,NeurIPS 2025)尚未被探索。我们提出SLIM(稀疏激光雷达注入的单目几何),这是首个将MoGe-2适配为接受真正稀疏激光雷达输入的工作。SLIM集成了一个部分卷积稀疏编码器和一个多尺度融合颈部,在五个尺度上将激光雷达特征融合到点图解码器中。我们采用密度无关训练(随机注入比例在[0.005, 0.30]之间),使得单一模型能够适应不同的输入密度。在Virtual KITTI和CARLA上,SLIM在100-150米范围内将MoGe-2基线的绝对相对误差降低了约39-51%。在六种注入比例下的消融实验表明,部分卷积注入在Virtual KITTI的所有六种设置下均改善了AbsRel和RMSE;在CARLA上,AbsRel在六种设置中的五种得到改善(0.015比例下接近平局,差异为0.0013),而RMSE在不同编码器间相当,部分卷积在三种设置下有所改善(最多改善0.31单位),在其余三种设置下最多损失0.11单位。

英文摘要

Sparse-LiDAR-prompted depth foundation models (PromptDA, Prior Depth Anything, DMD3C) have shown strong results on indoor scenes or within KITTI's standard 80-meter evaluation cap. However, two limitations remain: (i) systematic distance-stratified evaluation in long-range driving regimes (50-150 m) is largely absent; (ii) prior approaches built on disparity-based foundations rely on pre-interpolated dense priors, leaving truly sparse LiDAR injection on point-map foundations (e.g., MoGe-2, NeurIPS 2025) unexplored. We present SLIM (Sparse-LiDAR Injected Monocular geometry), the first adaptation of MoGe-2 to accept truly sparse LiDAR input. SLIM integrates a partial-convolution sparse encoder with a multi-scale fusion neck that fuses LiDAR features into the point-map decoder at five scales. We adopt density-agnostic training (random injection ratio in [0.005, 0.30]) so a single model serves diverse input densities. On Virtual KITTI and CARLA, SLIM reduces the absolute relative error of the MoGe-2 baseline by approximately 39-51% at 100-150 m. Ablation across six injection ratios shows partial-convolution injection improves both AbsRel and RMSE on Virtual KITTI in all six settings; on CARLA, AbsRel improves in five of six settings (one near-tie at 0.015 differs by 0.0013), and RMSE is comparable across encoders, with partial-convolution improving in three settings (by up to 0.31 unit) and losing by at most 0.11 unit in the other three.

2605.26451 2026-05-27 cs.HC cs.CV 版本更新

Design First, Code Later: Aesthetically Pleasing Template-Free Slides Generation

先设计,后编码:无模板的美观幻灯片生成

Zhiyao Cui, Chenxu Wang, Shuyue Hu, Yiqun Zhang, Wenqi Shao, Qiaosheng Zhang, Zhen Wang

发表机构 * School of Cybersecurity, Northwestern Polytechnical University(西北工业大学网络安全学院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institution(上海创新研究院) Fudan University(复旦大学)

AI总结 提出DeepSlides层次化幻灯片生成流程,通过解耦设计与实现、引入SlideDesign数据集和多智能体强化学习训练范式,在无模板条件下生成高质量幻灯片。

详情
AI中文摘要

自动生成演示幻灯片需要在严格的空间约束下协调叙事结构与页面级图形设计。对于这种结构化多模态任务,良好的设计流程对于确保幻灯片的最终质量至关重要。现有方法依赖固定模板或直接生成可执行代码,从而限制了LLM的创意布局设计能力,并绕过了关键的幻灯片页面设计步骤。为解决这些限制,本文(1)提出了一种层次化的幻灯片生成工作流DeepSlides,无需任何预定义模板或样式,系统化地组织幻灯片设计任务,将幻灯片页面设计与实现解耦;(2)引入了SlideDesign数据集,专门针对幻灯片生成任务定制;(3)提出了一种多智能体强化学习训练范式,并训练了一对模型SlideQwens,用于幻灯片设计和实现。实验结果表明,我们提出的框架在评估指标上优于基线方法,并在人类偏好评估中取得了优越性能。数据集和代码可在https://github.com/sxswz213/DeepSlides获取。

英文摘要

Producing presentation slides automatically entails coordinating narrative structure with page-level graphic design under strict spatial constraints. For such structured multimodal tasks, a well-organized design process is essential to ensure the final quality of slides. Existing approaches rely on fixed templates or directly emit executable code, thereby both limiting the creative layout-design capabilities of LLMs and bypassing the essential slide-page design step. To address these limitations, this paper (1) proposes a hierarchical slides generation workflow, DeepSlides, that systematically organizes slide design tasks without any predefined template or style, decoupling slide-page design from implementation; (2) introduces SlideDesign, a dataset tailored specifically for slides generation tasks; and (3) presents a multi-agent reinforcement learning training paradigm and trains a couple of models, SlideQwens, for slide design and implementation. Experimental results demonstrate that our proposed framework outperforms baseline methods on evaluated metrics and achieves superior performance in human preference evaluations. The dataset and code are available at https://github.com/sxswz213/DeepSlides.

2605.26449 2026-05-27 cs.CV cs.AI 版本更新

Cross-scale Aligned Supervision for Training GANs

跨尺度对齐监督用于训练生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * Sungkyunkwan University(全北大学)

AI总结 针对GAN多尺度生成中跨尺度轨迹未对齐问题,提出CAT(跨尺度对齐Transformer),通过生成器侧一致性正则化对齐中间输出与最终输出,在ImageNet-256上实现FID-50K为1.56。

Comments Preprint

详情
AI中文摘要

现代GAN通常在中间生成器输出上引入对抗性监督,并将由此产生的多阶段合成解释为从粗到细的分层生成。在这项工作中,我们挑战了这一解释。我们认为标准的尺度级对抗监督并未构建适当的从粗到细的层次结构:每个中间图像被独立地推向其自身分辨率下的真实分布,但这种尺度级的真实性并不能确保各阶段的输出代表相同的生成样本。此外,每个阶段产生的特定尺度图像并未用作后续阶段的明确细化目标。因此,其对抗性损失可以改善特定尺度的输出,而不约束后续阶段保持相同的样本轨迹,允许它们转向不同的样本而不是细化先前的输出。我们将此问题称为跨尺度轨迹未对齐问题。为了解决这个问题,我们提出了CAT,一种用于多尺度对抗生成的跨尺度对齐Transformer。CAT保持判别器尺度级,因此每个中间输出在其自身分辨率下被评估,同时添加一个简单的生成器侧一致性正则化,以对齐中间输出与最终输出。在类别条件ImageNet-256上,CAT-H/2在仅60个训练周期后,通过一步推理实现了1.56的FID-50K,优于强大的单步GAN和扩散/流基线。

英文摘要

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

2605.26447 2026-05-27 cs.CV 版本更新

Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting

Underwater360: 基于全景高斯泼溅的全景图像水下场景重建

Jiangbei Hu, Weichao Song, Shibo Yu, Mohan Wang, Zihan Yi, Rui Wu, Mingkang Xiang, Na Lei, Shengfa Wang, Zhongxuan Luo, Ying He

发表机构 * School of Software, Dalian University of Technology(大连理工大学软件学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出Underwater360框架,利用物理信息引导的全向高斯泼溅,通过球面光线投射和外观-介质建模,实现水下全景场景的高质量重建与外观恢复。

详情
AI中文摘要

水下场景重建对于沉浸式探索水生环境至关重要,但由于复杂的参与介质效应(如吸收和散射)以及传统相机的有限视场(FoV),仍然具有挑战性。尽管将全景成像与3D高斯泼溅(3DGS)相结合为逼真的水下渲染提供了有前景的方向,但传统的3DGS难以处理球面投影畸变和水下介质退化。在本文中,我们提出了 extbf{Underwater360},一个物理信息引导的全向3DGS框架,用于水下全景场景重建。首先,我们引入了一个全向高斯泼溅模块,该模块直接在球面相机空间中进行光线投射,而不是依赖2D投影近似,从而减少了360$^\circ$视场下的几何畸变。其次,我们设计了一个基于物理的外观-介质建模架构,带有姿态条件的外观嵌入,以明确地将内在场景辐射与深度相关的后向散射和衰减解耦,从而实现物理基础的外观恢复。最后,我们建立了一个新的全景水下基准数据集,包含合成场景和真实场景。大量实验表明,Underwater360在水下新视图合成和场景外观恢复方面取得了优越的性能,在复杂水下环境中提供了改进的渲染质量和跨视图一致性。代码和数据集发布在https://github.com/SwcK423/Underwater360。

英文摘要

Underwater scene reconstruction is essential for immersive exploration of aquatic environments, yet remains challenging due to complex participating-media effects such as absorption and scattering, as well as the limited field of view (FoV) of conventional cameras. Although combining panoramic imaging with 3D Gaussian Splatting (3DGS) offers a promising direction for photorealistic underwater rendering, traditional 3DGS struggles with both spherical projection distortion and underwater medium degradation. In this paper, we propose \textbf{Underwater360}, a physics-informed omnidirectional 3DGS framework for underwater panoramic scene reconstruction. First, we introduce an Omnidirectional Gaussian Splatting module that performs ray casting directly in spherical camera space instead of relying on 2D projection approximations, thereby reducing geometric distortions under 360$^\circ$ FoV. Second, we design a physics-based appearance-medium modeling architecture with pose-conditioned appearance embeddings to explicitly decouple intrinsic scene radiance from depth-dependent backscatter and attenuation, enabling physically grounded scene appearance restoration. Finally, we establish a new panoramic underwater benchmark dataset containing both synthetic and real-world scenes. Extensive experiments demonstrate that Underwater360 achieves superior performance in underwater novel view synthesis and scene appearance restoration, delivering improved rendering quality and cross-view consistency in complex underwater environments. The code and datasets are released at https://github.com/SwcK423/Underwater360

2605.26441 2026-05-27 cs.CV cs.AI 版本更新

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

从博弈视角重新思考弱监督视频时间定位

Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen, Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng, Daizong Liu

发表机构 * Hubei Key Laboratory of Distributed System Security(湖北分布式系统安全重点实验室) Hubei Engineering Research Center on Big Data Security(大数据安全工程研究中心) School of Cyber Science and Engineering(网络安全科学与工程学院) Huazhong University of Science and Technology(华中科技大学) University of Central Florida(佛罗里达中央大学) Zhejiang Gongshang University(浙江工商大学) Guangzhou University(广州大学) The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学)

AI总结 本文从博弈论视角出发,通过多元合作博弈建模帧与词的不确定对应关系,实现多级跨模态交互,从而在弱监督下提升视频时间定位的准确性。

Comments Published in ECCV 2024

详情
AI中文摘要

本文针对弱监督视频时间定位这一具有挑战性的任务。现有方法通常基于时刻提案选择框架,利用对比学习和重构范式对预定义时刻提案进行评分。尽管取得了显著进展,但我们认为当前框架忽略了两个不可或缺的问题:1) 粗粒度跨模态学习:先前方法仅捕获全局视频级与查询的对齐,未能建模视频帧与查询词之间的详细一致性以准确定位时刻边界。2) 复杂的时刻提案:其性能严重依赖于提案的质量,而提案的选择既耗时又复杂。为此,本文首次尝试从新颖的博弈视角处理该任务,通过多样粒度和灵活组合有效学习每个视觉-语言对之间的不确定关系,实现多级跨模态交互。具体而言,我们创造性地将每个视频帧和查询词建模为多元合作博弈中的玩家,学习它们对跨模态相似度得分的贡献。通过博弈论交互量化联盟内帧-词合作的趋势,我们能够评估帧与词之间所有不确定但可能的对应关系。最后,我们不再使用时刻提案,而是利用学习到的查询引导的帧级得分进行更好的时刻定位。实验表明,我们的方法在Charades-STA和ActivityNet Caption数据集上均取得了优越性能。

英文摘要

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction.Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization.Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

2605.26421 2026-05-27 cs.CV 版本更新

HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection

HydraPrompt: 面向合成图像检测的视觉语言模型自适应非对称框架

Senyuan Shi, Hao Tan, Zichang Tan, Shuhan Feng, Ajian Liu, Sergio Escalera, Jun Wan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) School of Advanced Interdisciplinary Sciences (SAIS), University of Chinese Academy of Sciences(中国科学院大学先进交叉学科学院) Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences(中国科学院深圳先进技术研究所) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS) University of Barcelona(巴塞罗那大学)

AI总结 提出一种非对称提示框架HydraPrompt,通过动态调整类别中心对齐细粒度图像线索,结合条件监督对比学习,实现合成图像检测的SOTA性能。

Comments 8 pages, 6 figures

详情
AI中文摘要

生成模型的快速发展导致伪造内容激增,对现有合成图像检测方法构成重大挑战。利用视觉语言模型(如CLIP)的进展,最近的工作通过可学习的文本提示来识别合成图像。然而,它们仍使用静态提示作为真实和伪造图像的固定边界,无法适应推理过程中出现的各种伪造类型。为解决这一问题,我们提出**HydraPrompt**,一种非对称提示框架,通过对齐细粒度图像线索动态调整类别中心。具体而言,我们提出非对称提示适配器(**APA**):(1)对于真实类别,引入单组提示以捕获一致的代表性模式,作为真实内容的统一锚点;(2)对于伪造类别,构建样本自适应提示,专门捕获不同样本中的多样线索,实现伪造图像变体的自适应建模。为增强不同合成图像间的可区分性,我们进一步引入条件监督对比(**CSC**)目标,在压缩真实表示的同时捕获细粒度伪造线索。在主流SID基准上的大量实验表明,我们的框架达到了最先进的性能。

英文摘要

The rapid evolution of generative models has precipitated a proliferation of fabricated content, posing significant challenges to existing Synthetic Image Detection (SID) methods. Capitalizing on advancements in vision-language models (e.g., CLIP), recent attempts have leveraged learnable textual prompts to identify synthetic images. However, they still leverage static prompt as a fixed boundary for real and fake images, failing to adapt to the varying types of forgery that emerge during inference. To overcome this issue, we propose **HydraPrompt**, an asymmetric prompting framework that dynamically adjusts the category centers by aligning with fine-grained image cues. Specifically, we propose an Asymmetric Prompt Adapter (**APA**): (1) for authentic category, we introduce a single set of prompts to capture the consistent representative patterns, which serves as a unified anchor for real content. While (2) for fake category, we construct sample-adaptive prompts that specialize in capturing diverse cues from different samples, enabling adaptive modeling of forgery image variations. To increase pronounced discriminability within different synthetic images, we further introduce a Conditional Supervised Contrastive (**CSC**) objective, which compacts the authentic representations while capturing fine-grained forgery clues. Extensive experiments on popular SID benchmarks demonstrate the state-of-the-art performance of our framework.

2605.26415 2026-05-27 cs.CV cs.AI 版本更新

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

拯救效应:时空语义早期退出绕过CLIP中的量化崩溃

Kahyeon Nam, Hyesong Choi

发表机构 * Soongsil University(顺斯大学)

AI总结 针对CLIP模型INT8量化导致的表示崩溃问题,提出LRA-EE方法,通过时空语义聚合、多特征门控和层自适应阈值实现早期退出,在ImageNet-1K零样本分类中降低13.4% FLOPs并提升2.44%准确率。

详情
AI中文摘要

在资源受限的硬件上部署视觉-语言模型通常需要INT8量化,但在CLIP等联合嵌入架构中,这引入了一种不同于量化CNN分类器的故障模式:跨Transformer块累积的激活噪声扰乱了多模态嵌入的方向,侵蚀了零样本检索所依赖的余弦对齐。我们将此特征化为量化诱导的表示崩溃(QIRC),并在INT8 CLIP ViT-B/32上量化它,其中逐层噪声信号比从浅层块的低于10%增长到第11层的52%。我们提出LRA-EE(逐层表示感知早期退出),它通过时空语义聚合(用全局补丁令牌平均替代不成熟的浅层[CLS])、学习到的多特征门控(置信度、top-2间隔、空间激活方差)以及根据每层信息噪声比校准的层自适应置信阈值,绕过噪声饱和的深层。在ImageNet-1K零样本分类上,LRA-EE相比INT8基线减少了13.4%的FLOPs,并将Top-1准确率提高了+2.44个百分点(58.72% -> 61.16%)。四象限分解隔离了拯救效应:9.5%的样本在浅层出口被正确分类,但在全深度被噪声丢失,而只有7.1%遭受相反情况。

英文摘要

Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We characterize this as Quantization-Induced Representation Collapse (QIRC) and quantify it on INT8 CLIP ViT-B/32, where the layer-wise noise-to-signal ratio grows from below 10% in shallow blocks to 52% at Layer 11. We propose LRA-EE (Layer-wise Representation-Aware Early Exit), which bypasses noise-saturated deep layers via Spatio-Semantic Aggregation (replacing the immature shallow [CLS] with a global patch-token average), a learned multi-feature gate (confidence, top-2 margin, spatial-activation variance), and Layer-adaptive Confidence Thresholding calibrated to each layer's Information-to-Noise Ratio. On ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44%p (58.72% -> 61.16%) over the INT8 baseline. A four-quadrant decomposition isolates the Rescue Effect: 9.5% of samples are correctly classified at shallow exits but lost to noise at full depth, against only 7.1% suffering the inverse.

2605.26399 2026-05-27 cs.CV 版本更新

OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following

OmniGF: 一种用于统一视线跟随的双分支视觉-语言框架

Qiaomu Miao, Haoyu Wu, Jingyi Xu, Minh Hoai, Dimitris Samaras

发表机构 * Stony Brook University(石英布大学) The University of Adelaide(阿德莱德大学)

AI总结 提出OmniGF框架,通过双分支解码策略(语言分支生成离散推理状态,连续空间分支利用密集隐藏状态)结合头部嵌入,实现多人场景下精确的空间视线估计、语义视线预测和复杂社会视线推理,在多个基准上达到新最优。

详情
AI中文摘要

理解人类注视行为对于复杂场景理解和人机交互至关重要。传统的视线跟随模型通常局限于纯空间定位,缺乏推理语义目标或复杂社会背景的高级能力。此外,这些模型通常顺序处理个体,对同一场景图像进行多人体推理时需要冗余计算。虽然最近的视觉-语言模型(VLM)提供了处理与视线相关语义任务所需的卓越语义推理能力,但它们对离散文本生成的依赖本质上限制了在连续空间任务(如视线定位)中的精度。为弥合这一差距,我们提出OmniGF,一个统一的视觉-语言框架,使基础VLM适应高度可扩展的多人体视线推理。该模型采用双分支解码策略:结构化语言分支生成离散推理状态,而连续空间分支直接利用VLM的密集隐藏状态。用高分辨率视线目标热图监督这些提取的表示,有效克服了仅文本坐标生成的空间瓶颈。此外,为明确将模型锚定于多人场景,我们通过从裁剪的人头图像编码的头嵌入增强输入,同时为所有个体提供细粒度的外观和方向线索。通过建模所有个体并利用VLM的强大语义能力,OmniGF无缝集成了精确的空间视线目标估计、语义视线预测和复杂社会视线推理。大量实验表明,我们的框架在多个标准基准上建立了新的最优性能。代码可在https://github.com/cvlab-stonybrook/omnigf获取。

英文摘要

Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference. While recent Vision-Language Models (VLMs) offer the exceptional semantic reasoning needed to address gaze-related semantic tasks, their reliance on discrete text generation inherently limits precision in continuous spatial tasks like gaze localization. To bridge this gap, we propose OmniGF, a unified vision-language framework that adapts foundational VLMs for highly scalable multi-person gaze reasoning. The model adopts a dual-branch decoding strategy: a structured language branch generates discrete reasoning states, while a continuous spatial branch directly taps into the VLM's dense hidden states. Supervising these extracted representations with high-resolution gaze target heatmaps effectively overcomes the spatial bottleneck of text-only coordinate generation. Furthermore, to explicitly ground the model in multi-person scenes, we augment the input with head embeddings encoded from cropped head images, providing fine-grained appearance and orientation cues for all individuals simultaneously. By modeling all individuals and leveraging the strong semantic capability of VLMs, OmniGF seamlessly integrates precise spatial gaze target estimation, semantic gaze prediction, and complex social gaze reasoning. Extensive experiments demonstrate that our framework establishes new state-of-the-art performance across multiple standard benchmarks. Code is available at https://github.com/cvlab-stonybrook/omnigf.

2605.26383 2026-05-27 cs.CV 版本更新

Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion

基于多阶段SAM3特征融合的零样本物体重识别在自我中心厨房视频中的应用

Dmytro Klepachevskyi, Alexander Wong, Sirisha Rambhatla, Yuhao Chen

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对自我中心厨房视频中物体重识别的挑战,提出一种基于SAM3分割的多阶段零样本方法,通过融合SAM3、DINOv2和CLIP特征并引入掩码形状IoU和k-倒数重排序,将mAP从45.3%提升至52.8%。

详情
AI中文摘要

由于视角快速变化、频繁遮挡、场景杂乱以及类内外观差异大,自我中心厨房视频中的物体重识别(ReID)具有挑战性。物体可能离开并重新进入视野,且实例多样性大且标注有限,使得监督式ReID难以扩展,从而推动了零样本方法的研究。我们在EPIC-Kitchens基准上研究零样本物体ReID,目标是仅使用预训练的视觉特征匹配跨帧的活跃食物和厨房工具实例。我们首先评估了五种最先进的特征提取器,包括视觉语言模型(VLM)——CLIP、DINOv2、DreamSim、I-JEPA和SAM3,并显示零样本方法失败,最佳基线仅达到45.3% mAP。然后,我们提出了一种增强的SAM3 ReID流水线,这是一种以SAM3分割为核心组件的零样本多阶段方法。阶段1使用SAM3抑制背景杂乱。阶段2将SAM3、DINOv2和CLIP的嵌入融合为单个L2归一化描述符。阶段3用掩码形状IoU增强余弦相似度以实现几何一致性,阶段4应用k-倒数重排序。整个流水线将性能提升7.5% mAP,达到52.8%。

英文摘要

Object re-identification (ReID) in egocentric kitchen videos is challenging due to rapid viewpoint changes, frequent occlusions, cluttered scenes, and large intra-class appearance variations. Objects may leave and re-enter the field of view, and the large diversity of instances with limited annotations makes supervised ReID difficult to scale, motivating zero-shot approaches. We study zero-shot object ReID on the EPIC-Kitchens benchmark, where the goal is to match active food and kitchen-tool instances across frames using only pre-trained visual features. We first evaluate five state-of-the-art feature extractors, including Vision-Language Models (VLMs) - CLIP, DINOv2, DreamSim, I-JEPA, and SAM3 - and show that zero-shot methods fail, with the best baseline achieving only 45.3% mAP. We then propose an Enhanced SAM3 ReID Pipeline, a zero-shot multi-stage method built around SAM3 segmentation as the core component. Stage 1 uses SAM3 to suppress background clutter. Stage 2 fuses embeddings from SAM3, DINOv2, and CLIP into a single L2-normalized descriptor. Stage 3 augments cosine similarity with mask-shape IoU for geometric consistency, and Stage 4 applies k-reciprocal re-ranking. The full pipeline improves performance by 7.5% mAP to 52.8%.

2605.26382 2026-05-27 cs.CV 版本更新

Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation

细节一致的分阶段蒸馏用于高效3D MRI分割

Mengchen Fan, Baocheng Geng, Xi Xiao, Tianyang Wang, Siyuan Mei, Pulin Che, Xiaoqian Jiang, Qizhen Lan

发表机构 * University of Alabama at Birmingham(阿拉巴马大学伯明翰分校) Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔兰根-纽伦堡弗里德里希-亚历山大大学) UTHealth Houston(休斯顿UT健康)

AI总结 提出细节一致蒸馏(DCD)框架,通过小波分解对齐教师-学生特征,在分阶段蒸馏中保留多尺度结构细节,实现高效3D MRI分割。

Comments Accepted by MICCAI 2026. 11 pages, 3 figures

详情
AI中文摘要

部署高性能3D医学图像分割器(如nnU-Net)通常受到内存占用和推理延迟的限制。因此压缩是必要的,但紧凑的3D编码器往往会在多分辨率阶段重复下采样时丢失细微的结构线索(小病变和锐利边界)。我们提出细节一致蒸馏(DCD),一种分阶段蒸馏框架,通过在小波分解表示中对齐教师-学生特征,跨尺度保留结构细节。在每个编码器阶段,DCD在小波域中蒸馏方向细节分量,同时相对不约束粗略近似,避免对全局语义的过度正则化。DCD仅在训练期间使用,不引入推理开销。在BraTS 2024和ISLES 2022基准上的实验表明,我们的方法在使用3D多模态数据的MRI分割中取得了优越性能。DCD的代码和实现细节可在https://github.com/ClinicaAlpha/DCD-3D-MedSeg公开获取。

英文摘要

Deploying high-performing 3D medical image segmenters (e.g., nnU-Net) is often limited by memory footprint and inference latency. Compression is therefore necessary, but compact 3D encoders tend to lose fine structural cues (small lesions and sharp boundaries) as downsampling repeats across multi-resolution stages. We propose Detail Consistent Distillation (DCD), a stage-wise distillation framework that preserves structural detail across scales by aligning teacher-student features in a wavelet-decomposed representation. At each encoder stage, DCD distills directional detail components in the wavelet domain while leaving the coarse approximation comparatively unconstrained, avoiding over-regularization of global semantics. DCD is used only during training and introduces no inference-time overhead. Experiments on the BraTS 2024 and ISLES 2022 benchmarks demonstrate that our approach achieves superior performance in MRI segmentation using 3D multi-modal data. Code and implementation details for DCD are publicly available at https://github.com/ClinicaAlpha/DCD-3D-MedSeg.

2605.26381 2026-05-27 cs.CV 版本更新

Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

基于Perceiver IO融合卫星和街景图像的多模态建筑检测

Niels Sombekke, Rob G. J. Wijnhoven, Martin R. Oswald

发表机构 * University of Amsterdam (UvA)(阿姆斯特丹大学) Spotr

AI总结 提出一种通过Perceiver IO架构融合卫星和街景图像的多模态分类框架,使用共享DINOv2骨干网络的空间补丁令牌,无需填充或固定大小池化即可处理可变数量的街景视图,并联合预测屋顶元素和材料类别,在包含10个国家32135栋建筑的数据集上验证了RGB-M掩码策略和融合模型的有效性。

详情
AI中文摘要

我们提出了一种多模态分类框架,通过Perceiver IO架构融合卫星和街景图像,该架构基于共享DINOv2骨干网络的空间补丁令牌。该设计自然地处理每栋建筑可变数量的街景视图,无需填充或固定大小池化,并联合预测多标签屋顶元素和屋顶材料类别。我们构建了一个包含10个国家32,135栋建筑(61,672个片段)的大规模数据集,将卫星图像与每个片段最多八个街景视图配对,并评估了四种用于隔离目标建筑的掩码策略。我们提出了一种RGB-M掩码策略,将建筑足迹掩码作为第四个输入通道,提供了一种软空间先验,在两种模态下均优于硬裁剪。Perceiver IO融合模型优于所有其他融合策略,并在街景可见的属性上取得了显著的每类增益(例如,石板+11.3 AP,老虎窗+1.3 AP),尽管仅卫星基线在主要从上方可见的类别的宏观平均mAP上仍保持轻微优势。这些结果为多模态建筑检测建立了一种可扩展、灵活的架构,能够处理异构输入和多个输出任务。

英文摘要

We present a multi-modal classification framework that fuses satellite and street-level imagery through a Perceiver IO architecture operating on spatial patch tokens from a shared DINOv2 backbone. The design naturally handles a variable number of street-level views per building without padding or fixed-size pooling, and jointly predicts multi-label roof element and roof material classes. We construct a large-scale dataset of 32,135 buildings (61,672 segments) spanning ten countries, pairing satellite images with up to eight street-level views per segment and evaluating four masking strategies for isolating the target building. We propose an RGB-M masking strategy that appends the building footprint mask as a fourth input channel, providing a soft spatial prior that outperforms hard cropping across both modalities. The Perceiver IO fusion model improves over all other fusion strategies and yields substantial per-class gains for attributes visible from street level (e.g., +11.3 AP for slate, +1.3 AP for dormers), though the satellite-only baseline retains a slight advantage in macro-averaged mAP for classes that are predominantly visible from above. These results establish a scalable, flexible architecture for multi-modal building inspection that can accommodate heterogeneous inputs and multiple output tasks.

2605.26380 2026-05-27 cs.CV cs.AI 版本更新

VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

VisualNeedle: 信息密集场景中的主动视觉搜索基准

Jingru Chen, Yiming Liu, Mingtao Chen, Sijie Chen, Richeng Xuan, Liang Yang, Zhichao Hu, Fanyang Lu

发表机构 * Hunyuan, Tencent(腾讯 Hunyuan) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 针对多模态大语言模型在细粒度感知基准中依赖捷径而非真实视觉证据的问题,提出VisualNeedle基准,通过反事实裁剪-黑化设置评估模型在信息密集场景中的主动视觉搜索能力,实验表明最佳模型准确率仅56.01%,落后人类63.00%。

详情
AI中文摘要

前沿多模态大语言模型(MLLMs)被报道在细粒度感知基准上达到超过90%的准确率。然而,这样的分数并不一定意味着对视觉证据的忠实使用。先前的研究已经识别出三种抬高基准性能的捷径。首先,问题中的语言先验和词汇线索使模型能够在未见图像的情况下推断出看似合理的答案。其次,来自视觉编码器的粗略全局语义可以绕过细粒度的局部细节。第三,在一些“用图像思考”的基准中,破坏视觉工具返回的中间图像几乎不影响最终答案。这些发现表明,仅靠更高的输入分辨率或更大的问题池并不能引发真正的主动视觉搜索。为了解决这个问题,我们引入了VisualNeedle,这是一个具有挑战性、信息密集且细粒度的基准,用于关键证据在空间上局限于微小区域且无法一眼看出的场景。我们进一步提出了一种反事实裁剪-黑化设置,将工具返回的裁剪区域替换为相同大小的黑色图像,以测试工具启用的性能是否真正依赖于中间视觉证据。我们在三种设置下评估了9个著名的MLLMs:无工具、标准工具启用和裁剪-黑化。无工具准确率保持在20%以下,最佳工具启用模型仅达到56.01%,仍落后于63.00%的人类多数投票准确率。这些结果揭示了细粒度视觉搜索中持续存在的局限性,而裁剪-黑化消融实验证实,VisualNeedle上的成功依赖于真正的中间视觉证据。

英文摘要

Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. However, such scores do not necessarily imply faithful use of visual evidence. Prior studies have identified three shortcuts that inflate benchmark performance. First, linguistic priors and lexical cues in questions often enable models to infer plausible answers without seeing the image. Second, coarse global semantics from the visual encoder can bypass fine-grained local details. Third, in some ``think-with-images'' benchmarks, corrupting the intermediate images returned by visual tools barely affects the final answer. These findings suggest that higher input resolution or larger question pools alone do not elicit genuine active visual search. To address this, we introduce VisualNeedle, a challenging, information-dense, and fine-grained benchmark for scenes where critical evidence is spatially constrained to minute regions and not discernible at a glance. We further propose a counterfactual crop-black setting, which replaces crops returned by tools with black images of the same size, to test whether tool-enabled performance truly relies on intermediate visual evidence. We evaluate 9 promninent MLLMs across three settings: no-tool, standard tool-enabled, and crop-black. No-tool accuracy stays below 20\%, and the best tool-enabled model reaches only 56.01\%, still trailing the 63.00% human majority-vote accuracy. These results reveal persistent limitations in fine-grained visual search, while the crop-black ablation confirms that success on VisualNeedle hinges on genuine intermediate visual evidence.

2605.26376 2026-05-27 cs.CV cs.AI cs.LG 版本更新

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

BioFact-MoE:基于生物学因子分解的混合专家模型用于肝细胞癌的视觉-语言预后建模

Junlin Yang, Tian Yu, Nicha C. Dvornek, Yuexi Du, Peiyu Duan, Annabella Shewarega, Lawrence H. Staib, James S. Duncan, Julius Chapiro

发表机构 * Department of Radiology \& Biomedical Imaging, Department of Biomedical Engineering, Department of Electrical Engineering, Department of Statistics \& Data Science Yale University, New Haven, CT, 06510, USA

AI总结 提出BioFact-MoE框架,通过生物学监督的混合专家模型显式分解肝脏和肿瘤因子,在肝细胞癌预后预测中提升准确性和生物学可解释性。

Comments Early accepted at MICCAI 2026

详情
AI中文摘要

肝细胞癌(HCC)具有生物学异质性,由肝功能储备和肿瘤相关肿瘤学因素之间的相互作用塑造;因此,相似的生存结果可能反映根本不同的潜在生物学过程。HCC的预后建模依赖于来自多参数MRI和常规临床实践放射学报告的丰富多模态信息。现有的预后视觉-语言模型(VLM)学习单一的纠缠潜在表示,混合了肝脏和肿瘤相关因素,限制了准确性和生物学可解释性。我们提出BioFact-MoE,一个生物学因子分解的混合专家(MoE)框架,通过残差MoE生存架构中的生物学监督专家显式分解肝脏和肿瘤因素。在N=588名患者的HCC队列(在4,582个3D MRI图像-报告对上预训练)中,BioFact-MoE在所有时间范围内持续优于所有基线的生存预测,实现了12、18和24个月的AUC分别为75.33%、75.85%和73.96%。除了标量风险预测,门控专家权重实现了表型感知的风险分层。通路感知的门控揭示了临床上有意义的治疗相关生存异质性。在保留验证中,肝脏和肿瘤嵌入分别与肝功能标志物和肿瘤负荷标志物显示出选择性关联(p<0.05),无需监督。代码可在https://github.com/jy-639/BioFact-MoE获取。

英文摘要

Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision-language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor-related factors, limiting both accuracy and biological interpretability. We present BioFact-MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image-report pairs), BioFact-MoE consistently improves survival prediction over all baselines across time horizons, achieving 12-, 18-, and 24-month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype-aware risk stratification. Pathway-informed gating uncovers clinically meaningful treatment-associated survival heterogeneity. In held-out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p<0.05), without supervision. The code is available at https://github.com/jy-639/BioFact-MoE.

2605.26370 2026-05-27 cs.CV 版本更新

Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery

航空影像中屋顶结构的联合实例分割与几何属性回归

Luuk Versteeg, Rob G. J. Wijnhoven, Martin R. Oswald

发表机构 * University of Amsterdam (UvA)(阿姆斯特丹大学)

AI总结 提出一种从单张航空正射影像中联合预测屋顶实例分割掩码和三个连续几何属性(建筑高度、屋顶坡度、屋顶方位角)的方法,通过条件方位角损失和对数归一化高度表示解决数据噪声和分布偏斜问题,在荷兰大规模数据集上实现了高精度,并可从单张图像重建简化3D建筑模型。

详情
AI中文摘要

我们提出了一种方法,用于从单张航空正射影像中联合预测实例级屋顶分割掩码以及三个连续几何属性——建筑高度、屋顶坡度和屋顶方位角。我们的方法扩展了Mask R-CNN,增加了一个专门的属性回归分支,并引入了两个关键创新:一个条件方位角损失,抑制了对屋顶平坦段(其中方位角标签固有噪声)的监督;以及一个对数归一化高度表示,解决了建筑高度严重偏斜分布的问题。我们在一个大规模荷兰航空图像数据集上进行训练和评估,该数据集与从3DBAG(一个全国性的基于LiDAR的3D建筑数据集)自动导出的真实值配对。使用DINOv3 ConvNeXt-Base骨干网络,我们的方法在屋顶坡度上实现了约4度的平均绝对误差,方位角为7度,建筑高度为1米,实例分割AP$_{50}$为0.566。预测的每段掩码和属性足以从单张俯视图像重建简化的3D建筑模型(LoD2),仅需在训练时使用昂贵的3D参考数据。

英文摘要

We present a method for jointly predicting instance-level roof segment masks together with three continuous geometric attributes -- building height, roof slope, and roof azimuth -- from a single aerial orthophoto. Our approach extends Mask R-CNN with a dedicated attribute regression branch and introduces two key innovations: a conditional azimuth loss that suppresses supervision for flat roof segments where azimuth labels are inherently noisy, and a log-normalized height representation that addresses the heavily skewed distribution of building heights. We train and evaluate on a large-scale dataset of Dutch aerial images paired with automatically derived ground truth from 3DBAG, a nationwide LiDAR-based 3D building dataset. Using a DINOv3 ConvNeXt-Base backbone, our method achieves a mean absolute error of approximately 4 degrees for roof slope, 7 degrees for azimuth, and 1 meter for building height, with an instance segmentation AP$_{50}$ of 0.566. The predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2) from a single overhead image, requiring expensive 3D reference data only for training.

2605.26353 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Personalized Generative Models for Contextual Debiasing

用于上下文去偏的个性化生成模型

Xinran Liang, Esin Tureci, Prachi Sinha, Ye Zhu, Vikram V. Ramaswamy, Olga Russakovsky

发表机构 * Department of Computer Science, Princeton University(普林斯顿大学计算机科学系) LIX, CNRS, École Polytechnique(巴黎政治学院LIX研究所,法国国家科学研究中心)

AI总结 提出DecoupleGen方法,利用个性化文本到图像扩散模型生成罕见上下文图像,作为训练增强以缓解视觉识别中的上下文偏差。

Comments CVPR 2026 Workshop on Synthetic Data for Computer Vision and Generative Models for Computer Vision. Code available at https://github.com/princetonvisualai/DecoupleGen

详情
AI中文摘要

不同的视觉模式在世界中出现的频率不同:例如,沙滩球出现在沙滩上比出现在道路上更常见。这些统计数据反映在视觉数据集中,因此训练好的模型更容易在常见场景中识别物体。然而,在道路上识别沙滩球可能比在沙滩上识别更重要。我们研究如何缓解这种差异。由于在现实世界中收集不常见的图像可能很困难,我们探索生成具有较少频繁上下文的图像是否可以作为有效的训练增强。一个关键挑战是引导生成保持在原始数据集分布附近,同时创建具有不常见上下文的多样化图像。我们引入了DecoupleGen方法,该方法个性化文本到图像扩散模型,以促进罕见上下文图像的连贯合成,同时保留原始视觉细节。生成的图像包含语义上有意义的内容,并在视觉上与原始数据集保持一致。我们进一步应用验证约束以确保增强数据的相关性。我们在复杂场景数据集上的物体分类和识别任务中评估了我们的方法。实验表明,我们的方法比先前的方法有一致的改进,并且我们的分析确定了这些改进背后的因素。

英文摘要

Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text-to-image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.

2605.26332 2026-05-27 cs.CV cs.AI 版本更新

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

被擦除但可被利用:针对已遗忘文本到图像扩散模型的黑盒嵌入感知提示攻击

Arian Komaei Koma, Seyed Amir Kasaei, AmirMahdi Sadeghzadeh, Mohammad Hossein Rohban

发表机构 * Department of Computer Engineering(计算机工程系)

AI总结 提出一种黑盒嵌入感知对抗提示攻击BEAP,利用大语言模型迭代生成有效对抗提示,以恢复被遗忘概念,并在攻击成功率上提升超过60%。

详情
AI中文摘要

机器遗忘旨在从预训练的文本到图像扩散模型中移除特定概念,然而已有多种白盒和黑盒攻击被提出以使模型生成这些被遗忘的概念。然而,这些攻击并未假设现实的威胁模型,即它们要么假设可以访问模型权重,要么产生无意义的对抗提示,即使通过简单的基于规则的防护也能轻易检测到。本文旨在填补这一空白。我们提出BEAP,一种黑盒、嵌入感知的对抗提示攻击,利用大语言模型(LLM)迭代生成有效的对抗提示并利用这些隐藏的漏洞。BEAP在文本空间中执行嵌入感知搜索,结合多个奖励信号:被遗忘概念的存在性、文本-图像对齐和图像质量,以优化生成的提示。与之前的攻击方法不同,BEAP使其提示对安全过滤器不可检测,同时生成高质量图像。大量实验表明,BEAP的攻击成功率(ASR)比先前方法提高了60%以上,而每次成功攻击平均仅需15个提示。警告:本文包含可能具有冒犯性或令人不安性质的模型输出。

英文摘要

Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts. Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.

2605.26328 2026-05-27 cs.CV 版本更新

RadarSim: Simulating Single-Chip Radar via Multimodal Neural Fields

RadarSim: 通过多模态神经场模拟单芯片雷达

Chuhan Chen, Tianshu Huang, Akarsh Prabhakara, Chaithanya Kumar Mummadi, Zhongxiao Cong, Anthony Rowe, Matthew O'Toole, Deva Ramanan

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Bosch Research(博世研究) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 提出RadarSim,一种利用RGB相机高角分辨率从相机初始化的神经场生成多普勒雷达距离图像的统一可微渲染器,以解决雷达空间分辨率低的问题,并产生比纯雷达重建更清晰的几何和多普勒距离帧。

Comments Accepted to 3DV 2026. Project website: https://sally-chen.github.io/radar-sim/

详情
AI中文摘要

雷达是相机的理想补充:两者都是廉价、固态的传感器,相机提供精细的角分辨率,而雷达在恶劣天气下提供度量深度和鲁棒性。然而,雷达数据比相机图像更难解释,且不同传感器之间差异显著,这增加了对仿真以进行传感器和处理流水线原型设计的依赖。最近将雷达重建视为新视角合成问题的工作在重建雷达相关几何和模拟低级雷达数据方面显示出巨大潜力。然而,此类方法受到底层雷达低空间分辨率的限制。为了解决这个问题,我们提出了一种统一的可微渲染器RadarSim,它利用RGB相机的高角分辨率从相机初始化的神经场生成多普勒雷达距离图像。通过使用来自定制手持装置的校准雷达相机记录的新数据集,我们证明RadarSim比纯雷达重建产生更清晰的几何和多普勒距离帧。

英文摘要

Radars are an ideal complement to cameras: both are inexpensive, solid-state sensors, with cameras offering fine angular resolution, while radars provide metric depth and robustness under adverse weather. However, radar data is more difficult to interpret than camera images and varies significantly between sensors, necessitating increased reliance on simulation for prototyping sensors and processing pipelines. Recent work treating radar reconstruction as a novel view synthesis problem has shown great promise in reconstructing radar-relevant geometry and simulating low-level radar data. However, such methods are constrained by the low spatial resolution of the underlying radar. To address this, we propose a unified differentiable renderer, RadarSim, which leverages the high angular resolution of RGB cameras to generate Doppler radar range images from a camera-initialized neural field. Using a novel data set of calibrated radar camera recordings from a custom hand-held rig, we demonstrate that RadarSim produces sharper geometry and Doppler range frames than radar-only reconstructions.

2605.26316 2026-05-27 cs.CV cs.AI 版本更新

E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

E$^3$C: 具有3D环境记忆和自我-外部人体姿态控制的视频生成

Qiao Gu, Lingni Ma, Adam W Harley, Richard Newcombe, Florian Shkurti, Julian Straub

发表机构 * Meta Reality Labs(Meta现实实验室) University of Toronto(多伦多大学)

AI总结 提出E$^3$C可控视频扩散框架,通过3D点云记忆和双通道人体控制(自我与外骨骼),实现物理一致的自我中心视频生成。

Comments Preprint. Project Page: https://e3c-videogen.github.io/

详情
AI中文摘要

可控且物理合理的自我中心视频生成对于具身智能体推理自身及他人动作如何表现和改变世界至关重要。与通用视频合成相比,自我中心生成尤其具有挑战性:相机与演员紧密耦合,导致视角快速变化和频繁的自遮挡;底层动作细微、关节化,且通常仅部分可见;人和场景状态必须与指定控制一致地演化。我们提出E$^3$C,一种用于自我中心生成的可控视频扩散框架,构建结构化和紧凑的条件,将持久场景结构与人类驱动动态分离。从上下文帧中,E$^3$C构建基于半稠密点云的3D记忆,并用来自视频VAE特征的外观描述符增强每个点。将此记忆渲染到目标视角产生与目标帧对齐的条件。人类动态单独建模。场景中观察到的人由骨架渲染(外部人体控制)控制,而相机佩戴者由其3D身体关节和6DoF手腕运动(自我人体控制)指定。为了在佩戴者身体部位不可见时保持自我人体控制,我们引入了一个自我运动编码器,生成持久的交叉注意力标记。在Nymeria上的实验表明,E$^3$C在视觉保真度、相机运动准确性、物体一致性以及自我和外部人体控制方面优于强基线,同时还能实现直观的场景编辑。

英文摘要

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and frequent self-occlusions; the underlying actions are subtle, articulated, and often only partially visible; and both the people and the scene state must evolve consistently with the specified controls. We present E$^3$C, a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E$^3$C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames. Human dynamics are modeled separately. The observed people in the scene are controlled by skeleton renderings (exo human control), while the camera wearer is specified by their 3D body joints and 6DoF wrist motion (ego human control). To preserve ego human control when the wearer's body parts are invisible, we introduce an ego motion encoder that produces persistent cross-attention tokens. Experiments on Nymeria show that E$^3$C improves visual fidelity, camera-motion accuracy, object consistency, and ego & exo human control over strong baselines, while also enabling intuitive scene editing.

2605.26295 2026-05-27 cs.CV 版本更新

Sleep-stage efficient classification using a lightweight self-supervised model

使用轻量级自监督模型的睡眠阶段高效分类

Eldiane Borges dos Santos Durães, João Batista Florindo

发表机构 * Institute of Mathematics, Statistics and Scientific Computing, University of Campinas, Street Sergio Buarque de Holanda, 651, Campinas, Brazil(数学、统计与科学计算研究所,坎皮纳斯大学,塞格雷奥·布阿尔克·德·霍兰达街651号,巴西坎皮纳斯)

AI总结 本研究通过简化mulEEG自监督模型并结合线性SVM分类器,实现了高效准确的睡眠阶段分类。

详情
Journal ref
Proceedings VISAPP 2025, 972-979 (2025)
AI中文摘要

睡眠阶段的准确分类对于诊断睡眠障碍至关重要,自动化该过程可以显著增强临床评估。本研究旨在探索使用自监督模型(具体为mulEEG的改编版本)结合线性SVM分类器来改进睡眠阶段分类。 extbf{方法:} mulEEG模型以自监督方式学习脑电图信号表示,本文通过将ResNet-50替换为ResNet-18主干网络(使用1D卷积作为时间序列编码器)对其进行了简化。还进行了另外两项改编:第一项评估了模型的不同配置和训练数据量,第二项测试了时间序列特征、频谱图特征及其拼接作为线性SVM分类器输入的有效性。 extbf{结果:} 结果显示,与简化模型相比,减少数据量提供了更好的成本效益比。使用ResNet-18的拼接特征也优于原始mulEEG模型的线性评估,实现了更高的分类性能。 extbf{结论:} 简化mulEEG模型以提取特征,并将其与稳健的分类器配对,可实现更高效、更准确的睡眠阶段分类。该方法有望改善临床睡眠评估,并可扩展到其他生物信号分类任务。

英文摘要

Accurate classification of sleep stages is crucial for diagnosing sleep disorders and automating this process can significantly enhance clinical assessments. This study aims to explore the use of a self-supervised model (more specifically, an adapted version of mulEEG) combined with a Linear SVM classifier to improve sleep stage classification. \textbf{Methods:} The mulEEG model, which learns electroencephalogram signal representations in a self-supervised manner, was simplified here by replacing ResNet-50 with 1D-convolutions used as time series encoder by a ResNet-18 backbone. Two other adaptations were conducted: the first one evaluated different configurations of the model and data volume for training, while the second tested the effectiveness of time series features, spectrogram features, and their concatenation as inputs to a Linear SVM classifier. \textbf{Results:} The results showed that reducing the volume of data offered a better cost-benefit ratio compared to simplifying the model. Using the concatenated features with ResNet-18 also outperformed the linear evaluations of the original mulEEG model, achieving higher classification performance. \textbf{Conclusions:} Simplifying the mulEEG model to extract features and pairing it with a robust classifier leads to more efficient and accurate sleep stage classification. This approach holds promise for improving clinical sleep assessments and can be extended to other biological signal classification tasks.

2605.26294 2026-05-27 cs.CV 版本更新

CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection

用于皮肤癌检测的CNN、Transformer、混合模型和视觉语言模型

Durjoy Dey, Yuhong Yan, Hassan Hajjdiab

发表机构 * Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada(计算机科学与软件工程系,康科迪亚大学,加拿大蒙特利尔) Ebovir Biotechnologie Inc., Montreal, Canada(Ebovir生物技术公司,加拿大蒙特利尔)

AI总结 本文在PAD-UFES-20数据集上统一评估了12种深度学习模型(包括CNN、ViT、混合卷积Transformer和视觉语言模型),结果表明混合模型和基于SigLIP的VLM在排名性能和临床相关操作点之间取得了最佳平衡。

Comments 13 pages, 3 figures, accepted at ICPRAI 2026, The Fifth International Conference on Pattern Recognition and Artificial Intelligence. To appear in Lecture Notes in Computer Science

详情
AI中文摘要

皮肤癌是一种常见且快速增长的恶性肿瘤,全球范围内发病率不断上升。早期检测对于改善预后至关重要。基于皮肤镜和临床图像训练的深度学习模型可以支持自动化和快速分诊。然而,许多研究仅评估了有限的架构,且不同研究的实验设置也各不相同。在本文中,我们在PAD-UFES-20数据集上对十二种深度学习模型进行了统一的二分类皮肤癌检测评估。这些模型涵盖四个家族:卷积神经网络(CNN)、视觉Transformer(ViT)、混合卷积Transformer骨干网络和视觉语言模型(VLM)。性能评估使用AUC、最大F1分数及其精确率和召回率,以及在80%特异性下的灵敏度,以反映筛查导向的需求。我们的结果表明,调优良好的CNN已经提供了强大的基线,但基于Transformer的家族持续改善了区分能力。混合模型(MaxViT Tiny、CoAtNet0)和基于SigLIP的VLM在排名性能和临床相关操作点之间实现了最佳整体权衡,而基于CLIP的模型提供了高精确率。所有实验的完整代码库已公开发布。这些发现共同为皮肤癌筛查中实际部署最合适的模型家族提供了实用指导,并为未来在PAD-UFES-20上的工作建立了可重复的参考点。

英文摘要

Skin cancer is a common and fast rising malignancy worldwide. Early detection is critical for improving outcomes. Deep learning models trained on dermoscopic and clinical images can support automated and fast triage. However, many studies evaluate only a limited set of architectures. Experimental setups also vary across studies. In this paper, we present a unified evaluation of twelve deep learning models for binary skin cancer detection on the PAD-UFES-20 dataset. The models span four families: convolutional neural networks (CNN), vision transformers (ViT), hybrid convolution transformer backbones, and vision language models (VLM). Performance is assessed using AUC, the maximum F1 score with its precision and recall, and sensitivity at 80% specificity, reflecting screening oriented requirements. Our results show that well tuned CNNs already provide strong baselines, but transformer based families consistently improve discrimination. Hybrid models (MaxViT Tiny, CoAtNet0) and a SigLIP based VLM achieve the best overall trade off between ranking performance and clinically relevant operating points, while CLIP based model offers high precision. The full codebase for all experiments is publicly released. Together, these findings offer practical guidance on which model families are most suitable for real world deployment in skin cancer screening and establish a reproducible reference point for future work on PAD-UFES-20.

2605.26287 2026-05-27 cs.CV 版本更新

A multifractal-based masked auto-encoder: an application to medical images

基于多重分形的掩码自编码器:在医学图像中的应用

Joao Batista Florindo, Viviane de Moura

发表机构 * Institute of Mathematics, Statistics and Scientific Computing - University of Campinas(数学、统计与科学计算研究所 - 卡波斯大学)

AI总结 提出一种利用多重分形测度(Renyi熵)优化掩码策略的掩码自编码器(MO-MAE),通过聚焦高复杂度区域提升医学图像分类性能。

详情
Journal ref
Proceedings VISAPP 2025, 769-776 (2025)
AI中文摘要

掩码自编码器(MAE)在医学图像分类中显示出巨大潜力。然而,传统MAE采用的随机掩码策略可能忽略医学图像中的关键区域,而这些区域中即使微小的变化也可能指示疾病。为解决这一局限性,我们提出了一种利用多重分形测度(Renyi熵)优化掩码策略的新方法。我们的方法称为多重分形优化掩码自编码器(MO-MAE),它采用多重分形分析来识别高复杂度和信息量丰富的区域。通过将掩码过程聚焦于这些区域,MO-MAE确保模型学习重建最具诊断相关性的特征。这种方法对医学成像特别有益,因为精细检查组织结构对于准确诊断至关重要。我们在涵盖多种疾病的多个医学数据集上评估了MO-MAE,包括MedMNIST和COVID-CT。我们的结果表明,MO-MAE取得了有前景的性能,超越了其他基线和最先进的模型。由于所提出的测度计算简单,该方法还增加了最小的计算开销。我们的发现表明,多重分形优化的掩码策略增强了模型捕获和重建复杂组织结构的能力,从而实现了更准确和高效的医学图像表示。所提出的MO-MAE框架为提高医学图像分析中深度学习模型的准确性和效率提供了一个有前景的方向,可能推动计算机辅助诊断领域的发展。

英文摘要

Masked autoencoders (MAE) have shown great promise in medical image classification. However, the random masking strategy employed by traditional MAEs may overlook critical areas in medical images, where even subtle changes can indicate disease. To address this limitation, we propose a novel approach that utilizes a multifractal measure (Renyi entropy) to optimize the masking strategy. Our method, termed Multifractal-Optimized Masked Autoencoder (MO-MAE), employs a multifractal analysis to identify regions of high complexity and information content. By focusing the masking process on these areas, MO-MAE ensures that the model learns to reconstruct the most diagnostically relevant features. This approach is particularly beneficial for medical imaging, where fine-grained inspection of tissue structures is crucial for accurate diagnosis. We evaluate MO-MAE on several medical datasets covering various diseases, including MedMNIST and COVID-CT. Our results demonstrate that MO-MAE achieves promising performance, surpassing other basiline and state-of-the-art models. The proposed method also adds minimum computational overhead as the computation of the proposed measure is straightforward. Our findings suggest that the multifractal-optimized masking strategy enhances the model's ability to capture and reconstruct complex tissue structures, leading to more accurate and efficient medical image representation. The proposed MO-MAE framework offers a promising direction for improving the accuracy and efficiency of deep learning models in medical image analysis, potentially advancing the field of computer-aided diagnosis.

2605.26283 2026-05-27 cs.CV cs.LG 版本更新

Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening

卷积、Transformer、混合模型及视觉语言模型在多病种视网膜筛查中的基准测试

Durjoy Dey, Aymane Ajbar, Yuhong Yan

发表机构 * Department of Computer Science and Software Engineering(计算机科学与软件工程系) Concordia University(康科迪亚大学) Ebovir Biotechnologie Inc.(Ebovir生物技术公司)

AI总结 本研究在RFMiD数据集上对四种模型家族的12种架构进行基准测试,评估其在多病种视网膜筛查中的性能,发现基于注意力的模型(如SwinTiny、CoAtNet0、MaxViTTiny)在二元筛查和多标签分类中表现最佳,视觉语言模型与CNN基线相当但未超越最优Transformer和混合模型。

Comments 12 pages, 3 figures, accepted at ICMHI 2026, 10th International Conference on Medical and Health Informatics, Kyoto, Japan. To appear in ACM Conference Proceedings

详情
AI中文摘要

现代深度学习为自动化视网膜筛查提供了强大工具,但在现实多病种设置和领域偏移下,不同视觉模型家族的比较仍不明确。本研究使用视网膜眼底多病种图像数据集(RFMiD),对四种模型家族(卷积神经网络、视觉Transformer、混合CNN-Transformer骨干网络和视觉语言模型)的12种架构进行基准测试。我们评估两个任务:任何视网膜疾病的二元筛查和28个疾病类别的多标签分类。通过标准化训练、校准和评估协议,我们报告了在特异性接近80%的临床相关操作点下的AUC、F1、精确率、召回率和灵敏度。在RFMiD上,所有架构在二元筛查中表现良好,AUC均高于84%,但基于注意力的模型表现最佳。SwinTiny以及混合模型CoAtNet0和MaxViTTiny在二元筛查中取得最强结果,并在多标签设置中提高了宏F1和微F1。视觉语言模型(包括CLIP ViT-B/16和SigLIP-Base384)与CNN基线相当,但未超越最优Transformer和混合骨干网络。在Messidor-2上对可转诊糖尿病视网膜病变进行外部验证时,AUC范围为66.8%至84.7%,混合模型和Transformer模型再次表现出强劲性能。这些结果为多病种视网膜筛查中的模型选择提供了可重复的参考,并指导未来用于临床部署的自动化筛查工具。

英文摘要

Modern deep learning offers powerful tools for automated retinal screening, but it remains unclear how different visual model families compare in realistic multi-disease settings and under domain shift. In this work, we benchmark twelve architectures across four model families: convolutional neural networks, vision transformers, hybrid CNN-transformer backbones, and vision-language models, using the Retinal Fundus Multi-disease Image Dataset (RFMiD). We evaluate two tasks: binary screening for any retinal disease and multi-label classification across 28 disease classes. Using standardized training, calibration, and evaluation protocols, we report AUC, F1, precision, recall, and sensitivity at a clinically relevant operating point with specificity near 80%. On RFMiD, all architectures perform well on binary screening, with AUC above 84%, but attention-based models perform best. SwinTiny and the hybrid CoAtNet0 and MaxViTTiny models achieve the strongest binary screening results and improve macro and micro F1 in the multi-label setting. Vision-language models, including CLIP ViT-B/16 and SigLIP-Base384, are competitive with CNN baselines but do not surpass the best transformer and hybrid backbones. In external validation on Messidor-2 for referable diabetic retinopathy, AUC ranges from 66.8% to 84.7%, with hybrid and transformer models again showing strong performance. These results provide a reproducible reference for model selection in multi-disease retinal screening and guide future automated screening tools for clinical deployment.

2605.26273 2026-05-27 cs.CV 版本更新

Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation

频率引导的RGB-热红外语义分割融合

İsmail Emre Canıtez, Özgür Erkent

发表机构 * Hacettepe University(哈切特佩大学)

AI总结 提出一种基于双ConvNeXt V2骨干网络的多模态融合架构,通过频率分解和置信门控残差机制融合RGB与热红外特征,在MFNet和PST900上以较低参数量实现先进性能。

Comments 9 pages, 7 figures, To be Presented at Perception Beyond the Visible Spectrum workshop series (IEEE PBVS) at CVPR, 2026

详情
AI中文摘要

在城市驾驶场景等复杂环境中,语义分割在光照条件不佳时仍具挑战性,仅凭RGB图像提供的信息不足。RGB-热红外融合利用可见光和红外图像的互补优势来提升场景理解;然而,在不同特征抽象层次上有效整合这些异质模态仍是一个开放问题。本文提出一种基于双ConvNeXt V2骨干网络的多模态融合架构,采用分阶段、模态自适应的融合策略。对于早期特征,我们引入基于频率的融合模块,通过高斯滤波将红外特征分解为低频和高频分量,应用双分支空间注意力选择性强调热模式与精细边界,并通过置信门控残差机制将其与RGB特征融合。对于后期特征,我们设计了一个具有跨模态注意力和多尺度深度可分离卷积的语义融合模块,以捕捉模态间的语义对应关系。融合后的特征通过带有深度监督的PANet风格双向解码器进行解码。在MFNet和PST900上的实验表明,我们最轻量化的变体分别达到61.73%和86.24%的mIoU,仅需35.43M参数,在显著减少参数和计算成本的同时优于近期方法。代码可在https://github.com/ismailemrecntz/VISIBLE-INFRARED-SENSOR-FUSION获取。

英文摘要

Semantic segmentation in complex environments such as urban driving scenes remains challenging under adverse lighting conditions, where RGB images alone provide insufficient information. RGB-Thermal fusion leverages the complementary strengths of visible and infrared imagery to improve scene understanding; however, effectively integrating these heterogeneous modalities at varying levels of feature abstraction remains an open problem. In this paper, we propose a multi-modal fusion architecture built upon dual ConvNeXt V2 backbones that employs stage-wise, modality-adaptive fusion strategies. For early-stage features, we introduce a Frequency-Based Fusion Module that decomposes infrared features into low- and high-frequency components via Gaussian filtering, applies dual-branch spatial attention to selectively emphasize thermal patterns and fine-grained boundaries, and integrates them with RGB features through a confidence-gated residual mechanism. For late-stage features, we design a semantic fusion module with cross-modal attention and multi-scale depthwise convolutions to capture semantic correspondences across modalities. The fused features are decoded via a PANet-style bidirectional decoder with deep supervision. Experiments on MFNet and PST900 demonstrate that our lightest variant achieves 61.73\% and 86.24\% mIoU, respectively, with only 35.43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. Code is available at https://github.com/ismailemrecntz/VISIBLE-INFRARED-SENSOR-FUSION

2605.26266 2026-05-27 cs.LG cs.AI cs.CV cs.GR eess.IV 版本更新

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

量化键窃取注意力:视频扩散中KV缓存压缩的偏差校正

Tuna Tuncer, Felix Becker, Thomas Pfeil

发表机构 * Technical University of Munich(慕尼黑技术大学) Tensordyne

AI总结 针对视频扩散模型中KV缓存量化导致注意力权重系统性偏差的问题,提出基于Jensen偏差的在线逐注意力分数校正方法,在INT2量化下恢复接近BF16的视频质量,且内存减半。

Comments Variants of this manuscript were accepted to the ICML 2026 workshops SCALE and F2S

详情
AI中文摘要

分块自回归视频扩散模型依赖先前生成块的KV缓存以避免冗余计算,但随着视频变长,该缓存迅速成为内存瓶颈。将KV缓存量化到低位宽的方法减少了内存压力,但降低了视频质量。我们表明,这种降低的一个关键驱动因素是注意力权重的系统性偏差:由于softmax注意力中指数的凸性,量化噪声膨胀了缓存键的贡献,我们称之为Jensen偏差。这种效应导致量化键从非量化的当前块中窃取注意力质量。我们推导出一个逐注意力分数校正,在期望中消除此偏差,该校正根据缓存键的量化步长和查询范数在线计算。使用二阶泰勒近似,额外的计算开销可忽略不计,且除了缓存外无需额外内存。在MAGI-1、SkyReels-V2和HY-WorldPlay上评估INT2量化,我们的校正恢复了因激进量化而损失的大部分质量,达到接近BF16的视频质量,并且在使用50%更少内存的情况下优于INT4量化。

英文摘要

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.

2605.26262 2026-05-27 cs.CV 版本更新

Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis

维度分布情感状态:利用效价和唤醒作为视觉情感分析的通用嵌入空间

Émile Bergeron, Tadagbé Dhossou, Sébastien Tremblay, Jean-François Lalonde

发表机构 * Université Laval(拉瓦尔大学)

AI总结 提出一种新的情感表示方法DDES,结合连续双维情感空间和多数据集训练流程,以辅助博物馆策展人预测艺术品引发的情感反应。

详情
AI中文摘要

博物馆是传播文化艺术的重要场所。它们是植根于历史和传统的机构;其展览通常旨在突出这些方面。最近,该领域正在探索一种新方法:基于情感的展览。这些展览专门设计用于引发游客的情感,以最大化参与度,并作为民主化艺术接触和吸引更广泛、更多样化观众的一种方式。为此,必须首先提取艺术品的情感内容,然而,由专家手动标注艺术品是一个劳动密集且成本高昂的过程,并且存在引入策展人个人偏见的风险。为了协助博物馆策展人设计这些展览,我们希望开发一种能够预测艺术作品所引发的情感反应的工具。在本文中,我们利用连续的双维情感空间来增强情感表示和深度学习模型的训练过程。借鉴现有的分类和维度情感表示,我们引入了一种新的表示方法——维度分布情感状态(DDES),以及一个多数据集训练流程。我们表明,与广泛使用的表示相比,DDES提供了多种优势,同时表现出相似的基线性能。

英文摘要

Museums are important sites for the dissemination of culture and art. They are institutions rooted in history and tradition; their exhibitions are often designed to highlight these aspects. Recently, a new approach is being explored in the field: emotion-based exhibitions. These exhibitions are designed specifically to elicit emotions in the visitors, in order to maximize engagement, and as a way to democratize access to art and attract a wider, more diverse audience. To do so, the emotional content of the artworks must first be extracted, however, manually annotating the artworks by experts is a prohibitively labor-intensive process, and risks introducing the personal bias of curators. To assist the museum curators in their design of these exhibitions, we wish to develop a tool that can predict the emotional response evoked by a work of art. In this article, we leverage a continuous bi-dimensional emotion space to enhance emotion representations and the training process of deep learning models. Drawing inspiration from existing categorical and dimensional emotion representations, we introduce a new representation, Dimensional Distribution Emotion State (DDES), along with a pipeline for multi-dataset training. We show that DDES provides multiple advantages compared to widely used representations while exhibiting similar baseline performance.

2605.26244 2026-05-27 cs.CV cs.MM cs.SD 版本更新

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

LongAV-Compass:面向分钟级音视频生成在T2AV、I2AV和V2AV上的统一评估

Tengfei Liu, Yang Shi, Xuanyu Zhu, Jiafu Tang, Liu Yang, Qixun Wang, Zhuoran Zhang, Yuqi Tang, Fengxiang Wang, Yuhao Dong, Xinlong Chen, Bozhou Li, Bohan Zeng, Yue Ding, Xiaohan Zhang, Jialu Chen, Haotian Wang, Yuanxing Zhang, Pengfei Wan, Leye Wang

发表机构 * Peking University(北京大学) Kling Team(Kling团队) Nanjing University(南京大学) SJTU(上海交通大学) HKUST(GZ)(香港科技大学(广州)) Shanghai AI Lab(上海人工智能实验室) Nanyang Technological University(南洋理工大学) CASIA(中国科学院自动化所) Tsinghua University(清华大学)

AI总结 针对现有评估协议局限于短片段的问题,提出LongAV-Compass基准,通过284个测试用例和统一评估框架,系统评估分钟级音视频生成在文本、图像、视频条件下的质量、一致性和对齐。

详情
AI中文摘要

音视频生成正从短片段快速发展到分钟级内容,而现有评估协议仍主要局限于短片段设置。现有基准主要关注5-10秒的文本条件生成,很少支持跨文本、图像和视频条件模态的统一评估。此外,它们对身份一致性、叙事连贯性和音视频对齐在长时间跨度上的退化提供的洞察有限。为弥补这一差距,我们引入了LongAV-Compass,一个用于分钟级音视频生成的系统基准。LongAV-Compass包含284个精选测试用例,涵盖文本到音视频(T2AV)、图像到音视频(I2AV)和视频到音视频(V2AV),按应用场景和生成复杂度组织。该基准结合了基于分类法的基准构建和统一评估框架,该框架集成了MLLM辅助评估与互补的感知和多模态指标,包括DINO-v2、ArcFace、CLIP和ImageBind。该框架评估超过20个细粒度维度,涵盖片段内质量、跨片段一致性、全局叙事连贯性、语义对齐和音视频同步。通过对11个代表性模型的实验以及人类对齐验证,LongAV-Compass提供了一个诊断测试平台,用于分析当前系统在跨不同输入模态维持连贯、语义对齐和时间一致的分钟级音视频生成方面的局限性。

英文摘要

Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.

2605.26241 2026-05-27 cs.CV 版本更新

RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation

RoMo:用于人体运动生成的大规模、丰富组织的数据集和语义分类体系

Jiahao Zhang, Joseph Liu, Young-Yoon Lee, Seonghyeon Moon, Victor Zordan, Guy Tevet, Karen Liu, Stephen Gould, Oren Jacob, Haomiao Jiang, Mubbasir Kapadia, Yizhak Ben-Shabat

发表机构 * Australian National University(澳大利亚国立大学) Roblox Stanford University(斯坦福大学) Rutgers University(罗格斯大学)

AI总结 提出RoMo数据集,通过分类感知过滤流水线确保质量,并采用三级语义分类体系组织数据,使训练模型在保真度和多样性上达到最优,同时提升对复杂文本提示的理解。

Comments Accepted to CVPR'26

详情
AI中文摘要

在语言、图像和视频领域的生成建模成功表明,大型、精心策划的数据集是构建强大模型的关键驱动力。然而,3D人体运动领域一直滞后,受限于在小型高保真运动捕捉数据集和以静态或低质量序列为主的大规模野外数据集之间的不满意选择。我们引入了RoMo,一个丰富、大规模、精心策划的野外人体运动数据集,解决了这些权衡。为确保质量,我们引入了一个分类感知过滤流水线,积极去除静态和易产生伪影的序列。每个序列都带有详细注释,并由一个新颖的三级语义分类体系组织。这种层次结构实现了细粒度的逐类别评估,揭示了全局指标所掩盖的模型优势和弱点。我们证明,在RoMo上训练的模型在保真度和多样性上达到最优,同时获得了对复杂、细微文本提示的卓越理解。最后,我们发布了Motion Toolbox以标准化指标、数据转换和可视化,为可重复和可解释的运动生成研究奠定了基础。

英文摘要

Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences. We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.

2605.26239 2026-05-27 cs.CV cs.MA 版本更新

Sentinel: Embodied Cooperative Spatial Reasoning and Planning

Sentinel:具身协同空间推理与规划

Xiangye Lin, Hongxin Zhang, Ruxi Deng, Qinhong Zhou, Chuang Gan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出Sentinel挑战和CoSaR框架,通过自然语言通信与空间导航算法结合,解决多智能体在城市规模户外环境中的协同空间推理与规划问题。

Comments The first two authors contributed equally

详情
AI中文摘要

在这项工作中,我们研究了协同空间智能,即分散的具身智能体在跨城市规模的户外领域中,在动态环境约束下有效协调的能力。我们引入了Sentinel挑战,这是一个基准测试,其中多个分散的具身智能体必须通过自然语言进行通信,以在大规模城市户外环境中就一个相互安全且方便的会合点达成一致。然后,每个智能体必须安全导航,同时避开巡逻的动态哨兵,并使用提供粗略空间信息的工具。为了解决这个问题,我们提出了CoSaR(协同空间推理与规划)框架,该框架将基础模型的高层通信和规划能力与经典空间导航算法的精度相结合。CoSaR使智能体能够交换情境更新、推理不断变化的空间约束,并协同重新规划轨迹。在14个城市级别场景(包含3-5个智能体)的评估中,CoSaR始终导致更快的聚集、更短的路径长度和更高的安全性。我们的结果表明,将动态通信与空间推理相结合对于鲁棒的多智能体协作至关重要。通过形式化这一新设置并提供可扩展的基准测试,我们旨在为推进具身多智能体系统中的协同空间智能奠定基础。代码和挑战可在https://github.com/UMass-Embodied-AGI/Sentinel获取。

英文摘要

In this work, we study Cooperative Spatial Intelligence, the ability of decentralized embodied agents to coordinate effectively under dynamic environmental constraints across city-scale outdoor domains. We introduce Sentinel Challenge, a benchmark where multiple decentralized embodied agents must communicate in natural language to agree on a mutually safe and convenient meeting point within large, city-scale outdoor environments. Each agent must then navigate safely while avoiding dynamic sentinels patrolling the area, using a tool that provides coarse spatial information. To address this, we propose CoSaR (Cooperative Spatial Reasoning and Planning), a framework that bridges the high-level communication and planning abilities of foundation models with the precision of classical spatial navigation algorithms. CoSaR enables agents to exchange situational updates, reason over evolving spatial constraints, and collaboratively replan trajectories. Evaluated across 14 city-level scenes with 3-5 agents, CoSaR consistently leads to faster gathering, shorter path lengths, and improved safety. Our results demonstrate that integrating dynamic communication with spatial reasoning is essential for robust multi-agent cooperation. By formalizing this new setting and providing a scalable benchmark, we aim to build a foundation for advancing cooperative spatial intelligence in embodied multi-agent systems. Code and challenge are available at https://github.com/UMass-Embodied-AGI/Sentinel.

2605.26232 2026-05-27 cs.CV 版本更新

Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos

并非所有模态都平等:面向多模态视频的指令感知门控

Bonan Ding, Umair Nawaz, Ufaq Khan, Abdelrahman M. Shaker, Muhammad Haris Khan, Jiale Cao, Jin Xie, Fahad Shahbaz Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎伊德大学人工智能学院) Tianjin University(天津大学) Chongqing University(重庆大学) Linköping University(林奈大学)

AI总结 提出UniMVU框架,通过内模态和模态级指令感知动态门控实现多模态视频理解,在六个基准上优于静态融合方法。

Comments 19 pages, 8 figures, 7 tables, preprint

详情
AI中文摘要

预训练视频大语言模型在视觉推理方面表现出色。然而,当视频伴随辅助流(如音频、深度图或密集时间证据)时,它们会陷入困境。在这种情况下,统一融合会导致模态干扰,使不相关的通道分散模型注意力。为了解决这个问题,我们提出了一个统一的多模态视频理解框架UniMVU,该框架通过两个级别的动态门控在视频、音频、深度图或任何其他模态输入之间执行指令感知融合:内模态门强调每个模态内的显著区域,而模态级门重新加权整个流;两者都根据文本指令进行条件化,以自适应地平衡模态重要性。我们的UniMVU将跨模态自注意力与指令驱动的内模态门控模块以及带有控制令牌的模态级门控模块相结合;对于时间对齐的流,我们进一步采用了一种快慢融合方案,以减少冗余。在六个基准(AVQA、AVSD、Music-AVQA、ScanQA、SQA3D和MVBench)上,我们的UniMVU相对于静态融合基线取得了一致的提升,在CIDEr指标上最高提升了13.5。此外,我们的分析表明,门控机制与人类可解释的模态相关性一致,消融实验显示了内模态和模态级门控的贡献。我们的UniMVU为指令感知的多模态视频理解提供了一种简单、统一的方案,无需手工设计的融合规则即可扩展到多种模态。

英文摘要

Pre-trained video large language models excel at visual reasoning. However, they struggle when videos arrive with auxiliary streams, such as audio, depth map, or dense temporal evidence. In such a scenario, uniform fusion induces modality interference, allowing irrelevant channels to distract the model. To address this issue, we present a unified multimodal video understanding framework, named UniMVU, that performs instruction-aware fusion across video, audio, depth map, or any other modality inputs via two levels of dynamic gating: inner-modality gates emphasize salient regions within each modality, whereas modality-level gates re-weight whole streams; both are conditioned on the text instruction to adaptively balance modality importance. Our UniMVU combines cross-modal self-attention with instruction-driven inner-modality gating module and a modality-level gating module with control token; for time-aligned streams we further adopt a fast-to-slow fusion scheme that reduces redundancy. Across six benchmarks (AVQA, AVSD, Music-AVQA, ScanQA, SQA3D and MVBench), our UniMVU achieves consistent gains over static-fusion baselines achieving gains as high as 13.5 in terms of CIDEr metric. Further, our analysis shows that the gating mechanism aligns with the human-interpretable modality relevance, and ablations show the contributions of inner-modality and modality-level gating. Our UniMVU provides a simple, unified recipe for instruction-aware multimodal video understanding that scales to diverse modalities without hand-crafted fusion rules.

2605.26230 2026-05-27 cs.CV 版本更新

Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction

几何感知表示去噪用于鲁棒的多视角3D重建

Jin Hyeon Kim, Jaeeun Lee, Claire Kim, Kyoungjin Oh, Paul Hyunbin Cho, Jaewon Min, Yeji Choi, Jihye Park, Hyunhee Park, Minkyu Park, Seungryong Kim

发表机构 * KAIST AI(韩国国立科学技术院人工智能研究所) Samsung Electronics(三星电子)

AI总结 提出几何感知表示去噪(GARD)框架,在前馈3D重建模型的特征空间中执行扩散式多视角恢复,同时恢复场景几何与高质量RGB图像,在DA3基准上验证有效性。

详情
AI中文摘要

多视角3D重建随着前馈3D重建模型的出现取得了显著进展。然而,这些模型通常在理想的无退化成像条件下训练和评估,而真实世界的观测往往包含与此类设置显著不同的退化。因此,在退化条件下提高多视角3D重建的鲁棒性仍然是一个重要挑战。我们提出了几何感知表示去噪(GARD),一种新颖的框架,直接在前馈3D重建模型的特征空间中执行基于扩散的多视角恢复。这种设计利用3D重建器的几何感知特征表示来有效恢复准确的场景几何。此外,通过使用额外的RGB图像解码器,精炼的表示还可用于恢复高质量的RGB图像,从而同时恢复3D场景几何和高质量图像。在Depth Anything 3(DA3)基准上的全面实验证明了所提出的GARD框架的有效性。

英文摘要

Multi-view 3D reconstruction has achieved remarkable progress with the advent of feed-forward 3D reconstruction models. However, these models are typically trained and evaluated under ideal, degradation-free imaging conditions, whereas real-world observations often contain degradations that differ significantly from such settings. Improving robustness for multi-view 3D reconstruction under degraded conditions therefore remains an important challenge. We present Geometry-Aware Representation Denoising (GARD), a novel framework that performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model. This design exploits the geometry-aware feature representations of the 3D reconstructor to effectively recover accurate scene geometry. Furthermore, by employing an additional RGB image decoder, the refined representations can also be used to restore high-quality RGB images, thereby enabling the simultaneous recovery of 3D scene geometry and high-quality imagery. Comprehensive experiments on the Depth Anything 3 (DA3) benchmark demonstrate the effectiveness of the proposed GARD framework.

2605.26149 2026-05-27 cs.GR cs.CV 版本更新

AnySurf: Any Surface Generation with Directed Edge

AnySurf: 基于有向边的任意表面生成

Wenda Shi, Chenyuan Pan, Dengming Zhang, Yiren Song, Biao Zhang, Xingxing Zou

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) Xi'an Jiaotong University(西安交通大学)

AI总结 提出AnySurf统一框架,通过有向边增强的柔性双网格表示,实现开放、封闭和混合3D表面的高质量生成,并引入ROS-FT后训练和轻量级DE-Adapter以保持生成性能。

详情
AI中文摘要

开放表面组件在真实工业3D内容中普遍存在,支持渲染、物理模拟和几何编辑。服装作为典型的开放表面类型,现有许多生成方法利用缝纫图案生成2D面板并缝合为3D形状。这种特定领域的设计缺乏可扩展性,无法泛化到鞋子和配饰。常见的基于场的3D生成器优先考虑水密网格,并倾向于在开放表面上创建有缺陷的双层结构。尽管Trellis2采用了无场表示,但其开放表面结果仍存在法线和拓扑错误。我们提出AnySurf,一个统一框架,生成具有准确面朝向的开放、封闭和混合3D表面。基于有向边增强的柔性双网格(FDG-D),我们的表示通过定向网格边保留法线方向信息。我们还提出了ROS-FT后训练和仅增加1%额外参数的轻量级DE-Adapter,促进有向边学习同时保持原始生成性能。我们进一步构建了包含工业服装和封闭配件的Outfit3D数据集。我们的工作将服装建模转化为通用的3D生成任务。实验结果表明,在网格质量和下游应用实用性方面具有优越性。

英文摘要

Open surface components prevail in real industrial 3D content and support rendering, physical simulation and geometric editing. Garments serve as a typical open surface type, with numerous existing generation methods leveraging sewing patterns to generate 2D panels and stitch them into 3D shapes. Such domain-specific designs lack scalability and cannot generalize to shoes and accessories. Common field-based 3D generators prioritize watertight meshes and tend to create flawed double-layer structures on open surfaces. Though Trellis2 adopts field-free representation, its open surface results still contain normal and topology errors. We present AnySurf, a unified framework generating open, closed and hybrid 3D surfaces with accurate face orientation. Built on directed-edge enhanced Flexible Dual Grid (FDG-D), our representation retains normal direction information via oriented grid edges. We also propose ROS-FT post-training and a lightweight DE-Adapter with merely 1% extra parameters, facilitating directed edge learning while preserving original generation performance. We further construct Outfit3D dataset containing industrial garments and closed accessories. Our work transforms garment modeling into a universal 3D generation task. Experimental results demonstrate superior mesh quality and better practicality for downstream applications.

2605.26137 2026-05-27 cs.GR cs.AI cs.CV 版本更新

AssetGen: Deployable 3D Asset Generation at Interactive Speed

AssetGen: 可部署的交互速度3D资产生成

Dilin Wang, Xiaoyu Xiang, Kihyuk Sohn, Tom Monnier, Yu-Ying Yeh, Thu Nguyen-Phuoc, Jiawen Zhang, Yuchen Fan, Antoine Toisoul, Hyunyoung Jung, Prithviraj Dhar, Michael Bunnell, Nikolaos Sarafianos, Chuhang Zou, Roman Shapovalov, Andrea Vedaldi, Rakesh Ranjan

发表机构 * Reality Labs, Meta(Meta现实实验室)

AI总结 提出AssetGen系统,通过粗到细的VecSet框架、多视图纹理生成及端到端加速,在30秒内生成带烘焙法线、颜色纹理和可控多边形预算的高质量网格,支持实时渲染和移动端部署。

详情
AI中文摘要

尽管3D生成技术正在快速发展,但近期工作通常侧重于获取高分辨率资产,而将用户体验和可部署性视为事后考虑。我们提出AssetGen,一个专注于这两个方面的3D生成器。给定一张参考图像,它在30秒内生成一个高质量网格,带有烘焙法线、颜色纹理和可控多边形预算,适用于实时渲染,包括移动端用例。AssetGen Flash变体进一步将延迟降低到14秒,适用于交互式和代理式创作循环。我们的模型使用粗到细的VecSet框架生成物体几何,该框架在GPU上实现网格简化、清理和法线烘焙,以及快速并行UV展开。然后以多视图方式生成纹理,随后进行反投影和3D修复。模型蒸馏、内核优化和流水线并行化被协同设计以加速整个系统。我们引入了大量自动化和盲人机评估,并在30秒内展示了与领先商业解决方案相当的视觉质量,在不到15秒内展示了预览质量的结果。最终结果是一个支持AI辅助、可部署的3D内容创建的系统,适用于交互式工作流。

英文摘要

While 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and deployability as afterthoughts. We present AssetGen, a 3D generator that focuses instead on these two aspects. Given one reference image, in 30 seconds it produces a high-quality mesh with baked normals, a color texture, and a controlled polygon budget suitable for real-time rendering, including mobile use cases. The AssetGen Flash variant further reduces latency to 14 seconds for interactive and agentic creation loops. Our model generates the object geometry with a coarse-to-refine VecSet framework, which implements mesh simplification, cleaning, and normal baking on the GPU, and a fast parallel UV unwrapping. It then generates textures in a multi-view fashion, followed by backprojection and 3D inpainting. Model distillation, kernel optimization, and pipeline parallelization are co-designed to accelerate the system end-to-end. We introduce numerous automated and blind human evaluations and demonstrate competitive visual quality against leading commercial solutions in 30 seconds and preview-quality results in less than 15 seconds. The final result is a system that supports AI-assisted, deployable 3D content creation in interactive workflows.

2605.26103 2026-05-27 cs.CV 版本更新

Global Structure-from-Motion Meets Feedforward Reconstruction

全局运动恢复结构与前馈重建的结合

Linfei Pan, Johannes Schönberger, Marc Pollefeys

发表机构 * ETH Zurich(苏黎世联邦理工学院) Meta Reality Labs(Meta现实实验室) Microsoft(微软公司)

AI总结 提出一种结合经典SfM和前馈重建优势的新流水线,在多种场景下实现最先进的重建结果。

Comments CVPR 2026, Highlight

详情
AI中文摘要

运动恢复结构——从一组图像同时估计相机姿态和3D场景结构的过程——仍然是计算机视觉中的一个核心挑战,许多开放问题尚待解决。前馈3D重建的最新进展在克服经典SfM方法的持续失败案例方面取得了显著进步,特别是在低纹理、有限重叠和对称性等场景中。然而,尽管前馈方法在这些挑战性条件下表现出色,但它们在可扩展性、准确性或鲁棒性方面常常面临限制,并且在标准重建设置中通常不如经典方法。在这项工作中,我们系统地分析了这些限制,并通过结合经典和前馈方法的各自优势,提出了一种新的运动恢复结构流水线。在多个数据集上的广泛实验显示了我们的方法的优势,在广泛场景中实现了最先进的结果。我们将我们的系统作为开源实现分享在https://github.com/colmap/gluemap。

英文摘要

Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved. Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited overlap, and symmetries. However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, or robustness, and typically fall short of classical methods in standard reconstruction settings. In this work, we systematically analyze these limitations and propose a new Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods. Extensive experiments across multiple datasets show the benefits of our approach, achieving state-of-the-art results across a wide range of scenarios. We share our system as an open-source implementation at https://github.com/colmap/gluemap.

2605.25861 2026-05-27 cs.CV cs.AI 版本更新

MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images

MuNet: 一种用于从单张图像联合进行3D人体网格恢复和3D穿衣人体重建的互惠网络

Yunqi Gao, Leyuan Liu, Yuhan Li, Changxin Gao, Jingying Chen

发表机构 * National Engineering Research Center for E-Learning(教育信息化国家级工程研究中心) National Engineering Research Center of Educational Big Data(教育大数据国家级工程研究中心) School of Electronic Information and Communications(电子信息与通讯学院) School of Artificial Intelligence and Automation(人工智能与自动化学院)

AI总结 提出MuNet,一种互惠网络,通过统一表示和互惠机制联合优化3D人体网格恢复与穿衣人体重建,在六个基准数据集上达到最先进性能。

详情
AI中文摘要

3D人体网格恢复和3D穿衣人体重建本质相关,但长期以来被孤立研究,忽视了联合优化的潜在收益。为克服这一局限,我们提出在一个统一框架中处理这两个任务,从而有效利用它们的相互依赖关系。基于这一思想,我们提出MuNet,一种用于从单张图像联合进行3D人体网格恢复和3D穿衣人体重建的互惠网络。首先,我们采用2-流形图作为所有3D模型的统一表示,从而在3D人体网格恢复和穿衣人体重建之间实现一致建模。其次,我们设计了一个端到端的图卷积网络,逐步将初始图变形为3D人体网格,并将其细化成详细的3D穿衣人体模型。第三,我们引入一种互惠机制,允许两个任务在训练期间进行相互交互,其中3D人体网格恢复为3D穿衣人体重建提供指导,而重建反馈则细化3D人体网格恢复。我们在六个基准数据集上广泛评估了MuNet,包括Human3.6M、3DPW、MPI-INF-3DHP、THuman2.0、CAPE和RenderPeople。实验结果表明,MuNet在所有数据集上的两个任务均达到了最先进的性能。MuNet的代码已在https://github.com/starVisionTeam/MuNet上发布,供研究使用。

英文摘要

3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby overlooking the potential gains of joint optimization. To overcome this limitation, we propose to address these two tasks within a unified framework, which allows their mutual dependencies to be effectively exploited. Building on this idea, we propose MuNet, a mutualistic network for joint 3D human mesh recovery and 3D clothed human reconstruction from single images. First, we adopt 2-manifold graphs as a unified representation for all 3D models, enabling consistent modeling across 3D human mesh recovery and clothed human reconstruction. Second, we design an end-to-end graph convolutional network that progressively deforms an initial graph into a 3D human mesh and refines it into a detailed 3D clothed human model. Third, we introduce a mutualistic mechanism that allows reciprocal interaction between the two tasks {during training}, where 3D human mesh recovery provides guidance for 3D clothed human reconstruction, and reconstruction feedback refines the 3D human mesh recovery. We extensively evaluate MuNet on six benchmark datasets for 3D human mesh recovery and 3D clothed human reconstruction, including Human3.6M, 3DPW, MPI-INF-3DHP, THuman2.0, CAPE, and RenderPeople. Experimental results demonstrate that MuNet achieves state-of-the-art performance on both tasks across all datasets. The code of MuNet is released for research purposes at https://github.com/starVisionTeam/MuNet.

2605.25570 2026-05-27 cs.CV 版本更新

From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation

从对比到一致性:重新思考基于事件的连续时光流估计

Rui Hu, Song Wu, Wen Yang, Jinjian Wu

发表机构 * Xidian University(西安电子科技大学)

AI总结 提出一种基于时空结构一致性(STSC)的混合监督框架,结合双向互补多尺度架构和课程引导混合训练策略,在连续时间和标准光流估计中达到最先进性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

估计连续光流是动态视觉感知中一个基础但具有挑战性的问题。基于事件的相机具有微秒级延迟和高动态范围,能够异步捕捉亮度变化,为以精细时间精度建模运动提供了独特机会。然而,时间密集的真实标注的稀缺性限制了监督学习的有效性,而专注于锐化扭曲事件图像(IWE)的对比度最大化(CM)框架往往忽略时间连续性和结构一致性,导致复杂运动下的轨迹扭曲。为了克服这些挑战,我们提出了一种基于时空结构一致性(STSC)原则的混合监督框架,用于连续时光流估计。该范式共同强化局部结构稳定性和轨迹连续性,确保跨时间的物理一致运动。为了进一步增强表示和鲁棒性,我们设计了一种双向互补的多尺度架构,并采用课程引导的混合训练策略,实现了从监督点约束到自监督流形正则化的平滑过渡。在多个基准上的综合实验表明,我们的方法在连续时间和标准光流估计中均达到了最先进的性能,证明了所提出学习范式的有效性。

英文摘要

Estimating continuous optical flow is a fundamental yet challenging problem in dynamic visual perception. Event-based cameras, with microsecond latency and high dynamic range, capture brightness changes asynchronously, offering a unique opportunity to model motion with fine temporal precision. However, the scarcity of temporally dense ground-truth annotations limits the effectiveness of supervised learning, while contrast maximization (CM) frameworks, focused on sharpening the Image of Warped Events (IWE), often neglect temporal continuity and structural coherence, leading to distorted trajectories under complex motion. To overcome these challenges, we propose a hybrid-supervised framework for continuous-time optical flow estimation, grounded in the principle of Spatio-temporal Structural Consistency (STSC). This paradigm jointly enforces local structural stability and trajectory continuity, ensuring physically coherent motion across time. To further enhance representation and robustness, we design a bidirectionally complementary multi-scale architecture and employ a curriculum-guided hybrid training strategy, enabling a smooth transition from supervised point constraints to self-supervised manifold regularization. Comprehensive experiments across multiple benchmarks show that our method achieves state-of-the-art performance in both continuous-time and standard optical flow estimation, demonstrating the effectiveness of the proposed learning paradigm.

2605.25569 2026-05-27 cs.CV 版本更新

ControlLight: Towards Controllable, Consistent, and Generalizable Low-Light Enhancement

ControlLight: 迈向可控、一致且泛化的低光照增强

Yufeng Yang, Jianzhuang Liu, Jisheng Chu, Yuqi Peng, Xianfang Zeng, Jiancheng Huang, Shifeng Chen

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Zhejiang University(浙江大学)

AI总结 提出ControlLight框架,通过构建连续光照强度监督的大规模数据集和引入错位感知加权流匹配损失,实现了低光照增强的可控性、一致性和泛化性。

Comments 18 pages, 12 figures

详情
AI中文摘要

现有的基于深度学习的低光照增强方法通常在有限的数据集上训练,且具有单一的增强目标,这限制了它们在现实应用中的泛化能力和可控性。为了克服这些限制,我们提出了ControlLight,一个可控、一致且泛化的低光照增强框架。我们首先构建了一个带有连续光照强度监督的真实世界退化图像的大规模数据集。为了进一步确保在不同控制强度下输出的一致性,我们引入了一种错位感知加权流匹配损失,该损失在连续增强强度下保持图像结构。ControlLight允许用户通过灵活控制强度来编辑真实世界的退化低光照图像,以获得满意的增强结果,同时保持视觉一致性和真实性。大量实验表明,ControlLight在现有低光照增强方法中达到了最先进的性能,同时展现出强大的连续可控性和对真实世界场景的泛化能力。

英文摘要

Existing deep learning-based low-light enhancement methods are typically trained on limited datasets with single enhancement targets, which restricts their generalization ability and controllability in real-world applications. To overcome these limitations, we propose ControlLight, a controllable, consistent, and generalizable framework for low-light enhancement. We first construct a large-scale dataset of real-world degraded images with continuous illumination-strength supervision. To further ensure consistent outputs under different control strengths, we introduce a misalignment-aware weighted flow matching loss that preserves image structure across continuous enhancement strengths. ControlLight allows users to edit real-world degraded low-light images toward satisfactory enhancement results by flexibly controlling the strength while preserving visual consistency and realism. Extensive experiments show that ControlLight achieves state-of-the-art performance against existing low-light enhancement approaches while demonstrating strong continuous controllability and generalization to real-world scenarios.

2605.25538 2026-05-27 cs.CV cs.DB 版本更新

Tetris: Tile-level Sampling for Efficient and High-Fidelity Video Object Tracking

Tetris: 用于高效高保真视频目标跟踪的瓦片级采样

Chanwut Kittivorawong, Alena Chao, Charlie Si, Alvin Cheung

发表机构 * U. of California, Berkeley(加州大学伯克利分校)

AI总结 提出Tetris系统,通过将视频分解为基于瓦片的骨牌数据模型,实现细粒度时空剪枝,在保持跟踪精度损失不超过5%的条件下,将检测器调用次数减少多达68.8倍。

详情
AI中文摘要

轨迹物化将原始视频转换为可重用的目标轨迹,下游查询可以直接使用而无需重新运行跟踪,但高效且高保真地提取这些轨迹仍然成本高昂。先前的系统通过时间帧采样来降低成本,但这会抹去细粒度跟踪所需的帧间运动。然而,在静态视频中,每帧的大部分区域不包含感兴趣的目标,剩余区域也能容忍不同的采样率。我们提出Tetris,一个轨迹提取系统,它将视频分解为基于瓦片的骨牌数据模型,实现细粒度时空剪枝,以最小的保真度损失减少检测器调用。Tetris在用户提供的检测器上游运行三个算子:一个分类器识别相关瓦片并将它们分组为骨牌;一个整数线性规划(ILP)在用户指定的精度约束下剪枝冗余骨牌;一个打包器将幸存者组装成画布,以最小化检测器调用。在7个静态视频数据集上,Tetris的跟踪精度损失保持在5%以内,而先前的系统在7个数据集中的3个上超过了这个界限。在这个5%的界限下,Tetris的吞吐量比先前系统高17.4倍,比参考流水线高68.8倍。项目页面位于https://tetris-db.github.io。

英文摘要

Track materialization converts raw video into reusable object tracks that downstream queries can run against without rerunning tracking, but extracting those tracks efficiently and with high fidelity remains expensive. Prior systems reduce cost through temporal frame sampling, erasing the inter-frame motion that fine-grained tracking requires. In stationary video, however, large portions of each frame contain no objects of interest, and the remaining regions tolerate different sampling rates. We present Tetris, a track-extraction system that decomposes videos into a tile-based polyomino data model, enabling fine-grained spatiotemporal pruning that reduces detector calls with minimal fidelity loss. Tetris runs three operators upstream of the user-provided detector: a classifier identifies relevant tiles and groups them into polyominoes, an integer linear program (ILP) prunes redundant polyominoes under a user-specified accuracy constraint, and a packer assembles the survivors into canvases that minimize detector calls. Across 7 stationary-video datasets, Tetris stays within a 5% tracking accuracy loss of a full-frame, every-frame reference pipeline, whereas prior systems exceed this bound on 3 of the 7 datasets. At this 5% bound, Tetris achieves up to 17.4x higher throughput than prior systems and up to 68.8x higher than the reference pipeline. The project page is at https://tetris-db.github.io .

2605.25353 2026-05-27 cs.LG cs.CV physics.comp-ph 版本更新

PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

PDEInvBench:面向PDE逆问题的神经网络综合数据集与设计空间探索

Divyam Goel, Nithin Chalapathi, Sanjeev Raja, Aditi S. Krishnapriyan

发表机构 * Department of Computer Science, UC Berkeley(计算机科学系,加州大学伯克利分校) UC Berkeley(加州大学伯克利分校) Departments of Computer Science and Chemical Engineering UC Berkeley(计算机科学与化学工程系,加州大学伯克利分校;劳伦斯伯克利国家实验室) LBNL

AI总结 提出PDEInvBench基准数据集,通过数值模拟涵盖多种PDE,并沿优化、表示和缩放三个维度系统探索神经网络设计空间,发现两阶段训练、PDE导数输入和初始条件多样性等实用见解。

Comments 37 total pages, 13 main pages, 20 figures, 8 tables. Published in Transactions on Machine Learning Research (TMLR), 2026

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

偏微分方程(PDE)中的逆问题涉及从观测到的时空解场估计系统的物理参数。神经网络因其对函数到函数空间变换的建模能力,非常适合PDE参数估计。虽然现有的机器学习方法基准主要关注正问题,但尚无针对PDE逆问题(即从解场映射到潜在物理参数)的类似综合研究和基准数据集。我们通过引入PDEInvBench填补了这一空白,这是一个全面的基准数据集,包含时间依赖和时间独立PDE的数值模拟,覆盖广泛的物理行为和参数。我们的数据集包括评估划分,用于评估在分布内和多种分布外设置下的性能。利用我们的基准数据集,我们沿三个关键维度全面探索了神经网络在PDE逆问题中的设计空间:(1)优化过程,分析监督、自监督和测试时训练目标对性能的作用;(2)问题表示,研究具有不同归纳偏好的架构选择和各种条件策略的价值;(3)缩放,针对模型和数据大小进行。我们的实验揭示了几个实用见解:1)神经网络在两步训练过程中表现最佳:先用PDE参数进行初始监督,然后使用PDE残差进行测试时微调;2)将PDE导数作为输入特征始终能提高精度;3)增加训练数据中初始条件的多样性比扩大PDE参数范围带来更大的性能提升。我们公开了数据集和代码库。

英文摘要

Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution fields. Neural networks are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations. While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems, i.e., mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size. Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and codebase publicly available.

2605.24001 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Diff-Instruct with Diffused Reward: 迈向有原则的一步生成器强化学习

Junyi Wu, Weijian Luo, Haoyang Zheng, Ruizhe Zhang, Guang Lin

发表机构 * Purdue University(普渡大学) hi-lab, Xiaohongshu Inc.(小红书实验室,小红书公司)

AI总结 针对一步生成器强化学习中奖励优化与生成动力学不匹配的问题,提出基于积分KL最小化的无数据轨迹级对齐框架DIDR,通过扩散奖励分数和代理估计器实现奖励驱动的校正,在一步SDXL和6B DiT骨干网络上取得帕累托优势。

Comments author list correction

详情
AI中文摘要

近期一步文本到图像生成的进展实现了实时合成,具有显著的效率和质量。先前用于一步生成器的强化学习方法将图像空间奖励优化与扩散噪声空间分布匹配相结合。这种范式由于终端奖励优化与底层生成动力学之间的不匹配带来了挑战。结果,优化倾向于利用随机自由度,通常以牺牲图像保真度为代价来提高奖励。为了解决这个问题,我们提出了Diff-Instruct with Diffused Reward (DIDR),一个从积分KL最小化推导出的无数据轨迹级对齐框架。DIDR将RLHF最优的奖励倾斜干净图像分布沿扩散轨迹传播到所有噪声水平。我们证明该目标与干净图像RLHF具有相同的最小化器,同时自然诱导出扩散奖励分数(DRS),它作为对参考分数函数的奖励驱动校正。为了使其实用,我们进一步引入了扩散奖励代理(DRP),一种基于可微短步去噪的DRS高效估计器。大量实验表明,DIDR持续帕累托主导现有的一步SDXL基线。此外,当迁移到6B DiT骨干网络(Z-Image)时,DIDR在偏好对齐上超越了其50步教师模型,同时仅需单步生成。

英文摘要

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

2605.23327 2026-05-27 cs.CV 版本更新

GFSR: Geometric Fidelity and Spatial Refinement for Reliable Lane Detection

GFSR:用于可靠车道检测的几何保真度与空间细化

Tiancheng Wang, Zhaolu Ding, Richeng Xu, Tianhui Zheng, Hui Liu, Hanyu Xuan, Zhiliang Wu, Guanghui Yue

发表机构 * the School of Big Data and Statistics, Anhui University(安徽大学大数据与统计学院) the School of Artificial Intelligence and Data Science, University of Science and Technology of China(中国科学技术大学人工智能与数据科学学院) Institute of Dataspace, Hefei Comprehensive National Science Center(合肥综合性国家科学中心数据空间研究院) the College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) the School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University(深圳大学医学院生物医学工程学院)

AI总结 针对现有车道检测方法中分类置信度与几何质量脱节、回归模块弱化采样点关联导致复杂场景性能下降的问题,提出包含LaneIoU引导的置信度校准和自适应门控位置细化的GFSR框架,在CULane和CurveLanes上取得最优结果。

Comments Submitted to IEEE Transactions on Intelligent Transportation Systems. 12 pages, 6 figures

详情
AI中文摘要

车道检测是自动驾驶和高级驾驶辅助系统中的一项关键感知任务。然而,现有方法在复杂真实场景中仍会退化,原因在于两个主要限制。首先,分类置信度仅表征车道先验的分类存在性,与几何质量无强相关性。如果仅基于该置信度进行阈值过滤和NMS,模型倾向于保留高置信度的车道先验,而消除那些置信度较低但几何表示更优的先验。其次,现有方法中的回归模块削弱了采样点之间的相关性,阻碍了对远处、高曲率和复杂拓扑车道的细粒度优化,导致欠拟合。为解决这些问题,我们提出了几何保真度与空间细化(GFSR),一个由LaneIoU引导的置信度校准(LCC)和自适应门控位置细化(AGLR)组成的框架。具体地,LCC采用LaneIoU作为软监督来显式估计车道先验的几何保真度,并将其与分类置信度融合以构建协同可靠性指数(CRI)。该指数引导车道先验过滤,有效保留那些具有高分类置信度和良好几何质量的先验。同时,在每个细化阶段与回归头协作,AGLR预测采样点横向偏移并采用门控机制自适应调节校正幅度,增强点间相关性,提升模型对复杂车道场景的适应性和鲁棒性。在CULane和CurveLanes上的大量实验表明,我们的GFSR在CULane上达到了最优性能,F1_50和F1_75分数分别为81.46%和65.01%,在CurveLanes上达到了87.35%的F1_50。

英文摘要

Lane detection stands as a crucial perception task in autonomous driving and advanced driver assistance systems. However, existing methods still degrade in complex real scenarios due to two major limitations. First, classification confidence only characterizes the categorical existence of lane priors and has no strong correlation with geometric quality. If threshold filtering and NMS are conducted merely based on this confidence, the model tends to retain lane priors with high confidence while eliminating those with lower confidence but superior geometric representation. Secondly, the regression modules in existing methods weaken correlations among sampling points, hindering fine-grained optimization of distant, high-curvature and complex-topology lanes and causing underfitting. To address these issues, we propose Geometric Fidelity and Spatial Refinement (GFSR), a framework consisting of LaneIoU-guided Confidence Calibration (LCC) and Adaptive Gated Location Refinement (AGLR). Specifically, LCC adopts LaneIoU as soft supervision to explicitly estimate the geometric fidelity of lane priors, which is further fused with classification confidence to construct the Collaborative Reliability Index (CRI). This index guides lane prior filtering, effectively retaining those with high classification confidence and favorable geometric quality. Meanwhile, cooperating with regression heads in each refinement stage, AGLR predicts sampling point lateral offsets and adopts a gating mechanism to adaptively regulate correction magnitude, strengthen inter-point correlations and boost model adaptability as well as robustness toward complex lane scenarios. Extensive experiments on CULane and CurveLanes demonstrate that our GFSR achieves state-of-the-art performance on CULane, with F1_50 and F1_75 scores of 81.46% and 65.01%, and reaches 87.35% F1_50 on CurveLanes.

2605.22904 2026-05-27 cs.CV cs.AI 版本更新

Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations

基于AI视频监控的自杀风险评估:地铁站预防的可解释框架

Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau, Brian Mishara

发表机构 * Université TÉLUQ(大学TÉLUQ) Polytechnique Montréal(蒙特利尔理工学院) Université du Québec à Montréal(魁北克大学蒙特利尔分校)

AI总结 提出首个可解释框架,通过行人跟踪、活动识别、站台语义分割和轨迹风险热图建模,从监控视频中评估自杀风险,在真实数据上达到83.2% ROC-AUC。

Comments 9 pages, 6 figures, 1 table. Accepted for Publication in the International Joint Conference of Artificial Intelligence (IJCAI)

详情
AI中文摘要

理解并监控地铁站中的人类行为对于支持自杀预防工作至关重要,早期识别高风险情况能够实现及时干预。这需要通过对每个乘客的行为、其空间上下文和时间动态进行联合推理,从监控视频中评估自杀风险。然而,使用监控摄像头捕获的视频进行评估具有挑战性,因为它需要准确感知人体运动、理解站台几何结构,并随时间聚合异质行为线索。在这项工作中,我们正式定义了地铁站自杀风险评估(SRA)任务,并引入了首个解决这一挑战的可解释框架。与专注于孤立子任务或试图直接推断意图的方法不同,我们的公式通过整合行人跟踪、活动识别、站台语义分割和轨迹驱动的风险热图建模,从累积证据中评估自杀风险。通过将SRA形式化为一个独特任务,并在真实监控数据上基准测试一个完整的操作流程,实现了83.2%的ROC-AUC,这项工作突出了自杀风险评估的复杂性,并为面向社会公益的可解释AI系统研究开辟了新方向。

英文摘要

Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry, and aggregation of heterogeneous behavioral cues over time. In this work, we formalize the task of Suicide Risk Assessment (SRA) in metro stations and introduce the first interpretable framework that addresses this challenge. Unlike approaches that focus on isolated subtasks or attempt to infer intent directly, our formulation assesses suicide risk from accumulated evidence by incorporating person tracking, activity recognition, semantic segmentation of the platform, and trajectory-driven risk heatmap modeling. By formalizing SRA as a distinct task and benchmarking a complete operational pipeline achieving 83.2% ROC-AUC on real surveillance data, this work highlights the complexity of suicide risk assessment and opens new directions for research on interpretable AI systems for social good.

2605.20914 2026-05-27 cs.CV 版本更新

RISE: Reliable Improvement in Self-Evolving Vision-Language Models

RISE: 自进化视觉语言模型的可靠改进

Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu

发表机构 * AMAP, Alibaba Group(阿里集团AMAP实验室)

AI总结 针对视觉语言模型自进化中角色交替粗粒度、问题质量下降和类型坍缩问题,提出RISE框架,通过细粒度角色交替、质量监督器和技能感知动态平衡实现可靠自进化。

详情
AI中文摘要

视觉语言模型(VLM)已具备强大的多模态推理能力,但进一步提升仍严重依赖大规模人工构建的监督信号进行后训练。这种监督信号获取成本高昂,尤其对于推理密集型多模态任务,其中问题、答案和反馈信号必须精心设计。这激发了自进化学习,即模型通过双角色闭环自我改进:提问者自主提出问题,求解者学习解答。然而,我们观察到当前的VLM自进化方法仍面临三大挑战:粗粒度的角色交替延迟了问题生成与求解者适应之间的交互;生成的问题质量可能逐渐下降;问题类型可能坍缩至狭窄分布。这些问题限制了自进化的效率和可靠性。因此,我们提出 extbf{RISE},一个可靠的视觉语言模型自进化框架。RISE基于三个互补设计:细粒度角色交替,缩短提问者与求解者之间的反馈循环以提高效率;质量监督器,提高问题有效性和伪标签可靠性;以及技能感知动态平衡,在进化过程中缓解模式坍缩并保持广泛的技能覆盖。这些组件共同使得从无标签图像中实现更可靠和有效的自进化成为可能。在两个VLM骨干网络上的七个基准测试实验表明,RISE持续改进基础模型,带来广泛而持久的性能提升。我们的代码已公开在https://github.com/AMAP-ML/RISE。

英文摘要

Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at https://github.com/AMAP-ML/RISE.

2605.20606 2026-05-27 cs.CV 版本更新

Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust?

注意你的边界:你的蒸馏数据集真的鲁棒吗?

Muquan Li, Yingyi Ma, Yihong Huang, Hang Gou, Ke Qin, Ming Li, Yuan-Fang Li, Tao He

发表机构 * The Laboratory of Intelligent Collaborative Computing of UESTC, Chengdu, China(UESTC智能协同计算实验室,中国成都) Monash University, Melbourne, Australia(墨尔本大学,澳大利亚墨尔本) Guangdong Laboratory of Artificial Intelligence(广东人工智能实验室)

AI总结 针对数据集蒸馏中鲁棒性不足的问题,提出一种结合攻击感知课程学习与对比鲁棒性目标的框架C²R,通过优先处理最小鲁棒边界的对抗样本并扩大类间决策边界分离度,显著提升鲁棒准确率。

Comments Accepted to ICML 2026

详情
AI中文摘要

数据集蒸馏(DD)将大型训练集压缩为小型合成集以进行高效训练,但大多数DD方法仅优化干净准确率而忽略鲁棒性。最近的鲁棒DD方法提高了鲁棒性,但通常面临较差的准确率-鲁棒性权衡,因为它们(i)统一对待所有对抗扰动样本,尽管鲁棒风险主要由接近零的鲁棒边界主导,以及(ii)没有明确增加攻击集中区域的决策边界类间分离。我们提出了对比课程鲁棒数据集蒸馏(C$^2$R),一个将攻击感知课程与对比鲁棒性目标相结合的框架。从鲁棒边界的角度,我们推导出一个扰动分数,近似每个样本的鲁棒铰链,从而能够优先考虑那些最直接驱动鲁棒误差的最小边界对抗样本。同时,一个类平衡的对比鲁棒性损失在明确扩大跨类别边界分离的同时强制执行对抗不变性。在CIFAR-10/100、Tiny-ImageNet和多个ImageNet-1K子集上进行的六种攻击实验表明,C$^2$R实现了最佳的鲁棒准确率,平均优于先前的鲁棒DD方法2.8%。

英文摘要

Dataset distillation (DD) compresses a large training set into a small synthetic set for efficient training, but most DD methods optimize only clean accuracy and leave robustness uncontrolled. Recent robust DD methods improve robustness, yet they often suffer from a poor accuracy-robustness trade-off because they (i) treat all adversarially perturbed examples uniformly, despite robust risk being dominated by near-zero robust margins, and (ii) do not explicitly increase inter-class separation in the decision boundary where attacks concentrate. We present Contrastive Curriculum for Robust Dataset Distillation (C$^2$R), a framework that couples an attack-aware curriculum with a contrastive robustness objective. From a robust-margin perspective, we derive a perturbation score that approximates each sample's robust hinge, enabling a curriculum that prioritizes the smallest-margin adversaries that most directly drive robust error. In parallel, a class-balanced contrastive robustness loss enforces adversarial invariance while explicitly widening boundary separation across classes. Experiments on CIFAR-10/100, Tiny-ImageNet, and multiple ImageNet-1K subsets under six attacks show that C$^2$R achieves the best robust accuracy, outperforming prior robust DD by $2.8$% on average.

2605.16457 2026-05-27 cs.LG cs.AI cs.CV 版本更新

Identifiable Token Correspondence for World Models

可辨识的令牌对应关系用于世界模型

Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University(人工智能交叉学科项目,首尔国立大学) Department of Computer Science(计算机科学系) Engineering, Seoul National University(工程系,首尔国立大学)

AI总结 提出可辨识的令牌对应关系(ITC)方法,通过将下一帧预测建模为结构化分配问题,解决基于令牌的Transformer世界模型在长程推演中的时间不一致性,在四个基准上达到最先进性能。

详情
AI中文摘要

基于令牌的Transformer世界模型在视觉强化学习中表现出色,但常在长程推演中出现时间不一致性,包括对象重复、消失和变形。一个关键原因是大多数现有方法将下一帧预测纯粹视为令牌生成问题,而未考虑令牌在时间上的持续性。我们引入可辨识的令牌对应关系(ITC),这是一种用于基于令牌的Transformer世界模型的解码步骤,将下一帧预测建模为具有潜在令牌对应变量的结构化分配问题:每个下一帧令牌要么通过从上一帧复制令牌来解释,要么通过生成新令牌来解释。ITC保持Transformer架构和训练过程不变,可以添加到现有骨干网络上。我们的实验在4个具有挑战性的基准上展示了最先进的性能。所提出的方法在Craftax-classic基准上实现了72.5%的回报率和35.6%的分数,显著超过了之前的最佳结果67.4%和27.9%。我们在https://github.com/snu-mllab/Identifiable-Token-Correspondence上发布了源代码。

英文摘要

Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu-mllab/Identifiable-Token-Correspondence.

2605.02207 2026-05-27 cs.CV cs.AI cs.LG 版本更新

MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings

MultiSense-Pneumo:面向资源受限环境中肺炎筛查的多模态学习框架

Dineth Jayakody, Pasindu Thenahandi, Chameli Dommanige

发表机构 * Department of Computer Science, Old Dominion University, VA, USA(计算机科学系,老 Dominion 大学,弗吉尼亚州,美国)

AI总结 提出MultiSense-Pneumo多模态原型系统,整合症状、咳嗽音频、语音和胸片,通过可解释的后期融合实现肺炎筛查与分诊支持。

详情
AI中文摘要

肺炎仍然是全球发病率和死亡率的主要原因,尤其是在低资源环境中,那里缺乏影像学、实验室检测和专科护理。临床评估依赖于异质性证据,包括症状、呼吸模式、口头描述和胸部影像,使得一线筛查本质上是多模态的。然而,许多现有的计算方法仍然是单模态的,并且主要关注放射影像。在这项工作中,我们提出了MultiSense-Pneumo,一个面向肺炎筛查和分诊支持的多模态研究原型,它整合了结构化症状描述符、咳嗽音频、口语和胸部X光片。该系统结合了确定性症状分诊、基于LightGBM的声学分类、使用ResNet-18的域对抗放射影像分析、基于Transformer的语音识别以及可解释的后期融合算子。每个模态被转换为归一化的关注信号,并聚合为统一的筛查估计。融合权重是手动指定的,被视为启发式、可解释的参数,而不是学习或临床优化的值。MultiSense-Pneumo的设计考虑了在标准笔记本电脑级硬件上的离线执行,但并未作为经过部署验证或临床验证的诊断系统呈现。实验结果表明,在合成域偏移下,放射影像路径具有强大的组件级性能,同时也突出了重要的局限性,特别是咳嗽声学的异常类别召回率降低以及缺乏配对的端到端多模态患者评估。因此,MultiSense-Pneumo旨在作为筛查和分诊研究的框架和组件级原型。

英文摘要

Pneumonia remains a leading global cause of morbidity and mortality, particularly in low-resource settings where access to imaging, laboratory testing, and specialist care is limited. Clinical assessment relies on heterogeneous evidence, including symptoms, respiratory patterns, spoken descriptions, and chest imaging, making frontline screening inherently multimodal. However, many existing computational approaches remain unimodal and focus primarily on radiographs. In this work, we present MultiSense-Pneumo, a multimodal research prototype for pneumonia-oriented screening and triage support that integrates structured symptom descriptors, cough audio, spoken language, and chest radiographs. The system combines deterministic symptom triage, LightGBM-based acoustic classification, domain-adversarial radiograph analysis using ResNet-18, transformer-based speech recognition, and an interpretable late-fusion operator. Each modality is transformed into a normalized concern signal and aggregated into a unified screening estimate. The fusion weights are hand-specified and are treated as heuristic, interpretable parameters rather than learned or clinically optimized values. MultiSense-Pneumo is implemented with offline execution in mind on standard laptop-class hardware, but it is not presented as a deployment-validated or clinically validated diagnostic system. Experimental results demonstrate strong component-level performance of the radiograph pathway under synthetic domain shifts, while also highlighting important limitations, especially reduced abnormal-class recall for cough acoustics and the absence of paired end-to-end multimodal patient evaluation. MultiSense-Pneumo is therefore intended as a framework and component-level prototype for screening and triage research.

2605.08146 2026-05-27 cs.CV cs.AI 版本更新

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

VT-Bench:视觉-表格多模态学习的统一基准

Zi-Yi Jia, Zi-Jian Cheng, Xin-Yue Zhang, Kun-Yang Yu, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University, China(新型软件技术国家重点实验室,南京大学,中国) School of Intelligence Science and Technology, Nanjing University, China(智能科学与技术学院,南京大学,中国) School of Artificial Intelligence, Nanjing University, China(人工智能学院,南京大学,中国)

AI总结 提出首个视觉-表格多模态基准VT-Bench,涵盖9个领域14个数据集,评估23个模型,揭示视觉-表格学习的挑战。

详情
AI中文摘要

多模态学习在视觉-文本任务中引起了广泛关注。然而,在医疗和工业等高危领域起关键作用的视觉-表格数据仍未得到充分探索。本文介绍了 extit{VT-Bench},这是第一个用于标准化视觉-表格判别预测和生成推理任务的统一基准。VT-Bench汇集了9个领域(以医疗为中心,同时涵盖宠物、媒体和交通)的14个数据集,超过756K个样本。我们评估了23个代表性模型,包括单模态专家、专门的视觉-表格模型、通用视觉-语言模型(VLM)和工具增强方法,突出了视觉-表格学习的重大挑战。我们相信VT-Bench将激励社区构建更强大的多模态视觉-表格基础模型。 基准:https://github.com/Ziyi-Jia990/VT-Bench

英文摘要

Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce \textit{VT-Bench}, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across 9 domains (medical-centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 23 representative models, including unimodal experts, specialized visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented methods, highlighting substantial challenges of visual-tabular learning. We believe VT-Bench will stimulate the community to build more powerful multi-modal vision-tabular foundation models. Benchmark: https://github.com/Ziyi-Jia990/VT-Bench

2511.19741 2026-05-27 cs.CV 版本更新

Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

通过最小切片传输计划的高效可迁移最优传输

Xinran Liu, Elaheh Akbari, Rocio Diaz Martin, Navid NaderiAlizadeh, Soheil Kolouri

发表机构 * Department of Computer Science, Vanderbilt University(范德比大学计算机科学系) Department of Mathematics, Florida State University(佛罗里达州立大学数学系) Department of Biostatistics & Bioinformatics, Duke University(杜克大学生物统计学与生物信息学系) Department of Electrical & Computer Engineering, Vanderbilt University(范德比大学电气与计算机工程系)

AI总结 提出最小切片传输计划(min-STP)框架,研究优化切片器在不同分布对间的可迁移性,并引入小批量公式以提高可扩展性,在点云对齐和流生成建模中实现一次性匹配和摊销训练。

详情
AI中文摘要

最优传输(OT)为寻找分布之间的对应关系以及解决计算机视觉各个领域(包括形状分析、图像生成和多模态任务)中的匹配和对齐问题提供了强大的框架。然而,OT的计算成本阻碍了其可扩展性。基于切片的传输计划最近通过利用一维OT问题的闭式解,在降低计算成本方面显示出前景。这些方法优化一维投影(切片)以获得条件传输计划,该计划最小化环境空间中的传输成本。虽然高效,但这些方法留下了一个问题:学习到的最优切片器是否能够在分布偏移下迁移到新的分布对。理解这种可迁移性对于数据演变或跨密切相关的分布重复进行OT计算的情况至关重要。在本文中,我们研究了最小切片传输计划(min-STP)框架,并探讨了优化切片器的可迁移性:在一个分布对上训练的切片器能否为新的未见对产生有效的传输计划?理论上,我们证明优化后的切片器在数据分布轻微扰动下保持接近,从而能够在相关任务间高效迁移。为了进一步提高可扩展性,我们引入了min-STP的小批量公式,并提供了其准确性的统计保证。实验上,我们证明了可迁移的min-STP实现了强一次性匹配性能,并促进了点云对齐和基于流的生成建模的摊销训练。

英文摘要

Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

2605.18359 2026-05-27 cs.CV 版本更新

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

RAVE: 重新分配大型多模态模型中的视觉注意力

Xi Leng, Xinhong Ma, Ziqiang Dong, Feng Zhang, Xiaoying Tang, Yang Yang, Guanjun Jiang

发表机构 * Qwen Business Unit of Alibaba(阿里巴巴文勤业务部) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Beijing Institute of Technology(北京理工大学)

AI总结 针对大型多模态模型中标准注意力机制存在的跨模态误分配和视觉内不平衡问题,提出轻量级成对门控机制RAVE,通过学习查询-键偏置重新分配视觉注意力,在多个多模态基准上平均提升3个百分点,尤其对感知密集型任务效果显著。

详情
AI中文摘要

大型多模态模型(LMMs)继承了预训练语言骨干网络的自注意力机制,但标准注意力可能表现出次优的分配,包括文本和视觉证据之间的跨模态误分配以及视觉令牌之间的视觉内不平衡。我们提出RAVE(重新分配视觉注意力),一种轻量级成对门控机制,它为预softmax注意力分数添加一个学习到的查询-键偏置,该偏置基于预RoPE查询和键特征。RAVE不需要对骨干网络进行架构修改,并且可以与模型的其余部分进行端到端训练。在一系列多模态基准测试中,RAVE比标准注意力平均提升3个百分点,在感知密集型任务(包括多语言OCR、图表理解、文档VQA和场景文本VQA)上提升最大,这些任务中准确的视觉定位至关重要。

英文摘要

Large multimodal models (LMMs) inherit the self-attention mechanism of pretrained language backbones, yet standard attention can exhibit suboptimal allocation, including cross-modal misallocation between textual and visual evidence and intra-visual imbalance among visual tokens. We propose RAVE (Re-Allocating Visual Attention), a lightweight pair-gating mechanism that adds a learned query-key bias to pre-softmax attention scores over visual keys, derived from pre-RoPE query and key features. RAVE requires no architectural modification to the backbone and can be trained end-to-end with the rest of the model. Across a suite of multimodal benchmarks, RAVE improves over standard attention by an average of 3 points, with the largest gains on perception-intensive tasks -- including multilingual OCR, chart understanding, document VQA, and scene text VQA -- where accurate visual grounding is critical.

2605.05204 2026-05-27 cs.CV 版本更新

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

D-OPSD:用于连续调优步蒸馏扩散模型的在线自蒸馏方法

Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, Harry Yang, Steven Hoi

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Z-Image Team, Alibaba Group(阿里集团Z-Image团队) University of California, San Diego(加州大学圣地亚哥分校) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出D-OPSD,一种在线自蒸馏训练范式,使步蒸馏扩散模型在监督微调中保持少步推理能力,通过让模型同时作为教师和学生,利用不同上下文条件(学生仅文本特征,教师多模态特征)最小化预测分布,学习新概念和风格而不牺牲原有少步能力。

Comments Project Page: https://vvvvvjdy.github.io/d-opsd/

详情
AI中文摘要

高性能图像生成模型的格局目前正在从低效的多步模型转向高效的少步模型(例如,Z-Image-Turbo和FLUX.2-klein)。然而,这些模型对直接连续监督微调提出了重大挑战。例如,应用常用的微调技术会损害其固有的少步推理能力。为了解决这个问题,我们提出了D-OPSD,一种用于步蒸馏扩散模型的新颖训练范式,能够在监督微调期间实现在线策略学习。我们首先发现,以LLM/VLM作为编码器的现代扩散模型可以继承其编码器的上下文能力。这使我们能够将训练形式化为一个在线自蒸馏过程。具体来说,在训练期间,我们让模型在不同上下文中同时充当教师和学生,其中学生仅以文本特征为条件,而教师则以文本提示和目标图像的多模态特征为条件。训练最小化学生自身轨迹上的两个预测分布。通过在模型自己的轨迹上并在其自身监督下进行优化,D-OPSD使模型能够学习新的概念、风格等,而不会牺牲原始的少步能力。

英文摘要

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for direct continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromise their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion models, where the LLM/VLM serves as the encoder, can inherit its encoder's in-context capabilities. This enables us to formulate the training as an on-policy self-distillation process. Specifically, during training, we make the model act as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimizing on the model's own trajectory and under its own supervision, D-OPSD enables the model to learn new concepts, styles, etc., without sacrificing the original few-step capacity.

2601.15891 2026-05-27 cs.CV 版本更新

RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture

RadJEPA:基于联合嵌入预测架构的胸部X光放射学编码器

Anas Anwarul Haq Khan, Mariam Husain, Pratik Jalan, Kshitij Jadhav

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Bombay(印度理工学院孟买分校计算机科学与工程系) Department of Biomedical Engineering, Johns Hopkins University(约翰霍普金斯大学生物医学工程系) Koita Centre for Digital Health, Indian Institute of Technology Bombay(印度理工学院孟买分校Koita数字健康中心)

AI总结 提出RadJEPA,一种无需语言监督的自监督框架,通过联合嵌入预测架构在约84万张无标签胸部X光图像上预训练,学习预测掩码区域的潜在表示,在放射学报告生成等任务中达到或超越现有基线。

详情
AI中文摘要

视觉-语言预训练推动了医学图像表示学习的最新进展,但这种范式受限于配对图像-文本数据的可用性以及临床叙述的报告偏差。我们探究是否可以在没有任何语言监督的情况下学习具有竞争力的放射学编码器。我们引入了RadJEPA,这是一个基于联合嵌入预测架构的自监督框架,并在约84万张无标签胸部X光图像上进行了预训练。该模型学习从可见上下文区域预测掩码目标区域的潜在表示,这一目标与图像-文本对比预训练和DINO风格自蒸馏不同,它显式地建模表示空间中的条件结构。我们主要在冻结的Vicuna-7B解码器上进行放射学报告生成评估,并将其编码器替换到四个广泛使用的视觉-语言骨干网络(MedLLaVA、Qwen-2.5、BLIP-2和Phi-4)中。为完整性,我们还报告了疾病分类和语义分割结果。在两个数据集和四个指标上,RadJEPA匹配或超过了最强的纯图像和视觉-语言基线,同时使用ViT-B/14骨干网络和224×224分辨率。

英文摘要

Vision-language pretraining has driven much of the recent progress in medical image representation learning, but this paradigm is constrained by the availability of paired image-text data and by the reporting bias of clinical narratives. We ask whether competitive radiology encoders can be learned without any language supervision. We introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture and pretrained on approximately 840K unlabeled chest X-ray images. The model learns to predict latent representations of masked target regions from a visible context region, an objective that differs from both image-text contrastive pretraining and DINO-style self-distillation by explicitly modelling conditional structure in representation space. We evaluate RadJEPA primarily on radiology report generation with a frozen Vicuna-7B decoder, and additionally substitute its encoder into four widely used vision-language backbones (MedLLaVA, Qwen-2.5, BLIP-2, and Phi-4). For completeness we also report disease classification and semantic segmentation results. Across two datasets and four metrics, RadJEPA matches or exceeds the strongest image-only and vision-language baselines while using a ViT-B/14 backbone at 224 x 224 resolution.

2605.15477 2026-05-27 cs.CV 版本更新

EgoExo-WM: Unlocking Exo Video for Ego World Models

EgoExo-WM: 利用外部视频解锁自我世界模型

Danny Tran, Roberto Martín-Martín, Kristen Grauman

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出通过从外部视频提取结构化身体姿态并利用人体运动学先验将其转换为自我视频,从而利用丰富的野外外部数据训练自我世界模型,显著提升预测质量和下游规划性能。

Comments Project Page: https://vision.cs.utexas.edu/projects/EgoExo-WM/

详情
AI中文摘要

自我中心世界模型为智能体预测和规划提供了有前景的方向,但其性能受限于自我中心训练数据的有限性以及人类物理动作的固有部分可观测性。相比之下,外部中心视频丰富且能很好地揭示身体姿态,但缺乏与智能体动作空间的直接对齐,且不是自我中心的。我们提出一种方法,通过从外部中心视频中提取结构化身体姿态作为动作表示,并基于人体运动学先验将外部中心视频转换为自我中心视频,从而弥合这一差距。这一过程使得将野外外部中心数据整合到自我中心世界模型训练中成为可能。我们表明,使用转换后的数据训练全身动作条件自我中心世界模型显著提高了预测质量和下游规划性能,其中我们推断实现视觉目标状态所需的身体姿态序列。我们的方法为利用任意野外视频构建强大的自我中心世界模型铺平了道路,进一步推动了机器人规划和增强现实指导等应用。

英文摘要

Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.

2605.11651 2026-05-27 cs.CV cs.AI cs.CL 版本更新

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Hide to See: 面向VLM蒸馏中视觉锚定思维的推理前缀掩码

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son

发表机构 * KAIST(韩国科学技术院) NVIDIA(英伟达) POSTECH(POSTECH大学)

AI总结 提出一种推理前缀掩码蒸馏框架,通过掩码学生模型的显著推理前缀,迫使其在推理过程中更依赖视觉证据,从而缓解长推理轨迹中的视觉遗忘问题,提升多模态推理性能。

Comments Pre-print

详情
AI中文摘要

近期VLM中的思考-回答方法(如Qwen3-VL-Thinking)通过在最终答案前利用中间推理步骤来提升推理性能,但其计算成本显著增加,尤其是对于较大的VLM。为了将这种能力蒸馏到紧凑的思考-回答VLM中,一个主要目标是提高学生在整个推理轨迹中利用视觉证据的能力,因为长思考-回答轨迹存在视觉遗忘问题。为此,我们引入了一种新颖的思考-回答蒸馏框架,通过掩码学生模型的显著推理前缀,鼓励学生将思考锚定在视觉信息上。为了补偿这种被掩码的文本线索,学生在蒸馏过程中被鼓励更多地依赖视觉证据作为替代信息源。我们的掩码策略包括:1)逐token的显著推理前缀掩码,针对每个下一token预测选择性掩码高影响力的推理前缀;2)自调节掩码预算调度,根据教师-学生分布之间的差异(即蒸馏难度)逐渐增加掩码规模。在蒸馏阶段,学生模型由我们的显著推理前缀掩码引导,该掩码同时阻塞未来token和显著推理线索,替代了自回归语言建模中使用的标准因果掩码。实验结果表明,我们的方法在多模态推理基准上优于最近的开源VLM、VLM蒸馏和自蒸馏方法,进一步分析证实了学生思考过程中视觉利用的增强。

英文摘要

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace, as long think-answer traces suffer from visual forgetting issues. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, measured by the discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyzes confirm enhanced visual utilization along the student thinking process.

2605.14799 2026-05-27 cs.CV cs.CR cs.SI 版本更新

Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

视觉Mamba能否提升AI生成图像检测?一项深入研究

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Xianxun Zhu, Abdenour Hadid

发表机构 * Laboratory of IEMN, CNRS, Centrale Lille, UMR 8520, Univ. Polytechnique Hauts-de-France(伊姆纳实验室,国家科学研究中心,里尔中央理工大学,UMR 8520,法国高等技术大学) Khalifa University(卡利法大学) School of Communication and Information Engineering, Shanghai University(上海大学通信与信息工程学院) Sorbonne Center for Artificial Intelligence, Sorbonne University Abu Dhabi(索邦人工智能中心,索邦大学阿布扎克分校)

AI总结 本研究系统评估了Vision Mamba模型在AI生成图像检测中的性能,与CNN、ViT和VLM检测器进行对比,分析了准确性、效率和泛化能力。

详情
AI中文摘要

近年来,计算机视觉取得了显著进展,这得益于卷积神经网络(CNN)、生成对抗网络(GAN)、扩散架构、视觉Transformer(ViT)以及最近的视觉-语言模型(VLM)等创新架构的发展。这一进展无疑有助于创造越来越逼真和多样化的视觉内容。然而,图像生成的这些进步也引发了对错误信息、身份盗窃以及隐私和安全威胁等潜在滥用的担忧。与此同时,基于Mamba的架构已成为这一快速发展的领域中一系列图像分析任务(包括分类、分割、医学成像、目标检测和图像恢复)的多功能工具。然而,与已有技术相比,它们在识别AI生成图像方面的潜力仍相对未被探索。本研究提供了用于AI生成图像检测的Vision Mamba模型的系统评估和比较分析。我们在多样化的数据集和合成图像源上,将多个Vision Mamba变体与代表性的CNN、ViT和基于VLM的检测器进行基准测试,重点关注准确性、效率以及跨不同图像类型和生成模型的泛化能力等关键指标。通过这一全面分析,我们旨在阐明Vision Mamba相对于已有方法在检测AI生成图像方面的适用性、准确性和效率上的优势与局限性。总体而言,我们的研究结果突显了Vision Mamba作为区分真实与AI生成视觉内容的系统组件的潜力和当前局限性。这项研究对于在区分真实与AI生成内容成为重大挑战的时代提升检测能力至关重要。

英文摘要

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

2605.14664 2026-05-27 cs.CV 版本更新

MiVE: Multiscale Vision-language features for reference-guided video Editing

MiVE:用于参考引导视频编辑的多尺度视觉语言特征

Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, Ting Liu

发表机构 * MT Lab, Meitu Inc., Beijing 100083, China(美图实验室,美图公司,北京100083,中国) Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing 100084, China(计算机科学与技术系,BNRist,IDG/麦戈文脑研究学院,清华大学,北京100084,中国) Beijing University of Posts(北京邮电大学)

AI总结 提出MiVE框架,利用VLM的多尺度层次特征(早期层保留空间细节,深层编码全局语义)统一到自注意力扩散Transformer中,解决模态间隙和细粒度信息丢失问题,在参考引导视频编辑中达到SOTA性能。

Comments ICML 2026

详情
AI中文摘要

参考引导视频编辑以源视频、文本指令和参考图像作为输入,要求模型在忠实执行指令编辑的同时保留原始运动及未编辑内容。现有方法分为两种范式,各有固有限制:解耦编码器在处理指令和视觉内容时存在模态间隙,而统一视觉语言编码器仅依赖最终层表示,丢失了细粒度空间细节。我们观察到VLM层层次化地编码互补信息——早期层捕获局部空间细节,对精确编辑至关重要;深层编码全局语义,用于指令理解。基于此洞察,我们提出MiVE(用于参考引导视频编辑的多尺度视觉语言特征),该框架将VLM重新用作多尺度特征提取器。MiVE从Qwen3-VL提取层次特征,并将其集成到统一的自注意力扩散Transformer中,消除了交叉注意力设计中固有的模态不匹配。实验表明,MiVE在人类偏好中排名最高,性能优于学术方法和商业系统,达到了最先进水平。

英文摘要

Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

2605.13455 2026-05-27 cs.CV 版本更新

Bayesian In Vivo Tracking of Synapses using Joint Poisson Deconvolution and Diffeomorphic Registration

使用联合泊松反卷积和微分同胚配准的贝叶斯体内突触追踪

Shashwat Kumar, Dominic M. Padova, Binish Narang, Gabrielle I. Coste, Austin R. Graves, Richard L. Huganir, Adam S. Charles, Michael I. Miller, Anuj Srivastava

发表机构 * Department of Biomedical Engineering, Johns Hopkins University(约翰霍普金斯大学生物医学工程系) Department of Neuroscience, Johns Hopkins University(约翰霍普金斯大学神经科学系) Kavli Neuroscience Discovery Institute, Johns Hopkins University(约翰霍普金斯大学Kavli神经科学发现研究所) Data Science and AI Institute, Johns Hopkins University(约翰霍普金斯大学数据科学与人工智能研究所) Department of Applied Mathematics and Statistics, Johns Hopkins University(约翰霍普金斯大学应用数学与统计学系)

AI总结 提出一种基于模板的贝叶斯框架,通过联合泊松反卷积和微分同胚配准,同时实现突触检测、去噪、荧光强度推断、组织运动校正和置信区间估计,用于低信噪比体内显微镜数据中的突触追踪。

详情
AI中文摘要

突触是密集排列的亚微米结构,在学习和记忆形成过程中动态重组。纵向体内成像荧光标记的突触受体为研究大规模突触动力学以及这些过程在神经疾病中如何被破坏提供了有希望的机会。然而,使用双光子显微镜的体内成像采用低激光功率,因此受到低信噪比和高散粒噪声、天与天之间的非线性组织运动、突触荧光的非平稳波动以及显微镜点扩散函数引起的显著模糊的影响。这些因素共同使得检测和追踪突触变得具有挑战性,尤其是在突触密度高的区域。本文提出了一种新颖的基于模板的框架,将突触建模为在非线性组织变形下移动的可变亮度点源。采用统一的贝叶斯方法,我们通过推导一个后验分布来将该模型应用于显微镜数据,该后验分布包含用于域扭曲的微分同胚映射、用于成像过程的高斯点扩散函数以及用于原始光子计数的泊松观测模型。贝叶斯解决方案同时:(1) 构建突触位置的概率模板,(2) 对图像数据进行去噪和反卷积,(3) 推断荧光强度,(4) 执行微分同胚图像配准以校正组织运动,以及(5) 为这些参数估计提供置信区域。我们在一个2D+t模拟数据集和一个在小鼠两周内成像的荧光突触的3D+t纵向体内显微镜数据集上展示了该框架。

英文摘要

Synapses are densely packed submicron structures that dynamically reorganize during learning and memory formation. Longitudinal \textit{in vivo} imaging of fluorescently tagged synaptic receptors offers a promising opportunity to study large-scale synaptic dynamics and how these processes are disrupted in neurological disease. However, in vivo imaging with 2-photon microscopy uses low laser power and therefore suffers from low signal-to-noise ratio (SNR) and high shot noise, nonlinear tissue motion between days, nonstationary fluctuations in synaptic fluorescence, and significant blur induced by the microscope point spread function (PSF). Together, these factors make it challenging to detect and track synapses, especially in regions with high synaptic density. This paper presents a novel template-based framework for modeling synapses as varying luminance point sources that move under a nonlinear tissue deformation. Taking a unified Bayesian approach, we apply this model to microscopy data by deriving a posterior that incorporates a diffeomorphic mapping for domain warping, a Gaussian point spread function for the imaging process, and a Poisson observation model for raw photon counts. The Bayesian solution simultaneously: (1) Constructs a probabilistic template of synapse locations, (2) denoises and deconvolves the image data, (3) infers fluorescence intensities, (4) performs diffeomorphic image registration to correct for tissue motion, and (5) provides confidence regions for these parameter estimates. We demonstrate the framework on both a 2D+t simulated dataset and a 3D+t longitudinal \textit{in vivo} microscopy dataset of fluorescent synapses imaged in a mouse over two weeks.

2604.22546 2026-05-27 cs.CV 版本更新

ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

ReLIC-SGG: 开放词汇场景图生成的关系格补全

Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang

发表机构 * Amirkabir University of Technology(阿米尔卡比大学技术学院)

AI总结 针对开放词汇场景图生成中标注不完整导致大量有效关系被误判为负例的问题,提出ReLIC-SGG框架,通过构建语义关系格建模谓词间的相似、蕴含和矛盾关系,将未标注关系视为潜在变量而非确定负例,结合视觉-语言兼容性、图上下文和语义一致性推断缺失正关系,并采用正-无标记图学习减少假负例监督,格引导解码生成紧凑且语义一致的场景图。

Comments Some errors in the experimental sections

详情
AI中文摘要

开放词汇场景图生成(SGG)旨在用超越固定谓词集的灵活关系短语描述视觉场景。现有方法通常将标注的三元组视为正例,所有未标注的对象-对关系视为负例。然而,场景图标注本质上是不完整的:许多有效关系缺失,且同一交互可以以不同粒度描述,例如 extit{on}、 extit{standing on}、 extit{resting on} 和 extit{supported by}。由于开放词汇SGG的关系空间更大,这一问题变得更加严重。我们提出 extbf{ReLIC-SGG},一种关系不完整性感知框架,将未标注关系视为潜在变量而非确定负例。ReLIC-SGG构建语义关系格来建模开放词汇谓词间的相似性、蕴含和矛盾关系,并利用它从视觉-语言兼容性、图上下文和语义一致性中推断缺失的正关系。正-无标记图学习目标进一步减少假负例监督,而格引导解码生成紧凑且语义一致的场景图。在常规、开放词汇和全景SGG基准上的实验表明,ReLIC-SGG改进了稀有和未见谓词的识别,并更好地恢复了缺失关系。

英文摘要

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.

2605.12271 2026-05-27 cs.CV 版本更新

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

超越文本提示:视觉到视觉生成作为统一范式

Yaofang Liu, Kangning Cui, Meng Chu, Zhaoqing Li, Suiyun Zhang, Jean-Michel Morel, Xiaodong Cun, Haoxuan Che, Rui Liu, Raymond H. Chan

发表机构 * City University of Hong Kong(香港城市大学) City University of Hong Kong (Dongguan)(香港城市大学(东莞)) The Hong Kong University of Science and Technology(香港科学与技术大学) The Chinese University of Hong Kong(香港中文大学) Celia Research HK(Celia研究香港) Great Bay University(大湾大学) Lingnan University(岭南大学)

AI总结 提出视觉到视觉(V2V)生成范式及无需训练的V2V-Zero框架,通过利用视觉页面隐藏状态替代文本条件,在多个任务上达到或接近优化后的文本到图像性能。

Comments Project Page: https://yaofang-liu.github.io/V2V_Web

详情
AI中文摘要

人类通常通过视觉制品(如排版表、草图、参考图像和标注场景)来指定和创作。然而,现代视觉生成器仍然要求用户将这种意图序列化为文本,这一瓶颈压缩了空间结构、精确外观和字形形状等信号。我们提出 extbf{\emph{视觉到视觉}(V2V)}生成,其中用户使用视觉规范页面(而非文本提示)来条件化生成模型。该页面不是编辑目标,而是指定所需输出的视觉文档。我们引入 extbf{V2V-Zero},一个无需训练的框架,通过用从视觉页面提取的最终层隐藏状态替换纯文本条件,在现有的视觉语言模型(VLM)条件化生成器中暴露此接口,利用了冻结的VLM已将文本和图像映射到生成器条件空间的事实。在GenEval上,V2V-Zero使用冻结的Qwen-Image骨干网络达到0.85,接近其优化后的文本到图像性能而无需微调。为评估更广泛的V2V空间,我们引入 extbf{Simple-V2V Bench},涵盖七个视觉条件化任务和七个模型,包括GPT Image 2、Nano Banana 2、Seedream 5.0 Lite、开源权重基线和视频扩展。V2V-Zero得分为32.7/100,优于评估的开源图像基线,并揭示了清晰的能力层次:属性绑定强,内容生成不可靠,结构控制即使对商业系统也困难。HunyuanVideo-1.5扩展得分为20.2/100,表明该接口可迁移到图像之外。机制分析显示默认推理路径主要通过视觉路由,95.0%的条件化token注意力集中在视觉页面隐藏状态上。

英文摘要

Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator's conditioning space. On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbf{Simple-V2V Bench}, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0\% of conditioning-token attention mass on visual-page hidden states.

2605.11867 2026-05-27 cs.CV 版本更新

When Brains Disagree: Biological Ambiguity Underlies the Challenge of Amyloid PET Synthesis from Structural MRI

当大脑存在分歧:生物模糊性是结构MRI合成淀粉样蛋白PET挑战的基础

Louise E. G. Baron, Ross Callaghan, David M. Cash, Philip S. J. Weston, Hojjat Azadbakht, Hui Zhang

发表机构 * Hawkes Institute, University College London, UK(霍克斯研究所,伦敦大学学院,英国) Department of Medical Physics and Biomedical Engineering, University College London, UK(医学物理与生物医学工程系,伦敦大学学院,英国) AINOSTICS Ltd, Manchester, UK(AINOSTICS有限公司,曼彻斯特,英国) Dementia Research Centre, UCL Queen Square Institute of Neurology, University College London, UK(痴呆研究中心,伦敦大学学院女王广场神经科学研究所,英国) UK Dementia Research Institute, London, UK(英国痴呆研究研究所,伦敦,英国) Department of Computer Science, University College London, UK(计算机科学系,伦敦大学学院,英国)

AI总结 通过控制实验证明,结构MRI到淀粉样蛋白PET合成性能受限的根本原因是生物模糊性(MRI与PET测量时间解耦的病理过程),而非模型架构能力,并表明引入血浆生物标志物等多模态信息可解决该问题。

Comments MICCAI 2026 accepted paper (no rebuttal)

详情
AI中文摘要

结构MRI到淀粉样蛋白PET合成已被提出作为阿尔茨海默病(AD)中淀粉样蛋白评估的非侵入性替代方法。然而,相同模型的报告性能在不同研究中差异很大,且日益复杂的架构并未带来一致的提升。这种不一致性被认为是由基本的生物模糊性引起的:MRI捕捉神经退行性变,而PET测量淀粉样蛋白病理——这两个过程在AD中常常在时间上解耦。因此,相似的MRI模式可能对应不同的淀粉样蛋白状态,产生模糊的一对多映射。因此,MRI到淀粉样蛋白PET合成可能本质上是病态的;然而,这一想法尚未得到科学验证。本工作的目的是通过两个控制实验来检验这一假设。我们首先通过根据淀粉样蛋白和神经退行性变状态对配对的MRI-PET数据进行分层来控制训练分布。在控制设计下使用两种标准合成模型,我们表明生物学上明确的映射可以单独学习,但当引入数据模糊性时性能崩溃。这表明数据分布中的模糊性(而非架构容量)限制了性能。其次,我们表明引入血浆生物标志物形式的正交生物学信息可以解决这种模糊性。当整合多模态输入时,性能提高且稳定性恢复。总之,这些发现表明MRI到淀粉样蛋白PET合成中有限且不一致的性能是由内在的生物模糊性解释的,稳定、有意义的进展需要多模态整合而非架构复杂性。

英文摘要

Structural MRI-to-amyloid PET synthesis has been proposed as a non-invasive alternative for amyloid assessment in Alzheimer's disease (AD). However, reported performance of identical models varies widely across studies, and increasingly complex architectures have not led to consistent gains. This inconsistency is thought to be caused by a fundamental biological ambiguity: MRI captures neurodegeneration, while PET measures amyloid pathology - two processes that are often temporally decoupled in AD. As a result, similar MRI patterns may correspond to different amyloid states, creating ambiguous one-to-many mappings. MRI-to-amyloid PET synthesis may therefore be intrinsically ill-posed; however, this idea has yet to be tested scientifically. The aim of this work is to test this hypothesis through two controlled experiments. We first control the training distribution by stratifying paired MRI-PET data by amyloid and neurodegeneration status. Using two standard synthesis models under a controlled design, we show that biologically unambiguous mappings are learnable in isolation, but performance collapses when data ambiguity is introduced. This demonstrates that ambiguity in the data distribution, rather than architectural capacity, constrains performance. Second, we show that introducing orthogonal biological information in the form of plasma biomarkers resolves this ambiguity. When multimodal inputs are incorporated, performance improves and stability is restored. Together, these findings suggest that limited and inconsistent performance in MRI-to-amyloid PET synthesis is explained by intrinsic biological ambiguity, and that stable, meaningful progress requires multimodal integration rather than architectural complexity.

2604.22274 2026-05-27 cs.CV 版本更新

CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

CAGE-SGG:用于开放词汇场景图生成的反事实主动图证据

Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen

发表机构 * Institute of Intelligent Vision and Embodied Cognition(智能视觉与具身认知研究院)

AI总结 提出基于反事实关系验证的开放词汇场景图生成框架,通过分解谓词为软证据基并使用反事实验证器确保关系有视觉证据支持,从而提升可靠性、可解释性和泛化能力。

Comments This manuscript has been withdrawn by the authors because we found a methodological flaw in the formulation and evaluation of the proposed approach. The issue affects the reliability of the experimental results and the conclusions drawn from them. Therefore, the authors consider the current version unsuitable for citation or further use

详情
AI中文摘要

开放词汇场景图生成(SGG)旨在用超出固定谓词词汇表的灵活且细粒度的关系短语描述视觉场景。虽然最近的视觉语言模型极大地扩展了SGG的语义覆盖范围,但它们也引入了一个关键的可信性问题:预测的关系可能由语言先验或对象共现驱动,而非基于视觉证据。在本文中,我们提出了一种基于反事实关系验证的证据充分的开放词汇SGG框架。我们的方法不是直接接受合理的关系提议,而是验证每个候选关系是否得到关系特定的视觉、几何和上下文证据的支持。具体来说,我们首先使用视觉语言提议器生成开放词汇关系候选,然后将谓词短语分解为软证据基,如支撑、接触、包含、深度和状态。关系条件证据编码器提取谓词相关线索,而反事实验证器测试当必要证据被移除时关系分数是否下降,并在无关扰动下保持稳定。我们进一步引入矛盾感知谓词学习和图级偏好优化,以改进细粒度区分和全局图一致性。在常规、开放词汇和全景SGG基准上的实验表明,我们的方法一致地改进了标准召回率指标、未见谓词泛化和反事实基础质量。这些结果表明,从关系生成转向关系验证可产生更可靠、可解释且基于证据的场景图。

英文摘要

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.

2511.15572 2026-05-27 cs.CV 版本更新

From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

从逐图像低秩到编码不匹配:重新思考视觉Transformer中的特征蒸馏

Huiyuan Tian, Bonan Xu, Shijian Li

发表机构 * Zhejiang University(浙江大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文通过发现编码不匹配现象,提出Lift或WideLast两种简单修复方法,显著提升视觉Transformer特征蒸馏在压缩场景下的性能。

Comments 22 pages, 22 figures. Accepted at the ICML 2026

详情
AI中文摘要

特征图知识蒸馏(KD)在规模相当的视觉Transformer(ViT)之间能很好地传递内部表示,但在压缩场景下常常失败。我们重新审视这一失败并揭示了一个悖论。逐样本SVD表明每个图像高度可压缩,这似乎暗示一个带有线性投影器的窄学生网络“原则上”应该匹配教师网络。然而,数据集层面的视图与这一直觉相矛盾:PCA表明教师网络是低秩子空间的并集,且不同输入间存在显著的子空间旋转。我们进一步引入token级别的频谱能量模式(SEP),发现一个架构无关的编码定律:即使token存在于低秩子空间中,它们也会在通道模式上广泛分布能量,造成带宽不匹配。我们将这一组合现象称为编码不匹配。我们提出两种最小修复方法:Lift或WideLast。(i)Lift在推理时保留一个轻量级的提升投影器以提供更宽的通道,或(ii)WideLast仅加宽学生网络的最后一个块,实现输入依赖的扩展。在ImageNet-1K上,这些修复方法复兴了ViT压缩的特征KD,将从CaiT-S24蒸馏的DeiT-Tiny的top-1准确率从74.86%提升至77.53%/78.23%,并且也增强了未经蒸馏训练的学生网络。我们的分析阐明了特征图KD何时以及为何失败,以及如何修复。代码和原始数据见https://github.com/thy960112/From-Per-Image-Low-Rank-to-Encoding-Mismatch。

英文摘要

Feature-map knowledge distillation (KD) transfers internal representations well between comparably sized Vision Transformers (ViTs), but it often fails in compression. We revisit this failure and uncover a paradox. Sample-wise SVD shows that each image is highly compressible, which seems to suggest that a narrow student with a linear projector should match the teacher "in principle". However, a dataset-level view contradicts this intuition: PCA shows that the teacher is a union of low-rank subspaces with significant subspace rotation across inputs. We further introduce token-level Spectral Energy Patterns (SEP) and find an architecture-invariant encoding law: tokens spread energy broadly across channel modes even when they live in low-rank subspace, creating a bandwidth mismatch. We refer to this combined phenomenon as an encoding mismatch. We propose two minimal remedies, Lift or WideLast: (i) Lift retains a lightweight lifting projector at inference to provide wider channel, or (ii) WideLast widens only the student's last block, enabling an input-dependent expansion. On ImageNet-1K, these fixes revive feature KD for ViT compression, improving DeiT-Tiny distilled from CaiT-S24 from 74.86% to 77.53%/78.23% top-1 accuracy, and they also strengthen students trained without distillation. Our analyses clarify when and why feature-map KD fails and then how to fix it. Code and raw data are provided in https://github.com/thy960112/From-Per-Image-Low-Rank-to-Encoding-Mismatch.

2605.04635 2026-05-27 cs.CV 版本更新

UniPCB: A Generation-Assisted Detection Framework for PCB Defect Inspection

UniPCB: 一种用于PCB缺陷检测的生成辅助检测框架

Huan Zhang, Lianghong Tan, Yichu Xu, Zishan Su, Jiangzhong Cao, Huanqi Wu, Linwei Zhu, Xu Zhang

发表机构 * School of Information Engineering, Guangdong University of Technology(广东工业大学信息工程学院) School of Computer Science, Wuhan University(武汉大学计算机学院) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)

AI总结 提出UniPCB框架,通过多模态条件生成器合成缺陷样本以增强数据,并设计倒残差移位注意力与跨级互补融合模块提升检测性能,在DsPCBSD+上实现98.0% mAP@0.5。

详情
AI中文摘要

在工业物联网(IIoT)中,实现智能、实时的印刷电路板(PCB)缺陷检测对于确保产品可靠性至关重要。然而,现有的基于IIoT的视觉检测系统面临两个相互叠加的挑战:稀缺且不平衡的缺陷样本限制了模型训练,以及在复杂电路背景下特征表示不足。现有的生成方法依赖具有粗略结构控制的单模态条件,而检测方法则改进架构但未解决数据瓶颈。为了共同解决这两个挑战,我们提出了一种生成辅助的PCB缺陷检测框架,该框架在IIoT支持的流水线中集成了受控缺陷合成与任务特定缺陷检测。在生成侧,多模态条件生成器并行提取互补的边缘、深度和文本条件。然后,ScaleEncoder将这些条件嵌入到扩散U-Net的四个分辨率中,条件调制在每个尺度上应用FiLM风格的空间自适应调制,实现结构对齐和缺陷感知的样本合成,以增强稀缺的IIoT数据集。在检测侧,倒残差移位注意力将自注意力与移位卷积相结合,以共同捕获全局上下文和局部纹理,跨级互补融合块生成像素级门控用于选择性跨级特征融合。合成的样本直接丰富检测训练集,使得生成的改进与检测的改进相互叠加。在DsPCBSD+上的大量实验表明,UniPCB在缺陷检测上达到mAP@0.5为98.0%、mAP@0.5:0.95为61.8%,超越了所有对比方法,同时生成分支的FID为129.61、SSIM为0.619,优于现有的条件生成方法。

英文摘要

In the Industrial Internet of Things (IIoT), enabling intelligent, real-time Printed Circuit Board (PCB) defect inspection is critical for ensuring product reliability. However, existing IIoT-based visual inspection systems face two compounding challenges: scarce and imbalanced defect samples that limit model training, and insufficient feature representation under complex circuit backgrounds. Existing generation methods rely on single-modality conditions with coarse structural control, while detection methods improve architectures without addressing the data bottleneck. To resolve both challenges jointly, we propose a generation-assisted PCB defect inspection framework that integrates controlled defect synthesis with task-specific defect detection within an IIoT-enabled pipeline. On the generation side, a Multi-modal Condition Generator extracts complementary edge, depth, and text conditions in parallel. A ScaleEncoder then embeds these conditions into the diffusion U-Net at four resolutions, and a Condition Modulation applies FiLM-style spatially-adaptive modulation at each scale, enabling structurally aligned and defect-aware sample synthesis to augment the scarce IIoT dataset. On the detection side, an Inverted Residual Shift Attention couples self-attention with shift-wise convolution to jointly capture global context and local texture, and a Cross-level Complementary Fusion Block generates pixel-level gates for selective cross-level feature fusion. The synthesized samples directly enrich the detection training set, so that improvements in generation compound with improvements in detection. Extensive experiments on DsPCBSD+ demonstrate that UniPCB achieves mAP@0.5 of 98.0% and mAP@0.5:0.95 of 61.8% on defect detection, surpassing all compared methods, while the generation branch attains an FID of 129.61 and SSIM of 0.619, outperforming existing conditional generation approaches.

2603.12647 2026-05-27 cs.CV cs.AI 版本更新

LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

LR-SGS:用于自动驾驶场景重建的鲁棒激光雷达反射率引导显著高斯泼溅

ZY Chen, F Zhu, H Zhu, DY Kong, XK Kuang, YJ Zhang, CM Jiang

发表机构 * Waymo Open Dataset(Waymo开放数据集)

AI总结 提出一种结合激光雷达反射率与RGB的显著高斯表示方法,通过结构感知初始化、反射率校准和联合对齐,实现高效鲁棒的自动驾驶场景重建。

Comments 8 pages, 7 figures

详情
AI中文摘要

最近的3D高斯泼溅(3DGS)方法已证明了自动驾驶场景重建和新视角合成的可行性。然而,现有方法大多仅依赖相机,或仅将激光雷达用于高斯初始化或深度监督,而点云中包含的丰富场景信息(如反射率)以及激光雷达与RGB之间的互补性尚未被充分利用,导致在具有高自运动和复杂光照等挑战性自动驾驶场景中性能下降。为解决这些问题,我们提出了一种鲁棒且高效的激光雷达反射率引导显著高斯泼溅方法(LR-SGS),用于自动驾驶场景。该方法引入了一种结构感知的显著高斯表示,该表示从激光雷达提取的几何和反射率特征点初始化,并通过显著变换和改进的密度控制来捕捉边缘和平面结构。此外,我们将激光雷达强度校准为反射率,并将其作为光照不变的材料通道附加到每个高斯上,与RGB联合对齐以强制边界一致性。在Waymo Open数据集上的大量实验表明,LR-SGS以更少的高斯和更短的训练时间实现了优越的重建性能。特别是在复杂光照场景下,我们的方法在PSNR上超过OmniRe 1.18 dB。

英文摘要

Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

2604.27604 2026-05-27 cs.CV cs.CE 版本更新

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

解码科学实验图像:用于感知、理解和推理的SPUR基准

Junpeng Ding, Zichen Tang, Haihong E, Mengyuan Ji, Yang Liu, Haolin Tian, Haiyang Sun, Pengqi Sun, Yang Xu, Yichen Liu, Haocheng Gao, Zijie Xi, Ruomeng Jiang, Peizhi Zhao, Rongjin Li, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Jintong Chen, Siying Lin

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出SPUR基准,通过4264个问答对评估多模态大模型在科学实验图像上的细粒度感知、跨面板关系理解和专家级推理能力,揭示当前模型与专家水平的差距。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

我们引入了SPUR,一个全面的科学实验图像感知、理解和推理基准,包含来自1084张专家精选图像的4264个问答对。SPUR具有三个关键创新:(1)面板级细粒度感知:评估多模态大语言模型(MLLMs)在六个细粒度面板类型上的三个维度(数值、形态和信息定位)的视觉感知能力;(2)跨面板关系理解:利用平均每样本14.3个面板的复杂图像评估MLLMs解读复杂跨面板关系的能力;(3)专家级推理:跨五个实验范式评估定性和定量推理,以确定模型是否能像人类专家一样从证据中推断结论。对20个MLLMs和四种多模态思维链(MCoT)方法的全面评估表明,当前模型远未达到科学图像解释的专家级要求,凸显了人工智能科学(AI4S)研究的关键瓶颈。

英文摘要

We introduce SPUR, a comprehensive benchmark for scientific experimental image perception, understanding, and reasoning, comprising 4,264 question-answering (QA) pairs derived from 1,084 expert-curated images. SPUR features three key innovations: (1) Panel-Level Fine-Grained Perception: evaluating the visual perception of multimodal large language models (MLLMs) across three dimensions (numerical, morphological, and information localization) on six fine-grained panel types; (2) Cross-Panel Relation Understanding: utilizing complex images with an average of 14.3 panels per sample to evaluate MLLMs' ability to decipher intricate cross-panel relations; (3) Expert-Level Reasoning: assessment of qualitative and quantitative reasoning across five experimental paradigms to determine if models can infer conclusions from evidence as human experts do. Comprehensive evaluation of 20 MLLMs and four multimodal Chain-of-Thought (MCoT) methods reveals that current models fall significantly short of the expert-level requirements for scientific image interpretation, underscoring a critical bottleneck in AI for Science (AI4S) research.

2604.24764 2026-05-27 cs.CV 版本更新

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

World-R1:通过强化学习为文本到视频生成注入3D约束

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) Microsoft Research(微软研究院) Independent Researcher(独立研究者)

AI总结 提出World-R1框架,利用强化学习(Flow-GRPO)结合3D基础模型和视觉语言模型的反馈,在不修改架构的情况下增强视频生成的3D一致性,并采用周期解耦训练策略平衡刚体几何与动态场景。

Comments ICML 2026, Project Page: https://aka.ms/world-r1, Code: https://github.com/microsoft/World-R1

详情
AI中文摘要

最近的视频基础模型展示了令人印象深刻的视觉合成能力,但经常遭受几何不一致性的困扰。现有方法尝试通过架构修改注入3D先验,但往往导致高计算成本并限制可扩展性。我们提出World-R1,一个通过强化学习将视频生成与3D约束对齐的框架。为促进这种对齐,我们引入了一个专门为世界模拟定制的纯文本数据集。利用Flow-GRPO,我们使用预训练的3D基础模型和视觉语言模型的反馈来优化模型,在不改变底层架构的情况下强制执行结构一致性。我们进一步采用周期解耦训练策略来平衡刚体几何一致性与动态场景流畅性。大量评估表明,我们的方法显著增强了3D一致性,同时保留了基础模型的原始视觉质量,有效弥合了视频生成与可扩展世界模拟之间的差距。

英文摘要

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

2401.07669 2026-05-27 cs.CV 版本更新

SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

SRL-CLIP: 通过结构化语义角色标签实现高效的CLIP视频适配

Darshan Singh, Zeeshan Khan, Makarand Tapaswi

发表机构 * CVIT, IIIT Hyderabad(IIIT海得拉巴计算机视觉研究所) Inria, École normale supérieure, CNRS, PSL Research University(法国国家信息与自动化研究所、巴黎综合理工学院、国家科学研究中心、巴黎高等研究大学)

AI总结 本文提出SRL-CLIP,利用结构化语义角色标签(SRL)生成规则化字幕,仅用23k视频-字幕对进行对比微调,即可高效适配CLIP用于通用视频理解,在零样本文本-视频检索上性能优于参数多4-8倍、数据多6000倍的模型。

Comments Accepted to the CV4Smalls Workshop at CVPR 2026

详情
AI中文摘要

将CLIP适配到视频领域因其语义丰富表示而日益流行。虽然CLIP是一个良好的起点,但它通常需要在大型视频叙述或字幕数据集(如HowTo100M、WebVid2.5M)上进行后预训练(对比微调)。然而,此类叙述或字幕往往缺乏全面信息来整体表示视频。由于文本的学习信号稀疏,视觉学习效率低下,适配需要数百万样本进行后预训练。在这项工作中,我们提出疑问:是否可能高效地将CLIP适配到通用和整体的视频理解?我们使用带有结构化和密集语义角色标签(SRL)的视频,这些标签以结构化格式捕获动作、人物或物体、属性、副词(方式)和位置,从而整体表示整个视频。我们从SRL生成基于规则的字幕,并证明仅对23k视频-字幕对进行简单的对比微调就足以学习强大的、可迁移的表示,适用于需要不同感知粒度水平的多种视频理解任务。我们的适配CLIP模型SRL-CLIP在零样本文本-视频检索上展现出与最先进模型相当或更优的性能,而这些模型拥有4-8倍更多的参数,并在多达6000倍更多的数据上进行了后预训练。SRL-CLIP在多个视频基准上超越了CLIP,突显了高效学习和改进的表示能力。

英文摘要

Adapting CLIP for videos has gained popularity due to its semantic and rich representation. While CLIP is a good starting point, it typically undergoes post-pretraining (contrastive finetuning) on large video narration or caption datasets (e.g. HowTo100M, WebVid2.5M). However, such narrations or captions often lack comprehensive information needed to represent a video holistically. As the learning signal from text is sparse, the visual learning is inefficient and adaptation requires millions of samples to post-pretrain. In this work, we ask: is it possible to efficiently adapt CLIP for general and holistic video understanding? We use videos labeled with structured and dense Semantic Role Labels (SRLs) that capture actions, people or objects, their attributes, adverbs (manner), and location in a structured format representing the entire video in a holistic way. We generate rule-based captions from SRLs and demonstrate that simple contrastive finetuning on a mere 23k video-caption pairs is adequate to learn powerful, transferable representations applicable across a diverse range of video understanding tasks that require varying levels of perceptual granularity. Our adapted CLIP model, SRL-CLIP, exhibits comparable or superior performance on zero-shot text-to-video retrieval compared to state-of-the-art models that possess 4-8x more parameters and are post-pretrained on up to 6000x more data. SRL-CLIP surpasses CLIP on multiple video benchmarks, underscoring the efficient learning and improved representations.

2604.22774 2026-05-27 cs.CY cs.AI cs.CV cs.LG 版本更新

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

当VLM“修正”学生:多行手写数学OCR评估中的过度修正识别与惩罚

Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim

发表机构 * Electronics and Telecommunications Research Institute(电子通信研究所)

AI总结 针对多行手写数学OCR评估中VLM过度修正问题,提出基于LLM的语义评估指标PINK,有效惩罚过度修正,在FERMAT数据集上优于BLEU。

详情
AI中文摘要

手写数学的准确转录对于教育AI系统至关重要,但当前基准未能正确评估这一能力。大多数先前研究关注单行表达式,并依赖BLEU等词汇指标,无法评估跨多行学生解决方案的语义推理。本文首次系统研究多行手写数学光学字符识别(OCR),揭示了视觉语言模型(VLM)的一个关键失败模式:过度修正。这些模型往往“修正”错误,而非忠实地转录学生作品,从而隐藏了教育评估旨在检测的错误。为解决此问题,我们提出PINK(基于惩罚的INK分数),一种语义评估指标,利用大语言模型(LLM)进行基于评分标准的评分,并明确惩罚过度修正。我们在FERMAT数据集上对15个最先进的VLM进行全面评估,发现与BLEU相比出现显著的排名反转:GPT-4o等模型因激进的过度修正受到严重惩罚,而Gemini 2.5 Flash成为最忠实的转录者。此外,人类专家研究表明,PINK与人类判断的一致性显著更高(55.0%偏好,而BLEU为39.5%),为教育场景中的手写数学OCR提供了更可靠的评估框架。

英文摘要

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.

2604.19673 2026-05-27 cs.CV 版本更新

InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

InHabit: 利用图像基础模型实现可扩展的3D人体放置

Nikita Kister, Pradyumna YM, István Sárándi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll

发表机构 * Bosch Center of Artificial Intelligence(博世人工智能中心) Max Planck Institute for Informatics(马克斯·普朗克信息学院)

AI总结 提出InHabit方法,通过渲染-生成-提升流程利用2D基础模型知识自动生成3D场景中与几何一致的人体交互数据,并构建大规模数据集InHabitants,显著提升3D人体-场景重建和接触估计性能。

详情
AI中文摘要

训练具身智能体像人类一样理解3D场景需要大量人类与多样环境有意义交互的数据,但此类数据稀缺。真实世界捕捉成本高昂且局限于受控环境,而现有合成数据集依赖简单几何启发式,忽略了丰富的场景上下文。相比之下,在互联网规模上训练的2D基础模型已获得关于人类-环境交互的常识知识。为了将这些知识迁移到3D,我们引入了InHabit,一种自动且可扩展的数据生成器,用于在3D场景中填充交互的人类。InHabit遵循渲染-生成-提升原则:给定渲染的3D场景,视觉语言模型提出上下文相关的动作,图像编辑模型插入一个人体,优化过程将编辑结果提升为与场景几何对齐的物理上合理的SMPL-X人体。应用于Habitat-Matterport3D,InHabit生成了InHabitants,这是首个大规模逼真3D人-场景交互数据集,包含约800个建筑规模场景中的78K个样本,具有完整的3D几何、SMPL-X人体和图像。用InHabitants增强标准训练数据改进了基于RGB的3D人-场景重建和接触估计,在感知用户研究中,我们的数据在78%的情况下优于先前技术。

英文摘要

Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics, ignoring rich scene context. In contrast, 2D foundation models trained at internet scale have acquired commonsense knowledge of human-environment interactions. To transfer this knowledge to 3D, we introduce InHabit, an automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces InHabitants, the first large-scale photorealistic 3D human-scene interaction dataset, with 78K samples across $\sim$800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and images. Augmenting standard training data with InHabitants improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over prior art.

2604.19667 2026-05-27 cs.CL cs.AI cs.CV cs.LG cs.MA 版本更新

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Chat2Workflow: 用自然语言生成可执行可视化工作流的基准

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Tencent(腾讯)

AI总结 提出Chat2Workflow基准,用于评估大语言模型从自然语言生成可执行可视化工作流的能力,并设计了一个智能体基线以提升性能。

Comments Work in progress

详情
AI中文摘要

目前,可执行的可视化工作流已成为实际工业部署中的主流范式,提供了强大的可靠性和可控性。然而,在当前实践中,此类工作流几乎完全通过手动工程构建:开发人员必须仔细设计工作流,为每个步骤编写提示,并随着需求的变化反复修改逻辑——这使得开发成本高昂、耗时且容易出错。为了研究大语言模型能否自动化这一多轮交互过程,我们引入了Chat2Workflow,一个直接从自然语言生成可执行可视化工作流的基准,并提出了一个稳健的智能体基线以提高性能。该基准基于大量真实业务工作流构建,每个实例的设计使得生成的工作流可以转换并直接部署到实际工作流平台(如Dify和Coze)上。实验结果表明,尽管最先进的语言模型通常能捕捉高层次意图,但在生成正确、稳定且可执行的工作流方面仍存在困难,尤其是在面对复杂且不断变化的需求时。尽管我们的智能体基线带来了高达6.05%的解决率提升,但剩余的现实差距使Chat2Workflow成为推进工业级自动化的基础。代码可在https://github.com/zjunlp/Chat2Workflow获取。

英文摘要

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve -- making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic baseline to improve performance. The benchmark is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially given complex and evolving requirements. Although our agentic baseline yields up to 6.05% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

2604.14684 2026-05-27 cs.CV 版本更新

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

DETR-ViP:具有鲁棒判别性视觉提示的检测Transformer

Bo Qian, Dahu Shi, Xing Wei

发表机构 * School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University(西安交通大学人机混合增强智能国家重点实验室) CCAI, Zhejiang University(浙江大学计算机辅助设计与智能交互研究院) Hikrobot Co., Ltd.(海康威视有限公司)

AI总结 提出DETR-ViP框架,通过全局提示集成和视觉-文本提示关系蒸馏学习可区分视觉提示,并采用选择性融合策略实现鲁棒检测,在多个数据集上显著提升视觉提示检测性能。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

视觉提示目标检测能够交互式且灵活地定义目标类别,从而促进开放词汇检测。由于视觉提示直接来源于图像特征,在识别稀有类别时通常优于文本提示。然而,视觉提示检测的研究很大程度上被忽视,通常被视为训练文本提示检测器的副产品,这阻碍了其发展。为充分释放视觉提示检测的潜力,我们研究了其性能次优的原因,并揭示根本问题在于视觉提示缺乏全局可区分性。受这些观察启发,我们提出DETR-ViP,一个鲁棒的目标检测框架,能够产生类别可区分的视觉提示。在基础图像-文本对比学习之上,DETR-ViP结合了全局提示集成和视觉-文本提示关系蒸馏,以学习更具判别性的提示表示。此外,DETR-ViP采用选择性融合策略,确保稳定且鲁棒的检测。在COCO、LVIS、ODinW和Roboflow100上的大量实验表明,DETR-ViP在视觉提示检测中相比其他最先进方法取得了显著更高的性能。一系列消融研究和分析进一步验证了所提出改进的有效性,并揭示了视觉提示检测能力增强的潜在原因。

英文摘要

Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.

2604.13491 2026-05-27 cs.CV 版本更新

FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation

FiRe:用于增强图像生成的细粒度多模态推理

Yongjin Kim, Yoonjin Oh, Yerin Kim, Hyomin Kim, Jeeyoung Yun, Yujung Heo, Minjun Kim, Sungwoong Kim

发表机构 * KT Corporation(KT公司)

AI总结 提出FiRe方法,通过细粒度多步推理和强化学习FiRe-GRPO,解决文本到图像生成中缺乏细粒度控制的问题。

详情
AI中文摘要

随着多模态大语言模型(MLLM)的快速发展,联合进行图像理解和生成的统一MLLM取得了显著进展。然而,尽管统一MLLM具有自我反思和自我改进的内在推理能力,它们在文本到图像生成中的应用仍未被充分探索。同时,现有的基于多模态推理的图像生成方法大多依赖于提示增强或整体图像-文本对齐判断,缺乏对详细提示属性的细粒度反思和改进,导致细粒度控制有限。为了解决这一局限性,我们提出了FiRe,一种通过MLLM增强图像生成的细粒度多模态推理方法。具体来说,FiRe执行细粒度多步推理,首先将提示分解为关键视觉需求,然后自我判断它们在生成图像中的满足程度,接着根据自我生成的精确反馈进行局部改进。此外,为了进一步增强MLLM的多模态推理能力,我们引入了FiRe-GRPO,一种针对FiRe量身定制的强化学习方法。由于标准的组相对策略优化(GRPO)在多步推理中面临稀疏的、基于结果的奖励问题,我们将推理过程形式化为一个步骤级别的决策问题,设计步骤特定的奖励,并计算步骤级别的优势以在GRPO内进行细粒度的信用分配。大量实验表明,FiRe持续优于竞争性的文本到图像基线,包括现有的基于推理的方法,在组合文本到图像基准上尤其取得了显著提升。

英文摘要

With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on prompt augmentation or holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. To address this limitation, we propose FiRe, a Fine-grained Multimodal Reasoning method for enhanced image generation by MLLM. In specific, FiRe performs a fine-grained multi-step reasoning by first decomposing the prompt into key visual requirements and then self-judging their satisfaction in the generated image, followed by localized refinement according to self-generated precise feedback. In addition, to further strengthen the MLLM's multimodal reasoning ability, we introduce FiRe-GRPO, a reinforcement learning method tailored to FiRe. Since standard Group Relative Policy Optimization (GRPO) suffers from sparse, outcome-based rewards in multi-step reasoning, we formulate our reasoning process as a step-level decision-making problem, design step-specific rewards, and compute step-level advantages for granular credit assignment within GRPO. Extensive experiments demonstrate that FiRe consistently outperforms competitive text-to-image baselines, including existing reasoning-based methods, with particularly substantial gains on compositional text-to-image benchmarks.

2604.12918 2026-05-27 cs.CV 版本更新

Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

雷达-相机BEV多任务学习:用于联合3D检测与分割的跨任务注意力桥

Ahmet İnanç, Özgür Erkent

发表机构 * Hacettepe University(哈切特佩大学)

AI总结 提出CTAB(跨任务注意力桥)模块,通过共享BEV空间中的多尺度可变形注意力在检测和分割分支间交换特征,实现联合3D检测与分割的多任务学习,在nuScenes上提升分割性能且检测几乎不受影响。

Comments 8 pages, 5 figures, 3 Tables, Accepted at Radar in Robotics: New Frontiers workshop, at IEEE International Conference on Robotics & Automation (ICRA), 2026

详情
AI中文摘要

鸟瞰图(BEV)表示是自动驾驶中3D感知的主流范式,它提供了一个统一的空间画布,检测和分割特征在几何上注册到同一物理坐标系。然而,现有的雷达-相机融合方法孤立地处理这些任务,错过了跨任务特征共享的机会:来自检测的物体级几何线索可以锐化分割,而来自分割的密集道路布局上下文可以锚定检测。我们提出了 extbf{CTAB}(跨任务注意力桥),这是一个双向模块,通过共享BEV空间中的多尺度可变形注意力在检测和分割分支之间交换特征。CTAB集成到一个多任务框架中,该框架包含基于实例归一化的分割解码器和可学习的BEV上采样,以提供更详细的BEV表示。在nuScenes上,CTAB在联合多任务基线的基础上,在7个类别上提升了分割性能,同时检测几乎不受影响。在一个4类子集(可行驶区域、人行横道、人行道、车辆)上,我们的联合多任务模型实现了51.0 mIoU-4,同时提供了有竞争力的3D检测。

英文摘要

Bird's-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity for cross-task feature sharing: object-level geometric cues from detection can sharpen segmentation, while dense road-layout context from segmentation can anchor detection. We propose \textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model achieves 51.0 mIoU-4 while simultaneously providing competitive 3D detection.

2505.23606 2026-05-27 cs.LG cs.CV 版本更新

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Muddit: 通过统一离散扩散模型解放超越文本到图像的生成

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

发表机构 * M-E-AGI-Lab(M-E-AGI实验室)

AI总结 提出Muddit,一种统一离散扩散Transformer,结合预训练文本到图像骨干的强视觉先验与轻量文本解码器,实现跨文本和图像模态的快速并行生成,在质量和效率上优于大型自回归模型。

Comments Accepted to ICLR 2026. Codes and Supplementary Material: https://github.com/M-E-AGI-Lab/Muddit

详情
AI中文摘要

统一生成模型旨在单一架构和解码范式下处理跨模态的多种任务——如文本生成、图像生成和视觉-语言推理。自回归统一模型因顺序解码导致推理缓慢,而非自回归统一模型因预训练骨干有限导致泛化能力弱。我们引入第二代Meissonic:Muddit,一种统一离散扩散Transformer,能够在文本和图像模态上实现快速并行生成。与先前从头训练的统一扩散模型不同,Muddit将来自预训练文本到图像骨干的强视觉先验与轻量文本解码器集成,从而在统一架构下实现灵活且高质量的多模态生成。实验结果表明,Muddit在质量和效率上均达到或优于显著更大的自回归模型。该工作凸显了纯离散扩散在配备强视觉先验时,作为统一生成的可扩展且有效骨干的潜力。

英文摘要

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce the second-generation Meissonic: Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

2604.10102 2026-05-27 cs.CV cs.AI 版本更新

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

退化一致性配对训练用于鲁棒的AI生成图像检测

Zongyou Yang, Yinghan Hou, Xiaokun Yang

发表机构 * Department of Computer Science(计算机科学系) University College London(伦敦大学学院) Department of Earth Science and Engineering(地球科学与工程系) Imperial College London(伦敦帝国理工学院) School of Electronic Information(电子信息学院)

AI总结 提出退化一致性配对训练(DCPT),通过特征一致性和预测一致性约束显式增强模型对JPEG压缩、高斯模糊等真实世界图像退化的鲁棒性,在Synthbuster基准上平均准确率提升9.1个百分点。

Comments 6 pages, 5 figures, 2 tables

详情
AI中文摘要

AI生成图像检测器在真实世界图像退化(如JPEG压缩、高斯模糊和分辨率降采样)下性能显著下降。我们观察到,包括B-Free在内的最先进方法将退化鲁棒性视为数据增强的副产品,而非明确的训练目标。在这项工作中,我们提出退化一致性配对训练(DCPT),这是一种简单而有效的训练策略,通过配对一致性约束显式增强鲁棒性。对于每张训练图像,我们构建一个干净视图和一个退化视图,然后施加两个约束:特征一致性损失,最小化干净表示和退化表示之间的余弦距离;以及基于对称KL散度的预测一致性损失,对齐两个视图的输出分布。DCPT不增加额外参数和推理开销。在Synthbuster基准(9个生成器,8种退化条件)上的实验表明,与没有配对训练的相同基线相比,DCPT将退化条件下的平均准确率提高了9.1个百分点,同时仅牺牲了0.9%的干净准确率。在JPEG压缩下改进最为显著(+15.7%至+17.9%)。消融实验进一步揭示,添加架构组件会导致在有限训练数据上过拟合,证实了对于退化鲁棒性,训练目标改进比架构增强更有效。

英文摘要

AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussian blur, and resolution downsampling. We observe that state-of-the-art methods, including B-Free, treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective. In this work, we propose Degradation-Consistent Paired Training (DCPT), a simple yet effective training strategy that explicitly enforces robustness through paired consistency constraints. For each training image, we construct a clean view and a degraded view, then impose two constraints: a feature consistency loss that minimizes the cosine distance between clean and degraded representations, and a prediction consistency loss based on symmetric KL divergence that aligns output distributions across views. DCPT adds zero additional parameters and zero inference overhead. Experiments on the Synthbuster benchmark (9 generators, 8 degradation conditions) demonstrate that DCPT improves the degraded-condition average accuracy by 9.1 percentage points compared to an identical baseline without paired training, while sacrificing only 0.9% clean accuracy. The improvement is most pronounced under JPEG compression (+15.7% to +17.9%). Ablation further reveals that adding architectural components leads to overfitting on limited training data, confirming that training objective improvement is more effective than architectural augmentation for degradation robustness.

2604.10095 2026-05-27 cs.CV 版本更新

Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models

挖掘属性子空间以实现3D基础模型的高效微调

Yu Jiang, Hanwen Jiang, Ahmed Abdelkader, Wen-Sheng Chu, Brandon Y. Feng, Zhangyang Wang, Qixing Huang

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Shanghai Jiao Tong University(上海交通大学) Adobe Research(Adobe研究) Google Research(谷歌研究)

AI总结 本文通过生成合成数据并提取与纹理、几何、相机运动和光照变化相关的LoRA子空间,发现这些子空间近似解耦,集成后形成降维子空间,从而提高下游任务微调的效率和预测精度。

Comments 10 pages, 8 figures. Code here: https://github.com/jpppppppppppppppppppppppp/Subspaces-Mining-for-VGGT

详情
AI中文摘要

随着3D基础模型的出现,人们越来越关注将其微调用于下游任务,其中LoRA是主要的微调范式。由于3D数据集在纹理、几何、相机运动和光照方面表现出明显的差异,因此存在有趣的基本问题:1) 是否存在与每种变化类型相关的LoRA子空间?2) 这些子空间是否解耦(即彼此正交)?3) 如何有效地计算它们?本文为所有这些问题提供了答案。我们引入了一种鲁棒的方法,生成具有受控变化的合成数据集,在每个数据集上微调LoRA适配器,并提取与每种变化类型相关的LoRA子空间。我们表明这些子空间近似解耦。将它们集成可以得到一个降维的LoRA子空间,从而能够实现高效的LoRA微调,并提高下游任务的预测精度。特别是,我们表明这样的降维LoRA子空间尽管完全来自合成数据,但可以泛化到真实数据集。消融研究验证了我们方法中各种选择的有效性。

英文摘要

With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.

2604.08819 2026-05-27 cs.CV cs.AI cs.LG cs.MM 版本更新

SenBen: Sensitive Scene Graphs for Explainable Content Moderation

SenBen: 用于可解释内容审核的敏感场景图

Fatih Cagatay Akyon, Alptekin Temizel

发表机构 * Graduate School of Informatics, METU(信息学院研究生院,梅尔夫大学) Ultralytics, Inc.(Ultralytics公司)

AI总结 提出SenBen基准和紧凑学生模型,通过多任务训练和词汇平衡策略实现敏感内容的空间定位与可解释性,在场景图生成上超越多数VLM。

Comments Accepted at CVPRW 2026

详情
AI中文摘要

内容审核系统将图像分类为安全或不安全,但缺乏空间定位和可解释性:它们无法解释检测到了什么敏感行为、涉及谁或发生在哪里。我们引入了敏感基准(SenBen),这是第一个用于敏感内容的大规模场景图基准,包含来自157部电影的13,999帧,标注了Visual Genome风格的场景图(25个对象类别、28个属性,包括情感状态如痛苦、恐惧、攻击和痛苦,14个谓词)以及跨5个类别的16个敏感标签。我们通过多任务配方将前沿VLM蒸馏成一个紧凑的241M学生模型,该配方通过基于后缀的对象身份、词汇感知召回(VAR)损失和解耦的Query2Label标签头(带非对称损失)解决自回归场景图生成中的词汇不平衡问题,在SenBen召回率上比标准交叉熵训练提高了+6.4个百分点。在基于场景图的指标上,我们的学生模型优于除Gemini模型外的所有评估VLM和所有商业安全API,同时在所有模型中实现了最高的对象检测和字幕生成分数,推理速度提升7.6倍,GPU内存减少16倍。

英文摘要

Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.

2512.21602 2026-05-27 cs.LG cs.CV 版本更新

An Empirical Study of Machine Learning Robustness and Scalability for Imbalanced Tabular Clinical Data in Emergency and Critical Care

机器学习在急诊和重症监护中不平衡表格临床数据的鲁棒性与可扩展性实证研究

Yusuf Brima, Marcellin Atemkeng

发表机构 * Computer Vision Group, Institute of Cognitive Science, Osnabrück University(计算机视觉组,认知科学研究所,奥斯纳布吕克大学) Department of Mathematics, Rhodes University(数学系,罗德斯大学) National Institute for Theoretical and Computational Sciences (NITheCS)(国家理论与计算科学研究所(NITheCS))

AI总结 本研究在MIMIC-IV-ED和eICU数据集上评估六类模型在不平衡临床表格数据上的性能,发现树模型在可扩展性上最优,而表格基础模型在性能与效率间提供新的权衡。

详情
AI中文摘要

每年,数百万患者通过急诊科和重症监护室,临床医生必须在时间压力和不确定性下做出高风险决策。机器学习可以支持恶化预测、分诊和罕见关键结局的预测,但临床数据通常严重不平衡,使模型偏向多数类并降低预测性能。因此,为不平衡的临床表格数据开发鲁棒且高效的模型仍然是一个重要挑战。 我们在MIMIC-IV-ED和eICU数据库的不平衡表格数据上评估了六类模型:决策树、随机森林、XGBoost、TabNet、TabICL和TabPFN v2.6。可训练模型通过贝叶斯超参数调优进行优化,而基础模型在其预训练推理模式下进行评估,无需任务特定的重新加权。模型使用Macro F1分数、对递增不平衡的鲁棒性以及跨七个临床预测任务的计算可扩展性进行评估。 结果在不同数据集上有所不同。在MIMIC-IV-ED上,TabPFN v2.6和TabICL获得了最强的平均Macro F1排名,XGBoost保持竞争力。在eICU上,XGBoost始终表现最佳,其次是其他基于树的方法,而基础模型达到中等性能。在两个数据集中,TabNet在递增不平衡下显示出最大的性能下降和最高的计算成本。训练时间分析表明,基于树的方法随数据集大小扩展最有利,而基础模型提供了较低的每任务适应成本。 这些发现表明,没有单一模型族在所有临床环境中占主导地位。然而,表格基础模型正在缩小与强经典基线的性能差距,同时提供独特的效率-性能权衡,这可能有利于资源受限的临床环境。

英文摘要

Every year, millions of patients pass through emergency departments and intensive care units, where clinicians must make high-stakes decisions under time pressure and uncertainty. Machine learning could support prediction of deterioration, triage, and rare critical outcomes, but clinical data are often severely imbalanced, biasing models toward majority classes and reducing predictive performance. Developing robust and efficient models for imbalanced clinical tabular data therefore remains an important challenge. We evaluated six model families on imbalanced tabular data from the MIMIC-IV-ED and eICU databases: Decision Tree, Random Forest, XGBoost, TabNet, TabICL, and TabPFN v2.6. Trainable models were optimized using Bayesian hyperparameter tuning, while foundation models were evaluated in their pretrained inference regime without task-specific reweighting. Models were assessed using Macro F1-score, robustness to increasing imbalance, and computational scalability across seven clinical prediction tasks. Results differed across datasets. On MIMIC-IV-ED, TabPFN v2.6 and TabICL achieved the strongest average Macro F1 ranks, with XGBoost remaining competitive. On eICU, XGBoost consistently performed best, followed by other tree-based methods, while foundation models achieved intermediate performance. Across both datasets, TabNet showed the largest degradation under increasing imbalance and the highest computational cost. Training-time analysis showed that tree-based methods scaled most favorably with dataset size, while foundation models offered low per-task adaptation cost. These findings suggest that no single model family dominates across all clinical settings. However, tabular foundation models are narrowing the performance gap with strong classical baselines while offering a distinct efficiency-performance trade-off that may benefit resource-constrained clinical environments.

2509.08289 2026-05-27 cs.CV 版本更新

Dual-Thresholded Heatmap-Guided Proposal Clustering and Negative Certainty Supervision with Enhanced Base Network for Weakly Supervised Object Detection

双阈值热力图引导的提议聚类与负确定性监督及增强基础网络的弱监督目标检测

Yuelin Guo, Haoyu He, Zhiyuan Chen, Zitong Huang, Renhao Lu, Lu Shi, Zejun Wang, Weizhe Zhang

发表机构 * Institute of Cyberspace Security, Harbin Institute of Technology(哈尔滨工业大学网络安全学院) Faculty of Information Technology, Monash University(莫纳什大学信息科技学院) Center on Machine Learning Research, Harbin Institute of Technology(哈尔滨工业大学机器学习研究中心) Department of New Networks, Peng Cheng Laboratory(鹏城实验室网络部) School of Cyberspace Science, Harbin Institute of Technology(哈尔滨工业大学网络空间科学学院)

AI总结 提出DANCE方法,通过双阈值热力图引导的提议选择、增强基础网络和负确定性监督损失,解决弱监督目标检测中伪GT框不完整、语义鸿沟和收敛慢的问题。

Comments IEEE TIP Minor Revision

详情
AI中文摘要

弱监督目标检测(WSOD)近年来因其不需要框级标注而受到广泛关注。最先进的方法通常采用多模块网络,使用WSDDN作为多实例检测网络模块,并使用多实例细化模块来改进性能。然而,这些方法存在三个关键局限性。首先,现有方法倾向于生成仅关注判别性部分的伪GT框,未能捕捉整个物体,或者覆盖整个物体但无法区分相邻的类内实例。其次,基础WSDDN架构缺乏每个提议的关键背景类表示,并且其分支之间存在较大的语义鸿沟。第三,先前的方法在优化过程中丢弃被忽略的提议,导致收敛缓慢。为了解决这些挑战,我们提出了双阈值热力图引导的提议聚类和负确定性监督与增强基础网络(DANCE)方法用于WSOD。具体来说,我们首先设计了一种热力图引导的提议选择器(HGPS)算法,该算法利用热力图上的双阈值来预选提议,使伪GT框既能捕捉完整的物体范围,又能区分相邻的类内实例。然后,我们构建了一个弱监督基础检测网络(WSBDN),它为每个提议增加一个背景类表示,并使用热力图进行预监督以弥合矩阵之间的语义鸿沟。最后,我们在被忽略的提议上引入负确定性监督(NCS)损失以加速收敛。在具有挑战性的PASCAL VOC和MS COCO数据集上进行的大量实验证明了我们方法的有效性和优越性。我们的代码可在https://github.com/gyl2565309278/DANCE公开获取。

英文摘要

Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and uses multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we propose the Dual-thresholded heAtmap-guided proposal clustering and Negative Certainty supervision with Enhanced base network (DANCE) method for WSOD. Specifically, we first devise a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then construct a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision (NCS) loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC and MS COCO datasets demonstrate the effectiveness and superiority of our method. Our code is publicly available at https://github.com/gyl2565309278/DANCE.

2604.00648 2026-05-27 cs.CV 版本更新

DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization

DirectFisheye-GS: 在三维高斯泼溅中通过跨视图联合优化实现原生鱼眼输入

Zhengxian Yang, Fei Xie, Xutao Xue, Rui Zhang, Taicheng Huang, Yang Liu, Mengqi Ji, Tao Yu

发表机构 * BNRist, Tsinghua University(北京理工大学,清华大学) Beihang University(北航) JD.com, Beijing, China(京东(北京,中国)) Shanghai AI Lab(上海人工智能实验室)

AI总结 针对鱼眼相机输入导致的信息丢失和细节模糊问题,提出将鱼眼相机模型集成到3DGS框架中,并引入基于特征重叠的跨视图联合优化策略,实现无需预处理的原生鱼眼图像训练,提升重建质量。

Comments CVPR 2026 Highlight; Fix NSFC ID

详情
AI中文摘要

三维高斯泼溅(3DGS)实现了从日常图像中进行高效的三维场景重建,具有实时、高保真渲染的特点,极大地推动了VR/AR应用的发展。鱼眼相机凭借其更宽的视场角(FOV),有望从更少的输入中实现高质量重建,近来备受关注。然而,由于3DGS依赖于光栅化,大多数后续涉及鱼眼相机输入的工作在训练前先对图像进行去畸变,这引入了两个问题:1)图像边缘的黑边导致信息丢失,抵消了鱼眼大FOV的优势;2)去畸变的拉伸和插值重采样将每个像素的值扩散到更大区域,稀释了细节密度——导致3DGS过拟合这些低频区域,产生模糊和漂浮伪影。在这项工作中,我们将鱼眼相机模型集成到原始3DGS框架中,实现了无需预处理的原生鱼眼图像输入进行训练。尽管建模正确,我们观察到重建场景在图像边缘仍然存在漂浮物:畸变向边缘增加,而3DGS原始的逐迭代随机选择视图优化忽略了高斯函数的跨视图相关性,导致极端形状(例如过大或拉长)降低了重建质量。为解决此问题,我们引入了一种基于特征重叠的跨视图联合优化策略,该策略在视图之间建立一致的几何和光度约束——该技术同样适用于现有的基于针孔相机的流水线。我们的DirectFisheye-GS在公共数据集上达到或超越了最先进的性能。项目页面:https://yzxqh.github.io/DirectFisheye-GS/ 。

英文摘要

3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye's large FOV advantage; 2) Undistortion's stretch-and-interpolate resampling spreads each pixel's value over a larger area, diluting detail density -- causes 3DGS overfitting these low-frequency zones, producing blur and floating artifacts. In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS's original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap-driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views-a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets. Project Page: https://yzxqh.github.io/DirectFisheye-GS/ .

2603.28730 2026-05-27 cs.RO cs.CL cs.CV 版本更新

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

SOLE-R1:视频语言推理作为机器人强化学习的唯一奖励

Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza

发表机构 * MIT(麻省理工学院) RAI Institute(机器人智能研究所)

AI总结 提出SOLE-R1模型,通过视频语言时空推理生成密集任务进度估计作为唯一奖励信号,实现在无真实奖励、演示或任务特定调优下的零样本在线强化学习。

详情
AI中文摘要

视觉语言模型(VLM)在各种任务中展现出令人印象深刻的能力,这促使人们努力利用这些模型来监督机器人学习。然而,当在强化学习(RL)中用作评估器时,当今最强的模型在部分可观测性和分布偏移下常常失败,使得策略能够利用感知错误而非解决任务。我们提出SOLE-R1(自观察学习器),一种专门设计用于为在线RL提供唯一奖励信号的视频语言推理模型。仅给定原始视频观测和自然语言目标,SOLE-R1执行每时间步的时空思维链(CoT)推理,并生成可直接用作奖励的密集任务进度估计。为了训练SOLE-R1,我们开发了一个大规模视频轨迹和推理合成流水线,生成与连续进度监督对齐的时间基础CoT轨迹。这些数据与基础的空间和多帧时间推理相结合,并使用混合框架训练模型,该框架将监督微调与可验证奖励的RL相结合。在四个不同的仿真环境和真实机器人设置中,SOLE-R1实现了从随机初始化的零样本在线RL:机器人学习之前未见过的操作任务,无需真实奖励、成功指标、演示或任务特定调优。SOLE-R1在24个未见过的任务上成功,并显著优于强视觉语言奖励器,包括Robometer、RoboReward、ReWiND、GPT-5和Gemini-3-Pro,同时对奖励破解表现出明显更强的鲁棒性。我们在匿名页面发布所有模型、数据、代码和演示:https://philip-mit.github.io/sole-r1/

英文摘要

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. We introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including Robometer, RoboReward, ReWiND, GPT-5, and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking. We release all models, data, code, and demos at the anonymous page: https://philip-mit.github.io/sole-r1/

2603.20020 2026-05-27 cs.CV cs.AI 版本更新

Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR

分离跳跃链接与$R$-探针:解耦特征聚合与梯度传播用于MLLM OCR

Ziye Yuan, Ruchang Yao, Chengxin Zheng, Yusheng Zhao, Daxiang Dong, Ming Zhang

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Beijing Key Laboratory of Software and Hardware Cooperative Artificial Intelligence Systems, Peking University, Beijing, China(多媒体信息处理国家重点实验室,计算机科学学院,PKU-Anker LLM实验室,软件与硬件协同人工智能系统北京重点实验室,北京大学,北京,中国) Tsinghua University, Beijing, China(清华大学,北京,中国) Baidu Inc, Beijing, China(百度公司,北京,中国)

AI总结 针对多模态大语言模型在OCR任务中因梯度干扰导致细粒度视觉信息丢失的问题,提出分离跳跃链接(Detached Skip-Links)以解耦前向特征聚合与反向梯度传播,并引入$R$-探针($R$-Probe)诊断视觉令牌的可重构性,从而提升OCR及通用多模态任务性能。

Comments Accepted by ICML 2026. Ziye Yuan and Ruchang Yao contributed equally to this work (co-first authors, listed in random order)

详情
AI中文摘要

多模态大语言模型(MLLMs)擅长高级推理,但在OCR任务中失败,因为细粒度视觉细节被破坏或错位。我们发现了多层特征融合中一个被忽视的优化问题。跳跃路径引入了从高级语义目标到早期视觉层的直接反向传播路径。这种机制覆盖了低级信号并破坏了训练稳定性。为了缓解这种梯度干扰,我们提出了分离跳跃链接(Detached Skip-Links),这是一种最小的修改,在前向传播中重用浅层特征,同时在联合训练期间停止通过跳跃分支的梯度。这种非对称设计减少了梯度干扰,提高了稳定性和收敛性,且无需增加可学习参数。为了诊断细粒度信息是否被保留并可供LLM使用,我们引入了$R$-探针($R$-Probe),它使用从LLM前四分之一层初始化的浅层解码器测量投影视觉令牌的像素级可重构性。在多个ViT骨干网络和多模态基准测试中,以及高达7M训练样本的规模下,我们的方法持续改进了以OCR为中心的基准测试,并在通用多模态任务上取得了明显提升。

英文摘要

Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.

2603.16870 2026-05-27 cs.CV cs.AI 版本更新

Demystifying Video Reasoning

揭秘视频推理

Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

发表机构 * SenseTime Research(秒速科技研究院) Nanyang Technological University(南洋理工大学) University of California, Berkeley(加州大学伯克利分校) University of California, San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过实验揭示视频扩散模型中的推理主要发生在去噪步骤中,提出链式步骤(CoS)机制,并发现工作记忆、自我修正和感知先行等涌现行为,最后提出一种无需训练的集成策略来提升推理能力。

Comments Homepage: https://www.wruisi.com/demystifying_video_reasoning

详情
AI中文摘要

近期视频生成的进展揭示了一个意外现象:基于扩散的视频模型展现出非平凡的推理能力。先前的工作将此归因于链式帧(CoF)机制,假设推理在视频帧间顺序展开。在本工作中,我们挑战这一假设,并揭示了一个根本不同的机制。我们表明视频模型中的推理主要沿着扩散去噪步骤涌现。通过定性分析和针对性探测实验,我们发现模型在早期去噪步骤中探索多个候选解,并逐步收敛到最终答案,我们将此过程称为链式步骤(CoS)。除了这一核心机制,我们还识别出对模型性能至关重要的几种涌现推理行为:(1)工作记忆,支持持久参考;(2)自我修正与增强,允许从不正确的中间解中恢复;(3)先感知后行动,早期步骤建立语义基础,后期步骤执行结构化操作。在扩散步骤内部,我们进一步揭示了扩散变换器中的自演化功能特化:早期层编码密集的感知结构,中间层执行推理,后期层巩固潜在表示。受这些见解的启发,我们提出了一种简单的无需训练的策略作为概念验证,展示了如何通过集成来自相同模型不同随机种子的潜在轨迹来改进推理。总体而言,我们的工作系统性地理解了推理如何在视频生成模型中涌现,为未来研究更好地利用视频模型固有的推理动态作为智能的新基础提供了基础。

英文摘要

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

2603.09551 2026-05-27 cs.CV 版本更新

GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision

GeoSolver: 利用细粒度过程监督扩展遥感中的测试时推理

Lang Sun, Ronghao Fu, Zhuoran Duan, Haoran Liu, Xueyan Liu, Bo Yang

发表机构 * College of Computer Science and Technology(计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education Jilin University(教育部符号计算与知识工程重点实验室)

AI总结 提出GeoSolver框架,通过构建大规模过程监督数据集Geo-PRM-2M和训练过程奖励模型GeoPRM,结合过程感知树GRPO强化学习算法,实现遥感中可验证的逐步推理,在多个基准上达到最优性能并支持测试时扩展。

Comments Code: https://github.com/yourname/GeoSolver

详情
AI中文摘要

尽管视觉语言模型(VLM)显著推进了遥感解译,但使其能够执行复杂、逐步推理仍然极具挑战性。最近在该领域引入思维链(CoT)推理的努力显示出前景,但确保这些中间步骤的视觉忠实性仍是一个关键瓶颈。为解决这一问题,我们提出了GeoSolver,一个新颖的框架,将遥感推理转向可验证的、过程监督的强化学习。我们首先构建了Geo-PRM-2M,一个大规模的、令牌级过程监督数据集,通过熵引导的蒙特卡洛树搜索(MCTS)和有针对性的视觉幻觉注入合成。基于该数据集,我们训练了GeoPRM,一个令牌级过程奖励模型(PRM),提供细粒度的忠实性反馈。为了有效利用这些验证信号,我们提出了过程感知树GRPO,一种强化学习算法,将树结构探索与忠实性加权奖励机制相结合,以精确分配中间步骤的信用。大量实验表明,我们的最终模型GeoSolver-9B在多样化的遥感基准上实现了最先进的性能。至关重要的是,GeoPRM解锁了鲁棒的测试时扩展(TTS)。作为通用的地理空间验证器,它无缝地扩展了GeoSolver-9B的性能,并直接增强了通用VLM,突显了其卓越的跨模型泛化能力。

英文摘要

While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.

2512.09700 2026-05-27 cs.CV eess.IV 版本更新

LiM-YOLO: Less is More with Pyramid Level Shift for Ship Detection in Optical Remote Sensing

LiM-YOLO:基于金字塔层级偏移的光学遥感舰船检测中少即是多

Seon-Hoon Kim, Yerin Kim, Hyeji Sim, Youeyun Jung, Okchul Jung, Daewon Chung

发表机构 * University of Science and Technology (UST)(科学技术大学) Korea Aerospace Research Institute (KARI)(韩国航空航天研究院)

AI总结 针对光学遥感舰船检测中目标尺度小、长宽比高导致深层特征金字塔空间特征稀释的问题,提出LiM-YOLO检测器,通过金字塔层级偏移策略将检测头从步长8、16、32移至4、8、16,并引入组归一化辅助投影模块,在减少64.1%参数量的情况下超越更大规模的SOTA检测器。

Comments 16 pages, 6 figures, 9 tables

详情
AI中文摘要

通用目标检测器在应用于卫星图像中的舰船检测时面临根本性的结构限制,其中舰船尺度分布集中在较小尺寸和高长宽比。在传统的YOLO架构中,最深的特征金字塔层级(步长32)将窄长船只压缩为亚像素表示,导致严重的空间特征稀释并影响准确的舰船边界回归。我们提出Less is More YOLO,一种基于YOLOv9超大变体的精简检测器,以解决这些领域特定的结构冲突。通过对四个主要基准(SODA-A、DOTA-v1.5、FAIR1M-v2.0和ShipRSImageNet)中舰船尺度分布的统计分析,我们引入了一种金字塔层级偏移策略,将检测头从步长8、16、32移至步长4、8、16。该偏移满足基于奈奎斯特-香农原理推导出的最窄目标的空间可表示性条件,同时消除了最深金字塔层级的计算冗余。为了进一步稳定高分辨率卫星输入上的训练,我们引入了一个组归一化辅助投影模块,将组归一化引入投影路径,缓解了内存受限的微批量训练中的梯度不稳定性。在这四个数据集上验证,我们的检测器仅用21.16百万参数就达到了0.600的mAP_{50-95},相比超大YOLOv9基线(58.99百万参数)减少了64.1%。尽管尺寸紧凑,我们的模型超越了多达三倍大的最先进检测器,验证了有针对性的金字塔层级偏移实现了准确性与效率之间的“少即是多”平衡。代码可在https://github.com/egshkim/LiM-YOLO获取。

英文摘要

General-purpose object detectors face fundamental structural limitations when applied to ship detection in satellite imagery, where the ship scale distribution is concentrated at small sizes and high aspect ratios. In conventional You Only Look Once architectures, the deepest feature pyramid level (stride 32) compresses narrow vessels into sub-pixel representations, causing severe spatial feature dilution and compromising accurate ship boundary regression. We propose Less is More YOLO, a streamlined detector built upon the extra-large variant of YOLOv9, to address these domain-specific structural conflicts. From a statistical analysis of ship scale distributions across four major benchmarks (SODA-A, DOTA-v1.5, FAIR1M-v2.0, and ShipRSImageNet), we introduce a Pyramid Level Shift Strategy that shifts the detection head from strides 8, 16, and 32 to strides 4, 8, and 16. This shift satisfies a spatial representability condition derived from the Nyquist-Shannon principle for the narrowest targets, while eliminating the computational redundancy of the deepest pyramid level. To further stabilize training on high-resolution satellite inputs, we incorporate a group-normalized auxiliary projection module that introduces Group Normalization into the projection path, mitigating gradient instability in memory-constrained micro-batch regimes. Validated on these four datasets, our detector attains an mAP_{50-95} of 0.600 with only 21.16 million parameters, a 64.1% reduction from the extra-large YOLOv9 baseline (58.99 million). Despite this compact size, our model surpasses state-of-the-art detectors up to three times larger, validating that a well-targeted pyramid level shift achieves a "Less is More" balance between accuracy and efficiency. The code is available at https://github.com/egshkim/LiM-YOLO.

2603.03711 2026-05-27 cs.CV 版本更新

LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing

LDP-Slicing:通过随机位平面切片实现图像的本地差分隐私

Yuanming Cao, Chengqi Li, Wenbo He

发表机构 * McMaster University(麦斯特大学)

AI总结 提出LDP-Slicing框架,通过将像素值分解为二进制位平面并应用本地差分隐私机制,结合感知混淆模块和隐私预算分配策略,在满足严格像素级ε-LDP的同时保持图像对下游任务的高效用。

详情
AI中文摘要

本地差分隐私(LDP)是隐私保护机器学习的黄金标准信任模型,通过在数据源处保证隐私。然而,由于像素空间的高维性,其在图像数据上的应用长期以来被认为不切实际。典型的LDP机制设计用于低维数据,当应用于高维像素空间时会导致严重的效用退化。本文证明这种效用损失并非LDP固有的,而是源于将其应用于不适当的数据表示。我们引入了LDP-Slicing,一个轻量级、无需训练的框架,解决了这种领域不匹配问题。我们的关键见解是将像素值分解为一系列二进制位平面。这种转换使我们能够直接将LDP机制应用于位级表示。为了进一步加强隐私并保持效用,我们集成了一个感知混淆模块,减轻人类可感知的泄漏,以及一个基于优化的隐私预算分配策略。该流程满足严格的像素级ε-LDP,同时生成对下游任务保持高效用的图像。在人脸识别和图像分类上的大量实验表明,在可比的隐私预算下,LDP-Slicing优于现有的DP/LDP基线,且计算开销可忽略不计。

英文摘要

Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.

2602.19206 2026-05-27 cs.CV 版本更新

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

GS-CLIP: 基于几何感知提示与协同视图表示学习的零样本3D异常检测

Zehao Deng, An Liu, Yan Wang

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院)

AI总结 提出GS-CLIP框架,通过几何感知提示和协同视图表示学习,在零样本设置下有效检测3D点云中的几何异常。

Comments Accepted by CVPR 2026

详情
AI中文摘要

零样本3D异常检测是一项新兴任务,旨在无需目标训练数据的情况下检测目标数据集中的异常,这在样本稀缺和数据隐私受限的场景中尤为重要。当前方法通过将3D点云投影到2D表示来适配CLIP,但面临挑战:投影会固有地丢失一些几何细节,且依赖单一2D模态导致视觉理解不完整,限制了检测多样异常类型的能力。为解决这些局限,我们提出几何感知提示与协同视图表示学习(GS-CLIP)框架,通过两阶段学习使模型能够识别几何异常。第一阶段,我们动态生成嵌入3D几何先验的文本提示,这些提示包含由我们的几何缺陷蒸馏模块(GDDM)提炼的全局形状上下文和局部缺陷信息。第二阶段,我们引入协同视图表示学习架构,并行处理渲染图像和深度图像,随后通过协同精炼模块(SRM)融合两个流的特征,利用它们的互补优势。在四个大规模公共数据集上的全面实验结果表明,GS-CLIP在检测中取得了优越性能。代码可在 https://github.com/zhushengxinyue/GS-CLIP 获取。

英文摘要

Zero-shot 3D Anomaly Detection is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection. Code can be available at https://github.com/zhushengxinyue/GS-CLIP.

2602.21636 2026-05-27 cs.CV 版本更新

Axial-Centric Cross-Plane Attention for 3D Medical Image Classification

轴向中心跨平面注意力用于3D医学图像分类

Doyoung Park, Jinsoo Kim, Lohendran Baskaran

发表机构 * National Heart Centre Singapore, Singapore(新加坡国家心脏中心) CVS.AI, National Heart Research Institute of Singapore, Singapore(CVS.AI、新加坡国家心脏研究院) Independent Researcher, Republic of Korea(韩国独立研究员)

AI总结 提出轴向中心跨平面注意力架构,通过不对称建模解剖平面间依赖关系,在MedMNIST3D基准上优于现有3D和多平面模型。

Comments Submitted to BMVC 2026

详情
AI中文摘要

(缩写版)临床医生通常通过检查多个解剖平面而非依赖体积视图来解释3D医学图像。在临床CT工作流中,轴向平面常作为主要诊断参考,而辅助平面提供互补空间上下文。然而,许多现有3D深度学习方法要么整体处理体积数据,要么对所有平面赋予相同重要性,未能反映这种不对称的轴向中心解释策略。为此,我们提出一种用于3D医学图像分类的轴向中心跨平面注意力架构,该架构建模解剖平面间的不对称依赖关系。该架构使用大规模轴向CT图像预训练的MedDINOv3作为冻结特征提取器,用于轴向、冠状和矢状平面。RICA块和平面内变换器编码器捕获平面特定的位置和上下文信息,而轴向中心跨平面变换器编码器选择性地以互补的辅助表示条件化轴向表示。在MedMNIST3D基准的六个数据集上的实验表明,所提方法在ACC和AUC上持续优于现有3D和多平面模型。轻量级变体AC-Tiny以显著更少的可训练参数实现了竞争性能,表明架构设计对性能提升的贡献大于模型规模增加。消融研究进一步验证了轴向中心查询、QKV分配、定向跨平面融合、无残差交叉注意力和分类头设计的重要性。切片级Grad-CAM可视化表明,模型在所有平面上识别出诊断相关区域。这些发现强调了将架构设计与临床解释工作流对齐对于稳健的3D医学图像分析的价值。

英文摘要

Abridged: Clinicians commonly interpret 3D medical images by examining multiple anatomical planes rather than relying on volumetric views. In clinical CT workflows, the axial plane often serves as the primary diagnostic reference, while the auxiliary planes provide complementary spatial context. However, many existing 3D deep learning approaches either process volumetric data holistically or assign equal importance to all planes, failing to reflect this asymmetric, axial-centric interpretation strategy. To address this, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that models asymmetric dependencies between anatomical planes. The architecture employs large-scale axial CT images pretrained MedDINOv3 as a frozen feature extractor for axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information, while axial-centric cross-plane transformer encoders selectively condition axial representations on complementary auxiliary representations. Experiments on six datasets from the MedMNIST3D benchmark show that the proposed method consistently outperforms existing 3D and multi-plane models in ACC and AUC. A lightweight variant, AC-Tiny, achieves competitive performance with substantially fewer trainable parameters, suggesting that architectural design contributes more to performance gains than increased model scale. Ablation studies further validate the importance of axial-centric querying, QKV allocation, directional cross-plane fusion, residual-free cross-attention, and classification head design. Slice-level Grad-CAM visualizations demonstrate that the model identifies diagnostically relevant regions across all planes. These findings highlight the value of aligning architectural design with clinical interpretation workflows for robust 3D medical image analysis.

2602.18907 2026-05-27 cs.LG cs.CV cs.CY 版本更新

DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

DeepInterestGR: 利用多模态大语言模型挖掘深度多兴趣用于生成式推荐

Yangchen Zeng, Zhenyu Yu, Zhiyuan Hu, Wenxin Zhang, Jinze Wang, Rongfeng Guo

发表机构 * Southeast University(东南大学)

AI总结 提出DeepInterestGR框架,通过多LLM兴趣挖掘、奖励标记深度兴趣和兴趣增强物品离散化,解决生成式推荐中的浅层兴趣问题,在三个Amazon数据集上显著提升推荐性能。

详情
AI中文摘要

我们介绍了DeepInterestGR,一个将深度兴趣挖掘集成到生成式推荐流程中的新颖框架。这解决了“浅层兴趣”问题——现有的生成方法依赖于表面文本特征,未能捕捉潜在的用户动机,限制了个性化深度和推荐可解释性。我们的方法通过结构化推理提示利用多LLM兴趣挖掘(MLIM),通过奖励标记深度兴趣(RLDI)进行质量控制,通过RQ-VAE进行兴趣增强物品离散化(IEID),并结合由兴趣感知奖励引导的两阶段SFT-GRPO训练流程。我们在三个Amazon Review基准(Beauty、Sports、Instruments)上验证了DeepInterestGR,与包括SASRec、BERT4Rec、TIGER、LC-Rec和S-DPO在内的14个最先进基线进行了比较。我们的方法在HR@10上实现了5.8%-8.3%的相对改进,在NDCG@10上实现了7.7%-9.9%的相对改进,跨领域泛化增益达到+24.8%。这些结果证明,融入深度语义兴趣可以有效改进基于SID的生成式推荐。

英文摘要

We introduce DeepInterestGR, a novel framework that integrates deep interest mining into the generative recommendation pipeline. This addresses the "Shallow Interest" problem - existing generative methods rely on surface-level textual features and fail to capture latent user motivations, limiting personalization depth and recommendation interpretability. Our approach leverages Multi-LLM Interest Mining (MLIM) via structured reasoning prompting, Reward-Labeled Deep Interest (RLDI) for quality control, and Interest-Enhanced Item Discretization (IEID) via RQ-VAE, combined with a two-stage SFT-GRPO training pipeline guided by an Interest-Aware Reward. We validate DeepInterestGR on three Amazon Review benchmarks (Beauty, Sports, Instruments), comparing against 14 state-of-the-art baselines including SASRec, BERT4Rec, TIGER, LC-Rec, and S-DPO. Our method achieves 5.8%-8.3% relative improvements on HR@10 and 7.7%-9.9% on NDCG@10 over the strongest baseline, with cross-domain generalization gains of +24.8%. These results provide evidence that incorporating deep semantic interests can effectively improve SID-based generative recommendation.

2602.17605 2026-05-27 cs.CV cs.AI cs.CY cs.LG 版本更新

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

在飞行中主动适应:基于相关性的在线元学习与潜在概念用于地理空间发现

Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly

发表机构 * University of Michigan, Ann Arbor, MI, USA(密歇根大学,安阿伯分校) Washington University in St. Louis, St. Louis, MO, USA(华盛顿大学圣路易斯分校)

AI总结 提出一个统一的地理空间发现框架,结合主动学习、在线元学习和概念引导推理,通过概念加权不确定性采样和相关性感知元批次形成策略,在有限数据和动态环境下高效发现隐藏目标。

详情
AI中文摘要

在环境监测中,数据收集通常成本高昂、稀疏且受紧急公共卫生需求影响。这对于致癌的PFAS(全氟和多氟烷基物质)污染尤其如此,与领域专家和环境组织的讨论强调需要在有限的采样预算下战略性地识别高风险、观测不足的区域。更广泛地说,在灾害响应和公共卫生环境中也出现了类似的挑战,动态环境使得从有限的地面实况中高效发现隐藏目标变得至关重要。然而,稀疏且有偏差的地理空间标签限制了现有基于学习方法(如强化学习)的适用性。为了解决这个问题,我们提出了一个统一的地理空间发现框架,该框架集成了主动学习、在线元学习和概念引导推理。我们的方法引入了两个基于共享的*概念相关性*概念的关键创新,该概念捕捉领域特定因素如何影响目标存在:一个*概念加权不确定性采样策略*,其中不确定性通过从现成概念(如土地覆盖和源距离)学习到的相关性进行调节;以及一个*相关性感知元批次形成策略*,该策略在在线元更新期间促进语义多样性,提高动态环境中的泛化能力。我们在PFAS污染发现任务上评估了我们的框架,这是一个受真实世界启发的环境监测任务,展示了在有限数据和变化条件下鲁棒的目标发现能力。

英文摘要

In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs. This is particularly true for cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, where discussions with domain experts and environmental organizations highlight the need to strategically identify high-risk, under-observed regions under tight sampling budgets. More broadly, similar challenges arise in disaster response and public health settings, where dynamic environments make it essential to efficiently uncover hidden targets from limited ground truth. Yet sparse and biased geospatial labels limit the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, capturing how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance from readily available concepts such as land cover and source proximity; and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. We evaluate our framework on PFAS contamination discovery as a real-world inspired environmental monitoring task, demonstrating robust target discovery under limited data and changing conditions.

2510.03352 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

基于侧信息的推理时搜索用于扩散模型图像重建

Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

发表机构 * Department of Electrical and Computer Engineering, Texas A&M University(电气与计算机工程系,德克萨斯A&M大学)

AI总结 提出一种即插即用、无需训练的推理时搜索框架,将侧信息融入现有扩散模型逆问题求解器,显著提升重建质量。

详情
AI中文摘要

扩散模型已被用作解决逆问题的先验。然而,现有方法通常忽略了能够显著提高重建质量的侧信息,尤其是在严重病态设置中。在这项工作中,我们提出了一种新颖的框架,通过推理时搜索将侧信息以即插即用、无需训练的方式融入现有的基于扩散模型的逆问题求解器。通过在多种逆问题(包括图像修复、超分辨率和几种去模糊任务)以及多种基于扩散模型的逆问题求解器(DPS、DAPS和MPGD)上的大量实验,我们表明,用我们的框架增强每个求解器,其重建质量始终优于相应的原始方法。为了展示我们方法的通用性,我们考虑了多种形式的侧信息,包括参考图像、文本描述和解剖学MRI扫描。代码可在该仓库中获取:https://github.com/mahdi-farahbakhsh/DISS。

英文摘要

Diffusion models have been used as priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel framework that incorporates side information into existing diffusion-based inverse problem solvers via inference-time search, in a plug-and-play, training-free manner. Through extensive experiments across a range of inverse problems, including inpainting, super-resolution, and several deblurring tasks, and across multiple diffusion-based inverse problem solvers (DPS, DAPS, and MPGD), we show that augmenting each solver with our framework consistently improves the quality of the reconstructions over the corresponding original method. To demonstrate the generality of our approach, we consider diverse forms of side information, including reference images, textual descriptions, and anatomical MRI scans. The code is available at this \href{https://github.com/mahdi-farahbakhsh/DISS}{repository}\footnote{https://github.com/mahdi-farahbakhsh/DISS}.

2602.10104 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Olaf-World: Orienting Latent Actions for Video World Modeling

Olaf-World: 面向视频世界模型的潜在动作定向

Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore Research (A STAR), Singapore

AI总结 提出SeqΔ-REPA对齐目标,通过冻结自监督视频编码器的时序特征差异锚定潜在动作,实现无标签视频中可迁移的动作控制世界模型预训练。

Comments ICML 2026. Project page: https://showlab.github.io/Olaf-World/ Code: https://github.com/showlab/Olaf-World

详情
AI中文摘要

扩展动作可控世界模型受限于动作标签的稀缺性。虽然潜在动作学习有望从无标签视频中提取控制接口,但学习到的潜在表示往往难以跨上下文迁移:它们纠缠了场景特定线索,缺乏共享坐标系。这是因为标准目标仅在每个片段内操作,没有提供跨上下文对齐动作语义的机制。我们的关键洞察是,尽管动作未被观测到,但其语义效果是可观测的,可以作为共享参考。我们引入SeqΔ-REPA,一种序列级控制效果对齐目标,将集成潜在动作锚定到来自冻结自监督视频编码器的时序特征差异。基于此,我们提出Olaf-World,一个从大规模被动视频中预训练动作条件视频世界模型的流程。大量实验表明,我们的方法学习了更结构化的潜在动作空间,从而在零样本动作迁移和适应新控制接口的数据效率上优于最先进的基线方法。

英文摘要

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

2602.09878 2026-05-27 cs.CV 版本更新

MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

MVISTA-4D: 用于机器人操作的一致性视图4D世界模型与测试时动作推理

Jiaxu Wang, Yicheng Jiang, Tianlun He, Jingkai Sun, Qiang Zhang, Junhao He, Jiahang Cao, Zesen Gan, Mingyuan Sun, Qiming Shao, Xiangyu Yue

发表机构 * MMLab, The Chinese University of Hong Kong, Hong Kong SAR(香港理工大学MMLab,香港中文大学,香港特别行政区) The Hong Kong University of Science(香港理工大学) The University of Hong Kong(香港大学) Tsinghua University(清华大学)

AI总结 提出一种基于世界模型的4D场景生成方法,通过多视图RGBD预测和测试时动作优化,实现几何一致的4D动态预测与机器人操作。

详情
Journal ref
International Conference on Machine Learning 2026
AI中文摘要

基于世界模型的“想象-然后行动”范式成为机器人操作的一种有前景的方法,但现有方法通常仅支持纯图像预测或部分3D几何推理,限制了其预测完整4D场景动态的能力。本文提出了一种新颖的具身4D世界模型,能够实现几何一致、任意视图的RGBD生成:仅以单视图RGBD观测作为输入,模型想象其余视角,然后通过反投影和融合构建跨时间的更完整3D结构。为了高效学习多视图、跨模态生成,我们明确设计了跨视图和跨模态特征融合,共同促进RGB与深度之间的一致性,并强制视图间的几何对齐。除了预测,将生成的未来转换为动作通常由逆动力学处理,但这是病态的,因为多个动作可以解释相同的状态转换。我们通过一种测试时动作优化策略来解决这个问题,该策略通过生成模型反向传播以推断与预测未来最佳匹配的轨迹级潜在变量,以及一个残差逆动力学模型,将该轨迹先验转换为精确的可执行动作。在三个数据集上的实验表明,该方法在4D场景生成和下游操作任务上均表现出色,消融实验为关键设计选择提供了实用见解。

英文摘要

World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.

2511.16449 2026-05-27 cs.CV cs.AI 版本更新

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

弥合视觉令牌剪枝中的语义-动作鸿沟以实现高效VLA推理

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) University of Science and Technology of China(中国科学技术大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) BAAI(北京人工智能研究院)

AI总结 提出VLA-Pruner方法,通过结合语义预填充和时序平滑的动作相关性估计视觉令牌重要性,并采用Combine-then-Filter策略,在保持操作质量的同时实现高达1.99倍加速。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过整合视觉感知、语言理解和动作执行,在具身人工智能中展现出巨大潜力。在实时部署中,这些模型必须处理连续的视觉流,产生大量计算开销。视觉令牌剪枝——一种通过保留显著令牌同时丢弃冗余令牌来加速视觉-语言模型(VLM)的主流技术——为这一挑战提供了自然的候选解决方案。然而,直接将面向VLM的剪枝方法应用于VLA推理会导致操作性能严重下降。我们的分析将这种下降归因于一个关键不匹配:VLA推理在视觉-语言预填充阶段和动作解码阶段表现出不同的注意力模式,因此仅基于上下文预填充语义显著性的剪枝偏向语义线索,可能移除动作关键的视觉令牌。受此观察启发,我们提出VLA-Pruner,一种有效的即插即用令牌剪枝方法,基于VLA推理的视觉需求,并进一步利用机器人操作的时间连续性。具体来说,VLA-Pruner从语义预填充和时序平滑的动作相关性两方面估计视觉令牌重要性,然后采用Combine-then-Filter策略,在计算预算下保留紧凑、非冗余的令牌。实验表明,VLA-Pruner在多种VLA架构上优于最先进方法,在相当的操作质量下实现高达1.99倍加速。

英文摘要

Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, and action execution. In real-time deployment, these models must process continuous visual streams, incurring substantial computational overhead. Visual token pruning -- a mainstream technique for accelerating Vision-Language Models (VLMs) by retaining salient tokens while discarding redundant ones -- offers a natural candidate solution to this challenge. However, directly applying VLM-oriented pruning methods to VLA inference can cause severe degradation in manipulation performance. Our analysis attributes this degradation to a key mismatch: VLA inference exhibits distinct attention patterns between the vision-language prefill stage and the action-decode stage, so pruning based only on context-prefill semantic salience is biased toward semantic cues and may remove action-critical visual tokens. Motivated by this observation, we propose VLA-Pruner, an effective plug-and-play token pruning method grounded in the visual requirements of VLA inference, further exploiting the temporal continuity of robot manipulation. Specifically, VLA-Pruner estimates visual-token importance from both semantic prefilling and temporally smoothed action relevance, and then applies a Combine-then-Filter strategy to retain compact, non-redundant tokens under the compute budget. Experiments show that VLA-Pruner outperforms state-of-the-art approaches across multiple VLA architectures, achieving up to 1.99x speedup with comparable manipulation quality.

2511.06625 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography

可解释的跨疾病推理:基于低剂量计算机断层扫描的心血管风险评估

Yifei Zhang, Jiashuo Zhang, Mojtaba Safari, Xiaofeng Yang, Liang Zhao

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系) Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Department of Radiation Oncology(放射肿瘤学部) Winship Cancer Institute, Emory University(埃默里大学Winship癌症研究所)

AI总结 提出一种可解释的跨疾病推理框架,通过提取肺部发现、基于医学知识进行跨器官机制推理,并结合心脏子体积特征,从低剂量胸部CT中实现心血管风险评估,在NLST队列中AUC达0.919。

详情
AI中文摘要

低剂量胸部计算机断层扫描(LDCT)在一次扫描中捕获肺部和心脏结构,使得能够联合评估肺部和心血管健康。现有方法通常独立建模这些领域,并未明确表示它们的生理交互。我们提出了一种可解释的跨疾病推理框架,用于从LDCT进行心血管风险评估。该框架遵循受限的临床信息路径:它提取肺部发现,将跨器官机制基于医学知识进行推理,并生成带有自然语言理由的心血管预测。它结合了四个组件:一个冻结的肺风险先验、一个肺部感知模块、一个代理推理模块和一个心脏子体积特征提取器。它们的输出被融合,以将局部心脏证据与机制层面的肺部上下文整合。在国家肺筛查试验队列中,该框架在CVD筛查中达到0.919的AUC,在CVD死亡率预测中高达0.838,优于心脏特异性、单疾病和基础模型基线。目标对照表明,这些增益不能仅由额外的胸部视觉特征、固定规则传播或单一推理后端解释。因此,所提出的框架提供了一种可审计的方法,用于从LDCT进行跨疾病心血管风险评估。

英文摘要

Low-dose chest computed tomography (LDCT) captures pulmonary and cardiac structures in a single scan, enabling joint assessment of lung and cardiovascular health. Existing approaches typically model these domains independently and do not explicitly represent their physiological interactions. We propose an Explainable Cross-Disease Reasoning Framework for cardiovascular risk assessment from LDCT. The framework follows a constrained clinical-information pathway: it extracts pulmonary findings, grounds cross-organ mechanisms in medical knowledge, and produces a cardiovascular prediction with a natural-language rationale. It combines four components: a frozen lung-risk prior, a pulmonary perception module, an agentic reasoning module, and a cardiac subvolume feature extractor. Their outputs are fused to integrate localized cardiac evidence with mechanism-level pulmonary context. On the National Lung Screening Trial cohort, the framework achieves an AUC of 0.919 for CVD screening and up to 0.838 for CVD mortality prediction, outperforming cardiac-specific, single-disease, and foundation-model baselines. Targeted controls indicate that the gains are not explained by additional thoracic visual features alone, fixed rule propagation, or a single reasoning backend. The proposed framework thus provides an auditable approach to cross-disease cardiovascular risk assessment from LDCT.

2507.13428 2026-05-27 cs.CV cs.AI 版本更新

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

PhyWorldBench:文本到视频模型中物理真实性的全面评估

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang

发表机构 * University of California, Santa Cruz(加州大学圣克ruz分校) NVIDIA Research(NVIDIA研究) Northeastern University(东北大学) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出PhyWorldBench基准,通过1050个提示评估12个视频生成模型在物理规律遵循上的表现,并引入反物理类别,利用多模态大语言模型进行零样本评估。

Comments 35 pages, 21 figures

详情
Journal ref
ICLR 2026 oral
AI中文摘要

视频生成模型在创建高质量、逼真内容方面取得了显著进展。然而,它们准确模拟物理现象的能力仍然是一个关键且未解决的挑战。本文提出了PhyWorldBench,一个全面的基准测试,旨在根据视频生成模型对物理定律的遵循程度进行评估。该基准涵盖了多个层次的物理现象,从基本物理原理如物体运动和能量守恒,到更复杂的场景如刚体相互作用以及人或动物的运动。此外,我们引入了一个新颖的反物理类别,其中提示故意违反现实世界的物理规律,从而评估模型在保持逻辑一致性的同时能否遵循此类指令。除了大规模人工评估外,我们还设计了一种简单而有效的方法,利用当前的多模态大语言模型以零样本方式评估物理真实性。我们评估了12个最先进的文本到视频生成模型,包括五个开源模型和五个专有模型,并进行了详细的比较和分析。通过对跨越基础、复合和反物理场景的1050个精心策划的提示进行系统测试,我们识别出这些模型在遵循现实世界物理规律方面面临的关键挑战。我们进一步研究了它们在不同物理现象和提示类型下的表现,并得出了针对性的建议,以构建增强物理原理保真度的提示。

英文摘要

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles such as object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel Anti-Physics category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that utilizes current multimodal large language models to evaluate physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with detailed comparison and analysis. Through systematic testing across 1050 curated prompts spanning fundamental, composite, and anti-physics scenarios, we identify pivotal challenges these models face in adhering to real-world physics. We further examine their performance under diverse physical phenomena and prompt types, and derive targeted recommendations for crafting prompts that enhance fidelity to physical principles.

2512.06609 2026-05-27 cs.LG cs.CV 版本更新

Training-Free Vector Quantization via Gaussian VAEs

基于高斯VAE的无训练向量量化

Tongda Xu, Wendi Zheng, Jiajun He, Jose Miguel Hernandez-Lobato, Yan Wang, Ya-Qin Zhang, Jie Tang

发表机构 * AIR, Tsinghua University(清华空气研究院) CST, Tsinghua University(清华计算机研究所) University of Cambridge(剑桥大学)

AI总结 提出Gaussian Quant (GQ)方法,通过约束训练高斯VAE并直接转换为VQ-VAE,无需额外训练,在UNet和ViT架构上优于现有VQ-VAE。

详情
AI中文摘要

向量量化变分自编码器(VQ-VAEs)是将图像压缩为离散标记的离散自编码器。然而,由于离散化,它们难以训练。在本文中,我们提出了一种简单而有效的技术,称为Gaussian Quant (GQ),它首先在特定约束下训练高斯VAE,然后将其转换为VQ-VAE,无需额外训练。对于转换,GQ生成随机高斯噪声作为码本,并找到最接近后验均值的噪声向量。理论上,我们证明当码本大小的对数超过高斯VAE的bits-back编码率时,可以保证较小的量化误差。实际上,我们提出了一种启发式方法来训练高斯VAE以实现有效转换,称为目标散度约束(TDC)。实验上,我们表明GQ在UNet和ViT架构上均优于先前的VQ-VAE,如VQGAN、FSQ、LFQ和BSQ。此外,TDC还改进了先前的离散化方法,如TokenBridge。源代码见https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE。

英文摘要

Vector-quantized variational autoencoders (VQ-VAEs) are discrete autoencoders that compress images into discrete tokens. However, they are difficult to train due to discretization. In this paper, we propose a simple yet effective technique dubbed Gaussian Quant (GQ), which first trains a Gaussian VAE under certain constraints and then converts it into a VQ-VAE without additional training. For conversion, GQ generates random Gaussian noise as a codebook and finds the closest noise vector to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAEs for effective conversion, named the target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE.

2511.16870 2026-05-27 cs.CV cs.LG 版本更新

Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representation Alignment

对齐与反转:通过表示对齐解决扩散和流模型中的逆问题

Loukas Sfountouris, Giannis Daras, Paris Giampouras

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出将扩散或流模型的内部表示与预训练自监督编码器(DINOv2)对齐(REPA),在推理时引导逆问题重建,显著提升重建质量和感知真实感。

详情
AI中文摘要

最近研究表明,强制扩散或流生成模型的内部表示与预训练自监督编码器的表示对齐,提供了强大的归纳偏置,改善了收敛性和样本质量。在这项工作中,我们将这一思想扩展到逆问题,其中预训练生成模型被用作先验。我们提出在扩散或流模型与DINOv2视觉编码器之间应用表示对齐(REPA),以在推理时指导重建过程。尽管逆问题中无法获得真实信号,但我们实验表明,对齐模型对近似目标特征的表示可以显著提升重建质量和感知真实感。我们提供了理论结果,显示(a) REPA正则化可以视为在DINOv2嵌入空间中最小化散度度量的变分方法,(b) 在一定的正则性假设下,REPA更新将潜在扩散状态引导向干净图像的状态。这些结果揭示了REPA在提升感知保真度中的作用。最后,我们通过将REPA集成到多个最先进的逆问题求解器中证明了方法的通用性,并在超分辨率、框内补全、高斯去模糊和运动去模糊上进行了大量实验,证实我们的方法一致地改善了重建质量,同时通过减少所需的离散化步骤数提高了效率。

英文摘要

Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a DINOv2 visual encoder, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we empirically show that aligning model representations of approximate target features can substantially enhance reconstruction quality and perceptual realism. We provide theoretical results showing (a) that REPA regularization can be viewed as a variational approach for minimizing a divergence measure in the DINOv2 embedding space, and (b) how under certain regularity assumptions REPA updates steer the latent diffusion states toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by We integrate REPA into multiple state-of-the-art inverse problem solvers, and provide extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirming that our method consistently improves reconstruction quality, while also providing efficiency gains reducing the number of required discretization steps.

2508.02806 2026-05-27 cs.CV cs.LG 版本更新

PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation

PyCAT4: 基于层次化视觉Transformer的3D人体姿态估计框架

Zongyou Yang, Jonathan Loo, Yinghan Hou

发表机构 * Department of Computer Science(计算机科学系) University College London(伦敦大学学院) School of Electronic Engineering(电子工程学院) Queen Mary University of London(伦敦女王学院) Department of Earth Science(地球科学系) Imperial College London(帝国理工学院)

AI总结 本研究提出PyCAT4框架,通过引入自注意力机制的Transformer特征提取层、特征时间融合技术和空间金字塔结构,优化Pymaf网络,在COCO和3DPW数据集上显著提升3D人体姿态估计的检测能力。

Comments 10 pages, 20 figures

详情
AI中文摘要

近年来,通过将卷积神经网络与金字塔网格对齐反馈循环相结合,3D人体姿态估计的准确性得到了显著提升。此外,基于Transformer的时间分析架构的采用在计算机视觉领域取得了创新性突破。鉴于这些进展,本研究旨在深度优化和改进现有的Pymaf网络架构。本文的主要创新包括:(1) 引入基于自注意力机制的Transformer特征提取网络层,以增强对低级特征的捕获;(2) 通过特征时间融合技术增强对视频序列中时间信号的理解和捕获;(3) 实现空间金字塔结构以实现多尺度特征融合,有效平衡不同尺度下的特征表示差异。本研究得到的新PyCAT4模型在COCO和3DPW数据集上进行了实验验证。结果表明,所提出的改进策略显著提升了网络在人体姿态估计中的检测能力,进一步推动了人体姿态估计技术的发展。

英文摘要

Recently, a significant improvement in the accuracy of 3D human pose estimation has been achieved by combining convolutional neural networks (CNNs) with pyramid grid alignment feedback loops. Additionally, innovative breakthroughs have been made in the field of computer vision through the adoption of Transformer-based temporal analysis architectures. Given these advancements, this study aims to deeply optimize and improve the existing Pymaf network architecture. The main innovations of this paper include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance the capture of low-level features; (2) Enhancing the understanding and capture of temporal signals in video sequences through feature temporal fusion techniques; (3) Implementing spatial pyramid structures to achieve multi-scale feature fusion, effectively balancing feature representations differences across different scales. The new PyCAT4 model obtained in this study is validated through experiments on the COCO and 3DPW datasets. The results demonstrate that the proposed improvement strategies significantly enhance the network's detection capability in human pose estimation, further advancing the development of human pose estimation technology.

2601.15283 2026-05-27 cs.CV cs.GR 版本更新

LuxRemix: Lighting Decomposition and Remixing for Indoor Scenes

LuxRemix: 室内场景的光照分解与重新混合

Ruofan Liang, Norman Müller, Ethan Weber, Duncan Zauss, Nandita Vijaykumar, Peter Kontschieder, Christian Richardt

发表机构 * Meta Reality Labs(Meta现实实验室) University of Toronto(多伦多大学)

AI总结 提出一种基于图像的光照分解模型,从多视图场景捕获中分解室内光照为独立光源,并通过多视图光照协调集成到可重光照的3D高斯溅射表示中,实现交互式光源编辑。

Comments CVPR 2026. Project page: https://luxremix.github.io

详情
AI中文摘要

我们提出了一种新颖的方法,用于从单个多视图场景捕获中对室内场景进行交互式光照编辑。我们的方法利用基于生成图像的光照分解模型,将复杂的室内场景照明分解为其组成光源。这种分解能够独立操作各个光源,特别是控制其状态(开/关)、色度和强度。我们进一步引入了多视图光照协调,以确保光照分解在所有场景视图中的一致传播。这被集成到一个可重光照的3D高斯溅射表示中,提供对单个光源的实时交互控制。我们的结果展示了在多种室内场景中高度逼真的光照分解和重光照效果。我们在合成和真实世界数据集上评估了我们的方法,并与最先进的技术进行了定量和定性比较。视频结果和交互演示请参见 https://luxremix.github.io。

英文摘要

We present a novel approach for interactive light editing in indoor scenes from a single multi-view scene capture. Our method leverages a generative image-based light decomposition model that factorizes complex indoor scene illumination into its constituent light sources. This factorization enables independent manipulation of individual light sources, specifically allowing control over their state (on/off), chromaticity, and intensity. We further introduce multi-view lighting harmonization to ensure consistent propagation of the lighting decomposition across all scene views. This is integrated into a relightable 3D Gaussian splatting representation, providing real-time interactive control over the individual light sources. Our results demonstrate highly photorealistic lighting decomposition and relighting outcomes across diverse indoor scenes. We evaluate our method on both synthetic and real-world datasets and provide a quantitative and qualitative comparison to state-of-the-art techniques. For video results and interactive demos, see https://luxremix.github.io.

2601.14702 2026-05-27 cs.AI cs.CV cs.RO 版本更新

Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving

Drive-P2D:自动驾驶中视觉语言模型的渐进式感知到决策基准

Zecong Tang, Zixu Wang, Yifei Wang, Weitong Lian, Tianjian Gao, Haoran Li, Tengju Ru, Lingyi Meng, Zhejun Cui, Yichen Zhu, Qi Kang, Kaixuan Wang, Yu Zhang

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学)

AI总结 提出Drive-P2D基准,通过分离推理与答案的协议,在目标、场景和决策三个层级上评估视觉语言模型的感知到决策能力,并分析错误模式。

详情
AI中文摘要

自动驾驶需要在复杂场景中实现可靠的感知和安全的决策。最近的视觉语言模型(VLM)展示了推理和泛化能力,为自动驾驶开辟了新的可能性;然而,现有的基准通常分别评估感知和决策,通过仅选择格式限制故障分析,或通过LLM评分的长格式输出引入评估偏差。为了解决这些问题,我们提出了Drive-P2D,一个渐进式感知到决策基准,包含6650个问题,涵盖目标、场景和决策三个层级。Drive-P2D采用分离的推理与答案协议:最终答案客观评分,而推理则用于分析沿渐进式感知到决策链暴露的错误模式。我们评估了所有场景和高风险场景下的主流VLM,并通过相关性分析和相似场景鲁棒性测试进一步刻画了感知到决策的能力边界。推理进一步揭示了逻辑推理错误和语义特征遗漏等故障模式,我们训练了一个轻量级分析器模型来自动化大规模推理错误模式标注。这些设计共同为构建更安全、更可靠的用于现实世界自动驾驶的VLM提供了实用见解。

英文摘要

Autonomous driving requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonstrate reasoning and generalization abilities, opening new possibilities for autonomous driving; however, existing benchmarks often evaluate perception and decision-making separately, limit failure analysis with choice-only formats, or introduce evaluation bias through LLM-scored long-form outputs. To address these issues, we present Drive-P2D, a progressive perception-to-decision benchmark with 6,650 questions across Object, Scene, and Decision levels. Drive-P2D adopts a separated reasoning-and-answer protocol: final answers are scored objectively, while reasoning is analyzed to identify error modes exposed along the progressive perception-to-decision chain. We evaluate mainstream VLMs across all and high-risk scenarios, and further characterize the perception-to-decision capability boundary through correlation analysis and similar-scene robustness testing. Reasoning further exposes failure modes such as logical reasoning errors and semantic feature omissions, and we train a lightweight analyzer model to automate large-scale error-mode annotation of reasoning. Together, these designs provide practical insights for building safer and more reliable VLMs for real-world autonomous driving.

2601.12809 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

CLIP风格视觉语言模型在合成空间关系数据训练中的左右对称性破缺

Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa

发表机构 * InfoTech, Toyota Motor Corporation(丰田汽车公司信息科技部门)

AI总结 通过可控一维图像文本测试平台,研究基于Transformer的视觉语言编码器在CLIP风格对比学习下如何通过位置与标记嵌入交互产生左右关系理解,并发现标签多样性比布局多样性更关键。

Comments Accepted at ICML 2026

详情
AI中文摘要

空间理解仍然是视觉语言模型中的一个关键挑战。然而,这种理解是否真正获得,如果是,通过什么机制,目前尚不清楚。我们提出了一个可控的一维图像文本测试平台,以探究在基于Transformer的视觉和文本编码器中,使用CLIP风格的对比目标训练时,左右关系理解是如何出现的。我们在单对象和双对象场景的配对描述上端到端地训练轻量级基于Transformer的视觉和文本编码器,并评估对未见对象对的泛化能力,同时系统性地改变标签和布局多样性。我们发现对比训练学习了左右关系,并且标签多样性(而非布局多样性)是这种情况下泛化的主要驱动因素。为了获得机制性理解,我们进行了注意力分解,并表明位置嵌入和标记嵌入之间的相互作用导致了水平注意力梯度,从而打破了编码器中的左右对称性;消除这一贡献会显著降低左右辨别能力。我们的结果提供了关于CLIP风格模型何时以及如何获得关系能力的机制性见解。

英文摘要

Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between positional and token embeddings induce a horizontal attention gradient that breaks left-right symmetry in the encoders; ablating this contribution substantially reduces left-right discrimination. Our results provide a mechanistic insight of when and how CLIP-style models acquire relational competence.

2511.02360 2026-05-27 cs.CV cs.CL 版本更新

LaRe: Latent Refocusing for Multimodal Reasoning

LaRe: 用于多模态推理的潜在重聚焦

Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络与信息安全学院)

AI总结 提出LaRe范式,在潜在空间内进行视觉重聚焦,结合语义增强训练,在提升推理准确率的同时大幅减少推理所需token数。

详情
AI中文摘要

思维链推理通过分解复杂任务提升逻辑性能,但其多模态扩展面临权衡。主流的“用图像思考”范式通过显式裁剪图像区域实现视觉重聚焦,但导致计算开销快速增长。新兴的潜在空间推理范式减少了token消耗,但缺乏动态重聚焦能力。我们认为这种权衡源于一个默认前提:有效的视觉重聚焦必须以显式token的形式发生。基于此,我们提出潜在重聚焦(LaRe),一种新的多模态推理范式,其中视觉重聚焦完全在潜在空间内进行。我们进一步设计了一种语义增强训练策略,通过视觉重建目标确保潜在空间的语义结构。实验评估表明,与现有基线相比,LaRe将平均准确率提高了7.6%,同时将推理所需的token数量减少了59.7%。当扩展到8B参数的视觉语言模型骨干时,LaRe实现了与最先进方法相当的性能,证明了我们提出的潜在重聚焦范式在多模态推理中的有效性。

英文摘要

Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective. Experimental evaluations demonstrate that LaRe improves average accuracy by 7.6% compared to existing baselines while reducing the number of tokens required for inference by 59.7%. When scaled to a 8B-parameter Vision-Language Model backbone, LaRe achieves performance comparable to state-of-the-art methods, demonstrating the efficacy of our proposed latent refocusing paradigm for multimodal reasoning.

2601.08375 2026-05-27 cs.CV 版本更新

Source-Free Domain Adaptation for Geospatial Point Cloud Semantic Segmentation

地理空间点云语义分割的无源域适应

Yuan Gao, Di Cao, Xiaohuan Xi, Sheng Nie, Shaobo Xia, Cheng Wang

发表机构 * Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院航天信息研究所) International Research Center of Big Data for Sustainable Development Goals(可持续发展目标大数据国际研究中心) University of Chinese Academy of Sciences(中国科学院大学) Zhengzhou Institute for Advanced Research of Henan Polytechnic University(河南理工大学郑州研究院) Henan Polytechnic University(河南理工大学) School of Aeronautic Engineering, Changsha University of Science and Technology(长沙理工大学航空工程学院) China University of Geosciences, Beijing(中国地质大学(北京))

AI总结 提出LoGo无源域适应框架,通过局部类平衡原型估计和全局最优传输分布对齐,解决地理空间点云语义分割中的域偏移问题。

详情
AI中文摘要

3D地理空间点云的语义分割是遥感应用的基础,但由区域和采集相关变化引起的域偏移通常会降低模型性能。尽管域适应可以缓解这种偏移,但现有方法通常需要访问源域数据,由于隐私问题和监管政策,这往往不可行。为了解决这个问题,我们提出了LoGo(局部-全局双共识),一种新颖的无源无监督域适应(SFUDA)框架,仅需要预训练模型和无标签目标数据。在局部层面,我们引入了一个类平衡原型估计模块,确保即使对于样本稀缺的尾部类别也能生成鲁棒的特征原型,有效缓解长尾分布引起的特征崩溃。在全局层面,我们引入了一个基于最优传输的全局分布对齐模块,将伪标签分配公式化为全局优化问题,有效纠正局部贪婪分配中头部类别的过度主导,从而防止模型预测严重偏向多数类别。最后,我们提出了一种双一致性伪标签过滤机制,仅保留局部多增强集成预测与全局最优传输分配一致的高置信度伪标签用于自训练。在两个具有挑战性的基准测试(包括跨场景和跨传感器设置)上的大量实验表明,LoGo始终优于现有的最先进方法。源代码可在 https://github.com/GYproject/LoGo-SFUDA 获取。

英文摘要

Semantic segmentation of 3D geospatial point clouds is fundamental to remote sensing applications, yet domain shifts caused by regional and acquisition-related variations often degrade model performance. Although domain adaptation can mitigate such shifts, existing methods typically require access to source-domain data, which is often infeasible due to privacy concerns and regulatory policies. To address this, we propose LoGo (Local-Global Dual-Consensus), a novel source-free unsupervised domain adaptation (SFUDA) framework requiring only a pretrained model and unlabeled target data. At the local level, we introduce a class-balanced prototype estimation module that ensures that robust feature prototypes can be generated even for sample-scarce tail classes, effectively mitigating the feature collapse caused by long-tailed distributions. At the global level, we introduce an optimal transport-based global distribution alignment module that formulates pseudo-label assignment as a global optimization problem, effectively correcting the over-dominance of head classes inherent in local greedy assignments, and thereby preventing model predictions from being severely biased towards majority classes. Finally, we propose a dual-consistency pseudo-label filtering mechanism that retains only high-confidence pseudo-labels where local multi-augmented ensemble predictions align with global optimal transport assignments for self-training. Extensive experiments on two challenging benchmarks, encompassing cross-scene and cross-sensor settings, demonstrate that LoGo consistently outperforms existing state-of-the-art methods. The source code is available at https://github.com/GYproject/LoGo-SFUDA.

2601.07737 2026-05-27 cs.CV cs.AI 版本更新

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

看见 vs. 相信:评估开源多模态大模型在反直觉场景中的语言偏见

Chen Ling, Tongwei Zhang, Hanqian Li, Nai Ding

发表机构 * Zhejiang University(浙江大学) Beijing University of Posts and Telecommunications(北京邮电大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 为评估多模态大模型处理反直觉动作场景的能力,提出CAIT基准(400个高保真合成场景),发现开源模型因语言先验而忽视视觉证据,性能接近随机水平,而链式思维推理虽提升准确率但导致过度思考拒绝视觉内容,通过微调和结构化提示可缓解此偏见。

详情
AI中文摘要

多模态大语言模型(MLLMs)在主流视觉理解任务中表现出色,但其处理违背日常常识的动作场景的能力尚未得到充分测试。为填补这一空白,我们引入了CAIT,一个包含400个高保真合成场景的基准,专注于反直觉的视觉动作,例如“兔子在追老虎”,其中视觉证据明确违背常识预期。我们评估了人类、领先的专有模型(如Claude和Gemini)以及14个代表性的开源MLLMs。人类达到近乎完美的性能(约0.95准确率),专有模型表现出稳健的理解(达到0.88准确率),而标准的开源指令微调模型性能处于随机水平。进一步分析表明,这种失败是由强烈的语言先验驱动的:模型不信任视觉输入,而是自动用统计上常见的文本描述覆盖异常的视觉信号。尽管引入链式思维推理机制可以提高准确率,但会显著减慢响应速度并产生新的失败模式:模型过度思考场景,仅仅因为违反现实物理定律而拒绝接受实际的视觉内容。最后,我们证明有针对性的微调和结构化提示可以有效缓解这种对语言先验的依赖,使开源模型能够基于实际视觉证据准确地进行推理。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their ability to process action scenes that contradict everyday common sense remains undertested. To address this gap, we introduce CAIT, a benchmark comprising 400 high-fidelity synthetic scenes focused on counter-intuitive visual actions, such as ``a rabbit is chasing a tiger'', where visual evidence explicitly contradicts common-sense expectations. We evaluate human, leading proprietary models (e.g., Claude and Gemini), and 14 representative open-source MLLMs. Humans achieve near-perfect performance (around 0.95 accuracy) and proprietary models demonstrate robust understanding (achieving up to 0.88 accuracy), standard open-source instruction-tuned models perform at the chance level. Further analysis demonstrates that this failure is driven by a strong language prior: rather than trusting the visual input, they automatically override the anomalous visual signals with statistically common text descriptions. Although introducing Chain-of-Thought reasoning mechanisms can improve accuracy, it significantly slows down the response and generates a new failure mode: models overthink the scenario and refuse to accept the actual visual content simply because it violates real-world physical laws. Finally, we demonstrate that targeted fine-tuning and structured prompting can effectively mitigate this reliance on language priors, enabling open-source models to accurately ground their reasoning in actual visual evidence.

2601.05729 2026-05-27 cs.CV 版本更新

TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

TAGRPO: 通过直接轨迹对齐提升图像到视频生成中的GRPO

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, Shuai Shao, Qinglin Lu, Ping Luo

发表机构 * The University of Hong Kong(香港大学) Tencent Hunyuan(腾讯文心)

AI总结 针对图像到视频生成中GRPO优化效果不佳的问题,提出基于对比学习的TAGRPO框架,通过中间潜变量对齐高奖励轨迹并远离低奖励轨迹,结合记忆库提升多样性,显著优于DanceGRPO。

Comments 18 pages, 12 figures

详情
AI中文摘要

近期研究表明,将组相对策略优化(GRPO)集成到流匹配模型中,特别是在文本到图像和文本到视频生成中,具有显著效果。然而,我们发现将这些技术直接应用于图像到视频(I2V)模型往往无法带来一致的奖励提升。为解决这一局限,我们提出了TAGRPO,一个受对比学习启发的鲁棒后训练框架,适用于I2V模型。我们的方法基于以下观察:从相同初始噪声生成的rollout视频为优化提供了更优的指导。基于这一洞察,我们提出了一种应用于中间潜变量的新型GRPO损失,鼓励直接对齐高奖励轨迹,同时最大化与低奖励轨迹的距离。此外,我们引入了一个用于rollout视频的记忆库,以增强多样性并降低计算开销。尽管方法简单,TAGRPO在I2V生成中相比DanceGRPO取得了显著改进。相关成果将在 https://tagrpo.github.io/ 更新。

英文摘要

Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation. The deliverables will be updated at https://tagrpo.github.io/ .

2601.01608 2026-05-27 cs.CV 版本更新

Guiding Token-Sparse Diffusion Models

引导令牌稀疏扩散模型

Felix Krause, Stefan Andreas Baumann, Johannes Schusterbauer, Olga Grebenkova, Ming Gui, Vincent Tao Hu, Björn Ommer

发表机构 * CompVis

AI总结 针对稀疏训练扩散模型在推理时对无分类器引导响应不足的问题,提出令牌级稀疏引导方法,在保持输出高质量和高方差的同时降低计算成本。

详情
AI中文摘要

扩散模型在图像合成中质量高,但训练和推理成本昂贵。近期工作利用视觉内容固有的冗余性,仅对视觉信息子集进行训练以降低训练成本。虽然这些方法成功实现了更便宜且更有效的训练,但稀疏训练的扩散模型在推理时表现不佳,原因是它们对无分类器引导(CFG)响应不足。为解决此问题,我们提出稀疏引导(SG)。SG不使用条件丢弃作为引导扩散模型的信号,而是使用令牌级稀疏性。因此,SG更好地保留了条件预测的高方差,实现了高质量和高方差输出。在推理时利用令牌级稀疏性,SG以更低的计算量提高了保真度,在常用的ImageNet-256基准上以25%更少的FLOPs实现了1.58 FID,并在匹配基线质量时节省高达58%的FLOPs。为证明稀疏引导的有效性,我们使用训练时稀疏性训练了一个2.5B文本到图像扩散模型,并在推理时利用SG。SG在提高吞吐量的同时,在构图和人类偏好评分上取得了改进。

英文摘要

Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.

2512.22666 2026-05-27 cs.CV cs.LG 版本更新

INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading

INTERACT-CMIL:用于结膜黑色素细胞上皮内病变分级的任务共享学习与任务间一致性

Mert Ikinci, Luna Toma, Karin U. Loeffler, Leticia Ussem, Daniela Süsskind, Julia M. Weller, Yousef Yeganeh, Martina C. Herwig-Carl, Shadi Albarqouni

发表机构 * Clinic for Diagnostic and Interventional Radiology, University Hospital Bonn, Germany(波恩大学诊断与介入放射科) Department of Ophthalmology, Friedrich-Alexander University Erlangen-Nürnberg, Germany(埃尔兰根-纽伦堡弗里德里希-亚历山大大学眼科部) TUM School of Computation, Information and Technology, Technical University of Munich, Germany(慕尼黑技术大学计算、信息与技术学院) Munich Center for Machine Learning, Germany(慕尼黑机器学习中心) Helmholtz AI, Helmholtz Center Munich, Germany(海德堡人工智能,海德堡慕尼黑研究中心)

AI总结 提出INTERACT-CMIL多任务深度学习框架,通过共享特征学习、组合部分监督和任务间一致性损失联合预测五个组织病理学轴,在486张结膜活检图像数据集上相比CNN和基础模型实现最高55.1%的宏F1提升。

详情
Journal ref
IEEE ISBI 2026
AI中文摘要

结膜黑色素细胞上皮内病变(CMIL)的准确分级对于治疗和黑色素瘤预测至关重要,但由于细微的形态学线索和相互关联的诊断标准,仍然困难。我们提出INTERACT-CMIL,一个多头深度学习框架,通过共享特征学习与组合部分监督以及强制跨任务一致性的相互依赖损失,联合预测五个组织病理学轴:WHO4、WHO5、水平扩散、垂直扩散和细胞异型性。在来自三家大学医院的486张专家注释的结膜活检斑块的新整理多中心数据集上进行训练和评估,INTERACT-CMIL在CNN和基础模型(FM)基线上取得了一致的改进,相对宏F1增益高达55.1%(WHO4)和25.0%(垂直扩散)。该框架提供与专家分级一致的连贯、可解释的多标准预测,为CMIL诊断提供了可重复的计算基准,并朝着标准化数字眼科病理学迈出了一步。

英文摘要

Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.

2512.19602 2026-05-27 cs.CV 版本更新

No Data? No Problem: Robust Vision-Tabular Learning with Missing Values

无数据?没问题:面向缺失值的鲁棒视觉-表格学习

Marta Hasny, Laura Daza, Keno Bressem, Maxime Di Folco, Julia Schnabel

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany(计算、信息与技术学院,慕尼黑技术大学,德国) Institute of Machine Learning in Biomedical Imaging, Helmholtz Munich, Germany(生物医学成像中的机器学习研究所,海德堡慕尼黑,德国) School of Biomedical Engineering and Imaging Sciences, King’s College London, UK(生物医学工程与成像科学学院,伦敦国王学院,英国) Department of Diagnostic and Interventional Radiology, TUM University Hospital, Technical University of Munich, Germany(诊断与介入放射科,慕尼黑技术大学医院,德国) Munich Center for Machine Learning, Germany(慕尼黑机器学习中心,德国)

AI总结 提出RoVTL框架,通过对比预训练中的表格属性缺失增强和下游任务中的Tabular More vs. Fewer损失,实现从0%到100%表格数据可用性下的鲁棒多模态学习。

详情
AI中文摘要

大规模医学数据库提供成像数据以及广泛的表格信息,如临床测量或人口统计数据。然而,这种丰富的表格属性并不反映现实世界的数据集,其中可能只有一部分属性可用。这种差异要求方法在推理时对缺失值保持鲁棒。为了解决这一挑战,我们提出了RoVTL(鲁棒视觉-表格学习),一个旨在处理任何级别表格数据可用性(从0%到100%)的框架。RoVTL包括两个关键阶段:对比预训练,其中我们将表格属性缺失作为数据增强引入以促进鲁棒性;以及下游任务微调,其中表格缺失通过一种新颖的Tabular More vs. Fewer损失来补充,该损失根据可用表格数据的数量对性能进行排序。结合门控交叉注意力融合模块,我们的微调方法在所有表格数据完整性场景下实现了一致的性能。我们在英国生物银行的 cardiac MRI 扫描上评估了RoVTL,证明了与先前方法相比对缺失表格数据的优越鲁棒性。此外,RoVTL成功泛化到外部 cardiac MRI 数据集进行多模态疾病分类,并扩展到自然图像领域,在汽车广告数据集上实现了鲁棒性能。模型权重和代码可在 https://github.com/marteczkah/RoVTL 获取。

英文摘要

Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as clinical measurements or demographics. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that remain robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning, where tabular missingness is complemented by a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with gated-cross attention fusion module, our tuning approach enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The model weights and code are available at https://github.com/marteczkah/RoVTL.

2512.14140 2026-05-27 cs.CV 版本更新

SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

SketchAssist:一种用于语义编辑和精确局部重绘的实用助手

Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang, Zhenpeng Zhan

发表机构 * Global Business Unit, Baidu Inc.(百度公司全球业务部)

AI总结 提出SketchAssist,一种结合指令引导编辑和线条引导区域重绘的交互式草图助手,通过可控数据生成管道和基于DiT的统一框架(集成任务引导的混合专家模块)实现高效、可控的草图操作,在语义和结构一致性上达到最先进性能。

详情
AI中文摘要

草图编辑需要同时处理高级语义变化和精确的局部重绘,这种组合对于稀疏且风格敏感的线条艺术尤其具有挑战性。与自然图像不同,草图依赖于最小的视觉线索,使得现有方法难以在保持整体一致性的同时协调全局语义修改与细粒度结构控制。我们提出了SketchAssist,一种交互式草图助手,它统一了指令引导编辑和线条引导区域重绘,在保持整体构图的同时实现高效且可控的草图操作。为了支持这一任务,我们引入了一个可控数据生成管道,该管道构建具有精确属性变化的结构化编辑序列,并在多步修改中保持结构对齐,同时通过保持风格的变换扩展风格多样性。基于这些数据,SketchAssist采用基于DiT的统一框架,使用多通道输入表示在单一接口内编码草图、掩码和引导信号。为了进一步处理不同的编辑模式,我们将任务引导的混合专家(T-MoE)集成到LoRA层中,实现对语义和结构引导的自适应控制。大量实验表明,在两个任务上都达到了最先进的性能,与最近的方法相比,实现了更强的指令遵循以及改进的结构和风格一致性。总之,我们的方法为草图编辑提供了一种实用且可控的解决方案。

英文摘要

Sketch editing requires jointly handling high-level semantic changes and precise local redrawing, a combination that is particularly challenging for sparse, style-sensitive line art. Unlike natural images, sketches rely on minimal visual cues, making it difficult for existing methods to reconcile global semantic modifications with fine-grained structural control while preserving overall coherence. We present SketchAssist, an interactive sketch assistant that unifies instruction-guided editing with line-guided region redrawing, enabling efficient and controllable sketch manipulation while preserving overall composition. To support this task, we introduce a controllable data generation pipeline that constructs structured edit sequences with precise attribute variations and maintains structural alignment across multi-step modifications, while expanding stylistic diversity via style-preserving transformations. Building on this data, SketchAssist adopts a unified framework based on DiT, using a multi-channel input representation to encode sketches, masks, and guidance signals within a single interface. To further handle different editing modes, we integrate a Task-guided Mixture-of-Experts (T-MoE) into LoRA layers, enabling adaptive control over semantic and structural guidance. Extensive experiments demonstrate state-of-the-art performance on both tasks, achieving strong instruction adherence and improved structural and style consistency compared to recent methods. Together, our method provide a practical and controllable solution for sketch editing.

2510.17790 2026-05-27 cs.CV cs.CL 版本更新

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

UltraCUA: 一种具有混合动作的计算机使用智能体基础模型

Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

发表机构 * Apple(苹果公司) The University of Hong Kong(香港大学)

AI总结 提出UltraCUA基础模型,通过混合动作(融合原始GUI操作与高级工具执行)克服计算机使用智能体仅依赖原始GUI动作的局限性,采用自动化管道、合成数据引擎、混合动作轨迹收集和两阶段训练方法,在OSWorld和WindowsAgentArena上分别实现22%的相对性能提升和21.7%的成功率。

详情
AI中文摘要

计算机使用智能体面临一个根本限制:它们仅依赖原始GUI动作(点击、键入、滚动),导致脆弱的执行链容易发生级联故障。虽然API驱动的智能体通过结构化接口和工具利用丰富的能力,但计算机使用智能体仍然局限于低层视觉交互。我们提出UltraCUA,一种通过混合动作(无缝统一原始GUI操作与高层工具执行)超越这一限制的基础模型。我们的创新基于四个关键进展。首先,一个自动化管道从软件文档和代码仓库中提取并扩展工具能力。其次,一个合成数据引擎生成超过17,000个可验证任务,捕捉真实世界的计算机使用复杂性。第三,全面的混合动作轨迹收集融合了GUI原语和策略性工具调用。第四,一种两阶段训练方法结合了监督微调和在线强化学习,实现了GUI与API之间的智能动作选择。对我们的7B和32B UltraCUA模型的评估揭示了变革性的性能提升。在OSWorld上,UltraCUA平均实现了22%的相对改进,同时执行速度比现有方法快11%。在WindowsAgentArena上的跨域验证展示了鲁棒的泛化能力,成功率达到21.7%,超过了在Windows上训练的基线。混合动作范式被证明至关重要,在减少错误传播的同时提高了执行效率。这项工作建立了一个可扩展的范式,桥接了原始GUI交互与高层工具智能,为多样环境和复杂现实任务提供了更具弹性和适应性的计算机使用智能体。

英文摘要

Computer-use agents face a fundamental limitation. They rely exclusively on primitive GUI actions (click, type, scroll), creating brittle execution chains prone to cascading failures. While API-driven agents harness rich capabilities through structured interfaces and tools, computer-use agents remain constrained to low-level visual interactions. We present UltraCUA, a foundation model that transcends this limitation through hybrid action-seamlessly unifying primitive GUI operations with high-level tool execution. Our innovation rests on four critical advances. First, an automated pipeline extracts and scales tool capabilities from software documentation and code repositories. Second, a synthetic data engine produces 17,000+ verifiable tasks capturing real-world computer-use complexity. Third, comprehensive hybrid action trajectory collection incorporates both GUI primitives and strategic tool calls. Fourth, a two-stage training methodology combines supervised fine-tuning with online reinforcement learning, enabling intelligent action selection between GUI and API. Evaluation with our 7B and 32B UltraCUA models reveals transformative performance gains. On OSWorld, UltraCUA achieves 22% relative improvement while executing 11% faster than existing approaches, averagely. Cross-domain validation on WindowsAgentArena demonstrates robust generalization with 21.7% success rate, surpassing Windows-trained baselines. The hybrid action paradigm proves essential, reducing error propagation while improving execution efficiency. This work establishes a scalable paradigm bridging primitive GUI interactions and high-level tool intelligence, enabling more resilient and adaptable computer use agents for diverse environments and complex real-world tasks.

2506.09532 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Athena: 利用数据高效的过程奖励模型增强多模态推理

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

发表机构 * Advanced Micro Devices Inc.(先进微器件公司) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出 Athena-PRM,一种多模态过程奖励模型,通过利用弱和强完成者之间的预测一致性高效生成高质量过程标签,在仅5000样本下显著提升复杂推理问题的逐步评估性能。

Comments TMLR 2026, https://openreview.net/forum?id=unWmplHccF

详情
AI中文摘要

我们提出了 Athena-PRM,一种多模态过程奖励模型(PRM),旨在评估解决复杂推理问题中每一步的奖励分数。开发高性能的PRM通常需要大量的时间和资金投入,主要因为需要推理步骤的逐步标注。传统的自动标注方法,如蒙特卡洛估计,通常会产生噪声标签并带来巨大的计算成本。为了高效生成高质量的过程标注数据,我们提出利用弱和强完成者之间的预测一致性作为识别可靠过程标签的标准。值得注意的是,Athena-PRM 在仅5000个样本的情况下,在各种场景和基准测试中展现出卓越的效果。此外,我们还开发了两种有效策略来提升PRM的性能:ORM初始化和负数据上采样。我们在三个具体场景中验证了我们的方法:测试时扩展的验证、推理步骤正确性的直接评估以及奖励排序微调。我们的 Athena-PRM 在多个基准测试和场景中持续取得优越性能。值得注意的是,当使用 Qwen2.5-VL-7B 作为策略模型时,Athena-PRM 在 WeMath 上提升了10.2个百分点,在 MathVista 上提升了7.1个百分点(测试时扩展)。此外,Athena-PRM 在 VisualProcessBench 上取得了最先进(SoTA)结果,比之前的 SoTA 高出3.9个F1分数,展示了其准确评估推理步骤正确性的强大能力。另外,利用 Athena-PRM 作为奖励模型,我们通过奖励排序微调开发了 Athena-7B,在五个基准测试上以显著优势超越了基线。

英文摘要

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

2512.04085 2026-05-27 cs.CV 版本更新

Unique Lives, Shared World: Learning from Single-Life Videos

独特生活,共享世界:从单个人生视频中学习

Tengda Han, Sayna Ebrahimi, Dilara Gokay, Li Yang Ku, Maks Ovsjanikov, Iva Babukova, Daniel Zoran, Viorica Patraucean, Joao Carreira, Andrew Zisserman, Dima Damen

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出“单个人生”学习范式,利用单个人拍摄的自我中心视频通过多视角自监督学习视觉编码器,发现不同人生训练的模型具有高度对齐的几何理解,且学到的表示可泛化到下游任务,与大量网络数据性能相当。

详情
AI中文摘要

我们引入了“单个人生”学习范式,其中我们仅针对一个人拍摄的自我中心视频训练一个独特的视觉模型。我们利用单个人生中自然捕获的多个视角,以自监督方式学习视觉编码器。我们的实验展示了三个关键发现。首先,独立在不同人生上训练的模型发展出高度对齐的几何理解。我们通过在捕获不同人生(包括室内和室外)的不同数据集上训练视觉编码器,并引入一种新的基于交叉注意力的度量来量化不同模型发展的内部表示的功能对齐,来证明这一点。其次,我们展示了单个人生模型学习到可泛化的几何表示,这些表示能有效迁移到下游任务,如未见环境中的深度估计。第三,我们证明,对同一个人一周内最多30小时的数据进行训练,其性能与在30小时多样化网络数据上训练相当,突出了单个人生表示学习的优势。总体而言,我们的结果确立了世界的共享结构既导致了在个人人生上训练的模型的一致性,也为视觉表示学习提供了强大的信号。

英文摘要

We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.

2511.01724 2026-05-27 cs.CV cs.LG 版本更新

PRBench: A Standardized Probabilistic Robustness Benchmark

PRBench:标准化概率鲁棒性基准

Yi Zhang, Zheng Wang, Zhen Chen, Wenjie Ruan, Qing Guo, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

发表机构 * WMG, University of Warwick(沃里克大学WMG学院) Department of Computer Science, University of Liverpool(利物浦大学计算机科学系) College of Computer Science, Nankai University(南开大学计算机学院) School of Computing, National University of Singapore(新加坡国立大学计算学院)

AI总结 提出PRBench基准,通过统一评估协议和理论分析,比较对抗训练与概率鲁棒性训练方法在干净准确率、鲁棒性及泛化误差上的表现。

详情
AI中文摘要

深度学习模型因对不可察觉扰动的脆弱性而闻名。现有研究大多集中于对抗鲁棒性(AR),它通过检查确定性对抗样本(AE)的存在性,在最坏情况下评估模型。相比之下,概率鲁棒性(PR)采用统计视角,衡量在随机扰动下预测保持正确的概率。尽管PR被广泛视为AR的实用补充,但专门用于提升PR的训练方法仍相对未被充分探索,尽管已有初步进展。在少数针对PR的训练方法中,我们发现了三个局限性:(i) 不可比较的评估协议;(ii) 尽管AT能带来PR提升的轶事证据,但与强AT基线的比较有限;(iii) 缺乏统一框架来比较这些方法的泛化能力。因此,我们引入了PRBench,这是第一个专门评估不同鲁棒性训练方法在PR提升上的基准。PRBench使用一套全面的指标,包括干净准确率、PR和AR性能、训练效率以及泛化误差(GE),对最常见的AT和针对PR的训练方法进行实证比较。我们还对不同训练方法的PR性能的GE进行了理论分析。PRBench揭示的主要发现包括:在跨不同超参数设置提升AR和PR性能方面,AT方法比针对PR的训练方法更具通用性,而针对PR的训练方法始终产生更低的GE和更高的干净准确率。包含229个训练模型(覆盖7个数据集和10种模型架构)的排行榜公开于 https://wellzline.github.io/PRBenchLeaderboard/。

英文摘要

Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 229 trained models across 7 datasets and 10 model architectures is publicly available at https://wellzline.github.io/PRBenchLeaderboard/.

2511.14993 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Kandinsky 5.0:图像与视频生成的基础模型系列

Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Julia Agafonova, Ilya Vasiliev, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov

发表机构 * Kandinsky Lab(Kandinsky 实验室)

AI总结 本文介绍Kandinsky 5.0系列模型,通过多阶段训练、自监督微调和强化学习后训练,实现高分辨率图像和10秒视频的高质量生成。

Comments Website: https://kandinskylab.ai/

详情
AI中文摘要

本报告介绍了Kandinsky 5.0,一系列用于高分辨率图像和10秒视频合成的最先进基础模型。该框架包含三个核心模型系列:Kandinsky 5.0 Image Lite——6B参数的图像生成模型系列,Kandinsky 5.0 Video Lite——快速轻量级的2B参数文本到视频和图像到视频模型,以及Kandinsky 5.0 Video Pro——19B参数模型,实现了卓越的视频生成质量。我们全面回顾了数据策展生命周期——包括收集、处理、过滤和聚类——用于多阶段训练流程,该流程涉及广泛的预训练,并融入了质量增强技术,如自监督微调(SFT)和基于强化学习(RL)的后训练。我们还介绍了新颖的架构、训练和推理优化,使Kandinsky 5.0能够在各种任务上实现高生成速度和最先进的性能,如人类评估所示。作为一个大规模、公开可用的生成框架,Kandinsky 5.0充分利用其预训练及后续阶段的全部潜力,以适应广泛的生成应用。我们希望本报告,连同我们开源代码和训练检查点的发布,将大大促进高质量生成模型的研究社区发展和可访问性。

英文摘要

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

2504.19203 2026-05-27 eess.IV cs.CV 版本更新

Improving Generalization in MRI-Based Deep Learning Models for Total Knee Replacement Prediction

基于MRI的深度学习模型在全膝关节置换预测中的泛化能力改进

Ehsan Karami, Hamid Soltanian-Zadeh

AI总结 针对MRI深度学习模型在不同来源数据上泛化性差的问题,提出用实例归一化替代批归一化、数据增强和对比损失的方法,在OAI数据集上显著提升分类性能。

详情
Journal ref
Proceedings of the 2025 32nd National and 10th International Iranian Conference on Biomedical Engineering (ICBME)
AI中文摘要

膝骨关节炎(KOA)是一种常见的关节疾病,会导致疼痛和行动不便。尽管基于MRI的深度学习模型在全膝关节置换(TKR)和疾病进展预测中表现出优越性能,但其泛化能力仍然具有挑战性,尤其是在应用于不同来源的影像数据时。在本研究中,我们证明用实例归一化替代批归一化、使用数据增强以及应用对比损失可以改善泛化能力。在训练和评估中,我们使用了来自骨关节炎倡议(OAI)数据库的MRI数据,将矢状面脂肪抑制中间加权涡轮自旋回波(FS-IW-TSE)图像作为源域,矢状面脂肪抑制三维(3D)双回波稳态(DESS)图像作为目标域。结果表明,通过在基线模型中将批归一化替换为实例归一化,使用全局强度非线性(GIN)增强方法生成增强输入视图,以及在分类损失之外加入监督对比损失以对齐相同标签样本的表征,两个域的分类指标均有统计学显著提升。当使用3D实例归一化时,带有对比损失的GIN方法优于所有评估的单源域泛化方法。比较有无对比损失的GIN(针对两种归一化类型)表明,添加对比损失始终带来更好的性能。

英文摘要

Knee osteoarthritis (KOA) is a common joint disease that causes pain and mobility issues. While MRI-based deep learning models have demonstrated superior performance in predicting total knee replacement (TKR) and disease progression, their generalizability remains challenging, particularly when applied to imaging data from different sources. In this study, we show that replacing batch normalization with instance normalization, using data augmentation, and applying contrastive loss improves generalization. For training and evaluation, we used MRI data from the Osteoarthritis Initiative (OAI) database, considering sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE) images as the source domain and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) images as the target domain. The results demonstrated a statistically significant improvement in classification metrics across both domains by replacing batch normalization with instance normalization in the baseline model, generating augmented input views using the Global Intensity Non-linear (GIN) augmentation method, and incorporating a supervised contrastive loss alongside the classification loss to align representations of samples with the same label. The GIN method with contrastive loss performed better than all evaluated single-source domain generalization methods when using 3D instance normalization. Comparing GIN with and without contrastive loss (for both normalization types) showed that adding contrastive loss consistently led to better performance.

2510.17759 2026-05-27 cs.CR cs.CL cs.CV cs.LG stat.ML 版本更新

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

VERA-V:用于破解视觉语言模型的变分推断框架

Qilin Liao, Anamika Lochab, Ruqi Zhang

发表机构 * Department of Computer Science, Purdue University, USA(美国普渡大学计算机科学系)

AI总结 提出VERA-V变分推断框架,通过联合后验分布生成隐蔽的文本-图像对抗输入,以系统性地发现视觉语言模型的多模态漏洞,在多个基准上攻击成功率最高提升53.75%。

Comments 18 pages, 7 Figures,

详情
AI中文摘要

视觉语言模型(VLM)通过视觉推理扩展了大语言模型,但其多模态设计也引入了新的、未被充分探索的漏洞。现有的多模态红队方法主要依赖脆弱的模板,专注于单一攻击设置,并且仅暴露了漏洞的一小部分。为了解决这些限制,我们引入了VERA-V,一个变分推断框架,将多模态越狱发现重新表述为学习配对文本-图像提示的联合后验分布。这种概率视角使得能够生成绕过模型防护的隐蔽、耦合的对抗输入。我们训练一个轻量级攻击者来近似后验分布,从而能够高效采样多样化的越狱方法,并提供对漏洞的分布性洞察。VERA-V进一步整合了三种互补策略:(i)基于排版的文本提示,嵌入有害线索;(ii)基于扩散的图像合成,引入对抗信号;(iii)结构化干扰物,分散VLM的注意力。在HarmBench和HADES基准上的实验表明,VERA-V在开源和前沿VLM上均持续优于最先进的基线方法,在GPT-4o上相比最佳基线实现了高达53.75%的攻击成功率(ASR)提升。我们在项目页面提供了代码,地址为:https://github.com/kxwhiowo/VERA-V

英文摘要

Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o. We include the code on the project page available here: https://github.com/kxwhiowo/VERA-V

2510.09606 2026-05-27 cs.CV 版本更新

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

SpaceVista:从毫米到公里的全尺度视觉空间推理

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue

发表机构 * Multimedia Laboratory, The Chinese University of Hong Kong(香港中文大学多媒体实验室) Beijing University of Posts(北京邮电大学) Hong Kong University of Science(香港理工大学)

AI总结 本文提出全尺度空间推理解决方案,通过结构化知识系统、尺度感知建模和渐进训练范式,构建SpaceVista-1M数据集(38K视频场景、约1M空间QA对)和SpaceVista-7B模型,在5个基准上展现强泛化能力。

Comments Project Page: https://peiwensun2000.github.io/mm2km/

详情
AI中文摘要

随着当前空间推理探索的兴起,研究人员在理解室内场景方面取得了显著进展,但在机器人技术和自动驾驶等多样化应用中仍面临挑战。本文旨在通过解决两个关键挑战来推进跨不同场景的全尺度空间推理:1)数据集构建严重依赖室内3D扫描和劳动密集型人工标注;2)缺乏有效的全尺度场景建模,常常导致对单个场景的过拟合。本文提出了一种整体解决方案,集成了结构化空间推理知识系统、尺度感知建模和渐进训练范式,据我们所知,这是首次尝试拓宽多模态大语言模型的全尺度空间智能。通过任务特定、专家驱动的自动化流水线,我们在5个空间尺度上整理了超过38K个视频场景,创建了SpaceVista-1M数据集,该数据集包含约100万个空间问答对,涵盖19种不同的任务类型。虽然专家模型可以注入有用的领域知识,但它们不适合用于评估。因此,我们通过手动记录、检索和组装基于视频的数据,构建了一个具有精确标注的全尺度基准。然而,由于潜在的知识冲突,使用SpaceVista-1M进行简单训练往往会产生次优结果。因此,我们引入了SpaceVista-7B,一个空间推理模型,它接受超出语义的密集输入,并使用尺度作为尺度感知专家和渐进奖励的锚点。最后,在包括我们的SpaceVista-Bench在内的5个基准上的广泛评估展示了竞争性能,展现了跨所有尺度和场景的强大泛化能力。我们的数据集、模型和基准将在https://peiwensun2000.github.io/mm2km上发布。

英文摘要

With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .

2510.04533 2026-05-27 cs.CV 版本更新

TAG: Tangential Amplifying Guidance for Hallucination-Resistant Sampling

TAG: 切向放大引导用于抗幻觉采样

Hyunmin Cho, Donghoon Ahn, Susung Hong, Jee Eun Kim, Seungryong Kim, Kyong Hwan Jin

发表机构 * Korea University(韩国大学) University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 提出一种无需训练、与架构无关的即插即用引导方法TAG,通过放大估计分数的切向分量来纠正采样轨迹,减少语义不一致性并提高保真度。

Comments Accepted to ICML 2026 (Regular)

详情
AI中文摘要

扩散模型实现了最先进的图像生成,但经常产生语义不一致或幻觉。现有的推理时引导方法依赖外部信号或架构修改,增加了计算开销。我们提出切向放大引导(TAG),一种无需训练、与架构无关、即插即用的引导方法,仅基于轨迹信号操作。TAG使用中间样本作为投影基,放大估计分数的切向分量以纠正采样轨迹。一阶泰勒分析表明,这会将状态引导至数据流形的高概率区域,减少不一致性并提高保真度,同时为现有采样器增加可忽略的开销。代码可在我们的项目页面(https://hyeon-cho.github.io/TAG/)获取。

英文摘要

Diffusion models achieve state-of-the-art image generation but often produce semantic inconsistencies, or hallucinations. Existing inference-time guidance methods rely on external signals or architectural modifications, adding computational overhead. We propose $\mathbf{T}$angential $\mathbf{A}$mplifying $\mathbf{G}$uidance $\mathbf{(TAG)}$, a training-free, architecture-agnostic, plug-and-play guidance method that operates purely on trajectory signals. TAG uses an intermediate sample as a projection basis and amplifies the tangential components of the estimated score to correct the sampling trajectory. A first-order Taylor analysis shows that this steers the state toward higher-probability regions of the data manifold, reducing inconsistencies and improving fidelity while adding negligible overhead to existing samplers. Code is available at our Project Page (https://hyeon-cho.github.io/TAG/).

2510.00902 2026-05-27 cs.CV cs.CY cs.HC 版本更新

Intuitions of Machine Learning Researchers about Transfer Learning for Medical Image Classification

机器学习研究者关于医学图像分类迁移学习的直觉

Yucheng Lu, Hubert Dariusz Zając, Veronika Cheplygina, Amelia Jiménez-Sánchez

发表机构 * IT University of Copenhagen(丹麦技术大学) University of Antwerp(安特卫普大学) University of Barcelona(巴塞罗那大学)

AI总结 通过任务调查揭示机器学习从业者选择源数据集的直觉依据,发现选择依赖于任务、社区实践和相似性感知,但相似性与性能并不一致,且缺乏伦理考量。

Comments Under review

详情
AI中文摘要

迁移学习对医学影像至关重要,然而源数据集的选择往往依赖于研究者的直觉而非系统原则,这可能影响算法的泛化能力,进而影响患者预后。本研究通过对机器学习从业者进行基于任务的调查来探究这些决策。与先前对模型和实验设置进行基准测试的工作不同,我们从人机交互(HCI)角度研究从业者如何选择源数据集。我们的发现表明,选择依赖于任务,并受到社区实践、数据集属性、计算(数据嵌入)或感知的视觉或语义相似性的影响。然而,相似性评分与预期性能并不总是一致,挑战了传统的“越相似越好”的观点。此外,伦理和公平性考虑在源数据集选择中基本缺失。参与者常使用模糊术语,这表明需要更清晰的定义和工具使其明确且可用。通过阐明这些启发式方法并引入迁移学习因素的概念框架,本研究为迁移学习中更系统的源选择提供了实用见解。

英文摘要

Transfer learning is crucial for medical imaging, yet the selection of source datasets often relies on researchers' intuition rather than systematic principles, which can impact the generalizability of algorithms and, thus, patient outcomes. This study investigates these decisions through a task-based survey with machine learning practitioners. Unlike prior work that benchmarks models and experimental setups, we take a human-computer interaction (HCI) perspective on how practitioners select source datasets. Our findings indicate that choices are task-dependent and influenced by community practices, dataset properties, and computational (data embedding), or perceived visual or semantic similarity. However, similarity ratings and expected performance are not always aligned, challenging a traditional "more similar is better" view. Moreover, ethical and fairness considerations remain largely absent from source dataset sections. Participants often used ambiguous terminology, which suggests a need for clearer definitions and tools to make them explicit and usable. By clarifying these heuristics and introducing a conceptual framework of transfer learning factors, this work provides practical insights for more systematic source selection in transfer learning.

2509.21552 2026-05-27 cs.CV cs.CL 版本更新

Learning GUI Grounding with Spatial Reasoning from Visual Feedback

通过视觉反馈学习具有空间推理能力的 GUI 定位

Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, Robert Sim

发表机构 * University of Edinburgh(爱丁堡大学) Microsoft(微软)

AI总结 本文提出将 GUI 定位重构为交互式搜索任务,利用多步在线强化学习训练 GUI-Cursor 模型,通过光标视觉反馈提升空间推理能力,在 GUI 定位和代理任务上超越强基线。

Comments Accepted at ICML 2026

详情
AI中文摘要

图形用户界面(GUI)定位通常被构建为坐标预测任务——给定自然语言指令,生成屏幕上用于点击和按键等操作的坐标。然而,最近的视觉语言模型(VLM)在处理高分辨率和复杂布局的 GUI 图像时,往往无法预测准确的数字坐标。为了解决这个问题,我们将 GUI 定位重构为交互式搜索任务,其中 VLM 生成动作以移动 GUI 中的光标来定位 UI 元素。在每一步,模型确定目标对象,评估光标与目标之间的空间关系,并根据移动历史将光标移近目标。在这个交互过程中,渲染的光标提供视觉反馈,帮助模型将其预测与相应的屏幕位置对齐。我们使用基于密集轨迹奖励函数的多步在线强化学习来训练我们的 GUI 定位模型 GUI-Cursor。实验结果表明,GUI-Cursor 在 GUI 定位和代理任务上超越了强基线,在相同基础模型下实现了更优性能,同时需要更少的训练数据。进一步分析表明,GUI-Cursor 学会在更困难的示例上自适应地执行更多步骤,并在分布外领域获得更好的空间推理能力。

英文摘要

Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing GUI images with high resolutions and complex layouts. To address this issue, we reframe GUI grounding as an interactive search task, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Experimental results demonstrate that GUI-Cursor surpasses strong baselines in GUI grounding and agentic tasks, achieving superior performance with the same base models while requiring less training data. Further analysis shows that GUI-Cursor learns to adaptively conduct more steps on more difficult examples, and it obtains better spatial reasoning capability on out-of-distribution domains.

2506.08543 2026-05-27 cs.CV 版本更新

Spectral Principal Paths: A Spectral Perspective on Linear Representation Formation in LLMs

谱主路径:大语言模型中线性表示形成的谱视角

Bowei Tian, Xuntao Lyu, Meng Liu, Hongyi Wang, Ang Li

发表机构 * University of Maryland, College Park(马里兰大学 College Park 分校) North Carolina State University(北卡罗来纳州立大学) Genbio AI

AI总结 提出输入空间线性假设和谱主路径框架,利用谱理论解释大语言模型中线性表示的形成与稳定性,并给出严格保证。

Comments arXiv admin note: text overlap with arXiv:2503.22720

详情
AI中文摘要

高层表示已成为增强AI透明度和可控性的核心焦点,将注意力从单个神经元或电路转向与人类可解释概念对齐的结构化语义方向。虽然线性表示假说(LRH)表明这些方向在表示中出现,但这些表示如何起源以及为何在层间变得日益稳定仍不清楚。为解决此问题,我们引入输入空间线性假设,认为与概念对齐的方向起源于输入空间,并随着深度增加而稳定维持。然后我们提出谱主路径(SPP)框架,该框架形式化了深度网络如何沿谱主方向逐步蒸馏线性表示。我们基于Wedin $\sin\Theta$ 扰动定理为SPP提供了严格的稳定性保证,识别了可测试的条件,包括谱间隙和上下文不连贯性,这些条件共同确保层间方向保持。通过将理论分析与实证证据相结合,本工作提供了关于线性表示如何在大语言模型中产生的谱视角,并暗示了对现代AI系统中公平性和透明度的概念级可控、鲁棒和连贯方法的潜在影响。

英文摘要

High-level representations have become a central focus in enhancing AI transparency and control, shifting attention from individual neurons or circuits to structured semantic directions that align with human-interpretable concepts. While the Linear Representation Hypothesis (LRH) suggests that such directions emerge in representations, it remains unclear how these representations originate and why they become increasingly stable across layers. To solve this issue, we introduce the Input-Space Linearity Hypothesis, positing that concept-aligned directions originate in the input space and are steadily maintained with increasing depth. We then propose the Spectral Principal Path (SPP) framework, which formalizes how deep networks progressively distill linear representations along the spectral principal directions. We provide rigorous stability guarantees for the SPP based on the Wedin $\sinΘ$ perturbation theorem, identifying testable conditions, including spectral gap and context incoherence, that jointly ensure layer-wise directional preservation. By bridging theoretical analysis with empirical evidence, this work identifies a spectral view of how linear representations arise in LLMs, and suggests potential implications for concept-level controllable, robust, and coherent approaches to fairness and transparency in modern AI systems.

2509.21167 2026-05-27 cs.LG cs.CV 版本更新

A Unified Framework for Diffusion Model Unlearning with f-Divergence

基于f-散度的扩散模型遗忘统一框架

Nicola Novello, Federico Fontana, Luigi Cinque, Deniz Gunduz, Andrea M. Tonello

发表机构 * University of Klagenfurt, Austria(克雷根福特大学) Sapienza University of Rome, Italy(罗马萨皮恩扎大学) Imperial College London, UK(伦敦帝国学院)

AI总结 提出一个基于f-散度的统一框架,将扩散模型概念遗忘中的MSE损失推广到任意f-散度,并通过理论分析和实验验证不同散度对遗忘效果的影响。

Comments Accepted at ICML 2026

详情
AI中文摘要

现有的大多数文本到图像扩散模型概念遗忘方法最小化基于目标概念和锚定概念的去噪器输出之间的均方误差(MSE)损失,这隐式地是两个高斯分布之间的KL散度。我们将这一目标推广到任意$f$-散度,将MSE恢复为KL实例,并识别出一族$\alpha$-散度,其高斯闭式形式产生廉价、类似MSE的训练目标。对于剩余的$f$-散度,我们基于$f$-散度的变分公式提供了一个最小-最大目标。我们从理论上分析并数值验证了不同$f$-散度如何影响梯度幅度和算法的收敛性质,从而影响遗忘质量。例如,我们观察到Hellinger闭式实例在多种场景下始终优于MSE。更一般地,所提出的统一框架为根据应用和用户目标选择最优散度提供了灵活的范式,允许对遗忘效果与生成保真度之间的权衡进行更精细的控制。

英文摘要

Most existing methods for concept unlearning in text-to-image diffusion models minimize a mean squared error (MSE) loss between the denoiser outputs conditioned on a target and an anchor concept, which is implicitly the KL divergence between two Gaussians. We generalize this objective to any $f$-divergence, recovering MSE as the KL instance, and identify a family of $α$-divergences whose Gaussian closed-form yields cheap, MSE-like training objectives. For the remaining $f$-divergences, we provide a min-max objective based on the variational formulation of the $f$-divergence. We theoretically analyze and numerically validate how different $f$-divergences impact the gradient magnitude and the convergence properties of the algorithm, affecting the quality of unlearning. For instance, we observe that the Hellinger closed-form instance consistently dominates MSE across multiple scenarios. More generally, the proposed unified framework offers a flexible paradigm for selecting the optimal divergence based on the application and user goal, allowing for finer control over the trade-off between unlearning efficacy and generative fidelity.

2509.18919 2026-05-27 cs.CV 版本更新

Advancing Metallic Surface Defect Detection via Anomaly-Guided Pretraining on a Large Industrial Dataset

通过在大规模工业数据集上的异常引导预训练推进金属表面缺陷检测

Chuni Liu, Hongjie Li, Jiaqi Du, Yangyang Hou, Qian Sun, Lei Jin, Ke Xu

发表机构 * Collaborative Innovation Center of Steel Technology, University of Science and Technology Beijing(钢铁技术协同创新中心,北京科技大学)

AI总结 提出异常引导自监督预训练(AGSSP)方法,通过两阶段框架利用异常先验引导表示学习,在金属表面缺陷检测中显著提升性能,mAP@0.5提升高达10%。

Comments Accepted for publication in Pattern Recognition

详情
Journal ref
Pattern Recognition, Volume 179, Part C, 2026, 113788
AI中文摘要

预训练-微调范式是金属表面缺陷检测中缓解数据稀缺挑战的关键策略。然而,其实现面临一个关键困境:在ImageNet等自然图像数据集上预训练存在显著的领域差距;同时,由于现有学习目标无法区分复杂背景噪声和纹理中的细微缺陷模式,在领域内工业数据上进行简单的自监督预训练往往效果不佳。为解决这一问题,我们引入了异常引导自监督预训练(AGSSP),这是一种通过异常先验显式引导表示学习的新范式。AGSSP采用两阶段框架:(1)首先通过从异常图中蒸馏知识来预训练模型的主干网络,鼓励网络捕获缺陷显著特征;(2)然后使用从这些图中导出的伪缺陷框预训练检测器,使其与定位任务对齐。为此,我们开发了一种知识增强方法来生成高质量的异常图,并收集了一个包含120,000张图像的大规模工业数据集。此外,我们提供了两个小规模、像素级标注的金属表面缺陷数据集用于验证。大量实验表明,AGSSP在各种设置下均能持续提升性能,与基于ImageNet的模型相比,mAP@0.5提升高达10%,mAP@0.5:0.95提升高达11.4%。所有代码、预训练模型和数据集均可在https://clovermini.github.io/AGSSP-Dev/公开获取。

英文摘要

The pretraining-finetuning paradigm is a crucial strategy in metallic surface defect detection for mitigating the challenges posed by data scarcity. However, its implementation presents a critical dilemma. Pretraining on natural image datasets such as ImageNet, faces a significant domain gap. Meanwhile, naive self-supervised pretraining on in-domain industrial data is often ineffective due to the inability of existing learning objectives to distinguish subtle defect patterns from complex background noise and textures. To resolve this, we introduce Anomaly-Guided Self-Supervised Pretraining (AGSSP), a novel paradigm that explicitly guides representation learning through anomaly priors. AGSSP employs a two-stage framework: (1) it first pretrains the model's backbone by distilling knowledge from anomaly maps, encouraging the network to capture defect-salient features; (2) it then pretrains the detector using pseudo-defect boxes derived from these maps, aligning it with localization tasks. To enable this, we develop a knowledge-enhanced method to generate high-quality anomaly maps and collect a large-scale industrial dataset of 120,000 images. Additionally, we present two small-scale, pixel-level labeled metallic surface defect datasets for validation. Extensive experiments demonstrate that AGSSP consistently enhances performance across various settings, achieving up to a 10\% improvement in mAP@0.5 and 11.4\% in mAP@0.5:0.95 compared to ImageNet-based models. All code, pretrained models, and datasets are publicly available at https://clovermini.github.io/AGSSP-Dev/.

2509.09977 2026-05-27 cs.CV 版本更新

ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking

ISTASTrack:通过ISTA适配器桥接ANN和SNN用于RGB-事件跟踪

Siying Liu, Zikai Wang, Hanle Zheng, Yifan Hu, Xilin Wang, Qingkai Yang, Jibin Wu, Hao Guo, Lei Deng

AI总结 提出首个基于Transformer的ANN-SNN混合跟踪器ISTASTrack,利用ISTA适配器双向融合RGB和事件特征,实现高效鲁棒跟踪。

Comments Accepted by IEEE Transactions on Image Processing, DOI: 10.1109/TIP.2026.3694138, 15 pages, 8 figures

详情
AI中文摘要

RGB-事件跟踪已成为视觉目标跟踪中一个有前景的趋势,旨在利用RGB图像和动态尖峰事件的互补优势来提高性能。然而,现有的人工神经网络(ANN)难以充分利用事件流的稀疏和异步特性。最近,结合ANN和脉冲神经网络(SNN)的混合架构研究作为RGB-事件感知中的一种有前途的解决方案出现,但有效融合跨异构范式的特征仍然是一个挑战。在这项工作中,我们提出了ISTASTrack,这是第一个基于Transformer的ANN-SNN混合跟踪器,配备了ISTA适配器用于RGB-事件跟踪。该双分支模型采用视觉Transformer从RGB输入中提取空间上下文,并使用脉冲Transformer从事件流中捕获时空动态。为了弥合ANN和SNN特征之间的模态和范式差距,我们系统地设计了一个基于模型的ISTA适配器,用于两个分支之间的双向特征交互,该适配器通过展开迭代收缩阈值算法从稀疏表示理论推导而来。此外,我们在适配器中引入了一个时间下采样注意力模块,以在潜在空间中对齐多步SNN特征与单步ANN特征,从而改善时间融合。在RGB-事件跟踪基准(如FE240hz、VisEvent、COESOT和FELT)上的实验结果表明,ISTASTrack在保持高能效的同时实现了最先进的性能,突显了混合ANN-SNN设计在鲁棒视觉跟踪中的有效性和实用性。代码已公开在https://github.com/lsying009/ISTASTrack.git。

英文摘要

RGB-Event tracking has become a promising trend in visual object tracking to leverage the complementary strengths of both RGB images and dynamic spike events for improved performance. However, existing artificial neural networks (ANNs) struggle to fully exploit the sparse and asynchronous nature of event streams. Recent efforts toward hybrid architectures combining ANNs and spiking neural networks (SNNs) have emerged as a promising solution in RGB-Event perception, yet effectively fusing features across heterogeneous paradigms remains a challenge. In this work, we propose ISTASTrack, the first transformer-based \textbf{A}NN-\textbf{S}NN hybrid \textbf{Track}er equipped with \textbf{ISTA} adapters for RGB-Event tracking. The two-branch model employs a vision transformer to extract spatial context from RGB inputs and a spiking transformer to capture spatio-temporal dynamics from event streams. To bridge the modality and paradigm gap between ANN and SNN features, we systematically design a model-based ISTA adapter for bidirectional feature interaction between the two branches, derived from sparse representation theory by unfolding the iterative shrinkage thresholding algorithm. Additionally, we incorporate a temporal downsampling attention module within the adapter to align multi-step SNN features with single-step ANN features in the latent space, improving temporal fusion. Experimental results on RGB-Event tracking benchmarks, such as FE240hz, VisEvent, COESOT, and FELT, have demonstrated that ISTASTrack achieves state-of-the-art performance while maintaining high energy efficiency, highlighting the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking. The code is publicly available at https://github.com/lsying009/ISTASTrack.git.

2504.08593 2026-05-27 cs.CV cs.AI 版本更新

Hands-On: Segmenting Individual Signs from Continuous Sequences

动手实践:从连续序列中分割单个手势

JianHe Low, Harry Walsh, Ozge Mercanoglu Sincan, Richard Bowden

发表机构 * CVSSP, University of Surrey(CVSSP,萨里大学)

AI总结 针对连续手语分割难题,提出基于Transformer的架构,利用HaMeR手部特征和3D角度,采用BIO标注方案建模时序动态,在DGS语料库上达到最优性能。

Comments Accepted in the 19th IEEE International Conference on Automatic Face and Gesture Recognition. Code Implementation Released

详情
Journal ref
IEEE 19th International Conference on Automatic Face and Gesture Recognition. (2025) 1-5
AI中文摘要

这项工作解决了连续手语分割的挑战,这是一项对手语翻译和数据标注具有重大影响的关键任务。我们提出了一种基于Transformer的架构,该架构对手语的时序动态进行建模,并使用开始-内部-外部(BIO)标注方案将分割视为序列标注问题。我们的方法利用了HaMeR手部特征,并辅以3D角度。大量实验表明,我们的模型在DGS语料库上取得了最先进的结果,而我们的特征在BSLCorpus上超越了先前的基准。

英文摘要

This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation and data annotation. We propose a transformer-based architecture that models the temporal dynamics of signing and frames segmentation as a sequence labeling problem using the Begin-In-Out (BIO) tagging scheme. Our method leverages the HaMeR hand features, and is complemented with 3D Angles. Extensive experiments show that our model achieves state-of-the-art results on the DGS Corpus, while our features surpass prior benchmarks on BSLCorpus.

2508.07996 2026-05-27 cs.CV 版本更新

Structured Relational Reasoning for Group Activity Assessment

结构化关系推理用于群体活动评估

Thinesh Thiyakesan Ponbagavathi, Chengzheng Yang, Alina Roitberg

发表机构 * University of Stuttgart(斯图加特大学) University of Hildesheim(希尔德斯海姆大学)

AI总结 提出ProGraD框架,利用冻结视觉基础模型和轻量级GroupContext Transformer,通过结构化关系推理在单次前向传播中联合推断群体位置、成员关系和活动,仅用10M参数即在Cafe和Social-CAD基准上取得最优性能。

Comments Accepted to CVPR 2026 Workshop (SAUAFG)

详情
AI中文摘要

群体活动检测(GAD)涉及识别视频中的社会群体及其集体行为。视觉基础模型(VFM),如DINOv2,提供优秀的特征,但是在以物体为中心的数据上预训练的。我们发现,将它们简单替换到现有GAD流程中实际上会降低性能,暴露出结构化的群体感知解码才是真正的瓶颈。我们提出了ProGraD,一个基于冻结VFM构建的结构化关系推理框架。其核心是一个轻量级的两层GroupContext Transformer,显式建模演员-群体关联并聚合全局上下文以推断集体行为。可学习的群体提示作为最小条件机制,引导冻结骨干网络朝向社交相关表示,而关系解码器对演员和群体执行核心推理。该设计在单次前向传播中联合推断群体位置、成员关系和活动,仅使用10M可训练参数——不到先前方法的一半。在具有多个并发社交群体的Cafe基准上,ProGraD将Group mAP$@$1.0提升了6.5%,Group mAP$@$0.5提升了8.2%。在Social-CAD上,它实现了最先进的社交和成员关系准确性。ProGraD还生成可解释的注意力图,为演员-群体推理提供洞察。

英文摘要

Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos. Vision Foundation Models (VFMs), like DINOv2, offer excellent features but are pretrained on object-centric data. We find that naively substituting them into existing GAD pipelines actually degrades performance, exposing structured group-aware decoding as the true bottleneck. We introduce ProGraD, a structured relational-reasoning framework for GAD built on top of frozen VFMs. At its core is a lightweight two-layer GroupContext Transformer that explicitly models actor-group associations and aggregates global context to infer collective behavior. Learnable group prompts serve as a minimal conditioning mechanism to guide the frozen backbone toward socially relevant representations, while the relational decoder performs the core reasoning over actors and groups. This design jointly infers group locations, memberships, and activities in a single pass using only 10M trainable parameters - less than half of prior methods. On the Cafe benchmark with multiple concurrent social groups, ProGraD improves the state-of-the-art by 6.5% Group mAP$@$1.0 and 8.2% Group mAP$@$0.5. On Social-CAD, it achieves state-of-the-art social and membership accuracy. ProGraD further produces interpretable attention maps that provide insights into actor-group reasoning.

2508.00748 2026-05-27 cs.CV cs.AI cs.CR cs.MM 版本更新

Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos

真的是你吗?探索逼真说话头像视频中的生物特征验证场景

Laura Pedrouzo-Rodriguez, Pedro Delgado-DeRobles, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez

发表机构 * Biometrics and Data Pattern Analytics Lab(生物特征与数据模式分析实验室)

AI总结 本文研究在逼真说话头像视频中,利用面部运动模式作为行为生物特征进行身份验证,提出基于图卷积网络的轻量级模型,AUC接近80%。

Comments Accepted at the IEEE International Joint Conference on Biometrics (IJCB 2025)

详情
Journal ref
2025 IEEE International Joint Conference on Biometrics (IJCB)
AI中文摘要

逼真说话头像在虚拟会议、游戏和社交平台中越来越常见。这些头像允许更沉浸式的交流,但也引入了严重的安全风险。一个新兴威胁是冒充:攻击者可以窃取用户的头像,保留其外观和声音,使得仅凭视觉或听觉几乎无法检测欺诈性使用。在本文中,我们探讨了在这种头像中介场景中生物特征验证的挑战。我们的主要问题是,当头像的视觉外观是其主人的复制品时,个体的面部运动模式能否作为可靠的行为生物特征来验证其身份。为了回答这个问题,我们引入了一个新的数据集,其中包含使用最先进的一次性头像生成模型GAGAvatar创建的逼真头像视频,包括真实和冒充的头像视频。我们还提出了一种轻量级、可解释的时空图卷积网络架构,具有时间注意力池化,仅使用面部标志点来建模动态面部手势。实验结果表明,面部运动线索能够实现有意义的身份验证,AUC值接近80%。所提出的基准和生物特征系统可供研究社区使用,以引起对基于头像的通信系统中更高级行为生物特征防御的迫切需求的关注。

英文摘要

Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user's avatar, preserving his appearance and voice, making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual's facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar's visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.

2508.01253 2026-05-27 cs.CV 版本更新

ODOV: Benchmark the Open-Domain Open-Vocabulary Object Detection

ODOV:开放域开放词汇目标检测基准

Yupeng Zhang, Ruize Han, Fangnan Zhou, Wei Feng, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University(天津大学智能计算学院) Key Research Center for Surface Monitoring and Analysis of Relics, State Administration of Cultural Heritage(文物表面监测与分析国家重点研究中心) Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology(深圳先进技术大学计算机科学与人工智能学院)

AI总结 针对真实场景中域偏移和类别偏移同时发生的问题,提出开放域开放词汇目标检测任务,构建OD-LVIS基准数据集,并设计基于VLM的基线方法,通过域无关类别提示和域投影嫁接模块提升检测性能。

详情
AI中文摘要

现有研究通常将域偏移和类别偏移作为独立问题进行研究,然而在真实场景中,这两种偏移常常同时发生并相互作用,导致检测性能显著下降。为了解决这一问题,我们提出并系统研究了一个新问题——开放域开放词汇(ODOV)目标检测,旨在评估模型在真实环境中适应复合域和类别偏移的能力。我们构建了一个新的基准数据集OD-LVIS,包含来自15个不同真实场景的46,949张图像和1,203个类别,用于评估目标检测性能。此外,我们提出了一种新的ODOV检测基线,充分利用VLM强大的多模态对齐能力,并引入两种关键机制以增强类别和域泛化能力。一种是域无关类别提示(DAPmt),它在增强类别语义的同时减弱域表示,从而实现纯粹的类别表示。另一种是域投影与嫁接(DP&G)模块,它融合了输入图像中的域特定特征,使模型能够动态地在各种开放域中进行泛化。这两个组件使模型能够在真实场景中同时存在类别和域变化的情况下保持有效的检测性能。我们为提出的ODOV检测任务提供了广泛的基准评估,并报告了实验结果。这些结果验证了ODOV任务的合理性、OD-LVIS数据集的实用性以及该方法的优越性。

英文摘要

Existing studies typically investigate domain shift and category shift as independent problems, however, in real-world scenarios, the two types of shifts often occur simultaneously and interact, leading to significant degradation in detection performance. To address this, we propose and systematically study a novel problem-Open-Domain Open-Vocabulary (ODOV) object detection-which aims to evaluate a model's ability to adapt to the compound domain and category shifts in real-world environments.We construct a new benchmark, OD-LVIS, which contains 46,949 images spanning 15 diverse real-world scenarios and 1,203 categories, for assessing object detection performance. Furthermore, we propose a novel ODOV detection baseline that fully leverages VLM's powerful multi-modal alignment capabilities and introduces two key mechanisms to enhance both category and domain generalization. One is the Domain-Agnostic Category Prompt (DAPmt), which strengthens category semantics while attenuating domain representations, enabling pure category representation. The other is the Domain Projection and Grafting (DP&G) module, which incorporates domain-specific features from input images, allowing the model to dynamically generalize across diverse open domains. These two components enable the model to maintain effective detection performance under simultaneous category and domain variations in real-world scenarios. We provide extensive benchmark evaluations for the proposed ODOV detection task and report experimental results. These results validate the soundness of the ODOV task, the practicality of the OD-LVIS dataset, and the superiority of the method.

2507.16116 2026-05-27 cs.CV 版本更新

Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation

Pusa V1.0: 通过向量化时间步长适应解锁预训练视频扩散模型中的时间控制

Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel

发表机构 * City University of Hong Kong(香港城市大学) The Chinese University of Hong Kong(香港中文大学) Huawei Research(华为研究) Great Bay University(大湾大学) AI Technology Center, Tencent PCG(腾讯AI技术中心) Lingnan University(岭南大学) Hong Kong Centre for Cerebro-Cardiovascular Health Engineering(香港脑心血管健康工程中心)

AI总结 提出向量化时间步长适应(VTA)方法,在统一视频扩散框架中实现细粒度时间控制,零样本完成图像到视频生成、起止帧控制等任务,且不破坏基础模型能力。

Comments Code is open-sourced at https://github.com/Yaofang-Liu/Pusa-VidGen

详情
AI中文摘要

视频扩散模型的快速发展受到时间建模基本限制的阻碍,特别是传统标量时间步长变量导致的帧演化刚性同步。尽管任务特定适应和自回归模型试图解决这些挑战,但它们仍受限于计算效率低下、灾难性遗忘或适用性狭窄。在这项工作中,我们提出了 extbf{Pusa} V1.0,一个利用 extbf{向量化时间步长适应(VTA)}在统一视频扩散框架中实现细粒度时间控制的通用模型。注意,VTA是一种非破坏性适应,意味着它完全保留了基础模型的能力。与Wan-I2V等传统方法(通过大量资源微调基础文本到视频(T2V)模型以进行图像到视频(I2V))不同,我们在基于VTA的超高效微调过程后以零样本方式实现了可比结果。此外,该方法还同时解锁了许多其他零样本能力,例如起止帧和视频扩展——所有这些都不需要任务特定训练。同时,它保留了基础模型的T2V能力。机制分析还表明,我们的方法保留了基础模型的生成先验,同时精确注入时间动态,避免了向量化时间步长固有的组合爆炸。这项工作为下一代视频合成建立了一个可扩展、高效且通用的范式,使高保真视频生成在研究和工业领域得以普及。

英文摘要

The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present \textbf{Pusa} V1.0, a versatile model that leverages \textbf{vectorized timestep adaptation (VTA)} to enable fine-grained temporal control within a unified video diffusion framework. Note that VTA is a non-destructive adaptation, which means that it fully preserves the capabilities of the base model. Unlike conventional methods like Wan-I2V, which finetune a base text-to-video (T2V) model with abundant resources to do image-to-video (I2V), we achieve comparable results in a zero-shot manner after an ultra-efficient finetuning process based on VTA. Moreover, this method also unlocks many other zero-shot capabilities simultaneously, such as start-end frames and video extension -- all without task-specific training. Meanwhile, it keeps the T2V capability from the base model. Mechanistic analyses also reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to the vectorized timestep. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike.

2507.06513 2026-05-27 cs.CV 版本更新

What Demands Attention in Urban Street Scenes? From Scene Understanding towards Road Safety: A Survey of Vision-driven Datasets and Studies

城市街景中什么需要关注?从场景理解到道路安全:视觉驱动数据集与研究的综述

Yaoqi Huang, Julie Stephany Berrio, Mao Shan, Stewart Worrall

发表机构 * Australian Centre For Robotics (ACFR), The University of Sydney(澳大利亚机器人中心(ACFR)、悉尼大学)

AI总结 本文通过系统分类交通场景中需要关注的关键元素,全面分析35个视觉驱动任务和73个数据集,提出统一分析框架,旨在促进道路安全研究。

Comments 40 tasks, 78 datasets

详情
AI中文摘要

基于视觉的传感器和计算机视觉算法的进步显著提升了对交通场景的分析与理解。为促进这些进步在道路安全中的应用,本综述系统分类了交通场景中需要关注的关键元素,并全面分析了现有的视觉驱动任务和数据集。与现有聚焦于孤立领域的综述相比,我们的分类法将值得关注的交通实体分为两大类:异常实体和正常但关键的实体,整合了十个类别和二十个子类。它建立了内在相关领域之间的联系,并提供了统一的分析框架。我们的综述重点分析了35个视觉驱动任务,并基于提出的分类法对73个可用数据集进行了全面检查和可视化。跨领域调查涵盖了每个基准的优缺点,旨在提供标准统一和资源优化的信息。文章最后系统讨论了现有弱点,从不同角度强调了潜在影响和有前景的解决方案。集成的分类法、全面分析和总结性表格为这一快速发展的领域提供了宝贵贡献,为研究人员提供了整体概览,指导战略性资源选择,并突出了关键研究空白。

英文摘要

Advances in vision-based sensors and computer vision algorithms have significantly improved the analysis and understanding of traffic scenarios. To facilitate the use of these improvements for road safety, this survey systematically categorizes the critical elements that demand attention in traffic scenarios and comprehensively analyzes available vision-driven tasks and datasets. Compared to existing surveys that focus on isolated domains, our taxonomy categorizes attention-worthy traffic entities into two main groups that are anomalies and normal but critical entities, integrating ten categories and twenty subclasses. It establishes connections between inherently related fields and provides a unified analytical framework. Our survey highlights the analysis of 35 vision-driven tasks and comprehensive examinations and visualizations of 73 available datasets based on the proposed taxonomy. The cross-domain investigation covers the pros and cons of each benchmark with the aim of providing information on standards unification and resource optimization. Our article concludes with a systematic discussion of the existing weaknesses, underlining the potential effects and promising solutions from various perspectives. The integrated taxonomy, comprehensive analysis, and recapitulatory tables serve as valuable contributions to this rapidly evolving field by providing researchers with a holistic overview, guiding strategic resource selection, and highlighting critical research gaps.

2507.05757 2026-05-27 cs.CV 版本更新

Normal Patch Retinex Robust Alghoritm for White Balancing in Digital Microscopy

Normal Patch Retinex 稳健算法用于数字显微镜白平衡

Radoslaw Roszczyk, Artur Krupa, Izabella Antoniuk

发表机构 * Faculty of Electrical Engineering(电子工程学院) Institute of Information Technology(信息技术研究所)

AI总结 提出一种基于Normal Patch Retinex的全自动白平衡算法,用于校正数字显微镜彩色图像,实验证明其优于经典算法。

详情
Journal ref
Vol. 29 No. 1/4 (2020)
AI中文摘要

在光学显微镜中获取准确彩色、平衡的图像即使对于经验丰富的显微镜操作者也可能是一个挑战。本文提出了一种完全自动的白平衡机制,能够充分校正显微彩色图像。该算法的结果已在200张显微图像数据集上通过实验验证。这些图像包含病理形态学中常用的三种显微标本的扫描图。此外,将所得结果与数字摄影中其他常用的白平衡算法进行了比较。本文应用的算法对于苏木精-荧光桃红-番红染色的显微图像和免疫组织化学染色图像比彩色摄影中使用的经典算法更有效。

英文摘要

The acquisition of accurately coloured, balanced images in an optical microscope can be a challenge even for experienced microscope operators. This article presents an entirely automatic mechanism for balancing the white level that allows the correction of the microscopic colour images adequately. The results of the algorithm have been confirmed experimentally on a set of two hundred microscopic images. The images contained scans of three microscopic specimens commonly used in pathomorphology. Also, the results achieved were compared with other commonly used white balance algorithms in digital photography. The algorithm applied in this work is more effective than the classical algorithms used in colour photography for microscopic images stained with hematoxylin-phloxine-saffron and for immunohistochemical staining images.

2506.17633 2026-05-27 cs.CV cs.AI 版本更新

Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection

自适应多提示对比网络用于少样本分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算机学院和数据科学学院,新加坡)

AI总结 针对少样本分布外检测问题,提出自适应多提示对比网络(AMCN),通过CLIP学习可学习文本提示和类间/类内分布,实现ID-OOD分离边界自适应。

Comments Published in ICML 2025

详情
AI中文摘要

分布外(OOD)检测旨在区分异常样本,以防止在分布内(ID)数据集上训练的模型产生不可用的输出。大多数OOD检测方法需要大量IID样本进行训练,这严重限制了它们的实际应用。为此,我们针对一个具有挑战性的场景:少样本OOD检测,其中只有少量标记的ID样本可用。因此,少样本OOD检测比传统的OOD检测设置更具挑战性。先前的少样本OOD检测工作忽略了不同类别之间的显著多样性。在本文中,我们提出了一种新颖的网络:自适应多提示对比网络(AMCN),它通过学习类间和类内分布来适应ID-OOD分离边界。为了弥补OOD的缺失和ID图像样本的稀缺,我们利用CLIP连接文本与图像,设计可学习的ID和OOD文本提示。具体来说,我们首先生成自适应提示(可学习ID提示、标签固定OOD提示和标签自适应OOD提示)。然后,我们通过引入类级阈值为每个类生成自适应类边界。最后,我们提出一个提示引导的ID-OOD分离模块来控制ID和OOD提示之间的间隔。实验结果表明,AMCN优于其他最先进的工作。

英文摘要

Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many IID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where {Only a few {\em labeled ID} samples are available.} Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID {\em image samples}, we leverage CLIP, connecting text with images, engineering learnable ID and OOD {\em textual prompts}. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.

2506.11253 2026-05-27 cs.CV cs.LG 版本更新

Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

将数据追踪的机器遗忘提升为基础模型的知识追踪

Yuwen Tan, Boqing Gong

发表机构 * Boston University(波士顿大学)

AI总结 本文提出将数据追踪的机器遗忘提升为基础模型的知识追踪,以应对多样化遗忘请求,并更接近人类遗忘机制,通过视觉语言模型案例展示实现范式。

Comments Accepted to TMLR

详情
AI中文摘要

机器遗忘从AI模型中移除特定训练数据点及其影响(例如,当数据所有者撤销其同意允许模型从数据中学习时)。在这篇立场论文中,我们提出将数据追踪的机器遗忘提升为基础模型(FMs)的知识追踪。我们基于实际需求和认知研究的见解支持这一立场。实际上,追踪数据无法满足对FMs的多样化遗忘请求,这些请求可能来自监管机构、企业用户、产品团队等,他们无法访问FMs的大量训练数据。相反,这些方方便提出关于FMs(不应)拥有的知识或能力的遗忘请求。认知上,知识追踪遗忘比追踪单个训练数据点更接近人脑的遗忘方式。我们进一步讨论了知识追踪机器遗忘范式中的重大挑战。最后,我们提供了一个关于视觉语言FMs的具体案例研究,以说明遗忘者如何实例化知识追踪机器遗忘范式。代码可在:https://1yuwen.github.io/Knowledge-Tracing-MU-Page 获取。

英文摘要

Machine unlearning removes certain training data points and their influence from AI models (e.g., when a data owner revokes their consent to allow models to learn from the data). In this position paper, we propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., who have no access to FMs' massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points does. We further discuss the nontrivial challenges in the knowledge-tracing machine unlearning paradigm. Finally, we provide a concrete case study about a vision-language FM to illustrate how an unlearner might instantiate the knowledge-tracing machine unlearning paradigm. Code is available at: https://1yuwen.github.io/Knowledge-Tracing-MU-Page.

2506.07813 2026-05-27 cs.CV cs.AI 版本更新

Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution

自级联扩散模型用于任意尺度图像超分辨率

Junseo Bang, Joonhee Lee, Kyeonghyun Lee, Haechang Lee, Dong Un Kang, Se Young Chun

发表机构 * Department of Electrical and Computer Engineering (ECE), Seoul National University(电气电子工程系(ECE),首尔国立大学) Institute of New Media and Communications (INMC) & Interdisciplinary Program in AI (IPAI), Seoul National University(新媒体与通讯研究所(INMC)及人工智能跨学科项目(IPAI),首尔国立大学)

AI总结 提出自级联扩散框架CasArbi,通过将任意缩放因子分解为连续小步骤,逐步提升分辨率并保持尺度一致性,在感知和失真指标上优于现有方法。

详情
AI中文摘要

任意尺度图像超分辨率旨在将图像上采样到任意期望分辨率,比传统固定尺度超分辨率提供更大灵活性。最近基于回归或生成模型的方法显示出有希望的结果,但由于其单阶段公式必须同时处理大范围的缩放因子,常常遭受尺度不一致的问题。为了解决这个问题,我们提出了CasArbi,一个用于任意尺度图像超分辨率的自级联扩散框架。CasArbi将不同的缩放因子分解为更小的顺序步骤,逐步提升图像分辨率,并在每一步实现任意尺度的无缝过渡。CasArbi利用坐标条件扩散模型学习连续图像表示,并在推理时采用自一致性指导生成尺度一致的细节。大量实验表明,CasArbi在感知和失真指标上均优于现有方法,并在各种任意尺度超分辨率基准上展现出卓越的尺度一致性。我们的代码可在https://github.com/junseo88/CasArbi获取。

英文摘要

Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fixed-scale super-resolution. Recent approaches based on regression-based or generative models have shown promising results but often suffer from scale inconsistency due to their single-stage formulation, which must handle a wide range of scaling factors simultaneously. To address this, we propose CasArbi, a self-cascaded diffusion framework for arbitrary-scale image super-resolution. CasArbi decomposes varying scaling factors into smaller sequential steps, progressively enhancing the image resolution at each step with seamless transitions for arbitrary scales. CasArbi leverages a coordinate-conditioned diffusion model for learning continuous image representations and adopts self-consistency guidance to generate scale-consistent details at inference time. Extensive experiments show that CasArbi outperforms existing methods in both perceptual and distortion metrics and demonstrates superior scale consistency across diverse arbitrary-scale super-resolution benchmarks. Our code is available at https://github.com/junseo88/CasArbi.

2505.18603 2026-05-27 cs.AI cs.CV 版本更新

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Doc-CoB:通过视觉链式框推理增强文档理解

Ye Mo, Kai Ye, Xianwei Mao, Zirui Shao, Gang Huang, Bo Zhang, Hangdi Xing, Kehan Chen, Huan Zhou, Zixu Yan, Jiajun Bu, Sheng Zhou

发表机构 * Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, Zhejiang University(浙江可感知与智能系统重点实验室,浙江大学) Alibaba Group(阿里巴巴集团) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Doc-CoB框架,通过粗到细的布局感知视觉推理,结合多模态大语言模型,逐步聚焦查询相关布局区域,提升文档理解性能。

详情
AI中文摘要

文档理解旨在对文档图像进行问答和信息提取,其中视觉内容信息密集,大多数查询仅依赖于少数相关布局区域。然而,现有方法要么采用一次通过策略,隐式假设所有布局同等重要,要么过度关注小区域而丢失关键布局信息。为解决这些局限性,我们引入了Doc-CoB(链式框),一个简单而有效的框架,将粗到细的布局感知视觉推理集成到多模态大语言模型中。Doc-CoB不是直接放大到小区域,而是逐步聚焦于查询相关布局,同时保留全局文档信息。具体来说,它首先选择关键布局框,然后通过视觉提示聚焦于这些框进行进一步理解。为支持这一范式,我们引入了两个推理任务:框识别和框推理,并构建了一个自动流水线,生成24.9万个带有中间视觉监督的训练样本。在七个基准测试和四种流行模型上的广泛实验表明,Doc-CoB显著提升了性能,证明了其有效性和广泛适用性。

英文摘要

Document understanding aims to perform question answering and information extraction over document images, where the visual content is highly information-dense and most queries rely on only a few relevant layout regions. However, existing methods either adopt a one-pass strategy that implicitly assumes all layouts are equally important, or focus excessively on small regions at the cost of losing critical layout information. To address these limitations, we introduce Doc-CoB (Chain-of-Boxes), a simple-yet-effective framework that integrates coarse-to-fine layout-aware visual reasoning into multimodal large language models. Instead of directly zooming into small regions, Doc-CoB progressively focuses on query-relevant layouts while preserving global document information. Specifically, it first selects key layout boxes and then focuses on them for further understanding with visual prompting. To support this paradigm, we introduce two reasoning tasks for box recognition and box reasoning, with an automatic pipeline that constructs 249k training samples with intermediate visual supervision. Extensive experiments on seven benchmarks with four popular models show that Doc-CoB significantly improves performance, demonstrating its effectiveness and wide applicability.

2505.17163 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

OCR-Reasoning基准:揭示MLLMs在复杂文本丰富图像推理中的真实能力

Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin

发表机构 * South China University of Technology(华南理工大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出OCR-Reasoning基准,包含1069个人工标注样本,覆盖6种核心推理能力和18个实际推理任务,通过双标注(最终答案和逐步推理过程)评估多模态大语言模型在文本丰富图像推理中的能力,发现最先进模型准确率均低于50%。

Comments ICLR 2026

详情
AI中文摘要

近期多模态慢思考系统在各种视觉推理任务中表现出色。然而,由于缺乏专门且系统的基准,它们在文本丰富图像推理任务中的能力仍未得到充分研究。为填补这一空白,我们提出了OCR-Reasoning,一个新颖的基准,旨在系统评估多模态大语言模型在文本丰富图像推理任务上的表现。具体而言,OCR-Reasoning包含1069个人工标注的示例,涵盖文本丰富视觉场景中的6种核心推理能力和18个实际推理任务。与仅提供最终答案的现有文本丰富图像理解基准不同,本基准额外提供了详细的逐步推理过程。这种双标注使得能够同时评估模型的最终答案和推理过程,从而全面评估文本丰富推理能力。利用该基准,我们对最新的多模态大语言模型进行了全面评估。结果表明,即使是最先进的多模态大语言模型在文本丰富图像推理任务中也面临巨大困难,在我们的基准上没有一个模型的准确率超过50%,这表明文本丰富图像推理的挑战是一个亟待解决的问题。基准和评估脚本可在https://github.com/SCUT-DLVCLab/OCR-Reasoning获取。

英文摘要

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

2505.16942 2026-05-27 cs.CV cs.LG 版本更新

Efficient All-Pairs Correlation Volume Sampling for Optical Flow Estimation

高效的全对相关性体素采样用于光流估计

Karlis Martins Briedis, Markus Gross, Christopher Schroers

发表机构 * DisneyResearch|Studios(迪士尼研究与工作室) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出一种内存和计算高效的算法,实现全对相关性体素采样的精确数学运算,在保持低内存占用的同时显著提升速度,并应用于高分辨率光流估计达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

最近的光流估计方法通常从密集的全对相关性体素中进行局部代价采样。这导致计算和内存复杂度与像素数成二次关系。尽管存在一种按需代价计算的替代内存高效实现,但在实践中速度明显较慢,因此许多先前方法在降采样分辨率下处理图像,丢失了细粒度细节。为了解决这个问题,我们提出了一种算法,用于全对相关性体素采样的内存和计算高效实现,同时仍然匹配RAFT定义的精确数学算子。我们的方法在保持同样低内存使用的情况下,性能优于按需采样高达92%,并且与默认实现相比,内存使用降低高达99%的同时性能至少相当。由于代价采样占整体运行时间的很大一部分,这可以转化为高分辨率输入下端到端模型推理总时间高达63%的节省。我们对现有方法的评估包括一个8K超高清数据集和SEA-RAFT方法的推理时间扩展。通过这一点,我们在高分辨率下在准确性和运行时间上都达到了最先进的结果。

英文摘要

Recent optical flow estimation methods often employ local cost sampling from a dense all-pairs correlation volume. This results in quadratic computational and memory complexity in the number of pixels. Although an alternative memory-efficient implementation with on-demand cost computation exists, this is significantly slower in practice and therefore many prior methods process images at downsampled resolutions, missing fine-grained details. To address this, we propose an algorithm for both memory and compute-efficient implementation of the all-pairs correlation volume sampling, still matching the exact mathematical operator as defined by RAFT. Our approach outperforms on-demand sampling by up to 92% while maintaining equally low memory usage, and performs at least on par with the default implementation with up to 99% lower memory usage. As cost sampling makes up a significant portion of the overall runtime, this can translate to up to 63% savings for the total end-to-end model inference on high-resolution inputs. Our evaluation of existing methods includes an 8K ultra-high-resolution dataset and an inference-time extension of the SEA-RAFT method. With this, we achieve state-of-the-art results at high resolutions both in accuracy and runtime.

2503.21510 2026-05-27 cs.LG cs.CV stat.ML 版本更新

An uncertainty-aware Bayesian framework for machine learning classification models: A case study in land cover classification

一种不确定性感知的贝叶斯机器学习分类模型框架:以土地覆盖分类为例

Samuel Bilson, Miles McCrory, Anna Pustogvar

发表机构 * National Physical Laboratory, Teddington, UK(英国国家物理实验室,Teddington) Department of Data Science(数据科学系) Department of Thermal & Radiometric Metrology(热学与辐射计量学系) School of Geography, Geology & the Environment(地理、地质与环境学院)

AI总结 提出一种考虑输入测量不确定性的贝叶斯生成式分类模型框架,通过贝叶斯二次判别分析模型在土地覆盖数据集上验证,该模型在可解释性、不确定性建模和计算效率方面优于随机森林和神经网络。

Comments 38 pages, 16 figures

详情
AI中文摘要

确保机器学习分类模型的预测伴随不确定性估计是可信任人工智能的主要支柱之一。当前不确定性量化研究主要关注ML模型的认知不确定性,但很少考虑输入测量不确定性,而这对于计量学的可追溯性至关重要。在这项工作中,我们提出了一种考虑输入测量不确定性的生成式ML分类模型的贝叶斯框架。我们以贝叶斯二次判别分析(BQDA)模型为例,并将其应用于来自Copernicus Sentinel-2的2020年和2021年计量土地覆盖数据集。我们将该模型的性能与土地覆盖图中更流行的分类模型(如随机森林和神经网络)进行基准测试。为了验证和评估此类模型的泛化能力,我们还在合成分类数据上进行了模拟,改变了输入测量噪声的分布类型和强度。我们发现,对于真实和合成数据,所提出的BQDA模型更可信,因为它更具可解释性,显式建模了输入测量不确定性,并在不同领域和大小的数据集上保持了类别概率输出的预测性能,同时计算效率更高。

英文摘要

Ensuring that predictions of machine learning (ML) classification models are accompanied by uncertainty estimates is one of the main pillars of trustworthy AI. Current research in uncertainty quantification focuses mainly on epistemic uncertainty of the ML model, but rarely takes account of input measurement uncertainty, which is vital for traceability in metrology. In this work we propose a Bayesian framework for generative ML classification models that takes account of input measurement uncertainty. We take the specific case of a Bayesian quadratic discriminant analysis (BQDA) model, and apply it to metrological land cover datasets from Copernicus Sentinel-2 from 2020 and 2021. We benchmark the performance of the model against more popular classification models used in land cover maps such as random forests and neural networks. To validate and assess the generalisability of such a model, we also run simulations over synthetic classification data, varying distribution type and strength of the input measurement noise. We find for both real and synthetic data, the BQDA model presented is more trustworthy, in the sense that it is more interpretable, explicitly models the input measurement uncertainty, and maintains predictive performance of class probability outputs across datasets over different domains and sizes, whilst also being more computationally efficient.

2504.08540 2026-05-27 cs.CV 版本更新

Datasets for Lane Detection in Autonomous Driving: A Comprehensive Review

自动驾驶中车道检测数据集:全面综述

Jörg Gamerdinger, Sven Teufel, Oliver Bringmann

发表机构 * University of Tübingen, Faculty of Science, Department of Computer Science, Embedded Systems Group(图宾根大学,科学学院,计算机科学系,嵌入式系统小组)

AI总结 本文全面综述了20个公开车道检测数据集,通过多维质量指标分析其特性、优势和局限,并指出未来改进方向以推动鲁棒车道检测创新。

详情
AI中文摘要

准确的车道检测对于自动驾驶至关重要,能够在各种道路场景下实现安全可靠的车辆导航。为了支持车道检测算法的开发和评估,已经引入了许多数据集,这些数据集在数据量、传感器类型、注释粒度、环境条件和场景多样性方面各不相同。本文全面综述了20个公开可用的车道检测数据集,系统地分析了它们的特性、优势和局限性。我们基于传感器分辨率、注释类型以及道路和天气条件的多样性等关键性能指标,使用一种新颖的多维数据集质量指标对这些数据集进行分类。通过识别现有挑战和研究空白,我们强调了未来数据集改进的机会,这些改进可以进一步推动鲁棒车道检测的创新。本综述为寻求适用于鲁棒车道检测的数据集的研究人员提供了资源,并为推进自动驾驶的更广泛目标做出了贡献。

英文摘要

Accurate lane detection is essential for automated driving, enabling safe and reliable vehicle navigation across a variety of road scenarios. Numerous datasets have been introduced to support the development and evaluation of lane detection algorithms, each differing in terms of the amount of data, sensor types, annotation granularity, environmental conditions, and scenario diversity. This paper provides a comprehensive review of 20 publicly available lane detection datasets, systematically analyzing their characteristics, advantages, and limitations. We classify these datasets based on key performance indicators such as sensor resolution, annotation types and diversity of road and weather conditions using a novel multidimensional metric for dataset quality. By identifying existing challenges and research gaps, we highlight opportunities for future dataset improvements that can further drive innovation in robust lane detection. This review serves as a resource for researchers seeking appropriate datasets for robust lane detection and contributes to the broader goal of advancing autonomous driving.

2504.07853 2026-05-27 cs.CV 版本更新

V2V3D: View-to-View Denoised 3D Reconstruction for Light-Field Microscopy

V2V3D:用于光场显微镜的视图到视图去噪三维重建

Jiayin Zhao, Zhenqi Fu, Tao Yu, Hui Qiao

发表机构 * Tsinghua University, Beijing 100084, China(清华大学,北京100084,中国) Shanghai AI Laboratory, China(上海人工智能实验室,中国)

AI总结 提出无监督视图到视图框架V2V3D,联合优化图像去噪和三维重建,利用噪声独立性实现噪声到噪声去噪,并设计基于波动光学的特征对齐技术恢复高频细节,在效率和性能上超越现有方法。

Comments CVPR 2025; New version: Fix NSFC ID

详情
AI中文摘要

光场显微镜(LFM)因其能够捕捉基于快照的大规模三维荧光图像而受到广泛关注。然而,现有的LFM重建算法对传感器噪声高度敏感,或者需要难以获取的真实标注数据进行训练。为了解决这些挑战,本文引入了V2V3D,一个无监督的基于视图到视图的框架,在统一架构中建立了图像去噪和三维重建联合优化的新范式。我们假设LF图像源自一致的三维信号,每个视图中的噪声是独立的。这使得V2V3D能够融入噪声到噪声原理以实现有效去噪。为了增强高频细节的恢复,我们提出了一种新颖的基于波动光学的特征对齐技术,该技术将用于波动光学中前向传播的点扩散函数转换为专门用于特征对齐的卷积核。此外,我们引入了一个包含LF图像及其对应三维强度体积的LFM数据集。大量实验表明,我们的方法实现了高计算效率,并优于其他最先进的方法。这些进展使V2V3D成为在挑战性条件下进行三维成像的有前景的解决方案。

英文摘要

Light field microscopy (LFM) has gained significant attention due to its ability to capture snapshot-based, large-scale 3D fluorescence images. However, existing LFM reconstruction algorithms are highly sensitive to sensor noise or require hard-to-get ground-truth annotated data for training. To address these challenges, this paper introduces V2V3D, an unsupervised view2view-based framework that establishes a new paradigm for joint optimization of image denoising and 3D reconstruction in a unified architecture. We assume that the LF images are derived from a consistent 3D signal, with the noise in each view being independent. This enables V2V3D to incorporate the principle of noise2noise for effective denoising. To enhance the recovery of high-frequency details, we propose a novel wave-optics-based feature alignment technique, which transforms the point spread function, used for forward propagation in wave optics, into convolution kernels specifically designed for feature alignment. Moreover, we introduce an LFM dataset containing LF images and their corresponding 3D intensity volumes. Extensive experiments demonstrate that our approach achieves high computational efficiency and outperforms the other state-of-the-art methods. These advancements position V2V3D as a promising solution for 3D imaging under challenging conditions.

2504.05046 2026-05-27 cs.CV 版本更新

MotionPRO: Exploring the Role of Pressure in Human MoCap and Beyond

MotionPRO:探索压力在人体动作捕捉及其它领域中的作用

Shenghao Ren, Yi Lu, Jiayi Huang, Jiayi Zhao, He Zhang, Tao Yu, Qiu Shen, Xun Cao

发表机构 * School of Electronic Science and Engineering, Nanjing University, Nanjing, China(南京大学电子科学与技术学院,南京,中国) Key Laboratory of Optoelectronic Devices and Systems with Extreme Performances of MOE, Nanjing University, Nanjing, China(南京大学极端性能光电器件与系统MOE重点实验室,南京,中国) BNRist, Tsinghua University, Beijing, China(清华大学北京研究院,北京,中国)

AI总结 本文通过构建包含压力、RGB和光学传感器的大规模人体动作捕捉数据集MotionPRO,并设计基于压力信号或融合压力与RGB的位姿和轨迹估计网络,证明了压力信号在提高物理合理性、全局轨迹精度以及驱动虚拟人和人形机器人方面的必要性和有效性。

Comments fix NSFC ID

详情
AI中文摘要

现有的人体动作捕捉(MoCap)方法大多关注视觉相似性而忽略物理合理性。因此,下游任务如驱动3D场景中的虚拟人或现实世界中的类人机器人会出现时间漂移和抖动、空间问题如滑动和穿透以及全局轨迹精度差等问题。在本文中,我们通过探索压力的作用,从人体与物理世界交互的角度重新审视人体动作捕捉。首先,我们构建了一个大规模的人体动作捕捉数据集,包含压力、RGB和光学传感器(命名为MotionPRO),该数据集由70名志愿者执行400种动作,共计1240万帧姿态。其次,我们通过两个具有挑战性的任务检验压力信号的必要性和有效性:(1)仅基于压力的姿态和轨迹估计:我们提出了一个包含小核解码器和长短期注意力模块的网络,并证明压力可以提供准确的全局轨迹和合理的下半身姿态。(2)融合压力和RGB的姿态和轨迹估计:我们沿相机轴施加正交相似性约束,沿垂直轴施加全身接触约束,以增强交叉注意力策略,融合压力和RGB特征图。实验表明,将压力与RGB特征融合不仅在客观指标上显著提升了性能,而且能够合理地驱动3D场景中的虚拟人(SMPL)。此外,我们证明融入物理感知使类人机器人能够执行更精确和稳定的动作,这对具身人工智能的发展非常有益。项目页面:https://nju-cite-mocaphumanoid.github.io/MotionPRO/

英文摘要

Existing human Motion Capture (MoCap) methods mostly focus on the visual similarity while neglecting the physical plausibility. As a result, downstream tasks such as driving virtual human in 3D scene or humanoid robots in real world suffer from issues such as timing drift and jitter, spatial problems like sliding and penetration, and poor global trajectory accuracy. In this paper, we revisit human MoCap from the perspective of interaction between human body and physical world by exploring the role of pressure. Firstly, we construct a large-scale human Motion capture dataset with Pressure, RGB and Optical sensors (named MotionPRO), which comprises 70 volunteers performing 400 types of motion, encompassing a total of 12.4M pose frames. Secondly, we examine both the necessity and effectiveness of the pressure signal through two challenging tasks: (1) pose and trajectory estimation based solely on pressure: We propose a network that incorporates a small kernel decoder and a long-short-term attention module, and proof that pressure could provide accurate global trajectory and plausible lower body pose. (2) pose and trajectory estimation by fusing pressure and RGB: We impose constraints on orthographic similarity along the camera axis and whole-body contact along the vertical axis to enhance the cross-attention strategy to fuse pressure and RGB feature maps. Experiments demonstrate that fusing pressure with RGB features not only significantly improves performance in terms of objective metrics, but also plausibly drives virtual humans (SMPL) in 3D scene. Furthermore, we demonstrate that incorporating physical perception enables humanoid robots to perform more precise and stable actions, which is highly beneficial for the development of embodied artificial intelligence. Project page is available at: https://nju-cite-mocaphumanoid.github.io/MotionPRO/

2504.02775 2026-05-27 cs.CV cs.LG 版本更新

TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection

TailedCore: 面向无监督长尾噪声异常检测的少样本采样

Yoon Gyo Jung, Jaewoo Park, Jaeho Yoon, Kuan-Chuan Peng, Wonchul Kim, Andrew Beng Jin Teoh, Octavia Camps

发表机构 * Northeastern University(东北大学) AiV Co.(AiV公司) Yonsei University(延世大学) Mitsubishi Electric Research Laboratories(三菱电机研究实验室)

AI总结 针对正常数据集存在缺陷污染且类别分布未知长尾的挑战,提出TailSampler估计类别大小以独立处理尾类与噪声,并构建基于记忆的异常检测模型TailedCore,在无监督长尾噪声异常检测中达到最先进性能。

Comments Accepted to CVPR2025

详情
AI中文摘要

我们旨在解决一个实际且具有挑战性的无监督异常检测问题,其中正常数据集既包含缺陷区域污染,其产品类别分布又是长尾但未知的。我们观察到现有模型存在尾类与噪声之间的权衡:如果模型对像素噪声鲁棒,则其在尾类样本上的性能会下降,反之亦然。为缓解该问题,我们独立处理尾类和噪声样本。为此,我们提出TailSampler,一种新颖的类别大小预测器,基于嵌入相似度的类别分布对称假设来估计样本的类别基数。TailSampler可用于专门采样尾类样本,从而单独处理它们。基于这些方面,我们构建了基于记忆的异常检测模型TailedCore,其记忆既能很好地捕捉尾类信息,又对噪声鲁棒。我们在无监督长尾噪声异常检测设置上广泛验证了TailedCore的有效性,并表明TailedCore在大多数设置下优于现有最先进方法。

英文摘要

We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.

2503.14359 2026-05-27 cs.CV 版本更新

ImViD: Immersive Volumetric Videos for Enhanced VR Engagement

ImViD:用于增强VR沉浸感的沉浸式体积视频

Zhengxian Yang, Shi Pan, Shengqi Wang, Haoxiang Wang, Li Lin, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, Tao Yu

发表机构 * Tsinghua University(清华大学) Migu Beijing Research Institute(中国移动北京研究院) Institute of Automation, Chinese Academy of Science(中国科学院自动化研究所)

AI总结 提出ImViD多视角多模态数据集,支持移动中捕获完整场景,为6自由度多模态沉浸式VR体验提供基准和重建管线。

Comments CVPR 2025 Highlight; Fix NSFC ID

详情
AI中文摘要

用户参与度通过结合视觉和听觉刺激的完全沉浸式多模态体验得到极大增强。因此,VR/AR技术的下一个前沿在于具有完整场景捕获、大6自由度交互空间、多模态反馈以及高分辨率和高帧率内容的沉浸式体积视频。为了促进沉浸式体积视频的重建,我们引入了ImViD,这是一个多视角、多模态数据集,具有完整的面向空间的数据捕获和各种室内/室外场景。我们的捕获设备支持在移动中进行多视角视频-音频捕获,这是现有数据集所不具备的能力,显著提高了数据捕获的完整性、灵活性和效率。捕获的多视角视频(带有同步音频)为5K分辨率、60FPS,持续1-5分钟,包含丰富的前景-背景元素和复杂的动态。我们使用我们的数据集对现有方法进行基准测试,并建立了一个基础管线,用于从多视角视听输入构建用于6自由度多模态沉浸式VR体验的沉浸式体积视频。基准测试以及重建和交互结果证明了我们数据集和基线方法的有效性,我们相信这将激发未来对沉浸式体积视频制作的研究。

英文摘要

User engagement is greatly enhanced by fully immersive multi-modal experiences that combine visual and auditory stimuli. Consequently, the next frontier in VR/AR technologies lies in immersive volumetric videos with complete scene capture, large 6-DoF interaction space, multi-modal feedback, and high resolution & frame-rate contents. To stimulate the reconstruction of immersive volumetric videos, we introduce ImViD, a multi-view, multi-modal dataset featuring complete space-oriented data capture and various indoor/outdoor scenarios. Our capture rig supports multi-view video-audio capture while on the move, a capability absent in existing datasets, significantly enhancing the completeness, flexibility, and efficiency of data capture. The captured multi-view videos (with synchronized audios) are in 5K resolution at 60FPS, lasting from 1-5 minutes, and include rich foreground-background elements, and complex dynamics. We benchmark existing methods using our dataset and establish a base pipeline for constructing immersive volumetric videos from multi-view audiovisual inputs for 6-DoF multi-modal immersive VR experiences. The benchmark and the reconstruction and interaction results demonstrate the effectiveness of our dataset and baseline method, which we believe will stimulate future research on immersive volumetric video production.

2501.00520 2026-05-27 cs.CV cs.LG 版本更新

Innovative Silicosis and Pneumonia Classification: Leveraging Graph Transformer Post-hoc Modeling and Ensemble Techniques

创新性矽肺和肺炎分类:利用图Transformer后验建模与集成技术

Bao Q. Bui, Tien T. T. Nguyen, Duy M. Le, Cong Tran, Cuong Pham

AI总结 提出结合图Transformer网络与传统深度神经网络的架构,并采用平衡交叉熵损失函数和集成方法,在自建胸部X光数据集上实现高精度矽肺与肺炎分类。

Comments Withdrawn by the authors because the manuscript contains incomplete and potentially misleading descriptions of the dataset construction and evaluation protocol, particularly in the Dataset and Experimental Setup sections. The work should not be cited or used as an independent reference in its current form

详情
AI中文摘要

本文对矽肺相关肺部炎症的分类与检测进行了全面研究。我们的主要贡献包括:1) 创建了一个名为SVBCX的新策划胸部X光(CXR)图像数据集,该数据集针对不同病原体引起的肺部炎症的细微差别进行了定制,为矽肺和肺炎研究社区提供了宝贵资源;2) 提出了一种新颖的深度学习架构,该架构将图Transformer网络与传统深度神经网络模块相结合,用于有效分类矽肺和肺炎。此外,我们采用平衡交叉熵(BalCE)作为损失函数,以确保不同类别之间的更均匀学习,增强模型辨别肺部状况细微差异的能力。所提出的模型架构和损失函数选择旨在提高炎症检测的准确性和可靠性,特别是在矽肺背景下。此外,我们的研究探索了一种集成方法的有效性,该方法结合了不同模型架构的优势。在构建的数据集上的实验结果表明,与基线模型相比,取得了显著改进。模型集成实现了宏F1分数0.9749,每个类别的AUC ROC分数超过0.99,突显了我们的方法在准确和鲁棒的肺部炎症分类中的有效性。

英文摘要

This paper presents a comprehensive study on the classification and detection of Silicosis-related lung inflammation. Our main contributions include 1) the creation of a newly curated chest X-ray (CXR) image dataset named SVBCX that is tailored to the nuances of lung inflammation caused by distinct agents, providing a valuable resource for silicosis and pneumonia research community; and 2) we propose a novel deep-learning architecture that integrates graph transformer networks alongside a traditional deep neural network module for the effective classification of silicosis and pneumonia. Additionally, we employ the Balanced Cross-Entropy (BalCE) as a loss function to ensure more uniform learning across different classes, enhancing the model's ability to discern subtle differences in lung conditions. The proposed model architecture and loss function selection aim to improve the accuracy and reliability of inflammation detection, particularly in the context of Silicosis. Furthermore, our research explores the efficacy of an ensemble approach that combines the strengths of diverse model architectures. Experimental results on the constructed dataset demonstrate promising outcomes, showcasing substantial enhancements compared to baseline models. The ensemble of models achieves a macro-F1 score of 0.9749 and AUC ROC scores exceeding 0.99 for each class, underscoring the effectiveness of our approach in accurate and robust lung inflammation classification.

2406.03474 2026-05-27 cs.CV 版本更新

AD-H: Language-guided Autonomous Driving with Hierarchical Agents

AD-H:基于分层智能体的语言引导自动驾驶

Zaibin Zhang, Talas Fu, Shiyu Tang, Yuanhang Zhang, Yifan Wang, Lijun Wang, Huchuan Lu

发表机构 * Dalian University of Technology(大连理工大学)

AI总结 提出AD-H分层多智能体框架,上层MLLM规划器生成中层驾驶指令,下层轻量控制器执行连续动作,通过规则重建1.15M中层指令对,以3B+350M参数超越7B模型,实现长时域泛化与指令遵循。

详情
AI中文摘要

语言引导的自动驾驶需要弥合高级自然语言指令与低级车辆控制之间巨大的抽象鸿沟。使用单个多模态大语言模型(MLLM)将语言直接映射到动作的端到端方法难以应对这种不匹配,往往无法利用模型的推理能力,并且在用于微调的驾驶数据集分布之外表现出有限的泛化能力。为了解决这个问题,我们提出了AD-H,一个分层多智能体框架,明确地将高级决策与低级车辆执行分开。在上层,基于MLLM的规划器解释自然语言命令和环境上下文,生成连贯的中层驾驶指令。在下层,轻量级控制器将这些中层指令转换为精确、连续的控制动作。这种分解与每个组件的功能优势相一致:规划器专注于语义推理和任务分解,而控制器确保稳定和准确的执行。为了支持这种层次结构下的大规模训练,我们设计了一个基于规则的流水线,从驾驶信号中重建中层命令,产生了115万对分层注释。大量实验表明,尽管AD-H使用的参数更少(即3B加350M,而对比模型为7B),但它仍优于最先进的模型,并实现了卓越的长时域泛化和指令遵循性能。我们在https://github.com/zhangzaibin/AD-H公开了我们的数据和代码。

英文摘要

Language-guided autonomous driving requires bridging a large abstraction gap between high-level natural-language instructions and low-level vehicle control. End-to-end approaches that use a single multimodal large language model (MLLM) to map language directly to actions struggle with this mismatch, often failing to exploit the reasoning capabilities of the model and exhibiting limited generalization beyond the distributions of driving datasets used for fine-tuning. To address this issue, we propose AD-H, a hierarchical multi-agent framework that explicitly separates high-level decision-making from low-level vehicle execution. At the upper level, an MLLM-based planner interprets natural-language commands and environmental context to generate coherent mid-level driving instructions. At the lower level, a lightweight controller converts these mid-level instructions into precise, continuous control actions. This decomposition aligns with the functional strengths of each component: the planner focuses on semantic reasoning and task decomposition, while the controller ensures stable and accurate actuation. To support large-scale training under this hierarchy, we design a rule-based pipeline that reconstructs mid-level commands from driving signals, producing 1.15 million hierarchical annotation pairs. Extensive experiments show that AD-H outperforms state-of-the-art models despite using fewer parameters, namely 3B plus 350M compared with 7B, and achieves superior long-horizon generalization and instruction-following performance. We make our data and code publicly accessible at https://github.com/zhangzaibin/AD-H

2405.16417 2026-05-27 cs.CV 版本更新

CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection

CRoFT: 面向OOD泛化和开放集OOD检测的并发优化鲁棒微调

Lin Zhu, Yifeng Yang, Qinying Gu, Xinbing Wang, Chenghu Zhou, Nanyang Ye

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对视觉语言预训练模型微调时分布偏移问题,提出一种基于能量分数梯度最小化的统一微调框架,同时提升闭集OOD泛化能力和开放集OOD检测性能。

详情
AI中文摘要

最近的视觉语言预训练模型(VL-PTMs)在开放词汇任务中取得了显著成功。然而,下游用例通常涉及对VL-PTMs的进一步微调,这可能会扭曲其通用知识并损害其处理分布偏移的能力。在现实场景中,机器学习系统不可避免地会遇到协变量偏移(例如,图像风格的变化)和语义偏移(例如,测试时未见类别)。这凸显了增强对协变量偏移的分布外(OOD)泛化能力,同时检测语义偏移的未见类别的重要性。因此,一个关键但尚未充分探索的问题出现了:如何在微调期间提高VL-PTMs对闭集OOD数据的泛化能力,同时有效检测开放集未见类别?在本文中,我们提出了一种新颖的OOD检测目标函数,该函数也有助于改进OOD泛化。我们表明,最小化训练数据上能量分数的梯度幅度会导致分类损失的域一致Hessian矩阵,这是理论分析揭示的OOD泛化的强指标。基于这一发现,我们开发了一个统一的微调框架,允许同时优化这两个任务。大量实验证明了我们方法的优越性。代码可在https://github.com/LinLLLL/CRoFT获取。

英文摘要

Recent vision-language pre-trained models (VL-PTMs) have shown remarkable success in open-vocabulary tasks. However, downstream use cases often involve further fine-tuning of VL-PTMs, which may distort their general knowledge and impair their ability to handle distribution shifts. In real-world scenarios, machine learning systems inevitably encounter both covariate shifts (e.g., changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of enhancing out-of-distribution (OOD) generalization on covariate shifts and simultaneously detecting semantic-shifted unseen classes. Thus a critical but underexplored question arises: How to improve VL-PTMs' generalization ability to closed-set OOD data, while effectively detecting open-set unseen classes during fine-tuning? In this paper, we propose a novel objective function of OOD detection that also serves to improve OOD generalization. We show that minimizing the gradient magnitude of energy scores on training data leads to domain-consistent Hessians of classification loss, a strong indicator for OOD generalization revealed by theoretical analysis. Based on this finding, we have developed a unified fine-tuning framework that allows for concurrent optimization of both tasks. Extensive experiments have demonstrated the superiority of our method. The code is available at https://github.com/LinLLLL/CRoFT.

2404.18539 2026-05-27 cs.CV cs.AI 版本更新

Enhancing Boundary Segmentation for Topological Accuracy with Skeleton-based Methods

基于骨架的方法增强边界分割的拓扑准确性

Chuni Liu, Boyuan Ma, Xiaojuan Ban, Yujie Xie, Hao Wang, Weihua Xue, Jingchao Ma, Ke Xu

发表机构 * University of Science and Technology Beijing(北京科技大学) Beijing Advanced Innovation Center for Materials Genome Engineering(北京材料基因组创新中心) School of Intelligence Science and Technology(智能科学与技术学院) Shunde Innovation School(顺德创新学校) Institute for Advanced Materials and Technology(先进材料与技术研究院) Key Laboratory of Intelligent Bionic Unmanned Systems(智能仿生无人系统重点实验室) Institute of Materials Intelligent Technology(材料智能技术研究院) Liaoning Academy of Materials(辽宁省材料科学院) School of Materials Science and Technology(材料科学与技术学院)

AI总结 提出Skea-Topo Aware损失函数,通过骨架感知加权和边界修正项提升网状图像边界分割的拓扑一致性,在三个数据集上相比13种方法VI指标提升最多7点。

详情
Journal ref
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), pp. 1092-1100, 2024
AI中文摘要

拓扑一致性在网状图像的边界分割任务中起着关键作用,例如神经元电子显微镜图像中的细胞膜分割、材料显微图像中的晶界分割以及航拍图像中的道路分割。在这些领域中,分割结果的拓扑变化对下游任务产生严重影响,甚至可能超过边界本身的错位。为了增强分割结果的拓扑准确性,我们提出了Skea-Topo Aware损失函数,这是一种新颖的损失函数,考虑了每个物体的形状和像素的拓扑重要性。它由两部分组成。首先,骨架感知加权损失通过更好地利用骨架建模物体几何来提高分割准确性。其次,边界修正项通过使用真实标签和预测中的前景和背景骨架,有效识别并强调预测误差中的拓扑关键像素。实验证明,在三个不同的边界分割数据集上,基于客观和主观评估,我们的方法在VI指标上相比13种最先进方法将拓扑一致性提高了最多7点。代码可在https://github.com/clovermini/Skea_topo获取。

英文摘要

Topological consistency plays a crucial role in the task of boundary segmentation for reticular images, such as cell membrane segmentation in neuron electron microscopic images, grain boundary segmentation in material microscopic images and road segmentation in aerial images. In these fields, topological changes in segmentation results have a serious impact on the downstream tasks, which can even exceed the misalignment of the boundary itself. To enhance the topology accuracy in segmentation results, we propose the Skea-Topo Aware loss, which is a novel loss function that takes into account the shape of each object and topological significance of the pixels. It consists of two components. First, a skeleton-aware weighted loss improves the segmentation accuracy by better modeling the object geometry with skeletons. Second, a boundary rectified term effectively identifies and emphasizes topological critical pixels in the prediction errors using both foreground and background skeletons in the ground truth and predictions. Experiments prove that our method improves topological consistency by up to 7 points in VI compared to 13 state-of-art methods, based on objective and subjective assessments across three different boundary segmentation datasets. The code is available at https://github.com/clovermini/Skea_topo.

2306.09344 2026-05-27 cs.CV cs.LG 版本更新

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

DreamSim: 使用合成数据学习人类视觉相似性的新维度

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola

发表机构 * MIT(麻省理工学院) Weizmann Institute of Science(魏茨曼科学研究所) Adobe Research(Adobe研究)

AI总结 本文提出DreamSim指标,通过合成数据训练,在图像布局、对象姿态和语义内容等中高层面上对齐人类感知,并在检索和重建任务中优于现有指标。

Comments Website: https://dreamsim-nights.github.io/ Code: https://github.com/ssundaram21/dreamsim

详情
Journal ref
Advances in Neural Information Processing Systems 36 (NeurIPS 2023)
AI中文摘要

当前的感知相似性度量在像素和补丁级别上操作。这些度量在低层颜色和纹理方面比较图像,但未能捕捉图像布局、对象姿态和语义内容中的中层相似性和差异。在本文中,我们开发了一种整体评估图像的感知度量。第一步是收集一个关于以多种方式相似的图像对的人类相似性判断的新数据集。该数据集的关键在于判断几乎是自动的,并且所有观察者共享。为了实现这一点,我们使用最近的文本到图像模型创建沿不同维度扰动的合成对。我们观察到流行的感知度量无法解释我们的新数据,因此我们引入了一个新的度量DreamSim,调整以更好地与人类感知对齐。我们分析了不同视觉属性如何影响我们的度量,发现它主要关注前景对象和语义内容,同时对颜色和布局敏感。值得注意的是,尽管在合成数据上训练,我们的度量能够泛化到真实图像,在检索和重建任务上取得了强劲的结果。此外,我们的度量在这些任务上优于先前学习的度量和最近的大型视觉模型。

英文摘要

Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.

2312.02694 2026-05-27 cs.CV 版本更新

UPOCR: Towards Unified Pixel-Level OCR Interface

UPOCR:迈向统一像素级OCR接口

Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai Ding, Fengjun Guo, Lianwen Jin

发表机构 * South China University of Technology(南方科技大学) INTSIG Information Co. Ltd(INTSIG信息有限公司) INTSIG-SCUT Joint Lab of Document Image Analysis(INTSIG-SCUT文档图像分析联合实验室)

AI总结 提出UPOCR,一种基于ViT编码器-解码器和可学习任务提示的统一像素级OCR模型,在文本移除、分割和篡改检测三个任务上以单一模型实现最先进性能。

Comments ICML 2024 Version

详情
AI中文摘要

现有的光学字符识别(OCR)方法依赖于任务特定的设计,具有不同的范式、架构和训练策略,这显著增加了研究和维护的复杂性,并阻碍了在应用中的快速部署。为此,我们提出UPOCR,一种简单而有效的通用模型,用于统一像素级OCR接口。具体来说,UPOCR将不同OCR任务的范式统一为图像到图像的变换,架构统一为基于视觉Transformer(ViT)的编码器-解码器,并带有可学习的任务提示。这些提示将编码器提取的通用特征表示推向任务特定的空间,赋予解码器任务感知能力。此外,模型训练统一以最小化预测图像与真实图像之间的差异为目标,无论任务之间的异质性如何。在三个像素级OCR任务(包括文本移除、文本分割和篡改文本检测)上进行了实验。无需花哨的附加组件,实验结果表明,所提出的方法能够以统一的单一模型同时在三个任务上实现最先进的性能,为未来通用OCR模型的研究提供了有价值的策略和见解。代码可在 https://github.com/shannanyinxiang/UPOCR 获取。

英文摘要

Existing optical character recognition (OCR) methods rely on task-specific designs with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder with learnable task prompts. The prompts push the general feature representations extracted by the encoder towards task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the predicted and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code is available at https://github.com/shannanyinxiang/UPOCR.