arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.26884 2026-05-27 cs.CV 版本更新

Small Object Detection in Industrial Recycling: A New Dataset and YOLO Performance Evaluation

工业回收中的小目标检测：新数据集与YOLO性能评估

Oussama Messai, Abbass Zein-Eddine, Abdelouahid Bentamou, Mickael Picq, Nicolas Duquesne, Stéphane Puydarrieux, Yann Gavet

发表机构 * Mines Saint-Etienne, CNRS, UMR 5307 LGF（圣艾蒂安 Mines、法国国家科学研究中心、UMR 5307 LGF）

AI总结针对工业回收中小、密集、重叠目标的检测难题，本文提出新数据集并对比基于深度学习的监督方法，评估YOLO等系统的性能、精度与计算效率，同时探索数据增强与合成图像的优势。

详情

DOI: 10.1117/1.jei.35.3.031203
Journal ref: Journal of Electronic Imaging 2026

AI中文摘要

本文解决了检测小、密集和重叠目标的问题，这是计算机视觉中的一个主要挑战。我们重点回顾了基于深度学习监督方法提出的系统，并在一个包含超过1万张图像和12万个实例的新数据集上对这些系统进行了详细比较，突出了它们在工业回收流程用例中的性能、准确性和计算效率。通过这种比较分析，我们确定了当前最可靠的系统及其设计要解决的具体挑战。此外，我们探讨了数据增强和合成图像的好处。基于我们的分析，我们还提出了潜在的未来方向和创新解决方案，这些方案可以增强小、密集和重叠目标检测系统的有效性。我们的研究范围涵盖回收流程中的目标检测、长度测量和异常检测。异常检测策略对图像分辨率和缩放级别的变化具有鲁棒性，确保在工业应用中的可靠性能。所提出的数据集、方法和评估代码的仓库可在以下网址找到：https://github.com/o-messai/SDOOD

英文摘要

In this paper, we address the problem of detecting small, dense, and overlapping objects, a major challenge in computer vision. Our focus is on reviewing proposed methods based on deep learning supervised approaches. We provide a detailed comparison of these systems on a new dataset of more than 10k images and 120k instances, highlighting their performance, accuracy, and computational efficiency in the industrial recycling process use case. Through this comparative analysis, we identify the most reliable systems currently available and the specific challenges they are designed to tackle. Furthermore, we explore the benefits of data augmentation and synthetic images. Based on our analysis, we also propose potential future directions and innovative solutions that could enhance the effectiveness of small, dense and overlapped object detection systems. The scope of our investigations encompasses object detection, length measurement, and anomaly detection within the context of the recycling process. The anomaly detection strategy is robust against variations in image resolution and zoom levels, ensuring reliable performance in industrial applications. The repository of the proposed dataset, methods and evaluation codes can be found at: https://github.com/o-messai/SDOOD

URL PDF HTML ☆

赞 0 踩 0

2605.26601 2026-05-27 cs.CV 版本更新

FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling

FTibSuite：面向藏语视觉语言建模的综合资源套件

Guixian Xu, Yide Liang, Zeli Su, Xuexian Song, Ziyin Zhang, Yushuang Dong, Ting Zhang, Xu Han

发表机构 * Hainan International College, Minzu University of China（民族大学海南国际学院）； School of Information Engineering, Minzu University of China（民族大学信息工程学院）； Shanghai Jiao Tong University（上海交通大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结针对藏语视觉语言建模缺乏可复现训练和评估基础设施的问题，提出FTibSuite资源套件，包含数据集FTibData、基准FTibBench和基线模型FTibVLM，在多项任务上取得显著性能提升。

详情

AI中文摘要

视觉语言模型取得了快速进展，但藏语由于缺乏可复现的训练和评估基础设施，仍然是一种严重服务不足的低资源语言。为填补这一空白，我们引入了FTibSuite，一个面向藏语视觉语言研究的综合资源套件，包括FTibData（人工验证的多模态训练语料库，涵盖持续预训练、图像-文本对齐和指令调优数据）、FTibBench（五个主流多模态基准的藏语改编版本，采用分层质量控制流程以减少翻译噪声）以及FTibVLM（基于Qwen3-VL-8B-Instruct通过三阶段适应流程构建的可复现基线）。在FTibBench上的实验表明，FTibVLM在所有任务上均取得一致的性能提升，例如将MMBench准确率从42.97提高到67.78，POPE-random准确率从47.53提高到80.56，同时保持了骨干模型原有的中文能力且退化最小，为藏语多模态研究提供了首个标准化基础。

英文摘要

Vision-language models have progressed rapidly, but Tibetan remains a severely underserved low-resource language due to the lack of reproducible training and evaluation infrastructure. To fill this gap, we introduce FTibSuite, a comprehensive resource suite for Tibetan vision-language research, consisting of FTibData (human-verified multimodal training corpora spanning continual pretraining, image-text alignment, and instruction tuning data), FTibBench (Tibetan adaptations of five mainstream multimodal benchmarks with a hierarchical quality-control workflow to reduce translation noise), and FTibVLM, a reproducible baseline built on Qwen3-VL-8B-Instruct via a three-stage adaptation pipeline. Experiments on FTibBench show FTibVLM delivers consistent performance gains across all tasks, such as improving MMBench accuracy from 42.97 to 67.78 and POPE-random accuracy from 47.53 to 80.56, while retaining the backbone's original Chinese capabilities with minimal degradation, providing the first standardized foundation for Tibetan multimodal research.

URL PDF HTML ☆

赞 0 踩 0

2605.25046 2026-05-27 cs.CV cs.AI 版本更新

TinyFormer: Preserving Tiny Objects in YOLO-DETR Hybrid Real-time Detectors

TinyFormer: 在YOLO-DETR混合实时检测器中保留小目标

Jun-Wei Hsieh, Meng-Yu Kao, Ghufron Wahyu Kurniawan, Kuan-Chuan Peng

发表机构 * College of Artificial Intelligence, National Yang Ming Chiao Tung University（国立阳明交通大学人工智能学院）； Mitsubishi Electric Research Laboratories（三菱电机研究实验室）

AI总结提出TinyFormer混合检测器，通过并行双融合模块（PBM）保留浅层高分辨率特征，并设计空间语义适配器（SSA）补偿粗粒度标记化导致的空间损失，在MS COCO上实现小目标检测精度提升。

详情

AI中文摘要

YOLO系列和基于DETR的检测器在小目标检测方面存在困难。YOLO风格的模型受益于高效的密集预测，但其大步长骨干网络可能会抑制深层特征图中的小目标实例，并使网格分配变得模糊。基于DETR的模型通过集合预测去除了手工设计的后处理，但它们在粗粒度标记网格上进行推理，其中小目标仅占据少数弱标记，在匹配过程中容易被忽略。为了解决这些局限性，我们提出了TinyFormer，一种统一的YOLO-DETR混合实时检测器，它结合了ViT表示、无NMS的集合预测和YOLO风格的金字塔颈部，以实现准确的小目标检测。TinyFormer引入了并行双融合模块（PBM），该模块从浅层阶段构建高分辨率捷径到特征金字塔，在多尺度融合过程中保留精细的空间细节。我们进一步设计了空间语义适配器（SSA）来补偿粗粒度标记化导致的空间损失。SSA从早期阶段提取高分辨率线索并将其注入Transformer标记嵌入中，从而在不牺牲DETR全局建模能力的情况下改进小目标定位。在MS COCO上的实验表明，TinyFormer持续优于最近的YOLO系列检测器和强大的DEIMv2基线。即使没有PBM，TinyFormer-X也达到了58.4%的AP，而添加PBM将整体AP提高到58.5%，并在小目标上带来了1.6%的AP增益。使用Objects365预训练，TinyFormer-X-PBM达到了60.2%的AP，以更少的参数和更低的计算量超越了RF-DETR和其他Objects365预训练的检测器。这些结果表明，TinyFormer弥合了密集的YOLO风格特征融合和DETR风格集合预测之间的差距，为实时小目标检测提供了强大的精度-效率权衡。代码可在https://github.com/mmpmmpmmpjosh/TinyFormer获取。

英文摘要

YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but their large-stride backbones may suppress tiny instances in deep feature maps and make grid assignment ambiguous. DETR-based models remove hand-crafted post-processing through set prediction, yet they reason over coarse token grids, where tiny objects occupy only a few weak tokens and are easily overlooked during matching. To address these limitations, we propose TinyFormer, a unified YOLO--DETR hybrid real-time detector that combines ViT representations, NMS-free set prediction, and a YOLO-style pyramid neck for accurate small-object detection. TinyFormer introduces a Parallel Bi-fusion Module (PBM), which builds high-resolution shortcuts from shallow stages to the feature pyramid, preserving fine spatial details during multi-scale fusion. We further design a Spatial Semantic Adapter (SSA) to compensate for the spatial loss caused by coarse tokenization. SSA extracts high-resolution cues from early stages and injects them into transformer token embeddings, improving tiny-object localization without sacrificing the global modeling ability of DETR. Experiments on MS COCO show that TinyFormer consistently outperforms recent YOLO-series detectors and the strong DEIMv2 baseline. TinyFormer-X achieves 58.4% AP even without PBM, while adding PBM improves the overall AP to 58.5% and brings a 1.6% AP gain on small objects. With Objects365 pre-training, TinyFormer-X-PBM reaches 60.2% AP, surpassing RF-DETR and other Objects365-pretrained detectors with fewer parameters and lower computation. These results demonstrate that TinyFormer bridges dense YOLO-style feature fusion and DETR-style set prediction, providing a strong accuracy-efficiency trade-off for real-time tiny-object detection. Code is available at https://github.com/mmpmmpmmpjosh/TinyFormer.

URL PDF HTML ☆

赞 0 踩 0

2605.27372 2026-05-27 cs.CV 版本更新

G3T Up! Gravity Aligned Coordinate Frames Simplify Pointmap Processing

G3T 崛起！重力对齐的坐标框架简化点图处理

Bharath Raj Nagoor Kani, Noah Snavely

发表机构 * Cornell University（康奈尔大学）

AI总结提出G3T模型，通过预测重力对齐的点图而非相机中心点图，利用场景结构先验减少旋转自由度，提升3D重建精度。

Comments Project Page: https://g3t-paper.github.io/

详情

AI中文摘要

现代前馈3D重建方法（如VGGT）在相机中心坐标框架中预测像素对齐的点图。然而，这种坐标框架的选择并非总是最优。我们提出改为在直立、重力对齐的框架中预测点图，该框架利用许多真实场景中存在的强结构线索。与相机中心框架不同，重力对齐框架在视点之间共享共同的垂直轴，减少了关联点图所需的旋转自由度。为此，我们引入了重力接地几何变换器（G3T），该模型从现有模型在重力对齐的3D数据上进行微调。G3T生成高度准确的重力感知预测，包括直立点图和相机到重力姿态。我们进一步介绍了G3T-Long，一种基于子图的增量式3D重建流程，该流程利用直立框架提供的减少的旋转自由度，实现了显著提高的重建精度。

英文摘要

Modern feed-forward 3D reconstruction methods like VGGT predict pixel-aligned pointmaps in camera-centric coordinate frames. However, this choice of coordinate frame is not always optimal. We propose instead to predict pointmaps in upright, gravity-aligned frames that exploit strong structural cues present in many real-world scenes. Unlike camera-centric frames, gravity-aligned frames share a common vertical axis across viewpoints, reducing the rotational degrees of freedom needed to relate pointmaps to one another. To this end, we introduce the Gravity Grounded Geometry Transformer (G3T), fine-tuned from existing models on gravity-aligned 3D data. G3T produces highly accurate gravity-aware predictions, including upright pointmaps and camera-to-gravity poses. We further introduce G3T-Long, a submap-based incremental 3D reconstruction pipeline that leverages the reduced rotational degrees of freedom afforded by upright frames to achieve significantly improved reconstruction accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.27343 2026-05-27 cs.CV cs.LG 版本更新

Towards Controllable Image Generation through Representation-Conditioned Diffusion Models

通过表示条件扩散模型实现可控图像生成

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

AI总结本文提出利用预训练自监督模型的表示作为条件，通过扩散模型实现无需大量标注的可控图像生成，并探索了表示空间中的平滑和分离特性。

2605.27336 2026-05-27 cs.CV 版本更新

PARE: Pruning and Adaptive Routing for Efficient Video Generation

PARE：面向高效视频生成的剪枝与自适应路由

Yutong Wang, Yunke Wang, Tianfan Xue, Yu Qiao, Yaohui Wang, Xinyuan Chen, Chang Xu

发表机构 * The University of Sydney（悉尼大学）； Shanghai AI Laboratory（上海人工智能实验室）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出PARE方法，通过结构感知剪枝压缩宽度和输入自适应路由压缩深度，联合减少视频扩散Transformer的计算量，在Wan2.1-14B上实现每步计算大幅降低且质量保持。

详情

AI中文摘要

视频扩散Transformer（DiTs）能生成高质量视频，但由于宽块、深架构和迭代采样，需要大量计算。近期方法通过压缩宽度、深度或采样步数来降低成本，但通常采用固定架构，无法适应单个输入或去噪阶段。我们提出PARE（面向高效视频生成的剪枝与自适应路由），通过结构感知剪枝和输入自适应路由联合压缩宽度和深度。对于宽度，我们观察到注意力头分化为空间和时间角色，并设计考虑这种区分的重分评分，以防止运动关键的时间头被过早剪枝。对于深度，我们训练一个轻量级路由器，以去噪时间步和视觉内容为条件，动态选择每个步骤执行哪些块，实现每个输入的计算自适应，而非静态移除块。一个渐进式流程首先通过蒸馏恢复宽度剪枝的质量，然后联合优化学生和路由器以解耦两个学习目标。在Wan2.1-14B上的图像到视频和文本到视频生成实验表明，PARE在VBench各维度上显著减少每步计算同时保持质量，并与步蒸馏结合实现进一步加速。

英文摘要

Video Diffusion Transformers (DiTs) generate high-quality videos but demand substantial compute due to wide blocks, deep architectures, and iterative sampling. Recent methods reduce cost by compressing width, depth, or sampling steps, but typically commit to a fixed architecture that cannot adapt to individual inputs or denoising stages. We propose PARE (Pruning and Adaptive Routing for Efficient video generation), which jointly compresses width and depth with structure-aware pruning and input-adaptive routing. For width, we observe that attention heads specialize into spatial and temporal roles, and design importance scoring that accounts for this distinction to prevent motion-critical temporal heads from being pruned prematurely. For depth, we train a lightweight router conditioned on denoising timestep and visual content to dynamically select which blocks to execute at each step, enabling per-input compute adaptation rather than static block removal. A progressive pipeline first recovers width-pruned quality via distillation, then jointly optimizes the student and router to decouple the two learning objectives. Experiments on Wan2.1-14B for both image-to-video and text-to-video generation show that PARE substantially reduces per-step computation while preserving quality across VBench dimensions, and composes with step distillation for further acceleration.

URL PDF HTML ☆

赞 0 踩 0

2605.27332 2026-05-27 cs.SE cs.AI cs.CV 版本更新

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

EdgeFlow: 基于边缘图增强的VLM流程图处理用于工业需求工程

Zhifei Dou, Shabnam Hassani, Ou Wei

发表机构 * Huawei Research Canada（华为加拿大研究）

AI总结提出EdgeFlow方法，通过向视觉语言模型(VLM)输入添加Canny边缘图作为结构先验，无需训练数据或微调即可提升流程图到Mermaid代码的转换精度，在工业数据集上节点F1提升17.39%，边F1提升16.94%。

Comments 10 pages

详情

AI中文摘要

流程图广泛应用于工业需求中，但通常以静态图像形式嵌入。视觉语言模型(VLM)在将这些流程图转换为机器可读模型以支持需求工程活动方面显示出潜力，然而，当直接应用于流程图转换时，它们常常在拓扑关键视觉细节上失败。为了解决这个问题，我们提出了EdgeFlow，它通过向VLM的原始输入添加确定性提取的Canny边缘图——作为结构先验——来改进流程图到Mermaid的转换，无需标注训练数据或领域特定的模型微调。我们在IndusReqFlow（一个来自真实世界需求的数据集）上评估了EdgeFlow。与现成的VLM相比，EdgeFlow将节点级F1提高了17.39个百分点，边级F1提高了16.94个百分点。在路径级别，EdgeFlow将路径F1提高了11.06个百分点，从而更好地支持基于模型的测试。这些结果表明，EdgeFlow提供了一种实用的、无需训练的方法，用于改进工业需求工程中保持拓扑结构的流程图到Mermaid转换。在公共合成基准上的跨数据集评估结果显示没有显著改进；这凸显了需要包含工业数据的多样化基准，以全面评估未来基于VLM的需求工程工具。

英文摘要

Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversion of these flowcharts into machine-readable models for RE activities, yet, when directly applied to flowchart conversion, they often fail on topology-critical visual details. To address this, we propose EdgeFlow that augments a VLM's original input with a deterministically extracted Canny edge map-acting as a structural prior-to improve flowchart-to-Mermaid conversion, without requiring annotated training data or domain-specific model fine-tuning. We evaluate EdgeFlow on IndusReqFlow, a dataset sourced from real-world requirements. Compared with off-the-shelf VLMs, EdgeFlow improves node-level F1 by 17.39 percentage points and edge-level F1 by 16.94 percentage points. At the path level, EdgeFlow improves path F1 by 11.06 percentage points, enabling better support for model-based testing. These results demonstrate that EdgeFlow provides a practical, training-free means to improve topology-preserving flowchart-to-Mermaid conversion for industrial RE. Cross-dataset evaluation results on a public synthetic benchmark show no significant improvement; this highlights the need for diverse benchmarks incorporating industrial data for the comprehensive evaluation of future VLM-based RE tools.

URL PDF HTML ☆

赞 0 踩 0

2605.27318 2026-05-27 cs.CV 版本更新

Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning

Q-GeoMem：面向视频空间推理的问题引导几何记忆

Xianqiang Gao, Qizhi Chen, Delin Qu, Haoming Song, Zhigang Wang, Bin Zhao, Dong Wang, Xuelong Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai AI Lab（上海人工智能实验室）； Northwestern Polytechnical University（西北工业大学）； TeleAI

AI总结提出Q-GeoMem框架，通过问题引导的几何记忆机制，结合细粒度上下文库和语义几何证据库，在视频空间推理任务中实现最先进性能。

详情

AI中文摘要

视频空间推理需要在时间上累积依赖于视角的证据，同时保留对回答问题有用的信息。现有的空间视频语言模型改进了几何感知和长程上下文建模，但通常将记忆视为通用时间缓存，这可能引入冗余或无关的几何信息，削弱长程推理能力。我们提出 extbf{\ours}，一种用于视频空间推理的问题引导几何记忆框架。\ours将相机条件几何注入视觉标记，并维护两种互补记忆：用于近期密集特征和相机状态的细粒度上下文库，以及用于紧凑长程证据的语义几何证据库。每个候选帧通过Q-Former基于的问题相关性与相对于已保留库的新颖性的乘积进行评分；该分数在读取时存储并重用，同时基于容量的替换规则保持库紧凑。在推理过程中，两种记忆在更新前被读取，并与当前帧表示自适应融合。在VSI-Bench和VSTI-Bench上的实验表明，\ours在评估的空间推理模型中达到了最先进的性能，验证了问题引导几何记忆的有效性。消融实验进一步验证了所提出的证据评分机制的贡献。

英文摘要

Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf{\ours}, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.27311 2026-05-27 cs.CL cs.CV 版本更新

Gemini Embedding 2：来自Gemini的原生多模态嵌入模型

Madhuri Shanbhogue, Zhe Li, Shanfeng Zhang, Gustavo Hernández Ábrego, Shih-Cheng Huang, Aashi Jain, Daniel Salz, Sonam Goenka, Chaitra Hegde, Ji Ma, Feiyang Chen, Jiaxing Wu, Tanmaya Dabral, Babak Samari, Kevin Poulet, Daniel Cer, Kaifeng Chen, Paul Suganathan, Hui Hui, Jovan Andonov, Philippe Schlattner, Jay Han, Iftekhar Naim, Wing Lowe, Vladimir Pchelin, Albert Yang, Yi-Ting Chen, Zhongli Ding, Grace Zhang, Georg Heigold, Yichang Chen, Antoine Reveillon, Brendan Mccloskey, Wenlei Zhou, Dahun Kim, Rui Meng, Emma Wang, Jack Zheng, Halley Fede, Zhen Yang, Keegan Mosley, Brian Potetz, Sahil Dua, Henrique Schechter Vera, Shen Gao, Hesen Zhang, Andreas Hess, Hengxuan Ying, Alberto Montes, Karan Gill, Min Choi, Sebastian Russo, Anja Hauth, Jinhyuk Lee, Michael Boratko, Megan Barnes, Vikram Rao, Claudiu Musat, Cyril Allauzen, Ehsan Variani, Shankar Kumar, Tom Bagby, Junyi Jiao, Yang Gu, Tengxin Li, Ayush Agrawal, Roberto Santana, Dev Nath, Stephen Karukas, Shuoxuan Han, Lucia Loher, Alice Twu, Nidhi Vyas, Siddharth Bhai, Frank Palma Gomez, Wangyuan Zhang, Chaoren Liu, Jizheng Yang, Steve Qiu, Shijie Zhang, Sujay Kulkarni, Sascha Rothe, Sean Nakamoto, Raphael Hoffmann, Zach Gleicher, Yunhsuan Sung, Qin Yin, Tom Duerig, Mojtaba Seyedhosseini

发表机构 * Gemini Report（Gemini 报告）

AI总结提出原生多模态嵌入模型Gemini Embedding 2，通过多任务多阶段对比学习统一视频、音频、图像和文本的表示空间，在单模态、跨模态和多模态检索任务上达到最先进性能。

详情

AI中文摘要

我们介绍了Gemini Embedding 2，一种原生多模态嵌入模型，允许在统一表示空间中对视频、音频、图像和文本模态进行嵌入。我们利用Gemini的多模态能力，为所有这些模态的交错输入任意组合生成嵌入，这些嵌入在广泛的任务中具有良好的泛化能力。在多任务多阶段训练设置中应用大规模对比学习，我们在关键嵌入基准测试中取得了最先进的性能，包括涵盖多种任务的单模态、跨模态和多模态检索。我们展示了我们的嵌入模型在多种任务上表现出强大的性能（在MSCOCO上得分为62.9 R@1，在Vatex上为68.8 NDCG@10，在MTEB多语言上为69.9，在MTEB代码上为84.0），超越了专门模型的性能。这些统一的能力使Gemini Embedding 2成为下游用例（如RAG、推荐和搜索）的有前途的候选者。此外，它在不同领域（从天文学和生物科学到美术和烹饪艺术）的强大零样本性能，使其成为即使对于专业领域也非常可靠的即用型表示。

英文摘要

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.

URL PDF HTML ☆

赞 0 踩 0

2605.27287 2026-05-27 cs.CV 版本更新

Chaos-SSL：基于混沌变换的注意力自监督学习框架用于医学图像分类

Joao Batista Florindo

发表机构 * Institute of Mathematics, Statistics and Scientific Computing, University of Campinas（数学、统计与科学计算研究所，坎皮纳斯大学）

AI总结提出Chaos-SSL框架，利用一维混沌映射作为非线性数据增强进行自监督预训练，并结合注意力融合模型，在皮肤病变和糖尿病视网膜病变分类上达到与最先进方法竞争的性能。

详情

DOI: 10.5220/0014436000004084
Journal ref: In Proceedings of VISAPP 2026 - Volume 1, pages 574-581

AI中文摘要

自监督学习（SSL）已成为缓解对大规模标注数据集依赖的强大范式，这是医学图像分析中的常见瓶颈。然而，依赖简单几何和颜色增强的标准SSL方法可能无法捕捉到分类细微病理所需的细粒度、复杂纹理细节。本文介绍了Chaos-SSL，一种新颖的两阶段医学图像分类框架。在第一阶段，我们提出了一种新的自监督预训练策略，利用一维混沌映射（Logistic、Tent和Sine）作为对比学习的复杂非线性增强。我们假设这些混沌变换创建了“更难”且语义更丰富的视图，迫使网络学习细粒度医学纹理的鲁棒表示。在第二阶段，我们引入了一种基于注意力的融合模型，该模型动态地将来自Chaos-SSL模型的专门特征与来自更大的ImageNet预训练模型的通用特征相结合。我们在两个公共数据集上验证了我们的方法：ISIC 2018（皮肤病变）和APTOS 2019（糖尿病视网膜病变）。我们的结果表明，使用Tent映射预训练30个epoch的Chaos-SSL模型，随后进行注意力融合，其性能与最先进方法完全竞争，在ISIC 2018上达到0.9261的准确率，在APTOS 2019上达到0.8726的准确率。这显著优于现有的SSL方法，包括几种最新方法。

英文摘要

Self-Supervised Learning (SSL) has emerged as a powerful paradigm to mitigate the reliance on large, annotated datasets, a common bottleneck in medical image analysis. However, standard SSL methods, which rely on simple geometric and color augmentations, may fail to capture the fine-grained, complex textural details necessary for classifying subtle pathologies. This paper introduces Chaos-SSL, a novel two-stage framework for medical image classification. In the first stage, we propose a new self-supervised pre-training strategy that leverages 1D chaotic maps (Logistic, Tent, and Sine) as a complex, non-linear augmentation for contrastive learning. We hypothesize that these chaotic transformations create ``harder'' and more semantically-rich views, forcing a network to learn robust representations of fine-grained medical textures. In the second stage, we introduce an attention-based fusion model that dynamically combines the specialized features from our Chaos-SSL model with the general-purpose features of a larger, ImageNet-pre-trained model. We validate our method on two public datasets: ISIC 2018 (skin lesions) and APTOS 2019 (diabetic retinopathy). Our results demonstrate that the Chaos-SSL model pre-trained with a Tent map for 30 epochs, followed by attention fusion, achieves performance fully competitive with the state-of-the-art, yielding an accuracy of 0.9261 on ISIC 2018 and 0.8726 on APTOS 2019. This significantly outperforms existing SSL methods, including several recent approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.27144 2026-05-27 cs.CV cs.LG 版本更新

Is an Image Also Worth 16x16=256 Superpixels? A Framework for Attentional Image Classification

图像是否也值得16x16=256个超像素？一个用于注意力图像分类的框架

Pedro Henrique da Costa Avelar, Anderson R. Tavares, Luís C. Lamb

发表机构 * UFRGS（联邦大学里约格兰德杜斯鲁斯）； Institute of Informatics（信息学院）； Federal University of Rio Grande do Sul（里约格兰德杜斯鲁斯联邦大学）； Division of Informatics（信息系）； School of Health Sciences（健康科学学院）； Imaging and Data Science（成像与数据科学）； Faculty of Biology, Medicine and Health（生物医学与健康学院）； University of Manchester（曼彻斯特大学）； Vaughan House, Portsmouth St（波特兰街瓦尔赫恩大楼）

AI总结提出超像素变换器（SPT）框架，统一超像素图像分类与视觉变换器，通过多维正弦余弦位置编码和增强的补丁数据结构，在多个数据集上优于超像素图神经网络方法，与视觉变换器竞争。

详情

AI中文摘要

基于超像素的图像分类传统上利用图神经网络（GNN）处理不规则图像表示。计算机视觉的最新进展，由视觉变换器（ViT）驱动，引入了自注意力模型的新范式，在各种任务中超越了卷积神经网络（CNN）。然而，GNN、超像素和变换器之间的协同联系仍未探索。在这项工作中，我们提出了超像素变换器（SPT），这是一个统一超像素图像分类和ViT的新框架。SPT将超像素图像分类与图注意力网络（SICGAT）模型和ViT泛化，以支持任意超像素分块策略、连接图和位置编码。我们引入了改进，包括多维正弦余弦位置编码和完全包含超像素形状和颜色信息的增强补丁数据结构。通过在CIFAR10、FashionMNIST和Imagenette等数据集上测试SPT，采用各种超像素生成和图连接策略，我们证明SPT相比以前的超像素GNN方法实现了优越的性能，并与ViT保持竞争力。值得注意的是，我们的方法解决了SICGAT的局限性，例如像素聚合过程中的信息丢失，并展示了受限图连接如何增强ViT性能。SPT弥合了基于超像素和变换器模型之间的差距，为跨领域泛化和混合注意力框架的未来创新开辟了道路，并表明图像也值得$16\times16$个超像素。

英文摘要

Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduced new paradigms in self-attentional models, surpassing convolutional neural networks (CNNs) in various tasks. However, a synergistic connection between GNNs, superpixels, and transformers remains unexplored. In this work, we propose Superpixel Transformers (SPT), a novel framework that unifies superpixel-based image classification and ViTs. SPT generalizes the Superpixel Image Classification with Graph Attention Networks (SICGAT) model and ViT to support arbitrary superpixel-based chunking strategies, connectivity graphs, and positional encodings. We introduce refinements including a multidimensional sine-cosine positional encoding and an enriched patch data structure that fully incorporates superpixel shape and color information. By testing SPT across datasets such as CIFAR10, FashionMNIST, and Imagenette, with various superpixel generation and graph connectivity strategies, we demonstrate that SPT achieves superior performance compared to previous superpixel-based GNN methods and remains competitive with ViTs. Notably, our approach addresses the limitations of SICGAT, such as information loss during pixel aggregation, and shows how constrained graph connectivity can enhance ViT performance. SPT bridges the gap between superpixel-based and transformer models, opening avenues for cross-domain generalization and future innovations in hybrid attentional frameworks, and showing that an image can also be worth $16\times16$ superpixels.

URL PDF HTML ☆

赞 0 踩 0

2605.27139 2026-05-27 eess.IV cs.CV physics.ins-det 版本更新

Unsupervised Deep Image Prior for Sparse-View and Limited-Angle Electron Tomography

无监督深度图像先验用于稀疏视角和有限角度电子断层扫描

Serge Brosset, Daniel del Pozo Bueno, Thomas David, Laure Guetaz, Philippe Ciuciu, Zineb Saghi

发表机构 * Univ. Grenoble Alpes, CEA, Leti（格勒诺布尔阿尔卑斯大学，CEA，LETI）； Univ. Grenoble Alpes, CEA, Liten（格勒诺布尔阿尔卑斯大学，CEA，Liten）； CEA, Joliot, NeuroSpin（CEA，Joliot，NeuroSpin）； Inria, MIND, Université Paris-Saclay（Inria，MIND，巴黎-萨克雷大学）

AI总结提出无监督深度图像先验方法，在稀疏视角和有限角度条件下实现与监督方法相当的电子断层重建性能，并应用于实验数据验证其可靠性。

Comments 22 pages, 12 figures

2605.27136 2026-05-27 cs.CV 版本更新

Leveraging Visual Signals for Robust Token-Level Uncertainty in Vision-Language Generation

利用视觉信号实现视觉-语言生成中鲁棒的词元级不确定性

Joseph Hoche, David Brellmann, Gianni Franchi

发表机构 * AMIAD, Pôle Recherche, Palaiseau（AMIAD研究部，帕莱索）

AI总结针对大型视觉语言模型不确定性量化中视觉信息利用不足的问题，提出基于视觉锚定的词元级不确定性量化框架VIG-TUQ，通过加权语言不确定性与视觉锚定分数，无需训练即可提升不确定性估计性能。

详情

AI中文摘要

不确定性量化（UQ）对于大型视觉语言模型（LVLMs）的可靠预测和实际部署仍然是一个关键挑战。然而，现有方法大多源自LLM文献，主要关注语言模态，而视觉信息对LVLM不确定性的贡献在很大程度上未被探索。在本文中，我们研究了LVLMs如何处理视觉信息，以及这一过程是否可用于改进不确定性估计。通过分析生成过程中视觉特征整合后的隐藏表示，我们观察到高置信度预测比不确定预测更依赖于视觉内容。基于这一发现，我们提出了视觉锚定语元级UQ（VIG-TUQ），这是一个无需训练的框架，通过用视觉锚定分数加权词元级语言不确定性，将视觉锚定显式纳入不确定性估计。我们在多个数据集和不同的LVLM架构（包括早期融合、晚期融合和原生融合模型）上评估了VIG-TUQ。结果表明，我们的方法通常优于现有的词元级不确定性方法。代码和数据将在接收后公开。

英文摘要

Uncertainty quantification (UQ) remains a critical challenge in Large Vision Language Models (LVLMs) for reliable predictions and real-world deployment. However, most existing methods are adapted from the LLM literature and primarily focus on the language modality, leaving the contribution of visual information to LVLM uncertainty largely underexplored. In this paper, we investigate how LVLMs process visual information and whether this process can be used to improve uncertainty estimation. By analyzing hidden representations after the integration of visual features during the generation process, we observe that high-confidence predictions rely more heavily on visual content than uncertain ones. Building on this insight, we propose Visual-Grounded Token UQ (VIG-TUQ), a training-free framework that explicitly incorporates visual grounding into uncertainty estimation by weighting token-level language uncertainty with visual grounding scores. We evaluate VIG-TUQ on multiple datasets and across diverse LVLM architectures, including early-fusion, late-fusion, and native-fusion models. Results indicate that our method often improves upon existing token-level uncertainty approaches. Code and data will be made available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.27135 2026-05-27 cs.CR cs.CV 版本更新

Do Modern Post-Hoc Watermarking Methods Beat Broken-Arrows?

现代事后水印方法能否击败断箭？

Enoal Gesny, Eva Giboulot

发表机构 * Inria（法国里昂研究所）

AI总结本文通过公平比较现代与经典事后水印方法在多种攻击下的鲁棒性和安全性，发现经典方法在现实场景中更优。

2605.27132 2026-05-27 cs.CV 版本更新

Image Thresholding: Understanding Bias of Evaluation Metrics towards Specific Evaluation Functions

图像阈值化：理解评估指标对特定评估函数的偏差

Eslam Hegazy, Mohamed Gabr

发表机构 * German University in Cairo（埃及开罗德国大学）

AI总结本文通过分析BSDS500数据集上所有可能阈值的阈值化目标函数与质量指标的相关性，揭示了Otsu准则与SSIM和PSNR的高相关性，以及Kapur熵的弱相关性，表明存在固有的指标-目标函数偏差。

Comments Submitted to ICPR 2026 (https://icpr2026.org)

详情

AI中文摘要

多级图像阈值化广泛应用于从医学成像到遥感的分割任务中。经典的目标函数，如Otsu的类间方差和Kapur的熵，通常通过元启发式算法进行优化，并使用结构相似性指数（SSIM）和峰值信噪比（PSNR）等指标评估性能。这些评估隐含地假设SSIM和PSNR提供了分割质量的无偏度量。在本研究中，我们通过分析BSDS500数据集中所有可能阈值下阈值化目标函数与质量指标之间的相关性来检验这一假设。结果表明，Otsu准则始终与SSIM和PSNR表现出高相关性，而Kapur熵的相关性较弱且变化较大。Otsu在所有图像上与PSNR的相关性优于Kapur，在超过91%的图像上与SSIM的相关性也优于Kapur。我们的发现揭示了一种固有的指标-目标函数偏差。这项工作强调了需要更中立的评估框架，并激励将分析扩展到其他阈值化准则和领域。本文的源代码可在https://w3id.org/met-dp/icpr26-95找到。

英文摘要

Multilevel image thresholding is widely used for segmentation in applications ranging from medical imaging to remote sensing. Classical objective functions, such as Otsu's between-class variance and Kapur's entropy, are often optimized using metaheuristic algorithms, with performance evaluated via metrics like Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR). These evaluations implicitly assume that SSIM and PSNR provide unbiased measures of segmentation quality. In this study, we examine this assumption by analyzing the correlation between thresholding objective functions and quality metrics across all possible thresholds for images in the BSDS500 dataset. Results show that Otsu's criterion consistently exhibits high correlation with both SSIM and PSNR, while Kapur's entropy demonstrates weaker and more variable correlation. Otsu outperforms Kapur in correlation with PSNR for all images and with SSIM for over 91%. Our findings reveal an inherent metric-objective-function bias. This work highlights the need for more neutral evaluation frameworks and motivates extending the analysis to additional thresholding criteria and domains. Source code of this paper can be found at https://w3id.org/met-dp/icpr26-95

URL PDF HTML ☆

赞 0 踩 0

2605.27129 2026-05-27 cs.CV cs.RO 版本更新

YOLO26-RipeLoc Lite: A lightweight architecture for tomato ripeness detection and picking point localization in greenhouse robotic harvesting

YOLO26-RipeLoc Lite：用于温室机器人采摘中番茄成熟度检测与采摘点定位的轻量级架构

Rajmeet Singh, Manveen Kaur, Shahpour Alirezaee, Irfan Hussain

发表机构 * Department of Mechanical Engineering（机械工程系）； Khalifa University（卡利法大学）； University of Windsor（温莎大学）

AI总结提出基于YOLO26的轻量级架构YOLO26-RipeLoc Lite，通过轻量特征金字塔网络、成熟度感知注意力模块和紧凑检测头，实现温室番茄的成熟度分类与中心点定位，在仅2.38M参数下达到92.9% mAP@0.5。

详情

AI中文摘要

在温室番茄生产中，自动化收获需要准确检测成熟番茄、进行成熟度分类，并为机器人末端执行器精确定位采摘点。本文提出YOLO26-RipeLoc Lite，一种基于YOLO26的轻量级深度学习架构，用于同时检测、成熟度分类和温室番茄的中心点定位。该模型引入了三项改进：(1) 轻量特征金字塔网络（LFPN），采用深度可分离卷积实现高效多尺度融合；(2) 成熟度感知注意力模块（RAAM），具有双池化和可学习的成熟度偏置向量，增强颜色纹理区分能力；(3) 紧凑检测头（CDH），采用共享卷积和集成的中心点回归分支，用于直接抓取规划。该模型在来自阿联酋阿布扎比SILAL温室的自定义数据集（1500张图像，6227个实例，其中3566个成熟，2661个未成熟）上进行评估。YOLO26-RipeLoc Lite在仅使用2.38M参数的情况下，实现了92.9%的mAP@0.5（成熟95.2%，未成熟90.6%），在所有评估架构中精度最高（95.2%）。训练后批量归一化剪枝30%可将参数减少至约1.8M，且精度损失可忽略。消融研究证实，温室感知的HSV增强提供了最大的改进（+2.02个百分点 mAP@50），骨干网络冻结达到了峰值精度（93.8%），而三阶段渐进解冻获得了最佳的定位质量（mAP@50:95为64.6%）。与YOLOv8n/s、YOLO11n/s、YOLO12n/s和YOLO26s的比较证实了其优越的精度-效率：比YOLO12n精度高2.9个百分点，参数少7.0%，并集成了用于机器人末端执行器引导的中心点定位。

英文摘要

In greenhouse tomato production, automated harvesting requires accurate detection of ripe tomatoes, ripeness classification, and precise picking-point localization for robotic end-effectors. This paper proposes YOLO26-RipeLoc Lite, a lightweight deep learning architecture based on YOLO26 for simultaneous detection, ripeness classification, and center-point localization of greenhouse tomatoes. The model introduces three modifications: (1) a Lightweight Feature Pyramid Network (LFPN) with depthwise separable convolutions for efficient multi-scale fusion, (2) a Ripeness-Aware Attention Module (RAAM) with dual pooling and a learnable ripeness bias vector for enhanced color-texture discrimination, and (3) a Compact Detection Head (CDH) with shared convolutions and an integrated center-point regression branch for direct grasp planning. The model is evaluated on a custom dataset of 1,500 images with 6,227 instances (3,566 ripe, 2,661 unripe) from the SILAL greenhouse, Abu Dhabi, UAE. YOLO26-RipeLoc Lite achieves mAP@0.5 of 92.9% (95.2% ripe, 90.6% unripe) with the highest precision (95.2%) among all evaluated architectures using only 2.38M parameters. Post-training BatchNorm pruning at 30% reduces parameters to ~1.8M with negligible accuracy loss. Ablation studies confirm that greenhouse-aware HSV augmentation provides the largest improvement (+2.02 pp mAP@50), backbone freezing achieves peak precision (93.8%), and 3-phase progressive unfreezing yields the best localization quality (mAP@50:95 of 64.6%). Comparisons with YOLOv8n/s, YOLO11n/s, YOLO12n/s, and YOLO26s confirm superior accuracy-efficiency: 2.9 pp higher precision than YOLO12n with 7.0% fewer parameters and integrated center-point localization for robotic end-effector guidance.

URL PDF HTML ☆

赞 0 踩 0

2605.27128 2026-05-27 cs.CV cs.LG 版本更新

PILOT: A Data-Free Continual Learning Approach for Real-Time Semantic Segmentation via Boundary Guidance

PILOT: 一种基于边界引导的无数据持续学习方法用于实时语义分割

Yujing Zhou, Prashant Shekhar, Thomas Yang, Yongxin Liu

发表机构 * Department of Mathematics, College of Arts and Sciences, Embry-Riddle Aeronautical University（数学系，文理学院，埃姆布里-里德航空大学）； Department of Electrical Engineering and Computer Science, College of Engineering, Embry-Riddle Aeronautical University（电气工程与计算机科学系，工程学院，埃姆布里-里德航空大学）

AI总结提出PILOT框架，通过冻结原网络参数并引入并行导数分支捕获新类边界信息，实现实时语义分割模型在无需旧数据情况下的增量学习，有效缓解灾难性遗忘。

详情

AI中文摘要

实时语义分割模型在准确性和推理速度之间取得了极好的平衡。然而，将这些模型部署在动态的真实世界环境中，通常需要能够在不重新训练整个数据集的情况下增量地学习新类别。这种能力被称为持续学习。在这方面，深度学习中的标准微调方法常常因灾难性遗忘而失败，即模型学习新信息但忘记了先前训练和学习的类别。针对这一关键领域，本文提出了一种针对PIDNet的新型持续学习框架，PIDNet是一种被广泛引用的最先进的实时语义分割模型。我们的方法PILOT（并行增量学习随时间）通过实现一个并行导数分支（D-branch）引入了一种实时且轻量级的策略，该分支旨在捕获新类别的高频边界信息，同时冻结原始分割网络的训练参数。这种新颖的设置允许模型适应新的语义类别，同时保留先前学习类别的知识。通过仅使用与新类别相关的数据，我们的模型显著减少了训练开销。实验结果表明，我们的方法成功分割了新类别，同时在原始基类上保持了较高的平均交并比（mIoU），从而在该领域轻松超越了所有主要的持续学习方法。总体而言，PILOT被证明能有效缓解灾难性遗忘，同时对推理延迟影响最小，从而保持实时性能。

英文摘要

Real-time semantic segmentation models offer an excellent balance between accuracy and inference speed. However, deploying these models in dynamic real world environments often requires the ability to learn novel classes incrementally without retraining on the entire dataset. This capability is known as continual learning. In this regard, the standard fine-tuning methods in deep learning often fail due to catastrophic forgetting, where the model learns new information but forgets previously trained and learned classes. Contributing to this crucial domain, the current paper proposes a novel continual learning framework tailored for PIDNet, which is a widely cited state-of-the-art real-time semantic segmentation model. Our method, PILOT(Parallel Incremental Learning Over Time), introduces a real-time and lightweight strategy by implementing a parallel Derivative-branch (D-branch) designed to capture the high frequency boundary information of novel classes while freezing the trained parameters of the original segmentation network. This novel setup allows the model to adapt to new semantic categories while preserving the knowledge of previously learned classes. By using only data associated with the new class, our model significantly reduces training overhead. Experimental results demonstrate that our approach successfully segments new classes while maintaining high mean Intersection over Union (mIoU) on the original base classes, thereby comfortably outperforming all major continual learning approaches in this domain. Overall, PILOT is shown to effectively mitigate catastrophic forgetting with minimal impact on inference latency, thus maintaining real-time performance.

URL PDF HTML ☆

赞 0 踩 0

2605.27116 2026-05-27 cs.CV 版本更新

COVD: Continual Open-Vocabulary Object Detection with Novel Concept Injection

COVD: 通过新概念注入的持续开放词汇目标检测

Yupeng Zhang, Ruize Han, Yuzhong Feng, Zixin Ren, Yuntong Tian, Liang Wan

发表机构 * Tianjin University（天津大学）； Shenzhen University of Advanced Technology（深圳大学）

AI总结提出持续开放词汇目标检测新任务COVD，通过冻结视觉编码器并仅更新文本分支参数注入新概念，实现无需额外参数的高效持续学习。

详情

AI中文摘要

开放词汇目标检测（OVD）取得了显著进展，使检测器能够从已见类别泛化到未见类别。然而，现实世界的类别空间不断演变，现有的OVD模型仍然难以处理新出现的概念，而重复的完全重新训练成本过高。为此，我们引入了一个新的任务设置，称为持续开放词汇目标检测与新概念注入（COVD），其中模型顺序学习传入的新概念组，同时保留先前的概念和原始的开放词汇知识，并附带一个新的基准Novel-114。我们的关键观察是，预训练的视觉编码器通常已经感知并表示了众多新概念，主要瓶颈在于视觉表示与文本概念之间缺乏稳定的语义对齐。基于此，我们提出了NoIn-Det，一个无需额外参数的高效持续注入框架。NoIn-Det冻结视觉编码器，仅使用常见概念和先前注入概念的文本来保留文本表示空间，并通过仅更新有利于新概念学习的少量文本分支参数来注入新概念。大量实验表明，NoIn-Det在不引入额外参数的情况下，有效学习了新概念，保留了旧知识，并持续优于现有的VLM持续学习方法。Novel-114和代码将发布。

英文摘要

Open-vocabulary object detection (OVD) has made significant progress, enabling detectors to generalize from seen to unseen categories. However, real-world category spaces continually evolve, and existing OVD models still struggle with newly emerging concepts, while repeated full retraining is prohibitively expensive. To this end, we introduce a new task setting, termed Continual OVD with Novel Concept Injection (COVD), where models sequentially learn incoming novel concept groups while preserving prior concepts and original open-vocabulary knowledge, along with a new benchmark, Novel-114. Our key observation is that pretrained visual encoders often already perceive and represent many novel concepts, and the main bottleneck lies in the lack of stable semantic alignment between visual representations and textual concepts. Based on this, we propose NoIn-Det, an efficient continual injection framework without additional parameters. NoIn-Det freezes the visual encoder, preserves the text representation space using only texts of common concepts and previously injected concepts, and injects novel concepts by updating only a small subset of text-branch parameters beneficial to novel concept learning. Extensive experiments show that NoIn-Det effectively learns novel concepts, preserves old knowledge, and consistently outperforms existing continual learning methods for VLMs without introducing additional parameters.Novel-114 and the code will be released.

URL PDF HTML ☆

赞 0 踩 0

2605.27101 2026-05-27 cs.CV cs.CL 版本更新

Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models

弹出式干扰揭示视频大语言模型中的事件袋行为

Oscar Chew, Serhii Honcharenko, Qian-Hui Chen, Patricia Lu, Dishant Zaveri, Khoa D. Doan, Kuan-Hao Huang

发表机构 * Texas A&M University（德克萨斯A&M大学）； National Taiwan University（台湾国立大学）； Stanford University（斯坦福大学）； VinUniversity（文大学）

AI总结通过插入无关广告片段，发现视频大语言模型常将不同片段的事件错误关联，表现出将视频视为事件集合而非时间序列的“事件袋”行为。

详情

AI中文摘要

视频理解的一个关键能力是跨时间可靠地将主体与事件联系起来，然而视频大语言模型（VideoLLMs）是否真正实现了这一点仍不清楚。在这项工作中，我们引入了DistractionBench来评估VideoLLMs在存在无关视频片段的情况下是否能稳健地关联主体和事件。通过受控干预，例如在较长视频中插入短广告片段，我们表明VideoLLMs经常幻觉出不同片段中实体之间的交互，错误地将注入广告中的动作归因于主视频中的主体。我们将这种系统性幻觉表征为事件袋（BoE）行为，其中模型将视频视为事件的集合而非时间结构化的序列。评估11个流行的VideoLLMs，我们发现所有模型都表现出显著的BoE行为。我们的发现表明VideoLLMs缺乏可靠的时间接地机制，并激励开发具有更稳健主体-事件关联的模型。

英文摘要

A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.

URL PDF HTML ☆

赞 0 踩 0

2605.27080 2026-05-27 cs.CV 版本更新

Semi-Supervised Gaze Estimation via Disentangled Subspace Contrastive Learning

基于解耦子空间对比学习的半监督视线估计

Qida Tan, Hongyu Yang, Wenchao Du

发表机构 * National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China（合成视觉基础科学国家重点实验室，四川大学，成都，中国）； College of Computer Science, Sichuan University, Chengdu, China（计算机学院，四川大学，成都，中国）

AI总结提出一种半监督学习框架DSCL，通过雅可比正则化解耦特征为俯仰角和偏航角子空间，并利用子空间内序数对比学习，仅用5%-20%标注数据即可达到竞争性能。

Comments ICML2026

详情

AI中文摘要

基于外观的视线估计由于标注样本有限和数据集多样性不足，常面临泛化能力差的问题。主流方法采用弱监督学习从无约束真实场景生成大规模伪标签数据，以缓解域偏移。本文设计了一种简单而有效的半监督学习架构，利用未标注数据增强域泛化，从而减少对劳动密集型人工标注的依赖。我们的关键洞察是施加雅可比正则化，将特征表示解耦为专门针对特定视线组件（如俯仰角和偏航角）的判别性子空间。我们进一步利用每个子空间内的内在序数排序进行对比学习，使模型能够从少量标注样本和大量未标注样本中学习鲁棒的视线表示。最终形成了我们的解耦子空间对比学习（DSCL）框架。在多个基准上的大量实验表明，所提出的DSCL是即插即用的，在域内和跨域评估设置下，仅使用20%、10%甚至5%的标注数据即可达到竞争性能。公开代码见https://github.com/da60266/DSCL。

英文摘要

Appearance-based gaze estimation always suffers from poor generalization due to limited annotated samples and insufficient dataset diversity. Leading approaches adopt weakly supervised learning to generate large-scale pseudo-labeled data from unconstrained real-world scenarios, aiming to mitigate the domain shifts. In this work, we devise a simple yet effective semi-supervised learning architecture that leverages unlabeled data to enhance domain generalization, thereby reducing reliance on labor-intensive manual annotations. Our key insight is to impose Jacobian regularization to disentangle feature representations into discriminative subspaces dedicated to specific gaze components, such as pitch and yaw angles. We further exploit the intrinsic ordinal ranking within each subspace for contrastive learning, enabling the model to learn robust gaze representations from a small set of labeled samples and an abundance of unlabeled ones. This ultimately yields our Disentangled Subspace Contrastive Learning (DSCL) framework. Extensive experiments on multiple benchmarks verify that the proposed DSCL is plug-and-play, achieving competitive performance using only 20\%, 10\%, and even 5\% of the annotated data under both in-domain and cross-domain evaluation settings. The public code is available at \href{https://github.com/da60266/DSCL}{https://github.com/da60266/DSCL}.

URL PDF HTML ☆

赞 0 踩 0

2605.27075 2026-05-27 cs.CV 版本更新

SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration

SoftCap: 扩散Transformer加速的软预算控制

Yuhang Zhang, Junxiang Qiu, Huixia Ben, Zhenhua Tang, Shuo Wang, Yanbin Hao

发表机构 * Hefei University of Technology（合肥工业大学）； University of Science and Technology of China（中国科学技术大学）； Anhui University of Science and Technology（安徽理工大学）； University of Macau（澳门大学）

AI总结提出一种无需训练的软预算控制层SoftCap，通过轨迹漂移观测器和软预算PI控制器动态调整全步触发阈值，在保持计算预算软上限的同时提升图像质量。

详情

AI中文摘要

扩散Transformer（DiTs）实现了强大的视觉质量，但其迭代去噪过程需要大量昂贵的Transformer评估。无训练加速方法通过缓存、预测或验证中间特征来降低这一成本，然而何时执行全步的运行时决策通常由固定调度或手动调整的阈值驱动。我们提出 extbf{SoftCap}，一种用于基于缓存的DiT推理的无训练控制层。SoftCap将轨迹漂移观测器（通过轻量级隐藏状态统计估计局部缓存风险）与软预算PI控制器（根据相对于固定参考配置的实际计算调整全步触发阈值）相结合。预算是软上限：它塑造阈值，但不要求运行消耗预定数量的全步评估。在FLUX.1-dev上，在可比的中等计算操作点下，SoftCap优于SpeCa，在几乎相同的FLOPs下将ImageReward从0.967提升至0.981，并将LPIPS-Full从0.518降至0.498，而目标扫描诊断显示随着预算放宽，预期的软上限行为得以实现。

英文摘要

Diffusion Transformers (DiTs) achieve strong visual quality, but their iterative denoising process requires many costly Transformer evaluations. Training-free acceleration methods reduce this cost by caching, forecasting, or verifying intermediate features, yet the runtime decision of when to execute a Full step is often driven by fixed schedules or hand-tuned thresholds. We propose \textbf{SoftCap}, a training-free control layer for cache-based DiT inference. SoftCap couples a Trajectory Drift Observer, which estimates local cache risk from lightweight hidden-state statistics, with a Soft-Budget PI Controller, which adjusts the Full-triggering threshold from realized compute relative to a fixed reference profile. The budget is a soft ceiling: it shapes the threshold but does not require a run to spend a prescribed number of Full evaluations. On FLUX.1-dev, SoftCap improves over SpeCa at a comparable middle-compute operating point, raising ImageReward from 0.967 to 0.981 and reducing LPIPS-Full from 0.518 to 0.498 at nearly identical FLOPs, while target-sweep diagnostics show the intended soft-ceiling behavior as the budget is relaxed.

URL PDF HTML ☆

赞 0 踩 0

2605.27074 2026-05-27 cs.CV 版本更新

IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams

IPIBench: 在连续流下评估多模态大模型的交互式主动智能

Jinzhao Li, Yinuo Chen, Wenxuan Song, Yijia Lei, Yichi Zhang, Honglei Yan, Panwang Pan, Miao Liu

发表机构 * College of AI, Tsinghua University（清华大学人工智能学院）； ByteDance（字节跳动）

AI总结提出IPIBench基准，用于评估多模态大模型在流式视频场景中的交互式主动智能，并设计IPI-Agent框架以改善主动触发和交互协调。

详情

AI中文摘要

最近的多模态大模型在反应式问答上表现强劲，但现实世界的流式助手需要对连续视觉输入进行主动推理。现有基准主要研究孤立的单轮设置中的反应式或主动式交互，忽视了用户可能在交错反应式查询中添加、修改或取消主动请求的动态多轮场景。为填补这一空白，我们引入IPIBench，这是首个在流式视频设置下评估多模态大模型交互式主动智能的基准。IPIBench涵盖主动监控、主动任务管理以及交错的反应式-主动式请求。对代表性多模态大模型的评估揭示了两个主要限制：不稳定的主动触发以及反应式和主动行为之间的弱协调。我们进一步提出IPI-Agent，一个无训练的智能体框架，包含交互控制策略和时间门控机制，用于稳定主动触发和协调多轮交互。实验表明，IPI-Agent在所有基准设置上持续改进现有多模态大模型。

英文摘要

Recent multimodal large language models (MLLMs) achieve strong performance on reactive question answering, but real-world streaming assistants require proactive reasoning over continuous visual inputs. Existing benchmarks mainly study reactive or proactive interactions in isolated single-turn settings, overlooking dynamic multi-turn scenarios where users may add, modify, or cancel proactive requests alongside interleaved reactive queries. To address this gap, we introduce IPIBench, the first benchmark for evaluating Interactive Proactive Intelligence of MLLMs under streaming video settings. IPIBench covers proactive monitoring, proactive task management, and interleaved reactive-proactive requests. Evaluations on representative MLLMs reveal two major limitations: unstable proactive triggering and weak coordination between reactive and proactive behaviors. We further propose IPI-Agent, a training-free agentic framework with an interaction-control policy and a temporal-gating mechanism for stabilizing proactive triggering and coordinating multi-turn interactions. Experiments show that IPI-Agent consistently improves existing MLLMs across all benchmark settings.

URL PDF HTML ☆

赞 0 踩 0

2605.27067 2026-05-27 cs.CV 版本更新

时间步感知的 SVDQuant-GPTQ 用于 Wan2.2-I2V 的 W4A4 量化

Junhao Wu, Dezhong Yao, Hai Jin

发表机构 * National Engineering Research Center for Big Data Technology and System（大数据技术与系统国家工程研究中心）； Services Computing Technology and System Lab（服务计算技术与系统实验室）； Cluster and Grid Computing Lab（集群与网格计算实验室）； School of Computer Science and Technology（计算机科学与技术学院）； Huazhong University of Science and Technology（华中科技大学）

AI总结针对 Wan2.2-I2V 视频扩散 Transformer 的 W4A4 量化，提出结合 SVDQuant 低秩异常补偿、GPTQ 重建感知残差权重量化和时间步分箱逐层激活裁剪比搜索的后训练量化框架，在 OpenS2V-Eval 上降低 59.3% 峰值显存且仅损失 0.9% VBench 平均分。

详情

AI中文摘要

大型视频扩散 Transformer 的 W4A4 量化提供了显著的内存节省，但面临两个主要挑战：稀疏的大幅度激活异常值，以及跨多步去噪轨迹的强时间步依赖的激活分布。这些困难因 Wan2.2-I2V 的双专家混合专家 DiT 设计而加剧，其高噪声和低噪声专家表现出不同的量化敏感性，单一全局校准策略无法捕捉。我们提出了一种后训练量化框架，结合基于 SVDQuant 的低秩异常补偿、基于 GPTQ 的重建感知残差权重量化，以及针对每个专家独立进行的时间步分箱逐层激活裁剪比搜索。在 OpenS2V-Eval 基准上，我们的方法相对于 BF16 基线将峰值 GPU 内存降低了 59.3%，同时仅导致 VBench 平均分数下降 0.9%，成像质量下降 2.3%，表明专家和时间步感知的校准对于 MoE 视频 DiT 的高保真 W4A4 推理至关重要。

英文摘要

W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse large-magnitude activation outliers, and strongly timestep-dependent activation distributions across the multi-step denoising trajectory. These difficulties are compounded by Wan2.2-I2V's two-expert Mixture-of-Experts DiT design, whose high-noise and low-noise experts exhibit distinct quantization sensitivities that a single global calibration policy cannot capture. We propose a post-training quantization framework combining SVDQuant-based low-rank outlier compensation, GPTQ-based reconstruction-aware residual weight quantization, and timestep-bin-wise per-layer activation clipping-ratio search conducted independently for each expert. On the OpenS2V-Eval benchmark, our method reduces peak GPU memory by 59.3\% relative to the BF16 baseline while incurring only a 0.9\% drop in VBench average score and a 2.3\% drop in Imaging Quality, demonstrating that expert- and timestep-aware calibration is essential for high-fidelity W4A4 inference on MoE video DiTs.

URL PDF HTML ☆

赞 0 踩 0

2605.26992 2026-05-27 cs.CV 版本更新

On the Robustness of Machine Unlearning for Vision-Language Models

机器遗忘在视觉-语言模型中的鲁棒性研究

Yujie Lin, Kaidi Jia, Jiayao Ma, Chengyi Yang, Jinsong Su

发表机构 * Xiamen University（厦门大学）

AI总结本文首次系统调查了视觉-语言模型机器遗忘的鲁棒性，通过提出三种攻击范式揭示现有方法往往隐藏而非彻底移除目标知识。

详情

AI中文摘要

视觉-语言模型（VLM）可能会记忆训练数据中的不良信息，这激发了人们对机器遗忘的兴趣。在这项工作中，我们首次对VLM遗忘进行了系统调查和鲁棒性分析。我们提供了现有VLM遗忘方法的全面分类和回顾，以及在多种提示设置下的统一评估。然后，我们提出了三种攻击范式，以检验被遗忘的多模态知识是否可以通过上下文提示或下游微调重新激活。大量实验表明，许多现有方法在这些攻击下仍然脆弱，这表明当前方法往往隐藏而非完全移除目标知识。我们的研究为当前VLM遗忘方法的鲁棒性和局限性提供了新见解，并强调了需要更可靠的多模态遗忘策略。代码可在https://github.com/XMUDeepLIT/VLM-UnL-Attack获取。

英文摘要

Vision-language models (VLMs) may memorize undesirable information from training data, motivating growing interest in machine unlearning. In this work, we present the first systematic survey and robustness analysis of VLM unlearning. We provide a comprehensive taxonomy and review of existing VLM unlearning methods, together with unified evaluations under multiple prompt settings. We then propose three attack paradigms to examine whether forgotten multimodal knowledge can be reactivated through contextual prompting or downstream retraining. Extensive experiments show that many existing methods remain vulnerable under these attacks, indicating that current approaches often hide rather than fully remove target knowledge. Our study provides new insights into the robustness and limitations of current VLM unlearning methods and highlights the need for more reliable multimodal unlearning strategies. Code is available at https://github.com/XMUDeepLIT/VLM-UnL-Attack.

URL PDF HTML ☆

赞 0 踩 0

2605.26967 2026-05-27 cs.CV 版本更新

CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning

CodecCap: 高保真度编解码器启发的残差建模用于密集视频字幕生成

Zihan Lin, Songhe Deng, Shuwei He, Danxiang Zhu, Dan Zhang, Yishu Lei, Xianlong Luo, Shikun Feng, Rui Liu

发表机构 * ERNIE Team, Baidu（百度ERNIE团队）； College of Artificial Intelligence, Inner Mongolia University（内蒙古大学人工智能学院）

AI总结提出CodecCap框架，通过关键帧和残差字幕模拟视频编解码器，在保持细粒度视觉证据的同时减少冗余，并引入VidCapQA基准验证其高保真度。

Comments 11 pages, 4 figures

详情

AI中文摘要

现有的视频字幕方法难以平衡视觉保真度和冗余：整体字幕紧凑但丢失细粒度证据，而分段字幕改善覆盖但引入大量冗余。我们提出CodecCap，一种受编解码器启发的高保真度密集视频字幕框架。类似于视频编解码器，CodecCap使用关键帧和残差字幕表示视频。关键帧字幕详尽编码稳定的视觉上下文，而残差字幕仅捕获时间上局部的动作、运动和变化。这有效保留了细粒度视觉证据，同时减少冗余描述。为了量化字幕的保真度，我们引入VidCapQA，一个包含14个能力维度1000个问题的字幕-问答基准。VidCapQA上的结果表明，强VLM直接生成的字幕仍然遗漏许多视觉细节，突显字幕表示是关键瓶颈。实验表明，CodecCap显著超越使用相同底层VLM的直接字幕生成，表明关键帧-残差字幕是一种高保真度视频-语言监督的方式。我们进一步使用CodecCap构建CodecVDC-100K，一个包含锚点、残差、场景级和视频级监督的大规模密集字幕数据集。

英文摘要

Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose CodecCap, a codec-inspired framework for high-fidelity dense video captioning. Analogous to video codecs, CodecCap represents videos using keyframe and residual captions. Keyframe captions exhaustively encode stable visual context, while residual captions capture temporally only localized actions, motions and changes. This effectively preserves fine-grained visual evidence while reducing redundant descriptions. To quantify the fidelity of captions, we introduce VidCapQA, a caption-then-QA benchmark with 1,000 questions across 14 capability dimensions. Results on VidCapQA show that captions directly generated by strong VLMs still miss many visual details, highlighting caption representation as a critical bottleneck. Experiments show that CodecCap significantly surpasses direct captioning with the same underlying VLMs, suggesting keyframe-residual captioning a way for high-fidelity video-language supervision. We further use CodecCap to construct CodecVDC-100K, a large-scale dense captioning dataset with anchor, residual, scene-level, and video-level supervision.

URL PDF HTML ☆

赞 0 踩 0

2605.26949 2026-05-27 cs.CV cs.GR 版本更新

RoadGIE：面向通用交互式道路提取的全球尺度航拍基准

Chenxu Peng, Chenxu Wang, Yimian Dai, Yongxiang Liu, Ming-Ming Cheng, Xiang Li

发表机构 * NKIARI, Shenzhen Futian（深圳福田NKIARI）； VCIP, CS, Nankai University（南开大学VCIP研究所）； AAIS, Nankai University（南开大学AAIS）； College of Electronic Engineering, National University of Defense Technology, Changsha, China（国防科技大学电子工程学院，长沙，中国）

AI总结提出最大、最多样的道路分割数据集WorldRoadSeg-360K，并设计支持连通性感知提示的交互式方法RoadGIE，在分割精度和拓扑一致性上达到最优。

详情

AI中文摘要

从航拍图像中准确分割道路是许多地理空间应用的基础。然而，现有数据集通常面临场景多样性有限、语义粒度低和结构连续性差的问题，限制了它们在不同环境中的泛化能力。为了解决这些挑战，我们引入了WorldRoadSeg-360K，这是迄今为止最大、最多样的道路分割数据集，包含从38个国家223个城市收集的366,947张高分辨率图像，覆盖不同地形和大陆。WorldRoadSeg-360K作为一个全面的基准，揭示了处理多样化和结构复杂场景的关键挑战。自动化方法通常难以保持道路连通性，而当前的交互式方法缺乏高效、拓扑敏感的工具用于实际道路编辑。为此，我们提出了RoadGIE，建立了一种新的遥感道路提取交互范式。与先前的点或框提示策略不同，RoadGIE支持连通性感知提示，包括点击和涂鸦，这些提示与道路网络的拓扑结构天然对齐。为了提高结构一致性并减轻迭代交互中的性能下降，RoadGIE集成了专家引导的提示策略，并针对交互场景调整了基于骨架的召回损失。RoadGIE在WorldRoadSeg-360K和其他基准上，在分割精度和拓扑一致性方面均达到了最先进的性能，同时仅需3.7M参数即可高效运行。代码公开于：https://github.com/chaineypung/RoadGIE

英文摘要

Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce WorldRoadSeg-360K, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across various terrains and continents. WorldRoadSeg-360K serves as a comprehensive benchmark and reveals key challenges in handling diverse and structurally complex scenes. Automated approaches often struggle to preserve road connectivity, while current interactive methods lack efficient, topology-sensitive tools for real-world road editing. To this end, we present RoadGIE, establishing a novel interactive paradigm for road extraction in remote sensing. Unlike prior point- or box-based prompting strategies, RoadGIE supports connectivity-aware prompts, including clicks and scribbles, which inherently align with the topology of road networks. To improve structural consistency and mitigate performance degradation during iterative interactions, RoadGIE integrates an expert-guided prompting strategy and adapts the skeleton-based recall loss for interactive scenarios. RoadGIE achieves state-of-the-art performance in both segmentation accuracy and topological consistency on WorldRoadSeg-360K and other benchmarks, while maintaining efficient operation with only 3.7M parameters. The code are publicly available at: https://github.com/chaineypung/RoadGIE

URL PDF HTML ☆

赞 0 踩 0

2605.26861 2026-05-27 cs.CV 版本更新

REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization

REVERSE: 强化证据验证与搜索的智能体图像地理定位

Yong Li, Furong Jia, Dacheng Yin, Kang Rong, Fengyun Rao, Jing Lyu, Fan Zhang

发表机构 * Peking University（北京大学）； The Hong Kong University of Science and Technology（香港科技大学）； WeChat Vision, Tencent Inc（腾讯公司）

AI总结提出REVERSE框架，通过多轮智能体推理强化证据搜索与验证的交互，在图像地理定位任务中优于强检索增强基线，以4B模型媲美更大模型。

详情

AI中文摘要

图像地理定位旨在确定照片的拍摄地点，该任务通常需要识别可见地标之外的信息。人类专家通常通过迭代工作流程解决：检查信息区域，形成位置假设，寻求外部证据，并根据新线索修正判断。现有方法仅部分捕捉这一过程：直接预测方法完全绕过证据获取，而检索增强方法引入外部证据但通常对中间决策（搜索位置、查询方式、过滤噪声结果）提供有限监督。我们提出REVERSE，一个强化证据搜索与验证交互的框架，实现多轮智能体推理。REVERSE教授三个中间决策：看哪里、查什么、信任什么证据。为此，我们构建了带注释区域选择、搜索观察和地理信息证据标签的工具化轨迹，并引入视觉定位、查询效用和证据辨别的过程奖励。离线搜索缓存使检索观察在强化学习过程中稳定且可重用，实现对噪声搜索结果的密集监督。使用4B模型，REVERSE在Im2GPS3k和YFCC4k上优于强检索增强基线，并媲美显著更大的模型。代码见https://github.com/yonglleee/REVERSE。

英文摘要

Image geo-localization aims to determine where a photograph was taken, a task that often requires more than recognizing visible landmarks. Human experts typically solve it through an iterative workflow: they inspect informative regions, form location hypotheses, seek external evidence, and revise their judgments as new clues appear. Existing methods only partially capture this process: direct prediction methods bypass evidence acquisition altogether, while retrieval-augmented methods introduce external evidence but usually provide limited supervision on the intermediate decisions of where to search, how to query, and how to filter noisy results. We present REVERSE, a framework that reinforces the interplay between evidence search and verification to enable multi-turn agentic reasoning. REVERSE teaches three intermediate decisions: where to look, what to query, and what evidence to trust. To support this, we construct tool-grounded trajectories with annotated region selections, search observations, and geo-informative evidence labels, and introduce process rewards for visual grounding, query utility, and evidence discrimination. An offline search cache makes retrieval observations stable and reusable during reinforcement learning, enabling dense supervision over noisy search results. With a 4B model, REVERSE outperforms strong retrieval-augmented baselines and rivals substantially larger models on Im2GPS3k and YFCC4k. Code is available at https://github.com/yonglleee/REVERSE.

URL PDF HTML ☆

赞 0 踩 0

2605.26855 2026-05-27 cs.CV 版本更新

Receipt Replay OOD: A Small Benchmark for Screen Replay Detection Under Domain Shift

Receipt Replay OOD: 一个用于域偏移下屏幕重放检测的小型基准

Alexander Vinogradov

发表机构 * IU International University of Applied Science（国际应用科学大学）

AI总结针对屏幕重放攻击检测中的域偏移问题，提出基于收据的小型OOD基准，评估跨域泛化性能。

2605.26831 2026-05-27 cs.CV cs.RO 版本更新

OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes

OSMa-Bench++：面向操作任务的语义映射开放基准测试，使用提示生成的合成场景

Regina Kurkova, Maxim Popov, Sergey Kolyubin

发表机构 * Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University（生物机电学与节能机器人实验室，ITMO大学）

AI总结本文扩展OSMa-Bench，通过提示生成合成室内场景实现可控基准测试，并提出一种基于提示的VQA类别，用于语义映射方法在杂乱、小物体、部分遮挡和光照变化等条件下的压力测试。

Comments Code: https://github.com/be2rlab/OSMa-Bench-v2

详情

AI中文摘要

语义映射方法越来越多地被用作下游机器人推理和操作的中间场景表示，但它们的评估仍然很大程度上依赖于固定的基准数据集，这些数据集对操作相关边缘情况的覆盖有限。在这项工作中，我们将OSMa-Bench扩展到使用提示生成的合成室内场景进行可控基准测试。我们的流程自动生成场景描述，使用SceneSmith合成相应环境，并将生成的资产适配为OSMa-Bench兼容的仿真格式。这种适配需要一个非平凡的中层，包括语义归一化、材质和纹理修复、着色器回退策略、地面处理、导航设置和受控光照配置。所提出设置的一个关键优势是原始场景生成提示是预先已知的，因此可以作为预期场景的辅助语义规范。我们利用这一特性，将OSMa-Bench的VQA组件扩展了一个基于提示的问题类别。由此产生的框架支持在杂乱、小物体、部分遮挡和光照变化等条件下对语义场景表示进行有针对性的压力测试，并使基准测试更具可扩展性，更好地与下游操作需求对齐。我们的代码可在https://github.com/be2rlab/OSMa-Bench-v2获取。

英文摘要

Semantic mapping methods are increasingly used as intermediate scene representations for downstream robotic reasoning and manipulation, yet their evaluation is still largely tied to fixed benchmark datasets with limited coverage of manipulation-relevant corner cases. In this work, we extend OSMa-Bench toward controllable benchmarking with prompt-generated synthetic indoor scenes. Our pipeline automatically generates scene descriptions, synthesizes corresponding environments with SceneSmith, and adapts the resulting assets into an OSMa-Bench-compatible simulation format. This adaptation requires a nontrivial intermediate layer, including semantic normalization, material and texture repair, shader fallback policies, floor handling, navigation setup, and controlled lighting configuration. A key advantage of the proposed setup is that the original scene-generation prompt is known in advance and can therefore serve as an auxiliary semantic specification of the intended scene. We use this property to extend the VQA component of OSMa-Bench with a prompt-grounded question category. The resulting framework supports targeted stress-testing of semantic scene representations under conditions such as clutter, small objects, partial occlusions, and lighting variation, and makes benchmarking more extensible and better aligned with downstream manipulation requirements. Our code is available at https://github.com/be2rlab/OSMa-Bench-v2.

URL PDF HTML ☆

赞 0 踩 0

2605.26830 2026-05-27 cs.LG cs.AI cs.CV 版本更新

The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery

卡尔曼演化：通过可解释算法发现缩小卡尔曼滤波的差距

Vasileios Saketos, Ming Xiao

发表机构 * KTH Royal Institute of Technology（皇家理工学院）

AI总结针对非线性传感场景下卡尔曼滤波性能下降的问题，提出Kalman Evolve框架，联合优化噪声参数与更新结构，利用大语言模型生成可解释的非仿射修改，在多个基准上实现高达12%的RMSE降低。

详情

AI中文摘要

状态估计是控制和信号处理中的一个基本问题，卡尔曼滤波器在线性动力学、高斯噪声和已知噪声协方差下提供最优解。然而，这些假设在多普勒雷达和LiDAR等实际传感场景中常常不成立。在这些情况下，最优估计器本质上是非线性的，导致系统性能下降。这产生了一个仅通过调整噪声协方差参数（即卡尔曼滤波器中的过程噪声和测量噪声）无法消除的性能差距。为了解决这一限制，我们提出了Kalman Evolve，一个通过联合优化噪声参数和更新结构来发现改进滤波算法的框架。我们的方法利用大语言模型作为程序空间上的结构化先验，能够生成对经典卡尔曼滤波器的可解释、非仿射修改，同时保留其递归形式。我们提供了分析结果，证明了在常见非线性传感模型下仿射估计器的次优性，从而激发了结构感知更新的必要性。在一系列合成和真实跟踪基准测试中，包括多普勒雷达、基于LiDAR的定位和行人跟踪，所发现的算法始终优于强基线（如优化卡尔曼滤波器），实现了高达12%的RMSE降低。这些结果表明，优化卡尔曼滤波器的结构而不仅仅是其参数，提供了一种实用且可解释的方式来改进状态估计。

英文摘要

State estimation is a fundamental problem in control and signal processing, for which the Kalman Filter provides an optimal solution under linear dynamics, Gaussian noise, and known noise covariances. However, these assumptions often fail in realistic sensing settings such as Doppler radar and LiDAR. In these cases, the optimal estimator is inherently nonlinear, which leads to systematic performance degradation. This creates a performance gap that cannot be eliminated by tuning the noise covariance parameters (i.e., the process and measurement noise in the Kalman Filter) alone. To address this limitation, we propose Kalman Evolve, a framework for discovering improved filtering algorithms by jointly optimizing both noise parameters and the update structure. Our approach leverages large language models (LLMs) as a structured prior over program space, enabling the generation of interpretable, non-affine modifications to the classical Kalman filter while preserving its recursive form. We provide analytical results establishing the suboptimality of affine estimators under common nonlinear sensing models, motivating the need for structure-aware updates. Across a range of synthetic and real-world tracking benchmarks, including Doppler radar, LiDAR-based localization, and pedestrian tracking, the discovered algorithms consistently improve over strong baselines such as the Optimized Kalman Filter, achieving up to 12\% reduction in RMSE. These results suggest that optimizing the structure of the Kalman filter, rather than only its parameters, provides a practical and interpretable way to improve state estimation.

URL PDF HTML ☆

赞 0 踩 0

2605.26744 2026-05-27 cs.CV 版本更新

Self-Intersection-Aware 3D Human Motion Generation Using an Efficient Human Sphere Proxy

基于高效人体球代理的自交感知3D人体运动生成

Pascal Herrmann, Maarten Bieshaar, Dennis Mack, Robert Herzog, Juergen Gall

发表机构 * Bosch Research（博世研究院）； University of Bonn（波恩大学）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔人工智能与机器学习研究所）

AI总结提出一种基于人体球代理的自交损失函数，用于训练人体运动生成模型，可减少高达49%的自交现象并改善评估指标。

Comments Accepted to BMVC 2025

详情

AI中文摘要

近年来，人体运动生成取得了巨大进展，最先进的方法在领先的评估基准上超越了真实数据。然而，对生成运动的视觉检查揭示了不同情况：即使是最先进的方法也经常生成包含自交（即身体部位相互穿透）的运动，这些强烈的伪影严重限制了感知到的运动质量。我们引入了一种新的损失函数，明确惩罚自交，用于人体运动生成方法的训练。我们的损失基于人体几何的球代理，与基于三角网格的类似方法相比，计算自交损失的速度快98%，内存使用减少83%。该损失与具体方法无关，我们将其添加到最近的人体运动生成方法（人体运动扩散模型MDM和MoMask）的训练中。大量实验表明，生成运动中的自交减少了高达49%，同时改善了其他评估指标。代码可在https://github.com/boschresearch/humansphereproxy获取。

英文摘要

Human motion generation has made tremendous progress in recent years, with state-of-the-art approaches surpassing ground truth data in leading evaluation benchmarks. However, visual inspection of the generated motions paints a different picture. Even state-of-the-art approaches generate motions frequently containing self-intersections, i.e., body parts interpenetrating, which are strong artifacts, severely limiting the perceived motion quality. We introduce a novel loss, which explicitly penalizes self-intersections, to the training of human motion generation methods. We base our loss on a sphere proxy of human geometry, which allows us to calculate a self-intersection loss 98% faster and uses 83% less memory than comparable methods based on triangular meshes. The loss is agnostic to the specific approach, and we add it to the training of the recent human motion generation methods human motion diffusion model (MDM) and MoMask. Our extensive experiments show a reduction of self-intersections in generated motions of up to 49% while improving other evaluation metrics. The code is available at https://github.com/boschresearch/humansphereproxy .

URL PDF HTML ☆

赞 0 踩 0

2605.26734 2026-05-27 cs.CV 版本更新

CIRCLED: A Multi-turn CIR Dataset with Consistent Dialogues across Domains

CIRCLED：跨领域一致对话的多轮CIR数据集

Tomohisa Takeda, Yu-Chieh Lin, Yuji Nozawa, Youyang Ng, Osamu Torii, Yusuke Matsui

发表机构 * Graduate School of Information Science and Technology, The University of Tokyo（信息科学与技术研究生院，东京大学）； Kioxia Corporation（铠侠公司）

AI总结为解决现有MTCIR数据集缺乏对话历史一致性和领域局限的问题，构建了CIRCLED数据集，通过扩展FashionIQ、CIRR和CIRCO，利用CIReVL检索流水线生成多轮会话，并经过多重过滤确保质量，最终提供22,608个多轮会话，涵盖九个子集，规模与通用性显著提升。

详情

AI中文摘要

现有的多轮组合图像检索（MTCIR）数据集缺乏对话历史一致性，且仅限于时尚领域。为解决这些限制，我们通过扩展FashionIQ、CIRR和CIRCO构建了CIRCLED。在CIRCLED中，每一轮的查询逐步逼近目标图像。数据通过基于CIReVL的检索流水线生成，并经过检索成功、轮次长度、一致性和信息冗余等多重过滤以确保质量。我们总共收集了涵盖九个子集的22,608个多轮会话，在规模和通用性上显著超过Multi-turn FashionIQ（11,505个会话）。我们进一步应用了多种基线方法，并在CIRCLED上定量评估了检索准确性。我们的工作提供了一个实用、高质量的基准，以促进未来多轮CIR的研究。数据集和代码公开于https://huggingface.co/datasets/tk1441/CIRCLED和https://github.com/mti-lab/circled。

英文摘要

Existing Multi-Turn Composed Image Retrieval (MTCIR) datasets lack dialogue-history consistency and are restricted to the fashion domain. To address these limitations, we construct CIRCLED by extending FashionIQ, CIRR, and CIRCO. In CIRCLED, the query at each turn progressively approaches the target image. Data are generated via a CIReVL-based retrieval pipeline and curated with multiple filters on retrieval success, turn length, consistency, and information redundancy to ensure quality. In total, we collect 22,608 multi-turn sessions across nine subsets, substantially exceeding Multi-turn FashionIQ (11,505 sessions) in both scale and generality. We further apply multiple baseline methods and quantitatively assess retrieval accuracy on CIRCLED. Our work provides a practical, high-quality benchmark to facilitate future research on multi-turn CIR. The dataset and code are publicly available at https://huggingface.co/datasets/tk1441/CIRCLED and https://github.com/mti-lab/circled.

URL PDF HTML ☆

赞 0 踩 0

2605.26729 2026-05-27 cs.CV 版本更新

Learning Reference-Guided Exposure Correction with Hybrid Illumination Characteristics

基于混合光照特性的参考引导曝光校正

Hao Ren, Zetong Bi, Zhaoliang Wan, Hui Cheng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China（计算机科学与工程学院，中山大学，广州，中国）

AI总结提出HICNet，一种参考引导的曝光校正框架，通过轻量编码器提取光照嵌入，结合FiLM全局调整和光度通道重平衡实现精细曝光匹配，无需真值或内在分解即可在基准测试上取得更优精度并泛化到未见场景。

Comments ICASSP2026

详情

DOI: 10.1109/ICASSP55912.2026.11465119

AI中文摘要

我们提出了HICNet，一个参考引导的曝光校正框架。一个轻量级、内容无关的编码器将每张图像蒸馏成一个紧凑的光照嵌入，捕获区域亮度、边缘对比度和高阶亮度矩。源图像与其参考图像之间的嵌入差异驱动一个多尺度调制网络，该网络结合基于FiLM的全局调整和光度通道重平衡，实现细粒度的、光照感知的光谱门控，产生曝光匹配的输出，同时忠实保留场景细节。跨批次对比损失对光照流形进行排序，增强了对不同光照条件的鲁棒性。在没有真值或内在分解的情况下训练，HICNet在公共基准测试上达到了更好的精度，并且能够很好地泛化到完全未见过的场景。

英文摘要

We present HICNet, a reference-guided exposure correction framework. A lightweight, content-agnostic encoder distills each image into a compact illumination embedding capturing regional brightness, edge contrast, and higher-order luminance moments. The embedding difference between a source and its reference drives a multi-scale modulation network that combines FiLM-based global adjustment with Photometric Channel Rebalancing for fine-grained, illumination-aware spectral gating, producing exposure-matched outputs while faithfully preserving scene details. A cross-batch contrastive loss orders the illumination manifold, bolstering robustness to diverse lighting conditions. Trained without ground truth or intrinsic decomposition, HICNet attains better accuracy on public benchmarks and generalizes well to entirely unseen scenes.

URL PDF HTML ☆

赞 0 踩 0

2605.26726 2026-05-27 eess.IV cs.AI cs.CV 版本更新

Measuring Prediction Uncertainty in Neural Cellular Automata

神经细胞自动机中的预测不确定性测量

Ario Sadafi, Michael Deutges, Nassir Navab, Carsten Marr

发表机构 * Computational Health Center, Helmholtz Munich, Neuherberg, Germany（赫尔姆霍茨慕尼黑计算健康中心）； Helmholtz AI, Helmholtz Munich, Neuherberg, Germany（赫尔姆霍茨慕尼黑人工智能研究所）； Computer Aided Medical Procedures, Technical University of Munich, Munich, Germany（慕尼黑技术大学计算机辅助医疗程序研究所）； Munich Center for Machine Learning, Munich, Germany（慕尼黑机器学习中心）； Department of Medicine III, Ludwig-Maximilian-University Hospital, Munich, Germany（慕尼黑路德维希-马克西米利安大学医院第三医学部）； Department of Physics, University of Munich, Munich, Germany（慕尼黑大学物理系）； German Cancer Consortium (DKTK), partner site Munich, Germany（德国癌症研究中心（DKTK）慕尼黑分部）

AI总结提出一种基于动态系统收敛性的不确定性度量方法，通过扰动自动机状态并观察预测稳定性来评估神经细胞自动机在医学图像分割中的可信度。

Comments Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026

详情

AI中文摘要

神经细胞自动机（NCA）为编码器-解码器分割网络提供了一种轻量级替代方案。然而，决定何时应信任预测可能很困难。在这里，我们研究基于NCA的医学图像分割的不确定性估计，无需修改底层架构或重新训练模型。我们的方法通过将NCA视为一个动态系统来激发，其中收敛吸引子对应于可信预测。具体地，我们提出了弹性（resilience），这是一种简单的度量，通过探测在自动机状态微小扰动下最终预测的稳定性来利用NCA固有的迭代结构。返回相同解的预测被认为是可信的，而显著变化的预测被标记为不确定。我们使用选择性预测指标（$\Delta$Dice@90和AURC）和排序指标（AUROC和AUPRC）通过其预测分割质量的能力来评估不确定性。在多个医学分割基准测试中，弹性比基线更可靠地识别失败案例，提高了基于NCA模型的信任度和安全性。

英文摘要

Neural cellular automata (NCA) provide a lightweight alternative to encoder-decoder segmentation networks. However, it can be difficult to decide when a prediction should be trusted. Here, we study uncertainty estimation for NCA-based medical image segmentation without modifying the underlying architecture or retraining the model. Our approach is motivated by viewing the NCA as a dynamical system where convergent attractors correspond to confident predictions. Concretely, we propose resilience, a simple measure that leverages the intrinsic iterative structure of NCAs by probing the stability of the final prediction under small perturbations of the automaton state. Predictions that return to the same solution are deemed confident, while those that change substantially are flagged as uncertain. We evaluate uncertainty by its ability to predict segmentation quality using selective prediction metrics ($Δ$Dice@90 and AURC) and ranking metrics (AUROC and AUPRC). Across multiple medical segmentation benchmarks, resilience identifies failure cases more reliably than baselines, improving trust and safety in NCA-based models.

URL PDF HTML ☆

赞 0 踩 0

2605.26725 2026-05-27 cs.CV 版本更新

Joint 2D-3D Segmentation and Association in Street-level Imaging

街景成像中的联合2D-3D分割与关联

Amir Melnikov, Masayuki Tanaka, Yusuke Monno, Masatoshi Okutomi

发表机构 * Institute of Science Tokyo（东京科学研究所）

AI总结提出一个统一框架，结合零样本检测分割与运动恢复结构，通过3D驱动的几何一致性机制替代传统2D多目标跟踪，实现街景图像中跨视角的稳定分割与身份关联，在挑战性城市场景中性能提升22%。

Comments 15 pages, 6 image figures, 1 in-body table, 1 in-body algorithm, 2 indexes with tables

详情

AI中文摘要

准确解读街景图像对于大规模城市地图绘制和创建空间数字孪生环境至关重要。本文提出了一个用于联合2D-3D分割与关联的统一框架，该框架将视觉语义与多视图几何推理相结合。与依赖时序帧进行跟踪的传统方法不同，我们的方法利用零样本检测和分割，结合运动恢复结构重建，建立稳定的跨视图对应关系。3D驱动的关联机制取代了传统的2D多目标跟踪，利用几何一致性指导宽基线视角和不同成像条件下的身份保持。通过结合2D纹理线索和全局3D上下文，所提出的管道非常适合可扩展的街景处理，并可适用于多种对象类型。实验表明，与最先进的纯2D跟踪方法相比，我们的方法显著提高了对真实序列的覆盖率和更鲁棒的身份保持，在挑战性城市场景中实现了22%的性能提升。

英文摘要

Accurate interpretation of street-level imagery is essential for large-scale urban mapping and the creation of Spatial Digital Twin (SDT) environments. This work presents a unified framework for joint 2D-3D segmentation and association that integrates visual semantics with multi-view geometric reasoning. Unlike conventional approaches that rely heavily on sequential frames for temporal tracking, our method leverages zero-shot detection and segmentation together with structure-from-motion reconstruction to establish stable cross-view correspondences. A 3D-driven association mechanism replaces traditional 2D multi-object tracking, using geometric consistency to guide identity preservation across wide-baseline viewpoints and varying imaging conditions. By combining 2D texture cues with global 3D context, the proposed pipeline is well-suited for scalable street-level processing and can be used for a variety of object types. Experiments demonstrate substantially improved coverage of ground-truth sequences and more robust identity retention compared to state-of-the-art 2D-only tracking methods, achieving a 22% performance gain in challenging urban scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.26712 2026-05-27 cs.CV 版本更新

METATR: A Multilingual, Evolving Benchmark for Automatic Text Recognition

METATR：一个多语言、不断演进的自动文本识别基准

Mélodie Boillet, Solène Tarride, Christopher Kermorvant

发表机构 * TEKLIA

AI总结提出METATR基准，通过多样化多语言文档、标准化评估框架和动态更新机制，全面评估自动文本识别系统（尤其是视觉大语言模型）的性能，支持模型比较与选择。

详情

AI中文摘要

反映真实文档多样性和复杂性的基准对于准确评估自动文本识别（ATR）系统，特别是视觉大语言模型（vLLMs）至关重要。尽管最近的模型表现出令人印象深刻的性能，但它们通常在包含现代印刷文本（主要是英语）的数据集上进行评估，这限制了它们与许多实际应用的相关性。因此，为特定用例选择模型需要在与目标文档匹配的数据上进行评估。这突显了代表性基准对于实际应用的重要性。在本文中，我们介绍了METATR（v1.0），一个多语言、不断演进的基准，旨在评估ATR模型在广泛文档上的性能，促进有意义的模型比较和选择。该基准通过包含来自各种公共收藏的文档来最大化多样性。这些文档涵盖29种语言，并包含多种文字和布局的文本。除了数据集本身，METATR还定义了标准化的提示和归一化方法，并建立了一个动态评估框架。这种方法旨在产生可重复的结果，同时随着时间的推移保持可扩展性。我们评估了广泛的最先进系统，包括开源模型和闭源模型。结果从多个维度报告，包括数据集和语言级别的性能、对手写文档的鲁棒性以及计算效率。我们的发现表明，尽管专有模型实现了最一致的性能，但在不同文字和布局之间仍然存在显著差异。总体而言，METATR提供了一个多维度的、面向从业者的框架，用于在真实条件下评估多语言ATR，并随着领域的发展跟踪进展。

数据污染下的异常检测对于在工业环境中部署无监督缺陷检测至关重要，因为整理完全干净的训练集是不切实际的。然而，现有方法对污染敏感，随着噪声比例增加，性能显著下降。在本文中，我们提出记忆蒸馏选择（MeDS），一种基于数据选择的训练算法。MeDS通过随机子采样构建部分记忆集成，其中产生的稀疏性作为低通滤波器，在广泛的噪声比例下捕获名义模式，从而实现对污染样本的粗粒度识别。然后，将到自举记忆的聚合距离蒸馏到重建分数网络中，随后在通过蒸馏模型过滤的干净数据上进行微调，实现异常的精确定位。MeDS在广泛的噪声比例下具有鲁棒性，无需针对特定噪声比例的超参数调整，在MVTecAD上以40%噪声比例达到99.16%的图像级AUROC，并在噪声设置下在VisA和Real-IAD上取得最先进性能。我们在噪声数据场景下的工业AD基准上彻底验证了MeDS的有效性，并进行了深入的经验分析。

英文摘要

Anomaly detection (AD) under data contamination is critical for deploying unsupervised defect detection in industrial environments, where curating perfectly clean training sets is impractical. However, existing methods are sensitive to contamination, suffering significant performance degradation as the noise ratio increases. In this paper, we propose Memory-Distilled Selection (MeDS), a training algorithm based on data selection. MeDS constructs an ensemble of partial memories via random subsampling, where the resulting sparsity acts as a low-pass filter that captures nominal patterns across a wide range of noise ratios, enabling coarse-level identification of contaminated samples. The aggregated distances to the bootstrapped memories are then distilled into a reconstruction score network, which is subsequently fine-tuned on clean data filtered using scores from the distilled model, enabling fine-grained localization of anomalies. MeDS is robust across a wide range of noise ratios without requiring noise-ratio-specific hyperparameter tuning, achieving 99.16\% image-level AUROC on MVTecAD at a 40\% noise ratio, and attaining state-of-the-art performance on both VisA and Real-IAD under noisy settings. We thoroughly verify the efficacy of MeDS on industrial AD benchmarks under noisy data scenarios, accompanied by in-depth empirical analyses.

URL PDF HTML ☆

赞 0 踩 0

2605.26661 2026-05-27 cs.CV cs.AI 版本更新

DelowlightSplat: 面向低光照3D场景重建的前馈高斯泼溅

Fuzhen Jiang, Zengtian Xie, Zhuoran Li

发表机构 * Hangzhou Dianzi University（杭州电子科技大学）； Zhuhai College of Science（珠海科技学院）

AI总结提出DelowlightSplat，一种低光照感知的前馈高斯泼溅框架，通过轻量级低光照适配器和成本体积多视图推理，从稀疏有噪声图像中直接预测干净3D高斯，实现高质量新视角合成。

2605.26621 2026-05-27 cs.CV cs.AI 版本更新

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

MedVol-R1：基于奖励驱动的证据基础用于体积推理分割

Zichun Wang, Hairong Shi, Bingzheng Wei, Yan Xu, Zihua Wang

发表机构 * School of Biological Science and Medical Engineering, Beihang University, Beijing, China（生物科学与医学工程学院，北京航空航天大学）； Center for Information and Computer Science, School of Science for Open and Environmental Systems, Graduate School of Science and Technology, Keio University, Kanagawa, Japan（信息与计算机科学中心，开放与环境系统科学学院，科技研究生学校，东京大学，神奈川，日本）； Bytedance Inc., China（字节跳动公司，中国）； Tsinghua University, Beijing, China（清华大学，北京，中国）

AI总结提出MedVol-R1框架，通过强化学习将临床推理解耦为可验证的2D证据锚点，再传播为3D掩膜，实现体积推理分割，在多个基准上达到最优性能。

详情

AI中文摘要

体积推理分割（VRS）旨在根据自由形式的临床查询在3D医学扫描中分割目标区域，其中所指对象通常是隐含的，需要医学知识和体积基础推理。现有方法通常依赖专门的分割标记将语言与掩膜解码连接起来，但这种耦合将决策过程压缩为不透明的潜在表示，限制了可解释性和对多样化叙述表达的泛化能力。在本文中，我们提出MedVol-R1，一种基于强化学习的VRS框架，明确地将证据基础与体积描绘解耦：LVLM将临床推理定位到可验证的2D证据锚点（关键轴向切片和2D边界框），然后由冻结的MedSAM2模块将其传播为连贯的3D掩膜。我们使用冷启动监督微调后接GRPO来训练MedVol-R1，并由多组件奖励引导，该奖励鼓励信息性证据选择、准确的2D空间定位和跨切片体积连贯性，无需昂贵的思维链注释。在M3D-Seg基准的CT-ORG、AbdomenCT-1K和KiTS23上的实验表明，MedVol-R1一致优于强基线并达到最先进性能，强化学习相比纯监督微调提供了明显增益。

英文摘要

Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the referent is often implicit and requires both medical knowledge and volume-grounded reasoning. Existing methods typically rely on specialized segmentation tokens to connect language with mask decoding, but this coupling collapses the decision process into opaque latent representations, limiting interpretability and generalization to diverse narrative expressions. In this paper, we present MedVol-R1, a reinforcement learning-based framework for VRS that explicitly decouples evidence grounding from volumetric delineation: the LVLM grounds clinical reasoning to a verifiable 2D evidence anchor (key axial slice and 2D bounding boxes), which is then propagated into a coherent 3D mask by a frozen MedSAM2 module. We train MedVol-R1 with cold-start supervised fine-tuning followed by GRPO, guided by a multi-component reward that encourages informative evidence selection, accurate 2D spatial grounding, and cross-slice volumetric coherence, without requiring costly chain-of-thought annotations. Experiments on CT-ORG, AbdomenCT-1K, and KiTS23 from the M3D-Seg benchmark demonstrate that MedVol-R1 consistently outperforms strong baselines and achieves state-of-the-art performance, with reinforcement learning providing clear gains over pure supervised fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.26616 2026-05-27 cs.CV 版本更新

Gaussian-Voxel Duet: A Dual-Scaffolding Hybrid Representation for Fast and Accurate Monocular Surface Reconstruction

高斯-体素二重奏：用于快速准确单目表面重建的双支架混合表示

Zhenhua Du, Zhen Tan, Haoyu Zhang, Dewen Hu, Shuaifeng Zhi, Peidong Liu

发表机构 * Zhejiang University（浙江大学）； Westlake University（西湖大学）； National University of Defense Technology（国防科技大学）

AI总结提出一种混合高斯-体素表示，通过将锚定高斯约束在体素化SDF定义的表面窄带内，并引入隐式表面约束损失，在保持快速训练和实时渲染的同时，实现了高质量表面重建和新视图合成。

Comments 27 pages, 14 figures

详情

AI中文摘要

尽管3D高斯泼溅在逼真新视图合成方面取得了显著成功，但其追求快速高保真3D重建一直受限于几何精度与优化效率之间的权衡。专攻图像渲染的方法收敛快，但代价是由于多余基元过拟合训练视图导致几何不完美；而集成神经有符号距离场（SDF）以改善几何的方法则带来了高昂的训练成本。在本文中，我们尝试通过将支架锚定高斯与联合优化的稀疏体素支架绑定来达成更好的权衡。这种混合高斯-体素表示明确地将锚定高斯限制在体素化SDF定义的表面周围的窄带内，有效提高了表示效率并凝聚了浮动高斯，同时不牺牲几何质量。隐式表面约束损失进一步以相互正则化的方式将单个高斯基元拉近至SDF诱导的表面，从而提高重建精度。在来自ScanNet++、ScanNetv2和DeepBlending数据集的各种真实室内场景上的大量实验表明，我们的方法在保持快速训练收敛和实时渲染的同时，实现了最先进的表面重建质量以及优于领先基线的新视图合成。代码将在https://github.com/duzh11/VoxelGS提供。

英文摘要

While 3D Gaussian Splatting has achieved remarkable success in photorealistic novel view synthesis, its pursuit of fast and high-fidelity 3D reconstruction has long been constrained by a trade-off between geometric accuracy and optimization efficiency. Methods specialized in image rendering converge quickly at the cost of imperfect geometry caused by superfluous primitives overfitting training views, while methods integrating neural signed-distance field (SDF) for better geometry incur prohibitive training costs. In this paper, we attempt to strike a better trade-off by tethering scaffold-anchored Gaussians to a jointly optimized sparse voxel scaffold. This hybrid Gaussian-Voxel representation explicitly confines anchored Gaussians to a narrow band around surfaces defined by voxelized SDFs, which effectively improves representation efficiency and condenses floating Gaussians without sacrificing geometry quality. An implicit surface tethering loss further pulls individual Gaussian primitives closer to SDF-induced surfaces in a mutually regularized manner for improved reconstruction accuracy. Extensive experiments on diverse real-world indoor scenes from ScanNet++, ScanNetv2, and DeepBlending datasets demonstrate that our method achieves state-of-the-art surface reconstruction quality as well as superior novel view synthesis against leading baselines, while maintaining fast training convergence and real-time rendering. Code will be available at https://github.com/duzh11/VoxelGS.

URL PDF HTML ☆

赞 0 踩 0

2605.26584 2026-05-27 cs.CV 版本更新

O-MARC: Omni Memory-Augmented Compression Distillation for Efficient Video Understanding

O-MARC: 全记忆增强压缩蒸馏用于高效视频理解

Peiran Wu, Yunze Liu, Chi-Hao Wu, Chen Chen, Junxiao Shen

发表机构 * University of Bristol（布里斯托大学）； Memories.ai Research（Memories.ai研究院）； University of Central Florida（佛罗里达中央大学）

AI总结提出O-MARC框架，通过无训练压缩方法OMAC保留视觉记忆和音频锚点，并利用压缩蒸馏使紧凑模型鲁棒，在多个基准上提升性能并降低推理成本。

详情

AI中文摘要

全模态大语言模型实现了统一的音频视频理解，但长联合令牌序列导致推理成本高昂，且现有基准未能完全隔离噪声用户生成视频中的音视频关联。我们引入了UGC-AVQA，一个公开的UGC基准，包含1000个视频和4816个问答对，其中音频移除测试确保基准问题需要声学和视觉证据。为了降低推理成本，我们提出了OMAC，一种无需训练的即插即用压缩方法，保留显著的视觉记忆和时域锚定的音频锚点。为了进一步使紧凑模型对压缩输入鲁棒，我们引入了O-MARC，一种用于学习记忆压缩多模态上下文的压缩蒸馏框架。在Qwen2.5-Omni-3B上，O-MARC在四个基准上的平均得分提升至45.8，优于全令牌推理的44.1和OmniZip的41.0。与全令牌推理相比，OMAC还保持了推理效率，延迟降低34.6%（1.53倍加速），内存降低34.7%。

英文摘要

Omnimodal large language models enable unified audio video understanding, but long joint token sequences make inference costly, and existing benchmarks do not fully isolate audio visual association in noisy user generated videos. We introduce UGC-AVQA, a public UGC benchmark with 1,000 videos and 4,816 QA pairs, where an audio removal test ensures that benchmark questions require both acoustic and visual evidence. To reduce inference cost, we propose OMAC, a training free plug in compression method that preserves salient visual memory and temporally grounded audio anchors. To further make compact models robust to compressed inputs, we introduce O-MARC, a compression distillation framework for learning with memory compressed multimodal contexts. On Qwen2.5-Omni-3B, O-MARC improves the average score across four benchmarks to 45.8, outperforming full token inference at 44.1 and OmniZip at 41.0. OMAC also keeps inference efficient, reducing latency by 34.6\% (1.53$\times$ speedup) and memory by 34.7\% compared with full token inference.

URL PDF HTML ☆

赞 0 踩 0

2605.26576 2026-05-27 cs.CV cs.LG 版本更新

CmIVTP：面向海事智能的基于跨模态交互的船舶轨迹预测

Yuxu Lu, Dong Yang, Xiaoyu Li, Mengwei Bao, Congcong Zhao

发表机构 * Department of Logistics and Maritime Studies, the Hong Kong Polytechnic University（物流及海运研究系，香港理工大学）； Research Centre for ESG Advancement (RCESGA), the Hong Kong Polytechnic University（ESG进步研究中心（RCESGA），香港理工大学）； School of Navigation, Wuhan University of Technology（航海学院，武汉理工大学）

AI总结针对单一数据源局限导致船舶轨迹预测不准的问题，提出跨模态交互框架CmIVTP，融合AIS和CCTV数据，利用目标感知场景编码器和跨模态交互Transformer实现高精度预测。

详情

AI中文摘要

海事智能交通系统（MITS）对于确保繁忙水域的航行安全和效率至关重要。然而，由于单源数据的局限性，准确的船舶轨迹预测仍然具有挑战性。自动识别系统（AIS）数据对于小型船舶通常稀疏或不可用，而仅靠闭路电视（CCTV）数据无法完全捕捉动态船舶行为。为缓解这些挑战，我们提出了一种基于跨模态交互的船舶轨迹预测（称为CmIVTP）框架，以建模船舶动力学与环境约束之间的复杂交互。具体地，我们引入了一个目标感知场景编码器来提取场景语义特征，有效捕捉船舶-环境交互并提高轨迹预测精度。此外，我们提出了一个跨模态交互变换器，它集成了AIS衍生的运动特征、基于CCTV的环境特征和场景表示。它利用跨模态注意力机制同时捕捉模态内语义和模态间交互，确保动态一致且环境可行的预测。此外，我们通过将历史AIS轨迹聚类为代表性运动模式构建了船舶群体轨迹库，为候选轨迹生成提供了一种高效且可扩展的方法。另外，我们引入了海事多模态数据集增强版（名为Maritime-MmD$^+$），这是一个同步AIS数据和CCTV视频数据的大规模数据集，为多模态轨迹预测研究提供了有力支持。大量实验表明，CmIVTP在多模态驱动的船舶轨迹预测基准上取得了更好的性能。本工作的代码资源可在https://github.com/LouisYxLu/CmIVTP获取。

英文摘要

Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However, accurate vessel trajectory prediction remains challenging due to the limitations of single-source data. Automatic identification system (AIS) data is often sparse or unavailable for small vessels, while closed-circuit television (CCTV) data alone cannot fully capture dynamic vessel behavior. To mitigate these challenges, we propose a cross-modal interaction-based vessel trajectory prediction (named CmIVTP) framework to model the intricate interactions between vessel dynamics and environmental constraints. Specifically, we introduce a target-aware scene encoder to extract scene semantic features, effectively capturing vessel-environment interactions and enhancing trajectory prediction accuracy. In addition, we propose a cross-modal interaction transformer, which integrates AIS-derived motion features, CCTV-based environmental features, and scene representations. It leverages cross-modal attention mechanisms to simultaneously capture intra-modal semantics and inter-modal interactions, ensuring dynamically consistent and environmentally feasible predictions. Furthermore, we construct a vessel group trajectory bank by clustering historical AIS trajectories into representative motion patterns, providing an efficient and scalable approach for candidate trajectory generation. Additionally, we introduce the maritime multimodal dataset plus (named Maritime-MmD$^+$), a large-scale dataset that synchronizes AIS data and CCTV video data, providing robust support for multimodal trajectory prediction research. Extensive experiments demonstrate that CmIVTP achieves better performance on multimodal-driven vessel trajectory prediction benchmarks. The code resources for this work can be available at https://github.com/LouisYxLu/CmIVTP.

URL PDF HTML ☆

赞 0 踩 0

2605.26520 2026-05-27 cs.CV cs.AI 版本更新

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

InterSketch: 一种具有自校正视觉草图和逐步奖励的交错推理模型

Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu, Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang, Yuanjie Zheng, Wei Liu, Lewei Lu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； SenseTime Research（商汤研究院）； Shandong Normal University（山东师范大学）

AI总结针对视觉-语言模型在长程视觉推理中文本中心范式局限性的问题，提出InterSketch模型，通过自校正和逐步奖励机制增强交错视觉-文本思维链能力，在视觉推理基准上超越Gemini-3-Pro等专有模型。

详情

AI中文摘要

尽管视觉-语言模型（VLM）已展现出多轮视觉推理能力，但其推理轨迹仍相对浅层且以文本为中心，限制了其在复杂视觉挑战中的适用性。相比之下，人类思维通常涉及长程推理，并伴有交错的视觉-文本思维链（VT-CoT）。为弥合这一差距，我们引入InterSketch，一种交错推理模型，通过自校正和逐步奖励机制增强VT-CoT能力。InterSketch使用外部工具动态生成中间视觉草图，并将其与文本推理交错进行，从而在长程视觉理解任务中实现有效的感知和逻辑推理。具体而言，在第一个冷启动阶段，我们提出了一个合成的高质量交错VT-CoT数据集，并引入反思机制，使模型具备多轮交错推理和自校正能力。在后续的强化学习（RL）阶段，我们设计了一种逐步奖励机制，以缓解长程推理中仅端到端监督固有的奖励信号稀疏性问题。在视觉推理基准上的大量实验证明了InterSketch的有效性，其性能甚至超越了Gemini-3-Pro等专有模型。

英文摘要

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

URL PDF HTML ☆

赞 0 踩 0

2605.26514 2026-05-27 cs.CV cs.AI cs.LG 版本更新

CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies

CSV-ViT: 一种使用可变大小皮层超顶点的视觉Transformer用于阿尔茨海默病病理检测

Geonwoo Baek, Ikbeom Jang

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Hankuk University of Foreign Studies（韩国家 foreign 学院）

AI总结提出一种保留感兴趣区域的、基于顶点的可变大小皮层表面分块方法（皮层超顶点），并设计可变大小补丁兼容的视觉Transformer（CSV-ViT），在阿尔茨海默病诊断、淀粉样蛋白阳性和tau蛋白阳性三分类任务中优于现有表面模型。

详情

AI中文摘要

确认阿尔茨海默病（AD）通常依赖于正电子发射断层扫描（PET），该方法仍然昂贵且有创，这促使了基于结构MRI的预筛查的使用。在非欧几里得流形，特别是大脑皮层表面上的深度学习，由于数据的球形拓扑结构面临重大挑战。最近的表面模型已经能够从皮层表面数据中学习；然而，施加基于面的均匀补丁通常会导致补丁边界处的重复顶点。一般来说，许多基于表面的模型对感兴趣区域（ROI）的感知有限，这可能导致非皮层区域（如内侧壁）被包含在内。我们提出了一种皮层表面分块方法，该方法执行保留ROI的、基于顶点的、可变大小的补丁划分。我们将这些皮层表面补丁称为皮层超顶点（CSV）。基于这种表示，我们设计了CSV视觉Transformer（CSV-ViT），这是一种可变大小补丁容忍的视觉Transformer，使用填充和掩码感知的补丁嵌入。我们使用T1加权MRI，并通过将AD相关状态分类为三个类别来评估我们的框架：AD诊断、淀粉样蛋白阳性和tau蛋白阳性。在实验中，CSV-ViT取得了比最近基于表面的模型更高的分类性能。结果表明，所提出的CSV-ViT可能支持在PET或脑脊液确认之前基于MRI的AD相关状态预测。

英文摘要

Confirming Alzheimer's disease (AD) typically relies on positron emission tomography (PET), which remains costly and invasive, motivating the use of structural MRI-based prescreening. Deep learning on non-Euclidean manifolds, particularly brain cortical surfaces, faces significant challenges due to the data's spherical topology. Recent surface models have enabled learning from cortical surface data; however, imposing face-based uniform patches often causes duplicate vertices at patch boundaries. In general, many surface-based models are limited in their awareness of the region of interest (ROI), which can result in non-cortical regions, such as the medial wall, being included. We propose a cortical surface tokenization that performs ROI-preserving, vertex-based, variable-sized patch partitioning. We refer to these cortical surface patches as cortical supervertices (CSVs). Building on this representation, we design the CSV Vision Transformer (CSV-ViT), a variable-size patch-tolerant Vision Transformer that uses padding and a mask-aware patch embedding. We used T1-weighted MRI and evaluated our framework by classifying AD-related status into three categories: AD diagnosis, amyloid positivity, and tau positivity. Across the experiments, CSV-ViT achieved higher classification performance than recent surface-based models. The results suggest that the proposed CSV-ViT may support MRI-based prediction of AD-related status prior to PET or CSF confirmation.

URL PDF HTML ☆

赞 0 踩 0

2605.26513 2026-05-27 cs.CV 版本更新

Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression

Re-M3Dr：重新平衡的多模态均值偏差回归

Haojie Yin, Chengcheng Feng, Tianyi Liu, Tianqi Zhang, Kaizhu Huang

发表机构 * Duke Kunshan University, China（杜克大学昆山学院）； Xi'an Jiaotong-Liverpool University, China（西安交通大学利物浦大学）； Soochow University（苏州大学）

AI总结针对多模态医学图像融合性能反不如单模态的问题，提出Re-M3Dr框架，通过自适应边界的监督对比学习和锐度感知梯度调制，实现多模态均值偏差回归，在临床数据集上均方误差降低29%。

详情

AI中文摘要

均值偏差（MD）是评估眼科视野损失的关键指标。虽然以往的工作仅关注从光学相干断层扫描（OCT）预测MD，但直观上假设将OCT与另一种眼底摄影（FP）成像结合可以提高性能，因为两种眼科医学成像提供了互补信息。当应用复杂的多目标优化时，这一点尤其值得期待，正如常见的多模态分类中所记载的那样。令人惊讶的是，我们的研究表明，在这种医学成像场景中，多模态融合的性能不如单模态模型。通过详细分析，我们确定根本原因是数据分布和模态学习冲突之间的耦合不平衡。这种不平衡扭曲了优化景观，导致训练不稳定。为了解决这一挑战，我们提出了重新平衡的多模态均值偏差回归（Re-M3Dr）方法，这是一种新颖的多模态回归框架。我们通过自适应边界的监督对比学习增强单模态表示。然后，我们的框架通过锐度感知梯度调制稳定联合优化。在公共和私人临床数据集上的实验结果表明，与最先进的多模态学习方法相比，均方误差平均降低29%，证明了Re-M3Dr的优越性。代码可在补充材料中获得。

英文摘要

Mean Deviation (MD) is a critical metric for assessing visual field loss in ophthalmology. While previous work has focused solely on predicting MD from Optical Coherence Tomography (OCT), it is intuitive to assume that combining OCT with another imaging of fundus photography (FP) could improve performance, as two ophthalmic medical imaging provide complementary information. This is particularly expected when sophisticated multi-objective optimization is applied, as documented in common multimodal classification. Surprisingly, our investigations reveal that multimodal fusion in this medical imaging scenario performs worse than unimodal model. Through detailed analysis, we identify the root cause as a coupled imbalance between data distribution and modality learning conflict. This imbalance distorts the optimization landscape, leading to unstable training. To address this challenge, we propose the method of Rebalanced MultiModal Mean Deviation Regression (Re-M3Dr), a novel multimodal regression framework. We enhance unimodal representation through adaptive margin based supervised contrastive learning. Then, our framework stabilizes the joint optimization with the sharpness-aware gradient modulation. Experimental results on both public and private clinical datasets show average 29\% reduction in MSE compared to SOTA multimodal learning methods, demonstrating the superiority of Re-M3Dr. The code is available in the supplementary materials.

URL PDF HTML ☆

赞 0 踩 0

2605.26503 2026-05-27 cs.CV 版本更新

Uncertainty-Aware Gaussian Map for Vision-Language Navigation

面向视觉-语言导航的不确定性感知高斯地图

Jianzhe Gao, Rui Liu, Yuxuan Xu, Tongtong Cao, Yingxue Zhang, Zhanguang Zhang, Sida Peng, Yi Yang, Wenguan Wang

发表机构 * The State Key Lab of Brain-Machine Intelligence（脑机智能国家重点实验室）； Department of Foundation model, 2012 Labs, Huawei（基础模型部门，2012实验室，华为）； Noah’s Ark Lab, 2012 Labs, Huawei（诺亚方舟实验室，2012实验室，华为）； School of Software Technology, Zhejiang University（浙江大学软件学院）

AI总结提出不确定性感知高斯地图，通过显式建模几何、语义和外观三种感知不确定性并融入观测空间，提升视觉-语言导航中智能体的决策可靠性。

详情

AI中文摘要

视觉-语言导航（VLN）要求智能体按照自然语言指令在3D环境中导航。在导航过程中，现有智能体通常遇到感知不确定性，例如缺乏可靠定位的证据或空间线索解释的模糊性，但在预测动作时通常忽略此类信息。在这项工作中，我们显式建模三种形式的感知不确定性（即几何、语义和外观不确定性），并将其整合到智能体的观测空间中，以实现知情决策。具体来说，我们的智能体首先构建一个语义高斯地图（SGM），由从全景观测初始化的可微3D高斯原语组成，编码环境的几何结构和语义内容。在SGM之上，通过高斯位置和尺度的变分扰动估计几何不确定性，以评估结构可靠性；通过扰动高斯语义属性捕获语义不确定性，以揭示模糊解释；通过Fisher信息刻画外观不确定性，该信息衡量渲染观测对高斯级变化的敏感性。这些不确定性被纳入SGM，将其扩展为统一的3D价值地图，将其作为支持可靠导航的可供性和约束。在多个VLN基准上的综合评估显示了我们的智能体的有效性。

英文摘要

Vision-Language Navigation (VLN) requires an agent to navigate 3D environments following natural language instructions. During navigation, existing agents commonly encounter perceptual uncertainty, such as insufficient evidence for reliable grounding or ambiguity in interpreting spatial cues, yet they typically ignore such information when predicting actions. In this work, we explicitly model three forms of perceptual uncertainty (i.e., geometric, semantic, and appearance uncertainty) and integrate them into the agent's observation space to enable informed decision-making. Concretely, our agent first constructs a Semantic Gaussian Map (SGM), composed of differentiable 3D Gaussian primitives initialized from panoramic observations, that encodes both the geometric structure and semantic content of the environment. On top of SGM, geometric uncertainty is estimated through variational perturbations of Gaussian position and scale to assess structural reliability; semantic uncertainty is captured by perturbing Gaussian semantic attributes to reveal ambiguous interpretations; and appearance uncertainty is characterized by Fisher Information, which measures the sensitivity of rendered observations to Gaussian-level variations. These uncertainties are incorporated into SGM, extending it into a unified 3D Value Map, which grounds them as affordances and constraints that support reliable navigation. Comprehensive evaluations across multiple VLN benchmarks show the effectiveness of our agent.

URL PDF HTML ☆

赞 0 踩 0

2605.26501 2026-05-27 cs.CV cs.AI 版本更新

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

揭示视觉-语言模型的脆弱性：通过纹理约束扰动和跨模态优化的多模态对抗协同

Xiang Fang, Wanlong Fang, Changshuo Wang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology（华中科技大学软件学院）； Nanyang Technological University, Singapore（新加坡南洋理工大学）； University College London（伦敦大学学院）

AI总结提出多模态对抗协同框架，通过纹理约束的通用对抗扰动和可学习的文本提示扰动，在黑盒设置下联合优化，揭示视觉-语言模型在多模态攻击下的脆弱性。

Comments Publish in AAAI 2026

详情

AI中文摘要

大型视觉-语言模型（LVLMs）通过整合视觉和文本输入，在图像描述和视觉问答等任务中表现出色，改变了多模态理解。然而，它们对抗攻击的鲁棒性，特别是利用两种模态的攻击，仍未被充分探索，这给自动驾驶和内容审核等关键应用带来了风险。现有攻击集中于单一模态或需要不切实际的白盒访问，限制了其现实相关性。在本文中，我们引入了多模态对抗协同（MMAS），这是一个开创性的框架，用于针对LVLMs构建通用的黑盒多模态攻击。MMAS同时生成纹理尺度约束的通用对抗扰动用于图像，以及可学习的提示扰动用于文本，仅通过模型查询进行联合优化。图像扰动利用基于小波的纹理约束确保在各种视觉输入中的不可感知性和鲁棒性。文本扰动在嵌入空间中受L范数约束，在保持语义连贯性的同时将输出导向目标。一种新颖的跨模态正则化项对齐扰动的梯度方向，增强了它们在任务和模型间的协同影响和可迁移性。大量实验表明，我们提出的攻击在主流LVLMs上具有强大的通用对抗能力。

英文摘要

Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against adversarial attacks, particularly those exploiting both modalities, remains underexplored, posing risks to critical applications like autonomous driving and content moderation. Existing attacks focus on single modalities or require impractical white-box access, limiting their real-world relevance. In this paper, we introduce Multi-Modal Adversarial Synergy, a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. MMAS simultaneously generates a texture scale-constrained universal adversarial perturbation for images and a learnable prompt perturbation for text, optimized jointly using only model queries. The image perturbation leverages wavelet-based texture constraints to ensure imperceptibility and robustness across diverse visual inputs. The text perturbation, constrained by an L-norm in the embedding space, maintains semantic coherence while steering outputs toward a target. A novel cross-modal regularization term aligns the perturbations' gradient directions, enhancing their synergistic impact and transferability across tasks and models. Extensive experiments show the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.26500 2026-05-27 cs.CV 版本更新

3D Gaussian Map with Open-Set Semantic Grouping for Vision-Language Navigation

面向视觉-语言导航的开放集语义分组3D高斯地图

Jianzhe Gao, Rui Liu, Wenguan Wang

发表机构 * The State Key Lab of Brain-Machine Intelligence（脑机智能国家重点实验室）

AI总结提出一种3D高斯地图表示环境，通过在线构建自中心场景地图和开放集语义分组操作增强几何与语义信息，并设计多层级动作预测策略，在三个公开基准上验证了有效性。

详情

AI中文摘要

视觉-语言导航（VLN）要求智能体基于自然语言指令遍历复杂的3D环境，这需要对场景有透彻的理解。现有工作为智能体配备了各种场景表示以增强空间感知，但往往忽略了VLN场景中复杂的3D几何和丰富的语义，限制了在多样化和未见环境中的泛化能力。为应对这些挑战，本文提出一种3D高斯地图，将环境表示为一组可微分的3D高斯，并据此开发了用于VLN的导航策略。具体地，通过从稀疏伪激光雷达点云初始化3D高斯来在线构建自中心场景地图，为场景理解提供信息丰富的几何先验。每个高斯基元进一步通过开放集语义分组操作得到增强，该操作基于3D高斯在开放世界中属于对象实例或材质类别的成员关系对其进行分组，形成统一的3D高斯地图。基于该地图，设计了多层级动作预测策略，结合多粒度的空间-语义线索，辅助智能体进行决策。在三个公开基准（即R2R、R4R和REVERIE）上进行的大量实验验证了我们方法的有效性。

英文摘要

Vision-language navigation (VLN) requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works equip agents with various scene representations to enhance spatial awareness, they often neglect the complex 3D geometry and rich semantics in VLN scenarios, limiting the ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Scene Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors for scene understanding. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world, resulting in a unified 3D Gaussian Map. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist agents in decision-making. Extensive experiments conducted on three public benchmarks (i.e., R2R, R4R, and REVERIE) validate the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2605.26491 2026-05-27 cs.LG cs.CV 版本更新

Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

超越成对偏好：扩散模型的列表级奖励感知对齐

Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue

发表机构 * Caltech（加州理工学院）； Stanford University（斯坦福大学）

AI总结提出Diffusion LAIR方法，通过列表级奖励感知优化，利用连续奖励分数和所有候选图像同时优化扩散模型，在文本到图像生成等任务上超越成对偏好基线。

详情

AI中文摘要

偏好优化已成为从人类反馈中进行在线强化学习（RLHF）的一种高效替代方案，用于对齐文本到图像扩散模型。然而，现有方法大多将监督简化为二元成对比较。当训练数据自然包含同一提示的多个候选图像，并且连续奖励分数能提供比单一赢家-输家标签更丰富的信息时，这种成对简化具有局限性。为解决这些局限性，我们提出了Diffusion LAIR，一种用于扩散模型的奖励感知列表级偏好优化方法。对于每个提示，LAIR将一组候选图像的奖励分数转换为居中优势权重，然后在隐式奖励上优化优势加权回归目标，隐式奖励定义为当前模型相对于固定参考模型的去噪损失改进，并带有二次惩罚以正则化隐式奖励的幅度。所得目标同时使用所有候选图像而非选择成对，并通过显式控制隐式奖励的幅度保持保守性。LAIR目标在隐式奖励空间中具有有界闭式最优解，阐明了正则化强度如何控制偏好更新的幅度。实验表明，Diffusion LAIR在SD1.5和SDXL上，在文本到图像生成、组合生成和图像编辑基准测试中均优于强偏好优化基线。

英文摘要

Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.26486 2026-05-27 cs.CV 版本更新

LongCat-Video-Avatar 1.5 Technical Report

LongCat-Video-Avatar 1.5 技术报告

Meituan LongCat Team, Xunliang Cai, Meng Cheng, Feng Gao, Zhe Kong, Jiamu Li, Le Li, Weiheng Li, Hongyu Liu, Shuai Tan, Xiaoming Wei, Tianyu Yang, Yong Zhang

发表机构 * Meituan LongCat Team（美团LongCat团队）

AI总结本文提出 LongCat-Video-Avatar 1.5，一个通过升级音频编码器、优化训练策略、数据筛选和RLHF训练实现高精度唇同步、全身时间稳定性和长视频生成的开放框架，在多个基准测试中达到或超越商业系统性能。

Comments Homepage: https://meigen-ai.github.io/LongCat-Video-Avatar-1.5-Page/ Github: https://github.com/meituan-longcat/LongCat-Video

详情

AI中文摘要

尽管音频驱动视频生成取得了进展，但实现商业级稳定性仍具挑战。我们提出 LongCat-Video-Avatar 1.5，一个升级的开源框架，优先考虑系统工程和生产就绪性而非架构新颖性。通过将音频编码器升级为 Whisper Large 并精心扩展训练配方，v1.5 实现了精确的唇同步、全身时间稳定性和严格身份一致性的鲁棒长视频生成。通过严格的数据筛选和 RLHF 训练，该模型能轻松泛化到风格化领域（如动漫和动物），并原生处理复杂现实条件（如多人交互和物体操作）。此外，针对工业部署的实际需求，我们采用高级步进蒸馏将推理加速至最优的 8 NFE，在服务效率与视觉保真度之间实现了良好权衡。通过在超过 500 个多样化测试案例的综合基准上进行的广泛定量指标和严格人工评估，验证了我们方法的优越性。结果表明，v1.5 在人类相似度评分和专家级质量评估中，与领先的闭源系统（如 HeyGen、OmniHuman 1.5、Kling Avatar 2.0）相比，达到了具有竞争力或更优的性能。通过开源发布，LongCat-Video-Avatar 1.5 缩小了学术研究原型与商业级部署之间的差距。

英文摘要

Despite advances in audio-driven video generation, achieving commercial-grade stability remains challenging. We present LongCat-Video-Avatar 1.5, an upgraded open-source framework prioritizing systematic engineering and production-readiness over architectural novelty. By upgrading the audio encoder to Whisper Large and meticulously scaling our training recipes, v1.5 achieves accurate lip-synchronization, full-body temporal stability, and robust long-video generation with strict identity consistency. Through rigorous data curation and RLHF Training, the model readily generalizes to stylized domains such as anime and animals, and natively handles complex real-world conditions, such as multi-person interactions and object handling. Furthermore, addressing the practical demands of industrial deployment, we employ advanced step distillation to accelerate inference to an optimal 8 NFE, achieving a favorable trade-off between serving efficiency and visual fidelity. The superiority of our approach is validated through extensive quantitative metrics and a rigorous human evaluation conducted on a comprehensive benchmark of over 500 diverse test cases. Results show that v1.5 achieves competitive or superior performance compared to leading closed-source systems (e.g., HeyGen, OmniHuman 1.5, Kling Avatar 2.0) across human-likeness ratings and expert-level quality assessments on our benchmark. With its open-source release, LongCat-Video-Avatar 1.5 narrows the gap between academic research prototypes and commercial-grade deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.26485 2026-05-27 cs.CV cs.CL 版本更新

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

OmniInteract：面向实时全模态助手的流式交互基准测试

Xudong Lu, Xueying Li, Annan Wang, Yang Bo, Jinpeng Chen, Zengliang Li, Nianzu Yang, Rui Liu, Xue Yang, Jingwen Hou, Hongsheng Li

发表机构 * CUHK MMLab（香港中文大学多模态实验室）； SJTU（上海交通大学）； NTU（国立新加坡大学）； McMaster（麦马斯特大学）； CityUHK（香港城市大学）； JUFE（吉林大学）

AI总结提出OmniInteract基准，通过在线推理音视频流评估全模态大模型的实时交互能力，发现现有模型在流式交互中表现薄弱。

详情

AI中文摘要

我们引入了OmniInteract，一个用于实时全模态大语言模型的流式基准测试，通过音视频流上的原生在线推理进行评估。与离线视频理解或文本提示的流式问答不同，OmniInteract保留了原始音视频流，并要求模型在线处理，无法访问未来内容。用户查询和环境声音嵌入在音频轨道中，要求模型检测多模态触发信号，决定何时响应，并在流展开时回答问题。OmniInteract包含250个视频，具有1430个时间锚定的响应槽：其中1062个1Q1A槽涵盖实时、主动和嵌套场景，368个1QnA槽用于连续任务监控和步骤指导。每个槽包括触发信号、响应窗口和目标答案。我们使用交互感知质量-时效性F1、中断诊断套件和嵌套链完成分数来评估响应正确性、时序、无效输出、中断处理和上下文连续性。实验表明，当前模型在流式交互中仍然薄弱，最佳整体IA-QTF1仅为0.368，最佳1QnA IA-QTF1仅为0.052。在全双工设置下的数学推理进一步研究表明，离线能力不一定能迁移到在线交互中。代码和数据集将在https://github.com/Lucky-Lance/OmniInteract公开。

英文摘要

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

URL PDF HTML ☆

赞 0 踩 0

2605.26483 2026-05-27 cs.CV 版本更新

Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis

基于临床基础的反事实推理用于医学视频诊断

Jianzhe Gao, Churan Wang, Weiyi Zhang, Jianghua Li, Li-An Li, Wenguan Wang, Yixin Zhu, Yizhou Wang

发表机构 * Center for Data Science in Clinical Medicine（临床医学数据科学中心）； The State Key Lab of Brain-Machine Intelligence（脑机智能国家重点实验室）； Department of Gynecology and Obstetrics, 7th Medical Center of Chinese PLA General Hospital（中国人民解放军第七医学中心妇产科部）； School of Computer Science, Peking University（北京大学计算机学院）； School of Psychological and Cognitive Sciences, Peking University（北京大学心理学与认知科学学院）； State Key Lab of General AI, Peking University（通用人工智能国家重点实验室）； Nat’l Eng. Research Center of Visual Technology（视觉技术国家工程研究中心）； Beijing Key Laboratory of Behavior and Mental Health（北京行为与心理健康重点实验室）； Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence（具身智能实验室，北京大学-武汉人工智能研究院）

AI总结提出MedVCR反事实推理框架，通过扩散生成器合成病理组织演变、临床规则编码诊断知识及双重诊断预测策略，在医学视频诊断任务上提升2.6%-10.2%性能。

详情

AI中文摘要

医学视频诊断涉及从整个检查过程中的动态组织反应推断临床决策。现有方法依赖于端到端学习范式，该范式i)关注外观而非病理，ii)缺乏临床先验知识，iii)仅基于观察进行推理而无反事实比较。本文引入MedVCR，一个模仿临床诊断思维的反事实推理框架。MedVCR包含三个组件：一个反事实生成器，通过扩散方式合成指定病理状态下的组织演变；一个反事实表示学习模块，通过临床规则（即时间一致性、病理可分离性和反事实对齐）编码诊断知识；以及一个双重诊断预测策略，将视频级评估与帧级反事实分析相结合。MedVCR在完全监督（如阴道镜检查）和弱监督（如结肠镜检查）视频诊断设置下进行评估，与领先基线相比取得了2.6%-10.2%的性能提升。全面的消融研究进一步验证了每个组件的有效性。代码将发布。

英文摘要

Medical video diagnosis involves inferring clinical decisions from dynamic tissue responses throughout examination processes. Existing methods rely on an end-to-end learning paradigm that i) focuses on appearance rather than pathology, ii) lacks clinical priors, and iii) reasons solely from observations without counterfactual comparison. This work introduces MedVCR, a counterfactual reasoning framework that mimics clinical diagnostic thinking. MedVCR comprises three components: a Counterfactual Generator that synthesizes tissue evolution under specified pathological states via a diffusion-based manner; a Counterfactual Representation Learning module that encodes diagnostic knowledge through clinical rules (i.e., temporal consistency, pathological separability, and counterfactual alignment); and a Dual Diagnostic Prediction strategy that integrates video-level assessment with frame-level counterfactual analysis. MedVCR is evaluated under both fully supervised (e.g., colposcopy) and weakly supervised (e.g., colonoscopy) video diagnosis settings, yielding 2.6%-10.2% performance gains compared with leading baselines. Comprehensive ablation studies further validate the effectiveness of each component. The code will be released.

URL PDF HTML ☆

赞 0 踩 0

2605.26478 2026-05-27 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

基于随机解耦策略梯度的高效在策略视觉强化学习

Haoxiang You, Yilang Liu, Davis Zong, Qian Wang, Teeratham Vitchutripop, Qi Wang, Daniel Rakita, Ian Abraham

发表机构 * Yale University（耶鲁大学）； Shanghai Jiao Tong University（上海交通大学）； University of Sydney（悉尼大学）

AI总结提出随机解耦策略梯度（SDPG）方法，通过轨迹滚动的随机扰动估计策略梯度，在单GPU上数小时内端到端训练多样化的视觉运动控制策略，显著降低计算和内存开销，并在视觉MuJoCo基准测试中优于基线方法。

2605.26475 2026-05-27 cs.CV cs.AI 版本更新

Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes

大规模平面场景的视觉度量测量比较研究

ZhiXin Sun

发表机构 * PowerChina Zhongnan Engineering Corporation Limited（中国电力工程顾问集团有限公司）

AI总结本文针对大规模室外场景，使用PTZ相机比较了三种基于视觉的平面度量方法（单目测距、图像拼接和立体测距），分析了它们的精度和适用性。

详情

AI中文摘要

基于视觉的度量距离和面积测量在大规模室外环境中仍然具有挑战性，原因包括远距离感知、相机变焦和不稳定的成像条件。本文研究了在实际水库监测场景中使用PTZ相机的平面度量测量，并比较了三种代表性方法：基于几何的单目测距、带有鸟瞰变换的图像拼接以及使用两个联合校准的单目相机的立体测距。对于单目测距，从相机几何推导出平面定位模型，并分析了相机俯仰角的影响。研究了用于大面积映射的图像拼接，同时开发了一种无需专用立体硬件的立体方案用于远距离测量。实验显示了明确的权衡：单目测距在足够大的俯仰角下达到米级精度，立体测距达到分米级精度且对俯仰变化敏感性较低，图像拼接在小规模场景中有效，但随着场景增大稳定性和可扩展性下降。

英文摘要

Vision-based metric distance and area measurement remains challenging in large-scale outdoor environments due to long-range sensing, camera zoom, and unstable imaging conditions. This work studies planar metric measurement in a real-world reservoir monitoring scenario using PTZ cameras and compares three representative approaches: geometry-based monocular ranging, image stitching with birds-eye-view transformation, and stereo-based ranging using two jointly calibrated monocular cameras. For monocular ranging, planar localization models are derived from camera geometry and the effect of camera pitch angle is analyzed. Image stitching is investigated for large-area mapping, while a stereo-based scheme is developed for long-range measurement without dedicated stereo hardware. Experiments show clear trade-offs: monocular ranging achieves meter-level accuracy under sufficiently large pitch angles, stereo-based ranging achieves decimeter-level accuracy with reduced sensitivity to pitch variations, and image stitching is effective for small-scale scenes but degrades in stability and scalability as scene size increases.

URL PDF HTML ☆

赞 0 踩 0

2605.26470 2026-05-27 cs.CV 版本更新

Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules

面向逆问题的三元动力学感知扩散后验采样：优化引导与随机性调度

Junseo Bang, Dong Ju Mun, Hoigi Seo, Seongmin Hong, Se Young Chun

发表机构 * IPAI \& AIIS, Seoul National University, Republic of Korea

AI总结提出TriPS方法，将后验采样建模为时变控制问题，通过优化数据一致性引导、无分类器引导和随机性的调度策略，显著提升成像逆问题的求解性能。

Comments ICML 2026

详情

AI中文摘要

使用扩散模型的生成后验采样已成为解决成像逆问题的主流范式，通常包含三个主要组件：数据一致性（DC）引导、无分类器引导（CFG）和随机性。虽然先前的工作专注于如何开发每个或所有组件，但很少关注如何调度它们，导致启发式固定或部分调整的次优调度。在这项工作中，我们认为所有三个组件在调度方面的相互作用对于显著提高成像逆问题的求解性能至关重要。我们的分析表明，在采样早期激进的CFG与DC引导冲突，而随机性将轨迹带回高概率区域。基于这些发现，我们提出了三元动力学感知后验采样（TriPS），它将后验采样重新表述为一个时变控制问题，并按照DC和随机性尺度递减、CFG尺度递增的三元趋势优化调度。TriPS通过两种策略实现：基于模板的函数先验搜索以获得可靠的基线调度，以及基于组相对策略优化（GRPO）的强化学习以获得更灵活的时间曲线。实验表明，TriPS在数据保真度和感知真实感方面优于最先进的基线方法。

英文摘要

Generative posterior sampling using diffusion models has emerged as a dominant paradigm for solving inverse problems in imaging, which usually consists of three main components: data consistency (DC) guidance, classifier-free guidance (CFG) and stochasticity. While prior arts have focused on how to develop each or all components, less attention has given to how to schedule them, leading to heuristically fixed or partially adjusted suboptimal schedules. In this work, we argue that the interactions among all three components in terms of scheduling are crucial for significantly improved performance in solving inverse problems in imaging. Our analysis shows that aggressive CFG early in sampling conflict with DC guidance, while stochasticity brings the trajectory back to higher-probability regions. Based on these findings, we propose Triadic Dynamics Aware Posterior Sampling (TriPS), which reformulates posterior sampling as a time-varying control problem and optimizes schedules following a triadic trend of decreasing DC and stochasticity scales alongside increasing CFG scale. TriPS achieves this through two strategies: template-based search over functional priors for reliable baseline schedules, and Group Relative Policy Optimization (GRPO)-based reinforcement learning for more flexible temporal curves. Experiments demonstrate TriPS outperforms state-of-the-art baselines in data fidelity and perceptual realism.

URL PDF HTML ☆

赞 0 踩 0

2605.26460 2026-05-27 cs.CV cs.AI 版本更新

AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

AnchorDiff: 基于锚点图传播的无训练概念定位用于多模态扩散Transformer

Jian Zhang, Zhijun Zhang

发表机构 * School of Automation Science and Engineering（自动化科学与工程学院）

AI总结提出AnchorDiff方法，通过锚点选择和混合图传播解耦语义定位与结构细化，解决多模态扩散Transformer中视觉混淆概念间的概念泄漏问题。

详情

AI中文摘要

多模态扩散Transformer（MM-DiTs）为无训练概念定位编码了丰富的表示，但现有的基于注意力的方法通常在视觉上易混淆的概念上产生重叠激活，这种失败模式我们称为概念泄漏，即目标响应溢出到非目标对象。为了解决这个问题，我们提出了AnchorDiff，一种无训练的定位方法，将语义定位与结构细化解耦。AnchorDiff从概念到图像的注意力图中选择一个高置信度锚点，并将其作为独热种子在从图像到图像自注意力导出的混合图上传播。该图利用输出空间相似性进行密集的物体内传播，并通过逐行注意力门抑制跨物体连接。此外，我们引入了多概念混淆数据集，其中包含具有多个视觉相似概念和独立掩码的图像，从而能够显式评估概念泄漏。实验表明，AnchorDiff在ImageNet-Segmentation和PascalVOC上实现了强大的定位性能，同时在我们的多概念混淆数据集上显著减少了概念泄漏。

英文摘要

Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.26456 2026-05-27 cs.CV 版本更新

Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth

稀疏激光雷达提示的单目几何基础：面向长距离驾驶深度的实证研究

Kai Zheng, Qiang Feng, Xingjian Liu, Wenquan Tan, Yuan Li

发表机构 * Benewake (Beijing) Co., Ltd.（北京 Benewake 公司）

AI总结本文提出SLIM，首次将MoGe-2适配为接受真正稀疏激光雷达输入，通过部分卷积稀疏编码器和多尺度融合网络，在长距离（100-150米）将绝对相对误差降低39-51%。

Comments 6 pages, 3 figures, 2 tables

详情

AI中文摘要

稀疏激光雷达提示的深度基础模型（PromptDA, Prior Depth Anything, DMD3C）在室内场景或KITTI标准80米评估范围内表现出色。然而，存在两个局限性：（i）在长距离驾驶场景（50-150米）中缺乏系统性的距离分层评估；（ii）基于视差基础模型的先前方法依赖于预插值的密集先验，而真正稀疏激光雷达注入到点图基础模型（例如MoGe-2，NeurIPS 2025）尚未被探索。我们提出SLIM（稀疏激光雷达注入的单目几何），这是首个将MoGe-2适配为接受真正稀疏激光雷达输入的工作。SLIM集成了一个部分卷积稀疏编码器和一个多尺度融合颈部，在五个尺度上将激光雷达特征融合到点图解码器中。我们采用密度无关训练（随机注入比例在[0.005, 0.30]之间），使得单一模型能够适应不同的输入密度。在Virtual KITTI和CARLA上，SLIM在100-150米范围内将MoGe-2基线的绝对相对误差降低了约39-51%。在六种注入比例下的消融实验表明，部分卷积注入在Virtual KITTI的所有六种设置下均改善了AbsRel和RMSE；在CARLA上，AbsRel在六种设置中的五种得到改善（0.015比例下接近平局，差异为0.0013），而RMSE在不同编码器间相当，部分卷积在三种设置下有所改善（最多改善0.31单位），在其余三种设置下最多损失0.11单位。

英文摘要

Sparse-LiDAR-prompted depth foundation models (PromptDA, Prior Depth Anything, DMD3C) have shown strong results on indoor scenes or within KITTI's standard 80-meter evaluation cap. However, two limitations remain: (i) systematic distance-stratified evaluation in long-range driving regimes (50-150 m) is largely absent; (ii) prior approaches built on disparity-based foundations rely on pre-interpolated dense priors, leaving truly sparse LiDAR injection on point-map foundations (e.g., MoGe-2, NeurIPS 2025) unexplored. We present SLIM (Sparse-LiDAR Injected Monocular geometry), the first adaptation of MoGe-2 to accept truly sparse LiDAR input. SLIM integrates a partial-convolution sparse encoder with a multi-scale fusion neck that fuses LiDAR features into the point-map decoder at five scales. We adopt density-agnostic training (random injection ratio in [0.005, 0.30]) so a single model serves diverse input densities. On Virtual KITTI and CARLA, SLIM reduces the absolute relative error of the MoGe-2 baseline by approximately 39-51% at 100-150 m. Ablation across six injection ratios shows partial-convolution injection improves both AbsRel and RMSE on Virtual KITTI in all six settings; on CARLA, AbsRel improves in five of six settings (one near-tie at 0.015 differs by 0.0013), and RMSE is comparable across encoders, with partial-convolution improving in three settings (by up to 0.31 unit) and losing by at most 0.11 unit in the other three.

URL PDF HTML ☆

赞 0 踩 0

2605.26451 2026-05-27 cs.HC cs.CV 版本更新

Design First, Code Later: Aesthetically Pleasing Template-Free Slides Generation

先设计，后编码：无模板的美观幻灯片生成

Zhiyao Cui, Chenxu Wang, Shuyue Hu, Yiqun Zhang, Wenqi Shao, Qiaosheng Zhang, Zhen Wang

发表机构 * School of Cybersecurity, Northwestern Polytechnical University（西北工业大学网络安全学院）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Innovation Institution（上海创新研究院）； Fudan University（复旦大学）

AI总结提出DeepSlides层次化幻灯片生成流程，通过解耦设计与实现、引入SlideDesign数据集和多智能体强化学习训练范式，在无模板条件下生成高质量幻灯片。

详情

AI中文摘要

自动生成演示幻灯片需要在严格的空间约束下协调叙事结构与页面级图形设计。对于这种结构化多模态任务，良好的设计流程对于确保幻灯片的最终质量至关重要。现有方法依赖固定模板或直接生成可执行代码，从而限制了LLM的创意布局设计能力，并绕过了关键的幻灯片页面设计步骤。为解决这些限制，本文(1)提出了一种层次化的幻灯片生成工作流DeepSlides，无需任何预定义模板或样式，系统化地组织幻灯片设计任务，将幻灯片页面设计与实现解耦；(2)引入了SlideDesign数据集，专门针对幻灯片生成任务定制；(3)提出了一种多智能体强化学习训练范式，并训练了一对模型SlideQwens，用于幻灯片设计和实现。实验结果表明，我们提出的框架在评估指标上优于基线方法，并在人类偏好评估中取得了优越性能。数据集和代码可在https://github.com/sxswz213/DeepSlides获取。

英文摘要

Producing presentation slides automatically entails coordinating narrative structure with page-level graphic design under strict spatial constraints. For such structured multimodal tasks, a well-organized design process is essential to ensure the final quality of slides. Existing approaches rely on fixed templates or directly emit executable code, thereby both limiting the creative layout-design capabilities of LLMs and bypassing the essential slide-page design step. To address these limitations, this paper (1) proposes a hierarchical slides generation workflow, DeepSlides, that systematically organizes slide design tasks without any predefined template or style, decoupling slide-page design from implementation; (2) introduces SlideDesign, a dataset tailored specifically for slides generation tasks; and (3) presents a multi-agent reinforcement learning training paradigm and trains a couple of models, SlideQwens, for slide design and implementation. Experimental results demonstrate that our proposed framework outperforms baseline methods on evaluated metrics and achieves superior performance in human preference evaluations. The dataset and code are available at https://github.com/sxswz213/DeepSlides.

URL PDF HTML ☆

赞 0 踩 0

2605.26449 2026-05-27 cs.CV cs.AI 版本更新

Cross-scale Aligned Supervision for Training GANs

跨尺度对齐监督用于训练生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * Sungkyunkwan University（全北大学）

AI总结针对GAN多尺度生成中跨尺度轨迹未对齐问题，提出CAT（跨尺度对齐Transformer），通过生成器侧一致性正则化对齐中间输出与最终输出，在ImageNet-256上实现FID-50K为1.56。

Comments Preprint

详情

AI中文摘要

现代GAN通常在中间生成器输出上引入对抗性监督，并将由此产生的多阶段合成解释为从粗到细的分层生成。在这项工作中，我们挑战了这一解释。我们认为标准的尺度级对抗监督并未构建适当的从粗到细的层次结构：每个中间图像被独立地推向其自身分辨率下的真实分布，但这种尺度级的真实性并不能确保各阶段的输出代表相同的生成样本。此外，每个阶段产生的特定尺度图像并未用作后续阶段的明确细化目标。因此，其对抗性损失可以改善特定尺度的输出，而不约束后续阶段保持相同的样本轨迹，允许它们转向不同的样本而不是细化先前的输出。我们将此问题称为跨尺度轨迹未对齐问题。为了解决这个问题，我们提出了CAT，一种用于多尺度对抗生成的跨尺度对齐Transformer。CAT保持判别器尺度级，因此每个中间输出在其自身分辨率下被评估，同时添加一个简单的生成器侧一致性正则化，以对齐中间输出与最终输出。在类别条件ImageNet-256上，CAT-H/2在仅60个训练周期后，通过一步推理实现了1.56的FID-50K，优于强大的单步GAN和扩散/流基线。

英文摘要

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.26447 2026-05-27 cs.CV 版本更新

拯救效应：时空语义早期退出绕过CLIP中的量化崩溃

Kahyeon Nam, Hyesong Choi

发表机构 * Soongsil University（顺斯大学）

AI总结针对CLIP模型INT8量化导致的表示崩溃问题，提出LRA-EE方法，通过时空语义聚合、多特征门控和层自适应阈值实现早期退出，在ImageNet-1K零样本分类中降低13.4% FLOPs并提升2.44%准确率。

详情

AI中文摘要

在资源受限的硬件上部署视觉-语言模型通常需要INT8量化，但在CLIP等联合嵌入架构中，这引入了一种不同于量化CNN分类器的故障模式：跨Transformer块累积的激活噪声扰乱了多模态嵌入的方向，侵蚀了零样本检索所依赖的余弦对齐。我们将此特征化为量化诱导的表示崩溃（QIRC），并在INT8 CLIP ViT-B/32上量化它，其中逐层噪声信号比从浅层块的低于10%增长到第11层的52%。我们提出LRA-EE（逐层表示感知早期退出），它通过时空语义聚合（用全局补丁令牌平均替代不成熟的浅层[CLS]）、学习到的多特征门控（置信度、top-2间隔、空间激活方差）以及根据每层信息噪声比校准的层自适应置信阈值，绕过噪声饱和的深层。在ImageNet-1K零样本分类上，LRA-EE相比INT8基线减少了13.4%的FLOPs，并将Top-1准确率提高了+2.44个百分点（58.72% -> 61.16%）。四象限分解隔离了拯救效应：9.5%的样本在浅层出口被正确分类，但在全深度被噪声丢失，而只有7.1%遭受相反情况。

英文摘要

Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We characterize this as Quantization-Induced Representation Collapse (QIRC) and quantify it on INT8 CLIP ViT-B/32, where the layer-wise noise-to-signal ratio grows from below 10% in shallow blocks to 52% at Layer 11. We propose LRA-EE (Layer-wise Representation-Aware Early Exit), which bypasses noise-saturated deep layers via Spatio-Semantic Aggregation (replacing the immature shallow [CLS] with a global patch-token average), a learned multi-feature gate (confidence, top-2 margin, spatial-activation variance), and Layer-adaptive Confidence Thresholding calibrated to each layer's Information-to-Noise Ratio. On ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44%p (58.72% -> 61.16%) over the INT8 baseline. A four-quadrant decomposition isolates the Rescue Effect: 9.5% of samples are correctly classified at shallow exits but lost to noise at full depth, against only 7.1% suffering the inverse.

URL PDF HTML ☆

赞 0 踩 0

2605.26399 2026-05-27 cs.CV 版本更新

OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following

OmniGF: 一种用于统一视线跟随的双分支视觉-语言框架

Qiaomu Miao, Haoyu Wu, Jingyi Xu, Minh Hoai, Dimitris Samaras

发表机构 * Stony Brook University（石英布大学）； The University of Adelaide（阿德莱德大学）

AI总结提出OmniGF框架，通过双分支解码策略（语言分支生成离散推理状态，连续空间分支利用密集隐藏状态）结合头部嵌入，实现多人场景下精确的空间视线估计、语义视线预测和复杂社会视线推理，在多个基准上达到新最优。

详情

AI中文摘要

理解人类注视行为对于复杂场景理解和人机交互至关重要。传统的视线跟随模型通常局限于纯空间定位，缺乏推理语义目标或复杂社会背景的高级能力。此外，这些模型通常顺序处理个体，对同一场景图像进行多人体推理时需要冗余计算。虽然最近的视觉-语言模型（VLM）提供了处理与视线相关语义任务所需的卓越语义推理能力，但它们对离散文本生成的依赖本质上限制了在连续空间任务（如视线定位）中的精度。为弥合这一差距，我们提出OmniGF，一个统一的视觉-语言框架，使基础VLM适应高度可扩展的多人体视线推理。该模型采用双分支解码策略：结构化语言分支生成离散推理状态，而连续空间分支直接利用VLM的密集隐藏状态。用高分辨率视线目标热图监督这些提取的表示，有效克服了仅文本坐标生成的空间瓶颈。此外，为明确将模型锚定于多人场景，我们通过从裁剪的人头图像编码的头嵌入增强输入，同时为所有个体提供细粒度的外观和方向线索。通过建模所有个体并利用VLM的强大语义能力，OmniGF无缝集成了精确的空间视线目标估计、语义视线预测和复杂社会视线推理。大量实验表明，我们的框架在多个标准基准上建立了新的最优性能。代码可在https://github.com/cvlab-stonybrook/omnigf获取。

英文摘要

Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference. While recent Vision-Language Models (VLMs) offer the exceptional semantic reasoning needed to address gaze-related semantic tasks, their reliance on discrete text generation inherently limits precision in continuous spatial tasks like gaze localization. To bridge this gap, we propose OmniGF, a unified vision-language framework that adapts foundational VLMs for highly scalable multi-person gaze reasoning. The model adopts a dual-branch decoding strategy: a structured language branch generates discrete reasoning states, while a continuous spatial branch directly taps into the VLM's dense hidden states. Supervising these extracted representations with high-resolution gaze target heatmaps effectively overcomes the spatial bottleneck of text-only coordinate generation. Furthermore, to explicitly ground the model in multi-person scenes, we augment the input with head embeddings encoded from cropped head images, providing fine-grained appearance and orientation cues for all individuals simultaneously. By modeling all individuals and leveraging the strong semantic capability of VLMs, OmniGF seamlessly integrates precise spatial gaze target estimation, semantic gaze prediction, and complex social gaze reasoning. Extensive experiments demonstrate that our framework establishes new state-of-the-art performance across multiple standard benchmarks. Code is available at https://github.com/cvlab-stonybrook/omnigf.

URL PDF HTML ☆

赞 0 踩 0

2605.26383 2026-05-27 cs.CV 版本更新

Zero-Shot Object Re-Identification in Egocentric Kitchen Videos via Multi-Stage SAM3 Feature Fusion

基于多阶段SAM3特征融合的零样本物体重识别在自我中心厨房视频中的应用

Dmytro Klepachevskyi, Alexander Wong, Sirisha Rambhatla, Yuhao Chen

发表机构 * University of Waterloo（滑铁卢大学）

AI总结针对自我中心厨房视频中物体重识别的挑战，提出一种基于SAM3分割的多阶段零样本方法，通过融合SAM3、DINOv2和CLIP特征并引入掩码形状IoU和k-倒数重排序，将mAP从45.3%提升至52.8%。

详情

AI中文摘要

由于视角快速变化、频繁遮挡、场景杂乱以及类内外观差异大，自我中心厨房视频中的物体重识别（ReID）具有挑战性。物体可能离开并重新进入视野，且实例多样性大且标注有限，使得监督式ReID难以扩展，从而推动了零样本方法的研究。我们在EPIC-Kitchens基准上研究零样本物体ReID，目标是仅使用预训练的视觉特征匹配跨帧的活跃食物和厨房工具实例。我们首先评估了五种最先进的特征提取器，包括视觉语言模型（VLM）——CLIP、DINOv2、DreamSim、I-JEPA和SAM3，并显示零样本方法失败，最佳基线仅达到45.3% mAP。然后，我们提出了一种增强的SAM3 ReID流水线，这是一种以SAM3分割为核心组件的零样本多阶段方法。阶段1使用SAM3抑制背景杂乱。阶段2将SAM3、DINOv2和CLIP的嵌入融合为单个L2归一化描述符。阶段3用掩码形状IoU增强余弦相似度以实现几何一致性，阶段4应用k-倒数重排序。整个流水线将性能提升7.5% mAP，达到52.8%。

英文摘要

Object re-identification (ReID) in egocentric kitchen videos is challenging due to rapid viewpoint changes, frequent occlusions, cluttered scenes, and large intra-class appearance variations. Objects may leave and re-enter the field of view, and the large diversity of instances with limited annotations makes supervised ReID difficult to scale, motivating zero-shot approaches. We study zero-shot object ReID on the EPIC-Kitchens benchmark, where the goal is to match active food and kitchen-tool instances across frames using only pre-trained visual features. We first evaluate five state-of-the-art feature extractors, including Vision-Language Models (VLMs) - CLIP, DINOv2, DreamSim, I-JEPA, and SAM3 - and show that zero-shot methods fail, with the best baseline achieving only 45.3% mAP. We then propose an Enhanced SAM3 ReID Pipeline, a zero-shot multi-stage method built around SAM3 segmentation as the core component. Stage 1 uses SAM3 to suppress background clutter. Stage 2 fuses embeddings from SAM3, DINOv2, and CLIP into a single L2-normalized descriptor. Stage 3 augments cosine similarity with mask-shape IoU for geometric consistency, and Stage 4 applies k-reciprocal re-ranking. The full pipeline improves performance by 7.5% mAP to 52.8%.

URL PDF HTML ☆

赞 0 踩 0

2605.26382 2026-05-27 cs.CV 版本更新

Detail Consistent Stage-Wise Distillation for Efficient 3D MRI Segmentation

细节一致的分阶段蒸馏用于高效3D MRI分割

Mengchen Fan, Baocheng Geng, Xi Xiao, Tianyang Wang, Siyuan Mei, Pulin Che, Xiaoqian Jiang, Qizhen Lan

发表机构 * University of Alabama at Birmingham（阿拉巴马大学伯明翰分校）； Friedrich-Alexander-Universität Erlangen-Nürnberg（埃尔兰根-纽伦堡弗里德里希-亚历山大大学）； UTHealth Houston（休斯顿UT健康）

AI总结提出细节一致蒸馏（DCD）框架，通过小波分解对齐教师-学生特征，在分阶段蒸馏中保留多尺度结构细节，实现高效3D MRI分割。

Comments Accepted by MICCAI 2026. 11 pages, 3 figures

详情

AI中文摘要

部署高性能3D医学图像分割器（如nnU-Net）通常受到内存占用和推理延迟的限制。因此压缩是必要的，但紧凑的3D编码器往往会在多分辨率阶段重复下采样时丢失细微的结构线索（小病变和锐利边界）。我们提出细节一致蒸馏（DCD），一种分阶段蒸馏框架，通过在小波分解表示中对齐教师-学生特征，跨尺度保留结构细节。在每个编码器阶段，DCD在小波域中蒸馏方向细节分量，同时相对不约束粗略近似，避免对全局语义的过度正则化。DCD仅在训练期间使用，不引入推理开销。在BraTS 2024和ISLES 2022基准上的实验表明，我们的方法在使用3D多模态数据的MRI分割中取得了优越性能。DCD的代码和实现细节可在https://github.com/ClinicaAlpha/DCD-3D-MedSeg公开获取。

英文摘要

Deploying high-performing 3D medical image segmenters (e.g., nnU-Net) is often limited by memory footprint and inference latency. Compression is therefore necessary, but compact 3D encoders tend to lose fine structural cues (small lesions and sharp boundaries) as downsampling repeats across multi-resolution stages. We propose Detail Consistent Distillation (DCD), a stage-wise distillation framework that preserves structural detail across scales by aligning teacher-student features in a wavelet-decomposed representation. At each encoder stage, DCD distills directional detail components in the wavelet domain while leaving the coarse approximation comparatively unconstrained, avoiding over-regularization of global semantics. DCD is used only during training and introduces no inference-time overhead. Experiments on the BraTS 2024 and ISLES 2022 benchmarks demonstrate that our approach achieves superior performance in MRI segmentation using 3D multi-modal data. Code and implementation details for DCD are publicly available at https://github.com/ClinicaAlpha/DCD-3D-MedSeg.

URL PDF HTML ☆

赞 0 踩 0

2605.26381 2026-05-27 cs.CV 版本更新

Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

基于Perceiver IO融合卫星和街景图像的多模态建筑检测

Niels Sombekke, Rob G. J. Wijnhoven, Martin R. Oswald

发表机构 * University of Amsterdam (UvA)（阿姆斯特丹大学）； Spotr

AI总结提出一种通过Perceiver IO架构融合卫星和街景图像的多模态分类框架，使用共享DINOv2骨干网络的空间补丁令牌，无需填充或固定大小池化即可处理可变数量的街景视图，并联合预测屋顶元素和材料类别，在包含10个国家32135栋建筑的数据集上验证了RGB-M掩码策略和融合模型的有效性。

详情

AI中文摘要

我们提出了一种多模态分类框架，通过Perceiver IO架构融合卫星和街景图像，该架构基于共享DINOv2骨干网络的空间补丁令牌。该设计自然地处理每栋建筑可变数量的街景视图，无需填充或固定大小池化，并联合预测多标签屋顶元素和屋顶材料类别。我们构建了一个包含10个国家32,135栋建筑（61,672个片段）的大规模数据集，将卫星图像与每个片段最多八个街景视图配对，并评估了四种用于隔离目标建筑的掩码策略。我们提出了一种RGB-M掩码策略，将建筑足迹掩码作为第四个输入通道，提供了一种软空间先验，在两种模态下均优于硬裁剪。Perceiver IO融合模型优于所有其他融合策略，并在街景可见的属性上取得了显著的每类增益（例如，石板+11.3 AP，老虎窗+1.3 AP），尽管仅卫星基线在主要从上方可见的类别的宏观平均mAP上仍保持轻微优势。这些结果为多模态建筑检测建立了一种可扩展、灵活的架构，能够处理异构输入和多个输出任务。

英文摘要

We present a multi-modal classification framework that fuses satellite and street-level imagery through a Perceiver IO architecture operating on spatial patch tokens from a shared DINOv2 backbone. The design naturally handles a variable number of street-level views per building without padding or fixed-size pooling, and jointly predicts multi-label roof element and roof material classes. We construct a large-scale dataset of 32,135 buildings (61,672 segments) spanning ten countries, pairing satellite images with up to eight street-level views per segment and evaluating four masking strategies for isolating the target building. We propose an RGB-M masking strategy that appends the building footprint mask as a fourth input channel, providing a soft spatial prior that outperforms hard cropping across both modalities. The Perceiver IO fusion model improves over all other fusion strategies and yields substantial per-class gains for attributes visible from street level (e.g., +11.3 AP for slate, +1.3 AP for dormers), though the satellite-only baseline retains a slight advantage in macro-averaged mAP for classes that are predominantly visible from above. These results establish a scalable, flexible architecture for multi-modal building inspection that can accommodate heterogeneous inputs and multiple output tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.26380 2026-05-27 cs.CV cs.AI 版本更新

VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

VisualNeedle: 信息密集场景中的主动视觉搜索基准

Jingru Chen, Yiming Liu, Mingtao Chen, Sijie Chen, Richeng Xuan, Liang Yang, Zhichao Hu, Fanyang Lu

发表机构 * Hunyuan, Tencent（腾讯 Hunyuan）； Peking University（北京大学）； Zhejiang University（浙江大学）

AI总结针对多模态大语言模型在细粒度感知基准中依赖捷径而非真实视觉证据的问题，提出VisualNeedle基准，通过反事实裁剪-黑化设置评估模型在信息密集场景中的主动视觉搜索能力，实验表明最佳模型准确率仅56.01%，落后人类63.00%。

详情

AI中文摘要

前沿多模态大语言模型（MLLMs）被报道在细粒度感知基准上达到超过90%的准确率。然而，这样的分数并不一定意味着对视觉证据的忠实使用。先前的研究已经识别出三种抬高基准性能的捷径。首先，问题中的语言先验和词汇线索使模型能够在未见图像的情况下推断出看似合理的答案。其次，来自视觉编码器的粗略全局语义可以绕过细粒度的局部细节。第三，在一些“用图像思考”的基准中，破坏视觉工具返回的中间图像几乎不影响最终答案。这些发现表明，仅靠更高的输入分辨率或更大的问题池并不能引发真正的主动视觉搜索。为了解决这个问题，我们引入了VisualNeedle，这是一个具有挑战性、信息密集且细粒度的基准，用于关键证据在空间上局限于微小区域且无法一眼看出的场景。我们进一步提出了一种反事实裁剪-黑化设置，将工具返回的裁剪区域替换为相同大小的黑色图像，以测试工具启用的性能是否真正依赖于中间视觉证据。我们在三种设置下评估了9个著名的MLLMs：无工具、标准工具启用和裁剪-黑化。无工具准确率保持在20%以下，最佳工具启用模型仅达到56.01%，仍落后于63.00%的人类多数投票准确率。这些结果揭示了细粒度视觉搜索中持续存在的局限性，而裁剪-黑化消融实验证实，VisualNeedle上的成功依赖于真正的中间视觉证据。

英文摘要

Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. However, such scores do not necessarily imply faithful use of visual evidence. Prior studies have identified three shortcuts that inflate benchmark performance. First, linguistic priors and lexical cues in questions often enable models to infer plausible answers without seeing the image. Second, coarse global semantics from the visual encoder can bypass fine-grained local details. Third, in some ``think-with-images'' benchmarks, corrupting the intermediate images returned by visual tools barely affects the final answer. These findings suggest that higher input resolution or larger question pools alone do not elicit genuine active visual search. To address this, we introduce VisualNeedle, a challenging, information-dense, and fine-grained benchmark for scenes where critical evidence is spatially constrained to minute regions and not discernible at a glance. We further propose a counterfactual crop-black setting, which replaces crops returned by tools with black images of the same size, to test whether tool-enabled performance truly relies on intermediate visual evidence. We evaluate 9 promninent MLLMs across three settings: no-tool, standard tool-enabled, and crop-black. No-tool accuracy stays below 20\%, and the best tool-enabled model reaches only 56.01\%, still trailing the 63.00% human majority-vote accuracy. These results reveal persistent limitations in fine-grained visual search, while the crop-black ablation confirms that success on VisualNeedle hinges on genuine intermediate visual evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.26376 2026-05-27 cs.CV cs.AI cs.LG 版本更新

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

BioFact-MoE：基于生物学因子分解的混合专家模型用于肝细胞癌的视觉-语言预后建模

Junlin Yang, Tian Yu, Nicha C. Dvornek, Yuexi Du, Peiyu Duan, Annabella Shewarega, Lawrence H. Staib, James S. Duncan, Julius Chapiro

发表机构 * Department of Radiology \& Biomedical Imaging, Department of Biomedical Engineering, Department of Electrical Engineering, Department of Statistics \& Data Science Yale University, New Haven, CT, 06510, USA

AI总结提出BioFact-MoE框架，通过生物学监督的混合专家模型显式分解肝脏和肿瘤因子，在肝细胞癌预后预测中提升准确性和生物学可解释性。

Comments Early accepted at MICCAI 2026

详情

AI中文摘要

肝细胞癌（HCC）具有生物学异质性，由肝功能储备和肿瘤相关肿瘤学因素之间的相互作用塑造；因此，相似的生存结果可能反映根本不同的潜在生物学过程。HCC的预后建模依赖于来自多参数MRI和常规临床实践放射学报告的丰富多模态信息。现有的预后视觉-语言模型（VLM）学习单一的纠缠潜在表示，混合了肝脏和肿瘤相关因素，限制了准确性和生物学可解释性。我们提出BioFact-MoE，一个生物学因子分解的混合专家（MoE）框架，通过残差MoE生存架构中的生物学监督专家显式分解肝脏和肿瘤因素。在N=588名患者的HCC队列（在4,582个3D MRI图像-报告对上预训练）中，BioFact-MoE在所有时间范围内持续优于所有基线的生存预测，实现了12、18和24个月的AUC分别为75.33%、75.85%和73.96%。除了标量风险预测，门控专家权重实现了表型感知的风险分层。通路感知的门控揭示了临床上有意义的治疗相关生存异质性。在保留验证中，肝脏和肿瘤嵌入分别与肝功能标志物和肿瘤负荷标志物显示出选择性关联（p<0.05），无需监督。代码可在https://github.com/jy-639/BioFact-MoE获取。

英文摘要

Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision-language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor-related factors, limiting both accuracy and biological interpretability. We present BioFact-MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image-report pairs), BioFact-MoE consistently improves survival prediction over all baselines across time horizons, achieving 12-, 18-, and 24-month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype-aware risk stratification. Pathway-informed gating uncovers clinically meaningful treatment-associated survival heterogeneity. In held-out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p<0.05), without supervision. The code is available at https://github.com/jy-639/BioFact-MoE.

URL PDF HTML ☆

赞 0 踩 0

2605.26370 2026-05-27 cs.CV 版本更新

Joint Instance Segmentation and Geometric Attribute Regression for Roof Structures in Aerial Imagery

航空影像中屋顶结构的联合实例分割与几何属性回归

Luuk Versteeg, Rob G. J. Wijnhoven, Martin R. Oswald

发表机构 * University of Amsterdam (UvA)（阿姆斯特丹大学）

AI总结提出一种从单张航空正射影像中联合预测屋顶实例分割掩码和三个连续几何属性（建筑高度、屋顶坡度、屋顶方位角）的方法，通过条件方位角损失和对数归一化高度表示解决数据噪声和分布偏斜问题，在荷兰大规模数据集上实现了高精度，并可从单张图像重建简化3D建筑模型。

详情

AI中文摘要

我们提出了一种方法，用于从单张航空正射影像中联合预测实例级屋顶分割掩码以及三个连续几何属性——建筑高度、屋顶坡度和屋顶方位角。我们的方法扩展了Mask R-CNN，增加了一个专门的属性回归分支，并引入了两个关键创新：一个条件方位角损失，抑制了对屋顶平坦段（其中方位角标签固有噪声）的监督；以及一个对数归一化高度表示，解决了建筑高度严重偏斜分布的问题。我们在一个大规模荷兰航空图像数据集上进行训练和评估，该数据集与从3DBAG（一个全国性的基于LiDAR的3D建筑数据集）自动导出的真实值配对。使用DINOv3 ConvNeXt-Base骨干网络，我们的方法在屋顶坡度上实现了约4度的平均绝对误差，方位角为7度，建筑高度为1米，实例分割AP$_{50}$为0.566。预测的每段掩码和属性足以从单张俯视图像重建简化的3D建筑模型（LoD2），仅需在训练时使用昂贵的3D参考数据。

英文摘要

We present a method for jointly predicting instance-level roof segment masks together with three continuous geometric attributes -- building height, roof slope, and roof azimuth -- from a single aerial orthophoto. Our approach extends Mask R-CNN with a dedicated attribute regression branch and introduces two key innovations: a conditional azimuth loss that suppresses supervision for flat roof segments where azimuth labels are inherently noisy, and a log-normalized height representation that addresses the heavily skewed distribution of building heights. We train and evaluate on a large-scale dataset of Dutch aerial images paired with automatically derived ground truth from 3DBAG, a nationwide LiDAR-based 3D building dataset. Using a DINOv3 ConvNeXt-Base backbone, our method achieves a mean absolute error of approximately 4 degrees for roof slope, 7 degrees for azimuth, and 1 meter for building height, with an instance segmentation AP$_{50}$ of 0.566. The predicted per-segment masks and attributes are sufficient to reconstruct simplified 3D building models (LoD2) from a single overhead image, requiring expensive 3D reference data only for training.

URL PDF HTML ☆

赞 0 踩 0

2605.26353 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Personalized Generative Models for Contextual Debiasing

用于上下文去偏的个性化生成模型

Xinran Liang, Esin Tureci, Prachi Sinha, Ye Zhu, Vikram V. Ramaswamy, Olga Russakovsky

发表机构 * Department of Computer Science, Princeton University（普林斯顿大学计算机科学系）； LIX, CNRS, École Polytechnique（巴黎政治学院LIX研究所，法国国家科学研究中心）

AI总结提出DecoupleGen方法，利用个性化文本到图像扩散模型生成罕见上下文图像，作为训练增强以缓解视觉识别中的上下文偏差。

Comments CVPR 2026 Workshop on Synthetic Data for Computer Vision and Generative Models for Computer Vision. Code available at https://github.com/princetonvisualai/DecoupleGen

详情

AI中文摘要

不同的视觉模式在世界中出现的频率不同：例如，沙滩球出现在沙滩上比出现在道路上更常见。这些统计数据反映在视觉数据集中，因此训练好的模型更容易在常见场景中识别物体。然而，在道路上识别沙滩球可能比在沙滩上识别更重要。我们研究如何缓解这种差异。由于在现实世界中收集不常见的图像可能很困难，我们探索生成具有较少频繁上下文的图像是否可以作为有效的训练增强。一个关键挑战是引导生成保持在原始数据集分布附近，同时创建具有不常见上下文的多样化图像。我们引入了DecoupleGen方法，该方法个性化文本到图像扩散模型，以促进罕见上下文图像的连贯合成，同时保留原始视觉细节。生成的图像包含语义上有意义的内容，并在视觉上与原始数据集保持一致。我们进一步应用验证约束以确保增强数据的相关性。我们在复杂场景数据集上的物体分类和识别任务中评估了我们的方法。实验表明，我们的方法比先前的方法有一致的改进，并且我们的分析确定了这些改进背后的因素。

英文摘要

Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text-to-image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.

URL PDF HTML ☆

赞 0 踩 0

2605.26332 2026-05-27 cs.CV cs.AI 版本更新

使用轻量级自监督模型的睡眠阶段高效分类

Eldiane Borges dos Santos Durães, João Batista Florindo

发表机构 * Institute of Mathematics, Statistics and Scientific Computing, University of Campinas, Street Sergio Buarque de Holanda, 651, Campinas, Brazil（数学、统计与科学计算研究所，坎皮纳斯大学，塞格雷奥·布阿尔克·德·霍兰达街651号，巴西坎皮纳斯）

AI总结本研究通过简化mulEEG自监督模型并结合线性SVM分类器，实现了高效准确的睡眠阶段分类。

详情

DOI: 10.5220/0013367900003912
Journal ref: Proceedings VISAPP 2025, 972-979 (2025)

AI中文摘要

睡眠阶段的准确分类对于诊断睡眠障碍至关重要，自动化该过程可以显著增强临床评估。本研究旨在探索使用自监督模型（具体为mulEEG的改编版本）结合线性SVM分类器来改进睡眠阶段分类。 extbf{方法：} mulEEG模型以自监督方式学习脑电图信号表示，本文通过将ResNet-50替换为ResNet-18主干网络（使用1D卷积作为时间序列编码器）对其进行了简化。还进行了另外两项改编：第一项评估了模型的不同配置和训练数据量，第二项测试了时间序列特征、频谱图特征及其拼接作为线性SVM分类器输入的有效性。 extbf{结果：} 结果显示，与简化模型相比，减少数据量提供了更好的成本效益比。使用ResNet-18的拼接特征也优于原始mulEEG模型的线性评估，实现了更高的分类性能。 extbf{结论：} 简化mulEEG模型以提取特征，并将其与稳健的分类器配对，可实现更高效、更准确的睡眠阶段分类。该方法有望改善临床睡眠评估，并可扩展到其他生物信号分类任务。

英文摘要

Accurate classification of sleep stages is crucial for diagnosing sleep disorders and automating this process can significantly enhance clinical assessments. This study aims to explore the use of a self-supervised model (more specifically, an adapted version of mulEEG) combined with a Linear SVM classifier to improve sleep stage classification. \textbf{Methods:} The mulEEG model, which learns electroencephalogram signal representations in a self-supervised manner, was simplified here by replacing ResNet-50 with 1D-convolutions used as time series encoder by a ResNet-18 backbone. Two other adaptations were conducted: the first one evaluated different configurations of the model and data volume for training, while the second tested the effectiveness of time series features, spectrogram features, and their concatenation as inputs to a Linear SVM classifier. \textbf{Results:} The results showed that reducing the volume of data offered a better cost-benefit ratio compared to simplifying the model. Using the concatenated features with ResNet-18 also outperformed the linear evaluations of the original mulEEG model, achieving higher classification performance. \textbf{Conclusions:} Simplifying the mulEEG model to extract features and pairing it with a robust classifier leads to more efficient and accurate sleep stage classification. This approach holds promise for improving clinical sleep assessments and can be extended to other biological signal classification tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.26294 2026-05-27 cs.CV 版本更新

CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection

用于皮肤癌检测的CNN、Transformer、混合模型和视觉语言模型

Durjoy Dey, Yuhong Yan, Hassan Hajjdiab

发表机构 * Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada（计算机科学与软件工程系，康科迪亚大学，加拿大蒙特利尔）； Ebovir Biotechnologie Inc., Montreal, Canada（Ebovir生物技术公司，加拿大蒙特利尔）

AI总结本文在PAD-UFES-20数据集上统一评估了12种深度学习模型（包括CNN、ViT、混合卷积Transformer和视觉语言模型），结果表明混合模型和基于SigLIP的VLM在排名性能和临床相关操作点之间取得了最佳平衡。

Comments 13 pages, 3 figures, accepted at ICPRAI 2026, The Fifth International Conference on Pattern Recognition and Artificial Intelligence. To appear in Lecture Notes in Computer Science

详情

AI中文摘要

皮肤癌是一种常见且快速增长的恶性肿瘤，全球范围内发病率不断上升。早期检测对于改善预后至关重要。基于皮肤镜和临床图像训练的深度学习模型可以支持自动化和快速分诊。然而，许多研究仅评估了有限的架构，且不同研究的实验设置也各不相同。在本文中，我们在PAD-UFES-20数据集上对十二种深度学习模型进行了统一的二分类皮肤癌检测评估。这些模型涵盖四个家族：卷积神经网络（CNN）、视觉Transformer（ViT）、混合卷积Transformer骨干网络和视觉语言模型（VLM）。性能评估使用AUC、最大F1分数及其精确率和召回率，以及在80%特异性下的灵敏度，以反映筛查导向的需求。我们的结果表明，调优良好的CNN已经提供了强大的基线，但基于Transformer的家族持续改善了区分能力。混合模型（MaxViT Tiny、CoAtNet0）和基于SigLIP的VLM在排名性能和临床相关操作点之间实现了最佳整体权衡，而基于CLIP的模型提供了高精确率。所有实验的完整代码库已公开发布。这些发现共同为皮肤癌筛查中实际部署最合适的模型家族提供了实用指导，并为未来在PAD-UFES-20上的工作建立了可重复的参考点。

英文摘要

Skin cancer is a common and fast rising malignancy worldwide. Early detection is critical for improving outcomes. Deep learning models trained on dermoscopic and clinical images can support automated and fast triage. However, many studies evaluate only a limited set of architectures. Experimental setups also vary across studies. In this paper, we present a unified evaluation of twelve deep learning models for binary skin cancer detection on the PAD-UFES-20 dataset. The models span four families: convolutional neural networks (CNN), vision transformers (ViT), hybrid convolution transformer backbones, and vision language models (VLM). Performance is assessed using AUC, the maximum F1 score with its precision and recall, and sensitivity at 80% specificity, reflecting screening oriented requirements. Our results show that well tuned CNNs already provide strong baselines, but transformer based families consistently improve discrimination. Hybrid models (MaxViT Tiny, CoAtNet0) and a SigLIP based VLM achieve the best overall trade off between ranking performance and clinically relevant operating points, while CLIP based model offers high precision. The full codebase for all experiments is publicly released. Together, these findings offer practical guidance on which model families are most suitable for real world deployment in skin cancer screening and establish a reproducible reference point for future work on PAD-UFES-20.

URL PDF HTML ☆

赞 0 踩 0

2605.26287 2026-05-27 cs.CV 版本更新

A multifractal-based masked auto-encoder: an application to medical images

基于多重分形的掩码自编码器：在医学图像中的应用

Joao Batista Florindo, Viviane de Moura

发表机构 * Institute of Mathematics, Statistics and Scientific Computing - University of Campinas（数学、统计与科学计算研究所 - 卡波斯大学）

AI总结提出一种利用多重分形测度（Renyi熵）优化掩码策略的掩码自编码器（MO-MAE），通过聚焦高复杂度区域提升医学图像分类性能。

详情

DOI: 10.5220/0013359300003912
Journal ref: Proceedings VISAPP 2025, 769-776 (2025)

AI中文摘要

掩码自编码器（MAE）在医学图像分类中显示出巨大潜力。然而，传统MAE采用的随机掩码策略可能忽略医学图像中的关键区域，而这些区域中即使微小的变化也可能指示疾病。为解决这一局限性，我们提出了一种利用多重分形测度（Renyi熵）优化掩码策略的新方法。我们的方法称为多重分形优化掩码自编码器（MO-MAE），它采用多重分形分析来识别高复杂度和信息量丰富的区域。通过将掩码过程聚焦于这些区域，MO-MAE确保模型学习重建最具诊断相关性的特征。这种方法对医学成像特别有益，因为精细检查组织结构对于准确诊断至关重要。我们在涵盖多种疾病的多个医学数据集上评估了MO-MAE，包括MedMNIST和COVID-CT。我们的结果表明，MO-MAE取得了有前景的性能，超越了其他基线和最先进的模型。由于所提出的测度计算简单，该方法还增加了最小的计算开销。我们的发现表明，多重分形优化的掩码策略增强了模型捕获和重建复杂组织结构的能力，从而实现了更准确和高效的医学图像表示。所提出的MO-MAE框架为提高医学图像分析中深度学习模型的准确性和效率提供了一个有前景的方向，可能推动计算机辅助诊断领域的发展。

英文摘要

Masked autoencoders (MAE) have shown great promise in medical image classification. However, the random masking strategy employed by traditional MAEs may overlook critical areas in medical images, where even subtle changes can indicate disease. To address this limitation, we propose a novel approach that utilizes a multifractal measure (Renyi entropy) to optimize the masking strategy. Our method, termed Multifractal-Optimized Masked Autoencoder (MO-MAE), employs a multifractal analysis to identify regions of high complexity and information content. By focusing the masking process on these areas, MO-MAE ensures that the model learns to reconstruct the most diagnostically relevant features. This approach is particularly beneficial for medical imaging, where fine-grained inspection of tissue structures is crucial for accurate diagnosis. We evaluate MO-MAE on several medical datasets covering various diseases, including MedMNIST and COVID-CT. Our results demonstrate that MO-MAE achieves promising performance, surpassing other basiline and state-of-the-art models. The proposed method also adds minimum computational overhead as the computation of the proposed measure is straightforward. Our findings suggest that the multifractal-optimized masking strategy enhances the model's ability to capture and reconstruct complex tissue structures, leading to more accurate and efficient medical image representation. The proposed MO-MAE framework offers a promising direction for improving the accuracy and efficiency of deep learning models in medical image analysis, potentially advancing the field of computer-aided diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2605.26283 2026-05-27 cs.CV cs.LG 版本更新

Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening

卷积、Transformer、混合模型及视觉语言模型在多病种视网膜筛查中的基准测试

Durjoy Dey, Aymane Ajbar, Yuhong Yan

发表机构 * Department of Computer Science and Software Engineering（计算机科学与软件工程系）； Concordia University（康科迪亚大学）； Ebovir Biotechnologie Inc.（Ebovir生物技术公司）

AI总结本研究在RFMiD数据集上对四种模型家族的12种架构进行基准测试，评估其在多病种视网膜筛查中的性能，发现基于注意力的模型（如SwinTiny、CoAtNet0、MaxViTTiny）在二元筛查和多标签分类中表现最佳，视觉语言模型与CNN基线相当但未超越最优Transformer和混合模型。

Comments 12 pages, 3 figures, accepted at ICMHI 2026, 10th International Conference on Medical and Health Informatics, Kyoto, Japan. To appear in ACM Conference Proceedings

详情

AI中文摘要

现代深度学习为自动化视网膜筛查提供了强大工具，但在现实多病种设置和领域偏移下，不同视觉模型家族的比较仍不明确。本研究使用视网膜眼底多病种图像数据集（RFMiD），对四种模型家族（卷积神经网络、视觉Transformer、混合CNN-Transformer骨干网络和视觉语言模型）的12种架构进行基准测试。我们评估两个任务：任何视网膜疾病的二元筛查和28个疾病类别的多标签分类。通过标准化训练、校准和评估协议，我们报告了在特异性接近80%的临床相关操作点下的AUC、F1、精确率、召回率和灵敏度。在RFMiD上，所有架构在二元筛查中表现良好，AUC均高于84%，但基于注意力的模型表现最佳。SwinTiny以及混合模型CoAtNet0和MaxViTTiny在二元筛查中取得最强结果，并在多标签设置中提高了宏F1和微F1。视觉语言模型（包括CLIP ViT-B/16和SigLIP-Base384）与CNN基线相当，但未超越最优Transformer和混合骨干网络。在Messidor-2上对可转诊糖尿病视网膜病变进行外部验证时，AUC范围为66.8%至84.7%，混合模型和Transformer模型再次表现出强劲性能。这些结果为多病种视网膜筛查中的模型选择提供了可重复的参考，并指导未来用于临床部署的自动化筛查工具。

英文摘要

Modern deep learning offers powerful tools for automated retinal screening, but it remains unclear how different visual model families compare in realistic multi-disease settings and under domain shift. In this work, we benchmark twelve architectures across four model families: convolutional neural networks, vision transformers, hybrid CNN-transformer backbones, and vision-language models, using the Retinal Fundus Multi-disease Image Dataset (RFMiD). We evaluate two tasks: binary screening for any retinal disease and multi-label classification across 28 disease classes. Using standardized training, calibration, and evaluation protocols, we report AUC, F1, precision, recall, and sensitivity at a clinically relevant operating point with specificity near 80%. On RFMiD, all architectures perform well on binary screening, with AUC above 84%, but attention-based models perform best. SwinTiny and the hybrid CoAtNet0 and MaxViTTiny models achieve the strongest binary screening results and improve macro and micro F1 in the multi-label setting. Vision-language models, including CLIP ViT-B/16 and SigLIP-Base384, are competitive with CNN baselines but do not surpass the best transformer and hybrid backbones. In external validation on Messidor-2 for referable diabetic retinopathy, AUC ranges from 66.8% to 84.7%, with hybrid and transformer models again showing strong performance. These results provide a reproducible reference for model selection in multi-disease retinal screening and guide future automated screening tools for clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.26273 2026-05-27 cs.CV 版本更新

Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation

频率引导的RGB-热红外语义分割融合

İsmail Emre Canıtez, Özgür Erkent

发表机构 * Hacettepe University（哈切特佩大学）

AI总结提出一种基于双ConvNeXt V2骨干网络的多模态融合架构，通过频率分解和置信门控残差机制融合RGB与热红外特征，在MFNet和PST900上以较低参数量实现先进性能。

Comments 9 pages, 7 figures, To be Presented at Perception Beyond the Visible Spectrum workshop series (IEEE PBVS) at CVPR, 2026

详情

AI中文摘要

在城市驾驶场景等复杂环境中，语义分割在光照条件不佳时仍具挑战性，仅凭RGB图像提供的信息不足。RGB-热红外融合利用可见光和红外图像的互补优势来提升场景理解；然而，在不同特征抽象层次上有效整合这些异质模态仍是一个开放问题。本文提出一种基于双ConvNeXt V2骨干网络的多模态融合架构，采用分阶段、模态自适应的融合策略。对于早期特征，我们引入基于频率的融合模块，通过高斯滤波将红外特征分解为低频和高频分量，应用双分支空间注意力选择性强调热模式与精细边界，并通过置信门控残差机制将其与RGB特征融合。对于后期特征，我们设计了一个具有跨模态注意力和多尺度深度可分离卷积的语义融合模块，以捕捉模态间的语义对应关系。融合后的特征通过带有深度监督的PANet风格双向解码器进行解码。在MFNet和PST900上的实验表明，我们最轻量化的变体分别达到61.73%和86.24%的mIoU，仅需35.43M参数，在显著减少参数和计算成本的同时优于近期方法。代码可在https://github.com/ismailemrecntz/VISIBLE-INFRARED-SENSOR-FUSION获取。

英文摘要

Semantic segmentation in complex environments such as urban driving scenes remains challenging under adverse lighting conditions, where RGB images alone provide insufficient information. RGB-Thermal fusion leverages the complementary strengths of visible and infrared imagery to improve scene understanding; however, effectively integrating these heterogeneous modalities at varying levels of feature abstraction remains an open problem. In this paper, we propose a multi-modal fusion architecture built upon dual ConvNeXt V2 backbones that employs stage-wise, modality-adaptive fusion strategies. For early-stage features, we introduce a Frequency-Based Fusion Module that decomposes infrared features into low- and high-frequency components via Gaussian filtering, applies dual-branch spatial attention to selectively emphasize thermal patterns and fine-grained boundaries, and integrates them with RGB features through a confidence-gated residual mechanism. For late-stage features, we design a semantic fusion module with cross-modal attention and multi-scale depthwise convolutions to capture semantic correspondences across modalities. The fused features are decoded via a PANet-style bidirectional decoder with deep supervision. Experiments on MFNet and PST900 demonstrate that our lightest variant achieves 61.73\% and 86.24\% mIoU, respectively, with only 35.43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. Code is available at https://github.com/ismailemrecntz/VISIBLE-INFRARED-SENSOR-FUSION

URL PDF HTML ☆

赞 0 踩 0

2605.26266 2026-05-27 cs.LG cs.AI cs.CV cs.GR eess.IV 版本更新

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

量化键窃取注意力：视频扩散中KV缓存压缩的偏差校正

Tuna Tuncer, Felix Becker, Thomas Pfeil

发表机构 * Technical University of Munich（慕尼黑技术大学）； Tensordyne

AI总结针对视频扩散模型中KV缓存量化导致注意力权重系统性偏差的问题，提出基于Jensen偏差的在线逐注意力分数校正方法，在INT2量化下恢复接近BF16的视频质量，且内存减半。

Comments Variants of this manuscript were accepted to the ICML 2026 workshops SCALE and F2S

详情

AI中文摘要

分块自回归视频扩散模型依赖先前生成块的KV缓存以避免冗余计算，但随着视频变长，该缓存迅速成为内存瓶颈。将KV缓存量化到低位宽的方法减少了内存压力，但降低了视频质量。我们表明，这种降低的一个关键驱动因素是注意力权重的系统性偏差：由于softmax注意力中指数的凸性，量化噪声膨胀了缓存键的贡献，我们称之为Jensen偏差。这种效应导致量化键从非量化的当前块中窃取注意力质量。我们推导出一个逐注意力分数校正，在期望中消除此偏差，该校正根据缓存键的量化步长和查询范数在线计算。使用二阶泰勒近似，额外的计算开销可忽略不计，且除了缓存外无需额外内存。在MAGI-1、SkyReels-V2和HY-WorldPlay上评估INT2量化，我们的校正恢复了因激进量化而损失的大部分质量，达到接近BF16的视频质量，并且在使用50%更少内存的情况下优于INT4量化。

英文摘要

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.

URL PDF HTML ☆

赞 0 踩 0

2605.26262 2026-05-27 cs.CV 版本更新

Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis

维度分布情感状态：利用效价和唤醒作为视觉情感分析的通用嵌入空间

Émile Bergeron, Tadagbé Dhossou, Sébastien Tremblay, Jean-François Lalonde

发表机构 * Université Laval（拉瓦尔大学）

AI总结提出一种新的情感表示方法DDES，结合连续双维情感空间和多数据集训练流程，以辅助博物馆策展人预测艺术品引发的情感反应。

详情

AI中文摘要

博物馆是传播文化艺术的重要场所。它们是植根于历史和传统的机构；其展览通常旨在突出这些方面。最近，该领域正在探索一种新方法：基于情感的展览。这些展览专门设计用于引发游客的情感，以最大化参与度，并作为民主化艺术接触和吸引更广泛、更多样化观众的一种方式。为此，必须首先提取艺术品的情感内容，然而，由专家手动标注艺术品是一个劳动密集且成本高昂的过程，并且存在引入策展人个人偏见的风险。为了协助博物馆策展人设计这些展览，我们希望开发一种能够预测艺术作品所引发的情感反应的工具。在本文中，我们利用连续的双维情感空间来增强情感表示和深度学习模型的训练过程。借鉴现有的分类和维度情感表示，我们引入了一种新的表示方法——维度分布情感状态（DDES），以及一个多数据集训练流程。我们表明，与广泛使用的表示相比，DDES提供了多种优势，同时表现出相似的基线性能。

英文摘要

Museums are important sites for the dissemination of culture and art. They are institutions rooted in history and tradition; their exhibitions are often designed to highlight these aspects. Recently, a new approach is being explored in the field: emotion-based exhibitions. These exhibitions are designed specifically to elicit emotions in the visitors, in order to maximize engagement, and as a way to democratize access to art and attract a wider, more diverse audience. To do so, the emotional content of the artworks must first be extracted, however, manually annotating the artworks by experts is a prohibitively labor-intensive process, and risks introducing the personal bias of curators. To assist the museum curators in their design of these exhibitions, we wish to develop a tool that can predict the emotional response evoked by a work of art. In this article, we leverage a continuous bi-dimensional emotion space to enhance emotion representations and the training process of deep learning models. Drawing inspiration from existing categorical and dimensional emotion representations, we introduce a new representation, Dimensional Distribution Emotion State (DDES), along with a pipeline for multi-dataset training. We show that DDES provides multiple advantages compared to widely used representations while exhibiting similar baseline performance.

URL PDF HTML ☆

赞 0 踩 0

2605.26244 2026-05-27 cs.CV cs.MM cs.SD 版本更新

几何感知表示去噪用于鲁棒的多视角3D重建

Jin Hyeon Kim, Jaeeun Lee, Claire Kim, Kyoungjin Oh, Paul Hyunbin Cho, Jaewon Min, Yeji Choi, Jihye Park, Hyunhee Park, Minkyu Park, Seungryong Kim

发表机构 * KAIST AI（韩国国立科学技术院人工智能研究所）； Samsung Electronics（三星电子）

AI总结提出几何感知表示去噪（GARD）框架，在前馈3D重建模型的特征空间中执行扩散式多视角恢复，同时恢复场景几何与高质量RGB图像，在DA3基准上验证有效性。

详情

AI中文摘要

多视角3D重建随着前馈3D重建模型的出现取得了显著进展。然而，这些模型通常在理想的无退化成像条件下训练和评估，而真实世界的观测往往包含与此类设置显著不同的退化。因此，在退化条件下提高多视角3D重建的鲁棒性仍然是一个重要挑战。我们提出了几何感知表示去噪（GARD），一种新颖的框架，直接在前馈3D重建模型的特征空间中执行基于扩散的多视角恢复。这种设计利用3D重建器的几何感知特征表示来有效恢复准确的场景几何。此外，通过使用额外的RGB图像解码器，精炼的表示还可用于恢复高质量的RGB图像，从而同时恢复3D场景几何和高质量图像。在Depth Anything 3（DA3）基准上的全面实验证明了所提出的GARD框架的有效性。

英文摘要

Multi-view 3D reconstruction has achieved remarkable progress with the advent of feed-forward 3D reconstruction models. However, these models are typically trained and evaluated under ideal, degradation-free imaging conditions, whereas real-world observations often contain degradations that differ significantly from such settings. Improving robustness for multi-view 3D reconstruction under degraded conditions therefore remains an important challenge. We present Geometry-Aware Representation Denoising (GARD), a novel framework that performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model. This design exploits the geometry-aware feature representations of the 3D reconstructor to effectively recover accurate scene geometry. Furthermore, by employing an additional RGB image decoder, the refined representations can also be used to restore high-quality RGB images, thereby enabling the simultaneous recovery of 3D scene geometry and high-quality imagery. Comprehensive experiments on the Depth Anything 3 (DA3) benchmark demonstrate the effectiveness of the proposed GARD framework.

URL PDF HTML ☆

赞 0 踩 0

2605.26149 2026-05-27 cs.GR cs.CV 版本更新

AnySurf: Any Surface Generation with Directed Edge

AnySurf: 基于有向边的任意表面生成

Wenda Shi, Chenyuan Pan, Dengming Zhang, Yiren Song, Biao Zhang, Xingxing Zou

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）； Xi'an Jiaotong University（西安交通大学）

AI总结提出AnySurf统一框架，通过有向边增强的柔性双网格表示，实现开放、封闭和混合3D表面的高质量生成，并引入ROS-FT后训练和轻量级DE-Adapter以保持生成性能。

详情

AI中文摘要

开放表面组件在真实工业3D内容中普遍存在，支持渲染、物理模拟和几何编辑。服装作为典型的开放表面类型，现有许多生成方法利用缝纫图案生成2D面板并缝合为3D形状。这种特定领域的设计缺乏可扩展性，无法泛化到鞋子和配饰。常见的基于场的3D生成器优先考虑水密网格，并倾向于在开放表面上创建有缺陷的双层结构。尽管Trellis2采用了无场表示，但其开放表面结果仍存在法线和拓扑错误。我们提出AnySurf，一个统一框架，生成具有准确面朝向的开放、封闭和混合3D表面。基于有向边增强的柔性双网格（FDG-D），我们的表示通过定向网格边保留法线方向信息。我们还提出了ROS-FT后训练和仅增加1%额外参数的轻量级DE-Adapter，促进有向边学习同时保持原始生成性能。我们进一步构建了包含工业服装和封闭配件的Outfit3D数据集。我们的工作将服装建模转化为通用的3D生成任务。实验结果表明，在网格质量和下游应用实用性方面具有优越性。

英文摘要

Open surface components prevail in real industrial 3D content and support rendering, physical simulation and geometric editing. Garments serve as a typical open surface type, with numerous existing generation methods leveraging sewing patterns to generate 2D panels and stitch them into 3D shapes. Such domain-specific designs lack scalability and cannot generalize to shoes and accessories. Common field-based 3D generators prioritize watertight meshes and tend to create flawed double-layer structures on open surfaces. Though Trellis2 adopts field-free representation, its open surface results still contain normal and topology errors. We present AnySurf, a unified framework generating open, closed and hybrid 3D surfaces with accurate face orientation. Built on directed-edge enhanced Flexible Dual Grid (FDG-D), our representation retains normal direction information via oriented grid edges. We also propose ROS-FT post-training and a lightweight DE-Adapter with merely 1% extra parameters, facilitating directed edge learning while preserving original generation performance. We further construct Outfit3D dataset containing industrial garments and closed accessories. Our work transforms garment modeling into a universal 3D generation task. Experimental results demonstrate superior mesh quality and better practicality for downstream applications.

URL PDF HTML ☆

赞 0 踩 0

2605.26137 2026-05-27 cs.GR cs.AI cs.CV 版本更新

AssetGen: Deployable 3D Asset Generation at Interactive Speed

AssetGen: 可部署的交互速度3D资产生成

Dilin Wang, Xiaoyu Xiang, Kihyuk Sohn, Tom Monnier, Yu-Ying Yeh, Thu Nguyen-Phuoc, Jiawen Zhang, Yuchen Fan, Antoine Toisoul, Hyunyoung Jung, Prithviraj Dhar, Michael Bunnell, Nikolaos Sarafianos, Chuhang Zou, Roman Shapovalov, Andrea Vedaldi, Rakesh Ranjan

发表机构 * Reality Labs, Meta（Meta现实实验室）

AI总结提出AssetGen系统，通过粗到细的VecSet框架、多视图纹理生成及端到端加速，在30秒内生成带烘焙法线、颜色纹理和可控多边形预算的高质量网格，支持实时渲染和移动端部署。

详情

AI中文摘要

尽管3D生成技术正在快速发展，但近期工作通常侧重于获取高分辨率资产，而将用户体验和可部署性视为事后考虑。我们提出AssetGen，一个专注于这两个方面的3D生成器。给定一张参考图像，它在30秒内生成一个高质量网格，带有烘焙法线、颜色纹理和可控多边形预算，适用于实时渲染，包括移动端用例。AssetGen Flash变体进一步将延迟降低到14秒，适用于交互式和代理式创作循环。我们的模型使用粗到细的VecSet框架生成物体几何，该框架在GPU上实现网格简化、清理和法线烘焙，以及快速并行UV展开。然后以多视图方式生成纹理，随后进行反投影和3D修复。模型蒸馏、内核优化和流水线并行化被协同设计以加速整个系统。我们引入了大量自动化和盲人机评估，并在30秒内展示了与领先商业解决方案相当的视觉质量，在不到15秒内展示了预览质量的结果。最终结果是一个支持AI辅助、可部署的3D内容创建的系统，适用于交互式工作流。

英文摘要

While 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and deployability as afterthoughts. We present AssetGen, a 3D generator that focuses instead on these two aspects. Given one reference image, in 30 seconds it produces a high-quality mesh with baked normals, a color texture, and a controlled polygon budget suitable for real-time rendering, including mobile use cases. The AssetGen Flash variant further reduces latency to 14 seconds for interactive and agentic creation loops. Our model generates the object geometry with a coarse-to-refine VecSet framework, which implements mesh simplification, cleaning, and normal baking on the GPU, and a fast parallel UV unwrapping. It then generates textures in a multi-view fashion, followed by backprojection and 3D inpainting. Model distillation, kernel optimization, and pipeline parallelization are co-designed to accelerate the system end-to-end. We introduce numerous automated and blind human evaluations and demonstrate competitive visual quality against leading commercial solutions in 30 seconds and preview-quality results in less than 15 seconds. The final result is a system that supports AI-assisted, deployable 3D content creation in interactive workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.26103 2026-05-27 cs.CV 版本更新

Global Structure-from-Motion Meets Feedforward Reconstruction

全局运动恢复结构与前馈重建的结合

Linfei Pan, Johannes Schönberger, Marc Pollefeys

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Meta Reality Labs（Meta现实实验室）； Microsoft（微软公司）

AI总结提出一种结合经典SfM和前馈重建优势的新流水线，在多种场景下实现最先进的重建结果。

Comments CVPR 2026, Highlight

详情

AI中文摘要

运动恢复结构——从一组图像同时估计相机姿态和3D场景结构的过程——仍然是计算机视觉中的一个核心挑战，许多开放问题尚待解决。前馈3D重建的最新进展在克服经典SfM方法的持续失败案例方面取得了显著进步，特别是在低纹理、有限重叠和对称性等场景中。然而，尽管前馈方法在这些挑战性条件下表现出色，但它们在可扩展性、准确性或鲁棒性方面常常面临限制，并且在标准重建设置中通常不如经典方法。在这项工作中，我们系统地分析了这些限制，并通过结合经典和前馈方法的各自优势，提出了一种新的运动恢复结构流水线。在多个数据集上的广泛实验显示了我们的方法的优势，在广泛场景中实现了最先进的结果。我们将我们的系统作为开源实现分享在https://github.com/colmap/gluemap。

英文摘要

Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved. Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited overlap, and symmetries. However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, or robustness, and typically fall short of classical methods in standard reconstruction settings. In this work, we systematically analyze these limitations and propose a new Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods. Extensive experiments across multiple datasets show the benefits of our approach, achieving state-of-the-art results across a wide range of scenarios. We share our system as an open-source implementation at https://github.com/colmap/gluemap.

URL PDF HTML ☆

赞 0 踩 0

2605.25861 2026-05-27 cs.CV cs.AI 版本更新

Tetris: 用于高效高保真视频目标跟踪的瓦片级采样

Chanwut Kittivorawong, Alena Chao, Charlie Si, Alvin Cheung

发表机构 * U. of California, Berkeley（加州大学伯克利分校）

AI总结提出Tetris系统，通过将视频分解为基于瓦片的骨牌数据模型，实现细粒度时空剪枝，在保持跟踪精度损失不超过5%的条件下，将检测器调用次数减少多达68.8倍。

详情

AI中文摘要

轨迹物化将原始视频转换为可重用的目标轨迹，下游查询可以直接使用而无需重新运行跟踪，但高效且高保真地提取这些轨迹仍然成本高昂。先前的系统通过时间帧采样来降低成本，但这会抹去细粒度跟踪所需的帧间运动。然而，在静态视频中，每帧的大部分区域不包含感兴趣的目标，剩余区域也能容忍不同的采样率。我们提出Tetris，一个轨迹提取系统，它将视频分解为基于瓦片的骨牌数据模型，实现细粒度时空剪枝，以最小的保真度损失减少检测器调用。Tetris在用户提供的检测器上游运行三个算子：一个分类器识别相关瓦片并将它们分组为骨牌；一个整数线性规划（ILP）在用户指定的精度约束下剪枝冗余骨牌；一个打包器将幸存者组装成画布，以最小化检测器调用。在7个静态视频数据集上，Tetris的跟踪精度损失保持在5%以内，而先前的系统在7个数据集中的3个上超过了这个界限。在这个5%的界限下，Tetris的吞吐量比先前系统高17.4倍，比参考流水线高68.8倍。项目页面位于https://tetris-db.github.io。

英文摘要

Track materialization converts raw video into reusable object tracks that downstream queries can run against without rerunning tracking, but extracting those tracks efficiently and with high fidelity remains expensive. Prior systems reduce cost through temporal frame sampling, erasing the inter-frame motion that fine-grained tracking requires. In stationary video, however, large portions of each frame contain no objects of interest, and the remaining regions tolerate different sampling rates. We present Tetris, a track-extraction system that decomposes videos into a tile-based polyomino data model, enabling fine-grained spatiotemporal pruning that reduces detector calls with minimal fidelity loss. Tetris runs three operators upstream of the user-provided detector: a classifier identifies relevant tiles and groups them into polyominoes, an integer linear program (ILP) prunes redundant polyominoes under a user-specified accuracy constraint, and a packer assembles the survivors into canvases that minimize detector calls. Across 7 stationary-video datasets, Tetris stays within a 5% tracking accuracy loss of a full-frame, every-frame reference pipeline, whereas prior systems exceed this bound on 3 of the 7 datasets. At this 5% bound, Tetris achieves up to 17.4x higher throughput than prior systems and up to 68.8x higher than the reference pipeline. The project page is at https://tetris-db.github.io .

URL PDF HTML ☆

赞 0 踩 0

2605.25353 2026-05-27 cs.LG cs.CV physics.comp-ph 版本更新

PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

PDEInvBench：面向PDE逆问题的神经网络综合数据集与设计空间探索

Divyam Goel, Nithin Chalapathi, Sanjeev Raja, Aditi S. Krishnapriyan

发表机构 * Department of Computer Science, UC Berkeley（计算机科学系，加州大学伯克利分校）； UC Berkeley（加州大学伯克利分校）； Departments of Computer Science and Chemical Engineering UC Berkeley（计算机科学与化学工程系，加州大学伯克利分校；劳伦斯伯克利国家实验室）； LBNL

AI总结提出PDEInvBench基准数据集，通过数值模拟涵盖多种PDE，并沿优化、表示和缩放三个维度系统探索神经网络设计空间，发现两阶段训练、PDE导数输入和初始条件多样性等实用见解。

Comments 37 total pages, 13 main pages, 20 figures, 8 tables. Published in Transactions on Machine Learning Research (TMLR), 2026

详情

Journal ref: Transactions on Machine Learning Research, 2026

AI中文摘要

偏微分方程（PDE）中的逆问题涉及从观测到的时空解场估计系统的物理参数。神经网络因其对函数到函数空间变换的建模能力，非常适合PDE参数估计。虽然现有的机器学习方法基准主要关注正问题，但尚无针对PDE逆问题（即从解场映射到潜在物理参数）的类似综合研究和基准数据集。我们通过引入PDEInvBench填补了这一空白，这是一个全面的基准数据集，包含时间依赖和时间独立PDE的数值模拟，覆盖广泛的物理行为和参数。我们的数据集包括评估划分，用于评估在分布内和多种分布外设置下的性能。利用我们的基准数据集，我们沿三个关键维度全面探索了神经网络在PDE逆问题中的设计空间：（1）优化过程，分析监督、自监督和测试时训练目标对性能的作用；（2）问题表示，研究具有不同归纳偏好的架构选择和各种条件策略的价值；（3）缩放，针对模型和数据大小进行。我们的实验揭示了几个实用见解：1）神经网络在两步训练过程中表现最佳：先用PDE参数进行初始监督，然后使用PDE残差进行测试时微调；2）将PDE导数作为输入特征始终能提高精度；3）增加训练数据中初始条件的多样性比扩大PDE参数范围带来更大的性能提升。我们公开了数据集和代码库。

英文摘要

Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution fields. Neural networks are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations. While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems, i.e., mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size. Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and codebase publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.24001 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Diff-Instruct with Diffused Reward: 迈向有原则的一步生成器强化学习

Junyi Wu, Weijian Luo, Haoyang Zheng, Ruizhe Zhang, Guang Lin

发表机构 * Purdue University（普渡大学）； hi-lab, Xiaohongshu Inc.（小红书实验室，小红书公司）

AI总结针对一步生成器强化学习中奖励优化与生成动力学不匹配的问题，提出基于积分KL最小化的无数据轨迹级对齐框架DIDR，通过扩散奖励分数和代理估计器实现奖励驱动的校正，在一步SDXL和6B DiT骨干网络上取得帕累托优势。

Comments author list correction

详情

AI中文摘要

近期一步文本到图像生成的进展实现了实时合成，具有显著的效率和质量。先前用于一步生成器的强化学习方法将图像空间奖励优化与扩散噪声空间分布匹配相结合。这种范式由于终端奖励优化与底层生成动力学之间的不匹配带来了挑战。结果，优化倾向于利用随机自由度，通常以牺牲图像保真度为代价来提高奖励。为了解决这个问题，我们提出了Diff-Instruct with Diffused Reward (DIDR)，一个从积分KL最小化推导出的无数据轨迹级对齐框架。DIDR将RLHF最优的奖励倾斜干净图像分布沿扩散轨迹传播到所有噪声水平。我们证明该目标与干净图像RLHF具有相同的最小化器，同时自然诱导出扩散奖励分数(DRS)，它作为对参考分数函数的奖励驱动校正。为了使其实用，我们进一步引入了扩散奖励代理(DRP)，一种基于可微短步去噪的DRS高效估计器。大量实验表明，DIDR持续帕累托主导现有的一步SDXL基线。此外，当迁移到6B DiT骨干网络(Z-Image)时，DIDR在偏好对齐上超越了其50步教师模型，同时仅需单步生成。

英文摘要

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

URL PDF HTML ☆

赞 0 踩 0

2605.23327 2026-05-27 cs.CV 版本更新

GFSR: Geometric Fidelity and Spatial Refinement for Reliable Lane Detection

GFSR：用于可靠车道检测的几何保真度与空间细化

Tiancheng Wang, Zhaolu Ding, Richeng Xu, Tianhui Zheng, Hui Liu, Hanyu Xuan, Zhiliang Wu, Guanghui Yue

发表机构 * the School of Big Data and Statistics, Anhui University（安徽大学大数据与统计学院）； the School of Artificial Intelligence and Data Science, University of Science and Technology of China（中国科学技术大学人工智能与数据科学学院）； Institute of Dataspace, Hefei Comprehensive National Science Center（合肥综合性国家科学中心数据空间研究院）； the College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算机与数据科学学院）； the School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University（深圳大学医学院生物医学工程学院）

AI总结针对现有车道检测方法中分类置信度与几何质量脱节、回归模块弱化采样点关联导致复杂场景性能下降的问题，提出包含LaneIoU引导的置信度校准和自适应门控位置细化的GFSR框架，在CULane和CurveLanes上取得最优结果。

Comments Submitted to IEEE Transactions on Intelligent Transportation Systems. 12 pages, 6 figures

详情

AI中文摘要

车道检测是自动驾驶和高级驾驶辅助系统中的一项关键感知任务。然而，现有方法在复杂真实场景中仍会退化，原因在于两个主要限制。首先，分类置信度仅表征车道先验的分类存在性，与几何质量无强相关性。如果仅基于该置信度进行阈值过滤和NMS，模型倾向于保留高置信度的车道先验，而消除那些置信度较低但几何表示更优的先验。其次，现有方法中的回归模块削弱了采样点之间的相关性，阻碍了对远处、高曲率和复杂拓扑车道的细粒度优化，导致欠拟合。为解决这些问题，我们提出了几何保真度与空间细化（GFSR），一个由LaneIoU引导的置信度校准（LCC）和自适应门控位置细化（AGLR）组成的框架。具体地，LCC采用LaneIoU作为软监督来显式估计车道先验的几何保真度，并将其与分类置信度融合以构建协同可靠性指数（CRI）。该指数引导车道先验过滤，有效保留那些具有高分类置信度和良好几何质量的先验。同时，在每个细化阶段与回归头协作，AGLR预测采样点横向偏移并采用门控机制自适应调节校正幅度，增强点间相关性，提升模型对复杂车道场景的适应性和鲁棒性。在CULane和CurveLanes上的大量实验表明，我们的GFSR在CULane上达到了最优性能，F1_50和F1_75分数分别为81.46%和65.01%，在CurveLanes上达到了87.35%的F1_50。

英文摘要

Lane detection stands as a crucial perception task in autonomous driving and advanced driver assistance systems. However, existing methods still degrade in complex real scenarios due to two major limitations. First, classification confidence only characterizes the categorical existence of lane priors and has no strong correlation with geometric quality. If threshold filtering and NMS are conducted merely based on this confidence, the model tends to retain lane priors with high confidence while eliminating those with lower confidence but superior geometric representation. Secondly, the regression modules in existing methods weaken correlations among sampling points, hindering fine-grained optimization of distant, high-curvature and complex-topology lanes and causing underfitting. To address these issues, we propose Geometric Fidelity and Spatial Refinement (GFSR), a framework consisting of LaneIoU-guided Confidence Calibration (LCC) and Adaptive Gated Location Refinement (AGLR). Specifically, LCC adopts LaneIoU as soft supervision to explicitly estimate the geometric fidelity of lane priors, which is further fused with classification confidence to construct the Collaborative Reliability Index (CRI). This index guides lane prior filtering, effectively retaining those with high classification confidence and favorable geometric quality. Meanwhile, cooperating with regression heads in each refinement stage, AGLR predicts sampling point lateral offsets and adopts a gating mechanism to adaptively regulate correction magnitude, strengthen inter-point correlations and boost model adaptability as well as robustness toward complex lane scenarios. Extensive experiments on CULane and CurveLanes demonstrate that our GFSR achieves state-of-the-art performance on CULane, with F1_50 and F1_75 scores of 81.46% and 65.01%, and reaches 87.35% F1_50 on CurveLanes.

URL PDF HTML ☆

赞 0 踩 0

2605.22904 2026-05-27 cs.CV cs.AI 版本更新

Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations

基于AI视频监控的自杀风险评估：地铁站预防的可解释框架

Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau, Brian Mishara

发表机构 * Université TÉLUQ（大学TÉLUQ）； Polytechnique Montréal（蒙特利尔理工学院）； Université du Québec à Montréal（魁北克大学蒙特利尔分校）

AI总结提出首个可解释框架，通过行人跟踪、活动识别、站台语义分割和轨迹风险热图建模，从监控视频中评估自杀风险，在真实数据上达到83.2% ROC-AUC。

Comments 9 pages, 6 figures, 1 table. Accepted for Publication in the International Joint Conference of Artificial Intelligence (IJCAI)

详情

AI中文摘要

理解并监控地铁站中的人类行为对于支持自杀预防工作至关重要，早期识别高风险情况能够实现及时干预。这需要通过对每个乘客的行为、其空间上下文和时间动态进行联合推理，从监控视频中评估自杀风险。然而，使用监控摄像头捕获的视频进行评估具有挑战性，因为它需要准确感知人体运动、理解站台几何结构，并随时间聚合异质行为线索。在这项工作中，我们正式定义了地铁站自杀风险评估（SRA）任务，并引入了首个解决这一挑战的可解释框架。与专注于孤立子任务或试图直接推断意图的方法不同，我们的公式通过整合行人跟踪、活动识别、站台语义分割和轨迹驱动的风险热图建模，从累积证据中评估自杀风险。通过将SRA形式化为一个独特任务，并在真实监控数据上基准测试一个完整的操作流程，实现了83.2%的ROC-AUC，这项工作突出了自杀风险评估的复杂性，并为面向社会公益的可解释AI系统研究开辟了新方向。

英文摘要

Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry, and aggregation of heterogeneous behavioral cues over time. In this work, we formalize the task of Suicide Risk Assessment (SRA) in metro stations and introduce the first interpretable framework that addresses this challenge. Unlike approaches that focus on isolated subtasks or attempt to infer intent directly, our formulation assesses suicide risk from accumulated evidence by incorporating person tracking, activity recognition, semantic segmentation of the platform, and trajectory-driven risk heatmap modeling. By formalizing SRA as a distinct task and benchmarking a complete operational pipeline achieving 83.2% ROC-AUC on real surveillance data, this work highlights the complexity of suicide risk assessment and opens new directions for research on interpretable AI systems for social good.

URL PDF HTML ☆

赞 0 踩 0

2605.20914 2026-05-27 cs.CV 版本更新

RISE: Reliable Improvement in Self-Evolving Vision-Language Models

RISE: 自进化视觉语言模型的可靠改进

Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu

发表机构 * AMAP, Alibaba Group（阿里集团AMAP实验室）

AI总结针对视觉语言模型自进化中角色交替粗粒度、问题质量下降和类型坍缩问题，提出RISE框架，通过细粒度角色交替、质量监督器和技能感知动态平衡实现可靠自进化。

详情

AI中文摘要

视觉语言模型（VLM）已具备强大的多模态推理能力，但进一步提升仍严重依赖大规模人工构建的监督信号进行后训练。这种监督信号获取成本高昂，尤其对于推理密集型多模态任务，其中问题、答案和反馈信号必须精心设计。这激发了自进化学习，即模型通过双角色闭环自我改进：提问者自主提出问题，求解者学习解答。然而，我们观察到当前的VLM自进化方法仍面临三大挑战：粗粒度的角色交替延迟了问题生成与求解者适应之间的交互；生成的问题质量可能逐渐下降；问题类型可能坍缩至狭窄分布。这些问题限制了自进化的效率和可靠性。因此，我们提出 extbf{RISE}，一个可靠的视觉语言模型自进化框架。RISE基于三个互补设计：细粒度角色交替，缩短提问者与求解者之间的反馈循环以提高效率；质量监督器，提高问题有效性和伪标签可靠性；以及技能感知动态平衡，在进化过程中缓解模式坍缩并保持广泛的技能覆盖。这些组件共同使得从无标签图像中实现更可靠和有效的自进化成为可能。在两个VLM骨干网络上的七个基准测试实验表明，RISE持续改进基础模型，带来广泛而持久的性能提升。我们的代码已公开在https://github.com/AMAP-ML/RISE。

英文摘要

Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at https://github.com/AMAP-ML/RISE.

URL PDF HTML ☆

赞 0 踩 0

2605.20606 2026-05-27 cs.CV 版本更新

Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust?

注意你的边界：你的蒸馏数据集真的鲁棒吗？

Muquan Li, Yingyi Ma, Yihong Huang, Hang Gou, Ke Qin, Ming Li, Yuan-Fang Li, Tao He

发表机构 * The Laboratory of Intelligent Collaborative Computing of UESTC, Chengdu, China（UESTC智能协同计算实验室，中国成都）； Monash University, Melbourne, Australia（墨尔本大学，澳大利亚墨尔本）； Guangdong Laboratory of Artificial Intelligence（广东人工智能实验室）

AI总结针对数据集蒸馏中鲁棒性不足的问题，提出一种结合攻击感知课程学习与对比鲁棒性目标的框架C²R，通过优先处理最小鲁棒边界的对抗样本并扩大类间决策边界分离度，显著提升鲁棒准确率。

Comments Accepted to ICML 2026

详情

AI中文摘要

数据集蒸馏（DD）将大型训练集压缩为小型合成集以进行高效训练，但大多数DD方法仅优化干净准确率而忽略鲁棒性。最近的鲁棒DD方法提高了鲁棒性，但通常面临较差的准确率-鲁棒性权衡，因为它们（i）统一对待所有对抗扰动样本，尽管鲁棒风险主要由接近零的鲁棒边界主导，以及（ii）没有明确增加攻击集中区域的决策边界类间分离。我们提出了对比课程鲁棒数据集蒸馏（C$^2$R），一个将攻击感知课程与对比鲁棒性目标相结合的框架。从鲁棒边界的角度，我们推导出一个扰动分数，近似每个样本的鲁棒铰链，从而能够优先考虑那些最直接驱动鲁棒误差的最小边界对抗样本。同时，一个类平衡的对比鲁棒性损失在明确扩大跨类别边界分离的同时强制执行对抗不变性。在CIFAR-10/100、Tiny-ImageNet和多个ImageNet-1K子集上进行的六种攻击实验表明，C$^2$R实现了最佳的鲁棒准确率，平均优于先前的鲁棒DD方法2.8%。

英文摘要

Dataset distillation (DD) compresses a large training set into a small synthetic set for efficient training, but most DD methods optimize only clean accuracy and leave robustness uncontrolled. Recent robust DD methods improve robustness, yet they often suffer from a poor accuracy-robustness trade-off because they (i) treat all adversarially perturbed examples uniformly, despite robust risk being dominated by near-zero robust margins, and (ii) do not explicitly increase inter-class separation in the decision boundary where attacks concentrate. We present Contrastive Curriculum for Robust Dataset Distillation (C$^2$R), a framework that couples an attack-aware curriculum with a contrastive robustness objective. From a robust-margin perspective, we derive a perturbation score that approximates each sample's robust hinge, enabling a curriculum that prioritizes the smallest-margin adversaries that most directly drive robust error. In parallel, a class-balanced contrastive robustness loss enforces adversarial invariance while explicitly widening boundary separation across classes. Experiments on CIFAR-10/100, Tiny-ImageNet, and multiple ImageNet-1K subsets under six attacks show that C$^2$R achieves the best robust accuracy, outperforming prior robust DD by $2.8$% on average.

URL PDF HTML ☆

赞 0 踩 0

2605.16457 2026-05-27 cs.LG cs.AI cs.CV 版本更新

Identifiable Token Correspondence for World Models

可辨识的令牌对应关系用于世界模型

Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University（人工智能交叉学科项目，首尔国立大学）； Department of Computer Science（计算机科学系）； Engineering, Seoul National University（工程系，首尔国立大学）

AI总结提出可辨识的令牌对应关系（ITC）方法，通过将下一帧预测建模为结构化分配问题，解决基于令牌的Transformer世界模型在长程推演中的时间不一致性，在四个基准上达到最先进性能。

详情

AI中文摘要

基于令牌的Transformer世界模型在视觉强化学习中表现出色，但常在长程推演中出现时间不一致性，包括对象重复、消失和变形。一个关键原因是大多数现有方法将下一帧预测纯粹视为令牌生成问题，而未考虑令牌在时间上的持续性。我们引入可辨识的令牌对应关系（ITC），这是一种用于基于令牌的Transformer世界模型的解码步骤，将下一帧预测建模为具有潜在令牌对应变量的结构化分配问题：每个下一帧令牌要么通过从上一帧复制令牌来解释，要么通过生成新令牌来解释。ITC保持Transformer架构和训练过程不变，可以添加到现有骨干网络上。我们的实验在4个具有挑战性的基准上展示了最先进的性能。所提出的方法在Craftax-classic基准上实现了72.5%的回报率和35.6%的分数，显著超过了之前的最佳结果67.4%和27.9%。我们在https://github.com/snu-mllab/Identifiable-Token-Correspondence上发布了源代码。

英文摘要

Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu-mllab/Identifiable-Token-Correspondence.

URL PDF HTML ☆

赞 0 踩 0

2605.02207 2026-05-27 cs.CV cs.AI cs.LG 版本更新

MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings

MultiSense-Pneumo：面向资源受限环境中肺炎筛查的多模态学习框架

Dineth Jayakody, Pasindu Thenahandi, Chameli Dommanige

发表机构 * Department of Computer Science, Old Dominion University, VA, USA（计算机科学系，老 Dominion 大学，弗吉尼亚州，美国）

AI总结提出MultiSense-Pneumo多模态原型系统，整合症状、咳嗽音频、语音和胸片，通过可解释的后期融合实现肺炎筛查与分诊支持。

详情

AI中文摘要

肺炎仍然是全球发病率和死亡率的主要原因，尤其是在低资源环境中，那里缺乏影像学、实验室检测和专科护理。临床评估依赖于异质性证据，包括症状、呼吸模式、口头描述和胸部影像，使得一线筛查本质上是多模态的。然而，许多现有的计算方法仍然是单模态的，并且主要关注放射影像。在这项工作中，我们提出了MultiSense-Pneumo，一个面向肺炎筛查和分诊支持的多模态研究原型，它整合了结构化症状描述符、咳嗽音频、口语和胸部X光片。该系统结合了确定性症状分诊、基于LightGBM的声学分类、使用ResNet-18的域对抗放射影像分析、基于Transformer的语音识别以及可解释的后期融合算子。每个模态被转换为归一化的关注信号，并聚合为统一的筛查估计。融合权重是手动指定的，被视为启发式、可解释的参数，而不是学习或临床优化的值。MultiSense-Pneumo的设计考虑了在标准笔记本电脑级硬件上的离线执行，但并未作为经过部署验证或临床验证的诊断系统呈现。实验结果表明，在合成域偏移下，放射影像路径具有强大的组件级性能，同时也突出了重要的局限性，特别是咳嗽声学的异常类别召回率降低以及缺乏配对的端到端多模态患者评估。因此，MultiSense-Pneumo旨在作为筛查和分诊研究的框架和组件级原型。

英文摘要

Pneumonia remains a leading global cause of morbidity and mortality, particularly in low-resource settings where access to imaging, laboratory testing, and specialist care is limited. Clinical assessment relies on heterogeneous evidence, including symptoms, respiratory patterns, spoken descriptions, and chest imaging, making frontline screening inherently multimodal. However, many existing computational approaches remain unimodal and focus primarily on radiographs. In this work, we present MultiSense-Pneumo, a multimodal research prototype for pneumonia-oriented screening and triage support that integrates structured symptom descriptors, cough audio, spoken language, and chest radiographs. The system combines deterministic symptom triage, LightGBM-based acoustic classification, domain-adversarial radiograph analysis using ResNet-18, transformer-based speech recognition, and an interpretable late-fusion operator. Each modality is transformed into a normalized concern signal and aggregated into a unified screening estimate. The fusion weights are hand-specified and are treated as heuristic, interpretable parameters rather than learned or clinically optimized values. MultiSense-Pneumo is implemented with offline execution in mind on standard laptop-class hardware, but it is not presented as a deployment-validated or clinically validated diagnostic system. Experimental results demonstrate strong component-level performance of the radiograph pathway under synthetic domain shifts, while also highlighting important limitations, especially reduced abnormal-class recall for cough acoustics and the absence of paired end-to-end multimodal patient evaluation. MultiSense-Pneumo is therefore intended as a framework and component-level prototype for screening and triage research.

URL PDF HTML ☆

赞 0 踩 0

2605.08146 2026-05-27 cs.CV cs.AI 版本更新

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

VT-Bench：视觉-表格多模态学习的统一基准

Zi-Yi Jia, Zi-Jian Cheng, Xin-Yue Zhang, Kun-Yang Yu, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University, China（新型软件技术国家重点实验室，南京大学，中国）； School of Intelligence Science and Technology, Nanjing University, China（智能科学与技术学院，南京大学，中国）； School of Artificial Intelligence, Nanjing University, China（人工智能学院，南京大学，中国）

AI总结提出首个视觉-表格多模态基准VT-Bench，涵盖9个领域14个数据集，评估23个模型，揭示视觉-表格学习的挑战。

2511.19741 2026-05-27 cs.CV 版本更新

Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

通过最小切片传输计划的高效可迁移最优传输

Xinran Liu, Elaheh Akbari, Rocio Diaz Martin, Navid NaderiAlizadeh, Soheil Kolouri

发表机构 * Department of Computer Science, Vanderbilt University（范德比大学计算机科学系）； Department of Mathematics, Florida State University（佛罗里达州立大学数学系）； Department of Biostatistics & Bioinformatics, Duke University（杜克大学生物统计学与生物信息学系）； Department of Electrical & Computer Engineering, Vanderbilt University（范德比大学电气与计算机工程系）

AI总结提出最小切片传输计划（min-STP）框架，研究优化切片器在不同分布对间的可迁移性，并引入小批量公式以提高可扩展性，在点云对齐和流生成建模中实现一次性匹配和摊销训练。

详情

AI中文摘要

最优传输（OT）为寻找分布之间的对应关系以及解决计算机视觉各个领域（包括形状分析、图像生成和多模态任务）中的匹配和对齐问题提供了强大的框架。然而，OT的计算成本阻碍了其可扩展性。基于切片的传输计划最近通过利用一维OT问题的闭式解，在降低计算成本方面显示出前景。这些方法优化一维投影（切片）以获得条件传输计划，该计划最小化环境空间中的传输成本。虽然高效，但这些方法留下了一个问题：学习到的最优切片器是否能够在分布偏移下迁移到新的分布对。理解这种可迁移性对于数据演变或跨密切相关的分布重复进行OT计算的情况至关重要。在本文中，我们研究了最小切片传输计划（min-STP）框架，并探讨了优化切片器的可迁移性：在一个分布对上训练的切片器能否为新的未见对产生有效的传输计划？理论上，我们证明优化后的切片器在数据分布轻微扰动下保持接近，从而能够在相关任务间高效迁移。为了进一步提高可扩展性，我们引入了min-STP的小批量公式，并提供了其准确性的统计保证。实验上，我们证明了可迁移的min-STP实现了强一次性匹配性能，并促进了点云对齐和基于流的生成建模的摊销训练。

英文摘要

Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.18359 2026-05-27 cs.CV 版本更新

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

RAVE: 重新分配大型多模态模型中的视觉注意力

Xi Leng, Xinhong Ma, Ziqiang Dong, Feng Zhang, Xiaoying Tang, Yang Yang, Guanjun Jiang

发表机构 * Qwen Business Unit of Alibaba（阿里巴巴文勤业务部）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Beijing Institute of Technology（北京理工大学）

AI总结针对大型多模态模型中标准注意力机制存在的跨模态误分配和视觉内不平衡问题，提出轻量级成对门控机制RAVE，通过学习查询-键偏置重新分配视觉注意力，在多个多模态基准上平均提升3个百分点，尤其对感知密集型任务效果显著。

2605.05204 2026-05-27 cs.CV 版本更新

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

D-OPSD：用于连续调优步蒸馏扩散模型的在线自蒸馏方法

Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, Harry Yang, Steven Hoi

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； Z-Image Team, Alibaba Group（阿里集团Z-Image团队）； University of California, San Diego（加州大学圣地亚哥分校）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出D-OPSD，一种在线自蒸馏训练范式，使步蒸馏扩散模型在监督微调中保持少步推理能力，通过让模型同时作为教师和学生，利用不同上下文条件（学生仅文本特征，教师多模态特征）最小化预测分布，学习新概念和风格而不牺牲原有少步能力。

Comments Project Page: https://vvvvvjdy.github.io/d-opsd/

详情

AI中文摘要

高性能图像生成模型的格局目前正在从低效的多步模型转向高效的少步模型（例如，Z-Image-Turbo和FLUX.2-klein）。然而，这些模型对直接连续监督微调提出了重大挑战。例如，应用常用的微调技术会损害其固有的少步推理能力。为了解决这个问题，我们提出了D-OPSD，一种用于步蒸馏扩散模型的新颖训练范式，能够在监督微调期间实现在线策略学习。我们首先发现，以LLM/VLM作为编码器的现代扩散模型可以继承其编码器的上下文能力。这使我们能够将训练形式化为一个在线自蒸馏过程。具体来说，在训练期间，我们让模型在不同上下文中同时充当教师和学生，其中学生仅以文本特征为条件，而教师则以文本提示和目标图像的多模态特征为条件。训练最小化学生自身轨迹上的两个预测分布。通过在模型自己的轨迹上并在其自身监督下进行优化，D-OPSD使模型能够学习新的概念、风格等，而不会牺牲原始的少步能力。

英文摘要

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for direct continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromise their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion models, where the LLM/VLM serves as the encoder, can inherit its encoder's in-context capabilities. This enables us to formulate the training as an on-policy self-distillation process. Specifically, during training, we make the model act as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimizing on the model's own trajectory and under its own supervision, D-OPSD enables the model to learn new concepts, styles, etc., without sacrificing the original few-step capacity.

URL PDF HTML ☆

赞 0 踩 0

2601.15891 2026-05-27 cs.CV 版本更新

视觉Mamba能否提升AI生成图像检测？一项深入研究

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Xianxun Zhu, Abdenour Hadid

发表机构 * Laboratory of IEMN, CNRS, Centrale Lille, UMR 8520, Univ. Polytechnique Hauts-de-France（伊姆纳实验室，国家科学研究中心，里尔中央理工大学，UMR 8520，法国高等技术大学）； Khalifa University（卡利法大学）； School of Communication and Information Engineering, Shanghai University（上海大学通信与信息工程学院）； Sorbonne Center for Artificial Intelligence, Sorbonne University Abu Dhabi（索邦人工智能中心，索邦大学阿布扎克分校）

AI总结本研究系统评估了Vision Mamba模型在AI生成图像检测中的性能，与CNN、ViT和VLM检测器进行对比，分析了准确性、效率和泛化能力。

详情

AI中文摘要

近年来，计算机视觉取得了显著进展，这得益于卷积神经网络（CNN）、生成对抗网络（GAN）、扩散架构、视觉Transformer（ViT）以及最近的视觉-语言模型（VLM）等创新架构的发展。这一进展无疑有助于创造越来越逼真和多样化的视觉内容。然而，图像生成的这些进步也引发了对错误信息、身份盗窃以及隐私和安全威胁等潜在滥用的担忧。与此同时，基于Mamba的架构已成为这一快速发展的领域中一系列图像分析任务（包括分类、分割、医学成像、目标检测和图像恢复）的多功能工具。然而，与已有技术相比，它们在识别AI生成图像方面的潜力仍相对未被探索。本研究提供了用于AI生成图像检测的Vision Mamba模型的系统评估和比较分析。我们在多样化的数据集和合成图像源上，将多个Vision Mamba变体与代表性的CNN、ViT和基于VLM的检测器进行基准测试，重点关注准确性、效率以及跨不同图像类型和生成模型的泛化能力等关键指标。通过这一全面分析，我们旨在阐明Vision Mamba相对于已有方法在检测AI生成图像方面的适用性、准确性和效率上的优势与局限性。总体而言，我们的研究结果突显了Vision Mamba作为区分真实与AI生成视觉内容的系统组件的潜力和当前局限性。这项研究对于在区分真实与AI生成内容成为重大挑战的时代提升检测能力至关重要。

英文摘要

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

URL PDF HTML ☆

赞 0 踩 0

2605.14664 2026-05-27 cs.CV 版本更新

MiVE: Multiscale Vision-language features for reference-guided video Editing

MiVE：用于参考引导视频编辑的多尺度视觉语言特征

Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, Ting Liu

发表机构 * MT Lab, Meitu Inc., Beijing 100083, China（美图实验室，美图公司，北京100083，中国）； Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing 100084, China（计算机科学与技术系，BNRist，IDG/麦戈文脑研究学院，清华大学，北京100084，中国）； Beijing University of Posts（北京邮电大学）

AI总结提出MiVE框架，利用VLM的多尺度层次特征（早期层保留空间细节，深层编码全局语义）统一到自注意力扩散Transformer中，解决模态间隙和细粒度信息丢失问题，在参考引导视频编辑中达到SOTA性能。

Comments ICML 2026

详情

AI中文摘要

参考引导视频编辑以源视频、文本指令和参考图像作为输入，要求模型在忠实执行指令编辑的同时保留原始运动及未编辑内容。现有方法分为两种范式，各有固有限制：解耦编码器在处理指令和视觉内容时存在模态间隙，而统一视觉语言编码器仅依赖最终层表示，丢失了细粒度空间细节。我们观察到VLM层层次化地编码互补信息——早期层捕获局部空间细节，对精确编辑至关重要；深层编码全局语义，用于指令理解。基于此洞察，我们提出MiVE（用于参考引导视频编辑的多尺度视觉语言特征），该框架将VLM重新用作多尺度特征提取器。MiVE从Qwen3-VL提取层次特征，并将其集成到统一的自注意力扩散Transformer中，消除了交叉注意力设计中固有的模态不匹配。实验表明，MiVE在人类偏好中排名最高，性能优于学术方法和商业系统，达到了最先进水平。

英文摘要

Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

URL PDF HTML ☆

赞 0 踩 0

2605.13455 2026-05-27 cs.CV 版本更新

UniPCB: 一种用于PCB缺陷检测的生成辅助检测框架

Huan Zhang, Lianghong Tan, Yichu Xu, Zishan Su, Jiangzhong Cao, Huanqi Wu, Linwei Zhu, Xu Zhang

发表机构 * School of Information Engineering, Guangdong University of Technology（广东工业大学信息工程学院）； School of Computer Science, Wuhan University（武汉大学计算机学院）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）

AI总结提出UniPCB框架，通过多模态条件生成器合成缺陷样本以增强数据，并设计倒残差移位注意力与跨级互补融合模块提升检测性能，在DsPCBSD+上实现98.0% mAP@0.5。

详情

AI中文摘要

在工业物联网（IIoT）中，实现智能、实时的印刷电路板（PCB）缺陷检测对于确保产品可靠性至关重要。然而，现有的基于IIoT的视觉检测系统面临两个相互叠加的挑战：稀缺且不平衡的缺陷样本限制了模型训练，以及在复杂电路背景下特征表示不足。现有的生成方法依赖具有粗略结构控制的单模态条件，而检测方法则改进架构但未解决数据瓶颈。为了共同解决这两个挑战，我们提出了一种生成辅助的PCB缺陷检测框架，该框架在IIoT支持的流水线中集成了受控缺陷合成与任务特定缺陷检测。在生成侧，多模态条件生成器并行提取互补的边缘、深度和文本条件。然后，ScaleEncoder将这些条件嵌入到扩散U-Net的四个分辨率中，条件调制在每个尺度上应用FiLM风格的空间自适应调制，实现结构对齐和缺陷感知的样本合成，以增强稀缺的IIoT数据集。在检测侧，倒残差移位注意力将自注意力与移位卷积相结合，以共同捕获全局上下文和局部纹理，跨级互补融合块生成像素级门控用于选择性跨级特征融合。合成的样本直接丰富检测训练集，使得生成的改进与检测的改进相互叠加。在DsPCBSD+上的大量实验表明，UniPCB在缺陷检测上达到mAP@0.5为98.0%、mAP@0.5:0.95为61.8%，超越了所有对比方法，同时生成分支的FID为129.61、SSIM为0.619，优于现有的条件生成方法。

英文摘要

In the Industrial Internet of Things (IIoT), enabling intelligent, real-time Printed Circuit Board (PCB) defect inspection is critical for ensuring product reliability. However, existing IIoT-based visual inspection systems face two compounding challenges: scarce and imbalanced defect samples that limit model training, and insufficient feature representation under complex circuit backgrounds. Existing generation methods rely on single-modality conditions with coarse structural control, while detection methods improve architectures without addressing the data bottleneck. To resolve both challenges jointly, we propose a generation-assisted PCB defect inspection framework that integrates controlled defect synthesis with task-specific defect detection within an IIoT-enabled pipeline. On the generation side, a Multi-modal Condition Generator extracts complementary edge, depth, and text conditions in parallel. A ScaleEncoder then embeds these conditions into the diffusion U-Net at four resolutions, and a Condition Modulation applies FiLM-style spatially-adaptive modulation at each scale, enabling structurally aligned and defect-aware sample synthesis to augment the scarce IIoT dataset. On the detection side, an Inverted Residual Shift Attention couples self-attention with shift-wise convolution to jointly capture global context and local texture, and a Cross-level Complementary Fusion Block generates pixel-level gates for selective cross-level feature fusion. The synthesized samples directly enrich the detection training set, so that improvements in generation compound with improvements in detection. Extensive experiments on DsPCBSD+ demonstrate that UniPCB achieves mAP@0.5 of 98.0% and mAP@0.5:0.95 of 61.8% on defect detection, surpassing all compared methods, while the generation branch attains an FID of 129.61 and SSIM of 0.619, outperforming existing conditional generation approaches.

URL PDF HTML ☆

赞 0 踩 0

2603.12647 2026-05-27 cs.CV cs.AI 版本更新

LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

LR-SGS：用于自动驾驶场景重建的鲁棒激光雷达反射率引导显著高斯泼溅

ZY Chen, F Zhu, H Zhu, DY Kong, XK Kuang, YJ Zhang, CM Jiang

发表机构 * Waymo Open Dataset（Waymo开放数据集）

AI总结提出一种结合激光雷达反射率与RGB的显著高斯表示方法，通过结构感知初始化、反射率校准和联合对齐，实现高效鲁棒的自动驾驶场景重建。

Comments 8 pages, 7 figures

详情

AI中文摘要

最近的3D高斯泼溅（3DGS）方法已证明了自动驾驶场景重建和新视角合成的可行性。然而，现有方法大多仅依赖相机，或仅将激光雷达用于高斯初始化或深度监督，而点云中包含的丰富场景信息（如反射率）以及激光雷达与RGB之间的互补性尚未被充分利用，导致在具有高自运动和复杂光照等挑战性自动驾驶场景中性能下降。为解决这些问题，我们提出了一种鲁棒且高效的激光雷达反射率引导显著高斯泼溅方法（LR-SGS），用于自动驾驶场景。该方法引入了一种结构感知的显著高斯表示，该表示从激光雷达提取的几何和反射率特征点初始化，并通过显著变换和改进的密度控制来捕捉边缘和平面结构。此外，我们将激光雷达强度校准为反射率，并将其作为光照不变的材料通道附加到每个高斯上，与RGB联合对齐以强制边界一致性。在Waymo Open数据集上的大量实验表明，LR-SGS以更少的高斯和更短的训练时间实现了优越的重建性能。特别是在复杂光照场景下，我们的方法在PSNR上超过OmniRe 1.18 dB。

英文摘要

Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

URL PDF HTML ☆

赞 0 踩 0

2604.27604 2026-05-27 cs.CV cs.CE 版本更新

Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning

解码科学实验图像：用于感知、理解和推理的SPUR基准

Junpeng Ding, Zichen Tang, Haihong E, Mengyuan Ji, Yang Liu, Haolin Tian, Haiyang Sun, Pengqi Sun, Yang Xu, Yichen Liu, Haocheng Gao, Zijie Xi, Ruomeng Jiang, Peizhi Zhao, Rongjin Li, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Jintong Chen, Siying Lin

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出SPUR基准，通过4264个问答对评估多模态大模型在科学实验图像上的细粒度感知、跨面板关系理解和专家级推理能力，揭示当前模型与专家水平的差距。

Comments Accepted to ACL 2026 Main Conference

详情

AI中文摘要

我们引入了SPUR，一个全面的科学实验图像感知、理解和推理基准，包含来自1084张专家精选图像的4264个问答对。SPUR具有三个关键创新：（1）面板级细粒度感知：评估多模态大语言模型（MLLMs）在六个细粒度面板类型上的三个维度（数值、形态和信息定位）的视觉感知能力；（2）跨面板关系理解：利用平均每样本14.3个面板的复杂图像评估MLLMs解读复杂跨面板关系的能力；（3）专家级推理：跨五个实验范式评估定性和定量推理，以确定模型是否能像人类专家一样从证据中推断结论。对20个MLLMs和四种多模态思维链（MCoT）方法的全面评估表明，当前模型远未达到科学图像解释的专家级要求，凸显了人工智能科学（AI4S）研究的关键瓶颈。

英文摘要

We introduce SPUR, a comprehensive benchmark for scientific experimental image perception, understanding, and reasoning, comprising 4,264 question-answering (QA) pairs derived from 1,084 expert-curated images. SPUR features three key innovations: (1) Panel-Level Fine-Grained Perception: evaluating the visual perception of multimodal large language models (MLLMs) across three dimensions (numerical, morphological, and information localization) on six fine-grained panel types; (2) Cross-Panel Relation Understanding: utilizing complex images with an average of 14.3 panels per sample to evaluate MLLMs' ability to decipher intricate cross-panel relations; (3) Expert-Level Reasoning: assessment of qualitative and quantitative reasoning across five experimental paradigms to determine if models can infer conclusions from evidence as human experts do. Comprehensive evaluation of 20 MLLMs and four multimodal Chain-of-Thought (MCoT) methods reveals that current models fall significantly short of the expert-level requirements for scientific image interpretation, underscoring a critical bottleneck in AI for Science (AI4S) research.

URL PDF HTML ☆

赞 0 踩 0

2604.24764 2026-05-27 cs.CV 版本更新

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

World-R1：通过强化学习为文本到视频生成注入3D约束

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang

发表机构 * Zhejiang University（浙江大学）； Microsoft Research（微软研究院）； Independent Researcher（独立研究者）

AI总结提出World-R1框架，利用强化学习（Flow-GRPO）结合3D基础模型和视觉语言模型的反馈，在不修改架构的情况下增强视频生成的3D一致性，并采用周期解耦训练策略平衡刚体几何与动态场景。

Comments ICML 2026, Project Page: https://aka.ms/world-r1, Code: https://github.com/microsoft/World-R1

详情

AI中文摘要

最近的视频基础模型展示了令人印象深刻的视觉合成能力，但经常遭受几何不一致性的困扰。现有方法尝试通过架构修改注入3D先验，但往往导致高计算成本并限制可扩展性。我们提出World-R1，一个通过强化学习将视频生成与3D约束对齐的框架。为促进这种对齐，我们引入了一个专门为世界模拟定制的纯文本数据集。利用Flow-GRPO，我们使用预训练的3D基础模型和视觉语言模型的反馈来优化模型，在不改变底层架构的情况下强制执行结构一致性。我们进一步采用周期解耦训练策略来平衡刚体几何一致性与动态场景流畅性。大量评估表明，我们的方法显著增强了3D一致性，同时保留了基础模型的原始视觉质量，有效弥合了视频生成与可扩展世界模拟之间的差距。

英文摘要

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

URL PDF HTML ☆

赞 0 踩 0

2401.07669 2026-05-27 cs.CV 版本更新

SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

SRL-CLIP: 通过结构化语义角色标签实现高效的CLIP视频适配

Darshan Singh, Zeeshan Khan, Makarand Tapaswi

发表机构 * CVIT, IIIT Hyderabad（IIIT海得拉巴计算机视觉研究所）； Inria, École normale supérieure, CNRS, PSL Research University（法国国家信息与自动化研究所、巴黎综合理工学院、国家科学研究中心、巴黎高等研究大学）

AI总结本文提出SRL-CLIP，利用结构化语义角色标签（SRL）生成规则化字幕，仅用23k视频-字幕对进行对比微调，即可高效适配CLIP用于通用视频理解，在零样本文本-视频检索上性能优于参数多4-8倍、数据多6000倍的模型。

Comments Accepted to the CV4Smalls Workshop at CVPR 2026

详情

AI中文摘要

将CLIP适配到视频领域因其语义丰富表示而日益流行。虽然CLIP是一个良好的起点，但它通常需要在大型视频叙述或字幕数据集（如HowTo100M、WebVid2.5M）上进行后预训练（对比微调）。然而，此类叙述或字幕往往缺乏全面信息来整体表示视频。由于文本的学习信号稀疏，视觉学习效率低下，适配需要数百万样本进行后预训练。在这项工作中，我们提出疑问：是否可能高效地将CLIP适配到通用和整体的视频理解？我们使用带有结构化和密集语义角色标签（SRL）的视频，这些标签以结构化格式捕获动作、人物或物体、属性、副词（方式）和位置，从而整体表示整个视频。我们从SRL生成基于规则的字幕，并证明仅对23k视频-字幕对进行简单的对比微调就足以学习强大的、可迁移的表示，适用于需要不同感知粒度水平的多种视频理解任务。我们的适配CLIP模型SRL-CLIP在零样本文本-视频检索上展现出与最先进模型相当或更优的性能，而这些模型拥有4-8倍更多的参数，并在多达6000倍更多的数据上进行了后预训练。SRL-CLIP在多个视频基准上超越了CLIP，突显了高效学习和改进的表示能力。

英文摘要

Adapting CLIP for videos has gained popularity due to its semantic and rich representation. While CLIP is a good starting point, it typically undergoes post-pretraining (contrastive finetuning) on large video narration or caption datasets (e.g. HowTo100M, WebVid2.5M). However, such narrations or captions often lack comprehensive information needed to represent a video holistically. As the learning signal from text is sparse, the visual learning is inefficient and adaptation requires millions of samples to post-pretrain. In this work, we ask: is it possible to efficiently adapt CLIP for general and holistic video understanding? We use videos labeled with structured and dense Semantic Role Labels (SRLs) that capture actions, people or objects, their attributes, adverbs (manner), and location in a structured format representing the entire video in a holistic way. We generate rule-based captions from SRLs and demonstrate that simple contrastive finetuning on a mere 23k video-caption pairs is adequate to learn powerful, transferable representations applicable across a diverse range of video understanding tasks that require varying levels of perceptual granularity. Our adapted CLIP model, SRL-CLIP, exhibits comparable or superior performance on zero-shot text-to-video retrieval compared to state-of-the-art models that possess 4-8x more parameters and are post-pretrained on up to 6000x more data. SRL-CLIP surpasses CLIP on multiple video benchmarks, underscoring the efficient learning and improved representations.

URL PDF HTML ☆

赞 0 踩 0

2604.22774 2026-05-27 cs.CY cs.AI cs.CV cs.LG 版本更新

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

当VLM“修正”学生：多行手写数学OCR评估中的过度修正识别与惩罚

Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim

发表机构 * Electronics and Telecommunications Research Institute（电子通信研究所）

AI总结针对多行手写数学OCR评估中VLM过度修正问题，提出基于LLM的语义评估指标PINK，有效惩罚过度修正，在FERMAT数据集上优于BLEU。

详情

AI中文摘要

手写数学的准确转录对于教育AI系统至关重要，但当前基准未能正确评估这一能力。大多数先前研究关注单行表达式，并依赖BLEU等词汇指标，无法评估跨多行学生解决方案的语义推理。本文首次系统研究多行手写数学光学字符识别（OCR），揭示了视觉语言模型（VLM）的一个关键失败模式：过度修正。这些模型往往“修正”错误，而非忠实地转录学生作品，从而隐藏了教育评估旨在检测的错误。为解决此问题，我们提出PINK（基于惩罚的INK分数），一种语义评估指标，利用大语言模型（LLM）进行基于评分标准的评分，并明确惩罚过度修正。我们在FERMAT数据集上对15个最先进的VLM进行全面评估，发现与BLEU相比出现显著的排名反转：GPT-4o等模型因激进的过度修正受到严重惩罚，而Gemini 2.5 Flash成为最忠实的转录者。此外，人类专家研究表明，PINK与人类判断的一致性显著更高（55.0%偏好，而BLEU为39.5%），为教育场景中的手写数学OCR提供了更可靠的评估框架。

英文摘要

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.

URL PDF HTML ☆

赞 0 踩 0

2604.19673 2026-05-27 cs.CV 版本更新

机器学习在急诊和重症监护中不平衡表格临床数据的鲁棒性与可扩展性实证研究

Yusuf Brima, Marcellin Atemkeng

发表机构 * Computer Vision Group, Institute of Cognitive Science, Osnabrück University（计算机视觉组，认知科学研究所，奥斯纳布吕克大学）； Department of Mathematics, Rhodes University（数学系，罗德斯大学）； National Institute for Theoretical and Computational Sciences (NITheCS)（国家理论与计算科学研究所（NITheCS））

AI总结本研究在MIMIC-IV-ED和eICU数据集上评估六类模型在不平衡临床表格数据上的性能，发现树模型在可扩展性上最优，而表格基础模型在性能与效率间提供新的权衡。

详情

AI中文摘要

每年，数百万患者通过急诊科和重症监护室，临床医生必须在时间压力和不确定性下做出高风险决策。机器学习可以支持恶化预测、分诊和罕见关键结局的预测，但临床数据通常严重不平衡，使模型偏向多数类并降低预测性能。因此，为不平衡的临床表格数据开发鲁棒且高效的模型仍然是一个重要挑战。我们在MIMIC-IV-ED和eICU数据库的不平衡表格数据上评估了六类模型：决策树、随机森林、XGBoost、TabNet、TabICL和TabPFN v2.6。可训练模型通过贝叶斯超参数调优进行优化，而基础模型在其预训练推理模式下进行评估，无需任务特定的重新加权。模型使用Macro F1分数、对递增不平衡的鲁棒性以及跨七个临床预测任务的计算可扩展性进行评估。结果在不同数据集上有所不同。在MIMIC-IV-ED上，TabPFN v2.6和TabICL获得了最强的平均Macro F1排名，XGBoost保持竞争力。在eICU上，XGBoost始终表现最佳，其次是其他基于树的方法，而基础模型达到中等性能。在两个数据集中，TabNet在递增不平衡下显示出最大的性能下降和最高的计算成本。训练时间分析表明，基于树的方法随数据集大小扩展最有利，而基础模型提供了较低的每任务适应成本。这些发现表明，没有单一模型族在所有临床环境中占主导地位。然而，表格基础模型正在缩小与强经典基线的性能差距，同时提供独特的效率-性能权衡，这可能有利于资源受限的临床环境。

英文摘要

Every year, millions of patients pass through emergency departments and intensive care units, where clinicians must make high-stakes decisions under time pressure and uncertainty. Machine learning could support prediction of deterioration, triage, and rare critical outcomes, but clinical data are often severely imbalanced, biasing models toward majority classes and reducing predictive performance. Developing robust and efficient models for imbalanced clinical tabular data therefore remains an important challenge. We evaluated six model families on imbalanced tabular data from the MIMIC-IV-ED and eICU databases: Decision Tree, Random Forest, XGBoost, TabNet, TabICL, and TabPFN v2.6. Trainable models were optimized using Bayesian hyperparameter tuning, while foundation models were evaluated in their pretrained inference regime without task-specific reweighting. Models were assessed using Macro F1-score, robustness to increasing imbalance, and computational scalability across seven clinical prediction tasks. Results differed across datasets. On MIMIC-IV-ED, TabPFN v2.6 and TabICL achieved the strongest average Macro F1 ranks, with XGBoost remaining competitive. On eICU, XGBoost consistently performed best, followed by other tree-based methods, while foundation models achieved intermediate performance. Across both datasets, TabNet showed the largest degradation under increasing imbalance and the highest computational cost. Training-time analysis showed that tree-based methods scaled most favorably with dataset size, while foundation models offered low per-task adaptation cost. These findings suggest that no single model family dominates across all clinical settings. However, tabular foundation models are narrowing the performance gap with strong classical baselines while offering a distinct efficiency-performance trade-off that may benefit resource-constrained clinical environments.

URL PDF HTML ☆

赞 0 踩 0

2509.08289 2026-05-27 cs.CV 版本更新

Dual-Thresholded Heatmap-Guided Proposal Clustering and Negative Certainty Supervision with Enhanced Base Network for Weakly Supervised Object Detection

双阈值热力图引导的提议聚类与负确定性监督及增强基础网络的弱监督目标检测

Yuelin Guo, Haoyu He, Zhiyuan Chen, Zitong Huang, Renhao Lu, Lu Shi, Zejun Wang, Weizhe Zhang

发表机构 * Institute of Cyberspace Security, Harbin Institute of Technology（哈尔滨工业大学网络安全学院）； Faculty of Information Technology, Monash University（莫纳什大学信息科技学院）； Center on Machine Learning Research, Harbin Institute of Technology（哈尔滨工业大学机器学习研究中心）； Department of New Networks, Peng Cheng Laboratory（鹏城实验室网络部）； School of Cyberspace Science, Harbin Institute of Technology（哈尔滨工业大学网络空间科学学院）

AI总结提出DANCE方法，通过双阈值热力图引导的提议选择、增强基础网络和负确定性监督损失，解决弱监督目标检测中伪GT框不完整、语义鸿沟和收敛慢的问题。

Comments IEEE TIP Minor Revision

详情

AI中文摘要

弱监督目标检测（WSOD）近年来因其不需要框级标注而受到广泛关注。最先进的方法通常采用多模块网络，使用WSDDN作为多实例检测网络模块，并使用多实例细化模块来改进性能。然而，这些方法存在三个关键局限性。首先，现有方法倾向于生成仅关注判别性部分的伪GT框，未能捕捉整个物体，或者覆盖整个物体但无法区分相邻的类内实例。其次，基础WSDDN架构缺乏每个提议的关键背景类表示，并且其分支之间存在较大的语义鸿沟。第三，先前的方法在优化过程中丢弃被忽略的提议，导致收敛缓慢。为了解决这些挑战，我们提出了双阈值热力图引导的提议聚类和负确定性监督与增强基础网络（DANCE）方法用于WSOD。具体来说，我们首先设计了一种热力图引导的提议选择器（HGPS）算法，该算法利用热力图上的双阈值来预选提议，使伪GT框既能捕捉完整的物体范围，又能区分相邻的类内实例。然后，我们构建了一个弱监督基础检测网络（WSBDN），它为每个提议增加一个背景类表示，并使用热力图进行预监督以弥合矩阵之间的语义鸿沟。最后，我们在被忽略的提议上引入负确定性监督（NCS）损失以加速收敛。在具有挑战性的PASCAL VOC和MS COCO数据集上进行的大量实验证明了我们方法的有效性和优越性。我们的代码可在https://github.com/gyl2565309278/DANCE公开获取。

英文摘要

Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and uses multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we propose the Dual-thresholded heAtmap-guided proposal clustering and Negative Certainty supervision with Enhanced base network (DANCE) method for WSOD. Specifically, we first devise a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then construct a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision (NCS) loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC and MS COCO datasets demonstrate the effectiveness and superiority of our method. Our code is publicly available at https://github.com/gyl2565309278/DANCE.

URL PDF HTML ☆

赞 0 踩 0

2604.00648 2026-05-27 cs.CV 版本更新

DirectFisheye-GS: Enabling Native Fisheye Input in Gaussian Splatting with Cross-View Joint Optimization

DirectFisheye-GS: 在三维高斯泼溅中通过跨视图联合优化实现原生鱼眼输入

Zhengxian Yang, Fei Xie, Xutao Xue, Rui Zhang, Taicheng Huang, Yang Liu, Mengqi Ji, Tao Yu

发表机构 * BNRist, Tsinghua University（北京理工大学，清华大学）； Beihang University（北航）； JD.com, Beijing, China（京东（北京，中国））； Shanghai AI Lab（上海人工智能实验室）

AI总结针对鱼眼相机输入导致的信息丢失和细节模糊问题，提出将鱼眼相机模型集成到3DGS框架中，并引入基于特征重叠的跨视图联合优化策略，实现无需预处理的原生鱼眼图像训练，提升重建质量。

Comments CVPR 2026 Highlight; Fix NSFC ID

详情

AI中文摘要

三维高斯泼溅（3DGS）实现了从日常图像中进行高效的三维场景重建，具有实时、高保真渲染的特点，极大地推动了VR/AR应用的发展。鱼眼相机凭借其更宽的视场角（FOV），有望从更少的输入中实现高质量重建，近来备受关注。然而，由于3DGS依赖于光栅化，大多数后续涉及鱼眼相机输入的工作在训练前先对图像进行去畸变，这引入了两个问题：1）图像边缘的黑边导致信息丢失，抵消了鱼眼大FOV的优势；2）去畸变的拉伸和插值重采样将每个像素的值扩散到更大区域，稀释了细节密度——导致3DGS过拟合这些低频区域，产生模糊和漂浮伪影。在这项工作中，我们将鱼眼相机模型集成到原始3DGS框架中，实现了无需预处理的原生鱼眼图像输入进行训练。尽管建模正确，我们观察到重建场景在图像边缘仍然存在漂浮物：畸变向边缘增加，而3DGS原始的逐迭代随机选择视图优化忽略了高斯函数的跨视图相关性，导致极端形状（例如过大或拉长）降低了重建质量。为解决此问题，我们引入了一种基于特征重叠的跨视图联合优化策略，该策略在视图之间建立一致的几何和光度约束——该技术同样适用于现有的基于针孔相机的流水线。我们的DirectFisheye-GS在公共数据集上达到或超越了最先进的性能。项目页面：https://yzxqh.github.io/DirectFisheye-GS/ 。

英文摘要

3D Gaussian Splatting (3DGS) has enabled efficient 3D scene reconstruction from everyday images with real-time, high-fidelity rendering, greatly advancing VR/AR applications. Fisheye cameras, with their wider field of view (FOV), promise high-quality reconstructions from fewer inputs and have recently attracted much attention. However, since 3DGS relies on rasterization, most subsequent works involving fisheye camera inputs first undistort images before training, which introduces two problems: 1) Black borders at image edges cause information loss and negate the fisheye's large FOV advantage; 2) Undistortion's stretch-and-interpolate resampling spreads each pixel's value over a larger area, diluting detail density -- causes 3DGS overfitting these low-frequency zones, producing blur and floating artifacts. In this work, we integrate fisheye camera model into the original 3DGS framework, enabling native fisheye image input for training without preprocessing. Despite correct modeling, we observed that the reconstructed scenes still exhibit floaters at image edges: Distortion increases toward the periphery, and 3DGS's original per-iteration random-selecting-view optimization ignores the cross-view correlations of a Gaussian, leading to extreme shapes (e.g., oversized or elongated) that degrade reconstruction quality. To address this, we introduce a feature-overlap-driven cross-view joint optimization strategy that establishes consistent geometric and photometric constraints across views-a technique equally applicable to existing pinhole-camera-based pipelines. Our DirectFisheye-GS matches or surpasses state-of-the-art performance on public datasets. Project Page: https://yzxqh.github.io/DirectFisheye-GS/ .

URL PDF HTML ☆

赞 0 踩 0

2603.28730 2026-05-27 cs.RO cs.CL cs.CV 版本更新

LDP-Slicing：通过随机位平面切片实现图像的本地差分隐私

Yuanming Cao, Chengqi Li, Wenbo He

发表机构 * McMaster University（麦斯特大学）

AI总结提出LDP-Slicing框架，通过将像素值分解为二进制位平面并应用本地差分隐私机制，结合感知混淆模块和隐私预算分配策略，在满足严格像素级ε-LDP的同时保持图像对下游任务的高效用。

详情

AI中文摘要

本地差分隐私（LDP）是隐私保护机器学习的黄金标准信任模型，通过在数据源处保证隐私。然而，由于像素空间的高维性，其在图像数据上的应用长期以来被认为不切实际。典型的LDP机制设计用于低维数据，当应用于高维像素空间时会导致严重的效用退化。本文证明这种效用损失并非LDP固有的，而是源于将其应用于不适当的数据表示。我们引入了LDP-Slicing，一个轻量级、无需训练的框架，解决了这种领域不匹配问题。我们的关键见解是将像素值分解为一系列二进制位平面。这种转换使我们能够直接将LDP机制应用于位级表示。为了进一步加强隐私并保持效用，我们集成了一个感知混淆模块，减轻人类可感知的泄漏，以及一个基于优化的隐私预算分配策略。该流程满足严格的像素级ε-LDP，同时生成对下游任务保持高效用的图像。在人脸识别和图像分类上的大量实验表明，在可比的隐私预算下，LDP-Slicing优于现有的DP/LDP基线，且计算开销可忽略不计。

英文摘要

Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2602.19206 2026-05-27 cs.CV 版本更新

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

GS-CLIP: 基于几何感知提示与协同视图表示学习的零样本3D异常检测

Zehao Deng, An Liu, Yan Wang

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）； Institute for AI Industry Research (AIR), Tsinghua University（清华大学人工智能产业研究院）

AI总结提出GS-CLIP框架，通过几何感知提示和协同视图表示学习，在零样本设置下有效检测3D点云中的几何异常。

Comments Accepted by CVPR 2026

详情

AI中文摘要

零样本3D异常检测是一项新兴任务，旨在无需目标训练数据的情况下检测目标数据集中的异常，这在样本稀缺和数据隐私受限的场景中尤为重要。当前方法通过将3D点云投影到2D表示来适配CLIP，但面临挑战：投影会固有地丢失一些几何细节，且依赖单一2D模态导致视觉理解不完整，限制了检测多样异常类型的能力。为解决这些局限，我们提出几何感知提示与协同视图表示学习（GS-CLIP）框架，通过两阶段学习使模型能够识别几何异常。第一阶段，我们动态生成嵌入3D几何先验的文本提示，这些提示包含由我们的几何缺陷蒸馏模块（GDDM）提炼的全局形状上下文和局部缺陷信息。第二阶段，我们引入协同视图表示学习架构，并行处理渲染图像和深度图像，随后通过协同精炼模块（SRM）融合两个流的特征，利用它们的互补优势。在四个大规模公共数据集上的全面实验结果表明，GS-CLIP在检测中取得了优越性能。代码可在 https://github.com/zhushengxinyue/GS-CLIP 获取。

英文摘要

Zero-shot 3D Anomaly Detection is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection. Code can be available at https://github.com/zhushengxinyue/GS-CLIP.

URL PDF HTML ☆

赞 0 踩 0

2602.21636 2026-05-27 cs.CV 版本更新

Axial-Centric Cross-Plane Attention for 3D Medical Image Classification

轴向中心跨平面注意力用于3D医学图像分类

Doyoung Park, Jinsoo Kim, Lohendran Baskaran

发表机构 * National Heart Centre Singapore, Singapore（新加坡国家心脏中心）； CVS.AI, National Heart Research Institute of Singapore, Singapore（CVS.AI、新加坡国家心脏研究院）； Independent Researcher, Republic of Korea（韩国独立研究员）

AI总结提出轴向中心跨平面注意力架构，通过不对称建模解剖平面间依赖关系，在MedMNIST3D基准上优于现有3D和多平面模型。

Comments Submitted to BMVC 2026

详情

AI中文摘要

（缩写版）临床医生通常通过检查多个解剖平面而非依赖体积视图来解释3D医学图像。在临床CT工作流中，轴向平面常作为主要诊断参考，而辅助平面提供互补空间上下文。然而，许多现有3D深度学习方法要么整体处理体积数据，要么对所有平面赋予相同重要性，未能反映这种不对称的轴向中心解释策略。为此，我们提出一种用于3D医学图像分类的轴向中心跨平面注意力架构，该架构建模解剖平面间的不对称依赖关系。该架构使用大规模轴向CT图像预训练的MedDINOv3作为冻结特征提取器，用于轴向、冠状和矢状平面。RICA块和平面内变换器编码器捕获平面特定的位置和上下文信息，而轴向中心跨平面变换器编码器选择性地以互补的辅助表示条件化轴向表示。在MedMNIST3D基准的六个数据集上的实验表明，所提方法在ACC和AUC上持续优于现有3D和多平面模型。轻量级变体AC-Tiny以显著更少的可训练参数实现了竞争性能，表明架构设计对性能提升的贡献大于模型规模增加。消融研究进一步验证了轴向中心查询、QKV分配、定向跨平面融合、无残差交叉注意力和分类头设计的重要性。切片级Grad-CAM可视化表明，模型在所有平面上识别出诊断相关区域。这些发现强调了将架构设计与临床解释工作流对齐对于稳健的3D医学图像分析的价值。

英文摘要

Abridged: Clinicians commonly interpret 3D medical images by examining multiple anatomical planes rather than relying on volumetric views. In clinical CT workflows, the axial plane often serves as the primary diagnostic reference, while the auxiliary planes provide complementary spatial context. However, many existing 3D deep learning approaches either process volumetric data holistically or assign equal importance to all planes, failing to reflect this asymmetric, axial-centric interpretation strategy. To address this, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that models asymmetric dependencies between anatomical planes. The architecture employs large-scale axial CT images pretrained MedDINOv3 as a frozen feature extractor for axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information, while axial-centric cross-plane transformer encoders selectively condition axial representations on complementary auxiliary representations. Experiments on six datasets from the MedMNIST3D benchmark show that the proposed method consistently outperforms existing 3D and multi-plane models in ACC and AUC. A lightweight variant, AC-Tiny, achieves competitive performance with substantially fewer trainable parameters, suggesting that architectural design contributes more to performance gains than increased model scale. Ablation studies further validate the importance of axial-centric querying, QKV allocation, directional cross-plane fusion, residual-free cross-attention, and classification head design. Slice-level Grad-CAM visualizations demonstrate that the model identifies diagnostically relevant regions across all planes. These findings highlight the value of aligning architectural design with clinical interpretation workflows for robust 3D medical image analysis.

URL PDF HTML ☆

赞 0 踩 0

2602.18907 2026-05-27 cs.LG cs.CV cs.CY 版本更新

DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

DeepInterestGR: 利用多模态大语言模型挖掘深度多兴趣用于生成式推荐

Yangchen Zeng, Zhenyu Yu, Zhiyuan Hu, Wenxin Zhang, Jinze Wang, Rongfeng Guo

发表机构 * Southeast University（东南大学）

AI总结提出DeepInterestGR框架，通过多LLM兴趣挖掘、奖励标记深度兴趣和兴趣增强物品离散化，解决生成式推荐中的浅层兴趣问题，在三个Amazon数据集上显著提升推荐性能。

详情

AI中文摘要

我们介绍了DeepInterestGR，一个将深度兴趣挖掘集成到生成式推荐流程中的新颖框架。这解决了“浅层兴趣”问题——现有的生成方法依赖于表面文本特征，未能捕捉潜在的用户动机，限制了个性化深度和推荐可解释性。我们的方法通过结构化推理提示利用多LLM兴趣挖掘（MLIM），通过奖励标记深度兴趣（RLDI）进行质量控制，通过RQ-VAE进行兴趣增强物品离散化（IEID），并结合由兴趣感知奖励引导的两阶段SFT-GRPO训练流程。我们在三个Amazon Review基准（Beauty、Sports、Instruments）上验证了DeepInterestGR，与包括SASRec、BERT4Rec、TIGER、LC-Rec和S-DPO在内的14个最先进基线进行了比较。我们的方法在HR@10上实现了5.8%-8.3%的相对改进，在NDCG@10上实现了7.7%-9.9%的相对改进，跨领域泛化增益达到+24.8%。这些结果证明，融入深度语义兴趣可以有效改进基于SID的生成式推荐。

英文摘要

We introduce DeepInterestGR, a novel framework that integrates deep interest mining into the generative recommendation pipeline. This addresses the "Shallow Interest" problem - existing generative methods rely on surface-level textual features and fail to capture latent user motivations, limiting personalization depth and recommendation interpretability. Our approach leverages Multi-LLM Interest Mining (MLIM) via structured reasoning prompting, Reward-Labeled Deep Interest (RLDI) for quality control, and Interest-Enhanced Item Discretization (IEID) via RQ-VAE, combined with a two-stage SFT-GRPO training pipeline guided by an Interest-Aware Reward. We validate DeepInterestGR on three Amazon Review benchmarks (Beauty, Sports, Instruments), comparing against 14 state-of-the-art baselines including SASRec, BERT4Rec, TIGER, LC-Rec, and S-DPO. Our method achieves 5.8%-8.3% relative improvements on HR@10 and 7.7%-9.9% on NDCG@10 over the strongest baseline, with cross-domain generalization gains of +24.8%. These results provide evidence that incorporating deep semantic interests can effectively improve SID-based generative recommendation.

URL PDF HTML ☆

赞 0 踩 0

2602.17605 2026-05-27 cs.CV cs.AI cs.CY cs.LG 版本更新

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

在飞行中主动适应：基于相关性的在线元学习与潜在概念用于地理空间发现

Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly

发表机构 * University of Michigan, Ann Arbor, MI, USA（密歇根大学，安阿伯分校）； Washington University in St. Louis, St. Louis, MO, USA（华盛顿大学圣路易斯分校）

AI总结提出一个统一的地理空间发现框架，结合主动学习、在线元学习和概念引导推理，通过概念加权不确定性采样和相关性感知元批次形成策略，在有限数据和动态环境下高效发现隐藏目标。

详情

AI中文摘要

在环境监测中，数据收集通常成本高昂、稀疏且受紧急公共卫生需求影响。这对于致癌的PFAS（全氟和多氟烷基物质）污染尤其如此，与领域专家和环境组织的讨论强调需要在有限的采样预算下战略性地识别高风险、观测不足的区域。更广泛地说，在灾害响应和公共卫生环境中也出现了类似的挑战，动态环境使得从有限的地面实况中高效发现隐藏目标变得至关重要。然而，稀疏且有偏差的地理空间标签限制了现有基于学习方法（如强化学习）的适用性。为了解决这个问题，我们提出了一个统一的地理空间发现框架，该框架集成了主动学习、在线元学习和概念引导推理。我们的方法引入了两个基于共享的*概念相关性*概念的关键创新，该概念捕捉领域特定因素如何影响目标存在：一个*概念加权不确定性采样策略*，其中不确定性通过从现成概念（如土地覆盖和源距离）学习到的相关性进行调节；以及一个*相关性感知元批次形成策略*，该策略在在线元更新期间促进语义多样性，提高动态环境中的泛化能力。我们在PFAS污染发现任务上评估了我们的框架，这是一个受真实世界启发的环境监测任务，展示了在有限数据和变化条件下鲁棒的目标发现能力。

英文摘要

In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs. This is particularly true for cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, where discussions with domain experts and environmental organizations highlight the need to strategically identify high-risk, under-observed regions under tight sampling budgets. More broadly, similar challenges arise in disaster response and public health settings, where dynamic environments make it essential to efficiently uncover hidden targets from limited ground truth. Yet sparse and biased geospatial labels limit the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, capturing how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance from readily available concepts such as land cover and source proximity; and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. We evaluate our framework on PFAS contamination discovery as a real-world inspired environmental monitoring task, demonstrating robust target discovery under limited data and changing conditions.

URL PDF HTML ☆

赞 0 踩 0

2510.03352 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

基于侧信息的推理时搜索用于扩散模型图像重建

Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

发表机构 * Department of Electrical and Computer Engineering, Texas A&M University（电气与计算机工程系，德克萨斯A&M大学）

AI总结提出一种即插即用、无需训练的推理时搜索框架，将侧信息融入现有扩散模型逆问题求解器，显著提升重建质量。

详情

AI中文摘要

扩散模型已被用作解决逆问题的先验。然而，现有方法通常忽略了能够显著提高重建质量的侧信息，尤其是在严重病态设置中。在这项工作中，我们提出了一种新颖的框架，通过推理时搜索将侧信息以即插即用、无需训练的方式融入现有的基于扩散模型的逆问题求解器。通过在多种逆问题（包括图像修复、超分辨率和几种去模糊任务）以及多种基于扩散模型的逆问题求解器（DPS、DAPS和MPGD）上的大量实验，我们表明，用我们的框架增强每个求解器，其重建质量始终优于相应的原始方法。为了展示我们方法的通用性，我们考虑了多种形式的侧信息，包括参考图像、文本描述和解剖学MRI扫描。代码可在该仓库中获取：https://github.com/mahdi-farahbakhsh/DISS。

英文摘要

Diffusion models have been used as priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel framework that incorporates side information into existing diffusion-based inverse problem solvers via inference-time search, in a plug-and-play, training-free manner. Through extensive experiments across a range of inverse problems, including inpainting, super-resolution, and several deblurring tasks, and across multiple diffusion-based inverse problem solvers (DPS, DAPS, and MPGD), we show that augmenting each solver with our framework consistently improves the quality of the reconstructions over the corresponding original method. To demonstrate the generality of our approach, we consider diverse forms of side information, including reference images, textual descriptions, and anatomical MRI scans. The code is available at this \href{https://github.com/mahdi-farahbakhsh/DISS}{repository}\footnote{https://github.com/mahdi-farahbakhsh/DISS}.

URL PDF HTML ☆

赞 0 踩 0

2602.10104 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Olaf-World: Orienting Latent Actions for Video World Modeling

Olaf-World: 面向视频世界模型的潜在动作定向

Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore ； Research (A STAR), Singapore

AI总结提出SeqΔ-REPA对齐目标，通过冻结自监督视频编码器的时序特征差异锚定潜在动作，实现无标签视频中可迁移的动作控制世界模型预训练。

Comments ICML 2026. Project page: https://showlab.github.io/Olaf-World/ Code: https://github.com/showlab/Olaf-World

详情

AI中文摘要

扩展动作可控世界模型受限于动作标签的稀缺性。虽然潜在动作学习有望从无标签视频中提取控制接口，但学习到的潜在表示往往难以跨上下文迁移：它们纠缠了场景特定线索，缺乏共享坐标系。这是因为标准目标仅在每个片段内操作，没有提供跨上下文对齐动作语义的机制。我们的关键洞察是，尽管动作未被观测到，但其语义效果是可观测的，可以作为共享参考。我们引入SeqΔ-REPA，一种序列级控制效果对齐目标，将集成潜在动作锚定到来自冻结自监督视频编码器的时序特征差异。基于此，我们提出Olaf-World，一个从大规模被动视频中预训练动作条件视频世界模型的流程。大量实验表明，我们的方法学习了更结构化的潜在动作空间，从而在零样本动作迁移和适应新控制接口的数据效率上优于最先进的基线方法。

英文摘要

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2602.09878 2026-05-27 cs.CV 版本更新

MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

MVISTA-4D: 用于机器人操作的一致性视图4D世界模型与测试时动作推理

Jiaxu Wang, Yicheng Jiang, Tianlun He, Jingkai Sun, Qiang Zhang, Junhao He, Jiahang Cao, Zesen Gan, Mingyuan Sun, Qiming Shao, Xiangyu Yue

发表机构 * MMLab, The Chinese University of Hong Kong, Hong Kong SAR（香港理工大学MMLab，香港中文大学，香港特别行政区）； The Hong Kong University of Science（香港理工大学）； The University of Hong Kong（香港大学）； Tsinghua University（清华大学）

AI总结提出一种基于世界模型的4D场景生成方法，通过多视图RGBD预测和测试时动作优化，实现几何一致的4D动态预测与机器人操作。

详情

Journal ref: International Conference on Machine Learning 2026

AI中文摘要

基于世界模型的“想象-然后行动”范式成为机器人操作的一种有前景的方法，但现有方法通常仅支持纯图像预测或部分3D几何推理，限制了其预测完整4D场景动态的能力。本文提出了一种新颖的具身4D世界模型，能够实现几何一致、任意视图的RGBD生成：仅以单视图RGBD观测作为输入，模型想象其余视角，然后通过反投影和融合构建跨时间的更完整3D结构。为了高效学习多视图、跨模态生成，我们明确设计了跨视图和跨模态特征融合，共同促进RGB与深度之间的一致性，并强制视图间的几何对齐。除了预测，将生成的未来转换为动作通常由逆动力学处理，但这是病态的，因为多个动作可以解释相同的状态转换。我们通过一种测试时动作优化策略来解决这个问题，该策略通过生成模型反向传播以推断与预测未来最佳匹配的轨迹级潜在变量，以及一个残差逆动力学模型，将该轨迹先验转换为精确的可执行动作。在三个数据集上的实验表明，该方法在4D场景生成和下游操作任务上均表现出色，消融实验为关键设计选择提供了实用见解。

英文摘要

World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.

URL PDF HTML ☆

赞 0 踩 0

2511.16449 2026-05-27 cs.CV cs.AI 版本更新

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

弥合视觉令牌剪枝中的语义-动作鸿沟以实现高效VLA推理

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University（上海交通大学人工智能学院）； University of Science and Technology of China（中国科学技术大学）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； BAAI（北京人工智能研究院）

AI总结提出VLA-Pruner方法，通过结合语义预填充和时序平滑的动作相关性估计视觉令牌重要性，并采用Combine-then-Filter策略，在保持操作质量的同时实现高达1.99倍加速。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过整合视觉感知、语言理解和动作执行，在具身人工智能中展现出巨大潜力。在实时部署中，这些模型必须处理连续的视觉流，产生大量计算开销。视觉令牌剪枝——一种通过保留显著令牌同时丢弃冗余令牌来加速视觉-语言模型（VLM）的主流技术——为这一挑战提供了自然的候选解决方案。然而，直接将面向VLM的剪枝方法应用于VLA推理会导致操作性能严重下降。我们的分析将这种下降归因于一个关键不匹配：VLA推理在视觉-语言预填充阶段和动作解码阶段表现出不同的注意力模式，因此仅基于上下文预填充语义显著性的剪枝偏向语义线索，可能移除动作关键的视觉令牌。受此观察启发，我们提出VLA-Pruner，一种有效的即插即用令牌剪枝方法，基于VLA推理的视觉需求，并进一步利用机器人操作的时间连续性。具体来说，VLA-Pruner从语义预填充和时序平滑的动作相关性两方面估计视觉令牌重要性，然后采用Combine-then-Filter策略，在计算预算下保留紧凑、非冗余的令牌。实验表明，VLA-Pruner在多种VLA架构上优于最先进方法，在相当的操作质量下实现高达1.99倍加速。

英文摘要

Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, and action execution. In real-time deployment, these models must process continuous visual streams, incurring substantial computational overhead. Visual token pruning -- a mainstream technique for accelerating Vision-Language Models (VLMs) by retaining salient tokens while discarding redundant ones -- offers a natural candidate solution to this challenge. However, directly applying VLM-oriented pruning methods to VLA inference can cause severe degradation in manipulation performance. Our analysis attributes this degradation to a key mismatch: VLA inference exhibits distinct attention patterns between the vision-language prefill stage and the action-decode stage, so pruning based only on context-prefill semantic salience is biased toward semantic cues and may remove action-critical visual tokens. Motivated by this observation, we propose VLA-Pruner, an effective plug-and-play token pruning method grounded in the visual requirements of VLA inference, further exploiting the temporal continuity of robot manipulation. Specifically, VLA-Pruner estimates visual-token importance from both semantic prefilling and temporally smoothed action relevance, and then applies a Combine-then-Filter strategy to retain compact, non-redundant tokens under the compute budget. Experiments show that VLA-Pruner outperforms state-of-the-art approaches across multiple VLA architectures, achieving up to 1.99x speedup with comparable manipulation quality.

URL PDF HTML ☆

赞 0 踩 0

2511.06625 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography

可解释的跨疾病推理：基于低剂量计算机断层扫描的心血管风险评估

Yifei Zhang, Jiashuo Zhang, Mojtaba Safari, Xiaofeng Yang, Liang Zhao

发表机构 * Department of Computer Science, Emory University（埃默里大学计算机科学系）； Department of Computer Science, Johns Hopkins University（约翰霍普金斯大学计算机科学系）； Department of Radiation Oncology（放射肿瘤学部）； Winship Cancer Institute, Emory University（埃默里大学Winship癌症研究所）

AI总结提出一种可解释的跨疾病推理框架，通过提取肺部发现、基于医学知识进行跨器官机制推理，并结合心脏子体积特征，从低剂量胸部CT中实现心血管风险评估，在NLST队列中AUC达0.919。

详情

AI中文摘要

低剂量胸部计算机断层扫描（LDCT）在一次扫描中捕获肺部和心脏结构，使得能够联合评估肺部和心血管健康。现有方法通常独立建模这些领域，并未明确表示它们的生理交互。我们提出了一种可解释的跨疾病推理框架，用于从LDCT进行心血管风险评估。该框架遵循受限的临床信息路径：它提取肺部发现，将跨器官机制基于医学知识进行推理，并生成带有自然语言理由的心血管预测。它结合了四个组件：一个冻结的肺风险先验、一个肺部感知模块、一个代理推理模块和一个心脏子体积特征提取器。它们的输出被融合，以将局部心脏证据与机制层面的肺部上下文整合。在国家肺筛查试验队列中，该框架在CVD筛查中达到0.919的AUC，在CVD死亡率预测中高达0.838，优于心脏特异性、单疾病和基础模型基线。目标对照表明，这些增益不能仅由额外的胸部视觉特征、固定规则传播或单一推理后端解释。因此，所提出的框架提供了一种可审计的方法，用于从LDCT进行跨疾病心血管风险评估。

英文摘要

Low-dose chest computed tomography (LDCT) captures pulmonary and cardiac structures in a single scan, enabling joint assessment of lung and cardiovascular health. Existing approaches typically model these domains independently and do not explicitly represent their physiological interactions. We propose an Explainable Cross-Disease Reasoning Framework for cardiovascular risk assessment from LDCT. The framework follows a constrained clinical-information pathway: it extracts pulmonary findings, grounds cross-organ mechanisms in medical knowledge, and produces a cardiovascular prediction with a natural-language rationale. It combines four components: a frozen lung-risk prior, a pulmonary perception module, an agentic reasoning module, and a cardiac subvolume feature extractor. Their outputs are fused to integrate localized cardiac evidence with mechanism-level pulmonary context. On the National Lung Screening Trial cohort, the framework achieves an AUC of 0.919 for CVD screening and up to 0.838 for CVD mortality prediction, outperforming cardiac-specific, single-disease, and foundation-model baselines. Targeted controls indicate that the gains are not explained by additional thoracic visual features alone, fixed rule propagation, or a single reasoning backend. The proposed framework thus provides an auditable approach to cross-disease cardiovascular risk assessment from LDCT.

URL PDF HTML ☆

赞 0 踩 0

2507.13428 2026-05-27 cs.CV cs.AI 版本更新

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

PhyWorldBench：文本到视频模型中物理真实性的全面评估

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang

发表机构 * University of California, Santa Cruz（加州大学圣克ruz分校）； NVIDIA Research（NVIDIA研究）； Northeastern University（东北大学）； University of California, Santa Barbara（加州大学圣巴巴拉分校）

AI总结提出PhyWorldBench基准，通过1050个提示评估12个视频生成模型在物理规律遵循上的表现，并引入反物理类别，利用多模态大语言模型进行零样本评估。

Comments 35 pages, 21 figures

详情

Journal ref: ICLR 2026 oral

AI中文摘要

视频生成模型在创建高质量、逼真内容方面取得了显著进展。然而，它们准确模拟物理现象的能力仍然是一个关键且未解决的挑战。本文提出了PhyWorldBench，一个全面的基准测试，旨在根据视频生成模型对物理定律的遵循程度进行评估。该基准涵盖了多个层次的物理现象，从基本物理原理如物体运动和能量守恒，到更复杂的场景如刚体相互作用以及人或动物的运动。此外，我们引入了一个新颖的反物理类别，其中提示故意违反现实世界的物理规律，从而评估模型在保持逻辑一致性的同时能否遵循此类指令。除了大规模人工评估外，我们还设计了一种简单而有效的方法，利用当前的多模态大语言模型以零样本方式评估物理真实性。我们评估了12个最先进的文本到视频生成模型，包括五个开源模型和五个专有模型，并进行了详细的比较和分析。通过对跨越基础、复合和反物理场景的1050个精心策划的提示进行系统测试，我们识别出这些模型在遵循现实世界物理规律方面面临的关键挑战。我们进一步研究了它们在不同物理现象和提示类型下的表现，并得出了针对性的建议，以构建增强物理原理保真度的提示。

英文摘要

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles such as object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel Anti-Physics category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that utilizes current multimodal large language models to evaluate physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with detailed comparison and analysis. Through systematic testing across 1050 curated prompts spanning fundamental, composite, and anti-physics scenarios, we identify pivotal challenges these models face in adhering to real-world physics. We further examine their performance under diverse physical phenomena and prompt types, and derive targeted recommendations for crafting prompts that enhance fidelity to physical principles.

URL PDF HTML ☆

赞 0 踩 0

2512.06609 2026-05-27 cs.LG cs.CV 版本更新

Training-Free Vector Quantization via Gaussian VAEs

基于高斯VAE的无训练向量量化

Tongda Xu, Wendi Zheng, Jiajun He, Jose Miguel Hernandez-Lobato, Yan Wang, Ya-Qin Zhang, Jie Tang

发表机构 * AIR, Tsinghua University（清华空气研究院）； CST, Tsinghua University（清华计算机研究所）； University of Cambridge（剑桥大学）

AI总结提出Gaussian Quant (GQ)方法，通过约束训练高斯VAE并直接转换为VQ-VAE，无需额外训练，在UNet和ViT架构上优于现有VQ-VAE。

详情

AI中文摘要

向量量化变分自编码器（VQ-VAEs）是将图像压缩为离散标记的离散自编码器。然而，由于离散化，它们难以训练。在本文中，我们提出了一种简单而有效的技术，称为Gaussian Quant (GQ)，它首先在特定约束下训练高斯VAE，然后将其转换为VQ-VAE，无需额外训练。对于转换，GQ生成随机高斯噪声作为码本，并找到最接近后验均值的噪声向量。理论上，我们证明当码本大小的对数超过高斯VAE的bits-back编码率时，可以保证较小的量化误差。实际上，我们提出了一种启发式方法来训练高斯VAE以实现有效转换，称为目标散度约束（TDC）。实验上，我们表明GQ在UNet和ViT架构上均优于先前的VQ-VAE，如VQGAN、FSQ、LFQ和BSQ。此外，TDC还改进了先前的离散化方法，如TokenBridge。源代码见https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE。

英文摘要

Vector-quantized variational autoencoders (VQ-VAEs) are discrete autoencoders that compress images into discrete tokens. However, they are difficult to train due to discretization. In this paper, we propose a simple yet effective technique dubbed Gaussian Quant (GQ), which first trains a Gaussian VAE under certain constraints and then converts it into a VQ-VAE without additional training. For conversion, GQ generates random Gaussian noise as a codebook and finds the closest noise vector to the posterior mean. Theoretically, we prove that when the logarithm of the codebook size exceeds the bits-back coding rate of the Gaussian VAE, a small quantization error is guaranteed. Practically, we propose a heuristic to train Gaussian VAEs for effective conversion, named the target divergence constraint (TDC). Empirically, we show that GQ outperforms previous VQ-VAEs, such as VQGAN, FSQ, LFQ, and BSQ, on both UNet and ViT architectures. Furthermore, TDC also improves previous Gaussian VAE discretization methods, such as TokenBridge. The source code is provided in https://github.com/tongdaxu/VQ-VAE-from-Gaussian-VAE.

URL PDF HTML ☆

赞 0 踩 0

2511.16870 2026-05-27 cs.CV cs.LG 版本更新

Drive-P2D：自动驾驶中视觉语言模型的渐进式感知到决策基准

Zecong Tang, Zixu Wang, Yifei Wang, Weitong Lian, Tianjian Gao, Haoran Li, Tengju Ru, Lingyi Meng, Zhejun Cui, Yichen Zhu, Qi Kang, Kaixuan Wang, Yu Zhang

发表机构 * Zhejiang University（浙江大学）； The University of Hong Kong（香港大学）

AI总结提出Drive-P2D基准，通过分离推理与答案的协议，在目标、场景和决策三个层级上评估视觉语言模型的感知到决策能力，并分析错误模式。

详情

AI中文摘要

自动驾驶需要在复杂场景中实现可靠的感知和安全的决策。最近的视觉语言模型（VLM）展示了推理和泛化能力，为自动驾驶开辟了新的可能性；然而，现有的基准通常分别评估感知和决策，通过仅选择格式限制故障分析，或通过LLM评分的长格式输出引入评估偏差。为了解决这些问题，我们提出了Drive-P2D，一个渐进式感知到决策基准，包含6650个问题，涵盖目标、场景和决策三个层级。Drive-P2D采用分离的推理与答案协议：最终答案客观评分，而推理则用于分析沿渐进式感知到决策链暴露的错误模式。我们评估了所有场景和高风险场景下的主流VLM，并通过相关性分析和相似场景鲁棒性测试进一步刻画了感知到决策的能力边界。推理进一步揭示了逻辑推理错误和语义特征遗漏等故障模式，我们训练了一个轻量级分析器模型来自动化大规模推理错误模式标注。这些设计共同为构建更安全、更可靠的用于现实世界自动驾驶的VLM提供了实用见解。

英文摘要

Autonomous driving requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonstrate reasoning and generalization abilities, opening new possibilities for autonomous driving; however, existing benchmarks often evaluate perception and decision-making separately, limit failure analysis with choice-only formats, or introduce evaluation bias through LLM-scored long-form outputs. To address these issues, we present Drive-P2D, a progressive perception-to-decision benchmark with 6,650 questions across Object, Scene, and Decision levels. Drive-P2D adopts a separated reasoning-and-answer protocol: final answers are scored objectively, while reasoning is analyzed to identify error modes exposed along the progressive perception-to-decision chain. We evaluate mainstream VLMs across all and high-risk scenarios, and further characterize the perception-to-decision capability boundary through correlation analysis and similar-scene robustness testing. Reasoning further exposes failure modes such as logical reasoning errors and semantic feature omissions, and we train a lightweight analyzer model to automate large-scale error-mode annotation of reasoning. Together, these designs provide practical insights for building safer and more reliable VLMs for real-world autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2601.12809 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

CLIP风格视觉语言模型在合成空间关系数据训练中的左右对称性破缺

Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa

发表机构 * InfoTech, Toyota Motor Corporation（丰田汽车公司信息科技部门）

AI总结通过可控一维图像文本测试平台，研究基于Transformer的视觉语言编码器在CLIP风格对比学习下如何通过位置与标记嵌入交互产生左右关系理解，并发现标签多样性比布局多样性更关键。

Comments Accepted at ICML 2026

详情

AI中文摘要

空间理解仍然是视觉语言模型中的一个关键挑战。然而，这种理解是否真正获得，如果是，通过什么机制，目前尚不清楚。我们提出了一个可控的一维图像文本测试平台，以探究在基于Transformer的视觉和文本编码器中，使用CLIP风格的对比目标训练时，左右关系理解是如何出现的。我们在单对象和双对象场景的配对描述上端到端地训练轻量级基于Transformer的视觉和文本编码器，并评估对未见对象对的泛化能力，同时系统性地改变标签和布局多样性。我们发现对比训练学习了左右关系，并且标签多样性（而非布局多样性）是这种情况下泛化的主要驱动因素。为了获得机制性理解，我们进行了注意力分解，并表明位置嵌入和标记嵌入之间的相互作用导致了水平注意力梯度，从而打破了编码器中的左右对称性；消除这一贡献会显著降低左右辨别能力。我们的结果提供了关于CLIP风格模型何时以及如何获得关系能力的机制性见解。

英文摘要

Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between positional and token embeddings induce a horizontal attention gradient that breaks left-right symmetry in the encoders; ablating this contribution substantially reduces left-right discrimination. Our results provide a mechanistic insight of when and how CLIP-style models acquire relational competence.

URL PDF HTML ☆

赞 0 踩 0

2511.02360 2026-05-27 cs.CV cs.CL 版本更新

LaRe: Latent Refocusing for Multimodal Reasoning

LaRe: 用于多模态推理的潜在重聚焦

Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络与信息安全学院）

AI总结提出LaRe范式，在潜在空间内进行视觉重聚焦，结合语义增强训练，在提升推理准确率的同时大幅减少推理所需token数。

详情

AI中文摘要

思维链推理通过分解复杂任务提升逻辑性能，但其多模态扩展面临权衡。主流的“用图像思考”范式通过显式裁剪图像区域实现视觉重聚焦，但导致计算开销快速增长。新兴的潜在空间推理范式减少了token消耗，但缺乏动态重聚焦能力。我们认为这种权衡源于一个默认前提：有效的视觉重聚焦必须以显式token的形式发生。基于此，我们提出潜在重聚焦（LaRe），一种新的多模态推理范式，其中视觉重聚焦完全在潜在空间内进行。我们进一步设计了一种语义增强训练策略，通过视觉重建目标确保潜在空间的语义结构。实验评估表明，与现有基线相比，LaRe将平均准确率提高了7.6%，同时将推理所需的token数量减少了59.7%。当扩展到8B参数的视觉语言模型骨干时，LaRe实现了与最先进方法相当的性能，证明了我们提出的潜在重聚焦范式在多模态推理中的有效性。

英文摘要

Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective. Experimental evaluations demonstrate that LaRe improves average accuracy by 7.6% compared to existing baselines while reducing the number of tokens required for inference by 59.7%. When scaled to a 8B-parameter Vision-Language Model backbone, LaRe achieves performance comparable to state-of-the-art methods, demonstrating the efficacy of our proposed latent refocusing paradigm for multimodal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2601.08375 2026-05-27 cs.CV 版本更新

Source-Free Domain Adaptation for Geospatial Point Cloud Semantic Segmentation

地理空间点云语义分割的无源域适应

Yuan Gao, Di Cao, Xiaohuan Xi, Sheng Nie, Shaobo Xia, Cheng Wang

发表机构 * Aerospace Information Research Institute, Chinese Academy of Sciences（中国科学院航天信息研究所）； International Research Center of Big Data for Sustainable Development Goals（可持续发展目标大数据国际研究中心）； University of Chinese Academy of Sciences（中国科学院大学）； Zhengzhou Institute for Advanced Research of Henan Polytechnic University（河南理工大学郑州研究院）； Henan Polytechnic University（河南理工大学）； School of Aeronautic Engineering, Changsha University of Science and Technology（长沙理工大学航空工程学院）； China University of Geosciences, Beijing（中国地质大学（北京））

AI总结提出LoGo无源域适应框架，通过局部类平衡原型估计和全局最优传输分布对齐，解决地理空间点云语义分割中的域偏移问题。

详情

AI中文摘要

3D地理空间点云的语义分割是遥感应用的基础，但由区域和采集相关变化引起的域偏移通常会降低模型性能。尽管域适应可以缓解这种偏移，但现有方法通常需要访问源域数据，由于隐私问题和监管政策，这往往不可行。为了解决这个问题，我们提出了LoGo（局部-全局双共识），一种新颖的无源无监督域适应（SFUDA）框架，仅需要预训练模型和无标签目标数据。在局部层面，我们引入了一个类平衡原型估计模块，确保即使对于样本稀缺的尾部类别也能生成鲁棒的特征原型，有效缓解长尾分布引起的特征崩溃。在全局层面，我们引入了一个基于最优传输的全局分布对齐模块，将伪标签分配公式化为全局优化问题，有效纠正局部贪婪分配中头部类别的过度主导，从而防止模型预测严重偏向多数类别。最后，我们提出了一种双一致性伪标签过滤机制，仅保留局部多增强集成预测与全局最优传输分配一致的高置信度伪标签用于自训练。在两个具有挑战性的基准测试（包括跨场景和跨传感器设置）上的大量实验表明，LoGo始终优于现有的最先进方法。源代码可在 https://github.com/GYproject/LoGo-SFUDA 获取。

英文摘要

Semantic segmentation of 3D geospatial point clouds is fundamental to remote sensing applications, yet domain shifts caused by regional and acquisition-related variations often degrade model performance. Although domain adaptation can mitigate such shifts, existing methods typically require access to source-domain data, which is often infeasible due to privacy concerns and regulatory policies. To address this, we propose LoGo (Local-Global Dual-Consensus), a novel source-free unsupervised domain adaptation (SFUDA) framework requiring only a pretrained model and unlabeled target data. At the local level, we introduce a class-balanced prototype estimation module that ensures that robust feature prototypes can be generated even for sample-scarce tail classes, effectively mitigating the feature collapse caused by long-tailed distributions. At the global level, we introduce an optimal transport-based global distribution alignment module that formulates pseudo-label assignment as a global optimization problem, effectively correcting the over-dominance of head classes inherent in local greedy assignments, and thereby preventing model predictions from being severely biased towards majority classes. Finally, we propose a dual-consistency pseudo-label filtering mechanism that retains only high-confidence pseudo-labels where local multi-augmented ensemble predictions align with global optimal transport assignments for self-training. Extensive experiments on two challenging benchmarks, encompassing cross-scene and cross-sensor settings, demonstrate that LoGo consistently outperforms existing state-of-the-art methods. The source code is available at https://github.com/GYproject/LoGo-SFUDA.

URL PDF HTML ☆

赞 0 踩 0

2601.07737 2026-05-27 cs.CV cs.AI 版本更新

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

看见 vs. 相信：评估开源多模态大模型在反直觉场景中的语言偏见

Chen Ling, Tongwei Zhang, Hanqian Li, Nai Ding

发表机构 * Zhejiang University（浙江大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结为评估多模态大模型处理反直觉动作场景的能力，提出CAIT基准（400个高保真合成场景），发现开源模型因语言先验而忽视视觉证据，性能接近随机水平，而链式思维推理虽提升准确率但导致过度思考拒绝视觉内容，通过微调和结构化提示可缓解此偏见。

详情

AI中文摘要

多模态大语言模型（MLLMs）在主流视觉理解任务中表现出色，但其处理违背日常常识的动作场景的能力尚未得到充分测试。为填补这一空白，我们引入了CAIT，一个包含400个高保真合成场景的基准，专注于反直觉的视觉动作，例如“兔子在追老虎”，其中视觉证据明确违背常识预期。我们评估了人类、领先的专有模型（如Claude和Gemini）以及14个代表性的开源MLLMs。人类达到近乎完美的性能（约0.95准确率），专有模型表现出稳健的理解（达到0.88准确率），而标准的开源指令微调模型性能处于随机水平。进一步分析表明，这种失败是由强烈的语言先验驱动的：模型不信任视觉输入，而是自动用统计上常见的文本描述覆盖异常的视觉信号。尽管引入链式思维推理机制可以提高准确率，但会显著减慢响应速度并产生新的失败模式：模型过度思考场景，仅仅因为违反现实物理定律而拒绝接受实际的视觉内容。最后，我们证明有针对性的微调和结构化提示可以有效缓解这种对语言先验的依赖，使开源模型能够基于实际视觉证据准确地进行推理。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their ability to process action scenes that contradict everyday common sense remains undertested. To address this gap, we introduce CAIT, a benchmark comprising 400 high-fidelity synthetic scenes focused on counter-intuitive visual actions, such as ``a rabbit is chasing a tiger'', where visual evidence explicitly contradicts common-sense expectations. We evaluate human, leading proprietary models (e.g., Claude and Gemini), and 14 representative open-source MLLMs. Humans achieve near-perfect performance (around 0.95 accuracy) and proprietary models demonstrate robust understanding (achieving up to 0.88 accuracy), standard open-source instruction-tuned models perform at the chance level. Further analysis demonstrates that this failure is driven by a strong language prior: rather than trusting the visual input, they automatically override the anomalous visual signals with statistically common text descriptions. Although introducing Chain-of-Thought reasoning mechanisms can improve accuracy, it significantly slows down the response and generates a new failure mode: models overthink the scenario and refuse to accept the actual visual content simply because it violates real-world physical laws. Finally, we demonstrate that targeted fine-tuning and structured prompting can effectively mitigate this reliance on language priors, enabling open-source models to accurately ground their reasoning in actual visual evidence.

URL PDF HTML ☆

赞 0 踩 0

2601.05729 2026-05-27 cs.CV 版本更新

TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

TAGRPO: 通过直接轨迹对齐提升图像到视频生成中的GRPO

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, Shuai Shao, Qinglin Lu, Ping Luo

发表机构 * The University of Hong Kong（香港大学）； Tencent Hunyuan（腾讯文心）

AI总结针对图像到视频生成中GRPO优化效果不佳的问题，提出基于对比学习的TAGRPO框架，通过中间潜变量对齐高奖励轨迹并远离低奖励轨迹，结合记忆库提升多样性，显著优于DanceGRPO。

Comments 18 pages, 12 figures

详情

AI中文摘要

近期研究表明，将组相对策略优化（GRPO）集成到流匹配模型中，特别是在文本到图像和文本到视频生成中，具有显著效果。然而，我们发现将这些技术直接应用于图像到视频（I2V）模型往往无法带来一致的奖励提升。为解决这一局限，我们提出了TAGRPO，一个受对比学习启发的鲁棒后训练框架，适用于I2V模型。我们的方法基于以下观察：从相同初始噪声生成的rollout视频为优化提供了更优的指导。基于这一洞察，我们提出了一种应用于中间潜变量的新型GRPO损失，鼓励直接对齐高奖励轨迹，同时最大化与低奖励轨迹的距离。此外，我们引入了一个用于rollout视频的记忆库，以增强多样性并降低计算开销。尽管方法简单，TAGRPO在I2V生成中相比DanceGRPO取得了显著改进。相关成果将在 https://tagrpo.github.io/ 更新。

英文摘要

Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation. The deliverables will be updated at https://tagrpo.github.io/ .

URL PDF HTML ☆

赞 0 踩 0

2601.01608 2026-05-27 cs.CV 版本更新

Guiding Token-Sparse Diffusion Models

引导令牌稀疏扩散模型

Felix Krause, Stefan Andreas Baumann, Johannes Schusterbauer, Olga Grebenkova, Ming Gui, Vincent Tao Hu, Björn Ommer

发表机构 * CompVis

AI总结针对稀疏训练扩散模型在推理时对无分类器引导响应不足的问题，提出令牌级稀疏引导方法，在保持输出高质量和高方差的同时降低计算成本。

详情

AI中文摘要

扩散模型在图像合成中质量高，但训练和推理成本昂贵。近期工作利用视觉内容固有的冗余性，仅对视觉信息子集进行训练以降低训练成本。虽然这些方法成功实现了更便宜且更有效的训练，但稀疏训练的扩散模型在推理时表现不佳，原因是它们对无分类器引导（CFG）响应不足。为解决此问题，我们提出稀疏引导（SG）。SG不使用条件丢弃作为引导扩散模型的信号，而是使用令牌级稀疏性。因此，SG更好地保留了条件预测的高方差，实现了高质量和高方差输出。在推理时利用令牌级稀疏性，SG以更低的计算量提高了保真度，在常用的ImageNet-256基准上以25%更少的FLOPs实现了1.58 FID，并在匹配基线质量时节省高达58%的FLOPs。为证明稀疏引导的有效性，我们使用训练时稀疏性训练了一个2.5B文本到图像扩散模型，并在推理时利用SG。SG在提高吞吐量的同时，在构图和人类偏好评分上取得了改进。

英文摘要

Diffusion models deliver high quality in image synthesis but remain expensive during training and inference. Recent works have leveraged the inherent redundancy in visual content to make training more affordable by training only on a subset of visual information. While these methods were successful in providing cheaper and more effective training, sparsely trained diffusion models struggle in inference. This is due to their lacking response to Classifier-free Guidance (CFG) leading to underwhelming performance during inference. To overcome this, we propose Sparse Guidance (SG). Instead of using conditional dropout as a signal to guide diffusion models, SG uses token-level sparsity. As a result, SG preserves the high-variance of the conditional prediction better, achieving good quality and high variance outputs. Leveraging token-level sparsity at inference, SG improves fidelity at lower compute, achieving 1.58 FID on the commonly used ImageNet-256 benchmark with 25% fewer FLOPs, and yields up to 58% FLOP savings at matched baseline quality. To demonstrate the effectiveness of Sparse Guidance, we train a 2.5B text-to-image diffusion model using training time sparsity and leverage SG during inference. SG achieves improvements in composition and human preference score while increasing throughput at the same time.

URL PDF HTML ☆

赞 0 踩 0

2512.22666 2026-05-27 cs.CV cs.LG 版本更新

INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading

INTERACT-CMIL：用于结膜黑色素细胞上皮内病变分级的任务共享学习与任务间一致性

Mert Ikinci, Luna Toma, Karin U. Loeffler, Leticia Ussem, Daniela Süsskind, Julia M. Weller, Yousef Yeganeh, Martina C. Herwig-Carl, Shadi Albarqouni

发表机构 * Clinic for Diagnostic and Interventional Radiology, University Hospital Bonn, Germany（波恩大学诊断与介入放射科）； Department of Ophthalmology, Friedrich-Alexander University Erlangen-Nürnberg, Germany（埃尔兰根-纽伦堡弗里德里希-亚历山大大学眼科部）； TUM School of Computation, Information and Technology, Technical University of Munich, Germany（慕尼黑技术大学计算、信息与技术学院）； Munich Center for Machine Learning, Germany（慕尼黑机器学习中心）； Helmholtz AI, Helmholtz Center Munich, Germany（海德堡人工智能，海德堡慕尼黑研究中心）

AI总结提出INTERACT-CMIL多任务深度学习框架，通过共享特征学习、组合部分监督和任务间一致性损失联合预测五个组织病理学轴，在486张结膜活检图像数据集上相比CNN和基础模型实现最高55.1%的宏F1提升。

详情

DOI: 10.1109/ISBI61048.2026.11515389
Journal ref: IEEE ISBI 2026

AI中文摘要

结膜黑色素细胞上皮内病变（CMIL）的准确分级对于治疗和黑色素瘤预测至关重要，但由于细微的形态学线索和相互关联的诊断标准，仍然困难。我们提出INTERACT-CMIL，一个多头深度学习框架，通过共享特征学习与组合部分监督以及强制跨任务一致性的相互依赖损失，联合预测五个组织病理学轴：WHO4、WHO5、水平扩散、垂直扩散和细胞异型性。在来自三家大学医院的486张专家注释的结膜活检斑块的新整理多中心数据集上进行训练和评估，INTERACT-CMIL在CNN和基础模型（FM）基线上取得了一致的改进，相对宏F1增益高达55.1%（WHO4）和25.0%（垂直扩散）。该框架提供与专家分级一致的连贯、可解释的多标准预测，为CMIL诊断提供了可重复的计算基准，并朝着标准化数字眼科病理学迈出了一步。

英文摘要

Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.

URL PDF HTML ☆

赞 0 踩 0

2512.19602 2026-05-27 cs.CV 版本更新

No Data? No Problem: Robust Vision-Tabular Learning with Missing Values

无数据？没问题：面向缺失值的鲁棒视觉-表格学习

Marta Hasny, Laura Daza, Keno Bressem, Maxime Di Folco, Julia Schnabel

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany（计算、信息与技术学院，慕尼黑技术大学，德国）； Institute of Machine Learning in Biomedical Imaging, Helmholtz Munich, Germany（生物医学成像中的机器学习研究所，海德堡慕尼黑，德国）； School of Biomedical Engineering and Imaging Sciences, King’s College London, UK（生物医学工程与成像科学学院，伦敦国王学院，英国）； Department of Diagnostic and Interventional Radiology, TUM University Hospital, Technical University of Munich, Germany（诊断与介入放射科，慕尼黑技术大学医院，德国）； Munich Center for Machine Learning, Germany（慕尼黑机器学习中心，德国）

AI总结提出RoVTL框架，通过对比预训练中的表格属性缺失增强和下游任务中的Tabular More vs. Fewer损失，实现从0%到100%表格数据可用性下的鲁棒多模态学习。

详情

AI中文摘要

大规模医学数据库提供成像数据以及广泛的表格信息，如临床测量或人口统计数据。然而，这种丰富的表格属性并不反映现实世界的数据集，其中可能只有一部分属性可用。这种差异要求方法在推理时对缺失值保持鲁棒。为了解决这一挑战，我们提出了RoVTL（鲁棒视觉-表格学习），一个旨在处理任何级别表格数据可用性（从0%到100%）的框架。RoVTL包括两个关键阶段：对比预训练，其中我们将表格属性缺失作为数据增强引入以促进鲁棒性；以及下游任务微调，其中表格缺失通过一种新颖的Tabular More vs. Fewer损失来补充，该损失根据可用表格数据的数量对性能进行排序。结合门控交叉注意力融合模块，我们的微调方法在所有表格数据完整性场景下实现了一致的性能。我们在英国生物银行的 cardiac MRI 扫描上评估了RoVTL，证明了与先前方法相比对缺失表格数据的优越鲁棒性。此外，RoVTL成功泛化到外部 cardiac MRI 数据集进行多模态疾病分类，并扩展到自然图像领域，在汽车广告数据集上实现了鲁棒性能。模型权重和代码可在 https://github.com/marteczkah/RoVTL 获取。

英文摘要

Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as clinical measurements or demographics. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that remain robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning, where tabular missingness is complemented by a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with gated-cross attention fusion module, our tuning approach enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The model weights and code are available at https://github.com/marteczkah/RoVTL.

URL PDF HTML ☆

赞 0 踩 0

2512.14140 2026-05-27 cs.CV 版本更新

SketchAssist: A Practical Assistant for Semantic Edits and Precise Local Redrawing

SketchAssist：一种用于语义编辑和精确局部重绘的实用助手

Han Zou, Yan Zhang, Ruiqi Yu, Cong Xie, Jie Huang, Zhenpeng Zhan

发表机构 * Global Business Unit, Baidu Inc.（百度公司全球业务部）

AI总结提出SketchAssist，一种结合指令引导编辑和线条引导区域重绘的交互式草图助手，通过可控数据生成管道和基于DiT的统一框架（集成任务引导的混合专家模块）实现高效、可控的草图操作，在语义和结构一致性上达到最先进性能。

详情

AI中文摘要

草图编辑需要同时处理高级语义变化和精确的局部重绘，这种组合对于稀疏且风格敏感的线条艺术尤其具有挑战性。与自然图像不同，草图依赖于最小的视觉线索，使得现有方法难以在保持整体一致性的同时协调全局语义修改与细粒度结构控制。我们提出了SketchAssist，一种交互式草图助手，它统一了指令引导编辑和线条引导区域重绘，在保持整体构图的同时实现高效且可控的草图操作。为了支持这一任务，我们引入了一个可控数据生成管道，该管道构建具有精确属性变化的结构化编辑序列，并在多步修改中保持结构对齐，同时通过保持风格的变换扩展风格多样性。基于这些数据，SketchAssist采用基于DiT的统一框架，使用多通道输入表示在单一接口内编码草图、掩码和引导信号。为了进一步处理不同的编辑模式，我们将任务引导的混合专家（T-MoE）集成到LoRA层中，实现对语义和结构引导的自适应控制。大量实验表明，在两个任务上都达到了最先进的性能，与最近的方法相比，实现了更强的指令遵循以及改进的结构和风格一致性。总之，我们的方法为草图编辑提供了一种实用且可控的解决方案。

英文摘要

Sketch editing requires jointly handling high-level semantic changes and precise local redrawing, a combination that is particularly challenging for sparse, style-sensitive line art. Unlike natural images, sketches rely on minimal visual cues, making it difficult for existing methods to reconcile global semantic modifications with fine-grained structural control while preserving overall coherence. We present SketchAssist, an interactive sketch assistant that unifies instruction-guided editing with line-guided region redrawing, enabling efficient and controllable sketch manipulation while preserving overall composition. To support this task, we introduce a controllable data generation pipeline that constructs structured edit sequences with precise attribute variations and maintains structural alignment across multi-step modifications, while expanding stylistic diversity via style-preserving transformations. Building on this data, SketchAssist adopts a unified framework based on DiT, using a multi-channel input representation to encode sketches, masks, and guidance signals within a single interface. To further handle different editing modes, we integrate a Task-guided Mixture-of-Experts (T-MoE) into LoRA layers, enabling adaptive control over semantic and structural guidance. Extensive experiments demonstrate state-of-the-art performance on both tasks, achieving strong instruction adherence and improved structural and style consistency compared to recent methods. Together, our method provide a practical and controllable solution for sketch editing.

URL PDF HTML ☆

赞 0 踩 0

2510.17790 2026-05-27 cs.CV cs.CL 版本更新

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

UltraCUA: 一种具有混合动作的计算机使用智能体基础模型

Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

发表机构 * Apple（苹果公司）； The University of Hong Kong（香港大学）

AI总结提出UltraCUA基础模型，通过混合动作（融合原始GUI操作与高级工具执行）克服计算机使用智能体仅依赖原始GUI动作的局限性，采用自动化管道、合成数据引擎、混合动作轨迹收集和两阶段训练方法，在OSWorld和WindowsAgentArena上分别实现22%的相对性能提升和21.7%的成功率。

详情

AI中文摘要

计算机使用智能体面临一个根本限制：它们仅依赖原始GUI动作（点击、键入、滚动），导致脆弱的执行链容易发生级联故障。虽然API驱动的智能体通过结构化接口和工具利用丰富的能力，但计算机使用智能体仍然局限于低层视觉交互。我们提出UltraCUA，一种通过混合动作（无缝统一原始GUI操作与高层工具执行）超越这一限制的基础模型。我们的创新基于四个关键进展。首先，一个自动化管道从软件文档和代码仓库中提取并扩展工具能力。其次，一个合成数据引擎生成超过17,000个可验证任务，捕捉真实世界的计算机使用复杂性。第三，全面的混合动作轨迹收集融合了GUI原语和策略性工具调用。第四，一种两阶段训练方法结合了监督微调和在线强化学习，实现了GUI与API之间的智能动作选择。对我们的7B和32B UltraCUA模型的评估揭示了变革性的性能提升。在OSWorld上，UltraCUA平均实现了22%的相对改进，同时执行速度比现有方法快11%。在WindowsAgentArena上的跨域验证展示了鲁棒的泛化能力，成功率达到21.7%，超过了在Windows上训练的基线。混合动作范式被证明至关重要，在减少错误传播的同时提高了执行效率。这项工作建立了一个可扩展的范式，桥接了原始GUI交互与高层工具智能，为多样环境和复杂现实任务提供了更具弹性和适应性的计算机使用智能体。

英文摘要

Computer-use agents face a fundamental limitation. They rely exclusively on primitive GUI actions (click, type, scroll), creating brittle execution chains prone to cascading failures. While API-driven agents harness rich capabilities through structured interfaces and tools, computer-use agents remain constrained to low-level visual interactions. We present UltraCUA, a foundation model that transcends this limitation through hybrid action-seamlessly unifying primitive GUI operations with high-level tool execution. Our innovation rests on four critical advances. First, an automated pipeline extracts and scales tool capabilities from software documentation and code repositories. Second, a synthetic data engine produces 17,000+ verifiable tasks capturing real-world computer-use complexity. Third, comprehensive hybrid action trajectory collection incorporates both GUI primitives and strategic tool calls. Fourth, a two-stage training methodology combines supervised fine-tuning with online reinforcement learning, enabling intelligent action selection between GUI and API. Evaluation with our 7B and 32B UltraCUA models reveals transformative performance gains. On OSWorld, UltraCUA achieves 22% relative improvement while executing 11% faster than existing approaches, averagely. Cross-domain validation on WindowsAgentArena demonstrates robust generalization with 21.7% success rate, surpassing Windows-trained baselines. The hybrid action paradigm proves essential, reducing error propagation while improving execution efficiency. This work establishes a scalable paradigm bridging primitive GUI interactions and high-level tool intelligence, enabling more resilient and adaptable computer use agents for diverse environments and complex real-world tasks.

URL PDF HTML ☆

赞 0 踩 0

2506.09532 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Athena: 利用数据高效的过程奖励模型增强多模态推理

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

发表机构 * Advanced Micro Devices Inc.（先进微器件公司）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结提出 Athena-PRM，一种多模态过程奖励模型，通过利用弱和强完成者之间的预测一致性高效生成高质量过程标签，在仅5000样本下显著提升复杂推理问题的逐步评估性能。

Comments TMLR 2026, https://openreview.net/forum?id=unWmplHccF

详情

AI中文摘要

我们提出了 Athena-PRM，一种多模态过程奖励模型（PRM），旨在评估解决复杂推理问题中每一步的奖励分数。开发高性能的PRM通常需要大量的时间和资金投入，主要因为需要推理步骤的逐步标注。传统的自动标注方法，如蒙特卡洛估计，通常会产生噪声标签并带来巨大的计算成本。为了高效生成高质量的过程标注数据，我们提出利用弱和强完成者之间的预测一致性作为识别可靠过程标签的标准。值得注意的是，Athena-PRM 在仅5000个样本的情况下，在各种场景和基准测试中展现出卓越的效果。此外，我们还开发了两种有效策略来提升PRM的性能：ORM初始化和负数据上采样。我们在三个具体场景中验证了我们的方法：测试时扩展的验证、推理步骤正确性的直接评估以及奖励排序微调。我们的 Athena-PRM 在多个基准测试和场景中持续取得优越性能。值得注意的是，当使用 Qwen2.5-VL-7B 作为策略模型时，Athena-PRM 在 WeMath 上提升了10.2个百分点，在 MathVista 上提升了7.1个百分点（测试时扩展）。此外，Athena-PRM 在 VisualProcessBench 上取得了最先进（SoTA）结果，比之前的 SoTA 高出3.9个F1分数，展示了其准确评估推理步骤正确性的强大能力。另外，利用 Athena-PRM 作为奖励模型，我们通过奖励排序微调开发了 Athena-7B，在五个基准测试上以显著优势超越了基线。

英文摘要

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2512.04085 2026-05-27 cs.CV 版本更新

Unique Lives, Shared World: Learning from Single-Life Videos

独特生活，共享世界：从单个人生视频中学习

Tengda Han, Sayna Ebrahimi, Dilara Gokay, Li Yang Ku, Maks Ovsjanikov, Iva Babukova, Daniel Zoran, Viorica Patraucean, Joao Carreira, Andrew Zisserman, Dima Damen

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出“单个人生”学习范式，利用单个人拍摄的自我中心视频通过多视角自监督学习视觉编码器，发现不同人生训练的模型具有高度对齐的几何理解，且学到的表示可泛化到下游任务，与大量网络数据性能相当。

详情

AI中文摘要

我们引入了“单个人生”学习范式，其中我们仅针对一个人拍摄的自我中心视频训练一个独特的视觉模型。我们利用单个人生中自然捕获的多个视角，以自监督方式学习视觉编码器。我们的实验展示了三个关键发现。首先，独立在不同人生上训练的模型发展出高度对齐的几何理解。我们通过在捕获不同人生（包括室内和室外）的不同数据集上训练视觉编码器，并引入一种新的基于交叉注意力的度量来量化不同模型发展的内部表示的功能对齐，来证明这一点。其次，我们展示了单个人生模型学习到可泛化的几何表示，这些表示能有效迁移到下游任务，如未见环境中的深度估计。第三，我们证明，对同一个人一周内最多30小时的数据进行训练，其性能与在30小时多样化网络数据上训练相当，突出了单个人生表示学习的优势。总体而言，我们的结果确立了世界的共享结构既导致了在个人人生上训练的模型的一致性，也为视觉表示学习提供了强大的信号。

英文摘要

We introduce the "single-life" learning paradigm, where we train a distinct vision model exclusively on egocentric videos captured by one individual. We leverage the multiple viewpoints naturally captured within a single life to learn a visual encoder in a self-supervised manner. Our experiments demonstrate three key findings. First, models trained independently on different lives develop a highly aligned geometric understanding. We demonstrate this by training visual encoders on distinct datasets each capturing a different life, both indoors and outdoors, as well as introducing a novel cross-attention-based metric to quantify the functional alignment of the internal representations developed by different models. Second, we show that single-life models learn generalizable geometric representations that effectively transfer to downstream tasks, such as depth estimation, in unseen environments. Third, we demonstrate that training on up to 30 hours from one week of the same person's life leads to comparable performance to training on 30 hours of diverse web data, highlighting the strength of single-life representation learning. Overall, our results establish that the shared structure of the world, both leads to consistency in models trained on individual lives, and provides a powerful signal for visual representation learning.

URL PDF HTML ☆

赞 0 踩 0

2511.01724 2026-05-27 cs.CV cs.LG 版本更新

PRBench: A Standardized Probabilistic Robustness Benchmark

PRBench：标准化概率鲁棒性基准

Yi Zhang, Zheng Wang, Zhen Chen, Wenjie Ruan, Qing Guo, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

发表机构 * WMG, University of Warwick（沃里克大学WMG学院）； Department of Computer Science, University of Liverpool（利物浦大学计算机科学系）； College of Computer Science, Nankai University（南开大学计算机学院）； School of Computing, National University of Singapore（新加坡国立大学计算学院）

AI总结提出PRBench基准，通过统一评估协议和理论分析，比较对抗训练与概率鲁棒性训练方法在干净准确率、鲁棒性及泛化误差上的表现。

详情

AI中文摘要

深度学习模型因对不可察觉扰动的脆弱性而闻名。现有研究大多集中于对抗鲁棒性（AR），它通过检查确定性对抗样本（AE）的存在性，在最坏情况下评估模型。相比之下，概率鲁棒性（PR）采用统计视角，衡量在随机扰动下预测保持正确的概率。尽管PR被广泛视为AR的实用补充，但专门用于提升PR的训练方法仍相对未被充分探索，尽管已有初步进展。在少数针对PR的训练方法中，我们发现了三个局限性：(i) 不可比较的评估协议；(ii) 尽管AT能带来PR提升的轶事证据，但与强AT基线的比较有限；(iii) 缺乏统一框架来比较这些方法的泛化能力。因此，我们引入了PRBench，这是第一个专门评估不同鲁棒性训练方法在PR提升上的基准。PRBench使用一套全面的指标，包括干净准确率、PR和AR性能、训练效率以及泛化误差（GE），对最常见的AT和针对PR的训练方法进行实证比较。我们还对不同训练方法的PR性能的GE进行了理论分析。PRBench揭示的主要发现包括：在跨不同超参数设置提升AR和PR性能方面，AT方法比针对PR的训练方法更具通用性，而针对PR的训练方法始终产生更低的GE和更高的干净准确率。包含229个训练模型（覆盖7个数据集和10种模型架构）的排行榜公开于 https://wellzline.github.io/PRBenchLeaderboard/。

英文摘要

Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 229 trained models across 7 datasets and 10 model architectures is publicly available at https://wellzline.github.io/PRBenchLeaderboard/.

URL PDF HTML ☆

赞 0 踩 0

2511.14993 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Kandinsky 5.0：图像与视频生成的基础模型系列

Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Julia Agafonova, Ilya Vasiliev, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov

发表机构 * Kandinsky Lab（Kandinsky 实验室）

AI总结本文介绍Kandinsky 5.0系列模型，通过多阶段训练、自监督微调和强化学习后训练，实现高分辨率图像和10秒视频的高质量生成。

Comments Website: https://kandinskylab.ai/

详情

AI中文摘要

本报告介绍了Kandinsky 5.0，一系列用于高分辨率图像和10秒视频合成的最先进基础模型。该框架包含三个核心模型系列：Kandinsky 5.0 Image Lite——6B参数的图像生成模型系列，Kandinsky 5.0 Video Lite——快速轻量级的2B参数文本到视频和图像到视频模型，以及Kandinsky 5.0 Video Pro——19B参数模型，实现了卓越的视频生成质量。我们全面回顾了数据策展生命周期——包括收集、处理、过滤和聚类——用于多阶段训练流程，该流程涉及广泛的预训练，并融入了质量增强技术，如自监督微调（SFT）和基于强化学习（RL）的后训练。我们还介绍了新颖的架构、训练和推理优化，使Kandinsky 5.0能够在各种任务上实现高生成速度和最先进的性能，如人类评估所示。作为一个大规模、公开可用的生成框架，Kandinsky 5.0充分利用其预训练及后续阶段的全部潜力，以适应广泛的生成应用。我们希望本报告，连同我们开源代码和训练检查点的发布，将大大促进高质量生成模型的研究社区发展和可访问性。

英文摘要

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

URL PDF HTML ☆

赞 0 踩 0

2504.19203 2026-05-27 eess.IV cs.CV 版本更新

Improving Generalization in MRI-Based Deep Learning Models for Total Knee Replacement Prediction

基于MRI的深度学习模型在全膝关节置换预测中的泛化能力改进

Ehsan Karami, Hamid Soltanian-Zadeh

AI总结针对MRI深度学习模型在不同来源数据上泛化性差的问题，提出用实例归一化替代批归一化、数据增强和对比损失的方法，在OAI数据集上显著提升分类性能。

详情

DOI: 10.1109/ICBME68496.2025.11392492
Journal ref: Proceedings of the 2025 32nd National and 10th International Iranian Conference on Biomedical Engineering (ICBME)

AI中文摘要

膝骨关节炎（KOA）是一种常见的关节疾病，会导致疼痛和行动不便。尽管基于MRI的深度学习模型在全膝关节置换（TKR）和疾病进展预测中表现出优越性能，但其泛化能力仍然具有挑战性，尤其是在应用于不同来源的影像数据时。在本研究中，我们证明用实例归一化替代批归一化、使用数据增强以及应用对比损失可以改善泛化能力。在训练和评估中，我们使用了来自骨关节炎倡议（OAI）数据库的MRI数据，将矢状面脂肪抑制中间加权涡轮自旋回波（FS-IW-TSE）图像作为源域，矢状面脂肪抑制三维（3D）双回波稳态（DESS）图像作为目标域。结果表明，通过在基线模型中将批归一化替换为实例归一化，使用全局强度非线性（GIN）增强方法生成增强输入视图，以及在分类损失之外加入监督对比损失以对齐相同标签样本的表征，两个域的分类指标均有统计学显著提升。当使用3D实例归一化时，带有对比损失的GIN方法优于所有评估的单源域泛化方法。比较有无对比损失的GIN（针对两种归一化类型）表明，添加对比损失始终带来更好的性能。

英文摘要

Knee osteoarthritis (KOA) is a common joint disease that causes pain and mobility issues. While MRI-based deep learning models have demonstrated superior performance in predicting total knee replacement (TKR) and disease progression, their generalizability remains challenging, particularly when applied to imaging data from different sources. In this study, we show that replacing batch normalization with instance normalization, using data augmentation, and applying contrastive loss improves generalization. For training and evaluation, we used MRI data from the Osteoarthritis Initiative (OAI) database, considering sagittal fat-suppressed intermediate-weighted turbo spin-echo (FS-IW-TSE) images as the source domain and sagittal fat-suppressed three-dimensional (3D) dual-echo in steady state (DESS) images as the target domain. The results demonstrated a statistically significant improvement in classification metrics across both domains by replacing batch normalization with instance normalization in the baseline model, generating augmented input views using the Global Intensity Non-linear (GIN) augmentation method, and incorporating a supervised contrastive loss alongside the classification loss to align representations of samples with the same label. The GIN method with contrastive loss performed better than all evaluated single-source domain generalization methods when using 3D instance normalization. Comparing GIN with and without contrastive loss (for both normalization types) showed that adding contrastive loss consistently led to better performance.

URL PDF HTML ☆

赞 0 踩 0

2510.17759 2026-05-27 cs.CR cs.CL cs.CV cs.LG stat.ML 版本更新

VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models

VERA-V：用于破解视觉语言模型的变分推断框架

Qilin Liao, Anamika Lochab, Ruqi Zhang

发表机构 * Department of Computer Science, Purdue University, USA（美国普渡大学计算机科学系）

AI总结提出VERA-V变分推断框架，通过联合后验分布生成隐蔽的文本-图像对抗输入，以系统性地发现视觉语言模型的多模态漏洞，在多个基准上攻击成功率最高提升53.75%。

Comments 18 pages, 7 Figures,

详情

AI中文摘要

视觉语言模型（VLM）通过视觉推理扩展了大语言模型，但其多模态设计也引入了新的、未被充分探索的漏洞。现有的多模态红队方法主要依赖脆弱的模板，专注于单一攻击设置，并且仅暴露了漏洞的一小部分。为了解决这些限制，我们引入了VERA-V，一个变分推断框架，将多模态越狱发现重新表述为学习配对文本-图像提示的联合后验分布。这种概率视角使得能够生成绕过模型防护的隐蔽、耦合的对抗输入。我们训练一个轻量级攻击者来近似后验分布，从而能够高效采样多样化的越狱方法，并提供对漏洞的分布性洞察。VERA-V进一步整合了三种互补策略：（i）基于排版的文本提示，嵌入有害线索；（ii）基于扩散的图像合成，引入对抗信号；（iii）结构化干扰物，分散VLM的注意力。在HarmBench和HADES基准上的实验表明，VERA-V在开源和前沿VLM上均持续优于最先进的基线方法，在GPT-4o上相比最佳基线实现了高达53.75%的攻击成功率（ASR）提升。我们在项目页面提供了代码，地址为：https://github.com/kxwhiowo/VERA-V

英文摘要

Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o. We include the code on the project page available here: https://github.com/kxwhiowo/VERA-V

URL PDF HTML ☆

赞 0 踩 0

2510.09606 2026-05-27 cs.CV 版本更新

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

SpaceVista：从毫米到公里的全尺度视觉空间推理

Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue

发表机构 * Multimedia Laboratory, The Chinese University of Hong Kong（香港中文大学多媒体实验室）； Beijing University of Posts（北京邮电大学）； Hong Kong University of Science（香港理工大学）

AI总结本文提出全尺度空间推理解决方案，通过结构化知识系统、尺度感知建模和渐进训练范式，构建SpaceVista-1M数据集（38K视频场景、约1M空间QA对）和SpaceVista-7B模型，在5个基准上展现强泛化能力。

Comments Project Page: https://peiwensun2000.github.io/mm2km/

详情

AI中文摘要

通过在大规模工业数据集上的异常引导预训练推进金属表面缺陷检测

Chuni Liu, Hongjie Li, Jiaqi Du, Yangyang Hou, Qian Sun, Lei Jin, Ke Xu

发表机构 * Collaborative Innovation Center of Steel Technology, University of Science and Technology Beijing（钢铁技术协同创新中心，北京科技大学）

AI总结提出异常引导自监督预训练（AGSSP）方法，通过两阶段框架利用异常先验引导表示学习，在金属表面缺陷检测中显著提升性能，mAP@0.5提升高达10%。

Comments Accepted for publication in Pattern Recognition

详情

DOI: 10.1016/j.patcog.2026.113788
Journal ref: Pattern Recognition, Volume 179, Part C, 2026, 113788

AI中文摘要

预训练-微调范式是金属表面缺陷检测中缓解数据稀缺挑战的关键策略。然而，其实现面临一个关键困境：在ImageNet等自然图像数据集上预训练存在显著的领域差距；同时，由于现有学习目标无法区分复杂背景噪声和纹理中的细微缺陷模式，在领域内工业数据上进行简单的自监督预训练往往效果不佳。为解决这一问题，我们引入了异常引导自监督预训练（AGSSP），这是一种通过异常先验显式引导表示学习的新范式。AGSSP采用两阶段框架：（1）首先通过从异常图中蒸馏知识来预训练模型的主干网络，鼓励网络捕获缺陷显著特征；（2）然后使用从这些图中导出的伪缺陷框预训练检测器，使其与定位任务对齐。为此，我们开发了一种知识增强方法来生成高质量的异常图，并收集了一个包含120,000张图像的大规模工业数据集。此外，我们提供了两个小规模、像素级标注的金属表面缺陷数据集用于验证。大量实验表明，AGSSP在各种设置下均能持续提升性能，与基于ImageNet的模型相比，mAP@0.5提升高达10%，mAP@0.5:0.95提升高达11.4%。所有代码、预训练模型和数据集均可在https://clovermini.github.io/AGSSP-Dev/公开获取。

英文摘要

The pretraining-finetuning paradigm is a crucial strategy in metallic surface defect detection for mitigating the challenges posed by data scarcity. However, its implementation presents a critical dilemma. Pretraining on natural image datasets such as ImageNet, faces a significant domain gap. Meanwhile, naive self-supervised pretraining on in-domain industrial data is often ineffective due to the inability of existing learning objectives to distinguish subtle defect patterns from complex background noise and textures. To resolve this, we introduce Anomaly-Guided Self-Supervised Pretraining (AGSSP), a novel paradigm that explicitly guides representation learning through anomaly priors. AGSSP employs a two-stage framework: (1) it first pretrains the model's backbone by distilling knowledge from anomaly maps, encouraging the network to capture defect-salient features; (2) it then pretrains the detector using pseudo-defect boxes derived from these maps, aligning it with localization tasks. To enable this, we develop a knowledge-enhanced method to generate high-quality anomaly maps and collect a large-scale industrial dataset of 120,000 images. Additionally, we present two small-scale, pixel-level labeled metallic surface defect datasets for validation. Extensive experiments demonstrate that AGSSP consistently enhances performance across various settings, achieving up to a 10\% improvement in mAP@0.5 and 11.4\% in mAP@0.5:0.95 compared to ImageNet-based models. All code, pretrained models, and datasets are publicly available at https://clovermini.github.io/AGSSP-Dev/.

URL PDF HTML ☆

赞 0 踩 0

2509.09977 2026-05-27 cs.CV 版本更新

ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking

ISTASTrack：通过ISTA适配器桥接ANN和SNN用于RGB-事件跟踪

Siying Liu, Zikai Wang, Hanle Zheng, Yifan Hu, Xilin Wang, Qingkai Yang, Jibin Wu, Hao Guo, Lei Deng

AI总结提出首个基于Transformer的ANN-SNN混合跟踪器ISTASTrack，利用ISTA适配器双向融合RGB和事件特征，实现高效鲁棒跟踪。

Comments Accepted by IEEE Transactions on Image Processing, DOI: 10.1109/TIP.2026.3694138, 15 pages, 8 figures

详情

AI中文摘要

RGB-事件跟踪已成为视觉目标跟踪中一个有前景的趋势，旨在利用RGB图像和动态尖峰事件的互补优势来提高性能。然而，现有的人工神经网络（ANN）难以充分利用事件流的稀疏和异步特性。最近，结合ANN和脉冲神经网络（SNN）的混合架构研究作为RGB-事件感知中的一种有前途的解决方案出现，但有效融合跨异构范式的特征仍然是一个挑战。在这项工作中，我们提出了ISTASTrack，这是第一个基于Transformer的ANN-SNN混合跟踪器，配备了ISTA适配器用于RGB-事件跟踪。该双分支模型采用视觉Transformer从RGB输入中提取空间上下文，并使用脉冲Transformer从事件流中捕获时空动态。为了弥合ANN和SNN特征之间的模态和范式差距，我们系统地设计了一个基于模型的ISTA适配器，用于两个分支之间的双向特征交互，该适配器通过展开迭代收缩阈值算法从稀疏表示理论推导而来。此外，我们在适配器中引入了一个时间下采样注意力模块，以在潜在空间中对齐多步SNN特征与单步ANN特征，从而改善时间融合。在RGB-事件跟踪基准（如FE240hz、VisEvent、COESOT和FELT）上的实验结果表明，ISTASTrack在保持高能效的同时实现了最先进的性能，突显了混合ANN-SNN设计在鲁棒视觉跟踪中的有效性和实用性。代码已公开在https://github.com/lsying009/ISTASTrack.git。

英文摘要

RGB-Event tracking has become a promising trend in visual object tracking to leverage the complementary strengths of both RGB images and dynamic spike events for improved performance. However, existing artificial neural networks (ANNs) struggle to fully exploit the sparse and asynchronous nature of event streams. Recent efforts toward hybrid architectures combining ANNs and spiking neural networks (SNNs) have emerged as a promising solution in RGB-Event perception, yet effectively fusing features across heterogeneous paradigms remains a challenge. In this work, we propose ISTASTrack, the first transformer-based \textbf{A}NN-\textbf{S}NN hybrid \textbf{Track}er equipped with \textbf{ISTA} adapters for RGB-Event tracking. The two-branch model employs a vision transformer to extract spatial context from RGB inputs and a spiking transformer to capture spatio-temporal dynamics from event streams. To bridge the modality and paradigm gap between ANN and SNN features, we systematically design a model-based ISTA adapter for bidirectional feature interaction between the two branches, derived from sparse representation theory by unfolding the iterative shrinkage thresholding algorithm. Additionally, we incorporate a temporal downsampling attention module within the adapter to align multi-step SNN features with single-step ANN features in the latent space, improving temporal fusion. Experimental results on RGB-Event tracking benchmarks, such as FE240hz, VisEvent, COESOT, and FELT, have demonstrated that ISTASTrack achieves state-of-the-art performance while maintaining high energy efficiency, highlighting the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking. The code is publicly available at https://github.com/lsying009/ISTASTrack.git.

URL PDF HTML ☆

赞 0 踩 0

2504.08593 2026-05-27 cs.CV cs.AI 版本更新

Hands-On: Segmenting Individual Signs from Continuous Sequences

动手实践：从连续序列中分割单个手势

JianHe Low, Harry Walsh, Ozge Mercanoglu Sincan, Richard Bowden

发表机构 * CVSSP, University of Surrey（CVSSP，萨里大学）

AI总结针对连续手语分割难题，提出基于Transformer的架构，利用HaMeR手部特征和3D角度，采用BIO标注方案建模时序动态，在DGS语料库上达到最优性能。

Comments Accepted in the 19th IEEE International Conference on Automatic Face and Gesture Recognition. Code Implementation Released

2508.07996 2026-05-27 cs.CV 版本更新

Structured Relational Reasoning for Group Activity Assessment

结构化关系推理用于群体活动评估

Thinesh Thiyakesan Ponbagavathi, Chengzheng Yang, Alina Roitberg

发表机构 * University of Stuttgart（斯图加特大学）； University of Hildesheim（希尔德斯海姆大学）

AI总结提出ProGraD框架，利用冻结视觉基础模型和轻量级GroupContext Transformer，通过结构化关系推理在单次前向传播中联合推断群体位置、成员关系和活动，仅用10M参数即在Cafe和Social-CAD基准上取得最优性能。

Comments Accepted to CVPR 2026 Workshop (SAUAFG)

详情

AI中文摘要

群体活动检测（GAD）涉及识别视频中的社会群体及其集体行为。视觉基础模型（VFM），如DINOv2，提供优秀的特征，但是在以物体为中心的数据上预训练的。我们发现，将它们简单替换到现有GAD流程中实际上会降低性能，暴露出结构化的群体感知解码才是真正的瓶颈。我们提出了ProGraD，一个基于冻结VFM构建的结构化关系推理框架。其核心是一个轻量级的两层GroupContext Transformer，显式建模演员-群体关联并聚合全局上下文以推断集体行为。可学习的群体提示作为最小条件机制，引导冻结骨干网络朝向社交相关表示，而关系解码器对演员和群体执行核心推理。该设计在单次前向传播中联合推断群体位置、成员关系和活动，仅使用10M可训练参数——不到先前方法的一半。在具有多个并发社交群体的Cafe基准上，ProGraD将Group mAP$@$1.0提升了6.5%，Group mAP$@$0.5提升了8.2%。在Social-CAD上，它实现了最先进的社交和成员关系准确性。ProGraD还生成可解释的注意力图，为演员-群体推理提供洞察。

英文摘要

Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos. Vision Foundation Models (VFMs), like DINOv2, offer excellent features but are pretrained on object-centric data. We find that naively substituting them into existing GAD pipelines actually degrades performance, exposing structured group-aware decoding as the true bottleneck. We introduce ProGraD, a structured relational-reasoning framework for GAD built on top of frozen VFMs. At its core is a lightweight two-layer GroupContext Transformer that explicitly models actor-group associations and aggregates global context to infer collective behavior. Learnable group prompts serve as a minimal conditioning mechanism to guide the frozen backbone toward socially relevant representations, while the relational decoder performs the core reasoning over actors and groups. This design jointly infers group locations, memberships, and activities in a single pass using only 10M trainable parameters - less than half of prior methods. On the Cafe benchmark with multiple concurrent social groups, ProGraD improves the state-of-the-art by 6.5% Group mAP$@$1.0 and 8.2% Group mAP$@$0.5. On Social-CAD, it achieves state-of-the-art social and membership accuracy. ProGraD further produces interpretable attention maps that provide insights into actor-group reasoning.

URL PDF HTML ☆

赞 0 踩 0

2508.00748 2026-05-27 cs.CV cs.AI cs.CR cs.MM 版本更新

Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos

真的是你吗？探索逼真说话头像视频中的生物特征验证场景

Laura Pedrouzo-Rodriguez, Pedro Delgado-DeRobles, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez

发表机构 * Biometrics and Data Pattern Analytics Lab（生物特征与数据模式分析实验室）

AI总结本文研究在逼真说话头像视频中，利用面部运动模式作为行为生物特征进行身份验证，提出基于图卷积网络的轻量级模型，AUC接近80%。

Comments Accepted at the IEEE International Joint Conference on Biometrics (IJCB 2025)

详情

DOI: 10.1109/IJCB65343.2025.11411089
Journal ref: 2025 IEEE International Joint Conference on Biometrics (IJCB)

AI中文摘要

逼真说话头像在虚拟会议、游戏和社交平台中越来越常见。这些头像允许更沉浸式的交流，但也引入了严重的安全风险。一个新兴威胁是冒充：攻击者可以窃取用户的头像，保留其外观和声音，使得仅凭视觉或听觉几乎无法检测欺诈性使用。在本文中，我们探讨了在这种头像中介场景中生物特征验证的挑战。我们的主要问题是，当头像的视觉外观是其主人的复制品时，个体的面部运动模式能否作为可靠的行为生物特征来验证其身份。为了回答这个问题，我们引入了一个新的数据集，其中包含使用最先进的一次性头像生成模型GAGAvatar创建的逼真头像视频，包括真实和冒充的头像视频。我们还提出了一种轻量级、可解释的时空图卷积网络架构，具有时间注意力池化，仅使用面部标志点来建模动态面部手势。实验结果表明，面部运动线索能够实现有意义的身份验证，AUC值接近80%。所提出的基准和生物特征系统可供研究社区使用，以引起对基于头像的通信系统中更高级行为生物特征防御的迫切需求的关注。

英文摘要

Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user's avatar, preserving his appearance and voice, making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual's facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar's visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.

URL PDF HTML ☆

赞 0 踩 0

2508.01253 2026-05-27 cs.CV 版本更新

Normal Patch Retinex 稳健算法用于数字显微镜白平衡

Radoslaw Roszczyk, Artur Krupa, Izabella Antoniuk

发表机构 * Faculty of Electrical Engineering（电子工程学院）； Institute of Information Technology（信息技术研究所）

AI总结提出一种基于Normal Patch Retinex的全自动白平衡算法，用于校正数字显微镜彩色图像，实验证明其优于经典算法。

详情

DOI: 10.22630/MGV.2020.29.1.5
Journal ref: Vol. 29 No. 1/4 (2020)

AI中文摘要

在光学显微镜中获取准确彩色、平衡的图像即使对于经验丰富的显微镜操作者也可能是一个挑战。本文提出了一种完全自动的白平衡机制，能够充分校正显微彩色图像。该算法的结果已在200张显微图像数据集上通过实验验证。这些图像包含病理形态学中常用的三种显微标本的扫描图。此外，将所得结果与数字摄影中其他常用的白平衡算法进行了比较。本文应用的算法对于苏木精-荧光桃红-番红染色的显微图像和免疫组织化学染色图像比彩色摄影中使用的经典算法更有效。

英文摘要

The acquisition of accurately coloured, balanced images in an optical microscope can be a challenge even for experienced microscope operators. This article presents an entirely automatic mechanism for balancing the white level that allows the correction of the microscopic colour images adequately. The results of the algorithm have been confirmed experimentally on a set of two hundred microscopic images. The images contained scans of three microscopic specimens commonly used in pathomorphology. Also, the results achieved were compared with other commonly used white balance algorithms in digital photography. The algorithm applied in this work is more effective than the classical algorithms used in colour photography for microscopic images stained with hematoxylin-phloxine-saffron and for immunohistochemical staining images.

URL PDF HTML ☆

赞 0 踩 0

2506.17633 2026-05-27 cs.CV cs.AI 版本更新

Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection

自适应多提示对比网络用于少样本分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore（南洋理工大学计算机学院和数据科学学院，新加坡）

AI总结针对少样本分布外检测问题，提出自适应多提示对比网络（AMCN），通过CLIP学习可学习文本提示和类间/类内分布，实现ID-OOD分离边界自适应。

Comments Published in ICML 2025

详情

AI中文摘要

分布外（OOD）检测旨在区分异常样本，以防止在分布内（ID）数据集上训练的模型产生不可用的输出。大多数OOD检测方法需要大量IID样本进行训练，这严重限制了它们的实际应用。为此，我们针对一个具有挑战性的场景：少样本OOD检测，其中只有少量标记的ID样本可用。因此，少样本OOD检测比传统的OOD检测设置更具挑战性。先前的少样本OOD检测工作忽略了不同类别之间的显著多样性。在本文中，我们提出了一种新颖的网络：自适应多提示对比网络（AMCN），它通过学习类间和类内分布来适应ID-OOD分离边界。为了弥补OOD的缺失和ID图像样本的稀缺，我们利用CLIP连接文本与图像，设计可学习的ID和OOD文本提示。具体来说，我们首先生成自适应提示（可学习ID提示、标签固定OOD提示和标签自适应OOD提示）。然后，我们通过引入类级阈值为每个类生成自适应类边界。最后，我们提出一个提示引导的ID-OOD分离模块来控制ID和OOD提示之间的间隔。实验结果表明，AMCN优于其他最先进的工作。

英文摘要

Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many IID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where {Only a few {\em labeled ID} samples are available.} Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID {\em image samples}, we leverage CLIP, connecting text with images, engineering learnable ID and OOD {\em textual prompts}. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.

URL PDF HTML ☆

赞 0 踩 0

2506.11253 2026-05-27 cs.CV cs.LG 版本更新

Lifting Data-Tracing Machine Unlearning to Knowledge-Tracing for Foundation Models

将数据追踪的机器遗忘提升为基础模型的知识追踪

Yuwen Tan, Boqing Gong

发表机构 * Boston University（波士顿大学）

AI总结本文提出将数据追踪的机器遗忘提升为基础模型的知识追踪，以应对多样化遗忘请求，并更接近人类遗忘机制，通过视觉语言模型案例展示实现范式。

Comments Accepted to TMLR

详情

AI中文摘要

机器遗忘从AI模型中移除特定训练数据点及其影响（例如，当数据所有者撤销其同意允许模型从数据中学习时）。在这篇立场论文中，我们提出将数据追踪的机器遗忘提升为基础模型（FMs）的知识追踪。我们基于实际需求和认知研究的见解支持这一立场。实际上，追踪数据无法满足对FMs的多样化遗忘请求，这些请求可能来自监管机构、企业用户、产品团队等，他们无法访问FMs的大量训练数据。相反，这些方方便提出关于FMs（不应）拥有的知识或能力的遗忘请求。认知上，知识追踪遗忘比追踪单个训练数据点更接近人脑的遗忘方式。我们进一步讨论了知识追踪机器遗忘范式中的重大挑战。最后，我们提供了一个关于视觉语言FMs的具体案例研究，以说明遗忘者如何实例化知识追踪机器遗忘范式。代码可在：https://1yuwen.github.io/Knowledge-Tracing-MU-Page 获取。

英文摘要

Machine unlearning removes certain training data points and their influence from AI models (e.g., when a data owner revokes their consent to allow models to learn from the data). In this position paper, we propose to lift data-tracing machine unlearning to knowledge-tracing for foundation models (FMs). We support this position based on practical needs and insights from cognitive studies. Practically, tracing data cannot meet the diverse unlearning requests for FMs, which may be from regulators, enterprise users, product teams, etc., who have no access to FMs' massive training data. Instead, it is convenient for these parties to issue an unlearning request about the knowledge or capability FMs (should not) possess. Cognitively, knowledge-tracing unlearning aligns with how the human brain forgets more closely than tracing individual training data points does. We further discuss the nontrivial challenges in the knowledge-tracing machine unlearning paradigm. Finally, we provide a concrete case study about a vision-language FM to illustrate how an unlearner might instantiate the knowledge-tracing machine unlearning paradigm. Code is available at: https://1yuwen.github.io/Knowledge-Tracing-MU-Page.

URL PDF HTML ☆

赞 0 踩 0

2506.07813 2026-05-27 cs.CV cs.AI 版本更新

Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution

自级联扩散模型用于任意尺度图像超分辨率

Junseo Bang, Joonhee Lee, Kyeonghyun Lee, Haechang Lee, Dong Un Kang, Se Young Chun

发表机构 * Department of Electrical and Computer Engineering (ECE), Seoul National University（电气电子工程系（ECE），首尔国立大学）； Institute of New Media and Communications (INMC) & Interdisciplinary Program in AI (IPAI), Seoul National University（新媒体与通讯研究所（INMC）及人工智能跨学科项目（IPAI），首尔国立大学）

AI总结提出自级联扩散框架CasArbi，通过将任意缩放因子分解为连续小步骤，逐步提升分辨率并保持尺度一致性，在感知和失真指标上优于现有方法。

详情

AI中文摘要

任意尺度图像超分辨率旨在将图像上采样到任意期望分辨率，比传统固定尺度超分辨率提供更大灵活性。最近基于回归或生成模型的方法显示出有希望的结果，但由于其单阶段公式必须同时处理大范围的缩放因子，常常遭受尺度不一致的问题。为了解决这个问题，我们提出了CasArbi，一个用于任意尺度图像超分辨率的自级联扩散框架。CasArbi将不同的缩放因子分解为更小的顺序步骤，逐步提升图像分辨率，并在每一步实现任意尺度的无缝过渡。CasArbi利用坐标条件扩散模型学习连续图像表示，并在推理时采用自一致性指导生成尺度一致的细节。大量实验表明，CasArbi在感知和失真指标上均优于现有方法，并在各种任意尺度超分辨率基准上展现出卓越的尺度一致性。我们的代码可在https://github.com/junseo88/CasArbi获取。

英文摘要

Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fixed-scale super-resolution. Recent approaches based on regression-based or generative models have shown promising results but often suffer from scale inconsistency due to their single-stage formulation, which must handle a wide range of scaling factors simultaneously. To address this, we propose CasArbi, a self-cascaded diffusion framework for arbitrary-scale image super-resolution. CasArbi decomposes varying scaling factors into smaller sequential steps, progressively enhancing the image resolution at each step with seamless transitions for arbitrary scales. CasArbi leverages a coordinate-conditioned diffusion model for learning continuous image representations and adopts self-consistency guidance to generate scale-consistent details at inference time. Extensive experiments show that CasArbi outperforms existing methods in both perceptual and distortion metrics and demonstrates superior scale consistency across diverse arbitrary-scale super-resolution benchmarks. Our code is available at https://github.com/junseo88/CasArbi.

URL PDF HTML ☆

赞 0 踩 0

2505.18603 2026-05-27 cs.AI cs.CV 版本更新

创新性矽肺和肺炎分类：利用图Transformer后验建模与集成技术

Bao Q. Bui, Tien T. T. Nguyen, Duy M. Le, Cong Tran, Cuong Pham

AI总结提出结合图Transformer网络与传统深度神经网络的架构，并采用平衡交叉熵损失函数和集成方法，在自建胸部X光数据集上实现高精度矽肺与肺炎分类。

Comments Withdrawn by the authors because the manuscript contains incomplete and potentially misleading descriptions of the dataset construction and evaluation protocol, particularly in the Dataset and Experimental Setup sections. The work should not be cited or used as an independent reference in its current form

详情

AI中文摘要

本文对矽肺相关肺部炎症的分类与检测进行了全面研究。我们的主要贡献包括：1) 创建了一个名为SVBCX的新策划胸部X光（CXR）图像数据集，该数据集针对不同病原体引起的肺部炎症的细微差别进行了定制，为矽肺和肺炎研究社区提供了宝贵资源；2) 提出了一种新颖的深度学习架构，该架构将图Transformer网络与传统深度神经网络模块相结合，用于有效分类矽肺和肺炎。此外，我们采用平衡交叉熵（BalCE）作为损失函数，以确保不同类别之间的更均匀学习，增强模型辨别肺部状况细微差异的能力。所提出的模型架构和损失函数选择旨在提高炎症检测的准确性和可靠性，特别是在矽肺背景下。此外，我们的研究探索了一种集成方法的有效性，该方法结合了不同模型架构的优势。在构建的数据集上的实验结果表明，与基线模型相比，取得了显著改进。模型集成实现了宏F1分数0.9749，每个类别的AUC ROC分数超过0.99，突显了我们的方法在准确和鲁棒的肺部炎症分类中的有效性。

英文摘要

This paper presents a comprehensive study on the classification and detection of Silicosis-related lung inflammation. Our main contributions include 1) the creation of a newly curated chest X-ray (CXR) image dataset named SVBCX that is tailored to the nuances of lung inflammation caused by distinct agents, providing a valuable resource for silicosis and pneumonia research community; and 2) we propose a novel deep-learning architecture that integrates graph transformer networks alongside a traditional deep neural network module for the effective classification of silicosis and pneumonia. Additionally, we employ the Balanced Cross-Entropy (BalCE) as a loss function to ensure more uniform learning across different classes, enhancing the model's ability to discern subtle differences in lung conditions. The proposed model architecture and loss function selection aim to improve the accuracy and reliability of inflammation detection, particularly in the context of Silicosis. Furthermore, our research explores the efficacy of an ensemble approach that combines the strengths of diverse model architectures. Experimental results on the constructed dataset demonstrate promising outcomes, showcasing substantial enhancements compared to baseline models. The ensemble of models achieves a macro-F1 score of 0.9749 and AUC ROC scores exceeding 0.99 for each class, underscoring the effectiveness of our approach in accurate and robust lung inflammation classification.

URL PDF HTML ☆

赞 0 踩 0

2406.03474 2026-05-27 cs.CV 版本更新

AD-H: Language-guided Autonomous Driving with Hierarchical Agents

AD-H：基于分层智能体的语言引导自动驾驶

Zaibin Zhang, Talas Fu, Shiyu Tang, Yuanhang Zhang, Yifan Wang, Lijun Wang, Huchuan Lu

发表机构 * Dalian University of Technology（大连理工大学）

AI总结提出AD-H分层多智能体框架，上层MLLM规划器生成中层驾驶指令，下层轻量控制器执行连续动作，通过规则重建1.15M中层指令对，以3B+350M参数超越7B模型，实现长时域泛化与指令遵循。

详情

AI中文摘要

语言引导的自动驾驶需要弥合高级自然语言指令与低级车辆控制之间巨大的抽象鸿沟。使用单个多模态大语言模型（MLLM）将语言直接映射到动作的端到端方法难以应对这种不匹配，往往无法利用模型的推理能力，并且在用于微调的驾驶数据集分布之外表现出有限的泛化能力。为了解决这个问题，我们提出了AD-H，一个分层多智能体框架，明确地将高级决策与低级车辆执行分开。在上层，基于MLLM的规划器解释自然语言命令和环境上下文，生成连贯的中层驾驶指令。在下层，轻量级控制器将这些中层指令转换为精确、连续的控制动作。这种分解与每个组件的功能优势相一致：规划器专注于语义推理和任务分解，而控制器确保稳定和准确的执行。为了支持这种层次结构下的大规模训练，我们设计了一个基于规则的流水线，从驾驶信号中重建中层命令，产生了115万对分层注释。大量实验表明，尽管AD-H使用的参数更少（即3B加350M，而对比模型为7B），但它仍优于最先进的模型，并实现了卓越的长时域泛化和指令遵循性能。我们在https://github.com/zhangzaibin/AD-H公开了我们的数据和代码。

英文摘要

Language-guided autonomous driving requires bridging a large abstraction gap between high-level natural-language instructions and low-level vehicle control. End-to-end approaches that use a single multimodal large language model (MLLM) to map language directly to actions struggle with this mismatch, often failing to exploit the reasoning capabilities of the model and exhibiting limited generalization beyond the distributions of driving datasets used for fine-tuning. To address this issue, we propose AD-H, a hierarchical multi-agent framework that explicitly separates high-level decision-making from low-level vehicle execution. At the upper level, an MLLM-based planner interprets natural-language commands and environmental context to generate coherent mid-level driving instructions. At the lower level, a lightweight controller converts these mid-level instructions into precise, continuous control actions. This decomposition aligns with the functional strengths of each component: the planner focuses on semantic reasoning and task decomposition, while the controller ensures stable and accurate actuation. To support large-scale training under this hierarchy, we design a rule-based pipeline that reconstructs mid-level commands from driving signals, producing 1.15 million hierarchical annotation pairs. Extensive experiments show that AD-H outperforms state-of-the-art models despite using fewer parameters, namely 3B plus 350M compared with 7B, and achieves superior long-horizon generalization and instruction-following performance. We make our data and code publicly accessible at https://github.com/zhangzaibin/AD-H

URL PDF HTML ☆

赞 0 踩 0

2405.16417 2026-05-27 cs.CV 版本更新

CRoFT: Robust Fine-Tuning with Concurrent Optimization for OOD Generalization and Open-Set OOD Detection

CRoFT: 面向OOD泛化和开放集OOD检测的并发优化鲁棒微调

Lin Zhu, Yifeng Yang, Qinying Gu, Xinbing Wang, Chenghu Zhou, Nanyang Ye

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结针对视觉语言预训练模型微调时分布偏移问题，提出一种基于能量分数梯度最小化的统一微调框架，同时提升闭集OOD泛化能力和开放集OOD检测性能。

详情

AI中文摘要

最近的视觉语言预训练模型（VL-PTMs）在开放词汇任务中取得了显著成功。然而，下游用例通常涉及对VL-PTMs的进一步微调，这可能会扭曲其通用知识并损害其处理分布偏移的能力。在现实场景中，机器学习系统不可避免地会遇到协变量偏移（例如，图像风格的变化）和语义偏移（例如，测试时未见类别）。这凸显了增强对协变量偏移的分布外（OOD）泛化能力，同时检测语义偏移的未见类别的重要性。因此，一个关键但尚未充分探索的问题出现了：如何在微调期间提高VL-PTMs对闭集OOD数据的泛化能力，同时有效检测开放集未见类别？在本文中，我们提出了一种新颖的OOD检测目标函数，该函数也有助于改进OOD泛化。我们表明，最小化训练数据上能量分数的梯度幅度会导致分类损失的域一致Hessian矩阵，这是理论分析揭示的OOD泛化的强指标。基于这一发现，我们开发了一个统一的微调框架，允许同时优化这两个任务。大量实验证明了我们方法的优越性。代码可在https://github.com/LinLLLL/CRoFT获取。

英文摘要

Recent vision-language pre-trained models (VL-PTMs) have shown remarkable success in open-vocabulary tasks. However, downstream use cases often involve further fine-tuning of VL-PTMs, which may distort their general knowledge and impair their ability to handle distribution shifts. In real-world scenarios, machine learning systems inevitably encounter both covariate shifts (e.g., changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of enhancing out-of-distribution (OOD) generalization on covariate shifts and simultaneously detecting semantic-shifted unseen classes. Thus a critical but underexplored question arises: How to improve VL-PTMs' generalization ability to closed-set OOD data, while effectively detecting open-set unseen classes during fine-tuning? In this paper, we propose a novel objective function of OOD detection that also serves to improve OOD generalization. We show that minimizing the gradient magnitude of energy scores on training data leads to domain-consistent Hessians of classification loss, a strong indicator for OOD generalization revealed by theoretical analysis. Based on this finding, we have developed a unified fine-tuning framework that allows for concurrent optimization of both tasks. Extensive experiments have demonstrated the superiority of our method. The code is available at https://github.com/LinLLLL/CRoFT.

URL PDF HTML ☆

赞 0 踩 0

2404.18539 2026-05-27 cs.CV cs.AI 版本更新

Enhancing Boundary Segmentation for Topological Accuracy with Skeleton-based Methods

基于骨架的方法增强边界分割的拓扑准确性

Chuni Liu, Boyuan Ma, Xiaojuan Ban, Yujie Xie, Hao Wang, Weihua Xue, Jingchao Ma, Ke Xu

发表机构 * University of Science and Technology Beijing（北京科技大学）； Beijing Advanced Innovation Center for Materials Genome Engineering（北京材料基因组创新中心）； School of Intelligence Science and Technology（智能科学与技术学院）； Shunde Innovation School（顺德创新学校）； Institute for Advanced Materials and Technology（先进材料与技术研究院）； Key Laboratory of Intelligent Bionic Unmanned Systems（智能仿生无人系统重点实验室）； Institute of Materials Intelligent Technology（材料智能技术研究院）； Liaoning Academy of Materials（辽宁省材料科学院）； School of Materials Science and Technology（材料科学与技术学院）

AI总结提出Skea-Topo Aware损失函数，通过骨架感知加权和边界修正项提升网状图像边界分割的拓扑一致性，在三个数据集上相比13种方法VI指标提升最多7点。

详情

DOI: 10.24963/ijcai.2024/121
Journal ref: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), pp. 1092-1100, 2024

AI中文摘要

拓扑一致性在网状图像的边界分割任务中起着关键作用，例如神经元电子显微镜图像中的细胞膜分割、材料显微图像中的晶界分割以及航拍图像中的道路分割。在这些领域中，分割结果的拓扑变化对下游任务产生严重影响，甚至可能超过边界本身的错位。为了增强分割结果的拓扑准确性，我们提出了Skea-Topo Aware损失函数，这是一种新颖的损失函数，考虑了每个物体的形状和像素的拓扑重要性。它由两部分组成。首先，骨架感知加权损失通过更好地利用骨架建模物体几何来提高分割准确性。其次，边界修正项通过使用真实标签和预测中的前景和背景骨架，有效识别并强调预测误差中的拓扑关键像素。实验证明，在三个不同的边界分割数据集上，基于客观和主观评估，我们的方法在VI指标上相比13种最先进方法将拓扑一致性提高了最多7点。代码可在https://github.com/clovermini/Skea_topo获取。

英文摘要

Topological consistency plays a crucial role in the task of boundary segmentation for reticular images, such as cell membrane segmentation in neuron electron microscopic images, grain boundary segmentation in material microscopic images and road segmentation in aerial images. In these fields, topological changes in segmentation results have a serious impact on the downstream tasks, which can even exceed the misalignment of the boundary itself. To enhance the topology accuracy in segmentation results, we propose the Skea-Topo Aware loss, which is a novel loss function that takes into account the shape of each object and topological significance of the pixels. It consists of two components. First, a skeleton-aware weighted loss improves the segmentation accuracy by better modeling the object geometry with skeletons. Second, a boundary rectified term effectively identifies and emphasizes topological critical pixels in the prediction errors using both foreground and background skeletons in the ground truth and predictions. Experiments prove that our method improves topological consistency by up to 7 points in VI compared to 13 state-of-art methods, based on objective and subjective assessments across three different boundary segmentation datasets. The code is available at https://github.com/clovermini/Skea_topo.

URL PDF HTML ☆

赞 0 踩 0

2306.09344 2026-05-27 cs.CV cs.LG 版本更新

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

DreamSim: 使用合成数据学习人类视觉相似性的新维度

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola

发表机构 * MIT（麻省理工学院）； Weizmann Institute of Science（魏茨曼科学研究所）； Adobe Research（Adobe研究）

AI总结本文提出DreamSim指标，通过合成数据训练，在图像布局、对象姿态和语义内容等中高层面上对齐人类感知，并在检索和重建任务中优于现有指标。

Comments Website: https://dreamsim-nights.github.io/ Code: https://github.com/ssundaram21/dreamsim

详情

DOI: 10.5555/3666122.3668330
Journal ref: Advances in Neural Information Processing Systems 36 (NeurIPS 2023)

AI中文摘要

当前的感知相似性度量在像素和补丁级别上操作。这些度量在低层颜色和纹理方面比较图像，但未能捕捉图像布局、对象姿态和语义内容中的中层相似性和差异。在本文中，我们开发了一种整体评估图像的感知度量。第一步是收集一个关于以多种方式相似的图像对的人类相似性判断的新数据集。该数据集的关键在于判断几乎是自动的，并且所有观察者共享。为了实现这一点，我们使用最近的文本到图像模型创建沿不同维度扰动的合成对。我们观察到流行的感知度量无法解释我们的新数据，因此我们引入了一个新的度量DreamSim，调整以更好地与人类感知对齐。我们分析了不同视觉属性如何影响我们的度量，发现它主要关注前景对象和语义内容，同时对颜色和布局敏感。值得注意的是，尽管在合成数据上训练，我们的度量能够泛化到真实图像，在检索和重建任务上取得了强劲的结果。此外，我们的度量在这些任务上优于先前学习的度量和最近的大型视觉模型。

英文摘要

Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.

URL PDF HTML ☆

赞 0 踩 0

2312.02694 2026-05-27 cs.CV 版本更新

UPOCR: Towards Unified Pixel-Level OCR Interface

UPOCR：迈向统一像素级OCR接口

Dezhi Peng, Zhenhua Yang, Jiaxin Zhang, Chongyu Liu, Yongxin Shi, Kai Ding, Fengjun Guo, Lianwen Jin

发表机构 * South China University of Technology（南方科技大学）； INTSIG Information Co. Ltd（INTSIG信息有限公司）； INTSIG-SCUT Joint Lab of Document Image Analysis（INTSIG-SCUT文档图像分析联合实验室）

AI总结提出UPOCR，一种基于ViT编码器-解码器和可学习任务提示的统一像素级OCR模型，在文本移除、分割和篡改检测三个任务上以单一模型实现最先进性能。

Comments ICML 2024 Version

详情

AI中文摘要

现有的光学字符识别（OCR）方法依赖于任务特定的设计，具有不同的范式、架构和训练策略，这显著增加了研究和维护的复杂性，并阻碍了在应用中的快速部署。为此，我们提出UPOCR，一种简单而有效的通用模型，用于统一像素级OCR接口。具体来说，UPOCR将不同OCR任务的范式统一为图像到图像的变换，架构统一为基于视觉Transformer（ViT）的编码器-解码器，并带有可学习的任务提示。这些提示将编码器提取的通用特征表示推向任务特定的空间，赋予解码器任务感知能力。此外，模型训练统一以最小化预测图像与真实图像之间的差异为目标，无论任务之间的异质性如何。在三个像素级OCR任务（包括文本移除、文本分割和篡改文本检测）上进行了实验。无需花哨的附加组件，实验结果表明，所提出的方法能够以统一的单一模型同时在三个任务上实现最先进的性能，为未来通用OCR模型的研究提供了有价值的策略和见解。代码可在 https://github.com/shannanyinxiang/UPOCR 获取。

英文摘要

Existing optical character recognition (OCR) methods rely on task-specific designs with divergent paradigms, architectures, and training strategies, which significantly increases the complexity of research and maintenance and hinders the fast deployment in applications. To this end, we propose UPOCR, a simple-yet-effective generalist model for Unified Pixel-level OCR interface. Specifically, the UPOCR unifies the paradigm of diverse OCR tasks as image-to-image transformation and the architecture as a vision Transformer (ViT)-based encoder-decoder with learnable task prompts. The prompts push the general feature representations extracted by the encoder towards task-specific spaces, endowing the decoder with task awareness. Moreover, the model training is uniformly aimed at minimizing the discrepancy between the predicted and ground-truth images regardless of the inhomogeneity among tasks. Experiments are conducted on three pixel-level OCR tasks including text removal, text segmentation, and tampered text detection. Without bells and whistles, the experimental results showcase that the proposed method can simultaneously achieve state-of-the-art performance on three tasks with a unified single model, which provides valuable strategies and insights for future research on generalist OCR models. Code is available at https://github.com/shannanyinxiang/UPOCR.

URL PDF HTML ☆

赞 0 踩 0