arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 多模态与视觉语言模型 50 篇

2606.07585 2026-06-09 cs.CV cs.AI 新提交

Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach

面向隐私安全的非个体化方法的多模态群体情绪识别

Anderson Augusma

发表机构 * Université Grenoble Alpes(格勒诺布尔-阿尔卑斯大学) Univ. Grenoble Alpes(格勒诺布尔-阿尔卑斯大学) Univ. of Glasgow(格拉斯哥大学) Inria(法国国家信息与自动化研究所) Univ. Paris-Saclay(巴黎-萨克雷大学) TU Delft(代尔夫特理工大学)

AI总结 本文提出两种多模态框架(交叉注意力融合+帧注意力池化,以及变分编码器多解码器),利用集体音视频信号进行群体情绪识别,避免使用个体特征,在保护隐私的同时实现鲁棒性能。

Comments Doctoral thesis

详情
AI中文摘要

本论文研究野外环境下的群体情绪识别(GER),重点关注隐私保护。与依赖面部、目光或语音分析等个体层面线索的传统情绪识别方法不同,本工作利用集体音视频信号推断群体层面的情绪,降低个体监控和监视的风险。提出了两个互补框架。第一个是用于音视频融合的交叉注意力多模态架构,结合帧注意力池化(FAP)进行时间聚合。该框架由合成数据增强支持,并通过消融研究验证,在真实世界GER条件下展现出鲁棒性。第二个框架,变分编码器多解码器(VE-MD),学习一个共享潜在空间,用于情绪分类和结构表示预测(包括身体和面部线索)。探索了两种解码策略(基于DETR和基于热图),以分析结构表示在群体和个体设置中的作用。本论文做出三项主要贡献:阐明了多模态和结构线索在群体层面情感计算中的作用;引入了两种用于隐私保护多模态GER的架构;并证明了在不使用个体特征作为输入数据的情况下可以实现有竞争力的性能。

英文摘要

This thesis addresses group emotion recognition (GER) in-the-wild with a focus on privacy preservation. Unlike traditional emotion recognition methods that rely on individual-level cues such as face, gaze, or voice analysis, this work uses collective audio-video signals to infer emotions at the group level, reducing risks of individual monitoring and surveillance. Two complementary frameworks are proposed. The first is a cross-attention multimodal architecture for audio-video fusion, combined with Frames Attention Pooling (FAP) for temporal aggregation. It is supported by synthetic data augmentation and validated through ablation studies, demonstrating robustness in real-world GER conditions. The second framework, Variational Encoder Multi-Decoder (VE-MD), learns a shared latent space for emotion classification and structural representation prediction, including body and face cues. Two decoding strategies, DETR-based and heatmap-based, are explored to analyze the role of structural representations in group and individual settings. The thesis makes three main contributions: it clarifies the role of multimodality and structural cues in group-level affective computing; introduces two architectures for privacy-preserving multimodal GER; and shows that competitive performance can be achieved without using individual features as input data.

2606.07595 2026-06-09 cs.CV cs.AI cs.IR 新提交

VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents

VisualLeakBench: 视觉语言智能体中可复现的动作边界传播失败

Youting Wang, Yuan Tang, Yitian Qian, Chen Zhao

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出VisualLeakBench基准,评估视觉语言智能体在截图、文档等场景下将敏感文本从图像复制到工具参数中的动作边界传播失败,发现PII传播率达78.8%,不安全文本传播率达85.5%。

详情
AI中文摘要

视觉语言智能体越来越多地在写入内存、发送消息或调用外部工具之前消费截图、文档和用户界面。我们研究了这一设置中的一个具体失败模式:动作边界传播,即敏感或不安全的可见文本从图像复制到下游工具参数中。我们提出了VisualLeakBench,一个多样化的500图像基准,涵盖UI、聊天、文档、表单和仪表板场景,并在两个工作流(笔记捕获和外部交接)下使用四个生产级VLM系统评估了一个分层的100图像智能体子集。在基线情况下,目标字符串在78.8%的PII案例和85.5%的渲染不安全文本案例中被传播到工具参数中。在防御性系统提示下,渲染不安全文本传播仍然高达52.6%,而PII工具传播降至2.0%,这主要是通过抑制工具使用而非保持效用实现的。速率取决于工具表面:类似搜索的工具抑制PII传播,但渲染不安全文本仍然跨越工具边界。我们测量的是视觉到工具的传播,而非下游指令执行。我们还提供了一个标记目标预言上限诊断,将大多数失败定位在工具边界,同时将响应侧泄漏作为残余风险。

英文摘要

Vision-language agents increasingly consume screenshots, documents, and user interfaces before writing to memory, sending messages, or invoking external tools. We study a concrete failure mode in this setting: action-boundary propagation, where sensitive or unsafe visible text is copied from an image into downstream tool arguments. We present VisualLeakBench, a diversified 500-image benchmark spanning UI, chat, document, form, and dashboard scenes, and evaluate a stratified 100-image agent subset with four production VLM systems under two workflows: note capture and external handoff. At baseline, target strings are propagated into tool arguments in 78.8% of PII cases and 85.5% of rendered unsafe-text cases. Under a defensive system prompt, rendered unsafe-text propagation remains high at 52.6%, while PII tool propagation falls to 2.0%, largely by suppressing tool use rather than preserving utility. Rates are tool-surface dependent: search-like tools suppress PII propagation, but rendered unsafe text still crosses tool boundaries. We measure visual-to-tool propagation rather than downstream instruction execution. We additionally provide a labeled-target oracle upper-bound diagnostic that localizes most failures at the tool boundary while leaving response-side leakage as residual risk.

2606.07613 2026-06-09 cs.CV cs.AI 新提交

Can You Trust What You See? Human and AI Detection of Synthetic Legal Evidence

你能相信你所见的吗?人类与AI对合成法律证据的检测

Jinzhe Tan, Ali Ekber Cinar, Karim Benyekhlef

发表机构 * Faculty of Law, McGill University(麦吉尔大学法学院)

AI总结 研究人类和前沿多模态大模型在民事纠纷场景中区分真实照片与AI生成图像的能力,发现两者均不可靠,提出结合人工审查、MLLM筛查和来源认证的解决方案。

详情
AI中文摘要

视觉证据长期以来被视为可靠的法律证明形式,但人工智能(AI)的进步正在削弱这一假设。本文探讨在典型民事纠纷的以物体为中心的场景中,人类和前沿多模态大语言模型(MLLM)区分真实证据照片与AI生成照片的能力。我们构建了合成法律证据检测数据集(SLED-1400),包含200张真实证据图像及由六种当代文本到图像生成器生成的1200张合成图像,涵盖十类证据。在受控网络实验中,136名普通参与者与四种MLLM(GPT-5.1、Gemini-3-Pro、Gemini-3-Flash、Qwen3-VL-235B)使用相同的刺激和响应格式进行评估。人类总体准确率为64.8%,在最强两个生成器(Gemini-3-Pro-Image和Flux-2-Max)上分别为48.5%和51.0%,与随机猜测无异。MLLM从未错误分类真实图像(100%特异性),但漏检了大部分来自较难生成器的合成输出,在Gemini-3-Pro-Image输出上平均检测率仅为5.9%。人类与MLLM的错误基本不相关,而四种MLLM之间高度相关。两个群体均不能作为可靠的独立验证者。我们认为,法律程序中的视觉证据应被视为本质上可争议的,可行的程序性应对必须结合训练有素的人工审查、MLLM筛查以及C2PA内容凭证等来源基础设施。

英文摘要

Visual evidence has long been treated as a reliable form of legal proof, but advances in artificial intelligence (AI) are undermining that assumption. This article asks how well humans and frontier multimodal large language models (MLLMs) can distinguish authentic evidentiary photographs from AI-generated counterparts in the object-centric scenarios typical of civil disputes. We built Synthetic Legal Evidence Detection (SLED-1400), a dataset of 200 authentic evidence images paired with 1,200 synthetic counterparts produced by six contemporary text-to-image generators across ten evidence categories. The same stimuli and response format were used in a controlled web experiment with 136 lay participants and in a standardized evaluation of four MLLMs (GPT-5.1, Gemini-3-Pro, Gemini-3-Flash, Qwen3-VL-235B). Human accuracy was 64.8% overall, and 48.5% and 51.0% on the two strongest generators (Gemini-3-Pro-Image and Flux-2-Max), indistinguishable from chance. MLLMs never misclassified an authentic image (100% specificity), but missed most synthetic outputs from the harder generators, with average MLLM detection at 5.9% on Gemini-3-Pro-Image outputs. Human and MLLM errors were largely uncorrelated, while the four MLLMs were strongly correlated with each other. Neither group is a reliable standalone authenticator. We argue that visual evidence in legal proceedings should be treated as inherently contestable, and that a workable procedural response must combine trained human review, MLLM screening, and provenance infrastructure such as C2PA Content Credentials.

2606.07639 2026-06-09 cs.CV cs.AI 新提交

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

MOSS-Video-Preview: 通过交叉注意力实现实时视频理解

Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出双通道交叉注意力架构MOSS-Video-Preview,通过非阻塞感知与生成实现实时视频理解,在单H200上实现5倍首词加速和2.7倍解码吞吐提升。

详情
AI中文摘要

视频理解正从离线范式——将完整录制的视频作为输入并在结束后产生单一答案——转向实时交互,其中模型在回复的同时感知新帧,随着新证据的出现修正答案,并在无话可说时保持沉默。我们提出MOSS-Video-Preview来验证这一范式。我们的核心主张是感知不能被生成阻塞;其自然实现是双通道架构。我们认为,交叉注意力主干比流行的仅解码器设计更适合实时视觉-语言融合:视觉特征通过侧通道进入,而不是加入自回归序列,因此感知和生成在独立的、非阻塞的路径上运行——降低了视觉处理的频率,并为独立压缩提供了清晰的通道级接口。我们辅以数据合成流水线,将密集字幕转换为实时理解问答,其答案被修正以匹配模型迄今为止感知到的内容,并在此数据上专门训练离线模型以引发实时行为。我们的模型总体上落后于强大的Qwen2.5-VL-7B基线——这一差距我们主要归因于数据和规模而非架构——但在离线视频和多模态理解上具有竞争力,在实时应用核心的空间和细粒度时间推理上保持稳健,并获得了离线模型缺乏的行为:持续感知、答案修正和及时沉默。在单个H200上,每视频256帧,它实现了约5倍的首词时间加速和2.7倍的解码吞吐提升,离线能力几乎没有下降。我们对范式、架构和数据的研究勾勒出通往实时视频理解的可行路径。

英文摘要

Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

2606.07641 2026-06-09 cs.CV 新提交

Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

可读但不可预测:视觉语言模型中的旋转结果预测

Lexin Wang, Shenghua Liu, Yiwei Wang, Jiafeng Guo, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of California, Merced(加州大学默塞德分校)

AI总结 研究视觉语言模型能否仅从原图预测180°旋转后的内容,引入RotOutBench基准,发现模型能识别但无法预测旋转结果。

详情
AI中文摘要

视觉语言模型能否仅从原始图像预测180°旋转后会看到什么?我们通过旋转结果预测来研究这种能力:给定原始图像,模型必须回答在180°平面旋转后会看到或读到什么,而不直接观察旋转后的目标。为了隔离这一差距,我们引入了RotOutBench,一个涵盖开放视觉案例和受控文本图像旋转的配对诊断基准。一个明显的模式出现了:许多VLM在直接给出原始或旋转图像时能够识别相关内容,但仅从原始图像推断旋转结果时却失败。在受控文本图像旋转中,即使对于具有高直接读取准确性的模型,预测旋转的准确性也降至接近零。模型级别的案例研究进一步表明,预测状态可以接近旋转图像读取状态,而最终读出仍向原始字符串偏移。当前的VLM在展示变换后的视觉状态时能够识别,但往往无法从原始视角预测该状态。

英文摘要

Can vision-language models predict what a 180° rotation would reveal from the original image alone? We study this ability through Rotated-Outcome Prediction: given an original image, a model must answer what would be seen or read after a 180° in-plane rotation, without directly observing the rotated target. To isolate this gap, we introduce RotOutBench, a paired diagnostic benchmark spanning open visual cases and controlled text-image rotations. A sharp pattern emerges: many VLMs can recognize the relevant content when directly given either the original or rotated image, yet fail to infer the rotated result from the original image alone. On controlled text-image rotations, predicted-rotation accuracy collapses to near zero even for models with high direct-reading accuracy. A model-level case study further shows that the prediction state can approach a rotated-image reading state, while the final readout still shifts toward the original string. Current VLMs can recognize a transformed visual state when it is shown, but often fail to predict that state from the original view.

2606.07642 2026-06-09 cs.CV cs.CY 新提交

Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View

视觉语言模型能否感知传感器所感?一种可扩展的专家引导设计用于从街景评估轮椅可达性

Dongdong Wang, Alina Hagen, Isabelle Gatmaitan, Hao Zhou, Yiwen Dong, Shabboo Valipoor, Vivian W. H. Wong, Lingyao Li

发表机构 * University of Florida(佛罗里达大学) University of South Florida(南佛罗里达大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出专家引导的检索增强框架,利用视觉语言模型从谷歌街景图像识别轮椅可达性障碍,通过GPS轮椅停留行为验证,表明VLM评分与移动摩擦部分一致,但细粒度障碍识别有限。

详情
AI中文摘要

评估建筑环境交互(如轮椅可达性)是困难的,因为现实世界的移动性受到分布式、上下文依赖和临时性障碍的影响,这些障碍难以大规模捕捉。为了支持可扩展的评估,本文研究了视觉语言模型(VLM)是否能够从谷歌街景(GSV)图像中识别可达性障碍。我们提出了一种专家引导的检索增强框架,结合GSV图像、ADA指导原则和专家制定的评分标准来评估可达性维度。我们在佛罗里达大学收集了一个校园规模的数据集,将407个独特的GSV位置与GPS衍生的轮椅停留行为作为移动摩擦信号相关联。结果表明,VLM评分与停留时间既呈负相关又在分布上相似,表明与移动摩擦的行为代理部分但一致的对齐。视觉线索分析显示,某些环境对象(如路缘坡道和人行横道)与较高的VLM可达性评分相关,而对于细微的表面条件、临时障碍物和视角依赖的障碍,对齐仍然有限。总体而言,我们的发现显示了专家引导的VLM在可扩展的可达性评估中的潜力,与真实世界轮椅导航的传感器衍生指标相一致。

英文摘要

Assessing built-environment interaction, such as wheelchair accessibility, is difficult because real-world mobility is shaped by distributed, context-dependent, and temporary barriers that are hard to capture at scale. To support scalable assessment, this paper examines whether vision-language models (VLMs) can identify accessibility barriers from Google Street View (GSV) imagery. We propose an expert-guided retrieval-augmented framework that combines GSV images, ADA-informed guidance, and expert-derived rubrics to evaluate accessibility dimensions. We collect a campus-scale dataset at the University of Florida, linking 407 unique GSV locations with GPS-derived wheelchair dwell behavior as a mobility-friction signal. Results show that VLM ratings are both negatively correlated and distributionally similar with dwell time, indicating partial but consistent alignment with a behavioral proxy for mobility friction. Visual cue analysis shows that certain environmental objects, such as curb ramps and crosswalks, are associated with higher VLM accessibility scores, while alignment remains limited for subtle surface conditions, transient obstructions, and viewpoint-dependent barriers. Overall, our findings show the potential of expert-guided VLMs for scalable accessibility assessment aligning with sensor-derived indicators of real-world wheelchair navigation.

2606.07643 2026-06-09 cs.CV cs.AI cs.SD eess.AS 新提交

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

AVI-Bench:迈向全模态大语言模型的人类级视听智能

Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding, Yunxin Liu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出AVI-Bench基准,通过感知、理解、推理三阶段跨模态任务评估全模态大语言模型的视听智能,并引入AVI-Bench-PriSe测试原始视听感知,揭示当前模型局限,构建四级AVI分类体系。

Comments 31 pages, 8 figures, ICML 2026

详情
AI中文摘要

近期全模态大语言模型(Omni-MLLMs)的进展实现了视觉、音频和语言的强集成。然而,由于缺乏系统全面的基准,其视听智能(AVI)仍未被充分评估。我们提出AVI-Bench,一个受认知启发的基准,通过需要联合视听解释的跨模态任务,在感知、理解和推理三个阶段评估Omni-MLLMs。该设计能够细粒度诊断模型能力和失败模式。为进一步评估超出熟悉领域的鲁棒性,我们提出AVI-Bench-PriSe,一个扩展版本,使用不熟悉的、低语义刺激探测模型的原始视听感知,测试超出常见训练分布的泛化能力。对开源和闭源模型的大量实验揭示了当前Omni-MLLMs的显著局限性。基于这些发现,我们提出了一个四级AVI分类体系。总体而言,AVI-Bench提供了一个原则性的评估框架,以指导更鲁棒和可泛化AVI的发展。项目网站:https://fudancvl.github.io/AVI-Bench/

英文摘要

Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/

2606.07647 2026-06-09 cs.CV cs.CL cs.LG 新提交

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

关键位置引导:基于令牌级视觉敏感度引导的LVLMs幻觉缓解

Ruipeng Zhang, Zhihao Li, C. L. Philip Chen, Tong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出令牌级视觉敏感度引导(TLVS)方法,通过提取令牌级引导向量并自适应调整引导强度,仅在关键解码步骤抑制幻觉,在多个基准上优于现有方法。

详情
AI中文摘要

大型视觉语言模型(LVLMs)取得了快速进展并部署在各种应用中,但幻觉仍然是一个主要挑战。激活引导因其训练开销小和推理时可控制而具有吸引力。然而,我们发现,在自回归解码过程中,视觉条件对令牌预测的影响是稀疏且局部的,许多现有方法对整个序列的图像与非图像差异进行平均,稀释了这些关键信号,导致引导方向信噪比低。此外,许多现有方法应用固定的引导强度,错误分配干预预算,过度扰动非关键令牌,并可能导致不稳定。为了解决这些限制,我们提出了令牌级视觉敏感度引导(TLVS)用于幻觉缓解。我们的方法首先提取令牌级引导向量并进行细化,然后仅在关键位置应用细粒度的、视觉敏感度自适应的引导。这种轻量级、即插即用的机制只需要最少的校准训练,可以应用于各种视觉语言模型。它在每个解码步骤调节引导强度,选择性地抑制易产生幻觉的片段,同时保留基于证据的内容。我们在多个基准上评估TLVS,包括POPE、AMBER、CHAIR(COCO)、MMHal和HallusionBench,证明其相对于先前引导方法的一致改进。

英文摘要

Large vision language models (LVLMs) have made rapid advancements and are deployed across various applications, yet hallucinations remain a major challenge. Activation steering is appealing due to its minimal training overhead and controllability at inference time. However, we found that during autoregressive decoding, visual conditioning affects token prediction sparsely and locally across decoding steps, and many existing methods that average image-versus-no-image differences over the entire sequence dilute these critical signals, yielding low signal-to-noise ratio steering directions. Additionally, many existing methods apply a fixed steering strength, which misallocates the intervention budget, over-perturbs non-critical tokens, and can cause instability. To address these limitations, we propose Token-Level Visual-Sensitivity Steering (TLVS) for hallucination mitigation. Our approach first extracts token-level steering vectors and refines them, and then applies fine-grained, visual-sensitivity-adaptive steering only where it matters. This lightweight, plug-and-play mechanism requires only minimal training for calibration and can be applied across diverse vision-language models. It modulates the steering strength at each decoding step, selectively suppressing hallucination-prone spans while preserving evidence-grounded content. We evaluate TLVS on several benchmarks, including POPE, AMBER, CHAIR (COCO), MMHal, and HallusionBench, demonstrating consistent improvements over previous steering methods.

2606.07689 2026-06-09 cs.CV 新提交

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

Struct-Searcher:代理式结构化思维推进多模态深度信息检索

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Zheng Lian, Hao Wu, Yuan Gao, Xinyu Geng, Xin Wang, Pheng-Ann Heng

发表机构 * CUHK(香港中文大学) LIGHTSPEED PKU(北京大学) Tongji University(同济大学) THU(清华大学) HKUST(香港科技大学)

AI总结 提出基于信念修正理论的结构化代理工作流Struct-Searcher,通过维护多模态结构图实现冲突感知的深度信息检索,在多个基准上平均相对准确率提升17.2%。

详情
AI中文摘要

深度研究代理因其收集大规模在线信息以获取目标知识的能力而受到越来越多的关注,最近的研究工作从纯文本信息检索转向多模态设置。然而,现有的代理工作流大多与证据积累模型一致,该模型线性地聚合证据,缺乏处理跨异构模态矛盾信息的原则性机制。为此,我们提出了Struct-Searcher,一种基于信念修正理论的结构化代理工作流,它在整个推理过程中显式地维护一个不断演变的多模态结构图,从而实现有效的冲突感知多模态深度信息检索。在多个基准数据集和骨干模型上的大量实验表明,Struct-Searcher是(1)即插即用且模型无关的,在BrowseComp-VL上使用五种不同骨干模型平均相对准确率提升17.2%。(2)性能最优,持续优于最先进的视觉语言模型(VLM)和深度研究代理,在MM-BrowseComp上相对准确率提升3.7%,在HLE-VL上提升1.5%,在BrowseComp-VL上提升0.7%,均超过第二名的竞争方法。

英文摘要

Deep research agents have attracted increasing attention for their ability to collect large-scale online information to acquire target knowledge, with recent efforts shifting from purely text-based information seeking to multimodal settings. However, existing agentic workflows are largely aligned with evidence accumulation models, which linearly aggregate evidence and lack principled mechanisms for handling contradictory information across heterogeneous modalities. Towards this end, we propose Struct-Searcher, a structural agentic workflow grounded in belief revision theory that explicitly maintains an evolving multimodal structural graph throughout the reasoning process, enabling effective conflict-aware multimodal deep information seeking. Extensive experiments across multiple benchmark datasets and backbone models demonstrate that Struct-Searcher is (1) plug-and-play and model-agnostic, yielding an average relative accuracy improvement of 17.2% on BrowseComp-VL across five different backbones. (2) top-performing, consistently outperforming state-of-the-art vision-language models (VLMs) and deep research agents, with relative accuracy improvements of 3.7% on MM-BrowseComp, 1.5% on HLE-VL, and 0.7% on BrowseComp-VL over the second-best competing approach.

2606.07861 2026-06-09 cs.CV cs.AI 新提交

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

最后一个可见像素:探究视觉-语言模型中的精细尺度感知

Lujun Li, Lama Sleem, Niccolo Gentile, Yangjie Xu, Yewei Song, Wenbo Wu, Radu State

发表机构 * University of Luxembourg(卢森堡大学) Foyer S.A. Université Paris-Saclay(巴黎-萨克雷大学)

AI总结 提出FineSightBench基准,通过4-48像素尺度分离感知与推理任务,发现视觉-语言模型感知在12像素饱和,推理在更大尺度仍受限,揭示精细视觉推理的根本缺陷。

Comments 25 pages

详情
AI中文摘要

最近的视觉-语言模型(VLM)在多模态理解和推理方面表现出色,但其细粒度视觉感知仍未被充分探索。'Strawberry中有多少个r?'的自然延伸是:VLM能可靠感知多小的视觉模式?为此,我们引入了FineSightBench,这是一个新的基准,通过将感知任务(字母、形状、物体的像素级识别)与推理任务(空间推理、计数、小目标排序)在4-48像素的受控尺度上分离,系统地探究这一极限。通过对最先进模型的全面实验和详细失败模式分析,我们揭示了一个尖锐的分离:感知在12像素左右饱和,而即使在更大尺度下推理仍然受限,存在持续的计数和序列错误。这些发现暴露了VLM在精细尺度视觉推理中的根本缺陷,需要更严格的评估。

英文摘要

Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?'' asks: how small a visual pattern can a VLM reliably perceive? As such, we introduce FineSightBench, a new benchmark that systematically probes this limit by separating perception tasks (pixel-level recognition of letters, shapes, objects) from reasoning tasks (spatial reasoning, counting, ordering over small targets) across controlled scales of 4--48px. Through comprehensive experiments and detailed failure mode analysis on state-of-the-art models, we reveal a sharp dissociation: perception saturates around 12px, while reasoning remains limited even at larger scales, with persistent numeracy and sequence errors. These findings expose fundamental deficiencies in VLMs' fine-scale visual reasoning that demand more rigorous evaluation.

2606.07872 2026-06-09 cs.CV 新提交

VisualFLIP: Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?

VisualFLIP: 在多模态推理中,预测是否依赖于任务关键的视觉证据?

Didi Zhu, Changrui Chen, Stefanos Zafeiriou, Jiankang Deng

发表机构 * Imperial College London(伦敦帝国理工学院)

AI总结 提出VisualFLIP基准,通过成对图像扰动测试多模态大模型是否真正依赖关键视觉证据,发现正确预测与证据依赖存在分离。

详情
AI中文摘要

当多模态大语言模型正确回答视觉推理问题时,该预测是否确实得到任务关键视觉证据的支持?正确答案可能与有缺陷的推理共存,这使得仅凭准确性无法完全检验基础。我们引入了VisualFLIP,一个包含1,374张图像的成对基准,这些图像在基数、属性、空间和逻辑任务中构成相同问题的扰动对。每对保持问题不变,但最小程度地改变证据,使得正确答案确定性地翻转。我们使用成对准确率(要求解决对中的两侧)和崩溃率(CR,衡量在至少解决一侧的模型中,对两个图像重复相同非空答案的频率)评估了24个MLLM。这些指标共同表明,成对正确性和证据依赖性相关但不同:有能力的模型在任务关键视觉变化后仍可能无法更新,并且当编辑后的图像在序列设置中跟随先前的答案时,某些模型的崩溃变得更加严重。更多细节请参见我们的项目页面:https://didizhu-judy.github.io/VisualFLIP/

英文摘要

When a multimodal large language model answers a visual reasoning question correctly, is the prediction actually supported by the task-critical visual evidence? Correct answers can coexist with flawed reasoning, making accuracy alone an incomplete test of grounding. We introduce VisualFLIP, a paired benchmark with 1,374 images arranged as same-question perturbation pairs across cardinality, attribute, spatial, and logic tasks. Each pair keeps the question fixed but minimally changes the evidence so the gold answer deterministically flips. We evaluate 24 MLLMs with pair accuracy, which requires solving both sides of a pair, and Collapse Rate (CR), which measures how often a model that solves at least one side repeats the same non-empty answer for both images. Together, these metrics show that paired correctness and evidence dependence are related but distinct: capable models can still fail to update after task-critical visual changes, and collapse becomes more severe for some models when the edited image follows an earlier answer in a sequential setting. Further details are available on our project page: https://didizhu-judy.github.io/VisualFLIP/

2606.07924 2026-06-09 cs.CV cs.AI cs.CL cs.LG cs.MM 新提交

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

解耦语义与逻辑:一种无需训练的从粗到精的视频检索增强生成流水线

Jiaxin Dai, Zehang Wei, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of AI and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 提出一种无需训练的两阶段级联视频RAG流水线,通过解耦语义检索与逻辑推理,实现跨语言长视频理解、严格角色遵循和零幻觉时间定位。

Comments To be presented at ACL 2026 MAGMAR Workshop (Oral; Retrieval leaderboard No.1)

详情
AI中文摘要

本文介绍了我们为第二届多模态增强生成研讨会(MAGMaR)提交的系统描述。针对跨语言长视频理解、严格角色遵循和零幻觉时间定位等关键挑战,我们提出了一种完全无需训练的两阶段级联视频RAG流水线。我们的架构通过模态感知的任务分工,策略性地将语义检索与认知逻辑推理解耦。在第一阶段,一个高召回率的语义预取模块仅使用高保真视觉摘要和全局文本描述进行密集检索,明确隔离噪声模态(如OCR和ASR)以保持纯净的向量空间。在第二阶段,一个由商业大语言模型(LLM)驱动的自适应、迭代和推理(A.I.R.)过滤代理执行细粒度认知重排序。该代理重新整合完整的多模态上下文,以强制执行与用户角色的严格逻辑对齐,有效剪除语义相似但逻辑无关的候选。最后,提示雕刻机制约束生成器将蒸馏后的子集合成为严格格式化的JSON响应,并带有精确的块级引用。在RAG轨道上的评估表明,我们的资源感知方法在信息检索和角色条件生成方面均表现出卓越的精度。

英文摘要

This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a modality-aware division of labor. In the first stage, a high-recall semantic pre-fetching module employs dense retrieval using only high-fidelity visual summaries and global text descriptions, explicitly isolating noisy modalities (e.g., OCR and ASR) to maintain a pristine vector space. In the second stage, an Adaptive, Iterative, and Reasoning-based (A.I.R.) filtering agent, powered by a commercial Large Language Model (LLM), performs fine-grained cognitive reranking. The agent re-incorporates full multimodal contexts to enforce strict logical alignment with user personas, effectively pruning semantically similar but logically irrelevant candidates. Finally, a Prompt Sculpting mechanism constrains the generator to synthesize the distilled subset into strictly formatted JSON responses with exact chunk-level citations. Evaluated on the RAG track, our resource-aware approach shows exceptional precision in both information retrieval and persona-conditioned generation.

2606.07962 2026-06-09 cs.CV 新提交

ChronoPhyBench: Do MLLMs Truly Understand the World or Merely Exploit Language Priors?

ChronoPhyBench:多模态大语言模型真正理解世界还是仅仅利用语言先验?

Bin Zhu, Yanhao Jia, Kexin Zhao, Jie Wang, Munan Ning, Hao Li, Yuwei Niu, Tanqing Sun, Huangchong Yan, Mingjun Pan, Xinyi Wu, Qishen Yin, Yunyang Ge, Shuai Zhao, Li Yuan

发表机构 * Peking University, Shenzhen Graduate School(北京大学深圳研究生院) Peng Cheng Laboratory(鹏城实验室) Rabbitpre Intelligence Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学)

AI总结 提出ChronoPhyBench基准,通过视频上下文和文本描述强制模型进行物理状态推理,揭示当前开源模型物理推理能力仍处初级阶段。

详情
AI中文摘要

多模态大语言模型(MLLMs)的最新进展在开放世界推理和理解方面展现了卓越的能力。然而,一个关键的不确定性仍然存在:这些模型是真正融合跨模态信息以构建基于物理的推理链,还是仅仅利用强大的语言先验来掩盖单模态依赖,从而幻觉出高级多模态能力?受此启发,为严格减轻语言模态偏差和捷径,我们提出一个新的多模态时序物理动力学推理基准ChronoPhyBench,该基准将下一状态预测与视觉问答(VQA)范式统一,通过基于历史视频上下文和文本描述,强制模型通过单图像选择以及更复杂的多帧时序排序任务来推断后续物理状态。同时,我们构建了一个基于ChronoPhyBench标准的大规模多模态推理数据集,包含超过10,000个长视频及其精心标注的描述,总计500万token。我们的实验评估揭示了与以往基准结论的鲜明对比。当前开源模型执行基于物理的多模态推理的能力仍处于初级阶段。最终,本工作旨在系统性地压力测试多模态模型的推理能力,量化幻觉率,并推动物理AI的发展,从而为社区提供一个朝着通用人工智能(AGI)的稳健且透明的评估框架。

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in open-world reasoning and understanding. However, a critical ambiguity persists: it remains unclear whether these models genuinely synthesize cross-modal information to construct physically grounded reasoning chains, or if they merely exploit strong language priors to mask single-modality reliance, thereby hallucinating advanced multimodal capabilities. Motivated by this, and to rigorously mitigate language modality bias and shortcuts, we propose a novel multimodal Chrono}logical Physical Dynamics Reasoning Benchmark ChronoPhyBench, which unifies next state prediction with Visual Question Answering (VQA) paradigms by conditioning on historical video context and textual captions to enforce models to deduce subsequent physical states through both single image selection and the inherently more complex task of multiple frame chronological sorting. Concurrently, we construct a large-scale multimodal reasoning dataset curated using the ChronoPhyBench criteria, comprising over 10,000 long-form videos paired with meticulously annotated captions, totaling 5M tokens. Our experimental evaluations reveal a stark contrast to conclusions drawn by previous benchmarks. The capacity of current open-source models to perform physically grounded multimodal reasoning remains in its infancy. Ultimately, this work seeks to systematically stress-test the reasoning capabilities of multimodal models, quantify hallucination rates, and advance the development of Physical AI, thereby providing the community with a robust and transparent evaluation framework toward Artificial General Intelligence (AGI).

2606.08016 2026-06-09 cs.CV cs.AI cs.CL 新提交

IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment

IEA:通过三阶段多任务对齐的业余友好型对话式图像编辑代理

Zichen Zhu, Yuheng Sun, Mingxuan Zhu, Wenjie Ma, Situo Zhang, Zhexiang Wang, Ziyue Yang, Danyang Zhang, Kunyao Lan, Zihan Zhao, Dingye Liu, Siqi Xiang, Lu Chen, Kai Yu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institution(上海创新研究院) Huawei Technologies Ltd.(华为技术有限公司) Nanyang Technological University(南洋理工大学) Jiangsu Key Lab of Language Computing(江苏省语言计算重点实验室)

AI总结 提出IEA对话式图像编辑代理,通过三阶段多任务训练学习操作参数化工具,实现可解释编辑轨迹,在像素距离和ROUGE-L指标上优于基线,用户研究中指令跟随和感知质量表现最佳。

Comments [CVPR 2026 Findings] Our data and code are released at https://github.com/OpenDFM/Image_Edit_Agent

详情
AI中文摘要

当前的图像编辑软件通常依赖于固定滤镜或专家调参,导致业余用户的意图与结果之间存在差距。生成模型创建的图像可能包含伪影、不合理的细节或偏离真实感的风格漂移,并且对编辑原因缺乏解释。我们提出IEA,一个对话式图像编辑代理,它学习在显式、可解释的动作空间中操作参数化工具。IEA通过三阶段多任务流水线进行训练:(1) 在蒸馏专家编辑上进行SFT,(2) 使用GRPO进行奖励优化,奖励包括相似度改进、工具有用性和意图总结,(3) 大规模合成微调以联合掌握图像编辑、细化和用户意图总结。通过逐步操作16个编辑工具,IEA产生透明的编辑轨迹,可以检查和调试。在定量实验中,它在编辑任务上获得更低的像素距离,在总结任务上获得比强基线更高的ROUGE-L。在用户研究中,它在指令跟随方面在工具调用方法中排名最佳,同时在整体感知质量上超越生成方法。我们的结果验证了可解释的、以工具为中心的VLM作为人类指令引导图像润色的可靠路径。

英文摘要

Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur users' intent and outcomes. Creations by generative models may contain artifacts, implausible details, or stylistic drift away from photorealism and offer little insight into why an edit was made. We propose IEA, a conversational Image Editing Agent that learns to operate parameterized tools in an explicit, interpretable action space. IEA is trained via a three-stage multitask pipeline: (1) SFT on distilled expert edits, (2) GRPO with rewards for likeness improvement, tool usefulness, and intent summarization, and (3) large-scale synthetic fine-tuning to jointly master image editing, refinement, and user intent summarization. By manipulating 16 editing tools step by step, IEA produces transparent edit traces that can be inspected and debugged. In quantitative experiments, it attains a lower pixel distance on the edit task and a higher ROUGE-L on the summary task than strong baselines. In user studies, it ranks best among tool-calling methods for instruction following while surpassing generative methods in overall perceptual quality. Our results validate interpretable, tool-centric VLMs as a reliable path to human instruction-guided image retouching.

2606.08031 2026-06-09 cs.CV 新提交

Vision-Language Asymmetry in Bistable Image Captioning

双稳态图像描述中的视觉-语言不对称性

Arohan Agate

发表机构 * Arohan Agate

AI总结 通过83个双稳态刺激的行为基线和稀疏自编码器分析,发现视觉塔中同时激活两种解释,但因果干预仅能翻转默认主导刺激的描述,揭示视觉表征与语言承诺之间的不对称性。

Comments Accepted at ICML 2026 Workshop on Philosophy of Machine Learning

详情
AI中文摘要

维特根斯坦的鸭兔图对视觉-语言模型提出了一个问题:当一个模型对模糊图像进行描述时,模型中的哪一部分决定了对某一方面的承诺?我们通过83个双稳态刺激的3,320次生成行为基线来解决这个问题,该基线在中性提示与强制选择提示下揭示了三种状态(默认主导、强制主导、强制平衡),然后使用我们在LLaVA-1.6-7B实际使用的CLIP层上训练的TopK稀疏自编码器(验证EV 0.93)来探测底层表示。在69个具有每方面特征池的双稳态刺激中,72%(50/69)在视觉塔处显示两个池同时激活,包括12/12的默认主导鸭/兔和7/8的强制平衡年轻/老年。在CLIP层22进行因果干预可以在默认主导刺激上翻转描述(在流畅性保护下兔翻转率为33%),但在任何测试系数下都无法翻转强制平衡年轻/老年的描述,尽管其视觉侧存在叠加。主导瓶颈位于视觉塔下游;视觉侧表示与语言侧承诺之间的差距是“看见”与“看作”区别的经验把手。我们还指出一个方法论注意事项:基于TopK SAE输出的秩统计需要经过结校正的排序以避免无声的行序偏差。

英文摘要

Wittgenstein's duck-rabbit poses a question for vision-language models: when a model captions an ambiguous image, where in the model is the commitment to one aspect made? We address this with a 3,320-generation behavioral baseline over 83 bistable stimuli that surfaces three regimes (default-dominant, force-dominant, force-balanced) under neutral vs forced-choice prompting, then probe the underlying representations using a TopK sparse autoencoder we train on the CLIP layer that LLaVA-1.6-7B actually consumes (validation EV 0.93). Across 69 bistable stimuli with both per-aspect feature pools available, 72% (50/69) show simultaneous activation of both pools at the vision tower, including 12/12 default-dominant duck/rabbit and 7/8 force-balanced young/old. Causal steering at CLIP layer 22 flips captions on default-dominant stimuli (33% rabbit-flip rate under a fluency guard) but cannot flip captions on force-balanced young/old at any tested coefficient, despite their vision-side superposition. The dominance bottleneck lives downstream of the vision tower; the gap between vision-side representation and language-side commitment is an empirical handle on the seeing/seeing-as distinction. We also flag a methodological note: rank-based statistics on TopK SAE outputs require tie-corrected ranking to avoid silent row-order bias.

2606.08034 2026-06-09 cs.CV cs.AI cs.CL 新提交

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

Sci-Rho:面向STEM问题的多语言视觉基础符号基准

Muhammad Falensi Azmi, Ikhlasul Akmal Hanif, Vallerie Alexandra Putra, Adi Yeltay, Abdullah Mubarak, Fajri Koto

发表机构 * Independent Researcher(独立研究员) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Binus University(比努斯大学) Bandung Institute of Technology(万隆理工学院)

AI总结 提出Sci-Rho,一个多语言、视觉基础的STEM问题动态基准,包含4242个模板和42420个实例,评估17个VLM发现最差精度与平均精度存在差距,且小模型跨语言性能下降。

Comments 22 pages

详情
AI中文摘要

符号基准已成为评估模型在STEM相关问题微小修改下鲁棒性的关键方法。然而,现有符号基准大多局限于数学推理,缺乏视觉基础,且主要以英语为主。在这项工作中,我们引入了Sci-Rho(科学鲁棒性),一个面向视觉基础STEM问题的动态基准,涵盖五个学科和七种语言,包含由领域专家(包括奥林匹克奖牌得主)精心设计的4,242个问题模板(每种语言606个)。每个模板实现为可执行的Python代码,通过改变数值、视觉模式、几何形状、颜色方案和函数类型,生成多样但等价的问题实例,总共产生42,420个实例,每个实例都配有推理步骤和真实解决方案。我们评估了17个最先进的VLM,发现最差情况准确率(定义为模型在每种生成变体上均正确回答的问题模板比例)与平均准确率之间存在明显差距。我们还发现,较小的模型在不同语言上表现出显著的性能下降,而专有模型和较大模型保持鲁棒。步骤级评估反映了相同的趋势,揭示了平均F1与最差情况F1分数之间的显著差距。最后,我们对VLM注意力头的检查显示,图像标记与文本标记的相对注意力分配存在显著的跨语言变化。我们的工作强调了超越静态基准的评估作为衡量VLM质量指标的重要性。

英文摘要

Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In this work, we introduce Sci-Rho (Science Rhobustness), a dynamic benchmark for visually-grounded STEM problems spanning five subjects and seven languages, comprising 4,242 problem templates (606 per language) crafted by domain experts, including Olympiad medalists. Each template is implemented as executable Python code that generates diverse but equivalent problem instances by varying numerical values, visual patterns, geometric shapes, color schemes, and function types, resulting in 42,420 instances in total, each paired with reasoning steps and ground-truth solutions. We evaluated 17 state-of-the-art VLMs and discovered a noticeable gap between worst-case accuracy (defined as the proportion of problem templates that a model answers correctly across every generated variation) and average accuracy. We also discovered that smaller models show noticeable performance degradation across languages, whereas proprietary and larger models remain robust. Step-level evaluation reflects this same trend, revealing a significant gap between average F1 and worst-case F1 scores. Finally, our inspection of attention heads of a VLM reveals substantial cross-lingual variation in the relative attention allocated to image tokens compared to text tokens. Our work highlights the importance of evaluation beyond static benchmarks as a metric to measure the quality of VLMs.

2606.08035 2026-06-09 cs.CV 新提交

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

DyCo-RL: 用于视觉推理的动态跨模态协调

Hangui Lin, Yan Shu, Zhengyang Liang, Chi Liu, Xiangrui Liu, Minghao Qin, Teng Long, Zheng Liu, Nicu Sebe

发表机构 * University of Trento(特伦托大学) BAAI(北京智源人工智能研究院) Singapore Management University(新加坡管理大学) IQuest Research

AI总结 提出DyCo-RL,通过Fisher-Rao测地距离量化模态内注意力转移,实现动态跨模态协调,并利用对齐引导的优势重加权优化策略,提升多模态大模型在视觉推理中的表现。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为增强多模态大语言模型(MLLMs)视觉推理的主要范式。然而,现有的RLVR方法主要针对推理结果进行优化,从根本上忽略了生成过程中所需的细粒度跨模态协调。通过token级分析和控制干预,我们揭示了在思维链(CoT)推理过程中,MLLMs经常无法在提取视觉证据和合成文本上下文之间动态交替——这种协调崩溃与推理失败存在因果关系。受这些发现的启发,我们提出了DyCo-RL,它将动态跨模态协调集成到RLVR优化中。具体来说,DyCo-RL使用Fisher-Rao测地距离来度量模态内注意力转移,将token分配到视觉导向或文本导向的功能角色。然后,它评估token实际注意力分配与其分配角色之间的一致性,利用该分数在策略优化期间进行对齐引导的优势重加权。大量实验表明,算法无关的DyCo-RL应用于Qwen2.5-VL-3B/7B时,在涵盖视觉中心和数学推理的七个基准测试中,一致地改进了四种代表性的RLVR算法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token's actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.

2606.08063 2026-06-09 cs.CV cs.AI cs.CL 新提交

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Robust-U1: MLLMs能否自我恢复受损视觉内容以实现鲁棒理解?

Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Robust-U1框架,通过监督微调、强化学习和多模态推理,使多模态大模型具备显式视觉自恢复能力,在真实和对抗性损坏下达到最先进鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解方面取得了显著成功,但在真实世界的视觉损坏下其性能会大幅下降。尽管存在现有的鲁棒性增强方法,但它们存在局限性:黑盒特征对齐缺乏可解释性,而白盒基于文本的推理无法恢复丢失的像素级细节。本文研究一个基本研究问题:MLLMs能否自行恢复受损的视觉内容?为此,我们提出Robust-U1,一种新颖框架,赋予MLLMs显式的视觉自恢复能力以实现鲁棒理解。该方法包含三个核心阶段:用于初始重建的监督微调、具有双重奖励(像素级SSIM和语义级CLIP相似度)的强化学习以对齐高视觉质量,以及联合考虑受损输入和恢复图像的多模态推理。大量实验表明,Robust-U1在真实世界损坏基准上达到了最先进的鲁棒性,并在一般VQA基准上的对抗性损坏下保持了优越性能。分析证实,高质量的视觉恢复直接提升了推理性能,将自恢复确立为鲁棒视觉理解的关键机制。源代码可在https://github.com/jqtangust/Robust-U1获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

2606.08126 2026-06-09 cs.CV 新提交

One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

一石三鸟:面向多VLM选择、自适应与集成的自适应最优传输

Qiyu Xu, Zhanxuan Hu, Yu Duan, Yonghang Tai, Huafeng Li, Quanxue Gao, Xiangyong Cao

发表机构 * Xi’an Jiaotong University(西安交通大学) Yunnan Normal University(云南师范大学) Xidian University(西安电子科技大学) Kunming University of Science and Technology(昆明理工大学)

AI总结 提出无训练框架OSTB,通过自适应最优传输估计共识样本-类别结构,同时解决多VLM的模型选择、目标域自适应和预测集成问题。

详情
AI中文摘要

视觉语言模型(VLM)能够从语义类别描述中进行视觉识别,这使得它们在目标标注稀缺或不可用时具有吸引力。然而,大多数部署流程首先选择一个单一的VLM,然后将该模型适应到未标记的目标集。这种单骨干范式隐藏了一个关键假设:所选VLM已经与目标域兼容。在实际的跨域部署中,可能有多个通用和领域专用的VLM是可行的,但没有实例级目标标签可用于识别可靠的模型。因此,部署需要一个耦合的解决方案来进行模型选择、目标适应和预测集成。我们从系统级多VLM的角度重新审视这个问题。我们的核心观察是,上述三个决策依赖于同一个潜在对象:目标集中可信的样本-类别结构。不同的VLM可能编码不同的迁移偏差并产生冲突的预测,但它们的输出仍然可以为估计该结构提供互补证据。我们提出了一石三鸟(OSTB),一个基于自适应最优传输的无训练框架。给定一组冻结的候选VLM,OSTB在不更新VLM参数的情况下估计一个共识的样本到类别传输计划。然后,学习到的传输结构被重用于所有部署目标:通过排序共识计划引起的组合语义和视觉可靠性来进行模型选择;通过拟合传输条件视觉分类器获得目标适应;通过可靠性感知的概率集成实现集成。在自然图像、遥感和医学病理基准上的大量实验表明,OSTB在异构候选池下提高了模型排名、适应稳定性和集成鲁棒性。

英文摘要

Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them attractive when target annotations are scarce or unavailable. Most deployment pipelines, however, first choose a single VLM and then adapt that model to the unlabeled target set. This single-backbone paradigm hides a critical assumption: the selected VLM is already compatible with the target domain. In realistic cross-domain deployment, several general-purpose and domain-specialized VLMs may be plausible, yet no instance-level target labels are available to identify the reliable ones. Deployment therefore requires a coupled solution for model selection, target adaptation, and prediction integration. We revisit this problem from a system-level multi-VLM perspective. Our central observation is that the three decisions above depend on the same latent object: a trustworthy sample-class structure in the target set. Different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure. We propose One Stone, Three Birds, a training-free framework based on self-adaptive optimal transport. Given a pool of frozen candidate VLMs, OSTB estimates a consensus sample-to-class transport plan without updating VLM parameters. The learned transport structure is then reused for all deployment objectives: model selection is performed by ranking the combined semantic and visual reliability induced by the consensus plan; target adaptation is obtained by fitting transport-conditioned visual classifiers; and ensembling is implemented through reliability-aware probabilistic integration. Extensive experiments on natural-image, remote-sensing, and medical-pathology benchmarks show that OSTB improves model ranking, adaptation stability, and ensemble robustness under heterogeneous candidate pools.

2606.08144 2026-06-09 cs.CV 新提交

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

IMAGINE:自适应模式-意象增强组合用于组合视频检索

Jiale Huang, Zixu Li, Zhiwei Chen, Zhiheng Fu, Chunxiao Wang, Yupeng Hu

发表机构 * Shandong University(山东大学) Qilu University of Technology (Shandong Academy of Sciences)(齐鲁工业大学(山东省科学院))

AI总结 针对组合视频检索中修改文本隐含语义与显式视觉内容不匹配的问题,提出自适应模式-意象增强组合网络(IMAGINE),通过动态多模态原型捕获隐含概念并调制视觉特征,在三个基准上达到最优性能。

Comments Accepted by ICMR 2026

详情
AI中文摘要

组合视频检索(CVR)旨在检索与参考视频经修改文本修改后匹配的目标视频。现有方法探索跨模态对应关系时,常假设修改对象直接出现在视频中。然而,修改文本常描述未明确呈现但通过语义相关视觉线索隐含表达的概念(例如,“蛋糕”暗示“生日派对”)。当前方法通常依赖在具体空间中对齐显式特征表示,忽略了关键的潜在关联。为解决此问题,我们提出自适应模式-意象增强组合网络(IMAGINE)。与标准显式匹配不同,IMAGINE通过动态多模态原型具体化隐含语义(称为模式意象)。这些原型捕获共享的潜在概念,自适应地调制视觉特征,有效将隐含引导注入检索过程。通过弥合显式视觉内容与隐含检索意图之间的差距,IMAGINE在三个广泛使用的基准上,在CVR和组合图像检索(CIR)中均达到最先进性能。

英文摘要

Composed Video Retrieval (CVR) is designed to retrieve a target video that matches a reference video modified by a modification text. While existing methods explore cross-modal correspondences, they often assume modified objects appear directly in videos. However, modification texts frequently describe concepts not explicitly presented but implicitly expressed through semantically related visual cues (e.g., "cake" implying "birthday party"). Current approaches typically rely on aligning explicit feature representations within the concrete space, neglecting critical latent associations. To address this, we propose an adaptIve scheMa-ImAGery enhanced composItional NEtwork (IMAGINE). Unlike standard explicit matching, IMAGINE materializes implicit semantics (termed schema imagery) via dynamic multimodal prototypes. These prototypes capture shared latent concepts to adaptively modulate visual features, effectively injecting implicit guidance into the retrieval process. By bridging the gap between explicit visual contents and implicit retrieval intentions, IMAGINE achieves state-of-the-art performance in both CVR and Composed Image Retrieval (CIR) across three widely used benchmarks.

2606.08231 2026-06-09 cs.CV 新提交

Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning

多模态基础模型中的测试时扩展:生成与推理的综合调查

Cong Wan, Ying He, Zhongzhan Huang, Hefeng Wu

发表机构 * Sun Yat-sen University(中山大学)

AI总结 本文首次系统综述多模态基础模型中的测试时扩展(TTS)方法,提出统一分类框架(采样、反馈、搜索三类),总结应用与基准,并讨论未来方向。

Comments Accepted by ACL 2026, Findings

详情
AI中文摘要

测试时扩展(TTS)已成为通过在推理过程中动态分配计算资源来增强模型性能的关键研究方向。最近的进展将这一范式应用于多模态基础模型(MFMs),释放了它们在多模态推理和生成方面的潜力。尽管进展迅速,该领域缺乏系统性的调查和统一的理论框架来描绘多模态TSS的发展格局。为填补这一空白,我们首次对MFMs的TTS研究进行了全面综述,提出了一个统一的分类框架,将现有方法归纳为三种不同策略:基于采样的、基于反馈的和基于搜索的方法。我们进一步总结了常用于评估多模态TTS在生成和推理任务中能力的代表性应用和基准。最后,本调查讨论了开放挑战并概述了未来研究方向,为这一快速发展的领域的后续研究提供了系统路线图。

英文摘要

Test-time Scaling (TTS) has emerged as a pivotal research direction for enhancing model performance by dynamically allocating computational resources during inference. Recent advancements have adapted this paradigm to Multimodal Foundation Models (MFMs), unlocking their potential in multimodal reasoning and generation. Despite rapid progress, the field lacks a systematic survey and unified theoretical framework to delineate the developmental landscape of multimodal TTS. To bridge this gap, we present the first comprehensive review of TTS research for MFMs, proposing a unified taxonomic framework that categorizes existing methodologies into three distinct strategies: sampling-based, feedback-based, and search-based approaches. We further summarize representative applications and benchmarks commonly utilized to evaluate multimodal TTS capabilities in generation and reasoning tasks. Finally, this survey discusses open challenges and outlines future research directions, providing a systematic roadmap for subsequent studies in this rapidly evolving field.

2606.08336 2026-06-09 cs.CV 新提交

Beyond Raw Signals: Undecoded Generative Latents as Privileged Synthetic Data

超越原始信号:作为特权合成数据的未解码生成潜变量

Cristian Sbrolli, Nicolas Michel, Matteo Matteucci, Toshihiko Yamasaki

发表机构 * Politecnico di Milano(米兰理工大学) The University of Tokyo(东京大学)

AI总结 提出直接潜变量增强(DLA)方法,利用未解码的生成潜变量作为特权信息,并通过多层显式模拟联觉(MESSy)将密集知识迁移到纯视觉学生模型,避免了解码-编码循环的低效性。

详情
AI中文摘要

虽然多模态集成显著提升了计算机视觉模型,但部署它们会带来高昂的推理成本,并且需要稀缺且完美配对的数据集。近期方法通过生成式AI合成缺失模态来解决这一数据瓶颈,但它们引入了一个严重的低效问题:解码-编码循环。具体来说,信息丰富的生成潜变量被解码为噪声原始信号,迫使下游分类器浪费容量重新编码它们。为了绕过这一瓶颈,我们提出直接潜变量增强(DLA),直接利用未解码的生成潜变量作为特权信息。此外,为了将这种密集知识迁移到纯视觉学生模型,我们引入多层显式模拟联觉(MESSy)。MESSy 不使用强制表示匹配(这迫使学生扭曲其原生视觉特征以适应复杂的多模态拓扑),而是使用预测目标来安全地内化这些物理先验。实验结果表明,我们的框架显著优于原始数据增强和传统蒸馏。最终,我们的方法产生了高度准确的单模态学生模型,其具有“联觉”潜变量结构,这些结构本质上与它们从未直接观察到的物理属性对齐。

英文摘要

While multimodal integration significantly improves computer vision models, deploying them incurs prohibitive inference costs and requires scarce, perfectly paired datasets. Recent methods address this data bottleneck by synthesizing missing modalities via generative AI, yet they introduce a severe inefficiency: the Decode-Encode Loop. Specifically, information-rich generative latents are decoded into noisy raw signals, forcing the downstream classifier to waste capacity re-encoding them. To bypass this bottleneck, we propose Direct Latent Augmentation (DLA), utilizing undecoded generative latents directly as privileged information. Furthermore, to transfer this dense knowledge to a purely visual student, we introduce Multilayer Explicit Simulated Synesthesia (MESSy). Instead of enforcing rigid representation matching, which forces the student to distort its native visual features to accommodate complex multimodal topologies, MESSy uses a predictive objective to safely internalize these physical priors. Empirical results demonstrate that our framework significantly outperforms raw data augmentation and traditional distillation. Ultimately, our approach yields highly accurate unimodal students with ``synesthetic'' latent structures that are inherently aligned with physical properties they have never directly observed.

2606.08464 2026-06-09 cs.CV 新提交

TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

TVI-CoT: 面向多模态理解的文本-视觉交错思维链推理

Lianyu Hu, Xiaoyu Ma, Zeqin Liao, Yang Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出TVI-CoT框架,通过可学习控制令牌实现文本推理与视觉特征访问的动态交错,解决多模态LLM在推理过程中无法访问视觉特征的问题,在多个基准上取得最优结果。

Comments ICML2026

详情
AI中文摘要

思维链推理已被证明能有效增强大语言模型的问题解决能力。然而,当应用于多模态大语言模型时,现有的思维链方法存在一个根本性限制:它们在推理过程中完全基于文本进行,无法访问视觉特征。在初始视觉编码后,图像信息变得不可访问,迫使模型仅基于初始描述中捕获的内容进行推理,形成了一种“视觉盲推理”范式,限制了细粒度视觉提取、错误验证和自适应注意力。我们提出了文本-视觉交错思维链(TVI-CoT),这是一个通过可学习控制令牌<THINK>、<LOOK>和<ANSWER>实现文本推理与视觉特征访问显式交错的框架。这些令牌允许在推理和视觉定位之间动态切换,根据不断演化的推理状态关注相关的图像区域。在八个基准上的实验表明,该方法在多模态大语言模型思维链方法中达到了最先进的结果,并且相比基线有显著性能提升:在MMMU上提升6.1%,在MathVerse上提升3.8%,在MathVista上提升3.4%,在ScienceQA上提升3.4%。代码可在https://github.com/hulianyuyy/TVI-CoT获取。

英文摘要

Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models. However, when applied to multimodal LLMs (MLLMs), existing CoT approaches suffer from a fundamental limitation: they perform reasoning entirely in text without accessing visual features during the reasoning process. After initial visual encoding, image information becomes inaccessible, forcing models to reason based solely on whatever was captured in the initial description, which forms a `vision-blind reasoning' paradigm that limits fine-grained visual extraction, error verification, and adaptive attention. We propose Text-Visual Interleaved Chain-of-Thought (TVI-CoT), a framework that enables explicit interleaving of textual reasoning and visual feature access through learnable control tokens <THINK>, <LOOK> and <ANSWER>. These tokens allow dynamic switching between reasoning and visual grounding, attending to relevant image regions conditioned on the evolving reasoning state. Experiments on eight benchmarks demonstrate state-of-the-art results among MLLM-based CoT methods and notable performance boost compared to the baseline: +6.1% on MMMU, +3.8% on MathVerse, +3.4% on MathVista, and +3.4% on ScienceQA. Code is available at https://github.com/hulianyuyy/TVI-CoT.

2606.08511 2026-06-09 cs.CV 新提交

Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs

少看多思:面向高效多模态大语言模型的块级注意力跳过

Jie Ma, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

发表机构 * Xiamen University(厦门大学)

AI总结 针对多模态大语言模型视觉注意力饱和问题,提出训练无关的Visual-Skip方法,通过选择性跳过冗余的视觉自注意力模块实现块级稀疏性,并利用轻量级校准动态选择最优稀疏路径,在保持性能的同时显著降低计算成本。

详情
AI中文摘要

多模态大语言模型(MLLMs)由于长视觉标记序列的自注意力二次计算成本而面临显著的推理瓶颈。然而,我们识别出当前架构中的一个关键低效问题:视觉注意力饱和。我们的分析表明,视觉标记在早期层中迅速建立其空间结构和模态内关系,使得深层中的视觉-视觉自注意力在计算上变得冗余。相反,这些层中的前馈网络(FFNs)对于将视觉特征投影到不断演化的文本语义空间中仍然至关重要。利用这一洞察,我们提出了Visual-Skip(V-Skip),一种无需训练的推理范式,它将空间交互与语义演化解耦。V-Skip不是丢弃标记,而是通过选择性地绕过饱和的视觉自注意力模块来施加块级结构化稀疏性。此外,认识到不同的下游任务需要不同的推理深度,V-Skip采用轻量级、少样本校准来动态路由任务最优的稀疏路径。大量实验表明,V-Skip有效地绕过了冗余的视觉注意力以实现块级稀疏性,在各种MLLMs上保持了94.16%至100.31%的性能保留。最终,我们证明,为了更有效地推理,模型不需要丢弃它们所看到的内容——它们只需要在正确的深度“少看”即可。

英文摘要

Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computational cost of self-attention over long visual token sequences. However, we identify a critical inefficiency in current architectures: Visual Attention Saturation. Our analysis reveals that visual tokens rapidly establish their spatial structure and intra-modal relationships in early layers, rendering visual-to-visual self-attention in deeper layers computationally redundant. Conversely, Feed-Forward Networks (FFNs) in these layers remain essential for projecting visual features into the evolving textual semantic space. Leveraging this insight, we present Visual-Skip (V-Skip), a training-free inference paradigm that decouples spatial interaction from semantic evolution. Rather than discarding tokens, V-Skip imposes block-wise structured sparsity by selectively bypassing saturated visual self-attention modules. Furthermore, recognizing that varying downstream tasks demand distinct reasoning depths, V-Skip employs a lightweight, few-shot calibration to dynamically route the task-optimal sparsity path. Extensive experiments demonstrate that V-Skip effectively bypasses redundant vision attention to achieve block-wise sparsity, maintaining a 94.16% to 100.31% performance retention across diverse MLLMs. Ultimately, we prove that to reason more effectively, models do not need to discard what they see -- they simply need to "look less" at the right depth.

2606.08708 2026-06-09 cs.CV 新提交

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

PRPO: 通过令牌级动态优势重塑的感知强化策略优化

Qiming Li, Tianlun Li, Xiaolong Cheng, Hangyu Li, Ruiyan Gong, Kangning Niu, Kaitao Jiang, Mu Xu

发表机构 * Amap CV Lab, Alibaba Group(阿里巴巴集团高德地图计算机视觉实验室) Peking University(北京大学)

AI总结 提出令牌级强化学习框架PRPO,通过鲁棒视觉依赖(RVD)指标识别关键感知令牌,并利用感知优势重塑(PAR)技术增强其学习信号,在7个多模态推理基准上平均提升23.3%(3B模型)和21.1%(7B模型)。

详情
AI中文摘要

可验证奖励强化学习(RLVR)已成为提升大型视觉语言模型(LVLMs)推理能力的有效范式。然而,现有的RLVR方法主要依赖于轨迹级结果奖励,为所有生成的令牌分配相同的学习信号。这种粗粒度的信用分配从根本上与多模态推理不匹配,因为只有稀疏的子集令牌在因果上基于视觉证据。因此,这些关键的感知令牌受到弱监督,并且常常被语言先验或推理模板令牌淹没。为解决这一局限,我们提出感知强化策略优化(PRPO),一种令牌级强化学习框架,明确识别并强化长程多模态推理轨迹中的关键感知令牌。PRPO引入了鲁棒视觉依赖(RVD),一种原则性度量,用于识别预测既基于视觉又对扰动稳定的令牌,过滤掉脆弱或噪声视觉令牌。基于RVD,我们进一步提出感知优势重塑(PAR),一种令牌级信用分配技术,放大感知信息丰富的令牌,同时为非感知令牌保留稳定梯度。在七个多模态推理基准上的大量实验表明,PRPO在3B和7B模型规模上均持续优于强LVLM基线,分别实现了23.3%和21.1%的平均增益。PRPO以更高的训练效率和更强的跨任务泛化能力达到了最先进的性能。我们的发现强调了细粒度信用分配对于可扩展多模态强化学习的重要性。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reasoning capability of Large Vision-Language Models (LVLMs). However, existing RLVR methods primarily rely on trajectory-level outcome rewards, which assign identical learning signals across all generated tokens. This coarse-grained credit assignment is fundamentally mismatched to multimodal reasoning, where only a sparse subset of tokens is causally grounded in visual evidence. Consequently, these pivotal perceptual tokens receive weak supervision and are often overwhelmed by language priors or reasoning-template tokens. To address this limitation, we propose Perception-Reinforced Policy Optimization (PRPO), a token-level reinforcement learning framework that explicitly identifies and reinforces pivotal perceptual tokens within long-horizon multimodal reasoning trajectories. PRPO introduces Robust Visual Dependency (RVD), a principled metric that identifies tokens whose predictions are both visually grounded and perturbation-stable, filtering out brittle or noisy visual tokens. Based on RVD, we further propose Perceptual Advantage Reshaping (PAR), a token-level credit assignment technique that amplifies perceptually informative tokens while preserving stable gradients for non-perceptual tokens. Extensive experiments on seven multimodal reasoning benchmarks demonstrate that PRPO consistently outperforms strong LVLM baselines across both 3B and 7B model scales, achieving average gains of 23.3% and 21.1%, respectively. PRPO achieves state-of-the-art performance with improved training efficiency and stronger cross-task generalization. Our findings highlight the importance of fine-grained credit assignment for scalable multimodal reinforcement learning.

2606.08719 2026-06-09 cs.CV 新提交

Thinking Without Images: Internalizing Visual Manipulation with On-Policy Self-Distillation

无图像思考:通过在线自我蒸馏内化视觉操作

Yishuo Cai, Jiahui Liu, Yuanxin Liu, Haobo Deng, Linli Yao, Yuhao Zheng, Kun Ouyang, Zhimo Li, Ziyue Wang, Xu Sun, Haoli Bai, Xiaohui Li

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University(北京大学计算机学院多媒体信息处理国家重点实验室) Central South University(中南大学) University of Science and Technology of China(中国科学技术大学) Peking University(北京大学) Huawei Technologies(华为技术有限公司)

AI总结 提出Imagine-OPD框架,通过在线自我蒸馏将“用图像思考”的视觉推理能力内化为“用想象思考”,在不调用外部工具的情况下生成内部视觉线索,在保持性能的同时显著降低推理开销。

详情
AI中文摘要

“用图像思考”已成为细粒度视觉推理的有效范式:通过显式放大相关区域并推理裁剪区域,模型可以访问从单个全局图像中难以恢复的局部证据。然而,这种优势伴随着冗余的工具调用和更长的推理轨迹。此外,当这种行为主要从结果奖励中学习时,产生的中间裁剪或视觉线索可能带有噪声,或者无法忠实地捕获任务相关的视觉证据。在这项工作中,我们探讨是否可以通过“用想象思考”来内化“用图像思考”的推理优势:这是一个内部过程,决定看哪里并想象更仔细检查会揭示什么视觉线索,而无需实际调用工具。我们提出Imagine-OPD,一种在线自我蒸馏框架,其中教师模型在训练期间扮演“用图像思考”推理者的角色:它接收来自标注区域的特权缩放证据视图,并监督模型自身的想象推理轨迹。Imagine-OPD不需要外部教师或高质量的想象演示。在视觉中心基准上的实验表明,Imagine-OPD在比较模型中实现了最佳平均性能,同时与“用图像思考”方法相比显著降低了推理开销。

英文摘要

''Thinking with Images'' has emerged as an effective paradigm for fine-grained visual reasoning: by explicitly zooming into relevant regions and reasoning over crops, models can access local evidence that is difficult to recover from a single global image. However, this benefit comes with redundant tool invocations and longer inference traces. Moreover, when such behaviors are learned mainly from outcome reward, the resulting intermediate crops or visual cues can be noisy or fail to faithfully capture task-relevant visual evidence. In this work, we ask whether the reasoning benefits of ''Thinking with Images'' can be internalized through Thinking with Imagination: an internal process that decides where to look and imagines what visual cues closer inspection would reveal without actually invoking tools. We propose Imagine-OPD, an on-policy self-distillation framework in which a teacher plays the role of a ''Thinking with Images'' reasoner during training: it receives privileged zoomed evidence views derived from annotated regions, and supervises the model's own imagination reasoning trajectories. Imagine-OPD does not require an external teacher or high-quality imagination demonstrations. Experiments on vision-centric benchmarks show that Imagine-OPD achieves the best average performance among compared models while significantly reducing inference overhead compared with ''Thinking with Images'' methods.

2606.08894 2026-06-09 cs.CV cs.CL 新提交

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

推理视觉语言模型对语义视觉干扰具有鲁棒性吗?

Yizheng Sun, Mochuan Zhan, Yanan Ma, Jia Tong See, Yifan Wang, Ziyi Wang, Hao Li, Yang Cui, Wenhao Cai, Jingyu Sun, Chenghua Lin, Riza Batista-Navarro, Jingyuan Sun

发表机构 * University of Manchester(曼彻斯特大学) Marex Imperial College London(帝国理工学院)

AI总结 针对推理VLM在真实场景中易受语义视觉干扰的问题,提出Distract-Bench基准,发现推理VLM对语义干扰的鲁棒性低于感知退化,且干扰常被纳入推理过程导致错误答案。

详情
AI中文摘要

推理视觉语言模型(VLM)在复杂多模态任务上表现强劲,但可靠的现实应用需要处理比干净、精心策划的基准更混乱的视觉输入。现有工作主要通过输入损坏(如噪声、模糊和天气效果)来评估VLM的可靠性,这些损坏使视觉证据更难感知。这留下了一个关键可靠性失败模式未被充分探索:模型可能正确感知证据,却从看似合理但无关且分散注意力的证据中进行推理,并将此错误传播到最终答案。为填补这一空白,我们引入了\textbf{Distract-Bench},一个用于评估VLM对\textbf{语义视觉干扰}鲁棒性的基准,定义为添加到输入中、保留真实答案但具有意义且与任务无关的视觉线索。我们全面评估了八个领先的开源和两个闭源VLM,涵盖传统视觉损坏和Distract-Bench。结果表明,Distract-Bench暴露了一种与视觉损坏不同的鲁棒性失败:推理VLM在感知退化下基本跟踪其非推理基础模型,但对语义干扰的鲁棒性始终较低。进一步分析表明,这些干扰常常进入VLM的推理过程,被当作证据,并导致错误答案。总之,这些发现重新定义了推理VLM的鲁棒性评估,将焦点从退化感知转向干扰,以实现可靠的现实世界视觉推理。我们的数据和代码可在https://github.com/Yizheng-Sun/Distract-Bench获取。

英文摘要

Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable real-world application requires handling visual inputs that are messier than clean, curated benchmarks. Existing works mainly evaluate such reliability of VLMs through input corruptions, such as noise, blur and weather effects, which make visual evidence harder to perceive. This leaves a critical reliability failure mode underexplored: a model may perceive the evidence correctly, yet reason from plausible but irrelevant and distracting evidence and propagate this mistake to its final answer. To address this gap, we introduce \textbf{Distract-Bench}, a benchmark for evaluating VLM robustness to \textbf{semantic visual distractions}, defined as meaningful but task-irrelevant visual cues added to inputs while preserving the ground-truth answer. We comprehensively evaluate eight leading open-source and two closed-source VLMs across conventional vision corruptions and Distract-Bench. Our results show that Distract-Bench exposes a robustness failure distinct from vision corruptions: reasoning VLMs largely track their non-reasoning base models under perceptual degradation, but show consistently lower robustness to semantic distractions. Further analysis shows that these distractions often enter the reasoning process of VLMs, are treated as evidence, and lead to incorrect answers. Together, these findings reframe robustness evaluation for reasoning VLMs, shifting the focus from degraded perception to distractions for reliable real-world visual reasoning. Our data and code are available at https://github.com/Yizheng-Sun/Distract-Bench.

2606.08908 2026-06-09 cs.CV cs.AI 新提交

Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection

面向光刻缺陷检测的视觉-语言模型失败感知精炼

Pangyun Jeong, Jiyeong Kong, Yuehua Hu, Dohee Jeong, Kyung-Tae Kang

发表机构 * Hanyang University(汉阳大学) Korea University(高丽大学) Korea Institute of Industrial Technology(韩国生产技术研究院)

AI总结 提出两阶段视觉-语言框架,先微调Qwen3-VL检测缺陷,再通过训练精炼模块修正第一阶段错误,提升检测可靠性。

Comments 6 pages, 3 figures

详情
AI中文摘要

半导体光刻检测需要可靠地检测微小图案缺陷,如桥接、毛刺、针孔和污染。在本研究中,我们提出了一种两阶段视觉-语言框架,结合了初始缺陷检测与预测精炼。在第一阶段,使用LoRA微调Qwen3-VL作为视觉-语言适配器,从光刻图像中预测缺陷数量、缺陷类别和归一化边界框。然而,直接微调仍可能产生常见的测试时错误,包括误报、漏检和错误缺陷类型。为解决此限制,第二阶段使用第一阶段预测失败及其修正标签训练精炼模块,使模型能够审查和修正初始输出。通过从初始适配器失败的案例中学习,精炼过程改善了超越单阶段微调的缺陷推理。

英文摘要

Semiconductor lithography inspection requires reliable detection of small pattern defects such as bridge, burr, pinch, and contamination. In this study, we propose a two-stage vision-language framework that combines initial defect detection with prediction refinement. In the first stage, Qwen3-VL is fine-tuned with LoRA as a vision-language adapter to predict defect counts, defect categories, and normalized bounding boxes from lithography images. However, direct fine-tuning may still produce common test-time errors, including false positives, missed defects, and incorrect defect types. To address this limitation, the second stage trains a refinement module using first-stage prediction failures and their corrected labels, allowing the model to review and revise initial outputs. By learning from cases where the initial adapter fails, the refinement process improves defect inference beyond single-stage fine-tuning.

2606.08948 2026-06-09 cs.CV cs.AI 新提交

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

NutriMLLM:用于膳食微量营养素分析的多模态大语言模型

Runze Yan, Minxiao Wang, Jiaying Lu, Darren Liu, Xiao Hu, Hanqi Luo

发表机构 * Emory University(埃默里大学)

AI总结 针对现有MLLM在膳食微量营养素估计中不可靠的问题,利用十年人口规模膳食回顾生成约110万图像-营养素三元组,微调Qwen3-VL和GLM-4.6V-Flash得到NutriMLLM,在真实图像上实现65种营养素全覆盖,准确率匹配或超越专有模型。

Comments 35 pages, 10 figures, 1 table

详情
AI中文摘要

从食物图像中全面估计膳食微量营养素可以改善临床营养护理,但训练此类模型需要将多样化食物与完整营养素谱相关联的大规模多模态数据集。我们首先证明,现有的多模态大语言模型(MLLMs),包括领先的专有模型,在此任务上不可靠。在五个模型家族和四个独立评估基准(ASA24、SNAPMe、FNDDS和NutriBench)上,模型经常弃权或返回统计上不合理的值。为了在没有昂贵专家标注的情况下解决这一差距,我们将十年人口规模的24小时膳食回顾重新用作文本到图像生成的结构化提示。该流程生成了约110万图像-描述-营养素三元组的合成语料库,每个三元组将生成的食品图像与完整的65种营养素标签配对。据我们所知,这是计划在发表后公开发布的最大合成食品图像语料库,具有全面的微量营养素标注。在此语料库上微调Qwen3-VL(2B/4B/8B/30B)和GLM-4.6V-Flash,得到了NutriMLLM,这是第一个专门用于全面膳食微量营养素估计的视觉语言模型家族。我们使用一个四组件框架评估这些模型,该框架分别测量弃权、幻觉、整体可用性和每种营养素的数值准确性。在真实食品图像上,每个NutriMLLM变体在所有65种营养素上实现了近乎完全的覆盖,并且最大的变体在大多数营养素上的准确率匹配或超过了专有基线(GPT-5、Gemini 3和Claude Sonnet 4.5)。这些结果表明,回忆驱动的合成监督可以使基于图像的全面微量营养素估计成为一个可处理的工程问题,并支持膳食评估、个性化营养指导和人口规模的微量营养素监测。

英文摘要

Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models requires large multimodal datasets linking diverse foods to complete nutrient profiles. We first show that existing multimodal large language models (MLLMs), including leading proprietary models, are unreliable for this task. Across five model families and four independent evaluation benchmarks (ASA24, SNAPMe, FNDDS, and NutriBench), models frequently abstained or returned statistically implausible values. To address this gap without costly expert annotation, we repurposed a decade of population-scale 24-hour dietary recalls as structured prompts for text-to-image generation. This pipeline produced a synthetic corpus of about 1.1 million image-description-nutrient triplets, each pairing a generated food image with a complete 65-nutrient label. To our knowledge, this is the largest synthetic food-image corpus with comprehensive micronutrient annotation planned for public release upon publication. Fine-tuning Qwen3-VL (2B/4B/8B/30B) and GLM-4.6V-Flash on this corpus yielded NutriMLLM, the first family of vision-language models specialized for comprehensive dietary micronutrient estimation. We evaluate these models with a four-component framework that separately measures abstention, hallucination, overall usability, and per-nutrient numerical accuracy. On real food images, every NutriMLLM variant achieved near-complete coverage across all 65 nutrients, and the largest variant matched or exceeded proprietary baselines (GPT-5, Gemini 3, and Claude Sonnet 4.5) in accuracy on most nutrients. These results show that recall-driven synthetic supervision can make image-based comprehensive micronutrient estimation a tractable engineering problem and support dietary assessment, personalized nutrition guidance, and population-scale micronutrient surveillance.

2606.08959 2026-06-09 cs.CV cs.CL 新提交

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

ChinaHeritaQA:面向中国世界遗产地的文化基础视觉问答数据集

Yi Zhang, Bolei Ma, Yong Cao, Chengyan Wu, Daniel Hershcovich, Anna-Carolina Haensch

发表机构 * LMU Munich(慕尼黑大学) FAU Erlangen-Nuremberg(埃尔朗根-纽伦堡大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Tübingen & Tübingen AI Center(图宾根大学与图宾根人工智能中心) Sun Yat-sen University(中山大学) University of Copenhagen(哥本哈根大学) University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出ChinaHeritaQA多模态基准数据集,包含2279张图像和14133个双语多项选择题,覆盖七个认知维度,评估视觉语言模型在中国世界遗产上的文化推理能力。

详情
AI中文摘要

我们介绍了ChinaHeritaQA,这是一个多模态基准数据集,用于评估视觉语言模型(VLM)在中国联合国教科文组织世界遗产地上的文化推理能力。该数据集包含2279张野外图像,配以14133个双语(中文/英文)多项选择题对,涵盖七个认知维度,从基本身份识别到历史分期和建筑分析。在联合国教科文组织对齐的本体论指导下,并通过严格的人工注释验证,该数据集确保了语言质量和事实一致性。对最先进VLM的评估显示,虽然顶级模型在平均表现上超过人类,但出现了显著的任务级差异:模型在视觉识别方面表现出色,但在文化基础推理上存在困难。性能也因朝代和地区而异。ChinaHeritaQA揭示了强大的视觉检索能力并不能延伸到文化和历史理解。我们发布该数据集以支持未来关于文化感知多模态学习的研究。

英文摘要

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

2606.09033 2026-06-09 cs.CV cs.CL 新提交

CRANE: Knowledge Editing for Reasoning MLLMs

CRANE:面向推理多模态大语言模型的知识编辑

Han Huang, Hao Wang, Mengqi Zhang, Shu Wu, Qiang Liu, Liang Wang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) New Laboratory of Pattern Recognition (NLPR), CASIA(中国科学院自动化研究所模式识别国家重点实验室) Harbin Institute of Technology(哈尔滨工业大学) Shandong University(山东大学)

AI总结 针对推理多模态大语言模型在知识编辑中出现的结构崩溃、认知失调和浅层内化三种失败模式,提出检索增强框架CRANE,无需逐编辑参数修改,通过模态感知双库检索系统和两阶段训练策略实现高成功率。

Comments 10 pages, 5 figures

详情
AI中文摘要

推理多模态大语言模型(MLLMs)的出现,即在生成答案前产生显式思维链(CoT)推理,为知识编辑带来了新挑战:在传统指标(教师强制准确率高达100%)下看似成功的方法,在检查模型推理过程时可能严重失败(基础成功率低至0%)。我们识别出三种失败模式:(1)结构崩溃,权重修改方法破坏CoT格式;(2)认知失调,模型的推理链基于视觉证据主动拒绝注入的编辑事实;(3)浅层内化,方法在精确查询上成功但在改写或多跳变体上失败。在推理MLLMs上,这些模式相互作用:泛化方法(FT、LoRA)触发格式崩溃,而无深度修改的方法无法泛化。为揭示这些失败,我们提出一种CoT感知评估协议,并构建ReasonEdit-Bench,包含冲突分层、多级探针和多跳可移植性测试。我们提出CRANE,一种检索增强框架,无需逐编辑参数修改。CRANE结合了模态感知双库检索系统和两阶段训练策略:监督微调(SFT)用于结构初始化,随后是带有认知路由奖励的GRPO,训练模型在视觉先验和注入编辑事实之间进行仲裁。在ReasonEdit-Bench上,CRANE在冲突场景中达到96.9%的基础成功率,多跳链中中间实体使用率为96.9%,文本局部性为97.6%,图像局部性编辑独立性为68.1%。在分布外MMEVOKE基准上,CRANE在黄金检索下达到87.0%。

英文摘要

The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought (CoT) reasoning before producing answers, has introduced a new challenge for knowledge editing: methods that appear successful under traditional metrics (teacher-forcing accuracy up to 100%) can fail severely when the model's reasoning process is examined (Grounded Success as low as 0%). We identify three failure modes: (1) Structural Collapse, where weight-modifying methods destroy the CoT format; (2) Cognitive Dissonance, where the model's reasoning chain actively rejects the injected edit fact based on visual evidence; and (3) Shallow Internalization, where methods succeed on exact queries but fail on rephrase or multi-hop variants. On reasoning MLLMs, these modes interact: methods that generalize (FT, LoRA) trigger format collapse, while methods without deep modification cannot generalize. To expose these failures, we propose a CoT-aware evaluation protocol and construct ReasonEdit-Bench, with conflict stratification, multi-level probes, and multi-hop portability tests. We propose CRANE, a retrieval-augmented framework that requires no per-edit parameter modification. CRANE combines a modality-aware dual-library retrieval system with a two-phase training strategy: Supervised Fine-Tuning (SFT) for structural initialization, followed by GRPO with a Cognitive Routing Reward that trains the model to arbitrate between visual priors and injected edit facts. On ReasonEdit-Bench, CRANE achieves 96.9% Grounded Success on conflict scenarios and 96.9% intermediate entity usage in multi-hop chains, with 97.6% text-locality and 68.1% image-locality Edit Independence. On the out-of-distribution MMEVOKE benchmark, CRANE reaches 87.0% under gold retrieval.

2606.09109 2026-06-09 cs.CV cs.IR cs.LG 新提交

Driving Video Retrieval for Complex Queries with Structured Grounding

面向复杂查询的驾驶视频检索与结构化对齐

Manyi Yao, Sparsh Garg, Christian Shelton, Amit Roy-Chowdhury, Abhishek Aich

发表机构 * NEC Laboratories, America(美国NEC实验室) University of California, Riverside(加州大学河滨分校)

AI总结 提出STRIVE-D框架,通过弱监督领域视频校准规则、融合视觉语言与关键词检索信号,在驾驶视频检索中实现高达84%的top-1准确率提升。

详情
AI中文摘要

大规模视频检索是自动驾驶中数据整理和安全验证的核心,用户不仅希望找到场景,还希望找到诸如切入和急刹车等动态事件。现有的视觉语言和基于关键词的检索方法常常遗漏这些事件,因为相关的运动可能没有在文本中明确描述或通过词汇重叠捕获。基于规则的检索可以更直接地编码此类事件,但它是脆弱的:生成的或手工编写的规则在假设与真实驾驶数据不匹配时常常失败。我们提出了STRIVE-D,一种针对驾驶视频的数据校准检索框架。它使用弱标记的领域内视频来估计查询规则何时可靠,调整与观测数据不匹配的规则,并将校准后的规则分数与视觉语言和基于关键词的检索信号融合。在三个驾驶基准测试中,包括新发布的DrivingDojo上的人工标注事件数据,STRIVE-D相对于最先进方法在top-1准确率上实现了高达84%的相对改进。

英文摘要

Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users want to find not only scenes but also dynamic events such as cut-ins and hard braking. Existing vision-language and keyword-based retrieval methods often miss these events because the relevant motion may not be explicitly described in text or captured by lexical overlap. Rule-based retrieval can encode such events more directly, but it is brittle: generated or hand-written rules often fail when their assumptions do not match real driving data. We propose STRIVE-D, a data-calibrated retrieval framework for driving videos. It uses weakly labeled in-domain videos to estimate when a query rule is reliable, adapt rules that mismatch observed data, and fuse calibrated rule scores with vision-language and keyword-based retrieval signals. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over state-of-the-art methods.

2606.09142 2026-06-09 cs.CV cs.AI 新提交

Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

通过视觉语言模型从自我中心视觉解码行人过街意图

Danya Li, Xiang Su, Yan Feng, Rico Krueger

发表机构 * Technical University of Denmark(丹麦技术大学) University of Helsinki(赫尔辛基大学) Delft University of Technology(代尔夫特理工大学)

AI总结 利用视觉语言模型(VLM)将行人过街意图预测转化为视觉问答任务,通过参数高效微调并结合自我运动、车辆运动和眼动等上下文线索,在自我中心视频上实现了14.5%的准确率提升,创下新纪录。

详情
AI中文摘要

自我中心视觉提供了人类感知和决策的第一人称视角,但其在交通安全预测方面的潜力尚未得到充分探索。在这项工作中,我们研究从短自我中心视频片段中解码行人过街意图。我们通过将任务表述为封闭式视觉问答(VQA)问题,并利用视觉语言模型(VLM)来预测行人的意图。我们首先在零样本设置下对三个系列的最先进VLM进行了基准测试,发现它们相对于随机猜测有适度提升,但表现出有限的高层次交通推理能力。基于这些发现,我们进一步使用参数高效微调将VLM适应于目标任务。我们的结果表明,微调后的模型显著优于其零样本对应模型,并在专门的基于Transformer的基线基础上实现了9%的准确率提升。最后,我们证明加入额外的上下文线索,包括自我运动、车辆运动和眼动,进一步提高了预测性能。特别是,由眼动和自我运动引导的微调Qwen3-VL-2B模型相比Transformer基线实现了14.5%的准确率提升,为自我中心行人意图解码建立了新的最先进水平。

英文摘要

Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians' intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.

2606.09167 2026-06-09 cs.CV 新提交

Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating

视觉-语言引导的高光谱目标跟踪:语义融合与上下文模板更新

Rui Yao, Yuhong Zhang, Kunyang Sun, Hancheng Zhu, Jiaqi Zhao, Zhiwen Shao, Abdulmotaleb El Saddik

发表机构 * China University of Mining and Technology(中国矿业大学) University of Ottawa(渥太华大学)

AI总结 提出VLHTrack框架,通过语言引导波段选择模块缓解光谱冗余,利用多模态融合模块整合视觉与语言特征,并采用动态模板更新策略应对目标形变,在HOT2023/2024上超越现有方法。

Comments 14 pages,8 figures

详情
AI中文摘要

高光谱目标跟踪(HOT)利用高光谱视频(HSV)提供的丰富光谱信息,为目标跟踪提供了巨大潜力。然而,从冗余光谱波段中高效提取和利用光谱信息仍然是一个基本挑战,严重限制了模型泛化能力和跟踪性能。此外,在动态场景中,目标常因遮挡和光照变化等因素出现剧烈外观变化,导致当前帧与模板之间产生大变形,这对现有时序建模方法构成重大挑战。本文提出VLHTrack,一种新颖的高光谱视觉-语言(VL)联合跟踪框架。具体而言,我们引入语言先验,通过设计语言引导波段选择模块(LBSM)来解决光谱冗余的基本挑战。LBSM利用大语言模型(LLM)描述建立语义到光谱的映射,从而减轻冗余并突出判别性光谱特征。随后采用多模态视觉-语言融合模块无缝整合视觉和语言嵌入,利用其互补优势学习连贯的跨模态表示。为解决长序列中的目标形变问题,我们提出通过动态模板更新与Mamba(DTUM)模块实现的动态更新模板特征策略。DTUM利用选择性状态空间建模学习帧间依赖关系以更新模板特征,确保在时间上下文引导下模板特征的高效演化。在HOT2023和HOT2024上的实验表明,VLHTrack优于最先进(SOTA)方法。

英文摘要

Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (HSVs), offering substantial potential for object tracking. However, efficiently extracting and exploiting spectral information from redundant spectral bands remains a fundamental challenge, which severely limits model generalization and tracking performance. Moreover, in dynamic scenes, targets often experience drastic appearance variations due to factors such as occlusion and illumination changes. These variations lead to large deformations between the current frame and the template. Such discrepancies pose major challenges for existing temporal modeling approaches. In this work, we propose VLHTrack, a novel hyperspectral vision-language (VL) joint tracking framework. Specifically, we incorporate language priors to address the fundamental challenge of spectral redundancy by designing a Language-Guided Band Selection Module (LBSM). By leveraging Large Language Model (LLM) descriptions, LBSM establishes a semantic-to-spectral mapping that mitigates redundancy and accentuates discriminative spectral features. A Multi-Modal Vision-Language Fusion Module is then employed to seamlessly integrate visual and linguistic embeddings, harnessing their complementary advantages to learn coherent cross-modal representations. To address target deformation in long-term sequences, we propose a dynamic update template feature strategy implemented via the Dynamic Template Update with Mamba (DTUM) module. By leveraging selective state space modeling, DTUM learns inter-frame dependencies to update template feature, ensuring efficient template feature evolution guided by temporal context. Experiments on HOT2023 and HOT2024 demonstrate that VLHTrack outperforms state-of-the-art (SOTA) methods.

2606.09290 2026-06-09 cs.CV 新提交

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Visual Para-Thinker++:用于视觉推理的单策略多智能体框架

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Zizhao Tong, Xiaofeng Zhang, Xiaosong Yuan

发表机构 * Zhejiang University(浙江大学) Hunan University(湖南大学) Tianjin University(天津大学) University of Chinese Academy of Sciences(中国科学院大学) Shanghai Jiao Tong University(上海交通大学) Jilin University(吉林大学)

AI总结 提出Visual Para-Thinker++框架,通过共享MLLM策略实例化为多个角色智能体并行推理,结合多智能体能力注入和角色解耦优化,有效缓解视觉推理中的早期感知承诺和幻觉问题。

详情
AI中文摘要

视觉推理需要整合分布在区域、属性和关系中的证据,这使得单链推理容易产生早期感知承诺和幻觉。我们提出Visual Para-Thinker++,一个单策略多智能体框架,其中共享的MLLM策略被实例化为角色条件的主智能体、工作智能体和总结智能体。主智能体使用固定分配模式分解任务;工作智能体在上下文隔离下并行推理;总结智能体整合所有工作智能体的推理轨迹,而不是对最终标签进行多数投票。共享策略通过多智能体能力注入和角色解耦多智能体优化进行训练,为相应的token片段分配角色特定的奖励和优势,以减少协作角色之间的梯度冲突。原生推理引擎通过共享视觉前缀和KV缓存重用实现高效的多智能体展开。在V*、CountBench、RefCOCO系列和HallusionBench上,Visual Para-Thinker++始终优于单轨迹和推理时并行基线,在幻觉敏感的视觉推理上尤其表现出色。

英文摘要

Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.

2606.09393 2026-06-09 cs.CV 新提交

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

CapRL++:基于可验证奖励的统一强化学习用于密集图像和视频描述生成

Penghui Yang, Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Yibin Wang, Yujie Zhou, Jiazi Bu, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin

发表机构 * Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) Microsoft(微软) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Alibaba Cloud(阿里云) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出CapRL++框架,利用可验证奖励的强化学习(RLVR)优化多模态描述生成,通过非视觉语言模型回答问题的准确性作为奖励,提升密集描述质量,在20多个基准上超越传统监督微调。

Comments 26 pages, 10 figures. Project page: https://github.com/InternLM/CapRL. arXiv admin note: text overlap with arXiv:2509.22647

详情
AI中文摘要

图像和视频描述是连接视觉与语言领域的基础任务,在预训练大型视觉语言模型(LVLMs)中发挥关键作用。当前最先进的描述模型通常采用监督微调(SFT)训练,这种范式依赖于昂贵且不可扩展的标注,并常导致模型记忆特定真实答案,限制了其通用性和生成多样化、创造性描述的能力。为克服这些局限,我们提出将可验证奖励的强化学习(RLVR)应用于多模态描述的开放任务。我们引入描述强化学习++(CapRL++),一种新颖的无参考训练框架,通过效用重新定义描述质量:高质量描述应使非视觉语言模型能够准确回答关于相应视觉内容的问题。CapRL++采用解耦的两阶段流程,其中LVLM生成描述,目标奖励来自一个独立的、无视觉的LLM仅基于该描述回答多项选择题的准确率。在超过20个图像和视频基准上的评估表明,CapRL++提升了密集描述质量,并增强了基于描述的预训练在空间和时间理解等任务上的表现。在CapRL++标注的可扩展图像和视频描述数据集上预训练带来了显著的下游收益。此外,在描述质量评估的Prism框架内,使用CapRL++训练的紧凑模型在密集描述性能上可与Qwen2.5-VL-72B和Qwen3-VL-235B-A22B等大得多的模型相媲美。这些结果验证了CapRL++能有效训练模型生成可泛化、高保真的描述,为超越传统SFT的局限奠定了坚实基础。

英文摘要

Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.

2606.09826 2026-06-09 cs.CV cs.AI 新提交

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

OmniGameArena: 一个统一的UE5基准测试,用于具有改进动态的VLM游戏智能体

Mingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang, Yiyu Wang, Wei Huang, Yitang Li, Fan Zhang, Zeyu Hu, Lingting Zhu, Xin Wang, Xiaojuan Qi

发表机构 * The University of Hong Kong(香港大学) LIGHTSPEED The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学)

AI总结 提出OmniGameArena,一个包含12个UE5游戏的统一基准,以及改进动态曲线(IDC),通过反思机制评估VLM智能体的冷启动分数、改进动态和泛化能力。

详情
AI中文摘要

视觉语言模型(VLM)智能体越来越多地部署在交互式游戏环境中。然而,针对VLM智能体的游戏基准通常报告每个(智能体,游戏)对的单次首次尝试分数,专注于单智能体单人游戏,并且缺乏统一的协议来评估异构智能体类别(商业VLM、开源VLM和专用游戏策略)在同一水平上。我们通过OmniGameArena填补了这些空白,这是一个包含12个新构建的Unreal Engine 5游戏的实时基准,涵盖单人(7个)、玩家对战(3个)和合作(2个)模式,具有统一的动作接口,以及改进动态曲线(IDC),这是一个智能体反思框架,其中使用工具的反思LLM在多个回合中自主优化有界技能提示。除了冷启动排行榜分数外,IDC还为每个(智能体,游戏)对揭示了两个额外的可观测指标:分数在反思回合中的演变方式,以及学习到的技能在保留任务变体上的表现。我们报告了12个VLM智能体在冷启动排行榜上的这些可观测指标,以及四个顶级智能体在IDC下的表现。

英文摘要

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

2606.07651 2026-06-09 cs.LG cs.CV 交叉投稿

KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection

KITE:一种融合文本、图像和知识图谱的三模态假新闻检测Transformer

Kevin Patel, Shashi Bhushan Jha

发表机构 * Department of Computer Science, University of West Florida(威斯福大学计算机科学系)

AI总结 提出三模态假新闻检测框架KITE,联合建模文本、视觉和知识表示,利用跨模态注意力整合特征,在基准数据集上显著优于单双模态基线。

详情
AI中文摘要

随着多模态虚假信息日益复杂,无缝融合欺骗性文本、操纵性视觉和事实错误的主张,传统的假新闻检测方法已落后。大多数先前工作侧重于文本-图像融合,或将外部知识仅作为后处理步骤应用,限制了其检测更深层语义不一致的能力。在本文中,我们引入了KITE(知识集成文本-图像编码器),一种三模态假新闻检测框架,联合建模文本、视觉和事实知识表示。KITE利用Roberta [23,14]和CLIP [24]进行语言和视觉编码,同时图注意力网络(GAT)处理从Wikidata检索的结构化事实。KITE在多模态Transformer中使用跨模态注意力[9]来集成文本、视觉和知识特征,帮助理解每种模态如何相互关联。模态特定置信度分数与最终预测一起生成,通过指示哪种输入类型对决策影响最大来提供可解释性。在基准数据集上的评估表明,KITE显著优于单模态和双模态基线,特别是在涉及图像-文本不匹配或与外部知识矛盾的情景中。

英文摘要

Traditional fake news detection methods are falling behind as multimodal misinformation grows more advanced, seamlessly blending deceptive text, manipulated visuals, and factually incorrect claims. Most prior work focuses on text-image fusion or applies external knowledge only as a post-processing step, limiting their ability to detect deeper semantic inconsistencies. In this paper, we introduce KITE (Knowledge-Integrated Text-Image Encoder), a tri-modal fake news detection framework that jointly models textual, visual, and factual knowledge representations. KITE leverages Roberta [23,14] and CLIP [24] for linguistic and visual encoding, while a Graph Attention Network (GAT) processes structured facts retrieved from Wikidata. KITE uses cross-modal attention [9] within a multimodal transformer to integrate text, visual, and knowledge features, helping it understand how each modality relates to one another. Modality-specific confidence scores are generated alongside the final prediction, offering interpretability by indicating which input type most influenced the decision. Evaluations on benchmark datasets demonstrate that KITE significantly outperforms unimodal and bimodal baselines, particularly in scenarios involving image-text mismatches or contradictions with external knowledge.

2606.08046 2026-06-09 cs.AI cs.CV cs.LG 交叉投稿

OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs

OSMGraphCLIP:从OpenStreetMap图学习全局位置表示

Dimitrios Michail, Eleni Saka, Ioannis Giannopoulos, Ioannis Papoutsis

发表机构 * Harokopio University of Athens(雅典哈罗科皮奥大学) National Technical University of Athens(雅典国家技术大学) Vienna University of Technology(维也纳技术大学) National Observatory of Athens(雅典国家天文台)

AI总结 提出OSMGraphCLIP模型,利用OpenStreetMap异构图结构学习全局位置嵌入,通过多尺度图编码器和对比学习对齐,在气候、生态、社会经济等下游任务中达到或超越卫星基线方法。

详情
AI中文摘要

我们提出了OSMGraphCLIP,一种CLIP风格的地理空间表示模型,从免费可用的OpenStreetMap(OSM)数据中学习全局位置嵌入。OSMGraphCLIP将地理环境表示为带类型的OSM特征的异构图,保留了道路、建筑物、土地利用区域和兴趣点之间的拓扑和语义关系。多尺度图编码器捕获细粒度的局部结构和更广泛的景观组成,并通过对比对齐目标监督球谐位置编码器。我们在涵盖气候、生态、社会经济指标、公共卫生、土地覆盖、生物多样性和野火预测等一系列下游地理空间回归和分类任务中评估了OSMGraphCLIP,并表明仅结构化OSM数据就支持跨领域的强全局位置表示。OSMGraphCLIP在大多数基准测试中达到或超过了基于卫星的基线,在社会经济和公共卫生任务中优势最为明显,因为OSM对建成环境的显式语义注释编码了卫星像素只能间接捕获的人类活动模式。在生态和环境任务中,尽管未使用地球观测数据,该模型仍与基于图像的方法保持紧密竞争。定性分析证实,学习到的嵌入连贯地组织了地理空间,仅从地图拓扑中恢复了生物群落边界、城市梯度和热带-温带区别。

英文摘要

We present OSMGraphCLIP, a CLIP-style geospatial representation model that learns global location embeddings from freely available OpenStreetMap (OSM) data. OSMGraphCLIP represents geographic environments as heterogeneous graphs of typed OSM features, preserving the topological and semantic relationships among roads, buildings, land-use regions, and points of interest. A multi-scale graph encoder captures both fine-grained local structure and broader landscape composition, and supervises a spherical-harmonics location encoder through a contrastive alignment objective. We evaluate OSMGraphCLIP across a diverse suite of downstream geospatial regression and classification tasks spanning climate, ecology, socioeconomic indicators, public health, land cover, biodiversity, and wildfire forecasting, and show that structured OSM data alone supports strong global location representations across domains. OSMGraphCLIP matches or exceeds satellite-based baselines on the majority of benchmarks, with the most pronounced advantage on socioeconomic and public-health tasks, where OSM's explicit semantic annotation of the built environment encodes patterns of human activity that satellite pixels can only capture indirectly. On ecological and environmental tasks, the model remains closely competitive with imagery-based methods despite using no Earth observation data. Qualitative analysis confirms that the learned embeddings organize geographic space coherently, recovering biome boundaries, urban gradients, and tropical--temperate distinctions from map topology alone.

2606.09131 2026-06-09 cs.AI cs.CL cs.CV cs.LG 交叉投稿

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

晚期融合足矣:面向视觉饱和的多模态大语言模型的双路径视觉令牌路由

Siyuan Liu, Jinyang Wu

发表机构 * School of Mechanics and Engineering Science, Peking University(北京大学力学与工程科学学院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对多模态大语言模型中视觉令牌在深层饱和的问题,提出双路径视觉令牌路由(DPVR-LF),在饱和点将视觉令牌路由至单层可训练分支,仅最后层融合,以约3%可训练参数保持性能并减少计算。

Comments 18 pages, 4 figures. Submitted to Pattern Recognition

详情
AI中文摘要

多模态大语言模型(MLLMs)通常继承为单模态文本建模设计的深层对称Transformer骨干,并对图像和语言令牌应用相同的统一计算。这种设计忽略了一个关键的模态不对称性:图像和文本令牌在信息密度、冗余度和所需推理深度上存在显著差异。通过对LLaVA-1.5的逐层分析,我们观察到视觉令牌倾向于在中间层饱和。具体而言,文本到图像的注意力从第0层的0.68下降到第4层的0.07,并在第18层后稳定在0.04附近,而文本令牌则继续受益于深层语义处理。这些发现表明架构对称性与深度异步模态演化之间存在不匹配,导致冗余的视觉计算以及在深层任务特定适应期间感知表示的潜在漂移。受此启发,我们提出了双路径视觉令牌路由(DPVR),一种用于高效MLLMs的模态不对称路由框架。其核心实例DPVR-LF(晚期融合)在饱和点将视觉令牌路由到一个单层可训练侧分支,运行一个跳过深层堆栈中图像位置的十三层纯文本前向传播,并仅在最后一层重新融合视觉和文本流。使用约3%的可训练参数,DPVR-LF在标准基准上保持了有竞争力的多模态性能,同时减少了深层Transformer堆栈中的视觉计算。该结果挑战了视觉令牌必须遍历所有深层语言模型层的传统假设,并表明单个晚期融合层足以在LLaVA风格的MLLMs中维持强大的感知能力。

英文摘要

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

2505.21457 2026-06-09 cs.CV cs.AI 版本更新

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

ACTIVE-o3:通过纯强化学习赋予多模态大语言模型主动感知能力

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Mingyu Liu, Zheng Huang, Anzhou Li, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ACTIVE-o3框架,基于GRPO强化学习,通过模块化感知-动作设计和双形式奖励,使MLLM自主学会高效准确的区域选择策略,在开放世界和领域特定任务中显著提升主动感知能力。

Comments Accepted to ICML 2026. Project page: https://aim-uofa.github.io/ACTIVE-o3

详情
AI中文摘要

主动视觉,也称为主动感知,指主动选择观察位置和方式以收集任务相关信息。它是人类和高级具身智能体高效感知与决策的关键组成部分。随着多模态大语言模型(MLLM)成为机器人系统中的核心规划器,缺乏赋予MLLM主动感知能力的方法已成为一个关键缺口。我们首先对基于MLLM的主动感知任务进行了系统定义,并表明GPT-o3的缩放策略可视为一个特例,尽管它存在效率低和区域选择不准确的问题。为解决这些问题,我们提出ACTIVE-o3,一个基于GRPO构建的强化学习框架,赋予MLLM主动感知能力。利用模块化感知-动作设计和双形式奖励,ACTIVE-o3在没有显式区域选择监督的情况下自主学会高效且稳定的区域选择策略。我们进一步建立了一个全面的基准测试,涵盖开放世界任务(包括小目标和密集目标定位)以及领域特定场景(包括遥感、自动驾驶和交互式分割)。实验结果表明,与基线相比,ACTIVE-o3显著增强了主动感知能力。此外,我们表明该框架不仅保留了模型的通用理解能力,还可作为利用感知数据的代理任务,进一步提升在RealWorldQA和MME等基准测试上的性能。

英文摘要

Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather task-relevant information. It is a critical component of efficient perception and decision-making in humans and advanced embodied agents. With the rise of Multimodal Large Language Models (MLLMs) as central planners in robotic systems, the lack of methods for equipping MLLMs with active perception has become a key gap. We first provide a systematic definition of MLLM-based active perception tasks and show that GPT-o3's zoom-in strategy can be viewed as a special case, though it suffers from low efficiency and inaccurate region selection. To address these issues, we propose ACTIVE-o3, a reinforcement learning framework built on GRPO that equips MLLMs with active perception capabilities. Leveraging a modular sensing-action design and a dual-form reward, ACTIVE-o3 autonomously learns efficient and stable region selection strategies without explicit region-selection supervision. We further establish a comprehensive benchmark covering both open-world tasks, including small- and dense-object grounding, and domain-specific scenarios, including remote sensing, autonomous driving, and interactive segmentation. Experimental results demonstrate that ACTIVE-o3 significantly enhances active perception capabilities compared to baselines. Moreover, we show that our framework not only preserves the model's general understanding ability but can also serve as a proxy task for leveraging perception data, further improving performance on benchmarks such as RealWorldQA and MME.

2512.16349 2026-06-09 cs.CV cs.AI 版本更新

Collaborative Edge-to-Server Inference for Vision-Language Models

面向视觉-语言模型的协作式边缘到服务器推理

Soochang Song, Yongjune Kim

发表机构 * Department of Electrical Engineering, Pohang University of Science and Technology (POSTECH)(电气工程系,波扬科学技术大学(POSTECH))

AI总结 提出一种协作式边缘到服务器推理框架,通过两阶段选择性重传策略,在降低通信成本的同时保持视觉-语言模型的推理精度。

Comments 12 pages, 15 figures, 3 tables

详情
AI中文摘要

我们提出了一种面向视觉-语言模型(VLM)的协作式边缘到服务器推理框架,该框架在保持推理精度的同时降低了通信成本。在典型部署中,边缘设备(客户端)捕获的视觉数据被传输到服务器进行VLM推理。然而,传输全分辨率图像会产生高昂的通信成本。相反,为减轻通信开销而进行的激进缩小或过度压缩可能会丢弃细粒度细节,导致精度下降。为克服这一限制,我们设计了一个通信高效的两阶段框架。在第一阶段,服务器对缩小的缩略图(全局图像)进行推理,并量化输出令牌的最小熵。如果最小熵超过预定义阈值,服务器利用VLM的内部注意力识别感兴趣区域(RoI),并请求边缘设备发送该RoI的细节保留局部图像。然后,服务器通过联合利用全局和局部图像来细化其推理。这种选择性重传策略确保仅额外传输必要的视觉内容。实验结果一致证实,所提出的框架在跨多种VQA基准测试中显著降低了通信开销,同时保持了推理精度。

英文摘要

We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, transmitting full-resolution images incurs high communication cost. Conversely, aggressive downsizing or excessive compression to mitigate communication overhead can discard fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a communication-efficient two-stage framework. In the first stage, the server performs inference on the downsized thumbnail (global image) and quantifies the min-entropy of the output tokens. If the min-entropy exceeds a predefined threshold, the server identifies a region of interest (RoI) using the VLM's internal attention and requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is additionally transmitted. Experimental results consistently confirm that the proposed framework substantially reduces communication overhead while maintaining inference accuracy across diverse VQA benchmarks.

2604.02056 2026-06-09 cs.CV 版本更新

COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing

COMPASS:通过代理令牌和共享空间实现完整的多模态融合以实现无处不在的感知

Hao Wang, Yanyu Qian, Pengcheng Weng, Zixuan Xia, William Dan, Yangxin Xu, Fei Wang

发表机构 * Universität Bern(伯恩大学) Xi’an Jiaotong University(西安交通大学) Nanyang Technological University(南洋理工大学)

AI总结 COMPASS通过代理令牌和共享空间实现多模态融合,解决缺失模态导致的信息丢失和融合接口不匹配问题,提升多模态感知的鲁棒性。

详情
AI中文摘要

COMPASS通过代理令牌和共享空间实现多模态融合,解决缺失模态导致的信息丢失和融合接口不匹配问题,提升多模态感知的鲁棒性。

英文摘要

Missing modalities in multimodal sensing cause not only information loss but also a fusion-interface mismatch: a fusion head trained on a canonical set of modality slots must operate on changing observed subsets at inference time. We propose Compass, an interface-complete fusion framework that restores this canonical slot structure before prediction. Each modality is assigned a fixed fusion slot. Observed modalities populate their slots with real representations, while absent modalities are filled with target-slot completion representations estimated from the observed sources. Multiple source-specific estimates for the same missing slot are aggregated into a single slot filler, allowing the same lightweight fusion operator to be applied under arbitrary missing-modality patterns. Training uses synthetic modality masking, slot-compatibility supervision, and representation-space stabilization to make completed slots compatible with real modality representations and useful for downstream recognition. Across XRF55, MM-Fi, and OctoNet, Compass improves robustness under diverse single- and multiple-missing settings, including controlled comparisons against imputation, distillation, and translation-style baselines. These results suggest that preserving the fusion interface is a simple and effective principle for robust multimodal sensing.

2604.27476 2026-06-09 cs.CV 版本更新

EdgeFM: Efficient Edge Inference for Vision-Language Models

EdgeFM: 为视觉-语言模型高效边缘推理设计的框架

Mengling Deng, Yuanpeng Chen, Sheng Yang, Wei Tao, Wenhai Zhang, Hui Song, Linyuanhao Qin, Kai Zhao, Xiaojun Ye, Shanhui Mo, Jingli Fan, Shuang Zhang, Bei Liu, Tiankun Zhao, Xiangjing An

发表机构 * Go Further. AI School of Data Science Fudan University(复旦大学数据科学学院) RUYi Dynamics Co. Ltd(RUYi Dynamics有限公司) Independent Researcher(独立研究者)

AI总结 EdgeFM通过轻量级代理驱动框架,优化边缘部署的视觉-语言模型推理,提升跨平台性能和可移植性,实现比传统工具链更快的推理速度。

Comments Technique Report version

详情
AI中文摘要

视觉-语言模型(VLMs)在边缘工业应用中表现出强大的适用性,但其部署受到确定性低延迟和资源限制下稳定执行的严重限制。现有框架要么依赖臃肿的通用设计,要么迫使开发者进入封闭的硬件特定生态系统,导致硬件锁定和较差的跨平台适应性。观察到现代AI代理可以高效搜索和调整配置以生成高度优化的低级内核,我们提出EdgeFM,一种轻量级、代理驱动的VLM/LLM推理框架,专为跨平台工业边缘部署设计。EdgeFM通过移除非必要功能来降低单次请求延迟,并将代理调优的内核优化封装为可重用的模块化库。通过允许直接调用这些技能而不是等待封闭源代码实现,它有效缩小了长期以来由专有工具链主导的性能差距。该框架原生支持主流平台,包括x86和NVIDIA Orin SoCs,并代表了首个在国产Horizon Journey平台上的端到端VLA部署,增强了跨平台可移植性。在大多数情况下,它比传统供应商特定工具链的推理性能更优,实现NVIDIA Orin平台上比TensorRT-Edge-LLM快1.49倍的速度提升。实验结果表明,EdgeFM提供了有利的端到端推理性能,为多样化的边缘工业场景提供了开源、生产级的解决方案。

英文摘要

Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely on bloated general-purpose designs or force developers into opaque, hardware-specific closed-source ecosystems, leading to hardware lock-in limitation and poor cross-platform adaptability. Observing that modern AI agents can efficiently search and tune configurations to generate highly optimized low-level kernels for standard LLM operators, we propose EdgeFM, a lightweight, agent-driven VLM/LLM inference framework tailored for cross-platform industrial edge deployment. EdgeFM removes non-essential features to reduce single-request latency, and encapsulates agent-tuned kernel optimizations as a modular library of reusable skills. By allowing direct invocation of these skills rather than waiting for closed-source implementations, it effectively closes the performance gap long dominated by proprietary toolchains. The framework natively supports mainstream platforms including x86 and NVIDIA Orin SoCs, and represents the first end-to-end VLA deployment on the domestic Horizon Journey platform, enhancing cross-platform portability. In most cases, it yields clearly better inference performance than conventional vendor-specific toolchains, achieving up to 1.49 times speedup over TensorRT-Edge-LLM on the NVIDIA Orin platform. Experimental results show that EdgeFM delivers favorable end-to-end inference performance, providing an open-source, production-grade solution for diverse edge industrial scenarios.

2606.07435 2026-06-09 cs.CV cs.CL 版本更新

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

唇读差距:VSR模型是否像人类唇读者一样感知视觉语音?

Rishabh Jain, Naomi Harte

发表机构 * Sigmedia Group(Sigmedia集团) School of Engineering(工程学院) Trinity College Dublin(都柏林大学)

AI总结 通过对比VSR系统与人类在MaFI数据集上的表现,发现模型虽整体准确率更高,但错误模式与人类不同,主要依赖训练数据中的语言线索而非视觉感知。

Comments Accepted at INTERSPEECH 2026

详情
AI中文摘要

视觉语音识别(VSR)模型在基准测试中现已超越人类唇读者,但这样的进步是否建立了类人的视觉语音感知?为探究此问题,我们使用MaFI词级唇读数据集,在词、字符、音素和视位级别上比较了三个VSR系统与人类基线。尽管模型实现了更高的整体准确率,但它们在不同于人类的单词上成功和失败。仅给定少量初始音素的纯文本n-gram基线可与人类唇读相媲美。VSR词级错误始终能更好地通过训练词频而非词的视觉信息量来解释。视位准确率、混淆矩阵以及人类-模型相关性进一步表明,模型在人类认为最难的视位上获益最多,并且对视觉清晰度的依赖性弱得多。我们的工作表明,VSR系统主要依赖训练数据中的语言线索而非视觉感知,未能将视觉特征绑定为有意义的单词。

英文摘要

Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

2503.22697 2026-06-09 q-bio.NC cs.AI cs.CV 版本更新

Brain2Text Decoding Model Reveals the Neural Mechanisms of Visual Semantic Processing

Brain2Text解码模型揭示视觉语义处理的神经机制

Feihan Feng, Jingxin Nie

发表机构 * Ministry of Education Center for Studies of Psychological Application(教育部心理应用研究中心) Center for Studies of Psychological Application(心理应用研究中心) Key Laboratory of Brain, Cognition and Education Sciences(脑认知与教育科学重点实验室) School of Psychology(心理学学院) Guangdong Key Laboratory of Mental Health and Cognitive Science(广东省心理健康与认知科学重点实验室)

AI总结 提出一种直接从fMRI信号解码自然图像语义描述的深度学习模型,揭示了高级视觉皮层在语义处理中的关键作用,并展示了类别特异性神经表征。

Comments 39 pages, 9 figures

详情
AI中文摘要

从神经活动解码感官体验以重建人类感知的视觉刺激和语义内容,仍然是神经科学和人工智能领域的挑战。尽管当前的脑解码模型取得了显著进展,但在与已建立的神经科学理论的系统整合以及探索潜在神经机制方面仍存在关键差距。在这里,我们提出了一种新颖的框架,直接将fMRI信号解码为所观看自然图像的文本描述。我们的新型深度学习模型在未使用视觉信息训练的情况下,实现了最先进的语义解码性能,生成了捕捉复杂场景核心语义内容的有意义描述。神经解剖学分析揭示了高级视觉皮层(包括MT+复合体、腹侧流视觉皮层和顶下小叶)在视觉语义处理中的关键作用。此外,类别特异性分析展示了语义维度(如生命度和运动)的细微神经表征。这项工作为大脑的语义解码提供了一个更直接和可解释的框架,为探究复杂语义处理的神经基础、完善对分布式语义网络的理解以及潜在开发脑启发语言模型提供了强大的新方法。

英文摘要

Decoding sensory experiences from neural activity to reconstruct human-perceived visual stimuli and semantic content remains a challenge in neuroscience and artificial intelligence. Despite notable progress in current brain decoding models, a critical gap still persists in their systematic integration with established neuroscientific theories and the exploration of underlying neural mechanisms. Here, we present a novel framework that directly decodes fMRI signals into textual descriptions of viewed natural images. Our novel deep learning model, trained without visual information, achieves state-of-the-art semantic decoding performance, generating meaningful captions that capture the core semantic content of complex scenes. Neuroanatomical analysis reveals the critical role of higher-level visual cortices, including MT+ complex, ventral stream visual cortex, and inferior parietal cortex, in visual semantic processing. Furthermore, category-specific analysis demonstrates nuanced neural representations for semantic dimensions like animacy and motion. This work provides a more direct and interpretable framework to the brain's semantic decoding, offering a powerful new methodology for probing the neural basis of complex semantic processing, refining the understanding of the distributed semantic network, and potentially developing brain-inspired language models.

2504.02983 2026-06-09 cs.CL cs.CV 版本更新

Hummus: A Dataset of Humorous Multimodal Metaphor Use

Hummus:幽默多模态隐喻使用数据集

Xiaoyu Tong, Zhi Zhang, Pia Sommerauer, Martha Lewis, Ekaterina Shutova

发表机构 * ILLC, University of Amsterdam, the Netherlands(阿姆斯特丹大学语言学研究所,荷兰) Vrije Universiteit Amsterdam, the Netherlands(阿姆斯特丹自由大学,荷兰)

AI总结 提出幽默多模态隐喻数据集Hummus,基于不一致理论和概念隐喻理论设计标注方案,测试多模态大语言模型在检测和理解幽默多模态隐喻上的表现,发现现有模型仍存在困难。

详情
AI中文摘要

隐喻和幽默有许多共同点,隐喻是最常见的幽默机制之一。本研究关注多模态隐喻的幽默能力,该领域尚未得到足够关注。我们从幽默的不一致理论、概念隐喻理论以及VU阿姆斯特丹隐喻语料库的标注方案中汲取灵感,开发了一种新的用于图像-标题对中幽默多模态隐喻使用的标注方案。我们创建了幽默多模态隐喻使用数据集Hummus,提供了从《纽约客》标题竞赛语料库中抽取的1000个图像-标题对的专家标注。利用该数据集,我们测试了最先进的多模态大语言模型(MLLMs)在检测和理解幽默多模态隐喻使用方面的能力。实验表明,当前MLLMs在处理幽默多模态隐喻时仍然存在困难,特别是在整合视觉和文本信息方面。我们在该网址发布数据集和代码。

英文摘要

Metaphor and humor share a lot of common ground, and metaphor is one of the most common humorous mechanisms. This study focuses on the humorous capacity of multimodal metaphors, which has not received due attention in the community. We take inspiration from the Incongruity Theory of humor, the Conceptual Metaphor Theory, and the annotation scheme behind the VU Amsterdam Metaphor Corpus, and developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs. We create the Hummus Dataset of Humorous Multimodal Metaphor Use, providing expert annotation on 1k image-caption pairs sampled from the New Yorker Caption Contest corpus. Using the dataset, we test state-of-the-art multimodal large language models (MLLMs) on their ability to detect and understand humorous multimodal metaphor use. Our experiments show that current MLLMs still struggle with processing humorous multimodal metaphors, particularly with regard to integrating visual and textual information. We release our dataset and code at github.com/xiaoyuisrain/humorous-multimodal-metaphor-use.

2507.08064 2026-06-09 cs.MM cs.CV 版本更新

PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning

PUMA: 基于层剪枝的语言模型,用于具有模态自适应学习的高效统一多模态检索

Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology(哈尔滨工业大学)

AI总结 提出PUMA,通过层剪枝自蒸馏减少MLLM参数,并设计模态自适应对比学习损失(MAC-Loss)提升检索效率,在降低资源消耗的同时保持性能。

详情
AI中文摘要

随着多媒体内容的扩展,现实应用中对统一多模态检索(UMR)的需求日益增加。最近的工作利用多模态大语言模型(MLLM)来解决这一任务。然而,其庞大的参数量导致训练成本高、推理效率低。为此,我们提出PUMA:一种基于层剪枝的语言模型,用于具有模态自适应学习的高效统一多模态检索。我们的方法从结构和学习两个角度改进UMR。(1)在结构上,我们提出层剪枝自蒸馏,通过仅保留浅层来剪枝MLLM,同时从丢弃的深层中蒸馏特征作为教师信号。这减少了参数并保持了表示能力。(2)在学习方面,我们引入模态自适应对比学习损失(MAC-Loss),根据目标模态将批次内负样本分为更难的模态内和更易的模态间两组,分配不同的温度策略以增强学习效率。实验表明,我们的方法在保持强性能的同时显著减少了资源使用。

英文摘要

As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models (MLLMs) to tackle this task. However, their large parameter size results in high training costs and low inference efficiency. To address this, we propose PUMA: a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning. Our approach improves UMR from both structural and learning perspectives. (1) Structurally, we propose Layer-Pruned Self-Distillation, which prunes MLLMs by keeping only shallow layers while distilling features from dropped deep layers as teacher signals. This reduces parameters and preserves representation capability. (2) On the learning side, we introduce Modality-Adaptive Contrastive Learning Loss (MAC-Loss), which separates in-batch negatives into harder intra-modality and easier inter-modality groups based on the target modality, assigning different temperature strategies to enhance learning efficiency. Experiments show our method significantly reduces resource usage while maintaining strong performance.

2510.03244 2026-06-09 cs.LG cs.AI cs.CV 版本更新

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

VFEM: 视觉特征赋能的多变量时间序列预测与跨模态融合

Yanlong Wang, Hang Yu, Jian Xu, Fei Ma, Hongkang Zhang, Tongtong Feng, Zijian Zhang, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Pengcheng Laboratory(鹏城实验室) Ant Group(蚂蚁集团) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳)) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出VFEM模型,利用预训练大视觉模型通过跨模态注意力融合视觉与时间特征,仅训练7.45%参数即可捕捉跨变量依赖,提升多变量时间序列预测性能。

详情
AI中文摘要

大型时间序列基础模型通常采用通道独立架构来处理不同的数据维度,但这种设计忽略了关键的跨通道依赖关系。同时,现有的跨模态方法主要依赖文本模态,使得视觉模型的空间模式识别能力在时间序列分析中未被充分探索。为了解决这些局限性,我们提出了VFEM,一种利用预训练大视觉模型(LVM)捕获复杂跨变量模式的跨模态预测模型。VFEM将多变量时间序列转换为视觉表示,使LVM能够感知通道独立模型未显式建模的空间关系。通过双分支架构,视觉和时间特征被独立提取,然后通过跨模态注意力融合,使两种模态的互补信息增强预测。通过冻结LVM并仅训练总参数的7.45%,VFEM在多个基准上取得了竞争性能,为多变量时间序列预测提供了新视角。

英文摘要

Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

2603.12046 2026-06-09 eess.AS cs.CV cs.SD 版本更新

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Dr. SHAP-AV:通过Shapley归因解码音频-视觉语音识别中的相对模态贡献

Umberto Cappellazzo, Stavros Petridis, Maja Pantic

发表机构 * Imperial College London, UK(伦敦帝国学院,英国) NatWest AI Research, UK(英国NatWest人工智能研究)

AI总结 本文提出Dr.SHAP-AV框架,通过Shapley值分析音频-视觉语音识别中模态贡献,揭示噪声环境下模型对视觉的依赖及音频贡献的稳定性,推动模态加权机制和Shapley归因作为标准诊断工具。

Comments Accepted to INTERSPEECH 2026 [Long Paper track]. Project website: https://umbertocappellazzo.github.io/Dr-SHAP-AV

详情
AI中文摘要

音频-视觉语音识别(AVSR)利用音频和视觉信息在噪声环境下实现鲁棒识别。然而,模型如何平衡这些模态仍不清楚。我们提出了Dr.SHAP-AV框架,利用Shapley值分析AVSR中的模态贡献。通过在两个基准测试中六个模型上进行实验,不同SNR水平下,我们引入三种分析:全局Shapley用于整体模态平衡,生成Shapley用于解码过程中的贡献动态,时间对齐Shapley用于输入-输出对应性。我们的发现表明,在噪声下模型倾向于依赖视觉,但在严重退化下仍保持高音频贡献。模态平衡在生成过程中演变,时间对齐在噪声下保持稳定,SNR是驱动模态权重的主要因素。这些发现揭示了持续的音频偏见,推动了定制化的模态加权机制和基于Shapley的归因作为标准AVSR诊断工具。

英文摘要

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

2. 具身智能、机器人与自动驾驶 46 篇

2606.07626 2026-06-09 cs.CV cs.AI 新提交

Eyes All Around: Design and Analysis of 360-Degree LiDAR Perception Using Equivariant Feature Learning in Unstructured Traffic

全方位视角:非结构化交通中基于等变特征学习的360度LiDAR感知设计与分析

Pranav Darshan, Raghuveer Narayanan Rajesh, M Uttara Kumari

发表机构 * RV College of Engineering(RV工程学院)

AI总结 针对非结构化城市交通中感知难题,提出结合扇形全景处理与旋转等变稀疏卷积的360度LiDAR感知框架,在印度城市交通数据集上验证了多类别检测性能。

详情
AI中文摘要

密集非结构化城市交通中的感知仍然是自动驾驶的主要挑战,原因是道路使用者种类繁多、频繁遮挡、不规则运动模式以及缺乏标准化的道路布局。尽管基于LiDAR的3D目标检测器在结构化驾驶场景中表现出色,但大多数是为有限视场设置开发和评估的,其在全环绕360度感知下的行为仍不明确。本文研究了用于自动驾驶的360度LiDAR感知流水线,特别关注全景感知、方位角扇形空间处理以及复杂城市场景中的变换等变特征提取。本文提出了一个实用的360度感知框架,将扇形全景处理与旋转等变稀疏卷积相结合,并在一个自定义的Ouster OS0 LiDAR数据集上评估其行为,该数据集收集自多样化的印度城市交通条件。结果显示,多个目标类别的检测总体稳定,其中汽车性能最强(92.02/90.51),公交车为80.53/76.34,卡车为78.59/74.16,而行人(67.45/61.02)、骑自行车者(73.21/69.54)和骑摩托车者(71.20/68.13)得分较低,反映了在密集城市场景中检测更小且更多变的道路使用者的更大难度。

英文摘要

Perception in dense, unstructured urban traffic remains a major challenge for autonomous driving because of the wide variety of road users, frequent occlusions, irregular motion patterns, and the lack of standardized road layouts. Although recent LiDAR based 3D object detectors have shown strong performance in structured driving scenarios, most are developed and evaluated for limited field of view settings, and their behavior under full surround 360-degree sensing is still not well understood. This paper studies a 360-degree LiDAR perception pipeline for autonomous driving, with particular attention to panoramic sensing, azimuthal sector wise spatial processing, and transformation equivariant feature extraction in complex urban scenes. The paper presents a practical 360-degree perception framework that combines sector wise panoramic processing with rotation equivariant sparse convolutions and evaluates its behavior on a custom Ouster OS0 LiDAR dataset collected across diverse Indian urban traffic conditions. The results show generally stable detection across several object classes, with the strongest performance for cars at 92.02/90.51, buses at 80.53/76.34, and trucks at 78.59/74.16, while lower scores for pedestrians at 67.45/61.02, cyclists at 73.21/69.54, and motorcyclists at 71.20/68.13 reflect the greater difficulty of detecting smaller and more variable road users in dense urban scenes.

2606.07895 2026-06-09 cs.CV cs.RO 新提交

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

TBD-VLA: 时序块扩散视觉语言动作模型

Sung-Wook Lee, Xuhui Kang, Yen-Ling Kuo

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 提出TBD-VLA框架,通过时序块扩散机制实现离散令牌VLA模型的并行动作生成,兼顾时序连贯性与推理速度,在仿真和真实任务中优于先前方法。

详情
AI中文摘要

离散视觉-语言-动作(VLA)模型通常将动作生成建模为离散动作空间上的下一个令牌预测,每个令牌自回归地依赖于先前的上下文。虽然有效,但这种范式会导致高推理延迟,并且很大程度上忽略了动作轨迹中固有的时间结构。最近的工作引入并行解码以提高效率,实现更快的推理,但缺乏建模令牌依赖关系的显式机制。我们提出TBD-VLA,一种基于离散令牌的VLA框架,它结合了块扩散以实现时序动作生成。我们将动作序列划分为时间块,并在每个块内执行掩码离散扩散,同时保持跨块的自回归生成。这种设计统一了时序自回归和并行动作解码,实现了强时序连贯性和改进的推理速度。此外,显式的时序建模通过时序修补实现了动作块(例如实时分块)的异步执行。TBD-VLA在仿真和真实世界的操作任务中显著优于先前的VLA方法,为走向快速、时序感知的离散VLA模型提供了一条可扩展的路径。项目网页:https://tbd-vla.github.io/

英文摘要

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/

2606.08242 2026-06-09 cs.CV 新提交

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Light-WAM:基于状态融合动作解码的高效世界动作模型

Ziang Li, Dongzhou Cheng, Yibin Wang, Shiyue Wang, Xiaoyang Xu, Lingxuan Weng, Juan Wang, Jiaqi Wang

发表机构 * Wuhan University(武汉大学) Shanghai Innovation Institute(上海创新研究院) Southeast University(东南大学) Fudan University(复旦大学) East China Normal University(华东师范大学)

AI总结 提出轻量级世界动作模型Light-WAM,通过紧凑视频骨干和降维潜空间未来视频监督降低训练成本,并引入状态融合动作专家实现高效动作预测,在LIBERO和RoboTwin 2.0上取得良好性能。

详情
AI中文摘要

世界动作模型(WAM)通过将未来预测作为额外训练目标来扩展机器人策略学习,鼓励策略在其表示中编码任务相关的时间结构。当前的WAM通常依赖大规模生成架构,导致高训练成本和推理延迟,难以部署为高效的闭环策略。我们提出Light-WAM,一种轻量级的世界动作模型,用于高效的机器人操作。具体来说,它采用紧凑的视频骨干网络,并在降维的潜空间中进行未来视频监督,降低了视频协同训练的成本,同时保留了其对表示学习的益处。对于动作预测,Light-WAM引入了状态融合动作专家(StateFusionActionExpert),该专家从多个骨干层读取适应后的状态,通过可学习查询池化进行融合,并在单次前向传播中直接预测动作块。这种设计为视频骨干表示与机器人动作之间提供了高效接口,避免了繁重的生成式动作专家。实验表明,Light-WAM在LIBERO上保持强劲性能,在RoboTwin 2.0上实现了可用的多任务性能,同时仅使用0.44B可训练参数。它还实现了72.03ms的推理延迟,峰值GPU内存为4.1GiB,并提高了训练吞吐量。

英文摘要

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.

2606.08277 2026-06-09 cs.CV 新提交

Remember with Confidence: Uncertainty Quantification for Spatio-temporal Memory with Probabilistic Guarantees

自信记忆:具有概率保证的时空记忆不确定性量化

Harry Zhang, Nicolas Gorlo, Luca Carlone

发表机构 * MIT(麻省理工学院)

AI总结 针对机器人长期操作中VLM描述噪声大、视角不一致的问题,提出目标级语义不确定性评分,并集成到UQ-DAAAM系统中,通过主动选择高质量视图和融合多视角描述来降低不确定性,同时提供概率保证。

详情
AI中文摘要

长期机器人操作需要时空记忆来记录环境状态并在下游推理中回忆。场景图和检索增强系统将VLM描述锚定到持久的3D实体,并带有丰富的语义描述。然而,VLM描述存在噪声且视角不一致,现有系统将其视为神谕,没有机制检测不可靠的存储描述。我们引入了多视角VLM记忆的目标级语义不确定性:一种衡量目标中心跨视角语义描述分散度并识别语义未解决目标的分数。然后,我们将不确定性分数集成到一个高级空间语义记忆系统中,称为UQ-DAAAM。UQ-DAAAM利用该分数,在固定查询预算下通过选择高质量视图并将多视角描述融合为单一目标描述,主动优化不确定目标。我们还推导了概率保证,表明更高质量的候选视图(根据我们的方法选择)更有可能降低不确定性。实验表明,不确定性量化可以使具身4D记忆系统更可靠、更有效。特别是在OC-NaVQA基准上,UQ-DAAAM相比基线实现了显著更大的不确定性降低和更好的时空问答性能。

英文摘要

Long-horizon robot operation requires spatio-temporal memory to record the environment state and recall it for downstream reasoning. Scene graphs and retrieval-augmented systems ground VLM descriptions to persistent 3D entities with rich semantic descriptions. However, VLM captions are noisy and viewpoint-inconsistent, and existing systems treat them as an oracle with no mechanism to detect unreliable stored descriptions. We introduce object-level semantic uncertainty for multi-view VLM memory: a score that measures object-centric cross-view semantic scatter of captions and identifies semantically unresolved objects. Then, we include our uncertainty scores in an advanced spatial-semantic memory system, that we dub UQ-DAAAM. UQ-DAAAM uses this score to actively refine uncertain objects under a fixed query budget by selecting high-quality views and fusing the resulting multi-view captions into a single object description. We also derive probabilistic guarantees showing that higher-quality candidate views (as selected by our approach) are more likely to reduce uncertainty. Our experiments show that uncertainty quantification can make embodied 4D memory systems more reliable and more effective. In particular, on the OC-NaVQA benchmark, UQ-DAAAM achieves substantially larger uncertainty reduction and better spatio-temporal question answering performance than baselines.

2606.08653 2026-06-09 cs.CV cs.AI cs.LG cs.RO 新提交

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

FiberTune: 在视觉-语言-动作微调中保留动作纤维视觉残差

Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Hebei Key Laboratory of Cognitive Intelligence, Xiong’an Institute of Innovation(河北省认知智能重点实验室,雄安创新研究院) Hebei University of Technology(河北工业大学) Beijing Information Science and Technology University(北京信息科技大学)

AI总结 提出FiberTune,通过在线动作探针过滤动作预测特征方向,对齐教师视觉残差并正则化有效秩,在六个仿真和实物任务中提升VLA策略性能。

Comments Project page: https://fibertune.github.io/

详情
AI中文摘要

动作监督的视觉-语言-动作(VLA)策略微调能有效拟合演示,但仅约束改变预测动作的方向,导致动作等价状态下视觉结构自由坍缩。我们将此形式化为沿局部动作纤维的残差视觉坍缩,并提出FiberTune,一种训练时目标,在不增加推理开销的情况下保留教师结构的视觉残差。FiberTune使用在线动作探针估计动作预测特征方向,从中滤除中间视觉标记表示,并将探针过滤后的残差与冻结的视觉教师对齐,同时正则化其有效秩。在相同训练条件下,FiberTune在跨越两个基准和两种架构(pi_0.5和OpenVLA-OFT)的六个受控仿真设置以及物理SO-101拾取放置任务中,均优于仅任务损失的微调;代表性提升包括长时域CALVIN ABC-to-D上SR(5)提高10.7个百分点,物理SO-101任务成功率从72.7%提升至78.1%。残差诊断显示,这些增益与探针过滤后的残差教师对齐度和有效秩增加一致,符合动作纤维动机。

英文摘要

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.

2606.08680 2026-06-09 cs.CV cs.RO 新提交

Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras

畸变感知的PETR用于混合针孔-鱼眼相机的BEV目标检测

Xiangzhong Liu

发表机构 * fortiss GmbH(fortiss有限公司)

AI总结 针对鱼眼相机径向畸变破坏BEV检测器均匀采样假设的问题,提出DAPETR,通过畸变感知位置编码和双向特征-几何协同调制模块,在KITTI-360基准上优于基线方法,并揭示了学习适应与显式几何重参数化之间的冲突。

Comments 8 pages, 5 figures, accepted at ICRA 2026

详情
AI中文摘要

鱼眼相机因其低成本和高覆盖视野(FOV)而被广泛部署于自动驾驶感知套件中,但其在3D目标检测中的潜力仍未得到充分利用。严重的径向畸变通过违反均匀采样的基本假设,对大多数BEV检测器构成挑战。为弥补这一差距,我们提出了畸变感知PETR(DAPETR),一种专为混合针孔-鱼眼相机设置设计的无投影检测器。DAPETR包含两个关键的学习自适应模块:一个统一的畸变感知位置编码,将图像表示的位置编码与鱼眼几何协调一致;以及一个双向特征-几何协同调制模块,使图像特征和3D位置编码相互适应。在我们转换的KITTI-360基准上的实验中,我们系统地将我们的学习自适应方法与极坐标下的PETR(PolarPETR)进行了比较。我们发现,尽管两种方法都优于基线,但我们的学习模块实现了更优的性能。关键的是,我们发现了两种策略结合时的负面交互,表明学习适应和显式几何重参数化可能冲突。我们的最终DAPETR模型显著推进了鱼眼BEV检测的研究和基准,为除图像校正外的有效畸变感知3D感知设计提供了关键见解。

英文摘要

Fisheye cameras are widely deployed in autonomous driving perception suites for their low cost and full-coverage field of view (FOV), yet their potential remains underleveraged in 3D object detection. Severe radial distortion challenges most BEV detectors by violating the fundamental assumption of uniform sampling. To bridge this gap, we propose Distortion-Aware PETR (DAPETR), a projection-free detector tailored for mixed pinhole-fisheye camera setups. DAPETR incorporates two key learned-adaptive modules: a unified distortion-aware positional embedding that harmonizes positional encodings for image representations with fisheye geometry, and a bidirectional feature-geometry co-modulation module that mutually adapts image features and 3D positional embeddings. In our experiments on a converted KITTI-360 benchmark, we systematically compare our learned adaptive approach against PETR in polar coordinates (PolarPETR). We find that while both methods improve over the baseline, our learned modules achieve superior performance. Crucially, we uncover a negative interaction when combining both strategies, revealing that learned adaptation and explicit geometric reparameterization can conflict. Our final DAPETR model significantly advances the research and benchmark for fisheye BEV detection, providing critical insights into effective distortion-aware 3D perception design other than image rectification.

2606.08684 2026-06-09 cs.CV 新提交

BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving

BLUE:迈向自动驾驶高效视觉-语言-动作模型中更好的语言使用

George Ling, Lijin Yang, Hao Yang, Zhongzhan Huang

发表机构 * Bosch Research(博世研究院)

AI总结 提出BLUE方法,通过轻量门控机制在视觉-语言-动作模型中按帧决定是否激活语言生成,实现性能提升和2.54倍推理加速。

Comments preprint

详情
AI中文摘要

我们提出BLUE,一种在自动驾驶(AD)的视觉-语言-动作(VLA)模型中实现更好语言使用的极简方法。通过广泛分析,我们发现语言仅在一小部分路线上重要,但在这些路线上,语言可以大幅提升或降低性能。因此,在每一帧生成语言是低效的,因为大部分计算花费在无法从语言中受益的帧上。我们进一步表明,预训练的VLA隐藏状态可能已经编码了语言是否会对给定帧有益,尽管场景复杂度和运动特征本身难以预测这一点。基于这一发现,BLUE在冻结的VLA隐藏状态上训练一个轻量级门控,以决定每帧是激活语言生成还是直接预测动作,无需修改主干网络或额外的人工标注。仅用0.11M参数的门控,BLUE在两个基准测试上均达到新的最优水平,在Bench2Drive上实现76.2%的成功率,在Longest6 v2上获得36的驾驶分数,同时相比主干网络实现2.54倍的推理加速和8.9%的成功率提升。BLUE为高效的语言增强自动驾驶提供了一条实用路径,表明VLA模型可以以极低的成本保留语言的优势。我们的代码、数据、日志和检查点完全公开在https://github.com/George-Ling3/BLUE。

英文摘要

We present BLUE, a minimal method for better language use in vision-language-action (VLA) models for autonomous driving (AD). Through extensive analysis, we reveal that language matters on only a small fraction of routes, but on those routes it can greatly improve or degrade performance. Generating language at every frame is therefore inefficient, since most computation is spent on frames that do not benefit from language. We further show that pretrained VLA hidden states potentially already encode whether language will benefit a given frame, even though scene complexity and kinematic features alone struggle to predict this. Based on this finding, BLUE trains a lightweight gate on frozen VLA hidden states to decide per frame whether to activate language generation or predict actions directly, without modifying the backbone or requiring additional human annotation. With just a 0.11M-parameter gate, BLUE sets a new state of the art on both benchmarks, achieving 76.2% success rate on Bench2Drive and 36 driving score on Longest6 v2, while delivering 2.54x inference speedup and 8.9% success rate improvement over the backbone. BLUE provides a practical path toward efficient language-augmented AD, showing that VLA models can retain the benefits of language at a fraction of the cost. Our code, data, logs and checkpoints are fully available on https://github.com/George-Ling3/BLUE.

2606.08844 2026-06-09 cs.CV cs.RO 新提交

Geometry-Aware Fisheye-LiDAR Fusion for Robust 3D Object Detection in Low-Overlap Setups

几何感知鱼眼-激光雷达融合用于低重叠设置下的鲁棒3D目标检测

Xiangzhong Liu, Xihao Wang, Hao Shen

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 针对稀疏视角下鱼眼相机与激光雷达的几何畸变和低重叠问题,提出几何感知混合融合框架,通过畸变感知LSS模块和双注意力校正模块实现极坐标与笛卡尔特征融合,在三个基准上提升检测精度。

Comments 8 pages, 4 figures, submitted to RA-L

详情
AI中文摘要

随着自主系统从资本密集型的机器人出租车扩展到成本敏感的物流领域,传感器配置越来越优化以实现每单位成本的覆盖范围。一种常见的稀疏视图设置利用双鱼眼摄像头和车顶安装的激光雷达,引入了严重的几何挑战:极端径向畸变、最小重叠以及球面投影与笛卡尔网格之间的错位。BEV融合算法通常在流程早期将图像和点云模态强制统一到笛卡尔网格中,导致广角鱼眼相机出现显著的特征失真和信息丢失。为了解决这个问题,我们提出了一个几何感知混合融合(GA-HF)框架,该框架明确考虑了鱼眼几何和BEV特征失真,其中鱼眼特征通过畸变感知的Lift-Splat-Shoot(LSS)模块提升到极坐标BEV网格中以保留原生角密度,而激光雷达特征在原生笛卡尔空间中处理以实现边界框回归的度量保真度。为了桥接这些异构流,我们引入了一个双注意力扭曲校正模块,该模块在融合前对扭曲的相机特征应用空间和通道注意力,明确抑制低质量外围区域的伪影,同时增强高质量语义线索。GA-HF在三个基准数据集上进行了评估:KITTI-360、Dur360BEV和Fisheye3DOD。据我们所知,这是首个探索激光雷达-鱼眼相机融合的方法。在KITTI-360上,GA-HF相比笛卡尔基线将NDS提高了4.2%;在Dur360BEV上,它超越了仅激光雷达和BEVFusion,同时在几何畸变下显著降低了方向误差;在Fisheye3DOD上,它在所有融合方法中取得了最高的检测分数。

英文摘要

As autonomous systems expand from capital-intensive robotaxis to cost-sensitive logistics, sensor configurations are increasingly optimized for coverage-per-cost. A prevalent sparse-view setup utilizes dual-fisheye cameras with a roof-mounted LiDAR, introducing severe geometric challenges: extreme radial distortion, minimal overlap, and misalignment between spherical projections and rectilinear grids. BEV fusion algorithms typically force image and point cloud modalities into unified Cartesian grids early in the pipeline, causing significant feature distortion and information loss for wide-view fisheye cameras. To address this, we propose a Geometry-Aware Hybrid Fusion (GA-HF) framework that explicitly accounts for fisheye geometry and BEV feature distortion, where fisheye features are lifted into a polar BEV grid via a Distortion-Aware Lift-Splat-Shoot (LSS) module to preserve native angular density, while LiDAR features are processed in native Cartesian space for metric fidelity of bounding box regression. To bridge these heterogeneous streams, we introduce a Dual-Attention Warping Correction module that applies spatial and channel attention to the warped camera features before fusion, explicitly suppressing artifacts in low-quality peripheral regions while enhancing high-quality semantic cues. GA-HF is evaluated on three benchmarks: KITTI-360, Dur360BEV, and Fisheye3DOD datasets. To the best of our knowledge, it is the first approach to explore LiDAR-fisheye camera fusion. On KITTI-360, GA-HF improves NDS by 4.2% over Cartesian baselines; on Dur360BEV, it surpasses both LiDAR-only and BEVFusion, while significantly reducing orientation error despite the geometric distortions; on Fisheye3DOD, it attains the highest detection score among all fusion methods.

2606.08860 2026-06-09 cs.CV 新提交

Vision-Language Work Zone Intelligence for Safety-Critical Speed Regulation of Mixed-Autonomy Vehicles in Dynamic Environments

面向动态环境中混合自主车辆安全关键速度调节的视觉语言工作区智能

Angel Martinez-Sanchez, Kianna Ng, Wesley Maia, Laura Fleig, Maitrayee Keskar, Erika Maquiling, Yash Tandon, Parthib Roy, Mohan Trivedi, Ross Greer

发表机构 * UC Merced(加州大学默塞德分校) Johns Hopkins(约翰霍普金斯大学) UC San Diego(加州大学圣地亚哥分校)

AI总结 提出一种实时车载感知管线,通过目标检测与语义验证融合及滞后状态转换,从视觉标志中识别临时工作区限速,在低成本硬件上实现96.5%召回率和68.7%精确率。

详情
AI中文摘要

临时工作区限速通过视觉不一致的标志传达,且常缺失于数字地图中,给人类驾驶员和自动驾驶车辆系统带来安全风险。我们提出一种实时车载感知管线,用于检测活动工作区、识别相关临时限速,并输出符合法规的工作区状态和速度值,适用于驾驶员警报或下游自动控制。该系统将目标检测与语义验证以及时间平滑、基于滞后的状态转换相结合,以减少动态场景中的误激活和闪烁,并完全在低成本嵌入式硬件上运行。在ROADWork数据集(490个序列)的标注子集上手动评估,系统实现了工作区内事件级召回率96.5%和事件级精确率68.7%。基于35分钟内部驾驶数据评估的限速识别达到95.45%精确率和53.85%召回率,无错误速度分类,仅有一个误报。这些结果表明了一种实用、可扩展的方法,将工作区速度感知直接建立在车载感知而非地图或基础设施上。我们在GitHub仓库中发布了所提系统管线的源代码:https://github.com/Mi3-Lab/workzone

英文摘要

Temporary work-zone speed limits are communicated through visually inconsistent signage and are often missing from digital maps, creating safety risks for human drivers and automated vehicle systems. We present a real-time, onboard perception pipeline that detects active work zones, recognizes associated temporary speed limits, and outputs a law-aware work-zone state and speed value suitable for driver alerts or downstream automated control. The system fuses object detections with semantic verification and temporally smoothed, hysteresis-based state transitions to reduce false activations and flicker in dynamic scenes, and runs fully on low-cost embedded hardware. Evaluated manually on a annotated subset of the ROADWork dataset (490 sequences), the system achieves inside-work-zone event-level recall of 96.5% and event-level precision of 68.7%. Speed-limit recognition evaluated on 35 minutes of in-house driving data attains 95.45% precision and 53.85% recall, with no incorrect speed classifications and a single false positive. These results demonstrate a practical, scalable approach for grounding work-zone speed awareness directly in onboard perception rather than maps or infrastructure. We release our source code for the proposed system pipeline on our GitHub repository: https://github.com/Mi3-Lab/workzone

2606.09009 2026-06-09 cs.CV 新提交

Scaling by Diversified Experience for Vision-Language-Action Models

通过多样化经验扩展视觉-语言-动作模型

Leiyu Wang, Zhaofengnian Wang, Xueqi Li, Luoyi Fan, Cewu Lu, Nanyang Ye

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SyVLA模型,通过意图解耦算法和相似样本引导的强化学习管道,解决视觉-语言-动作模型在推理与控制耦合及策略优化不稳定问题,在真实机器人任务中取得更高成功率和泛化能力。

Comments ICML 2026, SyVLA

详情
AI中文摘要

视觉-语言-动作模型在现实部署中面临重大挑战,原因在于高层推理与低层控制的纠缠以及策略优化的不稳定性。本文介绍了SyVLA,一种通过多样化经验训练的鲁棒VLA模型。我们提出意图解耦算法,从推理上下文中分离控制相关特征,以及相似样本引导的RL管道,以稳定策略更新并缓解分布偏移。在真实机器人任务和多模态基准上的大量实验表明,与现有方法相比,SyVLA实现了更高的任务成功率和更强的分布外泛化能力,同时有效保留了核心视觉-语言能力。代码和数据集发布在项目页面上。

英文摘要

Vision-Language-Action models face significant challenges in real-world deployment due to the entanglement of high-level reasoning with low-level control, and the instability of policy optimization. In this paper, we introduce SyVLA, a robust VLA model trained with diversified experiences. We propose an Intention Decoupling algorithm to isolate control-relevant features from reasoning contexts and a similar-sample guided RL pipeline to stabilize policy updates and mitigate distribution shift. Extensive experiments on real-world robotic tasks and multi-modal benchmarks demonstrate that SyVLA achieves superior task success rates and stronger out-of-distribution generalization compared to existing methods, while effectively preserving core vision-language capabilities. Codes and Datasets is released on \href{https://sy-vla.github.io/}{project page}.

2606.09243 2026-06-09 cs.CV cs.AI 新提交

EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video

EgoTactile: 从自我中心视频学习日常物体的抓取压力

Yuan Zeng, Yujia Shi, Tiao Tan, Xingting Li, Yaqi Qin, Zongqing Lu, Wenming Yang, Jing-Hao Xue, Qingmin Liao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出EgoTactile基准和条件扩散框架EgoPressureDiff,从自我中心视频估计全手抓取压力,解决视觉-物理歧义,实现鲁棒迁移。

Comments Accepted to ICML2026 spotlight

详情
AI中文摘要

从自我中心视频估计全手抓取压力对于沉浸式VR和机器人操作至关重要,然而密集触觉传感通常依赖侵入式硬件。现有的基于视觉的方法主要依赖平面或指尖接触,无法泛化到复杂的3D物体交互。因此,我们引入EgoTactile,一个将自我中心视频与全手压力监督配对用于多样日常物体的基准,并包含裸手迁移子集以实现对自然场景的泛化。利用该基准,我们首先建立EgoPressureFormer作为判别基线。此外,为显式处理部分观测中的不确定性,我们提出EgoPressureDiff,一个条件扩散框架,适配大规模预训练视频扩散骨干。通过将丰富的世界知识先验与物理信息特征修正层结合以注入语义约束,我们的方法有效推断合理的接触模式并解决视觉-物理歧义。大量实验表明,我们的方法在基准上取得优越性能,并具有对野外场景的鲁棒迁移能力。项目页面见https://egotactile.github.io/。

英文摘要

Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively infers plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios. Our project page is available at https://egotactile.github.io/.

2606.09273 2026-06-09 cs.CV 新提交

EditSSC: Toward Editable Semantic Occupancy Scenes with Unconditional Diffusion Models

EditSSC: 基于无条件扩散模型的可编辑语义占用场景

Fatima Balde, Raoul de Charette, Alexandre Boulch

发表机构 * Inria(法国国家信息与自动化研究所) Valeo.ai(法雷奥人工智能实验室)

AI总结 提出EditSSC方法,利用2D BEV表示和现成潜扩散网络实现3D语义场景生成与免训练编辑,在SemanticKITTI上优于现有3D专用基线。

Comments Accepted at CVPR 2026 Workshop

详情
AI中文摘要

3D语义场景生成对于自动驾驶应用至关重要,但大多数方法依赖于复杂的3D专用架构,如三平面编码器和适配的扩散网络,限制了其简单性和编辑能力。我们提出EditSSC,一种使用2D鸟瞰图(BEV)表示和现成潜扩散网络的3D语义场景生成方法,支持编辑。我们的方法将3D语义占用网格重塑为多通道BEV图像,并利用Stable Diffusion的量化自编码器和UNet,仅做最小修改。我们在量化后的潜变量上进行扩散,从而实现了免训练的编辑能力。通过利用码本中的类到码对应关系,我们的方法支持草图引导生成、修补和外推,无需任何重新训练。在SemanticKITTI上,EditSSC在无条件生成方面优于现有的3D专用基线,表明成熟的2D架构可以有效地用于3D场景生成和编辑。

英文摘要

3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex 3D-specific architectures such as triplane encoders and adapted diffusion networks, limiting both their simplicity and their editing capabilities. We propose EditSSC, an editing-ready method for 3D semantic scene generation using 2D Bird's Eye View (BEV) representations and off-the-shelf latent diffusion network. Our approach reshapes 3D semantic occupancy grids into multi-channel BEV images and leverages the quantized autoencoder and UNet from Stable Diffusion with minimal modifications. We perform diffusion on the latents after quantization, which enables training-free editing capabilities. By exploiting class-to-code correspondences in the codebook, our method supports sketch-guided generation, inpainting, and outpainting without any retraining. On SemanticKITTI, EditSSC outperforms existing 3D-specific baselines on unconditional generation, demonstrating that well-established 2D architectures can be effectively repurposed for 3D scene generation and editing.

2606.09362 2026-06-09 cs.CV cs.LG 新提交

Zero-Shot Semantic Re-Identification for Autonomous Driving: A VLM Baseline Study

零样本语义重识别用于自动驾驶:一项VLM基线研究

Eduardo Borges, Manuel Abreu, Luís Garrote, Urbano J. Nunes

发表机构 * Autonomous Mobile Robot(自主移动机器人) University of Minho(明德大学)

AI总结 提出使用视觉-语言模型生成语义描述进行零样本重识别,在自动驾驶场景中实现与监督CNN基线相当的检索性能,并增强可解释性。

Comments 7 pages

详情
AI中文摘要

自动驾驶中的重识别通常被表述为一个视觉匹配问题,其中车辆、行人和骑自行车者的观测通过学习的外观嵌入在时间、帧或相机视图之间进行关联,通常辅以运动、几何或多模态线索。然而,纯视觉表示可能对视角、遮挡、光照和传感器域变化敏感,限制了其在复杂驾驶场景中的可解释性和鲁棒性。我们提出了一项零样本管道的基线研究,使用视觉-语言模型生成检测到的交通参与者的文本描述,并评估这些描述是否能够支持跨观测的身份匹配。该公式不仅依赖低层次视觉相似性,而是通过结构化语义属性表示每个对象,包括类别、颜色、形状、姿态、可见部分、空间上下文和独特的视觉线索。本研究为自动驾驶场景中基于语言的重识别提供了初始基准,讨论并评估了当前VLM在此任务中的优势和局限性。结果表明,零样本语义描述可以支持有效的对象重识别,实现与监督CNN基线相当的检索性能,同时通过显式身份线索提供更大的可解释性。然而,实验也揭示了重要挑战,包括跨视角的属性不一致以及视觉相似实例之间的细粒度区分有限。

英文摘要

Re-Identification (ReID) in autonomous driving is typically formulated as a visual matching problem, where observations of vehicles, pedestrians, and cyclists are associated across time, frames, or camera views using learned appearance embeddings, often complemented by motion, geometric, or multimodal cues. However, purely visual representations may be sensitive to viewpoint, occlusion, illumination, and sensor-domain variations, limiting their interpretability and robustness in complex driving scenes. We propose a baseline study of a zero-shot pipeline using Vision-Language Models (VLMs) to generate textual descriptions of detected traffic participants and evaluate whether these descriptions can support identity matching across observations. Instead of relying only on low-level visual similarity, the proposed formulation represents each object through structured semantic attributes, including category, color, shape, pose, visible parts, spatial context, and distinctive visual cues. This study provides an initial benchmark for language-based re-identification in autonomous-driving scenarios, discussing and evaluating the strengths and limitations of current VLMs for this task. Results demonstrate that zero-shot semantic descriptions can support effective object re-identification, achieving retrieval performance comparable to a supervised CNN baseline while offering greater interpretability through explicit identity cues. However, the experiments also reveal important challenges, including attribute inconsistency across viewpoints and limited fine-grained discrimination between visually similar instances.

2606.09390 2026-06-09 cs.CV cs.AI cs.RO 新提交

Real-time body pose non-verbal communication with a consistency-based reliability measure

基于一致性可靠性度量的实时身体姿态非语言通信

Alina Marcu, Dragos Costea, Cristina Lazar, Marius Leordeanu

发表机构 * National University of Science and Technology "Politehnica" Bucharest(布加勒斯特理工大学) Simion Stoilow Institute of Mathematics of the Romanian Academy(罗马尼亚科学院西蒙·斯托伊洛数学研究所) NORCE Norwegian Research Centre AS(挪威研究中心)

AI总结 研究仅从2D身体姿态识别通信意图,提出自回归自一致性作为无监督可靠性信号,并在嵌入式GPU上实现实时性能。

详情
AI中文摘要

身体运动在远距离或无法捕捉面部及语音的条件下传达意图。我们研究仅从2D身体姿态识别通信意图。我们认为身体运动是可靠的信号,特别是在需要实时低成本设备上的人-机器人通信场景中,如救援任务。然而,现有资源并未孤立这一信号。情感语料库结合了身体、面部、语音和文本,而骨架动作识别基准标记的是执行的动作而非传达的信息。我们发布了一个包含十种通信意图的全身体姿态真实帧数据集,并将其与其他真实(IPC)和合成(MotionLCM, VEO3.1, Kimodo)数据集进行比较,这些数据集覆盖了不同难度。我们针对能在机器人有限板载硬件上运行的系统。我们基准测试了多种模型,从骨架图分类器到联合运动预测网络,并在嵌入式GPU(NVIDIA Orin Nano)上报告了性能指标和帧率,因为在我们的场景中速度和准确性同样重要。最后,我们展示了模型自身的自回归自一致性可作为无监督可靠性信号。我们给出了一个简短证明,界定了自一致性预测正确的概率,表明该概率随一致步数增加而增长,并识别了自信预测仍可能错误的条件,与行业标准指标进行了基准测试。

英文摘要

Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.

2606.09477 2026-06-09 cs.CV 新提交

Efficient Minimal Solvers for Visual-Inertial Relative Pose Estimation in Multi-Camera Systems

多相机系统中视觉-惯性相对位姿估计的高效最小求解器

Tao Li, Zhenbao Yu, Banglei Guan, Jianli Han, Weimin Lv

发表机构 * Naval Aviation University(海军航空大学) National University of Defense Technology(国防科技大学)

AI总结 提出两种基于IMU先验的最小求解器,仅需4个点对应,将多相机相对位姿问题简化为单变量6次多项式,显著降低计算复杂度,在RANSAC框架中表现优异。

详情
AI中文摘要

估计多相机系统的相对位姿是计算机视觉中的一个基本问题,在自动驾驶、移动设备和无人机(UAV)中具有关键应用。然而,现有解决方案通常计算复杂度高或依赖过多的点对应,限制了其实际应用。为解决这些限制,我们提出两种高效的最小求解器,利用新颖的参数化来估计多相机系统的相对位姿。第一种求解器利用惯性测量单元(IMU)提供的垂直方向先验,第二种利用IMU提供的旋转轴方向先验。我们的方法仅需四个点对应,并将多相机相对位姿估计问题简化为求解一个单变量6次多项式,相较于通常涉及8次多项式的现有方法有显著改进。这种计算复杂度和对应要求的降低使得我们的求解器在集成到RANSAC框架中时特别有效,展示了在视觉里程计应用中的强大潜力。通过在合成数据和KITTI基准上的严格评估,我们的方法相比最先进算法实现了卓越的计算效率和具有竞争力的精度。

英文摘要

Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critical applications in autonomous vehicles, mobile devices, and unmanned aerial vehicles (UAVs). However, existing solutions often suffer from high computational complexity or rely on an excessive number of point correspondences, limiting their real-world applicability. To address these limitations, we propose two efficient minimal solvers for estimating the relative poses of multi-camera systems using a novel parameterization. The first solver leverages the vertical direction prior provided by Inertial Measurement Units (IMUs), while the second utilizes the rotation axis direction prior from IMUs. Our methods require only four point correspondences and reduce the problem of multi-camera relative pose estimation to solving a univariate 6th-degree polynomial, a significant improvement over existing approaches, which typically involve 8th-degree polynomials. This reduction in computational complexity and correspondence requirements makes our solvers particularly effective when integrated into RANSAC frameworks, demonstrating strong potential for visual odometry applications. Through rigorous evaluations on synthetic data and the KITTI benchmark, our methods achieved superior computational efficiency and competitive accuracy compared to state-of-the-art algorithms.

2606.09634 2026-06-09 cs.CV cs.AI 新提交

ATN3D: Density-Aware LiDAR-Radar Early 3D Object Detection Under Extreme Sparsity

ATN3D:面向极端稀疏性的密度感知激光雷达-雷达早期3D目标检测

Debojyoti Biswas, Xianbiao Hu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Tsinghua University(清华大学)

AI总结 针对远距离稀疏感知下早期融合丢失信息、通道监督不均衡的问题,提出ATN3D框架,通过密度感知融合、占用门控邻域聚合、证据条件通道自注意力和距离感知损失,在VoD数据集上显著提升远距离检测性能。

详情
AI中文摘要

3D目标检测是自动驾驶车辆及更广泛智能交通系统感知的基石。远距离检测因感知证据稀疏而具有挑战性,然而这种“远距离”场景在交通中很常见。尽管在计算机视觉中>30m常被标记为远距离,但在道路上仅提供约1-2秒的感知和决策时间。在这种极端稀疏性下,出现两个核心挑战。首先,早期多模态融合倾向于丢弃稀疏性信息,并从空或错误占用的单元中注入噪声,降低远距离召回率。其次,上下文无关的统一通道监督偏向密集和近距样本,导致远处和小目标优化不足,延迟对远处目标的最早检测。我们提出“Ask The Neighbor”(ATN3D),一种专为稀疏范围条件设计的激光雷达-雷达框架。ATN3D引入:(i) 密度感知早期融合与跨模态门控,根据体素密度/稀疏性和雷达证据调节融合;(ii) 占用门控邻域聚合,使用圆形核仅从可信单元聚合;(iii) 证据条件通道自注意力,根据天气/距离自适应调整通道权重;(iv) 距离感知损失,按距离重新平衡分类和定位,使训练与距离分层评估对齐。在VoD基准的晴朗和雾天条件下,ATN3D超越强基线:晴朗天气mAP提升+3.55%,模拟浓雾下提升+8.41%;对于>30m目标,提升分别为+3.33%(晴朗)和+2.09%(浓雾)。这些结果表明在道路稀疏感知下更早、更可靠的远距离检测。

英文摘要

3D object detection is the backbone of perception for automated vehicles (AV) and broader intelligent transportation systems applications. Long-range detection is challenging because sensing evidence is sparse; yet this ``long-range'' scenario is routine in traffic. Although >30m is often labeled long-range in computer vision, on roadways it affords only approx. 1-2s for perception and decision-making. Under such extreme sparsity, two core challenges arise. First, early multimodal fusion tends to discard sparsity information and inject noise from empty or falsely occupied cells, degrading long-range recall. Second, context-agnostic uniform channel supervision favors dense and near-range samples, leaving far and small objects under-optimized, delaying the earliest detection of distant objects. We propose ``Ask The Neighbor'' (ATN3D), a LiDAR-Radar framework tailored for sparse-range conditions. ATN3D introduces (i) Density-aware early fusion with cross-modal gating that conditions fusion on per-voxel density/sparsity and Radar evidence, (ii) Occupancy-gated neighborhood aggregation with circular kernels to aggregate only from credible cells, (iii) Evidence-conditioned channel self-attention to adapt channel weights with weather/range, and (iv) a Range-aware loss that re-balances classification and localization by distance, aligning training with distance-stratified evaluation. On the VoD benchmark across clear and foggy conditions, ATN3D surpasses strong baselines: +3.55% mAP in clear weather and +8.41% mAP under simulated heavy fog; for >30m objects, gains are +3.33% (clear) and +2.09% (heavy fog). These results indicate earlier and more reliable long-range detections under sparse sensing in on-road traffic.

2606.07813 2026-06-09 cs.RO cs.CV 交叉投稿

MinNav: Minimalist Navigation Using Optical Flow For Active Tiny Aerial Robots

MinNav:基于光流的极简导航用于主动微型飞行机器人

Aniket Patil, Mandeep Singh, Uday Girish Maradana, Nitin J. Sanket

发表机构 * Worcester Polytechnic Institute(伍斯特理工学院) Perception and Autonomous Robotics (PeAR) Group, Robotics Engineering Department, Worcester Polytechnic Institute(伍斯特理工学院机器人工程系感知与自主机器人(PeAR)实验室)

AI总结 提出MinNav导航栈,利用光流及其不确定性,使微型飞行机器人在无先验知识下穿越静态/动态障碍和未知形状间隙,通过主动探索提高成功率,实验成功率70%,计算量远小于深度方法。

Comments Accepted for publication at ICRA 2026. Link to Project page https://pear.wpi.edu/research/minnav.html

详情
AI中文摘要

使用单目相机进行导航对于微型飞行机器人的自主操作至关重要,因为它在多功能性、成本和精度之间取得了完美平衡。在本文中,我们介绍了MinNav,一个基于光流及其不确定性的导航栈,用于在静态和动态障碍物以及未知形状间隙的场景中飞行,无需任何关于场景组件和/或其位置/顺序的先验知识。我们通过利用机器人的主动性以探索方式移动来寻找障碍物并导航,进一步提高了成功率。我们在多种环境下的许多真实世界实验中成功评估并演示了所提出的方法,包括静态和动态障碍物以及未知形状间隙,总体成功率为70%。据我们所知,这是第一个使用单目相机在无先验知识的情况下解决所有上述导航案例的方案。我们的方法在性能上与基于深度的方法相当,但所需的计算量少几个数量级,并且可以轻松在微型飞行机器人上运行。随附的视频、补充材料、代码和数据集可在https://pear.wpi.edu/research/minnav.html找到。

英文摘要

Navigation using a monocular camera is pivotal for autonomous operation on tiny aerial robots due to their perfect balance of versatility, cost and accuracy. In this paper, we introduce MinNav, a navigation stack based on optical flow and its uncertainty to fly through a scene with static and dynamic obstacles and unknown-shaped gaps without any prior knowledge of the scene components and/or their locations/ordering. We further improve success rate by using the activeness of the robot to move around in an exploratory way to find obstacles and navigate. We successfully evaluate and demonstrate the proposed approach in many real-world experiments in various environments with static and dynamic obstacles and unknown-shaped gaps with an overall success rate of 70%. To the best of our knowledge, this is the first solution to tackle all the aforementioned navigation cases without prior knowledge using a monocular camera. Our approach is on par in performance with depth based methods with factors of magnitude less computation required and can readily run onboard tiny aerial robots. The accompanying video, supplementary material, code and dataset can be found at https://pear.wpi.edu/research/minnav.html

2606.08103 2026-06-09 cs.RO cs.CV 交叉投稿

Revisiting Articulated Parts Perception in Robot Manipulation

重新审视机器人操作中的关节部件感知

Xiaoqian Wu, Yejie Guo, Xiaoyang Chen, Lixin Yang, Cewu Lu, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出几何主结构(GPS)作为关节部件的新表示,结合VR设备实现高效标注,训练通用模型,在零样本下达到73%操作成功率。

Comments CVPR2026

详情
AI中文摘要

我们被各种带有可移动关节部件的物体所包围,例如盒子、把手、门。对关节部件的准确且可泛化的感知对于增强机器人操作能力至关重要。基于这一需求,近期在关节部件感知方面的工作遵循两个主要方向:一类工作使用基于姿态的表示,这需要高人力成本;与此同时,基于可供性的方法通过点跟踪提取未来物体运动,无需额外人工,但受限于低质量数据。在本文中,我们提出了一种新的关节部件表示——几何主结构(GPS),它是部件几何结构的抽象,以平衡可扩展性和质量。为了实现高效且可扩展的数据收集,GPS与便携式虚拟现实(VR)设备集成,只需一分钟即可标注一个物体序列。这种直接的人工标注比估计的可供性提供了更高质量。利用高效的VR-GPS系统,我们收集了6个部件类别下234个物体的41K帧数据,并训练了一个以单张RGB-D物体图像为输入的通用GPS模型。对于物体操作,我们基于GPS预测部署了一个启发式策略。无需任何领域内微调,我们的方法在9个物体的270个初始状态下达到了73%的成功率。我们的代码、数据和可复用工具可在 https://enlighten0707.github.io/gps 获取。

英文摘要

We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and generalizable perception of articulated parts is essential to enhance robotic manipulation capabilities. Building on this need, recent efforts in articulated parts perception have followed two main directions: One line of work uses pose-based representation, which requires high manual cost; in parallel, affordance-based methods extract future object motion from point tracking without additional manual efforts, but suffer from low-quality data. In this paper, we propose a new representation of articulated parts, Geometric Primary Structure (GPS), an abstraction of the part geometry structure to balance scalability and quality. For efficient and scalable data collection, GPS is integrated with a portable Virtual Reality (VR) device and requires only one minute to annotate one object sequence. This direct human annotation provides higher quality than the estimated affordance. With this efficient VR-GPS system, we collect 41K frames for 234 objects across six part classes, and train a generalizable GPS model with a single RGB-D object image as input. For object manipulation, we deploy a heuristic policy based on GPS prediction. Without any in-domain fine-tuning, our method achieves an 73% success rate, covering 270 initial states for 9 objects. Our code, data and reusable tool are available at https://enlighten0707.github.io/gps.

2606.08440 2026-06-09 cs.RO cs.CV 交叉投稿

GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors

GraspFoM:基于3D基础先验的重建驱动机器人抓取

Dongli Wu, Xiaobao Wei, Hao Wang, Qiaochu Dong, Ying Li, Qingpo Wuwu, Ming Lu, Wufan Zhao

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Peking University(北京大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出GraspFoM框架,利用3D基础先验(SAM3D)构建共享3D物体潜变量,联合优化重建与抓取姿态预测,通过锚点初始化的截断姿态推理扩散器生成连续多模态抓取,实现高保真重建与最优抓取。

详情
AI中文摘要

机器人抓取是机器人操作中的基本能力。然而,在部分观测下抓取仍然具有挑战性。可靠的抓取依赖于局部接触线索和物体级3D结构。现有的几何感知抓取方法认识到重建的价值,但通常将几何视为中间预测,而不是可重用的抓取物体先验。在本文中,我们提出了GraspFoM,一个统一的框架,利用3D基础先验(SAM3D)为重建和抓取姿态预测构建共享的3D物体潜变量。基于这个共享的物体潜变量,我们引入了一个锚点初始化的截断姿态推理扩散器,它预测连续且多模态的抓取姿态,而不直接依赖离散的抓取候选。我们进一步通过一个重建感知评分器和残差潜变量更新器来研究重建与抓取之间的相互作用。重建提供基于几何的线索,而抓取监督则使共享的物体潜变量向与抓取相关的可操作性区域细化。GraspFoM联合预测抓取姿态并以网格和3DGS形式重建高保真3D资产。综合实验表明,GraspFoM在重建和抓取上都达到了最先进的结果。值得注意的是,这些改进只需要少量额外的可训练参数。组件消融研究也证明了每个组件的贡献。

英文摘要

Robotic grasping is a fundamental capability in robotic manipulation. Yet grasping remains challenging under partial observations. Reliable grasping depends on both local contact cues and object-level 3D structure. Existing geometry-aware grasping methods recognize the value of reconstruction, but they typically treat geometry as an intermediate prediction rather than a reusable object prior for grasping. In this paper, we present GraspFoM, a unified framework that leverages 3D foundation priors (SAM3D) to build a shared 3D object latent for both reconstruction and grasp pose prediction. Built on this shared object latent, we introduce an anchor-initialized truncated pose-reasoning diffuser that predicts continuous and multimodal grasp poses without directly relying on discrete grasp candidates. We further investigate the interaction between reconstruction and grasping through a reconstruction-aware scorer and a residual latent updater. Reconstruction provides grounded geometric cues, while grasp supervision refines the shared object latent toward grasp-relevant affordances. GraspFoM jointly predicts grasp poses and reconstructs high-fidelity 3D assets in mesh and 3DGS forms. Comprehensive experiments demonstrate that GraspFoM achieves state-of-the-art results on both reconstruction and grasping. Notably, these improvements require only a small number of additional trainable parameters. Component-wise ablation studies also demonstrate the contribution of each component.

2606.08495 2026-06-09 cs.RO cs.CV 交叉投稿

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

EgoPriMo:面向交互式人形控制的自我中心运动生成

Haoyang Ge, Peng Ren, Yukun Shi, Cong Huang, Kun Li, Kai Chen

发表机构 * Tianjin University(天津大学) Zhongguancun Academy(中关村学院) Beihang University(北京航空航天大学) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) DeepCybo

AI总结 提出EgoPriMo框架,通过自我中心人类演示学习全身运动先验,利用三流DiT联合建模身体动态、视觉上下文和文本,支持重建、生成和预测,并在Unitree人形机器人上执行。

详情
AI中文摘要

人形机器人需要适应场景上下文、任务要求和用户意图的全身运动。运动跟踪可以再现指定的轨迹,人形机器人视觉-语言-动作系统提供了语义接口,但两者都不能为广泛的全身行为提供可扩展且交互式的先验。我们提出了EgoPriMo(人形机器人自我中心运动先验),一个统一的框架,从自我中心人类演示中学习此类先验。给定自我中心观察和文本提示,EgoPriMo重建、生成和预测基于SMPL的全身运动。语言被用作高级控制信号,而不是完整的运动规范。EgoPriMo的核心是一个三流DiT,它联合建模身体动态、自我中心视觉上下文和文本;任务条件掩码通过同一个检查点路由不同的任务和缺失模态数据。在Nymeria和EgoExo4D上的实验表明,一个检查点在支持重建和预测的同时,改进了自我中心运动生成,优于UniEgoMotion;生成的SMPL运动也可以由Unitree人形控制器执行。这些结果表明了一条从可扩展的自我中心观察到可泛化和交互式人形运动先验的实用路径。

英文摘要

Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Motion tracking reproduces specified trajectories, and humanoid vision-language-action systems provide semantic interfaces, but neither offers a scalable and interactive prior for broad full-body behavior. We introduce EgoPriMo (Egocentric Motion Prior for Humanoid Robots), a unified framework that learns such priors from egocentric human demonstrations. Given egocentric observations and a text prompt, EgoPriMo reconstructs, generates, and forecasts SMPL-based full-body motion. Language is used as a high-level control signal rather than a complete motion specification. At the core of EgoPriMo is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text; task-conditioning masks route different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show that one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting; the generated SMPL motions can also be executed by a Unitree humanoid controller. These results indicate a practical path from scalable egocentric observations to generalizable and interactive humanoid motion priors.

2606.08542 2026-06-09 cs.RO cs.AI cs.CV 交叉投稿

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

当视频误读:面向探索性操作痕迹问答的阅读启发式闭环蒸馏

Haizhou Ge, Yufei Jia, Yue Li, Zhixing Chen, Lu Shi, Lei Han, Guyue Zhou, Ruqi Huang

发表机构 * Tsinghua University(清华大学) DISCOVER Robotics

AI总结 针对探索性操作中机器人误读视频痕迹的问题,提出闭环痕迹蒸馏方法,通过任务编码代理提取单行自然语言启发式提示,使冻结VLM准确预测最小成功动作链,在模拟和真实机器人任务上提升准确率0.38-0.47。

Comments 16 pages, 4 figures, 4 tables

详情
AI中文摘要

探索性操作往往将看似失败的尝试转化为下一步操作的关键证据。例如,机器人拉动锁住的抽屉失败,只有在开锁后才成功。失败的拉动揭示了潜在前提条件(抽屉被锁住),该条件决定了最小成功动作链(完成任务的最少动作),此处为[开锁,拉抽屉]。正确读取这一痕迹因此成为恢复该链的前提。我们将此设定形式化为探索性操作痕迹问答(EMT-QA):给定来自探索性痕迹的同步视频和本体感觉,预测在探测所揭示的潜在前提条件下的最小成功动作链。然而,即使最先进的VLM和具身多模态LLM也会误读这一证据:它们无法从原始视频、原始本体感觉或它们的组合中可靠地恢复动作链。我们引入闭环痕迹蒸馏,一种使用每任务编码代理检查带标签训练痕迹并蒸馏出关于痕迹的单行自然语言提示(称为蒸馏阅读启发式DRH)的流水线。推理时,不调用代理,不更新模型权重;冻结的VLM接收原始痕迹加上DRH作为提示条目。在三个模拟器和两个真实机器人任务上,DRH将链准确率比最佳原始模态基线提高0.38至0.47。相同的DRH还作为一次性程序分类器的唯一规范,其性能与提示的VLM相当。

英文摘要

Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination. We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.

2606.08655 2026-06-09 cs.RO cs.CV 交叉投稿

PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

PhysGraph:用于感知与推理的物理感知3D场景图

Haoyu Li, Aaron Thomas, Shuyan Zhou, Xianyi Cheng

发表机构 * Duke University(杜克大学)

AI总结 提出PhysGraph框架,结合符号推理与结构化3D几何,建模杂乱场景中的运动学和物理属性,在语义分割、多物体质量估计和关节预测上达到最优。

详情
AI中文摘要

为了执行广泛的日常任务,机器人需要构建一个语义丰富、物理基础扎实且结构化的3D表示,以支持任务规划和功能预测。然而,现有方法主要关注语义检索,常常忽略物理和运动学因素。尝试建模物理属性的方法通常依赖于狭窄的训练集或单物体建模,限制了跨不同物体类型的可扩展性和泛化能力。为应对这些挑战,我们提出了PhysGraph,一个将符号推理与结构化3D几何相统一的框架,用于建模杂乱场景中的运动学和物理属性。给定RGB-D观测,PhysGraph重建以物体为中心的3D几何,并跨视图关联物体实例。然后,它将物体分解为功能部件,并通过视觉推理推断材料和关节。在合成和真实世界数据集上的评估表明,PhysGraph在语义分割、多物体质量估计和关节预测方面取得了最先进的结果。凭借其简单而有效的设计,PhysGraph生成物理一致且语义结构化的场景图,作为下游任务(如约束感知的3D功能预测和真实到模拟迁移)的结构化3D表示,这两项任务均在我们的实验中得到了验证。

英文摘要

To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single-object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, PhysGraph reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer, both of which are demonstrated in our experiments.

2606.08688 2026-06-09 cs.RO cs.CV 交叉投稿

PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

PhysAgent: 通过轨迹驱动的多智能体反馈实现基于物理的4D合成自动化

Chunji Lv, Jiaxi Ye, Yuchen Jiang, Rexar Lin, Changsheng Li

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出PhysAgent,首个模拟器在环的多智能体框架,通过解耦材料与外力、利用视觉基础模型提取轨迹并借助LLM常识推理,实现自动化、物理可信的4D运动合成,显著提升生成多样性与物理准确性。

详情
AI中文摘要

实现完全自动化、物理合理的3D运动合成是图形学和生成式AI的核心目标。然而,配置复杂的环境力场仍然完全依赖人工专家干预,成为大规模模拟数据生成的严重瓶颈。现有自动化方法主要关注材料优化,在应用于更复杂的力场优化空间时表现出严重的模态差距和技术缺陷:朴素的大语言模型缺乏底层模拟反馈,导致严重的物理不准确性,而传统的分数蒸馏采样存在梯度缓慢、陷入局部最优以及数学上无法动态切换离散力场的问题。为此,我们提出PhysAgent,首个模拟器在环的多智能体框架,利用多模态输入实现自动化、基于物理的4D合成。通过将内在材料与外在动力学解耦,PhysAgent利用配备外化力场技能模块的语义智能体掌握模拟规则并生成有效初始化。随后,由轨迹驱动的多智能体反馈驱动的精炼智能体,借助视觉基础模型从渲染帧中提取密集点轨迹。通过将这些显式运动轨迹转换为结构化文本描述符,智能体利用LLM常识推理执行零样本宏观跳跃,有效逃离局部最优并动态切换离散力场。大量实验表明,PhysAgent能够从任意多模态提示快速生成稳定、多样的物理场景,在生成多样性和物理准确性上显著优于现有基线。

英文摘要

Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and generative AI. However, configuring complex environmental force fields still relies entirely on manual expert intervention, creating a severe bottleneck for large-scale simulation data generation. Existing automated methods primarily focus on material optimization and exhibit severe modality gaps and technical flaws when applied to the vastly more complex force field optimization space: naive Large Language Models (LLMs) lack underlying simulation feedback, causing severe physical inaccuracies, while traditional Score Distillation Sampling (SDS) suffers from sluggish gradients, local optima entrapment, and a mathematical inability to dynamically switch discrete force fields. To address this, we propose PhysAgent, the first simulator-in-the-loop multi-agent framework that leverages multimodal inputs for automated, physically grounded 4D synthesis. By decoupling intrinsic materials from extrinsic dynamics, PhysAgent utilizes a Semantic Agent equipped with an externalized Force Field Skill module to master simulation rules and generate valid initializations. Subsequently, the Refine Agents, driven by Trajectory-Grounded Multi-Agent Feedback, leverage vision foundation models to extract dense point trajectories from rendered frames. By converting these explicit motion trajectories into structured textual descriptors, the agent harnesses LLM commonsense reasoning to execute zero-shot macroscopic leaps, effectively escaping local optima and dynamically switching discrete force fields. Extensive experiments demonstrate that PhysAgent rapidly generates stable, diverse physical scenes from arbitrary multimodal prompts, significantly outperforming existing baselines in both generation diversity and physical accuracy.

2606.08962 2026-06-09 cs.LG cs.CV cs.RO 交叉投稿

C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache

C$^3$ache: 利用跨推理块缓存加速世界动作模型

Weisen Zhao, Lam Nguyen, Zhicong Lu, Yuzhang Shang

发表机构 * George Mason University(乔治梅森大学) University of Central Florida(中佛罗里达大学)

AI总结 提出C$^3$ache方法,通过跨推理块缓存和重用去噪残差,加速世界动作模型推理,实现高达2.5倍加速且任务成功率几乎无损。

详情
AI中文摘要

世界动作模型(WAM)比标准的视觉-语言-动作(VLA)策略在新型运动和环境中具有更好的泛化能力,因为视频建模目标使其能够从大量未标记视频中学习,而不是依赖稀缺的标记机器人演示。这种泛化能力计算成本高昂。为了完成一个任务,WAM需要运行多个推理块,每个块都需要一个昂贵的去噪过程。现有的加速方法通过在一个块的去噪轨迹内缓存和重用计算来降低这一成本。我们的实证分析揭示了它们忽略的一个重要的冗余来源:块间的冗余。当机器人执行平滑行为时,在给定去噪步骤计算的残差从一个块到下一个块高度相关。我们引入了C$^3$ache,一种无需训练的方法,它在相同去噪步骤的推理块之间缓存和重用这些残差。在基于Fast-WAM骨干的基准测试上的实验表明,C$^3$ache在总墙钟推理时间上实现了高达2.5倍的加速,而任务成功率几乎没有下降。

英文摘要

World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk's denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.

2606.08992 2026-06-09 cs.RO cs.AI cs.CV 交叉投稿

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

SpaceVLN:具有在线空间认知记忆与推理的零样本视觉与语言导航智能体

Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) China Telecom(中国电信) Central South University(中南大学) Jiangsu University(江苏大学)

AI总结 提出SpaceVLN,通过空间认知记忆和任务引导的空间推理,在零样本设置下实现连续环境中的视觉与语言导航,在多个基准上达到最优性能。

Comments 23 pages, 9 figures, 7 tables

详情
AI中文摘要

连续环境中的视觉与语言导航要求智能体理解未见环境的空间结构以遵循语言指令。尽管基础模型为无需任务特定策略训练的零样本导航开辟了有希望的路径,但许多导航器仍依赖局部视觉线索和基于线性历史的推理,忽视了探索区域、穿越路径、地标及其空间关系的空间本质。本文提出SpaceVLN,一种围绕空间认知记忆和任务引导的空间推理构建的导航智能体。具体而言,SpaceVLN引入了一个高效的分阶段闭环框架,其中规划和执行围绕可验证的空间-地标阶段组织。导航过程中,智能体逐步将探索区域抽象为空间航点,并动态维护子任务基础的地标证据,形成层次化的空间认知记忆以进行进度定位和空间关系理解。基于此记忆,Spatial-CoT将任务进度推理与空间感知、分析和预测相结合,实现任务引导的空间推理以用于具身导航。统一阶段接口使SpaceVLN能够在统一的零样本设置下处理视觉与语言导航和目标导向导航,无需任务特定策略训练。在R2R-CE、RxR-CE、GN-Bench和HM3D-OVON上,SpaceVLN实现了最先进的零样本性能,真实机器人部署进一步验证了其适用性。这些结果突显了空间认知记忆和任务引导的空间推理作为更强具身导航智能体的实用基础。

英文摘要

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

2606.09188 2026-06-09 cs.RO cs.CV 交叉投稿

Trajectory Optimization in Single and Dual-UAV Bearing-Only Target Localization

单无人机和双无人机仅方位目标定位中的轨迹优化

Zhijian Xiao, Huayu Huang, Bin Li, Yang Shang, Banglei Guan

发表机构 * College of Aerospace Science and Engineering, National University of Defense Technology(国防科技大学航天科学与工程学院) Hunan Key Laboratory of Image Measurement and Visual Navigation(湖南省图像测量与视觉导航重点实验室)

AI总结 提出基于Fisher信息矩阵的轨迹优化方法,通过谱加权目标函数和交叉角正弦项改善观测几何,结合改进粒子群算法,显著降低定位误差。

Comments 16 pages, 13 figures and 6 tables. Submitted to Measurement

详情
AI中文摘要

仅方位目标定位是光学测量中的一个基本问题,在无人机技术中有着广泛的应用。有效的轨迹规划可以建立有利的观测几何,从而提高仅方位无人机系统的目标定位精度。本文提出了一种用于无人机在仅方位目标定位场景中的轨迹优化方法。通过利用Fisher信息矩阵,该方法将几何构型和飞行器机动性动态集成到优化框架中。具体而言,我们引入了一个谱加权FIM目标函数,该函数在退化构型附近提供更好的梯度动力学,使规划器能够快速逃离不良观测条件。对于双无人机场景,引入交叉角正弦项,通过改善视线交叉角来优化三角测量几何,从而防止轨迹聚集。此外,我们提出了一种改进的粒子群优化算法,该算法具有运动模型约束和粒子归一化,以确保轨迹的物理可行性并增强与目标函数的兼容性。仿真结果表明,与传统的基于FIM的方法相比,所提出的方法在单无人机场景中将中位定位误差降低了99.21%,在双无人机配置中实现了69.70%的提升,在远距离机动目标的长时间仅方位目标定位中表现出优越的性能。

英文摘要

Bearing-only target localization is a fundamental problem in optical measurement and finds extensive applications in unmanned aerial vehicle (UAV) technology. Effective trajectory planning establishes favorable observation geometries, thereby enhancing the target localization accuracy of bearing-only UAV systems. This paper proposes an trajectory optimization method for unmanned aerial vehicles (UAVs) in bearing-only target localization scenarios. By leveraging the Fisher Information Matrix (FIM), the proposed approach dynamically integrates the geometric configuration and vehicle maneuverability into the optimization framework. Specifically, we introduce a spectrally-weighted FIM objective function that provides better gradient dynamics near degenerate configurations, enabling the planner to rapidly escape from poor observation conditions. For dual-UAV scenarios, an intersection angle sine term is introduced to optimize triangulation geometry by improving the sight-line intersection angle, thereby preventing trajectory aggregation. Furthermore, we propose an improved Particle Swarm Optimization (PSO) algorithm with motion model constraints and particle normalization to ensure the physical feasibility of the trajectory and enhance the compatibility with the objective functions. Simulation results demonstrate that the proposed method reduces the median localization error by 99.21% compared to conventional FIM-based approaches in single-UAV scenarios, and achieves a 69.70% improvement for dual-UAV configurations, exhibits superior performance in long-duration bearing-only target localization of maneuverability targets at extended ranges.

2606.09350 2026-06-09 cs.RO cs.CV 交叉投稿

Taming Perception Jitter: Uncertainty-Aware LiDAR Object Detection for Reliable Motion Classification

驯服感知抖动:面向可靠运动分类的不确定性感知激光雷达目标检测

Cornelius Schröder, Žygimantas Marcinkus, Markus Lienkamp

发表机构 * Technical University of Munich(慕尼黑工业大学) Institute for Automotive Engineering, Munich Institute of Robotics and Machine Intelligence, School of Engineering and Design(汽车工程研究所,慕尼黑机器人与机器智能研究所,工程与设计学院)

AI总结 提出一种部署友好的策略,通过不确定性估计和统计检验减少静态物体的虚假动态预测,在真实驾驶中显著降低误报和不必要停车。

详情
AI中文摘要

可靠的运动分类对于自动驾驶至关重要,因为对静态物体的错误动态预测可能会级联导致不必要的规划器干预。不稳定的边界框预测会导致跟踪中产生虚假的速度估计和错误预测的轨迹。我们提出了一种部署友好的缓解策略,该策略通过偶然不确定性估计增强3D目标检测器,并在短观测窗口上应用双样本z检验来区分真实运动和抖动。该方法集成到Autoware中,仅需最小改动,并重用现有数据关联以最小化计算开销。实验结果表明,在nuScenes上与速度阈值法性能相当,但在真实道路测试中,虚假动态预测和不必要停车显著减少,这是因为记录数据中存在中间抖动带,而仅基于速度的规则会误分类。这表明,不确定性感知检测和轻量级统计测试可以在噪声更大的真实环境中为自动驾驶带来实际性能提升。

英文摘要

Reliable motion classification is critical for autonomous driving, as false dynamic predictions of static objects can cascade into unnecessary planner interventions. Unstable bounding box predictions can lead to spurious velocity estimates in tracking and falsely predicted trajectories. We present a deployment-friendly mitigation strategy that augments a 3D object detector with aleatoric uncertainty estimates and applies a two-sample z-test over short observation windows to separate true motion from jitter. Integrated into Autoware with minimal changes, the approach reuses existing data association for minimal compute overhead. Empirical results show parity with velocity thresholding on nuScenes, but substantially fewer false dynamic predictions and unnecessary stops in real-world test drives, explained by the presence of an intermediate jitter band in the recorded data that speed-only rules misclassify. This demonstrates that uncertainty-aware detection and lightweight statistical testing can deliver practical performance gains for autonomous driving in noisier real-world settings.

2606.09451 2026-06-09 cs.RO cs.CV cs.LG 交叉投稿

Dense Force Estimation with an Event-based Optical Tactile Sensor

基于事件的光学触觉传感器的稠密力估计

Agis Politis, René Zurbrügg, Valentina Cavinato

发表机构 * Sony Advanced Visual Sensing, Zurich, Switzerland(索尼高级视觉传感公司,苏黎世,瑞士) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出首个利用事件相机重建稠密3D力场的方法,通过事件数据估计表面位移并映射为力,平均误差(0.14N,0.10N,0.93N),工作频率100Hz。

详情
AI中文摘要

人类依赖空间稠密、几何和力感知的触觉反馈以高时间分辨率进行灵巧操作。虽然基于视觉的触觉传感器能够实现稠密力估计,但受限于相机帧率、运动模糊和数据带宽。基于事件的光学触觉传感器具有微秒级时间分辨率和低运动模糊的优点,但现有方法仅限于预测净力。我们提出了首个利用基于事件的光学触觉传感器进行稠密3D力场重建的框架。我们的方法从事件数据估计3D表面位移,并通过逆有限元方法(iFEM)将其映射为力。剪切位移通过所提出的事件标记跟踪算法恢复,而法向位移则由卷积神经网络预测,该网络在收集的同步力-位移-事件数据集上训练。实验表明,该方法能够准确重建物理力,在力范围高达(4N,4N,20N)时,平均绝对误差为(0.14N,0.10N,0.93N),同时以平均100Hz的频率运行。这项工作为在机器人抓取和灵巧操作中实现高频控制的稠密力反馈迈出了第一步。

英文摘要

Humans rely on spatially dense, geometry and force-aware tactile feedback at high temporal resolution for dexterous manipulation. While vision-based tactile sensors enable dense force estimation, they are limited by camera frame rates, motion blur, and data bandwidth. Event-based optical tactile sensors offer an attractive alternative with microsecond temporal resolution and low motion blur, but existing methods are restricted to predicting only net forces. We introduce the first framework for dense 3D force field reconstruction using event-based optical tactile sensors. Our approach estimates 3D surface displacements from event data and maps them to forces via the inverse Finite Elements Method (iFEM). Shear displacements are recovered through the proposed event-based marker tracking algorithm, while normal displacements are predicted by a convolutional neural network trained on a collected dataset of synchronized force-displacement-event data. Experiments demonstrate accurate reconstruction of physically grounded forces, achieving a mean absolute error of (0.14 N, 0.10 N, 0.93 N) over force ranges up to (4 N, 4 N, 20 N), while operating at an average of 100 Hz. This work constitutes a first step toward enabling dense force feedback for high-frequency control in robotic grasping and dexterous manipulation.

2606.09569 2026-06-09 cs.RO cs.CV 交叉投稿

Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving Applications

自动驾驶应用中相对位姿估计的高效最小求解器

Tao Li, Liang Liu, Jianli Han, Weimin Lv

发表机构 * College of Aerospace Science and Engineering, Naval Aviation University(海军航空大学航空航天科学与工程学院)

AI总结 提出基于新平移参数化和一阶旋转近似的统一框架,设计三种最小求解器(利用IMU垂直方向、转向旋转轴方向、平面运动假设),减少点对应数量和代数复杂度,在RANSAC中加速假设生成,平衡速度与精度。

详情
AI中文摘要

随着视觉传感系统的进步,计算机视觉在自动驾驶和机器人导航中扮演着越来越重要的角色。多相机系统中的相对位姿估计对于精确的车辆定位和环境感知至关重要,要求高实时性和鲁棒性。然而,现有方法通常涉及高计算成本并严重依赖丰富的特征匹配,限制了它们在时间敏感驾驶场景中的适用性。为解决这些限制,本文引入了一个基于新颖平移参数化和一阶旋转近似的统一框架,用于高效相对位姿估计。在该框架内,我们提出了三种专门为自动驾驶车辆设计的高效最小求解器。第一个求解器集成了惯性测量单元(IMU)的垂直方向先验,第二个在转向操作期间利用旋转轴方向先验,第三个专为平面运动设计——这是结构化道路上地面车辆的现实假设。通过减少最小点对应数量和代数复杂度,我们的方法能够在基于RANSAC的流程中更快地生成假设,提高对实时系统的适用性。在合成数据集和KITTI自动驾驶基准上的大量实验表明,与现有最先进算法相比,所提出的求解器在速度和精度之间实现了有利的平衡。

英文摘要

With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localization and environment perception, demanding high real-time performance and robustness. Existing methods, however, often involve high computational costs and rely heavily on abundant feature matches, limiting their applicability in time-sensitive driving scenarios. To address these limitations, this paper introduces a unified framework for efficient relative pose estimation, built upon a novel translation parameterization and first-order rotation approximation. Within this framework, we propose three efficient minimal solvers specifically designed for autonomous vehicles. The first solver integrates the vertical direction prior from Inertial Measurement Units (IMUs), the second utilizes the rotation axis direction prior during steering maneuvers, and the third is designed for planar motion - a realistic assumption for ground vehicles operating on structured roads. By reducing both the minimal number of point correspondences and the algebraic complexity, our methods enable faster hypothesis generation within RANSAC-based pipelines, improving suitability for real-time systems. Extensive experiments on synthetic datasets and the KITTI autonomous driving benchmark demonstrate that the proposed solvers achieve a favorable balance between speed and accuracy compared to existing state-of-the-art algorithms.

2606.09615 2026-06-09 cs.RO cs.CV 交叉投稿

DexPIE: Stable Dexterous Policy Improvement from Real-World Experience

DexPIE:基于真实世界经验的稳定灵巧策略改进

Ruizhe Liao, Wenrui Chen, Liangji Zeng, Haoran Lin, Fan Yang, Kailun Yang, Yaonan Wang

发表机构 * Hunan University(湖南大学)

AI总结 提出DexPIE后训练框架,通过灵巧手适配干预系统、多阶段DAgger数据收集、相对动作空间异步推理和连续最优性指标条件化,在三个真实灵巧操作任务上成功率提升37%。

Comments Project website: https://siiuuuuuu.github.io/DexPIE

详情
AI中文摘要

灵巧操作因其高维动作空间和复杂的接触动力学,给模仿学习带来了巨大挑战。纯粹从演示中训练的策略在部署时常常遭受复合误差,并且需要大量专家数据才能达到可靠性能。为了超越演示数据的局限性,本文提出DexPIE,一个通过真实世界部署收集的经验来改进灵巧策略的后训练框架。首先,DexPIE通过灵巧手适配的干预系统和跨初始与中间任务阶段的多阶段DAgger式数据收集,实现了有效的探索覆盖,为准确的策略评估提供了可靠的监督。为了减少后训练 rollout 与演示数据之间的时间噪声,我们引入了相对动作空间中的异步推理,这能更好地将 rollout 数据与演示行为对齐,并允许评论家学习由更一致的基础策略诱导的值函数。最后,DexPIE通过对连续最优性指标进行条件化来改进策略,使策略能够以更细粒度的方式利用数据质量。在三个具有挑战性的真实世界灵巧操作任务中,DexPIE相比基于演示的参考策略实现了37%的成功率提升,优于所有基线方法,并展现出更强的鲁棒性。源代码和数据集将公开提供。

英文摘要

Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors during deployment and require large amounts of expert data to achieve reliable performance. To move beyond the limitations of demonstration data, in this work, we propose DexPIE, a post-training framework for dexterous policy improvement from experience collected through real-world deployment. First, DexPIE enables effective exploration coverage through a dexterous-hand-adapted intervention system and multi-stage DAgger-style data collection across initial and intermediate task stages, providing reliable supervision for accurate policy evaluation. To reduce temporal noise between post-training rollouts and demonstration data, we introduce asynchronous inference in the relative action space, which better aligns rollout data with demonstrated behavior and allows the critic to learn a value function induced by a more consistent underlying policy. Finally, DexPIE improves the policy through conditioning on a continuous optimality indicator, allowing the policy to leverage the quality of data in a more fine-grained manner. Across three challenging real-world dexterous manipulation tasks, DexPIE achieves a 37% improvement in success rate over the demonstration-based reference policy, outperforming all baseline methods and demonstrating stronger robustness. The source code and dataset will be made publicly available.

2606.09811 2026-06-09 cs.RO cs.AI cs.CV 交叉投稿

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

AHA-WAM:异步自适应时域世界-动作建模与观测引导的上下文路由

Jisong Cai, Long Ling, Shiwei Chu, Zhongshan Liu, Jiayue Kang, Zhixuan Liang, Wenjie Xu, Yinan Mao, Weinan Zhang, Xiaokang Yang, Ru Ying, Ran Zheng, Yao Mu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Baidu AI Cloud(百度智能云) The University of Hong Kong(香港大学)

AI总结 提出AHA-WAM,一种基于双扩散Transformer的异步时域自适应世界-动作模型,通过低频世界规划器和高频动作执行器解耦时序,实现高效闭环控制,在RoboTwin和真实任务上达到SOTA性能。

Comments Project page: https://serene-sivy.github.io/aha-wam/

详情
AI中文摘要

世界-动作模型已成为机器人操作的一种有前景的范式,它联合建模视觉场景动态和动作,将物理先验注入策略学习。然而,现有的世界-动作模型以相同的时间分辨率耦合世界预测和动作执行,迫使世界分支建模近期的帧变化,这些变化是冗余且信息量弱的。我们假设,将世界预测和动作执行严格绑定到相同的时间节奏可能未充分利用视频分支在具身控制中的潜力。因此,我们提出AHA-WAM,一种基于双扩散Transformer(DiT)架构的异步自适应时域世界-动作模型,该模型围绕这种时间不对称性重新组织世界-动作建模。AHA-WAM将视频DiT实例化为一个低频世界规划器,它维护过去观测的滚动键值记忆,并暴露可重用的逐层潜在上下文,编码长时域场景演化;同时,一个高频动作DiT通过逐层联合注意力查询该上下文,以闭环方式执行短动作块。为了支持异步执行,我们引入了自适应时域偏移训练和观测引导的视频-上下文路由(OVCR),它们共同让动作专家利用长时域世界上下文,同时保持对实时执行状态的响应,而无需重新运行视频DiT。在RoboTwin和真实世界操作任务上的实验表明,AHA-WAM无需任何机器人数据预训练即达到最先进性能,在RoboTwin上平均成功率为92.80%,在4个真实世界任务上成功率为78.3%,同时达到24.17 Hz的闭环控制,相比Fast-WAM加速4.59倍。

英文摘要

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

2606.09813 2026-06-09 cs.RO cs.CV 交叉投稿

iMaC: Translating Actions into Motion and Contact Images for Embodied World Models

iMaC: 将动作转化为运动与接触图像用于具身世界模型

Zhenyu Wu, Xiuwei Xu, Yukun Zhou, Yifan Li, Qiuping Deng, Xiaofeng Wang, Zheng Zhu, Bingyao Yu, Ziwei Wang, Jiwen Lu, Haibin Yan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) GigaAI Nanyang Technological University(南洋理工大学)

AI总结 提出iMac框架,将原始视觉图像作为动作表示,通过图像-动作编码器和动态预测器实现高保真未来状态预测和闭环控制,在预测精度、任务成功率和跨场景泛化上优于传统向量动作控制。

Comments Project page: https://imac-wm.github.io/

详情
AI中文摘要

具身世界模型已成为视觉机器人决策和交互环境模拟的关键范式。然而,传统的具身框架依赖于低维结构化动作向量(例如关节角度和末端执行器位姿),这些向量存在表达能力有限、跨不同具身形态泛化能力差以及对复杂物理交互的动态建模不自然等问题。为了解决这些限制,本文提出了iMac(图像作为动作控制),一种新颖的统一控制范式,将原始视觉图像视为具身世界模型的原生动作表示。与传统的显式运动学动作编码不同,iMac将连续的视觉操作表述为基于图像的动作标记,这些标记内在地包含了空间运动意图、交互几何约束和细微的物理动力学。我们构建了一个双分支具身架构,包括图像-动作编码器和动态世界预测器:编码器将目标驱动的视觉图像压缩为紧凑的动作嵌入,而预测器学习以图像动作为条件的环境转移规则,以实现高保真的未来状态预测和闭环具身控制。在公开的具身操作基准和真实机器人场景上进行了大量实验。结果表明,iMac在预测精度、任务成功率和跨场景泛化能力方面优于基于向量的动作控制基线。此外,我们的图像动作设计消除了对人工定义动作空间的依赖,实现了异构具身智能体的灵活通用控制。这项工作为具身世界模型提供了一种创新的视觉-动作视角,为可扩展的机器人感知和操作提供了一种简单而有效的范式。

英文摘要

Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.

2606.09827 2026-06-09 cs.RO cs.CV 交叉投稿

MemoryVLA++: Temporal Modeling via Memory and Imagination in Vision-Language-Action Models

MemoryVLA++:通过记忆与想象在视觉-语言-动作模型中进行时间建模

Hao Shi, Weiye Li, Bin Xie, Yulin Wang, Renping Zhou, Tiancai Wang, Xiangyu Zhang, Ping Luo, Gao Huang

发表机构 * Tsinghua University(清华大学) The University of Hong Kong(香港大学) Dexmal StepFun

AI总结 提出MemoryVLA++框架,通过工作记忆、感知-认知记忆库和想象未来状态的世界模型,实现完整时间建模,在模拟和真实机器人任务上显著提升长时域和依赖记忆与想象的任务性能。

Comments The project is available at https://shihao1895.github.io/MemoryVLA-PP-Web

详情
AI中文摘要

时间建模对于机器人操作至关重要,因为有效控制既需要过去交互的记忆,也需要对未来状态的想象。然而,大多数VLA模型主要依赖当前观测,因此在长时域和时间依赖任务上表现不佳。认知科学表明,人类依赖工作记忆缓冲短期上下文,海马系统保存过去经历的情景记忆,以及内部模型想象可能的未来状态演化。受这些机制启发,我们提出MemoryVLA++,一个完整的时序建模框架,为VLA模型配备记忆和想象能力以进行机器人操作。预训练的VLM将当前观测编码为感知和认知标记,形成工作记忆。这些标记查询感知-认知记忆库以检索相关历史上下文。该记忆库存储来自过去交互的低级细节和高级语义,并通过冗余感知合并进行更新。一个世界模型在去噪潜在空间中想象未来状态,并在记忆引导下整合想象的潜在表示,形成完整的时间感知标记。生成的标记条件化一个扩散动作专家,以预测时间一致的动作序列。我们在5个模拟基准和3类真实机器人任务(涵盖3种机器人)上进行了广泛实验,包括通用操作、长时域时间任务、鲁棒性和泛化性。我们的方法在Libero、SimplerEnv、Mikasa-Robo、Calvin、Libero-Plus以及多样化的真实机器人任务上取得了强劲性能,验证了具有记忆和想象的完整时间建模的有效性。例如,在真实机器人上,在通用、依赖记忆和依赖想象的任务上分别获得了+9%、+26%和+28%的提升。项目页面:https://shihao1895.github.io/MemoryVLA-PP-Web

英文摘要

Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web

2505.03528 2026-06-09 cs.CV 版本更新

Coop-WD: Cooperative Perception with Weighting and Denoising for Robust V2V Communication

Coop-WD:具有加权和去噪的协作感知用于鲁棒V2V通信

Chenguang Liu, Jianjun Chen, Yunfei Chen, Yubei He, Zhuangkun Wei, Hongjian Sun, Haiyan Lu, Qi Hao

发表机构 * Department of Engineering, Durham University(工程系,达勒姆大学) Faculty of Engineering and Information Technology, University of Technology, Sydney(工程与信息技术学院,悉尼技术大学) Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology(可信自主系统研究院,南方科技大学)

AI总结 针对V2V通信损伤对协作感知的影响,提出联合加权与去噪框架Coop-WD,通过自监督对比模型和条件扩散概率模型分层增强特征,并设计高效变体Coop-WD-eco降低计算开销,在各类信道下优于传统方法。

Comments submitted to IEEE Transactions on Intelligent Transportation Systems

详情
AI中文摘要

协作感知通过车对车(V2V)通信利用多辆车的共享信息,在自动驾驶中发挥着重要作用,以缓解单车感知的局限性。现有工作已探索V2V通信损伤对感知精度的影响,但缺乏对不同损伤程度的泛化能力。本文提出一个联合加权与去噪框架Coop-WD,以增强在V2V信道损伤下的协作感知。在该框架中,自监督对比模型和条件扩散概率模型被分层用于车辆级和像素级特征增强。提出一个高效变体模型Coop-WD-eco,选择性地停用去噪以减少处理开销。考虑了瑞利衰落、非平稳性和时变失真。仿真结果表明,所提出的Coop-WD在所有类型信道中均优于传统基准。通过可视化示例的定性分析进一步证明了我们提出方法的优越性。所提出的Coop-WD-eco在严重失真下实现了高达50%的计算成本降低,同时随着信道条件改善保持相当的精度。

英文摘要

Cooperative perception, leveraging shared information from multiple vehicles via vehicle-to-vehicle (V2V) communication, plays a vital role in autonomous driving to alleviate the limitation of single-vehicle perception. Existing works have explored the effects of V2V communication impairments on perception precision, but they lack generalization to different levels of impairments. In this work, we propose a joint weighting and denoising framework, Coop-WD, to enhance cooperative perception subject to V2V channel impairments. In this framework, the self-supervised contrastive model and the conditional diffusion probabilistic model are adopted hierarchically for vehicle-level and pixel-level feature enhancement. An efficient variant model, Coop-WD-eco, is proposed to selectively deactivate denoising to reduce processing overhead. Rician fading, non-stationarity, and time-varying distortion are considered. Simulation results demonstrate that the proposed Coop-WD outperforms conventional benchmarks in all types of channels. Qualitative analysis with visual examples further proves the superiority of our proposed method. The proposed Coop-WD-eco achieves up to 50% reduction in computational cost under severe distortion while maintaining comparable accuracy as channel conditions improve.

2509.20906 2026-06-09 cs.CV cs.RO 版本更新

Distant Object Localisation from Noisy Image Segmentation Sequences

基于噪声图像分割序列的远距离目标定位

Julius Pesonen, Arno Solin, Eija Honkavaara

发表机构 * Research Council of Finland(芬兰研究理事会) RCF Flagship Forest–Human–Machine Interplay—Building Resilience, Redefining Value Networks and Enabling Meaningful Experiences (UNITE)(RCF旗舰森林-人类-机器交互——构建韧性,重新定义价值网络和赋能有意义体验(UNITE))

AI总结 针对远距离目标定位问题,提出多视图三角测量和粒子滤波两种方法,后者还能提供形状和不确定性估计,结合无人机图像分割与GNSS姿态估计实现可靠野火监测。

详情
AI中文摘要

基于相机测量序列的3D目标定位对于安全关键的监视任务(如基于无人机的野火监测)至关重要。使用相机检测到的目标定位通常可以通过专门的传感器配置或3D场景重建来解决。然而,对于远距离目标或受限于可用计算资源的任务,这两种解决方案都不可行。在本文中,我们表明该任务可以通过多视图三角测量或粒子滤波来解决,后者还提供形状和不确定性估计。我们使用3D模拟和基于无人机的图像分割序列以及基于全球导航卫星系统(GNSS)的相机姿态估计来研究这些解决方案。结果表明,将所提出的方法与现有的图像分割模型和无人机携带的计算资源相结合,可以为基于无人机的野火监测提供可靠的系统。所提出的解决方案与检测方法无关,还能快速适应类似任务。代码可在以下网址获取:https://this URL

英文摘要

3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with specialised sensor configurations or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved with either multi-view triangulation or particle filters, with the latter also providing shape and uncertainty estimates. We studied the solutions using 3D simulation and drone-based image segmentation sequences with global navigation satellite system (GNSS) based camera pose estimates. The results suggest that combining the proposed methods with pre-existing image segmentation models and drone-carried computational resources yields a reliable system for drone-based wildfire monitoring. The proposed solutions are independent of the detection method, also enabling quick adaptation to similar tasks. Code is available at https://fgi_nls.gitlab.io/public/distant-localisation

2602.18020 2026-06-09 cs.CV cs.RO 版本更新

UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

UAOR: 面向视觉-语言-动作模型的不确定性感知观测重注入

Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Zichen Wen, Bowen Fang, Tao Yu, Xiangnan Wu, Qisen Ma, Kai Wang, Ziheng He, Yingda Li, Zhengbo Zhang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所模式识别新技术实验室) Shanghai Jiao Tong University(上海交通大学) FiveAges(五代)

AI总结 提出UAOR模块,通过动作熵检测不确定性,在语言模型高不确定层重注入观测信息,无需额外训练或数据,提升VLA模型在仿真和真实任务中的性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型利用预训练的视觉-语言模型(VLM)作为骨干,将图像和指令映射到动作,展现出在可泛化机器人操作中的显著潜力。为了提升性能,现有方法通常引入额外的观测线索(如深度图、点云)或辅助模块(如目标检测器、编码器),以实现更精确和可靠的任务执行,但这些方法通常需要昂贵的数据收集和额外训练。受语言模型中的前馈网络(FFN)可作为“键值记忆”的发现启发,我们提出不确定性感知观测重注入(UAOR),一种有效、无需训练且即插即用的VLA模型模块。具体地,当当前语言模型层表现出由动作熵衡量的高不确定性时,它通过注意力检索将关键观测信息重注入下一层的前馈网络(FFN)。该机制直接在高不确定性层用观测证据增强隐藏状态,从而实现更准确和可靠的动作生成。综合实验表明,我们的方法以最小开销一致地提升了多种VLA模型在仿真和真实任务中的性能。值得注意的是,UAOR消除了对额外观测线索或模块的需求,使其成为现有VLA流程中通用且实用的即插即用组件。项目页面见此URL。

英文摘要

Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism directly augments the hidden states with observation evidence at high-uncertainty layers, enabling more accurate and reliable action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.

2603.01613 2026-06-09 cs.CV 版本更新

Uncertainty-Aware Hierarchical Re-Localization in OpenStreetMap via Semantic Alignment

基于语义对齐的OpenStreetMap中不确定性感知分层重定位

Yuchen Zou, Xiao Hu, Lihuang Fang, Yuqing Tang

发表机构 * International Digital Economy Academy(国际数字经济学院) School of Automation Science and Engineering, Xi’an Jiaotong University(西安交通大学自动化科学与工程学院) Department of Electronic and Electrical Engineering, Southern University of Science and Technology(南方科技大学电子与电气工程系)

AI总结 提出不确定性感知分层搜索框架,利用目标级DINO-ViT令牌减少跨模态差异,通过粗FFT相关和不确定性控制的局部细化实现高效定位,在精度和速度上显著优于现有方法。

Comments 7 pages, 4 figures

详情
AI中文摘要

单目重定位使机器人能够从视觉观测中估计相机姿态。然而,许多现有方法依赖密集地图或大型参考图像数据库,面临可扩展性限制和隐私风险。OpenStreetMap(OSM)作为一种轻量级隐私保护地图,提供具有全局可扩展性的语义和几何信息。尽管如此,由于自然图像与OSM之间的跨模态差异以及基于全局地图定位的高成本,OSM定位仍然具有挑战性。在本文中,我们提出了一种具有语义对齐的不确定性感知分层搜索框架,用于OSM中的定位。首先,利用目标级DINO-ViT令牌来减少地面视角观测与OSM向量之间的语义差距。其次,将全局密集匹配分解为粗FFT相关和不确定性控制的局部细化。大量实验表明,我们的方法显著提高了定位精度和速度。在单个数据集上训练时,我们方法的3°方向召回率甚至优于最先进方法的5°召回率。

英文摘要

Monocular re-localization enables robots to estimate camera poses from visual observations. However, many existing methods rely on dense maps or large reference image databases, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight privacy-preserving map, offers semantic and geometric information with global scalability. Nonetheless, OSM localization remains challenging due to cross-modal discrepancies between natural images and OSM, as well as the high cost of global map-based localization. In this paper, we propose an uncertainty-aware hierarchical search framework with semantic alignment for localization in OSM. First, object-centric DINO-ViT tokens are exploited to reduce the semantic gap between ground-view observations and OSM vectors. Second, global dense matching is decomposed into coarse FFT correlation and uncertainty-controlled local refinement. Extensive experiments demonstrate that our method significantly improves localization accuracy and speed. When trained on a single dataset, the 3$^\circ$ orientation recall of our method even outperforms the 5$^\circ$ recall of state-of-the-art methods.

2605.01799 2026-06-09 cs.CV 版本更新

Embody4D: A Generalist Data Engine for Embodied 4D World Modeling

Embody4D: 面向具身4D世界建模的通用数据引擎

Peiyan Tu, Hanxin Zhu, Jingwen Sun, Shaojie Ren, Cong Wang, Yuyan Xu, Jiayi Luo, Xiaoqian Cheng, Zhibo Chen

发表机构 * Zhejiang University(浙江大学) Beijing Zhongguancun Academy(北京中关村学院) University of Science and Technology of China(中国科学技术大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Shanghai Jiao Tong University(上海交通大学) Beihang University(北京航空航天大学)

AI总结 提出Embody4D视频到视频世界模型,通过3D感知合成管道、潜在置信度专家调制和交互注意力机制,将单目机器人视频转换为多视角视频,解决具身智能中视角稀疏问题,提升下游规划与学习性能。

详情
AI中文摘要

具身智能体需要鲁棒且全面的3D时空表示来支持空间推理、操作理解和下游决策。然而,现有的机器人数据通常从固定或稀疏的视角捕获,仅提供部分且依赖视角的观察,这限制了多视角感知和跨视角泛化。鉴于在真实环境中收集额外视角的困难,我们提出Embody4D,一种专为具身场景设计的视频到视频世界模型,通过将单目机器人视频转换为来自灵活目标相机视角的新视角视频来弥合这一观察差距。首先,为解决训练数据稀缺问题,我们引入了一种3D感知的组合合成管道,以策划一个异构数据集,该数据集组合了跨具身形态的机器人手臂与多样背景,促进了广泛泛化。其次,为强制几何稳定性,我们设计了一种潜在置信度感知的专家调制策略,该策略估计扭曲潜在先验的可靠性,并自适应地将区域路由到复制、修复或修补专家,以实现时空一致的4D生成。最后,为增强操作保真度,我们引入了一种交互感知注意力机制,该机制明确关注机器人交互区域。大量实验表明,Embody4D在视觉评估基准上达到了最先进的性能,同时模拟和真实机器人实验进一步证明了其作为鲁棒数据引擎的有效性,能够合成高保真、视角一致的视频,赋能下游机器人规划和学习。

英文摘要

Embodied agents require robust and comprehensive 3D spatiotemporal representations to support spatial reasoning, manipulation understanding, and downstream decision making. However, existing robot data are typically captured from fixed or sparse viewpoints, providing only partial and view-dependent observations, which limits multi-view perception and generalization across viewpoints. Given the difficulty of collecting additional viewpoints in real-world settings, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios to bridge this observation gap by transforming a monocular robot video into novel-view videos from flexible target camera viewpoints. First, to tackle training data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, promoting broad generalization. Second, to enforce geometric stability, we devise a latent confidence-aware expert modulation strategy, which estimates the reliability of warped latent priors and adaptively routes regions to copy, repair, or inpaint experts for spatiotemporally consistent 4D generation. Finally, to enhance the fidelity of the manipulation, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments show that Embody4D achieves state-of-the-art performance on visual evaluation benchmarks, while both simulated and real-world robotic experiments further demonstrate its effectiveness as a robust data engine for synthesizing high-fidelity, view-consistent videos that empower downstream robotic planning and learning.

2605.06317 2026-06-09 cs.CV cs.AI 版本更新

NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps

NavOne: 一种基于顶部向下地图的视觉语言导航的一步全局规划

Dijia Zhan, Jinyi Li, Chenxi Zheng, Shaoyu Huang, Yong Li, Jie Tang, Xuemiao Xu

发表机构 * South China University of Technology(南方科技大学)

AI总结 本文提出了一种基于顶部向下地图的视觉语言导航方法,通过引入NavOne框架,实现多模态地图的单步全局路径规划,显著提升了导航效率和性能。

Comments 10 pages, 7 figures

详情
AI中文摘要

现有的视觉语言导航(VLN)方法通常采用以自身为中心的逐步导航范式,这导致误差累积并限制了效率。尽管最近的方法试图利用预建的环境地图,但它们通常依赖于逐步更新记忆图或评分离散路径提案,这限制了连续的空间推理并创建了离散瓶颈。我们提出了顶部向下VLN(TD-VLN),将导航重新表述为在预建的顶部向下地图上的一步全局路径规划问题,支持我们新构建的R2R-TopDown数据集。为了解决这个问题,我们引入了NavOne,一个统一的框架,它在单次端到端前向传递中直接预测多模态地图上的密集路径概率。NavOne具有顶部向下地图融合器,用于联合多模态地图表示,并扩展了空间感知的深度混合。在R2R-TopDown上的广泛实验表明,NavOne在基于地图的VLN方法中实现了最先进的性能,其规划阶段的速度提升比现有基于地图的基线方法快8倍,比以自身为中心的方法快80倍,从而实现了高效全局导航。

英文摘要

Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.

2605.21854 2026-06-09 cs.CV cs.AI 版本更新

CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

CrossVLA: 跨范式后训练和推理优化用于视觉-语言-动作模型

Zhi Liu

发表机构 * Tianjin University(天津大学)

AI总结 本文研究了视觉-语言-动作(VLA)模型的跨范式后训练方法,提出了CrossVLA框架,通过改进的连续动作流匹配估计器、对比LoRA和DoRA参数高效层的性能,并揭示了推理过程中去噪循环对延迟的影响,最终实现了在LIBERO数据集上的显著提升。

Comments Workshop draft, 14 pages, 4 figures. Code, ckpts, data: https://github.com/lz-googlefycy/vla-lab

详情
AI中文摘要

视觉-语言-动作(VLA)模型迅速收敛到一小套架构模式:离散令牌自回归(例如OpenVLA)和连续动作流匹配(例如pi-0.5)。然而,通过直接偏好优化(DPO)进行偏好对齐——语言模型中事实上的后训练步骤——几乎仅在自回归VLA上被研究。我们提出了CrossVLA,对跨范式VLA后训练进行实证研究。三大贡献:(i)一个替代流匹配对数概率估计器,使DPO可以在不进行概率流ODE积分的情况下在连续动作后端上运行;(ii)对LoRA和DoRA作为VLA DPO的参数高效层进行直接比较,发现DoRA在LIBERO 4套件上比OpenVLA SFT平均提升10.4个百分点(600次试验,3种子)——每套件+20.0对象,+11.0长周期,+8.0目标,+2.7空间——在对象上无种子方差(38/50在每个种子上);(iii)推理时间解剖显示去噪循环主导了78.6%的sample_actions延迟,而类似于VLA-Cache的前缀K/V缓存达到了21%的加速上限——无论是块级还是令牌级缓存策略在我们的基准中都会使成功率降至0-80%。我们进一步在6000个LIBERO帧上预训练了一个多视角+时间投影头,实现了99.5%的k-NN召回率@1(36倍于随机),可用作下游初始化。所有代码、检查点、训练日志和复现脚本均在https://github.com/lz-googlefycy/vla-lab上公开。

英文摘要

Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) -- the de-facto post-training step in language models -- has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross-paradigm VLA post-training. Three contributions: (i) a surrogate flow-matching log-probability estimator that lets DPO operate on continuous-action backbones without probability-flow ODE integration; (ii) a head-to-head comparison of LoRA and DoRA as the parameter-efficient layer for VLA DPO, finding DoRA improves over OpenVLA SFT by a mean +10.4 pp across LIBERO 4-suite (600 trials, 3 seeds) -- per-suite +20.0 Object, +11.0 Long-horizon, +8.0 Goal, +2.7 Spatial -- with zero seed variance on Object (38/50 on each of 3 seeds); (iii) an inference-time anatomy showing the denoise loop dominates 78.6% of sample_actions latency and prefix-K/V caching a la VLA-Cache caps at a 21% acceleration ceiling -- both chunk-level and token-level cache strategies degrade success rate to 0-80% in our benchmarks. We further pretrain a multi-view + temporal projection head on 6000 LIBERO frames, achieving 99.5% k-NN recall@1 for same-task retrieval (36x over random), available as a downstream initialisation. All code, ckpts, training logs, and reproduction scripts are open at https://github.com/lz-googlefycy/vla-lab.

2605.24890 2026-06-09 cs.CV 版本更新

QuoVLA: Quotient Space for Vision-Language-Action Models

QuoVLA:视觉-语言-动作模型的商空间

Xuan Wang, Yinan Wu, Haoran Duan, Jungong Han

发表机构 * Department of Automation(自动化系)

AI总结 针对VLA模型预训练VLM潜在表示动作信息不足的观点,提出商空间框架QuoVLA,通过量化模块和双分支设计压缩潜在表示为动作充分表示,在多个基准上提升泛化性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通常通过将视觉观察和语言指令映射到连续动作来适配预训练的视觉-语言模型(VLM)以进行机器人控制。现有方法通常采取动作不足的观点,假设预训练的VLM潜在表示要么缺乏直接可用的动作信息,要么应该屏蔽动作学习信号。与这一观点相反,我们的 extit{VLA商理论}表明,预训练的VLM潜在表示并非动作不足而是动作充分的:它们已经包含控制所需的信息,但由于区分了诱导相同最优动作行为的提示级变体而仍然过度完备。为了将这一理论付诸实践,我们提出了QuoVLA,一个用于VLA的商空间框架,将预训练的VLM潜在表示压缩为动作充分的表示。具体来说,QuoVLA通过一个量化模块和一个具有相对时间复杂度正则化的双分支设计实例化这一原则,在去除提示级冗余的同时保留动作相关信息。跨多个基准的大量实验表明,QuoVLA实现了强大的性能,在视觉、语言和环境分布偏移下的泛化方面尤其显著提升。我们的代码将公开提供。

英文摘要

Vision-Language-Action (VLA) models commonly adapt pretrained Vision-Language Models (VLMs) to robot control by mapping visual observations and language instructions to continuous actions. Existing approaches typically take an action-insufficiency view, assuming that pretrained VLM latents either lack directly usable action information or should be shielded from action-learning signals. Against this view, our \textit{Quotient Theory for VLA} shows that pretrained VLM latents are not action-insufficient but action-sufficient: they already contain the information needed for control, yet remain overcomplete by distinguishing prompt-level variations that induce the same optimal action behavior. To operationalize this theory, we propose QuoVLA, a quotient-space framework for VLA that compresses pretrained VLM latents into action-sufficient representations. Specifically, QuoVLA instantiates this principle with a quantization module and a dual-branch design with relative temporal-complexity regularization, preserving action-relevant information while removing prompt-level redundancy. Extensive experiments across multiple benchmarks demonstrate that QuoVLA achieves strong performance, with particularly notable improvements in generalization under visual, linguistic, and environmental distribution shifts. Our code will be made publicly available.

2606.07431 2026-06-09 cs.CV 版本更新

OpenGlass: Ultra-Low-Power On-Device AI Eyewear with Event-based Vision

OpenGlass:用于设备上基于事件的手势识别的开源智能眼镜

Pietro Bonazzi, Julian Moosmann, Ahmet Celik, Philipp Mayer, Michele Magno

发表机构 * Department of Information Technology and Electrical Engineering, ETH Zürich(信息科技与电气工程系,瑞士联邦理工学院)

AI总结 提出开源智能眼镜平台OpenGlass,采用模块化设计、事件驱动电源管理和GAP9 RISC-V SoC,实现低功耗设备上ML,在LynX数据集上达到83.94%的跨主体手势识别准确率。

详情
AI中文摘要

智能眼镜通过多模态传感器和设备上智能实现无干扰、上下文感知的交互,但受限于紧凑外形下的功耗、内存和计算约束。支持事件视觉和嵌入式ML的开源硬件平台在此规模下很少见。本文介绍了一个开源智能眼镜平台,用于新型传感器和算法的快速原型设计。其模块化设计使用灵活的FPC转接板,支持事件相机和帧相机,无需完全重新设计PCB。硬件-软件协同设计的电源管理系统结合了可配置PMIC和通过nRF5340协调器的事件驱动唤醒,使GAP9 RISC-V SoC在推理之间保持断电。原型从200 mAh电池实现长达11.8小时的连续设备上ML。作为演示,使用来自Prophesee GENX320相机的极性分离事件直方图,在LynX数据集上评估了以自我为中心的手势识别流水线。R(2+1)D在留二受试者交叉验证下达到最佳跨主体准确率83.94%(宏F1=0.781),在GAP9上端到端延迟为33.9毫秒。时间增强和去除模糊类别带来了最大增益(+8.9个百分点)。所有硬件设计、固件和模型均开源发布。

英文摘要

Smart eyewear enables unobtrusive, context-aware interaction through multimodal sensors and on-device intelligence, but is severely limited by power, memory, and compute constraints in a compact form factor. Open-hardware platforms supporting event-based vision and embedded ML at this scale are rare. This work introduces an open-source smart glasses platform for rapid prototyping of novel sensors and algorithms. Its modular design uses a flexible FPC interposer to support both event-based and frame-based cameras without full PCB redesign. A hardware-software co-designed power management system combines a configurable PMIC with event-driven wake-up via an nRF5340 coordinator, keeping the GAP9 RISC-V SoC powered down between inferences. The prototype achieves up to 11.5 hours of continuous on-device ML from a 200 mAh battery. As a demonstration, an egocentric hand gesture recognition pipeline was evaluated on the LynX dataset using polarity-separated event histograms from a Prophesee GENX320 camera. R(2+1)D achieved the best cross-subject accuracy of 83.94\% (macro F1 = 0.781) under leave-two-subjects-out validation, with 78.3 ms end-to-end inference latency on the GAP9. Temporal augmentation and removal of ambiguous classes provided the largest gains (+8.9 pp). All hardware designs, firmware, and models are released open source.

2503.14229 2026-06-09 cs.AI cs.CV cs.RO 版本更新

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

HA-VLN 2.0:面向离散与连续环境中动态多人交互的人类感知导航开放基准与排行榜

Yifei Dong, Fengyi Wu, Qi He, Lingdong Kong, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Alexander G Hauptmann, Zhi-Qi Cheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出HA-VLN 2.0统一基准,通过标准化任务、HAPS 2.0数据集与模拟器、16844条社会指令基准测试及真实机器人实验,证明显式社会建模提升导航鲁棒性并减少碰撞。

Comments 35 pages, 20 figures, website: https://f1y1113.github.io/HA-VLN-webpage/

详情
AI中文摘要

视觉与语言导航(VLN)主要研究离散或连续空间,很少关注动态拥挤环境。我们提出HA-VLN 2.0,一个引入显式社会感知约束的统一基准。我们的贡献包括:(i)标准化任务和指标,同时捕捉目标准确性和个人空间遵守;(ii)HAPS 2.0数据集和模拟器,建模多人交互、室外环境和更精细的语言-运动对齐;(iii)在16844条社会性指令上的基准测试,揭示领先代理在人类动态和部分可观测性下性能急剧下降;(iv)真实机器人实验验证模拟到现实的迁移,以及一个开放排行榜实现透明比较。结果表明,显式社会建模提高了导航鲁棒性并减少了碰撞,强调了以人为中心方法的必要性。通过发布数据集、模拟器、基线和协议,HA-VLN 2.0为安全、人类感知的导航研究提供了坚实基础。

英文摘要

Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, human-aware navigation research.

2508.00917 2026-06-09 cs.RO cs.CV cs.LG 版本更新

A Survey on Deep Multi-Task Learning in Connected Autonomous Vehicles

联网自动驾驶车辆中深度多任务学习综述

Jiayuan Wang, Farhad Pourpanah, Q. M. Jonathan Wu, Ning Zhang

发表机构 * Department of Electrical and Computer Engineering, University of Windsor(温莎大学电气与计算机工程系) Department of Electrical and Computer Engineering, Queen’s University(皇后大学电气与计算机工程系)

AI总结 综述联网自动驾驶车辆中深度多任务学习,涵盖感知、预测、规划、控制及V2X通信与资源管理,分析现有方法优缺点并指出未来方向。

详情
AI中文摘要

联网自动驾驶车辆(CAVs)必须同时执行多个任务,如感知、预测、规划和控制,以确保在复杂环境中安全可靠地导航。此外,通过车联万物(V2X)通信,可以实现CAVs之间的协同感知和驾驶,从而减轻单个车辆的局限性,同时也引入了严格的延迟、可靠性和带宽约束。传统上,任务使用单独的模型处理,这导致部署成本高、计算开销增加以及实现实时性能的挑战。多任务学习(MTL)最近成为一种有前景的解决方案,能够在统一模型中联合学习多个任务,从而提供更高的效率和资源利用率。据我们所知,本综述是首次专注于CAVs中深度MTL的全面回顾。我们首先概述CAVs和MTL以提供基础背景。然后,我们回顾了CAVs关键功能领域的MTL方法,包括感知、预测、规划、控制以及V2X通信和无线电资源管理(RRM)。对于前四个领域,我们将现有工作分为仅单车(车载)和V2X增强协同(多智能体)范式。我们进一步将V2X通信和RRM作为以通信为中心的MTL问题进行讨论。最后,我们讨论了现有方法的优势和局限性,识别了关键研究空白,并提供了旨在推进CAV系统MTL方法的未来研究方向。

英文摘要

Connected autonomous vehicles (CAVs) must simultaneously perform multiple tasks, such as perception, prediction, planning, and control, to ensure safe and reliable navigation in complex environments. Moreover, through vehicle-to-everything (V2X) communication, cooperative perception and driving among CAVs can be enabled, thereby mitigating the limitations of individual vehicles, while it also introduces stringent latency, reliability, and bandwidth constraints. Traditionally, tasks are addressed using separate models, which leads to high deployment costs, increased computational overhead, and challenges in achieving real-time performance. Multi-task learning (MTL) has recently emerged as a promising solution that enables the joint learning of multiple tasks within a unified model. This offers improved efficiency and resource utilization. To the best of our knowledge, this survey is the first comprehensive review focusing on deep MTL in CAVs. We begin with an overview of CAVs and MTL to provide foundational background. Then, we review MTL approaches across key functional domains in CAVs, including perception, prediction, planning, control, as well as V2X communications and radio resource management (RRM). For the first four domains, we categorize existing works under ego vehicle-only (onboard-only) and V2X-enhanced cooperative (multi-agent) paradigms. We further discuss V2X communications and RRM as communication-centric MTL problems. Finally, we discuss the strengths and limitations of existing methods, identify key research gaps, and provide future research directions aimed at advancing MTL methodologies for CAV systems.

2512.07998 2026-06-09 cs.RO cs.CV 版本更新

DIJIT: A Robotic Head for an Active Observer

DIJIT: 面向主动观察者的机器人头部

Mostafa Kamali Tabrizi, Mingshi Chi, Bir Bikram Dey, Kelly Yuan, Markus D. Solbach, Yiqian Liu, Michael Jenkin, John K. Tsotsos

发表机构 * Department of Electrical Engineering and Computer Science, York University(电气与计算机科学系,约克大学)

AI总结 提出DIJIT双目机器人头部,具有9个机械自由度和4个光学自由度,实现类人眼/头运动,用于主动视觉研究,其扫视精度接近人类。

详情
Journal ref
IEEE Robotics and Automation Letters, Vol. 11, No. 6, pp. 7038-7045, June 2026
AI中文摘要

我们提出DIJIT,一种新颖的双目机器人头部,专为作为主动观察者的移动代理设计。DIJIT独特的功能广度使得主动视觉研究以及类人眼和头颈运动、它们之间的相互关系以及各自对视觉能力的贡献成为可能。DIJIT还被用于探索人类视觉如何利用眼/头运动解决视觉任务与当前计算机视觉方法之间的差异。DIJIT的设计具有九个机械自由度,而相机和镜头提供了额外的四个光学自由度。机械设计的范围和速度与人类性能相当。DIJIT达到了人类峰值扫视速度的85%。我们的设计包括会聚立体视觉所需的运动范围,即聚散、版本和旋转。在这里,我们介绍DIJIT及其性能的某些方面。我们还提出了一种新颖的扫视相机运动方法,利用相机方向与电机值之间的直接关系。由此产生的扫视相机运动在准确性上接近人类运动,左相机和右相机的平均误差分别为1.17°和1.14°。

英文摘要

We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT's unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT's design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. DIJIT attains 85\% of the peak human saccade speed. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. Here, we present DIJIT and some aspects of its performance. We also present a novel method for saccadic camera movements, using a direct relationship between camera orientation and motor values. The resulting saccadic camera movements are close to human movements in terms of their accuracy, with 1.17$^\circ$ and 1.14$^\circ$ mean error for the left and right cameras, respectively.

2602.21172 2026-06-09 cs.AI cs.CV 版本更新

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

NoRD: 一种无需推理的高数据效率视觉-语言-动作模型

Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition Texas A&M University(德克萨斯大学A&M分校) UC Berkeley(伯克利加州大学)

AI总结 提出NoRD模型,通过无需推理标注和仅需<60%数据微调,结合Dr. GRPO算法克服难度偏差,实现与现有VLA模型相当的性能,显著降低数据与计算开销。

Comments Accepted to CVPR 2026. Code available at: https://github.com/Applied-Open-Source/nord

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过统一的端到端架构取代模块化流水线,推动了自动驾驶的发展。然而,当前的VLA模型面临两个昂贵的要求:(1)大规模数据集收集,(2)密集的推理标注。在这项工作中,我们通过NoRD(无需推理驾驶)解决了这两个挑战。与现有的VLA模型相比,NoRD在仅使用<60%的数据且无需推理标注的情况下实现了竞争性能,从而减少了3倍的token数量。我们发现,当将标准组相对策略优化(GRPO)应用于在这种小规模、无推理数据集上训练的策略时,它未能产生显著的改进。我们表明,这种限制源于难度偏差,它不成比例地惩罚了GRPO中产生高方差rollout的场景的奖励信号。NoRD通过引入Dr. GRPO(一种旨在减轻LLM中难度偏差的最新算法)克服了这一限制。因此,NoRD在Waymo和NAVSIM上以极少的训练数据和零推理开销实现了竞争性能,从而实现了更高效的自主系统。网站:此 https URL

英文摘要

Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with NORD (No Reasoning for Driving). Compared to existing VLAs, NORD achieves competitive performance while being fine-tuned on <60% of the data and no reasoning annotations, resulting in 3x fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. NORD overcomes this by incorporating Dr. GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, NORD achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems. Website: https://nord-vla-ai.github.io/

3. 图像识别、检索与分类 10 篇

2606.07648 2026-06-09 cs.CV cs.AI 新提交

AQIFormer: A Transformer-Based Multi-View Architecture for Cross-City Air Quality Classification

AQIFormer:一种基于Transformer的多视角架构用于跨城市空气质量分类

Om Kathalkar, Nitin Nilesh, Sachin Chaudhari, Anoop Namboodiri

发表机构 * IIIT Hyderabad(印度海得拉巴国际信息技术学院)

AI总结 提出AQIFormer,一种基于Transformer的集成架构,通过前后视图融合、天气感知注意力和多任务学习,在跨城市空气质量分类中达到89.96%准确率,比现有方法提升14.96%。

Comments Accepted at ICVGIP 2025 (Indian Conference on Computer Vision, Graphics and Image Processing), 9 pages, 4 figures

详情
AI中文摘要

空气污染是全球最严峻的环境和公共卫生挑战之一,传统的基于传感器的监测系统面临显著的可扩展性和经济性限制。基于图像的空气质量估计已成为一种有前景的替代方案,利用交通场景中大气污染物的视觉特征。然而,现有方法存在跨城市泛化能力有限以及对多视角信息利用不足的问题。我们提出AQIFormer,一种新颖的基于Transformer的集成架构,通过创新的双视图融合、天气感知注意力机制和全面的多任务学习来解决这些根本性限制。我们的方法独特地将前后交通图像与气象参数相结合,以实现跨不同城市环境的稳健空气质量分类。在包含26,678个同步前后图像对的综合数据集上进行的大量评估表明,该模型性能良好,准确率达到89.96%,比现有最优方法提高了14.96%。最重要的是,我们的模型保持了出色的跨城市泛化能力,在印度那格浦尔收集的独立数据集上达到81.67%的准确率,通过少量样本自适应仅用极少的训练样本,性能下降仅为8.29%。

英文摘要

Air pollution represents one of the most critical environmental and public health challenges globally, with traditional sensor-based monitoring systems facing significant scalability and economic constraints. Image-based air quality estimation has emerged as a promising alternative, leveraging the visual characteristics of atmospheric pollutants in traffic scenes. However, existing methods suffer from limited cross-city generalization and inadequate exploitation of multi-view perspectives. We present AQIFormer, a novel transformer-based ensemble architecture that addresses these fundamental limitations through innovative dual-view integration, weather-aware attention mechanisms, and comprehensive multi-task learning. Our approach uniquely combines front and rear traffic imagery with meteorological parameters to achieve robust air quality classification across diverse urban environments. Extensive evaluation on a comprehensive dataset of 26,678 synchronized front-rear image pairs demonstrates good performance with 89.96% accuracy, representing a 14.96% improvement over state-of-the-art methods. Most importantly, our model maintains exceptional cross-city generalization capabilities, achieving 81.67% accuracy on an independent dataset collected in Nagpur, India with only 8.29% performance degradation using few-shot adaptation with minimal training samples.

2606.07766 2026-06-09 cs.CV cs.AI 新提交

Quantum-Enhanced Similarity Measures for Polarimetric Materials Classification

量子增强的极化材料分类相似度度量

Sara Shojaei, Seyed Mohamad Ali Tousi, Emma Bennett, Param Sangani, Ali Shiri Sichani, Ilker Ersoy, Hadi Ali-Akbarpour, Filiz Bunyak, G. N. DeSouza

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出量子-经典混合流水线,将极化材料分类转化为点匹配问题,利用SWAP测试估计嵌入向量保真度,实现竞争性分类精度和开放集判别能力。

详情
AI中文摘要

我们提出了一种用于极化材料分类的量子-经典混合流水线,将其视为点匹配问题。包含偏振光反射的体素立方体用于训练编码器,为立方体的体素生成32维嵌入。在推理时,丢弃编码器头部,将嵌入编码为量子态的概率幅。然后,SWAP测试电路估计查询立方体的每个32D嵌入与锚点立方体数据集之间的保真度。聚合的保真度作为材料相似度分数,具有最高聚合保真度的锚点类别被视为查询材料的类别。我们在一个包含23种材料(每种约800个样本)的数据集上评估了我们的方法,这些材料来自其Mueller矩阵。比较了所提出的量子SWAP测试的点匹配方法和使用最优传输的经典分类器。我们的结果展示了竞争性的分类精度以及开放集判别潜力,使其成为基于NISQ的材料识别的可行途径。

英文摘要

We present a quantum--classical hybrid pipeline for polarimetric material classification that casts this as a point-matching problem. Voxel cubes, containing polarized light reflections, are used to train an encoder to produce 32-dimensional embeddings for the voxels of the cubes. At inference, the encoder head is discarded and the embeddings are encoded as probability amplitudes of quantum states. Next, a SWAP-test circuit estimates the fidelity between each of the 32D embeddings from the query cube and a dataset of anchor cubes. The aggregated fidelity serves as materials similarity scores, and the class of the anchor with highest aggregated fidelity is deemed as the class of the queried material. We evaluate our approach on a dataset of 23 materials ($\approx$800 samples each) derived from their Mueller matrices. The point-matching approaches from the proposed quantum SWAP-test and a classical classifier using Optimal Transport are compared. Our results demonstrate the competitive classification accuracy alongside open-set discrimination potential, establishing it as a viable path toward NISQ-based material recognition.

2606.08612 2026-06-09 cs.CV 新提交

Facial Expression Recognition in the Deep Learning Era: A Systematic Multi-Criteria Review of Methods, Models, Datasets, Performance, Challenges, and Future Research Directions

深度学习时代的面部表情识别:方法、模型、数据集、性能、挑战与未来研究方向的多准则系统综述

Spyridon Georgiou, Aggelos Psiris, Spyridon Evangelatos, Thomas Lagkas, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

发表机构 * International Hellenic University(国际希腊大学) University of Thessaly(色萨利大学) Democritus University of Thrace(德谟克利特大学) University of Peloponnese(伯罗奔尼撒大学) Harokopio University of Athens(哈罗科皮奥大学)

AI总结 本文系统综述了深度学习面部表情识别的最新进展,提出五阶段演化框架和多准则分类法,分析了七维度的优缺点,并总结了数据集、性能比较及未来挑战。

详情
AI中文摘要

面部表情识别(FER)在过去十年中取得了快速发展,这得益于从手工特征和浅层分类器向深度卷积、注意力机制、视觉语言和基础模型架构的转变,以及大规模野外基准测试的并行增长,这些基准涵盖了分类、维度、复合、微表情、动作单元(AU)和强度估计任务。然而,基于深度学习的FER领域迄今为止仅在狭窄的任务、架构或应用特定轴线上被综述,缺乏对其近期进展的整体、系统组织的描述。本综述通过全面回顾近期基于深度学习的FER,并明确将其与更广泛的面部情感识别(FAR)领域联系起来,填补了这一空白。其主要贡献包括:a) 描述了FER演变为五个不同阶段的过程,从手工特征和经典机器学习到注意力机制、视觉语言和基础模型方法,并给出了每个阶段的关键里程碑工作;b) 一个多准则分类法,沿七个互补轴分析文献:识别任务、输入模态、面部预处理流程、网络架构、学习策略、采集设置和应用领域;c) 按准则进行比较分析,深入洞察每个类别在野外条件下的优势和局限性;d) 按任务组织的公共FER数据集综述,包括其标注方案、模态和评估协议;e) 性能指标汇编以及代表性最先进方法在广泛采用的基准上的按任务定量比较;f) 当前挑战和有前景的未来方向的讨论。

英文摘要

Facial Expression Recognition (FER) has advanced rapidly over the last decade, driven by the shift from handcrafted descriptors and shallow classifiers to deep convolutional, attention-based, vision-language, and foundation-model architectures, and by the parallel growth of large-scale in-the-wild benchmarks spanning categorical, dimensional, compound, micro-expression, Action Unit (AU), and intensity-estimation tasks. Yet the deep learning-based FER landscape has so far been reviewed only along narrow task-, architecture-, or application-specific axes, leaving a holistic, systematically organized account of its recent advances missing. This survey addresses that gap with a comprehensive review of recent deep learning-based FER, explicitly linked to the wider Facial Affect Recognition (FAR) domain. Its main contributions are: a) A description of FER's evolution into five distinct phases, from handcrafted features and classical machine learning to attention-based, vision-language, and foundation-model approaches, with the key milestone works of each, b) A multi-criteria taxonomy analyzing the literature along seven complementary axes: recognition task, input modality, face pre-processing pipeline, network architecture, learning strategy, acquisition setting, and application domain, c) A per-criterion comparative analysis, with critical insights into the strengths and limitations of each category under in-the-wild conditions, d) A task-organized review of public FER datasets, with their annotation schemes, modalities, and evaluation protocols, e) A compilation of performance metrics and a per-task quantitative comparison of representative state-of-the-art methods on widely adopted benchmarks, and f) A discussion of current challenges and promising future directions.

2606.08826 2026-06-09 cs.CV astro-ph.GA 新提交

Classifying galaxies in the Galaxy10 DECals dataset using Inception and Residual CNNs

使用Inception和残差CNN对Galaxy10 DECals数据集中的星系进行分类

Lanz Anthonee A. Lagman, Prospero C. Naval, Reinabelle C. Reyes

发表机构 * University of the Philippines - Diliman(菲律宾大学迪利曼分校) Department of Computer Science, College of Engineering, University of the Philippines - Diliman(菲律宾大学迪利曼分校工程学院计算机科学系) National Institute of Physics, College of Science, University of the Philippines - Diliman(菲律宾大学迪利曼分校理学院国家物理研究所)

AI总结 本研究比较了ResNet101和InceptionV4在星系形态分类任务上的性能,两者均达到约90%的准确率,其中ResNet101表现更优,表明这两种CNN架构可作为未来巡天星系图像分类的稳健基础。

Comments 4 pages, 3 figures, 2 tables, published in Proceedings of the 42nd Samahang Pisika ng Pilipinas Physics Conference (SPP 2024)

详情
Journal ref
Proc. Samahang Pisika Pilipinas 42, SPP-2024-2E-05 (2024)
AI中文摘要

关于星系形态的图像数据预计在未来几年内将在数量和质量上都有所增加;因此,探索哪些适用于图像分类任务的深度学习架构具有成本效益非常重要。残差网络和Inception网络因其计算效率而成为探索分类卷积神经网络(CNN)的理想选择,这得益于残差连接和并行化Inception模块等技术,使得网络能够更深而不显著增加计算复杂度。在这项工作中,我们分析了ResNet101和InceptionV4在空间增强的Galaxy10 DECals数据集上的性能。保留星系的十类分类,我们修改了每个类别的图像数量。我们发现ResNet101和InceptionV4模型达到了约90%的准确率,与文献中报告的性能相当。在性能指标方面,ResNet101优于InceptionV4。我们的结果表明,这两种CNN架构中的任何一种都可以作为即将到来的巡天中星系图像分类专用管线的稳健基础。

英文摘要

Image data regarding galactic morphology is expected to increase both in quantity and quality for the next foreseeable years; thus it is important to explore which deep learning architectures adapted for image classification tasks are cost-effective. Residual and Inception networks are ideal for exploring classification convolutional neural networks (CNNs) due to their computational efficiency, achieved through techniques such as residual connections and parallelized inception modules, enabling deeper networks without excessively increasing computational complexity. In this work, we analyze the performance of ResNet101 and InceptionV4 on a spatially-augmented Galaxy10 DECals dataset. Retaining the ten-class classification of galaxies, we modify the image count of each class. We find that ResNet101 and InceptionV4 models achieved accuracies of $\sim$ 90%, comparable with reported performance in the literature. In terms of performance metrics, ResNet101 is superior to InceptionV4. Our results indicate that either of these CNN architectures could serve as a robust foundation for specialized pipelines for classification of galaxy images from upcoming surveys.

2606.08918 2026-06-09 cs.CV 新提交

When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

当视觉误导时,让位置说话:基于位置注意力机制和大多模态模型的全球图像地理定位方法

Junchao Cui, Wenqi Shi, Xuanzi Ma, Nan Wu, Shaoyong Du, Xiangyang Luo

发表机构 * Henan Key Laboratory of Cyberspace Situation Awareness(河南省网络空间态势感知重点实验室) Information Engineering University(信息工程大学)

AI总结 提出TransGeoCLIP框架,通过位置注意力机制和大多模态模型,解决视觉相似图像导致的地理定位错误问题,在多个基准上显著提升定位精度。

Comments Submitted to IEEE Transactions on Multimedia in March 2026

详情
AI中文摘要

全球图像地理定位旨在确定图像在全球范围内的拍摄位置。现有方法通常通过将图像与来自不同地理区域的视觉相似场景匹配而导致定位错误,限制了实际应用中的可靠性。为解决此问题,我们提出TransGeoCLIP,一种新颖的基于检索的框架,集成了位置注意力机制和大规模多模态模型(LMMs)。使用带有位置注意力的Transformer编码器对GPS坐标进行编码,TransGeoCLIP能够有效区分视觉相似图像中的地理特征。该框架包括两个阶段:1)检索数据库构建,采用配备位置注意力机制的Transformer对标记的GPS坐标进行编码并增强位置语义,随后通过CLIP实现图像-文本-GPS联合嵌入;2)检索增强推理,利用LMMs从检索到的数据库结果中推断最终图像位置预测。在包括IM2GPS、IM2GPS3k、YFCC4k和YFCC26k在内的多个数据集上的广泛实验结果表明,TransGeoCLIP显著提升了视觉相似图像的定位性能。特别是,街道级定位精度(误差在1公里内)大幅提升,在这些基准上分别超过最先进方法1.5%、1.07%、7.18%和9.75%。

英文摘要

Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existing methods often mislocalize images by matching them to visually similar scenes from different geographic regions, which limits reliability in practical applications. To address this issue, we propose TransGeoCLIP, a novel retrieval-based framework that integrates a location attention mechanism and large multimodal models (LMMs). Using the Transformer encoder with location attention to encode GPS coordinates, TransGeoCLIP can effectively distinguish geographic features among visually similar images. The framework consists of two stages: 1) Retrieval database construction, which employs Transformers equipped with location attention mechanisms to encode labeled GPS coordinates and enhance location semantics, subsequently enables joint image-text-GPS embedding through CLIP; 2) Retrieval-augmented inference, which leverages LMMs to infer the final image location prediction from retrieved database results. Extensive experimental results on diverse datasets, including IM2GPS, IM2GPS3k, YFCC4k, and YFCC26k, demonstrate that TransGeoCLIP significantly enhances localization performance for visually similar images. Particularly, street-level localization accuracy (within 1 km error) is substantially improved, surpassing state-of-the-art methods by 1.5%, 1.07%, 7.18%, and 9.75% on these benchmarks, respectively.

2606.09353 2026-06-09 cs.CV cs.AI 新提交

Beyond Humans: Multispecies Animal Face Recognition Using Transfer Learning

超越人类:使用迁移学习的多物种动物面部识别

Maria De Marsico, Anil K. Jain, Annalaura Miglino

发表机构 * Sapienza University of Rome(罗马大学) Michigan State University(密歇根州立大学) University of Salerno(萨莱诺大学)

AI总结 研究利用迁移学习(FaceNet和Vision Transformer)实现多物种动物面部识别,在狗、灵长类和牛数据集上验证,狗识别准确率最高(96.85%),部分场景超越现有方法。

Comments This paper extends the work published in the proceedings of CAIP 2025 conference: 'Adapting to the Wild: From Human Face to Animal Face Recognition' by De Marsico, M., Jain, A. K., Miranda, M., & Orlando, A

详情
AI中文摘要

个体动物识别可用于寻找丢失或被盗的宠物、追踪濒危物种个体以及识别拥挤农场中的动物。目前的识别技术主要使用物理设备(如微芯片),通常不切实际且难以应用。这些可以通过动物面部进行远程识别来替代;如果足够准确,它具有多个优势:非侵入性、可远距离工作、难以伪造,例如在食品工业中用病畜替换健康畜的情况。现有的少数数据集具有足够的每个主体图像并标注了单个动物身份,但不足以训练当前的深度学习架构。我们转而研究迁移学习的可能性,利用预训练网络模型作为骨干。我们的实验比较了专门在大型人脸数据库上训练的FaceNet和在ImageNet(即对象类别)上预训练的Vision Transformer(ViT)。我们使用了三种非常不同的动物的面部数据集:狗、灵长类(狐猴、金丝猴和黑猩猩)和牛。我们报告了结果,并对每个数据集与当前最优(SOTA)专门训练的深度网络进行了比较。三个数据集的捕获条件不同。图像质量(分辨率、运动模糊、不同姿态等)从狗到牛到灵长类依次下降。最佳性能在狗上实现,ViT达到了96.85%的平均验证准确率和84.34%的Rank-1识别率。濒危灵长类的结果仍然令人鼓舞,但性能因动物类别和任务(验证或识别)而异,并不总是优于SOTA。对于牛,ViT结果优于SOTA,而FaceNet仍然具有竞争力。

英文摘要

Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to apply. These could be replaced by remote recognition via the animal's face; if accurate enough, it provides several advantages: it is non-invasive, can work at a distance, and is difficult to counterfeit, as, for instance, in the case of substituting sick animals for healthy ones in the food industry. The few existing datasets with sufficient per-subject images annotated with a single animal identity are not large enough to train current deep learning architectures. We rather investigate the possibility of transfer learning, exploiting pre-trained network models as backbones. Our experiments compared FaceNet, which is specifically trained on large databases of human faces, with the Vision Transformer (ViT) pre-trained on ImageNet, i.e., on object categories. We used three face datasets of very different animals: dogs, primates (lemurs, golden monkeys, and chimpanzees), and cattle. We report the results and, for each dataset, compare them with the state of the art (SOTA) ad hoc-trained deep networks. The capture conditions differ among the three datasets. Image quality (resolution, motion blur, diverse poses, etc.) decreases from dogs to cattle to primates. The best performance was achieved with dogs, where ViT reached a mean verification accuracy of 96.85% and a Rank-1 Identification Rate of 84.34%. The results for endangered primates are still encouraging, but performance varies across animal classes and tasks (verification or identification), and does not always outperform SOTA. For cattle, the ViT results outperform SOTA, while FaceNet is still competitive.

2606.07628 2026-06-09 cs.CY cs.CV 交叉投稿

Frankenstein in the Pipeline: Computational Epistemicide in Facial Recognition

管道中的弗兰肯斯坦:面部识别中的计算性知识灭绝

Nina da Hora

发表机构 * Universidade Estadual de Campinas(坎皮纳斯州立大学) Instituto da Hora(时间研究所)

AI总结 本文借鉴玛丽·雪莱的《弗兰肯斯坦》作为方法论框架,分析基于嵌入的面部识别如何通过检测、地标定位、对齐/正面化和嵌入等步骤,逐步将面部简化为数据,实施“计算性知识灭绝”,并论证废除主义作为规范性立场。

Comments Accepted to ACM FAccT 2026. Author's version. 17 pages, 2 figures

详情
AI中文摘要

虽然计算机视觉的优生学根源在批判性技术研究中已有充分记载,但较少关注这种暴力在管道层面实施的操作机制。本文借鉴玛丽·雪莱的《弗兰肯斯坦》,不是作为意外后果的隐喻,而是作为方法的诊断框架:拆解、重构,以及通过制造程序断言其合法性的造物。我认为,基于嵌入的面部识别实施了我所谓的计算性知识灭绝,这是Sueli Carneiro的知识灭绝概念在计算领域的延伸——通过摧毁作为活生生的关系性表面的面部,并授权数值代理作为身份的特权场所。在检测/裁剪、地标定位、对齐/正面化和嵌入过程中,面部逐渐被缩小到可以稳定为数据的部分,产生一个规范的面部作为可读性的条件,以及相应的形式主体作为识别的条件。向量化完成了弗兰肯斯坦式的“缝合”:被解剖的面部被重新组装成一个固定维度的制品,旨在跨数据库和机构流通。然后,我展示了基于距离的相似性和阈值如何将“足够接近”的规范操作化,使识别与标准化密不可分,并使改良主义的“伦理AI”优化在结构上不足。本文最后主张废除主义作为规范性立场:拒绝将向量化身份作为权利和访问的合法基础,并拆除通过可剖析的数据点来治理人类生活的制度冲动。

英文摘要

While the eugenic roots of computer vision are well-documented in critical technology studies, less attention has been paid to the operational mechanisms through which this violence is enacted at the level of the pipeline. This paper employs Mary Shelley's Frankenstein not as a metaphor for unintended consequences, but as a diagnostic framework for method: disassembly, reconstruction, and the production of a creature whose legitimacy is asserted by the procedure that made it. I argue that embedding-based facial recognition enacts what I call computational epistemicide, an extension of Sueli Carneiro's concept of epistemicide to the computational domain - by destroying the face as a living, relational surface and authorizing a numerical proxy as the privileged site of identity. Across detection/cropping, landmarking, alignment/frontalization, and embedding, the face is progressively narrowed to what can be stabilized as data, producing a canonical face as the condition of legibility and a corresponding form-subject as the condition of recognition. Vectorization completes the Frankensteinian "stitching": the dissected face is reassembled into a fixed-dimensional artifact designed to circulate across databases and institutions. I then show how distance-based similarity and thresholding operationalize a norm of "close enough," making recognition inseparable from standardization and rendering reformist "ethical AI" optimization structurally insufficient. The paper concludes by arguing for abolition as a normative stance: refusing vectorized identity as a legitimate basis for rights and access, and dismantling the institutional impulse to govern human life through dissectible data points.

2605.20735 2026-06-09 cs.CV cs.LG 版本更新

Lowering the Barrier to IREX Participation: Open-Source Algorithms, Toolkit, and Benchmarking for Iris Recognition

降低参与IREX的门槛:用于虹膜识别的开源算法、工具包和基准测试

Siamul Karim Khan, Patrick J. Flynn, Adam Czajka

发表机构 * University of Notre Dame(内布拉斯加大学)

AI总结 本文提出两种新的开源虹膜识别算法,提供Python和符合IREX标准的C++实现,用于提交官方IREX X计划。研究旨在首次根据IREX测试协议评估开源虹膜识别解决方案,并提供一个模型C++提交,显著促进其他团队的开源方法进入IREX评估。新方法包括两个神经网络,分别使用三元组损失与批量硬三元组挖掘(TripletIris)和ArcFace损失(ArcIris)。此外,文章还提供了两种现有方法的开源IREX兼容C++实现:基于虹膜图像过滤的人类显著性驱动内核(HDBIF)算法,以及用于检测和比较Fuchs密钥(CRYPTS)的人类可解释算法。除了CRYPTS在1:N搜索中面临时间限制外,其他方法已通过官方IREX X评估,并在多个流行学术基准上进行了评估。最后,本文还提供了可用于任何新虹膜识别方法的虹膜分割和圆圈估计开源模型。

详情
AI中文摘要

本文提出了两种新的开源虹膜识别算法,提供了Python和符合IREX标准的C++实现,用于提交官方IREX X计划。本研究有两个主要目标:(a)首次根据IREX测试协议评估开源虹膜识别解决方案;(b)提供一个模型C++提交,显著促进其他团队的开源方法进入IREX评估。新方法包括两个神经网络,分别使用三元组损失与批量硬三元组挖掘(TripletIris)和ArcFace损失(ArcIris)。本文还提供了两种现有方法的开源IREX兼容C++实现:(a)基于虹膜图像过滤的人类显著性驱动内核(HDBIF)算法;(b)用于检测和比较Fuchs密钥(CRYPTS)的人类可解释算法。除了CRYPTS在1:N搜索中面临时间限制外,这些方法已通过官方IREX X评估,并在多个流行学术基准上进行了评估:Quality-Face/Iris Research Ensemble、Warsaw-Biobase Post-Mortem Iris、CASIA-Iris-Thousand-V4、CASIA-Iris-Lamp-V4、IIT Delhi Iris Database、IIITD Contact Lens Iris Database、NDIris3D和Notre Dame Variable Iris Image Quality Release 2。最后,本文还提供了可用于任何新虹膜识别方法的虹膜分割和圆圈估计开源模型。

英文摘要

NIST Iris Exchange (IREX) offers an appealing solution to evaluating new open-source iris recognition algorithms, but it presents high barriers to entry because these algorithms must be written in C++, using a specific API, and adapted to meet strict IREX speed and memory constraints. The main goal of this paper is to lower these barriers and advance open-source iris recognition large-scale evaluations by offering: (a) two new modern deep learning-based open-source iris matchers (ArcIris and TripletIris), along with their C++ IREX X-compliant implementations, which are the first open-source iris recognition methods included into the IREX X leaderboard (and thus IREX-vetted), as well as new segmentation and iris circular approximation models that can be incorporated into any new iris recognition method, and (b) a performance assessment (according to IREX X testing protocols) of all major and currently available open-source iris recognition solutions. The paper also provides Python implementations of the new ArcIris and TripletIris methods and discusses the differences one may encounter between C++ and Python implementations of the same conceptually equivalent approaches. Finally, the paper offers open-source, IREX X-compliant C++ implementations of two existing methods: (a) an iris image filtering-based algorithm utilizing human saliency-driven kernels (HDBIF), and (b) a human-interpretable algorithm for detecting and comparing Fuchs' crypts (CRYPTS). In addition to IREX X evaluation results, the paper reports the performance of all methods on major academic benchmarks: Quality-Face/Iris Research Ensemble (Q-FIRE), Warsaw-Biobase Post-Mortem Iris, CASIA-Iris-Thousand-V4, CASIA-Iris-Lamp-V4, IIT Delhi Iris Database, IIITD Contact Lens Iris Database, NDIris3D, and Notre Dame Variable Iris Image Quality Release 2 (VII-Q-R2).

2606.00706 2026-06-09 cs.CV 版本更新

CR-JEPA: Cross-Modal Joint-Embedding Predictive Learning for Remote Sensing Image Retrieval

CR-JEPA:用于遥感图像检索的跨模态联合嵌入预测学习

Md Aminur Hossain, Ayush V. Patel, Nitant Dube, Biplab Banerjee

发表机构 * Space Applications Centre, Indian Space Research Organisation(印度空间研究组织空间应用中心) Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay(印度理工学院孟买资源工程研究中心)

AI总结 提出CR-JEPA架构,通过模态特定主干、共享Transformer和JEPA预测目标实现跨模态语义对齐与同模态邻域保持,在BEN-14K等数据集上显著提升跨模态检索性能。

Comments 24 pages

详情
AI中文摘要

跨模态遥感图像检索旨在跨异构传感模态检索语义相关的场景。由于配对观测在成像物理、空间分辨率、光谱配置和视觉外观上可能存在显著差异,这仍然具有挑战性。此外,单一目标训练的检索投影可能不足以同时支持跨模态语义对齐和同模态邻域保持。我们提出了CR-JEPA,一种用于双模态遥感检索的跨模态检索联合嵌入预测架构。该模型使用模态特定主干、共享Transformer主干和JEPA风格的预测目标来估计模态内和跨模态的掩码潜在目标特征。受LeJEPA启发,我们对原始检索投影应用素描各向同性高斯正则化以稳定嵌入并缓解崩溃。CR-JEPA进一步采用解耦头设计,包括用于同模态检索的统一检索头和用于跨模态搜索的跨模态检索头。我们在BEN-14K、CBRSIR_VS和DSRSID上评估CR-JEPA。在BEN-14K上,与X-JEPA相比,CR-JEPA将S1到S2检索从61.23%提升至75.82%,S2到S1检索从63.73%提升至75.40%,同时以更少的参数实现了有竞争力的同模态检索。

英文摘要

Cross-modal remote sensing image retrieval aims to retrieve semantically related scenes across heterogeneous sensing modalities. This remains challenging because paired observations may differ substantially in imaging physics, spatial resolution, spectral configuration, and visual appearance. Moreover, a single retrieval projection trained with one objective may be insufficient to jointly support cross-modal semantic alignment and same-modal neighbourhood preservation. We propose CR-JEPA, a Cross-modal Retrieval Joint-Embedding Predictive Architecture for dual-modality remote sensing retrieval. The model uses modality-specific stems, a shared transformer trunk, and JEPA-style predictive objectives to estimate masked latent target features within and across modalities. Inspired by LeJEPA, we apply Sketched Isotropic Gaussian Regularization to raw retrieval projections to stabilize embeddings and mitigate collapse. CR-JEPA further employs a decoupled-head design with a unified retrieval head for same-modal retrieval and a cross-modal retrieval head for cross-modal search. We evaluate CR-JEPA on BEN-14K, CBRSIR_VS, and DSRSID. On BEN-14K, CR-JEPA improves S1 to S2 retrieval from 61.23% to 75.82% and S2 to S1 retrieval from 63.73% to 75.40% over X-JEPA, while also achieving competitive same-modal retrieval with fewer parameters.

2603.25157 2026-06-09 cs.LG cs.AI cs.CV stat.ML 版本更新

Vision Hopfield Memory Networks for Image Recognition

Vision Hopfield Memory Networks

Jianfeng Wang, Amine M'Charrak, Luk Koska, Xiangtao Wang, Daniel Petriceanu, Ruizhi Wang, Michael Bumbar, Luca Pinchetti, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) Faculty of Informatics, Vienna University of Technology(维也纳理工大学信息学院)

AI总结 本文提出了一种受大脑启发的视觉Hopfield记忆网络(V-HMN),通过整合分层记忆机制和迭代细化更新,实现了统一框架下的局部和全局动态建模,提升了可解释性和数据效率。

详情
AI中文摘要

近年来,视觉和多模态基础模型,如Transformer家族和状态空间模型(如Mamba)在图像、文本等领域取得了显著进展。尽管这些架构在经验上取得了成功,但它们与人脑的计算原理仍有很大差距,通常需要大量的训练数据且可解释性有限。在本文中,我们提出了视觉Hopfield记忆网络(V-HMN),一种受大脑启发的基础模型,整合了分层记忆机制和迭代细化更新。具体而言,V-HMN包含局部Hopfield模块,提供图像块级别的关联记忆动态,全局Hopfield模块作为情境调节的事件记忆,以及受预测编码启发的细化规则用于迭代误差校正。通过将这些基于记忆的模块分层组织,V-HMN在一个统一的框架中捕捉了局部和全局动态。记忆检索揭示了输入与存储模式之间的关系,使决策更具可解释性,而存储模式的重用提高了数据效率。这种受大脑启发的设计因此在可解释性和数据效率方面超越了现有的自注意或状态空间方法。我们在公开的计算机视觉基准上进行了广泛的实验,V-HMN在与广泛采用的基础架构竞争的同时,提供了更好的可解释性、更高的数据效率和更强的生物合理性。这些发现突显了V-HMN作为下一代视觉基础模型的潜力,同时为文本和音频等领域的多模态基础模型提供了通用的蓝图,从而将受大脑启发的计算与大规模机器学习联系起来。

英文摘要

Recent vision backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress on image recognition. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. We propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired vision backbone that integrates hierarchical memory mechanisms across layers with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, providing a prototype-based form of interpretability through explicit memory retrieval, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances data efficiency and provides a prototype-based form of interpretability compared to existing self-attention- or state-space-based approaches. We conducted extensive experiments on public image classification benchmarks. V-HMN achieves strong performance on small- and medium-scale benchmarks, and remains competitive with widely adopted backbone architectures on ImageNet despite minimal architectural tuning, while offering improved data efficiency and a prototype-based form of interpretability. These findings highlight the potential of V-HMN as a memory-centric alternative to standard vision backbones, thereby bridging brain-inspired computation with modern machine learning.

4. 目标检测、分割与定位 28 篇

2606.07659 2026-06-09 cs.CV eess.IV 新提交

Real-Time Industrial Defect Detection on Edge Hardware Using Fine-Tuned YOLOv8: A Systematic Benchmark on the NEU Surface Defect Database and MVTec AD with Automotive & Battery Manufacturing Extensions

基于微调YOLOv8的边缘硬件实时工业缺陷检测:在NEU表面缺陷数据库和MVTec AD上的系统基准测试及汽车与电池制造扩展

Emmanuel Ezeji Somtochukwu, Nitesh Rijal

发表机构 * Zema AI Labs(Zema AI实验室)

AI总结 提出Industrial-YOLO框架,基于微调YOLOv8,通过TensorRT和OpenVINO加速,在边缘硬件上实现超过120 FPS的实时缺陷检测,mAP达98.5%,并在汽车装配线验证零延迟性能。

Comments 11 pages, 4 figures, 7 tables. Includes edge optimization framework (TensorRT/OpenVINO) and industrial hardware benchmark analysis

详情
AI中文摘要

自动化表面缺陷检测对于确保高速制造环境中的严格质量控制至关重要。虽然深度学习模型提供了显著的准确性,但在资源受限的边缘硬件上部署而不引入显著延迟仍然是一个持续的挑战。本文提出了Industrial-YOLO,一个基于微调YOLOv8架构的边缘优化框架,专门为实时工业缺陷检测设计。我们利用NEU表面缺陷数据库(用于钢板)和MVTec AD数据集进行系统基准测试,并补充了代表真实世界结构异常(划痕、凹坑和夹杂物)的定制汽车制造扩展。为了弥合算法复杂性和边缘硬件约束之间的差距,通过TensorRT和OpenVINO加速引擎引入了目标特定的优化。实验结果表明,Industrial-YOLO在NVIDIA Jetson Orin平台上实现了超过120 FPS的高速推理速度,同时保持了98.5%的卓越平均精度(mAP)。所提出的框架在直接部署到活跃的汽车装配线上时,展示了高度鲁棒、零延迟的性能,为下一代自动光学检测(AOI)系统提供了可扩展的蓝图。

英文摘要

Automated surface defect detection is critical for ensuring rigorous quality control in high-speed manufacturing environments. While deep learning models offer remarkable accuracy, deploying them on resource-constrained edge hardware without introducing significant latency remains a persistent challenge. This paper presents Industrial-YOLO, an edge-optimized framework built upon a fine-tuned YOLOv8 architecture specifically engineered for real-time industrial defect detection. We conduct a systematic benchmark utilizing the NEU surface defect database for steel sheets and the MVTec AD dataset, supplemented with custom automotive manufacturing extensions representing real-world structural anomalies (scratches, pits, and inclusions). To bridge the gap between algorithmic complexity and edge hardware constraints, target-specific optimizations are introduced via TensorRT and OpenVINO acceleration engines. Experimental results demonstrate that Industrial-YOLO achieves a high-velocity inference speed exceeding 120 FPS on the NVIDIA Jetson Orin platform while maintaining an exceptional mean Average Precision (mAP) of 98.5%. The proposed framework showcases highly robust, zero-latency performance when deployed directly onto an active automotive assembly line, offering a scalable blueprint for next-generation automated optical inspection (AOI) systems.

2606.07756 2026-06-09 cs.CV cs.RO 新提交

DroneDAR: Long-Range Drone Distance Estimation Using Monocular Vision and Bounding-Box Features

DroneDAR: 使用单目视觉和边界框特征的长距离无人机距离估计

Knut Peterson, Zaid Mayers, David Han

发表机构 * iMaPLe Research Lab, Drexel University(德雷塞尔大学iMaPLe研究实验室)

AI总结 针对长距离小无人机距离估计的挑战,提出DroneDAR模型,结合卷积骨干网络和轻量级门控机制融合边界框特征,分析骨干容量、裁剪分辨率和回归损失对性能的影响,并探讨远距离失效模式。

Comments 6 pages, 5 figures. Accepted to the 2026 International Conference on Advanced Visual and Signal-Based Systems (AVSS)

详情
AI中文摘要

在长距离图像中准确估计小型无人机的距离对于跟踪和态势感知至关重要,但由于极端的目标尺度变化、背景杂波和噪声视觉线索,这仍然具有挑战性。本文研究了使用图像裁剪和边界框几何进行单目无人机距离估计,这是一种实际设置,其中检测器提供候选无人机区域,模型从外观和框派生特征预测距离。我们评估了一个Droneranger风格的基线,并引入了一个新的DroneDAR(无人机检测与测距)模型,该模型通过轻量级门控机制将卷积骨干网络与显式边界框线索相结合。实验分析了骨干网络容量、裁剪分辨率和回归损失函数如何影响不同距离范围内的性能。我们进一步研究了远距离下的常见失效模式,包括对边界框噪声的敏感性和裁剪中纹理细节的减少。结果为设计和训练在真实远距离条件下保持鲁棒性的距离估计器提供了指导,并指出了在无人机仅占据几个像素时提高可靠性的方向。

英文摘要

Accurate distance estimation for small drones in long-range imagery is important for tracking and situational awareness, yet remains challenging due to extreme target scale variation, background clutter, and noisy visual cues. This paper studies monocular drone distance estimation using image crops together with bounding-box geometry, a practical setting in which a detector provides a candidate drone region and the model predicts range from appearance and box-derived features. We evaluate a Droneranger-style baseline, and introduce a new DroneDAR (Drone Detection And Ranging) model that combines a convolutional backbone with explicit bounding-box cues through a lightweight gating mechanism. Experiments analyze how backbone capacity, crop resolution, and regression loss functions affect performance across distance regimes. We further examine common failure modes at long distances, including sensitivity to bounding-box noise and reduced texture detail in the crop. The results provide guidance for designing and training range estimators that remain robust under real-world long-range conditions and highlight directions for improving reliability when drones occupy only a few pixels.

2606.08001 2026-06-09 cs.CV 新提交

Learning a Semantic Calibration Network for Open-Vocabulary Semantic Segmentation

学习语义校准网络用于开放词汇语义分割

Yang Sun, Tao Wang, Anastasia Ioannou, Ge Xu

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出语义校准网络(SCN),通过类消歧和logits融合模块显式建模类间语义相关性,在保持CLIP泛化能力的同时提升分割性能。

Comments Paper accepted by 11th International Conference on Intelligent Computing and Signal Processing (ICSP 2026)

详情
AI中文摘要

语义图像分割为每个像素分配预定义的类别标签,近期取得了显著进展。开放词汇分割(OVS)将分割任务从固定集合扩展到开放集合,使得能够基于任意文本输入(如类别名称或描述)识别和分割新概念。本文提出了一种新颖的语义校准网络(SCN)用于开放词汇语义分割。与先前专注于特征聚合或简单微调预训练模型的方法不同,SCN通过显式建模类间语义相关性来细化掩码分类过程,旨在增强模型的判别能力,同时有效保留预训练CLIP模型的泛化能力。具体而言,SCN包含两个核心组件:类消歧(CD)和logits融合(LF)。首先,利用交叉注意力机制将文本嵌入转换为视觉感知的伪文本嵌入,以推导出增强的相似度分数,补充原始的掩码-文本相似度分数。随后,类消歧模块通过残差架构捕获隐式的类间依赖关系,有效解决语义歧义。最后,logits融合模块动态整合多方面的语义证据,确保模型在保持CLIP固有泛化能力的同时实现稳健的语义共识。在主流基准上的综合实验结果表明,与最先进算法相比,所提方法取得了显著的性能提升。

英文摘要

Semantic image segmentation assigns a predefined category label to each pixel, has achieved significant progress lately. Open-Vocabulary Segmentation (OVS) extends the segmentation task from a fixed set to an open set, enabling the identification and segmentation of novel concepts based on arbitrary text inputs, such as category names or descriptions. In this paper, we propose a novel Semantic Calibration Network (SCN) for open-vocabulary semantic segmentation. Different from prior approaches that focus on feature aggregation or simple fine-tuning of pre-trained models, SCN refines the mask classification process by explicitly modeling the semantic correlations between classes, aiming to enhance the model's discriminative power while effectively preserving the generalization abilities of the pre-trained CLIP model. Specifically, SCN comprises two core components: Class Disambiguation (CD) and Logits Fusion (LF). First, a cross-attention mechanism is utilized to transform the text embeddings into visually aware pseudo-text embeddings, in order to derive an enhanced similarity score that complements the original mask-text similarity score. Subsequently, the Class Disambiguation module captures implicit inter-class dependencies through a residual architecture to effectively resolve semantic ambiguities. Finally, the Logits Fusion module dynamically integrates multifaceted semantic evidence to ensure that the model achieves a robust semantic consensus while maintaining CLIP's inherent generalization capability. Comprehensive experimental results on mainstream benchmarks demonstrate that the proposed method achieves significant performance improvements compared to state-of-the-art algorithms.

2606.08002 2026-06-09 cs.CV 新提交

Aqua Boundary-Saliency Attention Module for Lightweight Underwater Salient Instance Segmentation Detection Transformer

Aqua边界显著性注意力模块:用于轻量级水下显著实例分割检测Transformer

M. Fazri Nizar, Julian Supardi, Muhammad Naufal Rachmatullah

发表机构 * Universitas Sriwijaya(斯里维贾亚大学)

AI总结 提出轻量级水下显著实例分割检测Transformer(LUSIS-DETR),通过Aqua边界显著性注意力模块嵌入水下先验线索,在四个数据集上达到领先性能,并在NVIDIA T4 GPU上实现4.31-6.34毫秒延迟。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

水下实例分割融合了像素级掩码预测和实例级判别,用于海洋资源勘探、生态监测和水下机器人感知。最近的基于提示和辅助模态的方法提高了掩码质量,但它们对大型基础模型、提示生成或额外模态估计的依赖使高效部署复杂化。本文介绍了轻量级水下显著实例分割检测Transformer(LUSIS-DETR),这是一个紧凑的检测Transformer框架,围绕Aqua边界显著性注意力模块(AquaBSAM)构建。AquaBSAM通过有界残差调制将水下边界、对比度、衰减、色度、暗通道和中心先验线索嵌入到DINOv2初始化的多尺度特征中,而辅助掩码监督和小目标复制粘贴仅在训练中使用。在四个最新的水下实例分割数据集UIIS、UIIS10K、USIS10K和USIS16K上的广泛评估表明,在类别感知和显著实例协议下,该方法相对于先前最先进的工作具有竞争力的领先性能。在NVIDIA T4图形处理单元(GPU)上的TensorRT半精度(FP16)基准测试实现了4.31-6.34毫秒(ms)的延迟,支持在可复现的设置下进行实时推理。

英文摘要

Underwater instance segmentation integrates pixel-level mask prediction and instance-level discrimination for marine resource exploration, ecological monitoring, and underwater robotic perception. Recent prompt-based and auxiliary-modality methods improve mask quality, but their reliance on large foundation models, prompt generation, or extra modality estimation complicates efficient deployment. This work introduces Lightweight Underwater Salient Instance Segmentation Detection Transformer (LUSIS-DETR), a compact detection-transformer framework built around the Aqua Boundary-Saliency Attention Module (AquaBSAM). AquaBSAM embeds underwater boundary, contrast, attenuation, chroma, dark-channel, and center-prior cues into DINOv2-initialized multi-scale features through bounded residual modulation, while auxiliary mask supervision and small-object copy-paste are training-only. Extensive evaluation on four recent underwater instance segmentation datasets, UIIS, UIIS10K, USIS10K, and USIS16K, shows competitively leading performance against previous state-of-the-art works across category-aware and salient-instance protocols. TensorRT half-precision (FP16) benchmarking on an NVIDIA T4 graphics processing unit (GPU) achieves 4.31-6.34 milliseconds (ms) latency, supporting real-time inference under an accessible reproduction setting.

2606.08866 2026-06-09 cs.CV 新提交

Generalizing Geometry-Guided Mamba as a Plug-and-Play Context Module for CNN-based Semantic Segmentation

泛化几何引导Mamba作为CNN语义分割的即插即用上下文模块

Sheng-Wei Chan, Hsin-Jui Pan, Chun-Po Shen, Chia-Min Lin, Yung-Che Wang, Jen-Shiun Chiang

发表机构 * Tamkang University(淡江大学)

AI总结 将几何引导的Mamba(G-Mamba)作为即插即用的上下文聚合模块,替代六种CNN分割网络的上下文头,在Cityscapes上以少量额外计算量获得一致的mIoU提升。

详情
AI中文摘要

基于CNN的语义分割网络通常依赖上下文头(如ASPP、PPM或注意力模块)来扩大感受野。这些头有效但可能引入大量计算、内存开销或边界泄漏。本文重新审视DGM-Net中的方向几何Mamba(G-Mamba),并将其作为即插即用的上下文聚合模块,而非全新的分割架构。关键思想是将几何引导注入选择性扫描过程,使长程特征传播能够由边界和向心流线索调制。我们替换了六种代表性CNN分割模型(包括DeepLabV3+、DANet、CCNet、PSPNet、PSANet和OCRNet)的原始上下文头,同时保持ResNet-101骨干网络不变。在Cityscapes上的结果表明,在$1024\ imes1024$分辨率下,仅增加适度的额外GFLOPs即可获得一致的mIoU提升,表明几何引导的SSM模块可以作为传统CNN上下文头的实用替代或增强。

英文摘要

CNN-based semantic segmentation networks usually rely on context heads such as ASPP, PPM, or attention modules to enlarge the receptive field. These heads are effective but may introduce heavy computation, memory cost, or boundary leakage. This paper revisits Directional Geometric Mamba (G-Mamba) from DGM-Net and studies it as a plug-and-play context aggregation module rather than a complete new segmentation architecture. The key idea is to inject geometric guidance into the selective scan process, allowing long-range feature propagation to be modulated by boundary and centripetal-flow cues. We replace the original context heads of six representative CNN segmentation models, including DeepLabV3+, DANet, CCNet, PSPNet, PSANet, and OCRNet, while keeping the ResNet-101 backbone unchanged. Results on Cityscapes show consistent mIoU gains with only moderate extra GFLOPs at $1024\times1024$ resolution, suggesting that geometry-guided SSM modules can serve as practical alternatives or enhancements to conventional CNN context heads.

2606.08906 2026-06-09 cs.CV 新提交

DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance

DifferSeg: 通过差分感知与频率引导实现多样化的多模态二值分割

Qiangqiang Zhou, Jiawei Xu, Yong Chen, Dandan Zhu, Yugen Yi, Xiaoqi Zhao

发表机构 * School of Artificial Intelligence, Jiangxi Normal University(江西师范大学人工智能学院) Institute of AI Education, East China Normal University(华东师范大学人工智能教育研究所) Yale School of Medicine, Yale University(耶鲁大学医学院)

AI总结 提出DifferSeg框架,通过差分感知融合模块自适应对齐多模态特征,并设计频率引导解码器平衡高低频表示,在29个公开数据集上超越67种方法。

详情
AI中文摘要

在许多二值分割任务中,大多数多模态方法依赖于固定的特征拼接进行跨模态交互,以及由低频语义主导的简单解码器设计。然而,它们忽略了两个关键挑战:一是缺乏处理模态差异和互补性的自适应机制,二是缺少平衡高低频表示的高效解码策略。在这项工作中,我们提出了一个简单而通用的多模态二值分割框架,称为DifferSeg,以同时解决这两个问题。借助差分感知融合(DPF)模块,DifferSeg使用可学习的差分算子自适应地对齐多模态特征,并通过残差融合增强其互补性,有效缓解模态不匹配和融合冗余。此外,我们设计了一个频率引导解码器(FGD),构建跨频率交互和多路径上采样,以保持细节高频结构与语义低频表示之间的一致性,确保细粒度边界恢复和噪声抑制。得益于这些设计,DifferSeg可以轻松泛化到各种二值分割任务,包括自然和医学模态。无需额外技巧,它在涉及18个下游任务的29个公开数据集上持续超越67种最先进方法,展示了卓越的泛化能力和分割精度。代码和预训练模型将在链接处提供。

英文摘要

In many binary segmentation tasks, most multimodal methods rely on fixed feature concatenation for cross-modal interaction and straightforward decoder designs dominated by low-frequency semantics. %ToDO: % However, they ignore two key challenges: one is the lack of an adaptive mechanism to handle modality discrepancies and complementarity, and the other is the absence of an efficient decoding strategy to balance both high- and low-frequency representations. % In this work, we propose a simple yet general multimodal binary segmentation framework, termed DifferSeg, to address both problems simultaneously. With the help of the differential perception fusion (DPF) module, DifferSeg employs learnable differential operators to adaptively align multimodal features and enhance their complementarity through residual fusion, effectively mitigating modality mismatch and fusion redundancy. % In addition, we design a frequency-guided decoder (FGD) that builds cross-frequency interactions and multi-path upsampling to maintain consistency between detailed high-frequency structures and semantic low-frequency representations, ensuring fine-grained boundary recovery and noise suppression. % Benefiting from these designs, DifferSeg can be easily generalized to diverse binary segmentation tasks, including both natural and medical modalities. Without bells and whistles, it consistently surpasses 67 state-of-the-art methods across 29 public datasets involving 18 downstream tasks, demonstrating superior generalization and segmentation accuracy.Code and pretrained models will be available at the Link.

2606.08920 2026-06-09 cs.CV cs.AI 新提交

PolyBuild: An End-to-End Method for Polygonal Building Contour Extraction from High-Resolution Remote Sensing Images

PolyBuild: 一种从高分辨率遥感图像中提取多边形建筑物轮廓的端到端方法

Yaoteng Zhang, Julin Zhang, Guangshuai Wang, Jiwei Deng, Hui Sheng, Yasir Muhammad, Shiqing Wei

发表机构 * China University of Petroleum (East China)(中国石油大学(华东)) South Surveying&Mapping Instrument Co.,Ltd.(南方测绘仪器有限公司) China Railway Design Corporation(中国铁路设计集团有限公司)

AI总结 提出端到端方法PolyBuild,通过初始轮廓生成模块和轮廓优化模块直接从遥感图像提取矢量多边形建筑物轮廓,无需后处理,性能优于现有方法。

Comments Accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS)

详情
AI中文摘要

从高分辨率遥感图像中提取建筑物多边形轮廓是各种地图应用的基本任务。然而,不同的成像条件和复杂的建筑结构使得自动轮廓提取极具挑战性。主流的建筑物提取方法通常依赖于像素级分割,随后进行多个后处理步骤以生成建筑物轮廓,这计算量大且容易出错。在本文中,我们提出了一种名为PolyBuild的端到端方法,该方法可以直接从高分辨率遥感图像中提取建筑物矢量多边形,无需任何后处理操作。该方法利用两个主要模块:初始轮廓生成模块(ICGM)和轮廓优化模块(COM)。ICGM通过利用每个建筑物实例的拼接子区域中心特征来生成初始建筑物轮廓。它通过生成边界框并使用四个子区域的中心特征来表示每个建筑物,同时进行目标检测和初始轮廓提取。轮廓优化模块(COM)通过在基于Transformer的解码器中迭代集成卷积神经网络(CNN)特征和轮廓位置信息,进一步细化生成的建筑物轮廓。混合CNN-Transformer架构有效捕获建筑物轮廓内的局部和全局空间关系,确保高质量的边界描绘。在三个建筑物数据集上进行了大量实验以评估PolyBuild的性能。结果表明,PolyBuild显著优于最先进的方法,包括基于掩码和基于轮廓的方法。

英文摘要

Extracting building polygon contours from high-resolution remote sensing images is a fundamental task for various mapping applications. However, the presence of varying imaging conditions and complex building structures, makes automatic contour extraction extremely challenging. Mainstream approaches for building extraction often rely on pixel-level segmentation followed by multiple post-processing steps to produce building contour, which can be computationally intensive and prone to errors. In this paper, we propose an end-to-end method named PolyBuild, which can directly extract building vector polygons from high-resolution remote sensing images without the need for any post-processing operations. The proposed method leverages two primary modules: an Initial Contour Generation Module (ICGM) and a Contour Optimization Module (COM). The ICGM is designed to generate an initial building contour by utilizing concatenated sub-region center features for each building instance. It performs simultaneous object detection and initial contour extraction by generating bounding boxes and using the center features of four sub-regions to represent each building. The Contour Optimization Module (COM) further refines the generated building contours by iteratively integrating Convolutional Neural Network (CNN) features and contour positional information in a Transformer-based decoder. The hybrid CNN-Transformer architecture effectively captures both local and global spatial relationships within the building contour, ensuring high-quality boundary delineation. Extensive experiments are conducted on three building datasets to evaluate the performance of PolyBuild. The results demonstrate that PolyBuild significantly outperforms state-of-the-art methods, including mask-based and contour-based approaches.

2606.08980 2026-06-09 cs.CV 新提交

EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation

EPS3D: 端到端前馈式3D全景分割

Runsong Zhu, Jiaxin Guo, Xiaoyang Guo, Zhengzhe Liu, Ka-Hei Hui, Wei Yin, Kai Chen, Wei Chen, Weiqiang Ren, Yunhui Liu, Pheng-Ann Heng, Chi-Wing Fu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出端到端前馈框架EPS3D,通过蒸馏训练和多视图图像预测3D感知特征,结合互增强模块实现语义-实例一致性,在Replica上语义mIoU提升13%,每场景仅需1秒。

Comments ICML 2026. The code is publicly available at \href{https://github.com/Runsong123/EPS3D}{https://github.com/Runsong123/EPS3D}

详情
AI中文摘要

本文介绍了EPS3D,一种用于开放词汇3D全景分割的新型端到端前馈框架。与依赖额外预处理的现有方法不同,我们设计了一种端到端架构,采用基于蒸馏的训练策略,在多样化的3D场景中从多视图图像预测3D感知的语义和实例特征,提高了3D一致性并避免了错误累积。我们进一步提出了一个互增强模块,以强制实现固有的语义-实例一致性。通过在实例内对齐语义(Ins2Sem)和利用语义指导细化实例特征(Sem2Ins),我们实现了更连贯的3D场景理解。最终,EPS3D在两个基准测试上优于最先进的基线(例如,在Replica上语义mIoU提升13%),且效率高(例如,每场景1秒),支持机器人操作和3D场景编辑等任务。

英文摘要

This paper introduces EPS3D, a new end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentation. Unlike existing methods relying on additional preprocessing, we design an end-to-end architecture, with a distillation-based training strategy on diverse 3D scenes to predict 3D-aware semantic and instance features from multi-view images, improving 3D consistency and avoiding error accumulation. We further propose a mutual enhancement module to enforce inherent semantic-instance consistency. By aligning semantics within instances (Ins2Sem) and refining instance features with semantic guidance (Sem2Ins), we achieve more coherent 3D scene understanding. Ultimately, EPS3D outperforms SOTA baselines on two benchmarks (e.g., +13% mIoU for semantics on Replica) with high efficiency (e.g., 1s per scene), supporting tasks like robotic manipulation and 3D scene editing.

2606.09081 2026-06-09 cs.CV 新提交

Edge-Constrained UAV Small-Object Detection with P2 Enhancement and Quantum-Inspired Lightweight Structure Search

边缘约束下基于P2增强和量子启发轻量级结构搜索的无人机小目标检测

Wuming Lei, Yanbin Gao, Mingyan Sun, Xiaobin Li, Xuechen Liang

发表机构 * East China Jiaotong University(华东交通大学)

AI总结 针对无人机边缘部署,结合P2高分辨率检测分支与量子启发进化算法搜索轻量级结构,在VisDrone上显著提升小目标检测精度。

详情
AI中文摘要

无人机目标检测需要紧凑的检测器,在机载计算和内存限制下保留小目标细节。轻量级网络中的重复下采样削弱了浅层空间信息,而手动添加注意力或融合模块可能增加成本且收益不稳定。本研究在边缘部署约束下分析YOLOX-Nano,结合P2高分辨率检测分支与量子启发进化算法(QIEA)进行轻量级结构筛选。搜索空间由轻量级优先级和任务特异性定义,评估同时考虑精度、浮点运算数(FLOPs)、延迟、内存消耗和召回率。在VisDrone上,P2分支使APamall比YOLOX-Nano基线提升31.10%。与类似模型大小的NanoDet-Plus相比,YOLOX-Nano+-P2在APs0.ss上提升17.5%,在APamal上提升44.9%。QIEA选择的候选者获得最高Recallso,但+P2在完整训练后仍是最强的AP导向变体。对Random-best、GA-best和SA/QUBO-best候选者进行完整的100轮验证进一步表明,代理排名不一定转化为最终的APse9s。这些结果支持将P2作为主要的小目标增强路径,并将QIEA作为候选筛选和精度-成本分析的轻量级工具。源代码、配置文件、诊断脚本和总结结果可在https://github.com/Ming23233/UAV-QIEA-Edge-Detection获取。

英文摘要

Unmanned aerial vehicle (UAV) object detection requires compact detectors that retain small-object details under onboard computation and memory constraints. Repeated downsampling inlightweight networks weakens shallow spatial information, while manually adding attention orfusion modules may increase cost without stable gains. This study analyzes YOLOX-Nano underedge-deployment constraints by combining a P2 high-resolution detection branch with a quantum-inspired evolutionary algorithm (QIEA) for lightweight structure screening. The search space isdefined by lightweight priority and task specificity, and the evaluation jointly considers accuracy,floating-point operations (FLOPs), latency, memory consumption, and recall. On VisDrone, theP2 branch increases APamall by 31.10% over the YOLOX-Nano baseline. Compared with NanoDet-Plus with similar model size, YOLOX-Nano+-P2 improves APs0.ss by 17.5% and APamal by 44.9%.The QIEA-selected candidate obtains the highest Recallso, but +P2 remains the strongest AP-oriented variant after full training. Full 100-epoch verification of Random-best, GA-best, andSA/QUBO-best candidates further shows that proxy rankings do not necessarily transfer to finalAPse9s. These results support using P2 as the main small-object enhancement path and QIEA as alightweight tool for candidate screening and accuracy-cost analysis. The source code, configurationfiles, diagnostic scripts, and summarized results are available at https://github.com/Ming23233/UAV-QIEA-Edge-Detection

2606.09143 2026-06-09 cs.CV 新提交

CAMF-Det: Closure-Aware Multimodal Fusion for LiDAR-Camera 3D Object Detection on UAV Platforms

CAMF-Det: 面向无人机平台的激光雷达-相机闭合感知多模态融合3D目标检测

Yanze Jiang, Yanfeng Gu, Xian Li

发表机构 * School of Electronics and Information Engineering, Harbin Institute of Technology(哈尔滨工业大学电子与信息工程学院)

AI总结 针对无人机俯视场景中树冠遮挡导致的多模态信息退化问题,提出基于比尔-朗伯定律的闭合感知融合框架CAMF-Det,通过显式建模双模态遮挡强度并注入检测流程,在自建数据集上实现困难级别mAP_BEV提升9.43%和4.88%。

详情
AI中文摘要

基于激光雷达和相机的多模态3D目标检测在地面车辆场景中表现出色,但尚未在无人机平台上得到探索。在无人机俯视场景中,以树冠为主的频繁地面物体遮挡导致空间变化和模态依赖的信息退化。现有的多模态融合框架既未显式建模这种地面物体遮挡,也未将遮挡感知嵌入检测流程,限制了其在遮挡无人机场景中的性能。为应对这些挑战,我们提出CAMF-Det,一种面向无人机平台的激光雷达-相机3D目标检测的闭合感知多模态融合框架,该框架通过物理启发式建模导出双模态遮挡强度,并将其作为先验嵌入整个检测流程。首先,双模态闭合建模模块通过比尔-朗伯启发式公式和建筑物掩码校正,离线为两种模态显式构建遮挡强度真值。其次,以这些真值图为监督,双模态预测网络在单帧推理下将离线建模结果转换为在线遮挡强度预测。第三,将真值和预测的遮挡强度注入数据增强、特征编码、多模态融合和检测头,实现在空间变化和模态依赖信息退化下的自适应检测。在两个自建的基于无人机的多模态数据集SI3D-DI和SI3D-DII上的实验表明,CAMF-Det在所有难度级别上均达到最佳性能,困难级别的mAP$_{\mathrm{BEV}}$分别比最佳竞争方法提升9.43%和4.88%。这些结果证实了显式遮挡先验建模和利用对于无人机场景中鲁棒多模态3D检测的有效性。

英文摘要

Multimodal 3D object detection based on LiDAR and cameras has demonstrated excellent performance in ground-vehicle scenarios, but has not been explored for Unmanned Aerial Vehicle (UAV) platforms. In UAV top-down scenes, frequent groundobject occlusion dominated by tree canopies causes spatially varying and modality-dependent information degradation. Existing multimodal fusion frameworks neither explicitly model such ground-object occlusion nor embed occlusion awareness into the detection pipeline, limiting their performance in occluded UAV scenes. To address these challenges, we propose CAMF-Det, a closure-aware multimodal fusion framework for LiDAR-camera 3D object detection on UAV platforms, which derives dual-modal occlusion intensity through physics-inspired modeling and embeds them as priors throughout the detection pipeline. First, a dual-modal closure modeling module explicitly constructs occlusion intensity ground truth for both modalities offline via a Beer-Lambert-inspired formulation and building-mask correction. Second, using these ground-truth maps as supervision, a dual-modal prediction network converts the offline modeling results into online occlusion intensity predictions under single-frame inference. Third, both ground-truth and predicted occlusion intensity are injected into data augmentation, feature encoding, multimodal fusion, and detection head, enabling adaptive detection under spatially varying and modality-dependent information degradation. Experiments on two self-built UAV-based multimodal datasets, SI3D-DI and SI3D-DII, demonstrate that CAMF-Det achieves the best performance across all difficulty levels, with hard-level mAP$_{\mathrm{BEV}}$ improvements of 9.43% and 4.88% over the best competing methods, respectively. These results confirm the effectiveness of explicit occlusion prior modeling and exploitation for robust multimodal 3D detection in UAV scenes.

2606.09162 2026-06-09 cs.CV 新提交

Zero-Parameter Geometric Gating for Temporally Stable Low-Altitude UAV Video Semantic Segmentation

用于低空无人机视频语义分割的零参数几何门控以实现时间稳定性

Jingpu Yang, Fengxian Ji, Zhengzhao Lai, Juanfan Wu, Mingxuan Cui, Yufeng Wang

发表机构 * Beihang University(北京航空航天大学) Northeastern University(东北大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Beijing Institute of Technology(北京理工大学)

AI总结 提出零参数几何门控,利用RANSAC单应性内点比率在16x16网格上路由区域,结合语义相似性传播实现时间稳定分割,在合成UAVid上提升mIoU达4.91%。

详情
AI中文摘要

低空无人机的视频语义分割需要时间一致性,但密集光流在主导航拍图像的平面区域中引入了空间结构化噪声。我们提出了一种零参数几何门控,它利用$16\ imes16$空间网格上的RANSAC单应性内点比率,在通过语义相似性传播融合之前,将每个区域路由到单应性或光流扭曲。该门控不需要学习参数——仅对RANSAC统计量进行中值阈值二值决策——为冻结的骨干网络仅增加了211K可训练参数(SSP融合层)。在合成UAVid上,该方法在两种架构(SegFormer-b2和Hiera-S+UPerNet)上比基础模型实现了+4.24--4.91%的mIoU改进。机制诊断表明,平面区域中的光流残差在空间上自相关(Moran's I = 0.32,$p < 0.001$),预测边界不稳定性(Spearman $\ ho= 0.66$),并且刚性化在单应性有效区域中将时间一致性从62%恢复到92%(+29.5pp)。

英文摘要

Video semantic segmentation for low-altitude UAVs requires temporal consistency, yet dense optical flow introduces spatially structured noise in the planar regions that dominate aerial imagery. We propose a zero-parameter geometric gate that uses RANSAC homography inlier ratios on a $16\times16$ spatial grid to route each region to either homography or optical flow warp before fusion via Semantic Similarity Propagation. The gate requires no learned parameters -- only a median-threshold binary decision on RANSAC statistics -- adding only 211K trainable parameters (the SSP fusion layer) to a frozen backbone. On synthetic UAVid, the method achieves +4.24--4.91\% mIoU improvement over base models across two architectures (SegFormer-b2 and Hiera-S+UPerNet). Mechanism diagnostics reveal that flow residuals in planar regions are spatially autocorrelated (Moran's I = 0.32, $p < 0.001$), predict boundary instability (Spearman $ρ= 0.66$), and that rigidification recovers temporal consistency from 62\% to 92\% (+29.5pp) in homography-valid regions.

2606.09245 2026-06-09 cs.CV cs.AI 新提交

Proposal Refinement for Few-Shot Object Detection

用于少样本目标检测的提议细化

Yuan Zeng, Bin Song, Jie Guo, Yuwen Chen

发表机构 * State Key Laboratory of Integrated Services Networks, Xidian University(西安电子科技大学综合业务网理论及关键技术国家重点实验室)

AI总结 针对少样本检测中区域提议在基类和新类间分布不均的问题,提出分阶段提议细化方法,通过基类训练阶段的细化损失和微调阶段的细化分支重新平衡提议分布,在基准上提升1%~6%且不增加推理时间。

详情
AI中文摘要

近年来,少样本目标检测引起了广泛关注。一些优秀的算法已被提出以处理这一任务。然而,这些算法大多依赖于少样本分类的性能。与以往尝试不同,我们的工作聚焦于新类和基类之间区域提议分布不均的问题。为了缓解这种不平衡分布,我们针对不同训练阶段提出了提议细化方法。具体而言,在基类训练阶段设计了细化损失以增强模型对新类的敏感性,在微调阶段引入了细化分支作为RPN(区域提议网络)的辅助分支以生成更多新类提议。通过重新平衡提议分布,所提方法在现有基准上比基线方法提高了约1%~6%,且不增加任何推理时间。通过大量实验,我们证明了为少样本目标检测任务建立了一种新的最先进方法。

英文摘要

Few-shot object detection has gained widely attention in recent years. Some excellent algorithms have been proposed to handle this task. However, most of these algorithms rely on the performance of few-shot classification. Unlike previous attempts, our work focuses on the problem of unbalanced distribution of region proposals between the novel classes and the base classes. In order to alleviate this unbalanced distribution, we propose the proposal refinement approach for different training phases. Specifically, refinement loss is designed for the base training phase to enhance sensitivity of the model to novel classes, and refinement branch is introduced as an auxiliary branch for RPN (Region Proposal Networks) to generate more novel proposals in the fine-tuning phase. By rebalancing the proposal distribution, the proposed approach outperforms the baselines methods by roughly 1\%$\sim$6\% on current benchmarks without increasing any inference time. Through extensive experiments, we prove that we establish a new state-of-the-art method for the few-shot object detection task.

2606.09262 2026-06-09 cs.CV 新提交

See More, Match Better: Multi-Source Feature Fusion for Two-View Correspondence Learning

看得更多,匹配更好:用于双视图对应学习的多源特征融合

Xiaojie Li, Xin Jiang, Luanyuan Dai, Jinnan Yang, Yongdong Zhang, Zechao Li

发表机构 * Nanjing University of Science and Technology(南京理工大学) People’s Daily Online(人民网) University of Science and Technology of China(中国科学技术大学)

AI总结 提出TriMatch框架,融合几何、纹理语义和结构语义特征,通过语义引导调制和层次细化,提升重复结构等场景下的对应点鉴别能力。

Comments Correspondence Learning, Multi-Source Feature Fusion, Outlier Removal, Camera Pose Estimation

详情
AI中文摘要

双视图对应学习旨在通过利用图像对中真假对应点的内在差异来区分内点和外点。现有方法主要依赖于基于坐标的几何一致性。然而,在包含重复结构、无纹理区域或局部相似几何模式的场景中,它们常常难以处理伪一致的外点。为了解决这一限制,我们提出了TriMatch,一个用于双视图对应学习的多源特征融合框架,由特征提取和特征细化两部分组成。在特征提取中,TriMatch联合提取几何、纹理语义和结构语义特征,为对应点判别提供互补证据。为了弥合语义特征与几何特征之间的差距,纹理和结构语义特征分别通过专用的纹理-几何对齐和结构-几何对齐模块与几何特征对齐。我们进一步引入了语义引导的对应点调制模块,该模块利用语义信息调制几何特征,以抑制几何上合理但语义上不一致的对应点。在特征细化中,层次化语义增强的对应点细化策略逐步建模对应点依赖关系并重新校准多上下文特征响应,从而实现更可靠的内点-外点判别。大量实验证明了TriMatch的有效性、鲁棒性和泛化能力。

英文摘要

Two-view correspondence learning aims to distinguish true correspondences (inliers) from false ones (outliers) in image pairs by leveraging their underlying differences. Existing methods mainly rely on coordinate-based geometric consistency. However, they often struggle with pseudo-consistent outliers in scenes containing repetitive structures, textureless regions, or locally similar geometric patterns. To address this limitation, we propose TriMatch, a multi-source feature fusion framework for two-view correspondence learning, which consists of two parts: feature extraction and feature refinement. In feature extraction, TriMatch jointly extracts geometric, texture semantic, and structural semantic features to provide complementary evidence for correspondence discrimination. To bridge the gap between semantic and geometric features, texture and structural semantic features are aligned with geometric features through dedicated Texture-Geometric Alignment and Structural-Geometric Alignment modules, respectively. We further introduce a Semantic-Guided Correspondence Modulation module, which modulates geometric features using semantic information to suppress geometrically plausible but semantically inconsistent correspondences. In feature refinement, a Hierarchical Semantic-Enhanced Correspondence Refinement strategy progressively models correspondence dependencies and recalibrates multi-context feature responses, enabling more reliable inlier-outlier discrimination. Extensive experiments demonstrate the effectiveness, robustness, and generalization capability of TriMatch.

2606.09303 2026-06-09 cs.CV 新提交

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

再思考:通过候选发现与比较推理进行分割

Xinyan Gao, Haoran Hao, Xiangyu Yue

发表机构 * The Chinese University of Hong Kong(香港中文大学) Nanjing University(南京大学)

AI总结 提出两阶段框架Rea2Seg,先基于注意力图生成候选掩码,再用多模态大语言模型推理评分,将分割转化为候选发现与判别选择,并引入新基准ReasonSeg-SGDR全面评估感知、定位与推理能力。

Comments Project page: https://snowball521.github.io/Rea2Seg-Project/

详情
AI中文摘要

预训练基础模型的快速发展使得更通用的图像分割成为可能。多模态大语言模型(MLLMs)已被广泛探索用于需要高级推理的复杂查询的图像分割。尽管取得了有希望的进展,现有方法通常受限于有限的训练数据以及MLLMs与掩码生成模块之间的差距。为了更好地将MLLMs的感知和推理能力迁移到复杂的基于推理的分割任务,我们提出了一个两阶段框架Rea2Seg用于掩码生成和选择。具体来说,该框架首先基于分割MLLM的注意力图识别潜在区域作为候选掩码。然后,它利用MLLM对问题和候选掩码进行推理,并为每个掩码分配分数。最终的分割结果通过对候选掩码重新排序并选择最高分的掩码获得,将图像分割重新表述为候选发现后跟判别性掩码选择。\n我们还注意到,现有基准中的大部分问题集中在常识推理上,这些问题通常不需要完全的联合视觉观察和推理。为了解决这个问题,我们引入了一个名为ReasonSeg-SGDR的新基准,该基准在多个维度上全面评估模型的感知、定位和推理能力,包括判别性识别、空间推理、几何推理和多步推理,并带有细粒度的掩码生成。\n此外,我们收集训练数据以增强MLLMs联合理解多模态查询和候选掩码的能力,并通过推理分配分数。在提出的基准和ReasonSeg上的实验结果表明了统一掩码生成和选择框架的有效性。

英文摘要

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

2606.09360 2026-06-09 cs.CV 新提交

ExDet: Open-Domain Open-Vocabulary Detection with Cross-modal Extrapolation and Rectification

ExDet: 基于跨模态外推与校正的开放域开放词汇检测

Yupeng Zhang, Yuzhong Feng, Ruize Han, Zhiwei Chen, Wei Feng, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University(天津大学智能与计算学部) Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology(深圳理工大学计算机科学与人工智能学院) School of Artificial Intelligence, Nanchang University(南昌大学人工智能学院)

AI总结 提出ExDet框架,通过文本引导外推(TGE)和检测器兼容校正(DCR)模块,无需额外训练即可增强开放域开放词汇检测的跨类别和跨域泛化能力,在多个基准上取得最优性能。

详情
AI中文摘要

开放域开放词汇检测(ODOVD)要求检测器泛化到新类别和未见过的域,比开放词汇检测更具挑战性。现有方法通常从头训练开放词汇检测器与域泛化模块,导致训练成本高。我们提出ExDet,一种轻量级类别-域协同泛化框架,用于增强现有检测器的跨类别和跨域泛化能力。ExDet由文本引导外推(TGE)、轻量级检测器兼容校正(DCR)模块和ExRPN组成。具体地,TGE利用视觉-语言模型(VLM)的DeltaSpace属性,从文本推断类别和域感知的代理视觉原型。DCR以无需检测器训练和无需真实数据的方式从TGE生成的原型中学习,并在推理时插入分类头之后,将表示校正为与检测器兼容的源域视觉分布,从而增强对新类别和未见域目标的分类。ExRPN通过结合语义相似度与RPN置信度重新校准提议分数,提高对新颖和域偏移目标的召回率,同时为后续分类和DCR提供更好支持。ExDet在OD-LVIS、OV-LVIS、Objects365和MSOSB上达到最优性能。

英文摘要

Open-domain open-vocabulary detection (ODOVD) requires detectors to generalize to both novel categories and unseen domains, making it more challenging than open-vocabulary detection. Existing methods typically train open-vocabulary detectors together with domain generalization modules from scratch, leading to high training cost. we propose ExDet, a lightweight category-domain collaborative generalization framework for ODOVD that enhances the cross-category and cross-domain generalization of existing detectors. ExDet consists of Text-Guided Extrapolation (TGE), a lightweight Detector-Compatible Rectification (DCR) module, and ExRPN. Specifically, TGE exploits the DeltaSpace property of vision-language models (VLMs) to infer category- and domain-aware proxy visual prototypes from text. DCR is learned from the TGE-generated prototypes in a detector training-free and real-data-free manner, and is inserted after the classification head at inference to rectify representations toward a detector-compatible source-domain visual distribution, thereby enhancing classification for targets from novel categories and unseen domains. ExRPN recalibrates proposal scores by combining semantic similarity with RPN confidence, improving recall for novel and domain-shifted objects while providing better support for subsequent classification and DCR. ExDet achieves SOTA performance on OD-LVIS, OV-LVIS, Objects365, and MSOSB.

2606.09367 2026-06-09 cs.CV 新提交

RT-SDGOD: Real-Time Single-Domain Generalized Object Detection

RT-SDGOD: 实时单域泛化目标检测

Yupeng Zhang, Fangzhuo Gao, Ruize Han, Wei Feng, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University(天津大学智能与计算学部) Key Research Center for Surface Monitoring and Analysis of Relics, State Administration of Cultural Heritage(国家文物局文物表面监测与分析重点研究中心) Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology(深圳理工大学计算机科学与人工智能学院)

AI总结 针对实时检测器在域偏移下漏检严重的问题,提出多证据协同建模框架RT-SDGDet,通过一对多监督、证据多样性学习和双视图一致性学习提升泛化能力,且无额外推理开销。

详情
AI中文摘要

在严格实时约束下的实际部署中,天气和成像变化会导致显著的分布偏移,严重降低检测器性能。单域泛化目标检测旨在缓解这一问题,但现有方法很少在问题表述层面研究实时检测器在受限推理预算下的泛化能力。为此,我们引入实时单域泛化目标检测(RT-SDGOD),专注于实时检测器如何仅通过训练时表示学习,在零额外推理开销下实现跨域泛化。我们观察到,在域偏移下,基于DETR的实时检测器主要通过漏检增加而退化,根源在于目标级判别证据有限且不稳定。基于此,我们提出RT-SDGDet,一种用于RT-SDGOD的多证据协同建模框架。核心思想是使同一目标的多个查询协同覆盖更充分的判别证据,同时保持跨视图的证据建模稳定性。具体而言,我们使用一对多(O2M)监督构建稳定的目标特定查询组,并进一步设计判别证据多样性学习(DEDL)和双视图证据一致性学习(DvECL),分别扩展目标级证据覆盖范围和改善外观扰动下的证据稳定性。由于所有组件仅在训练时引入,我们的方法不产生额外推理开销。大量实验表明,所提方法在多个未见目标域上取得了比现有方法更好的泛化性能。

英文摘要

In real-world deployment under strict real-time constraints, weather and imaging variations induce significant distribution shifts, severely degrading detectors. Single-Domain Generalized Object Detection aims to mitigate this issue, yet existing methods rarely investigate-at the level of problem formulation-the generalization capability of real-time detectors under such constrained inference budgets. To this end, we introduce Real-Time Single-Domain Generalized Object Detection (RT-SDGOD), which focuses on how real-time detectors can achieve cross-domain generalization under zero extra inference overhead by relying solely on training-time representation learning. We observe that, under domain shift, DETR-based real-time detectors mainly degrade through increased missed detections, rooted in limited and unstable object-level discriminative evidence. Based on this, we propose RT-SDGDet, a multi-evidence collaborative modeling framework for RT-SDGOD. The core idea is to enable multiple queries of the same object to collaboratively cover more sufficient discriminative evidence while maintaining the stability of such evidence modeling across views. Specifically, we use one-to-many (O2M) supervision to construct stable object-specific query groups, and further design Discriminative Evidence Diversity Learning (DEDL) and Dual-view Evidence Consistency Learning (DvECL) to expand object-level evidence coverage and improve evidence stability under appearance perturbations, respectively. Since all components are introduced only during training, our method incurs no extra inference overhead. Extensive experiments show that the proposed method achieves better generalization performance than existing approaches across multiple unseen target domains.

2606.09474 2026-06-09 cs.CV 新提交

Training-Free Generalized Few-Shot Segmentation through Open-Vocabulary Semantic Arbitration

无需训练的通用的少样本分割通过开放词汇语义仲裁

Silas Kwabla Gah, Ebenezer Owusu

发表机构 * University of Ghana(加纳大学)

AI总结 提出Open-V框架,通过推理时协调冻结的语义先验(SAM3 PCS与K-shot CLIP支持质心)实现无需训练的通用少样本分割,在多个基准上超越有监督方法。

详情
AI中文摘要

通用少样本语义分割(GFSS)传统上被视为表示学习问题,需要任务特定的适应来从有限的支持样本中引入新类别。然而,最近的基础模型已经展现出强大的开放词汇识别和分割能力,这提出了一个不同的问题:能否通过推理时协调冻结的语义先验而不是参数适应来解决GFSS?我们通过Open-V回答了这个问题,这是一个无需训练的GFSS框架,它结合了Segment Anything (SAM3) 可提示概念分割(PCS)与K-shot CLIP支持质心,通过校准的逐像素语义仲裁。Open-V不引入任何可训练组件,并在推理时支持任意语义类别。除了分割性能,我们的研究还贡献了三个更广泛的发现。首先,我们表明支持信息可以通过推理时语义基础来整合,并且其贡献随着基础模型文本先验在标签不相交词汇表上的减弱而增加。其次,我们识别了基础模型分割中的可重复性混淆,证明了预处理和评估空间的不匹配会无声地扭曲报告的性能。最后,我们在PASCAL-5i、COCO-20i和ADE-OW上验证了Open-V,表明无需训练的基础模型先验协调在常规GFSS和开放词汇评估设置中都能泛化。在PASCAL-5i(1-shot)上,Open-V达到了基础/新类/调和mIoU分别为78.4/77.5/77.9,无需GFSS特定训练,超越了最强有监督基线+17.7 HM。

英文摘要

Generalized Few-Shot Semantic Segmentation (GFSS) has traditionally been approached as a representation-learning problem, requiring task-specific adaptation to incorporate novel classes from limited support examples. Recent foundation models, however, already exhibit strong open-vocabulary recognition and segmentation capabilities, raising a different question: can GFSS be solved through inference-time coordination of frozen semantic priors rather than parameter adaptation? We answer this question with Open-V, a training-free GFSS framework that combines Segment Anything (SAM3) Promptable Concept Segmentation (PCS) with a K-shot CLIP support centroid through calibrated per-pixel semantic arbitration. OpenV introduces no trainable components and supports arbitrary semantic categories at inference time. Beyond segmentation performance, our study contributes three broader findings. First, we show that support information can be incorporated through inference-time semantic grounding, and that its contribution increases as foundation-model text priors weaken on label-disjoint vocabularies. Second, we identify a reproducibility confound in foundationmodel segmentation, demonstrating that preprocessing and evaluation-space mismatches can silently distort reported performance. Finally, we validate Open-V across PASCAL5i, COCO-20i, and ADE-OW, showing that training-free coordination of foundation-model priors generalizes across both conventional GFSS and open-vocabulary evaluation settings. On PASCAL-5i (1-shot), Open-V attains base/novel/harmonic mIoU of 78.4/77.5/77.9, without GFSS-specific training surpassing the strongest trained baseline by +17.7 HM.

2606.09670 2026-06-09 cs.CV cs.AI 新提交

Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision

视觉提示结合基于特征重建的双教师监督异常检测

Mateo Diaz-Bone, Daniel Caraballo, Florian Scheidegger, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Roy Assaf, Niccolo Avogaro, Yagmur G. Cinar, Brown Ebouky, Filip M. Janicki, Piotr S. Kluska, Cezary Skura, Cristiano Malossi

发表机构 * IBM Research Europe Zurich(IBM欧洲研究院苏黎世分院)

AI总结 针对异常检测在真实场景中因物体尺度、视角等变化失效的问题,提出视觉提示管道、解冻教师模型和扩散生成数据增强,在AeBAD数据集上提升3.5个百分点。

详情
AI中文摘要

最近的异常检测方法在成熟数据集(如MVTec)上取得了完美的检测和分割分数。然而,当基本假设(如一致的物体尺度、视角、背景、光照和居中放置)被违反时,许多方法面临挑战。这些变化使得异常检测方法在许多真实场景中无法使用。为了解决这些限制,我们引入了三个关键贡献:(1)一个视觉提示管道,通过前景-背景掩码隔离物体;(2)一种在师生模型中解冻教师以提高领域适应性的机制;(3)一种利用扩散生成合成图像的数据增强策略,以增强异常检测性能。通过使用掩码多尺度重建(MMR)模型作为骨干,我们在具有挑战性的AeBAD数据集上比之前的最先进方法提高了3.5个百分点。

英文摘要

Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, many of these methods face challenges when foundational assumptions - such as consistent object scale, viewpoint, background, illumination, and centered placement - are violated. Those variations that occur render anomaly detection methods unusable in many real-world scenarios. To address these limitations, we introduce three key contributions: (1) a visual prompting pipeline that isolates objects using foreground-background masking; (2) a mechanism for unfreezing the teacher in student-teacher models to improve domain adaptability; and (3) a data augmentation strategy leveraging diffusion-generated synthetic images to enhance anomaly detection performance. We achieve a 3.5 percentage point improvement over the previous state-of-the-art on the challenging AeBAD dataset by using the Masked Multiscale Reconstruction (MMR) model as our backbone.

2606.09679 2026-06-09 cs.CV 新提交

SoccerNet 2026 Player-Centric Ball-Action Spotting:Retraining and Post-Processing Extensions to the FOOTPASS Baselines

SoccerNet 2026 以球员为中心的球类动作定位:FOOTPASS 基线的重训练与后处理扩展

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods(迪克体育用品的GameChanger)

AI总结 针对足球广播中八类动作的球员-动作-时间预测任务,在FOOTPASS基线上提出梯度检查点、GNN与DST融合、平方根频率类别加权和后处理流水线四项扩展,在测试集和挑战集上分别达到0.548和0.446的Macro F1。

Comments CVPR 2026 SoccerNet Player Centric Ball Action Spotting Challenge, Rank 7

详情
AI中文摘要

我们描述了针对SoccerNet 2026以球员为中心的球类动作定位挑战赛的系统,该挑战要求预测广播足球中八类动作的谁、做什么以及何时发生。基于三个FOOTPASS基线[1](TAAD、TAAD+GNN和TAAD+DST),我们贡献了四个扩展:(1)梯度检查点,使得在单个GPU上能够对整个骨干网络进行微调;(2)将GNN logits融合到DST编码器中,将基于图的战术上下文与每个球员的视觉特征相结合;(3)平方根频率类别加权,以解决训练数据中213:1的传球与抢断不平衡问题;(4)一个后处理流水线,包括每类logit门控、时间帧细化、球衣重新分配和双模型集成。我们的系统在测试集上达到0.548 Macro F1,在挑战集上(服务器评估)达到0.446。

英文摘要

We describe our system for the SoccerNet 2026 Player-Centric Ball-Action Spotting Challenge, which requires predicting who performs which action and when, across eight classes in broadcast soccer. Building on the three FOOTPASS baselines [1] (TAAD, TAAD+GNN, and TAAD+DST), we contribute four extensions: (1) gradient check pointing to enable full-backbone fine-tuning on a single GPU; (2) fusion of GNN logits into the DST encoder, combining graph-based tactical context with per-player visual features; (3) square-root frequency class weighting to address the 213:1 pass-to-tackle imbalance in the training data; and (4) a post processing pipeline comprising per-class logit gating, temporal frame refinement, jersey re-assignment, and a two-model ensemble. Our system achieves 0.548 Macro F1 on the test set and 0.446 on the challenge set (server evaluation).

2606.09772 2026-06-09 cs.CV 新提交

SemDINO: A DINOv3-Driven Network for Cross-Temporal Semantic Alignment in Change Detection

SemDINO: 一种基于DINOv3的跨时间语义对齐变化检测网络

Xinyu Tong, Meihua Zhou, Jinxiao Sun, Yingjie Tang, Lei Wang

发表机构 * Xinjiang Institute of Ecology and Geography, Chinese Academy of Sciences(中国科学院新疆生态与地理研究所) University of Chinese Academy of Sciences(中国科学院大学) School of Computer Science, Xiangtan University(湘潭大学计算机科学学院) College of Information and Communication Engineering, Harbin Engineering University(哈尔滨工程大学信息与通信工程学院)

AI总结 提出SemDINO网络,通过双分支编码器、多尺度时序交互、语义净化与变化增强模块,解决语义变化检测中跨时间对齐不足、多尺度表示弱及伪变化鲁棒性差的问题。

详情
AI中文摘要

语义变化检测(SCD)旨在同时定位土地覆盖变化并识别转变前后的语义类别。然而,现有方法存在跨时间对齐不足、多尺度表示弱以及对光照、季节和配准噪声引起的伪变化鲁棒性差的问题。为了解决这些问题,我们提出了一种名为SemDINO的新型端到端语义变化检测网络,它将双分支编码器、多尺度时序交互、语义净化、变化增强和解耦多任务预测集成到一个统一框架中。具体来说,我们构建了一个双分支编码器,通过门控金字塔融合将CNN骨干网络和冻结的DINOv3特征相结合,实现丰富的多尺度语义表示。然后,提出了一种多尺度时序双向变换器交互(M-TBTT)模块,以实现全局跨时间特征对齐和信息交互。为了进一步增强真实变化并抑制伪变化,我们协同引入了语义净化(SCP)、双向变化增强(BiChangeEnhance)和多尺度变化增强(MCE)模块。最后,设计了一个多分支CD预测头,用于联合输出二值变化掩码、双时相语义图和边缘约束。在公开遥感CD数据集上的大量实验表明,SemDINO在复杂干扰因素场景下,相比最先进方法取得了优越的性能和泛化能力。

英文摘要

Semantic change detection (SCD) aims to simultaneously locate land-cover changes and identify semantic categories before and after transition. However, existing methods suffer from insufficient cross-temporal alignment, weak multi-scale representation, and poor robustness to pseudo-changes caused by illumination, season, and registration noise. To address these issues, we propose a novel end-to-end semantic change detection network named SemDINO, which integrates a dual-branch encoder, multi-scale temporal interaction, semantic purification, change enhancement, and decoupled multi-task prediction into a unified framework. Specifically, we construct a dual-branch encoder that combines a CNN backbone and frozen DINOv3 features via gated pyramid fusion, enabling rich multi-scale semantic representation. Then, a multi-scale temporal bidirectional transformer interaction (M-TBTT) module is proposed to achieve global cross-temporal feature alignment and information interaction. To further enhance genuine changes and suppress pseudo-variations, we introduce semantic purification (SCP), bidirectional change enhancement (BiChangeEnhance), and multi-scale change enhancement (MCE) modules collaboratively. Finally, a multi-branch CD prediction head is designed to jointly output binary change mask, bi-temporal semantic maps, and edge constraint. Extensive experiments on public remote sensing CD datasets demonstrate that SemDINO achieves superior performance and generalization ability against state-of-the-art methods, especially in complex scenarios with interference factors.

2504.00375 2026-06-09 cs.CV 版本更新

CamoSAM2: SAM2-oriented Prompt Auto-Refinement for Video Camouflaged Object Detection

CamoSAM2: 面向SAM2的提示自动精化用于视频伪装目标检测

Xin Zhang, Keren Fu, Qijun Zhao

发表机构 * National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University(合成视觉基础科学国家重点实验室,四川大学) College of Computer Science, Sichuan University(计算机学院,四川大学)

AI总结 提出CamoSAM2框架,通过运动-外观提示诱导器和自适应多提示精化策略,自动生成并优化SAM2的提示,显著提升视频伪装目标检测性能。

Comments 13 pages, 8 figures,

详情
AI中文摘要

Segment Anything Model 2 (SAM2) 是一种提示引导的视频基础模型,在视频目标分割中表现出色,引起了社区的广泛关注。由于伪装目标与其周围环境高度相似,即使人眼也难以区分,因此SAM2在现实场景中自动分割的应用面临伪装感知和可靠提示生成的挑战。为了解决这些问题,我们提出了CamoSAM2,一个运动-外观提示诱导器(MAPI)和精化框架,用于自动生成和精化SAM2的提示,从而在VCOD任务中实现高质量的自动检测和分割。首先,我们引入了一个提示诱导器,它同时整合运动和外观线索来检测伪装目标,比现有方法提供更准确的初始预测。其次,我们提出了一种针对SAM2的基于视频的自适应多提示精化(AMPR)策略,旨在减轻初始粗糙掩码中的提示错误,并进一步生成良好的提示。具体来说,我们引入了一个新颖的三步过程,通过伪装目标确定、关键提示帧选择和多提示形成来生成可靠的提示。在两个基准数据集上进行的大量实验表明,我们提出的模型CamoSAM2显著优于现有最先进的方法,在mIoU指标上分别提高了8.0%和10.1%。此外,与当前的VCOD模型相比,我们的方法实现了最快的推理速度。

英文摘要

The Segment Anything Model 2 (SAM2), a prompt-guided video foundation model, has remarkably performed in video object segmentation, drawing significant attention in the community. Due to the high similarity between camouflaged objects and their surroundings, which makes them difficult to distinguish even by the human eye, the application of SAM2 for automated segmentation in real-world scenarios faces challenges in camouflage perception and reliable prompts generation. To address these issues, we propose CamoSAM2, a motion-appearance prompt inducer (MAPI) and refinement framework to automatically generate and refine prompts for SAM2, enabling high-quality automatic detection and segmentation in VCOD task. Initially, we introduce a prompt inducer that simultaneously integrates motion and appearance cues to detect camouflaged objects, delivering more accurate initial predictions than existing methods. Subsequently, we propose a video-based adaptive multi-prompts refinement (AMPR) strategy tailored for SAM2, aimed at mitigating prompt error in initial coarse masks and further producing good prompts. Specifically, we introduce a novel three-step process to generate reliable prompts by camouflaged object determination, pivotal prompt frame selection, and multi-prompts formation. Extensive experiments conducted on two benchmark datasets demonstrate that our proposed model, CamoSAM2, significantly outperforms existing state-of-the-art methods, achieving increases of 8.0% and 10.1% in mIoU metric. Additionally, our method achieves the fastest inference speed compared to current VCOD models.

2509.10334 2026-06-09 cs.CV cs.AI cs.LG 版本更新

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

I-Segmenter: 用于高效语义分割的纯整数视觉Transformer

Jordan Sassoon, Michal Szczepanski, Martyna Poreba

发表机构 * CEA, France(法国原子能委员会)

AI总结 提出I-Segmenter,首个全整数ViT分割框架,通过整数运算替换、λ-ShiftGELU激活函数及解码器优化,在保持精度前提下显著降低模型大小和推理延迟。

Comments Accepted by the Journal of Systems Architecture

详情
AI中文摘要

视觉Transformer(ViT)最近在语义分割中取得了强劲的结果,但由于其高内存占用和计算成本,在资源受限设备上的部署仍然有限。量化提供了一种提高效率的有效策略,但基于ViT的分割模型在低精度下非常脆弱,因为量化误差会在深度编码器-解码器流水线中累积。我们引入了I-Segmenter,这是第一个完全纯整数的ViT分割框架。基于Segmenter架构,I-Segmenter系统地将浮点运算替换为纯整数对应运算。为了进一步稳定训练和推理,我们提出了λ-ShiftGELU,一种新颖的激活函数,它减轻了均匀量化在处理长尾激活分布时的局限性。此外,我们移除了L2归一化层,并将解码器中的双线性插值替换为最近邻上采样,确保整个计算图都是纯整数执行。大量实验表明,I-Segmenter在合理精度范围内(平均5.1%)达到其FP32基线的精度,同时将模型大小减少高达3.8倍,并通过优化的运行时实现高达1.2倍的推理加速。值得注意的是,即使在单张校准图像的一次性PTQ中,I-Segmenter也能提供有竞争力的精度,凸显了其在实际部署中的实用性。

英文摘要

Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose $λ$-ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment.

2509.17078 2026-06-09 cs.CV 版本更新

Enhanced Detection of Tiny Objects in Aerial Images

航拍图像中微小目标的增强检测

Kihyun Kim, Michalis Lazarou, Tania Stathaki

发表机构 * 1 Dept. of Electrical \& Electronic Engineering, Imperial College London 2 Center for Vision, Speech Signal Processing, University of Surrey

AI总结 针对YOLOv8在航拍图像中检测微小目标性能不足的问题,提出四种增强策略,并设计MoonNet管道,通过集成多种注意力模块提升检测精度,在微小目标基准上达到最优性能。

Comments Accepted at IEEE ICIP 2026

详情
AI中文摘要

虽然像YOLOv8这样的单阶段检测器训练速度快,但作为权衡,它们在小目标检测上往往表现不佳。在航拍图像中检测微小目标时,由于目标分辨率低和背景杂乱,这一问题变得更加关键。为了解决这个问题,我们引入了四种增强策略——输入图像分辨率调整、数据增强、注意力机制以及注意力模块的替代门控函数——这些策略可以轻松地在YOLOv8上实现。我们证明,增大图像尺寸和适当使用数据增强可以带来性能提升。此外,我们设计了一个混合正交神经模块网络(MoonNet)管道,该管道由多个注意力模块增强的CNN组成。两个著名的注意力模块,Squeeze-and-Excitation(SE)块和卷积块注意力模块(CBAM),被集成到YOLOv8的主干中,形成了MoonNet设计,与原始YOLOv8主干和单一类型注意力模块增强的主干相比,MoonNet主干获得了改进的检测精度。MoonNet通过与YOLC模型集成,在微小目标基准上实现了最先进的性能,进一步证明了其适应性和潜力。我们的代码可在以下网址获取:this https URL

英文摘要

While one-stage detectors like YOLOv8 offer fast training speed, they often under-perform on detecting small objects as a trade-off. This becomes even more critical when detecting tiny objects in aerial imagery due to low-resolution targets and cluttered backgrounds. To address this, we introduce four enhancement strategies-input image resolution adjustment, data augmentation, attention mechanisms, and an alternative gating function for attention modules-that can be easily implemented on YOLOv8. We demonstrate that image size enlargement and the proper use of augmentation can lead to enhancement. Additionally, we designed a Mixture of Orthogonal Neural-modules Network (MoonNet) pipeline which consists of multiple attention-module-augmented CNNs. Two well-known attention modules, Squeeze-and-Excitation (SE) Block and Convolutional Block Attention Module (CBAM), were integrated into the backbone of YOLOv8 to form the MoonNet design, and the MoonNet backbone obtained improved detection accuracy compared to the original YOLOv8 backbone and single-type attention-module-augmented backbones. MoonNet further proved its adaptability and potential by achieving state-of-the-art performance on a tiny-object benchmark when integrated with the YOLC model. Our code is available at: https://github.com/Kihyun11/MoonNet

2511.06644 2026-06-09 cs.CV 版本更新

UniADC: A Unified Framework for Anomaly Detection and Classification

UniADC:统一异常检测与分类框架

Ximiao Zhang, Min Xu, Zheng Zhang, Yap-Peng Tan, Xiuzhuang Zhou

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) China University of Mining(中国矿业大学) VinUniversity(文理大学)

AI总结 提出UniADC模型,通过无训练可控修复网络和隐式正态判别器,同时实现异常区域检测与类别识别,在少样本甚至零样本下超越现有方法。

详情
AI中文摘要

在本文中,我们引入了一个称为统一异常检测与分类的新任务,旨在同时检测图像中的异常区域并识别其具体类别。现有方法通常将异常检测和分类视为独立任务,从而忽略了它们的内在关联并限制了信息共享,导致性能次优。为了解决这个问题,我们提出了UniADC,一个设计用于在仅有少量甚至没有异常图像的情况下有效执行这两项任务的模型。具体来说,UniADC由两个关键组件组成:一个无需训练的可控修复网络和一个隐式正态判别器。修复网络可以通过在异常先验指导下修复正常区域来合成特定类别的异常图像,并且还可以修复少量异常样本以扩充可用异常数据。隐式正态判别器通过隐式建模正常状态来解决正常与异常像素分布不平衡的严峻挑战,通过将细粒度图像特征与异常类别嵌入对齐来实现精确的异常检测和分类。我们在四个异常检测与分类数据集(包括MVTec-FS、MTD、WFDD和Real-IAD)上进行了大量实验,结果表明UniADC在异常检测、定位和分类方面始终优于现有方法。代码可在以下网址获取:this https URL。

英文摘要

In this paper, we introduce a novel task termed unified anomaly detection and classification, which aims to simultaneously detect anomalous regions in images and identify their specific categories. Existing methods typically treat anomaly detection and classification as separate tasks, thereby neglecting their inherent correlations and limiting information sharing, which results in suboptimal performance. To address this, we propose UniADC, a model designed to effectively perform both tasks with only a few or even no anomaly images. Specifically, UniADC consists of two key components: a training-free Controllable Inpainting Network and an Implicit-Normal Discriminator. The inpainting network can synthesize anomaly images of specific categories by repainting normal regions guided by anomaly priors, and can also repaint few-shot anomaly samples to augment the available anomaly data. The implicit-normal discriminator addresses the severe challenge of the imbalance between normal and anomalous pixel distributions by implicitly modeling the normal state, achieving precise anomaly detection and classification by aligning fine-grained image features with anomaly-category embeddings. We conduct extensive experiments on four anomaly detection and classification datasets, including MVTec-FS, MTD, WFDD and Real-IAD, and the results demonstrate that UniADC consistently outperforms existing methods in anomaly detection, localization, and classification. The code is available at https://github.com/cnulab/UniADC.

2512.03470 2026-06-09 cs.CV 版本更新

STGBD-Net: Spatio-temporal Gradient Basis Decomposition Network for Infrared Small Target Detection

STGBD-Net:用于红外小目标检测的时空梯度基分解网络

Chen Hu, Mingyu Zhou, Shuai Yuan, Hongbo Hu, Zhenming Peng, Tian Pu, Xiying Li

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University(中山大学智能系统工程学院) School of Information and Communication Engineering and the Laboratory of Imaging Detection and Intelligent Perception, University of Electronic Science and Technology of China(电子科技大学信息与通信工程学院和成像检测与智能感知实验室) School of Instrument Science and Opto-Electronics Engineering, Hefei University of Technology(合肥工业大学仪器科学与光电工程学院)

AI总结 针对红外小目标检测中弱目标易被背景杂波淹没的问题,提出基于基分解理论的梯度基分解模块(GDM),将归一化梯度特征作为基向量重构新特征,结合轻量级U-Net实现单帧与多帧检测,在多个基准上达到SOTA性能。

详情
AI中文摘要

红外小目标检测(IRSTD)的一个关键挑战是弱目标信号响应容易被强背景杂波掩盖,经常导致漏检。虽然传统的基于梯度的方法试图捕捉精细细节,但其鲁棒性受到多方向梯度特征静态融合的限制。在本文中,我们从基分解理论的角度重新思考特征融合,并提出一种新颖的框架,将该过程重构为显式且自适应的分解与重建范式。具体而言,我们引入了基分解模块(BDM)及其专门变体——梯度分解模块(GDM),用于IRSTD。GDM将归一化梯度特征视为基向量来重建新特征,从而保持细节结构并突出红外小目标。通过将GDM集成到轻量级的三阶段U-Net中,我们开发了两种统一架构:用于单帧检测的空间梯度基分解网络和用于多帧场景的时空梯度基分解网络。大量实验表明,我们的网络在多个基准上达到了最先进的性能,在检测精度和计算效率之间提供了优越的平衡。我们的代码将在以下网址公开:this https URL。

英文摘要

A key challenge in infrared small target detection (IRSTD) is that weak target signal responses are easily obscured by strong background clutter, frequently resulting in missed detections. While traditional gradient-based methods attempt to capture fine details, their robustness is limited by the static fusion of multi-directional gradient features. In this paper, we rethink feature fusion from the perspective of Basis Decomposition Theory and propose a novel framework that reformulates the process into an explicit and adaptive decomposition-and-reconstruction paradigm. Specifically, we introduce the Basis Decomposition Module (BDM) and its specialized variant, the Gradient Decomposition Module (GDM) for IRSTD. GDMs treat the normalized gradient features as basis vectors to reconstruct a new feature, thereby maintaining detailed structures and highlighting infrared small targets. By integrating GDMs into a lightweight three-stage U-Net, we develop two unified architectures: the Spatial Gradient Basis Decomposition Network for single-frame detection and the Spatio-temporal Gradient Basis Decomposition Network for multi-frame scenarios. Extensive experiments demonstrate that our networks achieve state-of-the-art (SOTA) performance across multiple benchmarks, offering a superior balance between detection accuracy and computational efficiency. Our codes will be made public at: https://github.com/greekinRoma/IRSTD_HC_Platform.

2512.13869 2026-06-09 cs.CV 版本更新

Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

基于扩散模型的UAV人类检测粗到细层次对齐

Wenda Li, Meng Wu, Liangzhao Chen, Sungmin Eum, Heesung Kwon, Qing Qu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出Coarse-to-Fine Hierarchical Alignment框架,通过合成数据转换缩小领域差距,提升UAV人类检测精度,实验显示mAP50提升14.1%。

详情
AI中文摘要

训练目标检测器需要大量任务特定标注,但UAV人类检测因目标分布不断变化和标注图像稀缺而难以实现。为此,本文引入Coarse-to-Fine Hierarchical Alignment(CFHA),一个基于扩散模型的三阶段框架,旨在将合成数据转换为UAV人类检测数据,缩小领域差距并保持原始合成标签。CFHA通过三个模块显式解耦全局风格和局部内容领域的差异:(1)全局风格迁移——扩散模型通过少量真实参考集对齐合成图像的颜色、光照和纹理统计至真实风格;(2)局部细化——超分辨率扩散模型用于细化小物体的精细和逼真细节,如人类实例,保持形状和边界完整性;(3)幻觉消除——过滤掉与真实数据不匹配的人类实例,使人类外观更接近目标分布。在公开的UAV Sim2Real检测基准上进行的广泛实验表明,本文方法显著优于非转换基线。具体而言,本文方法在Semantic-Drone基准上mAP50提升达+14.1%。消融研究证实了全局和局部阶段的互补作用,并突显了层次对齐的重要性。代码已发布在https://github.com/liwd190019/CFHA。

英文摘要

Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \href{https://github.com/liwd190019/CFHA}{this url}.

2602.10858 2026-06-09 cs.CV 版本更新

Hyperspectral Smoke Segmentation via Mixture of Prototypes

基于原型混合的高光谱烟雾分割

Lujian Yao, Haitao Zhao, Xianghai Kong, Yuhan Xu

发表机构 * Automation Department, School of Information Science and Engineering, East China University of Science and Technology(自动化系,信息科学与工程学院,东华大学)

AI总结 针对烟雾分割中光谱信息不足、云干扰和半透明区域问题,提出首个高光谱烟雾分割数据集HSSDataset,并设计原型混合网络(MoP),通过波段分离、原型光谱表示和双阶段路由器实现自适应波段加权,在高低光谱模态上均取得优异性能。

Comments 31 pages, 14 figures

详情
AI中文摘要

烟雾分割对于野火管理和工业安全应用至关重要。传统的可见光方法由于光谱信息不足而面临局限性,特别是在处理云干扰和半透明烟雾区域时。为了解决这些挑战,我们引入高光谱成像进行烟雾分割,并提出了第一个高光谱烟雾分割数据集(HSSDataset),该数据集使用多对一标注协议,从20个真实场景的超过18,000帧中收集了精心标注的样本。然而,不同的光谱波段在空间区域上表现出不同的判别能力,因此需要自适应的波段加权策略。我们将此分解为三个技术挑战:光谱交互污染、有限的光谱模式建模和复杂的加权路由器问题。我们提出了一种原型混合(MoP)网络,包括:(1) 波段分离(BS)用于光谱隔离,(2) 基于原型的光谱表示(PSR)用于多样化模式,以及(3) 双阶段路由器(DSR)用于自适应空间感知波段加权。我们进一步构建了一个包含RGB-红外图像的多光谱数据集(MSSDataset)。大量实验验证了该方法在高光谱和多光谱模态上的优越性能,为基于光谱的烟雾分割建立了新的范式。

英文摘要

Smoke segmentation is critical for wildfire management and industrial safety applications. Traditional visible-light-based methods face limitations due to insufficient spectral information, particularly struggling with cloud interference and semi-transparent smoke regions. To address these challenges, we introduce hyperspectral imaging for smoke segmentation and present the first hyperspectral smoke segmentation dataset (HSSDataset) with carefully annotated samples collected from over 18,000 frames across 20 real-world scenarios using a Many-to-One annotations protocol. However, different spectral bands exhibit varying discriminative capabilities across spatial regions, necessitating adaptive band weighting strategies. We decompose this into three technical challenges: spectral interaction contamination, limited spectral pattern modeling, and complex weighting router problems. We propose a mixture of prototypes (MoP) network with: (1) band split (BS) for spectral isolation, (2) prototype-based spectral representation (PSR) for diverse patterns, and (3) dual-stage router (DSR) for adaptive spatial-aware band weighting. We further construct a multispectral dataset (MSSDataset) with RGB-infrared images. Extensive experiments validate superior performance across both hyperspectral and multispectral modalities, establishing a new paradigm for spectral-based smoke segmentation.

2602.20551 2026-06-09 cs.CV 版本更新

CAD-Prompted SAM3: Geometry-Conditioned Instance Segmentation for Industrial Objects

CAD-Prompted SAM3: 用于工业物体的几何条件实例分割

Zhenran Tang, Rohan Nagabhirava, Changliu Liu

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 本文提出基于CAD模型的SAM3分割框架,通过几何条件进行实例分割,解决工业环境中难以用语言或外观描述的对象问题。

详情
AI中文摘要

基于自然语言的分割方法受限于语言表达能力,在制造和3D打印环境中常遇到难以描述的实例特定对象。尽管图像示例提供替代方案,但它们主要编码外观线索如颜色和纹理,与部件的几何身份无关。在工业环境中,单一组件可能由不同材料、表面处理或颜色生产,使基于外观的提示不可靠。相反,此类对象通常由精确的CAD模型定义,捕捉其标准几何形状。我们提出基于SAM3的CAD提示分割框架,使用CAD模型的多视图渲染作为提示输入。渲染的视图提供独立于表面外观的几何条件。模型通过在模拟中生成的网格渲染数据进行训练,涵盖多样化的视角和场景上下文。我们的方法实现了单阶段CAD提示掩码预测,将可提示分割扩展到无法仅通过语言或外观描述的对象。

英文摘要

Verbal-prompted segmentation is inherently limited by the expressiveness of natural language and struggles with uncommon, instance-specific, or difficult-to-describe objects: scenarios frequently encountered in manufacturing and 3D printing environments. While image exemplars provide an alternative, they primarily encode appearance cues such as color and texture, which are often unrelated to a part's geometric identity. In industrial settings, a single component may be produced in different materials, finishes, or colors, making appearance-based prompting unreliable. In contrast, such objects are typically defined by precise CAD models that capture their canonical geometry. We propose a CAD-prompted segmentation framework built on SAM3 that uses canonical multi-view renderings of a CAD model as prompt input. The rendered views provide geometry-based conditioning independent of surface appearance. The model is trained using synthetic data generated from mesh renderings in simulation under diverse viewpoints and scene contexts. Our approach enables single-stage, CAD-prompted mask prediction, extending promptable segmentation to objects that cannot be robustly described by language or appearance alone.

5. 视频理解与时序视觉 21 篇

2606.07669 2026-06-09 cs.CV cs.AI 新提交

MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios

MemoVAD: 边缘计算场景下基于动态语义记忆的资源高效视频异常检测

Guo Li, Jiandian Zeng, Yang Li, Zihao Peng, Ke Chen, Tian Wang

发表机构 * Institute of Artificial Intelligence and Future Networks, Beijing Normal University(北京师范大学人工智能与未来网络研究院) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) Engineering Research Center of Cloud-Edge Intelligent Collaboration on Big Data, Ministry of Education, Beijing Normal University(北京师范大学大数据云边智能协同教育部工程研究中心)

AI总结 提出MemoVAD边缘-云协同框架,通过不确定性感知门控策略选择性调用云端视觉语言模型,并设计动态语义记忆缓存原型,在降低通信开销的同时提升视频异常检测性能。

Comments Accepted by IJCAI2026

详情
AI中文摘要

在真实监控场景中部署视频异常检测(VAD)面临着对高层语义的需求以确保有效性,与边缘设备有限计算资源之间的根本矛盾。视觉语言模型(VLM)提供了丰富的开放词汇语义,但其延迟和计算成本阻碍了设备端部署。为解决这一挑战,我们提出MemoVAD,一种边缘-云协同框架,选择性地将VLM语义融入流式VAD。MemoVAD在边缘端使用轻量级检测器和因果时序上下文编码器(TCE)建模时序依赖,运行大部分推理。具体而言,我们引入基于主观逻辑的不确定性感知门控(UAG)策略,以建模感知不确定性,并仅对高不确定性和语义新颖的片段查询云端VLM。此外,设计动态语义记忆(DSM)缓存经VLM验证的原型以实现高效检索,使边缘模型通过语义适配器逐步融入VLM级语义。在真实边缘设备上对UCF-Crime和XD-Violence数据集的实验表明,MemoVAD在显著降低通信开销的同时,超越了当前最优性能。

英文摘要

Deploying Video Anomaly Detection (VAD) in real-world surveillance faces a fundamental tension between the demand for high-level semantics to ensure effectiveness and the limited computational resources of edge devices. Vision-Language Models (VLMs) provide rich open-vocabulary semantics, but their latency and computational cost preclude on-device deployment. To address the challenge, we propose MemoVAD, an edge-cloud collaborative framework that selectively incorporates VLM semantics into streaming VAD. MemoVAD runs most inference on the edge with a lightweight detector and a causal Temporal Context Encoder (TCE) to model temporal dependencies. Specifically, we introduce an Uncertainty-Aware Gating (UAG) policy grounded in Subjective Logic to model perceived uncertainty and query the cloud-based VLM only for high-uncertainty and semantically novel clips. Besides, a Dynamic Semantic Memory (DSM) is designed to cache VLM-verified prototypes for efficient retrieval, enabling the edge model to progressively incorporate VLM-level semantics via a semantic adapter. Experiments on UCF-Crime and XD-Violence datasets via a real edge device show that MemoVAD substantially reduces communication overhead while surpassing state-of-the-art performance.

2606.08133 2026-06-09 cs.CV 新提交

Gravity-guided Contact Dynamics Estimation from 3D Human Motions

重力引导的3D人体运动接触动力学估计

Cuong Le, Urs Waldmann, Bastian Wandt, Mårten Wadenbäck

发表机构 * Linköping University(林雪平大学)

AI总结 提出GraCE模型,利用人体质心与重力分布,从3D运动数据中准确估计地面接触力与压力分布,优于现有方法。

Comments 14 pages, under submission

详情
AI中文摘要

作用于人体的地面接触力对于生物力学研究或运动表现分析至关重要。先前的方法依赖测力台或压力垫来收集地面接触动力学,限制了其在严格控制环境下的适用性。一个更具扩展性的解决方案是直接从运动捕捉数据估计动力学。近期方法仅根据身体与地面之间的垂直距离粗略估计地面接触动力学,无法捕捉所有接触点的复杂压力分布。为此,我们提出GraCE——重力引导的接触动力学估计,一种新颖的全身接触动力学模型,利用身体质量分布和重力的真实影响来估计人体运动。我们使用人体的重心,基于其与身体的相对距离来估计地面接触。每个接触点上的作用力通过预测的接触概率与根据质心轨迹计算的总外力的乘积来估计。我们在GroundLink数据集上的地面反作用力估计和MOYO数据集上的详细接触压力预测中优于相关工作。代码将在接收后公开。

英文摘要

Ground contact forces acting on the human body, are crucial for biomechanics studies or sport performance analysis. Prior methods rely on force plates or pressure mats to collect ground contact dynamics, limiting their applicability to carefully controlled settings. A more scalable solution is to estimate the dynamics directly from motion capture data. Recent approaches only roughly estimate the ground contact dynamics from the vertical distance between the body and the ground plane, which cannot capture the complex pressure distribution of all contact points. To this end, we propose GraCE -- Gravity-guided Contact Dynamics Estimation, a novel full-body contact dynamics model for human motions using a realistic influence of body mass distribution and gravity. We use the human's center of gravity to estimate the ground contacts based on its relative distance to the human body. The applied force on each contact is estimated via the product of predicted contact probabilities and the total exterior force computed from the center of mass trajectory. We outperform related work on the GroundLink dataset for ground reaction force estimation, and on the MOYO dataset for detailed contact pressure prediction. The code is published upon acceptance.

2606.08566 2026-06-09 cs.CV 新提交

Towards Accurate Emotion-Attributed Video Captioning via Fine-grained Emotion-Cause Pair Extraction

通过细粒度情感-原因对提取实现精确的情感归因视频字幕生成

Weidong Chen, Cheng Ye, Zhendong Mao, Liping Wang, Xinyan Liu, Yongdong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院) Harbin Institute of Technology (Weihai)(哈尔滨工业大学(威海))

AI总结 提出细粒度情感-原因对提取框架,通过概念感知视觉语义分解和视觉引导情感可解释学习,提升情感视频字幕的准确性和丰富性。

详情
AI中文摘要

情感视频字幕生成(EVC)是一项具有挑战性的任务,旨在为视频生成事实准确且情感丰富的描述。现有的EVC方法利用整体视觉特征挖掘全局情感线索,然后聚合多模态特征以指导情感字幕生成,这忽略了EVC任务的关键特性。视觉情感是由特定的动机原因引发的,这些原因通常只隐含在核心视频片段中。整体挖掘带来了显著的信息冗余和不准确的情感线索。因此,细粒度的视觉原因提取对情感感知和情感归因字幕生成都有促进作用。为此,我们提出了一种用于情感归因视频字幕生成的细粒度情感-原因对提取框架。具体来说,我们通过两轮学习成对的情感和原因特征:1)我们提出了一种概念感知的视觉语义分解模块,通过探索场景、对象和运动概念来增强视觉特征。此外,为了增强情感特征,我们提出了一种视觉引导的情感可解释学习模块,该模块利用视觉时间动态指导情感细化,并通过可靠的VAD向量约束增强可解释的细化过程。2)我们通过在细化前后交叉耦合视觉和情感特征来实现情感-原因对提取,并利用对比损失实现语义强制对齐。总体而言,我们的方法优化了视频的复杂语义理解和情感感知,从而在情感字幕生成中取得了有前景的性能。在三个具有挑战性的数据集上进行的大量实验证明了我们的方法和每个提出模块的优越性,例如,在EVC-MSVD数据集上,BLEU-2和ROUGE-L分别取得了+4.4%和+5.4%的最佳性能。

英文摘要

Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionally rich descriptions for videos. Existing EVC methods leverage holistic visual features to mine global emotional cues, and then aggregate multimodal features to guide the emotional caption generation, which ignores the critical characteristic of the EVC task. Visual emotions are evoked by specific motivational causes, which are usually only implied in core video segments. The holistic mining brings significant information redundancy and inaccurate emotional cues. Thus, fine-grained visual cause extraction has a facilitative effect on both emotion perception and emotion-attributed caption generation. To this end, we propose a fine-grained emotion-cause pair extraction framework for emotion-attributed video captioning. Specifically, we learn pair-wise emotion and cause features in two rounds: 1) We propose a Concept-aware Visual Semantic Decomposition module to augment visual features by exploring scene, object, and motion concepts. Besides, to enhance emotional features, we propose a Visual-guided Emotion Interpretable Learning module, which guides emotion refinement with visual temporal dynamics, and augments the interpretable refinement process by reliable VAD-vector constraints. 2) We achieve emotion-cause pair extraction by cross-coupling the visual and emotional features before and after refinement, and leverage contrastive loss to achieve semantic forced alignment. Overall, our approach optimizes complex semantic understanding and emotion perception of videos, leading to a promising performance in emotional captioning. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, e.g., achieving the best performances with +4.4% and +5.4% w.r.t. BLEU-2 and ROUGE-L, respectively, on the EVC-MSVD dataset.

2606.08615 2026-06-09 cs.CV cs.CL 新提交

Harnessing Streaming Video in the Wild

利用野外流式视频

Dingyu Yao, Shuhuan Gu, Qingyi Si, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Naibin Gu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院) JD.COM(京东)

AI总结 提出Streaming Harness系统,通过Streaming-Train-248K数据集和训练目标,使视觉语言模型具备主动交互、长期记忆和实时处理能力,并构建Streaming-Eval基准评估流式视频理解。

详情
AI中文摘要

视觉语言模型(VLM)在视频通话助手、实时评论和具身机器人等应用中越来越需要处理无界视频流。理想的流式系统应支持主动交互、长期记忆和实时处理,同时基于能够处理各种野外流式任务的VLM骨干。然而,现有VLM在离线视频理解方面表现出色,但在流式能力上有所欠缺,并且缺乏用于流式部署的专用基础设施。我们在三个方面解决这一差距。(i) 对于骨干能力,我们构建了\textbf{Streaming-Train-248K},一个流式数据集,配以新颖的训练目标,用于使VLM适应流式交互和理解。(ii) 对于实际部署,我们引入了\textbf{Streaming Harness},一个即插即用系统,赋予任何VLM三种核心能力:主动交互(每秒响应决策)、长期记忆(12小时上下文保留)和实时处理(亚秒级延迟)。(iii) 为了推动社区在流式能力方面的持续进步,我们设计了\textbf{Streaming-Eval},一个反映模型在各种野外场景中能力的基准。大量实验表明,我们的方法在流式视频理解所需的所有核心能力上均取得了一致的提升。我们将开源我们的数据、代码和基准,以推动社区从离线视频理解向可部署的流式智能的转变。

英文摘要

Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.

2606.08780 2026-06-09 cs.CV 新提交

Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

超越一致性:在零样本视频编辑中保留时间结构

Deyin Liu, Yisheng Ding, Zhe Jin, Xiatian Zhu, Anjan Dutta, Lin Wu

发表机构 * Anhui University(安徽大学) University of Surrey(萨里大学) University of Warwick(华威大学)

AI总结 提出一种零样本视频编辑方法,通过自适应分割视频片段、选取锚帧和令牌合并策略,首次显式保留源视频的时间结构,平衡编辑保真度与计算效率。

详情
AI中文摘要

现有的零样本视频编辑方法依赖预训练的扩散模型,成功实现了空间控制和基本的时间一致性,但根本上未能保留视频的原始时间结构。这一区别至关重要:时间一致性确保视觉平滑,而时间结构决定了视频的高层叙事、节奏和语义流。没有这种保留,编辑输出(尤其是具有复杂语义变化的长视频)在叙事上变得不连贯,语义模糊。为了解决这一局限性,我们提出了一种新颖的零样本编辑方法,首次明确关注保留源视频的时间结构。我们通过基于特征相似性自适应地将视频分割成语义不同的片段,并为每个片段选择一个代表性的锚帧来实现这一点。为了增强片段内保真度和计算效率,我们设计了一种片段自适应的令牌合并策略,利用锚帧的语义主导性来稳定编辑。此外,我们采用交替组合策略,确保片段间无缝过渡,同时保持语义区分。大量实验表明,我们的方法达到了最先进的结果,成功平衡了原始时间结构的保留与计算效率,为零样本视频编辑保真度设立了新基准。

英文摘要

Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial control and basic temporal consistency but fundamentally fail to preserve the video's original temporal structure.This distinction is critical: temporal consistency ensures visual smoothness, but temporal structure dictates the video's high-level narrative, rhythm, and semantic flow. Without this preservation, the edited output, especially for long videos with complex semantic variations, becomes narratively incoherent and semantically ambiguous. To address this limitation, we introduce a novel zero-shot editing approach that, for the first time, explicitly focuses on preserving the source video's temporal structure. We achieve this by adaptively partitioning the video into semantically distinct clips based on feature similarity and selecting a representative anchor frame for each clip. To enhance both intra-clip fidelity and computational efficiency, we design a clip-adaptive token merging strategy which leverages the anchor's semantic dominance to stabilize the editing. Furthermore, we employ an alternating combination strategy that ensures seamless inter-clip transitions while maintaining semantic distinction. Extensive experiments demonstrate that our method achieves state-of-the-art results, successfully balancing the preservation of original temporal structure with computational efficiency, and setting a new benchmark for zero-shot video editing fidelity.

2606.09064 2026-06-09 cs.CV cs.AI 新提交

See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

看得更多,思考更深:面向长视频理解的查询扩展视觉证据与答案线索引导反思

Shuning Wang, Zhiheng Wu, YiNuo Lu, Naiming Liu, Chen Jia, Bowen Liu, Shuo Nie, Weijie Zhu, Yumeng Zhang

发表机构 * Baidu Inc.(百度公司) Harbin Institute of Technology(哈尔滨工业大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出CoVER框架,通过动态收集查询扩展视觉证据和答案特定视觉反馈验证草稿答案,实现从答案中心生成到证据中心和视觉可验证推理的转变,在长视频理解任务上超越同规模模型及部分闭源模型。

详情
AI中文摘要

近期视频大语言模型(Video-LLMs)的进展使得长视频理解任务成为可能。然而,现有方法仍面临两个关键限制:证据获取通常依赖单一搜索意图,且答案生成缺乏有效的视觉反馈机制。为解决这些限制,我们提出了\textbf{CoVER},一个用于长视频理解的综合视觉证据与反思框架。CoVER使Video-LLMs能够通过动态收集查询扩展视觉证据来\textbf{看得更多},并通过使用有效的答案特定视觉反馈验证草稿答案来\textbf{思考更深}。这些机制共同将长视频理解从以答案为中心的生成转变为以证据为中心且可视觉验证的推理。实验结果表明,CoVER-7B在相同参数规模下显著优于其他模型,甚至在特定指标上超越了最先进的闭源模型。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.

2606.09181 2026-06-09 cs.CV cs.LG 新提交

Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

用于视频问答中细粒度证据分离的反事实推理

Zhou Du, Hamid Krim, Xiao Wu, Zhaoquan Yuan, Liangwei Li, Keisuke Fujii

发表机构 * School of OptoElectonic Science and Engineering, University of Electronic Science and Technology of China(电子科技大学光电科学与工程学院)

AI总结 提出反事实推理框架CREDiT,通过结构因果模型将视频问答中的跨模态表示分解为因果和非因果成分,在独立性约束下进行特征级因果干预,提升答案准确性和推理可靠性。

Comments 10 pages, 6 figures

详情
AI中文摘要

近期视频多模态模型的进展显著提升了视频问答性能。然而,这些系统往往依赖于虚假的统计相关性而非与答案相关的因果证据,导致推理不忠实且脆弱,尤其在复杂真实场景中。现有方法要么依赖跨模态相关性、昂贵的精心策划的训练资源,要么依赖不充分的因果假设和约束,且通常操作在时间区间级别。因此,它们未能明确地将因果视觉线索与混杂因素分离,且提供的细粒度证据定位有限。为解决此问题,我们提出了一种用于细粒度证据分离的反事实推理框架(CREDiT)。CREDiT使用结构因果模型形式化视频问答过程,并在独立性和最小性约束下学习明确分解为因果和非因果成分的跨模态表示。为促进忠实的分离,我们引入特征级因果干预,构建近似因果效应同时抑制非因果相关性的反事实输入。在NExT-GQA、SportsQA和SPORTU-video上的大量实验表明,CREDiT在通用和复杂体育场景中均能持续提升答案准确性和推理可靠性,从而构建更可信的视频问答系统。

英文摘要

Recent advances in video multimodal models have significantly improved VideoQA performance. However, these systems often rely on spurious statistical correlations rather than answer-relevant causal evidence, resulting in unfaithful and brittle reasoning, especially in complex real-world scenarios. Existing methods either rely on cross-modality correlations, costly curated training resources, or insufficient causal assumptions and constraints, and typically operate at the time-interval level. As a result, they fail to explicitly disentangle causal visual cues from confounders and provide limited fine-grained evidence localization. To address this issue, we propose a Counterfactual Reasoning framework for fine-grained Evidence Disentanglement (CREDiT). CREDiT formulates the VideoQA process using a structural causal model and learns cross-modality representations that are explicitly decomposed into causal and non-causal components under independence and minimality constraints. To facilitate faithful disentanglement, we introduce feature-level causal interventions and construct counterfactual inputs that approximate causal effects while suppressing non-causal correlations. Extensive experiments on NExT-GQA, SportsQA, and SPORTU-video demonstrate that CREDiT consistently improves answer accuracy and reasoning reliability across both generic and complex sports scenarios, leading to more trustworthy VideoQA systems.

2606.09248 2026-06-09 cs.CV 新提交

Temporal-Aware Reasoning Optimization for Video Temporal Grounding

时间感知推理优化用于视频时间定位

Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出TaRO框架,通过构造性推理探索和时间敏感性奖励,增强多模态大模型在视频时间定位中的时间感知推理能力,实现最先进性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在结合强化学习生成推理路径的视频时间定位任务中取得了显著进展。然而,现有模型常产生浅层推理,对精确时间定位的指导有限。这一限制源于(1)低效的随机探索和(2)仅关注答案正确性而忽略推理质量的奖励函数。为解决这些问题,我们提出TaRO(时间感知推理优化),一个明确增强模型时间思考能力的框架。首先,我们引入构造性推理探索,利用预生成的密集描述构建基于显式视觉线索和时间戳的推理路径,实现高质量时间感知推理的高效探索。其次,为评估推理质量,我们设计了时间敏感性奖励。高质量推理应锚定于特定事件和时间戳。如果思考中的事件边界被破坏,该推理应失效,导致推理路径的logit下降。我们利用这一下降作为推理质量的评判。最后,TaRO遵循渐进式课程,从利用该奖励选择更好的构造推理路径开始,演变为自由探索阶段,模型自主生成有效推理。实验表明,TaRO在VTG基准上达到最先进性能。代码见https://github.com/oceanflowlab/TaRO。

英文摘要

Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with reinforcement learning for generating reasoning paths. However, existing models often produce superficial reasoning, which offers limited guidance for precise temporal localization. This limitation stems from (1) inefficient random exploration and (2) reward functions that focus solely on the answer correctness while ignoring reasoning quality. To address these issues, we propose TaRO (Temporal-Aware Reasoning Optimization), a framework that explicitly enhances the model's ability of thinking with time. First, we introduce a Constructive Reasoning Exploration that leverages pre-generated dense captions to construct reasoning paths grounded in explicit visual cues and timestamps, enabling efficient exploration of high-quality time-aware reasoning. Second, to evaluate reasoning quality, we design a Temporal-Sensitivity Reward. High-quality reasoning should be anchored to specific events and timestamps. If the event boundary under thinking is disrupted, such reasoning should become invalid, leading to a drop in the logit of the reasoning path. We utilize this drop as a critique of reasoning quality. Finally, TaRO follows a progressive curriculum, which starts by utilizing this reward to select better constructed reasoning paths, and evolves to a free exploration phase where the model autonomously generates effective reasoning. Experiments demonstrate that TaRO achieves state-of-the-art performance on VTG benchmarks. Code is available at https://github.com/oceanflowlab/TaRO.

2606.09261 2026-06-09 cs.CV 新提交

Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition

自监督学习至关重要:一种用于微手势识别的简单集成方案

Tingyi Liu, Kun Li, Fei Wang, Junjie Chen, Zhiliang Wu, Jihao Gu, Haixu Liu, Dan Guo

发表机构 * Hefei University of Technology(合肥工业大学) United Arab Emirates University(阿拉伯联合酋长国大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院) Anhui Evolution Technology Co., Ltd.(安徽进化科技有限公司) Nanyang Technological University(南洋理工大学) University College London(伦敦大学学院) The University of Sydney(悉尼大学) Beijing QBoson Quantum Technology Co., Ltd.(北京量子芯科技有限公司)

AI总结 提出一种集成自监督RGB模型与监督多流模型的框架,在MiGA挑战赛微手势分类赛道取得第一名,通过自监督预训练提升性能,在iMiGUE测试集上达到74.419%的top-1准确率。

详情
AI中文摘要

在本文中,我们介绍了XInsight Lab在IJCAI 2026第四届MiGA挑战赛微手势分类赛道中的解决方案,该方案排名第一并取得了新的最先进结果。我们提出了一种多模态集成框架,将基于自监督的RGB模型与先前解决方案中的监督多流模型相结合。自监督RGB模型通过掩码视频建模在12万个未标注片段上进行预训练,然后在iMiGUE上微调。这一简单而有效的RGB基线在iMiGUE测试集上达到了69.224%的top-1准确率,展示了从域内未标注视频中学习可迁移表示的好处。通过将该模型作为互补分支加入,最终集成模型达到了74.419%的top-1准确率,比之前的最先进结果高出1.206个百分点。在iMiGUE上的实验结果,包括对集成策略的消融研究,验证了自监督RGB表示学习在微手势识别中的有效性。

英文摘要

In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.

2606.09542 2026-06-09 cs.CV 新提交

A VideoMAE-v2 Approach to Zero-Shot Traffic Accident Anticipation

一种用于零样本交通事故预警的VideoMAE-v2方法

Siyuan Li, Xiaoyang Bi, Mengshi Qi

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出基于VideoMAE-v2的框架,通过滑动窗口协议和逐帧预测头,在零样本设置下从粗粒度标注数据泛化到未知行车记录仪视频,实现交通事故预警。

详情
AI中文摘要

交通事故预警——在行车记录仪视频的每一帧预测即将发生碰撞的可能性——对于安全至关重要,但难以规模化,因为为每个部署场景收集域内标注的事故视频成本过高。我们在零样本设置下研究此任务,即没有目标域训练数据可用:模型必须仅从公开的二元标注驾驶事故数据集中学习,并泛化到未见过的行车记录仪视频。我们提出一个框架,通过将VideoMAE-v2骨干网络与滑动窗口协议下的逐帧预测头相结合,弥合帧级时间风险估计任务与粗粒度标注二元事故数据集之间的差距。我们的方法在2026年CVPR@AUTOPILOT零样本交通事故预警竞赛中获得第二名。代码可在https://github.com/TimeSouth/zero-shot-taa-solution获取。

英文摘要

Traffic accident anticipation -- predicting the likelihood of an imminent collision at every frame of a dashcam video -- is safety-critical yet difficult to scale, because collecting in-domain annotated accident footage for every deployment scenario is prohibitively expensive. We study this task under a zero-shot setting where no target-domain training data is available: the model must learn exclusively from a publicly available binary-labelled driving-accident dataset and generalise to unseen dashcam footage. We propose a framework that bridges the gap between the frame-level temporal risk estimation task and coarsely labelled binary accident datasets by coupling a VideoMAE-v2 backbone with a per-frame prediction head under a sliding-window protocol. Our method achieves 2nd place in the 2026 CVPR@AUTOPILOT Zero-Shot Accident Anticipation competition. Code is available at https://github.com/TimeSouth/zero-shot-taa-solution.

2606.09547 2026-06-09 cs.CV cs.LG 新提交

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

流式干预:视频大语言模型能否在错误发生时即时纠正?

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

发表机构 * Qualcomm AI Research(高通人工智能研究院) York University(约克大学) Vector Institute for AI(向量人工智能研究所)

AI总结 提出Ego-MC-Bench基准评估视频LLM在烹饪场景中的实时干预能力,并构建Ego-CoMist反事实合成数据集提升小模型性能。

Comments Qualcomm Interactive Cooking: Ego-MC-Bench -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-mistake-corrections and Ego-CoMist -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-counterfactual-mistakes

详情
AI中文摘要

学习日常技能(如烹饪一道菜)越来越依赖于教学媒体,例如在线视频。这为使用视频(和多模态)大语言模型(LLMs)作为任务指导助手打开了大门。一个潜在的任务指导助手在现实世界中成功的关键能力是,它能够在错误一出现时就主动干预以引导用户。为了评估这一关键能力,我们引入了Ego-MC-Bench(错误纠正),这是一个用于评估在现实烹饪场景中反应性、逐步任务指导的基准。大量实验表明,Ego-MC-Bench对于最先进的视频LLMs具有高度挑战性。我们认为一个关键原因是用于在此任务上微调模型的训练数据有限。尽管存在广泛的烹饪视频数据集,但现有数据集缺乏错误示例以及适当时间的干预。为了帮助解决这一数据限制,我们还引入了Ego-CoMist,这是一个反事实合成数据集,通过将非交互式烹饪视频转换为显示主动干预的监督训练示例而创建。我们表明,在Ego-CoMist上进行微调可以带来性能提升,特别是对于更适合在边缘设备上提供帮助的更小、更高效的视频LLMs。

英文摘要

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

2606.09641 2026-06-09 cs.CV 新提交

MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

MAVIS: 通过结构化视频理解实现多智能体视频检索

Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, Fei Luo

发表机构 * School of Computing and Information Technology, Great Bay University(大湾区大学计算机与信息技术学院) College of Computer Science, Nankai University(南开大学计算机学院) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Graduate School of Information Science and Technology, The University of Tokyo(东京大学信息科学与技术研究生院)

AI总结 提出多智能体框架MAVIS,通过结构化语义库解析视频,利用逻辑感知辩论机制协作推理,无需全库扫描和微调即可实现高效视频检索。

详情
AI中文摘要

视频检索的主流范式依赖于基于嵌入的全库扫描,这种方法存在固有的计算低效以及信息密集视频与稀疏文本查询之间的语义不对称问题。为弥合这一差距,我们引入了\textbf{MAVIS},一种新颖的多智能体框架,将检索重新构想为协作推理而非暴力搜索。MAVIS首先通过将原始视频解析为\textbf{结构化语义库}来弥合粒度不匹配,从而实现显式的属性级索引。在检索过程中,规划器将复杂的用户意图分解为原子子任务,分派专门的智能体独立提名候选。关键的是,MAVIS采用带有严格否决协议的\textbf{逻辑感知辩论}机制,智能体协作修剪逻辑不匹配,以识别紧凑的“有争议”候选集进行细粒度验证。这种智能体工作流有效避免了全库遍历的低效。在MSR-VTT、MSVD和ActivityNet上的大量实验表明,MAVIS在无需任务特定微调的情况下实现了有竞争力的性能,为传统的双编码器方法提供了可扩展且可解释的替代方案。

英文摘要

The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a \textbf{Structured Semantic Library}, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a \textbf{Logic-aware Debate} mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of ``controversial'' candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.

2606.09646 2026-06-09 cs.CV cs.AI cs.LG 新提交

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

视频基础模型是否理解直觉物理?逐层探测分析

Samuele Punzo, Niccolò Caselli, Ippokratis Pantelidis, Francesco Massafra, Salvatore Lo Sardo, Mohammadreza Salehi

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 通过冻结特征探测,研究预训练视频基础模型在直觉物理信息上的编码能力,发现V-JEPA表现最佳,物理信息在中后期层最易获取,且时序破坏显著降低性能。

详情
AI中文摘要

我们研究预训练视频基础模型是否在其冻结表示中编码直觉物理信息,以及该信息如何随模型家族、层和探测类型变化。通过在IntPhys2和Minimal Video Pairs (MVP)上进行冻结特征探测,我们比较了预测联合嵌入模型(V-JEPA)、掩码重建模型(VideoMAE)和基于扩散的视频生成器(LTX-Video)。V-JEPA在基准测试中取得最强整体结果,尤其是在建模时序动态的探测器中,而VideoMAE仍具竞争力,LTX-Video恢复较弱但非平凡的信号。逐层分析表明,物理相关信息在早期层最弱,在中后期深度最易获取;时序控制表明,打乱帧顺序显著降低性能,尤其是在MVP上。综合来看,这些结果表明直觉物理知识在预训练视频表示中可靠地出现,但其可获取性强烈依赖于预训练范式、表示深度和读出机制。

英文摘要

We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

2606.07577 2026-06-09 cs.AI cs.CV cs.SD eess.AS 交叉投稿

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

OmniMem: 面向流式音视频大语言模型的扰动感知记忆压缩

Guangzhi Sun, Yixuan Li, Yudong Yang, Chao Zhang

发表机构 * Tsinghua University(清华大学) ByteDance(字节跳动) Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 提出OmniMem,一种针对音视频LLM的流式记忆压缩框架,通过模态感知分配和扰动感知选择压缩KV缓存,在保持长视频理解的同时减少内存,在多个基准上提升2-4%准确率。

Comments Code: https://github.com/bytedance/SALMONN/tree/omni_mem

详情
AI中文摘要

音视频大语言模型(LLMs)在长视频理解方面具有强大潜力,但其长视频推理从根本上受到视频令牌和键值(KV)缓存线性增长的制约。我们提出OmniMem,一种专为音视频LLMs设计的内存高效流式框架。与将所有令牌统一处理的现有压缩方法不同,OmniMem引入了一种模态感知的内存分配策略,分别管理视觉和音频上下文,解决了两种模态之间的严重令牌不平衡问题。OmniMem进一步通过扰动感知的内存选择保留信息丰富且非冗余的KV状态,实现紧凑内存而不牺牲长程理解。为了在现实部署约束下加强压缩,我们还探索了预算感知微调,鼓励模型将有用信息整合到保留内存中。在VideoMME Long、LVBench和LVOmniBench上使用video-SALMONN 2+和Qwen-2.5-Omni的实验表明,在相同内存预算下,OmniMem始终比强训练无关压缩基线提高2-4%的绝对准确率,微调后额外提高1-2%。

英文摘要

Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamentally limited by the linear growth of video tokens and key-value (KV) caches. We present OmniMem, a memory-efficient streaming framework designed specifically for audio-visual LLMs. Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding. To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning.

2606.08239 2026-06-09 cs.AI cs.CL cs.CV 交叉投稿

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

当没有正确答案时:诊断视频理解中多模态大语言模型的缺失答案检测

Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai "Helen" Li, Yiran Chen

发表机构 * Duke University(杜克大学)

AI总结 研究多模态大语言模型在视频理解中检测缺失答案的能力,发现模型倾向于选择干扰项而非识别无正确答案,时间推理任务中问题更严重,链式思维提示虽提升检测率但仍不理想。

Comments Under review

详情
AI中文摘要

多模态大语言模型在视频理解方面取得了实质性进展,但其响应的可靠性仍未得到充分探索。本文对视频理解中多模态大语言模型的缺失答案检测进行了诊断研究,其中正确答案被故意排除在候选集之外,而一个可靠的模型应能识别出没有有效选项。我们在三种设置下评估缺失答案检测行为:带有“以上皆非”选项的多选题、带有检测指令的开放式生成,以及没有任何指导的标准评估。在多种模型和基准测试中,我们发现多模态大语言模型压倒性地选择合理的干扰项,而不是检测到缺失答案。这种失败在时间推理任务中更为明显,并且随着帧采样密度的增加而恶化。我们进一步探索了链式思维提示作为缓解策略,发现虽然它显著提高了检测率,但性能仍不令人满意,这表明仅基于提示的策略不足以完全解决这一局限性。这些发现揭示了缺失答案检测中的系统性失败,并强调了在多模态系统中需要明确的检测机制。

英文摘要

Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct answer is deliberately excluded from the candidate set and a reliable model is expected to recognize that no valid option exists. We evaluate the absent answer detection behavior under three settings: multiple-choice questions augmented with an ``None of the Above'' option, open-ended generation with a detection instruction, and standard evaluation without any guidance. Across a diverse set of models and benchmarks, we find that MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure is more pronounced in temporal reasoning tasks and worsens with denser frame sampling. We further explore chain-of-thought prompting as a mitigation strategy and find that while it substantially improves detection rates, performance remains unsatisfactory, suggesting that prompting-based strategies alone are insufficient to fully address this limitation. These findings expose a systematic failure in absent answer detection and highlight the need for explicit detection mechanisms in multimodal systems.

2506.20588 2026-06-09 cs.CV 版本更新

TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

TRIM:一种最大化时间相对信息和代表性的自监督视频摘要框架

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

发表机构 * Pompeu Fabra University(庞培法布拉大学) Universitat Autònoma de Barcelona(自治大学)

AI总结 TRIM框架通过自监督学习实现高效视频摘要,无需注意力机制等复杂结构,优于现有无监督方法并挑战传统复杂架构。

详情
AI中文摘要

随着视频内容的普及,视频摘要和亮点提取成为关键研究领域。然而,许多先进方法依赖监督标注或注意力模型,计算成本高且在分布变化时表现不稳定。我们提出一种新颖的自监督视频摘要模型,无需注意力、RNN或Transformer,通过马尔可夫过程驱动的损失度量和两阶段自监督学习范式,实现性能与效率的平衡。TRIM在SUMME和TVSUM数据集上达到最佳性能,超越所有现有无监督方法,并与最佳监督模型相当,展示了高效无标注架构的潜力,为更通用的视频摘要技术铺平道路,并挑战现有复杂架构的依赖。

英文摘要

The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets, outperforming all existing unsupervised methods. It also rivals the best supervised models, demonstrating the potential for efficient, annotation-free architectures. This paves the way for more generalizable video summarization techniques and challenges the prevailing reliance on complex architectures.

2510.20182 2026-06-09 cs.CV 版本更新

PEDRA: Evaluating the Realism of Pedestrian Dynamics in Video Generation

PEDRA: 评估视频生成中行人动态的真实性

Aaron Appelle, Jerome P. Lynch

发表机构 * Duke University(杜克大学)

AI总结 提出PEDRA评估协议,通过重建鸟瞰轨迹等方法,测试文本/图像到视频模型生成多行人交互场景的真实性,发现现有模型虽具备先验但存在行人合并消失等物理不一致问题。

Comments Accepted to CVPR 2026

详情
AI中文摘要

行人模拟传统上依赖于专家调整的手工模型,这限制了可扩展性和泛化性。与此同时,大规模视频生成模型已在各种场景中实现了高视觉真实感,激发了探索其作为通用世界模拟器潜力的兴趣。现有基准主要评估单主体真实性,而非包含多个交互人物的场景,使得生成视频中多智能体动态的合理性未经测试。我们提出一个严格的评估协议,用于基准测试文本到视频(T2V)和图像到视频(I2V)模型作为行人动态的隐式模拟器。对于I2V,我们利用已有数据集的起始帧,以便与真实视频进行直接比较;而对于T2V,我们设计了一个涵盖不同人群密度和交互类型的提示集。一个关键组成部分是一种无需已知相机参数即可从像素空间重建二维鸟瞰轨迹的方法。我们的分析表明,领先模型对合理的多智能体行为具有有效的先验,尽管合并和消失行人等问题揭示了其物理一致性的局限性。

英文摘要

Pedestrian simulation traditionally relies on expert-tuned, hand-crafted models that limit scalability and generalization. Meanwhile, large-scale video generation models have achieved high visual realism across diverse settings, motivating exploration of their potential as general-purpose world simulators. Existing benchmarks primarily assess single-subject realism rather than scenes with multiple interacting people, leaving the plausibility of multi-agent dynamics in generated videos untested. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable direct comparison with ground truth videos, while for T2V we design a prompt suite covering varied crowd densities and interaction types. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis shows that leading models exhibit effective priors for plausible multi-agent behavior, though issues such as merging and disappearing pedestrians reveal limits to their physical consistency.

2511.14143 2026-06-09 cs.CV cs.AI 版本更新

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

SMART: 基于音频增强多模态大模型的镜头感知视频时刻检索

An Yu, Weiheng Lu, Jian Li, Zhenfei Zhang, Yunhang Shen, Felix X. -F. Ye, Ming-Ching Chang

发表机构 * Department of Computer Science, University at Albany - SUNY(University at Albany - SUNY 计算机科学系) School of Software & Microelectronics, Peking University(北京大学软件与微电子学院) Nanjing University(南京大学) Xiamen University(厦门大学) Department of Mathematics and Statistics, University at Albany - SUNY(University at Albany - SUNY 数学与统计学系)

AI总结 提出SMART框架,融合音频与视觉特征,利用镜头感知令牌压缩技术,在多模态大模型基础上实现视频时刻检索,在Charades-STA和QVHighlights上取得显著提升。

详情
AI中文摘要

视频时刻检索是视频理解中的一项任务,旨在根据自然语言查询在未裁剪视频中定位特定时间片段。尽管近年来利用传统技术和多模态大模型在视频时刻检索方面取得了进展,但大多数现有方法仍依赖于粗粒度的时间理解和单一的视觉模态,限制了在复杂视频上的性能。为了解决这一问题,我们引入了\textit{镜头感知多模态音频增强时间片段检索}(SMART),这是一个基于多模态大模型的框架,它整合了音频线索并利用了镜头级别的时间结构。SMART通过结合音频和视觉特征来丰富多模态表示,同时应用\textbf{镜头感知令牌压缩},该技术选择性地保留每个镜头内的高信息令牌,以减少冗余并保留细粒度的时间细节。我们还优化了提示设计,以更好地利用视听线索。在Charades-STA和QVHighlights上的评估表明,SMART相比最先进的方法取得了显著改进,包括在Charades-STA上R1@0.5提升1.61%,R1@0.7提升2.59%。

英文摘要

Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textit{S}hot-aware \textit{M}ultimodal \textit{A}udio-enhanced \textit{R}etrieval of \textit{T}emporal \textit{S}egments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbf{Shot-aware Token Compression}, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61\% increase in R1@0.5 and 2.59\% gain in R1@0.7 on Charades-STA.

2603.04125 2026-06-09 cs.CV 版本更新

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

基于特征残差判别的小样本开放集动作识别基线研究与基准

Stefano Berti, Giulia Pasquale, Lorenzo Natale

发表机构 * Humanoid Sensing and Perception, Istituto Italiano di Tecnologia, Genoa, Italy(人形感知与感知、意大利理工学院,热那亚,意大利)

AI总结 针对小样本动作识别在开放集场景下的不足,提出基于特征残差判别器的架构扩展,在五个数据集上实现未知类拒绝能力提升且不损失闭集精度,设立新基准。

详情
AI中文摘要

小样本动作识别(FS-AR)已显示出有希望的结果,但常受限于闭集假设,在现实开放集场景中失效。虽然小样本开放集(FSOS)识别在图像领域已很成熟,但其在时空视频数据上的扩展仍未被充分探索。为解决此问题,我们提出基于特征残差判别器(FR-Disc)的架构扩展,将先前在骨骼数据上的工作适配到更复杂的视频领域。在五个数据集上的大量实验表明,虽然常见的开放集技术仅提供边际增益,但我们的FR-Disc显著增强了未知类拒绝能力,且不损害闭集精度,为FSOS-AR设立了新的最先进水平。项目网站、代码和基准可在以下网址获取:this https URL。

英文摘要

Few-Shot Action Recognition (FS-AR) has shown promising results but is often limited by a closed-set assumption that fails in real-world open-set scenarios. While Few-Shot Open-Set (FSOS) recognition is well-established for images, its extension to spatio-temporal video data remains underexplored. To address this, we propose an architectural extension based on a Feature-Residual Discriminator (FR-Disc), adapting previous work on skeletal data to the more complex video domain. Extensive experiments on five datasets demonstrate that while common open-set techniques provide only marginal gains, our FR-Disc significantly enhances unknown rejection capabilities without compromising closed-set accuracy, setting a new state-of-the-art for FSOS-AR. The project website, code, and benchmark are available at: https://hsp-iit.github.io/fsosar/.

2603.27493 2026-06-09 cs.CV 版本更新

Fully Spiking Neural Networks with Target Awareness for Energy-Efficient UAV Tracking

具有目标意识的全脉冲神经网络用于节能无人机跟踪

Pengzhi Zhong, Jiwei Mo, Dan Zeng, Feixiang He, Shuiwang Li

发表机构 * College of Computer Science and Engineering, Guilin University of Technology(桂林理工大学计算机科学与工程学院) School of Artificial Intelligence, Sun Yat-sen University(中山大学人工智能学院) School of Electronic Information, Central South University(中南大学电子信息学院)

AI总结 本文提出STATrack,一种基于RGB输入的全脉冲神经网络框架,用于无人机视觉跟踪,通过引入自适应互信息最大化机制和动态加权策略,提升目标语义保留与背景干扰抑制能力,实验证明其在能耗低下的高效跟踪性能。

详情
AI中文摘要

脉冲神经网络(SNNs)以其事件驱动计算和低功耗特性,在无人机(UAVs)上的能量高效视觉跟踪中展现出巨大潜力。然而,现有基于SNN的跟踪器通常依赖成本高昂的事件相机,限制了其在标准RGB相机UAV平台上的应用。为解决这一限制,我们提出了STATrack,一种仅使用RGB输入的全脉冲神经网络框架用于无人机视觉跟踪。到目前为止,这是首次探索全脉冲神经网络用于基于RGB的无人机视觉跟踪。为缓解脉冲离散化导致的目标语义退化以及减少无人机场景中的背景干扰,我们引入了自适应互信息最大化(AMIM)机制。AMIM最大化模板输入与其深层目标意识特征之间的互信息,促使脉冲骨干网络保留判别性目标语义。此外,设计了一种样本难度意识的动态加权策略,以自适应地调整训练过程中的互信息约束。在四个广泛使用的无人机跟踪基准上的大量实验表明,STATrack在低理论能耗下实现了最先进的跟踪性能,突显了其在能量受限的无人机应用中的潜力。

英文摘要

Spiking Neural Networks (SNNs), characterized by their event-driven computation and low power consumption, have shown great potential for energy-efficient visual tracking on unmanned aerial vehicles (UAVs). However, existing SNN-based trackers often rely on costly event cameras, which limits their deployment on standard RGB-camera UAV platforms. To address this limitation, we propose STATrack, a fully spiking neural network framework for UAV visual tracking using only RGB inputs. To the best of our knowledge, this is the first study to explore fully spiking neural networks for RGB-based UAV visual tracking. To alleviate target semantic degradation caused by spike discretization and reduce background interference in UAV scenes, we introduce an Adaptive Mutual Information Maximization (AMIM) mechanism. AMIM maximizes the mutual information between template inputs and their deep target-aware features, encouraging the spiking backbone to preserve discriminative target semantics. In addition, a sample-difficulty-aware dynamic weighting strategy is designed to adaptively adjust the mutual information constraint during training. Extensive experiments on four widely used UAV tracking benchmarks demonstrate that STATrack achieves state-of-the-art tracking performance with low theoretical energy consumption, highlighting its potential for energy-constrained UAV applications.

2503.08703 2026-06-09 cs.NE cs.CV 版本更新

SDTrack: A Baseline for Event-based Tracking via Spiking Neural Networks

SDTrack: 基于脉冲神经网络的事件驱动跟踪基线

Yimeng Shan, Zhenbang Ren, Haodi Wu, Wenjie Wei, Rui-Jie Zhu, Shuai Wang, Dehao Zhang, Yichen Xiao, Jieyuan Zhang, Kexin Shi, Jingzhinan Wang, Jason K. Eshraghian, Haicheng Qu, Malu Zhang

发表机构 * University of Electronic Science and Technology of China(电子科学与技术大学) Shenzhen Loop Area Institute(深圳环城院) Liaoning Technical University(辽宁技术大学) University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 提出首个基于Transformer的全脉冲驱动跟踪流水线SDTrack,通过全局轨迹提示方法聚合事件流,实现低能耗高精度跟踪。

Comments 10 pages,8 figures,4 tables

详情
AI中文摘要

事件相机提供了优越的时间分辨率、动态范围、能效和像素带宽。脉冲神经网络(SNNs)通过离散脉冲信号自然补充事件数据,使其成为事件驱动跟踪的理想选择。然而,当前结合人工神经网络(ANNs)和SNNs的方法存在次优架构,损害了能效并限制了跟踪性能。为了解决这些限制,我们提出了首个基于Transformer的脉冲驱动跟踪(SDTrack)流水线。它包含一种称为全局轨迹提示(GTP)的新型事件帧聚合方法和一个基于Transformer的跟踪器。GTP方法有效捕获全局轨迹信息,并将其与事件流聚合到事件帧中,以增强时空表示。基于Transformer的跟踪器包括一个完全脉冲驱动的SNN骨干网络和一个简单的跟踪头。SDTrack流水线端到端运行,无需数据增强或后处理。大量实验表明,我们的SDTrack-Tiny版本仅用19.61M参数和8.16mJ能耗即可实现竞争性精度,而Base版本在三个数据集上达到了最先进的精度。我们的工作为未来的神经形态视觉研究奠定了坚实基础。

英文摘要

Event cameras provide superior temporal resolution, dynamic range, energy efficiency, and pixel bandwidth. Spiking Neural Networks (SNNs) naturally complement event data through discrete spike signals, making them ideal for event-based tracking. However, current approaches combining Artificial Neural Networks (ANNs) and SNNs suffer from suboptimal architectures that compromise energy efficiency and limit tracking performance. To address these limitations, we propose the first Transformer-based \textbf{S}pike-\textbf{D}riven \textbf{T}racking (SDTrack) pipeline. It incorporates a novel event frame aggregation method called Global Trajectory Prompt (GTP) and a Transformer-based tracker. The GTP method effectively captures global trajectory information and aggregates it with event streams into event frames to enhance spatiotemporal representation. The Transformer-based tracker comprises a fully spike-driven SNN backbone and a simple tracking head. The SDTrack pipeline operates end-to-end without data augmentation or post-processing. Extensive experiments demonstrate that our SDTrack-Tiny pipeline achieves competitive accuracy with only 19.61$M$ parameters and 8.16$mJ$ energy consumption, while our Base version achieves state-of-the-art accuracy across three datasets. Our work establishes a solid foundation for future neuromorphic vision research.

6. 生成式视觉与世界模型 44 篇

2606.07636 2026-06-09 cs.CV cs.CL cs.MA 新提交

Crayotter: Traceable Multi-Agent Workflows for Long-Form Video Editing

Crayotter: 用于长视频编辑的可追踪多智能体工作流

Lecheng Yan, Yichong Zhang, Ben Pan, Xiaoyu Zheng, Jiawei Qian, Anqi Wu, Wenxi Li, Chenyang Lyu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Crayotter,一个开源多模态多智能体系统,通过三阶段工作流(材料准备、基于工件的编辑研究、工具驱动的执行)实现长视频编辑的可追踪性和选择性修订,在人类评估中优于基线方法。

Comments 11 pages, 5 figures

详情
AI中文摘要

从异构素材编辑长视频不仅需要选择片段:智能体必须在材料准备、时间线构建、后期制作和修订过程中保持叙事意图,同时留下足够的证据以诊断失败。我们提出 \textbf{Crayotter},一个用于提示驱动视频编辑的开源多模态多智能体系统。Crayotter 将制作组织为三个阶段:覆盖感知的材料准备、基于工件的编辑研究以及工具驱动的时间线执行。每个阶段外化可检查的工件,包括覆盖报告、多模态分析、编辑蓝图、工具调用和中间渲染。这些工件使编辑运行可追踪,并允许诊断和选择性修订失败的片段,而无需完全重启。我们在23个编辑主题上评估Crayotter,与CapCut-Mate和CutClaw进行比较。在人类评估下,Crayotter的平均得分为3.40/5,而两个基线分别为2.44和1.70,在主题对齐、叙事连贯性和编辑流畅性方面持续提升。我们还描述了一个可重放的轨迹模式和可验证的奖励设计,为这些工作流未来的策略优化做准备。代码、轨迹和示例可在 https://github.com/idwts/Crayotter 公开获取。

英文摘要

Editing a long-form video from heterogeneous footage requires more than selecting clips: an agent must preserve narrative intent across material preparation, timeline construction, post-production, and revision while leaving enough evidence to diagnose failures. We present \textbf{Crayotter}, an open-source multimodal multi-agent system for prompt-driven video editing. Crayotter organizes production into three phases: coverage-aware material preparation, artifact-based editing research, and tool-grounded timeline execution. Each phase externalizes inspectable artifacts, including coverage reports, multimodal analyses, editing blueprints, tool calls, and intermediate renders. These artifacts make an editing run traceable and allow failed segments to be diagnosed and selectively revised instead of requiring a full restart. We evaluate Crayotter on 23 editing themes against CapCut-Mate and CutClaw. Under human evaluation, Crayotter achieves an average score of 3.40/5, compared with 2.44 and 1.70 for the two baselines, with consistent gains in theme alignment, narrative coherence, and editing smoothness. We additionally describe a replayable trajectory schema and verifiable reward design that prepare these workflows for future policy optimization. Code, traces, and examples are publicly available at https://github.com/idwts/Crayotter.

2606.07638 2026-06-09 cs.CV cs.AI 新提交

Anchor-Conditioned Compositional Control for Landscape Image Generation

基于锚点条件的景观图像生成组合控制

Gadha Lekshmi P, Govind Arun, Rohith Syam, Ahmed Elgammal

发表机构 * Rutgers University–New Brunswick(罗格斯大学新布朗斯维克分校) University of Maryland–College Park(马里兰大学帕克分校) University of Technology Sydney(悉尼科技大学)

AI总结 提出锚点条件微调框架,通过解耦交叉注意力机制注入四维组合锚点向量,实现景观图像生成中的组合控制,在水平线检测和三分法对齐上取得最优性能。

Comments Accepted to the International Conference on Computational Creativity, ICCC 2026

详情
AI中文摘要

图像生成模型虽然被广泛用作创意工具,但对摄影师和视觉艺术家常规执行的组合控制类型支持有限。本文提出了一个用于景观图像生成的锚点条件微调框架的早期结果,其中从训练图像中提取四维组合锚点向量,并通过带有傅里叶编码和三路分类器自由引导丢弃的解耦交叉注意力机制注入扩散模型。与基线和三个消融变体的定量评估表明,所提出的架构实现了最高的水平线检测率0.850和最高的三分法对齐度0.817。类别特定的消融进一步表明,在组合同质场景子集上训练相比混合训练可将水平线偏差降低多达40%。这确立了组合控制精度是类别依赖的。

英文摘要

Image generative models, though widely used as creative tools, offer limited support for the kind of compositional control that photographers and visual artists routinely exercise. This paper presents early results on an anchor conditioned finetuning framework for landscape image generation, in which a four dimensional compositional anchor vector is extracted from training images and injected into a diffusion model via a decoupled cross attention mechanism with Fourier encoding and three way classifier free guidance dropout. Quantitative evaluation against a baseline and three ablation variants shows that the proposed architecture achieves the highest horizon detection rate of 0.850 and the highest rule of thirds alignment of 0.817. A category specific ablation further demonstrates that training on compositionally homogeneous scene subsets reduces horizon deviation by up to 40 percent compared to mixed training. This establishes that compositional control precision is category dependent.

2606.07649 2026-06-09 cs.CV cs.AI 新提交

ViMax: Agentic Video Generation

ViMax: 智能体视频生成

Lingxuan Huang, Sizhe He, Hengji Zhou, Liqiang Nie, Lianghao Xia, Chao Huang

发表机构 * The University of Hong Kong(香港大学) South China University of Technology(华南理工大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 提出ViMax框架,通过多智能体协作实现长视频生成,利用分层叙事引擎和视觉一致性机制,保证叙事连贯性和视觉一致性。

Comments 20 pages, 13 figures

详情
AI中文摘要

长视频生成需要系统的叙事规划和视觉一致性,而当前的短视频方法无法提供。现有方法生成孤立的序列,缺乏叙事结构,并且缺乏跨场景保持角色和环境一致性的机制。我们提出ViMax,一个智能体视频生成框架,通过协调的多智能体协作来解决视频创作问题,其中专门的组件协商叙事决策、视觉连续性和制作质量。我们的框架采用分层叙事引擎,结合检索增强生成以实现全局故事连贯性,以及依赖感知的视觉一致性机制,跨时间边界跟踪角色和环境状态,同时VLM引导的智能体持续监控和优化叙事连贯性和视觉保真度。该框架支持协调的智能体协作以生成扩展的叙事内容,在多场景时间线上保持叙事完整性和视觉连贯性。

英文摘要

Long-form video generation requires systematic narrative planning and visual consistency that current short-clip methods cannot provide. Existing methods generate isolated sequences without narrative structure and lack mechanisms for maintaining character and environmental consistency across scenes. We present ViMax, an agentic video generation framework that addresses video creation through coordinated multi-agent collaboration where specialized components negotiate narrative decisions, visual continuity, and production quality. Our framework employs a hierarchical narrative engine with retrieval-augmented generation for global story coherence and a dependency-aware visual consistency mechanism that tracks character and environmental states across temporal boundaries, while VLM-guided agents continuously monitor and refine both narrative coherence and visual fidelity. The framework enables coordinated agent collaboration to generate extended narrative content. This maintains both storytelling integrity and visual coherence across multi-scene timelines.

2606.07935 2026-06-09 cs.CV 新提交

REACT 2026: The Fourth Multiple Appropriate Facial Reaction Generation Challenge: Personalised MAFRG and Appropriate EEG Reaction Prediction

REACT 2026:第四届多适切面部反应生成挑战赛:个性化MAFRG与适切脑电反应预测

Siyang Song, Micol Spitale, Zijian Wu, Xiangyu Kong, Cheng Luo, Cristina Palmero, German Barquero, Sergio Escalera, Michel Valstar, Mohamed Daoudi, Fabien Ringeval, Andrew Howes, Elisabeth Andre, Hatice Gunes

发表机构 * University of Exeter(埃克塞特大学) Politecnico di Milano(米兰理工大学) Nanjing University of Science and Technology(南京理工大学) King Abdullah University of Science and Technology(阿卜杜拉国王科技大学) King's College London(伦敦国王学院) Universitat de Barcelona(巴塞罗那大学) University of Nottingham(诺丁汉大学) IMT Nord Europe(IMT 北欧欧洲) Université Grenoble Alpes(格勒诺布尔-阿尔卑斯大学) University of Augsburg(奥格斯堡大学) University of Cambridge(剑桥大学)

AI总结 提出REACT 2026挑战赛,鼓励开发机器学习模型,用于生成个性化、适切、多样、真实且同步的人类面部反应,并引入个性标签和脑电记录,探索新的一对多个性化面部反应生成设置。

Comments arXiv admin note: text overlap with arXiv:2505.17223

详情
AI中文摘要

在二元交互中,多种人类面部反应可能适合回应每个说话者行为。继REACT 2023、2024和2025挑战系列成功举办后,针对多适切面部反应生成(MAFRG)问题,已开发了一系列生成式深度学习模型。今年,我们提出REACT 2026挑战赛,鼓励开发和基准测试机器学习(ML)模型,这些模型能够生成由特定人类倾听者表达的多个个性化、适切、多样、真实且同步的人类风格面部反应,以回应每个给定的说话者行为。作为挑战的关键,我们持续向挑战参与者提供REACT 2025引入的MARS数据集,并额外提供个体层面的五大人格标签和脑电记录。这引入了一种新的结合人类表达行为、情感和神经生理信号的一对多个性化面部反应生成设置,这在当前的二元交互建模中仍很大程度上未被探索。本文还介绍了挑战指南和四个提议的子挑战的新基线:离线通用和个性化MAFRG,以及在线通用和个性化MAFRG,这些基线公开于https://github.com/reactmultimodalchallenge/baseline_react2026。

英文摘要

In dyadic interactions, various human facial reactions could be appropriate for responding to each human speaker behaviour. Following the successful organisation of the REACT 2023, 2024 and 2025 challenge series, a body of generative deep learning (DL) models have been developed for the problem of multiple appropriate facial reaction generation (MAFRG). This year, we propose the REACT 2026 challenge encouraging the development and benchmarking of Machine Learning (ML) models that can generate multiple personalised, appropriate, diverse, realistic and synchronised human-style facial reactions expressed by a specific human listener for responding to each given speaker behaviour. As a key of the challenge, we continuously provide challenge participants with MARS dataset introduced by REACT 2025 but additionally provide individual-level Big-Five personality labels and EEG recordings. This introduces a new one-to-many personalised facial reaction generation setting combining human expressive behavioural, affective and neurophysiological signals, which remains largely unexplored in current dyadic interaction modelling. This paper also presents the challenge guidelines and new baselines on the four proposed sub-challenges: Offline generic and personalised MAFRG as well as Online generic and personalised MAFRG, respectively, which are publicly available at https://github.com/reactmultimodalchallenge/baseline_react2026.

2606.07967 2026-06-09 cs.CV 新提交

DisCo: World Models with Discrete Camera Motion Control

DisCo: 具有离散相机运动控制的世界模型

Hongrui Huang, Junke Wang, Quanhao Li, Yu-Gang Jiang, Zuxuan Wu

发表机构 * Fudan University(复旦大学)

AI总结 提出DisCo,通过离散动作原语替代连续相机轨迹作为条件,解决可控视频生成中动作表示纠缠问题,提升动作跟随可靠性,并引入DisCoBench基准。

详情
AI中文摘要

可控视频世界模型旨在实现交互式世界探索,模型必须在保持视觉质量和时间一致性的同时忠实地执行明确的动作命令。然而,现有大多数方法依赖连续相机轨迹作为动作条件,这通常导致不可靠的动作跟随,尤其是在复杂运动序列下。在这项工作中,我们识别出动作表示纠缠是可控视频生成的关键瓶颈,并表明连续相机表示导致不同运动模式之间的高特征相似性,降低了动作可控性。基于这一见解,我们提出了DisCo,一种可控视频世界模型,它将生成条件约束在一组紧凑的离散动作原语上,以提高动作可分离性。我们进一步引入了DisCoBench,一个用于评估模型在短期、长期和高度动态探索场景中能力的综合基准。大量实验表明,DisCo在保持视觉质量的同时实现了显著更可靠的动作跟随。

英文摘要

Controllable video world models target interactive world exploration, where models must faithfully execute explicit action commands while preserving visual quality and temporal coherence. However, most existing approaches rely on continuous camera trajectories as action conditions, which often lead to unreliable action following, especially under complex motion sequences. In this work, we identify action representation entanglement as a key bottleneck in controllable video generation, and show that continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Based on this insight, we propose DisCo, a controllable video world model that conditions generation on a compact set of discrete action primitives to improve action separability. We further introduce DisCoBench, a comprehensive benchmark for evaluating the ability of models in short-term, long-horizon, and highly dynamic exploration scenarios. Extensive experiments demonstrate that DisCo achieves significantly more reliable action following while preserving visual quality.

2606.08091 2026-06-09 cs.CV 新提交

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

VideoWeaver: 评估与进化智能体长视频生成技能

Jianhui Wei, Jie Tan, Hengchuan Zhu, Xiaotian Zhang, Yan Zhang, Ziyi Chen, Daoan Zhang, Wei Xu, Zuozhu Liu

发表机构 * Zhejiang University(浙江大学) ByteDance(字节跳动)

AI总结 提出VideoWeaver框架,让智能体自主组合基础技能生成视频,并设计智能体裁判评估过程与结果,通过技能进化算法提升生成质量。

详情
AI中文摘要

最近的智能体框架如Claude Code、Codex和OpenClaw在工具使用和编排方面表现强劲,但它们能否处理长视频生成这一长时多模态任务仍待探索。与早期手工设计管线的视频智能体不同,这些框架可以构建和优化自己的工作流程。我们提出VideoWeaver,一个评估和进化长视频生成技能的智能体框架和基准测试,其中智能体通过将基础技能组合成自己的工作流程(而非遵循预定义管线)将单个指令转化为长视频。该基准测试包含16个任务类别和285个案例,参考信息涵盖文本、图像、音频、视频及其组合。由于错误可能出现在任何阶段而不仅仅是最终视频,我们提出一种智能体裁判,它检查执行轨迹和最终视频,并将其评分基于元数据和中间文件等证据。利用这一反馈,我们进一步设计了一种技能进化算法,用于优化和合并智能体的技能。在多个框架和模型上,我们发现显式的组合技能比单独使用基础技能更能改善生成过程,技能进化进一步提高了输出质量,并且不同框架和模型选择之间的性能差异显著。所提出的智能体裁判也与人类判断高度一致,尤其是在过程指标上。代码和数据集可在https://github.com/JianhuiWei7/VideoWeaver获取。

英文摘要

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but whether they can handle long video generation, a long-horizon multimodal task, remains underexplored. Unlike earlier video agents whose pipeline is handcrafted, these frameworks can build and refine their own workflows. We introduce VideoWeaver, an agent harness and benchmark that evaluates and evolves skills for long video generation, where an agent turns a single instruction into a long video by composing foundation skills into its own workflow rather than following a predefined pipeline. The benchmark has 16 task categories and 285 cases, with references spanning text, image, audio, video, and their combinations. Because errors can arise at any stage and not just in the final video, we propose an agent-as-judge that inspects both the execution trace and the final video, grounding its scores in evidence such as metadata and intermediate files. Using this feedback, we further design a skill evolution algorithm that refines and merges the agent's skills. Across multiple frameworks and models, we find that an explicit composition skill improves the generation process over using foundation skills alone, that skill evolution further improves output quality, and that performance varies notably across harness and model choices. The proposed agent-as-judge also aligns well with human judgments, especially on process metrics. Code and dataset is available at https://github.com/JianhuiWei7/VideoWeaver

2606.08150 2026-06-09 cs.CV 新提交

Property-Informed Diffusion-Based Text-to-Microstructure Generation

基于属性信息的扩散模型文本到微结构生成

Bingxuan Dai, Hongsong Wang, Jie Gui

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室(东南大学)) Purple Mountain Laboratories(紫金山实验室) Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education(教育部区块链应用监管工程研究中心(东南大学))

AI总结 提出一种属性信息驱动的扩散网络,从文本描述直接生成3D微结构,通过对比文本-结构对齐和测试时奖励引导对齐确保生成结构的语义和物理可行性。

Comments Published in CVPR2026, Code is at: https://github.com/hongsong-wang/PropDiff-TMG

详情
AI中文摘要

设计满足预期功能的3D超材料微结构仍然是一个重大挑战,因为它通常需要领域专业知识、迭代模拟和大量手动调整。现有的基于期望目标属性自动生成微结构的逆向设计工作往往受限于设计多样性不足,并在确保生成结构的物理可行性方面面临挑战。为解决这一问题,提出了一种属性信息驱动的扩散网络,能够直接从文本描述生成3D微结构。与传统的属性条件方法不同,我们的方法利用文本输入中丰富的语义和物理属性指导,支持多样化的结构合成。为了强制生成结构与目标文本提示之间的一致性,采用了双重对齐策略,包括对比文本-结构对齐和测试时奖励引导对齐。实验结果表明,该模型能够在广泛材料类别中生成语义有意义且物理上合理的结构。我们的方法在交互式微结构设计方面具有良好潜力,并为结合语言接口与逆向材料发现开辟了新方向。代码可在 https://github.com/hongsong-wang/PropDiff-TMG 获取。

英文摘要

Designing 3D metamaterial microstructures that meet the intended functions remains a major challenge, as it typically requires domain expertise, iterative simulations, and extensive manual tuning. Existing work on inverse design that automatically generates microstructures based on desired target properties often suffers from limited design diversity and faces challenges in ensuring the physical feasibility of the generated structures. To address this issue, a property-informed diffusion-based network is proposed that enables the generation of 3D microstructures directly from textual descriptions. Unlike traditional property conditioning methods, our approach leverages rich guidance in terms of semantics and physical properties in the text input to support diverse structure synthesis. To enforce consistency between the generated structures and the target textual prompts, a dual alignment strategy is adopted, including contrastive text-structure alignment and test-time reward-guided alignment. Experimental results show that the model is capable of generating semantically meaningful and physically plausible structures across a wide range of material categories. Our approach has good potential for interactive microstructure design and opens up new directions for combining language-based interfaces with inverse material discovery. Code is available at: https://github.com/hongsong-wang/PropDiff-TMG

2606.08260 2026-06-09 cs.CV 新提交

TIDE: Task-Isolated Diffusion for Unified Video Editing and Generation

TIDE: 任务隔离扩散模型用于统一视频编辑与生成

Qi Liu, Gang Yue, Mingyu Yin, Lisai Zhang, Yidi Wu, Yaole Wang, Yaohui Wang, Chang Yao, Jingyuan Chen, Lin Ma

发表机构 * Zhejiang University(浙江大学) Bilibili Inc.(哔哩哔哩股份有限公司)

AI总结 提出TIDE统一框架,通过逐token任务嵌入和双路径条件机制,实现指令编辑、参考编辑和多参考生成,在多任务渐进训练下达到SOTA性能。

详情
AI中文摘要

扩散Transformer的最新进展推动了视频生成和编辑的快速发展,但这些能力仍由独立的、任务特定的模型处理。构建支持多种视频任务的统一框架仍然是一个开放挑战:现有的统一尝试要么需要专用的辅助编码器,要么缺乏区分异构条件令牌的显式机制,当视觉条件的数量和类型因任务而异时难以应对。我们提出TIDE,一个统一框架,集成了基于指令的编辑、参考引导编辑和多参考生成。其核心是,我们引入了逐令牌任务嵌入,为每个输入令牌分配一个任务特定标识符,使模型能够显式区分目标、源和参考令牌。为了同时捕捉高层语义理解和细粒度结构保真度,我们设计了一种双路径条件方案,将视觉语言模型与VAE潜在路径耦合以提供互补信号。我们进一步设计了一种多任务渐进训练策略,逐步引入复杂度递增的任务,有效协调不同目标,并实现跨异构任务分布的平滑泛化。在多个视频编辑和生成基准上的大量实验表明,TIDE在所有评估任务上均达到了最先进的性能。我们的项目页面可在https://LittleWork123.github.io/tide获取。

英文摘要

Recent advances in Diffusion Transformers have driven rapid progress in video generation and editing, yet these capabilities are still handled by separate, task-specific models. Building a unified framework that supports diverse video tasks remains an open challenge: existing unified attempts either require dedicated auxiliary encoders or lack explicit mechanisms to distinguish heterogeneous conditioning tokens, struggling when the number and type of visual conditions vary across tasks. We propose TIDE, a unified framework that integrates instruction-based editing, reference-guided editing, and multi-reference generation. At its core, we introduce per-token task embeddings that assign each input token a task-specific identifier, enabling the model to explicitly disambiguate target, source, and reference tokens. To simultaneously capture high-level semantic understanding and fine-grained structural fidelity, we design a dual-path conditioning scheme that couples a vision-language model with a VAE latent path for complementary signals. We further devise a multi-task progressive training strategy that incrementally introduces tasks of increasing complexity, effectively harmonizing diverse objectives and enabling smooth generalization across heterogeneous task distributions. Extensive experiments on multiple video editing and generation benchmarks demonstrate that TIDE achieves state-of-the-art performance across all evaluated tasks. Our project page is available at https://LittleWork123.github.io/tide.

2606.08302 2026-06-09 cs.CV 新提交

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

HACK++:面向高效视觉自回归建模的更有效的头部感知键值压缩

Ziran Qin, Yuchen Jiang, Mingbao Lin, Youru Lv, Hang Guo, Wen Fei, Weiyao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Rakuten(乐天) Tsinghua University(清华大学)

AI总结 针对VAR模型跨尺度KV缓存导致的高计算和内存开销,提出无训练头部感知压缩框架HACK++,通过离线分类头部类型和自适应预算分配,在极低缓存预算下保持近无损生成。

详情
AI中文摘要

视觉自回归(VAR)模型采用下一尺度预测范式,以显著更少的解码步骤实现高质量生成。然而,现有VAR模型由于跨尺度键值(KV)缓存的累积,面临严重的注意力复杂度和内存开销。本文通过将KV缓存压缩引入下一尺度范式来应对这一挑战。我们首先深入分析VAR注意力,观察到注意力头可以稳定地分为两个功能不同的类别:上下文头关注保持语义一致性,而结构头保持空间连贯性。它们的功能差异使得现有的一刀切压缩方法在VAR模型上表现不佳。我们进一步发现,两种头部类型对历史尺度的依赖程度不同,且这种依赖在不同层和生成步骤中发生变化,这要求自适应的缓存预算分配。为解决这些问题,我们提出HACK++,一种针对VAR模型的无训练头部感知键值压缩框架。通过一次性离线校准,HACK++分类头部类型并推导头部特定先验。在推理时,它将注意力与缓存压缩在独立预算下解耦,在压缩累积缓存时采用更激进的策略,通过模式特定策略和依赖感知预算分配来限制当前尺度的注意力成本。在多个VAR模型上进行的广泛实验,涵盖文本到图像、类别条件和统一理解与生成任务,验证了HACK++的有效性和泛化能力。例如,在Infinity-2B/8B上,HACK++在仅30%注意力预算和10%缓存预算下保持近无损生成,即使在1%缓存预算下也保持稳健。

英文摘要

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale paradigm. We begin with an in-depth analysis of VAR attention and observe that attention heads can be stably divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads preserve spatial coherence. Their functional divergence makes existing one-size-fits-all compression methods perform poorly on VAR models. We further find that the two head types differ markedly in their reliance on historical scales, and that this reliance shifts across layers and generation steps, arguing for an adaptive cache budget allocation. To address these challenges, we propose HACK++, a training-free Head-Aware key-value Compression frameworK for VAR models. From a one-time offline calibration, HACK++ classifies head types and derives head-specific priors. At inference, it decouples attention from cache compression under independent budgets, bounding the current-scale attention cost while compressing the accumulated cache far more aggressively, via pattern-specific strategies and a reliance-aware budget allocation. Extensive experiments on multiple VAR models across text-to-image, class-conditional, and unified understanding-and-generation tasks validate the effectiveness and generalizability of HACK++. For example, on Infinity-2B/8B, HACK++ maintains near-lossless generation with only a 30% attention budget and a 10% cache budget, and remains robust even under a 1% cache budget.

2606.08492 2026-06-09 cs.CV cs.AI 新提交

Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

眼见为实:基于视觉锚点的提示重写对齐用于文本到图像生成

Xuanyi Liu, Deyi Ji, Junyu Lu, Jing Wang, Qianxiong Xu, Xuhang Chen, Tianrun Chen, Siwei Ma

发表机构 * Peking University(北京大学) Tencent(腾讯) Dalian University of Technology(大连理工大学) Nanyang Technological University(南洋理工大学) University of Cambridge(剑桥大学) Zhejiang University(浙江大学)

AI总结 提出FaithRewriter框架,利用多模态大模型生成中间视觉线索,结合大语言模型生成视觉锚定的增强提示,再蒸馏至小模型,以缩小用户意图与生成图像之间的差距。

详情
AI中文摘要

尽管文本到图像(T2I)模型具有令人印象深刻的能力,但由于用户提示的简洁性和模糊性,意图-生成差距往往持续存在。现有方法主要优化提示的流畅性和可读性。然而,增强过程仍然缺乏视觉基础。因此,重写器可能过度推断缺失的细节,导致意图-生成差距。为了解决这一限制,我们提出了FaithRewriter,一种用于T2I生成的新型提示增强框架。具体来说,FaithRewriter首先利用多模态MLLM从原始提示生成图像作为中间视觉线索。然后将该线索与提示结合,输入大规模LLM,生成视觉锚定的增强,更好地反映预期内容在图像中应如何呈现。最后,将这些增强蒸馏到小规模LLM中以便高效部署,增强其生成有效T2I提示的能力。实验表明,与强基线相比,FaithRewriter生成的提示更忠实于用户意图且视觉上更合理,有助于缩小意图-生成差距。

英文摘要

Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due to the brevity and ambiguity of user prompts. Existing approaches primarily polish the prompt for fluency and readability. However, the enhancement process still lacks visual grounding. As a result, the rewriter may over-infer missing details, causing an intent-generation gap. To address this limitation, we propose FaithRewriter, a novel prompt-enhancement framework for T2I generation. Specifically, FaithRewriter first leverages a multimodal MLLM to generate an image from the original prompt as an intermediate visual cue. This cue is then combined with the prompt and fed into a large-scale LLM to produce visually grounded augmentations that better reflect how the intended content should appear in images. Finally, these augmentations are distilled into a small-scale LLM for efficient deployment, enhancing its ability to generate effective T2I prompts. Experiments show that FaithRewriter yields prompts that are more faithful to the user intent and more visually plausible than strong baselines, helping narrow the intent-generation gap.

2606.08514 2026-06-09 cs.CV 新提交

OmniTryOn: Video Try-On Anything at Once!

OmniTryOn: 一次性视频试穿任意物品!

Changliang Xia, Chengyou Jia, Minnan Luo, Zhuohang Dang, Xin Shen, Bowen Ping

发表机构 * Xi’an Jiaotong University(西安交通大学)

AI总结 提出OmniTryOn框架,通过首帧可穿戴缓存和时空一致RoPE,实现无外部先验的一次性视频多物品试穿,在TryAny-Bench上显著优于现有方法。

详情
AI中文摘要

尽管视频虚拟试穿(VVT)取得了显著进展,现有方法仍存在两个基本局限:首先,它们仅限于单件衣物迁移,使得同时进行多物品试穿极不实用;其次,它们严重依赖显式外部先验(如衣物掩码),不可避免地破坏了关键的物理动态并降低了视觉质量。为弥补这一差距,本文提出了新颖的“任意试穿”任务,旨在一次推理过程中将多种可穿戴物品同时迁移到视频中的人物身上。为了支持并标准化这一范式,我们引入了TryAny-Bench,一个包含配对视频数据集和定制评估协议的综合基准。此外,我们提出了OmniTryOn,一个无外部先验的生成框架,用于解决该任务。具体而言,OmniTryOn采用首帧可穿戴缓存策略,通过初始视频帧直接为生成过程提供多样化的可穿戴物品。为保持一致性,我们提出了时空一致RoPE(STC-RoPE),它固有地建立了稳健的时空锚点,以严格保留复杂的人体运动和背景动态。通过提出的渐进式试穿(GTO)训练策略进行优化,我们的模型逐步掌握了稳健的多物品合成。在TryAny-Bench上的大量实验表明,OmniTryOn显著优于现有的专用视频虚拟试穿模型和通用视频编辑基线,为“任意试穿”任务建立了强大的新标准。我们的数据集、代码和模型可在https://github.com/xcltql666/OminTryOn获取。

英文摘要

Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fundamental limitations: first, they are restricted to single-garment transfer, rendering simultaneous multi-object try-on highly impractical; second, their heavy reliance on explicit external priors (e.g., garment masks) inevitably destroys crucial physical dynamics and degrades visual quality. To bridge this gap, this paper proposes the novel Try-On Anything task, which aims to simultaneously transfer diverse wearable objects onto a person in a video in a single inference pass. To support and standardize this paradigm, we introduce TryAny-Bench, a comprehensive benchmark encompassing a paired video dataset alongside a tailored evaluation protocol. Furthermore, we present OmniTryOn, an external-prior-free generative framework designed to tackle this task. Specifically, OmniTryOn employs a First Frame Wearable Cache strategy, which directly provides diverse wearable objects for the generation process through the initial video frame. To maintain consistency, we propose the Spatiotemporally Consistent RoPE (STC-RoPE), which inherently establishes robust spatiotemporal anchors to strictly preserve complex human motions and background dynamics. Optimized by the proposed Gradual Try-On (GTO) training strategy, our model progressively masters robust multi-object synthesis. Extensive experiments on TryAny-Bench demonstrate that OmniTryOn significantly outperforms existing specialized video virtual try-on models and general video editing baselines, establishing a powerful new standard for the Try-On Anything task. Our dataset, code, and models are available at https://github.com/xcltql666/OminTryOn.

2606.08672 2026-06-09 cs.CV cs.LG 新提交

Learning to Solve Generative ODEs Beyond the Linear Span

学习求解生成式常微分方程:超越线性跨度

Sihyeon Kim, Seunghun Lee, Vikas Singh, Hyunwoo J. Kim

发表机构 * Korea University(高丽大学) KAIST(韩国科学技术院) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 针对扩散和流生成模型中ODE求解器步数多的问题,提出SpanLift轻量神经求解器,通过空间残差算子增强标量系数更新,实现少步采样且不增加模型NFE,在多个任务上达到最先进性能。

Comments 12 pages, 7 figures

详情
AI中文摘要

扩散和流生成模型通过积分学习到的ODE进行采样,但高质量采样仍需要大量连续的模型评估。求解器学习通过调整标量系数、时间步长或两者来降低这一成本,同时保持骨干模型固定。在这项工作中,我们识别出该更新族中的一个结构瓶颈:每一步仍然受限于跨度。由于标量系数更新位于缓冲速度评估的跨度内,它只能拟合跨度内的分量,而任何跨度外的残差无法通过标量重组单独达到。我们提出SpanLift,一种轻量神经求解器,它用空间残差算子增强标量系数更新。SpanLift将固定的基础求解器作为跨度内先验,并在状态和速度缓冲上学习一个空间残差算子。该算子通过端点教师匹配训练,保留预训练的骨干,且不增加模型NFE。实验表明,学习到的校正跨基础求解器迁移,且主要位于跨度外。在像素空间扩散、潜流匹配和降水临近预报中,SpanLift实现了最先进的少步采样。仅用3个NFE,它将CIFAR-10的FID从8.16提升到5.69,ImageNet的FID从17.37提升到11.83。

英文摘要

Diffusion and flow generative models sample by integrating a learned ODE, but high quality still requires many sequential model evaluations. Solver learning reduces this cost by adapting scalar coefficients, timesteps, or both, while keeping the backbone model fixed. In this work, we identify a structural bottleneck in this update family: each step remains span-limited. Since the scalar-coefficient update lies in the span of buffered velocity evaluations, it can fit only the in-span component while leaving any out-of-span residual unreachable by scalar recombination alone. We propose SpanLift, a lightweight neural solver that augments scalar-coefficient updates with a spatial residual operator. SpanLift keeps a fixed base solver as an in-span prior and learns a spatial residual operator over the state and velocity buffer. The operator is trained by endpoint teacher matching, preserves the pretrained backbone, and adds no model NFEs. Empirically, the learned correction transfers across base solvers and is predominantly out-of-span. Across pixel-space diffusion, latent flow matching, and precipitation nowcasting, SpanLift achieves state-of-the-art few-step sampling. With only 3 NFE, it improves CIFAR-10 FID from 8.16 to 5.69 and ImageNet FID from 17.37 to 11.83.

2606.08788 2026-06-09 cs.CV 新提交

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

MaskAlign: 面向高效扩散训练的令牌子集表示对齐

Lianyu Pang, Tianlin Pan, Cheng Da, Changqian Yu, Huan Yang, Kun Gai, Song Guo, Wenhan Luo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Kuaishou Technology(快手科技) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对扩散模型与预训练视觉模型表示对齐中令牌级信息不匹配问题,提出MaskAlign方法,通过随机采样令牌子集进行对齐,并引入预掩码令牌混合块减少信息损失,提升训练效率和生成质量。

详情
AI中文摘要

与预训练视觉模型的表示对齐最近显示出加速扩散Transformer训练的潜力。通过将中间扩散特征与来自自监督视觉编码器的干净图像表示对齐,现有方法提高了收敛速度和生成质量。然而,这种对齐也引入了一个非平凡的约束:扩散模型处理噪声输入,其可用信息随时间步变化,而参考特征是从干净图像中提取的。在本文中,我们从令牌级角度重新审视这种不匹配。我们发现,在全令牌表示对齐下,具有较大对齐梯度范数的令牌表现出稳定的空间偏好,这表明对齐目标并非均匀影响所有令牌,可能鼓励模型依赖完整的干净图像令牌集。为了解决这个问题,我们提出了MaskAlign,一种令牌子集表示对齐方法,在训练期间对随机采样的令牌子集应用对齐。通过在不同迭代中向模型暴露不同的令牌子集,MaskAlign减少了表示对齐对完整令牌集的依赖,并鼓励在令牌子集扰动下更稳定的对齐行为。为了缓解直接丢弃令牌导致的信息损失,我们进一步引入了一个轻量级的预掩码令牌混合块,在掩码之前跨令牌共享信息。

英文摘要

Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.

2606.08833 2026-06-09 cs.CV 新提交

CSFlow: Aligning Flow Matching with Human Contrast Sensitivity

CSFlow: 将流匹配与人类对比敏感度对齐

Malgorzata Galinska, Bart Pogodzinski, Jan Eric Lenssen

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus(马克斯·普朗克信息学研究所,萨尔兰信息学园区)

AI总结 提出CSFlow加权方案,通过将人类对比敏感度函数与流匹配的迭代去噪步骤对齐,在傅里叶空间中引入软自回归结构,提升生成图像的视觉真实感,FID降低4.7%,Inception Score提升2.2%。

详情
AI中文摘要

我们引入了对比敏感流(CSFlow),这是一种将人眼的对比敏感度函数(CSF)与流匹配的迭代去噪步骤联系起来的加权方案。由于真实世界图像将信号集中在低空间频率,这些分量在连续扩散过程中比高频分量更早达到高信噪比。当使用扩散或流匹配模型生成图像时,这会在傅里叶空间中诱导一种软自回归结构,其中粗略的图像内容在精细细节之前稳定。同时,人类视觉系统对空间频率的敏感度不均:极低和极高的频率需要显著更高的对比度才能被感知。我们首次通过两个贡献将这些观察结果融合在一起:(1)一个估计每个反向流区间生成哪些频率的度量,以及(2)通过将每个噪声级别生成的频率与人类对比敏感度对齐获得的时间步权重。我们通过实验验证了我们的贡献,表明这些权重可以通过仅推理时间步修改或短时微调,将FID降低4.7%,Inception Score提高2.2%,GenEval分数提高2.5%,从而改善生成性能。定性上,我们发现我们的CSFlow权重导致生成的图像具有更好的视觉真实感和更少的卡通外观。

英文摘要

We introduce Contrast Sensitive Flow (CSFlow), a weighting scheme that connects the human eye's Contrast Sensitivity Function (CSF) to the iterative denoising steps of flow matching. Because real-world images concentrate signal at low spatial frequencies, these components reach high signal-to-noise ratio earlier during continuous diffusion than high-frequency components. When generating images with diffusion or flow matching models, this induces a soft autoregressive structure in Fourier space, where coarse image content stabilizes before fine detail. Meanwhile, the human visual system is unequally sensitive to spatial frequencies: very low and very high frequencies require significantly higher contrast to be perceived. We for the first time merge these observations through two contributions: (1) a metric that estimates which frequencies are generated at each reverse flow interval and (2) timestep weights obtained by aligning the frequencies generated at each noise level with human contrast sensitivity. We validate our contributions experimentally showing that these weights can improve generative performance by lowering FID by 4.7%, increasing Inception Score by 2.2% and improving GenEval scores by 2.5% using inference-only timestep modification or short fine-tuning. Qualitatively, we find that our CSFlow weights lead to better visual realism and less cartoonish appearance of generated images.

2606.08847 2026-06-09 cs.CV cs.AI cs.LG 新提交

BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation

BLM-SGAN: 用于语义-空间文本到图像生成的双向语言建模

Ahmed Abdelmoneim Mazrou, Haidy Maher El-Amir, Ali Hamdi

发表机构 * Faculty of Computer Science, MSA University, Egypt(MSA大学计算机科学学院,埃及)

AI总结 提出BLM-SGAN模型,利用BERT的双向注意力机制捕获长程依赖,解决GAN在文本到图像生成中的梯度消失和序列处理限制,在鸟类图像生成上达到SOTA。

Comments Published in ICACIn 2024. Appears in Advances on Intelligent Computing and Data Science II, Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, 2025

详情
Journal ref
Advances on Intelligent Computing and Data Science II (ICACIn 2024), Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, Cham, 2025
AI中文摘要

尽管从文本描述生成图像取得了成功,但在自然语言处理(NLP)和计算机视觉(CV)等领域仍面临难以克服的挑战。文本到图像(T2I)模型的最新进展,特别是那些利用生成对抗网络(GAN)的模型,显著提高了跨领域合成逼真图像的能力。然而,现有的基于GAN的T2I模型仍然面临关键挑战,例如难以捕获长程依赖、梯度消失以及序列处理的局限性。为了解决这些问题,我们引入了BLM-SGAN,一种新颖的模型,它结合了用于语义-空间文本到图像生成的双向语言建模。BLM-SGAN利用BERT的注意力机制来捕获丰富的上下文信息并有效管理扩展序列。我们的模型展示了最先进的性能,Inception Score(IS)为5.45 +/- 0.08,超过了多个竞争模型,如SSA-GAN、DF-GAN、SD-GAN和AttnGAN。BLM-SGAN能够从详细的文本描述中有效生成高度逼真的鸟类图像。实现代码可在以下网址获取:https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation。

英文摘要

Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT's attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.

2606.09056 2026-06-09 cs.CV cs.LG 新提交

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

MilliVid: 用于视频生成中长程一致性的分层潜变量

Ishaan Preetam Chandratreya, David Charatan, Basile Van Hoorick, Sergey Zakharov, Vitor Guizilini, Phillip Isola, Vincent Sitzmann

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Toyota Research Institute(丰田研究所)

AI总结 提出一种多尺度token空间的粗到细展开方法,通过预训练层次化自编码器压缩帧为多层token,并训练视频扩散模型生成这些token,在保持几何和物体持久性长程一致性的同时降低计算开销。

Comments Ishaan Preetam Chandratreya and David Charatan contributed equally. Project page: https://davidcharatan.com/millivid/

详情
AI中文摘要

视频生成模型已变得日益强大,但长程一致性仍然难以实现,因为即使只有几十帧也需要不切实际的长Transformer序列长度。我们表明,通过在多尺度token空间内使用粗到细展开生成视频,可以缓解这一问题。我们的方法很简单:首先,预训练一个自编码器,将每一帧压缩成一个token层次结构,层级范围从典型的潜变量分辨率到每帧仅几个token。最粗糙的层级捕获最重要的信息,如场景布局和语义,而更细的层级添加高频外观和纹理。然后,我们训练一个视频扩散模型,使用粗到细展开生成这些token。通过仔细控制在每个展开步骤中生成帧并用作上下文的细节级别,我们能够保持几何和物体持久性的长程一致性,同时将计算花费在感知上不太相关的细节的长程一致性上。我们使用一个自定义的长Minecraft视频数据集验证了这种方法,与现有基线相比,它产生了更一致的展开结果。

英文摘要

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.

2606.09156 2026-06-09 cs.CV 新提交

OmniGen-AR: AutoRegressive Any-to-Image Generation

OmniGen-AR: 自回归任意到图像生成

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究所) Shanghai Collaborative Innovation Center of Intelligent Visual Computing(上海智能视觉计算协同创新中心) Bytedance Seed(字节跳动Seed) The University of Hong Kong(香港大学)

AI总结 提出统一自回归框架OmniGen-AR,通过共享视觉分词器和解耦因果注意力,支持文本、空间信号和视觉上下文等多种条件输入,在多项基准上达到最优或竞争性能。

Comments Accepted by NeurIPS

详情
AI中文摘要

自回归(AR)模型在视觉生成中展现出强大潜力,以简单的架构和优化目标实现了优越性能。然而,现有方法通常局限于单一模态条件(如文本),限制了其在需要从多种控制信号合成图像的现实场景中的应用。在这项工作中,我们提出了OmniGen-AR,一个统一的任意到图像生成的自回归框架。通过共享视觉分词器将各种视觉条件离散化,并使用文本分词器处理文本提示,OmniGen-AR在单个模型中支持广泛的条件输入,包括文本(文本到图像生成)、空间信号(分割到图像和深度到图像)以及视觉上下文(图像编辑、帧预测和文本到视频生成)。为了减轻条件令牌到内容令牌的信息泄露风险,我们引入了解耦因果注意力(DCA),它将全序列因果掩码分离为条件因果注意力和内容因果注意力。这作为训练时的正则化器,不影响推理时的标准下一个令牌预测。通过这种设计,OmniGen-AR在多个基准上取得了新的最先进或至少具有竞争力的结果,例如在GenEval上达到0.63,在VBench上达到80.02,展示了其在灵活和高保真视觉生成方面的有效性。

英文摘要

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.

2606.09187 2026-06-09 cs.CV 新提交

CP4D: Compositional Physics-aware 4D Scene Generation

CP4D: 组合式物理感知4D场景生成

Hanxin Zhu, Cong Wang, Tianyu He, Long Chen, Xin Jin, Chen Gao, Zhibo Chen

AI总结 提出CP4D范式,通过静态环境与物理动态对象组合,结合物理模拟器与视频扩散模型生成轨迹,实现高保真、物理一致的4D场景。

详情
AI中文摘要

4D生成(即动态3D生成)因其强大的时空建模能力,近年来成为快速发展的研究前沿。然而,尽管取得了显著进展,现有方法通常无法捕捉底层物理原理,导致结果在物理上不一致且视觉上不真实。为了克服这一限制,我们提出了CP4D,一种新的范式,用于生成忠实遵循复杂物理动力学的逼真4D场景。受现实世界场景的组合性质启发,其中不变的静态背景与动态、物理合理的共存,CP4D将4D生成重新表述为静态3D环境与物理基础动态对象的集成。在此基础上,我们的框架遵循三阶段流程:\textbf{1)} 首先,我们利用预训练的专家模型分别生成环境和前景对象的高保真3D表示。\textbf{2)} 随后,为了为这些对象生成物理合理的轨迹和真实的交互,我们提出了一种混合运动合成策略,该策略整合了来自物理模拟器的先验知识与视频扩散模型中嵌入的常识。\textbf{3)} 最后,我们开发了一种自动组合机制,将静态环境和动态对象无缝融合成连贯、物理一致的4D场景。大量实验表明,CP4D能够生成具有高视觉保真度、强物理合理性和细粒度可控性的可探索、可交互的4D场景,显著优于现有方法。项目页面:https://anonymous.4open.science/w/CP4D/。

英文摘要

4D generation (\textit{i.e.}, dynamic 3D generation) has recently emerged as a rapidly growing research frontier due to its powerful spatiotemporal modeling capabilities. However, despite notable advances, existing approaches typically fail to capture the underlying physical principles, producing results that are both physically inconsistent and visually implausible. To overcome this limitation, we present CP4D, a novel paradigm for photorealistic 4D scene synthesis with faithful adherence to complex physical dynamics. Drawing inspiration from the compositional nature of real-world scenes, where immutable static backgrounds coexist with dynamic, physically plausible foregrounds, CP4D reformulates 4D generation as the integration of a static 3D environment with physically grounded dynamic objects. On this basis, our framework follows a three-stage pipeline: \textbf{1)} Firstly, we leverage pre-trained expert models to generate high-fidelity 3D representations of the environment and foreground objects respectively. \textbf{2)} Subsequently, to produce physically plausible trajectories and realistic interactions for these objects, we propose a hybrid motion synthesis strategy that integrates priors from physical simulators with the common sense embedded in video diffusion models. \textbf{3)} Finally, we develop an automated composition mechanism that seamlessly fuses the static environment and dynamic objects into coherent, physically consistent 4D scenes. Extensive experiments demonstrate that CP4D can generate explorable and interactive 4D scenes with high visual fidelity, strong physical plausibility, and fine-grained controllability, significantly outperforming existing methods. The project page: https://anonymous.4open.science/w/CP4D/.

2606.09507 2026-06-09 cs.CV 新提交

Prisma-World: Camera-Controllable Multi-Agent Video World Model

Prisma-World: 相机可控的多智能体视频世界模型

Huiqiang Sun, Zhan Peng, Size Wu, Kun Wang, Kang Liao, Dianyi Wang, Xingyu Zeng, Sheng Jin, Yangguang Li, Zhiguo Cao, Ziwei Liu, Wei Li

发表机构 * School of AIA, HUST(华中科技大学人工智能与自动化学院) S-Lab, NTU(南洋理工大学S-Lab) SenseTime Research(商汤科技研究院) FDU(复旦大学) SUAT(深圳大学) HKU(香港大学) CUHK(香港中文大学)

AI总结 提出Prisma-World,通过联合几何感知去噪过程实现多智能体视频生成中的跨视角一致性,支持灵活智能体数量和相机控制。

Comments Project page: https://huiqiang-sun.github.io/prisma-world/

详情
AI中文摘要

视频世界模型在生成可控视觉体验方面取得了快速进展,但大多数模型仍从单一观察者模拟世界。将此类模型扩展到多个智能体面临一个核心挑战:如果每个智能体的未来状态是独立生成的,重叠视角可能会实例化同一场景的不同版本,导致智能体间的物体、布局和外观不一致。传统的相机条件控制单个轨迹,但并未显式耦合在共享场景几何下应一致的视图生成。我们引入了Prisma-World,一个相机可控的多智能体世界模型,它将多智能体生成形式化为一个联合几何感知去噪过程,以实现跨视角一致性。Prisma-World在一个全注意力序列中处理所有智能体视频,使用多智能体RoPE设计来区分智能体身份同时保持同步的时间坐标,并将相对相机几何注入注意力中,使重叠视角偏向共享场景证据。为了进一步增强多视角一致性并提升全局空间感知,我们通过重叠衰减课程训练范式以及小地图条件结构指导来增强我们的框架。为了促进多智能体模型的训练和评估,我们引入了PrismaDataset,这是一个大规模UE5数据集,包含跨多样场景的全景采集、可组合的多智能体视角组(具有灵活的智能体数量和复杂的相机轨迹),以及用于一致性训练和评估的精确相机/动作标注。实验表明,单个Prisma-World模型可以生成高保真度的多智能体视频,具有灵活的智能体数量、相机可控性、改进的跨视角一致性以及在小地图引导下的空间定位。

英文摘要

Video world models have made rapid progress in generating controllable visual experiences, but most of them still simulate the world from a single observer. Extending such models to multiple agents raises a central challenge: if each agent's future state is generated independently, overlapping views may instantiate different versions of the same scene, leading to inconsistent objects, layouts, and appearances across agents. Conventional camera conditioning controls individual trajectories, but it does not explicitly couple the generation of views that should agree under shared scene geometry. We introduce Prisma-World, a camera-controllable multi-agent world model that formulates multi-agent generation as a joint geometry-aware denoising process for cross-view consistency. Prisma-World processes all agent videos within one full-attention sequence, uses a multi-agent RoPE design to distinguish agent identities while preserving synchronized temporal coordinates, and injects relative camera geometry into attention to bias overlapping viewpoints toward shared scene evidence. To further strengthen multi-view consistency and enhance global spatial perception, we augment our framework with an overlap-decaying curriculum training paradigm alongside minimap-conditioned structural guidance. To facilitate the training and evaluation of multi-agent models, we introduce PrismaDataset, a large-scale UE5 dataset with panoramic acquisition across diverse scenes, composable multi-agent view groups with flexible agent counts and complex camera trajectories, and precise camera/action annotations for consistency training and evaluation. Experiments show that a single Prisma-World model can generate high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding under minimap guidance.

2606.09803 2026-06-09 cs.CV cs.GR cs.LG 新提交

Echo-Memory: A Controlled Study of Memory in Action World Models

Echo-Memory:动作世界模型中记忆的受控研究

Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang, Haoran Li, Yaowei Li, Yaofeng Su, Yuming Li, Haoyu Wang, Shiyi Zhang, Songchun Zhang, Yuwei Niu, Sihan Xu, Junhao Zhuang, Haoyang Huang, Nan Duan

发表机构 * Joy Future Academy

AI总结 提出Echo-Memory框架,通过控制变量法研究动作条件世界模型中的记忆机制,发现原始上下文容量和块状状态空间递归对开放域返回任务至关重要。

Comments 9 figures and 28 pages, Code at \href{https://github.com/Echo-Team-Joy-Future-Academy-JD/Echo-Memory}{this URL}

详情
AI中文摘要

我们提出\textbf{Echo-Memory},对动作条件世界模型中的记忆机制进行受控研究。这些模型从第一帧、文本提示和相机动作序列生成多段视频,但其核心失败往往是记忆而非局部图像合成:当相机离开并返回时,场景或显著物体可能悄然改变。现有记忆设计难以比较,因为增益与骨干网络、训练、检索和评估差异纠缠在一起。Echo-Memory固定了动作到视频的接口,仅改变生成器存储和读取历史的方式。在共享的视频扩散骨干网络、优化器、相机动作表示、采样器和评估流程下,我们比较了原始上下文、基于压缩的记忆、具有不同读取路径的空间摘要以及状态空间递归。这种匹配矩阵分离了四个通常混淆的轴:\emph{容量}、\emph{压缩}、\emph{读取}和\emph{递归}。我们还通过三个分支协议评估记忆:重放质量、域内循环重访和开放域返回探测。这些分支通常不一致,表明重放保真度不足以作为记忆世界的代理。得出三个发现。原始上下文是一个强大的容量基线,并且比重放指标更能改善开放域返回。紧凑性不能免费替代容量:激进的混合压缩记忆会丢失返回所需的显著证据。最后,块状状态空间递归是我们矩阵中最强的开放域返回机制,表明隐式记忆的结构与是否使用记忆同样重要。这些结果为在孤立的重放指标之外研究动作世界模型中的记忆提供了一个紧凑的协议。

英文摘要

We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.

2606.09816 2026-06-09 cs.CV cs.AI math.PR 新提交

PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws

PTL-Diffusion: 具有周期终端定律的流形感知扩散

Danqi Zhuang, Jisui Huang, Xiaoyue Xi, Andrew Kiggins, Xiaojie Wang, Ke Chen, Yue Wu

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Cambridge(剑桥大学) University of Oxford(牛津大学) Harvard University(哈佛大学) MIT(麻省理工学院) University of Washington(华盛顿大学)

AI总结 提出PTL-Diffusion,通过将前向噪声过程收敛到周期高斯终端族而非单一分布,显式嵌入相位结构,改善低维流形上的分布匹配,在点云和人脸数据集上降低误差。

详情
AI中文摘要

标准扩散模型通常使用单一时间齐次高斯终端分布作为生成的参考律。虽然这一选择在分析上方便且经验上有效,但对于集中在低维流形附近的数据,它提供的显式结构很少,其中数据分布的不同区域可能对应于不同的局部几何或语义因素。因此,反向模型必须几乎完全从非结构化的终端参考分布中恢复流形级别的结构。\n我们提出PTL-Diffusion,一种概念验证的扩散框架,其前向噪声过程收敛到一个非常数的周期高斯终端族,而不是单一不变律。与相位条件DDPM不同(其中相位信息仅进入去噪网络,而前向过程保持不变),PTL-Diffusion将相位结构直接嵌入前向噪声动力学中。\n所提出的构造仍然接近标准去噪扩散模型:对于周期强迫的Ornstein-Uhlenbeck型前向过程,我们推导出闭合形式的前向边际分布、极限周期高斯终端族以及显式高斯反向后验,从而支持标准噪声预测训练。我们还引入了一个不变平均正则化项,通过平均周期参考律耦合相位条件反向动力学。在环面和圆柱点云基准以及Olivetti人脸数据集上的实验表明,PTL-Diffusion在匹配的DDPM基线上改善了流形级别的分布匹配,减少了相位条件误差、特征空间协方差误差和最近邻流形距离。这些结果表明结构化终端参考律是一个有前景的方向,同时激励更具表现力的相位构造和更大规模的评估。

英文摘要

Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution. We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics. The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.

2606.09828 2026-06-09 cs.CV 新提交

Latent Spatial Memory for Video World Models

视频世界模型的潜在空间记忆

Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang, Yefei He, Zicheng Duan, Donny Y. Chen, Yuqing Yang, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) Microsoft Research(微软研究院) Adelaide University(阿德莱德大学) Monash University(莫纳什大学)

AI总结 提出潜在空间记忆框架Mirage,通过在扩散潜在空间中直接构建和查询3D缓存,避免像素空间重建,实现高效视频生成,速度提升10.57倍,内存减少55倍。

Comments Project Page: https://aka.ms/latent-spatial-memory, Code: https://github.com/microsoft/LatentSpatialMemory

详情
AI中文摘要

在生成帧之间保持3D空间一致性的视频世界模型通常依赖于在RGB空间中构建的显式点云记忆。这种设计既计算昂贵(需要重复渲染和VAE编码),又固有地有损(因为通过像素空间的往返会丢弃学习到的潜在表示的丰富特征)。在本文中,我们为视频世界模型引入了\emph{潜在空间记忆},这是一种持久化的3D缓存,直接在扩散潜在空间中存储场景信息,避免了像素空间重建。在此基础上,我们提出了Mirage,一种潜在空间空间记忆框架,通过深度引导的反投影将潜在令牌提升到3D来构建记忆,并通过直接潜在空间扭曲合成新视图来查询记忆。这种统一的公式消除了像素空间重建的信息损失以及重复编码和渲染的计算负担。实验表明,相对于显式3D基线,潜在空间记忆实现了高达\textbf{10.57}倍的端到端视频生成加速和\textbf{55}倍的内存占用减少。利用扩散模型的几何先验,Mirage在WorldScore上达到了最先进的性能,并在RealEstate10K上实现了强大的重建质量。

英文摘要

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce \emph{latent spatial memory} for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to \textbf{10.57}$\times$ faster end-to-end video generation and \textbf{55}$\times$ reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.

2606.08309 2026-06-09 cs.LG cs.CV 交叉投稿

Where the Score Lives: A Wavelet View of Diffusion

分数函数所在之处:扩散的小波视角

Emma Finn, Binxu Wang, T. Anderson Keller, Demba E. Ba

发表机构 * The Kempner Institute for the Study of Natural and Artificial Intelligence(肯普纳自然与人工智能研究所) Harvard University(哈佛大学)

AI总结 提出基于二维正交小波基的分数函数参数化,通过数据分布矩分析揭示不同架构的归纳偏差,解释扩散模型中分数网络与数据分布的相互作用。

Comments 20 pages, 12 figures, AISTATS 2026

详情
Journal ref
Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026, Tangier, Morocco. PMLR: Volume 300
AI中文摘要

基于分数的生成模型在过去十年中在生成多样化视觉上合理的图像方面取得了显著成功。在扩散建模中,包括CNN、U-Net和Transformer在内的多种架构被用作分数近似网络;然而,迄今为止,关于这些架构选择如何影响生成行为的了解相对较少。在这项工作中,为了提供对此领域的见解,我们提出了一种使用二维正交小波基展开的分数函数的解析可解参数化。特别地,我们根据数据分布的矩推导出可解释的最优分数函数。我们利用这种参数化提供了一种与架构无关的、基于矩的分析,揭示了数据分布的哪些属性对去噪最为重要。我们的分数机器足够灵活,可以部分模仿多种架构(包括U-Net和CNN)的相关归纳偏差,朝着理解不同分数架构为何表现出不同生成行为迈出了一步。由于我们的分数函数可以根据数据矩解析求解,我们可以开始理解数据分布如何与分数网络相互作用,从而产生我们在扩散模型中观察到的行为。

英文摘要

Score-based generative models have had remarkable success over the last decade in generating a diverse set of visually plausible images. A variety of architectures including CNNs, U-Nets, and Transformers have been used as the score-approximation network in such diffusion modeling; however, to date, relatively little is known about how these architectural choices impact generative behavior. In this work, to provide insight into this area, we propose an analytically solvable parameterization of the score function using an expansion in a 2D orthogonal wavelet basis. In particular, we derive interpretable optimal score functions in terms of the moments of the data distribution. We use this parametrization to provide an architecture-agnostic, moment-based analysis that reveals which attributes of the data distribution tend to matter most for denoising. Our score machine is flexible enough to partially mimic the relevant inductive biases of multiple architectures, including U-Nets, and CNNs, taking a step towards understanding why different score architectures can exhibit distinct generative behavior. Since our score is solvable in terms of the moments of the data, we can begin to understand how the data distribution interacts with the score network to produce the behavior we observe in diffusion models.

2606.08841 2026-06-09 cs.AI cs.CV 交叉投稿

ZIPP:Zero-shot Image Personalization from Personas

ZIPP:基于人物画像的零样本图像个性化生成

Harini SI, Somesh Singh, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah

发表机构 * Adobe Media and Data Science Research (MDSR)(Adobe媒体与数据科学研究(MDSR)) IIIT-Delhi(德里印度理工学院) SUNY at Buffalo(纽约州立大学布法罗分校)

AI总结 提出ZIPP方法,利用自然语言人物画像通过LLM改写提示词实现零样本图像个性化生成,无需用户数据或微调;引入ZIPBench基准,在多个评测中取得13-20%的提升。

详情
AI中文摘要

文本到图像扩散模型越来越多地部署在开放式创意环境中,但其输出仍然缺乏个性,优化的是整体审美而非个人品味。人类偏好是多元化的:一位喜欢柔和、怀旧肖像的用户可能偏爱充满活力的街头摄影,而另一位则倾向于梦幻的电影美学。现有方法需要密集的交互历史或逐用户微调,在冷启动场景中失败,并将上下文相关的偏好压缩为静态表示。我们提出了基于人物画像的零样本图像个性化生成(ZIPP),该方法以自然语言人物画像(用户身份和审美偏好的简洁描述符)为条件生成图像,无需任何用户特定数据或权重更新。ZIPP使用LLM从给定人物画像的角度重写提示词,引导扩散模型输出个性化结果。为了大规模挖掘人物画像,我们在一个包含2200万用户的Reddit交互图上训练了一个归纳式图注意力网络,采用双对比目标将图结构与视觉行为对齐,然后通过多模态大语言模型将学习到的表示转化为自然语言人物画像。我们引入了ZIPBench,这是首个零样本个性化基准,包含1500名用户、图挖掘的人物画像和4万张生成图像。在四个基准和涵盖五个模型家族的14个LLM上,人物画像条件化带来一致的性能提升(13-20%),前沿模型受益最大。在少样本设置中,ZIPP匹配或超过了基于每用户100多个示例微调的基线。ZIPP实现了最低的偏好分布散度(CMMD 0.16 vs 0.55),且经IPF归一化的人口统计评估表明,它显著减少了现有方法中存在的子群体偏差。人工评估证实,与通用生成相比胜率为79%,与所有微调基线相比胜率为58-65%。

英文摘要

Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user's identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.

2410.21747 2026-06-09 cs.CV 版本更新

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

MotionGPT-2:用于运动生成与理解的通用运动-语言模型

Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng, Dan Xu, Shixiang Tang

发表机构 * Tsinghua University(清华大学) The University of Sydney(悉尼大学) University of Science and Technology of China(中国科学技术大学) The Chinese University of Hong Kong(香港中文大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Intime Department Store(Intime百货) Deepeleph HKUST(香港科技大学)

AI总结 提出MotionGPT-2,一种统一的大规模运动-语言模型,通过预训练大语言模型支持多模态控制条件,实现运动生成、描述和补全等多种任务,并引入Part-Aware VQVAE实现细粒度身体和手部运动表示。

详情
AI中文摘要

近年来,从描述性文本生成逼真的人体运动受到了显著的研究关注,这得益于数字内容创作等新兴需求。尽管取得了令人印象深刻的进展,现有方法通常受限于有限的控制模态、任务特异性,并且仅关注身体运动。在本文中,我们提出了MotionGPT-2,一种统一的大规模运动-语言模型(LMLM),以解决这些局限性。MotionGPT-2通过预训练的大语言模型(LLM)支持多种运动相关任务和多模态控制条件。它将多模态输入(如文本和单帧姿态)量化为离散的、LLM可解释的标记,无缝集成到LLM的词汇表中。这些标记随后被组织成统一的提示,通过预训练-微调范式引导LLM生成运动输出。我们还展示了所提出的MotionGPT-2通过创新的运动离散化框架Part-Aware VQVAE,能够高度适应具有挑战性的3D整体运动生成任务,该框架确保了身体和手部运动的细粒度表示。大量实验和可视化验证了我们方法的有效性,展示了MotionGPT-2在运动生成、运动描述和广义运动补全任务中的适应性。

英文摘要

Generating lifelike human motions from descriptive texts has experienced remarkable research focus in the recent years, propelled by the emerging requirements of digital humans.Despite impressive advances, existing approaches are often constrained by limited control modalities, task specificity, and focus solely on body motion representations.In this paper, we present MotionGPT-2, a unified Large Motion-Language Model (LMLM) that addresses these limitations. MotionGPT-2 accommodates multiple motion-relevant tasks and supporting multimodal control conditions through pre-trained Large Language Models (LLMs). It quantizes multimodal inputs-such as text and single-frame poses-into discrete, LLM-interpretable tokens, seamlessly integrating them into the LLM's vocabulary. These tokens are then organized into unified prompts, guiding the LLM to generate motion outputs through a pretraining-then-finetuning paradigm. We also show that the proposed MotionGPT-2 is highly adaptable to the challenging 3D holistic motion generation task, enabled by the innovative motion discretization framework, Part-Aware VQVAE, which ensures fine-grained representations of body and hand movements. Extensive experiments and visualizations validate the effectiveness of our method, demonstrating the adaptability of MotionGPT-2 across motion generation, motion captioning, and generalized motion completion tasks.

2508.07011 2026-06-09 cs.CV cs.GR 版本更新

HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

HiMat: 基于DiT的超高分辨率SVBRDF生成

Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

发表机构 * College of Computer Science, Nankai University(南开大学计算机科学学院) Adobe Research(Adobe研究) NVIDIA Nanjing University(南京大学)

AI总结 提出HiMat框架,利用扩散变压器和线性注意力在高压缩潜空间生成4K SVBRDF,并通过CrossStitch模块保持跨图一致性,实现高效、多样化的超高分辨率材质生成。

详情
AI中文摘要

创建超高分辨率空间变化双向反射分布函数(SVBRDF)对于逼真的3D内容创作至关重要,以忠实呈现近距离渲染所需的精细表面细节。然而,实现4K生成面临两个关键挑战:(1)需要以全分辨率合成多个反射图,这增加了像素预算并带来了高昂的内存和计算成本;(2)需要在4K下保持跨图的强像素级对齐,这在适配为RGB图像域设计的预训练模型时尤其困难。我们引入了HiMat,一个专为高效且多样化的4K SVBRDF生成量身定制的基于扩散的框架。为解决第一个挑战,HiMat通过DC-AE在高压缩潜空间中进行生成,并采用具有线性注意力的预训练扩散变压器来提高每图效率。为解决第二个挑战,我们提出了CrossStitch,一个轻量级卷积模块,在不增加全局注意力成本的情况下强制跨图一致性。我们的实验表明,与先前方法相比,HiMat实现了高保真度的4K SVBRDF生成,具有卓越的效率、结构一致性和多样性。除了材质,我们的框架还推广到相关应用,如本征分解。

英文摘要

Creating ultra-high-resolution spatially varying bidirectional reflectance functions (SVBRDFs) is critical for photorealistic 3D content creation, to faithfully represent fine-scale surface details required for close-up rendering. However, achieving 4K generation faces two key challenges: (1) the need to synthesize multiple reflectance maps at full resolution, which multiplies the pixel budget and imposes prohibitive memory and computational cost, and (2) the requirement to maintain strong pixel-level alignment across maps at 4K, which is particularly difficult when adapting pretrained models designed for the RGB image domain. We introduce HiMat, a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation. To address the first challenge, HiMat performs generation in a high-compression latent space via DC-AE, and employs a pretrained diffusion transformer with linear attention to improve per-map efficiency. To address the second challenge, we propose CrossStitch, a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention. Our experiments show that HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods. Beyond materials, our framework also generalizes to related applications such as intrinsic decomposition.

2509.24531 2026-06-09 cs.CV 版本更新

Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis

扩散桥还是流匹配?一个统一框架与比较分析

Kaizhen Zhu, Mokai Pan, Zhechuan Yu, Jingya Wang, Jingyi Yu, Ye Shi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过随机最优控制和最优传输理论统一了扩散桥与流匹配,证明扩散桥成本更低且轨迹更稳定,并在图像恢复等任务中通过实验验证了理论。

详情
AI中文摘要

扩散桥和流匹配在任意分布之间的变换中都展示了令人信服的经验性能。然而,关于哪种方法通常更优仍存在困惑,并且它们建模假设和实际实现中的显著差异阻碍了对其相对优势的统一理论解释。我们首次为这两种模型提供了统一的理论和实验验证。我们通过随机最优控制的视角重新构建了它们的框架,并证明了扩散桥的成本函数更低,引导系统走向更稳定和自然的轨迹。同时,从最优传输的角度来看,当训练数据规模减少时,流匹配的插值系数 $t$ 和 $1-t$ 变得越来越无效。为了证实这些理论主张,我们提出了一种基于潜在Transformer的新型强大架构用于扩散桥,并实现了具有相同结构的流匹配模型,以便在各种实验中进行公平的性能比较。我们在图像恢复、图像翻译和风格迁移任务上进行了全面的实验,系统性地改变了分布差异(不同难度)和训练数据规模。广泛的经验结果与我们的理论预测完全一致,并使我们能够描绘这两种模型各自的优缺点。我们的代码可在 https://this https URL 获取。

英文摘要

Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions. However, there remains confusion about which approach is generally preferable, and the substantial discrepancies in their modeling assumptions and practical implementations have hindered a unified theoretical account of their relative merits. We have, for the first time, provided a unified theoretical and experimental validation of these two models. We recast their frameworks through the lens of Stochastic Optimal Control and prove that the cost function of the Diffusion Bridge is lower, guiding the system toward more stable and natural trajectories. Simultaneously, from the perspective of Optimal Transport, interpolation coefficients $t$ and $1-t$ of Flow Matching become increasingly ineffective when the training data size is reduced. To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer, and implement a Flow Matching model with the same structure to enable a fair performance comparison in various experiments. Comprehensive experiments are conducted across Image Restoration, Translation, and Style Transfer tasks, systematically varying both the distributional discrepancy (different difficulty) and the training data size. Extensive empirical results align perfectly with our theoretical predictions and allow us to delineate the respective advantages and disadvantages of these two models. Our code is available at https://github.com/zhukaizhen/diffusion_bridge_flow_matching.

2510.05356 2026-06-09 cs.CV cs.LG 版本更新

Mitigating Diffusion Model Hallucinations with Dynamic Guidance

通过动态引导缓解扩散模型幻觉

Kostas Triaridis, Alexandros Graikos, Aggelina Chatziagapi, Grigorios G. Chrysos, Dimitris Samaras

发表机构 * Stony Brook University(石溪大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 针对扩散模型因分数函数过度平滑导致的幻觉问题,提出动态引导方法,沿预定方向选择性锐化分数函数,保留有效语义变化,显著减少幻觉。

Comments Project page: https://cvlab-stonybrook.github.io/DynamicGuidance/

详情
AI中文摘要

扩散模型中的幻觉是指样本出现结构不一致性,这通常是由于学习到的分数函数过度平滑,导致数据分布模式之间的插值。由于语义插值通常是有益的且有助于样本多样性,我们认为需要一种细致且有针对性的解决方案来处理扩散模型幻觉。在这项工作中,我们引入了动态引导,通过仅沿已知会导致伪影的预定方向选择性锐化分数函数来缓解幻觉,同时保留有效的语义变化。这种锐化可以使用预定的类别或语义一致的聚类(在数据分布上形成伪类)来执行。后者允许将动态引导原则性地扩展到文本到图像生成,其中我们选择模式以对应文本描述中细粒度的上下文差异。据我们所知,这是第一种在生成时而非通过事后过滤来解决幻觉的方法。动态引导在受控和自然图像数据集上均显著减少了幻觉,大幅优于基线方法。

英文摘要

Hallucinations in diffusion models are samples with structural inconsistencies that can emerge due to the excessive smoothing of the learned score function, which in turn leads to interpolations between modes of the data distribution. Since semantic interpolations are often desirable and contribute to sample diversity, we believe that a nuanced and targeted solution is required to address diffusion model hallucinations. In this work, we introduce Dynamic Guidance, which mitigates hallucinations by selectively sharpening the score function only along the pre-determined directions known to cause artifacts, while preserving valid semantic variations. This sharpening can be performed using either pre-determined classes or semantically coherent clusters that form pseudo-classes over the data distribution. The latter allows for a principled extension of Dynamic Guidance to text-to-image generation, where we select modes to correspond to fine-grained contextual differences in textual descriptions. To our knowledge, this is the first approach that addresses hallucinations at generation time rather than through post-hoc filtering. Dynamic Guidance substantially reduces hallucinations on both controlled and natural image datasets, significantly outperforming baselines.

2601.03256 2026-06-09 cs.CV 版本更新

Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training

Muses: 无需训练的设计、组合与生成不存在的幻想3D生物

Hexiao Lu, Xiaokun Sun, Zeyu Cai, Hao Guo, Ying Tai, Jian Yang, Zhenyu Zhang

发表机构 * Nanjing University(南京大学) China Agricultural University(中国农业大学)

AI总结 提出Muses,首个无需训练的馈送式幻想3D生物生成方法,利用3D骨架实现结构感知的设计、组合与生成,在视觉保真度和文本对齐方面达到最优。

Comments Project page: https://luhexiao.github.io/Muses.github.io/

详情
AI中文摘要

我们提出Muses,这是首个在馈送式范式中无需训练即可生成奇幻3D生物的方法。以往依赖部件感知优化、手动组装或2D图像生成的方法,由于复杂的部件级操作和有限的域外生成挑战,往往产生不真实或不连贯的3D资产。相比之下,Muses利用3D骨架(生物形态的基本表示)来明确且合理地组合多样元素。这种骨架基础将3D内容创作形式化为一个结构感知的设计、组合与生成流水线。Muses首先通过图约束推理构建一个具有连贯布局和比例的创意组合3D骨架。然后,该骨架在结构化潜在空间内引导基于体素的组装过程,整合来自不同对象的区域。最后,在骨架条件下应用图像引导的外观建模,为组装形状生成风格一致且和谐的纹理。大量实验证明,Muses在视觉保真度和文本描述对齐方面达到了最先进的性能,并在灵活的3D对象编辑方面具有潜力。项目页面:此 https URL。

英文摘要

We present Muses, the first training-free method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation. In contrast, Muses leverages the 3D skeleton, a fundamental representation of biological forms, to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation. Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape. Extensive experiments establish Muses' state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing. Project page: https://luhexiao.github.io/Muses.github.io/.

2601.18585 2026-06-09 cs.CV cs.GR 版本更新

GimmBO: Interactive Generative Image Model Merging via Bayesian Optimization

GimmBO: 基于贝叶斯优化的交互式生成图像模型合并

Chenxi Liu, Selena Ling, Alec Jacobson

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 针对扩散模型适配器合并中权重选择困难的问题,提出GimmBO框架,利用偏好贝叶斯优化实现交互式探索,通过两阶段BO后端提升高维空间采样效率与收敛性。

Comments Accepted at SIGGRAPH NA 2026

详情
AI中文摘要

基于微调的适配被广泛用于定制扩散图像生成,导致大量社区创建的适配器集合,这些适配器捕捉不同的主题和风格。源自同一基础模型的适配器可以通过权重合并,从而在广阔且连续的设计空间中合成新的视觉结果。为了探索这一空间,当前工作流依赖于手动滑块调优,这种方法扩展性差且使得权重选择困难,即使候选集限制在20-30个适配器。我们提出GimmBO,通过偏好贝叶斯优化(PBO)支持图像生成中适配器合并的交互式探索。受实际使用中的观察(包括稀疏性和受限权重范围)启发,我们引入了一个两阶段BO后端,提高了高维空间中的采样效率和收敛性。我们通过模拟用户和用户研究评估了我们的方法,展示了改进的收敛性、高成功率以及相对于BO和线搜索基线的持续增益,并通过几个扩展进一步展示了框架的灵活性。

英文摘要

Fine-tuning-based adaptation is widely used to customize diffusion-based image generation, leading to large collections of community-created adapters that capture diverse subjects and styles. Adapters derived from the same base model can be merged with weights, enabling the synthesis of new visual results within a vast and continuous design space. To explore this space, current workflows rely on manual slider-based tuning, an approach that scales poorly and makes weight selection difficult, even when the candidate set is limited to 20-30 adapters. We propose GimmBO to support interactive exploration of adapter merging for image generation through Preferential Bayesian Optimization (PBO). Motivated by observations from real-world usage, including sparsity and constrained weight ranges, we introduce a two-stage BO backend that improves sampling efficiency and convergence in high-dimensional spaces. We evaluate our approach with simulated users and a user study, demonstrating improved convergence, high success rates, and consistent gains over BO and line-search baselines, and further show the flexibility of the framework through several extensions.

2601.23286 2026-06-09 cs.CV cs.AI cs.LG 版本更新

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

VideoGPA: 通过几何先验知识蒸馏实现3D一致的视频生成

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 VideoGPA通过几何先验知识蒸馏提升视频生成的3D一致性,利用数据高效的自监督框架引导视频扩散模型,显著增强时间稳定性、几何合理性与运动一致性。

Comments 8 pages, 5 figures, ICML 2026

详情
AI中文摘要

尽管最近的视频扩散模型(VDMs)能产生视觉上令人印象深刻的结果,但它们在保持3D结构一致性方面存在根本性困难,常导致物体变形或空间漂移。我们假设这些失败是因为标准去噪目标缺乏显式的几何一致性激励。为此,我们引入VideoGPA(视频几何偏好对齐),一种数据高效的自监督框架,利用几何基础模型自动推导密集偏好信号,通过直接偏好优化(DPO)引导VDMs。该方法有效将生成分布引导至内在3D一致性,而无需人工标注。VideoGPA通过最少的偏好对显著提升了时间稳定性、几何合理性与运动一致性,在大量实验中一致优于最先进基线。

英文摘要

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, geometric plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

2602.07345 2026-06-09 cs.CV cs.LG 版本更新

Optimizing Few-Step Generation with Adaptive Matching Distillation

自适应匹配蒸馏优化少步生成

Lichen Bai, Zikai Zhou, Shitong Shao, Wenliang Zhong, Shuo Yang, Shuo Chen, Bojun Chen, Zeke Xie

发表机构 * xLeaF Lab, The Hong Kong University of Science(xLeaF实验室,香港科学与技术大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学,深圳) School of Intelligence Science(智能科学学院)

AI总结 提出自适应匹配蒸馏(AMD),通过奖励代理检测并逃离禁止区域,结合结构信号分解和排斥景观锐化,提升少步生成模型的样本保真度和训练鲁棒性。

Comments 25 pages, 15 figures, 11 tables

详情
AI中文摘要

分布匹配蒸馏(DMD)是一种强大的加速范式,但其稳定性常在禁止区域(真实教师提供不可靠指导而虚假教师施加不足排斥力的区域)中受到损害。在这项工作中,我们提出了一个统一的优化框架,将先前的方法重新解释为避免这些受损区域的隐式策略。基于这一见解,我们引入了自适应匹配蒸馏(AMD),一种利用奖励代理显式检测和逃离禁止区域的自我纠正机制。AMD通过结构信号分解动态优先考虑纠正梯度,并引入排斥景观锐化以强制执行陡峭的能量屏障,防止失败模式崩溃。在图像和视频生成任务(如SDXL、Wan2.1)以及严格基准测试(如VBench、GenEval)上的大量实验表明,AMD显著提高了样本保真度和训练鲁棒性。例如,AMD将SDXL上的HPSv2分数从30.64提升至31.25,优于最先进的基线。这些发现验证了在禁止区域内显式纠正优化轨迹对于推动少步生成模型性能上限至关重要。

英文摘要

Distribution Matching Distillation (DMD) is a powerful acceleration paradigm, yet its stability is often compromised in Forbidden Zone, regions where the real teacher provides unreliable guidance while the fake teacher exerts insufficient repulsive force. In this work, we propose a unified optimization framework that reinterprets prior art as implicit strategies to avoid these corrupted regions. Based on this insight, we introduce Adaptive Matching Distillation (AMD), a self-correcting mechanism that utilizes reward proxies to explicitly detect and escape Forbidden Zones. AMD dynamically prioritizes corrective gradients via structural signal decomposition and introduces Repulsive Landscape Sharpening to enforce steep energy barriers against failure mode collapse. Extensive experiments across image and video generation tasks (e.g., SDXL, Wan2.1) and rigorous benchmarks (e.g., VBench, GenEval) demonstrate that AMD significantly enhances sample fidelity and training robustness. For instance, AMD improves the HPSv2 score on SDXL from 30.64 to 31.25, outperforming state-of-the-art baselines. These findings validate that explicitly rectifying optimization trajectories within Forbidden Zones is essential for pushing the performance ceiling of few-step generative models.

2604.00903 2026-06-09 cs.CV 版本更新

IDDM: Identity-Decoupled Personalized Diffusion Models with a Tunable Privacy-Utility Trade-off

IDDM: 一种具有可调隐私-效用权衡的去标识化个性化扩散模型

Linyan Dai, Xinwei Zhang, Haoyang Li, Qingqing Ye, Haibo Hu

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出IDDM模型,通过在个性化流程中集成身份解耦,实现授权个性化的同时降低公开生成的身份可链接性,可调隐私-效用权衡。

详情
AI中文摘要

个性化文本到图像扩散模型(例如DreamBooth、LoRA)使用户能够从少量参考照片合成高质量的肖像用于社交表达。然而,一旦这些生成内容在社交媒体平台(如Instagram、Facebook)上分享,它们可通过人脸识别系统与真实用户关联,从而实现身份跟踪和画像。现有防御措施主要采用反个性化策略,通过破坏模型微调来保护公开发布的参考照片。虽然对未经授权的个性化有效,但未解决另一种实际场景:当个性化被授权时,但公开输出仍泄露身份信息。为此,我们引入新的防御设置,称为模型侧输出免疫,目标是生成支持授权个性化的模型,同时减少公开生成的身份可链接性,并通过可调控制隐私-效用权衡来满足多样化的隐私需求。为此,我们提出身份解耦个性化扩散模型(IDDM),一种模型侧防御,将身份解耦整合到个性化流程中。具体而言,IDDM采用交替过程,交替进行短个性化更新和身份解耦数据优化,使用两阶段计划来平衡身份可链接性抑制和生成效用。在多个数据集、多样提示和最先进的人脸识别系统上的广泛实验表明,IDDM一致地降低了身份可链接性,同时保持高质量的个性化生成。

英文摘要

Personalized text-to-image diffusion models (e.g., DreamBooth, LoRA) enable users to synthesize high-fidelity avatars from a few reference photos for social expression. However, once these generations are shared on social media platforms (e.g., Instagram, Facebook), they can be linked to the real user via face recognition systems, enabling identity tracking and profiling. Existing defenses mainly follow an anti-personalization strategy that protects publicly released reference photos by disrupting model fine-tuning. While effective against unauthorized personalization, they do not address another practical setting in which personalization is authorized, but the resulting public outputs still leak identity information. To address this problem, we introduce a new defense setting, termed model-side output immunization, whose goal is to produce a personalized model that supports authorized personalization while reducing the identity linkability of public generations, with tunable control over the privacy-utility trade-off to accommodate diverse privacy needs. To this end, we propose Identity-Decoupled personalized Diffusion Models (IDDM), a model-side defense that integrates identity decoupling into the personalization pipeline. Concretely, IDDM follows an alternating procedure that interleaves short personalization updates with identity-decoupled data optimization, using a two-stage schedule to balance identity linkability suppression and generation utility. Extensive experiments across multiple datasets, diverse prompts, and state-of-the-art face recognition systems show that IDDM consistently reduces identity linkability while preserving high-quality personalized generation.

2605.00273 2026-06-09 cs.CV cs.AI 版本更新

When Do Diffusion Models learn to Generate Multiple Objects?

扩散模型何时学会生成多个物体?

Yujin Jeong, Arnas Uselis, Iro Laina, Seong Joon Oh, Anna Rohrbach

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 研究探讨了扩散模型在多物体生成中的局限性,发现场景复杂度比概念不平衡更关键,且低数据条件下计数任务更难学习。

Comments ICML2026

详情
AI中文摘要

文本到图像的扩散模型实现了出色的视觉保真度,却在多物体生成中仍不可靠。尽管有大量实证证据表明这些失败,但其根本原因仍不清楚。我们首先探讨这种限制有多大源于数据本身。为了区分数据影响,我们考虑了不同数据集大小下的两种模式:(1)概念泛化,其中每个单独的概念在训练期间可能在不平衡的数据分布下被观察到;(2)组合泛化,其中特定的概念组合被系统性地排除。为了研究这些模式,我们引入了mosaic(多物体空间关系、属性、计数),一种受控的数据集生成框架。通过在mosaic上训练扩散模型,我们发现场景复杂性起主导作用,而非概念不平衡,并且在低数据模式中计数尤为难以学习。此外,随着训练过程中排除更多概念组合,组合泛化能力会崩溃。这些发现突显了扩散模型的根本限制,并促使更强的归纳偏见和数据设计以实现稳健的多物体组合生成。

英文摘要

Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.

2605.02439 2026-06-09 cs.CV cs.LG 版本更新

Anomaly-Preference Image Generation

异常偏好图像生成

Fuyun Wang, Yuanzhi Wang, Xu Guo, Sujia Huang, Tong Zhang, Dan Wang, Hui Yan, Xin Liu, Zhen Cui

发表机构 * Nanjing University of Science(南京理工大学) Beijing Normal University, Beijing, China(北京师范大学) China Academy of Space Technology, Beijing, China(中国航天科技集团)

AI总结 本文提出了一种新的异常生成方法,通过隐式偏好对齐机制和时间感知能力分配模块,提升生成图像的真实性和多样性,实验表明其在真实性和多样性上均优于现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

从有限数据中合成逼真且多样的异常样本对于鲁棒模型泛化至关重要。然而,现有方法难以平衡保真度和多样性,通常受分布不匹配和过拟合的阻碍。为缓解这一问题,我们引入了异常偏好优化,一种将异常生成重新表述为偏好学习问题的新范式。我们的方法核心是隐式偏好对齐机制,利用真实异常作为正例参考,直接从去噪轨迹偏差中推导优化信号,而无需昂贵的人工标注。此外,我们提出了一个时间感知能力分配模块,动态地沿扩散时间线分配模型能力,在高噪声阶段优先考虑结构多样性,在低噪声阶段增强细粒度保真度。在推理过程中,分层采样策略调节保真度与对齐的权衡,实现对生成过程的精确控制。大量实验表明,该方法显著优于现有基线,实现了真实性和多样性方面的最先进性能。

英文摘要

Synthesizing realistic and diverse anomalous samples from limited data is vital for robust model generalization. However, existing methods struggle to reconcile fidelity and diversity, often hampered by distribution misalignment and overfitting, respectively.To mitigate this, we introduce Anomaly Preference Optimization,a novel paradigm that reformulates anomaly generation as a preference learning problem.Central to our approach is an implicit preference alignment mechanism that leverages real anomalies as positive references, deriving optimization signals directly from denoising trajectory deviations without requiring costly human annotation. Furthermore, we propose a Time-Aware Capacity Allocation module that dynamically distributes model capacity along the diffusion timeline,prioritizing structural diversity during highnoise phases while enhancing fine-grained fidelity in low-noise stages. During inference, a hierarchical sampling strategy modulates the coherencealignment trade-off, enabling precise control over generation. Extensive experiments demonstrate that significantly outperforms existing baselines,achieving state-of-the-art performance in both realism and diversity.

2605.15466 2026-06-09 cs.CV 版本更新

Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction

以实体为中心的世界模型:交互感知的掩码用于因果视频预测

Santosh Kumar Paidi

发表机构 * Genentech, Inc.(基因泰克公司)

AI总结 本文提出IA-JEPA,通过运动中心的自监督掩码策略,优先捕捉物理交互,提升因果推理任务的准确性,并在真实世界动作和物理谜题中验证了其泛化能力。

Comments 12 pages, 4 figures

详情
AI中文摘要

从未标记视频中学习预测性世界模型是人工智能的基础挑战。尽管联合嵌入预测架构(JEPA)在语义分类中设定了新基准,但它们往往缺乏物理感知,无法捕捉下游推理所需的因果动态。我们假设这源于标准的基于块的掩码策略,这些策略优先考虑视觉纹理而非罕见但信息丰富的运动事件。我们提出交互感知JEPA(IA-JEPA),利用自监督的运动中心掩码策略,优先考虑物理交互。通过专门针对碰撞或动量转移的实体,我们迫使架构重建潜在轨迹而非静态背景特征。在CLEVRER基准上评估,IA-JEPA在因果推理任务中达到14.26%的准确率,显著高于标准块掩码基线的3.22%。关键的是,我们证明IA-JEPA通过诱导更高熵、更具判别性的潜在空间(+10%熵增)打破了标准自监督的“静态偏见”,并线性化物理能量(R²=0.43)。我们展示这种交互偏见可推广到真实世界的人类动作(Something-Something V2)和零样本物理谜题(PHYRE-Lite)。我们的结果提供了一条可扩展的、完全自监督的路径,以构建开始内部化物理世界因果结构的基础世界模型。

英文摘要

Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch-masked baselines. Crucially, we demonstrate that IA-JEPA breaks the "static bias" of standard self-supervision by inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy ($R^2=0.43$). We show that this interaction bias generalizes to real-world human actions (Something-Something V2) and zero-shot physical puzzles (PHYRE-Lite). Our results provide a scalable, fully self-supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.

2605.24892 2026-06-09 cs.CV 版本更新

X-Foresight: A Joint Vision-Action Causal Forecasting Network via Predictive World Modeling

X-Foresight:一种通过预测世界建模的联合视觉-动作因果预测网络

Baolu Li, Jingyu Qian, Rui Guo, Yilun Chen, Hanpeng Liu, Yuan Lin, Junhong Zhou, Ruixin Liu, Willow Yang, Yutong Zheng, Zhenli Zhang, Sean Li, Chaoda Zheng, Boyang Wang, Tenglong, Gu, Zhuangzhuang Ding, Pengkun Zheng, Yu Zhang, Xianming Liu

发表机构 * PWM Team(PWM团队) XPeng Inc.(XPeng公司)

AI总结 提出X-Foresight,一种将预测世界模型直接集成到VLA架构中的方法,通过长程分块自回归策略和课程学习,联合学习世界建模与实时动作控制,以解决视频预测中的低熵冗余和长程因果建模难题。

详情
AI中文摘要

物理世界知识主要存在于视频中。赋予视觉-语言-动作(VLA)模型此类知识对于安全且可泛化的规划至关重要。预测世界建模通过从过去观测预测未来视频,使VLA能够内化物理动态和长程因果关系。然而,朴素的下一帧预测面临两个挑战:1)与语义上不同的文本标记不同,视频标记是低熵且冗余的,导致预测退化为琐碎的外推;2)世界建模存在时间困境:密集预测捕捉瞬时动态,但无法高效建模长程因果。为有效学习世界知识,我们引入X-Foresight,一种直接集成到VLA架构中的预测世界模型,以联合学习世界建模和实时动作控制。其核心是一种长程分块自回归策略,该策略解决了上述两个挑战:通过预测语义上遥远的块而非相邻帧,它避免了琐碎的外推,同时保留密集的块内帧用于瞬时动态和稀疏的块间过渡用于长程因果。课程学习计划逐步扩展预测范围并稳定长程训练。为有效捕捉长程因果,我们提出时间重要性采样,将监督集中于由自我运动和行为信号识别的安全关键块。我们进一步将逼真合成委托给基于扩散的多视图渲染器,以改善逼真外观。大量实验表明,X-Foresight在规划性能上显著优于VLA基线,同时保持强大的生成保真度,为世界知识驱动的自主系统建立了稳健的范式。

英文摘要

Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality. To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control. At its core lies a long-horizon chunk-wise auto-regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra-chunk frames for instantaneous dynamics and sparse inter-chunk transitions for long-term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long-horizon training. To capture long-term causality effectively, we present temporal importance sampling, which concentrates supervision on safety-critical chunks identified by ego-motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion-based multi-view renderer, improving photorealistic appearance. Comprehensive experiments demonstrate that X-Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world-knowledge-driven autonomous systems.

2605.26108 2026-06-09 cs.CV 版本更新

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

通过奖励倾斜分布匹配增强少步生成器

Yushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang, Jun Zhang, Tianyu Pang

发表机构 * Tencent Hunyuan(腾讯文英) Hong Kong University of Science and Technology(香港科技大学) Westlake University(西湖大学)

AI总结 提出奖励倾斜分布匹配蒸馏(RTDMD)两阶段框架,结合分布匹配蒸馏与奖励引导强化学习,在仅4步推理下实现文本到图像生成的最新性能。

Comments Code and models are available at https://github.com/Harahan/RTDMD

详情
AI中文摘要

近期少步扩散蒸馏的进展实现了高效图像生成,但将这些模型与人类偏好对齐仍具挑战。我们提出奖励倾斜分布匹配蒸馏(RTDMD),一个两阶段框架,将分布匹配蒸馏与奖励引导的强化学习统一用于少步流生成器。我们证明,最小化到奖励倾斜教师分布的KL散度自然分解为分布匹配项和奖励最大化项。在第一阶段,我们引入环境一致分布匹配蒸馏(AC-DMD),它执行子区间分布匹配,并用一致性正则化增强假分数目标,帮助假分数模型在有限更新下跟踪变化的生成器分布。在第二阶段,我们联合优化两项:对于奖励最大化项,我们推导出一个混合策略梯度,将GRPO风格的估计器用于随机中间过渡,与通过确定性最后步骤的直接奖励反向传播相结合,并进一步引入步骤子集GRPO(SubGRPO)以降低方差。在SD3、SD3.5和FLUX.2上的实验表明,RTDMD在偏好、美学和组合指标上仅用4步推理就建立了新的最先进结果,超越了先前的少步文本到图像生成方法。代码和模型见https://github.com/Harahan/RTDMD。

英文摘要

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

2605.31158 2026-06-09 cs.CV cs.LG 版本更新

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

光交互:交互式视频世界模型的免训练推理加速

Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University(浙江大学) NVIDIA

AI总结 针对交互式视频世界模型推理成本高的问题,提出免训练加速框架Light Interaction,通过自适应上下文管理、去噪缓存加速和3D块稀疏注意力实现最高2.59倍加速。

Comments 13 pages, 6 figures, 3 tables. Project page: https://2843721358l-del.github.io/Light-Interaction-Project/

详情
AI中文摘要

交互式视频世界模型根据用户控制的相机运动逐块生成视频,支持实时游戏模拟、虚拟场景导航和具身AI训练等应用。然而,由于上下文记忆增长、二次注意力复杂度和重复去噪步骤,扩展到长交互轨迹的成本过高。我们提出Light Interaction,一种用于交互式视频世界模型的免训练推理加速框架。我们的关键洞察是,交互自然支持轨迹依赖的自适应计算:在探索新区域时可丢弃检索到的空间记忆,根据局部潜在动态调整时间上下文,当相机重新访问熟悉区域时可重用早期步骤的模型输出。基于此洞察,Light Interaction结合了自适应上下文管理、去噪缓存加速以及硬件-软件协同设计的3D块稀疏注意力(融合Triton内核)。在HY-WorldPlay和Matrix-Game-3.0上的评估表明,Light Interaction在无需模型重训练的情况下实现了最高2.59倍加速,同时保持有竞争力的视觉质量。

英文摘要

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

2606.00094 2026-06-09 cs.CV cs.AI 版本更新

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

显式建模数据流形几何的扩散图像生成

Duoduo Xue, Zhiyu Zhu, Junhui Hou

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 提出MIND框架,通过将离散补丁标记化集成到连续扩散模型的得分函数中显式建模流形几何,结合离散标记的结构量化能力和连续扩散的并行生成灵活性,在ImageNet 256×256上显著降低FID。

详情
AI中文摘要

图像生成模型旨在从底层数据流形中采样数据点,这需要学习并解码一个密集、低维且紧凑的参数化空间。为此,我们提出了数据流形感知图像扩散模型(MIND),一种通过将离散补丁标记化集成到连续扩散模型的得分函数中来显式建模流形几何的新框架。该方法成功利用了离散标记的结构量化能力和连续扩散的并行生成灵活性。此外,我们通过一种新颖的软top-$k$聚合机制实现了端到端可微训练,并引入了双分支高频特征嵌入层以缓解Transformer主干网络在低维输入上的谱偏差。进一步地,在推理阶段,我们设计了一种多阶段过渡采样方案,根据时间步动态调整采样方案。在ImageNet 256×256上的大量实验证明了MIND的有效性。经过80个epoch的训练,我们的基础模型在无引导情况下实现了22.73的FID,几乎将原始DiT-B/2基线的43.47 FID减半。与基线DiT和SiT相比,所提方法平均分别降低了15.95和9.06的FID。对于ImageNet-256×256上的引导图像生成,所提MIND-B仅用130M参数就实现了2.06的FID,超过了具有3.1B参数的LlamaGen-3B。所提MIND-XL具有715M参数,进一步将FID降低至1.95。我们的MIND为基于扩散的图像生成引入了全新视角,为该领域的未来研究和创新铺平了道路。代码将公开提供。

英文摘要

Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, low-dimensional, and compact parameterization space. To achieve this, we propose the Data Manifold-aware Image diffusioN moDel (MIND), a novel framework that explicitly models manifold geometry by integrating discrete patch tokenization into the score function of a continuous diffusion model. This approach successfully leverages both the structural quantification capabilities of discrete tokens and the parallel generation flexibility of continuous diffusion. Moreover, we enable end-to-end differentiable training via a novel soft top-$k$ aggregation mechanism and introduce dual-branch high-frequency feature embedding layers to alleviate the spectral bias of transformer backbones on low-dimensional inputs. Furthermore, for inference, we design a multi-stage transition sampling scheme that dynamically adjusts the sampling scheme based on timestep. Extensive experiments on ImageNet 256$\times$256 demonstrate the effectiveness of MIND. After 80-epoch training, our base model achieves an FID of 22.73 without guidance, nearly halving the 43.47 FID of the vanilla DiT-B/2 baseline. The proposed method reduces FID by 15.95 and 9.06 on average compared with the baselines DiT and SiT, respectively. For image generation on ImageNet-256$\times$256 with guidance, the proposed MIND-B with only 130M parameters achieves an FID of 2.06, superpassing the LlamaGen-3B with 3.1B parameters. The proposed MIND-XL with 715M parameters further reduces the FID to 1.95. Our MIND introduces a fresh perspective on diffusion-based image generation, paving the way for future research and innovation in this community. The code will be publicly available.

2606.05816 2026-06-09 cs.CV cs.AI 版本更新

Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning

基于LLM提示翻译和LoRA微调的韩语日记文本情感感知图像生成

Jihun Cho, Soo-Yeon Jeong, Sun-Young Ihm

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种情感感知文本到图像流水线,利用Qwen3-8B识别短日记中的隐含情感,并通过LoRA微调Stable Diffusion 3.5 Medium生成儿童手绘风格图像,同时探讨情感触发词的影响及CLIP Score作为评估指标的局限性。

Comments 4 pages, 4 figures, 2 tables, MITA 2026

详情
Journal ref
Proc. Int. Conf. Multimedia, Information Technology and its Applications (MITA), 2026
AI中文摘要

T2I模型无法有效捕捉包括日记在内的各类文本中的情感,因为它们主要关注视觉对象相关模式而非上下文情感理解。本文提出一种情感感知文本到图像流水线,从短韩语日记条目生成儿童手绘风格图像。该流水线采用Qwen3-8B识别短日记中的隐含情感,并使用基于情感触发词在儿童绘画图像上通过LoRA微调的Stable Diffusion 3.5 Medium进行图像生成。此外,本文通过实验检验情感触发词对生成图像的影响,并讨论CLIP Score作为情感感知图像生成评估指标的局限性。

英文摘要

T2I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-related patterns rather than contextual emotional understanding. This paper proposes an emotion-aware text-to-image pipeline that generates children's hand drawing style images from short Korean diary entries. The proposed pipeline employs Qwen3-8B for recognising implicit sentiment from short diaries, and Stable Diffusion 3.5 Medium fine-tuned with LoRA on children's drawing images with emotion-based trigger words for image generation. Additionally, this paper presents experiments examining the effect of emotion trigger words on generated images and discusses the limitations of CLIP Score as an evaluation metric for emotion-aware image generation.

2503.08434 2026-06-09 cs.GR cs.CV 版本更新

Bokeh Diffusion: Defocus Blur Control in Text-to-Image Diffusion Models

Bokeh Diffusion:文本到图像扩散模型中的散焦模糊控制

Armando Fortes, Tianyi Wei, Shangchen Zhou, Xingang Pan

发表机构 * S-Lab, Nanyang Technological University(S实验室,南洋理工大学)

AI总结 提出Bokeh Diffusion框架,通过物理散焦模糊参数条件化扩散模型,结合混合训练流程和接地自注意力机制,实现场景一致的散景模糊控制,支持双向模糊强度调整和真实图像编辑。

Comments SIGGRAPH Asia 2025. Project page: https://atfortes.github.io/projects/bokeh-diffusion/

详情
AI中文摘要

近年来,大规模文本到图像模型的进展通过从文本提示生成视觉上引人入胜的输出,彻底改变了创意领域;然而,传统摄影通过光圈等相机设置精确控制景深以塑造视觉美学,而当前的扩散模型通常依赖提示工程来模拟此类效果。这种方法往往导致粗略的近似,并意外地改变场景内容。在这项工作中,我们提出了Bokeh Diffusion,一个场景一致的散景控制框架,它显式地将扩散模型条件化在一个物理散焦模糊参数上。为了克服在不同相机设置下捕获的配对真实世界图像的稀缺性,我们引入了一个混合训练流程,将野外图像与合成模糊增强对齐,提供多样化的场景和主体,以及监督学习以分离图像内容与镜头模糊。我们框架的核心是接地自注意力机制,该机制在同一场景的不同散景水平的图像对上训练,使得模糊强度可以在保持底层场景的同时双向调整。大量实验表明,我们的方法实现了灵活的、类似镜头的模糊控制,支持通过反演进行真实图像编辑等下游应用,并在Stable Diffusion和FLUX架构上有效泛化。

英文摘要

Recent advances in large-scale text-to-image models have revolutionized creative fields by generating visually captivating outputs from textual prompts; however, while traditional photography offers precise control over camera settings to shape visual aesthetics - such as depth-of-field via aperture - current diffusion models typically rely on prompt engineering to mimic such effects. This approach often results in crude approximations and inadvertently alters the scene content. In this work, we propose Bokeh Diffusion, a scene-consistent bokeh control framework that explicitly conditions a diffusion model on a physical defocus blur parameter. To overcome the scarcity of paired real-world images captured under different camera settings, we introduce a hybrid training pipeline that aligns in-the-wild images with synthetic blur augmentations, providing diverse scenes and subjects as well as supervision to learn the separation of image content from lens blur. Central to our framework is our grounded self-attention mechanism, trained on image pairs with different bokeh levels of the same scene, which enables blur strength to be adjusted in both directions while preserving the underlying scene. Extensive experiments demonstrate that our approach enables flexible, lens-like blur control, supports downstream applications such as real image editing via inversion, and generalizes effectively across both Stable Diffusion and FLUX architectures.

2605.27852 2026-06-09 cs.GR cs.CV 版本更新

ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation

ClothTransformer: 用于可扩展布料模拟的统一潜空间变换器

Yu Zhang, Yidi Shao, Wenqi Ouyang, Yushi Lan, Zhexin Liang, Chengrui Wu, Xudong Xu, Xingang Pan

发表机构 * S-Lab, Nanyang Technological University, Singapore(新加坡南洋理工大学S实验室,南洋理工大学,新加坡) Feeling AI University of Oxford(牛津大学) Nanyang Technological University(南洋理工大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ClothTransformer,通过将布料模拟重构为潜空间中的自回归序列建模,使用统一Transformer架构处理多种场景,实现比现有方法低4-9倍的误差,并引入可扩展潜空间公式和穿透自由数据集。

详情
AI中文摘要

统一且可扩展的变换器最近在建模传统上与计算机图形学相关的多种现象(如3D视觉效果、渲染过程和视频中的运动)方面取得了显著成功。在这项工作中,我们进一步研究现代变换器技术是否能够应对布料模拟这一挑战性任务。为此,我们提出了ClothTransformer,这是一个将布料模拟重构为在学习的潜空间中进行自回归序列建模的框架。现有的神经布料模拟器大多专用于单一场景,与网格离散化内在耦合,并且缺乏鲁棒的碰撞处理。我们的方法通过三个贡献解决了这些局限性:(1)一个统一的变换器架构,在单一模型下处理多种场景——身体驱动的服装、机器人操作和自由落体碰撞——并在所有场景中实现比先前最先进方法低约4-9倍的误差;(2)一个可扩展的潜空间公式,将任意分辨率的网格压缩为固定大小的潜令牌集,使得时间动态计算独立于网格分辨率;(3)一个覆盖所有三种设置的高保真无穿透数据集(约493.4k帧),该数据集支持可微分的连续碰撞检测(CCD)模块以抑制穿透伪影。

英文摘要

Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios -- body-driven garments, robotic manipulation, and free-fall collisions -- under a single model and achieves approximately $4$--$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts. Project Page: https://yucrazing.github.io/clothtransformer/

2606.06497 2026-06-09 cs.GR cs.CV cs.HC 版本更新

Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers

实时注意力弯曲:视频扩散变换器的粒度交互式网络弯曲

Adam Cole, Rebecca Fiebrink, Mick Grierson

发表机构 * Creative Computing Institute(创意计算研究所) University of the Arts London(伦敦艺术大学)

AI总结 提出实时注意力弯曲工具,通过操纵视频扩散变换器的自注意力、交叉注意力及前馈网络,实现逐层、逐步、逐令牌的交互式生成控制,增强艺术家的创作代理与模型材料亲密性。

Comments 5 pages, 4 figures. Accepted to ACM Creativity & Cognition XAIxArts Workshop 2026

详情
AI中文摘要

生成式视频模型已实现显著的视觉保真度,但其仅提示的界面提供了薄弱的创作代理,并使得艺术家无法了解模型的物质过程。我们提出了实时注意力弯曲,这是一种将网络弯曲实践扩展到视频扩散变换器(DiT)全深度并使其进入实时交互式生成的工具。作为DayDream Scope生态系统中的插件构建,并封装了开源实时Wan管道,该工具将自注意力、交叉注意力和前馈网络暴露为可独立操作的面,目标可细化到单个扩散步骤、DiT层、提示令牌和隐藏神经元。实时操作的即时性提供了我们所谓的与模型的“物质亲密性”:对特定层和神经元如何塑造生成视频的响应式、近乎机械的感觉。我们将该工具定位为同时作为对变换器内部结构的XAIxArts探针,以及用于发现模型默认表示空间之外的美学的表达性乐器。

英文摘要

Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call "material intimacy" with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model's default representational space.

7. 3D视觉、点云与空间智能 30 篇

2606.07670 2026-06-09 cs.CV cs.AI 新提交

Liquid Neural Networks as a Drop-in Continuous-Time Deformation Field for Dynamic 3D Gaussian Splatting

液态神经网络作为动态3D高斯泼溅的即插即用连续时间变形场

Mingzhao Li, Arghya Pal, Guan Yuan Tan

发表机构 * Monash University(莫纳什大学)

AI总结 提出用液态神经网络(LNN)的闭式连续时间(CfC)单元替代MLP,构建显式连续时间变形场,在动态场景重建中匹配或超越MLP基线,尤其擅长高频关节运动。

详情
AI中文摘要

可变形3D高斯泼溅(D-3DGS)通过一个位置编码的MLP(以帧时间t为输入)变形一组规范3D高斯,从单目视频重建动态场景。尽管拟合连续变量,但MLP在架构中不耦合任意两个t值,实际上预测离散的逐帧偏移,使得时间平滑性仅作为优化的副产品出现。我们将变形场重新设计为一组闭式连续时间(CfC)单元,即液态神经网络(LNN),它是液态时间常数ODE的闭式解,同时保留D-3DGS管道的其他部分。每个单元暴露一个sigmoid时间门,在两个候选隐藏状态之间插值,将学习到的对t的平滑响应嵌入损失景观,无需调用任何数值求解器。在八个D-NeRF和七个NeRF-DS场景上,液态场在总体上匹配或超过MLP基线,其最大增益集中在具有最高频关节运动的场景上。结果是一种近乎零摩擦的架构设计,将离散的MLP变形场转变为t的显式连续时间函数。

英文摘要

Deformable 3D Gaussian Splatting (D-3DGS) re-constructs dynamic scenes from monocular video by deforming a canonical set of 3D Gaussians through a positional-encoded MLP of frame time t. Although fitted to a continuous variable, the MLP couples no two values of t in its architecture and effectively predicts discrete per-frame offsets, leaving temporal smoothness to emerge only as a byproduct of optimisation. We redesign the deformation field as a stack of Closed-form Continuous-time (CfC) cells, a Liquid Neural Network (LNN), that is the closed-form solution of the Liquid Time-constant ODE while preserving every other part of the D-3DGS pipeline. Each cell exposes a sigmoidal time gate that interpolates between two candidate hidden states, baking a learned smooth response to t into the loss landscape without invoking any numerical solver. On the eight D-NeRF and seven NeRF-DS scenes the liquid field matches or exceeds the MLP baseline in aggregate, with its largest gains concentrated on the scenes with the most high-frequency articulated motion. The result is a near-zero-friction architectural design that turns the discrete MLP deformation field into an explicit continuous-time function of t.

2606.07907 2026-06-09 cs.CV cs.AI 新提交

3D Oral Modelling with Improved Vertex Distribution Using Matching-Based Learning

基于匹配学习的改进顶点分布的3D口腔建模

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

发表机构 * st Jihun Cho(第一作者) nd Soo-Yeon Jeong(第二作者) rd Eun-Jeong Bae(第三作者) th Sun-Young Ihm(第四作者)

AI总结 针对3D口腔重建中预测顶点分布不均的问题,提出结合匈牙利匹配过滤与排斥损失的改进损失函数,使顶点分布更均匀,虽精度略降但有效缓解了聚集现象。

Comments 5 pages, 7 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025

详情
AI中文摘要

在我们之前的工作中,提出了一个基于深度学习的3D口内重建框架。该模型直接从十张固定角度的口内图像预测显式3D点云坐标,采用MobileNetV2和多头注意力进行多视图特征融合,并使用L1损失和倒角距离的组合作为损失函数。尽管模型达到了77.49%的准确率,但预测顶点倾向于集中在真实值的高密度区域,而其他区域大部分未被覆盖。\n在本文中,提出了一种改进的损失函数来解决这一局限性。引入了带过滤的匈牙利匹配和排斥损失,以强制重建模型上的顶点分布更加均匀。所提出的模型达到了68.02%的准确率,数值上低于之前的模型。然而,先前工作中观察到的顶点聚集问题得到了显著缓解,预测顶点在整个重建表面上分布更加均匀。

英文摘要

In our previous work, a deep learning-based framework for 3D intraoral reconstruction was proposed. The model directly predicts explicit 3D point cloud coordinates from ten fixed-angle intraoral images, employing MobileNetV2 and Multi-head Attention for multi-view feature fusion, with a combined L1 Loss and Chamfer Distance as the loss function. Although the model achieved an accuracy of 77.49%, predicted vertices tended to concentrate in high-density regions of the ground truth, leaving other regions largely uncovered. In this paper, an improved loss function is proposed to address this limitation. Hungarian matching with filtering and Repulsion Loss are introduced to enforce more uniform vertex distribution across the reconstructed model. The proposed model achieves an accuracy of 68.02%, which is numerically lower than the previous model. However, the vertex clustering issue observed in the prior work is substantially alleviated, with predicted vertices distributed more evenly across the entire reconstructed surface.

2606.07932 2026-06-09 cs.CV cs.GR cs.MM eess.IV math.OC 新提交

LEGS: Laplacian-Enhanced Gaussian Splatting with a Nonlinear Weighted Loss

LEGS: 拉普拉斯增强的高斯泼溅与非线性加权损失

Yongfei Guo, Qizhou Huo, Xuan Sun, Yuanhao Gong

发表机构 * Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences(中国科学院长春光学精密机械与物理研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出LEGS方法,利用二阶拉普拉斯结构引导和非线性权重函数改进高斯泼溅的损失函数,在保持渲染管线不变的情况下提升结构感知优化,在Tanks&Temples和Mip-NeRF360数据集上PSNR最高提升1.68 dB。

详情
AI中文摘要

3D高斯泼溅(3DGS)已成为辐射场重建和实时新视角合成的高效显式表示方法。然而,其标准光度损失对平坦区域和结构丰富区域处理相似,这可能限制尖锐轮廓和精细细节的恢复。边缘引导高斯泼溅(EGGS)通过边缘引导加权提高了结构感知能力,但主要依赖一阶梯度响应和线性加权。本文提出LEGS,一种具有非线性加权损失的拉普拉斯增强高斯泼溅方法。LEGS用二阶拉普拉斯结构引导取代一阶梯度引导,并通过非线性响应-权重函数将归一化拉普拉斯响应映射为逐像素权重。所提出的损失改进了结构感知的高斯优化,同时保持原始3DGS渲染管线不变。在完整Tanks&Temples和Mip-NeRF360数据集上的实验表明,LEGS相比3DGS的峰值信噪比(PSNR)最高提升1.68 dB,相比EGGS最高提升0.52 dB。将所提出的二阶非线性加权策略集成到FastGS和FasterGS中,PSNR进一步提升最高1.69 dB,证明了其作为高斯泼溅管线的通用损失级扩展的有效性,在AR/VR、沉浸式可视化和实时3D内容生成中具有潜在应用。

英文摘要

3D Gaussian Splatting (3DGS) has become an efficient explicit representation for radiance field reconstruction and real-time novel view synthesis. However, its standard photometric loss treats flat and structure-rich regions similarly, which may limit the recovery of sharp contours and fine details. Edge-Guided Gaussian Splatting (EGGS) improves structure awareness through edge-guided weighting, but mainly relies on first-order gradient responses and linear weighting. In this paper, we propose LEGS, a Laplacian-Enhanced Gaussian Splatting method with a nonlinearly weighted loss. LEGS replaces first-order gradient guidance with second-order Laplacian structural guidance and maps the normalized Laplacian response into pixel-wise weights through nonlinear response-to-weight functions. The proposed loss improves structure-aware Gaussian optimization while keeping the original 3DGS rendering pipeline unchanged. Experiments on the full Tanks\&Temples and Mip-NeRF360 datasets show that LEGS improves peak signal-to-noise ratio (PSNR) by up to 1.68 dB over 3DGS and up to 0.52 dB over EGGS. Incorporating the proposed second-order nonlinear weighting strategy into FastGS and FasterGS further improves PSNR by up to 1.69 dB, demonstrating its effectiveness as a general loss-level extension for Gaussian Splatting pipelines with potential applications in AR/VR, immersive visualization, and real-time 3D content generation.

2606.07938 2026-06-09 cs.CV cs.MM eess.IV 新提交

DAL-PCQA: Enabling Distortion-Level and Language-Driven Reasoning for Point Cloud Quality Assessment

DAL-PCQA:实现点云质量评估的失真级别与语言驱动推理

Swarna Chakraborty, Gabriel De Castro Araújo, Syeda Tasmi Faria, Marcelo M. Carvalho, Mylene C. Q. Farias

发表机构 * University of Brasília(巴西利亚大学)

AI总结 提出DAL-PCQA数据集,通过多级失真标签、质量类别和自然语言描述,结合零样本与微调多模态模型,实现可解释的点云质量评估。

Comments Accepted at Qomex 2026

详情
AI中文摘要

点云质量评估(PCQA)方法通常预测标量平均意见分数(MOS),量化整体感知退化但不揭示其原因。相比之下,人类观察者自然地以特定失真(如模糊、颜色偏移、点密度变化、缺失区域和几何变形)进行推理。为弥合这一差距,我们引入了DAL-PCQA,一个用于PCQA的失真感知、语言标注数据集。DAL-PCQA用多级失真严重性标签、离散质量类别和与人类感知对齐的结构化自然语言描述增强了基准点云。我们定义了一个涵盖光度学和几何学伪影的点云特定失真分类法。统计分析揭示了不同失真类型和质量级别的特征退化模式。为评估这些标注的实用性,我们比较了用于生成感知质量描述的零样本和微调多模态模型。实验表明,失真感知监督显著提高了与真实描述的词法和语义对齐。通过实现可解释的失真级别推理,DAL-PCQA促进了语言驱动的、可解释的点云质量评估。该数据集公开于https://github.com/swarna96/DAL-PCQA。

英文摘要

Point Cloud Quality Assessment (PCQA) methods typically predict scalar Mean Opinion Scores (MOS), which quantify overall perceptual degradation but do not reveal its causes. In contrast, human observers naturally reason in terms of specific distortions such as blur, color shifts, point density changes, missing regions, and geometric deformations. To close this gap, we introduce DAL-PCQA, a distortion-aware, language-annotated dataset for PCQA. DAL-PCQA augments benchmark point clouds with multi-level distortion severity labels, discrete quality categories, and structured natural language descriptions aligned with human perception. We define a point-cloud-specific distortion taxonomy that covers both photometric and geometric artifacts. Statistical analysis reveals characteristic degradation patterns across distortion types and quality levels. To assess the utility of these annotations, we compare zero-shot and fine-tuned multimodal models for generating perceptual quality descriptions. Experiments show that distortion-aware supervision substantially improves lexical and semantic alignment with ground-truth descriptions. By enabling interpretable, distortion-level reasoning, DAL-PCQA facilitates language-driven, explainable point cloud quality assessment. The dataset is publicly available at https://github.com/swarna96/DAL-PCQA.

2606.08014 2026-06-09 cs.CV cs.AI 新提交

GVC-Seg: Training-Free 3D Instance Segmentation via Geometric Visual Correspondence

GVC-Seg: 基于几何视觉对应的免训练3D实例分割

Liang Xu, Fangjing Wang, Jinyu Yang, Feng Zheng

发表机构 * Victoria University of Wellington(惠灵顿维多利亚大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Southern University of Science and Technology(南方科技大学)

AI总结 提出GVC-Seg,一种免训练的3D实例分割方法,通过几何与视觉特征对应消除多模型集成中的置信度偏差,在多个基准上达到最优性能。

Comments 10 pages, 5 figures

详情
AI中文摘要

点云数据中的精确3D实例分割对于机器视觉应用至关重要。最近的研究利用多个预训练基础模型生成3D提案,然后应用提案聚合方法,显著提升了性能。然而,由于不同分割模型之间置信度水平的固有差异,它们通常会产生次优结果,导致偏向于置信度更高的模型。这种偏差本质上是模型依赖的,并受到数据预处理技术和训练策略等因素的影响。为了解决这一偏差,我们提出了一种新颖的、免训练的3D实例分割方法,通过几何视觉对应(GVC-Seg)来利用3D几何线索与2D视觉线索之间的对应关系,以减轻置信度偏差。此外,在实例掩码生成和实例语义推理过程中,分别引入了3D提案生成模块和掩码感知的CLIP特征提取模块。通过这种方式,GVC-Seg增强了提案质量评估,确保了不同模型之间的无偏集成学习。大量实验表明,我们的方法在多个具有挑战性的基准上达到了最先进的性能,同时在开放词汇语义分割设置中也展现出强大的潜力。

英文摘要

Accurate 3D instance segmentation in point cloud data is critical for machine vision applications. Recent advancements leverage multiple pre-trained foundation models to generate 3D proposals, followed by the application of proposal aggregation methods, which significantly enhance performance. However, they often produce sub-optimal results due to inherent variations in confidence levels across different segmentation models, resulting in a bias toward the model with higher confidence. This bias is inherently model-dependent and is influenced by factors such as data preprocessing techniques and training strategies. To address this bias, we propose a novel, training-free 3D instance segmentation approach via Geometric Visual Correspondence (GVC-Seg), which exploits the correspondence between 3D geometric cues and 2D visual cues to mitigate the confidence bias. Additionally, a 3D proposal generation module and a mask-aware CLIP feature extraction module are introduced during the instance mask generation and instance semantic reasoning, respectively. In this way, GVC-Seg enhances proposal quality assessment, ensuring unbiased ensemble learning across different models. Extensive experiments demonstrate that our method achieves state-of-the-art performance on several challenging benchmarks, while also exhibiting strong potential in open-vocabulary semantic segmentation settings.

2606.08205 2026-06-09 cs.CV 新提交

Empowering Feed-Forward Reconstruction Models with Metric Scale via Satellite Images

利用卫星图像赋予前馈重建模型度量尺度

Xianghui Ze, Yongjian Luo, Mengjun Chao, Zhenbo Song, Jianfeng Lu, Yujiao Shi

发表机构 * Nanjing University of Science and Technology(南京理工大学) ShanghaiTech University(上海科技大学)

AI总结 提出卫星引导框架,通过双向交叉视图交互利用卫星图像作为全局度量参考,解决前馈3D重建中的尺度模糊问题,实现度量深度估计、点云重建和相机定位。

详情
AI中文摘要

前馈3D重建模型最近在多样场景中展现出强大的泛化能力,但大多数模型仅能恢复未知全局尺度下的几何结构。这种尺度模糊限制了它们在需要环境度量理解的应用中的使用。现有的度量重建方法通常依赖于大规模度量标注或精确的相机标定,这在许多实际场景中成本高昂或不可靠。我们提出了一种卫星引导框架,用于解决前馈3D重建中的尺度模糊问题。关键思想是利用现成的卫星图像作为全局度量参考。给定粗略的相机姿态,我们的方法检索局部卫星图像块,并通过双向交叉视图交互将其与前馈重建主干集成。通过强制重建场景与卫星参考之间的一致性,模型推断绝对尺度、细化场景几何并在度量坐标系中估计相机姿态。在KITTI、nuScenes和Oxford RobotCar上的实验表明,该方法在度量深度估计、多视角点云重建和跨视角相机定位方面取得了一致改进,同时保持了跨数据集和地理区域的强泛化能力。

英文摘要

Feed-forward 3D reconstruction models have recently shown strong generalization across diverse scenes, yet most of them recover geometry only up to an unknown global scale. This scale ambiguity limits their use in applications that require metric understanding of the environment. Existing metric reconstruction methods commonly rely on large-scale metric annotations or accurate camera calibration, both of which are costly or unreliable in many real-world settings. We propose a satellite-guided framework for resolving scale ambiguity in feed-forward 3D reconstruction. The key idea is to use readily available satellite imagery as a global metric reference. Given a coarse camera pose, our method retrieves a local satellite patch and integrates it with a feed-forward reconstruction backbone through bidirectional cross-view interaction. By enforcing consistency between the reconstructed scene and the satellite reference, the model infers absolute scale, refines scene geometry, and estimates camera pose in a metric coordinate frame. Experiments on KITTI, nuScenes, and Oxford RobotCar show consistent improvements in metric depth estimation, multi-view point-cloud reconstruction, and cross-view camera localization, while preserving strong generalization across datasets and geographic regions.

2606.08284 2026-06-09 cs.CV cs.RO 新提交

G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

G2G:利用组内几何进行组间姿态估计

Yufei Wei, Shuhao Ye, Chenxiao Hu, Yiyuan Pan, Dongyu Feng, Rong Xiong, Yue Wang, Yanmei Jiao

发表机构 * State Key Laboratory of Industrial Control and Technology, Zhejiang University(浙江大学工业控制技术国家重点实验室) Zhejiang Humanoid Robot Innovation Center Co., Ltd.(浙江人形机器人创新中心有限公司) School of Information Science and Engineering, Hangzhou Normal University(杭州师范大学信息科学与工程学院)

AI总结 提出G2G方法,通过冻结多视图基础模型并添加三个轻量可训练模块(感知器重采样器、跨组桥接模块和多帧姿态头),仅利用相对姿态监督实现组间6-DoF姿态估计,在四个数据集上达到SOTA。

详情
AI中文摘要

恢复两个图像组之间的相对6-DoF姿态是跨序列重定位和多相机刚性里程计的基础。每个组通过视觉里程计或刚性校准携带已知的组内几何,预训练的多视图骨干网络已经将这种几何融合到视觉特征中。然而,当前模型将所有视图视为非结构化集合,缺少跨组推理的关键环节。我们提出\ours{},该方法保持基础模型完全冻结,并添加三个轻量可训练模块来桥接两个组:感知器重采样器、带有合并自注意力的跨组桥接模块以及多帧姿态头。可训练部分总计约32M参数,不到完整模型的6%,且仅由相对姿态监督。在四个数据集(涵盖室内外仿真、真实世界跨季节采集以及零样本仿真到真实迁移)上,\ours{}在两个任务上都达到了最先进的精度,而每个基线都使用其完整的原始监督进行重新训练。代码可在https://github.com/WeiYuFei0217/G2G获取。

英文摘要

Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We introduce \ours{}, which keeps the foundation model entirely frozen and adds three lightweight trainable modules to bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head. The trainable footprint totals about 32M parameters, under 6\% of the full model, and is supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-season capture, and zero-shot sim-to-real transfer, \ours{} attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision. Code is available at https://github.com/WeiYuFei0217/G2G.

2606.08957 2026-06-09 cs.CV 新提交

Rethinking 3D Shape Generation: Diffusion over Superquadrics

重新思考3D形状生成:超二次曲面上的扩散

Zhiyang Liu, Wanze Li, Yuwei Wu, Chengran Yuan, Jiawei Sun, Rui Zheng, Marcelo H Ang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出将扩散模型从高基数几何表示转移到紧凑的超二次曲面参数上,以降低计算和内存成本,并支持无分辨率点云解码、部件级编辑和约束设计,实现高效生成。

Comments Accepted to ICML2026

详情
AI中文摘要

扩散模型推动了3D形状生成的发展,但大多数方法仍然在高基数空间(如体素/SDF网格、网格或点云)中进行去噪,这计算和内存密集,难以在更高分辨率和更强可控性方面扩展。我们重新思考扩散表示,提出将扩散从密集几何转移到紧凑的几何基元,将每个形状表示为少量超二次曲面。我们不操作成千上万的几何表示值,而是利用7KB的超二次曲面参数(姿态、大小和形状),大幅降低扩散状态维度和每步计算/内存。我们的超二次曲面扩散通过支持更广泛的能力(如无分辨率点云解码、部件级编辑和基于约束的设计)提高了可扩展性,并在点云解码后在标准基准上实现了具有竞争力的表面保真度和分布性能,同时在大多数条件下每个形状的生成时间仅为0.6秒。

英文摘要

Diffusion models have advanced 3D shape generation, yet most methods still denoise in high-cardinality spaces (e.g., voxel/SDF grids, meshes, or point clouds), which is computationally and memory intensive and makes it difficult to scale in terms of both higher resolution and stronger controllability. We rethink the diffusion representation and propose to move diffusion from dense geometry to compact geometric primitives, representing each shape as a small set of superquadrics. Instead of operating on thousands to millions of geometric representation values, we leverage 7KB superquadric parameters (pose, size, and shape), drastically reducing diffusion-state dimensionality and per-step compute/memory. Our diffusion-over-superquadrics improves scalability by supporting broader capabilities (e.g., resolution-free point-cloud decoding, part-level editing, and constraint-based design) and achieving competitive surface-fidelity and distributional performance on standard benchmarks after point-cloud decoding, while enabling efficient generation within 0.6s per shape for most conditions.

2606.09034 2026-06-09 cs.CV 新提交

Leveraging NeRF-Rendered Images for 3D Gaussian Splatting

利用NeRF渲染图像进行3D高斯泼溅

Mizuki Morikawa, Yuta Shimizu, Chunyu Li, Yusuke Monno, Masatoshi Okutomi

AI总结 提出利用NeRF渲染图像辅助3DGS训练,通过去除瞬态物体和生成鸟瞰视图,结合扩散增强,在保持3DGS速度的同时提升街景渲染质量。

Comments ICIP 2026

详情
AI中文摘要

神经辐射场(NeRF)和3D高斯泼溅(3DGS)是两种主流的新视角合成方法。它们通常表现出互补的性能,即3DGS渲染速度更快,而NeRF渲染质量更高。受此启发,我们提出利用NeRF渲染的图像来辅助3DGS。具体来说,我们针对街景场景,利用预训练的街景专用NeRF方法为目标3DGS方法生成训练图像。在我们的3DGS训练中,NeRF渲染的图像用于去除街景输入视图中的瞬态物体,并生成鸟瞰视图作为额外视图,从而将NeRF的高质量渲染继承到3DGS中。我们进一步引入基于扩散的图像增强,以提高额外视图的图像质量。在一个人工合成数据集和两个真实数据集上的实验结果表明,我们提出的方法在保持3DGS速度和NeRF质量的同时,改善了街景渲染效果。

英文摘要

Neural radiance field (NeRF) and 3D Gaussian splatting (3DGS) are two mainstream approaches for novel view synthesis. They often show complementary performance, i.e., 3DGS demonstrating faster rendering speed and NeRF demonstrating higher rendering quality. Motivated by this, we propose leveraging NeRF-rendered images for 3DGS. Specifically, we target street scenes and utilize a pre-trained street-specific NeRF method to produce training images for a target 3DGS method. In our 3DGS training, NeRF-rendered images are used to remove transient objects in street-level input views and to generate bird's-eye views as additional views, inheriting the higher-quality rendering of NeRF into 3DGS. We further incorporate a diffusion-based image enhancement to improve the image quality of the additional views. Experimental results on one synthetic and two real datasets demonstrate that our proposed method improves street-scene rendering while preserving the speed of 3DGS and the quality of NeRF.

2606.09074 2026-06-09 cs.CV 新提交

REFINE: Super-efficient 3D Gaussian Splatting Pruning via Rendering-Free Primitive Importance

REFINE: 通过无渲染的基元重要性实现超高效的3D高斯泼溅剪枝

Zhang Chen, Shuai Wan, Mengting Yu, Fuzheng Yang, Junhui Hou

发表机构 * Northwestern Polytechnical University(西北工业大学) Xidian University(西安电子科技大学) City University of Hong Kong(香港城市大学)

AI总结 提出REFINE框架,利用无渲染的基元重要性度量(基于解析近似的Hessian场)实现3D高斯泼溅的高效剪枝,在保持渲染质量的同时将剪枝计算复杂度降低3000倍。

详情
AI中文摘要

现有的3D高斯泼溅(3DGS)剪枝方法要么导致严重的质量下降,要么带来过高的计算开销。本文提出REFINE,一个高度加速的3DGS剪枝框架,其核心是一种新颖的无渲染基元重要性度量。我们的方法利用解析近似、渲染感知的Hessian场来量化移除单个基元所导致的预期感知误差。通过建模可见性、投影几何和内容自适应超参数的联合调制,我们完全绕过了昂贵的正向渲染过程,推导出一个各向异性的感知权重场,作为基元重要性的高保真代理。在多个基准数据集上的大量实验表明,REFINE在保持极具竞争力的渲染质量的同时,与最先进的剪枝方法相比,实现了前所未有的3000倍剪枝相关计算复杂度降低。

英文摘要

Existing pruning methods for 3D Gaussian splatting (3DGS) suffer from either severe quality degradation or prohibitive computational overhead. In this paper, we propose REFINE, a highly accelerated 3DGS pruning framework centered on a novel rendering-free primitive importance metric. Our approach leverages an analytically approximated, rendering-aware Hessian field to quantify the expected perceptual error induced by the removal of individual primitives. By modeling the joint modulation of visibility, projection geometry and the content adaptive hyperparameter, we entirely bypass costly forward rendering passes and derive an anisotropic perceptual weight field that serves as a high-fidelity proxy for primitive importance. Extensive experiments across multiple benchmark datasets demonstrate that REFINE maintains highly competitive rendering quality while achieving an unprecedented $3,000\times$ reduction in pruning-related computational complexity compared to state-of-the-art pruning methods.

2606.09123 2026-06-09 cs.CV cs.AI 新提交

An Enhanced Geometric-Spectral Feature Learning Framework for Airborne Multispectral Point Cloud Classification

一种增强的几何-光谱特征学习框架用于机载多光谱点云分类

Xian Li, Yanfeng Gu, Aleksandra Pižurica

AI总结 针对机载多光谱点云高维异构、样本不平衡和类间光谱相似问题,提出基于注意力的双流特征融合框架,结合残差注意力融合块和联合损失函数,实现高精度地物分类。

详情
AI中文摘要

多光谱点云由三维空间-光谱信息组成,对于精确的土地覆盖分类具有巨大潜力。然而,分类模型的表示能力受到机载多光谱点云固有的高维异构空间-光谱信息、不平衡的样本分布和类间光谱相似性的限制。我们构建了两个多光谱点云数据集,并提出了一种基于注意力的增强几何-光谱特征学习框架用于机载多光谱点云分类。我们模型的一个关键组件是一种带有注意力机制的双流特征融合方法,该方法增强了来自高维异构多光谱点云的空间-光谱特征的表示能力。第一流旨在提取带有融合自注意力的位置编码全局光谱特征,第二流包括多核点卷积和特征聚合注意力以提取光谱引导的几何特征。然后,我们开发了一个残差注意力融合块,以整合来自两个并行流的最具信息量的几何-光谱特征。这项工作的另一个重要贡献是一个联合损失函数,以提高对不平衡和类间相似样本的学习能力。在两个机载多光谱点云数据集上的实验结果表明,与最先进的方法相比,所提方法具有有效性。此外,本文使用的代码和数据集将在https://github.com/HITlixian/TGRS_GSFF免费提供。

英文摘要

Multispectral point cloud (MPC) is composed of 3D spatial-spectral information, which holds tremendous potential for accurate land-cover classification. However, the representation power of classification models is limited by inherent high-dimensional and heterogeneous spatial-spectral information, unbalanced sample distribution, and inter-class spectral similarity of airborne MPCs. We build two MPC datasets and propose an enhanced geometric-spectral feature learning framework based on attentions for airborne MPC classification. A key component in our model is a two-stream feature fusion method with attention mechanisms, which enhances the representation capability of spatial-spectral features from high-dimensional heterogeneous MPCs. The first stream aims to extract position-encoded global spectral features with fusion self-attention, and the second stream comprises a multikernel point convolution and feature aggregation attention to extract spectral-guided geometric features. We then develop a residual attention fusion block to integrate the most informative geometric-spectral features from the two parallel streams. Another important contribution of this work is a joint loss function to improve the learning ability on unbalanced and interclass similar samples. Experimental results on two airborne MPC datasets demonstrate the effectiveness of the proposed method compared with the state-of-the-art methods. Furthermore, the codes and datasets used in this paper will be made available freely at https://github.com/HITlixian/TGRS_GSFF.

2606.09139 2026-06-09 cs.CV 新提交

A Geometric Framework for Absolute Pose and Velocity Estimation with Event Cameras

事件相机的绝对位姿与速度估计的几何框架

Zibin Liu, Shunkun Liang, Banglei Guan, Yang Shang, Qifeng Yu, Ji Zhao

发表机构 * National University of Defense Technology(国防科技大学) independent researcher(独立研究者)

AI总结 提出利用3D直线及其触发事件的几何约束,通过线性与多项式求解器同时估计事件相机的绝对位姿和速度,最少仅需三个对应关系,在精度和效率上超越现有方法。

详情
AI中文摘要

尽管基于事件的运动估计取得了快速进展,当前的几何方法主要关注速度估计。然而,对于机器人导航和增强现实等关键应用同样至关重要的绝对位姿估计仍相对未被充分探索。因此,从事件流中同时恢复绝对位姿和速度仍然是一个开放且具有挑战性的问题。为弥补这一空白,我们提出了一种几何框架,通过利用场景中的3D直线及其触发的事件来估计绝对位姿和速度。该框架的核心是两个关键几何约束:3D直线与其对应事件平面的法向量之间的正交性,以及事件与其关联直线的2D投影之间的共线性。基于这些约束,我们提出了用于绝对位姿估计的线性求解器和多项式求解器。前者能够高效计算,而后者为旋转提供了全局最优解。对于速度估计,我们开发了一个高效的线性求解器和一个更精确的基于优化的求解器,以恢复角速度和线速度。值得注意的是,我们的方法最少需要三个事件-直线对应关系即可独立确定6自由度绝对位姿或速度。在仿真和真实世界数据集上的大量实验表明,我们的方法达到了最先进的性能,与现有方法相比,在精度和计算效率上都有显著提升。演示代码公开于 https://github.com/Zibin6/EventPoseVelocity。

英文摘要

Despite the rapid advancements in event-based motion estimation, current geometric methods primarily focus on velocity estimation. However, absolute pose estimation, which is equally crucial for key applications such as robotic navigation and augmented reality, remains relatively underexplored. Consequently, the simultaneous recovery of absolute pose and velocity from event streams remains an open and challenging problem. To address this gap, we propose a geometric framework for absolute pose and velocity estimation by leveraging 3D lines in the scene and the events they trigger. At the core of the framework lie two key geometric constraints: the orthogonality between a 3D line and the normal vector of its corresponding event plane, and the collinearity of an event with the 2D projection of its associated line. Based on these constraints, we present both linear and polynomial solvers for absolute pose estimation. The former enables efficient computation, while the latter provides a globally optimal solution for rotation. For velocity estimation, we develop an efficient linear solver and a more accurate optimization-based solver to recover both angular and linear velocities. Notably, our methods require a minimum of three event-line correspondences to determine the 6-DoF absolute pose or velocities independently. Extensive experiments in simulation and on real-world datasets demonstrate that our methods achieve state-of-the-art performance, with significant improvements in accuracy and computational efficiency compared to existing methods. The demo code is publicly available at https://github.com/Zibin6/EventPoseVelocity.

2606.09218 2026-06-09 cs.CV 新提交

Minimal Solvers for Full-DoF Motion Estimation from Asynchronous Differential SfM

全自由度运动估计的最小求解器:基于异步差分SfM

Shuo Pan, Banglei Guan, Bin Li, Zhenbao Yu, Zibin Liu, Zi Wang, Yang Shang, Qifeng Yu

发表机构 * College of Aerospace Science and Engineering, National University of Defense Technology(国防科技大学空天科学学院) The Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation(湖南省图像测量与视觉导航重点实验室)

AI总结 提出从异步光流直接估计全自由度自运动的方法,解耦差分对极约束,基于至少五个点实现角速度和线速度的联合恢复,并设计了首个代数最小5点求解器及加速版本。

详情
AI中文摘要

作为一种仿生智能传感器,事件相机以其高时间分辨率、低延迟和低功耗为特点,为时空信息的智能感知和视觉运动估计引入了新范式。然而,其异步数据流对传统的同步帧算法提出了重大挑战。为了解决这些挑战,本文提出了一种新颖的框架,直接从异步光流进行全自由度(DoF)自运动估计,特别针对角速度和线速度的联合恢复。我们将差分对极约束解耦为不同的角速度和线速度分量,并推导出其异步数据的公式。基于该公式,开发了一种优化算法,利用至少五个点实现全自由度自运动估计。此外,通过对旋转动力学应用一阶近似,我们将约束方程转化为多项式形式,从而得到了该公式的第一个代数最小5点求解器。为了确保高速场景下的实时性能,我们还提出了一种通过截断高阶角速度项实现的加速求解器。在合成和真实数据集上的广泛评估表明,异步方法优于传统的同步方法,特别是在对时空噪声的准确性和鲁棒性方面。我们相信,这项工作为高速机器人应用中高效且准确的连续时间运动估计奠定了关键基础。

英文摘要

As a bio-inspired intelligent sensor, event cameras have introduced a new paradigm in the intelligent perception of spatiotemporal information and visual motion estimation, characterized by their high temporal resolution, low latency, and minimal power consumption. However, their asynchronous data streams present significant challenges to traditional synchronous, frame-based algorithms. To address these challenges, this paper presents a novel framework for full degree of freedom (DoF) egomotion estimation directly from asynchronous optical flow, specifically targeting the joint recovery of angular and linear velocities. We decouple the differential epipolar constraint into distinct angular and linear velocity components, and derive its formulation for asynchronous data. Based on this formulation, an optimization algorithm is developed that enables full-DoF egomotion estimation leveraging at least five points. Furthermore, by applying a first-order approximation to rotational dynamics, we transform the constraint equations into a polynomial form, resulting in the first algebraic minimal 5-point solver for this formulation. To ensure real-time performance in high-speed scenarios, we additionally propose an accelerated solver achieved by truncating high-order angular velocity terms. Extensive evaluations on both synthetic and real-world datasets demonstrate that the asynchronous approach outperforms traditional synchronous methods, particularly in its accuracy and robustness to spatiotemporal noise. We believe that this work establishes a critical foundation for efficient and accurate continuous-time motion estimation in high-speed robotics applications.

2606.09246 2026-06-09 cs.CV 新提交

SOMA: From Surface Observations to Muscle Anatomy

SOMA:从表面观察到肌肉解剖

Eduardo Alvarado, Emily Kim, Gerrit Nolte, Friedemann Runte, Mario Botsch, Marc Habermann, Christian Theobalt

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus(马克斯·普朗克信息学研究所,萨尔兰信息学园区) TU Dortmund University(多特蒙德工业大学)

AI总结 提出SOMA模型,从多视角RGB相机获取的表面信号推断时空肌肉行为,并构建SKIM数据集,首次实现从多视角RGB数据恢复肌肉变形,提供可扩展的低成本解剖动画方案。

详情
AI中文摘要

随着对逼真虚拟人类的需求日益增长,参数化人体模型已成为现代医学、体育和娱乐应用的基石。然而,大多数这些模型固有地存在局限性:它们仅捕捉皮肤的3D表面,无法洞察产生运动的复杂生物力学结构。随着更多应用向生物力学扩展,对超越皮肤的虚拟人类模型的需求日益明显。传统的软组织模拟(如FEM)准确但不可扩展,且对于大多数常见应用而言计算成本过高。或者,现有的生物力学工具可以模拟肌肉力和激活,但不模拟外部形状的变化,限制了激活与实际可观察解剖结构之间的相关性。这激发了一个新的逆向研究问题:直接从可见的表面观测(即从皮肤,从而从姿态)恢复肌肉变形。在这项工作中,我们提出了SOMA(从表面观察到肌肉解剖),一个从使用RGB相机获得的表面信号推断时空肌肉行为的个体特定模型,以及SKIM,一个个体特定的软组织变形数据集。据我们所知,这是首次尝试从多视角RGB数据恢复肌肉变形的方法。我们展示了我们的方法如何提供解剖学基础的动画,而无需传统模拟的复杂性,从而提供可扩展且成本效益高的解决方案。数据和代码已公开。

英文摘要

With the growing demand for realistic virtual humans, parametric body models have become a cornerstone of modern medicine, sports, and entertainment applications. However, most of these models are inherently limited: they only capture the 3D surface of the skin, offering no insight into the complex bio-mechanical structures that generate motion. As more applications expand towards biomechanics, the need for virtual human models that go beyond the skin has become increasingly evident. Traditional soft-tissue simulations, such as FEM, are accurate but non-scalable and too computationally expensive for most common applications. Alternatively, existing biomechanical tools can simulate muscular forces and activations, but do not model changes in external shape, restricting how activations correlate with actual observable anatomy. This motivates a novel inverse research problem: recovering muscle deformations directly from visible surface observations - i.e., from the skin, and thus the pose. In this work, we present SOMA (from Surface Observations to Muscle Anatomy), a person-specific model that infers spatio-temporal muscle behavior from surface signals obtained using RGB cameras, and SKIM, a subject-specific soft-tissue deformation dataset. To the best of our knowledge, this is the first method that attempts to recover muscle deformations from multi-view RGB data. We show how our method provides anatomically grounded animations without the complexity of traditional simulations, leading to a scalable and cost-effective solution. Data and code are available.

2606.09294 2026-06-09 cs.CV 新提交

Virtual-point-based Solutions to Handle Generalized Absolute Pose Problem

基于虚拟点的广义绝对位姿问题求解方法

Bin Li, Banglei Guan, Shunkun Liang, Yang Shang

发表机构 * National University of Defense Technology(国防科技大学) Hunan Institute of Advanced Technology(湖南高级技术研究所)

AI总结 针对多相机系统广义PnP问题,提出虚拟点公式化方法,将标准PnP求解器转化为广义位姿求解器,并基于Cayley、四元数和旋转矩阵参数化导出三种求解器,在精度、全局最优性和效率上优于现有方法。

详情
AI中文摘要

多相机系统因其宽视场、灵活性和容错性在机器人和自主导航中日益普及。然而,现有的PnP求解器无法处理多个投影中心。本文引入一种虚拟点公式化方法,桥接了标准PnP与广义位姿问题,实现了将现有PnP求解器转化为广义位姿求解器的统一流程。基于该框架,我们推导了三种基于虚拟点的广义位姿求解器,即VGPc、VGPq和VGPr,分别利用Cayley、四元数和旋转矩阵参数化。大量实验表明,所提出的求解器继承了原始PnP算法的精度和效率,同时显著优于现有的广义求解器。具体而言,VGPc在异方差噪声条件下实现了更高的估计精度,VGPq保持了全局最优性,而VGPr在精度不降低的情况下提供了优越的计算效率。

英文摘要

Multi-camera systems are increasingly adopted in robotics and autonomous navigation for their wide field of view, flexibility, and fault tolerance. Nevertheless, existing PnP solvers fail to handle multiple projection centers. This paper introduces a virtual point formulation that bridges the standard PnP and generalized pose problems, enabling a unified pipeline that transforms existing PnP solvers into generalized pose solvers. Based on this framework, we derive three Virtual-point-based Generalized Pose solvers, namely VGPc, VGPq, and VGPr, leveraging Cayley, quaternion, and rotation-matrix parameterizations, respectively. Extensive experiments demonstrate that the proposed solvers inherit the accuracy and efficiency of original PnP algorithms while significantly outperforming existing generalized solvers. Specifically, VGPc achieves higher estimation accuracy under heteroscedastic noise conditions, VGPq maintains global optimality, whereas VGPr provides superior computational efficiency without accuracy degradation.

2606.09738 2026-06-09 cs.CV 新提交

HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents

HDSL:一种用于结构化3D室内场景生成和基于LLM智能体局部编辑的层次化领域特定语言

Letian Li, Chao Shen, Shuzhao Xie, Chenghao Gu, ZhengXiao He, Yu Meng, Xin Yang, Wenyuan Jiang, Zhi Wang

发表机构 * SIGS, Tsinghua University(清华大学深圳国际研究生院) Nankai University(南开大学) University of Arizona(亚利桑那大学) Zhejiang University(浙江大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出HDSL语言,以树结构表示室内场景,结合LLM智能体生成、多模态检索和力导向布局优化,实现结构化场景生成与局部编辑,显著提升对象覆盖率和编辑效率。

详情
AI中文摘要

文本驱动的室内场景生成与编辑需要一种语言模型既能生成又能修改的中间表示。现有的基于LLM的系统通常依赖场景图或全局约束列表,这些表示虽然紧凑但未能充分指定局部几何结构,使得基于指令的编辑难以定位。我们将此问题视为结构化程序生成和局部程序修复,并提出层次化描述性场景语言(HDSL),一种用于结构化3D室内场景的XML/CSS风格领域特定语言。HDSL将房间、区域、对象和支持表面表示为带有局部坐标的树,使得复杂场景更易于递归规划和检索编辑。我们的流程使用LLM智能体生成带有边界验证的HDSL子树,通过多模态资产检索将非虚拟节点具体化,并应用力导向布局优化来修复边界和碰撞错误。对于编辑,层次化检索增强生成(HRAG)检索相关子树,要求LLM仅重写该局部上下文,并通过确定性三路合并将结果合并回去。在我们复现的基准测试中,HDSL在对象覆盖率、文本-场景对齐和生成时间上优于完整的文本到场景基线,同时在几何指标上与最近的仅布局复现方法保持竞争力;对于编辑,HRAG将令牌使用量减少5.22倍,运行时间减少6.19倍,为所有八对编辑生成有效的DSL,并更好地保留无关的场景对象。

英文摘要

Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but underspecify local geometry and make instruction-based edits difficult to localize. We frame this problem as structured program generation and local program repair, and propose Hierarchical Descriptive Scene Language (HDSL), an XML/CSS-style domain-specific language for structured 3D indoor scenes. HDSL represents rooms, regions, objects, and support surfaces as a tree with local coordinates, making complex scenes easier to plan recursively and easier to retrieve for editing. Our pipeline uses LLM agents to generate HDSL subtrees with bounded verification, grounds non-virtual nodes through multimodal asset retrieval, and applies force-directed layout optimization to repair boundary and collision errors. For editing, Hierarchical Retrieval-Augmented Generation retrieves the relevant subtree, asks the LLM to rewrite only that local context, and merges the result back through a deterministic three-way merge. In our reproduced benchmark, HDSL improves average object coverage, text-scene alignment, and generation time over full text-to-scene baselines while remaining competitive with recent layout-only reproductions on geometry metrics; for editing, HRAG reduces token use by $5.22\times$ and runtime by $6.19\times$, produces valid DSL for all eight paired edits, and better preserves unrelated scene objects.

2606.09794 2026-06-09 cs.CV cs.GR 新提交

Beyond Spherical Harmonics: Rethinking Appearance Models for Radiance Reconstruction

超越球谐函数:重新思考辐射重建的外观模型

Ewa Miazga, Jorge Condor, Piotr Didyk

发表机构 * École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Università della Svizzera Italiana(意大利语区瑞士大学)

AI总结 本文系统评估多种球面函数,提出归一化各向异性球面Gabor函数,以紧凑表示高效建模高频外观效果,在辐射场重建中实现五倍内存节省和更优质量。

Comments 19 pages, 11 figures

详情
AI中文摘要

视角相关的外观建模在新视角合成与重建中仍是一个具有挑战性的问题。准确表示复杂的角度效应通常需要大量的内存和计算资源。对于新的基于学习的方法,常见做法是依赖球谐函数(SH)。然而,捕捉镜面反射等高频率现象需要高阶展开,这会增加内存使用和计算成本。因此,大多数方法采用低阶SH,这限制了建模复杂视角相关效应的能力,导致表示过于平滑或漫反射。为解决这些限制,我们系统评估了场景重建中多种球面函数。其中一些函数在本文中首次被引入图形学和计算机视觉领域。基于实验洞察,我们提出了一种新的球面公式——归一化各向异性球面Gabor函数,它能够在保持紧凑表示的同时高效建模和学习高频外观效果。与现有方法相比,我们的函数在重建如闪光等视角相关现象时实现了更高质量,同时内存效率提高五倍,且评估更高效。我们在辐射场重建任务中验证了其性能。

英文摘要

View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-based methods, a common approach is to rely on SH. However, capturing high-frequency phenomena such as specular reflections demands high-order expansions, which increase memory usage and computational cost. Consequently, most methods employ low-order SH, which limits the ability to model complex view-dependent effects, resulting in overly smooth or diffuse representations. To address these limitations, we systematically evaluate a wide range of spherical functions in the context of scene reconstruction. Some of them are introduced to graphics and computer vision for the first time in this paper. Based on the insights from the experiment, we develop a novel spherical formulation, the Normalized Anisotropic Spherical Gabor function that enables efficient modeling and learning of high-frequency appearance effects while maintaining compact representation. Compared to existing approaches, our function achieves higher-quality reconstruction of view-dependent phenomena such as glints, while being up to five times more memory-efficient and more efficient to evaluate. We validate its performance in radiance-field reconstruction tasks.

2606.07529 2026-06-09 cs.CL cs.AI cs.CV cs.LG cs.MM 交叉投稿

CAPruner: Conceptual-Adjacent Scene Graph Pruner for Enhancing 3D Spatial Reasoning of Large Language Models

CAPruner: 概念相邻场景图剪枝器以增强大语言模型的3D空间推理

Shengli Zhou, Xiangchen Wang, Guanhua Chen, Feng Zheng

发表机构 * Southern University of Science and Technology(南方科技大学) SpatialTemporal AI(时空人工智能)

AI总结 提出概念相邻场景图剪枝器(CAPruner),通过融合模糊语义相关性和空间邻近性估计关系重要性,在任务特定上下文中选择关键关系,避免关系级标注,显著提升大语言模型在3D视觉语言任务上的空间推理性能。

Comments Accepted by ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)最近被应用于3D视觉语言(3D-VL)任务,这些任务需要空间推理以识别相对于锚点的目标物体。场景图通常用于表示此类关系,但在完整图上进行推理会导致高昂的令牌成本和计算效率低下,因此需要剪枝。现有的剪枝方法主要依赖空间邻近性,常常移除任务相关的关系,从而削弱可靠的空间推理。为了解决这些局限性,我们推导出场景图剪枝的一个关键要求:保留与特定3D-VL任务最相关的空间关系。在此洞察指导下,我们提出了概念相邻场景图剪枝器(CAPruner)。CAPruner将模糊语义相关性与空间邻近性相结合,以估计关系的重要性,从而能够在任务特定上下文中选择关键关系。此外,为了避免昂贵的关系级标注,CAPruner通过监督每个节点入射边的聚合分数进行训练。大量实验表明,CAPruner有效保留了空间推理所必需的关系,从而显著提升了LLMs在3D-VL任务上的性能。代码可在 https://github.com/fz-zsl/CAPruner 获取。

英文摘要

Large language models (LLMs) have recently been applied to 3D vision-language (3D-VL) tasks, which require spatial reasoning to identify target objects relative to anchors. Scene graphs are commonly employed to represent such relations, but reasoning over complete graphs incurs high token costs and computational inefficiencies, motivating the need for pruning. Existing pruning methods primarily rely on spatial proximity and often remove task-relevant relations, thereby undermining reliable spatial reasoning. To address these limitations, we derive a key requirement for scene graph pruning: preserving spatial relations that are most pertinent to the specific 3D-VL task. Guided by this insight, we propose the Conceptual-Adjacent Scene Graph Pruner (CAPruner). CAPruner integrates fuzzy semantic relevance with spatial proximity to estimate the importance of relations, enabling the selection of critical relations in a task-specific context. Moreover, to avoid costly relation-level annotations, CAPruner is trained by supervising the aggregated scores of each node's incident edges. Extensive experiments demonstrate that CAPruner effectively preserves relations essential for spatial reasoning, leading to substantial performance improvements of LLMs on 3D-VL tasks. Code is available at https://github.com/fz-zsl/CAPruner.

2606.07791 2026-06-09 cs.GR cs.CV cs.IR 交叉投稿

Frequency-Scale Saliency for Spectral Descriptor Analysis in 3D Shape Retrieval

频率尺度显著性用于三维形状检索中的谱描述符分析

Jianru Shen

发表机构 * University of Montana(蒙大拿大学)

AI总结 提出频率尺度显著性框架,通过消融量化描述符尺度区间的检索贡献,发现短尺度主导性能而长尺度有害,加权检索在困难类别上提升mAP 0.156。

Comments Accepted at Computer Graphics International (CGI) 2026

详情
AI中文摘要

经典谱描述符如热核签名和波核签名广泛用于非刚性三维形状检索,但其失效模式仍不明确。我们提出一个频率尺度显著性框架,通过消融量化每个描述符尺度区间对检索级别的贡献,从而审计这些描述符。我们引入类谱指纹来表征类别级别的尺度依赖性,并表明描述符在类别对之间的相似性与检索失败显著相关,Spearman相关系数为0.479。在SHREC'11上的实验表明,短尺度主导检索性能而长尺度有害,HKS和WKS表现出不同的尺度依赖性模式,且显著性加权检索在困难类别上将mAP提升了0.156,交叉验证和随机权重控制证实该提升是稳定的,并非由任意重新加权导致。

英文摘要

Classical spectral descriptors such as the Heat Kernel Signature and Wave Kernel Signature are widely used for non-rigid 3D shape retrieval, yet their failure modes remain poorly understood. We present a frequency-scale saliency framework that audits these descriptors by quantifying the retrieval-level contribution of each descriptor scale interval through ablation. We introduce class spectral fingerprints to characterize category-level scale dependence, and show that descriptor similarity between class pairs is substantially correlated with retrieval failure, with a Spearman correlation of 0.479. Experiments on SHREC'11 demonstrate that short scales dominate retrieval performance while long scales are harmful, that HKS and WKS exhibit distinct scale dependence patterns, and that saliency-weighted retrieval improves mAP on hard categories by 0.156, with cross-fold and random-weight controls confirming that the gain is stable and not due to arbitrary reweighting.

2606.08041 2026-06-09 cs.GR cs.CV 交叉投稿

Wispy to Voluminous: Prior-free Multi-view Capture of Strand-level Facial Hair

从稀疏到浓密:无先验的多视角面部毛发级联重建

Jaeseong Lee, Giljoo Nam, Adrian Jarabo, Carlos Aliaga

发表机构 * KAIST(韩国科学技术院) Meta Codec Avatar Lab(Meta 编码人像实验室) Meta Reality Labs Research(Meta 现实实验室研究)

AI总结 提出从多视角图像自动重建面部毛发(胡须、眉毛等)的管线,将无结构3D高斯表示转换为显式曲线发丝,解决几何歧义,实现高保真发丝重建。

Comments 27 pages, 16 figures, supplementary included

详情
AI中文摘要

面部毛发是个人身份的一个决定性特征,但仍然是数字头像的关键瓶颈。最近的体积方法实现了照片级真实感,但将毛发烘焙到面部几何中,阻碍了可编辑性,并且无法解析稀疏的、发丝状结构。同时,头皮毛发重建方法针对密集的毛发体积,无法适应面部毛发稀疏、空间变化的特性。我们提出了一种管线,从多视角图像自动重建面部毛发——胡须、髭须、睫毛和眉毛,将无结构的3D高斯表示转换为显式的基于曲线的发丝表示。我们分四个阶段解决几何歧义:(i)优化由跟踪头部几何约束的3D高斯,以强制早期光线终止并抑制次表面噪声;(ii)追踪连续发丝,对频繁交叉和极端曲率具有鲁棒性;(iii)将发丝接地到表面,并通过物理动机的先验解决根尖歧义;(iv)通过光度优化下的不透明度驱动密度控制来细化重建。据我们所知,这是第一个从3D高斯表示重建高保真面部毛发发丝的方法。恢复的发丝忠实地保留了面部毛发特征的朝向和稀疏模式,并生成可直接用于下游生产任务的资产,包括面部动画和物理模拟、几何梳理和转移、外观编辑以及基于物理的渲染。

英文摘要

Facial hair is a defining trait of personal identity, yet remains a critical bottleneck for digital avatars. Recent volumetric methods achieve photorealism but bake hair into the underlying face geometry, preventing editability and failing to resolve sparse, strand-like structures. Meanwhile, scalp-hair reconstruction methods target dense hair volumes and do not transfer to the sparse, spatially-varying nature of facial hair. We present a pipeline that automatically reconstructs facial hair -- beard, mustache, lashes, and brows -- from multi-view images, converting an unstructured 3D Gaussian representation into an explicit curve-based strand representation. We resolve geometric ambiguities in four stages: (i) optimizing 3D Gaussians constrained by tracked head geometry to enforce early ray termination and suppress sub-surface noise; (ii) tracing continuous strands robust to frequent crossings and extreme curvature; (iii) grounding strands to the surface and resolving root-tip ambiguity via a physically-motivated prior; and (iv) refining the reconstruction through opacity-driven density control under photometric optimization. To our knowledge, this is the first method to reconstruct high-fidelity facial hair strands from a 3D Gaussian representation. The recovered strands faithfully preserve the orientation and sparsity patterns characteristic of facial hair, and yield assets immediately suitable for downstream production tasks, including facial animation and physical simulation, geometric grooming and transfer, appearance editing, and physics-based rendering.

2606.08043 2026-06-09 cs.GR cs.CV 交叉投稿

OmniFaceRig: Fully Automatic Inner-Mouth-Aware Face Rigging Across Diverse 3D Character Topologies

OmniFaceRig: 跨多种3D角色拓扑的全自动内口感知面部绑定

Chao Wang, Guangyao Ma, John Doublestein, Junming Chen, Yiming Lin, Zhaoen Su, Xiaomin Luo, Shiyang Cheng, Jie Shen, Doug Roble, Dilin Wang, Yilei Li, Rakesh Ranjan

发表机构 * Reality Labs, Meta(Meta现实实验室)

AI总结 提出全自动端到端管道OmniFaceRig,将静态表面网格转换为含内口几何的FACS绑定,支持人类、人形及多种动物拓扑,无需手动标注或模板。

详情
AI中文摘要

面部绑定——创建基于FACS的混合形状以及内口几何(牙齿、牙龈和舌头)——仍然是3D角色制作中的主要瓶颈。现有流程仍需要大量设计工作,特别是手动地标标注、每个角色的模板调整和内口放置。我们提出OmniFaceRig,一个全自动端到端管道,将静态表面仅3D角色网格(无预建模口腔)转换为内口感知的FACS绑定,包含多达155个混合形状、程序化拟合的牙齿、牙龈和舌头,以及重新打包的UV/纹理。OmniFaceRig支持多种拓扑——人类、人形、长吻动物(如狗、狼、狐狸)和短吻动物(如猫、熊、兔子、老虎)——无需手动地标、无需用户提供模板、无需每个资产的设置。该管道结合了混合VLM+CV可绑定性检查、多模型面部解析、密集关键点驱动的模板配准、程序化内口构建以及碰撞感知的混合形状迁移。对于非人类角色,OmniFaceRig选择拓扑特定的面部和内口模板,并使用碰撞感知的内口拟合来减少牙齿-面部交叉,而无需用户暴露于类别特定的调整。我们还公开发布了Omni-Bench,一个包含1000个双足3D角色的免费基准数据集,带有FACS面部混合形状和内口几何,涵盖人类、人形、猫、狗和其他动物。实验表明,在筛选后的Omni-Bench输入上,最终绑定成功率很高,分割集成几乎实现了完全的面部检测召回,以及可靠的内口放置和低穿透率。总之,OmniFaceRig为从静态生成的角色到动画就绪的面部绑定提供了一条自动化路径,适用于人类和非人类拓扑。

英文摘要

Facial rigging - creating FACS-based blendshapes together with inner-mouth geometry (teeth, gums, and tongue) - remains a major bottleneck in 3D character production. Existing pipelines still require substantial designer effort, especially for manual landmark annotation, per-character template adjustment, and inner-mouth placement. We present OmniFaceRig, a fully automatic end-to-end pipeline that converts a static surface-only 3D character mesh, with no pre-modeled oral cavity, into an inner-mouth-aware FACS rig with up to 155 blendshapes, procedurally fitted teeth, gums, and tongue, and re-packed UV/texture. OmniFaceRig supports diverse topologies - humans, humanoids, long-muzzled animals (e.g., dogs, wolves, foxes), and short-muzzled animals (e.g., cats, bears, rabbits, tigers) - with no manual landmarks, no user-provided templates, and no per-asset setup. The pipeline combines hybrid VLM+CV riggability checking, multi-model face parsing, dense keypoint-driven template registration, procedural inner-mouth construction, and collision-aware blendshape transfer. For non-human characters, OmniFaceRig selects topology-specific face and inner-mouth templates and uses collision-aware inner-mouth fitting to reduce teeth-face intersections without exposing users to category-specific tuning. We also publicly release Omni-Bench, a freely available benchmark dataset of 1,000 biped 3D characters with FACS facial blendshapes and inner-mouth geometry, spanning humans, humanoids, cats, dogs, and other animals. Experiments show high final rigging success on screened Omni-Bench inputs, nearly complete face detection recall from the segmentation ensemble and reliable inner-mouth placement with low penetration. Together, OmniFaceRig provides an automatic path from static generated characters to animation-ready facial rigs across both human and non-human topologies.

2606.09134 2026-06-09 cs.RO cs.AI cs.CL cs.CV cs.GR 交叉投稿

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

从USD场景到知识图谱:基于LLM的零样本本体接地

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler

发表机构 * Technical University of Berlin(柏林工业大学) Fraunhofer FOKUS(弗劳恩霍夫开放通信系统研究所)

AI总结 研究利用大语言模型(LLM)零样本地将3D场景对象自动映射到本体类别,无需训练,在厨房场景中达到90-96%准确率,并揭示语义线索是关键。

Comments Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026

详情
AI中文摘要

从3D仿真场景构建知识图谱对于机器人任务推理至关重要,但关键瓶颈——将场景对象接地到形式本体类别——仍然依赖于手工制作的字典,这些字典脆弱且无法跨资产泛化。我们研究大语言模型(LLM)是否能够自动化通用场景描述(USD)场景的接地步骤,作为一种零样本、无需训练的替代方案。在具有SOMA-HOME本体的厨房场景(125个对象)中,LLM在描述性名称下达到90-96%的精确匹配准确率,在缩写名称下达到49-89%,显著优于字典和嵌入基线。在完全不透明名称下,上下文增强提示可恢复高达48%的准确率。特征消融表明,LLM主要利用场景图中的语义线索(兄弟名称和父路径);匿名化这些线索将准确率降至0-6%,而仅凭几何信息仅能达到4-17%。

英文摘要

Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.

2508.05950 2026-06-09 cs.CV cs.AI 版本更新

CLONE: A 3DGS-Based Closed-Loop Differentiable Optimization Framework for Single-Image Normal Estimation

CLONE: 基于3DGS的闭环可微优化框架用于单图像法线估计

Yanxing Liang, Yinghui Wang, Wei Li, Tao Yan, Jiaxing Shen

发表机构 * School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China(江南大学人工智能与计算机科学学院,中国无锡) School of Data Science, Lingnan University, Hong Kong, China(岭南大学数据科学学院,中国香港)

AI总结 提出CLONE框架,通过3D高斯泼溅参数化场景并利用协方差特征分解得到连续可微法线,结合可微光照模型和一步确定性扩散精化网络,在统一重投影目标下联合优化,实现无需真值法线监督的几何一致性单图像法线估计。

详情
AI中文摘要

我们提出CLONE,一个基于3DGS的闭环可微优化框架,用于单图像法线估计。核心思想是构建一个“图像-几何-图像”一致性循环,统一并联合约束两种范式的局限性:判别式方法依赖显式监督而缺乏跨域几何约束,生成式方法虽有强生成先验但缺乏稳定的可微优化路径。具体地,我们首先采用3D高斯泼溅显式参数化场景,并通过协方差特征分解导出连续可微的表面法线,为几何建模提供解析梯度路径。然后,我们引入一个带有可学习光调制核的可微光照模型,建立表面法线与图像辐射之间的连续映射,使重投影误差直接监督底层3D几何。此外,为补偿高斯表示在局部细节表达上的不足,我们设计了一个一步确定性扩散启发的精化网络,在保持端到端可微性的同时增强局部几何细节。引入跨域门控融合机制以协调全局几何一致性和局部细节重建。最后,所有组件在统一的重投影目标下联合优化,形成闭环且稳定的梯度传播路径。这使得无需真值法线监督即可有效约束多解空间并改善几何一致性。

英文摘要

We propose CLONE, a 3DGS-based Closed-Loop differentiable Optimization framework for single-image Normal Estimation. The core idea is to construct an "image-geometry-image" consistency loop that unifies and jointly constrains the limitations of both paradigms: the reliance on explicit supervision without cross-domain geometric constraints in discriminative methods, and the absence of stable differentiable optimization pathways in generative methods despite strong generative priors. Specifically, we first employ 3D Gaussian Splatting to explicitly parameterize the scene and derive continuous and differentiable surface normals via covariance eigen-decomposition, providing an analytical gradient pathway for geometric modeling. We then introduce a differentiable illumination model with a learnable light modulation kernel to establish a continuous mapping between surface normals and image radiance, enabling reprojection errors to directly supervise the underlying 3D geometry. Furthermore, to compensate for the limited local detail expressiveness of Gaussian representations, we design a one-step deterministic diffusion-inspired refinement network, which enhances local geometric details while preserving end-to-end differentiability. A cross-domain gating fusion mechanism is introduced to coordinate global geometric consistency and local detail reconstruction. Finally, all components are jointly optimized under a unified reprojection objective, forming a closed-loop and stable gradient propagation pathway. This enables effective constraint of the multi-solution space and improved geometric consistency without requiring ground-truth normal supervision.

2603.21511 2026-06-09 cs.CV 版本更新

Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection

回到点:探索用于零样本3D异常检测的点语言模型

Kaiqiang Li, Gang Li, Mingle Zhou, Min Li, Delong Han, Jin Wan

发表机构 * Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, China(计算机功率网络与信息安全重点实验室,教育部,山东计算机科学中心(济南国家超级计算机中心),齐鲁大学(山东科学院),济南,中国) Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Jinan, China(山东省计算功率互联网与服务计算重点实验室,山东省计算机科学基础研究中心,济南,中国)

AI总结 本文提出BTP框架,通过结合3D点云和文本嵌入,提升零样本3D异常检测的性能,实验表明其在Real3D-AD和Anomaly-ShapeNet上表现优异。

Comments Corrected several numerical entries due to a reporting error; the corrected values do not affect the main conclusions

详情
AI中文摘要

零样本(ZS)3D异常检测对于可靠工业检测至关重要,因为它可以在不需目标类别训练数据的情况下检测和定位缺陷。现有方法将3D点云转换为2D图像,并利用预训练的视觉-语言模型(VLMs)进行异常检测。然而,这种策略不可避免地丢弃了几何细节,并对局部异常表现出有限的敏感性。在本文中,我们重新审视内在的3D表示,并探索预训练点语言模型(PLMs)在ZS 3D异常检测中的潜力。我们提出了BTP(Back To Point),一种新的框架,能够有效对齐3D点云和文本嵌入。具体而言,BTP将多粒度补丁特征与文本表示对齐,用于局部异常检测,同时结合几何描述符以增强对结构异常的敏感性。此外,我们引入了一种联合表示学习策略,利用辅助点云数据以提高鲁棒性并丰富异常语义。在Real3D-AD和Anomaly-ShapeNet上的大量实验表明,BTP在ZS 3D异常检测中实现了优越的性能。代码将在https://github.com/wistful-8029/BTP-3DAD上提供。

英文摘要

Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \href{https://github.com/wistful-8029/BTP-3DAD}{https://github.com/wistful-8029/BTP-3DAD}.

2604.04554 2026-06-09 cs.CV cs.RO 版本更新

Relational Epipolar Graphs for Robust Relative Camera Pose Estimation

基于关系epipolar图的鲁棒相对相机姿态估计

Prateeth Rao, Sachit Rao

发表机构 * International Institute of Information Technology(国际信息科技研究所)

AI总结 本文提出基于epipolar图的关系推断方法,用于估计相对相机姿态,通过图操作估计旋转、平移和本质矩阵,提升对密集噪声和大基线变化的鲁棒性。

Comments 21 pages, 11 figures, 11 Tables, Submitted to IJCV

详情
AI中文摘要

视觉同步定位与建图(VSLAM)的关键组成部分是利用匹配的关键点估计相对相机姿态。准确估计面临噪声对应关系的挑战。经典方法依赖于随机假设采样和迭代估计,而基于学习的方法通常缺乏显式的几何结构。在本文中,我们将相对姿态估计重新表述为epipolar对应图上的关系推断问题,其中匹配的关键点是节点,相邻的节点通过边连接。图操作如修剪、消息传递和池化可估计四元组旋转、平移向量和本质矩阵(EM)。最小化包含(i)与地面真实值(GT)的$\mathcal{L}_2$差异,(ii)估计与GT EM之间的Frobenius范数,(iii)奇异值差异,(iv)航向角差异,(v)尺度差异的损失,可得到图像对之间的相对姿态。所用的密集检测器-free方法LoFTR用于匹配。在室内和室外基准测试中,相比经典和学习引导方法,该方法在密集噪声和大基线变化方面表现出改进的鲁棒性,突显了全局关系共识的有效性。

英文摘要

A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) $\mathcal{L}_2$ differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.

2604.16512 2026-06-09 cs.CV cs.CG cs.GR cs.LG cs.NA math.NA 版本更新

Medial Axis Aware Learning of Signed Distance Functions

面向中轴线的符号距离函数学习

Samuel Weidemaier, Christoph Norden-Smoch, Martin Rumpf

发表机构 * Institute for Numerical Simulation, University of Bonn(数值模拟研究所,波恩大学)

AI总结 本文提出一种新的变分方法,用于计算高精度的全局符号距离函数,通过高阶变分公式考虑梯度的跳跃集,以提高计算精度。

详情
AI中文摘要

我们提出了一种新的变分方法,用于计算给定点云的高精度全局符号距离函数(SDF)。为此,通过高阶变分公式显式考虑SDF梯度的跳跃集,即表面的中轴线,该公式强制在远离此不连续集的方向上沿梯度方向线性增长。Eikonal方程和SDF的零水平集被作为约束条件。为了使该变分问题具有计算可行性,采用了一种相场近似方法,属于Ambrosio-Tortorelli类型。相关的相场函数隐式地描述了中轴线。该方法用于由无向点云表示的表面,使用神经网络近似SDF和相场函数。实验表明,该方法在近场和全局范围内均具有较高的准确性。定量和定性比较表明,所提出的方法具有优势。

英文摘要

We propose a novel variational method to compute a highly accurate global signed distance function (SDF) to a given point cloud. To this end, the jump set of the gradient of the SDF, which coincides with the medial axis of the surface, is explicitly taken into account through a higher-order variational formulation that enforces linear growth along the gradient direction away from this discontinuity set. The eikonal equation and the zero-level set of the SDF are enforced as constraints. To make this variational problem computationally tractable, a phase field approximation of Ambrosio-Tortorelli type is employed. The associated phase field function implicitly describes the medial axis. The method is implemented for surfaces represented by unoriented point clouds using neural network approximations of both the SDF and the phase field. Experiments demonstrate the method's accuracy both in the near field and globally. Quantitative and qualitative comparisons with other approaches show the advantages of the proposed method.

2604.25781 2026-06-09 cs.CV cs.GR 版本更新

Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects

Sketch2Arti:基于草图的CAD物体关节建模

Yi Yang, Hao Pan, Yijing Cui, Alla Sheffer, Changjian Li

发表机构 * University of Edinburgh(爱丁堡大学) Tsinghua University(清华大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 提出Sketch2Arti系统,通过用户从选定视角绘制的2D草图,自动发现CAD模型中的可动部件并预测其运动参数,支持复杂物体的多关节迭代建模,无需类别信息且可推广到多样物体。

Comments Project page: https://arlo-yang.github.io/Sketch2Arti

详情
AI中文摘要

关节建模旨在推断3D物体的可动部件及其运动参数,实现交互式动画、模拟和形状编辑。本文提出Sketch2Arti,首个基于草图的CAD物体关节建模系统。我们的关键观察是,设计师通过轻量级草图(如箭头和笔画)自然地传达关节意图,指示部件应如何移动,但将这些草图转化为关节3D模型仍主要依赖手动操作。Sketch2Arti通过允许用户从选定视角绘制简单2D草图来指定关节,弥合了这一差距。给定CAD模型和用户草图,我们的方法自动发现对应的可动部件并预测其运动参数,支持对复杂物体进行多个关节的迭代建模,并实现精细控制。重要的是,Sketch2Arti以类别无关的方式训练,无需物体类别信息,从而对现有关节数据集之外的多样物体具有强泛化能力。此外,对于缺乏内部结构的壳体模型,Sketch2Arti支持由用户草图引导的可控内部补全,生成与现有几何和预测运动约束一致的合理内部组件。综合实验和用户评估证明了Sketch2Arti的有效性、可控性和泛化性。代码、数据集和原型系统见https://arlo-yang.github.io/Sketch2Arti。

英文摘要

Articulation modeling aims to infer movable parts and their motion parameters for a 3D object, enabling interactive animation, simulation, and shape editing. In this paper, we present Sketch2Arti, the first sketch-based articulation modeling system for CAD objects. Our key observation is that designers naturally communicate articulation intent through lightweight sketches (e.g., arrows and strokes) that indicate how parts should move, yet translating such sketches into articulated 3D models remains largely manual. Sketch2Arti bridges this gap by enabling users to specify articulation through simple 2D sketches drawn from a chosen viewpoint. Given a CAD model and user sketches, our approach automatically discovers the corresponding movable parts and predicts their motion parameters, allowing iterative modeling of multiple articulations on complex objects with fine-grained control. Importantly, Sketch2Arti is trained in a category-agnostic manner without requiring object category information, leading to strong generalization to diverse objects beyond existing articulation datasets. Moreover, for shell models lacking interior structures, Sketch2Arti supports controllable internal completion guided by user sketches, generating plausible internal components consistent with the existing geometry and predicted motion constraints. Comprehensive experiments and user evaluations demonstrate the effectiveness, controllability, and generalization of Sketch2Arti. The code, dataset, and the prototype system are at https://arlo-yang.github.io/Sketch2Arti.

2605.01171 2026-06-09 cs.CV cs.LG 版本更新

CADFit: Precise Mesh-to-CAD Program Generation with Hybrid Optimization

CADFit:基于混合优化的精确网格到CAD程序生成

Ghadi Nehme, Eamon Whalen, Faez Ahmed

发表机构 * Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA(麻省理工学院机械工程系)

AI总结 提出CADFit框架,通过基于几何反馈的增量拟合和验证参数化操作,从网格中恢复复杂可编辑的CAD构造序列,在多个基准上优于现有方法,并显著降低无效比率。

详情
AI中文摘要

尽管最近取得了进展,但从几何输入(如网格或点云)恢复参数化CAD构造序列仍然是设计和制造的关键挑战,因为现有的CAD重建和生成方法主要局限于难以编辑的格式(如网格或Breps)或可编辑的简单草图-拉伸流水线和低复杂度数据集。我们引入了CADFit,一个基于混合优化的CAD重建框架,通过使用几何反馈增量拟合和验证参数化操作,从网格中恢复复杂、可编辑的CAD构造序列。我们的方法的特点是将重建公式化为对结构化CAD程序的IoU驱动优化,并支持丰富的操作集,包括拉伸、旋转、圆角和倒角。在多个CAD基准上的实验表明,CADFit在体积交并比和倒角距离方面优于最先进的网格到CAD方法,同时显著降低了重建CAD程序的无效比率,特别是对于复杂设计。我们进一步提出了一个多模态流水线,通过将基于图像的几何重建与CADFit相结合,实现从图像端到端重建CAD构造序列。通过实现更高复杂度CAD模型的精确重建,CADFit为生成更丰富的数据集和推进未来基于学习的CAD逆向工程方法提供了实用基础。代码可在:https://github.com/ghadinehme/CADFit 获取。

英文摘要

Despite recent progress, recovering parametric CAD construction sequences from geometric input, such as meshes or point clouds, is a key challenge for design and manufacturing, as existing CAD reconstruction and generation methods are largely restricted to difficult-to-edit formats like meshes or Breps or editable simple sketch-and-extrude pipelines and low-complexity datasets. We introduce CADFit, a hybrid optimization-based CAD reconstruction framework that recovers complex, editable CAD construction sequences from meshes by incrementally fitting and validating parametric operations using geometric feedback. Our approach is distinguished by formulating reconstruction as an IoU-driven optimization over structured CAD programs and supporting a rich set of operations, including extrusions, revolutions, fillets, and chamfers. Experiments on multiple CAD benchmarks show that CADFit outperforms state-of-the-art mesh-to-CAD methods in volumetric Intersection-over-Union and Chamfer Distance, while substantially reducing the Invalid Ratio of reconstructed CAD programs, particularly for complex designs. We further present a multimodal pipeline that enables end-to-end reconstruction of CAD construction sequences from images by combining image-based geometry reconstruction with CADFit. By enabling accurate reconstruction of higher-complexity CAD models, CADFit provides a practical foundation for generating richer datasets and advancing future learning-based approaches to CAD reverse engineering. The code is available at: https://github.com/ghadinehme/CADFit.

2605.01320 2026-06-09 cs.CV 版本更新

PACE: Post-Causal Entropy Modeling for Learned LiDAR Point Cloud Compression

PACE:用于学习LiDAR点云压缩的后因果熵建模

Jiahao Zhu, Kang You, Dandan Ding, Zhan Ma

发表机构 * School of Information Science and Technology, Hangzhou Normal University, Hangzhou, China.(信息科学与技术学院,杭州师范大学,杭州,中国) School of Electronic Science and Engineering, Nanjing University, Nanjing, China(电子科学与工程学院,南京大学,南京,中国)

AI总结 PACE通过非因果骨干网络和轻量级预测器提升LiDAR点云压缩效率,实现90%以上的解码延迟降低和BD-BR节省。

详情
AI中文摘要

LiDAR点云压缩对自动驾驶系统处理高分辨率传感器数据至关重要。尽管基于八叉树结构的学得熵建模能获得高压缩增益,但面临两个关键瓶颈:1)解码时因因果、多阶段上下文建模导致的延迟过高;2)性能-延迟权衡的刚性,使单一模型难以适应变化约束。这些限制源于上下文聚合骨干与概率预测之间的紧密耦合。为此,我们提出PACE,一种新的框架,将祖先上下文聚合重新表述为非因果骨干,并将因果性限制在轻量级、阶段可扩展的预测器中,消除重复骨干执行并减少计算开销。预测器支持任意数量的预测阶段,使模型能够无缝适应多样化的性能-延迟权衡,而无需重新加载参数。实验表明,PACE在压缩效率上达到新状态,实现显著的BD-BR节省,并在自回归模式下将解码延迟降低超过90%,使其在实际应用中具有吸引力。

英文摘要

LiDAR point cloud compression is vital for autonomous systems to handle massive data from high-resolution sensors. While learned entropy modeling built upon octree structures yields high compression gains, it faces two critical bottlenecks: 1) prohibitive latency, particularly during decoding, caused by causal, multi-stage context modeling; and 2) a rigid performance-latency trade-off, preventing a single model from adapting to varying constraints. These limitations stem from the tight coupling between the context aggregation backbone and probability prediction. To address this, we propose PACE, a new framework that reformulates ancestral context aggregation as a non-causal backbone and confines causality to a lightweight, stage-scalable predictor, eliminating repetitive backbone executions and reducing computational overhead. The predictor supports an arbitrary number of prediction stages, enabling seamless adaptation across diverse performance-latency trade-offs without reloading parameters. Experiments demonstrate that PACE sets a new state-of-the-art in compression efficiency, achieving notable BD-BR savings and reducing decoding latency by over 90\% in autoregressive mode, making it attractive for practical applications.

2606.07419 2026-06-09 cs.CV 版本更新

DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

DisPOSE: 投影多随机扩散用于自监督多视图3D人体姿态估计

Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

发表机构 * Imperial College London(伦敦帝国学院) Technical University of Munich(慕尼黑技术大学)

AI总结 提出DisPOSE框架,将多视图人员分配问题建模为多随机张量空间上的生成扩散过程,通过可微Sinkhorn投影和超图卷积解码器实现自监督3D人体姿态估计,在标准数据集和手术室遮挡场景中表现优异。

详情
AI中文摘要

从不同摄像机视角恢复多个个体的3D人体姿态是分析交互行为的基本瓶颈。现有的自监督方法利用3D姿态的合成目录;然而,由于分布偏移,这导致在真实场景中泛化能力差。因此,我们引入了DisPOSE,一个自监督框架,将固有的离散多视图人员分配问题近似为多随机张量空间上的生成扩散过程。通过在去噪过程中采用可微的Sinkhorn投影,模型学会基于2D图像先验引导解决方案走向有效且可行的分配。然后,使用超图卷积解码器对定位个体的完整3D骨架进行回归,该解码器显式建模跨多个视图的关系结构和关节。所提出的方法在标准数据集上优于当前最先进的自监督方法,并在一个包含手术室高度遮挡场景的新基准上展示了强大的性能。我们的基于扩散的定位展示了高标签效率,仅使用10%的伪标签就能保持99%的性能。值得注意的是,在保持可微性的同时解耦分配和根回归组件,使得DisPOSE几乎对不同摄像机布置不敏感。

英文摘要

Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward valid and feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Notably, disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.

8. 医学影像与生物视觉 44 篇

2606.07590 2026-06-09 cs.CV cs.AI 新提交

SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions

SlideCheck: 通过数据集分布引导病理基础模型的自监督预训练

Mingyi He, Xinyi Guo, Xitong Ling, Weiming Chen, Jiawen Li, Lianghui Zhu, Minxi Ouyang, Mingxi Fu, Yizhi Wang, Tian Guan

发表机构 * Beijing University of Chemical Technology(北京化工大学) South China Normal University(华南师范大学) Tsinghua University(清华大学)

AI总结 提出SlideCheck工具,利用冻结病理基础模型的特征,通过双头MLP评分异常和恶性证据,引导自监督预训练数据筛选,实验表明数据分布影响模型下游性能。

Comments 9 pages, 2 figures, 4 tables

详情
AI中文摘要

病理基础模型在大量WSI衍生补丁流上进行预训练,而数据构建过程中的监督通常是切片级别、稀疏或异质的。这种不匹配使得理解和控制哪些生物模式进入预训练数据变得困难。我们提出SlideCheck,一个轻量级的预训练数据引导工具,建立在冻结的病理基础模型补丁特征之上。SlideCheck并非作为独立的补丁诊断模型,而是提供明确的异常和恶性评分,用于组织、过滤和审计病理预训练数据。SlideCheck使用双头MLP分别建模广泛的异常形态和恶性证据。正则化的特征空间评分器为补丁级证据估计提供监督锚点,而评分-注意力一致性将补丁评分与WSI级别的MIL注意力结合,挖掘高置信度伪标签。然后使用相同的评分构建广泛阳性ViT预训练子集,其中如果异常或恶性证据超过阈值,则选择补丁。实验表明,SlideCheck定义的数据分布影响自监督ViT预训练的下游行为,表明生物组成是病理基础模型开发中的重要可控因素。精心策划的子集可以接近全数据性能,表明明确评分的补丁池可能支持更高效和可审计的预训练数据构建。这些发现将SlideCheck定位为数据引导和审计层,用于将大型未分化补丁池转化为可控和可重用的预训练数据集。

英文摘要

Pathology foundation models are pretrained on large streams of WSI-derived patches, while supervision during data construction is often slide-level, sparse, or heterogeneous. This mismatch makes it difficult to understand and control which biological patterns enter the pretraining data. We propose SlideCheck, a lightweight pretraining data guidance tool built on frozen pathology foundation model patch features. Rather than serving as a standalone patch diagnostic model, SlideCheck provides explicit abnormality and malignancy scores for organizing, filtering, and auditing pathology pretraining data. SlideCheck uses a dual-head MLP to separately model broad abnormal morphology and malignant evidence. A regularized feature-space scorer provides a supervised anchor for patch-level evidence estimation, while score-attention agreement combines patch scores with WSI-level MIL attention to mine high-confidence pseudo labels. The same scores are then used to construct broad-positive ViT pretraining subsets, where a patch is selected if either abnormality or malignancy evidence exceeds a threshold. Experiments show that SlideCheck-defined data distributions influence the downstream behavior of self-supervised ViT pretraining, indicating that biological composition is an important controllable factor in pathology foundation model development. Curated subsets can approach full-data performance, suggesting that explicitly scored patch pools may support more efficient and auditable pretraining data construction. These findings position SlideCheck as a data guidance and auditing layer for transforming large, undifferentiated patch pools into controllable and reusable pretraining datasets.

2606.07633 2026-06-09 cs.CV cs.AI 新提交

AMN: An Adaptive Multi-Scale Fusion Network with Boundary and Uncertainty Modeling for Nuclei Segmentation

AMN:一种用于细胞核分割的具有边界和不确定性建模的自适应多尺度融合网络

Spoorthi M, Suja Palaniswamy

发表机构 * Department of Computer Science \& Engineering, Amrita School of Computing, Bengaluru, Amrita Vishwa Vidyapeetham, India , 2 p\

AI总结 提出AMN双编码器分割框架,融合Swin Transformer和ResNet-50特征金字塔,通过门控机制动态加权,结合多目标损失,在CoNIC基准上平均Dice 0.82,F1 0.68,优于八种基线模型。

详情
AI中文摘要

组织病理学图像中细胞核亚型的准确分类对于下游任务(包括肿瘤分级、免疫浸润量化和预后预测)至关重要。现有方法孤立地依赖卷积或基于Transformer的编码器,限制了它们同时捕捉细粒度局部纹理和长程空间上下文的能力。我们提出了AMN(自适应多尺度细胞核网络),一种双编码器分割框架,联合利用Swin Transformer和ResNet-50特征金字塔,通过学习的逐通道门控机制动态权衡每个编码器在每个尺度的贡献。AMN使用多目标损失进行训练,该损失结合了类别加权焦点损失、具有正像素强调的边界感知损失以及一种新颖的不确定性调制分类项,用于抑制过度自信的错误预测。在涵盖七个细胞核类别的CoNIC基准上评估,AMN实现了平均Dice 0.82和平均F1 0.68,在诊断上具有挑战性的淋巴细胞类别上F1为0.67。AMN优于八种基线模型,包括纯CNN、纯Transformer和最近的混合架构:U-Net、ResU-Net、DeepLabV3+、SegNet、ViT-Small、HmsU-Net、ConvFormer-UNet和BEFUnet。在MoNuSeg上的跨数据集评估证明了无需重新训练的强泛化能力,验证了所学表示的领域鲁棒性。

英文摘要

Accurate classification of nuclei subtypes in histopathology images is critical for downstream tasks including tumor grading, immune infiltrate quantification, and prognosis prediction. Existing approaches rely on either convolutional or transformer-based encoders in isolation, limiting their ability to simultaneously capture fine-grained local texture and long-range spatial context. We present AMN (Adaptive Multi-Scale Nuclei Network), a dual-encoder segmentation framework that jointly leverages a Swin Transformer and a ResNet-50 feature pyramid, fused via a learned per-channel gating mechanism that dynamically weighs each encoder's contribution at every scale. AMN is trained with a multi-objective loss combining class-weighted focal loss, boundary-aware loss with positive-pixel emphasis, and a novel uncertainty-modulated classification term that suppresses overconfident erroneous predictions. Evaluated on the CoNIC benchmark across seven nuclei classes, AMN achieves a mean Dice of 0.82 and mean F1 of 0.68, with an F1 of 0.67 on the diagnostically challenging lymphocyte class. AMN outperforms eight baseline models spanning pure-CNN, pure-transformer, and recent hybrid architectures: U-Net, ResU-Net, DeepLabV3+, SegNet, ViT-Small, HmsU-Net, ConvFormer-UNet, and BEFUnet. Cross-dataset evaluation on MoNuSeg demonstrates strong generalization without retraining and validating the domain robustness of the learned representations.

2606.07635 2026-06-09 cs.CV cs.AI 新提交

NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis

NeuroAlign: 用于MCI分析的动态与结构性神经影像的分层多模态融合

Xiongri Shen, Zhenxi Song, Jiaqi wang, Yi Zhong, Leilei Zhao, Chenqi Xu, Linling Li, Yichen Wei, Lingyan Liang, Demao Deng, Luping Song, Ping Luan, Ahmed M. Anter, Shuqiang Wang, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)人工智能学院智能科学与工程学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Guangdong Key Laboratory of Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University(深圳大学医学部生物医学工程学院广东省生物医学测量与超声成像重点实验室) Department of Radiology, The People’s Hospital of Guangxi Zhuang Autonomous Region, Guangxi Academy of Medical Sciences(广西壮族自治区人民医院放射科,广西医学科学院) Shenzhen Sixth People’s Hospital (Nanshan Hospital), Huazhong University of Science and Technology Union Shenzhen Hospital(华中科技大学协和深圳医院(深圳市第六人民医院)) School of Basic Medical Sciences, Shenzhen University(深圳大学基础医学院) Egypt-Japan University of Science and Technology (E-JUST)(埃及日本科技大学) School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Shenzhen University Medical School(深圳大学医学部生物医学工程学院,国家地方联合医学超声关键技术工程实验室,广东省生物医学测量与超声成像重点实验室)

AI总结 提出NeuroAlign框架,通过双模态分层对齐和双域分层交互融合fMRI与DTI特征,实现MCI/SCD检测,并设计无梯度归因方法SAM进行特征分析。

详情
AI中文摘要

功能磁共振成像(fMRI)和弥散张量成像(DTI)的多模态神经影像融合为认知障碍分析提供了互补信息,但仍面临异构特征空间和表示不对齐的挑战。我们提出\textit{NeuroAlign},一个用于结构化多模态融合的分层框架。它引入了(1)\textit{双模态分层对齐}(DMHA),该模块建模多尺度动态连接并对齐动态-静态和功能-结构嵌入;以及(2)\textit{双域分层交互}(DDHI),该模块实现连接级和区域级特征之间的细粒度调制和全局交互。为了支持特征级检查,我们设计了\textit{协同激活映射}(SAM),一种针对DFC、SFC、ALFF和FA的无梯度、面向标记的归因方法。在GUTCM、ADNI和OASIS数据集上通过五折验证评估,NeuroAlign在MCI/SCD检测中取得了竞争性结果,并展示了初步的跨数据集可迁移性。归因分析揭示了模态特异性和部分一致的脑区模式,为多模态表示分析提供了模型驱动的证据。

英文摘要

Multimodal neuroimaging fusion of functional MRI (fMRI) and diffusion tensor imaging (DTI) provides complementary information for cognitive impairment analysis, but remains challenged by heterogeneous feature spaces and misaligned representations. We propose \textit{NeuroAlign}, a hierarchical framework for structured multimodal fusion. It introduces (1) \textit{Dual-Modal Hierarchical Alignment} (DMHA), which models multi-scale dynamic connectivity and aligns dynamic-static and functional-structural embeddings; and (2) \textit{Dual-Domain Hierarchical Interaction} (DDHI), which enables fine-grained modulation and global interaction between connectivity- and region-level features. To support feature-level inspection, we design \textit{Synergistic Activation Mapping} (SAM), a gradient-free, marker-oriented attribution method for DFC, SFC, ALFF, and FA. Evaluated on GUTCM, ADNI, and OASIS under five-fold validation, NeuroAlign achieves competitive MCI/SCD detection and preliminary cross-dataset transferability. Attribution analyses reveal modality-specific and partially consistent brain patterns, providing model-derived evidence for multimodal representation analysis.

2606.07658 2026-06-09 cs.CV cs.LG 新提交

What neurosurgeons need to see: synthetic intra-operative MRI from ultrasound for brain-shift compensation in brain tumour surgery

神经外科医生需要看到的:用于脑肿瘤手术中脑移位补偿的超声合成术中MRI

Santiago Cepeda, Olga Esteban-Sinovas, Ignacio Arrese, Rosario Sarabia

发表机构 * Department of Neurosurgery, Neurovascular Unit, Río Hortega University Hospital, Valladolid, Spain(西班牙巴利亚多利德里奥·奥尔特加大学医院神经外科神经血管科) Specialized Group in Biomedical Imaging and Computational Analysis (GEIBAC), Instituto de Investigación Biosanitaria de Valladolid (IBioVALL), Valladolid, Spain(西班牙巴利亚多利德生物医学研究与计算分析专业组(GEIBAC),巴利亚多利德生物健康研究所(IBioVALL))

AI总结 提出一种端到端流水线,通过融合术前MRI、术中超声生成的合成MRI及锚定该合成图像的可变形配准,生成术前成像空间中的全脑MRI体积,以补偿脑移位,为神经导航提供类似MRI的术中视野更新。

详情
AI中文摘要

最大安全切除是胶质瘤手术的主要目标。硬脑膜打开后,神经导航引导会因脑移位而逐渐退化。术中MRI可以补偿,但需要专用基础设施且很少可用,而术中超声(ioUS)廉价、可重复且与常规工作流程兼容。将ioUS与术前MRI结合的导航系统通常依赖刚性配准;即使是可变形多模态配准也受限于超声散斑对比度、窄视野以及无法表示术前扫描中不存在的结构,最关键的是切除腔和残余肿瘤。我们提出一个端到端流水线,通过合并术前MRI、从ioUS生成的合成MRI以及锚定在该合成图像上的可变形配准,生成术前成像空间中的全脑MRI体积。它集成了一个2.5D残差变换器合成骨干(ResViT-2.5D)和一个两阶段配准,将NiftyReg与合成锚定的SynthMorph阶段耦合,直接对原始扫描仪输入进行操作。在切除后的ReMIND队列上,ResViT-2.5D生成的合成图像在结构、强度和感知指标上与术中T2紧密匹配。在14名受试者的215个专家标志点上,合成锚定配准将平均目标配准误差从6.27毫米降低到5.86毫米,与强大的经典NiftyReg基线(5.85毫米)相当,同时为每个受试者产生微分同胚变形场。贡献不在于配准精度的提高,而在于集成的体积本身,它在超声视野内反映了术中切除后的状态。这为外科医生提供了手术视野的类似MRI的更新,并有可能集成到手术导航工作流程中。

英文摘要

Maximal safe resection is the primary objective in glioma surgery. Neuronavigation guidance is progressively degraded by brain shift after dural opening. Intraoperative MRI can compensate but needs dedicated infrastructure and is rarely available, whereas intraoperative ultrasound (ioUS) is inexpensive, repeatable, and compatible with routine workflows. Navigation systems combining ioUS with preoperative MRI usually rely on rigid registration; even deformable multimodal registration is limited by ultrasound speckle contrast, a narrow field of view, and the inability to represent structures absent from the preoperative scan, most critically the resection cavity and residual tumor. We propose an end-to-end pipeline that generates a new whole-brain MRI volume in the preoperative imaging space by merging the preoperative MRI, a synthetic MRI generated from the ioUS, and a deformable registration anchored on that synthetic image. It integrates a 2.5D residual-transformer synthesis backbone (ResViT-2.5D) and a two-stage registration coupling NiftyReg with a synthesis-anchored SynthMorph stage, operating directly on raw scanner inputs. On a post-resection ReMIND cohort, ResViT-2.5D produced synthetic images closely matching the intraoperative T2 across structural, intensity, and perceptual metrics. In 14 subjects with 215 expert landmarks, the synthesis-anchored registration reduced the mean target registration error from 6.27 to 5.86 mm, matching a strong classical NiftyReg baseline (5.85 mm) while yielding a diffeomorphic deformation field in every subject. The contribution is not a gain in registration accuracy but the integrated volume itself, which inside the ultrasound field of view it reflects the intraoperative post-resection state. This provides the surgeon with an MRI-like update of the operative field with potential for integration into surgical-navigation workflows.

2606.07674 2026-06-09 cs.CV q-bio.NC 新提交

Simultaneous hyperkinetic movement disorders phenotyping: a cross-cohort pediatric transfer study using routine videos, markerless pose estimation and a tabular foundation model

同时性多动症表型分析:基于常规视频、无标记姿态估计和表格基础模型的跨队列儿科迁移研究

Laura Cif, Diane Demailly, Zohra Souei, Muhammad Mushhood Ur Rehman, Juan Dario Ortigoza Escobar, Mayté Castro Jiménez, Cécile A. Hubsch, Sophie Huby, Morgan Dornadic, Gun-Marie Hariz, Eduardo M. Moraud, Jocelyne Bloch, Gabriella A. Horvath, Xavier Vasques

发表机构 * Lausanne University Hospital (CHUV)(洛桑大学医院) University of Lausanne (UNIL)(洛桑大学) Institut du Neurone(神经元研究所) Clinique Beau Soleil(博索莱伊诊所) Institut Mutualiste Montpelliérain(蒙彼利埃互助研究所) Military University Hospital of Sfax(斯法克斯军事大学医院) University of Edinburgh(爱丁堡大学) Hospital Sant Joan de Déu(圣琼德迪乌医院) European Reference Network for Rare Neurological Diseases (ERN-RND)(欧洲罕见神经系统疾病参考网络) Instituto de Salud Carlos III(卡洛斯三世健康研究所) CHU Montpellier(蒙彼利埃大学医院) Umeå University(于默奥大学) University Hospital Lausanne(洛桑大学医院) Ecole Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) British Columbia Children’s Hospital(不列颠哥伦比亚儿童医院)

AI总结 提出结合无标记姿态估计、运动学描述符和预训练基础模型的视频框架,在成人数据上训练后迁移至儿科队列,经轻量校准后实现多种多动症现象的同时检测。

详情
AI中文摘要

目的:开发并外部测试一个基于视频的框架,用于同时检测多动症运动障碍现象:肌张力障碍、震颤、肌阵挛、舞蹈症、手足徐动症、投掷症、刻板动作和抽动,使用常规临床记录,并明确测试从成人到儿科人群的外部跨队列迁移。方法:在这项概念验证研究中,该框架结合了无标记姿态估计、运动学描述符和预训练基础模型。在21名确诊多动症的成人和4名健康对照(按标准化方案评估)上开发了共享预测骨干。外部验证在一个独立的外部队列上进行:一个真实世界的儿科样本(n=12,单基因联合多动症)。对于外部数据集,骨干网络未经重新训练直接部署;轻量校准仅调整最终受试者级别的决策步骤,使用由临床医生选择的小标记子集(代表队列表型范围)。结果:在临床医生选择的子集上对决策层进行本地校准后,在保留的儿科患者(n=7)上性能持续提升:汉明准确率从0.804提高到0.839,Jaccard指数从0.548提高到0.633。当评估限制在临床医生一致性更高的现象时,校准后的性能得以保持,Jaccard指数进一步提高(汉明准确率0.9,Jaccard指数0.786),表明增益并非依赖于最不可靠的标签。

英文摘要

Objective: To develop and externally test a video-based framework for simultaneous detection of hyperkinetic MDs phenomenologies: dystonia, tremor, myoclonus, chorea, athetosis, ballismus, stereotypies, and tics using routine clinical recordings, with explicit testing of external, cross-cohort transfer from adult to pediatric populations. Methods: In this proof-of-concept study, the framework combines markerless pose estimation, kinematic descriptors, and a pretrained fondation model. A shared predictive backbone was developed on 21 adults with confirmed hyperkinetic MDs and 4 healthy controls assessed under a standardized protocol. External validation was performed on an independent external cohort: a real-world pediatric sample (n=12, monogenic combined MDs). For the external dataset, the backbone was deployed without retraining; lightweight calibration adjusted only the final subject-level decision step using a small labeled subset of patients selected by clinicians as representative of the cohort's phenotypic range. Results: After local calibration of the decision layer on the clinician-selected subset, performance improved consistently on the held-out pediatric patients (n=7): Hamming accuracy rose from 0.804 to 0.839 and the Jaccard index from 0.548 to 0.633. This calibrated performance was preserved, and the Jaccard index further improved, when the evaluation was restricted to the phenomenologies with more definite clinician agreement (Hamming accuracy 0.9, Jaccard index 0.786), indicating that the gains did not rest on the least-reliable labels.

2606.07775 2026-06-09 cs.CV 新提交

DALE-CT: Depth-Aware Foundation Models for Computed Tomography

DALE-CT: 用于计算机断层扫描的深度感知基础模型

Evan W. Damron, Mahmut S. Gokmen, Mitchell A. Klusty, Caroline N. Leach, Emily B. Collier, V. K. Cody Bumgardner

发表机构 * University of Kentucky(肯塔基大学)

AI总结 提出DALE-CT,一种基于LeJEPA的2D切片模型,通过3D深度感知预训练(利用解剖掩膜和异常标注)提升表示质量,在CT多异常检测中达到与3D视觉语言模型近似的性能。

Comments 9 pages, 2 figures

详情
AI中文摘要

自监督学习(SSL)的最新突破,如潜在欧几里得联合嵌入预测架构(LeJEPA),以及视觉编码器与语言模型集成的成功,推动了计算机断层扫描(CT)中对适应性强、高容量视觉编码器的需求。在这项工作中,我们探索了基于2D切片的架构作为处理体积CT数据的原生3D模型的灵活替代方案。使用CT-RATE数据集,我们从头开始训练了DALE-CT(深度感知潜在欧几里得计算机断层扫描),这是一个完全使用LeJEPA构建的2D模型系列,并将其与持续预训练的DINOv2基线进行了比较。为了提高表示质量,我们开发了一种新颖的3D深度感知预训练策略,该策略由来自自动解剖掩膜和人工标注异常的双重辅助监督密集支持。在使用多实例学习(MIL)进行多异常检测的线性探测评估下,该双监督模型(DALE-CT-2S)的冻结主干实现了0.833的宏AUROC。这一性能表明,从头开始使用显著更少的数据且无需文本监督,即可达到与最先进的3D视觉语言模型近乎相当的水平。为确保可重复性,所有训练代码、评估脚本和模型权重均已公开。

英文摘要

Recent breakthroughs in self-supervised learning (SSL), such as the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA), alongside successes in integrating visual encoders with language models, have driven the demand for adaptable, high-capacity vision encoders in Computed Tomography (CT). In this work, we explore 2D slice-based architectures as a flexible alternative to native 3D models for processing volumetric CT data. Using the CT-RATE dataset, we trained DALE-CT (Depth-Aware Latent-Euclidean Computed Tomography), a 2D model family built entirely from scratch using LeJEPA, and compared its performance against a continually pre-trained DINOv2 baseline. To enhance representation quality, we developed a novel 3D depth-aware pre-training strategy anchored by dense auxiliary supervision from both automated anatomical masks and human-annotated abnormalities. Under linear probe evaluation with Multiple Instance Learning (MIL) for multi-abnormality detection, the frozen backbone of this dual-supervised model (DALE-CT-2S) achieves a Macro AUROC of 0.833. This performance demonstrates near-parity with state-of-the-art 3D vision-language models, achieved entirely from scratch with significantly less data and no textual supervision. To ensure reproducibility, all training code, evaluation scripts, and model weights have been made publicly available.

2606.08364 2026-06-09 cs.CV cs.AI 新提交

Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis

基于自监督视觉Transformer的CBCT颞下颌关节骨关节炎检测

Shradhdha Trivedi, Vrundan Sojitra, Mariela Padilla

发表机构 * Herman Ostrow School of Dentistry, University of Southern California(南加州大学赫尔曼·奥斯特罗牙科学院) Viterbi School of Engineering, University of Southern California(南加州大学维特比工程学院)

AI总结 研究DINO系列自监督ViT在CBCT颞下颌关节骨关节炎检测中的迁移性能,发现部分解冻最后两个Transformer块可将AUC从0.671提升至0.902,表明适应策略比骨干选择更重要。

详情
AI中文摘要

颞下颌关节骨关节炎(TMJ OA)是一种常见的退行性疾病,其骨性改变在锥形束CT(CBCT)上通常很细微,使得自动检测具有挑战性。我们研究了DINO系列自监督视觉Transformer——DINOv1、DINOv2、DINOv2+reg和RAD-DINO(一种放射学预训练变体)——迁移到CBCT的效果,询问需要多少以及何种骨干适应。我们提出了一种简单的基于切片的流程,使用视觉Transformer(ViT)骨干:轴向CBCT切片由冻结或部分适应的ViT逐切片编码,并通过基于注意力的多实例学习(MIL)聚合,用于患者级别的二分类OA/正常分类。通过在多源CBCT数据集上对解冻策略和聚合设计进行系统消融,我们发现部分解冻最后两个Transformer块是决定性因素,将AUC从0.671(完全冻结的DINOv2)提高到0.902。这优于DINOv1(0.867)、DINOv2+reg(0.774)和有监督的ImageNet ViT-B/16基线(0.843)。我们的结果为在低数据医学影像设置中适应DINO系列基础模型提供了实用指导,表明适应策略比骨干选择本身更能驱动性能。

英文摘要

Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes are often subtle on cone-beam CT (CBCT), making automated detection challenging. We study how well the DINO family of self-supervised vision transformers -- DINOv1, DINOv2, DINOv2+reg, and RAD-DINO (a radiology-pretrained variant) -- transfers to CBCT, asking how much backbone adaptation is needed and of what kind. We propose a simple slice-based pipeline using Vision Transformer (ViT) backbones: axial CBCT slices are encoded per-slice by a frozen or partially adapted ViT and aggregated via attention-based multiple instance learning (MIL) for patient-level binary OA/Normal classification. Through systematic ablation across unfreezing strategies and aggregation designs on a multi-source CBCT dataset, we find that partial unfreezing of the final two transformer blocks is the decisive factor, improving AUC from 0.671 (fully frozen DINOv2) to 0.902. This outperforms DINOv1 (0.867), DINOv2+reg (0.774), and a supervised ImageNet ViT-B/16 baseline (0.843). Our results provide practical guidance for adapting DINO-family foundation models in low-data medical imaging settings, showing that adaptation strategy is a stronger driver of performance than backbone choice alone.

2606.08404 2026-06-09 cs.CV 新提交

Geometry-Driven Flow Analysis of Brain Sulcal Pattern

脑沟模式的几何驱动流分析

Moo K. Chung, Luigi Maccotta, Aaron Struck

发表机构 * GitHub

AI总结 提出基于泊松方程的几何驱动流框架,通过平均曲率建模皮层折叠,生成光滑势场梯度定义物理通量,用于分析青少年肌阵挛癫痫的皮层结构异常。

详情
AI中文摘要

皮层折叠反映了协调的神经发育过程,并日益被认为是神经系统疾病的敏感标志。然而,现有大多数分析依赖于间接的标量摘要,并未明确建模折叠几何本身。在青少年肌阵挛癫痫(JME)中,一种常见的遗传性癫痫,皮层异常通常是微妙的、空间分布的,并且难以使用传统的形态测量指标检测。我们引入了一个基于泊松方程的框架,将皮层折叠建模为源自皮层流形上平均曲率的几何驱动流。通过将折叠模式视为静态的源-汇结构,所提出的方法产生了一个光滑的、全局平衡的势场,其表面梯度定义了物理上可解释的通量。该框架能够对脑沟-脑回折叠组织进行空间连贯的分析,并为JME中几何驱动的皮层结构提供了原则性的表示。

英文摘要

Cortical folding reflects coordinated neurodevelopmental processes and is increasingly recognized as a sensitive marker of neurological disease. However, most existing analyses rely on indirect scalar summaries that do not explicitly model folding geometry itself. In juvenile myoclonic epilepsy (JME), a common genetic epilepsy, cortical abnormalities are often subtle, spatially distributed, and difficult to detect using conventional morphometric measures. We introduce a Poisson-equation-based framework that models cortical folding as a geometry-driven flow derived from mean curvature on the cortical manifold. By treating folding patterns as a stationary source-sink structure, the proposed approach yields a smooth, globally balanced potential field whose surface gradient defines a physically interpretable flux. This framework enables spatially coherent analysis of sulcal-gyral folding organization and provides a principled representation of geometry-driven cortical structure in JME.

2606.08420 2026-06-09 cs.CV 新提交

CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs

CheXanatomy: 面向胸部X光片的解剖感知视觉-语言建模

Sergios Gatidis, Curtis Langlotz, Christian Bluethgen

发表机构 * Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University(斯坦福大学医学与影像人工智能中心) Department of Radiology, Stanford University(斯坦福大学放射学系)

AI总结 提出CheXanatomy框架,通过自回归令牌空间监督将解剖知识融入预训练视觉-语言模型,实现解剖分割,在合成和真实X光片上性能媲美U-Net,并提升域迁移鲁棒性和样本效率。

详情
AI中文摘要

在大规模图像-文本对上预训练的视觉-语言模型(VLM)表现出强大的图像级理解能力,但主要针对全局对齐进行优化,并未显式编码细粒度解剖结构,限制了其在分割等空间精确任务中的适用性。我们提出CheXanatomy,一个通过自回归令牌空间监督将显式解剖知识融入预训练VLM的框架。该模型无需添加任务特定的解码器头,而是通过下一个令牌预测训练生成解剖分割掩码。为了实现可扩展的监督,我们从CT体积合成逼真的胸部X光片,并前向投影CT分割标签以获得解剖一致的2D掩码。我们在合成和真实胸部X光片上评估该方法,与U-Net基线进行比较,包括模型规模、输入分辨率和视觉编码器微调的消融实验。自回归解剖监督在分布内实现了与专用卷积模型相当的性能,并在向真实CXR数据的域迁移下表现出改进的几何鲁棒性。此外,在有限监督下适应新定位任务时,解剖预训练模型展现出更好的样本效率。更大的模型和更高的输入图像分辨率提升了性能,而视觉编码器微调效果有限。这些结果表明,将解剖结构直接嵌入生成目标促进了空间有根据的表征,并支持解剖感知的医学视觉-语言建模。

英文摘要

Vision-language models (VLMs) pretrained on large-scale image-text pairs demonstrate strong image-level understanding, but are primarily optimized for global alignment and do not explicitly encode fine-grained anatomical structure, limiting their suitability for spatially precise tasks such as segmentation. We introduce CheXanatomy, a framework that integrates explicit anatomical knowledge into a pretrained VLM through autoregressive token-space supervision. Instead of adding task-specific decoder heads, the model is trained to generate anatomical segmentation masks via next-token prediction. To enable scalable supervision, we synthesize realistic chest radiographs from CT volumes and forward-project CT segmentation labels to obtain anatomically consistent 2D masks. We evaluate the approach on synthetic and real chest radiographs against a U-Net baseline, including ablations on model scale, input resolution, and vision encoder fine-tuning. Autoregressive anatomical supervision achieves performance comparable to specialized convolutional models in-distribution and demonstrates improved geometric robustness under domain shift to real CXR data. In addition, anatomy-pretrained models exhibit improved sample efficiency when adapting to novel localization tasks under limited supervision. Larger models and higher input image resolution improve performance, while vision encoder fine-tuning has limited effect. These results show that embedding anatomical structure directly into the generative objective promotes spatially grounded representations and supports anatomy-aware medical vision-language modeling.

2606.08421 2026-06-09 cs.CV 新提交

Segmentation-Assisted Brain MRI Synthesis with Cross-Image Multi-Contrast Feature Memory Bank Retrieval Augmentation

基于跨图像多对比度特征记忆库检索增强的分割辅助脑MRI合成

Wenwei Huang, Jia Wei, Jianlong Zhou

发表机构 * South China University of Technology(华南理工大学) University of Technology Sydney(悉尼科技大学)

AI总结 提出分割辅助的闭环生成对抗框架,通过辅助分割分支和双库检索增强策略,提高多对比度脑MRI中肿瘤区域的合成保真度。

详情
AI中文摘要

多对比度脑MRI提供互补的软组织特征,有助于疾病的筛查和诊断。然而,有限的扫描时间、图像损坏和各种成像协议常常导致多对比度图像不完整。虽然当前方法在图像合成方面表现出色,但它们通常难以合成关键的肿瘤区域,并且无法有效利用多对比度脑MRI中的上下文信息。为了解决这个问题,我们提出了一种以合成为中心、分割辅助的闭环框架,结合检索增强合成。我们的方法整体采用生成对抗架构,旨在通过单一模型从任何可用对比度的组合中合成缺失的对比度。为了显式捕获肿瘤语义并将合成聚焦于肿瘤区域,我们添加了一个辅助分割分支,该分支预测肿瘤掩膜并将其作为语义条件反馈给合成分支,从而在模型中学习肿瘤感知表示并提高合成保真度。此外,我们提出了一种双库检索增强策略。它动态查询两个外部知识库,即用于关键肿瘤上下文的肿瘤掩膜记忆库和用于全局风格信息的跨图像对比度特征记忆库,以增强合成。在两个公开的多对比度磁共振脑数据集:BraTs2020和UCSF-BMSR上验证,所提出的方法在处理医学脑图像合成任务方面有效,并且与先前方法相比表现出优越的性能。代码可在 https://github.com/iBizzard/SSCF.git 获取。

英文摘要

Multi-contrast brain MRI provide complementary soft-tissue characteristics that aid in the screening and diagnosis of diseases. However, limited scanning time, image corruption and various imaging protocols often result in incomplete multi-contrast images. While current approaches excel in image synthesis, they often struggle to synthesize critical tumor regions and exploit contextual information in multi-contrast brain MRI effectively. To address this issue, we propose a synthesis-centric, segmentation-assisted closed-loop framework with retrieval augmentation synthesis. Our method overall takes a generative adversarial architecture, which aims to synthesize missing contrasts from any combination of available ones with a single model. To explicitly capture tumor semantics and focus synthesis on tumor regions, we add an auxiliary segmentation branch that predicts tumor masks and feeds them back as semantic conditioning in synthesis branch, thereby learning tumor-aware representations in the model and improving synthesis fidelity. Furthermore, we propose a dual-bank retrieval augmentation strategy. It dynamically queries two external knowledge bases, namely a tumor masks memory bank for crucial tumor context and cross-image contrast feature memory bank for global style information, to augment synthesis. Verified on two public multi-contrast magnetic resonance brain datasets: BraTs2020 and UCSF-BMSR, the proposed method is effective in handling medical brain images synthesis tasks and shows superior performance compared to previous methods. Code is available at:https://github.com/iBizzard/SSCF.git

2606.08641 2026-06-09 cs.CV 新提交

Learnable Token Sparsification for Efficient Gigapixel Whole Slide Image Reasoning

可学习的令牌稀疏化用于高效十亿像素全切片图像推理

Jingzhi Chen, Landi He, Zhuo Chen, Shawn Young, Lijian Xu

发表机构 * Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 针对视觉语言模型中全切片图像令牌过多的问题,提出可学习的稀疏化方法,通过SparseLearn组件和可微分的Soft Top-K算子实现训练,推理时仅保留32个令牌,在SlideBench上达到73.32%准确率。

详情
AI中文摘要

在视觉语言模型中处理十亿像素全切片图像面临的主要困难是视觉令牌数量过多。现有解决方案通常依赖于无需训练的空间下采样或启发式剪枝策略,这些方法往往会丢弃细微但具有临床意义的模式,因为病理证据在组织中不规则地分布。为了克服这一限制,我们将全切片图像中的令牌减少重新定义为可训练的稀疏化问题,使模型能够学习最优选择策略,而不是遵循固定的启发式规则。我们提出了一种解耦路由架构。为了在训练过程中通过不可微的剪枝操作实现梯度传播,我们引入了一个名为SparseLearn的组件。该组件使用一个方差保持的噪声门,通过可微分的Soft Top-K算子调节每个补丁的信息流,并配合一个对角注意力去噪器,在不泄露空间信息的情况下恢复受扰动的表示。在推理时,SparseLearn模块被完全丢弃,训练好的评分器应用确定性的Hard Top-K算子,仅保留得分最高的32个令牌,不产生额外计算。通过将视觉序列压缩到仅32个令牌的稀疏集合(仅占原始长度的0.78%),我们的框架在SlideBench(TCGA)上实现了73.32%的总体准确率,持续优于基于采样的基线和通用视觉语言模型。在SlideBench(BCNB)和WSI VQA*上也展示了强大的零样本泛化能力。通过解决视觉上下文瓶颈并防止稀疏诊断证据的稀释,这项工作为端到端的十亿像素全切片图像推理提供了一种高效范式。

英文摘要

The processing of gigapixel whole slide images within vision language models faces a major difficulty due to an excessive number of visual tokens. Existing solutions typically rely on spatial downsampling or heuristic pruning strategies that operate without training, and these methods often discard subtle but clinically meaningful patterns because pathological evidence is scattered irregularly across the tissue. To overcome this limitation, we reformulate token reduction in whole slide images as a trainable sparsification problem, allowing the model to learn an optimal selection strategy instead of following fixed heuristics. We propose a decoupled routing architecture. To enable gradient propagation through the nondifferentiable pruning operation during training, we introduce a component called SparseLearn. This component uses a variance-preserving noise gate that regulates the information flow of each patch via a differentiable Soft Top-K operator, together with a diagonal attention denoiser that recovers perturbed representations without leaking spatial information. At inference time, the SparseLearn module is entirely discarded, and the trained scorer applies a deterministic Hard Top-K operator to keep only the highest scoring 32 tokens, incurring no extra computation. By compressing the visual sequence down to a sparse set of just 32 tokens, which represents as little as 0.78% of the original length, our framework achieves 73.32% overall accuracy on SlideBench (TCGA), consistently surpassing sampling-based baselines and general-purpose vision language models. It also demonstrates strong zero shot generalization on SlideBench (BCNB) and WSI VQA*. By resolving the visual context bottleneck and preventing the dilution of sparse diagnostic evidence, this work provides a highly efficient paradigm for end to end gigapixel whole slide image reasoning.

2606.08670 2026-06-09 cs.CV 新提交

WaveDiT: Distribution-Aware Wavelet Flow Matching for Efficient 3D Brain MRI Synthesis

WaveDiT: 面向高效3D脑MRI合成的分布感知小波流匹配

Danilo Danese, Angela Lombardi, Giuseppe Fasano, Matteo Attimonelli, Tommaso Di Noia

发表机构 * Politecnico di Bari(巴里理工大学) Sapienza University of Rome(罗马大学)

AI总结 提出WaveDiT,一种在3D Haar离散小波变换系数空间中运行的条件流匹配框架,通过分解时空注意力与基于高阶小波统计的带状异方差不确定性建模,实现单GPU上全分辨率3D脑MRI高效合成,在分布对齐和下游任务中优于现有方法。

Comments Provisionally accepted at MICCAI 2026

详情
AI中文摘要

大型且人口统计学平衡的数据集对于可靠的神经影像生物标志物至关重要。全分辨率3D脑MRI合成可以支持该场景下的数据增强,但现有方法要么在体积尺度上产生高昂的计算成本,要么依赖可能有损解剖细节的潜在压缩。因此,实用的3D生成增强通常需要专门的计算基础设施。我们提出WaveDiT,一种在3D Haar离散小波变换系数空间中运行的条件流匹配框架。该模型将分解的时空注意力与从高阶小波统计中导出的带状异方差不确定性建模相结合。预测的对数方差直接集成到流目标和条件路径中,实现了与解剖细节的重尾和输入相关方差结构一致的适应性精度。该公式支持在单个现代GPU上,在实用的内存和时间约束下进行全分辨率3D合成。在多站点队列上的评估表明,与基于扩散、潜在和小波的基线相比,生成的MRI分布与真实MRI分布的对齐程度有所提高,同时下游脑年龄预测和区域级解剖一致性也得到了增强。代码可在https://github.com/sisinflab/WaveDiT获取。

英文摘要

Large and demographically balanced datasets are essential for reliable neuroimaging biomarkers. Full-resolution 3D brain MRI synthesis can support data augmentation in this setting, but existing approaches either incur prohibitive computational cost at volumetric scale or rely on lossy latent compression that may compromise anatomical detail. As a result, practical 3D generative augmentation often requires specialized compute infrastructure. We propose WaveDiT, a conditional flow matching framework operating in the coefficient space of a 3D Haar Discrete Wavelet Transform. The model combines factorized spatio-depth attention with band-wise heteroscedastic uncertainty modeling derived from higher-order wavelet statistics. Predicted log-variance is integrated directly into both the flow objective and conditioning pathway, enabling adaptive precision consistent with the heavy-tailed and input-dependent variance structure of anatomical detail. This formulation supports full-resolution 3D synthesis under practical memory and time constraints on a single modern GPU. Evaluation on a multi-site cohort demonstrates improved alignment between generated and real MRI distributions, together with enhanced downstream brain age prediction and region-level anatomical agreement relative to diffusion, latent, and wavelet-based baselines. Code is available at https://github.com/sisinflab/WaveDiT

2606.08687 2026-06-09 cs.CV 新提交

Shift-Dependent Asymmetry: Orthogonal Inverse Low-Rank Adaptation for Federated Medical Segmentation

移位依赖的不对称性:面向联邦医学分割的正交逆低秩适应

Xingyue Zhao, Wenke Huang, Linghao Zhuang, Haoran Wu, Anwen Jiang, Zhifeng Wang, Wenwen He, Ming Feng, Mang Ye, Bo Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对联邦医学分割中编码器与解码器的不对称性(编码器受外观移位主导,解码器受监督变化主导),提出逆不对称调优(IAT)方法,通过个性化模块特定组件并引入子空间正交正则化器防止泄露,实现跨站点泛化提升。

Comments Accepted by ICML 2026

详情
AI中文摘要

低秩适应(LoRA)能够实现医学图像分割基础模型的高效联邦微调。然而,大多数联邦LoRA方法采用统一的聚合规则,这在医学分割的编码器-解码器不对称性下被打破:编码器受外观移位主导,而解码器受监督变化主导。这种不匹配将共享解剖结构与站点特定偏差纠缠在一起,损害了泛化能力。为解决这一问题,我们提出逆不对称调优(IAT)。IAT通过个性化编码器中模块特定组件以吸收外观移位,以及解码器中模块特定组件以适应站点依赖的监督,同时保留用于可迁移共识的共享路径,从而将适应与异质性来源对齐。然而,在LoRA的双线性参数化下,仅靠结构分离是不够的,因为乘法耦合仍可能导致站点特定更新泄漏到共享方向。因此,我们引入子空间正交正则化器,在有效更新空间中惩罚共享-局部共线性,从而在不增加额外通信的情况下减轻泄漏。实验表明,与强联邦LoRA和参数高效联邦学习基线相比,该方法取得了持续改进。

英文摘要

Low-Rank Adaptation (LoRA) enables efficient federated fine-tuning of segmentation foundation models for medical imaging. However, most federated LoRA methods adopt a uniform aggregation rule, which breaks under the encoder-decoder asymmetry in medical segmentation: the encoder is dominated by appearance shifts, while the decoder is dominated by supervision variations. This mismatch entangles shared anatomy with site-specific biases and harms generalization. To address this, we propose Inverse Asymmetric Tuning (IAT). IAT aligns adaptation with heterogeneity sources by personalizing module-specific components in the encoder to absorb appearance shifts and in the decoder to accommodate site-dependent supervision, while retaining a shared pathway for transferable consensus. However, structural separation alone is insufficient under LoRA's bilinear parameterization, where multiplicative coupling can still cause site-specific updates to leak into the shared direction. We therefore introduce a Subspace Orthogonality Regularizer that penalizes shared-local collinearity in the effective update space, mitigating leakage without extra communication. Experiments show consistent improvements over strong federated LoRA and parameter-efficient FL baselines.

2606.08742 2026-06-09 cs.CV 新提交

AUCp: Pseudo-AUC for Inference Model Selection with Unlabeled Validation Data in Abnormality Detection

AUCp: 用于异常检测中无标注验证数据的推理模型选择的伪AUC

Md Mahfuzur Rahman Siddiquee, Fazle Rafsani, Jay Shah, Teresa Wu, Catherine D Chong, Todd J Schwedt, Baoxin Li

发表机构 * arXiv

AI总结 提出AUCp指标,无需标注验证集即可为无监督/自监督异常检测方法选择最优推理模型,通过将测试集所有样本视为异常计算伪AUC,理论及实验证明其优于传统指标。

详情
Journal ref
IEEE Transactions on Medical Imaging (Early Access), 2026
AI中文摘要

异常检测是医学图像分析中一项关键但具有挑战性的任务。通过学习仅重构正常数据来区分异常与正常数据,减少了对标注数据集的依赖。然而,许多研究即使是无监督的,也依赖标注验证集从多次训练迭代中选择最佳推理模型。对于许多疾病,标注数据不可用且获取耗时。为解决此问题,提出了AUCp——一种支持无监督和自监督方法异常检测的新指标。它不通过评估重构图像的真实性来选择最佳推理模型,而是关注实际检测性能,且无需标注测试集。假设测试集中所有未标注样本的伪真实标签为异常/阳性,并使用传统AUC计算,得到AUCp分数。给定一个包含大量正常样本的代表性训练集,我们通过数学和实证证据表明,使用AUCp分数进行模型选择在无监督和自监督方法中比传统指标更能改善疾病检测。使用两种无监督方法进行神经系统疾病检测以及在不同数据集上的自监督方法,我们的结果表明AUCp分数有效识别最佳推理模型,显著增强异常和疾病检测。相应实现可在https://github.com/mahfuzmohammad/AUCp获取。

英文摘要

Abnormality detection is a crucial yet challenging task in medical image analysis. Distinguishing abnormalities from normal data by learning to reconstruct normal-only data alleviates the reliance on labeled datasets. However, many studies, even if unsupervised, rely on a labeled validation set to select the best model for inference from multiple training iterations. For many diseases labeled data are unavailable and substantially time consuming to obtain. To address this, AUCp - a novel metric that supports abnormality detection for unsupervised and self-supervised methods is proposed. Instead of evaluating the realism of reconstructed images to select the best of model for inference, it focuses on actual detection performance and without requiring an annotated test set. Assuming the pseudo ground truth of all unannotated samples in the test set as abnormal/positive and using traditional AUC calculation, AUCp scores are derived. Given a large and representative training set of normal samples, we show mathematical and empirical evidence that model selection using AUCp scores improves disease detection in terms of unsupervised and self-supervised methods over conventional metrics. Using two unsupervised methods for neurologic disease detection and self-supervised methods on diverse datasets, our results demonstrate that the AUCp score effectively identifies the optimal model for inference, significantly enhancing abnormality and disease detection. The corresponding implementations are available in https://github.com/mahfuzmohammad/AUCp.

2606.08745 2026-06-09 cs.CV 新提交

Stain-Aware Wavelet Regularization for Instant Adversarial Purification in Histopathology

染色感知的小波正则化用于组织病理学中的即时对抗净化

Zhe Li, Bernhard Kainz

发表机构 * FAU Erlangen-Nürnberg(埃尔朗根-纽伦堡大学)

AI总结 提出染色感知小波正则化(SAWR),利用Haar变换的多级小波域正则化分层分离对抗扰动与诊断结构信息,并扩展到组织学通道实现染色特异性频率调节,在即时净化框架中将对抗鲁棒性提升高达10.69%。

Comments 14 pages, 4 figures

详情
AI中文摘要

深度学习在计算病理学流程中已变得普遍,支持癌症筛查和数字病理学分析等任务。然而,神经网络对对抗扰动的敏感性引发了临床实践中可靠部署的安全问题。在组织病理学图像中,由于难以区分高频对抗噪声与细微且具有诊断意义的组织结构,这一挑战更加严峻。为解决此问题,我们提出染色感知小波正则化(SAWR),一种利用基于Haar变换的多级小波域正则化的对抗净化框架,以分层方式将对抗扰动与诊断结构信息分离。该频谱约束进一步扩展到单个组织学通道,实现与苏木精和伊红的生物学特性一致的染色特异性频率调节。当集成到即时净化框架中时,SAWR将对抗鲁棒性相对于基线方法提升高达10.69%,同时在对抗扰动下保持纹理和频谱保真度。

英文摘要

Deep learning has become prevalent in computational pathology pipelines that support tasks such as cancer screening and digital pathology analysis. However, the susceptibility of neural networks to adversarial perturbations raises safety concerns for reliable deployment in clinical practice. In histopathological images, this challenge is exacerbated by the difficulty of distinguishing high-frequency adversarial noise from subtle and diagnostically relevant tissue structures. To address this issue, we propose Stain-Aware Wavelet Regularization (SAWR), an adversarial purification framework that leverages multi-level wavelet-domain regularization based on Haar transform to hierarchically disentangle adversarial perturbations from diagnostic structural information. This spectral constraint is further extended to individual histological channels, enabling stain-specific frequency regulation consistent with the biological properties of Hematoxylin and Eosin. When integrated into an instant purification framework, SAWR improves adversarial robustness by up to 10.69\% over the baseline approach, while maintaining texture and spectral fidelity under adversarial perturbations.

2606.08751 2026-06-09 cs.CV 新提交

Less Is More: Training-Free Acceleration Framework of 3D Diffusion Models for Low-Count PET Denoising via Global-Local Trajectory Reduction

少即是多:通过全局-局部轨迹缩减实现低计数PET去噪的3D扩散模型免训练加速框架

Yuhan Liu, Scott M. Leonard, Marlee Crews, Muhannad Fadhel, Jinkui Hao, Tianqi Chen, Ryan J. Avery, Bo Zhou

发表机构 * Northwestern University(西北大学) Hefei University of Technology(合肥工业大学)

AI总结 提出一种免训练的全局-局部跳跃策略,通过噪声一致变换初始化中间步骤和重用U-Net特征,在加速3D扩散模型去噪的同时提升重建质量。

Comments 19 pages, 10 figures, 5 tables

详情
AI中文摘要

PET中的准确定量和摄取测量对于评估疾病进展和支持临床决策至关重要。虽然高计数PET提供了可靠的图像质量,但相关的辐射剂量和长时间采集仍然是重要的临床问题,促使采用低计数协议。基于扩散模型的方法在将低计数PET恢复至接近高计数质量方面显示出巨大潜力,但其迭代采样过程在应用于高分辨率3D PET体积时变得极其昂贵,导致显著的推理延迟,限制了实际临床部署。为了解决这些挑战,我们提出了一种免训练的全局-局部跳跃策略,该策略加速了基于扩散模型的3D PET去噪,同时提高了重建质量。所提出的方法即插即用,可直接应用于预训练扩散模型,无需重新训练或修改架构。具体而言,我们引入了:(i) 全局去噪步骤跳跃策略,通过使用低计数输入的噪声一致变换从中间去噪步骤初始化反向扩散过程,大幅减少所需的去噪步骤数;(ii) 局部特征重用捷径,在相邻去噪步骤间重用缓慢变化的高级U-Net特征,进一步减少每步计算量同时保持图像保真度。我们在来自内部和公共数据集的多种PET示踪剂上评估了所提出的方法,包括18F-FDG PET、68Ga-DOTATATE PET和18F-PSMA PET,结果显示相对于全步骤基线,实现了超过一个数量级的一致加速以及改进或相当的重建性能。盲法读者研究进一步证实了增强的临床信心和感知诊断质量。

英文摘要

Accurate quantification and uptake measurement in PET are critical for assessing disease progression and supporting clinical decision-making. While high-count PET provides reliable image quality, the associated radiation dose and prolonged acquisition remain significant clinical concerns, motivating the adoption of low-count protocols. Diffusion-model-based methods have demonstrated strong potential for restoring low-count PET to near high-count quality, but their iterative sampling procedure becomes prohibitively expensive when applied to high-resolution 3D PET volumes, introducing substantial inference latency that limits practical clinical deployment. To address these challenges, we propose a training-free Global-Local Skipping Strategy that accelerates diffusion model-based 3D PET denoising while simultaneously improving reconstruction quality. The proposed method is plug-and-play and directly applicable to pre-trained diffusion models without retraining or architectural modification. Specifically, we introduce: (i) a global denoising step skipping strategy that initializes the reverse diffusion process from an intermediate denoising step using a noise-consistent transformation of the low-count input, substantially reducing the number of required denoising steps; and (ii) a local feature reuse shortcut that reuses slowly-varying high-level U-Net features across neighboring denoising steps, further reducing per-step computation while preserving image fidelity. We evaluate the proposed approach on multiple PET tracers from in-house and public datasets, including 18F-FDG PET, 68Ga-DOTATATE PET, and 18F-PSMA PET, demonstrating consistent acceleration of over an order of magnitude alongside improved or comparable reconstruction performance relative to the full-step baseline. Blinded reader studies further confirm enhanced clinical confidence and perceived diagnostic quality.

2606.08897 2026-06-09 cs.CV cs.AI q-bio.QM 新提交

A multi-agent system for spine MRI report generation from multi-sequence imaging

基于多序列影像的脊柱MRI报告生成多智能体系统

Zhiping Xiao, Junwei Yang, Gongbo Sun, Han Zhang, Hanwen Xu, Yi Yao, Zachary D. Miller, William E. King, Mohammed M. Kanani, Jalal B. Andre, Sammy Chu, Ming Zhang, Paul E. Kinahan, Nathan M. Cross, Sheng Wang

发表机构 * University of Washington(华盛顿大学) Peking University(北京大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) New York University(纽约大学) University of Washington Medical Center(华盛顿大学医学中心)

AI总结 提出SpineAgent多智能体框架,利用多序列基础模型整合T1/T2等序列信息,实现脊柱MRI报告生成、病理定位和图文检索,在跨厂商和跨队列评估中表现优异。

详情
AI中文摘要

脊柱病理是全球疼痛和残疾的主要原因之一。脊柱MRI是临床评估的核心,但其解读仍然复杂且耗时,需要整合多个成像序列和解剖区域的信息。尽管自动化MRI分析最近取得了进展,但如何有效结合多序列数据同时保留序列特异性诊断信息仍是一个开放挑战。本文提出SpineAgent,一个基于多序列基础模型的脊柱MRI报告生成多智能体框架,该模型在来自32,047名患者和453,683个MRI系列(总计13,441,191张MRI切片)的常规临床数据上训练。为了适应不同模态的序列,我们首先分别在T1和T2加权序列上预训练两个基于DINOv3的编码器。然后,我们引入一种持续训练策略,学习一个合成器,利用T1和T2编码器嵌入其他序列的图像,生成整合MRI序列间各种信号的患者级嵌入。利用这些嵌入,SpineAgent实现了最先进的性能,并在跨制造商和跨队列评估中展现出强大的泛化能力。除了分类,SpineAgent通过识别与发现相关的切片和分割病理区域实现病理定位。它还支持多模态图像-报告检索,为可扩展和可解释的MRI报告生成提供了坚实基础。我们进一步将这些经过验证的SpineAgent能力集成到37个专门智能体中。最后,我们将它们的输出作为结构化标记,整合到一个端到端训练用于报告生成的医疗报告智能体中。通过自动指标和五位放射科医生的专家评估,SpineAgent在脊柱MRI报告生成中取得了领先性能。

英文摘要

Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluation, yet its interpretation remains complex and time-consuming, requiring integration of information across multiple imaging sequences and anatomical regions. Despite recent advances in automated MRI analysis, effectively combining multi-sequence data while preserving sequence-specific diagnostic information remains an open challenge. Here we present SpineAgent, a multi-agent framework for spine MRI report generation built upon a multi-sequence foundation model trained on routine clinical data from 32,047 patients and 453,683 MRI series, comprising a total of 13,441,191 MRI slices. To accommodate diverse modalities of sequences, we first pre-train two DINOv3-based encoders separately on T1- and T2-weighted sequences. We then introduce a continual training strategy that learns a synthesizer to embed images of other sequences using the T1 and T2 encoders, producing patient-level embedding that integrates various signals across MRI sequences. Using these embeddings, SpineAgent achieves state-of-the-art performance, and demonstrates strong generalizability under cross-manufacturer and cross-cohort evaluation. Beyond classification, SpineAgent enables pathology localization by identifying findings-relevant slices and segmenting pathological regions. It also supports multimodal image-report retrieval, providing a solid foundation for scalable and explainable MRI report generation. We further integrate these validated capabilities of SpineAgent into 37 specialized agents. Finally, we incorporate their outputs as structured tokens within a Medical Report Agent trained end-to-end for report generation. Through both automated metrics and expert evaluation by five radiologists, SpineAgent achieves leading performance in spine MRI report generation.

2606.09140 2026-06-09 cs.CV 新提交

DiffSight-Former: Modeling Structural Differences and Temporal Dynamics for Glaucoma Progression Prediction

DiffSight-Former:建模结构差异和时间动态用于青光眼进展预测

Yi Huang, Lei Bi, Jinman Kim

发表机构 * The University of Sydney(悉尼大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出DiffSight-Former框架,通过时间变异特征提取、多结构差异建模和时间感知Transformer,从序列眼底图像中预测青光眼进展,在SIGF和GRAPE数据集上取得高AUC和灵敏度。

Comments 12 pages, 6 figures

详情
AI中文摘要

青光眼是全球不可逆失明的主要原因,从眼底图像早期检测对于有效疾病管理至关重要。虽然深度学习在眼底图像分析中取得了有希望的性能,但现有方法大多依赖单时间点图像,未能捕捉与疾病进展相关的纵向结构和血管变化。临床随访期间获取的序列眼底图像提供了宝贵的时间信息;然而,当前的序列模型通常难以检测微妙的早期进展信号,并且常依赖固定长度输入或已患青光眼图像的诊断线索,限制了其在早期预测中的临床实用性。为解决这些限制,我们提出了DiffSight-Former,一个从序列眼底图像预测青光眼进展的框架。它包含一个基于眼底专用基础模型的时间变异特征提取模块,以获得稳健的解剖表示。引入多结构差异建模模块来量化视盘/杯区域和视网膜血管中与进展相关的变化。这些表示与时间间隔嵌入集成,并由时间感知Transformer处理,以建模疾病进展并估计未来青光眼发作的概率。在两个纵向数据集SIGF(405个序列)和GRAPE(263个序列)上进行了实验。在SIGF上,DiffSight-Former在进展预测中达到了91.54%的AUC和92.16%的灵敏度。在GRAPE上,它在三个临床视野进展标准上平均准确率达到87.48%。与现有方法相比,DiffSight-Former在不同时间设置下表现出强大的性能和鲁棒性,突显了其在纵向青光眼监测和早期风险预测中的潜力。

英文摘要

Glaucoma is a leading cause of irreversible blindness worldwide, and early detection from fundus images is critical for effective disease management. While deep learning has achieved promising performance in fundus image analysis, most existing methods rely on single time-point images and fail to capture longitudinal structural and vascular changes associated with disease progression. Sequential fundus images acquired during clinical follow-up provide valuable temporal information; however, current sequential models often struggle to detect subtle early progression signals and commonly depend on fixed-length inputs or diagnostic cues from already glaucomatous images, limiting their clinical utility for early prediction. To address these limitations, we propose DiffSight-Former, a framework for glaucoma progression prediction from sequential fundus images. It incorporates a time-variant feature extraction module based on a fundus-specific foundation model to obtain robust anatomical representations. A multi-structure difference modeling module is introduced to quantify progression-related changes in the optic disc/cup region and retinal vasculature. These representations are integrated with temporal interval embeddings and processed by a time-aware Transformer to model disease progression and estimate the probability of future glaucoma onset. Experiments were conducted on two longitudinal datasets, SIGF (405 sequences) and GRAPE (263 sequences). On SIGF, DiffSight-Former achieved an AUC of 91.54% and a sensitivity of 92.16% for progression prediction. On GRAPE, it achieved an average accuracy of 87.48% across three clinical visual-field progression criteria. Compared with existing approaches, DiffSight-Former demonstrates strong performance and robustness across different temporal settings, highlighting its potential for longitudinal glaucoma monitoring and early risk prediction.

2606.09249 2026-06-09 cs.CV 新提交

MAGIS: Evidence-Based Multi-Agent Reasoning for Interpretable Strabismus Clinical Decision-Making

MAGIS:基于证据的多智能体推理用于可解释的斜视临床决策

Xikai Tang, Yifan Wang, Jiafan Zhuang, Li Luo, Jinming Guo, Xiaoling Xie, Jiacheng Liu, Peiwei Wei, Lihao Zhong, Xiaoli Kang, Jie Cen, Guangqiang Yin, Kunliang Qiu, Ce Zheng, Zhun Fan

发表机构 * School of Information and Software Engineering, University of Electronic Science and Technology of China(电子科技大学信息与软件工程学院) Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China(电子科技大学深圳高等研究院) Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong(汕头大学·香港中文大学联合汕头国际眼科中心) School of Artificial Intelligence, Guangzhou City Polytechnic(广州城市职业学院人工智能学院) Medical College, Shantou University(汕头大学医学院) College of Engineering, Shantou University(汕头大学工学院) Department of Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine(上海交通大学医学院附属新华医院眼科) Shenzhen Loop Area Institute(深圳环路区域研究所)

AI总结 提出MAGIS框架,通过多智能体协作、双重证据约束上下文和基于证据的纠正验证机制,将斜视诊断从黑箱生成转变为结构化推理,在细粒度斜视基准上将加权F1分数从72.0%提升至91.3%,并显著提高诊断报告的临床可靠性。

详情
AI中文摘要

斜视是一种常见的眼部疾病,需要细粒度亚型诊断以制定个性化治疗方案。然而,现有的深度学习方法主要提供诊断预测,缺乏透明推理;而近期的大视觉语言模型(LVLMs)虽然在联合图像理解和报告生成方面有前景,但在这种对证据敏感且规则驱动的医学任务中极易产生幻觉。为解决这些问题,我们提出了MAGIS,一个基于证据的多智能体可解释斜视诊断推理框架。MAGIS将黑箱端到端生成转变为结构化的诊断过程,包括候选假设生成、双重证据约束上下文、基于证据的纠正验证和报告生成。具体而言,我们引入了双重证据约束上下文(DECC)机制,将来自九个注视方位照片的视觉证据和基于证据的临床诊断规则联合组织成约束上下文,以实现可靠的诊断推理。我们进一步开发了基于证据的纠正验证(EBCV)机制,验证当前诊断假设是否得到视觉证据、基于热图的视觉线索和基于证据的临床诊断规则的支持。当检测到不一致时,触发假设修正。在细粒度斜视基准上的实验表明,MAGIS不仅显著优于其他最先进的诊断系统,将加权F1分数从72.0%提高到91.3%,而且大幅提升了生成诊断报告的临床可靠性(一致性、对齐性和完整性)。这些结果表明,MAGIS为构建准确、基于证据且临床可解释的斜视诊断系统提供了有效解决方案。

英文摘要

Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatment planning. However, existing deep learning methods mainly provide diagnostic predictions without transparent reasoning, while recent large vision-language models (LVLMs), although promising for joint image understanding and report generation, remain highly prone to hallucination in this evidence-sensitive and rule-driven medical task. To address these challenges, we propose MAGIS, an evidence-based Multi-AGent reasoning for Interpretable Strabismus diagnosis framework. MAGIS transforms black-box end-to-end generation into a structured diagnostic process consisting of candidate hypothesis generation, dual-evidence constrained context, evidence-based corrective verification, and report generation. Specifically, we introduce a Dual-Evidence Constrained Context (DECC) mechanism that jointly organizes visual evidence from the photograph of the nine cardinal positions of gaze and evidence-based clinical diagnostic rules into a constrained context for reliable diagnostic reasoning. We further develop an Evidence-Based Corrective Verification (EBCV) mechanism that verifies whether the current diagnostic hypothesis is supported by visual evidence, heatmap-based visual cues, and evidence-based clinical diagnostic rules. Hypothesis refinement is triggered when inconsistency is detected. Experiments on a fine-grained strabismus benchmark demonstrate that MAGIS not only significantly outperforms other state-of-the-art diagnostic systems, improving the weighted F1 score from 72.0% to 91.3%, but also substantially improves the clinical reliability (consistency, alignment, and completeness) of generated diagnostic reports. These results demonstrate that MAGIS provides an effective solution for building accurate, evidence-based, and clinically interpretable strabismus diagnosis systems.

2606.09253 2026-06-09 cs.CV physics.med-ph 新提交

A practical probabilistic framework for deformable image registration uncertainty in radiotherapy dose propagation

一种实用的概率框架用于放射治疗剂量传播中的可变形图像配准不确定性

Stefan Heldmann, Sven Kuckertz, Nasim Givehchi, Thomas Coradi, Mikel Byrne, Ben Archibald-Heeren, Nils Papenberg

发表机构 * Fraunhofer Institute for Digital Medicine MEVIS(弗劳恩霍夫数字医学研究所MEVIS) Varian, a Siemens Healthineers company(Varian公司) Icon Group(Icon集团)

AI总结 提出一种轻量级概率框架,通过局部确定性图建模可变形图像配准不确定性,实现剂量统计和剂量体积直方图的不确定性传播,并在前列腺放疗案例中验证了确定性图设计对结果的影响。

详情
AI中文摘要

可变形图像配准(DIR)广泛应用于放射治疗中的剂量传播和累积,但底层变形的不确定性会显著影响临床相关的剂量估计。我们提出了一种实用的概率框架,用于将DIR不确定性传播到体素级剂量统计和剂量体积直方图(DVH)。该方法将每个体素的映射对应关系建模为由透明的局部确定性图控制的随机变量,该确定性图可通过简单的安全边界、结构边界不匹配或结构保守的不确定性值来定义。这产生了可解释的量,如剂量概率、期望剂量、置信区间和诱导的DVH包络。该框架设计为轻量级且可解释:它避免了复杂的生物力学或基于集成的不确定性模型,而是强调简单的参数化、计算可行性和透明的剂量指标。我们进一步引入了一种结构导向的内/外策略作为可选优化,将映射概率限制在解剖学上合理的目标区域。该方法在前列腺放疗案例研究中得到验证,并用于比较不同的确定性图策略和概率核。实验表明,确定性图设计对结果剂量和DVH不确定性边界的影响比特定核选择更强,而内/外策略的额外收益在案例中依赖于具体情况且效果有限。总体而言,所提出的框架提供了一种透明的方式,将DIR不确定性纳入放射治疗剂量评估,并研究建模选择如何影响传播的剂量指标。

英文摘要

Deformable image registration (DIR) is widely used in radiotherapy for dose propagation and accumulation, but uncertainty in the underlying deformation can substantially affect clinically relevant dose estimates. We present a practical probabilistic framework for propagating DIR uncertainty to voxel-wise dose statistics and dose-volume histograms (DVHs). The method models the mapped correspondence at each voxel as a random variable governed by a transparent local certainty map that can be defined by simple safety margins, structure-boundary mismatch, or structure-wise conservative uncertainty values. This yields interpretable quantities such as dose probabilities, expected dose, confidence bounds, and induced DVH envelopes. The framework is designed to remain lightweight and interpretable: it avoids complex biomechanical or ensemble-based uncertainty models and instead emphasizes simple parameterization, computational feasibility, and transparent dose metrics. We further introduce a structure-guided in/out strategy as an optional refinement that restricts mapping probabilities to anatomically plausible target regions. The approach is demonstrated on a prostate radiotherapy case study and used to compare different certainty-map strategies and probability kernels. The experiments show that the certainty-map design has a stronger effect on resulting dose and DVH uncertainty bounds than the specific kernel choice, while the additional benefit of the in/out strategy is case-dependent and modest in the present example. Overall, the proposed framework provides a transparent way to incorporate DIR uncertainty into radiotherapy dose assessment and to study how modelling choices affect propagated dose metrics.

2606.09378 2026-06-09 cs.CV 新提交

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Echo-DM: 通过条件潜在扩散和区域感知融合去除超声标记

Zhiwei Wang, Tao Huang, Wentao Jiang, Muyi Li, Jianxin Liu, Jian Chen, Jie Zou, Yong Luo, Bo Du, Jing Zhang

发表机构 * School of Computer Science, Wuhan University, China(武汉大学计算机学院) The Central Hospital of Wuhan, China(武汉市中心医院) School of Computer Science, Hubei University of Technology, China(湖北工业大学计算机学院)

AI总结 提出Echo-DM框架,结合条件潜在扩散和区域感知融合,在无掩码条件下有效去除超声图像中的人工标记,同时保持解剖结构保真度。

Comments 18 pages, 4 figures

详情
AI中文摘要

临床超声图像通常包含人工标记,如测量卡尺和文字,以辅助诊断解释和比较。然而,这些标记可能在下游自动分析中引入捷径偏差,促使深度学习模型依赖标记相关线索而非临床有意义的解剖结构。现有的标记去除方法要么依赖于掩码且易受错误传播影响,要么是无掩码的确定性修复器,可能过度平滑超声纹理并扰动未受影响的背景区域。为应对这些挑战,我们提出了Echo-DM,一个通过条件潜在扩散和区域感知融合进行超声标记去除的框架。Echo-DM遵循通用的编码器-扩散-解码器流水线,其中基于DiT的条件潜在扩散网络执行全局修复,区域感知融合模块在端到端无掩码推理下强制执行保留感知的图像空间细化。基于这一固定核心设计,我们进一步分别用基于VAE和基于RAE的潜在模块实例化了Echo-DM-V和Echo-DM-R,这表明Echo-DM架构与多种潜在模块实例化兼容。在Echo-PAIR(一个大规模配对临床超声数据集)上的大量实验表明,与代表性的两阶段基线相比,Echo-DM具有优越的标记去除能力和强大的解剖保真度,同时在部署设置中提供了有利的质量-效率权衡。数据、代码和模型将在https://github.com/MiliLab/Echo-DM发布。

英文摘要

Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist diagnostic interpretation and comparison. However, these markers can introduce shortcut bias in downstream automated analysis, encouraging deep learning models to rely on marker-related cues rather than clinically meaningful anatomy. Existing marker removal methods are either mask-dependent and vulnerable to error propagation, or mask-free deterministic restorers that may over-smooth ultrasound texture and perturb unaffected background regions. To address these challenges, we present Echo-DM, a framework for ultrasound marker removal via conditional latent diffusion and region-aware fusion. Echo-DM follows a common encoder-diffusion-decoder pipeline, where a DiT-based conditional latent diffusion network performs global restoration and a region-aware fusion module enforces preservation-aware image-space refinement under end-to-end mask-free inference. Building on this fixed core design, we further instantiate Echo-DM-V and Echo-DM-R with VAE-based and RAE-based latent modules, respectively, which demonstrates that the Echo-DM architecture is compatible with diverse latent-module instantiations. Extensive experiments on Echo-PAIR, a large-scale paired clinical ultrasound dataset, demonstrate superior marker removal and strong anatomical fidelity compared with representative two-stage baselines, while providing favorable quality--efficiency trade-offs across deployment settings. Data, code and models will be released at https://github.com/MiliLab/Echo-DM.

2606.09400 2026-06-09 cs.CV 新提交

vesselFM-CT: Segmenting All Blood Vessels in CT Images for System-Level Cardiovascular Analysis

vesselFM-CT:在CT图像中分割所有血管以实现系统级心血管分析

Bastian Wittmann, Chinmay Prabhakar, Suprosanna Shit, Bjoern Menze

发表机构 * Department of Quantitative Biomedicine, University of Zurich(苏黎世大学定量生物医学系)

AI总结 提出vesselFM-CT模型,通过迭代多步训练和TubeLoss损失函数,实现CT图像中从大血管到微小肠系膜血管的全分割,优于基线方法,支持系统级心血管分析。

详情
AI中文摘要

人体血管网络中的血管在半径、长度、拓扑特性和分支模式上表现出剧烈的结构变化。这种异质性,加上位置特定的解剖背景变化,对稳健、大规模地分析整个心血管系统构成了重大挑战。因此,大多数研究集中在血管网络的狭窄孤立部分。虽然这些针对性研究提供了有价值的见解,但它们本质上限制了评估血管网络整体系统健康和功能完整性的能力。在这项工作中,我们旨在弥合这一差距,以推进临床诊断和我们对血管生理学的基本理解。我们提出了在CT图像中分割所有血管的任务,范围从心血管系统最大的组成部分到微小的肠系膜血管。为此,我们引入了vesselFM-CT,这是第一个能够稳健分割3D CT图像中所有血管的模型。vesselFM-CT通过迭代多步过程进行训练,并优化我们提出的TubeLoss损失函数,有效解决了心血管系统固有的异质性。我们证明vesselFM-CT优于所有基线,并能够从CT图像中自动精确提取心血管系统,从而解锁广泛的临床和技术视角,包括自动疾病分类和合成CT图像生成。

英文摘要

The vascular network in the human body is characterized by blood vessels exhibiting drastic structural variations in radius, length, topological properties, and branching patterns. This heterogeneity, together with location-specific anatomical background variations, poses a significant challenge for robust, large-scale analysis of the entire cardiovascular system. As a result, most research has focused on narrow, isolated segments of the vascular network. While such targeted studies provide valuable insights, they inherently limit the ability to assess the systemic health and functional integrity of the vascular network as a whole. In this work, we aim to bridge this gap to advance both clinical diagnostics and our fundamental understanding of vascular physiology. We propose the task of segmenting all vessels in CT images, ranging from the largest components of the cardiovascular system to even minuscule mesenteric vessels. To this end, we introduce vesselFM-CT, the first model capable of robustly segmenting all blood vessels in 3D CT images. VesselFM-CT is trained via an iterative, multi-step process and optimizes our proposed TubeLoss loss function, effectively addressing the inherent heterogeneity of the cardiovascular system. We demonstrate that vesselFM-CT outperforms all baselines and enables automated, precise extraction of the cardiovascular system from CT images, thereby unlocking a wide range of clinical and technical perspectives, including automated disease classification and synthetic CT image generation.

2606.09453 2026-06-09 cs.CV 新提交

GD-MIL: Grade-Disentangled Multiple Instance Learning for Multimodal Biochemical Recurrence Prediction in Prostate Cancer

GD-MIL:用于前列腺癌多模态生化复发预测的等级解耦多实例学习

Dasari Naga Raju

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对前列腺癌生化复发预测,提出GD-MIL方法,通过梯度反转等级对抗训练实现H&E全切片图像特征与Gleason等级解耦,融合临床变量后C-index达0.704,显著优于临床基线。

详情
AI中文摘要

根治性前列腺切除术后的生化复发(BCR)是前列腺癌的关键终点,但风险分层几乎完全依赖于以Gleason等级为主的变量。H&E全切片图像(WSI)是否携带超出等级之外的预后信号,以及多实例学习(MIL)能否恢复这些信号,仍无定论。一个关键障碍是许多流程在评估折上选择模型检查点,人为提高了一致性指数。我们在TCGA-PRAD(487名患者,101个BCR事件)上构建了严格的基准测试,使用严格的折外评分,在五个种子重复的五折交叉验证上进行评估。MIL聚合器(ABMIL、CLAM、TransMIL、PatchGCN)的选择影响很小(UNI2-h的C指数为0.61-0.64),而特征提取器是主导因素(ResNet50为0.566,病理基础模型高达0.639)。基于等级、分期和年龄的临床Cox模型达到0.687;没有纯影像模型显著优于它(p > 0.10)。我们引入了等级解耦MIL(GD-MIL),一种门控注意力MIL编码器,通过梯度反转等级对抗训练,使切片表示在与临床变量晚期融合之前对Gleason等级保持不变。GD-MIL实现了C指数0.704,显著优于临床基线(delta-c = +0.029,p = 0.0005)和最佳纯影像模型(delta-c = +0.062,p = 0.039),表明H&E形态学包含与等级互补的预后信息。中位风险分层在BCR无生存期上产生log-rank p < 0.0001的分离(五年约20% vs 约70%)。

英文摘要

Biochemical recurrence (BCR) after radical prostatectomy is a critical endpoint in prostate cancer, yet risk stratification relies almost entirely on variables dominated by Gleason grade. Whether H&E whole slide images (WSIs) carry prognostic signal beyond grade, and whether multiple instance learning (MIL) can recover it, remains unsettled. A key obstacle is that many pipelines select model checkpoints on the evaluation fold, artificially inflating concordance. We construct a rigorous benchmark on TCGA-PRAD (487 patients, 101 BCR events) using strict out-of-fold scoring over five-fold cross-validation repeated across five seeds. The choice of MIL aggregator (ABMIL, CLAM, TransMIL, PatchGCN) has little effect (C-index 0.61-0.64 with UNI2-h), while the feature extractor is the dominant factor (ResNet50 0.566 versus pathology foundation models up to 0.639). A clinical Cox model on grade, stage, and age reaches 0.687; no imaging-only model significantly outperforms it (p > 0.10). We introduce Grade-Disentangled MIL (GD-MIL), a gated-attention MIL encoder trained with a gradient-reversal grade adversary that encourages the slide representation to be invariant to Gleason grade before late fusion with clinical variables. GD-MIL achieves C-index 0.704, significantly outperforming both the clinical baseline (delta-c = +0.029, p = 0.0005) and the best imaging-only model (delta-c = +0.062, p = 0.039), suggesting H&E morphology contains prognostic information complementary to grade. A median risk split yields log-rank p < 0.0001 separation in BCR-free survival (~20% vs ~70% at five years).

2606.09699 2026-06-09 cs.CV 新提交

Cranio-Diff: Diffusion-based Cross-domain Craniofacial Reconstruction with 2D X-ray Skull Guidance and Structural Identity Constraints

Cranio-Diff: 基于扩散的跨模态颅面重建,利用二维X射线颅骨引导和结构身份约束

Ravi Shankar Prasad, Naresh Gurjar, Shashank Baghel, Chirag, Dinesh Singh

发表机构 * Indian Institute of Technology Mandi(印度理工学院曼迪分校) CSVTU Bhilai(恰蒂斯加尔邦斯瓦米·维韦卡南达技术大学比莱分校)

AI总结 提出Cranio-Diff扩散框架,通过ControlNet的颅骨条件结构引导和生物特征文本条件,从2D X射线颅骨图像重建跨模态人脸,解决结构身份对齐问题,在120名受试者的颅面数据集上优于现有方法。

Comments 14 pages, 7 figures, BMVC 2026 conference

详情
AI中文摘要

最先进的生成模型,如CycleGAN、Pix2Pix和扩散模型,在人脸生成任务中表现出色。然而,在从颅骨(X射线)到人脸(光学)域的跨模态颅面重建中,由于跨模态结构身份对齐不匹配,它们无法有效捕获跨模态语义信息。为解决此问题,我们提出Cranio-Diff,一种基于扩散的框架,用于从2D X射线颅骨图像进行跨域颅面重建。该方法通过ControlNet集成颅骨条件结构引导和生物特征文本条件,生成与给定颅骨在语义和结构上更对齐的人脸。所提出的Cranio-Diff方法在从120名受试者的侧位和正位X射线扫描获得的颅面数据集上进行了评估。为实现受控评估,每张人脸图像在三个年龄组(25、45、65)和三个BMI变化(-10%、基线、+10%)下合成,共产生4320个配对样本。据我们所知,这是唯一具有此规模的X射线-人脸数据集。大量实验表明,所提方法在生成图像质量和检索任务上均优于近期现有方法。最后,为评估所提方法的性能,我们使用FID、IS、SSIM、LPIPS、PSNR和ArcFace分数评估了生成图像的质量。此外,使用recall@k、mAP@k和MRR@k评估了检索性能。获得的实验结果表明,所提方法可作为法医调查中的辅助工具。

英文摘要

The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated remarkable performance in the face generation task. However, they fail to effectively capture cross-modality semantic information in craniofacial reconstruction when translating from the skull (x-ray) to the face (optical) domain, due to a mismatch in the alignment of structural identity across modalities. To address this issue, we propose Cranio-Diff, a diffusion-based framework for cross-domain cranio-facial reconstruction from 2D X-ray skull images. The proposed approach integrates skull-conditioned structural guidance through ControlNet with biometric text conditioning to generate a face which is more semantically and structurally aligned with the given skull. The proposed Cranio-diff method is evaluated on skull-face dataset obtained from X-ray scans of 120 subjects in lateral and frontal views. To enable controlled evaluation, each face image is synthesised across three age groups (25, 45, 65) and three BMI variations of -10%, baseline and +10%, yielding 4320 paired samples. To the best of our knowledge, this is the only X-ray-face dataset with this magnitude. Extensive experiments showed that the proposed method outperforms recent existing approaches in both generated image quality and retrieval task. Finally, to evaluate the performance of our proposed method, we have evaluated the quality of the generated image using FID, IS, SSIM, LPIPS, PSNR and ArcFace score. Additionally, retrieval performance is evaluated using recall@k, mAP@k and MRR@k. Obtained experimental results demonstrate that the proposed method can be used as an alternate tool in providing aid in forensic investigations.

2606.07717 2026-06-09 eess.IV cs.AI cs.CV 交叉投稿

Multi-planar 2D-U-Net Segmentation of 3D-CT Abdominal Organs augmented by Spatial Occurrence Maps

多平面2D-U-Net分割3D-CT腹部器官,辅以空间出现图

Daria Kern, Negar Chabi, Souraj Adhikary, Andre Mastmeyer

发表机构 * Glasgow Caledonian University School of Science & Engineering(格拉斯哥卡里多尼亚大学科学与工程学院) Jade University of Applied Sciences Department of Engineering & Medical Technology(雅德应用科学大学工程与医疗技术系)

AI总结 提出轻量级2D-U-Net框架,结合粗到细分割、多平面预测和模糊3D空间图,在80个CT扫描中使Dice系数提升约4%。

Comments 11 pages, 9 figures, 1 table, http://www.wscg.eu/

详情
AI中文摘要

本工作提出一个基于2D-U-Net的轻量级框架,用于在大视野3D CT扫描中分割五个腹部器官。该方法结合了粗到细分割、来自多个解剖平面的预测以及额外的模糊3D空间图,这些空间图提供解剖位置线索以提高分割精度。我们结合了由空间出现图增强的多平面2D-U-Net模型。该方法包括两个主要阶段。首先,通过使用2D-U-Net轴向遍历整个扫描并确定5个目标腹部器官的x-y-z最小和最大范围来检测腹部感兴趣区域。其次,我们在前一阶段的边界内使用空间出现图来增强我们的多平面2D-U-Net架构。该方法在来自各种公共来源的80个CT扫描上进行评估。结果显示,与未使用空间出现图训练的相同模型相比,Dice系数最大提升约4%。

英文摘要

This work proposes a lightweight 2D-U-Net-based framework for segmenting five abdominal organs in large field-of-view 3D CT scans. The method combines coarse-to-fine segmentation, predictions from multiple anatomical planes, and additional fuzzy 3D spatial maps that provide anatomical location cues to improve segmentation accuracy. We combine multi-planar 2D-U-Net models augmented by a spatial occurrence map. The approach involves two main stages. First, the abdominal volume of interest region is detected by traversing the whole scan axially with a 2D-U-Net and determining the x-y-z-minimum and -maximum extents of the 5 abdominal organs of interest. Second, we use spatial occurrence maps to enhance our multi-planar 2D-U-net architecture inside the bounds from the former stage. The method is evaluated on 80 CT scans from various public sources. The results show Dice improvements of about 4% at maximum compared to the same model trained without spatial occurrence maps.

2606.08712 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

SNR-ST-Mix: Sample-specific Neighborhood Regression Mixup for Augmented Spatial Transcriptomics Imputation with Deep Neural Network

SNR-ST-Mix: 基于样本特异性邻域回归混合增强的空间转录组学深度神经网络插补

Hongyi Yu, Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee A. Cooper, Bo Zhou

发表机构 * Northwestern University(西北大学) Yale University(耶鲁大学)

AI总结 针对空间转录组数据噪声大、分辨率低的问题,提出SNR-ST-Mix数据增强框架,通过空间邻域约束和表达相似性加权混合生成生物合理的合成样本,提升深度神经网络插补性能。

Comments 19 pages, 4 figures, 3 tables

详情
AI中文摘要

目的:空间转录组学(ST)能够在组织背景下测量基因表达。然而,这些测量通常噪声大、分辨率低且采样稀疏,限制了精细空间结构的恢复。深度神经网络已成为从组织学进行表达插补的强大工具,但其性能仍受限于有限的样本量和缺乏生物学信息的增强。大多数现有的学习增强策略是为分类任务而非回归任务设计的,忽略了空间和转录组关系,导致生物上不合理的插值,阻碍了预测性能。方法:为解决这些限制,我们提出SNR-ST-Mix,一种专门为ST数据设计的几何和表达感知数据增强框架。它将混合限制在点的k个最近空间邻域内,并基于表达相似性自适应加权插值系数,生成保留局部生物结构同时确保空间平滑性的增强样本。这种双重条件化产生合成样本,扩展了有效训练流形,促进了泛化,并在样本特异性训练下增强了预测稳定性。结果:使用各种组织类型的大量实验表明,SNR-ST-Mix在不需要架构更改或额外计算的情况下,始终优于传统增强方法。结论:SNR-ST-Mix为空间转录组学回归任务提供了一种有效且生物学原理的增强策略。通过显式利用空间几何和转录组相似性,它扩展了有效训练流形,并在不增加模型复杂度的情况下提高了预测性能。

英文摘要

Purpose: Spatial transcriptomics (ST) enables gene expression measurements within the tissue context. However, these measurements are often noisy, low-resolution, and sparsely sampled, which limits the recovery of fine spatial structure. Deep neural networks have become powerful tools for expression imputation from histology, but their performance remains constrained by limited sample sizes and a lack of biologically informed augmentation. Most of the existing augmentation strategies for learning are designed for classification tasks rather than regression, which neglect spatial and transcriptomic relationships, leading to biologically implausible interpolations that hinder prediction performance. Approach: To address these limitations, we propose SNR-ST-Mix, a geometry- and expression-aware data augmentation framework designed specifically for ST data. It constrains mixing to a spot's k-nearest spatial neighbors and adaptively weights interpolation coefficients based on expression similarity, generating augmented samples that preserve local biological structure while ensuring spatial smoothness. This dual conditioning yields synthetic examples that expand the effective training manifold, promote generalization, and enhance prediction stability under sample-specific training. Results: Extensive experiments with various tissue types demonstrate that SNR-ST-Mix consistently outperforms conventional augmentation methods without requiring architectural changes or additional computation. Conclusions: SNR-ST-Mix provides an effective and biologically principled augmentation strategy for spatial transcriptomics regression tasks. By explicitly leveraging spatial geometry and transcriptomic similarity, it expands the effective training manifold and improves predictive performance without increasing model complexity.

2505.07573 2026-06-09 cs.CV cs.AI 版本更新

Robust Renal Mass Segmentation on CT: A Validation Study of an AI-Based Framework

基于CT的肾脏肿块鲁棒分割:AI框架的验证研究

Sarah de Boer, Hartmut Häntze, Kiran Vaidhya Venkadesh, Myrthe A. D. Buser, Gabriel E. Humpire Mamani, Lina Xu, Lisa C. Adams, Jawed Nawabi, Keno K. Bressem, Bram van Ginneken, Mathias Prokop, Alessa Hering

发表机构 * Department of Medical Imaging, Radboudumc, Nijmegen, The Netherlands(医学影像部门,Radboudumc,尼姆维根,荷兰) Department of Radiology, Charité - Universitätsmedizin Berlin, Berlin, Germany(放射科,Charité - 大学医学中心柏林,柏林,德国) Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Berlin, Germany(神经放射科,Charité - 大学医学中心柏林,柏林,德国) Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany(诊断和介入放射科,Klinikum rechts der Isar,TUM大学医院,慕尼黑技术大学,慕尼黑,德国) Department of Cardiovascular Radiology and Nuclear Medicine, German Heart Center, TUM University Hospital, Technical University of Munich, Munich, Germany(心血管放射学和核医学部,德国心脏中心,TUM大学医院,慕尼黑技术大学,慕尼黑,德国) Fraunhofer MEVIS, Bremen, Germany(Fraunhofer MEVIS,不莱梅,德国)

AI总结 提出Renal-Net,基于nnU-Net和公开数据训练,在CT图像上实现肾脏肿块分割,验证显示优于现有模型且鲁棒性强。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:012. 23 pages, 12 figures

详情
Journal ref
Machine.Learning.for.Biomedical.Imaging. 2026 (2026)
AI中文摘要

肾脏肿块分割在临床工作流中具有重要潜力,尤其是在需要定量评估的场景中。肾脏体积可作为肾脏疾病的重要生物标志物,其体积变化与肾功能直接相关。目前,临床实践常依赖主观视觉评估来评价肾脏大小和肾脏病变(包括肿瘤和囊肿),这些病变通常根据直径、体积和解剖位置进行分期。为了支持更客观和可重复的方法,本研究旨在开发一个鲁棒且经过充分验证的肾脏肿块分割算法,命名为Renal-Net。我们使用公开可用的训练数据集,并利用最先进的医学图像分割框架nnU-Net。使用专有和公开测试数据集进行验证,分割性能通过Dice系数和95百分位Hausdorff距离量化。此外,我们根据患者性别、年龄、CT对比相和肿瘤组织学亚型分析亚组鲁棒性。我们的结果表明,仅使用公开数据训练的分割算法能有效泛化到外部测试集,并在所有测试数据集上优于现有最先进模型。亚组分析显示一致的高性能,表明强鲁棒性和可靠性。开发的算法和相关代码可在以下网址公开获取:https://this.url。

英文摘要

Renal mass segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments. Kidney volume could serve as an important biomarker for renal diseases, with changes in volume correlating directly with kidney function. Currently, clinical practice often relies on subjective visual assessment for evaluating kidney size and kidney lesions, including tumors and cysts, which are typically staged based on diameter, volume, and anatomical location. To support a more objective and reproducible approach, this research aims to develop a robust, thoroughly validated renal mass segmentation algorithm, named Renal-Net. We employ publicly available training datasets and leverage the state-of-the-art medical image segmentation framework nnU-Net. Validation is conducted using both proprietary and public test datasets, with segmentation performance quantified by Dice coefficient and the 95th percentile Hausdorff distance. Furthermore, we analyze robustness across subgroups based on patient sex, age, CT contrast phases, and tumor histologic subtypes. Our findings demonstrate that our segmentation algorithm, trained exclusively on publicly available data, generalizes effectively to external test sets and outperforms existing state-of-the-art models across all tested datasets. Subgroup analyses reveal consistent high performance, indicating strong robustness and reliability. The developed algorithm and associated code are publicly accessible at https://github.com/DIAGNijmegen/oncology-kidney-abnormality-segmentation.

2508.20734 2026-06-09 cs.CV 版本更新

CardioMorphNet: Cardiac Motion Prediction Using a Shape-Guided Bayesian Recurrent Deep Network

CardioMorphNet: 使用形状引导的贝叶斯循环深度网络进行心脏运动预测

Reza Akbari Movahed, Abuzar Rezaee, Arezoo Zakeri, Colin Berry, Edmond S. L. Ho, Ali Gooya

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出CardioMorphNet,一种基于循环变分自编码器和贝叶斯公式的3D心脏形状引导可变形配准框架,通过递归配准分割图避免强度相似性损失,在心脏运动估计中优于现有方法,并具有更低的不确定性。

Comments Published in Medical Image Analysis. Updated to match the final published version

详情
Journal ref
Medical Image Analysis, vol. 113, p. 104149, 2026
AI中文摘要

从电影心脏磁共振(CMR)图像中准确估计心脏运动对于评估心脏功能和检测其异常至关重要。现有方法通常难以准确捕捉心脏运动,因为它们依赖于基于强度的图像配准相似性损失,可能忽略心脏解剖区域。为了解决这个问题,我们提出了CardioMorphNet,一个用于使用短轴(SAX)CMR图像进行3D心脏形状引导可变形配准的循环贝叶斯深度学习框架。它采用循环变分自编码器来建模心脏周期中的时空依赖性,以及两个用于双心室分割和运动估计的后验模型。从贝叶斯公式导出的损失函数通过递归配准分割图来引导框架关注解剖区域,而不使用基于强度的图像配准相似性损失,同时利用顺序SAX体积和时空特征。贝叶斯建模还使得能够计算估计运动场的不确定性图。通过在UK Biobank和M&M数据集上验证,将扭曲的掩模形状与真实掩模进行比较,CardioMorphNet在心脏运动估计中表现出优越的性能,优于最先进的方法。不确定性评估表明,与其他基于概率的心脏配准方法相比,它在心脏区域估计的运动场上产生更低的不确定性值,表明其预测具有更高的置信度。此外,临床指标提取评估显示,CardioMorphNet比其他方法更准确地估计临床指标。

英文摘要

Accurate cardiac motion estimation from cine cardiac magnetic resonance (CMR) images is vital for assessing cardiac function and detecting its abnormalities. Existing methods often struggle to accurately capture heart motion because they rely on intensity-based image registration similarity losses that may overlook cardiac anatomical regions. To address this, we propose CardioMorphNet, a recurrent Bayesian deep learning framework for 3D cardiac shape-guided deformable registration using short-axis (SAX) CMR images. It employs a recurrent variational autoencoder to model spatio-temporal dependencies across the cardiac cycle, along with two posterior models for bi-ventricular segmentation and motion estimation. The derived loss function from the Bayesian formulation guides the framework to focus on anatomical regions by recursively registering segmentation maps without using intensity-based image registration similarity loss, while leveraging sequential SAX volumes and spatio-temporal features. The Bayesian modelling also enables the computation of uncertainty maps for the estimated motion fields. Validated on the UK Biobank and M&M datasets by comparing warped mask shapes with ground-truth masks, CardioMorphNet demonstrates superior performance in cardiac motion estimation, outperforming state-of-the-art methods. Uncertainty assessment shows that it also yields lower uncertainty values for estimated motion fields in the cardiac region compared with other probabilistic-based cardiac registration methods, indicating higher confidence in its predictions. In addition, the clinical indices extraction assessment shows that CardioMorphNet estimates the clinical indices more accurately than other approaches.

2509.15017 2026-06-09 cs.CV 版本更新

No Modality Left Behind: Adapting to Missing Modalities via Knowledge Distillation for Brain Tumor Segmentation

不遗漏任何模态:通过知识蒸馏适应缺失模态的脑肿瘤分割

Shenghao Zhu, Yifei Chen, Weihong Chen, Shuo Jiang, Guanyu Zhou, Yuanhan Wang, Feiwei Qin, Changmiao Wang, Qiyuan Tian

发表机构 * Medical Image Analysis(医学影像分析)

AI总结 提出AdaMM框架,利用知识蒸馏和三个协同模块处理多模态MRI中模态缺失问题,在多个数据集上显著提升分割精度和鲁棒性。

Comments 51 pages, 11 figures

详情
AI中文摘要

准确的脑肿瘤分割对于术前评估和个性化治疗至关重要。多模态MRI因其能够捕捉不同序列中互补的肿瘤特征而被广泛使用。然而,在临床实践中,模态缺失很常见,限制了依赖完整输入的现有深度学习方法的鲁棒性和泛化能力,尤其是在非主导模态组合下。为了解决这个问题,我们提出了AdaMM,一个针对缺失模态场景定制的多模态脑肿瘤分割框架,以知识蒸馏为核心,由三个协同模块组成。图引导自适应细化模块显式建模通用特征与模态特定特征之间的语义关联,增强对模态缺失的适应性。双瓶颈蒸馏模块通过全局风格匹配和对抗特征对齐,将结构和纹理知识从教师模型转移到学生模型。病变存在引导可靠性模块通过辅助分类任务预测病变类型的先验概率,有效抑制不完整输入下的假阳性。在Pretreat-MetsToBrain-Masks和BraTS 2018、2024数据集上的大量实验表明,AdaMM始终优于现有方法,在单模态和弱模态配置下表现出更优的分割精度和鲁棒性。此外,我们对六类缺失模态策略进行了系统评估,支持知识蒸馏的优越性,并为方法选择和未来研究提供了实用指导。我们的源代码可在以下网址获取:此 https URL。

英文摘要

Accurate brain tumor segmentation is essential for preoperative evaluation and personalized treatment. Multi-modal MRI is widely used due to its ability to capture complementary tumor features across different sequences. However, in clinical practice, missing modalities are common, limiting the robustness and generalizability of existing deep learning methods that rely on complete inputs, especially under non-dominant modality combinations. To address this, we propose AdaMM, a multi-modal brain tumor segmentation framework tailored for missing-modality scenarios, centered on knowledge distillation and composed of three synergistic modules. The Graph-guided Adaptive Refinement Module explicitly models semantic associations between generalizable and modality-specific features, enhancing adaptability to modality absence. The Bi-Bottleneck Distillation Module transfers structural and textural knowledge from teacher to student models via global style matching and adversarial feature alignment. The Lesion-Presence-Guided Reliability Module predicts prior probabilities of lesion types through an auxiliary classification task, effectively suppressing false positives under incomplete inputs. Extensive experiments on the Pretreat-MetsToBrain-Masks and BraTS 2018, 2024 datasets demonstrate that AdaMM consistently outperforms existing methods, exhibiting superior segmentation accuracy and robustness, particularly in single-modality and weak-modality configurations. In addition, we conduct a systematic evaluation of six categories of missing-modality strategies, supporting the superiority of knowledge distillation and offering practical guidance for method selection and future research. Our source code is available at https://github.com/Quanato607/AdaMM.

2511.18454 2026-06-09 cs.CV cs.AI 版本更新

AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading

AttnRegDeepLab: 一种用于可解释胚胎碎片分级的双阶段解耦框架

Ming-Jhe Lee, Chang-Hong Wu, Jung-Hua Wang, Ming-Jer Chen, Yu-Chiao Yi, Tsung-Hsien Lee

发表机构 * Department of Electrical Engineering(电气工程系) AI Research Center(人工智能研究中心) National Taiwan Ocean University(国立台湾海洋大学) Department of Obstetrics, Gynecology(妇产科部) Gynecology, CSMU Hospital, Taichung, Taiwan(台中市立医院妇产科)

AI总结 提出AttnRegDeepLab框架,通过双分支多任务学习、注意力门控、多尺度回归头和两阶段解耦训练,实现胚胎碎片分级的高精度与可解释性。

Comments 6 pages, 5 figures

详情
AI中文摘要

胚胎碎片是评估体外受精(IVF)发育潜力的关键形态学指标。然而,手动分级主观且低效,而现有的深度学习解决方案往往缺乏临床可解释性,或在分割区域估计中遭受累积误差。为了解决这些问题,本研究提出了AttnRegDeepLab(注意力引导回归DeepLab),一种以双分支多任务学习(MTL)为特征的框架。通过将注意力门集成到其跳跃连接中,修改了原始的DeepLabV3+解码器,显式抑制细胞质噪声以保留轮廓细节。此外,引入了一个多尺度回归头,并采用特征注入机制将全局分级先验传播到分割任务中,纠正系统量化误差。提出了一种两阶段解耦训练策略来解决MTL中的梯度冲突。同时,设计了一种基于范围的损失以利用弱标记数据。我们的方法在保持出色分割精度(Dice系数=0.729)的同时实现了稳健的分级精度,这与可能以牺牲轮廓完整性为代价最小化分级误差的端到端方法形成对比。这项工作提供了一种在视觉保真度和量化精度之间取得平衡的临床可解释解决方案。

英文摘要

Assessing embryo fragmentation is crucial for predicting IVF success, yet manual grading is prone to subjectivity, and existing AI models struggle with clinical interpretability and segmentation errors. We propose AttnRegDeepLab, a Multi-Task Learning (MTL) framework designed to solve these challenges. The model enhances a DeepLabV3+ decoder with Attention Gates to filter out cytoplasmic noise and retain sharp contour details. It also introduces a Multi-Scale Regression Head with Feature Injection, guiding the segmentation process with global grading priors to eliminate systematic area estimation errors. Based on a two-stage decoupled training strategy and a range-based loss for weakly labeled data, our method resolves MTL gradient conflicts. AttnRegDeepLab yields high grading precision and excellent segmentation quality (Dice coefficient = 0.729), avoiding the trade-off between contour integrity and grading accuracy seen under standard joint optimization. This provides a reliable, clinically interpretable tool balancing visual and quantitative accuracy.

2511.18676 2026-06-09 cs.CV cs.AI 版本更新

MedVision: Benchmarking Quantitative Medical Image Analysis

MedVision:定量医学图像分析的基准测试

Yongcheng Yao, Yongshuo Zong, Raman Dutt, Yongxin Yang, Sotirios A Tsaftaris, Timothy Hospedales

发表机构 * University of Edinburgh(爱丁堡大学) Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 针对当前医学视觉语言模型缺乏定量推理能力的问题,提出MedVision数据集和基准,涵盖22个公共数据集、3080万图像-标注对,通过监督和强化微调显著提升检测、肿瘤/病变大小估计和角度/距离测量性能。

Comments 22 pages, 13 figures, 14 tables

详情
AI中文摘要

当前医学领域的视觉-语言模型(VLM)主要用于分类问答(如“这是正常还是异常?”)或定性描述任务。然而,临床决策通常依赖于定量评估,例如测量肿瘤大小或关节角度,医生据此得出自己的诊断结论。这种定量推理能力在现有VLM中尚未得到充分探索和支持。在这项工作中,我们引入了MedVision,这是一个专门设计用于评估和改进VLM在定量医学图像分析中的大规模数据集和基准。MedVision涵盖22个公共数据集,涉及多种解剖结构和模态,包含3080万个图像-标注对。我们聚焦于三个代表性的定量任务:(1)解剖结构和异常检测,(2)肿瘤/病变(T/L)大小估计,以及(3)角度/距离(A/D)测量。我们表明,当前现成的VLM在这些任务上表现不佳。然而,在MedVision上进行监督和强化微调显著提升了检测、T/L估计和A/D测量的性能。MedVision为开发具有稳健定量推理能力的医学图像分析VLM奠定了基础。

英文摘要

Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. We show that current off-the-shelf VLMs perform poorly on these tasks. However, supervised and reinforcement fine-tuning on MedVision significantly enhances performance across detection, T/L estimation, and A/D measurement. MedVision provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging.

2601.15408 2026-06-09 cs.CV cs.AI cs.CL cs.LG 版本更新

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

CURE:基于课程引导的多任务训练实现可靠的解剖学接地报告生成

Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem

发表机构 * Pontificia Universidad Católica de Chile(智利天主教大学) CENIA iHEALTH KAUST(科威特皇家科学与技术局)

AI总结 提出CURE框架,通过课程学习动态调整多任务训练,提升医学报告生成的视觉接地准确性和事实一致性,无需额外数据。

Comments 31 pages, 7 figures, accepted to CVPR 2026 (oral)

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 36279-36289
AI中文摘要

医学视觉语言模型可以自动生成放射学报告,但在精确的视觉接地和事实一致性方面存在困难。现有模型常常将文本发现与视觉证据错误对齐,导致不可靠或弱接地的预测。我们提出CURE,一个错误感知的课程学习框架,无需任何额外数据即可改善接地和报告质量。CURE在短语接地、接地报告生成和解剖学接地报告生成上,使用公共数据集微调多模态指令模型。该方法基于模型性能动态调整采样,强调困难样本以改善空间和文本对齐。CURE将接地准确率提高了+0.35 IoU,报告质量提高了+0.192 CXRFEScore,并将幻觉减少了18.6%。CURE是一个数据高效的框架,增强了接地准确性和报告可靠性。代码可从此https URL获取,模型权重可从此https URL获取。

英文摘要

Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.35 IoU, boosts report quality by +0.192 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

2601.20503 2026-06-09 cs.CV cs.AI 版本更新

Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

使用部分标注数据集训练策略的比较评估:FLAIR MRI中白质高信号和卒中病变分割

Jesse Phitidis, Alison Q. Smithard, William N. Whiteley, Joanna M. Wardlaw, Miguel O. Bernabeu, Maria Valdés Hernández

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 本研究系统评估了六种利用部分标注数据训练联合分割白质高信号和缺血性卒中病变模型的策略,发现伪标签法最有效,可提升模型性能并支持大规模临床研究。

详情
AI中文摘要

白质高信号(WMH)和缺血性卒中病变(ISL)是脑小血管疾病(SVD)的关键影像生物标志物,可在磁共振成像(MRI)上检测到。开发稳健的深度学习模型来自动分割和区分这些病理仍然具有挑战性。具体而言,WMH和ISL常在同一受试者中共存,并在液体衰减反转恢复(FLAIR)序列上表现为视觉上混淆的高信号,使其精确勾画复杂化。为了解决完全标注队列稀缺的问题,我们系统评估了六种使用部分标注数据训练联合WMH和ISL分割模型的可行策略。我们汇集了私有和公开数据集,构建了一个包含2,052个MRI体积的大规模队列,其中分别有1,341和1,152个体积包含WMH和ISL的真实标注。我们的分析表明,多种策略有效利用部分标注数据提升整体模型性能,其中伪标签法是最有效的方法。该模型表现出一致的WMH分割策略,并成功检测到大多数FLAIR阳性的ISL。这些发现证明了使用部分标注数据开发可靠自动分割工具的可行性,可支持持续的SVD监测和大规模临床研究中的高通量生物标志物提取。

英文摘要

White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are key imaging biomarkers of cerebral small vessel disease (SVD) detectable on magnetic resonance imaging (MRI). The development of robust deep learning models to automatically segment and differentiate these pathologies remains challenging. Specifically, WMH and ISL frequently co-occur within the same subject and present as visually confounding hyperintensities on fluid-attenuated inversion recovery (FLAIR) sequences, complicating their accurate delineation. To address the scarcity of fully annotated cohorts, we systematically evaluated six accessible strategies for training a joint WMH and ISL segmentation model using partially labelled data. We aggregated privately held and publicly available datasets to curate a large-scale cohort of 2,052 MRI volumes, of which 1341 and 1152 volumes contained ground truth annotations for WMH and ISL, respectively. Our analysis indicates that multiple strategies effectively leverage partially labelled data to enhance overall model performance, with pseudolabelling emerging as the most effective approach. This model exhibited a consistent WMH segmentation policy and successfully detected the majority of FLAIR-positive ISL. These findings demonstrate the viability of using partially labelled data to develop reliable automated segmentation tools, which can support ongoing SVD monitoring and high-throughput biomarker extraction for large-scale clinical research.

2602.17337 2026-06-09 cs.CV 版本更新

Polaffini: A feature-based approach for robust affine and polyaffine image registration

Polaffini: 一种基于特征的鲁棒仿射和多项式仿射图像配准方法

Antoine Legouhy, Cosimo Campo, Ross Callaghan, Hojjat Azadbakht, Hui Zhang

发表机构 * Hawkes Institute & Department of Computer Science, University College London, London, UK(霍克斯研究所及大学学院伦敦计算机科学系,伦敦,英国) Institut Pasteur, Université Paris Cité, Unité de Neuroanatomie Appliquée et Théorique(巴斯德研究所,巴黎城市大学,应用与理论神经解剖学单元) AINOSTICS ltd., Manchester, UK(AINOSTICS有限公司,曼彻斯特,英国)

AI总结 提出Polaffini框架,利用深度学习分割模型提取解剖对应点,通过闭式解实现全局和局部仿射匹配,生成从仿射到多项式仿射的可调平滑变换,在结构对齐和下游非线性配准初始化上优于传统方法。

Comments associated github repo: https://github.com/CIG-UCL/polaffini

详情
AI中文摘要

在这项工作中,我们提出了Polaffini,一个稳健且通用的解剖学基础配准框架。医学图像配准主要由基于强度的配准方法主导,这些方法依赖于对齐质量的替代度量。相比之下,基于特征的方法通过识别明确的解剖对应点进行操作,理论上更理想,但由于可靠提取特征的挑战而 largely 失宠。然而,得益于深度学习的近期进展,这些挑战现已显著克服,预训练的分割模型能够即时提供可靠、精细的解剖描绘。我们旨在证明这些进展可用于创建新的解剖学基础图像配准算法。为此,我们提出Polaffini,它从这些分割区域中以特别简单的方式获得具有一一对应关系的解剖学基础特征点:提取它们的质心。这些特征点通过闭式解实现高效的全局和局部仿射匹配。这些匹配用于生成从仿射到多项式仿射的整体变换,并具有可调平滑度。多项式仿射变换比仿射变换具有更多的自由度,允许更精细的对齐,并且它们在对数-欧几里得框架中的嵌入确保了微分同胚性质。Polaffini既可用于独立配准,也可作为后续非线性配准的预对齐,我们将其与流行的基于强度的配准技术进行了评估。结果表明,Polaffini在结构对齐方面优于竞争方法,并为下游非线性配准提供了改进的初始化。Polaffini快速、稳健且准确,使其特别适合集成到医学图像处理流程中。

英文摘要

In this work we present Polaffini, a robust and versatile framework for anatomically grounded registration. Medical image registration is dominated by intensity-based registration methods that rely on surrogate measures of alignment quality. In contrast, feature-based approaches that operate by identifying explicit anatomical correspondences, while more desirable in theory, have largely fallen out of favor due to the challenges of reliably extracting features. However, such challenges are now significantly overcome thanks to recent advances in deep learning, which provide pre-trained segmentation models capable of instantly delivering reliable, fine-grained anatomical delineations. We aim to demonstrate that these advances can be leveraged to create new anatomically-grounded image registration algorithms. To this end, we propose Polaffini, which obtains, from these segmented regions, anatomically grounded feature points with 1-to-1 correspondence in a particularly simple way: extracting their centroids. These enable efficient global and local affine matching via closed-form solutions. Those are used to produce an overall transformation ranging from affine to polyaffine with tunable smoothness. Polyaffine transformations can have many more degrees of freedom than affine ones allowing for finer alignment, and their embedding in the log-Euclidean framework ensures diffeomorphic properties. Polaffini has applications both for standalone registration and as pre-alignment for subsequent non-linear registration, and we evaluate it against popular intensity-based registration techniques. Results demonstrate that Polaffini outperforms competing methods in terms of structural alignment and provides improved initialisation for downstream non-linear registration. Polaffini is fast, robust, and accurate, making it particularly well-suited for integration into medical image processing pipelines.

2602.22919 2026-06-09 cs.CV 版本更新

Chain of Flow: ECG-Conditioned 4D Cardiac Cine Generation from Patient-Specific Anatomical Anchor

流动链:基于患者特定解剖锚点的ECG条件4D心脏电影生成

Haofan Wu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen, Le Zhang

发表机构 * School of Engineering, College of Engineering and Physical Sciences, University of Birmingham(英国伯明翰大学工程学院) William Harvey Research Institute, NIHR Barts Biomedical Research Centre, Queen Mary University London(伦敦Queen Mary大学威廉·哈里维研究所) Barts Heart Centre, St Bartholomew’s Hospital, Barts Health NHS Trust(巴特勒医院心脏中心,圣巴塞洛缪医院,巴特勒健康 NHS信托) Division of Cardiology, Johns Hopkins University School of Medicine(约翰霍普金斯大学医学院心脏病科)

AI总结 提出Chain of Flow (COF)框架,利用患者特定MRI和当前ECG生成4D心脏电影,在UK Biobank上实现高图像保真度和下游功能性能。

详情
AI中文摘要

心脏电影磁共振成像(MRI)是功能性心脏评估的核心,然而在分析时可能无法直接获得完整的当前电影序列。我们引入了流动链(COF),这是一个心电图(ECG)条件框架,结合患者特定MRI和当前ECG,用于生成特定于受试者的4D心脏电影。在UK Biobank数据集上,COF在共享同次就诊可评估基准上实现了强图像级保真度和下游功能导向性能。多切片和多分辨率分析表明,在短轴堆叠和异质采集分辨率上,结构生成质量稳定。跨重采样输入MRI相位的受控相位鲁棒性分析进一步提供了同次就诊代理支持,当目标MRI相位未直接观察到时,使用患者特定MRI加当前ECG。跨次就诊路线提供了探索性序列证据,在当前面向感兴趣区域读出中增益最明显。疾病类别功能审计、病例级容积轨迹证据审查进一步描绘了当前患者特定MRI加ECG公式在解剖感知下游心脏分析中保持稳定的情况。代码可在https://this URL获取。

英文摘要

Cardiac cine magnetic resonance imaging (MRI) is central to functional cardiac assessment, yet a full current cine sequence may not always be directly available at the point of analysis. We introduce Chain of Flow (COF), an electrocardiography (ECG)-conditioned framework that combines patient-specific MRI and current ECG for subject-specific 4D cardiac cine generation. On the UK Biobank dataset, COF achieves strong image-level fidelity and downstream function-oriented performance on a shared same-visit evaluable benchmark. Multi-slice and multi-resolution analyses indicate stable structural generation quality across the short-axis stack and heterogeneous acquisition resolutions. Controlled phase-robustness analyses across resampled input MRI phases further provide same-visit proxy support for patient-specific MRI plus current ECG when a target MRI phase is not directly observed. A cross-visit route provides exploratory serial evidence, with the clearest gains in current-facing region-of-interest readout. Disease-category functional audits, case-level volume-trajectory evidence review further delineate where the current patient-specific MRI plus ECG formulation remains stable for anatomy-aware downstream cardiac analysis. Code is available at https://anonymous.4open.science/r/COF-paper-release-C88B.

2603.24388 2026-06-09 cs.CV 版本更新

Causal Transfer in Medical Image Analysis

医学图像分析中的因果迁移

Mohammed M. Abdelsamea, Daniel Tweneboah Anyimadu, Tasneem Selim, Saif Alzubi, Lei Zhang, Ahmed Karam Eldaly, Xujiong Ye

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文探讨了医学图像分析中因果迁移方法,整合因果推理与跨域表示学习,以解决领域偏移问题,提升临床AI的鲁棒性和泛化能力。

详情
AI中文摘要

医学成像模型在跨医院、扫描仪、人群或成像协议部署时经常失效,这是由于领域偏移,限制了其临床可靠性。虽然迁移学习和领域适应通过统计方法解决此类偏移,但它们通常依赖于破坏性变化条件下的虚假相关性。另一方面,因果推断提供了一种原则性方法,以识别在不同环境中保持稳定的不变机制。本文介绍了并系统化了医学图像分析中的因果迁移学习(CTL)。该范式将因果推理与跨域表示学习结合,以实现稳健且可推广的临床AI。我们把领域偏移视为因果问题,并分析结构性因果模型、不变风险最小化和反事实推理如何嵌入迁移学习流程中。我们研究了涵盖分类、分割、重建、异常检测和多模态成像的任务,并按任务、偏移类型和因果假设进行组织。提出了一种统一的分类法,将因果框架与迁移机制联系起来。我们进一步总结了数据集、基准测试和经验收益,突出因果迁移在何时以及为何优于基于相关性的领域适应。最后,我们讨论了CTL如何支持多机构和联邦设置中的公平性、鲁棒性和可信部署,并概述了临床可靠医学成像AI的开放挑战和研究方向。

英文摘要

Medical imaging models frequently fail when deployed across hospitals, scanners, populations, or imaging protocols due to domain shift, limiting their clinical reliability. While transfer learning and domain adaptation address such shifts statistically, they often rely on spurious correlations that break under changing conditions. On the other hand, causal inference provides a principled way to identify invariant mechanisms that remain stable across environments. This survey introduces and systematises Causal Transfer Learning (CTL) for medical image analysis. This paradigm integrates causal reasoning with cross-domain representation learning to enable robust and generalisable clinical AI. We frame domain shift as a causal problem and analyse how structural causal models, invariant risk minimisation, and counterfactual reasoning can be embedded within transfer learning pipelines. We studied spanning classification, segmentation, reconstruction, anomaly detection, and multimodal imaging, and organised them by task, shift type, and causal assumption. A unified taxonomy is proposed that connects causal frameworks and transfer mechanisms. We further summarise datasets, benchmarks, and empirical gains, highlighting when and why causal transfer outperforms correlation-based domain adaptation. Finally, we discuss how CTL supports fairness, robustness, and trustworthy deployment in multi-institutional and federated settings, and outline open challenges and research directions for clinically reliable medical imaging AI.

2603.29495 2026-06-09 cs.CV cs.ET cs.HC 版本更新

All-in-One Augmented Reality Guided Head and Neck Tumor Resection

一体化增强现实引导的头颈部肿瘤切除

Yue Yang, Matthieu Chabanas, Carrie Reale, Annie Benson, Jason Slagle, Matthew Weinger, Michael Topf, Jie Ying Wu

发表机构 * Vanderbilt Institute for Surgery and Engineering(范德比尔特手术与工程研究院) Vanderbilt University Medical Center(范德比尔特大学医学院)

AI总结 本文提出了一种整合增强现实技术的系统,用于在手术中精确定位肿瘤边缘,通过HoloLens 2的深度感应和自动标记less表面配准,显著提高了手术精度。

详情
AI中文摘要

头颈部鳞状细胞癌常存在阳性边缘,但术中重新切除常因边缘位置依赖口头沟通而不够精确。本文提出了一种一体化增强现实系统,通过HoloLens 2的深度感应和全自动标记less表面配准,将已切除标本的阳性边缘重新定位到切除床并进行原位可视化。在硅基仿生研究中,标记less配准的定位误差与标记基线相当(中位数1.8 mm vs. 1.7 mm;最大值<4 mm)。在边缘重新定位任务中,增强现实指导将误差从口头指导(中位数14.2 mm)降低到几毫米(中位数3.2 mm),所有AR定位误差均在5 mm以内。这些结果支持了无标记增强现实边缘指导在提高术中重新切除精度方面的可行性。

英文摘要

Positive margins are common in head and neck squamous cell carcinoma, yet intraoperative re-resection is often imprecise because margin locations are typically communicated verbally from pathology. We present an all-in-one augmented reality (AR) system that relocalizes positive margins from a resected specimen to the resection bed and visualizes them in situ using HoloLens 2 depth sensing and fully automated markerless surface registration. In a silicone phantom study with six medical trainees, markerless registration achieved target registration errors comparable to a marker-based baseline (median 1.8 mm vs. 1.7 mm; maximum < 4 mm). In a margin relocalization task, AR guidance reduced error from verbal guidance (median 14.2 mm) to a few millimeters (median 3.2 mm), with all AR localizations within 5 mm error. These results support the feasibility of markerless AR margin guidance for more precise intraoperative re-excision.

2604.23435 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Knee-xRAI: An Explainable AI Framework for Automatic Kellgren-Lawrence Grading of Knee Osteoarthritis

膝-xRAI:一种用于自动膝骨关节炎Kellgren-Lawrence分级的可解释AI框架

Azmul A. Irfan, Nur Ahmad Khatim, Alfan Alfian Irfan, Achmad Zaki, Erike A. Suwarsono, Mansur M. Arief

发表机构 * Orthopaedic Department, Faculty of Medicine UIN Syarif Hidayatullah Jakarta(乌姆尼大学医学学院骨科部) Informatics Engineering, Institut Teknologi Sepuluh Nopember(十月份技术研究所信息工程系) Information Technology, Universitas Muhammadiyah Yogyakarta(尤科阿卡塔大学信息技术系) Industrial and Systems Engineering, King Fahd University of Petroleum and Minerals(国王法赫德石油与矿物大学工业与系统工程系)

AI总结 本文提出Knee-xRAI框架,通过模拟临床放射流程,结合JSN、骨刺和下骨质硬化等特征,利用XGBoost-SHAP和ConvNeXt模型实现可解释的KL分级,验证了其在膝骨关节炎诊断中的有效性。

Comments 8 pages, 5 figures

详情
AI中文摘要

对平片进行膝骨关节炎(KOA)分级的可重复性差。KL评分单级分歧可能改变手术管理或将患者从保守治疗转为关节内注射。同时,超越人类读者的深度学习模型通常缺乏决策解释。我们提出了Knee-xRAI,一个分解分级过程的流程,通过模仿临床放射流程独立测量关节间隙狭窄(JSN)、骨刺和下骨质硬化,然后将这些发现组合成可解释的KL评分。具体而言,U-Net++架构通过轮廓分割量化JSN,SE-ResNet-50多任务网络在OARSI尺度上对骨刺进行解剖部位评分,混合纹理-CNN检测二进制硬化。该流程产生一个50维特征向量,通过XGBoost-SHAP分类器(路径A,审计)和ConvNeXt混合预测器(路径B,部署)进行评估。在8,260个OAI衍生的放射图像上,JSN模块的Dice得分为0.8909,mJSW ICC为0.8674。路径A达到QWK为0.6294和AUC为0.8046,证实了结构化特征向量具有显著的诊断信号。路径B达到QWK为0.8436和AUC为0.9017。SHAP分析显示JSN是主导特征,骨刺增加了一致的增量,硬化贡献微小。移除JSN证据会降低KL3-KL4召回率,而早期等级保持不变,与KL诊断标准一致。Knee-xRAI将每个预测都基于可审计的放射学发现链,提供临床透明度。

英文摘要

Grading knee osteoarthritis (KOA) on plain radiographs is poorly reproducible across readers. A single-grade disagreement on the Kellgren-Lawrence (KL) scale can alter surgical management or redirect a patient from conservative therapy to intra-articular injection. Meanwhile, deep learning models that outperform human readers often offer no explanation for their decisions. We present Knee-xRAI, a pipeline that decomposes the grading process by mimicking clinical radiological workflows. It independently measures joint space narrowing (JSN), osteophytes, and subchondral sclerosis, then combines these findings into an explainable KL grade. Specifically, a U-Net++ architecture quantifies JSN via contour segmentation, an SE-ResNet-50 multi-task network grades osteophytes per anatomical site on the OARSI scale, and a hybrid texture-CNN detects binary sclerosis. This pipeline yields a 50-dimensional feature vector evaluated via an XGBoost-SHAP classifier (Path A, audit) and a ConvNeXt hybrid predictor (Path B, deployed). On 8,260 OAI-derived radiographs, the JSN module achieved a Dice score of 0.8909 and an mJSW ICC of 0.8674. Path A reached a QWK of 0.6294 and an AUC of 0.8046, confirming the structured feature vector carries substantial diagnostic signal. Path B achieved a QWK of 0.8436 and an AUC of 0.9017. SHAP analysis identifies JSN as the dominant feature, with osteophytes adding a consistent increment and sclerosis contributing marginally. Removing JSN evidence collapses KL3-KL4 recall while early grades remain intact, aligning with the KL diagnostic criteria. Knee-xRAI grounds every prediction in an auditable chain of measured radiographic findings, providing clinical transparency at the point of care.

2605.11314 2026-06-09 cs.CV cs.AI 版本更新

Quantifying Rodda and Graham Gait Classification from 3D Markerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort

从单视角视频中基于3D无标记运动学的罗达和格雷厄姆步态分类量化

Lauhitya Reddy, Seth Donahue, Jeremy Bauer, Susan Sienko, Anita Bagley, Joseph Krzak, Maura Eveld, Karen Kruger, Ross Chafetz, Vedant Kulkarni, Hyeokhyen Kwon

发表机构 * Department of Biomedical Informatics, Emory University(埃默里大学生物医学信息学系) Shriners Children’s(夏皮罗儿童医院) The Wallace H. Coulter Department of Biomedical Engineering, Emory University and Georgia Institute of Technology(埃默里大学和佐治亚理工学院的沃克·H·库勒生物医学工程系)

AI总结 本文提出了一种基于单视角视频的无标记步态分析方法,用于量化罗达和格雷厄姆步态分类中的膝踝z分数,从而在资源有限的临床环境中实现可扩展的客观步态评估。

Comments 29 pages, 8 figures, 9 tables (including 1 supplementary table); manuscript prepared in PLOS ONE format

详情
AI中文摘要

脑瘫(CP)是一种运动神经障碍,是儿童中最常见的终身身体残疾原因。大约75%的脑瘫儿童能够行走,准确的步态评估对于保持行走功能至关重要,这种功能在四分之一到一半的脑瘫成人中在中年时会恶化。罗达和格雷厄姆分类系统利用来自3D仪器化步态分析(3D-IGA)的踝关节和膝关节z分数来量化矢状面步态偏差,但3D-IGA成本高且仅限于专业中心,而观察性评估仅显示中等的评分者间一致性。我们开发了一种无标记步态分析流程,可以直接从单视角临床步态视频中量化罗达和格雷厄姆膝踝z分数。在1,058个双侧肢体样本(来自152名儿童的529次试验,其中88名男性,63名女性,年龄12.1±4.0岁,60种不同的主要诊断,脑瘫最为常见,n=54)中,矢状面模型在膝关节z分数上达到R²=0.80±0.02和CCC=0.89±0.02,踝关节z分数上达到R²=0.57±0.02和CCC=0.72±0.02,与3D-IGA相比。二元筛查用于过量膝关节屈曲的AUROC=0.88,正确识别了83%的受影响儿童,应用罗达和格雷厄姆规则得到7类准确率为43±1%,宏AUROC=0.78±0.01,踝关节预测误差仍然是主要瓶颈。除了横断面筛查外,连续z分数支持跨访问的纵向轨迹跟踪,为监测疾病进展和治疗反应提供定量基础,这在观察性量表中是无法实现的。这些结果证明了基于视频的z分数估计、过量屈曲筛查和纵向轨迹跟踪在资源有限的临床环境中实现可扩展、客观步态评估的可行性。

英文摘要

Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approximately 75% of children with CP are ambulatory, and accurate gait assessment is central to preserving walking function, which deteriorates by mid-adulthood in a quarter to half of adults with CP. The Rodda and Graham classification system quantifies sagittal-plane gait deviations using ankle and knee z-scores derived from 3D Instrumented Gait Analysis (3D-IGA), but 3D-IGA is expensive and limited to specialized centers, while observational assessment shows only moderate inter-rater agreement. We developed a markerless gait analysis pipeline that quantifies Rodda and Graham knee and ankle z-scores directly from single-view clinical gait videos. Across 1,058 bilateral limb samples from 529 trials of 152 children (88 male, 63 female; age 12.1 $\pm$ 4.0 years; 60 distinct primary diagnoses, cerebral palsy the most common at $n=54$), the sagittal-view model achieved $R^2 = 0.80 \pm 0.02$ and CCC $= 0.89 \pm 0.02$ for knee z-scores and $R^2 = 0.57 \pm 0.02$ and CCC $= 0.72 \pm 0.02$ for ankle z-scores against 3D-IGA. Binary screening for excess knee flexion achieves AUROC $= 0.88$, correctly identifying 83% of affected children, and applying Rodda and Graham rules yields $43 \pm 1$% 7-class accuracy with macro-AUROC $= 0.78 \pm 0.01$, ankle prediction error remaining the primary bottleneck. Beyond cross-sectional screening, continuous z-scores support longitudinal trajectory tracking across visits, providing a quantitative substrate for monitoring disease progression and treatment response unavailable from observational scales. These results demonstrate the feasibility of video-based z-score estimation, excess-flexion screening, and longitudinal trajectory tracking as a path toward scalable, objective gait assessment in low-resource clinical settings.

2606.00967 2026-06-09 cs.CV 版本更新

MedSyn2: Flexible Control of 3D CT Generation via Text and Semantically-Defined Segmentation Prompts

通过文本和语义定义的分割提示灵活控制3D CT生成

Weicheng Dai, Chenyu Wang, Binxu Li, Shantanu Ghosh, Afrooz Zandifar, Christina LeBedis, Kayhan Batmanghelich

发表机构 * Boston University School of Engineering(波士顿大学工程学院) Stanford University(斯坦福大学) University of Pittsburgh Medical Center(匹兹堡大学医学中心) Boston University School of Medicine(波士顿大学医学院)

AI总结 提出一种灵活的多模态框架,通过文本和可选分割提示控制3D CT生成,实现高分辨率、解剖一致且可控的体数据生成。

详情
AI中文摘要

体积医学图像的生成模型在医学成像中有许多应用,从数据增强到作为逆问题的先验。对于这些应用,生成具有强可控性的高分辨率3D图像至关重要,但仍极具挑战性。现有方法通常通过放射学报告作为文本提示或通过完整图像分割来控制生成。基于文本的提示虽然灵活,但对异常的位置、形状和边界的空间控制有限。相比之下,基于分割的方法接收精确的空间指导,但需要全器官标注,具有限制性。在这项工作中,我们提出了一种灵活的多模态框架,用于可控体积图像生成,支持来自放射学报告和分割提示(两者均为可选)的输入。我们的方法允许用户提供特定解剖结构或异常的分割,而无需全器官标注。分割掩膜的语义含义通过附带的文本描述指定,从而形成高度灵活且可扩展的条件机制。我们开发了一种基于改进扩散变换器的内存高效架构,该架构联合处理图像和分割标记。该模型进一步结合了门控注意力,以有效关注长放射学报告。实验表明,我们的方法实现了最先进的感知和语义分数(例如,平均FID相对改进24%),生成高分辨率解剖一致的CT体积,并在用于数据增强时提高了数据效率。放射科医生的评估进一步证实了生成图像与真实医学图像之间的强一致性。

英文摘要

Generative models for volumetric medical images have found many applications in medical imaging, ranging from data augmentation to serving as priors for inverse problems. For these applications, generating high-resolution 3D images with strong controllability is essential but remains highly challenging. Existing approaches typically control generation either through radiology reports used as text prompts or through full image segmentation. While text-based prompting is flexible, it provides limited spatial control over the location, shape, and boundary of abnormalities. In contrast, segmentation-based methods receive precise spatial guidance but are restrictive in requiring full-organ annotations. In this work, we propose a flexible multimodal framework for controllable volumetric image generation that supports input from radiology reports and segmentation prompts (both optional). Our approach allows users to provide segmentation of a specific anatomy or abnormality without requiring full-organ annotations. The semantic meaning of the segmentation mask is specified through an accompanying text description, resulting in a highly flexible and scalable conditioning mechanism. We develop a memory-efficient architecture based on a modified diffusion transformer that jointly processes image and segmentation tokens. The model further incorporates gated attention to effectively attend to long radiology reports. Experiments demonstrate that our method achieves state-of-the-art perceptual and semantic scores (e.g., 24% relative improvement in mean FID), generates high-resolution anatomically consistent CT volumes, and improves data efficiency when used for data augmentation. Radiologists' evaluation further confirms strong alignment between generated and real medical images.

2606.06407 2026-06-09 cs.CV cs.IR cs.LG eess.IV 版本更新

A Vision-language Framework for Comparative Reasoning in Radiology

放射学中比较推理的视觉语言框架

Tengfei Zhang, Ziheng Zhao, Xiaoman Zhang, Lisong Dai, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Weidi Xie

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系) Department of Radiology, Renmin Hospital of Wuhan University(武汉大学仁民医院放射科) Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University(上海交通大学附属第六人民医院)

AI总结 提出一个实体感知的跨图像推理框架,通过构建大规模比较影像数据集MedReCo-DB和开发MedReCo及MedReCo-VLM模型,实现了参考病例检索和时间比较解读,显著提升了放射学比较推理性能。

详情
AI中文摘要

医学影像人工智能在孤立图像解读方面取得了强劲性能,但仍与放射学实践存在较大差距,因为诊断和随访依赖于对先前研究和类似参考病例的比较。本文我们将放射学比较形式化为一个实体感知的跨图像推理问题,并引入一个支持参考病例检索和时间比较解读的框架。我们构建了MedReCo-DB,这是一个从常规图像-报告对中派生的大规模比较影像资源,包含来自八个机构、四个国家、七种成像模态的超过16万名患者的69万余张图像。报告被分解为解剖结构、异常发现和病理状况,为实体条件检索和比较视觉问答提供监督。利用该资源,我们开发了MedReCo,一个用于可控检索临床类似病例的实体感知视觉编码器,以及MedReCo-VLM,一个用于生成性解读间隔变化的视觉语言扩展。在内部、外部和跨中心评估中,MedReCo在所有12个内部检索设置中实现了最高的Recall@1,并将外部检索平均提高了6.0个百分点。在临床易混淆的鉴别组中,它始终优于最强的基线。MedReCo-VLM在所有比较生成评估中取得了最佳性能,并在胸部X光片上将纵向随访准确性提高了14.5-46.5个百分点,在CT上提高了13.0-27.9个百分点。这些发现表明,实体感知的比较推理可以从常规临床数据中大规模学习,并可能为医学影像AI提供更符合临床的基础。

英文摘要

Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

2406.19749 2026-06-09 eess.IV cs.CV 版本更新

SPIRONet: Spatial-Frequency Learning and Graph-based Channel Interaction Network for Vessel Segmentation

SPIRONet:用于血管分割的空间-频率学习与基于图的通道交互网络

De-Xing Huang, Xiao-Hu Zhou, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Mei-Jiang Gui, Hao Li, Tian-Yu Xiang, Bo-Xian Yao, Zeng-Guang Hou

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出SPIRONet,通过双空间-频率编码器、交叉注意力融合和基于图的通道交互模块,解决低信噪比、细小血管和强干扰下的血管分割难题,在五个数据集上取得最优性能。

Comments Accepted by Biomedical Signal Processing and Control. 15 Pages, 9 Figures, 13 Tables

详情
AI中文摘要

自动血管分割在下一代手术机器人介入导航系统的发展中起着关键作用。然而,当前方法在低信噪比、细小或纤细血管以及强干扰等具有挑战性的术中条件下,分割性能仍不理想。本研究提出了一种新颖的空间-频率学习与基于图的通道交互网络(SPIRONet)来解决上述问题。针对低信噪比血管外观和细小或纤细分支,采用了双空间-频率编码器,其中频率编码器捕获受局部噪声波动影响较小的全局血管连续性,而空间编码器保留精细的血管细节。进一步引入了交叉注意力融合模块,以自适应地整合这种互补的空间和频率信息。此外,为了抑制非目标血管和类血管结构的干扰,设计了基于图的通道交互模块来建模通道间的相关性,增强一致的血管相关响应,同时抑制任务无关的激活。在五个具有挑战性的数据集上的大量实验结果表明,与现有方法相比,所提方法取得了有竞争力且持续强劲的性能。例如,在CADSA、CAXF、DCA1、XCAD和ARCADE上,SPIRONet分别比最强竞争方法实现了+0.87%、+0.52%、+0.23%、+1.39%和+2.22%的IoU提升。此外,SPIRONet在512x512输入尺寸下实现了21 FPS的推理速度,满足介入场景(6-12 FPS)的实时要求。这些有希望的结果表明SPIRONet在介入导航系统中集成的潜力。代码可在该https URL获取。

英文摘要

Automatic vessel segmentation plays a pivotal role in the development of next-generation interventional navigation systems for surgical robotics. However, current approaches still suffer from suboptimal segmentation performance under challenging intraoperative conditions, such as low-signal-to-noise ratio (SNR), small or slender vessels, and strong interference. In this study, a novel spatial-frequency learning and graph-based channel interaction network (SPIRONet) is proposed to address the above issues. To address low-SNR vessel appearance and small or slender branches, dual spatial-frequency encoders are utilized, where the frequency encoder captures global vessel continuity that is less affected by local noise fluctuations, while the spatial encoder preserves fine vessel details. A cross-attention fusion module is further introduced to adaptively integrate this complementary spatial and frequency information. Moreover, to suppress interference from non-target vessels and vessel-like structures, a graph-based channel interaction module is designed to model channel-wise correlations, enhancing consistent vessel-related responses while suppressing task-irrelevant activations. Extensive experimental results on five challenging datasets demonstrate that the proposed method achieves competitive and consistently strong performance compared with existing methods. For example, SPIRONet achieves IoU improvements of +0.87%, +0.52%, +0.23%, +1.39%, and +2.22% over the strongest competing methods on CADSA, CAXF, DCA1, XCAD, and ARCADE, respectively. Moreover, SPIRONet achieves an inference speed of 21 FPS with a 512x512 input size, meeting the real-time requirements of interventional scenarios (6-12 FPS). These promising results indicate SPIRONet's potential for integration into interventional navigation systems. Code is available at https://github.com/Dxhuang-CASIA/SPIRONet.

2501.11755 2026-06-09 eess.IV cs.CV 版本更新

A generalizable 3D framework and model for self-supervised learning in medical imaging

一种通用的3D框架和模型用于医学影像中的自监督学习

Tony Xu, Sepehr Hosseini, Chris Anderson, Anthony Rinaldi, Rahul G. Krishnan, Anne L. Martel, Maged Goubran

发表机构 * Department of Medical Biophysics, University of Toronto(多伦多大学医学生物物理学系) Department of Computer Science, University of Toronto(多伦多大学计算机科学系) Institute for Aerospace Studies, University of Toronto(多伦多大学航空航天研究所) Physical Sciences Platform, Sunnybrook Research Institute(圣母医院研究学院物理科学平台) Vector Institute, Toronto(多伦多向量研究所) Department of Laboratory Medicine and Pathobiology, University of Toronto(多伦多大学实验室医学与病理学系) Hurvitz Brain Sciences, Sunnybrook Health Sciences Centre(圣母医院健康科学中心Hurvitz脑科学) Harquail Centre for Neuromodulation, Sunnybrook Health Sciences Centre(圣母医院健康科学中心Harquail神经调制中心)

AI总结 本文提出3DINO方法,基于大规模多模态数据集预训练出通用医学影像模型3DINO-ViT,验证其在多种医学影像分割和分类任务中的泛化能力,优于现有方法。

Comments Published in npj Digital Medicine

详情
AI中文摘要

当前3D医学影像自监督学习方法依赖简单的预设任务和特定器官或模态的数据集,限制了其通用性和扩展性。我们提出了3DINO,一种针对3D数据集的先进自监督学习方法,并在包含超过10个器官的10万例3D医学影像扫描的多模态数据集上预训练了3DINO-ViT。我们通过广泛的实验验证了3DINO-ViT在多种医学影像分割和分类任务中的性能。结果表明,3DINO-ViT能够跨模态和器官泛化,包括在分布外任务和数据集上表现优异,在大多数评估指标和标注数据集大小上均优于现有方法。我们的3DINO框架和3DINO-ViT将被公开,以促进3D基础模型的研究或进一步微调用于广泛医学影像应用。

英文摘要

Current self-supervised learning methods for 3D medical imaging rely on simple pretext formulations and organ- or modality-specific datasets, limiting their generalizability and scalability. We present 3DINO, a cutting-edge SSL method adapted to 3D datasets, and use it to pretrain 3DINO-ViT: a general-purpose medical imaging model, on an exceptionally large, multimodal, and multi-organ dataset of ~100,000 3D medical imaging scans from over 10 organs. We validate 3DINO-ViT using extensive experiments on numerous medical imaging segmentation and classification tasks. Our results demonstrate that 3DINO-ViT generalizes across modalities and organs, including out-of-distribution tasks and datasets, outperforming state-of-the-art methods on the majority of evaluation metrics and labeled dataset sizes. Our 3DINO framework and 3DINO-ViT will be made available to enable research on 3D foundation models or further finetuning for a wide range of medical imaging applications.

2511.18493 2026-06-09 eess.IV cs.AI cs.CV 版本更新

SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation

SAGE:适应性组织病理图像分割的形状自适应门控专家

Gia Huy Thai, Hoang-Nguyen Vu, Anh-Minh Phan, Quang-Thinh Ly, Thi-Ngoc-Truc Nguyen, Nhat Ho

发表机构 * University of Science, VNU-HCM(越南国家大学科学学院) Trivita AI University of Technology, VNU-HCM(越南国家大学技术学院) Michigan State University, USA(美国密歇根州立大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 SAGE通过动态专家路由框架提升异构视觉网络中细胞形态变化的适应性,实现高精度分割与稳健泛化。

Comments Accepted to CVPR 2026 (Findings Track). Project Page: https://oxyzgiahuy.github.io/sage/

详情
AI中文摘要

细胞大小和形状的显著差异仍然是计算机辅助癌症检测在吉像素全滑片图像中的主要障碍,由于细胞异质性。当前的CNN-Transformer混合模型使用静态计算图和固定路由,导致额外计算并难以适应输入变化。我们提出形状自适应门控专家(SAGE),一种输入自适应框架,通过双路径设计和层次门控以及形状适应枢纽(SA-Hub)将静态骨干网络重新配置为动态路由专家架构。SAGE以ConvNeXt和Vision Transformer UNet(SAGE-ConvNeXt+ViT-UNet)实现,其在EBHI上达到95.23%的Dice分数,在GlaS Test A和Test B上分别达到92.78%和91.42%的DSC分数,并在DigestPath上达到91.26%的DSC分数,同时在分布偏移下表现出稳健的泛化能力,通过自适应平衡局部细化和全局上下文。SAGE建立了可扩展的动态专家路由基础,从而促进灵活的视觉推理。项目页面:https://oxyzgiahuy.github.io/sage/

英文摘要

The significant variability in cell size and shape continues to pose a major obstacle in computer-assisted cancer detection on gigapixel Whole Slide Images (WSIs), due to cellular heterogeneity. Current CNN-Transformer hybrids use static computation graphs with fixed routing. This leads to extra computation and makes it harder to adapt to changes in input. We propose Shape-Adapting Gated Experts (SAGE), an input-adaptive framework that enables dynamic expert routing in heterogeneous visual networks. SAGE reconfigures static backbones into dynamically routed expert architectures via a dual-path design with hierarchical gating and a Shape-Adapting Hub (SA-Hub) that harmonizes feature representations across convolutional and transformer modules. Embodied as SAGE with ConvNeXt and Vision Transformer UNet (SAGE-ConvNeXt+ViT-UNet), our model achieves a Dice score of 95.23% on EBHI, DSC scores of 92.78% and 91.42% on GlaS Test A and Test B, respectively, and 91.26% DSC at the WSI level on DigestPath, while exhibiting robust generalization under distribution shifts by adaptively balancing local refinement and global context. SAGE establishes a scalable foundation for dynamic expert routing in visual networks, thereby facilitating flexible visual reasoning. Project page: https://oxyzgiahuy.github.io/sage/

9. 文档图像、OCR与图表理解 10 篇

2606.07558 2026-06-09 cs.CV cs.AI cs.DL 新提交

Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing

基于百年跨度扫描文档档案微调的页面图像分类器,用于进一步的内容特定处理

Kateryna Lutsai, Pavel Straňák, David Novák, Dana Křivánková

发表机构 * Institute of Formal and Applied Linguistics, Charles University MFF(查尔斯大学数学与物理学院形式与应用语言学研究所) Institute of Archaeology, Czech Academy of Sciences(捷克科学院考古研究所)

AI总结 针对历史文档数字化中手动分类不可行的问题,提出基于视觉内容类型(文本、表格、图形)的自动页面图像分类系统,采用微调深度网络(RegNetY-16GF达99.16%准确率)实现近完美分类,并公开模型、数据集和代码。

Comments 29 pages, 19 figures, 13 tables. arXiv admin note: text overlap with arXiv:2507.21114

详情
AI中文摘要

目的:人文学科的数字化项目产生了大量、异构的历史文档档案,使得手动分类在大规模下不切实际。本工作解决基于视觉内容类型——文本、表格和图形——对扫描页面图像进行分类的自动化系统需求,从而支持内容特定的下游处理,如光学字符识别(OCR)或结构化数据提取。方法:开发了一个图像分类系统,并在来自百年历史的捷克考古档案的超过48,000张带注释的历史页面图像数据集上进行评估,通过四个连续的注释阶段和领域专家审查进行优化。使用手工制作的图像特征建立了随机森林分类器基线。随后,微调并比较了深度学习架构:卷积神经网络(EfficientNetV2、RegNetY)、视觉和文档图像变换器(ViT、DiT)以及多模态CLIP模型。与领域专家合作设计了11类标签方案,并通过五折交叉验证进行评估。结果:基于特征的基线实现了约75%的准确率。微调的CNN和变换器显著优于基线,RegNetY-16GF在保留测试集上达到99.16%的Top-1准确率,ViT-large达到99.12%。CLIP ViT-B/16通过优化文本描述达到99.14%的准确率。结论:仅图像模型,特别是RegNetY-16GF,实现了近乎完美的分类准确率,并在649,508张未标注档案页面上产生一致标签,模型间一致性超过90%。微调的CLIP尽管在测试集上具有竞争力,但在未标注数据上与仅图像模型的一致性低于65%,因此不太适合部署。最终模型、注释数据集和软件均以开源许可证公开提供。

英文摘要

Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting impractical at scale. This work addresses the need for an automated system to classify scanned page images based on visual content type - text, tables, and graphics - enabling content-specific downstream processing such as Optical Character Recognition (OCR) or structured data extraction. Methods: An image classification system was developed and evaluated on a dataset of over 48,000 annotated historical page images from century-old Czech archaeological archives, refined through four successive annotation stages with domain-expert review. A Random Forest Classifier baseline was established using hand-crafted image features. Subsequently, deep learning architectures were fine-tuned and compared: Convolutional Neural Networks (EfficientNetV2, RegNetY), Vision and Document Image Transformers (ViT, DiT), and multimodal CLIP models. An 11-category label scheme was designed collaboratively with domain experts and evaluated via five-fold cross-validation. Results: The feature-based baseline achieved approximately 75% accuracy. Fine-tuned CNNs and Transformers substantially outperformed it, with RegNetY-16GF achieving 99.16% and ViT-large 99.12% Top-1 accuracy on the held-out test set. CLIP ViT-B/16 reached 99.14% with optimized text descriptions. Conclusion: Image-only models, particularly RegNetY-16GF, deliver near-perfect classification accuracy and produce consistent labels across 649,508 unlabeled archival pages with over 90% inter-model agreement. Fine-tuned CLIP, despite competitive test-set accuracy, showed under 65% agreement with image-only models on unlabeled data, making it less suitable for deployment. The final models, annotated dataset, and software are publicly available under open-source licenses.

2606.07661 2026-06-09 cs.CV cs.DL 新提交

PereStruct: Multimodal Semantic Assembly for Robust Historical Document Parsing

PereStruct: 面向鲁棒历史文档解析的多模态语义组装

Maksim Shandybo, Ivan Bespalov, Daniil Yefimov, Marina Kosheleva, Alexander Loukianov

发表机构 * IGIC RAS(俄罗斯科学院信息传输问题研究所) Yandex Cloud National University of Science and Technology MISIS(莫斯科国立钢铁合金学院) Nekrasov Central Universal Scientific Library(涅克拉索夫中央综合科学图书馆)

AI总结 针对历史报纸复杂多栏布局的解析难题,提出结合微调YOLO与语义组装模块的多模态方法,在块到文章映射上F1达0.904,BLEU约0.96,显著优于通用视觉语言模型。

Comments Code and data available at https://github.com/makSShandybo/PereStruct

详情
AI中文摘要

解析具有复杂非标准布局的历史文档仍是大规模档案数字化的基本瓶颈。与现代排版不同,历史报纸存在严重的物理退化和高度不规则的页面结构,即使最先进的视觉语言模型也难以应对,呈现出严重的分布外挑战。我们通过一个专门为解析历史报纸(具有特别复杂多栏布局的文档)设计的自动化流程来弥补这一差距。我们的方法结合了用于布局分析和块检测的微调YOLO架构(在1,426张完全人工标注的扫描页面上训练),以及一个新颖的语义组装模块,该模块通过联合建模基于TF-IDF的词法语义相似性、来自微调YOLO的视觉嵌入以及几何布局约束来重构文章。这种多模态集成实现了最先进的性能,在块到文章映射上取得了0.904的F1分数。值得注意的是,与视觉语言模型(Qwen3.6-35B-A3B和Qwen3.6-Plus)的端到端评估表明,PereStruct实现了显著更高的保真度(BLEU约0.96 vs 0.34),验证了模块化架构在通用VLM难以处理的复杂历史布局上表现出色。为了支持可重复性并推动该领域的研究,我们发布了包含599张标注页面的训练语料库和包含93张页面(具有专家验证的真实块到文章映射)的精选PereStruct基准。该框架为复杂档案材料的高保真数字化和语义重建奠定了坚实基础。

英文摘要

Parsing historical documents with complex, non-standard layouts remains a fundamental bottleneck in large-scale archival digitization. Unlike modern typography, historical newspapers exhibit severe physical degradation and highly irregular page structures that confound even state-of-the-art vision-language models, presenting severe out-of-distribution challenges. We address this gap with an automated pipeline specifically designed for parsing historical newspapers, documents characterized by particularly intricate multi-column layouts. Our approach combines a fine-tuned YOLO architecture for layout analysis and block detection, trained on 1,426 fully human-annotated scanned pages, with a novel semantic assembly module that reconstructs articles by jointly modeling lexical-semantic similarity via TF-IDF, visual embeddings from our fine-tuned YOLO, and geometric layout constraints. This multi-modal integration yields state-of-the-art performance, achieving an F1 score of 0.904 on block-to-article mapping. Notably, end-to-end evaluation against vision-language models (Qwen3.6-35B-A3B and Qwen3.6-Plus) demonstrates that PereStruct achieves substantially higher fidelity (BLEU approximately 0.96 vs 0.34), validating that modular architectures excel where generic VLMs fail on complex historical layouts. To support reproducibility and advance research in this domain, we release both the training corpus of 599 annotated pages and a curated PereStruct benchmark of 93 pages with expert-verified ground-truth block-to-article mappings. This framework establishes a robust foundation for high-fidelity digitization and semantic reconstruction of complex archival materials.

2606.08858 2026-06-09 cs.CV cs.AI 新提交

Intelligent Character Recognition of Handwritten Forms with Deep Neural Networks

基于深度神经网络的手写表单智能字符识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA) Offenburg University(奥芬堡大学机器学习与分析研究所(IMLA))

AI总结 提出一种通过深度神经网络将检测与分类合并为单一任务的手写字符识别方法,利用人工合成训练数据,在真实考试数据上达到88.28%的识别率。

Comments Author's accepted manuscript of a published Springer book chapter. 14 pages, 16 figures

详情
Journal ref
In: Cavallucci D., Livotov P., Brad S. (eds), Towards AI-Aided Invention and Innovation, IFIP Advances in Information and Communication Technology, vol. 682, Springer Nature Switzerland, 2023, pp. 81-94
AI中文摘要

手写表单的自动处理仍然是一项具有挑战性的任务,其中手写字符的检测和后续分类是关键步骤。我们描述了一种新颖的方法,其中两个步骤——检测和分类——通过深度神经网络在一个任务中执行。因此,训练数据不是手动标注的,而是从基础表单和现有数据集中人工制造的。可以证明,这种单任务方法优于最先进的双任务方法。当前研究专注于手写拉丁字母,并使用EMNIST数据集。然而,该数据集存在局限性,需要进一步定制。最后,在从笔试中获得的真实数据上达到了88.28%的整体识别率。

英文摘要

The automatic processing of handwritten forms remains a challenging task, wherein detection and subsequent classification of handwritten characters are essential steps. We describe a novel approach, in which both steps -- detection and classification -- are executed in one task through a deep neural network. Therefore, training data is not annotated by hand, but manufactured artificially from the underlying forms and yet existing datasets. It can be demonstrated that this single-task approach is superior in comparison to the state-of-the-art two-task approach. The current study focuses on hand-written Latin letters and employs the EMNIST data set. However, limitations were identified with this data set, necessitating further customization. Finally, an overall recognition rate of 88.28 percent was attained on real data obtained from a written exam.

2606.09446 2026-06-09 cs.CV 新提交

Leveraging Morphology for Historical Script Metrological Analysis

利用形态学进行历史手稿计量分析

Malamatenia Vlachou Efstathiou, Raphaël Baena, Dominique Stutzmann, Mathieu Aubry

发表机构 * LIGM, École des Ponts et Chaussées, IP Paris, CNRS, France(LIGM,国立桥路学校,巴黎理工学院,法国国家科学研究中心,法国) Institut de Recherche et d’Histoire des Textes, Paris, Île-de-France, France(文本研究与历史研究所,巴黎,法兰西岛,法国)

AI总结 提出基于Transformer的检测架构和原型线重建模块,从行级转录中学习字符原型,实现可扩展、有意义的古文字测量,并验证其在区分图形轮廓和发现细微变化方面的有效性。

详情
AI中文摘要

手写文本识别的进展使得历史文献的大规模转录成为可能,但仍为古文字学(历史手稿研究)提供有限的可解释视觉测量。本文的主要见解是,形态学手稿分析,特别是从行级转录中学习字符原型的能力,能够定义可扩展、有意义且稳定的古文字测量。更精确地说,我们利用基于Transformer的检测架构和基于原型的线重建模块来学习原型字符及其出现、变形和定位。我们的贡献有两方面。首先,我们引入了一种深度架构和学习方法,仅通过行级转录监督即可实现高效的字符建模,显著改进了可学习打字机基线,并实现了准确的字符边界框预测,释放了其在古文字测量中的潜力。其次,我们介绍并展示了由我们的架构实现的字符、双字母组和图形单元之间间距的自动测量的古文字相关性。为了演示,我们将巴黎手稿BnF fr. 2813(14世纪末由查理五世委托,由四名抄写员抄写)的注释扩展到160页。我们可视化这些页面上的测量结果,显示它们不仅使我们能够区分图形轮廓,还能发现和分析细微变化。这个案例研究概述了我们方法的可扩展性及其在所需训练数据方面的节俭性,因为单列文本就足以对160页中的每一页进行计算。数据和代码公开于:https://malamatenia.github.io/morphology4metrology-analysis。

英文摘要

Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interpretable visual measurements for paleography, the study of historical scripts. In this paper, our main insight is that morphological script analysis, in particular the capacity to learn character prototypes from line-level transcriptions, enables the definition of scalable, meaningful, and stable paleographic measurements. More precisely, we leverage a transformer-based detection architecture together with a prototype-based line reconstruction module to learn prototypical characters and their occurrence, deformation, and positioning. Our contributions are twofold. First, we introduce a deep architecture and learning methodology that enables efficient character modeling with only line-level transcription supervision, significantly improving over the Learnable Typewriter baseline and enabling accurate character bounding box prediction, unlocking its potential for paleographic measurements. Second, we introduce and demonstrate the paleographical relevance of automatic measurements enabled by our architecture for characters, bi-grams, and spaces between graphical units. For this demonstration, we extend the annotations of the codex Paris, BnF, fr. 2813, commissioned in the late fourteenth century by Charles V and copied by four hands, to 160 pages. We visualize our measurements over these pages, showing how they enable us not only to differentiate graphical profiles, but also to discover and analyze subtle variations. This case study outlines the scalability of our approach and its frugality in terms of required training data, since a single column of text is sufficient to compute our measurements on each of the 160 pages. Data and code are publicly available at: https://malamatenia.github.io/morphology4metrology-analysis.

2606.09479 2026-06-09 cs.CV cs.DL 新提交

Optical Music Recognition for Real-World Manuscripts with Synthetic Data

基于合成数据的真实世界手稿光学音乐识别

Jiří Mayer, Martina Dvořáková, Vojtěch Dvořák, Markéta Herzánová Vlková, Filip Bím, Pavel Pecina, Samuel Šomorjai, Petr Žabička, Jan Hajič

发表机构 * Institute of Formal and Applied Linguistics, Charles University(查尔斯大学形式与应用语言学研究所) Moravian Library(摩拉维亚图书馆)

AI总结 针对资源受限场景下真实世界复杂钢琴手稿的识别,提出利用合成手稿图像进行域自适应,显著提升性能,并避免昂贵细粒度标注。

Comments Accepted for publication at the ICDAR 2026 conference

详情
AI中文摘要

光学音乐识别(OMR)在模型设计方面取得了重大进展,端到端方法现在能够识别所有复杂程度的符号。然而,这一进展的影响受到可用训练数据集视觉领域的限制,这些数据集大多是数字原生的。图书馆和其他遗产机构中现有的大量乐谱收藏主要包含手稿,其视觉领域高度多样且不同,因此现有的OMR系统在现实世界中应用时失败。这些机构通常资源受限,因此无法期望大规模领域内数据集。我们在资源受限场景下为具有复杂钢琴符号的真实世界手稿提供了第一个基线。使用细粒度音乐符号图(MuNG)注释和Smashcima合成工具,我们随后表明,虽然领域内数据的一些直接转录仍然是必要的,但使用合成音乐手稿图像进行域自适应带来了显著的改进。此外,所使用的符号不需要是领域内的,因此可以避免昂贵的细粒度注释。因此,我们将OMR更接近其既定目标之一:保护和推广音乐文化遗产。

英文摘要

Optical Music Recognition (OMR) has seen major progress in model design, with end-to-end methods now capable of recognising notation at all levels of complexity. However, the impact of this progress has been limited by the visual domains of available training datasets, which are largely born-digital. Existing large collections of sheet music in libraries and other heritage institutions contain predominantly manuscripts, whose visual domains are highly diverse and different, so existing OMR systems fail when applied in the real world. These institutions are often resource-constrained, so large in-domain datasets cannot be expected. We provide a first baseline on real-world manuscripts with complex piano notation in the resource-constrained scenario. Using fine-grained music notation graph (MuNG) annotations and the Smashcima synthesis tool, we then show that while some direct transcriptions of in-domain data remain essential, domain adaptation using synthetic musical manuscript images brings significant improvement. Furthermore, the symbols used do not need to be in-domain, so the expensive fine-grained annotation can be avoided. We thus bring OMR closer to one of its stated goals: preserving and promoting musical cultural heritage.

2606.09788 2026-06-09 cs.CV 新提交

POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

POTATR: 一种用于页面级表格提取的轻量级图像到图模型

Brandon Smock, Libin Liang, Max Sokolov, Amrit Ramesh, Valerie Faucon-Morin, Tayyibah Khanam, Maury Courtland

发表机构 * Kensho Technologies

AI总结 提出轻量级图像到图模型POTATR(29M参数),在页面级表格提取任务上以130倍速度和300倍低成本超越前沿模型,GriTS_Con达0.964,输出空间可解释。

Comments 16 pages, split from PubTables-v2 paper

详情
AI中文摘要

大规模文档处理需要上下文感知的表格提取(TE),既准确又高效。然而,当前方法需要数十亿参数、数百个自回归步骤或昂贵的API推理。受此启发,我们引入了页面对象表格Transformer(POTATR),这是一个轻量级的29M参数图像到图模型,扩展了表格Transformer(TATR)用于上下文感知的页面级TE。在PubTables-v2单页面基准测试中,POTATR超越了所有测试模型(包括前沿MLLM),实现了0.964的$\textrm{GriTS}_\textrm{Con}$,同时运行速度提高130倍以上,成本降低约300倍。此外,POTATR的输出是空间可解释的:每个识别元素都有一个边界框,支持视觉验证和几何文本分配。因此,POTATR在执行统一的页面级TE的同时,可以与其他模型组合,通过外部OCR扩展到扫描文档,并通过跨页面合并等技术扩展到全文档TE。代码和模型将发布。

英文摘要

Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and efficient. Yet current approaches require billions of parameters, hundreds of autoregressive steps, or costly API inference. Motivated by this, we introduce the Page-Object Table Transformer (POTATR), a lightweight 29M parameter image-to-graph model that extends the Table Transformer (TATR) for contextualized page-level TE. POTATR outperforms all models tested on the PubTables-v2 Single Pages benchmark -- including frontier MLLMs -- achieving $\textrm{GriTS}_\textrm{Con}$ of 0.964 while running over 130$\times$ faster at roughly 300$\times$ lower cost. Further, POTATR's output is spatially grounded: every recognized element has a bounding box, enabling visual verification and geometric text assignment. As a result, POTATR performs unified page-level TE while composing with other models, enabling extension to scanned documents via external OCR and to full-document TE via techniques like cross-page merging. Code and models will be released.

2606.08770 2026-06-09 cs.CL cs.AI cs.CV cs.LG 交叉投稿

TeamHerald@CHIPSAL 2026: Hate Speech Detection and Sentiment Analysis of Nepali Memes using Transformer-based Architectures and Ensemble Learning

TeamHerald@CHIPSAL 2026:基于Transformer架构和集成学习的尼泊尔语模因仇恨言论检测与情感分析

Ashish Acharya, Anish Khatiwada, Rohit Khadka, Pragya Aryal

发表机构 * Herald College Kathmandu(加德满都赫尔德学院)

AI总结 针对尼泊尔语模因中代码混合和资源匮乏问题,采用OCR提取文本并结合Transformer模型,发现硬/软投票集成策略在二分类和多分类任务中表现不同,软投票在多类情感任务中提升15.8%的Macro F1分数。

Comments Accepted at the 2nd Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2026) at LREC 2026

详情
AI中文摘要

尼泊尔语互联网模因的分析因频繁的代码混合和缺乏已建立的基线资源而变得复杂。虽然模因本质上结合了视觉和文本元素,但本研究侧重于以文本为中心的方法,通过OCR层提取嵌入文本,并使用基于Transformer的架构进行建模。我们评估了六种不同的模型,并研究了硬投票和软投票集成策略在两项任务中的比较效果:二分类仇恨言论检测和三分类情感分析。实验结果表明,独立的仅解码器模型在二分类任务中取得了最高性能,而软投票集成在多类情感任务中表现最佳,相比最强的独立基线,Macro F1分数相对提升了15.8%。这些发现表明,集成策略在二分类和多类任务中表现不同,突出了选择适合分类目标的聚合方法的重要性。

英文摘要

The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of established baseline resources. While memes inherently combine visual and textual elements, this study focuses on a text-centric approach by extracting embedded text using an OCR layer and modeling it with Transformer-based architectures. We evaluate six distinct models and investigate the comparative effectiveness of Hard and Soft Voting ensemble strategies across two tasks: binary hate speech detection and three-class sentiment analysis. Experimental results show that a standalone decoder-only model achieved the highest performance for binary classification, whereas the Soft Voting ensemble performed best for the multi-class sentiment task, yielding a 15.8% relative improvement in Macro F1-score over the strongest standalone baseline. These findings suggest that ensemble strategies behave differently across binary and multi-class tasks, highlighting the importance of selecting aggregation methods suited to the classification objective.

2606.08855 2026-06-09 cs.AI cs.CV cs.CY 交叉投稿

Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations

高等教育中的混合电子评估:纸质笔试的半自动评分

Hartwig Grabowski, Michael Canz

发表机构 * Institute for Machine Learning and Analytics, Hochschule Offenburg(霍恩海姆应用技术大学机器学习与分析研究所) Hochschule Offenburg(霍恩海姆应用技术大学)

AI总结 针对完全数字化和部分数字化电子评估在总结性考试中的局限性,提出混合电子评估方法,保留纸质问题导向任务,通过结构化答案格式和手写字符识别实现半自动评分,结合视觉大语言模型和两遍验证提升评估有效性、公平性和可扩展性。

Comments 15 pages, 6 figures

详情
AI中文摘要

本文考察了完全数字化和部分数字化电子评估方法在高等教育总结性考试中的局限性。分析聚焦于封闭式问题格式导致的教学狭窄化,以及在大学生群体中尤为突出的组织、技术和法律约束。作为替代方案,本文提出了一种混合电子评估方法,该方法保留纸质、问题导向的考试任务,同时实现半自动评分。评估相关的中间结果以结构化答案格式编码,由学生手写输入,随后从表格字段中捕获。核心的技术瓶颈是在现实考试条件下可靠识别手写字符。最近的视觉大语言模型,结合两遍验证原则和与标准答案的比对,可以减少误分类,从而提高总结性评估的有效性、公平性和可扩展性。

英文摘要

This paper examines the limitations of fully digital and partially digital e-assessment approaches in summative examinations in higher education. The analysis focuses on the didactic narrowing caused by closed question formats and on organizational, technical, and legal constraints that become particularly relevant in large student cohorts. As an alternative, the paper proposes a hybrid e-assessment approach that retains paper-based, problem-oriented examination tasks while enabling semi-automated grading. Assessment-relevant intermediate results are encoded in a structured answer format, entered by students by hand, and subsequently captured from table fields. The central technical bottleneck is reliable recognition of handwritten characters under realistic examination conditions. Recent vision-capable large language models, combined with a two-pass validation principle and comparison against a solution key, can reduce misclassifications and thereby improve the validity, fairness, and scalability of summative assessment.

2509.09501 2026-06-09 cs.CV 版本更新

Region-Wise Correspondence Prediction between Manga Line Art Images

漫画线条艺术图像的区域级对应预测

Yingxuan Li, Jiafeng Mao, Qianru Qiu, Yusuke Matsui

发表机构 * The University of Tokyo(东京大学) CyberAgent, Inc.(CyberAgent公司)

AI总结 提出一种基于Transformer的框架,通过自动生成的大规模区域对应训练,实现无标注漫画线条艺术图像的区域级对应预测,区域级准确率达78.4-84.4%。

Comments Accepted to CVPR2026

详情
AI中文摘要

理解漫画线条艺术图像之间的区域级对应是高级漫画处理的基础,支持线条艺术着色和中间帧生成等下游任务。与包含丰富视觉线索的自然图像不同,漫画线条艺术仅由稀疏的黑白笔画组成,这使得确定图像间哪些区域对应具有挑战性。在这项工作中,我们引入了一个新任务:预测原始漫画线条艺术图像之间的区域级对应,无需任何标注。为了解决这个问题,我们提出了一个基于Transformer的框架,在大规模自动生成的区域对应上进行训练。该模型学会抑制噪声匹配并加强一致的结构关系,从而在图像内部和图像之间实现鲁棒的块级特征对齐。在推理过程中,我们的方法通过边缘感知聚类和区域匹配来分割每个线条艺术并建立连贯的区域级对应。我们构建了人工标注的基准用于评估,跨多个数据集的实验显示了高块级准确率和强区域级对应性能,区域级准确率达到78.4-84.4%。这些结果凸显了我们的方法在真实漫画和动画应用中的潜力。

英文摘要

Understanding region-wise correspondences between manga line art images is fundamental for high-level manga processing, supporting downstream tasks such as line art colorization and in-between frame generation. Unlike natural images that contain rich visual cues, manga line art consists only of sparse black-and-white strokes, making it challenging to determine which regions correspond across images. In this work, we introduce a new task: predicting region-wise correspondence between raw manga line art images without any annotations. To address this problem, we propose a Transformer-based framework trained on large-scale, automatically generated region correspondences. The model learns to suppress noisy matches and strengthen consistent structural relationships, resulting in robust patch-level feature alignment within and across images. During inference, our method segments each line art and establishes coherent region-level correspondences through edge-aware clustering and region matching. We construct manually annotated benchmarks for evaluation, and experiments across multiple datasets demonstrate both high patch-level accuracy and strong region-level correspondence performance, achieving 78.4-84.4% region-level accuracy. These results highlight the potential of our method for real-world manga and animation applications.

2501.06659 2026-06-09 cs.DB cs.CV 版本更新

Visual Template Inference for Data Extraction from Documents

文档数据提取的视觉模板推断

Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, Aditya G. Parameswaran

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 提出TWIX工具,通过推断文档的视觉模板来高效、低成本地提取结构化数据,在精度和召回率上优于现有方法,并实现大规模数据集上的显著加速和降本。

详情
AI中文摘要

许多模板化文档是根据结构化数据按照视觉模板程序化生成的。这类文档包括发票、税务文件、财务报告和采购订单。从这些文档中有效提取数据对于支持下游分析任务至关重要。当前的数据提取工具通常难以处理复杂的文档布局,在大数据集上会产生高延迟和/或高成本,并且需要大量人力。我们的工具TWIX的关键洞察是推断用于生成此类文档的底层模板,然后提取数据,而不是直接从文档中提取。为此,TWIX首先通过利用字段的一致位置模式(例如,同一模板中的两个字段在多个记录中反复以固定距离共现)来推断底层字段,如表格部分的列或共置键值对中的键。然后,TWIX通过强制视觉约束(例如,对于表格区域,垂直对齐表格行与其列标题;对于键值对,水平对齐键与其值)将这些字段组装成模板。最后,TWIX使用这个推断出的模板以低成本从模板化文档中准确高效地提取数据。在一个包含34个多样化真实世界数据集的基准测试中,TWIX在精度和召回率上比最先进的结构化数据提取工具(Evaporate、Textract和Azure Document Intelligence)以及基于视觉的大语言模型(如GPT-4-Vision)高出25%以上。另一个包含30个大数据集的基准测试展示了TWIX的可扩展性:对于从超过2000页的大型文档集合中提取数据,它比最具竞争力的对比工具快520倍,便宜3786倍。

英文摘要

Many templatized documents are programmatically generated from structured data following a visual template. Such documents include invoices, tax documents, financial reports, and purchase orders. Effective data extraction from these documents is crucial to support downstream analytical tasks. Current data extraction tools often struggle with complex document layouts, incur high latency and/or cost on large datasets, and require significant human effort. The key insight of our tool, TWIX, is to infer the underlying template used to create such documents, and then extract the data, rather than extracting directly from documents. To do so, TWIX first infers the underlying fields, such as columns of tabular portions or keys in co-located key-value pairs, by leveraging their consistent location patterns (e.g., two fields in the same template repeatedly co-occur within a fixed distance apart across multiple records). TWIX then assembles these fields into a template by enforcing visual constraints, such as vertically aligning table rows with their column headers for tabular regions, and horizontally aligning keys with their values for key-value pairs. TWIX then uses this inferred template to accurately and efficiently extract data from templatized documents at a low cost. On one benchmark with 34 diverse real-world datasets, TWIX outperforms state-of-the-art structured data extraction tools (Evaporate, Textract, and Azure Document Intelligence), and vision-based LLMs like GPT-4-Vision, by over 25% in precision and recall. Another benchmark with 30 large datasets demonstrates TWIX's scalability: it is 520X faster and 3,786X cheaper than the most competitive compared tool, for extracting data from large document collections with over 2000 pages.

10. 低层视觉、计算成像与图像增强 15 篇

2606.07985 2026-06-09 cs.CV cs.CL 新提交

FMRFusion: Frequency-Aware Multi-View Representation Learning for Heterogeneous Image Fusion

FMRFusion: 面向异质图像融合的频率感知多视图表示学习

Tao Zhoua, Yunlong Liu, Qinghui Chen, Zekai Zhang, Minlong Sun, Changlin Biana, Dagang Li, Wenmin Wang, Jinglin Zhang

发表机构 * Shandong University(山东大学) Macau University of Science and Technology(澳门科技大学)

AI总结 提出FMRFusion网络,通过多尺度结构感知模块、双线性频率分解和跨视图互补交互,结合流匹配优化,实现红外与可见光图像融合,在夜间场景表现优异。

详情
AI中文摘要

红外与可见光图像融合旨在生成保留重要目标信息和详细纹理的复合图像,整合两种异质模态。以往的图像融合方法通常采用单模块堆叠方式从两种模态中提取特征,然而这些方法可能导致对其独特特征的学习不完整,从而限制融合效果并在真实异质数据场景中降低鲁棒性。为解决这些问题,我们提出FMRFusion,一种用于异质图像融合的频率感知多视图表示学习网络。引入多尺度结构感知模块以有效捕捉判别性结构,提取细粒度局部结构和关键上下文信息。采用双线性频率分解机制将特征分离为高频和低频分量,实现不同频率域中局部细节和全局表示的联合建模。此外,融入跨视图互补交互以显式建模和融合反射光信息与辐射强度响应之间的互补特性,促进有效的跨视图交互。我们通过流匹配进一步改善融合结果的质量,通过学习从粗数据到高质量表示的变换逐步细化融合特征。在多个基准数据集上进行的大量实验表明,FMRFusion在一系列融合任务中实现了优越且一致的性能,尤其在夜间场景中表现突出。

英文摘要

Infrared and visible image fusion aims to generate a composite image that retains significant target information and preserves detailed textures, integrating two heterogeneous modalities. Previous image fusion methods typically adopt a single-module stacking approach to extract features from the two modalities. However, these approaches may result in incomplete learning of their distinct characteristics, thereby limiting the fusion effectiveness and constrain ing robustness in real-world heterogeneous data scenarios. To address these challenges, we propose FMRFusion, a frequency-aware multi-view representation learning network for Heterogeneous Image Fusion. A Multi-Scale Struc tural Perception Module is introduced to effectively capture discriminative structures, extracting fine-grained local structures and essential contextual information. A bilinear frequency decomposition mechanism is employed to sepa rate features into high-frequency and low-frequency components, enabling joint modeling of local details and global representations across different frequency domains. Moreover, a Cross-View Complementary Interaction is incorpo rated to explicitly model and fuse the complementary characteristics between reflected light information and radiative intensity responses, facilitating effective cross-view interaction. We further improve the Performance of the fused results by flow matching, which progressively refines the fused features by learning the transformation from coarse data to high-quality representations. Extensive experiments conducted on multiple benchmark datasets demonstrate that FMRFusion achieves superior and consistent performance across a range of fusion tasks, especially in nighttime scenarios

2606.08324 2026-06-09 cs.CV cs.AI 新提交

Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

基于集合的Transformer用于远距离长波红外高光谱成像中的大气补偿

Fabian Perez, Nicolas Quintero, Jeferson Acevedo, Hoover Rueda-Chacon

发表机构 * Department of Computer Science, Universidad Industrial de Santander Bucaramanga(圣安德烈斯工业大学计算机科学系)

AI总结 提出一种轻量级基于集合的深度学习框架,利用不同距离的辐射测量联合估计透射率、大气路径辐射和下行辐射,在MODTRAN数据集上实现低光谱畸变。

Comments IGARSS 2026 accepted paper conference

详情
AI中文摘要

被动长波红外(LWIR)高光谱成像在远距离几何下依赖于大气吸收和发射以及反射辐射,因此大气补偿对于获取目标信息至关重要。尽管其重要性,但由于实际和建模困难,这一补偿在很大程度上被忽视。在本文中,我们提出了一种轻量级基于集合的深度学习框架,该框架将不同远距离范围收集的多个辐射测量作为输入,并联合估计透射率、大气路径辐射和共享的下行辐射光谱。我们使用稀疏自编码器分析学习到的表示,并观察到尽管缺乏位置监督,几个潜在特征确实在测试数据的地理一致子集上激活。在MODTRAN生成的远距离LWIR数据集上的实验表明,所有估计产品的光谱畸变较低。数据集和代码公开于:https://factral.co/SAE-LWIR/

英文摘要

Passive long-wave infrared (LWIR) hyperspectral imaging under a standoff geometry depends on atmospheric absorption and emission, as well as reflected radiance, thus making atmospheric compensation essential to get knowledge of a target of interest. Despite its importance, this compensation has been largely overlooked due to its practical and modeling difficulty. In this paper, we present a lightweight set-based deep learning framework that takes multiple radiance measurements, collected at different standoff ranges, as input and jointly estimates transmittance, atmospheric path radiance, and a shared downwelling spectrum. We analyze the learned representation with a sparse autoencoder and observe that several latent features do activate on geographically coherent subsets of the test data despite the absence of location supervision. Experiments on a MODTRAN generated standoff LWIR dataset demonstrate low spectral distortion across all estimated products. The dataset and code is publicly available at: https://factral.co/SAE-LWIR/

2606.08535 2026-06-09 cs.CV 新提交

NGram-MoSE: Efficient Remote Sensing Super-Resolution via N-Gram Context and Mixture-of-Experts

NGram-MoSE:基于N-Gram上下文和混合专家模型的高效遥感超分辨率

Yun-Hsuan Huang, Trong-An Bui, Chih-Hung Chuang

发表机构 * National Science and Technology Council (NSTC), Taiwan(台湾国家科学与技术委员会)

AI总结 提出轻量Transformer架构NGram-MoSE,通过N-Gram上下文注入增强局部一致性,结合混合专家前馈设计稀疏激活以降低计算量,在遥感超分辨率任务中实现高效且鲁棒的纹理重建。

详情
AI中文摘要

环境监测和灾害管理的遥感应用经常受到时空权衡的限制:具有精细空间细节的图像通常获取频率较低,而时间上更可用的观测通常更粗糙。单图像超分辨率提供了一种实用的方法,可以在不改变获取计划的情况下增强粗糙图像,然而许多基于Transformer的SR模型仍然计算成本高昂,并且可能对有限或地理偏倚的训练数据敏感,这降低了在分布外条件下的鲁棒性。本文提出了NGram-MoSE,一种轻量级Transformer架构,旨在提高效率和纹理连续性。NGram-MoSE引入了N-Gram上下文注入以增强跨窗口局部一致性并减轻窗口边界伪影,并采用了混合专家(MoE)前馈设计,通过稀疏激活扩展容量而不成比例地增加推理成本。在地理上不相交的OOD测试集上的实验表明,NGram-MoSE实现了31.68 dB的PSNR,同时相对于重型Transformer参考模型将FLOPs减少了14倍。在滑坡分割基准上的下游评估进一步表明,将退化的输入恢复到检测器训练尺度可提高性能,在mAP@50上比双三次上采样绝对提高了4.47%,并且在尺度外推下表现出更强的跨尺度一致性。这些结果表明,NGram-MoSE为需要鲁棒泛化的资源受限遥感流水线提供了一个有效的SR模块。

英文摘要

Remote sensing applications for environmental monitoring and disaster management are frequently constrained by a spatial--temporal trade-off: imagery with fine spatial detail is often acquired less frequently, whereas more temporally available observations are typically coarser. Single-image super-resolution provides a practical means to enhance coarse imagery without changing acquisition schedules, yet many Transformer-based SR models remain computationally expensive and can be sensitive to limited or geographically biased training data, which degrades robustness under out-of-distribution conditions. This paper presents NGram-MoSE, a lightweight Transformer architecture designed to improve both efficiency and texture continuity. NGram-MoSE introduces N-Gram Context Injection to strengthen cross-window local consistency and mitigate window-boundary artifacts, and incorporates a Mixture-of-Experts (MoE) feed-forward design to scale capacity through sparse activation without proportional growth in inference cost. Experiments on a geographically disjoint OOD test set show that NGram-MoSE achieves 31.68\,dB PSNR while reducing FLOPs by \(14\times\) relative to a heavyweight Transformer reference. Downstream evaluation on a landslide segmentation benchmark further demonstrates that restoring degraded inputs to the detector training scale improves performance, yielding a 4.47\% absolute gain in mAP@50 over bicubic upsampling, and exhibits stronger cross-scale consistency under scale extrapolation. These results indicate that NGram-MoSE provides an effective SR module for resource-constrained remote sensing pipelines requiring robust generalization.

2606.09029 2026-06-09 cs.CV 新提交

Frequency Decoupled Framework for Screen Content Image Super-Resolution

面向屏幕内容图像超分辨率的频率解耦框架

Xufei Wang, Qicheng Zhang, Qi Wu, Ziyang Gu, Shizhuang Weng

发表机构 * Anhui University(安徽大学)

AI总结 提出频率解耦框架(FDF),通过振幅-相位分解和定制隐式表示,联合利用周期模式与连贯上下文,实现屏幕内容图像超分辨率,在多个数据集上达到最优性能。

Comments 13pages;11figures

详情
AI中文摘要

基于隐式神经表示的方法在屏幕内容图像超分辨率(SCISR)中表现出优越性能。然而,它们忽略了固有的频率特性,导致性能次优。我们提出一种频率解耦框架(FDF),从相量角度重新思考SCISR,通过捕获振幅中的结构化能量和相位中的关系连续性,并利用定制的隐式表示联合利用它们,以忠实恢复屏幕内容图像(SCI)的规则纹理和全局配置。振幅-相位分解网络(APFN)首先将图像分离为振幅和相位流,其中振幅聚类模块(ACM)将稀疏但高能量的振幅响应组织成代表性原型以提取周期模式,而相位一致性自注意力(PCSA)通过连续一致性传播逐步增强配置。振荡-非谐隐式拟合网络(OAIF-Net)集成周期性和连贯隐式表示,以有效利用SCI中嵌入的周期模式和连贯上下文。实验结果表明,FDF在四个公共SCI数据集上的多个尺度上实现了最先进的SCISR性能。消融实验进一步证明了每个组件在提取和利用周期模式与连贯上下文方面的有效性。

英文摘要

Methods based on implicit neural representations have demonstrated superior performance in Screen Content Image Super-Resolution (SCISR) . However, they overlooked the inherent frequency characteristics, leading to suboptimal performance. We propose a frequency decoupled framework (FDF) that rethinks SCISR from a phasor perspective by capturing structured energy in amplitude and relational continuity in phase, and jointly exploiting them with bespoke implicit representations to faithfully recover the regular textures and global configuration of Screen Content Image (SCI). Amplitude-Phase Factorization Network (APFN) first separates images into amplitude and phase streams, where Amplitude Clustering Module (ACM) organizes sparse yet high-energy amplitude responses into representative prototypes for periodic pattern extraction, while Phase Consistency Self-Attention (PCSA) progressively reinforces configuration through continuous consistency propagation. And Oscillation-Anharmonic Implicit Fitting Network (OAIF-Net) integrates periodic and coherent implicit representations for efficient exploitation of the periodic patterns and coherent context embedded in SCI. Experimental results show FDF achieves state-of-the-art SCISR performance at multiple scales across four public SCI datasets. Ablation experiments further demonstrate the effectiveness of each component in extracting and exploiting periodic patterns and coherent context.

2606.09110 2026-06-09 cs.CV 新提交

HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging

HDRAgent: 一种用于多曝光HDR成像的智能体框架

Weiyu Zhou, Tao Hu, Yijian Wang, Xiaogang Xu, Ruixing Wang, Qingsen Yan

发表机构 * School of Computer Science, Northwestern Polytechnical University(西北工业大学计算机学院) Shenzhen Research Institute, Northwestern Polytechnical University(西北工业大学深圳研究院) Zhejiang University(浙江大学) Camera Group, DJI(大疆相机部门)

AI总结 提出首个智能体驱动的HDR成像框架HDRAgent,通过细粒度上下文知识匹配、感知-失真反馈机制和智能体引导的生成对齐策略,自适应选择重建策略,减少复杂动态场景中的鬼影和局部伪影。

详情
AI中文摘要

大多数现有的多曝光HDR方法遵循固定的前馈重建范式,使其在复杂动态场景中容易产生鬼影伪影。为了解决这个问题,我们提出了HDRAgent,这是第一个用于HDR成像的智能体驱动框架,它根据当前场景条件自适应地选择重建策略。具体来说,为了提供场景特定的先验知识,我们引入了一个细粒度上下文知识匹配(FCM)模块。该模块利用多模态大语言模型(MLLM)衍生的场景感知来检索相关的历史案例和工具知识,并将它们组织成结构化证据,用于基于MLLM的自适应工具调度。此外,我们提出了一种感知-失真反馈机制,将执行后的质量评估和伪影诊断转化为结构化反馈,并累积到历史记忆中,以帮助后续的上下文知识细化和策略选择。此外,考虑到极端运动可能使对齐方法失效,我们设计了一种智能体引导的生成对齐策略,该策略使用基于MLLM的动态区域解析,在参考帧引导下重建非参考帧中的不可靠内容。实验表明,HDRAgent有效减少了鬼影和局部伪影,同时实现了具有竞争力或更优的客观性能和视觉质量。

英文摘要

Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them prone to ghosting artifacts in complex dynamic scenes. To address this issue, we propose HDRAgent, the first agent-driven framework for HDR imaging, which adaptively selects reconstruction strategies according to the current scene conditions. Specifically, to provide scene-specific prior knowledge, we introduce a fine-grained contextual knowledge matching (FCM) module. This module leverages multimodal large language model (MLLM)-derived scene perception to retrieve relevant historical cases and tool knowledge, organizing them into structured evidence for MLLM-based adaptive tool scheduling. In addition, we propose a perception--distortion feedback mechanism that transforms post-execution quality assessment and artifact diagnosis into structured feedback, which is accumulated in historical memory to help subsequent contextual knowledge refinement and strategy selection. Furthermore, considering that extreme motion can invalidate alignment methods, we design an agent-guided generative alignment strategy that uses MLLM-based dynamic-region parsing to reconstruct unreliable contents in non-reference frames under reference-frame guidance. Experiments demonstrate that HDRAgent effectively reduces ghosting and local artifacts while achieving competitive or superior objective performance and visual quality.

2606.09250 2026-06-09 cs.CV 新提交

LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

LiteVSR: 冻结扩散变换器的轻量级自适应用于视频超分辨率

Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong, Jifei Song

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LiteVSR,利用流匹配原理,通过轻量级状态感知适配器在冻结扩散变换器上实现视频超分辨率,仅需11.25%可训练参数和12 GPU小时训练。

详情
AI中文摘要

将大规模预训练视频生成器适应于新领域的视频超分辨率(VSR)在计算上仍然昂贵。将生成重新表述为直接从低质量到高质量映射的方法偏离了原始生成形式,需要大量微调。ControlNet风格的适配器在现代扩散变换器下失去效率,因为缺少编码器-解码器层次结构迫使复制整个骨干网络。我们观察到流匹配为跨域VSR适应提供了一种原则性替代方案。通过预测所有时间步上的恒定速度场,适应任务简化为学习固定的注入模式,而不是时变变换。基于这一见解,我们提出了LiteVSR,一个极简框架,使用完全冻结的扩散变换器和轻量级状态感知适配器执行VSR。该适配器采用双流架构,从低质量输入中提取静态结构线索,从中间去噪状态中提取动态线索,通过时间依赖的交叉注意力对齐它们,使得随着去噪进行,从结构对齐到纹理细化的自适应过渡成为可能。LiteVSR在仅使用11.25%可训练参数和单个A100上12 GPU小时的训练下实现了有竞争力的恢复质量,同时保持了快速采样(低至单步)的兼容性。

英文摘要

Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.

2606.09516 2026-06-09 cs.CV 新提交

SwiftVR: Real-Time One-Step Generative Video Restoration

SwiftVR:实时一步生成式视频恢复

Jiaqi Yan, Xiangyu Chen, Xinlin Zhong, Haibin Huang, Chi Zhang, Jie Liu, Jiantao Zhou, Xuelong Li

发表机构 * State Key Laboratory of Internet of Things for Smart City, Department of Computer and Information Science, University of Macau(澳门大学智慧城市物联网国家重点实验室) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI)) State Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室)

AI总结 提出SwiftVR,一种流式一步生成式视频恢复框架,通过因果分块协议、无掩码移位窗口自注意力和轻量级恢复感知自编码器,在消费级GPU上实现实时高清视频恢复。

详情
AI中文摘要

实时视频恢复(VR)用于直播流,需要在严格的每帧延迟约束下输出高分辨率结果。现有的一步扩散式VR模型由于两个主要瓶颈难以部署在消费级GPU上:高分辨率下的二次空间注意力以及大型视频自编码器的延迟-内存开销。我们提出SwiftVR,一种流式一步生成式VR框架,在因果分块协议下减少这两个瓶颈。对于注意力,无掩码移位窗口自注意力通过确定性索引将每个空间窗口聚合成密集张量,所有注意力调用都在密集缩放点积注意力路径上,无需掩码、循环移位、填充或硬件特定的稀疏核。由于SwiftVR仅使用标准密集SDPA调用,训练好的模型无需重新训练或自定义核即可迁移到消费级GPU。对于自编码,轻量级恢复感知自编码器在保持重建质量的同时实现快速分块解码。在单个H100上,SwiftVR在2560x1440分辨率下维持31 FPS,在3840x2160下维持14 FPS,而所有对比的扩散式VR基线在4K下均超出内存限制。在消费级RTX 5090上,SwiftVR在1920x1080下达到26 FPS。据我们所知,SwiftVR是首个在消费级GPU上实现实时1080p流媒体的生成式VR模型,同时以更低的推理成本获得强大的无参考感知质量。项目地址:https://h-oliday.github.io/SwiftVR。

英文摘要

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.

2606.09608 2026-06-09 cs.CV 新提交

TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution

TUDSR: 用于更高超分辨率的两次上采样扩散

Zhiqiang Wu, Yitong Dong, Xian Wei

发表机构 * East China Normal University(华东师范大学) Zhejiang University(浙江大学)

AI总结 提出TUDSR框架,通过两阶段训练(R分辨率和NR分辨率)结合循环分块策略,在SD2.1基础上实现1024²和2048²高分辨率图像超分辨率,显著优于现有方法。

详情
AI中文摘要

基于扩散的生成模型在真实世界图像超分辨率(SR)中取得了显著成功。通过分块扩散技术,这些模型可以生成超出其原生支持分辨率的高分辨率图像。然而,这种高分辨率(例如2048²)输出的质量通常仍然极差,主要归因于我们考虑的两个因素:图像上采样比率(例如×8)超过模型原生支持的上采样比率(例如×4),以及模型的原生支持分辨率。在实践中,训练原生高分辨率模型需要更大的架构,这会导致显著的计算开销和GPU内存成本,使其在资源有限的设备上难以实现。因此,我们提出了TUDSR,一种用于更高超分辨率的两次上采样扩散框架。TUDSR框架主要包括两个阶段:第一阶段在R分辨率下训练,第二阶段引入基于循环分块的训练策略在NR分辨率下训练。每个阶段采用包含生成器和判别器的单步GAN架构。基于SD2.1-base,我们开发了TUDSR-S,在多个基准测试中取得了最先进的性能。大量实验进一步表明,TUDSR-S在1024²甚至2048²分辨率下生成高质量图像,显著优于现有方法。代码可在https://github.com/wuer5/TUDSR获取。

英文摘要

Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g $2048^2$) outputs often remains extremely poor, primarily due to two factors we consider: the image upsampling ratio (e.g $\times8$) exceeding the model's native-supported upsampling ratio (e.g $\times4$), and the model's native-supported resolution. In practice, training a native high-resolution model requires larger architectures, which incur significant computational overhead and GPU memory costs, making it hard on limited-resource equipment. Thus, we present TUDSR, a Twice Upsampling-Diffusion framework for higher SR. The TUDSR framework mainly consists of two stages: the first involves training at $R$-resolution, and the second introduces a looped chunk-based training strategy at $NR$-resolution. Each stage adapts a one-step GAN architecture comprising a generator and a discriminator. Based on SD2.1-base, we develop TUDSR-S, which achieves state-of-the-art performance across multiple benchmarks. Extensive experiments further demonstrate that TUDSR-S generates high-quality images at the resolutions of $1024^2$ and even $2048^2$, significantly outperforming existing approaches. Code is available at https://github.com/wuer5/TUDSR.

2606.09792 2026-06-09 cs.CV 新提交

End-to-End Optimization of Incoherent Imaging for Classification Under Detector-Limited Readout

探测器有限读出下非相干成像分类的端到端优化

Archer Wang, Joshua Chen, Sachin Vaidya, Marin Soljačić

发表机构 * Research Laboratory of Electronics, Massachusetts Institute of Technology(麻省理工学院电子研究实验室) Department of Physics, Massachusetts Institute of Technology(麻省理工学院物理系)

AI总结 针对探测器有限读出场景,通过端到端优化相位掩模提升非相干成像分类性能,理论证明全读出下无增益,有限读出下通过增强类可分性实现显著改进。

详情
AI中文摘要

光学前端(如超表面)和神经网络后端的端到端联合优化已广泛应用于成像任务,但缺乏一个形式化框架来描述此类系统何时以及为何优于传统透镜成像。本文聚焦于分类这一核心成像任务,探究端到端优化非相干成像相位掩模何时能提升性能。我们发现,这些增益主要出现在探测器读出受限的情况下,而在全读出下则有限。在后一种情况下,我们证明没有非相干相位掩模能超过探测器测量与类别标签之间的理想信道互信息;传统聚焦透镜接近这一上限,联合优化无实证增益。当探测器读出受限时(通过粗空间采样或有限测量次数),优化光学系统可通过增加探测器测量中的类可分性来显著提升分类性能。这些增益在低探测器噪声下最大,并随噪声增大而减小,因为光学系统在信号到达探测器前塑造信号,但无法去除之后添加的噪声。该优势还取决于任务的光谱结构:当类别判别内容集中在比类内变化更低的空间频率时,协同设计帮助最大。我们开发了一个理论框架来形式化这些区别,并在合成数据和标准基准(MNIST、FashionMNIST、SVHN)上测试其预测。

英文摘要

End-to-end co-optimization of optical front-ends (e.g. metasurfaces) and neural network back-ends has been widely applied to imaging tasks, yet a formalism characterizing when and why such systems outperform conventional lens-based imaging is largely lacking. This paper focuses on object classification, a central imaging task, and asks when end-to-end optimization of a phase mask for incoherent imaging improves performance over a conventional focusing lens. We find that these gains arise primarily under constrained detector readout and are limited under full detector readout. In the latter setting, we prove that no incoherent phase mask exceeds the ideal-channel mutual information between detector measurements and class labels; a conventional focusing lens approaches this ceiling, and joint optimization yields no empirical gain. When detector readout is constrained -- by coarse spatial sampling or a limited number of measurements -- optimized optics can substantially improve classification by increasing class separability in the detector measurements. These gains are largest under low detector noise and shrink as noise grows, because the optics shape the signal before it reaches the detector but cannot remove noise added afterward. The advantage also depends on the spectral structure of the task: co-design helps most when class-discriminative content is concentrated at lower spatial frequencies than within-class variation. We develop a theoretical framework formalizing these distinctions and test its predictions on synthetic data and standard benchmarks (MNIST, FashionMNIST, SVHN).

2606.07675 2026-06-09 eess.IV cs.CV cs.LG 交叉投稿

The Need for Neural ISP in the Small-Pixel Era: How Shrinking Pixels Push Optics to the Limit and Neural Restoration Pushes Back

小像素时代对神经ISP的需求:像素缩小将光学推向极限,神经恢复则逆势而上

Jingxi Li, Neerja Aggarwal, Laurent Gudemann, Shivansh Rao, Vishal Vinod, Tom E. Bishop, Ziv Attar

发表机构 * Glass Imaging Inc(玻璃成像公司)

AI总结 针对智能手机小像素长焦模块中光学像差限制分辨率的问题,提出基于学习的神经ISP恢复图像,在0.35微米像素下实现2.5-3倍分辨率提升,表明神经ISP可替代复杂光学设计。

详情
AI中文摘要

智能手机长焦摄像头正接近“长焦物理墙”:随着像素间距缩小至亚0.5微米,光学系统仍受几何像差限制,导致分辨率收益递减。传统图像信号处理器(ISP)无法消除这些像差,因为它们通过局部、分阶段处理运行,没有明确的点扩散函数(PSF)模型。我们展示了基于学习的神经ISP用于图像恢复,通过训练底层退化,逆转了分阶段流水线无法处理的问题,将小像素设计转化为净优势。我们通过一个代表性长焦模块的受控模拟进行研究,评估了五种配置(0.35--0.75微米像素间距)。光圈按比例缩放以保持每像素信噪比和衍射光斑尺寸固定,从而隔离几何像差和空间采样。传统ISP随像素减小仅适度改进,而神经ISP显著扩展:在0.35微米时,其MTF50(垂直)达到745 cycles/mm,比传统ISP分辨率提升2.5-3倍,LPIPS从0.244显著改善至0.151,而传统结果保持相对平坦。在低信噪比扩展中(0.35微米下每帧15 dB突发),多帧神经ISP恢复的性能接近亮光单帧基线,而多帧传统ISP没有显示出有意义的改进——表明小像素下的传统流水线受限于未校正的PSF模糊而非噪声。这些结果指向一种设计理念:神经ISP通过校正残余光学像差而非要求日益复杂的光学系统,实现高分辨率长焦模块。

英文摘要

Smartphone telephoto cameras are approaching a "telephoto physics wall": as pixel pitches shrink toward sub-0.5 micron, the optics remain limited by geometric aberrations, leading to diminishing returns on resolution. Traditional Image Signal Processors (ISPs) cannot eliminate these aberrations, because they operate through local, stage-wise processing with no explicit model of the underlying point spread function (PSF). We demonstrate how a learning-based Neural ISP for image restoration, trained on the underlying degradations, inverts what stage-wise pipelines cannot, turning small-pixel designs into a net advantage. We investigate this through a controlled simulation of a representative telephoto module, evaluating five configurations (0.35--0.75 micron pixel pitch). The aperture is scaled proportionally to keep per-pixel SNR and diffraction spot size fixed, thereby isolating geometric aberration and spatial sampling. While the traditional ISP improves only modestly with smaller pixels, the Neural ISP scales substantially: at 0.35 micron} it reaches 745 cycles/mm MTF50 (vertical), a 2.5--3x resolution improvement over the traditional ISP, and LPIPS improves significantly from 0.244 to 0.151 while traditional results stay comparatively flat. In a low-SNR extension (15 dB per-frame bursts at 0.35 micron), a multi-frame Neural ISP recovers performance close to the bright-light single-frame baseline, whereas a multi-frame traditional ISP shows no meaningful improvement -- indicating that traditional pipelines at small pixels are bottlenecked by uncorrected PSF blur rather than by noise. These results point to a design philosophy in which Neural ISPs enable high-resolution telephoto modules by correcting residual optical aberrations rather than requiring increasingly complex optics.

2606.07896 2026-06-09 physics.optics cs.CV 交叉投稿

Beyond the Thin-Layer Limit: Differentiable Volumetric Training for Visible-Range Diffractive Neural Networks

超越薄层极限:可见光衍射神经网络的微分体积训练

Dineth Jayakody, Dushan N. Wadduwage

发表机构 * Department of Computer Science, Old Dominion University, Norfolk, VA 23529, USA(计算机科学系,老奥德纳大学,诺福克,VA 23529,美国) School of Data Science, Old Dominion University, Norfolk, VA 23529, USA(数据科学学院,老奥德纳大学,诺福克,VA 23529,美国) Department of Physics, Old Dominion University, Norfolk, VA 23529, USA(物理系,老奥德纳大学,诺福克,VA 23529,美国)

AI总结 针对可见光衍射神经网络因薄层近似导致性能不佳的问题,提出可微光束传播层,将每个衍射元件建模为有限厚度体积,显著降低设计-器件失配,FDTD验证将分类准确率从50%提升至90%。

详情
AI中文摘要

衍射深度神经网络(D2NN)有望为机器视觉提供微型化、低功耗、光速的光学前端,然而最成熟的演示仍停留在太赫兹波段,由易于制备的毫米尺度神经元构建。将D2NN推广到几乎所有视觉流水线工作的可见光波段,长期以来归因于纳米尺度神经元的制备困难;但即使近期进展消除了这一障碍,与太赫兹对应物匹配的可见光D2NN仍遥不可及。我们识别出真正的障碍是几乎所有D2NN训练所依赖的薄层近似,它将每个衍射层视为无限薄的掩模。失败的原因并非通常假设的短波长,而是可见光波段使用的低折射率材料(n约1.3-1.5)需要足够厚的浮雕结构,使得层内衍射和相位积累变得显著。为克服这一问题,我们引入可微光束传播($\partial$BPM)层,将每个元件建模为有限厚度体积,并在训练过程中通过其传播光,保持与制备兼容的高度图端到端可训练,无需全波仿真在环。在MNIST、Fashion-MNIST和CIFAR-100分类及成像任务中,$\partial$BPM训练显著降低了设计-器件失配,全波FDTD验证将分类准确率从50%提升至90%,无需重新优化。因此,$\partial$BPM层为高效光学神经网络优化与制备一致的衍射设计之间提供了可扩展的、物理感知的桥梁。

英文摘要

Diffractive deep neural networks (D2NNs) promise miniaturized, power-efficient, light-speed optical front-ends for machine vision, yet the most mature demonstrations remain in the terahertz regime, built from readily fabricated millimeter-scale neurons. Translating D2NNs to the visible range, where nearly all vision pipelines operate, was long blamed on the difficulty of fabricating nanoscale neurons; but even after recent advances removed that barrier, visible-range D2NNs matching their terahertz counterparts remain out of reach. We identify the true obstacle as the thin-layer approximation underlying nearly all D2NN training, which treats each diffractive layer as an infinitely thin mask. It fails not because of the short wavelength, as is commonly assumed, but because the low-refractive-index materials (n approximately 1.3-1.5) used at visible wavelengths require relief structures thick enough that intra-layer diffraction and phase accumulation become significant. To overcome this, we introduce a differentiable beam-propagation ($\partial$BPM) layer that models each element as a finite-thickness volume and propagates light through it during training, keeping the fabrication-compatible height map end-to-end trainable without full-wave simulation in the loop. Across MNIST, Fashion-MNIST, and CIFAR-100 classification and imaging, $\partial$BPM training substantially reduces the design-to-device mismatch, and full-wave FDTD validation raises classification accuracy from 50% to 90% without re-optimization. The $\partial$BPM layer thus offers a scalable, physics-aware bridge between efficient optical neural-network optimization and fabrication-consistent diffractive design.

2606.08370 2026-06-09 eess.IV cs.CV 交叉投稿

Programmable Silicon Retina on Pixel Processor Array

可编程硅视网膜在像素处理器阵列上的实现

Maciej Lewandowski, Prince Philip, Alexandre Marcireau, Chetan Singh Thakur, André van Schaik, Piotr Dudek

发表机构 * Department of Electrical and Electronic Engineering, University of Manchester, UK(电气与电子工程系,曼彻斯特大学,英国) Department of Electronic Systems Engineering, Indian Institute of Science, Bangalore, India(电子系统工程系,印度科学研究院,班加罗尔,印度) Department of Computer Science, University of Manchester, UK(计算机科学系,曼彻斯特大学,英国)

AI总结 在SCAMP-5像素处理器阵列上首次实现多级硅视网膜模型,通过空间滤波和增益控制等生物启发处理,在视频显著性预测中损失降低13%,事件率减少约47%。

详情
AI中文摘要

标准动态视觉传感器通过检测时间对比度变化来近似视网膜处理,提供高速度和高动态范围。在这项工作中,我们探讨了加入额外的生物启发处理阶段——特别是空间滤波和增益控制——是否能为某些下游任务(如显著性预测)带来优势。我们首次在SCAMP-5像素处理器阵列上实现了多级硅视网膜模型,并提供了基于GPU的仿真框架。我们在视频强度重建和视频显著性预测上评估了模型性能。虽然生物启发模型在重建绝对强度帧方面效果较差,但与标准DVS事件表示相比,它在显著性预测损失上降低了13%,同时事件率减少了约47%。这些实验使用了一个轻量级的约10万参数的FireNet风格网络,该网络从基于事件的重建调整为显著性预测。这些结果表明,硅视网膜的“信息蒸馏”机制可以为下游神经网络实现更高效的表示,特别是在带宽受限的边缘应用中。

英文摘要

Standard dynamic vision sensors approximate retinal processing by detecting temporal contrast changes, offering high speed and high dynamic range. In this work, we explore whether incorporating additional biologically inspired processing stages - specifically spatial filtering and gain control - can offer advantages for certain downstream tasks such as saliency prediction. We present the first implementation of a multi-stage Silicon Retina model on the SCAMP-5 Pixel Processor Array, along with a GPU-based simulation framework. We evaluate the performance of our model on Video Intensity Reconstruction and Video Saliency Prediction. While the bio-inspired model is less effective at reconstructing absolute intensity frames, it achieves a 13\% reduction in saliency prediction loss in comparison to standard DVS event representation, while reducing the event rate by approximately 47\%. These experiments are obtained using a lightweight $\approx 100$k-parameter FireNet-style network, adapted from event-based reconstruction to saliency prediction. These results suggest that the silicon retina's "information distillation" mechanism can achieve a more efficient representation for downstream neural networks, particularly in bandwidth-constrained edge applications.

2404.01948 2026-06-09 cs.CV 版本更新

Quantifying Noise of Dynamic Vision Sensor

量化动态视觉传感器的噪声

Evgeny V. Votyakov, Alessandro Artusi

发表机构 * DeepCamera MRG CYENS Centre of Excellence(CYENS卓越中心) Nicosia, Cyprus(塞浦路斯尼科西亚)

AI总结 本文提出基于去趋势波动分析的新型技术,用于量化动态视觉传感器背景噪声,解决无地面真实情况下噪声与信号区分难题,并展示最优去噪滤波器参数的确定方法。

Comments 5 pages, 4 figures, submitted to the IEEE Signal Processing Letters

详情
AI中文摘要

动态视觉传感器(DVS)以其大量背景活动噪声(BA)为特征,该噪声与原始信号混合。信号的动态特性和实际应用中缺乏地面真实,使得标准图像处理技术难以区分噪声与清洁信号。本文提出一种基于去趋势波动分析(DFA)的新技术,用于表征BA噪声。该技术可用于解决现有DVS问题:如何在无地面真实情况下量化噪声和信号,以及如何推导最优去噪滤波器参数。后者问题的解决方案在流行的实时移动汽车数据集中得到了演示。

英文摘要

Dynamic visual sensors (DVS) are characterized by a large amount of background activity (BA) noise, which it is mixed with the original (cleaned) sensor signal. The dynamic nature of the signal and the absence in practical application of the ground truth, it clearly makes difficult to distinguish between noise and the cleaned sensor signals using standard image processing techniques. In this letter, a new technique is presented to characterise BA noise derived from the Detrended Fluctuation Analysis (DFA). The proposed technique can be used to address an existing DVS issues, which is how to quantitatively characterised noise and signal without ground truth, and how to derive an optimal denoising filter parameters. The solution of the latter problem is demonstrated for the popular real moving-car dataset.

2406.07318 2026-06-09 cs.CV cs.AR eess.IV 版本更新

Embedded Graph Convolutional Networks for Real-Time Event Data Processing on SoC FPGAs

嵌入式图卷积网络用于SoC FPGA上的实时事件数据处理

Kamil Jeziorek, Piotr Wzorek, Krzysztof Blachut, Andrea Pinna, Tomasz Kryjak

发表机构 * University of Warsaw(华沙大学) Politechnika Warszawska(华沙理工大学)

AI总结 提出一种针对事件相机的硬件感知图卷积网络EFGCN,在SoC FPGA上实现实时处理,模型大小相比AEGNN降低100倍,精度仅下降2.9%。

详情
Journal ref
Journal of Systems Architecture, Volume 177, August 2026, 103850
AI中文摘要

事件相机的使用代表了解决传统视频系统限制的重要且快速发展的趋势。特别是在汽车领域,这些相机因其低延迟和低功耗而集成到嵌入式实时系统中具有重要意义。确保事件处理所需吞吐量和延迟的一种有效方法是利用图卷积网络(GCNs)。在本研究中,我们引入了一种定制的EFGCN(基于事件的FPGA加速图卷积网络),该网络采用了一系列针对PointNetConv(一种用于点云处理的图卷积)的硬件感知优化。所提出的技术相比该领域最新工作之一——异步基于事件的GNN(AEGNN),模型大小减少了高达100倍,而精度下降相对较小(N-Caltech101分类任务下降2.9%,N-Cars分类任务下降2.2%),从而遵循了TinyML趋势。我们在ZCU104 SoC FPGA平台上实现了EFGCN,无需任何片外外部存储器资源,实现了每秒1330万事件(MEPS)的吞吐量和低延迟的实时部分异步处理。在多个基于事件的分类基准测试中,我们的方法在提供每事件最先进的计算效率、小模型大小以及高可扩展性、可定制性和资源效率的同时,实现了具有竞争力的精度。我们将软件和硬件源代码发布在开放存储库中:此 https URL。

英文摘要

The utilisation of event cameras represents an important and swiftly evolving trend aimed at addressing the constraints of traditional video systems. Particularly within the automotive domain, these cameras find significant relevance for their integration into embedded real-time systems due to lower latency and power consumption. One effective approach to ensure the necessary throughput and latency for event processing is through the utilisation of graph convolutional networks (GCNs). In this study, we introduce a custom EFGCN (Event-based FPGA-accelerated Graph Convolutional Network) designed with a series of hardware-aware optimisations tailored for PointNetConv,a graph convolution designed for point cloud processing. The proposed techniques result in up to 100-fold reduction in model size compared to Asynchronous Event-based GNN (AEGNN), one of the most recent works in the field, with a relatively small decrease in accuracy (2.9% for the N-Caltech101 classification task, 2.2% for the N-Cars classification task), thus following the TinyML trend. We implemented EFGCN on a ZCU104 SoC FPGA platform without any off-chip external memory resources, achieving a throughput of 13.3 million events per second (MEPS) and real-time partially asynchronous processing with low latency. Across multiple event-based classification benchmarks, our approach achieves competitive accuracy while providing state-of-the-art computational efficiency per event, small model size, and high scalability, customisability and resource efficiency. We publish both software and hardware source code in an open repository: https://github.com/vision-agh/gcnn-dvs-fpga.

2605.22208 2026-06-09 cs.CV 版本更新

EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning

EvoIR-Agent: 通过经验驱动学习实现自进化图像修复智能体

Kailin Zhuang, Jiawei Wu, Zhi Jin

AI总结 本文提出EvoIR-Agent,通过经验驱动学习解决图像修复中经验不足导致的规划失败问题,通过构建分层经验池和自进化机制提升修复性能和效率,实验表明其在全参考指标上表现优异,且在性能与效率之间取得显著平衡。

Comments Temporarily withdrawn for institutional clearance and compliance review. A revised version will be uploaded once the process is finalized

详情
AI中文摘要

多模态大语言模型(MLLM)驱动的图像修复智能体在退化耦合场景中表现出色,能够灵活选择工具并确定去除顺序。然而,其零样本规划在缺乏经验时往往失效,需要通过大量试错来获得满意结果。目前有两种方法用于解决此问题,但存在矛盾:基于训练的方法将内在经验嵌入参数中,实现高推理效率但缺乏对新工具或退化的兼容性。相比之下,基于免训练的方法利用显式经验存储以提高兼容性,但仍因经验存储方式简单而存在试错开销。为解决此矛盾,本文提出EvoIR-Agent,首先系统地制定了免训练图像修复智能体的经验组件。随后构建了分层经验池,能够为多样化的工具和去除顺序提供粗到细的指导。此外,引入了自进化机制,通过积累的记录更新池,从而大大提高了性能和效率。大量实验表明,EvoIR-Agent在全参考指标上取得了显著领先,并在性能与效率之间实现了显著的帕累托最优平衡。

英文摘要

Multimodal Large Language Model (MLLM)-driven image restoration agent demonstrates effectiveness in degradation coupling scenarios by flexibly selecting tools and determining removal orders. However, their zero-shot planning often fails without experience, necessitating severe trial-and-error overhead to achieve satisfactory outcomes. Currently, two paradigms are employed to address this issue, yet a dilemma persists: Training-based methods embed intrinsic experience into parameters, achieving high inference efficiency but lacking compatibility with new tools or degradation. In contrast, training-free methods utilize explicit experience storage for compatibility but still incur trial-and-error overhead due to naive experience. To resolve the dilemma, we propose EvoIR-Agent, which first systematically formulates the experience components of a training-free image restoration agent. Subsequently, a hierarchical experience pool is constructed, which enables coarse-to-fine guidance for diverse tools and removal orders. Furthermore, a self-evolving mechanism is introduced to update the pool from scratch using accumulated records, thereby greatly improving performance and efficiency. Extensive experiments reveal that EvoIR-Agent achieves a significant lead in the full reference metrics and yields a remarkable Pareto-optimal balance between performance and efficiency compared to the state-of-the-art methods.

11. 鲁棒性、安全、隐私与可信视觉 16 篇

2606.07593 2026-06-09 cs.CV cs.AI 新提交

A Mechanistic Analysis of Adversarial Fine-tuning of Vision Transformers

视觉Transformer对抗微调的机制分析

Hannah Gao, Isha Agarwal, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 通过机制分析研究对抗微调对视觉Transformer在扰动和常规图像上性能的影响,发现微调仅改善特定类型扰动,未改变稀疏表示。

详情
AI中文摘要

图像分类模型在高风险现实场景中的广泛应用要求模型对输入图像的轻微扰动(如模糊或锐化)具有鲁棒性。尽管视觉Transformer(ViT)在现代多模态模型(如视觉-语言模型(VLM)和视觉-语言-动作(VLA)模型)中扮演着不可或缺的角色,但在鲁棒性设置中它们缺乏关注。在这项工作中,我们通过机制视角分析了对抗微调(一种提高模型对图像扰动鲁棒性的流行方法)对ViT在扰动和常规图像上性能的影响。我们在低频和高频图像损坏上对抗训练ViT,并试图通过检查模型的注意力机制、内部表示和知识演化来解释下游模型性能的变化。总体而言,我们的结果表明,虽然对带有常见损坏的输入进行微调提高了模型在新损坏数据实例上的性能和确定性,但这些改进不会转移到训练中未见过的其他类别损坏。此外,尽管观察到各层视觉注意力和知识演化的变化,我们发现对抗训练并未导致ViT学习的稀疏表示发生根本性变化。

英文摘要

The widespread use of image classification models in high-risk, real-world situations necessitates making these models robust to slight disturbances or perturbations, such as blurring or sharpening, in the input images. While vision transformers (ViTs) play an integral role in many modern-day multi-modal models like Vision-Language-Models (VLMs) and Vision-Language-Action (VLA) models, they have received a lack of attention in the setting of robustness. In this work, we analyze the effects of adversarial fine-tuning, a popular method for improving model robustness to image perturbations, on a ViT's performance on perturbed and regular images through a mechanistic lens. We adversarially train a ViT on low-frequency and high-frequency image corruptions, and attempt to explain changes in downstream model performance through an examination of the model's attention mechanisms, internal representations, and knowledge evolution. Overall, our results suggest that, while fine-tuning on inputs with common corruptions improves model performance and certainty on new instances of corrupted data, these improvements do not transfer to other classes of corruptions not seen in the training. Additionally, despite observing changes in visual attention and knowledge evolution across layers, we found that adversarial training did not lead to fundamental changes in the sparse representations learned by ViTs.

2606.07620 2026-06-09 cs.CV cs.AI cs.DC cs.LG 新提交

SENTRY: Statistical Reliability Analysis of Vision Transformers Under Soft Errors

SENTRY: 视觉Transformer在软错误下的统计可靠性分析

Pramit Kumar Bhaduri, Mahdi Taheri, Samira Nazari, Maksim Jenihhin, Christian Herglotz, Michael Hubner

发表机构 * Brandenburg University of Technology Cottbus-Senftenberg(勃兰登堡工业大学) Tallinn University of Technology(塔林理工大学) Zanjan University(赞詹大学)

AI总结 提出基于有限总体抽样的统计故障注入框架,仅需数千样本即可在99%置信度下以1%误差界估计故障率,将实验成本降低高达10700倍,并揭示ViT中归一化层和关键指数位是脆弱性热点。

详情
AI中文摘要

随着视觉Transformer在自动驾驶和医学成像等安全关键领域的应用增长,确保其抵抗软错误的可靠性至关重要。尽管ViT提供了最先进的准确性,但其庞大的参数数量使得穷举故障注入不可行。为弥补这一差距,本文提出一个统计故障注入框架,利用有限总体抽样理论提供形式化的可靠性保证。我们证明,无论模型规模如何,仅需数千个样本即可在99%置信度下将故障率限制在1%的误差界内。与穷举方法相比,该方法将实验成本降低高达10700倍,同时保留跨架构组件定位脆弱性的能力。通过对ViT-Tiny和ViT-Small等不同架构的广泛评估,我们揭示了高度非均匀的可靠性景观。结果表明,虽然只有3%的FP32位翻转导致故障,但其中绝大多数事件导致灾难性的精度崩溃。具体脆弱性被定位到归一化层和IEEE-754格式中的关键指数位,为设计加固的、边缘部署的ViT架构提供了数学基础和可操作的见解。

英文摘要

With the growth of Vision Transformers in safety-critical domains like autonomous systems and medical imaging, ensuring their reliability against soft errors is paramount. While ViTs offer state-of-the-art accuracy, their massive parameter counts render exhaustive fault injection campaigns infeasible. To bridge this gap, a statistical fault injection framework is presented, leveraging finite-population sampling theory to provide formal reliability guarantees. It is demonstrated that failure rates are bounded within a 1% margin at 99\% confidence using only a few thousand samples, regardless of model scale. This methodology achieves up to a 10,700 times reduction in experimental cost compared to exhaustive approaches, while preserving the ability to localize vulnerabilities across architectural components. Through extensive evaluation of different architectures like ViT-Tiny and ViT-Small, a highly non-uniform reliability landscape is uncovered. It is shown that while only 3% of FP32 bit-flips result in failure, the vast majority of these events lead to catastrophic accuracy collapse. Specific vulnerabilities are localized to normalization layers and critical exponent bits within the IEEE-754 format, providing a mathematical foundation and actionable insights for the design of hardened, edge-deployed ViT architectures.

2606.07660 2026-06-09 cs.CV cs.LG 新提交

Need We Teach Foundation Models What is a Generative Image? Gradient-Free Generative Artifact Detection via Analytic Spectral Adaptation

我们是否需要教基础模型什么是生成图像?基于解析谱自适应的无梯度生成伪影检测

Qiaoyu Chen, Bing Zhang

发表机构 * Harbin University of Commerce(哈尔滨商业大学)

AI总结 提出无梯度方法,将生成伪影检测重构为分布外异常度量问题,通过解析解耦统计与语义偏差,在零样本设置下显著优于梯度优化方法。

详情
AI中文摘要

通过基于梯度的更新来适应基础模型以检测生成伪影会损害其内在表示。在有限样本上优化时,模型会过拟合到局部领域捷径。在专门数据上微调大量权重会引入错误的归纳偏差,在高维特征空间中引起可测量的 $\mathcal{L}_2$ 范数扰动——我们将这一现象形式化为锚点漂移。非线性激活放大了这种漂移,损害了跨未见领域的零样本伪造检测。我们提出了一种无梯度方法,将检测从二分类重新定义为分布外(OOD)异常度量问题。将冻结的基础模型视为稳定的坐标系,通过解析解耦统计和语义偏差,在真实视觉流形上建立一个绝对的自然锚点,该锚点源自注意力加权的空间矩和感知不一致性的正交投影。在极端零样本设置下(在面部伪造上训练,在通用文本到图像生成上测试),我们的方法显著优于梯度优化范式。无反向传播的前向传递和线性求解器实现了硬件无关、边缘可部署的校准,延迟极低。此外,Sherman-Morrison公式使得能够针对新型攻击进行即时在线学习,并通过协方差增量传输实现隐私保护的联邦协作。

英文摘要

Adapting foundation models to detect generative artifacts via gradient-based updates compromises their intrinsic representations. Under optimization on limited samples, models overfit to local domain shortcuts. Fine-tuning massive weights on specialized data introduces erroneous inductive biases, inducing a measurable $\mathcal{L}_2$ norm perturbation in the high-dimensional feature space -- a phenomenon we formalize as anchor drift. Amplified by nonlinear activations, this drift impairs zero-shot forgery detection across unseen domains.We propose a gradient-free methodology reframing detection from binary classification to an out-of-distribution (OOD) anomaly measurement problem. Treating a frozen foundation model as a stable coordinate system, we establish an absolute natural anchor on the real visual manifold by analytically decoupling statistical and semantic deviations, derived from attention-weighted spatial moments and orthogonal projection of perceptual inconsistencies. Evaluated in an extreme zero-shot setting (trained on face forgeries, tested on universal Text-to-Image generations), our method significantly outperforms gradient-optimized paradigms. Backpropagation-free forward passes and linear solvers enable hardware-agnostic, edge-deployable calibration with minimal latency. Furthermore, the Sherman-Morrison formula unlocks instantaneous online learning against novel attacks and enables privacy-preserving federated collaboration via covariance delta transmission.

2606.07102 2026-06-09 cs.CV cs.AI 新提交

GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection

GP-Adapter: 基于高斯过程的CLIP适配器用于少样本分布外检测

Taisei Saito, Koretaka Ogata, Takafumi Hiroi

发表机构 * st Taisei Saito(第一作者) nd Koretaka Ogata(第二作者) rd Takafumi Hiroi(第三作者)

AI总结 提出GP-Adapter,一种无需训练的框架,通过高斯过程不确定性建模增强CLIP,用于少样本分类和分布外检测,无需微调骨干网络,仅依赖少量缓存和轻量超参数选择。

Comments 8 pages, 6 figures, Accepted at IJCNN 2026

详情
AI中文摘要

我们提出GP-Adapter,一种无需训练的框架,通过高斯过程(GP)不确定性建模增强CLIP(对比语言-图像预训练),用于少样本分类和分布外(OOD)检测。虽然CLIP实现了强大的零样本识别,但它产生确定性的相似度分数,并提供有限的不确定性信息,这在分布偏移和数据稀缺情况下至关重要。GP-Adapter在冻结的CLIP嵌入之上,使用图像特征的RBF核和文本提示的线性核构建模态特定、类别级的一类GP,并融合它们的预测统计量,以生成方差感知的置信度分数用于OOD检测。该方法无需微调CLIP骨干网络,仅依赖于少量$K$样本缓存和轻量超参数选择,内存成本为$O(CK^2)$,其中$C$为类别数,$K$为样本数。在ImageNet和多个OOD基准上的实验表明,GP-Adapter提供了具有竞争力的少样本性能,并且在与提示学习基线结合时持续改进OOD检测,突出了基于GP的不确定性建模与提示学习之间的互补性。总体而言,我们的结果表明,将概率推理与大型预训练视觉-语言模型集成可以提高低数据和分布偏移场景下的可靠性。代码可在该https URL获取。

英文摘要

We propose GP-Adapter, a training-free framework that augments CLIP (Contrastive Language-Image Pre-training) with Gaussian Process (GP) uncertainty modeling for few-shot classification and out-of-distribution (OOD) detection. While CLIP achieves strong zero-shot recognition, it yields deterministic similarity scores and offers limited uncertainty information, which is critical under distribution shift and data scarcity. GP-Adapter constructs modality-specific, class-wise one-class GPs on top of frozen CLIP embeddings using an RBF kernel for image features and a linear kernel for text prompts and fuses their predictive statistics to produce a variance-aware confidence score for OOD detection. The method requires no fine-tuning of the CLIP backbone and relies only on a small $K$-shot cache and lightweight hyperparameter selection, with memory cost scaling as $O(CK^2)$ for $C$ classes and $K$ shots. Experiments on ImageNet and multiple OOD benchmarks show that GP-Adapter provides competitive few-shot performance and consistently improves OOD detection when combined with prompt-learning baselines, highlighting the complementarity between GP-based uncertainty modeling and prompt learning. Overall, our results suggest that integrating probabilistic inference with large pre-trained vision-language models can improve reliability in low-data and distribution-shifted settings. Code is available at https://github.com/tms-byte/GP-Adapter

2606.08121 2026-06-09 cs.CV 新提交

Trustworthy Visual Predicates for Robust Manipulation Understanding under Degradation

可信视觉谓词用于退化条件下的鲁棒操作理解

Fatemeh Ziaeetabar

发表机构 * University of Tehran(德黑兰大学)

AI总结 提出谓词级可靠性框架,通过结构化谓词词汇表、置信度感知估计和可靠性度量,分析模糊、遮挡等退化对操作理解中视觉谓词的影响,实验表明接触敏感和动态谓词更脆弱。

详情
AI中文摘要

操作理解需要可靠的关联证据,如接触、支撑、包含、运动耦合、抓取、释放和主动手参与。尽管这些视觉谓词广泛用于事件链、图基和神经符号模型,但它们在视觉退化下的可靠性很少被直接分析。本文引入了一个谓词级可靠性框架,用于在模糊、遮挡、光照变化、低分辨率、帧丢失和检测噪声下实现鲁棒的操作理解。该框架定义了结构化谓词词汇表、置信度感知的谓词估计以及用于谓词保持、退化敏感性、时间一致性、置信度加权稳定性和下游影响的可靠性度量。在受控操作视频和公共自我中心或双手数据集(包括VISOR/EPIC-KITCHENS、H2O和ARCTIC)上的实验表明,谓词失败是结构化的而非均匀的。静态空间谓词相对稳健,而接触敏感、动态和派生谓词(如抓取和释放)更脆弱。在严重退化下,检测噪声、遮挡和帧丢失导致最强的可靠性损失。下游分析表明,退化谓词将操作理解准确率从0.89降至0.58,而在中等退化下去除置信度加权将准确率从0.74降至0.64。这些结果表明,谓词可靠性在视觉感知和结构化操作推理之间提供了一个诊断层。

英文摘要

Manipulation understanding requires reliable relational evidence, such as contact, support, containment, motion coupling, grasp, release, and active-hand involvement. Although these visual predicates are widely used in event-chain, graph-based, and neuro-symbolic models, their reliability under visual degradation is rarely analyzed directly. This paper introduces a predicate-level reliability framework for robust manipulation understanding under blur, occlusion, illumination change, low resolution, frame dropping, and detection noise. The framework defines a structured predicate vocabulary, confidence-aware predicate estimation, and reliability metrics for predicate preservation, degradation sensitivity, temporal consistency, confidence-weighted stability, and downstream impact. Experiments on controlled manipulation videos and public egocentric or bimanual datasets, including VISOR/EPIC-KITCHENS, H2O, and ARCTIC, show that predicate failures are structured rather than uniform. Static spatial predicates remain comparatively robust, whereas contact-sensitive, dynamic, and derived predicates such as grasp and release are more fragile. Under severe degradation, detection noise, occlusion, and frame dropping cause the strongest reliability losses. Downstream analysis shows that degraded predicates reduce manipulation-understanding accuracy from 0.89 to 0.58, while removing confidence weighting under moderate degradation reduces accuracy from 0.74 to 0.64. These results show that predicate reliability provides a diagnostic layer between visual perception and structured manipulation reasoning.

2606.08634 2026-06-09 cs.CV 新提交

SSAFE: Simple and Strong AI-Generated Image Detection via Frozen Vision Encoders

SSAFE: 通过冻结视觉编码器实现简单而强大的AI生成图像检测

Seunghyun Lee, Byoungkwon Kim, Jaehyun Nam, Kyungmin Lee, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院) Google Cloud AI(谷歌云AI)

AI总结 本文发现冻结的多模态视觉编码器在嵌入空间中自然分离真实与合成图像,通过线性分类器即可实现强检测性能,并提出一种表示感知的数据策展策略,仅用10K图像训练,在多个基准上表现优异。

Comments Preprint. 22 pages, 10 figures, supplementary material included

详情
AI中文摘要

生成模型的快速发展模糊了合成图像与真实图像之间的界限,产生了对可靠深度伪造检测的迫切需求。然而,大多数现有方法依赖于大规模的真实-伪造数据集,随着新生成器的不断涌现,这些数据集越来越难以维护。在这项工作中,我们研究了图像真实性信息在多大程度上已经编码在现代多模态视觉表示中。我们发现,冻结的多模态编码器在其嵌入空间中自然分离真实图像和合成图像,使得简单的线性分类器无需特定任务微调即可实现强性能。受此观察启发,我们开发了一种表示感知的数据策展策略,选择一组紧凑的代表性生成器进行训练。由此产生的训练集仅包含10K张图像,而AIGIBench为288K张,OpenFake为400万张,同时提高了对未见生成器和分布偏移的鲁棒性。我们还引入了RealWorldBench,这是一个包含现代相机照片、当代库存图像以及近期商业生成器输出的基准。在多个基准上的实验表明,将冻结的多模态表示与精心策展的训练数据相结合,为AI生成图像检测提供了一种简单而有效的方法。

英文摘要

The rapid advancement of generative models has blurred the boundary between synthetic and real imagery, creating an urgent need for reliable deepfake detection. Yet most existing approaches rely on massive real--fake datasets, which are increasingly difficult to maintain as new generators continue to emerge. In this work, we investigate how much information about image authenticity is already encoded in modern multimodal vision representations. We find that frozen multimodal encoders naturally separate real and synthetic images in their embedding space, enabling a simple linear classifier to achieve strong performance without task-specific fine-tuning. Motivated by this observation, we develop a representation-aware data curation strategy that selects a compact set of representative generators for training. The resulting training set contains only 10K images, compared to 288K in AIGIBench and 4M in OpenFake, while improving robustness to unseen generators and distribution shifts. We additionally introduce RealWorldBench, a benchmark consisting of modern camera photographs, contemporary stock images, and outputs from recent commercial generators. Experiments across multiple benchmarks show that combining frozen multimodal representations with carefully curated training data provides a simple and effective approach to AI-generated image detection.

2606.08864 2026-06-09 cs.CV cs.LG 新提交

CHROMA: Detecting AI-Generated Images through Inter-Channel Color-Space Correlations

CHROMA: 通过通道间色彩空间相关性检测AI生成图像

Juan Pablo Sotelo, Marina Gardella, Pablo Musé

发表机构 * Instituto de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay(乌拉圭共和国大学工程学院电气工程研究所) Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, Gif-sur-Yvette, 91190 France(巴黎萨克雷大学,巴黎萨克雷高等师范学校,法国国家科学研究中心,博雷利中心)

AI总结 提出利用通道间色彩相关性作为轻量级取证线索,通过增强RGB输入与相关性图,使用固定CNN骨干网络在有限计算预算下训练,有效区分真实与AI生成图像,并提升对未知生成器的鲁棒性。

Comments This manuscript has been accepted for publication at the 28th International Conference on Pattern Recognition (ICPR 2026). The final published version will appear in the Springer LNCS proceedings

详情
AI中文摘要

扩散模型和大规模生成模型的快速普及使得区分合成图像与真实照片越来越具有挑战性。尽管已有自动检测器被提出,但它们对未见生成器的泛化能力仍然脆弱。为解决这一局限,我们研究了通道间色彩相关性,这是一种轻量级且未被充分利用的取证线索。我们首先证明,LPIPS(一种广泛使用的感知度量)对选择性改变不同色彩空间参数化下通道依赖性的扰动表现出不一致的响应,表明跨通道统计量并不受常见感知训练目标的统一约束。受此启发,我们分析了多个色彩空间中成对通道间相关性特征的分布。我们的分析揭示了这些分布中系统性的、生成器特定的差异,其中RGB和Lab色彩空间提供了真实图像与生成图像之间最明显的分离。基于此,我们引入了Chroma,一种AI生成图像检测器,它用通道间相关性图增强标准RGB输入,并采用在适度计算预算下训练的固定CNN骨干网络。我们在单生成器训练和有限多生成器监督机制(仅从额外生成器获取少量样本)下评估其鲁棒性。在标准基准协议下,相关性增强的输入改善了真实与生成图像的区分能力和鲁棒性,在保持简单架构和训练过程的同时,性能与最新检测器相当。代码可在https://github.com/JPSoteloSilva/CHROMA获取。

英文摘要

The rapid adoption of diffusion and large-scale generative models has made it increasingly challenging to distinguish synthetic imagery from real photographs. While automated detectors have been proposed, their generalization to unseen generators remains brittle. To address this limitation, we investigate inter-channel color correlations, a lightweight and underexploited forensic cue. We first demonstrate that LPIPS, a widely used perceptual metric, exhibits inconsistent responses to perturbations that selectively alter channel dependence across different color-space parameterizations, indicating that cross-channel statistics are not uniformly constrained by common perceptual training objectives. Motivated by this, we analyze the distributions of pairwise inter-channel correlation features across multiple color spaces. Our analysis reveals systematic, generator-specific differences in these distributions, with RGB and Lab color spaces providing the most apparent separation between real and generated images. Building on this, we introduce Chroma, a detector of AI-generated images which augments standard RGB inputs with inter-channel correlation maps and employs a fixed CNN backbone trained with a modest computational budget. We assess its robustness under both single-generator training and a limited multi-generator supervision regime, where only a few samples from additional generators are available. Across a standard benchmark protocol, correlation-augmented inputs improve real-vs-generated discrimination and robustness, yielding performance competitive with recent detectors while maintaining a simple architecture and training procedure. Code is available at https://github.com/JPSoteloSilva/CHROMA

2606.09511 2026-06-09 cs.CV 新提交

Securing Self-supervised Data Curation for Foundation Models Robustness

保障基础模型鲁棒性的自监督数据筛选

Sandeep Gupta, Roberto Passerone

发表机构 * Queen's University Belfast(贝尔法斯特女王大学) University of Trento(特伦托大学)

AI总结 针对自监督数据筛选面临的数据投毒风险,提出基于ImageBind和传统分类器的主动防御机制PDD,在多种攻击下有效检测中毒数据,SVM-PDD表现最优。

Comments 22 pages

详情
AI中文摘要

自监督数据筛选为扩展和提升机器学习模型的泛化能力提供了一条途径。通过利用自监督学习(SSL)进行数据筛选,可以有效满足基础模型对大规模训练数据集的需求。SSL极大地减轻了与标注和人工数据集筛选相关的成本,同时最小化了对人工监督的需求。然而,必须严格检查SSL筛选数据集的完整性,因为依赖匿名且未经审查的外部来源会显著增加数据投毒的风险。在本文中,我们提出了一种中毒数据检测器(PDD),这是一种主动防御机制,旨在在基础模型训练之前确保SSL筛选数据集的完整性。PDD使用预训练的ImageBind模型与传统分类器(包括随机森林(RF)、k近邻(KNN)、朴素贝叶斯(NB)和支持向量机(SVM))的组合进行设计。我们使用来自三个不同数据集的176,200张图像以及三种不同的对抗攻击(涵盖分布内和分布外场景)严格评估了PDD。值得注意的是,SVM-PDD在分布内(Set3-Set5)和分布外(TrueFace和140K RealFace)数据集上均实现了优越的性能。我们的设计表现出强大的可扩展性,并通过集成方法实现了新对抗攻击检测器的快速集成。

英文摘要

Self-supervised data curation provides a pathway to scaling and improving the generalization capabilities of machine learning models. By leveraging self-supervised learning (SSL) for data curation, the demand for massive training datasets required by foundation models can be effectively met. SSL greatly alleviates the costs associated with annotation and manual dataset curation while minimizing the need for human oversight. However, the integrity of SSL-curated datasets must be rigorously checked, as reliance on anonymous and unvetted external sources can substantially increase the risk of data poisoning. In this paper, we propose a Poisoned Data Detector (PDD), an active defense mechanism designed to ensure the integrity of SSL-curated datasets prior to foundation model training. PDDs are designed using a combination of the pretrained ImageBind model and traditional classifiers, including Random Forest (RF), k-Nearest Neighbors (KNN), Naive Bayes (NB), and Support Vector Machines (SVM). We rigorously evaluated PDDs using 176,200 images from three diverse datasets and three different adversarial attacks encompassing both in-distribution and out-of-distribution scenarios. Notably, SVM-PDD achieves superior performance for both in-distribution (Set3-Set5) and out-of-distribution (TrueFace and 140K RealFace) datasets. Our design demonstrates strong scalability and enables the rapid integration of new adversarial attack detectors through an ensemble approach.

2606.09536 2026-06-09 cs.CV 新提交

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation

基于Hadamard编码输出表示的对抗攻击与干扰检测用于目标检测和语义分割

Lucas Görnhardt, Timo Bartels, Niklas Schwarz, Tim Fingscheidt

发表机构 * Technische Universität Braunschweig(Braunschweig 技术大学)

AI总结 针对传统one-hot编码导致模型校准差、易受攻击的问题,提出Hadamard编码输出表示,通过优化解码过程实现最优类概率,利用预测不一致性检测对抗攻击和干扰,在单次检测中达到SOTA性能。

详情
AI中文摘要

传统的one-hot编码通常导致模型校准不佳,在攻击下过于自信,并使基于熵的检测算法失效。先前的图像分类工作表明,Hadamard编码输出表示可以提高对抗鲁棒性。然而,将Hadamard码集成到语义分割中的尝试在平均交并比性能上远落后于最先进模型。对于目标检测,此类输出编码尚未被研究。此外,现有技术没有解决内在的码字不一致性,也没有实际利用内在的码字冗余。因此,我们首先推导了一种新的Hadamard码字解码过程,以得到最优的类概率,通过使用到概率单纯形的投影来解决底层优化问题。其次,我们的优化提供了预测不一致性的度量。第三,我们首次展示了如何利用这些不一致性进行对抗攻击和干扰检测。第四,我们引入了HadamardNet,这是一个采用Hadamard码作为语义分割和目标检测模型及任务的输出表示的框架。我们在干扰和对抗攻击上进行了全面评估,在仅单次检测中实现了两项任务的最先进扰动检测性能,同时在干净数据上提供了等效或接近的参考性能。

英文摘要

Conventional one-hot encodings often yield poorly calibrated models, being overconfident under attack, and letting entropy-based detection algorithms fail. Previous image classification works have demonstrated that Hadamard-coded output representations can improve adversarial robustness. However, attempts to integrate Hadamard codes into semantic segmentation fall far behind state-of-the-art models in mean intersection-over-union performance. Regarding object detection, such output encodings have not yet been investigated at all. Further, no prior art addressed intrinsic codeword inconsistencies or actually exploited intrinsic codeword redundancy. Accordingly, we first derive a novel decoding procedure for Hadamard codewords towards optimal class-wise probabilities, solving the underlying optimization problem by using the projection onto the probability simplex. Second, our optimization delivers a measure of prediction inconsistency. Third, we are the first to show how to exploit these inconsistencies for adversarial attack and disturbance detection. Fourth, we introduce HadamardNet, a framework employing Hadamard codes as output representations for semantic segmentation and object detection models and tasks. We conduct a comprehensive evaluation both on disturbances and adversarial attacks, achieving state-of-the-art perturbation detection performance for both tasks in only a single detection pass, while delivering equivalent or close-by reference performance on clean data.

2606.09746 2026-06-09 cs.CV cs.AI cs.LG 新提交

Hybrid Robustness Verification for Spatio-Temporal Neural Networks

时空神经网络的混合鲁棒性验证

Sherwin Varghese, Matthew Wicker, Alessio Lomuscio

发表机构 * Imperial College London(伦敦帝国学院)

AI总结 针对3D CNN在视频和体素输入中的鲁棒性验证,提出时空约束建模和STBP框架,实现精确闭式传播与可扩展近似,在UCF-101等基准上提升1.7倍认证鲁棒准确率。

Comments Accepted at the 9th International Symposium on AI Verification (SAIV 2026)

详情
AI中文摘要

随着人工智能越来越多地部署在安全关键系统中,为底层模型提供形式化的鲁棒性保证至关重要。现有的验证方法要么依赖过于保守的近似,要么产生难以承受的计算成本。例如,在视频设置中使用lp-范数扰动编码了对手可以在每个视频帧中注入噪声的信念。实际上,对抗性扰动表现出结构化的时空相关性,被约束在低维、语义上有意义的子空间中。在这项工作中,我们研究了处理视频和体素输入的3D CNN的鲁棒性验证,针对动作识别(UCF-101)、自动驾驶(Udacity)和医学成像(MedMNIST)中的应用,通过将对抗强度建模为时空约束——攻击者可以修改一组连续帧中的子集或补丁——来利用关于对抗强度的现实假设。我们证明,建模现实约束能够实现更紧的近似。我们引入了时空边界传播(STBP),这是一个验证框架,它计算第一卷积层的精确闭式表征,并通过可扩展的近似传播认证边界。计算精确闭式为第一卷积层提供了最紧的边界。因此,我们在网络的其余部分使用近似方法。为了推动该领域的进一步发展,我们提出了ST-Bench,一个用于自动驾驶和活动识别的验证基准,以系统评估可验证的鲁棒性。与现有的基于验证的方法相比,STBP在相同的扰动预算下提供了更强的鲁棒性保证,并显著提高了可扩展性,实现了1.7倍更高的认证鲁棒准确率。

英文摘要

With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive computational costs. For example, the use of lp-norm perturbations in video settings encodes the belief that the adversary can inject noise in every video frame. In practice, adversarial perturbations exhibit structured spatial and temporal correlations, constrained to lower-dimensional, semantically meaningful subspaces. In this work, we study robustness verification of 3D CNNs processing video and volumetric inputs, targeting applications in action recognition (UCF-101), autonomous driving (Udacity), and medical imaging (MedMNIST) exploiting realistic assumptions on adversarial strength by modelling them as spatio-temporal constraints - where the attacker can modify either a subset of frames or patches within a set of consecutive frames. We demonstrate that modelling realistic constraints enables tighter approximations. We introduce Spatio-Temporal Bound Propagation (STBP), a verification framework that computes an exact closed-form characterization of the first convolutional layer and propagates certified bounds through subsequent layers using scalable approximations. Computing the exact closed form provides the tightest bounds for the first convolutional layer. Thus, we utilise approximation methods in the remainder of the network. To spur further progress in this field, we propose ST-Bench, a verification benchmark for autonomous driving and activity recognition, to systematically evaluate verifiable robustness. Compared to existing verification-based approaches, STBP provides stronger robustness guarantees with significantly improved scalability, achieving 1.7x higher certified robust accuracy under identical perturbation budgets.

2606.07650 2026-06-09 cs.CR cs.CV cs.NI 交叉投稿

Detecting Aimbot Cheaters in MOGs

检测多人在线游戏中的自瞄作弊者

Salman Shaikh, Tao Ni, Marc Dacier

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出PATCH主动防御策略,通过对抗性补丁作为游戏蜜标,触发作弊者目标检测模型,实现检测或干扰,在定制Unreal Engine游戏中白盒检测率超90%,跨模型迁移率达60-90%。

详情
AI中文摘要

多人在线游戏已成为娱乐行业数十亿美元的产业。然而,作弊者的存在破坏了诚实玩家的体验,并贬低了游戏开发者的努力,因为它直接影响玩家留存率、竞技完整性、游戏的合法性和可信度,以及最重要的整体收入流。在各种作弊技术中,视觉自瞄作弊是一种新兴威胁。它们使用计算机视觉模型从客户端屏幕截图中检测对手,而不是访问游戏内存,这使得商业内核级反作弊解决方案完全无法检测。在本文中,我们介绍了PATCH,一种新颖的主动防御策略,该策略部署对抗性补丁作为游戏中的蜜标,以减轻视觉自瞄作弊者的存在。我们的方法侧重于故意触发作弊者的目标检测模型,从而实现直接检测,或通过在其视口上泛滥补丁使作弊者无法进行游戏。我们在各种标准上评估了我们的方法;分析了不同补丁大小的有效性、补丁对不同屏幕分辨率的可扩展性、对不同视觉自瞄作弊配置的有效性,并探索了各种YOLO模型以评估补丁的可迁移性。在定制的Unreal Engine游戏上的评估显示,在几乎所有补丁大小的白盒场景中,检测率超过90%,并且使用较大补丁时,跨模型迁移率达到60%至90%。我们进一步在商业MOG《堡垒之夜》上验证了我们的方法,展示了现实世界的适用性。

英文摘要

Multiplayer Online Games have become a multibillion dollar industry in the entertainment sector. However, the presence of cheaters undermines the experience of honest players and devalues the effort of game developers, as it directly affects player retention, competitive integrity, the legitimacy and trustworthiness of a game, and most importantly the overall revenue streams. Among various cheating techniques, visual aimbots represent an emerging threat. They use computer vision models to detect opponents from client screen captures rather than accessing game memory, making them completely undetectable by commercial kernel level anti cheat solutions. In this paper, we introduce PATCH, a novel proactive defense strategy that deploys adversarial patches as in game honeytokens to mitigate the presence of visual aimbot cheaters. Our approach centers on deliberately triggering the cheaters' object detection model, enabling either direct detection, or rendering the game unplayable for the cheater via patch flooding on their viewport. We evaluate our approach on various criteria; analyzing the effectiveness of different patch sizes, scalability of patches to different screen resolutions, efficacy against diverse visual aimbot cheat configurations and also explore various YOLO models to assess patch transferability. Evaluation on a custom Unreal Engine game demonstrates over 90 percent detection rate in white box scenarios for almost all patch sizes, and reaches 60 to 90 percent cross model transferability with larger patches. We further validate our approach on Fortnite, a commercial MOG, demonstrating real world applicability.

2507.09092 2026-06-09 cs.CV cs.LG 版本更新

Analysis of Information Theory for Explainable AI

可解释人工智能的信息论分析

Ram S Iyer

发表机构 * Rajiv Gandhi Institute of Petroleum Technology(拉贾夫·甘地石油技术研究所)

AI总结 提出基于互信息的激活映射方法MI CAM,通过特征图与输入图像的互信息加权生成显著性可视化,实现模型推理的因果解释,性能优于现有方法。

详情
AI中文摘要

随着机器视觉在医疗和自动化电厂等关键日常需求中的介入,卷积神经网络的内部机制以及网络提供特定推理的原因引起了关注。本文提出了一种新颖的基于激活映射的事后视觉解释方法,称为MI CAM。与之前基于类激活映射的方法不同,MI CAM通过每个特征图与输入图像的互信息对其进行加权,生成显著性可视化,最终结果由权重和激活图的线性组合产生。它还通过反事实分析验证了因果解释的生成。我们旨在展示MI CAM在模型推理过程中实现的视觉表现和无偏解释。我们的方法与所有最先进的方法相当,但在定性和定量度量上尤其优于其中一些方法。

英文摘要

With the intervention of machine vision in our crucial day to day necessities including healthcare and automated power plants, attention has been drawn to the internal mechanisms of convolutional neural networks, and the reason why the network provides specific inferences. This paper proposes a novel post-hoc visual explanation method called MI CAM based on activation mapping. Differing from previous class activation mapping based approaches, MI CAM produces saliency visualizations by weighing each feature map through its mutual information with the input image and the final result is generated by a linear combination of weights and activation maps. It also adheres to producing causal interpretations as validated with the help of counterfactual analysis. We aim to exhibit the visual performance and unbiased justifications for the model inferencing procedure achieved by MI CAM. Our approach works at par with all state-of-the-art methods but particularly outperforms some in terms of qualitative and quantitative measures.

2602.05175 2026-06-09 cs.CV 版本更新

Enhancing Adversarial Robustness with Signed Distance Fields for Harmonizing Geometric Invariance and Texture

利用符号距离场增强对抗鲁棒性以协调几何不变性与纹理

Zhe Li, Bernhard Kainz

发表机构 * Department AIBE, FAU Erlangen-Nürnberg(AIBE部门,埃朗根-纽伦堡大学)

AI总结 提出GeoTexPuri框架,通过符号距离场将离散图像掩码转化为连续空间场,在训练中融合几何与纹理特征,实现高效对抗净化,在ImageNet上取得84.79%干净准确率和83.52%鲁棒准确率。

Comments 14 pages, 6 figures

详情
AI中文摘要

深度神经网络在视觉识别中表现出色,但仍极易受到难以察觉的对抗攻击。现有的防御策略如对抗训练和基于扩散的净化已取得显著进展,但常受限于高计算成本、信息丢失和推理延迟。为解决这些挑战,我们提出了一种几何与纹理平衡净化(GeoTexPuri)框架,通过协调不变的几何结构与纹理特征来增强对抗鲁棒性。具体而言,该框架通过符号距离场(SDF)将离散图像掩码转化为连续空间场,在训练阶段融入密集几何引导。这一过程建立了稳定的结构锚点,使模型免受局部像素噪声干扰。通过多流训练目标,模型学会内化净化后的表示,有效将语义纹理线索与这些底层几何不变量对齐。在ImageNet上的大量实验证明了我们方法的有效性。GeoTexPuri在AutoAttack下实现了84.79%的干净准确率和83.52%的鲁棒准确率。关键在于,GeoTexPuri在推理时作为确定性分类器运行,仅需输入图像,无需任何辅助几何模块或额外计算成本,从而为实时应用提供了可扩展且高效的解决方案。

英文摘要

Deep neural networks demonstrate impressive performance in visual recognition but remain highly vulnerable to imperceptible adversarial attacks. Existing defense strategies such as adversarial training and diffusion-based purification have achieved significant progress but are frequently constrained by high computational cost, information loss, and inference latency. To address these challenges, we propose a Geometric and Texture balancing Purification (GeoTexPuri) framework that enhances adversarial robustness by harmonizing invariant geometric structures with textural features. Specifically, the framework integrates dense geometric guidance into the training phase by transforming discrete image masks into continuous spatial fields via Signed Distance Fields (SDF). This process establishes stable structural anchors that shield the model from local pixel noise. Through a multi-stream training objective, the model learns to internalize purified representations that effectively align semantic textural cues with these underlying geometric invariants. Extensive experiments on ImageNet demonstrate the efficacy of our approach. GeoTexPuri achieves 84.79\% clean accuracy and 83.52\% robust accuracy under the AutoAttack. Crucially, GeoTexPuri functions as a deterministic classifier during inference, requiring only the input image without any auxiliary geometric modules or additional computational costs, thereby ensuring a scalable and efficient solution for real-time applications.

2603.23916 2026-06-09 cs.CV cs.AI 版本更新

DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning

DecepGPT: 基于多文化数据集和鲁棒多模态学习的模式驱动欺骗检测

Jiajian Huang, Dongliang Zhu, Zitong YU, Hui Ma, Jiayu Zhang, Chunmei Zhu, Xiaochun Cao

发表机构 * Great Bay University(Great Bay大学) Wuhan University(武汉大学) Sun Yat-sen University(孙中山大学)

AI总结 本文提出DecepGPT,通过构建包含结构化线索描述和推理链的推理数据集,释放多文化数据集T4-Deception,并提出SICS和DMC模块,实现多模态欺骗检测的鲁棒学习,实验表明其在领域内和跨领域场景中均取得最佳性能。

Comments 17 pages, 11 figures, 12 tables

详情
AI中文摘要

多模态欺骗检测旨在通过分析音频视觉线索来识别欺骗行为,用于刑侦和安全领域。在高风险环境中,调查人员需要可验证的证据将音频视觉线索与最终决策联系起来,并且需要在不同领域和文化背景下可靠地泛化。然而,现有基准仅提供二元标签而无中间推理线索。数据集也较小,场景覆盖有限,导致捷径学习。我们通过三个贡献解决这些问题:首先,我们通过增强现有基准并添加结构化线索级描述和推理链来构建推理数据集,使模型输出可审计报告。其次,我们发布T4-Deception,一个基于统一的『To Tell The Truth』电视格式在四个国家实施的多文化数据集。该数据集包含1695个样本,是目前最大的非实验室欺骗检测数据集。第三,我们提出两个模块,以在小数据条件下实现鲁棒学习。Stabilized Individuality-Commonality Synergy (SICS) 通过结合可学习的全局先验与样本自适应残差,优化多模态表示,随后通过极性感知调整双向校准表示。Distilled Modality Consistency (DMC) 通过知识蒸馏将模态特定预测与融合的多模态预测对齐,以防止单模态捷径学习。在三个已建立的基准和我们新的数据集上的实验表明,我们的方法在领域内和跨领域场景中均取得最佳性能,同时在不同文化背景下表现出优越的迁移能力。数据集和代码将被发布。

英文摘要

Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-stakes settings, investigators need verifiable evidence connecting audiovisual cues to final decisions, along with reliable generalization across domains and cultural contexts. However, existing benchmarks provide only binary labels without intermediate reasoning cues. Datasets are also small with limited scenario coverage, leading to shortcut learning. We address these issues through three contributions. First, we construct reasoning datasets by augmenting existing benchmarks with structured cue-level descriptions and reasoning chains, enabling model output auditable reports. Second, we release T4-Deception, a multicultural dataset based on the unified ``To Tell The Truth'' television format implemented across four countries. With 1695 samples, it is the largest non-laboratory deception detection dataset. Third, we propose two modules for robust learning under small-data conditions. Stabilized Individuality-Commonality Synergy (SICS) refines multimodal representations by synergizing learnable global priors with sample-adaptive residuals, followed by a polarity-aware adjustment that bi-directionally recalibrates representations. Distilled Modality Consistency (DMC) aligns modality-specific predictions with the fused multimodal predictions via knowledge distillation to prevent unimodal shortcut learning. Experiments on three established benchmarks and our novel dataset demonstrate that our method achieves state-of-the-art performance in both in-domain and cross-domain scenarios, while exhibiting superior transferability across diverse cultural contexts. The datasets and codes will be released.

2403.06013 2026-06-09 cs.LG cs.CV 版本更新

Are Classification Robustness and Explanation Robustness Really Strongly Correlated? An Analysis Through Input Loss Landscape

分类鲁棒性与解释鲁棒性真的强相关吗?基于输入损失景观的分析

Tiejin Chen, Wenwang Huang, Linsey Pang, Dongsheng Luo, Hua Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文质疑分类鲁棒性与解释鲁棒性强相关的传统观点,通过聚类评估解释鲁棒性,并提出调整解释损失景观的训练方法,发现两者并不强相关。

详情
AI中文摘要

本文深入探讨深度学习鲁棒性的关键领域,挑战了图像分类系统中分类鲁棒性和解释鲁棒性固有相关的传统观点。通过一种利用聚类高效评估解释鲁棒性的新颖评估方法,我们证明增强解释鲁棒性并不一定会使输入损失景观相对于解释损失变得平坦——这与平坦的损失景观指示更好的分类鲁棒性相反。为了深入探究这一矛盾,我们提出了一种开创性的训练方法,旨在调整相对于解释损失的损失景观。通过这种新的训练方法,我们发现尽管这种调整可以影响解释的鲁棒性,但它们对分类的鲁棒性没有影响。这些发现不仅挑战了两种鲁棒性之间强相关的主流假设,而且为理解损失景观与解释损失之间的关系开辟了新的途径。

英文摘要

This paper delves into the critical area of deep learning robustness, challenging the conventional belief that classification robustness and explanation robustness in image classification systems are inherently correlated. Through a novel evaluation approach leveraging clustering for efficient assessment of explanation robustness, we demonstrate that enhancing explanation robustness does not necessarily flatten the input loss landscape with respect to explanation loss - contrary to flattened loss landscapes indicating better classification robustness. To deeply investigate this contradiction, a groundbreaking training method designed to adjust the loss landscape with respect to explanation loss is proposed. Through the new training method, we uncover that although such adjustments can impact the robustness of explanations, they do not have an influence on the robustness of classification. These findings not only challenge the prevailing assumption of a strong correlation between the two forms of robustness but also pave new pathways for understanding relationship between loss landscape and explanation loss.

2503.18314 2026-06-09 cs.LG cs.AI cs.CV 版本更新

LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty

LoTUS:带有不确定性风味的大规模机器遗忘

Christoforos N. Spartalis, Theodoros Semertzidis, Petros Daras, Efstratios Gavves

发表机构 * University of Amsterdam(阿姆斯特丹大学) Centre for Research & Technology Hellas(希腊研究中心与技术中心) Archimedes/Athena RC(阿基米德/雅典娜研究中心)

AI总结 提出LoTUS方法,通过平滑预测概率至信息论界限来消除训练样本影响,避免从头重训练,在Transformer和ResNet18模型上超越现有方法,并引入RF-JSD指标用于实际评估。

Comments Accepted as a main conference paper at CVPR 2025 (https://cvpr.thecvf.com/virtual/2025/poster/33292)

详情
AI中文摘要

我们提出了LoTUS,一种新颖的机器遗忘(MU)方法,它消除了预训练模型中训练样本的影响,避免了从头开始重新训练。LoTUS将模型的预测概率平滑到信息论界限,减轻了因数据记忆导致的过度自信。我们在Transformer和ResNet18模型上,针对五个公共数据集,与八个基线方法进行了评估。除了已有的MU基准测试,我们还在ImageNet1k(一个大规模数据集,其中重新训练不切实际)上评估了遗忘效果,模拟了真实世界条件。此外,我们引入了新颖的无重训练杰森-香农散度(RF-JSD)指标,以便在真实世界条件下进行评估。实验结果表明,LoTUS在效率和有效性方面均优于最先进的方法。代码:此https URL。

英文摘要

We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model up to an information-theoretic bound, mitigating its over-confidence stemming from data memorization. We evaluate LoTUS on Transformer and ResNet18 models against eight baselines across five public datasets. Beyond established MU benchmarks, we evaluate unlearning on ImageNet1k, a large-scale dataset, where retraining is impractical, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. The experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: https://github.com/cspartalis/LoTUS.

12. 数据集、基准、评测与训练方法 54 篇

2606.07640 2026-06-09 cs.CV cs.AI cs.LG 新提交

No Free Lunch for Synthetic Images under Data Scarcity Conditions

数据稀缺条件下合成图像的无免费午餐定理

Borja Arroyo Galende, Alejandro Almodóvar, Patricia A. Apellániz, Juan Parras, Silvia Uribe, Santiago Zazo

发表机构 * Universidad Politécnica de Madrid(马德里理工大学) Universidad de Alcalá(阿尔卡拉大学)

AI总结 研究数据稀缺和隐私敏感条件下合成数据的保真度、隐私和效用权衡,提出联合评估框架,比较VAE、GAN和DDPM在三个图像数据集上的表现,发现GAN和DDPM在差分隐私下更鲁棒。

详情
AI中文摘要

本研究探讨了在数据稀缺和隐私敏感条件下,合成数据生成中保真度、隐私和效用之间的权衡。我们提出了一个联合评估这三个维度的框架,并将其应用于三种广泛使用的生成模型:VAE、GAN和DDPM。评估涵盖三个图像数据集:MNIST、OCTMNIST和OrganAMNIST,包括通用和医学成像领域。在训练过程中引入差分隐私机制时,三种模型的行为出现了显著差异。GAN和DDPM表现出更强的鲁棒性,在一系列噪声水平下保持较高的保真度和下游效用,而VAE随着隐私约束的增加而更快地退化。本研究强调了深度生成模型多维评估的重要性,并指出应用隐私技术时它们的行为存在显著差异。

英文摘要

This study investigates the trade-offs between fidelity, privacy, and utility in synthetic data generation under conditions of data scarcity and privacy sensitivity. We propose an evaluation framework that jointly assesses these three dimensions and apply it to three widely used generative models, VAE, GAN, and DDPM. The evaluation spans three image datasets, MNIST, OCTMNIST, and OrganAMNIST, encompassing both general-purpose and medical imaging domains. Notable differences arise between the three models in their behaviour when differential privacy mechanisms are introduced during training. GAN and DDPM demonstrate greater robustness, maintaining higher fidelity and downstream utility across a range of noise levels, while VAE degrades more rapidly as privacy constraints increase. This study highlights the importance of a multidimensional evaluation of deep generative models, also noting that their behaviour significantly differs when privacy techniques are applied.

2606.07645 2026-06-09 cs.CV cs.AI 新提交

FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

FineGen:基于VLM的多智能体框架用于细粒度图像-文本数据集构建

Chang Kong, Yuebing Li, Peng Mo, Haigang Zhang, Qiuming Luo

发表机构 * Shenzhen Polytechnic University(深圳职业技术大学) Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong Macao Greater Bay Area(粤港澳大湾区应用人工智能研究所) Shenzhen University(深圳大学)

AI总结 提出FineGen框架,通过生成-验证-校正流水线和闭环反馈机制自动构建含硬负样本的细粒度数据集,在ImageNet上构建FineGen-100K,硬样本准确率提升14.4%。

Comments 15 pages, 2 figures, conference

详情
AI中文摘要

当前视觉-语言数据集中硬负样本的稀缺严重阻碍了细粒度感知。为此,我们提出FineGen,一种基于VLM的多智能体框架,用于自动化数据集构建。通过采用协作的生成-验证-校正流水线及闭环反馈机制,FineGen确保合成的硬负样本在语义上有效且与视觉内容严格矛盾。将其应用于ImageNet,我们构建了FineGen-100K,一个包含超过147,000个属性特定硬负样本的分层数据集,正负样本比严格为1:10。广泛评估证实了96.7%的属性有效性。关键的是,在FG-OVD基准上的下游验证表明,在FineGen-100K上微调后,硬样本准确率大幅提升14.4%,显著优于现有最先进方法。

英文摘要

The scarcity of hard negative samples in current vision-language datasets significantly hinders fine-grained perception. To address this, we propose FineGen, a VLM-based Multi-Agent framework for automated dataset construction. By employing a collaborative Generation-Verification-Correction pipeline with a closed-loop feedback mechanism, FineGen ensures synthesized hard negatives are semantically valid yet strictly contradictory to visual content. Applying this to ImageNet, we construct FineGen-100K, a hierarchical dataset containing over 147,000 attribute-specific hard negatives with a rigorous 1:10 positive-to-negative ratio. Extensive evaluations confirm a 96.7% attribute validity rate. Crucially, downstream validation on the FG-OVD benchmark shows that fine-tuning on FineGen-100K yields a substantial +14.4% accuracy improvement on hard samples, significantly outperforming state-of-the-art methods.

2606.07646 2026-06-09 cs.CV cs.AI 新提交

DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation

DOME:从稀疏监督中学习可迁移域变量用于测试时自适应

Xiaoran Xu, Yifan Xu, Yupeng Wu, Xiaoshan Yang, Changsheng Xu

发表机构 * MAIS, IACAS(中国科学院自动化研究所多模态人工智能系统实验室)

AI总结 提出DOME域编码器,通过视觉-语言预训练提取密集连续表示,参数化域为分布变量并引入动量更新的稀疏域库,实现零样本显式域建模,在多个基准上超越复杂TTA方法。

详情
AI中文摘要

测试时自适应(TTA)旨在仅使用无标签流数据将模型对齐到变化的测试域。现有方法大多隐式推断单个全局域分布,忽略了真实世界域迁移的多维性和样本特异性,导致自适应脆弱。我们提出DOME,一种有效的域编码器,以零样本方式显式建模每个样本的域。DOME利用视觉-语言预训练提取密集、连续的表示,将域参数化为分布变量,并引入动量更新的稀疏域库用于解耦监督。通过将这些显式域线索注入下游模型,即使是最基本的熵最小化TTA策略也在ImageNet-C、ImageNet-R和ImageNet-Sketch上达到了最先进的性能,超越了复杂的TTA方法。我们的结果表明,鲁棒的自适应并非源于复杂的自适应算法,而是源于显式的、结构化的域表示。

英文摘要

Test-time adaptation (TTA) aims to align a model to shifting test domains using only unlabeled streaming data. Most existing methods implicitly infer a single global domain distribution, ignoring the multidimensional and sample-specific nature of real-world domain shifts, leading to fragile adaptation. We propose DOME, an effective domain encoder that explicitly models each sample's domain in a zero-shot manner. DOME leverages vision-language pretraining to extract dense, continuous representations, parameterizes domains as distributional variables, and introduces a momentum-updated sparse domain bank for disentangled supervision. By injecting these explicit domain cues into downstream models, even a basic entropy-minimization TTA strategy achieves state-of-the-art performance across ImageNet-C, ImageNet-R, and ImageNet-Sketch, outperforming complex TTA approaches. Our results demonstrate that robust adaptation stems not from intricate adaptation algorithms, but from explicit, structured domain representation.

2606.07653 2026-06-09 cs.CV cs.AI 新提交

A Dataset for Dynamic Human Preferences for Vision Language Models

面向视觉语言模型的动态人类偏好数据集

Hannah Gao, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出一个评估视觉语言模型理解动态人类偏好能力的基准,通过自动化管道生成包含图像依赖变化的数据集,并评估了现有模型。

详情
AI中文摘要

鉴于视觉语言模型(VLM)在人机交互场景中的广泛应用,评估这些模型适应不同用户实时偏好的能力变得重要。尽管近年来引入了越来越多的视觉语言基准,但它们主要侧重于评估静态能力和从大量训练数据中学习的一般偏好。本文引入了一个新的基准,用于评估VLM理解动态人类偏好的能力,即在推理时通过上下文传递的偏好。我们提供了一个自动化管道来生成该基准,包含图像依赖变化、动态多模态人类偏好数据集,并对最新模型在新基准上的表现进行了评估。

英文摘要

Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number of vision-language benchmarks have recently been introduced, they focus largely on evaluating static capabilities and generally-held preferences learned from extensive training data. This work introduces a new benchmark for evaluating the ability of VLMs to understand dynamic human-preferences, i.e. preferences that are passed in-context at inference time. We provide an automated pipeline for generating this benchmark with variations on image dependence, a dynamic multi-modal human-preference dataset, and evaluations of state-of-the-art models on the novel benchmark.

2606.07654 2026-06-09 cs.CV cs.AI 新提交

MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

MM-Matryoshka:通过二维多模态套娃训练框架实现预算弹性视觉文档检索

Haowen Xiang, Yibo Yan, Jiahao Huo, Yu Huang, Yi Cao, Mingdong Ou, Xuming Hu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Alibaba Cloud Computing(阿里云计算) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出MM-Matryoshka,一种二维套娃训练框架,使视觉文档检索器在向量维度和编码器深度上实现弹性预算选择,无需为不同预算训练独立模型。

详情
AI中文摘要

多向量视觉文档检索器通过深度视觉语言模型(VLM)为每个页面生成多个向量,实现强大的细粒度匹配,但这种设计在存储和计算开销上导致部署成本高昂。现有效率技术通常只优化预算的一部分,使得多模态检索器缺乏统一的方法来权衡精度与向量宽度和编码器深度。因此,我们提出MM-Matryoshka,一种用于预算弹性视觉文档检索(VDR)的二维套娃训练框架,使ColPali风格的多向量检索在维度和层两个方向上实现弹性。在推理时,单个检索器可以选择二维可调预算,无需为不同预算训练独立模型。通过在多个代表性骨干网络上的全面实验,我们证明MM-Matryoshka在显著降低存储和计算开销的同时,保留了比直接截断基线高得多的质量,从而为高效VDR提供了稳健的预算弹性。

英文摘要

Multi-vector visual document retrievers achieve strong fine-grained matching by representing each page with multiple vectors from deep Vision-Language Models (VLMs), but this design makes deployment expensive in both storage and computational overhead. Existing efficiency techniques usually optimize only part of this budget, leaving multimodal retrievers without a unified way to trade accuracy for both vector width and encoder depth. Therefore, we propose MM-Matryoshka, a 2D Matryoshka training framework for budget-elastic Visual Document Retrieval (VDR), enabling ColPali-style multi-vector retrieval elastic along both dimension and layer. At inference time, a single retriever can select a 2D selectable budget without training separate models for different budgets. Through comprehensive experiments across multiple representative backbones, we demonstrate that by retaining significantly higher quality than direct truncation baselines while substantially reducing storage and computational overhead, MM-Matryoshka can offer robust budget elasticity for efficient VDR.

2606.07687 2026-06-09 cs.CV cs.AI 新提交

What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction

什么使视频世界模型潜在空间与动作相关:预测优于重建

Jewon Yeom, Hanseul Kim, Jeongjae Park, Sungmok Jung, Jaejin Lee, Taesup Kim

发表机构 * Graduate School of Data Science, Seoul National University(首尔大学数据科学研究生院)

AI总结 通过统一探针评估,发现动作相关结构主要由时间视频预训练驱动,而非像素重建保真度,其中视频预训练自监督编码器在视觉保真度和动作预测间取得最佳帕累托权衡。

详情
AI中文摘要

视频世界模型越来越多地用于提供预测性视觉表示,但尚不清楚哪些预训练信号在其潜在空间中诱导出与动作相关的结构。我们通过跨多种编码器家族的统一探针评估来研究这个问题,包括仅图像自监督、带或不带潜在预测的视频预训练、基于重建的自编码器、扩散模型以及捷径强制动力学模型。使用共同的逆动力学探针目标,我们发现动作相关结构主要由时间视频预训练驱动,而非像素重建保真度:具有强像素解码质量的模型可能表现出接近零的动作可恢复性,而视频预训练的自监督编码器在视觉保真度和动作预测之间始终实现最佳帕累托权衡。比较V-JEPA和VideoMAE进一步表明,大部分收益来自自然视频时间上下文,特征级潜在预测提供了较小的额外收益。这些趋势在机器人基准测试中转移,尽管CALVIN显示静态环境任务可以通过允许强图像先验来部分掩盖时间结构的重要性。最后,逆动力学监督显著提高了对视觉损坏的鲁棒性,表明动作感知目标正则化了潜在几何,超越了干净环境性能。我们的结果确定时间预测结构——而非重建保真度——是动作相关视频表示的主要成分。

英文摘要

Video world models are increasingly used to provide predictive visual representations, yet it remains unclear which pretraining signals induce action-relevant structure in their latent spaces. We study this question through a unified probe-based evaluation across diverse encoder families, including image-only self-supervision, video pretraining with and without latent prediction, reconstruction-based autoencoders, diffusion models, and shortcut-forcing dynamics models. Using a common inverse-dynamics probing objective, we find that action-relevant structure is driven primarily by temporal video pretraining rather than pixel reconstruction fidelity: models with strong pixel decoding quality can exhibit near-zero action recoverability, while video-pretrained self-supervised encoders consistently achieve the best Pareto trade-off between visual fidelity and action prediction. Comparing V-JEPA and VideoMAE further shows that most gains arise from natural-video temporal context, with feature-level latent prediction providing a smaller additional benefit. These trends transfer across robotic benchmarks, though CALVIN reveals that static-environment tasks can partially mask the importance of temporal structure by allowing strong image priors to suffice. Finally, inverse-dynamics supervision substantially improves robustness to visual corruption, suggesting that action-aware objectives regularize latent geometry beyond clean-setting performance. Our results identify temporal predictive structure -- not reconstruction fidelity -- as the primary ingredient underlying action-relevant video representations.

2606.07708 2026-06-09 cs.CV cs.AI 新提交

Cross-View Urban Traffic Dataset: Drone-Supervised Ground Truth for Monocular Bird's-Eye View Localization

跨视角城市交通数据集:用于单目鸟瞰图定位的无人机监督地面真值

Prakhar Bhardwaj, Simone Weikl, Kilian Mang, Elia Jonas Sandtner

发表机构 * OTH Regensburg(雷根斯堡应用技术大学)

AI总结 提出一个由同步自行车视角和无人机视角视频构建的跨视角城市交通数据集,支持跨视角身份匹配和鸟瞰图预测任务,提供身份级对齐和标准化评估。

详情
AI中文摘要

我们介绍了一个从真实城市交叉口同步的自行车视角视频和无人机航拍视频构建的跨视角城市交通感知数据集和基准。该基准针对两个关联任务:街景和无人机视角目标轨迹之间的跨视角身份匹配,以及利用空中监督的自我到鸟瞰图预测。与先前的城市驾驶和V2X数据集相比,我们的基准提供了跨截然不同视角的身份级对齐,以及标准化评估、标注工具和基线实现。这一设置源于以交叉口为中心的交通分析,其中身份保持、局部交互和全局空间结构必须跨视角联合推理。我们在轨迹和帧级别评估方法,包括跨视角ID精确率/召回率/IDF1、近远分解、时间稳定性和一致性指标。我们还提供了基于楔形的跨视角匹配以及三种BEV预测基线(逆透视映射、MonoLayout风格学习基线和回归基线)的基线结果。结果表明该基准可行但具有挑战性:跨视角匹配实现了高召回率,但仍受过度分配和时间不一致性的限制,而自我到BEV预测受益于空中监督,但在轻量级单目感知下远未饱和。我们希望该基准能支持跨视角感知、城市场景对齐和自我到全局交通理解的未来研究。

英文摘要

We introduce a dataset and benchmark for cross-view urban traffic perception built from synchronized ego-centric bicycle videos and aerial drone videos recorded at real urban intersections. The benchmark targets two linked tasks: cross-view identity matching between street-view and drone-view object tracks, and ego-to-bird's-eye-view prediction using aerial supervision. In contrast to prior urban driving and V2X datasets, our benchmark provides identity-level alignment across radically different viewpoints together with standardized evaluation, annotation tooling, and baseline implementations. This setting is motivated by intersection-centric traffic analysis, where identity preservation, local interactions, and global spatial structure must be reasoned about jointly across views. We evaluate methods at both the track and frame levels, including cross-view ID precision/recall/IDF1, near--far breakdowns, temporal stability, and consistency metrics. We also provide baseline results for wedge-based cross-view matching and for three BEV prediction baselines: inverse perspective mapping, a MonoLayout-style learned baseline, and a regression baseline. The results show that the benchmark is feasible but challenging: cross-view matching achieves strong recall yet remains limited by over-assignment and temporal inconsistency, while ego-to-BEV prediction benefits from aerial supervision but remains far from saturated under lightweight monocular sensing. We hope that this benchmark will support future research on cross-view perception, urban scene alignment, and ego-to-global traffic understanding.

2606.07882 2026-06-09 cs.CV cs.AI 新提交

The Cross-Architecture Substrate: A Domain-Transcendent, Calibration-Surviving Geometric Invariant of Modern Vision Encoders

跨架构基板:现代视觉编码器的领域超越、校准存活的几何不变量

Yousef Radwan

发表机构 * KAUST(阿卜杜拉国王科技大学)

AI总结 发现现代视觉编码器训练后前16个主方向收敛到同一16维几何对象(跨架构基板),该基板跨视觉领域传输、校准后仍存在,并应用于无标签迁移性过滤、领域检测、低样本探测和无教师蒸馏。

Comments 14 pages, 2 figures. 40th Conference on Neural Information Processing Systems (NeurIPS 2026)

详情
AI中文摘要

不同的视觉神经网络——训练用于分类、对比、重建或将图像与文本匹配——应该具有相应不同的内部表示。我们报告它们并非如此。训练后,十三个现代视觉编码器内部的前十六个主变化方向收敛到同一个十六维几何对象。我们称之为跨架构基板,并使用PCA、中心核对齐(CKA)和Pang 2026校准进行研究。该基板在四个视觉领域(自然照片、医学CT、卫星、显微镜)上以中位数Procrustes-CKA 0.679传输,在八个领域(增加素描、深度、热红外、天文学)上为0.604,每对>0.40。它在全局(7.4倍判别vs MAE分离,n=13,394)和局部(4.82-5.30,p<10^{-44})上经受住Pang校准。它不是像素统计(0.263),不是Gabor特征(0.31),不是随机投影(0.041),并且在训练的前10%中出现,而准确率持续上升。我们提供了四个应用:一个无标签迁移性过滤器,优于LogME(快3倍,+0.15 Kendall-tau);一个四路领域检测器(99.6%准确率);一个冻结低样本探测器(16维在每类N=50标签时比768维DINOv2高3.78个百分点);以及一个无教师蒸馏辅助,匹配训练教师KD在33对上(10%标签分数时峰值增益7.56个百分点)。该基板不跨模态,不帮助跨范式蒸馏,也不预测迁移质量(与迁移准确率的rho=0.08)。

英文摘要

Different vision neural networks -- trained to classify, contrast, reconstruct, or match images to text -- should have correspondingly different internal representations. We report that they do not. After training, the top sixteen principal directions of variation inside thirteen modern vision encoders converge to the same sixteen-dimensional geometric object. We call this the cross-architecture substrate and study it with PCA, centred kernel alignment (CKA), and Pang 2026 calibration. The substrate transports across four visual domains (natural photographs, medical CT, satellite, microscopy) at median Procrustes-CKA 0.679, and across eight domains (adding sketches, depth, thermal infrared, astronomy) at 0.604, every pair >0.40. It survives Pang calibration globally (7.4x disc-vs-MAE separation, n=13,394) and locally (4.82-5.30, p<10^{-44}). It is not pixel statistics (0.263), not Gabor features (0.31), not a random projection (0.041), and emerges in the first 10% of training while accuracy keeps climbing. We deliver four applications: a label-free transferability filter beating LogME (3x faster, +0.15 Kendall-tau); a four-way domain detector (99.6% accuracy); a frozen low-shot probe (16 dims beat 768-dim DINOv2 by 3.78pp at N=50 labels per class); and a teacher-free distillation auxiliary matching trained-teacher KD on 33 pairs (7.56pp peak gain at 10% label fraction). The substrate does not cross modalities, does not help cross-paradigm distillation, and does not predict transfer quality (rho=0.08 against transfer accuracy).

2606.07891 2026-06-09 cs.CV 新提交

C3VD-DEFCOL: A Deformable Colonoscopy Dataset with Time-Resolved 3D Ground Truth and Realistic Appearance

C3VD-DEFCOL:具有时间分辨三维真实地面真值和逼真外观的可变形结肠镜数据集

Ethan Luk, Mayank V. Golhar, Anthony Song, Raúl Iranzo, Víctor M. Batlle, Lalithkumar Seenivasan, José M. M. Montiel, Nicholas J. Durr

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Universidad de Zaragoza(萨拉戈萨大学)

AI总结 提出C3VD-DEFCOL框架和数据集,通过模拟蠕动变形和真实纹理渲染,为可变形结肠镜三维重建提供带时间分辨地面真值的评估平台。

详情
AI中文摘要

三维重建可通过估计黏膜覆盖范围并在筛查期间提醒临床医生遗漏区域来改进结肠镜检查。然而,由于当前没有数据集同时提供逼真的体内外观和密集的时间分辨三维地面真值(尤其在非刚性变形下),算法开发受到限制。我们提出C3VD-DEFCOL,一个用于评估可变形结肠镜重建的框架和数据集,具有配对的几何和逼真纹理。从C3VD/C3VDv2结肠网格和相机轨迹出发,我们生成结肠表面的受控变形,包括蠕动波和中心线运动,并渲染每帧深度、表面法线、光流、相机姿态和时间戳三维网格。然后,我们使用渲染的几何(主要是深度)来条件化一个基于LTX-2.3的模拟到真实翻译模型,该模型生成具有体内样黏膜颜色、纹理、血管和镜面外观的RGB片段,同时保留底层三维场景结构。所得数据集包含来自11个独特结肠网格几何的110个视频,具有不同的相机轨迹、外观和参数化变形模式,包括三个蠕动严重程度级别作为受控评估轴。我们使用外观真实性、几何一致性和时间一致性指标评估生成的视频,并利用配对地面真值对可变形三维重建中的下游任务——姿态估计进行基准测试。实验表明,姿态估计误差随变形严重程度增加而增加,提供了现有体内数据集无法实现的受控压力测试。总体而言,C3VD-DEFCOL被设计为一个可重复的定量评估平台,用于测试可变形三维重建算法,旨在缩小合成数据集与体内结肠镜之间的领域差距。

英文摘要

3D reconstruction could improve colonoscopy by estimating mucosal coverage and alerting clinicians to missed regions during screening. However, algorithm development is limited as no current datasets provide both a realistic in vivo appearance and dense, time-resolved 3D ground truth, especially under non-rigid deformation. We present C3VD-DEFCOL, a framework and dataset for evaluating deformable colonoscopy reconstruction with paired geometry and realistic texture. Starting from C3VD/C3VDv2 colon meshes and camera trajectories, we generate controlled deformations of the colon surface, including peristaltic waves and centerline motion, and render per-frame depth, surface normals, optical flow, camera poses, and time-stamped 3D meshes. We then use the rendered geometry, primarily depth, to condition an LTX-2.3-based sim-to-real translation model that produces RGB clips with in vivo-like mucosal color, texture, vasculature, and specular appearance while preserving the underlying 3D scene structure. The resulting dataset contains 110 videos from 11 unique colon mesh geometries, with varying camera trajectories, appearances, and parameterized deformation regimes, including three peristaltic severity levels that serve as controlled evaluation axes. We evaluate the generated videos using appearance realism, geometric consistency, and temporal consistency metrics, and use the paired ground truth to benchmark the downstream task of pose estimation in deformable 3D reconstruction. Our experiments show how pose estimation error increases with increasing deformation severity, providing a controlled stress test that is not possible with existing in vivo datasets. Overall, C3VD-DEFCOL is designed as a reproducible, quantitative evaluation platform for testing deformable 3D reconstruction algorithms, with the goal of reducing the domain gap between synthetic datasets and in vivo colonoscopy.

2606.08033 2026-06-09 cs.CV cs.LG 新提交

Balancing Real and Synthetic Data for CNN-based Masonry Crack Detection

基于CNN的砌体裂缝检测中真实与合成数据的平衡

Mattia Forlesi, Alfonso Esposito, Ivan Zyrianoff, Alessandro Marzani, Marco Di Felice

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 针对砌体裂缝检测中真实数据不足的问题,提出用合成数据补充训练,通过调整真实与合成数据比例,发现20%真实数据加合成数据即可达到甚至超越纯真实数据的效果。

详情
AI中文摘要

裂缝是建筑健康的关键指标,早期识别对于防止有害损害至关重要。深度学习(DL)的进展,特别是卷积神经网络(CNN),已实现可扩展的自动裂缝检测解决方案。然而,CNN性能高度依赖于大规模多样化数据集的可用性,这对于砌体等复杂表面尤其具有挑战性。收集足够的真实数据耗时,而公开数据集可能不充分。为解决这一限制,我们探索生成合成裂缝数据,以补充真实数据并提高训练效果。真实数据集由从博洛尼亚及周边地区建筑收集的砌体裂缝图像组成。相比之下,合成数据集使用裂缝叠加工具生成,该工具以受控方向和位置向背景图像添加裂缝。使用真实数据集训练多种DL架构,以确定最佳性能模型(InceptionV4),用于生成数据的实验。通过改变真实与合成数据的比例,在InceptionV4上测试了六种训练场景,并在由真实图像组成的测试集上使用F1分数和平均交并比(mIoU)指标进行评估。结果表明,在合成数据上训练加上少量20%真实数据,可获得与仅使用真实数据训练相当的结果。此外,20/80(合成/真实)场景实现了76%的F1分数和80%的平均IoU,优于纯真实情况。可以看出,该方法展示了合成数据在减少收集工作同时提高裂缝检测准确性的潜力。

英文摘要

Cracks are a critical indicator of building health, and early stage identification is fundamental to prevent harmful damages. Advances in deep learning (DL), particularly convolutional neural networks (CNNs), have enabled scalable solutions for automated crack detection. However, CNN performance strongly depends on the availability of large and diverse datasets, which is particularly challenging for complex surfaces such as masonry. Collecting sufficient real data is time-consuming, while publicly available datasets may not be adequate. To address this limitation, we explored generating synthetic crack data, which complements real data and improves training effectiveness. The real dataset consists of masonry crack images collected from buildings in Bologna and surrounding areas. In contrast, the synthetic dataset was generated using a crack overlay tool that adds cracks to background images in a controlled orientation and placement. The real dataset was used to train several DL architectures, to identify the best-performing model (InceptionV4) employed for experiments with generated data. Six training scenarios were tested in InceptionV4 by varying the ratio of real and synthetic data, with evaluation performed on a test set composed of real images using the F1-score and mean Intersection over Union (mIoU) metrics. Results show that training on synthetic data plus a modest addition of 20% real data achieves results comparable to training on real data only. Moreover, the 20/80 scenario (synthetic/real) achieved an 76% F1-score and 80% mean IoU, outperforming the real-only case. As can be seen, the method demonstrates the potential of synthetic data to reduce collection efforts while enhancing crack detection accuracy.

2606.08123 2026-06-09 cs.CV cs.AI 新提交

Human-Centered Benchmarking of Driver Monitoring Models

以人为中心的驾驶员监控模型基准测试

Ruben Dario Florez-Zela

发表机构 * Universidad Nacional de San Agustin de Arequipa (UNSA)(圣奥古斯丁国立大学(UNSA))

AI总结 针对驾驶员监控模型仅用分类精度评估的不足,提出以人为中心的基准测试框架(HCBF),从精度、可解释性、效率和鲁棒性四维评估,发现模型在帕累托前沿上各占优势,但聚合排名会掩盖关键缺陷。

Comments 9 pages, 3 figures, 7 tables. Code available at: https://github.com/rubendflorezzela/hcbf-driver-monitoring

详情
AI中文摘要

基于视觉的驾驶员监控系统越来越多地部署在安全关键的智能交通环境中,但它们几乎总是仅根据分类精度进行比较。本文认为精度不足以表征模型在实际部署中的适用性,并提出了以人为中心的基准测试框架(HCBF),该框架从四个维度评估模型:精度、可解释性、效率和鲁棒性。该框架应用于四种代表性的轻量级架构:MobileNetV3、ShuffleNetV2、EfficientNet-B0和DeiT-Tiny,在MRL眼睛数据集上进行眼睛状态分类。虽然这些模型在干净数据集上的精度几乎无法区分,但每个模型恰好在一个维度上领先,并且所有四个模型都位于帕累托前沿。在三种面向部署的权重场景下计算的人为中心得分将ShuffleNetV2排在首位。然而,这个聚合胜出者在传感器噪声下保留了不到一半的性能,并且将闭眼分类为睁眼而失败,而Transformer则保持鲁棒。这些发现表明,聚合排名可能掩盖在操作上具有决定性的维度特定漏洞,强调了多维、以人为中心评估的价值。

英文摘要

Vision-based driver monitoring systems are increasingly deployed in safety-critical intelligent transportation settings, yet they are almost always compared on classification accuracy alone. This paper argues that accuracy is insufficient to characterize a model's fitness for real-world deployment, and proposes the Human-Centered Benchmarking Framework (HCBF), which evaluates models across four dimensions: accuracy, explainability, efficiency, and robustness. The framework is applied to four representative lightweight architectures, MobileNetV3, ShuffleNetV2, EfficientNet-B0, and DeiT-Tiny, on the MRL Eye Dataset for eye-state classification. While the models are nearly indistinguishable on clean-set accuracy, each leads in exactly one dimension, and all four lie on the Pareto frontier. A Human-Centered Score computed under three deployment-oriented weighting scenarios ranks ShuffleNetV2 first throughout. However, this aggregate winner retains less than half of its performance under sensor noise and fails by classifying closed eyes as open, whereas the transformer remains robust. These findings show that aggregate ranking can mask dimension-specific vulnerabilities that are operationally decisive, underscoring the value of multi-dimensional, human-centered evaluation.

2606.08132 2026-06-09 cs.CV cs.LG 新提交

Phase Marginalization for Patch-Grid Instability in Vision Transformers

视觉Transformer中补丁网格不稳定性的相位边缘化

Oğuzhan Ercan

发表机构 * Scientific and Technological Research Council of Türkiye(土耳其科学技术研究委员会)

AI总结 提出相位边缘化方法,通过评估结构化补丁网格相位、逆对齐密集输出并在原始图像坐标系聚合,消除视觉Transformer中补丁网格相位引起的预测不稳定性,无需训练即可提升分割、深度和匹配性能。

Comments 13 pages, 1 figure, 9 tables

详情
AI中文摘要

视觉Transformer在固定的补丁网格上操作,这可能导致密集预测中相位依赖的不稳定性:改变补丁划分会改变像素可用的令牌证据,尤其是在边界附近。我们将补丁网格相位形式化为一个干扰变量,并提出相位边缘化,一种事后边缘化方法,该方法评估结构化的补丁网格相位,逆对齐密集输出,并在原始图像坐标系中聚合它们。中心变体,K=4的均匀相位边缘化,无需训练,并在测量的分割、深度和局部匹配设置上优于规范的K=1基线。在受控的Cityscapes实验中,均匀相位边缘化相比基于通用移位的四次前向测试时增强(TTA)提供了适度的计算匹配优势(在最强测试的通用行上平均交并比提高0.31)。一项扩展研究进一步表明,K=4是一个实用的成本-精度权衡:K=8基本不变,K=16在更高延迟下增加很少精度。这些结果将补丁网格相位定位为可测量的干扰变量,并将相位边缘化定位为密集ViT预测的简单诊断和事后边缘化基线。

英文摘要

Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.

2606.08156 2026-06-09 cs.CV cs.AI 新提交

RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

RAPID: 逐层冗余感知剪枝与重要性驱动的令牌合并以实现高效ViT

Kyumin Choi, Ikbeom Jang

发表机构 * Hankuk University of Foreign Studies(韩国外国语大学)

AI总结 提出RAPID框架,根据ViT网络深度自适应调整令牌缩减策略:浅中层用冗余相似度感知剪枝,深层用重要性相似度感知合并,在ImageNet-1K上实现更优的精度-压缩帕累托前沿。

Comments 7 pages, 2 figures

详情
AI中文摘要

视觉Transformer(ViT)取得了强大性能,但由于二次自注意力复杂度而遭受高计算成本。尽管令牌缩减技术(如剪枝和合并)缓解了这一问题,但它们通常忽略了表示在网络深度上的演化。我们提出RAPID,一种深度感知的令牌缩减框架,可根据令牌表示的逐层特征自适应调整缩减策略。主要方法贡献是一种分叉策略:在浅层到中层,RAPID采用冗余相似度感知剪枝度量来消除过度表示的局部模式。当特征在更深层过渡到全局语义概念时,框架转向重要性相似度感知合并机制。该阶段利用分类(CLS)令牌注意力权重来保护语义关键令牌,同时融合不太重要但相似的邻居。在ImageNet-1K上使用ViT和DeiT架构的实验验证表明,与ToMe和ToFu等即插即用基线相比,RAPID建立了更优的精度-压缩帕累托前沿。RAPID在激进压缩场景下尤其鲁棒,在极端缩减率下比ToMe准确率高出4.29%。我们的框架提供了一种免训练模板,通过将缩减策略与层次化特征演化对齐来优化视觉模型。

英文摘要

Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity. Although token reduction techniques such as pruning and merging mitigate this, they typically overlook how representations evolve across network depth. We propose RAPID, a depth-aware token reduction framework that adapts reduction strategies to the layer-wise characteristics of token representations. The primary methodological contribution is a bifurcated strategy: in shallow-to-middle layers, RAPID employs a redundancy-similarity aware pruning metric to eliminate over-represented local patterns. As features transition to global semantic concepts in deeper layers, the framework shifts to an importance-similarity aware merging mechanism. This stage leverages classification (CLS) token attention weights to protect semantically critical tokens while fusing less important but similar neighbors. Empirical validation on ImageNet-1K using ViT and DeiT architectures demonstrates that RAPID establishes a superior accuracy-compression Pareto frontier compared to plug-and-play baselines such as ToMe and ToFu. RAPID is particularly robust in aggressive compression regimes, achieving up to 4.29% higher accuracy than ToMe at extreme reduction rates. Our framework provides a training-free template for optimizing vision models by aligning reduction strategies with hierarchical feature evolution.

2606.08164 2026-06-09 cs.CV 新提交

How Much MRI Preprocessing Is Enough? A Cost-Utility Study for Brain MRI Foundation Models

MRI预处理需要多少才够?脑MRI基础模型的成本效用研究

Jiangshuan Pang, Wangyang Tang, Jing Yan, Zhixuan Cheng, Youzhe He, Zhenkun Zhuang, Tao Zhou, Shiping Liu

发表机构 * University of the Chinese Academy of Sciences(中国科学院大学) BGI Research(华大研究院)

AI总结 本研究通过比较P0-P7预处理级别对自监督3D MRI预训练的影响,发现并非预处理越强越好,P2是最低成本可行级别,更强预处理仅在特定任务中带来有限提升,且下游可补偿。

详情
AI中文摘要

MRI预处理定义了脑MRI基础模型看到的输入分布,但它通常被视为常规数据清理而非建模选择。我们询问对于自监督3D MRI预训练,多少预处理值得其计算成本。保持语料库、3D ViT骨干网络、掩码协议和下游评估不变,我们在20,000个异质脑MRI体积上比较了用于掩码自编码(MAE)和联合嵌入预测学习(JEPA)的分级P0-P7预处理谱,然后将编码器迁移到IDH预测、MCI分类、脑年龄回归和GLI/PED肿瘤分割。结果不支持简单的“越多越好”规则。P0/P1数值不稳定,使P2成为成本最低的可行级别;超过P2,选择最佳可行预处理级别仅使MAE的聚合效用提高3.4个百分点,JEPA提高1.8个百分点,且大多数配对增益在统计上未解决。更强的预处理仅在选定场景中有益:IDH略有改善,AGE和GLI/PED通常在P2附近或最佳,而MCI显示出最清晰的P7经验增益。跨级别MCI迁移进一步表明,大部分P7优势可以通过在下游应用更强的预处理来恢复,而不需要在预训练全程使用P7。这些发现将MRI预处理重新定义为一种下游感知的成本效用决策,而非默认的升级流水线。代码可在https://github.com/PangJiangShuan/PreBrain获取。

英文摘要

MRI preprocessing defines the input distribution seen by brain MRI foundation models, yet it is usually treated as routine data cleaning rather than a modeling choice. We ask how much preprocessing is worth its computational cost for self-supervised 3D MRI pretraining. Keeping the corpus, 3D ViT backbone, masking protocol, and downstream evaluations fixed, we compare a graded P0-P7 preprocessing spectrum for masked autoencoding (MAE) and joint-embedding predictive learning (JEPA) on 20,000 heterogeneous brain MRI volumes, then transfer the encoders to IDH prediction, MCI classification, brain age regression, and GLI/PED tumor segmentation. The results do not support a simple "more is better" rule. P0/P1 are numerically unstable, making P2 the lowest-cost feasible level; beyond P2, choosing the best feasible preprocessing level improves aggregate utility by only 3.4 percentage points for MAE and 1.8 percentage points for JEPA, with most paired gains statistically unresolved. Stronger preprocessing is beneficial only in selected regimes: IDH improves modestly, AGE and GLI/PED are often near or best at P2, and MCI shows the clearest empirical P7 gain. Cross-level MCI transfer further shows that much of the P7 advantage can be recovered by applying stronger preprocessing downstream, without requiring P7 throughout pretraining. These findings recast MRI preprocessing as a downstream-aware cost-utility decision rather than a default escalation pipeline. Code is available at https://github.com/PangJiangShuan/PreBrain.

2606.08332 2026-06-09 cs.CV 新提交

SMI: Efficient Self-Supervised Learning via Mutual-Information-Inspired Dependency Optimization

SMI: 基于互信息启发的依赖优化的高效自监督学习

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) Universitat Autònoma de Barcelona(巴塞罗那自治大学)

AI总结 提出SMI方法,通过非线性变换样本级依赖矩阵优化自监督学习,在ImageNet上以ResNet-50达到竞争性能并降低计算复杂度,在低资源任务上提升迁移性能。

详情
AI中文摘要

自监督学习(SSL)已经取得了显著的表示学习性能,但许多现有方法依赖于大批量大小、内存库、动量编码器或全局同步机制,这些机制大大增加了计算成本和训练复杂度。在这项工作中,我们提出了语义互信息(SMI),一种轻量级的自监督目标,它源于高斯假设下互信息启发的依赖公式。与在高维特征相关矩阵上操作的传统相关匹配目标不同,SMI通过成对相关性的非线性变换在样本级依赖矩阵上进行优化。这种公式引入了独特的优化动态,强调强依赖的语义对,同时保持表示多样性。在ImageNet上使用ResNet-50骨干网络的实验结果表明,SMI在实现与最先进的SSL方法相当的线性评估性能的同时,显著降低了计算复杂度。在多个低资源基准上,SMI持续改善了Barlow Twins的迁移性能,特别是在细粒度数据集上。此外,对优化动态和表示几何的分析表明,对齐-冗余平衡得到改善,特征多样性增加,语义表示更加空间局部化。这些结果表明,非线性依赖优化为传统的基于相关的自监督学习目标提供了一种有效且计算高效的替代方案。

英文摘要

Self-supervised learning (SSL) has achieved remarkable representation learning performance, but many existing methods rely on large batch sizes, memory banks, momentum encoders, or global synchronization mechanisms that substantially increase computational cost and training complexity. In this work, we propose Semantic Mutual Information (SMI), a lightweight self-supervised objective derived from a mutual-information-inspired dependency formulation under Gaussian assumptions. Unlike conventional correlation matching objectives that operate on high-dimensional feature correlation matrices, SMI performs optimization on a sample-level dependency matrix through a nonlinear transformation of pairwise correlations. This formulation induces distinct optimization dynamics that emphasize strongly dependent semantic pairs while maintaining representation diversity. Experimental results on ImageNet using a ResNet-50 backbone demonstrate that SMI achieves competitive linear evaluation performance relative to state-of-the-art SSL approaches while substantially reducing computational complexity. Across multiple low-resource benchmarks, SMI consistently improves transfer performance over Barlow Twins, particularly on fine-grained datasets. Furthermore, analyses of optimization dynamics and representation geometry suggest improved alignment--redundancy balance, greater feature diversity, and more spatially localized semantic representations. These results indicate that nonlinear dependency optimization provides an effective and computationally efficient alternative to conventional correlation-based self-supervised learning objectives.

2606.08572 2026-06-09 cs.CV 新提交

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

OmniCap-IF:全视频字幕遵循指令能力的基准测试与改进

Jiahao Wang, An Ping, Yanghai Wang, Yuanxing Zhang, Shihao Li, Hanyan Bian, Yichi Ren, Yize Zhang, Han Wang, Haowen Chen, Junze Li, Jiaqi Wang, Yiyang Hu, Zhuze Xu, Zijie Zhang, Jiaheng Liu

发表机构 * NJU-LINK Team, Nanjing University(南京大学 NJU-LINK 团队) Kling Team, Kuaishou Technology(快手科技 Kling 团队)

AI总结 提出首个全模态字幕指令遵循基准OmniCap-IF,通过格式与内容正确性评估50种约束类型,揭示格式-内容权衡,并构建54K指令微调数据集OmniCap-IF-54K及模型OmniCaptioner-IF。

详情
AI中文摘要

虽然全模态大语言模型(OLLMs)在联合处理音频和视觉流方面展示了令人印象深刻的能力,但它们严格遵循复杂、多方面的用户指令的能力在很大程度上仍未得到探索。现有基准主要关注整体视频理解或纯文本指令遵循,未能捕捉模态与用户约束之间的复杂交互。为填补这一空白,我们引入了OmniCap-IF,这是首个专门设计用于评估全模态字幕中指令遵循能力的综合基准。OmniCap-IF包含一个系统框架,从格式正确性和内容正确性两个维度评估字幕。我们的基准涵盖了纯视觉、纯音频和音视频模态中的50种不同约束类型,同时整合了时间定位以评估时空精度。对1,920个高质量样本上主流模型的广泛评估揭示了显著的性能差异。此外,我们的分析揭示了一个关键的“格式-内容权衡”,表明增加格式复杂性直接降低了模型的全模态推理能力。最后,为推进该领域,我们整理了一个54K的指令微调数据集OmniCap-IF-54K,并提出了OmniCaptioner-IF,该模型在复杂指令遵循和通用全模态字幕性能方面均取得了显著改进。

英文摘要

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.

2606.08795 2026-06-09 cs.CV 新提交

PairWise Image Finder: An Open-source Tool for Finding Visually Aligned Street-Level Image Pairs for Urban Perception Studies

PairWise Image Finder: 用于城市感知研究的视觉对齐街景图像对查找开源工具

Jussi Torkko

发表机构 * Digital Geography Lab, Department of Geosciences and Geography, University of Helsinki(赫尔辛基大学地球科学与地理系数字地理实验室)

AI总结 提出PairWise图像查找工具,集成特征检测与匹配及语义分割掩码,量化不同时期图像的视觉对齐度,输出匹配特征比例、距离、覆盖率和语义掩码对齐度,支持过滤高质量图像对,用于纵向变化研究和减少人工工作量。

Comments 6 pages, two figures, github repo link near the end

详情
AI中文摘要

变化检测和场景识别技术已广泛应用于街景图像(SVI)以理解跨年场景的变化。然而,仅凭元数据往往不足以可靠地找到视觉对齐的图像对。本研究介绍了PairWise图像查找器,该工具集成了特征检测和匹配,并辅以语义分割掩码来量化不同时期两幅图像的视觉对齐度。该工具输出匹配关键特征的比例、匹配特征距离和覆盖率以及语义掩码的对齐度,使用户能够根据对齐质量和用例过滤图像对。从该工具导出的视觉对齐对可用于准确研究显式的纵向变化,并有助于减少感知研究中的人工工作量。通过比较纵向变化展示了该工具的可用性,强调了量化变化时视角的重要性。所提出的方法为研究人员和利益相关者提供了一个可扩展的开源工具,用于查找用于城市分析、感知及相关应用的高质量图像对。

英文摘要

Change detection and scene recognition techniques have been widely applied to Street View Imagery (SVI) to understand changes in scenes across the years. However, metadata alone is often insufficient to reliably find visually aligned image pairs. This study introduces the PairWise image finder, a tool that integrates feature detection and matching, supported by semantic segmentation masks to quantify the visual alignment of two images of varying time periods. The tool outputs the share of matched key features, the matched feature distance and coverage, and the alignment of semantic masks, which enables the user to filter image pairs depending on the alignment quality and use case. The visually aligned pairs derived from the tool can be used to accurately study explicit longitudinal change and help reduce manual effort for perception studies. The usability of the tool is demonstrated through a comparison of longitudinal changes, highlighting the importance of perspective when quantifying changes. The proposed method provides a scalable and open tool for researchers and stakeholders to find high-quality image pairs for urban analysis, perception and related applications.

2606.09028 2026-06-09 cs.CV cs.AI cs.RO 新提交

ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models

ATM:用于诊断和改进潜在世界模型的动作一致性转移矩阵

Jiaheng Chen

发表机构 * School of Software, Northeastern University(东北大学软件学院)

AI总结 提出ATM矩阵,通过轻量级探针比较真实与预测潜在转移中的动作信息,无需模拟器即可诊断世界模型质量,并引入AITS利用动作可识别性作为训练信号提升下游规划。

Comments 13 pages, 3 figures, 6 tables

详情
AI中文摘要

潜在世界模型越来越多地用于控制和目标条件规划,但评估其学习到的表示是否对规划有用通常需要与CEM等规划器耦合的慢速模拟器评估。这种评估是黑盒且依赖于模型复杂度的:在相同协议下,不同世界模型每个检查点可能需要几分钟到几小时。在这项工作中,我们提出了ATM,一个动作一致性转移矩阵,用于诊断潜在转移是否保留了与规划相关的动作语义。ATM通过轻量级事后探针比较真实编码转移和模型预测转移中的动作信息,生成一个可解释的矩阵,揭示表示质量、转移域不一致性和失败模式,而无需模拟器 rollout。它还可以折叠成一个简单的筛选分数,用于跨检查点、变体和世界模型的内部任务排名。当真实成功差距显著时,ATM实现了高度可靠的成对排名,同时将分钟到小时的CEM评估减少到秒级的转移分析,在我们的设置中实现了超过100倍的加速。我们进一步引入了AITS,表明动作可识别性不仅具有诊断作用,而且是一种有用的训练信号,可以在不改变规划器的情况下改进下游规划。

英文摘要

Latent world models are increasingly used for control and goal-conditioned planning, yet assessing whether their learned representations are useful for planning usually requires slow, planner-coupled simulator evaluation with CEM or similar planners. Such evaluation is black-box and model-complexity-dependent: under the same protocol, different world models may require minutes to hours per checkpoint. In this work, we propose ATM, an Action-Consistency Transfer Matrix for diagnosing whether latent transitions preserve action semantics relevant to planning. ATM compares action information in real encoded transitions and model-predicted transitions through lightweight post-hoc probes, producing an interpretable matrix that reveals representation quality, transition-domain inconsistency, and failure modes without simulator rollout. It can also be collapsed into a simple screening score for within-task ranking across checkpoints, variants, and world models. When the true success gap is non-trivial, ATM achieves highly reliable pairwise ranking, while reducing minutes-to-hours CEM evaluation to seconds-level transition analysis, yielding more than 100x speedup in our setup. We further introduce AITS, showing that action-identifiability is not only diagnostic but also a useful training signal for improving downstream planning without changing the planner.

2606.09180 2026-06-09 cs.CV 新提交

Claude Code-Driving Scenario Mining for the Argoverse 2 Challenge

Claude Code驱动的Argoverse 2挑战赛场景挖掘

Wei Deng, Caoshengzhe Xue, Shuaikun Liu, Zhaohong Liu, Mengshi Qi, Huadong Ma

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出四阶段管道:Claude Code自主生成代码、迭代训练集筛选、语义代码审查和场景级验证,用于Argoverse 2场景挖掘挑战。

详情
AI中文摘要

我们提交了参加CVPR 2026 Argoverse 2场景挖掘挑战赛的作品。我们的系统使用四阶段管道:(1) 由GLM~5.1驱动的Claude Code代理进行自主代码生成,(2) 使用时间戳平衡准确率阈值0.8进行迭代训练集筛选以策划少样本示例,(3) 由单独的Claude Code会话进行语义代码审查,以及(4) Qwen3-VL场景级验证以过滤误报。我们报告了在Argoverse 2测试集上的结果。

英文摘要

We present our submission to the CVPR 2026 Argoverse 2 Scenario Mining Challenge. Our system uses a four-stage pipeline: (1) autonomous code generation via a Claude Code agent powered by GLM~5.1, (2) iterative training set screening with Timestamp Balanced Accuracy threshold 0.8 to curate few-shot examples, (3) semantic code review by a separate Claude Code session, and (4) Qwen3-VL scene-level verification to filter false positives. We report results on the Argoverse 2 test set.

2606.09219 2026-06-09 cs.CV astro-ph.IM 新提交

Semi-supervised Source Detection in Astronomical Images: New Benchmark and Strong Baseline

天文图像中的半监督源检测:新基准与强基线

Longhan Feng, Zihuang Cao, Ali Luo, Yuanhao Guo, Shuilian Yao, Yixin Guo, Qi Jia, Yu Liu

发表机构 * School of Software Dalian University of Technology(大连理工大学软件学院) National Astronomical Observatories Chinese Academy of Sciences(中国科学院国家天文台) Research Institute of Highway Ministry of Transport(交通运输部公路科学研究院)

AI总结 针对天文图像中源检测的挑战,提出LAMOST-DET基准数据集和半监督学习框架Nova Teacher,通过光源增强、置信度引导伪监督和跨视图互补挖掘,在稀疏标注下有效检测密集源,mAP提升4.04%和5.22%。

详情
AI中文摘要

在现代观测天文学中,源检测是准确定位和识别恒星源的基石,对于恒星种群合成和宇宙学参数估计等研究至关重要。然而,天文图像的特征,包括高密度、点扩散函数效应和低信噪比,对最新的先进目标检测器提出了重大挑战。此外,由于在天文图像中标注密集、微小和暗弱的源存在显著困难,全监督检测方法几乎不实用。为了解决天文数据集的稀缺性,我们引入了一个新的综合基准(LAMOST-DET),包含18,400张天文图像和728,898个源实例。在该数据集上,我们进一步设计了一个新颖的半监督学习框架,称为Nova Teacher,能够在稀疏标注下有效检测密集源。它集成了光源增强模块、置信度引导的伪监督和跨视图互补挖掘,采用双教师范式。在LAMOST-DET上的大量实验表明,Nova Teacher在两种半监督设置下分别比之前的竞争者持续提高4.04%和5.22%的mAP。此外,我们的方法在自然图像数据集上与其他检测器竞争,验证了其在不同场景下的泛化能力。源代码可在https://github.com/AcWiz/NovaTeacher获取。

英文摘要

Source detection in modern observational astronomy is a cornerstone for localizing and identifying stellar sources accurately. It is crucial for studies such as stellar population synthesis and cosmological parameter estimation. However, the characteristics of astronomical images, including high density, the effect of point spread functions and low signal-to-noise ratios, significantly challenge the latest advanced object detectors. Besides, fully-supervised detection methods are hardly practical, due to the significant difficulty in annotating dense, small, and faint sources in astronomical images. To tackle the scarcity of astronomical datasets, we introduce a new comprehensive benchmark (LAMOST-DET), comprising 18,400 astronomical images and 728,898 source instances. Upon the dataset, we further devise a novel semi-supervised learning framework coined Nova Teacher, capable of detecting dense sources effectively given sparse annotations. It integrates source light enhancement module, confidence-guided pseudo-supervision, and cross-view complementary mining in a dual-teacher paradigm. Extensive experiments on LAMOST-DET show that, Nova Teacher consistently improves previous competitors by 4.04% and 5.22% mAP under two semi-supervised settings. Additionally, our method competes against other detectors on a natural image dataset, validating its generalization ability to various scenarios. The source code is available at https://github.com/AcWiz/NovaTeacher.

2606.09368 2026-06-09 cs.CV cs.AI 新提交

PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments

PhysScene:用于物理实验科学视觉推理的场景图数据集

Minghao Zou, Qingtian Zeng, Shangkun Liu, Yanda Meng, Guanghui Yue, Baoquan Zhao, Abdulmotaleb El Saddik, Wei Zhou

发表机构 * Cardiff University(卡迪夫大学) Shandong University of Science and Technology(山东科技大学) University of Exeter(埃克塞特大学) Shenzhen University(深圳大学) Sun Yat-sen University(中山大学) University of Ottawa(渥太华大学)

AI总结 提出首个面向物理实验的场景图数据集PhysScene,通过高密度关系约束和结构化实验设置,推动科学视觉推理中超越空间共现的逻辑依赖关系建模。

详情
AI中文摘要

场景图通过建模对象及其成对关系,提供视觉场景的结构化表示。尽管最近取得了进展,现有数据集主要关注通用自然场景,领域特定和功能导向的场景仍未被充分探索。这一限制阻碍了科学实验场景中关系推理的评估,进而阻碍了此类场景中智能监控、分析及相关应用的发展。为填补这一空白,我们引入了PhysScene,这是首个针对物理实验的场景图数据集。PhysScene涵盖了实验环境中特有的仪器、结构化实验装置和功能关系,使得推理能够超越空间共现,扩展到逻辑依赖。PhysScene不追求大规模数据,而是聚焦于实验场景中的强语义约束和高关系密度,为现有场景解析算法带来新挑战,同时提供进一步改进的机会。广泛的分析和实验表明,PhysScene补充了现有基准,并为推进科学视觉推理建立了有价值的测试平台。该数据集公开于https://github.com/ZMH-SDUST/PhysScene。

英文摘要

Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise relationships. Despite recent progress, existing datasets primarily focus on generic natural contexts, leaving domain-specific and function-oriented scenes largely underexplored. This limitation restricts the evaluation of relational reasoning in scientific experimental scenes, thereby hindering the development of intelligent monitoring, analysis, and related applications in such scenes. To address this gap, we introduce PhysScene, the first SG dataset tailored to physics experiments. PhysScene encompasses specialized instruments, structured experimental setups, and functional relations intrinsic to experimental environments, enabling reasoning that extends beyond spatial co-occurrence to logical dependencies. Rather than pursuing large data scale, PhysScene focuses on strong semantic constraints and high relation density in experimental scenes, posing new challenges for existing scene parsing algorithms while offering opportunities for further improvements. Extensive analyses and experiments show that PhysScene complements existing benchmarks and establishes a valuable testbed for advancing scientific visual reasoning. The dataset is publicly available at https://github.com/ZMH-SDUST/PhysScene.

2606.09495 2026-06-09 cs.CV 新提交

ContextShift: A Controlled Benchmark for Context Dependence in Object Detection

ContextShift: 目标检测中上下文依赖性的受控基准

Dan Zlotnikov, Alex Lazarovich, Ohad Ben-Shahar

发表机构 * Ben-Gurion University of the Negev(内盖夫本-古里安大学)

AI总结 提出ContextShift基准,通过几何变换和背景替换系统操纵物体-上下文关系,发现检测器性能下降主要表现为漏检增加和预测数量减少,且统计共现与有效视觉上下文非线性相关,上下文感知增强可提升鲁棒性。

详情
AI中文摘要

现代目标检测器在标准基准上表现强劲,但其对上下文变化的鲁棒性仍未被充分理解。先前的评估主要依赖于在非受控分布偏移上的平均精度等聚合指标,这可能会掩盖上下文变化下性能下降的真实情况。我们提出了ContextShift,一个受控基准,它在保持物体外观的同时系统地操纵物体-上下文关系。基于COCO 2017,它通过几何变换以及合成和自然背景替换,将上下文作为独立变量分离出来,包括基于归一化点互信息(NPMI)的连续兼容性轴。在多种检测器架构中,我们观察到一致的退化模式:假阴性最多增加227%,预测数量最多减少44%,而假阳性保持稳定或下降。这种抑制行为无法被平均精度等聚合指标捕捉,这些指标可能掩盖显著的召回率损失和预测动态变化。进一步分析表明,退化更多是由有效检测候选的形成减少而非置信度降低所驱动。此外,沿统计兼容性轴的性能是非单调的,在中间NPMI处达到峰值,并向两端退化,表明统计共现与有效视觉上下文并非线性相关。最后,我们展示了上下文感知增强提高了鲁棒性:每个增强变体在原始和操纵的测试图像上都优于仅使用数据集的基线,通过在训练期间暴露模型于物体-上下文解耦,部分恢复了因预测抑制失败而损失的性能。

英文摘要

Modern object detectors achieve strong performance on standard benchmarks, yet their robustness to contextual variation remains insufficiently understood. Prior evaluations largely rely on aggregate metrics such as AP on uncontrolled distribution shifts, which can obscure how performance degrades under context change. We introduce ContextShift, a controlled benchmark that systematically manipulates object--context relationships while preserving object appearance. Built on COCO 2017, it isolates context as an independent variable through geometric transformations and synthetic and natural background substitutions, including a continuous compatibility axis based on normalized pointwise mutual information (NPMI). Across diverse detector architectures, we observe a consistent degradation pattern: false negatives increase by up to 227% and prediction volume decreases by up to 44%, while false positives remain stable or decline. This suppression behavior is not captured by aggregate metrics such as AP, which can mask substantial recall loss and changes in prediction dynamics. Further analysis suggests that degradation is driven less by reduced confidence than by a reduced formation of valid detection candidates. Moreover, performance along the statistical compatibility axis is non-monotonic, peaking at intermediate NPMI and degrading toward both extremes, indicating that statistical co-occurrence does not correlate linearly with effective visual context. Finally, we show that context-aware augmentation improves robustness: every augmented variant outperforms the dataset-only baseline on both original and manipulated test images, partially recovering performance lost to prediction-suppression failures by exposing models to object--context decoupling during training.

2606.07568 2026-06-09 cs.HC cs.AI cs.CV cs.LG physics.data-an 交叉投稿

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

行为克隆在科学数据标注中的系统研究

Ishaan Singh Chandok, Core Francisco Park

发表机构 * GitHub

AI总结 针对科学数据标注中人工验证校正耗时问题,提出行为克隆框架,通过9个合成任务模拟专家策略,发现模型层次化技能习得、多任务预训练高效微调、内部表示共享错误模式等关键结论。

Comments ICML 2026 Oral

详情
AI中文摘要

科学数据标注,例如视频中动物追踪或神经重建的校对,仍然受限于“最后一公里”问题:即使有强大的自动化,验证和校正仍需大量人力。标准方法训练模型直接预测标注,丢弃了专家如何导航、点击、验证和校正的丰富监督信息。我们引入了一个研究科学标注上行为克隆的框架:9个合成任务配以合成标注,模拟真实人类策略,包括探索、错误校正和战略决策。我们的实验揭示了若干发现。首先,技能层次化出现:模型先学习GUI机制,再学习任务关键决策,且比训练数据犯更少错误,同时保留在错误发生时校正的能力。其次,在多任务行为克隆上扩展模型表明,在我们的规模范围内,更大的模型数据效率更高。第三,多任务预训练能够高效微调至新任务,而从零开始训练则完全失败。第四,线性探针揭示模型内部表示标注过程的潜在变量,如任务阶段和数据位置;有趣的是,我们发现一个跨不同标注任务泛化的共享错误表示。总体而言,我们的框架建立了系统基准并识别了关键瓶颈,为将行为克隆扩展到真实世界科学数据标注奠定了基础。

英文摘要

Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the "last mile" problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decision-making. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient within our scale range. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.

2606.07618 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

ScaleSweep: 通过块尺度初始化实现LLM的精确NVFP4训练后量化

Li Lin, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所)

AI总结 提出ScaleSweep方法,通过扫描可行块尺度候选并选择最小化目标函数的候选,优化NVFP4量化中的尺度初始化,理论推导扫描范围边界,在Llama和Qwen模型上提升量化性能,缩小与全精度的差距。

Comments under review

详情
AI中文摘要

NVFP4是一种最近引入的硬件支持的FP4格式,通过细粒度块尺度提高了4位量化的保真度。然而,现有的NVFP4尺度初始化方法仍然主要依赖于AbsMax初始化,这与最优解之间存在明显差距。为了解决这个问题,我们提出了ScaleSweep,一种简单高效的尺度优化方法,它扫描可行的块尺度候选,并选择最小化目标函数的候选。我们进一步提供了NVFP4量化的理论分析,并推导了在原始张量与量化重建张量之间的均方误差(MSE)和加权均方误差(WMSE)下所需扫描范围的上下界。所提出的界限大幅减少了扫描空间,同时保留了最优候选,使得与基线量化算子相比开销可忽略。在Llama和Qwen模型上的实验表明,ScaleSweep持续优于现有的初始化方法,并进一步缩小了与全精度的差距。特别是在对权重、激活、KV缓存和查询状态进行激进的全端到端量化时,ScaleSweep保留了超过93%的全精度性能。

英文摘要

NVFP4 is a recently introduced hardware-supported FP4 format that improves the fidelity of 4-bit quantization through fine-grained block scales. However, existing NVFP4 scale initialization methods still primarily rely on AbsMax initialization, which leaves a noticeable gap to the optimal solution. To address this, we propose ScaleSweep, a simple and efficient scale optimization method that sweeps over feasible block scale candidates and selects the candidate that minimizes a target objective. We further provide a theoretical analysis of NVFP4 quantization and derive both lower and upper bounds for the required sweep range under mean square error (MSE) and weighted mean square error (WMSE) between the original tensor and the quantized reconstructed tensor. The proposed bounds substantially reduce the sweep space while preserving the optimal candidate, enabling negligible overhead compared with the baseline quantization operators. Experiments on Llama and Qwen models demonstrate that ScaleSweep consistently improves quantization performance over existing initialization methods and further narrows the gap to full precision. In particular, under aggressive end-to-end quantization of weights, activations, KV cache, and query states, ScaleSweep preserves more than 93% of the full-precision performance.

2606.07780 2026-06-09 cs.AI cs.CV cs.LG 交叉投稿

Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events

土地覆盖与洪水类型控制基于卫星的洪水测绘在不同全球洪水事件中的检测极限

Venkatesh Kolluru, Rajat Shinde, Abdelhak Marouane, Caden Helbling, Deepak Shah, Othneil Drew, Iksha Gurung, Manil Maskey, Rahul Ramachandran

发表机构 * Earth System Science Center, University of Alabama in Huntsville(阿拉巴马大学亨茨维尔分校地球系统科学中心) Space and Earth Science Data Analysis(空间与地球科学数据分析) NASA Marshall Space Flight Center(NASA马歇尔太空飞行中心)

AI总结 研究利用Prithvi-EO-2.0模型在19个全球洪水事件中评估卫星洪水测绘的检测能力,发现检测精度取决于土地覆盖和洪水类型,农田和河流洪水检测效果较好,而树木覆盖和建成区检测近乎为零。

详情
AI中文摘要

洪水是最具破坏性的自然灾害之一,在气候变化下其频率增加使得基于卫星的淹没测绘对灾害响应至关重要。基于卫星档案预训练的地理空间基础模型提供了地理可迁移性,但其在多样、未见事件中的操作可靠性尚未被表征。在此,我们在跨越六大洲、八个气候带和六种洪水机制的19个分布外洪水事件(2017-2025年)中部署Prithvi-EO-2.0,并针对两个独立参考产品进行验证。检测精度共同依赖于土地覆盖和洪水类型,农田产生最高一致性(IoU=52%),河流事件检测最强(F1=0.69),而树木覆盖和建成区显示近乎零检测(IoU=4%),无论洪水机制如何。双参考验证揭示,明显的模型误差部分反映了参考产品之间的定义不一致而非检测失败。迭代流水线测试识别出23种故障模式,其中流水线工程在初始误差中占主导地位,超过模型容量。这些发现为操作卫星洪水测绘建立了环境依赖的检测边界。

英文摘要

Floods are among the most destructive natural hazards, and their increasing frequency under climate change makes satellite-based inundation mapping essential for disaster response. Geospatial foundation models pretrained on satellite archives offer geographic transferability, but their operational reliability across diverse, unseen events remains uncharacterized. Here we deploy Prithvi-EO-2.0 across 19 out-of-distribution flood events (2017-2025) spanning six continents, eight climate zones, and six flood mechanisms, validating against two independent reference products. Detection accuracy depended jointly on land cover and flood type, with cropland yielding the highest agreement (IoU=52%) and riverine events the strongest detection (F1=0.69), while tree cover and built-up areas showed near-zero detection (IoU=4%) regardless of flood mechanism. Dual-reference validation revealed that apparent model error partly reflects definitional inconsistency between reference products rather than detection failure. Iterative pipeline testing identified 23 failure modes, with pipeline engineering dominating initial error over model capacity. These findings establish environment-dependent detection boundaries for operational satellite flood mapping.

2606.08204 2026-06-09 cs.LG cs.CV 交叉投稿

Neural Field Tokenizations with Hierarchy and Spatial Locality Priors

具有层次和空间局部性先验的神经场分词

Alonso Urbano, David W. Romero, Max Zimmer, Sebastian Pokutta

发表机构 * Zuse Institute Berlin (ZIB)(柏林祖斯研究所) Cartesia AI Technische Universität Berlin(柏林工业大学)

AI总结 提出LH-NeF框架,利用层次和局部性先验学习通用连续信号的分词表示,通过前馈编码替代元学习,内存减少42倍,批大小提升133倍,在图像、3D形状和气候场上匹配或超越多种基线。

详情
AI中文摘要

神经场将数据参数化为从坐标到值的函数,为跨模态表示学习提供统一框架。现有方法以每样本元学习为主,由于内存密集的内循环优化而扩展性差。自然的替代方案——前馈编码——通常引入模态特定假设,牺牲了神经场学习的通用性。我们认为局部性和层次性是学习场表示的有用先验,可以在不损害模态无关性的情况下注入。我们提出LH-NeF,一个学习连续信号通用分词表示的框架。保持局部性的层次编码器将原始坐标-值场观测映射到结构化分词,训练期间从中重建场。通过用单次前向传播替代元学习的内循环,LH-NeF比最强的模态无关基线少用42倍内存,支持133倍更大的批次。在图像、3D形状和气候场上,我们的学习表示在重建和下游任务上匹配或超过模态无关、模态特定和专用生成神经场基线的性能。

英文摘要

Neural fields parameterize data as functions from coordinates to values, providing a unified framework for representation learning across modalities. Existing approaches are dominated by per-sample meta-learning, which scales poorly due to memory-intensive inner-loop optimization. The natural alternative -- feed-forward encoding -- typically introduces modality-specific assumptions, sacrificing the generality that makes learning with neural fields attractive. We argue that locality and hierarchy are useful priors for learning field representations that can be injected without compromising modality-agnosticism. We propose LH-NeF, a framework to learn general-purpose tokenized representations of continuous signals. A locality-preserving hierarchical encoder maps raw coordinate-value field observations to structured tokens, from which the field is reconstructed during training. By replacing meta-learning's inner loop with a single forward pass, LH-NeF uses 42$\times$ less memory and supports 133$\times$ larger batches than the strongest modality-agnostic baseline. Across images, 3D shapes, and climate fields, our learned representations match or exceed performance of modality-agnostic, modality-specific, and specialized generative neural field baselines on both reconstruction and downstream tasks.

2606.08437 2026-06-09 eess.IV cs.CV 交叉投稿

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

X-Palm: 用于跨域掌纹认证的配对多光谱到智能手机数据集

Jamal Seyedmohammadi, Pai Chet Ng, Angelo Genovese, Zhixiang Chi, Jeannie Lee, Konstantinos N. Plataniotis

发表机构 * Singapore Institute of Technology(新加坡科技学院) Università degli Studi di Milano(米兰大学) University of Toronto(多伦多大学)

AI总结 为解决掌纹识别中受控注册与非约束认证之间的域差距,提出首个配对身份的多光谱-智能手机跨域数据集X-Palm,包含6006张图像,覆盖大规模模态和环境变化,实验表明现有模型在该数据集上性能严重下降,而基于X-Palm训练的模型具有跨域鲁棒性。

详情
AI中文摘要

掌纹模态提供了一种保护隐私的生物识别解决方案,但其部署受到受控注册与非约束认证之间域差距的阻碍。现有数据集大多局限于受控设置,无法捕捉真实环境的复合变异性。在本文中,我们介绍了X-Palm,一个跨域数据集,包含来自103名个体(206只手)的6006张掌纹图像。据我们所知,X-Palm是第一个提供新颖的配对身份采集的掌纹数据集,专门设计用于弥合可靠受控多光谱注册与非约束移动认证之间的差距,同时涵盖广泛的野外变异性。与现有专注于单一或少数变化的数据集不同,X-Palm通过捕获两个不同域中身份的配对数据来解决实际部署中遇到的大规模模态和环境变化:(1)使用我们定制开发的扫描仪进行受控多光谱掌纹设置,以及(2)参与者驱动的非约束智能手机掌纹设置,同时包含硬件、手部姿势、光照、背景、相机到手距离、视角和手掌表面条件(例如湿度和遮挡)的变化。我们对12个SOTA模型的广泛基准测试表明,现有方法在受控数据上表现良好,但在X-Palm上性能严重下降。相反,在X-Palm上训练的模型在跨域中表现出一致的鲁棒性,使X-Palm成为训练模型以实现真实世界跨域泛化的宝贵资源。数据访问说明和相关基准测试代码公开于:https://github.com/X-Palm/X-Palm-2026

英文摘要

Palmprint modality offers a privacy-preserving biometric solution, yet its deployment is hindered by the domain gap between controlled enrollment and unconstrained authentication. Existing datasets are largely restricted to controlled setups and fail to capture the compound variability of real-world environments. In this paper, we introduce X-Palm, a cross-domain dataset comprising 6,006 palm images from 103 individuals (206 hands). To the best of our knowledge, X-Palm is the first palmprint dataset providing novel paired-identity acquisition specifically designed to bridge the gap between reliably controlled multispectral enrollment and unconstrained mobile authentication while encompassing a broad spectrum of in-the-wild variability. Unlike existing datasets that focus on single to a few variations, X-Palm addresses the massive modality and environmental shifts encountered in practical deployments by capturing paired data for identities across two distinct domains: (1) a controlled Multispectral Palmprint setting using our custom-developed scanner, and (2) an unconstrained smartphone palmprint setting that is participant-driven, incorporating simultaneous variations in hardware, hand pose, illumination, background, camera-to-hand distance, perspective, and palm surface conditions (e.g., moisture and occlusions). Our extensive benchmarks of 12 SOTA models reveal that while existing methods achieve high performance on controlled data, they experience severe performance collapse on X-Palm. Conversely, models trained on X-Palm demonstrate consistent robustness across domains, positioning X-Palm as a valuable resource for training a model towards real-world, cross-domain generalization. Data access instructions and the related benchmarking codes are publicly available at: https://github.com/X-Palm/X-Palm-2026

2606.08574 2026-06-09 cs.LG cs.CV 交叉投稿

OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework

OrderDP:一种理论上保证无损的动态数据剪枝框架

Chenhan Jin, Shengze Xu, Qingsong Wang, Fan Jia, Dingshuo Chen, Tieyong Zeng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港 Baptist大学) Guangzhou Nanfang College(广州南方学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Xiangtan University(湘潭大学) University of Utah(犹他大学)

AI总结 提出OrderDP框架,通过随机子集选取与top-q样本选择实现无偏梯度估计,提供收敛性和泛化性理论保证,在CIFAR和ImageNet上降低40%训练成本且保持精度。

Comments Published as a conference paper at ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

数据剪枝(DP)作为一种常被提及的减轻训练负担的策略,根据定义明确的剪枝方法减少训练样本数量,同时力求实现近乎无损的性能。然而,现有方法通常选择信息量大的样本,与全数据集训练相比可能导致有偏的梯度估计。此外,这种偏差及其对最终性能的影响分析仍不明确。为解决这些问题,我们提出OrderDP,一个即插即用的框架,旨在获得稳定、无偏且近乎无损的训练加速,并具有理论保证。具体而言,OrderDP首先随机选择一个子集,然后选择前$q$个样本,其中相对于代理损失建立无偏性。这确保了OrderDP在代理目标方面进行无偏训练。我们进一步建立了收敛性和泛化性分析,阐明了OrderDP如何影响最优性能,并在保证最终性能的同时实现良好控制的加速。实验上,我们在CIFAR-10、CIFAR-100和ImageNet-1K上对OrderDP与全面基线进行了评估,展示了具有竞争力的精度、稳定的收敛和精确的控制——所有这些都通过更简单的设计和更快的运行时间实现,同时将训练成本降低超过40%。我们的方法兼具强性能和计算效率,为数据高效学习提供了一个稳健且易于适应的工具。代码公开于https://github.com/shengze-xu/OrderDP。

英文摘要

Data pruning (DP), as an oft-stated strategy to alleviate heavy training burdens, reduces the volume of training samples according to a well-defined pruning method while striving for near-lossless performance. However, existing approaches, which commonly select highly informative samples, can lead to biased gradient estimation compared to full-dataset training. Furthermore, the analysis of this bias and its impact on final performance remains ambiguous. To address these challenges, we propose OrderDP, a plug-and-play framework that aims to obtain stable, unbiased, and near-lossless training acceleration with theoretical guarantees. Specifically, OrderDP first randomly selects a subset and then chooses the top-$q$ samples, where unbiasedness is established with respect to a surrogate loss. This ensures that OrderDP conducts unbiased training in terms of the surrogate objective. We further establish convergence and generalization analyses, elucidating how OrderDP affects optimal performance and enables well-controlled acceleration while ensuring guaranteed final performance. Empirically, we evaluate OrderDP against comprehensive baselines on CIFAR-10, CIFAR-100, and ImageNet-1K, demonstrating competitive accuracy, stable convergence, and exact control -- all with a simpler design and faster runtime, while reducing training cost by over 40%. Delivering both strong performance and computational efficiency, our method serves as a robust and easily adaptable tool for data-efficient learning. The code is publicly available at https://github.com/shengze-xu/OrderDP.

2606.09059 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

Stage-1 Controls the Entropy Regime, Not the Outcome

Stage-1 控制熵状态,而非最终结果

Jianxiong Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过小数据实验研究两阶段后训练中Stage-1(SFT或OPD)的作用,发现其主要影响策略熵状态,但对最终性能影响有限。

详情
AI中文摘要

两阶段后训练——Stage-1 热启动(监督微调 SFT 或在线策略蒸馏 OPD)后接 Stage-2 强化学习(RL)——越来越多地用于视觉语言模型(VLM)。我们使用 Qwen2.5-VL-7B 和同模态 72B VLM 教师进行 OPD,在小数据研究中探究 Stage-1 实际控制什么。首先,三种热启动在 Geometry3K 内部验证集上达到狭窄的 53%–54% 区间,与近期专门方法报告的窄范围一致;该设置几乎没有证据表明 Stage-1 改变了域内终点。其次,匹配配方、早停的 SFT 在域外 MathVista 上提升了 +2.1 点,逆转了过训练变体的 -9.5 点下降。最明显的区别是熵状态:OPD 进入 RL 时的策略熵显著高于任一 SFT 初始化,且这种分离在可用轨迹中持续可见。在域内初始化时,OPD 还具有更高的答案多样性和 pass@16(比 SFT 高 +2.0 到 +5.2 点),尽管问题级自举区间显示较小的对比具有不确定性。RL 后优势消失(终点 pass@16 值在 1.1 点以内),在 MathVista 上也是如此(六个模型在 1.2 点以内)。因此,我们的贡献是一个有界的实证刻画:在此设置中,Stage-1 与熵状态强相关,但下游收益小、局部化,且不能证明 OPD 是更好的 RL 热启动。

英文摘要

Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.

2606.09091 2026-06-09 cs.LG cs.CV 交叉投稿

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

稳定基于策略的蒸馏用于多模态大语言模型推理的全局归一化

Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu

发表机构 * OPPO AI Center(OPPO AI中心)

AI总结 针对策略蒸馏中异常状态导致梯度不稳定的问题,提出全局归一化蒸馏策略优化(GNDPO),通过将KL分数转化为批次级相对优势来稳定优化,提升多模态推理任务的训练鲁棒性和性能。

详情
AI中文摘要

基于策略的蒸馏(OPD)最近成为一种重要的后训练范式。通过使用更强的教师模型为采样轨迹提供密集、细粒度的监督,OPD相比依赖稀疏二元或基于结果的环境反馈的可验证奖励强化学习(RLVR)具有明显优势。然而,朴素的token级蒸馏可能因异常状态中的幅度不匹配而遭受梯度不稳定性。为了解决这个问题,我们提出了全局归一化蒸馏策略优化(GNDPO),这是一种实用方法,通过将原始KL分数转化为批次级相对优势来稳定优化。这种归一化有效缓解了梯度爆炸,同时保留了token级指导的优势。实验结果表明,GNDPO在多模态推理任务中显著提高了训练鲁棒性和下游性能。代码已发布在 https://github.com/OPPO-Mente-Lab/GNDPO。

英文摘要

On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.

2606.09169 2026-06-09 cs.AI cs.CV cs.MM 交叉投稿

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

IMUG-Bench:交错理解与生成的统一多模态模型基准

Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Huawei(华为)

AI总结 提出IMUG-Bench基准,用于评估统一多模态模型在多轮交错图文对话中的理解与生成能力,包含3113样本和12034交互轮次,揭示了生成侧暴露偏差,并探索了测试时扩展策略。

详情
AI中文摘要

近年来,统一多模态模型(UMMs)出现,支持在单一框架内同时进行理解和生成。掌握动态、多轮交错图文对话是UMMs在实际应用中的关键任务。然而,现有基准未能评估这一重要任务,因为它们通常局限于单轮或静态设置,并且通常忽略多轮交互中的暴露偏差。为弥补这一差距,我们提出IMUG-Bench,一个用于UMMs多轮交错图文对话的综合基准,联合评估其理解和生成能力。我们的IMUG-Bench包含三类:静态空间、时间因果和混合,涵盖3113个样本和12034个交互轮次。它还包括动态理解问题,从而支持更能反映真实多轮交互场景的评估。在IMUG-Bench上进行的大规模实验系统评估了主流开源和闭源UMMs,揭示了它们的能力边界和失败模式,并发现了多轮交互中生成侧的显著暴露偏差。我们进一步探索了几种测试时扩展策略,包括思维链、自我验证和最佳N采样,这些策略有效提高了生成准确性并减轻了生成任务中的暴露偏差。这些发现为增强未来UMMs的鲁棒性和多轮交互能力提供了见解。

英文摘要

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.

2606.09644 2026-06-09 cs.CL cs.CV 交叉投稿

Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

答案从何而来?面向自动驾驶的多视角MLLMs中视角级视觉证据识别基准

Yimu Wang, Yee Man Choi, Barry Zhang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对多视角自动驾驶场景,提出一个基准测试,评估多模态大模型在视觉问答中识别支持性相机视角的能力,包含122个冲突中心问题对,并区分视角选择与答案正确性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉推理基准测试中取得了强劲结果,但仅凭答案准确性并不能表明模型是否依赖了正确的视觉证据。这一差距在用于自动驾驶的多视角驾驶场景中尤为重要,因为模型可能产生看似合理的答案,却将其归因于错误的相机视角。我们引入了一个多视角视觉问答基准,用于评估证据来源识别:给定六个同步的NuScenes视角和一个问题,模型必须识别支持性的相机视角并回答问题。该基准包含来自73个场景的122个冲突中心问答对,涵盖因果关系、反事实推理和意图预测。视角标签由自动冲突挖掘流程提出,并由标注者手动验证。我们评估了三种设置:相机视角选择、给定黄金视角的Oracle问答,以及模型在一次前向中同时选择视角并回答的联合预测。答案以多项选择和自由形式两种格式进行评估,使用精确匹配处理结构化预测,并使用LLM评判器处理自由形式回答。通过明确分离视觉来源识别与答案正确性,该基准揭示了仅凭答案评估无法发现的接地失败案例。

英文摘要

Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.

2606.09718 2026-06-09 cs.LG cs.CV 交叉投稿

Evaluating the Representation Space of Diffusion Models via Self-Supervised Principles

通过自监督原则评估扩散模型的表示空间

Xiao Li, Yixuan Jia, Zekai Zhang, Xiang Li, Lianghe Shi, Jinxin Zhou, Zhihui Zhu, Liyue Shen, Qing Qu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 受自监督学习启发,提出基于Fisher信息的度量ICR,分解特征为不变和残差成分,用于联合评估扩散模型的表示与生成能力,发现中间噪声水平下不变性最强且分类性能最佳,ICR可敏感检测训练中的记忆化。

Comments First two authors contributed equally. Accepted at ICML 2026

详情
AI中文摘要

扩散模型已展现出卓越的生成能力,并成为强大的自监督表示学习器,但这两种能力之间的联系仍较少被探索。受自监督学习(SSL)启发,我们引入了一个框架,用于联合评估扩散模型的表示和生成能力。具体地,我们将特征分解为不变成分和残差成分,并推导出不变污染比(ICR),这是一种基于Fisher的度量,用于量化残差变化在特征空间中对不变信号的污染程度。我们利用该框架分析扩散模型的判别和生成行为。在表示方面,我们发现不变性在中间噪声水平达到峰值,同时该水平也产生最佳的下游分类性能。在生成方面,我们研究了在数据有限情况下训练如何从真正的泛化过渡到记忆化,并表明ICR可作为早期学习的敏感训练时指标:沿Fisher方向增加的残差能量标志着记忆化的开始,该指标仅从训练特征即可检测,无需外部评估器或保留测试集。总体而言,我们的结果表明,扩散模型可以通过其学习表示的几何结构从自监督视角进行监控。

英文摘要

Diffusion models have demonstrated remarkable generative capabilities and have also emerged as powerful self-supervised representation learners, yet the connection between these two abilities remains less explored. Drawing inspiration from self-supervised learning (SSL), we introduce a framework for jointly evaluating the representation and generation capabilities of diffusion models. Specifically, we decompose features into invariant and residual components and derive the Invariant Contamination Ratio (ICR), a Fisher-based metric that quantifies how residual variation contaminates invariant signal in feature space. We use this framework to analyze both discriminative and generative behavior of diffusion models. On the representation side, we find that invariance peaks at intermediate noise levels, which also yield the best downstream classification performance. On the generative side, we study how training transitions from genuine generalization to memorization in data-limited regimes, and show that ICR serves as a sensitive training-time indicator of early learning: increasing residual energy along Fisher directions marks the onset of memorization, detectable from training features alone without external evaluators or held-out test sets. Overall, our results show that diffusion models can be monitored from a self-supervised perspective through the geometry of their learned representations.

2410.00713 2026-06-09 cs.CV 版本更新

RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations

RAD:面向机器人观测的真实异常检测数据集与基准

Kaichen Zhou, Xinhai Chang, Taewhan Kim, Jiadong Zhang, Yang Cao, Chufei Peng, Fangneng Zhan, Hao Zhao, Hao Dong, Kai Ming Ting, Ye Zhu

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Peking University(北京大学) Carnegie Mellon University(卡内基梅隆大学) Great Bay University(大湾大学) Harvard University(哈佛大学) Tsinghua University(清华大学) Nanjing University(南京大学) Deakin University(德克萨斯大学)

AI总结 提出RAD数据集,包含13类日常物体和4种缺陷,从60多个机器人视角在非受控光照下采集,用于评估2D特征、3D重建和视觉语言模型在姿态无关的异常检测中的表现,发现2D方法优于3D和VLM方法。

详情
AI中文摘要

异常检测是机器人感知和工业检测的核心能力,然而现有大多数基准是在固定视角和稳定光照的受控条件下收集的,未能反映实际部署场景。我们提出RAD(真实异常检测),一个由机器人捕获的多视角数据集,旨在强调姿态变化、反射材料和视角依赖的缺陷可见性。RAD涵盖13个日常物体类别和四种真实缺陷类型——划痕、缺失、污渍和挤压——在非受控光照下从每个物体超过60个机器人视角捕获。我们在姿态无关的设置下对多种最先进方法进行基准测试,包括基于2D特征的方法、3D重建流水线和视觉语言模型(VLM)。令人惊讶的是,我们发现成熟的2D特征嵌入方法在图像级别上始终优于最近的3D和基于VLM的方法,而在像素级别定位上性能差距缩小。我们的分析表明,反射表面、几何对称性和稀疏的视角覆盖从根本上限制了当前基于几何和零样本的方法。RAD为机器人异常检测建立了一个具有挑战性和现实性的基准,突出了超出受控实验室环境的关键开放问题。

英文摘要

Anomaly detection is a core capability for robotic perception and industrial inspection, yet most existing benchmarks are collected under controlled conditions with fixed viewpoints and stable illumination, failing to reflect real deployment scenarios. We introduce RAD (Realistic Anomaly Detection), a robot-captured, multi-view dataset designed to stress pose variation, reflective materials, and viewpoint-dependent defect visibility. RAD covers 13 everyday object categories and four realistic defect types--scratched, missing, stained, and squeezed--captured from over 60 robot viewpoints per object under uncontrolled lighting. We benchmark a wide range of state-of-the-art approaches, including 2D feature-based methods, 3D reconstruction pipelines, and vision-language models (VLMs), under a pose-agnostic setting. Surprisingly, we find that mature 2D feature-embedding methods consistently outperform recent 3D and VLM-based approaches at the image level, while the performance gap narrows for pixel-level localization. Our analysis reveals that reflective surfaces, geometric symmetry, and sparse viewpoint coverage fundamentally limit current geometry-based and zero-shot methods. RAD establishes a challenging and realistic benchmark for robotic anomaly detection, highlighting critical open problems beyond controlled laboratory settings.

2505.13225 2026-06-09 cs.CV 版本更新

CoSeP: Complementary Separability Pruning via Class-Separability Clustering

CoSeP:基于类别可分性聚类的互补可分离性剪枝

David Levin, Gonen Singer

发表机构 * Faculty of Engineering Bar-Ilan University(巴伊兰大学工程学院)

AI总结 提出CoSeP方法,通过类别可分性空间中的互补性建模和自动剪枝率选择,在多种网络和数据集上实现精度提升或持平,并降低计算量。

Comments Major revision and extension of arXiv:2505.13225

详情
AI中文摘要

神经网络剪枝旨在压缩模型以实现高效部署,但仍存在两个基本挑战。首先,许多方法依赖每个组件的重要性分数,独立选择滤波器或神经元,忽略了冗余性:保留的集合可能包含多个捕捉相似判别模式的组件,而完全遗漏其他组件。其次,确定每层剪枝率通常需要手动、特定于架构的调整,且没有原则性的停止准则。我们提出CoSeP(互补可分离性剪枝)来解决这两个问题。CoSeP不是孤立地评分组件,而是通过Jeffries-Matusita距离计算每个组件在所有类别对上的类别可分性轮廓来表示该组件。这定义了一个可分性空间,其中邻近的组件可能冗余,而远离的组件捕捉互补信息。CoSeP在该空间中选择一个紧凑的代表集:通过k-medoids聚类对组件进行分组,使用平均简化轮廓评估候选子集大小,并通过拐点检测准则自动确定保留多少个组件。在CIFAR-10、CIFAR-100和ImageNet-1K上,针对ResNet、VGG、MobileNet和DenseNet架构,CoSeP在减少FLOPs的同时匹配或提高了精度,实测推理时间减少高达20%。例如,在ResNet-50/ImageNet-1K上实现了+0.66%的top-1准确率提升,同时FLOPs减少2.30倍;在VGG-16/CIFAR-10上实现了0.37%的准确率提升,FLOPs减少2.59倍。这些结果表明,在类别可分性空间中建模互补性为剪枝提供了一种有效且原则性的方法。

英文摘要

Neural network pruning aims to compress models for efficient deployment, yet two fundamental challenges remain. First, many methods rely on per-component importance scores, selecting filters or neurons independently and ignoring redundancy: the retained set may include multiple components capturing similar discriminative patterns while missing others entirely. Second, determining per-layer pruning ratios typically requires manual, architecture-specific tuning with no principled stopping criterion. We propose CoSeP (Complementary Separability Pruning) to address both issues. Rather than scoring components in isolation, CoSeP represents each component by its class-separability profile across all class pairs, computed via Jeffries--Matusita distances. This defines a separability space in which nearby components are potentially redundant and distant components capture complementary information. CoSeP selects a compact set of representatives in this space: components are grouped via k-medoids clustering, candidate subset sizes are evaluated using the Mean Simplified Silhouette, and a knee-detection criterion automatically determines how many components to retain. Across CIFAR-10, CIFAR-100, and ImageNet-1K, on ResNet, VGG, MobileNet, and DenseNet architectures, CoSeP matches or improves accuracy while reducing FLOPs, with measured wall-clock inference-time reductions of up to 20%. For example, it achieves a +0.66% top-1 accuracy gain with 2.30x FLOPs reduction on ResNet-50/ImageNet-1K, and a 0.37% gain with 2.59x FLOPs reduction on VGG-16/CIFAR-10. These results demonstrate that modeling complementarity in class-separability space provides an effective and principled approach to pruning.

2509.09151 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Video Understanding by Design: How Datasets Shape Video Models

通过设计理解视频:数据集如何塑造视频模型

Lei Wang, Syuan-Hao Li, Piotr Koniusz, Yongsheng Gao

发表机构 * School of Engineering and Built Environment, Electrical and Electronic Engineering, Griffith University(工程与建筑环境学院,电气与电子工程学院,格里菲斯大学) School of Computer Science and Engineering, University of New South Wales(计算机科学与工程学院,新南威尔士大学)

AI总结 本文从数据集视角出发,提出统一框架连接数据集结构、归纳偏差与架构设计,分析数据集特性如何驱动视频理解架构创新,并讨论不同数据体制下的表征偏差。

Comments Research report

详情
AI中文摘要

视频理解研究因日益多样化的数据集和更强大的模型架构而快速发展。现有综述通常按任务、基准或模型家族组织进展,但对特定架构为何出现并成功提供的见解有限。本文认为,视频理解的演进根本上由数据集结构塑造。我们提出一个以数据集为中心的视角,在统一框架内连接数据集结构、归纳偏差和架构设计。我们表明,不同数据集要求模型捕获特定的不变性和能力,例如对视角变化的鲁棒性、对时间顺序的敏感性、长程依赖推理、关系交互和跨模态对齐。这些需求自然产生归纳偏差,即有利于特定推理和泛化模式的架构假设。从这一视角看,里程碑式架构,包括双流网络、3D CNN、时序模型、Transformer、基于图的方法和多模态基础模型,可理解为对演进数据集所带来挑战的架构响应。基于此框架,我们系统分析了数据集特性如何塑造视频理解任务中的架构创新,并讨论了不同数据体制引发的表征偏差。通过将数据集、归纳偏差和架构统一为一个连贯视角,本综述既提供了对领域演进的回顾性解释,也提供了通向通用视频理解系统的前瞻性路线图。代码和数据集诱导偏差的动态视频可视化见 https://this https URL。

英文摘要

Research in video understanding has advanced rapidly, driven by increasingly diverse datasets and more powerful model architectures. While existing surveys typically organize progress by tasks, benchmarks, or model families, they provide limited insight into why particular architectures emerged and succeeded. In this survey, we argue that the evolution of video understanding is fundamentally shaped by dataset structure. We present a dataset-centric perspective that connects dataset structure, inductive biases, and architectural design within a unified framework. We show that different datasets require models to capture specific invariances and capabilities, such as robustness to viewpoint changes, sensitivity to temporal ordering, reasoning over long-range dependencies, relational interactions, and cross-modal alignment. These requirements naturally give rise to inductive biases, i.e., architectural assumptions that favor particular patterns of reasoning and generalization. From this perspective, milestone architectures, including two-stream networks, 3D CNNs, temporal models, transformers, graph-based methods, and multimodal foundation models, can be understood as architectural responses to the challenges posed by evolving datasets. Building on this framework, we systematically analyze how dataset characteristics have shaped architectural innovation across video understanding tasks and discuss the representational biases induced by different data regimes. By unifying datasets, inductive biases, and architectures into a coherent perspective, this survey offers both a retrospective explanation of the field's evolution and a forward-looking roadmap toward general-purpose video understanding systems. Code and dynamic video visualizations of dataset-induced biases are available at https://time.griffith.edu.au/paper-sites/video-understanding/.

2602.05845 2026-06-09 cs.CV 版本更新

Self-Supervised Learning with a Multi-Task Latent Space Objective

基于多任务潜在空间目标的自监督学习

Pierre-François De Plaen, Abhishek Jha, Luc Van Gool, Tinne Tuytelaars, Marc Proesmans

发表机构 * ESAT-PSI, KU Leuven, Belgium(KU莱顿大学ESAT-PSI实验室) VIB.AI, KU Leuven, Belgium(KU莱顿大学VIB.AI实验室) CVL, ETH Zürich, Switzerland(苏黎世联邦理工学院CVL实验室) INSAIT, Sofia University, Bulgaria(保加利亚索菲亚大学INSAIT研究所) TRACE vzw(TRACE非营利组织)

AI总结 提出自预测孪生SSL的多任务公式,通过为每种空间变换分配专用预测器解决多裁剪训练失败问题,提升线性评估3.8-4%,并引入非对称裁剪视图实现语义修复预训练。

详情
AI中文摘要

我们提出了自预测孪生SSL的多任务公式,其中每个空间变换定义了一个不同的潜在空间对齐任务,由共享编码器上的专用预测器解决。这一视角直接解释了BYOL、SimSiam和MoCo v3等自预测方法中多裁剪训练长期存在的失败原因:共享预测器被迫同时解决异构对齐任务,导致优化不稳定。为每种视图类型分配一个预测器解决了这种干扰,跨框架实现了3.8-4%的线性评估提升。该视角还提出了一种通过引入额外空间变换作为互补任务来丰富预训练的原则性方法。我们通过引入非对称裁剪视图来证明这一点,其中掩码在线视图与完整目标对齐,形成语义修复目标。所得框架稳定、与骨干网络无关,并持续提升ResNet和ViT模型在ImageNet和COCO上的性能。

英文摘要

We propose a multi-task formulation of self-predictive Siamese SSL in which each spatial transformation defines a distinct latent-space alignment task, solved by a dedicated predictor over a shared encoder. This perspective directly explains a long-standing failure of multi-crop training in self-predictive methods such as BYOL, SimSiam, and MoCo v3: a shared predictor is forced to solve heterogeneous alignment tasks simultaneously, leading to unstable optimization. Assigning one predictor per view type resolves this interference, unlocking linear evaluation gains of 3.8-4\% across frameworks. This perspective also suggests a principled way to enrich pre-training by introducing additional spatial transformations as complementary tasks. We demonstrate this by introducing asymmetric cutout views, in which a masked online view is aligned with a complete target, forming a semantic inpainting objective. The resulting framework is stable, backbone-agnostic, and consistently improves the performance of ResNet and ViT models on ImageNet and COCO.

2602.24181 2026-06-09 cs.CV cs.AI 版本更新

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

混合饮食使DINO成为杂食视觉编码器

Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson, Ye Xia, Skanda Koppula, Andre Araujo, Joao Carreira, Niloy J. Mitra

发表机构 * Google DeepMind(谷歌深Mind) University College London(伦敦大学学院)

AI总结 针对DINOv2等预训练视觉编码器在不同视觉模态间特征对齐差的问题,提出杂食视觉编码器,通过后训练框架学习模态无关特征空间,实现跨模态鲁棒理解。

Comments CVPR 2026 Highlight

详情
AI中文摘要

预训练的视觉编码器(如DINOv2)在单模态任务上表现出色。然而,我们观察到它们的特征在不同视觉模态之间对齐不佳。例如,同一场景的RGB图像及其对应深度图的特征嵌入,其余弦相似度与两个随机不相关图像几乎相同。为了解决这个问题,我们提出了杂食视觉编码器,一种学习模态无关特征空间的后训练框架。我们通过双重目标微调编码器:首先,最大化同一场景不同模态之间的特征对齐;其次,一个蒸馏目标,将学习到的表示锚定到完全冻结的教师模型。由此产生的学生编码器通过为给定场景生成更一致的嵌入(无论输入模态是RGB、深度、分割等)而变得“杂食”。这种方法在保留原始基础模型的判别语义的同时,实现了鲁棒的跨模态理解。杂食模型权重可在以下网址获取:此 https URL。

英文摘要

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their features are poorly aligned across different visual modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a post-training framework that learns a modality-agnostic feature space. We fine-tune the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to a fully frozen teacher. The resulting student encoder becomes "omnivorous" by producing more consistent embeddings for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model. Omnivorous model weights are available at https://github.com/google-deepmind/representations4d.

2603.14342 2026-06-09 cs.CV cs.AI 版本更新

AgroOmni: A Large-Scale Multi-view Agricultural Dataset for Cross-Scale Multimodal Reasoning

AgroOmni:一个大规模多视角农业数据集用于跨尺度多模态推理

Jiarui Zhang, Junqi Hu, Zurong Mai, Yang Liu, Yuhang Chen, Shuohong Lou, Henglian Huang, Hong Cheng, Lingyuan Zhao, Jianxi Huang, Yutong Lu, Haohuan Fu, Juepeng Zheng

发表机构 * Sun Yat-sen University(中山大学) Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) HuanTian Wisdom Technology Co., Ltd.(慧天智慧科技有限公司) China Agricultural University(中国农业大学) Southwest Jiaotong University(西南交通大学) National Supercomputing Center in Shenzhen(深圳国家超算中心) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出AgroOmni数据集,通过288K视觉问答对覆盖56个专业任务类别,解决多视角跨尺度农业多模态推理中的尺度偏差问题,提出AgroNVILA模型在AgroMind基准上达到62.32%的SOTA成绩。

详情
AI中文摘要

现代农业数据来源于多样化的平台,涵盖多个空间尺度,从地面级近距离摄影到无人机(UAV)航空观测和卫星遥感图像。因此,农业多模态推理需要强大的跨尺度空间理解。然而,由于缺乏多视角农业基准数据集,现有多模态大语言模型(MLLMs)表现出严重的地面级偏差,导致农业感知任务中出现尺度混淆和语义崩溃,例如将农田图像误认为墙壁或地板。为此,我们引入AgroOmni,一个大规模多视角训练语料库,包含288K个视觉问答对,覆盖56个专业任务类别,跨14种任务类型,旨在捕捉现代农业精准农业中的多样化尺度。基于此数据集,我们提出AgroNVILA,其在AgroMind基准上达到62.32%的最新SOTA成绩(比GPT-5.2高15.03%),有效缓解了多视角跨尺度差距,实现了整体农业理解。对AgMMU的诊断评估进一步揭示了宏观先验与微观诊断之间的固有异质性,通过受约束的零样本性能。同时,即使最小的微调也使AgroNVILA在AgMMU上实现了显著的性能提升,强有力地证明了其由AgroOmni赋能的泛化能力。完整的训练脚本已公开在https://anonymous.4open.science/r/AgroOmni-6510。

英文摘要

Modern agricultural data is sourced from diverse platforms and spans multiple spatial scales, ranging from ground-level close-up photography to Unmanned Aerial Vehicle (UAV) aerial observation and satellite remote sensing imagery. Accordingly, agricultural multimodal reasoning demands robust cross-scale spatial understanding. However, due to the lack of multi-view agricultural benchmark datasets, existing multimodal large language models (MLLMs) exhibit severe ground-level bias, which leads to scale confusion then semantic collapse in agricultural perception tasks, such as misinterpreting farmland imagery as walls or floors. To address this, we introduce AgroOmni, a large-scale multi-view training corpus with 288K Visual Question Answering pairs covering 56 specialized task categories across 14 task types, designed to capture diverse scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, which achieves a new state-of-the-art of 62.32% on the AgroMind benchmark (+15.03% over GPT-5.2), effectively mitigating the multi-view cross-scale gap for holistic agricultural understanding. Diagnostic evaluations on AgMMU further reveal an inherent heterogeneity between macro-priors and micro-diagnostics through constrained zero-shot performance. Meanwhile, even minimal fine-tuning leads to a dramatic performance gain of AgroNVILA on AgMMU, strongly demonstrating its generalization capability empowered by AgroOmni. Full training scripts are publicly available at https://anonymous.4open.science/r/AgroOmni-6510.

2603.25726 2026-06-09 cs.CV 版本更新

AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

AnyHand:一个大规模合成数据集用于RGB(-D)手姿态估计

Chen Si, Yulin Liu, Bo Ai, Jianwen Xie, Rolandos Alexandros Potamias, Chuanxia Zheng, Hao Su

发表机构 * University of California, San Diego(加州大学圣迭戈分校) Lambda, Inc(Lambda公司) Imperial College London(伦敦帝国理工学院) Nanyang Technological University(南洋理工大学)

AI总结 AnyHand通过提供大规模RGB-D图像和丰富的几何标注,提升了3D手姿态估计的性能,证明了数据多样性和质量对模型效果的重要性。

详情
AI中文摘要

我们介绍了AnyHand,一个大规模合成数据集,旨在推动3D手姿态估计的前沿。尽管近期基于基础方法的工作表明,扩大训练数据显著提高了手姿态估计,但现有现实数据集在覆盖范围上有限,且先前合成数据集很少能同时提供遮挡、手臂细节和对齐的深度信息。为解决这一瓶颈,我们提出的AnyHand包含250万张单手和4100万张手-物体交互的RGB-D图像,具有丰富的几何标注。我们展示了将现有RGB基线的原始训练数据配方扩展为AnyHand可显著提升多个基准(FreiHAND和HO-3D)的性能,即使在保持架构和训练方案不变的情况下。结合对训练数据规模和组成设置的广泛消融分析,这些结果表明,训练数据的多样性和质量与规模一样关键,对于推动手姿态估计的发展至关重要。我们进一步在附录中检验了AnyHand对齐深度图的实用性,显示使用AnyHand扩展RGB-D监督可使现有RGB基线的轻量深度融合变体超越先前的RGB-D方法。

英文摘要

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation. While recent works with foundation approaches have shown that scaling training data markedly improves hand pose estimation, existing real-world datasets are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our proposed AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. We show that extending the original training data recipes of existing RGB baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architectures and training schemes fixed. Together with extensive ablations on the scale and composition of the training data setups, these results suggest that training data diversity and quality are as critical as scale for advancing hand pose estimation. We further examine the utility of AnyHand's aligned depth maps in the appendix, showing that scaling RGB-D supervision with AnyHand allows a lightweight depth-fusion variant of existing RGB baselines to outperform prior RGB-D methods.

2603.26763 2026-06-09 cs.CV cs.MM eess.IV 版本更新

A Camera-Native Talking-Head Video Dataset for Various Computer Vision Tasks

面向各种计算机视觉任务的相机原生谈话头视频数据集

Babak Naderi, Ross Cutler, Nabakumar Singh Khongbantabam

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出一个包含847个谈话头视频的数据集,用于评估视频压缩、超分辨率和质量评估等任务,展示了其在实时通信中的应用价值。

详情
AI中文摘要

谈话头视频是实时通信中的主要内容类型,但该领域视频处理研究的公开数据集仍然稀缺且信号保真度有限。本文开源了一个包含847个谈话头视频的数据集(约212分钟),每个视频持续15秒,通过446个不同的消费级摄像头设备在自然环境中录制。所有视频均使用FFV1无损编码器存储,保留相机原生信号——未压缩(24.4%)或MJPEG编码(75.6%)——而不进行额外的有损处理。每个视频都标注了平均意见分数(MOS)和十个感知质量标记,共同解释了64.4%的MOS方差。从该数据集中,我们挑选出120个视频片段,分为三种内容条件:原始、背景虚化和背景替换。在四个数据集和四个编码器(H.264、H.265、H.266和AV1)上的编码效率评估显示,H.266相对于H.264的VMAF BD-rate节省高达-71.3%,编码器×数据集(η_p² = 0.112)和编码器×内容条件(η_p² = 0.149)的交互显著,表明内容类型和背景处理会影响压缩效率。初步的超分辨率评估显示,该数据集显著影响绝对性能,但保持模型排名,证明其在编码器基准测试之外的应用价值。该数据集的规模是现有最大谈话头摄像头数据集的5倍(847 vs. 160个视频),具有无损信号保真度,为视频压缩、超分辨率、质量评估和增强模型的实时通信基准测试提供了资源。

英文摘要

Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a camera-native dataset of 847 talking-head recordings (approximately 212 minutes), each 15s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4%) or MJPEG-encoded (75.6%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3\%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($η_p^2 = .112$) and encoder$\times$content condition ($η_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. A preliminary super-resolution evaluation with four SR models confirms that the dataset significantly affects absolute performance while preserving model rankings, demonstrating applicability beyond codec benchmarking. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs. 160 clips) with lossless signal fidelity, establishing a resource for benchmarking video compression, super-resolution, quality assessment, and enhancement models in real-time communication.

2604.10999 2026-06-09 cs.CV 版本更新

TraversalBench: Challenging Paths to Follow for Vision Language Models

TraversalBench: 为视觉语言模型设计的复杂路径挑战测试集

Clara Petrova, Zhuo Chen, Marin Soljačić

发表机构 * Massachusetts Institute of Technology, Department of Physics(麻省理工学院物理系) Massachusetts Institute of Technology, Institute for Data, Systems, and Society(麻省理工学院数据、系统与社会研究所) NSF AI Institute for Artificial Intelligence and Fundamental Interactions(国家科学基金会人工智能与基本相互作用AI研究所)

AI总结 本文提出TraversalBench,一个用于评估视觉语言模型复杂视觉路径跟随能力的受控基准测试集,发现自相交是主要困难来源,揭示了模型在路径感知上的局限性。

详情
AI中文摘要

视觉语言模型(VLMs)在多模态基准测试中表现优异,但其遵循复杂视觉路径的能力尚未充分测试。我们引入TraversalBench,一个用于精确视觉路径遍历的受控基准测试集。每个实例包含一条具有唯一起始标记和标签顶点的连续折线;模型必须从起点到终点恢复顺序序列。该基准测试平衡了自相交次数、曲折度、顶点数量和附近干扰线的影响,同时限制对OCR、世界知识或开放式规划的依赖。我们发现自相交是主要困难来源。一次交叉分析将失败定位到交叉点:性能在第一次交叉前稳定,然后在模型必须解决正确延续时急剧下降。附近干扰因素有较弱但累积的影响,辅助阅读顺序基准揭示了一致的左右偏见。这些结果描述了VLMs如何感知和失败于视觉路径。最后,我们将TraversalBench定位为视觉语言模型持续和精确视觉定位基准测试集的新贡献。代码、基准测试数据和渲染示例可在https://github.com/clarapetrova/traversalbench获取。

英文摘要

Vision-language models (VLMs) perform strongly on multimodal benchmarks, but their ability to follow complex visual paths remains under-tested. We introduce TraversalBench, a controlled benchmark for exact visual path traversal. Each instance contains a continuous polyline with a unique start marker and labeled vertices; models must recover the ordered sequence encountered from start to finish. The benchmark balances self-intersection count, tortuosity, vertex count, and nearby confounding lines while limiting reliance on OCR, world knowledge, or open-ended planning. We find that self-intersections are the dominant source of difficulty. A first-crossing analysis localizes failures to crossing points: performance is stable before the first crossing, then drops sharply when the model must resolve the correct continuation. Nearby confounders have weaker but compounding effects, and an auxiliary reading-order benchmark reveals a consistent left-to-right bias. Together, these results characterize how VLMs perceive and fail on visual paths. Finally, we position TraversalBench as a new contribution to the growing line of sustained and precise visual grounding benchmarks for VLMs. Code, benchmark data, and rendered examples are available at https://github.com/clarapetrova/traversalbench.

2604.22482 2026-06-09 cs.CV cs.GR 版本更新

Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond

Holo360D: 一个大规模真实世界数据集,具有连续轨迹,用于推进全景3D重建及更广泛领域

Jing Ou, Zidong Cao, Yinrui Ren, Zhuoxiao Li, Jinjing Zhu, Tongyan Hua, Shuai Zhang, Hui Xiong, Wufan Zhao

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) South China Normal University(华南师范大学)

AI总结 本文提出Holo360D数据集,包含109495张全景图像及注册点云、网格和对齐相机姿态,通过连续轨迹和高完整性深度图提升全景3D重建性能,建立新基准并提供有效微调策略。

Comments Datasets Link: https://github.com/Jou719/Holo360D

详情
AI中文摘要

尽管馈送式3D重建模型发展迅速,但全景图像仍因球面畸变而性能下降。现有全景3D数据集多由360相机在离散位置采集,导致轨迹不连续。本文提出Holo360D数据集,包含109,495张全景图像及注册点云、网格和对齐相机姿态。Holo360D是首个大规模提供连续全景序列和高完整性深度图的数据集。原始数据由3D激光扫描仪与360相机采集,随后通过在线和离线SLAM系统处理。为提升3D数据质量,提出针对360数据集的后处理流程,包括几何去噪、网格孔填补和区域特定重网格化。最后,通过在Holo360D上微调3D重建模型建立新基准,提供有效微调策略的关键见解。实验结果表明,Holo360D提供更优的训练信号,为推进全景3D重建模型提供全面基准。数据集和代码将公开发布。

英文摘要

While feed-forward 3D reconstruction models have advanced rapidly, they still exhibit degraded performance on panoramas due to spherical distortions. Moreover, existing panoramic 3D datasets are predominantly collected with 360 cameras fixed at discrete locations, resulting in discontinuous trajectories. These limitations critically hinder the development of panoramic feed-forward 3D reconstruction, especially for the multi-view setting. In this paper, we present Holo360D, a comprehensive dataset containing 109,495 panoramas paired with registered point clouds, meshes, and aligned camera poses. To our knowledge, Holo360D is the first large-scale dataset that provides continuous panoramic sequences with accurately aligned high-completeness depth maps. The raw data are initially collected using a 3D laser scanner coupled with a 360 camera. Subsequently, the raw data are processed with both online and offline SLAM systems. Furthermore, to enhance the 3D data quality, a post-processing pipeline tailored for the 360 dataset is proposed, including geometry denoising, mesh hole filling, and region-specific remeshing. Finally, we establish a new benchmark by fine-tuning 3D reconstruction models on Holo360D, providing key insights into effective fine-tuning strategies. Our results demonstrate that Holo360D delivers superior training signals and provides a comprehensive benchmark for advancing panoramic 3D reconstruction models. Datasets and Code will be made publicly available.

2604.23066 2026-06-09 cs.CV 版本更新

Urban Flood Observations: A hand-labeled training and validation dataset of post-flood inundation

城市洪水观测:一个手标注的训练和验证数据集,用于洪水后淹没区域

Rohit Mukherjee, Hannah K. Friedrich, Beth Tellman, Ariful Islam, Zhijie Zhang, Jonathan Giezendanner, Upmanu Lall, Venkataraman Lakshmi

发表机构 * Pacific Northwest National Laboratory(太平洋西北国家实验室) University of Arizona(亚利桑那大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Utah State University(犹他州立大学) Massachusetts Institute of Technology(麻省理工学院) Columbia University(哥伦比亚大学) University of Virginia(弗吉尼亚大学)

AI总结 本文提出UFO数据集,用于复杂城市环境中从卫星图像中映射洪水淹没区域,通过手标注数据集验证了分割模型,达到77.3的平均IoU,并评估了两种常用水体产品。

Comments 15 pages, 8 figures

详情
AI中文摘要

城市洪水影响全球生命和基础设施。从卫星图像中映射复杂城市环境中的淹没区域仍然具有挑战性,由于空间分辨率有限、获取频率低和云层覆盖。我们提出了Urban Flood Observations (UFO),一个全球性的手标注数据集,包含2017至2021年间14次洪水事件中的215张图像芯片(1024x1024像素),源自3米的PlanetScope影像。每张芯片被标注为'淹没'(所有可见水面,包括洪水水和原有水面(永久或季节性))和'非淹没'。通过留一事件法交叉验证训练分割模型,达到77.3的平均交并比(IoU)。我们还利用UFO评估了两种广泛使用的水体产品,即基于Sentinel-1的NASA IMPACT模型和Google的10米Dynamic World水类,分别得到44.1和48.1的IoU。UFO公开可用,以支持城市淹没区域映射方法的发展和验证。

英文摘要

Urban flooding affects lives and infrastructure worldwide. Mapping inundation in complex urban environments from satellite imagery remains challenging due to limited spatial resolution, infrequent acquisitions, and cloud cover. We present Urban Flood Observations (UFO), a global, hand-labeled dataset of post-flood inundation in diverse urban settings. UFO comprises 215 image chips (1024 by 1024 pixels) from 14 flood events between 2017 and 2021, derived from 3 m PlanetScope imagery. Each chip is annotated with two classes: 'inundated' (all visible surface water, including floodwater and pre-existing water bodies (permanent or seasonal)) and 'non-inundated'. To demonstrate the dataset's utility, we trained a segmentation model using leave-one-event-out cross-validation, achieving a mean Intersection over Union (IoU) of 77.3. We also used UFO to evaluate two widely used surface water products, the Sentinel-1-based NASA IMPACT model and Google's 10 m Dynamic World water class, which yielded IoUs of 44.1 and 48.1, respectively. UFO is publicly available to support the development and validation of urban inundation mapping methods.

2605.05136 2026-06-09 cs.CV 版本更新

CPCANet: Deep Unfolding Common Principal Component Analysis for Domain Generalization

CPCANet:基于共同主成分分析的深度展开方法用于领域泛化

Yu-Hsi Chen, Abd-Krim Seghouane

发表机构 * The University of Melbourne(墨尔本大学)

AI总结 本文提出CPCANet,通过深度展开Flury-Gautschi算法实现共同主成分分析,提升领域泛化性能,在四个基准测试中达到零样本转移的最先进水平。

Comments 9 pages, 5 tables

详情
AI中文摘要

领域泛化(DG)旨在学习在分布外转移下仍具鲁棒性的表示,并有效推广到未见目标领域。尽管最近的不变学习策略和架构进步已取得良好性能,但通过二阶统计显式发现结构化的领域不变子空间仍被忽视。本文提出CPCANet,一种基于共同主成分分析(CPCA)的新型框架,将迭代的Flury-Gautschi(FG)算法展开为完全可微的神经层。该方法将CPCA的统计特性整合到端到端可训练框架中,强制在不同领域间发现共享子空间,同时保持可解释性。在四个标准DG基准测试中,CPCANet在零样本转移中达到最先进性能。此外,CPCANet架构无关,无需特定数据集调优,提供了一种简单高效的鲁棒表示学习方法以应对分布偏移。代码可在https://github.com/wish44165/CPCANet获取。

英文摘要

Domain Generalization (DG) aims to learn representations that remain robust under out-of-distribution (OOD) shifts and generalize effectively to unseen target domains. While recent invariant learning strategies and architectural advances have achieved strong performance, explicitly discovering a structured domain-invariant subspace through second-order statistics remains underexplored. In this work, we propose CPCANet, a novel framework grounded in Common Principal Component Analysis (CPCA), which unrolls the iterative Flury-Gautschi (FG) algorithm into fully differentiable neural layers. This approach integrates the statistical properties of CPCA into an end-to-end trainable framework, enforcing the discovery of a shared subspace across diverse domains while preserving interpretability. Experiments on four standard DG benchmarks demonstrate that CPCANet achieves state-of-the-art (SOTA) performance in zero-shot transfer. Moreover, CPCANet is architecture-agnostic and requires no dataset-specific tuning, providing a simple and efficient approach to learning robust representations under distribution shift. Code is available at https://github.com/wish44165/CPCANet.

2605.10376 2026-06-09 cs.CV 版本更新

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

SleepWalk:一种三层压力测试基准,用于指导的视觉-语言导航

Niyati Rawal, Sushant Ravva, Shah Alam Abir, Saksham Jain, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das

发表机构 * Indian AI Research Organization (IAIRO)(印度人工智能研究组织) ĸragya Lab, BITS Pilani Goa(BITS Pilani Goa 的 ĸragya 实验室) University of Dhaka(达卡大学) Delhi Technological University(德里技术大学) Apple(苹果公司) Meta

AI总结 本文提出SleepWalk基准,用于评估基于指令的轨迹预测,针对局部化、交互导向的具身推理,揭示当前VLM在空间推理中的系统性失败。

详情
AI中文摘要

视觉-语言模型(VLMs)在多模态感知和语言理解方面迅速发展,但尚不清楚它们是否能可靠地将语言接地为在3D数字环境中空间一致且可能执行的动作。我们引入SleepWalk,一种评估基于指令的轨迹预测的基准,该基准生成自文本场景描述并过滤以确保可导航性。与以往以长距离探索房间为中心的导航基准不同,SleepWalk针对局部化、以交互为中心的具身推理:给定渲染的视觉观察和自然语言指令,模型必须预测一个尊重场景几何、避免碰撞并终止在动作兼容位置的轨迹。该基准涵盖多样化的室内和室外环境,并将任务分为三层空间和时间难度,使在增加的组合复杂性下对接地进行细粒度分析成为可能。使用标准化的点评估评估协议,我们评估了三种前沿VLMs在2,472个经过精心挑选的3D环境中,每个场景有九条指令。结果揭示了在遮挡、交互约束和多步指令下的系统性失败:随着任务难度等级的增加,性能下降。总体而言,当前VLMs可以生成在空间上一致、可能执行且与预期动作一致的轨迹。通过在受控且可扩展的设置中暴露失败,SleepWalk为推进基于接地的多模态推理、具身规划、视觉-语言导航和3D环境中的动作能力代理提供了关键基准。

英文摘要

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

2606.00793 2026-06-09 cs.CV 版本更新

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

MBench: 视频世界模型记忆能力的综合基准

Shengjun Zhang, Zhang Zhang, Simin Huang, Zhenyu Tang, Hanyang Wang, Chensheng Dai, Min Chen, Yifan Li, Yuxin Li, Yingjie Chen, Hao Liu, Chen Li, Jing Lyu, Yueqi Duan

发表机构 * Tsinghua University(清华大学) WeChat Vision, Tecent Inc.(微信视觉,腾讯公司) Peking University(北京大学)

AI总结 提出MBench基准,通过实体一致性、环境一致性和因果一致性三个核心维度及其12个子维度,系统评估视频世界模型的长期记忆能力,并揭示现有方法在长期状态保持上的关键局限。

Comments Project Page: https://peanutup.github.io/MBench-project/

详情
AI中文摘要

近期基于视频的世界模型在合成高保真视觉序列方面展现了前所未有的能力。然而,在视觉上合理的视频生成与世界模型的功能要求之间仍存在根本差距,特别是在长时间跨度内维持稳定且合理的内部状态方面。现有基准主要强调视觉质量、运动一致性和文本-视频对齐,但很大程度上忽略了记忆——世界模型在长期跨度和复杂交互中保持一致性的核心能力。为解决这一差距,我们提出了 extbf{MBench},一个专门用于量化和评估视频世界模型记忆能力的综合基准。我们系统地将视频世界模型的记忆能力分解为三个层次化且互补的核心维度:实体一致性、环境一致性和因果一致性,这些维度进一步细化为12个可量化的子维度,以全面表征长期记忆。我们的基准基于严格策划的真实拍摄长视频,并通过基于规则的量化矩阵和VLM进行评估,以实现客观且全面的一致性评估。对主流最先进视频世界模型的广泛评估揭示了现有方法在长期状态保持方面的关键系统性局限,为推进该领域提供了标准化基准和明确的研究方向。

英文摘要

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present \textbf{MBench}, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.

2606.04409 2026-06-09 cs.CV cs.AI cs.LG 版本更新

An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

数据规模、模型复杂度和输入模态对视觉泛化影响的实证研究

Yidi Zhouluo

发表机构 * School of Medical Information and Artificial Intelligence, Shandong First Medical University(医学信息与人工智能学院,山东第一医科大学)

AI总结 通过一维非线性函数和CIFAR数据集实验,实证分析数据规模、模型复杂度和输入模态对视觉泛化性能的影响。

Comments 12 pages, 9 figures, 4 tables

详情
AI中文摘要

现代深度神经网络通常具有较大的参数规模和非线性层次结构,在计算机视觉中取得了强劲性能。然而,其泛化性能的来源仍然难以用传统统计学习理论解释。在可能影响视觉泛化的因素中,数据规模、模型复杂度和输入模态是基础且可控的变量。本研究实证分析了这三个因素如何影响模型泛化性能。具体而言,在初步实验中,我们构建了一维非线性函数,并改变训练样本数量和多项式次数,以观察数据规模和模型复杂度对模型性能的影响。在主要实验中,我们比较了CIFAR-10和CIFAR-100上不同训练数据规模、模型架构和输入模态下的模型性能。实验结果表明,增加训练数据规模持续改善泛化性能,而模型复杂度的变化并未带来稳定提升。此外,去除颜色信息会降低模型性能,而梯度、边缘和小波等显式先验特征在不同模型架构上的效果不一致。总体而言,本研究提供了数据规模、模型复杂度、输入模态与视觉泛化性能之间关系的实证分析。代码和实验日志见:https://github.com/zlyd-CV/DeepLearning-Empirical-Studies。

英文摘要

Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/YidiZhouluo/DeepLearning-Empirical-Studies/tree/main/Exp_01.

2505.19662 2026-06-09 cs.AI cs.CV 版本更新

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

FieldWorkArena:面向真实作业任务的代理AI基准测试

Jun Takahashi, Atsunori Moteki, Akiyoshi Uchida, Shoichi Masui, Fan Yang, Kanji Uchino, Yueqi Song, Yonatan Bisk, Graham Neubig, Ikuo Kusajima, Yasuto Watanabe, Hiroyuki Ishida, Koki Nakagawa, Shan Jiang

发表机构 * Fujitsu Limited(富士通株式会社) Fujitsu Research of America(富士通美国研究部) Carnegie Mellon University(卡内基梅隆大学) Master’s Student, The University of Tokyo(东京大学硕士研究生) Agent Research Collective(代理研究集体)

AI总结 本文提出FieldWorkArena,用于评估代理AI在真实制造业和零售环境中的性能,通过现场采集的数据和实地访谈设计任务,验证多模态大语言模型的评估可行性。

Comments 27 pages, 10 figures, 7 tables [ICPR 2026 Accepted] Changes from previous version: added supplemental material

详情
AI中文摘要

本文介绍FieldWorkArena,一个针对真实世界作业任务的代理AI基准测试平台。随着对代理AI的需求增加,此类系统旨在检测和记录安全隐患、程序违规等关键事件。与大多数专注于模拟或数字环境的基准测试不同,我们的工作解决了在真实世界中评估代理的挑战。本文改进了之前的评估函数,以评估代理AI在多样化真实任务中的性能。数据集包含工厂、仓库和零售现场采集的图像和视频。任务通过与现场工人和管理人员的访谈精心设计。评估结果证实,考虑多模态大语言模型(如GPT-4o)特性进行性能评估是可行的。此外,本研究确定了所提新评估方法的有效性和局限性。完整数据集和评估程序可在网站(https://en-documents.research.global.fujitsu.com/fieldworkarena/)上公开获取。

英文摘要

This paper introduces FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are built to detect and document safety hazards, procedural violations, and other critical incidents across real-world manufacturing and retail environments. Whereas most agentic AI benchmarks focus on performance in simulated or digital environments, our work addresses the fundamental challenge of evaluating agents in the real-world. In this paper, we improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. Our dataset comprises on-site captured images/videos in factories, warehouses and retails. Tasks were meticulously developed through interviews with site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Furthermore, this study identifies both the effectiveness and limitations of the proposed new evaluation methodology. The complete dataset and evaluation program are publicly accessible on the website (https://en-documents.research.global.fujitsu.com/fieldworkarena/)

2601.04498 2026-06-09 cs.LG cs.CV 版本更新

IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation

IGenBench:文本到信息图生成可靠性基准测试

Yinghao Tang, Xueding Liu, Boyuan Zhang, Tingfeng Lan, Yupeng Xie, Jiale Lao, Yiyao Wang, Haoxuan Li, Tingting Gao, Bo Pan, Luoxuan Weng, Xiuqi Huang, Minfeng Zhu, Yingchaojie Feng, Yuyu Luo, Wei Chen

发表机构 * State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD与CG国家重点实验室) UESTC University of Virginia(弗吉尼亚大学) HKUST(GZ)(香港科技大学(广州)) Cornell University(康奈尔大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学)

AI总结 提出IGENBENCH基准,包含30种信息图类型和600个测试用例,通过多模态大语言模型分解为10类原子问题评估10种T2I模型,发现数据完整性等维度是普遍瓶颈。

详情
AI中文摘要

信息图是结合数据可视化与文本和插图元素的复合视觉制品,用于传达信息。虽然最近的文本到图像(T2I)模型可以生成美观的图像,但它们在生成信息图方面的可靠性仍不清楚。生成的信息图可能乍看正确,但包含容易被忽视的问题,例如扭曲的数据编码或错误的文本内容。我们提出了IGENBENCH,这是第一个评估文本到信息图生成可靠性的基准,包含跨越30种信息图类型的600个精心设计的测试用例。我们设计了一个自动评估框架,将可靠性验证分解为基于10种问题类型的原子是否问题。我们使用多模态大语言模型(MLLM)验证每个问题,得到问题级准确率(Q-ACC)和信息图级准确率(I-ACC)。我们在IGENBENCH上全面评估了10个最先进的T2I模型。我们的系统分析揭示了未来模型开发的关键见解:(i)三级性能层次,顶级模型的Q-ACC为0.90,但I-ACC仅为0.49;(ii)数据相关维度成为普遍瓶颈(例如,数据完整性:0.21);(iii)所有模型实现端到端正确性的挑战。我们在https://this URL发布IGENBENCH。

英文摘要

Infographics are composite visual artifacts that combine data visualizations with textual and illustrative elements to communicate information. While recent text-to-image (T2I) models can generate aesthetically appealing images, their reliability in generating infographics remains unclear. Generated infographics may appear correct at first glance but contain easily overlooked issues, such as distorted data encoding or incorrect textual content. We present IGENBENCH, the first benchmark for evaluating the reliability of text-to-infographic generation, comprising 600 curated test cases spanning 30 infographic types. We design an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on a taxonomy of 10 question types. We employ multimodal large language models (MLLMs) to verify each question, yielding question-level accuracy (Q-ACC) and infographic-level accuracy (I-ACC). We comprehensively evaluate 10 state-of-the-art T2I models on IGENBENCH. Our systematic analysis reveals key insights for future model development: (i) a three-tier performance hierarchy with the top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions emerging as universal bottlenecks (e.g., Data Completeness: 0.21); and (iii) the challenge of achieving end-to-end correctness across all models. We release IGENBENCH at https://igen-bench.vercel.app/.

2605.16223 2026-06-09 cs.GR cs.AI cs.CV 版本更新

Evaluating Design Video Generation: Metrics for Compositional Fidelity

评估设计视频生成:构成保真度的度量标准

Adrienne Deganutti, Dingning Cao, Jaejung Seol, Elad Hirsch, Purvanshi Mehta

发表机构 * Lica World(Lica世界) San Francisco, United States of America(美国旧金山) ICML’26 Workshop on Human-AI Co-Creativity, Seoul, South Korea(ICML’26 人类-人工智能协同创作研讨会,韩国首尔)

AI总结 本文提出一个自动化评估框架,用于评估设计动画中布局、动作正确性、时间质量和内容保真度,以替代主观人类评估,为该领域提供统一基准。

Comments ICML 2026 Workshop on Human-AI Co-Creativity

详情
AI中文摘要

生成视频模型越来越多地用于设计动画任务,但该领域缺乏标准化评估框架。与自然视频生成不同,设计动画施加了结构化约束:特定组件需以规定类型、方向、速度和时间进行动画,而非动画区域必须保持稳定,布局结构必须保持。本文提供了一个全面自动化的评估框架,从四个维度组织:布局保真度、动作正确性、时间质量及内容保真度。这消除了对主观人类评估的依赖,并为该领域建立了一个共同的基准。我们在此发布代码和数据集:https://github.com/purvanshi/lica-bench。

英文摘要

Generative video models are increasingly used in design animation tasks, yet no standardized evaluation framework exists for this domain. Unlike natural video generation, design animation imposes structured constraints: specific components shall animate with prescribed motion types, directions, speed and timing, while non-animated regions must remain stable and layout structure must be preserved. This paper provides a fully automated evaluation framework organized across four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity. This eliminates the reliance on subjective human evaluation and establishes a common basis for benchmarking progress in the field. We release the code and dataset here: https://github.com/purvanshi/lica-bench.

2605.23595 2026-06-09 cs.LG cs.AI cs.CV cs.ET cs.PF 版本更新

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

基于元学习的成本效益模型评估

Trinh Pham, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen

发表机构 * Griffith University(格里菲斯大学) Edith Cowan University(埃迪斯科文大学) The University of Queensland(昆士兰大学)

AI总结 提出MetaEvaluator,一种基于元学习的模型无关框架,通过参考模型池实现无标签数据上的快速、准确且成本效益高的新模型评估。

Comments Accepted by KDD 2026

详情
AI中文摘要

机器学习的快速发展产生了不断扩展的模型生态系统,使得在未见过的未标记数据上验证新发布模型的可靠性变得越来越具有挑战性。传统的评估流程依赖于昂贵的标注、重复的微调或无法跨模型家族迁移的狭窄假设。我们提出了MetaEvaluator,一个成本效益高、模型无关的框架,用于快速、无标签地评估跨不同架构和模态的未见模型。MetaEvaluator利用参考模型池上的元学习来获得可迁移的初始化,从而能够准确评估新模型,同时将成本分摊到整个池中,并消除了每个模型重新训练的需要。据我们所知,这是第一个能够在完全未标记数据集上评估新模型的模型无关框架。大量实验表明,与传统方法相比,MetaEvaluator以显著降低的成本产生稳定且准确的性能估计,使得在未标记数据上对新出现的模型进行可扩展的基准测试变得实用。

英文摘要

The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free evaluation of unseen models across diverse architectures and modalities. MetaEvaluator meta-learns over a pool of reference models to acquire an effective initialization for accurate assessment of unseen models, thereby amortizing evaluation cost and eliminating the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework that evaluates new models on unlabeled datasets. Extensive experiments demonstrate that MetaEvaluator delivers stable and accurate performance estimates at substantially lower cost than conventional approaches, enabling scalable benchmarking on unlabeled datasets for emerging models. The code is available at: https://github.com/phkhanhtrinh23/MetaEvaluator.

2606.04920 2026-06-09 cs.LG cs.CV 版本更新

Toward Multi-Domain and Long-Tailed Quantization via Feature Alignment and Scaling

通过特征对齐与缩放实现多域和长尾量化

Ting-An Chen, Chin-Yuan Yeh, De-Nian Yang

发表机构 * Graduate Institute of Electrical Engineering, National Taiwan University, Taiwan(台湾大学电子工程研究所) Institute of Information Science, Academia Sinica, Taiwan(中科院资讯研究所) Graduate Institute of Communication Engineering, National Taiwan University, Taiwan(台湾大学通讯工程研究所) Institute of Information Science and the Research Center for Information Technology Innovation, Academia Sinica, Taiwan(中科院资讯研究所及资讯科技创新研究中心)

AI总结 提出EmaQ和EmaQ-LT方法,通过CDF投影对齐域分布、敏感度加权聚合稳定多域量化,并引入类别条件方差缩放和置信度调整缓解长尾问题,在多种基准上实现低比特量化下的强性能。

详情
AI中文摘要

量化深度神经网络对于在资源受限设备上进行高效推理至关重要。然而,现有大多数方法针对单域和类别平衡数据设计,忽略了存在域偏移或严重类别不平衡的实际场景。我们通过高效多域对齐量化(EmaQ)解决这些挑战,该方法通过基于CDF的投影对齐域分布,并使用敏感度感知权重聚合来稳定多域量化。我们进一步将EmaQ扩展到EmaQ-LT用于长尾量化,通过引入类别条件方差缩放和基于置信度的logit调整来缓解多数类过度自信。理论分析建立了收敛保证,并激励了所提出的敏感度和缩放机制。在标准、多域(Office-31、Digits)和长尾(SynDigits-LT、CIFAR-10-LT、CIFAR-100-LT)基准上的实验表明,EmaQ和EmaQ-LT在域偏移和类别不平衡下实现了强大的低比特性能。

英文摘要

Quantizing deep neural networks is essential for efficient inference on resource-constrained devices. However, most existing methods are designed for single-domain and class-balanced data, leaving practical settings with domain shifts or severe class imbalance underexplored. We address these challenges with Efficient Multi-Domain Alignment Quantization (EmaQ), which aligns domain distributions through a CDF-based projection and uses sensitivity-aware weight aggregation to stabilize multi-domain quantization. We further extend EmaQ to EmaQ-LT for long-tailed quantization by introducing class-conditioned variance scaling and confidence-based logit adjustment to mitigate majority-class overconfidence. Theoretical analyses establish convergence guarantees and motivate the proposed sensitivity and scaling mechanisms. Experiments on standard, multi-domain (Office-31, Digits), and long-tailed (SynDigits-LT, CIFAR-10-LT, CIFAR-100-LT) benchmarks show that EmaQ and EmaQ-LT achieve strong low-bit performance under domain shift and class imbalance.

2606.05872 2026-06-09 cs.AI cs.CV 版本更新

Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns

基于熵的AI智能体评估:一种测量行为模式的轻量级框架

Olasimbo Ayodeji Arigbabu

发表机构 * Olasimbo Ayodeji Arigbabu(奥拉西姆波·阿里加布)

AI总结 提出一种基于熵的轻量级评估框架(EEA),通过动作熵、轨迹熵、工具熵、信息增益、探索效率和鲁棒性熵等指标,从决策过程结构角度补充传统任务成功率等评估方法。

Comments 6 pages, 2 Tables

详情
AI中文摘要

AI智能体通常使用任务成功率、奖励、延迟和成本进行评估。这些指标很有用,但常常忽略了智能体行为的重要方面:智能体是否过度探索、是否过于僵化地重复自身、是否有效使用工具、是否随时间减少不确定性、或者在多次运行中保持鲁棒性。本文提出基于熵的AI智能体评估(EEA),一种通过熵来测量智能体行为的轻量级框架。EEA不将智能仅视为最终任务完成,而是研究智能体决策过程的结构。该框架引入了动作熵、轨迹熵、工具熵、信息增益、探索效率和鲁棒性熵。这些指标旨在补充而非取代传统评估方法。我们还提供了一个实用的Python实现,旨在与LangChain、Google ADK、自定义智能体循环以及存储的可观测性轨迹等智能体框架集成。

英文摘要

AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important aspects of agent behavior: whether an agent explores too much, repeats itself too rigidly, uses tools effectively, reduces uncertainty over time, or remains robust across repeated runs. This paper proposes Entropy-Based Evaluation of AI Agents (EEA), a lightweight framework for measuring agent behavior through entropy. Rather than treating intelligence as only final task completion, EEA studies the structure of the agents decision process. The framework introduces action entropy, trajectory entropy, tool entropy, information gain, exploration efficiency, and robustness entropy. These metrics are intended to complement, not replace, traditional evaluation methods. We also present a practical Python implementation designed to integrate with agent frameworks such as LangChain, Google ADK, custom agent loops, and stored observability traces.

13. 其他/综合视觉 30 篇

2606.09111 2026-06-09 cs.CV 新提交

Illumination-Invariant Anomaly Detection for Sub-Canopy UAV Multispectral Point Clouds

林冠下无人机多光谱点云的照度不变异常检测

Likun Chen, Yanfeng Gu, Xian Li

发表机构 * School of Electronics and Information Engineering(电子信息工程学院) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 针对植被阴影导致的严重光照异质性,提出无先验异常检测框架,通过太阳角度估计与照度一致稀疏表示,有效分离暗目标与真实阴影,提升复杂森林环境中异常与背景的可分离性。

Comments 5 pages, 8 figures

详情
AI中文摘要

无人机多光谱点云为林冠下目标检测提供了高维空间-光谱数据,但其有效性因植被阴影造成的严重光照异质性而显著降低。为此,我们提出一种无先验异常检测框架,能够稳健处理光照变化。首先,将太阳角度估计公式化为逆优化问题。通过将光谱指数与光线追踪模型耦合,该策略实现了无先验阴影提取,无需依赖飞行元数据,有效区分暗目标与真实阴影。其次,为减轻光谱失真,引入一种照度一致稀疏表示机制。与标准重建方法不同,我们严格从共享相同光照状态的邻域构建背景字典。该约束有效解耦光谱反射率与光照变化,确保目标仅由物理一致的背景点表示。实验结果表明,所提方法在复杂森林环境中显著提高了异常与背景的可分离性,性能优于最先进的基线方法。该框架特别适用于识别伪装军事目标、绘制倒木树干以及发现隐藏在茂密植被下的考古遗址。

英文摘要

Unmanned Aerial Vehicle (UAV) multispectral point clouds (MPC) provide high-dimensional spatial-spectral data for sub-canopy target detection; however, their efficacy is significantly compromised by severe illumination heterogeneity caused by vegetation shadows. To address this, we propose a prior-free anomaly detection framework capable of robustly handling lighting variations. First, we formulate solar angle estimation as an inverse optimization problem. By coupling spectral indices with a ray-tracing model, this strategy achieves Prior-Free Shadow Extraction without relying on flight metadata, effectively distinguishing dark objects from true shadows. Second, to mitigate spectral distortions, we introduce an Illumination-Consistent Sparse Representation mechanism. Unlike standard reconstruction methods, we construct a background dictionary strictly from neighbors sharing the same illumination state. This constraint effectively disentangles spectral reflectance from lighting variations, ensuring that targets are represented solely by physically consistent background points. Experimental results indicate that the proposed method significantly improves the separability between anomalies and background in complex forest environments, demonstrating superior performance over state-of-the-art baselines. This framework is particularly suited for identifying camouflaged military targets, mapping fallen tree trunks, and uncovering archaeological ruins hidden beneath dense foliage.

2606.09208 2026-06-09 cs.CV 新提交

Event-driven dynamic trajectories reconstruction and measurement of mechanical parameters for fragments

碎片事件驱动的动态轨迹重建与机械参数测量

Haoyang Li, Banglei Guan, Muxi Zha, Yifei Bian, Minzu Liang, Yang Shang, Qifeng Yu

发表机构 * School of Aeronautics, Northwestern Polytechnical University(航空学院,西北工业大学) College of Aerospace Science and Engineering, National University of Defense Technology(航空科学与工程学院,国防科技大学) College of Science, National University of Defense Technology(科学学院,国防科技大学)

AI总结 针对弹头爆炸中高速碎片因强闪光和烟雾难以准确测量机械参数的问题,提出事件驱动方法,利用事件相机的高时间分辨率和动态范围,结合多几何约束与概率模型重建3D轨迹并计算速度与动能。

Comments 33 pages,11 figures

详情
AI中文摘要

在弹头爆炸过程中,会产生高密度、高速且相互遮挡的碎片。它们的机械参数(位置、速度、动能)直接决定了弹头碎片场的杀伤力。然而,爆炸场景中的高强度闪光和烟雾严重阻碍了这些机械参数的准确获取。为应对这一挑战,本文融合实验力学方法,提出了一种事件驱动的碎片动态轨迹重建及其机械参数测量方法。作为一种新型类脑视觉传感器,事件相机提供微秒级时间分辨率和高动态范围的光照变化感知,克服了强闪光干扰下高速目标难以准确测量的难题。该方法构建了多事件相机视觉系统,采用三种几何约束:时间相关极线约束寻找潜在匹配事件点对,三焦张量线约束和局部单应约束消除误匹配。建立了综合概率模型,通过熵权法确定每个约束概率的权重,以定量过滤误匹配。通过空间线线相交和非线性优化实现3D轨迹重建。最后,基于重建轨迹计算碎片的速度和动能。该方法为弹头碎片场的机械损伤评估和战术防护设计提供了可靠的技术支持。

英文摘要

During warhead detonation, high-density, high-speed, and mutually occluded fragments are generated. Their mechanical parameters (position, velocity, kinetic energy) directly determine the lethality of the warhead fragment field. However, high-intensity flash and smoke in detonation scenarios severely hinder the accurate acquisition of these mechanical parameters. To address this challenge, this paper integrates experimental mechanics approaches and presents an event-driven method for reconstructing the dynamic trajectories of fragments and measuring their mechanical parameters. As a novel brain-inspired visual sensor, event cameras offer microsecond-level temporal resolution and high dynamic range lighting change perception, overcoming the difficulty of accurately measuring high-speed targets under strong flash interference. The method constructs a multi-event-camera vision system, adopting three geometric constraints: time-correlated epipolar constraint to find potential matching event point pairs, and trifocal tensor line constraint plus local homography constraint to eliminate mismatches. A comprehensive probability model is established, with entropy weight method determining the weight of each constraint's probability to quantitatively filter mismatches. 3D trajectory reconstruction is achieved via spatial line-line intersection and nonlinear optimization. Finally, the velocity and kinetic energy of the fragments are calculated based on the reconstructed trajectory. This method provides reliable technical support for the mechanical damage evaluation of warhead fragment fields and the tactical protection design.

2606.09383 2026-06-09 cs.CV 新提交

An Opticalmechanics Framework for Dynamic Estimation of Multibody Systems

多体系统动态估计的光力学框架

Banglei Guan, Xuanyu Bai, Qingquan Chen, Zibin Liu, Dongcai Tan, Zhenbao Yu, Yang Shang, Qifeng Yu

发表机构 * National University of Defense Technology(国防科技大学)

AI总结 提出光力学运动-动力学集成框架,通过图像测量运动学量结合遗传算法优化,实现无接触关节力矩估计,实验验证腕关节力矩误差0.46 Nm。

Comments 10 pages, 12 figures

详情
AI中文摘要

传统的人体动力学分析通常受限于接触力/力矩传感器和受控实验室环境。为解决此问题,本研究提出了一种用于多体系统的光力学运动-动力学集成估计框架。具体而言,建立约束多体模型描述系统动力学,同时将图像测量的运动学量作为动态估计的非接触输入。然后通过基于遗传算法的优化,最小化模型预测与图像测量运动学量之间的差异,识别未知关节力矩。在气浮平台上的实验验证表明,与传感器测量相比,从图像数据估计的腕关节力矩平均绝对误差为0.46 Nm。在前向预测测试中,模型预测的角速度相对于图像测量结果的平均绝对误差为0.006 rad/s。本研究展示了在直接力/力矩测量困难的情况下,结合图像测量和力学建模进行非接触动态估计的潜力。

英文摘要

Conventional dynamics analysis of the human body is often constrained by the need for contact force and torque sensors and controlled laboratory environments. To address this issue, this study proposes an opticalmechanics kinematic-dynamic integrated estimation framework for multibody systems. Specifically, a constrained multibody model is established to describe the system dynamics, while image-measured kinematic quantities are used as non contact inputs for dynamic estimation. The unknown joint torque is then identified through a genetic-algorithm based optimization by minimizing the discrepancy between model-predicted and image-measured kinematic quan tities. Experimental validation on an air-bearing platform showed that the wrist joint torque estimated from image data achieved a mean absolute error of 0.46 Nm compared with sensor measurements. In the forward prediction test, the model-predicted angular velocity achieved a mean absolute error of 0.006 rad/s relative to the image-measured results. This study demonstrates the potential of combining image measurement and mechanical modeling for non-contact dynamic estimation in scenarios where direct force and torque measurement is difficult.

2606.07541 2026-06-09 cs.HC cs.AI cs.CV cs.CY cs.MM 交叉投稿

Multimodal Large Language Models as Synthetic Participants in Video-Based Studies: An Evaluation

多模态大语言模型作为视频研究中的合成参与者:一项评估

Prabal Shrestha, Bohan Jiang, Haoning Xue, Huan Liu, Xinyi Zhou

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究评估多模态大语言模型在视频感知任务中模拟人类主观评分的表现,发现模型存在偏差且与人类一致性有限。

Comments Accepted to SocialLLM @ ICWSM 2026

详情
AI中文摘要

多模态大语言模型在视频理解和推理等客观任务上表现出色。然而,它们能否近似主观人类反应仍不清楚,因为主观反应不仅依赖于内容理解,还依赖于个体的社会背景。为填补这一空白,我们评估了MLLMs作为合成参与者在一项新兴任务中的表现:评估对短视频的感知感官参与度。基于感知信息感官价值框架,我们使用17项量表(测量情绪唤醒、戏剧冲击和新奇性)比较了招募的人类参与者和基于档案条件的MLLM模拟(n=673)的评分。我们发现,即使领先的MLLMs(Gemini 3 Flash和Qwen 3 Omni)与人类参与者的一致性也有限。这些模型在评分分布中表现出明显的向下均值偏移和中心趋势偏差。它们既引入又扁平化了子群体差异,同时对参与者档案的敏感性不一致。提示策略对这些指标的影响不同,适度改善某些方面同时恶化其他方面。这些结果突显了开发MLLMs作为视频研究中合成参与者的挑战与机遇。数据和代码:https://github.com/MINDLab25/mllm-human-simulation-eval

英文摘要

Multimodal large language models (MLLMs) have shown strong performance on objective tasks such as video understanding and reasoning. However, it remains unclear whether they can approximate subjective human responses, which depend not only on content comprehension but also on individuals' social contexts. To address this gap, we evaluate MLLMs as synthetic participants in an emerging task: assessing perceived sensory engagement with short videos. Grounded in the Perceived Message Sensation Value (PMSV) framework, we compare ratings from recruited human participants and profile-conditioned MLLM simulations (n=673) using a 17-item scale measuring emotional arousal, dramatic impact, and novelty. We find that even leading MLLMs (Gemini 3 Flash and Qwen 3 Omni) show limited agreement with human participants. The models exhibit distinct downward mean-shift and central-tendency biases in their rating distributions. They both introduce and flatten subgroup differences, while showing inconsistent sensitivity to participant profiles. Prompting strategies affect these metrics differently, modestly improving some aspects while worsening others. These results highlight both the challenges and opportunities of developing MLLMs as synthetic participants in video-based research. Data and code: https://github.com/MINDLab25/mllm-human-simulation-eval

2606.07599 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

DiffoR:一种统一的连续生成框架用于通用序数回归

Hongxu Ma, Lin Wang, Chenghou Jin, Han Zhou, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Kuaishou Technology(快手科技) Shanghai University of Finance and Economics(上海财经大学) Tongji University(同济大学)

AI总结 提出DiffOR框架,将序数回归建模为连续生成任务,利用扩散模型通过迭代去噪恢复连续序数值,并设计双解耦策略(多尺度增量聚合与动态去噪感知)保留序数拓扑,在12个基准上超越现有方法。

Comments Accepted at KDD 2026

详情
AI中文摘要

序数回归(OR)旨在预测具有内在顺序的目标值,支撑着从推荐系统到计算机视觉等多个领域的关键应用。尽管从朴素回归发展到基于离散化的分类和生成,现有范式仍然受到量化伪影和缺乏全局序数拓扑感知的根本限制。这些方法通常强制执行刚性边界划分,无法捕捉序数数据固有的非平稳语义转换。在本文中,我们提出了一种新范式,将OR形式化为连续生成序数回归任务。在该新范式下,我们引入了DiffOR,一个统一的框架,利用扩散模型通过迭代去噪恢复连续序数值,从而能够动态学习软语义转换。为了显式保留序数拓扑,我们设计了一种双解耦策略:在空间上,多尺度增量聚合将目标分解为层次化的连续增量;在时间上,动态去噪感知将去噪步骤与特征频率同步,确保稳健的从粗到细的细化。理论上,我们证明了所提方法可以显著增强表示能力和机制可解释性。在四个领域的12个基准上的大量实验验证了DiffOR相对于最先进方法的一致优越性,建立了一个新标准,展示了作为通用序数回归通用解决方案的强大潜力。

英文摘要

Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to discretization-based classification and generation, existing paradigms remain fundamentally constrained by quantization artifacts and the lack of global ordinal topological perception. These methods typically enforce rigid boundary delineations, failing to capture the non-stationary semantic transitions inherent to ordinal data. In this paper, we propose a novel paradigm where OR is formulated as a Continuous Generative Ordinal Regression task. Under the novel paradigm, we introduce DiffOR, a unified framework that leverages diffusion models to recover continuous ordinal values via iterative denoising, thereby enabling the dynamic learning of soft semantic transitions. To explicitly preserve ordinal topology, we devise a Dual-Decoupling Strategy: Spatially, Multi-scale Increment Aggregation decomposes targets into hierarchical continuous increments; Temporally, Dynamic Denoising Perception synchronizes denoising steps with feature frequencies, ensuring robust coarse-to-fine refinement. Theoretically, we show that the proposed method can significantly enhance both representation capability and mechanistic interpretability. Extensive experiments on 12 benchmarks across four domains validate DiffOR's consistent superiority over state-of-the-art methods, establishing a new standard that demonstrates strong potential as a general-purpose solution for universal ordinal regression.

2606.07655 2026-06-09 eess.SP cs.CR cs.CV 交叉投稿

FADRW: A Feature-Aware Modulated and Dynamically Reweighted Loss for Few-Shot Linguistic Steganalysis

FADRW:一种面向少样本语言隐写分析的特征感知调制与动态重加权损失

Shuo Liu, Xianghong Lin, Yukun Wei, Zhongliang Yang

发表机构 * International School, Beijing University of Posts and Telecommunications(北京邮电大学国际学院) School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络安全学院)

AI总结 针对语言隐写检测中类别极度不平衡和特征边缘化问题,提出FADRW损失函数,通过动态重加权和特征感知调制提升少样本隐写分析性能。

Comments Accepted by IEEE Signal Processing Letters

详情
AI中文摘要

社交媒体平台的普及为恶意语言隐写提供了便利,带来了显著的安全风险。然而,模型训练中的两个基本问题严重阻碍了检测。首先,极端类别不平衡(隐写样本不足1%)导致强烈的决策偏差。其次,生成式隐写的不可见性使其特征与正常文本几乎无法区分;这种相似性加上其极端稀有性,导致严重的特征边缘化,微弱的隐写信号被完全淹没。为了直接应对这些优化层面的挑战,我们提出了FADRW(特征感知调制与动态重加权损失),一种专为少样本隐写分析设计的新型损失函数框架。FADRW采用动态重加权逐步抵消决策偏差,并通过特征感知调制模块在结构上重塑特征空间,通过增强这些细微特征的可分离性来防止特征边缘化。在来自三个真实社交平台的数据集上进行的大量实验表明,FADRW显著优于最先进的方法,尤其是在具有挑战性的少样本隐写样本场景中。

英文摘要

The ubiquity of social media platforms facilitates malicious linguistic steganography, posing significant security risks. However, detection is severely hampered by two fundamental issues during model training. Firstly, extreme class imbalance (less than 1% steganographic samples) induces a strong decision bias. Secondly, the invisibility of generative steganography means its features are nearly indistinguishable from benign text; this similarity, compounded by their extreme rarity, leads to severe feature marginalization, where faint steganographic signals are completely overwhelmed. To directly address these optimization-level challenges, we propose FADRW (Feature-Aware Modulated and Dynamically Reweighted Loss), a novel loss function framework engineered for few-shot steganalysis. FADRW employs Dynamic Reweighting to progressively counteract decision bias, and a Feature-Aware Modulation module to structurally reshape the feature space, preventing feature marginalization by enhancing the separability of these subtle features. Extensive experiments on datasets from three real-world social platforms demonstrate that FADRW significantly outperforms state-of-the-art methods, particularly in the challenging few-shot steganographic sample scenario.

2606.07718 2026-06-09 cs.AI cs.CV cs.LG 交叉投稿

A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline

评估AI代理在神经科学数据到发现流程中的案例研究

Kai A. Horstmann, Ethan Lin, Alice A. Robie, Jennifer J. Sun, Kristin Branson

发表机构 * Cornell University(康奈尔大学) HHMI Janelia Research Campus(霍华德·休斯医学研究所贾雷尔研究园区)

AI总结 本研究评估通用编码代理在果蝇光遗传学数据到发现流程中的表现,发现代理能解决单个阶段任务,但端到端流程仍超出其能力,主要挑战包括缺乏预定义迭代标准和科学判断能力。

详情
AI中文摘要

代理型AI工具为自动化科学研究流程中的软件开发瓶颈提供了有希望的路径,特别是对于那些需要领域专家花费数天到数月构建的阶段,科学家关心的是正确性和鲁棒性,而非实现细节。我们针对果蝇光遗传学数据到发现流程,对通用编码代理进行了实证研究。我们在比现有基准大得多的任务、数量级更大的数据集以及基于领域专家标准的评估标准上评估代理。我们表明,代理可以解决几个单独的流程阶段,这表明阶段级自动化是可行的。通过分析代理的代码迭代,我们发现当没有预定义的标准可供迭代时,它们最困难,此时它们必须利用自己的科学判断来评估当前解决方案,这是一个关键开放挑战。与科学实践相呼应,它们有时尝试对中间输出进行视觉检查以进行自我评估,但大多未能正确解释所见或据此采取行动。正确解决端到端流程需要将所有流程阶段的成功串联起来,这超出了代理当前的能力。我们识别出现有基准中基本缺失的挑战,包括计算资源管理和对大型保留数据集的泛化。最后,我们提炼出构建科学任务和针对开放问题的严格评估标准的原则。

英文摘要

Agentic AI tools offer a promising path to automating software development bottlenecks in scientific research pipelines, particularly for stages that take domain experts days to months to build, where scientists care about correctness and robustness, not implementation details. We present an empirical study of general-purpose coding agents on a fly optogenetics data-to-discovery pipeline. We assess agents on tasks substantially larger than existing benchmarks, datasets orders of magnitude bigger, and evaluation criteria grounded in domain expert standards. We show that agents can solve several individual pipeline stages, suggesting stage-level automation is tractable. By analyzing agents' code iterations, we show that they struggle most when there is not a pre-defined criterion to iterate on, and they must instead use their scientific judgment to assess their current solution, a key open challenge. Mirroring scientific practice, they sometimes attempt visual inspection of intermediate outputs for self-evaluation, but largely fail to interpret what they see or act on it appropriately. Solving the end-to-end pipeline correctly requires stringing together successes across all pipeline stages, and this is beyond agents' current abilities. We identify challenges largely absent from existing benchmarks, including computational resource management and generalization to large held-out data collections. Finally, we distill principles for constructing scientific tasks and rigorous evaluation criteria for open-ended problems.

2606.07949 2026-06-09 q-bio.PE cs.CV eess.IV 交叉投稿

Feasibility to detect rapid change and disappearance of seagrass: Lessons from nearly 80 years of vegetation change in the Ako, Seto Inland Sea, Japan

检测海草快速变化和消失的可行性:来自日本濑户内海Ako近80年植被变化的教训

Takehisa Yamakita, Yoji Igarashi, Akira Eto, Ken Ishida, Masaaki Iiyama

发表机构 * Japan Agency for Marine-Earth Science and Technology (JAMSTEC)(日本海洋地球科学技术机构) The University of Tokyo(东京大学) Tokyo University of Marine Science and Technology(东京海洋大学) Shiga University(滋贺大学)

AI总结 本研究利用近80年的航拍和卫星影像,结合YOLO深度学习分割,分析了日本Ako潮滩海草床的长期动态,发现2025年Zostera marina在一年内几乎完全消失,表明这是一次由夏季水温升高驱动的快速生态系统转变,并提出了改进海草监测指标的建议。

详情
AI中文摘要

本研究分析了日本濑户内海的Ako潮滩,该地的大叶藻(Zostera marina)在2025年一年内几乎全部消失。利用1940年代以来的航拍照片、高分辨率卫星影像、GRUS图像(2.5-5米)以及每月Sentinel-2合成图像(10米),我们重建了约80年的海草分布。基于深度学习的YOLO分割在这些数据集上实现了高精度(总体精度≥0.9);尽管无法区分物种,但模型捕捉了植被面积的主要时间动态。长期平均海草面积为6.8公顷,但数值波动很大,从1974年的3.5公顷到1989年的41.3公顷,除2025年的0.2公顷外。2019年至2026年的Sentinel-2合成图像显示出明显的季节性,植被在初夏增加,秋季开始减少。然而,2025年夏季后面积急剧下降,并在2025-2026年整个冬季保持异常低值。我们的结果表明,2025年的事件并非正常波动,而是一次快速生态系统转变,涉及优势冠层物种的丧失,最可能的原因是区域夏季水温升高。这些发现对海草基本海洋变量(EOVs)和TNFD对齐的自然相关披露中使用的自然状态(SoN)指标也有影响。与森林不同,海草草甸需要更精细的时间分辨率,因为显著的季节性和突然崩溃都会强烈影响面积指标。因此,除了先前指出的物种级分类精度等问题外,我们建议:(1)基线应在最长的可用记录上定义并进行生态学论证;(2)在年际比较前应用季节性标准化;(3)将面积异常极端的年份标记出来,而非用作参考点。

英文摘要

This study analyses the Ako tidal flat in the Seto Inland Sea, Japan, where nearly all Zostera marina disappeared within a single year in 2025. Using aerial photographs from the 1940s onward, high-resolution satellite imagery, GRUS images (2.5-5 m), and monthly Sentinel-2 composites (10 m), we reconstructed approximately 80 years of seagrass distribution. YOLO-based segmentation using deep learning achieved high accuracy (overall accuracy >= 0.9) across these datasets; although species could not be discriminated, the models captured the major temporal dynamics in vegetation area. The long-term mean seagrass area was 6.8 ha, but values fluctuated widely, from 3.5 ha in 1974 to 41.3 ha in 1989 except 0.2 ha in 2025. Sentinel-2 composites from 2019 to 2026 revealed clear seasonality, with vegetation increasing in early summer and declining from autumn. In 2025, however, the area decreased sharply after summer and remained anomalously low throughout the winter of 2025-2026. Our results, indicating that the 2025 event was not a normal fluctuation but a rapid ecosystem shift involving the loss of the dominant canopy-forming species, most plausibly driven by regionally elevated summer water temperatures. The findings also have implications for seagrass Essential Ocean Variables (EOVs) and the State of Nature (SoN) metrics used in TNFD-aligned nature-related disclosures. Unlike forests, seagrass meadows require finer temporal resolution because both pronounced seasonality and abrupt collapse strongly influence area-based indicators. Therefore, in addition to previously noted issues such as species-level classification accuracy, we recommend that (1) baselines be defined over the longest available record and justified ecologically, (2) seasonal standardization be applied before inter-annual comparisons, and (3) years with extreme area anomalies be flagged rather than used as reference points.

2606.08258 2026-06-09 cs.GR cs.CV cs.LG 交叉投稿

MS-COOT: Comparing Morse-Smale Complexes with Co-Optimal Transport

MS-COOT: 用共最优传输比较Morse-Smale复形

Guangyu Meng, Mingzhe Li, Erin Wolf Chambers

发表机构 * Department of Computer Science and Engineering, University of Notre Dame(Notre Dame 大学计算机科学与工程系)

AI总结 提出MS-COOT距离,将Morse-Smale复形表示为超图,通过共最优传输联合匹配临界点和区域,实现区域级结构比较,在分类等任务中优于图方法。

详情
AI中文摘要

理解和比较标量场中的结构是科学可视化的核心挑战,应用范围从特征分析到时间和结构比较。Morse-Smale (MS) 复形通过将标量场分解为由梯度流诱导的区域提供了自然表示。然而,现有方法通常依赖于基于图的表示,捕获临界点之间的关系而丢弃区域级结构。在这项工作中,我们将MS复形表示为超图,其中临界点构成节点,区域定义超边。我们引入MS-COOT,一种共最优传输距离,联合计算临界点和区域之间的对应关系。这种公式化使得在基于距离的框架内能够进行显式的区域到区域匹配,从而识别诸如分裂和合并等区域级事件。我们使用领域特定组件实例化该框架,包括编码临界点-区域关系的超网络函数、强调拓扑显著特征的基于持久性的概率度量,以及包含临界点属性的样本代价项。我们在涵盖2D模拟、3D曲面网格和体积数据的五个数据集上评估MS-COOT。我们的结果表明,MS-COOT捕获了基于图的距离未反映的区域级结构变化,同时在分类和分辨率判别等下游任务中实现了强性能。

英文摘要

Understanding and comparing structures in scalar fields is a central challenge in scientific visualization, with applications ranging from feature analysis to temporal and structural comparison. The Morse-Smale (MS) complex provides a natural representation by decomposing a scalar field into regions induced by gradient flow. However, existing approaches typically rely on graph-based representations, capturing relationships between critical points while discarding region-level structure. In this work, we represent the MS complex as a hypergraph, where critical points form nodes and regions define hyperedges. We introduce MS-COOT, a co-optimal transport distance that jointly computes correspondences between critical points and regions. This formulation enables explicit region-to-region matching within a distance-based framework, allowing identification of region-level events such as splitting and merging. We instantiate this framework with domain-specific components, including a hypernetwork function encoding critical point-region relationships, persistence-based probability measures that emphasize topologically significant features, and a sample cost term that incorporates critical point attributes. We evaluate MS-COOT on five datasets spanning 2D simulations, 3D surface meshes, and volumetric data. Our results show that MS-COOT captures region-level structural changes that are not reflected by graph-based distances, while achieving strong performance in downstream tasks such as classification and resolution discrimination.

2606.08469 2026-06-09 cs.GR cs.CV 交叉投稿

OctaOctree Neural Radiosity for Real-time Glossy Material Rendering

OctaOctree神经辐射度用于实时光泽材质渲染

Jierui Ren, Haojie Jin, Bo Pang, Meng Gai, Fei Zhu, Yisong Chen, Sheng Li

发表机构 * Peking University(北京大学)

AI总结 提出OctaOctree表示,通过空间自适应八叉树耦合八面体方向图,高效编码高频出射辐射分布,实现单次网络查询的实时高质量全局光照。

Comments 11 pages, 9 figures

详情
AI中文摘要

建模高频出射辐射分布仍然是全局光照中的基本挑战,尤其是对于光泽和镜面材质。现有的基于神经的辐射缓存方法通常依赖于位置特征编码或空间组织的缓存,这使得在不增加模型复杂度或采样成本的情况下难以表示尖锐的方向辐射变化。为了应对这一挑战,我们提出了OctaOctree,一种用于全局光照的高效空间-角度辐射表示。OctaOctree在3D空间中使用自适应八叉树组织出射辐射,并将每个空间节点与一个八面体方向图关联。通过将空间层次与方向依赖存储耦合,我们的表示为局部光照和可见性变化分配精细的空间分辨率,同时使用更粗糙的空间层次和更丰富的角度分辨率来捕捉光泽和镜面辐射分布。这种设计直接将反射感知的空间-角度先验嵌入辐射表示中,减轻了神经网络或重建模块仅从位置特征恢复高频视角依赖效应的负担。因此,OctaOctree为从漫反射互反射到尖锐光泽反射的广泛间接光照效应提供了紧凑且富有表现力的神经编码。实验表明,我们的方法在主交点处通过单次网络查询产生高质量、方向感知的全局光照,与基线神经辐射度和辐射缓存方法相比,实现了更好的保真度和实时性能。

英文摘要

Modeling high-frequency outgoing radiance distributions remains a fundamental challenge in global illumination, especially for glossy and specular materials. Existing neural-based radiance caching methods commonly rely on positional feature encodings or spatially organized caches, which makes it difficult to represent sharp directional radiance variations without increasing the model complexity or sampling cost. To address this challenge, we propose OctaOctree, an efficient spatial-angular radiance representation for global illumination. OctaOctree organizes outgoing radiance with an adaptive octree in 3D space, and associates each spatial node with an octahedral directional map. By coupling the spatial hierarchy with direction-dependent storage, our representation allocates fine spatial resolution to local illumination and visibility changes, while using coarser spatial levels with richer angular resolution to capture glossy and specular radiance distributions. This design embeds a reflectance-aware spatial-angular prior directly into the radiance representation, reducing the burden on neural networks or reconstruction modules to recover high-frequency view-dependent effects from positional features alone. As a result, OctaOctree provides a compact and expressive neural encoding for a wide range of indirect illumination effects, from diffuse interreflection to sharp glossy reflections. Experiments demonstrate that our method produces high-quality, direction-aware global illumination with single network query at primary intersections, achieving improved fidelity and real-time performance compared with baseline neural radiosity and radiance caching approaches.

2606.08652 2026-06-09 astro-ph.SR cs.AI cs.CV 交叉投稿

Reconstructing Synthetic SDO/AIA 193 A EUV Images from He I 10830 A Observations with Diffusion Model Translator

利用扩散模型翻译器从He I 10830 Å观测重建合成SDO/AIA 193 Å EUV图像

Marco Marena, Qin Li, Haimin Wang, Haodi Jiang, Prajwal Shah, Bo Shen

发表机构 * Department of Mechanical and Industrial Engineering, New Jersey Institute of Technology(机械与工业工程系,新泽西理工学院) Department of Physics, New Jersey Institute of Technology(物理系,新泽西理工学院) Department of Computer Science, Sam Houston State University(计算机科学系,萨姆霍斯顿州立大学) Department of Computer Science, New Jersey Institute of Technology(计算机科学系,新泽西理工学院) Department of Data Science, New Jersey Institute of Technology(数据科学系,新泽西理工学院)

AI总结 提出基于扩散的日冕洞感知翻译模型(CH-aware DMT),从He I图像重建AIA 193 Å EUV图像,在测试集上保持全盘EUV形态(CC=0.92)和日冕洞结构(CC=0.84),并通过历史数据验证其物理合理性。

详情
AI中文摘要

常规的全盘EUV成像仅在现代时期(如SOHO和SDO)才可用。为了将EUV日冕背景扩展到更早时期,我们利用了数十年的全盘He I观测数据,其吸收受日冕辐照度和磁拓扑调制,并被广泛用作开放场区域的代理。我们提出了一种基于扩散的条件图像翻译框架——日冕洞感知扩散模型翻译器(CH-aware DMT),从He I输入重建合成SDO/AIA 193 Å EUV图像。该模型在2011-2015年时间对齐的SOLIS He I和AIA 193 Å配对数据上训练,采用基于月份的划分:1-10月用于训练,11月用于验证,12月用于测试。在保留的测试集上,重建结果保留了主要的全盘EUV形态(CC=0.92),并恢复了与日冕洞相关的低强度结构(CC=0.84)。我们进一步通过以下方式评估历史适用性:(1)比较2005-2015年间重建的AIA 193 Å形态与SOHO/EIT 195 Å;(2)比较从KPVT He I输入生成的重建AIA 193 Å图像与Yohkoh/SXT软X射线观测;(3)评估长期重建的盘积分发射统计量与观测EUV序列及独立太阳活动代理(1974-2015年的太阳黑子数和F10.7射电通量)的关系。这些结果表明,以He I为条件的CH-aware DMT可以为历史研究提供物理上合理的合成AIA 193 Å日冕代理,支持在直接EUV成像可用之前对大规模日冕演化进行数十年尺度的分析。

英文摘要

Routine full-disk EUV imaging has been available only since the modern era, such as SOHO and SDO. To extend EUV coronal context into earlier periods, we leverage the multi-decade availability of full-disk \HeI{} observations, whose absorption is modulated by coronal irradiance and magnetic topology and is widely used as a proxy for open-field regions. We present a diffusion-based conditional image translation framework, Coronal Hole-aware Diffusion Model Translator (CH-aware DMT), to reconstruct synthetic SDO/AIA 193 Å EUV images from \HeI{} inputs. The model is trained on temporally co-aligned SOLIS \HeI{} and AIA 193 Å pairs spanning 2011--2015 using a month-based split, where January--October are used for training, November is used for validation, and December for testing. On the held-out test set, the reconstructions preserve dominant full-disk EUV morphology (CC=0.92) and recover CH-related low-intensity structure (CC=0.84). We further assess historical applicability by (1) comparing reconstructed AIA 193 Å morphology with SOHO/EIT 195 Å over 2005--2015; (2) comparing reconstructed AIA 193 Å images generated from KPVT \HeI{} inputs against Yohkoh/SXT soft X-ray observations; and (3) evaluating long-term reconstructed disk-integrated emission statistics against observational EUV series and independent solar activity proxies (sunspot number and F10.7 radio flux over 1974--2015). These results indicate that CH-aware DMT conditioned on \HeI{} can provide a physically plausible synthetic AIA 193 Å coronal proxy for historical studies, supporting multi-decade analyses of large-scale coronal evolution before the direct EUV imaging was available.

2606.08728 2026-06-09 cs.AI cs.CL cs.CV cs.LG 交叉投稿

Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

人工智能数学推理:语言模型、神经符号系统与验证发现的综合综述

Syed Rifat Raiyan, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 本文综述了数学推理领域从早期规则系统到当代推理模型、多智能体系统及验证发现工作流的演变,沿非正式推理、形式推理、数学发现及推理技术四轴组织,并评估了基准测试、失败模式及未来方向。

Comments Under review, 47 pages, 14 figures, 22 tables

详情
AI中文摘要

数学推理长期以来一直是机器智能的严格测试;在过去十年中,它已从NLP中的一个边缘问题发展为最重要的人工智能前沿之一。本综述对该领域的演变进行了统一阐述,从早期基于规则的数学文字题(MWP)求解器和模板驱动的几何系统,到神经表达式生成和LLM提示,再到当代推理模型、多智能体系统、神经符号定理证明器和验证发现工作流。我们沿四个轴组织该领域:(i) 文本和图表的非正式推理,涵盖MWP求解、多模态几何和VLM;(ii) 证明助手的形式推理,包括自动形式化、策略预测、编译器引导修复和证明搜索;(iii) 数学发现,其中系统提出构造、改进界限或协助攻击开放问题;以及(iv) 推理和训练时技术,包括CoT提示、工具使用、过程奖励模型和RLVR,这些技术日益将生成与验证联系起来。我们编目了涵盖小学算术、竞赛数学、几何、形式证明、多模态和多语言推理以及专家评估的主要基准,并考察了基准饱和、污染、报告不匹配以及pass@1、多数投票和验证器辅助pass@$k$之间的区别。我们批判性地评估了失败模式:扰动下的脆弱性、奖励黑客、多模态基础失败、脆弱形式化以及推理规模推理的能源成本。借鉴来自在职数学家的近期观点,我们确定了未来方向,集中于验证发现工作流、推理效率以及使AI辅助形式化广泛可用的基础设施。配套材料:https://github.com/Starscream-11813/awesome-AI4Math。

英文摘要

Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field's evolution, from early rule-based math word problem (MWP) solvers and template-driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi-agent systems, neuro-symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler-guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training-time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade-school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier-assisted pass@$k$. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning-scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified-discovery workflows, reasoning efficiency, and infrastructure to make AI-assisted formalization broadly usable. Companion materials: https://github.com/Starscream-11813/awesome-AI4Math.

2509.11485 2026-06-09 cond-mat.mtrl-sci cs.CV 版本更新

Geometric Analysis of Magnetic Labyrinthine Stripe Evolution via Deep Learning Segmentation

通过深度学习分割进行磁性迷宫条纹演化的几何分析

Vinícius Yu Okubo, Kotaro Shimizu, B. S. Shivaram, Gia-Wei Chern, Hae Yong Kim

发表机构 * Dept. Electronic Systems Engineering, Polytechnic School, University of São Paulo(圣保罗大学电子系统工程系) Department of Applied Physics, The University of Tokyo(东京大学应用物理系) Department of Physics, University of Virginia(弗吉尼亚大学物理系)

AI总结 研究通过深度学习分割和几何分析,量化磁性条纹的局部结构演化,揭示两种演化模式与磁场极性的关系。

Comments 17 pages, 15 figures. This manuscript will be submitted to the Journal of Magnetism and Magnetic Materials and is not yet under review

详情
AI中文摘要

迷宫状条纹图案在许多物理系统中普遍存在,但缺乏长程秩序使得定量表征具有挑战性。我们研究了在磁场退火协议下掺铋钇铁 garnet (Bi:YIG) 薄膜中此类图案的演化。通过训练包含加性白高斯噪声和简单形噪声的 U-Net 深度学习模型,能够稳健地分割实验磁光图像,即使存在噪声和遮挡。基于此分割,我们开发了基于骨架化、图映射和样条拟合的几何分析流程,通过长度和曲率测量量化局部条纹传播。对 444 张图像进行分析,揭示了从“淬火”状态到更平行和一致的“退火”状态的转变,并识别出两种不同的演化模式(类型 A 和 B),与磁场极性相关。我们的结果提供了对磁性条纹图案几何和拓扑性质的定量分析,并为复杂迷宫系统分析提供了新的见解和通用工具。

英文摘要

Labyrinthine stripe patterns are common in many physical systems, yet their lack of long-range order makes quantitative characterization challenging. We investigate the evolution of such patterns in bismuth-doped yttrium iron garnet (Bi:YIG) films subjected to a magnetic field annealing protocol. A U-Net deep learning model, trained with synthetic degradations including additive white Gaussian and Simplex noise, enables robust segmentation of experimental magneto-optical images despite noise and occlusions. Building on this segmentation, we develop a geometric analysis pipeline based on skeletonization, graph mapping, and spline fitting, which quantifies local stripe propagation through length and curvature measurements. Applying this framework to 444 images from 12 annealing protocol trials, we analyze the transition from the "quenched" state to a more parallel and coherent "annealed" state, and identify two distinct evolution modes (Type A and Type B) linked to field polarity. Our results provide a quantitative analysis of geometric and topological properties in magnetic stripe patterns and offer new insights into their local structural evolution, and establish a general tool for analyzing complex labyrinthine systems.

2512.07355 2026-06-09 cs.AI cs.CV cs.LG 版本更新

A Geometric Unification of Concept Learning with Concept Cones

概念学习与概念锥的几何统一

Alexandre Rocchi, Thomas Fel, Gianni Franchi

发表机构 * AMIAD Kempner Institute, Harvard University(哈佛大学凯普勒研究所)

AI总结 通过共享几何框架(概念锥)统一监督式概念瓶颈模型与无监督稀疏自编码器,提出包含关系度量评估概念对齐,并发现稀疏性与扩展因子的最佳平衡点。

Comments 33 pages

详情
AI中文摘要

两种可解释性传统并行发展但很少相互交流:概念瓶颈模型(CBM)规定概念应该是什么,而稀疏自编码器(SAE)发现哪些概念涌现。CBM使用监督将激活与人类标记的概念对齐,而SAE依赖稀疏编码来揭示涌现概念。我们证明两种范式实例化相同的几何结构:每个范式学习激活空间中的一组线性方向,其非负组合形成概念锥。因此,监督和无监督方法的不同不在于种类,而在于如何选择这个锥。基于这一观点,我们提出了两种范式之间的操作桥梁。CBM提供人类定义的参考几何,而SAE可以通过其学习的锥在多大程度上近似或包含CBM的锥来评估。这种包含框架产生了量化指标,将归纳偏差(如SAE类型、稀疏性或扩展比)与合理概念的涌现联系起来。使用这些指标,我们发现了稀疏性和扩展因子的“最佳点”,该点最大化与CBM概念的几何和语义对齐。总体而言,我们的工作通过共享的几何框架统一了监督和无监督的概念发现,提供了原则性指标来衡量SAE进展,并评估发现的概念与合理的人类概念的对齐程度。

英文摘要

Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.

2603.13679 2026-06-09 cs.HC cs.CV 版本更新

Toward Scalable Co-located Practical Learning: Assisting with Computer Vision and Multimodal Analytics

迈向可扩展的协同实践学习:协助计算机视觉和多模态分析

Xinyu Li, Linxuan Zhao, Yueqiao Jin, Yuchen Liu, Jin Zhou, Roberto Martinez-Maldonado, Dragan Gasevic, Lixiang Yan

发表机构 * Centre for Learning Analytics at Monash(墨尔本大学学习分析中心) Monash University(墨尔本大学) Department of Civil and Environmental Engineering(土木与环境工程系) School of Education(教育学院) The University of Hong Kong(香港大学)

AI总结 本研究评估了固定摄像头管道在重复护理模拟中的效果,通过多阶段源到目标适应提升行为检测精度,并利用行为轨迹分析提升模拟 debriefing 的可检索性。

详情
AI中文摘要

协同实践学习在患者周围留下可见动作、任务资源和房间区域的痕迹,但这些痕迹通常通过实时观察或回顾视频审查来恢复。固定广角视频可以减少传感负担,但 debriefing 管道必须做更多:不仅要检测行为,还要在小摄像头位置变化后维持检测,将检测器推导的行为轨迹与指导员标注的结果相关联,并保持房间区域上下文。本研究在重复护理模拟中评估了固定摄像头管道。使用统一的六代码分类法,我们测试了YOLO26目标-only训练和两阶段源到目标适应,跨两个相同房间侧视数据源。然后将检测结果从51个指导员标注的会话转换为每秒行为和行为区域轨迹,用于速率、有序网络、转换网络和序列分析。两阶段适应将2021目标视图的平均mAP50从0.815提升到0.848,从0.690提升到0.855对于较小的2022目标视图;在平衡的目标配额$N=22$下,2022模型达到0.850 mAP50。在检测器推导的行为轨迹分析中,更高的手机使用特征化低任务表现会话。区域标签改变了患者互动的解释:在更高表现会话中,主要患者护理区域互动更强,而在较低表现会话中,次级区域互动更强。有序和转换网络模型显示,有序房间区域关系超越了行为频率,最强的任务表现分类器使用了区域和共在特征。最终的轨迹最适合可检索的模拟 debriefing,其中指导员检查检测到的时刻,而不是接收自动评估分数。

英文摘要

Co-located practical learning leaves evidence in visible actions around patients, task resources and room zones, but these traces are often recovered through live observation or retrospective video review. Fixed wide-angle video could reduce sensing burden, yet a debriefing pipeline must do more than detect behaviours: it must maintain detection after small camera-position shifts, relate the detector-derived behaviour trace to instructor-labelled outcomes and preserve room-zone context. This study evaluates a fixed-camera pipeline in repeated nursing simulation. Using a harmonised six-code taxonomy, we tested YOLO26 target-only training and two-stage source-to-target adaptation across two same-room side-view data sources. We then converted detections from 51 instructor-labelled sessions into one-second behaviour and behaviour-zone traces for rate, ordered-network, transition-network and sequence analyses. Two-stage adaptation improved mean mAP50 from 0.815 to 0.848 for the 2021 target view and from 0.690 to 0.855 for the smaller 2022 target view; with a balanced target quota of \(N = 22\), the 2022 model reached 0.850 mAP50. In the detector-derived behaviour trace analyses, higher phone use characterised low task-performance sessions. Zone labels changed the interpretation of patient interaction: primary patient-care-zone interaction was stronger in higher-performance sessions, while secondary-zone interaction was stronger in lower-performance sessions. Ordered and transition network models showed that ordered room-zone relations contributed beyond behaviour frequency, with the strongest task-performance classifier using zoned and co-presence features. The resulting trace is most appropriate for searchable simulation debriefing, where instructors inspect detected moments rather than receive automated assessment scores.

2605.00358 2026-06-09 cs.CL cs.CV 版本更新

From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing

从反向传播到正向回放:重新审视LLM参数编辑中的目标构造

Wei Liu, Hongkai Liu, Zhiying Deng, Yee Whye Teh, Wee Sun Lee

发表机构 * University of Cambridge(剑桥大学) University of Edinburgh(爱丁堡大学)

AI总结 本文重新审视LLM参数编辑中的目标构造,提出一种更简洁的替代方法,通过正向传播代替反向传播,提高目标隐藏状态的准确性和兼容性。

Comments ICML 2026, code: https://github.com/jugechengzi/FE

详情
AI中文摘要

LLM参数编辑方法通常依赖于计算目标层的理想隐藏状态(称为锚点)并将其分布到多个前层(通常称为反向传播)以实现协同编辑。尽管长期广泛使用,其基础理论尚未系统研究。本文首先系统研究其基础,有助于明确其能力边界、实际考虑和潜在失败模式。然后,我们提出了一种简单优雅的替代方法,用正向传播代替反向传播。不优化最后一层的靶标,而是在第一编辑层优化锚点,然后将其传播到后续所有编辑层,以获得准确且相互兼容的目标隐藏状态。这种方法达到与现有方法相同计算复杂度,同时产生更准确的层间目标。我们的方法简单,不影响初始目标隐藏状态的计算或后续编辑流程的其他组件,因此对广泛的LLM参数编辑方法有益。

英文摘要

LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used for a long time, its underlying basis have not been systematically investigated. In this paper, we first conduct a systematic study of its foundations, which helps clarify its capability boundaries, practical considerations, and potential failure modes. Then, we propose a simple and elegant alternative that replaces backward spreading with forward-propagation. Instead of optimizing the target at the last editing layer, we optimize the anchor point at the first editing layer, and then propagate it forward to obtain accurate and mutually compatible target hidden-states for all subsequent editing layers. This approach achieves the same computational complexity as existing methods while producing more accurate layer-wise targets. Our method is simple, without interfering with either the computation of the initial target hidden state or any other components of the subsequent editing pipeline, and thus constituting a benefit for a wide range of LLM parameter editing methods.

2606.00384 2026-06-09 cs.AI cs.CL cs.CV cs.LG stat.CO 版本更新

VESTA: Visual Exploration with Statistical Tool Agents

VESTA: 基于统计工具代理的视觉探索

William Rudman, Abhishek Divekar, Kanishk Jain, Sebastian Joseph, Stella S. R. Offner, Matthew Lease, Kyle Mahowald, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) New York University(纽约大学)

AI总结 提出VESTA框架,通过动态增长的工具集指导数据变换、假设驱动可视化和统计检验,提升视觉语言模型在复杂统计建模任务上的性能。

详情
AI中文摘要

将定量模型拟合到数据上是科学工作流程中的核心步骤,但它仍然是最少自动化的步骤之一。最近的基于代理的系统利用语言和视觉语言模型(VLM)来迭代地提出和优化统计模型,但这些系统在更具挑战性的建模任务上表现不佳。为了解决这些限制,我们引入了VESTA:基于统计工具代理的视觉探索,这是一个框架,为VLM配备了一个动态增长的探索工具包,通过数据变换、假设驱动的可视化和稳健的统计检验来指导模型优化。与之前仅依赖迭代批评的系统不同,VESTA在优化之前和优化过程中通过选择或创建诊断工具主动探索数据,这些工具会累积在模型的上下文中,并可在以后重用。我们在三种工具配置下评估VESTA与已建立的基线:无工具、静态专家编写的工具和动态模型编写的工具。为了支持这一评估,我们引入了DAWN(自动工作流和数值建模数据集),这是一个针对分布拟合和时间序列建模的基准,具有不同的难度等级,并最终涉及真实世界的天文学任务,包括建模初始质量函数和引力波啁啾信号。我们发现VESTA的动态工具创建优于先前的代理流水线,在复杂和特定领域的任务上取得了最大的收益。我们进一步表明,动态生成的工具比现有视觉工具创建系统生成的工具复杂得多,每个函数覆盖更多的诊断类别,并且强烈倾向于VLM批评者可以直接推理的视觉输出。

英文摘要

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

2606.06076 2026-06-09 cs.AI cs.CV 版本更新

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

通过模态差距感知自蒸馏从符号状态学习视觉空间规划

Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou, Xiu Li

发表机构 * Tsinghua University(清华大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出MGSD两阶段框架,通过冷启动接地和特权教师蒸馏弥合视觉与符号规划之间的模态差距,在视觉规划基准上显著提升性能。

Comments 17 pages, preprint

详情
AI中文摘要

尽管视觉-语言模型在通用多模态理解方面表现出色,但在视觉空间规划上仍存在困难。我们将其归因于感知-推理模态差距:视觉规划要求模型从像素中推断潜在状态结构,然后对恢复的结构进行推理以产生有效动作,而符号规划直接利用显式对象和约束。这造成了视觉状态恢复和多步规划的双重瓶颈。为解决此问题,我们提出MGSD,一种两阶段模态差距感知自蒸馏框架。首先,冷启动接地阶段为视觉学生模型配备可靠的状态表示,最小化早期感知噪声。其次,特权教师通过在线策略蒸馏转移规划能力,使用显式符号状态监督学生自身的视觉 rollout 前缀。关键在于,符号数据仅在训练期间使用,推理完全基于视觉。在视觉规划基准上的实验表明,MGSD在4B和8B骨干网络上均持续提升视觉规划性能,宏观平均值分别提高19.3%和18.4%。所得模型缩小了与符号输入上限的差距,而消融和诊断实验证实改进来自视觉状态恢复和最优路径推理。这些结果表明,模态差距感知自蒸馏不仅改善了模型感知可行动状态的方式,也改善了它们在推断结构上进行规划的能力。代码见 https://github.com/Oranger-l/MGSD。

英文摘要

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.

2604.24583 2026-06-09 cs.CV 版本更新

Improving Vision-language Models with Perception-centric Process Reward Models

通过以感知为中心的过程奖励模型改进视觉-语言模型

Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Wayne Xin Zhao, Min Yang, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院) Bytedance(字节跳动) University of California, San Diego(加州大学圣地亚哥分校) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出Perceval模型,通过token级错误定位提升视觉-语言模型的推理能力,通过感知驱动的监督策略实现细粒度训练与推理优化,实验显示在多个领域基准上显著提升性能。

Comments 8 pages

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 33099-33109
AI中文摘要

近期强化学习与可验证奖励(RLVR)的进步显著提升了视觉-语言模型(VLMs)的复杂推理能力。然而,其结果级监督过于粗略,无法诊断和纠正推理链中的错误。为此,我们提出了Perceval,一种过程奖励模型(PRM),能够实现token级错误定位,提取与图像相关的声明,并逐一与图像中的视觉证据进行比较,最终返回包含感知错误的声明。Perceval通过感知密集的监督训练数据进行训练,然后将其整合到RL训练过程中训练策略模型。具体而言,与传统的GRPO相比,我们通过针对Perceval识别出的幻觉片段施加惩罚,应用token级优势,从而实现细粒度监督信号。除了增强训练过程外,Perceval还能在推理阶段协助VLMs。使用Perceval,可以截断模型响应中的错误部分,然后让模型直接重新生成响应或诱导模型反思其先前输出。此过程可以多次重复以实现测试时扩展。实验显示,在多个领域基准上的多个RL训练的推理VLMs上显著提升,突显了以感知为中心的监督作为通用策略的潜力。对于测试时扩展,它也展示了与其他策略(如多数投票)相比的一致性性能提升。我们的代码和数据将在https://github.com/RUCAIBox/Perceval上公开发布。

英文摘要

Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.

2604.17488 2026-06-09 cs.CV 版本更新

AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation

AutoVQA-G:用于自动视觉问答与接地标注的自我改进代理框架

Rongsheng Hu, Runwei Guan, Yicheng Di, Jiayu Bao, Yuan Liu

发表机构 * School of Artificial Intelligence(人工智能学院)

AI总结 本文提出AutoVQA-G框架,通过迭代优化流程提升视觉问答接地标注的准确性,优于现有多模态LLM,为构建高质量数据促进更稳健的视觉语言模型训练提供新方法。

Comments Accepted at IEEE ICASSP 2026. 5 pages, 5 figures. Code available at https://github.com/rohnson1999/AutoVQA-G

详情
Journal ref
Proc. 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12312-12316, 2026
AI中文摘要

手动标注高质量的视觉问答与接地(VQA-G)数据集对于推动视觉语言模型(VLMs)的发展至关重要,但难以扩展。现有自动化方法常受限于两个关键问题:(1)由于模型幻觉导致的数据一致性差;(2)基于简单启发法的脆弱验证机制。为解决这些限制,我们引入了AutoVQA-G,一种自我改进的代理框架,用于自动化VQA-G标注。AutoVQA-G采用迭代细化循环,其中一致性评估模块使用链式推理(CoT)进行细粒度视觉验证。基于此反馈,一个记忆增强的提示优化代理分析失败样本的批评,逐步优化生成提示。我们的实验表明,AutoVQA-G生成的VQA-G数据集在视觉接地准确性上优于领先的多模态LLM,为创建高质量数据以促进更稳健的VLM训练和评估提供有前景的方法。代码:https://github.com/rohnson1999/AutoVQA-G

英文摘要

Manual annotation of high-quality visual question answering with grounding (VQA-G) datasets, which pair visual questions with evidential grounding, is crucial for advancing vision-language models (VLMs), but remains unscalable. Existing automated methods are often hindered by two key issues: (1) inconsistent data fidelity due to model hallucinations; (2) brittle verification mechanisms based on simple heuristics. To address these limitations, we introduce AutoVQA-G, a self-improving agentic framework for automated VQA-G annotation. AutoVQA-G employs an iterative refinement loop where a Consistency Evaluation module uses Chain-of-Thought (CoT) reasoning for fine-grained visual verification. Based on this feedback, a memory-augmented Prompt Optimization agent analyzes critiques from failed samples to progressively refine generation prompts. Our experiments show that AutoVQA-G generates VQA-G datasets with superior visual grounding accuracy compared to leading multimodal LLMs, offering a promising approach for creating high-fidelity data to facilitate more robust VLM training and evaluation. Code: https://github.com/rohnson1999/AutoVQA-G

2507.18967 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Underwater Waste Detection Using Deep Learning A Performance Comparison of YOLOv7 to 10 and Faster RCNN

利用深度学习进行水下垃圾检测:YOLOv7到YOLOv10与Faster R-CNN的性能比较

UMMPK Nawarathne, HMNS Kumari, HMLS Kumari

发表机构 * Faculty of Computing, Sri Lanka Institute of Information Technology(计算学院,斯里兰卡信息科技学院) Faculty of Information Technology and Communication Sciences, Tampere University(信息科技与通信科学学院,塔尔皮埃大学) Computing Centre, Faculty of Engineering, University of Peradeniya(工程学院计算机中心,珀德尼亚大学)

AI总结 本文比较了YOLOv7到YOLOv10及Faster R-CNN在水下垃圾检测中的性能,发现YOLOv8在低能见度和不同深度条件下表现最佳,mAP达80.9%。

Comments 7 pages, 11 figures, to be published in International Journal of Research in Computing (IJRC)

详情
Journal ref
Vol. 5 No. I (2026): International Journal of Research in Computing (IJRC)
AI中文摘要

水下污染是当今最严重的环境问题之一,全球海洋、河流和景观中发现大量垃圾。准确检测这些垃圾对废物管理、环境监测和缓解策略至关重要。本文研究了五种先进的物体识别算法,包括YOLO模型(YOLOv7、YOLOv8、YOLOv9、YOLOv10)和Faster R-CNN,以确定哪种模型在水下环境中识别材料最有效。这些模型在包含十五种不同类别的大型数据集上进行了彻底训练和测试。结果显示,YOLOv8在低能见度和变量深度条件下表现最佳,mAP为80.9%。这种性能提升归因于YOLOv8的架构,其包含改进的无锚机制和自监督学习,从而在各种环境中实现更精确和高效的识别。这些发现突显了YOLOv8模型在全球抗污染斗争中的潜力,提高了水下清理作业的检测能力和可扩展性。

英文摘要

Underwater pollution is one of today's most significant environmental concerns, with vast volumes of garbage found in seas, rivers, and landscapes around the world. Accurate detection of these waste materials is crucial for successful waste management, environmental monitoring, and mitigation strategies. In this study, we investigated the performance of five cutting-edge object recognition algorithms, namely YOLO (You Only Look Once) models, including YOLOv7, YOLOv8, YOLOv9, YOLOv10, and Faster Region-Convolutional Neural Network (R-CNN), to identify which model was most effective at recognizing materials in underwater situations. The models were thoroughly trained and tested on a large dataset containing fifteen different classes under diverse conditions, such as low visibility and variable depths. From the above-mentioned models, YOLOv8 outperformed the others, with a mean Average Precision (mAP) of 80.9%, indicating a significant performance. This increased performance is attributed to YOLOv8's architecture, which incorporates advanced features such as improved anchor-free mechanisms and self-supervised learning, allowing for more precise and efficient recognition of items in a variety of settings. These findings highlight the YOLOv8 model's potential as an effective tool in the global fight against pollution, improving both the detection capabilities and scalability of underwater cleanup operations.

2603.24942 2026-06-09 cs.CV 版本更新

BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation

BiFM:双向流匹配用于少步图像编辑与生成

Yasong Dai, Zeeshan Hayder, David Ahmedt-Aristizabal, Hongdong Li

发表机构 * Australian National University(澳大利亚国立大学) Data61-CSIRO

AI总结 BiFM通过双向流匹配框架统一学习生成与反向过程,解决少步采样中正向过程近似差问题,提升图像编辑质量与通用性。

Comments Accepted in CVPR2026

详情
AI中文摘要

最近的扩散和流匹配模型通过迭代采样逐步去除噪声,实现了灵活的语义保持编辑。然而,少步采样在正向过程近似方面表现不佳,导致编辑质量下降。现有少步反向方法通常依赖预训练生成器和辅助模块,限制了不同架构的可扩展性和泛化能力。为了解决这些问题,我们提出了BiFM(双向流匹配),一个统一的框架,能够在单一模型中联合学习生成和反向过程。BiFM直接估计“图像→噪声”和“噪声→图像”方向的平均速度场,受共享的瞬时速度场约束,该速度场由预定义的调度或预训练的多步扩散模型导出。此外,BiFM引入了一种新的训练策略,利用连续时间间隔监督,通过双向一致性目标和轻量级时间间隔嵌入进行稳定。这种双向公式还允许一步反向和无缝集成到流行的扩散和流匹配骨干中。在多样化的图像编辑和生成任务中,BiFM一致优于现有的少步方法,实现了更优越的性能和可编辑性。

英文摘要

Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image $\to$ noise" and ``noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.

2501.15505 2026-06-09 cs.RO cs.CV cs.HC 版本更新

Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics

揭示iMarkers的潜力:用于高级机器人的隐形标志物

Ali Tourani, Deniz Isinsu Avsar, Hriday Bavle, Jose Luis Sanchez-Lopez, Jan Lagerwall, Holger Voos

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg(自动化与机器人研究组,安全、可靠性与信任跨学科中心(SnT),卢森堡大学) Faculty of Science, Technology, and Medicine, University of Luxembourg(科学、技术与医学学院,卢森堡大学) Department of Physics & Materials Science, University of Luxembourg(物理与材料科学系,卢森堡大学) Institute for Advanced Studies, University of Luxembourg(先进研究学院,卢森堡大学)

AI总结 本文提出iMarkers,一种隐形标志物,可被机器人和AR设备检测,解决了传统标志物影响视觉美观的问题,展示了其在机器人应用中的灵活性和有效性。

Comments 19 pages, 10 figures, 4 tables

详情
AI中文摘要

标志物在机器人导航、物体识别和场景理解中被广泛应用。尽管为机器人和增强现实(AR)应用提供了显著优势,但它们通常会破坏环境的视觉美观,因为它们对人类可见,因此不适合许多日常使用场景。为了解决这一差距,本文提出了iMarkers,即创新的、不显眼的标志物,仅能被机器人和配备适当传感器和检测算法的AR设备检测。这些标志物在生产中具有高度灵活性,允许根据各种需求定制其可见范围和编码算法。本文还介绍了用于检测iMarkers的硬件设计和开源软件算法,突显了其在检测和识别阶段的适应性和鲁棒性。大量评估已证明iMarkers相对于传统(印刷)和混合标志物的有效性,并确认了其在多样化机器人场景中的适用性。

英文摘要

Fiducial markers are widely used in robotics for navigation, object recognition, and scene understanding. While offering significant advantages for robots and Augmented Reality (AR) applications, they often disrupt the visual aesthetics of environments, as they are visible to humans, making them unsuitable for many everyday use cases. To address this gap, this paper presents iMarkers, innovative, unobtrusive fiducial markers detectable exclusively by robots and AR devices equipped with adequate sensors and detection algorithms. These markers offer high flexibility in production, allowing customization of their visibility range and encoding algorithms to suit various demands. The paper also introduces the hardware designs and open-sourced software algorithms developed for detecting iMarkers, highlighting their adaptability and robustness in the detection and recognition stages. Numerous evaluations have demonstrated the effectiveness of iMarkers relative to conventional (printed) and blended fiducial markers and have confirmed their applicability across diverse robotics scenarios.

2511.10500 2026-06-09 cs.CV 版本更新

Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising

可学习总变分与Lambda映射用于低剂量CT去噪

Yusuf Talha Basak, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim

发表机构 * University of Michigan(密歇根大学)

AI总结 本文提出可学习总变分框架,通过结合展开的总变分求解器与LambdaNet预测像素级正则化图,实现空间自适应平滑,实验显示在低剂量CT去噪中优于传统TV和FBP+U-Net。

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)
AI中文摘要

尽管总变分(TV)在噪声抑制和边缘保持方面表现出色,但其对标量正则化参数的依赖限制了适应性。在本研究中,我们提出了一种可学习总变分(LTV)框架,将展开的TV求解器与预测像素级正则化图的LambdaNet相结合。所提出的框架端到端训练以优化重建和正则化,实现空间自适应平滑。在DeepLesion数据集上使用现实LoDoPaB-CT模拟实验表明,LTV在低剂量CT去噪中优于传统TV和FBP+U-Net,实现了最高+3.7 dB PSNR和8%的相对SSIM改进。LTV为低剂量CT去噪提供了可解释的替代方案,而非黑箱CNN。

英文摘要

While Total Variation (TV) excels in noise reduction and edge preservation, its reliance on a scalar regularization parameter limits adaptivity. In this study, we present a Learnable Total Variation (LTV) framework coupling an unrolled TV solver with a LambdaNet that predicts a per-pixel regularization map. The proposed framework is trained end-to-end to optimize reconstruction and regularization jointly, yielding spatially adaptive smoothing. Experiments on the DeepLesion dataset, using realistic LoDoPaB-CT simulation, show consistent gains over classical TV and FBP+U-Net, achieving up to +3.7 dB PSNR and 8% relative SSIM improvement. LTV provides an interpretable alternative to black-box CNNs for low-dose CT denoising.

2511.14639 2026-06-09 cs.CV 版本更新

SLAM-AGS: Slide-Label Aware Multi-Task Pretraining Using Adaptive Gradient Surgery in Computational Cytology

SLAM-AGS:基于滑片标签的多任务预训练方法:利用自适应梯度手术的计算细胞学

Marco Acerbis, Swarnadip Chatterjee, Christophe Avenel, Joakim Lindblad

发表机构 * University of Cambridge(剑桥大学)

AI总结 SLAM-AGS通过联合优化弱监督相似性和自监督对比性目标,提升下游任务性能,并利用自适应梯度手术解决任务梯度冲突,实现在低见证率下的稳定预训练和更优表现。

Comments 5 pages, 2 figures, Submitted to ISBI2026

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)
AI中文摘要

计算细胞学面临两个主要挑战:i) 实例级标签不可靠且获取成本高昂,ii) 见证率极低。我们提出SLAM-AGS,一种Slide-Label-Aware多任务预训练框架,联合优化滑片负补丁上的弱监督相似性目标和滑片正补丁上的自监督对比性目标,从而在下游任务中获得更强的性能。为稳定学习,我们应用自适应梯度手术以解决冲突的任务梯度并防止模型崩溃。我们将预训练的编码器整合到基于注意力的多实例学习聚合器中,用于袋级预测和引导检索袋中最异常的实例。在公开可用的骨髓细胞学数据集上,使用模拟的见证率从10%降至0.5%,SLAM-AGS在袋级F1分数和Top 400正细胞检索上优于其他预训练方法,尤其在低见证率下表现最佳,显示了解决梯度干扰能够实现稳定的预训练和更好的下游任务性能。为促进可重复性,我们分享了完整的实现和评估框架作为开源:https://github.com/Ace95/SLAM-AGS。

英文摘要

Computational cytology faces two major challenges: i) instance-level labels are unreliable and prohibitively costly to obtain, ii) witness rates are extremely low. We propose SLAM-AGS, a Slide-Label-Aware Multitask pretraining framework that jointly optimizes (i) a weakly supervised similarity objective on slide-negative patches and (ii) a self-supervised contrastive objective on slide-positive patches, yielding stronger performance on downstream tasks. To stabilize learning, we apply Adaptive Gradient Surgery to tackle conflicting task gradients and prevent model collapse. We integrate the pretrained encoder into an attention-based Multiple Instance Learning aggregator for bag-level prediction and attention-guided retrieval of the most abnormal instances in a bag. On a publicly available bone-marrow cytology dataset, with simulated witness rates from 10% down to 0.5%, SLAM-AGS improves bag-level F1-Score and Top 400 positive cell retrieval over other pretraining methods, with the largest gains at low witness rates, showing that resolving gradient interference enables stable pretraining and better performance on downstream tasks. To facilitate reproducibility, we share our complete implementation and evaluation framework as open source: https://github.com/Ace95/SLAM-AGS.

2510.13381 2026-06-09 cs.CV cs.GR 版本更新

Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering

利用2D先验和SDF引导进行动态城市场景渲染

Siddharth Tourani, Jayaram Reddy, Akash Kumbar, Satyajit Tourani, Nishant Goyal, Madhava Krishna, N. Dinesh Reddy, Muhammad Haris Khan

发表机构 * IIIT Hyderabad(海得拉尔印度理工学院) MBZUAI(穆罕默德·本·拉希德智能研究院) University of Heidelberg(海德堡大学) VLM Run IIT Kharagpur(克达尔理工学院)

AI总结 本文提出结合2D对象无关先验与SDF表示的方法,用于动态城市场景渲染,无需LiDAR数据,提升几何精度和变形建模能力。

Comments Accepted at ICCV-2025, project page: https://dynamic-ugsdf.github.io/

详情
Journal ref
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
AI中文摘要

动态场景渲染与重建在计算机视觉和增强现实领域至关重要。基于3D高斯点扩散(3DGS)的方法已能准确建模动态城市场景,但需要相机和LiDAR数据、真实3D分割及运动数据。本文探讨结合2D深度和点跟踪先验与SDF表示是否能降低这些要求。我们提出一种将SDF与3DGS结合的新方法,通过融合两者优势,提升物体表示的鲁棒性。统一优化框架增强了3DGS的几何精度,并改进了SDF内的变形建模,实现更适应和精确的表示。实验表明,即使没有LiDAR数据,该方法在渲染指标上也达到最先进的性能。当结合LiDAR时,方法在不同物体类别上的重建和生成新视角方面进一步提升,无需真实3D运动标注。此外,该方法支持多种场景编辑任务,包括场景分解和场景合成。

英文摘要

Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.

2501.03957 2026-06-09 cs.HC cs.CV 版本更新

Vision Language Models as Values Detectors

视觉语言模型作为价值检测器

Giulio Antonio Abbo, Tony Belpaeme

发表机构 * IDLab-AIRO, Ghent University – imec, Belgium(IDLab-AIRO、根特大学 – imec、比利时)

AI总结 本文研究了先进LLM与人类标注者在家庭环境场景中检测相关元素的对齐情况,发现LLaVA 34B表现最佳但仍需改进,表明LLM在检测图像中价值元素方面有潜力。

Comments 13 pages, 2 figures

详情
Journal ref
Value Engineering in Artificial Intelligence (VALE 2024) (LNAI,volume 15356)
AI中文摘要

大型语言模型整合文本和视觉输入,为解释复杂数据提供了新可能。尽管其能生成连贯且上下文相关的文本,但其与人类感知在识别图像中相关元素的对齐仍需探索。本文研究了最先进的LLM与人类标注者在家庭环境场景中检测相关元素的对齐情况。我们创建了十二张描绘不同家庭场景的图像,并邀请十四名标注者识别每张图像中的关键元素。然后将这些人类响应与五个不同LLM的输出进行比较,包括GPT-4o和四个LLaVA变体。我们的发现显示对齐程度各异,LLaVA 34B表现最佳但得分仍低。然而,结果分析表明这些模型在检测图像中价值元素方面有潜力,表明通过改进训练和优化提示,LLM可增强社交机器人、辅助技术和人机交互的应用,提供更深入的见解和更相关的响应。

英文摘要

Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models' potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.

2411.18385 2026-06-09 cs.LG cs.CV stat.ML 版本更新

Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization

基于高效二阶优化的联邦学习中的不确定性与个性化

Shivam Pal, Aishwarya Gupta, Saqib Sarwar, Piyush Rai

发表机构 * Department of Computer Science and Engineering, IIT Kanpur, India(计算机科学与工程系,印度IIT坎pur)

AI总结 本文提出一种高效的联邦学习方法,利用二阶优化减少计算和通信成本,同时保留贝叶斯方法的不确定性与个性化优势。

详情
Journal ref
Transactions on Machine Learning Research (TMLR), 2025
AI中文摘要

联邦学习(FL)已发展为一种有前景的方法,用于在不同客户端上协作学习分布式和异质数据,而无需数据离开客户端。最近的FL研究倡导采用贝叶斯方法,因为它提供了一种系统的方法来考虑模型和预测不确定性,通过学习客户端和/或服务器模型的后验分布。此外,贝叶斯FL自然能够实现个性化,以处理不同客户端上的数据异质性,通过让每个客户端学习其独特的个性化模型。特别是,层次贝叶斯方法使所有客户端都能学习其个性化模型,同时通过服务器提供的先验分布考虑共同点。然而,尽管有这些优势,贝叶斯方法在FL中可能计算成本高且通信成本高,因为需要计算和发送后验分布。我们提出了一种新的贝叶斯FL方法,采用高效的二阶优化方法,其计算成本与Adam等一阶优化方法相似,同时提供贝叶斯方法的多种优势(例如不确定性、个性化),并且在标准和个性化FL设置中都比最先进的贝叶斯FL方法更高效和准确。我们的方法在预测准确性和不确定性估计方面优于基线方法,包括基于优化和贝叶斯FL的方法。

英文摘要

Federated Learning (FL) has emerged as a promising method to collaboratively learn from decentralized and heterogeneous data available at different clients without the requirement of data ever leaving the clients. Recent works on FL have advocated taking a Bayesian approach to FL as it offers a principled way to account for the model and predictive uncertainty by learning a posterior distribution for the client and/or server models. Moreover, Bayesian FL also naturally enables personalization in FL to handle data heterogeneity across the different clients by having each client learn its own distinct personalized model. In particular, the hierarchical Bayesian approach enables all the clients to learn their personalized models while also taking into account the commonalities via a prior distribution provided by the server. However, despite their promise, Bayesian approaches for FL can be computationally expensive and can have high communication costs as well because of the requirement of computing and sending the posterior distributions. We present a novel Bayesian FL method using an efficient second-order optimization approach, with a computational cost that is similar to first-order optimization methods like Adam, but also provides the various benefits of the Bayesian approach for FL (e.g., uncertainty, personalization), while also being significantly more efficient and accurate than SOTA Bayesian FL methods (both for standard as well as personalized FL settings). Our method achieves improved predictive accuracies as well as better uncertainty estimates as compared to the baselines which include both optimization based as well as Bayesian FL methods.

2101.01060 2026-06-09 cs.CV cs.AI cs.MM 版本更新

Personal Privacy Protection via Irrelevant Faces Tracking and Pixelation in Video Live Streaming

通过无关面孔跟踪和像素化实现个人隐私保护在视频直播中

Jizhe Zhou, Chi-Man Pun

发表机构 * IEEE

AI总结 本文提出FPVLS方法,通过帧到视频的双阶段结构实现视频直播中的自动隐私过滤,解决目标漂移、计算效率和过度像素化问题。

详情
Journal ref
IEEE Transactions on Information Forensics and Security, 16, 1088-1103 (2020)
AI中文摘要

截至目前,旨在保护隐私的像素化任务仍然劳动密集且尚未被深入研究。随着视频直播的普及,建立在线直播中的面部像素化机制已成为紧迫需求。本文开发了一种名为视频直播中的面部像素化(FPVLS)的新方法,以在非约束直播活动中自动生成自动个人隐私过滤。简单地应用多面部跟踪器会遇到目标漂移、计算效率和过度像素化的问题。因此,为了快速准确地对无关人员的面部进行像素化,FPVLS采用帧到视频的双阶段结构。在单帧上,FPVLS利用基于图像的面部检测和嵌入网络生成面部向量。在原始轨迹生成阶段,所提出的定位增量仿射传播(PIAP)聚类算法利用面部向量和定位信息,快速关联跨帧的同一人的面部。这样的帧级累积原始轨迹在视频级别上可能具有间断性和不可靠性。因此,我们进一步引入轨迹细化阶段,该阶段结合提案网络和基于经验似然比(ELR)统计量的两样本测试,以细化原始轨迹。在细化轨迹上应用高斯滤波器以最终实现像素化。在我们收集的视频直播数据集上,FPVLS获得了令人满意的准确性、实时效率,并且包含过度像素化问题。

英文摘要

To date, the privacy-protection intended pixelation tasks are still labor-intensive and yet to be studied. With the prevailing of video live streaming, establishing an online face pixelation mechanism during streaming is an urgency. In this paper, we develop a new method called Face Pixelation in Video Live Streaming (FPVLS) to generate automatic personal privacy filtering during unconstrained streaming activities. Simply applying multi-face trackers will encounter problems in target drifting, computing efficiency, and over-pixelation. Therefore, for fast and accurate pixelation of irrelevant people's faces, FPVLS is organized in a frame-to-video structure of two core stages. On individual frames, FPVLS utilizes image-based face detection and embedding networks to yield face vectors. In the raw trajectories generation stage, the proposed Positioned Incremental Affinity Propagation (PIAP) clustering algorithm leverages face vectors and positioned information to quickly associate the same person's faces across frames. Such frame-wise accumulated raw trajectories are likely to be intermittent and unreliable on video level. Hence, we further introduce the trajectory refinement stage that merges a proposal network with the two-sample test based on the Empirical Likelihood Ratio (ELR) statistic to refine the raw trajectories. A Gaussian filter is laid on the refined trajectories for final pixelation. On the video live streaming dataset we collected, FPVLS obtains satisfying accuracy, real-time efficiency, and contains the over-pixelation problems.

1909.02747 2026-06-09 eess.IV cs.CV cs.LG stat.ML 版本更新

Eelgrass beds and oyster farming at a lagoon before and after the Great East Japan Earthquake 2011: potential to apply deep learning at a coastal area

2011年东日本大地震前后三重县洋浦湾的海草床和牡蛎养殖:在沿海地区应用深度学习的潜力

Takehisa Yamakita

发表机构 * Marine Biodiversity and Environmental Assessment Research Center (BioEnv)(海洋生物多样性与环境评估研究中心)

AI总结 本文通过比较手动勾勒、简单图像分割和深度学习图像变换,研究了日本三重县洋浦湾海草床、沙地和牡蛎养殖筏的自动土地覆盖分类,展示了深度学习在地震后沿海地区空间模式提取中的潜力。

详情
AI中文摘要

本文通过对比手动勾勒、简单图像分割和深度学习图像变换方法,研究了日本三重县洋浦湾海草床、沙地和牡蛎养殖筏的自动土地覆盖分类,展示了深度学习在地震后沿海地区空间模式提取中的潜力。实验结果表明,图像变换方法在输出分辨率上表现最佳,其在植被分类上的准确率超过69%,通过随机点评估独立测试数据。沙地分布通过分割模型检测,而牡蛎养殖筏的分布则通过分割模型识别。通过手动勾勒和图像变换结果评估地震前后的变化,发现沙地面积增加而植被面积减少。仅通过分割模型检测到牡蛎养殖面积的减少。这些结果证明了深度学习在地震和海啸后空间模式提取中的潜力。

英文摘要

There is a small number of case studies of automatic land cover classification on the coastal area. Here, I test extraction of seagrass beds, sandy area, oyster farming rafts at Mangoku-ura Lagoon, Miyagi, Japan by comparing manual tracing, simple image segmentation, and image transformation using deep learning. The result was used to extract the changes before and after the earthquake and tsunami. The output resolution was best in the image transformation method, which showed more than 69% accuracy for vegetation classification by an assessment using random points on independent test data. The distribution of oyster farming rafts was detected by the segmentation model. Assessment of the change before and after the earthquake by the manual tracing and image transformation result revealed increase of sand area and decrease of the vegetation. By the segmentation model only the decrease of the oyster farming was detected. These results demonstrate the potential to extract the spatial pattern of these elements after an earthquake and tsunami. Index Terms: Great East Japan Earthquake of 2011, Land use land cover (LULC), Zosteracea seagrass, cultured oyster, deep learning, Mangoku Bay