arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.07585 2026-06-09 cs.CV cs.AI 新提交

Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach

面向隐私安全的非个体化方法的多模态群体情绪识别

Anderson Augusma

发表机构 * Université Grenoble Alpes（格勒诺布尔-阿尔卑斯大学）； Univ. Grenoble Alpes（格勒诺布尔-阿尔卑斯大学）； Univ. of Glasgow（格拉斯哥大学）； Inria（法国国家信息与自动化研究所）； Univ. Paris-Saclay（巴黎-萨克雷大学）； TU Delft（代尔夫特理工大学）

AI总结本文提出两种多模态框架（交叉注意力融合+帧注意力池化，以及变分编码器多解码器），利用集体音视频信号进行群体情绪识别，避免使用个体特征，在保护隐私的同时实现鲁棒性能。

Comments Doctoral thesis

详情

AI中文摘要

本论文研究野外环境下的群体情绪识别（GER），重点关注隐私保护。与依赖面部、目光或语音分析等个体层面线索的传统情绪识别方法不同，本工作利用集体音视频信号推断群体层面的情绪，降低个体监控和监视的风险。提出了两个互补框架。第一个是用于音视频融合的交叉注意力多模态架构，结合帧注意力池化（FAP）进行时间聚合。该框架由合成数据增强支持，并通过消融研究验证，在真实世界GER条件下展现出鲁棒性。第二个框架，变分编码器多解码器（VE-MD），学习一个共享潜在空间，用于情绪分类和结构表示预测（包括身体和面部线索）。探索了两种解码策略（基于DETR和基于热图），以分析结构表示在群体和个体设置中的作用。本论文做出三项主要贡献：阐明了多模态和结构线索在群体层面情感计算中的作用；引入了两种用于隐私保护多模态GER的架构；并证明了在不使用个体特征作为输入数据的情况下可以实现有竞争力的性能。

英文摘要

This thesis addresses group emotion recognition (GER) in-the-wild with a focus on privacy preservation. Unlike traditional emotion recognition methods that rely on individual-level cues such as face, gaze, or voice analysis, this work uses collective audio-video signals to infer emotions at the group level, reducing risks of individual monitoring and surveillance. Two complementary frameworks are proposed. The first is a cross-attention multimodal architecture for audio-video fusion, combined with Frames Attention Pooling (FAP) for temporal aggregation. It is supported by synthetic data augmentation and validated through ablation studies, demonstrating robustness in real-world GER conditions. The second framework, Variational Encoder Multi-Decoder (VE-MD), learns a shared latent space for emotion classification and structural representation prediction, including body and face cues. Two decoding strategies, DETR-based and heatmap-based, are explored to analyze the role of structural representations in group and individual settings. The thesis makes three main contributions: it clarifies the role of multimodality and structural cues in group-level affective computing; introduces two architectures for privacy-preserving multimodal GER; and shows that competitive performance can be achieved without using individual features as input data.

URL PDF HTML ☆

赞 0 踩 0

2606.07595 2026-06-09 cs.CV cs.AI cs.IR 新提交

VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents

VisualLeakBench: 视觉语言智能体中可复现的动作边界传播失败

Youting Wang, Yuan Tang, Yitian Qian, Chen Zhao

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出VisualLeakBench基准，评估视觉语言智能体在截图、文档等场景下将敏感文本从图像复制到工具参数中的动作边界传播失败，发现PII传播率达78.8%，不安全文本传播率达85.5%。

详情

AI中文摘要

视觉语言智能体越来越多地在写入内存、发送消息或调用外部工具之前消费截图、文档和用户界面。我们研究了这一设置中的一个具体失败模式：动作边界传播，即敏感或不安全的可见文本从图像复制到下游工具参数中。我们提出了VisualLeakBench，一个多样化的500图像基准，涵盖UI、聊天、文档、表单和仪表板场景，并在两个工作流（笔记捕获和外部交接）下使用四个生产级VLM系统评估了一个分层的100图像智能体子集。在基线情况下，目标字符串在78.8%的PII案例和85.5%的渲染不安全文本案例中被传播到工具参数中。在防御性系统提示下，渲染不安全文本传播仍然高达52.6%，而PII工具传播降至2.0%，这主要是通过抑制工具使用而非保持效用实现的。速率取决于工具表面：类似搜索的工具抑制PII传播，但渲染不安全文本仍然跨越工具边界。我们测量的是视觉到工具的传播，而非下游指令执行。我们还提供了一个标记目标预言上限诊断，将大多数失败定位在工具边界，同时将响应侧泄漏作为残余风险。

英文摘要

Vision-language agents increasingly consume screenshots, documents, and user interfaces before writing to memory, sending messages, or invoking external tools. We study a concrete failure mode in this setting: action-boundary propagation, where sensitive or unsafe visible text is copied from an image into downstream tool arguments. We present VisualLeakBench, a diversified 500-image benchmark spanning UI, chat, document, form, and dashboard scenes, and evaluate a stratified 100-image agent subset with four production VLM systems under two workflows: note capture and external handoff. At baseline, target strings are propagated into tool arguments in 78.8% of PII cases and 85.5% of rendered unsafe-text cases. Under a defensive system prompt, rendered unsafe-text propagation remains high at 52.6%, while PII tool propagation falls to 2.0%, largely by suppressing tool use rather than preserving utility. Rates are tool-surface dependent: search-like tools suppress PII propagation, but rendered unsafe text still crosses tool boundaries. We measure visual-to-tool propagation rather than downstream instruction execution. We additionally provide a labeled-target oracle upper-bound diagnostic that localizes most failures at the tool boundary while leaving response-side leakage as residual risk.

URL PDF HTML ☆

赞 0 踩 0

2606.07613 2026-06-09 cs.CV cs.AI 新提交

Can You Trust What You See? Human and AI Detection of Synthetic Legal Evidence

你能相信你所见的吗？人类与AI对合成法律证据的检测

Jinzhe Tan, Ali Ekber Cinar, Karim Benyekhlef

发表机构 * Faculty of Law, McGill University（麦吉尔大学法学院）

AI总结研究人类和前沿多模态大模型在民事纠纷场景中区分真实照片与AI生成图像的能力，发现两者均不可靠，提出结合人工审查、MLLM筛查和来源认证的解决方案。

详情

AI中文摘要

视觉证据长期以来被视为可靠的法律证明形式，但人工智能（AI）的进步正在削弱这一假设。本文探讨在典型民事纠纷的以物体为中心的场景中，人类和前沿多模态大语言模型（MLLM）区分真实证据照片与AI生成照片的能力。我们构建了合成法律证据检测数据集（SLED-1400），包含200张真实证据图像及由六种当代文本到图像生成器生成的1200张合成图像，涵盖十类证据。在受控网络实验中，136名普通参与者与四种MLLM（GPT-5.1、Gemini-3-Pro、Gemini-3-Flash、Qwen3-VL-235B）使用相同的刺激和响应格式进行评估。人类总体准确率为64.8%，在最强两个生成器（Gemini-3-Pro-Image和Flux-2-Max）上分别为48.5%和51.0%，与随机猜测无异。MLLM从未错误分类真实图像（100%特异性），但漏检了大部分来自较难生成器的合成输出，在Gemini-3-Pro-Image输出上平均检测率仅为5.9%。人类与MLLM的错误基本不相关，而四种MLLM之间高度相关。两个群体均不能作为可靠的独立验证者。我们认为，法律程序中的视觉证据应被视为本质上可争议的，可行的程序性应对必须结合训练有素的人工审查、MLLM筛查以及C2PA内容凭证等来源基础设施。

英文摘要

Visual evidence has long been treated as a reliable form of legal proof, but advances in artificial intelligence (AI) are undermining that assumption. This article asks how well humans and frontier multimodal large language models (MLLMs) can distinguish authentic evidentiary photographs from AI-generated counterparts in the object-centric scenarios typical of civil disputes. We built Synthetic Legal Evidence Detection (SLED-1400), a dataset of 200 authentic evidence images paired with 1,200 synthetic counterparts produced by six contemporary text-to-image generators across ten evidence categories. The same stimuli and response format were used in a controlled web experiment with 136 lay participants and in a standardized evaluation of four MLLMs (GPT-5.1, Gemini-3-Pro, Gemini-3-Flash, Qwen3-VL-235B). Human accuracy was 64.8% overall, and 48.5% and 51.0% on the two strongest generators (Gemini-3-Pro-Image and Flux-2-Max), indistinguishable from chance. MLLMs never misclassified an authentic image (100% specificity), but missed most synthetic outputs from the harder generators, with average MLLM detection at 5.9% on Gemini-3-Pro-Image outputs. Human and MLLM errors were largely uncorrelated, while the four MLLMs were strongly correlated with each other. Neither group is a reliable standalone authenticator. We argue that visual evidence in legal proceedings should be treated as inherently contestable, and that a workable procedural response must combine trained human review, MLLM screening, and provenance infrastructure such as C2PA Content Credentials.

URL PDF HTML ☆

赞 0 踩 0

2606.07639 2026-06-09 cs.CV cs.AI 新提交

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

MOSS-Video-Preview: 通过交叉注意力实现实时视频理解

Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出双通道交叉注意力架构MOSS-Video-Preview，通过非阻塞感知与生成实现实时视频理解，在单H200上实现5倍首词加速和2.7倍解码吞吐提升。

详情

AI中文摘要

视频理解正从离线范式——将完整录制的视频作为输入并在结束后产生单一答案——转向实时交互，其中模型在回复的同时感知新帧，随着新证据的出现修正答案，并在无话可说时保持沉默。我们提出MOSS-Video-Preview来验证这一范式。我们的核心主张是感知不能被生成阻塞；其自然实现是双通道架构。我们认为，交叉注意力主干比流行的仅解码器设计更适合实时视觉-语言融合：视觉特征通过侧通道进入，而不是加入自回归序列，因此感知和生成在独立的、非阻塞的路径上运行——降低了视觉处理的频率，并为独立压缩提供了清晰的通道级接口。我们辅以数据合成流水线，将密集字幕转换为实时理解问答，其答案被修正以匹配模型迄今为止感知到的内容，并在此数据上专门训练离线模型以引发实时行为。我们的模型总体上落后于强大的Qwen2.5-VL-7B基线——这一差距我们主要归因于数据和规模而非架构——但在离线视频和多模态理解上具有竞争力，在实时应用核心的空间和细粒度时间推理上保持稳健，并获得了离线模型缺乏的行为：持续感知、答案修正和及时沉默。在单个H200上，每视频256帧，它实现了约5倍的首词时间加速和2.7倍的解码吞吐提升，离线能力几乎没有下降。我们对范式、架构和数据的研究勾勒出通往实时视频理解的可行路径。

英文摘要

Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.07641 2026-06-09 cs.CV 新提交

Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

可读但不可预测：视觉语言模型中的旋转结果预测

Lexin Wang, Shenghua Liu, Yiwei Wang, Jiafeng Guo, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of California, Merced（加州大学默塞德分校）

AI总结研究视觉语言模型能否仅从原图预测180°旋转后的内容，引入RotOutBench基准，发现模型能识别但无法预测旋转结果。

详情

AI中文摘要

视觉语言模型能否仅从原始图像预测180°旋转后会看到什么？我们通过旋转结果预测来研究这种能力：给定原始图像，模型必须回答在180°平面旋转后会看到或读到什么，而不直接观察旋转后的目标。为了隔离这一差距，我们引入了RotOutBench，一个涵盖开放视觉案例和受控文本图像旋转的配对诊断基准。一个明显的模式出现了：许多VLM在直接给出原始或旋转图像时能够识别相关内容，但仅从原始图像推断旋转结果时却失败。在受控文本图像旋转中，即使对于具有高直接读取准确性的模型，预测旋转的准确性也降至接近零。模型级别的案例研究进一步表明，预测状态可以接近旋转图像读取状态，而最终读出仍向原始字符串偏移。当前的VLM在展示变换后的视觉状态时能够识别，但往往无法从原始视角预测该状态。

英文摘要

Can vision-language models predict what a 180° rotation would reveal from the original image alone? We study this ability through Rotated-Outcome Prediction: given an original image, a model must answer what would be seen or read after a 180° in-plane rotation, without directly observing the rotated target. To isolate this gap, we introduce RotOutBench, a paired diagnostic benchmark spanning open visual cases and controlled text-image rotations. A sharp pattern emerges: many VLMs can recognize the relevant content when directly given either the original or rotated image, yet fail to infer the rotated result from the original image alone. On controlled text-image rotations, predicted-rotation accuracy collapses to near zero even for models with high direct-reading accuracy. A model-level case study further shows that the prediction state can approach a rotated-image reading state, while the final readout still shifts toward the original string. Current VLMs can recognize a transformed visual state when it is shown, but often fail to predict that state from the original view.

URL PDF HTML ☆

赞 0 踩 0

2606.07642 2026-06-09 cs.CV cs.CY 新提交

Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View

视觉语言模型能否感知传感器所感？一种可扩展的专家引导设计用于从街景评估轮椅可达性

Dongdong Wang, Alina Hagen, Isabelle Gatmaitan, Hao Zhou, Yiwen Dong, Shabboo Valipoor, Vivian W. H. Wong, Lingyao Li

发表机构 * University of Florida（佛罗里达大学）； University of South Florida（南佛罗里达大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出专家引导的检索增强框架，利用视觉语言模型从谷歌街景图像识别轮椅可达性障碍，通过GPS轮椅停留行为验证，表明VLM评分与移动摩擦部分一致，但细粒度障碍识别有限。

详情

AI中文摘要

评估建筑环境交互（如轮椅可达性）是困难的，因为现实世界的移动性受到分布式、上下文依赖和临时性障碍的影响，这些障碍难以大规模捕捉。为了支持可扩展的评估，本文研究了视觉语言模型（VLM）是否能够从谷歌街景（GSV）图像中识别可达性障碍。我们提出了一种专家引导的检索增强框架，结合GSV图像、ADA指导原则和专家制定的评分标准来评估可达性维度。我们在佛罗里达大学收集了一个校园规模的数据集，将407个独特的GSV位置与GPS衍生的轮椅停留行为作为移动摩擦信号相关联。结果表明，VLM评分与停留时间既呈负相关又在分布上相似，表明与移动摩擦的行为代理部分但一致的对齐。视觉线索分析显示，某些环境对象（如路缘坡道和人行横道）与较高的VLM可达性评分相关，而对于细微的表面条件、临时障碍物和视角依赖的障碍，对齐仍然有限。总体而言，我们的发现显示了专家引导的VLM在可扩展的可达性评估中的潜力，与真实世界轮椅导航的传感器衍生指标相一致。

英文摘要

Assessing built-environment interaction, such as wheelchair accessibility, is difficult because real-world mobility is shaped by distributed, context-dependent, and temporary barriers that are hard to capture at scale. To support scalable assessment, this paper examines whether vision-language models (VLMs) can identify accessibility barriers from Google Street View (GSV) imagery. We propose an expert-guided retrieval-augmented framework that combines GSV images, ADA-informed guidance, and expert-derived rubrics to evaluate accessibility dimensions. We collect a campus-scale dataset at the University of Florida, linking 407 unique GSV locations with GPS-derived wheelchair dwell behavior as a mobility-friction signal. Results show that VLM ratings are both negatively correlated and distributionally similar with dwell time, indicating partial but consistent alignment with a behavioral proxy for mobility friction. Visual cue analysis shows that certain environmental objects, such as curb ramps and crosswalks, are associated with higher VLM accessibility scores, while alignment remains limited for subtle surface conditions, transient obstructions, and viewpoint-dependent barriers. Overall, our findings show the potential of expert-guided VLMs for scalable accessibility assessment aligning with sensor-derived indicators of real-world wheelchair navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.07643 2026-06-09 cs.CV cs.AI cs.SD eess.AS 新提交

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

AVI-Bench：迈向全模态大语言模型的人类级视听智能

Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding, Yunxin Liu

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出AVI-Bench基准，通过感知、理解、推理三阶段跨模态任务评估全模态大语言模型的视听智能，并引入AVI-Bench-PriSe测试原始视听感知，揭示当前模型局限，构建四级AVI分类体系。

Comments 31 pages, 8 figures, ICML 2026

详情

AI中文摘要

近期全模态大语言模型（Omni-MLLMs）的进展实现了视觉、音频和语言的强集成。然而，由于缺乏系统全面的基准，其视听智能（AVI）仍未被充分评估。我们提出AVI-Bench，一个受认知启发的基准，通过需要联合视听解释的跨模态任务，在感知、理解和推理三个阶段评估Omni-MLLMs。该设计能够细粒度诊断模型能力和失败模式。为进一步评估超出熟悉领域的鲁棒性，我们提出AVI-Bench-PriSe，一个扩展版本，使用不熟悉的、低语义刺激探测模型的原始视听感知，测试超出常见训练分布的泛化能力。对开源和闭源模型的大量实验揭示了当前Omni-MLLMs的显著局限性。基于这些发现，我们提出了一个四级AVI分类体系。总体而言，AVI-Bench提供了一个原则性的评估框架，以指导更鲁棒和可泛化AVI的发展。项目网站：https://fudancvl.github.io/AVI-Bench/

英文摘要

Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/

URL PDF HTML ☆

赞 0 踩 0

2606.07647 2026-06-09 cs.CV cs.CL cs.LG 新提交

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

关键位置引导：基于令牌级视觉敏感度引导的LVLMs幻觉缓解

Ruipeng Zhang, Zhihao Li, C. L. Philip Chen, Tong Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出令牌级视觉敏感度引导（TLVS）方法，通过提取令牌级引导向量并自适应调整引导强度，仅在关键解码步骤抑制幻觉，在多个基准上优于现有方法。

详情

AI中文摘要

大型视觉语言模型（LVLMs）取得了快速进展并部署在各种应用中，但幻觉仍然是一个主要挑战。激活引导因其训练开销小和推理时可控制而具有吸引力。然而，我们发现，在自回归解码过程中，视觉条件对令牌预测的影响是稀疏且局部的，许多现有方法对整个序列的图像与非图像差异进行平均，稀释了这些关键信号，导致引导方向信噪比低。此外，许多现有方法应用固定的引导强度，错误分配干预预算，过度扰动非关键令牌，并可能导致不稳定。为了解决这些限制，我们提出了令牌级视觉敏感度引导（TLVS）用于幻觉缓解。我们的方法首先提取令牌级引导向量并进行细化，然后仅在关键位置应用细粒度的、视觉敏感度自适应的引导。这种轻量级、即插即用的机制只需要最少的校准训练，可以应用于各种视觉语言模型。它在每个解码步骤调节引导强度，选择性地抑制易产生幻觉的片段，同时保留基于证据的内容。我们在多个基准上评估TLVS，包括POPE、AMBER、CHAIR（COCO）、MMHal和HallusionBench，证明其相对于先前引导方法的一致改进。

英文摘要

Large vision language models (LVLMs) have made rapid advancements and are deployed across various applications, yet hallucinations remain a major challenge. Activation steering is appealing due to its minimal training overhead and controllability at inference time. However, we found that during autoregressive decoding, visual conditioning affects token prediction sparsely and locally across decoding steps, and many existing methods that average image-versus-no-image differences over the entire sequence dilute these critical signals, yielding low signal-to-noise ratio steering directions. Additionally, many existing methods apply a fixed steering strength, which misallocates the intervention budget, over-perturbs non-critical tokens, and can cause instability. To address these limitations, we propose Token-Level Visual-Sensitivity Steering (TLVS) for hallucination mitigation. Our approach first extracts token-level steering vectors and refines them, and then applies fine-grained, visual-sensitivity-adaptive steering only where it matters. This lightweight, plug-and-play mechanism requires only minimal training for calibration and can be applied across diverse vision-language models. It modulates the steering strength at each decoding step, selectively suppressing hallucination-prone spans while preserving evidence-grounded content. We evaluate TLVS on several benchmarks, including POPE, AMBER, CHAIR (COCO), MMHal, and HallusionBench, demonstrating consistent improvements over previous steering methods.

URL PDF HTML ☆

赞 0 踩 0

2606.07689 2026-06-09 cs.CV 新提交

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

Struct-Searcher：代理式结构化思维推进多模态深度信息检索

Fan Zhang, Vireo Zhang, Shengju Qian, Haoxuan Li, Zheng Lian, Hao Wu, Yuan Gao, Xinyu Geng, Xin Wang, Pheng-Ann Heng

发表机构 * CUHK（香港中文大学）； LIGHTSPEED ； PKU（北京大学）； Tongji University（同济大学）； THU（清华大学）； HKUST（香港科技大学）

AI总结提出基于信念修正理论的结构化代理工作流Struct-Searcher，通过维护多模态结构图实现冲突感知的深度信息检索，在多个基准上平均相对准确率提升17.2%。

详情

AI中文摘要

深度研究代理因其收集大规模在线信息以获取目标知识的能力而受到越来越多的关注，最近的研究工作从纯文本信息检索转向多模态设置。然而，现有的代理工作流大多与证据积累模型一致，该模型线性地聚合证据，缺乏处理跨异构模态矛盾信息的原则性机制。为此，我们提出了Struct-Searcher，一种基于信念修正理论的结构化代理工作流，它在整个推理过程中显式地维护一个不断演变的多模态结构图，从而实现有效的冲突感知多模态深度信息检索。在多个基准数据集和骨干模型上的大量实验表明，Struct-Searcher是（1）即插即用且模型无关的，在BrowseComp-VL上使用五种不同骨干模型平均相对准确率提升17.2%。（2）性能最优，持续优于最先进的视觉语言模型（VLM）和深度研究代理，在MM-BrowseComp上相对准确率提升3.7%，在HLE-VL上提升1.5%，在BrowseComp-VL上提升0.7%，均超过第二名的竞争方法。

英文摘要

Deep research agents have attracted increasing attention for their ability to collect large-scale online information to acquire target knowledge, with recent efforts shifting from purely text-based information seeking to multimodal settings. However, existing agentic workflows are largely aligned with evidence accumulation models, which linearly aggregate evidence and lack principled mechanisms for handling contradictory information across heterogeneous modalities. Towards this end, we propose Struct-Searcher, a structural agentic workflow grounded in belief revision theory that explicitly maintains an evolving multimodal structural graph throughout the reasoning process, enabling effective conflict-aware multimodal deep information seeking. Extensive experiments across multiple benchmark datasets and backbone models demonstrate that Struct-Searcher is (1) plug-and-play and model-agnostic, yielding an average relative accuracy improvement of 17.2% on BrowseComp-VL across five different backbones. (2) top-performing, consistently outperforming state-of-the-art vision-language models (VLMs) and deep research agents, with relative accuracy improvements of 3.7% on MM-BrowseComp, 1.5% on HLE-VL, and 0.7% on BrowseComp-VL over the second-best competing approach.

URL PDF HTML ☆

赞 0 踩 0

2606.07861 2026-06-09 cs.CV cs.AI 新提交

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

最后一个可见像素：探究视觉-语言模型中的精细尺度感知

Lujun Li, Lama Sleem, Niccolo Gentile, Yangjie Xu, Yewei Song, Wenbo Wu, Radu State

发表机构 * University of Luxembourg（卢森堡大学）； Foyer S.A. ； Université Paris-Saclay（巴黎-萨克雷大学）

AI总结提出FineSightBench基准，通过4-48像素尺度分离感知与推理任务，发现视觉-语言模型感知在12像素饱和，推理在更大尺度仍受限，揭示精细视觉推理的根本缺陷。

Comments 25 pages

详情

AI中文摘要

最近的视觉-语言模型（VLM）在多模态理解和推理方面表现出色，但其细粒度视觉感知仍未被充分探索。'Strawberry中有多少个r？'的自然延伸是：VLM能可靠感知多小的视觉模式？为此，我们引入了FineSightBench，这是一个新的基准，通过将感知任务（字母、形状、物体的像素级识别）与推理任务（空间推理、计数、小目标排序）在4-48像素的受控尺度上分离，系统地探究这一极限。通过对最先进模型的全面实验和详细失败模式分析，我们揭示了一个尖锐的分离：感知在12像素左右饱和，而即使在更大尺度下推理仍然受限，存在持续的计数和序列错误。这些发现暴露了VLM在精细尺度视觉推理中的根本缺陷，需要更严格的评估。

英文摘要

Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?'' asks: how small a visual pattern can a VLM reliably perceive? As such, we introduce FineSightBench, a new benchmark that systematically probes this limit by separating perception tasks (pixel-level recognition of letters, shapes, objects) from reasoning tasks (spatial reasoning, counting, ordering over small targets) across controlled scales of 4--48px. Through comprehensive experiments and detailed failure mode analysis on state-of-the-art models, we reveal a sharp dissociation: perception saturates around 12px, while reasoning remains limited even at larger scales, with persistent numeracy and sequence errors. These findings expose fundamental deficiencies in VLMs' fine-scale visual reasoning that demand more rigorous evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.07872 2026-06-09 cs.CV 新提交

VisualFLIP: Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?

VisualFLIP: 在多模态推理中，预测是否依赖于任务关键的视觉证据？

Didi Zhu, Changrui Chen, Stefanos Zafeiriou, Jiankang Deng

发表机构 * Imperial College London（伦敦帝国理工学院）

AI总结提出VisualFLIP基准，通过成对图像扰动测试多模态大模型是否真正依赖关键视觉证据，发现正确预测与证据依赖存在分离。

详情

AI中文摘要

当多模态大语言模型正确回答视觉推理问题时，该预测是否确实得到任务关键视觉证据的支持？正确答案可能与有缺陷的推理共存，这使得仅凭准确性无法完全检验基础。我们引入了VisualFLIP，一个包含1,374张图像的成对基准，这些图像在基数、属性、空间和逻辑任务中构成相同问题的扰动对。每对保持问题不变，但最小程度地改变证据，使得正确答案确定性地翻转。我们使用成对准确率（要求解决对中的两侧）和崩溃率（CR，衡量在至少解决一侧的模型中，对两个图像重复相同非空答案的频率）评估了24个MLLM。这些指标共同表明，成对正确性和证据依赖性相关但不同：有能力的模型在任务关键视觉变化后仍可能无法更新，并且当编辑后的图像在序列设置中跟随先前的答案时，某些模型的崩溃变得更加严重。更多细节请参见我们的项目页面：https://didizhu-judy.github.io/VisualFLIP/

英文摘要

When a multimodal large language model answers a visual reasoning question correctly, is the prediction actually supported by the task-critical visual evidence? Correct answers can coexist with flawed reasoning, making accuracy alone an incomplete test of grounding. We introduce VisualFLIP, a paired benchmark with 1,374 images arranged as same-question perturbation pairs across cardinality, attribute, spatial, and logic tasks. Each pair keeps the question fixed but minimally changes the evidence so the gold answer deterministically flips. We evaluate 24 MLLMs with pair accuracy, which requires solving both sides of a pair, and Collapse Rate (CR), which measures how often a model that solves at least one side repeats the same non-empty answer for both images. Together, these metrics show that paired correctness and evidence dependence are related but distinct: capable models can still fail to update after task-critical visual changes, and collapse becomes more severe for some models when the edited image follows an earlier answer in a sequential setting. Further details are available on our project page: https://didizhu-judy.github.io/VisualFLIP/

URL PDF HTML ☆

赞 0 踩 0

2606.07924 2026-06-09 cs.CV cs.AI cs.CL cs.LG cs.MM 新提交

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

解耦语义与逻辑：一种无需训练的从粗到精的视频检索增强生成流水线

Jiaxin Dai, Zehang Wei, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； School of AI and Automation, Huazhong University of Science and Technology（华中科技大学人工智能与自动化学院）

AI总结提出一种无需训练的两阶段级联视频RAG流水线，通过解耦语义检索与逻辑推理，实现跨语言长视频理解、严格角色遵循和零幻觉时间定位。

Comments To be presented at ACL 2026 MAGMAR Workshop (Oral; Retrieval leaderboard No.1)

详情

AI中文摘要

本文介绍了我们为第二届多模态增强生成研讨会（MAGMaR）提交的系统描述。针对跨语言长视频理解、严格角色遵循和零幻觉时间定位等关键挑战，我们提出了一种完全无需训练的两阶段级联视频RAG流水线。我们的架构通过模态感知的任务分工，策略性地将语义检索与认知逻辑推理解耦。在第一阶段，一个高召回率的语义预取模块仅使用高保真视觉摘要和全局文本描述进行密集检索，明确隔离噪声模态（如OCR和ASR）以保持纯净的向量空间。在第二阶段，一个由商业大语言模型（LLM）驱动的自适应、迭代和推理（A.I.R.）过滤代理执行细粒度认知重排序。该代理重新整合完整的多模态上下文，以强制执行与用户角色的严格逻辑对齐，有效剪除语义相似但逻辑无关的候选。最后，提示雕刻机制约束生成器将蒸馏后的子集合成为严格格式化的JSON响应，并带有精确的块级引用。在RAG轨道上的评估表明，我们的资源感知方法在信息检索和角色条件生成方面均表现出卓越的精度。

英文摘要

This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a modality-aware division of labor. In the first stage, a high-recall semantic pre-fetching module employs dense retrieval using only high-fidelity visual summaries and global text descriptions, explicitly isolating noisy modalities (e.g., OCR and ASR) to maintain a pristine vector space. In the second stage, an Adaptive, Iterative, and Reasoning-based (A.I.R.) filtering agent, powered by a commercial Large Language Model (LLM), performs fine-grained cognitive reranking. The agent re-incorporates full multimodal contexts to enforce strict logical alignment with user personas, effectively pruning semantically similar but logically irrelevant candidates. Finally, a Prompt Sculpting mechanism constrains the generator to synthesize the distilled subset into strictly formatted JSON responses with exact chunk-level citations. Evaluated on the RAG track, our resource-aware approach shows exceptional precision in both information retrieval and persona-conditioned generation.

URL PDF HTML ☆

赞 0 踩 0

2606.07962 2026-06-09 cs.CV 新提交

DyCo-RL: 用于视觉推理的动态跨模态协调

Hangui Lin, Yan Shu, Zhengyang Liang, Chi Liu, Xiangrui Liu, Minghao Qin, Teng Long, Zheng Liu, Nicu Sebe

发表机构 * University of Trento（特伦托大学）； BAAI（北京智源人工智能研究院）； Singapore Management University（新加坡管理大学）； IQuest Research

AI总结提出DyCo-RL，通过Fisher-Rao测地距离量化模态内注意力转移，实现动态跨模态协调，并利用对齐引导的优势重加权优化策略，提升多模态大模型在视觉推理中的表现。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）已成为增强多模态大语言模型（MLLMs）视觉推理的主要范式。然而，现有的RLVR方法主要针对推理结果进行优化，从根本上忽略了生成过程中所需的细粒度跨模态协调。通过token级分析和控制干预，我们揭示了在思维链（CoT）推理过程中，MLLMs经常无法在提取视觉证据和合成文本上下文之间动态交替——这种协调崩溃与推理失败存在因果关系。受这些发现的启发，我们提出了DyCo-RL，它将动态跨模态协调集成到RLVR优化中。具体来说，DyCo-RL使用Fisher-Rao测地距离来度量模态内注意力转移，将token分配到视觉导向或文本导向的功能角色。然后，它评估token实际注意力分配与其分配角色之间的一致性，利用该分数在策略优化期间进行对齐引导的优势重加权。大量实验表明，算法无关的DyCo-RL应用于Qwen2.5-VL-3B/7B时，在涵盖视觉中心和数学推理的七个基准测试中，一致地改进了四种代表性的RLVR算法。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token's actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.08063 2026-06-09 cs.CV cs.AI cs.CL 新提交

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Robust-U1: MLLMs能否自我恢复受损视觉内容以实现鲁棒理解？

Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Robust-U1框架，通过监督微调、强化学习和多模态推理，使多模态大模型具备显式视觉自恢复能力，在真实和对抗性损坏下达到最先进鲁棒性。

Comments Accepted by ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解方面取得了显著成功，但在真实世界的视觉损坏下其性能会大幅下降。尽管存在现有的鲁棒性增强方法，但它们存在局限性：黑盒特征对齐缺乏可解释性，而白盒基于文本的推理无法恢复丢失的像素级细节。本文研究一个基本研究问题：MLLMs能否自行恢复受损的视觉内容？为此，我们提出Robust-U1，一种新颖框架，赋予MLLMs显式的视觉自恢复能力以实现鲁棒理解。该方法包含三个核心阶段：用于初始重建的监督微调、具有双重奖励（像素级SSIM和语义级CLIP相似度）的强化学习以对齐高视觉质量，以及联合考虑受损输入和恢复图像的多模态推理。大量实验表明，Robust-U1在真实世界损坏基准上达到了最先进的鲁棒性，并在一般VQA基准上的对抗性损坏下保持了优越性能。分析证实，高质量的视觉恢复直接提升了推理性能，将自恢复确立为鲁棒视觉理解的关键机制。源代码可在https://github.com/jqtangust/Robust-U1获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

URL PDF HTML ☆

赞 0 踩 0

2606.08126 2026-06-09 cs.CV 新提交

超越原始信号：作为特权合成数据的未解码生成潜变量

Cristian Sbrolli, Nicolas Michel, Matteo Matteucci, Toshihiko Yamasaki

发表机构 * Politecnico di Milano（米兰理工大学）； The University of Tokyo（东京大学）

AI总结提出直接潜变量增强（DLA）方法，利用未解码的生成潜变量作为特权信息，并通过多层显式模拟联觉（MESSy）将密集知识迁移到纯视觉学生模型，避免了解码-编码循环的低效性。

详情

AI中文摘要

虽然多模态集成显著提升了计算机视觉模型，但部署它们会带来高昂的推理成本，并且需要稀缺且完美配对的数据集。近期方法通过生成式AI合成缺失模态来解决这一数据瓶颈，但它们引入了一个严重的低效问题：解码-编码循环。具体来说，信息丰富的生成潜变量被解码为噪声原始信号，迫使下游分类器浪费容量重新编码它们。为了绕过这一瓶颈，我们提出直接潜变量增强（DLA），直接利用未解码的生成潜变量作为特权信息。此外，为了将这种密集知识迁移到纯视觉学生模型，我们引入多层显式模拟联觉（MESSy）。MESSy 不使用强制表示匹配（这迫使学生扭曲其原生视觉特征以适应复杂的多模态拓扑），而是使用预测目标来安全地内化这些物理先验。实验结果表明，我们的框架显著优于原始数据增强和传统蒸馏。最终，我们的方法产生了高度准确的单模态学生模型，其具有“联觉”潜变量结构，这些结构本质上与它们从未直接观察到的物理属性对齐。

英文摘要

While multimodal integration significantly improves computer vision models, deploying them incurs prohibitive inference costs and requires scarce, perfectly paired datasets. Recent methods address this data bottleneck by synthesizing missing modalities via generative AI, yet they introduce a severe inefficiency: the Decode-Encode Loop. Specifically, information-rich generative latents are decoded into noisy raw signals, forcing the downstream classifier to waste capacity re-encoding them. To bypass this bottleneck, we propose Direct Latent Augmentation (DLA), utilizing undecoded generative latents directly as privileged information. Furthermore, to transfer this dense knowledge to a purely visual student, we introduce Multilayer Explicit Simulated Synesthesia (MESSy). Instead of enforcing rigid representation matching, which forces the student to distort its native visual features to accommodate complex multimodal topologies, MESSy uses a predictive objective to safely internalize these physical priors. Empirical results demonstrate that our framework significantly outperforms raw data augmentation and traditional distillation. Ultimately, our approach yields highly accurate unimodal students with ``synesthetic'' latent structures that are inherently aligned with physical properties they have never directly observed.

URL PDF HTML ☆

赞 0 踩 0

2606.08464 2026-06-09 cs.CV 新提交

TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

TVI-CoT: 面向多模态理解的文本-视觉交错思维链推理

Lianyu Hu, Xiaoyu Ma, Zeqin Liao, Yang Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出TVI-CoT框架，通过可学习控制令牌实现文本推理与视觉特征访问的动态交错，解决多模态LLM在推理过程中无法访问视觉特征的问题，在多个基准上取得最优结果。

Comments ICML2026

详情

AI中文摘要

思维链推理已被证明能有效增强大语言模型的问题解决能力。然而，当应用于多模态大语言模型时，现有的思维链方法存在一个根本性限制：它们在推理过程中完全基于文本进行，无法访问视觉特征。在初始视觉编码后，图像信息变得不可访问，迫使模型仅基于初始描述中捕获的内容进行推理，形成了一种“视觉盲推理”范式，限制了细粒度视觉提取、错误验证和自适应注意力。我们提出了文本-视觉交错思维链（TVI-CoT），这是一个通过可学习控制令牌<THINK>、<LOOK>和<ANSWER>实现文本推理与视觉特征访问显式交错的框架。这些令牌允许在推理和视觉定位之间动态切换，根据不断演化的推理状态关注相关的图像区域。在八个基准上的实验表明，该方法在多模态大语言模型思维链方法中达到了最先进的结果，并且相比基线有显著性能提升：在MMMU上提升6.1%，在MathVerse上提升3.8%，在MathVista上提升3.4%，在ScienceQA上提升3.4%。代码可在https://github.com/hulianyuyy/TVI-CoT获取。

英文摘要

Chain-of-thought (CoT) reasoning has proven effective for enhancing problem-solving in large language models. However, when applied to multimodal LLMs (MLLMs), existing CoT approaches suffer from a fundamental limitation: they perform reasoning entirely in text without accessing visual features during the reasoning process. After initial visual encoding, image information becomes inaccessible, forcing models to reason based solely on whatever was captured in the initial description, which forms a `vision-blind reasoning' paradigm that limits fine-grained visual extraction, error verification, and adaptive attention. We propose Text-Visual Interleaved Chain-of-Thought (TVI-CoT), a framework that enables explicit interleaving of textual reasoning and visual feature access through learnable control tokens <THINK>, <LOOK> and <ANSWER>. These tokens allow dynamic switching between reasoning and visual grounding, attending to relevant image regions conditioned on the evolving reasoning state. Experiments on eight benchmarks demonstrate state-of-the-art results among MLLM-based CoT methods and notable performance boost compared to the baseline: +6.1% on MMMU, +3.8% on MathVerse, +3.4% on MathVista, and +3.4% on ScienceQA. Code is available at https://github.com/hulianyuyy/TVI-CoT.

URL PDF HTML ☆

赞 0 踩 0

2606.08511 2026-06-09 cs.CV 新提交

Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs

少看多思：面向高效多模态大语言模型的块级注意力跳过

Jie Ma, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

发表机构 * Xiamen University（厦门大学）

AI总结针对多模态大语言模型视觉注意力饱和问题，提出训练无关的Visual-Skip方法，通过选择性跳过冗余的视觉自注意力模块实现块级稀疏性，并利用轻量级校准动态选择最优稀疏路径，在保持性能的同时显著降低计算成本。

详情

AI中文摘要

多模态大语言模型（MLLMs）由于长视觉标记序列的自注意力二次计算成本而面临显著的推理瓶颈。然而，我们识别出当前架构中的一个关键低效问题：视觉注意力饱和。我们的分析表明，视觉标记在早期层中迅速建立其空间结构和模态内关系，使得深层中的视觉-视觉自注意力在计算上变得冗余。相反，这些层中的前馈网络（FFNs）对于将视觉特征投影到不断演化的文本语义空间中仍然至关重要。利用这一洞察，我们提出了Visual-Skip（V-Skip），一种无需训练的推理范式，它将空间交互与语义演化解耦。V-Skip不是丢弃标记，而是通过选择性地绕过饱和的视觉自注意力模块来施加块级结构化稀疏性。此外，认识到不同的下游任务需要不同的推理深度，V-Skip采用轻量级、少样本校准来动态路由任务最优的稀疏路径。大量实验表明，V-Skip有效地绕过了冗余的视觉注意力以实现块级稀疏性，在各种MLLMs上保持了94.16%至100.31%的性能保留。最终，我们证明，为了更有效地推理，模型不需要丢弃它们所看到的内容——它们只需要在正确的深度“少看”即可。

英文摘要

Multimodal Large Language Models (MLLMs) face a significant inference bottleneck due to the quadratic computational cost of self-attention over long visual token sequences. However, we identify a critical inefficiency in current architectures: Visual Attention Saturation. Our analysis reveals that visual tokens rapidly establish their spatial structure and intra-modal relationships in early layers, rendering visual-to-visual self-attention in deeper layers computationally redundant. Conversely, Feed-Forward Networks (FFNs) in these layers remain essential for projecting visual features into the evolving textual semantic space. Leveraging this insight, we present Visual-Skip (V-Skip), a training-free inference paradigm that decouples spatial interaction from semantic evolution. Rather than discarding tokens, V-Skip imposes block-wise structured sparsity by selectively bypassing saturated visual self-attention modules. Furthermore, recognizing that varying downstream tasks demand distinct reasoning depths, V-Skip employs a lightweight, few-shot calibration to dynamically route the task-optimal sparsity path. Extensive experiments demonstrate that V-Skip effectively bypasses redundant vision attention to achieve block-wise sparsity, maintaining a 94.16% to 100.31% performance retention across diverse MLLMs. Ultimately, we prove that to reason more effectively, models do not need to discard what they see -- they simply need to "look less" at the right depth.

URL PDF HTML ☆

赞 0 踩 0

2606.08708 2026-06-09 cs.CV 新提交

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

PRPO: 通过令牌级动态优势重塑的感知强化策略优化

Qiming Li, Tianlun Li, Xiaolong Cheng, Hangyu Li, Ruiyan Gong, Kangning Niu, Kaitao Jiang, Mu Xu

发表机构 * Amap CV Lab, Alibaba Group（阿里巴巴集团高德地图计算机视觉实验室）； Peking University（北京大学）

AI总结提出令牌级强化学习框架PRPO，通过鲁棒视觉依赖（RVD）指标识别关键感知令牌，并利用感知优势重塑（PAR）技术增强其学习信号，在7个多模态推理基准上平均提升23.3%（3B模型）和21.1%（7B模型）。

详情

AI中文摘要

可验证奖励强化学习（RLVR）已成为提升大型视觉语言模型（LVLMs）推理能力的有效范式。然而，现有的RLVR方法主要依赖于轨迹级结果奖励，为所有生成的令牌分配相同的学习信号。这种粗粒度的信用分配从根本上与多模态推理不匹配，因为只有稀疏的子集令牌在因果上基于视觉证据。因此，这些关键的感知令牌受到弱监督，并且常常被语言先验或推理模板令牌淹没。为解决这一局限，我们提出感知强化策略优化（PRPO），一种令牌级强化学习框架，明确识别并强化长程多模态推理轨迹中的关键感知令牌。PRPO引入了鲁棒视觉依赖（RVD），一种原则性度量，用于识别预测既基于视觉又对扰动稳定的令牌，过滤掉脆弱或噪声视觉令牌。基于RVD，我们进一步提出感知优势重塑（PAR），一种令牌级信用分配技术，放大感知信息丰富的令牌，同时为非感知令牌保留稳定梯度。在七个多模态推理基准上的大量实验表明，PRPO在3B和7B模型规模上均持续优于强LVLM基线，分别实现了23.3%和21.1%的平均增益。PRPO以更高的训练效率和更强的跨任务泛化能力达到了最先进的性能。我们的发现强调了细粒度信用分配对于可扩展多模态强化学习的重要性。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective paradigm for improving the reasoning capability of Large Vision-Language Models (LVLMs). However, existing RLVR methods primarily rely on trajectory-level outcome rewards, which assign identical learning signals across all generated tokens. This coarse-grained credit assignment is fundamentally mismatched to multimodal reasoning, where only a sparse subset of tokens is causally grounded in visual evidence. Consequently, these pivotal perceptual tokens receive weak supervision and are often overwhelmed by language priors or reasoning-template tokens. To address this limitation, we propose Perception-Reinforced Policy Optimization (PRPO), a token-level reinforcement learning framework that explicitly identifies and reinforces pivotal perceptual tokens within long-horizon multimodal reasoning trajectories. PRPO introduces Robust Visual Dependency (RVD), a principled metric that identifies tokens whose predictions are both visually grounded and perturbation-stable, filtering out brittle or noisy visual tokens. Based on RVD, we further propose Perceptual Advantage Reshaping (PAR), a token-level credit assignment technique that amplifies perceptually informative tokens while preserving stable gradients for non-perceptual tokens. Extensive experiments on seven multimodal reasoning benchmarks demonstrate that PRPO consistently outperforms strong LVLM baselines across both 3B and 7B model scales, achieving average gains of 23.3% and 21.1%, respectively. PRPO achieves state-of-the-art performance with improved training efficiency and stronger cross-task generalization. Our findings highlight the importance of fine-grained credit assignment for scalable multimodal reinforcement learning.

URL PDF HTML ☆

赞 0 踩 0

2606.08719 2026-06-09 cs.CV 新提交

Thinking Without Images: Internalizing Visual Manipulation with On-Policy Self-Distillation

无图像思考：通过在线自我蒸馏内化视觉操作

Yishuo Cai, Jiahui Liu, Yuanxin Liu, Haobo Deng, Linli Yao, Yuhao Zheng, Kun Ouyang, Zhimo Li, Ziyue Wang, Xu Sun, Haoli Bai, Xiaohui Li

发表机构 * State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（北京大学计算机学院多媒体信息处理国家重点实验室）； Central South University（中南大学）； University of Science and Technology of China（中国科学技术大学）； Peking University（北京大学）； Huawei Technologies（华为技术有限公司）

AI总结提出Imagine-OPD框架，通过在线自我蒸馏将“用图像思考”的视觉推理能力内化为“用想象思考”，在不调用外部工具的情况下生成内部视觉线索，在保持性能的同时显著降低推理开销。

详情

AI中文摘要

“用图像思考”已成为细粒度视觉推理的有效范式：通过显式放大相关区域并推理裁剪区域，模型可以访问从单个全局图像中难以恢复的局部证据。然而，这种优势伴随着冗余的工具调用和更长的推理轨迹。此外，当这种行为主要从结果奖励中学习时，产生的中间裁剪或视觉线索可能带有噪声，或者无法忠实地捕获任务相关的视觉证据。在这项工作中，我们探讨是否可以通过“用想象思考”来内化“用图像思考”的推理优势：这是一个内部过程，决定看哪里并想象更仔细检查会揭示什么视觉线索，而无需实际调用工具。我们提出Imagine-OPD，一种在线自我蒸馏框架，其中教师模型在训练期间扮演“用图像思考”推理者的角色：它接收来自标注区域的特权缩放证据视图，并监督模型自身的想象推理轨迹。Imagine-OPD不需要外部教师或高质量的想象演示。在视觉中心基准上的实验表明，Imagine-OPD在比较模型中实现了最佳平均性能，同时与“用图像思考”方法相比显著降低了推理开销。

英文摘要

''Thinking with Images'' has emerged as an effective paradigm for fine-grained visual reasoning: by explicitly zooming into relevant regions and reasoning over crops, models can access local evidence that is difficult to recover from a single global image. However, this benefit comes with redundant tool invocations and longer inference traces. Moreover, when such behaviors are learned mainly from outcome reward, the resulting intermediate crops or visual cues can be noisy or fail to faithfully capture task-relevant visual evidence. In this work, we ask whether the reasoning benefits of ''Thinking with Images'' can be internalized through Thinking with Imagination: an internal process that decides where to look and imagines what visual cues closer inspection would reveal without actually invoking tools. We propose Imagine-OPD, an on-policy self-distillation framework in which a teacher plays the role of a ''Thinking with Images'' reasoner during training: it receives privileged zoomed evidence views derived from annotated regions, and supervises the model's own imagination reasoning trajectories. Imagine-OPD does not require an external teacher or high-quality imagination demonstrations. Experiments on vision-centric benchmarks show that Imagine-OPD achieves the best average performance among compared models while significantly reducing inference overhead compared with ''Thinking with Images'' methods.

URL PDF HTML ☆

赞 0 踩 0

2606.08894 2026-06-09 cs.CV cs.CL 新提交

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

推理视觉语言模型对语义视觉干扰具有鲁棒性吗？

Yizheng Sun, Mochuan Zhan, Yanan Ma, Jia Tong See, Yifan Wang, Ziyi Wang, Hao Li, Yang Cui, Wenhao Cai, Jingyu Sun, Chenghua Lin, Riza Batista-Navarro, Jingyuan Sun

发表机构 * University of Manchester（曼彻斯特大学）； Marex ； Imperial College London（帝国理工学院）

AI总结针对推理VLM在真实场景中易受语义视觉干扰的问题，提出Distract-Bench基准，发现推理VLM对语义干扰的鲁棒性低于感知退化，且干扰常被纳入推理过程导致错误答案。

详情

AI中文摘要

推理视觉语言模型（VLM）在复杂多模态任务上表现强劲，但可靠的现实应用需要处理比干净、精心策划的基准更混乱的视觉输入。现有工作主要通过输入损坏（如噪声、模糊和天气效果）来评估VLM的可靠性，这些损坏使视觉证据更难感知。这留下了一个关键可靠性失败模式未被充分探索：模型可能正确感知证据，却从看似合理但无关且分散注意力的证据中进行推理，并将此错误传播到最终答案。为填补这一空白，我们引入了\textbf{Distract-Bench}，一个用于评估VLM对\textbf{语义视觉干扰}鲁棒性的基准，定义为添加到输入中、保留真实答案但具有意义且与任务无关的视觉线索。我们全面评估了八个领先的开源和两个闭源VLM，涵盖传统视觉损坏和Distract-Bench。结果表明，Distract-Bench暴露了一种与视觉损坏不同的鲁棒性失败：推理VLM在感知退化下基本跟踪其非推理基础模型，但对语义干扰的鲁棒性始终较低。进一步分析表明，这些干扰常常进入VLM的推理过程，被当作证据，并导致错误答案。总之，这些发现重新定义了推理VLM的鲁棒性评估，将焦点从退化感知转向干扰，以实现可靠的现实世界视觉推理。我们的数据和代码可在https://github.com/Yizheng-Sun/Distract-Bench获取。

英文摘要

Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable real-world application requires handling visual inputs that are messier than clean, curated benchmarks. Existing works mainly evaluate such reliability of VLMs through input corruptions, such as noise, blur and weather effects, which make visual evidence harder to perceive. This leaves a critical reliability failure mode underexplored: a model may perceive the evidence correctly, yet reason from plausible but irrelevant and distracting evidence and propagate this mistake to its final answer. To address this gap, we introduce \textbf{Distract-Bench}, a benchmark for evaluating VLM robustness to \textbf{semantic visual distractions}, defined as meaningful but task-irrelevant visual cues added to inputs while preserving the ground-truth answer. We comprehensively evaluate eight leading open-source and two closed-source VLMs across conventional vision corruptions and Distract-Bench. Our results show that Distract-Bench exposes a robustness failure distinct from vision corruptions: reasoning VLMs largely track their non-reasoning base models under perceptual degradation, but show consistently lower robustness to semantic distractions. Further analysis shows that these distractions often enter the reasoning process of VLMs, are treated as evidence, and lead to incorrect answers. Together, these findings reframe robustness evaluation for reasoning VLMs, shifting the focus from degraded perception to distractions for reliable real-world visual reasoning. Our data and code are available at https://github.com/Yizheng-Sun/Distract-Bench.

URL PDF HTML ☆

赞 0 踩 0

2606.08908 2026-06-09 cs.CV cs.AI 新提交

Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection

面向光刻缺陷检测的视觉-语言模型失败感知精炼

Pangyun Jeong, Jiyeong Kong, Yuehua Hu, Dohee Jeong, Kyung-Tae Kang

发表机构 * Hanyang University（汉阳大学）； Korea University（高丽大学）； Korea Institute of Industrial Technology（韩国生产技术研究院）

AI总结提出两阶段视觉-语言框架，先微调Qwen3-VL检测缺陷，再通过训练精炼模块修正第一阶段错误，提升检测可靠性。

Comments 6 pages, 3 figures

2606.08948 2026-06-09 cs.CV cs.AI 新提交

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

NutriMLLM：用于膳食微量营养素分析的多模态大语言模型

Runze Yan, Minxiao Wang, Jiaying Lu, Darren Liu, Xiao Hu, Hanqi Luo

发表机构 * Emory University（埃默里大学）

AI总结针对现有MLLM在膳食微量营养素估计中不可靠的问题，利用十年人口规模膳食回顾生成约110万图像-营养素三元组，微调Qwen3-VL和GLM-4.6V-Flash得到NutriMLLM，在真实图像上实现65种营养素全覆盖，准确率匹配或超越专有模型。

Comments 35 pages, 10 figures, 1 table

详情

AI中文摘要

从食物图像中全面估计膳食微量营养素可以改善临床营养护理，但训练此类模型需要将多样化食物与完整营养素谱相关联的大规模多模态数据集。我们首先证明，现有的多模态大语言模型（MLLMs），包括领先的专有模型，在此任务上不可靠。在五个模型家族和四个独立评估基准（ASA24、SNAPMe、FNDDS和NutriBench）上，模型经常弃权或返回统计上不合理的值。为了在没有昂贵专家标注的情况下解决这一差距，我们将十年人口规模的24小时膳食回顾重新用作文本到图像生成的结构化提示。该流程生成了约110万图像-描述-营养素三元组的合成语料库，每个三元组将生成的食品图像与完整的65种营养素标签配对。据我们所知，这是计划在发表后公开发布的最大合成食品图像语料库，具有全面的微量营养素标注。在此语料库上微调Qwen3-VL（2B/4B/8B/30B）和GLM-4.6V-Flash，得到了NutriMLLM，这是第一个专门用于全面膳食微量营养素估计的视觉语言模型家族。我们使用一个四组件框架评估这些模型，该框架分别测量弃权、幻觉、整体可用性和每种营养素的数值准确性。在真实食品图像上，每个NutriMLLM变体在所有65种营养素上实现了近乎完全的覆盖，并且最大的变体在大多数营养素上的准确率匹配或超过了专有基线（GPT-5、Gemini 3和Claude Sonnet 4.5）。这些结果表明，回忆驱动的合成监督可以使基于图像的全面微量营养素估计成为一个可处理的工程问题，并支持膳食评估、个性化营养指导和人口规模的微量营养素监测。

英文摘要

Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models requires large multimodal datasets linking diverse foods to complete nutrient profiles. We first show that existing multimodal large language models (MLLMs), including leading proprietary models, are unreliable for this task. Across five model families and four independent evaluation benchmarks (ASA24, SNAPMe, FNDDS, and NutriBench), models frequently abstained or returned statistically implausible values. To address this gap without costly expert annotation, we repurposed a decade of population-scale 24-hour dietary recalls as structured prompts for text-to-image generation. This pipeline produced a synthetic corpus of about 1.1 million image-description-nutrient triplets, each pairing a generated food image with a complete 65-nutrient label. To our knowledge, this is the largest synthetic food-image corpus with comprehensive micronutrient annotation planned for public release upon publication. Fine-tuning Qwen3-VL (2B/4B/8B/30B) and GLM-4.6V-Flash on this corpus yielded NutriMLLM, the first family of vision-language models specialized for comprehensive dietary micronutrient estimation. We evaluate these models with a four-component framework that separately measures abstention, hallucination, overall usability, and per-nutrient numerical accuracy. On real food images, every NutriMLLM variant achieved near-complete coverage across all 65 nutrients, and the largest variant matched or exceeded proprietary baselines (GPT-5, Gemini 3, and Claude Sonnet 4.5) in accuracy on most nutrients. These results show that recall-driven synthetic supervision can make image-based comprehensive micronutrient estimation a tractable engineering problem and support dietary assessment, personalized nutrition guidance, and population-scale micronutrient surveillance.

URL PDF HTML ☆

赞 0 踩 0

2606.08959 2026-06-09 cs.CV cs.CL 新提交

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

ChinaHeritaQA：面向中国世界遗产地的文化基础视觉问答数据集

Yi Zhang, Bolei Ma, Yong Cao, Chengyan Wu, Daniel Hershcovich, Anna-Carolina Haensch

发表机构 * LMU Munich（慕尼黑大学）； FAU Erlangen-Nuremberg（埃尔朗根-纽伦堡大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Tübingen & Tübingen AI Center（图宾根大学与图宾根人工智能中心）； Sun Yat-sen University（中山大学）； University of Copenhagen（哥本哈根大学）； University of Maryland, College Park（马里兰大学帕克分校）

AI总结提出ChinaHeritaQA多模态基准数据集，包含2279张图像和14133个双语多项选择题，覆盖七个认知维度，评估视觉语言模型在中国世界遗产上的文化推理能力。

详情

AI中文摘要

我们介绍了ChinaHeritaQA，这是一个多模态基准数据集，用于评估视觉语言模型（VLM）在中国联合国教科文组织世界遗产地上的文化推理能力。该数据集包含2279张野外图像，配以14133个双语（中文/英文）多项选择题对，涵盖七个认知维度，从基本身份识别到历史分期和建筑分析。在联合国教科文组织对齐的本体论指导下，并通过严格的人工注释验证，该数据集确保了语言质量和事实一致性。对最先进VLM的评估显示，虽然顶级模型在平均表现上超过人类，但出现了显著的任务级差异：模型在视觉识别方面表现出色，但在文化基础推理上存在困难。性能也因朝代和地区而异。ChinaHeritaQA揭示了强大的视觉检索能力并不能延伸到文化和历史理解。我们发布该数据集以支持未来关于文化感知多模态学习的研究。

英文摘要

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

URL PDF HTML ☆

赞 0 踩 0

2606.09033 2026-06-09 cs.CV cs.CL 新提交

CRANE: Knowledge Editing for Reasoning MLLMs

CRANE：面向推理多模态大语言模型的知识编辑

Han Huang, Hao Wang, Mengqi Zhang, Shu Wu, Qiang Liu, Liang Wang

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； New Laboratory of Pattern Recognition (NLPR), CASIA（中国科学院自动化研究所模式识别国家重点实验室）； Harbin Institute of Technology（哈尔滨工业大学）； Shandong University（山东大学）

AI总结针对推理多模态大语言模型在知识编辑中出现的结构崩溃、认知失调和浅层内化三种失败模式，提出检索增强框架CRANE，无需逐编辑参数修改，通过模态感知双库检索系统和两阶段训练策略实现高成功率。

Comments 10 pages, 5 figures

详情

AI中文摘要

推理多模态大语言模型（MLLMs）的出现，即在生成答案前产生显式思维链（CoT）推理，为知识编辑带来了新挑战：在传统指标（教师强制准确率高达100%）下看似成功的方法，在检查模型推理过程时可能严重失败（基础成功率低至0%）。我们识别出三种失败模式：（1）结构崩溃，权重修改方法破坏CoT格式；（2）认知失调，模型的推理链基于视觉证据主动拒绝注入的编辑事实；（3）浅层内化，方法在精确查询上成功但在改写或多跳变体上失败。在推理MLLMs上，这些模式相互作用：泛化方法（FT、LoRA）触发格式崩溃，而无深度修改的方法无法泛化。为揭示这些失败，我们提出一种CoT感知评估协议，并构建ReasonEdit-Bench，包含冲突分层、多级探针和多跳可移植性测试。我们提出CRANE，一种检索增强框架，无需逐编辑参数修改。CRANE结合了模态感知双库检索系统和两阶段训练策略：监督微调（SFT）用于结构初始化，随后是带有认知路由奖励的GRPO，训练模型在视觉先验和注入编辑事实之间进行仲裁。在ReasonEdit-Bench上，CRANE在冲突场景中达到96.9%的基础成功率，多跳链中中间实体使用率为96.9%，文本局部性为97.6%，图像局部性编辑独立性为68.1%。在分布外MMEVOKE基准上，CRANE在黄金检索下达到87.0%。

英文摘要

The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought (CoT) reasoning before producing answers, has introduced a new challenge for knowledge editing: methods that appear successful under traditional metrics (teacher-forcing accuracy up to 100%) can fail severely when the model's reasoning process is examined (Grounded Success as low as 0%). We identify three failure modes: (1) Structural Collapse, where weight-modifying methods destroy the CoT format; (2) Cognitive Dissonance, where the model's reasoning chain actively rejects the injected edit fact based on visual evidence; and (3) Shallow Internalization, where methods succeed on exact queries but fail on rephrase or multi-hop variants. On reasoning MLLMs, these modes interact: methods that generalize (FT, LoRA) trigger format collapse, while methods without deep modification cannot generalize. To expose these failures, we propose a CoT-aware evaluation protocol and construct ReasonEdit-Bench, with conflict stratification, multi-level probes, and multi-hop portability tests. We propose CRANE, a retrieval-augmented framework that requires no per-edit parameter modification. CRANE combines a modality-aware dual-library retrieval system with a two-phase training strategy: Supervised Fine-Tuning (SFT) for structural initialization, followed by GRPO with a Cognitive Routing Reward that trains the model to arbitrate between visual priors and injected edit facts. On ReasonEdit-Bench, CRANE achieves 96.9% Grounded Success on conflict scenarios and 96.9% intermediate entity usage in multi-hop chains, with 97.6% text-locality and 68.1% image-locality Edit Independence. On the out-of-distribution MMEVOKE benchmark, CRANE reaches 87.0% under gold retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.09109 2026-06-09 cs.CV cs.IR cs.LG 新提交

Driving Video Retrieval for Complex Queries with Structured Grounding

面向复杂查询的驾驶视频检索与结构化对齐

Manyi Yao, Sparsh Garg, Christian Shelton, Amit Roy-Chowdhury, Abhishek Aich

发表机构 * NEC Laboratories, America（美国NEC实验室）； University of California, Riverside（加州大学河滨分校）

AI总结提出STRIVE-D框架，通过弱监督领域视频校准规则、融合视觉语言与关键词检索信号，在驾驶视频检索中实现高达84%的top-1准确率提升。

详情

AI中文摘要

大规模视频检索是自动驾驶中数据整理和安全验证的核心，用户不仅希望找到场景，还希望找到诸如切入和急刹车等动态事件。现有的视觉语言和基于关键词的检索方法常常遗漏这些事件，因为相关的运动可能没有在文本中明确描述或通过词汇重叠捕获。基于规则的检索可以更直接地编码此类事件，但它是脆弱的：生成的或手工编写的规则在假设与真实驾驶数据不匹配时常常失败。我们提出了STRIVE-D，一种针对驾驶视频的数据校准检索框架。它使用弱标记的领域内视频来估计查询规则何时可靠，调整与观测数据不匹配的规则，并将校准后的规则分数与视觉语言和基于关键词的检索信号融合。在三个驾驶基准测试中，包括新发布的DrivingDojo上的人工标注事件数据，STRIVE-D相对于最先进方法在top-1准确率上实现了高达84%的相对改进。

英文摘要

Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users want to find not only scenes but also dynamic events such as cut-ins and hard braking. Existing vision-language and keyword-based retrieval methods often miss these events because the relevant motion may not be explicitly described in text or captured by lexical overlap. Rule-based retrieval can encode such events more directly, but it is brittle: generated or hand-written rules often fail when their assumptions do not match real driving data. We propose STRIVE-D, a data-calibrated retrieval framework for driving videos. It uses weakly labeled in-domain videos to estimate when a query rule is reliable, adapt rules that mismatch observed data, and fuse calibrated rule scores with vision-language and keyword-based retrieval signals. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.09142 2026-06-09 cs.CV cs.AI 新提交

Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

通过视觉语言模型从自我中心视觉解码行人过街意图

Danya Li, Xiang Su, Yan Feng, Rico Krueger

发表机构 * Technical University of Denmark（丹麦技术大学）； University of Helsinki（赫尔辛基大学）； Delft University of Technology（代尔夫特理工大学）

AI总结利用视觉语言模型（VLM）将行人过街意图预测转化为视觉问答任务，通过参数高效微调并结合自我运动、车辆运动和眼动等上下文线索，在自我中心视频上实现了14.5%的准确率提升，创下新纪录。

详情

AI中文摘要

自我中心视觉提供了人类感知和决策的第一人称视角，但其在交通安全预测方面的潜力尚未得到充分探索。在这项工作中，我们研究从短自我中心视频片段中解码行人过街意图。我们通过将任务表述为封闭式视觉问答（VQA）问题，并利用视觉语言模型（VLM）来预测行人的意图。我们首先在零样本设置下对三个系列的最先进VLM进行了基准测试，发现它们相对于随机猜测有适度提升，但表现出有限的高层次交通推理能力。基于这些发现，我们进一步使用参数高效微调将VLM适应于目标任务。我们的结果表明，微调后的模型显著优于其零样本对应模型，并在专门的基于Transformer的基线基础上实现了9%的准确率提升。最后，我们证明加入额外的上下文线索，包括自我运动、车辆运动和眼动，进一步提高了预测性能。特别是，由眼动和自我运动引导的微调Qwen3-VL-2B模型相比Transformer基线实现了14.5%的准确率提升，为自我中心行人意图解码建立了新的最先进水平。

英文摘要

Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians' intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.09167 2026-06-09 cs.CV 新提交

Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating

视觉-语言引导的高光谱目标跟踪：语义融合与上下文模板更新

Rui Yao, Yuhong Zhang, Kunyang Sun, Hancheng Zhu, Jiaqi Zhao, Zhiwen Shao, Abdulmotaleb El Saddik

发表机构 * China University of Mining and Technology（中国矿业大学）； University of Ottawa（渥太华大学）

AI总结提出VLHTrack框架，通过语言引导波段选择模块缓解光谱冗余，利用多模态融合模块整合视觉与语言特征，并采用动态模板更新策略应对目标形变，在HOT2023/2024上超越现有方法。

Comments 14 pages,8 figures

详情

AI中文摘要

高光谱目标跟踪（HOT）利用高光谱视频（HSV）提供的丰富光谱信息，为目标跟踪提供了巨大潜力。然而，从冗余光谱波段中高效提取和利用光谱信息仍然是一个基本挑战，严重限制了模型泛化能力和跟踪性能。此外，在动态场景中，目标常因遮挡和光照变化等因素出现剧烈外观变化，导致当前帧与模板之间产生大变形，这对现有时序建模方法构成重大挑战。本文提出VLHTrack，一种新颖的高光谱视觉-语言（VL）联合跟踪框架。具体而言，我们引入语言先验，通过设计语言引导波段选择模块（LBSM）来解决光谱冗余的基本挑战。LBSM利用大语言模型（LLM）描述建立语义到光谱的映射，从而减轻冗余并突出判别性光谱特征。随后采用多模态视觉-语言融合模块无缝整合视觉和语言嵌入，利用其互补优势学习连贯的跨模态表示。为解决长序列中的目标形变问题，我们提出通过动态模板更新与Mamba（DTUM）模块实现的动态更新模板特征策略。DTUM利用选择性状态空间建模学习帧间依赖关系以更新模板特征，确保在时间上下文引导下模板特征的高效演化。在HOT2023和HOT2024上的实验表明，VLHTrack优于最先进（SOTA）方法。

英文摘要

Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (HSVs), offering substantial potential for object tracking. However, efficiently extracting and exploiting spectral information from redundant spectral bands remains a fundamental challenge, which severely limits model generalization and tracking performance. Moreover, in dynamic scenes, targets often experience drastic appearance variations due to factors such as occlusion and illumination changes. These variations lead to large deformations between the current frame and the template. Such discrepancies pose major challenges for existing temporal modeling approaches. In this work, we propose VLHTrack, a novel hyperspectral vision-language (VL) joint tracking framework. Specifically, we incorporate language priors to address the fundamental challenge of spectral redundancy by designing a Language-Guided Band Selection Module (LBSM). By leveraging Large Language Model (LLM) descriptions, LBSM establishes a semantic-to-spectral mapping that mitigates redundancy and accentuates discriminative spectral features. A Multi-Modal Vision-Language Fusion Module is then employed to seamlessly integrate visual and linguistic embeddings, harnessing their complementary advantages to learn coherent cross-modal representations. To address target deformation in long-term sequences, we propose a dynamic update template feature strategy implemented via the Dynamic Template Update with Mamba (DTUM) module. By leveraging selective state space modeling, DTUM learns inter-frame dependencies to update template feature, ensuring efficient template feature evolution guided by temporal context. Experiments on HOT2023 and HOT2024 demonstrate that VLHTrack outperforms state-of-the-art (SOTA) methods.

URL PDF HTML ☆

赞 0 踩 0

2606.09290 2026-06-09 cs.CV 新提交

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Visual Para-Thinker++：用于视觉推理的单策略多智能体框架

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Zizhao Tong, Xiaofeng Zhang, Xiaosong Yuan

发表机构 * Zhejiang University（浙江大学）； Hunan University（湖南大学）； Tianjin University（天津大学）； University of Chinese Academy of Sciences（中国科学院大学）； Shanghai Jiao Tong University（上海交通大学）； Jilin University（吉林大学）

AI总结提出Visual Para-Thinker++框架，通过共享MLLM策略实例化为多个角色智能体并行推理，结合多智能体能力注入和角色解耦优化，有效缓解视觉推理中的早期感知承诺和幻觉问题。

详情

AI中文摘要

视觉推理需要整合分布在区域、属性和关系中的证据，这使得单链推理容易产生早期感知承诺和幻觉。我们提出Visual Para-Thinker++，一个单策略多智能体框架，其中共享的MLLM策略被实例化为角色条件的主智能体、工作智能体和总结智能体。主智能体使用固定分配模式分解任务；工作智能体在上下文隔离下并行推理；总结智能体整合所有工作智能体的推理轨迹，而不是对最终标签进行多数投票。共享策略通过多智能体能力注入和角色解耦多智能体优化进行训练，为相应的token片段分配角色特定的奖励和优势，以减少协作角色之间的梯度冲突。原生推理引擎通过共享视觉前缀和KV缓存重用实现高效的多智能体展开。在V*、CountBench、RefCOCO系列和HallusionBench上，Visual Para-Thinker++始终优于单轨迹和推理时并行基线，在幻觉敏感的视觉推理上尤其表现出色。

英文摘要

Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.09393 2026-06-09 cs.CV 新提交

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

CapRL++：基于可验证奖励的统一强化学习用于密集图像和视频描述生成

Penghui Yang, Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Yibin Wang, Yujie Zhou, Jiazi Bu, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin

发表机构 * Tsinghua University（清华大学）； University of Science and Technology of China（中国科学技术大学）； Microsoft（微软）； Shanghai AI Laboratory（上海人工智能实验室）； Shanghai Innovation Institute（上海创新研究院）； Alibaba Cloud（阿里云）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出CapRL++框架，利用可验证奖励的强化学习（RLVR）优化多模态描述生成，通过非视觉语言模型回答问题的准确性作为奖励，提升密集描述质量，在20多个基准上超越传统监督微调。

Comments 26 pages, 10 figures. Project page: https://github.com/InternLM/CapRL. arXiv admin note: text overlap with arXiv:2509.22647

详情

AI中文摘要

图像和视频描述是连接视觉与语言领域的基础任务，在预训练大型视觉语言模型（LVLMs）中发挥关键作用。当前最先进的描述模型通常采用监督微调（SFT）训练，这种范式依赖于昂贵且不可扩展的标注，并常导致模型记忆特定真实答案，限制了其通用性和生成多样化、创造性描述的能力。为克服这些局限，我们提出将可验证奖励的强化学习（RLVR）应用于多模态描述的开放任务。我们引入描述强化学习++（CapRL++），一种新颖的无参考训练框架，通过效用重新定义描述质量：高质量描述应使非视觉语言模型能够准确回答关于相应视觉内容的问题。CapRL++采用解耦的两阶段流程，其中LVLM生成描述，目标奖励来自一个独立的、无视觉的LLM仅基于该描述回答多项选择题的准确率。在超过20个图像和视频基准上的评估表明，CapRL++提升了密集描述质量，并增强了基于描述的预训练在空间和时间理解等任务上的表现。在CapRL++标注的可扩展图像和视频描述数据集上预训练带来了显著的下游收益。此外，在描述质量评估的Prism框架内，使用CapRL++训练的紧凑模型在密集描述性能上可与Qwen2.5-VL-72B和Qwen3-VL-235B-A22B等大得多的模型相媲美。这些结果验证了CapRL++能有效训练模型生成可泛化、高保真的描述，为超越传统SFT的局限奠定了坚实基础。

英文摘要

Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.

URL PDF HTML ☆

赞 0 踩 0

2606.09826 2026-06-09 cs.CV cs.AI 新提交

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

OmniGameArena: 一个统一的UE5基准测试，用于具有改进动态的VLM游戏智能体

Mingxian Lin, Shengju Qian, Yuqi Liu, Yi-Hua Huang, Yiyu Wang, Wei Huang, Yitang Li, Fan Zhang, Zeyu Hu, Lingting Zhu, Xin Wang, Xiaojuan Qi

发表机构 * The University of Hong Kong（香港大学）； LIGHTSPEED ； The Chinese University of Hong Kong（香港中文大学）； Tsinghua University（清华大学）

AI总结提出OmniGameArena，一个包含12个UE5游戏的统一基准，以及改进动态曲线（IDC），通过反思机制评估VLM智能体的冷启动分数、改进动态和泛化能力。

详情

AI中文摘要

视觉语言模型（VLM）智能体越来越多地部署在交互式游戏环境中。然而，针对VLM智能体的游戏基准通常报告每个（智能体，游戏）对的单次首次尝试分数，专注于单智能体单人游戏，并且缺乏统一的协议来评估异构智能体类别（商业VLM、开源VLM和专用游戏策略）在同一水平上。我们通过OmniGameArena填补了这些空白，这是一个包含12个新构建的Unreal Engine 5游戏的实时基准，涵盖单人（7个）、玩家对战（3个）和合作（2个）模式，具有统一的动作接口，以及改进动态曲线（IDC），这是一个智能体反思框架，其中使用工具的反思LLM在多个回合中自主优化有界技能提示。除了冷启动排行榜分数外，IDC还为每个（智能体，游戏）对揭示了两个额外的可观测指标：分数在反思回合中的演变方式，以及学习到的技能在保留任务变体上的表现。我们报告了12个VLM智能体在冷启动排行榜上的这些可观测指标，以及四个顶级智能体在IDC下的表现。

英文摘要

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

URL PDF HTML ☆

赞 0 踩 0

2606.07651 2026-06-09 cs.LG cs.CV 交叉投稿

KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection

KITE：一种融合文本、图像和知识图谱的三模态假新闻检测Transformer

Kevin Patel, Shashi Bhushan Jha

发表机构 * Department of Computer Science, University of West Florida（威斯福大学计算机科学系）

AI总结提出三模态假新闻检测框架KITE，联合建模文本、视觉和知识表示，利用跨模态注意力整合特征，在基准数据集上显著优于单双模态基线。

详情

AI中文摘要

随着多模态虚假信息日益复杂，无缝融合欺骗性文本、操纵性视觉和事实错误的主张，传统的假新闻检测方法已落后。大多数先前工作侧重于文本-图像融合，或将外部知识仅作为后处理步骤应用，限制了其检测更深层语义不一致的能力。在本文中，我们引入了KITE（知识集成文本-图像编码器），一种三模态假新闻检测框架，联合建模文本、视觉和事实知识表示。KITE利用Roberta [23,14]和CLIP [24]进行语言和视觉编码，同时图注意力网络（GAT）处理从Wikidata检索的结构化事实。KITE在多模态Transformer中使用跨模态注意力[9]来集成文本、视觉和知识特征，帮助理解每种模态如何相互关联。模态特定置信度分数与最终预测一起生成，通过指示哪种输入类型对决策影响最大来提供可解释性。在基准数据集上的评估表明，KITE显著优于单模态和双模态基线，特别是在涉及图像-文本不匹配或与外部知识矛盾的情景中。

英文摘要

Traditional fake news detection methods are falling behind as multimodal misinformation grows more advanced, seamlessly blending deceptive text, manipulated visuals, and factually incorrect claims. Most prior work focuses on text-image fusion or applies external knowledge only as a post-processing step, limiting their ability to detect deeper semantic inconsistencies. In this paper, we introduce KITE (Knowledge-Integrated Text-Image Encoder), a tri-modal fake news detection framework that jointly models textual, visual, and factual knowledge representations. KITE leverages Roberta [23,14] and CLIP [24] for linguistic and visual encoding, while a Graph Attention Network (GAT) processes structured facts retrieved from Wikidata. KITE uses cross-modal attention [9] within a multimodal transformer to integrate text, visual, and knowledge features, helping it understand how each modality relates to one another. Modality-specific confidence scores are generated alongside the final prediction, offering interpretability by indicating which input type most influenced the decision. Evaluations on benchmark datasets demonstrate that KITE significantly outperforms unimodal and bimodal baselines, particularly in scenarios involving image-text mismatches or contradictions with external knowledge.

URL PDF HTML ☆

赞 0 踩 0

2606.08046 2026-06-09 cs.AI cs.CV cs.LG 交叉投稿

COMPASS：通过代理令牌和共享空间实现完整的多模态融合以实现无处不在的感知

Hao Wang, Yanyu Qian, Pengcheng Weng, Zixuan Xia, William Dan, Yangxin Xu, Fei Wang

发表机构 * Universität Bern（伯恩大学）； Xi’an Jiaotong University（西安交通大学）； Nanyang Technological University（南洋理工大学）

AI总结 COMPASS通过代理令牌和共享空间实现多模态融合，解决缺失模态导致的信息丢失和融合接口不匹配问题，提升多模态感知的鲁棒性。

2604.27476 2026-06-09 cs.CV 版本更新

EdgeFM: Efficient Edge Inference for Vision-Language Models

EdgeFM: 为视觉-语言模型高效边缘推理设计的框架

Mengling Deng, Yuanpeng Chen, Sheng Yang, Wei Tao, Wenhai Zhang, Hui Song, Linyuanhao Qin, Kai Zhao, Xiaojun Ye, Shanhui Mo, Jingli Fan, Shuang Zhang, Bei Liu, Tiankun Zhao, Xiangjing An

发表机构 * Go Further. AI ； School of Data Science Fudan University（复旦大学数据科学学院）； RUYi Dynamics Co. Ltd（RUYi Dynamics有限公司）； Independent Researcher（独立研究者）

AI总结 EdgeFM通过轻量级代理驱动框架，优化边缘部署的视觉-语言模型推理，提升跨平台性能和可移植性，实现比传统工具链更快的推理速度。

Comments Technique Report version

详情

AI中文摘要

视觉-语言模型（VLMs）在边缘工业应用中表现出强大的适用性，但其部署受到确定性低延迟和资源限制下稳定执行的严重限制。现有框架要么依赖臃肿的通用设计，要么迫使开发者进入封闭的硬件特定生态系统，导致硬件锁定和较差的跨平台适应性。观察到现代AI代理可以高效搜索和调整配置以生成高度优化的低级内核，我们提出EdgeFM，一种轻量级、代理驱动的VLM/LLM推理框架，专为跨平台工业边缘部署设计。EdgeFM通过移除非必要功能来降低单次请求延迟，并将代理调优的内核优化封装为可重用的模块化库。通过允许直接调用这些技能而不是等待封闭源代码实现，它有效缩小了长期以来由专有工具链主导的性能差距。该框架原生支持主流平台，包括x86和NVIDIA Orin SoCs，并代表了首个在国产Horizon Journey平台上的端到端VLA部署，增强了跨平台可移植性。在大多数情况下，它比传统供应商特定工具链的推理性能更优，实现NVIDIA Orin平台上比TensorRT-Edge-LLM快1.49倍的速度提升。实验结果表明，EdgeFM提供了有利的端到端推理性能，为多样化的边缘工业场景提供了开源、生产级的解决方案。

英文摘要

Vision-language models (VLMs) have demonstrated strong applicability in edge industrial applications, yet their deployment remains severely constrained by requirements for deterministic low latency and stable execution under resource limitations. Existing frameworks either rely on bloated general-purpose designs or force developers into opaque, hardware-specific closed-source ecosystems, leading to hardware lock-in limitation and poor cross-platform adaptability. Observing that modern AI agents can efficiently search and tune configurations to generate highly optimized low-level kernels for standard LLM operators, we propose EdgeFM, a lightweight, agent-driven VLM/LLM inference framework tailored for cross-platform industrial edge deployment. EdgeFM removes non-essential features to reduce single-request latency, and encapsulates agent-tuned kernel optimizations as a modular library of reusable skills. By allowing direct invocation of these skills rather than waiting for closed-source implementations, it effectively closes the performance gap long dominated by proprietary toolchains. The framework natively supports mainstream platforms including x86 and NVIDIA Orin SoCs, and represents the first end-to-end VLA deployment on the domestic Horizon Journey platform, enhancing cross-platform portability. In most cases, it yields clearly better inference performance than conventional vendor-specific toolchains, achieving up to 1.49 times speedup over TensorRT-Edge-LLM on the NVIDIA Orin platform. Experimental results show that EdgeFM delivers favorable end-to-end inference performance, providing an open-source, production-grade solution for diverse edge industrial scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.07435 2026-06-09 cs.CV cs.CL 版本更新

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

唇读差距：VSR模型是否像人类唇读者一样感知视觉语音？

Rishabh Jain, Naomi Harte

发表机构 * Sigmedia Group（Sigmedia集团）； School of Engineering（工程学院）； Trinity College Dublin（都柏林大学）

AI总结通过对比VSR系统与人类在MaFI数据集上的表现，发现模型虽整体准确率更高，但错误模式与人类不同，主要依赖训练数据中的语言线索而非视觉感知。

Comments Accepted at INTERSPEECH 2026

详情

AI中文摘要

视觉语音识别（VSR）模型在基准测试中现已超越人类唇读者，但这样的进步是否建立了类人的视觉语音感知？为探究此问题，我们使用MaFI词级唇读数据集，在词、字符、音素和视位级别上比较了三个VSR系统与人类基线。尽管模型实现了更高的整体准确率，但它们在不同于人类的单词上成功和失败。仅给定少量初始音素的纯文本n-gram基线可与人类唇读相媲美。VSR词级错误始终能更好地通过训练词频而非词的视觉信息量来解释。视位准确率、混淆矩阵以及人类-模型相关性进一步表明，模型在人类认为最难的视位上获益最多，并且对视觉清晰度的依赖性弱得多。我们的工作表明，VSR系统主要依赖训练数据中的语言线索而非视觉感知，未能将视觉特征绑定为有意义的单词。

英文摘要

Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

URL PDF HTML ☆

赞 0 踩 0

2503.22697 2026-06-09 q-bio.NC cs.AI cs.CV 版本更新

Brain2Text Decoding Model Reveals the Neural Mechanisms of Visual Semantic Processing

Brain2Text解码模型揭示视觉语义处理的神经机制

Feihan Feng, Jingxin Nie

发表机构 * Ministry of Education Center for Studies of Psychological Application（教育部心理应用研究中心）； Center for Studies of Psychological Application（心理应用研究中心）； Key Laboratory of Brain, Cognition and Education Sciences（脑认知与教育科学重点实验室）； School of Psychology（心理学学院）； Guangdong Key Laboratory of Mental Health and Cognitive Science（广东省心理健康与认知科学重点实验室）

AI总结提出一种直接从fMRI信号解码自然图像语义描述的深度学习模型，揭示了高级视觉皮层在语义处理中的关键作用，并展示了类别特异性神经表征。

Comments 39 pages, 9 figures

详情

AI中文摘要

从神经活动解码感官体验以重建人类感知的视觉刺激和语义内容，仍然是神经科学和人工智能领域的挑战。尽管当前的脑解码模型取得了显著进展，但在与已建立的神经科学理论的系统整合以及探索潜在神经机制方面仍存在关键差距。在这里，我们提出了一种新颖的框架，直接将fMRI信号解码为所观看自然图像的文本描述。我们的新型深度学习模型在未使用视觉信息训练的情况下，实现了最先进的语义解码性能，生成了捕捉复杂场景核心语义内容的有意义描述。神经解剖学分析揭示了高级视觉皮层（包括MT+复合体、腹侧流视觉皮层和顶下小叶）在视觉语义处理中的关键作用。此外，类别特异性分析展示了语义维度（如生命度和运动）的细微神经表征。这项工作为大脑的语义解码提供了一个更直接和可解释的框架，为探究复杂语义处理的神经基础、完善对分布式语义网络的理解以及潜在开发脑启发语言模型提供了强大的新方法。

英文摘要

Decoding sensory experiences from neural activity to reconstruct human-perceived visual stimuli and semantic content remains a challenge in neuroscience and artificial intelligence. Despite notable progress in current brain decoding models, a critical gap still persists in their systematic integration with established neuroscientific theories and the exploration of underlying neural mechanisms. Here, we present a novel framework that directly decodes fMRI signals into textual descriptions of viewed natural images. Our novel deep learning model, trained without visual information, achieves state-of-the-art semantic decoding performance, generating meaningful captions that capture the core semantic content of complex scenes. Neuroanatomical analysis reveals the critical role of higher-level visual cortices, including MT+ complex, ventral stream visual cortex, and inferior parietal cortex, in visual semantic processing. Furthermore, category-specific analysis demonstrates nuanced neural representations for semantic dimensions like animacy and motion. This work provides a more direct and interpretable framework to the brain's semantic decoding, offering a powerful new methodology for probing the neural basis of complex semantic processing, refining the understanding of the distributed semantic network, and potentially developing brain-inspired language models.

URL PDF HTML ☆

赞 0 踩 0

2504.02983 2026-06-09 cs.CL cs.CV 版本更新

Hummus: A Dataset of Humorous Multimodal Metaphor Use

Hummus：幽默多模态隐喻使用数据集

Xiaoyu Tong, Zhi Zhang, Pia Sommerauer, Martha Lewis, Ekaterina Shutova

发表机构 * ILLC, University of Amsterdam, the Netherlands（阿姆斯特丹大学语言学研究所，荷兰）； Vrije Universiteit Amsterdam, the Netherlands（阿姆斯特丹自由大学，荷兰）

AI总结提出幽默多模态隐喻数据集Hummus，基于不一致理论和概念隐喻理论设计标注方案，测试多模态大语言模型在检测和理解幽默多模态隐喻上的表现，发现现有模型仍存在困难。

详情

AI中文摘要

隐喻和幽默有许多共同点，隐喻是最常见的幽默机制之一。本研究关注多模态隐喻的幽默能力，该领域尚未得到足够关注。我们从幽默的不一致理论、概念隐喻理论以及VU阿姆斯特丹隐喻语料库的标注方案中汲取灵感，开发了一种新的用于图像-标题对中幽默多模态隐喻使用的标注方案。我们创建了幽默多模态隐喻使用数据集Hummus，提供了从《纽约客》标题竞赛语料库中抽取的1000个图像-标题对的专家标注。利用该数据集，我们测试了最先进的多模态大语言模型（MLLMs）在检测和理解幽默多模态隐喻使用方面的能力。实验表明，当前MLLMs在处理幽默多模态隐喻时仍然存在困难，特别是在整合视觉和文本信息方面。我们在该网址发布数据集和代码。

英文摘要

Metaphor and humor share a lot of common ground, and metaphor is one of the most common humorous mechanisms. This study focuses on the humorous capacity of multimodal metaphors, which has not received due attention in the community. We take inspiration from the Incongruity Theory of humor, the Conceptual Metaphor Theory, and the annotation scheme behind the VU Amsterdam Metaphor Corpus, and developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs. We create the Hummus Dataset of Humorous Multimodal Metaphor Use, providing expert annotation on 1k image-caption pairs sampled from the New Yorker Caption Contest corpus. Using the dataset, we test state-of-the-art multimodal large language models (MLLMs) on their ability to detect and understand humorous multimodal metaphor use. Our experiments show that current MLLMs still struggle with processing humorous multimodal metaphors, particularly with regard to integrating visual and textual information. We release our dataset and code at github.com/xiaoyuisrain/humorous-multimodal-metaphor-use.

URL PDF HTML ☆

赞 0 踩 0

2507.08064 2026-06-09 cs.MM cs.CV 版本更新

PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning

PUMA: 基于层剪枝的语言模型，用于具有模态自适应学习的高效统一多模态检索

Yibo Lyu, Rui Shao, Gongwei Chen, Yijie Zhu, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）

AI总结提出PUMA，通过层剪枝自蒸馏减少MLLM参数，并设计模态自适应对比学习损失（MAC-Loss）提升检索效率，在降低资源消耗的同时保持性能。

详情

AI中文摘要

随着多媒体内容的扩展，现实应用中对统一多模态检索（UMR）的需求日益增加。最近的工作利用多模态大语言模型（MLLM）来解决这一任务。然而，其庞大的参数量导致训练成本高、推理效率低。为此，我们提出PUMA：一种基于层剪枝的语言模型，用于具有模态自适应学习的高效统一多模态检索。我们的方法从结构和学习两个角度改进UMR。（1）在结构上，我们提出层剪枝自蒸馏，通过仅保留浅层来剪枝MLLM，同时从丢弃的深层中蒸馏特征作为教师信号。这减少了参数并保持了表示能力。（2）在学习方面，我们引入模态自适应对比学习损失（MAC-Loss），根据目标模态将批次内负样本分为更难的模态内和更易的模态间两组，分配不同的温度策略以增强学习效率。实验表明，我们的方法在保持强性能的同时显著减少了资源使用。

英文摘要

As multimedia content expands, the demand for unified multimodal retrieval (UMR) in real-world applications increases. Recent work leverages multimodal large language models (MLLMs) to tackle this task. However, their large parameter size results in high training costs and low inference efficiency. To address this, we propose PUMA: a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning. Our approach improves UMR from both structural and learning perspectives. (1) Structurally, we propose Layer-Pruned Self-Distillation, which prunes MLLMs by keeping only shallow layers while distilling features from dropped deep layers as teacher signals. This reduces parameters and preserves representation capability. (2) On the learning side, we introduce Modality-Adaptive Contrastive Learning Loss (MAC-Loss), which separates in-batch negatives into harder intra-modality and easier inter-modality groups based on the target modality, assigning different temperature strategies to enhance learning efficiency. Experiments show our method significantly reduces resource usage while maintaining strong performance.

URL PDF HTML ☆

赞 0 踩 0

2510.03244 2026-06-09 cs.LG cs.AI cs.CV 版本更新

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

VFEM: 视觉特征赋能的多变量时间序列预测与跨模态融合

Yanlong Wang, Hang Yu, Jian Xu, Fei Ma, Hongkang Zhang, Tongtong Feng, Zijian Zhang, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Pengcheng Laboratory（鹏城实验室）； Ant Group（蚂蚁集团）； Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)（广东人工智能与数字经济实验室（深圳））； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出VFEM模型，利用预训练大视觉模型通过跨模态注意力融合视觉与时间特征，仅训练7.45%参数即可捕捉跨变量依赖，提升多变量时间序列预测性能。

详情

AI中文摘要

大型时间序列基础模型通常采用通道独立架构来处理不同的数据维度，但这种设计忽略了关键的跨通道依赖关系。同时，现有的跨模态方法主要依赖文本模态，使得视觉模型的空间模式识别能力在时间序列分析中未被充分探索。为了解决这些局限性，我们提出了VFEM，一种利用预训练大视觉模型（LVM）捕获复杂跨变量模式的跨模态预测模型。VFEM将多变量时间序列转换为视觉表示，使LVM能够感知通道独立模型未显式建模的空间关系。通过双分支架构，视觉和时间特征被独立提取，然后通过跨模态注意力融合，使两种模态的互补信息增强预测。通过冻结LVM并仅训练总参数的7.45%，VFEM在多个基准上取得了竞争性能，为多变量时间序列预测提供了新视角。

英文摘要

Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

URL PDF HTML ☆

赞 0 踩 0

2603.12046 2026-06-09 eess.AS cs.CV cs.SD 版本更新

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

Dr. SHAP-AV：通过Shapley归因解码音频-视觉语音识别中的相对模态贡献

Umberto Cappellazzo, Stavros Petridis, Maja Pantic

发表机构 * Imperial College London, UK（伦敦帝国学院，英国）； NatWest AI Research, UK（英国NatWest人工智能研究）

AI总结本文提出Dr.SHAP-AV框架，通过Shapley值分析音频-视觉语音识别中模态贡献，揭示噪声环境下模型对视觉的依赖及音频贡献的稳定性，推动模态加权机制和Shapley归因作为标准诊断工具。

Comments Accepted to INTERSPEECH 2026 [Long Paper track]. Project website: https://umbertocappellazzo.github.io/Dr-SHAP-AV

详情

AI中文摘要

音频-视觉语音识别（AVSR）利用音频和视觉信息在噪声环境下实现鲁棒识别。然而，模型如何平衡这些模态仍不清楚。我们提出了Dr.SHAP-AV框架，利用Shapley值分析AVSR中的模态贡献。通过在两个基准测试中六个模型上进行实验，不同SNR水平下，我们引入三种分析：全局Shapley用于整体模态平衡，生成Shapley用于解码过程中的贡献动态，时间对齐Shapley用于输入-输出对应性。我们的发现表明，在噪声下模型倾向于依赖视觉，但在严重退化下仍保持高音频贡献。模态平衡在生成过程中演变，时间对齐在噪声下保持稳定，SNR是驱动模态权重的主要因素。这些发现揭示了持续的音频偏见，推动了定制化的模态加权机制和基于Shapley的归因作为标准AVSR诊断工具。

英文摘要

Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.

URL PDF HTML ☆

赞 0 踩 0

2606.07626 2026-06-09 cs.CV cs.AI 新提交

Eyes All Around: Design and Analysis of 360-Degree LiDAR Perception Using Equivariant Feature Learning in Unstructured Traffic

全方位视角：非结构化交通中基于等变特征学习的360度LiDAR感知设计与分析

Pranav Darshan, Raghuveer Narayanan Rajesh, M Uttara Kumari

发表机构 * RV College of Engineering（RV工程学院）

AI总结针对非结构化城市交通中感知难题，提出结合扇形全景处理与旋转等变稀疏卷积的360度LiDAR感知框架，在印度城市交通数据集上验证了多类别检测性能。

详情

AI中文摘要

密集非结构化城市交通中的感知仍然是自动驾驶的主要挑战，原因是道路使用者种类繁多、频繁遮挡、不规则运动模式以及缺乏标准化的道路布局。尽管基于LiDAR的3D目标检测器在结构化驾驶场景中表现出色，但大多数是为有限视场设置开发和评估的，其在全环绕360度感知下的行为仍不明确。本文研究了用于自动驾驶的360度LiDAR感知流水线，特别关注全景感知、方位角扇形空间处理以及复杂城市场景中的变换等变特征提取。本文提出了一个实用的360度感知框架，将扇形全景处理与旋转等变稀疏卷积相结合，并在一个自定义的Ouster OS0 LiDAR数据集上评估其行为，该数据集收集自多样化的印度城市交通条件。结果显示，多个目标类别的检测总体稳定，其中汽车性能最强（92.02/90.51），公交车为80.53/76.34，卡车为78.59/74.16，而行人（67.45/61.02）、骑自行车者（73.21/69.54）和骑摩托车者（71.20/68.13）得分较低，反映了在密集城市场景中检测更小且更多变的道路使用者的更大难度。

英文摘要

Perception in dense, unstructured urban traffic remains a major challenge for autonomous driving because of the wide variety of road users, frequent occlusions, irregular motion patterns, and the lack of standardized road layouts. Although recent LiDAR based 3D object detectors have shown strong performance in structured driving scenarios, most are developed and evaluated for limited field of view settings, and their behavior under full surround 360-degree sensing is still not well understood. This paper studies a 360-degree LiDAR perception pipeline for autonomous driving, with particular attention to panoramic sensing, azimuthal sector wise spatial processing, and transformation equivariant feature extraction in complex urban scenes. The paper presents a practical 360-degree perception framework that combines sector wise panoramic processing with rotation equivariant sparse convolutions and evaluates its behavior on a custom Ouster OS0 LiDAR dataset collected across diverse Indian urban traffic conditions. The results show generally stable detection across several object classes, with the strongest performance for cars at 92.02/90.51, buses at 80.53/76.34, and trucks at 78.59/74.16, while lower scores for pedestrians at 67.45/61.02, cyclists at 73.21/69.54, and motorcyclists at 71.20/68.13 reflect the greater difficulty of detecting smaller and more variable road users in dense urban scenes.

URL PDF HTML ☆

赞 0 踩 0

2606.07895 2026-06-09 cs.CV cs.RO 新提交

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

TBD-VLA: 时序块扩散视觉语言动作模型

Sung-Wook Lee, Xuhui Kang, Yen-Ling Kuo

发表机构 * University of Virginia（弗吉尼亚大学）

AI总结提出TBD-VLA框架，通过时序块扩散机制实现离散令牌VLA模型的并行动作生成，兼顾时序连贯性与推理速度，在仿真和真实任务中优于先前方法。

详情

AI中文摘要

离散视觉-语言-动作（VLA）模型通常将动作生成建模为离散动作空间上的下一个令牌预测，每个令牌自回归地依赖于先前的上下文。虽然有效，但这种范式会导致高推理延迟，并且很大程度上忽略了动作轨迹中固有的时间结构。最近的工作引入并行解码以提高效率，实现更快的推理，但缺乏建模令牌依赖关系的显式机制。我们提出TBD-VLA，一种基于离散令牌的VLA框架，它结合了块扩散以实现时序动作生成。我们将动作序列划分为时间块，并在每个块内执行掩码离散扩散，同时保持跨块的自回归生成。这种设计统一了时序自回归和并行动作解码，实现了强时序连贯性和改进的推理速度。此外，显式的时序建模通过时序修补实现了动作块（例如实时分块）的异步执行。TBD-VLA在仿真和真实世界的操作任务中显著优于先前的VLA方法，为走向快速、时序感知的离散VLA模型提供了一条可扩展的路径。项目网页：https://tbd-vla.github.io/

英文摘要

Discrete Vision-Language-Action (VLA) models typically formulate action generation as next-token prediction over discretized action spaces, conditioning each token autoregressively on prior context. While effective, this paradigm incurs high inference latency and largely ignores the temporal structure inherent in action trajectories. Recent efforts introduce parallel decoding to improve efficiency, enabling faster inference, but lack explicit mechanisms for modeling token dependencies. We introduce TBD-VLA, a discrete token-based VLA framework that incorporates block diffusion to enable temporal action generation. We partition action sequences into temporal blocks and perform masked discrete diffusion within each block, while maintaining autoregressive generation across blocks. This design unifies temporal autoregression and parallel action decoding, achieving both strong temporal coherence and improved inference speed. In addition, the explicit temporal modeling enables asynchronous execution of action chunks (e.g., Real-Time Chunking) via temporal in-painting. TBD-VLA significantly outperforms prior VLA approaches in both simulation and real-world manipulation tasks, offering a scalable path toward fast, temporally aware, discrete VLA models. Project webpage: https://tbd-vla.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.08242 2026-06-09 cs.CV 新提交

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Light-WAM：基于状态融合动作解码的高效世界动作模型

Ziang Li, Dongzhou Cheng, Yibin Wang, Shiyue Wang, Xiaoyang Xu, Lingxuan Weng, Juan Wang, Jiaqi Wang

发表机构 * Wuhan University（武汉大学）； Shanghai Innovation Institute（上海创新研究院）； Southeast University（东南大学）； Fudan University（复旦大学）； East China Normal University（华东师范大学）

AI总结提出轻量级世界动作模型Light-WAM，通过紧凑视频骨干和降维潜空间未来视频监督降低训练成本，并引入状态融合动作专家实现高效动作预测，在LIBERO和RoboTwin 2.0上取得良好性能。

详情

AI中文摘要

世界动作模型（WAM）通过将未来预测作为额外训练目标来扩展机器人策略学习，鼓励策略在其表示中编码任务相关的时间结构。当前的WAM通常依赖大规模生成架构，导致高训练成本和推理延迟，难以部署为高效的闭环策略。我们提出Light-WAM，一种轻量级的世界动作模型，用于高效的机器人操作。具体来说，它采用紧凑的视频骨干网络，并在降维的潜空间中进行未来视频监督，降低了视频协同训练的成本，同时保留了其对表示学习的益处。对于动作预测，Light-WAM引入了状态融合动作专家（StateFusionActionExpert），该专家从多个骨干层读取适应后的状态，通过可学习查询池化进行融合，并在单次前向传播中直接预测动作块。这种设计为视频骨干表示与机器人动作之间提供了高效接口，避免了繁重的生成式动作专家。实验表明，Light-WAM在LIBERO上保持强劲性能，在RoboTwin 2.0上实现了可用的多任务性能，同时仅使用0.44B可训练参数。它还实现了72.03ms的推理延迟，峰值GPU内存为4.1GiB，并提高了训练吞吐量。

英文摘要

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.

URL PDF HTML ☆

赞 0 踩 0

2606.08277 2026-06-09 cs.CV 新提交

面向动态环境中混合自主车辆安全关键速度调节的视觉语言工作区智能

Angel Martinez-Sanchez, Kianna Ng, Wesley Maia, Laura Fleig, Maitrayee Keskar, Erika Maquiling, Yash Tandon, Parthib Roy, Mohan Trivedi, Ross Greer

发表机构 * UC Merced（加州大学默塞德分校）； Johns Hopkins（约翰霍普金斯大学）； UC San Diego（加州大学圣地亚哥分校）

AI总结提出一种实时车载感知管线，通过目标检测与语义验证融合及滞后状态转换，从视觉标志中识别临时工作区限速，在低成本硬件上实现96.5%召回率和68.7%精确率。

详情

AI中文摘要

基于一致性可靠性度量的实时身体姿态非语言通信

Alina Marcu, Dragos Costea, Cristina Lazar, Marius Leordeanu

发表机构 * National University of Science and Technology "Politehnica" Bucharest（布加勒斯特理工大学）； Simion Stoilow Institute of Mathematics of the Romanian Academy（罗马尼亚科学院西蒙·斯托伊洛数学研究所）； NORCE Norwegian Research Centre AS（挪威研究中心）

AI总结研究仅从2D身体姿态识别通信意图，提出自回归自一致性作为无监督可靠性信号，并在嵌入式GPU上实现实时性能。

详情

AI中文摘要

身体运动在远距离或无法捕捉面部及语音的条件下传达意图。我们研究仅从2D身体姿态识别通信意图。我们认为身体运动是可靠的信号，特别是在需要实时低成本设备上的人-机器人通信场景中，如救援任务。然而，现有资源并未孤立这一信号。情感语料库结合了身体、面部、语音和文本，而骨架动作识别基准标记的是执行的动作而非传达的信息。我们发布了一个包含十种通信意图的全身体姿态真实帧数据集，并将其与其他真实（IPC）和合成（MotionLCM, VEO3.1, Kimodo）数据集进行比较，这些数据集覆盖了不同难度。我们针对能在机器人有限板载硬件上运行的系统。我们基准测试了多种模型，从骨架图分类器到联合运动预测网络，并在嵌入式GPU（NVIDIA Orin Nano）上报告了性能指标和帧率，因为在我们的场景中速度和准确性同样重要。最后，我们展示了模型自身的自回归自一致性可作为无监督可靠性信号。我们给出了一个简短证明，界定了自一致性预测正确的概率，表明该概率随一致步数增加而增长，并识别了自信预测仍可能错误的条件，与行业标准指标进行了基准测试。

英文摘要

Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.09477 2026-06-09 cs.CV 新提交

Efficient Minimal Solvers for Visual-Inertial Relative Pose Estimation in Multi-Camera Systems

多相机系统中视觉-惯性相对位姿估计的高效最小求解器

Tao Li, Zhenbao Yu, Banglei Guan, Jianli Han, Weimin Lv

发表机构 * Naval Aviation University（海军航空大学）； National University of Defense Technology（国防科技大学）

AI总结提出两种基于IMU先验的最小求解器，仅需4个点对应，将多相机相对位姿问题简化为单变量6次多项式，显著降低计算复杂度，在RANSAC框架中表现优异。

详情

AI中文摘要

估计多相机系统的相对位姿是计算机视觉中的一个基本问题，在自动驾驶、移动设备和无人机（UAV）中具有关键应用。然而，现有解决方案通常计算复杂度高或依赖过多的点对应，限制了其实际应用。为解决这些限制，我们提出两种高效的最小求解器，利用新颖的参数化来估计多相机系统的相对位姿。第一种求解器利用惯性测量单元（IMU）提供的垂直方向先验，第二种利用IMU提供的旋转轴方向先验。我们的方法仅需四个点对应，并将多相机相对位姿估计问题简化为求解一个单变量6次多项式，相较于通常涉及8次多项式的现有方法有显著改进。这种计算复杂度和对应要求的降低使得我们的求解器在集成到RANSAC框架中时特别有效，展示了在视觉里程计应用中的强大潜力。通过在合成数据和KITTI基准上的严格评估，我们的方法相比最先进算法实现了卓越的计算效率和具有竞争力的精度。

英文摘要

Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critical applications in autonomous vehicles, mobile devices, and unmanned aerial vehicles (UAVs). However, existing solutions often suffer from high computational complexity or rely on an excessive number of point correspondences, limiting their real-world applicability. To address these limitations, we propose two efficient minimal solvers for estimating the relative poses of multi-camera systems using a novel parameterization. The first solver leverages the vertical direction prior provided by Inertial Measurement Units (IMUs), while the second utilizes the rotation axis direction prior from IMUs. Our methods require only four point correspondences and reduce the problem of multi-camera relative pose estimation to solving a univariate 6th-degree polynomial, a significant improvement over existing approaches, which typically involve 8th-degree polynomials. This reduction in computational complexity and correspondence requirements makes our solvers particularly effective when integrated into RANSAC frameworks, demonstrating strong potential for visual odometry applications. Through rigorous evaluations on synthetic data and the KITTI benchmark, our methods achieved superior computational efficiency and competitive accuracy compared to state-of-the-art algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.09634 2026-06-09 cs.CV cs.AI 新提交

GraspFoM：基于3D基础先验的重建驱动机器人抓取

Dongli Wu, Xiaobao Wei, Hao Wang, Qiaochu Dong, Ying Li, Qingpo Wuwu, Ming Lu, Wufan Zhao

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Peking University（北京大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出GraspFoM框架，利用3D基础先验（SAM3D）构建共享3D物体潜变量，联合优化重建与抓取姿态预测，通过锚点初始化的截断姿态推理扩散器生成连续多模态抓取，实现高保真重建与最优抓取。

详情

AI中文摘要

机器人抓取是机器人操作中的基本能力。然而，在部分观测下抓取仍然具有挑战性。可靠的抓取依赖于局部接触线索和物体级3D结构。现有的几何感知抓取方法认识到重建的价值，但通常将几何视为中间预测，而不是可重用的抓取物体先验。在本文中，我们提出了GraspFoM，一个统一的框架，利用3D基础先验（SAM3D）为重建和抓取姿态预测构建共享的3D物体潜变量。基于这个共享的物体潜变量，我们引入了一个锚点初始化的截断姿态推理扩散器，它预测连续且多模态的抓取姿态，而不直接依赖离散的抓取候选。我们进一步通过一个重建感知评分器和残差潜变量更新器来研究重建与抓取之间的相互作用。重建提供基于几何的线索，而抓取监督则使共享的物体潜变量向与抓取相关的可操作性区域细化。GraspFoM联合预测抓取姿态并以网格和3DGS形式重建高保真3D资产。综合实验表明，GraspFoM在重建和抓取上都达到了最先进的结果。值得注意的是，这些改进只需要少量额外的可训练参数。组件消融研究也证明了每个组件的贡献。

英文摘要

Robotic grasping is a fundamental capability in robotic manipulation. Yet grasping remains challenging under partial observations. Reliable grasping depends on both local contact cues and object-level 3D structure. Existing geometry-aware grasping methods recognize the value of reconstruction, but they typically treat geometry as an intermediate prediction rather than a reusable object prior for grasping. In this paper, we present GraspFoM, a unified framework that leverages 3D foundation priors (SAM3D) to build a shared 3D object latent for both reconstruction and grasp pose prediction. Built on this shared object latent, we introduce an anchor-initialized truncated pose-reasoning diffuser that predicts continuous and multimodal grasp poses without directly relying on discrete grasp candidates. We further investigate the interaction between reconstruction and grasping through a reconstruction-aware scorer and a residual latent updater. Reconstruction provides grounded geometric cues, while grasp supervision refines the shared object latent toward grasp-relevant affordances. GraspFoM jointly predicts grasp poses and reconstructs high-fidelity 3D assets in mesh and 3DGS forms. Comprehensive experiments demonstrate that GraspFoM achieves state-of-the-art results on both reconstruction and grasping. Notably, these improvements require only a small number of additional trainable parameters. Component-wise ablation studies also demonstrate the contribution of each component.

URL PDF HTML ☆

赞 0 踩 0

2606.08495 2026-06-09 cs.RO cs.CV 交叉投稿

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

EgoPriMo：面向交互式人形控制的自我中心运动生成

Haoyang Ge, Peng Ren, Yukun Shi, Cong Huang, Kun Li, Kai Chen

发表机构 * Tianjin University（天津大学）； Zhongguancun Academy（中关村学院）； Beihang University（北京航空航天大学）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； DeepCybo

AI总结提出EgoPriMo框架，通过自我中心人类演示学习全身运动先验，利用三流DiT联合建模身体动态、视觉上下文和文本，支持重建、生成和预测，并在Unitree人形机器人上执行。

详情

AI中文摘要

人形机器人需要适应场景上下文、任务要求和用户意图的全身运动。运动跟踪可以再现指定的轨迹，人形机器人视觉-语言-动作系统提供了语义接口，但两者都不能为广泛的全身行为提供可扩展且交互式的先验。我们提出了EgoPriMo（人形机器人自我中心运动先验），一个统一的框架，从自我中心人类演示中学习此类先验。给定自我中心观察和文本提示，EgoPriMo重建、生成和预测基于SMPL的全身运动。语言被用作高级控制信号，而不是完整的运动规范。EgoPriMo的核心是一个三流DiT，它联合建模身体动态、自我中心视觉上下文和文本；任务条件掩码通过同一个检查点路由不同的任务和缺失模态数据。在Nymeria和EgoExo4D上的实验表明，一个检查点在支持重建和预测的同时，改进了自我中心运动生成，优于UniEgoMotion；生成的SMPL运动也可以由Unitree人形控制器执行。这些结果表明了一条从可扩展的自我中心观察到可泛化和交互式人形运动先验的实用路径。

英文摘要

Humanoid robots require whole-body motions that adapt to scene context, task requirements, and user intent. Motion tracking reproduces specified trajectories, and humanoid vision-language-action systems provide semantic interfaces, but neither offers a scalable and interactive prior for broad full-body behavior. We introduce EgoPriMo (Egocentric Motion Prior for Humanoid Robots), a unified framework that learns such priors from egocentric human demonstrations. Given egocentric observations and a text prompt, EgoPriMo reconstructs, generates, and forecasts SMPL-based full-body motion. Language is used as a high-level control signal rather than a complete motion specification. At the core of EgoPriMo is a Triple-stream DiT that jointly models body dynamics, egocentric visual context, and text; task-conditioning masks route different tasks and missing-modality data through the same checkpoint. Experiments on Nymeria and EgoExo4D show that one checkpoint improves egocentric motion generation over UniEgoMotion while supporting reconstruction and forecasting; the generated SMPL motions can also be executed by a Unitree humanoid controller. These results indicate a practical path from scalable egocentric observations to generalizable and interactive humanoid motion priors.

URL PDF HTML ☆

赞 0 踩 0

2606.08542 2026-06-09 cs.RO cs.AI cs.CV 交叉投稿

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

当视频误读：面向探索性操作痕迹问答的阅读启发式闭环蒸馏

Haizhou Ge, Yufei Jia, Yue Li, Zhixing Chen, Lu Shi, Lei Han, Guyue Zhou, Ruqi Huang

发表机构 * Tsinghua University（清华大学）； DISCOVER Robotics

AI总结针对探索性操作中机器人误读视频痕迹的问题，提出闭环痕迹蒸馏方法，通过任务编码代理提取单行自然语言启发式提示，使冻结VLM准确预测最小成功动作链，在模拟和真实机器人任务上提升准确率0.38-0.47。

Comments 16 pages, 4 figures, 4 tables

详情

AI中文摘要

探索性操作往往将看似失败的尝试转化为下一步操作的关键证据。例如，机器人拉动锁住的抽屉失败，只有在开锁后才成功。失败的拉动揭示了潜在前提条件（抽屉被锁住），该条件决定了最小成功动作链（完成任务的最少动作），此处为[开锁，拉抽屉]。正确读取这一痕迹因此成为恢复该链的前提。我们将此设定形式化为探索性操作痕迹问答（EMT-QA）：给定来自探索性痕迹的同步视频和本体感觉，预测在探测所揭示的潜在前提条件下的最小成功动作链。然而，即使最先进的VLM和具身多模态LLM也会误读这一证据：它们无法从原始视频、原始本体感觉或它们的组合中可靠地恢复动作链。我们引入闭环痕迹蒸馏，一种使用每任务编码代理检查带标签训练痕迹并蒸馏出关于痕迹的单行自然语言提示（称为蒸馏阅读启发式DRH）的流水线。推理时，不调用代理，不更新模型权重；冻结的VLM接收原始痕迹加上DRH作为提示条目。在三个模拟器和两个真实机器人任务上，DRH将链准确率比最佳原始模态基线提高0.38至0.47。相同的DRH还作为一次性程序分类器的唯一规范，其性能与提示的VLM相当。

英文摘要

Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a locked cabinet drawer, fails, and only succeeds after opening the lock. The failed pull reveals a latent precondition (the drawer is locked) that determines the minimal-success action chain (the fewest actions that complete the task), here [lock-open, drawer-pull]. Correctly reading this trace is therefore the prerequisite for recovering that chain. We formalize this setting as Exploratory Manipulation Trace QA (EMT-QA): given synchronized video and proprioception from an exploratory trace, predict the minimal-success action chain under the latent precondition revealed by the probe. However, even state-of-the-art VLMs and embodied multimodal LLMs misread this evidence: they do not reliably recover the chain from raw video, raw proprioception, or their combination. We introduce Closed-Loop Trace Distillation, a pipeline that uses a per-task coding agent to inspect labeled training traces and distill a one-line natural-language prompt over the trace, which we call the Distilled Reading Heuristic (DRH). At inference, no agent is invoked and no model weights are updated; a frozen VLM receives the raw trace plus the DRH as a prompt entry. Across three simulator and two real-robot tasks, the DRH improves chain accuracy by +0.38 to +0.47 over the best raw-modality baseline. The same DRH also serves as the sole specification for one-shot programmatic classifiers that match the prompted VLM.

URL PDF HTML ☆

赞 0 踩 0

2606.08655 2026-06-09 cs.RO cs.CV 交叉投稿

PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

PhysGraph：用于感知与推理的物理感知3D场景图

Haoyu Li, Aaron Thomas, Shuyan Zhou, Xianyi Cheng

发表机构 * Duke University（杜克大学）

AI总结提出PhysGraph框架，结合符号推理与结构化3D几何，建模杂乱场景中的运动学和物理属性，在语义分割、多物体质量估计和关节预测上达到最优。

详情

AI中文摘要

为了执行广泛的日常任务，机器人需要构建一个语义丰富、物理基础扎实且结构化的3D表示，以支持任务规划和功能预测。然而，现有方法主要关注语义检索，常常忽略物理和运动学因素。尝试建模物理属性的方法通常依赖于狭窄的训练集或单物体建模，限制了跨不同物体类型的可扩展性和泛化能力。为应对这些挑战，我们提出了PhysGraph，一个将符号推理与结构化3D几何相统一的框架，用于建模杂乱场景中的运动学和物理属性。给定RGB-D观测，PhysGraph重建以物体为中心的3D几何，并跨视图关联物体实例。然后，它将物体分解为功能部件，并通过视觉推理推断材料和关节。在合成和真实世界数据集上的评估表明，PhysGraph在语义分割、多物体质量估计和关节预测方面取得了最先进的结果。凭借其简单而有效的设计，PhysGraph生成物理一致且语义结构化的场景图，作为下游任务（如约束感知的3D功能预测和真实到模拟迁移）的结构化3D表示，这两项任务均在我们的实验中得到了验证。

英文摘要

To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single-object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, PhysGraph reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer, both of which are demonstrated in our experiments.

URL PDF HTML ☆

赞 0 踩 0

2606.08688 2026-06-09 cs.RO cs.CV 交叉投稿

PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

PhysAgent: 通过轨迹驱动的多智能体反馈实现基于物理的4D合成自动化

Chunji Lv, Jiaxi Ye, Yuchen Jiang, Rexar Lin, Changsheng Li

发表机构 * Beijing Institute of Technology（北京理工大学）

AI总结提出PhysAgent，首个模拟器在环的多智能体框架，通过解耦材料与外力、利用视觉基础模型提取轨迹并借助LLM常识推理，实现自动化、物理可信的4D运动合成，显著提升生成多样性与物理准确性。

详情

AI中文摘要

实现完全自动化、物理合理的3D运动合成是图形学和生成式AI的核心目标。然而，配置复杂的环境力场仍然完全依赖人工专家干预，成为大规模模拟数据生成的严重瓶颈。现有自动化方法主要关注材料优化，在应用于更复杂的力场优化空间时表现出严重的模态差距和技术缺陷：朴素的大语言模型缺乏底层模拟反馈，导致严重的物理不准确性，而传统的分数蒸馏采样存在梯度缓慢、陷入局部最优以及数学上无法动态切换离散力场的问题。为此，我们提出PhysAgent，首个模拟器在环的多智能体框架，利用多模态输入实现自动化、基于物理的4D合成。通过将内在材料与外在动力学解耦，PhysAgent利用配备外化力场技能模块的语义智能体掌握模拟规则并生成有效初始化。随后，由轨迹驱动的多智能体反馈驱动的精炼智能体，借助视觉基础模型从渲染帧中提取密集点轨迹。通过将这些显式运动轨迹转换为结构化文本描述符，智能体利用LLM常识推理执行零样本宏观跳跃，有效逃离局部最优并动态切换离散力场。大量实验表明，PhysAgent能够从任意多模态提示快速生成稳定、多样的物理场景，在生成多样性和物理准确性上显著优于现有基线。

英文摘要

Achieving fully automated, physically plausible 3D motion synthesis is a core objective in graphics and generative AI. However, configuring complex environmental force fields still relies entirely on manual expert intervention, creating a severe bottleneck for large-scale simulation data generation. Existing automated methods primarily focus on material optimization and exhibit severe modality gaps and technical flaws when applied to the vastly more complex force field optimization space: naive Large Language Models (LLMs) lack underlying simulation feedback, causing severe physical inaccuracies, while traditional Score Distillation Sampling (SDS) suffers from sluggish gradients, local optima entrapment, and a mathematical inability to dynamically switch discrete force fields. To address this, we propose PhysAgent, the first simulator-in-the-loop multi-agent framework that leverages multimodal inputs for automated, physically grounded 4D synthesis. By decoupling intrinsic materials from extrinsic dynamics, PhysAgent utilizes a Semantic Agent equipped with an externalized Force Field Skill module to master simulation rules and generate valid initializations. Subsequently, the Refine Agents, driven by Trajectory-Grounded Multi-Agent Feedback, leverage vision foundation models to extract dense point trajectories from rendered frames. By converting these explicit motion trajectories into structured textual descriptors, the agent harnesses LLM commonsense reasoning to execute zero-shot macroscopic leaps, effectively escaping local optima and dynamically switching discrete force fields. Extensive experiments demonstrate that PhysAgent rapidly generates stable, diverse physical scenes from arbitrary multimodal prompts, significantly outperforming existing baselines in both generation diversity and physical accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.08962 2026-06-09 cs.LG cs.CV cs.RO 交叉投稿

驯服感知抖动：面向可靠运动分类的不确定性感知激光雷达目标检测

Cornelius Schröder, Žygimantas Marcinkus, Markus Lienkamp

发表机构 * Technical University of Munich（慕尼黑工业大学）； Institute for Automotive Engineering, Munich Institute of Robotics and Machine Intelligence, School of Engineering and Design（汽车工程研究所，慕尼黑机器人与机器智能研究所，工程与设计学院）

AI总结提出一种部署友好的策略，通过不确定性估计和统计检验减少静态物体的虚假动态预测，在真实驾驶中显著降低误报和不必要停车。

详情

AI中文摘要

可靠的运动分类对于自动驾驶至关重要，因为对静态物体的错误动态预测可能会级联导致不必要的规划器干预。不稳定的边界框预测会导致跟踪中产生虚假的速度估计和错误预测的轨迹。我们提出了一种部署友好的缓解策略，该策略通过偶然不确定性估计增强3D目标检测器，并在短观测窗口上应用双样本z检验来区分真实运动和抖动。该方法集成到Autoware中，仅需最小改动，并重用现有数据关联以最小化计算开销。实验结果表明，在nuScenes上与速度阈值法性能相当，但在真实道路测试中，虚假动态预测和不必要停车显著减少，这是因为记录数据中存在中间抖动带，而仅基于速度的规则会误分类。这表明，不确定性感知检测和轻量级统计测试可以在噪声更大的真实环境中为自动驾驶带来实际性能提升。

英文摘要

Reliable motion classification is critical for autonomous driving, as false dynamic predictions of static objects can cascade into unnecessary planner interventions. Unstable bounding box predictions can lead to spurious velocity estimates in tracking and falsely predicted trajectories. We present a deployment-friendly mitigation strategy that augments a 3D object detector with aleatoric uncertainty estimates and applies a two-sample z-test over short observation windows to separate true motion from jitter. Integrated into Autoware with minimal changes, the approach reuses existing data association for minimal compute overhead. Empirical results show parity with velocity thresholding on nuScenes, but substantially fewer false dynamic predictions and unnecessary stops in real-world test drives, explained by the presence of an intermediate jitter band in the recorded data that speed-only rules misclassify. This demonstrates that uncertainty-aware detection and lightweight statistical testing can deliver practical performance gains for autonomous driving in noisier real-world settings.

URL PDF HTML ☆

赞 0 踩 0

2606.09451 2026-06-09 cs.RO cs.CV cs.LG 交叉投稿

Dense Force Estimation with an Event-based Optical Tactile Sensor

基于事件的光学触觉传感器的稠密力估计

Agis Politis, René Zurbrügg, Valentina Cavinato

发表机构 * Sony Advanced Visual Sensing, Zurich, Switzerland（索尼高级视觉传感公司，苏黎世，瑞士）； ETH Zürich（苏黎世联邦理工学院）

AI总结提出首个利用事件相机重建稠密3D力场的方法，通过事件数据估计表面位移并映射为力，平均误差(0.14N,0.10N,0.93N)，工作频率100Hz。

详情

AI中文摘要

人类依赖空间稠密、几何和力感知的触觉反馈以高时间分辨率进行灵巧操作。虽然基于视觉的触觉传感器能够实现稠密力估计，但受限于相机帧率、运动模糊和数据带宽。基于事件的光学触觉传感器具有微秒级时间分辨率和低运动模糊的优点，但现有方法仅限于预测净力。我们提出了首个利用基于事件的光学触觉传感器进行稠密3D力场重建的框架。我们的方法从事件数据估计3D表面位移，并通过逆有限元方法（iFEM）将其映射为力。剪切位移通过所提出的事件标记跟踪算法恢复，而法向位移则由卷积神经网络预测，该网络在收集的同步力-位移-事件数据集上训练。实验表明，该方法能够准确重建物理力，在力范围高达(4N,4N,20N)时，平均绝对误差为(0.14N,0.10N,0.93N)，同时以平均100Hz的频率运行。这项工作为在机器人抓取和灵巧操作中实现高频控制的稠密力反馈迈出了第一步。

英文摘要

Humans rely on spatially dense, geometry and force-aware tactile feedback at high temporal resolution for dexterous manipulation. While vision-based tactile sensors enable dense force estimation, they are limited by camera frame rates, motion blur, and data bandwidth. Event-based optical tactile sensors offer an attractive alternative with microsecond temporal resolution and low motion blur, but existing methods are restricted to predicting only net forces. We introduce the first framework for dense 3D force field reconstruction using event-based optical tactile sensors. Our approach estimates 3D surface displacements from event data and maps them to forces via the inverse Finite Elements Method (iFEM). Shear displacements are recovered through the proposed event-based marker tracking algorithm, while normal displacements are predicted by a convolutional neural network trained on a collected dataset of synchronized force-displacement-event data. Experiments demonstrate accurate reconstruction of physically grounded forces, achieving a mean absolute error of (0.14 N, 0.10 N, 0.93 N) over force ranges up to (4 N, 4 N, 20 N), while operating at an average of 100 Hz. This work constitutes a first step toward enabling dense force feedback for high-frequency control in robotic grasping and dexterous manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.09569 2026-06-09 cs.RO cs.CV 交叉投稿

基于噪声图像分割序列的远距离目标定位

Julius Pesonen, Arno Solin, Eija Honkavaara

发表机构 * Research Council of Finland（芬兰研究理事会）； RCF Flagship Forest–Human–Machine Interplay—Building Resilience, Redefining Value Networks and Enabling Meaningful Experiences (UNITE)（RCF旗舰森林-人类-机器交互——构建韧性，重新定义价值网络和赋能有意义体验（UNITE））

AI总结针对远距离目标定位问题，提出多视图三角测量和粒子滤波两种方法，后者还能提供形状和不确定性估计，结合无人机图像分割与GNSS姿态估计实现可靠野火监测。

详情

AI中文摘要

基于相机测量序列的3D目标定位对于安全关键的监视任务（如基于无人机的野火监测）至关重要。使用相机检测到的目标定位通常可以通过专门的传感器配置或3D场景重建来解决。然而，对于远距离目标或受限于可用计算资源的任务，这两种解决方案都不可行。在本文中，我们表明该任务可以通过多视图三角测量或粒子滤波来解决，后者还提供形状和不确定性估计。我们使用3D模拟和基于无人机的图像分割序列以及基于全球导航卫星系统（GNSS）的相机姿态估计来研究这些解决方案。结果表明，将所提出的方法与现有的图像分割模型和无人机携带的计算资源相结合，可以为基于无人机的野火监测提供可靠的系统。所提出的解决方案与检测方法无关，还能快速适应类似任务。代码可在以下网址获取：https://this URL

英文摘要

3D object localisation based on a sequence of camera measurements is essential for safety-critical surveillance tasks, such as drone-based wildfire monitoring. Localisation of objects detected with a camera can typically be solved with specialised sensor configurations or 3D scene reconstruction. However, in the context of distant objects or tasks limited by the amount of available computational resources, neither solution is feasible. In this paper, we show that the task can be solved with either multi-view triangulation or particle filters, with the latter also providing shape and uncertainty estimates. We studied the solutions using 3D simulation and drone-based image segmentation sequences with global navigation satellite system (GNSS) based camera pose estimates. The results suggest that combining the proposed methods with pre-existing image segmentation models and drone-carried computational resources yields a reliable system for drone-based wildfire monitoring. The proposed solutions are independent of the detection method, also enabling quick adaptation to similar tasks. Code is available at https://fgi_nls.gitlab.io/public/distant-localisation

URL PDF HTML ☆

赞 0 踩 0

2602.18020 2026-06-09 cs.CV cs.RO 版本更新

UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

UAOR: 面向视觉-语言-动作模型的不确定性感知观测重注入

Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Zichen Wen, Bowen Fang, Tao Yu, Xiangnan Wu, Qisen Ma, Kai Wang, Ziheng He, Yingda Li, Zhengbo Zhang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所模式识别新技术实验室）； Shanghai Jiao Tong University（上海交通大学）； FiveAges（五代）

AI总结提出UAOR模块，通过动作熵检测不确定性，在语言模型高不确定层重注入观测信息，无需额外训练或数据，提升VLA模型在仿真和真实任务中的性能。

详情

AI中文摘要

视觉-语言-动作（VLA）模型利用预训练的视觉-语言模型（VLM）作为骨干，将图像和指令映射到动作，展现出在可泛化机器人操作中的显著潜力。为了提升性能，现有方法通常引入额外的观测线索（如深度图、点云）或辅助模块（如目标检测器、编码器），以实现更精确和可靠的任务执行，但这些方法通常需要昂贵的数据收集和额外训练。受语言模型中的前馈网络（FFN）可作为“键值记忆”的发现启发，我们提出不确定性感知观测重注入（UAOR），一种有效、无需训练且即插即用的VLA模型模块。具体地，当当前语言模型层表现出由动作熵衡量的高不确定性时，它通过注意力检索将关键观测信息重注入下一层的前馈网络（FFN）。该机制直接在高不确定性层用观测证据增强隐藏状态，从而实现更准确和可靠的动作生成。综合实验表明，我们的方法以最小开销一致地提升了多种VLA模型在仿真和真实任务中的性能。值得注意的是，UAOR消除了对额外观测线索或模块的需求，使其成为现有VLA流程中通用且实用的即插即用组件。项目页面见此URL。

英文摘要

Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism directly augments the hidden states with observation evidence at high-uncertainty layers, enabling more accurate and reliable action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.

URL PDF HTML ☆

赞 0 踩 0

2603.01613 2026-06-09 cs.CV 版本更新

Uncertainty-Aware Hierarchical Re-Localization in OpenStreetMap via Semantic Alignment

基于语义对齐的OpenStreetMap中不确定性感知分层重定位

Yuchen Zou, Xiao Hu, Lihuang Fang, Yuqing Tang

发表机构 * International Digital Economy Academy（国际数字经济学院）； School of Automation Science and Engineering, Xi’an Jiaotong University（西安交通大学自动化科学与工程学院）； Department of Electronic and Electrical Engineering, Southern University of Science and Technology（南方科技大学电子与电气工程系）

AI总结提出不确定性感知分层搜索框架，利用目标级DINO-ViT令牌减少跨模态差异，通过粗FFT相关和不确定性控制的局部细化实现高效定位，在精度和速度上显著优于现有方法。

Comments 7 pages, 4 figures

详情

AI中文摘要

单目重定位使机器人能够从视觉观测中估计相机姿态。然而，许多现有方法依赖密集地图或大型参考图像数据库，面临可扩展性限制和隐私风险。OpenStreetMap（OSM）作为一种轻量级隐私保护地图，提供具有全局可扩展性的语义和几何信息。尽管如此，由于自然图像与OSM之间的跨模态差异以及基于全局地图定位的高成本，OSM定位仍然具有挑战性。在本文中，我们提出了一种具有语义对齐的不确定性感知分层搜索框架，用于OSM中的定位。首先，利用目标级DINO-ViT令牌来减少地面视角观测与OSM向量之间的语义差距。其次，将全局密集匹配分解为粗FFT相关和不确定性控制的局部细化。大量实验表明，我们的方法显著提高了定位精度和速度。在单个数据集上训练时，我们方法的3°方向召回率甚至优于最先进方法的5°召回率。

英文摘要

Monocular re-localization enables robots to estimate camera poses from visual observations. However, many existing methods rely on dense maps or large reference image databases, which face scalability limitations and privacy risks. OpenStreetMap (OSM), as a lightweight privacy-preserving map, offers semantic and geometric information with global scalability. Nonetheless, OSM localization remains challenging due to cross-modal discrepancies between natural images and OSM, as well as the high cost of global map-based localization. In this paper, we propose an uncertainty-aware hierarchical search framework with semantic alignment for localization in OSM. First, object-centric DINO-ViT tokens are exploited to reduce the semantic gap between ground-view observations and OSM vectors. Second, global dense matching is decomposed into coarse FFT correlation and uncertainty-controlled local refinement. Extensive experiments demonstrate that our method significantly improves localization accuracy and speed. When trained on a single dataset, the 3$^\circ$ orientation recall of our method even outperforms the 5$^\circ$ recall of state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.01799 2026-06-09 cs.CV 版本更新

QuoVLA：视觉-语言-动作模型的商空间

Xuan Wang, Yinan Wu, Haoran Duan, Jungong Han

发表机构 * Department of Automation（自动化系）

AI总结针对VLA模型预训练VLM潜在表示动作信息不足的观点，提出商空间框架QuoVLA，通过量化模块和双分支设计压缩潜在表示为动作充分表示，在多个基准上提升泛化性能。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通常通过将视觉观察和语言指令映射到连续动作来适配预训练的视觉-语言模型（VLM）以进行机器人控制。现有方法通常采取动作不足的观点，假设预训练的VLM潜在表示要么缺乏直接可用的动作信息，要么应该屏蔽动作学习信号。与这一观点相反，我们的 extit{VLA商理论}表明，预训练的VLM潜在表示并非动作不足而是动作充分的：它们已经包含控制所需的信息，但由于区分了诱导相同最优动作行为的提示级变体而仍然过度完备。为了将这一理论付诸实践，我们提出了QuoVLA，一个用于VLA的商空间框架，将预训练的VLM潜在表示压缩为动作充分的表示。具体来说，QuoVLA通过一个量化模块和一个具有相对时间复杂度正则化的双分支设计实例化这一原则，在去除提示级冗余的同时保留动作相关信息。跨多个基准的大量实验表明，QuoVLA实现了强大的性能，在视觉、语言和环境分布偏移下的泛化方面尤其显著提升。我们的代码将公开提供。

英文摘要

Vision-Language-Action (VLA) models commonly adapt pretrained Vision-Language Models (VLMs) to robot control by mapping visual observations and language instructions to continuous actions. Existing approaches typically take an action-insufficiency view, assuming that pretrained VLM latents either lack directly usable action information or should be shielded from action-learning signals. Against this view, our \textit{Quotient Theory for VLA} shows that pretrained VLM latents are not action-insufficient but action-sufficient: they already contain the information needed for control, yet remain overcomplete by distinguishing prompt-level variations that induce the same optimal action behavior. To operationalize this theory, we propose QuoVLA, a quotient-space framework for VLA that compresses pretrained VLM latents into action-sufficient representations. Specifically, QuoVLA instantiates this principle with a quantization module and a dual-branch design with relative temporal-complexity regularization, preserving action-relevant information while removing prompt-level redundancy. Extensive experiments across multiple benchmarks demonstrate that QuoVLA achieves strong performance, with particularly notable improvements in generalization under visual, linguistic, and environmental distribution shifts. Our code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.07431 2026-06-09 cs.CV 版本更新

OpenGlass: Ultra-Low-Power On-Device AI Eyewear with Event-based Vision

OpenGlass：用于设备上基于事件的手势识别的开源智能眼镜

Pietro Bonazzi, Julian Moosmann, Ahmet Celik, Philipp Mayer, Michele Magno

发表机构 * Department of Information Technology and Electrical Engineering, ETH Zürich（信息科技与电气工程系，瑞士联邦理工学院）

AI总结提出开源智能眼镜平台OpenGlass，采用模块化设计、事件驱动电源管理和GAP9 RISC-V SoC，实现低功耗设备上ML，在LynX数据集上达到83.94%的跨主体手势识别准确率。

详情

AI中文摘要

智能眼镜通过多模态传感器和设备上智能实现无干扰、上下文感知的交互，但受限于紧凑外形下的功耗、内存和计算约束。支持事件视觉和嵌入式ML的开源硬件平台在此规模下很少见。本文介绍了一个开源智能眼镜平台，用于新型传感器和算法的快速原型设计。其模块化设计使用灵活的FPC转接板，支持事件相机和帧相机，无需完全重新设计PCB。硬件-软件协同设计的电源管理系统结合了可配置PMIC和通过nRF5340协调器的事件驱动唤醒，使GAP9 RISC-V SoC在推理之间保持断电。原型从200 mAh电池实现长达11.8小时的连续设备上ML。作为演示，使用来自Prophesee GENX320相机的极性分离事件直方图，在LynX数据集上评估了以自我为中心的手势识别流水线。R(2+1)D在留二受试者交叉验证下达到最佳跨主体准确率83.94%（宏F1=0.781），在GAP9上端到端延迟为33.9毫秒。时间增强和去除模糊类别带来了最大增益（+8.9个百分点）。所有硬件设计、固件和模型均开源发布。

英文摘要

Smart eyewear enables unobtrusive, context-aware interaction through multimodal sensors and on-device intelligence, but is severely limited by power, memory, and compute constraints in a compact form factor. Open-hardware platforms supporting event-based vision and embedded ML at this scale are rare. This work introduces an open-source smart glasses platform for rapid prototyping of novel sensors and algorithms. Its modular design uses a flexible FPC interposer to support both event-based and frame-based cameras without full PCB redesign. A hardware-software co-designed power management system combines a configurable PMIC with event-driven wake-up via an nRF5340 coordinator, keeping the GAP9 RISC-V SoC powered down between inferences. The prototype achieves up to 11.5 hours of continuous on-device ML from a 200 mAh battery. As a demonstration, an egocentric hand gesture recognition pipeline was evaluated on the LynX dataset using polarity-separated event histograms from a Prophesee GENX320 camera. R(2+1)D achieved the best cross-subject accuracy of 83.94\% (macro F1 = 0.781) under leave-two-subjects-out validation, with 78.3 ms end-to-end inference latency on the GAP9. Temporal augmentation and removal of ambiguous classes provided the largest gains (+8.9 pp). All hardware designs, firmware, and models are released open source.

URL PDF HTML ☆

赞 0 踩 0

2503.14229 2026-06-09 cs.AI cs.CV cs.RO 版本更新

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

HA-VLN 2.0：面向离散与连续环境中动态多人交互的人类感知导航开放基准与排行榜

Yifei Dong, Fengyi Wu, Qi He, Lingdong Kong, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Alexander G Hauptmann, Zhi-Qi Cheng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出HA-VLN 2.0统一基准，通过标准化任务、HAPS 2.0数据集与模拟器、16844条社会指令基准测试及真实机器人实验，证明显式社会建模提升导航鲁棒性并减少碰撞。

Comments 35 pages, 20 figures, website: https://f1y1113.github.io/HA-VLN-webpage/

详情

AI中文摘要

视觉与语言导航（VLN）主要研究离散或连续空间，很少关注动态拥挤环境。我们提出HA-VLN 2.0，一个引入显式社会感知约束的统一基准。我们的贡献包括：（i）标准化任务和指标，同时捕捉目标准确性和个人空间遵守；（ii）HAPS 2.0数据集和模拟器，建模多人交互、室外环境和更精细的语言-运动对齐；（iii）在16844条社会性指令上的基准测试，揭示领先代理在人类动态和部分可观测性下性能急剧下降；（iv）真实机器人实验验证模拟到现实的迁移，以及一个开放排行榜实现透明比较。结果表明，显式社会建模提高了导航鲁棒性并减少了碰撞，强调了以人为中心方法的必要性。通过发布数据集、模拟器、基线和协议，HA-VLN 2.0为安全、人类感知的导航研究提供了坚实基础。

英文摘要

Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, human-aware navigation research.

URL PDF HTML ☆

赞 0 踩 0

2508.00917 2026-06-09 cs.RO cs.CV cs.LG 版本更新

A Survey on Deep Multi-Task Learning in Connected Autonomous Vehicles

联网自动驾驶车辆中深度多任务学习综述

Jiayuan Wang, Farhad Pourpanah, Q. M. Jonathan Wu, Ning Zhang

发表机构 * Department of Electrical and Computer Engineering, University of Windsor（温莎大学电气与计算机工程系）； Department of Electrical and Computer Engineering, Queen’s University（皇后大学电气与计算机工程系）

AI总结综述联网自动驾驶车辆中深度多任务学习，涵盖感知、预测、规划、控制及V2X通信与资源管理，分析现有方法优缺点并指出未来方向。

详情

DOI: 10.1109/COMST.2026.3699223

AI中文摘要

联网自动驾驶车辆（CAVs）必须同时执行多个任务，如感知、预测、规划和控制，以确保在复杂环境中安全可靠地导航。此外，通过车联万物（V2X）通信，可以实现CAVs之间的协同感知和驾驶，从而减轻单个车辆的局限性，同时也引入了严格的延迟、可靠性和带宽约束。传统上，任务使用单独的模型处理，这导致部署成本高、计算开销增加以及实现实时性能的挑战。多任务学习（MTL）最近成为一种有前景的解决方案，能够在统一模型中联合学习多个任务，从而提供更高的效率和资源利用率。据我们所知，本综述是首次专注于CAVs中深度MTL的全面回顾。我们首先概述CAVs和MTL以提供基础背景。然后，我们回顾了CAVs关键功能领域的MTL方法，包括感知、预测、规划、控制以及V2X通信和无线电资源管理（RRM）。对于前四个领域，我们将现有工作分为仅单车（车载）和V2X增强协同（多智能体）范式。我们进一步将V2X通信和RRM作为以通信为中心的MTL问题进行讨论。最后，我们讨论了现有方法的优势和局限性，识别了关键研究空白，并提供了旨在推进CAV系统MTL方法的未来研究方向。

英文摘要

Connected autonomous vehicles (CAVs) must simultaneously perform multiple tasks, such as perception, prediction, planning, and control, to ensure safe and reliable navigation in complex environments. Moreover, through vehicle-to-everything (V2X) communication, cooperative perception and driving among CAVs can be enabled, thereby mitigating the limitations of individual vehicles, while it also introduces stringent latency, reliability, and bandwidth constraints. Traditionally, tasks are addressed using separate models, which leads to high deployment costs, increased computational overhead, and challenges in achieving real-time performance. Multi-task learning (MTL) has recently emerged as a promising solution that enables the joint learning of multiple tasks within a unified model. This offers improved efficiency and resource utilization. To the best of our knowledge, this survey is the first comprehensive review focusing on deep MTL in CAVs. We begin with an overview of CAVs and MTL to provide foundational background. Then, we review MTL approaches across key functional domains in CAVs, including perception, prediction, planning, control, as well as V2X communications and radio resource management (RRM). For the first four domains, we categorize existing works under ego vehicle-only (onboard-only) and V2X-enhanced cooperative (multi-agent) paradigms. We further discuss V2X communications and RRM as communication-centric MTL problems. Finally, we discuss the strengths and limitations of existing methods, identify key research gaps, and provide future research directions aimed at advancing MTL methodologies for CAV systems.

URL PDF HTML ☆

赞 0 踩 0

2512.07998 2026-06-09 cs.RO cs.CV 版本更新

DIJIT: A Robotic Head for an Active Observer

DIJIT: 面向主动观察者的机器人头部

Mostafa Kamali Tabrizi, Mingshi Chi, Bir Bikram Dey, Kelly Yuan, Markus D. Solbach, Yiqian Liu, Michael Jenkin, John K. Tsotsos

发表机构 * Department of Electrical Engineering and Computer Science, York University（电气与计算机科学系，约克大学）

AI总结提出DIJIT双目机器人头部，具有9个机械自由度和4个光学自由度，实现类人眼/头运动，用于主动视觉研究，其扫视精度接近人类。

详情

DOI: 10.1109/LRA.2026.3682980
Journal ref: IEEE Robotics and Automation Letters, Vol. 11, No. 6, pp. 7038-7045, June 2026

AI中文摘要

我们提出DIJIT，一种新颖的双目机器人头部，专为作为主动观察者的移动代理设计。DIJIT独特的功能广度使得主动视觉研究以及类人眼和头颈运动、它们之间的相互关系以及各自对视觉能力的贡献成为可能。DIJIT还被用于探索人类视觉如何利用眼/头运动解决视觉任务与当前计算机视觉方法之间的差异。DIJIT的设计具有九个机械自由度，而相机和镜头提供了额外的四个光学自由度。机械设计的范围和速度与人类性能相当。DIJIT达到了人类峰值扫视速度的85%。我们的设计包括会聚立体视觉所需的运动范围，即聚散、版本和旋转。在这里，我们介绍DIJIT及其性能的某些方面。我们还提出了一种新颖的扫视相机运动方法，利用相机方向与电机值之间的直接关系。由此产生的扫视相机运动在准确性上接近人类运动，左相机和右相机的平均误差分别为1.17°和1.14°。

英文摘要

We present DIJIT, a novel binocular robotic head expressly designed for mobile agents that behave as active observers. DIJIT's unique breadth of functionality enables active vision research and the study of human-like eye and head-neck motions, their interrelationships, and how each contributes to visual ability. DIJIT is also being used to explore the differences between how human vision employs eye/head movements to solve visual tasks and current computer vision methods. DIJIT's design features nine mechanical degrees of freedom, while the cameras and lenses provide an additional four optical degrees of freedom. The ranges and speeds of the mechanical design are comparable to human performance. DIJIT attains 85\% of the peak human saccade speed. Our design includes the ranges of motion required for convergent stereo, namely, vergence, version, and cyclotorsion. Here, we present DIJIT and some aspects of its performance. We also present a novel method for saccadic camera movements, using a direct relationship between camera orientation and motor values. The resulting saccadic camera movements are close to human movements in terms of their accuracy, with 1.17$^\circ$ and 1.14$^\circ$ mean error for the left and right cameras, respectively.

URL PDF HTML ☆

赞 0 踩 0

2602.21172 2026-06-09 cs.AI cs.CV 版本更新

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

NoRD: 一种无需推理的高数据效率视觉-语言-动作模型

Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition ； Texas A&M University（德克萨斯大学A&M分校）； UC Berkeley（伯克利加州大学）

AI总结提出NoRD模型，通过无需推理标注和仅需<60%数据微调，结合Dr. GRPO算法克服难度偏差，实现与现有VLA模型相当的性能，显著降低数据与计算开销。

Comments Accepted to CVPR 2026. Code available at: https://github.com/Applied-Open-Source/nord

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过统一的端到端架构取代模块化流水线，推动了自动驾驶的发展。然而，当前的VLA模型面临两个昂贵的要求：（1）大规模数据集收集，（2）密集的推理标注。在这项工作中，我们通过NoRD（无需推理驾驶）解决了这两个挑战。与现有的VLA模型相比，NoRD在仅使用<60%的数据且无需推理标注的情况下实现了竞争性能，从而减少了3倍的token数量。我们发现，当将标准组相对策略优化（GRPO）应用于在这种小规模、无推理数据集上训练的策略时，它未能产生显著的改进。我们表明，这种限制源于难度偏差，它不成比例地惩罚了GRPO中产生高方差rollout的场景的奖励信号。NoRD通过引入Dr. GRPO（一种旨在减轻LLM中难度偏差的最新算法）克服了这一限制。因此，NoRD在Waymo和NAVSIM上以极少的训练数据和零推理开销实现了竞争性能，从而实现了更高效的自主系统。网站：此 https URL

英文摘要

Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with NORD (No Reasoning for Driving). Compared to existing VLAs, NORD achieves competitive performance while being fine-tuned on <60% of the data and no reasoning annotations, resulting in 3x fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. NORD overcomes this by incorporating Dr. GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, NORD achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems. Website: https://nord-vla-ai.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.07648 2026-06-09 cs.CV cs.AI 新提交

AQIFormer: A Transformer-Based Multi-View Architecture for Cross-City Air Quality Classification

AQIFormer：一种基于Transformer的多视角架构用于跨城市空气质量分类

Om Kathalkar, Nitin Nilesh, Sachin Chaudhari, Anoop Namboodiri

发表机构 * IIIT Hyderabad（印度海得拉巴国际信息技术学院）

AI总结提出AQIFormer，一种基于Transformer的集成架构，通过前后视图融合、天气感知注意力和多任务学习，在跨城市空气质量分类中达到89.96%准确率，比现有方法提升14.96%。

Comments Accepted at ICVGIP 2025 (Indian Conference on Computer Vision, Graphics and Image Processing), 9 pages, 4 figures

详情

DOI: 10.1145/3774521.3774577

AI中文摘要

空气污染是全球最严峻的环境和公共卫生挑战之一，传统的基于传感器的监测系统面临显著的可扩展性和经济性限制。基于图像的空气质量估计已成为一种有前景的替代方案，利用交通场景中大气污染物的视觉特征。然而，现有方法存在跨城市泛化能力有限以及对多视角信息利用不足的问题。我们提出AQIFormer，一种新颖的基于Transformer的集成架构，通过创新的双视图融合、天气感知注意力机制和全面的多任务学习来解决这些根本性限制。我们的方法独特地将前后交通图像与气象参数相结合，以实现跨不同城市环境的稳健空气质量分类。在包含26,678个同步前后图像对的综合数据集上进行的大量评估表明，该模型性能良好，准确率达到89.96%，比现有最优方法提高了14.96%。最重要的是，我们的模型保持了出色的跨城市泛化能力，在印度那格浦尔收集的独立数据集上达到81.67%的准确率，通过少量样本自适应仅用极少的训练样本，性能下降仅为8.29%。

英文摘要

Air pollution represents one of the most critical environmental and public health challenges globally, with traditional sensor-based monitoring systems facing significant scalability and economic constraints. Image-based air quality estimation has emerged as a promising alternative, leveraging the visual characteristics of atmospheric pollutants in traffic scenes. However, existing methods suffer from limited cross-city generalization and inadequate exploitation of multi-view perspectives. We present AQIFormer, a novel transformer-based ensemble architecture that addresses these fundamental limitations through innovative dual-view integration, weather-aware attention mechanisms, and comprehensive multi-task learning. Our approach uniquely combines front and rear traffic imagery with meteorological parameters to achieve robust air quality classification across diverse urban environments. Extensive evaluation on a comprehensive dataset of 26,678 synchronized front-rear image pairs demonstrates good performance with 89.96% accuracy, representing a 14.96% improvement over state-of-the-art methods. Most importantly, our model maintains exceptional cross-city generalization capabilities, achieving 81.67% accuracy on an independent dataset collected in Nagpur, India with only 8.29% performance degradation using few-shot adaptation with minimal training samples.

URL PDF HTML ☆

赞 0 踩 0

2606.07766 2026-06-09 cs.CV cs.AI 新提交

Quantum-Enhanced Similarity Measures for Polarimetric Materials Classification

量子增强的极化材料分类相似度度量

Sara Shojaei, Seyed Mohamad Ali Tousi, Emma Bennett, Param Sangani, Ali Shiri Sichani, Ilker Ersoy, Hadi Ali-Akbarpour, Filiz Bunyak, G. N. DeSouza

发表机构 * University of Cambridge（剑桥大学）

AI总结提出量子-经典混合流水线，将极化材料分类转化为点匹配问题，利用SWAP测试估计嵌入向量保真度，实现竞争性分类精度和开放集判别能力。

详情

AI中文摘要

我们提出了一种用于极化材料分类的量子-经典混合流水线，将其视为点匹配问题。包含偏振光反射的体素立方体用于训练编码器，为立方体的体素生成32维嵌入。在推理时，丢弃编码器头部，将嵌入编码为量子态的概率幅。然后，SWAP测试电路估计查询立方体的每个32D嵌入与锚点立方体数据集之间的保真度。聚合的保真度作为材料相似度分数，具有最高聚合保真度的锚点类别被视为查询材料的类别。我们在一个包含23种材料（每种约800个样本）的数据集上评估了我们的方法，这些材料来自其Mueller矩阵。比较了所提出的量子SWAP测试的点匹配方法和使用最优传输的经典分类器。我们的结果展示了竞争性的分类精度以及开放集判别潜力，使其成为基于NISQ的材料识别的可行途径。

英文摘要

We present a quantum--classical hybrid pipeline for polarimetric material classification that casts this as a point-matching problem. Voxel cubes, containing polarized light reflections, are used to train an encoder to produce 32-dimensional embeddings for the voxels of the cubes. At inference, the encoder head is discarded and the embeddings are encoded as probability amplitudes of quantum states. Next, a SWAP-test circuit estimates the fidelity between each of the 32D embeddings from the query cube and a dataset of anchor cubes. The aggregated fidelity serves as materials similarity scores, and the class of the anchor with highest aggregated fidelity is deemed as the class of the queried material. We evaluate our approach on a dataset of 23 materials ($\approx$800 samples each) derived from their Mueller matrices. The point-matching approaches from the proposed quantum SWAP-test and a classical classifier using Optimal Transport are compared. Our results demonstrate the competitive classification accuracy alongside open-set discrimination potential, establishing it as a viable path toward NISQ-based material recognition.

URL PDF HTML ☆

赞 0 踩 0

2606.08612 2026-06-09 cs.CV 新提交

Facial Expression Recognition in the Deep Learning Era: A Systematic Multi-Criteria Review of Methods, Models, Datasets, Performance, Challenges, and Future Research Directions

深度学习时代的面部表情识别：方法、模型、数据集、性能、挑战与未来研究方向的多准则系统综述

Spyridon Georgiou, Aggelos Psiris, Spyridon Evangelatos, Thomas Lagkas, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

发表机构 * International Hellenic University（国际希腊大学）； University of Thessaly（色萨利大学）； Democritus University of Thrace（德谟克利特大学）； University of Peloponnese（伯罗奔尼撒大学）； Harokopio University of Athens（哈罗科皮奥大学）

AI总结本文系统综述了深度学习面部表情识别的最新进展，提出五阶段演化框架和多准则分类法，分析了七维度的优缺点，并总结了数据集、性能比较及未来挑战。

详情

AI中文摘要

面部表情识别（FER）在过去十年中取得了快速发展，这得益于从手工特征和浅层分类器向深度卷积、注意力机制、视觉语言和基础模型架构的转变，以及大规模野外基准测试的并行增长，这些基准涵盖了分类、维度、复合、微表情、动作单元（AU）和强度估计任务。然而，基于深度学习的FER领域迄今为止仅在狭窄的任务、架构或应用特定轴线上被综述，缺乏对其近期进展的整体、系统组织的描述。本综述通过全面回顾近期基于深度学习的FER，并明确将其与更广泛的面部情感识别（FAR）领域联系起来，填补了这一空白。其主要贡献包括：a) 描述了FER演变为五个不同阶段的过程，从手工特征和经典机器学习到注意力机制、视觉语言和基础模型方法，并给出了每个阶段的关键里程碑工作；b) 一个多准则分类法，沿七个互补轴分析文献：识别任务、输入模态、面部预处理流程、网络架构、学习策略、采集设置和应用领域；c) 按准则进行比较分析，深入洞察每个类别在野外条件下的优势和局限性；d) 按任务组织的公共FER数据集综述，包括其标注方案、模态和评估协议；e) 性能指标汇编以及代表性最先进方法在广泛采用的基准上的按任务定量比较；f) 当前挑战和有前景的未来方向的讨论。

英文摘要

Facial Expression Recognition (FER) has advanced rapidly over the last decade, driven by the shift from handcrafted descriptors and shallow classifiers to deep convolutional, attention-based, vision-language, and foundation-model architectures, and by the parallel growth of large-scale in-the-wild benchmarks spanning categorical, dimensional, compound, micro-expression, Action Unit (AU), and intensity-estimation tasks. Yet the deep learning-based FER landscape has so far been reviewed only along narrow task-, architecture-, or application-specific axes, leaving a holistic, systematically organized account of its recent advances missing. This survey addresses that gap with a comprehensive review of recent deep learning-based FER, explicitly linked to the wider Facial Affect Recognition (FAR) domain. Its main contributions are: a) A description of FER's evolution into five distinct phases, from handcrafted features and classical machine learning to attention-based, vision-language, and foundation-model approaches, with the key milestone works of each, b) A multi-criteria taxonomy analyzing the literature along seven complementary axes: recognition task, input modality, face pre-processing pipeline, network architecture, learning strategy, acquisition setting, and application domain, c) A per-criterion comparative analysis, with critical insights into the strengths and limitations of each category under in-the-wild conditions, d) A task-organized review of public FER datasets, with their annotation schemes, modalities, and evaluation protocols, e) A compilation of performance metrics and a per-task quantitative comparison of representative state-of-the-art methods on widely adopted benchmarks, and f) A discussion of current challenges and promising future directions.

URL PDF HTML ☆

赞 0 踩 0

2606.08826 2026-06-09 cs.CV astro-ph.GA 新提交

Classifying galaxies in the Galaxy10 DECals dataset using Inception and Residual CNNs

使用Inception和残差CNN对Galaxy10 DECals数据集中的星系进行分类

Lanz Anthonee A. Lagman, Prospero C. Naval, Reinabelle C. Reyes

发表机构 * University of the Philippines - Diliman（菲律宾大学迪利曼分校）； Department of Computer Science, College of Engineering, University of the Philippines - Diliman（菲律宾大学迪利曼分校工程学院计算机科学系）； National Institute of Physics, College of Science, University of the Philippines - Diliman（菲律宾大学迪利曼分校理学院国家物理研究所）

AI总结本研究比较了ResNet101和InceptionV4在星系形态分类任务上的性能，两者均达到约90%的准确率，其中ResNet101表现更优，表明这两种CNN架构可作为未来巡天星系图像分类的稳健基础。

Comments 4 pages, 3 figures, 2 tables, published in Proceedings of the 42nd Samahang Pisika ng Pilipinas Physics Conference (SPP 2024)

详情

Journal ref: Proc. Samahang Pisika Pilipinas 42, SPP-2024-2E-05 (2024)

AI中文摘要

关于星系形态的图像数据预计在未来几年内将在数量和质量上都有所增加；因此，探索哪些适用于图像分类任务的深度学习架构具有成本效益非常重要。残差网络和Inception网络因其计算效率而成为探索分类卷积神经网络（CNN）的理想选择，这得益于残差连接和并行化Inception模块等技术，使得网络能够更深而不显著增加计算复杂度。在这项工作中，我们分析了ResNet101和InceptionV4在空间增强的Galaxy10 DECals数据集上的性能。保留星系的十类分类，我们修改了每个类别的图像数量。我们发现ResNet101和InceptionV4模型达到了约90%的准确率，与文献中报告的性能相当。在性能指标方面，ResNet101优于InceptionV4。我们的结果表明，这两种CNN架构中的任何一种都可以作为即将到来的巡天中星系图像分类专用管线的稳健基础。

英文摘要

Image data regarding galactic morphology is expected to increase both in quantity and quality for the next foreseeable years; thus it is important to explore which deep learning architectures adapted for image classification tasks are cost-effective. Residual and Inception networks are ideal for exploring classification convolutional neural networks (CNNs) due to their computational efficiency, achieved through techniques such as residual connections and parallelized inception modules, enabling deeper networks without excessively increasing computational complexity. In this work, we analyze the performance of ResNet101 and InceptionV4 on a spatially-augmented Galaxy10 DECals dataset. Retaining the ten-class classification of galaxies, we modify the image count of each class. We find that ResNet101 and InceptionV4 models achieved accuracies of $\sim$ 90%, comparable with reported performance in the literature. In terms of performance metrics, ResNet101 is superior to InceptionV4. Our results indicate that either of these CNN architectures could serve as a robust foundation for specialized pipelines for classification of galaxy images from upcoming surveys.

URL PDF HTML ☆

赞 0 踩 0

2606.08918 2026-06-09 cs.CV 新提交

When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

当视觉误导时，让位置说话：基于位置注意力机制和大多模态模型的全球图像地理定位方法

Junchao Cui, Wenqi Shi, Xuanzi Ma, Nan Wu, Shaoyong Du, Xiangyang Luo

发表机构 * Henan Key Laboratory of Cyberspace Situation Awareness（河南省网络空间态势感知重点实验室）； Information Engineering University（信息工程大学）

AI总结提出TransGeoCLIP框架，通过位置注意力机制和大多模态模型，解决视觉相似图像导致的地理定位错误问题，在多个基准上显著提升定位精度。

Comments Submitted to IEEE Transactions on Multimedia in March 2026

详情

AI中文摘要

全球图像地理定位旨在确定图像在全球范围内的拍摄位置。现有方法通常通过将图像与来自不同地理区域的视觉相似场景匹配而导致定位错误，限制了实际应用中的可靠性。为解决此问题，我们提出TransGeoCLIP，一种新颖的基于检索的框架，集成了位置注意力机制和大规模多模态模型（LMMs）。使用带有位置注意力的Transformer编码器对GPS坐标进行编码，TransGeoCLIP能够有效区分视觉相似图像中的地理特征。该框架包括两个阶段：1）检索数据库构建，采用配备位置注意力机制的Transformer对标记的GPS坐标进行编码并增强位置语义，随后通过CLIP实现图像-文本-GPS联合嵌入；2）检索增强推理，利用LMMs从检索到的数据库结果中推断最终图像位置预测。在包括IM2GPS、IM2GPS3k、YFCC4k和YFCC26k在内的多个数据集上的广泛实验结果表明，TransGeoCLIP显著提升了视觉相似图像的定位性能。特别是，街道级定位精度（误差在1公里内）大幅提升，在这些基准上分别超过最先进方法1.5%、1.07%、7.18%和9.75%。

英文摘要

Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existing methods often mislocalize images by matching them to visually similar scenes from different geographic regions, which limits reliability in practical applications. To address this issue, we propose TransGeoCLIP, a novel retrieval-based framework that integrates a location attention mechanism and large multimodal models (LMMs). Using the Transformer encoder with location attention to encode GPS coordinates, TransGeoCLIP can effectively distinguish geographic features among visually similar images. The framework consists of two stages: 1) Retrieval database construction, which employs Transformers equipped with location attention mechanisms to encode labeled GPS coordinates and enhance location semantics, subsequently enables joint image-text-GPS embedding through CLIP; 2) Retrieval-augmented inference, which leverages LMMs to infer the final image location prediction from retrieved database results. Extensive experimental results on diverse datasets, including IM2GPS, IM2GPS3k, YFCC4k, and YFCC26k, demonstrate that TransGeoCLIP significantly enhances localization performance for visually similar images. Particularly, street-level localization accuracy (within 1 km error) is substantially improved, surpassing state-of-the-art methods by 1.5%, 1.07%, 7.18%, and 9.75% on these benchmarks, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.09353 2026-06-09 cs.CV cs.AI 新提交

Beyond Humans: Multispecies Animal Face Recognition Using Transfer Learning

超越人类：使用迁移学习的多物种动物面部识别

Maria De Marsico, Anil K. Jain, Annalaura Miglino

发表机构 * Sapienza University of Rome（罗马大学）； Michigan State University（密歇根州立大学）； University of Salerno（萨莱诺大学）

AI总结研究利用迁移学习（FaceNet和Vision Transformer）实现多物种动物面部识别，在狗、灵长类和牛数据集上验证，狗识别准确率最高（96.85%），部分场景超越现有方法。

Comments This paper extends the work published in the proceedings of CAIP 2025 conference: 'Adapting to the Wild: From Human Face to Animal Face Recognition' by De Marsico, M., Jain, A. K., Miranda, M., & Orlando, A

详情

AI中文摘要

个体动物识别可用于寻找丢失或被盗的宠物、追踪濒危物种个体以及识别拥挤农场中的动物。目前的识别技术主要使用物理设备（如微芯片），通常不切实际且难以应用。这些可以通过动物面部进行远程识别来替代；如果足够准确，它具有多个优势：非侵入性、可远距离工作、难以伪造，例如在食品工业中用病畜替换健康畜的情况。现有的少数数据集具有足够的每个主体图像并标注了单个动物身份，但不足以训练当前的深度学习架构。我们转而研究迁移学习的可能性，利用预训练网络模型作为骨干。我们的实验比较了专门在大型人脸数据库上训练的FaceNet和在ImageNet（即对象类别）上预训练的Vision Transformer（ViT）。我们使用了三种非常不同的动物的面部数据集：狗、灵长类（狐猴、金丝猴和黑猩猩）和牛。我们报告了结果，并对每个数据集与当前最优（SOTA）专门训练的深度网络进行了比较。三个数据集的捕获条件不同。图像质量（分辨率、运动模糊、不同姿态等）从狗到牛到灵长类依次下降。最佳性能在狗上实现，ViT达到了96.85%的平均验证准确率和84.34%的Rank-1识别率。濒危灵长类的结果仍然令人鼓舞，但性能因动物类别和任务（验证或识别）而异，并不总是优于SOTA。对于牛，ViT结果优于SOTA，而FaceNet仍然具有竞争力。

英文摘要

Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to apply. These could be replaced by remote recognition via the animal's face; if accurate enough, it provides several advantages: it is non-invasive, can work at a distance, and is difficult to counterfeit, as, for instance, in the case of substituting sick animals for healthy ones in the food industry. The few existing datasets with sufficient per-subject images annotated with a single animal identity are not large enough to train current deep learning architectures. We rather investigate the possibility of transfer learning, exploiting pre-trained network models as backbones. Our experiments compared FaceNet, which is specifically trained on large databases of human faces, with the Vision Transformer (ViT) pre-trained on ImageNet, i.e., on object categories. We used three face datasets of very different animals: dogs, primates (lemurs, golden monkeys, and chimpanzees), and cattle. We report the results and, for each dataset, compare them with the state of the art (SOTA) ad hoc-trained deep networks. The capture conditions differ among the three datasets. Image quality (resolution, motion blur, diverse poses, etc.) decreases from dogs to cattle to primates. The best performance was achieved with dogs, where ViT reached a mean verification accuracy of 96.85% and a Rank-1 Identification Rate of 84.34%. The results for endangered primates are still encouraging, but performance varies across animal classes and tasks (verification or identification), and does not always outperform SOTA. For cattle, the ViT results outperform SOTA, while FaceNet is still competitive.

URL PDF HTML ☆

赞 0 踩 0

2606.07628 2026-06-09 cs.CY cs.CV 交叉投稿

Frankenstein in the Pipeline: Computational Epistemicide in Facial Recognition

管道中的弗兰肯斯坦：面部识别中的计算性知识灭绝

Nina da Hora

发表机构 * Universidade Estadual de Campinas（坎皮纳斯州立大学）； Instituto da Hora（时间研究所）

AI总结本文借鉴玛丽·雪莱的《弗兰肯斯坦》作为方法论框架，分析基于嵌入的面部识别如何通过检测、地标定位、对齐/正面化和嵌入等步骤，逐步将面部简化为数据，实施“计算性知识灭绝”，并论证废除主义作为规范性立场。

Comments Accepted to ACM FAccT 2026. Author's version. 17 pages, 2 figures

详情

DOI: 10.1145/3805689.3812284

AI中文摘要

虽然计算机视觉的优生学根源在批判性技术研究中已有充分记载，但较少关注这种暴力在管道层面实施的操作机制。本文借鉴玛丽·雪莱的《弗兰肯斯坦》，不是作为意外后果的隐喻，而是作为方法的诊断框架：拆解、重构，以及通过制造程序断言其合法性的造物。我认为，基于嵌入的面部识别实施了我所谓的计算性知识灭绝，这是Sueli Carneiro的知识灭绝概念在计算领域的延伸——通过摧毁作为活生生的关系性表面的面部，并授权数值代理作为身份的特权场所。在检测/裁剪、地标定位、对齐/正面化和嵌入过程中，面部逐渐被缩小到可以稳定为数据的部分，产生一个规范的面部作为可读性的条件，以及相应的形式主体作为识别的条件。向量化完成了弗兰肯斯坦式的“缝合”：被解剖的面部被重新组装成一个固定维度的制品，旨在跨数据库和机构流通。然后，我展示了基于距离的相似性和阈值如何将“足够接近”的规范操作化，使识别与标准化密不可分，并使改良主义的“伦理AI”优化在结构上不足。本文最后主张废除主义作为规范性立场：拒绝将向量化身份作为权利和访问的合法基础，并拆除通过可剖析的数据点来治理人类生活的制度冲动。

英文摘要

While the eugenic roots of computer vision are well-documented in critical technology studies, less attention has been paid to the operational mechanisms through which this violence is enacted at the level of the pipeline. This paper employs Mary Shelley's Frankenstein not as a metaphor for unintended consequences, but as a diagnostic framework for method: disassembly, reconstruction, and the production of a creature whose legitimacy is asserted by the procedure that made it. I argue that embedding-based facial recognition enacts what I call computational epistemicide, an extension of Sueli Carneiro's concept of epistemicide to the computational domain - by destroying the face as a living, relational surface and authorizing a numerical proxy as the privileged site of identity. Across detection/cropping, landmarking, alignment/frontalization, and embedding, the face is progressively narrowed to what can be stabilized as data, producing a canonical face as the condition of legibility and a corresponding form-subject as the condition of recognition. Vectorization completes the Frankensteinian "stitching": the dissected face is reassembled into a fixed-dimensional artifact designed to circulate across databases and institutions. I then show how distance-based similarity and thresholding operationalize a norm of "close enough," making recognition inseparable from standardization and rendering reformist "ethical AI" optimization structurally insufficient. The paper concludes by arguing for abolition as a normative stance: refusing vectorized identity as a legitimate basis for rights and access, and dismantling the institutional impulse to govern human life through dissectible data points.

URL PDF HTML ☆

赞 0 踩 0

2605.20735 2026-06-09 cs.CV cs.LG 版本更新

Lowering the Barrier to IREX Participation: Open-Source Algorithms, Toolkit, and Benchmarking for Iris Recognition

降低参与IREX的门槛：用于虹膜识别的开源算法、工具包和基准测试

Siamul Karim Khan, Patrick J. Flynn, Adam Czajka

发表机构 * University of Notre Dame（内布拉斯加大学）

AI总结本文提出两种新的开源虹膜识别算法，提供Python和符合IREX标准的C++实现，用于提交官方IREX X计划。研究旨在首次根据IREX测试协议评估开源虹膜识别解决方案，并提供一个模型C++提交，显著促进其他团队的开源方法进入IREX评估。新方法包括两个神经网络，分别使用三元组损失与批量硬三元组挖掘（TripletIris）和ArcFace损失（ArcIris）。此外，文章还提供了两种现有方法的开源IREX兼容C++实现：基于虹膜图像过滤的人类显著性驱动内核（HDBIF）算法，以及用于检测和比较Fuchs密钥（CRYPTS）的人类可解释算法。除了CRYPTS在1:N搜索中面临时间限制外，其他方法已通过官方IREX X评估，并在多个流行学术基准上进行了评估。最后，本文还提供了可用于任何新虹膜识别方法的虹膜分割和圆圈估计开源模型。

详情

AI中文摘要

本文提出了两种新的开源虹膜识别算法，提供了Python和符合IREX标准的C++实现，用于提交官方IREX X计划。本研究有两个主要目标：（a）首次根据IREX测试协议评估开源虹膜识别解决方案；（b）提供一个模型C++提交，显著促进其他团队的开源方法进入IREX评估。新方法包括两个神经网络，分别使用三元组损失与批量硬三元组挖掘（TripletIris）和ArcFace损失（ArcIris）。本文还提供了两种现有方法的开源IREX兼容C++实现：（a）基于虹膜图像过滤的人类显著性驱动内核（HDBIF）算法；（b）用于检测和比较Fuchs密钥（CRYPTS）的人类可解释算法。除了CRYPTS在1:N搜索中面临时间限制外，这些方法已通过官方IREX X评估，并在多个流行学术基准上进行了评估：Quality-Face/Iris Research Ensemble、Warsaw-Biobase Post-Mortem Iris、CASIA-Iris-Thousand-V4、CASIA-Iris-Lamp-V4、IIT Delhi Iris Database、IIITD Contact Lens Iris Database、NDIris3D和Notre Dame Variable Iris Image Quality Release 2。最后，本文还提供了可用于任何新虹膜识别方法的虹膜分割和圆圈估计开源模型。

英文摘要

NIST Iris Exchange (IREX) offers an appealing solution to evaluating new open-source iris recognition algorithms, but it presents high barriers to entry because these algorithms must be written in C++, using a specific API, and adapted to meet strict IREX speed and memory constraints. The main goal of this paper is to lower these barriers and advance open-source iris recognition large-scale evaluations by offering: (a) two new modern deep learning-based open-source iris matchers (ArcIris and TripletIris), along with their C++ IREX X-compliant implementations, which are the first open-source iris recognition methods included into the IREX X leaderboard (and thus IREX-vetted), as well as new segmentation and iris circular approximation models that can be incorporated into any new iris recognition method, and (b) a performance assessment (according to IREX X testing protocols) of all major and currently available open-source iris recognition solutions. The paper also provides Python implementations of the new ArcIris and TripletIris methods and discusses the differences one may encounter between C++ and Python implementations of the same conceptually equivalent approaches. Finally, the paper offers open-source, IREX X-compliant C++ implementations of two existing methods: (a) an iris image filtering-based algorithm utilizing human saliency-driven kernels (HDBIF), and (b) a human-interpretable algorithm for detecting and comparing Fuchs' crypts (CRYPTS). In addition to IREX X evaluation results, the paper reports the performance of all methods on major academic benchmarks: Quality-Face/Iris Research Ensemble (Q-FIRE), Warsaw-Biobase Post-Mortem Iris, CASIA-Iris-Thousand-V4, CASIA-Iris-Lamp-V4, IIT Delhi Iris Database, IIITD Contact Lens Iris Database, NDIris3D, and Notre Dame Variable Iris Image Quality Release 2 (VII-Q-R2).

URL PDF HTML ☆

赞 0 踩 0

2606.00706 2026-06-09 cs.CV 版本更新

CR-JEPA: Cross-Modal Joint-Embedding Predictive Learning for Remote Sensing Image Retrieval

CR-JEPA：用于遥感图像检索的跨模态联合嵌入预测学习

Md Aminur Hossain, Ayush V. Patel, Nitant Dube, Biplab Banerjee

发表机构 * Space Applications Centre, Indian Space Research Organisation（印度空间研究组织空间应用中心）； Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay（印度理工学院孟买资源工程研究中心）

AI总结提出CR-JEPA架构，通过模态特定主干、共享Transformer和JEPA预测目标实现跨模态语义对齐与同模态邻域保持，在BEN-14K等数据集上显著提升跨模态检索性能。

Comments 24 pages

详情

AI中文摘要

跨模态遥感图像检索旨在跨异构传感模态检索语义相关的场景。由于配对观测在成像物理、空间分辨率、光谱配置和视觉外观上可能存在显著差异，这仍然具有挑战性。此外，单一目标训练的检索投影可能不足以同时支持跨模态语义对齐和同模态邻域保持。我们提出了CR-JEPA，一种用于双模态遥感检索的跨模态检索联合嵌入预测架构。该模型使用模态特定主干、共享Transformer主干和JEPA风格的预测目标来估计模态内和跨模态的掩码潜在目标特征。受LeJEPA启发，我们对原始检索投影应用素描各向同性高斯正则化以稳定嵌入并缓解崩溃。CR-JEPA进一步采用解耦头设计，包括用于同模态检索的统一检索头和用于跨模态搜索的跨模态检索头。我们在BEN-14K、CBRSIR_VS和DSRSID上评估CR-JEPA。在BEN-14K上，与X-JEPA相比，CR-JEPA将S1到S2检索从61.23%提升至75.82%，S2到S1检索从63.73%提升至75.40%，同时以更少的参数实现了有竞争力的同模态检索。

英文摘要

Cross-modal remote sensing image retrieval aims to retrieve semantically related scenes across heterogeneous sensing modalities. This remains challenging because paired observations may differ substantially in imaging physics, spatial resolution, spectral configuration, and visual appearance. Moreover, a single retrieval projection trained with one objective may be insufficient to jointly support cross-modal semantic alignment and same-modal neighbourhood preservation. We propose CR-JEPA, a Cross-modal Retrieval Joint-Embedding Predictive Architecture for dual-modality remote sensing retrieval. The model uses modality-specific stems, a shared transformer trunk, and JEPA-style predictive objectives to estimate masked latent target features within and across modalities. Inspired by LeJEPA, we apply Sketched Isotropic Gaussian Regularization to raw retrieval projections to stabilize embeddings and mitigate collapse. CR-JEPA further employs a decoupled-head design with a unified retrieval head for same-modal retrieval and a cross-modal retrieval head for cross-modal search. We evaluate CR-JEPA on BEN-14K, CBRSIR_VS, and DSRSID. On BEN-14K, CR-JEPA improves S1 to S2 retrieval from 61.23% to 75.82% and S2 to S1 retrieval from 63.73% to 75.40% over X-JEPA, while also achieving competitive same-modal retrieval with fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2603.25157 2026-06-09 cs.LG cs.AI cs.CV stat.ML 版本更新

泛化几何引导Mamba作为CNN语义分割的即插即用上下文模块

Sheng-Wei Chan, Hsin-Jui Pan, Chun-Po Shen, Chia-Min Lin, Yung-Che Wang, Jen-Shiun Chiang

发表机构 * Tamkang University（淡江大学）

AI总结将几何引导的Mamba（G-Mamba）作为即插即用的上下文聚合模块，替代六种CNN分割网络的上下文头，在Cityscapes上以少量额外计算量获得一致的mIoU提升。

详情

AI中文摘要

基于CNN的语义分割网络通常依赖上下文头（如ASPP、PPM或注意力模块）来扩大感受野。这些头有效但可能引入大量计算、内存开销或边界泄漏。本文重新审视DGM-Net中的方向几何Mamba（G-Mamba），并将其作为即插即用的上下文聚合模块，而非全新的分割架构。关键思想是将几何引导注入选择性扫描过程，使长程特征传播能够由边界和向心流线索调制。我们替换了六种代表性CNN分割模型（包括DeepLabV3+、DANet、CCNet、PSPNet、PSANet和OCRNet）的原始上下文头，同时保持ResNet-101骨干网络不变。在Cityscapes上的结果表明，在$1024\ imes1024$分辨率下，仅增加适度的额外GFLOPs即可获得一致的mIoU提升，表明几何引导的SSM模块可以作为传统CNN上下文头的实用替代或增强。

英文摘要

CNN-based semantic segmentation networks usually rely on context heads such as ASPP, PPM, or attention modules to enlarge the receptive field. These heads are effective but may introduce heavy computation, memory cost, or boundary leakage. This paper revisits Directional Geometric Mamba (G-Mamba) from DGM-Net and studies it as a plug-and-play context aggregation module rather than a complete new segmentation architecture. The key idea is to inject geometric guidance into the selective scan process, allowing long-range feature propagation to be modulated by boundary and centripetal-flow cues. We replace the original context heads of six representative CNN segmentation models, including DeepLabV3+, DANet, CCNet, PSPNet, PSANet, and OCRNet, while keeping the ResNet-101 backbone unchanged. Results on Cityscapes show consistent mIoU gains with only moderate extra GFLOPs at $1024\times1024$ resolution, suggesting that geometry-guided SSM modules can serve as practical alternatives or enhancements to conventional CNN context heads.

URL PDF HTML ☆

赞 0 踩 0

2606.08906 2026-06-09 cs.CV 新提交

DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance

DifferSeg: 通过差分感知与频率引导实现多样化的多模态二值分割

Qiangqiang Zhou, Jiawei Xu, Yong Chen, Dandan Zhu, Yugen Yi, Xiaoqi Zhao

发表机构 * School of Artificial Intelligence, Jiangxi Normal University（江西师范大学人工智能学院）； Institute of AI Education, East China Normal University（华东师范大学人工智能教育研究所）； Yale School of Medicine, Yale University（耶鲁大学医学院）

AI总结提出DifferSeg框架，通过差分感知融合模块自适应对齐多模态特征，并设计频率引导解码器平衡高低频表示，在29个公开数据集上超越67种方法。

详情

AI中文摘要

在许多二值分割任务中，大多数多模态方法依赖于固定的特征拼接进行跨模态交互，以及由低频语义主导的简单解码器设计。然而，它们忽略了两个关键挑战：一是缺乏处理模态差异和互补性的自适应机制，二是缺少平衡高低频表示的高效解码策略。在这项工作中，我们提出了一个简单而通用的多模态二值分割框架，称为DifferSeg，以同时解决这两个问题。借助差分感知融合（DPF）模块，DifferSeg使用可学习的差分算子自适应地对齐多模态特征，并通过残差融合增强其互补性，有效缓解模态不匹配和融合冗余。此外，我们设计了一个频率引导解码器（FGD），构建跨频率交互和多路径上采样，以保持细节高频结构与语义低频表示之间的一致性，确保细粒度边界恢复和噪声抑制。得益于这些设计，DifferSeg可以轻松泛化到各种二值分割任务，包括自然和医学模态。无需额外技巧，它在涉及18个下游任务的29个公开数据集上持续超越67种最先进方法，展示了卓越的泛化能力和分割精度。代码和预训练模型将在链接处提供。

英文摘要

In many binary segmentation tasks, most multimodal methods rely on fixed feature concatenation for cross-modal interaction and straightforward decoder designs dominated by low-frequency semantics. %ToDO: % However, they ignore two key challenges: one is the lack of an adaptive mechanism to handle modality discrepancies and complementarity, and the other is the absence of an efficient decoding strategy to balance both high- and low-frequency representations. % In this work, we propose a simple yet general multimodal binary segmentation framework, termed DifferSeg, to address both problems simultaneously. With the help of the differential perception fusion (DPF) module, DifferSeg employs learnable differential operators to adaptively align multimodal features and enhance their complementarity through residual fusion, effectively mitigating modality mismatch and fusion redundancy. % In addition, we design a frequency-guided decoder (FGD) that builds cross-frequency interactions and multi-path upsampling to maintain consistency between detailed high-frequency structures and semantic low-frequency representations, ensuring fine-grained boundary recovery and noise suppression. % Benefiting from these designs, DifferSeg can be easily generalized to diverse binary segmentation tasks, including both natural and medical modalities. Without bells and whistles, it consistently surpasses 67 state-of-the-art methods across 29 public datasets involving 18 downstream tasks, demonstrating superior generalization and segmentation accuracy.Code and pretrained models will be available at the Link.

URL PDF HTML ☆

赞 0 踩 0

2606.08920 2026-06-09 cs.CV cs.AI 新提交

PolyBuild: An End-to-End Method for Polygonal Building Contour Extraction from High-Resolution Remote Sensing Images

PolyBuild: 一种从高分辨率遥感图像中提取多边形建筑物轮廓的端到端方法

Yaoteng Zhang, Julin Zhang, Guangshuai Wang, Jiwei Deng, Hui Sheng, Yasir Muhammad, Shiqing Wei

发表机构 * China University of Petroleum (East China)（中国石油大学（华东））； South Surveying&Mapping Instrument Co.,Ltd.（南方测绘仪器有限公司）； China Railway Design Corporation（中国铁路设计集团有限公司）

AI总结提出端到端方法PolyBuild，通过初始轮廓生成模块和轮廓优化模块直接从遥感图像提取矢量多边形建筑物轮廓，无需后处理，性能优于现有方法。

Comments Accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS)

详情

AI中文摘要

从高分辨率遥感图像中提取建筑物多边形轮廓是各种地图应用的基本任务。然而，不同的成像条件和复杂的建筑结构使得自动轮廓提取极具挑战性。主流的建筑物提取方法通常依赖于像素级分割，随后进行多个后处理步骤以生成建筑物轮廓，这计算量大且容易出错。在本文中，我们提出了一种名为PolyBuild的端到端方法，该方法可以直接从高分辨率遥感图像中提取建筑物矢量多边形，无需任何后处理操作。该方法利用两个主要模块：初始轮廓生成模块（ICGM）和轮廓优化模块（COM）。ICGM通过利用每个建筑物实例的拼接子区域中心特征来生成初始建筑物轮廓。它通过生成边界框并使用四个子区域的中心特征来表示每个建筑物，同时进行目标检测和初始轮廓提取。轮廓优化模块（COM）通过在基于Transformer的解码器中迭代集成卷积神经网络（CNN）特征和轮廓位置信息，进一步细化生成的建筑物轮廓。混合CNN-Transformer架构有效捕获建筑物轮廓内的局部和全局空间关系，确保高质量的边界描绘。在三个建筑物数据集上进行了大量实验以评估PolyBuild的性能。结果表明，PolyBuild显著优于最先进的方法，包括基于掩码和基于轮廓的方法。

看得更多，匹配更好：用于双视图对应学习的多源特征融合

Xiaojie Li, Xin Jiang, Luanyuan Dai, Jinnan Yang, Yongdong Zhang, Zechao Li

发表机构 * Nanjing University of Science and Technology（南京理工大学）； People’s Daily Online（人民网）； University of Science and Technology of China（中国科学技术大学）

AI总结提出TriMatch框架，融合几何、纹理语义和结构语义特征，通过语义引导调制和层次细化，提升重复结构等场景下的对应点鉴别能力。

Comments Correspondence Learning, Multi-Source Feature Fusion, Outlier Removal, Camera Pose Estimation

详情

AI中文摘要

双视图对应学习旨在通过利用图像对中真假对应点的内在差异来区分内点和外点。现有方法主要依赖于基于坐标的几何一致性。然而，在包含重复结构、无纹理区域或局部相似几何模式的场景中，它们常常难以处理伪一致的外点。为了解决这一限制，我们提出了TriMatch，一个用于双视图对应学习的多源特征融合框架，由特征提取和特征细化两部分组成。在特征提取中，TriMatch联合提取几何、纹理语义和结构语义特征，为对应点判别提供互补证据。为了弥合语义特征与几何特征之间的差距，纹理和结构语义特征分别通过专用的纹理-几何对齐和结构-几何对齐模块与几何特征对齐。我们进一步引入了语义引导的对应点调制模块，该模块利用语义信息调制几何特征，以抑制几何上合理但语义上不一致的对应点。在特征细化中，层次化语义增强的对应点细化策略逐步建模对应点依赖关系并重新校准多上下文特征响应，从而实现更可靠的内点-外点判别。大量实验证明了TriMatch的有效性、鲁棒性和泛化能力。

英文摘要

Two-view correspondence learning aims to distinguish true correspondences (inliers) from false ones (outliers) in image pairs by leveraging their underlying differences. Existing methods mainly rely on coordinate-based geometric consistency. However, they often struggle with pseudo-consistent outliers in scenes containing repetitive structures, textureless regions, or locally similar geometric patterns. To address this limitation, we propose TriMatch, a multi-source feature fusion framework for two-view correspondence learning, which consists of two parts: feature extraction and feature refinement. In feature extraction, TriMatch jointly extracts geometric, texture semantic, and structural semantic features to provide complementary evidence for correspondence discrimination. To bridge the gap between semantic and geometric features, texture and structural semantic features are aligned with geometric features through dedicated Texture-Geometric Alignment and Structural-Geometric Alignment modules, respectively. We further introduce a Semantic-Guided Correspondence Modulation module, which modulates geometric features using semantic information to suppress geometrically plausible but semantically inconsistent correspondences. In feature refinement, a Hierarchical Semantic-Enhanced Correspondence Refinement strategy progressively models correspondence dependencies and recalibrates multi-context feature responses, enabling more reliable inlier-outlier discrimination. Extensive experiments demonstrate the effectiveness, robustness, and generalization capability of TriMatch.

URL PDF HTML ☆

赞 0 踩 0

2606.09303 2026-06-09 cs.CV 新提交

Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

再思考：通过候选发现与比较推理进行分割

Xinyan Gao, Haoran Hao, Xiangyu Yue

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Nanjing University（南京大学）

AI总结提出两阶段框架Rea2Seg，先基于注意力图生成候选掩码，再用多模态大语言模型推理评分，将分割转化为候选发现与判别选择，并引入新基准ReasonSeg-SGDR全面评估感知、定位与推理能力。

Comments Project page: https://snowball521.github.io/Rea2Seg-Project/

详情

AI中文摘要

预训练基础模型的快速发展使得更通用的图像分割成为可能。多模态大语言模型（MLLMs）已被广泛探索用于需要高级推理的复杂查询的图像分割。尽管取得了有希望的进展，现有方法通常受限于有限的训练数据以及MLLMs与掩码生成模块之间的差距。为了更好地将MLLMs的感知和推理能力迁移到复杂的基于推理的分割任务，我们提出了一个两阶段框架Rea2Seg用于掩码生成和选择。具体来说，该框架首先基于分割MLLM的注意力图识别潜在区域作为候选掩码。然后，它利用MLLM对问题和候选掩码进行推理，并为每个掩码分配分数。最终的分割结果通过对候选掩码重新排序并选择最高分的掩码获得，将图像分割重新表述为候选发现后跟判别性掩码选择。\n我们还注意到，现有基准中的大部分问题集中在常识推理上，这些问题通常不需要完全的联合视觉观察和推理。为了解决这个问题，我们引入了一个名为ReasonSeg-SGDR的新基准，该基准在多个维度上全面评估模型的感知、定位和推理能力，包括判别性识别、空间推理、几何推理和多步推理，并带有细粒度的掩码生成。\n此外，我们收集训练数据以增强MLLMs联合理解多模态查询和候选掩码的能力，并通过推理分配分数。在提出的基准和ReasonSeg上的实验结果表明了统一掩码生成和选择框架的有效性。

英文摘要

The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.

URL PDF HTML ☆

赞 0 踩 0

2606.09360 2026-06-09 cs.CV 新提交

ExDet: Open-Domain Open-Vocabulary Detection with Cross-modal Extrapolation and Rectification

ExDet: 基于跨模态外推与校正的开放域开放词汇检测

Yupeng Zhang, Yuzhong Feng, Ruize Han, Zhiwei Chen, Wei Feng, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University（天津大学智能与计算学部）； Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology（深圳理工大学计算机科学与人工智能学院）； School of Artificial Intelligence, Nanchang University（南昌大学人工智能学院）

AI总结提出ExDet框架，通过文本引导外推（TGE）和检测器兼容校正（DCR）模块，无需额外训练即可增强开放域开放词汇检测的跨类别和跨域泛化能力，在多个基准上取得最优性能。

详情

AI中文摘要

开放域开放词汇检测（ODOVD）要求检测器泛化到新类别和未见过的域，比开放词汇检测更具挑战性。现有方法通常从头训练开放词汇检测器与域泛化模块，导致训练成本高。我们提出ExDet，一种轻量级类别-域协同泛化框架，用于增强现有检测器的跨类别和跨域泛化能力。ExDet由文本引导外推（TGE）、轻量级检测器兼容校正（DCR）模块和ExRPN组成。具体地，TGE利用视觉-语言模型（VLM）的DeltaSpace属性，从文本推断类别和域感知的代理视觉原型。DCR以无需检测器训练和无需真实数据的方式从TGE生成的原型中学习，并在推理时插入分类头之后，将表示校正为与检测器兼容的源域视觉分布，从而增强对新类别和未见域目标的分类。ExRPN通过结合语义相似度与RPN置信度重新校准提议分数，提高对新颖和域偏移目标的召回率，同时为后续分类和DCR提供更好支持。ExDet在OD-LVIS、OV-LVIS、Objects365和MSOSB上达到最优性能。

英文摘要

Open-domain open-vocabulary detection (ODOVD) requires detectors to generalize to both novel categories and unseen domains, making it more challenging than open-vocabulary detection. Existing methods typically train open-vocabulary detectors together with domain generalization modules from scratch, leading to high training cost. we propose ExDet, a lightweight category-domain collaborative generalization framework for ODOVD that enhances the cross-category and cross-domain generalization of existing detectors. ExDet consists of Text-Guided Extrapolation (TGE), a lightweight Detector-Compatible Rectification (DCR) module, and ExRPN. Specifically, TGE exploits the DeltaSpace property of vision-language models (VLMs) to infer category- and domain-aware proxy visual prototypes from text. DCR is learned from the TGE-generated prototypes in a detector training-free and real-data-free manner, and is inserted after the classification head at inference to rectify representations toward a detector-compatible source-domain visual distribution, thereby enhancing classification for targets from novel categories and unseen domains. ExRPN recalibrates proposal scores by combining semantic similarity with RPN confidence, improving recall for novel and domain-shifted objects while providing better support for subsequent classification and DCR. ExDet achieves SOTA performance on OD-LVIS, OV-LVIS, Objects365, and MSOSB.

URL PDF HTML ☆

赞 0 踩 0

2606.09367 2026-06-09 cs.CV 新提交

RT-SDGOD: Real-Time Single-Domain Generalized Object Detection

RT-SDGOD: 实时单域泛化目标检测

Yupeng Zhang, Fangzhuo Gao, Ruize Han, Wei Feng, Liang Wan

发表机构 * College of Intelligence and Computing, Tianjin University（天津大学智能与计算学部）； Key Research Center for Surface Monitoring and Analysis of Relics, State Administration of Cultural Heritage（国家文物局文物表面监测与分析重点研究中心）； Faculty of Computer Science and Artificial Intelligence, Shenzhen University of Advanced Technology（深圳理工大学计算机科学与人工智能学院）

AI总结针对实时检测器在域偏移下漏检严重的问题，提出多证据协同建模框架RT-SDGDet，通过一对多监督、证据多样性学习和双视图一致性学习提升泛化能力，且无额外推理开销。

详情

AI中文摘要

在严格实时约束下的实际部署中，天气和成像变化会导致显著的分布偏移，严重降低检测器性能。单域泛化目标检测旨在缓解这一问题，但现有方法很少在问题表述层面研究实时检测器在受限推理预算下的泛化能力。为此，我们引入实时单域泛化目标检测（RT-SDGOD），专注于实时检测器如何仅通过训练时表示学习，在零额外推理开销下实现跨域泛化。我们观察到，在域偏移下，基于DETR的实时检测器主要通过漏检增加而退化，根源在于目标级判别证据有限且不稳定。基于此，我们提出RT-SDGDet，一种用于RT-SDGOD的多证据协同建模框架。核心思想是使同一目标的多个查询协同覆盖更充分的判别证据，同时保持跨视图的证据建模稳定性。具体而言，我们使用一对多（O2M）监督构建稳定的目标特定查询组，并进一步设计判别证据多样性学习（DEDL）和双视图证据一致性学习（DvECL），分别扩展目标级证据覆盖范围和改善外观扰动下的证据稳定性。由于所有组件仅在训练时引入，我们的方法不产生额外推理开销。大量实验表明，所提方法在多个未见目标域上取得了比现有方法更好的泛化性能。

英文摘要

In real-world deployment under strict real-time constraints, weather and imaging variations induce significant distribution shifts, severely degrading detectors. Single-Domain Generalized Object Detection aims to mitigate this issue, yet existing methods rarely investigate-at the level of problem formulation-the generalization capability of real-time detectors under such constrained inference budgets. To this end, we introduce Real-Time Single-Domain Generalized Object Detection (RT-SDGOD), which focuses on how real-time detectors can achieve cross-domain generalization under zero extra inference overhead by relying solely on training-time representation learning. We observe that, under domain shift, DETR-based real-time detectors mainly degrade through increased missed detections, rooted in limited and unstable object-level discriminative evidence. Based on this, we propose RT-SDGDet, a multi-evidence collaborative modeling framework for RT-SDGOD. The core idea is to enable multiple queries of the same object to collaboratively cover more sufficient discriminative evidence while maintaining the stability of such evidence modeling across views. Specifically, we use one-to-many (O2M) supervision to construct stable object-specific query groups, and further design Discriminative Evidence Diversity Learning (DEDL) and Dual-view Evidence Consistency Learning (DvECL) to expand object-level evidence coverage and improve evidence stability under appearance perturbations, respectively. Since all components are introduced only during training, our method incurs no extra inference overhead. Extensive experiments show that the proposed method achieves better generalization performance than existing approaches across multiple unseen target domains.

URL PDF HTML ☆

赞 0 踩 0

2606.09474 2026-06-09 cs.CV 新提交

Training-Free Generalized Few-Shot Segmentation through Open-Vocabulary Semantic Arbitration

无需训练的通用的少样本分割通过开放词汇语义仲裁

Silas Kwabla Gah, Ebenezer Owusu

发表机构 * University of Ghana（加纳大学）

AI总结提出Open-V框架，通过推理时协调冻结的语义先验（SAM3 PCS与K-shot CLIP支持质心）实现无需训练的通用少样本分割，在多个基准上超越有监督方法。

详情

AI中文摘要

通用少样本语义分割（GFSS）传统上被视为表示学习问题，需要任务特定的适应来从有限的支持样本中引入新类别。然而，最近的基础模型已经展现出强大的开放词汇识别和分割能力，这提出了一个不同的问题：能否通过推理时协调冻结的语义先验而不是参数适应来解决GFSS？我们通过Open-V回答了这个问题，这是一个无需训练的GFSS框架，它结合了Segment Anything (SAM3) 可提示概念分割（PCS）与K-shot CLIP支持质心，通过校准的逐像素语义仲裁。Open-V不引入任何可训练组件，并在推理时支持任意语义类别。除了分割性能，我们的研究还贡献了三个更广泛的发现。首先，我们表明支持信息可以通过推理时语义基础来整合，并且其贡献随着基础模型文本先验在标签不相交词汇表上的减弱而增加。其次，我们识别了基础模型分割中的可重复性混淆，证明了预处理和评估空间的不匹配会无声地扭曲报告的性能。最后，我们在PASCAL-5i、COCO-20i和ADE-OW上验证了Open-V，表明无需训练的基础模型先验协调在常规GFSS和开放词汇评估设置中都能泛化。在PASCAL-5i（1-shot）上，Open-V达到了基础/新类/调和mIoU分别为78.4/77.5/77.9，无需GFSS特定训练，超越了最强有监督基线+17.7 HM。

英文摘要

Generalized Few-Shot Semantic Segmentation (GFSS) has traditionally been approached as a representation-learning problem, requiring task-specific adaptation to incorporate novel classes from limited support examples. Recent foundation models, however, already exhibit strong open-vocabulary recognition and segmentation capabilities, raising a different question: can GFSS be solved through inference-time coordination of frozen semantic priors rather than parameter adaptation? We answer this question with Open-V, a training-free GFSS framework that combines Segment Anything (SAM3) Promptable Concept Segmentation (PCS) with a K-shot CLIP support centroid through calibrated per-pixel semantic arbitration. OpenV introduces no trainable components and supports arbitrary semantic categories at inference time. Beyond segmentation performance, our study contributes three broader findings. First, we show that support information can be incorporated through inference-time semantic grounding, and that its contribution increases as foundation-model text priors weaken on label-disjoint vocabularies. Second, we identify a reproducibility confound in foundationmodel segmentation, demonstrating that preprocessing and evaluation-space mismatches can silently distort reported performance. Finally, we validate Open-V across PASCAL5i, COCO-20i, and ADE-OW, showing that training-free coordination of foundation-model priors generalizes across both conventional GFSS and open-vocabulary evaluation settings. On PASCAL-5i (1-shot), Open-V attains base/novel/harmonic mIoU of 78.4/77.5/77.9, without GFSS-specific training surpassing the strongest trained baseline by +17.7 HM.

URL PDF HTML ☆

赞 0 踩 0

2606.09670 2026-06-09 cs.CV cs.AI 新提交

Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision

视觉提示结合基于特征重建的双教师监督异常检测

Mateo Diaz-Bone, Daniel Caraballo, Florian Scheidegger, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Roy Assaf, Niccolo Avogaro, Yagmur G. Cinar, Brown Ebouky, Filip M. Janicki, Piotr S. Kluska, Cezary Skura, Cristiano Malossi

发表机构 * IBM Research Europe Zurich（IBM欧洲研究院苏黎世分院）

AI总结针对异常检测在真实场景中因物体尺度、视角等变化失效的问题，提出视觉提示管道、解冻教师模型和扩散生成数据增强，在AeBAD数据集上提升3.5个百分点。

详情

AI中文摘要

最近的异常检测方法在成熟数据集（如MVTec）上取得了完美的检测和分割分数。然而，当基本假设（如一致的物体尺度、视角、背景、光照和居中放置）被违反时，许多方法面临挑战。这些变化使得异常检测方法在许多真实场景中无法使用。为了解决这些限制，我们引入了三个关键贡献：（1）一个视觉提示管道，通过前景-背景掩码隔离物体；（2）一种在师生模型中解冻教师以提高领域适应性的机制；（3）一种利用扩散生成合成图像的数据增强策略，以增强异常检测性能。通过使用掩码多尺度重建（MMR）模型作为骨干，我们在具有挑战性的AeBAD数据集上比之前的最先进方法提高了3.5个百分点。

英文摘要

Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, many of these methods face challenges when foundational assumptions - such as consistent object scale, viewpoint, background, illumination, and centered placement - are violated. Those variations that occur render anomaly detection methods unusable in many real-world scenarios. To address these limitations, we introduce three key contributions: (1) a visual prompting pipeline that isolates objects using foreground-background masking; (2) a mechanism for unfreezing the teacher in student-teacher models to improve domain adaptability; and (3) a data augmentation strategy leveraging diffusion-generated synthetic images to enhance anomaly detection performance. We achieve a 3.5 percentage point improvement over the previous state-of-the-art on the challenging AeBAD dataset by using the Masked Multiscale Reconstruction (MMR) model as our backbone.

URL PDF HTML ☆

赞 0 踩 0

2606.09679 2026-06-09 cs.CV 新提交

SoccerNet 2026 Player-Centric Ball-Action Spotting:Retraining and Post-Processing Extensions to the FOOTPASS Baselines

SoccerNet 2026 以球员为中心的球类动作定位：FOOTPASS 基线的重训练与后处理扩展

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods（迪克体育用品的GameChanger）

AI总结针对足球广播中八类动作的球员-动作-时间预测任务，在FOOTPASS基线上提出梯度检查点、GNN与DST融合、平方根频率类别加权和后处理流水线四项扩展，在测试集和挑战集上分别达到0.548和0.446的Macro F1。

Comments CVPR 2026 SoccerNet Player Centric Ball Action Spotting Challenge, Rank 7

2606.09772 2026-06-09 cs.CV 新提交

UniADC：统一异常检测与分类框架

Ximiao Zhang, Min Xu, Zheng Zhang, Yap-Peng Tan, Xiuzhuang Zhou

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； China University of Mining（中国矿业大学）； VinUniversity（文理大学）

AI总结提出UniADC模型，通过无训练可控修复网络和隐式正态判别器，同时实现异常区域检测与类别识别，在少样本甚至零样本下超越现有方法。

详情

AI中文摘要

在本文中，我们引入了一个称为统一异常检测与分类的新任务，旨在同时检测图像中的异常区域并识别其具体类别。现有方法通常将异常检测和分类视为独立任务，从而忽略了它们的内在关联并限制了信息共享，导致性能次优。为了解决这个问题，我们提出了UniADC，一个设计用于在仅有少量甚至没有异常图像的情况下有效执行这两项任务的模型。具体来说，UniADC由两个关键组件组成：一个无需训练的可控修复网络和一个隐式正态判别器。修复网络可以通过在异常先验指导下修复正常区域来合成特定类别的异常图像，并且还可以修复少量异常样本以扩充可用异常数据。隐式正态判别器通过隐式建模正常状态来解决正常与异常像素分布不平衡的严峻挑战，通过将细粒度图像特征与异常类别嵌入对齐来实现精确的异常检测和分类。我们在四个异常检测与分类数据集（包括MVTec-FS、MTD、WFDD和Real-IAD）上进行了大量实验，结果表明UniADC在异常检测、定位和分类方面始终优于现有方法。代码可在以下网址获取：this https URL。

英文摘要

In this paper, we introduce a novel task termed unified anomaly detection and classification, which aims to simultaneously detect anomalous regions in images and identify their specific categories. Existing methods typically treat anomaly detection and classification as separate tasks, thereby neglecting their inherent correlations and limiting information sharing, which results in suboptimal performance. To address this, we propose UniADC, a model designed to effectively perform both tasks with only a few or even no anomaly images. Specifically, UniADC consists of two key components: a training-free Controllable Inpainting Network and an Implicit-Normal Discriminator. The inpainting network can synthesize anomaly images of specific categories by repainting normal regions guided by anomaly priors, and can also repaint few-shot anomaly samples to augment the available anomaly data. The implicit-normal discriminator addresses the severe challenge of the imbalance between normal and anomalous pixel distributions by implicitly modeling the normal state, achieving precise anomaly detection and classification by aligning fine-grained image features with anomaly-category embeddings. We conduct extensive experiments on four anomaly detection and classification datasets, including MVTec-FS, MTD, WFDD and Real-IAD, and the results demonstrate that UniADC consistently outperforms existing methods in anomaly detection, localization, and classification. The code is available at https://github.com/cnulab/UniADC.

URL PDF HTML ☆

赞 0 踩 0

2512.03470 2026-06-09 cs.CV 版本更新

STGBD-Net: Spatio-temporal Gradient Basis Decomposition Network for Infrared Small Target Detection

STGBD-Net：用于红外小目标检测的时空梯度基分解网络

Chen Hu, Mingyu Zhou, Shuai Yuan, Hongbo Hu, Zhenming Peng, Tian Pu, Xiying Li

发表机构 * School of Intelligent Systems Engineering, Sun Yat-sen University（中山大学智能系统工程学院）； School of Information and Communication Engineering and the Laboratory of Imaging Detection and Intelligent Perception, University of Electronic Science and Technology of China（电子科技大学信息与通信工程学院和成像检测与智能感知实验室）； School of Instrument Science and Opto-Electronics Engineering, Hefei University of Technology（合肥工业大学仪器科学与光电工程学院）

AI总结针对红外小目标检测中弱目标易被背景杂波淹没的问题，提出基于基分解理论的梯度基分解模块（GDM），将归一化梯度特征作为基向量重构新特征，结合轻量级U-Net实现单帧与多帧检测，在多个基准上达到SOTA性能。

详情

AI中文摘要

红外小目标检测（IRSTD）的一个关键挑战是弱目标信号响应容易被强背景杂波掩盖，经常导致漏检。虽然传统的基于梯度的方法试图捕捉精细细节，但其鲁棒性受到多方向梯度特征静态融合的限制。在本文中，我们从基分解理论的角度重新思考特征融合，并提出一种新颖的框架，将该过程重构为显式且自适应的分解与重建范式。具体而言，我们引入了基分解模块（BDM）及其专门变体——梯度分解模块（GDM），用于IRSTD。GDM将归一化梯度特征视为基向量来重建新特征，从而保持细节结构并突出红外小目标。通过将GDM集成到轻量级的三阶段U-Net中，我们开发了两种统一架构：用于单帧检测的空间梯度基分解网络和用于多帧场景的时空梯度基分解网络。大量实验表明，我们的网络在多个基准上达到了最先进的性能，在检测精度和计算效率之间提供了优越的平衡。我们的代码将在以下网址公开：this https URL。

英文摘要

A key challenge in infrared small target detection (IRSTD) is that weak target signal responses are easily obscured by strong background clutter, frequently resulting in missed detections. While traditional gradient-based methods attempt to capture fine details, their robustness is limited by the static fusion of multi-directional gradient features. In this paper, we rethink feature fusion from the perspective of Basis Decomposition Theory and propose a novel framework that reformulates the process into an explicit and adaptive decomposition-and-reconstruction paradigm. Specifically, we introduce the Basis Decomposition Module (BDM) and its specialized variant, the Gradient Decomposition Module (GDM) for IRSTD. GDMs treat the normalized gradient features as basis vectors to reconstruct a new feature, thereby maintaining detailed structures and highlighting infrared small targets. By integrating GDMs into a lightweight three-stage U-Net, we develop two unified architectures: the Spatial Gradient Basis Decomposition Network for single-frame detection and the Spatio-temporal Gradient Basis Decomposition Network for multi-frame scenarios. Extensive experiments demonstrate that our networks achieve state-of-the-art (SOTA) performance across multiple benchmarks, offering a superior balance between detection accuracy and computational efficiency. Our codes will be made public at: https://github.com/greekinRoma/IRSTD_HC_Platform.

URL PDF HTML ☆

赞 0 踩 0

2512.13869 2026-06-09 cs.CV 版本更新

Coarse-to-Fine Hierarchical Alignment for UAV-based Human Detection using Diffusion Models

基于扩散模型的UAV人类检测粗到细层次对齐

Wenda Li, Meng Wu, Liangzhao Chen, Sungmin Eum, Heesung Kwon, Qing Qu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出Coarse-to-Fine Hierarchical Alignment框架，通过合成数据转换缩小领域差距，提升UAV人类检测精度，实验显示mAP50提升14.1%。

详情

AI中文摘要

训练目标检测器需要大量任务特定标注，但UAV人类检测因目标分布不断变化和标注图像稀缺而难以实现。为此，本文引入Coarse-to-Fine Hierarchical Alignment（CFHA），一个基于扩散模型的三阶段框架，旨在将合成数据转换为UAV人类检测数据，缩小领域差距并保持原始合成标签。CFHA通过三个模块显式解耦全局风格和局部内容领域的差异：（1）全局风格迁移——扩散模型通过少量真实参考集对齐合成图像的颜色、光照和纹理统计至真实风格；（2）局部细化——超分辨率扩散模型用于细化小物体的精细和逼真细节，如人类实例，保持形状和边界完整性；（3）幻觉消除——过滤掉与真实数据不匹配的人类实例，使人类外观更接近目标分布。在公开的UAV Sim2Real检测基准上进行的广泛实验表明，本文方法显著优于非转换基线。具体而言，本文方法在Semantic-Drone基准上mAP50提升达+14.1%。消融研究证实了全局和局部阶段的互补作用，并突显了层次对齐的重要性。代码已发布在https://github.com/liwd190019/CFHA。

英文摘要

Training object detectors demands extensive, task-specific annotations, yet this requirement becomes impractical in UAV-based human detection due to constantly shifting target distributions and the scarcity of labeled images. As a remedy, synthetic simulators are adopted to generate annotated data, with a low annotation cost. However, the domain gap between synthetic and real images hinders the model from being effectively applied to the target domain. Accordingly, we introduce Coarse-to-Fine Hierarchical Alignment (CFHA), a three-stage diffusion-based framework designed to transform synthetic data for UAV-based human detection, narrowing the domain gap while preserving the original synthetic labels. CFHA explicitly decouples global style and local content domain discrepancies and bridges those gaps using three modules: (1) Global Style Transfer -- a diffusion model aligns color, illumination, and texture statistics of synthetic images to the realistic style, using only a small real reference set; (2) Local Refinement -- a super-resolution diffusion model is used to facilitate fine-grained and photorealistic details for the small objects, such as human instances, preserving shape and boundary integrity; (3) Hallucination Removal -- a module that filters out human instances whose visual attributes do not align with real-world data to make the human appearance closer to the target distribution. Extensive experiments on public UAV Sim2Real detection benchmarks demonstrate that our methods significantly improve the detection accuracy compared to the non-transformed baselines. Specifically, our method achieves up to $+14.1$ improvement of mAP50 on Semantic-Drone benchmark. Ablation studies confirm the complementary roles of the global and local stages and highlight the importance of hierarchical alignment. The code is released at \href{https://github.com/liwd190019/CFHA}{this url}.

URL PDF HTML ☆

赞 0 踩 0

2602.10858 2026-06-09 cs.CV 版本更新

Hyperspectral Smoke Segmentation via Mixture of Prototypes

基于原型混合的高光谱烟雾分割

Lujian Yao, Haitao Zhao, Xianghai Kong, Yuhan Xu

发表机构 * Automation Department, School of Information Science and Engineering, East China University of Science and Technology（自动化系，信息科学与工程学院，东华大学）

AI总结针对烟雾分割中光谱信息不足、云干扰和半透明区域问题，提出首个高光谱烟雾分割数据集HSSDataset，并设计原型混合网络（MoP），通过波段分离、原型光谱表示和双阶段路由器实现自适应波段加权，在高低光谱模态上均取得优异性能。

Comments 31 pages, 14 figures

详情

AI中文摘要

烟雾分割对于野火管理和工业安全应用至关重要。传统的可见光方法由于光谱信息不足而面临局限性，特别是在处理云干扰和半透明烟雾区域时。为了解决这些挑战，我们引入高光谱成像进行烟雾分割，并提出了第一个高光谱烟雾分割数据集（HSSDataset），该数据集使用多对一标注协议，从20个真实场景的超过18,000帧中收集了精心标注的样本。然而，不同的光谱波段在空间区域上表现出不同的判别能力，因此需要自适应的波段加权策略。我们将此分解为三个技术挑战：光谱交互污染、有限的光谱模式建模和复杂的加权路由器问题。我们提出了一种原型混合（MoP）网络，包括：(1) 波段分离（BS）用于光谱隔离，(2) 基于原型的光谱表示（PSR）用于多样化模式，以及(3) 双阶段路由器（DSR）用于自适应空间感知波段加权。我们进一步构建了一个包含RGB-红外图像的多光谱数据集（MSSDataset）。大量实验验证了该方法在高光谱和多光谱模态上的优越性能，为基于光谱的烟雾分割建立了新的范式。

英文摘要

Smoke segmentation is critical for wildfire management and industrial safety applications. Traditional visible-light-based methods face limitations due to insufficient spectral information, particularly struggling with cloud interference and semi-transparent smoke regions. To address these challenges, we introduce hyperspectral imaging for smoke segmentation and present the first hyperspectral smoke segmentation dataset (HSSDataset) with carefully annotated samples collected from over 18,000 frames across 20 real-world scenarios using a Many-to-One annotations protocol. However, different spectral bands exhibit varying discriminative capabilities across spatial regions, necessitating adaptive band weighting strategies. We decompose this into three technical challenges: spectral interaction contamination, limited spectral pattern modeling, and complex weighting router problems. We propose a mixture of prototypes (MoP) network with: (1) band split (BS) for spectral isolation, (2) prototype-based spectral representation (PSR) for diverse patterns, and (3) dual-stage router (DSR) for adaptive spatial-aware band weighting. We further construct a multispectral dataset (MSSDataset) with RGB-infrared images. Extensive experiments validate superior performance across both hyperspectral and multispectral modalities, establishing a new paradigm for spectral-based smoke segmentation.

URL PDF HTML ☆

赞 0 踩 0

2602.20551 2026-06-09 cs.CV 版本更新

通过细粒度情感-原因对提取实现精确的情感归因视频字幕生成

Weidong Chen, Cheng Ye, Zhendong Mao, Liping Wang, Xinyan Liu, Yongdong Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合性国家科学中心人工智能研究院）； Harbin Institute of Technology (Weihai)（哈尔滨工业大学（威海））

AI总结提出细粒度情感-原因对提取框架，通过概念感知视觉语义分解和视觉引导情感可解释学习，提升情感视频字幕的准确性和丰富性。

详情

AI中文摘要

情感视频字幕生成（EVC）是一项具有挑战性的任务，旨在为视频生成事实准确且情感丰富的描述。现有的EVC方法利用整体视觉特征挖掘全局情感线索，然后聚合多模态特征以指导情感字幕生成，这忽略了EVC任务的关键特性。视觉情感是由特定的动机原因引发的，这些原因通常只隐含在核心视频片段中。整体挖掘带来了显著的信息冗余和不准确的情感线索。因此，细粒度的视觉原因提取对情感感知和情感归因字幕生成都有促进作用。为此，我们提出了一种用于情感归因视频字幕生成的细粒度情感-原因对提取框架。具体来说，我们通过两轮学习成对的情感和原因特征：1）我们提出了一种概念感知的视觉语义分解模块，通过探索场景、对象和运动概念来增强视觉特征。此外，为了增强情感特征，我们提出了一种视觉引导的情感可解释学习模块，该模块利用视觉时间动态指导情感细化，并通过可靠的VAD向量约束增强可解释的细化过程。2）我们通过在细化前后交叉耦合视觉和情感特征来实现情感-原因对提取，并利用对比损失实现语义强制对齐。总体而言，我们的方法优化了视频的复杂语义理解和情感感知，从而在情感字幕生成中取得了有前景的性能。在三个具有挑战性的数据集上进行的大量实验证明了我们的方法和每个提出模块的优越性，例如，在EVC-MSVD数据集上，BLEU-2和ROUGE-L分别取得了+4.4%和+5.4%的最佳性能。

英文摘要

Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionally rich descriptions for videos. Existing EVC methods leverage holistic visual features to mine global emotional cues, and then aggregate multimodal features to guide the emotional caption generation, which ignores the critical characteristic of the EVC task. Visual emotions are evoked by specific motivational causes, which are usually only implied in core video segments. The holistic mining brings significant information redundancy and inaccurate emotional cues. Thus, fine-grained visual cause extraction has a facilitative effect on both emotion perception and emotion-attributed caption generation. To this end, we propose a fine-grained emotion-cause pair extraction framework for emotion-attributed video captioning. Specifically, we learn pair-wise emotion and cause features in two rounds: 1) We propose a Concept-aware Visual Semantic Decomposition module to augment visual features by exploring scene, object, and motion concepts. Besides, to enhance emotional features, we propose a Visual-guided Emotion Interpretable Learning module, which guides emotion refinement with visual temporal dynamics, and augments the interpretable refinement process by reliable VAD-vector constraints. 2) We achieve emotion-cause pair extraction by cross-coupling the visual and emotional features before and after refinement, and leverage contrastive loss to achieve semantic forced alignment. Overall, our approach optimizes complex semantic understanding and emotion perception of videos, leading to a promising performance in emotional captioning. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, e.g., achieving the best performances with +4.4% and +5.4% w.r.t. BLEU-2 and ROUGE-L, respectively, on the EVC-MSVD dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.08615 2026-06-09 cs.CV cs.CL 新提交

Harnessing Streaming Video in the Wild

利用野外流式视频

Dingyu Yao, Shuhuan Gu, Qingyi Si, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Naibin Gu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络空间安全学院）； JD.COM（京东）

AI总结提出Streaming Harness系统，通过Streaming-Train-248K数据集和训练目标，使视觉语言模型具备主动交互、长期记忆和实时处理能力，并构建Streaming-Eval基准评估流式视频理解。

详情

AI中文摘要

视觉语言模型（VLM）在视频通话助手、实时评论和具身机器人等应用中越来越需要处理无界视频流。理想的流式系统应支持主动交互、长期记忆和实时处理，同时基于能够处理各种野外流式任务的VLM骨干。然而，现有VLM在离线视频理解方面表现出色，但在流式能力上有所欠缺，并且缺乏用于流式部署的专用基础设施。我们在三个方面解决这一差距。(i) 对于骨干能力，我们构建了\textbf{Streaming-Train-248K}，一个流式数据集，配以新颖的训练目标，用于使VLM适应流式交互和理解。(ii) 对于实际部署，我们引入了\textbf{Streaming Harness}，一个即插即用系统，赋予任何VLM三种核心能力：主动交互（每秒响应决策）、长期记忆（12小时上下文保留）和实时处理（亚秒级延迟）。(iii) 为了推动社区在流式能力方面的持续进步，我们设计了\textbf{Streaming-Eval}，一个反映模型在各种野外场景中能力的基准。大量实验表明，我们的方法在流式视频理解所需的所有核心能力上均取得了一致的提升。我们将开源我们的数据、代码和基准，以推动社区从离线视频理解向可部署的流式智能的转变。

英文摘要

Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.08780 2026-06-09 cs.CV 新提交

Beyond Consistency: Preserving Temporal Structure in Zero-Shot Video Editing

超越一致性：在零样本视频编辑中保留时间结构

Deyin Liu, Yisheng Ding, Zhe Jin, Xiatian Zhu, Anjan Dutta, Lin Wu

发表机构 * Anhui University（安徽大学）； University of Surrey（萨里大学）； University of Warwick（华威大学）

AI总结提出一种零样本视频编辑方法，通过自适应分割视频片段、选取锚帧和令牌合并策略，首次显式保留源视频的时间结构，平衡编辑保真度与计算效率。

详情

AI中文摘要

现有的零样本视频编辑方法依赖预训练的扩散模型，成功实现了空间控制和基本的时间一致性，但根本上未能保留视频的原始时间结构。这一区别至关重要：时间一致性确保视觉平滑，而时间结构决定了视频的高层叙事、节奏和语义流。没有这种保留，编辑输出（尤其是具有复杂语义变化的长视频）在叙事上变得不连贯，语义模糊。为了解决这一局限性，我们提出了一种新颖的零样本编辑方法，首次明确关注保留源视频的时间结构。我们通过基于特征相似性自适应地将视频分割成语义不同的片段，并为每个片段选择一个代表性的锚帧来实现这一点。为了增强片段内保真度和计算效率，我们设计了一种片段自适应的令牌合并策略，利用锚帧的语义主导性来稳定编辑。此外，我们采用交替组合策略，确保片段间无缝过渡，同时保持语义区分。大量实验表明，我们的方法达到了最先进的结果，成功平衡了原始时间结构的保留与计算效率，为零样本视频编辑保真度设立了新基准。

英文摘要

Existing zero-shot video editing methods rely on pre-trained diffusion models, successfully achieving spatial control and basic temporal consistency but fundamentally fail to preserve the video's original temporal structure.This distinction is critical: temporal consistency ensures visual smoothness, but temporal structure dictates the video's high-level narrative, rhythm, and semantic flow. Without this preservation, the edited output, especially for long videos with complex semantic variations, becomes narratively incoherent and semantically ambiguous. To address this limitation, we introduce a novel zero-shot editing approach that, for the first time, explicitly focuses on preserving the source video's temporal structure. We achieve this by adaptively partitioning the video into semantically distinct clips based on feature similarity and selecting a representative anchor frame for each clip. To enhance both intra-clip fidelity and computational efficiency, we design a clip-adaptive token merging strategy which leverages the anchor's semantic dominance to stabilize the editing. Furthermore, we employ an alternating combination strategy that ensures seamless inter-clip transitions while maintaining semantic distinction. Extensive experiments demonstrate that our method achieves state-of-the-art results, successfully balancing the preservation of original temporal structure with computational efficiency, and setting a new benchmark for zero-shot video editing fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.09064 2026-06-09 cs.CV cs.AI 新提交

自监督学习至关重要：一种用于微手势识别的简单集成方案

Tingyi Liu, Kun Li, Fei Wang, Junjie Chen, Zhiliang Wu, Jihao Gu, Haixu Liu, Dan Guo

发表机构 * Hefei University of Technology（合肥工业大学）； United Arab Emirates University（阿拉伯联合酋长国大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合性国家科学中心人工智能研究院）； Anhui Evolution Technology Co., Ltd.（安徽进化科技有限公司）； Nanyang Technological University（南洋理工大学）； University College London（伦敦大学学院）； The University of Sydney（悉尼大学）； Beijing QBoson Quantum Technology Co., Ltd.（北京量子芯科技有限公司）

AI总结提出一种集成自监督RGB模型与监督多流模型的框架，在MiGA挑战赛微手势分类赛道取得第一名，通过自监督预训练提升性能，在iMiGUE测试集上达到74.419%的top-1准确率。

详情

AI中文摘要

在本文中，我们介绍了XInsight Lab在IJCAI 2026第四届MiGA挑战赛微手势分类赛道中的解决方案，该方案排名第一并取得了新的最先进结果。我们提出了一种多模态集成框架，将基于自监督的RGB模型与先前解决方案中的监督多流模型相结合。自监督RGB模型通过掩码视频建模在12万个未标注片段上进行预训练，然后在iMiGUE上微调。这一简单而有效的RGB基线在iMiGUE测试集上达到了69.224%的top-1准确率，展示了从域内未标注视频中学习可迁移表示的好处。通过将该模型作为互补分支加入，最终集成模型达到了74.419%的top-1准确率，比之前的最先进结果高出1.206个百分点。在iMiGUE上的实验结果，包括对集成策略的消融研究，验证了自监督RGB表示学习在微手势识别中的有效性。

英文摘要

In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.

URL PDF HTML ☆

赞 0 踩 0

2606.09542 2026-06-09 cs.CV 新提交

A VideoMAE-v2 Approach to Zero-Shot Traffic Accident Anticipation

一种用于零样本交通事故预警的VideoMAE-v2方法

Siyuan Li, Xiaoyang Bi, Mengshi Qi

发表机构 * State Key Laboratory of Networking and Switching Technology（网络与交换技术国家重点实验室）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出基于VideoMAE-v2的框架，通过滑动窗口协议和逐帧预测头，在零样本设置下从粗粒度标注数据泛化到未知行车记录仪视频，实现交通事故预警。

详情

AI中文摘要

交通事故预警——在行车记录仪视频的每一帧预测即将发生碰撞的可能性——对于安全至关重要，但难以规模化，因为为每个部署场景收集域内标注的事故视频成本过高。我们在零样本设置下研究此任务，即没有目标域训练数据可用：模型必须仅从公开的二元标注驾驶事故数据集中学习，并泛化到未见过的行车记录仪视频。我们提出一个框架，通过将VideoMAE-v2骨干网络与滑动窗口协议下的逐帧预测头相结合，弥合帧级时间风险估计任务与粗粒度标注二元事故数据集之间的差距。我们的方法在2026年CVPR@AUTOPILOT零样本交通事故预警竞赛中获得第二名。代码可在https://github.com/TimeSouth/zero-shot-taa-solution获取。

英文摘要

Traffic accident anticipation -- predicting the likelihood of an imminent collision at every frame of a dashcam video -- is safety-critical yet difficult to scale, because collecting in-domain annotated accident footage for every deployment scenario is prohibitively expensive. We study this task under a zero-shot setting where no target-domain training data is available: the model must learn exclusively from a publicly available binary-labelled driving-accident dataset and generalise to unseen dashcam footage. We propose a framework that bridges the gap between the frame-level temporal risk estimation task and coarsely labelled binary accident datasets by coupling a VideoMAE-v2 backbone with a per-frame prediction head under a sliding-window protocol. Our method achieves 2nd place in the 2026 CVPR@AUTOPILOT Zero-Shot Accident Anticipation competition. Code is available at https://github.com/TimeSouth/zero-shot-taa-solution.

URL PDF HTML ☆

赞 0 踩 0

2606.09547 2026-06-09 cs.CV cs.LG 新提交

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

流式干预：视频大语言模型能否在错误发生时即时纠正？

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

发表机构 * Qualcomm AI Research（高通人工智能研究院）； York University（约克大学）； Vector Institute for AI（向量人工智能研究所）

AI总结提出Ego-MC-Bench基准评估视频LLM在烹饪场景中的实时干预能力，并构建Ego-CoMist反事实合成数据集提升小模型性能。

Comments Qualcomm Interactive Cooking: Ego-MC-Bench -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-mistake-corrections and Ego-CoMist -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-counterfactual-mistakes

详情

AI中文摘要

学习日常技能（如烹饪一道菜）越来越依赖于教学媒体，例如在线视频。这为使用视频（和多模态）大语言模型（LLMs）作为任务指导助手打开了大门。一个潜在的任务指导助手在现实世界中成功的关键能力是，它能够在错误一出现时就主动干预以引导用户。为了评估这一关键能力，我们引入了Ego-MC-Bench（错误纠正），这是一个用于评估在现实烹饪场景中反应性、逐步任务指导的基准。大量实验表明，Ego-MC-Bench对于最先进的视频LLMs具有高度挑战性。我们认为一个关键原因是用于在此任务上微调模型的训练数据有限。尽管存在广泛的烹饪视频数据集，但现有数据集缺乏错误示例以及适当时间的干预。为了帮助解决这一数据限制，我们还引入了Ego-CoMist，这是一个反事实合成数据集，通过将非交互式烹饪视频转换为显示主动干预的监督训练示例而创建。我们表明，在Ego-CoMist上进行微调可以带来性能提升，特别是对于更适合在边缘设备上提供帮助的更小、更高效的视频LLMs。

英文摘要

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

URL PDF HTML ☆

赞 0 踩 0

2606.09641 2026-06-09 cs.CV 新提交

MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding

MAVIS: 通过结构化视频理解实现多智能体视频检索

Jie Zhang, Qilang Ye, Hao Zhou, Haochen Liang, Fei Luo

发表机构 * School of Computing and Information Technology, Great Bay University（大湾区大学计算机与信息技术学院）； College of Computer Science, Nankai University（南开大学计算机学院）； Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Graduate School of Information Science and Technology, The University of Tokyo（东京大学信息科学与技术研究生院）

AI总结提出多智能体框架MAVIS，通过结构化语义库解析视频，利用逻辑感知辩论机制协作推理，无需全库扫描和微调即可实现高效视频检索。

详情

AI中文摘要

视频检索的主流范式依赖于基于嵌入的全库扫描，这种方法存在固有的计算低效以及信息密集视频与稀疏文本查询之间的语义不对称问题。为弥合这一差距，我们引入了\textbf{MAVIS}，一种新颖的多智能体框架，将检索重新构想为协作推理而非暴力搜索。MAVIS首先通过将原始视频解析为\textbf{结构化语义库}来弥合粒度不匹配，从而实现显式的属性级索引。在检索过程中，规划器将复杂的用户意图分解为原子子任务，分派专门的智能体独立提名候选。关键的是，MAVIS采用带有严格否决协议的\textbf{逻辑感知辩论}机制，智能体协作修剪逻辑不匹配，以识别紧凑的“有争议”候选集进行细粒度验证。这种智能体工作流有效避免了全库遍历的低效。在MSR-VTT、MSVD和ActivityNet上的大量实验表明，MAVIS在无需任务特定微调的情况下实现了有竞争力的性能，为传统的双编码器方法提供了可扩展且可解释的替代方案。

英文摘要

The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a \textbf{Structured Semantic Library}, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a \textbf{Logic-aware Debate} mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of ``controversial'' candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.09646 2026-06-09 cs.CV cs.AI cs.LG 新提交

TRIM：一种最大化时间相对信息和代表性的自监督视频摘要框架

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

发表机构 * Pompeu Fabra University（庞培法布拉大学）； Universitat Autònoma de Barcelona（自治大学）

AI总结 TRIM框架通过自监督学习实现高效视频摘要，无需注意力机制等复杂结构，优于现有无监督方法并挑战传统复杂架构。

详情

AI中文摘要

随着视频内容的普及，视频摘要和亮点提取成为关键研究领域。然而，许多先进方法依赖监督标注或注意力模型，计算成本高且在分布变化时表现不稳定。我们提出一种新颖的自监督视频摘要模型，无需注意力、RNN或Transformer，通过马尔可夫过程驱动的损失度量和两阶段自监督学习范式，实现性能与效率的平衡。TRIM在SUMME和TVSUM数据集上达到最佳性能，超越所有现有无监督方法，并与最佳监督模型相当，展示了高效无标注架构的潜力，为更通用的视频摘要技术铺平道路，并挑战现有复杂架构的依赖。

英文摘要

The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets, outperforming all existing unsupervised methods. It also rivals the best supervised models, demonstrating the potential for efficient, annotation-free architectures. This paves the way for more generalizable video summarization techniques and challenges the prevailing reliance on complex architectures.

URL PDF HTML ☆

赞 0 踩 0

2510.20182 2026-06-09 cs.CV 版本更新

PEDRA: Evaluating the Realism of Pedestrian Dynamics in Video Generation

PEDRA: 评估视频生成中行人动态的真实性

Aaron Appelle, Jerome P. Lynch

发表机构 * Duke University（杜克大学）

AI总结提出PEDRA评估协议，通过重建鸟瞰轨迹等方法，测试文本/图像到视频模型生成多行人交互场景的真实性，发现现有模型虽具备先验但存在行人合并消失等物理不一致问题。

Comments Accepted to CVPR 2026

详情

AI中文摘要

行人模拟传统上依赖于专家调整的手工模型，这限制了可扩展性和泛化性。与此同时，大规模视频生成模型已在各种场景中实现了高视觉真实感，激发了探索其作为通用世界模拟器潜力的兴趣。现有基准主要评估单主体真实性，而非包含多个交互人物的场景，使得生成视频中多智能体动态的合理性未经测试。我们提出一个严格的评估协议，用于基准测试文本到视频（T2V）和图像到视频（I2V）模型作为行人动态的隐式模拟器。对于I2V，我们利用已有数据集的起始帧，以便与真实视频进行直接比较；而对于T2V，我们设计了一个涵盖不同人群密度和交互类型的提示集。一个关键组成部分是一种无需已知相机参数即可从像素空间重建二维鸟瞰轨迹的方法。我们的分析表明，领先模型对合理的多智能体行为具有有效的先验，尽管合并和消失行人等问题揭示了其物理一致性的局限性。

英文摘要

Pedestrian simulation traditionally relies on expert-tuned, hand-crafted models that limit scalability and generalization. Meanwhile, large-scale video generation models have achieved high visual realism across diverse settings, motivating exploration of their potential as general-purpose world simulators. Existing benchmarks primarily assess single-subject realism rather than scenes with multiple interacting people, leaving the plausibility of multi-agent dynamics in generated videos untested. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable direct comparison with ground truth videos, while for T2V we design a prompt suite covering varied crowd densities and interaction types. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis shows that leading models exhibit effective priors for plausible multi-agent behavior, though issues such as merging and disappearing pedestrians reveal limits to their physical consistency.

URL PDF HTML ☆

赞 0 踩 0

2511.14143 2026-06-09 cs.CV cs.AI 版本更新

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

SMART: 基于音频增强多模态大模型的镜头感知视频时刻检索

An Yu, Weiheng Lu, Jian Li, Zhenfei Zhang, Yunhang Shen, Felix X. -F. Ye, Ming-Ching Chang

发表机构 * Department of Computer Science, University at Albany - SUNY（University at Albany - SUNY 计算机科学系）； School of Software & Microelectronics, Peking University（北京大学软件与微电子学院）； Nanjing University（南京大学）； Xiamen University（厦门大学）； Department of Mathematics and Statistics, University at Albany - SUNY（University at Albany - SUNY 数学与统计学系）

AI总结提出SMART框架，融合音频与视觉特征，利用镜头感知令牌压缩技术，在多模态大模型基础上实现视频时刻检索，在Charades-STA和QVHighlights上取得显著提升。

详情

AI中文摘要

视频时刻检索是视频理解中的一项任务，旨在根据自然语言查询在未裁剪视频中定位特定时间片段。尽管近年来利用传统技术和多模态大模型在视频时刻检索方面取得了进展，但大多数现有方法仍依赖于粗粒度的时间理解和单一的视觉模态，限制了在复杂视频上的性能。为了解决这一问题，我们引入了\textit{镜头感知多模态音频增强时间片段检索}（SMART），这是一个基于多模态大模型的框架，它整合了音频线索并利用了镜头级别的时间结构。SMART通过结合音频和视觉特征来丰富多模态表示，同时应用\textbf{镜头感知令牌压缩}，该技术选择性地保留每个镜头内的高信息令牌，以减少冗余并保留细粒度的时间细节。我们还优化了提示设计，以更好地利用视听线索。在Charades-STA和QVHighlights上的评估表明，SMART相比最先进的方法取得了显著改进，包括在Charades-STA上R1@0.5提升1.61%，R1@0.7提升2.59%。

英文摘要

Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textit{S}hot-aware \textit{M}ultimodal \textit{A}udio-enhanced \textit{R}etrieval of \textit{T}emporal \textit{S}egments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbf{Shot-aware Token Compression}, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61\% increase in R1@0.5 and 2.59\% gain in R1@0.7 on Charades-STA.

URL PDF HTML ☆

赞 0 踩 0

2603.04125 2026-06-09 cs.CV 版本更新

A Baseline Study and Benchmark for Few-Shot Open-Set Action Recognition with Feature Residual Discrimination

基于特征残差判别的小样本开放集动作识别基线研究与基准

Stefano Berti, Giulia Pasquale, Lorenzo Natale

发表机构 * Humanoid Sensing and Perception, Istituto Italiano di Tecnologia, Genoa, Italy（人形感知与感知、意大利理工学院，热那亚，意大利）

AI总结针对小样本动作识别在开放集场景下的不足，提出基于特征残差判别器的架构扩展，在五个数据集上实现未知类拒绝能力提升且不损失闭集精度，设立新基准。

2603.27493 2026-06-09 cs.CV 版本更新

DisCo: 具有离散相机运动控制的世界模型

Hongrui Huang, Junke Wang, Quanhao Li, Yu-Gang Jiang, Zuxuan Wu

发表机构 * Fudan University（复旦大学）

AI总结提出DisCo，通过离散动作原语替代连续相机轨迹作为条件，解决可控视频生成中动作表示纠缠问题，提升动作跟随可靠性，并引入DisCoBench基准。

详情

AI中文摘要

可控视频世界模型旨在实现交互式世界探索，模型必须在保持视觉质量和时间一致性的同时忠实地执行明确的动作命令。然而，现有大多数方法依赖连续相机轨迹作为动作条件，这通常导致不可靠的动作跟随，尤其是在复杂运动序列下。在这项工作中，我们识别出动作表示纠缠是可控视频生成的关键瓶颈，并表明连续相机表示导致不同运动模式之间的高特征相似性，降低了动作可控性。基于这一见解，我们提出了DisCo，一种可控视频世界模型，它将生成条件约束在一组紧凑的离散动作原语上，以提高动作可分离性。我们进一步引入了DisCoBench，一个用于评估模型在短期、长期和高度动态探索场景中能力的综合基准。大量实验表明，DisCo在保持视觉质量的同时实现了显著更可靠的动作跟随。

英文摘要

Controllable video world models target interactive world exploration, where models must faithfully execute explicit action commands while preserving visual quality and temporal coherence. However, most existing approaches rely on continuous camera trajectories as action conditions, which often lead to unreliable action following, especially under complex motion sequences. In this work, we identify action representation entanglement as a key bottleneck in controllable video generation, and show that continuous camera representations lead to high feature similarity across distinct motion patterns, degrading action controllability. Based on this insight, we propose DisCo, a controllable video world model that conditions generation on a compact set of discrete action primitives to improve action separability. We further introduce DisCoBench, a comprehensive benchmark for evaluating the ability of models in short-term, long-horizon, and highly dynamic exploration scenarios. Extensive experiments demonstrate that DisCo achieves significantly more reliable action following while preserving visual quality.

URL PDF HTML ☆

赞 0 踩 0

2606.08091 2026-06-09 cs.CV 新提交

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

VideoWeaver: 评估与进化智能体长视频生成技能

Jianhui Wei, Jie Tan, Hengchuan Zhu, Xiaotian Zhang, Yan Zhang, Ziyi Chen, Daoan Zhang, Wei Xu, Zuozhu Liu

发表机构 * Zhejiang University（浙江大学）； ByteDance（字节跳动）

AI总结提出VideoWeaver框架，让智能体自主组合基础技能生成视频，并设计智能体裁判评估过程与结果，通过技能进化算法提升生成质量。

详情

AI中文摘要

最近的智能体框架如Claude Code、Codex和OpenClaw在工具使用和编排方面表现强劲，但它们能否处理长视频生成这一长时多模态任务仍待探索。与早期手工设计管线的视频智能体不同，这些框架可以构建和优化自己的工作流程。我们提出VideoWeaver，一个评估和进化长视频生成技能的智能体框架和基准测试，其中智能体通过将基础技能组合成自己的工作流程（而非遵循预定义管线）将单个指令转化为长视频。该基准测试包含16个任务类别和285个案例，参考信息涵盖文本、图像、音频、视频及其组合。由于错误可能出现在任何阶段而不仅仅是最终视频，我们提出一种智能体裁判，它检查执行轨迹和最终视频，并将其评分基于元数据和中间文件等证据。利用这一反馈，我们进一步设计了一种技能进化算法，用于优化和合并智能体的技能。在多个框架和模型上，我们发现显式的组合技能比单独使用基础技能更能改善生成过程，技能进化进一步提高了输出质量，并且不同框架和模型选择之间的性能差异显著。所提出的智能体裁判也与人类判断高度一致，尤其是在过程指标上。代码和数据集可在https://github.com/JianhuiWei7/VideoWeaver获取。

英文摘要

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but whether they can handle long video generation, a long-horizon multimodal task, remains underexplored. Unlike earlier video agents whose pipeline is handcrafted, these frameworks can build and refine their own workflows. We introduce VideoWeaver, an agent harness and benchmark that evaluates and evolves skills for long video generation, where an agent turns a single instruction into a long video by composing foundation skills into its own workflow rather than following a predefined pipeline. The benchmark has 16 task categories and 285 cases, with references spanning text, image, audio, video, and their combinations. Because errors can arise at any stage and not just in the final video, we propose an agent-as-judge that inspects both the execution trace and the final video, grounding its scores in evidence such as metadata and intermediate files. Using this feedback, we further design a skill evolution algorithm that refines and merges the agent's skills. Across multiple frameworks and models, we find that an explicit composition skill improves the generation process over using foundation skills alone, that skill evolution further improves output quality, and that performance varies notably across harness and model choices. The proposed agent-as-judge also aligns well with human judgments, especially on process metrics. Code and dataset is available at https://github.com/JianhuiWei7/VideoWeaver

URL PDF HTML ☆

赞 0 踩 0

2606.08150 2026-06-09 cs.CV 新提交

Property-Informed Diffusion-Based Text-to-Microstructure Generation

基于属性信息的扩散模型文本到微结构生成

Bingxuan Dai, Hongsong Wang, Jie Gui

发表机构 * School of Cyber Science and Engineering, Southeast University（东南大学网络空间安全学院）； School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education（教育部新一代人工智能技术及其跨学科应用重点实验室（东南大学））； Purple Mountain Laboratories（紫金山实验室）； Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education（教育部区块链应用监管工程研究中心（东南大学））

AI总结提出一种属性信息驱动的扩散网络，从文本描述直接生成3D微结构，通过对比文本-结构对齐和测试时奖励引导对齐确保生成结构的语义和物理可行性。

Comments Published in CVPR2026, Code is at: https://github.com/hongsong-wang/PropDiff-TMG

详情

AI中文摘要

设计满足预期功能的3D超材料微结构仍然是一个重大挑战，因为它通常需要领域专业知识、迭代模拟和大量手动调整。现有的基于期望目标属性自动生成微结构的逆向设计工作往往受限于设计多样性不足，并在确保生成结构的物理可行性方面面临挑战。为解决这一问题，提出了一种属性信息驱动的扩散网络，能够直接从文本描述生成3D微结构。与传统的属性条件方法不同，我们的方法利用文本输入中丰富的语义和物理属性指导，支持多样化的结构合成。为了强制生成结构与目标文本提示之间的一致性，采用了双重对齐策略，包括对比文本-结构对齐和测试时奖励引导对齐。实验结果表明，该模型能够在广泛材料类别中生成语义有意义且物理上合理的结构。我们的方法在交互式微结构设计方面具有良好潜力，并为结合语言接口与逆向材料发现开辟了新方向。代码可在 https://github.com/hongsong-wang/PropDiff-TMG 获取。

英文摘要

Designing 3D metamaterial microstructures that meet the intended functions remains a major challenge, as it typically requires domain expertise, iterative simulations, and extensive manual tuning. Existing work on inverse design that automatically generates microstructures based on desired target properties often suffers from limited design diversity and faces challenges in ensuring the physical feasibility of the generated structures. To address this issue, a property-informed diffusion-based network is proposed that enables the generation of 3D microstructures directly from textual descriptions. Unlike traditional property conditioning methods, our approach leverages rich guidance in terms of semantics and physical properties in the text input to support diverse structure synthesis. To enforce consistency between the generated structures and the target textual prompts, a dual alignment strategy is adopted, including contrastive text-structure alignment and test-time reward-guided alignment. Experimental results show that the model is capable of generating semantically meaningful and physically plausible structures across a wide range of material categories. Our approach has good potential for interactive microstructure design and opens up new directions for combining language-based interfaces with inverse material discovery. Code is available at: https://github.com/hongsong-wang/PropDiff-TMG

URL PDF HTML ☆

赞 0 踩 0

2606.08260 2026-06-09 cs.CV 新提交

TIDE: Task-Isolated Diffusion for Unified Video Editing and Generation

TIDE: 任务隔离扩散模型用于统一视频编辑与生成

Qi Liu, Gang Yue, Mingyu Yin, Lisai Zhang, Yidi Wu, Yaole Wang, Yaohui Wang, Chang Yao, Jingyuan Chen, Lin Ma

发表机构 * Zhejiang University（浙江大学）； Bilibili Inc.（哔哩哔哩股份有限公司）

AI总结提出TIDE统一框架，通过逐token任务嵌入和双路径条件机制，实现指令编辑、参考编辑和多参考生成，在多任务渐进训练下达到SOTA性能。

详情

AI中文摘要

扩散Transformer的最新进展推动了视频生成和编辑的快速发展，但这些能力仍由独立的、任务特定的模型处理。构建支持多种视频任务的统一框架仍然是一个开放挑战：现有的统一尝试要么需要专用的辅助编码器，要么缺乏区分异构条件令牌的显式机制，当视觉条件的数量和类型因任务而异时难以应对。我们提出TIDE，一个统一框架，集成了基于指令的编辑、参考引导编辑和多参考生成。其核心是，我们引入了逐令牌任务嵌入，为每个输入令牌分配一个任务特定标识符，使模型能够显式区分目标、源和参考令牌。为了同时捕捉高层语义理解和细粒度结构保真度，我们设计了一种双路径条件方案，将视觉语言模型与VAE潜在路径耦合以提供互补信号。我们进一步设计了一种多任务渐进训练策略，逐步引入复杂度递增的任务，有效协调不同目标，并实现跨异构任务分布的平滑泛化。在多个视频编辑和生成基准上的大量实验表明，TIDE在所有评估任务上均达到了最先进的性能。我们的项目页面可在https://LittleWork123.github.io/tide获取。

英文摘要

Recent advances in Diffusion Transformers have driven rapid progress in video generation and editing, yet these capabilities are still handled by separate, task-specific models. Building a unified framework that supports diverse video tasks remains an open challenge: existing unified attempts either require dedicated auxiliary encoders or lack explicit mechanisms to distinguish heterogeneous conditioning tokens, struggling when the number and type of visual conditions vary across tasks. We propose TIDE, a unified framework that integrates instruction-based editing, reference-guided editing, and multi-reference generation. At its core, we introduce per-token task embeddings that assign each input token a task-specific identifier, enabling the model to explicitly disambiguate target, source, and reference tokens. To simultaneously capture high-level semantic understanding and fine-grained structural fidelity, we design a dual-path conditioning scheme that couples a vision-language model with a VAE latent path for complementary signals. We further devise a multi-task progressive training strategy that incrementally introduces tasks of increasing complexity, effectively harmonizing diverse objectives and enabling smooth generalization across heterogeneous task distributions. Extensive experiments on multiple video editing and generation benchmarks demonstrate that TIDE achieves state-of-the-art performance across all evaluated tasks. Our project page is available at https://LittleWork123.github.io/tide.

URL PDF HTML ☆

赞 0 踩 0

2606.08302 2026-06-09 cs.CV 新提交

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

HACK++：面向高效视觉自回归建模的更有效的头部感知键值压缩

Ziran Qin, Yuchen Jiang, Mingbao Lin, Youru Lv, Hang Guo, Wen Fei, Weiyao Lin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Rakuten（乐天）； Tsinghua University（清华大学）

AI总结针对VAR模型跨尺度KV缓存导致的高计算和内存开销，提出无训练头部感知压缩框架HACK++，通过离线分类头部类型和自适应预算分配，在极低缓存预算下保持近无损生成。

详情

AI中文摘要

视觉自回归（VAR）模型采用下一尺度预测范式，以显著更少的解码步骤实现高质量生成。然而，现有VAR模型由于跨尺度键值（KV）缓存的累积，面临严重的注意力复杂度和内存开销。本文通过将KV缓存压缩引入下一尺度范式来应对这一挑战。我们首先深入分析VAR注意力，观察到注意力头可以稳定地分为两个功能不同的类别：上下文头关注保持语义一致性，而结构头保持空间连贯性。它们的功能差异使得现有的一刀切压缩方法在VAR模型上表现不佳。我们进一步发现，两种头部类型对历史尺度的依赖程度不同，且这种依赖在不同层和生成步骤中发生变化，这要求自适应的缓存预算分配。为解决这些问题，我们提出HACK++，一种针对VAR模型的无训练头部感知键值压缩框架。通过一次性离线校准，HACK++分类头部类型并推导头部特定先验。在推理时，它将注意力与缓存压缩在独立预算下解耦，在压缩累积缓存时采用更激进的策略，通过模式特定策略和依赖感知预算分配来限制当前尺度的注意力成本。在多个VAR模型上进行的广泛实验，涵盖文本到图像、类别条件和统一理解与生成任务，验证了HACK++的有效性和泛化能力。例如，在Infinity-2B/8B上，HACK++在仅30%注意力预算和10%缓存预算下保持近无损生成，即使在1%缓存预算下也保持稳健。

英文摘要

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale paradigm. We begin with an in-depth analysis of VAR attention and observe that attention heads can be stably divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads preserve spatial coherence. Their functional divergence makes existing one-size-fits-all compression methods perform poorly on VAR models. We further find that the two head types differ markedly in their reliance on historical scales, and that this reliance shifts across layers and generation steps, arguing for an adaptive cache budget allocation. To address these challenges, we propose HACK++, a training-free Head-Aware key-value Compression frameworK for VAR models. From a one-time offline calibration, HACK++ classifies head types and derives head-specific priors. At inference, it decouples attention from cache compression under independent budgets, bounding the current-scale attention cost while compressing the accumulated cache far more aggressively, via pattern-specific strategies and a reliance-aware budget allocation. Extensive experiments on multiple VAR models across text-to-image, class-conditional, and unified understanding-and-generation tasks validate the effectiveness and generalizability of HACK++. For example, on Infinity-2B/8B, HACK++ maintains near-lossless generation with only a 30% attention budget and a 10% cache budget, and remains robust even under a 1% cache budget.

URL PDF HTML ☆

赞 0 踩 0

2606.08492 2026-06-09 cs.CV cs.AI 新提交

Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

眼见为实：基于视觉锚点的提示重写对齐用于文本到图像生成

Xuanyi Liu, Deyi Ji, Junyu Lu, Jing Wang, Qianxiong Xu, Xuhang Chen, Tianrun Chen, Siwei Ma

发表机构 * Peking University（北京大学）； Tencent（腾讯）； Dalian University of Technology（大连理工大学）； Nanyang Technological University（南洋理工大学）； University of Cambridge（剑桥大学）； Zhejiang University（浙江大学）

AI总结提出FaithRewriter框架，利用多模态大模型生成中间视觉线索，结合大语言模型生成视觉锚定的增强提示，再蒸馏至小模型，以缩小用户意图与生成图像之间的差距。

详情

AI中文摘要

尽管文本到图像（T2I）模型具有令人印象深刻的能力，但由于用户提示的简洁性和模糊性，意图-生成差距往往持续存在。现有方法主要优化提示的流畅性和可读性。然而，增强过程仍然缺乏视觉基础。因此，重写器可能过度推断缺失的细节，导致意图-生成差距。为了解决这一限制，我们提出了FaithRewriter，一种用于T2I生成的新型提示增强框架。具体来说，FaithRewriter首先利用多模态MLLM从原始提示生成图像作为中间视觉线索。然后将该线索与提示结合，输入大规模LLM，生成视觉锚定的增强，更好地反映预期内容在图像中应如何呈现。最后，将这些增强蒸馏到小规模LLM中以便高效部署，增强其生成有效T2I提示的能力。实验表明，与强基线相比，FaithRewriter生成的提示更忠实于用户意图且视觉上更合理，有助于缩小意图-生成差距。

英文摘要

Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due to the brevity and ambiguity of user prompts. Existing approaches primarily polish the prompt for fluency and readability. However, the enhancement process still lacks visual grounding. As a result, the rewriter may over-infer missing details, causing an intent-generation gap. To address this limitation, we propose FaithRewriter, a novel prompt-enhancement framework for T2I generation. Specifically, FaithRewriter first leverages a multimodal MLLM to generate an image from the original prompt as an intermediate visual cue. This cue is then combined with the prompt and fed into a large-scale LLM to produce visually grounded augmentations that better reflect how the intended content should appear in images. Finally, these augmentations are distilled into a small-scale LLM for efficient deployment, enhancing its ability to generate effective T2I prompts. Experiments show that FaithRewriter yields prompts that are more faithful to the user intent and more visually plausible than strong baselines, helping narrow the intent-generation gap.

URL PDF HTML ☆

赞 0 踩 0

2606.08514 2026-06-09 cs.CV 新提交

OmniTryOn: Video Try-On Anything at Once!

OmniTryOn: 一次性视频试穿任意物品！

Changliang Xia, Chengyou Jia, Minnan Luo, Zhuohang Dang, Xin Shen, Bowen Ping

发表机构 * Xi’an Jiaotong University（西安交通大学）

AI总结提出OmniTryOn框架，通过首帧可穿戴缓存和时空一致RoPE，实现无外部先验的一次性视频多物品试穿，在TryAny-Bench上显著优于现有方法。

详情

AI中文摘要

尽管视频虚拟试穿（VVT）取得了显著进展，现有方法仍存在两个基本局限：首先，它们仅限于单件衣物迁移，使得同时进行多物品试穿极不实用；其次，它们严重依赖显式外部先验（如衣物掩码），不可避免地破坏了关键的物理动态并降低了视觉质量。为弥补这一差距，本文提出了新颖的“任意试穿”任务，旨在一次推理过程中将多种可穿戴物品同时迁移到视频中的人物身上。为了支持并标准化这一范式，我们引入了TryAny-Bench，一个包含配对视频数据集和定制评估协议的综合基准。此外，我们提出了OmniTryOn，一个无外部先验的生成框架，用于解决该任务。具体而言，OmniTryOn采用首帧可穿戴缓存策略，通过初始视频帧直接为生成过程提供多样化的可穿戴物品。为保持一致性，我们提出了时空一致RoPE（STC-RoPE），它固有地建立了稳健的时空锚点，以严格保留复杂的人体运动和背景动态。通过提出的渐进式试穿（GTO）训练策略进行优化，我们的模型逐步掌握了稳健的多物品合成。在TryAny-Bench上的大量实验表明，OmniTryOn显著优于现有的专用视频虚拟试穿模型和通用视频编辑基线，为“任意试穿”任务建立了强大的新标准。我们的数据集、代码和模型可在https://github.com/xcltql666/OminTryOn获取。

英文摘要

Although video virtual try-on (VVT) has achieved significant progress, existing methods still exhibit two fundamental limitations: first, they are restricted to single-garment transfer, rendering simultaneous multi-object try-on highly impractical; second, their heavy reliance on explicit external priors (e.g., garment masks) inevitably destroys crucial physical dynamics and degrades visual quality. To bridge this gap, this paper proposes the novel Try-On Anything task, which aims to simultaneously transfer diverse wearable objects onto a person in a video in a single inference pass. To support and standardize this paradigm, we introduce TryAny-Bench, a comprehensive benchmark encompassing a paired video dataset alongside a tailored evaluation protocol. Furthermore, we present OmniTryOn, an external-prior-free generative framework designed to tackle this task. Specifically, OmniTryOn employs a First Frame Wearable Cache strategy, which directly provides diverse wearable objects for the generation process through the initial video frame. To maintain consistency, we propose the Spatiotemporally Consistent RoPE (STC-RoPE), which inherently establishes robust spatiotemporal anchors to strictly preserve complex human motions and background dynamics. Optimized by the proposed Gradual Try-On (GTO) training strategy, our model progressively masters robust multi-object synthesis. Extensive experiments on TryAny-Bench demonstrate that OmniTryOn significantly outperforms existing specialized video virtual try-on models and general video editing baselines, establishing a powerful new standard for the Try-On Anything task. Our dataset, code, and models are available at https://github.com/xcltql666/OminTryOn.

URL PDF HTML ☆

赞 0 踩 0

2606.08672 2026-06-09 cs.CV cs.LG 新提交

Learning to Solve Generative ODEs Beyond the Linear Span

学习求解生成式常微分方程：超越线性跨度

Sihyeon Kim, Seunghun Lee, Vikas Singh, Hyunwoo J. Kim

发表机构 * Korea University（高丽大学）； KAIST（韩国科学技术院）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）

AI总结针对扩散和流生成模型中ODE求解器步数多的问题，提出SpanLift轻量神经求解器，通过空间残差算子增强标量系数更新，实现少步采样且不增加模型NFE，在多个任务上达到最先进性能。

Comments 12 pages, 7 figures

详情

AI中文摘要

扩散和流生成模型通过积分学习到的ODE进行采样，但高质量采样仍需要大量连续的模型评估。求解器学习通过调整标量系数、时间步长或两者来降低这一成本，同时保持骨干模型固定。在这项工作中，我们识别出该更新族中的一个结构瓶颈：每一步仍然受限于跨度。由于标量系数更新位于缓冲速度评估的跨度内，它只能拟合跨度内的分量，而任何跨度外的残差无法通过标量重组单独达到。我们提出SpanLift，一种轻量神经求解器，它用空间残差算子增强标量系数更新。SpanLift将固定的基础求解器作为跨度内先验，并在状态和速度缓冲上学习一个空间残差算子。该算子通过端点教师匹配训练，保留预训练的骨干，且不增加模型NFE。实验表明，学习到的校正跨基础求解器迁移，且主要位于跨度外。在像素空间扩散、潜流匹配和降水临近预报中，SpanLift实现了最先进的少步采样。仅用3个NFE，它将CIFAR-10的FID从8.16提升到5.69，ImageNet的FID从17.37提升到11.83。

英文摘要

Diffusion and flow generative models sample by integrating a learned ODE, but high quality still requires many sequential model evaluations. Solver learning reduces this cost by adapting scalar coefficients, timesteps, or both, while keeping the backbone model fixed. In this work, we identify a structural bottleneck in this update family: each step remains span-limited. Since the scalar-coefficient update lies in the span of buffered velocity evaluations, it can fit only the in-span component while leaving any out-of-span residual unreachable by scalar recombination alone. We propose SpanLift, a lightweight neural solver that augments scalar-coefficient updates with a spatial residual operator. SpanLift keeps a fixed base solver as an in-span prior and learns a spatial residual operator over the state and velocity buffer. The operator is trained by endpoint teacher matching, preserves the pretrained backbone, and adds no model NFEs. Empirically, the learned correction transfers across base solvers and is predominantly out-of-span. Across pixel-space diffusion, latent flow matching, and precipitation nowcasting, SpanLift achieves state-of-the-art few-step sampling. With only 3 NFE, it improves CIFAR-10 FID from 8.16 to 5.69 and ImageNet FID from 17.37 to 11.83.

URL PDF HTML ☆

赞 0 踩 0

2606.08788 2026-06-09 cs.CV 新提交

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

MaskAlign: 面向高效扩散训练的令牌子集表示对齐

Lianyu Pang, Tianlin Pan, Cheng Da, Changqian Yu, Huan Yang, Kun Gai, Song Guo, Wenhan Luo

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； Kuaishou Technology（快手科技）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结针对扩散模型与预训练视觉模型表示对齐中令牌级信息不匹配问题，提出MaskAlign方法，通过随机采样令牌子集进行对齐，并引入预掩码令牌混合块减少信息损失，提升训练效率和生成质量。

详情

AI中文摘要

与预训练视觉模型的表示对齐最近显示出加速扩散Transformer训练的潜力。通过将中间扩散特征与来自自监督视觉编码器的干净图像表示对齐，现有方法提高了收敛速度和生成质量。然而，这种对齐也引入了一个非平凡的约束：扩散模型处理噪声输入，其可用信息随时间步变化，而参考特征是从干净图像中提取的。在本文中，我们从令牌级角度重新审视这种不匹配。我们发现，在全令牌表示对齐下，具有较大对齐梯度范数的令牌表现出稳定的空间偏好，这表明对齐目标并非均匀影响所有令牌，可能鼓励模型依赖完整的干净图像令牌集。为了解决这个问题，我们提出了MaskAlign，一种令牌子集表示对齐方法，在训练期间对随机采样的令牌子集应用对齐。通过在不同迭代中向模型暴露不同的令牌子集，MaskAlign减少了表示对齐对完整令牌集的依赖，并鼓励在令牌子集扰动下更稳定的对齐行为。为了缓解直接丢弃令牌导致的信息损失，我们进一步引入了一个轻量级的预掩码令牌混合块，在掩码之前跨令牌共享信息。

英文摘要

Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.

URL PDF HTML ☆

赞 0 踩 0

2606.08833 2026-06-09 cs.CV 新提交

CSFlow: Aligning Flow Matching with Human Contrast Sensitivity

CSFlow: 将流匹配与人类对比敏感度对齐

Malgorzata Galinska, Bart Pogodzinski, Jan Eric Lenssen

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus（马克斯·普朗克信息学研究所，萨尔兰信息学园区）

AI总结提出CSFlow加权方案，通过将人类对比敏感度函数与流匹配的迭代去噪步骤对齐，在傅里叶空间中引入软自回归结构，提升生成图像的视觉真实感，FID降低4.7%，Inception Score提升2.2%。

详情

AI中文摘要

我们引入了对比敏感流（CSFlow），这是一种将人眼的对比敏感度函数（CSF）与流匹配的迭代去噪步骤联系起来的加权方案。由于真实世界图像将信号集中在低空间频率，这些分量在连续扩散过程中比高频分量更早达到高信噪比。当使用扩散或流匹配模型生成图像时，这会在傅里叶空间中诱导一种软自回归结构，其中粗略的图像内容在精细细节之前稳定。同时，人类视觉系统对空间频率的敏感度不均：极低和极高的频率需要显著更高的对比度才能被感知。我们首次通过两个贡献将这些观察结果融合在一起：（1）一个估计每个反向流区间生成哪些频率的度量，以及（2）通过将每个噪声级别生成的频率与人类对比敏感度对齐获得的时间步权重。我们通过实验验证了我们的贡献，表明这些权重可以通过仅推理时间步修改或短时微调，将FID降低4.7%，Inception Score提高2.2%，GenEval分数提高2.5%，从而改善生成性能。定性上，我们发现我们的CSFlow权重导致生成的图像具有更好的视觉真实感和更少的卡通外观。

英文摘要

We introduce Contrast Sensitive Flow (CSFlow), a weighting scheme that connects the human eye's Contrast Sensitivity Function (CSF) to the iterative denoising steps of flow matching. Because real-world images concentrate signal at low spatial frequencies, these components reach high signal-to-noise ratio earlier during continuous diffusion than high-frequency components. When generating images with diffusion or flow matching models, this induces a soft autoregressive structure in Fourier space, where coarse image content stabilizes before fine detail. Meanwhile, the human visual system is unequally sensitive to spatial frequencies: very low and very high frequencies require significantly higher contrast to be perceived. We for the first time merge these observations through two contributions: (1) a metric that estimates which frequencies are generated at each reverse flow interval and (2) timestep weights obtained by aligning the frequencies generated at each noise level with human contrast sensitivity. We validate our contributions experimentally showing that these weights can improve generative performance by lowering FID by 4.7%, increasing Inception Score by 2.2% and improving GenEval scores by 2.5% using inference-only timestep modification or short fine-tuning. Qualitatively, we find that our CSFlow weights lead to better visual realism and less cartoonish appearance of generated images.

URL PDF HTML ☆

赞 0 踩 0

2606.08847 2026-06-09 cs.CV cs.AI cs.LG 新提交

BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation

BLM-SGAN: 用于语义-空间文本到图像生成的双向语言建模

Ahmed Abdelmoneim Mazrou, Haidy Maher El-Amir, Ali Hamdi

发表机构 * Faculty of Computer Science, MSA University, Egypt（MSA大学计算机科学学院，埃及）

AI总结提出BLM-SGAN模型，利用BERT的双向注意力机制捕获长程依赖，解决GAN在文本到图像生成中的梯度消失和序列处理限制，在鸟类图像生成上达到SOTA。

Comments Published in ICACIn 2024. Appears in Advances on Intelligent Computing and Data Science II, Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, 2025

详情

DOI: 10.1007/978-3-031-91351-8_5
Journal ref: Advances on Intelligent Computing and Data Science II (ICACIn 2024), Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, Cham, 2025

AI中文摘要

尽管从文本描述生成图像取得了成功，但在自然语言处理（NLP）和计算机视觉（CV）等领域仍面临难以克服的挑战。文本到图像（T2I）模型的最新进展，特别是那些利用生成对抗网络（GAN）的模型，显著提高了跨领域合成逼真图像的能力。然而，现有的基于GAN的T2I模型仍然面临关键挑战，例如难以捕获长程依赖、梯度消失以及序列处理的局限性。为了解决这些问题，我们引入了BLM-SGAN，一种新颖的模型，它结合了用于语义-空间文本到图像生成的双向语言建模。BLM-SGAN利用BERT的注意力机制来捕获丰富的上下文信息并有效管理扩展序列。我们的模型展示了最先进的性能，Inception Score（IS）为5.45 +/- 0.08，超过了多个竞争模型，如SSA-GAN、DF-GAN、SD-GAN和AttnGAN。BLM-SGAN能够从详细的文本描述中有效生成高度逼真的鸟类图像。实现代码可在以下网址获取：https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation。

英文摘要

Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT's attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.

URL PDF HTML ☆

赞 0 踩 0

2606.09056 2026-06-09 cs.CV cs.LG 新提交

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

MilliVid: 用于视频生成中长程一致性的分层潜变量

Ishaan Preetam Chandratreya, David Charatan, Basile Van Hoorick, Sergey Zakharov, Vitor Guizilini, Phillip Isola, Vincent Sitzmann

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Toyota Research Institute（丰田研究所）

AI总结提出一种多尺度token空间的粗到细展开方法，通过预训练层次化自编码器压缩帧为多层token，并训练视频扩散模型生成这些token，在保持几何和物体持久性长程一致性的同时降低计算开销。

Comments Ishaan Preetam Chandratreya and David Charatan contributed equally. Project page: https://davidcharatan.com/millivid/

详情

AI中文摘要

视频生成模型已变得日益强大，但长程一致性仍然难以实现，因为即使只有几十帧也需要不切实际的长Transformer序列长度。我们表明，通过在多尺度token空间内使用粗到细展开生成视频，可以缓解这一问题。我们的方法很简单：首先，预训练一个自编码器，将每一帧压缩成一个token层次结构，层级范围从典型的潜变量分辨率到每帧仅几个token。最粗糙的层级捕获最重要的信息，如场景布局和语义，而更细的层级添加高频外观和纹理。然后，我们训练一个视频扩散模型，使用粗到细展开生成这些token。通过仔细控制在每个展开步骤中生成帧并用作上下文的细节级别，我们能够保持几何和物体持久性的长程一致性，同时将计算花费在感知上不太相关的细节的长程一致性上。我们使用一个自定义的长Minecraft视频数据集验证了这种方法，与现有基线相比，它产生了更一致的展开结果。

英文摘要

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.09156 2026-06-09 cs.CV 新提交

OmniGen-AR: AutoRegressive Any-to-Image Generation

OmniGen-AR: 自回归任意到图像生成

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究所）； Shanghai Collaborative Innovation Center of Intelligent Visual Computing（上海智能视觉计算协同创新中心）； Bytedance Seed（字节跳动Seed）； The University of Hong Kong（香港大学）

AI总结提出统一自回归框架OmniGen-AR，通过共享视觉分词器和解耦因果注意力，支持文本、空间信号和视觉上下文等多种条件输入，在多项基准上达到最优或竞争性能。

Comments Accepted by NeurIPS

详情

AI中文摘要

自回归（AR）模型在视觉生成中展现出强大潜力，以简单的架构和优化目标实现了优越性能。然而，现有方法通常局限于单一模态条件（如文本），限制了其在需要从多种控制信号合成图像的现实场景中的应用。在这项工作中，我们提出了OmniGen-AR，一个统一的任意到图像生成的自回归框架。通过共享视觉分词器将各种视觉条件离散化，并使用文本分词器处理文本提示，OmniGen-AR在单个模型中支持广泛的条件输入，包括文本（文本到图像生成）、空间信号（分割到图像和深度到图像）以及视觉上下文（图像编辑、帧预测和文本到视频生成）。为了减轻条件令牌到内容令牌的信息泄露风险，我们引入了解耦因果注意力（DCA），它将全序列因果掩码分离为条件因果注意力和内容因果注意力。这作为训练时的正则化器，不影响推理时的标准下一个令牌预测。通过这种设计，OmniGen-AR在多个基准上取得了新的最先进或至少具有竞争力的结果，例如在GenEval上达到0.63，在VBench上达到80.02，展示了其在灵活和高保真视觉生成方面的有效性。

英文摘要

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.

URL PDF HTML ☆

赞 0 踩 0

PTL-Diffusion: 具有周期终端定律的流形感知扩散

Danqi Zhuang, Jisui Huang, Xiaoyue Xi, Andrew Kiggins, Xiaojie Wang, Ke Chen, Yue Wu

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； University of Cambridge（剑桥大学）； University of Oxford（牛津大学）； Harvard University（哈佛大学）； MIT（麻省理工学院）； University of Washington（华盛顿大学）

AI总结提出PTL-Diffusion，通过将前向噪声过程收敛到周期高斯终端族而非单一分布，显式嵌入相位结构，改善低维流形上的分布匹配，在点云和人脸数据集上降低误差。

详情

AI中文摘要

标准扩散模型通常使用单一时间齐次高斯终端分布作为生成的参考律。虽然这一选择在分析上方便且经验上有效，但对于集中在低维流形附近的数据，它提供的显式结构很少，其中数据分布的不同区域可能对应于不同的局部几何或语义因素。因此，反向模型必须几乎完全从非结构化的终端参考分布中恢复流形级别的结构。\n我们提出PTL-Diffusion，一种概念验证的扩散框架，其前向噪声过程收敛到一个非常数的周期高斯终端族，而不是单一不变律。与相位条件DDPM不同（其中相位信息仅进入去噪网络，而前向过程保持不变），PTL-Diffusion将相位结构直接嵌入前向噪声动力学中。\n所提出的构造仍然接近标准去噪扩散模型：对于周期强迫的Ornstein-Uhlenbeck型前向过程，我们推导出闭合形式的前向边际分布、极限周期高斯终端族以及显式高斯反向后验，从而支持标准噪声预测训练。我们还引入了一个不变平均正则化项，通过平均周期参考律耦合相位条件反向动力学。在环面和圆柱点云基准以及Olivetti人脸数据集上的实验表明，PTL-Diffusion在匹配的DDPM基线上改善了流形级别的分布匹配，减少了相位条件误差、特征空间协方差误差和最近邻流形距离。这些结果表明结构化终端参考律是一个有前景的方向，同时激励更具表现力的相位构造和更大规模的评估。

英文摘要

Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution. We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics. The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.

URL PDF HTML ☆

赞 0 踩 0

2606.09828 2026-06-09 cs.CV 新提交

Latent Spatial Memory for Video World Models

视频世界模型的潜在空间记忆

Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen, Zeyu Zhang, Yefei He, Zicheng Duan, Donny Y. Chen, Yuqing Yang, Bohan Zhuang

发表机构 * Zhejiang University（浙江大学）； Microsoft Research（微软研究院）； Adelaide University（阿德莱德大学）； Monash University（莫纳什大学）

AI总结提出潜在空间记忆框架Mirage，通过在扩散潜在空间中直接构建和查询3D缓存，避免像素空间重建，实现高效视频生成，速度提升10.57倍，内存减少55倍。

Comments Project Page: https://aka.ms/latent-spatial-memory, Code: https://github.com/microsoft/LatentSpatialMemory

详情

AI中文摘要

在生成帧之间保持3D空间一致性的视频世界模型通常依赖于在RGB空间中构建的显式点云记忆。这种设计既计算昂贵（需要重复渲染和VAE编码），又固有地有损（因为通过像素空间的往返会丢弃学习到的潜在表示的丰富特征）。在本文中，我们为视频世界模型引入了\emph{潜在空间记忆}，这是一种持久化的3D缓存，直接在扩散潜在空间中存储场景信息，避免了像素空间重建。在此基础上，我们提出了Mirage，一种潜在空间空间记忆框架，通过深度引导的反投影将潜在令牌提升到3D来构建记忆，并通过直接潜在空间扭曲合成新视图来查询记忆。这种统一的公式消除了像素空间重建的信息损失以及重复编码和渲染的计算负担。实验表明，相对于显式3D基线，潜在空间记忆实现了高达\textbf{10.57}倍的端到端视频生成加速和\textbf{55}倍的内存占用减少。利用扩散模型的几何先验，Mirage在WorldScore上达到了最先进的性能，并在RealEstate10K上实现了强大的重建质量。

英文摘要

Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce \emph{latent spatial memory} for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to \textbf{10.57}$\times$ faster end-to-end video generation and \textbf{55}$\times$ reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.

URL PDF HTML ☆

赞 0 踩 0

2606.08309 2026-06-09 cs.LG cs.CV 交叉投稿

Where the Score Lives: A Wavelet View of Diffusion

分数函数所在之处：扩散的小波视角

Emma Finn, Binxu Wang, T. Anderson Keller, Demba E. Ba

发表机构 * The Kempner Institute for the Study of Natural and Artificial Intelligence（肯普纳自然与人工智能研究所）； Harvard University（哈佛大学）

AI总结提出基于二维正交小波基的分数函数参数化，通过数据分布矩分析揭示不同架构的归纳偏差，解释扩散模型中分数网络与数据分布的相互作用。

Comments 20 pages, 12 figures, AISTATS 2026

详情

Journal ref: Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026, Tangier, Morocco. PMLR: Volume 300

AI中文摘要

基于分数的生成模型在过去十年中在生成多样化视觉上合理的图像方面取得了显著成功。在扩散建模中，包括CNN、U-Net和Transformer在内的多种架构被用作分数近似网络；然而，迄今为止，关于这些架构选择如何影响生成行为的了解相对较少。在这项工作中，为了提供对此领域的见解，我们提出了一种使用二维正交小波基展开的分数函数的解析可解参数化。特别地，我们根据数据分布的矩推导出可解释的最优分数函数。我们利用这种参数化提供了一种与架构无关的、基于矩的分析，揭示了数据分布的哪些属性对去噪最为重要。我们的分数机器足够灵活，可以部分模仿多种架构（包括U-Net和CNN）的相关归纳偏差，朝着理解不同分数架构为何表现出不同生成行为迈出了一步。由于我们的分数函数可以根据数据矩解析求解，我们可以开始理解数据分布如何与分数网络相互作用，从而产生我们在扩散模型中观察到的行为。

英文摘要

Score-based generative models have had remarkable success over the last decade in generating a diverse set of visually plausible images. A variety of architectures including CNNs, U-Nets, and Transformers have been used as the score-approximation network in such diffusion modeling; however, to date, relatively little is known about how these architectural choices impact generative behavior. In this work, to provide insight into this area, we propose an analytically solvable parameterization of the score function using an expansion in a 2D orthogonal wavelet basis. In particular, we derive interpretable optimal score functions in terms of the moments of the data distribution. We use this parametrization to provide an architecture-agnostic, moment-based analysis that reveals which attributes of the data distribution tend to matter most for denoising. Our score machine is flexible enough to partially mimic the relevant inductive biases of multiple architectures, including U-Nets, and CNNs, taking a step towards understanding why different score architectures can exhibit distinct generative behavior. Since our score is solvable in terms of the moments of the data, we can begin to understand how the data distribution interacts with the score network to produce the behavior we observe in diffusion models.

URL PDF HTML ☆

赞 0 踩 0

2606.08841 2026-06-09 cs.AI cs.CV 交叉投稿

ZIPP:Zero-shot Image Personalization from Personas

ZIPP：基于人物画像的零样本图像个性化生成

Harini SI, Somesh Singh, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah

发表机构 * Adobe Media and Data Science Research (MDSR)（Adobe媒体与数据科学研究（MDSR））； IIIT-Delhi（德里印度理工学院）； SUNY at Buffalo（纽约州立大学布法罗分校）

AI总结提出ZIPP方法，利用自然语言人物画像通过LLM改写提示词实现零样本图像个性化生成，无需用户数据或微调；引入ZIPBench基准，在多个评测中取得13-20%的提升。

详情

AI中文摘要

文本到图像扩散模型越来越多地部署在开放式创意环境中，但其输出仍然缺乏个性，优化的是整体审美而非个人品味。人类偏好是多元化的：一位喜欢柔和、怀旧肖像的用户可能偏爱充满活力的街头摄影，而另一位则倾向于梦幻的电影美学。现有方法需要密集的交互历史或逐用户微调，在冷启动场景中失败，并将上下文相关的偏好压缩为静态表示。我们提出了基于人物画像的零样本图像个性化生成（ZIPP），该方法以自然语言人物画像（用户身份和审美偏好的简洁描述符）为条件生成图像，无需任何用户特定数据或权重更新。ZIPP使用LLM从给定人物画像的角度重写提示词，引导扩散模型输出个性化结果。为了大规模挖掘人物画像，我们在一个包含2200万用户的Reddit交互图上训练了一个归纳式图注意力网络，采用双对比目标将图结构与视觉行为对齐，然后通过多模态大语言模型将学习到的表示转化为自然语言人物画像。我们引入了ZIPBench，这是首个零样本个性化基准，包含1500名用户、图挖掘的人物画像和4万张生成图像。在四个基准和涵盖五个模型家族的14个LLM上，人物画像条件化带来一致的性能提升（13-20%），前沿模型受益最大。在少样本设置中，ZIPP匹配或超过了基于每用户100多个示例微调的基线。ZIPP实现了最低的偏好分布散度（CMMD 0.16 vs 0.55），且经IPF归一化的人口统计评估表明，它显著减少了现有方法中存在的子群体偏差。人工评估证实，与通用生成相比胜率为79%，与所有微调基线相比胜率为58-65%。

英文摘要

Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user's identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.

URL PDF HTML ☆

赞 0 踩 0

2410.21747 2026-06-09 cs.CV 版本更新

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

MotionGPT-2：用于运动生成与理解的通用运动-语言模型

Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng, Dan Xu, Shixiang Tang

发表机构 * Tsinghua University（清华大学）； The University of Sydney（悉尼大学）； University of Science and Technology of China（中国科学技术大学）； The Chinese University of Hong Kong（香港中文大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Intime Department Store（Intime百货）； Deepeleph ； HKUST（香港科技大学）

AI总结提出MotionGPT-2，一种统一的大规模运动-语言模型，通过预训练大语言模型支持多模态控制条件，实现运动生成、描述和补全等多种任务，并引入Part-Aware VQVAE实现细粒度身体和手部运动表示。

详情

AI中文摘要

近年来，从描述性文本生成逼真的人体运动受到了显著的研究关注，这得益于数字内容创作等新兴需求。尽管取得了令人印象深刻的进展，现有方法通常受限于有限的控制模态、任务特异性，并且仅关注身体运动。在本文中，我们提出了MotionGPT-2，一种统一的大规模运动-语言模型（LMLM），以解决这些局限性。MotionGPT-2通过预训练的大语言模型（LLM）支持多种运动相关任务和多模态控制条件。它将多模态输入（如文本和单帧姿态）量化为离散的、LLM可解释的标记，无缝集成到LLM的词汇表中。这些标记随后被组织成统一的提示，通过预训练-微调范式引导LLM生成运动输出。我们还展示了所提出的MotionGPT-2通过创新的运动离散化框架Part-Aware VQVAE，能够高度适应具有挑战性的3D整体运动生成任务，该框架确保了身体和手部运动的细粒度表示。大量实验和可视化验证了我们方法的有效性，展示了MotionGPT-2在运动生成、运动描述和广义运动补全任务中的适应性。

英文摘要

Generating lifelike human motions from descriptive texts has experienced remarkable research focus in the recent years, propelled by the emerging requirements of digital humans.Despite impressive advances, existing approaches are often constrained by limited control modalities, task specificity, and focus solely on body motion representations.In this paper, we present MotionGPT-2, a unified Large Motion-Language Model (LMLM) that addresses these limitations. MotionGPT-2 accommodates multiple motion-relevant tasks and supporting multimodal control conditions through pre-trained Large Language Models (LLMs). It quantizes multimodal inputs-such as text and single-frame poses-into discrete, LLM-interpretable tokens, seamlessly integrating them into the LLM's vocabulary. These tokens are then organized into unified prompts, guiding the LLM to generate motion outputs through a pretraining-then-finetuning paradigm. We also show that the proposed MotionGPT-2 is highly adaptable to the challenging 3D holistic motion generation task, enabled by the innovative motion discretization framework, Part-Aware VQVAE, which ensures fine-grained representations of body and hand movements. Extensive experiments and visualizations validate the effectiveness of our method, demonstrating the adaptability of MotionGPT-2 across motion generation, motion captioning, and generalized motion completion tasks.

URL PDF HTML ☆

赞 0 踩 0

2508.07011 2026-06-09 cs.CV cs.GR 版本更新

HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

HiMat: 基于DiT的超高分辨率SVBRDF生成

Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

发表机构 * College of Computer Science, Nankai University（南开大学计算机科学学院）； Adobe Research（Adobe研究）； NVIDIA ； Nanjing University（南京大学）

AI总结提出HiMat框架，利用扩散变压器和线性注意力在高压缩潜空间生成4K SVBRDF，并通过CrossStitch模块保持跨图一致性，实现高效、多样化的超高分辨率材质生成。

详情

AI中文摘要

创建超高分辨率空间变化双向反射分布函数（SVBRDF）对于逼真的3D内容创作至关重要，以忠实呈现近距离渲染所需的精细表面细节。然而，实现4K生成面临两个关键挑战：（1）需要以全分辨率合成多个反射图，这增加了像素预算并带来了高昂的内存和计算成本；（2）需要在4K下保持跨图的强像素级对齐，这在适配为RGB图像域设计的预训练模型时尤其困难。我们引入了HiMat，一个专为高效且多样化的4K SVBRDF生成量身定制的基于扩散的框架。为解决第一个挑战，HiMat通过DC-AE在高压缩潜空间中进行生成，并采用具有线性注意力的预训练扩散变压器来提高每图效率。为解决第二个挑战，我们提出了CrossStitch，一个轻量级卷积模块，在不增加全局注意力成本的情况下强制跨图一致性。我们的实验表明，与先前方法相比，HiMat实现了高保真度的4K SVBRDF生成，具有卓越的效率、结构一致性和多样性。除了材质，我们的框架还推广到相关应用，如本征分解。

英文摘要

Creating ultra-high-resolution spatially varying bidirectional reflectance functions (SVBRDFs) is critical for photorealistic 3D content creation, to faithfully represent fine-scale surface details required for close-up rendering. However, achieving 4K generation faces two key challenges: (1) the need to synthesize multiple reflectance maps at full resolution, which multiplies the pixel budget and imposes prohibitive memory and computational cost, and (2) the requirement to maintain strong pixel-level alignment across maps at 4K, which is particularly difficult when adapting pretrained models designed for the RGB image domain. We introduce HiMat, a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation. To address the first challenge, HiMat performs generation in a high-compression latent space via DC-AE, and employs a pretrained diffusion transformer with linear attention to improve per-map efficiency. To address the second challenge, we propose CrossStitch, a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention. Our experiments show that HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods. Beyond materials, our framework also generalizes to related applications such as intrinsic decomposition.

URL PDF HTML ☆

赞 0 踩 0

2509.24531 2026-06-09 cs.CV 版本更新

VideoGPA: 通过几何先验知识蒸馏实现3D一致的视频生成

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结 VideoGPA通过几何先验知识蒸馏提升视频生成的3D一致性，利用数据高效的自监督框架引导视频扩散模型，显著增强时间稳定性、几何合理性与运动一致性。

Comments 8 pages, 5 figures, ICML 2026

2602.07345 2026-06-09 cs.CV cs.LG 版本更新

Optimizing Few-Step Generation with Adaptive Matching Distillation

自适应匹配蒸馏优化少步生成

Lichen Bai, Zikai Zhou, Shitong Shao, Wenliang Zhong, Shuo Yang, Shuo Chen, Bojun Chen, Zeke Xie

发表机构 * xLeaF Lab, The Hong Kong University of Science（xLeaF实验室，香港科学与技术大学）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学，深圳）； School of Intelligence Science（智能科学学院）

AI总结提出自适应匹配蒸馏（AMD），通过奖励代理检测并逃离禁止区域，结合结构信号分解和排斥景观锐化，提升少步生成模型的样本保真度和训练鲁棒性。

Comments 25 pages, 15 figures, 11 tables

详情

AI中文摘要

分布匹配蒸馏（DMD）是一种强大的加速范式，但其稳定性常在禁止区域（真实教师提供不可靠指导而虚假教师施加不足排斥力的区域）中受到损害。在这项工作中，我们提出了一个统一的优化框架，将先前的方法重新解释为避免这些受损区域的隐式策略。基于这一见解，我们引入了自适应匹配蒸馏（AMD），一种利用奖励代理显式检测和逃离禁止区域的自我纠正机制。AMD通过结构信号分解动态优先考虑纠正梯度，并引入排斥景观锐化以强制执行陡峭的能量屏障，防止失败模式崩溃。在图像和视频生成任务（如SDXL、Wan2.1）以及严格基准测试（如VBench、GenEval）上的大量实验表明，AMD显著提高了样本保真度和训练鲁棒性。例如，AMD将SDXL上的HPSv2分数从30.64提升至31.25，优于最先进的基线。这些发现验证了在禁止区域内显式纠正优化轨迹对于推动少步生成模型性能上限至关重要。

英文摘要

Distribution Matching Distillation (DMD) is a powerful acceleration paradigm, yet its stability is often compromised in Forbidden Zone, regions where the real teacher provides unreliable guidance while the fake teacher exerts insufficient repulsive force. In this work, we propose a unified optimization framework that reinterprets prior art as implicit strategies to avoid these corrupted regions. Based on this insight, we introduce Adaptive Matching Distillation (AMD), a self-correcting mechanism that utilizes reward proxies to explicitly detect and escape Forbidden Zones. AMD dynamically prioritizes corrective gradients via structural signal decomposition and introduces Repulsive Landscape Sharpening to enforce steep energy barriers against failure mode collapse. Extensive experiments across image and video generation tasks (e.g., SDXL, Wan2.1) and rigorous benchmarks (e.g., VBench, GenEval) demonstrate that AMD significantly enhances sample fidelity and training robustness. For instance, AMD improves the HPSv2 score on SDXL from 30.64 to 31.25, outperforming state-of-the-art baselines. These findings validate that explicitly rectifying optimization trajectories within Forbidden Zones is essential for pushing the performance ceiling of few-step generative models.

URL PDF HTML ☆

赞 0 踩 0

2604.00903 2026-06-09 cs.CV 版本更新

X-Foresight：一种通过预测世界建模的联合视觉-动作因果预测网络

Baolu Li, Jingyu Qian, Rui Guo, Yilun Chen, Hanpeng Liu, Yuan Lin, Junhong Zhou, Ruixin Liu, Willow Yang, Yutong Zheng, Zhenli Zhang, Sean Li, Chaoda Zheng, Boyang Wang, Tenglong, Gu, Zhuangzhuang Ding, Pengkun Zheng, Yu Zhang, Xianming Liu

发表机构 * PWM Team（PWM团队）； XPeng Inc.（XPeng公司）

AI总结提出X-Foresight，一种将预测世界模型直接集成到VLA架构中的方法，通过长程分块自回归策略和课程学习，联合学习世界建模与实时动作控制，以解决视频预测中的低熵冗余和长程因果建模难题。

详情

AI中文摘要

物理世界知识主要存在于视频中。赋予视觉-语言-动作（VLA）模型此类知识对于安全且可泛化的规划至关重要。预测世界建模通过从过去观测预测未来视频，使VLA能够内化物理动态和长程因果关系。然而，朴素的下一帧预测面临两个挑战：1）与语义上不同的文本标记不同，视频标记是低熵且冗余的，导致预测退化为琐碎的外推；2）世界建模存在时间困境：密集预测捕捉瞬时动态，但无法高效建模长程因果。为有效学习世界知识，我们引入X-Foresight，一种直接集成到VLA架构中的预测世界模型，以联合学习世界建模和实时动作控制。其核心是一种长程分块自回归策略，该策略解决了上述两个挑战：通过预测语义上遥远的块而非相邻帧，它避免了琐碎的外推，同时保留密集的块内帧用于瞬时动态和稀疏的块间过渡用于长程因果。课程学习计划逐步扩展预测范围并稳定长程训练。为有效捕捉长程因果，我们提出时间重要性采样，将监督集中于由自我运动和行为信号识别的安全关键块。我们进一步将逼真合成委托给基于扩散的多视图渲染器，以改善逼真外观。大量实验表明，X-Foresight在规划性能上显著优于VLA基线，同时保持强大的生成保真度，为世界知识驱动的自主系统建立了稳健的范式。

英文摘要

Physical world knowledge resides mainly in videos. Equipping Vision-Language-Action (VLA) models with such knowledge is fundamental for safe and generalizable planning. Predictive world modeling enables VLA to internalize physical dynamics and long-term causality by predicting future video from past observations. However, naive next-frame prediction faces two challenges: 1) unlike semantically distinct text tokens, video tokens are low-entropy and redundant, causing prediction to degenerate into trivial extrapolation. 2) world modeling poses a temporal dilemma: dense prediction captures instantaneous dynamics, but cannot efficiently model long-horizon causality. To learn world knowledge effectively, we introduce X-Foresight, a predictive world model integrated directly into the VLA architecture to jointly learn world modeling and real-time action control. At its core lies a long-horizon chunk-wise auto-regressive strategy that addresses both challenges: by predicting semantically distant chunks rather than adjacent frames, it escapes trivial extrapolation, while preserving dense intra-chunk frames for instantaneous dynamics and sparse inter-chunk transitions for long-term causality. A curriculum learning schedule progressively extends prediction horizons and stabilizes long-horizon training. To capture long-term causality effectively, we present temporal importance sampling, which concentrates supervision on safety-critical chunks identified by ego-motion and behavioral signals. We further delegate photorealistic synthesis to a diffusion-based multi-view renderer, improving photorealistic appearance. Comprehensive experiments demonstrate that X-Foresight significantly outperforms VLA baselines in planning performance while maintaining strong generative fidelity, establishing a robust paradigm for world-knowledge-driven autonomous systems.

URL PDF HTML ☆

赞 0 踩 0

2605.26108 2026-06-09 cs.CV 版本更新

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

通过奖励倾斜分布匹配增强少步生成器

Yushi Huang, Xiangxin Zhou, Ruoyu Wang, Chi Zhang, Jun Zhang, Tianyu Pang

发表机构 * Tencent Hunyuan（腾讯文英）； Hong Kong University of Science and Technology（香港科技大学）； Westlake University（西湖大学）

AI总结提出奖励倾斜分布匹配蒸馏（RTDMD）两阶段框架，结合分布匹配蒸馏与奖励引导强化学习，在仅4步推理下实现文本到图像生成的最新性能。

Comments Code and models are available at https://github.com/Harahan/RTDMD

详情

AI中文摘要

近期少步扩散蒸馏的进展实现了高效图像生成，但将这些模型与人类偏好对齐仍具挑战。我们提出奖励倾斜分布匹配蒸馏（RTDMD），一个两阶段框架，将分布匹配蒸馏与奖励引导的强化学习统一用于少步流生成器。我们证明，最小化到奖励倾斜教师分布的KL散度自然分解为分布匹配项和奖励最大化项。在第一阶段，我们引入环境一致分布匹配蒸馏（AC-DMD），它执行子区间分布匹配，并用一致性正则化增强假分数目标，帮助假分数模型在有限更新下跟踪变化的生成器分布。在第二阶段，我们联合优化两项：对于奖励最大化项，我们推导出一个混合策略梯度，将GRPO风格的估计器用于随机中间过渡，与通过确定性最后步骤的直接奖励反向传播相结合，并进一步引入步骤子集GRPO（SubGRPO）以降低方差。在SD3、SD3.5和FLUX.2上的实验表明，RTDMD在偏好、美学和组合指标上仅用4步推理就建立了新的最先进结果，超越了先前的少步文本到图像生成方法。代码和模型见https://github.com/Harahan/RTDMD。

英文摘要

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

URL PDF HTML ☆

赞 0 踩 0

2605.31158 2026-06-09 cs.CV cs.LG 版本更新

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

光交互：交互式视频世界模型的免训练推理加速

Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University（浙江大学）； NVIDIA

AI总结针对交互式视频世界模型推理成本高的问题，提出免训练加速框架Light Interaction，通过自适应上下文管理、去噪缓存加速和3D块稀疏注意力实现最高2.59倍加速。

Comments 13 pages, 6 figures, 3 tables. Project page: https://2843721358l-del.github.io/Light-Interaction-Project/

详情

AI中文摘要

交互式视频世界模型根据用户控制的相机运动逐块生成视频，支持实时游戏模拟、虚拟场景导航和具身AI训练等应用。然而，由于上下文记忆增长、二次注意力复杂度和重复去噪步骤，扩展到长交互轨迹的成本过高。我们提出Light Interaction，一种用于交互式视频世界模型的免训练推理加速框架。我们的关键洞察是，交互自然支持轨迹依赖的自适应计算：在探索新区域时可丢弃检索到的空间记忆，根据局部潜在动态调整时间上下文，当相机重新访问熟悉区域时可重用早期步骤的模型输出。基于此洞察，Light Interaction结合了自适应上下文管理、去噪缓存加速以及硬件-软件协同设计的3D块稀疏注意力（融合Triton内核）。在HY-WorldPlay和Matrix-Game-3.0上的评估表明，Light Interaction在无需模型重训练的情况下实现了最高2.59倍加速，同时保持有竞争力的视觉质量。

英文摘要

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

URL PDF HTML ☆

赞 0 踩 0

2606.00094 2026-06-09 cs.CV cs.AI 版本更新

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

显式建模数据流形几何的扩散图像生成

Duoduo Xue, Zhiyu Zhu, Junhui Hou

发表机构 * City University of Hong Kong（香港城市大学）

AI总结提出MIND框架，通过将离散补丁标记化集成到连续扩散模型的得分函数中显式建模流形几何，结合离散标记的结构量化能力和连续扩散的并行生成灵活性，在ImageNet 256×256上显著降低FID。

详情

AI中文摘要

图像生成模型旨在从底层数据流形中采样数据点，这需要学习并解码一个密集、低维且紧凑的参数化空间。为此，我们提出了数据流形感知图像扩散模型（MIND），一种通过将离散补丁标记化集成到连续扩散模型的得分函数中来显式建模流形几何的新框架。该方法成功利用了离散标记的结构量化能力和连续扩散的并行生成灵活性。此外，我们通过一种新颖的软top-$k$聚合机制实现了端到端可微训练，并引入了双分支高频特征嵌入层以缓解Transformer主干网络在低维输入上的谱偏差。进一步地，在推理阶段，我们设计了一种多阶段过渡采样方案，根据时间步动态调整采样方案。在ImageNet 256×256上的大量实验证明了MIND的有效性。经过80个epoch的训练，我们的基础模型在无引导情况下实现了22.73的FID，几乎将原始DiT-B/2基线的43.47 FID减半。与基线DiT和SiT相比，所提方法平均分别降低了15.95和9.06的FID。对于ImageNet-256×256上的引导图像生成，所提MIND-B仅用130M参数就实现了2.06的FID，超过了具有3.1B参数的LlamaGen-3B。所提MIND-XL具有715M参数，进一步将FID降低至1.95。我们的MIND为基于扩散的图像生成引入了全新视角，为该领域的未来研究和创新铺平了道路。代码将公开提供。

英文摘要

Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, low-dimensional, and compact parameterization space. To achieve this, we propose the Data Manifold-aware Image diffusioN moDel (MIND), a novel framework that explicitly models manifold geometry by integrating discrete patch tokenization into the score function of a continuous diffusion model. This approach successfully leverages both the structural quantification capabilities of discrete tokens and the parallel generation flexibility of continuous diffusion. Moreover, we enable end-to-end differentiable training via a novel soft top-$k$ aggregation mechanism and introduce dual-branch high-frequency feature embedding layers to alleviate the spectral bias of transformer backbones on low-dimensional inputs. Furthermore, for inference, we design a multi-stage transition sampling scheme that dynamically adjusts the sampling scheme based on timestep. Extensive experiments on ImageNet 256$\times$256 demonstrate the effectiveness of MIND. After 80-epoch training, our base model achieves an FID of 22.73 without guidance, nearly halving the 43.47 FID of the vanilla DiT-B/2 baseline. The proposed method reduces FID by 15.95 and 9.06 on average compared with the baselines DiT and SiT, respectively. For image generation on ImageNet-256$\times$256 with guidance, the proposed MIND-B with only 130M parameters achieves an FID of 2.06, superpassing the LlamaGen-3B with 3.1B parameters. The proposed MIND-XL with 715M parameters further reduces the FID to 1.95. Our MIND introduces a fresh perspective on diffusion-based image generation, paving the way for future research and innovation in this community. The code will be publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.05816 2026-06-09 cs.CV cs.AI 版本更新

Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning

基于LLM提示翻译和LoRA微调的韩语日记文本情感感知图像生成

Jihun Cho, Soo-Yeon Jeong, Sun-Young Ihm

发表机构 * KAIST（韩国科学技术院）

AI总结提出一种情感感知文本到图像流水线，利用Qwen3-8B识别短日记中的隐含情感，并通过LoRA微调Stable Diffusion 3.5 Medium生成儿童手绘风格图像，同时探讨情感触发词的影响及CLIP Score作为评估指标的局限性。

Comments 4 pages, 4 figures, 2 tables, MITA 2026

2503.08434 2026-06-09 cs.GR cs.CV 版本更新

Bokeh Diffusion: Defocus Blur Control in Text-to-Image Diffusion Models

Bokeh Diffusion：文本到图像扩散模型中的散焦模糊控制

Armando Fortes, Tianyi Wei, Shangchen Zhou, Xingang Pan

发表机构 * S-Lab, Nanyang Technological University（S实验室，南洋理工大学）

AI总结提出Bokeh Diffusion框架，通过物理散焦模糊参数条件化扩散模型，结合混合训练流程和接地自注意力机制，实现场景一致的散景模糊控制，支持双向模糊强度调整和真实图像编辑。

Comments SIGGRAPH Asia 2025. Project page: https://atfortes.github.io/projects/bokeh-diffusion/

详情

DOI: 10.1145/3757377.3763906

AI中文摘要

近年来，大规模文本到图像模型的进展通过从文本提示生成视觉上引人入胜的输出，彻底改变了创意领域；然而，传统摄影通过光圈等相机设置精确控制景深以塑造视觉美学，而当前的扩散模型通常依赖提示工程来模拟此类效果。这种方法往往导致粗略的近似，并意外地改变场景内容。在这项工作中，我们提出了Bokeh Diffusion，一个场景一致的散景控制框架，它显式地将扩散模型条件化在一个物理散焦模糊参数上。为了克服在不同相机设置下捕获的配对真实世界图像的稀缺性，我们引入了一个混合训练流程，将野外图像与合成模糊增强对齐，提供多样化的场景和主体，以及监督学习以分离图像内容与镜头模糊。我们框架的核心是接地自注意力机制，该机制在同一场景的不同散景水平的图像对上训练，使得模糊强度可以在保持底层场景的同时双向调整。大量实验表明，我们的方法实现了灵活的、类似镜头的模糊控制，支持通过反演进行真实图像编辑等下游应用，并在Stable Diffusion和FLUX架构上有效泛化。

英文摘要

Recent advances in large-scale text-to-image models have revolutionized creative fields by generating visually captivating outputs from textual prompts; however, while traditional photography offers precise control over camera settings to shape visual aesthetics - such as depth-of-field via aperture - current diffusion models typically rely on prompt engineering to mimic such effects. This approach often results in crude approximations and inadvertently alters the scene content. In this work, we propose Bokeh Diffusion, a scene-consistent bokeh control framework that explicitly conditions a diffusion model on a physical defocus blur parameter. To overcome the scarcity of paired real-world images captured under different camera settings, we introduce a hybrid training pipeline that aligns in-the-wild images with synthetic blur augmentations, providing diverse scenes and subjects as well as supervision to learn the separation of image content from lens blur. Central to our framework is our grounded self-attention mechanism, trained on image pairs with different bokeh levels of the same scene, which enables blur strength to be adjusted in both directions while preserving the underlying scene. Extensive experiments demonstrate that our approach enables flexible, lens-like blur control, supports downstream applications such as real image editing via inversion, and generalizes effectively across both Stable Diffusion and FLUX architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.27852 2026-06-09 cs.GR cs.CV 版本更新

ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation

ClothTransformer: 用于可扩展布料模拟的统一潜空间变换器

Yu Zhang, Yidi Shao, Wenqi Ouyang, Yushi Lan, Zhexin Liang, Chengrui Wu, Xudong Xu, Xingang Pan

发表机构 * S-Lab, Nanyang Technological University, Singapore（新加坡南洋理工大学S实验室，南洋理工大学，新加坡）； Feeling AI ； University of Oxford（牛津大学）； Nanyang Technological University（南洋理工大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出ClothTransformer，通过将布料模拟重构为潜空间中的自回归序列建模，使用统一Transformer架构处理多种场景，实现比现有方法低4-9倍的误差，并引入可扩展潜空间公式和穿透自由数据集。

详情

AI中文摘要

统一且可扩展的变换器最近在建模传统上与计算机图形学相关的多种现象（如3D视觉效果、渲染过程和视频中的运动）方面取得了显著成功。在这项工作中，我们进一步研究现代变换器技术是否能够应对布料模拟这一挑战性任务。为此，我们提出了ClothTransformer，这是一个将布料模拟重构为在学习的潜空间中进行自回归序列建模的框架。现有的神经布料模拟器大多专用于单一场景，与网格离散化内在耦合，并且缺乏鲁棒的碰撞处理。我们的方法通过三个贡献解决了这些局限性：（1）一个统一的变换器架构，在单一模型下处理多种场景——身体驱动的服装、机器人操作和自由落体碰撞——并在所有场景中实现比先前最先进方法低约4-9倍的误差；（2）一个可扩展的潜空间公式，将任意分辨率的网格压缩为固定大小的潜令牌集，使得时间动态计算独立于网格分辨率；（3）一个覆盖所有三种设置的高保真无穿透数据集（约493.4k帧），该数据集支持可微分的连续碰撞检测（CCD）模块以抑制穿透伪影。

英文摘要

Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios -- body-driven garments, robotic manipulation, and free-fall collisions -- under a single model and achieves approximately $4$--$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts. Project Page: https://yucrazing.github.io/clothtransformer/

URL PDF HTML ☆

赞 0 踩 0

2606.06497 2026-06-09 cs.GR cs.CV cs.HC 版本更新

Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers

实时注意力弯曲：视频扩散变换器的粒度交互式网络弯曲

Adam Cole, Rebecca Fiebrink, Mick Grierson

发表机构 * Creative Computing Institute（创意计算研究所）； University of the Arts London（伦敦艺术大学）

AI总结提出实时注意力弯曲工具，通过操纵视频扩散变换器的自注意力、交叉注意力及前馈网络，实现逐层、逐步、逐令牌的交互式生成控制，增强艺术家的创作代理与模型材料亲密性。

Comments 5 pages, 4 figures. Accepted to ACM Creativity & Cognition XAIxArts Workshop 2026

详情

AI中文摘要

生成式视频模型已实现显著的视觉保真度，但其仅提示的界面提供了薄弱的创作代理，并使得艺术家无法了解模型的物质过程。我们提出了实时注意力弯曲，这是一种将网络弯曲实践扩展到视频扩散变换器（DiT）全深度并使其进入实时交互式生成的工具。作为DayDream Scope生态系统中的插件构建，并封装了开源实时Wan管道，该工具将自注意力、交叉注意力和前馈网络暴露为可独立操作的面，目标可细化到单个扩散步骤、DiT层、提示令牌和隐藏神经元。实时操作的即时性提供了我们所谓的与模型的“物质亲密性”：对特定层和神经元如何塑造生成视频的响应式、近乎机械的感觉。我们将该工具定位为同时作为对变换器内部结构的XAIxArts探针，以及用于发现模型默认表示空间之外的美学的表达性乐器。

英文摘要

Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope ecosystem and wrapping open-source real-time Wan pipelines, the tool exposes self-attention, cross-attention, and the feed-forward network as independently manipulable surfaces, with targeting down to individual diffusion steps, DiT layers, prompt tokens, and hidden neurons. The immediacy of live manipulation affords what we call "material intimacy" with the model: a responsive, near-mechanistic feel for how specific layers and neurons shape generated video. We position the tool as simultaneously an XAIxArts probe into transformer internals and an expressive instrument for discovering aesthetics outside the model's default representational space.

URL PDF HTML ☆

赞 0 踩 0

2606.07670 2026-06-09 cs.CV cs.AI 新提交

Liquid Neural Networks as a Drop-in Continuous-Time Deformation Field for Dynamic 3D Gaussian Splatting

液态神经网络作为动态3D高斯泼溅的即插即用连续时间变形场

Mingzhao Li, Arghya Pal, Guan Yuan Tan

发表机构 * Monash University（莫纳什大学）

AI总结提出用液态神经网络（LNN）的闭式连续时间（CfC）单元替代MLP，构建显式连续时间变形场，在动态场景重建中匹配或超越MLP基线，尤其擅长高频关节运动。

详情

AI中文摘要

可变形3D高斯泼溅（D-3DGS）通过一个位置编码的MLP（以帧时间t为输入）变形一组规范3D高斯，从单目视频重建动态场景。尽管拟合连续变量，但MLP在架构中不耦合任意两个t值，实际上预测离散的逐帧偏移，使得时间平滑性仅作为优化的副产品出现。我们将变形场重新设计为一组闭式连续时间（CfC）单元，即液态神经网络（LNN），它是液态时间常数ODE的闭式解，同时保留D-3DGS管道的其他部分。每个单元暴露一个sigmoid时间门，在两个候选隐藏状态之间插值，将学习到的对t的平滑响应嵌入损失景观，无需调用任何数值求解器。在八个D-NeRF和七个NeRF-DS场景上，液态场在总体上匹配或超过MLP基线，其最大增益集中在具有最高频关节运动的场景上。结果是一种近乎零摩擦的架构设计，将离散的MLP变形场转变为t的显式连续时间函数。

英文摘要

Deformable 3D Gaussian Splatting (D-3DGS) re-constructs dynamic scenes from monocular video by deforming a canonical set of 3D Gaussians through a positional-encoded MLP of frame time t. Although fitted to a continuous variable, the MLP couples no two values of t in its architecture and effectively predicts discrete per-frame offsets, leaving temporal smoothness to emerge only as a byproduct of optimisation. We redesign the deformation field as a stack of Closed-form Continuous-time (CfC) cells, a Liquid Neural Network (LNN), that is the closed-form solution of the Liquid Time-constant ODE while preserving every other part of the D-3DGS pipeline. Each cell exposes a sigmoidal time gate that interpolates between two candidate hidden states, baking a learned smooth response to t into the loss landscape without invoking any numerical solver. On the eight D-NeRF and seven NeRF-DS scenes the liquid field matches or exceeds the MLP baseline in aggregate, with its largest gains concentrated on the scenes with the most high-frequency articulated motion. The result is a near-zero-friction architectural design that turns the discrete MLP deformation field into an explicit continuous-time function of t.

URL PDF HTML ☆

赞 0 踩 0

2606.07907 2026-06-09 cs.CV cs.AI 新提交

3D Oral Modelling with Improved Vertex Distribution Using Matching-Based Learning

基于匹配学习的改进顶点分布的3D口腔建模

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

发表机构 * st Jihun Cho（第一作者）； nd Soo-Yeon Jeong（第二作者）； rd Eun-Jeong Bae（第三作者）； th Sun-Young Ihm（第四作者）

AI总结针对3D口腔重建中预测顶点分布不均的问题，提出结合匈牙利匹配过滤与排斥损失的改进损失函数，使顶点分布更均匀，虽精度略降但有效缓解了聚集现象。

Comments 5 pages, 7 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025

详情

AI中文摘要

在我们之前的工作中，提出了一个基于深度学习的3D口内重建框架。该模型直接从十张固定角度的口内图像预测显式3D点云坐标，采用MobileNetV2和多头注意力进行多视图特征融合，并使用L1损失和倒角距离的组合作为损失函数。尽管模型达到了77.49%的准确率，但预测顶点倾向于集中在真实值的高密度区域，而其他区域大部分未被覆盖。\n在本文中，提出了一种改进的损失函数来解决这一局限性。引入了带过滤的匈牙利匹配和排斥损失，以强制重建模型上的顶点分布更加均匀。所提出的模型达到了68.02%的准确率，数值上低于之前的模型。然而，先前工作中观察到的顶点聚集问题得到了显著缓解，预测顶点在整个重建表面上分布更加均匀。

英文摘要

In our previous work, a deep learning-based framework for 3D intraoral reconstruction was proposed. The model directly predicts explicit 3D point cloud coordinates from ten fixed-angle intraoral images, employing MobileNetV2 and Multi-head Attention for multi-view feature fusion, with a combined L1 Loss and Chamfer Distance as the loss function. Although the model achieved an accuracy of 77.49%, predicted vertices tended to concentrate in high-density regions of the ground truth, leaving other regions largely uncovered. In this paper, an improved loss function is proposed to address this limitation. Hungarian matching with filtering and Repulsion Loss are introduced to enforce more uniform vertex distribution across the reconstructed model. The proposed model achieves an accuracy of 68.02%, which is numerically lower than the previous model. However, the vertex clustering issue observed in the prior work is substantially alleviated, with predicted vertices distributed more evenly across the entire reconstructed surface.

URL PDF HTML ☆

赞 0 踩 0

2606.07932 2026-06-09 cs.CV cs.GR cs.MM eess.IV math.OC 新提交

利用卫星图像赋予前馈重建模型度量尺度

Xianghui Ze, Yongjian Luo, Mengjun Chao, Zhenbo Song, Jianfeng Lu, Yujiao Shi

发表机构 * Nanjing University of Science and Technology（南京理工大学）； ShanghaiTech University（上海科技大学）

AI总结提出卫星引导框架，通过双向交叉视图交互利用卫星图像作为全局度量参考，解决前馈3D重建中的尺度模糊问题，实现度量深度估计、点云重建和相机定位。

详情

AI中文摘要

前馈3D重建模型最近在多样场景中展现出强大的泛化能力，但大多数模型仅能恢复未知全局尺度下的几何结构。这种尺度模糊限制了它们在需要环境度量理解的应用中的使用。现有的度量重建方法通常依赖于大规模度量标注或精确的相机标定，这在许多实际场景中成本高昂或不可靠。我们提出了一种卫星引导框架，用于解决前馈3D重建中的尺度模糊问题。关键思想是利用现成的卫星图像作为全局度量参考。给定粗略的相机姿态，我们的方法检索局部卫星图像块，并通过双向交叉视图交互将其与前馈重建主干集成。通过强制重建场景与卫星参考之间的一致性，模型推断绝对尺度、细化场景几何并在度量坐标系中估计相机姿态。在KITTI、nuScenes和Oxford RobotCar上的实验表明，该方法在度量深度估计、多视角点云重建和跨视角相机定位方面取得了一致改进，同时保持了跨数据集和地理区域的强泛化能力。

英文摘要

Feed-forward 3D reconstruction models have recently shown strong generalization across diverse scenes, yet most of them recover geometry only up to an unknown global scale. This scale ambiguity limits their use in applications that require metric understanding of the environment. Existing metric reconstruction methods commonly rely on large-scale metric annotations or accurate camera calibration, both of which are costly or unreliable in many real-world settings. We propose a satellite-guided framework for resolving scale ambiguity in feed-forward 3D reconstruction. The key idea is to use readily available satellite imagery as a global metric reference. Given a coarse camera pose, our method retrieves a local satellite patch and integrates it with a feed-forward reconstruction backbone through bidirectional cross-view interaction. By enforcing consistency between the reconstructed scene and the satellite reference, the model infers absolute scale, refines scene geometry, and estimates camera pose in a metric coordinate frame. Experiments on KITTI, nuScenes, and Oxford RobotCar show consistent improvements in metric depth estimation, multi-view point-cloud reconstruction, and cross-view camera localization, while preserving strong generalization across datasets and geographic regions.

URL PDF HTML ☆

赞 0 踩 0

2606.08284 2026-06-09 cs.CV cs.RO 新提交

G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

G2G：利用组内几何进行组间姿态估计

Yufei Wei, Shuhao Ye, Chenxiao Hu, Yiyuan Pan, Dongyu Feng, Rong Xiong, Yue Wang, Yanmei Jiao

发表机构 * State Key Laboratory of Industrial Control and Technology, Zhejiang University（浙江大学工业控制技术国家重点实验室）； Zhejiang Humanoid Robot Innovation Center Co., Ltd.（浙江人形机器人创新中心有限公司）； School of Information Science and Engineering, Hangzhou Normal University（杭州师范大学信息科学与工程学院）

AI总结提出G2G方法，通过冻结多视图基础模型并添加三个轻量可训练模块（感知器重采样器、跨组桥接模块和多帧姿态头），仅利用相对姿态监督实现组间6-DoF姿态估计，在四个数据集上达到SOTA。

详情

AI中文摘要

恢复两个图像组之间的相对6-DoF姿态是跨序列重定位和多相机刚性里程计的基础。每个组通过视觉里程计或刚性校准携带已知的组内几何，预训练的多视图骨干网络已经将这种几何融合到视觉特征中。然而，当前模型将所有视图视为非结构化集合，缺少跨组推理的关键环节。我们提出\ours{}，该方法保持基础模型完全冻结，并添加三个轻量可训练模块来桥接两个组：感知器重采样器、带有合并自注意力的跨组桥接模块以及多帧姿态头。可训练部分总计约32M参数，不到完整模型的6%，且仅由相对姿态监督。在四个数据集（涵盖室内外仿真、真实世界跨季节采集以及零样本仿真到真实迁移）上，\ours{}在两个任务上都达到了最先进的精度，而每个基线都使用其完整的原始监督进行重新训练。代码可在https://github.com/WeiYuFei0217/G2G获取。

英文摘要

Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We introduce \ours{}, which keeps the foundation model entirely frozen and adds three lightweight trainable modules to bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head. The trainable footprint totals about 32M parameters, under 6\% of the full model, and is supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-season capture, and zero-shot sim-to-real transfer, \ours{} attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision. Code is available at https://github.com/WeiYuFei0217/G2G.

URL PDF HTML ☆

赞 0 踩 0

2606.08957 2026-06-09 cs.CV 新提交

Rethinking 3D Shape Generation: Diffusion over Superquadrics

重新思考3D形状生成：超二次曲面上的扩散

Zhiyang Liu, Wanze Li, Yuwei Wu, Chengran Yuan, Jiawei Sun, Rui Zheng, Marcelo H Ang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出将扩散模型从高基数几何表示转移到紧凑的超二次曲面参数上，以降低计算和内存成本，并支持无分辨率点云解码、部件级编辑和约束设计，实现高效生成。

Comments Accepted to ICML2026

详情

AI中文摘要

扩散模型推动了3D形状生成的发展，但大多数方法仍然在高基数空间（如体素/SDF网格、网格或点云）中进行去噪，这计算和内存密集，难以在更高分辨率和更强可控性方面扩展。我们重新思考扩散表示，提出将扩散从密集几何转移到紧凑的几何基元，将每个形状表示为少量超二次曲面。我们不操作成千上万的几何表示值，而是利用7KB的超二次曲面参数（姿态、大小和形状），大幅降低扩散状态维度和每步计算/内存。我们的超二次曲面扩散通过支持更广泛的能力（如无分辨率点云解码、部件级编辑和基于约束的设计）提高了可扩展性，并在点云解码后在标准基准上实现了具有竞争力的表面保真度和分布性能，同时在大多数条件下每个形状的生成时间仅为0.6秒。

英文摘要

Diffusion models have advanced 3D shape generation, yet most methods still denoise in high-cardinality spaces (e.g., voxel/SDF grids, meshes, or point clouds), which is computationally and memory intensive and makes it difficult to scale in terms of both higher resolution and stronger controllability. We rethink the diffusion representation and propose to move diffusion from dense geometry to compact geometric primitives, representing each shape as a small set of superquadrics. Instead of operating on thousands to millions of geometric representation values, we leverage 7KB superquadric parameters (pose, size, and shape), drastically reducing diffusion-state dimensionality and per-step compute/memory. Our diffusion-over-superquadrics improves scalability by supporting broader capabilities (e.g., resolution-free point-cloud decoding, part-level editing, and constraint-based design) and achieving competitive surface-fidelity and distributional performance on standard benchmarks after point-cloud decoding, while enabling efficient generation within 0.6s per shape for most conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.09034 2026-06-09 cs.CV 新提交

Leveraging NeRF-Rendered Images for 3D Gaussian Splatting

利用NeRF渲染图像进行3D高斯泼溅

Mizuki Morikawa, Yuta Shimizu, Chunyu Li, Yusuke Monno, Masatoshi Okutomi

AI总结提出利用NeRF渲染图像辅助3DGS训练，通过去除瞬态物体和生成鸟瞰视图，结合扩散增强，在保持3DGS速度的同时提升街景渲染质量。

Comments ICIP 2026

详情

AI中文摘要

神经辐射场（NeRF）和3D高斯泼溅（3DGS）是两种主流的新视角合成方法。它们通常表现出互补的性能，即3DGS渲染速度更快，而NeRF渲染质量更高。受此启发，我们提出利用NeRF渲染的图像来辅助3DGS。具体来说，我们针对街景场景，利用预训练的街景专用NeRF方法为目标3DGS方法生成训练图像。在我们的3DGS训练中，NeRF渲染的图像用于去除街景输入视图中的瞬态物体，并生成鸟瞰视图作为额外视图，从而将NeRF的高质量渲染继承到3DGS中。我们进一步引入基于扩散的图像增强，以提高额外视图的图像质量。在一个人工合成数据集和两个真实数据集上的实验结果表明，我们提出的方法在保持3DGS速度和NeRF质量的同时，改善了街景渲染效果。

英文摘要

Neural radiance field (NeRF) and 3D Gaussian splatting (3DGS) are two mainstream approaches for novel view synthesis. They often show complementary performance, i.e., 3DGS demonstrating faster rendering speed and NeRF demonstrating higher rendering quality. Motivated by this, we propose leveraging NeRF-rendered images for 3DGS. Specifically, we target street scenes and utilize a pre-trained street-specific NeRF method to produce training images for a target 3DGS method. In our 3DGS training, NeRF-rendered images are used to remove transient objects in street-level input views and to generate bird's-eye views as additional views, inheriting the higher-quality rendering of NeRF into 3DGS. We further incorporate a diffusion-based image enhancement to improve the image quality of the additional views. Experimental results on one synthetic and two real datasets demonstrate that our proposed method improves street-scene rendering while preserving the speed of 3DGS and the quality of NeRF.

URL PDF HTML ☆

赞 0 踩 0

2606.09074 2026-06-09 cs.CV 新提交

REFINE: Super-efficient 3D Gaussian Splatting Pruning via Rendering-Free Primitive Importance

REFINE: 通过无渲染的基元重要性实现超高效的3D高斯泼溅剪枝

Zhang Chen, Shuai Wan, Mengting Yu, Fuzheng Yang, Junhui Hou

发表机构 * Northwestern Polytechnical University（西北工业大学）； Xidian University（西安电子科技大学）； City University of Hong Kong（香港城市大学）

AI总结提出REFINE框架，利用无渲染的基元重要性度量（基于解析近似的Hessian场）实现3D高斯泼溅的高效剪枝，在保持渲染质量的同时将剪枝计算复杂度降低3000倍。

详情

AI中文摘要

现有的3D高斯泼溅（3DGS）剪枝方法要么导致严重的质量下降，要么带来过高的计算开销。本文提出REFINE，一个高度加速的3DGS剪枝框架，其核心是一种新颖的无渲染基元重要性度量。我们的方法利用解析近似、渲染感知的Hessian场来量化移除单个基元所导致的预期感知误差。通过建模可见性、投影几何和内容自适应超参数的联合调制，我们完全绕过了昂贵的正向渲染过程，推导出一个各向异性的感知权重场，作为基元重要性的高保真代理。在多个基准数据集上的大量实验表明，REFINE在保持极具竞争力的渲染质量的同时，与最先进的剪枝方法相比，实现了前所未有的3000倍剪枝相关计算复杂度降低。

英文摘要

Existing pruning methods for 3D Gaussian splatting (3DGS) suffer from either severe quality degradation or prohibitive computational overhead. In this paper, we propose REFINE, a highly accelerated 3DGS pruning framework centered on a novel rendering-free primitive importance metric. Our approach leverages an analytically approximated, rendering-aware Hessian field to quantify the expected perceptual error induced by the removal of individual primitives. By modeling the joint modulation of visibility, projection geometry and the content adaptive hyperparameter, we entirely bypass costly forward rendering passes and derive an anisotropic perceptual weight field that serves as a high-fidelity proxy for primitive importance. Extensive experiments across multiple benchmark datasets demonstrate that REFINE maintains highly competitive rendering quality while achieving an unprecedented $3,000\times$ reduction in pruning-related computational complexity compared to state-of-the-art pruning methods.

URL PDF HTML ☆

赞 0 踩 0

2606.09123 2026-06-09 cs.CV cs.AI 新提交

超越球谐函数：重新思考辐射重建的外观模型

Ewa Miazga, Jorge Condor, Piotr Didyk

发表机构 * École Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）； Università della Svizzera Italiana（意大利语区瑞士大学）

AI总结本文系统评估多种球面函数，提出归一化各向异性球面Gabor函数，以紧凑表示高效建模高频外观效果，在辐射场重建中实现五倍内存节省和更优质量。

Comments 19 pages, 11 figures

详情

AI中文摘要

视角相关的外观建模在新视角合成与重建中仍是一个具有挑战性的问题。准确表示复杂的角度效应通常需要大量的内存和计算资源。对于新的基于学习的方法，常见做法是依赖球谐函数（SH）。然而，捕捉镜面反射等高频率现象需要高阶展开，这会增加内存使用和计算成本。因此，大多数方法采用低阶SH，这限制了建模复杂视角相关效应的能力，导致表示过于平滑或漫反射。为解决这些限制，我们系统评估了场景重建中多种球面函数。其中一些函数在本文中首次被引入图形学和计算机视觉领域。基于实验洞察，我们提出了一种新的球面公式——归一化各向异性球面Gabor函数，它能够在保持紧凑表示的同时高效建模和学习高频外观效果。与现有方法相比，我们的函数在重建如闪光等视角相关现象时实现了更高质量，同时内存效率提高五倍，且评估更高效。我们在辐射场重建任务中验证了其性能。

英文摘要

View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-based methods, a common approach is to rely on SH. However, capturing high-frequency phenomena such as specular reflections demands high-order expansions, which increase memory usage and computational cost. Consequently, most methods employ low-order SH, which limits the ability to model complex view-dependent effects, resulting in overly smooth or diffuse representations. To address these limitations, we systematically evaluate a wide range of spherical functions in the context of scene reconstruction. Some of them are introduced to graphics and computer vision for the first time in this paper. Based on the insights from the experiment, we develop a novel spherical formulation, the Normalized Anisotropic Spherical Gabor function that enables efficient modeling and learning of high-frequency appearance effects while maintaining compact representation. Compared to existing approaches, our function achieves higher-quality reconstruction of view-dependent phenomena such as glints, while being up to five times more memory-efficient and more efficient to evaluate. We validate its performance in radiance-field reconstruction tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.07529 2026-06-09 cs.CL cs.AI cs.CV cs.LG cs.MM 交叉投稿

CAPruner: Conceptual-Adjacent Scene Graph Pruner for Enhancing 3D Spatial Reasoning of Large Language Models

CAPruner: 概念相邻场景图剪枝器以增强大语言模型的3D空间推理

Shengli Zhou, Xiangchen Wang, Guanhua Chen, Feng Zheng

发表机构 * Southern University of Science and Technology（南方科技大学）； SpatialTemporal AI（时空人工智能）

AI总结提出概念相邻场景图剪枝器(CAPruner)，通过融合模糊语义相关性和空间邻近性估计关系重要性，在任务特定上下文中选择关键关系，避免关系级标注，显著提升大语言模型在3D视觉语言任务上的空间推理性能。

Comments Accepted by ACL 2026 Main Conference

详情

AI中文摘要

大型语言模型（LLMs）最近被应用于3D视觉语言（3D-VL）任务，这些任务需要空间推理以识别相对于锚点的目标物体。场景图通常用于表示此类关系，但在完整图上进行推理会导致高昂的令牌成本和计算效率低下，因此需要剪枝。现有的剪枝方法主要依赖空间邻近性，常常移除任务相关的关系，从而削弱可靠的空间推理。为了解决这些局限性，我们推导出场景图剪枝的一个关键要求：保留与特定3D-VL任务最相关的空间关系。在此洞察指导下，我们提出了概念相邻场景图剪枝器（CAPruner）。CAPruner将模糊语义相关性与空间邻近性相结合，以估计关系的重要性，从而能够在任务特定上下文中选择关键关系。此外，为了避免昂贵的关系级标注，CAPruner通过监督每个节点入射边的聚合分数进行训练。大量实验表明，CAPruner有效保留了空间推理所必需的关系，从而显著提升了LLMs在3D-VL任务上的性能。代码可在 https://github.com/fz-zsl/CAPruner 获取。

英文摘要

Large language models (LLMs) have recently been applied to 3D vision-language (3D-VL) tasks, which require spatial reasoning to identify target objects relative to anchors. Scene graphs are commonly employed to represent such relations, but reasoning over complete graphs incurs high token costs and computational inefficiencies, motivating the need for pruning. Existing pruning methods primarily rely on spatial proximity and often remove task-relevant relations, thereby undermining reliable spatial reasoning. To address these limitations, we derive a key requirement for scene graph pruning: preserving spatial relations that are most pertinent to the specific 3D-VL task. Guided by this insight, we propose the Conceptual-Adjacent Scene Graph Pruner (CAPruner). CAPruner integrates fuzzy semantic relevance with spatial proximity to estimate the importance of relations, enabling the selection of critical relations in a task-specific context. Moreover, to avoid costly relation-level annotations, CAPruner is trained by supervising the aggregated scores of each node's incident edges. Extensive experiments demonstrate that CAPruner effectively preserves relations essential for spatial reasoning, leading to substantial performance improvements of LLMs on 3D-VL tasks. Code is available at https://github.com/fz-zsl/CAPruner.

URL PDF HTML ☆

赞 0 踩 0

2606.07791 2026-06-09 cs.GR cs.CV cs.IR 交叉投稿

Frequency-Scale Saliency for Spectral Descriptor Analysis in 3D Shape Retrieval

频率尺度显著性用于三维形状检索中的谱描述符分析

Jianru Shen

发表机构 * University of Montana（蒙大拿大学）

AI总结提出频率尺度显著性框架，通过消融量化描述符尺度区间的检索贡献，发现短尺度主导性能而长尺度有害，加权检索在困难类别上提升mAP 0.156。

Comments Accepted at Computer Graphics International (CGI) 2026

2606.08041 2026-06-09 cs.GR cs.CV 交叉投稿

Wispy to Voluminous: Prior-free Multi-view Capture of Strand-level Facial Hair

从稀疏到浓密：无先验的多视角面部毛发级联重建

Jaeseong Lee, Giljoo Nam, Adrian Jarabo, Carlos Aliaga

发表机构 * KAIST（韩国科学技术院）； Meta Codec Avatar Lab（Meta 编码人像实验室）； Meta Reality Labs Research（Meta 现实实验室研究）

AI总结提出从多视角图像自动重建面部毛发（胡须、眉毛等）的管线，将无结构3D高斯表示转换为显式曲线发丝，解决几何歧义，实现高保真发丝重建。

Comments 27 pages, 16 figures, supplementary included

详情

AI中文摘要

面部毛发是个人身份的一个决定性特征，但仍然是数字头像的关键瓶颈。最近的体积方法实现了照片级真实感，但将毛发烘焙到面部几何中，阻碍了可编辑性，并且无法解析稀疏的、发丝状结构。同时，头皮毛发重建方法针对密集的毛发体积，无法适应面部毛发稀疏、空间变化的特性。我们提出了一种管线，从多视角图像自动重建面部毛发——胡须、髭须、睫毛和眉毛，将无结构的3D高斯表示转换为显式的基于曲线的发丝表示。我们分四个阶段解决几何歧义：（i）优化由跟踪头部几何约束的3D高斯，以强制早期光线终止并抑制次表面噪声；（ii）追踪连续发丝，对频繁交叉和极端曲率具有鲁棒性；（iii）将发丝接地到表面，并通过物理动机的先验解决根尖歧义；（iv）通过光度优化下的不透明度驱动密度控制来细化重建。据我们所知，这是第一个从3D高斯表示重建高保真面部毛发发丝的方法。恢复的发丝忠实地保留了面部毛发特征的朝向和稀疏模式，并生成可直接用于下游生产任务的资产，包括面部动画和物理模拟、几何梳理和转移、外观编辑以及基于物理的渲染。

英文摘要

Facial hair is a defining trait of personal identity, yet remains a critical bottleneck for digital avatars. Recent volumetric methods achieve photorealism but bake hair into the underlying face geometry, preventing editability and failing to resolve sparse, strand-like structures. Meanwhile, scalp-hair reconstruction methods target dense hair volumes and do not transfer to the sparse, spatially-varying nature of facial hair. We present a pipeline that automatically reconstructs facial hair -- beard, mustache, lashes, and brows -- from multi-view images, converting an unstructured 3D Gaussian representation into an explicit curve-based strand representation. We resolve geometric ambiguities in four stages: (i) optimizing 3D Gaussians constrained by tracked head geometry to enforce early ray termination and suppress sub-surface noise; (ii) tracing continuous strands robust to frequent crossings and extreme curvature; (iii) grounding strands to the surface and resolving root-tip ambiguity via a physically-motivated prior; and (iv) refining the reconstruction through opacity-driven density control under photometric optimization. To our knowledge, this is the first method to reconstruct high-fidelity facial hair strands from a 3D Gaussian representation. The recovered strands faithfully preserve the orientation and sparsity patterns characteristic of facial hair, and yield assets immediately suitable for downstream production tasks, including facial animation and physical simulation, geometric grooming and transfer, appearance editing, and physics-based rendering.

URL PDF HTML ☆

赞 0 踩 0

2606.08043 2026-06-09 cs.GR cs.CV 交叉投稿

OmniFaceRig: Fully Automatic Inner-Mouth-Aware Face Rigging Across Diverse 3D Character Topologies

OmniFaceRig: 跨多种3D角色拓扑的全自动内口感知面部绑定

Chao Wang, Guangyao Ma, John Doublestein, Junming Chen, Yiming Lin, Zhaoen Su, Xiaomin Luo, Shiyang Cheng, Jie Shen, Doug Roble, Dilin Wang, Yilei Li, Rakesh Ranjan

发表机构 * Reality Labs, Meta（Meta现实实验室）

AI总结提出全自动端到端管道OmniFaceRig，将静态表面网格转换为含内口几何的FACS绑定，支持人类、人形及多种动物拓扑，无需手动标注或模板。

详情

AI中文摘要

面部绑定——创建基于FACS的混合形状以及内口几何（牙齿、牙龈和舌头）——仍然是3D角色制作中的主要瓶颈。现有流程仍需要大量设计工作，特别是手动地标标注、每个角色的模板调整和内口放置。我们提出OmniFaceRig，一个全自动端到端管道，将静态表面仅3D角色网格（无预建模口腔）转换为内口感知的FACS绑定，包含多达155个混合形状、程序化拟合的牙齿、牙龈和舌头，以及重新打包的UV/纹理。OmniFaceRig支持多种拓扑——人类、人形、长吻动物（如狗、狼、狐狸）和短吻动物（如猫、熊、兔子、老虎）——无需手动地标、无需用户提供模板、无需每个资产的设置。该管道结合了混合VLM+CV可绑定性检查、多模型面部解析、密集关键点驱动的模板配准、程序化内口构建以及碰撞感知的混合形状迁移。对于非人类角色，OmniFaceRig选择拓扑特定的面部和内口模板，并使用碰撞感知的内口拟合来减少牙齿-面部交叉，而无需用户暴露于类别特定的调整。我们还公开发布了Omni-Bench，一个包含1000个双足3D角色的免费基准数据集，带有FACS面部混合形状和内口几何，涵盖人类、人形、猫、狗和其他动物。实验表明，在筛选后的Omni-Bench输入上，最终绑定成功率很高，分割集成几乎实现了完全的面部检测召回，以及可靠的内口放置和低穿透率。总之，OmniFaceRig为从静态生成的角色到动画就绪的面部绑定提供了一条自动化路径，适用于人类和非人类拓扑。

英文摘要

Facial rigging - creating FACS-based blendshapes together with inner-mouth geometry (teeth, gums, and tongue) - remains a major bottleneck in 3D character production. Existing pipelines still require substantial designer effort, especially for manual landmark annotation, per-character template adjustment, and inner-mouth placement. We present OmniFaceRig, a fully automatic end-to-end pipeline that converts a static surface-only 3D character mesh, with no pre-modeled oral cavity, into an inner-mouth-aware FACS rig with up to 155 blendshapes, procedurally fitted teeth, gums, and tongue, and re-packed UV/texture. OmniFaceRig supports diverse topologies - humans, humanoids, long-muzzled animals (e.g., dogs, wolves, foxes), and short-muzzled animals (e.g., cats, bears, rabbits, tigers) - with no manual landmarks, no user-provided templates, and no per-asset setup. The pipeline combines hybrid VLM+CV riggability checking, multi-model face parsing, dense keypoint-driven template registration, procedural inner-mouth construction, and collision-aware blendshape transfer. For non-human characters, OmniFaceRig selects topology-specific face and inner-mouth templates and uses collision-aware inner-mouth fitting to reduce teeth-face intersections without exposing users to category-specific tuning. We also publicly release Omni-Bench, a freely available benchmark dataset of 1,000 biped 3D characters with FACS facial blendshapes and inner-mouth geometry, spanning humans, humanoids, cats, dogs, and other animals. Experiments show high final rigging success on screened Omni-Bench inputs, nearly complete face detection recall from the segmentation ensemble and reliable inner-mouth placement with low penetration. Together, OmniFaceRig provides an automatic path from static generated characters to animation-ready facial rigs across both human and non-human topologies.

URL PDF HTML ☆

赞 0 踩 0

2606.09134 2026-06-09 cs.RO cs.AI cs.CL cs.CV cs.GR 交叉投稿

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

从USD场景到知识图谱：基于LLM的零样本本体接地

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler

发表机构 * Technical University of Berlin（柏林工业大学）； Fraunhofer FOKUS（弗劳恩霍夫开放通信系统研究所）

AI总结研究利用大语言模型（LLM）零样本地将3D场景对象自动映射到本体类别，无需训练，在厨房场景中达到90-96%准确率，并揭示语义线索是关键。

Comments Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026

详情

AI中文摘要

从3D仿真场景构建知识图谱对于机器人任务推理至关重要，但关键瓶颈——将场景对象接地到形式本体类别——仍然依赖于手工制作的字典，这些字典脆弱且无法跨资产泛化。我们研究大语言模型（LLM）是否能够自动化通用场景描述（USD）场景的接地步骤，作为一种零样本、无需训练的替代方案。在具有SOMA-HOME本体的厨房场景（125个对象）中，LLM在描述性名称下达到90-96%的精确匹配准确率，在缩写名称下达到49-89%，显著优于字典和嵌入基线。在完全不透明名称下，上下文增强提示可恢复高达48%的准确率。特征消融表明，LLM主要利用场景图中的语义线索（兄弟名称和父路径）；匿名化这些线索将准确率降至0-6%，而仅凭几何信息仅能达到4-17%。

英文摘要

Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.

URL PDF HTML ☆

赞 0 踩 0

2508.05950 2026-06-09 cs.CV cs.AI 版本更新

面向中轴线的符号距离函数学习

Samuel Weidemaier, Christoph Norden-Smoch, Martin Rumpf

发表机构 * Institute for Numerical Simulation, University of Bonn（数值模拟研究所，波恩大学）

AI总结本文提出一种新的变分方法，用于计算高精度的全局符号距离函数，通过高阶变分公式考虑梯度的跳跃集，以提高计算精度。

详情

AI中文摘要

我们提出了一种新的变分方法，用于计算给定点云的高精度全局符号距离函数（SDF）。为此，通过高阶变分公式显式考虑SDF梯度的跳跃集，即表面的中轴线，该公式强制在远离此不连续集的方向上沿梯度方向线性增长。Eikonal方程和SDF的零水平集被作为约束条件。为了使该变分问题具有计算可行性，采用了一种相场近似方法，属于Ambrosio-Tortorelli类型。相关的相场函数隐式地描述了中轴线。该方法用于由无向点云表示的表面，使用神经网络近似SDF和相场函数。实验表明，该方法在近场和全局范围内均具有较高的准确性。定量和定性比较表明，所提出的方法具有优势。

英文摘要

We propose a novel variational method to compute a highly accurate global signed distance function (SDF) to a given point cloud. To this end, the jump set of the gradient of the SDF, which coincides with the medial axis of the surface, is explicitly taken into account through a higher-order variational formulation that enforces linear growth along the gradient direction away from this discontinuity set. The eikonal equation and the zero-level set of the SDF are enforced as constraints. To make this variational problem computationally tractable, a phase field approximation of Ambrosio-Tortorelli type is employed. The associated phase field function implicitly describes the medial axis. The method is implemented for surfaces represented by unoriented point clouds using neural network approximations of both the SDF and the phase field. Experiments demonstrate the method's accuracy both in the near field and globally. Quantitative and qualitative comparisons with other approaches show the advantages of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2604.25781 2026-06-09 cs.CV cs.GR 版本更新

Sketch2Arti: Sketch-based Articulation Modeling of CAD Objects

Sketch2Arti：基于草图的CAD物体关节建模

Yi Yang, Hao Pan, Yijing Cui, Alla Sheffer, Changjian Li

发表机构 * University of Edinburgh（爱丁堡大学）； Tsinghua University（清华大学）； University of British Columbia（不列颠哥伦比亚大学）

AI总结提出Sketch2Arti系统，通过用户从选定视角绘制的2D草图，自动发现CAD模型中的可动部件并预测其运动参数，支持复杂物体的多关节迭代建模，无需类别信息且可推广到多样物体。

Comments Project page: https://arlo-yang.github.io/Sketch2Arti

详情

AI中文摘要

关节建模旨在推断3D物体的可动部件及其运动参数，实现交互式动画、模拟和形状编辑。本文提出Sketch2Arti，首个基于草图的CAD物体关节建模系统。我们的关键观察是，设计师通过轻量级草图（如箭头和笔画）自然地传达关节意图，指示部件应如何移动，但将这些草图转化为关节3D模型仍主要依赖手动操作。Sketch2Arti通过允许用户从选定视角绘制简单2D草图来指定关节，弥合了这一差距。给定CAD模型和用户草图，我们的方法自动发现对应的可动部件并预测其运动参数，支持对复杂物体进行多个关节的迭代建模，并实现精细控制。重要的是，Sketch2Arti以类别无关的方式训练，无需物体类别信息，从而对现有关节数据集之外的多样物体具有强泛化能力。此外，对于缺乏内部结构的壳体模型，Sketch2Arti支持由用户草图引导的可控内部补全，生成与现有几何和预测运动约束一致的合理内部组件。综合实验和用户评估证明了Sketch2Arti的有效性、可控性和泛化性。代码、数据集和原型系统见https://arlo-yang.github.io/Sketch2Arti。

英文摘要

Articulation modeling aims to infer movable parts and their motion parameters for a 3D object, enabling interactive animation, simulation, and shape editing. In this paper, we present Sketch2Arti, the first sketch-based articulation modeling system for CAD objects. Our key observation is that designers naturally communicate articulation intent through lightweight sketches (e.g., arrows and strokes) that indicate how parts should move, yet translating such sketches into articulated 3D models remains largely manual. Sketch2Arti bridges this gap by enabling users to specify articulation through simple 2D sketches drawn from a chosen viewpoint. Given a CAD model and user sketches, our approach automatically discovers the corresponding movable parts and predicts their motion parameters, allowing iterative modeling of multiple articulations on complex objects with fine-grained control. Importantly, Sketch2Arti is trained in a category-agnostic manner without requiring object category information, leading to strong generalization to diverse objects beyond existing articulation datasets. Moreover, for shell models lacking interior structures, Sketch2Arti supports controllable internal completion guided by user sketches, generating plausible internal components consistent with the existing geometry and predicted motion constraints. Comprehensive experiments and user evaluations demonstrate the effectiveness, controllability, and generalization of Sketch2Arti. The code, dataset, and the prototype system are at https://arlo-yang.github.io/Sketch2Arti.

URL PDF HTML ☆

赞 0 踩 0

2605.01171 2026-06-09 cs.CV cs.LG 版本更新

CADFit: Precise Mesh-to-CAD Program Generation with Hybrid Optimization

CADFit：基于混合优化的精确网格到CAD程序生成

Ghadi Nehme, Eamon Whalen, Faez Ahmed

发表机构 * Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA（麻省理工学院机械工程系）

AI总结提出CADFit框架，通过基于几何反馈的增量拟合和验证参数化操作，从网格中恢复复杂可编辑的CAD构造序列，在多个基准上优于现有方法，并显著降低无效比率。

详情

AI中文摘要

尽管最近取得了进展，但从几何输入（如网格或点云）恢复参数化CAD构造序列仍然是设计和制造的关键挑战，因为现有的CAD重建和生成方法主要局限于难以编辑的格式（如网格或Breps）或可编辑的简单草图-拉伸流水线和低复杂度数据集。我们引入了CADFit，一个基于混合优化的CAD重建框架，通过使用几何反馈增量拟合和验证参数化操作，从网格中恢复复杂、可编辑的CAD构造序列。我们的方法的特点是将重建公式化为对结构化CAD程序的IoU驱动优化，并支持丰富的操作集，包括拉伸、旋转、圆角和倒角。在多个CAD基准上的实验表明，CADFit在体积交并比和倒角距离方面优于最先进的网格到CAD方法，同时显著降低了重建CAD程序的无效比率，特别是对于复杂设计。我们进一步提出了一个多模态流水线，通过将基于图像的几何重建与CADFit相结合，实现从图像端到端重建CAD构造序列。通过实现更高复杂度CAD模型的精确重建，CADFit为生成更丰富的数据集和推进未来基于学习的CAD逆向工程方法提供了实用基础。代码可在：https://github.com/ghadinehme/CADFit 获取。

英文摘要

Despite recent progress, recovering parametric CAD construction sequences from geometric input, such as meshes or point clouds, is a key challenge for design and manufacturing, as existing CAD reconstruction and generation methods are largely restricted to difficult-to-edit formats like meshes or Breps or editable simple sketch-and-extrude pipelines and low-complexity datasets. We introduce CADFit, a hybrid optimization-based CAD reconstruction framework that recovers complex, editable CAD construction sequences from meshes by incrementally fitting and validating parametric operations using geometric feedback. Our approach is distinguished by formulating reconstruction as an IoU-driven optimization over structured CAD programs and supporting a rich set of operations, including extrusions, revolutions, fillets, and chamfers. Experiments on multiple CAD benchmarks show that CADFit outperforms state-of-the-art mesh-to-CAD methods in volumetric Intersection-over-Union and Chamfer Distance, while substantially reducing the Invalid Ratio of reconstructed CAD programs, particularly for complex designs. We further present a multimodal pipeline that enables end-to-end reconstruction of CAD construction sequences from images by combining image-based geometry reconstruction with CADFit. By enabling accurate reconstruction of higher-complexity CAD models, CADFit provides a practical foundation for generating richer datasets and advancing future learning-based approaches to CAD reverse engineering. The code is available at: https://github.com/ghadinehme/CADFit.

URL PDF HTML ☆

赞 0 踩 0

2605.01320 2026-06-09 cs.CV 版本更新

PACE: Post-Causal Entropy Modeling for Learned LiDAR Point Cloud Compression

PACE：用于学习LiDAR点云压缩的后因果熵建模

Jiahao Zhu, Kang You, Dandan Ding, Zhan Ma

发表机构 * School of Information Science and Technology, Hangzhou Normal University, Hangzhou, China.（信息科学与技术学院，杭州师范大学，杭州，中国）； School of Electronic Science and Engineering, Nanjing University, Nanjing, China（电子科学与工程学院，南京大学，南京，中国）

AI总结 PACE通过非因果骨干网络和轻量级预测器提升LiDAR点云压缩效率，实现90%以上的解码延迟降低和BD-BR节省。

详情

AI中文摘要

LiDAR点云压缩对自动驾驶系统处理高分辨率传感器数据至关重要。尽管基于八叉树结构的学得熵建模能获得高压缩增益，但面临两个关键瓶颈：1）解码时因因果、多阶段上下文建模导致的延迟过高；2）性能-延迟权衡的刚性，使单一模型难以适应变化约束。这些限制源于上下文聚合骨干与概率预测之间的紧密耦合。为此，我们提出PACE，一种新的框架，将祖先上下文聚合重新表述为非因果骨干，并将因果性限制在轻量级、阶段可扩展的预测器中，消除重复骨干执行并减少计算开销。预测器支持任意数量的预测阶段，使模型能够无缝适应多样化的性能-延迟权衡，而无需重新加载参数。实验表明，PACE在压缩效率上达到新状态，实现显著的BD-BR节省，并在自回归模式下将解码延迟降低超过90%，使其在实际应用中具有吸引力。

英文摘要

LiDAR point cloud compression is vital for autonomous systems to handle massive data from high-resolution sensors. While learned entropy modeling built upon octree structures yields high compression gains, it faces two critical bottlenecks: 1) prohibitive latency, particularly during decoding, caused by causal, multi-stage context modeling; and 2) a rigid performance-latency trade-off, preventing a single model from adapting to varying constraints. These limitations stem from the tight coupling between the context aggregation backbone and probability prediction. To address this, we propose PACE, a new framework that reformulates ancestral context aggregation as a non-causal backbone and confines causality to a lightweight, stage-scalable predictor, eliminating repetitive backbone executions and reducing computational overhead. The predictor supports an arbitrary number of prediction stages, enabling seamless adaptation across diverse performance-latency trade-offs without reloading parameters. Experiments demonstrate that PACE sets a new state-of-the-art in compression efficiency, achieving notable BD-BR savings and reducing decoding latency by over 90\% in autoregressive mode, making it attractive for practical applications.

URL PDF HTML ☆

赞 0 踩 0

2606.07419 2026-06-09 cs.CV 版本更新

DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

DisPOSE: 投影多随机扩散用于自监督多视图3D人体姿态估计

Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

发表机构 * Imperial College London（伦敦帝国学院）； Technical University of Munich（慕尼黑技术大学）

AI总结提出DisPOSE框架，将多视图人员分配问题建模为多随机张量空间上的生成扩散过程，通过可微Sinkhorn投影和超图卷积解码器实现自监督3D人体姿态估计，在标准数据集和手术室遮挡场景中表现优异。

详情

AI中文摘要

从不同摄像机视角恢复多个个体的3D人体姿态是分析交互行为的基本瓶颈。现有的自监督方法利用3D姿态的合成目录；然而，由于分布偏移，这导致在真实场景中泛化能力差。因此，我们引入了DisPOSE，一个自监督框架，将固有的离散多视图人员分配问题近似为多随机张量空间上的生成扩散过程。通过在去噪过程中采用可微的Sinkhorn投影，模型学会基于2D图像先验引导解决方案走向有效且可行的分配。然后，使用超图卷积解码器对定位个体的完整3D骨架进行回归，该解码器显式建模跨多个视图的关系结构和关节。所提出的方法在标准数据集上优于当前最先进的自监督方法，并在一个包含手术室高度遮挡场景的新基准上展示了强大的性能。我们的基于扩散的定位展示了高标签效率，仅使用10%的伪标签就能保持99%的性能。值得注意的是，在保持可微性的同时解耦分配和根回归组件，使得DisPOSE几乎对不同摄像机布置不敏感。

英文摘要

Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward valid and feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Notably, disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.

URL PDF HTML ☆

赞 0 踩 0

2606.07590 2026-06-09 cs.CV cs.AI 新提交

SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions

SlideCheck: 通过数据集分布引导病理基础模型的自监督预训练

Mingyi He, Xinyi Guo, Xitong Ling, Weiming Chen, Jiawen Li, Lianghui Zhu, Minxi Ouyang, Mingxi Fu, Yizhi Wang, Tian Guan

发表机构 * Beijing University of Chemical Technology（北京化工大学）； South China Normal University（华南师范大学）； Tsinghua University（清华大学）

AI总结提出SlideCheck工具，利用冻结病理基础模型的特征，通过双头MLP评分异常和恶性证据，引导自监督预训练数据筛选，实验表明数据分布影响模型下游性能。

Comments 9 pages, 2 figures, 4 tables

详情

AI中文摘要

病理基础模型在大量WSI衍生补丁流上进行预训练，而数据构建过程中的监督通常是切片级别、稀疏或异质的。这种不匹配使得理解和控制哪些生物模式进入预训练数据变得困难。我们提出SlideCheck，一个轻量级的预训练数据引导工具，建立在冻结的病理基础模型补丁特征之上。SlideCheck并非作为独立的补丁诊断模型，而是提供明确的异常和恶性评分，用于组织、过滤和审计病理预训练数据。SlideCheck使用双头MLP分别建模广泛的异常形态和恶性证据。正则化的特征空间评分器为补丁级证据估计提供监督锚点，而评分-注意力一致性将补丁评分与WSI级别的MIL注意力结合，挖掘高置信度伪标签。然后使用相同的评分构建广泛阳性ViT预训练子集，其中如果异常或恶性证据超过阈值，则选择补丁。实验表明，SlideCheck定义的数据分布影响自监督ViT预训练的下游行为，表明生物组成是病理基础模型开发中的重要可控因素。精心策划的子集可以接近全数据性能，表明明确评分的补丁池可能支持更高效和可审计的预训练数据构建。这些发现将SlideCheck定位为数据引导和审计层，用于将大型未分化补丁池转化为可控和可重用的预训练数据集。

英文摘要

Pathology foundation models are pretrained on large streams of WSI-derived patches, while supervision during data construction is often slide-level, sparse, or heterogeneous. This mismatch makes it difficult to understand and control which biological patterns enter the pretraining data. We propose SlideCheck, a lightweight pretraining data guidance tool built on frozen pathology foundation model patch features. Rather than serving as a standalone patch diagnostic model, SlideCheck provides explicit abnormality and malignancy scores for organizing, filtering, and auditing pathology pretraining data. SlideCheck uses a dual-head MLP to separately model broad abnormal morphology and malignant evidence. A regularized feature-space scorer provides a supervised anchor for patch-level evidence estimation, while score-attention agreement combines patch scores with WSI-level MIL attention to mine high-confidence pseudo labels. The same scores are then used to construct broad-positive ViT pretraining subsets, where a patch is selected if either abnormality or malignancy evidence exceeds a threshold. Experiments show that SlideCheck-defined data distributions influence the downstream behavior of self-supervised ViT pretraining, indicating that biological composition is an important controllable factor in pathology foundation model development. Curated subsets can approach full-data performance, suggesting that explicitly scored patch pools may support more efficient and auditable pretraining data construction. These findings position SlideCheck as a data guidance and auditing layer for transforming large, undifferentiated patch pools into controllable and reusable pretraining datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.07633 2026-06-09 cs.CV cs.AI 新提交

AMN: An Adaptive Multi-Scale Fusion Network with Boundary and Uncertainty Modeling for Nuclei Segmentation

AMN：一种用于细胞核分割的具有边界和不确定性建模的自适应多尺度融合网络

Spoorthi M, Suja Palaniswamy

发表机构 * Department of Computer Science \& Engineering, Amrita School of Computing, Bengaluru, Amrita Vishwa Vidyapeetham, India , 2 p\

AI总结提出AMN双编码器分割框架，融合Swin Transformer和ResNet-50特征金字塔，通过门控机制动态加权，结合多目标损失，在CoNIC基准上平均Dice 0.82，F1 0.68，优于八种基线模型。

详情

AI中文摘要

组织病理学图像中细胞核亚型的准确分类对于下游任务（包括肿瘤分级、免疫浸润量化和预后预测）至关重要。现有方法孤立地依赖卷积或基于Transformer的编码器，限制了它们同时捕捉细粒度局部纹理和长程空间上下文的能力。我们提出了AMN（自适应多尺度细胞核网络），一种双编码器分割框架，联合利用Swin Transformer和ResNet-50特征金字塔，通过学习的逐通道门控机制动态权衡每个编码器在每个尺度的贡献。AMN使用多目标损失进行训练，该损失结合了类别加权焦点损失、具有正像素强调的边界感知损失以及一种新颖的不确定性调制分类项，用于抑制过度自信的错误预测。在涵盖七个细胞核类别的CoNIC基准上评估，AMN实现了平均Dice 0.82和平均F1 0.68，在诊断上具有挑战性的淋巴细胞类别上F1为0.67。AMN优于八种基线模型，包括纯CNN、纯Transformer和最近的混合架构：U-Net、ResU-Net、DeepLabV3+、SegNet、ViT-Small、HmsU-Net、ConvFormer-UNet和BEFUnet。在MoNuSeg上的跨数据集评估证明了无需重新训练的强泛化能力，验证了所学表示的领域鲁棒性。

英文摘要

Accurate classification of nuclei subtypes in histopathology images is critical for downstream tasks including tumor grading, immune infiltrate quantification, and prognosis prediction. Existing approaches rely on either convolutional or transformer-based encoders in isolation, limiting their ability to simultaneously capture fine-grained local texture and long-range spatial context. We present AMN (Adaptive Multi-Scale Nuclei Network), a dual-encoder segmentation framework that jointly leverages a Swin Transformer and a ResNet-50 feature pyramid, fused via a learned per-channel gating mechanism that dynamically weighs each encoder's contribution at every scale. AMN is trained with a multi-objective loss combining class-weighted focal loss, boundary-aware loss with positive-pixel emphasis, and a novel uncertainty-modulated classification term that suppresses overconfident erroneous predictions. Evaluated on the CoNIC benchmark across seven nuclei classes, AMN achieves a mean Dice of 0.82 and mean F1 of 0.68, with an F1 of 0.67 on the diagnostically challenging lymphocyte class. AMN outperforms eight baseline models spanning pure-CNN, pure-transformer, and recent hybrid architectures: U-Net, ResU-Net, DeepLabV3+, SegNet, ViT-Small, HmsU-Net, ConvFormer-UNet, and BEFUnet. Cross-dataset evaluation on MoNuSeg demonstrates strong generalization without retraining and validating the domain robustness of the learned representations.

URL PDF HTML ☆

赞 0 踩 0

2606.07635 2026-06-09 cs.CV cs.AI 新提交

NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis

NeuroAlign: 用于MCI分析的动态与结构性神经影像的分层多模态融合

Xiongri Shen, Zhenxi Song, Jiaqi wang, Yi Zhong, Leilei Zhao, Chenqi Xu, Linling Li, Yichen Wei, Lingyan Liang, Demao Deng, Luping Song, Ping Luan, Ahmed M. Anter, Shuqiang Wang, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术学院）； School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳）人工智能学院智能科学与工程学院）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Guangdong Key Laboratory of Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University（深圳大学医学部生物医学工程学院广东省生物医学测量与超声成像重点实验室）； Department of Radiology, The People’s Hospital of Guangxi Zhuang Autonomous Region, Guangxi Academy of Medical Sciences（广西壮族自治区人民医院放射科，广西医学科学院）； Shenzhen Sixth People’s Hospital (Nanshan Hospital), Huazhong University of Science and Technology Union Shenzhen Hospital（华中科技大学协和深圳医院（深圳市第六人民医院））； School of Basic Medical Sciences, Shenzhen University（深圳大学基础医学院）； Egypt-Japan University of Science and Technology (E-JUST)（埃及日本科技大学）； School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Shenzhen University Medical School（深圳大学医学部生物医学工程学院，国家地方联合医学超声关键技术工程实验室，广东省生物医学测量与超声成像重点实验室）

AI总结提出NeuroAlign框架，通过双模态分层对齐和双域分层交互融合fMRI与DTI特征，实现MCI/SCD检测，并设计无梯度归因方法SAM进行特征分析。

详情

AI中文摘要

功能磁共振成像（fMRI）和弥散张量成像（DTI）的多模态神经影像融合为认知障碍分析提供了互补信息，但仍面临异构特征空间和表示不对齐的挑战。我们提出\textit{NeuroAlign}，一个用于结构化多模态融合的分层框架。它引入了（1）\textit{双模态分层对齐}（DMHA），该模块建模多尺度动态连接并对齐动态-静态和功能-结构嵌入；以及（2）\textit{双域分层交互}（DDHI），该模块实现连接级和区域级特征之间的细粒度调制和全局交互。为了支持特征级检查，我们设计了\textit{协同激活映射}（SAM），一种针对DFC、SFC、ALFF和FA的无梯度、面向标记的归因方法。在GUTCM、ADNI和OASIS数据集上通过五折验证评估，NeuroAlign在MCI/SCD检测中取得了竞争性结果，并展示了初步的跨数据集可迁移性。归因分析揭示了模态特异性和部分一致的脑区模式，为多模态表示分析提供了模型驱动的证据。

英文摘要

Multimodal neuroimaging fusion of functional MRI (fMRI) and diffusion tensor imaging (DTI) provides complementary information for cognitive impairment analysis, but remains challenged by heterogeneous feature spaces and misaligned representations. We propose \textit{NeuroAlign}, a hierarchical framework for structured multimodal fusion. It introduces (1) \textit{Dual-Modal Hierarchical Alignment} (DMHA), which models multi-scale dynamic connectivity and aligns dynamic-static and functional-structural embeddings; and (2) \textit{Dual-Domain Hierarchical Interaction} (DDHI), which enables fine-grained modulation and global interaction between connectivity- and region-level features. To support feature-level inspection, we design \textit{Synergistic Activation Mapping} (SAM), a gradient-free, marker-oriented attribution method for DFC, SFC, ALFF, and FA. Evaluated on GUTCM, ADNI, and OASIS under five-fold validation, NeuroAlign achieves competitive MCI/SCD detection and preliminary cross-dataset transferability. Attribution analyses reveal modality-specific and partially consistent brain patterns, providing model-derived evidence for multimodal representation analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.07658 2026-06-09 cs.CV cs.LG 新提交

What neurosurgeons need to see: synthetic intra-operative MRI from ultrasound for brain-shift compensation in brain tumour surgery

神经外科医生需要看到的：用于脑肿瘤手术中脑移位补偿的超声合成术中MRI

Santiago Cepeda, Olga Esteban-Sinovas, Ignacio Arrese, Rosario Sarabia

发表机构 * Department of Neurosurgery, Neurovascular Unit, Río Hortega University Hospital, Valladolid, Spain（西班牙巴利亚多利德里奥·奥尔特加大学医院神经外科神经血管科）； Specialized Group in Biomedical Imaging and Computational Analysis (GEIBAC), Instituto de Investigación Biosanitaria de Valladolid (IBioVALL), Valladolid, Spain（西班牙巴利亚多利德生物医学研究与计算分析专业组(GEIBAC)，巴利亚多利德生物健康研究所(IBioVALL)）

AI总结提出一种端到端流水线，通过融合术前MRI、术中超声生成的合成MRI及锚定该合成图像的可变形配准，生成术前成像空间中的全脑MRI体积，以补偿脑移位，为神经导航提供类似MRI的术中视野更新。

详情

AI中文摘要

最大安全切除是胶质瘤手术的主要目标。硬脑膜打开后，神经导航引导会因脑移位而逐渐退化。术中MRI可以补偿，但需要专用基础设施且很少可用，而术中超声（ioUS）廉价、可重复且与常规工作流程兼容。将ioUS与术前MRI结合的导航系统通常依赖刚性配准；即使是可变形多模态配准也受限于超声散斑对比度、窄视野以及无法表示术前扫描中不存在的结构，最关键的是切除腔和残余肿瘤。我们提出一个端到端流水线，通过合并术前MRI、从ioUS生成的合成MRI以及锚定在该合成图像上的可变形配准，生成术前成像空间中的全脑MRI体积。它集成了一个2.5D残差变换器合成骨干（ResViT-2.5D）和一个两阶段配准，将NiftyReg与合成锚定的SynthMorph阶段耦合，直接对原始扫描仪输入进行操作。在切除后的ReMIND队列上，ResViT-2.5D生成的合成图像在结构、强度和感知指标上与术中T2紧密匹配。在14名受试者的215个专家标志点上，合成锚定配准将平均目标配准误差从6.27毫米降低到5.86毫米，与强大的经典NiftyReg基线（5.85毫米）相当，同时为每个受试者产生微分同胚变形场。贡献不在于配准精度的提高，而在于集成的体积本身，它在超声视野内反映了术中切除后的状态。这为外科医生提供了手术视野的类似MRI的更新，并有可能集成到手术导航工作流程中。

英文摘要

Maximal safe resection is the primary objective in glioma surgery. Neuronavigation guidance is progressively degraded by brain shift after dural opening. Intraoperative MRI can compensate but needs dedicated infrastructure and is rarely available, whereas intraoperative ultrasound (ioUS) is inexpensive, repeatable, and compatible with routine workflows. Navigation systems combining ioUS with preoperative MRI usually rely on rigid registration; even deformable multimodal registration is limited by ultrasound speckle contrast, a narrow field of view, and the inability to represent structures absent from the preoperative scan, most critically the resection cavity and residual tumor. We propose an end-to-end pipeline that generates a new whole-brain MRI volume in the preoperative imaging space by merging the preoperative MRI, a synthetic MRI generated from the ioUS, and a deformable registration anchored on that synthetic image. It integrates a 2.5D residual-transformer synthesis backbone (ResViT-2.5D) and a two-stage registration coupling NiftyReg with a synthesis-anchored SynthMorph stage, operating directly on raw scanner inputs. On a post-resection ReMIND cohort, ResViT-2.5D produced synthetic images closely matching the intraoperative T2 across structural, intensity, and perceptual metrics. In 14 subjects with 215 expert landmarks, the synthesis-anchored registration reduced the mean target registration error from 6.27 to 5.86 mm, matching a strong classical NiftyReg baseline (5.85 mm) while yielding a diffeomorphic deformation field in every subject. The contribution is not a gain in registration accuracy but the integrated volume itself, which inside the ultrasound field of view it reflects the intraoperative post-resection state. This provides the surgeon with an MRI-like update of the operative field with potential for integration into surgical-navigation workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.07674 2026-06-09 cs.CV q-bio.NC 新提交

Simultaneous hyperkinetic movement disorders phenotyping: a cross-cohort pediatric transfer study using routine videos, markerless pose estimation and a tabular foundation model

同时性多动症表型分析：基于常规视频、无标记姿态估计和表格基础模型的跨队列儿科迁移研究

Laura Cif, Diane Demailly, Zohra Souei, Muhammad Mushhood Ur Rehman, Juan Dario Ortigoza Escobar, Mayté Castro Jiménez, Cécile A. Hubsch, Sophie Huby, Morgan Dornadic, Gun-Marie Hariz, Eduardo M. Moraud, Jocelyne Bloch, Gabriella A. Horvath, Xavier Vasques

发表机构 * Lausanne University Hospital (CHUV)（洛桑大学医院）； University of Lausanne (UNIL)（洛桑大学）； Institut du Neurone（神经元研究所）； Clinique Beau Soleil（博索莱伊诊所）； Institut Mutualiste Montpelliérain（蒙彼利埃互助研究所）； Military University Hospital of Sfax（斯法克斯军事大学医院）； University of Edinburgh（爱丁堡大学）； Hospital Sant Joan de Déu（圣琼德迪乌医院）； European Reference Network for Rare Neurological Diseases (ERN-RND)（欧洲罕见神经系统疾病参考网络）； Instituto de Salud Carlos III（卡洛斯三世健康研究所）； CHU Montpellier（蒙彼利埃大学医院）； Umeå University（于默奥大学）； University Hospital Lausanne（洛桑大学医院）； Ecole Polytechnique Fédérale de Lausanne（洛桑联邦理工学院）； British Columbia Children’s Hospital（不列颠哥伦比亚儿童医院）

AI总结提出结合无标记姿态估计、运动学描述符和预训练基础模型的视频框架，在成人数据上训练后迁移至儿科队列，经轻量校准后实现多种多动症现象的同时检测。

详情

AI中文摘要

目的：开发并外部测试一个基于视频的框架，用于同时检测多动症运动障碍现象：肌张力障碍、震颤、肌阵挛、舞蹈症、手足徐动症、投掷症、刻板动作和抽动，使用常规临床记录，并明确测试从成人到儿科人群的外部跨队列迁移。方法：在这项概念验证研究中，该框架结合了无标记姿态估计、运动学描述符和预训练基础模型。在21名确诊多动症的成人和4名健康对照（按标准化方案评估）上开发了共享预测骨干。外部验证在一个独立的外部队列上进行：一个真实世界的儿科样本（n=12，单基因联合多动症）。对于外部数据集，骨干网络未经重新训练直接部署；轻量校准仅调整最终受试者级别的决策步骤，使用由临床医生选择的小标记子集（代表队列表型范围）。结果：在临床医生选择的子集上对决策层进行本地校准后，在保留的儿科患者（n=7）上性能持续提升：汉明准确率从0.804提高到0.839，Jaccard指数从0.548提高到0.633。当评估限制在临床医生一致性更高的现象时，校准后的性能得以保持，Jaccard指数进一步提高（汉明准确率0.9，Jaccard指数0.786），表明增益并非依赖于最不可靠的标签。

英文摘要

Objective: To develop and externally test a video-based framework for simultaneous detection of hyperkinetic MDs phenomenologies: dystonia, tremor, myoclonus, chorea, athetosis, ballismus, stereotypies, and tics using routine clinical recordings, with explicit testing of external, cross-cohort transfer from adult to pediatric populations. Methods: In this proof-of-concept study, the framework combines markerless pose estimation, kinematic descriptors, and a pretrained fondation model. A shared predictive backbone was developed on 21 adults with confirmed hyperkinetic MDs and 4 healthy controls assessed under a standardized protocol. External validation was performed on an independent external cohort: a real-world pediatric sample (n=12, monogenic combined MDs). For the external dataset, the backbone was deployed without retraining; lightweight calibration adjusted only the final subject-level decision step using a small labeled subset of patients selected by clinicians as representative of the cohort's phenotypic range. Results: After local calibration of the decision layer on the clinician-selected subset, performance improved consistently on the held-out pediatric patients (n=7): Hamming accuracy rose from 0.804 to 0.839 and the Jaccard index from 0.548 to 0.633. This calibrated performance was preserved, and the Jaccard index further improved, when the evaluation was restricted to the phenomenologies with more definite clinician agreement (Hamming accuracy 0.9, Jaccard index 0.786), indicating that the gains did not rest on the least-reliable labels.

URL PDF HTML ☆

赞 0 踩 0

2606.07775 2026-06-09 cs.CV 新提交

DALE-CT: Depth-Aware Foundation Models for Computed Tomography

DALE-CT: 用于计算机断层扫描的深度感知基础模型

Evan W. Damron, Mahmut S. Gokmen, Mitchell A. Klusty, Caroline N. Leach, Emily B. Collier, V. K. Cody Bumgardner

发表机构 * University of Kentucky（肯塔基大学）

AI总结提出DALE-CT，一种基于LeJEPA的2D切片模型，通过3D深度感知预训练（利用解剖掩膜和异常标注）提升表示质量，在CT多异常检测中达到与3D视觉语言模型近似的性能。

Comments 9 pages, 2 figures

详情

AI中文摘要

自监督学习（SSL）的最新突破，如潜在欧几里得联合嵌入预测架构（LeJEPA），以及视觉编码器与语言模型集成的成功，推动了计算机断层扫描（CT）中对适应性强、高容量视觉编码器的需求。在这项工作中，我们探索了基于2D切片的架构作为处理体积CT数据的原生3D模型的灵活替代方案。使用CT-RATE数据集，我们从头开始训练了DALE-CT（深度感知潜在欧几里得计算机断层扫描），这是一个完全使用LeJEPA构建的2D模型系列，并将其与持续预训练的DINOv2基线进行了比较。为了提高表示质量，我们开发了一种新颖的3D深度感知预训练策略，该策略由来自自动解剖掩膜和人工标注异常的双重辅助监督密集支持。在使用多实例学习（MIL）进行多异常检测的线性探测评估下，该双监督模型（DALE-CT-2S）的冻结主干实现了0.833的宏AUROC。这一性能表明，从头开始使用显著更少的数据且无需文本监督，即可达到与最先进的3D视觉语言模型近乎相当的水平。为确保可重复性，所有训练代码、评估脚本和模型权重均已公开。

英文摘要

Recent breakthroughs in self-supervised learning (SSL), such as the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA), alongside successes in integrating visual encoders with language models, have driven the demand for adaptable, high-capacity vision encoders in Computed Tomography (CT). In this work, we explore 2D slice-based architectures as a flexible alternative to native 3D models for processing volumetric CT data. Using the CT-RATE dataset, we trained DALE-CT (Depth-Aware Latent-Euclidean Computed Tomography), a 2D model family built entirely from scratch using LeJEPA, and compared its performance against a continually pre-trained DINOv2 baseline. To enhance representation quality, we developed a novel 3D depth-aware pre-training strategy anchored by dense auxiliary supervision from both automated anatomical masks and human-annotated abnormalities. Under linear probe evaluation with Multiple Instance Learning (MIL) for multi-abnormality detection, the frozen backbone of this dual-supervised model (DALE-CT-2S) achieves a Macro AUROC of 0.833. This performance demonstrates near-parity with state-of-the-art 3D vision-language models, achieved entirely from scratch with significantly less data and no textual supervision. To ensure reproducibility, all training code, evaluation scripts, and model weights have been made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.08364 2026-06-09 cs.CV cs.AI 新提交

Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis

基于自监督视觉Transformer的CBCT颞下颌关节骨关节炎检测

Shradhdha Trivedi, Vrundan Sojitra, Mariela Padilla

发表机构 * Herman Ostrow School of Dentistry, University of Southern California（南加州大学赫尔曼·奥斯特罗牙科学院）； Viterbi School of Engineering, University of Southern California（南加州大学维特比工程学院）

AI总结研究DINO系列自监督ViT在CBCT颞下颌关节骨关节炎检测中的迁移性能，发现部分解冻最后两个Transformer块可将AUC从0.671提升至0.902，表明适应策略比骨干选择更重要。

详情

AI中文摘要

颞下颌关节骨关节炎（TMJ OA）是一种常见的退行性疾病，其骨性改变在锥形束CT（CBCT）上通常很细微，使得自动检测具有挑战性。我们研究了DINO系列自监督视觉Transformer——DINOv1、DINOv2、DINOv2+reg和RAD-DINO（一种放射学预训练变体）——迁移到CBCT的效果，询问需要多少以及何种骨干适应。我们提出了一种简单的基于切片的流程，使用视觉Transformer（ViT）骨干：轴向CBCT切片由冻结或部分适应的ViT逐切片编码，并通过基于注意力的多实例学习（MIL）聚合，用于患者级别的二分类OA/正常分类。通过在多源CBCT数据集上对解冻策略和聚合设计进行系统消融，我们发现部分解冻最后两个Transformer块是决定性因素，将AUC从0.671（完全冻结的DINOv2）提高到0.902。这优于DINOv1（0.867）、DINOv2+reg（0.774）和有监督的ImageNet ViT-B/16基线（0.843）。我们的结果为在低数据医学影像设置中适应DINO系列基础模型提供了实用指导，表明适应策略比骨干选择本身更能驱动性能。

英文摘要

Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes are often subtle on cone-beam CT (CBCT), making automated detection challenging. We study how well the DINO family of self-supervised vision transformers -- DINOv1, DINOv2, DINOv2+reg, and RAD-DINO (a radiology-pretrained variant) -- transfers to CBCT, asking how much backbone adaptation is needed and of what kind. We propose a simple slice-based pipeline using Vision Transformer (ViT) backbones: axial CBCT slices are encoded per-slice by a frozen or partially adapted ViT and aggregated via attention-based multiple instance learning (MIL) for patient-level binary OA/Normal classification. Through systematic ablation across unfreezing strategies and aggregation designs on a multi-source CBCT dataset, we find that partial unfreezing of the final two transformer blocks is the decisive factor, improving AUC from 0.671 (fully frozen DINOv2) to 0.902. This outperforms DINOv1 (0.867), DINOv2+reg (0.774), and a supervised ImageNet ViT-B/16 baseline (0.843). Our results provide practical guidance for adapting DINO-family foundation models in low-data medical imaging settings, showing that adaptation strategy is a stronger driver of performance than backbone choice alone.

URL PDF HTML ☆

赞 0 踩 0

2606.08404 2026-06-09 cs.CV 新提交

Geometry-Driven Flow Analysis of Brain Sulcal Pattern

脑沟模式的几何驱动流分析

Moo K. Chung, Luigi Maccotta, Aaron Struck

发表机构 * GitHub

AI总结提出基于泊松方程的几何驱动流框架，通过平均曲率建模皮层折叠，生成光滑势场梯度定义物理通量，用于分析青少年肌阵挛癫痫的皮层结构异常。

详情

AI中文摘要

皮层折叠反映了协调的神经发育过程，并日益被认为是神经系统疾病的敏感标志。然而，现有大多数分析依赖于间接的标量摘要，并未明确建模折叠几何本身。在青少年肌阵挛癫痫（JME）中，一种常见的遗传性癫痫，皮层异常通常是微妙的、空间分布的，并且难以使用传统的形态测量指标检测。我们引入了一个基于泊松方程的框架，将皮层折叠建模为源自皮层流形上平均曲率的几何驱动流。通过将折叠模式视为静态的源-汇结构，所提出的方法产生了一个光滑的、全局平衡的势场，其表面梯度定义了物理上可解释的通量。该框架能够对脑沟-脑回折叠组织进行空间连贯的分析，并为JME中几何驱动的皮层结构提供了原则性的表示。

英文摘要

Cortical folding reflects coordinated neurodevelopmental processes and is increasingly recognized as a sensitive marker of neurological disease. However, most existing analyses rely on indirect scalar summaries that do not explicitly model folding geometry itself. In juvenile myoclonic epilepsy (JME), a common genetic epilepsy, cortical abnormalities are often subtle, spatially distributed, and difficult to detect using conventional morphometric measures. We introduce a Poisson-equation-based framework that models cortical folding as a geometry-driven flow derived from mean curvature on the cortical manifold. By treating folding patterns as a stationary source-sink structure, the proposed approach yields a smooth, globally balanced potential field whose surface gradient defines a physically interpretable flux. This framework enables spatially coherent analysis of sulcal-gyral folding organization and provides a principled representation of geometry-driven cortical structure in JME.

URL PDF HTML ☆

赞 0 踩 0

2606.08420 2026-06-09 cs.CV 新提交

CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs

CheXanatomy: 面向胸部X光片的解剖感知视觉-语言建模

Sergios Gatidis, Curtis Langlotz, Christian Bluethgen

发表机构 * Stanford Center for Artificial Intelligence in Medicine and Imaging, Stanford University（斯坦福大学医学与影像人工智能中心）； Department of Radiology, Stanford University（斯坦福大学放射学系）

AI总结提出CheXanatomy框架，通过自回归令牌空间监督将解剖知识融入预训练视觉-语言模型，实现解剖分割，在合成和真实X光片上性能媲美U-Net，并提升域迁移鲁棒性和样本效率。

详情

AI中文摘要

在大规模图像-文本对上预训练的视觉-语言模型（VLM）表现出强大的图像级理解能力，但主要针对全局对齐进行优化，并未显式编码细粒度解剖结构，限制了其在分割等空间精确任务中的适用性。我们提出CheXanatomy，一个通过自回归令牌空间监督将显式解剖知识融入预训练VLM的框架。该模型无需添加任务特定的解码器头，而是通过下一个令牌预测训练生成解剖分割掩码。为了实现可扩展的监督，我们从CT体积合成逼真的胸部X光片，并前向投影CT分割标签以获得解剖一致的2D掩码。我们在合成和真实胸部X光片上评估该方法，与U-Net基线进行比较，包括模型规模、输入分辨率和视觉编码器微调的消融实验。自回归解剖监督在分布内实现了与专用卷积模型相当的性能，并在向真实CXR数据的域迁移下表现出改进的几何鲁棒性。此外，在有限监督下适应新定位任务时，解剖预训练模型展现出更好的样本效率。更大的模型和更高的输入图像分辨率提升了性能，而视觉编码器微调效果有限。这些结果表明，将解剖结构直接嵌入生成目标促进了空间有根据的表征，并支持解剖感知的医学视觉-语言建模。

英文摘要

Vision-language models (VLMs) pretrained on large-scale image-text pairs demonstrate strong image-level understanding, but are primarily optimized for global alignment and do not explicitly encode fine-grained anatomical structure, limiting their suitability for spatially precise tasks such as segmentation. We introduce CheXanatomy, a framework that integrates explicit anatomical knowledge into a pretrained VLM through autoregressive token-space supervision. Instead of adding task-specific decoder heads, the model is trained to generate anatomical segmentation masks via next-token prediction. To enable scalable supervision, we synthesize realistic chest radiographs from CT volumes and forward-project CT segmentation labels to obtain anatomically consistent 2D masks. We evaluate the approach on synthetic and real chest radiographs against a U-Net baseline, including ablations on model scale, input resolution, and vision encoder fine-tuning. Autoregressive anatomical supervision achieves performance comparable to specialized convolutional models in-distribution and demonstrates improved geometric robustness under domain shift to real CXR data. In addition, anatomy-pretrained models exhibit improved sample efficiency when adapting to novel localization tasks under limited supervision. Larger models and higher input image resolution improve performance, while vision encoder fine-tuning has limited effect. These results show that embedding anatomical structure directly into the generative objective promotes spatially grounded representations and supports anatomy-aware medical vision-language modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.08421 2026-06-09 cs.CV 新提交

Segmentation-Assisted Brain MRI Synthesis with Cross-Image Multi-Contrast Feature Memory Bank Retrieval Augmentation

基于跨图像多对比度特征记忆库检索增强的分割辅助脑MRI合成

Wenwei Huang, Jia Wei, Jianlong Zhou

发表机构 * South China University of Technology（华南理工大学）； University of Technology Sydney（悉尼科技大学）

AI总结提出分割辅助的闭环生成对抗框架，通过辅助分割分支和双库检索增强策略，提高多对比度脑MRI中肿瘤区域的合成保真度。

详情

AI中文摘要

多对比度脑MRI提供互补的软组织特征，有助于疾病的筛查和诊断。然而，有限的扫描时间、图像损坏和各种成像协议常常导致多对比度图像不完整。虽然当前方法在图像合成方面表现出色，但它们通常难以合成关键的肿瘤区域，并且无法有效利用多对比度脑MRI中的上下文信息。为了解决这个问题，我们提出了一种以合成为中心、分割辅助的闭环框架，结合检索增强合成。我们的方法整体采用生成对抗架构，旨在通过单一模型从任何可用对比度的组合中合成缺失的对比度。为了显式捕获肿瘤语义并将合成聚焦于肿瘤区域，我们添加了一个辅助分割分支，该分支预测肿瘤掩膜并将其作为语义条件反馈给合成分支，从而在模型中学习肿瘤感知表示并提高合成保真度。此外，我们提出了一种双库检索增强策略。它动态查询两个外部知识库，即用于关键肿瘤上下文的肿瘤掩膜记忆库和用于全局风格信息的跨图像对比度特征记忆库，以增强合成。在两个公开的多对比度磁共振脑数据集：BraTs2020和UCSF-BMSR上验证，所提出的方法在处理医学脑图像合成任务方面有效，并且与先前方法相比表现出优越的性能。代码可在 https://github.com/iBizzard/SSCF.git 获取。

英文摘要

Multi-contrast brain MRI provide complementary soft-tissue characteristics that aid in the screening and diagnosis of diseases. However, limited scanning time, image corruption and various imaging protocols often result in incomplete multi-contrast images. While current approaches excel in image synthesis, they often struggle to synthesize critical tumor regions and exploit contextual information in multi-contrast brain MRI effectively. To address this issue, we propose a synthesis-centric, segmentation-assisted closed-loop framework with retrieval augmentation synthesis. Our method overall takes a generative adversarial architecture, which aims to synthesize missing contrasts from any combination of available ones with a single model. To explicitly capture tumor semantics and focus synthesis on tumor regions, we add an auxiliary segmentation branch that predicts tumor masks and feeds them back as semantic conditioning in synthesis branch, thereby learning tumor-aware representations in the model and improving synthesis fidelity. Furthermore, we propose a dual-bank retrieval augmentation strategy. It dynamically queries two external knowledge bases, namely a tumor masks memory bank for crucial tumor context and cross-image contrast feature memory bank for global style information, to augment synthesis. Verified on two public multi-contrast magnetic resonance brain datasets: BraTs2020 and UCSF-BMSR, the proposed method is effective in handling medical brain images synthesis tasks and shows superior performance compared to previous methods. Code is available at:https://github.com/iBizzard/SSCF.git

URL PDF HTML ☆

赞 0 踩 0

2606.08641 2026-06-09 cs.CV 新提交

AUCp: 用于异常检测中无标注验证数据的推理模型选择的伪AUC

Md Mahfuzur Rahman Siddiquee, Fazle Rafsani, Jay Shah, Teresa Wu, Catherine D Chong, Todd J Schwedt, Baoxin Li

发表机构 * arXiv

AI总结提出AUCp指标，无需标注验证集即可为无监督/自监督异常检测方法选择最优推理模型，通过将测试集所有样本视为异常计算伪AUC，理论及实验证明其优于传统指标。

详情

DOI: 10.1109/TMI.2026.3684946
Journal ref: IEEE Transactions on Medical Imaging (Early Access), 2026

AI中文摘要

异常检测是医学图像分析中一项关键但具有挑战性的任务。通过学习仅重构正常数据来区分异常与正常数据，减少了对标注数据集的依赖。然而，许多研究即使是无监督的，也依赖标注验证集从多次训练迭代中选择最佳推理模型。对于许多疾病，标注数据不可用且获取耗时。为解决此问题，提出了AUCp——一种支持无监督和自监督方法异常检测的新指标。它不通过评估重构图像的真实性来选择最佳推理模型，而是关注实际检测性能，且无需标注测试集。假设测试集中所有未标注样本的伪真实标签为异常/阳性，并使用传统AUC计算，得到AUCp分数。给定一个包含大量正常样本的代表性训练集，我们通过数学和实证证据表明，使用AUCp分数进行模型选择在无监督和自监督方法中比传统指标更能改善疾病检测。使用两种无监督方法进行神经系统疾病检测以及在不同数据集上的自监督方法，我们的结果表明AUCp分数有效识别最佳推理模型，显著增强异常和疾病检测。相应实现可在https://github.com/mahfuzmohammad/AUCp获取。

英文摘要

Abnormality detection is a crucial yet challenging task in medical image analysis. Distinguishing abnormalities from normal data by learning to reconstruct normal-only data alleviates the reliance on labeled datasets. However, many studies, even if unsupervised, rely on a labeled validation set to select the best model for inference from multiple training iterations. For many diseases labeled data are unavailable and substantially time consuming to obtain. To address this, AUCp - a novel metric that supports abnormality detection for unsupervised and self-supervised methods is proposed. Instead of evaluating the realism of reconstructed images to select the best of model for inference, it focuses on actual detection performance and without requiring an annotated test set. Assuming the pseudo ground truth of all unannotated samples in the test set as abnormal/positive and using traditional AUC calculation, AUCp scores are derived. Given a large and representative training set of normal samples, we show mathematical and empirical evidence that model selection using AUCp scores improves disease detection in terms of unsupervised and self-supervised methods over conventional metrics. Using two unsupervised methods for neurologic disease detection and self-supervised methods on diverse datasets, our results demonstrate that the AUCp score effectively identifies the optimal model for inference, significantly enhancing abnormality and disease detection. The corresponding implementations are available in https://github.com/mahfuzmohammad/AUCp.

URL PDF HTML ☆

赞 0 踩 0

2606.08745 2026-06-09 cs.CV 新提交

Stain-Aware Wavelet Regularization for Instant Adversarial Purification in Histopathology

染色感知的小波正则化用于组织病理学中的即时对抗净化

Zhe Li, Bernhard Kainz

发表机构 * FAU Erlangen-Nürnberg（埃尔朗根-纽伦堡大学）

AI总结提出染色感知小波正则化（SAWR），利用Haar变换的多级小波域正则化分层分离对抗扰动与诊断结构信息，并扩展到组织学通道实现染色特异性频率调节，在即时净化框架中将对抗鲁棒性提升高达10.69%。

Comments 14 pages, 4 figures

详情

AI中文摘要

深度学习在计算病理学流程中已变得普遍，支持癌症筛查和数字病理学分析等任务。然而，神经网络对对抗扰动的敏感性引发了临床实践中可靠部署的安全问题。在组织病理学图像中，由于难以区分高频对抗噪声与细微且具有诊断意义的组织结构，这一挑战更加严峻。为解决此问题，我们提出染色感知小波正则化（SAWR），一种利用基于Haar变换的多级小波域正则化的对抗净化框架，以分层方式将对抗扰动与诊断结构信息分离。该频谱约束进一步扩展到单个组织学通道，实现与苏木精和伊红的生物学特性一致的染色特异性频率调节。当集成到即时净化框架中时，SAWR将对抗鲁棒性相对于基线方法提升高达10.69%，同时在对抗扰动下保持纹理和频谱保真度。

英文摘要

Deep learning has become prevalent in computational pathology pipelines that support tasks such as cancer screening and digital pathology analysis. However, the susceptibility of neural networks to adversarial perturbations raises safety concerns for reliable deployment in clinical practice. In histopathological images, this challenge is exacerbated by the difficulty of distinguishing high-frequency adversarial noise from subtle and diagnostically relevant tissue structures. To address this issue, we propose Stain-Aware Wavelet Regularization (SAWR), an adversarial purification framework that leverages multi-level wavelet-domain regularization based on Haar transform to hierarchically disentangle adversarial perturbations from diagnostic structural information. This spectral constraint is further extended to individual histological channels, enabling stain-specific frequency regulation consistent with the biological properties of Hematoxylin and Eosin. When integrated into an instant purification framework, SAWR improves adversarial robustness by up to 10.69\% over the baseline approach, while maintaining texture and spectral fidelity under adversarial perturbations.

URL PDF HTML ☆

赞 0 踩 0

2606.08751 2026-06-09 cs.CV 新提交

Less Is More: Training-Free Acceleration Framework of 3D Diffusion Models for Low-Count PET Denoising via Global-Local Trajectory Reduction

少即是多：通过全局-局部轨迹缩减实现低计数PET去噪的3D扩散模型免训练加速框架

Yuhan Liu, Scott M. Leonard, Marlee Crews, Muhannad Fadhel, Jinkui Hao, Tianqi Chen, Ryan J. Avery, Bo Zhou

发表机构 * Northwestern University（西北大学）； Hefei University of Technology（合肥工业大学）

AI总结提出一种免训练的全局-局部跳跃策略，通过噪声一致变换初始化中间步骤和重用U-Net特征，在加速3D扩散模型去噪的同时提升重建质量。

Comments 19 pages, 10 figures, 5 tables

详情

AI中文摘要

PET中的准确定量和摄取测量对于评估疾病进展和支持临床决策至关重要。虽然高计数PET提供了可靠的图像质量，但相关的辐射剂量和长时间采集仍然是重要的临床问题，促使采用低计数协议。基于扩散模型的方法在将低计数PET恢复至接近高计数质量方面显示出巨大潜力，但其迭代采样过程在应用于高分辨率3D PET体积时变得极其昂贵，导致显著的推理延迟，限制了实际临床部署。为了解决这些挑战，我们提出了一种免训练的全局-局部跳跃策略，该策略加速了基于扩散模型的3D PET去噪，同时提高了重建质量。所提出的方法即插即用，可直接应用于预训练扩散模型，无需重新训练或修改架构。具体而言，我们引入了：(i) 全局去噪步骤跳跃策略，通过使用低计数输入的噪声一致变换从中间去噪步骤初始化反向扩散过程，大幅减少所需的去噪步骤数；(ii) 局部特征重用捷径，在相邻去噪步骤间重用缓慢变化的高级U-Net特征，进一步减少每步计算量同时保持图像保真度。我们在来自内部和公共数据集的多种PET示踪剂上评估了所提出的方法，包括18F-FDG PET、68Ga-DOTATATE PET和18F-PSMA PET，结果显示相对于全步骤基线，实现了超过一个数量级的一致加速以及改进或相当的重建性能。盲法读者研究进一步证实了增强的临床信心和感知诊断质量。

英文摘要

Accurate quantification and uptake measurement in PET are critical for assessing disease progression and supporting clinical decision-making. While high-count PET provides reliable image quality, the associated radiation dose and prolonged acquisition remain significant clinical concerns, motivating the adoption of low-count protocols. Diffusion-model-based methods have demonstrated strong potential for restoring low-count PET to near high-count quality, but their iterative sampling procedure becomes prohibitively expensive when applied to high-resolution 3D PET volumes, introducing substantial inference latency that limits practical clinical deployment. To address these challenges, we propose a training-free Global-Local Skipping Strategy that accelerates diffusion model-based 3D PET denoising while simultaneously improving reconstruction quality. The proposed method is plug-and-play and directly applicable to pre-trained diffusion models without retraining or architectural modification. Specifically, we introduce: (i) a global denoising step skipping strategy that initializes the reverse diffusion process from an intermediate denoising step using a noise-consistent transformation of the low-count input, substantially reducing the number of required denoising steps; and (ii) a local feature reuse shortcut that reuses slowly-varying high-level U-Net features across neighboring denoising steps, further reducing per-step computation while preserving image fidelity. We evaluate the proposed approach on multiple PET tracers from in-house and public datasets, including 18F-FDG PET, 68Ga-DOTATATE PET, and 18F-PSMA PET, demonstrating consistent acceleration of over an order of magnitude alongside improved or comparable reconstruction performance relative to the full-step baseline. Blinded reader studies further confirm enhanced clinical confidence and perceived diagnostic quality.

URL PDF HTML ☆

赞 0 踩 0

2606.08897 2026-06-09 cs.CV cs.AI q-bio.QM 新提交

A multi-agent system for spine MRI report generation from multi-sequence imaging

基于多序列影像的脊柱MRI报告生成多智能体系统

Zhiping Xiao, Junwei Yang, Gongbo Sun, Han Zhang, Hanwen Xu, Yi Yao, Zachary D. Miller, William E. King, Mohammed M. Kanani, Jalal B. Andre, Sammy Chu, Ming Zhang, Paul E. Kinahan, Nathan M. Cross, Sheng Wang

发表机构 * University of Washington（华盛顿大学）； Peking University（北京大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； New York University（纽约大学）； University of Washington Medical Center（华盛顿大学医学中心）

AI总结提出SpineAgent多智能体框架，利用多序列基础模型整合T1/T2等序列信息，实现脊柱MRI报告生成、病理定位和图文检索，在跨厂商和跨队列评估中表现优异。

详情

AI中文摘要

脊柱病理是全球疼痛和残疾的主要原因之一。脊柱MRI是临床评估的核心，但其解读仍然复杂且耗时，需要整合多个成像序列和解剖区域的信息。尽管自动化MRI分析最近取得了进展，但如何有效结合多序列数据同时保留序列特异性诊断信息仍是一个开放挑战。本文提出SpineAgent，一个基于多序列基础模型的脊柱MRI报告生成多智能体框架，该模型在来自32,047名患者和453,683个MRI系列（总计13,441,191张MRI切片）的常规临床数据上训练。为了适应不同模态的序列，我们首先分别在T1和T2加权序列上预训练两个基于DINOv3的编码器。然后，我们引入一种持续训练策略，学习一个合成器，利用T1和T2编码器嵌入其他序列的图像，生成整合MRI序列间各种信号的患者级嵌入。利用这些嵌入，SpineAgent实现了最先进的性能，并在跨制造商和跨队列评估中展现出强大的泛化能力。除了分类，SpineAgent通过识别与发现相关的切片和分割病理区域实现病理定位。它还支持多模态图像-报告检索，为可扩展和可解释的MRI报告生成提供了坚实基础。我们进一步将这些经过验证的SpineAgent能力集成到37个专门智能体中。最后，我们将它们的输出作为结构化标记，整合到一个端到端训练用于报告生成的医疗报告智能体中。通过自动指标和五位放射科医生的专家评估，SpineAgent在脊柱MRI报告生成中取得了领先性能。

英文摘要

Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluation, yet its interpretation remains complex and time-consuming, requiring integration of information across multiple imaging sequences and anatomical regions. Despite recent advances in automated MRI analysis, effectively combining multi-sequence data while preserving sequence-specific diagnostic information remains an open challenge. Here we present SpineAgent, a multi-agent framework for spine MRI report generation built upon a multi-sequence foundation model trained on routine clinical data from 32,047 patients and 453,683 MRI series, comprising a total of 13,441,191 MRI slices. To accommodate diverse modalities of sequences, we first pre-train two DINOv3-based encoders separately on T1- and T2-weighted sequences. We then introduce a continual training strategy that learns a synthesizer to embed images of other sequences using the T1 and T2 encoders, producing patient-level embedding that integrates various signals across MRI sequences. Using these embeddings, SpineAgent achieves state-of-the-art performance, and demonstrates strong generalizability under cross-manufacturer and cross-cohort evaluation. Beyond classification, SpineAgent enables pathology localization by identifying findings-relevant slices and segmenting pathological regions. It also supports multimodal image-report retrieval, providing a solid foundation for scalable and explainable MRI report generation. We further integrate these validated capabilities of SpineAgent into 37 specialized agents. Finally, we incorporate their outputs as structured tokens within a Medical Report Agent trained end-to-end for report generation. Through both automated metrics and expert evaluation by five radiologists, SpineAgent achieves leading performance in spine MRI report generation.

URL PDF HTML ☆

赞 0 踩 0

2606.09140 2026-06-09 cs.CV 新提交

DiffSight-Former: Modeling Structural Differences and Temporal Dynamics for Glaucoma Progression Prediction

DiffSight-Former：建模结构差异和时间动态用于青光眼进展预测

Yi Huang, Lei Bi, Jinman Kim

发表机构 * The University of Sydney（悉尼大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出DiffSight-Former框架，通过时间变异特征提取、多结构差异建模和时间感知Transformer，从序列眼底图像中预测青光眼进展，在SIGF和GRAPE数据集上取得高AUC和灵敏度。

Comments 12 pages, 6 figures

详情

AI中文摘要

青光眼是全球不可逆失明的主要原因，从眼底图像早期检测对于有效疾病管理至关重要。虽然深度学习在眼底图像分析中取得了有希望的性能，但现有方法大多依赖单时间点图像，未能捕捉与疾病进展相关的纵向结构和血管变化。临床随访期间获取的序列眼底图像提供了宝贵的时间信息；然而，当前的序列模型通常难以检测微妙的早期进展信号，并且常依赖固定长度输入或已患青光眼图像的诊断线索，限制了其在早期预测中的临床实用性。为解决这些限制，我们提出了DiffSight-Former，一个从序列眼底图像预测青光眼进展的框架。它包含一个基于眼底专用基础模型的时间变异特征提取模块，以获得稳健的解剖表示。引入多结构差异建模模块来量化视盘/杯区域和视网膜血管中与进展相关的变化。这些表示与时间间隔嵌入集成，并由时间感知Transformer处理，以建模疾病进展并估计未来青光眼发作的概率。在两个纵向数据集SIGF（405个序列）和GRAPE（263个序列）上进行了实验。在SIGF上，DiffSight-Former在进展预测中达到了91.54%的AUC和92.16%的灵敏度。在GRAPE上，它在三个临床视野进展标准上平均准确率达到87.48%。与现有方法相比，DiffSight-Former在不同时间设置下表现出强大的性能和鲁棒性，突显了其在纵向青光眼监测和早期风险预测中的潜力。

英文摘要

Glaucoma is a leading cause of irreversible blindness worldwide, and early detection from fundus images is critical for effective disease management. While deep learning has achieved promising performance in fundus image analysis, most existing methods rely on single time-point images and fail to capture longitudinal structural and vascular changes associated with disease progression. Sequential fundus images acquired during clinical follow-up provide valuable temporal information; however, current sequential models often struggle to detect subtle early progression signals and commonly depend on fixed-length inputs or diagnostic cues from already glaucomatous images, limiting their clinical utility for early prediction. To address these limitations, we propose DiffSight-Former, a framework for glaucoma progression prediction from sequential fundus images. It incorporates a time-variant feature extraction module based on a fundus-specific foundation model to obtain robust anatomical representations. A multi-structure difference modeling module is introduced to quantify progression-related changes in the optic disc/cup region and retinal vasculature. These representations are integrated with temporal interval embeddings and processed by a time-aware Transformer to model disease progression and estimate the probability of future glaucoma onset. Experiments were conducted on two longitudinal datasets, SIGF (405 sequences) and GRAPE (263 sequences). On SIGF, DiffSight-Former achieved an AUC of 91.54% and a sensitivity of 92.16% for progression prediction. On GRAPE, it achieved an average accuracy of 87.48% across three clinical visual-field progression criteria. Compared with existing approaches, DiffSight-Former demonstrates strong performance and robustness across different temporal settings, highlighting its potential for longitudinal glaucoma monitoring and early risk prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.09249 2026-06-09 cs.CV 新提交

MAGIS: Evidence-Based Multi-Agent Reasoning for Interpretable Strabismus Clinical Decision-Making

MAGIS：基于证据的多智能体推理用于可解释的斜视临床决策

Xikai Tang, Yifan Wang, Jiafan Zhuang, Li Luo, Jinming Guo, Xiaoling Xie, Jiacheng Liu, Peiwei Wei, Lihao Zhong, Xiaoli Kang, Jie Cen, Guangqiang Yin, Kunliang Qiu, Ce Zheng, Zhun Fan

发表机构 * School of Information and Software Engineering, University of Electronic Science and Technology of China（电子科技大学信息与软件工程学院）； Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China（电子科技大学深圳高等研究院）； Joint Shantou International Eye Center of Shantou University and The Chinese University of Hong Kong（汕头大学·香港中文大学联合汕头国际眼科中心）； School of Artificial Intelligence, Guangzhou City Polytechnic（广州城市职业学院人工智能学院）； Medical College, Shantou University（汕头大学医学院）； College of Engineering, Shantou University（汕头大学工学院）； Department of Ophthalmology, Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine（上海交通大学医学院附属新华医院眼科）； Shenzhen Loop Area Institute（深圳环路区域研究所）

AI总结提出MAGIS框架，通过多智能体协作、双重证据约束上下文和基于证据的纠正验证机制，将斜视诊断从黑箱生成转变为结构化推理，在细粒度斜视基准上将加权F1分数从72.0%提升至91.3%，并显著提高诊断报告的临床可靠性。

详情

AI中文摘要

斜视是一种常见的眼部疾病，需要细粒度亚型诊断以制定个性化治疗方案。然而，现有的深度学习方法主要提供诊断预测，缺乏透明推理；而近期的大视觉语言模型（LVLMs）虽然在联合图像理解和报告生成方面有前景，但在这种对证据敏感且规则驱动的医学任务中极易产生幻觉。为解决这些问题，我们提出了MAGIS，一个基于证据的多智能体可解释斜视诊断推理框架。MAGIS将黑箱端到端生成转变为结构化的诊断过程，包括候选假设生成、双重证据约束上下文、基于证据的纠正验证和报告生成。具体而言，我们引入了双重证据约束上下文（DECC）机制，将来自九个注视方位照片的视觉证据和基于证据的临床诊断规则联合组织成约束上下文，以实现可靠的诊断推理。我们进一步开发了基于证据的纠正验证（EBCV）机制，验证当前诊断假设是否得到视觉证据、基于热图的视觉线索和基于证据的临床诊断规则的支持。当检测到不一致时，触发假设修正。在细粒度斜视基准上的实验表明，MAGIS不仅显著优于其他最先进的诊断系统，将加权F1分数从72.0%提高到91.3%，而且大幅提升了生成诊断报告的临床可靠性（一致性、对齐性和完整性）。这些结果表明，MAGIS为构建准确、基于证据且临床可解释的斜视诊断系统提供了有效解决方案。

英文摘要

Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatment planning. However, existing deep learning methods mainly provide diagnostic predictions without transparent reasoning, while recent large vision-language models (LVLMs), although promising for joint image understanding and report generation, remain highly prone to hallucination in this evidence-sensitive and rule-driven medical task. To address these challenges, we propose MAGIS, an evidence-based Multi-AGent reasoning for Interpretable Strabismus diagnosis framework. MAGIS transforms black-box end-to-end generation into a structured diagnostic process consisting of candidate hypothesis generation, dual-evidence constrained context, evidence-based corrective verification, and report generation. Specifically, we introduce a Dual-Evidence Constrained Context (DECC) mechanism that jointly organizes visual evidence from the photograph of the nine cardinal positions of gaze and evidence-based clinical diagnostic rules into a constrained context for reliable diagnostic reasoning. We further develop an Evidence-Based Corrective Verification (EBCV) mechanism that verifies whether the current diagnostic hypothesis is supported by visual evidence, heatmap-based visual cues, and evidence-based clinical diagnostic rules. Hypothesis refinement is triggered when inconsistency is detected. Experiments on a fine-grained strabismus benchmark demonstrate that MAGIS not only significantly outperforms other state-of-the-art diagnostic systems, improving the weighted F1 score from 72.0% to 91.3%, but also substantially improves the clinical reliability (consistency, alignment, and completeness) of generated diagnostic reports. These results demonstrate that MAGIS provides an effective solution for building accurate, evidence-based, and clinically interpretable strabismus diagnosis systems.

URL PDF HTML ☆

赞 0 踩 0

2606.09253 2026-06-09 cs.CV physics.med-ph 新提交

A practical probabilistic framework for deformable image registration uncertainty in radiotherapy dose propagation

一种实用的概率框架用于放射治疗剂量传播中的可变形图像配准不确定性

Stefan Heldmann, Sven Kuckertz, Nasim Givehchi, Thomas Coradi, Mikel Byrne, Ben Archibald-Heeren, Nils Papenberg

发表机构 * Fraunhofer Institute for Digital Medicine MEVIS（弗劳恩霍夫数字医学研究所MEVIS）； Varian, a Siemens Healthineers company（Varian公司）； Icon Group（Icon集团）

AI总结提出一种轻量级概率框架，通过局部确定性图建模可变形图像配准不确定性，实现剂量统计和剂量体积直方图的不确定性传播，并在前列腺放疗案例中验证了确定性图设计对结果的影响。

详情

AI中文摘要

可变形图像配准（DIR）广泛应用于放射治疗中的剂量传播和累积，但底层变形的不确定性会显著影响临床相关的剂量估计。我们提出了一种实用的概率框架，用于将DIR不确定性传播到体素级剂量统计和剂量体积直方图（DVH）。该方法将每个体素的映射对应关系建模为由透明的局部确定性图控制的随机变量，该确定性图可通过简单的安全边界、结构边界不匹配或结构保守的不确定性值来定义。这产生了可解释的量，如剂量概率、期望剂量、置信区间和诱导的DVH包络。该框架设计为轻量级且可解释：它避免了复杂的生物力学或基于集成的不确定性模型，而是强调简单的参数化、计算可行性和透明的剂量指标。我们进一步引入了一种结构导向的内/外策略作为可选优化，将映射概率限制在解剖学上合理的目标区域。该方法在前列腺放疗案例研究中得到验证，并用于比较不同的确定性图策略和概率核。实验表明，确定性图设计对结果剂量和DVH不确定性边界的影响比特定核选择更强，而内/外策略的额外收益在案例中依赖于具体情况且效果有限。总体而言，所提出的框架提供了一种透明的方式，将DIR不确定性纳入放射治疗剂量评估，并研究建模选择如何影响传播的剂量指标。

英文摘要

Deformable image registration (DIR) is widely used in radiotherapy for dose propagation and accumulation, but uncertainty in the underlying deformation can substantially affect clinically relevant dose estimates. We present a practical probabilistic framework for propagating DIR uncertainty to voxel-wise dose statistics and dose-volume histograms (DVHs). The method models the mapped correspondence at each voxel as a random variable governed by a transparent local certainty map that can be defined by simple safety margins, structure-boundary mismatch, or structure-wise conservative uncertainty values. This yields interpretable quantities such as dose probabilities, expected dose, confidence bounds, and induced DVH envelopes. The framework is designed to remain lightweight and interpretable: it avoids complex biomechanical or ensemble-based uncertainty models and instead emphasizes simple parameterization, computational feasibility, and transparent dose metrics. We further introduce a structure-guided in/out strategy as an optional refinement that restricts mapping probabilities to anatomically plausible target regions. The approach is demonstrated on a prostate radiotherapy case study and used to compare different certainty-map strategies and probability kernels. The experiments show that the certainty-map design has a stronger effect on resulting dose and DVH uncertainty bounds than the specific kernel choice, while the additional benefit of the in/out strategy is case-dependent and modest in the present example. Overall, the proposed framework provides a transparent way to incorporate DIR uncertainty into radiotherapy dose assessment and to study how modelling choices affect propagated dose metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.09378 2026-06-09 cs.CV 新提交

Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion

Echo-DM: 通过条件潜在扩散和区域感知融合去除超声标记

Zhiwei Wang, Tao Huang, Wentao Jiang, Muyi Li, Jianxin Liu, Jian Chen, Jie Zou, Yong Luo, Bo Du, Jing Zhang

发表机构 * School of Computer Science, Wuhan University, China（武汉大学计算机学院）； The Central Hospital of Wuhan, China（武汉市中心医院）； School of Computer Science, Hubei University of Technology, China（湖北工业大学计算机学院）

AI总结提出Echo-DM框架，结合条件潜在扩散和区域感知融合，在无掩码条件下有效去除超声图像中的人工标记，同时保持解剖结构保真度。

Comments 18 pages, 4 figures

详情

AI中文摘要

临床超声图像通常包含人工标记，如测量卡尺和文字，以辅助诊断解释和比较。然而，这些标记可能在下游自动分析中引入捷径偏差，促使深度学习模型依赖标记相关线索而非临床有意义的解剖结构。现有的标记去除方法要么依赖于掩码且易受错误传播影响，要么是无掩码的确定性修复器，可能过度平滑超声纹理并扰动未受影响的背景区域。为应对这些挑战，我们提出了Echo-DM，一个通过条件潜在扩散和区域感知融合进行超声标记去除的框架。Echo-DM遵循通用的编码器-扩散-解码器流水线，其中基于DiT的条件潜在扩散网络执行全局修复，区域感知融合模块在端到端无掩码推理下强制执行保留感知的图像空间细化。基于这一固定核心设计，我们进一步分别用基于VAE和基于RAE的潜在模块实例化了Echo-DM-V和Echo-DM-R，这表明Echo-DM架构与多种潜在模块实例化兼容。在Echo-PAIR（一个大规模配对临床超声数据集）上的大量实验表明，与代表性的两阶段基线相比，Echo-DM具有优越的标记去除能力和强大的解剖保真度，同时在部署设置中提供了有利的质量-效率权衡。数据、代码和模型将在https://github.com/MiliLab/Echo-DM发布。

英文摘要

Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist diagnostic interpretation and comparison. However, these markers can introduce shortcut bias in downstream automated analysis, encouraging deep learning models to rely on marker-related cues rather than clinically meaningful anatomy. Existing marker removal methods are either mask-dependent and vulnerable to error propagation, or mask-free deterministic restorers that may over-smooth ultrasound texture and perturb unaffected background regions. To address these challenges, we present Echo-DM, a framework for ultrasound marker removal via conditional latent diffusion and region-aware fusion. Echo-DM follows a common encoder-diffusion-decoder pipeline, where a DiT-based conditional latent diffusion network performs global restoration and a region-aware fusion module enforces preservation-aware image-space refinement under end-to-end mask-free inference. Building on this fixed core design, we further instantiate Echo-DM-V and Echo-DM-R with VAE-based and RAE-based latent modules, respectively, which demonstrates that the Echo-DM architecture is compatible with diverse latent-module instantiations. Extensive experiments on Echo-PAIR, a large-scale paired clinical ultrasound dataset, demonstrate superior marker removal and strong anatomical fidelity compared with representative two-stage baselines, while providing favorable quality--efficiency trade-offs across deployment settings. Data, code and models will be released at https://github.com/MiliLab/Echo-DM.

URL PDF HTML ☆

赞 0 踩 0

2606.09400 2026-06-09 cs.CV 新提交

vesselFM-CT: Segmenting All Blood Vessels in CT Images for System-Level Cardiovascular Analysis

vesselFM-CT：在CT图像中分割所有血管以实现系统级心血管分析

Bastian Wittmann, Chinmay Prabhakar, Suprosanna Shit, Bjoern Menze

发表机构 * Department of Quantitative Biomedicine, University of Zurich（苏黎世大学定量生物医学系）

AI总结提出vesselFM-CT模型，通过迭代多步训练和TubeLoss损失函数，实现CT图像中从大血管到微小肠系膜血管的全分割，优于基线方法，支持系统级心血管分析。

详情

AI中文摘要

人体血管网络中的血管在半径、长度、拓扑特性和分支模式上表现出剧烈的结构变化。这种异质性，加上位置特定的解剖背景变化，对稳健、大规模地分析整个心血管系统构成了重大挑战。因此，大多数研究集中在血管网络的狭窄孤立部分。虽然这些针对性研究提供了有价值的见解，但它们本质上限制了评估血管网络整体系统健康和功能完整性的能力。在这项工作中，我们旨在弥合这一差距，以推进临床诊断和我们对血管生理学的基本理解。我们提出了在CT图像中分割所有血管的任务，范围从心血管系统最大的组成部分到微小的肠系膜血管。为此，我们引入了vesselFM-CT，这是第一个能够稳健分割3D CT图像中所有血管的模型。vesselFM-CT通过迭代多步过程进行训练，并优化我们提出的TubeLoss损失函数，有效解决了心血管系统固有的异质性。我们证明vesselFM-CT优于所有基线，并能够从CT图像中自动精确提取心血管系统，从而解锁广泛的临床和技术视角，包括自动疾病分类和合成CT图像生成。

英文摘要

The vascular network in the human body is characterized by blood vessels exhibiting drastic structural variations in radius, length, topological properties, and branching patterns. This heterogeneity, together with location-specific anatomical background variations, poses a significant challenge for robust, large-scale analysis of the entire cardiovascular system. As a result, most research has focused on narrow, isolated segments of the vascular network. While such targeted studies provide valuable insights, they inherently limit the ability to assess the systemic health and functional integrity of the vascular network as a whole. In this work, we aim to bridge this gap to advance both clinical diagnostics and our fundamental understanding of vascular physiology. We propose the task of segmenting all vessels in CT images, ranging from the largest components of the cardiovascular system to even minuscule mesenteric vessels. To this end, we introduce vesselFM-CT, the first model capable of robustly segmenting all blood vessels in 3D CT images. VesselFM-CT is trained via an iterative, multi-step process and optimizes our proposed TubeLoss loss function, effectively addressing the inherent heterogeneity of the cardiovascular system. We demonstrate that vesselFM-CT outperforms all baselines and enables automated, precise extraction of the cardiovascular system from CT images, thereby unlocking a wide range of clinical and technical perspectives, including automated disease classification and synthetic CT image generation.

URL PDF HTML ☆

赞 0 踩 0

2606.09453 2026-06-09 cs.CV 新提交

基于CT的肾脏肿块鲁棒分割：AI框架的验证研究

Sarah de Boer, Hartmut Häntze, Kiran Vaidhya Venkadesh, Myrthe A. D. Buser, Gabriel E. Humpire Mamani, Lina Xu, Lisa C. Adams, Jawed Nawabi, Keno K. Bressem, Bram van Ginneken, Mathias Prokop, Alessa Hering

发表机构 * Department of Medical Imaging, Radboudumc, Nijmegen, The Netherlands（医学影像部门，Radboudumc，尼姆维根，荷兰）； Department of Radiology, Charité - Universitätsmedizin Berlin, Berlin, Germany（放射科，Charité - 大学医学中心柏林，柏林，德国）； Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Berlin, Germany（神经放射科，Charité - 大学医学中心柏林，柏林，德国）； Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany（诊断和介入放射科，Klinikum rechts der Isar，TUM大学医院，慕尼黑技术大学，慕尼黑，德国）； Department of Cardiovascular Radiology and Nuclear Medicine, German Heart Center, TUM University Hospital, Technical University of Munich, Munich, Germany（心血管放射学和核医学部，德国心脏中心，TUM大学医院，慕尼黑技术大学，慕尼黑，德国）； Fraunhofer MEVIS, Bremen, Germany（Fraunhofer MEVIS，不莱梅，德国）

AI总结提出Renal-Net，基于nnU-Net和公开数据训练，在CT图像上实现肾脏肿块分割，验证显示优于现有模型且鲁棒性强。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:012. 23 pages, 12 figures

详情

DOI: 10.59275/j.melba.2026-67g5
Journal ref: Machine.Learning.for.Biomedical.Imaging. 2026 (2026)

AI中文摘要

肾脏肿块分割在临床工作流中具有重要潜力，尤其是在需要定量评估的场景中。肾脏体积可作为肾脏疾病的重要生物标志物，其体积变化与肾功能直接相关。目前，临床实践常依赖主观视觉评估来评价肾脏大小和肾脏病变（包括肿瘤和囊肿），这些病变通常根据直径、体积和解剖位置进行分期。为了支持更客观和可重复的方法，本研究旨在开发一个鲁棒且经过充分验证的肾脏肿块分割算法，命名为Renal-Net。我们使用公开可用的训练数据集，并利用最先进的医学图像分割框架nnU-Net。使用专有和公开测试数据集进行验证，分割性能通过Dice系数和95百分位Hausdorff距离量化。此外，我们根据患者性别、年龄、CT对比相和肿瘤组织学亚型分析亚组鲁棒性。我们的结果表明，仅使用公开数据训练的分割算法能有效泛化到外部测试集，并在所有测试数据集上优于现有最先进模型。亚组分析显示一致的高性能，表明强鲁棒性和可靠性。开发的算法和相关代码可在以下网址公开获取：https://this.url。

英文摘要

Renal mass segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments. Kidney volume could serve as an important biomarker for renal diseases, with changes in volume correlating directly with kidney function. Currently, clinical practice often relies on subjective visual assessment for evaluating kidney size and kidney lesions, including tumors and cysts, which are typically staged based on diameter, volume, and anatomical location. To support a more objective and reproducible approach, this research aims to develop a robust, thoroughly validated renal mass segmentation algorithm, named Renal-Net. We employ publicly available training datasets and leverage the state-of-the-art medical image segmentation framework nnU-Net. Validation is conducted using both proprietary and public test datasets, with segmentation performance quantified by Dice coefficient and the 95th percentile Hausdorff distance. Furthermore, we analyze robustness across subgroups based on patient sex, age, CT contrast phases, and tumor histologic subtypes. Our findings demonstrate that our segmentation algorithm, trained exclusively on publicly available data, generalizes effectively to external test sets and outperforms existing state-of-the-art models across all tested datasets. Subgroup analyses reveal consistent high performance, indicating strong robustness and reliability. The developed algorithm and associated code are publicly accessible at https://github.com/DIAGNijmegen/oncology-kidney-abnormality-segmentation.

URL PDF HTML ☆

赞 0 踩 0

2508.20734 2026-06-09 cs.CV 版本更新

CardioMorphNet: Cardiac Motion Prediction Using a Shape-Guided Bayesian Recurrent Deep Network

CardioMorphNet: 使用形状引导的贝叶斯循环深度网络进行心脏运动预测

Reza Akbari Movahed, Abuzar Rezaee, Arezoo Zakeri, Colin Berry, Edmond S. L. Ho, Ali Gooya

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出CardioMorphNet，一种基于循环变分自编码器和贝叶斯公式的3D心脏形状引导可变形配准框架，通过递归配准分割图避免强度相似性损失，在心脏运动估计中优于现有方法，并具有更低的不确定性。

Comments Published in Medical Image Analysis. Updated to match the final published version

详情

DOI: 10.1016/j.media.2026.104149
Journal ref: Medical Image Analysis, vol. 113, p. 104149, 2026

AI中文摘要

从电影心脏磁共振（CMR）图像中准确估计心脏运动对于评估心脏功能和检测其异常至关重要。现有方法通常难以准确捕捉心脏运动，因为它们依赖于基于强度的图像配准相似性损失，可能忽略心脏解剖区域。为了解决这个问题，我们提出了CardioMorphNet，一个用于使用短轴（SAX）CMR图像进行3D心脏形状引导可变形配准的循环贝叶斯深度学习框架。它采用循环变分自编码器来建模心脏周期中的时空依赖性，以及两个用于双心室分割和运动估计的后验模型。从贝叶斯公式导出的损失函数通过递归配准分割图来引导框架关注解剖区域，而不使用基于强度的图像配准相似性损失，同时利用顺序SAX体积和时空特征。贝叶斯建模还使得能够计算估计运动场的不确定性图。通过在UK Biobank和M&M数据集上验证，将扭曲的掩模形状与真实掩模进行比较，CardioMorphNet在心脏运动估计中表现出优越的性能，优于最先进的方法。不确定性评估表明，与其他基于概率的心脏配准方法相比，它在心脏区域估计的运动场上产生更低的不确定性值，表明其预测具有更高的置信度。此外，临床指标提取评估显示，CardioMorphNet比其他方法更准确地估计临床指标。

英文摘要

Accurate cardiac motion estimation from cine cardiac magnetic resonance (CMR) images is vital for assessing cardiac function and detecting its abnormalities. Existing methods often struggle to accurately capture heart motion because they rely on intensity-based image registration similarity losses that may overlook cardiac anatomical regions. To address this, we propose CardioMorphNet, a recurrent Bayesian deep learning framework for 3D cardiac shape-guided deformable registration using short-axis (SAX) CMR images. It employs a recurrent variational autoencoder to model spatio-temporal dependencies across the cardiac cycle, along with two posterior models for bi-ventricular segmentation and motion estimation. The derived loss function from the Bayesian formulation guides the framework to focus on anatomical regions by recursively registering segmentation maps without using intensity-based image registration similarity loss, while leveraging sequential SAX volumes and spatio-temporal features. The Bayesian modelling also enables the computation of uncertainty maps for the estimated motion fields. Validated on the UK Biobank and M&M datasets by comparing warped mask shapes with ground-truth masks, CardioMorphNet demonstrates superior performance in cardiac motion estimation, outperforming state-of-the-art methods. Uncertainty assessment shows that it also yields lower uncertainty values for estimated motion fields in the cardiac region compared with other probabilistic-based cardiac registration methods, indicating higher confidence in its predictions. In addition, the clinical indices extraction assessment shows that CardioMorphNet estimates the clinical indices more accurately than other approaches.

URL PDF HTML ☆

赞 0 踩 0

2509.15017 2026-06-09 cs.CV 版本更新

No Modality Left Behind: Adapting to Missing Modalities via Knowledge Distillation for Brain Tumor Segmentation

不遗漏任何模态：通过知识蒸馏适应缺失模态的脑肿瘤分割

Shenghao Zhu, Yifei Chen, Weihong Chen, Shuo Jiang, Guanyu Zhou, Yuanhan Wang, Feiwei Qin, Changmiao Wang, Qiyuan Tian

发表机构 * Medical Image Analysis（医学影像分析）

AI总结提出AdaMM框架，利用知识蒸馏和三个协同模块处理多模态MRI中模态缺失问题，在多个数据集上显著提升分割精度和鲁棒性。

Comments 51 pages, 11 figures

详情

DOI: 10.1016/j.media.2026.104108

AI中文摘要

准确的脑肿瘤分割对于术前评估和个性化治疗至关重要。多模态MRI因其能够捕捉不同序列中互补的肿瘤特征而被广泛使用。然而，在临床实践中，模态缺失很常见，限制了依赖完整输入的现有深度学习方法的鲁棒性和泛化能力，尤其是在非主导模态组合下。为了解决这个问题，我们提出了AdaMM，一个针对缺失模态场景定制的多模态脑肿瘤分割框架，以知识蒸馏为核心，由三个协同模块组成。图引导自适应细化模块显式建模通用特征与模态特定特征之间的语义关联，增强对模态缺失的适应性。双瓶颈蒸馏模块通过全局风格匹配和对抗特征对齐，将结构和纹理知识从教师模型转移到学生模型。病变存在引导可靠性模块通过辅助分类任务预测病变类型的先验概率，有效抑制不完整输入下的假阳性。在Pretreat-MetsToBrain-Masks和BraTS 2018、2024数据集上的大量实验表明，AdaMM始终优于现有方法，在单模态和弱模态配置下表现出更优的分割精度和鲁棒性。此外，我们对六类缺失模态策略进行了系统评估，支持知识蒸馏的优越性，并为方法选择和未来研究提供了实用指导。我们的源代码可在以下网址获取：此 https URL。

英文摘要

Accurate brain tumor segmentation is essential for preoperative evaluation and personalized treatment. Multi-modal MRI is widely used due to its ability to capture complementary tumor features across different sequences. However, in clinical practice, missing modalities are common, limiting the robustness and generalizability of existing deep learning methods that rely on complete inputs, especially under non-dominant modality combinations. To address this, we propose AdaMM, a multi-modal brain tumor segmentation framework tailored for missing-modality scenarios, centered on knowledge distillation and composed of three synergistic modules. The Graph-guided Adaptive Refinement Module explicitly models semantic associations between generalizable and modality-specific features, enhancing adaptability to modality absence. The Bi-Bottleneck Distillation Module transfers structural and textural knowledge from teacher to student models via global style matching and adversarial feature alignment. The Lesion-Presence-Guided Reliability Module predicts prior probabilities of lesion types through an auxiliary classification task, effectively suppressing false positives under incomplete inputs. Extensive experiments on the Pretreat-MetsToBrain-Masks and BraTS 2018, 2024 datasets demonstrate that AdaMM consistently outperforms existing methods, exhibiting superior segmentation accuracy and robustness, particularly in single-modality and weak-modality configurations. In addition, we conduct a systematic evaluation of six categories of missing-modality strategies, supporting the superiority of knowledge distillation and offering practical guidance for method selection and future research. Our source code is available at https://github.com/Quanato607/AdaMM.

URL PDF HTML ☆

赞 0 踩 0

2511.18454 2026-06-09 cs.CV cs.AI 版本更新

AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading

AttnRegDeepLab: 一种用于可解释胚胎碎片分级的双阶段解耦框架

Ming-Jhe Lee, Chang-Hong Wu, Jung-Hua Wang, Ming-Jer Chen, Yu-Chiao Yi, Tsung-Hsien Lee

发表机构 * Department of Electrical Engineering（电气工程系）； AI Research Center（人工智能研究中心）； National Taiwan Ocean University（国立台湾海洋大学）； Department of Obstetrics, Gynecology（妇产科部）； Gynecology, CSMU Hospital, Taichung, Taiwan（台中市立医院妇产科）

AI总结提出AttnRegDeepLab框架，通过双分支多任务学习、注意力门控、多尺度回归头和两阶段解耦训练，实现胚胎碎片分级的高精度与可解释性。

Comments 6 pages, 5 figures

详情

AI中文摘要

胚胎碎片是评估体外受精（IVF）发育潜力的关键形态学指标。然而，手动分级主观且低效，而现有的深度学习解决方案往往缺乏临床可解释性，或在分割区域估计中遭受累积误差。为了解决这些问题，本研究提出了AttnRegDeepLab（注意力引导回归DeepLab），一种以双分支多任务学习（MTL）为特征的框架。通过将注意力门集成到其跳跃连接中，修改了原始的DeepLabV3+解码器，显式抑制细胞质噪声以保留轮廓细节。此外，引入了一个多尺度回归头，并采用特征注入机制将全局分级先验传播到分割任务中，纠正系统量化误差。提出了一种两阶段解耦训练策略来解决MTL中的梯度冲突。同时，设计了一种基于范围的损失以利用弱标记数据。我们的方法在保持出色分割精度（Dice系数=0.729）的同时实现了稳健的分级精度，这与可能以牺牲轮廓完整性为代价最小化分级误差的端到端方法形成对比。这项工作提供了一种在视觉保真度和量化精度之间取得平衡的临床可解释解决方案。

英文摘要

Assessing embryo fragmentation is crucial for predicting IVF success, yet manual grading is prone to subjectivity, and existing AI models struggle with clinical interpretability and segmentation errors. We propose AttnRegDeepLab, a Multi-Task Learning (MTL) framework designed to solve these challenges. The model enhances a DeepLabV3+ decoder with Attention Gates to filter out cytoplasmic noise and retain sharp contour details. It also introduces a Multi-Scale Regression Head with Feature Injection, guiding the segmentation process with global grading priors to eliminate systematic area estimation errors. Based on a two-stage decoupled training strategy and a range-based loss for weakly labeled data, our method resolves MTL gradient conflicts. AttnRegDeepLab yields high grading precision and excellent segmentation quality (Dice coefficient = 0.729), avoiding the trade-off between contour integrity and grading accuracy seen under standard joint optimization. This provides a reliable, clinically interpretable tool balancing visual and quantitative accuracy.

URL PDF HTML ☆

赞 0 踩 0

2511.18676 2026-06-09 cs.CV cs.AI 版本更新

MedVision: Benchmarking Quantitative Medical Image Analysis

MedVision：定量医学图像分析的基准测试

Yongcheng Yao, Yongshuo Zong, Raman Dutt, Yongxin Yang, Sotirios A Tsaftaris, Timothy Hospedales

发表机构 * University of Edinburgh（爱丁堡大学）； Queen Mary University of London（伦敦大学玛丽女王学院）

AI总结针对当前医学视觉语言模型缺乏定量推理能力的问题，提出MedVision数据集和基准，涵盖22个公共数据集、3080万图像-标注对，通过监督和强化微调显著提升检测、肿瘤/病变大小估计和角度/距离测量性能。

Comments 22 pages, 13 figures, 14 tables

详情

AI中文摘要

当前医学领域的视觉-语言模型（VLM）主要用于分类问答（如“这是正常还是异常？”）或定性描述任务。然而，临床决策通常依赖于定量评估，例如测量肿瘤大小或关节角度，医生据此得出自己的诊断结论。这种定量推理能力在现有VLM中尚未得到充分探索和支持。在这项工作中，我们引入了MedVision，这是一个专门设计用于评估和改进VLM在定量医学图像分析中的大规模数据集和基准。MedVision涵盖22个公共数据集，涉及多种解剖结构和模态，包含3080万个图像-标注对。我们聚焦于三个代表性的定量任务：（1）解剖结构和异常检测，（2）肿瘤/病变（T/L）大小估计，以及（3）角度/距离（A/D）测量。我们表明，当前现成的VLM在这些任务上表现不佳。然而，在MedVision上进行监督和强化微调显著提升了检测、T/L估计和A/D测量的性能。MedVision为开发具有稳健定量推理能力的医学图像分析VLM奠定了基础。

英文摘要

Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnormal?") or qualitative descriptive tasks. However, clinical decision-making often relies on quantitative assessments, such as measuring the size of a tumor or the angle of a joint, from which physicians draw their own diagnostic conclusions. This quantitative reasoning capability remains underexplored and poorly supported in existing VLMs. In this work, we introduce MedVision, a large-scale dataset and benchmark specifically designed to evaluate and improve VLMs on quantitative medical image analysis. MedVision spans 22 public datasets covering diverse anatomies and modalities, with 30.8 million image-annotation pairs. We focus on three representative quantitative tasks: (1) detection of anatomical structures and abnormalities, (2) tumor/lesion (T/L) size estimation, and (3) angle/distance (A/D) measurement. We show that current off-the-shelf VLMs perform poorly on these tasks. However, supervised and reinforcement fine-tuning on MedVision significantly enhances performance across detection, T/L estimation, and A/D measurement. MedVision provides a foundation for developing VLMs with robust quantitative reasoning capabilities in medical imaging.

URL PDF HTML ☆

赞 0 踩 0

2601.15408 2026-06-09 cs.CV cs.AI cs.CL cs.LG 版本更新

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

CURE：基于课程引导的多任务训练实现可靠的解剖学接地报告生成

Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem

发表机构 * Pontificia Universidad Católica de Chile（智利天主教大学）； CENIA ； iHEALTH ； KAUST（科威特皇家科学与技术局）

AI总结提出CURE框架，通过课程学习动态调整多任务训练，提升医学报告生成的视觉接地准确性和事实一致性，无需额外数据。

Comments 31 pages, 7 figures, accepted to CVPR 2026 (oral)

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 36279-36289

AI中文摘要

医学视觉语言模型可以自动生成放射学报告，但在精确的视觉接地和事实一致性方面存在困难。现有模型常常将文本发现与视觉证据错误对齐，导致不可靠或弱接地的预测。我们提出CURE，一个错误感知的课程学习框架，无需任何额外数据即可改善接地和报告质量。CURE在短语接地、接地报告生成和解剖学接地报告生成上，使用公共数据集微调多模态指令模型。该方法基于模型性能动态调整采样，强调困难样本以改善空间和文本对齐。CURE将接地准确率提高了+0.35 IoU，报告质量提高了+0.192 CXRFEScore，并将幻觉减少了18.6%。CURE是一个数据高效的框架，增强了接地准确性和报告可靠性。代码可从此https URL获取，模型权重可从此https URL获取。

英文摘要

Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.35 IoU, boosts report quality by +0.192 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

URL PDF HTML ☆

赞 0 踩 0

2601.20503 2026-06-09 cs.CV cs.AI 版本更新

Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

使用部分标注数据集训练策略的比较评估：FLAIR MRI中白质高信号和卒中病变分割

Jesse Phitidis, Alison Q. Smithard, William N. Whiteley, Joanna M. Wardlaw, Miguel O. Bernabeu, Maria Valdés Hernández

发表机构 * University of Edinburgh（爱丁堡大学）

AI总结本研究系统评估了六种利用部分标注数据训练联合分割白质高信号和缺血性卒中病变模型的策略，发现伪标签法最有效，可提升模型性能并支持大规模临床研究。

详情

AI中文摘要

白质高信号（WMH）和缺血性卒中病变（ISL）是脑小血管疾病（SVD）的关键影像生物标志物，可在磁共振成像（MRI）上检测到。开发稳健的深度学习模型来自动分割和区分这些病理仍然具有挑战性。具体而言，WMH和ISL常在同一受试者中共存，并在液体衰减反转恢复（FLAIR）序列上表现为视觉上混淆的高信号，使其精确勾画复杂化。为了解决完全标注队列稀缺的问题，我们系统评估了六种使用部分标注数据训练联合WMH和ISL分割模型的可行策略。我们汇集了私有和公开数据集，构建了一个包含2,052个MRI体积的大规模队列，其中分别有1,341和1,152个体积包含WMH和ISL的真实标注。我们的分析表明，多种策略有效利用部分标注数据提升整体模型性能，其中伪标签法是最有效的方法。该模型表现出一致的WMH分割策略，并成功检测到大多数FLAIR阳性的ISL。这些发现证明了使用部分标注数据开发可靠自动分割工具的可行性，可支持持续的SVD监测和大规模临床研究中的高通量生物标志物提取。

英文摘要

White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are key imaging biomarkers of cerebral small vessel disease (SVD) detectable on magnetic resonance imaging (MRI). The development of robust deep learning models to automatically segment and differentiate these pathologies remains challenging. Specifically, WMH and ISL frequently co-occur within the same subject and present as visually confounding hyperintensities on fluid-attenuated inversion recovery (FLAIR) sequences, complicating their accurate delineation. To address the scarcity of fully annotated cohorts, we systematically evaluated six accessible strategies for training a joint WMH and ISL segmentation model using partially labelled data. We aggregated privately held and publicly available datasets to curate a large-scale cohort of 2,052 MRI volumes, of which 1341 and 1152 volumes contained ground truth annotations for WMH and ISL, respectively. Our analysis indicates that multiple strategies effectively leverage partially labelled data to enhance overall model performance, with pseudolabelling emerging as the most effective approach. This model exhibited a consistent WMH segmentation policy and successfully detected the majority of FLAIR-positive ISL. These findings demonstrate the viability of using partially labelled data to develop reliable automated segmentation tools, which can support ongoing SVD monitoring and high-throughput biomarker extraction for large-scale clinical research.

URL PDF HTML ☆

赞 0 踩 0

2602.17337 2026-06-09 cs.CV 版本更新

Polaffini: A feature-based approach for robust affine and polyaffine image registration

Polaffini: 一种基于特征的鲁棒仿射和多项式仿射图像配准方法

Antoine Legouhy, Cosimo Campo, Ross Callaghan, Hojjat Azadbakht, Hui Zhang

发表机构 * Hawkes Institute & Department of Computer Science, University College London, London, UK（霍克斯研究所及大学学院伦敦计算机科学系，伦敦，英国）； Institut Pasteur, Université Paris Cité, Unité de Neuroanatomie Appliquée et Théorique（巴斯德研究所，巴黎城市大学，应用与理论神经解剖学单元）； AINOSTICS ltd., Manchester, UK（AINOSTICS有限公司，曼彻斯特，英国）

AI总结提出Polaffini框架，利用深度学习分割模型提取解剖对应点，通过闭式解实现全局和局部仿射匹配，生成从仿射到多项式仿射的可调平滑变换，在结构对齐和下游非线性配准初始化上优于传统方法。

Comments associated github repo: https://github.com/CIG-UCL/polaffini

详情

AI中文摘要

在这项工作中，我们提出了Polaffini，一个稳健且通用的解剖学基础配准框架。医学图像配准主要由基于强度的配准方法主导，这些方法依赖于对齐质量的替代度量。相比之下，基于特征的方法通过识别明确的解剖对应点进行操作，理论上更理想，但由于可靠提取特征的挑战而 largely 失宠。然而，得益于深度学习的近期进展，这些挑战现已显著克服，预训练的分割模型能够即时提供可靠、精细的解剖描绘。我们旨在证明这些进展可用于创建新的解剖学基础图像配准算法。为此，我们提出Polaffini，它从这些分割区域中以特别简单的方式获得具有一一对应关系的解剖学基础特征点：提取它们的质心。这些特征点通过闭式解实现高效的全局和局部仿射匹配。这些匹配用于生成从仿射到多项式仿射的整体变换，并具有可调平滑度。多项式仿射变换比仿射变换具有更多的自由度，允许更精细的对齐，并且它们在对数-欧几里得框架中的嵌入确保了微分同胚性质。Polaffini既可用于独立配准，也可作为后续非线性配准的预对齐，我们将其与流行的基于强度的配准技术进行了评估。结果表明，Polaffini在结构对齐方面优于竞争方法，并为下游非线性配准提供了改进的初始化。Polaffini快速、稳健且准确，使其特别适合集成到医学图像处理流程中。

英文摘要

In this work we present Polaffini, a robust and versatile framework for anatomically grounded registration. Medical image registration is dominated by intensity-based registration methods that rely on surrogate measures of alignment quality. In contrast, feature-based approaches that operate by identifying explicit anatomical correspondences, while more desirable in theory, have largely fallen out of favor due to the challenges of reliably extracting features. However, such challenges are now significantly overcome thanks to recent advances in deep learning, which provide pre-trained segmentation models capable of instantly delivering reliable, fine-grained anatomical delineations. We aim to demonstrate that these advances can be leveraged to create new anatomically-grounded image registration algorithms. To this end, we propose Polaffini, which obtains, from these segmented regions, anatomically grounded feature points with 1-to-1 correspondence in a particularly simple way: extracting their centroids. These enable efficient global and local affine matching via closed-form solutions. Those are used to produce an overall transformation ranging from affine to polyaffine with tunable smoothness. Polyaffine transformations can have many more degrees of freedom than affine ones allowing for finer alignment, and their embedding in the log-Euclidean framework ensures diffeomorphic properties. Polaffini has applications both for standalone registration and as pre-alignment for subsequent non-linear registration, and we evaluate it against popular intensity-based registration techniques. Results demonstrate that Polaffini outperforms competing methods in terms of structural alignment and provides improved initialisation for downstream non-linear registration. Polaffini is fast, robust, and accurate, making it particularly well-suited for integration into medical image processing pipelines.

URL PDF HTML ☆

赞 0 踩 0

2602.22919 2026-06-09 cs.CV 版本更新

Chain of Flow: ECG-Conditioned 4D Cardiac Cine Generation from Patient-Specific Anatomical Anchor

流动链：基于患者特定解剖锚点的ECG条件4D心脏电影生成

Haofan Wu, Nay Aung, Theodoros N. Arvanitis, Joao A. C. Lima, Steffen E. Petersen, Le Zhang

发表机构 * School of Engineering, College of Engineering and Physical Sciences, University of Birmingham（英国伯明翰大学工程学院）； William Harvey Research Institute, NIHR Barts Biomedical Research Centre, Queen Mary University London（伦敦Queen Mary大学威廉·哈里维研究所）； Barts Heart Centre, St Bartholomew’s Hospital, Barts Health NHS Trust（巴特勒医院心脏中心，圣巴塞洛缪医院，巴特勒健康 NHS信托）； Division of Cardiology, Johns Hopkins University School of Medicine（约翰霍普金斯大学医学院心脏病科）

AI总结提出Chain of Flow (COF)框架，利用患者特定MRI和当前ECG生成4D心脏电影，在UK Biobank上实现高图像保真度和下游功能性能。

详情

AI中文摘要

心脏电影磁共振成像（MRI）是功能性心脏评估的核心，然而在分析时可能无法直接获得完整的当前电影序列。我们引入了流动链（COF），这是一个心电图（ECG）条件框架，结合患者特定MRI和当前ECG，用于生成特定于受试者的4D心脏电影。在UK Biobank数据集上，COF在共享同次就诊可评估基准上实现了强图像级保真度和下游功能导向性能。多切片和多分辨率分析表明，在短轴堆叠和异质采集分辨率上，结构生成质量稳定。跨重采样输入MRI相位的受控相位鲁棒性分析进一步提供了同次就诊代理支持，当目标MRI相位未直接观察到时，使用患者特定MRI加当前ECG。跨次就诊路线提供了探索性序列证据，在当前面向感兴趣区域读出中增益最明显。疾病类别功能审计、病例级容积轨迹证据审查进一步描绘了当前患者特定MRI加ECG公式在解剖感知下游心脏分析中保持稳定的情况。代码可在https://this URL获取。

英文摘要

Cardiac cine magnetic resonance imaging (MRI) is central to functional cardiac assessment, yet a full current cine sequence may not always be directly available at the point of analysis. We introduce Chain of Flow (COF), an electrocardiography (ECG)-conditioned framework that combines patient-specific MRI and current ECG for subject-specific 4D cardiac cine generation. On the UK Biobank dataset, COF achieves strong image-level fidelity and downstream function-oriented performance on a shared same-visit evaluable benchmark. Multi-slice and multi-resolution analyses indicate stable structural generation quality across the short-axis stack and heterogeneous acquisition resolutions. Controlled phase-robustness analyses across resampled input MRI phases further provide same-visit proxy support for patient-specific MRI plus current ECG when a target MRI phase is not directly observed. A cross-visit route provides exploratory serial evidence, with the clearest gains in current-facing region-of-interest readout. Disease-category functional audits, case-level volume-trajectory evidence review further delineate where the current patient-specific MRI plus ECG formulation remains stable for anatomy-aware downstream cardiac analysis. Code is available at https://anonymous.4open.science/r/COF-paper-release-C88B.

URL PDF HTML ☆

赞 0 踩 0

2603.24388 2026-06-09 cs.CV 版本更新

Causal Transfer in Medical Image Analysis

医学图像分析中的因果迁移

Mohammed M. Abdelsamea, Daniel Tweneboah Anyimadu, Tasneem Selim, Saif Alzubi, Lei Zhang, Ahmed Karam Eldaly, Xujiong Ye

发表机构 * University of Waterloo（滑铁卢大学）

AI总结本文探讨了医学图像分析中因果迁移方法，整合因果推理与跨域表示学习，以解决领域偏移问题，提升临床AI的鲁棒性和泛化能力。

详情

通过文本和语义定义的分割提示灵活控制3D CT生成

Weicheng Dai, Chenyu Wang, Binxu Li, Shantanu Ghosh, Afrooz Zandifar, Christina LeBedis, Kayhan Batmanghelich

发表机构 * Boston University School of Engineering（波士顿大学工程学院）； Stanford University（斯坦福大学）； University of Pittsburgh Medical Center（匹兹堡大学医学中心）； Boston University School of Medicine（波士顿大学医学院）

AI总结提出一种灵活的多模态框架，通过文本和可选分割提示控制3D CT生成，实现高分辨率、解剖一致且可控的体数据生成。

详情

AI中文摘要

体积医学图像的生成模型在医学成像中有许多应用，从数据增强到作为逆问题的先验。对于这些应用，生成具有强可控性的高分辨率3D图像至关重要，但仍极具挑战性。现有方法通常通过放射学报告作为文本提示或通过完整图像分割来控制生成。基于文本的提示虽然灵活，但对异常的位置、形状和边界的空间控制有限。相比之下，基于分割的方法接收精确的空间指导，但需要全器官标注，具有限制性。在这项工作中，我们提出了一种灵活的多模态框架，用于可控体积图像生成，支持来自放射学报告和分割提示（两者均为可选）的输入。我们的方法允许用户提供特定解剖结构或异常的分割，而无需全器官标注。分割掩膜的语义含义通过附带的文本描述指定，从而形成高度灵活且可扩展的条件机制。我们开发了一种基于改进扩散变换器的内存高效架构，该架构联合处理图像和分割标记。该模型进一步结合了门控注意力，以有效关注长放射学报告。实验表明，我们的方法实现了最先进的感知和语义分数（例如，平均FID相对改进24%），生成高分辨率解剖一致的CT体积，并在用于数据增强时提高了数据效率。放射科医生的评估进一步证实了生成图像与真实医学图像之间的强一致性。

英文摘要

Generative models for volumetric medical images have found many applications in medical imaging, ranging from data augmentation to serving as priors for inverse problems. For these applications, generating high-resolution 3D images with strong controllability is essential but remains highly challenging. Existing approaches typically control generation either through radiology reports used as text prompts or through full image segmentation. While text-based prompting is flexible, it provides limited spatial control over the location, shape, and boundary of abnormalities. In contrast, segmentation-based methods receive precise spatial guidance but are restrictive in requiring full-organ annotations. In this work, we propose a flexible multimodal framework for controllable volumetric image generation that supports input from radiology reports and segmentation prompts (both optional). Our approach allows users to provide segmentation of a specific anatomy or abnormality without requiring full-organ annotations. The semantic meaning of the segmentation mask is specified through an accompanying text description, resulting in a highly flexible and scalable conditioning mechanism. We develop a memory-efficient architecture based on a modified diffusion transformer that jointly processes image and segmentation tokens. The model further incorporates gated attention to effectively attend to long radiology reports. Experiments demonstrate that our method achieves state-of-the-art perceptual and semantic scores (e.g., 24% relative improvement in mean FID), generates high-resolution anatomically consistent CT volumes, and improves data efficiency when used for data augmentation. Radiologists' evaluation further confirms strong alignment between generated and real medical images.

URL PDF HTML ☆

赞 0 踩 0

2606.06407 2026-06-09 cs.CV cs.IR cs.LG eess.IV 版本更新

A Vision-language Framework for Comparative Reasoning in Radiology

放射学中比较推理的视觉语言框架

Tengfei Zhang, Ziheng Zhao, Xiaoman Zhang, Lisong Dai, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Weidi Xie

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Department of Biomedical Informatics, Harvard Medical School（哈佛医学院生物医学信息学系）； Department of Radiology, Renmin Hospital of Wuhan University（武汉大学仁民医院放射科）； Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University（上海交通大学附属第六人民医院）

AI总结提出一个实体感知的跨图像推理框架，通过构建大规模比较影像数据集MedReCo-DB和开发MedReCo及MedReCo-VLM模型，实现了参考病例检索和时间比较解读，显著提升了放射学比较推理性能。

详情

AI中文摘要

医学影像人工智能在孤立图像解读方面取得了强劲性能，但仍与放射学实践存在较大差距，因为诊断和随访依赖于对先前研究和类似参考病例的比较。本文我们将放射学比较形式化为一个实体感知的跨图像推理问题，并引入一个支持参考病例检索和时间比较解读的框架。我们构建了MedReCo-DB，这是一个从常规图像-报告对中派生的大规模比较影像资源，包含来自八个机构、四个国家、七种成像模态的超过16万名患者的69万余张图像。报告被分解为解剖结构、异常发现和病理状况，为实体条件检索和比较视觉问答提供监督。利用该资源，我们开发了MedReCo，一个用于可控检索临床类似病例的实体感知视觉编码器，以及MedReCo-VLM，一个用于生成性解读间隔变化的视觉语言扩展。在内部、外部和跨中心评估中，MedReCo在所有12个内部检索设置中实现了最高的Recall@1，并将外部检索平均提高了6.0个百分点。在临床易混淆的鉴别组中，它始终优于最强的基线。MedReCo-VLM在所有比较生成评估中取得了最佳性能，并在胸部X光片上将纵向随访准确性提高了14.5-46.5个百分点，在CT上提高了13.0-27.9个百分点。这些发现表明，实体感知的比较推理可以从常规临床数据中大规模学习，并可能为医学影像AI提供更符合临床的基础。

英文摘要

Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

URL PDF HTML ☆

赞 0 踩 0

2406.19749 2026-06-09 eess.IV cs.CV 版本更新

SPIRONet: Spatial-Frequency Learning and Graph-based Channel Interaction Network for Vessel Segmentation

SPIRONet：用于血管分割的空间-频率学习与基于图的通道交互网络

De-Xing Huang, Xiao-Hu Zhou, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Mei-Jiang Gui, Hao Li, Tian-Yu Xiang, Bo-Xian Yao, Zeng-Guang Hou

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（多模态人工智能系统国家重点实验室，自动化研究所，中国科学院）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结提出SPIRONet，通过双空间-频率编码器、交叉注意力融合和基于图的通道交互模块，解决低信噪比、细小血管和强干扰下的血管分割难题，在五个数据集上取得最优性能。

Comments Accepted by Biomedical Signal Processing and Control. 15 Pages, 9 Figures, 13 Tables

详情

AI中文摘要

自动血管分割在下一代手术机器人介入导航系统的发展中起着关键作用。然而，当前方法在低信噪比、细小或纤细血管以及强干扰等具有挑战性的术中条件下，分割性能仍不理想。本研究提出了一种新颖的空间-频率学习与基于图的通道交互网络（SPIRONet）来解决上述问题。针对低信噪比血管外观和细小或纤细分支，采用了双空间-频率编码器，其中频率编码器捕获受局部噪声波动影响较小的全局血管连续性，而空间编码器保留精细的血管细节。进一步引入了交叉注意力融合模块，以自适应地整合这种互补的空间和频率信息。此外，为了抑制非目标血管和类血管结构的干扰，设计了基于图的通道交互模块来建模通道间的相关性，增强一致的血管相关响应，同时抑制任务无关的激活。在五个具有挑战性的数据集上的大量实验结果表明，与现有方法相比，所提方法取得了有竞争力且持续强劲的性能。例如，在CADSA、CAXF、DCA1、XCAD和ARCADE上，SPIRONet分别比最强竞争方法实现了+0.87%、+0.52%、+0.23%、+1.39%和+2.22%的IoU提升。此外，SPIRONet在512x512输入尺寸下实现了21 FPS的推理速度，满足介入场景（6-12 FPS）的实时要求。这些有希望的结果表明SPIRONet在介入导航系统中集成的潜力。代码可在该https URL获取。

英文摘要

Automatic vessel segmentation plays a pivotal role in the development of next-generation interventional navigation systems for surgical robotics. However, current approaches still suffer from suboptimal segmentation performance under challenging intraoperative conditions, such as low-signal-to-noise ratio (SNR), small or slender vessels, and strong interference. In this study, a novel spatial-frequency learning and graph-based channel interaction network (SPIRONet) is proposed to address the above issues. To address low-SNR vessel appearance and small or slender branches, dual spatial-frequency encoders are utilized, where the frequency encoder captures global vessel continuity that is less affected by local noise fluctuations, while the spatial encoder preserves fine vessel details. A cross-attention fusion module is further introduced to adaptively integrate this complementary spatial and frequency information. Moreover, to suppress interference from non-target vessels and vessel-like structures, a graph-based channel interaction module is designed to model channel-wise correlations, enhancing consistent vessel-related responses while suppressing task-irrelevant activations. Extensive experimental results on five challenging datasets demonstrate that the proposed method achieves competitive and consistently strong performance compared with existing methods. For example, SPIRONet achieves IoU improvements of +0.87%, +0.52%, +0.23%, +1.39%, and +2.22% over the strongest competing methods on CADSA, CAXF, DCA1, XCAD, and ARCADE, respectively. Moreover, SPIRONet achieves an inference speed of 21 FPS with a 512x512 input size, meeting the real-time requirements of interventional scenarios (6-12 FPS). These promising results indicate SPIRONet's potential for integration into interventional navigation systems. Code is available at https://github.com/Dxhuang-CASIA/SPIRONet.

URL PDF HTML ☆

赞 0 踩 0

2501.11755 2026-06-09 eess.IV cs.CV 版本更新

A generalizable 3D framework and model for self-supervised learning in medical imaging

一种通用的3D框架和模型用于医学影像中的自监督学习

Tony Xu, Sepehr Hosseini, Chris Anderson, Anthony Rinaldi, Rahul G. Krishnan, Anne L. Martel, Maged Goubran

发表机构 * Department of Medical Biophysics, University of Toronto（多伦多大学医学生物物理学系）； Department of Computer Science, University of Toronto（多伦多大学计算机科学系）； Institute for Aerospace Studies, University of Toronto（多伦多大学航空航天研究所）； Physical Sciences Platform, Sunnybrook Research Institute（圣母医院研究学院物理科学平台）； Vector Institute, Toronto（多伦多向量研究所）； Department of Laboratory Medicine and Pathobiology, University of Toronto（多伦多大学实验室医学与病理学系）； Hurvitz Brain Sciences, Sunnybrook Health Sciences Centre（圣母医院健康科学中心Hurvitz脑科学）； Harquail Centre for Neuromodulation, Sunnybrook Health Sciences Centre（圣母医院健康科学中心Harquail神经调制中心）

AI总结本文提出3DINO方法，基于大规模多模态数据集预训练出通用医学影像模型3DINO-ViT，验证其在多种医学影像分割和分类任务中的泛化能力，优于现有方法。

Comments Published in npj Digital Medicine

详情

DOI: 10.1038/s41746-025-02035-w

AI中文摘要

当前3D医学影像自监督学习方法依赖简单的预设任务和特定器官或模态的数据集，限制了其通用性和扩展性。我们提出了3DINO，一种针对3D数据集的先进自监督学习方法，并在包含超过10个器官的10万例3D医学影像扫描的多模态数据集上预训练了3DINO-ViT。我们通过广泛的实验验证了3DINO-ViT在多种医学影像分割和分类任务中的性能。结果表明，3DINO-ViT能够跨模态和器官泛化，包括在分布外任务和数据集上表现优异，在大多数评估指标和标注数据集大小上均优于现有方法。我们的3DINO框架和3DINO-ViT将被公开，以促进3D基础模型的研究或进一步微调用于广泛医学影像应用。

英文摘要

Current self-supervised learning methods for 3D medical imaging rely on simple pretext formulations and organ- or modality-specific datasets, limiting their generalizability and scalability. We present 3DINO, a cutting-edge SSL method adapted to 3D datasets, and use it to pretrain 3DINO-ViT: a general-purpose medical imaging model, on an exceptionally large, multimodal, and multi-organ dataset of ~100,000 3D medical imaging scans from over 10 organs. We validate 3DINO-ViT using extensive experiments on numerous medical imaging segmentation and classification tasks. Our results demonstrate that 3DINO-ViT generalizes across modalities and organs, including out-of-distribution tasks and datasets, outperforming state-of-the-art methods on the majority of evaluation metrics and labeled dataset sizes. Our 3DINO framework and 3DINO-ViT will be made available to enable research on 3D foundation models or further finetuning for a wide range of medical imaging applications.

URL PDF HTML ☆

赞 0 踩 0

2511.18493 2026-06-09 eess.IV cs.AI cs.CV 版本更新

SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation

SAGE：适应性组织病理图像分割的形状自适应门控专家

Gia Huy Thai, Hoang-Nguyen Vu, Anh-Minh Phan, Quang-Thinh Ly, Thi-Ngoc-Truc Nguyen, Nhat Ho

发表机构 * University of Science, VNU-HCM（越南国家大学科学学院）； Trivita AI ； University of Technology, VNU-HCM（越南国家大学技术学院）； Michigan State University, USA（美国密歇根州立大学）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结 SAGE通过动态专家路由框架提升异构视觉网络中细胞形态变化的适应性，实现高精度分割与稳健泛化。

Comments Accepted to CVPR 2026 (Findings Track). Project Page: https://oxyzgiahuy.github.io/sage/

详情

AI中文摘要

细胞大小和形状的显著差异仍然是计算机辅助癌症检测在吉像素全滑片图像中的主要障碍，由于细胞异质性。当前的CNN-Transformer混合模型使用静态计算图和固定路由，导致额外计算并难以适应输入变化。我们提出形状自适应门控专家（SAGE），一种输入自适应框架，通过双路径设计和层次门控以及形状适应枢纽（SA-Hub）将静态骨干网络重新配置为动态路由专家架构。SAGE以ConvNeXt和Vision Transformer UNet（SAGE-ConvNeXt+ViT-UNet）实现，其在EBHI上达到95.23%的Dice分数，在GlaS Test A和Test B上分别达到92.78%和91.42%的DSC分数，并在DigestPath上达到91.26%的DSC分数，同时在分布偏移下表现出稳健的泛化能力，通过自适应平衡局部细化和全局上下文。SAGE建立了可扩展的动态专家路由基础，从而促进灵活的视觉推理。项目页面：https://oxyzgiahuy.github.io/sage/

英文摘要

The significant variability in cell size and shape continues to pose a major obstacle in computer-assisted cancer detection on gigapixel Whole Slide Images (WSIs), due to cellular heterogeneity. Current CNN-Transformer hybrids use static computation graphs with fixed routing. This leads to extra computation and makes it harder to adapt to changes in input. We propose Shape-Adapting Gated Experts (SAGE), an input-adaptive framework that enables dynamic expert routing in heterogeneous visual networks. SAGE reconfigures static backbones into dynamically routed expert architectures via a dual-path design with hierarchical gating and a Shape-Adapting Hub (SA-Hub) that harmonizes feature representations across convolutional and transformer modules. Embodied as SAGE with ConvNeXt and Vision Transformer UNet (SAGE-ConvNeXt+ViT-UNet), our model achieves a Dice score of 95.23% on EBHI, DSC scores of 92.78% and 91.42% on GlaS Test A and Test B, respectively, and 91.26% DSC at the WSI level on DigestPath, while exhibiting robust generalization under distribution shifts by adaptively balancing local refinement and global context. SAGE establishes a scalable foundation for dynamic expert routing in visual networks, thereby facilitating flexible visual reasoning. Project page: https://oxyzgiahuy.github.io/sage/

URL PDF HTML ☆

赞 0 踩 0

2606.07558 2026-06-09 cs.CV cs.AI cs.DL 新提交

Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing

基于百年跨度扫描文档档案微调的页面图像分类器，用于进一步的内容特定处理

Kateryna Lutsai, Pavel Straňák, David Novák, Dana Křivánková

发表机构 * Institute of Formal and Applied Linguistics, Charles University MFF（查尔斯大学数学与物理学院形式与应用语言学研究所）； Institute of Archaeology, Czech Academy of Sciences（捷克科学院考古研究所）

AI总结针对历史文档数字化中手动分类不可行的问题，提出基于视觉内容类型（文本、表格、图形）的自动页面图像分类系统，采用微调深度网络（RegNetY-16GF达99.16%准确率）实现近完美分类，并公开模型、数据集和代码。

Comments 29 pages, 19 figures, 13 tables. arXiv admin note: text overlap with arXiv:2507.21114

详情

AI中文摘要

目的：人文学科的数字化项目产生了大量、异构的历史文档档案，使得手动分类在大规模下不切实际。本工作解决基于视觉内容类型——文本、表格和图形——对扫描页面图像进行分类的自动化系统需求，从而支持内容特定的下游处理，如光学字符识别（OCR）或结构化数据提取。方法：开发了一个图像分类系统，并在来自百年历史的捷克考古档案的超过48,000张带注释的历史页面图像数据集上进行评估，通过四个连续的注释阶段和领域专家审查进行优化。使用手工制作的图像特征建立了随机森林分类器基线。随后，微调并比较了深度学习架构：卷积神经网络（EfficientNetV2、RegNetY）、视觉和文档图像变换器（ViT、DiT）以及多模态CLIP模型。与领域专家合作设计了11类标签方案，并通过五折交叉验证进行评估。结果：基于特征的基线实现了约75%的准确率。微调的CNN和变换器显著优于基线，RegNetY-16GF在保留测试集上达到99.16%的Top-1准确率，ViT-large达到99.12%。CLIP ViT-B/16通过优化文本描述达到99.14%的准确率。结论：仅图像模型，特别是RegNetY-16GF，实现了近乎完美的分类准确率，并在649,508张未标注档案页面上产生一致标签，模型间一致性超过90%。微调的CLIP尽管在测试集上具有竞争力，但在未标注数据上与仅图像模型的一致性低于65%，因此不太适合部署。最终模型、注释数据集和软件均以开源许可证公开提供。

英文摘要

Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting impractical at scale. This work addresses the need for an automated system to classify scanned page images based on visual content type - text, tables, and graphics - enabling content-specific downstream processing such as Optical Character Recognition (OCR) or structured data extraction. Methods: An image classification system was developed and evaluated on a dataset of over 48,000 annotated historical page images from century-old Czech archaeological archives, refined through four successive annotation stages with domain-expert review. A Random Forest Classifier baseline was established using hand-crafted image features. Subsequently, deep learning architectures were fine-tuned and compared: Convolutional Neural Networks (EfficientNetV2, RegNetY), Vision and Document Image Transformers (ViT, DiT), and multimodal CLIP models. An 11-category label scheme was designed collaboratively with domain experts and evaluated via five-fold cross-validation. Results: The feature-based baseline achieved approximately 75% accuracy. Fine-tuned CNNs and Transformers substantially outperformed it, with RegNetY-16GF achieving 99.16% and ViT-large 99.12% Top-1 accuracy on the held-out test set. CLIP ViT-B/16 reached 99.14% with optimized text descriptions. Conclusion: Image-only models, particularly RegNetY-16GF, deliver near-perfect classification accuracy and produce consistent labels across 649,508 unlabeled archival pages with over 90% inter-model agreement. Fine-tuned CLIP, despite competitive test-set accuracy, showed under 65% agreement with image-only models on unlabeled data, making it less suitable for deployment. The final models, annotated dataset, and software are publicly available under open-source licenses.

URL PDF HTML ☆

赞 0 踩 0

2606.07661 2026-06-09 cs.CV cs.DL 新提交

PereStruct: Multimodal Semantic Assembly for Robust Historical Document Parsing

PereStruct: 面向鲁棒历史文档解析的多模态语义组装

Maksim Shandybo, Ivan Bespalov, Daniil Yefimov, Marina Kosheleva, Alexander Loukianov

发表机构 * IGIC RAS（俄罗斯科学院信息传输问题研究所）； Yandex Cloud ； National University of Science and Technology MISIS（莫斯科国立钢铁合金学院）； Nekrasov Central Universal Scientific Library（涅克拉索夫中央综合科学图书馆）

AI总结针对历史报纸复杂多栏布局的解析难题，提出结合微调YOLO与语义组装模块的多模态方法，在块到文章映射上F1达0.904，BLEU约0.96，显著优于通用视觉语言模型。

Comments Code and data available at https://github.com/makSShandybo/PereStruct

详情

AI中文摘要

解析具有复杂非标准布局的历史文档仍是大规模档案数字化的基本瓶颈。与现代排版不同，历史报纸存在严重的物理退化和高度不规则的页面结构，即使最先进的视觉语言模型也难以应对，呈现出严重的分布外挑战。我们通过一个专门为解析历史报纸（具有特别复杂多栏布局的文档）设计的自动化流程来弥补这一差距。我们的方法结合了用于布局分析和块检测的微调YOLO架构（在1,426张完全人工标注的扫描页面上训练），以及一个新颖的语义组装模块，该模块通过联合建模基于TF-IDF的词法语义相似性、来自微调YOLO的视觉嵌入以及几何布局约束来重构文章。这种多模态集成实现了最先进的性能，在块到文章映射上取得了0.904的F1分数。值得注意的是，与视觉语言模型（Qwen3.6-35B-A3B和Qwen3.6-Plus）的端到端评估表明，PereStruct实现了显著更高的保真度（BLEU约0.96 vs 0.34），验证了模块化架构在通用VLM难以处理的复杂历史布局上表现出色。为了支持可重复性并推动该领域的研究，我们发布了包含599张标注页面的训练语料库和包含93张页面（具有专家验证的真实块到文章映射）的精选PereStruct基准。该框架为复杂档案材料的高保真数字化和语义重建奠定了坚实基础。

英文摘要

Parsing historical documents with complex, non-standard layouts remains a fundamental bottleneck in large-scale archival digitization. Unlike modern typography, historical newspapers exhibit severe physical degradation and highly irregular page structures that confound even state-of-the-art vision-language models, presenting severe out-of-distribution challenges. We address this gap with an automated pipeline specifically designed for parsing historical newspapers, documents characterized by particularly intricate multi-column layouts. Our approach combines a fine-tuned YOLO architecture for layout analysis and block detection, trained on 1,426 fully human-annotated scanned pages, with a novel semantic assembly module that reconstructs articles by jointly modeling lexical-semantic similarity via TF-IDF, visual embeddings from our fine-tuned YOLO, and geometric layout constraints. This multi-modal integration yields state-of-the-art performance, achieving an F1 score of 0.904 on block-to-article mapping. Notably, end-to-end evaluation against vision-language models (Qwen3.6-35B-A3B and Qwen3.6-Plus) demonstrates that PereStruct achieves substantially higher fidelity (BLEU approximately 0.96 vs 0.34), validating that modular architectures excel where generic VLMs fail on complex historical layouts. To support reproducibility and advance research in this domain, we release both the training corpus of 599 annotated pages and a curated PereStruct benchmark of 93 pages with expert-verified ground-truth block-to-article mappings. This framework establishes a robust foundation for high-fidelity digitization and semantic reconstruction of complex archival materials.

URL PDF HTML ☆

赞 0 踩 0

2606.08858 2026-06-09 cs.CV cs.AI 新提交

Intelligent Character Recognition of Handwritten Forms with Deep Neural Networks

基于深度神经网络的手写表单智能字符识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA) Offenburg University（奥芬堡大学机器学习与分析研究所（IMLA））

AI总结提出一种通过深度神经网络将检测与分类合并为单一任务的手写字符识别方法，利用人工合成训练数据，在真实考试数据上达到88.28%的识别率。

Comments Author's accepted manuscript of a published Springer book chapter. 14 pages, 16 figures

详情

DOI: 10.1007/978-3-031-42532-5_6
Journal ref: In: Cavallucci D., Livotov P., Brad S. (eds), Towards AI-Aided Invention and Innovation, IFIP Advances in Information and Communication Technology, vol. 682, Springer Nature Switzerland, 2023, pp. 81-94

AI中文摘要

手写表单的自动处理仍然是一项具有挑战性的任务，其中手写字符的检测和后续分类是关键步骤。我们描述了一种新颖的方法，其中两个步骤——检测和分类——通过深度神经网络在一个任务中执行。因此，训练数据不是手动标注的，而是从基础表单和现有数据集中人工制造的。可以证明，这种单任务方法优于最先进的双任务方法。当前研究专注于手写拉丁字母，并使用EMNIST数据集。然而，该数据集存在局限性，需要进一步定制。最后，在从笔试中获得的真实数据上达到了88.28%的整体识别率。

英文摘要

The automatic processing of handwritten forms remains a challenging task, wherein detection and subsequent classification of handwritten characters are essential steps. We describe a novel approach, in which both steps -- detection and classification -- are executed in one task through a deep neural network. Therefore, training data is not annotated by hand, but manufactured artificially from the underlying forms and yet existing datasets. It can be demonstrated that this single-task approach is superior in comparison to the state-of-the-art two-task approach. The current study focuses on hand-written Latin letters and employs the EMNIST data set. However, limitations were identified with this data set, necessitating further customization. Finally, an overall recognition rate of 88.28 percent was attained on real data obtained from a written exam.

URL PDF HTML ☆

赞 0 踩 0

2606.09446 2026-06-09 cs.CV 新提交

文档数据提取的视觉模板推断

Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, Aditya G. Parameswaran

发表机构 * UC Berkeley（加州大学伯克利分校）

AI总结提出TWIX工具，通过推断文档的视觉模板来高效、低成本地提取结构化数据，在精度和召回率上优于现有方法，并实现大规模数据集上的显著加速和降本。

详情

AI中文摘要

许多模板化文档是根据结构化数据按照视觉模板程序化生成的。这类文档包括发票、税务文件、财务报告和采购订单。从这些文档中有效提取数据对于支持下游分析任务至关重要。当前的数据提取工具通常难以处理复杂的文档布局，在大数据集上会产生高延迟和/或高成本，并且需要大量人力。我们的工具TWIX的关键洞察是推断用于生成此类文档的底层模板，然后提取数据，而不是直接从文档中提取。为此，TWIX首先通过利用字段的一致位置模式（例如，同一模板中的两个字段在多个记录中反复以固定距离共现）来推断底层字段，如表格部分的列或共置键值对中的键。然后，TWIX通过强制视觉约束（例如，对于表格区域，垂直对齐表格行与其列标题；对于键值对，水平对齐键与其值）将这些字段组装成模板。最后，TWIX使用这个推断出的模板以低成本从模板化文档中准确高效地提取数据。在一个包含34个多样化真实世界数据集的基准测试中，TWIX在精度和召回率上比最先进的结构化数据提取工具（Evaporate、Textract和Azure Document Intelligence）以及基于视觉的大语言模型（如GPT-4-Vision）高出25%以上。另一个包含30个大数据集的基准测试展示了TWIX的可扩展性：对于从超过2000页的大型文档集合中提取数据，它比最具竞争力的对比工具快520倍，便宜3786倍。

英文摘要

Many templatized documents are programmatically generated from structured data following a visual template. Such documents include invoices, tax documents, financial reports, and purchase orders. Effective data extraction from these documents is crucial to support downstream analytical tasks. Current data extraction tools often struggle with complex document layouts, incur high latency and/or cost on large datasets, and require significant human effort. The key insight of our tool, TWIX, is to infer the underlying template used to create such documents, and then extract the data, rather than extracting directly from documents. To do so, TWIX first infers the underlying fields, such as columns of tabular portions or keys in co-located key-value pairs, by leveraging their consistent location patterns (e.g., two fields in the same template repeatedly co-occur within a fixed distance apart across multiple records). TWIX then assembles these fields into a template by enforcing visual constraints, such as vertically aligning table rows with their column headers for tabular regions, and horizontally aligning keys with their values for key-value pairs. TWIX then uses this inferred template to accurately and efficiently extract data from templatized documents at a low cost. On one benchmark with 34 diverse real-world datasets, TWIX outperforms state-of-the-art structured data extraction tools (Evaporate, Textract, and Azure Document Intelligence), and vision-based LLMs like GPT-4-Vision, by over 25% in precision and recall. Another benchmark with 30 large datasets demonstrates TWIX's scalability: it is 520X faster and 3,786X cheaper than the most competitive compared tool, for extracting data from large document collections with over 2000 pages.

URL PDF HTML ☆

赞 0 踩 0

2606.07985 2026-06-09 cs.CV cs.CL 新提交

FMRFusion: Frequency-Aware Multi-View Representation Learning for Heterogeneous Image Fusion

FMRFusion: 面向异质图像融合的频率感知多视图表示学习

Tao Zhoua, Yunlong Liu, Qinghui Chen, Zekai Zhang, Minlong Sun, Changlin Biana, Dagang Li, Wenmin Wang, Jinglin Zhang

发表机构 * Shandong University（山东大学）； Macau University of Science and Technology（澳门科技大学）

AI总结提出FMRFusion网络，通过多尺度结构感知模块、双线性频率分解和跨视图互补交互，结合流匹配优化，实现红外与可见光图像融合，在夜间场景表现优异。

详情

AI中文摘要

红外与可见光图像融合旨在生成保留重要目标信息和详细纹理的复合图像，整合两种异质模态。以往的图像融合方法通常采用单模块堆叠方式从两种模态中提取特征，然而这些方法可能导致对其独特特征的学习不完整，从而限制融合效果并在真实异质数据场景中降低鲁棒性。为解决这些问题，我们提出FMRFusion，一种用于异质图像融合的频率感知多视图表示学习网络。引入多尺度结构感知模块以有效捕捉判别性结构，提取细粒度局部结构和关键上下文信息。采用双线性频率分解机制将特征分离为高频和低频分量，实现不同频率域中局部细节和全局表示的联合建模。此外，融入跨视图互补交互以显式建模和融合反射光信息与辐射强度响应之间的互补特性，促进有效的跨视图交互。我们通过流匹配进一步改善融合结果的质量，通过学习从粗数据到高质量表示的变换逐步细化融合特征。在多个基准数据集上进行的大量实验表明，FMRFusion在一系列融合任务中实现了优越且一致的性能，尤其在夜间场景中表现突出。

英文摘要

Infrared and visible image fusion aims to generate a composite image that retains significant target information and preserves detailed textures, integrating two heterogeneous modalities. Previous image fusion methods typically adopt a single-module stacking approach to extract features from the two modalities. However, these approaches may result in incomplete learning of their distinct characteristics, thereby limiting the fusion effectiveness and constrain ing robustness in real-world heterogeneous data scenarios. To address these challenges, we propose FMRFusion, a frequency-aware multi-view representation learning network for Heterogeneous Image Fusion. A Multi-Scale Struc tural Perception Module is introduced to effectively capture discriminative structures, extracting fine-grained local structures and essential contextual information. A bilinear frequency decomposition mechanism is employed to sepa rate features into high-frequency and low-frequency components, enabling joint modeling of local details and global representations across different frequency domains. Moreover, a Cross-View Complementary Interaction is incorpo rated to explicitly model and fuse the complementary characteristics between reflected light information and radiative intensity responses, facilitating effective cross-view interaction. We further improve the Performance of the fused results by flow matching, which progressively refines the fused features by learning the transformation from coarse data to high-quality representations. Extensive experiments conducted on multiple benchmark datasets demonstrate that FMRFusion achieves superior and consistent performance across a range of fusion tasks, especially in nighttime scenarios

URL PDF HTML ☆

赞 0 踩 0

2606.08324 2026-06-09 cs.CV cs.AI 新提交

Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging

基于集合的Transformer用于远距离长波红外高光谱成像中的大气补偿

Fabian Perez, Nicolas Quintero, Jeferson Acevedo, Hoover Rueda-Chacon

发表机构 * Department of Computer Science, Universidad Industrial de Santander Bucaramanga（圣安德烈斯工业大学计算机科学系）

AI总结提出一种轻量级基于集合的深度学习框架，利用不同距离的辐射测量联合估计透射率、大气路径辐射和下行辐射，在MODTRAN数据集上实现低光谱畸变。

Comments IGARSS 2026 accepted paper conference

详情

AI中文摘要

被动长波红外（LWIR）高光谱成像在远距离几何下依赖于大气吸收和发射以及反射辐射，因此大气补偿对于获取目标信息至关重要。尽管其重要性，但由于实际和建模困难，这一补偿在很大程度上被忽视。在本文中，我们提出了一种轻量级基于集合的深度学习框架，该框架将不同远距离范围收集的多个辐射测量作为输入，并联合估计透射率、大气路径辐射和共享的下行辐射光谱。我们使用稀疏自编码器分析学习到的表示，并观察到尽管缺乏位置监督，几个潜在特征确实在测试数据的地理一致子集上激活。在MODTRAN生成的远距离LWIR数据集上的实验表明，所有估计产品的光谱畸变较低。数据集和代码公开于：https://factral.co/SAE-LWIR/

英文摘要

Passive long-wave infrared (LWIR) hyperspectral imaging under a standoff geometry depends on atmospheric absorption and emission, as well as reflected radiance, thus making atmospheric compensation essential to get knowledge of a target of interest. Despite its importance, this compensation has been largely overlooked due to its practical and modeling difficulty. In this paper, we present a lightweight set-based deep learning framework that takes multiple radiance measurements, collected at different standoff ranges, as input and jointly estimates transmittance, atmospheric path radiance, and a shared downwelling spectrum. We analyze the learned representation with a sparse autoencoder and observe that several latent features do activate on geographically coherent subsets of the test data despite the absence of location supervision. Experiments on a MODTRAN generated standoff LWIR dataset demonstrate low spectral distortion across all estimated products. The dataset and code is publicly available at: https://factral.co/SAE-LWIR/

URL PDF HTML ☆

赞 0 踩 0

2606.08535 2026-06-09 cs.CV 新提交

NGram-MoSE: Efficient Remote Sensing Super-Resolution via N-Gram Context and Mixture-of-Experts

NGram-MoSE：基于N-Gram上下文和混合专家模型的高效遥感超分辨率

Yun-Hsuan Huang, Trong-An Bui, Chih-Hung Chuang

发表机构 * National Science and Technology Council (NSTC), Taiwan（台湾国家科学与技术委员会）

AI总结提出轻量Transformer架构NGram-MoSE，通过N-Gram上下文注入增强局部一致性，结合混合专家前馈设计稀疏激活以降低计算量，在遥感超分辨率任务中实现高效且鲁棒的纹理重建。

详情

AI中文摘要

环境监测和灾害管理的遥感应用经常受到时空权衡的限制：具有精细空间细节的图像通常获取频率较低，而时间上更可用的观测通常更粗糙。单图像超分辨率提供了一种实用的方法，可以在不改变获取计划的情况下增强粗糙图像，然而许多基于Transformer的SR模型仍然计算成本高昂，并且可能对有限或地理偏倚的训练数据敏感，这降低了在分布外条件下的鲁棒性。本文提出了NGram-MoSE，一种轻量级Transformer架构，旨在提高效率和纹理连续性。NGram-MoSE引入了N-Gram上下文注入以增强跨窗口局部一致性并减轻窗口边界伪影，并采用了混合专家（MoE）前馈设计，通过稀疏激活扩展容量而不成比例地增加推理成本。在地理上不相交的OOD测试集上的实验表明，NGram-MoSE实现了31.68 dB的PSNR，同时相对于重型Transformer参考模型将FLOPs减少了14倍。在滑坡分割基准上的下游评估进一步表明，将退化的输入恢复到检测器训练尺度可提高性能，在mAP@50上比双三次上采样绝对提高了4.47%，并且在尺度外推下表现出更强的跨尺度一致性。这些结果表明，NGram-MoSE为需要鲁棒泛化的资源受限遥感流水线提供了一个有效的SR模块。

英文摘要

Remote sensing applications for environmental monitoring and disaster management are frequently constrained by a spatial--temporal trade-off: imagery with fine spatial detail is often acquired less frequently, whereas more temporally available observations are typically coarser. Single-image super-resolution provides a practical means to enhance coarse imagery without changing acquisition schedules, yet many Transformer-based SR models remain computationally expensive and can be sensitive to limited or geographically biased training data, which degrades robustness under out-of-distribution conditions. This paper presents NGram-MoSE, a lightweight Transformer architecture designed to improve both efficiency and texture continuity. NGram-MoSE introduces N-Gram Context Injection to strengthen cross-window local consistency and mitigate window-boundary artifacts, and incorporates a Mixture-of-Experts (MoE) feed-forward design to scale capacity through sparse activation without proportional growth in inference cost. Experiments on a geographically disjoint OOD test set show that NGram-MoSE achieves 31.68\,dB PSNR while reducing FLOPs by $14\times$ relative to a heavyweight Transformer reference. Downstream evaluation on a landslide segmentation benchmark further demonstrates that restoring degraded inputs to the detector training scale improves performance, yielding a 4.47\% absolute gain in mAP@50 over bicubic upsampling, and exhibits stronger cross-scale consistency under scale extrapolation. These results indicate that NGram-MoSE provides an effective SR module for resource-constrained remote sensing pipelines requiring robust generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.09029 2026-06-09 cs.CV 新提交

Frequency Decoupled Framework for Screen Content Image Super-Resolution

面向屏幕内容图像超分辨率的频率解耦框架

Xufei Wang, Qicheng Zhang, Qi Wu, Ziyang Gu, Shizhuang Weng

发表机构 * Anhui University（安徽大学）

AI总结提出频率解耦框架（FDF），通过振幅-相位分解和定制隐式表示，联合利用周期模式与连贯上下文，实现屏幕内容图像超分辨率，在多个数据集上达到最优性能。

Comments 13pages;11figures

详情

AI中文摘要

基于隐式神经表示的方法在屏幕内容图像超分辨率（SCISR）中表现出优越性能。然而，它们忽略了固有的频率特性，导致性能次优。我们提出一种频率解耦框架（FDF），从相量角度重新思考SCISR，通过捕获振幅中的结构化能量和相位中的关系连续性，并利用定制的隐式表示联合利用它们，以忠实恢复屏幕内容图像（SCI）的规则纹理和全局配置。振幅-相位分解网络（APFN）首先将图像分离为振幅和相位流，其中振幅聚类模块（ACM）将稀疏但高能量的振幅响应组织成代表性原型以提取周期模式，而相位一致性自注意力（PCSA）通过连续一致性传播逐步增强配置。振荡-非谐隐式拟合网络（OAIF-Net）集成周期性和连贯隐式表示，以有效利用SCI中嵌入的周期模式和连贯上下文。实验结果表明，FDF在四个公共SCI数据集上的多个尺度上实现了最先进的SCISR性能。消融实验进一步证明了每个组件在提取和利用周期模式与连贯上下文方面的有效性。

英文摘要

Methods based on implicit neural representations have demonstrated superior performance in Screen Content Image Super-Resolution (SCISR) . However, they overlooked the inherent frequency characteristics, leading to suboptimal performance. We propose a frequency decoupled framework (FDF) that rethinks SCISR from a phasor perspective by capturing structured energy in amplitude and relational continuity in phase, and jointly exploiting them with bespoke implicit representations to faithfully recover the regular textures and global configuration of Screen Content Image (SCI). Amplitude-Phase Factorization Network (APFN) first separates images into amplitude and phase streams, where Amplitude Clustering Module (ACM) organizes sparse yet high-energy amplitude responses into representative prototypes for periodic pattern extraction, while Phase Consistency Self-Attention (PCSA) progressively reinforces configuration through continuous consistency propagation. And Oscillation-Anharmonic Implicit Fitting Network (OAIF-Net) integrates periodic and coherent implicit representations for efficient exploitation of the periodic patterns and coherent context embedded in SCI. Experimental results show FDF achieves state-of-the-art SCISR performance at multiple scales across four public SCI datasets. Ablation experiments further demonstrate the effectiveness of each component in extracting and exploiting periodic patterns and coherent context.

URL PDF HTML ☆

赞 0 踩 0

2606.09110 2026-06-09 cs.CV 新提交

HDRAgent: An Agentic Framework for Multi-Exposure HDR Imaging

HDRAgent: 一种用于多曝光HDR成像的智能体框架

Weiyu Zhou, Tao Hu, Yijian Wang, Xiaogang Xu, Ruixing Wang, Qingsen Yan

发表机构 * School of Computer Science, Northwestern Polytechnical University（西北工业大学计算机学院）； Shenzhen Research Institute, Northwestern Polytechnical University（西北工业大学深圳研究院）； Zhejiang University（浙江大学）； Camera Group, DJI（大疆相机部门）

AI总结提出首个智能体驱动的HDR成像框架HDRAgent，通过细粒度上下文知识匹配、感知-失真反馈机制和智能体引导的生成对齐策略，自适应选择重建策略，减少复杂动态场景中的鬼影和局部伪影。

详情

AI中文摘要

大多数现有的多曝光HDR方法遵循固定的前馈重建范式，使其在复杂动态场景中容易产生鬼影伪影。为了解决这个问题，我们提出了HDRAgent，这是第一个用于HDR成像的智能体驱动框架，它根据当前场景条件自适应地选择重建策略。具体来说，为了提供场景特定的先验知识，我们引入了一个细粒度上下文知识匹配（FCM）模块。该模块利用多模态大语言模型（MLLM）衍生的场景感知来检索相关的历史案例和工具知识，并将它们组织成结构化证据，用于基于MLLM的自适应工具调度。此外，我们提出了一种感知-失真反馈机制，将执行后的质量评估和伪影诊断转化为结构化反馈，并累积到历史记忆中，以帮助后续的上下文知识细化和策略选择。此外，考虑到极端运动可能使对齐方法失效，我们设计了一种智能体引导的生成对齐策略，该策略使用基于MLLM的动态区域解析，在参考帧引导下重建非参考帧中的不可靠内容。实验表明，HDRAgent有效减少了鬼影和局部伪影，同时实现了具有竞争力或更优的客观性能和视觉质量。

英文摘要

Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them prone to ghosting artifacts in complex dynamic scenes. To address this issue, we propose HDRAgent, the first agent-driven framework for HDR imaging, which adaptively selects reconstruction strategies according to the current scene conditions. Specifically, to provide scene-specific prior knowledge, we introduce a fine-grained contextual knowledge matching (FCM) module. This module leverages multimodal large language model (MLLM)-derived scene perception to retrieve relevant historical cases and tool knowledge, organizing them into structured evidence for MLLM-based adaptive tool scheduling. In addition, we propose a perception--distortion feedback mechanism that transforms post-execution quality assessment and artifact diagnosis into structured feedback, which is accumulated in historical memory to help subsequent contextual knowledge refinement and strategy selection. Furthermore, considering that extreme motion can invalidate alignment methods, we design an agent-guided generative alignment strategy that uses MLLM-based dynamic-region parsing to reconstruct unreliable contents in non-reference frames under reference-frame guidance. Experiments demonstrate that HDRAgent effectively reduces ghosting and local artifacts while achieving competitive or superior objective performance and visual quality.

URL PDF HTML ☆

赞 0 踩 0

2606.09250 2026-06-09 cs.CV 新提交

LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

LiteVSR: 冻结扩散变换器的轻量级自适应用于视频超分辨率

Yu Cao, Ziquan Liu, Zhensong Zhang, Jiankang Deng, Shaogang Gong, Jifei Song

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LiteVSR，利用流匹配原理，通过轻量级状态感知适配器在冻结扩散变换器上实现视频超分辨率，仅需11.25%可训练参数和12 GPU小时训练。

详情

AI中文摘要

将大规模预训练视频生成器适应于新领域的视频超分辨率（VSR）在计算上仍然昂贵。将生成重新表述为直接从低质量到高质量映射的方法偏离了原始生成形式，需要大量微调。ControlNet风格的适配器在现代扩散变换器下失去效率，因为缺少编码器-解码器层次结构迫使复制整个骨干网络。我们观察到流匹配为跨域VSR适应提供了一种原则性替代方案。通过预测所有时间步上的恒定速度场，适应任务简化为学习固定的注入模式，而不是时变变换。基于这一见解，我们提出了LiteVSR，一个极简框架，使用完全冻结的扩散变换器和轻量级状态感知适配器执行VSR。该适配器采用双流架构，从低质量输入中提取静态结构线索，从中间去噪状态中提取动态线索，通过时间依赖的交叉注意力对齐它们，使得随着去噪进行，从结构对齐到纹理细化的自适应过渡成为可能。LiteVSR在仅使用11.25%可训练参数和单个A100上12 GPU小时的训练下实现了有竞争力的恢复质量，同时保持了快速采样（低至单步）的兼容性。

英文摘要

Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.

URL PDF HTML ☆

赞 0 踩 0

2606.09516 2026-06-09 cs.CV 新提交

SwiftVR: Real-Time One-Step Generative Video Restoration

SwiftVR：实时一步生成式视频恢复

Jiaqi Yan, Xiangyu Chen, Xinlin Zhong, Haibin Huang, Chi Zhang, Jie Liu, Jiantao Zhou, Xuelong Li

发表机构 * State Key Laboratory of Internet of Things for Smart City, Department of Computer and Information Science, University of Macau（澳门大学智慧城市物联网国家重点实验室）； Institute of Artificial Intelligence (TeleAI), China Telecom（中国电信人工智能研究院（TeleAI））； State Key Laboratory for Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）

AI总结提出SwiftVR，一种流式一步生成式视频恢复框架，通过因果分块协议、无掩码移位窗口自注意力和轻量级恢复感知自编码器，在消费级GPU上实现实时高清视频恢复。

详情

AI中文摘要

实时视频恢复（VR）用于直播流，需要在严格的每帧延迟约束下输出高分辨率结果。现有的一步扩散式VR模型由于两个主要瓶颈难以部署在消费级GPU上：高分辨率下的二次空间注意力以及大型视频自编码器的延迟-内存开销。我们提出SwiftVR，一种流式一步生成式VR框架，在因果分块协议下减少这两个瓶颈。对于注意力，无掩码移位窗口自注意力通过确定性索引将每个空间窗口聚合成密集张量，所有注意力调用都在密集缩放点积注意力路径上，无需掩码、循环移位、填充或硬件特定的稀疏核。由于SwiftVR仅使用标准密集SDPA调用，训练好的模型无需重新训练或自定义核即可迁移到消费级GPU。对于自编码，轻量级恢复感知自编码器在保持重建质量的同时实现快速分块解码。在单个H100上，SwiftVR在2560x1440分辨率下维持31 FPS，在3840x2160下维持14 FPS，而所有对比的扩散式VR基线在4K下均超出内存限制。在消费级RTX 5090上，SwiftVR在1920x1080下达到26 FPS。据我们所知，SwiftVR是首个在消费级GPU上实现实时1080p流媒体的生成式VR模型，同时以更低的推理成本获得强大的无参考感知质量。项目地址：https://h-oliday.github.io/SwiftVR。

英文摘要

Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.

URL PDF HTML ☆

赞 0 踩 0

2606.09608 2026-06-09 cs.CV 新提交

TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution

TUDSR: 用于更高超分辨率的两次上采样扩散

Zhiqiang Wu, Yitong Dong, Xian Wei

发表机构 * East China Normal University（华东师范大学）； Zhejiang University（浙江大学）

AI总结提出TUDSR框架，通过两阶段训练（R分辨率和NR分辨率）结合循环分块策略，在SD2.1基础上实现1024²和2048²高分辨率图像超分辨率，显著优于现有方法。

详情

AI中文摘要

基于扩散的生成模型在真实世界图像超分辨率（SR）中取得了显著成功。通过分块扩散技术，这些模型可以生成超出其原生支持分辨率的高分辨率图像。然而，这种高分辨率（例如2048²）输出的质量通常仍然极差，主要归因于我们考虑的两个因素：图像上采样比率（例如×8）超过模型原生支持的上采样比率（例如×4），以及模型的原生支持分辨率。在实践中，训练原生高分辨率模型需要更大的架构，这会导致显著的计算开销和GPU内存成本，使其在资源有限的设备上难以实现。因此，我们提出了TUDSR，一种用于更高超分辨率的两次上采样扩散框架。TUDSR框架主要包括两个阶段：第一阶段在R分辨率下训练，第二阶段引入基于循环分块的训练策略在NR分辨率下训练。每个阶段采用包含生成器和判别器的单步GAN架构。基于SD2.1-base，我们开发了TUDSR-S，在多个基准测试中取得了最先进的性能。大量实验进一步表明，TUDSR-S在1024²甚至2048²分辨率下生成高质量图像，显著优于现有方法。代码可在https://github.com/wuer5/TUDSR获取。

英文摘要

Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g $2048^2$) outputs often remains extremely poor, primarily due to two factors we consider: the image upsampling ratio (e.g $\times8$) exceeding the model's native-supported upsampling ratio (e.g $\times4$), and the model's native-supported resolution. In practice, training a native high-resolution model requires larger architectures, which incur significant computational overhead and GPU memory costs, making it hard on limited-resource equipment. Thus, we present TUDSR, a Twice Upsampling-Diffusion framework for higher SR. The TUDSR framework mainly consists of two stages: the first involves training at $R$-resolution, and the second introduces a looped chunk-based training strategy at $NR$-resolution. Each stage adapts a one-step GAN architecture comprising a generator and a discriminator. Based on SD2.1-base, we develop TUDSR-S, which achieves state-of-the-art performance across multiple benchmarks. Extensive experiments further demonstrate that TUDSR-S generates high-quality images at the resolutions of $1024^2$ and even $2048^2$, significantly outperforming existing approaches. Code is available at https://github.com/wuer5/TUDSR.

URL PDF HTML ☆

赞 0 踩 0

2606.09792 2026-06-09 cs.CV 新提交

End-to-End Optimization of Incoherent Imaging for Classification Under Detector-Limited Readout

探测器有限读出下非相干成像分类的端到端优化

Archer Wang, Joshua Chen, Sachin Vaidya, Marin Soljačić

发表机构 * Research Laboratory of Electronics, Massachusetts Institute of Technology（麻省理工学院电子研究实验室）； Department of Physics, Massachusetts Institute of Technology（麻省理工学院物理系）

AI总结针对探测器有限读出场景，通过端到端优化相位掩模提升非相干成像分类性能，理论证明全读出下无增益，有限读出下通过增强类可分性实现显著改进。

详情

AI中文摘要

光学前端（如超表面）和神经网络后端的端到端联合优化已广泛应用于成像任务，但缺乏一个形式化框架来描述此类系统何时以及为何优于传统透镜成像。本文聚焦于分类这一核心成像任务，探究端到端优化非相干成像相位掩模何时能提升性能。我们发现，这些增益主要出现在探测器读出受限的情况下，而在全读出下则有限。在后一种情况下，我们证明没有非相干相位掩模能超过探测器测量与类别标签之间的理想信道互信息；传统聚焦透镜接近这一上限，联合优化无实证增益。当探测器读出受限时（通过粗空间采样或有限测量次数），优化光学系统可通过增加探测器测量中的类可分性来显著提升分类性能。这些增益在低探测器噪声下最大，并随噪声增大而减小，因为光学系统在信号到达探测器前塑造信号，但无法去除之后添加的噪声。该优势还取决于任务的光谱结构：当类别判别内容集中在比类内变化更低的空间频率时，协同设计帮助最大。我们开发了一个理论框架来形式化这些区别，并在合成数据和标准基准（MNIST、FashionMNIST、SVHN）上测试其预测。

英文摘要

End-to-end co-optimization of optical front-ends (e.g. metasurfaces) and neural network back-ends has been widely applied to imaging tasks, yet a formalism characterizing when and why such systems outperform conventional lens-based imaging is largely lacking. This paper focuses on object classification, a central imaging task, and asks when end-to-end optimization of a phase mask for incoherent imaging improves performance over a conventional focusing lens. We find that these gains arise primarily under constrained detector readout and are limited under full detector readout. In the latter setting, we prove that no incoherent phase mask exceeds the ideal-channel mutual information between detector measurements and class labels; a conventional focusing lens approaches this ceiling, and joint optimization yields no empirical gain. When detector readout is constrained -- by coarse spatial sampling or a limited number of measurements -- optimized optics can substantially improve classification by increasing class separability in the detector measurements. These gains are largest under low detector noise and shrink as noise grows, because the optics shape the signal before it reaches the detector but cannot remove noise added afterward. The advantage also depends on the spectral structure of the task: co-design helps most when class-discriminative content is concentrated at lower spatial frequencies than within-class variation. We develop a theoretical framework formalizing these distinctions and test its predictions on synthetic data and standard benchmarks (MNIST, FashionMNIST, SVHN).

URL PDF HTML ☆

赞 0 踩 0

2606.07675 2026-06-09 eess.IV cs.CV cs.LG 交叉投稿

The Need for Neural ISP in the Small-Pixel Era: How Shrinking Pixels Push Optics to the Limit and Neural Restoration Pushes Back

小像素时代对神经ISP的需求：像素缩小将光学推向极限，神经恢复则逆势而上

Jingxi Li, Neerja Aggarwal, Laurent Gudemann, Shivansh Rao, Vishal Vinod, Tom E. Bishop, Ziv Attar

发表机构 * Glass Imaging Inc（玻璃成像公司）

AI总结针对智能手机小像素长焦模块中光学像差限制分辨率的问题，提出基于学习的神经ISP恢复图像，在0.35微米像素下实现2.5-3倍分辨率提升，表明神经ISP可替代复杂光学设计。

详情

AI中文摘要

智能手机长焦摄像头正接近“长焦物理墙”：随着像素间距缩小至亚0.5微米，光学系统仍受几何像差限制，导致分辨率收益递减。传统图像信号处理器（ISP）无法消除这些像差，因为它们通过局部、分阶段处理运行，没有明确的点扩散函数（PSF）模型。我们展示了基于学习的神经ISP用于图像恢复，通过训练底层退化，逆转了分阶段流水线无法处理的问题，将小像素设计转化为净优势。我们通过一个代表性长焦模块的受控模拟进行研究，评估了五种配置（0.35--0.75微米像素间距）。光圈按比例缩放以保持每像素信噪比和衍射光斑尺寸固定，从而隔离几何像差和空间采样。传统ISP随像素减小仅适度改进，而神经ISP显著扩展：在0.35微米时，其MTF50（垂直）达到745 cycles/mm，比传统ISP分辨率提升2.5-3倍，LPIPS从0.244显著改善至0.151，而传统结果保持相对平坦。在低信噪比扩展中（0.35微米下每帧15 dB突发），多帧神经ISP恢复的性能接近亮光单帧基线，而多帧传统ISP没有显示出有意义的改进——表明小像素下的传统流水线受限于未校正的PSF模糊而非噪声。这些结果指向一种设计理念：神经ISP通过校正残余光学像差而非要求日益复杂的光学系统，实现高分辨率长焦模块。

英文摘要

Smartphone telephoto cameras are approaching a "telephoto physics wall": as pixel pitches shrink toward sub-0.5 micron, the optics remain limited by geometric aberrations, leading to diminishing returns on resolution. Traditional Image Signal Processors (ISPs) cannot eliminate these aberrations, because they operate through local, stage-wise processing with no explicit model of the underlying point spread function (PSF). We demonstrate how a learning-based Neural ISP for image restoration, trained on the underlying degradations, inverts what stage-wise pipelines cannot, turning small-pixel designs into a net advantage. We investigate this through a controlled simulation of a representative telephoto module, evaluating five configurations (0.35--0.75 micron pixel pitch). The aperture is scaled proportionally to keep per-pixel SNR and diffraction spot size fixed, thereby isolating geometric aberration and spatial sampling. While the traditional ISP improves only modestly with smaller pixels, the Neural ISP scales substantially: at 0.35 micron} it reaches 745 cycles/mm MTF50 (vertical), a 2.5--3x resolution improvement over the traditional ISP, and LPIPS improves significantly from 0.244 to 0.151 while traditional results stay comparatively flat. In a low-SNR extension (15 dB per-frame bursts at 0.35 micron), a multi-frame Neural ISP recovers performance close to the bright-light single-frame baseline, whereas a multi-frame traditional ISP shows no meaningful improvement -- indicating that traditional pipelines at small pixels are bottlenecked by uncorrected PSF blur rather than by noise. These results point to a design philosophy in which Neural ISPs enable high-resolution telephoto modules by correcting residual optical aberrations rather than requiring increasingly complex optics.

URL PDF HTML ☆

赞 0 踩 0

2606.07896 2026-06-09 physics.optics cs.CV 交叉投稿

Beyond the Thin-Layer Limit: Differentiable Volumetric Training for Visible-Range Diffractive Neural Networks

超越薄层极限：可见光衍射神经网络的微分体积训练

Dineth Jayakody, Dushan N. Wadduwage

发表机构 * Department of Computer Science, Old Dominion University, Norfolk, VA 23529, USA（计算机科学系，老奥德纳大学，诺福克，VA 23529，美国）； School of Data Science, Old Dominion University, Norfolk, VA 23529, USA（数据科学学院，老奥德纳大学，诺福克，VA 23529，美国）； Department of Physics, Old Dominion University, Norfolk, VA 23529, USA（物理系，老奥德纳大学，诺福克，VA 23529，美国）

AI总结针对可见光衍射神经网络因薄层近似导致性能不佳的问题，提出可微光束传播层，将每个衍射元件建模为有限厚度体积，显著降低设计-器件失配，FDTD验证将分类准确率从50%提升至90%。

详情

AI中文摘要

衍射深度神经网络（D2NN）有望为机器视觉提供微型化、低功耗、光速的光学前端，然而最成熟的演示仍停留在太赫兹波段，由易于制备的毫米尺度神经元构建。将D2NN推广到几乎所有视觉流水线工作的可见光波段，长期以来归因于纳米尺度神经元的制备困难；但即使近期进展消除了这一障碍，与太赫兹对应物匹配的可见光D2NN仍遥不可及。我们识别出真正的障碍是几乎所有D2NN训练所依赖的薄层近似，它将每个衍射层视为无限薄的掩模。失败的原因并非通常假设的短波长，而是可见光波段使用的低折射率材料（n约1.3-1.5）需要足够厚的浮雕结构，使得层内衍射和相位积累变得显著。为克服这一问题，我们引入可微光束传播（$\partial$BPM）层，将每个元件建模为有限厚度体积，并在训练过程中通过其传播光，保持与制备兼容的高度图端到端可训练，无需全波仿真在环。在MNIST、Fashion-MNIST和CIFAR-100分类及成像任务中，$\partial$BPM训练显著降低了设计-器件失配，全波FDTD验证将分类准确率从50%提升至90%，无需重新优化。因此，$\partial$BPM层为高效光学神经网络优化与制备一致的衍射设计之间提供了可扩展的、物理感知的桥梁。

英文摘要

Diffractive deep neural networks (D2NNs) promise miniaturized, power-efficient, light-speed optical front-ends for machine vision, yet the most mature demonstrations remain in the terahertz regime, built from readily fabricated millimeter-scale neurons. Translating D2NNs to the visible range, where nearly all vision pipelines operate, was long blamed on the difficulty of fabricating nanoscale neurons; but even after recent advances removed that barrier, visible-range D2NNs matching their terahertz counterparts remain out of reach. We identify the true obstacle as the thin-layer approximation underlying nearly all D2NN training, which treats each diffractive layer as an infinitely thin mask. It fails not because of the short wavelength, as is commonly assumed, but because the low-refractive-index materials (n approximately 1.3-1.5) used at visible wavelengths require relief structures thick enough that intra-layer diffraction and phase accumulation become significant. To overcome this, we introduce a differentiable beam-propagation ($\partial$BPM) layer that models each element as a finite-thickness volume and propagates light through it during training, keeping the fabrication-compatible height map end-to-end trainable without full-wave simulation in the loop. Across MNIST, Fashion-MNIST, and CIFAR-100 classification and imaging, $\partial$BPM training substantially reduces the design-to-device mismatch, and full-wave FDTD validation raises classification accuracy from 50% to 90% without re-optimization. The $\partial$BPM layer thus offers a scalable, physics-aware bridge between efficient optical neural-network optimization and fabrication-consistent diffractive design.

URL PDF HTML ☆

赞 0 踩 0

2606.08370 2026-06-09 eess.IV cs.CV 交叉投稿

Programmable Silicon Retina on Pixel Processor Array

可编程硅视网膜在像素处理器阵列上的实现

Maciej Lewandowski, Prince Philip, Alexandre Marcireau, Chetan Singh Thakur, André van Schaik, Piotr Dudek

发表机构 * Department of Electrical and Electronic Engineering, University of Manchester, UK（电气与电子工程系，曼彻斯特大学，英国）； Department of Electronic Systems Engineering, Indian Institute of Science, Bangalore, India（电子系统工程系，印度科学研究院，班加罗尔，印度）； Department of Computer Science, University of Manchester, UK（计算机科学系，曼彻斯特大学，英国）

AI总结在SCAMP-5像素处理器阵列上首次实现多级硅视网膜模型，通过空间滤波和增益控制等生物启发处理，在视频显著性预测中损失降低13%，事件率减少约47%。

详情

AI中文摘要

标准动态视觉传感器通过检测时间对比度变化来近似视网膜处理，提供高速度和高动态范围。在这项工作中，我们探讨了加入额外的生物启发处理阶段——特别是空间滤波和增益控制——是否能为某些下游任务（如显著性预测）带来优势。我们首次在SCAMP-5像素处理器阵列上实现了多级硅视网膜模型，并提供了基于GPU的仿真框架。我们在视频强度重建和视频显著性预测上评估了模型性能。虽然生物启发模型在重建绝对强度帧方面效果较差，但与标准DVS事件表示相比，它在显著性预测损失上降低了13%，同时事件率减少了约47%。这些实验使用了一个轻量级的约10万参数的FireNet风格网络，该网络从基于事件的重建调整为显著性预测。这些结果表明，硅视网膜的“信息蒸馏”机制可以为下游神经网络实现更高效的表示，特别是在带宽受限的边缘应用中。

英文摘要

Standard dynamic vision sensors approximate retinal processing by detecting temporal contrast changes, offering high speed and high dynamic range. In this work, we explore whether incorporating additional biologically inspired processing stages - specifically spatial filtering and gain control - can offer advantages for certain downstream tasks such as saliency prediction. We present the first implementation of a multi-stage Silicon Retina model on the SCAMP-5 Pixel Processor Array, along with a GPU-based simulation framework. We evaluate the performance of our model on Video Intensity Reconstruction and Video Saliency Prediction. While the bio-inspired model is less effective at reconstructing absolute intensity frames, it achieves a 13\% reduction in saliency prediction loss in comparison to standard DVS event representation, while reducing the event rate by approximately 47\%. These experiments are obtained using a lightweight $\approx 100$k-parameter FireNet-style network, adapted from event-based reconstruction to saliency prediction. These results suggest that the silicon retina's "information distillation" mechanism can achieve a more efficient representation for downstream neural networks, particularly in bandwidth-constrained edge applications.

URL PDF HTML ☆

赞 0 踩 0

2404.01948 2026-06-09 cs.CV 版本更新

Quantifying Noise of Dynamic Vision Sensor

量化动态视觉传感器的噪声

Evgeny V. Votyakov, Alessandro Artusi

发表机构 * DeepCamera MRG ； CYENS Centre of Excellence（CYENS卓越中心）； Nicosia, Cyprus（塞浦路斯尼科西亚）

AI总结本文提出基于去趋势波动分析的新型技术，用于量化动态视觉传感器背景噪声，解决无地面真实情况下噪声与信号区分难题，并展示最优去噪滤波器参数的确定方法。

Comments 5 pages, 4 figures, submitted to the IEEE Signal Processing Letters

2406.07318 2026-06-09 cs.CV cs.AR eess.IV 版本更新

Embedded Graph Convolutional Networks for Real-Time Event Data Processing on SoC FPGAs

嵌入式图卷积网络用于SoC FPGA上的实时事件数据处理

Kamil Jeziorek, Piotr Wzorek, Krzysztof Blachut, Andrea Pinna, Tomasz Kryjak

发表机构 * University of Warsaw（华沙大学）； Politechnika Warszawska（华沙理工大学）

AI总结提出一种针对事件相机的硬件感知图卷积网络EFGCN，在SoC FPGA上实现实时处理，模型大小相比AEGNN降低100倍，精度仅下降2.9%。

详情

DOI: 10.1016/j.sysarc.2026.103850
Journal ref: Journal of Systems Architecture, Volume 177, August 2026, 103850

AI中文摘要

事件相机的使用代表了解决传统视频系统限制的重要且快速发展的趋势。特别是在汽车领域，这些相机因其低延迟和低功耗而集成到嵌入式实时系统中具有重要意义。确保事件处理所需吞吐量和延迟的一种有效方法是利用图卷积网络（GCNs）。在本研究中，我们引入了一种定制的EFGCN（基于事件的FPGA加速图卷积网络），该网络采用了一系列针对PointNetConv（一种用于点云处理的图卷积）的硬件感知优化。所提出的技术相比该领域最新工作之一——异步基于事件的GNN（AEGNN），模型大小减少了高达100倍，而精度下降相对较小（N-Caltech101分类任务下降2.9%，N-Cars分类任务下降2.2%），从而遵循了TinyML趋势。我们在ZCU104 SoC FPGA平台上实现了EFGCN，无需任何片外外部存储器资源，实现了每秒1330万事件（MEPS）的吞吐量和低延迟的实时部分异步处理。在多个基于事件的分类基准测试中，我们的方法在提供每事件最先进的计算效率、小模型大小以及高可扩展性、可定制性和资源效率的同时，实现了具有竞争力的精度。我们将软件和硬件源代码发布在开放存储库中：此 https URL。

英文摘要

The utilisation of event cameras represents an important and swiftly evolving trend aimed at addressing the constraints of traditional video systems. Particularly within the automotive domain, these cameras find significant relevance for their integration into embedded real-time systems due to lower latency and power consumption. One effective approach to ensure the necessary throughput and latency for event processing is through the utilisation of graph convolutional networks (GCNs). In this study, we introduce a custom EFGCN (Event-based FPGA-accelerated Graph Convolutional Network) designed with a series of hardware-aware optimisations tailored for PointNetConv,a graph convolution designed for point cloud processing. The proposed techniques result in up to 100-fold reduction in model size compared to Asynchronous Event-based GNN (AEGNN), one of the most recent works in the field, with a relatively small decrease in accuracy (2.9% for the N-Caltech101 classification task, 2.2% for the N-Cars classification task), thus following the TinyML trend. We implemented EFGCN on a ZCU104 SoC FPGA platform without any off-chip external memory resources, achieving a throughput of 13.3 million events per second (MEPS) and real-time partially asynchronous processing with low latency. Across multiple event-based classification benchmarks, our approach achieves competitive accuracy while providing state-of-the-art computational efficiency per event, small model size, and high scalability, customisability and resource efficiency. We publish both software and hardware source code in an open repository: https://github.com/vision-agh/gcnn-dvs-fpga.

URL PDF HTML ☆

赞 0 踩 0

2605.22208 2026-06-09 cs.CV 版本更新

EvoIR-Agent: Self-Evolving Image Restoration Agentic System via Experience-Driven Learning

EvoIR-Agent: 通过经验驱动学习实现自进化图像修复智能体

Kailin Zhuang, Jiawei Wu, Zhi Jin

AI总结本文提出EvoIR-Agent，通过经验驱动学习解决图像修复中经验不足导致的规划失败问题，通过构建分层经验池和自进化机制提升修复性能和效率，实验表明其在全参考指标上表现优异，且在性能与效率之间取得显著平衡。

Comments Temporarily withdrawn for institutional clearance and compliance review. A revised version will be uploaded once the process is finalized

详情

AI中文摘要

多模态大语言模型（MLLM）驱动的图像修复智能体在退化耦合场景中表现出色，能够灵活选择工具并确定去除顺序。然而，其零样本规划在缺乏经验时往往失效，需要通过大量试错来获得满意结果。目前有两种方法用于解决此问题，但存在矛盾：基于训练的方法将内在经验嵌入参数中，实现高推理效率但缺乏对新工具或退化的兼容性。相比之下，基于免训练的方法利用显式经验存储以提高兼容性，但仍因经验存储方式简单而存在试错开销。为解决此矛盾，本文提出EvoIR-Agent，首先系统地制定了免训练图像修复智能体的经验组件。随后构建了分层经验池，能够为多样化的工具和去除顺序提供粗到细的指导。此外，引入了自进化机制，通过积累的记录更新池，从而大大提高了性能和效率。大量实验表明，EvoIR-Agent在全参考指标上取得了显著领先，并在性能与效率之间实现了显著的帕累托最优平衡。

英文摘要

Multimodal Large Language Model (MLLM)-driven image restoration agent demonstrates effectiveness in degradation coupling scenarios by flexibly selecting tools and determining removal orders. However, their zero-shot planning often fails without experience, necessitating severe trial-and-error overhead to achieve satisfactory outcomes. Currently, two paradigms are employed to address this issue, yet a dilemma persists: Training-based methods embed intrinsic experience into parameters, achieving high inference efficiency but lacking compatibility with new tools or degradation. In contrast, training-free methods utilize explicit experience storage for compatibility but still incur trial-and-error overhead due to naive experience. To resolve the dilemma, we propose EvoIR-Agent, which first systematically formulates the experience components of a training-free image restoration agent. Subsequently, a hierarchical experience pool is constructed, which enables coarse-to-fine guidance for diverse tools and removal orders. Furthermore, a self-evolving mechanism is introduced to update the pool from scratch using accumulated records, thereby greatly improving performance and efficiency. Extensive experiments reveal that EvoIR-Agent achieves a significant lead in the full reference metrics and yields a remarkable Pareto-optimal balance between performance and efficiency compared to the state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.07593 2026-06-09 cs.CV cs.AI 新提交

A Mechanistic Analysis of Adversarial Fine-tuning of Vision Transformers

视觉Transformer对抗微调的机制分析

Hannah Gao, Isha Agarwal, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结通过机制分析研究对抗微调对视觉Transformer在扰动和常规图像上性能的影响，发现微调仅改善特定类型扰动，未改变稀疏表示。

详情

AI中文摘要

图像分类模型在高风险现实场景中的广泛应用要求模型对输入图像的轻微扰动（如模糊或锐化）具有鲁棒性。尽管视觉Transformer（ViT）在现代多模态模型（如视觉-语言模型（VLM）和视觉-语言-动作（VLA）模型）中扮演着不可或缺的角色，但在鲁棒性设置中它们缺乏关注。在这项工作中，我们通过机制视角分析了对抗微调（一种提高模型对图像扰动鲁棒性的流行方法）对ViT在扰动和常规图像上性能的影响。我们在低频和高频图像损坏上对抗训练ViT，并试图通过检查模型的注意力机制、内部表示和知识演化来解释下游模型性能的变化。总体而言，我们的结果表明，虽然对带有常见损坏的输入进行微调提高了模型在新损坏数据实例上的性能和确定性，但这些改进不会转移到训练中未见过的其他类别损坏。此外，尽管观察到各层视觉注意力和知识演化的变化，我们发现对抗训练并未导致ViT学习的稀疏表示发生根本性变化。

英文摘要

The widespread use of image classification models in high-risk, real-world situations necessitates making these models robust to slight disturbances or perturbations, such as blurring or sharpening, in the input images. While vision transformers (ViTs) play an integral role in many modern-day multi-modal models like Vision-Language-Models (VLMs) and Vision-Language-Action (VLA) models, they have received a lack of attention in the setting of robustness. In this work, we analyze the effects of adversarial fine-tuning, a popular method for improving model robustness to image perturbations, on a ViT's performance on perturbed and regular images through a mechanistic lens. We adversarially train a ViT on low-frequency and high-frequency image corruptions, and attempt to explain changes in downstream model performance through an examination of the model's attention mechanisms, internal representations, and knowledge evolution. Overall, our results suggest that, while fine-tuning on inputs with common corruptions improves model performance and certainty on new instances of corrupted data, these improvements do not transfer to other classes of corruptions not seen in the training. Additionally, despite observing changes in visual attention and knowledge evolution across layers, we found that adversarial training did not lead to fundamental changes in the sparse representations learned by ViTs.

URL PDF HTML ☆

赞 0 踩 0

2606.07620 2026-06-09 cs.CV cs.AI cs.DC cs.LG 新提交

SENTRY: Statistical Reliability Analysis of Vision Transformers Under Soft Errors

SENTRY: 视觉Transformer在软错误下的统计可靠性分析

Pramit Kumar Bhaduri, Mahdi Taheri, Samira Nazari, Maksim Jenihhin, Christian Herglotz, Michael Hubner

发表机构 * Brandenburg University of Technology Cottbus-Senftenberg（勃兰登堡工业大学）； Tallinn University of Technology（塔林理工大学）； Zanjan University（赞詹大学）

AI总结提出基于有限总体抽样的统计故障注入框架，仅需数千样本即可在99%置信度下以1%误差界估计故障率，将实验成本降低高达10700倍，并揭示ViT中归一化层和关键指数位是脆弱性热点。

详情

AI中文摘要

随着视觉Transformer在自动驾驶和医学成像等安全关键领域的应用增长，确保其抵抗软错误的可靠性至关重要。尽管ViT提供了最先进的准确性，但其庞大的参数数量使得穷举故障注入不可行。为弥补这一差距，本文提出一个统计故障注入框架，利用有限总体抽样理论提供形式化的可靠性保证。我们证明，无论模型规模如何，仅需数千个样本即可在99%置信度下将故障率限制在1%的误差界内。与穷举方法相比，该方法将实验成本降低高达10700倍，同时保留跨架构组件定位脆弱性的能力。通过对ViT-Tiny和ViT-Small等不同架构的广泛评估，我们揭示了高度非均匀的可靠性景观。结果表明，虽然只有3%的FP32位翻转导致故障，但其中绝大多数事件导致灾难性的精度崩溃。具体脆弱性被定位到归一化层和IEEE-754格式中的关键指数位，为设计加固的、边缘部署的ViT架构提供了数学基础和可操作的见解。

英文摘要

With the growth of Vision Transformers in safety-critical domains like autonomous systems and medical imaging, ensuring their reliability against soft errors is paramount. While ViTs offer state-of-the-art accuracy, their massive parameter counts render exhaustive fault injection campaigns infeasible. To bridge this gap, a statistical fault injection framework is presented, leveraging finite-population sampling theory to provide formal reliability guarantees. It is demonstrated that failure rates are bounded within a 1% margin at 99\% confidence using only a few thousand samples, regardless of model scale. This methodology achieves up to a 10,700 times reduction in experimental cost compared to exhaustive approaches, while preserving the ability to localize vulnerabilities across architectural components. Through extensive evaluation of different architectures like ViT-Tiny and ViT-Small, a highly non-uniform reliability landscape is uncovered. It is shown that while only 3% of FP32 bit-flips result in failure, the vast majority of these events lead to catastrophic accuracy collapse. Specific vulnerabilities are localized to normalization layers and critical exponent bits within the IEEE-754 format, providing a mathematical foundation and actionable insights for the design of hardened, edge-deployed ViT architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.07660 2026-06-09 cs.CV cs.LG 新提交

Need We Teach Foundation Models What is a Generative Image? Gradient-Free Generative Artifact Detection via Analytic Spectral Adaptation

我们是否需要教基础模型什么是生成图像？基于解析谱自适应的无梯度生成伪影检测

Qiaoyu Chen, Bing Zhang

发表机构 * Harbin University of Commerce（哈尔滨商业大学）

AI总结提出无梯度方法，将生成伪影检测重构为分布外异常度量问题，通过解析解耦统计与语义偏差，在零样本设置下显著优于梯度优化方法。

详情

AI中文摘要

通过基于梯度的更新来适应基础模型以检测生成伪影会损害其内在表示。在有限样本上优化时，模型会过拟合到局部领域捷径。在专门数据上微调大量权重会引入错误的归纳偏差，在高维特征空间中引起可测量的 $\mathcal{L}_2$ 范数扰动——我们将这一现象形式化为锚点漂移。非线性激活放大了这种漂移，损害了跨未见领域的零样本伪造检测。我们提出了一种无梯度方法，将检测从二分类重新定义为分布外（OOD）异常度量问题。将冻结的基础模型视为稳定的坐标系，通过解析解耦统计和语义偏差，在真实视觉流形上建立一个绝对的自然锚点，该锚点源自注意力加权的空间矩和感知不一致性的正交投影。在极端零样本设置下（在面部伪造上训练，在通用文本到图像生成上测试），我们的方法显著优于梯度优化范式。无反向传播的前向传递和线性求解器实现了硬件无关、边缘可部署的校准，延迟极低。此外，Sherman-Morrison公式使得能够针对新型攻击进行即时在线学习，并通过协方差增量传输实现隐私保护的联邦协作。

英文摘要

Adapting foundation models to detect generative artifacts via gradient-based updates compromises their intrinsic representations. Under optimization on limited samples, models overfit to local domain shortcuts. Fine-tuning massive weights on specialized data introduces erroneous inductive biases, inducing a measurable $\mathcal{L}_2$ norm perturbation in the high-dimensional feature space -- a phenomenon we formalize as anchor drift. Amplified by nonlinear activations, this drift impairs zero-shot forgery detection across unseen domains.We propose a gradient-free methodology reframing detection from binary classification to an out-of-distribution (OOD) anomaly measurement problem. Treating a frozen foundation model as a stable coordinate system, we establish an absolute natural anchor on the real visual manifold by analytically decoupling statistical and semantic deviations, derived from attention-weighted spatial moments and orthogonal projection of perceptual inconsistencies. Evaluated in an extreme zero-shot setting (trained on face forgeries, tested on universal Text-to-Image generations), our method significantly outperforms gradient-optimized paradigms. Backpropagation-free forward passes and linear solvers enable hardware-agnostic, edge-deployable calibration with minimal latency. Furthermore, the Sherman-Morrison formula unlocks instantaneous online learning against novel attacks and enables privacy-preserving federated collaboration via covariance delta transmission.

URL PDF HTML ☆

赞 0 踩 0

2606.07102 2026-06-09 cs.CV cs.AI 新提交

GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection

GP-Adapter: 基于高斯过程的CLIP适配器用于少样本分布外检测

Taisei Saito, Koretaka Ogata, Takafumi Hiroi

发表机构 * st Taisei Saito（第一作者）； nd Koretaka Ogata（第二作者）； rd Takafumi Hiroi（第三作者）

AI总结提出GP-Adapter，一种无需训练的框架，通过高斯过程不确定性建模增强CLIP，用于少样本分类和分布外检测，无需微调骨干网络，仅依赖少量缓存和轻量超参数选择。

Comments 8 pages, 6 figures, Accepted at IJCNN 2026

详情

AI中文摘要

我们提出GP-Adapter，一种无需训练的框架，通过高斯过程（GP）不确定性建模增强CLIP（对比语言-图像预训练），用于少样本分类和分布外（OOD）检测。虽然CLIP实现了强大的零样本识别，但它产生确定性的相似度分数，并提供有限的不确定性信息，这在分布偏移和数据稀缺情况下至关重要。GP-Adapter在冻结的CLIP嵌入之上，使用图像特征的RBF核和文本提示的线性核构建模态特定、类别级的一类GP，并融合它们的预测统计量，以生成方差感知的置信度分数用于OOD检测。该方法无需微调CLIP骨干网络，仅依赖于少量$K$样本缓存和轻量超参数选择，内存成本为$O(CK^2)$，其中$C$为类别数，$K$为样本数。在ImageNet和多个OOD基准上的实验表明，GP-Adapter提供了具有竞争力的少样本性能，并且在与提示学习基线结合时持续改进OOD检测，突出了基于GP的不确定性建模与提示学习之间的互补性。总体而言，我们的结果表明，将概率推理与大型预训练视觉-语言模型集成可以提高低数据和分布偏移场景下的可靠性。代码可在该https URL获取。

英文摘要

We propose GP-Adapter, a training-free framework that augments CLIP (Contrastive Language-Image Pre-training) with Gaussian Process (GP) uncertainty modeling for few-shot classification and out-of-distribution (OOD) detection. While CLIP achieves strong zero-shot recognition, it yields deterministic similarity scores and offers limited uncertainty information, which is critical under distribution shift and data scarcity. GP-Adapter constructs modality-specific, class-wise one-class GPs on top of frozen CLIP embeddings using an RBF kernel for image features and a linear kernel for text prompts and fuses their predictive statistics to produce a variance-aware confidence score for OOD detection. The method requires no fine-tuning of the CLIP backbone and relies only on a small $K$-shot cache and lightweight hyperparameter selection, with memory cost scaling as $O(CK^2)$ for $C$ classes and $K$ shots. Experiments on ImageNet and multiple OOD benchmarks show that GP-Adapter provides competitive few-shot performance and consistently improves OOD detection when combined with prompt-learning baselines, highlighting the complementarity between GP-based uncertainty modeling and prompt learning. Overall, our results suggest that integrating probabilistic inference with large pre-trained vision-language models can improve reliability in low-data and distribution-shifted settings. Code is available at https://github.com/tms-byte/GP-Adapter

URL PDF HTML ☆

赞 0 踩 0

2606.08121 2026-06-09 cs.CV 新提交

基于Hadamard编码输出表示的对抗攻击与干扰检测用于目标检测和语义分割

Lucas Görnhardt, Timo Bartels, Niklas Schwarz, Tim Fingscheidt

发表机构 * Technische Universität Braunschweig（Braunschweig 技术大学）

AI总结针对传统one-hot编码导致模型校准差、易受攻击的问题，提出Hadamard编码输出表示，通过优化解码过程实现最优类概率，利用预测不一致性检测对抗攻击和干扰，在单次检测中达到SOTA性能。

详情

AI中文摘要

传统的one-hot编码通常导致模型校准不佳，在攻击下过于自信，并使基于熵的检测算法失效。先前的图像分类工作表明，Hadamard编码输出表示可以提高对抗鲁棒性。然而，将Hadamard码集成到语义分割中的尝试在平均交并比性能上远落后于最先进模型。对于目标检测，此类输出编码尚未被研究。此外，现有技术没有解决内在的码字不一致性，也没有实际利用内在的码字冗余。因此，我们首先推导了一种新的Hadamard码字解码过程，以得到最优的类概率，通过使用到概率单纯形的投影来解决底层优化问题。其次，我们的优化提供了预测不一致性的度量。第三，我们首次展示了如何利用这些不一致性进行对抗攻击和干扰检测。第四，我们引入了HadamardNet，这是一个采用Hadamard码作为语义分割和目标检测模型及任务的输出表示的框架。我们在干扰和对抗攻击上进行了全面评估，在仅单次检测中实现了两项任务的最先进扰动检测性能，同时在干净数据上提供了等效或接近的参考性能。

英文摘要

Conventional one-hot encodings often yield poorly calibrated models, being overconfident under attack, and letting entropy-based detection algorithms fail. Previous image classification works have demonstrated that Hadamard-coded output representations can improve adversarial robustness. However, attempts to integrate Hadamard codes into semantic segmentation fall far behind state-of-the-art models in mean intersection-over-union performance. Regarding object detection, such output encodings have not yet been investigated at all. Further, no prior art addressed intrinsic codeword inconsistencies or actually exploited intrinsic codeword redundancy. Accordingly, we first derive a novel decoding procedure for Hadamard codewords towards optimal class-wise probabilities, solving the underlying optimization problem by using the projection onto the probability simplex. Second, our optimization delivers a measure of prediction inconsistency. Third, we are the first to show how to exploit these inconsistencies for adversarial attack and disturbance detection. Fourth, we introduce HadamardNet, a framework employing Hadamard codes as output representations for semantic segmentation and object detection models and tasks. We conduct a comprehensive evaluation both on disturbances and adversarial attacks, achieving state-of-the-art perturbation detection performance for both tasks in only a single detection pass, while delivering equivalent or close-by reference performance on clean data.

URL PDF HTML ☆

赞 0 踩 0

2606.09746 2026-06-09 cs.CV cs.AI cs.LG 新提交

Hybrid Robustness Verification for Spatio-Temporal Neural Networks

时空神经网络的混合鲁棒性验证

Sherwin Varghese, Matthew Wicker, Alessio Lomuscio

发表机构 * Imperial College London（伦敦帝国学院）

AI总结针对3D CNN在视频和体素输入中的鲁棒性验证，提出时空约束建模和STBP框架，实现精确闭式传播与可扩展近似，在UCF-101等基准上提升1.7倍认证鲁棒准确率。

Comments Accepted at the 9th International Symposium on AI Verification (SAIV 2026)

详情

AI中文摘要

随着人工智能越来越多地部署在安全关键系统中，为底层模型提供形式化的鲁棒性保证至关重要。现有的验证方法要么依赖过于保守的近似，要么产生难以承受的计算成本。例如，在视频设置中使用lp-范数扰动编码了对手可以在每个视频帧中注入噪声的信念。实际上，对抗性扰动表现出结构化的时空相关性，被约束在低维、语义上有意义的子空间中。在这项工作中，我们研究了处理视频和体素输入的3D CNN的鲁棒性验证，针对动作识别（UCF-101）、自动驾驶（Udacity）和医学成像（MedMNIST）中的应用，通过将对抗强度建模为时空约束——攻击者可以修改一组连续帧中的子集或补丁——来利用关于对抗强度的现实假设。我们证明，建模现实约束能够实现更紧的近似。我们引入了时空边界传播（STBP），这是一个验证框架，它计算第一卷积层的精确闭式表征，并通过可扩展的近似传播认证边界。计算精确闭式为第一卷积层提供了最紧的边界。因此，我们在网络的其余部分使用近似方法。为了推动该领域的进一步发展，我们提出了ST-Bench，一个用于自动驾驶和活动识别的验证基准，以系统评估可验证的鲁棒性。与现有的基于验证的方法相比，STBP在相同的扰动预算下提供了更强的鲁棒性保证，并显著提高了可扩展性，实现了1.7倍更高的认证鲁棒准确率。

英文摘要

With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive computational costs. For example, the use of lp-norm perturbations in video settings encodes the belief that the adversary can inject noise in every video frame. In practice, adversarial perturbations exhibit structured spatial and temporal correlations, constrained to lower-dimensional, semantically meaningful subspaces. In this work, we study robustness verification of 3D CNNs processing video and volumetric inputs, targeting applications in action recognition (UCF-101), autonomous driving (Udacity), and medical imaging (MedMNIST) exploiting realistic assumptions on adversarial strength by modelling them as spatio-temporal constraints - where the attacker can modify either a subset of frames or patches within a set of consecutive frames. We demonstrate that modelling realistic constraints enables tighter approximations. We introduce Spatio-Temporal Bound Propagation (STBP), a verification framework that computes an exact closed-form characterization of the first convolutional layer and propagates certified bounds through subsequent layers using scalable approximations. Computing the exact closed form provides the tightest bounds for the first convolutional layer. Thus, we utilise approximation methods in the remainder of the network. To spur further progress in this field, we propose ST-Bench, a verification benchmark for autonomous driving and activity recognition, to systematically evaluate verifiable robustness. Compared to existing verification-based approaches, STBP provides stronger robustness guarantees with significantly improved scalability, achieving 1.7x higher certified robust accuracy under identical perturbation budgets.

URL PDF HTML ☆

赞 0 踩 0

2606.07650 2026-06-09 cs.CR cs.CV cs.NI 交叉投稿

Detecting Aimbot Cheaters in MOGs

检测多人在线游戏中的自瞄作弊者

Salman Shaikh, Tao Ni, Marc Dacier

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出PATCH主动防御策略，通过对抗性补丁作为游戏蜜标，触发作弊者目标检测模型，实现检测或干扰，在定制Unreal Engine游戏中白盒检测率超90%，跨模型迁移率达60-90%。

详情

AI中文摘要

多人在线游戏已成为娱乐行业数十亿美元的产业。然而，作弊者的存在破坏了诚实玩家的体验，并贬低了游戏开发者的努力，因为它直接影响玩家留存率、竞技完整性、游戏的合法性和可信度，以及最重要的整体收入流。在各种作弊技术中，视觉自瞄作弊是一种新兴威胁。它们使用计算机视觉模型从客户端屏幕截图中检测对手，而不是访问游戏内存，这使得商业内核级反作弊解决方案完全无法检测。在本文中，我们介绍了PATCH，一种新颖的主动防御策略，该策略部署对抗性补丁作为游戏中的蜜标，以减轻视觉自瞄作弊者的存在。我们的方法侧重于故意触发作弊者的目标检测模型，从而实现直接检测，或通过在其视口上泛滥补丁使作弊者无法进行游戏。我们在各种标准上评估了我们的方法；分析了不同补丁大小的有效性、补丁对不同屏幕分辨率的可扩展性、对不同视觉自瞄作弊配置的有效性，并探索了各种YOLO模型以评估补丁的可迁移性。在定制的Unreal Engine游戏上的评估显示，在几乎所有补丁大小的白盒场景中，检测率超过90%，并且使用较大补丁时，跨模型迁移率达到60%至90%。我们进一步在商业MOG《堡垒之夜》上验证了我们的方法，展示了现实世界的适用性。

英文摘要

Multiplayer Online Games have become a multibillion dollar industry in the entertainment sector. However, the presence of cheaters undermines the experience of honest players and devalues the effort of game developers, as it directly affects player retention, competitive integrity, the legitimacy and trustworthiness of a game, and most importantly the overall revenue streams. Among various cheating techniques, visual aimbots represent an emerging threat. They use computer vision models to detect opponents from client screen captures rather than accessing game memory, making them completely undetectable by commercial kernel level anti cheat solutions. In this paper, we introduce PATCH, a novel proactive defense strategy that deploys adversarial patches as in game honeytokens to mitigate the presence of visual aimbot cheaters. Our approach centers on deliberately triggering the cheaters' object detection model, enabling either direct detection, or rendering the game unplayable for the cheater via patch flooding on their viewport. We evaluate our approach on various criteria; analyzing the effectiveness of different patch sizes, scalability of patches to different screen resolutions, efficacy against diverse visual aimbot cheat configurations and also explore various YOLO models to assess patch transferability. Evaluation on a custom Unreal Engine game demonstrates over 90 percent detection rate in white box scenarios for almost all patch sizes, and reaches 60 to 90 percent cross model transferability with larger patches. We further validate our approach on Fortnite, a commercial MOG, demonstrating real world applicability.

URL PDF HTML ☆

赞 0 踩 0

2507.09092 2026-06-09 cs.CV cs.LG 版本更新

分类鲁棒性与解释鲁棒性真的强相关吗？基于输入损失景观的分析

Tiejin Chen, Wenwang Huang, Linsey Pang, Dongsheng Luo, Hua Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文质疑分类鲁棒性与解释鲁棒性强相关的传统观点，通过聚类评估解释鲁棒性，并提出调整解释损失景观的训练方法，发现两者并不强相关。

详情

AI中文摘要

本文深入探讨深度学习鲁棒性的关键领域，挑战了图像分类系统中分类鲁棒性和解释鲁棒性固有相关的传统观点。通过一种利用聚类高效评估解释鲁棒性的新颖评估方法，我们证明增强解释鲁棒性并不一定会使输入损失景观相对于解释损失变得平坦——这与平坦的损失景观指示更好的分类鲁棒性相反。为了深入探究这一矛盾，我们提出了一种开创性的训练方法，旨在调整相对于解释损失的损失景观。通过这种新的训练方法，我们发现尽管这种调整可以影响解释的鲁棒性，但它们对分类的鲁棒性没有影响。这些发现不仅挑战了两种鲁棒性之间强相关的主流假设，而且为理解损失景观与解释损失之间的关系开辟了新的途径。

英文摘要

This paper delves into the critical area of deep learning robustness, challenging the conventional belief that classification robustness and explanation robustness in image classification systems are inherently correlated. Through a novel evaluation approach leveraging clustering for efficient assessment of explanation robustness, we demonstrate that enhancing explanation robustness does not necessarily flatten the input loss landscape with respect to explanation loss - contrary to flattened loss landscapes indicating better classification robustness. To deeply investigate this contradiction, a groundbreaking training method designed to adjust the loss landscape with respect to explanation loss is proposed. Through the new training method, we uncover that although such adjustments can impact the robustness of explanations, they do not have an influence on the robustness of classification. These findings not only challenge the prevailing assumption of a strong correlation between the two forms of robustness but also pave new pathways for understanding relationship between loss landscape and explanation loss.

URL PDF HTML ☆

赞 0 踩 0

2503.18314 2026-06-09 cs.LG cs.AI cs.CV 版本更新

LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty

LoTUS：带有不确定性风味的大规模机器遗忘

Christoforos N. Spartalis, Theodoros Semertzidis, Petros Daras, Efstratios Gavves

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Centre for Research & Technology Hellas（希腊研究中心与技术中心）； Archimedes/Athena RC（阿基米德/雅典娜研究中心）

AI总结提出LoTUS方法，通过平滑预测概率至信息论界限来消除训练样本影响，避免从头重训练，在Transformer和ResNet18模型上超越现有方法，并引入RF-JSD指标用于实际评估。

Comments Accepted as a main conference paper at CVPR 2025 (https://cvpr.thecvf.com/virtual/2025/poster/33292)

2606.07640 2026-06-09 cs.CV cs.AI cs.LG 新提交

No Free Lunch for Synthetic Images under Data Scarcity Conditions

数据稀缺条件下合成图像的无免费午餐定理

Borja Arroyo Galende, Alejandro Almodóvar, Patricia A. Apellániz, Juan Parras, Silvia Uribe, Santiago Zazo

发表机构 * Universidad Politécnica de Madrid（马德里理工大学）； Universidad de Alcalá（阿尔卡拉大学）

AI总结研究数据稀缺和隐私敏感条件下合成数据的保真度、隐私和效用权衡，提出联合评估框架，比较VAE、GAN和DDPM在三个图像数据集上的表现，发现GAN和DDPM在差分隐私下更鲁棒。

2606.07645 2026-06-09 cs.CV cs.AI 新提交

FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction

FineGen：基于VLM的多智能体框架用于细粒度图像-文本数据集构建

Chang Kong, Yuebing Li, Peng Mo, Haigang Zhang, Qiuming Luo

发表机构 * Shenzhen Polytechnic University（深圳职业技术大学）； Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong Macao Greater Bay Area（粤港澳大湾区应用人工智能研究所）； Shenzhen University（深圳大学）

AI总结提出FineGen框架，通过生成-验证-校正流水线和闭环反馈机制自动构建含硬负样本的细粒度数据集，在ImageNet上构建FineGen-100K，硬样本准确率提升14.4%。

Comments 15 pages, 2 figures, conference

2606.07646 2026-06-09 cs.CV cs.AI 新提交

DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation

DOME：从稀疏监督中学习可迁移域变量用于测试时自适应

Xiaoran Xu, Yifan Xu, Yupeng Wu, Xiaoshan Yang, Changsheng Xu

发表机构 * MAIS, IACAS（中国科学院自动化研究所多模态人工智能系统实验室）

AI总结提出DOME域编码器，通过视觉-语言预训练提取密集连续表示，参数化域为分布变量并引入动量更新的稀疏域库，实现零样本显式域建模，在多个基准上超越复杂TTA方法。

详情

AI中文摘要

测试时自适应（TTA）旨在仅使用无标签流数据将模型对齐到变化的测试域。现有方法大多隐式推断单个全局域分布，忽略了真实世界域迁移的多维性和样本特异性，导致自适应脆弱。我们提出DOME，一种有效的域编码器，以零样本方式显式建模每个样本的域。DOME利用视觉-语言预训练提取密集、连续的表示，将域参数化为分布变量，并引入动量更新的稀疏域库用于解耦监督。通过将这些显式域线索注入下游模型，即使是最基本的熵最小化TTA策略也在ImageNet-C、ImageNet-R和ImageNet-Sketch上达到了最先进的性能，超越了复杂的TTA方法。我们的结果表明，鲁棒的自适应并非源于复杂的自适应算法，而是源于显式的、结构化的域表示。

英文摘要

Test-time adaptation (TTA) aims to align a model to shifting test domains using only unlabeled streaming data. Most existing methods implicitly infer a single global domain distribution, ignoring the multidimensional and sample-specific nature of real-world domain shifts, leading to fragile adaptation. We propose DOME, an effective domain encoder that explicitly models each sample's domain in a zero-shot manner. DOME leverages vision-language pretraining to extract dense, continuous representations, parameterizes domains as distributional variables, and introduces a momentum-updated sparse domain bank for disentangled supervision. By injecting these explicit domain cues into downstream models, even a basic entropy-minimization TTA strategy achieves state-of-the-art performance across ImageNet-C, ImageNet-R, and ImageNet-Sketch, outperforming complex TTA approaches. Our results demonstrate that robust adaptation stems not from intricate adaptation algorithms, but from explicit, structured domain representation.

URL PDF HTML ☆

赞 0 踩 0

2606.07653 2026-06-09 cs.CV cs.AI 新提交

A Dataset for Dynamic Human Preferences for Vision Language Models

面向视觉语言模型的动态人类偏好数据集

Hannah Gao, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结提出一个评估视觉语言模型理解动态人类偏好能力的基准，通过自动化管道生成包含图像依赖变化的数据集，并评估了现有模型。

2606.07654 2026-06-09 cs.CV cs.AI 新提交

MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework

MM-Matryoshka：通过二维多模态套娃训练框架实现预算弹性视觉文档检索

Haowen Xiang, Yibo Yan, Jiahao Huo, Yu Huang, Yi Cao, Mingdong Ou, Xuming Hu

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Alibaba Cloud Computing（阿里云计算）； Hong Kong University of Science and Technology（香港科技大学）

AI总结提出MM-Matryoshka，一种二维套娃训练框架，使视觉文档检索器在向量维度和编码器深度上实现弹性预算选择，无需为不同预算训练独立模型。

详情

AI中文摘要

多向量视觉文档检索器通过深度视觉语言模型（VLM）为每个页面生成多个向量，实现强大的细粒度匹配，但这种设计在存储和计算开销上导致部署成本高昂。现有效率技术通常只优化预算的一部分，使得多模态检索器缺乏统一的方法来权衡精度与向量宽度和编码器深度。因此，我们提出MM-Matryoshka，一种用于预算弹性视觉文档检索（VDR）的二维套娃训练框架，使ColPali风格的多向量检索在维度和层两个方向上实现弹性。在推理时，单个检索器可以选择二维可调预算，无需为不同预算训练独立模型。通过在多个代表性骨干网络上的全面实验，我们证明MM-Matryoshka在显著降低存储和计算开销的同时，保留了比直接截断基线高得多的质量，从而为高效VDR提供了稳健的预算弹性。

英文摘要

Multi-vector visual document retrievers achieve strong fine-grained matching by representing each page with multiple vectors from deep Vision-Language Models (VLMs), but this design makes deployment expensive in both storage and computational overhead. Existing efficiency techniques usually optimize only part of this budget, leaving multimodal retrievers without a unified way to trade accuracy for both vector width and encoder depth. Therefore, we propose MM-Matryoshka, a 2D Matryoshka training framework for budget-elastic Visual Document Retrieval (VDR), enabling ColPali-style multi-vector retrieval elastic along both dimension and layer. At inference time, a single retriever can select a 2D selectable budget without training separate models for different budgets. Through comprehensive experiments across multiple representative backbones, we demonstrate that by retaining significantly higher quality than direct truncation baselines while substantially reducing storage and computational overhead, MM-Matryoshka can offer robust budget elasticity for efficient VDR.

URL PDF HTML ☆

赞 0 踩 0

2606.07687 2026-06-09 cs.CV cs.AI 新提交

What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction

什么使视频世界模型潜在空间与动作相关：预测优于重建

Jewon Yeom, Hanseul Kim, Jeongjae Park, Sungmok Jung, Jaejin Lee, Taesup Kim

发表机构 * Graduate School of Data Science, Seoul National University（首尔大学数据科学研究生院）

AI总结通过统一探针评估，发现动作相关结构主要由时间视频预训练驱动，而非像素重建保真度，其中视频预训练自监督编码器在视觉保真度和动作预测间取得最佳帕累托权衡。

详情

AI中文摘要

视频世界模型越来越多地用于提供预测性视觉表示，但尚不清楚哪些预训练信号在其潜在空间中诱导出与动作相关的结构。我们通过跨多种编码器家族的统一探针评估来研究这个问题，包括仅图像自监督、带或不带潜在预测的视频预训练、基于重建的自编码器、扩散模型以及捷径强制动力学模型。使用共同的逆动力学探针目标，我们发现动作相关结构主要由时间视频预训练驱动，而非像素重建保真度：具有强像素解码质量的模型可能表现出接近零的动作可恢复性，而视频预训练的自监督编码器在视觉保真度和动作预测之间始终实现最佳帕累托权衡。比较V-JEPA和VideoMAE进一步表明，大部分收益来自自然视频时间上下文，特征级潜在预测提供了较小的额外收益。这些趋势在机器人基准测试中转移，尽管CALVIN显示静态环境任务可以通过允许强图像先验来部分掩盖时间结构的重要性。最后，逆动力学监督显著提高了对视觉损坏的鲁棒性，表明动作感知目标正则化了潜在几何，超越了干净环境性能。我们的结果确定时间预测结构——而非重建保真度——是动作相关视频表示的主要成分。

英文摘要

Video world models are increasingly used to provide predictive visual representations, yet it remains unclear which pretraining signals induce action-relevant structure in their latent spaces. We study this question through a unified probe-based evaluation across diverse encoder families, including image-only self-supervision, video pretraining with and without latent prediction, reconstruction-based autoencoders, diffusion models, and shortcut-forcing dynamics models. Using a common inverse-dynamics probing objective, we find that action-relevant structure is driven primarily by temporal video pretraining rather than pixel reconstruction fidelity: models with strong pixel decoding quality can exhibit near-zero action recoverability, while video-pretrained self-supervised encoders consistently achieve the best Pareto trade-off between visual fidelity and action prediction. Comparing V-JEPA and VideoMAE further shows that most gains arise from natural-video temporal context, with feature-level latent prediction providing a smaller additional benefit. These trends transfer across robotic benchmarks, though CALVIN reveals that static-environment tasks can partially mask the importance of temporal structure by allowing strong image priors to suffice. Finally, inverse-dynamics supervision substantially improves robustness to visual corruption, suggesting that action-aware objectives regularize latent geometry beyond clean-setting performance. Our results identify temporal predictive structure -- not reconstruction fidelity -- as the primary ingredient underlying action-relevant video representations.

URL PDF HTML ☆

赞 0 踩 0

2606.07708 2026-06-09 cs.CV cs.AI 新提交

Cross-View Urban Traffic Dataset: Drone-Supervised Ground Truth for Monocular Bird's-Eye View Localization

跨视角城市交通数据集：用于单目鸟瞰图定位的无人机监督地面真值

Prakhar Bhardwaj, Simone Weikl, Kilian Mang, Elia Jonas Sandtner

发表机构 * OTH Regensburg（雷根斯堡应用技术大学）

AI总结提出一个由同步自行车视角和无人机视角视频构建的跨视角城市交通数据集，支持跨视角身份匹配和鸟瞰图预测任务，提供身份级对齐和标准化评估。

详情

AI中文摘要

我们介绍了一个从真实城市交叉口同步的自行车视角视频和无人机航拍视频构建的跨视角城市交通感知数据集和基准。该基准针对两个关联任务：街景和无人机视角目标轨迹之间的跨视角身份匹配，以及利用空中监督的自我到鸟瞰图预测。与先前的城市驾驶和V2X数据集相比，我们的基准提供了跨截然不同视角的身份级对齐，以及标准化评估、标注工具和基线实现。这一设置源于以交叉口为中心的交通分析，其中身份保持、局部交互和全局空间结构必须跨视角联合推理。我们在轨迹和帧级别评估方法，包括跨视角ID精确率/召回率/IDF1、近远分解、时间稳定性和一致性指标。我们还提供了基于楔形的跨视角匹配以及三种BEV预测基线（逆透视映射、MonoLayout风格学习基线和回归基线）的基线结果。结果表明该基准可行但具有挑战性：跨视角匹配实现了高召回率，但仍受过度分配和时间不一致性的限制，而自我到BEV预测受益于空中监督，但在轻量级单目感知下远未饱和。我们希望该基准能支持跨视角感知、城市场景对齐和自我到全局交通理解的未来研究。

英文摘要

We introduce a dataset and benchmark for cross-view urban traffic perception built from synchronized ego-centric bicycle videos and aerial drone videos recorded at real urban intersections. The benchmark targets two linked tasks: cross-view identity matching between street-view and drone-view object tracks, and ego-to-bird's-eye-view prediction using aerial supervision. In contrast to prior urban driving and V2X datasets, our benchmark provides identity-level alignment across radically different viewpoints together with standardized evaluation, annotation tooling, and baseline implementations. This setting is motivated by intersection-centric traffic analysis, where identity preservation, local interactions, and global spatial structure must be reasoned about jointly across views. We evaluate methods at both the track and frame levels, including cross-view ID precision/recall/IDF1, near--far breakdowns, temporal stability, and consistency metrics. We also provide baseline results for wedge-based cross-view matching and for three BEV prediction baselines: inverse perspective mapping, a MonoLayout-style learned baseline, and a regression baseline. The results show that the benchmark is feasible but challenging: cross-view matching achieves strong recall yet remains limited by over-assignment and temporal inconsistency, while ego-to-BEV prediction benefits from aerial supervision but remains far from saturated under lightweight monocular sensing. We hope that this benchmark will support future research on cross-view perception, urban scene alignment, and ego-to-global traffic understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.07882 2026-06-09 cs.CV cs.AI 新提交

The Cross-Architecture Substrate: A Domain-Transcendent, Calibration-Surviving Geometric Invariant of Modern Vision Encoders

跨架构基板：现代视觉编码器的领域超越、校准存活的几何不变量

Yousef Radwan

发表机构 * KAUST（阿卜杜拉国王科技大学）

AI总结发现现代视觉编码器训练后前16个主方向收敛到同一16维几何对象（跨架构基板），该基板跨视觉领域传输、校准后仍存在，并应用于无标签迁移性过滤、领域检测、低样本探测和无教师蒸馏。

Comments 14 pages, 2 figures. 40th Conference on Neural Information Processing Systems (NeurIPS 2026)

详情

AI中文摘要

不同的视觉神经网络——训练用于分类、对比、重建或将图像与文本匹配——应该具有相应不同的内部表示。我们报告它们并非如此。训练后，十三个现代视觉编码器内部的前十六个主变化方向收敛到同一个十六维几何对象。我们称之为跨架构基板，并使用PCA、中心核对齐（CKA）和Pang 2026校准进行研究。该基板在四个视觉领域（自然照片、医学CT、卫星、显微镜）上以中位数Procrustes-CKA 0.679传输，在八个领域（增加素描、深度、热红外、天文学）上为0.604，每对>0.40。它在全局（7.4倍判别vs MAE分离，n=13,394）和局部（4.82-5.30，p<10^{-44}）上经受住Pang校准。它不是像素统计（0.263），不是Gabor特征（0.31），不是随机投影（0.041），并且在训练的前10%中出现，而准确率持续上升。我们提供了四个应用：一个无标签迁移性过滤器，优于LogME（快3倍，+0.15 Kendall-tau）；一个四路领域检测器（99.6%准确率）；一个冻结低样本探测器（16维在每类N=50标签时比768维DINOv2高3.78个百分点）；以及一个无教师蒸馏辅助，匹配训练教师KD在33对上（10%标签分数时峰值增益7.56个百分点）。该基板不跨模态，不帮助跨范式蒸馏，也不预测迁移质量（与迁移准确率的rho=0.08）。

英文摘要

Different vision neural networks -- trained to classify, contrast, reconstruct, or match images to text -- should have correspondingly different internal representations. We report that they do not. After training, the top sixteen principal directions of variation inside thirteen modern vision encoders converge to the same sixteen-dimensional geometric object. We call this the cross-architecture substrate and study it with PCA, centred kernel alignment (CKA), and Pang 2026 calibration. The substrate transports across four visual domains (natural photographs, medical CT, satellite, microscopy) at median Procrustes-CKA 0.679, and across eight domains (adding sketches, depth, thermal infrared, astronomy) at 0.604, every pair >0.40. It survives Pang calibration globally (7.4x disc-vs-MAE separation, n=13,394) and locally (4.82-5.30, p<10^{-44}). It is not pixel statistics (0.263), not Gabor features (0.31), not a random projection (0.041), and emerges in the first 10% of training while accuracy keeps climbing. We deliver four applications: a label-free transferability filter beating LogME (3x faster, +0.15 Kendall-tau); a four-way domain detector (99.6% accuracy); a frozen low-shot probe (16 dims beat 768-dim DINOv2 by 3.78pp at N=50 labels per class); and a teacher-free distillation auxiliary matching trained-teacher KD on 33 pairs (7.56pp peak gain at 10% label fraction). The substrate does not cross modalities, does not help cross-paradigm distillation, and does not predict transfer quality (rho=0.08 against transfer accuracy).

URL PDF HTML ☆

赞 0 踩 0

2606.07891 2026-06-09 cs.CV 新提交

C3VD-DEFCOL: A Deformable Colonoscopy Dataset with Time-Resolved 3D Ground Truth and Realistic Appearance

C3VD-DEFCOL：具有时间分辨三维真实地面真值和逼真外观的可变形结肠镜数据集

Ethan Luk, Mayank V. Golhar, Anthony Song, Raúl Iranzo, Víctor M. Batlle, Lalithkumar Seenivasan, José M. M. Montiel, Nicholas J. Durr

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Universidad de Zaragoza（萨拉戈萨大学）

AI总结提出C3VD-DEFCOL框架和数据集，通过模拟蠕动变形和真实纹理渲染，为可变形结肠镜三维重建提供带时间分辨地面真值的评估平台。

详情

AI中文摘要

三维重建可通过估计黏膜覆盖范围并在筛查期间提醒临床医生遗漏区域来改进结肠镜检查。然而，由于当前没有数据集同时提供逼真的体内外观和密集的时间分辨三维地面真值（尤其在非刚性变形下），算法开发受到限制。我们提出C3VD-DEFCOL，一个用于评估可变形结肠镜重建的框架和数据集，具有配对的几何和逼真纹理。从C3VD/C3VDv2结肠网格和相机轨迹出发，我们生成结肠表面的受控变形，包括蠕动波和中心线运动，并渲染每帧深度、表面法线、光流、相机姿态和时间戳三维网格。然后，我们使用渲染的几何（主要是深度）来条件化一个基于LTX-2.3的模拟到真实翻译模型，该模型生成具有体内样黏膜颜色、纹理、血管和镜面外观的RGB片段，同时保留底层三维场景结构。所得数据集包含来自11个独特结肠网格几何的110个视频，具有不同的相机轨迹、外观和参数化变形模式，包括三个蠕动严重程度级别作为受控评估轴。我们使用外观真实性、几何一致性和时间一致性指标评估生成的视频，并利用配对地面真值对可变形三维重建中的下游任务——姿态估计进行基准测试。实验表明，姿态估计误差随变形严重程度增加而增加，提供了现有体内数据集无法实现的受控压力测试。总体而言，C3VD-DEFCOL被设计为一个可重复的定量评估平台，用于测试可变形三维重建算法，旨在缩小合成数据集与体内结肠镜之间的领域差距。

英文摘要

3D reconstruction could improve colonoscopy by estimating mucosal coverage and alerting clinicians to missed regions during screening. However, algorithm development is limited as no current datasets provide both a realistic in vivo appearance and dense, time-resolved 3D ground truth, especially under non-rigid deformation. We present C3VD-DEFCOL, a framework and dataset for evaluating deformable colonoscopy reconstruction with paired geometry and realistic texture. Starting from C3VD/C3VDv2 colon meshes and camera trajectories, we generate controlled deformations of the colon surface, including peristaltic waves and centerline motion, and render per-frame depth, surface normals, optical flow, camera poses, and time-stamped 3D meshes. We then use the rendered geometry, primarily depth, to condition an LTX-2.3-based sim-to-real translation model that produces RGB clips with in vivo-like mucosal color, texture, vasculature, and specular appearance while preserving the underlying 3D scene structure. The resulting dataset contains 110 videos from 11 unique colon mesh geometries, with varying camera trajectories, appearances, and parameterized deformation regimes, including three peristaltic severity levels that serve as controlled evaluation axes. We evaluate the generated videos using appearance realism, geometric consistency, and temporal consistency metrics, and use the paired ground truth to benchmark the downstream task of pose estimation in deformable 3D reconstruction. Our experiments show how pose estimation error increases with increasing deformation severity, providing a controlled stress test that is not possible with existing in vivo datasets. Overall, C3VD-DEFCOL is designed as a reproducible, quantitative evaluation platform for testing deformable 3D reconstruction algorithms, with the goal of reducing the domain gap between synthetic datasets and in vivo colonoscopy.

URL PDF HTML ☆

赞 0 踩 0

2606.08033 2026-06-09 cs.CV cs.LG 新提交

MRI预处理需要多少才够？脑MRI基础模型的成本效用研究

Jiangshuan Pang, Wangyang Tang, Jing Yan, Zhixuan Cheng, Youzhe He, Zhenkun Zhuang, Tao Zhou, Shiping Liu

发表机构 * University of the Chinese Academy of Sciences（中国科学院大学）； BGI Research（华大研究院）

AI总结本研究通过比较P0-P7预处理级别对自监督3D MRI预训练的影响，发现并非预处理越强越好，P2是最低成本可行级别，更强预处理仅在特定任务中带来有限提升，且下游可补偿。

详情

AI中文摘要

MRI预处理定义了脑MRI基础模型看到的输入分布，但它通常被视为常规数据清理而非建模选择。我们询问对于自监督3D MRI预训练，多少预处理值得其计算成本。保持语料库、3D ViT骨干网络、掩码协议和下游评估不变，我们在20,000个异质脑MRI体积上比较了用于掩码自编码（MAE）和联合嵌入预测学习（JEPA）的分级P0-P7预处理谱，然后将编码器迁移到IDH预测、MCI分类、脑年龄回归和GLI/PED肿瘤分割。结果不支持简单的“越多越好”规则。P0/P1数值不稳定，使P2成为成本最低的可行级别；超过P2，选择最佳可行预处理级别仅使MAE的聚合效用提高3.4个百分点，JEPA提高1.8个百分点，且大多数配对增益在统计上未解决。更强的预处理仅在选定场景中有益：IDH略有改善，AGE和GLI/PED通常在P2附近或最佳，而MCI显示出最清晰的P7经验增益。跨级别MCI迁移进一步表明，大部分P7优势可以通过在下游应用更强的预处理来恢复，而不需要在预训练全程使用P7。这些发现将MRI预处理重新定义为一种下游感知的成本效用决策，而非默认的升级流水线。代码可在https://github.com/PangJiangShuan/PreBrain获取。

英文摘要

MRI preprocessing defines the input distribution seen by brain MRI foundation models, yet it is usually treated as routine data cleaning rather than a modeling choice. We ask how much preprocessing is worth its computational cost for self-supervised 3D MRI pretraining. Keeping the corpus, 3D ViT backbone, masking protocol, and downstream evaluations fixed, we compare a graded P0-P7 preprocessing spectrum for masked autoencoding (MAE) and joint-embedding predictive learning (JEPA) on 20,000 heterogeneous brain MRI volumes, then transfer the encoders to IDH prediction, MCI classification, brain age regression, and GLI/PED tumor segmentation. The results do not support a simple "more is better" rule. P0/P1 are numerically unstable, making P2 the lowest-cost feasible level; beyond P2, choosing the best feasible preprocessing level improves aggregate utility by only 3.4 percentage points for MAE and 1.8 percentage points for JEPA, with most paired gains statistically unresolved. Stronger preprocessing is beneficial only in selected regimes: IDH improves modestly, AGE and GLI/PED are often near or best at P2, and MCI shows the clearest empirical P7 gain. Cross-level MCI transfer further shows that much of the P7 advantage can be recovered by applying stronger preprocessing downstream, without requiring P7 throughout pretraining. These findings recast MRI preprocessing as a downstream-aware cost-utility decision rather than a default escalation pipeline. Code is available at https://github.com/PangJiangShuan/PreBrain.

URL PDF HTML ☆

赞 0 踩 0

2606.08332 2026-06-09 cs.CV 新提交

SMI: Efficient Self-Supervised Learning via Mutual-Information-Inspired Dependency Optimization

SMI: 基于互信息启发的依赖优化的高效自监督学习

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

发表机构 * Universitat Pompeu Fabra（庞培法布拉大学）； Universitat Autònoma de Barcelona（巴塞罗那自治大学）

AI总结提出SMI方法，通过非线性变换样本级依赖矩阵优化自监督学习，在ImageNet上以ResNet-50达到竞争性能并降低计算复杂度，在低资源任务上提升迁移性能。

详情

AI中文摘要

自监督学习（SSL）已经取得了显著的表示学习性能，但许多现有方法依赖于大批量大小、内存库、动量编码器或全局同步机制，这些机制大大增加了计算成本和训练复杂度。在这项工作中，我们提出了语义互信息（SMI），一种轻量级的自监督目标，它源于高斯假设下互信息启发的依赖公式。与在高维特征相关矩阵上操作的传统相关匹配目标不同，SMI通过成对相关性的非线性变换在样本级依赖矩阵上进行优化。这种公式引入了独特的优化动态，强调强依赖的语义对，同时保持表示多样性。在ImageNet上使用ResNet-50骨干网络的实验结果表明，SMI在实现与最先进的SSL方法相当的线性评估性能的同时，显著降低了计算复杂度。在多个低资源基准上，SMI持续改善了Barlow Twins的迁移性能，特别是在细粒度数据集上。此外，对优化动态和表示几何的分析表明，对齐-冗余平衡得到改善，特征多样性增加，语义表示更加空间局部化。这些结果表明，非线性依赖优化为传统的基于相关的自监督学习目标提供了一种有效且计算高效的替代方案。

英文摘要

Self-supervised learning (SSL) has achieved remarkable representation learning performance, but many existing methods rely on large batch sizes, memory banks, momentum encoders, or global synchronization mechanisms that substantially increase computational cost and training complexity. In this work, we propose Semantic Mutual Information (SMI), a lightweight self-supervised objective derived from a mutual-information-inspired dependency formulation under Gaussian assumptions. Unlike conventional correlation matching objectives that operate on high-dimensional feature correlation matrices, SMI performs optimization on a sample-level dependency matrix through a nonlinear transformation of pairwise correlations. This formulation induces distinct optimization dynamics that emphasize strongly dependent semantic pairs while maintaining representation diversity. Experimental results on ImageNet using a ResNet-50 backbone demonstrate that SMI achieves competitive linear evaluation performance relative to state-of-the-art SSL approaches while substantially reducing computational complexity. Across multiple low-resource benchmarks, SMI consistently improves transfer performance over Barlow Twins, particularly on fine-grained datasets. Furthermore, analyses of optimization dynamics and representation geometry suggest improved alignment--redundancy balance, greater feature diversity, and more spatially localized semantic representations. These results indicate that nonlinear dependency optimization provides an effective and computationally efficient alternative to conventional correlation-based self-supervised learning objectives.

URL PDF HTML ☆

赞 0 踩 0

2606.08572 2026-06-09 cs.CV 新提交

行为克隆在科学数据标注中的系统研究

Ishaan Singh Chandok, Core Francisco Park

发表机构 * GitHub

AI总结针对科学数据标注中人工验证校正耗时问题，提出行为克隆框架，通过9个合成任务模拟专家策略，发现模型层次化技能习得、多任务预训练高效微调、内部表示共享错误模式等关键结论。

Comments ICML 2026 Oral

详情

AI中文摘要

科学数据标注，例如视频中动物追踪或神经重建的校对，仍然受限于“最后一公里”问题：即使有强大的自动化，验证和校正仍需大量人力。标准方法训练模型直接预测标注，丢弃了专家如何导航、点击、验证和校正的丰富监督信息。我们引入了一个研究科学标注上行为克隆的框架：9个合成任务配以合成标注，模拟真实人类策略，包括探索、错误校正和战略决策。我们的实验揭示了若干发现。首先，技能层次化出现：模型先学习GUI机制，再学习任务关键决策，且比训练数据犯更少错误，同时保留在错误发生时校正的能力。其次，在多任务行为克隆上扩展模型表明，在我们的规模范围内，更大的模型数据效率更高。第三，多任务预训练能够高效微调至新任务，而从零开始训练则完全失败。第四，线性探针揭示模型内部表示标注过程的潜在变量，如任务阶段和数据位置；有趣的是，我们发现一个跨不同标注任务泛化的共享错误表示。总体而言，我们的框架建立了系统基准并识别了关键瓶颈，为将行为克隆扩展到真实世界科学数据标注奠定了基础。

英文摘要

Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the "last mile" problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decision-making. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient within our scale range. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.

URL PDF HTML ☆

赞 0 踩 0

2606.07618 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

ScaleSweep: 通过块尺度初始化实现LLM的精确NVFP4训练后量化

Li Lin, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王选计算机技术研究所）

AI总结提出ScaleSweep方法，通过扫描可行块尺度候选并选择最小化目标函数的候选，优化NVFP4量化中的尺度初始化，理论推导扫描范围边界，在Llama和Qwen模型上提升量化性能，缩小与全精度的差距。

Comments under review

详情

AI中文摘要

NVFP4是一种最近引入的硬件支持的FP4格式，通过细粒度块尺度提高了4位量化的保真度。然而，现有的NVFP4尺度初始化方法仍然主要依赖于AbsMax初始化，这与最优解之间存在明显差距。为了解决这个问题，我们提出了ScaleSweep，一种简单高效的尺度优化方法，它扫描可行的块尺度候选，并选择最小化目标函数的候选。我们进一步提供了NVFP4量化的理论分析，并推导了在原始张量与量化重建张量之间的均方误差（MSE）和加权均方误差（WMSE）下所需扫描范围的上下界。所提出的界限大幅减少了扫描空间，同时保留了最优候选，使得与基线量化算子相比开销可忽略。在Llama和Qwen模型上的实验表明，ScaleSweep持续优于现有的初始化方法，并进一步缩小了与全精度的差距。特别是在对权重、激活、KV缓存和查询状态进行激进的全端到端量化时，ScaleSweep保留了超过93%的全精度性能。

英文摘要

NVFP4 is a recently introduced hardware-supported FP4 format that improves the fidelity of 4-bit quantization through fine-grained block scales. However, existing NVFP4 scale initialization methods still primarily rely on AbsMax initialization, which leaves a noticeable gap to the optimal solution. To address this, we propose ScaleSweep, a simple and efficient scale optimization method that sweeps over feasible block scale candidates and selects the candidate that minimizes a target objective. We further provide a theoretical analysis of NVFP4 quantization and derive both lower and upper bounds for the required sweep range under mean square error (MSE) and weighted mean square error (WMSE) between the original tensor and the quantized reconstructed tensor. The proposed bounds substantially reduce the sweep space while preserving the optimal candidate, enabling negligible overhead compared with the baseline quantization operators. Experiments on Llama and Qwen models demonstrate that ScaleSweep consistently improves quantization performance over existing initialization methods and further narrows the gap to full precision. In particular, under aggressive end-to-end quantization of weights, activations, KV cache, and query states, ScaleSweep preserves more than 93% of the full-precision performance.

URL PDF HTML ☆

赞 0 踩 0

2606.07780 2026-06-09 cs.AI cs.CV cs.LG 交叉投稿

Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events

土地覆盖与洪水类型控制基于卫星的洪水测绘在不同全球洪水事件中的检测极限

Venkatesh Kolluru, Rajat Shinde, Abdelhak Marouane, Caden Helbling, Deepak Shah, Othneil Drew, Iksha Gurung, Manil Maskey, Rahul Ramachandran

发表机构 * Earth System Science Center, University of Alabama in Huntsville（阿拉巴马大学亨茨维尔分校地球系统科学中心）； Space and Earth Science Data Analysis（空间与地球科学数据分析）； NASA Marshall Space Flight Center（NASA马歇尔太空飞行中心）

AI总结研究利用Prithvi-EO-2.0模型在19个全球洪水事件中评估卫星洪水测绘的检测能力，发现检测精度取决于土地覆盖和洪水类型，农田和河流洪水检测效果较好，而树木覆盖和建成区检测近乎为零。

详情

AI中文摘要

洪水是最具破坏性的自然灾害之一，在气候变化下其频率增加使得基于卫星的淹没测绘对灾害响应至关重要。基于卫星档案预训练的地理空间基础模型提供了地理可迁移性，但其在多样、未见事件中的操作可靠性尚未被表征。在此，我们在跨越六大洲、八个气候带和六种洪水机制的19个分布外洪水事件（2017-2025年）中部署Prithvi-EO-2.0，并针对两个独立参考产品进行验证。检测精度共同依赖于土地覆盖和洪水类型，农田产生最高一致性（IoU=52%），河流事件检测最强（F1=0.69），而树木覆盖和建成区显示近乎零检测（IoU=4%），无论洪水机制如何。双参考验证揭示，明显的模型误差部分反映了参考产品之间的定义不一致而非检测失败。迭代流水线测试识别出23种故障模式，其中流水线工程在初始误差中占主导地位，超过模型容量。这些发现为操作卫星洪水测绘建立了环境依赖的检测边界。

英文摘要

Floods are among the most destructive natural hazards, and their increasing frequency under climate change makes satellite-based inundation mapping essential for disaster response. Geospatial foundation models pretrained on satellite archives offer geographic transferability, but their operational reliability across diverse, unseen events remains uncharacterized. Here we deploy Prithvi-EO-2.0 across 19 out-of-distribution flood events (2017-2025) spanning six continents, eight climate zones, and six flood mechanisms, validating against two independent reference products. Detection accuracy depended jointly on land cover and flood type, with cropland yielding the highest agreement (IoU=52%) and riverine events the strongest detection (F1=0.69), while tree cover and built-up areas showed near-zero detection (IoU=4%) regardless of flood mechanism. Dual-reference validation revealed that apparent model error partly reflects definitional inconsistency between reference products rather than detection failure. Iterative pipeline testing identified 23 failure modes, with pipeline engineering dominating initial error over model capacity. These findings establish environment-dependent detection boundaries for operational satellite flood mapping.

URL PDF HTML ☆

赞 0 踩 0

2606.08204 2026-06-09 cs.LG cs.CV 交叉投稿

Neural Field Tokenizations with Hierarchy and Spatial Locality Priors

具有层次和空间局部性先验的神经场分词

Alonso Urbano, David W. Romero, Max Zimmer, Sebastian Pokutta

发表机构 * Zuse Institute Berlin (ZIB)（柏林祖斯研究所）； Cartesia AI ； Technische Universität Berlin（柏林工业大学）

AI总结提出LH-NeF框架，利用层次和局部性先验学习通用连续信号的分词表示，通过前馈编码替代元学习，内存减少42倍，批大小提升133倍，在图像、3D形状和气候场上匹配或超越多种基线。

详情

AI中文摘要

神经场将数据参数化为从坐标到值的函数，为跨模态表示学习提供统一框架。现有方法以每样本元学习为主，由于内存密集的内循环优化而扩展性差。自然的替代方案——前馈编码——通常引入模态特定假设，牺牲了神经场学习的通用性。我们认为局部性和层次性是学习场表示的有用先验，可以在不损害模态无关性的情况下注入。我们提出LH-NeF，一个学习连续信号通用分词表示的框架。保持局部性的层次编码器将原始坐标-值场观测映射到结构化分词，训练期间从中重建场。通过用单次前向传播替代元学习的内循环，LH-NeF比最强的模态无关基线少用42倍内存，支持133倍更大的批次。在图像、3D形状和气候场上，我们的学习表示在重建和下游任务上匹配或超过模态无关、模态特定和专用生成神经场基线的性能。

英文摘要

Neural fields parameterize data as functions from coordinates to values, providing a unified framework for representation learning across modalities. Existing approaches are dominated by per-sample meta-learning, which scales poorly due to memory-intensive inner-loop optimization. The natural alternative -- feed-forward encoding -- typically introduces modality-specific assumptions, sacrificing the generality that makes learning with neural fields attractive. We argue that locality and hierarchy are useful priors for learning field representations that can be injected without compromising modality-agnosticism. We propose LH-NeF, a framework to learn general-purpose tokenized representations of continuous signals. A locality-preserving hierarchical encoder maps raw coordinate-value field observations to structured tokens, from which the field is reconstructed during training. By replacing meta-learning's inner loop with a single forward pass, LH-NeF uses 42$\times$ less memory and supports 133$\times$ larger batches than the strongest modality-agnostic baseline. Across images, 3D shapes, and climate fields, our learned representations match or exceed performance of modality-agnostic, modality-specific, and specialized generative neural field baselines on both reconstruction and downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.08437 2026-06-09 eess.IV cs.CV 交叉投稿

X-Palm: Paired Multispectral-to-Smartphone Dataset for Cross-Domain Palmprint Authentication

X-Palm: 用于跨域掌纹认证的配对多光谱到智能手机数据集

Jamal Seyedmohammadi, Pai Chet Ng, Angelo Genovese, Zhixiang Chi, Jeannie Lee, Konstantinos N. Plataniotis

发表机构 * Singapore Institute of Technology（新加坡科技学院）； Università degli Studi di Milano（米兰大学）； University of Toronto（多伦多大学）

AI总结为解决掌纹识别中受控注册与非约束认证之间的域差距，提出首个配对身份的多光谱-智能手机跨域数据集X-Palm，包含6006张图像，覆盖大规模模态和环境变化，实验表明现有模型在该数据集上性能严重下降，而基于X-Palm训练的模型具有跨域鲁棒性。

详情

AI中文摘要

掌纹模态提供了一种保护隐私的生物识别解决方案，但其部署受到受控注册与非约束认证之间域差距的阻碍。现有数据集大多局限于受控设置，无法捕捉真实环境的复合变异性。在本文中，我们介绍了X-Palm，一个跨域数据集，包含来自103名个体（206只手）的6006张掌纹图像。据我们所知，X-Palm是第一个提供新颖的配对身份采集的掌纹数据集，专门设计用于弥合可靠受控多光谱注册与非约束移动认证之间的差距，同时涵盖广泛的野外变异性。与现有专注于单一或少数变化的数据集不同，X-Palm通过捕获两个不同域中身份的配对数据来解决实际部署中遇到的大规模模态和环境变化：（1）使用我们定制开发的扫描仪进行受控多光谱掌纹设置，以及（2）参与者驱动的非约束智能手机掌纹设置，同时包含硬件、手部姿势、光照、背景、相机到手距离、视角和手掌表面条件（例如湿度和遮挡）的变化。我们对12个SOTA模型的广泛基准测试表明，现有方法在受控数据上表现良好，但在X-Palm上性能严重下降。相反，在X-Palm上训练的模型在跨域中表现出一致的鲁棒性，使X-Palm成为训练模型以实现真实世界跨域泛化的宝贵资源。数据访问说明和相关基准测试代码公开于：https://github.com/X-Palm/X-Palm-2026

英文摘要

Palmprint modality offers a privacy-preserving biometric solution, yet its deployment is hindered by the domain gap between controlled enrollment and unconstrained authentication. Existing datasets are largely restricted to controlled setups and fail to capture the compound variability of real-world environments. In this paper, we introduce X-Palm, a cross-domain dataset comprising 6,006 palm images from 103 individuals (206 hands). To the best of our knowledge, X-Palm is the first palmprint dataset providing novel paired-identity acquisition specifically designed to bridge the gap between reliably controlled multispectral enrollment and unconstrained mobile authentication while encompassing a broad spectrum of in-the-wild variability. Unlike existing datasets that focus on single to a few variations, X-Palm addresses the massive modality and environmental shifts encountered in practical deployments by capturing paired data for identities across two distinct domains: (1) a controlled Multispectral Palmprint setting using our custom-developed scanner, and (2) an unconstrained smartphone palmprint setting that is participant-driven, incorporating simultaneous variations in hardware, hand pose, illumination, background, camera-to-hand distance, perspective, and palm surface conditions (e.g., moisture and occlusions). Our extensive benchmarks of 12 SOTA models reveal that while existing methods achieve high performance on controlled data, they experience severe performance collapse on X-Palm. Conversely, models trained on X-Palm demonstrate consistent robustness across domains, positioning X-Palm as a valuable resource for training a model towards real-world, cross-domain generalization. Data access instructions and the related benchmarking codes are publicly available at: https://github.com/X-Palm/X-Palm-2026

URL PDF HTML ☆

赞 0 踩 0

2606.08574 2026-06-09 cs.LG cs.CV 交叉投稿

OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework

OrderDP：一种理论上保证无损的动态数据剪枝框架

Chenhan Jin, Shengze Xu, Qingsong Wang, Fan Jia, Dingshuo Chen, Tieyong Zeng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Beijing Normal-Hong Kong Baptist University（北京师范大学-香港 Baptist大学）； Guangzhou Nanfang College（广州南方学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Xiangtan University（湘潭大学）； University of Utah（犹他大学）

AI总结提出OrderDP框架，通过随机子集选取与top-q样本选择实现无偏梯度估计，提供收敛性和泛化性理论保证，在CIFAR和ImageNet上降低40%训练成本且保持精度。

Comments Published as a conference paper at ICLR 2026

详情

Journal ref: International Conference on Learning Representations (ICLR), 2026

AI中文摘要

数据剪枝（DP）作为一种常被提及的减轻训练负担的策略，根据定义明确的剪枝方法减少训练样本数量，同时力求实现近乎无损的性能。然而，现有方法通常选择信息量大的样本，与全数据集训练相比可能导致有偏的梯度估计。此外，这种偏差及其对最终性能的影响分析仍不明确。为解决这些问题，我们提出OrderDP，一个即插即用的框架，旨在获得稳定、无偏且近乎无损的训练加速，并具有理论保证。具体而言，OrderDP首先随机选择一个子集，然后选择前$q$个样本，其中相对于代理损失建立无偏性。这确保了OrderDP在代理目标方面进行无偏训练。我们进一步建立了收敛性和泛化性分析，阐明了OrderDP如何影响最优性能，并在保证最终性能的同时实现良好控制的加速。实验上，我们在CIFAR-10、CIFAR-100和ImageNet-1K上对OrderDP与全面基线进行了评估，展示了具有竞争力的精度、稳定的收敛和精确的控制——所有这些都通过更简单的设计和更快的运行时间实现，同时将训练成本降低超过40%。我们的方法兼具强性能和计算效率，为数据高效学习提供了一个稳健且易于适应的工具。代码公开于https://github.com/shengze-xu/OrderDP。

英文摘要

Data pruning (DP), as an oft-stated strategy to alleviate heavy training burdens, reduces the volume of training samples according to a well-defined pruning method while striving for near-lossless performance. However, existing approaches, which commonly select highly informative samples, can lead to biased gradient estimation compared to full-dataset training. Furthermore, the analysis of this bias and its impact on final performance remains ambiguous. To address these challenges, we propose OrderDP, a plug-and-play framework that aims to obtain stable, unbiased, and near-lossless training acceleration with theoretical guarantees. Specifically, OrderDP first randomly selects a subset and then chooses the top-$q$ samples, where unbiasedness is established with respect to a surrogate loss. This ensures that OrderDP conducts unbiased training in terms of the surrogate objective. We further establish convergence and generalization analyses, elucidating how OrderDP affects optimal performance and enables well-controlled acceleration while ensuring guaranteed final performance. Empirically, we evaluate OrderDP against comprehensive baselines on CIFAR-10, CIFAR-100, and ImageNet-1K, demonstrating competitive accuracy, stable convergence, and exact control -- all with a simpler design and faster runtime, while reducing training cost by over 40%. Delivering both strong performance and computational efficiency, our method serves as a robust and easily adaptable tool for data-efficient learning. The code is publicly available at https://github.com/shengze-xu/OrderDP.

URL PDF HTML ☆

赞 0 踩 0

2606.09059 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

Stage-1 Controls the Entropy Regime, Not the Outcome

Stage-1 控制熵状态，而非最终结果

Jianxiong Shen

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过小数据实验研究两阶段后训练中Stage-1（SFT或OPD）的作用，发现其主要影响策略熵状态，但对最终性能影响有限。

详情

AI中文摘要

两阶段后训练——Stage-1 热启动（监督微调 SFT 或在线策略蒸馏 OPD）后接 Stage-2 强化学习（RL）——越来越多地用于视觉语言模型（VLM）。我们使用 Qwen2.5-VL-7B 和同模态 72B VLM 教师进行 OPD，在小数据研究中探究 Stage-1 实际控制什么。首先，三种热启动在 Geometry3K 内部验证集上达到狭窄的 53%–54% 区间，与近期专门方法报告的窄范围一致；该设置几乎没有证据表明 Stage-1 改变了域内终点。其次，匹配配方、早停的 SFT 在域外 MathVista 上提升了 +2.1 点，逆转了过训练变体的 -9.5 点下降。最明显的区别是熵状态：OPD 进入 RL 时的策略熵显著高于任一 SFT 初始化，且这种分离在可用轨迹中持续可见。在域内初始化时，OPD 还具有更高的答案多样性和 pass@16（比 SFT 高 +2.0 到 +5.2 点），尽管问题级自举区间显示较小的对比具有不确定性。RL 后优势消失（终点 pass@16 值在 1.1 点以内），在 MathVista 上也是如此（六个模型在 1.2 点以内）。因此，我们的贡献是一个有界的实证刻画：在此设置中，Stage-1 与熵状态强相关，但下游收益小、局部化，且不能证明 OPD 是更好的 RL 热启动。

英文摘要

Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.

URL PDF HTML ☆

赞 0 踩 0

2606.09091 2026-06-09 cs.LG cs.CV 交叉投稿

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

稳定基于策略的蒸馏用于多模态大语言模型推理的全局归一化

Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu

发表机构 * OPPO AI Center（OPPO AI中心）

AI总结针对策略蒸馏中异常状态导致梯度不稳定的问题，提出全局归一化蒸馏策略优化（GNDPO），通过将KL分数转化为批次级相对优势来稳定优化，提升多模态推理任务的训练鲁棒性和性能。

详情

AI中文摘要

基于策略的蒸馏（OPD）最近成为一种重要的后训练范式。通过使用更强的教师模型为采样轨迹提供密集、细粒度的监督，OPD相比依赖稀疏二元或基于结果的环境反馈的可验证奖励强化学习（RLVR）具有明显优势。然而，朴素的token级蒸馏可能因异常状态中的幅度不匹配而遭受梯度不稳定性。为了解决这个问题，我们提出了全局归一化蒸馏策略优化（GNDPO），这是一种实用方法，通过将原始KL分数转化为批次级相对优势来稳定优化。这种归一化有效缓解了梯度爆炸，同时保留了token级指导的优势。实验结果表明，GNDPO在多模态推理任务中显著提高了训练鲁棒性和下游性能。代码已发布在 https://github.com/OPPO-Mente-Lab/GNDPO。

英文摘要

On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.

URL PDF HTML ☆

赞 0 踩 0

2606.09169 2026-06-09 cs.AI cs.CV cs.MM 交叉投稿

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

IMUG-Bench：交错理解与生成的统一多模态模型基准

Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai

发表机构 * Zhejiang University（浙江大学）； The University of Hong Kong（香港大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Huawei（华为）

AI总结提出IMUG-Bench基准，用于评估统一多模态模型在多轮交错图文对话中的理解与生成能力，包含3113样本和12034交互轮次，揭示了生成侧暴露偏差，并探索了测试时扩展策略。

详情

AI中文摘要

近年来，统一多模态模型（UMMs）出现，支持在单一框架内同时进行理解和生成。掌握动态、多轮交错图文对话是UMMs在实际应用中的关键任务。然而，现有基准未能评估这一重要任务，因为它们通常局限于单轮或静态设置，并且通常忽略多轮交互中的暴露偏差。为弥补这一差距，我们提出IMUG-Bench，一个用于UMMs多轮交错图文对话的综合基准，联合评估其理解和生成能力。我们的IMUG-Bench包含三类：静态空间、时间因果和混合，涵盖3113个样本和12034个交互轮次。它还包括动态理解问题，从而支持更能反映真实多轮交互场景的评估。在IMUG-Bench上进行的大规模实验系统评估了主流开源和闭源UMMs，揭示了它们的能力边界和失败模式，并发现了多轮交互中生成侧的显著暴露偏差。我们进一步探索了几种测试时扩展策略，包括思维链、自我验证和最佳N采样，这些策略有效提高了生成准确性并减轻了生成任务中的暴露偏差。这些发现为增强未来UMMs的鲁棒性和多轮交互能力提供了见解。

英文摘要

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.

URL PDF HTML ☆

赞 0 踩 0

2606.09644 2026-06-09 cs.CL cs.CV 交叉投稿

Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

答案从何而来？面向自动驾驶的多视角MLLMs中视角级视觉证据识别基准

Yimu Wang, Yee Man Choi, Barry Zhang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

发表机构 * University of Waterloo（滑铁卢大学）

AI总结针对多视角自动驾驶场景，提出一个基准测试，评估多模态大模型在视觉问答中识别支持性相机视角的能力，包含122个冲突中心问题对，并区分视角选择与答案正确性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉推理基准测试中取得了强劲结果，但仅凭答案准确性并不能表明模型是否依赖了正确的视觉证据。这一差距在用于自动驾驶的多视角驾驶场景中尤为重要，因为模型可能产生看似合理的答案，却将其归因于错误的相机视角。我们引入了一个多视角视觉问答基准，用于评估证据来源识别：给定六个同步的NuScenes视角和一个问题，模型必须识别支持性的相机视角并回答问题。该基准包含来自73个场景的122个冲突中心问答对，涵盖因果关系、反事实推理和意图预测。视角标签由自动冲突挖掘流程提出，并由标注者手动验证。我们评估了三种设置：相机视角选择、给定黄金视角的Oracle问答，以及模型在一次前向中同时选择视角并回答的联合预测。答案以多项选择和自由形式两种格式进行评估，使用精确匹配处理结构化预测，并使用LLM评判器处理自由形式回答。通过明确分离视觉来源识别与答案正确性，该基准揭示了仅凭答案评估无法发现的接地失败案例。

英文摘要

Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.

URL PDF HTML ☆

赞 0 踩 0

2606.09718 2026-06-09 cs.LG cs.CV 交叉投稿

Evaluating the Representation Space of Diffusion Models via Self-Supervised Principles

通过自监督原则评估扩散模型的表示空间

Xiao Li, Yixuan Jia, Zekai Zhang, Xiang Li, Lianghe Shi, Jinxin Zhou, Zhihui Zhu, Liyue Shen, Qing Qu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结受自监督学习启发，提出基于Fisher信息的度量ICR，分解特征为不变和残差成分，用于联合评估扩散模型的表示与生成能力，发现中间噪声水平下不变性最强且分类性能最佳，ICR可敏感检测训练中的记忆化。

Comments First two authors contributed equally. Accepted at ICML 2026

详情

AI中文摘要

扩散模型已展现出卓越的生成能力，并成为强大的自监督表示学习器，但这两种能力之间的联系仍较少被探索。受自监督学习（SSL）启发，我们引入了一个框架，用于联合评估扩散模型的表示和生成能力。具体地，我们将特征分解为不变成分和残差成分，并推导出不变污染比（ICR），这是一种基于Fisher的度量，用于量化残差变化在特征空间中对不变信号的污染程度。我们利用该框架分析扩散模型的判别和生成行为。在表示方面，我们发现不变性在中间噪声水平达到峰值，同时该水平也产生最佳的下游分类性能。在生成方面，我们研究了在数据有限情况下训练如何从真正的泛化过渡到记忆化，并表明ICR可作为早期学习的敏感训练时指标：沿Fisher方向增加的残差能量标志着记忆化的开始，该指标仅从训练特征即可检测，无需外部评估器或保留测试集。总体而言，我们的结果表明，扩散模型可以通过其学习表示的几何结构从自监督视角进行监控。

英文摘要

Diffusion models have demonstrated remarkable generative capabilities and have also emerged as powerful self-supervised representation learners, yet the connection between these two abilities remains less explored. Drawing inspiration from self-supervised learning (SSL), we introduce a framework for jointly evaluating the representation and generation capabilities of diffusion models. Specifically, we decompose features into invariant and residual components and derive the Invariant Contamination Ratio (ICR), a Fisher-based metric that quantifies how residual variation contaminates invariant signal in feature space. We use this framework to analyze both discriminative and generative behavior of diffusion models. On the representation side, we find that invariance peaks at intermediate noise levels, which also yield the best downstream classification performance. On the generative side, we study how training transitions from genuine generalization to memorization in data-limited regimes, and show that ICR serves as a sensitive training-time indicator of early learning: increasing residual energy along Fisher directions marks the onset of memorization, detectable from training features alone without external evaluators or held-out test sets. Overall, our results show that diffusion models can be monitored from a self-supervised perspective through the geometry of their learned representations.

URL PDF HTML ☆

赞 0 踩 0

2410.00713 2026-06-09 cs.CV 版本更新

基于多任务潜在空间目标的自监督学习

Pierre-François De Plaen, Abhishek Jha, Luc Van Gool, Tinne Tuytelaars, Marc Proesmans

发表机构 * ESAT-PSI, KU Leuven, Belgium（KU莱顿大学ESAT-PSI实验室）； VIB.AI, KU Leuven, Belgium（KU莱顿大学VIB.AI实验室）； CVL, ETH Zürich, Switzerland（苏黎世联邦理工学院CVL实验室）； INSAIT, Sofia University, Bulgaria（保加利亚索菲亚大学INSAIT研究所）； TRACE vzw（TRACE非营利组织）

AI总结提出自预测孪生SSL的多任务公式，通过为每种空间变换分配专用预测器解决多裁剪训练失败问题，提升线性评估3.8-4%，并引入非对称裁剪视图实现语义修复预训练。

详情

AI中文摘要

我们提出了自预测孪生SSL的多任务公式，其中每个空间变换定义了一个不同的潜在空间对齐任务，由共享编码器上的专用预测器解决。这一视角直接解释了BYOL、SimSiam和MoCo v3等自预测方法中多裁剪训练长期存在的失败原因：共享预测器被迫同时解决异构对齐任务，导致优化不稳定。为每种视图类型分配一个预测器解决了这种干扰，跨框架实现了3.8-4%的线性评估提升。该视角还提出了一种通过引入额外空间变换作为互补任务来丰富预训练的原则性方法。我们通过引入非对称裁剪视图来证明这一点，其中掩码在线视图与完整目标对齐，形成语义修复目标。所得框架稳定、与骨干网络无关，并持续提升ResNet和ViT模型在ImageNet和COCO上的性能。

英文摘要

We propose a multi-task formulation of self-predictive Siamese SSL in which each spatial transformation defines a distinct latent-space alignment task, solved by a dedicated predictor over a shared encoder. This perspective directly explains a long-standing failure of multi-crop training in self-predictive methods such as BYOL, SimSiam, and MoCo v3: a shared predictor is forced to solve heterogeneous alignment tasks simultaneously, leading to unstable optimization. Assigning one predictor per view type resolves this interference, unlocking linear evaluation gains of 3.8-4\% across frameworks. This perspective also suggests a principled way to enrich pre-training by introducing additional spatial transformations as complementary tasks. We demonstrate this by introducing asymmetric cutout views, in which a masked online view is aligned with a complete target, forming a semantic inpainting objective. The resulting framework is stable, backbone-agnostic, and consistently improves the performance of ResNet and ViT models on ImageNet and COCO.

URL PDF HTML ☆

赞 0 踩 0

2602.24181 2026-06-09 cs.CV cs.AI 版本更新

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

混合饮食使DINO成为杂食视觉编码器

Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson, Ye Xia, Skanda Koppula, Andre Araujo, Joao Carreira, Niloy J. Mitra

发表机构 * Google DeepMind（谷歌深Mind）； University College London（伦敦大学学院）

AI总结针对DINOv2等预训练视觉编码器在不同视觉模态间特征对齐差的问题，提出杂食视觉编码器，通过后训练框架学习模态无关特征空间，实现跨模态鲁棒理解。

Comments CVPR 2026 Highlight

详情

AI中文摘要

预训练的视觉编码器（如DINOv2）在单模态任务上表现出色。然而，我们观察到它们的特征在不同视觉模态之间对齐不佳。例如，同一场景的RGB图像及其对应深度图的特征嵌入，其余弦相似度与两个随机不相关图像几乎相同。为了解决这个问题，我们提出了杂食视觉编码器，一种学习模态无关特征空间的后训练框架。我们通过双重目标微调编码器：首先，最大化同一场景不同模态之间的特征对齐；其次，一个蒸馏目标，将学习到的表示锚定到完全冻结的教师模型。由此产生的学生编码器通过为给定场景生成更一致的嵌入（无论输入模态是RGB、深度、分割等）而变得“杂食”。这种方法在保留原始基础模型的判别语义的同时，实现了鲁棒的跨模态理解。杂食模型权重可在以下网址获取：此 https URL。

英文摘要

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their features are poorly aligned across different visual modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a post-training framework that learns a modality-agnostic feature space. We fine-tune the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to a fully frozen teacher. The resulting student encoder becomes "omnivorous" by producing more consistent embeddings for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model. Omnivorous model weights are available at https://github.com/google-deepmind/representations4d.

URL PDF HTML ☆

赞 0 踩 0

2603.14342 2026-06-09 cs.CV cs.AI 版本更新

AgroOmni: A Large-Scale Multi-view Agricultural Dataset for Cross-Scale Multimodal Reasoning

AgroOmni：一个大规模多视角农业数据集用于跨尺度多模态推理

Jiarui Zhang, Junqi Hu, Zurong Mai, Yang Liu, Yuhang Chen, Shuohong Lou, Henglian Huang, Hong Cheng, Lingyuan Zhao, Jianxi Huang, Yutong Lu, Haohuan Fu, Juepeng Zheng

发表机构 * Sun Yat-sen University（中山大学）； Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）； HuanTian Wisdom Technology Co., Ltd.（慧天智慧科技有限公司）； China Agricultural University（中国农业大学）； Southwest Jiaotong University（西南交通大学）； National Supercomputing Center in Shenzhen（深圳国家超算中心）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出AgroOmni数据集，通过288K视觉问答对覆盖56个专业任务类别，解决多视角跨尺度农业多模态推理中的尺度偏差问题，提出AgroNVILA模型在AgroMind基准上达到62.32%的SOTA成绩。

详情

AI中文摘要

现代农业数据来源于多样化的平台，涵盖多个空间尺度，从地面级近距离摄影到无人机（UAV）航空观测和卫星遥感图像。因此，农业多模态推理需要强大的跨尺度空间理解。然而，由于缺乏多视角农业基准数据集，现有多模态大语言模型（MLLMs）表现出严重的地面级偏差，导致农业感知任务中出现尺度混淆和语义崩溃，例如将农田图像误认为墙壁或地板。为此，我们引入AgroOmni，一个大规模多视角训练语料库，包含288K个视觉问答对，覆盖56个专业任务类别，跨14种任务类型，旨在捕捉现代农业精准农业中的多样化尺度。基于此数据集，我们提出AgroNVILA，其在AgroMind基准上达到62.32%的最新SOTA成绩（比GPT-5.2高15.03%），有效缓解了多视角跨尺度差距，实现了整体农业理解。对AgMMU的诊断评估进一步揭示了宏观先验与微观诊断之间的固有异质性，通过受约束的零样本性能。同时，即使最小的微调也使AgroNVILA在AgMMU上实现了显著的性能提升，强有力地证明了其由AgroOmni赋能的泛化能力。完整的训练脚本已公开在https://anonymous.4open.science/r/AgroOmni-6510。

英文摘要

Modern agricultural data is sourced from diverse platforms and spans multiple spatial scales, ranging from ground-level close-up photography to Unmanned Aerial Vehicle (UAV) aerial observation and satellite remote sensing imagery. Accordingly, agricultural multimodal reasoning demands robust cross-scale spatial understanding. However, due to the lack of multi-view agricultural benchmark datasets, existing multimodal large language models (MLLMs) exhibit severe ground-level bias, which leads to scale confusion then semantic collapse in agricultural perception tasks, such as misinterpreting farmland imagery as walls or floors. To address this, we introduce AgroOmni, a large-scale multi-view training corpus with 288K Visual Question Answering pairs covering 56 specialized task categories across 14 task types, designed to capture diverse scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, which achieves a new state-of-the-art of 62.32% on the AgroMind benchmark (+15.03% over GPT-5.2), effectively mitigating the multi-view cross-scale gap for holistic agricultural understanding. Diagnostic evaluations on AgMMU further reveal an inherent heterogeneity between macro-priors and micro-diagnostics through constrained zero-shot performance. Meanwhile, even minimal fine-tuning leads to a dramatic performance gain of AgroNVILA on AgMMU, strongly demonstrating its generalization capability empowered by AgroOmni. Full training scripts are publicly available at https://anonymous.4open.science/r/AgroOmni-6510.

URL PDF HTML ☆

赞 0 踩 0

2603.25726 2026-06-09 cs.CV 版本更新

AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

AnyHand：一个大规模合成数据集用于RGB（-D）手姿态估计

Chen Si, Yulin Liu, Bo Ai, Jianwen Xie, Rolandos Alexandros Potamias, Chuanxia Zheng, Hao Su

发表机构 * University of California, San Diego（加州大学圣迭戈分校）； Lambda, Inc（Lambda公司）； Imperial College London（伦敦帝国理工学院）； Nanyang Technological University（南洋理工大学）

AI总结 AnyHand通过提供大规模RGB-D图像和丰富的几何标注，提升了3D手姿态估计的性能，证明了数据多样性和质量对模型效果的重要性。

详情

AI中文摘要

我们介绍了AnyHand，一个大规模合成数据集，旨在推动3D手姿态估计的前沿。尽管近期基于基础方法的工作表明，扩大训练数据显著提高了手姿态估计，但现有现实数据集在覆盖范围上有限，且先前合成数据集很少能同时提供遮挡、手臂细节和对齐的深度信息。为解决这一瓶颈，我们提出的AnyHand包含250万张单手和4100万张手-物体交互的RGB-D图像，具有丰富的几何标注。我们展示了将现有RGB基线的原始训练数据配方扩展为AnyHand可显著提升多个基准（FreiHAND和HO-3D）的性能，即使在保持架构和训练方案不变的情况下。结合对训练数据规模和组成设置的广泛消融分析，这些结果表明，训练数据的多样性和质量与规模一样关键，对于推动手姿态估计的发展至关重要。我们进一步在附录中检验了AnyHand对齐深度图的实用性，显示使用AnyHand扩展RGB-D监督可使现有RGB基线的轻量深度融合变体超越先前的RGB-D方法。

英文摘要

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation. While recent works with foundation approaches have shown that scaling training data markedly improves hand pose estimation, existing real-world datasets are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our proposed AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. We show that extending the original training data recipes of existing RGB baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architectures and training schemes fixed. Together with extensive ablations on the scale and composition of the training data setups, these results suggest that training data diversity and quality are as critical as scale for advancing hand pose estimation. We further examine the utility of AnyHand's aligned depth maps in the appendix, showing that scaling RGB-D supervision with AnyHand allows a lightweight depth-fusion variant of existing RGB baselines to outperform prior RGB-D methods.

URL PDF HTML ☆

赞 0 踩 0

2603.26763 2026-06-09 cs.CV cs.MM eess.IV 版本更新

A Camera-Native Talking-Head Video Dataset for Various Computer Vision Tasks

面向各种计算机视觉任务的相机原生谈话头视频数据集

Babak Naderi, Ross Cutler, Nabakumar Singh Khongbantabam

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出一个包含847个谈话头视频的数据集，用于评估视频压缩、超分辨率和质量评估等任务，展示了其在实时通信中的应用价值。

详情

AI中文摘要

谈话头视频是实时通信中的主要内容类型，但该领域视频处理研究的公开数据集仍然稀缺且信号保真度有限。本文开源了一个包含847个谈话头视频的数据集（约212分钟），每个视频持续15秒，通过446个不同的消费级摄像头设备在自然环境中录制。所有视频均使用FFV1无损编码器存储，保留相机原生信号——未压缩（24.4%）或MJPEG编码（75.6%）——而不进行额外的有损处理。每个视频都标注了平均意见分数（MOS）和十个感知质量标记，共同解释了64.4%的MOS方差。从该数据集中，我们挑选出120个视频片段，分为三种内容条件：原始、背景虚化和背景替换。在四个数据集和四个编码器（H.264、H.265、H.266和AV1）上的编码效率评估显示，H.266相对于H.264的VMAF BD-rate节省高达-71.3%，编码器×数据集（η_p² = 0.112）和编码器×内容条件（η_p² = 0.149）的交互显著，表明内容类型和背景处理会影响压缩效率。初步的超分辨率评估显示，该数据集显著影响绝对性能，但保持模型排名，证明其在编码器基准测试之外的应用价值。该数据集的规模是现有最大谈话头摄像头数据集的5倍（847 vs. 160个视频），具有无损信号保真度，为视频压缩、超分辨率、质量评估和增强模型的实时通信基准测试提供了资源。

英文摘要

Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a camera-native dataset of 847 talking-head recordings (approximately 212 minutes), each 15s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4%) or MJPEG-encoded (75.6%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to $-71.3\%$ (H.266) relative to H.264, with significant encoder$\times$dataset ($η_p^2 = .112$) and encoder$\times$content condition ($η_p^2 = .149$) interactions, demonstrating that both content type and background processing affect compression efficiency. A preliminary super-resolution evaluation with four SR models confirms that the dataset significantly affects absolute performance while preserving model rankings, demonstrating applicability beyond codec benchmarking. The dataset offers 5$\times$ the scale of the largest prior talking-head webcam dataset (847 vs. 160 clips) with lossless signal fidelity, establishing a resource for benchmarking video compression, super-resolution, quality assessment, and enhancement models in real-time communication.

URL PDF HTML ☆

赞 0 踩 0

2604.10999 2026-06-09 cs.CV 版本更新

SleepWalk：一种三层压力测试基准，用于指导的视觉-语言导航

Niyati Rawal, Sushant Ravva, Shah Alam Abir, Saksham Jain, Aman Chadha, Vinija Jain, Suranjana Trivedy, Amitava Das

发表机构 * Indian AI Research Organization (IAIRO)（印度人工智能研究组织）； ĸragya Lab, BITS Pilani Goa（BITS Pilani Goa 的 ĸragya 实验室）； University of Dhaka（达卡大学）； Delhi Technological University（德里技术大学）； Apple（苹果公司）； Meta

AI总结本文提出SleepWalk基准，用于评估基于指令的轨迹预测，针对局部化、交互导向的具身推理，揭示当前VLM在空间推理中的系统性失败。

详情

AI中文摘要

视觉-语言模型（VLMs）在多模态感知和语言理解方面迅速发展，但尚不清楚它们是否能可靠地将语言接地为在3D数字环境中空间一致且可能执行的动作。我们引入SleepWalk，一种评估基于指令的轨迹预测的基准，该基准生成自文本场景描述并过滤以确保可导航性。与以往以长距离探索房间为中心的导航基准不同，SleepWalk针对局部化、以交互为中心的具身推理：给定渲染的视觉观察和自然语言指令，模型必须预测一个尊重场景几何、避免碰撞并终止在动作兼容位置的轨迹。该基准涵盖多样化的室内和室外环境，并将任务分为三层空间和时间难度，使在增加的组合复杂性下对接地进行细粒度分析成为可能。使用标准化的点评估评估协议，我们评估了三种前沿VLMs在2,472个经过精心挑选的3D环境中，每个场景有九条指令。结果揭示了在遮挡、交互约束和多步指令下的系统性失败：随着任务难度等级的增加，性能下降。总体而言，当前VLMs可以生成在空间上一致、可能执行且与预期动作一致的轨迹。通过在受控且可扩展的设置中暴露失败，SleepWalk为推进基于接地的多模态推理、具身规划、视觉-语言导航和3D环境中的动作能力代理提供了关键基准。

英文摘要

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

URL PDF HTML ☆

赞 0 踩 0

2606.00793 2026-06-09 cs.CV 版本更新

MBench: A Comprehensive Benchmark on Memory Capability for Video World Models

MBench: 视频世界模型记忆能力的综合基准

Shengjun Zhang, Zhang Zhang, Simin Huang, Zhenyu Tang, Hanyang Wang, Chensheng Dai, Min Chen, Yifan Li, Yuxin Li, Yingjie Chen, Hao Liu, Chen Li, Jing Lyu, Yueqi Duan

发表机构 * Tsinghua University（清华大学）； WeChat Vision, Tecent Inc.（微信视觉，腾讯公司）； Peking University（北京大学）

AI总结提出MBench基准，通过实体一致性、环境一致性和因果一致性三个核心维度及其12个子维度，系统评估视频世界模型的长期记忆能力，并揭示现有方法在长期状态保持上的关键局限。

Comments Project Page: https://peanutup.github.io/MBench-project/

详情

AI中文摘要

近期基于视频的世界模型在合成高保真视觉序列方面展现了前所未有的能力。然而，在视觉上合理的视频生成与世界模型的功能要求之间仍存在根本差距，特别是在长时间跨度内维持稳定且合理的内部状态方面。现有基准主要强调视觉质量、运动一致性和文本-视频对齐，但很大程度上忽略了记忆——世界模型在长期跨度和复杂交互中保持一致性的核心能力。为解决这一差距，我们提出了 extbf{MBench}，一个专门用于量化和评估视频世界模型记忆能力的综合基准。我们系统地将视频世界模型的记忆能力分解为三个层次化且互补的核心维度：实体一致性、环境一致性和因果一致性，这些维度进一步细化为12个可量化的子维度，以全面表征长期记忆。我们的基准基于严格策划的真实拍摄长视频，并通过基于规则的量化矩阵和VLM进行评估，以实现客观且全面的一致性评估。对主流最先进视频世界模型的广泛评估揭示了现有方法在长期状态保持方面的关键系统性局限，为推进该领域提供了标准化基准和明确的研究方向。

英文摘要

Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present \textbf{MBench}, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.

URL PDF HTML ☆

赞 0 踩 0

2606.04409 2026-06-09 cs.CV cs.AI cs.LG 版本更新

An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

数据规模、模型复杂度和输入模态对视觉泛化影响的实证研究

Yidi Zhouluo

发表机构 * School of Medical Information and Artificial Intelligence, Shandong First Medical University（医学信息与人工智能学院，山东第一医科大学）

AI总结通过一维非线性函数和CIFAR数据集实验，实证分析数据规模、模型复杂度和输入模态对视觉泛化性能的影响。

Comments 12 pages, 9 figures, 4 tables

详情

AI中文摘要

现代深度神经网络通常具有较大的参数规模和非线性层次结构，在计算机视觉中取得了强劲性能。然而，其泛化性能的来源仍然难以用传统统计学习理论解释。在可能影响视觉泛化的因素中，数据规模、模型复杂度和输入模态是基础且可控的变量。本研究实证分析了这三个因素如何影响模型泛化性能。具体而言，在初步实验中，我们构建了一维非线性函数，并改变训练样本数量和多项式次数，以观察数据规模和模型复杂度对模型性能的影响。在主要实验中，我们比较了CIFAR-10和CIFAR-100上不同训练数据规模、模型架构和输入模态下的模型性能。实验结果表明，增加训练数据规模持续改善泛化性能，而模型复杂度的变化并未带来稳定提升。此外，去除颜色信息会降低模型性能，而梯度、边缘和小波等显式先验特征在不同模型架构上的效果不一致。总体而言，本研究提供了数据规模、模型复杂度、输入模态与视觉泛化性能之间关系的实证分析。代码和实验日志见：https://github.com/zlyd-CV/DeepLearning-Empirical-Studies。

英文摘要

Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/YidiZhouluo/DeepLearning-Empirical-Studies/tree/main/Exp_01.

URL PDF HTML ☆

赞 0 踩 0

2505.19662 2026-06-09 cs.AI cs.CV 版本更新

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

FieldWorkArena：面向真实作业任务的代理AI基准测试

Jun Takahashi, Atsunori Moteki, Akiyoshi Uchida, Shoichi Masui, Fan Yang, Kanji Uchino, Yueqi Song, Yonatan Bisk, Graham Neubig, Ikuo Kusajima, Yasuto Watanabe, Hiroyuki Ishida, Koki Nakagawa, Shan Jiang

发表机构 * Fujitsu Limited（富士通株式会社）； Fujitsu Research of America（富士通美国研究部）； Carnegie Mellon University（卡内基梅隆大学）； Master’s Student, The University of Tokyo（东京大学硕士研究生）； Agent Research Collective（代理研究集体）

AI总结本文提出FieldWorkArena，用于评估代理AI在真实制造业和零售环境中的性能，通过现场采集的数据和实地访谈设计任务，验证多模态大语言模型的评估可行性。

Comments 27 pages, 10 figures, 7 tables [ICPR 2026 Accepted] Changes from previous version: added supplemental material

详情

AI中文摘要

本文介绍FieldWorkArena，一个针对真实世界作业任务的代理AI基准测试平台。随着对代理AI的需求增加，此类系统旨在检测和记录安全隐患、程序违规等关键事件。与大多数专注于模拟或数字环境的基准测试不同，我们的工作解决了在真实世界中评估代理的挑战。本文改进了之前的评估函数，以评估代理AI在多样化真实任务中的性能。数据集包含工厂、仓库和零售现场采集的图像和视频。任务通过与现场工人和管理人员的访谈精心设计。评估结果证实，考虑多模态大语言模型（如GPT-4o）特性进行性能评估是可行的。此外，本研究确定了所提新评估方法的有效性和局限性。完整数据集和评估程序可在网站（https://en-documents.research.global.fujitsu.com/fieldworkarena/）上公开获取。

英文摘要

This paper introduces FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for agentic AI, they are built to detect and document safety hazards, procedural violations, and other critical incidents across real-world manufacturing and retail environments. Whereas most agentic AI benchmarks focus on performance in simulated or digital environments, our work addresses the fundamental challenge of evaluating agents in the real-world. In this paper, we improve the evaluation function from previous methods to assess the performance of agentic AI in diverse real-world tasks. Our dataset comprises on-site captured images/videos in factories, warehouses and retails. Tasks were meticulously developed through interviews with site workers and managers. Evaluation results confirmed that performance evaluation considering the characteristics of Multimodal LLM (MLLM) such as GPT-4o is feasible. Furthermore, this study identifies both the effectiveness and limitations of the proposed new evaluation methodology. The complete dataset and evaluation program are publicly accessible on the website (https://en-documents.research.global.fujitsu.com/fieldworkarena/)

URL PDF HTML ☆

赞 0 踩 0

2601.04498 2026-06-09 cs.LG cs.CV 版本更新

IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation

IGenBench：文本到信息图生成可靠性基准测试

Yinghao Tang, Xueding Liu, Boyuan Zhang, Tingfeng Lan, Yupeng Xie, Jiale Lao, Yiyao Wang, Haoxuan Li, Tingting Gao, Bo Pan, Luoxuan Weng, Xiuqi Huang, Minfeng Zhu, Yingchaojie Feng, Yuyu Luo, Wei Chen

发表机构 * State Key Lab of CAD&CG, Zhejiang University（浙江大学CAD与CG国家重点实验室）； UESTC ； University of Virginia（弗吉尼亚大学）； HKUST(GZ)（香港科技大学（广州））； Cornell University（康奈尔大学）； Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）

AI总结提出IGENBENCH基准，包含30种信息图类型和600个测试用例，通过多模态大语言模型分解为10类原子问题评估10种T2I模型，发现数据完整性等维度是普遍瓶颈。

详情

AI中文摘要

信息图是结合数据可视化与文本和插图元素的复合视觉制品，用于传达信息。虽然最近的文本到图像（T2I）模型可以生成美观的图像，但它们在生成信息图方面的可靠性仍不清楚。生成的信息图可能乍看正确，但包含容易被忽视的问题，例如扭曲的数据编码或错误的文本内容。我们提出了IGENBENCH，这是第一个评估文本到信息图生成可靠性的基准，包含跨越30种信息图类型的600个精心设计的测试用例。我们设计了一个自动评估框架，将可靠性验证分解为基于10种问题类型的原子是否问题。我们使用多模态大语言模型（MLLM）验证每个问题，得到问题级准确率（Q-ACC）和信息图级准确率（I-ACC）。我们在IGENBENCH上全面评估了10个最先进的T2I模型。我们的系统分析揭示了未来模型开发的关键见解：（i）三级性能层次，顶级模型的Q-ACC为0.90，但I-ACC仅为0.49；（ii）数据相关维度成为普遍瓶颈（例如，数据完整性：0.21）；（iii）所有模型实现端到端正确性的挑战。我们在https://this URL发布IGENBENCH。

英文摘要

Infographics are composite visual artifacts that combine data visualizations with textual and illustrative elements to communicate information. While recent text-to-image (T2I) models can generate aesthetically appealing images, their reliability in generating infographics remains unclear. Generated infographics may appear correct at first glance but contain easily overlooked issues, such as distorted data encoding or incorrect textual content. We present IGENBENCH, the first benchmark for evaluating the reliability of text-to-infographic generation, comprising 600 curated test cases spanning 30 infographic types. We design an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on a taxonomy of 10 question types. We employ multimodal large language models (MLLMs) to verify each question, yielding question-level accuracy (Q-ACC) and infographic-level accuracy (I-ACC). We comprehensively evaluate 10 state-of-the-art T2I models on IGENBENCH. Our systematic analysis reveals key insights for future model development: (i) a three-tier performance hierarchy with the top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions emerging as universal bottlenecks (e.g., Data Completeness: 0.21); and (iii) the challenge of achieving end-to-end correctness across all models. We release IGENBENCH at https://igen-bench.vercel.app/.

URL PDF HTML ☆

赞 0 踩 0

2605.16223 2026-06-09 cs.GR cs.AI cs.CV 版本更新

Evaluating Design Video Generation: Metrics for Compositional Fidelity

评估设计视频生成：构成保真度的度量标准

Adrienne Deganutti, Dingning Cao, Jaejung Seol, Elad Hirsch, Purvanshi Mehta

发表机构 * Lica World（Lica世界）； San Francisco, United States of America（美国旧金山）； ICML’26 Workshop on Human-AI Co-Creativity, Seoul, South Korea（ICML’26 人类-人工智能协同创作研讨会，韩国首尔）

AI总结本文提出一个自动化评估框架，用于评估设计动画中布局、动作正确性、时间质量和内容保真度，以替代主观人类评估，为该领域提供统一基准。

Comments ICML 2026 Workshop on Human-AI Co-Creativity

2605.23595 2026-06-09 cs.LG cs.AI cs.CV cs.ET cs.PF 版本更新

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

基于元学习的成本效益模型评估

Trinh Pham, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen

发表机构 * Griffith University（格里菲斯大学）； Edith Cowan University（埃迪斯科文大学）； The University of Queensland（昆士兰大学）

AI总结提出MetaEvaluator，一种基于元学习的模型无关框架，通过参考模型池实现无标签数据上的快速、准确且成本效益高的新模型评估。

Comments Accepted by KDD 2026

详情

AI中文摘要

机器学习的快速发展产生了不断扩展的模型生态系统，使得在未见过的未标记数据上验证新发布模型的可靠性变得越来越具有挑战性。传统的评估流程依赖于昂贵的标注、重复的微调或无法跨模型家族迁移的狭窄假设。我们提出了MetaEvaluator，一个成本效益高、模型无关的框架，用于快速、无标签地评估跨不同架构和模态的未见模型。MetaEvaluator利用参考模型池上的元学习来获得可迁移的初始化，从而能够准确评估新模型，同时将成本分摊到整个池中，并消除了每个模型重新训练的需要。据我们所知，这是第一个能够在完全未标记数据集上评估新模型的模型无关框架。大量实验表明，与传统方法相比，MetaEvaluator以显著降低的成本产生稳定且准确的性能估计，使得在未标记数据上对新出现的模型进行可扩展的基准测试变得实用。

英文摘要

The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free evaluation of unseen models across diverse architectures and modalities. MetaEvaluator meta-learns over a pool of reference models to acquire an effective initialization for accurate assessment of unseen models, thereby amortizing evaluation cost and eliminating the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework that evaluates new models on unlabeled datasets. Extensive experiments demonstrate that MetaEvaluator delivers stable and accurate performance estimates at substantially lower cost than conventional approaches, enabling scalable benchmarking on unlabeled datasets for emerging models. The code is available at: https://github.com/phkhanhtrinh23/MetaEvaluator.

URL PDF HTML ☆

赞 0 踩 0

2606.04920 2026-06-09 cs.LG cs.CV 版本更新

DiffoR：一种统一的连续生成框架用于通用序数回归

Hongxu Ma, Lin Wang, Chenghou Jin, Han Zhou, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

发表机构 * Fudan University（复旦大学）； Kuaishou Technology（快手科技）； Shanghai University of Finance and Economics（上海财经大学）； Tongji University（同济大学）

AI总结提出DiffOR框架，将序数回归建模为连续生成任务，利用扩散模型通过迭代去噪恢复连续序数值，并设计双解耦策略（多尺度增量聚合与动态去噪感知）保留序数拓扑，在12个基准上超越现有方法。

Comments Accepted at KDD 2026

详情

DOI: 10.1145/3770855.3818149

AI中文摘要

序数回归（OR）旨在预测具有内在顺序的目标值，支撑着从推荐系统到计算机视觉等多个领域的关键应用。尽管从朴素回归发展到基于离散化的分类和生成，现有范式仍然受到量化伪影和缺乏全局序数拓扑感知的根本限制。这些方法通常强制执行刚性边界划分，无法捕捉序数数据固有的非平稳语义转换。在本文中，我们提出了一种新范式，将OR形式化为连续生成序数回归任务。在该新范式下，我们引入了DiffOR，一个统一的框架，利用扩散模型通过迭代去噪恢复连续序数值，从而能够动态学习软语义转换。为了显式保留序数拓扑，我们设计了一种双解耦策略：在空间上，多尺度增量聚合将目标分解为层次化的连续增量；在时间上，动态去噪感知将去噪步骤与特征频率同步，确保稳健的从粗到细的细化。理论上，我们证明了所提方法可以显著增强表示能力和机制可解释性。在四个领域的12个基准上的大量实验验证了DiffOR相对于最先进方法的一致优越性，建立了一个新标准，展示了作为通用序数回归通用解决方案的强大潜力。

英文摘要

Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to discretization-based classification and generation, existing paradigms remain fundamentally constrained by quantization artifacts and the lack of global ordinal topological perception. These methods typically enforce rigid boundary delineations, failing to capture the non-stationary semantic transitions inherent to ordinal data. In this paper, we propose a novel paradigm where OR is formulated as a Continuous Generative Ordinal Regression task. Under the novel paradigm, we introduce DiffOR, a unified framework that leverages diffusion models to recover continuous ordinal values via iterative denoising, thereby enabling the dynamic learning of soft semantic transitions. To explicitly preserve ordinal topology, we devise a Dual-Decoupling Strategy: Spatially, Multi-scale Increment Aggregation decomposes targets into hierarchical continuous increments; Temporally, Dynamic Denoising Perception synchronizes denoising steps with feature frequencies, ensuring robust coarse-to-fine refinement. Theoretically, we show that the proposed method can significantly enhance both representation capability and mechanistic interpretability. Extensive experiments on 12 benchmarks across four domains validate DiffOR's consistent superiority over state-of-the-art methods, establishing a new standard that demonstrates strong potential as a general-purpose solution for universal ordinal regression.

URL PDF HTML ☆

赞 0 踩 0

2606.07655 2026-06-09 eess.SP cs.CR cs.CV 交叉投稿

FADRW: A Feature-Aware Modulated and Dynamically Reweighted Loss for Few-Shot Linguistic Steganalysis

FADRW：一种面向少样本语言隐写分析的特征感知调制与动态重加权损失

Shuo Liu, Xianghong Lin, Yukun Wei, Zhongliang Yang

发表机构 * International School, Beijing University of Posts and Telecommunications（北京邮电大学国际学院）； School of Cyberspace Security, Beijing University of Posts and Telecommunications（北京邮电大学网络安全学院）

AI总结针对语言隐写检测中类别极度不平衡和特征边缘化问题，提出FADRW损失函数，通过动态重加权和特征感知调制提升少样本隐写分析性能。

Comments Accepted by IEEE Signal Processing Letters

详情

DOI: 10.1109/LSP.2026.3700567

AI中文摘要

社交媒体平台的普及为恶意语言隐写提供了便利，带来了显著的安全风险。然而，模型训练中的两个基本问题严重阻碍了检测。首先，极端类别不平衡（隐写样本不足1%）导致强烈的决策偏差。其次，生成式隐写的不可见性使其特征与正常文本几乎无法区分；这种相似性加上其极端稀有性，导致严重的特征边缘化，微弱的隐写信号被完全淹没。为了直接应对这些优化层面的挑战，我们提出了FADRW（特征感知调制与动态重加权损失），一种专为少样本隐写分析设计的新型损失函数框架。FADRW采用动态重加权逐步抵消决策偏差，并通过特征感知调制模块在结构上重塑特征空间，通过增强这些细微特征的可分离性来防止特征边缘化。在来自三个真实社交平台的数据集上进行的大量实验表明，FADRW显著优于最先进的方法，尤其是在具有挑战性的少样本隐写样本场景中。

英文摘要

The ubiquity of social media platforms facilitates malicious linguistic steganography, posing significant security risks. However, detection is severely hampered by two fundamental issues during model training. Firstly, extreme class imbalance (less than 1% steganographic samples) induces a strong decision bias. Secondly, the invisibility of generative steganography means its features are nearly indistinguishable from benign text; this similarity, compounded by their extreme rarity, leads to severe feature marginalization, where faint steganographic signals are completely overwhelmed. To directly address these optimization-level challenges, we propose FADRW (Feature-Aware Modulated and Dynamically Reweighted Loss), a novel loss function framework engineered for few-shot steganalysis. FADRW employs Dynamic Reweighting to progressively counteract decision bias, and a Feature-Aware Modulation module to structurally reshape the feature space, preventing feature marginalization by enhancing the separability of these subtle features. Extensive experiments on datasets from three real-world social platforms demonstrate that FADRW significantly outperforms state-of-the-art methods, particularly in the challenging few-shot steganographic sample scenario.

URL PDF HTML ☆

赞 0 踩 0

2606.07718 2026-06-09 cs.AI cs.CV cs.LG 交叉投稿

A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline

评估AI代理在神经科学数据到发现流程中的案例研究

Kai A. Horstmann, Ethan Lin, Alice A. Robie, Jennifer J. Sun, Kristin Branson

发表机构 * Cornell University（康奈尔大学）； HHMI Janelia Research Campus（霍华德·休斯医学研究所贾雷尔研究园区）

AI总结本研究评估通用编码代理在果蝇光遗传学数据到发现流程中的表现，发现代理能解决单个阶段任务，但端到端流程仍超出其能力，主要挑战包括缺乏预定义迭代标准和科学判断能力。

详情

AI中文摘要

代理型AI工具为自动化科学研究流程中的软件开发瓶颈提供了有希望的路径，特别是对于那些需要领域专家花费数天到数月构建的阶段，科学家关心的是正确性和鲁棒性，而非实现细节。我们针对果蝇光遗传学数据到发现流程，对通用编码代理进行了实证研究。我们在比现有基准大得多的任务、数量级更大的数据集以及基于领域专家标准的评估标准上评估代理。我们表明，代理可以解决几个单独的流程阶段，这表明阶段级自动化是可行的。通过分析代理的代码迭代，我们发现当没有预定义的标准可供迭代时，它们最困难，此时它们必须利用自己的科学判断来评估当前解决方案，这是一个关键开放挑战。与科学实践相呼应，它们有时尝试对中间输出进行视觉检查以进行自我评估，但大多未能正确解释所见或据此采取行动。正确解决端到端流程需要将所有流程阶段的成功串联起来，这超出了代理当前的能力。我们识别出现有基准中基本缺失的挑战，包括计算资源管理和对大型保留数据集的泛化。最后，我们提炼出构建科学任务和针对开放问题的严格评估标准的原则。

英文摘要

Agentic AI tools offer a promising path to automating software development bottlenecks in scientific research pipelines, particularly for stages that take domain experts days to months to build, where scientists care about correctness and robustness, not implementation details. We present an empirical study of general-purpose coding agents on a fly optogenetics data-to-discovery pipeline. We assess agents on tasks substantially larger than existing benchmarks, datasets orders of magnitude bigger, and evaluation criteria grounded in domain expert standards. We show that agents can solve several individual pipeline stages, suggesting stage-level automation is tractable. By analyzing agents' code iterations, we show that they struggle most when there is not a pre-defined criterion to iterate on, and they must instead use their scientific judgment to assess their current solution, a key open challenge. Mirroring scientific practice, they sometimes attempt visual inspection of intermediate outputs for self-evaluation, but largely fail to interpret what they see or act on it appropriately. Solving the end-to-end pipeline correctly requires stringing together successes across all pipeline stages, and this is beyond agents' current abilities. We identify challenges largely absent from existing benchmarks, including computational resource management and generalization to large held-out data collections. Finally, we distill principles for constructing scientific tasks and rigorous evaluation criteria for open-ended problems.

URL PDF HTML ☆

赞 0 踩 0

2606.07949 2026-06-09 q-bio.PE cs.CV eess.IV 交叉投稿

Feasibility to detect rapid change and disappearance of seagrass: Lessons from nearly 80 years of vegetation change in the Ako, Seto Inland Sea, Japan

检测海草快速变化和消失的可行性：来自日本濑户内海Ako近80年植被变化的教训

Takehisa Yamakita, Yoji Igarashi, Akira Eto, Ken Ishida, Masaaki Iiyama

发表机构 * Japan Agency for Marine-Earth Science and Technology (JAMSTEC)（日本海洋地球科学技术机构）； The University of Tokyo（东京大学）； Tokyo University of Marine Science and Technology（东京海洋大学）； Shiga University（滋贺大学）

AI总结本研究利用近80年的航拍和卫星影像，结合YOLO深度学习分割，分析了日本Ako潮滩海草床的长期动态，发现2025年Zostera marina在一年内几乎完全消失，表明这是一次由夏季水温升高驱动的快速生态系统转变，并提出了改进海草监测指标的建议。

详情

AI中文摘要

本研究分析了日本濑户内海的Ako潮滩，该地的大叶藻（Zostera marina）在2025年一年内几乎全部消失。利用1940年代以来的航拍照片、高分辨率卫星影像、GRUS图像（2.5-5米）以及每月Sentinel-2合成图像（10米），我们重建了约80年的海草分布。基于深度学习的YOLO分割在这些数据集上实现了高精度（总体精度≥0.9）；尽管无法区分物种，但模型捕捉了植被面积的主要时间动态。长期平均海草面积为6.8公顷，但数值波动很大，从1974年的3.5公顷到1989年的41.3公顷，除2025年的0.2公顷外。2019年至2026年的Sentinel-2合成图像显示出明显的季节性，植被在初夏增加，秋季开始减少。然而，2025年夏季后面积急剧下降，并在2025-2026年整个冬季保持异常低值。我们的结果表明，2025年的事件并非正常波动，而是一次快速生态系统转变，涉及优势冠层物种的丧失，最可能的原因是区域夏季水温升高。这些发现对海草基本海洋变量（EOVs）和TNFD对齐的自然相关披露中使用的自然状态（SoN）指标也有影响。与森林不同，海草草甸需要更精细的时间分辨率，因为显著的季节性和突然崩溃都会强烈影响面积指标。因此，除了先前指出的物种级分类精度等问题外，我们建议：（1）基线应在最长的可用记录上定义并进行生态学论证；（2）在年际比较前应用季节性标准化；（3）将面积异常极端的年份标记出来，而非用作参考点。

英文摘要

This study analyses the Ako tidal flat in the Seto Inland Sea, Japan, where nearly all Zostera marina disappeared within a single year in 2025. Using aerial photographs from the 1940s onward, high-resolution satellite imagery, GRUS images (2.5-5 m), and monthly Sentinel-2 composites (10 m), we reconstructed approximately 80 years of seagrass distribution. YOLO-based segmentation using deep learning achieved high accuracy (overall accuracy >= 0.9) across these datasets; although species could not be discriminated, the models captured the major temporal dynamics in vegetation area. The long-term mean seagrass area was 6.8 ha, but values fluctuated widely, from 3.5 ha in 1974 to 41.3 ha in 1989 except 0.2 ha in 2025. Sentinel-2 composites from 2019 to 2026 revealed clear seasonality, with vegetation increasing in early summer and declining from autumn. In 2025, however, the area decreased sharply after summer and remained anomalously low throughout the winter of 2025-2026. Our results, indicating that the 2025 event was not a normal fluctuation but a rapid ecosystem shift involving the loss of the dominant canopy-forming species, most plausibly driven by regionally elevated summer water temperatures. The findings also have implications for seagrass Essential Ocean Variables (EOVs) and the State of Nature (SoN) metrics used in TNFD-aligned nature-related disclosures. Unlike forests, seagrass meadows require finer temporal resolution because both pronounced seasonality and abrupt collapse strongly influence area-based indicators. Therefore, in addition to previously noted issues such as species-level classification accuracy, we recommend that (1) baselines be defined over the longest available record and justified ecologically, (2) seasonal standardization be applied before inter-annual comparisons, and (3) years with extreme area anomalies be flagged rather than used as reference points.

URL PDF HTML ☆

赞 0 踩 0

2606.08258 2026-06-09 cs.GR cs.CV cs.LG 交叉投稿

MS-COOT: Comparing Morse-Smale Complexes with Co-Optimal Transport

MS-COOT: 用共最优传输比较Morse-Smale复形

Guangyu Meng, Mingzhe Li, Erin Wolf Chambers

发表机构 * Department of Computer Science and Engineering, University of Notre Dame（Notre Dame 大学计算机科学与工程系）

AI总结提出MS-COOT距离，将Morse-Smale复形表示为超图，通过共最优传输联合匹配临界点和区域，实现区域级结构比较，在分类等任务中优于图方法。

详情

AI中文摘要

理解和比较标量场中的结构是科学可视化的核心挑战，应用范围从特征分析到时间和结构比较。Morse-Smale (MS) 复形通过将标量场分解为由梯度流诱导的区域提供了自然表示。然而，现有方法通常依赖于基于图的表示，捕获临界点之间的关系而丢弃区域级结构。在这项工作中，我们将MS复形表示为超图，其中临界点构成节点，区域定义超边。我们引入MS-COOT，一种共最优传输距离，联合计算临界点和区域之间的对应关系。这种公式化使得在基于距离的框架内能够进行显式的区域到区域匹配，从而识别诸如分裂和合并等区域级事件。我们使用领域特定组件实例化该框架，包括编码临界点-区域关系的超网络函数、强调拓扑显著特征的基于持久性的概率度量，以及包含临界点属性的样本代价项。我们在涵盖2D模拟、3D曲面网格和体积数据的五个数据集上评估MS-COOT。我们的结果表明，MS-COOT捕获了基于图的距离未反映的区域级结构变化，同时在分类和分辨率判别等下游任务中实现了强性能。

英文摘要

Understanding and comparing structures in scalar fields is a central challenge in scientific visualization, with applications ranging from feature analysis to temporal and structural comparison. The Morse-Smale (MS) complex provides a natural representation by decomposing a scalar field into regions induced by gradient flow. However, existing approaches typically rely on graph-based representations, capturing relationships between critical points while discarding region-level structure. In this work, we represent the MS complex as a hypergraph, where critical points form nodes and regions define hyperedges. We introduce MS-COOT, a co-optimal transport distance that jointly computes correspondences between critical points and regions. This formulation enables explicit region-to-region matching within a distance-based framework, allowing identification of region-level events such as splitting and merging. We instantiate this framework with domain-specific components, including a hypernetwork function encoding critical point-region relationships, persistence-based probability measures that emphasize topologically significant features, and a sample cost term that incorporates critical point attributes. We evaluate MS-COOT on five datasets spanning 2D simulations, 3D surface meshes, and volumetric data. Our results show that MS-COOT captures region-level structural changes that are not reflected by graph-based distances, while achieving strong performance in downstream tasks such as classification and resolution discrimination.

URL PDF HTML ☆

赞 0 踩 0

2606.08469 2026-06-09 cs.GR cs.CV 交叉投稿

OctaOctree Neural Radiosity for Real-time Glossy Material Rendering

OctaOctree神经辐射度用于实时光泽材质渲染

Jierui Ren, Haojie Jin, Bo Pang, Meng Gai, Fei Zhu, Yisong Chen, Sheng Li

发表机构 * Peking University（北京大学）

AI总结提出OctaOctree表示，通过空间自适应八叉树耦合八面体方向图，高效编码高频出射辐射分布，实现单次网络查询的实时高质量全局光照。

Comments 11 pages, 9 figures

详情

AI中文摘要

建模高频出射辐射分布仍然是全局光照中的基本挑战，尤其是对于光泽和镜面材质。现有的基于神经的辐射缓存方法通常依赖于位置特征编码或空间组织的缓存，这使得在不增加模型复杂度或采样成本的情况下难以表示尖锐的方向辐射变化。为了应对这一挑战，我们提出了OctaOctree，一种用于全局光照的高效空间-角度辐射表示。OctaOctree在3D空间中使用自适应八叉树组织出射辐射，并将每个空间节点与一个八面体方向图关联。通过将空间层次与方向依赖存储耦合，我们的表示为局部光照和可见性变化分配精细的空间分辨率，同时使用更粗糙的空间层次和更丰富的角度分辨率来捕捉光泽和镜面辐射分布。这种设计直接将反射感知的空间-角度先验嵌入辐射表示中，减轻了神经网络或重建模块仅从位置特征恢复高频视角依赖效应的负担。因此，OctaOctree为从漫反射互反射到尖锐光泽反射的广泛间接光照效应提供了紧凑且富有表现力的神经编码。实验表明，我们的方法在主交点处通过单次网络查询产生高质量、方向感知的全局光照，与基线神经辐射度和辐射缓存方法相比，实现了更好的保真度和实时性能。

英文摘要

Modeling high-frequency outgoing radiance distributions remains a fundamental challenge in global illumination, especially for glossy and specular materials. Existing neural-based radiance caching methods commonly rely on positional feature encodings or spatially organized caches, which makes it difficult to represent sharp directional radiance variations without increasing the model complexity or sampling cost. To address this challenge, we propose OctaOctree, an efficient spatial-angular radiance representation for global illumination. OctaOctree organizes outgoing radiance with an adaptive octree in 3D space, and associates each spatial node with an octahedral directional map. By coupling the spatial hierarchy with direction-dependent storage, our representation allocates fine spatial resolution to local illumination and visibility changes, while using coarser spatial levels with richer angular resolution to capture glossy and specular radiance distributions. This design embeds a reflectance-aware spatial-angular prior directly into the radiance representation, reducing the burden on neural networks or reconstruction modules to recover high-frequency view-dependent effects from positional features alone. As a result, OctaOctree provides a compact and expressive neural encoding for a wide range of indirect illumination effects, from diffuse interreflection to sharp glossy reflections. Experiments demonstrate that our method produces high-quality, direction-aware global illumination with single network query at primary intersections, achieving improved fidelity and real-time performance compared with baseline neural radiosity and radiance caching approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.08652 2026-06-09 astro-ph.SR cs.AI cs.CV 交叉投稿

迈向可扩展的协同实践学习：协助计算机视觉和多模态分析

Xinyu Li, Linxuan Zhao, Yueqiao Jin, Yuchen Liu, Jin Zhou, Roberto Martinez-Maldonado, Dragan Gasevic, Lixiang Yan

发表机构 * Centre for Learning Analytics at Monash（墨尔本大学学习分析中心）； Monash University（墨尔本大学）； Department of Civil and Environmental Engineering（土木与环境工程系）； School of Education（教育学院）； The University of Hong Kong（香港大学）

AI总结本研究评估了固定摄像头管道在重复护理模拟中的效果，通过多阶段源到目标适应提升行为检测精度，并利用行为轨迹分析提升模拟 debriefing 的可检索性。

详情

AI中文摘要

协同实践学习在患者周围留下可见动作、任务资源和房间区域的痕迹，但这些痕迹通常通过实时观察或回顾视频审查来恢复。固定广角视频可以减少传感负担，但 debriefing 管道必须做更多：不仅要检测行为，还要在小摄像头位置变化后维持检测，将检测器推导的行为轨迹与指导员标注的结果相关联，并保持房间区域上下文。本研究在重复护理模拟中评估了固定摄像头管道。使用统一的六代码分类法，我们测试了YOLO26目标-only训练和两阶段源到目标适应，跨两个相同房间侧视数据源。然后将检测结果从51个指导员标注的会话转换为每秒行为和行为区域轨迹，用于速率、有序网络、转换网络和序列分析。两阶段适应将2021目标视图的平均mAP50从0.815提升到0.848，从0.690提升到0.855对于较小的2022目标视图；在平衡的目标配额$N=22$下，2022模型达到0.850 mAP50。在检测器推导的行为轨迹分析中，更高的手机使用特征化低任务表现会话。区域标签改变了患者互动的解释：在更高表现会话中，主要患者护理区域互动更强，而在较低表现会话中，次级区域互动更强。有序和转换网络模型显示，有序房间区域关系超越了行为频率，最强的任务表现分类器使用了区域和共在特征。最终的轨迹最适合可检索的模拟 debriefing，其中指导员检查检测到的时刻，而不是接收自动评估分数。

英文摘要

Co-located practical learning leaves evidence in visible actions around patients, task resources and room zones, but these traces are often recovered through live observation or retrospective video review. Fixed wide-angle video could reduce sensing burden, yet a debriefing pipeline must do more than detect behaviours: it must maintain detection after small camera-position shifts, relate the detector-derived behaviour trace to instructor-labelled outcomes and preserve room-zone context. This study evaluates a fixed-camera pipeline in repeated nursing simulation. Using a harmonised six-code taxonomy, we tested YOLO26 target-only training and two-stage source-to-target adaptation across two same-room side-view data sources. We then converted detections from 51 instructor-labelled sessions into one-second behaviour and behaviour-zone traces for rate, ordered-network, transition-network and sequence analyses. Two-stage adaptation improved mean mAP50 from 0.815 to 0.848 for the 2021 target view and from 0.690 to 0.855 for the smaller 2022 target view; with a balanced target quota of $N = 22$, the 2022 model reached 0.850 mAP50. In the detector-derived behaviour trace analyses, higher phone use characterised low task-performance sessions. Zone labels changed the interpretation of patient interaction: primary patient-care-zone interaction was stronger in higher-performance sessions, while secondary-zone interaction was stronger in lower-performance sessions. Ordered and transition network models showed that ordered room-zone relations contributed beyond behaviour frequency, with the strongest task-performance classifier using zoned and co-presence features. The resulting trace is most appropriate for searchable simulation debriefing, where instructors inspect detected moments rather than receive automated assessment scores.

URL PDF HTML ☆

赞 0 踩 0

2605.00358 2026-06-09 cs.CL cs.CV 版本更新

From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing

从反向传播到正向回放：重新审视LLM参数编辑中的目标构造

Wei Liu, Hongkai Liu, Zhiying Deng, Yee Whye Teh, Wee Sun Lee

发表机构 * University of Cambridge（剑桥大学）； University of Edinburgh（爱丁堡大学）

AI总结本文重新审视LLM参数编辑中的目标构造，提出一种更简洁的替代方法，通过正向传播代替反向传播，提高目标隐藏状态的准确性和兼容性。

Comments ICML 2026, code: https://github.com/jugechengzi/FE

详情

AI中文摘要

LLM参数编辑方法通常依赖于计算目标层的理想隐藏状态（称为锚点）并将其分布到多个前层（通常称为反向传播）以实现协同编辑。尽管长期广泛使用，其基础理论尚未系统研究。本文首先系统研究其基础，有助于明确其能力边界、实际考虑和潜在失败模式。然后，我们提出了一种简单优雅的替代方法，用正向传播代替反向传播。不优化最后一层的靶标，而是在第一编辑层优化锚点，然后将其传播到后续所有编辑层，以获得准确且相互兼容的目标隐藏状态。这种方法达到与现有方法相同计算复杂度，同时产生更准确的层间目标。我们的方法简单，不影响初始目标隐藏状态的计算或后续编辑流程的其他组件，因此对广泛的LLM参数编辑方法有益。

英文摘要

LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used for a long time, its underlying basis have not been systematically investigated. In this paper, we first conduct a systematic study of its foundations, which helps clarify its capability boundaries, practical considerations, and potential failure modes. Then, we propose a simple and elegant alternative that replaces backward spreading with forward-propagation. Instead of optimizing the target at the last editing layer, we optimize the anchor point at the first editing layer, and then propagate it forward to obtain accurate and mutually compatible target hidden-states for all subsequent editing layers. This approach achieves the same computational complexity as existing methods while producing more accurate layer-wise targets. Our method is simple, without interfering with either the computation of the initial target hidden state or any other components of the subsequent editing pipeline, and thus constituting a benefit for a wide range of LLM parameter editing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.00384 2026-06-09 cs.AI cs.CL cs.CV cs.LG stat.CO 版本更新

VESTA: Visual Exploration with Statistical Tool Agents

VESTA: 基于统计工具代理的视觉探索

William Rudman, Abhishek Divekar, Kanishk Jain, Sebastian Joseph, Stella S. R. Offner, Matthew Lease, Kyle Mahowald, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； New York University（纽约大学）

AI总结提出VESTA框架，通过动态增长的工具集指导数据变换、假设驱动可视化和统计检验，提升视觉语言模型在复杂统计建模任务上的性能。

详情

AI中文摘要

将定量模型拟合到数据上是科学工作流程中的核心步骤，但它仍然是最少自动化的步骤之一。最近的基于代理的系统利用语言和视觉语言模型（VLM）来迭代地提出和优化统计模型，但这些系统在更具挑战性的建模任务上表现不佳。为了解决这些限制，我们引入了VESTA：基于统计工具代理的视觉探索，这是一个框架，为VLM配备了一个动态增长的探索工具包，通过数据变换、假设驱动的可视化和稳健的统计检验来指导模型优化。与之前仅依赖迭代批评的系统不同，VESTA在优化之前和优化过程中通过选择或创建诊断工具主动探索数据，这些工具会累积在模型的上下文中，并可在以后重用。我们在三种工具配置下评估VESTA与已建立的基线：无工具、静态专家编写的工具和动态模型编写的工具。为了支持这一评估，我们引入了DAWN（自动工作流和数值建模数据集），这是一个针对分布拟合和时间序列建模的基准，具有不同的难度等级，并最终涉及真实世界的天文学任务，包括建模初始质量函数和引力波啁啾信号。我们发现VESTA的动态工具创建优于先前的代理流水线，在复杂和特定领域的任务上取得了最大的收益。我们进一步表明，动态生成的工具比现有视觉工具创建系统生成的工具复杂得多，每个函数覆盖更多的诊断类别，并且强烈倾向于VLM批评者可以直接推理的视觉输出。

英文摘要

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

URL PDF HTML ☆

赞 0 踩 0

2606.06076 2026-06-09 cs.AI cs.CV 版本更新

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

通过模态差距感知自蒸馏从符号状态学习视觉空间规划

Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou, Xiu Li

发表机构 * Tsinghua University（清华大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出MGSD两阶段框架，通过冷启动接地和特权教师蒸馏弥合视觉与符号规划之间的模态差距，在视觉规划基准上显著提升性能。

Comments 17 pages, preprint

详情

AI中文摘要

尽管视觉-语言模型在通用多模态理解方面表现出色，但在视觉空间规划上仍存在困难。我们将其归因于感知-推理模态差距：视觉规划要求模型从像素中推断潜在状态结构，然后对恢复的结构进行推理以产生有效动作，而符号规划直接利用显式对象和约束。这造成了视觉状态恢复和多步规划的双重瓶颈。为解决此问题，我们提出MGSD，一种两阶段模态差距感知自蒸馏框架。首先，冷启动接地阶段为视觉学生模型配备可靠的状态表示，最小化早期感知噪声。其次，特权教师通过在线策略蒸馏转移规划能力，使用显式符号状态监督学生自身的视觉 rollout 前缀。关键在于，符号数据仅在训练期间使用，推理完全基于视觉。在视觉规划基准上的实验表明，MGSD在4B和8B骨干网络上均持续提升视觉规划性能，宏观平均值分别提高19.3%和18.4%。所得模型缩小了与符号输入上限的差距，而消融和诊断实验证实改进来自视觉状态恢复和最优路径推理。这些结果表明，模态差距感知自蒸馏不仅改善了模型感知可行动状态的方式，也改善了它们在推断结构上进行规划的能力。代码见 https://github.com/Oranger-l/MGSD。

英文摘要

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.

URL PDF HTML ☆

赞 0 踩 0

2604.24583 2026-06-09 cs.CV 版本更新

Improving Vision-language Models with Perception-centric Process Reward Models

通过以感知为中心的过程奖励模型改进视觉-语言模型

Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Wayne Xin Zhao, Min Yang, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学北京校区人工智能学院）； Bytedance（字节跳动）； University of California, San Diego（加州大学圣地亚哥分校）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结本文提出Perceval模型，通过token级错误定位提升视觉-语言模型的推理能力，通过感知驱动的监督策略实现细粒度训练与推理优化，实验显示在多个领域基准上显著提升性能。

Comments 8 pages

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 33099-33109

AI中文摘要

近期强化学习与可验证奖励（RLVR）的进步显著提升了视觉-语言模型（VLMs）的复杂推理能力。然而，其结果级监督过于粗略，无法诊断和纠正推理链中的错误。为此，我们提出了Perceval，一种过程奖励模型（PRM），能够实现token级错误定位，提取与图像相关的声明，并逐一与图像中的视觉证据进行比较，最终返回包含感知错误的声明。Perceval通过感知密集的监督训练数据进行训练，然后将其整合到RL训练过程中训练策略模型。具体而言，与传统的GRPO相比，我们通过针对Perceval识别出的幻觉片段施加惩罚，应用token级优势，从而实现细粒度监督信号。除了增强训练过程外，Perceval还能在推理阶段协助VLMs。使用Perceval，可以截断模型响应中的错误部分，然后让模型直接重新生成响应或诱导模型反思其先前输出。此过程可以多次重复以实现测试时扩展。实验显示，在多个领域基准上的多个RL训练的推理VLMs上显著提升，突显了以感知为中心的监督作为通用策略的潜力。对于测试时扩展，它也展示了与其他策略（如多数投票）相比的一致性性能提升。我们的代码和数据将在https://github.com/RUCAIBox/Perceval上公开发布。

英文摘要

Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.

URL PDF HTML ☆

赞 0 踩 0

2604.17488 2026-06-09 cs.CV 版本更新

AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation

AutoVQA-G：用于自动视觉问答与接地标注的自我改进代理框架

Rongsheng Hu, Runwei Guan, Yicheng Di, Jiayu Bao, Yuan Liu

发表机构 * School of Artificial Intelligence（人工智能学院）

AI总结本文提出AutoVQA-G框架，通过迭代优化流程提升视觉问答接地标注的准确性，优于现有多模态LLM，为构建高质量数据促进更稳健的视觉语言模型训练提供新方法。

Comments Accepted at IEEE ICASSP 2026. 5 pages, 5 figures. Code available at https://github.com/rohnson1999/AutoVQA-G

详情

DOI: 10.1109/ICASSP55912.2026.11462754
Journal ref: Proc. 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12312-12316, 2026

AI中文摘要

手动标注高质量的视觉问答与接地（VQA-G）数据集对于推动视觉语言模型（VLMs）的发展至关重要，但难以扩展。现有自动化方法常受限于两个关键问题：（1）由于模型幻觉导致的数据一致性差；（2）基于简单启发法的脆弱验证机制。为解决这些限制，我们引入了AutoVQA-G，一种自我改进的代理框架，用于自动化VQA-G标注。AutoVQA-G采用迭代细化循环，其中一致性评估模块使用链式推理（CoT）进行细粒度视觉验证。基于此反馈，一个记忆增强的提示优化代理分析失败样本的批评，逐步优化生成提示。我们的实验表明，AutoVQA-G生成的VQA-G数据集在视觉接地准确性上优于领先的多模态LLM，为创建高质量数据以促进更稳健的VLM训练和评估提供有前景的方法。代码：https://github.com/rohnson1999/AutoVQA-G

英文摘要

Manual annotation of high-quality visual question answering with grounding (VQA-G) datasets, which pair visual questions with evidential grounding, is crucial for advancing vision-language models (VLMs), but remains unscalable. Existing automated methods are often hindered by two key issues: (1) inconsistent data fidelity due to model hallucinations; (2) brittle verification mechanisms based on simple heuristics. To address these limitations, we introduce AutoVQA-G, a self-improving agentic framework for automated VQA-G annotation. AutoVQA-G employs an iterative refinement loop where a Consistency Evaluation module uses Chain-of-Thought (CoT) reasoning for fine-grained visual verification. Based on this feedback, a memory-augmented Prompt Optimization agent analyzes critiques from failed samples to progressively refine generation prompts. Our experiments show that AutoVQA-G generates VQA-G datasets with superior visual grounding accuracy compared to leading multimodal LLMs, offering a promising approach for creating high-fidelity data to facilitate more robust VLM training and evaluation. Code: https://github.com/rohnson1999/AutoVQA-G

URL PDF HTML ☆

赞 0 踩 0

2507.18967 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Underwater Waste Detection Using Deep Learning A Performance Comparison of YOLOv7 to 10 and Faster RCNN

利用深度学习进行水下垃圾检测：YOLOv7到YOLOv10与Faster R-CNN的性能比较

UMMPK Nawarathne, HMNS Kumari, HMLS Kumari

发表机构 * Faculty of Computing, Sri Lanka Institute of Information Technology（计算学院，斯里兰卡信息科技学院）； Faculty of Information Technology and Communication Sciences, Tampere University（信息科技与通信科学学院，塔尔皮埃大学）； Computing Centre, Faculty of Engineering, University of Peradeniya（工程学院计算机中心，珀德尼亚大学）

AI总结本文比较了YOLOv7到YOLOv10及Faster R-CNN在水下垃圾检测中的性能，发现YOLOv8在低能见度和不同深度条件下表现最佳，mAP达80.9%。

Comments 7 pages, 11 figures, to be published in International Journal of Research in Computing (IJRC)

详情

Journal ref: Vol. 5 No. I (2026): International Journal of Research in Computing (IJRC)

AI中文摘要

水下污染是当今最严重的环境问题之一，全球海洋、河流和景观中发现大量垃圾。准确检测这些垃圾对废物管理、环境监测和缓解策略至关重要。本文研究了五种先进的物体识别算法，包括YOLO模型（YOLOv7、YOLOv8、YOLOv9、YOLOv10）和Faster R-CNN，以确定哪种模型在水下环境中识别材料最有效。这些模型在包含十五种不同类别的大型数据集上进行了彻底训练和测试。结果显示，YOLOv8在低能见度和变量深度条件下表现最佳，mAP为80.9%。这种性能提升归因于YOLOv8的架构，其包含改进的无锚机制和自监督学习，从而在各种环境中实现更精确和高效的识别。这些发现突显了YOLOv8模型在全球抗污染斗争中的潜力，提高了水下清理作业的检测能力和可扩展性。

英文摘要

Underwater pollution is one of today's most significant environmental concerns, with vast volumes of garbage found in seas, rivers, and landscapes around the world. Accurate detection of these waste materials is crucial for successful waste management, environmental monitoring, and mitigation strategies. In this study, we investigated the performance of five cutting-edge object recognition algorithms, namely YOLO (You Only Look Once) models, including YOLOv7, YOLOv8, YOLOv9, YOLOv10, and Faster Region-Convolutional Neural Network (R-CNN), to identify which model was most effective at recognizing materials in underwater situations. The models were thoroughly trained and tested on a large dataset containing fifteen different classes under diverse conditions, such as low visibility and variable depths. From the above-mentioned models, YOLOv8 outperformed the others, with a mean Average Precision (mAP) of 80.9%, indicating a significant performance. This increased performance is attributed to YOLOv8's architecture, which incorporates advanced features such as improved anchor-free mechanisms and self-supervised learning, allowing for more precise and efficient recognition of items in a variety of settings. These findings highlight the YOLOv8 model's potential as an effective tool in the global fight against pollution, improving both the detection capabilities and scalability of underwater cleanup operations.

URL PDF HTML ☆

赞 0 踩 0

2603.24942 2026-06-09 cs.CV 版本更新

BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation

BiFM：双向流匹配用于少步图像编辑与生成

Yasong Dai, Zeeshan Hayder, David Ahmedt-Aristizabal, Hongdong Li

发表机构 * Australian National University（澳大利亚国立大学）； Data61-CSIRO

AI总结 BiFM通过双向流匹配框架统一学习生成与反向过程，解决少步采样中正向过程近似差问题，提升图像编辑质量与通用性。

Comments Accepted in CVPR2026

详情

AI中文摘要

最近的扩散和流匹配模型通过迭代采样逐步去除噪声，实现了灵活的语义保持编辑。然而，少步采样在正向过程近似方面表现不佳，导致编辑质量下降。现有少步反向方法通常依赖预训练生成器和辅助模块，限制了不同架构的可扩展性和泛化能力。为了解决这些问题，我们提出了BiFM（双向流匹配），一个统一的框架，能够在单一模型中联合学习生成和反向过程。BiFM直接估计“图像→噪声”和“噪声→图像”方向的平均速度场，受共享的瞬时速度场约束，该速度场由预定义的调度或预训练的多步扩散模型导出。此外，BiFM引入了一种新的训练策略，利用连续时间间隔监督，通过双向一致性目标和轻量级时间间隔嵌入进行稳定。这种双向公式还允许一步反向和无缝集成到流行的扩散和流匹配骨干中。在多样化的图像编辑和生成任务中，BiFM一致优于现有的少步方法，实现了更优越的性能和可编辑性。

英文摘要

Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image $\to$ noise" and ``noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.

URL PDF HTML ☆

赞 0 踩 0

2501.15505 2026-06-09 cs.RO cs.CV cs.HC 版本更新

Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics

揭示iMarkers的潜力：用于高级机器人的隐形标志物

Ali Tourani, Deniz Isinsu Avsar, Hriday Bavle, Jose Luis Sanchez-Lopez, Jan Lagerwall, Holger Voos

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg（自动化与机器人研究组，安全、可靠性与信任跨学科中心（SnT），卢森堡大学）； Faculty of Science, Technology, and Medicine, University of Luxembourg（科学、技术与医学学院，卢森堡大学）； Department of Physics & Materials Science, University of Luxembourg（物理与材料科学系，卢森堡大学）； Institute for Advanced Studies, University of Luxembourg（先进研究学院，卢森堡大学）

AI总结本文提出iMarkers，一种隐形标志物，可被机器人和AR设备检测，解决了传统标志物影响视觉美观的问题，展示了其在机器人应用中的灵活性和有效性。

Comments 19 pages, 10 figures, 4 tables

详情

AI中文摘要

标志物在机器人导航、物体识别和场景理解中被广泛应用。尽管为机器人和增强现实（AR）应用提供了显著优势，但它们通常会破坏环境的视觉美观，因为它们对人类可见，因此不适合许多日常使用场景。为了解决这一差距，本文提出了iMarkers，即创新的、不显眼的标志物，仅能被机器人和配备适当传感器和检测算法的AR设备检测。这些标志物在生产中具有高度灵活性，允许根据各种需求定制其可见范围和编码算法。本文还介绍了用于检测iMarkers的硬件设计和开源软件算法，突显了其在检测和识别阶段的适应性和鲁棒性。大量评估已证明iMarkers相对于传统（印刷）和混合标志物的有效性，并确认了其在多样化机器人场景中的适用性。

英文摘要

Fiducial markers are widely used in robotics for navigation, object recognition, and scene understanding. While offering significant advantages for robots and Augmented Reality (AR) applications, they often disrupt the visual aesthetics of environments, as they are visible to humans, making them unsuitable for many everyday use cases. To address this gap, this paper presents iMarkers, innovative, unobtrusive fiducial markers detectable exclusively by robots and AR devices equipped with adequate sensors and detection algorithms. These markers offer high flexibility in production, allowing customization of their visibility range and encoding algorithms to suit various demands. The paper also introduces the hardware designs and open-sourced software algorithms developed for detecting iMarkers, highlighting their adaptability and robustness in the detection and recognition stages. Numerous evaluations have demonstrated the effectiveness of iMarkers relative to conventional (printed) and blended fiducial markers and have confirmed their applicability across diverse robotics scenarios.

URL PDF HTML ☆

赞 0 踩 0

2511.10500 2026-06-09 cs.CV 版本更新

Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising

可学习总变分与Lambda映射用于低剂量CT去噪

Yusuf Talha Basak, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim

发表机构 * University of Michigan（密歇根大学）

AI总结本文提出可学习总变分框架，通过结合展开的总变分求解器与LambdaNet预测像素级正则化图，实现空间自适应平滑，实验显示在低剂量CT去噪中优于传统TV和FBP+U-Net。

详情

DOI: 10.1109/ISBI61048.2026.11515960
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

AI中文摘要

尽管总变分（TV）在噪声抑制和边缘保持方面表现出色，但其对标量正则化参数的依赖限制了适应性。在本研究中，我们提出了一种可学习总变分（LTV）框架，将展开的TV求解器与预测像素级正则化图的LambdaNet相结合。所提出的框架端到端训练以优化重建和正则化，实现空间自适应平滑。在DeepLesion数据集上使用现实LoDoPaB-CT模拟实验表明，LTV在低剂量CT去噪中优于传统TV和FBP+U-Net，实现了最高+3.7 dB PSNR和8%的相对SSIM改进。LTV为低剂量CT去噪提供了可解释的替代方案，而非黑箱CNN。

英文摘要

While Total Variation (TV) excels in noise reduction and edge preservation, its reliance on a scalar regularization parameter limits adaptivity. In this study, we present a Learnable Total Variation (LTV) framework coupling an unrolled TV solver with a LambdaNet that predicts a per-pixel regularization map. The proposed framework is trained end-to-end to optimize reconstruction and regularization jointly, yielding spatially adaptive smoothing. Experiments on the DeepLesion dataset, using realistic LoDoPaB-CT simulation, show consistent gains over classical TV and FBP+U-Net, achieving up to +3.7 dB PSNR and 8% relative SSIM improvement. LTV provides an interpretable alternative to black-box CNNs for low-dose CT denoising.

URL PDF HTML ☆

赞 0 踩 0

2511.14639 2026-06-09 cs.CV 版本更新

SLAM-AGS: Slide-Label Aware Multi-Task Pretraining Using Adaptive Gradient Surgery in Computational Cytology

SLAM-AGS：基于滑片标签的多任务预训练方法：利用自适应梯度手术的计算细胞学

Marco Acerbis, Swarnadip Chatterjee, Christophe Avenel, Joakim Lindblad

发表机构 * University of Cambridge（剑桥大学）

AI总结 SLAM-AGS通过联合优化弱监督相似性和自监督对比性目标，提升下游任务性能，并利用自适应梯度手术解决任务梯度冲突，实现在低见证率下的稳定预训练和更优表现。

Comments 5 pages, 2 figures, Submitted to ISBI2026

详情

DOI: 10.1109/ISBI61048.2026.11515859
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

AI中文摘要

计算细胞学面临两个主要挑战：i) 实例级标签不可靠且获取成本高昂，ii) 见证率极低。我们提出SLAM-AGS，一种Slide-Label-Aware多任务预训练框架，联合优化滑片负补丁上的弱监督相似性目标和滑片正补丁上的自监督对比性目标，从而在下游任务中获得更强的性能。为稳定学习，我们应用自适应梯度手术以解决冲突的任务梯度并防止模型崩溃。我们将预训练的编码器整合到基于注意力的多实例学习聚合器中，用于袋级预测和引导检索袋中最异常的实例。在公开可用的骨髓细胞学数据集上，使用模拟的见证率从10%降至0.5%，SLAM-AGS在袋级F1分数和Top 400正细胞检索上优于其他预训练方法，尤其在低见证率下表现最佳，显示了解决梯度干扰能够实现稳定的预训练和更好的下游任务性能。为促进可重复性，我们分享了完整的实现和评估框架作为开源：https://github.com/Ace95/SLAM-AGS。

英文摘要

Computational cytology faces two major challenges: i) instance-level labels are unreliable and prohibitively costly to obtain, ii) witness rates are extremely low. We propose SLAM-AGS, a Slide-Label-Aware Multitask pretraining framework that jointly optimizes (i) a weakly supervised similarity objective on slide-negative patches and (ii) a self-supervised contrastive objective on slide-positive patches, yielding stronger performance on downstream tasks. To stabilize learning, we apply Adaptive Gradient Surgery to tackle conflicting task gradients and prevent model collapse. We integrate the pretrained encoder into an attention-based Multiple Instance Learning aggregator for bag-level prediction and attention-guided retrieval of the most abnormal instances in a bag. On a publicly available bone-marrow cytology dataset, with simulated witness rates from 10% down to 0.5%, SLAM-AGS improves bag-level F1-Score and Top 400 positive cell retrieval over other pretraining methods, with the largest gains at low witness rates, showing that resolving gradient interference enables stable pretraining and better performance on downstream tasks. To facilitate reproducibility, we share our complete implementation and evaluation framework as open source: https://github.com/Ace95/SLAM-AGS.

URL PDF HTML ☆

赞 0 踩 0

2510.13381 2026-06-09 cs.CV cs.GR 版本更新

Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering

利用2D先验和SDF引导进行动态城市场景渲染

Siddharth Tourani, Jayaram Reddy, Akash Kumbar, Satyajit Tourani, Nishant Goyal, Madhava Krishna, N. Dinesh Reddy, Muhammad Haris Khan

发表机构 * IIIT Hyderabad（海得拉尔印度理工学院）； MBZUAI（穆罕默德·本·拉希德智能研究院）； University of Heidelberg（海德堡大学）； VLM Run ； IIT Kharagpur（克达尔理工学院）

AI总结本文提出结合2D对象无关先验与SDF表示的方法，用于动态城市场景渲染，无需LiDAR数据，提升几何精度和变形建模能力。

Comments Accepted at ICCV-2025, project page: https://dynamic-ugsdf.github.io/

详情

DOI: 10.1109/ICCV51701.2025.02698
Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

AI中文摘要

动态场景渲染与重建在计算机视觉和增强现实领域至关重要。基于3D高斯点扩散（3DGS）的方法已能准确建模动态城市场景，但需要相机和LiDAR数据、真实3D分割及运动数据。本文探讨结合2D深度和点跟踪先验与SDF表示是否能降低这些要求。我们提出一种将SDF与3DGS结合的新方法，通过融合两者优势，提升物体表示的鲁棒性。统一优化框架增强了3DGS的几何精度，并改进了SDF内的变形建模，实现更适应和精确的表示。实验表明，即使没有LiDAR数据，该方法在渲染指标上也达到最先进的性能。当结合LiDAR时，方法在不同物体类别上的重建和生成新视角方面进一步提升，无需真实3D运动标注。此外，该方法支持多种场景编辑任务，包括场景分解和场景合成。

英文摘要

Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.

URL PDF HTML ☆

赞 0 踩 0

2501.03957 2026-06-09 cs.HC cs.CV 版本更新

Vision Language Models as Values Detectors

视觉语言模型作为价值检测器

Giulio Antonio Abbo, Tony Belpaeme

发表机构 * IDLab-AIRO, Ghent University – imec, Belgium（IDLab-AIRO、根特大学 – imec、比利时）

AI总结本文研究了先进LLM与人类标注者在家庭环境场景中检测相关元素的对齐情况，发现LLaVA 34B表现最佳但仍需改进，表明LLM在检测图像中价值元素方面有潜力。

Comments 13 pages, 2 figures

详情

DOI: 10.1007/978-3-031-85463-7_5
Journal ref: Value Engineering in Artificial Intelligence (VALE 2024) (LNAI,volume 15356)

AI中文摘要

大型语言模型整合文本和视觉输入，为解释复杂数据提供了新可能。尽管其能生成连贯且上下文相关的文本，但其与人类感知在识别图像中相关元素的对齐仍需探索。本文研究了最先进的LLM与人类标注者在家庭环境场景中检测相关元素的对齐情况。我们创建了十二张描绘不同家庭场景的图像，并邀请十四名标注者识别每张图像中的关键元素。然后将这些人类响应与五个不同LLM的输出进行比较，包括GPT-4o和四个LLaVA变体。我们的发现显示对齐程度各异，LLaVA 34B表现最佳但得分仍低。然而，结果分析表明这些模型在检测图像中价值元素方面有潜力，表明通过改进训练和优化提示，LLM可增强社交机器人、辅助技术和人机交互的应用，提供更深入的见解和更相关的响应。

英文摘要

Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models' potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.

URL PDF HTML ☆

赞 0 踩 0

2411.18385 2026-06-09 cs.LG cs.CV stat.ML 版本更新

Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization

基于高效二阶优化的联邦学习中的不确定性与个性化

Shivam Pal, Aishwarya Gupta, Saqib Sarwar, Piyush Rai

发表机构 * Department of Computer Science and Engineering, IIT Kanpur, India（计算机科学与工程系，印度IIT坎pur）

AI总结本文提出一种高效的联邦学习方法，利用二阶优化减少计算和通信成本，同时保留贝叶斯方法的不确定性与个性化优势。

详情

Journal ref: Transactions on Machine Learning Research (TMLR), 2025

AI中文摘要

联邦学习（FL）已发展为一种有前景的方法，用于在不同客户端上协作学习分布式和异质数据，而无需数据离开客户端。最近的FL研究倡导采用贝叶斯方法，因为它提供了一种系统的方法来考虑模型和预测不确定性，通过学习客户端和/或服务器模型的后验分布。此外，贝叶斯FL自然能够实现个性化，以处理不同客户端上的数据异质性，通过让每个客户端学习其独特的个性化模型。特别是，层次贝叶斯方法使所有客户端都能学习其个性化模型，同时通过服务器提供的先验分布考虑共同点。然而，尽管有这些优势，贝叶斯方法在FL中可能计算成本高且通信成本高，因为需要计算和发送后验分布。我们提出了一种新的贝叶斯FL方法，采用高效的二阶优化方法，其计算成本与Adam等一阶优化方法相似，同时提供贝叶斯方法的多种优势（例如不确定性、个性化），并且在标准和个性化FL设置中都比最先进的贝叶斯FL方法更高效和准确。我们的方法在预测准确性和不确定性估计方面优于基线方法，包括基于优化和贝叶斯FL的方法。

英文摘要

Federated Learning (FL) has emerged as a promising method to collaboratively learn from decentralized and heterogeneous data available at different clients without the requirement of data ever leaving the clients. Recent works on FL have advocated taking a Bayesian approach to FL as it offers a principled way to account for the model and predictive uncertainty by learning a posterior distribution for the client and/or server models. Moreover, Bayesian FL also naturally enables personalization in FL to handle data heterogeneity across the different clients by having each client learn its own distinct personalized model. In particular, the hierarchical Bayesian approach enables all the clients to learn their personalized models while also taking into account the commonalities via a prior distribution provided by the server. However, despite their promise, Bayesian approaches for FL can be computationally expensive and can have high communication costs as well because of the requirement of computing and sending the posterior distributions. We present a novel Bayesian FL method using an efficient second-order optimization approach, with a computational cost that is similar to first-order optimization methods like Adam, but also provides the various benefits of the Bayesian approach for FL (e.g., uncertainty, personalization), while also being significantly more efficient and accurate than SOTA Bayesian FL methods (both for standard as well as personalized FL settings). Our method achieves improved predictive accuracies as well as better uncertainty estimates as compared to the baselines which include both optimization based as well as Bayesian FL methods.

URL PDF HTML ☆

赞 0 踩 0

2101.01060 2026-06-09 cs.CV cs.AI cs.MM 版本更新

Personal Privacy Protection via Irrelevant Faces Tracking and Pixelation in Video Live Streaming

通过无关面孔跟踪和像素化实现个人隐私保护在视频直播中

Jizhe Zhou, Chi-Man Pun

发表机构 * IEEE

AI总结本文提出FPVLS方法，通过帧到视频的双阶段结构实现视频直播中的自动隐私过滤，解决目标漂移、计算效率和过度像素化问题。

详情

DOI: 10.1109/TIFS.2020.3029913
Journal ref: IEEE Transactions on Information Forensics and Security, 16, 1088-1103 (2020)

AI中文摘要

截至目前，旨在保护隐私的像素化任务仍然劳动密集且尚未被深入研究。随着视频直播的普及，建立在线直播中的面部像素化机制已成为紧迫需求。本文开发了一种名为视频直播中的面部像素化（FPVLS）的新方法，以在非约束直播活动中自动生成自动个人隐私过滤。简单地应用多面部跟踪器会遇到目标漂移、计算效率和过度像素化的问题。因此，为了快速准确地对无关人员的面部进行像素化，FPVLS采用帧到视频的双阶段结构。在单帧上，FPVLS利用基于图像的面部检测和嵌入网络生成面部向量。在原始轨迹生成阶段，所提出的定位增量仿射传播（PIAP）聚类算法利用面部向量和定位信息，快速关联跨帧的同一人的面部。这样的帧级累积原始轨迹在视频级别上可能具有间断性和不可靠性。因此，我们进一步引入轨迹细化阶段，该阶段结合提案网络和基于经验似然比（ELR）统计量的两样本测试，以细化原始轨迹。在细化轨迹上应用高斯滤波器以最终实现像素化。在我们收集的视频直播数据集上，FPVLS获得了令人满意的准确性、实时效率，并且包含过度像素化问题。

英文摘要

To date, the privacy-protection intended pixelation tasks are still labor-intensive and yet to be studied. With the prevailing of video live streaming, establishing an online face pixelation mechanism during streaming is an urgency. In this paper, we develop a new method called Face Pixelation in Video Live Streaming (FPVLS) to generate automatic personal privacy filtering during unconstrained streaming activities. Simply applying multi-face trackers will encounter problems in target drifting, computing efficiency, and over-pixelation. Therefore, for fast and accurate pixelation of irrelevant people's faces, FPVLS is organized in a frame-to-video structure of two core stages. On individual frames, FPVLS utilizes image-based face detection and embedding networks to yield face vectors. In the raw trajectories generation stage, the proposed Positioned Incremental Affinity Propagation (PIAP) clustering algorithm leverages face vectors and positioned information to quickly associate the same person's faces across frames. Such frame-wise accumulated raw trajectories are likely to be intermittent and unreliable on video level. Hence, we further introduce the trajectory refinement stage that merges a proposal network with the two-sample test based on the Empirical Likelihood Ratio (ELR) statistic to refine the raw trajectories. A Gaussian filter is laid on the refined trajectories for final pixelation. On the video live streaming dataset we collected, FPVLS obtains satisfying accuracy, real-time efficiency, and contains the over-pixelation problems.

URL PDF HTML ☆

赞 0 踩 0

1909.02747 2026-06-09 eess.IV cs.CV cs.LG stat.ML 版本更新

Eelgrass beds and oyster farming at a lagoon before and after the Great East Japan Earthquake 2011: potential to apply deep learning at a coastal area

2011年东日本大地震前后三重县洋浦湾的海草床和牡蛎养殖：在沿海地区应用深度学习的潜力

Takehisa Yamakita

发表机构 * Marine Biodiversity and Environmental Assessment Research Center (BioEnv)（海洋生物多样性与环境评估研究中心）

AI总结本文通过比较手动勾勒、简单图像分割和深度学习图像变换，研究了日本三重县洋浦湾海草床、沙地和牡蛎养殖筏的自动土地覆盖分类，展示了深度学习在地震后沿海地区空间模式提取中的潜力。

详情

DOI: 10.1109/IGARSS.2019.8900354.

AI中文摘要

本文通过对比手动勾勒、简单图像分割和深度学习图像变换方法，研究了日本三重县洋浦湾海草床、沙地和牡蛎养殖筏的自动土地覆盖分类，展示了深度学习在地震后沿海地区空间模式提取中的潜力。实验结果表明，图像变换方法在输出分辨率上表现最佳，其在植被分类上的准确率超过69%，通过随机点评估独立测试数据。沙地分布通过分割模型检测，而牡蛎养殖筏的分布则通过分割模型识别。通过手动勾勒和图像变换结果评估地震前后的变化，发现沙地面积增加而植被面积减少。仅通过分割模型检测到牡蛎养殖面积的减少。这些结果证明了深度学习在地震和海啸后空间模式提取中的潜力。

英文摘要

There is a small number of case studies of automatic land cover classification on the coastal area. Here, I test extraction of seagrass beds, sandy area, oyster farming rafts at Mangoku-ura Lagoon, Miyagi, Japan by comparing manual tracing, simple image segmentation, and image transformation using deep learning. The result was used to extract the changes before and after the earthquake and tsunami. The output resolution was best in the image transformation method, which showed more than 69% accuracy for vegetation classification by an assessment using random points on independent test data. The distribution of oyster farming rafts was detected by the segmentation model. Assessment of the change before and after the earthquake by the manual tracing and image transformation result revealed increase of sand area and decrease of the vegetation. By the segmentation model only the decrease of the oyster farming was detected. These results demonstrate the potential to extract the spatial pattern of these elements after an earthquake and tsunami. Index Terms: Great East Japan Earthquake of 2011, Land use land cover (LULC), Zosteracea seagrass, cultured oyster, deep learning, Mangoku Bay

URL PDF HTML ☆

赞 0 踩 0

1. 多模态与视觉语言模型 50 篇

Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach

VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents

Can You Trust What You See? Human and AI Detection of Synthetic Legal Evidence

MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention

Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models

Do VLMs See What Sensors Feel? A Scalable Expert-Guided Design for Wheelchair Accessibility Assessment from Street View

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

Struct-Searcher: Agentic Structural Thinking Advances Multimodal Deep Information Seeking

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

VisualFLIP: Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

ChronoPhyBench: Do MLLMs Truly Understand the World or Merely Exploit Language Priors?

IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment

Vision-Language Asymmetry in Bistable Image Captioning

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

Test-Time Scaling in Multimodal Foundation Models: A Comprehensive Survey of Generation and Reasoning

Beyond Raw Signals: Undecoded Generative Latents as Privileged Synthetic Data

TVI-CoT: Text-Visual Interleaved Chain-of-Thought Reasoning for Multimodal Understanding

Look Less, Reason More: Block-wise Attention Skipping for Efficient Multimodal LLMs

PRPO: Perception-Reinforced Policy Optimization via Token-Level Dynamic Advantage Reshaping

Thinking Without Images: Internalizing Visual Manipulation with On-Policy Self-Distillation

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

CRANE: Knowledge Editing for Reasoning MLLMs

Driving Video Retrieval for Complex Queries with Structured Grounding

Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

Vision-Language Guided Hyperspectral Object Tracking via Semantics Fusion and Contextual Template Updating

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection

OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

Collaborative Edge-to-Server Inference for Vision-Language Models

COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing

EdgeFM: Efficient Edge Inference for Vision-Language Models

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

Brain2Text Decoding Model Reveals the Neural Mechanisms of Visual Semantic Processing

Hummus: A Dataset of Humorous Multimodal Metaphor Use

PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition

2. 具身智能、机器人与自动驾驶 46 篇

Eyes All Around: Design and Analysis of 360-Degree LiDAR Perception Using Equivariant Feature Learning in Unstructured Traffic

TBD-VLA: Temporal Block Diffusion Vision Language Action Model

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Remember with Confidence: Uncertainty Quantification for Spatio-temporal Memory with Probabilistic Guarantees

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

Distortion-Aware PETR for BEV Object Detection with Mixed Pinhole-Fisheye Cameras

BLUE: Toward Better Language Use in Efficient Vision-Language-Action Models for Autonomous Driving

Geometry-Aware Fisheye-LiDAR Fusion for Robust 3D Object Detection in Low-Overlap Setups

Vision-Language Work Zone Intelligence for Safety-Critical Speed Regulation of Mixed-Autonomy Vehicles in Dynamic Environments

Scaling by Diversified Experience for Vision-Language-Action Models

EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video

EditSSC: Toward Editable Semantic Occupancy Scenes with Unconditional Diffusion Models

Zero-Shot Semantic Re-Identification for Autonomous Driving: A VLM Baseline Study

Real-time body pose non-verbal communication with a consistency-based reliability measure

Efficient Minimal Solvers for Visual-Inertial Relative Pose Estimation in Multi-Camera Systems

ATN3D: Density-Aware LiDAR-Radar Early 3D Object Detection Under Extreme Sparsity

MinNav: Minimalist Navigation Using Optical Flow For Active Tiny Aerial Robots

Revisiting Articulated Parts Perception in Robot Manipulation

GraspFoM: Towards Reconstruction-Driven Robotic Grasping with 3D Foundation Priors

EgoPriMo: Egocentric Motion Generation for Interactive Humanoid Control

When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA

PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

PhysAgent: Automating Physics-Based 4D Synthesis via Trajectory-Grounded Multi-Agent Feedback

C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

Trajectory Optimization in Single and Dual-UAV Bearing-Only Target Localization

Taming Perception Jitter: Uncertainty-Aware LiDAR Object Detection for Reliable Motion Classification

Dense Force Estimation with an Event-based Optical Tactile Sensor